6: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 9: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 8: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 10: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 6: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 6: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 8: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 8: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 6: START 2068238: Fri Nov 25 17:30:28 EET 2022 8: START 2068238: Fri Nov 25 17:30:28 EET 2022 7: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 4: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 9: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 10: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 1: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 16: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 2: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 3: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 23: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 11: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 30: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 18: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 4: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 9: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 9: START 2068238: Fri Nov 25 17:30:28 EET 2022 10: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 10: START 2068238: Fri Nov 25 17:30:28 EET 2022 4: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 4: START 2068238: Fri Nov 25 17:30:28 EET 2022 27: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 31: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 21: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 0: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 26: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 25: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 5: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 14: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 13: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 22: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 12: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 15: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 19: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 20: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 28: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 24: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 17: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 29: Model parameters: d_model 1792 ffw_size 7168 kv_size 128 n_heads 14 n_layers 26 23: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 23: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 23: START 2068238: Fri Nov 25 17:30:28 EET 2022 7: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 7: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 7: START 2068238: Fri Nov 25 17:30:28 EET 2022 18: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 18: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 18: START 2068238: Fri Nov 25 17:30:28 EET 2022 9: 9: 9: ======================= ROCm System Management Interface ======================= 9: ================================= Concise Info ================================= 9: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 9: 0 47.0c 92.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 9: 1 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 9: 2 41.0c 85.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 9: 3 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 9: 4 44.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 9: 5 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 9: 6 38.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 9: 7 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 9: ================================================================================ 9: ============================= End of ROCm SMI Log ============================== 6: 6: 6: ======================= ROCm System Management Interface ======================= 6: ================================= Concise Info ================================= 6: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 6: 0 46.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 6: 1 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 6: 2 38.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 6: 3 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 6: 4 39.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 6: 5 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 6: 6 40.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 6: 7 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 6: ================================================================================ 6: ============================= End of ROCm SMI Log ============================== 8: 8: 8: ======================= ROCm System Management Interface ======================= 8: ================================= Concise Info ================================= 8: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 8: 0 41.0c 93.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 8: 1 41.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 8: 2 43.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 8: 3 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 8: 4 41.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 8: 5 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 8: 6 45.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 8: 7 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 8: ================================================================================ 8: ============================= End of ROCm SMI Log ============================== 10: 10: 10: ======================= ROCm System Management Interface ======================= 10: ================================= Concise Info ================================= 10: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 10: 0 46.0c 84.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 10: 1 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 10: 2 47.0c 83.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 10: 3 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 10: 4 42.0c 84.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 10: 5 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 10: 6 37.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 10: 7 41.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 10: ================================================================================ 10: ============================= End of ROCm SMI Log ============================== 4: 4: 4: ======================= ROCm System Management Interface ======================= 4: ================================= Concise Info ================================= 4: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 4: 0 44.0c 92.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 4: 1 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 4: 2 38.0c 84.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 4: 3 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 4: 4 46.0c 76.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 4: 5 49.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 4: 6 38.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 4: 7 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 4: ================================================================================ 4: ============================= End of ROCm SMI Log ============================== 11: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 11: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 11: START 2068238: Fri Nov 25 17:30:28 EET 2022 1: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 1: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 1: START 2068238: Fri Nov 25 17:30:28 EET 2022 16: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 16: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 16: START 2068238: Fri Nov 25 17:30:28 EET 2022 30: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 30: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 30: START 2068238: Fri Nov 25 17:30:28 EET 2022 2: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 2: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 2: START 2068238: Fri Nov 25 17:30:28 EET 2022 3: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 3: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 3: START 2068238: Fri Nov 25 17:30:28 EET 2022 27: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 27: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 27: START 2068238: Fri Nov 25 17:30:28 EET 2022 31: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 31: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 21: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 21: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 31: START 2068238: Fri Nov 25 17:30:28 EET 2022 21: START 2068238: Fri Nov 25 17:30:28 EET 2022 0: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 0: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 0: START 2068238: Fri Nov 25 17:30:28 EET 2022 26: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 26: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 25: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 25: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 26: START 2068238: Fri Nov 25 17:30:28 EET 2022 25: START 2068238: Fri Nov 25 17:30:28 EET 2022 5: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 5: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 5: START 2068238: Fri Nov 25 17:30:28 EET 2022 14: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 14: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 14: START 2068238: Fri Nov 25 17:30:28 EET 2022 13: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 13: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 13: START 2068238: Fri Nov 25 17:30:28 EET 2022 22: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 22: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 22: START 2068238: Fri Nov 25 17:30:28 EET 2022 12: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 12: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 12: START 2068238: Fri Nov 25 17:30:28 EET 2022 15: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 15: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 15: START 2068238: Fri Nov 25 17:30:28 EET 2022 19: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 19: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 20: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 20: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 19: START 2068238: Fri Nov 25 17:30:28 EET 2022 20: START 2068238: Fri Nov 25 17:30:28 EET 2022 28: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 28: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 28: START 2068238: Fri Nov 25 17:30:28 EET 2022 24: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 24: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 24: START 2068238: Fri Nov 25 17:30:28 EET 2022 17: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 17: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 17: START 2068238: Fri Nov 25 17:30:28 EET 2022 29: Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 26 --hidden-size 1792 --num-attention-heads 14 --kv-channels 128 --ffn-hidden-size 7168 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 256 --train-samples 44_416_143 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b1long --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 44_416_143 --lr-warmup-samples 444_161 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b1long --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b1long --load checkpoints_1b1long --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_ 29: document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2068238.json --zero-stage 0 29: START 2068238: Fri Nov 25 17:30:28 EET 2022 7: 7: 7: ======================= ROCm System Management Interface ======================= 7: ================================= Concise Info ================================= 7: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 7: 0 45.0c 93.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 7: 1 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 7: 2 45.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 7: 3 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 7: 4 42.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 7: 5 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 7: 6 38.0c 93.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 7: 7 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 7: ================================================================================ 7: ============================= End of ROCm SMI Log ============================== 23: 23: 23: ======================= ROCm System Management Interface ======================= 23: ================================= Concise Info ================================= 23: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 23: 0 39.0c 93.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 23: 1 51.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 23: 2 41.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 23: 3 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 23: 4 44.0c 83.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 23: 5 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 23: 6 40.0c 82.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 23: 7 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 23: ================================================================================ 23: ============================= End of ROCm SMI Log ============================== 18: 18: 18: ======================= ROCm System Management Interface ======================= 18: ================================= Concise Info ================================= 18: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 18: 0 47.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 18: 1 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 18: 2 45.0c 86.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 18: 3 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 18: 4 41.0c 84.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 18: 5 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 18: 6 35.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 18: 7 41.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 18: ================================================================================ 18: ============================= End of ROCm SMI Log ============================== 11: 11: 11: ======================= ROCm System Management Interface ======================= 11: ================================= Concise Info ================================= 11: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 11: 0 41.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 11: 1 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 11: 2 42.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 11: 3 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 11: 4 41.0c 85.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 11: 5 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 11: 6 38.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 11: 7 41.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 11: ================================================================================ 11: ============================= End of ROCm SMI Log ============================== 1: 1: 1: ======================= ROCm System Management Interface ======================= 1: ================================= Concise Info ================================= 1: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 1: 0 45.0c 94.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 1: 1 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 1: 2 42.0c 83.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 1: 3 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 1: 4 41.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 1: 5 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 1: 6 37.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 1: 7 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 1: ================================================================================ 1: ============================= End of ROCm SMI Log ============================== 16: 16: 16: ======================= ROCm System Management Interface ======================= 16: ================================= Concise Info ================================= 16: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 16: 0 43.0c 97.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 16: 1 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 16: 2 41.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 16: 3 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 16: 4 37.0c 84.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 16: 5 42.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 16: 6 40.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 16: 7 39.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 16: ================================================================================ 16: ============================= End of ROCm SMI Log ============================== 30: 30: 30: ======================= ROCm System Management Interface ======================= 30: ================================= Concise Info ================================= 30: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 30: 0 41.0c 95.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 30: 1 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 30: 2 43.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 30: 3 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 30: 4 44.0c 81.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 30: 5 49.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 30: 6 39.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 30: 7 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 30: ================================================================================ 30: ============================= End of ROCm SMI Log ============================== 2: 2: 2: ======================= ROCm System Management Interface ======================= 2: ================================= Concise Info ================================= 2: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 2: 0 40.0c 98.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 2: 1 50.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 2: 2 40.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 2: 3 50.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 2: 4 40.0c 93.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 2: 5 42.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 2: 6 38.0c 85.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 2: 7 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 2: ================================================================================ 2: ============================= End of ROCm SMI Log ============================== 3: 3: 3: ======================= ROCm System Management Interface ======================= 3: ================================= Concise Info ================================= 3: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 3: 0 43.0c 94.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 3: 1 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 3: 2 44.0c 80.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 3: 3 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 3: 4 40.0c 84.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 3: 5 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 3: 6 36.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 3: 7 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 3: ================================================================================ 3: ============================= End of ROCm SMI Log ============================== 27: 27: 27: ======================= ROCm System Management Interface ======================= 27: ================================= Concise Info ================================= 27: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 27: 0 42.0c 93.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 27: 1 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 27: 2 40.0c 83.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 27: 3 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 27: 4 38.0c 84.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 27: 5 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 27: 6 38.0c 84.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 27: 7 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 27: ================================================================================ 27: ============================= End of ROCm SMI Log ============================== 21: 21: 21: ======================= ROCm System Management Interface ======================= 21: ================================= Concise Info ================================= 21: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 21: 0 43.0c 98.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 21: 1 50.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 21: 2 35.0c 83.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 21: 3 51.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 21: 4 46.0c 84.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 21: 5 42.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 21: 6 46.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 21: 7 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 21: ================================================================================ 21: ============================= End of ROCm SMI Log ============================== 31: 31: 31: ======================= ROCm System Management Interface ======================= 31: ================================= Concise Info ================================= 31: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 31: 0 47.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 31: 1 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 31: 2 43.0c 83.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 31: 3 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 31: 4 39.0c 79.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 31: 5 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 31: 6 38.0c 85.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 31: 7 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 31: ================================================================================ 31: ============================= End of ROCm SMI Log ============================== 26: 26: 26: ======================= ROCm System Management Interface ======================= 26: ================================= Concise Info ================================= 26: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 26: 0 44.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 26: 1 49.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 26: 2 41.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 26: 3 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 26: 4 41.0c 83.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 26: 5 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 26: 6 43.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 26: 7 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 26: ================================================================================ 26: ============================= End of ROCm SMI Log ============================== 25: 25: 25: ======================= ROCm System Management Interface ======================= 25: ================================= Concise Info ================================= 25: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 25: 0 45.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 25: 1 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 25: 2 40.0c 92.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 25: 3 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 25: 4 41.0c 77.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 25: 5 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 25: 6 40.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 25: 7 41.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 25: ================================================================================ 25: ============================= End of ROCm SMI Log ============================== 14: 14: 14: ======================= ROCm System Management Interface ======================= 14: ================================= Concise Info ================================= 14: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 14: 0 46.0c 93.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 14: 1 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 14: 2 44.0c 81.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 14: 3 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 14: 4 43.0c 86.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 14: 5 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 14: 6 43.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 14: 7 41.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 14: ================================================================================ 14: ============================= End of ROCm SMI Log ============================== 5: 5: 5: ======================= ROCm System Management Interface ======================= 5: ================================= Concise Info ================================= 5: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 5: 0 45.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 5: 1 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 5: 2 36.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 5: 3 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 5: 4 46.0c 83.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 5: 5 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 5: 6 37.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 5: 7 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 5: ================================================================================ 5: ============================= End of ROCm SMI Log ============================== 0: 0: 0: ======================= ROCm System Management Interface ======================= 0: ================================= Concise Info ================================= 0: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 0: 0 44.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 0: 1 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 0: 2 37.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 0: 3 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 0: 4 40.0c 86.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 0: 5 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 0: 6 43.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 0: 7 42.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 0: ================================================================================ 0: ============================= End of ROCm SMI Log ============================== 13: 13: 13: ======================= ROCm System Management Interface ======================= 13: ================================= Concise Info ================================= 13: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 13: 0 42.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 13: 1 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 13: 2 45.0c 79.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 13: 3 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 13: 4 43.0c 81.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 13: 5 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 13: 6 37.0c 77.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 13: 7 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 13: ================================================================================ 13: ============================= End of ROCm SMI Log ============================== 22: 22: 22: ======================= ROCm System Management Interface ======================= 22: ================================= Concise Info ================================= 22: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 22: 0 47.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 22: 1 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 22: 2 40.0c 81.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 22: 3 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 22: 4 38.0c 82.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 22: 5 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 22: 6 39.0c 82.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 22: 7 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 22: ================================================================================ 22: ============================= End of ROCm SMI Log ============================== 12: 12: 12: ======================= ROCm System Management Interface ======================= 12: ================================= Concise Info ================================= 12: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 12: 0 44.0c 96.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 12: 1 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 12: 2 38.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 12: 3 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 12: 4 36.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 12: 5 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 12: 6 41.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 12: 7 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 12: ================================================================================ 12: ============================= End of ROCm SMI Log ============================== 15: 15: 15: ======================= ROCm System Management Interface ======================= 15: ================================= Concise Info ================================= 15: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 15: 0 41.0c 93.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 15: 1 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 15: 2 37.0c 81.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 15: 3 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 15: 4 39.0c 84.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 15: 5 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 15: 6 41.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 15: 7 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 15: ================================================================================ 15: ============================= End of ROCm SMI Log ============================== 19: 19: 19: ======================= ROCm System Management Interface ======================= 19: ================================= Concise Info ================================= 19: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 19: 0 46.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 19: 1 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 19: 2 43.0c 84.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 19: 3 49.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 19: 4 45.0c 80.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 19: 5 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 19: 6 40.0c 83.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 19: 7 41.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 19: ================================================================================ 19: ============================= End of ROCm SMI Log ============================== 20: 20: 20: ======================= ROCm System Management Interface ======================= 20: ================================= Concise Info ================================= 20: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 20: 0 43.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 20: 1 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 20: 2 41.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 20: 3 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 20: 4 40.0c 86.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 20: 5 42.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 20: 6 41.0c 77.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 20: 7 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 20: ================================================================================ 20: ============================= End of ROCm SMI Log ============================== 28: 28: 28: ======================= ROCm System Management Interface ======================= 28: ================================= Concise Info ================================= 28: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 28: 0 44.0c 98.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 28: 1 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 28: 2 43.0c 83.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 28: 3 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 28: 4 45.0c 79.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 28: 5 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 28: 6 40.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 28: 7 40.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 28: ================================================================================ 28: ============================= End of ROCm SMI Log ============================== 24: 24: 24: ======================= ROCm System Management Interface ======================= 24: ================================= Concise Info ================================= 24: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 24: 0 45.0c 92.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 24: 1 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 24: 2 39.0c 81.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 24: 3 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 24: 4 43.0c 81.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 24: 5 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 24: 6 40.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 24: 7 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 24: ================================================================================ 24: ============================= End of ROCm SMI Log ============================== 17: 17: 17: ======================= ROCm System Management Interface ======================= 17: ================================= Concise Info ================================= 17: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 17: 0 48.0c 86.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 17: 1 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 17: 2 43.0c 86.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 17: 3 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 17: 4 46.0c 80.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 17: 5 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 17: 6 42.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 17: 7 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 17: ================================================================================ 17: ============================= End of ROCm SMI Log ============================== 29: 29: 29: ======================= ROCm System Management Interface ======================= 29: ================================= Concise Info ================================= 29: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 29: 0 43.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 29: 1 50.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 29: 2 42.0c 86.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 29: 3 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 29: 4 38.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 29: 5 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 29: 6 34.0c 86.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 29: 7 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 29: ================================================================================ 29: ============================= End of ROCm SMI Log ============================== 6: Launching on nid006253 (6/32), master nid006168 port 9999, GPUs 8, CUDA: True 9: Launching on nid006256 (9/32), master nid006168 port 9999, GPUs 8, CUDA: True 11: Launching on nid006258 (11/32), master nid006168 port 9999, GPUs 8, CUDA: True 7: Launching on nid006254 (7/32), master nid006168 port 9999, GPUs 8, CUDA: True 8: Launching on nid006255 (8/32), master nid006168 port 9999, GPUs 8, CUDA: True 23: Launching on nid006270 (23/32), master nid006168 port 9999, GPUs 8, CUDA: True 4: Launching on nid006251 (4/32), master nid006168 port 9999, GPUs 8, CUDA: True 16: Launching on nid006263 (16/32), master nid006168 port 9999, GPUs 8, CUDA: True 27: Launching on nid006274 (27/32), master nid006168 port 9999, GPUs 8, CUDA: True 31: Launching on nid006278 (31/32), master nid006168 port 9999, GPUs 8, CUDA: True 3: Launching on nid006171 (3/32), master nid006168 port 9999, GPUs 8, CUDA: True 26: Launching on nid006273 (26/32), master nid006168 port 9999, GPUs 8, CUDA: True 13: Launching on nid006260 (13/32), master nid006168 port 9999, GPUs 8, CUDA: True 14: Launching on nid006261 (14/32), master nid006168 port 9999, GPUs 8, CUDA: True 18: Launching on nid006265 (18/32), master nid006168 port 9999, GPUs 8, CUDA: True 21: Launching on nid006268 (21/32), master nid006168 port 9999, GPUs 8, CUDA: True 30: Launching on nid006277 (30/32), master nid006168 port 9999, GPUs 8, CUDA: True 17: Launching on nid006264 (17/32), master nid006168 port 9999, GPUs 8, CUDA: True 12: Launching on nid006259 (12/32), master nid006168 port 9999, GPUs 8, CUDA: True 29: Launching on nid006276 (29/32), master nid006168 port 9999, GPUs 8, CUDA: True 22: Launching on nid006269 (22/32), master nid006168 port 9999, GPUs 8, CUDA: True 10: Launching on nid006257 (10/32), master nid006168 port 9999, GPUs 8, CUDA: True 2: Launching on nid006170 (2/32), master nid006168 port 9999, GPUs 8, CUDA: True 19: Launching on nid006266 (19/32), master nid006168 port 9999, GPUs 8, CUDA: True 24: Launching on nid006271 (24/32), master nid006168 port 9999, GPUs 8, CUDA: True 1: Launching on nid006169 (1/32), master nid006168 port 9999, GPUs 8, CUDA: True 25: Launching on nid006272 (25/32), master nid006168 port 9999, GPUs 8, CUDA: True 15: Launching on nid006262 (15/32), master nid006168 port 9999, GPUs 8, CUDA: True 5: Launching on nid006252 (5/32), master nid006168 port 9999, GPUs 8, CUDA: True 28: Launching on nid006275 (28/32), master nid006168 port 9999, GPUs 8, CUDA: True 20: Launching on nid006267 (20/32), master nid006168 port 9999, GPUs 8, CUDA: True 0: Launching on nid006168 (0/32), master nid006168 port 9999, GPUs 8, CUDA: True 0: using world size: 256, data-parallel-size: 256, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 0: accumulate and all-reduce gradients in fp32 for bfloat16 data type. 0: using torch.bfloat16 for parameters ... 0: ------------------------ arguments ------------------------ 0: abort_on_unmet_fused_kernel_constraints ......... False 0: accumulate_allreduce_grads_in_fp32 .............. True 0: adam_beta1 ...................................... 0.9 0: adam_beta2 ...................................... 0.999 0: adam_eps ........................................ 1e-08 0: adlr_autoresume ................................. False 0: adlr_autoresume_interval ........................ 1000 0: apply_query_key_layer_scaling ................... True 0: apply_residual_connection_post_layernorm ........ False 0: attention_dropout ............................... 0.1 0: attention_softmax_in_fp32 ....................... False 0: bert_binary_head ................................ True 0: bert_load ....................................... None 0: bf16 ............................................ True 0: bias_dropout_fusion ............................. True 0: bias_gelu_fusion ................................ True 0: biencoder_projection_dim ........................ 0 0: biencoder_shared_query_context_model ............ False 0: block_data_path ................................. None 0: checkpoint_activations .......................... False 0: checkpoint_in_cpu ............................... False 0: checkpoint_num_layers ........................... 1 0: clip_grad ....................................... 1.0 0: codecarbon_dir .................................. None 0: consumed_train_samples .......................... 0 0: consumed_train_tokens ........................... 0 0: consumed_valid_samples .......................... 0 0: contigious_checkpointing ........................ False 0: cpu_optimizer ................................... False 0: cpu_torch_adam .................................. False 0: curriculum_learning ............................. False 0: data_impl ....................................... mmap 0: data_parallel_size .............................. 256 0: data_path ....................................... ['/scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document'] 0: dataloader_type ................................. single 0: DDP_impl ........................................ local 0: decoder_seq_length .............................. None 0: deepscale ....................................... False 0: deepscale_config ................................ None 0: deepspeed ....................................... True 0: deepspeed_activation_checkpointing .............. False 0: deepspeed_config ................................ ds_configs/2068238.json 0: deepspeed_mpi ................................... False 0: distribute_checkpointed_activations ............. False 0: distributed_backend ............................. nccl 0: embed_layernorm ................................. False 0: embedding_path .................................. None 0: encoder_seq_length .............................. 2048 0: eod_mask_loss ................................... False 0: eval_interval ................................... 1000 0: eval_iters ...................................... 1 0: eval_only ....................................... None 0: evidence_data_path .............................. None 0: exit_duration_in_mins ........................... None 0: exit_interval ................................... None 0: ffn_hidden_size ................................. 7168 0: finetune ........................................ False 0: fp16 ............................................ False 0: fp16_lm_cross_entropy ........................... False 0: fp32_residual_connection ........................ False 0: gigaflos_no_embeds .............................. 0 0: global_batch_size ............................... 256 0: glu_activation .................................. None 0: hidden_dropout .................................. 0.1 0: hidden_size ..................................... 1792 0: hysteresis ...................................... 2 0: ict_head_size ................................... None 0: ict_load ........................................ None 0: img_dim ......................................... 224 0: indexer_batch_size .............................. 128 0: indexer_log_interval ............................ 1000 0: inference ....................................... False 0: init_method_std ................................. 0.02 0: init_method_xavier_uniform ...................... False 0: initial_loss_scale .............................. 4294967296 0: kill_switch_path ................................ kill-switch-1b1long 0: kv_channels ..................................... 128 0: layer_norm_fusion ............................... True 0: layernorm_epsilon ............................... 1e-05 0: lazy_mpu_init ................................... None 0: load ............................................ checkpoints_1b1long 0: local_rank ...................................... None 0: log_batch_size_to_tensorboard ................... True 0: log_interval .................................... 10 0: log_learning_rate_to_tensorboard ................ True 0: log_level ....................................... None 0: log_level_replica ............................... None 0: log_loss_scale_to_tensorboard ................... True 0: log_num_zeros_in_grad ........................... False 0: log_params_norm ................................. False 0: log_path ........................................ None 0: log_timers_to_tensorboard ....................... True 0: log_validation_ppl_to_tensorboard ............... True 0: loss_on_targets_only ............................ False 0: loss_scale ...................................... None 0: loss_scale_window ............................... 1000 0: lr .............................................. 0.0002 0: lr_decay_iters .................................. None 0: lr_decay_samples ................................ 44416143 0: lr_decay_style .................................. cosine 0: lr_decay_tokens ................................. None 0: lr_warmup_fraction .............................. None 0: lr_warmup_iters ................................. 0 0: lr_warmup_samples ............................... 444161 0: make_vocab_size_divisible_by .................... 128 0: mask_prob ....................................... 0.15 0: masked_softmax_fusion ........................... True 0: max_position_embeddings ......................... 2048 0: mean_noise_span_length .......................... None 0: memory_centric_tiled_linear ..................... False 0: merge_file ...................................... gpt2/merges.txt 0: micro_batch_size ................................ 1 0: min_loss_scale .................................. 1.0 0: min_lr .......................................... 2e-05 0: mmap_warmup ..................................... False 0: no_load_optim ................................... None 0: no_load_rng ..................................... None 0: no_save_optim ................................... None 0: no_save_rng ..................................... None 0: noise_density ................................... None 0: num_attention_heads ............................. 14 0: num_channels .................................... 3 0: num_classes ..................................... 1000 0: num_layers ...................................... 26 0: num_layers_per_virtual_pipeline_stage ........... None 0: num_workers ..................................... 2 0: onnx_safe ....................................... None 0: openai_gelu ..................................... False 0: optimizer ....................................... adam 0: optimizer_fusion ................................ True 0: override_lr_scheduler ........................... False 0: pad_vocab_size_to ............................... None 0: params_dtype .................................... torch.bfloat16 0: partition_activations ........................... False 0: patch_dim ....................................... 16 0: pipeline_model_parallel_size .................... 1 0: position_embedding_type ......................... PositionEmbeddingType.absolute 0: pp_partition_method ............................. None 0: profile_backward ................................ False 0: query_in_block_prob ............................. 0.1 0: rampup_batch_size ............................... None 0: rank ............................................ 0 0: remote_device ................................... none 0: reset_attention_mask ............................ False 0: reset_position_ids .............................. False 0: retriever_report_topk_accuracies ................ [] 0: retriever_score_scaling ......................... False 0: retriever_seq_length ............................ 256 0: reweight_loss_based_on_position_frequency ....... False 0: sample_rate ..................................... 1.0 0: save ............................................ checkpoints_1b1long 0: save_interval ................................... 1000 0: scatter_gather_tensors_in_pipeline .............. True 0: scattered_embeddings ............................ False 0: seed ............................................ 1234 0: seq_length ...................................... 2048 0: sgd_momentum .................................... 0.9 0: short_seq_prob .................................. 0.1 0: skip_train_iteration_range ...................... None 0: split ........................................... 949,50,1 0: split_transformers .............................. False 0: sync_tp_duplicated_parameters ................... False 0: synchronize_each_layer .......................... False 0: tensor_model_parallel_size ...................... 1 0: tensorboard_dir ................................. tensorboard_1b1long 0: tensorboard_log_interval ........................ 1 0: tensorboard_queue_size .......................... 5 0: test_weighted_split_names ....................... None 0: test_weighted_split_paths ....................... None 0: test_weighted_split_paths_path .................. None 0: test_weighted_split_splits ...................... None 0: test_weighted_split_weights ..................... None 0: tile_factor ..................................... 1 0: titles_data_path ................................ None 0: tokenizer_name_or_path .......................... None 0: tokenizer_type .................................. GPT2BPETokenizer 0: train_iters ..................................... None 0: train_samples ................................... 44416143 0: train_tokens .................................... None 0: train_weighted_split_paths ...................... None 0: train_weighted_split_paths_path ................. None 0: universal_checkpoint ............................ False 0: use_bnb_optimizer ............................... False 0: use_checkpoint_lr_scheduler ..................... False 0: use_contiguous_buffers_in_ddp ................... True 0: use_cpu_initialization .......................... None 0: use_one_sent_docs ............................... False 0: use_pin_memory .................................. False 0: valid_num_workers ............................... 2 0: valid_weighted_split_names ...................... None 0: valid_weighted_split_paths ...................... None 0: valid_weighted_split_paths_path ................. None 0: valid_weighted_split_splits ..................... None 0: valid_weighted_split_weights .................... None 0: virtual_pipeline_model_parallel_size ............ None 0: vocab_extra_ids ................................. 0 0: vocab_file ...................................... gpt2/vocab.json 0: weight_decay .................................... 0.1 0: world_size ...................................... 256 0: zero_allgather_bucket_size ...................... 0.0 0: zero_contigious_gradients ....................... False 0: zero_reduce_bucket_size ......................... 0.0 0: zero_reduce_scatter ............................. False 0: zero_stage ...................................... 0 0: -------------------- end of arguments --------------------- 0: setting number of micro-batches to constant 1 0: > building GPT2BPETokenizer tokenizer ... 0: > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304) 0: DeepSpeed general environment info: 0: torch install path ............... ['/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch'] 0: torch version .................... 1.13.0+rocm5.2 0: torch cuda version ............... None 0: torch hip version ................ 5.2.21151-afdc89f8 0: nvcc version ..................... None 0: deepspeed install path ........... ['/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed'] 0: deepspeed info ................... 0.7.5, unknown, unknown 0: deepspeed wheel compiled w. ...... torch 1.13, hip 5.1 0: **** Git info for Megatron: git_hash=unknown git_branch=unknown **** 0: > initializing torch distributed ... 0: [2022-11-25 17:30:49,780] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 31: > setting tensorboard ... 0: > initializing tensor model parallel with size 1 0: > initializing pipeline model parallel with size 1 0: > setting random seeds to 1234 ... 0: > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 0: > compiling dataset index builder ... 0: make: Entering directory '/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/data' 0: make: Nothing to be done for 'default'. 0: make: Leaving directory '/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/data' 0: >>> done with dataset index builder. Compilation time: 0.087 seconds 0: WARNING: constraints for invoking optimized fused softmax kernel are not met. We default back to unfused kernel invocations. 0: > compiling and loading fused kernels ... 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax.cpp -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax_hip.cpp [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax_hip.h [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/compat.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/compat.h [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/type_shim.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/type_shim.h [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax_cuda.cu -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax_hip.hip [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/type_shim.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/type_shim.h [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/compat.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/compat.h [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax_hip.h [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax_hip.h [skipped, already hipified] 0: Total number of unsupported CUDA function calls: 0 0: 0: 0: Total number of replaced kernel launches: 87 0: ninja: no work to do. 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax.cpp -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax_hip.cpp [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax_cuda.cu -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax_hip.hip [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/type_shim.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/type_shim.h [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/compat.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/compat.h [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax_hip.h [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax_hip.h [skipped, already hipified] 0: Total number of unsupported CUDA function calls: 0 0: 0: 0: Total number of replaced kernel launches: 63 0: ninja: no work to do. 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/layer_norm_cuda.cpp -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/layer_norm_cuda.cpp [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/layer_norm_cuda_kernel.cu -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/layer_norm_hip_kernel.hip [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/type_shim.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/type_shim.h [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/compat.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/compat.h [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax_hip.h [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax_hip.h [skipped, already hipified] 0: Total number of unsupported CUDA function calls: 0 0: 0: 0: Total number of replaced kernel launches: 67 0: [1/1] c++ layer_norm_cuda.o layer_norm_hip_kernel.cuda.o -shared -L/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/lib -lc10 -lc10_hip -ltorch_cpu -ltorch_hip -ltorch -ltorch_python -L/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib -lamdhip64 -o fused_mix_prec_layer_norm_cuda.so 0: >>> done with compiling and loading fused kernels. Compilation time: 22.271 seconds 0: time to initialize megatron (seconds): 78.252 0: [after megatron is initialized] datetime: 2022-11-25 17:31:26 0: building GPT model ... 0: [2022-11-25 17:31:26,317] [INFO] [utils.py:827:see_memory_usage] Before Building Model 0: [2022-11-25 17:31:26,318] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB 0: [2022-11-25 17:31:26,318] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 28.28 GB, percent = 5.6% 0: SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None 0: Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=1, model=0): 1, ProcessCoord(pipe=0, data=2, model=0): 2, ProcessCoord(pipe=0, data=3, model=0): 3, ProcessCoord(pipe=0, data=4, model=0): 4, ProcessCoord(pipe=0, data=5, model=0): 5, ProcessCoord(pipe=0, data=6, model=0): 6, ProcessCoord(pipe=0, data=7, model=0): 7, ProcessCoord(pipe=0, data=8, model=0): 8, ProcessCoord(pipe=0, data=9, model=0): 9, ProcessCoord(pipe=0, data=10, model=0): 10, ProcessCoord(pipe=0, data=11, model=0): 11, ProcessCoord(pipe=0, data=12, model=0): 12, ProcessCoord(pipe=0, data=13, model=0): 13, ProcessCoord(pipe=0, data=14, model=0): 14, ProcessCoord(pipe=0, data=15, model=0): 15, ProcessCoord(pipe=0, data=16, model=0): 16, ProcessCoord(pipe=0, data=17, model=0): 17, ProcessCoord(pipe=0, data=18, model=0): 18, ProcessCoord(pipe=0, data=19, model=0): 19, ProcessCoord(pipe=0, data=20, model=0): 20, ProcessCoord(pipe=0, data=21, model=0): 21, ProcessCoord(pipe=0, data=22, model=0): 22, ProcessCoord(pi 0: pe=0, data=23, model=0): 23, ProcessCoord(pipe=0, data=24, model=0): 24, ProcessCoord(pipe=0, data=25, model=0): 25, ProcessCoord(pipe=0, data=26, model=0): 26, ProcessCoord(pipe=0, data=27, model=0): 27, ProcessCoord(pipe=0, data=28, model=0): 28, ProcessCoord(pipe=0, data=29, model=0): 29, ProcessCoord(pipe=0, data=30, model=0): 30, ProcessCoord(pipe=0, data=31, model=0): 31, ProcessCoord(pipe=0, data=32, model=0): 32, ProcessCoord(pipe=0, data=33, model=0): 33, ProcessCoord(pipe=0, data=34, model=0): 34, ProcessCoord(pipe=0, data=35, model=0): 35, ProcessCoord(pipe=0, data=36, model=0): 36, ProcessCoord(pipe=0, data=37, model=0): 37, ProcessCoord(pipe=0, data=38, model=0): 38, ProcessCoord(pipe=0, data=39, model=0): 39, ProcessCoord(pipe=0, data=40, model=0): 40, ProcessCoord(pipe=0, data=41, model=0): 41, ProcessCoord(pipe=0, data=42, model=0): 42, ProcessCoord(pipe=0, data=43, model=0): 43, ProcessCoord(pipe=0, data=44, model=0): 44, ProcessCoord(pipe=0, data=45, model=0): 45, ProcessCoord(pipe=0, data=4 0: 6, model=0): 46, ProcessCoord(pipe=0, data=47, model=0): 47, ProcessCoord(pipe=0, data=48, model=0): 48, ProcessCoord(pipe=0, data=49, model=0): 49, ProcessCoord(pipe=0, data=50, model=0): 50, ProcessCoord(pipe=0, data=51, model=0): 51, ProcessCoord(pipe=0, data=52, model=0): 52, ProcessCoord(pipe=0, data=53, model=0): 53, ProcessCoord(pipe=0, data=54, model=0): 54, ProcessCoord(pipe=0, data=55, model=0): 55, ProcessCoord(pipe=0, data=56, model=0): 56, ProcessCoord(pipe=0, data=57, model=0): 57, ProcessCoord(pipe=0, data=58, model=0): 58, ProcessCoord(pipe=0, data=59, model=0): 59, ProcessCoord(pipe=0, data=60, model=0): 60, ProcessCoord(pipe=0, data=61, model=0): 61, ProcessCoord(pipe=0, data=62, model=0): 62, ProcessCoord(pipe=0, data=63, model=0): 63, ProcessCoord(pipe=0, data=64, model=0): 64, ProcessCoord(pipe=0, data=65, model=0): 65, ProcessCoord(pipe=0, data=66, model=0): 66, ProcessCoord(pipe=0, data=67, model=0): 67, ProcessCoord(pipe=0, data=68, model=0): 68, ProcessCoord(pipe=0, data=69, model=0): 0: 69, ProcessCoord(pipe=0, data=70, model=0): 70, ProcessCoord(pipe=0, data=71, model=0): 71, ProcessCoord(pipe=0, data=72, model=0): 72, ProcessCoord(pipe=0, data=73, model=0): 73, ProcessCoord(pipe=0, data=74, model=0): 74, ProcessCoord(pipe=0, data=75, model=0): 75, ProcessCoord(pipe=0, data=76, model=0): 76, ProcessCoord(pipe=0, data=77, model=0): 77, ProcessCoord(pipe=0, data=78, model=0): 78, ProcessCoord(pipe=0, data=79, model=0): 79, ProcessCoord(pipe=0, data=80, model=0): 80, ProcessCoord(pipe=0, data=81, model=0): 81, ProcessCoord(pipe=0, data=82, model=0): 82, ProcessCoord(pipe=0, data=83, model=0): 83, ProcessCoord(pipe=0, data=84, model=0): 84, ProcessCoord(pipe=0, data=85, model=0): 85, ProcessCoord(pipe=0, data=86, model=0): 86, ProcessCoord(pipe=0, data=87, model=0): 87, ProcessCoord(pipe=0, data=88, model=0): 88, ProcessCoord(pipe=0, data=89, model=0): 89, ProcessCoord(pipe=0, data=90, model=0): 90, ProcessCoord(pipe=0, data=91, model=0): 91, ProcessCoord(pipe=0, data=92, model=0): 92, Process 0: Coord(pipe=0, data=93, model=0): 93, ProcessCoord(pipe=0, data=94, model=0): 94, ProcessCoord(pipe=0, data=95, model=0): 95, ProcessCoord(pipe=0, data=96, model=0): 96, ProcessCoord(pipe=0, data=97, model=0): 97, ProcessCoord(pipe=0, data=98, model=0): 98, ProcessCoord(pipe=0, data=99, model=0): 99, ProcessCoord(pipe=0, data=100, model=0): 100, ProcessCoord(pipe=0, data=101, model=0): 101, ProcessCoord(pipe=0, data=102, model=0): 102, ProcessCoord(pipe=0, data=103, model=0): 103, ProcessCoord(pipe=0, data=104, model=0): 104, ProcessCoord(pipe=0, data=105, model=0): 105, ProcessCoord(pipe=0, data=106, model=0): 106, ProcessCoord(pipe=0, data=107, model=0): 107, ProcessCoord(pipe=0, data=108, model=0): 108, ProcessCoord(pipe=0, data=109, model=0): 109, ProcessCoord(pipe=0, data=110, model=0): 110, ProcessCoord(pipe=0, data=111, model=0): 111, ProcessCoord(pipe=0, data=112, model=0): 112, ProcessCoord(pipe=0, data=113, model=0): 113, ProcessCoord(pipe=0, data=114, model=0): 114, ProcessCoord(pipe=0, data=115, mo 0: del=0): 115, ProcessCoord(pipe=0, data=116, model=0): 116, ProcessCoord(pipe=0, data=117, model=0): 117, ProcessCoord(pipe=0, data=118, model=0): 118, ProcessCoord(pipe=0, data=119, model=0): 119, ProcessCoord(pipe=0, data=120, model=0): 120, ProcessCoord(pipe=0, data=121, model=0): 121, ProcessCoord(pipe=0, data=122, model=0): 122, ProcessCoord(pipe=0, data=123, model=0): 123, ProcessCoord(pipe=0, data=124, model=0): 124, ProcessCoord(pipe=0, data=125, model=0): 125, ProcessCoord(pipe=0, data=126, model=0): 126, ProcessCoord(pipe=0, data=127, model=0): 127, ProcessCoord(pipe=0, data=128, model=0): 128, ProcessCoord(pipe=0, data=129, model=0): 129, ProcessCoord(pipe=0, data=130, model=0): 130, ProcessCoord(pipe=0, data=131, model=0): 131, ProcessCoord(pipe=0, data=132, model=0): 132, ProcessCoord(pipe=0, data=133, model=0): 133, ProcessCoord(pipe=0, data=134, model=0): 134, ProcessCoord(pipe=0, data=135, model=0): 135, ProcessCoord(pipe=0, data=136, model=0): 136, ProcessCoord(pipe=0, data=137, model=0): 137, 0: ProcessCoord(pipe=0, data=138, model=0): 138, ProcessCoord(pipe=0, data=139, model=0): 139, ProcessCoord(pipe=0, data=140, model=0): 140, ProcessCoord(pipe=0, data=141, model=0): 141, ProcessCoord(pipe=0, data=142, model=0): 142, ProcessCoord(pipe=0, data=143, model=0): 143, ProcessCoord(pipe=0, data=144, model=0): 144, ProcessCoord(pipe=0, data=145, model=0): 145, ProcessCoord(pipe=0, data=146, model=0): 146, ProcessCoord(pipe=0, data=147, model=0): 147, ProcessCoord(pipe=0, data=148, model=0): 148, ProcessCoord(pipe=0, data=149, model=0): 149, ProcessCoord(pipe=0, data=150, model=0): 150, ProcessCoord(pipe=0, data=151, model=0): 151, ProcessCoord(pipe=0, data=152, model=0): 152, ProcessCoord(pipe=0, data=153, model=0): 153, ProcessCoord(pipe=0, data=154, model=0): 154, ProcessCoord(pipe=0, data=155, model=0): 155, ProcessCoord(pipe=0, data=156, model=0): 156, ProcessCoord(pipe=0, data=157, model=0): 157, ProcessCoord(pipe=0, data=158, model=0): 158, ProcessCoord(pipe=0, data=159, model=0): 159, ProcessCoor 0: d(pipe=0, data=160, model=0): 160, ProcessCoord(pipe=0, data=161, model=0): 161, ProcessCoord(pipe=0, data=162, model=0): 162, ProcessCoord(pipe=0, data=163, model=0): 163, ProcessCoord(pipe=0, data=164, model=0): 164, ProcessCoord(pipe=0, data=165, model=0): 165, ProcessCoord(pipe=0, data=166, model=0): 166, ProcessCoord(pipe=0, data=167, model=0): 167, ProcessCoord(pipe=0, data=168, model=0): 168, ProcessCoord(pipe=0, data=169, model=0): 169, ProcessCoord(pipe=0, data=170, model=0): 170, ProcessCoord(pipe=0, data=171, model=0): 171, ProcessCoord(pipe=0, data=172, model=0): 172, ProcessCoord(pipe=0, data=173, model=0): 173, ProcessCoord(pipe=0, data=174, model=0): 174, ProcessCoord(pipe=0, data=175, model=0): 175, ProcessCoord(pipe=0, data=176, model=0): 176, ProcessCoord(pipe=0, data=177, model=0): 177, ProcessCoord(pipe=0, data=178, model=0): 178, ProcessCoord(pipe=0, data=179, model=0): 179, ProcessCoord(pipe=0, data=180, model=0): 180, ProcessCoord(pipe=0, data=181, model=0): 181, ProcessCoord(pipe=0, da 0: ta=182, model=0): 182, ProcessCoord(pipe=0, data=183, model=0): 183, ProcessCoord(pipe=0, data=184, model=0): 184, ProcessCoord(pipe=0, data=185, model=0): 185, ProcessCoord(pipe=0, data=186, model=0): 186, ProcessCoord(pipe=0, data=187, model=0): 187, ProcessCoord(pipe=0, data=188, model=0): 188, ProcessCoord(pipe=0, data=189, model=0): 189, ProcessCoord(pipe=0, data=190, model=0): 190, ProcessCoord(pipe=0, data=191, model=0): 191, ProcessCoord(pipe=0, data=192, model=0): 192, ProcessCoord(pipe=0, data=193, model=0): 193, ProcessCoord(pipe=0, data=194, model=0): 194, ProcessCoord(pipe=0, data=195, model=0): 195, ProcessCoord(pipe=0, data=196, model=0): 196, ProcessCoord(pipe=0, data=197, model=0): 197, ProcessCoord(pipe=0, data=198, model=0): 198, ProcessCoord(pipe=0, data=199, model=0): 199, ProcessCoord(pipe=0, data=200, model=0): 200, ProcessCoord(pipe=0, data=201, model=0): 201, ProcessCoord(pipe=0, data=202, model=0): 202, ProcessCoord(pipe=0, data=203, model=0): 203, ProcessCoord(pipe=0, data=204, mode 0: l=0): 204, ProcessCoord(pipe=0, data=205, model=0): 205, ProcessCoord(pipe=0, data=206, model=0): 206, ProcessCoord(pipe=0, data=207, model=0): 207, ProcessCoord(pipe=0, data=208, model=0): 208, ProcessCoord(pipe=0, data=209, model=0): 209, ProcessCoord(pipe=0, data=210, model=0): 210, ProcessCoord(pipe=0, data=211, model=0): 211, ProcessCoord(pipe=0, data=212, model=0): 212, ProcessCoord(pipe=0, data=213, model=0): 213, ProcessCoord(pipe=0, data=214, model=0): 214, ProcessCoord(pipe=0, data=215, model=0): 215, ProcessCoord(pipe=0, data=216, model=0): 216, ProcessCoord(pipe=0, data=217, model=0): 217, ProcessCoord(pipe=0, data=218, model=0): 218, ProcessCoord(pipe=0, data=219, model=0): 219, ProcessCoord(pipe=0, data=220, model=0): 220, ProcessCoord(pipe=0, data=221, model=0): 221, ProcessCoord(pipe=0, data=222, model=0): 222, ProcessCoord(pipe=0, data=223, model=0): 223, ProcessCoord(pipe=0, data=224, model=0): 224, ProcessCoord(pipe=0, data=225, model=0): 225, ProcessCoord(pipe=0, data=226, model=0): 226, P 0: rocessCoord(pipe=0, data=227, model=0): 227, ProcessCoord(pipe=0, data=228, model=0): 228, ProcessCoord(pipe=0, data=229, model=0): 229, ProcessCoord(pipe=0, data=230, model=0): 230, ProcessCoord(pipe=0, data=231, model=0): 231, ProcessCoord(pipe=0, data=232, model=0): 232, ProcessCoord(pipe=0, data=233, model=0): 233, ProcessCoord(pipe=0, data=234, model=0): 234, ProcessCoord(pipe=0, data=235, model=0): 235, ProcessCoord(pipe=0, data=236, model=0): 236, ProcessCoord(pipe=0, data=237, model=0): 237, ProcessCoord(pipe=0, data=238, model=0): 238, ProcessCoord(pipe=0, data=239, model=0): 239, ProcessCoord(pipe=0, data=240, model=0): 240, ProcessCoord(pipe=0, data=241, model=0): 241, ProcessCoord(pipe=0, data=242, model=0): 242, ProcessCoord(pipe=0, data=243, model=0): 243, ProcessCoord(pipe=0, data=244, model=0): 244, ProcessCoord(pipe=0, data=245, model=0): 245, ProcessCoord(pipe=0, data=246, model=0): 246, ProcessCoord(pipe=0, data=247, model=0): 247, ProcessCoord(pipe=0, data=248, model=0): 248, ProcessCoord( 0: pipe=0, data=249, model=0): 249, ProcessCoord(pipe=0, data=250, model=0): 250, ProcessCoord(pipe=0, data=251, model=0): 251, ProcessCoord(pipe=0, data=252, model=0): 252, ProcessCoord(pipe=0, data=253, model=0): 253, ProcessCoord(pipe=0, data=254, model=0): 254, ProcessCoord(pipe=0, data=255, model=0): 255} 0: [2022-11-25 17:31:34,782] [INFO] [module.py:366:_partition_layers] Partitioning pipeline stages with method type:transformer 0: stage=0 layers=33 0: 0: _to_float16 0: 1: EmbeddingPipe 0: 2: 0: 3: ParallelTransformerLayerPipe 0: 4: ParallelTransformerLayerPipe 0: 5: ParallelTransformerLayerPipe 0: 6: ParallelTransformerLayerPipe 0: 7: ParallelTransformerLayerPipe 0: 8: ParallelTransformerLayerPipe 0: 9: ParallelTransformerLayerPipe 0: 10: ParallelTransformerLayerPipe 0: 11: ParallelTransformerLayerPipe 0: 12: ParallelTransformerLayerPipe 0: 13: ParallelTransformerLayerPipe 0: 14: ParallelTransformerLayerPipe 0: 15: ParallelTransformerLayerPipe 0: 16: ParallelTransformerLayerPipe 0: 17: ParallelTransformerLayerPipe 0: 18: ParallelTransformerLayerPipe 0: 19: ParallelTransformerLayerPipe 0: 20: ParallelTransformerLayerPipe 0: 21: ParallelTransformerLayerPipe 0: 22: ParallelTransformerLayerPipe 0: 23: ParallelTransformerLayerPipe 0: 24: ParallelTransformerLayerPipe 0: 25: ParallelTransformerLayerPipe 0: 26: ParallelTransformerLayerPipe 0: 27: ParallelTransformerLayerPipe 0: 28: ParallelTransformerLayerPipe 0: 29: undo 0: 30: MixedFusedLayerNorm 0: 31: EmbeddingPipe 0: 32: float16_to_fp32 0: loss: CrossEntropy 0: [2022-11-25 17:31:34,966] [INFO] [utils.py:827:see_memory_usage] After Building Model 0: [2022-11-25 17:31:34,966] [INFO] [utils.py:828:see_memory_usage] MA 2.05 GB Max_MA 2.05 GB CA 2.19 GB Max_CA 2 GB 0: [2022-11-25 17:31:34,967] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 28.35 GB, percent = 5.6% 0: setting training iterations to 173500 0: > learning rate decay style: cosine 0: DeepSpeed is enabled. 0: [2022-11-25 17:31:34,969] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.7.5, git-hash=unknown, git-branch=unknown 0: [2022-11-25 17:31:56,214] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False 0: [2022-11-25 17:31:56,214] [INFO] [logging.py:68:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer 0: [2022-11-25 17:31:56,214] [INFO] [logging.py:68:log_dist] [Rank 0] Using client Optimizer as basic optimizer 0: [2022-11-25 17:31:56,226] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam 0: [2022-11-25 17:31:56,226] [INFO] [logging.py:68:log_dist] [Rank 0] Creating BF16 optimizer 0: [2022-11-25 17:31:56,266] [INFO] [utils.py:827:see_memory_usage] begin bf16_optimizer 0: [2022-11-25 17:31:56,267] [INFO] [utils.py:828:see_memory_usage] MA 2.04 GB Max_MA 2.06 GB CA 2.19 GB Max_CA 2 GB 0: [2022-11-25 17:31:56,267] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 29.06 GB, percent = 5.8% 1: ninja: no work to do. 1: Time to load utils op: 0.18664169311523438 seconds 0: Time to load utils op: 0.21123909950256348 seconds 5: Time to load utils op: 0.20827341079711914 seconds 3: Time to load utils op: 0.2086935043334961 seconds 2: Time to load utils op: 0.20888948440551758 seconds 8: Time to load utils op: 0.20869970321655273 seconds 10: Time to load utils op: 0.2092437744140625 seconds 11: Time to load utils op: 0.2084503173828125 seconds 13: Time to load utils op: 0.2090916633605957 seconds 23: Time to load utils op: 0.20903253555297852 seconds 18: Time to load utils op: 0.21053338050842285 seconds 0: [2022-11-25 17:31:56,511] [INFO] [utils.py:827:see_memory_usage] before initializing group 0 0: [2022-11-25 17:31:56,512] [INFO] [utils.py:828:see_memory_usage] MA 2.04 GB Max_MA 2.04 GB CA 2.19 GB Max_CA 2 GB 0: [2022-11-25 17:31:56,512] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 29.06 GB, percent = 5.8% 31: Time to load utils op: 0.20872807502746582 seconds 0: [2022-11-25 17:31:56,579] [INFO] [utils.py:827:see_memory_usage] after initializing group 0 0: [2022-11-25 17:31:56,580] [INFO] [utils.py:828:see_memory_usage] MA 4.22 GB Max_MA 4.22 GB CA 5.44 GB Max_CA 5 GB 0: [2022-11-25 17:31:56,580] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 29.06 GB, percent = 5.8% 0: [2022-11-25 17:31:56,611] [INFO] [utils.py:827:see_memory_usage] before initializing group 1 0: [2022-11-25 17:31:56,612] [INFO] [utils.py:828:see_memory_usage] MA 4.22 GB Max_MA 4.22 GB CA 5.44 GB Max_CA 5 GB 0: [2022-11-25 17:31:56,612] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 29.06 GB, percent = 5.8% 0: [2022-11-25 17:31:56,645] [INFO] [utils.py:827:see_memory_usage] after initializing group 1 0: [2022-11-25 17:31:56,645] [INFO] [utils.py:828:see_memory_usage] MA 6.14 GB Max_MA 6.14 GB CA 8.31 GB Max_CA 8 GB 0: [2022-11-25 17:31:56,645] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 29.06 GB, percent = 5.8% 0: [2022-11-25 17:31:56,676] [INFO] [utils.py:827:see_memory_usage] before initializing group 2 0: [2022-11-25 17:31:56,677] [INFO] [utils.py:828:see_memory_usage] MA 6.14 GB Max_MA 6.14 GB CA 8.31 GB Max_CA 8 GB 0: [2022-11-25 17:31:56,677] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 29.06 GB, percent = 5.8% 0: ninja: no work to do. 0: Time to load utils op: 0.14670634269714355 seconds 30: Time to load utils op: 0.1103520393371582 seconds 30: Time to load utils op: 0.1103818416595459 secondsTime to load utils op: 0.1103825569152832 seconds 30: Time to load utils op: 0.11039566993713379 seconds 30: 30: Time to load utils op: 0.11039996147155762 secondsTime to load utils op: 0.11041474342346191 seconds 30: 30: Time to load utils op: 0.11042308807373047 seconds 30: Time to load utils op: 0.11041712760925293 seconds 29: Time to load utils op: 0.11178731918334961 seconds 29: Time to load utils op: 0.1117856502532959 secondsTime to load utils op: 0.11179089546203613 seconds 29: 29: Time to load utils op: 0.11179304122924805 secondsTime to load utils op: 0.11179590225219727 seconds 29: 29: Time to load utils op: 0.11181139945983887 secondsTime to load utils op: 0.11180281639099121 seconds 29: 29: Time to load utils op: 0.11181163787841797 seconds 0: [2022-11-25 17:31:56,714] [INFO] [utils.py:827:see_memory_usage] after initializing group 2 0: [2022-11-25 17:31:56,715] [INFO] [utils.py:828:see_memory_usage] MA 6.14 GB Max_MA 6.14 GB CA 8.31 GB Max_CA 8 GB 0: [2022-11-25 17:31:56,715] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 29.06 GB, percent = 5.8% 0: Time to load utils op: 0.0007715225219726562 seconds 1: Time to load utils op: 0.0005631446838378906 seconds 2: Time to load utils op: 0.00047326087951660156 seconds 3: Time to load utils op: 0.0009925365447998047 seconds 5: Time to load utils op: 0.00042819976806640625 seconds 8: Time to load utils op: 0.0004298686981201172 seconds 0: [2022-11-25 17:31:56,747] [INFO] [utils.py:827:see_memory_usage] before initialize_optimizer 0: [2022-11-25 17:31:56,747] [INFO] [utils.py:828:see_memory_usage] MA 6.14 GB Max_MA 6.14 GB CA 8.31 GB Max_CA 8 GB 0: [2022-11-25 17:31:56,748] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 29.06 GB, percent = 5.8% 10: Time to load utils op: 0.0009729862213134766 seconds 11: Time to load utils op: 0.0004570484161376953 seconds 13: Time to load utils op: 0.0004363059997558594 seconds 18: Time to load utils op: 0.000431060791015625 seconds 0: Time to load utils op: 0.20233368873596191 secondsTime to load utils op: 0.20255589485168457 secondsTime to load utils op: 0.20235371589660645 seconds 0: 0: 0: Time to load utils op: 0.2024376392364502 seconds 0: Time to load utils op: 0.2019796371459961 seconds 0: Time to load utils op: 0.2017834186553955 seconds 1: Time to load utils op: 0.202345609664917 seconds 1: Time to load utils op: 0.20297598838806152 seconds 1: Time to load utils op: 0.20188665390014648 seconds 1: Time to load utils op: 0.20177769660949707 seconds 1: Time to load utils op: 0.20224690437316895 seconds 1: Time to load utils op: 0.2029895782470703 secondsTime to load utils op: 0.20306634902954102 seconds 1: 2: Time to load utils op: 0.2030181884765625 seconds 2: Time to load utils op: 0.20214009284973145 seconds 2: Time to load utils op: 0.20323681831359863 seconds 2: Time to load utils op: 0.20219683647155762 seconds 2: Time to load utils op: 0.2023937702178955 seconds 23: Time to load utils op: 0.00048661231994628906 seconds 2: Time to load utils op: 0.2025454044342041 secondsTime to load utils op: 0.20253896713256836 seconds 2: 3: Time to load utils op: 0.20259785652160645 seconds 3: Time to load utils op: 0.20220327377319336 seconds 3: Time to load utils op: 0.20349764823913574 secondsTime to load utils op: 0.20294618606567383 secondsTime to load utils op: 0.20280098915100098 seconds 3: Time to load utils op: 0.20347285270690918 seconds 3: 3: 3: Time to load utils op: 0.20309114456176758 seconds 5: Time to load utils op: 0.20322632789611816 seconds 5: Time to load utils op: 0.20339632034301758 seconds 5: Time to load utils op: 0.20362401008605957 seconds 5: Time to load utils op: 0.20364141464233398 secondsTime to load utils op: 0.2039637565612793 secondsTime to load utils op: 0.20410990715026855 secondsTime to load utils op: 0.20400571823120117 seconds 5: 5: 5: 8: Time to load utils op: 0.20360803604125977 seconds 8: Time to load utils op: 0.2029111385345459 seconds 8: Time to load utils op: 0.20369505882263184 seconds 8: Time to load utils op: 0.20307087898254395 seconds 8: Time to load utils op: 0.2031264305114746 secondsTime to load utils op: 0.2030797004699707 seconds 8: 8: Time to load utils op: 0.20384430885314941 seconds 10: Time to load utils op: 0.2029867172241211 seconds 10: Time to load utils op: 0.20323824882507324 seconds 11: Time to load utils op: 0.20299220085144043 seconds 11: Time to load utils op: 0.20351243019104004 seconds 11: Time to load utils op: 0.20363426208496094 secondsTime to load utils op: 0.20319676399230957 seconds 11: 11: Time to load utils op: 0.2033400535583496 secondsTime to load utils op: 0.20337390899658203 seconds 11: 11: Time to load utils op: 0.20374083518981934 seconds 10: Time to load utils op: 0.20246577262878418 seconds 13: Time to load utils op: 0.2021934986114502 seconds 13: Time to load utils op: 0.20288825035095215 secondsTime to load utils op: 0.20291972160339355 secondsTime to load utils op: 0.2032310962677002 seconds 13: 13: 13: Time to load utils op: 0.2033073902130127 secondsTime to load utils op: 0.20302081108093262 seconds 13: 13: Time to load utils op: 0.20231151580810547 seconds 10: Time to load utils op: 0.20205974578857422 seconds 10: Time to load utils op: 0.20206165313720703 seconds 10: Time to load utils op: 0.20196533203125 seconds 4: Time to load utils op: 0.21302008628845215 secondsTime to load utils op: 0.21302390098571777 secondsTime to load utils op: 0.21301746368408203 secondsTime to load utils op: 0.21301746368408203 seconds 4: 4: 4: 4: Time to load utils op: 0.21303033828735352 secondsTime to load utils op: 0.21303009986877441 seconds 4: 4: Time to load utils op: 0.21299958229064941 seconds 4: Time to load utils op: 0.2130424976348877 seconds 10: Time to load utils op: 0.20263981819152832 seconds 6: Time to load utils op: 0.21189665794372559 seconds 6: Time to load utils op: 0.2119290828704834 seconds 6: Time to load utils op: 0.2117924690246582 seconds 6: Time to load utils op: 0.21195030212402344 seconds 6: Time to load utils op: 0.21194720268249512 seconds 6: Time to load utils op: 0.21195769309997559 seconds 6: Time to load utils op: 0.21196913719177246 secondsTime to load utils op: 0.21196651458740234 seconds 6: 29: Time to load utils op: 0.0008599758148193359 seconds 7: Time to load utils op: 0.21073317527770996 secondsTime to load utils op: 0.21073222160339355 seconds 7: 7: Time to load utils op: 0.2107546329498291 seconds 7: Time to load utils op: 0.21076107025146484 seconds 7: Time to load utils op: 0.21076679229736328 seconds 29: Time to load utils op: 0.0011398792266845703 seconds 7: Time to load utils op: 0.2107715606689453 secondsTime to load utils op: 0.21077609062194824 seconds 7: 7: Time to load utils op: 0.21078205108642578 seconds 29: Time to load utils op: 0.001291036605834961 seconds 29: Time to load utils op: 0.0012805461883544922 seconds 0: Time to load utils op: 0.00045871734619140625 seconds 29: Time to load utils op: 0.0013070106506347656 seconds 2: Time to load utils op: 0.0003581047058105469 seconds 29: Time to load utils op: 0.0012793540954589844 seconds 29: Time to load utils op: 0.0012378692626953125 seconds 0: Time to load utils op: 0.0003993511199951172 secondsTime to load utils op: 0.00042057037353515625 seconds 0: 29: Time to load utils op: 0.0013384819030761719 seconds 1: Time to load utils op: 0.0004069805145263672 secondsTime to load utils op: 0.0003921985626220703 seconds 1: 2: Time to load utils op: 0.0003285408020019531 seconds 1: Time to load utils op: 0.00034356117248535156 seconds 0: Time to load utils op: 0.0003917217254638672 seconds 0: Time to load utils op: 0.0003962516784667969 seconds 31: Time to load utils op: 0.00044608116149902344 seconds 2: Time to load utils op: 0.0003368854522705078 seconds 1: Time to load utils op: 0.0003936290740966797 seconds 0: Time to load utils op: 0.0003840923309326172 seconds 1: Time to load utils op: 0.0003345012664794922 seconds 3: Time to load utils op: 0.0003616809844970703 seconds 1: Time to load utils op: 0.0003256797790527344 seconds 2: Time to load utils op: 0.00037479400634765625 seconds 3: Time to load utils op: 0.00033855438232421875 seconds 2: Time to load utils op: 0.00033855438232421875 seconds 18: Time to load utils op: 0.20336008071899414 seconds 2: Time to load utils op: 0.00032448768615722656 seconds 18: Time to load utils op: 0.2025437355041504 secondsTime to load utils op: 0.20258021354675293 seconds 18: 1: Time to load utils op: 0.000362396240234375 seconds 18: Time to load utils op: 0.20198678970336914 seconds 2: Time to load utils op: 0.0003273487091064453 seconds 3: Time to load utils op: 0.0003495216369628906 seconds 18: Time to load utils op: 0.20203804969787598 seconds 18: Time to load utils op: 0.20212531089782715 seconds 3: Time to load utils op: 0.0003402233123779297 seconds 3: Time to load utils op: 0.0003459453582763672 seconds 3: Time to load utils op: 0.0003516674041748047 seconds 3: Time to load utils op: 0.0003268718719482422 seconds 9: Time to load utils op: 0.21191716194152832 secondsTime to load utils op: 0.2119135856628418 secondsTime to load utils op: 0.21192073822021484 seconds 9: Time to load utils op: 0.2119276523590088 seconds 9: 9: 9: Time to load utils op: 0.21191620826721191 seconds 9: Time to load utils op: 0.21193337440490723 secondsTime to load utils op: 0.2119441032409668 seconds 9: 9: Time to load utils op: 0.21194124221801758 seconds 18: Time to load utils op: 0.20186352729797363 seconds 5: Time to load utils op: 0.0003447532653808594 seconds 5: Time to load utils op: 0.0003437995910644531 seconds 5: Time to load utils op: 0.0003917217254638672 seconds 5: Time to load utils op: 0.0003674030303955078 seconds 12: Time to load utils op: 0.21107792854309082 seconds 12: Time to load utils op: 0.2110738754272461 seconds 12: Time to load utils op: 0.21108150482177734 seconds 12: Time to load utils op: 0.2110908031463623 seconds 12: Time to load utils op: 0.21109604835510254 secondsTime to load utils op: 0.21109867095947266 seconds 12: 12: Time to load utils op: 0.2111063003540039 secondsTime to load utils op: 0.21111249923706055 seconds 12: 5: Time to load utils op: 0.0003426074981689453 seconds 5: Time to load utils op: 0.0003650188446044922 seconds 30: Time to load utils op: 0.0008306503295898438 seconds 30: Time to load utils op: 0.0010154247283935547 seconds 30: Time to load utils op: 0.0011544227600097656 seconds 30: Time to load utils op: 0.0010747909545898438 seconds 30: Time to load utils op: 0.0011484622955322266 seconds 30: Time to load utils op: 0.0011789798736572266 seconds 30: Time to load utils op: 0.0011334419250488281 seconds 30: Time to load utils op: 0.00125885009765625 seconds 23: Time to load utils op: 0.20248103141784668 secondsTime to load utils op: 0.20253825187683105 seconds 23: 23: Time to load utils op: 0.20294785499572754 seconds 23: Time to load utils op: 0.20380592346191406 secondsTime to load utils op: 0.20270276069641113 seconds 23: 23: Time to load utils op: 0.20406699180603027 seconds 23: Time to load utils op: 0.20390748977661133 seconds 5: Time to load utils op: 0.00038933753967285156 seconds 8: Time to load utils op: 0.00035381317138671875 seconds 14: Time to load utils op: 0.21149516105651855 seconds 14: Time to load utils op: 0.21150517463684082 seconds 14: Time to load utils op: 0.21152114868164062 secondsTime to load utils op: 0.21152973175048828 seconds 14: 8: Time to load utils op: 0.0003368854522705078 seconds 14: Time to load utils op: 0.21154212951660156 seconds 14: Time to load utils op: 0.2115495204925537 seconds 14: Time to load utils op: 0.21155428886413574 seconds 14: Time to load utils op: 0.2115616798400879 seconds 8: Time to load utils op: 0.0003654956817626953 seconds 8: Time to load utils op: 0.0003724098205566406 seconds 15: Time to load utils op: 0.21165919303894043 seconds 8: Time to load utils op: 0.0003592967987060547 secondsTime to load utils op: 0.0003495216369628906 seconds 8: 15: Time to load utils op: 0.21166753768920898 secondsTime to load utils op: 0.21166515350341797 secondsTime to load utils op: 0.21166133880615234 seconds 15: Time to load utils op: 0.21167230606079102 seconds 15: 15: 15: Time to load utils op: 0.21168994903564453 secondsTime to load utils op: 0.21167850494384766 seconds 15: 15: Time to load utils op: 0.21169519424438477 seconds 8: Time to load utils op: 0.00039076805114746094 seconds 16: Time to load utils op: 0.21048831939697266 seconds 16: Time to load utils op: 0.21049761772155762 seconds 16: Time to load utils op: 0.2104952335357666 seconds 16: Time to load utils op: 0.2105121612548828 seconds 16: Time to load utils op: 0.21052026748657227 seconds 16: Time to load utils op: 0.21051931381225586 seconds 16: Time to load utils op: 0.21053409576416016 seconds 16: Time to load utils op: 0.21053123474121094 seconds 17: Time to load utils op: 0.21106362342834473 seconds 17: Time to load utils op: 0.2110767364501953 seconds 17: Time to load utils op: 0.21108198165893555 seconds 17: Time to load utils op: 0.21109819412231445 seconds 17: Time to load utils op: 0.21110010147094727 seconds 17: Time to load utils op: 0.2111060619354248 seconds 17: Time to load utils op: 0.21111059188842773 seconds 17: Time to load utils op: 0.2111191749572754 seconds 10: Time to load utils op: 0.0003490447998046875 seconds 11: Time to load utils op: 0.0003535747528076172 seconds 11: Time to load utils op: 0.0003573894500732422 seconds 10: Time to load utils op: 0.00040721893310546875 seconds 11: Time to load utils op: 0.0003590583801269531 seconds 11: Time to load utils op: 0.00034689903259277344 seconds 11: Time to load utils op: 0.0003466606140136719 seconds 11: Time to load utils op: 0.00037741661071777344 seconds 11: Time to load utils op: 0.00036025047302246094 seconds 19: Time to load utils op: 0.21034598350524902 seconds 19: Time to load utils op: 0.20491647720336914 secondsTime to load utils op: 0.20569992065429688 seconds 19: 19: Time to load utils op: 0.2047581672668457 secondsTime to load utils op: 0.20529985427856445 seconds 19: 19: Time to load utils op: 0.2056427001953125 seconds 19: Time to load utils op: 0.21010875701904297 seconds 19: Time to load utils op: 0.2046968936920166 seconds 13: Time to load utils op: 0.00032591819763183594 seconds 13: Time to load utils op: 0.0004513263702392578 seconds 13: Time to load utils op: 0.0003764629364013672 seconds 10: Time to load utils op: 0.0002989768981933594 seconds 20: Time to load utils op: 0.2113513946533203 secondsTime to load utils op: 0.21134710311889648 secondsTime to load utils op: 0.21135210990905762 seconds 20: 20: 20: Time to load utils op: 0.21135568618774414 seconds 20: Time to load utils op: 0.21136140823364258 secondsTime to load utils op: 0.21137070655822754 secondsTime to load utils op: 0.21136474609375 seconds 20: Time to load utils op: 0.21135377883911133 seconds 20: 20: 13: Time to load utils op: 0.0003864765167236328 seconds 21: Time to load utils op: 0.21122264862060547 seconds 21: Time to load utils op: 0.2112288475036621 seconds 21: Time to load utils op: 0.21124529838562012 seconds 21: Time to load utils op: 0.21124911308288574 seconds 13: Time to load utils op: 0.0003802776336669922 seconds 21: Time to load utils op: 0.21126794815063477 secondsTime to load utils op: 0.211256742477417 seconds 21: 21: Time to load utils op: 0.21126341819763184 seconds 21: Time to load utils op: 0.21127748489379883 seconds 13: Time to load utils op: 0.0003867149353027344 seconds 13: Time to load utils op: 0.00042247772216796875 seconds 31: Time to load utils op: 0.20421957969665527 seconds 31: Time to load utils op: 0.20383858680725098 secondsTime to load utils op: 0.2032456398010254 seconds 31: 10: Time to load utils op: 0.00036978721618652344 seconds 31: Time to load utils op: 0.2039639949798584 secondsTime to load utils op: 0.20356202125549316 seconds 31: 31: Time to load utils op: 0.204329252243042 secondsTime to load utils op: 0.20348620414733887 seconds 31: 10: Time to load utils op: 0.0003807544708251953 seconds 22: Time to load utils op: 0.21105289459228516 seconds 22: Time to load utils op: 0.2110607624053955 seconds 22: Time to load utils op: 0.21112680435180664 seconds 22: Time to load utils op: 0.211137056350708 seconds 10: Time to load utils op: 0.0003743171691894531 seconds 22: Time to load utils op: 0.21114325523376465 seconds 22: Time to load utils op: 0.21115589141845703 secondsTime to load utils op: 0.2111530303955078 seconds 22: 22: Time to load utils op: 0.2111661434173584 seconds 18: Time to load utils op: 0.0003020763397216797 seconds 24: Time to load utils op: 0.21121907234191895 seconds 24: Time to load utils op: 0.21122145652770996 seconds 24: Time to load utils op: 0.21126794815063477 seconds 24: Time to load utils op: 0.21128273010253906 seconds 24: Time to load utils op: 0.21128153800964355 secondsTime to load utils op: 0.21128487586975098 seconds 24: 24: Time to load utils op: 0.21129441261291504 seconds 24: Time to load utils op: 0.21120119094848633 seconds 18: Time to load utils op: 0.00040340423583984375 seconds 18: Time to load utils op: 0.0003714561462402344 seconds 18: Time to load utils op: 0.0003647804260253906 seconds 10: Time to load utils op: 0.0003840923309326172 seconds 18: Time to load utils op: 0.0003826618194580078 seconds 0: [2022-11-25 17:31:56,804] [INFO] [utils.py:827:see_memory_usage] end initialize_optimizer 25: Time to load utils op: 0.21279406547546387 secondsTime to load utils op: 0.21279168128967285 seconds 25: 0: [2022-11-25 17:31:56,805] [INFO] [utils.py:828:see_memory_usage] MA 6.17 GB Max_MA 6.17 GB CA 8.31 GB Max_CA 8 GB 25: Time to load utils op: 0.2127997875213623 seconds 25: Time to load utils op: 0.21282076835632324 secondsTime to load utils op: 0.21280837059020996 seconds 25: 25: Time to load utils op: 0.21282625198364258 secondsTime to load utils op: 0.21283507347106934 seconds 25: 25: Time to load utils op: 0.21283793449401855 seconds 0: [2022-11-25 17:31:56,805] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 29.2 GB, percent = 5.8% 18: Time to load utils op: 0.00034427642822265625 seconds 18: Time to load utils op: 0.00038313865661621094 seconds 26: Time to load utils op: 0.21145367622375488 seconds 26: Time to load utils op: 0.2114734649658203 seconds 26: Time to load utils op: 0.2114717960357666 seconds 26: Time to load utils op: 0.2114865779876709 seconds 26: Time to load utils op: 0.21149849891662598 secondsTime to load utils op: 0.2115159034729004 seconds 26: 26: Time to load utils op: 0.21150660514831543 seconds 26: Time to load utils op: 0.2115192413330078 seconds 27: Time to load utils op: 0.21144390106201172 seconds 27: Time to load utils op: 0.21145963668823242 seconds 27: Time to load utils op: 0.2114725112915039 seconds 27: Time to load utils op: 0.2115013599395752 secondsTime to load utils op: 0.2114872932434082 seconds 27: 27: Time to load utils op: 0.21149706840515137 secondsTime to load utils op: 0.2115013599395752 seconds 27: Time to load utils op: 0.21150469779968262 seconds 27: 28: Time to load utils op: 0.21170401573181152 seconds 28: Time to load utils op: 0.21174025535583496 seconds 28: Time to load utils op: 0.2117478847503662 seconds 28: Time to load utils op: 0.2117617130279541 seconds 28: Time to load utils op: 0.21177291870117188 seconds 28: Time to load utils op: 0.2117764949798584 secondsTime to load utils op: 0.21178722381591797 seconds 28: 28: Time to load utils op: 0.21178174018859863 seconds 23: Time to load utils op: 0.0003600120544433594 seconds 23: Time to load utils op: 0.00034809112548828125 seconds 23: Time to load utils op: 0.000301361083984375 seconds 23: Time to load utils op: 0.0003962516784667969 seconds 23: Time to load utils op: 0.00036215782165527344 seconds 23: Time to load utils op: 0.0003695487976074219 seconds 23: Time to load utils op: 0.00037741661071777344 seconds 31: Time to load utils op: 0.0003612041473388672 seconds 31: Time to load utils op: 0.0003857612609863281 secondsTime to load utils op: 0.00038123130798339844 seconds 31: 31: Time to load utils op: 0.000396728515625 seconds 31: Time to load utils op: 0.00036072731018066406 seconds 31: Time to load utils op: 0.00036644935607910156 seconds 31: Time to load utils op: 0.00034880638122558594 seconds 7: Time to load utils op: 0.0007526874542236328 seconds 6: Time to load utils op: 0.0005848407745361328 seconds 4: Time to load utils op: 0.00115966796875 seconds 6: Time to load utils op: 0.0008175373077392578 seconds 6: Time to load utils op: 0.0007271766662597656 seconds 6: Time to load utils op: 0.0007622241973876953 seconds 6: Time to load utils op: 0.0007843971252441406 seconds 7: Time to load utils op: 0.0010755062103271484 seconds 6: Time to load utils op: 0.0009431838989257812 seconds 6: Time to load utils op: 0.0008783340454101562 seconds 6: Time to load utils op: 0.001047372817993164 seconds 7: Time to load utils op: 0.0012328624725341797 seconds 7: Time to load utils op: 0.0013399124145507812 seconds 7: Time to load utils op: 0.0012557506561279297 seconds 4: Time to load utils op: 0.0014638900756835938 seconds 7: Time to load utils op: 0.001300811767578125 seconds 7: Time to load utils op: 0.0012784004211425781 seconds 4: Time to load utils op: 0.0014362335205078125 secondsTime to load utils op: 0.001390218734741211 seconds 4: Time to load utils op: 0.0014388561248779297 seconds 4: 7: Time to load utils op: 0.001310110092163086 seconds 4: Time to load utils op: 0.0014119148254394531 seconds 4: Time to load utils op: 0.0013301372528076172 seconds 4: Time to load utils op: 0.0014748573303222656 seconds 12: Time to load utils op: 0.0007076263427734375 seconds 12: Time to load utils op: 0.0007796287536621094 seconds 12: Time to load utils op: 0.0009415149688720703 seconds 12: Time to load utils op: 0.0008342266082763672 seconds 12: Time to load utils op: 0.0010154247283935547 seconds 12: Time to load utils op: 0.0009717941284179688 seconds 12: Time to load utils op: 0.0009522438049316406 seconds 12: Time to load utils op: 0.0010046958923339844 seconds 15: Time to load utils op: 0.0011665821075439453 seconds 15: Time to load utils op: 0.0010688304901123047 seconds 16: Time to load utils op: 0.0010192394256591797 seconds 22: Time to load utils op: 0.0010046958923339844 seconds 17: Time to load utils op: 0.0007321834564208984 seconds 14: Time to load utils op: 0.0009889602661132812 seconds 14: Time to load utils op: 0.0007147789001464844 seconds 16: Time to load utils op: 0.0011930465698242188 seconds 20: Time to load utils op: 0.0008039474487304688 seconds 17: Time to load utils op: 0.0008080005645751953 seconds 14: Time to load utils op: 0.0007839202880859375 seconds 20: Time to load utils op: 0.0006208419799804688 seconds 17: Time to load utils op: 0.0008990764617919922 seconds 15: Time to load utils op: 0.0013804435729980469 seconds 15: Time to load utils op: 0.00133514404296875 seconds 20: Time to load utils op: 0.0009860992431640625 seconds 15: Time to load utils op: 0.0012557506561279297 seconds 24: Time to load utils op: 0.0009570121765136719 seconds 17: Time to load utils op: 0.0010828971862792969 seconds 16: Time to load utils op: 0.0014297962188720703 seconds 14: Time to load utils op: 0.0010821819305419922 seconds 22: Time to load utils op: 0.0015020370483398438 seconds 16: Time to load utils op: 0.001344442367553711 seconds 15: Time to load utils op: 0.001287221908569336 seconds 15: Time to load utils op: 0.0012862682342529297 seconds 24: Time to load utils op: 0.0008904933929443359 seconds 14: Time to load utils op: 0.0010552406311035156 seconds 22: Time to load utils op: 0.0015063285827636719 seconds 16: Time to load utils op: 0.0013644695281982422 secondsTime to load utils op: 0.0014064311981201172 seconds 16: 15: Time to load utils op: 0.0012898445129394531 seconds 14: Time to load utils op: 0.0011444091796875 seconds 20: Time to load utils op: 0.00110626220703125 seconds 24: Time to load utils op: 0.0011188983917236328 seconds 22: Time to load utils op: 0.0014224052429199219 seconds 22: Time to load utils op: 0.0014085769653320312 seconds 20: Time to load utils op: 0.001081705093383789 secondsTime to load utils op: 0.0010988712310791016 seconds 20: 14: Time to load utils op: 0.0011026859283447266 seconds 22: Time to load utils op: 0.001443624496459961 seconds 17: Time to load utils op: 0.0012371540069580078 seconds 16: Time to load utils op: 0.0014162063598632812 seconds 22: Time to load utils op: 0.0014443397521972656 seconds 17: Time to load utils op: 0.0012106895446777344 seconds 16: Time to load utils op: 0.0014503002166748047 seconds 20: Time to load utils op: 0.0011157989501953125 seconds 22: Time to load utils op: 0.0015094280242919922 seconds 20: Time to load utils op: 0.0011799335479736328 seconds 14: Time to load utils op: 0.0011701583862304688 seconds 17: Time to load utils op: 0.0011746883392333984 seconds 24: Time to load utils op: 0.0012903213500976562 seconds 17: Time to load utils op: 0.0012462139129638672 seconds 24: Time to load utils op: 0.0012691020965576172 seconds 24: Time to load utils op: 0.001251220703125 seconds 24: Time to load utils op: 0.00124359130859375 seconds 24: Time to load utils op: 0.001329183578491211 seconds 0: [2022-11-25 17:31:56,857] [INFO] [utils.py:827:see_memory_usage] end bf16_optimizer 0: [2022-11-25 17:31:56,858] [INFO] [utils.py:828:see_memory_usage] MA 6.17 GB Max_MA 6.17 GB CA 8.31 GB Max_CA 8 GB 0: [2022-11-25 17:31:56,858] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 29.21 GB, percent = 5.8% 0: [2022-11-25 17:31:56,858] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam 0: [2022-11-25 17:31:56,858] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed using client LR scheduler 0: [2022-11-25 17:31:56,858] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed LR Scheduler = 0: [2022-11-25 17:31:56,858] [INFO] [logging.py:68:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 0: [2022-11-25 17:31:56,859] [INFO] [config.py:1007:print] DeepSpeedEngine configuration: 0: [2022-11-25 17:31:56,859] [INFO] [config.py:1011:print] activation_checkpointing_config { 0: "partition_activations": false, 0: "contiguous_memory_optimization": false, 0: "cpu_checkpointing": false, 0: "number_checkpoints": null, 0: "synchronize_checkpoint_boundary": false, 0: "profile": false 0: } 0: [2022-11-25 17:31:56,859] [INFO] [config.py:1011:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} 0: [2022-11-25 17:31:56,859] [INFO] [config.py:1011:print] amp_enabled .................. False 0: [2022-11-25 17:31:56,859] [INFO] [config.py:1011:print] amp_params ................... False 0: [2022-11-25 17:31:56,859] [INFO] [config.py:1011:print] autotuning_config ............ { 0: "enabled": false, 0: "start_step": null, 0: "end_step": null, 0: "metric_path": null, 0: "arg_mappings": null, 0: "metric": "throughput", 0: "model_info": null, 0: "results_dir": "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/autotuning_results", 0: "exps_dir": "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/autotuning_exps", 0: "overwrite": true, 0: "fast": true, 0: "start_profile_step": 3, 0: "end_profile_step": 5, 0: "tuner_type": "gridsearch", 0: "tuner_early_stopping": 5, 0: "tuner_num_trials": 50, 0: "model_info_path": null, 0: "mp_size": 1, 0: "max_train_batch_size": null, 0: "min_train_batch_size": 1, 0: "max_train_micro_batch_size_per_gpu": 1.024000e+03, 0: "min_train_micro_batch_size_per_gpu": 1, 0: "num_tuning_micro_batch_sizes": 3 0: } 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] bfloat16_enabled ............. True 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] checkpoint_parallel_write_pipeline False 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] checkpoint_tag_validation_enabled True 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] checkpoint_tag_validation_fail False 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] comms_config ................. 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] communication_data_type ...... None 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_pa 0: rameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] curriculum_enabled ........... False 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] curriculum_params ............ False 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] dataloader_drop_last ......... False 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] disable_allgather ............ False 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] dump_state ................... False 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] dynamic_loss_scale_args ...... None 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] eigenvalue_enabled ........... False 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] eigenvalue_gas_boundary_resolution 1 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] eigenvalue_layer_name ........ bert.encoder.layer 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] eigenvalue_layer_num ......... 0 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] eigenvalue_max_iter .......... 100 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] eigenvalue_stability ......... 1e-06 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] eigenvalue_tol ............... 0.01 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] eigenvalue_verbose ........... False 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] elasticity_enabled ........... False 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] flops_profiler_config ........ { 0: "enabled": false, 0: "profile_step": 1, 0: "module_depth": -1, 0: "top_modules": 1, 0: "detailed": true, 0: "output_file": null 0: } 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] fp16_auto_cast ............... None 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] fp16_enabled ................. False 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] fp16_master_weights_and_gradients False 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] global_rank .................. 0 9: Time to load utils op: 0.0007939338684082031 seconds 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] gradient_accumulation_steps .. 1 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] gradient_clipping ............ 1.0 0: [2022-11-25 17:31:56,860] [INFO] [config.py:1011:print] gradient_predivide_factor .... 1.0 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] initial_dynamic_scale ........ 1 9: Time to load utils op: 0.0008261203765869141 seconds 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] load_universal_checkpoint .... False 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] loss_scale ................... 1.0 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] memory_breakdown ............. False 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] monitor_config ............... 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] nebula_config ................ { 0: "enabled": false, 0: "persistent_storage_path": null, 0: "persistent_time_interval": 100, 0: "num_of_version_in_retention": 2, 0: "enable_nebula_load": true, 0: "load_path": null 0: } 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] optimizer_legacy_fusion ...... False 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] optimizer_name ............... None 9: Time to load utils op: 0.0009090900421142578 seconds 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] optimizer_params ............. None 9: Time to load utils op: 0.0010273456573486328 seconds 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] pld_enabled .................. False 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] pld_params ................... False 9: Time to load utils op: 0.0010080337524414062 seconds 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] prescale_gradients ........... False 9: Time to load utils op: 0.0009303092956542969 secondsTime to load utils op: 0.0007641315460205078 seconds 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] scheduler_name ............... None 9: 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] scheduler_params ............. None 9: Time to load utils op: 0.0009899139404296875 seconds 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] sparse_attention ............. None 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] sparse_gradients_enabled ..... False 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] steps_per_print .............. 2000 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] train_batch_size ............. 256 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] train_micro_batch_size_per_gpu 1 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] use_node_local_storage ....... False 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] wall_clock_breakdown ......... False 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] world_size ................... 256 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] zero_allow_untested_optimizer False 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] zero_enabled ................. False 0: [2022-11-25 17:31:56,861] [INFO] [config.py:1011:print] zero_optimization_stage ...... 0 0: [2022-11-25 17:31:56,862] [INFO] [config.py:996:print_user_config] json = { 0: "train_micro_batch_size_per_gpu": 1, 0: "train_batch_size": 256, 0: "gradient_clipping": 1.0, 0: "zero_optimization": { 0: "stage": 0 0: }, 0: "bf16": { 0: "enabled": true 0: }, 0: "steps_per_print": 2.000000e+03, 0: "wall_clock_breakdown": false 0: } 28: Time to load utils op: 0.0005805492401123047 seconds 0: Time to load utils op: 0.0004851818084716797 seconds 28: Time to load utils op: 0.0007977485656738281 seconds 0: [2022-11-25 17:31:56,862] [INFO] [engine.py:87:__init__] CONFIG: micro_batches=1 micro_batch_size=1 28: Time to load utils op: 0.0008504390716552734 seconds 27: Time to load utils op: 0.0008416175842285156 seconds 28: Time to load utils op: 0.0009794235229492188 seconds 27: Time to load utils op: 0.0011131763458251953 seconds 25: Time to load utils op: 0.0009107589721679688 seconds 28: Time to load utils op: 0.0012154579162597656 secondsTime to load utils op: 0.0011849403381347656 seconds 28: 28: Time to load utils op: 0.0011968612670898438 seconds 28: Time to load utils op: 0.0012001991271972656 seconds 27: Time to load utils op: 0.0013127326965332031 secondsTime to load utils op: 0.0013344287872314453 seconds 27: 27: Time to load utils op: 0.0012996196746826172 seconds 27: Time to load utils op: 0.0012657642364501953 seconds 27: Time to load utils op: 0.0014009475708007812 seconds 27: Time to load utils op: 0.0013625621795654297 seconds 25: Time to load utils op: 0.0012421607971191406 seconds 25: Time to load utils op: 0.0012671947479248047 seconds 25: Time to load utils op: 0.0011496543884277344 seconds 25: Time to load utils op: 0.0011451244354248047 secondsTime to load utils op: 0.0011789798736572266 seconds 25: 25: Time to load utils op: 0.0011582374572753906 seconds 25: Time to load utils op: 0.0011861324310302734 seconds 19: Time to load utils op: 0.00047850608825683594 seconds 19: Time to load utils op: 0.00047898292541503906 seconds 19: Time to load utils op: 0.0005128383636474609 seconds 21: Time to load utils op: 0.0007379055023193359 seconds 19: Time to load utils op: 0.0005240440368652344 seconds 19: Time to load utils op: 0.0005464553833007812 secondsTime to load utils op: 0.0005738735198974609 seconds 19: 21: Time to load utils op: 0.001024007797241211 seconds 19: Time to load utils op: 0.0006122589111328125 seconds 21: Time to load utils op: 0.0008411407470703125 seconds 19: Time to load utils op: 0.0006158351898193359 seconds 21: Time to load utils op: 0.0009520053863525391 seconds 21: Time to load utils op: 0.001157999038696289 seconds 21: Time to load utils op: 0.0010483264923095703 seconds 21: Time to load utils op: 0.0009999275207519531 seconds 21: Time to load utils op: 0.0012042522430419922 seconds 0: [2022-11-25 17:31:56,882] [INFO] [engine.py:145:__init__] RANK=0 STAGE=0 LAYERS=33 [0, 33) STAGE_PARAMS=1096338432 (1096.338M) TOTAL_PARAMS=1096338432 (1096.338M) UNIQUE_PARAMS=1096338432 (1096.338M) 26: Time to load utils op: 0.00090789794921875 seconds 26: Time to load utils op: 0.0008656978607177734 seconds 26: Time to load utils op: 0.0010433197021484375 seconds 26: Time to load utils op: 0.0012655258178710938 seconds 26: Time to load utils op: 0.0011906623840332031 seconds 26: Time to load utils op: 0.0011913776397705078 secondsTime to load utils op: 0.0011587142944335938 seconds 26: 26: Time to load utils op: 0.0011980533599853516 seconds 0: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 0: WARNING: could not find the metadata file checkpoints_1b1long 0: will not load any checkpoints and will start from random 0: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 0: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 0: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 16: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 28: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 28: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 28: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 24: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 24: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 24: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 31: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 29: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 30: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 30: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 16: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 20: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 20: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 23: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 23: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 28: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 24: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 26: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 0: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 0: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 8: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 16: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 16: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 12: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 12: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 28: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 28: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 14: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 29: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 22: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 22: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 27: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 27: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 7: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 10: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 12: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 15: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 15: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 20: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 23: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 23: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 11: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 28: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 24: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 14: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 31: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 30: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 18: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 26: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 19: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 19: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 0: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 6: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 6: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 7: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 4: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 8: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 8: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 16: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 2: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 2: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 13: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 13: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 12: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 15: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 20: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 25: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 25: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 28: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 24: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 14: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 31: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 29: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 22: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 30: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 21: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 21: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 26: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 26: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 27: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 0: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 6: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 5: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 5: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 7: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 7: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 4: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 9: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 9: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 8: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 10: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 16: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 2: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 2: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 13: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 13: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 3: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 12: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 12: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 15: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 20: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 23: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 11: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 14: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 31: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 29: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 22: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 17: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 17: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 18: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 19: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 19: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 27: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 6: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 5: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 7: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 4: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 9: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 8: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 10: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 1: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 16: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 13: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 3: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 15: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 15: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 20: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 25: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 23: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 11: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 11: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 24: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 14: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 14: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 29: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 29: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 22: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 30: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 21: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 18: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 18: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 26: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 26: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 6: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 5: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 7: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 4: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 9: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 8: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 10: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 1: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 16: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 2: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 2: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 13: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 3: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 12: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 12: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 15: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 20: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 25: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 23: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 11: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 24: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 14: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 14: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 31: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 22: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 30: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 17: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 21: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 19: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 27: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 6: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 5: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 7: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 7: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 4: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 9: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 8: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 10: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 1: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 1: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 2: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 13: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 3: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 15: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 20: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 25: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 23: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 11: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 31: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 29: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 22: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 30: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 17: [2022-11-25 17:31:56,947] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 21: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 18: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 26: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 26: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 19: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 27: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 6: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 5: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 4: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 9: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 8: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 10: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 1: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 2: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 13: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 25: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 11: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 31: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 29: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 22: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 30: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 18: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 6: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 5: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 5: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 4: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 9: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 9: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 1: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 25: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 11: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 31: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 17: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 17: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 21: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 19: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 27: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 10: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 10: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 25: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 17: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 21: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 18: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 19: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 27: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 1: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 17: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 21: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 18: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 1: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 31: time (ms) | load-checkpoint: 8.66 3: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 3: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 4: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 3: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 3: [2022-11-25 17:31:56,948] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b1long/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 0: estimated model parameters: 1.096338432 0: estimated model parameters without embeddings: 1.002523648 0: [after model, optimizer, and learning rate scheduler are built] datetime: 2022-11-25 17:31:57 0: > building train, validation, and test datasets ... 0: > datasets target sizes (minimum size): 0: train: 44416143 0: validation: 44544 0: test: 256 0: > building train, validation, and test datasets for GPT ... 0: > building dataset index ... 0: reading sizes... 0: reading pointers... 0: reading document index... 0: creating numpy buffer of mmap... 0: creating memory view of numpy buffer... 0: > finished creating indexed dataset in 0.010186 seconds 0: number of documents: 210604984 0: > dataset split: 0: train: 0: document indices in [0, 199864130) total of 199864130 documents 0: validation: 0: document indices in [199864130, 210394379) total of 10530249 documents 0: test: 0: document indices in [210394379, 210604984) total of 210605 documents 0: > WARNING: could not find index map files, building the indices on rank 0 ... 0: > only one epoch required, setting separate_last_epoch to False 0: > elasped time to build and save doc-idx mapping (seconds): 14.457500 0: using: 0: number of documents: 199864130 0: number of epochs: 1 0: sequence length: 2048 0: total number of samples: 173377816 0: > elasped time to build and save sample-idx mapping (seconds): 4.167411 0: > building shuffle index with split [0, 173377816) and [173377816, 173377816) ... 0: > elasped time to build and save shuffle-idx mapping (seconds): 10.276205 0: > loading doc-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_train_indexmap_44416143ns_2048sl_1234s_doc_idx.npy 0: > loading sample-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_train_indexmap_44416143ns_2048sl_1234s_sample_idx.npy 0: > loading shuffle-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_train_indexmap_44416143ns_2048sl_1234s_shuffle_idx.npy 0: loaded indexed file in 0.128 seconds 0: total number of samples: 173377817 0: total number of epochs: 1 0: > WARNING: could not find index map files, building the indices on rank 0 ... 0: > only one epoch required, setting separate_last_epoch to False 0: > elasped time to build and save doc-idx mapping (seconds): 0.482712 0: using: 0: number of documents: 10530249 0: number of epochs: 1 0: sequence length: 2048 0: total number of samples: 9118344 0: > elasped time to build and save sample-idx mapping (seconds): 0.210911 0: > building shuffle index with split [0, 9118344) and [9118344, 9118344) ... 0: > elasped time to build and save shuffle-idx mapping (seconds): 0.267297 0: > loading doc-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_valid_indexmap_44544ns_2048sl_1234s_doc_idx.npy 0: > loading sample-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_valid_indexmap_44544ns_2048sl_1234s_sample_idx.npy 0: > loading shuffle-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_valid_indexmap_44544ns_2048sl_1234s_shuffle_idx.npy 0: loaded indexed file in 0.026 seconds 0: total number of samples: 9118345 0: total number of epochs: 1 0: > loading doc-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_test_indexmap_256ns_2048sl_1234s_doc_idx.npy 0: > loading sample-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_test_indexmap_256ns_2048sl_1234s_sample_idx.npy 0: > loading shuffle-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_test_indexmap_256ns_2048sl_1234s_shuffle_idx.npy 0: loaded indexed file in 0.010 seconds 0: total number of samples: 182928 0: total number of epochs: 1 0: > finished creating GPT datasets ... 0: [after dataloaders are built] datetime: 2022-11-25 17:32:44 0: done with setup ... 0: training ... 0: Number of parameters: [tensor rank - pipeline rank] w/ and w/o embeddings: 31: time (ms) | model-and-optimizer-setup: 31002.96 | train/valid/test-data-iterators-setup: 47260.76 0: [000-000] 1.0963B / 1.0025B 0: [before the start of training step] datetime: 2022-11-25 17:32:44 0: [Rank 0] (after 10 iterations) memory (MB) | allocated: 8825.99951171875 | max allocated: 22728.953125 | reserved: 24902.0 | max reserved: 24902.0 31: iteration 10/ 173500 | consumed samples: 2560 | consumed tokens: 5242880 | elapsed time per iteration (s): 2.42 | learning rate: 1.153E-06 | global batch size: 256 | lm loss: 1.085415E+01 | grad norm: 26.848 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 105.587 | TFLOPs: 6.39 | 31: iteration 20/ 173500 | consumed samples: 5120 | consumed tokens: 10485760 | elapsed time per iteration (s): 0.78 | learning rate: 2.305E-06 | global batch size: 256 | lm loss: 9.305616E+00 | grad norm: 12.473 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.505 | TFLOPs: 19.81 | 31: iteration 30/ 173500 | consumed samples: 7680 | consumed tokens: 15728640 | elapsed time per iteration (s): 0.82 | learning rate: 3.458E-06 | global batch size: 256 | lm loss: 8.706579E+00 | grad norm: 4.349 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.023 | TFLOPs: 18.82 | 31: iteration 40/ 173500 | consumed samples: 10240 | consumed tokens: 20971520 | elapsed time per iteration (s): 0.87 | learning rate: 4.611E-06 | global batch size: 256 | lm loss: 8.416602E+00 | grad norm: 2.820 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.868 | TFLOPs: 17.84 | 31: iteration 50/ 173500 | consumed samples: 12800 | consumed tokens: 26214400 | elapsed time per iteration (s): 0.87 | learning rate: 5.764E-06 | global batch size: 256 | lm loss: 8.205109E+00 | grad norm: 3.087 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.829 | TFLOPs: 17.72 | 31: iteration 60/ 173500 | consumed samples: 15360 | consumed tokens: 31457280 | elapsed time per iteration (s): 0.86 | learning rate: 6.916E-06 | global batch size: 256 | lm loss: 8.035940E+00 | grad norm: 2.850 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.910 | TFLOPs: 17.96 | 31: iteration 70/ 173500 | consumed samples: 17920 | consumed tokens: 36700160 | elapsed time per iteration (s): 0.84 | learning rate: 8.069E-06 | global batch size: 256 | lm loss: 7.821154E+00 | grad norm: 2.301 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.762 | TFLOPs: 18.44 | 31: iteration 80/ 173500 | consumed samples: 20480 | consumed tokens: 41943040 | elapsed time per iteration (s): 0.80 | learning rate: 9.222E-06 | global batch size: 256 | lm loss: 7.582549E+00 | grad norm: 3.852 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.183 | TFLOPs: 19.31 | 31: iteration 90/ 173500 | consumed samples: 23040 | consumed tokens: 47185920 | elapsed time per iteration (s): 0.75 | learning rate: 1.037E-05 | global batch size: 256 | lm loss: 7.434130E+00 | grad norm: 3.241 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.710 | TFLOPs: 20.73 | 31: iteration 100/ 173500 | consumed samples: 25600 | consumed tokens: 52428800 | elapsed time per iteration (s): 0.76 | learning rate: 1.153E-05 | global batch size: 256 | lm loss: 7.250718E+00 | grad norm: 4.877 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.863 | TFLOPs: 20.32 | 31: iteration 110/ 173500 | consumed samples: 28160 | consumed tokens: 57671680 | elapsed time per iteration (s): 0.80 | learning rate: 1.268E-05 | global batch size: 256 | lm loss: 7.074741E+00 | grad norm: 4.095 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.947 | TFLOPs: 19.42 | 31: iteration 120/ 173500 | consumed samples: 30720 | consumed tokens: 62914560 | elapsed time per iteration (s): 0.78 | learning rate: 1.383E-05 | global batch size: 256 | lm loss: 6.925228E+00 | grad norm: 2.788 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.302 | TFLOPs: 19.98 | 31: iteration 130/ 173500 | consumed samples: 33280 | consumed tokens: 68157440 | elapsed time per iteration (s): 0.78 | learning rate: 1.499E-05 | global batch size: 256 | lm loss: 6.759454E+00 | grad norm: 2.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.561 | TFLOPs: 19.82 | 31: iteration 140/ 173500 | consumed samples: 35840 | consumed tokens: 73400320 | elapsed time per iteration (s): 0.79 | learning rate: 1.614E-05 | global batch size: 256 | lm loss: 6.649731E+00 | grad norm: 2.800 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.061 | TFLOPs: 19.67 | 31: iteration 150/ 173500 | consumed samples: 38400 | consumed tokens: 78643200 | elapsed time per iteration (s): 0.83 | learning rate: 1.729E-05 | global batch size: 256 | lm loss: 6.572528E+00 | grad norm: 2.528 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.268 | TFLOPs: 18.71 | 31: iteration 160/ 173500 | consumed samples: 40960 | consumed tokens: 83886080 | elapsed time per iteration (s): 0.82 | learning rate: 1.844E-05 | global batch size: 256 | lm loss: 6.424799E+00 | grad norm: 3.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.350 | TFLOPs: 18.84 | 31: iteration 170/ 173500 | consumed samples: 43520 | consumed tokens: 89128960 | elapsed time per iteration (s): 0.77 | learning rate: 1.960E-05 | global batch size: 256 | lm loss: 6.396594E+00 | grad norm: 2.876 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.213 | TFLOPs: 20.10 | 31: iteration 180/ 173500 | consumed samples: 46080 | consumed tokens: 94371840 | elapsed time per iteration (s): 0.77 | learning rate: 2.075E-05 | global batch size: 256 | lm loss: 6.286553E+00 | grad norm: 2.457 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.341 | TFLOPs: 20.23 | 31: iteration 190/ 173500 | consumed samples: 48640 | consumed tokens: 99614720 | elapsed time per iteration (s): 0.86 | learning rate: 2.190E-05 | global batch size: 256 | lm loss: 6.245541E+00 | grad norm: 3.369 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.968 | TFLOPs: 18.03 | 31: iteration 200/ 173500 | consumed samples: 51200 | consumed tokens: 104857600 | elapsed time per iteration (s): 0.84 | learning rate: 2.305E-05 | global batch size: 256 | lm loss: 6.178689E+00 | grad norm: 2.293 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.064 | TFLOPs: 18.40 | 31: iteration 210/ 173500 | consumed samples: 53760 | consumed tokens: 110100480 | elapsed time per iteration (s): 0.82 | learning rate: 2.421E-05 | global batch size: 256 | lm loss: 6.137992E+00 | grad norm: 2.378 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.739 | TFLOPs: 18.92 | 31: iteration 220/ 173500 | consumed samples: 56320 | consumed tokens: 115343360 | elapsed time per iteration (s): 0.76 | learning rate: 2.536E-05 | global batch size: 256 | lm loss: 6.060030E+00 | grad norm: 3.686 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.700 | TFLOPs: 20.31 | 31: iteration 230/ 173500 | consumed samples: 58880 | consumed tokens: 120586240 | elapsed time per iteration (s): 0.78 | learning rate: 2.651E-05 | global batch size: 256 | lm loss: 6.035353E+00 | grad norm: 3.803 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.443 | TFLOPs: 19.75 | 31: iteration 240/ 173500 | consumed samples: 61440 | consumed tokens: 125829120 | elapsed time per iteration (s): 0.77 | learning rate: 2.767E-05 | global batch size: 256 | lm loss: 6.028471E+00 | grad norm: 2.641 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.481 | TFLOPs: 20.17 | 31: iteration 250/ 173500 | consumed samples: 64000 | consumed tokens: 131072000 | elapsed time per iteration (s): 0.80 | learning rate: 2.882E-05 | global batch size: 256 | lm loss: 5.938224E+00 | grad norm: 3.506 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.898 | TFLOPs: 19.35 | 31: iteration 260/ 173500 | consumed samples: 66560 | consumed tokens: 136314880 | elapsed time per iteration (s): 0.85 | learning rate: 2.997E-05 | global batch size: 256 | lm loss: 5.920851E+00 | grad norm: 2.323 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.286 | TFLOPs: 18.23 | 31: iteration 270/ 173500 | consumed samples: 69120 | consumed tokens: 141557760 | elapsed time per iteration (s): 0.83 | learning rate: 3.112E-05 | global batch size: 256 | lm loss: 5.846789E+00 | grad norm: 2.588 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.831 | TFLOPs: 18.62 | 31: iteration 280/ 173500 | consumed samples: 71680 | consumed tokens: 146800640 | elapsed time per iteration (s): 0.79 | learning rate: 3.228E-05 | global batch size: 256 | lm loss: 5.831240E+00 | grad norm: 2.839 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.784 | TFLOPs: 19.53 | 31: iteration 290/ 173500 | consumed samples: 74240 | consumed tokens: 152043520 | elapsed time per iteration (s): 0.76 | learning rate: 3.343E-05 | global batch size: 256 | lm loss: 5.766938E+00 | grad norm: 3.082 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.483 | TFLOPs: 20.30 | 31: iteration 300/ 173500 | consumed samples: 76800 | consumed tokens: 157286400 | elapsed time per iteration (s): 0.77 | learning rate: 3.458E-05 | global batch size: 256 | lm loss: 5.687265E+00 | grad norm: 2.655 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.749 | TFLOPs: 20.19 | 31: iteration 310/ 173500 | consumed samples: 79360 | consumed tokens: 162529280 | elapsed time per iteration (s): 0.77 | learning rate: 3.573E-05 | global batch size: 256 | lm loss: 5.708453E+00 | grad norm: 2.515 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.377 | TFLOPs: 19.99 | 31: iteration 320/ 173500 | consumed samples: 81920 | consumed tokens: 167772160 | elapsed time per iteration (s): 0.79 | learning rate: 3.689E-05 | global batch size: 256 | lm loss: 5.685188E+00 | grad norm: 2.377 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.524 | TFLOPs: 19.69 | 31: iteration 330/ 173500 | consumed samples: 84480 | consumed tokens: 173015040 | elapsed time per iteration (s): 0.84 | learning rate: 3.804E-05 | global batch size: 256 | lm loss: 5.640014E+00 | grad norm: 2.485 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.486 | TFLOPs: 18.36 | 31: iteration 340/ 173500 | consumed samples: 87040 | consumed tokens: 178257920 | elapsed time per iteration (s): 0.82 | learning rate: 3.919E-05 | global batch size: 256 | lm loss: 5.609632E+00 | grad norm: 2.075 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.713 | TFLOPs: 18.86 | 31: iteration 350/ 173500 | consumed samples: 89600 | consumed tokens: 183500800 | elapsed time per iteration (s): 0.81 | learning rate: 4.035E-05 | global batch size: 256 | lm loss: 5.558034E+00 | grad norm: 2.629 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.737 | TFLOPs: 19.10 | 31: iteration 360/ 173500 | consumed samples: 92160 | consumed tokens: 188743680 | elapsed time per iteration (s): 0.85 | learning rate: 4.150E-05 | global batch size: 256 | lm loss: 5.516315E+00 | grad norm: 3.350 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.794 | TFLOPs: 18.14 | 31: iteration 370/ 173500 | consumed samples: 94720 | consumed tokens: 193986560 | elapsed time per iteration (s): 0.77 | learning rate: 4.265E-05 | global batch size: 256 | lm loss: 5.447181E+00 | grad norm: 1.837 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.501 | TFLOPs: 20.05 | 31: iteration 380/ 173500 | consumed samples: 97280 | consumed tokens: 199229440 | elapsed time per iteration (s): 0.86 | learning rate: 4.380E-05 | global batch size: 256 | lm loss: 5.457804E+00 | grad norm: 2.751 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.196 | TFLOPs: 17.98 | 31: iteration 390/ 173500 | consumed samples: 99840 | consumed tokens: 204472320 | elapsed time per iteration (s): 0.77 | learning rate: 4.496E-05 | global batch size: 256 | lm loss: 5.404614E+00 | grad norm: 2.292 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.870 | TFLOPs: 20.02 | 31: iteration 400/ 173500 | consumed samples: 102400 | consumed tokens: 209715200 | elapsed time per iteration (s): 0.74 | learning rate: 4.611E-05 | global batch size: 256 | lm loss: 5.417352E+00 | grad norm: 3.377 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.910 | TFLOPs: 20.81 | 31: iteration 410/ 173500 | consumed samples: 104960 | consumed tokens: 214958080 | elapsed time per iteration (s): 0.78 | learning rate: 4.726E-05 | global batch size: 256 | lm loss: 5.375229E+00 | grad norm: 2.689 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.428 | TFLOPs: 19.81 | 31: iteration 420/ 173500 | consumed samples: 107520 | consumed tokens: 220200960 | elapsed time per iteration (s): 0.73 | learning rate: 4.841E-05 | global batch size: 256 | lm loss: 5.408162E+00 | grad norm: 2.328 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.424 | TFLOPs: 21.20 | 31: iteration 430/ 173500 | consumed samples: 110080 | consumed tokens: 225443840 | elapsed time per iteration (s): 0.79 | learning rate: 4.957E-05 | global batch size: 256 | lm loss: 5.332573E+00 | grad norm: 2.438 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.468 | TFLOPs: 19.57 | 31: iteration 440/ 173500 | consumed samples: 112640 | consumed tokens: 230686720 | elapsed time per iteration (s): 0.74 | learning rate: 5.072E-05 | global batch size: 256 | lm loss: 5.292148E+00 | grad norm: 2.289 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.900 | TFLOPs: 20.87 | 31: iteration 450/ 173500 | consumed samples: 115200 | consumed tokens: 235929600 | elapsed time per iteration (s): 0.81 | learning rate: 5.187E-05 | global batch size: 256 | lm loss: 5.250002E+00 | grad norm: 1.950 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.371 | TFLOPs: 19.02 | 31: iteration 460/ 173500 | consumed samples: 117760 | consumed tokens: 241172480 | elapsed time per iteration (s): 0.78 | learning rate: 5.303E-05 | global batch size: 256 | lm loss: 5.259367E+00 | grad norm: 2.239 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.150 | TFLOPs: 19.85 | 31: iteration 470/ 173500 | consumed samples: 120320 | consumed tokens: 246415360 | elapsed time per iteration (s): 0.75 | learning rate: 5.418E-05 | global batch size: 256 | lm loss: 5.234134E+00 | grad norm: 2.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.056 | TFLOPs: 20.75 | 31: iteration 480/ 173500 | consumed samples: 122880 | consumed tokens: 251658240 | elapsed time per iteration (s): 0.77 | learning rate: 5.533E-05 | global batch size: 256 | lm loss: 5.217163E+00 | grad norm: 2.314 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.792 | TFLOPs: 20.07 | 31: iteration 490/ 173500 | consumed samples: 125440 | consumed tokens: 256901120 | elapsed time per iteration (s): 0.92 | learning rate: 5.648E-05 | global batch size: 256 | lm loss: 5.154974E+00 | grad norm: 2.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 278.449 | TFLOPs: 16.85 | 31: iteration 500/ 173500 | consumed samples: 128000 | consumed tokens: 262144000 | elapsed time per iteration (s): 0.85 | learning rate: 5.764E-05 | global batch size: 256 | lm loss: 5.134572E+00 | grad norm: 1.574 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.303 | TFLOPs: 18.17 | 31: iteration 510/ 173500 | consumed samples: 130560 | consumed tokens: 267386880 | elapsed time per iteration (s): 0.78 | learning rate: 5.879E-05 | global batch size: 256 | lm loss: 5.119096E+00 | grad norm: 1.920 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.244 | TFLOPs: 19.80 | 31: iteration 520/ 173500 | consumed samples: 133120 | consumed tokens: 272629760 | elapsed time per iteration (s): 0.78 | learning rate: 5.994E-05 | global batch size: 256 | lm loss: 5.098163E+00 | grad norm: 2.520 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.970 | TFLOPs: 19.84 | 31: iteration 530/ 173500 | consumed samples: 135680 | consumed tokens: 277872640 | elapsed time per iteration (s): 0.82 | learning rate: 6.109E-05 | global batch size: 256 | lm loss: 5.034930E+00 | grad norm: 1.820 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.754 | TFLOPs: 18.86 | 31: iteration 540/ 173500 | consumed samples: 138240 | consumed tokens: 283115520 | elapsed time per iteration (s): 0.84 | learning rate: 6.225E-05 | global batch size: 256 | lm loss: 5.083284E+00 | grad norm: 1.593 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.729 | TFLOPs: 18.44 | 31: iteration 550/ 173500 | consumed samples: 140800 | consumed tokens: 288358400 | elapsed time per iteration (s): 0.76 | learning rate: 6.340E-05 | global batch size: 256 | lm loss: 5.064396E+00 | grad norm: 2.066 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.259 | TFLOPs: 20.34 | 31: iteration 560/ 173500 | consumed samples: 143360 | consumed tokens: 293601280 | elapsed time per iteration (s): 0.79 | learning rate: 6.455E-05 | global batch size: 256 | lm loss: 5.068308E+00 | grad norm: 1.686 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.239 | TFLOPs: 19.68 | 31: iteration 570/ 173500 | consumed samples: 145920 | consumed tokens: 298844160 | elapsed time per iteration (s): 0.79 | learning rate: 6.571E-05 | global batch size: 256 | lm loss: 5.017149E+00 | grad norm: 1.793 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.235 | TFLOPs: 19.68 | 31: iteration 580/ 173500 | consumed samples: 148480 | consumed tokens: 304087040 | elapsed time per iteration (s): 0.79 | learning rate: 6.686E-05 | global batch size: 256 | lm loss: 4.971856E+00 | grad norm: 1.512 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.564 | TFLOPs: 19.70 | 31: iteration 590/ 173500 | consumed samples: 151040 | consumed tokens: 309329920 | elapsed time per iteration (s): 0.80 | learning rate: 6.801E-05 | global batch size: 256 | lm loss: 4.920555E+00 | grad norm: 2.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.989 | TFLOPs: 19.30 | 31: iteration 600/ 173500 | consumed samples: 153600 | consumed tokens: 314572800 | elapsed time per iteration (s): 0.82 | learning rate: 6.916E-05 | global batch size: 256 | lm loss: 4.913213E+00 | grad norm: 1.985 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.086 | TFLOPs: 18.82 | 31: iteration 610/ 173500 | consumed samples: 156160 | consumed tokens: 319815680 | elapsed time per iteration (s): 0.78 | learning rate: 7.032E-05 | global batch size: 256 | lm loss: 4.898056E+00 | grad norm: 1.592 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.192 | TFLOPs: 19.85 | 31: iteration 620/ 173500 | consumed samples: 158720 | consumed tokens: 325058560 | elapsed time per iteration (s): 0.83 | learning rate: 7.147E-05 | global batch size: 256 | lm loss: 4.846072E+00 | grad norm: 1.639 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.037 | TFLOPs: 18.76 | 31: iteration 630/ 173500 | consumed samples: 161280 | consumed tokens: 330301440 | elapsed time per iteration (s): 0.77 | learning rate: 7.262E-05 | global batch size: 256 | lm loss: 4.931440E+00 | grad norm: 2.080 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.125 | TFLOPs: 20.21 | 31: iteration 640/ 173500 | consumed samples: 163840 | consumed tokens: 335544320 | elapsed time per iteration (s): 0.77 | learning rate: 7.378E-05 | global batch size: 256 | lm loss: 4.835787E+00 | grad norm: 1.599 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.939 | TFLOPs: 20.02 | 31: iteration 650/ 173500 | consumed samples: 166400 | consumed tokens: 340787200 | elapsed time per iteration (s): 0.76 | learning rate: 7.493E-05 | global batch size: 256 | lm loss: 4.852417E+00 | grad norm: 1.600 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.685 | TFLOPs: 20.43 | 31: iteration 660/ 173500 | consumed samples: 168960 | consumed tokens: 346030080 | elapsed time per iteration (s): 0.80 | learning rate: 7.608E-05 | global batch size: 256 | lm loss: 4.791671E+00 | grad norm: 1.434 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.817 | TFLOPs: 19.47 | 31: iteration 670/ 173500 | consumed samples: 171520 | consumed tokens: 351272960 | elapsed time per iteration (s): 0.80 | learning rate: 7.723E-05 | global batch size: 256 | lm loss: 4.738846E+00 | grad norm: 1.312 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.918 | TFLOPs: 19.29 | 31: iteration 680/ 173500 | consumed samples: 174080 | consumed tokens: 356515840 | elapsed time per iteration (s): 0.78 | learning rate: 7.839E-05 | global batch size: 256 | lm loss: 4.746957E+00 | grad norm: 1.415 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.762 | TFLOPs: 19.89 | 31: iteration 690/ 173500 | consumed samples: 176640 | consumed tokens: 361758720 | elapsed time per iteration (s): 0.78 | learning rate: 7.954E-05 | global batch size: 256 | lm loss: 4.730325E+00 | grad norm: 1.453 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.061 | TFLOPs: 19.97 | 31: iteration 700/ 173500 | consumed samples: 179200 | consumed tokens: 367001600 | elapsed time per iteration (s): 0.78 | learning rate: 8.069E-05 | global batch size: 256 | lm loss: 4.706831E+00 | grad norm: 1.461 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.503 | TFLOPs: 19.75 | 31: iteration 710/ 173500 | consumed samples: 181760 | consumed tokens: 372244480 | elapsed time per iteration (s): 0.78 | learning rate: 8.184E-05 | global batch size: 256 | lm loss: 4.688029E+00 | grad norm: 1.400 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.127 | TFLOPs: 19.97 | 31: iteration 720/ 173500 | consumed samples: 184320 | consumed tokens: 377487360 | elapsed time per iteration (s): 0.77 | learning rate: 8.300E-05 | global batch size: 256 | lm loss: 4.693392E+00 | grad norm: 1.383 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.217 | TFLOPs: 20.16 | 31: iteration 730/ 173500 | consumed samples: 186880 | consumed tokens: 382730240 | elapsed time per iteration (s): 0.82 | learning rate: 8.415E-05 | global batch size: 256 | lm loss: 4.648584E+00 | grad norm: 1.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.412 | TFLOPs: 18.90 | 31: iteration 740/ 173500 | consumed samples: 189440 | consumed tokens: 387973120 | elapsed time per iteration (s): 0.73 | learning rate: 8.530E-05 | global batch size: 256 | lm loss: 4.618730E+00 | grad norm: 1.492 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.296 | TFLOPs: 21.31 | 31: iteration 750/ 173500 | consumed samples: 192000 | consumed tokens: 393216000 | elapsed time per iteration (s): 0.77 | learning rate: 8.646E-05 | global batch size: 256 | lm loss: 4.552285E+00 | grad norm: 1.319 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.222 | TFLOPs: 20.22 | 31: iteration 760/ 173500 | consumed samples: 194560 | consumed tokens: 398458880 | elapsed time per iteration (s): 0.75 | learning rate: 8.761E-05 | global batch size: 256 | lm loss: 4.576573E+00 | grad norm: 1.509 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.942 | TFLOPs: 20.75 | 31: iteration 770/ 173500 | consumed samples: 197120 | consumed tokens: 403701760 | elapsed time per iteration (s): 0.82 | learning rate: 8.876E-05 | global batch size: 256 | lm loss: 4.567830E+00 | grad norm: 1.365 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.151 | TFLOPs: 18.94 | 31: iteration 780/ 173500 | consumed samples: 199680 | consumed tokens: 408944640 | elapsed time per iteration (s): 0.78 | learning rate: 8.991E-05 | global batch size: 256 | lm loss: 4.499974E+00 | grad norm: 1.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.617 | TFLOPs: 19.88 | 31: iteration 790/ 173500 | consumed samples: 202240 | consumed tokens: 414187520 | elapsed time per iteration (s): 0.77 | learning rate: 9.107E-05 | global batch size: 256 | lm loss: 4.466917E+00 | grad norm: 1.351 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.347 | TFLOPs: 20.05 | 31: iteration 800/ 173500 | consumed samples: 204800 | consumed tokens: 419430400 | elapsed time per iteration (s): 0.79 | learning rate: 9.222E-05 | global batch size: 256 | lm loss: 4.429627E+00 | grad norm: 1.357 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.367 | TFLOPs: 19.68 | 31: iteration 810/ 173500 | consumed samples: 207360 | consumed tokens: 424673280 | elapsed time per iteration (s): 0.74 | learning rate: 9.337E-05 | global batch size: 256 | lm loss: 4.378968E+00 | grad norm: 1.235 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.638 | TFLOPs: 20.85 | 31: iteration 820/ 173500 | consumed samples: 209920 | consumed tokens: 429916160 | elapsed time per iteration (s): 0.78 | learning rate: 9.452E-05 | global batch size: 256 | lm loss: 4.380173E+00 | grad norm: 1.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.555 | TFLOPs: 19.94 | 31: iteration 830/ 173500 | consumed samples: 212480 | consumed tokens: 435159040 | elapsed time per iteration (s): 0.77 | learning rate: 9.568E-05 | global batch size: 256 | lm loss: 4.373691E+00 | grad norm: 1.342 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.438 | TFLOPs: 19.99 | 31: iteration 840/ 173500 | consumed samples: 215040 | consumed tokens: 440401920 | elapsed time per iteration (s): 0.79 | learning rate: 9.683E-05 | global batch size: 256 | lm loss: 4.272437E+00 | grad norm: 1.263 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.183 | TFLOPs: 19.49 | 31: iteration 850/ 173500 | consumed samples: 217600 | consumed tokens: 445644800 | elapsed time per iteration (s): 0.80 | learning rate: 9.798E-05 | global batch size: 256 | lm loss: 4.242782E+00 | grad norm: 1.356 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.745 | TFLOPs: 19.40 | 31: iteration 860/ 173500 | consumed samples: 220160 | consumed tokens: 450887680 | elapsed time per iteration (s): 0.80 | learning rate: 9.914E-05 | global batch size: 256 | lm loss: 4.209338E+00 | grad norm: 1.365 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.526 | TFLOPs: 19.27 | 31: iteration 870/ 173500 | consumed samples: 222720 | consumed tokens: 456130560 | elapsed time per iteration (s): 0.78 | learning rate: 1.003E-04 | global batch size: 256 | lm loss: 4.204113E+00 | grad norm: 1.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.892 | TFLOPs: 19.78 | 31: iteration 880/ 173500 | consumed samples: 225280 | consumed tokens: 461373440 | elapsed time per iteration (s): 0.79 | learning rate: 1.014E-04 | global batch size: 256 | lm loss: 4.165678E+00 | grad norm: 1.595 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.877 | TFLOPs: 19.53 | 31: iteration 890/ 173500 | consumed samples: 227840 | consumed tokens: 466616320 | elapsed time per iteration (s): 0.75 | learning rate: 1.026E-04 | global batch size: 256 | lm loss: 4.090152E+00 | grad norm: 1.370 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.165 | TFLOPs: 20.52 | 31: iteration 900/ 173500 | consumed samples: 230400 | consumed tokens: 471859200 | elapsed time per iteration (s): 0.81 | learning rate: 1.037E-04 | global batch size: 256 | lm loss: 4.045837E+00 | grad norm: 1.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.917 | TFLOPs: 19.17 | 31: iteration 910/ 173500 | consumed samples: 232960 | consumed tokens: 477102080 | elapsed time per iteration (s): 0.78 | learning rate: 1.049E-04 | global batch size: 256 | lm loss: 4.039030E+00 | grad norm: 1.261 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.262 | TFLOPs: 19.98 | 31: iteration 920/ 173500 | consumed samples: 235520 | consumed tokens: 482344960 | elapsed time per iteration (s): 0.76 | learning rate: 1.061E-04 | global batch size: 256 | lm loss: 4.012318E+00 | grad norm: 1.371 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.922 | TFLOPs: 20.32 | 31: iteration 930/ 173500 | consumed samples: 238080 | consumed tokens: 487587840 | elapsed time per iteration (s): 0.77 | learning rate: 1.072E-04 | global batch size: 256 | lm loss: 3.934585E+00 | grad norm: 1.051 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.052 | TFLOPs: 20.03 | 31: iteration 940/ 173500 | consumed samples: 240640 | consumed tokens: 492830720 | elapsed time per iteration (s): 0.83 | learning rate: 1.084E-04 | global batch size: 256 | lm loss: 3.901472E+00 | grad norm: 1.246 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.913 | TFLOPs: 18.63 | 31: iteration 950/ 173500 | consumed samples: 243200 | consumed tokens: 498073600 | elapsed time per iteration (s): 0.77 | learning rate: 1.095E-04 | global batch size: 256 | lm loss: 3.908261E+00 | grad norm: 1.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.402 | TFLOPs: 20.05 | 31: iteration 960/ 173500 | consumed samples: 245760 | consumed tokens: 503316480 | elapsed time per iteration (s): 0.76 | learning rate: 1.107E-04 | global batch size: 256 | lm loss: 3.859402E+00 | grad norm: 1.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.689 | TFLOPs: 20.37 | 31: iteration 970/ 173500 | consumed samples: 248320 | consumed tokens: 508559360 | elapsed time per iteration (s): 0.76 | learning rate: 1.118E-04 | global batch size: 256 | lm loss: 3.844562E+00 | grad norm: 1.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.668 | TFLOPs: 20.43 | 31: iteration 980/ 173500 | consumed samples: 250880 | consumed tokens: 513802240 | elapsed time per iteration (s): 0.76 | learning rate: 1.130E-04 | global batch size: 256 | lm loss: 3.771047E+00 | grad norm: 0.983 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.852 | TFLOPs: 20.50 | 31: iteration 990/ 173500 | consumed samples: 253440 | consumed tokens: 519045120 | elapsed time per iteration (s): 0.78 | learning rate: 1.141E-04 | global batch size: 256 | lm loss: 3.776937E+00 | grad norm: 1.096 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.274 | TFLOPs: 19.86 | 31: iteration 1000/ 173500 | consumed samples: 256000 | consumed tokens: 524288000 | elapsed time per iteration (s): 0.77 | learning rate: 1.153E-04 | global batch size: 256 | lm loss: 3.738673E+00 | grad norm: 1.062 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.428 | TFLOPs: 19.99 | 31: ------------------------------------------------------------------------------------------ 31: valid loss at iteration 1000 | lm loss value: 3.669847E+00 | lm loss PPL: 3.924591E+01 | 31: ------------------------------------------------------------------------------------------ 0: saving checkpoint at iteration 1000 to checkpoints_1b1long 0: [2022-11-25 17:46:13,797] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step1000 is begin to save! 0: [2022-11-25 17:46:13,925] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_01-model_00-model_states.pt... 0: [2022-11-25 17:46:14,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_01-model_00-model_states.pt. 0: [2022-11-25 17:46:14,127] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_03-model_00-model_states.pt... 0: [2022-11-25 17:46:14,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_03-model_00-model_states.pt. 0: [2022-11-25 17:46:14,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_04-model_00-model_states.pt... 0: [2022-11-25 17:46:14,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_04-model_00-model_states.pt. 0: [2022-11-25 17:46:14,305] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_05-model_00-model_states.pt... 0: [2022-11-25 17:46:14,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_05-model_00-model_states.pt. 0: [2022-11-25 17:46:14,375] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_06-model_00-model_states.pt... 0: [2022-11-25 17:46:14,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_06-model_00-model_states.pt. 0: [2022-11-25 17:46:14,448] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_07-model_00-model_states.pt... 0: [2022-11-25 17:46:14,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_07-model_00-model_states.pt. 0: [2022-11-25 17:46:14,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_08-model_00-model_states.pt... 0: [2022-11-25 17:46:14,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_08-model_00-model_states.pt. 0: [2022-11-25 17:46:14,598] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_09-model_00-model_states.pt... 0: [2022-11-25 17:46:14,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_09-model_00-model_states.pt. 0: [2022-11-25 17:46:14,674] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_10-model_00-model_states.pt... 0: [2022-11-25 17:46:14,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_10-model_00-model_states.pt. 0: [2022-11-25 17:46:14,747] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_11-model_00-model_states.pt... 0: [2022-11-25 17:46:14,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_11-model_00-model_states.pt. 0: [2022-11-25 17:46:14,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_12-model_00-model_states.pt... 0: [2022-11-25 17:46:14,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_12-model_00-model_states.pt. 0: [2022-11-25 17:46:14,942] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_13-model_00-model_states.pt... 0: [2022-11-25 17:46:15,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_13-model_00-model_states.pt. 0: [2022-11-25 17:46:15,017] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_14-model_00-model_states.pt... 0: [2022-11-25 17:46:15,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_14-model_00-model_states.pt. 0: [2022-11-25 17:46:15,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_15-model_00-model_states.pt... 0: [2022-11-25 17:46:15,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_15-model_00-model_states.pt. 0: [2022-11-25 17:46:15,226] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_16-model_00-model_states.pt... 0: [2022-11-25 17:46:15,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_16-model_00-model_states.pt. 0: [2022-11-25 17:46:15,299] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_17-model_00-model_states.pt... 0: [2022-11-25 17:46:15,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_17-model_00-model_states.pt. 0: [2022-11-25 17:46:15,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_18-model_00-model_states.pt... 0: [2022-11-25 17:46:15,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_18-model_00-model_states.pt. 0: [2022-11-25 17:46:15,447] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_19-model_00-model_states.pt... 0: [2022-11-25 17:46:15,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_19-model_00-model_states.pt. 0: [2022-11-25 17:46:15,517] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_20-model_00-model_states.pt... 0: [2022-11-25 17:46:15,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_20-model_00-model_states.pt. 0: [2022-11-25 17:46:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_21-model_00-model_states.pt... 0: [2022-11-25 17:46:15,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_21-model_00-model_states.pt. 0: [2022-11-25 17:46:15,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_22-model_00-model_states.pt... 0: [2022-11-25 17:46:15,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_22-model_00-model_states.pt. 0: [2022-11-25 17:46:15,787] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_23-model_00-model_states.pt... 0: [2022-11-25 17:46:15,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_23-model_00-model_states.pt. 0: [2022-11-25 17:46:15,863] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_24-model_00-model_states.pt... 0: [2022-11-25 17:46:15,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_24-model_00-model_states.pt. 0: [2022-11-25 17:46:15,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_25-model_00-model_states.pt... 0: [2022-11-25 17:46:16,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_25-model_00-model_states.pt. 0: [2022-11-25 17:46:16,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_26-model_00-model_states.pt... 0: [2022-11-25 17:46:16,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_26-model_00-model_states.pt. 0: [2022-11-25 17:46:16,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_27-model_00-model_states.pt... 0: [2022-11-25 17:46:16,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_27-model_00-model_states.pt. 0: [2022-11-25 17:46:16,198] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_28-model_00-model_states.pt... 0: [2022-11-25 17:46:16,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_28-model_00-model_states.pt. 0: [2022-11-25 17:46:16,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/layer_30-model_00-model_states.pt... 0: [2022-11-25 17:46:16,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/layer_30-model_00-model_states.pt. 0: [2022-11-25 17:46:16,275] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step1000/mp_rank_00_model_states.pt 0: [2022-11-25 17:46:16,275] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/mp_rank_00_model_states.pt... 0: [2022-11-25 17:46:16,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/mp_rank_00_model_states.pt. 0: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-25 17:46:16,352] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-25 17:46:16,352] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-25 17:46:16,352] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-25 17:46:16,352] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 23: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 14: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-25 17:46:16,352] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 18: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 8: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 15: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 23: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 22: [2022-11-25 17:46:16,352] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-25 17:46:16,352] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-25 17:46:16,352] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 18: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 0: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 5: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 10: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 3: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 11: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 17: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 18: [2022-11-25 17:46:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 0: [2022-11-25 17:46:16,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-25 17:46:16,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 17:46:16,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 1: [2022-11-25 17:46:16,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 0: [2022-11-25 17:46:16,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 17:46:16,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 17:46:16,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 23: [2022-11-25 17:46:16,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-25 17:46:16,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-25 17:46:16,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 17:46:16,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 17:46:16,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 17:46:16,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 16: [2022-11-25 17:46:16,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-25 17:46:16,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-25 17:46:16,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 26: [2022-11-25 17:46:16,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-25 17:46:16,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-25 17:46:16,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 29: [2022-11-25 17:46:16,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-25 17:46:16,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-25 17:46:16,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 23: [2022-11-25 17:46:16,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-25 17:46:16,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 18: [2022-11-25 17:46:16,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 23: [2022-11-25 17:46:16,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 18: [2022-11-25 17:46:16,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 28: [2022-11-25 17:46:16,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 18: [2022-11-25 17:46:16,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 28: [2022-11-25 17:46:16,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-25 17:46:16,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 31: [2022-11-25 17:46:16,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-25 17:46:16,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-25 17:46:16,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 27: [2022-11-25 17:46:16,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 14: [2022-11-25 17:46:16,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 27: [2022-11-25 17:46:16,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-25 17:46:16,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 14: [2022-11-25 17:46:16,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 17:46:16,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 11: [2022-11-25 17:46:16,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-25 17:46:16,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 17:46:16,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 16: [2022-11-25 17:46:16,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-25 17:46:16,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-25 17:46:16,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 4: [2022-11-25 17:46:16,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 17:46:16,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 17:46:16,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 19: [2022-11-25 17:46:16,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-25 17:46:16,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 26: [2022-11-25 17:46:16,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-25 17:46:16,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-25 17:46:16,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 19: [2022-11-25 17:46:16,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 18: [2022-11-25 17:46:16,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-25 17:46:16,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-25 17:46:16,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 23: [2022-11-25 17:46:16,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-25 17:46:16,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 18: [2022-11-25 17:46:16,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 23: [2022-11-25 17:46:16,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 18: [2022-11-25 17:46:16,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-25 17:46:16,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 11: [2022-11-25 17:46:16,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 19: [2022-11-25 17:46:16,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 11: [2022-11-25 17:46:16,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 17:46:16,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 19: [2022-11-25 17:46:16,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-25 17:46:16,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 28: [2022-11-25 17:46:16,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-25 17:46:16,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-25 17:46:16,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 2: [2022-11-25 17:46:16,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 10: [2022-11-25 17:46:16,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 2: [2022-11-25 17:46:16,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 17:46:16,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 10: [2022-11-25 17:46:16,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 17:46:16,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 16: [2022-11-25 17:46:16,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-25 17:46:16,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-25 17:46:16,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 12: [2022-11-25 17:46:16,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 17:46:16,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 26: [2022-11-25 17:46:16,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 12: [2022-11-25 17:46:16,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 26: [2022-11-25 17:46:16,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-25 17:46:16,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 30: [2022-11-25 17:46:16,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-25 17:46:16,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-25 17:46:16,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 13: [2022-11-25 17:46:16,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 14: [2022-11-25 17:46:16,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 13: [2022-11-25 17:46:16,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 14: [2022-11-25 17:46:16,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 13: [2022-11-25 17:46:16,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 14: [2022-11-25 17:46:16,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 8: [2022-11-25 17:46:16,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 17:46:16,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 17:46:16,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 5: [2022-11-25 17:46:16,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 17:46:16,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 17:46:16,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 27: [2022-11-25 17:46:16,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 2: [2022-11-25 17:46:16,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 27: [2022-11-25 17:46:16,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-25 17:46:16,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 2: [2022-11-25 17:46:16,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 31: [2022-11-25 17:46:16,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-25 17:46:16,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-25 17:46:16,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 2: [2022-11-25 17:46:16,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 25: [2022-11-25 17:46:16,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-25 17:46:16,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-25 17:46:16,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-25 17:46:16,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-25 17:46:16,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 25: [2022-11-25 17:46:16,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 17: [2022-11-25 17:46:16,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-25 17:46:16,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 22: [2022-11-25 17:46:16,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 17: [2022-11-25 17:46:16,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 22: [2022-11-25 17:46:16,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-25 17:46:16,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 11: [2022-11-25 17:46:16,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 10: [2022-11-25 17:46:16,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 11: [2022-11-25 17:46:16,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 10: [2022-11-25 17:46:16,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 11: [2022-11-25 17:46:16,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 26: [2022-11-25 17:46:16,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 10: [2022-11-25 17:46:16,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 11: [2022-11-25 17:46:16,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 26: [2022-11-25 17:46:16,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 11: [2022-11-25 17:46:16,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 17:46:16,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 26: [2022-11-25 17:46:16,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 3: [2022-11-25 17:46:16,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 17:46:16,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 17:46:16,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 17:46:16,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 3: [2022-11-25 17:46:16,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 17:46:16,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 13: [2022-11-25 17:46:16,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 17:46:16,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 19: [2022-11-25 17:46:16,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 13: [2022-11-25 17:46:16,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 19: [2022-11-25 17:46:16,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-25 17:46:16,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 30: [2022-11-25 17:46:16,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-25 17:46:16,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-25 17:46:16,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 15: [2022-11-25 17:46:16,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 17:46:16,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 17:46:16,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 17:46:16,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 14: [2022-11-25 17:46:16,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 15: [2022-11-25 17:46:16,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 17:46:16,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 14: [2022-11-25 17:46:16,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 17:46:16,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 15: [2022-11-25 17:46:16,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 17:46:16,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 10: [2022-11-25 17:46:16,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 15: [2022-11-25 17:46:16,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 9: [2022-11-25 17:46:16,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 10: [2022-11-25 17:46:16,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 9: [2022-11-25 17:46:16,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 10: [2022-11-25 17:46:16,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 24: [2022-11-25 17:46:16,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 9: [2022-11-25 17:46:16,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 24: [2022-11-25 17:46:16,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-25 17:46:16,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 17:46:16,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 6: [2022-11-25 17:46:16,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 1: [2022-11-25 17:46:16,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 17:46:16,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 6: [2022-11-25 17:46:16,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 17:46:16,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 17:46:16,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 17:46:16,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 6: [2022-11-25 17:46:16,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 0: [2022-11-25 17:46:16,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 17:46:16,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 17:46:16,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 0: [2022-11-25 17:46:16,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 17:46:16,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 17:46:16,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 17:46:16,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 17:46:16,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 0: [2022-11-25 17:46:16,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 17:46:16,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 18: [2022-11-25 17:46:16,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-25 17:46:16,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 1: [2022-11-25 17:46:16,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 18: [2022-11-25 17:46:16,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 17:46:16,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 7: [2022-11-25 17:46:16,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 17:46:16,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 17:46:16,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 17:46:16,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 7: [2022-11-25 17:46:16,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 17:46:16,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 24: [2022-11-25 17:46:16,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 5: [2022-11-25 17:46:16,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 17:46:16,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 24: [2022-11-25 17:46:16,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 5: [2022-11-25 17:46:16,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 24: [2022-11-25 17:46:16,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 4: [2022-11-25 17:46:16,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 17:46:16,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 17:46:16,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 2: [2022-11-25 17:46:16,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 17:46:16,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 17:46:16,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 3: [2022-11-25 17:46:16,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 17:46:16,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 17:46:16,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 7: [2022-11-25 17:46:16,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 17:46:16,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 29: [2022-11-25 17:46:16,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 7: [2022-11-25 17:46:16,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 29: [2022-11-25 17:46:16,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-25 17:46:16,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 30: [2022-11-25 17:46:16,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-25 17:46:16,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-25 17:46:16,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 25: [2022-11-25 17:46:16,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-25 17:46:16,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-25 17:46:16,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-25 17:46:16,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-25 17:46:16,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-25 17:46:16,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-25 17:46:16,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 25: [2022-11-25 17:46:16,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 25: [2022-11-25 17:46:16,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 30: [2022-11-25 17:46:16,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-25 17:46:16,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-25 17:46:16,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 28: [2022-11-25 17:46:16,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-25 17:46:16,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-25 17:46:16,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 7: [2022-11-25 17:46:16,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 17:46:16,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 31: [2022-11-25 17:46:16,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-25 17:46:16,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 7: [2022-11-25 17:46:16,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 31: [2022-11-25 17:46:16,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-25 17:46:16,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-25 17:46:16,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 31: [2022-11-25 17:46:16,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 14: [2022-11-25 17:46:16,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 21: [2022-11-25 17:46:16,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-25 17:46:16,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 14: [2022-11-25 17:46:16,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 21: [2022-11-25 17:46:16,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 14: [2022-11-25 17:46:16,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 2: [2022-11-25 17:46:16,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 17:46:16,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 17:46:16,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 12: [2022-11-25 17:46:16,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 17:46:16,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 17:46:16,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 17:46:16,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 17:46:16,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 12: [2022-11-25 17:46:16,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 2: [2022-11-25 17:46:16,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 17:46:16,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 17:46:16,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 3: [2022-11-25 17:46:16,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 17:46:16,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 17:46:16,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 19: [2022-11-25 17:46:16,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-25 17:46:16,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-25 17:46:16,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 5: [2022-11-25 17:46:16,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 17:46:16,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 17:46:16,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 4: [2022-11-25 17:46:16,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 17:46:16,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 17:46:16,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 21: [2022-11-25 17:46:16,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-25 17:46:16,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-25 17:46:16,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 12: [2022-11-25 17:46:16,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 21: [2022-11-25 17:46:16,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-25 17:46:16,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 12: [2022-11-25 17:46:16,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 21: [2022-11-25 17:46:16,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 12: [2022-11-25 17:46:16,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 20: [2022-11-25 17:46:16,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-25 17:46:16,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-25 17:46:16,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-25 17:46:16,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-25 17:46:16,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-25 17:46:16,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-25 17:46:16,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 20: [2022-11-25 17:46:16,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 20: [2022-11-25 17:46:16,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 16: [2022-11-25 17:46:16,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 11: [2022-11-25 17:46:16,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 16: [2022-11-25 17:46:16,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-25 17:46:16,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 11: [2022-11-25 17:46:16,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 17:46:16,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 4: [2022-11-25 17:46:16,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 17:46:16,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 17:46:16,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 27: [2022-11-25 17:46:16,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 18: [2022-11-25 17:46:16,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 27: [2022-11-25 17:46:16,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-25 17:46:16,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-25 17:46:16,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 18: [2022-11-25 17:46:16,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 27: [2022-11-25 17:46:16,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 18: [2022-11-25 17:46:16,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 27: [2022-11-25 17:46:16,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 5: [2022-11-25 17:46:16,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 17:46:16,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 17:46:16,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 3: [2022-11-25 17:46:16,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 17:46:16,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 17:46:16,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 0: [2022-11-25 17:46:16,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 17:46:16,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 23: [2022-11-25 17:46:16,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-25 17:46:16,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-25 17:46:16,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 11: [2022-11-25 17:46:16,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 17:46:16,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 17:46:16,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 31: [2022-11-25 17:46:16,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-25 17:46:16,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-25 17:46:16,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 15: [2022-11-25 17:46:16,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 17:46:16,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 17:46:16,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 3: [2022-11-25 17:46:16,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 17:46:16,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 17:46:16,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 22: [2022-11-25 17:46:16,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-25 17:46:16,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 8: [2022-11-25 17:46:16,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 22: [2022-11-25 17:46:16,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 8: [2022-11-25 17:46:16,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 17:46:16,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 21: [2022-11-25 17:46:16,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-25 17:46:16,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-25 17:46:16,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 22: [2022-11-25 17:46:16,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 21: [2022-11-25 17:46:16,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 16: [2022-11-25 17:46:16,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 22: [2022-11-25 17:46:16,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-25 17:46:16,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 21: [2022-11-25 17:46:16,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-25 17:46:16,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 16: [2022-11-25 17:46:16,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-25 17:46:16,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 29: [2022-11-25 17:46:16,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 22: [2022-11-25 17:46:16,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 29: [2022-11-25 17:46:16,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 22: [2022-11-25 17:46:16,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 9: [2022-11-25 17:46:16,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 22: [2022-11-25 17:46:16,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 29: [2022-11-25 17:46:16,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 9: [2022-11-25 17:46:16,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 17:46:16,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 22: [2022-11-25 17:46:16,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-25 17:46:16,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-25 17:46:16,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 27: [2022-11-25 17:46:16,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-25 17:46:16,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-25 17:46:16,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 29: [2022-11-25 17:46:16,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-25 17:46:16,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-25 17:46:16,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 10: [2022-11-25 17:46:16,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 17:46:16,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 17:46:16,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 15: [2022-11-25 17:46:16,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-25 17:46:16,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 17:46:16,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 7: [2022-11-25 17:46:16,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 5: [2022-11-25 17:46:16,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 7: [2022-11-25 17:46:16,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 5: [2022-11-25 17:46:16,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 7: [2022-11-25 17:46:16,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 5: [2022-11-25 17:46:16,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 28: [2022-11-25 17:46:16,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-25 17:46:16,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 9: [2022-11-25 17:46:16,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 17:46:16,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 20: [2022-11-25 17:46:16,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-25 17:46:16,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 9: [2022-11-25 17:46:16,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 17:46:16,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 20: [2022-11-25 17:46:16,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-25 17:46:16,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 9: [2022-11-25 17:46:16,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 9: [2022-11-25 17:46:16,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 20: [2022-11-25 17:46:16,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 20: [2022-11-25 17:46:16,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 16: [2022-11-25 17:46:16,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-25 17:46:16,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-25 17:46:16,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 28: [2022-11-25 17:46:16,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 13: [2022-11-25 17:46:16,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 17:46:16,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 17:46:16,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 13: [2022-11-25 17:46:16,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 17:46:16,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 4: [2022-11-25 17:46:16,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 13: [2022-11-25 17:46:16,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 4: [2022-11-25 17:46:16,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 17:46:16,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 13: [2022-11-25 17:46:16,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 17:46:16,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 17:46:16,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 14: [2022-11-25 17:46:16,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 17:46:16,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 17:46:16,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 10: [2022-11-25 17:46:16,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 17:46:16,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 17:46:16,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 23: [2022-11-25 17:46:16,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-25 17:46:16,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-25 17:46:16,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 8: [2022-11-25 17:46:16,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 17:46:16,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 17:46:16,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 19: [2022-11-25 17:46:16,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-25 17:46:16,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-25 17:46:16,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 17: [2022-11-25 17:46:16,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-25 17:46:16,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-25 17:46:16,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 6: [2022-11-25 17:46:16,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 28: [2022-11-25 17:46:16,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 6: [2022-11-25 17:46:16,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 17:46:16,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 28: [2022-11-25 17:46:16,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-25 17:46:16,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 17: [2022-11-25 17:46:16,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-25 17:46:16,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-25 17:46:16,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 15: [2022-11-25 17:46:16,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 17:46:16,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 17:46:16,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 26: [2022-11-25 17:46:16,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-25 17:46:16,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-25 17:46:16,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 4: [2022-11-25 17:46:16,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 16: [2022-11-25 17:46:16,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 4: [2022-11-25 17:46:16,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 17:46:16,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 16: [2022-11-25 17:46:16,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-25 17:46:16,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 22: [2022-11-25 17:46:16,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 20: [2022-11-25 17:46:16,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 22: [2022-11-25 17:46:16,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 20: [2022-11-25 17:46:16,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 22: [2022-11-25 17:46:16,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 20: [2022-11-25 17:46:16,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 6: [2022-11-25 17:46:16,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 17:46:16,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 17:46:16,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 6: [2022-11-25 17:46:16,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 17:46:16,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 17:46:16,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 0: [2022-11-25 17:46:16,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 17:46:16,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 17:46:16,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 24: [2022-11-25 17:46:16,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 12: [2022-11-25 17:46:16,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 17:46:16,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 17:46:16,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 24: [2022-11-25 17:46:16,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-25 17:46:16,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 20: [2022-11-25 17:46:16,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 5: [2022-11-25 17:46:16,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 20: [2022-11-25 17:46:16,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 29: [2022-11-25 17:46:16,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 5: [2022-11-25 17:46:16,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 20: [2022-11-25 17:46:16,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 29: [2022-11-25 17:46:16,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 5: [2022-11-25 17:46:16,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 29: [2022-11-25 17:46:16,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 17:46:16,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 17:46:16,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 17:46:16,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 8: [2022-11-25 17:46:16,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 21: [2022-11-25 17:46:16,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-25 17:46:16,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 8: [2022-11-25 17:46:16,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 21: [2022-11-25 17:46:16,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 8: [2022-11-25 17:46:16,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 17: [2022-11-25 17:46:16,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-25 17:46:16,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-25 17:46:16,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 6: [2022-11-25 17:46:16,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 17:46:16,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 17:46:16,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 7: [2022-11-25 17:46:16,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 17:46:16,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 17:46:16,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 0: [2022-11-25 17:46:16,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 17:46:16,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 17:46:16,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 17:46:16,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 17:46:16,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 17:46:16,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 4: [2022-11-25 17:46:16,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 17:46:16,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 17:46:16,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 8: [2022-11-25 17:46:16,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-25 17:46:16,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 17:46:16,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 15: [2022-11-25 17:46:16,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 17:46:16,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 17:46:16,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 30: [2022-11-25 17:46:16,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-25 17:46:16,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-25 17:46:16,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 8: [2022-11-25 17:46:16,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 17:46:16,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 17:46:16,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 20: [2022-11-25 17:46:16,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-25 17:46:16,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-25 17:46:16,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 0: [2022-11-25 17:46:16,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 17:46:16,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 27: [2022-11-25 17:46:16,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 0: [2022-11-25 17:46:16,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 27: [2022-11-25 17:46:16,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-25 17:46:16,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 26: [2022-11-25 17:46:16,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-25 17:46:16,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-25 17:46:16,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 2: [2022-11-25 17:46:16,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 17:46:16,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 17:46:16,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 17: [2022-11-25 17:46:16,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-25 17:46:16,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-25 17:46:16,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 12: [2022-11-25 17:46:16,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 17:46:16,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 17:46:16,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 13: [2022-11-25 17:46:16,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 17:46:16,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 17:46:16,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 31: [2022-11-25 17:46:16,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-25 17:46:16,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-25 17:46:16,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 4: [2022-11-25 17:46:16,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 17:46:16,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 17:46:16,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 10: [2022-11-25 17:46:16,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 17:46:16,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 17:46:16,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 23: [2022-11-25 17:46:16,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-25 17:46:16,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-25 17:46:16,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 9: [2022-11-25 17:46:16,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 17:46:16,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 17:46:16,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 14: [2022-11-25 17:46:16,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 17:46:16,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 17:46:16,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 30: [2022-11-25 17:46:16,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-25 17:46:16,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-25 17:46:16,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 11: [2022-11-25 17:46:16,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 9: [2022-11-25 17:46:16,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 11: [2022-11-25 17:46:16,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 17:46:16,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 9: [2022-11-25 17:46:16,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 17:46:16,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 6: [2022-11-25 17:46:16,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 17:46:16,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 17:46:16,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 16: [2022-11-25 17:46:16,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-25 17:46:16,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-25 17:46:16,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 19: [2022-11-25 17:46:16,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-25 17:46:16,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-25 17:46:16,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 29: [2022-11-25 17:46:16,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-25 17:46:16,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-25 17:46:16,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 18: [2022-11-25 17:46:16,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-25 17:46:16,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-25 17:46:16,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 3: [2022-11-25 17:46:16,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 25: [2022-11-25 17:46:16,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 3: [2022-11-25 17:46:16,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 17:46:16,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 25: [2022-11-25 17:46:16,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-25 17:46:16,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 8: [2022-11-25 17:46:16,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-25 17:46:16,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 17:46:16,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 27: [2022-11-25 17:46:16,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 28: [2022-11-25 17:46:16,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 27: [2022-11-25 17:46:16,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-25 17:46:16,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 7: [2022-11-25 17:46:16,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 21: [2022-11-25 17:46:16,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 7: [2022-11-25 17:46:16,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 1: [2022-11-25 17:46:16,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 21: [2022-11-25 17:46:16,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 7: [2022-11-25 17:46:16,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 21: [2022-11-25 17:46:16,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 17:46:16,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 17:46:16,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 15: [2022-11-25 17:46:16,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-25 17:46:16,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 17:46:16,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 25: [2022-11-25 17:46:16,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-25 17:46:16,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-25 17:46:16,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 24: [2022-11-25 17:46:16,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-25 17:46:16,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-25 17:46:16,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 17: [2022-11-25 17:46:16,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-25 17:46:16,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-25 17:46:16,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 22: [2022-11-25 17:46:16,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-25 17:46:16,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-25 17:46:16,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 24: [2022-11-25 17:46:16,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 28: [2022-11-25 17:46:16,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 24: [2022-11-25 17:46:16,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 28: [2022-11-25 17:46:16,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 24: [2022-11-25 17:46:16,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 31: [2022-11-25 17:46:16,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-25 17:46:16,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-25 17:46:16,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 26: [2022-11-25 17:46:16,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-25 17:46:16,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-25 17:46:16,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 17: [2022-11-25 17:46:16,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-25 17:46:16,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-25 17:46:16,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 5: [2022-11-25 17:46:16,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 17:46:16,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 17:46:16,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 18: [2022-11-25 17:46:16,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-25 17:46:16,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-25 17:46:16,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 28: [2022-11-25 17:46:16,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-25 17:46:16,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-25 17:46:16,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 30: [2022-11-25 17:46:16,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-25 17:46:16,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-25 17:46:16,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 3: [2022-11-25 17:46:16,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 12: [2022-11-25 17:46:16,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 3: [2022-11-25 17:46:16,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 17:46:16,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 12: [2022-11-25 17:46:16,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 17:46:16,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 22: [2022-11-25 17:46:16,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-25 17:46:16,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-25 17:46:16,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 13: [2022-11-25 17:46:16,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 17:46:16,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 17:46:16,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 11: [2022-11-25 17:46:16,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 17:46:16,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 17:46:16,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 6: [2022-11-25 17:46:16,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 31: [2022-11-25 17:46:16,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 6: [2022-11-25 17:46:16,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 12: [2022-11-25 17:46:16,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 31: [2022-11-25 17:46:16,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 6: [2022-11-25 17:46:16,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 31: [2022-11-25 17:46:16,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 12: [2022-11-25 17:46:16,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 17:46:16,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 2: [2022-11-25 17:46:16,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 17:46:16,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 17:46:16,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 10: [2022-11-25 17:46:16,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 27: [2022-11-25 17:46:16,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 10: [2022-11-25 17:46:16,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 27: [2022-11-25 17:46:16,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 10: [2022-11-25 17:46:16,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 27: [2022-11-25 17:46:16,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 28: [2022-11-25 17:46:16,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-25 17:46:16,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 24: [2022-11-25 17:46:16,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-25 17:46:16,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-25 17:46:16,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 7: [2022-11-25 17:46:16,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 17:46:16,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 17:46:16,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 30: [2022-11-25 17:46:16,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-25 17:46:16,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-25 17:46:16,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 19: [2022-11-25 17:46:16,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-25 17:46:16,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 23: [2022-11-25 17:46:16,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 19: [2022-11-25 17:46:16,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 21: [2022-11-25 17:46:16,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 23: [2022-11-25 17:46:16,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-25 17:46:16,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 21: [2022-11-25 17:46:16,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-25 17:46:16,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 17:46:16,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 29: [2022-11-25 17:46:16,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 1: [2022-11-25 17:46:16,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 29: [2022-11-25 17:46:16,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-25 17:46:16,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 17:46:16,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 18: [2022-11-25 17:46:16,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-25 17:46:16,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-25 17:46:16,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 28: [2022-11-25 17:46:16,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 8: [2022-11-25 17:46:16,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 17:46:16,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 17:46:16,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 23: [2022-11-25 17:46:16,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-25 17:46:16,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-25 17:46:16,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 17: [2022-11-25 17:46:16,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-25 17:46:16,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-25 17:46:16,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 9: [2022-11-25 17:46:16,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 17:46:16,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 17:46:16,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 17:46:16,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 17:46:16,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 9: [2022-11-25 17:46:16,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 24: [2022-11-25 17:46:16,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-25 17:46:16,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-25 17:46:16,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 25: [2022-11-25 17:46:16,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-25 17:46:16,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-25 17:46:16,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 13: [2022-11-25 17:46:16,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 14: [2022-11-25 17:46:16,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 17:46:16,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 17:46:16,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 26: [2022-11-25 17:46:16,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-25 17:46:16,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-25 17:46:16,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 13: [2022-11-25 17:46:16,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 17:46:16,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 2: [2022-11-25 17:46:16,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 17:46:16,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 17:46:16,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 14: [2022-11-25 17:46:16,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 17:46:16,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 17:46:16,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 24: [2022-11-25 17:46:16,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-25 17:46:16,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-25 17:46:16,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 10: [2022-11-25 17:46:16,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-25 17:46:16,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 17:46:16,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 19: [2022-11-25 17:46:16,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-25 17:46:16,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-25 17:46:16,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 29: [2022-11-25 17:46:16,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-25 17:46:16,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-25 17:46:16,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 5: [2022-11-25 17:46:16,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 17:46:16,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step1000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 17:46:16,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 0: successfully saved checkpoint at iteration 1000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2841.22 31: iteration 1010/ 173500 | consumed samples: 258560 | consumed tokens: 529530880 | elapsed time per iteration (s): 1.09 | learning rate: 1.164E-04 | global batch size: 256 | lm loss: 3.756234E+00 | grad norm: 1.069 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.017 | TFLOPs: 14.22 | 31: iteration 1020/ 173500 | consumed samples: 261120 | consumed tokens: 534773760 | elapsed time per iteration (s): 0.77 | learning rate: 1.176E-04 | global batch size: 256 | lm loss: 3.708591E+00 | grad norm: 1.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.266 | TFLOPs: 20.16 | 31: iteration 1030/ 173500 | consumed samples: 263680 | consumed tokens: 540016640 | elapsed time per iteration (s): 0.80 | learning rate: 1.187E-04 | global batch size: 256 | lm loss: 3.685621E+00 | grad norm: 0.972 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.466 | TFLOPs: 19.45 | 31: iteration 1040/ 173500 | consumed samples: 266240 | consumed tokens: 545259520 | elapsed time per iteration (s): 0.83 | learning rate: 1.199E-04 | global batch size: 256 | lm loss: 3.726999E+00 | grad norm: 1.041 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.701 | TFLOPs: 18.55 | 31: iteration 1050/ 173500 | consumed samples: 268800 | consumed tokens: 550502400 | elapsed time per iteration (s): 0.75 | learning rate: 1.210E-04 | global batch size: 256 | lm loss: 3.679913E+00 | grad norm: 0.977 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.589 | TFLOPs: 20.73 | 31: iteration 1060/ 173500 | consumed samples: 271360 | consumed tokens: 555745280 | elapsed time per iteration (s): 0.84 | learning rate: 1.222E-04 | global batch size: 256 | lm loss: 3.622175E+00 | grad norm: 0.863 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.553 | TFLOPs: 18.49 | 31: iteration 1070/ 173500 | consumed samples: 273920 | consumed tokens: 560988160 | elapsed time per iteration (s): 0.78 | learning rate: 1.233E-04 | global batch size: 256 | lm loss: 3.598551E+00 | grad norm: 0.864 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.211 | TFLOPs: 19.92 | 31: iteration 1080/ 173500 | consumed samples: 276480 | consumed tokens: 566231040 | elapsed time per iteration (s): 0.81 | learning rate: 1.245E-04 | global batch size: 256 | lm loss: 3.531116E+00 | grad norm: 0.749 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.717 | TFLOPs: 19.22 | 31: iteration 1090/ 173500 | consumed samples: 279040 | consumed tokens: 571473920 | elapsed time per iteration (s): 0.77 | learning rate: 1.256E-04 | global batch size: 256 | lm loss: 3.565641E+00 | grad norm: 0.820 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.391 | TFLOPs: 20.17 | 31: iteration 1100/ 173500 | consumed samples: 281600 | consumed tokens: 576716800 | elapsed time per iteration (s): 0.79 | learning rate: 1.268E-04 | global batch size: 256 | lm loss: 3.589804E+00 | grad norm: 0.787 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.807 | TFLOPs: 19.65 | 31: iteration 1110/ 173500 | consumed samples: 284160 | consumed tokens: 581959680 | elapsed time per iteration (s): 0.77 | learning rate: 1.280E-04 | global batch size: 256 | lm loss: 3.526612E+00 | grad norm: 0.659 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.502 | TFLOPs: 20.24 | 31: iteration 1120/ 173500 | consumed samples: 286720 | consumed tokens: 587202560 | elapsed time per iteration (s): 0.79 | learning rate: 1.291E-04 | global batch size: 256 | lm loss: 3.541742E+00 | grad norm: 0.714 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.238 | TFLOPs: 19.49 | 31: iteration 1130/ 173500 | consumed samples: 289280 | consumed tokens: 592445440 | elapsed time per iteration (s): 0.74 | learning rate: 1.303E-04 | global batch size: 256 | lm loss: 3.535495E+00 | grad norm: 0.683 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.731 | TFLOPs: 20.79 | 31: iteration 1140/ 173500 | consumed samples: 291840 | consumed tokens: 597688320 | elapsed time per iteration (s): 0.73 | learning rate: 1.314E-04 | global batch size: 256 | lm loss: 3.477818E+00 | grad norm: 1.038 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.688 | TFLOPs: 21.16 | 31: iteration 1150/ 173500 | consumed samples: 294400 | consumed tokens: 602931200 | elapsed time per iteration (s): 0.76 | learning rate: 1.326E-04 | global batch size: 256 | lm loss: 3.521810E+00 | grad norm: 0.721 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.936 | TFLOPs: 20.38 | 31: iteration 1160/ 173500 | consumed samples: 296960 | consumed tokens: 608174080 | elapsed time per iteration (s): 0.80 | learning rate: 1.337E-04 | global batch size: 256 | lm loss: 3.491086E+00 | grad norm: 0.686 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.836 | TFLOPs: 19.47 | 31: iteration 1170/ 173500 | consumed samples: 299520 | consumed tokens: 613416960 | elapsed time per iteration (s): 0.78 | learning rate: 1.349E-04 | global batch size: 256 | lm loss: 3.477417E+00 | grad norm: 0.627 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.451 | TFLOPs: 19.75 | 31: iteration 1180/ 173500 | consumed samples: 302080 | consumed tokens: 618659840 | elapsed time per iteration (s): 0.79 | learning rate: 1.360E-04 | global batch size: 256 | lm loss: 3.493528E+00 | grad norm: 0.761 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.394 | TFLOPs: 19.69 | 31: iteration 1190/ 173500 | consumed samples: 304640 | consumed tokens: 623902720 | elapsed time per iteration (s): 0.75 | learning rate: 1.372E-04 | global batch size: 256 | lm loss: 3.486912E+00 | grad norm: 0.705 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.416 | TFLOPs: 20.59 | 31: iteration 1200/ 173500 | consumed samples: 307200 | consumed tokens: 629145600 | elapsed time per iteration (s): 0.75 | learning rate: 1.383E-04 | global batch size: 256 | lm loss: 3.441988E+00 | grad norm: 0.649 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.243 | TFLOPs: 20.52 | 31: iteration 1210/ 173500 | consumed samples: 309760 | consumed tokens: 634388480 | elapsed time per iteration (s): 0.77 | learning rate: 1.395E-04 | global batch size: 256 | lm loss: 3.423140E+00 | grad norm: 0.592 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.409 | TFLOPs: 20.23 | 31: iteration 1220/ 173500 | consumed samples: 312320 | consumed tokens: 639631360 | elapsed time per iteration (s): 0.76 | learning rate: 1.406E-04 | global batch size: 256 | lm loss: 3.352572E+00 | grad norm: 0.717 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.059 | TFLOPs: 20.51 | 31: iteration 1230/ 173500 | consumed samples: 314880 | consumed tokens: 644874240 | elapsed time per iteration (s): 0.77 | learning rate: 1.418E-04 | global batch size: 256 | lm loss: 3.403121E+00 | grad norm: 0.626 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.531 | TFLOPs: 20.06 | 31: iteration 1240/ 173500 | consumed samples: 317440 | consumed tokens: 650117120 | elapsed time per iteration (s): 0.73 | learning rate: 1.429E-04 | global batch size: 256 | lm loss: 3.373479E+00 | grad norm: 0.792 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.544 | TFLOPs: 21.21 | 31: iteration 1250/ 173500 | consumed samples: 320000 | consumed tokens: 655360000 | elapsed time per iteration (s): 0.87 | learning rate: 1.441E-04 | global batch size: 256 | lm loss: 3.374149E+00 | grad norm: 0.696 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.663 | TFLOPs: 17.83 | 31: iteration 1260/ 173500 | consumed samples: 322560 | consumed tokens: 660602880 | elapsed time per iteration (s): 0.82 | learning rate: 1.452E-04 | global batch size: 256 | lm loss: 3.304882E+00 | grad norm: 0.574 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.366 | TFLOPs: 18.96 | 31: iteration 1270/ 173500 | consumed samples: 325120 | consumed tokens: 665845760 | elapsed time per iteration (s): 0.83 | learning rate: 1.464E-04 | global batch size: 256 | lm loss: 3.365686E+00 | grad norm: 0.828 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.980 | TFLOPs: 18.63 | 31: iteration 1280/ 173500 | consumed samples: 327680 | consumed tokens: 671088640 | elapsed time per iteration (s): 0.72 | learning rate: 1.476E-04 | global batch size: 256 | lm loss: 3.379929E+00 | grad norm: 0.808 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.172 | TFLOPs: 21.37 | 31: iteration 1290/ 173500 | consumed samples: 330240 | consumed tokens: 676331520 | elapsed time per iteration (s): 0.78 | learning rate: 1.487E-04 | global batch size: 256 | lm loss: 3.247747E+00 | grad norm: 0.561 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.189 | TFLOPs: 19.92 | 31: iteration 1300/ 173500 | consumed samples: 332800 | consumed tokens: 681574400 | elapsed time per iteration (s): 0.86 | learning rate: 1.499E-04 | global batch size: 256 | lm loss: 3.316988E+00 | grad norm: 0.575 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.083 | TFLOPs: 18.09 | 31: iteration 1310/ 173500 | consumed samples: 335360 | consumed tokens: 686817280 | elapsed time per iteration (s): 0.77 | learning rate: 1.510E-04 | global batch size: 256 | lm loss: 3.287576E+00 | grad norm: 0.622 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.026 | TFLOPs: 20.21 | 31: iteration 1320/ 173500 | consumed samples: 337920 | consumed tokens: 692060160 | elapsed time per iteration (s): 0.84 | learning rate: 1.522E-04 | global batch size: 256 | lm loss: 3.284733E+00 | grad norm: 0.697 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.647 | TFLOPs: 18.49 | 31: iteration 1330/ 173500 | consumed samples: 340480 | consumed tokens: 697303040 | elapsed time per iteration (s): 0.78 | learning rate: 1.533E-04 | global batch size: 256 | lm loss: 3.282394E+00 | grad norm: 0.616 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.794 | TFLOPs: 19.89 | 31: iteration 1340/ 173500 | consumed samples: 343040 | consumed tokens: 702545920 | elapsed time per iteration (s): 0.75 | learning rate: 1.545E-04 | global batch size: 256 | lm loss: 3.291673E+00 | grad norm: 0.746 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.070 | TFLOPs: 20.57 | 31: iteration 1350/ 173500 | consumed samples: 345600 | consumed tokens: 707788800 | elapsed time per iteration (s): 0.83 | learning rate: 1.556E-04 | global batch size: 256 | lm loss: 3.269534E+00 | grad norm: 0.620 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.299 | TFLOPs: 18.77 | 31: iteration 1360/ 173500 | consumed samples: 348160 | consumed tokens: 713031680 | elapsed time per iteration (s): 0.78 | learning rate: 1.568E-04 | global batch size: 256 | lm loss: 3.220264E+00 | grad norm: 0.656 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.356 | TFLOPs: 19.86 | 31: iteration 1370/ 173500 | consumed samples: 350720 | consumed tokens: 718274560 | elapsed time per iteration (s): 0.76 | learning rate: 1.579E-04 | global batch size: 256 | lm loss: 3.259959E+00 | grad norm: 0.684 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.762 | TFLOPs: 20.43 | 31: iteration 1380/ 173500 | consumed samples: 353280 | consumed tokens: 723517440 | elapsed time per iteration (s): 0.73 | learning rate: 1.591E-04 | global batch size: 256 | lm loss: 3.259259E+00 | grad norm: 0.594 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.400 | TFLOPs: 21.08 | 31: iteration 1390/ 173500 | consumed samples: 355840 | consumed tokens: 728760320 | elapsed time per iteration (s): 0.79 | learning rate: 1.602E-04 | global batch size: 256 | lm loss: 3.246161E+00 | grad norm: 0.564 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.058 | TFLOPs: 19.48 | 31: iteration 1400/ 173500 | consumed samples: 358400 | consumed tokens: 734003200 | elapsed time per iteration (s): 0.76 | learning rate: 1.614E-04 | global batch size: 256 | lm loss: 3.278462E+00 | grad norm: 0.502 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.902 | TFLOPs: 20.38 | 31: iteration 1410/ 173500 | consumed samples: 360960 | consumed tokens: 739246080 | elapsed time per iteration (s): 0.82 | learning rate: 1.625E-04 | global batch size: 256 | lm loss: 3.234595E+00 | grad norm: 0.584 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.563 | TFLOPs: 18.97 | 31: iteration 1420/ 173500 | consumed samples: 363520 | consumed tokens: 744488960 | elapsed time per iteration (s): 0.78 | learning rate: 1.637E-04 | global batch size: 256 | lm loss: 3.183151E+00 | grad norm: 0.601 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.842 | TFLOPs: 19.95 | 31: iteration 1430/ 173500 | consumed samples: 366080 | consumed tokens: 749731840 | elapsed time per iteration (s): 0.90 | learning rate: 1.648E-04 | global batch size: 256 | lm loss: 3.183851E+00 | grad norm: 0.602 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 283.415 | TFLOPs: 17.15 | 31: iteration 1440/ 173500 | consumed samples: 368640 | consumed tokens: 754974720 | elapsed time per iteration (s): 0.81 | learning rate: 1.660E-04 | global batch size: 256 | lm loss: 3.217923E+00 | grad norm: 0.799 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.285 | TFLOPs: 19.07 | 31: iteration 1450/ 173500 | consumed samples: 371200 | consumed tokens: 760217600 | elapsed time per iteration (s): 0.81 | learning rate: 1.671E-04 | global batch size: 256 | lm loss: 3.208133E+00 | grad norm: 0.546 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.474 | TFLOPs: 19.15 | 31: iteration 1460/ 173500 | consumed samples: 373760 | consumed tokens: 765460480 | elapsed time per iteration (s): 0.75 | learning rate: 1.683E-04 | global batch size: 256 | lm loss: 3.210076E+00 | grad norm: 0.689 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.118 | TFLOPs: 20.70 | 31: iteration 1470/ 173500 | consumed samples: 376320 | consumed tokens: 770703360 | elapsed time per iteration (s): 0.73 | learning rate: 1.695E-04 | global batch size: 256 | lm loss: 3.173593E+00 | grad norm: 0.662 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.014 | TFLOPs: 21.11 | 31: iteration 1480/ 173500 | consumed samples: 378880 | consumed tokens: 775946240 | elapsed time per iteration (s): 0.79 | learning rate: 1.706E-04 | global batch size: 256 | lm loss: 3.201689E+00 | grad norm: 0.631 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.522 | TFLOPs: 19.63 | 31: iteration 1490/ 173500 | consumed samples: 381440 | consumed tokens: 781189120 | elapsed time per iteration (s): 0.81 | learning rate: 1.718E-04 | global batch size: 256 | lm loss: 3.219464E+00 | grad norm: 0.587 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.548 | TFLOPs: 19.21 | 31: iteration 1500/ 173500 | consumed samples: 384000 | consumed tokens: 786432000 | elapsed time per iteration (s): 0.81 | learning rate: 1.729E-04 | global batch size: 256 | lm loss: 3.186907E+00 | grad norm: 0.558 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.678 | TFLOPs: 19.04 | 31: iteration 1510/ 173500 | consumed samples: 386560 | consumed tokens: 791674880 | elapsed time per iteration (s): 0.80 | learning rate: 1.741E-04 | global batch size: 256 | lm loss: 3.185997E+00 | grad norm: 0.618 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.613 | TFLOPs: 19.40 | 31: iteration 1520/ 173500 | consumed samples: 389120 | consumed tokens: 796917760 | elapsed time per iteration (s): 0.78 | learning rate: 1.752E-04 | global batch size: 256 | lm loss: 3.194712E+00 | grad norm: 0.638 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.964 | TFLOPs: 19.90 | 31: iteration 1530/ 173500 | consumed samples: 391680 | consumed tokens: 802160640 | elapsed time per iteration (s): 0.81 | learning rate: 1.764E-04 | global batch size: 256 | lm loss: 3.166024E+00 | grad norm: 0.478 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.827 | TFLOPs: 19.23 | 31: iteration 1540/ 173500 | consumed samples: 394240 | consumed tokens: 807403520 | elapsed time per iteration (s): 0.79 | learning rate: 1.775E-04 | global batch size: 256 | lm loss: 3.168983E+00 | grad norm: 0.470 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.732 | TFLOPs: 19.71 | 31: iteration 1550/ 173500 | consumed samples: 396800 | consumed tokens: 812646400 | elapsed time per iteration (s): 0.81 | learning rate: 1.787E-04 | global batch size: 256 | lm loss: 3.133713E+00 | grad norm: 0.544 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.166 | TFLOPs: 19.07 | 31: iteration 1560/ 173500 | consumed samples: 399360 | consumed tokens: 817889280 | elapsed time per iteration (s): 0.81 | learning rate: 1.798E-04 | global batch size: 256 | lm loss: 3.113584E+00 | grad norm: 0.500 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.153 | TFLOPs: 19.01 | 31: iteration 1570/ 173500 | consumed samples: 401920 | consumed tokens: 823132160 | elapsed time per iteration (s): 0.80 | learning rate: 1.810E-04 | global batch size: 256 | lm loss: 3.100190E+00 | grad norm: 0.429 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.173 | TFLOPs: 19.25 | 31: iteration 1580/ 173500 | consumed samples: 404480 | consumed tokens: 828375040 | elapsed time per iteration (s): 0.79 | learning rate: 1.821E-04 | global batch size: 256 | lm loss: 3.121869E+00 | grad norm: 0.420 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.147 | TFLOPs: 19.55 | 31: iteration 1590/ 173500 | consumed samples: 407040 | consumed tokens: 833617920 | elapsed time per iteration (s): 0.82 | learning rate: 1.833E-04 | global batch size: 256 | lm loss: 3.124240E+00 | grad norm: 0.506 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.004 | TFLOPs: 18.88 | 31: iteration 1600/ 173500 | consumed samples: 409600 | consumed tokens: 838860800 | elapsed time per iteration (s): 0.80 | learning rate: 1.844E-04 | global batch size: 256 | lm loss: 3.095133E+00 | grad norm: 0.583 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.164 | TFLOPs: 19.25 | 31: iteration 1610/ 173500 | consumed samples: 412160 | consumed tokens: 844103680 | elapsed time per iteration (s): 0.80 | learning rate: 1.856E-04 | global batch size: 256 | lm loss: 3.085391E+00 | grad norm: 0.498 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.882 | TFLOPs: 19.35 | 31: iteration 1620/ 173500 | consumed samples: 414720 | consumed tokens: 849346560 | elapsed time per iteration (s): 0.80 | learning rate: 1.867E-04 | global batch size: 256 | lm loss: 3.073762E+00 | grad norm: 0.516 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.367 | TFLOPs: 19.44 | 31: iteration 1630/ 173500 | consumed samples: 417280 | consumed tokens: 854589440 | elapsed time per iteration (s): 0.78 | learning rate: 1.879E-04 | global batch size: 256 | lm loss: 3.084845E+00 | grad norm: 0.416 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.975 | TFLOPs: 19.90 | 31: iteration 1640/ 173500 | consumed samples: 419840 | consumed tokens: 859832320 | elapsed time per iteration (s): 0.87 | learning rate: 1.890E-04 | global batch size: 256 | lm loss: 3.056468E+00 | grad norm: 0.494 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.859 | TFLOPs: 17.84 | 31: iteration 1650/ 173500 | consumed samples: 422400 | consumed tokens: 865075200 | elapsed time per iteration (s): 0.80 | learning rate: 1.902E-04 | global batch size: 256 | lm loss: 3.116882E+00 | grad norm: 0.494 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.020 | TFLOPs: 19.30 | 31: iteration 1660/ 173500 | consumed samples: 424960 | consumed tokens: 870318080 | elapsed time per iteration (s): 0.74 | learning rate: 1.914E-04 | global batch size: 256 | lm loss: 3.045102E+00 | grad norm: 0.556 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.014 | TFLOPs: 20.99 | 31: iteration 1670/ 173500 | consumed samples: 427520 | consumed tokens: 875560960 | elapsed time per iteration (s): 0.75 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 3.077728E+00 | grad norm: 0.479 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.877 | TFLOPs: 20.68 | 31: iteration 1680/ 173500 | consumed samples: 430080 | consumed tokens: 880803840 | elapsed time per iteration (s): 0.78 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 3.044245E+00 | grad norm: 0.451 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.702 | TFLOPs: 19.95 | 31: iteration 1690/ 173500 | consumed samples: 432640 | consumed tokens: 886046720 | elapsed time per iteration (s): 0.77 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 3.089516E+00 | grad norm: 0.493 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.583 | TFLOPs: 20.00 | 31: iteration 1700/ 173500 | consumed samples: 435200 | consumed tokens: 891289600 | elapsed time per iteration (s): 0.81 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 3.083948E+00 | grad norm: 0.453 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.689 | TFLOPs: 19.10 | 31: iteration 1710/ 173500 | consumed samples: 437760 | consumed tokens: 896532480 | elapsed time per iteration (s): 0.75 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 3.051568E+00 | grad norm: 0.523 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.889 | TFLOPs: 20.62 | 31: iteration 1720/ 173500 | consumed samples: 440320 | consumed tokens: 901775360 | elapsed time per iteration (s): 0.78 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 3.047103E+00 | grad norm: 0.556 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.297 | TFLOPs: 19.98 | 31: iteration 1730/ 173500 | consumed samples: 442880 | consumed tokens: 907018240 | elapsed time per iteration (s): 0.74 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 3.035634E+00 | grad norm: 0.478 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.744 | TFLOPs: 20.92 | 31: iteration 1740/ 173500 | consumed samples: 445440 | consumed tokens: 912261120 | elapsed time per iteration (s): 0.75 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.070422E+00 | grad norm: 0.482 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.818 | TFLOPs: 20.68 | 31: iteration 1750/ 173500 | consumed samples: 448000 | consumed tokens: 917504000 | elapsed time per iteration (s): 0.79 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.041185E+00 | grad norm: 0.400 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.697 | TFLOPs: 19.64 | 31: iteration 1760/ 173500 | consumed samples: 450560 | consumed tokens: 922746880 | elapsed time per iteration (s): 0.75 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.067645E+00 | grad norm: 0.928 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.698 | TFLOPs: 20.67 | 31: iteration 1770/ 173500 | consumed samples: 453120 | consumed tokens: 927989760 | elapsed time per iteration (s): 0.76 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.092207E+00 | grad norm: 0.803 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.050 | TFLOPs: 20.39 | 31: iteration 1780/ 173500 | consumed samples: 455680 | consumed tokens: 933232640 | elapsed time per iteration (s): 0.77 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.104421E+00 | grad norm: 0.900 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.296 | TFLOPs: 20.16 | 31: iteration 1790/ 173500 | consumed samples: 458240 | consumed tokens: 938475520 | elapsed time per iteration (s): 0.75 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.148592E+00 | grad norm: 0.800 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.971 | TFLOPs: 20.75 | 31: iteration 1800/ 173500 | consumed samples: 460800 | consumed tokens: 943718400 | elapsed time per iteration (s): 0.74 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.050351E+00 | grad norm: 0.427 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.948 | TFLOPs: 20.87 | 31: iteration 1810/ 173500 | consumed samples: 463360 | consumed tokens: 948961280 | elapsed time per iteration (s): 0.79 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.053655E+00 | grad norm: 0.402 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.061 | TFLOPs: 19.60 | 31: iteration 1820/ 173500 | consumed samples: 465920 | consumed tokens: 954204160 | elapsed time per iteration (s): 0.81 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.018104E+00 | grad norm: 0.394 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.688 | TFLOPs: 19.16 | 31: iteration 1830/ 173500 | consumed samples: 468480 | consumed tokens: 959447040 | elapsed time per iteration (s): 0.74 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.998170E+00 | grad norm: 0.377 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.408 | TFLOPs: 20.84 | 31: iteration 1840/ 173500 | consumed samples: 471040 | consumed tokens: 964689920 | elapsed time per iteration (s): 0.75 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.993073E+00 | grad norm: 0.402 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.012 | TFLOPs: 20.63 | 31: iteration 1850/ 173500 | consumed samples: 473600 | consumed tokens: 969932800 | elapsed time per iteration (s): 0.78 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.995125E+00 | grad norm: 0.384 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.190 | TFLOPs: 19.85 | 31: iteration 1860/ 173500 | consumed samples: 476160 | consumed tokens: 975175680 | elapsed time per iteration (s): 0.75 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.951845E+00 | grad norm: 0.365 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.286 | TFLOPs: 20.59 | 31: iteration 1870/ 173500 | consumed samples: 478720 | consumed tokens: 980418560 | elapsed time per iteration (s): 0.79 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.960922E+00 | grad norm: 0.404 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.053 | TFLOPs: 19.73 | 31: iteration 1880/ 173500 | consumed samples: 481280 | consumed tokens: 985661440 | elapsed time per iteration (s): 0.81 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.940913E+00 | grad norm: 0.390 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.259 | TFLOPs: 19.01 | 31: iteration 1890/ 173500 | consumed samples: 483840 | consumed tokens: 990904320 | elapsed time per iteration (s): 0.76 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.986229E+00 | grad norm: 0.386 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.462 | TFLOPs: 20.36 | 31: iteration 1900/ 173500 | consumed samples: 486400 | consumed tokens: 996147200 | elapsed time per iteration (s): 0.75 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.921597E+00 | grad norm: 0.422 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.248 | TFLOPs: 20.64 | 31: iteration 1910/ 173500 | consumed samples: 488960 | consumed tokens: 1001390080 | elapsed time per iteration (s): 0.73 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.977050E+00 | grad norm: 0.364 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.040 | TFLOPs: 21.12 | 31: iteration 1920/ 173500 | consumed samples: 491520 | consumed tokens: 1006632960 | elapsed time per iteration (s): 0.73 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.935286E+00 | grad norm: 0.350 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.113 | TFLOPs: 21.24 | 31: iteration 1930/ 173500 | consumed samples: 494080 | consumed tokens: 1011875840 | elapsed time per iteration (s): 0.76 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.950245E+00 | grad norm: 0.370 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.017 | TFLOPs: 20.51 | 31: iteration 1940/ 173500 | consumed samples: 496640 | consumed tokens: 1017118720 | elapsed time per iteration (s): 0.74 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.940349E+00 | grad norm: 0.455 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.014 | TFLOPs: 20.99 | 31: iteration 1950/ 173500 | consumed samples: 499200 | consumed tokens: 1022361600 | elapsed time per iteration (s): 0.77 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.917379E+00 | grad norm: 0.359 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.575 | TFLOPs: 20.18 | 31: iteration 1960/ 173500 | consumed samples: 501760 | consumed tokens: 1027604480 | elapsed time per iteration (s): 0.77 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.972225E+00 | grad norm: 0.458 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.810 | TFLOPs: 20.01 | 31: iteration 1970/ 173500 | consumed samples: 504320 | consumed tokens: 1032847360 | elapsed time per iteration (s): 0.76 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.978897E+00 | grad norm: 0.418 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.749 | TFLOPs: 20.31 | 31: iteration 1980/ 173500 | consumed samples: 506880 | consumed tokens: 1038090240 | elapsed time per iteration (s): 0.77 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.943868E+00 | grad norm: 0.408 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.079 | TFLOPs: 20.21 | 31: iteration 1990/ 173500 | consumed samples: 509440 | consumed tokens: 1043333120 | elapsed time per iteration (s): 0.78 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.966101E+00 | grad norm: 0.367 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.132 | TFLOPs: 19.79 | 0: [2022-11-25 17:59:17,040] [INFO] [logging.py:68:log_dist] [Rank 0] step=2000, skipped=0, lr=[0.00019999894289482022, 0.00019999894289482022, 0.00019999894289482022], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 2000/ 173500 | consumed samples: 512000 | consumed tokens: 1048576000 | elapsed time per iteration (s): 0.75 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.950897E+00 | grad norm: 0.352 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.447 | TFLOPs: 20.54 | 0: steps: 2000 loss: 3.0262 iter time (s): 0.791 samples/sec: 323.489 31: ------------------------------------------------------------------------------------------ 31: valid loss at iteration 2000 | lm loss value: 2.837246E+00 | lm loss PPL: 1.706870E+01 | 31: ------------------------------------------------------------------------------------------ 0: saving checkpoint at iteration 2000 to checkpoints_1b1long 0: [2022-11-25 17:59:17,331] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step2000 is begin to save! 0: [2022-11-25 17:59:17,339] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_01-model_00-model_states.pt... 0: [2022-11-25 17:59:17,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_01-model_00-model_states.pt. 0: [2022-11-25 17:59:17,532] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_03-model_00-model_states.pt... 0: [2022-11-25 17:59:17,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_03-model_00-model_states.pt. 0: [2022-11-25 17:59:17,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_04-model_00-model_states.pt... 0: [2022-11-25 17:59:17,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_04-model_00-model_states.pt. 0: [2022-11-25 17:59:17,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_05-model_00-model_states.pt... 0: [2022-11-25 17:59:17,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_05-model_00-model_states.pt. 0: [2022-11-25 17:59:17,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_06-model_00-model_states.pt... 0: [2022-11-25 17:59:17,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_06-model_00-model_states.pt. 0: [2022-11-25 17:59:17,827] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_07-model_00-model_states.pt... 0: [2022-11-25 17:59:17,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_07-model_00-model_states.pt. 0: [2022-11-25 17:59:17,900] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_08-model_00-model_states.pt... 0: [2022-11-25 17:59:17,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_08-model_00-model_states.pt. 0: [2022-11-25 17:59:17,971] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_09-model_00-model_states.pt... 0: [2022-11-25 17:59:18,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_09-model_00-model_states.pt. 0: [2022-11-25 17:59:18,043] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_10-model_00-model_states.pt... 0: [2022-11-25 17:59:18,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_10-model_00-model_states.pt. 0: [2022-11-25 17:59:18,119] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_11-model_00-model_states.pt... 0: [2022-11-25 17:59:18,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_11-model_00-model_states.pt. 0: [2022-11-25 17:59:18,189] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_12-model_00-model_states.pt... 0: [2022-11-25 17:59:18,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_12-model_00-model_states.pt. 0: [2022-11-25 17:59:18,263] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_13-model_00-model_states.pt... 0: [2022-11-25 17:59:18,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_13-model_00-model_states.pt. 0: [2022-11-25 17:59:18,335] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_14-model_00-model_states.pt... 0: [2022-11-25 17:59:18,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_14-model_00-model_states.pt. 0: [2022-11-25 17:59:18,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_15-model_00-model_states.pt... 0: [2022-11-25 17:59:18,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_15-model_00-model_states.pt. 0: [2022-11-25 17:59:18,485] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_16-model_00-model_states.pt... 0: [2022-11-25 17:59:18,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_16-model_00-model_states.pt. 0: [2022-11-25 17:59:18,555] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_17-model_00-model_states.pt... 0: [2022-11-25 17:59:18,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_17-model_00-model_states.pt. 0: [2022-11-25 17:59:18,631] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_18-model_00-model_states.pt... 0: [2022-11-25 17:59:18,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_18-model_00-model_states.pt. 0: [2022-11-25 17:59:18,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_19-model_00-model_states.pt... 0: [2022-11-25 17:59:18,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_19-model_00-model_states.pt. 0: [2022-11-25 17:59:18,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_20-model_00-model_states.pt... 0: [2022-11-25 17:59:18,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_20-model_00-model_states.pt. 0: [2022-11-25 17:59:18,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_21-model_00-model_states.pt... 0: [2022-11-25 17:59:18,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_21-model_00-model_states.pt. 0: [2022-11-25 17:59:18,920] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_22-model_00-model_states.pt... 0: [2022-11-25 17:59:18,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_22-model_00-model_states.pt. 0: [2022-11-25 17:59:18,990] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_23-model_00-model_states.pt... 0: [2022-11-25 17:59:19,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_23-model_00-model_states.pt. 0: [2022-11-25 17:59:19,061] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_24-model_00-model_states.pt... 0: [2022-11-25 17:59:19,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_24-model_00-model_states.pt. 0: [2022-11-25 17:59:19,135] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_25-model_00-model_states.pt... 0: [2022-11-25 17:59:19,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_25-model_00-model_states.pt. 0: [2022-11-25 17:59:19,206] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_26-model_00-model_states.pt... 0: [2022-11-25 17:59:19,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_26-model_00-model_states.pt. 0: [2022-11-25 17:59:19,279] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_27-model_00-model_states.pt... 0: [2022-11-25 17:59:19,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_27-model_00-model_states.pt. 0: [2022-11-25 17:59:19,349] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_28-model_00-model_states.pt... 0: [2022-11-25 17:59:19,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_28-model_00-model_states.pt. 0: [2022-11-25 17:59:19,424] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/layer_30-model_00-model_states.pt... 0: [2022-11-25 17:59:19,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/layer_30-model_00-model_states.pt. 0: [2022-11-25 17:59:19,426] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step2000/mp_rank_00_model_states.pt 0: [2022-11-25 17:59:19,426] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/mp_rank_00_model_states.pt... 0: [2022-11-25 17:59:19,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/mp_rank_00_model_states.pt. 0: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 5: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 9: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 16: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 20: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 11: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 24: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 22: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 18: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 0: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 10: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 2: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 12: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 17: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 27: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 6: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 10: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 15: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 25: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 31: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 17: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 27: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 17: [2022-11-25 17:59:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 2: [2022-11-25 17:59:19,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 26: [2022-11-25 17:59:19,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 2: [2022-11-25 17:59:19,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 26: [2022-11-25 17:59:19,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 2: [2022-11-25 17:59:19,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 26: [2022-11-25 17:59:19,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 11: [2022-11-25 17:59:19,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 17:59:19,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 17:59:19,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 20: [2022-11-25 17:59:19,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 5: [2022-11-25 17:59:19,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 20: [2022-11-25 17:59:19,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 5: [2022-11-25 17:59:19,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 20: [2022-11-25 17:59:19,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 5: [2022-11-25 17:59:19,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 25: [2022-11-25 17:59:19,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-25 17:59:19,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-25 17:59:19,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 18: [2022-11-25 17:59:19,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-25 17:59:19,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-25 17:59:19,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 21: [2022-11-25 17:59:19,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-25 17:59:19,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-25 17:59:19,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 17: [2022-11-25 17:59:19,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 12: [2022-11-25 17:59:19,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 17:59:19,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 17:59:19,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 24: [2022-11-25 17:59:19,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 17: [2022-11-25 17:59:19,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 4: [2022-11-25 17:59:19,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 17:59:19,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 17:59:19,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 24: [2022-11-25 17:59:19,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 17: [2022-11-25 17:59:19,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 24: [2022-11-25 17:59:19,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 19: [2022-11-25 17:59:19,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 14: [2022-11-25 17:59:19,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 17:59:19,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 19: [2022-11-25 17:59:19,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-25 17:59:19,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 14: [2022-11-25 17:59:19,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 23: [2022-11-25 17:59:19,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-25 17:59:19,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-25 17:59:19,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 9: [2022-11-25 17:59:19,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 17:59:19,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 17:59:19,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 7: [2022-11-25 17:59:19,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 17:59:19,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 17:59:19,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: [2022-11-25 17:59:19,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 1: [2022-11-25 17:59:19,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 17:59:19,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 17:59:19,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 3: [2022-11-25 17:59:19,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 17:59:19,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 17:59:19,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 27: [2022-11-25 17:59:19,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 13: [2022-11-25 17:59:19,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 18: [2022-11-25 17:59:19,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 27: [2022-11-25 17:59:19,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 13: [2022-11-25 17:59:19,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 17:59:19,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 27: [2022-11-25 17:59:19,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 18: [2022-11-25 17:59:19,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-25 17:59:19,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 22: [2022-11-25 17:59:19,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-25 17:59:19,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-25 17:59:19,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 16: [2022-11-25 17:59:19,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-25 17:59:19,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-25 17:59:19,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 31: [2022-11-25 17:59:19,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-25 17:59:19,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-25 17:59:19,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 19: [2022-11-25 17:59:19,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-25 17:59:19,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 4: [2022-11-25 17:59:19,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 19: [2022-11-25 17:59:19,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 4: [2022-11-25 17:59:19,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 12: [2022-11-25 17:59:19,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 4: [2022-11-25 17:59:19,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 12: [2022-11-25 17:59:19,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 17:59:19,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 7: [2022-11-25 17:59:19,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 17:59:19,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 1: [2022-11-25 17:59:19,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 7: [2022-11-25 17:59:19,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 17:59:19,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 17:59:19,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 3: [2022-11-25 17:59:19,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 17:59:19,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 17:59:19,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 31: [2022-11-25 17:59:19,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-25 17:59:19,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-25 17:59:19,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 13: [2022-11-25 17:59:19,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 17:59:19,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 17:59:19,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 17: [2022-11-25 17:59:19,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-25 17:59:19,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-25 17:59:19,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 17: [2022-11-25 17:59:19,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-25 17:59:19,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-25 17:59:19,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 12: [2022-11-25 17:59:19,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 28: [2022-11-25 17:59:19,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 12: [2022-11-25 17:59:19,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 17:59:19,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 28: [2022-11-25 17:59:19,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-25 17:59:19,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 18: [2022-11-25 17:59:19,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 27: [2022-11-25 17:59:19,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 18: [2022-11-25 17:59:19,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 27: [2022-11-25 17:59:19,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 18: [2022-11-25 17:59:19,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 27: [2022-11-25 17:59:19,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 4: [2022-11-25 17:59:19,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 17:59:19,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 17:59:19,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 25: [2022-11-25 17:59:19,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-25 17:59:19,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-25 17:59:19,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 3: [2022-11-25 17:59:19,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 17:59:19,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 17:59:19,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 17:59:19,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 5: [2022-11-25 17:59:19,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 1: [2022-11-25 17:59:19,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 5: [2022-11-25 17:59:19,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 1: [2022-11-25 17:59:19,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 5: [2022-11-25 17:59:19,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 19: [2022-11-25 17:59:19,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-25 17:59:19,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-25 17:59:19,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 3: [2022-11-25 17:59:19,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 17:59:19,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 17:59:19,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 31: [2022-11-25 17:59:19,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 0: [2022-11-25 17:59:19,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 31: [2022-11-25 17:59:19,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-25 17:59:19,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: [2022-11-25 17:59:19,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 17: [2022-11-25 17:59:19,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 0: [2022-11-25 17:59:19,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 17: [2022-11-25 17:59:19,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-25 17:59:19,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: [2022-11-25 17:59:19,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 17:59:19,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 17:59:19,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 23: [2022-11-25 17:59:19,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 31: [2022-11-25 17:59:19,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-25 17:59:19,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-25 17:59:19,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 23: [2022-11-25 17:59:19,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-25 17:59:19,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 27: [2022-11-25 17:59:19,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-25 17:59:19,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-25 17:59:19,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 4: [2022-11-25 17:59:19,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 17:59:19,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 17:59:19,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 22: [2022-11-25 17:59:19,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-25 17:59:19,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-25 17:59:19,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 21: [2022-11-25 17:59:19,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-25 17:59:19,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-25 17:59:19,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 18: [2022-11-25 17:59:19,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-25 17:59:19,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-25 17:59:19,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 29: [2022-11-25 17:59:19,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 0: [2022-11-25 17:59:19,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 29: [2022-11-25 17:59:19,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-25 17:59:19,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: [2022-11-25 17:59:19,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 25: [2022-11-25 17:59:19,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-25 17:59:19,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-25 17:59:19,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 22: [2022-11-25 17:59:19,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-25 17:59:19,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-25 17:59:19,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 14: [2022-11-25 17:59:19,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 17:59:19,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 17:59:19,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 13: [2022-11-25 17:59:19,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 17:59:19,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 17:59:19,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 15: [2022-11-25 17:59:19,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 21: [2022-11-25 17:59:19,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-25 17:59:19,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-25 17:59:19,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 15: [2022-11-25 17:59:19,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 17:59:19,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 17:59:19,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 27: [2022-11-25 17:59:19,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 1: [2022-11-25 17:59:19,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 27: [2022-11-25 17:59:19,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 1: [2022-11-25 17:59:19,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 29: [2022-11-25 17:59:19,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 27: [2022-11-25 17:59:19,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 29: [2022-11-25 17:59:19,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-25 17:59:19,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 19: [2022-11-25 17:59:19,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-25 17:59:19,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-25 17:59:19,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 21: [2022-11-25 17:59:19,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-25 17:59:19,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-25 17:59:19,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 8: [2022-11-25 17:59:19,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-25 17:59:19,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 17:59:19,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 17:59:19,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 17:59:19,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 8: [2022-11-25 17:59:19,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 7: [2022-11-25 17:59:19,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 17:59:19,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 17:59:19,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 12: [2022-11-25 17:59:19,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 17:59:19,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 17:59:19,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 24: [2022-11-25 17:59:19,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-25 17:59:19,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-25 17:59:19,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 22: [2022-11-25 17:59:19,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-25 17:59:19,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-25 17:59:19,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 9: [2022-11-25 17:59:19,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 17:59:19,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 17:59:19,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 25: [2022-11-25 17:59:19,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-25 17:59:19,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-25 17:59:19,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: [2022-11-25 17:59:19,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 17:59:19,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 17:59:19,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 14: [2022-11-25 17:59:19,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 17:59:19,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 17:59:19,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 26: [2022-11-25 17:59:19,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-25 17:59:19,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-25 17:59:19,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 2: [2022-11-25 17:59:19,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 17:59:19,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 17:59:19,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 7: [2022-11-25 17:59:19,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 20: [2022-11-25 17:59:19,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 7: [2022-11-25 17:59:19,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 20: [2022-11-25 17:59:19,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 7: [2022-11-25 17:59:19,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 20: [2022-11-25 17:59:19,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 28: [2022-11-25 17:59:19,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-25 17:59:19,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-25 17:59:19,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 8: [2022-11-25 17:59:19,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 17:59:19,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 17:59:19,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 24: [2022-11-25 17:59:19,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-25 17:59:19,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 30: [2022-11-25 17:59:19,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 24: [2022-11-25 17:59:19,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 30: [2022-11-25 17:59:19,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-25 17:59:19,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 9: [2022-11-25 17:59:19,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 17:59:19,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 17:59:19,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 16: [2022-11-25 17:59:19,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-25 17:59:19,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-25 17:59:19,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 14: [2022-11-25 17:59:19,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 17:59:19,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 17:59:19,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 24: [2022-11-25 17:59:19,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-25 17:59:19,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-25 17:59:19,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 20: [2022-11-25 17:59:19,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-25 17:59:19,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-25 17:59:19,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 29: [2022-11-25 17:59:19,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-25 17:59:19,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-25 17:59:19,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 17:59:19,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 31: [2022-11-25 17:59:19,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 8: [2022-11-25 17:59:19,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 31: [2022-11-25 17:59:19,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 8: [2022-11-25 17:59:19,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 1: [2022-11-25 17:59:19,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 8: [2022-11-25 17:59:19,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 17:59:19,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 31: [2022-11-25 17:59:19,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 28: [2022-11-25 17:59:19,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-25 17:59:19,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-25 17:59:19,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 30: [2022-11-25 17:59:19,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-25 17:59:19,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-25 17:59:19,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 9: [2022-11-25 17:59:19,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 17:59:19,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 17:59:19,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 10: [2022-11-25 17:59:19,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 17:59:19,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 17:59:19,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 2: [2022-11-25 17:59:19,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 17:59:19,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 17:59:19,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 11: [2022-11-25 17:59:19,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 17:59:19,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 17:59:19,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 3: [2022-11-25 17:59:19,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 17:59:19,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 17:59:19,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 15: [2022-11-25 17:59:19,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 17:59:19,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 17:59:19,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 10: [2022-11-25 17:59:19,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 17:59:19,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 17:59:19,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 30: [2022-11-25 17:59:19,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-25 17:59:19,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-25 17:59:19,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 9: [2022-11-25 17:59:19,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 17:59:19,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 25: [2022-11-25 17:59:19,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 9: [2022-11-25 17:59:19,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 25: [2022-11-25 17:59:19,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-25 17:59:19,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 16: [2022-11-25 17:59:19,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-25 17:59:19,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-25 17:59:19,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 26: [2022-11-25 17:59:19,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-25 17:59:19,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-25 17:59:19,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 5: [2022-11-25 17:59:19,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 17:59:19,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 17:59:19,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 17: [2022-11-25 17:59:19,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-25 17:59:19,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-25 17:59:19,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 21: [2022-11-25 17:59:19,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-25 17:59:19,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-25 17:59:19,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 20: [2022-11-25 17:59:19,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-25 17:59:19,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-25 17:59:19,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 10: [2022-11-25 17:59:19,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 17:59:19,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 17:59:19,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 23: [2022-11-25 17:59:19,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-25 17:59:19,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-25 17:59:19,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 27: [2022-11-25 17:59:19,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-25 17:59:19,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-25 17:59:19,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 6: [2022-11-25 17:59:19,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 17:59:19,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 17:59:19,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 4: [2022-11-25 17:59:19,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 17:59:19,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 17:59:19,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 5: [2022-11-25 17:59:19,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 17:59:19,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 17:59:19,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 14: [2022-11-25 17:59:19,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 17:59:19,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 17:59:19,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 11: [2022-11-25 17:59:19,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 17:59:19,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 17:59:19,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 16: [2022-11-25 17:59:19,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-25 17:59:19,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-25 17:59:19,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 18: [2022-11-25 17:59:19,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-25 17:59:19,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-25 17:59:19,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 24: [2022-11-25 17:59:19,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-25 17:59:19,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-25 17:59:19,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 26: [2022-11-25 17:59:19,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-25 17:59:19,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-25 17:59:19,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 10: [2022-11-25 17:59:19,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 17:59:19,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 17:59:19,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 16: [2022-11-25 17:59:19,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-25 17:59:19,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-25 17:59:19,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 15: [2022-11-25 17:59:19,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-25 17:59:19,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 17:59:19,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 17:59:19,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 15: [2022-11-25 17:59:19,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 17:59:19,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 12: [2022-11-25 17:59:19,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 17:59:19,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 17:59:19,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 6: [2022-11-25 17:59:19,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 17:59:19,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 17:59:19,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 22: [2022-11-25 17:59:19,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-25 17:59:19,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-25 17:59:19,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 6: [2022-11-25 17:59:19,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 17:59:19,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 17:59:19,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 7: [2022-11-25 17:59:19,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 17:59:19,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 17:59:19,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 28: [2022-11-25 17:59:19,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-25 17:59:19,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-25 17:59:19,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 29: [2022-11-25 17:59:19,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-25 17:59:19,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-25 17:59:19,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 15: [2022-11-25 17:59:19,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 17:59:19,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 17:59:19,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 11: [2022-11-25 17:59:19,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 17:59:19,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 17:59:19,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 29: [2022-11-25 17:59:19,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-25 17:59:19,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-25 17:59:19,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 30: [2022-11-25 17:59:19,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-25 17:59:19,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-25 17:59:19,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 13: [2022-11-25 17:59:19,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-25 17:59:19,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 17:59:19,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 8: [2022-11-25 17:59:19,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 17:59:19,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 17:59:19,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 2: [2022-11-25 17:59:19,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 17:59:19,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 17:59:19,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 5: [2022-11-25 17:59:19,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 17:59:19,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 17:59:19,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: [2022-11-25 17:59:19,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 17:59:19,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 17:59:19,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 3: [2022-11-25 17:59:19,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 25: [2022-11-25 17:59:19,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 3: [2022-11-25 17:59:19,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 17:59:19,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 25: [2022-11-25 17:59:19,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-25 17:59:19,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 31: [2022-11-25 17:59:19,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-25 17:59:19,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-25 17:59:19,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 17:59:19,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 20: [2022-11-25 17:59:19,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 1: [2022-11-25 17:59:19,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 17:59:19,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 20: [2022-11-25 17:59:19,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-25 17:59:19,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 5: [2022-11-25 17:59:19,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 17:59:19,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 17:59:19,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 17: [2022-11-25 17:59:19,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-25 17:59:19,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-25 17:59:19,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 21: [2022-11-25 17:59:19,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-25 17:59:19,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-25 17:59:19,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 9: [2022-11-25 17:59:19,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 17:59:19,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 19: [2022-11-25 17:59:19,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 9: [2022-11-25 17:59:19,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 19: [2022-11-25 17:59:19,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-25 17:59:19,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 6: [2022-11-25 17:59:19,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 17:59:19,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 17:59:19,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 10: [2022-11-25 17:59:19,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 17:59:19,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 17:59:19,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 28: [2022-11-25 17:59:19,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-25 17:59:19,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-25 17:59:19,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 10: [2022-11-25 17:59:19,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 17:59:19,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 17:59:19,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 14: [2022-11-25 17:59:19,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-25 17:59:19,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 17:59:19,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 26: [2022-11-25 17:59:19,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-25 17:59:19,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-25 17:59:19,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 6: [2022-11-25 17:59:19,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 17:59:19,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 17:59:19,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 24: [2022-11-25 17:59:19,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-25 17:59:19,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-25 17:59:19,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 30: [2022-11-25 17:59:19,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-25 17:59:19,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-25 17:59:19,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 12: [2022-11-25 17:59:19,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 17:59:19,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 17:59:19,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 26: [2022-11-25 17:59:19,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-25 17:59:19,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-25 17:59:19,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 4: [2022-11-25 17:59:19,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 17:59:19,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 17:59:19,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 22: [2022-11-25 17:59:19,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-25 17:59:19,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-25 17:59:19,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 18: [2022-11-25 17:59:19,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-25 17:59:19,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-25 17:59:19,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 16: [2022-11-25 17:59:19,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-25 17:59:19,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-25 17:59:19,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 19: [2022-11-25 17:59:19,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-25 17:59:19,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-25 17:59:19,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 7: [2022-11-25 17:59:19,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 17:59:19,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 17:59:19,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 20: [2022-11-25 17:59:19,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-25 17:59:19,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-25 17:59:19,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 13: [2022-11-25 17:59:19,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 17:59:19,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 17:59:19,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 15: [2022-11-25 17:59:19,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 30: [2022-11-25 17:59:19,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 15: [2022-11-25 17:59:19,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 30: [2022-11-25 17:59:19,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 15: [2022-11-25 17:59:19,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 30: [2022-11-25 17:59:19,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 17:59:19,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 17:59:19,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 2: [2022-11-25 17:59:19,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 5: [2022-11-25 17:59:19,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 1: [2022-11-25 17:59:19,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 2: [2022-11-25 17:59:19,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 5: [2022-11-25 17:59:19,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 17:59:19,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 2: [2022-11-25 17:59:19,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 29: [2022-11-25 17:59:19,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-25 17:59:19,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-25 17:59:19,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 3: [2022-11-25 17:59:19,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 17:59:19,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 17:59:19,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 25: [2022-11-25 17:59:19,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-25 17:59:19,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-25 17:59:19,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 21: [2022-11-25 17:59:19,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-25 17:59:19,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-25 17:59:19,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 8: [2022-11-25 17:59:19,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 10: [2022-11-25 17:59:19,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 8: [2022-11-25 17:59:19,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 10: [2022-11-25 17:59:19,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 8: [2022-11-25 17:59:19,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 10: [2022-11-25 17:59:19,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 26: [2022-11-25 17:59:19,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-25 17:59:19,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 20: [2022-11-25 17:59:19,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 26: [2022-11-25 17:59:19,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 11: [2022-11-25 17:59:19,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 20: [2022-11-25 17:59:19,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 11: [2022-11-25 17:59:19,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 20: [2022-11-25 17:59:19,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 11: [2022-11-25 17:59:19,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 11: [2022-11-25 17:59:19,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 17:59:19,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 17:59:19,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: [2022-11-25 17:59:19,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 17:59:19,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 17:59:19,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 17: [2022-11-25 17:59:19,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-25 17:59:19,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-25 17:59:19,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 31: [2022-11-25 17:59:19,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-25 17:59:19,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-25 17:59:19,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 6: [2022-11-25 17:59:19,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 17:59:19,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 17:59:19,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 27: [2022-11-25 17:59:19,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-25 17:59:19,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-25 17:59:19,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 14: [2022-11-25 17:59:19,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 17:59:19,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 17:59:19,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 30: [2022-11-25 17:59:19,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-25 17:59:19,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-25 17:59:19,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 28: [2022-11-25 17:59:19,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-25 17:59:19,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-25 17:59:19,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 4: [2022-11-25 17:59:19,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 17:59:19,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 17:59:19,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 11: [2022-11-25 17:59:19,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 17:59:19,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 17:59:19,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 9: [2022-11-25 17:59:19,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 17:59:19,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 17:59:19,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 5: [2022-11-25 17:59:19,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 17:59:19,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 27: [2022-11-25 17:59:19,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 5: [2022-11-25 17:59:19,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 7: [2022-11-25 17:59:19,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 18: [2022-11-25 17:59:19,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 27: [2022-11-25 17:59:19,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 7: [2022-11-25 17:59:19,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 18: [2022-11-25 17:59:19,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 27: [2022-11-25 17:59:19,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 7: [2022-11-25 17:59:19,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 16: [2022-11-25 17:59:19,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 18: [2022-11-25 17:59:19,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 19: [2022-11-25 17:59:19,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-25 17:59:19,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-25 17:59:19,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 16: [2022-11-25 17:59:19,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-25 17:59:19,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 12: [2022-11-25 17:59:19,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 17:59:19,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 17:59:19,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 22: [2022-11-25 17:59:19,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-25 17:59:19,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-25 17:59:19,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 2: [2022-11-25 17:59:19,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 17:59:19,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 17:59:19,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 25: [2022-11-25 17:59:19,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-25 17:59:19,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-25 17:59:19,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 13: [2022-11-25 17:59:19,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 17:59:19,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 17:59:19,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: [2022-11-25 17:59:19,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 17:59:19,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 3: [2022-11-25 17:59:19,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 0: [2022-11-25 17:59:19,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 3: [2022-11-25 17:59:19,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 17:59:19,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 15: [2022-11-25 17:59:19,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 17:59:19,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 17:59:19,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 17:59:19,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 17:59:19,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 17:59:19,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 20: [2022-11-25 17:59:19,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-25 17:59:19,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-25 17:59:19,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 6: [2022-11-25 17:59:19,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 17:59:19,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 17:59:19,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 22: [2022-11-25 17:59:19,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-25 17:59:19,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-25 17:59:19,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 2: [2022-11-25 17:59:19,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 17:59:19,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 9: [2022-11-25 17:59:19,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 2: [2022-11-25 17:59:19,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 9: [2022-11-25 17:59:19,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 17:59:19,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 19: [2022-11-25 17:59:19,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 21: [2022-11-25 17:59:19,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 10: [2022-11-25 17:59:19,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 19: [2022-11-25 17:59:19,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 21: [2022-11-25 17:59:19,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 19: [2022-11-25 17:59:19,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 21: [2022-11-25 17:59:19,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 10: [2022-11-25 17:59:19,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 17:59:19,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 8: [2022-11-25 17:59:19,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 17:59:19,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 17:59:19,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 15: [2022-11-25 17:59:19,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 17:59:19,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 17:59:19,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 12: [2022-11-25 17:59:19,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 4: [2022-11-25 17:59:19,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 12: [2022-11-25 17:59:19,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 4: [2022-11-25 17:59:19,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 12: [2022-11-25 17:59:19,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 4: [2022-11-25 17:59:19,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: [2022-11-25 17:59:19,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 17:59:19,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 17:59:19,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 7: [2022-11-25 17:59:19,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 17:59:19,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 17:59:19,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 30: [2022-11-25 17:59:19,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-25 17:59:19,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-25 17:59:19,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 6: [2022-11-25 17:59:19,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 17:59:19,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 17:59:19,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 24: [2022-11-25 17:59:19,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 13: [2022-11-25 17:59:19,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 18: [2022-11-25 17:59:19,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 11: [2022-11-25 17:59:19,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 24: [2022-11-25 17:59:19,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 13: [2022-11-25 17:59:19,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 24: [2022-11-25 17:59:19,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 18: [2022-11-25 17:59:19,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 13: [2022-11-25 17:59:19,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 11: [2022-11-25 17:59:19,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 18: [2022-11-25 17:59:19,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 11: [2022-11-25 17:59:19,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 31: [2022-11-25 17:59:19,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 14: [2022-11-25 17:59:19,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 31: [2022-11-25 17:59:19,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 14: [2022-11-25 17:59:19,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 31: [2022-11-25 17:59:19,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 14: [2022-11-25 17:59:19,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 8: [2022-11-25 17:59:19,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 17:59:19,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 17:59:19,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 26: [2022-11-25 17:59:19,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-25 17:59:19,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-25 17:59:19,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 16: [2022-11-25 17:59:19,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-25 17:59:19,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 17: [2022-11-25 17:59:19,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 16: [2022-11-25 17:59:19,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 17: [2022-11-25 17:59:19,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-25 17:59:19,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 28: [2022-11-25 17:59:19,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-25 17:59:19,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-25 17:59:19,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 24: [2022-11-25 17:59:19,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-25 17:59:19,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-25 17:59:19,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 2: [2022-11-25 17:59:19,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 17:59:19,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 17:59:19,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 27: [2022-11-25 17:59:19,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-25 17:59:19,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-25 17:59:19,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 28: [2022-11-25 17:59:19,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-25 17:59:19,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-25 17:59:19,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 23: [2022-11-25 17:59:19,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-25 17:59:19,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-25 17:59:19,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 29: [2022-11-25 17:59:19,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-25 17:59:19,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-25 17:59:19,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 13: [2022-11-25 17:59:19,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 17:59:19,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 17:59:19,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 23: [2022-11-25 17:59:19,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-25 17:59:19,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-25 17:59:19,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 29: [2022-11-25 17:59:19,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-25 17:59:19,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-25 17:59:19,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 23: [2022-11-25 17:59:19,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-25 17:59:19,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-25 17:59:19,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-25 17:59:19,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-25 17:59:19,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 23: [2022-11-25 17:59:19,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-25 17:59:19,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step2000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-25 17:59:19,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 23: [2022-11-25 17:59:19,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: successfully saved checkpoint at iteration 2000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2537.20 31: iteration 2010/ 173500 | consumed samples: 514560 | consumed tokens: 1053818880 | elapsed time per iteration (s): 1.09 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.960323E+00 | grad norm: 0.399 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.863 | TFLOPs: 14.21 | 31: iteration 2020/ 173500 | consumed samples: 517120 | consumed tokens: 1059061760 | elapsed time per iteration (s): 0.83 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.975179E+00 | grad norm: 0.387 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.351 | TFLOPs: 18.71 | 31: iteration 2030/ 173500 | consumed samples: 519680 | consumed tokens: 1064304640 | elapsed time per iteration (s): 0.76 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.896446E+00 | grad norm: 0.376 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.497 | TFLOPs: 20.42 | 31: iteration 2040/ 173500 | consumed samples: 522240 | consumed tokens: 1069547520 | elapsed time per iteration (s): 0.83 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.916744E+00 | grad norm: 0.386 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.476 | TFLOPs: 18.72 | 31: iteration 2050/ 173500 | consumed samples: 524800 | consumed tokens: 1074790400 | elapsed time per iteration (s): 0.82 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.942574E+00 | grad norm: 0.389 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.868 | TFLOPs: 18.87 | 31: iteration 2060/ 173500 | consumed samples: 527360 | consumed tokens: 1080033280 | elapsed time per iteration (s): 0.81 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.914760E+00 | grad norm: 0.354 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.907 | TFLOPs: 19.17 | 31: iteration 2070/ 173500 | consumed samples: 529920 | consumed tokens: 1085276160 | elapsed time per iteration (s): 0.82 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.889465E+00 | grad norm: 0.513 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.398 | TFLOPs: 18.90 | 31: iteration 2080/ 173500 | consumed samples: 532480 | consumed tokens: 1090519040 | elapsed time per iteration (s): 0.83 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.914519E+00 | grad norm: 0.371 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.161 | TFLOPs: 18.70 | 31: iteration 2090/ 173500 | consumed samples: 535040 | consumed tokens: 1095761920 | elapsed time per iteration (s): 0.82 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.905186E+00 | grad norm: 0.348 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.362 | TFLOPs: 18.90 | 31: iteration 2100/ 173500 | consumed samples: 537600 | consumed tokens: 1101004800 | elapsed time per iteration (s): 0.83 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.895928E+00 | grad norm: 0.483 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.923 | TFLOPs: 18.69 | 31: iteration 2110/ 173500 | consumed samples: 540160 | consumed tokens: 1106247680 | elapsed time per iteration (s): 0.81 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.883172E+00 | grad norm: 0.663 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.503 | TFLOPs: 19.15 | 31: iteration 2120/ 173500 | consumed samples: 542720 | consumed tokens: 1111490560 | elapsed time per iteration (s): 0.84 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.919219E+00 | grad norm: 1.002 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.590 | TFLOPs: 18.37 | 31: iteration 2130/ 173500 | consumed samples: 545280 | consumed tokens: 1116733440 | elapsed time per iteration (s): 0.82 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.984364E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.016 | TFLOPs: 18.88 | 31: iteration 2140/ 173500 | consumed samples: 547840 | consumed tokens: 1121976320 | elapsed time per iteration (s): 0.82 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.936100E+00 | grad norm: 0.449 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.366 | TFLOPs: 18.96 | 31: iteration 2150/ 173500 | consumed samples: 550400 | consumed tokens: 1127219200 | elapsed time per iteration (s): 0.82 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.901306E+00 | grad norm: 0.377 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.129 | TFLOPs: 18.94 | 31: iteration 2160/ 173500 | consumed samples: 552960 | consumed tokens: 1132462080 | elapsed time per iteration (s): 0.78 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.891993E+00 | grad norm: 0.385 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.653 | TFLOPs: 19.76 | 31: iteration 2170/ 173500 | consumed samples: 555520 | consumed tokens: 1137704960 | elapsed time per iteration (s): 0.80 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.900589E+00 | grad norm: 0.340 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.680 | TFLOPs: 19.28 | 31: iteration 2180/ 173500 | consumed samples: 558080 | consumed tokens: 1142947840 | elapsed time per iteration (s): 0.80 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.903010E+00 | grad norm: 0.346 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.363 | TFLOPs: 19.26 | 31: iteration 2190/ 173500 | consumed samples: 560640 | consumed tokens: 1148190720 | elapsed time per iteration (s): 0.87 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.880862E+00 | grad norm: 0.382 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.839 | TFLOPs: 17.84 | 31: iteration 2200/ 173500 | consumed samples: 563200 | consumed tokens: 1153433600 | elapsed time per iteration (s): 0.84 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.903207E+00 | grad norm: 0.389 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.496 | TFLOPs: 18.54 | 31: iteration 2210/ 173500 | consumed samples: 565760 | consumed tokens: 1158676480 | elapsed time per iteration (s): 0.84 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.854187E+00 | grad norm: 0.359 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.609 | TFLOPs: 18.43 | 31: iteration 2220/ 173500 | consumed samples: 568320 | consumed tokens: 1163919360 | elapsed time per iteration (s): 0.82 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.916046E+00 | grad norm: 0.356 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.210 | TFLOPs: 18.89 | 31: iteration 2230/ 173500 | consumed samples: 570880 | consumed tokens: 1169162240 | elapsed time per iteration (s): 0.81 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.867826E+00 | grad norm: 0.396 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.885 | TFLOPs: 19.11 | 31: iteration 2240/ 173500 | consumed samples: 573440 | consumed tokens: 1174405120 | elapsed time per iteration (s): 0.83 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.890277E+00 | grad norm: 0.358 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.771 | TFLOPs: 18.62 | 31: iteration 2250/ 173500 | consumed samples: 576000 | consumed tokens: 1179648000 | elapsed time per iteration (s): 0.75 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.852076E+00 | grad norm: 0.351 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.134 | TFLOPs: 20.64 | 31: iteration 2260/ 173500 | consumed samples: 578560 | consumed tokens: 1184890880 | elapsed time per iteration (s): 0.72 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.817622E+00 | grad norm: 0.335 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.032 | TFLOPs: 21.48 | 31: iteration 2270/ 173500 | consumed samples: 581120 | consumed tokens: 1190133760 | elapsed time per iteration (s): 0.75 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.819803E+00 | grad norm: 0.937 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.637 | TFLOPs: 20.55 | 31: iteration 2280/ 173500 | consumed samples: 583680 | consumed tokens: 1195376640 | elapsed time per iteration (s): 0.73 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.852748E+00 | grad norm: 1.425 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.521 | TFLOPs: 21.33 | 31: iteration 2290/ 173500 | consumed samples: 586240 | consumed tokens: 1200619520 | elapsed time per iteration (s): 0.72 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.901935E+00 | grad norm: 0.558 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.214 | TFLOPs: 21.37 | 31: iteration 2300/ 173500 | consumed samples: 588800 | consumed tokens: 1205862400 | elapsed time per iteration (s): 0.73 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.862732E+00 | grad norm: 0.344 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.797 | TFLOPs: 21.22 | 31: iteration 2310/ 173500 | consumed samples: 591360 | consumed tokens: 1211105280 | elapsed time per iteration (s): 0.75 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.875617E+00 | grad norm: 0.339 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.483 | TFLOPs: 20.72 | 31: iteration 2320/ 173500 | consumed samples: 593920 | consumed tokens: 1216348160 | elapsed time per iteration (s): 0.80 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.862542E+00 | grad norm: 0.352 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.577 | TFLOPs: 19.45 | 31: iteration 2330/ 173500 | consumed samples: 596480 | consumed tokens: 1221591040 | elapsed time per iteration (s): 0.75 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.788658E+00 | grad norm: 0.324 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.941 | TFLOPs: 20.63 | 31: iteration 2340/ 173500 | consumed samples: 599040 | consumed tokens: 1226833920 | elapsed time per iteration (s): 0.72 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.801870E+00 | grad norm: 0.327 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.231 | TFLOPs: 21.43 | 31: iteration 2350/ 173500 | consumed samples: 601600 | consumed tokens: 1232076800 | elapsed time per iteration (s): 0.77 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.827744E+00 | grad norm: 0.330 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.152 | TFLOPs: 20.22 | 31: iteration 2360/ 173500 | consumed samples: 604160 | consumed tokens: 1237319680 | elapsed time per iteration (s): 0.80 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.844817E+00 | grad norm: 0.382 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.702 | TFLOPs: 19.46 | 31: iteration 2370/ 173500 | consumed samples: 606720 | consumed tokens: 1242562560 | elapsed time per iteration (s): 0.72 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.854373E+00 | grad norm: 0.431 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.930 | TFLOPs: 21.59 | 31: iteration 2380/ 173500 | consumed samples: 609280 | consumed tokens: 1247805440 | elapsed time per iteration (s): 0.74 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.836240E+00 | grad norm: 0.346 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.205 | TFLOPs: 20.82 | 31: iteration 2390/ 173500 | consumed samples: 611840 | consumed tokens: 1253048320 | elapsed time per iteration (s): 0.75 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.820245E+00 | grad norm: 0.298 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.738 | TFLOPs: 20.61 | 31: iteration 2400/ 173500 | consumed samples: 614400 | consumed tokens: 1258291200 | elapsed time per iteration (s): 0.83 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.861726E+00 | grad norm: 0.362 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.982 | TFLOPs: 18.75 | 31: iteration 2410/ 173500 | consumed samples: 616960 | consumed tokens: 1263534080 | elapsed time per iteration (s): 0.77 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.823024E+00 | grad norm: 0.332 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.332 | TFLOPs: 20.23 | 31: iteration 2420/ 173500 | consumed samples: 619520 | consumed tokens: 1268776960 | elapsed time per iteration (s): 0.76 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.813733E+00 | grad norm: 0.327 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.942 | TFLOPs: 20.26 | 31: iteration 2430/ 173500 | consumed samples: 622080 | consumed tokens: 1274019840 | elapsed time per iteration (s): 0.75 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.817613E+00 | grad norm: 0.329 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.285 | TFLOPs: 20.53 | 31: iteration 2440/ 173500 | consumed samples: 624640 | consumed tokens: 1279262720 | elapsed time per iteration (s): 0.80 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.790187E+00 | grad norm: 0.385 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.032 | TFLOPs: 19.42 | 31: iteration 2450/ 173500 | consumed samples: 627200 | consumed tokens: 1284505600 | elapsed time per iteration (s): 0.77 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.815623E+00 | grad norm: 0.435 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.409 | TFLOPs: 20.23 | 31: iteration 2460/ 173500 | consumed samples: 629760 | consumed tokens: 1289748480 | elapsed time per iteration (s): 0.74 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.934521E+00 | grad norm: 2.406 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.389 | TFLOPs: 21.02 | 31: iteration 2470/ 173500 | consumed samples: 632320 | consumed tokens: 1294991360 | elapsed time per iteration (s): 0.78 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.976373E+00 | grad norm: 0.893 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.420 | TFLOPs: 19.81 | 31: iteration 2480/ 173500 | consumed samples: 634880 | consumed tokens: 1300234240 | elapsed time per iteration (s): 0.81 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.919860E+00 | grad norm: 0.534 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.721 | TFLOPs: 19.10 | 31: iteration 2490/ 173500 | consumed samples: 637440 | consumed tokens: 1305477120 | elapsed time per iteration (s): 0.81 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.858065E+00 | grad norm: 0.358 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.861 | TFLOPs: 19.17 | 31: iteration 2500/ 173500 | consumed samples: 640000 | consumed tokens: 1310720000 | elapsed time per iteration (s): 0.80 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.878511E+00 | grad norm: 1.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.673 | TFLOPs: 19.28 | 31: iteration 2510/ 173500 | consumed samples: 642560 | consumed tokens: 1315962880 | elapsed time per iteration (s): 0.83 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.874315E+00 | grad norm: 0.304 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.520 | TFLOPs: 18.73 | 31: iteration 2520/ 173500 | consumed samples: 645120 | consumed tokens: 1321205760 | elapsed time per iteration (s): 0.83 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.810692E+00 | grad norm: 0.308 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.256 | TFLOPs: 18.77 | 31: iteration 2530/ 173500 | consumed samples: 647680 | consumed tokens: 1326448640 | elapsed time per iteration (s): 0.84 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.812484E+00 | grad norm: 0.290 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.520 | TFLOPs: 18.54 | 31: iteration 2540/ 173500 | consumed samples: 650240 | consumed tokens: 1331691520 | elapsed time per iteration (s): 0.83 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.794417E+00 | grad norm: 0.295 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.409 | TFLOPs: 18.66 | 31: iteration 2550/ 173500 | consumed samples: 652800 | consumed tokens: 1336934400 | elapsed time per iteration (s): 0.78 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.768341E+00 | grad norm: 0.322 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.045 | TFLOPs: 19.85 | 31: iteration 2560/ 173500 | consumed samples: 655360 | consumed tokens: 1342177280 | elapsed time per iteration (s): 0.83 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.776170E+00 | grad norm: 0.310 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.349 | TFLOPs: 18.65 | 31: iteration 2570/ 173500 | consumed samples: 657920 | consumed tokens: 1347420160 | elapsed time per iteration (s): 0.81 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.786794E+00 | grad norm: 0.308 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.717 | TFLOPs: 19.16 | 31: iteration 2580/ 173500 | consumed samples: 660480 | consumed tokens: 1352663040 | elapsed time per iteration (s): 0.88 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.788876E+00 | grad norm: 0.300 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.698 | TFLOPs: 17.65 | 31: iteration 2590/ 173500 | consumed samples: 663040 | consumed tokens: 1357905920 | elapsed time per iteration (s): 0.77 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.808180E+00 | grad norm: 0.295 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.787 | TFLOPs: 20.19 | 31: iteration 2600/ 173500 | consumed samples: 665600 | consumed tokens: 1363148800 | elapsed time per iteration (s): 0.76 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.756349E+00 | grad norm: 0.330 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.519 | TFLOPs: 20.30 | 31: iteration 2610/ 173500 | consumed samples: 668160 | consumed tokens: 1368391680 | elapsed time per iteration (s): 0.76 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.794527E+00 | grad norm: 0.319 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.961 | TFLOPs: 20.51 | 31: iteration 2620/ 173500 | consumed samples: 670720 | consumed tokens: 1373634560 | elapsed time per iteration (s): 0.73 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.800557E+00 | grad norm: 0.309 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.920 | TFLOPs: 21.35 | 31: iteration 2630/ 173500 | consumed samples: 673280 | consumed tokens: 1378877440 | elapsed time per iteration (s): 0.79 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.773430E+00 | grad norm: 0.308 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.832 | TFLOPs: 19.59 | 31: iteration 2640/ 173500 | consumed samples: 675840 | consumed tokens: 1384120320 | elapsed time per iteration (s): 0.82 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.817957E+00 | grad norm: 0.324 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.836 | TFLOPs: 18.99 | 31: iteration 2650/ 173500 | consumed samples: 678400 | consumed tokens: 1389363200 | elapsed time per iteration (s): 0.78 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.785960E+00 | grad norm: 0.311 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.229 | TFLOPs: 19.92 | 31: iteration 2660/ 173500 | consumed samples: 680960 | consumed tokens: 1394606080 | elapsed time per iteration (s): 0.77 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.755702E+00 | grad norm: 0.321 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.644 | TFLOPs: 20.18 | 31: iteration 2670/ 173500 | consumed samples: 683520 | consumed tokens: 1399848960 | elapsed time per iteration (s): 0.78 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.761998E+00 | grad norm: 0.299 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.937 | TFLOPs: 19.84 | 31: iteration 2680/ 173500 | consumed samples: 686080 | consumed tokens: 1405091840 | elapsed time per iteration (s): 0.81 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.727244E+00 | grad norm: 0.315 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.029 | TFLOPs: 19.06 | 31: iteration 2690/ 173500 | consumed samples: 688640 | consumed tokens: 1410334720 | elapsed time per iteration (s): 0.81 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.752884E+00 | grad norm: 0.302 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.586 | TFLOPs: 19.03 | 31: iteration 2700/ 173500 | consumed samples: 691200 | consumed tokens: 1415577600 | elapsed time per iteration (s): 0.81 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.762974E+00 | grad norm: 0.301 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.737 | TFLOPs: 19.10 | 31: iteration 2710/ 173500 | consumed samples: 693760 | consumed tokens: 1420820480 | elapsed time per iteration (s): 0.83 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.724105E+00 | grad norm: 0.298 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.411 | TFLOPs: 18.60 | 31: iteration 2720/ 173500 | consumed samples: 696320 | consumed tokens: 1426063360 | elapsed time per iteration (s): 0.84 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.782314E+00 | grad norm: 0.304 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.981 | TFLOPs: 18.33 | 31: iteration 2730/ 173500 | consumed samples: 698880 | consumed tokens: 1431306240 | elapsed time per iteration (s): 0.83 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.738964E+00 | grad norm: 0.283 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.013 | TFLOPs: 18.57 | 31: iteration 2740/ 173500 | consumed samples: 701440 | consumed tokens: 1436549120 | elapsed time per iteration (s): 0.78 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.716635E+00 | grad norm: 0.327 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.273 | TFLOPs: 19.74 | 31: iteration 2750/ 173500 | consumed samples: 704000 | consumed tokens: 1441792000 | elapsed time per iteration (s): 0.83 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.725617E+00 | grad norm: 0.310 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.050 | TFLOPs: 18.64 | 31: iteration 2760/ 173500 | consumed samples: 706560 | consumed tokens: 1447034880 | elapsed time per iteration (s): 0.84 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.737435E+00 | grad norm: 0.295 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.060 | TFLOPs: 18.46 | 31: iteration 2770/ 173500 | consumed samples: 709120 | consumed tokens: 1452277760 | elapsed time per iteration (s): 0.88 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.760093E+00 | grad norm: 0.328 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.624 | TFLOPs: 17.52 | 31: iteration 2780/ 173500 | consumed samples: 711680 | consumed tokens: 1457520640 | elapsed time per iteration (s): 0.85 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.748253E+00 | grad norm: 0.300 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.153 | TFLOPs: 18.22 | 31: iteration 2790/ 173500 | consumed samples: 714240 | consumed tokens: 1462763520 | elapsed time per iteration (s): 0.86 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.760229E+00 | grad norm: 0.303 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.421 | TFLOPs: 18.05 | 31: iteration 2800/ 173500 | consumed samples: 716800 | consumed tokens: 1468006400 | elapsed time per iteration (s): 0.82 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.780395E+00 | grad norm: 0.289 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.781 | TFLOPs: 18.98 | 31: iteration 2810/ 173500 | consumed samples: 719360 | consumed tokens: 1473249280 | elapsed time per iteration (s): 0.85 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.699778E+00 | grad norm: 0.292 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.567 | TFLOPs: 18.18 | 31: iteration 2820/ 173500 | consumed samples: 721920 | consumed tokens: 1478492160 | elapsed time per iteration (s): 0.81 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.771693E+00 | grad norm: 0.275 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.767 | TFLOPs: 19.04 | 31: iteration 2830/ 173500 | consumed samples: 724480 | consumed tokens: 1483735040 | elapsed time per iteration (s): 0.84 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.762297E+00 | grad norm: 0.278 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.989 | TFLOPs: 18.39 | 31: iteration 2840/ 173500 | consumed samples: 727040 | consumed tokens: 1488977920 | elapsed time per iteration (s): 0.79 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.718637E+00 | grad norm: 0.288 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.011 | TFLOPs: 19.60 | 31: iteration 2850/ 173500 | consumed samples: 729600 | consumed tokens: 1494220800 | elapsed time per iteration (s): 0.76 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.739277E+00 | grad norm: 0.281 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.064 | TFLOPs: 20.51 | 31: iteration 2860/ 173500 | consumed samples: 732160 | consumed tokens: 1499463680 | elapsed time per iteration (s): 0.77 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.735255E+00 | grad norm: 0.282 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.384 | TFLOPs: 20.05 | 31: iteration 2870/ 173500 | consumed samples: 734720 | consumed tokens: 1504706560 | elapsed time per iteration (s): 0.75 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.743756E+00 | grad norm: 0.318 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.586 | TFLOPs: 20.73 | 31: iteration 2880/ 173500 | consumed samples: 737280 | consumed tokens: 1509949440 | elapsed time per iteration (s): 0.78 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.732587E+00 | grad norm: 0.308 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.348 | TFLOPs: 19.80 | 31: iteration 2890/ 173500 | consumed samples: 739840 | consumed tokens: 1515192320 | elapsed time per iteration (s): 0.75 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.714799E+00 | grad norm: 0.289 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.043 | TFLOPs: 20.69 | 31: iteration 2900/ 173500 | consumed samples: 742400 | consumed tokens: 1520435200 | elapsed time per iteration (s): 0.78 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.734483E+00 | grad norm: 2.349 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.172 | TFLOPs: 19.85 | 31: iteration 2910/ 173500 | consumed samples: 744960 | consumed tokens: 1525678080 | elapsed time per iteration (s): 0.73 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.786115E+00 | grad norm: 0.381 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.959 | TFLOPs: 21.11 | 31: iteration 2920/ 173500 | consumed samples: 747520 | consumed tokens: 1530920960 | elapsed time per iteration (s): 0.79 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.719038E+00 | grad norm: 0.295 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.336 | TFLOPs: 19.56 | 31: iteration 2930/ 173500 | consumed samples: 750080 | consumed tokens: 1536163840 | elapsed time per iteration (s): 0.85 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.702718E+00 | grad norm: 0.306 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.705 | TFLOPs: 18.19 | 31: iteration 2940/ 173500 | consumed samples: 752640 | consumed tokens: 1541406720 | elapsed time per iteration (s): 0.83 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.704169E+00 | grad norm: 0.294 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.619 | TFLOPs: 18.55 | 31: iteration 2950/ 173500 | consumed samples: 755200 | consumed tokens: 1546649600 | elapsed time per iteration (s): 0.82 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.694225E+00 | grad norm: 0.260 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.357 | TFLOPs: 18.84 | 31: iteration 2960/ 173500 | consumed samples: 757760 | consumed tokens: 1551892480 | elapsed time per iteration (s): 0.86 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.743419E+00 | grad norm: 0.273 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.649 | TFLOPs: 18.01 | 31: iteration 2970/ 173500 | consumed samples: 760320 | consumed tokens: 1557135360 | elapsed time per iteration (s): 0.83 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.747363E+00 | grad norm: 0.271 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.733 | TFLOPs: 18.68 | 31: iteration 2980/ 173500 | consumed samples: 762880 | consumed tokens: 1562378240 | elapsed time per iteration (s): 0.85 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.720155E+00 | grad norm: 0.269 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.875 | TFLOPs: 18.32 | 31: iteration 2990/ 173500 | consumed samples: 765440 | consumed tokens: 1567621120 | elapsed time per iteration (s): 0.82 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.709246E+00 | grad norm: 0.279 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.856 | TFLOPs: 18.87 | 31: iteration 3000/ 173500 | consumed samples: 768000 | consumed tokens: 1572864000 | elapsed time per iteration (s): 0.83 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.712372E+00 | grad norm: 0.285 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.643 | TFLOPs: 18.73 | 0: saving checkpoint at iteration 3000 to checkpoints_1b1long 31: ------------------------------------------------------------------------------------------ 31: valid loss at iteration 3000 | lm loss value: 2.658626E+00 | lm loss PPL: 1.427666E+01 | 31: ------------------------------------------------------------------------------------------ 0: [2022-11-25 18:12:39,425] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step3000 is begin to save! 0: [2022-11-25 18:12:39,437] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_01-model_00-model_states.pt... 0: [2022-11-25 18:12:39,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_01-model_00-model_states.pt. 0: [2022-11-25 18:12:39,631] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_03-model_00-model_states.pt... 0: [2022-11-25 18:12:39,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_03-model_00-model_states.pt. 0: [2022-11-25 18:12:39,711] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_04-model_00-model_states.pt... 0: [2022-11-25 18:12:39,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_04-model_00-model_states.pt. 0: [2022-11-25 18:12:39,786] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_05-model_00-model_states.pt... 0: [2022-11-25 18:12:39,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_05-model_00-model_states.pt. 0: [2022-11-25 18:12:39,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_06-model_00-model_states.pt... 0: [2022-11-25 18:12:39,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_06-model_00-model_states.pt. 0: [2022-11-25 18:12:39,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_07-model_00-model_states.pt... 0: [2022-11-25 18:12:40,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_07-model_00-model_states.pt. 0: [2022-11-25 18:12:40,003] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_08-model_00-model_states.pt... 0: [2022-11-25 18:12:40,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_08-model_00-model_states.pt. 0: [2022-11-25 18:12:40,080] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_09-model_00-model_states.pt... 0: [2022-11-25 18:12:40,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_09-model_00-model_states.pt. 0: [2022-11-25 18:12:40,153] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_10-model_00-model_states.pt... 0: [2022-11-25 18:12:40,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_10-model_00-model_states.pt. 0: [2022-11-25 18:12:40,226] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_11-model_00-model_states.pt... 0: [2022-11-25 18:12:40,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_11-model_00-model_states.pt. 0: [2022-11-25 18:12:40,297] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_12-model_00-model_states.pt... 0: [2022-11-25 18:12:40,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_12-model_00-model_states.pt. 0: [2022-11-25 18:12:40,371] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_13-model_00-model_states.pt... 0: [2022-11-25 18:12:40,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_13-model_00-model_states.pt. 0: [2022-11-25 18:12:40,447] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_14-model_00-model_states.pt... 0: [2022-11-25 18:12:40,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_14-model_00-model_states.pt. 0: [2022-11-25 18:12:40,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_15-model_00-model_states.pt... 0: [2022-11-25 18:12:40,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_15-model_00-model_states.pt. 0: [2022-11-25 18:12:40,590] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_16-model_00-model_states.pt... 0: [2022-11-25 18:12:40,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_16-model_00-model_states.pt. 0: [2022-11-25 18:12:40,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_17-model_00-model_states.pt... 0: [2022-11-25 18:12:40,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_17-model_00-model_states.pt. 0: [2022-11-25 18:12:40,741] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_18-model_00-model_states.pt... 0: [2022-11-25 18:12:40,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_18-model_00-model_states.pt. 0: [2022-11-25 18:12:40,814] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_19-model_00-model_states.pt... 0: [2022-11-25 18:12:40,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_19-model_00-model_states.pt. 0: [2022-11-25 18:12:40,886] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_20-model_00-model_states.pt... 0: [2022-11-25 18:12:40,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_20-model_00-model_states.pt. 0: [2022-11-25 18:12:40,960] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_21-model_00-model_states.pt... 0: [2022-11-25 18:12:41,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_21-model_00-model_states.pt. 0: [2022-11-25 18:12:41,034] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_22-model_00-model_states.pt... 0: [2022-11-25 18:12:41,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_22-model_00-model_states.pt. 0: [2022-11-25 18:12:41,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_23-model_00-model_states.pt... 0: [2022-11-25 18:12:41,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_23-model_00-model_states.pt. 0: [2022-11-25 18:12:41,189] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_24-model_00-model_states.pt... 0: [2022-11-25 18:12:41,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_24-model_00-model_states.pt. 0: [2022-11-25 18:12:41,263] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_25-model_00-model_states.pt... 0: [2022-11-25 18:12:41,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_25-model_00-model_states.pt. 0: [2022-11-25 18:12:41,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_26-model_00-model_states.pt... 0: [2022-11-25 18:12:41,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_26-model_00-model_states.pt. 0: [2022-11-25 18:12:41,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_27-model_00-model_states.pt... 0: [2022-11-25 18:12:41,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_27-model_00-model_states.pt. 0: [2022-11-25 18:12:41,487] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_28-model_00-model_states.pt... 0: [2022-11-25 18:12:41,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_28-model_00-model_states.pt. 0: [2022-11-25 18:12:41,556] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/layer_30-model_00-model_states.pt... 0: [2022-11-25 18:12:41,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/layer_30-model_00-model_states.pt. 0: [2022-11-25 18:12:41,561] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step3000/mp_rank_00_model_states.pt 0: [2022-11-25 18:12:41,561] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/mp_rank_00_model_states.pt... 0: [2022-11-25 18:12:41,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/mp_rank_00_model_states.pt. 0: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:12:41,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:12:41,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:12:41,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 18:12:41,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 3: [2022-11-25 18:12:41,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:12:41,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:12:41,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 18:12:41,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 11: [2022-11-25 18:12:41,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 18:12:41,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 31: [2022-11-25 18:12:41,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:12:41,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:12:41,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 22: [2022-11-25 18:12:41,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 31: [2022-11-25 18:12:41,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 22: [2022-11-25 18:12:41,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 17: [2022-11-25 18:12:41,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:12:41,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-25 18:12:41,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 7: [2022-11-25 18:12:41,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:12:41,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:12:41,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 7: [2022-11-25 18:12:41,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 26: [2022-11-25 18:12:41,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 7: [2022-11-25 18:12:41,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 8: [2022-11-25 18:12:41,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:12:41,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:12:41,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 18:12:41,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 27: [2022-11-25 18:12:41,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 29: [2022-11-25 18:12:41,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:12:41,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 29: [2022-11-25 18:12:41,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 4: [2022-11-25 18:12:41,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:12:41,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:12:41,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 4: [2022-11-25 18:12:41,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 18:12:41,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 10: [2022-11-25 18:12:41,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 19: [2022-11-25 18:12:41,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:12:41,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 19: [2022-11-25 18:12:41,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-25 18:12:41,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 13: [2022-11-25 18:12:41,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:12:41,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:12:41,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 18:12:41,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 9: [2022-11-25 18:12:41,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 18:12:41,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 6: [2022-11-25 18:12:41,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:12:41,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 18:12:41,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 5: [2022-11-25 18:12:41,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:12:41,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:12:41,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:12:41,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:12:41,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 23: [2022-11-25 18:12:41,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 5: [2022-11-25 18:12:41,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 23: [2022-11-25 18:12:41,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 11: [2022-11-25 18:12:41,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 18: [2022-11-25 18:12:41,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 5: [2022-11-25 18:12:41,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:12:41,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 18: [2022-11-25 18:12:41,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 5: [2022-11-25 18:12:41,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 18:12:41,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 13: [2022-11-25 18:12:41,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:12:41,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 18:12:41,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 13: [2022-11-25 18:12:41,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:12:41,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 18:12:41,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: [2022-11-25 18:12:41,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:12:41,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 18:12:41,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 12: [2022-11-25 18:12:41,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:12:41,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:12:41,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:12:41,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:12:41,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 25: [2022-11-25 18:12:41,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 22: [2022-11-25 18:12:41,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 19: [2022-11-25 18:12:41,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 12: [2022-11-25 18:12:41,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 25: [2022-11-25 18:12:41,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 22: [2022-11-25 18:12:41,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 21: [2022-11-25 18:12:41,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:12:41,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 21: [2022-11-25 18:12:41,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-25 18:12:41,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 9: [2022-11-25 18:12:41,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:12:41,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:12:41,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:12:41,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 31: [2022-11-25 18:12:41,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 9: [2022-11-25 18:12:41,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 11: [2022-11-25 18:12:41,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 31: [2022-11-25 18:12:41,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 11: [2022-11-25 18:12:41,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 5: [2022-11-25 18:12:41,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:12:41,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:12:41,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:12:41,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:12:41,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:12:41,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 5: [2022-11-25 18:12:41,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 15: [2022-11-25 18:12:41,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 14: [2022-11-25 18:12:41,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 29: [2022-11-25 18:12:41,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 5: [2022-11-25 18:12:41,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 14: [2022-11-25 18:12:41,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 29: [2022-11-25 18:12:41,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 14: [2022-11-25 18:12:41,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 10: [2022-11-25 18:12:41,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:12:41,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 10: [2022-11-25 18:12:41,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 18:12:41,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: [2022-11-25 18:12:41,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:12:41,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:12:41,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:12:41,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:12:41,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:12:41,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 31: [2022-11-25 18:12:41,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 0: [2022-11-25 18:12:41,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 25: [2022-11-25 18:12:41,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 31: [2022-11-25 18:12:41,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 19: [2022-11-25 18:12:41,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 0: [2022-11-25 18:12:41,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 19: [2022-11-25 18:12:41,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 10: [2022-11-25 18:12:41,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:12:41,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:12:41,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 13: [2022-11-25 18:12:41,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 10: [2022-11-25 18:12:41,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 13: [2022-11-25 18:12:41,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 6: [2022-11-25 18:12:41,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:12:41,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:12:41,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:12:41,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:12:41,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:12:41,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:12:41,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:12:41,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:12:41,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:12:41,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:12:41,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:12:41,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:12:41,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:12:41,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 18: [2022-11-25 18:12:41,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:12:41,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 5: [2022-11-25 18:12:41,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 18:12:41,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 7: [2022-11-25 18:12:41,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 12: [2022-11-25 18:12:41,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 15: [2022-11-25 18:12:41,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 20: [2022-11-25 18:12:41,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 25: [2022-11-25 18:12:41,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 23: [2022-11-25 18:12:41,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 11: [2022-11-25 18:12:41,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 14: [2022-11-25 18:12:41,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 31: [2022-11-25 18:12:41,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 22: [2022-11-25 18:12:41,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 17: [2022-11-25 18:12:41,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 6: [2022-11-25 18:12:41,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 7: [2022-11-25 18:12:41,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 12: [2022-11-25 18:12:41,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 15: [2022-11-25 18:12:41,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 20: [2022-11-25 18:12:41,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 25: [2022-11-25 18:12:41,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 23: [2022-11-25 18:12:41,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 11: [2022-11-25 18:12:41,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 14: [2022-11-25 18:12:41,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 31: [2022-11-25 18:12:41,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 29: [2022-11-25 18:12:41,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 22: [2022-11-25 18:12:41,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 17: [2022-11-25 18:12:41,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:12:41,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 6: [2022-11-25 18:12:41,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:12:41,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:12:41,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:12:41,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:12:41,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:12:41,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 17: [2022-11-25 18:12:41,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 18: [2022-11-25 18:12:41,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 6: [2022-11-25 18:12:41,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 15: [2022-11-25 18:12:41,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 20: [2022-11-25 18:12:41,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 23: [2022-11-25 18:12:41,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 14: [2022-11-25 18:12:41,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 17: [2022-11-25 18:12:41,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 6: [2022-11-25 18:12:41,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 15: [2022-11-25 18:12:41,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 20: [2022-11-25 18:12:41,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 23: [2022-11-25 18:12:41,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 14: [2022-11-25 18:12:41,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 20: [2022-11-25 18:12:41,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:12:41,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:12:41,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 23: [2022-11-25 18:12:41,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 20: [2022-11-25 18:12:41,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 23: [2022-11-25 18:12:41,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: [2022-11-25 18:12:41,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:12:41,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 18:12:41,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: [2022-11-25 18:12:41,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:12:41,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:12:41,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 18:12:41,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 28: [2022-11-25 18:12:41,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:12:41,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:12:41,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:12:41,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 28: [2022-11-25 18:12:41,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-25 18:12:41,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 21: [2022-11-25 18:12:41,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 19: [2022-11-25 18:12:41,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-25 18:12:41,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: [2022-11-25 18:12:41,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 5: [2022-11-25 18:12:41,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:12:41,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:12:41,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:12:41,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 4: [2022-11-25 18:12:41,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 16: [2022-11-25 18:12:41,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 5: [2022-11-25 18:12:41,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 4: [2022-11-25 18:12:41,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 16: [2022-11-25 18:12:41,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 5: [2022-11-25 18:12:41,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 29: [2022-11-25 18:12:41,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:12:41,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:12:41,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-25 18:12:41,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 15: [2022-11-25 18:12:41,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:12:41,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 15: [2022-11-25 18:12:41,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 12: [2022-11-25 18:12:41,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 15: [2022-11-25 18:12:41,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 2: [2022-11-25 18:12:41,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:12:41,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 18:12:41,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 22: [2022-11-25 18:12:41,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:12:41,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-25 18:12:41,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 12: [2022-11-25 18:12:41,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:12:41,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:12:41,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 8: [2022-11-25 18:12:41,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:12:41,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 18: [2022-11-25 18:12:41,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 12: [2022-11-25 18:12:41,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 8: [2022-11-25 18:12:41,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 18:12:41,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 30: [2022-11-25 18:12:41,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:12:41,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-25 18:12:41,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 7: [2022-11-25 18:12:41,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:12:41,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:12:41,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 16: [2022-11-25 18:12:41,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 7: [2022-11-25 18:12:41,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 16: [2022-11-25 18:12:41,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 2: [2022-11-25 18:12:41,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:12:41,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:12:41,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 9: [2022-11-25 18:12:41,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 2: [2022-11-25 18:12:41,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 9: [2022-11-25 18:12:41,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 7: [2022-11-25 18:12:41,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:12:41,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 18:12:41,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 24: [2022-11-25 18:12:41,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:12:41,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-25 18:12:41,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 24: [2022-11-25 18:12:41,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:12:41,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-25 18:12:41,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 21: [2022-11-25 18:12:41,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:12:41,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:12:41,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 4: [2022-11-25 18:12:41,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:12:41,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 26: [2022-11-25 18:12:41,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 4: [2022-11-25 18:12:41,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 26: [2022-11-25 18:12:41,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 4: [2022-11-25 18:12:41,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 8: [2022-11-25 18:12:41,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:12:41,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:12:41,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:12:41,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:12:41,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 30: [2022-11-25 18:12:41,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 8: [2022-11-25 18:12:41,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 14: [2022-11-25 18:12:41,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 30: [2022-11-25 18:12:41,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 27: [2022-11-25 18:12:41,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 14: [2022-11-25 18:12:41,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 27: [2022-11-25 18:12:41,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 24: [2022-11-25 18:12:41,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:12:41,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-25 18:12:41,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 16: [2022-11-25 18:12:41,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:12:41,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-25 18:12:41,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 10: [2022-11-25 18:12:41,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:12:41,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:12:41,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:12:41,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 1: [2022-11-25 18:12:41,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 27: [2022-11-25 18:12:41,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 10: [2022-11-25 18:12:41,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 1: [2022-11-25 18:12:41,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 27: [2022-11-25 18:12:41,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 17: [2022-11-25 18:12:41,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:12:41,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-25 18:12:41,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 21: [2022-11-25 18:12:41,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:12:41,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-25 18:12:41,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 3: [2022-11-25 18:12:41,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:12:41,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 18:12:41,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 26: [2022-11-25 18:12:41,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:12:41,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-25 18:12:41,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 8: [2022-11-25 18:12:41,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:12:41,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:12:41,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:12:41,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 24: [2022-11-25 18:12:41,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 8: [2022-11-25 18:12:41,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 24: [2022-11-25 18:12:41,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 26: [2022-11-25 18:12:41,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 24: [2022-11-25 18:12:41,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:12:41,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 24: [2022-11-25 18:12:41,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-25 18:12:41,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 10: [2022-11-25 18:12:41,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:12:41,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:12:41,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 10: [2022-11-25 18:12:41,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 25: [2022-11-25 18:12:41,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 10: [2022-11-25 18:12:41,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 1: [2022-11-25 18:12:41,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:12:41,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:12:41,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 18:12:41,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 18:12:41,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 1: [2022-11-25 18:12:41,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 27: [2022-11-25 18:12:41,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:12:41,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-25 18:12:41,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 6: [2022-11-25 18:12:41,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:12:41,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 18:12:41,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 23: [2022-11-25 18:12:41,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:12:41,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-25 18:12:41,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 11: [2022-11-25 18:12:41,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:12:41,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 18:12:41,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 4: [2022-11-25 18:12:41,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:12:41,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:12:41,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:12:41,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 3: [2022-11-25 18:12:41,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 30: [2022-11-25 18:12:41,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 4: [2022-11-25 18:12:41,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 3: [2022-11-25 18:12:41,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 30: [2022-11-25 18:12:41,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 6: [2022-11-25 18:12:41,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:12:41,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:12:41,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:12:41,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 18:12:41,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 1: [2022-11-25 18:12:41,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 13: [2022-11-25 18:12:41,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 1: [2022-11-25 18:12:41,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 13: [2022-11-25 18:12:41,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 31: [2022-11-25 18:12:41,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:12:41,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-25 18:12:41,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 3: [2022-11-25 18:12:41,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:12:41,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:12:41,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:12:41,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:12:41,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 20: [2022-11-25 18:12:41,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 3: [2022-11-25 18:12:41,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 20: [2022-11-25 18:12:41,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 17: [2022-11-25 18:12:41,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 26: [2022-11-25 18:12:41,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 3: [2022-11-25 18:12:41,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:12:41,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 26: [2022-11-25 18:12:41,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 3: [2022-11-25 18:12:41,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 18:12:41,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: [2022-11-25 18:12:41,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:12:41,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:12:41,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:12:41,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:12:41,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:12:41,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:12:41,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 16: [2022-11-25 18:12:41,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 2: [2022-11-25 18:12:41,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 30: [2022-11-25 18:12:41,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 0: [2022-11-25 18:12:41,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 16: [2022-11-25 18:12:41,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 2: [2022-11-25 18:12:41,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 20: [2022-11-25 18:12:41,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 30: [2022-11-25 18:12:41,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 19: [2022-11-25 18:12:41,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 20: [2022-11-25 18:12:41,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 19: [2022-11-25 18:12:41,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 9: [2022-11-25 18:12:41,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:12:41,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 30: [2022-11-25 18:12:41,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:12:41,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 30: [2022-11-25 18:12:41,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-25 18:12:41,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 21: [2022-11-25 18:12:41,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:12:41,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:12:41,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-25 18:12:41,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 27: [2022-11-25 18:12:41,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-25 18:12:41,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 8: [2022-11-25 18:12:41,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:12:41,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:12:41,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:12:41,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:12:41,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:12:41,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:12:41,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:12:41,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 2: [2022-11-25 18:12:41,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 12: [2022-11-25 18:12:41,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 15: [2022-11-25 18:12:41,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 29: [2022-11-25 18:12:41,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 22: [2022-11-25 18:12:41,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 7: [2022-11-25 18:12:41,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:12:41,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 2: [2022-11-25 18:12:41,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 12: [2022-11-25 18:12:41,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 15: [2022-11-25 18:12:41,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 29: [2022-11-25 18:12:41,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 18: [2022-11-25 18:12:41,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 22: [2022-11-25 18:12:41,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 18: [2022-11-25 18:12:41,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 7: [2022-11-25 18:12:41,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 18:12:41,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 18: [2022-11-25 18:12:41,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:12:41,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:12:41,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 16: [2022-11-25 18:12:41,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 18: [2022-11-25 18:12:41,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 16: [2022-11-25 18:12:41,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 2: [2022-11-25 18:12:41,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:12:41,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 18:12:41,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 5: [2022-11-25 18:12:41,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:12:41,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:12:41,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 5: [2022-11-25 18:12:41,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 25: [2022-11-25 18:12:41,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 5: [2022-11-25 18:12:41,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 13: [2022-11-25 18:12:41,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:12:41,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:12:41,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 18:12:41,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 14: [2022-11-25 18:12:41,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 18:12:41,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 1: [2022-11-25 18:12:41,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:12:41,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 18:12:41,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 23: [2022-11-25 18:12:41,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:12:41,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-25 18:12:41,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 10: [2022-11-25 18:12:41,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:12:41,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 18:12:41,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 4: [2022-11-25 18:12:41,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:12:41,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:12:41,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 18:12:41,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 11: [2022-11-25 18:12:41,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 18:12:41,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 6: [2022-11-25 18:12:41,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:12:41,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 18:12:41,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 31: [2022-11-25 18:12:41,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:12:41,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:12:41,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-25 18:12:41,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 26: [2022-11-25 18:12:41,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-25 18:12:41,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 16: [2022-11-25 18:12:41,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:12:41,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:12:41,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-25 18:12:41,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 16: [2022-11-25 18:12:41,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-25 18:12:41,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 3: [2022-11-25 18:12:41,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:12:41,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 18:12:41,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 2: [2022-11-25 18:12:41,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:12:41,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:12:41,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 20: [2022-11-25 18:12:41,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:12:41,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 28: [2022-11-25 18:12:41,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-25 18:12:41,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 20: [2022-11-25 18:12:41,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-25 18:12:41,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: [2022-11-25 18:12:41,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:12:41,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 18:12:41,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 17: [2022-11-25 18:12:41,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:12:41,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:12:41,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:12:41,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 21: [2022-11-25 18:12:41,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:12:41,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:12:41,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:12:41,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:12:41,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 30: [2022-11-25 18:12:41,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 17: [2022-11-25 18:12:41,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 21: [2022-11-25 18:12:41,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 18: [2022-11-25 18:12:41,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 19: [2022-11-25 18:12:41,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 7: [2022-11-25 18:12:41,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 30: [2022-11-25 18:12:41,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 21: [2022-11-25 18:12:41,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 18: [2022-11-25 18:12:41,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 19: [2022-11-25 18:12:41,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 27: [2022-11-25 18:12:41,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-25 18:12:41,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 13: [2022-11-25 18:12:41,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:12:41,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 18:12:41,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 15: [2022-11-25 18:12:41,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:12:41,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 18:12:41,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 5: [2022-11-25 18:12:41,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:12:41,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 18:12:41,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 1: [2022-11-25 18:12:41,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:12:41,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 18:12:41,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 29: [2022-11-25 18:12:41,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:12:41,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 22: [2022-11-25 18:12:41,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:12:41,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 22: [2022-11-25 18:12:41,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-25 18:12:41,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 3: [2022-11-25 18:12:41,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:12:41,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:12:41,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:12:41,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 25: [2022-11-25 18:12:41,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 24: [2022-11-25 18:12:41,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 3: [2022-11-25 18:12:41,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 25: [2022-11-25 18:12:41,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 24: [2022-11-25 18:12:41,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 10: [2022-11-25 18:12:41,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:12:41,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 18:12:41,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 8: [2022-11-25 18:12:41,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:12:41,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:12:41,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 14: [2022-11-25 18:12:41,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 8: [2022-11-25 18:12:41,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 14: [2022-11-25 18:12:41,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 23: [2022-11-25 18:12:41,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:12:41,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-25 18:12:41,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 11: [2022-11-25 18:12:41,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:12:41,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:12:41,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 18:12:41,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 6: [2022-11-25 18:12:41,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 18:12:41,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 31: [2022-11-25 18:12:41,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:12:41,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:12:41,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 26: [2022-11-25 18:12:41,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 31: [2022-11-25 18:12:41,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 26: [2022-11-25 18:12:41,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: [2022-11-25 18:12:41,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:12:41,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:12:41,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:12:41,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:12:41,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:12:41,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 4: [2022-11-25 18:12:41,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 9: [2022-11-25 18:12:41,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 17: [2022-11-25 18:12:41,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 21: [2022-11-25 18:12:41,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 0: [2022-11-25 18:12:41,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 4: [2022-11-25 18:12:41,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 9: [2022-11-25 18:12:41,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 17: [2022-11-25 18:12:41,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 21: [2022-11-25 18:12:41,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 8: [2022-11-25 18:12:41,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:12:41,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 18:12:41,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 27: [2022-11-25 18:12:41,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:12:41,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-25 18:12:41,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 7: [2022-11-25 18:12:41,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:12:41,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:12:41,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:12:41,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:12:41,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:12:41,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:12:41,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:12:41,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:12:41,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:12:41,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 9: [2022-11-25 18:12:41,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 13: [2022-11-25 18:12:41,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 15: [2022-11-25 18:12:41,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 28: [2022-11-25 18:12:41,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 29: [2022-11-25 18:12:41,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 18: [2022-11-25 18:12:41,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 7: [2022-11-25 18:12:41,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 9: [2022-11-25 18:12:41,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 13: [2022-11-25 18:12:41,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 15: [2022-11-25 18:12:41,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 28: [2022-11-25 18:12:41,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 29: [2022-11-25 18:12:41,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 22: [2022-11-25 18:12:41,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 30: [2022-11-25 18:12:41,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 18: [2022-11-25 18:12:41,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 22: [2022-11-25 18:12:41,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 30: [2022-11-25 18:12:41,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 29: [2022-11-25 18:12:41,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:12:41,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 28: [2022-11-25 18:12:41,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:12:41,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 28: [2022-11-25 18:12:41,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 30: [2022-11-25 18:12:41,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:12:41,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 30: [2022-11-25 18:12:41,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 0: [2022-11-25 18:12:41,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:12:41,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: [2022-11-25 18:12:41,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 18:12:41,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 4: [2022-11-25 18:12:41,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:12:41,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:12:41,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 18:12:41,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 12: [2022-11-25 18:12:41,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 18:12:41,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 11: [2022-11-25 18:12:41,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:12:41,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:12:41,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 26: [2022-11-25 18:12:41,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-25 18:12:41,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 11: [2022-11-25 18:12:41,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 19: [2022-11-25 18:12:41,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:12:41,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-25 18:12:41,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 6: [2022-11-25 18:12:41,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:12:41,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 18:12:41,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 16: [2022-11-25 18:12:41,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:12:41,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:12:41,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:12:41,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-25 18:12:41,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 20: [2022-11-25 18:12:41,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 22: [2022-11-25 18:12:41,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 20: [2022-11-25 18:12:41,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 22: [2022-11-25 18:12:41,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 31: [2022-11-25 18:12:41,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:12:41,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-25 18:12:41,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 5: [2022-11-25 18:12:41,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:12:41,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:12:41,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:12:41,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:12:41,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:12:41,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:12:41,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:12:41,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 8: [2022-11-25 18:12:41,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 2: [2022-11-25 18:12:41,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 5: [2022-11-25 18:12:41,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 8: [2022-11-25 18:12:41,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 2: [2022-11-25 18:12:41,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:12:41,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 25: [2022-11-25 18:12:41,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-25 18:12:41,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 27: [2022-11-25 18:12:41,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 1: [2022-11-25 18:12:41,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:12:41,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 15: [2022-11-25 18:12:41,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 25: [2022-11-25 18:12:41,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 25: [2022-11-25 18:12:41,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 27: [2022-11-25 18:12:41,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 1: [2022-11-25 18:12:41,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 2: [2022-11-25 18:12:41,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 18:12:41,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 21: [2022-11-25 18:12:41,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:12:41,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 21: [2022-11-25 18:12:41,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-25 18:12:41,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 24: [2022-11-25 18:12:41,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:12:41,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:12:41,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 10: [2022-11-25 18:12:41,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 24: [2022-11-25 18:12:41,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 10: [2022-11-25 18:12:41,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 7: [2022-11-25 18:12:41,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:12:41,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:12:41,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:12:41,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 7: [2022-11-25 18:12:41,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 18: [2022-11-25 18:12:41,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 7: [2022-11-25 18:12:41,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 16: [2022-11-25 18:12:41,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-25 18:12:41,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 9: [2022-11-25 18:12:41,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:12:41,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 18:12:41,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 3: [2022-11-25 18:12:41,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:12:41,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:12:41,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 28: [2022-11-25 18:12:41,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 3: [2022-11-25 18:12:41,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 28: [2022-11-25 18:12:41,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 23: [2022-11-25 18:12:41,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:12:41,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-25 18:12:41,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 19: [2022-11-25 18:12:41,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:12:41,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-25 18:12:41,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 20: [2022-11-25 18:12:41,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:12:41,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-25 18:12:41,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 14: [2022-11-25 18:12:41,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:12:41,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 18:12:41,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 4: [2022-11-25 18:12:41,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:12:41,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 18:12:41,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 17: [2022-11-25 18:12:41,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:12:41,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-25 18:12:41,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 12: [2022-11-25 18:12:41,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:12:41,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 18:12:41,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 12: [2022-11-25 18:12:41,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:12:41,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 18:12:41,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 28: [2022-11-25 18:12:41,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:12:41,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-25 18:12:41,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 28: [2022-11-25 18:12:41,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:12:41,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-25 18:12:41,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 28: [2022-11-25 18:12:42,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:12:42,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step3000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-25 18:12:42,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: successfully saved checkpoint at iteration 3000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2692.57 31: iteration 3010/ 173500 | consumed samples: 770560 | consumed tokens: 1578106880 | elapsed time per iteration (s): 1.12 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.691135E+00 | grad norm: 0.271 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.939 | TFLOPs: 13.85 | 31: iteration 3020/ 173500 | consumed samples: 773120 | consumed tokens: 1583349760 | elapsed time per iteration (s): 0.80 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.709500E+00 | grad norm: 0.272 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.005 | TFLOPs: 19.42 | 31: iteration 3030/ 173500 | consumed samples: 775680 | consumed tokens: 1588592640 | elapsed time per iteration (s): 0.83 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.668910E+00 | grad norm: 0.318 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.125 | TFLOPs: 18.76 | 31: iteration 3040/ 173500 | consumed samples: 778240 | consumed tokens: 1593835520 | elapsed time per iteration (s): 0.79 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.708531E+00 | grad norm: 0.305 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.584 | TFLOPs: 19.52 | 31: iteration 3050/ 173500 | consumed samples: 780800 | consumed tokens: 1599078400 | elapsed time per iteration (s): 0.79 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.712682E+00 | grad norm: 0.284 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.259 | TFLOPs: 19.56 | 31: iteration 3060/ 173500 | consumed samples: 783360 | consumed tokens: 1604321280 | elapsed time per iteration (s): 0.80 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.727946E+00 | grad norm: 0.285 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.144 | TFLOPs: 19.31 | 31: iteration 3070/ 173500 | consumed samples: 785920 | consumed tokens: 1609564160 | elapsed time per iteration (s): 0.81 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.729058E+00 | grad norm: 0.293 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.148 | TFLOPs: 19.01 | 31: iteration 3080/ 173500 | consumed samples: 788480 | consumed tokens: 1614807040 | elapsed time per iteration (s): 0.78 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.696879E+00 | grad norm: 0.278 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.242 | TFLOPs: 19.74 | 31: iteration 3090/ 173500 | consumed samples: 791040 | consumed tokens: 1620049920 | elapsed time per iteration (s): 0.81 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.696381E+00 | grad norm: 0.361 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.535 | TFLOPs: 19.15 | 31: iteration 3100/ 173500 | consumed samples: 793600 | consumed tokens: 1625292800 | elapsed time per iteration (s): 0.81 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.724104E+00 | grad norm: 0.268 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.201 | TFLOPs: 19.19 | 31: iteration 3110/ 173500 | consumed samples: 796160 | consumed tokens: 1630535680 | elapsed time per iteration (s): 0.78 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.658731E+00 | grad norm: 0.284 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.313 | TFLOPs: 19.86 | 31: iteration 3120/ 173500 | consumed samples: 798720 | consumed tokens: 1635778560 | elapsed time per iteration (s): 0.82 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.700621E+00 | grad norm: 0.274 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.355 | TFLOPs: 18.90 | 31: iteration 3130/ 173500 | consumed samples: 801280 | consumed tokens: 1641021440 | elapsed time per iteration (s): 0.81 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.694710E+00 | grad norm: 0.319 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.817 | TFLOPs: 19.05 | 31: iteration 3140/ 173500 | consumed samples: 803840 | consumed tokens: 1646264320 | elapsed time per iteration (s): 0.85 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.769720E+00 | grad norm: 10.671 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.008 | TFLOPs: 18.15 | 31: iteration 3150/ 173500 | consumed samples: 806400 | consumed tokens: 1651507200 | elapsed time per iteration (s): 0.89 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.072229E+00 | grad norm: 1.735 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.245 | TFLOPs: 17.44 | 31: iteration 3160/ 173500 | consumed samples: 808960 | consumed tokens: 1656750080 | elapsed time per iteration (s): 0.80 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.819166E+00 | grad norm: 0.549 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.237 | TFLOPs: 19.43 | 31: iteration 3170/ 173500 | consumed samples: 811520 | consumed tokens: 1661992960 | elapsed time per iteration (s): 0.79 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.755410E+00 | grad norm: 0.331 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.618 | TFLOPs: 19.58 | 31: iteration 3180/ 173500 | consumed samples: 814080 | consumed tokens: 1667235840 | elapsed time per iteration (s): 0.74 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.738843E+00 | grad norm: 0.294 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.377 | TFLOPs: 20.83 | 31: iteration 3190/ 173500 | consumed samples: 816640 | consumed tokens: 1672478720 | elapsed time per iteration (s): 0.82 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.679614E+00 | grad norm: 0.284 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.358 | TFLOPs: 18.96 | 31: iteration 3200/ 173500 | consumed samples: 819200 | consumed tokens: 1677721600 | elapsed time per iteration (s): 0.77 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.677180E+00 | grad norm: 0.279 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.804 | TFLOPs: 20.07 | 31: iteration 3210/ 173500 | consumed samples: 821760 | consumed tokens: 1682964480 | elapsed time per iteration (s): 0.83 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.720224E+00 | grad norm: 0.278 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.997 | TFLOPs: 18.63 | 31: iteration 3220/ 173500 | consumed samples: 824320 | consumed tokens: 1688207360 | elapsed time per iteration (s): 0.73 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.678671E+00 | grad norm: 0.290 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.516 | TFLOPs: 21.27 | 31: iteration 3230/ 173500 | consumed samples: 826880 | consumed tokens: 1693450240 | elapsed time per iteration (s): 0.80 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.703848E+00 | grad norm: 0.260 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.080 | TFLOPs: 19.42 | 31: iteration 3240/ 173500 | consumed samples: 829440 | consumed tokens: 1698693120 | elapsed time per iteration (s): 0.76 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.675766E+00 | grad norm: 0.282 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.042 | TFLOPs: 20.45 | 31: iteration 3250/ 173500 | consumed samples: 832000 | consumed tokens: 1703936000 | elapsed time per iteration (s): 0.80 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.657895E+00 | grad norm: 0.282 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.792 | TFLOPs: 19.41 | 31: iteration 3260/ 173500 | consumed samples: 834560 | consumed tokens: 1709178880 | elapsed time per iteration (s): 0.78 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.681714E+00 | grad norm: 0.264 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.564 | TFLOPs: 19.82 | 31: iteration 3270/ 173500 | consumed samples: 837120 | consumed tokens: 1714421760 | elapsed time per iteration (s): 0.79 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.684441E+00 | grad norm: 0.266 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.077 | TFLOPs: 19.48 | 31: iteration 3280/ 173500 | consumed samples: 839680 | consumed tokens: 1719664640 | elapsed time per iteration (s): 0.78 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.671411E+00 | grad norm: 0.256 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.982 | TFLOPs: 19.90 | 31: iteration 3290/ 173500 | consumed samples: 842240 | consumed tokens: 1724907520 | elapsed time per iteration (s): 0.74 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.643089E+00 | grad norm: 0.360 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.989 | TFLOPs: 21.05 | 31: iteration 3300/ 173500 | consumed samples: 844800 | consumed tokens: 1730150400 | elapsed time per iteration (s): 0.73 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.698905E+00 | grad norm: 0.458 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.778 | TFLOPs: 21.22 | 31: iteration 3310/ 173500 | consumed samples: 847360 | consumed tokens: 1735393280 | elapsed time per iteration (s): 0.77 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.698639E+00 | grad norm: 0.816 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.117 | TFLOPs: 20.03 | 31: iteration 3320/ 173500 | consumed samples: 849920 | consumed tokens: 1740636160 | elapsed time per iteration (s): 0.72 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.719447E+00 | grad norm: 0.300 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.143 | TFLOPs: 21.42 | 31: iteration 3330/ 173500 | consumed samples: 852480 | consumed tokens: 1745879040 | elapsed time per iteration (s): 0.80 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.715738E+00 | grad norm: 0.263 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.709 | TFLOPs: 19.28 | 31: iteration 3340/ 173500 | consumed samples: 855040 | consumed tokens: 1751121920 | elapsed time per iteration (s): 0.83 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.701229E+00 | grad norm: 0.264 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.700 | TFLOPs: 18.55 | 31: iteration 3350/ 173500 | consumed samples: 857600 | consumed tokens: 1756364800 | elapsed time per iteration (s): 0.88 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.685644E+00 | grad norm: 0.254 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.926 | TFLOPs: 17.60 | 31: iteration 3360/ 173500 | consumed samples: 860160 | consumed tokens: 1761607680 | elapsed time per iteration (s): 0.83 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.673629E+00 | grad norm: 0.296 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.982 | TFLOPs: 18.69 | 31: iteration 3370/ 173500 | consumed samples: 862720 | consumed tokens: 1766850560 | elapsed time per iteration (s): 0.83 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.692010E+00 | grad norm: 0.249 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.818 | TFLOPs: 18.56 | 31: iteration 3380/ 173500 | consumed samples: 865280 | consumed tokens: 1772093440 | elapsed time per iteration (s): 0.78 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.671740E+00 | grad norm: 0.291 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.867 | TFLOPs: 19.84 | 31: iteration 3390/ 173500 | consumed samples: 867840 | consumed tokens: 1777336320 | elapsed time per iteration (s): 0.84 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.666887E+00 | grad norm: 0.272 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.189 | TFLOPs: 18.40 | 31: iteration 3400/ 173500 | consumed samples: 870400 | consumed tokens: 1782579200 | elapsed time per iteration (s): 0.80 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.641018E+00 | grad norm: 0.278 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.738 | TFLOPs: 19.40 | 31: iteration 3410/ 173500 | consumed samples: 872960 | consumed tokens: 1787822080 | elapsed time per iteration (s): 0.85 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.700440E+00 | grad norm: 0.273 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.504 | TFLOPs: 18.30 | 31: iteration 3420/ 173500 | consumed samples: 875520 | consumed tokens: 1793064960 | elapsed time per iteration (s): 0.81 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.685283E+00 | grad norm: 0.282 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.706 | TFLOPs: 19.04 | 31: iteration 3430/ 173500 | consumed samples: 878080 | consumed tokens: 1798307840 | elapsed time per iteration (s): 0.82 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.641966E+00 | grad norm: 0.270 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.423 | TFLOPs: 18.78 | 31: iteration 3440/ 173500 | consumed samples: 880640 | consumed tokens: 1803550720 | elapsed time per iteration (s): 0.80 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.614536E+00 | grad norm: 0.283 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.941 | TFLOPs: 19.36 | 31: iteration 3450/ 173500 | consumed samples: 883200 | consumed tokens: 1808793600 | elapsed time per iteration (s): 0.82 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.676817E+00 | grad norm: 0.259 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.701 | TFLOPs: 18.86 | 31: iteration 3460/ 173500 | consumed samples: 885760 | consumed tokens: 1814036480 | elapsed time per iteration (s): 0.82 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.649281E+00 | grad norm: 0.293 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.915 | TFLOPs: 18.93 | 31: iteration 3470/ 173500 | consumed samples: 888320 | consumed tokens: 1819279360 | elapsed time per iteration (s): 0.83 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.677197E+00 | grad norm: 0.270 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.724 | TFLOPs: 18.56 | 31: iteration 3480/ 173500 | consumed samples: 890880 | consumed tokens: 1824522240 | elapsed time per iteration (s): 0.81 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.664370E+00 | grad norm: 0.272 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.713 | TFLOPs: 19.22 | 31: iteration 3490/ 173500 | consumed samples: 893440 | consumed tokens: 1829765120 | elapsed time per iteration (s): 0.79 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.635408E+00 | grad norm: 0.264 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.610 | TFLOPs: 19.64 | 31: iteration 3500/ 173500 | consumed samples: 896000 | consumed tokens: 1835008000 | elapsed time per iteration (s): 0.81 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.634587E+00 | grad norm: 0.249 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.285 | TFLOPs: 19.01 | 31: iteration 3510/ 173500 | consumed samples: 898560 | consumed tokens: 1840250880 | elapsed time per iteration (s): 0.78 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.650463E+00 | grad norm: 0.275 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.635 | TFLOPs: 19.76 | 31: iteration 3520/ 173500 | consumed samples: 901120 | consumed tokens: 1845493760 | elapsed time per iteration (s): 0.83 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.628980E+00 | grad norm: 0.281 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.093 | TFLOPs: 18.70 | 31: iteration 3530/ 173500 | consumed samples: 903680 | consumed tokens: 1850736640 | elapsed time per iteration (s): 0.86 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.642033E+00 | grad norm: 0.277 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.905 | TFLOPs: 17.96 | 31: iteration 3540/ 173500 | consumed samples: 906240 | consumed tokens: 1855979520 | elapsed time per iteration (s): 0.83 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.685176E+00 | grad norm: 1.512 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.331 | TFLOPs: 18.71 | 31: iteration 3550/ 173500 | consumed samples: 908800 | consumed tokens: 1861222400 | elapsed time per iteration (s): 0.73 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.871120E+00 | grad norm: 1.474 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.712 | TFLOPs: 21.10 | 31: iteration 3560/ 173500 | consumed samples: 911360 | consumed tokens: 1866465280 | elapsed time per iteration (s): 0.78 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.858775E+00 | grad norm: 1.026 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.060 | TFLOPs: 19.91 | 31: iteration 3570/ 173500 | consumed samples: 913920 | consumed tokens: 1871708160 | elapsed time per iteration (s): 0.78 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.782814E+00 | grad norm: 0.337 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.474 | TFLOPs: 19.81 | 31: iteration 3580/ 173500 | consumed samples: 916480 | consumed tokens: 1876951040 | elapsed time per iteration (s): 0.76 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.732027E+00 | grad norm: 0.305 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.824 | TFLOPs: 20.32 | 31: iteration 3590/ 173500 | consumed samples: 919040 | consumed tokens: 1882193920 | elapsed time per iteration (s): 0.73 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.662892E+00 | grad norm: 0.261 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.425 | TFLOPs: 21.26 | 31: iteration 3600/ 173500 | consumed samples: 921600 | consumed tokens: 1887436800 | elapsed time per iteration (s): 0.76 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.666022E+00 | grad norm: 0.254 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.285 | TFLOPs: 20.40 | 31: iteration 3610/ 173500 | consumed samples: 924160 | consumed tokens: 1892679680 | elapsed time per iteration (s): 0.78 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.679718E+00 | grad norm: 0.232 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.565 | TFLOPs: 19.88 | 31: iteration 3620/ 173500 | consumed samples: 926720 | consumed tokens: 1897922560 | elapsed time per iteration (s): 0.78 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.638793E+00 | grad norm: 0.247 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.925 | TFLOPs: 19.84 | 31: iteration 3630/ 173500 | consumed samples: 929280 | consumed tokens: 1903165440 | elapsed time per iteration (s): 0.80 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.668119E+00 | grad norm: 0.265 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.529 | TFLOPs: 19.27 | 31: iteration 3640/ 173500 | consumed samples: 931840 | consumed tokens: 1908408320 | elapsed time per iteration (s): 0.77 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.676598E+00 | grad norm: 0.238 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.827 | TFLOPs: 20.07 | 31: iteration 3650/ 173500 | consumed samples: 934400 | consumed tokens: 1913651200 | elapsed time per iteration (s): 0.81 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.633832E+00 | grad norm: 0.249 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.213 | TFLOPs: 19.01 | 31: iteration 3660/ 173500 | consumed samples: 936960 | consumed tokens: 1918894080 | elapsed time per iteration (s): 0.75 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.636190E+00 | grad norm: 0.242 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.152 | TFLOPs: 20.52 | 31: iteration 3670/ 173500 | consumed samples: 939520 | consumed tokens: 1924136960 | elapsed time per iteration (s): 0.78 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.633429E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.232 | TFLOPs: 19.92 | 31: iteration 3680/ 173500 | consumed samples: 942080 | consumed tokens: 1929379840 | elapsed time per iteration (s): 0.79 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.726895E+00 | grad norm: 0.369 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.741 | TFLOPs: 19.59 | 31: iteration 3690/ 173500 | consumed samples: 944640 | consumed tokens: 1934622720 | elapsed time per iteration (s): 0.80 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.692758E+00 | grad norm: 0.257 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.636 | TFLOPs: 19.28 | 31: iteration 3700/ 173500 | consumed samples: 947200 | consumed tokens: 1939865600 | elapsed time per iteration (s): 0.79 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.654785E+00 | grad norm: 0.243 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.819 | TFLOPs: 19.53 | 31: iteration 3710/ 173500 | consumed samples: 949760 | consumed tokens: 1945108480 | elapsed time per iteration (s): 0.78 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.654210E+00 | grad norm: 0.240 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.768 | TFLOPs: 19.77 | 31: iteration 3720/ 173500 | consumed samples: 952320 | consumed tokens: 1950351360 | elapsed time per iteration (s): 0.80 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.661644E+00 | grad norm: 0.237 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.198 | TFLOPs: 19.25 | 31: iteration 3730/ 173500 | consumed samples: 954880 | consumed tokens: 1955594240 | elapsed time per iteration (s): 0.77 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.649137E+00 | grad norm: 0.248 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.626 | TFLOPs: 20.18 | 31: iteration 3740/ 173500 | consumed samples: 957440 | consumed tokens: 1960837120 | elapsed time per iteration (s): 0.77 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.628872E+00 | grad norm: 0.239 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.268 | TFLOPs: 20.22 | 31: iteration 3750/ 173500 | consumed samples: 960000 | consumed tokens: 1966080000 | elapsed time per iteration (s): 0.81 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.657244E+00 | grad norm: 0.252 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.794 | TFLOPs: 19.23 | 31: iteration 3760/ 173500 | consumed samples: 962560 | consumed tokens: 1971322880 | elapsed time per iteration (s): 0.78 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.607284E+00 | grad norm: 0.240 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.987 | TFLOPs: 19.84 | 31: iteration 3770/ 173500 | consumed samples: 965120 | consumed tokens: 1976565760 | elapsed time per iteration (s): 0.77 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.633347E+00 | grad norm: 0.230 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.127 | TFLOPs: 20.03 | 31: iteration 3780/ 173500 | consumed samples: 967680 | consumed tokens: 1981808640 | elapsed time per iteration (s): 0.78 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.627719E+00 | grad norm: 0.240 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.704 | TFLOPs: 19.95 | 31: iteration 3790/ 173500 | consumed samples: 970240 | consumed tokens: 1987051520 | elapsed time per iteration (s): 0.79 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.637014E+00 | grad norm: 0.238 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.712 | TFLOPs: 19.52 | 31: iteration 3800/ 173500 | consumed samples: 972800 | consumed tokens: 1992294400 | elapsed time per iteration (s): 0.81 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.638295E+00 | grad norm: 0.226 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.403 | TFLOPs: 19.02 | 31: iteration 3810/ 173500 | consumed samples: 975360 | consumed tokens: 1997537280 | elapsed time per iteration (s): 0.86 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.625142E+00 | grad norm: 0.240 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.145 | TFLOPs: 17.98 | 31: iteration 3820/ 173500 | consumed samples: 977920 | consumed tokens: 2002780160 | elapsed time per iteration (s): 0.77 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.638219E+00 | grad norm: 0.243 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.723 | TFLOPs: 20.07 | 31: iteration 3830/ 173500 | consumed samples: 980480 | consumed tokens: 2008023040 | elapsed time per iteration (s): 0.84 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.654180E+00 | grad norm: 0.270 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.271 | TFLOPs: 18.35 | 31: iteration 3840/ 173500 | consumed samples: 983040 | consumed tokens: 2013265920 | elapsed time per iteration (s): 0.79 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.621746E+00 | grad norm: 0.243 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.193 | TFLOPs: 19.67 | 31: iteration 3850/ 173500 | consumed samples: 985600 | consumed tokens: 2018508800 | elapsed time per iteration (s): 0.80 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.595617E+00 | grad norm: 0.241 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.332 | TFLOPs: 19.26 | 31: iteration 3860/ 173500 | consumed samples: 988160 | consumed tokens: 2023751680 | elapsed time per iteration (s): 0.78 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.613779E+00 | grad norm: 0.252 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.977 | TFLOPs: 19.96 | 31: iteration 3870/ 173500 | consumed samples: 990720 | consumed tokens: 2028994560 | elapsed time per iteration (s): 0.74 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.583381E+00 | grad norm: 0.249 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.797 | TFLOPs: 20.80 | 31: iteration 3880/ 173500 | consumed samples: 993280 | consumed tokens: 2034237440 | elapsed time per iteration (s): 0.78 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.633810E+00 | grad norm: 0.245 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.624 | TFLOPs: 19.94 | 31: iteration 3890/ 173500 | consumed samples: 995840 | consumed tokens: 2039480320 | elapsed time per iteration (s): 0.79 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.631982E+00 | grad norm: 0.240 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.799 | TFLOPs: 19.53 | 31: iteration 3900/ 173500 | consumed samples: 998400 | consumed tokens: 2044723200 | elapsed time per iteration (s): 0.75 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.596708E+00 | grad norm: 0.310 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.662 | TFLOPs: 20.55 | 31: iteration 3910/ 173500 | consumed samples: 1000960 | consumed tokens: 2049966080 | elapsed time per iteration (s): 0.82 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.617095E+00 | grad norm: 0.249 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.633 | TFLOPs: 18.97 | 31: iteration 3920/ 173500 | consumed samples: 1003520 | consumed tokens: 2055208960 | elapsed time per iteration (s): 0.82 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.599180E+00 | grad norm: 0.237 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.888 | TFLOPs: 18.99 | 31: iteration 3930/ 173500 | consumed samples: 1006080 | consumed tokens: 2060451840 | elapsed time per iteration (s): 0.75 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.639936E+00 | grad norm: 0.250 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.181 | TFLOPs: 20.64 | 31: iteration 3940/ 173500 | consumed samples: 1008640 | consumed tokens: 2065694720 | elapsed time per iteration (s): 0.82 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.587834E+00 | grad norm: 2.509 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.507 | TFLOPs: 18.85 | 31: iteration 3950/ 173500 | consumed samples: 1011200 | consumed tokens: 2070937600 | elapsed time per iteration (s): 0.76 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.618128E+00 | grad norm: 0.306 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.283 | TFLOPs: 20.34 | 31: iteration 3960/ 173500 | consumed samples: 1013760 | consumed tokens: 2076180480 | elapsed time per iteration (s): 0.77 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.584662E+00 | grad norm: 0.252 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.814 | TFLOPs: 20.19 | 31: iteration 3970/ 173500 | consumed samples: 1016320 | consumed tokens: 2081423360 | elapsed time per iteration (s): 0.81 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.640003E+00 | grad norm: 0.290 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.458 | TFLOPs: 19.21 | 31: iteration 3980/ 173500 | consumed samples: 1018880 | consumed tokens: 2086666240 | elapsed time per iteration (s): 0.78 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.615067E+00 | grad norm: 0.249 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.840 | TFLOPs: 19.83 | 31: iteration 3990/ 173500 | consumed samples: 1021440 | consumed tokens: 2091909120 | elapsed time per iteration (s): 0.80 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.628431E+00 | grad norm: 0.239 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.933 | TFLOPs: 19.42 | 0: [2022-11-25 18:25:57,655] [INFO] [logging.py:68:log_dist] [Rank 0] step=4000, skipped=0, lr=[0.00019992278300259638, 0.00019992278300259638, 0.00019992278300259638], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 4000/ 173500 | consumed samples: 1024000 | consumed tokens: 2097152000 | elapsed time per iteration (s): 0.83 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.587053E+00 | grad norm: 0.228 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.426 | TFLOPs: 18.72 | 0: steps: 4000 loss: 2.5488 iter time (s): 0.794 samples/sec: 322.417 31: ------------------------------------------------------------------------------------------ 31: valid loss at iteration 4000 | lm loss value: 2.673924E+00 | lm loss PPL: 1.449675E+01 | 31: ------------------------------------------------------------------------------------------ 0: saving checkpoint at iteration 4000 to checkpoints_1b1long 0: [2022-11-25 18:25:57,944] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step4000 is begin to save! 0: [2022-11-25 18:25:57,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_01-model_00-model_states.pt... 0: [2022-11-25 18:25:58,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_01-model_00-model_states.pt. 0: [2022-11-25 18:25:58,149] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_03-model_00-model_states.pt... 0: [2022-11-25 18:25:58,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_03-model_00-model_states.pt. 0: [2022-11-25 18:25:58,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_04-model_00-model_states.pt... 0: [2022-11-25 18:25:58,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_04-model_00-model_states.pt. 0: [2022-11-25 18:25:58,299] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_05-model_00-model_states.pt... 0: [2022-11-25 18:25:58,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_05-model_00-model_states.pt. 0: [2022-11-25 18:25:58,374] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_06-model_00-model_states.pt... 0: [2022-11-25 18:25:58,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_06-model_00-model_states.pt. 0: [2022-11-25 18:25:58,448] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_07-model_00-model_states.pt... 0: [2022-11-25 18:25:58,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_07-model_00-model_states.pt. 0: [2022-11-25 18:25:58,523] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_08-model_00-model_states.pt... 0: [2022-11-25 18:25:58,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_08-model_00-model_states.pt. 0: [2022-11-25 18:25:58,594] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_09-model_00-model_states.pt... 0: [2022-11-25 18:25:58,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_09-model_00-model_states.pt. 0: [2022-11-25 18:25:58,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_10-model_00-model_states.pt... 0: [2022-11-25 18:25:58,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_10-model_00-model_states.pt. 0: [2022-11-25 18:25:58,742] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_11-model_00-model_states.pt... 0: [2022-11-25 18:25:58,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_11-model_00-model_states.pt. 0: [2022-11-25 18:25:58,817] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_12-model_00-model_states.pt... 0: [2022-11-25 18:25:58,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_12-model_00-model_states.pt. 0: [2022-11-25 18:25:58,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_13-model_00-model_states.pt... 0: [2022-11-25 18:25:58,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_13-model_00-model_states.pt. 0: [2022-11-25 18:25:58,971] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_14-model_00-model_states.pt... 0: [2022-11-25 18:25:59,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_14-model_00-model_states.pt. 0: [2022-11-25 18:25:59,047] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_15-model_00-model_states.pt... 0: [2022-11-25 18:25:59,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_15-model_00-model_states.pt. 0: [2022-11-25 18:25:59,124] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_16-model_00-model_states.pt... 0: [2022-11-25 18:25:59,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_16-model_00-model_states.pt. 0: [2022-11-25 18:25:59,200] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_17-model_00-model_states.pt... 0: [2022-11-25 18:25:59,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_17-model_00-model_states.pt. 0: [2022-11-25 18:25:59,275] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_18-model_00-model_states.pt... 0: [2022-11-25 18:25:59,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_18-model_00-model_states.pt. 0: [2022-11-25 18:25:59,354] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_19-model_00-model_states.pt... 0: [2022-11-25 18:25:59,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_19-model_00-model_states.pt. 0: [2022-11-25 18:25:59,431] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_20-model_00-model_states.pt... 0: [2022-11-25 18:25:59,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_20-model_00-model_states.pt. 0: [2022-11-25 18:25:59,510] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_21-model_00-model_states.pt... 0: [2022-11-25 18:25:59,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_21-model_00-model_states.pt. 0: [2022-11-25 18:25:59,588] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_22-model_00-model_states.pt... 0: [2022-11-25 18:25:59,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_22-model_00-model_states.pt. 0: [2022-11-25 18:25:59,659] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_23-model_00-model_states.pt... 0: [2022-11-25 18:25:59,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_23-model_00-model_states.pt. 0: [2022-11-25 18:25:59,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_24-model_00-model_states.pt... 0: [2022-11-25 18:25:59,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_24-model_00-model_states.pt. 0: [2022-11-25 18:25:59,818] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_25-model_00-model_states.pt... 0: [2022-11-25 18:25:59,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_25-model_00-model_states.pt. 0: [2022-11-25 18:25:59,892] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_26-model_00-model_states.pt... 0: [2022-11-25 18:25:59,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_26-model_00-model_states.pt. 0: [2022-11-25 18:25:59,970] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_27-model_00-model_states.pt... 0: [2022-11-25 18:26:00,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_27-model_00-model_states.pt. 0: [2022-11-25 18:26:00,045] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_28-model_00-model_states.pt... 0: [2022-11-25 18:26:00,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_28-model_00-model_states.pt. 0: [2022-11-25 18:26:00,121] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/layer_30-model_00-model_states.pt... 0: [2022-11-25 18:26:00,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/layer_30-model_00-model_states.pt. 0: [2022-11-25 18:26:00,125] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step4000/mp_rank_00_model_states.pt 0: [2022-11-25 18:26:00,126] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/mp_rank_00_model_states.pt... 0: [2022-11-25 18:26:00,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/mp_rank_00_model_states.pt. 0: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:26:00,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:26:00,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:26:00,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:26:00,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 2: [2022-11-25 18:26:00,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 1: [2022-11-25 18:26:00,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 2: [2022-11-25 18:26:00,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 10: [2022-11-25 18:26:00,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:26:00,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 20: [2022-11-25 18:26:00,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:26:00,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 20: [2022-11-25 18:26:00,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-25 18:26:00,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 29: [2022-11-25 18:26:00,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:26:00,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-25 18:26:00,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 25: [2022-11-25 18:26:00,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:26:00,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-25 18:26:00,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 11: [2022-11-25 18:26:00,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:26:00,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 18:26:00,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 31: [2022-11-25 18:26:00,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:26:00,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-25 18:26:00,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: [2022-11-25 18:26:00,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:26:00,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 18:26:00,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 21: [2022-11-25 18:26:00,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:26:00,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:26:00,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-25 18:26:00,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 27: [2022-11-25 18:26:00,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 28: [2022-11-25 18:26:00,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:26:00,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 27: [2022-11-25 18:26:00,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 28: [2022-11-25 18:26:00,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 13: [2022-11-25 18:26:00,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:26:00,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 3: [2022-11-25 18:26:00,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:26:00,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 3: [2022-11-25 18:26:00,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 18:26:00,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 7: [2022-11-25 18:26:00,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:26:00,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 18:26:00,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 30: [2022-11-25 18:26:00,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:26:00,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-25 18:26:00,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 24: [2022-11-25 18:26:00,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:26:00,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:26:00,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-25 18:26:00,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 24: [2022-11-25 18:26:00,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 5: [2022-11-25 18:26:00,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:26:00,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 5: [2022-11-25 18:26:00,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 18:26:00,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: [2022-11-25 18:26:00,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:26:00,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 18:26:00,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 15: [2022-11-25 18:26:00,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:26:00,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 18:26:00,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 15: [2022-11-25 18:26:00,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:26:00,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 18:26:00,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 24: [2022-11-25 18:26:00,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:26:00,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-25 18:26:00,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 8: [2022-11-25 18:26:00,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:26:00,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 18:26:00,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 20: [2022-11-25 18:26:00,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:26:00,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-25 18:26:00,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 11: [2022-11-25 18:26:00,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:26:00,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 18:26:00,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 12: [2022-11-25 18:26:00,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:26:00,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 18:26:00,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 13: [2022-11-25 18:26:00,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:26:00,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:26:00,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 19: [2022-11-25 18:26:00,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 13: [2022-11-25 18:26:00,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 19: [2022-11-25 18:26:00,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 4: [2022-11-25 18:26:00,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:26:00,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 18:26:00,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 4: [2022-11-25 18:26:00,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:26:00,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 18:26:00,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 24: [2022-11-25 18:26:00,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:26:00,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:26:00,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-25 18:26:00,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 20: [2022-11-25 18:26:00,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-25 18:26:00,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 7: [2022-11-25 18:26:00,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:26:00,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:26:00,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 18:26:00,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 18:26:00,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 7: [2022-11-25 18:26:00,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 8: [2022-11-25 18:26:00,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:26:00,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 18:26:00,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:26:00,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 8: [2022-11-25 18:26:00,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 18:26:00,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 10: [2022-11-25 18:26:00,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:26:00,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:26:00,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:26:00,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 0: [2022-11-25 18:26:00,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 30: [2022-11-25 18:26:00,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-25 18:26:00,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: [2022-11-25 18:26:00,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 10: [2022-11-25 18:26:00,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 30: [2022-11-25 18:26:00,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:26:00,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-25 18:26:00,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 21: [2022-11-25 18:26:00,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:26:00,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-25 18:26:00,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 21: [2022-11-25 18:26:00,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:26:00,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-25 18:26:00,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 13: [2022-11-25 18:26:00,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:26:00,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 18:26:00,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 12: [2022-11-25 18:26:00,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:26:00,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 18:26:00,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 26: [2022-11-25 18:26:00,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:26:00,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 11: [2022-11-25 18:26:00,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:26:00,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 11: [2022-11-25 18:26:00,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 18:26:00,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 29: [2022-11-25 18:26:00,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:26:00,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-25 18:26:00,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 20: [2022-11-25 18:26:00,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:26:00,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-25 18:26:00,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 14: [2022-11-25 18:26:00,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:26:00,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 18:26:00,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 5: [2022-11-25 18:26:00,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:26:00,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 18:26:00,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 15: [2022-11-25 18:26:00,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:26:00,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 18:26:00,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 3: [2022-11-25 18:26:00,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:26:00,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 18:26:00,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 21: [2022-11-25 18:26:00,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:26:00,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-25 18:26:00,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 2: [2022-11-25 18:26:00,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:26:00,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 18:26:00,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 11: [2022-11-25 18:26:00,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:26:00,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 18:26:00,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 23: [2022-11-25 18:26:00,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:26:00,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-25 18:26:00,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 8: [2022-11-25 18:26:00,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:26:00,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 18:26:00,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 18: [2022-11-25 18:26:00,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:26:00,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-25 18:26:00,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 24: [2022-11-25 18:26:00,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:26:00,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-25 18:26:00,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 27: [2022-11-25 18:26:00,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:26:00,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-25 18:26:00,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 3: [2022-11-25 18:26:00,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:26:00,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 18:26:00,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 6: [2022-11-25 18:26:00,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:26:00,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 18:26:00,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: [2022-11-25 18:26:00,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:26:00,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:26:00,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:26:00,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 12: [2022-11-25 18:26:00,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 26: [2022-11-25 18:26:00,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 12: [2022-11-25 18:26:00,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 9: [2022-11-25 18:26:00,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:26:00,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 18:26:00,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 16: [2022-11-25 18:26:00,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:26:00,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-25 18:26:00,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 17: [2022-11-25 18:26:00,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:26:00,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-25 18:26:00,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 29: [2022-11-25 18:26:00,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:26:00,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-25 18:26:00,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 25: [2022-11-25 18:26:00,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:26:00,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-25 18:26:00,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 31: [2022-11-25 18:26:00,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:26:00,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-25 18:26:00,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 18: [2022-11-25 18:26:00,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:26:00,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 1: [2022-11-25 18:26:00,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:26:00,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 1: [2022-11-25 18:26:00,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 18:26:00,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 27: [2022-11-25 18:26:00,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:26:00,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 13: [2022-11-25 18:26:00,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:26:00,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: [2022-11-25 18:26:00,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 18:26:00,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 13: [2022-11-25 18:26:00,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 18:26:00,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 12: [2022-11-25 18:26:00,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:26:00,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 18:26:00,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 4: [2022-11-25 18:26:00,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:26:00,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 18:26:00,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 9: [2022-11-25 18:26:00,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:26:00,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 18:26:00,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 7: [2022-11-25 18:26:00,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:26:00,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 18:26:00,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 18: [2022-11-25 18:26:00,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:26:00,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-25 18:26:00,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 30: [2022-11-25 18:26:00,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:26:00,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 14: [2022-11-25 18:26:00,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:26:00,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 14: [2022-11-25 18:26:00,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 18:26:00,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 31: [2022-11-25 18:26:00,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:26:00,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-25 18:26:00,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 26: [2022-11-25 18:26:00,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:26:00,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-25 18:26:00,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 9: [2022-11-25 18:26:00,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:26:00,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 18:26:00,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 2: [2022-11-25 18:26:00,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:26:00,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:26:00,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 5: [2022-11-25 18:26:00,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 2: [2022-11-25 18:26:00,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 5: [2022-11-25 18:26:00,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 22: [2022-11-25 18:26:00,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:26:00,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-25 18:26:00,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 4: [2022-11-25 18:26:00,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:26:00,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 18:26:00,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 6: [2022-11-25 18:26:00,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:26:00,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 18:26:00,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 15: [2022-11-25 18:26:00,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:26:00,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 18:26:00,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 3: [2022-11-25 18:26:00,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:26:00,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 18:26:00,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 19: [2022-11-25 18:26:00,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:26:00,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-25 18:26:00,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 22: [2022-11-25 18:26:00,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:26:00,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-25 18:26:00,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 25: [2022-11-25 18:26:00,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:26:00,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-25 18:26:00,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 28: [2022-11-25 18:26:00,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:26:00,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-25 18:26:00,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 29: [2022-11-25 18:26:00,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:26:00,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-25 18:26:00,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 6: [2022-11-25 18:26:00,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:26:00,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 18:26:00,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 5: [2022-11-25 18:26:00,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:26:00,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 18:26:00,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 17: [2022-11-25 18:26:00,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:26:00,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:26:00,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:26:00,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:26:00,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 17: [2022-11-25 18:26:00,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 23: [2022-11-25 18:26:00,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 17: [2022-11-25 18:26:00,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 26: [2022-11-25 18:26:00,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 6: [2022-11-25 18:26:00,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 23: [2022-11-25 18:26:00,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 27: [2022-11-25 18:26:00,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:26:00,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 27: [2022-11-25 18:26:00,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-25 18:26:00,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 1: [2022-11-25 18:26:00,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:26:00,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 18:26:00,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 18: [2022-11-25 18:26:00,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:26:00,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-25 18:26:00,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 31: [2022-11-25 18:26:00,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:26:00,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-25 18:26:00,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 2: [2022-11-25 18:26:00,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:26:00,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 18:26:00,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 16: [2022-11-25 18:26:00,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:26:00,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-25 18:26:00,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 22: [2022-11-25 18:26:00,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:26:00,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-25 18:26:00,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 10: [2022-11-25 18:26:00,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:26:00,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 14: [2022-11-25 18:26:00,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:26:00,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 14: [2022-11-25 18:26:00,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 18:26:00,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 20: [2022-11-25 18:26:00,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:26:00,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-25 18:26:00,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 1: [2022-11-25 18:26:00,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:26:00,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 18:26:00,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 19: [2022-11-25 18:26:00,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:26:00,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:26:00,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-25 18:26:00,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-25 18:26:00,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 19: [2022-11-25 18:26:00,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 23: [2022-11-25 18:26:00,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:26:00,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:26:00,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-25 18:26:00,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-25 18:26:00,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 23: [2022-11-25 18:26:00,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 8: [2022-11-25 18:26:00,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:26:00,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 18:26:00,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 21: [2022-11-25 18:26:00,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:26:00,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-25 18:26:00,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 15: [2022-11-25 18:26:00,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:26:00,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 18:26:00,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 1: [2022-11-25 18:26:00,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:26:00,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 17: [2022-11-25 18:26:00,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:26:00,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 17: [2022-11-25 18:26:00,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-25 18:26:00,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 22: [2022-11-25 18:26:00,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:26:00,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:26:00,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 11: [2022-11-25 18:26:00,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 22: [2022-11-25 18:26:00,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 11: [2022-11-25 18:26:00,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 23: [2022-11-25 18:26:00,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:26:00,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-25 18:26:00,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 13: [2022-11-25 18:26:00,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:26:00,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 18:26:00,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 30: [2022-11-25 18:26:00,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:26:00,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-25 18:26:00,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 25: [2022-11-25 18:26:00,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:26:00,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-25 18:26:00,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 3: [2022-11-25 18:26:00,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:26:00,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 18:26:00,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 29: [2022-11-25 18:26:00,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:26:00,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:26:00,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 4: [2022-11-25 18:26:00,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:26:00,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 28: [2022-11-25 18:26:00,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 5: [2022-11-25 18:26:00,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:26:00,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 28: [2022-11-25 18:26:00,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 5: [2022-11-25 18:26:00,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 4: [2022-11-25 18:26:00,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 5: [2022-11-25 18:26:00,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: [2022-11-25 18:26:00,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:26:00,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 18:26:00,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 26: [2022-11-25 18:26:00,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:26:00,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-25 18:26:00,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 16: [2022-11-25 18:26:00,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:26:00,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-25 18:26:00,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 31: [2022-11-25 18:26:00,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:26:00,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-25 18:26:00,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 12: [2022-11-25 18:26:00,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:26:00,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 18:26:00,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 14: [2022-11-25 18:26:00,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:26:00,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 18:26:00,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 19: [2022-11-25 18:26:00,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:26:00,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-25 18:26:00,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 2: [2022-11-25 18:26:00,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:26:00,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 18:26:00,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 7: [2022-11-25 18:26:00,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:26:00,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 18:26:00,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 25: [2022-11-25 18:26:00,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:26:00,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-25 18:26:00,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 28: [2022-11-25 18:26:00,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:26:00,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 27: [2022-11-25 18:26:00,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:26:00,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-25 18:26:00,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 9: [2022-11-25 18:26:00,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:26:00,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 18:26:00,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 6: [2022-11-25 18:26:00,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:26:00,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 18:26:00,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 10: [2022-11-25 18:26:00,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:26:00,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 17: [2022-11-25 18:26:00,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:26:00,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 17: [2022-11-25 18:26:00,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-25 18:26:00,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 28: [2022-11-25 18:26:00,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 14: [2022-11-25 18:26:00,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:26:00,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 18:26:00,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 16: [2022-11-25 18:26:00,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:26:00,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-25 18:26:00,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 16: [2022-11-25 18:26:00,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:26:00,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-25 18:26:00,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 27: [2022-11-25 18:26:00,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:26:00,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-25 18:26:00,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 2: [2022-11-25 18:26:00,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:26:00,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 18:26:00,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 1: [2022-11-25 18:26:00,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:26:00,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 18:26:00,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 24: [2022-11-25 18:26:00,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:26:00,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-25 18:26:00,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 29: [2022-11-25 18:26:00,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:26:00,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-25 18:26:00,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 18: [2022-11-25 18:26:00,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:26:00,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-25 18:26:00,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 8: [2022-11-25 18:26:00,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:26:00,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 18:26:00,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 15: [2022-11-25 18:26:00,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:26:00,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 18:26:00,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 21: [2022-11-25 18:26:00,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:26:00,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-25 18:26:00,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 17: [2022-11-25 18:26:00,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:26:00,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-25 18:26:00,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 5: [2022-11-25 18:26:00,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:26:00,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 20: [2022-11-25 18:26:00,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:26:00,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 20: [2022-11-25 18:26:00,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-25 18:26:00,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 11: [2022-11-25 18:26:00,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:26:00,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 18:26:00,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 26: [2022-11-25 18:26:00,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:26:00,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-25 18:26:00,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 30: [2022-11-25 18:26:00,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:26:00,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-25 18:26:00,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 18: [2022-11-25 18:26:00,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:26:00,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-25 18:26:00,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 25: [2022-11-25 18:26:00,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:26:00,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-25 18:26:00,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: [2022-11-25 18:26:00,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:26:00,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 12: [2022-11-25 18:26:00,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:26:00,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 12: [2022-11-25 18:26:00,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 18:26:00,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 13: [2022-11-25 18:26:00,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:26:00,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 18:26:00,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 19: [2022-11-25 18:26:00,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:26:00,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 22: [2022-11-25 18:26:00,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:26:00,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 22: [2022-11-25 18:26:00,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-25 18:26:00,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 3: [2022-11-25 18:26:00,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:26:00,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 18:26:00,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 4: [2022-11-25 18:26:00,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:26:00,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 18:26:00,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 6: [2022-11-25 18:26:00,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:26:00,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 18:26:00,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 16: [2022-11-25 18:26:00,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:26:00,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-25 18:26:00,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 9: [2022-11-25 18:26:00,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:26:00,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 18:26:00,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 14: [2022-11-25 18:26:00,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:26:00,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 18:26:00,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 1: [2022-11-25 18:26:00,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:26:00,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 27: [2022-11-25 18:26:00,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:26:00,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:26:00,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 27: [2022-11-25 18:26:00,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 10: [2022-11-25 18:26:00,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 18:26:00,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 17: [2022-11-25 18:26:00,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:26:00,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 17: [2022-11-25 18:26:00,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-25 18:26:00,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 31: [2022-11-25 18:26:00,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:26:00,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-25 18:26:00,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 28: [2022-11-25 18:26:00,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:26:00,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-25 18:26:00,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 28: [2022-11-25 18:26:00,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:26:00,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-25 18:26:00,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 7: [2022-11-25 18:26:00,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:26:00,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 18:26:00,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 9: [2022-11-25 18:26:00,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:26:00,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 18:26:00,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 24: [2022-11-25 18:26:00,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:26:00,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-25 18:26:00,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 23: [2022-11-25 18:26:00,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:26:00,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-25 18:26:00,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 8: [2022-11-25 18:26:00,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:26:00,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 18:26:00,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 15: [2022-11-25 18:26:00,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:26:00,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 18:26:00,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 5: [2022-11-25 18:26:00,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:26:00,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 18:26:00,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 20: [2022-11-25 18:26:00,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:26:00,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:26:00,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 26: [2022-11-25 18:26:00,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 20: [2022-11-25 18:26:00,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 26: [2022-11-25 18:26:00,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 30: [2022-11-25 18:26:00,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:26:00,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-25 18:26:00,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 25: [2022-11-25 18:26:00,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:26:00,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-25 18:26:00,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 4: [2022-11-25 18:26:00,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:26:00,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 18:26:00,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 13: [2022-11-25 18:26:00,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:26:00,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 18:26:00,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 29: [2022-11-25 18:26:00,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:26:00,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-25 18:26:00,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: [2022-11-25 18:26:00,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:26:00,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 18:26:00,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 10: [2022-11-25 18:26:00,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:26:00,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:26:00,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 12: [2022-11-25 18:26:00,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 19: [2022-11-25 18:26:00,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:26:00,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 12: [2022-11-25 18:26:00,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 19: [2022-11-25 18:26:00,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-25 18:26:00,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 22: [2022-11-25 18:26:00,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:26:00,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-25 18:26:00,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 3: [2022-11-25 18:26:00,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:26:00,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 18:26:00,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 10: [2022-11-25 18:26:00,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:26:00,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:26:00,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 11: [2022-11-25 18:26:00,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 18:26:00,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 10: [2022-11-25 18:26:00,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 6: [2022-11-25 18:26:00,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:26:00,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 18:26:00,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 14: [2022-11-25 18:26:00,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:26:00,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 18:26:00,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 18: [2022-11-25 18:26:00,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:26:00,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-25 18:26:00,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 16: [2022-11-25 18:26:00,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:26:00,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-25 18:26:00,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 28: [2022-11-25 18:26:00,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:26:00,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-25 18:26:00,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 13: [2022-11-25 18:26:00,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:26:00,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 18:26:00,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 11: [2022-11-25 18:26:00,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:26:00,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:26:00,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 18:26:00,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 17: [2022-11-25 18:26:00,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:26:00,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-25 18:26:00,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-25 18:26:00,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 17: [2022-11-25 18:26:00,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 4: [2022-11-25 18:26:00,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:26:00,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 3: [2022-11-25 18:26:00,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:26:00,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 3: [2022-11-25 18:26:00,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 18:26:00,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 27: [2022-11-25 18:26:00,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:26:00,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-25 18:26:00,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 10: [2022-11-25 18:26:00,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:26:00,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 18:26:00,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 19: [2022-11-25 18:26:00,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:26:00,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:26:00,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 22: [2022-11-25 18:26:00,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 19: [2022-11-25 18:26:00,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 22: [2022-11-25 18:26:00,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 20: [2022-11-25 18:26:00,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:26:00,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 12: [2022-11-25 18:26:00,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:26:00,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 12: [2022-11-25 18:26:00,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 1: [2022-11-25 18:26:00,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:26:00,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 1: [2022-11-25 18:26:00,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 18:26:00,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 30: [2022-11-25 18:26:00,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:26:00,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-25 18:26:00,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: [2022-11-25 18:26:00,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:26:00,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 18:26:00,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 24: [2022-11-25 18:26:00,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:26:00,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-25 18:26:00,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 5: [2022-11-25 18:26:00,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:26:00,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 18:26:00,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 8: [2022-11-25 18:26:00,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:26:00,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 18:26:00,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 16: [2022-11-25 18:26:00,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:26:00,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-25 18:26:00,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 15: [2022-11-25 18:26:00,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:26:00,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 18:26:00,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 31: [2022-11-25 18:26:00,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:26:00,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-25 18:26:00,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 6: [2022-11-25 18:26:00,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:26:00,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 18:26:00,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 18: [2022-11-25 18:26:00,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:26:00,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-25 18:26:00,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 7: [2022-11-25 18:26:00,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:26:00,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 18:26:00,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 14: [2022-11-25 18:26:00,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:26:00,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:26:00,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 9: [2022-11-25 18:26:00,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 14: [2022-11-25 18:26:00,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 9: [2022-11-25 18:26:00,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 28: [2022-11-25 18:26:00,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:26:00,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-25 18:26:00,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 31: [2022-11-25 18:26:00,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:26:00,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-25 18:26:00,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 2: [2022-11-25 18:26:00,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:26:00,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 18:26:00,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 25: [2022-11-25 18:26:00,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:26:00,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-25 18:26:00,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 23: [2022-11-25 18:26:00,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:26:00,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-25 18:26:00,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 29: [2022-11-25 18:26:00,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:26:00,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-25 18:26:00,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 23: [2022-11-25 18:26:00,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:26:00,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-25 18:26:00,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 26: [2022-11-25 18:26:00,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:26:00,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-25 18:26:00,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 7: [2022-11-25 18:26:00,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:26:00,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 18:26:00,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 24: [2022-11-25 18:26:00,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:26:00,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:26:00,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-25 18:26:00,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 2: [2022-11-25 18:26:00,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 18:26:00,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 9: [2022-11-25 18:26:00,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:26:00,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 18:26:00,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 21: [2022-11-25 18:26:00,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:26:00,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-25 18:26:00,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 21: [2022-11-25 18:26:00,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:26:00,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step4000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-25 18:26:00,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: successfully saved checkpoint at iteration 4000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2978.77 31: iteration 4010/ 173500 | consumed samples: 1026560 | consumed tokens: 2102394880 | elapsed time per iteration (s): 1.07 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.594829E+00 | grad norm: 0.239 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.217 | TFLOPs: 14.53 | 31: iteration 4020/ 173500 | consumed samples: 1029120 | consumed tokens: 2107637760 | elapsed time per iteration (s): 0.82 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.625271E+00 | grad norm: 0.247 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.636 | TFLOPs: 18.97 | 31: iteration 4030/ 173500 | consumed samples: 1031680 | consumed tokens: 2112880640 | elapsed time per iteration (s): 0.84 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.593004E+00 | grad norm: 0.236 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.983 | TFLOPs: 18.45 | 31: iteration 4040/ 173500 | consumed samples: 1034240 | consumed tokens: 2118123520 | elapsed time per iteration (s): 0.77 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.566806E+00 | grad norm: 0.245 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.489 | TFLOPs: 19.99 | 31: iteration 4050/ 173500 | consumed samples: 1036800 | consumed tokens: 2123366400 | elapsed time per iteration (s): 0.81 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.576059E+00 | grad norm: 0.235 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.020 | TFLOPs: 19.18 | 31: iteration 4060/ 173500 | consumed samples: 1039360 | consumed tokens: 2128609280 | elapsed time per iteration (s): 0.77 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.613313E+00 | grad norm: 0.226 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.878 | TFLOPs: 20.20 | 31: iteration 4070/ 173500 | consumed samples: 1041920 | consumed tokens: 2133852160 | elapsed time per iteration (s): 0.76 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.624164E+00 | grad norm: 0.250 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.165 | TFLOPs: 20.46 | 31: iteration 4080/ 173500 | consumed samples: 1044480 | consumed tokens: 2139095040 | elapsed time per iteration (s): 0.76 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.592934E+00 | grad norm: 0.234 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.381 | TFLOPs: 20.35 | 31: iteration 4090/ 173500 | consumed samples: 1047040 | consumed tokens: 2144337920 | elapsed time per iteration (s): 0.74 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.601471E+00 | grad norm: 0.237 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.540 | TFLOPs: 20.84 | 31: iteration 4100/ 173500 | consumed samples: 1049600 | consumed tokens: 2149580800 | elapsed time per iteration (s): 0.76 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.593599E+00 | grad norm: 0.241 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.670 | TFLOPs: 20.37 | 31: iteration 4110/ 173500 | consumed samples: 1052160 | consumed tokens: 2154823680 | elapsed time per iteration (s): 0.75 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.559525E+00 | grad norm: 0.294 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.983 | TFLOPs: 20.63 | 31: iteration 4120/ 173500 | consumed samples: 1054720 | consumed tokens: 2160066560 | elapsed time per iteration (s): 0.78 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.553627E+00 | grad norm: 0.235 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.204 | TFLOPs: 19.73 | 31: iteration 4130/ 173500 | consumed samples: 1057280 | consumed tokens: 2165309440 | elapsed time per iteration (s): 0.78 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.559316E+00 | grad norm: 0.241 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.486 | TFLOPs: 19.81 | 31: iteration 4140/ 173500 | consumed samples: 1059840 | consumed tokens: 2170552320 | elapsed time per iteration (s): 0.80 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.575375E+00 | grad norm: 0.241 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.830 | TFLOPs: 19.29 | 31: iteration 4150/ 173500 | consumed samples: 1062400 | consumed tokens: 2175795200 | elapsed time per iteration (s): 0.75 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.587325E+00 | grad norm: 0.233 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.921 | TFLOPs: 20.56 | 31: iteration 4160/ 173500 | consumed samples: 1064960 | consumed tokens: 2181038080 | elapsed time per iteration (s): 0.81 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.586562E+00 | grad norm: 0.250 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.499 | TFLOPs: 19.15 | 31: iteration 4170/ 173500 | consumed samples: 1067520 | consumed tokens: 2186280960 | elapsed time per iteration (s): 0.80 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.561454E+00 | grad norm: 0.255 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.001 | TFLOPs: 19.48 | 31: iteration 4180/ 173500 | consumed samples: 1070080 | consumed tokens: 2191523840 | elapsed time per iteration (s): 0.80 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.581708E+00 | grad norm: 0.246 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.196 | TFLOPs: 19.25 | 31: iteration 4190/ 173500 | consumed samples: 1072640 | consumed tokens: 2196766720 | elapsed time per iteration (s): 0.81 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.551950E+00 | grad norm: 0.283 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.963 | TFLOPs: 19.11 | 31: iteration 4200/ 173500 | consumed samples: 1075200 | consumed tokens: 2202009600 | elapsed time per iteration (s): 0.79 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.588389E+00 | grad norm: 0.245 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.151 | TFLOPs: 19.67 | 31: iteration 4210/ 173500 | consumed samples: 1077760 | consumed tokens: 2207252480 | elapsed time per iteration (s): 0.81 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.583687E+00 | grad norm: 0.251 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.988 | TFLOPs: 19.24 | 31: iteration 4220/ 173500 | consumed samples: 1080320 | consumed tokens: 2212495360 | elapsed time per iteration (s): 0.76 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.601712E+00 | grad norm: 0.249 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.693 | TFLOPs: 20.31 | 31: iteration 4230/ 173500 | consumed samples: 1082880 | consumed tokens: 2217738240 | elapsed time per iteration (s): 0.84 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.578954E+00 | grad norm: 0.251 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.401 | TFLOPs: 18.54 | 31: iteration 4240/ 173500 | consumed samples: 1085440 | consumed tokens: 2222981120 | elapsed time per iteration (s): 0.80 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.578106E+00 | grad norm: 0.242 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.091 | TFLOPs: 19.43 | 31: iteration 4250/ 173500 | consumed samples: 1088000 | consumed tokens: 2228224000 | elapsed time per iteration (s): 0.74 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.558361E+00 | grad norm: 0.229 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.363 | TFLOPs: 21.01 | 31: iteration 4260/ 173500 | consumed samples: 1090560 | consumed tokens: 2233466880 | elapsed time per iteration (s): 0.78 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.582487E+00 | grad norm: 0.228 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.613 | TFLOPs: 19.82 | 31: iteration 4270/ 173500 | consumed samples: 1093120 | consumed tokens: 2238709760 | elapsed time per iteration (s): 0.75 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.541633E+00 | grad norm: 0.239 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.407 | TFLOPs: 20.65 | 31: iteration 4280/ 173500 | consumed samples: 1095680 | consumed tokens: 2243952640 | elapsed time per iteration (s): 0.76 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.577404E+00 | grad norm: 0.251 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.292 | TFLOPs: 20.47 | 31: iteration 4290/ 173500 | consumed samples: 1098240 | consumed tokens: 2249195520 | elapsed time per iteration (s): 0.79 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.559375E+00 | grad norm: 0.236 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.527 | TFLOPs: 19.63 | 31: iteration 4300/ 173500 | consumed samples: 1100800 | consumed tokens: 2254438400 | elapsed time per iteration (s): 0.79 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.576643E+00 | grad norm: 0.239 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.288 | TFLOPs: 19.68 | 31: iteration 4310/ 173500 | consumed samples: 1103360 | consumed tokens: 2259681280 | elapsed time per iteration (s): 0.74 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.565716E+00 | grad norm: 0.236 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.213 | TFLOPs: 21.07 | 31: iteration 4320/ 173500 | consumed samples: 1105920 | consumed tokens: 2264924160 | elapsed time per iteration (s): 0.77 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.562486E+00 | grad norm: 0.233 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.119 | TFLOPs: 20.09 | 31: iteration 4330/ 173500 | consumed samples: 1108480 | consumed tokens: 2270167040 | elapsed time per iteration (s): 0.81 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.538820E+00 | grad norm: 0.242 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.119 | TFLOPs: 19.06 | 31: iteration 4340/ 173500 | consumed samples: 1111040 | consumed tokens: 2275409920 | elapsed time per iteration (s): 0.80 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.600313E+00 | grad norm: 0.239 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.207 | TFLOPs: 19.43 | 31: iteration 4350/ 173500 | consumed samples: 1113600 | consumed tokens: 2280652800 | elapsed time per iteration (s): 0.78 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.570625E+00 | grad norm: 0.234 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.065 | TFLOPs: 19.97 | 31: iteration 4360/ 173500 | consumed samples: 1116160 | consumed tokens: 2285895680 | elapsed time per iteration (s): 0.82 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.563011E+00 | grad norm: 0.244 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.820 | TFLOPs: 18.80 | 31: iteration 4370/ 173500 | consumed samples: 1118720 | consumed tokens: 2291138560 | elapsed time per iteration (s): 0.78 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.597006E+00 | grad norm: 0.225 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.762 | TFLOPs: 19.95 | 31: iteration 4380/ 173500 | consumed samples: 1121280 | consumed tokens: 2296381440 | elapsed time per iteration (s): 0.75 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.560552E+00 | grad norm: 0.221 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.544 | TFLOPs: 20.78 | 31: iteration 4390/ 173500 | consumed samples: 1123840 | consumed tokens: 2301624320 | elapsed time per iteration (s): 0.78 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.591154E+00 | grad norm: 0.243 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.328 | TFLOPs: 19.92 | 31: iteration 4400/ 173500 | consumed samples: 1126400 | consumed tokens: 2306867200 | elapsed time per iteration (s): 0.75 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.570127E+00 | grad norm: 0.232 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.788 | TFLOPs: 20.62 | 31: iteration 4410/ 173500 | consumed samples: 1128960 | consumed tokens: 2312110080 | elapsed time per iteration (s): 0.77 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.546703E+00 | grad norm: 0.240 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.118 | TFLOPs: 20.15 | 31: iteration 4420/ 173500 | consumed samples: 1131520 | consumed tokens: 2317352960 | elapsed time per iteration (s): 0.76 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.557124E+00 | grad norm: 0.238 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.919 | TFLOPs: 20.32 | 31: iteration 4430/ 173500 | consumed samples: 1134080 | consumed tokens: 2322595840 | elapsed time per iteration (s): 0.78 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.533814E+00 | grad norm: 0.267 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.968 | TFLOPs: 19.90 | 31: iteration 4440/ 173500 | consumed samples: 1136640 | consumed tokens: 2327838720 | elapsed time per iteration (s): 0.83 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.568521E+00 | grad norm: 0.232 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.631 | TFLOPs: 18.67 | 31: iteration 4450/ 173500 | consumed samples: 1139200 | consumed tokens: 2333081600 | elapsed time per iteration (s): 0.77 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.570855E+00 | grad norm: 0.234 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.340 | TFLOPs: 20.23 | 31: iteration 4460/ 173500 | consumed samples: 1141760 | consumed tokens: 2338324480 | elapsed time per iteration (s): 0.78 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.510610E+00 | grad norm: 0.236 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.620 | TFLOPs: 19.88 | 31: iteration 4470/ 173500 | consumed samples: 1144320 | consumed tokens: 2343567360 | elapsed time per iteration (s): 0.73 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.549262E+00 | grad norm: 0.258 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.842 | TFLOPs: 21.10 | 31: iteration 4480/ 173500 | consumed samples: 1146880 | consumed tokens: 2348810240 | elapsed time per iteration (s): 0.83 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.568789E+00 | grad norm: 0.246 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.687 | TFLOPs: 18.55 | 31: iteration 4490/ 173500 | consumed samples: 1149440 | consumed tokens: 2354053120 | elapsed time per iteration (s): 0.73 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.523027E+00 | grad norm: 0.244 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.503 | TFLOPs: 21.14 | 31: iteration 4500/ 173500 | consumed samples: 1152000 | consumed tokens: 2359296000 | elapsed time per iteration (s): 0.80 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.554301E+00 | grad norm: 0.234 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.340 | TFLOPs: 19.26 | 31: iteration 4510/ 173500 | consumed samples: 1154560 | consumed tokens: 2364538880 | elapsed time per iteration (s): 0.76 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.536400E+00 | grad norm: 0.257 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.860 | TFLOPs: 20.26 | 31: iteration 4520/ 173500 | consumed samples: 1157120 | consumed tokens: 2369781760 | elapsed time per iteration (s): 0.89 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.658251E+00 | grad norm: 4.828 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.062 | TFLOPs: 17.43 | 31: iteration 4530/ 173500 | consumed samples: 1159680 | consumed tokens: 2375024640 | elapsed time per iteration (s): 0.82 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.841803E+00 | grad norm: 1.320 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.721 | TFLOPs: 18.86 | 31: iteration 4540/ 173500 | consumed samples: 1162240 | consumed tokens: 2380267520 | elapsed time per iteration (s): 0.84 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.748450E+00 | grad norm: 0.996 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.403 | TFLOPs: 18.54 | 31: iteration 4550/ 173500 | consumed samples: 1164800 | consumed tokens: 2385510400 | elapsed time per iteration (s): 0.81 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.665420E+00 | grad norm: 0.353 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.433 | TFLOPs: 19.08 | 31: iteration 4560/ 173500 | consumed samples: 1167360 | consumed tokens: 2390753280 | elapsed time per iteration (s): 0.83 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.625631E+00 | grad norm: 0.265 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.306 | TFLOPs: 18.71 | 31: iteration 4570/ 173500 | consumed samples: 1169920 | consumed tokens: 2395996160 | elapsed time per iteration (s): 0.86 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.584924E+00 | grad norm: 0.243 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.245 | TFLOPs: 17.98 | 31: iteration 4580/ 173500 | consumed samples: 1172480 | consumed tokens: 2401239040 | elapsed time per iteration (s): 0.84 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.641249E+00 | grad norm: 0.458 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.414 | TFLOPs: 18.42 | 31: iteration 4590/ 173500 | consumed samples: 1175040 | consumed tokens: 2406481920 | elapsed time per iteration (s): 0.81 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.578412E+00 | grad norm: 0.298 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.639 | TFLOPs: 19.10 | 31: iteration 4600/ 173500 | consumed samples: 1177600 | consumed tokens: 2411724800 | elapsed time per iteration (s): 0.80 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.618262E+00 | grad norm: 0.250 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.870 | TFLOPs: 19.41 | 31: iteration 4610/ 173500 | consumed samples: 1180160 | consumed tokens: 2416967680 | elapsed time per iteration (s): 0.82 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.575931E+00 | grad norm: 0.241 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.047 | TFLOPs: 18.88 | 31: iteration 4620/ 173500 | consumed samples: 1182720 | consumed tokens: 2422210560 | elapsed time per iteration (s): 0.86 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.577765E+00 | grad norm: 0.242 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.275 | TFLOPs: 18.11 | 31: iteration 4630/ 173500 | consumed samples: 1185280 | consumed tokens: 2427453440 | elapsed time per iteration (s): 0.79 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.589180E+00 | grad norm: 0.233 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.746 | TFLOPs: 19.53 | 31: iteration 4640/ 173500 | consumed samples: 1187840 | consumed tokens: 2432696320 | elapsed time per iteration (s): 0.91 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.596253E+00 | grad norm: 0.223 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 280.811 | TFLOPs: 16.99 | 31: iteration 4650/ 173500 | consumed samples: 1190400 | consumed tokens: 2437939200 | elapsed time per iteration (s): 0.80 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.552785E+00 | grad norm: 0.229 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.850 | TFLOPs: 19.29 | 31: iteration 4660/ 173500 | consumed samples: 1192960 | consumed tokens: 2443182080 | elapsed time per iteration (s): 0.81 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.548692E+00 | grad norm: 0.215 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.637 | TFLOPs: 19.03 | 31: iteration 4670/ 173500 | consumed samples: 1195520 | consumed tokens: 2448424960 | elapsed time per iteration (s): 0.81 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.530038E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.382 | TFLOPs: 19.14 | 31: iteration 4680/ 173500 | consumed samples: 1198080 | consumed tokens: 2453667840 | elapsed time per iteration (s): 0.80 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.544437E+00 | grad norm: 0.223 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.112 | TFLOPs: 19.24 | 31: iteration 4690/ 173500 | consumed samples: 1200640 | consumed tokens: 2458910720 | elapsed time per iteration (s): 0.77 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.567949E+00 | grad norm: 0.231 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.883 | TFLOPs: 20.08 | 31: iteration 4700/ 173500 | consumed samples: 1203200 | consumed tokens: 2464153600 | elapsed time per iteration (s): 0.75 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.551384E+00 | grad norm: 0.215 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.017 | TFLOPs: 20.57 | 31: iteration 4710/ 173500 | consumed samples: 1205760 | consumed tokens: 2469396480 | elapsed time per iteration (s): 0.71 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.526370E+00 | grad norm: 0.218 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 359.051 | TFLOPs: 21.72 | 31: iteration 4720/ 173500 | consumed samples: 1208320 | consumed tokens: 2474639360 | elapsed time per iteration (s): 0.74 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.545863E+00 | grad norm: 0.228 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.677 | TFLOPs: 21.03 | 31: iteration 4730/ 173500 | consumed samples: 1210880 | consumed tokens: 2479882240 | elapsed time per iteration (s): 0.76 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.520924E+00 | grad norm: 0.213 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.118 | TFLOPs: 20.27 | 31: iteration 4740/ 173500 | consumed samples: 1213440 | consumed tokens: 2485125120 | elapsed time per iteration (s): 0.71 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.543975E+00 | grad norm: 0.235 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 360.502 | TFLOPs: 21.81 | 31: iteration 4750/ 173500 | consumed samples: 1216000 | consumed tokens: 2490368000 | elapsed time per iteration (s): 0.72 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.536641E+00 | grad norm: 0.239 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.193 | TFLOPs: 21.49 | 31: iteration 4760/ 173500 | consumed samples: 1218560 | consumed tokens: 2495610880 | elapsed time per iteration (s): 0.86 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.541686E+00 | grad norm: 0.220 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.192 | TFLOPs: 17.92 | 31: iteration 4770/ 173500 | consumed samples: 1221120 | consumed tokens: 2500853760 | elapsed time per iteration (s): 0.78 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.568420E+00 | grad norm: 0.229 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.800 | TFLOPs: 19.83 | 31: iteration 4780/ 173500 | consumed samples: 1223680 | consumed tokens: 2506096640 | elapsed time per iteration (s): 0.78 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.522776E+00 | grad norm: 0.235 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.733 | TFLOPs: 19.83 | 31: iteration 4790/ 173500 | consumed samples: 1226240 | consumed tokens: 2511339520 | elapsed time per iteration (s): 0.74 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.546238E+00 | grad norm: 0.222 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.342 | TFLOPs: 20.95 | 31: iteration 4800/ 173500 | consumed samples: 1228800 | consumed tokens: 2516582400 | elapsed time per iteration (s): 0.86 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.559692E+00 | grad norm: 0.217 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.917 | TFLOPs: 18.08 | 31: iteration 4810/ 173500 | consumed samples: 1231360 | consumed tokens: 2521825280 | elapsed time per iteration (s): 0.85 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.545048E+00 | grad norm: 0.219 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.493 | TFLOPs: 18.30 | 31: iteration 4820/ 173500 | consumed samples: 1233920 | consumed tokens: 2527068160 | elapsed time per iteration (s): 0.78 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.529520E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.543 | TFLOPs: 19.76 | 31: iteration 4830/ 173500 | consumed samples: 1236480 | consumed tokens: 2532311040 | elapsed time per iteration (s): 0.77 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.517578E+00 | grad norm: 0.226 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.702 | TFLOPs: 20.01 | 31: iteration 4840/ 173500 | consumed samples: 1239040 | consumed tokens: 2537553920 | elapsed time per iteration (s): 0.84 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.494054E+00 | grad norm: 0.222 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.356 | TFLOPs: 18.35 | 31: iteration 4850/ 173500 | consumed samples: 1241600 | consumed tokens: 2542796800 | elapsed time per iteration (s): 0.76 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.529747E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.649 | TFLOPs: 20.37 | 31: iteration 4860/ 173500 | consumed samples: 1244160 | consumed tokens: 2548039680 | elapsed time per iteration (s): 0.81 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.560695E+00 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.262 | TFLOPs: 19.13 | 31: iteration 4870/ 173500 | consumed samples: 1246720 | consumed tokens: 2553282560 | elapsed time per iteration (s): 0.82 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.518886E+00 | grad norm: 0.218 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.048 | TFLOPs: 18.88 | 31: iteration 4880/ 173500 | consumed samples: 1249280 | consumed tokens: 2558525440 | elapsed time per iteration (s): 0.81 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.565259E+00 | grad norm: 1.849 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.207 | TFLOPs: 19.07 | 31: iteration 4890/ 173500 | consumed samples: 1251840 | consumed tokens: 2563768320 | elapsed time per iteration (s): 0.81 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.630914E+00 | grad norm: 0.435 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.116 | TFLOPs: 19.06 | 31: iteration 4900/ 173500 | consumed samples: 1254400 | consumed tokens: 2569011200 | elapsed time per iteration (s): 0.85 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.549747E+00 | grad norm: 0.241 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.618 | TFLOPs: 18.25 | 31: iteration 4910/ 173500 | consumed samples: 1256960 | consumed tokens: 2574254080 | elapsed time per iteration (s): 0.82 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.520198E+00 | grad norm: 0.227 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.640 | TFLOPs: 18.79 | 31: iteration 4920/ 173500 | consumed samples: 1259520 | consumed tokens: 2579496960 | elapsed time per iteration (s): 0.85 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.558726E+00 | grad norm: 0.237 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.908 | TFLOPs: 18.33 | 31: iteration 4930/ 173500 | consumed samples: 1262080 | consumed tokens: 2584739840 | elapsed time per iteration (s): 0.82 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.530375E+00 | grad norm: 0.223 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.604 | TFLOPs: 18.97 | 31: iteration 4940/ 173500 | consumed samples: 1264640 | consumed tokens: 2589982720 | elapsed time per iteration (s): 0.83 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.538850E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.937 | TFLOPs: 18.75 | 31: iteration 4950/ 173500 | consumed samples: 1267200 | consumed tokens: 2595225600 | elapsed time per iteration (s): 0.82 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.558998E+00 | grad norm: 0.219 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.315 | TFLOPs: 18.89 | 31: iteration 4960/ 173500 | consumed samples: 1269760 | consumed tokens: 2600468480 | elapsed time per iteration (s): 0.79 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.526814E+00 | grad norm: 0.221 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.597 | TFLOPs: 19.64 | 31: iteration 4970/ 173500 | consumed samples: 1272320 | consumed tokens: 2605711360 | elapsed time per iteration (s): 0.78 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.522760E+00 | grad norm: 0.220 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.559 | TFLOPs: 19.82 | 31: iteration 4980/ 173500 | consumed samples: 1274880 | consumed tokens: 2610954240 | elapsed time per iteration (s): 0.73 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.535806E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.698 | TFLOPs: 21.10 | 31: iteration 4990/ 173500 | consumed samples: 1277440 | consumed tokens: 2616197120 | elapsed time per iteration (s): 0.76 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.512693E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.656 | TFLOPs: 20.25 | 31: iteration 5000/ 173500 | consumed samples: 1280000 | consumed tokens: 2621440000 | elapsed time per iteration (s): 0.75 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.507177E+00 | grad norm: 0.232 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.599 | TFLOPs: 20.67 | 31: ------------------------------------------------------------------------------------------ 31: valid loss at iteration 5000 | lm loss value: 2.541687E+00 | lm loss PPL: 1.270107E+01 | 31: ------------------------------------------------------------------------------------------ 0: saving checkpoint at iteration 5000 to checkpoints_1b1long 0: [2022-11-25 18:39:12,451] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step5000 is begin to save! 0: [2022-11-25 18:39:12,464] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_01-model_00-model_states.pt... 0: [2022-11-25 18:39:12,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_01-model_00-model_states.pt. 0: [2022-11-25 18:39:12,701] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_03-model_00-model_states.pt... 0: [2022-11-25 18:39:12,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_03-model_00-model_states.pt. 0: [2022-11-25 18:39:12,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_04-model_00-model_states.pt... 0: [2022-11-25 18:39:12,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_04-model_00-model_states.pt. 0: [2022-11-25 18:39:12,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_05-model_00-model_states.pt... 0: [2022-11-25 18:39:12,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_05-model_00-model_states.pt. 0: [2022-11-25 18:39:12,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_06-model_00-model_states.pt... 0: [2022-11-25 18:39:13,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_06-model_00-model_states.pt. 0: [2022-11-25 18:39:13,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_07-model_00-model_states.pt... 0: [2022-11-25 18:39:13,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_07-model_00-model_states.pt. 0: [2022-11-25 18:39:13,090] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_08-model_00-model_states.pt... 0: [2022-11-25 18:39:13,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_08-model_00-model_states.pt. 0: [2022-11-25 18:39:13,166] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_09-model_00-model_states.pt... 0: [2022-11-25 18:39:13,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_09-model_00-model_states.pt. 0: [2022-11-25 18:39:13,242] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_10-model_00-model_states.pt... 0: [2022-11-25 18:39:13,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_10-model_00-model_states.pt. 0: [2022-11-25 18:39:13,318] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_11-model_00-model_states.pt... 0: [2022-11-25 18:39:13,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_11-model_00-model_states.pt. 0: [2022-11-25 18:39:13,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_12-model_00-model_states.pt... 0: [2022-11-25 18:39:13,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_12-model_00-model_states.pt. 0: [2022-11-25 18:39:13,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_13-model_00-model_states.pt... 0: [2022-11-25 18:39:13,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_13-model_00-model_states.pt. 0: [2022-11-25 18:39:13,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_14-model_00-model_states.pt... 0: [2022-11-25 18:39:13,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_14-model_00-model_states.pt. 0: [2022-11-25 18:39:13,628] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_15-model_00-model_states.pt... 0: [2022-11-25 18:39:13,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_15-model_00-model_states.pt. 0: [2022-11-25 18:39:13,706] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_16-model_00-model_states.pt... 0: [2022-11-25 18:39:13,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_16-model_00-model_states.pt. 0: [2022-11-25 18:39:13,784] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_17-model_00-model_states.pt... 0: [2022-11-25 18:39:13,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_17-model_00-model_states.pt. 0: [2022-11-25 18:39:13,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_18-model_00-model_states.pt... 0: [2022-11-25 18:39:13,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_18-model_00-model_states.pt. 0: [2022-11-25 18:39:13,936] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_19-model_00-model_states.pt... 0: [2022-11-25 18:39:14,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_19-model_00-model_states.pt. 0: [2022-11-25 18:39:14,009] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_20-model_00-model_states.pt... 0: [2022-11-25 18:39:14,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_20-model_00-model_states.pt. 0: [2022-11-25 18:39:14,089] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_21-model_00-model_states.pt... 0: [2022-11-25 18:39:14,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_21-model_00-model_states.pt. 0: [2022-11-25 18:39:14,163] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_22-model_00-model_states.pt... 0: [2022-11-25 18:39:14,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_22-model_00-model_states.pt. 0: [2022-11-25 18:39:14,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_23-model_00-model_states.pt... 0: [2022-11-25 18:39:14,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_23-model_00-model_states.pt. 0: [2022-11-25 18:39:14,330] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_24-model_00-model_states.pt... 0: [2022-11-25 18:39:14,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_24-model_00-model_states.pt. 0: [2022-11-25 18:39:14,406] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_25-model_00-model_states.pt... 0: [2022-11-25 18:39:14,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_25-model_00-model_states.pt. 0: [2022-11-25 18:39:14,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_26-model_00-model_states.pt... 0: [2022-11-25 18:39:14,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_26-model_00-model_states.pt. 0: [2022-11-25 18:39:14,562] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_27-model_00-model_states.pt... 0: [2022-11-25 18:39:14,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_27-model_00-model_states.pt. 0: [2022-11-25 18:39:14,638] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_28-model_00-model_states.pt... 0: [2022-11-25 18:39:14,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_28-model_00-model_states.pt. 0: [2022-11-25 18:39:14,713] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/layer_30-model_00-model_states.pt... 0: [2022-11-25 18:39:14,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/layer_30-model_00-model_states.pt. 0: [2022-11-25 18:39:14,716] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step5000/mp_rank_00_model_states.pt 0: [2022-11-25 18:39:14,716] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/mp_rank_00_model_states.pt... 0: [2022-11-25 18:39:14,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/mp_rank_00_model_states.pt. 0: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:39:14,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:39:14,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:39:14,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-25 18:39:14,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 20: [2022-11-25 18:39:14,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:39:14,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-25 18:39:14,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 13: [2022-11-25 18:39:14,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:39:14,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 18:39:14,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 10: [2022-11-25 18:39:14,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:39:14,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 18:39:14,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 8: [2022-11-25 18:39:14,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:39:14,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 18:39:14,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 0: [2022-11-25 18:39:14,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:39:14,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 18:39:14,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 19: [2022-11-25 18:39:14,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:39:14,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:39:14,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-25 18:39:14,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 2: [2022-11-25 18:39:14,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 18:39:14,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 26: [2022-11-25 18:39:14,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:39:14,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 3: [2022-11-25 18:39:14,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:39:14,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 3: [2022-11-25 18:39:14,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 18:39:14,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 4: [2022-11-25 18:39:14,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:39:14,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 18:39:14,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 16: [2022-11-25 18:39:14,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:39:14,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 15: [2022-11-25 18:39:14,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:39:14,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 15: [2022-11-25 18:39:14,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 18:39:14,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 12: [2022-11-25 18:39:14,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:39:14,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 18:39:14,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 24: [2022-11-25 18:39:14,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:39:14,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-25 18:39:14,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 11: [2022-11-25 18:39:14,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:39:14,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:39:14,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-25 18:39:14,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 11: [2022-11-25 18:39:14,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 18:39:14,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 29: [2022-11-25 18:39:14,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:39:14,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-25 18:39:14,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 27: [2022-11-25 18:39:14,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:39:14,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 12: [2022-11-25 18:39:14,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:39:14,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 18:39:14,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 27: [2022-11-25 18:39:14,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 4: [2022-11-25 18:39:14,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:39:14,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 18:39:14,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 27: [2022-11-25 18:39:14,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:39:14,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 8: [2022-11-25 18:39:14,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:39:14,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 8: [2022-11-25 18:39:14,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 18:39:14,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 29: [2022-11-25 18:39:14,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:39:14,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-25 18:39:14,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 17: [2022-11-25 18:39:14,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:39:14,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:39:14,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 5: [2022-11-25 18:39:14,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:39:14,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 17: [2022-11-25 18:39:14,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 5: [2022-11-25 18:39:14,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 17: [2022-11-25 18:39:14,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 5: [2022-11-25 18:39:14,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 6: [2022-11-25 18:39:14,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:39:14,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 18:39:14,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 9: [2022-11-25 18:39:14,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:39:14,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 18:39:14,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 12: [2022-11-25 18:39:14,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:39:14,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:39:14,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:39:14,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 8: [2022-11-25 18:39:14,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 12: [2022-11-25 18:39:14,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 4: [2022-11-25 18:39:14,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 8: [2022-11-25 18:39:14,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 3: [2022-11-25 18:39:14,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:39:14,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 3: [2022-11-25 18:39:14,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 18:39:14,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 29: [2022-11-25 18:39:14,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:39:14,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:39:14,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 21: [2022-11-25 18:39:14,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-25 18:39:14,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 21: [2022-11-25 18:39:14,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:39:14,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:39:14,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:39:14,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 21: [2022-11-25 18:39:14,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-25 18:39:14,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 3: [2022-11-25 18:39:14,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 21: [2022-11-25 18:39:14,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 21: [2022-11-25 18:39:14,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 3: [2022-11-25 18:39:14,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 24: [2022-11-25 18:39:14,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:39:14,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:39:14,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-25 18:39:14,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 24: [2022-11-25 18:39:14,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 26: [2022-11-25 18:39:14,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:39:14,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 24: [2022-11-25 18:39:14,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 30: [2022-11-25 18:39:14,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:39:14,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 30: [2022-11-25 18:39:14,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-25 18:39:14,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 10: [2022-11-25 18:39:14,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:39:14,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:39:14,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:39:14,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 14: [2022-11-25 18:39:14,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 10: [2022-11-25 18:39:14,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 14: [2022-11-25 18:39:14,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:39:14,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 14: [2022-11-25 18:39:14,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 24: [2022-11-25 18:39:14,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 14: [2022-11-25 18:39:14,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 10: [2022-11-25 18:39:14,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:39:14,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 10: [2022-11-25 18:39:14,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 18:39:14,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 4: [2022-11-25 18:39:14,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:39:14,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 18:39:14,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 19: [2022-11-25 18:39:14,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:39:14,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:39:14,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 23: [2022-11-25 18:39:14,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:39:14,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 19: [2022-11-25 18:39:14,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 23: [2022-11-25 18:39:14,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-25 18:39:14,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 7: [2022-11-25 18:39:14,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 16: [2022-11-25 18:39:14,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:39:14,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-25 18:39:14,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 28: [2022-11-25 18:39:14,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:39:14,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-25 18:39:14,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 19: [2022-11-25 18:39:14,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:39:14,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-25 18:39:14,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 28: [2022-11-25 18:39:14,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:39:14,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-25 18:39:14,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 2: [2022-11-25 18:39:14,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:39:14,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:39:14,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:39:14,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:39:14,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 0: [2022-11-25 18:39:14,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 9: [2022-11-25 18:39:14,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 8: [2022-11-25 18:39:14,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 2: [2022-11-25 18:39:14,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 9: [2022-11-25 18:39:14,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 8: [2022-11-25 18:39:14,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 0: [2022-11-25 18:39:14,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 14: [2022-11-25 18:39:14,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:39:14,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 18:39:14,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 9: [2022-11-25 18:39:14,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:39:14,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 18:39:14,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 3: [2022-11-25 18:39:14,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:39:14,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 18:39:14,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 0: [2022-11-25 18:39:14,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:39:14,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 18:39:14,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 23: [2022-11-25 18:39:14,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:39:14,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:39:14,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 7: [2022-11-25 18:39:14,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:39:14,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 7: [2022-11-25 18:39:14,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 18:39:14,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 12: [2022-11-25 18:39:14,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:39:14,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 18:39:14,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 21: [2022-11-25 18:39:14,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:39:14,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 26: [2022-11-25 18:39:14,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:39:14,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 26: [2022-11-25 18:39:14,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-25 18:39:14,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 30: [2022-11-25 18:39:14,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:39:14,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-25 18:39:14,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 5: [2022-11-25 18:39:14,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:39:14,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 18:39:14,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 0: [2022-11-25 18:39:14,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 18:39:14,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 24: [2022-11-25 18:39:14,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:39:14,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:39:14,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:39:14,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 1: [2022-11-25 18:39:14,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:39:14,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 9: [2022-11-25 18:39:14,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 18: [2022-11-25 18:39:14,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-25 18:39:14,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 1: [2022-11-25 18:39:14,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:39:14,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:39:14,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 24: [2022-11-25 18:39:14,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 1: [2022-11-25 18:39:14,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 18:39:14,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 18:39:14,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 1: [2022-11-25 18:39:14,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 1: [2022-11-25 18:39:14,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 30: [2022-11-25 18:39:14,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:39:14,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-25 18:39:14,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 13: [2022-11-25 18:39:14,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:39:14,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 18:39:14,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 10: [2022-11-25 18:39:14,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:39:14,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:39:14,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 19: [2022-11-25 18:39:14,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 10: [2022-11-25 18:39:14,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 19: [2022-11-25 18:39:14,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 31: [2022-11-25 18:39:14,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:39:14,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:39:14,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:39:14,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-25 18:39:14,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-25 18:39:14,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-25 18:39:14,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 31: [2022-11-25 18:39:14,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 31: [2022-11-25 18:39:14,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 14: [2022-11-25 18:39:14,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:39:14,885] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 18:39:14,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 25: [2022-11-25 18:39:14,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:39:14,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:39:14,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:39:14,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 22: [2022-11-25 18:39:14,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:39:14,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-25 18:39:14,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-25 18:39:14,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 22: [2022-11-25 18:39:14,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-25 18:39:14,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 25: [2022-11-25 18:39:14,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 25: [2022-11-25 18:39:14,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 28: [2022-11-25 18:39:14,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:39:14,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-25 18:39:14,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 7: [2022-11-25 18:39:14,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:39:14,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 18:39:14,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 29: [2022-11-25 18:39:14,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:39:14,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:39:14,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 29: [2022-11-25 18:39:14,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 5: [2022-11-25 18:39:14,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 29: [2022-11-25 18:39:14,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 16: [2022-11-25 18:39:14,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:39:14,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-25 18:39:14,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 22: [2022-11-25 18:39:14,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:39:14,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-25 18:39:14,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 11: [2022-11-25 18:39:14,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:39:14,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 18:39:14,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 17: [2022-11-25 18:39:14,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:39:14,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-25 18:39:14,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 23: [2022-11-25 18:39:14,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:39:14,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-25 18:39:14,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 6: [2022-11-25 18:39:14,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:39:14,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 18:39:14,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 12: [2022-11-25 18:39:14,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:39:14,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 18:39:14,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 15: [2022-11-25 18:39:14,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:39:14,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:39:14,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 2: [2022-11-25 18:39:14,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 15: [2022-11-25 18:39:14,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 2: [2022-11-25 18:39:14,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 4: [2022-11-25 18:39:14,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:39:14,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 18:39:14,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 11: [2022-11-25 18:39:14,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:39:14,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 18:39:14,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 19: [2022-11-25 18:39:14,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:39:14,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-25 18:39:14,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 6: [2022-11-25 18:39:14,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:39:14,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:39:14,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 18:39:14,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 7: [2022-11-25 18:39:14,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 18:39:14,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 20: [2022-11-25 18:39:14,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:39:14,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-25 18:39:14,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 6: [2022-11-25 18:39:14,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:39:14,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 18:39:14,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 27: [2022-11-25 18:39:14,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:39:14,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-25 18:39:14,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 15: [2022-11-25 18:39:14,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:39:14,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 18:39:14,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 18: [2022-11-25 18:39:14,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:39:14,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-25 18:39:14,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 22: [2022-11-25 18:39:14,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:39:14,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-25 18:39:14,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 17: [2022-11-25 18:39:14,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:39:14,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-25 18:39:14,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 20: [2022-11-25 18:39:14,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:39:14,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-25 18:39:14,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 3: [2022-11-25 18:39:14,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:39:14,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 18:39:14,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 27: [2022-11-25 18:39:14,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:39:14,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-25 18:39:14,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 15: [2022-11-25 18:39:14,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:39:14,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 18:39:14,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 18: [2022-11-25 18:39:14,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:39:14,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-25 18:39:14,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 6: [2022-11-25 18:39:14,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:39:14,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 18:39:14,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 13: [2022-11-25 18:39:14,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:39:14,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 18:39:14,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 20: [2022-11-25 18:39:14,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:39:14,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:39:14,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-25 18:39:14,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-25 18:39:14,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 20: [2022-11-25 18:39:14,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 0: [2022-11-25 18:39:14,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:39:14,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 18:39:14,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 13: [2022-11-25 18:39:14,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:39:14,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 18:39:14,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 5: [2022-11-25 18:39:14,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:39:14,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 18:39:14,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 7: [2022-11-25 18:39:14,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:39:14,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 18:39:14,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 10: [2022-11-25 18:39:14,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:39:14,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 18:39:14,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 17: [2022-11-25 18:39:14,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:39:14,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-25 18:39:14,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 8: [2022-11-25 18:39:14,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:39:14,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 18:39:14,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 30: [2022-11-25 18:39:14,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:39:14,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-25 18:39:14,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 1: [2022-11-25 18:39:14,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:39:14,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 18:39:14,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 28: [2022-11-25 18:39:14,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:39:14,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:39:14,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-25 18:39:14,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 31: [2022-11-25 18:39:14,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-25 18:39:14,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 2: [2022-11-25 18:39:14,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:39:14,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 18:39:14,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 11: [2022-11-25 18:39:14,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:39:14,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:39:14,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 18:39:14,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 18:39:14,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 11: [2022-11-25 18:39:14,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 9: [2022-11-25 18:39:14,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:39:14,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 18:39:14,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 29: [2022-11-25 18:39:14,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:39:14,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-25 18:39:14,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 14: [2022-11-25 18:39:14,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:39:14,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 18:39:14,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 7: [2022-11-25 18:39:14,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:39:14,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 18:39:14,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 24: [2022-11-25 18:39:14,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:39:14,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:39:14,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 21: [2022-11-25 18:39:14,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:39:14,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 21: [2022-11-25 18:39:14,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 23: [2022-11-25 18:39:14,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 21: [2022-11-25 18:39:14,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 24: [2022-11-25 18:39:14,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 15: [2022-11-25 18:39:14,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:39:14,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 18:39:14,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 5: [2022-11-25 18:39:14,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:39:14,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 18:39:14,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 4: [2022-11-25 18:39:14,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:39:14,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 18:39:14,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 16: [2022-11-25 18:39:14,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:39:14,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-25 18:39:14,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 18: [2022-11-25 18:39:14,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:39:14,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 20: [2022-11-25 18:39:14,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:39:14,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 20: [2022-11-25 18:39:14,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-25 18:39:14,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 27: [2022-11-25 18:39:14,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:39:14,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-25 18:39:14,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 19: [2022-11-25 18:39:14,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:39:14,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-25 18:39:14,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 12: [2022-11-25 18:39:14,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:39:14,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 18:39:14,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 13: [2022-11-25 18:39:14,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:39:14,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 18:39:14,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 6: [2022-11-25 18:39:14,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:39:14,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 18:39:14,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 3: [2022-11-25 18:39:14,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:39:14,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 18:39:14,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 9: [2022-11-25 18:39:14,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:39:14,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 18:39:14,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 8: [2022-11-25 18:39:14,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:39:14,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 18:39:14,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 27: [2022-11-25 18:39:14,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:39:14,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-25 18:39:14,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 17: [2022-11-25 18:39:14,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:39:14,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 18: [2022-11-25 18:39:14,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:39:14,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-25 18:39:14,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 17: [2022-11-25 18:39:14,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 21: [2022-11-25 18:39:14,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:39:14,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-25 18:39:14,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 13: [2022-11-25 18:39:14,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:39:14,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 18:39:14,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 0: [2022-11-25 18:39:14,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:39:14,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:39:14,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 18:39:14,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 24: [2022-11-25 18:39:14,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-25 18:39:14,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 26: [2022-11-25 18:39:14,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:39:14,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-25 18:39:14,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 22: [2022-11-25 18:39:14,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:39:14,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-25 18:39:14,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 1: [2022-11-25 18:39:14,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:39:14,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 28: [2022-11-25 18:39:14,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:39:14,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 28: [2022-11-25 18:39:14,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-25 18:39:14,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 29: [2022-11-25 18:39:14,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:39:14,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-25 18:39:14,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 30: [2022-11-25 18:39:14,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:39:14,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-25 18:39:14,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 2: [2022-11-25 18:39:14,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:39:14,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:39:14,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 31: [2022-11-25 18:39:14,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 2: [2022-11-25 18:39:14,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 31: [2022-11-25 18:39:14,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 23: [2022-11-25 18:39:14,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:39:14,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-25 18:39:14,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 15: [2022-11-25 18:39:14,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:39:14,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 18:39:14,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 5: [2022-11-25 18:39:14,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:39:14,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 18:39:14,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 14: [2022-11-25 18:39:14,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:39:14,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 18:39:14,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 26: [2022-11-25 18:39:14,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:39:14,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-25 18:39:14,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 6: [2022-11-25 18:39:14,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:39:14,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 18:39:14,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 4: [2022-11-25 18:39:14,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:39:14,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 18:39:14,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 7: [2022-11-25 18:39:14,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:39:14,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 18:39:14,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 19: [2022-11-25 18:39:14,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:39:14,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-25 18:39:14,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 5: [2022-11-25 18:39:14,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:39:14,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 18:39:14,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 13: [2022-11-25 18:39:14,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:39:14,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 18:39:14,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 22: [2022-11-25 18:39:14,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:39:14,982] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-25 18:39:14,982] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 12: [2022-11-25 18:39:14,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:39:14,982] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 18:39:14,982] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 24: [2022-11-25 18:39:14,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:39:14,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 20: [2022-11-25 18:39:14,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:39:14,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:39:14,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:39:14,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 24: [2022-11-25 18:39:14,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 10: [2022-11-25 18:39:14,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 3: [2022-11-25 18:39:14,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 20: [2022-11-25 18:39:14,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 10: [2022-11-25 18:39:14,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 3: [2022-11-25 18:39:14,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 0: [2022-11-25 18:39:14,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:39:14,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 18:39:14,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 16: [2022-11-25 18:39:14,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:39:14,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-25 18:39:14,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 27: [2022-11-25 18:39:14,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:39:14,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:39:14,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 18: [2022-11-25 18:39:14,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-25 18:39:14,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 27: [2022-11-25 18:39:14,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 9: [2022-11-25 18:39:14,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:39:14,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 18:39:14,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 17: [2022-11-25 18:39:14,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:39:14,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-25 18:39:14,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 8: [2022-11-25 18:39:14,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:39:14,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 18:39:14,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 28: [2022-11-25 18:39:14,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:39:14,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-25 18:39:14,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 11: [2022-11-25 18:39:14,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:39:14,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 18:39:14,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 26: [2022-11-25 18:39:14,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:39:14,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-25 18:39:14,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 14: [2022-11-25 18:39:14,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:39:14,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 18:39:14,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 15: [2022-11-25 18:39:14,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:39:14,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 18:39:14,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 31: [2022-11-25 18:39:15,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:39:15,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-25 18:39:15,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 13: [2022-11-25 18:39:15,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:39:15,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 16: [2022-11-25 18:39:15,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:39:15,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 16: [2022-11-25 18:39:15,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-25 18:39:15,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 5: [2022-11-25 18:39:15,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:39:15,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:39:15,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 22: [2022-11-25 18:39:15,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 5: [2022-11-25 18:39:15,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 22: [2022-11-25 18:39:15,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 3: [2022-11-25 18:39:15,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:39:15,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 18:39:15,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 4: [2022-11-25 18:39:15,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:39:15,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 18:39:15,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 21: [2022-11-25 18:39:15,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:39:15,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-25 18:39:15,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 19: [2022-11-25 18:39:15,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:39:15,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-25 18:39:15,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 12: [2022-11-25 18:39:15,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:39:15,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 18:39:15,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 18: [2022-11-25 18:39:15,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:39:15,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-25 18:39:15,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 20: [2022-11-25 18:39:15,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:39:15,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:39:15,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-25 18:39:15,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 24: [2022-11-25 18:39:15,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-25 18:39:15,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 22: [2022-11-25 18:39:15,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:39:15,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-25 18:39:15,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 1: [2022-11-25 18:39:15,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:39:15,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 18:39:15,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 9: [2022-11-25 18:39:15,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:39:15,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:39:15,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 10: [2022-11-25 18:39:15,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 9: [2022-11-25 18:39:15,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 10: [2022-11-25 18:39:15,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 10: [2022-11-25 18:39:15,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:39:15,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 18:39:15,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 6: [2022-11-25 18:39:15,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:39:15,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 18:39:15,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 17: [2022-11-25 18:39:15,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:39:15,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-25 18:39:15,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 8: [2022-11-25 18:39:15,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:39:15,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:39:15,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 18:39:15,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 0: [2022-11-25 18:39:15,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 18:39:15,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 2: [2022-11-25 18:39:15,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:39:15,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 18:39:15,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 7: [2022-11-25 18:39:15,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:39:15,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 18:39:15,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 29: [2022-11-25 18:39:15,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:39:15,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-25 18:39:15,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 17: [2022-11-25 18:39:15,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:39:15,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-25 18:39:15,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 11: [2022-11-25 18:39:15,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:39:15,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 18:39:15,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 23: [2022-11-25 18:39:15,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:39:15,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-25 18:39:15,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 21: [2022-11-25 18:39:15,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:39:15,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-25 18:39:15,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 14: [2022-11-25 18:39:15,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:39:15,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 18:39:15,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 30: [2022-11-25 18:39:15,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:39:15,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-25 18:39:15,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 26: [2022-11-25 18:39:15,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:39:15,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-25 18:39:15,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 31: [2022-11-25 18:39:15,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:39:15,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 1: [2022-11-25 18:39:15,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:39:15,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 1: [2022-11-25 18:39:15,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 18:39:15,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 27: [2022-11-25 18:39:15,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:39:15,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-25 18:39:15,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 15: [2022-11-25 18:39:15,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:39:15,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 18:39:15,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 16: [2022-11-25 18:39:15,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:39:15,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-25 18:39:15,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 28: [2022-11-25 18:39:15,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:39:15,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-25 18:39:15,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 11: [2022-11-25 18:39:15,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:39:15,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 18:39:15,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 23: [2022-11-25 18:39:15,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:39:15,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-25 18:39:15,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 18: [2022-11-25 18:39:15,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:39:15,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-25 18:39:15,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 2: [2022-11-25 18:39:15,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:39:15,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 18:39:15,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 28: [2022-11-25 18:39:15,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:39:15,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-25 18:39:15,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 2: [2022-11-25 18:39:15,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:39:15,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 18:39:15,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 31: [2022-11-25 18:39:15,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:39:15,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-25 18:39:15,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 1: [2022-11-25 18:39:15,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:39:15,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 18:39:15,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 23: [2022-11-25 18:39:15,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:39:15,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-25 18:39:15,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 30: [2022-11-25 18:39:15,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:39:15,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-25 18:39:15,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 29: [2022-11-25 18:39:15,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:39:15,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-25 18:39:15,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 25: [2022-11-25 18:39:15,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:39:15,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-25 18:39:15,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 25: [2022-11-25 18:39:15,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:39:15,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-25 18:39:15,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 25: [2022-11-25 18:39:15,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:39:15,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-25 18:39:15,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:39:15,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:39:15,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 25: [2022-11-25 18:39:15,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-25 18:39:15,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step5000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-25 18:39:15,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 25: [2022-11-25 18:39:15,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 0: successfully saved checkpoint at iteration 5000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2748.21 31: iteration 5010/ 173500 | consumed samples: 1282560 | consumed tokens: 2626682880 | elapsed time per iteration (s): 1.01 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.517073E+00 | grad norm: 0.212 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 252.241 | TFLOPs: 15.26 | 31: iteration 5020/ 173500 | consumed samples: 1285120 | consumed tokens: 2631925760 | elapsed time per iteration (s): 0.77 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.497706E+00 | grad norm: 0.216 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.976 | TFLOPs: 20.02 | 31: iteration 5030/ 173500 | consumed samples: 1287680 | consumed tokens: 2637168640 | elapsed time per iteration (s): 0.80 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.510973E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.380 | TFLOPs: 19.32 | 31: iteration 5040/ 173500 | consumed samples: 1290240 | consumed tokens: 2642411520 | elapsed time per iteration (s): 0.80 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.490908E+00 | grad norm: 0.216 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.674 | TFLOPs: 19.46 | 31: iteration 5050/ 173500 | consumed samples: 1292800 | consumed tokens: 2647654400 | elapsed time per iteration (s): 0.78 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.486190E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.239 | TFLOPs: 19.86 | 31: iteration 5060/ 173500 | consumed samples: 1295360 | consumed tokens: 2652897280 | elapsed time per iteration (s): 0.74 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.490105E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.845 | TFLOPs: 20.98 | 31: iteration 5070/ 173500 | consumed samples: 1297920 | consumed tokens: 2658140160 | elapsed time per iteration (s): 0.76 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.509750E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.337 | TFLOPs: 20.47 | 31: iteration 5080/ 173500 | consumed samples: 1300480 | consumed tokens: 2663383040 | elapsed time per iteration (s): 0.75 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.532042E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.197 | TFLOPs: 20.52 | 31: iteration 5090/ 173500 | consumed samples: 1303040 | consumed tokens: 2668625920 | elapsed time per iteration (s): 0.76 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.511882E+00 | grad norm: 0.213 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.031 | TFLOPs: 20.39 | 31: iteration 5100/ 173500 | consumed samples: 1305600 | consumed tokens: 2673868800 | elapsed time per iteration (s): 0.77 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.539404E+00 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.090 | TFLOPs: 20.15 | 31: iteration 5110/ 173500 | consumed samples: 1308160 | consumed tokens: 2679111680 | elapsed time per iteration (s): 0.82 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.518160E+00 | grad norm: 0.222 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.497 | TFLOPs: 18.91 | 31: iteration 5120/ 173500 | consumed samples: 1310720 | consumed tokens: 2684354560 | elapsed time per iteration (s): 0.77 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.526326E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.018 | TFLOPs: 20.03 | 31: iteration 5130/ 173500 | consumed samples: 1313280 | consumed tokens: 2689597440 | elapsed time per iteration (s): 0.77 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.522643E+00 | grad norm: 0.212 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.489 | TFLOPs: 20.24 | 31: iteration 5140/ 173500 | consumed samples: 1315840 | consumed tokens: 2694840320 | elapsed time per iteration (s): 0.76 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.508045E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.240 | TFLOPs: 20.46 | 31: iteration 5150/ 173500 | consumed samples: 1318400 | consumed tokens: 2700083200 | elapsed time per iteration (s): 0.78 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.521301E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.632 | TFLOPs: 19.76 | 31: iteration 5160/ 173500 | consumed samples: 1320960 | consumed tokens: 2705326080 | elapsed time per iteration (s): 0.76 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.523638E+00 | grad norm: 0.224 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.002 | TFLOPs: 20.45 | 31: iteration 5170/ 173500 | consumed samples: 1323520 | consumed tokens: 2710568960 | elapsed time per iteration (s): 0.76 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.505160E+00 | grad norm: 0.217 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.813 | TFLOPs: 20.38 | 31: iteration 5180/ 173500 | consumed samples: 1326080 | consumed tokens: 2715811840 | elapsed time per iteration (s): 0.73 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.511987E+00 | grad norm: 0.225 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.715 | TFLOPs: 21.16 | 31: iteration 5190/ 173500 | consumed samples: 1328640 | consumed tokens: 2721054720 | elapsed time per iteration (s): 0.73 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.492102E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.568 | TFLOPs: 21.15 | 31: iteration 5200/ 173500 | consumed samples: 1331200 | consumed tokens: 2726297600 | elapsed time per iteration (s): 0.74 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.501920E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.450 | TFLOPs: 20.90 | 31: iteration 5210/ 173500 | consumed samples: 1333760 | consumed tokens: 2731540480 | elapsed time per iteration (s): 0.75 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.482447E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.907 | TFLOPs: 20.68 | 31: iteration 5220/ 173500 | consumed samples: 1336320 | consumed tokens: 2736783360 | elapsed time per iteration (s): 0.77 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.492237E+00 | grad norm: 0.228 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.527 | TFLOPs: 20.00 | 31: iteration 5230/ 173500 | consumed samples: 1338880 | consumed tokens: 2742026240 | elapsed time per iteration (s): 0.74 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.532377E+00 | grad norm: 0.252 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.553 | TFLOPs: 20.84 | 31: iteration 5240/ 173500 | consumed samples: 1341440 | consumed tokens: 2747269120 | elapsed time per iteration (s): 0.74 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.486221E+00 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.820 | TFLOPs: 21.04 | 31: iteration 5250/ 173500 | consumed samples: 1344000 | consumed tokens: 2752512000 | elapsed time per iteration (s): 0.80 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.510012E+00 | grad norm: 0.220 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.285 | TFLOPs: 19.38 | 31: iteration 5260/ 173500 | consumed samples: 1346560 | consumed tokens: 2757754880 | elapsed time per iteration (s): 0.72 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.523151E+00 | grad norm: 0.222 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.734 | TFLOPs: 21.46 | 31: iteration 5270/ 173500 | consumed samples: 1349120 | consumed tokens: 2762997760 | elapsed time per iteration (s): 0.78 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.464222E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.132 | TFLOPs: 19.73 | 31: iteration 5280/ 173500 | consumed samples: 1351680 | consumed tokens: 2768240640 | elapsed time per iteration (s): 0.84 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.515331E+00 | grad norm: 0.217 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.813 | TFLOPs: 18.50 | 31: iteration 5290/ 173500 | consumed samples: 1354240 | consumed tokens: 2773483520 | elapsed time per iteration (s): 0.88 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.481397E+00 | grad norm: 0.215 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.814 | TFLOPs: 17.65 | 31: iteration 5300/ 173500 | consumed samples: 1356800 | consumed tokens: 2778726400 | elapsed time per iteration (s): 0.79 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.484777E+00 | grad norm: 0.217 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.144 | TFLOPs: 19.49 | 31: iteration 5310/ 173500 | consumed samples: 1359360 | consumed tokens: 2783969280 | elapsed time per iteration (s): 0.80 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.522642E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.595 | TFLOPs: 19.46 | 31: iteration 5320/ 173500 | consumed samples: 1361920 | consumed tokens: 2789212160 | elapsed time per iteration (s): 0.78 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.494109E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.322 | TFLOPs: 19.74 | 31: iteration 5330/ 173500 | consumed samples: 1364480 | consumed tokens: 2794455040 | elapsed time per iteration (s): 0.80 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.500174E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.528 | TFLOPs: 19.45 | 31: iteration 5340/ 173500 | consumed samples: 1367040 | consumed tokens: 2799697920 | elapsed time per iteration (s): 0.82 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.520557E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.706 | TFLOPs: 18.80 | 31: iteration 5350/ 173500 | consumed samples: 1369600 | consumed tokens: 2804940800 | elapsed time per iteration (s): 0.82 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.473726E+00 | grad norm: 0.216 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.252 | TFLOPs: 18.83 | 31: iteration 5360/ 173500 | consumed samples: 1372160 | consumed tokens: 2810183680 | elapsed time per iteration (s): 0.74 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.524390E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.686 | TFLOPs: 20.85 | 31: iteration 5370/ 173500 | consumed samples: 1374720 | consumed tokens: 2815426560 | elapsed time per iteration (s): 0.82 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.508981E+00 | grad norm: 0.213 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.245 | TFLOPs: 18.83 | 31: iteration 5380/ 173500 | consumed samples: 1377280 | consumed tokens: 2820669440 | elapsed time per iteration (s): 0.74 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.468519E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.531 | TFLOPs: 21.02 | 31: iteration 5390/ 173500 | consumed samples: 1379840 | consumed tokens: 2825912320 | elapsed time per iteration (s): 0.76 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.503751E+00 | grad norm: 0.228 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.375 | TFLOPs: 20.41 | 31: iteration 5400/ 173500 | consumed samples: 1382400 | consumed tokens: 2831155200 | elapsed time per iteration (s): 0.74 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.467305E+00 | grad norm: 0.212 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.122 | TFLOPs: 21.06 | 31: iteration 5410/ 173500 | consumed samples: 1384960 | consumed tokens: 2836398080 | elapsed time per iteration (s): 0.77 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.501822E+00 | grad norm: 0.253 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.741 | TFLOPs: 20.13 | 31: iteration 5420/ 173500 | consumed samples: 1387520 | consumed tokens: 2841640960 | elapsed time per iteration (s): 0.73 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.492360E+00 | grad norm: 0.225 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.662 | TFLOPs: 21.34 | 31: iteration 5430/ 173500 | consumed samples: 1390080 | consumed tokens: 2846883840 | elapsed time per iteration (s): 0.84 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.512800E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.710 | TFLOPs: 18.43 | 31: iteration 5440/ 173500 | consumed samples: 1392640 | consumed tokens: 2852126720 | elapsed time per iteration (s): 0.74 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.494415E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.396 | TFLOPs: 21.02 | 31: iteration 5450/ 173500 | consumed samples: 1395200 | consumed tokens: 2857369600 | elapsed time per iteration (s): 0.86 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.498802E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.348 | TFLOPs: 17.99 | 31: iteration 5460/ 173500 | consumed samples: 1397760 | consumed tokens: 2862612480 | elapsed time per iteration (s): 0.75 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.517169E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.056 | TFLOPs: 20.57 | 31: iteration 5470/ 173500 | consumed samples: 1400320 | consumed tokens: 2867855360 | elapsed time per iteration (s): 0.74 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.494574E+00 | grad norm: 0.215 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.839 | TFLOPs: 20.92 | 31: iteration 5480/ 173500 | consumed samples: 1402880 | consumed tokens: 2873098240 | elapsed time per iteration (s): 0.81 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.491294E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.544 | TFLOPs: 19.21 | 31: iteration 5490/ 173500 | consumed samples: 1405440 | consumed tokens: 2878341120 | elapsed time per iteration (s): 0.77 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.465084E+00 | grad norm: 0.219 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.294 | TFLOPs: 20.22 | 31: iteration 5500/ 173500 | consumed samples: 1408000 | consumed tokens: 2883584000 | elapsed time per iteration (s): 0.81 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.522560E+00 | grad norm: 0.212 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.582 | TFLOPs: 19.15 | 31: iteration 5510/ 173500 | consumed samples: 1410560 | consumed tokens: 2888826880 | elapsed time per iteration (s): 0.79 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.488873E+00 | grad norm: 0.223 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.874 | TFLOPs: 19.65 | 31: iteration 5520/ 173500 | consumed samples: 1413120 | consumed tokens: 2894069760 | elapsed time per iteration (s): 0.78 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.518589E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.521 | TFLOPs: 19.75 | 31: iteration 5530/ 173500 | consumed samples: 1415680 | consumed tokens: 2899312640 | elapsed time per iteration (s): 0.75 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.487925E+00 | grad norm: 0.220 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.756 | TFLOPs: 20.74 | 31: iteration 5540/ 173500 | consumed samples: 1418240 | consumed tokens: 2904555520 | elapsed time per iteration (s): 0.73 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.528610E+00 | grad norm: 0.212 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.529 | TFLOPs: 21.33 | 31: iteration 5550/ 173500 | consumed samples: 1420800 | consumed tokens: 2909798400 | elapsed time per iteration (s): 0.78 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.494900E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.365 | TFLOPs: 19.93 | 31: iteration 5560/ 173500 | consumed samples: 1423360 | consumed tokens: 2915041280 | elapsed time per iteration (s): 0.74 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.486470E+00 | grad norm: 0.219 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.630 | TFLOPs: 20.97 | 31: iteration 5570/ 173500 | consumed samples: 1425920 | consumed tokens: 2920284160 | elapsed time per iteration (s): 0.74 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.513760E+00 | grad norm: 0.264 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.886 | TFLOPs: 21.05 | 31: iteration 5580/ 173500 | consumed samples: 1428480 | consumed tokens: 2925527040 | elapsed time per iteration (s): 0.75 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.476903E+00 | grad norm: 0.217 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.839 | TFLOPs: 20.68 | 31: iteration 5590/ 173500 | consumed samples: 1431040 | consumed tokens: 2930769920 | elapsed time per iteration (s): 0.75 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.484559E+00 | grad norm: 0.223 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.545 | TFLOPs: 20.66 | 31: iteration 5600/ 173500 | consumed samples: 1433600 | consumed tokens: 2936012800 | elapsed time per iteration (s): 0.72 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.470666E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.488 | TFLOPs: 21.57 | 31: iteration 5610/ 173500 | consumed samples: 1436160 | consumed tokens: 2941255680 | elapsed time per iteration (s): 0.74 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.523031E+00 | grad norm: 0.218 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.804 | TFLOPs: 21.04 | 31: iteration 5620/ 173500 | consumed samples: 1438720 | consumed tokens: 2946498560 | elapsed time per iteration (s): 0.75 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.472231E+00 | grad norm: 0.218 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.084 | TFLOPs: 20.57 | 31: iteration 5630/ 173500 | consumed samples: 1441280 | consumed tokens: 2951741440 | elapsed time per iteration (s): 0.80 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.519674E+00 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.359 | TFLOPs: 19.44 | 31: iteration 5640/ 173500 | consumed samples: 1443840 | consumed tokens: 2956984320 | elapsed time per iteration (s): 0.73 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.443230E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.430 | TFLOPs: 21.14 | 31: iteration 5650/ 173500 | consumed samples: 1446400 | consumed tokens: 2962227200 | elapsed time per iteration (s): 0.75 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.491485E+00 | grad norm: 0.212 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.421 | TFLOPs: 20.72 | 31: iteration 5660/ 173500 | consumed samples: 1448960 | consumed tokens: 2967470080 | elapsed time per iteration (s): 0.76 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.475306E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.716 | TFLOPs: 20.31 | 31: iteration 5670/ 173500 | consumed samples: 1451520 | consumed tokens: 2972712960 | elapsed time per iteration (s): 0.80 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.507658E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.897 | TFLOPs: 19.29 | 31: iteration 5680/ 173500 | consumed samples: 1454080 | consumed tokens: 2977955840 | elapsed time per iteration (s): 0.79 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.490866E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.745 | TFLOPs: 19.65 | 31: iteration 5690/ 173500 | consumed samples: 1456640 | consumed tokens: 2983198720 | elapsed time per iteration (s): 0.78 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.530081E+00 | grad norm: 0.220 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.742 | TFLOPs: 19.95 | 31: iteration 5700/ 173500 | consumed samples: 1459200 | consumed tokens: 2988441600 | elapsed time per iteration (s): 0.73 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.507811E+00 | grad norm: 0.228 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.538 | TFLOPs: 21.27 | 31: iteration 5710/ 173500 | consumed samples: 1461760 | consumed tokens: 2993684480 | elapsed time per iteration (s): 0.80 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.477015E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.751 | TFLOPs: 19.47 | 31: iteration 5720/ 173500 | consumed samples: 1464320 | consumed tokens: 2998927360 | elapsed time per iteration (s): 0.76 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.461539E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.816 | TFLOPs: 20.38 | 31: iteration 5730/ 173500 | consumed samples: 1466880 | consumed tokens: 3004170240 | elapsed time per iteration (s): 0.76 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.505730E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.304 | TFLOPs: 20.47 | 31: iteration 5740/ 173500 | consumed samples: 1469440 | consumed tokens: 3009413120 | elapsed time per iteration (s): 0.77 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.476435E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.448 | TFLOPs: 20.17 | 31: iteration 5750/ 173500 | consumed samples: 1472000 | consumed tokens: 3014656000 | elapsed time per iteration (s): 0.74 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.437033E+00 | grad norm: 0.213 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.338 | TFLOPs: 21.01 | 31: iteration 5760/ 173500 | consumed samples: 1474560 | consumed tokens: 3019898880 | elapsed time per iteration (s): 0.75 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.476549E+00 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.343 | TFLOPs: 20.77 | 31: iteration 5770/ 173500 | consumed samples: 1477120 | consumed tokens: 3025141760 | elapsed time per iteration (s): 0.79 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.468469E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.950 | TFLOPs: 19.60 | 31: iteration 5780/ 173500 | consumed samples: 1479680 | consumed tokens: 3030384640 | elapsed time per iteration (s): 0.77 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.470583E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.543 | TFLOPs: 20.00 | 31: iteration 5790/ 173500 | consumed samples: 1482240 | consumed tokens: 3035627520 | elapsed time per iteration (s): 0.81 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.480692E+00 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.429 | TFLOPs: 19.08 | 31: iteration 5800/ 173500 | consumed samples: 1484800 | consumed tokens: 3040870400 | elapsed time per iteration (s): 0.80 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.498557E+00 | grad norm: 0.215 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.496 | TFLOPs: 19.39 | 31: iteration 5810/ 173500 | consumed samples: 1487360 | consumed tokens: 3046113280 | elapsed time per iteration (s): 0.81 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.465431E+00 | grad norm: 0.217 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.525 | TFLOPs: 19.03 | 31: iteration 5820/ 173500 | consumed samples: 1489920 | consumed tokens: 3051356160 | elapsed time per iteration (s): 0.77 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.497910E+00 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.581 | TFLOPs: 20.12 | 31: iteration 5830/ 173500 | consumed samples: 1492480 | consumed tokens: 3056599040 | elapsed time per iteration (s): 0.84 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.464883E+00 | grad norm: 0.383 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.454 | TFLOPs: 18.42 | 31: iteration 5840/ 173500 | consumed samples: 1495040 | consumed tokens: 3061841920 | elapsed time per iteration (s): 0.82 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.482684E+00 | grad norm: 0.223 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.590 | TFLOPs: 18.91 | 31: iteration 5850/ 173500 | consumed samples: 1497600 | consumed tokens: 3067084800 | elapsed time per iteration (s): 0.81 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.490276E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.795 | TFLOPs: 19.10 | 31: iteration 5860/ 173500 | consumed samples: 1500160 | consumed tokens: 3072327680 | elapsed time per iteration (s): 0.80 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.480186E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.770 | TFLOPs: 19.41 | 31: iteration 5870/ 173500 | consumed samples: 1502720 | consumed tokens: 3077570560 | elapsed time per iteration (s): 0.79 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.480683E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.420 | TFLOPs: 19.51 | 31: iteration 5880/ 173500 | consumed samples: 1505280 | consumed tokens: 3082813440 | elapsed time per iteration (s): 0.78 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.477347E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.250 | TFLOPs: 19.86 | 31: iteration 5890/ 173500 | consumed samples: 1507840 | consumed tokens: 3088056320 | elapsed time per iteration (s): 0.80 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.440149E+00 | grad norm: 0.223 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.673 | TFLOPs: 19.34 | 31: iteration 5900/ 173500 | consumed samples: 1510400 | consumed tokens: 3093299200 | elapsed time per iteration (s): 0.79 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.452513E+00 | grad norm: 0.220 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.605 | TFLOPs: 19.58 | 31: iteration 5910/ 173500 | consumed samples: 1512960 | consumed tokens: 3098542080 | elapsed time per iteration (s): 0.83 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.469598E+00 | grad norm: 0.229 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.068 | TFLOPs: 18.64 | 31: iteration 5920/ 173500 | consumed samples: 1515520 | consumed tokens: 3103784960 | elapsed time per iteration (s): 0.79 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.488436E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.985 | TFLOPs: 19.54 | 31: iteration 5930/ 173500 | consumed samples: 1518080 | consumed tokens: 3109027840 | elapsed time per iteration (s): 0.85 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.467128E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.758 | TFLOPs: 18.20 | 31: iteration 5940/ 173500 | consumed samples: 1520640 | consumed tokens: 3114270720 | elapsed time per iteration (s): 0.94 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.472848E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 273.251 | TFLOPs: 16.53 | 31: iteration 5950/ 173500 | consumed samples: 1523200 | consumed tokens: 3119513600 | elapsed time per iteration (s): 0.84 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.738561E+00 | grad norm: 1.458 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.385 | TFLOPs: 18.54 | 31: iteration 5960/ 173500 | consumed samples: 1525760 | consumed tokens: 3124756480 | elapsed time per iteration (s): 0.84 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.890635E+00 | grad norm: 2.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.302 | TFLOPs: 18.53 | 31: iteration 5970/ 173500 | consumed samples: 1528320 | consumed tokens: 3129999360 | elapsed time per iteration (s): 0.81 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.889027E+00 | grad norm: 3.480 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.467 | TFLOPs: 19.15 | 31: iteration 5980/ 173500 | consumed samples: 1530880 | consumed tokens: 3135242240 | elapsed time per iteration (s): 0.81 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.823361E+00 | grad norm: 0.870 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.797 | TFLOPs: 19.10 | 31: iteration 5990/ 173500 | consumed samples: 1533440 | consumed tokens: 3140485120 | elapsed time per iteration (s): 0.83 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.702221E+00 | grad norm: 0.421 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.995 | TFLOPs: 18.75 | 0: [2022-11-25 18:52:14,361] [INFO] [logging.py:68:log_dist] [Rank 0] step=6000, skipped=0, lr=[0.0001997263111243839, 0.0001997263111243839, 0.0001997263111243839], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 6000/ 173500 | consumed samples: 1536000 | consumed tokens: 3145728000 | elapsed time per iteration (s): 0.80 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.592437E+00 | grad norm: 0.346 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.695 | TFLOPs: 19.46 | 0: steps: 6000 loss: 2.5802 iter time (s): 0.782 samples/sec: 327.447 31: ------------------------------------------------------------------------------------------ 31: valid loss at iteration 6000 | lm loss value: 2.576070E+00 | lm loss PPL: 1.314538E+01 | 31: ------------------------------------------------------------------------------------------ 0: saving checkpoint at iteration 6000 to checkpoints_1b1long 0: [2022-11-25 18:52:14,747] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step6000 is begin to save! 0: [2022-11-25 18:52:14,756] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_01-model_00-model_states.pt... 0: [2022-11-25 18:52:14,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_01-model_00-model_states.pt. 0: [2022-11-25 18:52:14,954] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_03-model_00-model_states.pt... 0: [2022-11-25 18:52:15,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_03-model_00-model_states.pt. 0: [2022-11-25 18:52:15,033] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_04-model_00-model_states.pt... 0: [2022-11-25 18:52:15,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_04-model_00-model_states.pt. 0: [2022-11-25 18:52:15,107] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_05-model_00-model_states.pt... 0: [2022-11-25 18:52:15,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_05-model_00-model_states.pt. 0: [2022-11-25 18:52:15,184] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_06-model_00-model_states.pt... 0: [2022-11-25 18:52:15,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_06-model_00-model_states.pt. 0: [2022-11-25 18:52:15,257] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_07-model_00-model_states.pt... 0: [2022-11-25 18:52:15,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_07-model_00-model_states.pt. 0: [2022-11-25 18:52:15,327] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_08-model_00-model_states.pt... 0: [2022-11-25 18:52:15,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_08-model_00-model_states.pt. 0: [2022-11-25 18:52:15,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_09-model_00-model_states.pt... 0: [2022-11-25 18:52:15,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_09-model_00-model_states.pt. 0: [2022-11-25 18:52:15,477] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_10-model_00-model_states.pt... 0: [2022-11-25 18:52:15,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_10-model_00-model_states.pt. 0: [2022-11-25 18:52:15,552] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_11-model_00-model_states.pt... 0: [2022-11-25 18:52:15,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_11-model_00-model_states.pt. 0: [2022-11-25 18:52:15,627] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_12-model_00-model_states.pt... 0: [2022-11-25 18:52:15,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_12-model_00-model_states.pt. 0: [2022-11-25 18:52:15,703] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_13-model_00-model_states.pt... 0: [2022-11-25 18:52:15,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_13-model_00-model_states.pt. 0: [2022-11-25 18:52:15,777] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_14-model_00-model_states.pt... 0: [2022-11-25 18:52:15,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_14-model_00-model_states.pt. 0: [2022-11-25 18:52:15,851] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_15-model_00-model_states.pt... 0: [2022-11-25 18:52:15,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_15-model_00-model_states.pt. 0: [2022-11-25 18:52:15,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_16-model_00-model_states.pt... 0: [2022-11-25 18:52:15,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_16-model_00-model_states.pt. 0: [2022-11-25 18:52:15,998] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_17-model_00-model_states.pt... 0: [2022-11-25 18:52:16,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_17-model_00-model_states.pt. 0: [2022-11-25 18:52:16,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_18-model_00-model_states.pt... 0: [2022-11-25 18:52:16,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_18-model_00-model_states.pt. 0: [2022-11-25 18:52:16,144] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_19-model_00-model_states.pt... 0: [2022-11-25 18:52:16,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_19-model_00-model_states.pt. 0: [2022-11-25 18:52:16,218] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_20-model_00-model_states.pt... 0: [2022-11-25 18:52:16,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_20-model_00-model_states.pt. 0: [2022-11-25 18:52:16,289] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_21-model_00-model_states.pt... 0: [2022-11-25 18:52:16,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_21-model_00-model_states.pt. 0: [2022-11-25 18:52:16,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_22-model_00-model_states.pt... 0: [2022-11-25 18:52:16,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_22-model_00-model_states.pt. 0: [2022-11-25 18:52:16,438] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_23-model_00-model_states.pt... 0: [2022-11-25 18:52:16,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_23-model_00-model_states.pt. 0: [2022-11-25 18:52:16,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_24-model_00-model_states.pt... 0: [2022-11-25 18:52:16,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_24-model_00-model_states.pt. 0: [2022-11-25 18:52:16,584] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_25-model_00-model_states.pt... 0: [2022-11-25 18:52:16,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_25-model_00-model_states.pt. 0: [2022-11-25 18:52:16,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_26-model_00-model_states.pt... 0: [2022-11-25 18:52:16,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_26-model_00-model_states.pt. 0: [2022-11-25 18:52:16,732] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_27-model_00-model_states.pt... 0: [2022-11-25 18:52:16,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_27-model_00-model_states.pt. 0: [2022-11-25 18:52:16,806] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_28-model_00-model_states.pt... 0: [2022-11-25 18:52:16,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_28-model_00-model_states.pt. 0: [2022-11-25 18:52:16,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/layer_30-model_00-model_states.pt... 0: [2022-11-25 18:52:16,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/layer_30-model_00-model_states.pt. 0: [2022-11-25 18:52:16,880] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step6000/mp_rank_00_model_states.pt 0: [2022-11-25 18:52:16,880] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/mp_rank_00_model_states.pt... 0: [2022-11-25 18:52:16,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/mp_rank_00_model_states.pt. 0: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 20: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 11: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 24: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 7: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 2: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 25: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 14: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 22: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 17: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 26: [2022-11-25 18:52:16,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 23: [2022-11-25 18:52:17,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:52:17,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-25 18:52:17,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 23: [2022-11-25 18:52:17,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:52:17,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-25 18:52:17,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 18: [2022-11-25 18:52:17,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:52:17,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:52:17,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-25 18:52:17,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-25 18:52:17,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 18: [2022-11-25 18:52:17,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 31: [2022-11-25 18:52:17,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:52:17,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-25 18:52:17,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 31: [2022-11-25 18:52:17,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:52:17,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-25 18:52:17,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 1: [2022-11-25 18:52:17,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:52:17,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 18:52:17,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 7: [2022-11-25 18:52:17,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:52:17,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 18:52:17,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 7: [2022-11-25 18:52:17,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:52:17,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 18:52:17,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 7: [2022-11-25 18:52:17,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:52:17,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:52:17,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 18:52:17,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 18:52:17,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 7: [2022-11-25 18:52:17,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 1: [2022-11-25 18:52:17,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:52:17,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 18:52:17,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 18: [2022-11-25 18:52:17,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:52:17,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-25 18:52:17,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 31: [2022-11-25 18:52:17,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:52:17,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-25 18:52:17,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 18: [2022-11-25 18:52:17,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:52:17,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-25 18:52:17,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 31: [2022-11-25 18:52:17,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:52:17,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-25 18:52:17,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 13: [2022-11-25 18:52:17,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:52:17,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 18:52:17,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 3: [2022-11-25 18:52:17,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:52:17,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 18:52:17,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 26: [2022-11-25 18:52:17,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:52:17,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-25 18:52:17,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 6: [2022-11-25 18:52:17,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:52:17,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:52:17,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 12: [2022-11-25 18:52:17,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 6: [2022-11-25 18:52:17,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 12: [2022-11-25 18:52:17,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 7: [2022-11-25 18:52:17,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:52:17,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 18:52:17,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 16: [2022-11-25 18:52:17,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:52:17,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:52:17,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-25 18:52:17,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-25 18:52:17,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 16: [2022-11-25 18:52:17,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 25: [2022-11-25 18:52:17,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:52:17,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-25 18:52:17,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 28: [2022-11-25 18:52:17,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:52:17,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:52:17,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 1: [2022-11-25 18:52:17,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:52:17,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 22: [2022-11-25 18:52:17,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:52:17,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:52:17,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 22: [2022-11-25 18:52:17,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-25 18:52:17,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-25 18:52:17,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 22: [2022-11-25 18:52:17,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 25: [2022-11-25 18:52:17,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:52:17,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-25 18:52:17,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 12: [2022-11-25 18:52:17,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:52:17,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 18:52:17,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 18:52:17,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:52:17,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:52:17,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:52:17,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 18:52:17,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 18:52:17,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 31: [2022-11-25 18:52:17,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:52:17,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 0: [2022-11-25 18:52:17,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 18:52:17,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 18:52:17,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 31: [2022-11-25 18:52:17,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 18:52:17,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:52:17,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:52:17,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 24: [2022-11-25 18:52:17,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:52:17,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:52:17,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:52:17,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 24: [2022-11-25 18:52:17,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-25 18:52:17,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-25 18:52:17,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-25 18:52:17,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 24: [2022-11-25 18:52:17,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 24: [2022-11-25 18:52:17,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 18:52:17,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:52:17,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 18:52:17,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 7: [2022-11-25 18:52:17,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:52:17,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 18:52:17,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 21: [2022-11-25 18:52:17,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:52:17,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:52:17,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:52:17,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-25 18:52:17,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-25 18:52:17,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-25 18:52:17,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 21: [2022-11-25 18:52:17,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 21: [2022-11-25 18:52:17,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 18:52:17,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:52:17,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-25 18:52:17,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 18:52:17,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 28: [2022-11-25 18:52:17,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 18:52:17,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 28: [2022-11-25 18:52:17,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:52:17,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-25 18:52:17,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 18: [2022-11-25 18:52:17,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:52:17,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-25 18:52:17,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 9: [2022-11-25 18:52:17,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:52:17,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:52:17,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:52:17,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 18:52:17,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 18:52:17,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 18:52:17,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 9: [2022-11-25 18:52:17,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 9: [2022-11-25 18:52:17,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 16: [2022-11-25 18:52:17,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:52:17,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-25 18:52:17,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 3: [2022-11-25 18:52:17,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:52:17,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:52:17,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 12: [2022-11-25 18:52:17,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 3: [2022-11-25 18:52:17,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 12: [2022-11-25 18:52:17,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 23: [2022-11-25 18:52:17,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:52:17,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-25 18:52:17,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 5: [2022-11-25 18:52:17,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:52:17,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:52:17,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:52:17,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:52:17,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 20: [2022-11-25 18:52:17,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 5: [2022-11-25 18:52:17,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 18:52:17,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 15: [2022-11-25 18:52:17,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:52:17,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:52:17,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:52:17,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 5: [2022-11-25 18:52:17,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 10: [2022-11-25 18:52:17,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:52:17,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 5: [2022-11-25 18:52:17,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 5: [2022-11-25 18:52:17,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 10: [2022-11-25 18:52:17,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:52:17,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:52:17,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 18:52:17,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 10: [2022-11-25 18:52:17,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 15: [2022-11-25 18:52:17,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 10: [2022-11-25 18:52:17,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 18:52:17,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 10: [2022-11-25 18:52:17,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 15: [2022-11-25 18:52:17,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 10: [2022-11-25 18:52:17,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 10: [2022-11-25 18:52:17,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 15: [2022-11-25 18:52:17,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 23: [2022-11-25 18:52:17,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:52:17,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-25 18:52:17,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 11: [2022-11-25 18:52:17,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:52:17,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:52:17,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:52:17,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 18:52:17,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 11: [2022-11-25 18:52:17,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 18:52:17,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 18:52:17,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 11: [2022-11-25 18:52:17,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 17: [2022-11-25 18:52:17,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:52:17,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:52:17,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:52:17,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-25 18:52:17,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-25 18:52:17,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-25 18:52:17,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 17: [2022-11-25 18:52:17,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 17: [2022-11-25 18:52:17,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 4: [2022-11-25 18:52:17,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:52:17,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 18:52:17,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 15: [2022-11-25 18:52:17,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:52:17,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 18:52:17,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 14: [2022-11-25 18:52:17,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:52:17,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 18:52:17,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 20: [2022-11-25 18:52:17,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:52:17,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-25 18:52:17,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 2: [2022-11-25 18:52:17,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:52:17,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:52:17,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 18:52:17,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 18:52:17,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 2: [2022-11-25 18:52:17,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 8: [2022-11-25 18:52:17,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:52:17,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:52:17,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:52:17,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:52:17,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 18:52:17,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 18:52:17,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 18:52:17,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 18:52:17,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 8: [2022-11-25 18:52:17,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 7: [2022-11-25 18:52:17,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:52:17,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 7: [2022-11-25 18:52:17,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 8: [2022-11-25 18:52:17,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 27: [2022-11-25 18:52:17,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:52:17,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:52:17,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 27: [2022-11-25 18:52:17,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:52:17,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:52:17,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-25 18:52:17,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-25 18:52:17,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 27: [2022-11-25 18:52:17,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-25 18:52:17,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 27: [2022-11-25 18:52:17,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-25 18:52:17,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 27: [2022-11-25 18:52:17,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 23: [2022-11-25 18:52:17,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:52:17,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-25 18:52:17,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 3: [2022-11-25 18:52:17,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:52:17,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 18:52:17,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 25: [2022-11-25 18:52:17,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:52:17,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-25 18:52:17,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:52:17,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 25: [2022-11-25 18:52:17,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-25 18:52:17,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 3: [2022-11-25 18:52:17,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:52:17,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 18:52:17,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 4: [2022-11-25 18:52:17,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:52:17,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 18:52:17,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 2: [2022-11-25 18:52:17,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:52:17,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 18:52:17,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 18: [2022-11-25 18:52:17,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:52:17,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-25 18:52:17,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 20: [2022-11-25 18:52:17,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:52:17,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-25 18:52:17,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 19: [2022-11-25 18:52:17,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:52:17,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:52:17,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:52:17,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:52:17,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:52:17,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-25 18:52:17,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-25 18:52:17,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-25 18:52:17,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-25 18:52:17,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 19: [2022-11-25 18:52:17,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 19: [2022-11-25 18:52:17,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 22: [2022-11-25 18:52:17,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 19: [2022-11-25 18:52:17,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 22: [2022-11-25 18:52:17,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 28: [2022-11-25 18:52:17,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:52:17,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-25 18:52:17,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 29: [2022-11-25 18:52:17,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:52:17,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:52:17,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:52:17,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:52:17,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-25 18:52:17,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-25 18:52:17,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-25 18:52:17,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 29: [2022-11-25 18:52:17,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 29: [2022-11-25 18:52:17,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-25 18:52:17,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 29: [2022-11-25 18:52:17,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 13: [2022-11-25 18:52:17,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:52:17,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 18:52:17,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 2: [2022-11-25 18:52:17,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:52:17,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 18:52:17,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 22: [2022-11-25 18:52:17,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:52:17,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-25 18:52:17,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 5: [2022-11-25 18:52:17,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:52:17,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 18:52:17,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 11: [2022-11-25 18:52:17,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:52:17,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 18:52:17,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 16: [2022-11-25 18:52:17,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:52:17,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-25 18:52:17,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 4: [2022-11-25 18:52:17,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:52:17,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 18:52:17,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 4: [2022-11-25 18:52:17,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:52:17,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 18:52:17,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 2: [2022-11-25 18:52:17,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:52:17,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 18:52:17,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 13: [2022-11-25 18:52:17,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:52:17,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 18:52:17,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 9: [2022-11-25 18:52:17,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:52:17,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 18:52:17,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 28: [2022-11-25 18:52:17,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:52:17,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:52:17,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 18:52:17,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 28: [2022-11-25 18:52:17,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-25 18:52:17,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 2: [2022-11-25 18:52:17,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:52:17,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 18:52:17,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 31: [2022-11-25 18:52:17,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:52:17,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-25 18:52:17,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 24: [2022-11-25 18:52:17,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:52:17,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-25 18:52:17,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 21: [2022-11-25 18:52:17,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:52:17,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-25 18:52:17,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 23: [2022-11-25 18:52:17,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:52:17,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:52:17,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 6: [2022-11-25 18:52:17,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 23: [2022-11-25 18:52:17,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 6: [2022-11-25 18:52:17,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 20: [2022-11-25 18:52:17,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:52:17,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-25 18:52:17,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 6: [2022-11-25 18:52:17,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:52:17,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 18:52:17,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 10: [2022-11-25 18:52:17,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:52:17,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 18:52:17,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 17: [2022-11-25 18:52:17,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:52:17,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-25 18:52:17,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 13: [2022-11-25 18:52:17,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:52:17,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 18:52:17,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 18:52:17,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 18:52:17,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 30: [2022-11-25 18:52:17,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:52:17,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:52:17,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:52:17,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:52:17,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-25 18:52:17,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-25 18:52:17,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-25 18:52:17,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-25 18:52:17,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 30: [2022-11-25 18:52:17,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 30: [2022-11-25 18:52:17,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 30: [2022-11-25 18:52:17,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 6: [2022-11-25 18:52:17,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:52:17,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 18:52:17,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 31: [2022-11-25 18:52:17,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:52:17,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:52:17,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 28: [2022-11-25 18:52:17,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-25 18:52:17,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 31: [2022-11-25 18:52:17,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 2: [2022-11-25 18:52:17,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:52:17,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 18:52:17,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 8: [2022-11-25 18:52:17,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:52:17,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 18:52:17,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 27: [2022-11-25 18:52:17,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:52:17,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-25 18:52:17,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 26: [2022-11-25 18:52:17,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:52:17,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-25 18:52:17,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 29: [2022-11-25 18:52:17,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:52:17,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-25 18:52:17,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 3: [2022-11-25 18:52:17,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:52:17,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 18:52:17,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 18: [2022-11-25 18:52:17,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:52:17,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-25 18:52:17,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 19: [2022-11-25 18:52:17,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 2: [2022-11-25 18:52:17,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:52:17,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-25 18:52:17,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 2: [2022-11-25 18:52:17,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 18:52:17,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 13: [2022-11-25 18:52:17,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:52:17,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 18:52:17,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 25: [2022-11-25 18:52:17,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:52:17,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-25 18:52:17,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 30: [2022-11-25 18:52:17,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:52:17,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-25 18:52:17,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 28: [2022-11-25 18:52:17,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-25 18:52:17,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-25 18:52:17,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 26: [2022-11-25 18:52:17,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:52:17,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-25 18:52:17,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 18: [2022-11-25 18:52:17,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-25 18:52:17,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-25 18:52:17,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 18:52:17,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 18:52:17,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 18:52:17,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 20: [2022-11-25 18:52:17,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:52:17,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-25 18:52:17,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 17: [2022-11-25 18:52:17,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:52:17,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-25 18:52:17,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 16: [2022-11-25 18:52:17,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:52:17,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-25 18:52:17,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 5: [2022-11-25 18:52:17,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:52:17,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 18:52:17,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 22: [2022-11-25 18:52:17,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:52:17,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-25 18:52:17,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 4: [2022-11-25 18:52:17,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:52:17,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 28: [2022-11-25 18:52:17,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:52:17,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 28: [2022-11-25 18:52:17,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-25 18:52:17,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 11: [2022-11-25 18:52:17,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:52:17,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:52:17,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 24: [2022-11-25 18:52:17,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 11: [2022-11-25 18:52:17,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 24: [2022-11-25 18:52:17,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 6: [2022-11-25 18:52:17,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:52:17,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 18:52:17,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 31: [2022-11-25 18:52:17,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-25 18:52:17,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-25 18:52:17,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 15: [2022-11-25 18:52:17,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:52:17,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 18:52:17,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 7: [2022-11-25 18:52:17,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 18:52:17,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 18:52:17,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 12: [2022-11-25 18:52:17,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:52:17,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 18:52:17,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 14: [2022-11-25 18:52:17,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:52:17,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 18:52:17,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 9: [2022-11-25 18:52:17,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:52:17,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 18:52:17,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 26: [2022-11-25 18:52:17,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:52:17,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-25 18:52:17,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 23: [2022-11-25 18:52:17,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:52:17,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-25 18:52:17,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 21: [2022-11-25 18:52:17,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:52:17,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-25 18:52:17,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 14: [2022-11-25 18:52:17,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:52:17,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 18:52:17,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 10: [2022-11-25 18:52:17,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:52:17,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:52:17,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 10: [2022-11-25 18:52:17,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 18:52:17,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 26: [2022-11-25 18:52:17,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 30: [2022-11-25 18:52:17,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:52:17,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-25 18:52:17,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 8: [2022-11-25 18:52:17,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:52:17,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 18:52:17,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 3: [2022-11-25 18:52:17,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:52:17,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 18:52:17,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 27: [2022-11-25 18:52:17,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:52:17,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-25 18:52:17,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 25: [2022-11-25 18:52:17,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:52:17,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-25 18:52:17,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 19: [2022-11-25 18:52:17,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:52:17,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-25 18:52:17,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 29: [2022-11-25 18:52:17,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:52:17,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-25 18:52:17,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 15: [2022-11-25 18:52:17,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:52:17,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 18:52:17,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 13: [2022-11-25 18:52:17,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:52:17,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 18:52:17,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 6: [2022-11-25 18:52:17,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:52:17,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 18:52:17,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 20: [2022-11-25 18:52:17,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:52:17,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-25 18:52:17,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 23: [2022-11-25 18:52:17,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:52:17,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 23: [2022-11-25 18:52:17,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 16: [2022-11-25 18:52:17,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 23: [2022-11-25 18:52:17,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 16: [2022-11-25 18:52:17,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 14: [2022-11-25 18:52:17,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:52:17,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 18:52:17,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 4: [2022-11-25 18:52:17,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:52:17,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 18:52:17,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 30: [2022-11-25 18:52:17,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-25 18:52:17,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-25 18:52:17,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 12: [2022-11-25 18:52:17,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:52:17,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 8: [2022-11-25 18:52:17,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:52:17,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 8: [2022-11-25 18:52:17,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 18:52:17,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 5: [2022-11-25 18:52:17,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:52:17,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 18:52:17,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 24: [2022-11-25 18:52:17,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:52:17,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-25 18:52:17,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 14: [2022-11-25 18:52:17,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:52:17,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 22: [2022-11-25 18:52:17,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:52:17,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 22: [2022-11-25 18:52:17,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-25 18:52:17,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 3: [2022-11-25 18:52:17,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:52:17,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 18:52:17,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 21: [2022-11-25 18:52:17,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:52:17,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-25 18:52:17,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 10: [2022-11-25 18:52:17,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:52:17,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 18:52:17,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 14: [2022-11-25 18:52:17,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:52:17,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 18:52:17,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 11: [2022-11-25 18:52:17,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:52:17,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 18:52:17,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 26: [2022-11-25 18:52:17,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:52:17,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 16: [2022-11-25 18:52:17,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:52:17,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:52:17,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 17: [2022-11-25 18:52:17,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 26: [2022-11-25 18:52:17,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 16: [2022-11-25 18:52:17,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 17: [2022-11-25 18:52:17,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 25: [2022-11-25 18:52:17,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:52:17,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-25 18:52:17,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 9: [2022-11-25 18:52:17,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:52:17,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 18:52:17,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 20: [2022-11-25 18:52:17,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-25 18:52:17,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-25 18:52:17,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 22: [2022-11-25 18:52:17,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:52:17,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-25 18:52:17,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 13: [2022-11-25 18:52:17,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:52:17,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 18:52:17,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 27: [2022-11-25 18:52:17,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-25 18:52:17,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-25 18:52:17,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 12: [2022-11-25 18:52:17,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:52:17,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 18:52:17,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 15: [2022-11-25 18:52:17,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:52:17,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 18:52:17,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 3: [2022-11-25 18:52:17,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 18:52:17,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 18:52:17,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 4: [2022-11-25 18:52:17,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:52:17,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 18:52:17,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 29: [2022-11-25 18:52:17,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:52:17,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-25 18:52:17,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 26: [2022-11-25 18:52:17,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:52:17,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-25 18:52:17,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 5: [2022-11-25 18:52:17,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:52:17,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 30: [2022-11-25 18:52:17,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:52:17,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 30: [2022-11-25 18:52:17,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-25 18:52:17,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 12: [2022-11-25 18:52:17,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 18:52:17,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 18:52:17,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 17: [2022-11-25 18:52:17,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:52:17,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-25 18:52:17,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 21: [2022-11-25 18:52:17,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:52:17,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-25 18:52:17,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 6: [2022-11-25 18:52:17,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:52:17,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 18:52:17,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 14: [2022-11-25 18:52:17,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-25 18:52:17,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 18:52:17,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 29: [2022-11-25 18:52:17,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-25 18:52:17,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-25 18:52:17,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 25: [2022-11-25 18:52:17,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:52:17,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 13: [2022-11-25 18:52:17,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 25: [2022-11-25 18:52:17,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 13: [2022-11-25 18:52:17,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 9: [2022-11-25 18:52:17,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 13: [2022-11-25 18:52:17,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 9: [2022-11-25 18:52:17,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 18:52:17,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 20: [2022-11-25 18:52:17,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:52:17,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-25 18:52:17,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 20: [2022-11-25 18:52:17,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 16: [2022-11-25 18:52:17,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 20: [2022-11-25 18:52:17,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 24: [2022-11-25 18:52:17,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:52:17,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-25 18:52:17,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-25 18:52:17,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-25 18:52:17,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 24: [2022-11-25 18:52:17,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 26: [2022-11-25 18:52:17,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-25 18:52:17,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-25 18:52:17,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 22: [2022-11-25 18:52:17,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-25 18:52:17,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-25 18:52:17,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 19: [2022-11-25 18:52:17,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:52:17,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-25 18:52:17,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-25 18:52:17,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-25 18:52:17,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 19: [2022-11-25 18:52:17,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 11: [2022-11-25 18:52:17,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:52:17,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 18:52:17,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 10: [2022-11-25 18:52:17,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:52:17,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 18:52:17,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 5: [2022-11-25 18:52:17,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:52:17,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 14: [2022-11-25 18:52:17,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 5: [2022-11-25 18:52:17,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 14: [2022-11-25 18:52:17,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 18:52:17,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 9: [2022-11-25 18:52:17,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 18:52:17,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 18:52:17,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 17: [2022-11-25 18:52:17,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-25 18:52:17,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-25 18:52:17,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 6: [2022-11-25 18:52:17,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:52:17,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 27: [2022-11-25 18:52:17,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 6: [2022-11-25 18:52:17,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 27: [2022-11-25 18:52:17,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-25 18:52:17,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 11: [2022-11-25 18:52:17,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 18:52:17,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 18:52:17,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 15: [2022-11-25 18:52:17,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 18:52:17,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 18:52:17,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 8: [2022-11-25 18:52:17,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 18:52:17,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 18:52:17,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 21: [2022-11-25 18:52:17,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-25 18:52:17,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-25 18:52:17,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 10: [2022-11-25 18:52:17,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 18:52:17,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 18:52:17,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 4: [2022-11-25 18:52:17,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 18:52:17,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 18:52:17,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 1: [2022-11-25 18:52:17,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:52:17,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:52:17,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 18:52:17,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 18:52:17,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:52:17,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:52:17,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 1: [2022-11-25 18:52:17,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 1: [2022-11-25 18:52:17,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 18:52:17,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 18:52:17,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 1: [2022-11-25 18:52:17,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 1: [2022-11-25 18:52:17,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 18:52:17,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step6000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 18:52:17,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: successfully saved checkpoint at iteration 6000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2636.89 31: iteration 6010/ 173500 | consumed samples: 1538560 | consumed tokens: 3150970880 | elapsed time per iteration (s): 1.11 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.571791E+00 | grad norm: 0.215 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.787 | TFLOPs: 13.96 | 31: iteration 6020/ 173500 | consumed samples: 1541120 | consumed tokens: 3156213760 | elapsed time per iteration (s): 0.80 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.517999E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.346 | TFLOPs: 19.32 | 31: iteration 6030/ 173500 | consumed samples: 1543680 | consumed tokens: 3161456640 | elapsed time per iteration (s): 0.80 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.499246E+00 | grad norm: 0.236 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.863 | TFLOPs: 19.41 | 31: iteration 6040/ 173500 | consumed samples: 1546240 | consumed tokens: 3166699520 | elapsed time per iteration (s): 0.81 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.534823E+00 | grad norm: 0.215 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.526 | TFLOPs: 19.15 | 31: iteration 6050/ 173500 | consumed samples: 1548800 | consumed tokens: 3171942400 | elapsed time per iteration (s): 0.85 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.478799E+00 | grad norm: 0.275 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.078 | TFLOPs: 18.15 | 31: iteration 6060/ 173500 | consumed samples: 1551360 | consumed tokens: 3177185280 | elapsed time per iteration (s): 0.79 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.484883E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.324 | TFLOPs: 19.50 | 31: iteration 6070/ 173500 | consumed samples: 1553920 | consumed tokens: 3182428160 | elapsed time per iteration (s): 0.82 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.516194E+00 | grad norm: 0.216 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.607 | TFLOPs: 18.91 | 31: iteration 6080/ 173500 | consumed samples: 1556480 | consumed tokens: 3187671040 | elapsed time per iteration (s): 0.84 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.505541E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.788 | TFLOPs: 18.44 | 31: iteration 6090/ 173500 | consumed samples: 1559040 | consumed tokens: 3192913920 | elapsed time per iteration (s): 0.75 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.480883E+00 | grad norm: 0.228 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.913 | TFLOPs: 20.68 | 31: iteration 6100/ 173500 | consumed samples: 1561600 | consumed tokens: 3198156800 | elapsed time per iteration (s): 0.79 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.502426E+00 | grad norm: 0.224 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.194 | TFLOPs: 19.67 | 31: iteration 6110/ 173500 | consumed samples: 1564160 | consumed tokens: 3203399680 | elapsed time per iteration (s): 0.77 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.470410E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.358 | TFLOPs: 20.17 | 31: iteration 6120/ 173500 | consumed samples: 1566720 | consumed tokens: 3208642560 | elapsed time per iteration (s): 0.78 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.450205E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.912 | TFLOPs: 19.78 | 31: iteration 6130/ 173500 | consumed samples: 1569280 | consumed tokens: 3213885440 | elapsed time per iteration (s): 0.84 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.483003E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.840 | TFLOPs: 18.38 | 31: iteration 6140/ 173500 | consumed samples: 1571840 | consumed tokens: 3219128320 | elapsed time per iteration (s): 0.81 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.466935E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.106 | TFLOPs: 19.06 | 31: iteration 6150/ 173500 | consumed samples: 1574400 | consumed tokens: 3224371200 | elapsed time per iteration (s): 0.84 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.470074E+00 | grad norm: 0.229 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.947 | TFLOPs: 18.51 | 31: iteration 6160/ 173500 | consumed samples: 1576960 | consumed tokens: 3229614080 | elapsed time per iteration (s): 0.84 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.511013E+00 | grad norm: 0.330 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.549 | TFLOPs: 18.36 | 31: iteration 6170/ 173500 | consumed samples: 1579520 | consumed tokens: 3234856960 | elapsed time per iteration (s): 0.81 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.504516E+00 | grad norm: 0.236 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.723 | TFLOPs: 19.04 | 31: iteration 6180/ 173500 | consumed samples: 1582080 | consumed tokens: 3240099840 | elapsed time per iteration (s): 0.83 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.478622E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.929 | TFLOPs: 18.69 | 31: iteration 6190/ 173500 | consumed samples: 1584640 | consumed tokens: 3245342720 | elapsed time per iteration (s): 0.80 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.502306E+00 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.451 | TFLOPs: 19.45 | 31: iteration 6200/ 173500 | consumed samples: 1587200 | consumed tokens: 3250585600 | elapsed time per iteration (s): 0.82 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.484902E+00 | grad norm: 0.231 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.124 | TFLOPs: 18.82 | 31: iteration 6210/ 173500 | consumed samples: 1589760 | consumed tokens: 3255828480 | elapsed time per iteration (s): 0.78 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.515743E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.556 | TFLOPs: 19.94 | 31: iteration 6220/ 173500 | consumed samples: 1592320 | consumed tokens: 3261071360 | elapsed time per iteration (s): 0.81 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.488304E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.751 | TFLOPs: 19.10 | 31: iteration 6230/ 173500 | consumed samples: 1594880 | consumed tokens: 3266314240 | elapsed time per iteration (s): 0.80 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.503111E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.653 | TFLOPs: 19.46 | 31: iteration 6240/ 173500 | consumed samples: 1597440 | consumed tokens: 3271557120 | elapsed time per iteration (s): 0.77 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.497679E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.805 | TFLOPs: 20.01 | 31: iteration 6250/ 173500 | consumed samples: 1600000 | consumed tokens: 3276800000 | elapsed time per iteration (s): 0.80 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.455688E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.764 | TFLOPs: 19.34 | 31: iteration 6260/ 173500 | consumed samples: 1602560 | consumed tokens: 3282042880 | elapsed time per iteration (s): 0.78 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.490689E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.443 | TFLOPs: 19.75 | 31: iteration 6270/ 173500 | consumed samples: 1605120 | consumed tokens: 3287285760 | elapsed time per iteration (s): 0.78 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.461667E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.982 | TFLOPs: 19.78 | 31: iteration 6280/ 173500 | consumed samples: 1607680 | consumed tokens: 3292528640 | elapsed time per iteration (s): 0.78 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.456489E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.287 | TFLOPs: 19.92 | 31: iteration 6290/ 173500 | consumed samples: 1610240 | consumed tokens: 3297771520 | elapsed time per iteration (s): 0.80 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.450241E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.827 | TFLOPs: 19.35 | 31: iteration 6300/ 173500 | consumed samples: 1612800 | consumed tokens: 3303014400 | elapsed time per iteration (s): 0.80 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.459138E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.832 | TFLOPs: 19.35 | 31: iteration 6310/ 173500 | consumed samples: 1615360 | consumed tokens: 3308257280 | elapsed time per iteration (s): 0.77 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.445894E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.927 | TFLOPs: 20.02 | 31: iteration 6320/ 173500 | consumed samples: 1617920 | consumed tokens: 3313500160 | elapsed time per iteration (s): 0.80 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.467785E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.983 | TFLOPs: 19.42 | 31: iteration 6330/ 173500 | consumed samples: 1620480 | consumed tokens: 3318743040 | elapsed time per iteration (s): 0.78 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.494463E+00 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.090 | TFLOPs: 19.85 | 31: iteration 6340/ 173500 | consumed samples: 1623040 | consumed tokens: 3323985920 | elapsed time per iteration (s): 0.78 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.426352E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.281 | TFLOPs: 19.74 | 31: iteration 6350/ 173500 | consumed samples: 1625600 | consumed tokens: 3329228800 | elapsed time per iteration (s): 0.81 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.461271E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.763 | TFLOPs: 19.10 | 31: iteration 6360/ 173500 | consumed samples: 1628160 | consumed tokens: 3334471680 | elapsed time per iteration (s): 0.82 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.457786E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.315 | TFLOPs: 18.83 | 31: iteration 6370/ 173500 | consumed samples: 1630720 | consumed tokens: 3339714560 | elapsed time per iteration (s): 0.83 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.477913E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.011 | TFLOPs: 18.63 | 31: iteration 6380/ 173500 | consumed samples: 1633280 | consumed tokens: 3344957440 | elapsed time per iteration (s): 0.80 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.443755E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.196 | TFLOPs: 19.25 | 31: iteration 6390/ 173500 | consumed samples: 1635840 | consumed tokens: 3350200320 | elapsed time per iteration (s): 0.83 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.453707E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.632 | TFLOPs: 18.73 | 31: iteration 6400/ 173500 | consumed samples: 1638400 | consumed tokens: 3355443200 | elapsed time per iteration (s): 0.79 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.453955E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.746 | TFLOPs: 19.71 | 31: iteration 6410/ 173500 | consumed samples: 1640960 | consumed tokens: 3360686080 | elapsed time per iteration (s): 0.80 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.474200E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.081 | TFLOPs: 19.24 | 31: iteration 6420/ 173500 | consumed samples: 1643520 | consumed tokens: 3365928960 | elapsed time per iteration (s): 0.80 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.433254E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.018 | TFLOPs: 19.36 | 31: iteration 6430/ 173500 | consumed samples: 1646080 | consumed tokens: 3371171840 | elapsed time per iteration (s): 0.78 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.453252E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.023 | TFLOPs: 19.91 | 31: iteration 6440/ 173500 | consumed samples: 1648640 | consumed tokens: 3376414720 | elapsed time per iteration (s): 0.82 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.455310E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.310 | TFLOPs: 18.83 | 31: iteration 6450/ 173500 | consumed samples: 1651200 | consumed tokens: 3381657600 | elapsed time per iteration (s): 0.79 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.450424E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.461 | TFLOPs: 19.51 | 31: iteration 6460/ 173500 | consumed samples: 1653760 | consumed tokens: 3386900480 | elapsed time per iteration (s): 0.80 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.416161E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.382 | TFLOPs: 19.26 | 31: iteration 6470/ 173500 | consumed samples: 1656320 | consumed tokens: 3392143360 | elapsed time per iteration (s): 0.85 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.433536E+00 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.514 | TFLOPs: 18.12 | 31: iteration 6480/ 173500 | consumed samples: 1658880 | consumed tokens: 3397386240 | elapsed time per iteration (s): 0.83 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.460084E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.426 | TFLOPs: 18.66 | 31: iteration 6490/ 173500 | consumed samples: 1661440 | consumed tokens: 3402629120 | elapsed time per iteration (s): 0.75 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.441385E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.009 | TFLOPs: 20.69 | 31: iteration 6500/ 173500 | consumed samples: 1664000 | consumed tokens: 3407872000 | elapsed time per iteration (s): 0.80 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.419964E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.681 | TFLOPs: 19.34 | 31: iteration 6510/ 173500 | consumed samples: 1666560 | consumed tokens: 3413114880 | elapsed time per iteration (s): 0.76 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.449721E+00 | grad norm: 0.280 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.970 | TFLOPs: 20.45 | 31: iteration 6520/ 173500 | consumed samples: 1669120 | consumed tokens: 3418357760 | elapsed time per iteration (s): 0.77 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.461178E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.612 | TFLOPs: 20.24 | 31: iteration 6530/ 173500 | consumed samples: 1671680 | consumed tokens: 3423600640 | elapsed time per iteration (s): 0.75 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.428314E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.280 | TFLOPs: 20.65 | 31: iteration 6540/ 173500 | consumed samples: 1674240 | consumed tokens: 3428843520 | elapsed time per iteration (s): 0.79 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.433756E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.384 | TFLOPs: 19.62 | 31: iteration 6550/ 173500 | consumed samples: 1676800 | consumed tokens: 3434086400 | elapsed time per iteration (s): 0.73 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.463258E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.002 | TFLOPs: 21.17 | 31: iteration 6560/ 173500 | consumed samples: 1679360 | consumed tokens: 3439329280 | elapsed time per iteration (s): 0.76 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.447651E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.781 | TFLOPs: 20.31 | 31: iteration 6570/ 173500 | consumed samples: 1681920 | consumed tokens: 3444572160 | elapsed time per iteration (s): 0.78 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.448961E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.459 | TFLOPs: 19.87 | 31: iteration 6580/ 173500 | consumed samples: 1684480 | consumed tokens: 3449815040 | elapsed time per iteration (s): 0.79 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.456686E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.755 | TFLOPs: 19.59 | 31: iteration 6590/ 173500 | consumed samples: 1687040 | consumed tokens: 3455057920 | elapsed time per iteration (s): 0.81 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.441774E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.402 | TFLOPs: 19.14 | 31: iteration 6600/ 173500 | consumed samples: 1689600 | consumed tokens: 3460300800 | elapsed time per iteration (s): 0.77 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.444259E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.351 | TFLOPs: 20.23 | 31: iteration 6610/ 173500 | consumed samples: 1692160 | consumed tokens: 3465543680 | elapsed time per iteration (s): 0.82 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.446906E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.213 | TFLOPs: 18.89 | 31: iteration 6620/ 173500 | consumed samples: 1694720 | consumed tokens: 3470786560 | elapsed time per iteration (s): 0.75 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.611140E+00 | grad norm: 1.737 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.101 | TFLOPs: 20.76 | 31: iteration 6630/ 173500 | consumed samples: 1697280 | consumed tokens: 3476029440 | elapsed time per iteration (s): 0.73 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.564015E+00 | grad norm: 0.431 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.641 | TFLOPs: 21.15 | 31: iteration 6640/ 173500 | consumed samples: 1699840 | consumed tokens: 3481272320 | elapsed time per iteration (s): 0.78 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.480476E+00 | grad norm: 0.230 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.001 | TFLOPs: 19.84 | 31: iteration 6650/ 173500 | consumed samples: 1702400 | consumed tokens: 3486515200 | elapsed time per iteration (s): 0.77 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.443456E+00 | grad norm: 0.213 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.860 | TFLOPs: 20.14 | 31: iteration 6660/ 173500 | consumed samples: 1704960 | consumed tokens: 3491758080 | elapsed time per iteration (s): 0.82 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.485322E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.792 | TFLOPs: 18.80 | 31: iteration 6670/ 173500 | consumed samples: 1707520 | consumed tokens: 3497000960 | elapsed time per iteration (s): 0.81 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.446462E+00 | grad norm: 0.240 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.063 | TFLOPs: 19.12 | 31: iteration 6680/ 173500 | consumed samples: 1710080 | consumed tokens: 3502243840 | elapsed time per iteration (s): 0.79 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.445148E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.399 | TFLOPs: 19.69 | 31: iteration 6690/ 173500 | consumed samples: 1712640 | consumed tokens: 3507486720 | elapsed time per iteration (s): 0.76 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.457207E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.930 | TFLOPs: 20.50 | 31: iteration 6700/ 173500 | consumed samples: 1715200 | consumed tokens: 3512729600 | elapsed time per iteration (s): 0.82 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.460506E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.052 | TFLOPs: 18.88 | 31: iteration 6710/ 173500 | consumed samples: 1717760 | consumed tokens: 3517972480 | elapsed time per iteration (s): 0.89 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.471698E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.807 | TFLOPs: 17.47 | 31: iteration 6720/ 173500 | consumed samples: 1720320 | consumed tokens: 3523215360 | elapsed time per iteration (s): 0.82 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.426513E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.843 | TFLOPs: 18.87 | 31: iteration 6730/ 173500 | consumed samples: 1722880 | consumed tokens: 3528458240 | elapsed time per iteration (s): 0.75 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.469852E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.854 | TFLOPs: 20.56 | 31: iteration 6740/ 173500 | consumed samples: 1725440 | consumed tokens: 3533701120 | elapsed time per iteration (s): 0.76 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.445346E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.446 | TFLOPs: 20.29 | 31: iteration 6750/ 173500 | consumed samples: 1728000 | consumed tokens: 3538944000 | elapsed time per iteration (s): 0.80 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.421943E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.097 | TFLOPs: 19.30 | 31: iteration 6760/ 173500 | consumed samples: 1730560 | consumed tokens: 3544186880 | elapsed time per iteration (s): 0.75 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.440991E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.623 | TFLOPs: 20.55 | 31: iteration 6770/ 173500 | consumed samples: 1733120 | consumed tokens: 3549429760 | elapsed time per iteration (s): 0.73 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.403041E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.822 | TFLOPs: 21.10 | 31: iteration 6780/ 173500 | consumed samples: 1735680 | consumed tokens: 3554672640 | elapsed time per iteration (s): 0.73 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.415804E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.236 | TFLOPs: 21.13 | 31: iteration 6790/ 173500 | consumed samples: 1738240 | consumed tokens: 3559915520 | elapsed time per iteration (s): 0.73 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.436629E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.203 | TFLOPs: 21.31 | 31: iteration 6800/ 173500 | consumed samples: 1740800 | consumed tokens: 3565158400 | elapsed time per iteration (s): 0.74 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.483702E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.804 | TFLOPs: 20.80 | 31: iteration 6810/ 173500 | consumed samples: 1743360 | consumed tokens: 3570401280 | elapsed time per iteration (s): 0.74 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.428499E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.157 | TFLOPs: 20.94 | 31: iteration 6820/ 173500 | consumed samples: 1745920 | consumed tokens: 3575644160 | elapsed time per iteration (s): 0.76 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.428509E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.836 | TFLOPs: 20.44 | 31: iteration 6830/ 173500 | consumed samples: 1748480 | consumed tokens: 3580887040 | elapsed time per iteration (s): 0.76 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.443719E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.442 | TFLOPs: 20.47 | 31: iteration 6840/ 173500 | consumed samples: 1751040 | consumed tokens: 3586129920 | elapsed time per iteration (s): 0.80 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.435594E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.771 | TFLOPs: 19.28 | 31: iteration 6850/ 173500 | consumed samples: 1753600 | consumed tokens: 3591372800 | elapsed time per iteration (s): 0.83 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.429937E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.278 | TFLOPs: 18.59 | 31: iteration 6860/ 173500 | consumed samples: 1756160 | consumed tokens: 3596615680 | elapsed time per iteration (s): 0.81 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.413427E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.835 | TFLOPs: 19.17 | 31: iteration 6870/ 173500 | consumed samples: 1758720 | consumed tokens: 3601858560 | elapsed time per iteration (s): 0.77 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.461867E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.457 | TFLOPs: 19.99 | 31: iteration 6880/ 173500 | consumed samples: 1761280 | consumed tokens: 3607101440 | elapsed time per iteration (s): 0.78 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.450098E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.671 | TFLOPs: 19.88 | 31: iteration 6890/ 173500 | consumed samples: 1763840 | consumed tokens: 3612344320 | elapsed time per iteration (s): 0.77 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.385521E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.397 | TFLOPs: 20.05 | 31: iteration 6900/ 173500 | consumed samples: 1766400 | consumed tokens: 3617587200 | elapsed time per iteration (s): 0.75 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.394336E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.560 | TFLOPs: 20.72 | 31: iteration 6910/ 173500 | consumed samples: 1768960 | consumed tokens: 3622830080 | elapsed time per iteration (s): 0.78 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.428455E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.369 | TFLOPs: 19.74 | 31: iteration 6920/ 173500 | consumed samples: 1771520 | consumed tokens: 3628072960 | elapsed time per iteration (s): 0.81 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.424338E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.535 | TFLOPs: 19.21 | 31: iteration 6930/ 173500 | consumed samples: 1774080 | consumed tokens: 3633315840 | elapsed time per iteration (s): 0.76 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.418342E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.975 | TFLOPs: 20.39 | 31: iteration 6940/ 173500 | consumed samples: 1776640 | consumed tokens: 3638558720 | elapsed time per iteration (s): 0.76 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.437292E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.857 | TFLOPs: 20.32 | 31: iteration 6950/ 173500 | consumed samples: 1779200 | consumed tokens: 3643801600 | elapsed time per iteration (s): 0.82 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.412548E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.029 | TFLOPs: 19.00 | 31: iteration 6960/ 173500 | consumed samples: 1781760 | consumed tokens: 3649044480 | elapsed time per iteration (s): 0.76 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.453617E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.439 | TFLOPs: 20.41 | 31: iteration 6970/ 173500 | consumed samples: 1784320 | consumed tokens: 3654287360 | elapsed time per iteration (s): 0.77 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.413955E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.475 | TFLOPs: 20.11 | 31: iteration 6980/ 173500 | consumed samples: 1786880 | consumed tokens: 3659530240 | elapsed time per iteration (s): 0.83 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.392517E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.650 | TFLOPs: 18.73 | 31: iteration 6990/ 173500 | consumed samples: 1789440 | consumed tokens: 3664773120 | elapsed time per iteration (s): 0.74 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.447829E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.653 | TFLOPs: 20.79 | 31: iteration 7000/ 173500 | consumed samples: 1792000 | consumed tokens: 3670016000 | elapsed time per iteration (s): 0.89 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.432733E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.704 | TFLOPs: 17.41 | 0: saving checkpoint at iteration 7000 to checkpoints_1b1long 31: ------------------------------------------------------------------------------------------ 31: valid loss at iteration 7000 | lm loss value: 2.392945E+00 | lm loss PPL: 1.094568E+01 | 31: ------------------------------------------------------------------------------------------ 0: [2022-11-25 19:05:28,848] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step7000 is begin to save! 0: [2022-11-25 19:05:28,858] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_01-model_00-model_states.pt... 0: [2022-11-25 19:05:29,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_01-model_00-model_states.pt. 0: [2022-11-25 19:05:29,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_03-model_00-model_states.pt... 0: [2022-11-25 19:05:29,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_03-model_00-model_states.pt. 0: [2022-11-25 19:05:29,144] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_04-model_00-model_states.pt... 0: [2022-11-25 19:05:29,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_04-model_00-model_states.pt. 0: [2022-11-25 19:05:29,225] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_05-model_00-model_states.pt... 0: [2022-11-25 19:05:29,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_05-model_00-model_states.pt. 0: [2022-11-25 19:05:29,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_06-model_00-model_states.pt... 0: [2022-11-25 19:05:29,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_06-model_00-model_states.pt. 0: [2022-11-25 19:05:29,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_07-model_00-model_states.pt... 0: [2022-11-25 19:05:29,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_07-model_00-model_states.pt. 0: [2022-11-25 19:05:29,449] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_08-model_00-model_states.pt... 0: [2022-11-25 19:05:29,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_08-model_00-model_states.pt. 0: [2022-11-25 19:05:29,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_09-model_00-model_states.pt... 0: [2022-11-25 19:05:29,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_09-model_00-model_states.pt. 0: [2022-11-25 19:05:29,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_10-model_00-model_states.pt... 0: [2022-11-25 19:05:29,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_10-model_00-model_states.pt. 0: [2022-11-25 19:05:29,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_11-model_00-model_states.pt... 0: [2022-11-25 19:05:29,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_11-model_00-model_states.pt. 0: [2022-11-25 19:05:29,769] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_12-model_00-model_states.pt... 0: [2022-11-25 19:05:29,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_12-model_00-model_states.pt. 0: [2022-11-25 19:05:29,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_13-model_00-model_states.pt... 0: [2022-11-25 19:05:29,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_13-model_00-model_states.pt. 0: [2022-11-25 19:05:29,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_14-model_00-model_states.pt... 0: [2022-11-25 19:05:30,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_14-model_00-model_states.pt. 0: [2022-11-25 19:05:30,006] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_15-model_00-model_states.pt... 0: [2022-11-25 19:05:30,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_15-model_00-model_states.pt. 0: [2022-11-25 19:05:30,081] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_16-model_00-model_states.pt... 0: [2022-11-25 19:05:30,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_16-model_00-model_states.pt. 0: [2022-11-25 19:05:30,155] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_17-model_00-model_states.pt... 0: [2022-11-25 19:05:30,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_17-model_00-model_states.pt. 0: [2022-11-25 19:05:30,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_18-model_00-model_states.pt... 0: [2022-11-25 19:05:30,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_18-model_00-model_states.pt. 0: [2022-11-25 19:05:30,312] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_19-model_00-model_states.pt... 0: [2022-11-25 19:05:30,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_19-model_00-model_states.pt. 0: [2022-11-25 19:05:30,387] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_20-model_00-model_states.pt... 0: [2022-11-25 19:05:30,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_20-model_00-model_states.pt. 0: [2022-11-25 19:05:30,462] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_21-model_00-model_states.pt... 0: [2022-11-25 19:05:30,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_21-model_00-model_states.pt. 0: [2022-11-25 19:05:30,536] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_22-model_00-model_states.pt... 0: [2022-11-25 19:05:30,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_22-model_00-model_states.pt. 0: [2022-11-25 19:05:30,609] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_23-model_00-model_states.pt... 0: [2022-11-25 19:05:30,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_23-model_00-model_states.pt. 0: [2022-11-25 19:05:30,689] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_24-model_00-model_states.pt... 0: [2022-11-25 19:05:30,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_24-model_00-model_states.pt. 0: [2022-11-25 19:05:30,762] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_25-model_00-model_states.pt... 0: [2022-11-25 19:05:30,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_25-model_00-model_states.pt. 0: [2022-11-25 19:05:30,838] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_26-model_00-model_states.pt... 0: [2022-11-25 19:05:30,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_26-model_00-model_states.pt. 0: [2022-11-25 19:05:30,911] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_27-model_00-model_states.pt... 0: [2022-11-25 19:05:30,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_27-model_00-model_states.pt. 0: [2022-11-25 19:05:30,986] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_28-model_00-model_states.pt... 0: [2022-11-25 19:05:31,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_28-model_00-model_states.pt. 0: [2022-11-25 19:05:31,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/layer_30-model_00-model_states.pt... 0: [2022-11-25 19:05:31,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/layer_30-model_00-model_states.pt. 0: [2022-11-25 19:05:31,066] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step7000/mp_rank_00_model_states.pt 0: [2022-11-25 19:05:31,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/mp_rank_00_model_states.pt... 0: [2022-11-25 19:05:31,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/mp_rank_00_model_states.pt. 0: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 16: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 24: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 30: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 21: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 26: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 24: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 30: [2022-11-25 19:05:31,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-25 19:05:31,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:05:31,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:05:31,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 13: [2022-11-25 19:05:31,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 17: [2022-11-25 19:05:31,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 13: [2022-11-25 19:05:31,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 18: [2022-11-25 19:05:31,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-25 19:05:31,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-25 19:05:31,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 19:05:31,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:05:31,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:05:31,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:05:31,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 1: [2022-11-25 19:05:31,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 10: [2022-11-25 19:05:31,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 1: [2022-11-25 19:05:31,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 19:05:31,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 19:05:31,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 6: [2022-11-25 19:05:31,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 16: [2022-11-25 19:05:31,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 24: [2022-11-25 19:05:31,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:05:31,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 31: [2022-11-25 19:05:31,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 21: [2022-11-25 19:05:31,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-25 19:05:31,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-25 19:05:31,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 27: [2022-11-25 19:05:31,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:05:31,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 16: [2022-11-25 19:05:31,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 24: [2022-11-25 19:05:31,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 14: [2022-11-25 19:05:31,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 31: [2022-11-25 19:05:31,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 27: [2022-11-25 19:05:31,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 6: [2022-11-25 19:05:31,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 16: [2022-11-25 19:05:31,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 24: [2022-11-25 19:05:31,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 14: [2022-11-25 19:05:31,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 31: [2022-11-25 19:05:31,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 27: [2022-11-25 19:05:31,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 26: [2022-11-25 19:05:31,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-25 19:05:31,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-25 19:05:31,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 7: [2022-11-25 19:05:31,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:05:31,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:05:31,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:05:31,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 28: [2022-11-25 19:05:31,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:05:31,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 3: [2022-11-25 19:05:31,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 12: [2022-11-25 19:05:31,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 15: [2022-11-25 19:05:31,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 28: [2022-11-25 19:05:31,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 7: [2022-11-25 19:05:31,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 3: [2022-11-25 19:05:31,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 12: [2022-11-25 19:05:31,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 15: [2022-11-25 19:05:31,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 28: [2022-11-25 19:05:31,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 8: [2022-11-25 19:05:31,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:05:31,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 19:05:31,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 9: [2022-11-25 19:05:31,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 20: [2022-11-25 19:05:31,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:05:31,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 21: [2022-11-25 19:05:31,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 19: [2022-11-25 19:05:31,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:05:31,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 20: [2022-11-25 19:05:31,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 30: [2022-11-25 19:05:31,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 21: [2022-11-25 19:05:31,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 19: [2022-11-25 19:05:31,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 9: [2022-11-25 19:05:31,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 20: [2022-11-25 19:05:31,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 30: [2022-11-25 19:05:31,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 21: [2022-11-25 19:05:31,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 19: [2022-11-25 19:05:31,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 5: [2022-11-25 19:05:31,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:05:31,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 25: [2022-11-25 19:05:31,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:05:31,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 29: [2022-11-25 19:05:31,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:05:31,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 25: [2022-11-25 19:05:31,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 14: [2022-11-25 19:05:31,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 29: [2022-11-25 19:05:31,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 25: [2022-11-25 19:05:31,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 14: [2022-11-25 19:05:31,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 29: [2022-11-25 19:05:31,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 8: [2022-11-25 19:05:31,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:05:31,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:05:31,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 22: [2022-11-25 19:05:31,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 27: [2022-11-25 19:05:31,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:05:31,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 1: [2022-11-25 19:05:31,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 23: [2022-11-25 19:05:31,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 22: [2022-11-25 19:05:31,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 27: [2022-11-25 19:05:31,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 8: [2022-11-25 19:05:31,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 1: [2022-11-25 19:05:31,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 23: [2022-11-25 19:05:31,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 22: [2022-11-25 19:05:31,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 27: [2022-11-25 19:05:31,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 22: [2022-11-25 19:05:31,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 29: [2022-11-25 19:05:31,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 22: [2022-11-25 19:05:31,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-25 19:05:31,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 24: [2022-11-25 19:05:31,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 29: [2022-11-25 19:05:31,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 24: [2022-11-25 19:05:31,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 29: [2022-11-25 19:05:31,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 24: [2022-11-25 19:05:31,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 5: [2022-11-25 19:05:31,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:05:31,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:05:31,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:05:31,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:05:31,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 7: [2022-11-25 19:05:31,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 1: [2022-11-25 19:05:31,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 12: [2022-11-25 19:05:31,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 5: [2022-11-25 19:05:31,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 7: [2022-11-25 19:05:31,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 1: [2022-11-25 19:05:31,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 12: [2022-11-25 19:05:31,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 19:05:31,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:05:31,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:05:31,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:05:31,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:05:31,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 31: [2022-11-25 19:05:31,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:05:31,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 26: [2022-11-25 19:05:31,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:05:31,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 13: [2022-11-25 19:05:31,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 23: [2022-11-25 19:05:31,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 11: [2022-11-25 19:05:31,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 31: [2022-11-25 19:05:31,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 30: [2022-11-25 19:05:31,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 26: [2022-11-25 19:05:31,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 0: [2022-11-25 19:05:31,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 6: [2022-11-25 19:05:31,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 13: [2022-11-25 19:05:31,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 23: [2022-11-25 19:05:31,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 11: [2022-11-25 19:05:31,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 31: [2022-11-25 19:05:31,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 30: [2022-11-25 19:05:31,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 26: [2022-11-25 19:05:31,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 19:05:31,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 13: [2022-11-25 19:05:31,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:05:31,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 29: [2022-11-25 19:05:31,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:05:31,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 11: [2022-11-25 19:05:31,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 29: [2022-11-25 19:05:31,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 13: [2022-11-25 19:05:31,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 11: [2022-11-25 19:05:31,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 29: [2022-11-25 19:05:31,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 14: [2022-11-25 19:05:31,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:05:31,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 19:05:31,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 21: [2022-11-25 19:05:31,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-25 19:05:31,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-25 19:05:31,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 6: [2022-11-25 19:05:31,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:05:31,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:05:31,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:05:31,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 20: [2022-11-25 19:05:31,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 25: [2022-11-25 19:05:31,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 31: [2022-11-25 19:05:31,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 19: [2022-11-25 19:05:31,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 27: [2022-11-25 19:05:31,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:05:31,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 4: [2022-11-25 19:05:31,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 9: [2022-11-25 19:05:31,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 12: [2022-11-25 19:05:31,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 20: [2022-11-25 19:05:31,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 25: [2022-11-25 19:05:31,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 31: [2022-11-25 19:05:31,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 19: [2022-11-25 19:05:31,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 27: [2022-11-25 19:05:31,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 6: [2022-11-25 19:05:31,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 4: [2022-11-25 19:05:31,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 9: [2022-11-25 19:05:31,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 12: [2022-11-25 19:05:31,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 20: [2022-11-25 19:05:31,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 25: [2022-11-25 19:05:31,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 31: [2022-11-25 19:05:31,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 19: [2022-11-25 19:05:31,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 27: [2022-11-25 19:05:31,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 2: [2022-11-25 19:05:31,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:05:31,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:05:31,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 19:05:31,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 19:05:31,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 26: [2022-11-25 19:05:31,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:05:31,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 26: [2022-11-25 19:05:31,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-25 19:05:31,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 5: [2022-11-25 19:05:31,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:05:31,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:05:31,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:05:31,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 24: [2022-11-25 19:05:31,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 22: [2022-11-25 19:05:31,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:05:31,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 8: [2022-11-25 19:05:31,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 3: [2022-11-25 19:05:31,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 23: [2022-11-25 19:05:31,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 24: [2022-11-25 19:05:31,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 22: [2022-11-25 19:05:31,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 5: [2022-11-25 19:05:31,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 8: [2022-11-25 19:05:31,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 3: [2022-11-25 19:05:31,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 23: [2022-11-25 19:05:31,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 24: [2022-11-25 19:05:31,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 22: [2022-11-25 19:05:31,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 25: [2022-11-25 19:05:31,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 22: [2022-11-25 19:05:31,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 25: [2022-11-25 19:05:31,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 22: [2022-11-25 19:05:31,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 25: [2022-11-25 19:05:31,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 22: [2022-11-25 19:05:31,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 19:05:31,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:05:31,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 20: [2022-11-25 19:05:31,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:05:31,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:05:31,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 20: [2022-11-25 19:05:31,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 30: [2022-11-25 19:05:31,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 0: [2022-11-25 19:05:31,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 15: [2022-11-25 19:05:31,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 20: [2022-11-25 19:05:31,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 30: [2022-11-25 19:05:31,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 19:05:31,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 19: [2022-11-25 19:05:31,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:05:31,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:05:31,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 19: [2022-11-25 19:05:31,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-25 19:05:31,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 9: [2022-11-25 19:05:31,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 11: [2022-11-25 19:05:31,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:05:31,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 17: [2022-11-25 19:05:31,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:05:31,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-25 19:05:31,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 4: [2022-11-25 19:05:31,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:05:31,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:05:31,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 3: [2022-11-25 19:05:31,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 4: [2022-11-25 19:05:31,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 3: [2022-11-25 19:05:31,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 15: [2022-11-25 19:05:31,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:05:31,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 26: [2022-11-25 19:05:31,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-25 19:05:31,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 15: [2022-11-25 19:05:31,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 26: [2022-11-25 19:05:31,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 15: [2022-11-25 19:05:31,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 19:05:31,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:05:31,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:05:31,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:05:31,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 17: [2022-11-25 19:05:31,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 0: [2022-11-25 19:05:31,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 11: [2022-11-25 19:05:31,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 17: [2022-11-25 19:05:31,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 19:05:31,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 17: [2022-11-25 19:05:31,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:05:31,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-25 19:05:31,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 4: [2022-11-25 19:05:31,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:05:31,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 19:05:31,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 10: [2022-11-25 19:05:31,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:05:31,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 19:05:31,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 9: [2022-11-25 19:05:31,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 16: [2022-11-25 19:05:31,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 28: [2022-11-25 19:05:31,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:05:31,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 16: [2022-11-25 19:05:31,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 28: [2022-11-25 19:05:31,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 9: [2022-11-25 19:05:31,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 16: [2022-11-25 19:05:31,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 28: [2022-11-25 19:05:31,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 14: [2022-11-25 19:05:31,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:05:31,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 19:05:31,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 10: [2022-11-25 19:05:31,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:05:31,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 27: [2022-11-25 19:05:31,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:05:31,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 27: [2022-11-25 19:05:31,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 10: [2022-11-25 19:05:31,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 15: [2022-11-25 19:05:31,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 27: [2022-11-25 19:05:31,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 10: [2022-11-25 19:05:31,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 13: [2022-11-25 19:05:31,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:05:31,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 19:05:31,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 16: [2022-11-25 19:05:31,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:05:31,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 28: [2022-11-25 19:05:31,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 16: [2022-11-25 19:05:31,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 12: [2022-11-25 19:05:31,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 16: [2022-11-25 19:05:31,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 12: [2022-11-25 19:05:31,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 1: [2022-11-25 19:05:31,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:05:31,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 19:05:31,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 28: [2022-11-25 19:05:31,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-25 19:05:31,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 5: [2022-11-25 19:05:31,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:05:31,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 29: [2022-11-25 19:05:31,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:05:31,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 29: [2022-11-25 19:05:31,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 4: [2022-11-25 19:05:31,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 29: [2022-11-25 19:05:31,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 5: [2022-11-25 19:05:31,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 8: [2022-11-25 19:05:31,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:05:31,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 8: [2022-11-25 19:05:31,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 19:05:31,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 7: [2022-11-25 19:05:31,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 21: [2022-11-25 19:05:31,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:05:31,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 21: [2022-11-25 19:05:31,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 7: [2022-11-25 19:05:31,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 21: [2022-11-25 19:05:31,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 24: [2022-11-25 19:05:31,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-25 19:05:31,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-25 19:05:31,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 23: [2022-11-25 19:05:31,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:05:31,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-25 19:05:31,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 2: [2022-11-25 19:05:31,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:05:31,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 19:05:31,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 30: [2022-11-25 19:05:31,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:05:31,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-25 19:05:31,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 31: [2022-11-25 19:05:31,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-25 19:05:31,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-25 19:05:31,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 6: [2022-11-25 19:05:31,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 18: [2022-11-25 19:05:31,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:05:31,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 18: [2022-11-25 19:05:31,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 6: [2022-11-25 19:05:31,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 18: [2022-11-25 19:05:31,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 19: [2022-11-25 19:05:31,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-25 19:05:31,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-25 19:05:31,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 10: [2022-11-25 19:05:31,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 16: [2022-11-25 19:05:31,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 20: [2022-11-25 19:05:31,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:05:31,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 16: [2022-11-25 19:05:31,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 20: [2022-11-25 19:05:31,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 10: [2022-11-25 19:05:31,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 16: [2022-11-25 19:05:31,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 20: [2022-11-25 19:05:31,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 3: [2022-11-25 19:05:31,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:05:31,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 19:05:31,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 1: [2022-11-25 19:05:31,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:05:31,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 19:05:31,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 19:05:31,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 26: [2022-11-25 19:05:31,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-25 19:05:31,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-25 19:05:31,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 22: [2022-11-25 19:05:31,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-25 19:05:31,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-25 19:05:31,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 15: [2022-11-25 19:05:31,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:05:31,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:05:31,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 11: [2022-11-25 19:05:31,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 15: [2022-11-25 19:05:31,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 11: [2022-11-25 19:05:31,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 28: [2022-11-25 19:05:31,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-25 19:05:31,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-25 19:05:31,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 7: [2022-11-25 19:05:31,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 25: [2022-11-25 19:05:31,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:05:31,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 25: [2022-11-25 19:05:31,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 7: [2022-11-25 19:05:31,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 25: [2022-11-25 19:05:31,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 4: [2022-11-25 19:05:31,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:05:31,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 19:05:31,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 19:05:31,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 17: [2022-11-25 19:05:31,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:05:31,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 0: [2022-11-25 19:05:31,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 17: [2022-11-25 19:05:31,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 9: [2022-11-25 19:05:31,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:05:31,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 19:05:31,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 28: [2022-11-25 19:05:31,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-25 19:05:31,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-25 19:05:31,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 18: [2022-11-25 19:05:31,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 27: [2022-11-25 19:05:31,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 18: [2022-11-25 19:05:31,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-25 19:05:31,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 27: [2022-11-25 19:05:31,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-25 19:05:31,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 14: [2022-11-25 19:05:31,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:05:31,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 19:05:31,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 13: [2022-11-25 19:05:31,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:05:31,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 19:05:31,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 12: [2022-11-25 19:05:31,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:05:31,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 19:05:31,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 10: [2022-11-25 19:05:31,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:05:31,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 19:05:31,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 5: [2022-11-25 19:05:31,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:05:31,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 19:05:31,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 16: [2022-11-25 19:05:31,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 29: [2022-11-25 19:05:31,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 16: [2022-11-25 19:05:31,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 29: [2022-11-25 19:05:31,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 16: [2022-11-25 19:05:31,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 29: [2022-11-25 19:05:31,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 7: [2022-11-25 19:05:31,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:05:31,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 19:05:31,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 8: [2022-11-25 19:05:31,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:05:31,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 19:05:31,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 21: [2022-11-25 19:05:31,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 18: [2022-11-25 19:05:31,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 21: [2022-11-25 19:05:31,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 18: [2022-11-25 19:05:31,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 21: [2022-11-25 19:05:31,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 18: [2022-11-25 19:05:31,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 24: [2022-11-25 19:05:31,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-25 19:05:31,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-25 19:05:31,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 6: [2022-11-25 19:05:31,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:05:31,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 19: [2022-11-25 19:05:31,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:05:31,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 2: [2022-11-25 19:05:31,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 19: [2022-11-25 19:05:31,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 6: [2022-11-25 19:05:31,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 2: [2022-11-25 19:05:31,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 19: [2022-11-25 19:05:31,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 23: [2022-11-25 19:05:31,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:05:31,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-25 19:05:31,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 20: [2022-11-25 19:05:31,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-25 19:05:31,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-25 19:05:31,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 3: [2022-11-25 19:05:31,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 31: [2022-11-25 19:05:31,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:05:31,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:05:31,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 31: [2022-11-25 19:05:31,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 30: [2022-11-25 19:05:31,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 3: [2022-11-25 19:05:31,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 31: [2022-11-25 19:05:31,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 30: [2022-11-25 19:05:31,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 19:05:31,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 25: [2022-11-25 19:05:31,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 22: [2022-11-25 19:05:31,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 25: [2022-11-25 19:05:31,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 22: [2022-11-25 19:05:31,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 0: [2022-11-25 19:05:31,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 22: [2022-11-25 19:05:31,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 19:05:31,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 25: [2022-11-25 19:05:31,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 1: [2022-11-25 19:05:31,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:05:31,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 19:05:31,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 18: [2022-11-25 19:05:31,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 26: [2022-11-25 19:05:31,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 18: [2022-11-25 19:05:31,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-25 19:05:31,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 26: [2022-11-25 19:05:31,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-25 19:05:31,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 11: [2022-11-25 19:05:31,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:05:31,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 19:05:31,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 9: [2022-11-25 19:05:31,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:05:31,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:05:31,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:05:31,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 19:05:31,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 15: [2022-11-25 19:05:31,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 17: [2022-11-25 19:05:31,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 15: [2022-11-25 19:05:31,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 17: [2022-11-25 19:05:31,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 4: [2022-11-25 19:05:31,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 28: [2022-11-25 19:05:31,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:05:31,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 28: [2022-11-25 19:05:31,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 4: [2022-11-25 19:05:31,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 28: [2022-11-25 19:05:31,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 5: [2022-11-25 19:05:31,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 27: [2022-11-25 19:05:31,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:05:31,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 14: [2022-11-25 19:05:31,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:05:31,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 27: [2022-11-25 19:05:31,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 14: [2022-11-25 19:05:31,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 27: [2022-11-25 19:05:31,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 14: [2022-11-25 19:05:31,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 13: [2022-11-25 19:05:31,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:05:31,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 19:05:31,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 18: [2022-11-25 19:05:31,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:05:31,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 16: [2022-11-25 19:05:31,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 18: [2022-11-25 19:05:31,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 16: [2022-11-25 19:05:31,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 18: [2022-11-25 19:05:31,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 8: [2022-11-25 19:05:31,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 16: [2022-11-25 19:05:31,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 8: [2022-11-25 19:05:31,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 12: [2022-11-25 19:05:31,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:05:31,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 6: [2022-11-25 19:05:31,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:05:31,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:05:31,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:05:31,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 29: [2022-11-25 19:05:31,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 21: [2022-11-25 19:05:31,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:05:31,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 21: [2022-11-25 19:05:31,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 6: [2022-11-25 19:05:31,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 7: [2022-11-25 19:05:31,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 10: [2022-11-25 19:05:31,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 29: [2022-11-25 19:05:31,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 21: [2022-11-25 19:05:31,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 7: [2022-11-25 19:05:31,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 10: [2022-11-25 19:05:31,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 29: [2022-11-25 19:05:31,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 19:05:31,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:05:31,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 20: [2022-11-25 19:05:31,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:05:31,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:05:31,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:05:31,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 30: [2022-11-25 19:05:31,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 0: [2022-11-25 19:05:31,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 1: [2022-11-25 19:05:31,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 20: [2022-11-25 19:05:31,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 23: [2022-11-25 19:05:31,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 30: [2022-11-25 19:05:31,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 19:05:31,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 1: [2022-11-25 19:05:31,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 20: [2022-11-25 19:05:31,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 26: [2022-11-25 19:05:31,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-25 19:05:31,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-25 19:05:31,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 25: [2022-11-25 19:05:31,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 19: [2022-11-25 19:05:31,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 25: [2022-11-25 19:05:31,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-25 19:05:31,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 19: [2022-11-25 19:05:31,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-25 19:05:31,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 3: [2022-11-25 19:05:31,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:05:31,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 22: [2022-11-25 19:05:31,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:05:31,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 22: [2022-11-25 19:05:31,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 3: [2022-11-25 19:05:31,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 15: [2022-11-25 19:05:31,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 22: [2022-11-25 19:05:31,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 15: [2022-11-25 19:05:31,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 11: [2022-11-25 19:05:31,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:05:31,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 19:05:31,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 4: [2022-11-25 19:05:31,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:05:31,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:05:31,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 31: [2022-11-25 19:05:31,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:05:31,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:05:31,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 2: [2022-11-25 19:05:31,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 28: [2022-11-25 19:05:31,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:05:31,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 9: [2022-11-25 19:05:31,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 2: [2022-11-25 19:05:31,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 28: [2022-11-25 19:05:31,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 31: [2022-11-25 19:05:31,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 17: [2022-11-25 19:05:31,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 9: [2022-11-25 19:05:31,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 28: [2022-11-25 19:05:31,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 31: [2022-11-25 19:05:31,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 17: [2022-11-25 19:05:31,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 18: [2022-11-25 19:05:31,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-25 19:05:31,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-25 19:05:31,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 14: [2022-11-25 19:05:31,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:05:31,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 19:05:31,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 13: [2022-11-25 19:05:31,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 24: [2022-11-25 19:05:31,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-25 19:05:31,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 13: [2022-11-25 19:05:31,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 24: [2022-11-25 19:05:31,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 13: [2022-11-25 19:05:31,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 5: [2022-11-25 19:05:31,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:05:31,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 19:05:31,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 16: [2022-11-25 19:05:31,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-25 19:05:31,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-25 19:05:31,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 12: [2022-11-25 19:05:31,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:05:31,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 19:05:31,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 7: [2022-11-25 19:05:31,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:05:31,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:05:31,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 10: [2022-11-25 19:05:31,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 7: [2022-11-25 19:05:31,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 10: [2022-11-25 19:05:31,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 29: [2022-11-25 19:05:31,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-25 19:05:31,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-25 19:05:31,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 21: [2022-11-25 19:05:31,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 27: [2022-11-25 19:05:31,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 21: [2022-11-25 19:05:31,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-25 19:05:31,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 27: [2022-11-25 19:05:31,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-25 19:05:31,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 19:05:31,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:05:31,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:05:31,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:05:31,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:05:31,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 5: [2022-11-25 19:05:31,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 15: [2022-11-25 19:05:31,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 17: [2022-11-25 19:05:31,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 0: [2022-11-25 19:05:31,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 5: [2022-11-25 19:05:31,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 15: [2022-11-25 19:05:31,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 17: [2022-11-25 19:05:31,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 4: [2022-11-25 19:05:31,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:05:31,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:05:31,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:05:31,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 13: [2022-11-25 19:05:31,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:05:31,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 1: [2022-11-25 19:05:31,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 11: [2022-11-25 19:05:31,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 1: [2022-11-25 19:05:31,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 13: [2022-11-25 19:05:31,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 11: [2022-11-25 19:05:31,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 13: [2022-11-25 19:05:31,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 22: [2022-11-25 19:05:31,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:05:31,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 22: [2022-11-25 19:05:31,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 9: [2022-11-25 19:05:31,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 22: [2022-11-25 19:05:31,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 9: [2022-11-25 19:05:31,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 26: [2022-11-25 19:05:31,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:05:31,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 26: [2022-11-25 19:05:31,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 12: [2022-11-25 19:05:31,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 26: [2022-11-25 19:05:31,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 12: [2022-11-25 19:05:31,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 14: [2022-11-25 19:05:31,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:05:31,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 19:05:31,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 16: [2022-11-25 19:05:31,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-25 19:05:31,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-25 19:05:31,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 8: [2022-11-25 19:05:31,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:05:31,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:05:31,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:05:31,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 8: [2022-11-25 19:05:31,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 2: [2022-11-25 19:05:31,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 3: [2022-11-25 19:05:31,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 8: [2022-11-25 19:05:31,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 2: [2022-11-25 19:05:31,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 21: [2022-11-25 19:05:31,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-25 19:05:31,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-25 19:05:31,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 20: [2022-11-25 19:05:31,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-25 19:05:31,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-25 19:05:31,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 7: [2022-11-25 19:05:31,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:05:31,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:05:31,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 25: [2022-11-25 19:05:31,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 28: [2022-11-25 19:05:31,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 24: [2022-11-25 19:05:31,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-25 19:05:31,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:05:31,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:05:31,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 8: [2022-11-25 19:05:31,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 25: [2022-11-25 19:05:31,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 24: [2022-11-25 19:05:31,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-25 19:05:31,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 30: [2022-11-25 19:05:31,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 7: [2022-11-25 19:05:31,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 8: [2022-11-25 19:05:31,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 10: [2022-11-25 19:05:31,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 25: [2022-11-25 19:05:31,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 24: [2022-11-25 19:05:31,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 24: [2022-11-25 19:05:31,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 30: [2022-11-25 19:05:31,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 10: [2022-11-25 19:05:31,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 28: [2022-11-25 19:05:31,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-25 19:05:31,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 18: [2022-11-25 19:05:31,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-25 19:05:31,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-25 19:05:31,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 20: [2022-11-25 19:05:31,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 25: [2022-11-25 19:05:31,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:05:31,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 25: [2022-11-25 19:05:31,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 23: [2022-11-25 19:05:31,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:05:31,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 25: [2022-11-25 19:05:31,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 23: [2022-11-25 19:05:31,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-25 19:05:31,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 23: [2022-11-25 19:05:31,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 20: [2022-11-25 19:05:31,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-25 19:05:31,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 31: [2022-11-25 19:05:31,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-25 19:05:31,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-25 19:05:31,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 2: [2022-11-25 19:05:31,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:05:31,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 19:05:31,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 31: [2022-11-25 19:05:31,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-25 19:05:31,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-25 19:05:31,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 29: [2022-11-25 19:05:31,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-25 19:05:31,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-25 19:05:31,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 3: [2022-11-25 19:05:31,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:05:31,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 19:05:31,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 30: [2022-11-25 19:05:31,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:05:31,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-25 19:05:31,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 2: [2022-11-25 19:05:31,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:05:31,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 19:05:31,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 19: [2022-11-25 19:05:31,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-25 19:05:31,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-25 19:05:31,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 19: [2022-11-25 19:05:31,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-25 19:05:31,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-25 19:05:31,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 6: [2022-11-25 19:05:31,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:05:31,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:05:31,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 19:05:31,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 19:05:31,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 6: [2022-11-25 19:05:31,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 27: [2022-11-25 19:05:31,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-25 19:05:31,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step7000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-25 19:05:31,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: successfully saved checkpoint at iteration 7000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2570.57 31: iteration 7010/ 173500 | consumed samples: 1794560 | consumed tokens: 3675258880 | elapsed time per iteration (s): 1.10 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.407776E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.507 | TFLOPs: 14.07 | 31: iteration 7020/ 173500 | consumed samples: 1797120 | consumed tokens: 3680501760 | elapsed time per iteration (s): 0.77 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.457025E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.292 | TFLOPs: 20.10 | 31: iteration 7030/ 173500 | consumed samples: 1799680 | consumed tokens: 3685744640 | elapsed time per iteration (s): 0.77 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.399484E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.379 | TFLOPs: 20.23 | 31: iteration 7040/ 173500 | consumed samples: 1802240 | consumed tokens: 3690987520 | elapsed time per iteration (s): 0.84 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.437786E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.551 | TFLOPs: 18.42 | 31: iteration 7050/ 173500 | consumed samples: 1804800 | consumed tokens: 3696230400 | elapsed time per iteration (s): 0.79 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.437724E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.843 | TFLOPs: 19.53 | 31: iteration 7060/ 173500 | consumed samples: 1807360 | consumed tokens: 3701473280 | elapsed time per iteration (s): 0.78 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.408552E+00 | grad norm: 0.251 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.496 | TFLOPs: 19.75 | 31: iteration 7070/ 173500 | consumed samples: 1809920 | consumed tokens: 3706716160 | elapsed time per iteration (s): 0.80 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.426172E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.912 | TFLOPs: 19.41 | 31: iteration 7080/ 173500 | consumed samples: 1812480 | consumed tokens: 3711959040 | elapsed time per iteration (s): 0.80 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.407081E+00 | grad norm: 0.213 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.464 | TFLOPs: 19.33 | 31: iteration 7090/ 173500 | consumed samples: 1815040 | consumed tokens: 3717201920 | elapsed time per iteration (s): 0.76 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.431461E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.198 | TFLOPs: 20.40 | 31: iteration 7100/ 173500 | consumed samples: 1817600 | consumed tokens: 3722444800 | elapsed time per iteration (s): 0.79 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.394204E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.860 | TFLOPs: 19.53 | 31: iteration 7110/ 173500 | consumed samples: 1820160 | consumed tokens: 3727687680 | elapsed time per iteration (s): 0.81 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.451241E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.280 | TFLOPs: 19.01 | 31: iteration 7120/ 173500 | consumed samples: 1822720 | consumed tokens: 3732930560 | elapsed time per iteration (s): 0.78 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.589594E+00 | grad norm: 15.927 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.328 | TFLOPs: 19.74 | 31: iteration 7130/ 173500 | consumed samples: 1825280 | consumed tokens: 3738173440 | elapsed time per iteration (s): 0.81 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.539331E+00 | grad norm: 0.359 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.640 | TFLOPs: 19.10 | 31: iteration 7140/ 173500 | consumed samples: 1827840 | consumed tokens: 3743416320 | elapsed time per iteration (s): 0.83 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.579652E+00 | grad norm: 0.347 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.632 | TFLOPs: 18.61 | 31: iteration 7150/ 173500 | consumed samples: 1830400 | consumed tokens: 3748659200 | elapsed time per iteration (s): 0.79 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.481176E+00 | grad norm: 0.256 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.364 | TFLOPs: 19.68 | 31: iteration 7160/ 173500 | consumed samples: 1832960 | consumed tokens: 3753902080 | elapsed time per iteration (s): 0.74 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.480926E+00 | grad norm: 0.280 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.762 | TFLOPs: 20.86 | 31: iteration 7170/ 173500 | consumed samples: 1835520 | consumed tokens: 3759144960 | elapsed time per iteration (s): 0.77 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.471852E+00 | grad norm: 0.212 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.806 | TFLOPs: 20.07 | 31: iteration 7180/ 173500 | consumed samples: 1838080 | consumed tokens: 3764387840 | elapsed time per iteration (s): 0.75 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.446671E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.631 | TFLOPs: 20.73 | 31: iteration 7190/ 173500 | consumed samples: 1840640 | consumed tokens: 3769630720 | elapsed time per iteration (s): 0.78 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.415255E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.137 | TFLOPs: 19.85 | 31: iteration 7200/ 173500 | consumed samples: 1843200 | consumed tokens: 3774873600 | elapsed time per iteration (s): 0.78 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.420667E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.611 | TFLOPs: 19.82 | 31: iteration 7210/ 173500 | consumed samples: 1845760 | consumed tokens: 3780116480 | elapsed time per iteration (s): 0.76 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.430837E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.030 | TFLOPs: 20.45 | 31: iteration 7220/ 173500 | consumed samples: 1848320 | consumed tokens: 3785359360 | elapsed time per iteration (s): 0.78 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.426650E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.159 | TFLOPs: 19.85 | 31: iteration 7230/ 173500 | consumed samples: 1850880 | consumed tokens: 3790602240 | elapsed time per iteration (s): 0.80 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.436281E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.715 | TFLOPs: 19.34 | 31: iteration 7240/ 173500 | consumed samples: 1853440 | consumed tokens: 3795845120 | elapsed time per iteration (s): 0.78 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.444317E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.158 | TFLOPs: 19.97 | 31: iteration 7250/ 173500 | consumed samples: 1856000 | consumed tokens: 3801088000 | elapsed time per iteration (s): 0.81 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.455424E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.070 | TFLOPs: 19.06 | 31: iteration 7260/ 173500 | consumed samples: 1858560 | consumed tokens: 3806330880 | elapsed time per iteration (s): 0.75 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.402569E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.695 | TFLOPs: 20.61 | 31: iteration 7270/ 173500 | consumed samples: 1861120 | consumed tokens: 3811573760 | elapsed time per iteration (s): 0.76 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.441723E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.024 | TFLOPs: 20.45 | 31: iteration 7280/ 173500 | consumed samples: 1863680 | consumed tokens: 3816816640 | elapsed time per iteration (s): 0.74 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.443414E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.740 | TFLOPs: 20.86 | 31: iteration 7290/ 173500 | consumed samples: 1866240 | consumed tokens: 3822059520 | elapsed time per iteration (s): 0.75 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.429789E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.873 | TFLOPs: 20.74 | 31: iteration 7300/ 173500 | consumed samples: 1868800 | consumed tokens: 3827302400 | elapsed time per iteration (s): 0.74 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.437215E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.536 | TFLOPs: 21.03 | 31: iteration 7310/ 173500 | consumed samples: 1871360 | consumed tokens: 3832545280 | elapsed time per iteration (s): 0.74 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.426075E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.061 | TFLOPs: 21.00 | 31: iteration 7320/ 173500 | consumed samples: 1873920 | consumed tokens: 3837788160 | elapsed time per iteration (s): 0.79 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.410886E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.607 | TFLOPs: 19.64 | 31: iteration 7330/ 173500 | consumed samples: 1876480 | consumed tokens: 3843031040 | elapsed time per iteration (s): 0.79 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.425383E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.969 | TFLOPs: 19.60 | 31: iteration 7340/ 173500 | consumed samples: 1879040 | consumed tokens: 3848273920 | elapsed time per iteration (s): 0.77 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.395883E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.501 | TFLOPs: 20.05 | 31: iteration 7350/ 173500 | consumed samples: 1881600 | consumed tokens: 3853516800 | elapsed time per iteration (s): 0.74 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.416019E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.114 | TFLOPs: 21.00 | 31: iteration 7360/ 173500 | consumed samples: 1884160 | consumed tokens: 3858759680 | elapsed time per iteration (s): 0.77 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.430437E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.662 | TFLOPs: 20.06 | 31: iteration 7370/ 173500 | consumed samples: 1886720 | consumed tokens: 3864002560 | elapsed time per iteration (s): 0.75 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.413195E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.590 | TFLOPs: 20.60 | 31: iteration 7380/ 173500 | consumed samples: 1889280 | consumed tokens: 3869245440 | elapsed time per iteration (s): 0.71 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.417224E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.791 | TFLOPs: 21.71 | 31: iteration 7390/ 173500 | consumed samples: 1891840 | consumed tokens: 3874488320 | elapsed time per iteration (s): 0.76 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.386527E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.952 | TFLOPs: 20.51 | 31: iteration 7400/ 173500 | consumed samples: 1894400 | consumed tokens: 3879731200 | elapsed time per iteration (s): 0.77 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.425123E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.466 | TFLOPs: 20.23 | 31: iteration 7410/ 173500 | consumed samples: 1896960 | consumed tokens: 3884974080 | elapsed time per iteration (s): 0.78 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.431038E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.446 | TFLOPs: 19.93 | 31: iteration 7420/ 173500 | consumed samples: 1899520 | consumed tokens: 3890216960 | elapsed time per iteration (s): 0.78 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.384222E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.372 | TFLOPs: 19.74 | 31: iteration 7430/ 173500 | consumed samples: 1902080 | consumed tokens: 3895459840 | elapsed time per iteration (s): 0.78 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.412799E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.291 | TFLOPs: 19.92 | 31: iteration 7440/ 173500 | consumed samples: 1904640 | consumed tokens: 3900702720 | elapsed time per iteration (s): 0.80 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.364777E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.679 | TFLOPs: 19.46 | 31: iteration 7450/ 173500 | consumed samples: 1907200 | consumed tokens: 3905945600 | elapsed time per iteration (s): 0.77 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.410170E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.106 | TFLOPs: 20.15 | 31: iteration 7460/ 173500 | consumed samples: 1909760 | consumed tokens: 3911188480 | elapsed time per iteration (s): 0.76 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.422171E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.779 | TFLOPs: 20.43 | 31: iteration 7470/ 173500 | consumed samples: 1912320 | consumed tokens: 3916431360 | elapsed time per iteration (s): 0.76 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.388789E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.407 | TFLOPs: 20.35 | 31: iteration 7480/ 173500 | consumed samples: 1914880 | consumed tokens: 3921674240 | elapsed time per iteration (s): 0.75 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.387462E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.794 | TFLOPs: 20.62 | 31: iteration 7490/ 173500 | consumed samples: 1917440 | consumed tokens: 3926917120 | elapsed time per iteration (s): 0.81 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.417781E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.805 | TFLOPs: 19.17 | 31: iteration 7500/ 173500 | consumed samples: 1920000 | consumed tokens: 3932160000 | elapsed time per iteration (s): 0.78 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.379821E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.660 | TFLOPs: 19.76 | 31: iteration 7510/ 173500 | consumed samples: 1922560 | consumed tokens: 3937402880 | elapsed time per iteration (s): 0.77 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.387961E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.224 | TFLOPs: 20.16 | 31: iteration 7520/ 173500 | consumed samples: 1925120 | consumed tokens: 3942645760 | elapsed time per iteration (s): 0.78 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.431937E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.399 | TFLOPs: 19.93 | 31: iteration 7530/ 173500 | consumed samples: 1927680 | consumed tokens: 3947888640 | elapsed time per iteration (s): 0.80 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.433281E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.214 | TFLOPs: 19.37 | 31: iteration 7540/ 173500 | consumed samples: 1930240 | consumed tokens: 3953131520 | elapsed time per iteration (s): 0.80 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.415315E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.730 | TFLOPs: 19.40 | 31: iteration 7550/ 173500 | consumed samples: 1932800 | consumed tokens: 3958374400 | elapsed time per iteration (s): 0.78 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.391582E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.794 | TFLOPs: 19.83 | 31: iteration 7560/ 173500 | consumed samples: 1935360 | consumed tokens: 3963617280 | elapsed time per iteration (s): 0.79 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.394262E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.816 | TFLOPs: 19.53 | 31: iteration 7570/ 173500 | consumed samples: 1937920 | consumed tokens: 3968860160 | elapsed time per iteration (s): 0.78 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.405996E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.369 | TFLOPs: 19.80 | 31: iteration 7580/ 173500 | consumed samples: 1940480 | consumed tokens: 3974103040 | elapsed time per iteration (s): 0.77 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.431946E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.673 | TFLOPs: 20.00 | 31: iteration 7590/ 173500 | consumed samples: 1943040 | consumed tokens: 3979345920 | elapsed time per iteration (s): 0.80 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.433916E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.996 | TFLOPs: 19.48 | 31: iteration 7600/ 173500 | consumed samples: 1945600 | consumed tokens: 3984588800 | elapsed time per iteration (s): 0.75 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.400196E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.342 | TFLOPs: 20.77 | 31: iteration 7610/ 173500 | consumed samples: 1948160 | consumed tokens: 3989831680 | elapsed time per iteration (s): 0.77 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.445929E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.703 | TFLOPs: 20.07 | 31: iteration 7620/ 173500 | consumed samples: 1950720 | consumed tokens: 3995074560 | elapsed time per iteration (s): 0.79 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.368337E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.952 | TFLOPs: 19.54 | 31: iteration 7630/ 173500 | consumed samples: 1953280 | consumed tokens: 4000317440 | elapsed time per iteration (s): 0.83 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.389751E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.521 | TFLOPs: 18.66 | 31: iteration 7640/ 173500 | consumed samples: 1955840 | consumed tokens: 4005560320 | elapsed time per iteration (s): 0.80 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.418049E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.727 | TFLOPs: 19.34 | 31: iteration 7650/ 173500 | consumed samples: 1958400 | consumed tokens: 4010803200 | elapsed time per iteration (s): 0.84 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.416152E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.745 | TFLOPs: 18.50 | 31: iteration 7660/ 173500 | consumed samples: 1960960 | consumed tokens: 4016046080 | elapsed time per iteration (s): 0.84 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.365497E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.717 | TFLOPs: 18.50 | 31: iteration 7670/ 173500 | consumed samples: 1963520 | consumed tokens: 4021288960 | elapsed time per iteration (s): 0.83 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.403151E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.281 | TFLOPs: 18.71 | 31: iteration 7680/ 173500 | consumed samples: 1966080 | consumed tokens: 4026531840 | elapsed time per iteration (s): 0.86 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.387385E+00 | grad norm: 0.294 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.439 | TFLOPs: 18.05 | 31: iteration 7690/ 173500 | consumed samples: 1968640 | consumed tokens: 4031774720 | elapsed time per iteration (s): 0.83 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.420667E+00 | grad norm: 0.217 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.270 | TFLOPs: 18.71 | 31: iteration 7700/ 173500 | consumed samples: 1971200 | consumed tokens: 4037017600 | elapsed time per iteration (s): 0.81 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.399080E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.449 | TFLOPs: 19.14 | 31: iteration 7710/ 173500 | consumed samples: 1973760 | consumed tokens: 4042260480 | elapsed time per iteration (s): 0.82 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.427383E+00 | grad norm: 1.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.662 | TFLOPs: 18.79 | 31: iteration 7720/ 173500 | consumed samples: 1976320 | consumed tokens: 4047503360 | elapsed time per iteration (s): 0.89 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.431523E+00 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.600 | TFLOPs: 17.34 | 31: iteration 7730/ 173500 | consumed samples: 1978880 | consumed tokens: 4052746240 | elapsed time per iteration (s): 0.86 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.441389E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.369 | TFLOPs: 17.99 | 31: iteration 7740/ 173500 | consumed samples: 1981440 | consumed tokens: 4057989120 | elapsed time per iteration (s): 0.77 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.415712E+00 | grad norm: 0.212 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.082 | TFLOPs: 20.15 | 31: iteration 7750/ 173500 | consumed samples: 1984000 | consumed tokens: 4063232000 | elapsed time per iteration (s): 0.82 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.411814E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.470 | TFLOPs: 18.78 | 31: iteration 7760/ 173500 | consumed samples: 1986560 | consumed tokens: 4068474880 | elapsed time per iteration (s): 0.83 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.418111E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.979 | TFLOPs: 18.69 | 31: iteration 7770/ 173500 | consumed samples: 1989120 | consumed tokens: 4073717760 | elapsed time per iteration (s): 0.80 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.403735E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.197 | TFLOPs: 19.25 | 31: iteration 7780/ 173500 | consumed samples: 1991680 | consumed tokens: 4078960640 | elapsed time per iteration (s): 0.76 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.443582E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.998 | TFLOPs: 20.39 | 31: iteration 7790/ 173500 | consumed samples: 1994240 | consumed tokens: 4084203520 | elapsed time per iteration (s): 0.73 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.396617E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.645 | TFLOPs: 21.21 | 31: iteration 7800/ 173500 | consumed samples: 1996800 | consumed tokens: 4089446400 | elapsed time per iteration (s): 0.82 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.393062E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.074 | TFLOPs: 18.94 | 31: iteration 7810/ 173500 | consumed samples: 1999360 | consumed tokens: 4094689280 | elapsed time per iteration (s): 0.77 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.384628E+00 | grad norm: 0.213 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.714 | TFLOPs: 20.13 | 31: iteration 7820/ 173500 | consumed samples: 2001920 | consumed tokens: 4099932160 | elapsed time per iteration (s): 0.81 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.431943E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.971 | TFLOPs: 19.12 | 31: iteration 7830/ 173500 | consumed samples: 2004480 | consumed tokens: 4105175040 | elapsed time per iteration (s): 0.79 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.387407E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.805 | TFLOPs: 19.65 | 31: iteration 7840/ 173500 | consumed samples: 2007040 | consumed tokens: 4110417920 | elapsed time per iteration (s): 0.84 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.389913E+00 | grad norm: 0.269 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.333 | TFLOPs: 18.47 | 31: iteration 7850/ 173500 | consumed samples: 2009600 | consumed tokens: 4115660800 | elapsed time per iteration (s): 0.81 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.432721E+00 | grad norm: 0.322 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.430 | TFLOPs: 19.20 | 31: iteration 7860/ 173500 | consumed samples: 2012160 | consumed tokens: 4120903680 | elapsed time per iteration (s): 0.82 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.419502E+00 | grad norm: 0.697 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.626 | TFLOPs: 18.85 | 31: iteration 7870/ 173500 | consumed samples: 2014720 | consumed tokens: 4126146560 | elapsed time per iteration (s): 0.78 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.611353E+00 | grad norm: 1.522 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.840 | TFLOPs: 19.77 | 31: iteration 7880/ 173500 | consumed samples: 2017280 | consumed tokens: 4131389440 | elapsed time per iteration (s): 0.80 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.568429E+00 | grad norm: 0.586 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.206 | TFLOPs: 19.43 | 31: iteration 7890/ 173500 | consumed samples: 2019840 | consumed tokens: 4136632320 | elapsed time per iteration (s): 0.77 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.495552E+00 | grad norm: 0.282 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.666 | TFLOPs: 20.00 | 31: iteration 7900/ 173500 | consumed samples: 2022400 | consumed tokens: 4141875200 | elapsed time per iteration (s): 0.77 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.427970E+00 | grad norm: 0.218 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.229 | TFLOPs: 20.10 | 31: iteration 7910/ 173500 | consumed samples: 2024960 | consumed tokens: 4147118080 | elapsed time per iteration (s): 0.82 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.455713E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.811 | TFLOPs: 18.80 | 31: iteration 7920/ 173500 | consumed samples: 2027520 | consumed tokens: 4152360960 | elapsed time per iteration (s): 0.82 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.408054E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.489 | TFLOPs: 18.97 | 31: iteration 7930/ 173500 | consumed samples: 2030080 | consumed tokens: 4157603840 | elapsed time per iteration (s): 0.80 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.444890E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.224 | TFLOPs: 19.37 | 31: iteration 7940/ 173500 | consumed samples: 2032640 | consumed tokens: 4162846720 | elapsed time per iteration (s): 0.83 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.382885E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.658 | TFLOPs: 18.61 | 31: iteration 7950/ 173500 | consumed samples: 2035200 | consumed tokens: 4168089600 | elapsed time per iteration (s): 0.79 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.383477E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.833 | TFLOPs: 19.71 | 31: iteration 7960/ 173500 | consumed samples: 2037760 | consumed tokens: 4173332480 | elapsed time per iteration (s): 0.82 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.450934E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.370 | TFLOPs: 18.96 | 31: iteration 7970/ 173500 | consumed samples: 2040320 | consumed tokens: 4178575360 | elapsed time per iteration (s): 0.84 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.414157E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.875 | TFLOPs: 18.50 | 31: iteration 7980/ 173500 | consumed samples: 2042880 | consumed tokens: 4183818240 | elapsed time per iteration (s): 0.77 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.403847E+00 | grad norm: 0.218 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.363 | TFLOPs: 20.23 | 31: iteration 7990/ 173500 | consumed samples: 2045440 | consumed tokens: 4189061120 | elapsed time per iteration (s): 0.84 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.437326E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.761 | TFLOPs: 18.50 | 0: [2022-11-25 19:18:41,049] [INFO] [logging.py:68:log_dist] [Rank 0] step=8000, skipped=0, lr=[0.00019940979012929202, 0.00019940979012929202, 0.00019940979012929202], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 8000/ 173500 | consumed samples: 2048000 | consumed tokens: 4194304000 | elapsed time per iteration (s): 0.79 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.401987E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.653 | TFLOPs: 19.70 | 0: steps: 8000 loss: 2.3987 iter time (s): 0.788 samples/sec: 324.937 31: ------------------------------------------------------------------------------------------ 31: valid loss at iteration 8000 | lm loss value: 2.384506E+00 | lm loss PPL: 1.085370E+01 | 31: ------------------------------------------------------------------------------------------ 0: saving checkpoint at iteration 8000 to checkpoints_1b1long 0: [2022-11-25 19:18:41,297] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step8000 is begin to save! 0: [2022-11-25 19:18:41,307] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_01-model_00-model_states.pt... 0: [2022-11-25 19:18:41,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_01-model_00-model_states.pt. 0: [2022-11-25 19:18:41,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_03-model_00-model_states.pt... 0: [2022-11-25 19:18:41,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_03-model_00-model_states.pt. 0: [2022-11-25 19:18:41,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_04-model_00-model_states.pt... 0: [2022-11-25 19:18:41,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_04-model_00-model_states.pt. 0: [2022-11-25 19:18:41,649] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_05-model_00-model_states.pt... 0: [2022-11-25 19:18:41,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_05-model_00-model_states.pt. 0: [2022-11-25 19:18:41,722] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_06-model_00-model_states.pt... 0: [2022-11-25 19:18:41,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_06-model_00-model_states.pt. 0: [2022-11-25 19:18:41,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_07-model_00-model_states.pt... 0: [2022-11-25 19:18:41,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_07-model_00-model_states.pt. 0: [2022-11-25 19:18:41,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_08-model_00-model_states.pt... 0: [2022-11-25 19:18:41,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_08-model_00-model_states.pt. 0: [2022-11-25 19:18:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_09-model_00-model_states.pt... 0: [2022-11-25 19:18:42,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_09-model_00-model_states.pt. 0: [2022-11-25 19:18:42,014] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_10-model_00-model_states.pt... 0: [2022-11-25 19:18:42,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_10-model_00-model_states.pt. 0: [2022-11-25 19:18:42,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_11-model_00-model_states.pt... 0: [2022-11-25 19:18:42,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_11-model_00-model_states.pt. 0: [2022-11-25 19:18:42,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_12-model_00-model_states.pt... 0: [2022-11-25 19:18:42,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_12-model_00-model_states.pt. 0: [2022-11-25 19:18:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_13-model_00-model_states.pt... 0: [2022-11-25 19:18:42,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_13-model_00-model_states.pt. 0: [2022-11-25 19:18:42,311] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_14-model_00-model_states.pt... 0: [2022-11-25 19:18:42,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_14-model_00-model_states.pt. 0: [2022-11-25 19:18:42,382] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_15-model_00-model_states.pt... 0: [2022-11-25 19:18:42,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_15-model_00-model_states.pt. 0: [2022-11-25 19:18:42,459] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_16-model_00-model_states.pt... 0: [2022-11-25 19:18:42,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_16-model_00-model_states.pt. 0: [2022-11-25 19:18:42,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_17-model_00-model_states.pt... 0: [2022-11-25 19:18:42,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_17-model_00-model_states.pt. 0: [2022-11-25 19:18:42,606] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_18-model_00-model_states.pt... 0: [2022-11-25 19:18:42,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_18-model_00-model_states.pt. 0: [2022-11-25 19:18:42,682] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_19-model_00-model_states.pt... 0: [2022-11-25 19:18:42,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_19-model_00-model_states.pt. 0: [2022-11-25 19:18:42,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_20-model_00-model_states.pt... 0: [2022-11-25 19:18:42,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_20-model_00-model_states.pt. 0: [2022-11-25 19:18:42,829] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_21-model_00-model_states.pt... 0: [2022-11-25 19:18:42,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_21-model_00-model_states.pt. 0: [2022-11-25 19:18:42,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_22-model_00-model_states.pt... 0: [2022-11-25 19:18:42,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_22-model_00-model_states.pt. 0: [2022-11-25 19:18:42,974] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_23-model_00-model_states.pt... 0: [2022-11-25 19:18:43,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_23-model_00-model_states.pt. 0: [2022-11-25 19:18:43,049] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_24-model_00-model_states.pt... 0: [2022-11-25 19:18:43,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_24-model_00-model_states.pt. 0: [2022-11-25 19:18:43,124] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_25-model_00-model_states.pt... 0: [2022-11-25 19:18:43,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_25-model_00-model_states.pt. 0: [2022-11-25 19:18:43,197] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_26-model_00-model_states.pt... 0: [2022-11-25 19:18:43,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_26-model_00-model_states.pt. 0: [2022-11-25 19:18:43,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_27-model_00-model_states.pt... 0: [2022-11-25 19:18:43,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_27-model_00-model_states.pt. 0: [2022-11-25 19:18:43,340] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_28-model_00-model_states.pt... 0: [2022-11-25 19:18:43,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_28-model_00-model_states.pt. 0: [2022-11-25 19:18:43,415] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/layer_30-model_00-model_states.pt... 0: [2022-11-25 19:18:43,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/layer_30-model_00-model_states.pt. 0: [2022-11-25 19:18:43,417] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step8000/mp_rank_00_model_states.pt 0: [2022-11-25 19:18:43,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/mp_rank_00_model_states.pt... 0: [2022-11-25 19:18:43,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/mp_rank_00_model_states.pt. 0: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 20: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 24: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 22: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 17: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 26: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 31: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 21: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-25 19:18:43,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:18:43,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:18:43,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 19:18:43,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 15: [2022-11-25 19:18:43,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:18:43,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:18:43,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 14: [2022-11-25 19:18:43,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 15: [2022-11-25 19:18:43,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 14: [2022-11-25 19:18:43,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 19: [2022-11-25 19:18:43,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-25 19:18:43,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-25 19:18:43,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 8: [2022-11-25 19:18:43,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 31: [2022-11-25 19:18:43,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:18:43,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 19:18:43,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 31: [2022-11-25 19:18:43,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-25 19:18:43,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 5: [2022-11-25 19:18:43,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 25: [2022-11-25 19:18:43,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:18:43,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 25: [2022-11-25 19:18:43,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 5: [2022-11-25 19:18:43,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 25: [2022-11-25 19:18:43,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 26: [2022-11-25 19:18:43,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-25 19:18:43,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-25 19:18:43,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 6: [2022-11-25 19:18:43,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:18:43,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:18:43,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 28: [2022-11-25 19:18:43,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 24: [2022-11-25 19:18:43,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 29: [2022-11-25 19:18:43,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:18:43,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:18:43,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 4: [2022-11-25 19:18:43,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 2: [2022-11-25 19:18:43,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 24: [2022-11-25 19:18:43,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 29: [2022-11-25 19:18:43,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 30: [2022-11-25 19:18:43,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 6: [2022-11-25 19:18:43,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 4: [2022-11-25 19:18:43,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 2: [2022-11-25 19:18:43,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 24: [2022-11-25 19:18:43,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 29: [2022-11-25 19:18:43,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 30: [2022-11-25 19:18:43,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 5: [2022-11-25 19:18:43,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:18:43,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:18:43,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:18:43,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 16: [2022-11-25 19:18:43,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:18:43,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:18:43,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 28: [2022-11-25 19:18:43,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 22: [2022-11-25 19:18:43,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:18:43,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:18:43,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 9: [2022-11-25 19:18:43,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 8: [2022-11-25 19:18:43,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 10: [2022-11-25 19:18:43,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 16: [2022-11-25 19:18:43,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-25 19:18:43,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 13: [2022-11-25 19:18:43,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 12: [2022-11-25 19:18:43,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 28: [2022-11-25 19:18:43,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 22: [2022-11-25 19:18:43,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 17: [2022-11-25 19:18:43,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 5: [2022-11-25 19:18:43,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 9: [2022-11-25 19:18:43,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 8: [2022-11-25 19:18:43,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 10: [2022-11-25 19:18:43,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 13: [2022-11-25 19:18:43,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 12: [2022-11-25 19:18:43,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 22: [2022-11-25 19:18:43,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 17: [2022-11-25 19:18:43,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 21: [2022-11-25 19:18:43,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-25 19:18:43,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-25 19:18:43,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 1: [2022-11-25 19:18:43,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:18:43,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:18:43,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 1: [2022-11-25 19:18:43,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 17: [2022-11-25 19:18:43,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 1: [2022-11-25 19:18:43,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 2: [2022-11-25 19:18:43,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:18:43,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:18:43,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 26: [2022-11-25 19:18:43,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 19: [2022-11-25 19:18:43,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:18:43,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 11: [2022-11-25 19:18:43,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 14: [2022-11-25 19:18:43,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 26: [2022-11-25 19:18:43,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 19: [2022-11-25 19:18:43,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 2: [2022-11-25 19:18:43,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 11: [2022-11-25 19:18:43,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 14: [2022-11-25 19:18:43,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 26: [2022-11-25 19:18:43,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 19: [2022-11-25 19:18:43,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 11: [2022-11-25 19:18:43,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:18:43,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 19:18:43,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 1: [2022-11-25 19:18:43,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 20: [2022-11-25 19:18:43,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:18:43,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:18:43,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 20: [2022-11-25 19:18:43,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:18:43,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 1: [2022-11-25 19:18:43,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 20: [2022-11-25 19:18:43,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 23: [2022-11-25 19:18:43,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 20: [2022-11-25 19:18:43,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 20: [2022-11-25 19:18:43,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-25 19:18:43,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 24: [2022-11-25 19:18:43,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-25 19:18:43,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-25 19:18:43,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 5: [2022-11-25 19:18:43,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:18:43,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 19:18:43,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 4: [2022-11-25 19:18:43,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:18:43,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 19:18:43,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 25: [2022-11-25 19:18:43,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-25 19:18:43,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-25 19:18:43,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 10: [2022-11-25 19:18:43,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 16: [2022-11-25 19:18:43,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:18:43,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:18:43,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 16: [2022-11-25 19:18:43,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 17: [2022-11-25 19:18:43,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 10: [2022-11-25 19:18:43,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 16: [2022-11-25 19:18:43,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 2: [2022-11-25 19:18:43,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:18:43,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 2: [2022-11-25 19:18:43,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 19:18:43,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 23: [2022-11-25 19:18:43,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:18:43,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:18:43,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:18:43,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 14: [2022-11-25 19:18:43,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 10: [2022-11-25 19:18:43,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 23: [2022-11-25 19:18:43,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 14: [2022-11-25 19:18:43,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 10: [2022-11-25 19:18:43,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 26: [2022-11-25 19:18:43,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-25 19:18:43,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-25 19:18:43,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 20: [2022-11-25 19:18:43,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 31: [2022-11-25 19:18:43,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-25 19:18:43,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 20: [2022-11-25 19:18:43,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 31: [2022-11-25 19:18:43,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 20: [2022-11-25 19:18:43,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 11: [2022-11-25 19:18:43,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:18:43,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 22: [2022-11-25 19:18:43,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:18:43,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 22: [2022-11-25 19:18:43,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-25 19:18:43,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 19:18:43,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:18:43,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:18:43,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:18:43,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:18:43,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:18:43,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 25: [2022-11-25 19:18:43,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 31: [2022-11-25 19:18:43,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 29: [2022-11-25 19:18:43,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:18:43,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 19: [2022-11-25 19:18:43,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:18:43,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 4: [2022-11-25 19:18:43,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 8: [2022-11-25 19:18:43,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 13: [2022-11-25 19:18:43,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 3: [2022-11-25 19:18:43,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 25: [2022-11-25 19:18:43,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 31: [2022-11-25 19:18:43,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 29: [2022-11-25 19:18:43,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 17: [2022-11-25 19:18:43,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 19: [2022-11-25 19:18:43,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 0: [2022-11-25 19:18:43,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 4: [2022-11-25 19:18:43,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 8: [2022-11-25 19:18:43,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 13: [2022-11-25 19:18:43,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 3: [2022-11-25 19:18:43,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 20: [2022-11-25 19:18:43,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 25: [2022-11-25 19:18:43,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 31: [2022-11-25 19:18:43,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 29: [2022-11-25 19:18:43,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 17: [2022-11-25 19:18:43,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 19: [2022-11-25 19:18:43,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 19:18:43,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 13: [2022-11-25 19:18:43,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:18:43,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 20: [2022-11-25 19:18:43,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 0: [2022-11-25 19:18:43,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 13: [2022-11-25 19:18:43,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 3: [2022-11-25 19:18:43,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 20: [2022-11-25 19:18:43,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 19:18:43,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:18:43,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 3: [2022-11-25 19:18:43,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 19:18:43,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 19:18:43,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 19:18:43,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:18:43,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 19:18:43,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 5: [2022-11-25 19:18:43,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:18:43,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 29: [2022-11-25 19:18:43,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 22: [2022-11-25 19:18:43,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 18: [2022-11-25 19:18:43,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:18:43,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 7: [2022-11-25 19:18:43,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 19:18:43,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 29: [2022-11-25 19:18:43,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 22: [2022-11-25 19:18:43,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 18: [2022-11-25 19:18:43,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 5: [2022-11-25 19:18:43,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 7: [2022-11-25 19:18:43,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 29: [2022-11-25 19:18:43,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 22: [2022-11-25 19:18:43,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 18: [2022-11-25 19:18:43,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 7: [2022-11-25 19:18:43,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 19:18:43,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 23: [2022-11-25 19:18:43,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:18:43,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 24: [2022-11-25 19:18:43,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:18:43,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 11: [2022-11-25 19:18:43,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 24: [2022-11-25 19:18:43,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 21: [2022-11-25 19:18:43,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:18:43,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 11: [2022-11-25 19:18:43,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 24: [2022-11-25 19:18:43,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 21: [2022-11-25 19:18:43,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-25 19:18:43,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 16: [2022-11-25 19:18:43,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 26: [2022-11-25 19:18:43,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 16: [2022-11-25 19:18:43,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 26: [2022-11-25 19:18:43,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-25 19:18:43,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 16: [2022-11-25 19:18:43,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 1: [2022-11-25 19:18:43,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 16: [2022-11-25 19:18:43,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:18:43,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:18:43,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:18:43,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 16: [2022-11-25 19:18:43,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 15: [2022-11-25 19:18:43,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 23: [2022-11-25 19:18:43,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 1: [2022-11-25 19:18:43,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 16: [2022-11-25 19:18:43,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 15: [2022-11-25 19:18:43,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 23: [2022-11-25 19:18:43,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 10: [2022-11-25 19:18:43,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:18:43,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:18:43,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:18:43,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 1: [2022-11-25 19:18:43,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 2: [2022-11-25 19:18:43,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 10: [2022-11-25 19:18:43,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 1: [2022-11-25 19:18:43,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 2: [2022-11-25 19:18:43,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 14: [2022-11-25 19:18:43,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:18:43,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 19:18:43,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 9: [2022-11-25 19:18:43,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:18:43,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:18:43,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:18:43,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 25: [2022-11-25 19:18:43,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 19: [2022-11-25 19:18:43,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:18:43,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:18:43,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 9: [2022-11-25 19:18:43,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 25: [2022-11-25 19:18:43,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 19: [2022-11-25 19:18:43,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 6: [2022-11-25 19:18:43,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 7: [2022-11-25 19:18:43,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 25: [2022-11-25 19:18:43,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 19: [2022-11-25 19:18:43,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 6: [2022-11-25 19:18:43,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 19:18:43,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 6: [2022-11-25 19:18:43,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 6: [2022-11-25 19:18:43,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:18:43,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:18:43,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:18:43,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 28: [2022-11-25 19:18:43,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 31: [2022-11-25 19:18:43,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 22: [2022-11-25 19:18:43,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:18:43,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 18: [2022-11-25 19:18:43,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:18:43,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 4: [2022-11-25 19:18:43,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 8: [2022-11-25 19:18:43,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 3: [2022-11-25 19:18:43,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 28: [2022-11-25 19:18:43,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 31: [2022-11-25 19:18:43,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 22: [2022-11-25 19:18:43,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 30: [2022-11-25 19:18:43,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 18: [2022-11-25 19:18:43,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 6: [2022-11-25 19:18:43,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 4: [2022-11-25 19:18:43,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 8: [2022-11-25 19:18:43,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 3: [2022-11-25 19:18:43,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 28: [2022-11-25 19:18:43,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 31: [2022-11-25 19:18:43,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 22: [2022-11-25 19:18:43,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 30: [2022-11-25 19:18:43,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 18: [2022-11-25 19:18:43,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 29: [2022-11-25 19:18:43,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:18:43,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 28: [2022-11-25 19:18:43,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:18:43,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 28: [2022-11-25 19:18:43,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 1: [2022-11-25 19:18:43,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 28: [2022-11-25 19:18:43,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 29: [2022-11-25 19:18:43,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-25 19:18:43,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 5: [2022-11-25 19:18:43,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:18:43,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 18: [2022-11-25 19:18:43,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:18:43,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 30: [2022-11-25 19:18:43,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 18: [2022-11-25 19:18:43,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 5: [2022-11-25 19:18:43,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 30: [2022-11-25 19:18:43,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 18: [2022-11-25 19:18:43,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 18: [2022-11-25 19:18:43,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-25 19:18:43,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-25 19:18:43,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 19:18:43,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:18:43,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 21: [2022-11-25 19:18:43,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:18:43,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 21: [2022-11-25 19:18:43,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 0: [2022-11-25 19:18:43,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 12: [2022-11-25 19:18:43,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 21: [2022-11-25 19:18:43,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 19:18:43,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 9: [2022-11-25 19:18:43,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:18:43,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:18:43,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 8: [2022-11-25 19:18:43,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 24: [2022-11-25 19:18:43,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:18:43,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 8: [2022-11-25 19:18:43,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 24: [2022-11-25 19:18:43,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-25 19:18:43,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 6: [2022-11-25 19:18:43,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 29: [2022-11-25 19:18:43,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:18:43,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 29: [2022-11-25 19:18:43,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 6: [2022-11-25 19:18:43,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 29: [2022-11-25 19:18:43,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 9: [2022-11-25 19:18:43,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:18:43,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:18:43,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 15: [2022-11-25 19:18:43,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 9: [2022-11-25 19:18:43,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 15: [2022-11-25 19:18:43,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 21: [2022-11-25 19:18:43,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-25 19:18:43,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-25 19:18:43,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 12: [2022-11-25 19:18:43,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:18:43,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 19:18:43,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 13: [2022-11-25 19:18:43,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:18:43,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 19:18:43,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 26: [2022-11-25 19:18:43,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-25 19:18:43,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-25 19:18:43,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 4: [2022-11-25 19:18:43,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:18:43,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:18:43,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 30: [2022-11-25 19:18:43,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 4: [2022-11-25 19:18:43,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 30: [2022-11-25 19:18:43,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 27: [2022-11-25 19:18:43,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-25 19:18:43,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-25 19:18:43,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 9: [2022-11-25 19:18:43,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:18:43,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 19:18:43,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 12: [2022-11-25 19:18:43,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:18:43,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 19:18:43,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 3: [2022-11-25 19:18:43,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:18:43,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:18:43,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 19:18:43,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 19:18:43,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 3: [2022-11-25 19:18:43,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 22: [2022-11-25 19:18:43,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-25 19:18:43,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-25 19:18:43,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 18: [2022-11-25 19:18:43,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 27: [2022-11-25 19:18:43,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 18: [2022-11-25 19:18:43,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 27: [2022-11-25 19:18:43,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 15: [2022-11-25 19:18:43,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 18: [2022-11-25 19:18:43,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 27: [2022-11-25 19:18:43,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 15: [2022-11-25 19:18:43,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 19:18:43,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 21: [2022-11-25 19:18:43,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-25 19:18:43,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-25 19:18:43,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 15: [2022-11-25 19:18:43,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:18:43,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 19:18:43,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 24: [2022-11-25 19:18:43,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-25 19:18:43,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-25 19:18:43,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 20: [2022-11-25 19:18:43,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:18:43,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 20: [2022-11-25 19:18:43,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 23: [2022-11-25 19:18:43,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 20: [2022-11-25 19:18:43,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 23: [2022-11-25 19:18:43,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 14: [2022-11-25 19:18:43,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:18:43,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:18:43,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:18:43,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:18:43,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 30: [2022-11-25 19:18:43,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 17: [2022-11-25 19:18:43,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 2: [2022-11-25 19:18:43,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 14: [2022-11-25 19:18:43,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 30: [2022-11-25 19:18:43,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 17: [2022-11-25 19:18:43,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 2: [2022-11-25 19:18:43,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 31: [2022-11-25 19:18:43,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-25 19:18:43,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-25 19:18:43,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 19: [2022-11-25 19:18:43,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-25 19:18:43,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-25 19:18:43,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 25: [2022-11-25 19:18:43,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-25 19:18:43,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-25 19:18:43,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 11: [2022-11-25 19:18:43,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:18:43,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 19:18:43,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 10: [2022-11-25 19:18:43,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:18:43,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 19:18:43,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 16: [2022-11-25 19:18:43,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-25 19:18:43,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-25 19:18:43,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 1: [2022-11-25 19:18:43,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:18:43,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:18:43,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:18:43,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 12: [2022-11-25 19:18:43,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:18:43,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 12: [2022-11-25 19:18:43,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 6: [2022-11-25 19:18:43,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 12: [2022-11-25 19:18:43,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 6: [2022-11-25 19:18:43,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 7: [2022-11-25 19:18:43,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:18:43,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 19:18:43,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 27: [2022-11-25 19:18:43,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-25 19:18:43,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-25 19:18:43,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 8: [2022-11-25 19:18:43,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:18:43,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 29: [2022-11-25 19:18:43,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:18:43,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 29: [2022-11-25 19:18:43,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-25 19:18:43,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 5: [2022-11-25 19:18:43,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:18:43,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 5: [2022-11-25 19:18:43,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 19:18:43,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 4: [2022-11-25 19:18:43,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:18:43,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 3: [2022-11-25 19:18:43,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 26: [2022-11-25 19:18:43,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:18:43,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 13: [2022-11-25 19:18:43,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 3: [2022-11-25 19:18:43,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 4: [2022-11-25 19:18:43,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 3: [2022-11-25 19:18:43,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 26: [2022-11-25 19:18:43,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-25 19:18:43,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 19:18:43,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 15: [2022-11-25 19:18:43,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:18:43,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 15: [2022-11-25 19:18:43,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 19:18:43,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 28: [2022-11-25 19:18:43,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 18: [2022-11-25 19:18:43,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 28: [2022-11-25 19:18:43,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-25 19:18:43,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 18: [2022-11-25 19:18:43,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-25 19:18:43,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 28: [2022-11-25 19:18:43,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 22: [2022-11-25 19:18:43,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 28: [2022-11-25 19:18:43,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 22: [2022-11-25 19:18:43,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 28: [2022-11-25 19:18:43,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 22: [2022-11-25 19:18:43,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 9: [2022-11-25 19:18:43,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:18:43,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 19:18:43,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 24: [2022-11-25 19:18:43,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-25 19:18:43,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-25 19:18:43,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 17: [2022-11-25 19:18:43,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:18:43,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-25 19:18:43,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 23: [2022-11-25 19:18:43,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:18:43,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-25 19:18:43,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 14: [2022-11-25 19:18:43,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:18:43,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 19:18:43,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 11: [2022-11-25 19:18:43,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 31: [2022-11-25 19:18:43,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:18:43,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:18:43,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 31: [2022-11-25 19:18:43,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 11: [2022-11-25 19:18:43,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 31: [2022-11-25 19:18:43,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 2: [2022-11-25 19:18:43,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 19:18:43,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 25: [2022-11-25 19:18:43,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-25 19:18:43,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-25 19:18:43,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 7: [2022-11-25 19:18:43,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:18:43,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 16: [2022-11-25 19:18:43,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 28: [2022-11-25 19:18:43,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 19: [2022-11-25 19:18:43,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:18:43,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 10: [2022-11-25 19:18:43,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 16: [2022-11-25 19:18:43,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 28: [2022-11-25 19:18:43,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 19: [2022-11-25 19:18:43,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 7: [2022-11-25 19:18:43,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 10: [2022-11-25 19:18:43,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 16: [2022-11-25 19:18:43,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 28: [2022-11-25 19:18:43,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 19: [2022-11-25 19:18:43,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 30: [2022-11-25 19:18:43,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:18:43,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 1: [2022-11-25 19:18:43,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:18:43,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 1: [2022-11-25 19:18:43,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 19:18:43,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 19:18:43,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:18:43,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:18:43,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 29: [2022-11-25 19:18:43,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:18:43,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 6: [2022-11-25 19:18:43,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 0: [2022-11-25 19:18:43,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 6: [2022-11-25 19:18:43,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 12: [2022-11-25 19:18:43,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 29: [2022-11-25 19:18:43,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 12: [2022-11-25 19:18:43,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 20: [2022-11-25 19:18:43,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 29: [2022-11-25 19:18:43,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 8: [2022-11-25 19:18:43,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 20: [2022-11-25 19:18:43,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-25 19:18:43,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 8: [2022-11-25 19:18:43,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 19:18:43,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 5: [2022-11-25 19:18:43,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:18:43,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 19:18:43,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 26: [2022-11-25 19:18:43,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-25 19:18:43,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-25 19:18:43,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 3: [2022-11-25 19:18:43,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 28: [2022-11-25 19:18:43,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 22: [2022-11-25 19:18:43,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:18:43,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 28: [2022-11-25 19:18:43,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 22: [2022-11-25 19:18:43,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 3: [2022-11-25 19:18:43,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 28: [2022-11-25 19:18:43,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 22: [2022-11-25 19:18:43,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 13: [2022-11-25 19:18:43,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:18:43,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 19:18:43,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 15: [2022-11-25 19:18:43,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:18:43,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 19:18:43,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 9: [2022-11-25 19:18:43,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:18:43,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 19:18:43,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 18: [2022-11-25 19:18:43,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-25 19:18:43,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-25 19:18:43,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 23: [2022-11-25 19:18:43,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 24: [2022-11-25 19:18:43,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:18:43,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:18:43,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 24: [2022-11-25 19:18:43,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 23: [2022-11-25 19:18:43,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 24: [2022-11-25 19:18:43,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 14: [2022-11-25 19:18:43,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 19:18:43,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 27: [2022-11-25 19:18:43,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-25 19:18:43,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-25 19:18:43,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 17: [2022-11-25 19:18:43,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 21: [2022-11-25 19:18:43,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 27: [2022-11-25 19:18:43,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 21: [2022-11-25 19:18:43,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 27: [2022-11-25 19:18:43,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 17: [2022-11-25 19:18:43,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 21: [2022-11-25 19:18:43,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 27: [2022-11-25 19:18:43,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 17: [2022-11-25 19:18:43,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 11: [2022-11-25 19:18:43,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:18:43,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 19:18:43,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 31: [2022-11-25 19:18:43,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-25 19:18:43,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-25 19:18:43,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 19: [2022-11-25 19:18:43,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-25 19:18:43,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-25 19:18:43,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 20: [2022-11-25 19:18:43,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-25 19:18:43,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-25 19:18:43,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 2: [2022-11-25 19:18:43,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:18:43,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 19:18:43,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 16: [2022-11-25 19:18:43,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-25 19:18:43,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-25 19:18:43,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 12: [2022-11-25 19:18:43,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 27: [2022-11-25 19:18:43,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-25 19:18:43,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 12: [2022-11-25 19:18:43,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 27: [2022-11-25 19:18:43,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 12: [2022-11-25 19:18:43,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 1: [2022-11-25 19:18:43,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:18:43,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:18:43,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 19:18:43,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 10: [2022-11-25 19:18:43,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 19:18:43,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 19:18:43,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:18:43,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:18:43,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 0: [2022-11-25 19:18:43,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 23: [2022-11-25 19:18:43,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 19:18:43,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 6: [2022-11-25 19:18:43,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:18:43,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:18:43,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 20: [2022-11-25 19:18:43,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 25: [2022-11-25 19:18:43,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 28: [2022-11-25 19:18:43,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:18:43,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 31: [2022-11-25 19:18:43,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 22: [2022-11-25 19:18:43,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:18:43,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 18: [2022-11-25 19:18:43,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:18:43,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 4: [2022-11-25 19:18:43,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 9: [2022-11-25 19:18:43,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:18:43,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 20: [2022-11-25 19:18:43,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 25: [2022-11-25 19:18:43,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 28: [2022-11-25 19:18:43,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 14: [2022-11-25 19:18:43,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 22: [2022-11-25 19:18:43,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 17: [2022-11-25 19:18:43,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 6: [2022-11-25 19:18:43,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 4: [2022-11-25 19:18:43,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 13: [2022-11-25 19:18:43,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 20: [2022-11-25 19:18:43,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 25: [2022-11-25 19:18:43,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 28: [2022-11-25 19:18:43,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 14: [2022-11-25 19:18:43,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 31: [2022-11-25 19:18:43,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 22: [2022-11-25 19:18:43,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 17: [2022-11-25 19:18:43,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 18: [2022-11-25 19:18:43,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 9: [2022-11-25 19:18:43,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 31: [2022-11-25 19:18:43,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 18: [2022-11-25 19:18:43,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 9: [2022-11-25 19:18:43,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 2: [2022-11-25 19:18:43,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:18:43,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 19:18:43,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 16: [2022-11-25 19:18:43,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-25 19:18:43,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-25 19:18:43,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 29: [2022-11-25 19:18:43,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-25 19:18:43,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-25 19:18:43,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 19: [2022-11-25 19:18:43,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-25 19:18:43,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-25 19:18:43,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 21: [2022-11-25 19:18:43,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-25 19:18:43,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-25 19:18:43,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 7: [2022-11-25 19:18:43,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:18:43,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:18:43,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 15: [2022-11-25 19:18:43,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 7: [2022-11-25 19:18:43,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 15: [2022-11-25 19:18:43,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 10: [2022-11-25 19:18:43,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:18:43,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:18:43,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 19:18:43,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 5: [2022-11-25 19:18:43,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 19:18:43,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 8: [2022-11-25 19:18:43,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:18:43,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 24: [2022-11-25 19:18:43,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 26: [2022-11-25 19:18:43,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 24: [2022-11-25 19:18:43,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 8: [2022-11-25 19:18:43,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 12: [2022-11-25 19:18:43,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 24: [2022-11-25 19:18:43,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 8: [2022-11-25 19:18:43,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 12: [2022-11-25 19:18:43,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 26: [2022-11-25 19:18:43,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-25 19:18:43,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 30: [2022-11-25 19:18:43,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:18:43,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-25 19:18:43,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 4: [2022-11-25 19:18:43,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 27: [2022-11-25 19:18:43,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:18:43,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 19:18:43,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 27: [2022-11-25 19:18:43,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-25 19:18:43,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 7: [2022-11-25 19:18:43,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:18:43,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 19:18:43,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 21: [2022-11-25 19:18:43,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-25 19:18:43,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-25 19:18:43,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 25: [2022-11-25 19:18:43,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-25 19:18:43,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-25 19:18:43,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 7: [2022-11-25 19:18:43,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:18:43,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 19:18:43,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 11: [2022-11-25 19:18:43,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:18:43,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 19:18:43,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 30: [2022-11-25 19:18:43,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:18:43,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-25 19:18:43,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 3: [2022-11-25 19:18:43,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:18:43,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 19:18:43,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 27: [2022-11-25 19:18:43,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-25 19:18:43,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step8000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-25 19:18:43,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: successfully saved checkpoint at iteration 8000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2510.75 31: iteration 8010/ 173500 | consumed samples: 2050560 | consumed tokens: 4199546880 | elapsed time per iteration (s): 1.10 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.438595E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.780 | TFLOPs: 14.08 | 31: iteration 8020/ 173500 | consumed samples: 2053120 | consumed tokens: 4204789760 | elapsed time per iteration (s): 0.82 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.405981E+00 | grad norm: 0.232 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.864 | TFLOPs: 18.87 | 31: iteration 8030/ 173500 | consumed samples: 2055680 | consumed tokens: 4210032640 | elapsed time per iteration (s): 0.81 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.392444E+00 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.031 | TFLOPs: 19.06 | 31: iteration 8040/ 173500 | consumed samples: 2058240 | consumed tokens: 4215275520 | elapsed time per iteration (s): 0.80 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.397299E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.697 | TFLOPs: 19.28 | 31: iteration 8050/ 173500 | consumed samples: 2060800 | consumed tokens: 4220518400 | elapsed time per iteration (s): 0.81 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.382822E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.296 | TFLOPs: 19.20 | 31: iteration 8060/ 173500 | consumed samples: 2063360 | consumed tokens: 4225761280 | elapsed time per iteration (s): 0.77 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.426650E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.538 | TFLOPs: 20.00 | 31: iteration 8070/ 173500 | consumed samples: 2065920 | consumed tokens: 4231004160 | elapsed time per iteration (s): 0.82 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.383863E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.587 | TFLOPs: 18.91 | 31: iteration 8080/ 173500 | consumed samples: 2068480 | consumed tokens: 4236247040 | elapsed time per iteration (s): 0.79 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.382955E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.734 | TFLOPs: 19.52 | 31: iteration 8090/ 173500 | consumed samples: 2071040 | consumed tokens: 4241489920 | elapsed time per iteration (s): 0.84 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.375524E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.953 | TFLOPs: 18.45 | 31: iteration 8100/ 173500 | consumed samples: 2073600 | consumed tokens: 4246732800 | elapsed time per iteration (s): 0.81 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.395570E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.504 | TFLOPs: 19.15 | 31: iteration 8110/ 173500 | consumed samples: 2076160 | consumed tokens: 4251975680 | elapsed time per iteration (s): 0.80 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.380583E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.447 | TFLOPs: 19.39 | 31: iteration 8120/ 173500 | consumed samples: 2078720 | consumed tokens: 4257218560 | elapsed time per iteration (s): 0.80 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.386650E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.060 | TFLOPs: 19.42 | 31: iteration 8130/ 173500 | consumed samples: 2081280 | consumed tokens: 4262461440 | elapsed time per iteration (s): 0.84 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.384926E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.037 | TFLOPs: 18.45 | 31: iteration 8140/ 173500 | consumed samples: 2083840 | consumed tokens: 4267704320 | elapsed time per iteration (s): 0.79 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.396789E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.213 | TFLOPs: 19.55 | 31: iteration 8150/ 173500 | consumed samples: 2086400 | consumed tokens: 4272947200 | elapsed time per iteration (s): 0.81 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.380708E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.673 | TFLOPs: 19.04 | 31: iteration 8160/ 173500 | consumed samples: 2088960 | consumed tokens: 4278190080 | elapsed time per iteration (s): 0.77 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.400235E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.530 | TFLOPs: 20.06 | 31: iteration 8170/ 173500 | consumed samples: 2091520 | consumed tokens: 4283432960 | elapsed time per iteration (s): 0.83 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.382305E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.567 | TFLOPs: 18.73 | 31: iteration 8180/ 173500 | consumed samples: 2094080 | consumed tokens: 4288675840 | elapsed time per iteration (s): 0.85 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.375696E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.786 | TFLOPs: 18.26 | 31: iteration 8190/ 173500 | consumed samples: 2096640 | consumed tokens: 4293918720 | elapsed time per iteration (s): 0.79 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.408108E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.979 | TFLOPs: 19.66 | 31: iteration 8200/ 173500 | consumed samples: 2099200 | consumed tokens: 4299161600 | elapsed time per iteration (s): 0.81 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.442578E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.626 | TFLOPs: 19.16 | 31: iteration 8210/ 173500 | consumed samples: 2101760 | consumed tokens: 4304404480 | elapsed time per iteration (s): 0.81 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.399909E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.728 | TFLOPs: 19.04 | 31: iteration 8220/ 173500 | consumed samples: 2104320 | consumed tokens: 4309647360 | elapsed time per iteration (s): 0.83 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.388078E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.570 | TFLOPs: 18.67 | 31: iteration 8230/ 173500 | consumed samples: 2106880 | consumed tokens: 4314890240 | elapsed time per iteration (s): 0.80 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.357067E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.683 | TFLOPs: 19.34 | 31: iteration 8240/ 173500 | consumed samples: 2109440 | consumed tokens: 4320133120 | elapsed time per iteration (s): 0.88 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.384197E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.029 | TFLOPs: 17.61 | 31: iteration 8250/ 173500 | consumed samples: 2112000 | consumed tokens: 4325376000 | elapsed time per iteration (s): 0.83 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.411732E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.628 | TFLOPs: 18.67 | 31: iteration 8260/ 173500 | consumed samples: 2114560 | consumed tokens: 4330618880 | elapsed time per iteration (s): 0.86 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.391864E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.819 | TFLOPs: 17.96 | 31: iteration 8270/ 173500 | consumed samples: 2117120 | consumed tokens: 4335861760 | elapsed time per iteration (s): 0.87 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.386517E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.203 | TFLOPs: 17.74 | 31: iteration 8280/ 173500 | consumed samples: 2119680 | consumed tokens: 4341104640 | elapsed time per iteration (s): 0.76 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.380489E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.602 | TFLOPs: 20.42 | 31: iteration 8290/ 173500 | consumed samples: 2122240 | consumed tokens: 4346347520 | elapsed time per iteration (s): 0.78 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.426883E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.149 | TFLOPs: 19.85 | 31: iteration 8300/ 173500 | consumed samples: 2124800 | consumed tokens: 4351590400 | elapsed time per iteration (s): 0.74 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.399801E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.164 | TFLOPs: 21.00 | 31: iteration 8310/ 173500 | consumed samples: 2127360 | consumed tokens: 4356833280 | elapsed time per iteration (s): 0.79 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.378477E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.314 | TFLOPs: 19.68 | 31: iteration 8320/ 173500 | consumed samples: 2129920 | consumed tokens: 4362076160 | elapsed time per iteration (s): 0.77 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.420264E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.593 | TFLOPs: 20.18 | 31: iteration 8330/ 173500 | consumed samples: 2132480 | consumed tokens: 4367319040 | elapsed time per iteration (s): 0.76 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.413562E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.693 | TFLOPs: 20.25 | 31: iteration 8340/ 173500 | consumed samples: 2135040 | consumed tokens: 4372561920 | elapsed time per iteration (s): 0.75 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.413512E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.509 | TFLOPs: 20.78 | 31: iteration 8350/ 173500 | consumed samples: 2137600 | consumed tokens: 4377804800 | elapsed time per iteration (s): 0.75 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.391899E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.674 | TFLOPs: 20.55 | 31: iteration 8360/ 173500 | consumed samples: 2140160 | consumed tokens: 4383047680 | elapsed time per iteration (s): 0.77 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.395695E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.932 | TFLOPs: 20.02 | 31: iteration 8370/ 173500 | consumed samples: 2142720 | consumed tokens: 4388290560 | elapsed time per iteration (s): 0.75 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.354023E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.091 | TFLOPs: 20.76 | 31: iteration 8380/ 173500 | consumed samples: 2145280 | consumed tokens: 4393533440 | elapsed time per iteration (s): 0.78 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.381279E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.183 | TFLOPs: 19.79 | 31: iteration 8390/ 173500 | consumed samples: 2147840 | consumed tokens: 4398776320 | elapsed time per iteration (s): 0.75 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.376909E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.121 | TFLOPs: 20.58 | 31: iteration 8400/ 173500 | consumed samples: 2150400 | consumed tokens: 4404019200 | elapsed time per iteration (s): 0.78 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.359411E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.513 | TFLOPs: 19.93 | 31: iteration 8410/ 173500 | consumed samples: 2152960 | consumed tokens: 4409262080 | elapsed time per iteration (s): 0.81 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.343180E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.300 | TFLOPs: 19.14 | 31: iteration 8420/ 173500 | consumed samples: 2155520 | consumed tokens: 4414504960 | elapsed time per iteration (s): 0.79 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.364829E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.698 | TFLOPs: 19.64 | 31: iteration 8430/ 173500 | consumed samples: 2158080 | consumed tokens: 4419747840 | elapsed time per iteration (s): 0.80 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.396123E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.400 | TFLOPs: 19.38 | 31: iteration 8440/ 173500 | consumed samples: 2160640 | consumed tokens: 4424990720 | elapsed time per iteration (s): 0.81 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.374467E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.386 | TFLOPs: 19.20 | 31: iteration 8450/ 173500 | consumed samples: 2163200 | consumed tokens: 4430233600 | elapsed time per iteration (s): 0.82 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.384725E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.407 | TFLOPs: 18.90 | 31: iteration 8460/ 173500 | consumed samples: 2165760 | consumed tokens: 4435476480 | elapsed time per iteration (s): 0.81 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.405667E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.888 | TFLOPs: 19.23 | 31: iteration 8470/ 173500 | consumed samples: 2168320 | consumed tokens: 4440719360 | elapsed time per iteration (s): 0.77 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.367077E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.347 | TFLOPs: 19.99 | 31: iteration 8480/ 173500 | consumed samples: 2170880 | consumed tokens: 4445962240 | elapsed time per iteration (s): 0.77 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.364568E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.523 | TFLOPs: 20.06 | 31: iteration 8490/ 173500 | consumed samples: 2173440 | consumed tokens: 4451205120 | elapsed time per iteration (s): 0.75 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.391605E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.958 | TFLOPs: 20.69 | 31: iteration 8500/ 173500 | consumed samples: 2176000 | consumed tokens: 4456448000 | elapsed time per iteration (s): 0.83 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.394530E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.203 | TFLOPs: 18.59 | 31: iteration 8510/ 173500 | consumed samples: 2178560 | consumed tokens: 4461690880 | elapsed time per iteration (s): 0.76 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.332994E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.352 | TFLOPs: 20.35 | 31: iteration 8520/ 173500 | consumed samples: 2181120 | consumed tokens: 4466933760 | elapsed time per iteration (s): 0.74 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.357063E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.847 | TFLOPs: 20.92 | 31: iteration 8530/ 173500 | consumed samples: 2183680 | consumed tokens: 4472176640 | elapsed time per iteration (s): 0.81 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.361059E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.867 | TFLOPs: 19.17 | 31: iteration 8540/ 173500 | consumed samples: 2186240 | consumed tokens: 4477419520 | elapsed time per iteration (s): 0.84 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.371526E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.488 | TFLOPs: 18.48 | 31: iteration 8550/ 173500 | consumed samples: 2188800 | consumed tokens: 4482662400 | elapsed time per iteration (s): 0.83 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.369472E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.865 | TFLOPs: 18.75 | 31: iteration 8560/ 173500 | consumed samples: 2191360 | consumed tokens: 4487905280 | elapsed time per iteration (s): 0.85 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.377821E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.828 | TFLOPs: 18.20 | 31: iteration 8570/ 173500 | consumed samples: 2193920 | consumed tokens: 4493148160 | elapsed time per iteration (s): 0.82 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.373162E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.383 | TFLOPs: 18.84 | 31: iteration 8580/ 173500 | consumed samples: 2196480 | consumed tokens: 4498391040 | elapsed time per iteration (s): 0.81 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.375928E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.475 | TFLOPs: 19.15 | 31: iteration 8590/ 173500 | consumed samples: 2199040 | consumed tokens: 4503633920 | elapsed time per iteration (s): 0.80 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.402073E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.400 | TFLOPs: 19.32 | 31: iteration 8600/ 173500 | consumed samples: 2201600 | consumed tokens: 4508876800 | elapsed time per iteration (s): 0.73 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.389631E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.012 | TFLOPs: 21.17 | 31: iteration 8610/ 173500 | consumed samples: 2204160 | consumed tokens: 4514119680 | elapsed time per iteration (s): 0.76 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.365128E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.773 | TFLOPs: 20.31 | 31: iteration 8620/ 173500 | consumed samples: 2206720 | consumed tokens: 4519362560 | elapsed time per iteration (s): 0.76 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.401796E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.593 | TFLOPs: 20.48 | 31: iteration 8630/ 173500 | consumed samples: 2209280 | consumed tokens: 4524605440 | elapsed time per iteration (s): 0.80 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.358043E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.213 | TFLOPs: 19.37 | 31: iteration 8640/ 173500 | consumed samples: 2211840 | consumed tokens: 4529848320 | elapsed time per iteration (s): 0.78 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.350055E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.025 | TFLOPs: 19.91 | 31: iteration 8650/ 173500 | consumed samples: 2214400 | consumed tokens: 4535091200 | elapsed time per iteration (s): 0.77 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.342681E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.418 | TFLOPs: 19.99 | 31: iteration 8660/ 173500 | consumed samples: 2216960 | consumed tokens: 4540334080 | elapsed time per iteration (s): 0.76 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.376179E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.373 | TFLOPs: 20.35 | 31: iteration 8670/ 173500 | consumed samples: 2219520 | consumed tokens: 4545576960 | elapsed time per iteration (s): 0.77 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.366698E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.224 | TFLOPs: 20.10 | 31: iteration 8680/ 173500 | consumed samples: 2222080 | consumed tokens: 4550819840 | elapsed time per iteration (s): 0.73 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.366022E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.633 | TFLOPs: 21.09 | 31: iteration 8690/ 173500 | consumed samples: 2224640 | consumed tokens: 4556062720 | elapsed time per iteration (s): 0.74 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.354248E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.258 | TFLOPs: 20.95 | 31: iteration 8700/ 173500 | consumed samples: 2227200 | consumed tokens: 4561305600 | elapsed time per iteration (s): 0.72 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.364103E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.468 | TFLOPs: 21.63 | 31: iteration 8710/ 173500 | consumed samples: 2229760 | consumed tokens: 4566548480 | elapsed time per iteration (s): 0.73 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.348243E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.022 | TFLOPs: 21.36 | 31: iteration 8720/ 173500 | consumed samples: 2232320 | consumed tokens: 4571791360 | elapsed time per iteration (s): 0.79 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.349339E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.729 | TFLOPs: 19.65 | 31: iteration 8730/ 173500 | consumed samples: 2234880 | consumed tokens: 4577034240 | elapsed time per iteration (s): 0.76 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.391106E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.700 | TFLOPs: 20.49 | 31: iteration 8740/ 173500 | consumed samples: 2237440 | consumed tokens: 4582277120 | elapsed time per iteration (s): 0.77 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.337396E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.481 | TFLOPs: 20.05 | 31: iteration 8750/ 173500 | consumed samples: 2240000 | consumed tokens: 4587520000 | elapsed time per iteration (s): 0.75 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.379506E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.013 | TFLOPs: 20.69 | 31: iteration 8760/ 173500 | consumed samples: 2242560 | consumed tokens: 4592762880 | elapsed time per iteration (s): 0.80 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.367515E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.661 | TFLOPs: 19.34 | 31: iteration 8770/ 173500 | consumed samples: 2245120 | consumed tokens: 4598005760 | elapsed time per iteration (s): 0.74 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.339872E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.402 | TFLOPs: 20.84 | 31: iteration 8780/ 173500 | consumed samples: 2247680 | consumed tokens: 4603248640 | elapsed time per iteration (s): 0.77 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.357655E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.451 | TFLOPs: 20.23 | 31: iteration 8790/ 173500 | consumed samples: 2250240 | consumed tokens: 4608491520 | elapsed time per iteration (s): 0.78 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.348664E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.921 | TFLOPs: 19.78 | 31: iteration 8800/ 173500 | consumed samples: 2252800 | consumed tokens: 4613734400 | elapsed time per iteration (s): 0.77 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.371961E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.545 | TFLOPs: 20.18 | 31: iteration 8810/ 173500 | consumed samples: 2255360 | consumed tokens: 4618977280 | elapsed time per iteration (s): 0.75 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.352111E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.772 | TFLOPs: 20.74 | 31: iteration 8820/ 173500 | consumed samples: 2257920 | consumed tokens: 4624220160 | elapsed time per iteration (s): 0.77 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.370317E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.739 | TFLOPs: 20.13 | 31: iteration 8830/ 173500 | consumed samples: 2260480 | consumed tokens: 4629463040 | elapsed time per iteration (s): 0.72 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.371249E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.661 | TFLOPs: 21.58 | 31: iteration 8840/ 173500 | consumed samples: 2263040 | consumed tokens: 4634705920 | elapsed time per iteration (s): 0.75 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.375806E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.519 | TFLOPs: 20.54 | 31: iteration 8850/ 173500 | consumed samples: 2265600 | consumed tokens: 4639948800 | elapsed time per iteration (s): 0.76 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.358457E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.452 | TFLOPs: 20.41 | 31: iteration 8860/ 173500 | consumed samples: 2268160 | consumed tokens: 4645191680 | elapsed time per iteration (s): 0.77 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.368457E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.476 | TFLOPs: 20.05 | 31: iteration 8870/ 173500 | consumed samples: 2270720 | consumed tokens: 4650434560 | elapsed time per iteration (s): 0.80 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.346749E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.336 | TFLOPs: 19.32 | 31: iteration 8880/ 173500 | consumed samples: 2273280 | consumed tokens: 4655677440 | elapsed time per iteration (s): 0.75 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.360045E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.097 | TFLOPs: 20.57 | 31: iteration 8890/ 173500 | consumed samples: 2275840 | consumed tokens: 4660920320 | elapsed time per iteration (s): 0.76 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.337283E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.823 | TFLOPs: 20.26 | 31: iteration 8900/ 173500 | consumed samples: 2278400 | consumed tokens: 4666163200 | elapsed time per iteration (s): 0.77 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.369154E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.066 | TFLOPs: 20.21 | 31: iteration 8910/ 173500 | consumed samples: 2280960 | consumed tokens: 4671406080 | elapsed time per iteration (s): 0.75 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.362233E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.413 | TFLOPs: 20.59 | 31: iteration 8920/ 173500 | consumed samples: 2283520 | consumed tokens: 4676648960 | elapsed time per iteration (s): 0.77 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.342583E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.144 | TFLOPs: 20.21 | 31: iteration 8930/ 173500 | consumed samples: 2286080 | consumed tokens: 4681891840 | elapsed time per iteration (s): 0.76 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.351026E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.928 | TFLOPs: 20.38 | 31: iteration 8940/ 173500 | consumed samples: 2288640 | consumed tokens: 4687134720 | elapsed time per iteration (s): 0.78 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.349572E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.882 | TFLOPs: 19.96 | 31: iteration 8950/ 173500 | consumed samples: 2291200 | consumed tokens: 4692377600 | elapsed time per iteration (s): 0.77 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.359422E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.898 | TFLOPs: 20.08 | 31: iteration 8960/ 173500 | consumed samples: 2293760 | consumed tokens: 4697620480 | elapsed time per iteration (s): 0.78 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.350521E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.780 | TFLOPs: 19.95 | 31: iteration 8970/ 173500 | consumed samples: 2296320 | consumed tokens: 4702863360 | elapsed time per iteration (s): 0.81 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.357535E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.940 | TFLOPs: 19.17 | 31: iteration 8980/ 173500 | consumed samples: 2298880 | consumed tokens: 4708106240 | elapsed time per iteration (s): 0.79 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.338898E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.224 | TFLOPs: 19.68 | 31: iteration 8990/ 173500 | consumed samples: 2301440 | consumed tokens: 4713349120 | elapsed time per iteration (s): 0.78 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.345887E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.811 | TFLOPs: 19.83 | 31: iteration 9000/ 173500 | consumed samples: 2304000 | consumed tokens: 4718592000 | elapsed time per iteration (s): 0.74 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.352961E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.386 | TFLOPs: 20.83 | 31: ------------------------------------------------------------------------------------------ 31: valid loss at iteration 9000 | lm loss value: 2.253114E+00 | lm loss PPL: 9.517329E+00 | 31: ------------------------------------------------------------------------------------------ 0: saving checkpoint at iteration 9000 to checkpoints_1b1long 0: [2022-11-25 19:31:49,233] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step9000 is begin to save! 0: [2022-11-25 19:31:49,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_01-model_00-model_states.pt... 0: [2022-11-25 19:31:49,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_01-model_00-model_states.pt. 0: [2022-11-25 19:31:49,438] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_03-model_00-model_states.pt... 0: [2022-11-25 19:31:49,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_03-model_00-model_states.pt. 0: [2022-11-25 19:31:49,516] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_04-model_00-model_states.pt... 0: [2022-11-25 19:31:49,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_04-model_00-model_states.pt. 0: [2022-11-25 19:31:49,590] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_05-model_00-model_states.pt... 0: [2022-11-25 19:31:49,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_05-model_00-model_states.pt. 0: [2022-11-25 19:31:49,664] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_06-model_00-model_states.pt... 0: [2022-11-25 19:31:49,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_06-model_00-model_states.pt. 0: [2022-11-25 19:31:49,735] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_07-model_00-model_states.pt... 0: [2022-11-25 19:31:49,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_07-model_00-model_states.pt. 0: [2022-11-25 19:31:49,809] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_08-model_00-model_states.pt... 0: [2022-11-25 19:31:49,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_08-model_00-model_states.pt. 0: [2022-11-25 19:31:49,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_09-model_00-model_states.pt... 0: [2022-11-25 19:31:49,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_09-model_00-model_states.pt. 0: [2022-11-25 19:31:49,959] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_10-model_00-model_states.pt... 0: [2022-11-25 19:31:50,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_10-model_00-model_states.pt. 0: [2022-11-25 19:31:50,030] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_11-model_00-model_states.pt... 0: [2022-11-25 19:31:50,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_11-model_00-model_states.pt. 0: [2022-11-25 19:31:50,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_12-model_00-model_states.pt... 0: [2022-11-25 19:31:50,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_12-model_00-model_states.pt. 0: [2022-11-25 19:31:50,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_13-model_00-model_states.pt... 0: [2022-11-25 19:31:50,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_13-model_00-model_states.pt. 0: [2022-11-25 19:31:50,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_14-model_00-model_states.pt... 0: [2022-11-25 19:31:50,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_14-model_00-model_states.pt. 0: [2022-11-25 19:31:50,328] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_15-model_00-model_states.pt... 0: [2022-11-25 19:31:50,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_15-model_00-model_states.pt. 0: [2022-11-25 19:31:50,403] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_16-model_00-model_states.pt... 0: [2022-11-25 19:31:50,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_16-model_00-model_states.pt. 0: [2022-11-25 19:31:50,477] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_17-model_00-model_states.pt... 0: [2022-11-25 19:31:50,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_17-model_00-model_states.pt. 0: [2022-11-25 19:31:50,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_18-model_00-model_states.pt... 0: [2022-11-25 19:31:50,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_18-model_00-model_states.pt. 0: [2022-11-25 19:31:50,625] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_19-model_00-model_states.pt... 0: [2022-11-25 19:31:50,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_19-model_00-model_states.pt. 0: [2022-11-25 19:31:50,701] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_20-model_00-model_states.pt... 0: [2022-11-25 19:31:50,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_20-model_00-model_states.pt. 0: [2022-11-25 19:31:50,771] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_21-model_00-model_states.pt... 0: [2022-11-25 19:31:50,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_21-model_00-model_states.pt. 0: [2022-11-25 19:31:50,847] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_22-model_00-model_states.pt... 0: [2022-11-25 19:31:50,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_22-model_00-model_states.pt. 0: [2022-11-25 19:31:50,919] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_23-model_00-model_states.pt... 0: [2022-11-25 19:31:50,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_23-model_00-model_states.pt. 0: [2022-11-25 19:31:50,996] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_24-model_00-model_states.pt... 0: [2022-11-25 19:31:51,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_24-model_00-model_states.pt. 0: [2022-11-25 19:31:51,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_25-model_00-model_states.pt... 0: [2022-11-25 19:31:51,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_25-model_00-model_states.pt. 0: [2022-11-25 19:31:51,143] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_26-model_00-model_states.pt... 0: [2022-11-25 19:31:51,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_26-model_00-model_states.pt. 0: [2022-11-25 19:31:51,218] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_27-model_00-model_states.pt... 0: [2022-11-25 19:31:51,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_27-model_00-model_states.pt. 0: [2022-11-25 19:31:51,293] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_28-model_00-model_states.pt... 0: [2022-11-25 19:31:51,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_28-model_00-model_states.pt. 0: [2022-11-25 19:31:51,368] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/layer_30-model_00-model_states.pt... 0: [2022-11-25 19:31:51,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/layer_30-model_00-model_states.pt. 0: [2022-11-25 19:31:51,370] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step9000/mp_rank_00_model_states.pt 0: [2022-11-25 19:31:51,370] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/mp_rank_00_model_states.pt... 0: [2022-11-25 19:31:51,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/mp_rank_00_model_states.pt. 0: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 16: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 25: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 28: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 17: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 26: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 25: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 28: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 29: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 21: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 25: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 29: [2022-11-25 19:31:51,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 19: [2022-11-25 19:31:51,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-25 19:31:51,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-25 19:31:51,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 13: [2022-11-25 19:31:51,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:31:51,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:31:51,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 5: [2022-11-25 19:31:51,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 13: [2022-11-25 19:31:51,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 5: [2022-11-25 19:31:51,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 12: [2022-11-25 19:31:51,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:31:51,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 19:31:51,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 11: [2022-11-25 19:31:51,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:31:51,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 19:31:51,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 18: [2022-11-25 19:31:51,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-25 19:31:51,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-25 19:31:51,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 23: [2022-11-25 19:31:51,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 27: [2022-11-25 19:31:51,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:31:51,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-25 19:31:51,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 27: [2022-11-25 19:31:51,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-25 19:31:51,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 0: [2022-11-25 19:31:51,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:31:51,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 20: [2022-11-25 19:31:51,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:31:51,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 20: [2022-11-25 19:31:51,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 15: [2022-11-25 19:31:51,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:31:51,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 20: [2022-11-25 19:31:51,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 15: [2022-11-25 19:31:51,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 4: [2022-11-25 19:31:51,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:31:51,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 19:31:51,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 30: [2022-11-25 19:31:51,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:31:51,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 31: [2022-11-25 19:31:51,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:31:51,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 31: [2022-11-25 19:31:51,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-25 19:31:51,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 22: [2022-11-25 19:31:51,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-25 19:31:51,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-25 19:31:51,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 19:31:51,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:31:51,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 19:31:51,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 19:31:51,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:31:51,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 19:31:51,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 9: [2022-11-25 19:31:51,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:31:51,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 16: [2022-11-25 19:31:51,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-25 19:31:51,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 9: [2022-11-25 19:31:51,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 16: [2022-11-25 19:31:51,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 24: [2022-11-25 19:31:51,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:31:51,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 24: [2022-11-25 19:31:51,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 29: [2022-11-25 19:31:51,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 24: [2022-11-25 19:31:51,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 29: [2022-11-25 19:31:51,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 24: [2022-11-25 19:31:51,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 29: [2022-11-25 19:31:51,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 19:31:51,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 24: [2022-11-25 19:31:51,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 19:31:51,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 24: [2022-11-25 19:31:51,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 17: [2022-11-25 19:31:51,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:31:51,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:31:51,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 8: [2022-11-25 19:31:51,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 19:31:51,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 17: [2022-11-25 19:31:51,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 14: [2022-11-25 19:31:51,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:31:51,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 19:31:51,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 3: [2022-11-25 19:31:51,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:31:51,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 19:31:51,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 10: [2022-11-25 19:31:51,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:31:51,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:31:51,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 19:31:51,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 19:31:51,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 10: [2022-11-25 19:31:51,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 6: [2022-11-25 19:31:51,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:31:51,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:31:51,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 23: [2022-11-25 19:31:51,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:31:51,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 19:31:51,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 23: [2022-11-25 19:31:51,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 6: [2022-11-25 19:31:51,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 23: [2022-11-25 19:31:51,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 26: [2022-11-25 19:31:51,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-25 19:31:51,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-25 19:31:51,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 20: [2022-11-25 19:31:51,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-25 19:31:51,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-25 19:31:51,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 9: [2022-11-25 19:31:51,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:31:51,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 14: [2022-11-25 19:31:51,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:31:51,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 28: [2022-11-25 19:31:51,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:31:51,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 12: [2022-11-25 19:31:51,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:31:51,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:31:51,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 15: [2022-11-25 19:31:51,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 19: [2022-11-25 19:31:51,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:31:51,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 15: [2022-11-25 19:31:51,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 12: [2022-11-25 19:31:51,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 19: [2022-11-25 19:31:51,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-25 19:31:51,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 2: [2022-11-25 19:31:51,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:31:51,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 19:31:51,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 31: [2022-11-25 19:31:51,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:31:51,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 31: [2022-11-25 19:31:51,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 22: [2022-11-25 19:31:51,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:31:51,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 31: [2022-11-25 19:31:51,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 22: [2022-11-25 19:31:51,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 2: [2022-11-25 19:31:51,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 22: [2022-11-25 19:31:51,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 16: [2022-11-25 19:31:51,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-25 19:31:51,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-25 19:31:51,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 0: [2022-11-25 19:31:51,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:31:51,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 19:31:51,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 23: [2022-11-25 19:31:51,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:31:51,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-25 19:31:51,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 21: [2022-11-25 19:31:51,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-25 19:31:51,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-25 19:31:51,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-25 19:31:51,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 21: [2022-11-25 19:31:51,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-25 19:31:51,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 26: [2022-11-25 19:31:51,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-25 19:31:51,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 8: [2022-11-25 19:31:51,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 26: [2022-11-25 19:31:51,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 8: [2022-11-25 19:31:51,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 19:31:51,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 10: [2022-11-25 19:31:51,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:31:51,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 25: [2022-11-25 19:31:51,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:31:51,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 25: [2022-11-25 19:31:51,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 17: [2022-11-25 19:31:51,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:31:51,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 25: [2022-11-25 19:31:51,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 17: [2022-11-25 19:31:51,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 17: [2022-11-25 19:31:51,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:31:51,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 1: [2022-11-25 19:31:51,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:31:51,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 19:31:51,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 0: [2022-11-25 19:31:51,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:31:51,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 0: [2022-11-25 19:31:51,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 19:31:51,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 18: [2022-11-25 19:31:51,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-25 19:31:51,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-25 19:31:51,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-25 19:31:51,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-25 19:31:51,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 18: [2022-11-25 19:31:51,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 13: [2022-11-25 19:31:51,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:31:51,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 19:31:51,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:31:51,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:31:51,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 19:31:51,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:31:51,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 13: [2022-11-25 19:31:51,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 15: [2022-11-25 19:31:51,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 19:31:51,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 13: [2022-11-25 19:31:51,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 19:31:51,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 28: [2022-11-25 19:31:51,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-25 19:31:51,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 28: [2022-11-25 19:31:51,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-25 19:31:51,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 22: [2022-11-25 19:31:51,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 28: [2022-11-25 19:31:51,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 22: [2022-11-25 19:31:51,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-25 19:31:51,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 2: [2022-11-25 19:31:51,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:31:51,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 19:31:51,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 3: [2022-11-25 19:31:51,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:31:51,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 19:31:51,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 29: [2022-11-25 19:31:51,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-25 19:31:51,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-25 19:31:51,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 29: [2022-11-25 19:31:51,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-25 19:31:51,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 9: [2022-11-25 19:31:51,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 29: [2022-11-25 19:31:51,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 9: [2022-11-25 19:31:51,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 19:31:51,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 20: [2022-11-25 19:31:51,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-25 19:31:51,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-25 19:31:51,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 8: [2022-11-25 19:31:51,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:31:51,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 19:31:51,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 31: [2022-11-25 19:31:51,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-25 19:31:51,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-25 19:31:51,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 6: [2022-11-25 19:31:51,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:31:51,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 19:31:51,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 12: [2022-11-25 19:31:51,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:31:51,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 19:31:51,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 30: [2022-11-25 19:31:51,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:31:51,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-25 19:31:51,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 14: [2022-11-25 19:31:51,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:31:51,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 19:31:51,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 21: [2022-11-25 19:31:51,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-25 19:31:51,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 7: [2022-11-25 19:31:51,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 21: [2022-11-25 19:31:51,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 19:31:51,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 19:31:51,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 25: [2022-11-25 19:31:51,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-25 19:31:51,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-25 19:31:51,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-25 19:31:51,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-25 19:31:51,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 25: [2022-11-25 19:31:51,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 3: [2022-11-25 19:31:51,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:31:51,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 19:31:51,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 0: [2022-11-25 19:31:51,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 26: [2022-11-25 19:31:51,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-25 19:31:51,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-25 19:31:51,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 19:31:51,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:31:51,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 19:31:51,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 11: [2022-11-25 19:31:51,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:31:51,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:31:51,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 6: [2022-11-25 19:31:51,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 11: [2022-11-25 19:31:51,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 6: [2022-11-25 19:31:51,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 27: [2022-11-25 19:31:51,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-25 19:31:51,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-25 19:31:51,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 18: [2022-11-25 19:31:51,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-25 19:31:51,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-25 19:31:51,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 30: [2022-11-25 19:31:51,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:31:51,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-25 19:31:51,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 28: [2022-11-25 19:31:51,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:31:51,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 19:31:51,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 28: [2022-11-25 19:31:51,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-25 19:31:51,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 24: [2022-11-25 19:31:51,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-25 19:31:51,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-25 19:31:51,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 11: [2022-11-25 19:31:51,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:31:51,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 19:31:51,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 16: [2022-11-25 19:31:51,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-25 19:31:51,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-25 19:31:51,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 15: [2022-11-25 19:31:51,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:31:51,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 19:31:51,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 4: [2022-11-25 19:31:51,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:31:51,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 19:31:51,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 27: [2022-11-25 19:31:51,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-25 19:31:51,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-25 19:31:51,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 19: [2022-11-25 19:31:51,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-25 19:31:51,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-25 19:31:51,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 12: [2022-11-25 19:31:51,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:31:51,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 19:31:51,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 16: [2022-11-25 19:31:51,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-25 19:31:51,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-25 19:31:51,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 5: [2022-11-25 19:31:51,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 24: [2022-11-25 19:31:51,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-25 19:31:51,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 5: [2022-11-25 19:31:51,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 19:31:51,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 24: [2022-11-25 19:31:51,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 14: [2022-11-25 19:31:51,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:31:51,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 19:31:51,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 4: [2022-11-25 19:31:51,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:31:51,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 19:31:51,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 22: [2022-11-25 19:31:51,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-25 19:31:51,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-25 19:31:51,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 13: [2022-11-25 19:31:51,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:31:51,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 19:31:51,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 21: [2022-11-25 19:31:51,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-25 19:31:51,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-25 19:31:51,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 19: [2022-11-25 19:31:51,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-25 19:31:51,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-25 19:31:51,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 23: [2022-11-25 19:31:51,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:31:51,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-25 19:31:51,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 11: [2022-11-25 19:31:51,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:31:51,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 19:31:51,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 27: [2022-11-25 19:31:51,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-25 19:31:51,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-25 19:31:51,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 20: [2022-11-25 19:31:51,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-25 19:31:51,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-25 19:31:51,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 17: [2022-11-25 19:31:51,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:31:51,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-25 19:31:51,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 29: [2022-11-25 19:31:51,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-25 19:31:51,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-25 19:31:51,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 31: [2022-11-25 19:31:51,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:31:51,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 31: [2022-11-25 19:31:51,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 7: [2022-11-25 19:31:51,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 31: [2022-11-25 19:31:51,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 19:31:51,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 5: [2022-11-25 19:31:51,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:31:51,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 19:31:51,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 9: [2022-11-25 19:31:51,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:31:51,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 19:31:51,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 28: [2022-11-25 19:31:51,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-25 19:31:51,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-25 19:31:51,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 2: [2022-11-25 19:31:51,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:31:51,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 19:31:51,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 4: [2022-11-25 19:31:51,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:31:51,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 19:31:51,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 8: [2022-11-25 19:31:51,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:31:51,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 19:31:51,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 30: [2022-11-25 19:31:51,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:31:51,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-25 19:31:51,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 25: [2022-11-25 19:31:51,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-25 19:31:51,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-25 19:31:51,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 3: [2022-11-25 19:31:51,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:31:51,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 19:31:51,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 18: [2022-11-25 19:31:51,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-25 19:31:51,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-25 19:31:51,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 19: [2022-11-25 19:31:51,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-25 19:31:51,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-25 19:31:51,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 19:31:51,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:31:51,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 19:31:51,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 0: [2022-11-25 19:31:51,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:31:51,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 19:31:51,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 10: [2022-11-25 19:31:51,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:31:51,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 24: [2022-11-25 19:31:51,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:31:51,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:31:51,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:31:51,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 6: [2022-11-25 19:31:51,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 10: [2022-11-25 19:31:51,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 6: [2022-11-25 19:31:51,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 10: [2022-11-25 19:31:51,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 24: [2022-11-25 19:31:51,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-25 19:31:51,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 12: [2022-11-25 19:31:51,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:31:51,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 19:31:51,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 5: [2022-11-25 19:31:51,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:31:51,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 19:31:51,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 16: [2022-11-25 19:31:51,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-25 19:31:51,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-25 19:31:51,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 22: [2022-11-25 19:31:51,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-25 19:31:51,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-25 19:31:51,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 11: [2022-11-25 19:31:51,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 26: [2022-11-25 19:31:51,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:31:51,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 26: [2022-11-25 19:31:51,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 11: [2022-11-25 19:31:51,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 26: [2022-11-25 19:31:51,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 15: [2022-11-25 19:31:51,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:31:51,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 19:31:51,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 13: [2022-11-25 19:31:51,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:31:51,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 19:31:51,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 14: [2022-11-25 19:31:51,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:31:51,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 19:31:51,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 21: [2022-11-25 19:31:51,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-25 19:31:51,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-25 19:31:51,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 31: [2022-11-25 19:31:51,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-25 19:31:51,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-25 19:31:51,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 17: [2022-11-25 19:31:51,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:31:51,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:31:51,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 23: [2022-11-25 19:31:51,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-25 19:31:51,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 17: [2022-11-25 19:31:51,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 29: [2022-11-25 19:31:51,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-25 19:31:51,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-25 19:31:51,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 4: [2022-11-25 19:31:51,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:31:51,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:31:51,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 8: [2022-11-25 19:31:51,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 4: [2022-11-25 19:31:51,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 8: [2022-11-25 19:31:51,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 28: [2022-11-25 19:31:51,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-25 19:31:51,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-25 19:31:51,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 20: [2022-11-25 19:31:51,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-25 19:31:51,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-25 19:31:51,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 3: [2022-11-25 19:31:51,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:31:51,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 19:31:51,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 19:31:51,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:31:51,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 19:31:51,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 2: [2022-11-25 19:31:51,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:31:51,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 19:31:51,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 5: [2022-11-25 19:31:51,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:31:51,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 19:31:51,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 9: [2022-11-25 19:31:51,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:31:51,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 27: [2022-11-25 19:31:51,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:31:51,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 27: [2022-11-25 19:31:51,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-25 19:31:51,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 25: [2022-11-25 19:31:51,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-25 19:31:51,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-25 19:31:51,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 19: [2022-11-25 19:31:51,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-25 19:31:51,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-25 19:31:51,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 30: [2022-11-25 19:31:51,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:31:51,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-25 19:31:51,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 10: [2022-11-25 19:31:51,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:31:51,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 19:31:51,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 18: [2022-11-25 19:31:51,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-25 19:31:51,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-25 19:31:51,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 6: [2022-11-25 19:31:51,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:31:51,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 19:31:51,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 19:31:51,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:31:51,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 19:31:51,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 5: [2022-11-25 19:31:51,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:31:51,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 19:31:51,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 0: [2022-11-25 19:31:51,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:31:51,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 19:31:51,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 16: [2022-11-25 19:31:51,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-25 19:31:51,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-25 19:31:51,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 12: [2022-11-25 19:31:51,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:31:51,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 19:31:51,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 24: [2022-11-25 19:31:51,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-25 19:31:51,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-25 19:31:51,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 22: [2022-11-25 19:31:51,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-25 19:31:51,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-25 19:31:51,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 27: [2022-11-25 19:31:51,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:31:51,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 27: [2022-11-25 19:31:51,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 15: [2022-11-25 19:31:51,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 27: [2022-11-25 19:31:51,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 15: [2022-11-25 19:31:51,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 13: [2022-11-25 19:31:51,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:31:51,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 19:31:51,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 26: [2022-11-25 19:31:51,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-25 19:31:51,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-25 19:31:51,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 14: [2022-11-25 19:31:51,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:31:51,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 19:31:51,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 21: [2022-11-25 19:31:51,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-25 19:31:51,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-25 19:31:51,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 23: [2022-11-25 19:31:51,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 20: [2022-11-25 19:31:51,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:31:51,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-25 19:31:51,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 20: [2022-11-25 19:31:51,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 17: [2022-11-25 19:31:51,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 20: [2022-11-25 19:31:51,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 17: [2022-11-25 19:31:51,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-25 19:31:51,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 26: [2022-11-25 19:31:51,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-25 19:31:51,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-25 19:31:51,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 28: [2022-11-25 19:31:51,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-25 19:31:51,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-25 19:31:51,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 29: [2022-11-25 19:31:51,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-25 19:31:51,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-25 19:31:51,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 31: [2022-11-25 19:31:51,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-25 19:31:51,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-25 19:31:51,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 30: [2022-11-25 19:31:51,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:31:51,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-25 19:31:51,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 4: [2022-11-25 19:31:51,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:31:51,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 19:31:51,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 8: [2022-11-25 19:31:51,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:31:51,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 19:31:51,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 9: [2022-11-25 19:31:51,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:31:51,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 19:31:51,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 25: [2022-11-25 19:31:51,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-25 19:31:51,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-25 19:31:51,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 18: [2022-11-25 19:31:51,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-25 19:31:51,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-25 19:31:51,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 0: [2022-11-25 19:31:51,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:31:51,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 19:31:51,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 3: [2022-11-25 19:31:51,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:31:51,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 19:31:51,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 19:31:51,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:31:51,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 7: [2022-11-25 19:31:51,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:31:51,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 1: [2022-11-25 19:31:51,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 19:31:51,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 6: [2022-11-25 19:31:51,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:31:51,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 19:31:51,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 5: [2022-11-25 19:31:51,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:31:51,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 19:31:51,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 19: [2022-11-25 19:31:51,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-25 19:31:51,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-25 19:31:51,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 11: [2022-11-25 19:31:51,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:31:51,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 19:31:51,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 24: [2022-11-25 19:31:51,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-25 19:31:51,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-25 19:31:51,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 10: [2022-11-25 19:31:51,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:31:51,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 19:31:51,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 15: [2022-11-25 19:31:51,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:31:51,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 19:31:51,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 16: [2022-11-25 19:31:51,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 22: [2022-11-25 19:31:51,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 16: [2022-11-25 19:31:51,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 22: [2022-11-25 19:31:51,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 16: [2022-11-25 19:31:51,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 22: [2022-11-25 19:31:51,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 12: [2022-11-25 19:31:51,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:31:51,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 19:31:51,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 14: [2022-11-25 19:31:51,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 26: [2022-11-25 19:31:51,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:31:51,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 26: [2022-11-25 19:31:51,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 14: [2022-11-25 19:31:51,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 26: [2022-11-25 19:31:51,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 13: [2022-11-25 19:31:51,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:31:51,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 19:31:51,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 21: [2022-11-25 19:31:51,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-25 19:31:51,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-25 19:31:51,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 27: [2022-11-25 19:31:51,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-25 19:31:51,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-25 19:31:51,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 31: [2022-11-25 19:31:51,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-25 19:31:51,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-25 19:31:51,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 2: [2022-11-25 19:31:51,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:31:51,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 19:31:51,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 8: [2022-11-25 19:31:51,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 20: [2022-11-25 19:31:51,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:31:51,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:31:51,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 19:31:51,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 20: [2022-11-25 19:31:51,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 28: [2022-11-25 19:31:51,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:31:51,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 20: [2022-11-25 19:31:51,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 28: [2022-11-25 19:31:51,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 17: [2022-11-25 19:31:51,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 28: [2022-11-25 19:31:51,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 23: [2022-11-25 19:31:51,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:31:51,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-25 19:31:51,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 8: [2022-11-25 19:31:51,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:31:51,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 19:31:51,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 29: [2022-11-25 19:31:51,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-25 19:31:51,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-25 19:31:51,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 30: [2022-11-25 19:31:51,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:31:51,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-25 19:31:51,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 4: [2022-11-25 19:31:51,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:31:51,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 19:31:51,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 27: [2022-11-25 19:31:51,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:31:51,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:31:51,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 19:31:51,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 25: [2022-11-25 19:31:51,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 27: [2022-11-25 19:31:51,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 25: [2022-11-25 19:31:51,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-25 19:31:51,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 27: [2022-11-25 19:31:51,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 19:31:51,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:31:51,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 19:31:51,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 18: [2022-11-25 19:31:51,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-25 19:31:51,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-25 19:31:51,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 30: [2022-11-25 19:31:51,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:31:51,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 15: [2022-11-25 19:31:51,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 30: [2022-11-25 19:31:51,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 15: [2022-11-25 19:31:51,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 19:31:51,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 31: [2022-11-25 19:31:51,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-25 19:31:51,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-25 19:31:51,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 28: [2022-11-25 19:31:51,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-25 19:31:51,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-25 19:31:51,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 21: [2022-11-25 19:31:51,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-25 19:31:51,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-25 19:31:51,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 29: [2022-11-25 19:31:51,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-25 19:31:51,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-25 19:31:51,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 17: [2022-11-25 19:31:51,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:31:51,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 11: [2022-11-25 19:31:51,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 17: [2022-11-25 19:31:51,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 11: [2022-11-25 19:31:51,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 19:31:51,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 20: [2022-11-25 19:31:51,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-25 19:31:51,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-25 19:31:51,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 24: [2022-11-25 19:31:51,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-25 19:31:51,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-25 19:31:51,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 10: [2022-11-25 19:31:51,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:31:51,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 19:31:51,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 22: [2022-11-25 19:31:51,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:31:51,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 22: [2022-11-25 19:31:51,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 25: [2022-11-25 19:31:51,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 23: [2022-11-25 19:31:51,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 22: [2022-11-25 19:31:51,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 25: [2022-11-25 19:31:51,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 23: [2022-11-25 19:31:51,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 25: [2022-11-25 19:31:51,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 19:31:51,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:31:51,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 19: [2022-11-25 19:31:51,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:31:51,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 19:31:51,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 16: [2022-11-25 19:31:51,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 19: [2022-11-25 19:31:51,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 16: [2022-11-25 19:31:51,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 19: [2022-11-25 19:31:51,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 19:31:51,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 16: [2022-11-25 19:31:51,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 19:31:51,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 0: [2022-11-25 19:31:51,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:31:51,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 19:31:51,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 9: [2022-11-25 19:31:51,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:31:51,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 6: [2022-11-25 19:31:51,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:31:51,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 6: [2022-11-25 19:31:51,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 19:31:51,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 13: [2022-11-25 19:31:51,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:31:51,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 19:31:51,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 12: [2022-11-25 19:31:51,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:31:51,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 19:31:51,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 11: [2022-11-25 19:31:51,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:31:51,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 19:31:51,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 2: [2022-11-25 19:31:51,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:31:51,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 19:31:51,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 9: [2022-11-25 19:31:51,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:31:51,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 19:31:51,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 2: [2022-11-25 19:31:51,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:31:51,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 19:31:51,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 3: [2022-11-25 19:31:51,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:31:51,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:31:51,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 19:31:51,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 19:31:51,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 3: [2022-11-25 19:31:51,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 5: [2022-11-25 19:31:51,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:31:51,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 19:31:51,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 4: [2022-11-25 19:31:51,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:31:51,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 19:31:51,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 26: [2022-11-25 19:31:51,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-25 19:31:51,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step9000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-25 19:31:51,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 0: successfully saved checkpoint at iteration 9000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2517.28 31: iteration 9010/ 173500 | consumed samples: 2306560 | consumed tokens: 4723834880 | elapsed time per iteration (s): 0.99 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.370507E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 258.408 | TFLOPs: 15.63 | 31: iteration 9020/ 173500 | consumed samples: 2309120 | consumed tokens: 4729077760 | elapsed time per iteration (s): 0.79 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.344283E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.244 | TFLOPs: 19.68 | 31: iteration 9030/ 173500 | consumed samples: 2311680 | consumed tokens: 4734320640 | elapsed time per iteration (s): 0.74 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.375264E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.066 | TFLOPs: 20.88 | 31: iteration 9040/ 173500 | consumed samples: 2314240 | consumed tokens: 4739563520 | elapsed time per iteration (s): 0.79 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.364709E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.596 | TFLOPs: 19.58 | 31: iteration 9050/ 173500 | consumed samples: 2316800 | consumed tokens: 4744806400 | elapsed time per iteration (s): 0.74 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.379878E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.302 | TFLOPs: 20.83 | 31: iteration 9060/ 173500 | consumed samples: 2319360 | consumed tokens: 4750049280 | elapsed time per iteration (s): 0.74 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.375178E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.284 | TFLOPs: 20.95 | 31: iteration 9070/ 173500 | consumed samples: 2321920 | consumed tokens: 4755292160 | elapsed time per iteration (s): 0.72 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.360581E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.348 | TFLOPs: 21.50 | 31: iteration 9080/ 173500 | consumed samples: 2324480 | consumed tokens: 4760535040 | elapsed time per iteration (s): 0.74 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.386436E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.155 | TFLOPs: 20.94 | 31: iteration 9090/ 173500 | consumed samples: 2327040 | consumed tokens: 4765777920 | elapsed time per iteration (s): 0.76 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.344515E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.869 | TFLOPs: 20.50 | 31: iteration 9100/ 173500 | consumed samples: 2329600 | consumed tokens: 4771020800 | elapsed time per iteration (s): 0.76 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.371136E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.888 | TFLOPs: 20.32 | 31: iteration 9110/ 173500 | consumed samples: 2332160 | consumed tokens: 4776263680 | elapsed time per iteration (s): 0.75 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.337952E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.386 | TFLOPs: 20.59 | 31: iteration 9120/ 173500 | consumed samples: 2334720 | consumed tokens: 4781506560 | elapsed time per iteration (s): 0.75 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.363753E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.661 | TFLOPs: 20.55 | 31: iteration 9130/ 173500 | consumed samples: 2337280 | consumed tokens: 4786749440 | elapsed time per iteration (s): 0.74 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.358422E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.190 | TFLOPs: 20.82 | 31: iteration 9140/ 173500 | consumed samples: 2339840 | consumed tokens: 4791992320 | elapsed time per iteration (s): 0.76 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.351265E+00 | grad norm: 0.243 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.551 | TFLOPs: 20.48 | 31: iteration 9150/ 173500 | consumed samples: 2342400 | consumed tokens: 4797235200 | elapsed time per iteration (s): 0.73 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 7.205484E+00 | grad norm: 8.710 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.905 | TFLOPs: 21.29 | 31: iteration 9160/ 173500 | consumed samples: 2344960 | consumed tokens: 4802478080 | elapsed time per iteration (s): 0.74 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 8.779391E+00 | grad norm: 9.562 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.969 | TFLOPs: 21.05 | 31: iteration 9170/ 173500 | consumed samples: 2347520 | consumed tokens: 4807720960 | elapsed time per iteration (s): 0.79 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 7.566447E+00 | grad norm: 1.450 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.408 | TFLOPs: 19.50 | 31: iteration 9180/ 173500 | consumed samples: 2350080 | consumed tokens: 4812963840 | elapsed time per iteration (s): 0.78 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 6.931259E+00 | grad norm: 0.719 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.056 | TFLOPs: 19.79 | 31: iteration 9190/ 173500 | consumed samples: 2352640 | consumed tokens: 4818206720 | elapsed time per iteration (s): 0.81 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 6.636311E+00 | grad norm: 0.409 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.073 | TFLOPs: 19.18 | 31: iteration 9200/ 173500 | consumed samples: 2355200 | consumed tokens: 4823449600 | elapsed time per iteration (s): 0.80 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 6.379452E+00 | grad norm: 0.605 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.927 | TFLOPs: 19.35 | 31: iteration 9210/ 173500 | consumed samples: 2357760 | consumed tokens: 4828692480 | elapsed time per iteration (s): 0.79 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 6.218834E+00 | grad norm: 0.713 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.546 | TFLOPs: 19.51 | 31: iteration 9220/ 173500 | consumed samples: 2360320 | consumed tokens: 4833935360 | elapsed time per iteration (s): 0.80 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 6.083807E+00 | grad norm: 0.673 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.735 | TFLOPs: 19.40 | 31: iteration 9230/ 173500 | consumed samples: 2362880 | consumed tokens: 4839178240 | elapsed time per iteration (s): 0.80 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 5.962058E+00 | grad norm: 0.730 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.900 | TFLOPs: 19.47 | 31: iteration 9240/ 173500 | consumed samples: 2365440 | consumed tokens: 4844421120 | elapsed time per iteration (s): 0.82 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 5.865632E+00 | grad norm: 0.593 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.224 | TFLOPs: 18.83 | 31: iteration 9250/ 173500 | consumed samples: 2368000 | consumed tokens: 4849664000 | elapsed time per iteration (s): 0.81 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 5.717315E+00 | grad norm: 0.627 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.221 | TFLOPs: 19.19 | 31: iteration 9260/ 173500 | consumed samples: 2370560 | consumed tokens: 4854906880 | elapsed time per iteration (s): 0.81 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 5.659204E+00 | grad norm: 0.639 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.393 | TFLOPs: 19.14 | 31: iteration 9270/ 173500 | consumed samples: 2373120 | consumed tokens: 4860149760 | elapsed time per iteration (s): 0.82 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 5.493419E+00 | grad norm: 0.667 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.863 | TFLOPs: 18.81 | 31: iteration 9280/ 173500 | consumed samples: 2375680 | consumed tokens: 4865392640 | elapsed time per iteration (s): 0.81 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 5.470169E+00 | grad norm: 0.522 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.313 | TFLOPs: 19.14 | 31: iteration 9290/ 173500 | consumed samples: 2378240 | consumed tokens: 4870635520 | elapsed time per iteration (s): 0.82 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 5.325307E+00 | grad norm: 0.490 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.061 | TFLOPs: 18.88 | 31: iteration 9300/ 173500 | consumed samples: 2380800 | consumed tokens: 4875878400 | elapsed time per iteration (s): 0.81 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 5.215206E+00 | grad norm: 0.497 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.994 | TFLOPs: 19.06 | 31: iteration 9310/ 173500 | consumed samples: 2383360 | consumed tokens: 4881121280 | elapsed time per iteration (s): 0.78 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 5.165458E+00 | grad norm: 1.222 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.747 | TFLOPs: 19.95 | 31: iteration 9320/ 173500 | consumed samples: 2385920 | consumed tokens: 4886364160 | elapsed time per iteration (s): 0.78 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 5.134792E+00 | grad norm: 0.604 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.181 | TFLOPs: 19.73 | 31: iteration 9330/ 173500 | consumed samples: 2388480 | consumed tokens: 4891607040 | elapsed time per iteration (s): 0.73 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 4.996962E+00 | grad norm: 0.443 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.895 | TFLOPs: 21.23 | 31: iteration 9340/ 173500 | consumed samples: 2391040 | consumed tokens: 4896849920 | elapsed time per iteration (s): 0.77 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 4.893347E+00 | grad norm: 0.424 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.428 | TFLOPs: 20.23 | 31: iteration 9350/ 173500 | consumed samples: 2393600 | consumed tokens: 4902092800 | elapsed time per iteration (s): 0.75 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 4.768248E+00 | grad norm: 0.681 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.250 | TFLOPs: 20.52 | 31: iteration 9360/ 173500 | consumed samples: 2396160 | consumed tokens: 4907335680 | elapsed time per iteration (s): 0.74 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 4.601183E+00 | grad norm: 0.818 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.248 | TFLOPs: 21.01 | 31: iteration 9370/ 173500 | consumed samples: 2398720 | consumed tokens: 4912578560 | elapsed time per iteration (s): 0.75 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 4.486154E+00 | grad norm: 1.002 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.579 | TFLOPs: 20.60 | 31: iteration 9380/ 173500 | consumed samples: 2401280 | consumed tokens: 4917821440 | elapsed time per iteration (s): 0.75 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 4.328894E+00 | grad norm: 0.866 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.889 | TFLOPs: 20.62 | 31: iteration 9390/ 173500 | consumed samples: 2403840 | consumed tokens: 4923064320 | elapsed time per iteration (s): 0.76 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 4.020229E+00 | grad norm: 1.312 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.069 | TFLOPs: 20.27 | 31: iteration 9400/ 173500 | consumed samples: 2406400 | consumed tokens: 4928307200 | elapsed time per iteration (s): 0.75 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 3.660592E+00 | grad norm: 1.099 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.103 | TFLOPs: 20.58 | 31: iteration 9410/ 173500 | consumed samples: 2408960 | consumed tokens: 4933550080 | elapsed time per iteration (s): 0.75 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 3.114828E+00 | grad norm: 0.698 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.396 | TFLOPs: 20.59 | 31: iteration 9420/ 173500 | consumed samples: 2411520 | consumed tokens: 4938792960 | elapsed time per iteration (s): 0.75 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.816258E+00 | grad norm: 0.355 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.307 | TFLOPs: 20.59 | 31: iteration 9430/ 173500 | consumed samples: 2414080 | consumed tokens: 4944035840 | elapsed time per iteration (s): 0.77 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.585622E+00 | grad norm: 0.299 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.621 | TFLOPs: 20.18 | 31: iteration 9440/ 173500 | consumed samples: 2416640 | consumed tokens: 4949278720 | elapsed time per iteration (s): 0.75 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.529126E+00 | grad norm: 0.239 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.050 | TFLOPs: 20.57 | 31: iteration 9450/ 173500 | consumed samples: 2419200 | consumed tokens: 4954521600 | elapsed time per iteration (s): 0.75 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.527104E+00 | grad norm: 0.227 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.440 | TFLOPs: 20.72 | 31: iteration 9460/ 173500 | consumed samples: 2421760 | consumed tokens: 4959764480 | elapsed time per iteration (s): 0.76 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.474907E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.138 | TFLOPs: 20.27 | 31: iteration 9470/ 173500 | consumed samples: 2424320 | consumed tokens: 4965007360 | elapsed time per iteration (s): 0.74 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.463407E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.215 | TFLOPs: 20.88 | 31: iteration 9480/ 173500 | consumed samples: 2426880 | consumed tokens: 4970250240 | elapsed time per iteration (s): 0.76 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.402957E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.836 | TFLOPs: 20.38 | 31: iteration 9490/ 173500 | consumed samples: 2429440 | consumed tokens: 4975493120 | elapsed time per iteration (s): 0.72 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.397576E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.506 | TFLOPs: 21.51 | 31: iteration 9500/ 173500 | consumed samples: 2432000 | consumed tokens: 4980736000 | elapsed time per iteration (s): 0.75 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.422652E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.491 | TFLOPs: 20.54 | 31: iteration 9510/ 173500 | consumed samples: 2434560 | consumed tokens: 4985978880 | elapsed time per iteration (s): 0.72 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.414286E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.957 | TFLOPs: 21.47 | 31: iteration 9520/ 173500 | consumed samples: 2437120 | consumed tokens: 4991221760 | elapsed time per iteration (s): 0.71 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.389828E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.746 | TFLOPs: 21.70 | 31: iteration 9530/ 173500 | consumed samples: 2439680 | consumed tokens: 4996464640 | elapsed time per iteration (s): 0.80 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.424342E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.031 | TFLOPs: 19.36 | 31: iteration 9540/ 173500 | consumed samples: 2442240 | consumed tokens: 5001707520 | elapsed time per iteration (s): 0.76 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.419783E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.832 | TFLOPs: 20.38 | 31: iteration 9550/ 173500 | consumed samples: 2444800 | consumed tokens: 5006950400 | elapsed time per iteration (s): 0.81 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.408623E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.374 | TFLOPs: 19.02 | 31: iteration 9560/ 173500 | consumed samples: 2447360 | consumed tokens: 5012193280 | elapsed time per iteration (s): 0.81 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.400825E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.448 | TFLOPs: 19.14 | 31: iteration 9570/ 173500 | consumed samples: 2449920 | consumed tokens: 5017436160 | elapsed time per iteration (s): 0.81 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.334129E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.767 | TFLOPs: 19.10 | 31: iteration 9580/ 173500 | consumed samples: 2452480 | consumed tokens: 5022679040 | elapsed time per iteration (s): 0.86 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.394038E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.482 | TFLOPs: 17.94 | 31: iteration 9590/ 173500 | consumed samples: 2455040 | consumed tokens: 5027921920 | elapsed time per iteration (s): 0.75 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.362536E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.293 | TFLOPs: 20.53 | 31: iteration 9600/ 173500 | consumed samples: 2457600 | consumed tokens: 5033164800 | elapsed time per iteration (s): 0.80 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.369442E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.076 | TFLOPs: 19.30 | 31: iteration 9610/ 173500 | consumed samples: 2460160 | consumed tokens: 5038407680 | elapsed time per iteration (s): 0.83 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.363860E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.223 | TFLOPs: 18.77 | 31: iteration 9620/ 173500 | consumed samples: 2462720 | consumed tokens: 5043650560 | elapsed time per iteration (s): 0.84 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.419440E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.591 | TFLOPs: 18.37 | 31: iteration 9630/ 173500 | consumed samples: 2465280 | consumed tokens: 5048893440 | elapsed time per iteration (s): 0.80 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.391812E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.096 | TFLOPs: 19.43 | 31: iteration 9640/ 173500 | consumed samples: 2467840 | consumed tokens: 5054136320 | elapsed time per iteration (s): 0.81 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.348912E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.291 | TFLOPs: 19.07 | 31: iteration 9650/ 173500 | consumed samples: 2470400 | consumed tokens: 5059379200 | elapsed time per iteration (s): 0.82 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.385233E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.381 | TFLOPs: 18.90 | 31: iteration 9660/ 173500 | consumed samples: 2472960 | consumed tokens: 5064622080 | elapsed time per iteration (s): 0.77 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.386756E+00 | grad norm: 0.360 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.487 | TFLOPs: 20.05 | 31: iteration 9670/ 173500 | consumed samples: 2475520 | consumed tokens: 5069864960 | elapsed time per iteration (s): 0.79 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.356309E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.511 | TFLOPs: 19.63 | 31: iteration 9680/ 173500 | consumed samples: 2478080 | consumed tokens: 5075107840 | elapsed time per iteration (s): 0.80 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.357804E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.054 | TFLOPs: 19.30 | 31: iteration 9690/ 173500 | consumed samples: 2480640 | consumed tokens: 5080350720 | elapsed time per iteration (s): 0.76 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.359062E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.117 | TFLOPs: 20.27 | 31: iteration 9700/ 173500 | consumed samples: 2483200 | consumed tokens: 5085593600 | elapsed time per iteration (s): 0.79 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.368087E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.609 | TFLOPs: 19.64 | 31: iteration 9710/ 173500 | consumed samples: 2485760 | consumed tokens: 5090836480 | elapsed time per iteration (s): 0.78 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.361622E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.955 | TFLOPs: 19.78 | 31: iteration 9720/ 173500 | consumed samples: 2488320 | consumed tokens: 5096079360 | elapsed time per iteration (s): 0.79 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.366290E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.081 | TFLOPs: 19.55 | 31: iteration 9730/ 173500 | consumed samples: 2490880 | consumed tokens: 5101322240 | elapsed time per iteration (s): 0.77 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.411749E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.522 | TFLOPs: 20.24 | 31: iteration 9740/ 173500 | consumed samples: 2493440 | consumed tokens: 5106565120 | elapsed time per iteration (s): 0.82 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.381277E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.385 | TFLOPs: 18.78 | 31: iteration 9750/ 173500 | consumed samples: 2496000 | consumed tokens: 5111808000 | elapsed time per iteration (s): 0.80 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.358106E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.980 | TFLOPs: 19.30 | 31: iteration 9760/ 173500 | consumed samples: 2498560 | consumed tokens: 5117050880 | elapsed time per iteration (s): 0.88 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.351118E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.134 | TFLOPs: 17.67 | 31: iteration 9770/ 173500 | consumed samples: 2501120 | consumed tokens: 5122293760 | elapsed time per iteration (s): 0.76 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.364583E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.105 | TFLOPs: 20.45 | 31: iteration 9780/ 173500 | consumed samples: 2503680 | consumed tokens: 5127536640 | elapsed time per iteration (s): 0.75 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.389724E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.148 | TFLOPs: 20.58 | 31: iteration 9790/ 173500 | consumed samples: 2506240 | consumed tokens: 5132779520 | elapsed time per iteration (s): 0.80 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.348043E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.620 | TFLOPs: 19.28 | 31: iteration 9800/ 173500 | consumed samples: 2508800 | consumed tokens: 5138022400 | elapsed time per iteration (s): 0.73 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.375081E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.280 | TFLOPs: 21.13 | 31: iteration 9810/ 173500 | consumed samples: 2511360 | consumed tokens: 5143265280 | elapsed time per iteration (s): 0.76 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.369850E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.940 | TFLOPs: 20.26 | 31: iteration 9820/ 173500 | consumed samples: 2513920 | consumed tokens: 5148508160 | elapsed time per iteration (s): 0.77 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.342139E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.489 | TFLOPs: 20.05 | 31: iteration 9830/ 173500 | consumed samples: 2516480 | consumed tokens: 5153751040 | elapsed time per iteration (s): 0.79 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.367531E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.435 | TFLOPs: 19.69 | 31: iteration 9840/ 173500 | consumed samples: 2519040 | consumed tokens: 5158993920 | elapsed time per iteration (s): 0.78 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.323001E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.857 | TFLOPs: 19.77 | 31: iteration 9850/ 173500 | consumed samples: 2521600 | consumed tokens: 5164236800 | elapsed time per iteration (s): 0.79 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.364693E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.451 | TFLOPs: 19.63 | 31: iteration 9860/ 173500 | consumed samples: 2524160 | consumed tokens: 5169479680 | elapsed time per iteration (s): 0.78 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.378255E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.709 | TFLOPs: 19.95 | 31: iteration 9870/ 173500 | consumed samples: 2526720 | consumed tokens: 5174722560 | elapsed time per iteration (s): 0.79 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.327324E+00 | grad norm: 0.247 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.511 | TFLOPs: 19.63 | 31: iteration 9880/ 173500 | consumed samples: 2529280 | consumed tokens: 5179965440 | elapsed time per iteration (s): 0.80 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.356753E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.962 | TFLOPs: 19.36 | 31: iteration 9890/ 173500 | consumed samples: 2531840 | consumed tokens: 5185208320 | elapsed time per iteration (s): 0.81 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.388976E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.947 | TFLOPs: 19.17 | 31: iteration 9900/ 173500 | consumed samples: 2534400 | consumed tokens: 5190451200 | elapsed time per iteration (s): 0.76 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.383521E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.336 | TFLOPs: 20.47 | 31: iteration 9910/ 173500 | consumed samples: 2536960 | consumed tokens: 5195694080 | elapsed time per iteration (s): 0.77 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.367513E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.972 | TFLOPs: 20.02 | 31: iteration 9920/ 173500 | consumed samples: 2539520 | consumed tokens: 5200936960 | elapsed time per iteration (s): 0.82 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.390805E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.122 | TFLOPs: 18.94 | 31: iteration 9930/ 173500 | consumed samples: 2542080 | consumed tokens: 5206179840 | elapsed time per iteration (s): 0.72 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.363062E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.434 | TFLOPs: 21.38 | 31: iteration 9940/ 173500 | consumed samples: 2544640 | consumed tokens: 5211422720 | elapsed time per iteration (s): 0.78 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.339211E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.055 | TFLOPs: 19.85 | 31: iteration 9950/ 173500 | consumed samples: 2547200 | consumed tokens: 5216665600 | elapsed time per iteration (s): 0.73 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.335499E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.091 | TFLOPs: 21.12 | 31: iteration 9960/ 173500 | consumed samples: 2549760 | consumed tokens: 5221908480 | elapsed time per iteration (s): 0.77 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.363217E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.931 | TFLOPs: 20.20 | 31: iteration 9970/ 173500 | consumed samples: 2552320 | consumed tokens: 5227151360 | elapsed time per iteration (s): 0.79 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.323375E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.806 | TFLOPs: 19.71 | 31: iteration 9980/ 173500 | consumed samples: 2554880 | consumed tokens: 5232394240 | elapsed time per iteration (s): 0.76 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.371524E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.088 | TFLOPs: 20.39 | 31: iteration 9990/ 173500 | consumed samples: 2557440 | consumed tokens: 5237637120 | elapsed time per iteration (s): 119.56 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.371034E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.141 | TFLOPs: 0.13 | 0: [2022-11-25 20:17:30,206] [INFO] [logging.py:68:log_dist] [Rank 0] step=10000, skipped=0, lr=[0.00019897364350587667, 0.00019897364350587667, 0.00019897364350587667], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 10000/ 173500 | consumed samples: 2560000 | consumed tokens: 5242880000 | elapsed time per iteration (s): 78.26 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.377430E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 3.271 | TFLOPs: 0.20 | 0: steps: 10000 loss: 2.3446 iter time (s): 1.759 samples/sec: 145.500 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 10000 | lm loss value: 2.285973E+00 | lm loss PPL: 9.835254E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 10000 to checkpoints_1b1long 0: [2022-11-25 20:17:30,589] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step10000 is begin to save! 0: [2022-11-25 20:17:30,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_01-model_00-model_states.pt... 0: [2022-11-25 20:17:30,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_01-model_00-model_states.pt. 0: [2022-11-25 20:17:30,823] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_03-model_00-model_states.pt... 0: [2022-11-25 20:17:31,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_03-model_00-model_states.pt. 0: [2022-11-25 20:17:31,040] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_04-model_00-model_states.pt... 0: [2022-11-25 20:17:31,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_04-model_00-model_states.pt. 0: [2022-11-25 20:17:31,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_05-model_00-model_states.pt... 0: [2022-11-25 20:17:31,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_05-model_00-model_states.pt. 0: [2022-11-25 20:17:31,191] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_06-model_00-model_states.pt... 0: [2022-11-25 20:17:31,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_06-model_00-model_states.pt. 0: [2022-11-25 20:17:31,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_07-model_00-model_states.pt... 0: [2022-11-25 20:17:31,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_07-model_00-model_states.pt. 0: [2022-11-25 20:17:31,347] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_08-model_00-model_states.pt... 0: [2022-11-25 20:17:31,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_08-model_00-model_states.pt. 0: [2022-11-25 20:17:31,428] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_09-model_00-model_states.pt... 0: [2022-11-25 20:17:31,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_09-model_00-model_states.pt. 0: [2022-11-25 20:17:31,507] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_10-model_00-model_states.pt... 0: [2022-11-25 20:17:31,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_10-model_00-model_states.pt. 0: [2022-11-25 20:17:31,584] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_11-model_00-model_states.pt... 0: [2022-11-25 20:17:31,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_11-model_00-model_states.pt. 0: [2022-11-25 20:17:31,666] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_12-model_00-model_states.pt... 0: [2022-11-25 20:17:31,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_12-model_00-model_states.pt. 0: [2022-11-25 20:17:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_13-model_00-model_states.pt... 0: [2022-11-25 20:17:31,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_13-model_00-model_states.pt. 0: [2022-11-25 20:17:31,818] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_14-model_00-model_states.pt... 0: [2022-11-25 20:17:31,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_14-model_00-model_states.pt. 0: [2022-11-25 20:17:31,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_15-model_00-model_states.pt... 0: [2022-11-25 20:17:31,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_15-model_00-model_states.pt. 0: [2022-11-25 20:17:31,967] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_16-model_00-model_states.pt... 0: [2022-11-25 20:17:32,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_16-model_00-model_states.pt. 0: [2022-11-25 20:17:32,042] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_17-model_00-model_states.pt... 0: [2022-11-25 20:17:32,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_17-model_00-model_states.pt. 0: [2022-11-25 20:17:32,117] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_18-model_00-model_states.pt... 0: [2022-11-25 20:17:32,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_18-model_00-model_states.pt. 0: [2022-11-25 20:17:32,189] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_19-model_00-model_states.pt... 0: [2022-11-25 20:17:32,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_19-model_00-model_states.pt. 0: [2022-11-25 20:17:32,266] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_20-model_00-model_states.pt... 0: [2022-11-25 20:17:32,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_20-model_00-model_states.pt. 0: [2022-11-25 20:17:32,339] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_21-model_00-model_states.pt... 0: [2022-11-25 20:17:32,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_21-model_00-model_states.pt. 0: [2022-11-25 20:17:32,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_22-model_00-model_states.pt... 0: [2022-11-25 20:17:32,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_22-model_00-model_states.pt. 0: [2022-11-25 20:17:32,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_23-model_00-model_states.pt... 0: [2022-11-25 20:17:32,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_23-model_00-model_states.pt. 0: [2022-11-25 20:17:32,562] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_24-model_00-model_states.pt... 0: [2022-11-25 20:17:32,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_24-model_00-model_states.pt. 0: [2022-11-25 20:17:32,637] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_25-model_00-model_states.pt... 0: [2022-11-25 20:17:32,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_25-model_00-model_states.pt. 0: [2022-11-25 20:17:32,712] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_26-model_00-model_states.pt... 0: [2022-11-25 20:17:32,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_26-model_00-model_states.pt. 0: [2022-11-25 20:17:32,786] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_27-model_00-model_states.pt... 0: [2022-11-25 20:17:32,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_27-model_00-model_states.pt. 0: [2022-11-25 20:17:32,861] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_28-model_00-model_states.pt... 0: [2022-11-25 20:17:32,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_28-model_00-model_states.pt. 0: [2022-11-25 20:17:32,936] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/layer_30-model_00-model_states.pt... 0: [2022-11-25 20:17:32,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/layer_30-model_00-model_states.pt. 0: [2022-11-25 20:17:32,938] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step10000/mp_rank_00_model_states.pt 0: [2022-11-25 20:17:32,938] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/mp_rank_00_model_states.pt... 0: [2022-11-25 20:17:32,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/mp_rank_00_model_states.pt. 0: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 25: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 28: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 30: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 21: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 25: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 24: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 29: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 21: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:17:33,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:17:33,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:17:33,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 20:17:33,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 13: [2022-11-25 20:17:33,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:17:33,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 20:17:33,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 3: [2022-11-25 20:17:33,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:17:33,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 20:17:33,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 5: [2022-11-25 20:17:33,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:17:33,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 20:17:33,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 3: [2022-11-25 20:17:33,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:17:33,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:17:33,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 5: [2022-11-25 20:17:33,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:17:33,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 3: [2022-11-25 20:17:33,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 5: [2022-11-25 20:17:33,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 20:17:33,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 5: [2022-11-25 20:17:33,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 5: [2022-11-25 20:17:33,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:17:33,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 20:17:33,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 25: [2022-11-25 20:17:33,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-25 20:17:33,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-25 20:17:33,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 3: [2022-11-25 20:17:33,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:17:33,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 20:17:33,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 8: [2022-11-25 20:17:33,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:17:33,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 20:17:33,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 25: [2022-11-25 20:17:33,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-25 20:17:33,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-25 20:17:33,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-25 20:17:33,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 25: [2022-11-25 20:17:33,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-25 20:17:33,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 14: [2022-11-25 20:17:33,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:17:33,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:17:33,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 20:17:33,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 7: [2022-11-25 20:17:33,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 20:17:33,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 5: [2022-11-25 20:17:33,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:17:33,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 20:17:33,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 7: [2022-11-25 20:17:33,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:17:33,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 20:17:33,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 28: [2022-11-25 20:17:33,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-25 20:17:33,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:17:33,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:17:33,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:17:33,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 20:17:33,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 6: [2022-11-25 20:17:33,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 20:17:33,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 6: [2022-11-25 20:17:33,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:17:33,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 20:17:33,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 14: [2022-11-25 20:17:33,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:17:33,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 20:17:33,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 7: [2022-11-25 20:17:33,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:17:33,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 20:17:33,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 2: [2022-11-25 20:17:33,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:17:33,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 20:17:33,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 6: [2022-11-25 20:17:33,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:17:33,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 20:17:33,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 30: [2022-11-25 20:17:33,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-25 20:17:33,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-25 20:17:33,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 3: [2022-11-25 20:17:33,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:17:33,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:17:33,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 20:17:33,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 3: [2022-11-25 20:17:33,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 20:17:33,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 2: [2022-11-25 20:17:33,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:17:33,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 20:17:33,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 28: [2022-11-25 20:17:33,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-25 20:17:33,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-25 20:17:33,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 28: [2022-11-25 20:17:33,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 28: [2022-11-25 20:17:33,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-25 20:17:33,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-25 20:17:33,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 13: [2022-11-25 20:17:33,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:17:33,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 20:17:33,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 13: [2022-11-25 20:17:33,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:17:33,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 9: [2022-11-25 20:17:33,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:17:33,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 9: [2022-11-25 20:17:33,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 20:17:33,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 13: [2022-11-25 20:17:33,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:17:33,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 20:17:33,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 10: [2022-11-25 20:17:33,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:17:33,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:17:33,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:17:33,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 20:17:33,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 20:17:33,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 20:17:33,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 10: [2022-11-25 20:17:33,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 10: [2022-11-25 20:17:33,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 30: [2022-11-25 20:17:33,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-25 20:17:33,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-25 20:17:33,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 8: [2022-11-25 20:17:33,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:17:33,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 20:17:33,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 8: [2022-11-25 20:17:33,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:17:33,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 20:17:33,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: [2022-11-25 20:17:33,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:17:33,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 20:17:33,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 2: [2022-11-25 20:17:33,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:17:33,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 20:17:33,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: [2022-11-25 20:17:33,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:17:33,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 20:17:33,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: [2022-11-25 20:17:33,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:17:33,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 20:17:33,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 14: [2022-11-25 20:17:33,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:17:33,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 20:17:33,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: [2022-11-25 20:17:33,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:17:33,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 20:17:33,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: [2022-11-25 20:17:33,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 18: [2022-11-25 20:17:33,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-25 20:17:33,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-25 20:17:33,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-25 20:17:33,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-25 20:17:33,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-25 20:17:33,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-25 20:17:33,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 18: [2022-11-25 20:17:33,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 18: [2022-11-25 20:17:33,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 15: [2022-11-25 20:17:33,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:17:33,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 20:17:33,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 24: [2022-11-25 20:17:33,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-25 20:17:33,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-25 20:17:33,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 9: [2022-11-25 20:17:33,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:17:33,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 20:17:33,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 31: [2022-11-25 20:17:33,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-25 20:17:33,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-25 20:17:33,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-25 20:17:33,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-25 20:17:33,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-25 20:17:33,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-25 20:17:33,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-25 20:17:33,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-25 20:17:33,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 31: [2022-11-25 20:17:33,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 31: [2022-11-25 20:17:33,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 31: [2022-11-25 20:17:33,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 7: [2022-11-25 20:17:33,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:17:33,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 20:17:33,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 7: [2022-11-25 20:17:33,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:17:33,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 20:17:33,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 30: [2022-11-25 20:17:33,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-25 20:17:33,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-25 20:17:33,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 23: [2022-11-25 20:17:33,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:17:33,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-25 20:17:33,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 25: [2022-11-25 20:17:33,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-25 20:17:33,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-25 20:17:33,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 2: [2022-11-25 20:17:33,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:17:33,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 20:17:33,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 12: [2022-11-25 20:17:33,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:17:33,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:17:33,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:17:33,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 20:17:33,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 12: [2022-11-25 20:17:33,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 20:17:33,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 20:17:33,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 12: [2022-11-25 20:17:33,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 15: [2022-11-25 20:17:33,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:17:33,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 20:17:33,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 15: [2022-11-25 20:17:33,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:17:33,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 20:17:33,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 24: [2022-11-25 20:17:33,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-25 20:17:33,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-25 20:17:33,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-25 20:17:33,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-25 20:17:33,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 24: [2022-11-25 20:17:33,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 19: [2022-11-25 20:17:33,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-25 20:17:33,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-25 20:17:33,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 21: [2022-11-25 20:17:33,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-25 20:17:33,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-25 20:17:33,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-25 20:17:33,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-25 20:17:33,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-25 20:17:33,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-25 20:17:33,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 21: [2022-11-25 20:17:33,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 21: [2022-11-25 20:17:33,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 8: [2022-11-25 20:17:33,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 19: [2022-11-25 20:17:33,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-25 20:17:33,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 8: [2022-11-25 20:17:33,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 19: [2022-11-25 20:17:33,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 8: [2022-11-25 20:17:33,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 9: [2022-11-25 20:17:33,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:17:33,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 20:17:33,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 19: [2022-11-25 20:17:33,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-25 20:17:33,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-25 20:17:33,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 11: [2022-11-25 20:17:33,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:17:33,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 20:17:33,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 22: [2022-11-25 20:17:33,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-25 20:17:33,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-25 20:17:33,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-25 20:17:33,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-25 20:17:33,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 22: [2022-11-25 20:17:33,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 29: [2022-11-25 20:17:33,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-25 20:17:33,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-25 20:17:33,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-25 20:17:33,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-25 20:17:33,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-25 20:17:33,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-25 20:17:33,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-25 20:17:33,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-25 20:17:33,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-25 20:17:33,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-25 20:17:33,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 29: [2022-11-25 20:17:33,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 29: [2022-11-25 20:17:33,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 29: [2022-11-25 20:17:33,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 29: [2022-11-25 20:17:33,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 2: [2022-11-25 20:17:33,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:17:33,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 1: [2022-11-25 20:17:33,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:17:33,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: [2022-11-25 20:17:33,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:17:33,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 20:17:33,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 1: [2022-11-25 20:17:33,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:17:33,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:17:33,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 20:17:33,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 1: [2022-11-25 20:17:33,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 20:17:33,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 20:17:33,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 1: [2022-11-25 20:17:33,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 16: [2022-11-25 20:17:33,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-25 20:17:33,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-25 20:17:33,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-25 20:17:33,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-25 20:17:33,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-25 20:17:33,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-25 20:17:33,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-25 20:17:33,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-25 20:17:33,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-25 20:17:33,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-25 20:17:33,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 16: [2022-11-25 20:17:33,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 16: [2022-11-25 20:17:33,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 16: [2022-11-25 20:17:33,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 16: [2022-11-25 20:17:33,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 4: [2022-11-25 20:17:33,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:17:33,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:17:33,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:17:33,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 20:17:33,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 20:17:33,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 20:17:33,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 4: [2022-11-25 20:17:33,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 4: [2022-11-25 20:17:33,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 13: [2022-11-25 20:17:33,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:17:33,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 20:17:33,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 22: [2022-11-25 20:17:33,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-25 20:17:33,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-25 20:17:33,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 27: [2022-11-25 20:17:33,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-25 20:17:33,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-25 20:17:33,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-25 20:17:33,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-25 20:17:33,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-25 20:17:33,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 8: [2022-11-25 20:17:33,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:17:33,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 27: [2022-11-25 20:17:33,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 27: [2022-11-25 20:17:33,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 27: [2022-11-25 20:17:33,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 8: [2022-11-25 20:17:33,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 17: [2022-11-25 20:17:33,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-25 20:17:33,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-25 20:17:33,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-25 20:17:33,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-25 20:17:33,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-25 20:17:33,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-25 20:17:33,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-25 20:17:33,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-25 20:17:33,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 17: [2022-11-25 20:17:33,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 17: [2022-11-25 20:17:33,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 17: [2022-11-25 20:17:33,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 20: [2022-11-25 20:17:33,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-25 20:17:33,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-25 20:17:33,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-25 20:17:33,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-25 20:17:33,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-25 20:17:33,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-25 20:17:33,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 20: [2022-11-25 20:17:33,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 20: [2022-11-25 20:17:33,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 11: [2022-11-25 20:17:33,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:17:33,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 20:17:33,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 23: [2022-11-25 20:17:33,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:17:33,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-25 20:17:33,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:17:33,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 23: [2022-11-25 20:17:33,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-25 20:17:33,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 11: [2022-11-25 20:17:33,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:17:33,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 20:17:33,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 31: [2022-11-25 20:17:33,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-25 20:17:33,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-25 20:17:33,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: [2022-11-25 20:17:33,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 5: [2022-11-25 20:17:33,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:17:33,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 5: [2022-11-25 20:17:33,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 20:17:33,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 26: [2022-11-25 20:17:33,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-25 20:17:33,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-25 20:17:33,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-25 20:17:33,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-25 20:17:33,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-25 20:17:33,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-25 20:17:33,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-25 20:17:33,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-25 20:17:33,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-25 20:17:33,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-25 20:17:33,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 26: [2022-11-25 20:17:33,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 26: [2022-11-25 20:17:33,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 26: [2022-11-25 20:17:33,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 26: [2022-11-25 20:17:33,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 25: [2022-11-25 20:17:33,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-25 20:17:33,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-25 20:17:33,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 17: [2022-11-25 20:17:33,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-25 20:17:33,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-25 20:17:33,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 25: [2022-11-25 20:17:33,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-25 20:17:33,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-25 20:17:33,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 17: [2022-11-25 20:17:33,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-25 20:17:33,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-25 20:17:33,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 18: [2022-11-25 20:17:33,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-25 20:17:33,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-25 20:17:33,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 28: [2022-11-25 20:17:33,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-25 20:17:33,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-25 20:17:33,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 9: [2022-11-25 20:17:33,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:17:33,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 20:17:33,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 31: [2022-11-25 20:17:33,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:17:33,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 31: [2022-11-25 20:17:33,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-25 20:17:33,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 1: [2022-11-25 20:17:33,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 20:17:33,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 30: [2022-11-25 20:17:33,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:17:33,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 30: [2022-11-25 20:17:33,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-25 20:17:33,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 6: [2022-11-25 20:17:33,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 20:17:33,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 16: [2022-11-25 20:17:33,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-25 20:17:33,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 12: [2022-11-25 20:17:33,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 16: [2022-11-25 20:17:33,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 12: [2022-11-25 20:17:33,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 20:17:33,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 14: [2022-11-25 20:17:33,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:17:33,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 20:17:33,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 19: [2022-11-25 20:17:33,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-25 20:17:33,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-25 20:17:33,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 11: [2022-11-25 20:17:33,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:17:33,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 20:17:33,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 29: [2022-11-25 20:17:33,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-25 20:17:33,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-25 20:17:33,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 10: [2022-11-25 20:17:33,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 21: [2022-11-25 20:17:33,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:17:33,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 21: [2022-11-25 20:17:33,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 10: [2022-11-25 20:17:33,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 21: [2022-11-25 20:17:33,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 24: [2022-11-25 20:17:33,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-25 20:17:33,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-25 20:17:33,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 26: [2022-11-25 20:17:33,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-25 20:17:33,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-25 20:17:33,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: [2022-11-25 20:17:33,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:17:33,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 20:17:33,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 8: [2022-11-25 20:17:33,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:17:33,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 25: [2022-11-25 20:17:33,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:17:33,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 25: [2022-11-25 20:17:33,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-25 20:17:33,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 22: [2022-11-25 20:17:33,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-25 20:17:33,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 23: [2022-11-25 20:17:33,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 22: [2022-11-25 20:17:33,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 23: [2022-11-25 20:17:33,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-25 20:17:33,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 13: [2022-11-25 20:17:33,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:17:33,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:17:33,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 15: [2022-11-25 20:17:33,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 13: [2022-11-25 20:17:33,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 15: [2022-11-25 20:17:33,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 7: [2022-11-25 20:17:33,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:17:33,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 20:17:33,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 4: [2022-11-25 20:17:33,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:17:33,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 20:17:33,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 2: [2022-11-25 20:17:33,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:17:33,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 20:17:33,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 20: [2022-11-25 20:17:33,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-25 20:17:33,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-25 20:17:33,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 27: [2022-11-25 20:17:33,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-25 20:17:33,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-25 20:17:33,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 5: [2022-11-25 20:17:33,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:17:33,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 20:17:33,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 6: [2022-11-25 20:17:33,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:17:33,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 20:17:33,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 18: [2022-11-25 20:17:33,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-25 20:17:33,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-25 20:17:33,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 3: [2022-11-25 20:17:33,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 17: [2022-11-25 20:17:33,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:17:33,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 20:17:33,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 17: [2022-11-25 20:17:33,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-25 20:17:33,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 19: [2022-11-25 20:17:33,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-25 20:17:33,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-25 20:17:33,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 31: [2022-11-25 20:17:33,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-25 20:17:33,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-25 20:17:33,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 14: [2022-11-25 20:17:33,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:17:33,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 20:17:33,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 28: [2022-11-25 20:17:33,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-25 20:17:33,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-25 20:17:33,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 30: [2022-11-25 20:17:33,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-25 20:17:33,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-25 20:17:33,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 9: [2022-11-25 20:17:33,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:17:33,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 20:17:33,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 12: [2022-11-25 20:17:33,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:17:33,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 20:17:33,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 16: [2022-11-25 20:17:33,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-25 20:17:33,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-25 20:17:33,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 1: [2022-11-25 20:17:33,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:17:33,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 20:17:33,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 21: [2022-11-25 20:17:33,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-25 20:17:33,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-25 20:17:33,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 10: [2022-11-25 20:17:33,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:17:33,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 20:17:33,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 29: [2022-11-25 20:17:33,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-25 20:17:33,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-25 20:17:33,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 24: [2022-11-25 20:17:33,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-25 20:17:33,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-25 20:17:33,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 23: [2022-11-25 20:17:33,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:17:33,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-25 20:17:33,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: [2022-11-25 20:17:33,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:17:33,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 20:17:33,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 13: [2022-11-25 20:17:33,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:17:33,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 20:17:33,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 22: [2022-11-25 20:17:33,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-25 20:17:33,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-25 20:17:33,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 8: [2022-11-25 20:17:33,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:17:33,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 20:17:33,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 4: [2022-11-25 20:17:33,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:17:33,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 26: [2022-11-25 20:17:33,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:17:33,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 26: [2022-11-25 20:17:33,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-25 20:17:33,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 11: [2022-11-25 20:17:33,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:17:33,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 20:17:33,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 15: [2022-11-25 20:17:33,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:17:33,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 20:17:33,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 7: [2022-11-25 20:17:33,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:17:33,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 20:17:33,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 18: [2022-11-25 20:17:33,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-25 20:17:33,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-25 20:17:33,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 27: [2022-11-25 20:17:33,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-25 20:17:33,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-25 20:17:33,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 25: [2022-11-25 20:17:33,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-25 20:17:33,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-25 20:17:33,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 3: [2022-11-25 20:17:33,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:17:33,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 20:17:33,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 2: [2022-11-25 20:17:33,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:17:33,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 20:17:33,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 5: [2022-11-25 20:17:33,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:17:33,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 20: [2022-11-25 20:17:33,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:17:33,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 20: [2022-11-25 20:17:33,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-25 20:17:33,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 9: [2022-11-25 20:17:33,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:17:33,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 20:17:33,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 17: [2022-11-25 20:17:33,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 28: [2022-11-25 20:17:33,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 17: [2022-11-25 20:17:33,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 28: [2022-11-25 20:17:33,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 17: [2022-11-25 20:17:33,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 14: [2022-11-25 20:17:33,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 28: [2022-11-25 20:17:33,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 14: [2022-11-25 20:17:33,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 20:17:33,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 6: [2022-11-25 20:17:33,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:17:33,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 20:17:33,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 30: [2022-11-25 20:17:33,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-25 20:17:33,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-25 20:17:33,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 31: [2022-11-25 20:17:33,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-25 20:17:33,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-25 20:17:33,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 16: [2022-11-25 20:17:33,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-25 20:17:33,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-25 20:17:33,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 21: [2022-11-25 20:17:33,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-25 20:17:33,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-25 20:17:33,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 12: [2022-11-25 20:17:33,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:17:33,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 20:17:33,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 7: [2022-11-25 20:17:33,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:17:33,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 11: [2022-11-25 20:17:33,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:17:33,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 11: [2022-11-25 20:17:33,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 20:17:33,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 1: [2022-11-25 20:17:33,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:17:33,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 20:17:33,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 10: [2022-11-25 20:17:33,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:17:33,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 20:17:33,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 29: [2022-11-25 20:17:33,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-25 20:17:33,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-25 20:17:33,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 19: [2022-11-25 20:17:33,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-25 20:17:33,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 2: [2022-11-25 20:17:33,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 19: [2022-11-25 20:17:33,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 2: [2022-11-25 20:17:33,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 20:17:33,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 23: [2022-11-25 20:17:33,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 22: [2022-11-25 20:17:33,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:17:33,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-25 20:17:33,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 22: [2022-11-25 20:17:33,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-25 20:17:33,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 26: [2022-11-25 20:17:33,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-25 20:17:33,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-25 20:17:33,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 24: [2022-11-25 20:17:33,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-25 20:17:33,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-25 20:17:33,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 8: [2022-11-25 20:17:33,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:17:33,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 20:17:33,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 13: [2022-11-25 20:17:33,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:17:33,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 20:17:33,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 27: [2022-11-25 20:17:33,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-25 20:17:33,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-25 20:17:33,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 4: [2022-11-25 20:17:33,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:17:33,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 20:17:33,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 15: [2022-11-25 20:17:33,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:17:33,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 20:17:33,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 3: [2022-11-25 20:17:33,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:17:33,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 20:17:33,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 18: [2022-11-25 20:17:33,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-25 20:17:33,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-25 20:17:33,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 20: [2022-11-25 20:17:33,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-25 20:17:33,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-25 20:17:33,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 9: [2022-11-25 20:17:33,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:17:33,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 20:17:33,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 1: [2022-11-25 20:17:33,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:17:33,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 20:17:33,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 30: [2022-11-25 20:17:33,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-25 20:17:33,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-25 20:17:33,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 14: [2022-11-25 20:17:33,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:17:33,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 20:17:33,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 12: [2022-11-25 20:17:33,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:17:33,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 20:17:33,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 21: [2022-11-25 20:17:33,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:17:33,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 21: [2022-11-25 20:17:33,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 11: [2022-11-25 20:17:33,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 21: [2022-11-25 20:17:33,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 11: [2022-11-25 20:17:33,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 24: [2022-11-25 20:17:33,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-25 20:17:33,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-25 20:17:33,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 22: [2022-11-25 20:17:33,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-25 20:17:33,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-25 20:17:33,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 19: [2022-11-25 20:17:33,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-25 20:17:33,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-25 20:17:33,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 23: [2022-11-25 20:17:33,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:17:33,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-25 20:17:33,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 10: [2022-11-25 20:17:33,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:17:33,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 20:17:33,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 28: [2022-11-25 20:17:33,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-25 20:17:33,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-25 20:17:33,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 21: [2022-11-25 20:17:33,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-25 20:17:33,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-25 20:17:33,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 4: [2022-11-25 20:17:33,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:17:33,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 20:17:33,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 30: [2022-11-25 20:17:33,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-25 20:17:33,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-25 20:17:33,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 23: [2022-11-25 20:17:33,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:17:33,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-25 20:17:33,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 12: [2022-11-25 20:17:33,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:17:33,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 20:17:33,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 15: [2022-11-25 20:17:33,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 28: [2022-11-25 20:17:33,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:17:33,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 20:17:33,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 15: [2022-11-25 20:17:33,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:17:33,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 28: [2022-11-25 20:17:33,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 24: [2022-11-25 20:17:33,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:17:33,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 28: [2022-11-25 20:17:33,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 24: [2022-11-25 20:17:33,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-25 20:17:33,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 14: [2022-11-25 20:17:33,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 27: [2022-11-25 20:17:33,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:17:33,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:17:33,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 10: [2022-11-25 20:17:33,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 14: [2022-11-25 20:17:33,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 27: [2022-11-25 20:17:33,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 22: [2022-11-25 20:17:33,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 27: [2022-11-25 20:17:33,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 10: [2022-11-25 20:17:33,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 22: [2022-11-25 20:17:33,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-25 20:17:33,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 11: [2022-11-25 20:17:33,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:17:33,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 20:17:33,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 18: [2022-11-25 20:17:33,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-25 20:17:33,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-25 20:17:33,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 4: [2022-11-25 20:17:33,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:17:33,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 20:17:33,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 9: [2022-11-25 20:17:33,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:17:33,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 20:17:33,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 27: [2022-11-25 20:17:33,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 20: [2022-11-25 20:17:33,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-25 20:17:33,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 27: [2022-11-25 20:17:33,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 20: [2022-11-25 20:17:33,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 27: [2022-11-25 20:17:33,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 20: [2022-11-25 20:17:33,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-25 20:17:33,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-25 20:17:33,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 19: [2022-11-25 20:17:33,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-25 20:17:33,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-25 20:17:33,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 1: [2022-11-25 20:17:33,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:17:33,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step10000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 20:17:33,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: successfully saved checkpoint at iteration 10000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2797.05 31: iteration 10010/ 173500 | consumed samples: 2562560 | consumed tokens: 5248122880 | elapsed time per iteration (s): 1.16 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.346301E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 219.857 | TFLOPs: 13.30 | 31: iteration 10020/ 173500 | consumed samples: 2565120 | consumed tokens: 5253365760 | elapsed time per iteration (s): 0.84 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.371176E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.419 | TFLOPs: 18.48 | 31: iteration 10030/ 173500 | consumed samples: 2567680 | consumed tokens: 5258608640 | elapsed time per iteration (s): 0.87 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.372076E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.359 | TFLOPs: 17.81 | 31: iteration 10040/ 173500 | consumed samples: 2570240 | consumed tokens: 5263851520 | elapsed time per iteration (s): 0.82 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.371116E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.933 | TFLOPs: 18.81 | 31: iteration 10050/ 173500 | consumed samples: 2572800 | consumed tokens: 5269094400 | elapsed time per iteration (s): 1.25 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.349578E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 205.266 | TFLOPs: 12.42 | 31: iteration 10060/ 173500 | consumed samples: 2575360 | consumed tokens: 5274337280 | elapsed time per iteration (s): 15.48 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.353800E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 16.543 | TFLOPs: 1.00 | 31: iteration 10070/ 173500 | consumed samples: 2577920 | consumed tokens: 5279580160 | elapsed time per iteration (s): 0.86 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.354038E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.039 | TFLOPs: 17.97 | 31: iteration 10080/ 173500 | consumed samples: 2580480 | consumed tokens: 5284823040 | elapsed time per iteration (s): 0.87 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.318551E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.899 | TFLOPs: 17.72 | 31: iteration 10090/ 173500 | consumed samples: 2583040 | consumed tokens: 5290065920 | elapsed time per iteration (s): 0.85 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.330491E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.578 | TFLOPs: 18.18 | 31: iteration 10100/ 173500 | consumed samples: 2585600 | consumed tokens: 5295308800 | elapsed time per iteration (s): 0.83 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.359327E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.563 | TFLOPs: 18.73 | 31: iteration 10110/ 173500 | consumed samples: 2588160 | consumed tokens: 5300551680 | elapsed time per iteration (s): 0.83 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.333929E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.892 | TFLOPs: 18.63 | 31: iteration 10120/ 173500 | consumed samples: 2590720 | consumed tokens: 5305794560 | elapsed time per iteration (s): 0.82 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.321626E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.368 | TFLOPs: 18.90 | 31: iteration 10130/ 173500 | consumed samples: 2593280 | consumed tokens: 5311037440 | elapsed time per iteration (s): 0.85 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.344788E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.231 | TFLOPs: 18.28 | 31: iteration 10140/ 173500 | consumed samples: 2595840 | consumed tokens: 5316280320 | elapsed time per iteration (s): 0.84 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.358302E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.634 | TFLOPs: 18.43 | 31: iteration 10150/ 173500 | consumed samples: 2598400 | consumed tokens: 5321523200 | elapsed time per iteration (s): 0.81 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.324755E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.689 | TFLOPs: 19.16 | 31: iteration 10160/ 173500 | consumed samples: 2600960 | consumed tokens: 5326766080 | elapsed time per iteration (s): 0.82 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.347257E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.479 | TFLOPs: 18.84 | 31: iteration 10170/ 173500 | consumed samples: 2603520 | consumed tokens: 5332008960 | elapsed time per iteration (s): 0.84 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.369527E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.338 | TFLOPs: 18.35 | 31: iteration 10180/ 173500 | consumed samples: 2606080 | consumed tokens: 5337251840 | elapsed time per iteration (s): 0.86 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.341445E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.443 | TFLOPs: 17.93 | 31: iteration 10190/ 173500 | consumed samples: 2608640 | consumed tokens: 5342494720 | elapsed time per iteration (s): 0.87 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.354604E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.247 | TFLOPs: 17.86 | 31: iteration 10200/ 173500 | consumed samples: 2611200 | consumed tokens: 5347737600 | elapsed time per iteration (s): 0.82 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.339079E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.067 | TFLOPs: 18.82 | 31: iteration 10210/ 173500 | consumed samples: 2613760 | consumed tokens: 5352980480 | elapsed time per iteration (s): 0.85 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.342462E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.860 | TFLOPs: 18.32 | 31: iteration 10220/ 173500 | consumed samples: 2616320 | consumed tokens: 5358223360 | elapsed time per iteration (s): 0.86 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.322584E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.381 | TFLOPs: 17.93 | 31: iteration 10230/ 173500 | consumed samples: 2618880 | consumed tokens: 5363466240 | elapsed time per iteration (s): 0.86 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.338878E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.939 | TFLOPs: 17.96 | 31: iteration 10240/ 173500 | consumed samples: 2621440 | consumed tokens: 5368709120 | elapsed time per iteration (s): 0.86 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.361374E+00 | grad norm: 0.539 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.366 | TFLOPs: 17.93 | 31: iteration 10250/ 173500 | consumed samples: 2624000 | consumed tokens: 5373952000 | elapsed time per iteration (s): 0.86 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.545679E+00 | grad norm: 0.443 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.291 | TFLOPs: 17.99 | 31: iteration 10260/ 173500 | consumed samples: 2626560 | consumed tokens: 5379194880 | elapsed time per iteration (s): 0.80 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.360189E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.476 | TFLOPs: 19.27 | 31: iteration 10270/ 173500 | consumed samples: 2629120 | consumed tokens: 5384437760 | elapsed time per iteration (s): 0.81 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.365007E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.301 | TFLOPs: 19.14 | 31: iteration 10280/ 173500 | consumed samples: 2631680 | consumed tokens: 5389680640 | elapsed time per iteration (s): 0.84 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.364343E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.691 | TFLOPs: 18.43 | 31: iteration 10290/ 173500 | consumed samples: 2634240 | consumed tokens: 5394923520 | elapsed time per iteration (s): 0.78 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.372901E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.952 | TFLOPs: 19.78 | 31: iteration 10300/ 173500 | consumed samples: 2636800 | consumed tokens: 5400166400 | elapsed time per iteration (s): 0.82 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.325529E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.486 | TFLOPs: 18.84 | 31: iteration 10310/ 173500 | consumed samples: 2639360 | consumed tokens: 5405409280 | elapsed time per iteration (s): 0.89 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.354848E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.848 | TFLOPs: 17.35 | 31: iteration 10320/ 173500 | consumed samples: 2641920 | consumed tokens: 5410652160 | elapsed time per iteration (s): 0.84 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.372821E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.946 | TFLOPs: 18.39 | 31: iteration 10330/ 173500 | consumed samples: 2644480 | consumed tokens: 5415895040 | elapsed time per iteration (s): 0.80 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.330558E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.976 | TFLOPs: 19.30 | 31: iteration 10340/ 173500 | consumed samples: 2647040 | consumed tokens: 5421137920 | elapsed time per iteration (s): 0.83 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.372578E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.680 | TFLOPs: 18.67 | 31: iteration 10350/ 173500 | consumed samples: 2649600 | consumed tokens: 5426380800 | elapsed time per iteration (s): 0.86 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.344235E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.083 | TFLOPs: 17.91 | 31: iteration 10360/ 173500 | consumed samples: 2652160 | consumed tokens: 5431623680 | elapsed time per iteration (s): 0.89 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.349869E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.194 | TFLOPs: 17.44 | 31: iteration 10370/ 173500 | consumed samples: 2654720 | consumed tokens: 5436866560 | elapsed time per iteration (s): 0.81 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.343555E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.259 | TFLOPs: 19.01 | 31: iteration 10380/ 173500 | consumed samples: 2657280 | consumed tokens: 5442109440 | elapsed time per iteration (s): 0.83 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.339013E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.590 | TFLOPs: 18.73 | 31: iteration 10390/ 173500 | consumed samples: 2659840 | consumed tokens: 5447352320 | elapsed time per iteration (s): 0.82 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.326926E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.211 | TFLOPs: 18.83 | 31: iteration 10400/ 173500 | consumed samples: 2662400 | consumed tokens: 5452595200 | elapsed time per iteration (s): 0.87 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.339166E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.063 | TFLOPs: 17.73 | 31: iteration 10410/ 173500 | consumed samples: 2664960 | consumed tokens: 5457838080 | elapsed time per iteration (s): 0.83 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.330587E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.011 | TFLOPs: 18.75 | 31: iteration 10420/ 173500 | consumed samples: 2667520 | consumed tokens: 5463080960 | elapsed time per iteration (s): 0.85 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.353189E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.752 | TFLOPs: 18.26 | 31: iteration 10430/ 173500 | consumed samples: 2670080 | consumed tokens: 5468323840 | elapsed time per iteration (s): 0.81 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.347483E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.372 | TFLOPs: 19.14 | 31: iteration 10440/ 173500 | consumed samples: 2672640 | consumed tokens: 5473566720 | elapsed time per iteration (s): 0.81 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.338709E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.675 | TFLOPs: 19.16 | 31: iteration 10450/ 173500 | consumed samples: 2675200 | consumed tokens: 5478809600 | elapsed time per iteration (s): 0.83 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.325791E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.506 | TFLOPs: 18.66 | 31: iteration 10460/ 173500 | consumed samples: 2677760 | consumed tokens: 5484052480 | elapsed time per iteration (s): 0.85 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.332900E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.959 | TFLOPs: 18.27 | 31: iteration 10470/ 173500 | consumed samples: 2680320 | consumed tokens: 5489295360 | elapsed time per iteration (s): 0.85 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.336375E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.820 | TFLOPs: 18.26 | 31: iteration 10480/ 173500 | consumed samples: 2682880 | consumed tokens: 5494538240 | elapsed time per iteration (s): 0.80 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.332670E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.637 | TFLOPs: 19.40 | 31: iteration 10490/ 173500 | consumed samples: 2685440 | consumed tokens: 5499781120 | elapsed time per iteration (s): 0.80 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.327533E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.643 | TFLOPs: 19.40 | 31: iteration 10500/ 173500 | consumed samples: 2688000 | consumed tokens: 5505024000 | elapsed time per iteration (s): 0.82 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.331170E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.625 | TFLOPs: 18.79 | 31: iteration 10510/ 173500 | consumed samples: 2690560 | consumed tokens: 5510266880 | elapsed time per iteration (s): 0.77 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.346182E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.470 | TFLOPs: 20.05 | 31: iteration 10520/ 173500 | consumed samples: 2693120 | consumed tokens: 5515509760 | elapsed time per iteration (s): 0.87 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.377106E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.846 | TFLOPs: 17.84 | 31: iteration 10530/ 173500 | consumed samples: 2695680 | consumed tokens: 5520752640 | elapsed time per iteration (s): 0.78 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.334117E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.358 | TFLOPs: 19.74 | 31: iteration 10540/ 173500 | consumed samples: 2698240 | consumed tokens: 5525995520 | elapsed time per iteration (s): 0.84 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.327119E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.561 | TFLOPs: 18.49 | 31: iteration 10550/ 173500 | consumed samples: 2700800 | consumed tokens: 5531238400 | elapsed time per iteration (s): 0.83 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.316981E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.781 | TFLOPs: 18.56 | 31: iteration 10560/ 173500 | consumed samples: 2703360 | consumed tokens: 5536481280 | elapsed time per iteration (s): 0.85 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.323986E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.238 | TFLOPs: 18.16 | 31: iteration 10570/ 173500 | consumed samples: 2705920 | consumed tokens: 5541724160 | elapsed time per iteration (s): 0.80 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.321797E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.999 | TFLOPs: 19.30 | 31: iteration 10580/ 173500 | consumed samples: 2708480 | consumed tokens: 5546967040 | elapsed time per iteration (s): 0.86 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.327467E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.445 | TFLOPs: 17.99 | 31: iteration 10590/ 173500 | consumed samples: 2711040 | consumed tokens: 5552209920 | elapsed time per iteration (s): 0.81 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.377214E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.326 | TFLOPs: 19.02 | 31: iteration 10600/ 173500 | consumed samples: 2713600 | consumed tokens: 5557452800 | elapsed time per iteration (s): 0.86 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.333193E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.333 | TFLOPs: 18.05 | 31: iteration 10610/ 173500 | consumed samples: 2716160 | consumed tokens: 5562695680 | elapsed time per iteration (s): 0.85 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.310281E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.660 | TFLOPs: 18.19 | 31: iteration 10620/ 173500 | consumed samples: 2718720 | consumed tokens: 5567938560 | elapsed time per iteration (s): 0.79 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.323096E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.868 | TFLOPs: 19.65 | 31: iteration 10630/ 173500 | consumed samples: 2721280 | consumed tokens: 5573181440 | elapsed time per iteration (s): 0.81 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.305956E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.829 | TFLOPs: 19.11 | 31: iteration 10640/ 173500 | consumed samples: 2723840 | consumed tokens: 5578424320 | elapsed time per iteration (s): 0.83 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.336115E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.262 | TFLOPs: 18.65 | 31: iteration 10650/ 173500 | consumed samples: 2726400 | consumed tokens: 5583667200 | elapsed time per iteration (s): 0.80 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.340664E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.232 | TFLOPs: 19.25 | 31: iteration 10660/ 173500 | consumed samples: 2728960 | consumed tokens: 5588910080 | elapsed time per iteration (s): 0.89 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.332905E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.104 | TFLOPs: 17.43 | 31: iteration 10670/ 173500 | consumed samples: 2731520 | consumed tokens: 5594152960 | elapsed time per iteration (s): 0.86 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.351892E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.726 | TFLOPs: 17.95 | 31: iteration 10680/ 173500 | consumed samples: 2734080 | consumed tokens: 5599395840 | elapsed time per iteration (s): 0.81 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.327330E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.498 | TFLOPs: 19.03 | 31: iteration 10690/ 173500 | consumed samples: 2736640 | consumed tokens: 5604638720 | elapsed time per iteration (s): 0.83 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.343498E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.880 | TFLOPs: 18.75 | 31: iteration 10700/ 173500 | consumed samples: 2739200 | consumed tokens: 5609881600 | elapsed time per iteration (s): 0.80 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.328385E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.324 | TFLOPs: 19.26 | 31: iteration 10710/ 173500 | consumed samples: 2741760 | consumed tokens: 5615124480 | elapsed time per iteration (s): 0.84 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.349587E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.092 | TFLOPs: 18.46 | 31: iteration 10720/ 173500 | consumed samples: 2744320 | consumed tokens: 5620367360 | elapsed time per iteration (s): 0.80 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.357998E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.163 | TFLOPs: 19.31 | 31: iteration 10730/ 173500 | consumed samples: 2746880 | consumed tokens: 5625610240 | elapsed time per iteration (s): 0.77 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.325274E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.856 | TFLOPs: 20.20 | 31: iteration 10740/ 173500 | consumed samples: 2749440 | consumed tokens: 5630853120 | elapsed time per iteration (s): 0.76 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.306613E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.219 | TFLOPs: 20.40 | 31: iteration 10750/ 173500 | consumed samples: 2752000 | consumed tokens: 5636096000 | elapsed time per iteration (s): 0.72 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.331899E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.081 | TFLOPs: 21.60 | 31: iteration 10760/ 173500 | consumed samples: 2754560 | consumed tokens: 5641338880 | elapsed time per iteration (s): 3.25 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.335074E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 78.691 | TFLOPs: 4.76 | 31: iteration 10770/ 173500 | consumed samples: 2757120 | consumed tokens: 5646581760 | elapsed time per iteration (s): 0.80 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.298267E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.331 | TFLOPs: 19.32 | 31: iteration 10780/ 173500 | consumed samples: 2759680 | consumed tokens: 5651824640 | elapsed time per iteration (s): 0.75 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.294865E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.246 | TFLOPs: 20.52 | 31: iteration 10790/ 173500 | consumed samples: 2762240 | consumed tokens: 5657067520 | elapsed time per iteration (s): 0.74 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.339565E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.914 | TFLOPs: 20.81 | 31: iteration 10800/ 173500 | consumed samples: 2764800 | consumed tokens: 5662310400 | elapsed time per iteration (s): 0.75 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.303129E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.710 | TFLOPs: 20.61 | 31: iteration 10810/ 173500 | consumed samples: 2767360 | consumed tokens: 5667553280 | elapsed time per iteration (s): 0.75 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.312523E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.747 | TFLOPs: 20.67 | 31: iteration 10820/ 173500 | consumed samples: 2769920 | consumed tokens: 5672796160 | elapsed time per iteration (s): 0.76 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.319681E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.505 | TFLOPs: 20.48 | 31: iteration 10830/ 173500 | consumed samples: 2772480 | consumed tokens: 5678039040 | elapsed time per iteration (s): 0.74 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.325287E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.347 | TFLOPs: 21.01 | 31: iteration 10840/ 173500 | consumed samples: 2775040 | consumed tokens: 5683281920 | elapsed time per iteration (s): 0.77 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.320184E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.120 | TFLOPs: 20.03 | 31: iteration 10850/ 173500 | consumed samples: 2777600 | consumed tokens: 5688524800 | elapsed time per iteration (s): 0.80 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.288173E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.986 | TFLOPs: 19.30 | 31: iteration 10860/ 173500 | consumed samples: 2780160 | consumed tokens: 5693767680 | elapsed time per iteration (s): 0.78 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.349404E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.235 | TFLOPs: 19.92 | 31: iteration 10870/ 173500 | consumed samples: 2782720 | consumed tokens: 5699010560 | elapsed time per iteration (s): 0.78 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.329675E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.423 | TFLOPs: 19.87 | 31: iteration 10880/ 173500 | consumed samples: 2785280 | consumed tokens: 5704253440 | elapsed time per iteration (s): 0.79 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.339910E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.716 | TFLOPs: 19.58 | 31: iteration 10890/ 173500 | consumed samples: 2787840 | consumed tokens: 5709496320 | elapsed time per iteration (s): 0.73 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.322531E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.836 | TFLOPs: 21.22 | 31: iteration 10900/ 173500 | consumed samples: 2790400 | consumed tokens: 5714739200 | elapsed time per iteration (s): 0.78 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.316218E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.140 | TFLOPs: 19.91 | 31: iteration 10910/ 173500 | consumed samples: 2792960 | consumed tokens: 5719982080 | elapsed time per iteration (s): 0.76 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.347115E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.012 | TFLOPs: 20.45 | 31: iteration 10920/ 173500 | consumed samples: 2795520 | consumed tokens: 5725224960 | elapsed time per iteration (s): 0.74 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.299213E+00 | grad norm: 0.228 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.365 | TFLOPs: 20.83 | 31: iteration 10930/ 173500 | consumed samples: 2798080 | consumed tokens: 5730467840 | elapsed time per iteration (s): 0.74 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.308577E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.182 | TFLOPs: 21.00 | 31: iteration 10940/ 173500 | consumed samples: 2800640 | consumed tokens: 5735710720 | elapsed time per iteration (s): 0.80 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.281526E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.602 | TFLOPs: 19.34 | 31: iteration 10950/ 173500 | consumed samples: 2803200 | consumed tokens: 5740953600 | elapsed time per iteration (s): 0.78 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.341683E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.766 | TFLOPs: 19.83 | 31: iteration 10960/ 173500 | consumed samples: 2805760 | consumed tokens: 5746196480 | elapsed time per iteration (s): 0.76 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.314026E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.650 | TFLOPs: 20.31 | 31: iteration 10970/ 173500 | consumed samples: 2808320 | consumed tokens: 5751439360 | elapsed time per iteration (s): 0.79 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.319953E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.738 | TFLOPs: 19.71 | 31: iteration 10980/ 173500 | consumed samples: 2810880 | consumed tokens: 5756682240 | elapsed time per iteration (s): 0.81 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.316626E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.293 | TFLOPs: 19.13 | 31: iteration 10990/ 173500 | consumed samples: 2813440 | consumed tokens: 5761925120 | elapsed time per iteration (s): 0.78 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.328095E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.172 | TFLOPs: 19.91 | 31: iteration 11000/ 173500 | consumed samples: 2816000 | consumed tokens: 5767168000 | elapsed time per iteration (s): 0.81 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.326951E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.344 | TFLOPs: 19.20 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 11000 | lm loss value: 2.257739E+00 | lm loss PPL: 9.561445E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 11000 to checkpoints_1b1long 0: [2022-11-25 20:34:04,948] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step11000 is begin to save! 0: [2022-11-25 20:34:04,959] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_01-model_00-model_states.pt... 0: [2022-11-25 20:34:05,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_01-model_00-model_states.pt. 0: [2022-11-25 20:34:05,158] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_03-model_00-model_states.pt... 0: [2022-11-25 20:34:05,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_03-model_00-model_states.pt. 0: [2022-11-25 20:34:05,240] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_04-model_00-model_states.pt... 0: [2022-11-25 20:34:05,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_04-model_00-model_states.pt. 0: [2022-11-25 20:34:05,322] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_05-model_00-model_states.pt... 0: [2022-11-25 20:34:05,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_05-model_00-model_states.pt. 0: [2022-11-25 20:34:05,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_06-model_00-model_states.pt... 0: [2022-11-25 20:34:05,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_06-model_00-model_states.pt. 0: [2022-11-25 20:34:05,471] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_07-model_00-model_states.pt... 0: [2022-11-25 20:34:05,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_07-model_00-model_states.pt. 0: [2022-11-25 20:34:05,542] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_08-model_00-model_states.pt... 0: [2022-11-25 20:34:05,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_08-model_00-model_states.pt. 0: [2022-11-25 20:34:05,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_09-model_00-model_states.pt... 0: [2022-11-25 20:34:05,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_09-model_00-model_states.pt. 0: [2022-11-25 20:34:05,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_10-model_00-model_states.pt... 0: [2022-11-25 20:34:05,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_10-model_00-model_states.pt. 0: [2022-11-25 20:34:05,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_11-model_00-model_states.pt... 0: [2022-11-25 20:34:05,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_11-model_00-model_states.pt. 0: [2022-11-25 20:34:05,836] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_12-model_00-model_states.pt... 0: [2022-11-25 20:34:05,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_12-model_00-model_states.pt. 0: [2022-11-25 20:34:05,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_13-model_00-model_states.pt... 0: [2022-11-25 20:34:05,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_13-model_00-model_states.pt. 0: [2022-11-25 20:34:05,985] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_14-model_00-model_states.pt... 0: [2022-11-25 20:34:06,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_14-model_00-model_states.pt. 0: [2022-11-25 20:34:06,062] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_15-model_00-model_states.pt... 0: [2022-11-25 20:34:06,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_15-model_00-model_states.pt. 0: [2022-11-25 20:34:06,135] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_16-model_00-model_states.pt... 0: [2022-11-25 20:34:06,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_16-model_00-model_states.pt. 0: [2022-11-25 20:34:06,207] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_17-model_00-model_states.pt... 0: [2022-11-25 20:34:06,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_17-model_00-model_states.pt. 0: [2022-11-25 20:34:06,283] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_18-model_00-model_states.pt... 0: [2022-11-25 20:34:06,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_18-model_00-model_states.pt. 0: [2022-11-25 20:34:06,358] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_19-model_00-model_states.pt... 0: [2022-11-25 20:34:06,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_19-model_00-model_states.pt. 0: [2022-11-25 20:34:06,433] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_20-model_00-model_states.pt... 0: [2022-11-25 20:34:06,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_20-model_00-model_states.pt. 0: [2022-11-25 20:34:06,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_21-model_00-model_states.pt... 0: [2022-11-25 20:34:06,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_21-model_00-model_states.pt. 0: [2022-11-25 20:34:06,583] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_22-model_00-model_states.pt... 0: [2022-11-25 20:34:06,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_22-model_00-model_states.pt. 0: [2022-11-25 20:34:06,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_23-model_00-model_states.pt... 0: [2022-11-25 20:34:06,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_23-model_00-model_states.pt. 0: [2022-11-25 20:34:06,733] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_24-model_00-model_states.pt... 0: [2022-11-25 20:34:06,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_24-model_00-model_states.pt. 0: [2022-11-25 20:34:06,807] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_25-model_00-model_states.pt... 0: [2022-11-25 20:34:06,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_25-model_00-model_states.pt. 0: [2022-11-25 20:34:06,882] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_26-model_00-model_states.pt... 0: [2022-11-25 20:34:06,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_26-model_00-model_states.pt. 0: [2022-11-25 20:34:06,958] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_27-model_00-model_states.pt... 0: [2022-11-25 20:34:07,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_27-model_00-model_states.pt. 0: [2022-11-25 20:34:07,030] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_28-model_00-model_states.pt... 0: [2022-11-25 20:34:07,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_28-model_00-model_states.pt. 0: [2022-11-25 20:34:07,104] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/layer_30-model_00-model_states.pt... 0: [2022-11-25 20:34:07,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/layer_30-model_00-model_states.pt. 0: [2022-11-25 20:34:07,107] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step11000/mp_rank_00_model_states.pt 0: [2022-11-25 20:34:07,107] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/mp_rank_00_model_states.pt... 0: [2022-11-25 20:34:07,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/mp_rank_00_model_states.pt. 0: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 27: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 27: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 20: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 26: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 27: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 16: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 20: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 23: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 29: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 30: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 21: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 29: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 19: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 29: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 19: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-25 20:34:07,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:34:07,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:34:07,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:34:07,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:34:07,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 16: [2022-11-25 20:34:07,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:34:07,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 25: [2022-11-25 20:34:07,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:34:07,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:34:07,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:34:07,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 22: [2022-11-25 20:34:07,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 30: [2022-11-25 20:34:07,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 21: [2022-11-25 20:34:07,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 19: [2022-11-25 20:34:07,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 27: [2022-11-25 20:34:07,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:34:07,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 9: [2022-11-25 20:34:07,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 8: [2022-11-25 20:34:07,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 16: [2022-11-25 20:34:07,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 12: [2022-11-25 20:34:07,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 25: [2022-11-25 20:34:07,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 23: [2022-11-25 20:34:07,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 11: [2022-11-25 20:34:07,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 14: [2022-11-25 20:34:07,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 22: [2022-11-25 20:34:07,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 30: [2022-11-25 20:34:07,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 21: [2022-11-25 20:34:07,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 19: [2022-11-25 20:34:07,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 27: [2022-11-25 20:34:07,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 6: [2022-11-25 20:34:07,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 9: [2022-11-25 20:34:07,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 8: [2022-11-25 20:34:07,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 16: [2022-11-25 20:34:07,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 12: [2022-11-25 20:34:07,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 25: [2022-11-25 20:34:07,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 23: [2022-11-25 20:34:07,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 11: [2022-11-25 20:34:07,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 14: [2022-11-25 20:34:07,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 22: [2022-11-25 20:34:07,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 30: [2022-11-25 20:34:07,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 21: [2022-11-25 20:34:07,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 19: [2022-11-25 20:34:07,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 27: [2022-11-25 20:34:07,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 16: [2022-11-25 20:34:07,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-25 20:34:07,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-25 20:34:07,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 21: [2022-11-25 20:34:07,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-25 20:34:07,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-25 20:34:07,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 10: [2022-11-25 20:34:07,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:34:07,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 20:34:07,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 30: [2022-11-25 20:34:07,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-25 20:34:07,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-25 20:34:07,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 6: [2022-11-25 20:34:07,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:34:07,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:34:07,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 5: [2022-11-25 20:34:07,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 6: [2022-11-25 20:34:07,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 20:34:07,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 31: [2022-11-25 20:34:07,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-25 20:34:07,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-25 20:34:07,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 17: [2022-11-25 20:34:07,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-25 20:34:07,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-25 20:34:07,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 20:34:07,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:34:07,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:34:07,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:34:07,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:34:07,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:34:07,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:34:07,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:34:07,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:34:07,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 20: [2022-11-25 20:34:07,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 25: [2022-11-25 20:34:07,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:34:07,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 28: [2022-11-25 20:34:07,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 24: [2022-11-25 20:34:07,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:34:07,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 29: [2022-11-25 20:34:07,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 22: [2022-11-25 20:34:07,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:34:07,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 4: [2022-11-25 20:34:07,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 9: [2022-11-25 20:34:07,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 8: [2022-11-25 20:34:07,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 13: [2022-11-25 20:34:07,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 20:34:07,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 3: [2022-11-25 20:34:07,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 15: [2022-11-25 20:34:07,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 20:34:07,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 20: [2022-11-25 20:34:07,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 25: [2022-11-25 20:34:07,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 11: [2022-11-25 20:34:07,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 28: [2022-11-25 20:34:07,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 24: [2022-11-25 20:34:07,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 14: [2022-11-25 20:34:07,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 29: [2022-11-25 20:34:07,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 22: [2022-11-25 20:34:07,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 5: [2022-11-25 20:34:07,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 4: [2022-11-25 20:34:07,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 9: [2022-11-25 20:34:07,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 8: [2022-11-25 20:34:07,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 1: [2022-11-25 20:34:07,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:34:07,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 13: [2022-11-25 20:34:07,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 3: [2022-11-25 20:34:07,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 15: [2022-11-25 20:34:07,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 15: [2022-11-25 20:34:07,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 20: [2022-11-25 20:34:07,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 25: [2022-11-25 20:34:07,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 11: [2022-11-25 20:34:07,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 28: [2022-11-25 20:34:07,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 24: [2022-11-25 20:34:07,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 14: [2022-11-25 20:34:07,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 29: [2022-11-25 20:34:07,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 22: [2022-11-25 20:34:07,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 21: [2022-11-25 20:34:07,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:34:07,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:34:07,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 28: [2022-11-25 20:34:07,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 24: [2022-11-25 20:34:07,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-25 20:34:07,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-25 20:34:07,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 29: [2022-11-25 20:34:07,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 21: [2022-11-25 20:34:07,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 4: [2022-11-25 20:34:07,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 1: [2022-11-25 20:34:07,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 28: [2022-11-25 20:34:07,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 29: [2022-11-25 20:34:07,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 21: [2022-11-25 20:34:07,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 4: [2022-11-25 20:34:07,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 28: [2022-11-25 20:34:07,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 29: [2022-11-25 20:34:07,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 4: [2022-11-25 20:34:07,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:34:07,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 20:34:07,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 22: [2022-11-25 20:34:07,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:34:07,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:34:07,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 22: [2022-11-25 20:34:07,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-25 20:34:07,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 12: [2022-11-25 20:34:07,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 15: [2022-11-25 20:34:07,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:34:07,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 20:34:07,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 31: [2022-11-25 20:34:07,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-25 20:34:07,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-25 20:34:07,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 19: [2022-11-25 20:34:07,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-25 20:34:07,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-25 20:34:07,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 20: [2022-11-25 20:34:07,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-25 20:34:07,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-25 20:34:07,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 12: [2022-11-25 20:34:07,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:34:07,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 20:34:07,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 3: [2022-11-25 20:34:07,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:34:07,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 24: [2022-11-25 20:34:07,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:34:07,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 24: [2022-11-25 20:34:07,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 3: [2022-11-25 20:34:07,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 23: [2022-11-25 20:34:07,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 24: [2022-11-25 20:34:07,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 3: [2022-11-25 20:34:07,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 23: [2022-11-25 20:34:07,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:34:07,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-25 20:34:07,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 30: [2022-11-25 20:34:07,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-25 20:34:07,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-25 20:34:07,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 24: [2022-11-25 20:34:07,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-25 20:34:07,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-25 20:34:07,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 4: [2022-11-25 20:34:07,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:34:07,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:34:07,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:34:07,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:34:07,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:34:07,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 25: [2022-11-25 20:34:07,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 28: [2022-11-25 20:34:07,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 31: [2022-11-25 20:34:07,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 29: [2022-11-25 20:34:07,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 27: [2022-11-25 20:34:07,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:34:07,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 9: [2022-11-25 20:34:07,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 8: [2022-11-25 20:34:07,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 10: [2022-11-25 20:34:07,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 3: [2022-11-25 20:34:07,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 15: [2022-11-25 20:34:07,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 25: [2022-11-25 20:34:07,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 28: [2022-11-25 20:34:07,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 31: [2022-11-25 20:34:07,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 29: [2022-11-25 20:34:07,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 27: [2022-11-25 20:34:07,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 4: [2022-11-25 20:34:07,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 9: [2022-11-25 20:34:07,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 8: [2022-11-25 20:34:07,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 10: [2022-11-25 20:34:07,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 3: [2022-11-25 20:34:07,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 15: [2022-11-25 20:34:07,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 25: [2022-11-25 20:34:07,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 28: [2022-11-25 20:34:07,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 31: [2022-11-25 20:34:07,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 29: [2022-11-25 20:34:07,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 27: [2022-11-25 20:34:07,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 8: [2022-11-25 20:34:07,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:34:07,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 20:34:07,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 20:34:07,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:34:07,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:34:07,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 26: [2022-11-25 20:34:07,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:34:07,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 11: [2022-11-25 20:34:07,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 14: [2022-11-25 20:34:07,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 26: [2022-11-25 20:34:07,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:34:07,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 11: [2022-11-25 20:34:07,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 14: [2022-11-25 20:34:07,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 26: [2022-11-25 20:34:07,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 23: [2022-11-25 20:34:07,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 26: [2022-11-25 20:34:07,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-25 20:34:07,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 23: [2022-11-25 20:34:07,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 26: [2022-11-25 20:34:07,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 23: [2022-11-25 20:34:07,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 30: [2022-11-25 20:34:07,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-25 20:34:07,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-25 20:34:07,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 3: [2022-11-25 20:34:07,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 18: [2022-11-25 20:34:07,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:34:07,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 18: [2022-11-25 20:34:07,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 3: [2022-11-25 20:34:07,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 18: [2022-11-25 20:34:07,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: [2022-11-25 20:34:07,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:34:07,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:34:07,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:34:07,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:34:07,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:34:07,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 28: [2022-11-25 20:34:07,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 22: [2022-11-25 20:34:07,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 17: [2022-11-25 20:34:07,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 27: [2022-11-25 20:34:07,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:34:07,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:34:07,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 6: [2022-11-25 20:34:07,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 5: [2022-11-25 20:34:07,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 9: [2022-11-25 20:34:07,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 1: [2022-11-25 20:34:07,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 28: [2022-11-25 20:34:07,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 22: [2022-11-25 20:34:07,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 17: [2022-11-25 20:34:07,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 27: [2022-11-25 20:34:07,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 0: [2022-11-25 20:34:07,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 6: [2022-11-25 20:34:07,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 20:34:07,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 9: [2022-11-25 20:34:07,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 1: [2022-11-25 20:34:07,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 28: [2022-11-25 20:34:07,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 22: [2022-11-25 20:34:07,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 17: [2022-11-25 20:34:07,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 27: [2022-11-25 20:34:07,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: [2022-11-25 20:34:07,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 1: [2022-11-25 20:34:07,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 17: [2022-11-25 20:34:07,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:34:07,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 1: [2022-11-25 20:34:07,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 17: [2022-11-25 20:34:07,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-25 20:34:07,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 17: [2022-11-25 20:34:07,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-25 20:34:07,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-25 20:34:07,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: [2022-11-25 20:34:07,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 20:34:07,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 7: [2022-11-25 20:34:07,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:34:07,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:34:07,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 20:34:07,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 19: [2022-11-25 20:34:07,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:34:07,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 7: [2022-11-25 20:34:07,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 19: [2022-11-25 20:34:07,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-25 20:34:07,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 21: [2022-11-25 20:34:07,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 25: [2022-11-25 20:34:07,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-25 20:34:07,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 21: [2022-11-25 20:34:07,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 25: [2022-11-25 20:34:07,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 2: [2022-11-25 20:34:07,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:34:07,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:34:07,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 20:34:07,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 21: [2022-11-25 20:34:07,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 2: [2022-11-25 20:34:07,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 2: [2022-11-25 20:34:07,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 26: [2022-11-25 20:34:07,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-25 20:34:07,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-25 20:34:07,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 12: [2022-11-25 20:34:07,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:34:07,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 20:34:07,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 7: [2022-11-25 20:34:07,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:34:07,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 20:34:07,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 20: [2022-11-25 20:34:07,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-25 20:34:07,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-25 20:34:07,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-25 20:34:07,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 20: [2022-11-25 20:34:07,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-25 20:34:07,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 31: [2022-11-25 20:34:07,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-25 20:34:07,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-25 20:34:07,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: [2022-11-25 20:34:07,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:34:07,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 20:34:07,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 20:34:07,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:34:07,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:34:07,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 29: [2022-11-25 20:34:07,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:34:07,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 1: [2022-11-25 20:34:07,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 14: [2022-11-25 20:34:07,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 29: [2022-11-25 20:34:07,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 1: [2022-11-25 20:34:07,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 14: [2022-11-25 20:34:07,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 29: [2022-11-25 20:34:07,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 20:34:07,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 1: [2022-11-25 20:34:07,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:34:07,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 20:34:07,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 13: [2022-11-25 20:34:07,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:34:07,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 20:34:07,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 13: [2022-11-25 20:34:07,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:34:07,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 20:34:07,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 11: [2022-11-25 20:34:07,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 22: [2022-11-25 20:34:07,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:34:07,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 22: [2022-11-25 20:34:07,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 11: [2022-11-25 20:34:07,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 22: [2022-11-25 20:34:07,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 10: [2022-11-25 20:34:07,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:34:07,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 20:34:07,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 2: [2022-11-25 20:34:07,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:34:07,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 20:34:07,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 6: [2022-11-25 20:34:07,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:34:07,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 20:34:07,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 8: [2022-11-25 20:34:07,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:34:07,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 28: [2022-11-25 20:34:07,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:34:07,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 19: [2022-11-25 20:34:07,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:34:07,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 2: [2022-11-25 20:34:07,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 28: [2022-11-25 20:34:07,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 14: [2022-11-25 20:34:07,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 19: [2022-11-25 20:34:07,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 8: [2022-11-25 20:34:07,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 2: [2022-11-25 20:34:07,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 28: [2022-11-25 20:34:07,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 19: [2022-11-25 20:34:07,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 14: [2022-11-25 20:34:07,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 6: [2022-11-25 20:34:07,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:34:07,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 20:34:07,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 19: [2022-11-25 20:34:07,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-25 20:34:07,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-25 20:34:07,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 10: [2022-11-25 20:34:07,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:34:07,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 20:34:07,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 9: [2022-11-25 20:34:07,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:34:07,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 20:34:07,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 27: [2022-11-25 20:34:07,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-25 20:34:07,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-25 20:34:07,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 18: [2022-11-25 20:34:07,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-25 20:34:07,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-25 20:34:07,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 15: [2022-11-25 20:34:07,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 24: [2022-11-25 20:34:07,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 17: [2022-11-25 20:34:07,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:34:07,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 24: [2022-11-25 20:34:07,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 17: [2022-11-25 20:34:07,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 15: [2022-11-25 20:34:07,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 24: [2022-11-25 20:34:07,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 17: [2022-11-25 20:34:07,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 16: [2022-11-25 20:34:07,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-25 20:34:07,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-25 20:34:07,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 16: [2022-11-25 20:34:07,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-25 20:34:07,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-25 20:34:07,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 4: [2022-11-25 20:34:07,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 26: [2022-11-25 20:34:07,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:34:07,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 26: [2022-11-25 20:34:07,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 4: [2022-11-25 20:34:07,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 26: [2022-11-25 20:34:07,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 25: [2022-11-25 20:34:07,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-25 20:34:07,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-25 20:34:07,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 3: [2022-11-25 20:34:07,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 30: [2022-11-25 20:34:07,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:34:07,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 30: [2022-11-25 20:34:07,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 3: [2022-11-25 20:34:07,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 30: [2022-11-25 20:34:07,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 7: [2022-11-25 20:34:07,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 18: [2022-11-25 20:34:07,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:34:07,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 18: [2022-11-25 20:34:07,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 7: [2022-11-25 20:34:07,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 18: [2022-11-25 20:34:07,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 18: [2022-11-25 20:34:07,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-25 20:34:07,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-25 20:34:07,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-25 20:34:07,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-25 20:34:07,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 18: [2022-11-25 20:34:07,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 13: [2022-11-25 20:34:07,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:34:07,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:34:07,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 10: [2022-11-25 20:34:07,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 13: [2022-11-25 20:34:07,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 10: [2022-11-25 20:34:07,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 2: [2022-11-25 20:34:07,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:34:07,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 20:34:07,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: [2022-11-25 20:34:07,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:34:07,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:34:07,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:34:07,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 11: [2022-11-25 20:34:07,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 0: [2022-11-25 20:34:07,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 23: [2022-11-25 20:34:07,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 11: [2022-11-25 20:34:07,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: [2022-11-25 20:34:07,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 16: [2022-11-25 20:34:07,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 21: [2022-11-25 20:34:07,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 16: [2022-11-25 20:34:07,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 21: [2022-11-25 20:34:07,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 16: [2022-11-25 20:34:07,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 21: [2022-11-25 20:34:07,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 19: [2022-11-25 20:34:07,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 27: [2022-11-25 20:34:07,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-25 20:34:07,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 19: [2022-11-25 20:34:07,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 27: [2022-11-25 20:34:07,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 19: [2022-11-25 20:34:07,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 12: [2022-11-25 20:34:07,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 31: [2022-11-25 20:34:07,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 29: [2022-11-25 20:34:07,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:34:07,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 31: [2022-11-25 20:34:07,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 29: [2022-11-25 20:34:07,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 12: [2022-11-25 20:34:07,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 31: [2022-11-25 20:34:07,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 29: [2022-11-25 20:34:07,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 20: [2022-11-25 20:34:07,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-25 20:34:07,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-25 20:34:07,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 1: [2022-11-25 20:34:07,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 22: [2022-11-25 20:34:07,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:34:07,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 22: [2022-11-25 20:34:07,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-25 20:34:07,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 1: [2022-11-25 20:34:07,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 28: [2022-11-25 20:34:07,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-25 20:34:07,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-25 20:34:07,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 6: [2022-11-25 20:34:07,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:34:07,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 20:34:07,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 8: [2022-11-25 20:34:07,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:34:07,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:34:07,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:34:07,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 2: [2022-11-25 20:34:07,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:34:07,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 8: [2022-11-25 20:34:07,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 2: [2022-11-25 20:34:07,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 14: [2022-11-25 20:34:07,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 9: [2022-11-25 20:34:07,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 20:34:07,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 2: [2022-11-25 20:34:07,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 7: [2022-11-25 20:34:07,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:34:07,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:34:07,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 24: [2022-11-25 20:34:07,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 17: [2022-11-25 20:34:07,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 26: [2022-11-25 20:34:07,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:34:07,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 15: [2022-11-25 20:34:07,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 24: [2022-11-25 20:34:07,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 17: [2022-11-25 20:34:07,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 26: [2022-11-25 20:34:07,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 7: [2022-11-25 20:34:07,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 15: [2022-11-25 20:34:07,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 25: [2022-11-25 20:34:07,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:34:07,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 24: [2022-11-25 20:34:07,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 17: [2022-11-25 20:34:07,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 26: [2022-11-25 20:34:07,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 25: [2022-11-25 20:34:07,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 11: [2022-11-25 20:34:07,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 25: [2022-11-25 20:34:07,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 4: [2022-11-25 20:34:07,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:34:07,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 20:34:07,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 18: [2022-11-25 20:34:07,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-25 20:34:07,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-25 20:34:07,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 10: [2022-11-25 20:34:07,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 30: [2022-11-25 20:34:07,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:34:07,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 30: [2022-11-25 20:34:07,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 10: [2022-11-25 20:34:07,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 30: [2022-11-25 20:34:07,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: [2022-11-25 20:34:07,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:34:07,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 20:34:07,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 3: [2022-11-25 20:34:07,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:34:07,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 20:34:07,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 12: [2022-11-25 20:34:07,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:34:07,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 20:34:07,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 21: [2022-11-25 20:34:07,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-25 20:34:07,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-25 20:34:07,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 13: [2022-11-25 20:34:07,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:34:07,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 20:34:07,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 31: [2022-11-25 20:34:07,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-25 20:34:07,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-25 20:34:07,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 29: [2022-11-25 20:34:07,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-25 20:34:07,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-25 20:34:07,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 27: [2022-11-25 20:34:07,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-25 20:34:07,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 20: [2022-11-25 20:34:07,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-25 20:34:07,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 27: [2022-11-25 20:34:07,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 20: [2022-11-25 20:34:07,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 16: [2022-11-25 20:34:07,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-25 20:34:07,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-25 20:34:07,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 19: [2022-11-25 20:34:07,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-25 20:34:07,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-25 20:34:07,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 6: [2022-11-25 20:34:07,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:34:07,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 9: [2022-11-25 20:34:07,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:34:07,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 22: [2022-11-25 20:34:07,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:34:07,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 22: [2022-11-25 20:34:07,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 9: [2022-11-25 20:34:07,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 14: [2022-11-25 20:34:07,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 22: [2022-11-25 20:34:07,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 14: [2022-11-25 20:34:07,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 6: [2022-11-25 20:34:07,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 20:34:07,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:34:07,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:34:07,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 5: [2022-11-25 20:34:07,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:34:07,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 20:34:07,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 20:34:07,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 20:34:07,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 20:34:07,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 15: [2022-11-25 20:34:07,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:34:07,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 28: [2022-11-25 20:34:07,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:34:07,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 28: [2022-11-25 20:34:07,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-25 20:34:07,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 17: [2022-11-25 20:34:07,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-25 20:34:07,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-25 20:34:07,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 4: [2022-11-25 20:34:07,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:34:07,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 20:34:07,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 1: [2022-11-25 20:34:07,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:34:07,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 20:34:07,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 24: [2022-11-25 20:34:07,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-25 20:34:07,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-25 20:34:07,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 26: [2022-11-25 20:34:07,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-25 20:34:07,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-25 20:34:07,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 2: [2022-11-25 20:34:07,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:34:07,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 20:34:07,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 18: [2022-11-25 20:34:07,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-25 20:34:07,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-25 20:34:07,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 3: [2022-11-25 20:34:07,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 25: [2022-11-25 20:34:07,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:34:07,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 25: [2022-11-25 20:34:07,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 3: [2022-11-25 20:34:07,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 25: [2022-11-25 20:34:07,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 30: [2022-11-25 20:34:07,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-25 20:34:07,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-25 20:34:07,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: [2022-11-25 20:34:07,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:34:07,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 21: [2022-11-25 20:34:07,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-25 20:34:07,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 0: [2022-11-25 20:34:07,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 10: [2022-11-25 20:34:07,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 21: [2022-11-25 20:34:07,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: [2022-11-25 20:34:07,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 10: [2022-11-25 20:34:07,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 13: [2022-11-25 20:34:07,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:34:07,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 19: [2022-11-25 20:34:07,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:34:07,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 20:34:07,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 12: [2022-11-25 20:34:07,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 19: [2022-11-25 20:34:07,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 12: [2022-11-25 20:34:07,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 19: [2022-11-25 20:34:07,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 11: [2022-11-25 20:34:07,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:34:07,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 20:34:07,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 7: [2022-11-25 20:34:07,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:34:07,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 16: [2022-11-25 20:34:07,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:34:07,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 29: [2022-11-25 20:34:07,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 27: [2022-11-25 20:34:07,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:34:07,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 16: [2022-11-25 20:34:07,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 23: [2022-11-25 20:34:07,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 7: [2022-11-25 20:34:07,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 1: [2022-11-25 20:34:07,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 16: [2022-11-25 20:34:07,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 23: [2022-11-25 20:34:07,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 29: [2022-11-25 20:34:07,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 27: [2022-11-25 20:34:07,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 1: [2022-11-25 20:34:07,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 29: [2022-11-25 20:34:07,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 27: [2022-11-25 20:34:07,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 22: [2022-11-25 20:34:07,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-25 20:34:07,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-25 20:34:07,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 14: [2022-11-25 20:34:07,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:34:07,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:34:07,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 31: [2022-11-25 20:34:07,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 24: [2022-11-25 20:34:07,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:34:07,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 24: [2022-11-25 20:34:07,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 14: [2022-11-25 20:34:07,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 31: [2022-11-25 20:34:07,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 9: [2022-11-25 20:34:07,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 24: [2022-11-25 20:34:07,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 31: [2022-11-25 20:34:07,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 8: [2022-11-25 20:34:07,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:34:07,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 20:34:07,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 6: [2022-11-25 20:34:07,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:34:07,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 20:34:07,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 25: [2022-11-25 20:34:07,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-25 20:34:07,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-25 20:34:07,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 20:34:07,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:34:07,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 20: [2022-11-25 20:34:07,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 28: [2022-11-25 20:34:07,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 30: [2022-11-25 20:34:07,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 18: [2022-11-25 20:34:07,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:34:07,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 2: [2022-11-25 20:34:07,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 20: [2022-11-25 20:34:07,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 28: [2022-11-25 20:34:07,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 30: [2022-11-25 20:34:07,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 5: [2022-11-25 20:34:07,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 2: [2022-11-25 20:34:07,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 20: [2022-11-25 20:34:07,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 30: [2022-11-25 20:34:07,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 18: [2022-11-25 20:34:07,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-25 20:34:07,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 7: [2022-11-25 20:34:07,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:34:07,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:34:07,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 28: [2022-11-25 20:34:07,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 17: [2022-11-25 20:34:07,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:34:07,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 4: [2022-11-25 20:34:07,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 17: [2022-11-25 20:34:07,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 7: [2022-11-25 20:34:07,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 4: [2022-11-25 20:34:07,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 11: [2022-11-25 20:34:07,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 20:34:07,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 17: [2022-11-25 20:34:07,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 31: [2022-11-25 20:34:07,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-25 20:34:07,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-25 20:34:07,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 3: [2022-11-25 20:34:07,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:34:07,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 20:34:07,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 26: [2022-11-25 20:34:07,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-25 20:34:07,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 29: [2022-11-25 20:34:07,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 21: [2022-11-25 20:34:07,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-25 20:34:07,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 29: [2022-11-25 20:34:07,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 21: [2022-11-25 20:34:07,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 29: [2022-11-25 20:34:07,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 26: [2022-11-25 20:34:07,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: [2022-11-25 20:34:07,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:34:07,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:34:07,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 20: [2022-11-25 20:34:07,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:34:07,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 20:34:07,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 10: [2022-11-25 20:34:07,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 12: [2022-11-25 20:34:07,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 20: [2022-11-25 20:34:07,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 10: [2022-11-25 20:34:07,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 12: [2022-11-25 20:34:07,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 20: [2022-11-25 20:34:07,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 7: [2022-11-25 20:34:07,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:34:07,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 20:34:07,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 16: [2022-11-25 20:34:07,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 26: [2022-11-25 20:34:07,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 27: [2022-11-25 20:34:07,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 16: [2022-11-25 20:34:07,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-25 20:34:07,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 26: [2022-11-25 20:34:07,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-25 20:34:07,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 27: [2022-11-25 20:34:07,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-25 20:34:07,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 13: [2022-11-25 20:34:07,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:34:07,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 20:34:07,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 8: [2022-11-25 20:34:07,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:34:07,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 20:34:07,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 15: [2022-11-25 20:34:07,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:34:07,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 20:34:07,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 23: [2022-11-25 20:34:07,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:34:07,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step11000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-25 20:34:07,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: successfully saved checkpoint at iteration 11000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2481.46 31: iteration 11010/ 173500 | consumed samples: 2818560 | consumed tokens: 5772410880 | elapsed time per iteration (s): 1.04 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.303839E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.021 | TFLOPs: 14.82 | 31: iteration 11020/ 173500 | consumed samples: 2821120 | consumed tokens: 5777653760 | elapsed time per iteration (s): 0.81 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.352710E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.625 | TFLOPs: 19.22 | 31: iteration 11030/ 173500 | consumed samples: 2823680 | consumed tokens: 5782896640 | elapsed time per iteration (s): 0.80 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.293601E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.481 | TFLOPs: 19.45 | 31: iteration 11040/ 173500 | consumed samples: 2826240 | consumed tokens: 5788139520 | elapsed time per iteration (s): 0.83 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.319509E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.047 | TFLOPs: 18.70 | 31: iteration 11050/ 173500 | consumed samples: 2828800 | consumed tokens: 5793382400 | elapsed time per iteration (s): 0.87 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.336462E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.787 | TFLOPs: 17.89 | 31: iteration 11060/ 173500 | consumed samples: 2831360 | consumed tokens: 5798625280 | elapsed time per iteration (s): 0.72 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.300184E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.538 | TFLOPs: 21.45 | 31: iteration 11070/ 173500 | consumed samples: 2833920 | consumed tokens: 5803868160 | elapsed time per iteration (s): 0.85 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.308818E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.227 | TFLOPs: 18.22 | 31: iteration 11080/ 173500 | consumed samples: 2836480 | consumed tokens: 5809111040 | elapsed time per iteration (s): 0.81 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.328611E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.873 | TFLOPs: 19.23 | 31: iteration 11090/ 173500 | consumed samples: 2839040 | consumed tokens: 5814353920 | elapsed time per iteration (s): 0.86 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.310374E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.344 | TFLOPs: 18.11 | 31: iteration 11100/ 173500 | consumed samples: 2841600 | consumed tokens: 5819596800 | elapsed time per iteration (s): 0.78 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.335678E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.867 | TFLOPs: 19.90 | 31: iteration 11110/ 173500 | consumed samples: 2844160 | consumed tokens: 5824839680 | elapsed time per iteration (s): 0.79 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.342899E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.928 | TFLOPs: 19.54 | 31: iteration 11120/ 173500 | consumed samples: 2846720 | consumed tokens: 5830082560 | elapsed time per iteration (s): 0.84 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.329130E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.087 | TFLOPs: 18.46 | 31: iteration 11130/ 173500 | consumed samples: 2849280 | consumed tokens: 5835325440 | elapsed time per iteration (s): 0.73 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.292634E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.062 | TFLOPs: 21.18 | 31: iteration 11140/ 173500 | consumed samples: 2851840 | consumed tokens: 5840568320 | elapsed time per iteration (s): 0.74 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.320683E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.853 | TFLOPs: 20.80 | 31: iteration 11150/ 173500 | consumed samples: 2854400 | consumed tokens: 5845811200 | elapsed time per iteration (s): 0.78 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.307284E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.916 | TFLOPs: 19.84 | 31: iteration 11160/ 173500 | consumed samples: 2856960 | consumed tokens: 5851054080 | elapsed time per iteration (s): 0.75 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.327191E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.498 | TFLOPs: 20.60 | 31: iteration 11170/ 173500 | consumed samples: 2859520 | consumed tokens: 5856296960 | elapsed time per iteration (s): 0.78 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.315644E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.057 | TFLOPs: 19.91 | 31: iteration 11180/ 173500 | consumed samples: 2862080 | consumed tokens: 5861539840 | elapsed time per iteration (s): 0.81 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.320565E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.673 | TFLOPs: 19.10 | 31: iteration 11190/ 173500 | consumed samples: 2864640 | consumed tokens: 5866782720 | elapsed time per iteration (s): 0.82 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.269518E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.313 | TFLOPs: 18.89 | 31: iteration 11200/ 173500 | consumed samples: 2867200 | consumed tokens: 5872025600 | elapsed time per iteration (s): 2.67 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.358375E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 95.838 | TFLOPs: 5.80 | 31: iteration 11210/ 173500 | consumed samples: 2869760 | consumed tokens: 5877268480 | elapsed time per iteration (s): 0.75 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.321835E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.105 | TFLOPs: 20.64 | 31: iteration 11220/ 173500 | consumed samples: 2872320 | consumed tokens: 5882511360 | elapsed time per iteration (s): 0.77 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.346468E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.860 | TFLOPs: 20.20 | 31: iteration 11230/ 173500 | consumed samples: 2874880 | consumed tokens: 5887754240 | elapsed time per iteration (s): 0.80 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.329734E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.002 | TFLOPs: 19.30 | 31: iteration 11240/ 173500 | consumed samples: 2877440 | consumed tokens: 5892997120 | elapsed time per iteration (s): 0.77 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.337401E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.077 | TFLOPs: 20.21 | 31: iteration 11250/ 173500 | consumed samples: 2880000 | consumed tokens: 5898240000 | elapsed time per iteration (s): 0.87 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.329019E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.767 | TFLOPs: 17.83 | 31: iteration 11260/ 173500 | consumed samples: 2882560 | consumed tokens: 5903482880 | elapsed time per iteration (s): 0.85 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.314667E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.782 | TFLOPs: 18.32 | 31: iteration 11270/ 173500 | consumed samples: 2885120 | consumed tokens: 5908725760 | elapsed time per iteration (s): 0.79 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.318081E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.770 | TFLOPs: 19.59 | 31: iteration 11280/ 173500 | consumed samples: 2887680 | consumed tokens: 5913968640 | elapsed time per iteration (s): 0.78 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.350989E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.782 | TFLOPs: 19.89 | 31: iteration 11290/ 173500 | consumed samples: 2890240 | consumed tokens: 5919211520 | elapsed time per iteration (s): 0.76 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.308717E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.307 | TFLOPs: 20.41 | 31: iteration 11300/ 173500 | consumed samples: 2892800 | consumed tokens: 5924454400 | elapsed time per iteration (s): 0.76 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.297697E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.004 | TFLOPs: 20.33 | 31: iteration 11310/ 173500 | consumed samples: 2895360 | consumed tokens: 5929697280 | elapsed time per iteration (s): 0.77 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.309005E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.534 | TFLOPs: 20.00 | 31: iteration 11320/ 173500 | consumed samples: 2897920 | consumed tokens: 5934940160 | elapsed time per iteration (s): 0.75 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.302816E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.179 | TFLOPs: 20.70 | 31: iteration 11330/ 173500 | consumed samples: 2900480 | consumed tokens: 5940183040 | elapsed time per iteration (s): 0.79 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.300806E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.951 | TFLOPs: 19.54 | 31: iteration 11340/ 173500 | consumed samples: 2903040 | consumed tokens: 5945425920 | elapsed time per iteration (s): 0.77 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.321418E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.881 | TFLOPs: 20.20 | 31: iteration 11350/ 173500 | consumed samples: 2905600 | consumed tokens: 5950668800 | elapsed time per iteration (s): 0.81 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.326180E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.837 | TFLOPs: 19.11 | 31: iteration 11360/ 173500 | consumed samples: 2908160 | consumed tokens: 5955911680 | elapsed time per iteration (s): 0.76 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.337587E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.773 | TFLOPs: 20.31 | 31: iteration 11370/ 173500 | consumed samples: 2910720 | consumed tokens: 5961154560 | elapsed time per iteration (s): 0.82 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.319438E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.632 | TFLOPs: 18.91 | 31: iteration 11380/ 173500 | consumed samples: 2913280 | consumed tokens: 5966397440 | elapsed time per iteration (s): 0.76 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.315232E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.569 | TFLOPs: 20.48 | 31: iteration 11390/ 173500 | consumed samples: 2915840 | consumed tokens: 5971640320 | elapsed time per iteration (s): 0.83 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.308547E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.085 | TFLOPs: 18.58 | 31: iteration 11400/ 173500 | consumed samples: 2918400 | consumed tokens: 5976883200 | elapsed time per iteration (s): 0.78 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.344751E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.346 | TFLOPs: 19.86 | 31: iteration 11410/ 173500 | consumed samples: 2920960 | consumed tokens: 5982126080 | elapsed time per iteration (s): 0.81 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.287947E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.862 | TFLOPs: 19.11 | 31: iteration 11420/ 173500 | consumed samples: 2923520 | consumed tokens: 5987368960 | elapsed time per iteration (s): 0.75 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.285276E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.293 | TFLOPs: 20.77 | 31: iteration 11430/ 173500 | consumed samples: 2926080 | consumed tokens: 5992611840 | elapsed time per iteration (s): 0.77 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.304716E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.516 | TFLOPs: 20.12 | 31: iteration 11440/ 173500 | consumed samples: 2928640 | consumed tokens: 5997854720 | elapsed time per iteration (s): 0.74 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.334143E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.352 | TFLOPs: 21.01 | 31: iteration 11450/ 173500 | consumed samples: 2931200 | consumed tokens: 6003097600 | elapsed time per iteration (s): 0.79 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.294444E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.950 | TFLOPs: 19.60 | 31: iteration 11460/ 173500 | consumed samples: 2933760 | consumed tokens: 6008340480 | elapsed time per iteration (s): 0.78 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.310736E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.403 | TFLOPs: 19.87 | 31: iteration 11470/ 173500 | consumed samples: 2936320 | consumed tokens: 6013583360 | elapsed time per iteration (s): 0.87 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.328303E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.510 | TFLOPs: 17.76 | 31: iteration 11480/ 173500 | consumed samples: 2938880 | consumed tokens: 6018826240 | elapsed time per iteration (s): 0.74 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.338491E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.824 | TFLOPs: 20.86 | 31: iteration 11490/ 173500 | consumed samples: 2941440 | consumed tokens: 6024069120 | elapsed time per iteration (s): 0.81 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.321452E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.033 | TFLOPs: 19.18 | 31: iteration 11500/ 173500 | consumed samples: 2944000 | consumed tokens: 6029312000 | elapsed time per iteration (s): 0.79 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.336271E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.168 | TFLOPs: 19.61 | 31: iteration 11510/ 173500 | consumed samples: 2946560 | consumed tokens: 6034554880 | elapsed time per iteration (s): 0.80 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.334819E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.678 | TFLOPs: 19.34 | 31: iteration 11520/ 173500 | consumed samples: 2949120 | consumed tokens: 6039797760 | elapsed time per iteration (s): 0.76 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.313223E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.393 | TFLOPs: 20.35 | 31: iteration 11530/ 173500 | consumed samples: 2951680 | consumed tokens: 6045040640 | elapsed time per iteration (s): 0.81 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.284864E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.772 | TFLOPs: 19.04 | 31: iteration 11540/ 173500 | consumed samples: 2954240 | consumed tokens: 6050283520 | elapsed time per iteration (s): 0.77 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.330089E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.998 | TFLOPs: 20.02 | 31: iteration 11550/ 173500 | consumed samples: 2956800 | consumed tokens: 6055526400 | elapsed time per iteration (s): 0.82 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.325362E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.540 | TFLOPs: 18.97 | 31: iteration 11560/ 173500 | consumed samples: 2959360 | consumed tokens: 6060769280 | elapsed time per iteration (s): 0.85 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.326307E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.145 | TFLOPs: 18.16 | 31: iteration 11570/ 173500 | consumed samples: 2961920 | consumed tokens: 6066012160 | elapsed time per iteration (s): 0.85 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.305989E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.586 | TFLOPs: 18.31 | 31: iteration 11580/ 173500 | consumed samples: 2964480 | consumed tokens: 6071255040 | elapsed time per iteration (s): 0.76 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.286223E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.656 | TFLOPs: 20.31 | 31: iteration 11590/ 173500 | consumed samples: 2967040 | consumed tokens: 6076497920 | elapsed time per iteration (s): 0.74 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.320594E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.942 | TFLOPs: 20.93 | 31: iteration 11600/ 173500 | consumed samples: 2969600 | consumed tokens: 6081740800 | elapsed time per iteration (s): 0.77 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.321736E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.146 | TFLOPs: 20.21 | 31: iteration 11610/ 173500 | consumed samples: 2972160 | consumed tokens: 6086983680 | elapsed time per iteration (s): 0.75 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.320424E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.738 | TFLOPs: 20.55 | 31: iteration 11620/ 173500 | consumed samples: 2974720 | consumed tokens: 6092226560 | elapsed time per iteration (s): 0.74 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.351763E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.273 | TFLOPs: 20.95 | 31: iteration 11630/ 173500 | consumed samples: 2977280 | consumed tokens: 6097469440 | elapsed time per iteration (s): 0.78 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.303948E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.904 | TFLOPs: 19.78 | 31: iteration 11640/ 173500 | consumed samples: 2979840 | consumed tokens: 6102712320 | elapsed time per iteration (s): 0.88 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.334138E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.102 | TFLOPs: 17.61 | 31: iteration 11650/ 173500 | consumed samples: 2982400 | consumed tokens: 6107955200 | elapsed time per iteration (s): 0.78 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.287659E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.757 | TFLOPs: 19.89 | 31: iteration 11660/ 173500 | consumed samples: 2984960 | consumed tokens: 6113198080 | elapsed time per iteration (s): 0.76 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.319442E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.049 | TFLOPs: 20.33 | 31: iteration 11670/ 173500 | consumed samples: 2987520 | consumed tokens: 6118440960 | elapsed time per iteration (s): 0.77 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.294081E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.614 | TFLOPs: 20.06 | 31: iteration 11680/ 173500 | consumed samples: 2990080 | consumed tokens: 6123683840 | elapsed time per iteration (s): 0.76 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.333333E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.552 | TFLOPs: 20.48 | 31: iteration 11690/ 173500 | consumed samples: 2992640 | consumed tokens: 6128926720 | elapsed time per iteration (s): 0.76 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.326359E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.279 | TFLOPs: 20.28 | 31: iteration 11700/ 173500 | consumed samples: 2995200 | consumed tokens: 6134169600 | elapsed time per iteration (s): 0.75 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.291845E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.585 | TFLOPs: 20.60 | 31: iteration 11710/ 173500 | consumed samples: 2997760 | consumed tokens: 6139412480 | elapsed time per iteration (s): 0.80 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.314844E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.403 | TFLOPs: 19.32 | 31: iteration 11720/ 173500 | consumed samples: 3000320 | consumed tokens: 6144655360 | elapsed time per iteration (s): 0.78 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.302109E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.277 | TFLOPs: 19.74 | 31: iteration 11730/ 173500 | consumed samples: 3002880 | consumed tokens: 6149898240 | elapsed time per iteration (s): 0.81 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.305455E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.002 | TFLOPs: 19.12 | 31: iteration 11740/ 173500 | consumed samples: 3005440 | consumed tokens: 6155141120 | elapsed time per iteration (s): 0.80 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.310055E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.465 | TFLOPs: 19.39 | 31: iteration 11750/ 173500 | consumed samples: 3008000 | consumed tokens: 6160384000 | elapsed time per iteration (s): 0.86 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.285143E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.602 | TFLOPs: 18.00 | 31: iteration 11760/ 173500 | consumed samples: 3010560 | consumed tokens: 6165626880 | elapsed time per iteration (s): 0.82 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.318143E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.519 | TFLOPs: 18.97 | 31: iteration 11770/ 173500 | consumed samples: 3013120 | consumed tokens: 6170869760 | elapsed time per iteration (s): 0.80 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.326445E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.138 | TFLOPs: 19.25 | 31: iteration 11780/ 173500 | consumed samples: 3015680 | consumed tokens: 6176112640 | elapsed time per iteration (s): 0.82 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.291549E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.809 | TFLOPs: 18.92 | 31: iteration 11790/ 173500 | consumed samples: 3018240 | consumed tokens: 6181355520 | elapsed time per iteration (s): 0.80 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.305429E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.091 | TFLOPs: 19.24 | 31: iteration 11800/ 173500 | consumed samples: 3020800 | consumed tokens: 6186598400 | elapsed time per iteration (s): 0.81 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.305228E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.640 | TFLOPs: 19.10 | 31: iteration 11810/ 173500 | consumed samples: 3023360 | consumed tokens: 6191841280 | elapsed time per iteration (s): 0.77 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.294647E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.760 | TFLOPs: 20.19 | 31: iteration 11820/ 173500 | consumed samples: 3025920 | consumed tokens: 6197084160 | elapsed time per iteration (s): 0.79 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.308258E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.942 | TFLOPs: 19.60 | 31: iteration 11830/ 173500 | consumed samples: 3028480 | consumed tokens: 6202327040 | elapsed time per iteration (s): 0.77 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.251213E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.172 | TFLOPs: 20.16 | 31: iteration 11840/ 173500 | consumed samples: 3031040 | consumed tokens: 6207569920 | elapsed time per iteration (s): 0.84 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.292829E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.290 | TFLOPs: 18.41 | 31: iteration 11850/ 173500 | consumed samples: 3033600 | consumed tokens: 6212812800 | elapsed time per iteration (s): 0.80 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.324956E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.362 | TFLOPs: 19.32 | 31: iteration 11860/ 173500 | consumed samples: 3036160 | consumed tokens: 6218055680 | elapsed time per iteration (s): 0.76 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.258452E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.993 | TFLOPs: 20.27 | 31: iteration 11870/ 173500 | consumed samples: 3038720 | consumed tokens: 6223298560 | elapsed time per iteration (s): 0.82 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.305973E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.703 | TFLOPs: 18.86 | 31: iteration 11880/ 173500 | consumed samples: 3041280 | consumed tokens: 6228541440 | elapsed time per iteration (s): 0.85 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.304649E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.444 | TFLOPs: 18.18 | 31: iteration 11890/ 173500 | consumed samples: 3043840 | consumed tokens: 6233784320 | elapsed time per iteration (s): 0.80 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.317130E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.775 | TFLOPs: 19.41 | 31: iteration 11900/ 173500 | consumed samples: 3046400 | consumed tokens: 6239027200 | elapsed time per iteration (s): 0.81 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.279440E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.631 | TFLOPs: 19.16 | 31: iteration 11910/ 173500 | consumed samples: 3048960 | consumed tokens: 6244270080 | elapsed time per iteration (s): 0.83 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.333210E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.831 | TFLOPs: 18.68 | 31: iteration 11920/ 173500 | consumed samples: 3051520 | consumed tokens: 6249512960 | elapsed time per iteration (s): 0.87 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.329882E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.644 | TFLOPs: 17.76 | 31: iteration 11930/ 173500 | consumed samples: 3054080 | consumed tokens: 6254755840 | elapsed time per iteration (s): 0.87 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.329352E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.292 | TFLOPs: 17.80 | 31: iteration 11940/ 173500 | consumed samples: 3056640 | consumed tokens: 6259998720 | elapsed time per iteration (s): 0.86 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.325553E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.866 | TFLOPs: 18.02 | 31: iteration 11950/ 173500 | consumed samples: 3059200 | consumed tokens: 6265241600 | elapsed time per iteration (s): 0.84 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.266938E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.330 | TFLOPs: 18.47 | 31: iteration 11960/ 173500 | consumed samples: 3061760 | consumed tokens: 6270484480 | elapsed time per iteration (s): 0.83 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.306343E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.746 | TFLOPs: 18.74 | 31: iteration 11970/ 173500 | consumed samples: 3064320 | consumed tokens: 6275727360 | elapsed time per iteration (s): 0.78 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.304325E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.582 | TFLOPs: 19.82 | 31: iteration 11980/ 173500 | consumed samples: 3066880 | consumed tokens: 6280970240 | elapsed time per iteration (s): 0.81 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.333908E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.235 | TFLOPs: 19.07 | 31: iteration 11990/ 173500 | consumed samples: 3069440 | consumed tokens: 6286213120 | elapsed time per iteration (s): 0.80 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.306974E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.697 | TFLOPs: 19.46 | 0: [2022-11-25 20:47:42,126] [INFO] [logging.py:68:log_dist] [Rank 0] step=12000, skipped=0, lr=[0.0001984184547955352, 0.0001984184547955352, 0.0001984184547955352], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 12000/ 173500 | consumed samples: 3072000 | consumed tokens: 6291456000 | elapsed time per iteration (s): 0.81 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.329754E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.179 | TFLOPs: 19.07 | 0: steps: 12000 loss: 2.3818 iter time (s): 0.900 samples/sec: 284.440 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 12000 | lm loss value: 2.290657E+00 | lm loss PPL: 9.881430E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 12000 to checkpoints_1b1long 0: [2022-11-25 20:47:42,383] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step12000 is begin to save! 0: [2022-11-25 20:47:42,407] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_01-model_00-model_states.pt... 0: [2022-11-25 20:47:42,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_01-model_00-model_states.pt. 0: [2022-11-25 20:47:42,624] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_03-model_00-model_states.pt... 0: [2022-11-25 20:47:42,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_03-model_00-model_states.pt. 0: [2022-11-25 20:47:42,705] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_04-model_00-model_states.pt... 0: [2022-11-25 20:47:42,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_04-model_00-model_states.pt. 0: [2022-11-25 20:47:42,784] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_05-model_00-model_states.pt... 0: [2022-11-25 20:47:42,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_05-model_00-model_states.pt. 0: [2022-11-25 20:47:42,861] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_06-model_00-model_states.pt... 0: [2022-11-25 20:47:42,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_06-model_00-model_states.pt. 0: [2022-11-25 20:47:42,937] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_07-model_00-model_states.pt... 0: [2022-11-25 20:47:43,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_07-model_00-model_states.pt. 0: [2022-11-25 20:47:43,014] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_08-model_00-model_states.pt... 0: [2022-11-25 20:47:43,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_08-model_00-model_states.pt. 0: [2022-11-25 20:47:43,092] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_09-model_00-model_states.pt... 0: [2022-11-25 20:47:43,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_09-model_00-model_states.pt. 0: [2022-11-25 20:47:43,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_10-model_00-model_states.pt... 0: [2022-11-25 20:47:43,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_10-model_00-model_states.pt. 0: [2022-11-25 20:47:43,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_11-model_00-model_states.pt... 0: [2022-11-25 20:47:43,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_11-model_00-model_states.pt. 0: [2022-11-25 20:47:43,321] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_12-model_00-model_states.pt... 0: [2022-11-25 20:47:43,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_12-model_00-model_states.pt. 0: [2022-11-25 20:47:43,401] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_13-model_00-model_states.pt... 0: [2022-11-25 20:47:43,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_13-model_00-model_states.pt. 0: [2022-11-25 20:47:43,477] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_14-model_00-model_states.pt... 0: [2022-11-25 20:47:43,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_14-model_00-model_states.pt. 0: [2022-11-25 20:47:43,555] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_15-model_00-model_states.pt... 0: [2022-11-25 20:47:43,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_15-model_00-model_states.pt. 0: [2022-11-25 20:47:43,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_16-model_00-model_states.pt... 0: [2022-11-25 20:47:43,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_16-model_00-model_states.pt. 0: [2022-11-25 20:47:43,705] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_17-model_00-model_states.pt... 0: [2022-11-25 20:47:43,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_17-model_00-model_states.pt. 0: [2022-11-25 20:47:43,780] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_18-model_00-model_states.pt... 0: [2022-11-25 20:47:43,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_18-model_00-model_states.pt. 0: [2022-11-25 20:47:43,856] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_19-model_00-model_states.pt... 0: [2022-11-25 20:47:43,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_19-model_00-model_states.pt. 0: [2022-11-25 20:47:43,935] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_20-model_00-model_states.pt... 0: [2022-11-25 20:47:44,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_20-model_00-model_states.pt. 0: [2022-11-25 20:47:44,009] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_21-model_00-model_states.pt... 0: [2022-11-25 20:47:44,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_21-model_00-model_states.pt. 0: [2022-11-25 20:47:44,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_22-model_00-model_states.pt... 0: [2022-11-25 20:47:44,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_22-model_00-model_states.pt. 0: [2022-11-25 20:47:44,160] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_23-model_00-model_states.pt... 0: [2022-11-25 20:47:44,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_23-model_00-model_states.pt. 0: [2022-11-25 20:47:44,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_24-model_00-model_states.pt... 0: [2022-11-25 20:47:44,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_24-model_00-model_states.pt. 0: [2022-11-25 20:47:44,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_25-model_00-model_states.pt... 0: [2022-11-25 20:47:44,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_25-model_00-model_states.pt. 0: [2022-11-25 20:47:44,566] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_26-model_00-model_states.pt... 0: [2022-11-25 20:47:44,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_26-model_00-model_states.pt. 0: [2022-11-25 20:47:44,643] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_27-model_00-model_states.pt... 0: [2022-11-25 20:47:44,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_27-model_00-model_states.pt. 0: [2022-11-25 20:47:44,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_28-model_00-model_states.pt... 0: [2022-11-25 20:47:44,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_28-model_00-model_states.pt. 0: [2022-11-25 20:47:44,794] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/layer_30-model_00-model_states.pt... 0: [2022-11-25 20:47:44,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/layer_30-model_00-model_states.pt. 0: [2022-11-25 20:47:44,796] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step12000/mp_rank_00_model_states.pt 0: [2022-11-25 20:47:44,796] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/mp_rank_00_model_states.pt... 0: [2022-11-25 20:47:44,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/mp_rank_00_model_states.pt. 0: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:47:44,878] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 25: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 24: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 31: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 26: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-25 20:47:44,878] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 25: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 31: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 30: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 29: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 30: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 29: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:47:44,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:47:44,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:47:44,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 20:47:44,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 4: [2022-11-25 20:47:44,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:47:44,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 20:47:44,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 8: [2022-11-25 20:47:44,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:47:44,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 20:47:44,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 23: [2022-11-25 20:47:44,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 24: [2022-11-25 20:47:44,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:47:44,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 22: [2022-11-25 20:47:44,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:47:44,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 22: [2022-11-25 20:47:44,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 24: [2022-11-25 20:47:44,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 22: [2022-11-25 20:47:44,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 24: [2022-11-25 20:47:44,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: [2022-11-25 20:47:44,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:47:44,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 20:47:44,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 13: [2022-11-25 20:47:44,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:47:44,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 20:47:44,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 3: [2022-11-25 20:47:44,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:47:44,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 20:47:44,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 17: [2022-11-25 20:47:44,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 30: [2022-11-25 20:47:44,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 17: [2022-11-25 20:47:44,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-25 20:47:44,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 6: [2022-11-25 20:47:44,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 30: [2022-11-25 20:47:44,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-25 20:47:44,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 6: [2022-11-25 20:47:44,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 20:47:44,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 6: [2022-11-25 20:47:44,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:47:44,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 14: [2022-11-25 20:47:44,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:47:44,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 14: [2022-11-25 20:47:44,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 20:47:44,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 14: [2022-11-25 20:47:44,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:47:44,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:47:44,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 4: [2022-11-25 20:47:44,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 14: [2022-11-25 20:47:44,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 4: [2022-11-25 20:47:44,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 19: [2022-11-25 20:47:44,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-25 20:47:44,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-25 20:47:44,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 5: [2022-11-25 20:47:44,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:47:44,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 22: [2022-11-25 20:47:44,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:47:44,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 22: [2022-11-25 20:47:44,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-25 20:47:44,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 28: [2022-11-25 20:47:44,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 19: [2022-11-25 20:47:44,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:47:44,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 19: [2022-11-25 20:47:44,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 7: [2022-11-25 20:47:44,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 19: [2022-11-25 20:47:44,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 7: [2022-11-25 20:47:44,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 8: [2022-11-25 20:47:44,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 27: [2022-11-25 20:47:44,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:47:44,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 2: [2022-11-25 20:47:44,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:47:44,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 2: [2022-11-25 20:47:44,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 20:47:44,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 27: [2022-11-25 20:47:44,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 9: [2022-11-25 20:47:44,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:47:44,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 20:47:44,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 30: [2022-11-25 20:47:44,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 27: [2022-11-25 20:47:44,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 30: [2022-11-25 20:47:44,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-25 20:47:44,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: [2022-11-25 20:47:44,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 24: [2022-11-25 20:47:44,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:47:44,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:47:44,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:47:44,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 24: [2022-11-25 20:47:44,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 7: [2022-11-25 20:47:44,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 24: [2022-11-25 20:47:44,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 29: [2022-11-25 20:47:44,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:47:44,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 7: [2022-11-25 20:47:44,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 11: [2022-11-25 20:47:44,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 29: [2022-11-25 20:47:44,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 11: [2022-11-25 20:47:44,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 29: [2022-11-25 20:47:44,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: [2022-11-25 20:47:44,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 21: [2022-11-25 20:47:44,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-25 20:47:44,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-25 20:47:44,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 11: [2022-11-25 20:47:44,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:47:44,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 20:47:44,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 20: [2022-11-25 20:47:44,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-25 20:47:44,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-25 20:47:44,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 10: [2022-11-25 20:47:44,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:47:44,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 20:47:44,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 19: [2022-11-25 20:47:44,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-25 20:47:44,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-25 20:47:44,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 10: [2022-11-25 20:47:44,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:47:44,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 22: [2022-11-25 20:47:44,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:47:44,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 22: [2022-11-25 20:47:44,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 10: [2022-11-25 20:47:44,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 22: [2022-11-25 20:47:44,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 10: [2022-11-25 20:47:44,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 10: [2022-11-25 20:47:44,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 3: [2022-11-25 20:47:44,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:47:44,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:47:44,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 20:47:44,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 20:47:44,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 3: [2022-11-25 20:47:44,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 9: [2022-11-25 20:47:44,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 30: [2022-11-25 20:47:44,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-25 20:47:44,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 9: [2022-11-25 20:47:44,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 30: [2022-11-25 20:47:44,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: [2022-11-25 20:47:44,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:47:44,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 21: [2022-11-25 20:47:44,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-25 20:47:44,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-25 20:47:44,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: [2022-11-25 20:47:44,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 20:47:44,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 4: [2022-11-25 20:47:44,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:47:44,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 20:47:44,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 6: [2022-11-25 20:47:44,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:47:44,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 20:47:44,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 20: [2022-11-25 20:47:44,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-25 20:47:44,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-25 20:47:44,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-25 20:47:44,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 20: [2022-11-25 20:47:44,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-25 20:47:44,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 22: [2022-11-25 20:47:44,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-25 20:47:44,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-25 20:47:44,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 8: [2022-11-25 20:47:44,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:47:44,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 20:47:44,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 4: [2022-11-25 20:47:44,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:47:44,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 15: [2022-11-25 20:47:44,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:47:44,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 4: [2022-11-25 20:47:44,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 15: [2022-11-25 20:47:44,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 13: [2022-11-25 20:47:44,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:47:44,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 20:47:44,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 30: [2022-11-25 20:47:44,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-25 20:47:44,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-25 20:47:44,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 9: [2022-11-25 20:47:44,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:47:44,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 20:47:44,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 28: [2022-11-25 20:47:44,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-25 20:47:44,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 28: [2022-11-25 20:47:44,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-25 20:47:44,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-25 20:47:44,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 28: [2022-11-25 20:47:44,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:47:44,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 28: [2022-11-25 20:47:44,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 23: [2022-11-25 20:47:44,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 28: [2022-11-25 20:47:44,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 28: [2022-11-25 20:47:44,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:47:44,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 11: [2022-11-25 20:47:44,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 28: [2022-11-25 20:47:44,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 11: [2022-11-25 20:47:44,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 28: [2022-11-25 20:47:44,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 11: [2022-11-25 20:47:44,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 26: [2022-11-25 20:47:44,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-25 20:47:44,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-25 20:47:44,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 3: [2022-11-25 20:47:44,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:47:44,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 9: [2022-11-25 20:47:44,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:47:44,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 9: [2022-11-25 20:47:44,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 20:47:44,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 23: [2022-11-25 20:47:44,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:47:44,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-25 20:47:44,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 23: [2022-11-25 20:47:44,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:47:44,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-25 20:47:44,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 7: [2022-11-25 20:47:44,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:47:44,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:47:44,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 20:47:44,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 20:47:44,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 7: [2022-11-25 20:47:44,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 19: [2022-11-25 20:47:44,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-25 20:47:44,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-25 20:47:44,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: [2022-11-25 20:47:44,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:47:44,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:47:44,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:47:44,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:47:44,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 2: [2022-11-25 20:47:44,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 0: [2022-11-25 20:47:44,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 2: [2022-11-25 20:47:44,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 20:47:44,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 2: [2022-11-25 20:47:44,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 20:47:44,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 2: [2022-11-25 20:47:44,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 15: [2022-11-25 20:47:44,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:47:44,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 20:47:44,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:47:44,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 15: [2022-11-25 20:47:44,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 20:47:44,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 8: [2022-11-25 20:47:44,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:47:44,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 20:47:44,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 25: [2022-11-25 20:47:44,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-25 20:47:44,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-25 20:47:44,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 25: [2022-11-25 20:47:44,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-25 20:47:44,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-25 20:47:44,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 16: [2022-11-25 20:47:44,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-25 20:47:44,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-25 20:47:44,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 29: [2022-11-25 20:47:44,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-25 20:47:44,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-25 20:47:44,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 11: [2022-11-25 20:47:44,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:47:44,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 20:47:44,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 21: [2022-11-25 20:47:44,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-25 20:47:44,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-25 20:47:44,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 21: [2022-11-25 20:47:44,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-25 20:47:44,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-25 20:47:44,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 18: [2022-11-25 20:47:44,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-25 20:47:44,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 13: [2022-11-25 20:47:44,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 18: [2022-11-25 20:47:44,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 13: [2022-11-25 20:47:44,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 20:47:44,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 10: [2022-11-25 20:47:44,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:47:44,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 20:47:44,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 6: [2022-11-25 20:47:44,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:47:44,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 20:47:44,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 17: [2022-11-25 20:47:44,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-25 20:47:44,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-25 20:47:44,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 17: [2022-11-25 20:47:44,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-25 20:47:44,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-25 20:47:44,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 17: [2022-11-25 20:47:44,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-25 20:47:44,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-25 20:47:44,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 26: [2022-11-25 20:47:44,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-25 20:47:44,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-25 20:47:44,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 26: [2022-11-25 20:47:44,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-25 20:47:44,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-25 20:47:44,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 5: [2022-11-25 20:47:44,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:47:44,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 20:47:44,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 26: [2022-11-25 20:47:44,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-25 20:47:44,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-25 20:47:44,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 25: [2022-11-25 20:47:44,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-25 20:47:44,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-25 20:47:44,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 13: [2022-11-25 20:47:44,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:47:44,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 20:47:44,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 20: [2022-11-25 20:47:44,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-25 20:47:44,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-25 20:47:44,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 31: [2022-11-25 20:47:44,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-25 20:47:44,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-25 20:47:44,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-25 20:47:44,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-25 20:47:44,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-25 20:47:44,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-25 20:47:44,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-25 20:47:44,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-25 20:47:44,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 31: [2022-11-25 20:47:44,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 31: [2022-11-25 20:47:44,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 31: [2022-11-25 20:47:44,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 29: [2022-11-25 20:47:44,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-25 20:47:44,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-25 20:47:44,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 22: [2022-11-25 20:47:44,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:47:44,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:47:44,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:47:44,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 22: [2022-11-25 20:47:44,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-25 20:47:44,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 1: [2022-11-25 20:47:44,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 20:47:44,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 20:47:44,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 20:47:44,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 1: [2022-11-25 20:47:44,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 1: [2022-11-25 20:47:44,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 5: [2022-11-25 20:47:44,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:47:44,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 20:47:44,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 16: [2022-11-25 20:47:44,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-25 20:47:44,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-25 20:47:44,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: [2022-11-25 20:47:44,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 20:47:44,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 24: [2022-11-25 20:47:44,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-25 20:47:44,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-25 20:47:44,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 29: [2022-11-25 20:47:44,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-25 20:47:44,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-25 20:47:44,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 18: [2022-11-25 20:47:44,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-25 20:47:44,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-25 20:47:44,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 2: [2022-11-25 20:47:44,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:47:44,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 20:47:44,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 5: [2022-11-25 20:47:44,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:47:44,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 20:47:44,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 12: [2022-11-25 20:47:44,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:47:44,982] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 20:47:44,982] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 14: [2022-11-25 20:47:44,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:47:44,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 20:47:44,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 16: [2022-11-25 20:47:44,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-25 20:47:44,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-25 20:47:44,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 27: [2022-11-25 20:47:44,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:47:44,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 27: [2022-11-25 20:47:44,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 12: [2022-11-25 20:47:44,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 27: [2022-11-25 20:47:44,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 12: [2022-11-25 20:47:44,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 27: [2022-11-25 20:47:44,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-25 20:47:44,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-25 20:47:44,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 18: [2022-11-25 20:47:44,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-25 20:47:44,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-25 20:47:44,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 5: [2022-11-25 20:47:44,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:47:44,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 20:47:44,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 27: [2022-11-25 20:47:45,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-25 20:47:45,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-25 20:47:45,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 24: [2022-11-25 20:47:45,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-25 20:47:45,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-25 20:47:45,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 14: [2022-11-25 20:47:45,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:47:45,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 20:47:45,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 14: [2022-11-25 20:47:45,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:47:45,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 20:47:45,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 31: [2022-11-25 20:47:45,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-25 20:47:45,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-25 20:47:45,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 4: [2022-11-25 20:47:45,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:47:45,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 20:47:45,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 9: [2022-11-25 20:47:45,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:47:45,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 20:47:45,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 3: [2022-11-25 20:47:45,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:47:45,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 20:47:45,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 23: [2022-11-25 20:47:45,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:47:45,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-25 20:47:45,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 26: [2022-11-25 20:47:45,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-25 20:47:45,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 28: [2022-11-25 20:47:45,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 26: [2022-11-25 20:47:45,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 28: [2022-11-25 20:47:45,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-25 20:47:45,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 29: [2022-11-25 20:47:45,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-25 20:47:45,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 21: [2022-11-25 20:47:45,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 29: [2022-11-25 20:47:45,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 21: [2022-11-25 20:47:45,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 30: [2022-11-25 20:47:45,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 21: [2022-11-25 20:47:45,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 30: [2022-11-25 20:47:45,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-25 20:47:45,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 8: [2022-11-25 20:47:45,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:47:45,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 20:47:45,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 16: [2022-11-25 20:47:45,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-25 20:47:45,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-25 20:47:45,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 12: [2022-11-25 20:47:45,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:47:45,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 20:47:45,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 6: [2022-11-25 20:47:45,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:47:45,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 20:47:45,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 1: [2022-11-25 20:47:45,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:47:45,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 20:47:45,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 17: [2022-11-25 20:47:45,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-25 20:47:45,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-25 20:47:45,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 25: [2022-11-25 20:47:45,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-25 20:47:45,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 0: [2022-11-25 20:47:45,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:47:45,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 25: [2022-11-25 20:47:45,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: [2022-11-25 20:47:45,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 7: [2022-11-25 20:47:45,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:47:45,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 20:47:45,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 15: [2022-11-25 20:47:45,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:47:45,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 20:47:45,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 10: [2022-11-25 20:47:45,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:47:45,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:47:45,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:47:45,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 20:47:45,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 14: [2022-11-25 20:47:45,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 5: [2022-11-25 20:47:45,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 14: [2022-11-25 20:47:45,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 5: [2022-11-25 20:47:45,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 2: [2022-11-25 20:47:45,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:47:45,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 20:47:45,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 3: [2022-11-25 20:47:45,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 24: [2022-11-25 20:47:45,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:47:45,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 20:47:45,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 24: [2022-11-25 20:47:45,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-25 20:47:45,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 11: [2022-11-25 20:47:45,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:47:45,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 20:47:45,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 20: [2022-11-25 20:47:45,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-25 20:47:45,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-25 20:47:45,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 23: [2022-11-25 20:47:45,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 22: [2022-11-25 20:47:45,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:47:45,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 22: [2022-11-25 20:47:45,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 23: [2022-11-25 20:47:45,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 22: [2022-11-25 20:47:45,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 26: [2022-11-25 20:47:45,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-25 20:47:45,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-25 20:47:45,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 30: [2022-11-25 20:47:45,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-25 20:47:45,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-25 20:47:45,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 8: [2022-11-25 20:47:45,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:47:45,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 20:47:45,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 27: [2022-11-25 20:47:45,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-25 20:47:45,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-25 20:47:45,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 10: [2022-11-25 20:47:45,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:47:45,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 20:47:45,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 13: [2022-11-25 20:47:45,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:47:45,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 20:47:45,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 28: [2022-11-25 20:47:45,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-25 20:47:45,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-25 20:47:45,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 29: [2022-11-25 20:47:45,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:47:45,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:47:45,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 29: [2022-11-25 20:47:45,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 9: [2022-11-25 20:47:45,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 29: [2022-11-25 20:47:45,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 12: [2022-11-25 20:47:45,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:47:45,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 20:47:45,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 11: [2022-11-25 20:47:45,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:47:45,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 20:47:45,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 4: [2022-11-25 20:47:45,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:47:45,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 20:47:45,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 17: [2022-11-25 20:47:45,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-25 20:47:45,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-25 20:47:45,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 24: [2022-11-25 20:47:45,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 20: [2022-11-25 20:47:45,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 24: [2022-11-25 20:47:45,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-25 20:47:45,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 20: [2022-11-25 20:47:45,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 27: [2022-11-25 20:47:45,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 20: [2022-11-25 20:47:45,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 27: [2022-11-25 20:47:45,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-25 20:47:45,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 21: [2022-11-25 20:47:45,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-25 20:47:45,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-25 20:47:45,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 31: [2022-11-25 20:47:45,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-25 20:47:45,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-25 20:47:45,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 19: [2022-11-25 20:47:45,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-25 20:47:45,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-25 20:47:45,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 19: [2022-11-25 20:47:45,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-25 20:47:45,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-25 20:47:45,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 12: [2022-11-25 20:47:45,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:47:45,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 20:47:45,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 18: [2022-11-25 20:47:45,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-25 20:47:45,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-25 20:47:45,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 14: [2022-11-25 20:47:45,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:47:45,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 3: [2022-11-25 20:47:45,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:47:45,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 3: [2022-11-25 20:47:45,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 20:47:45,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 24: [2022-11-25 20:47:45,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:47:45,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 24: [2022-11-25 20:47:45,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 23: [2022-11-25 20:47:45,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 24: [2022-11-25 20:47:45,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 23: [2022-11-25 20:47:45,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 8: [2022-11-25 20:47:45,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:47:45,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 20:47:45,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 21: [2022-11-25 20:47:45,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-25 20:47:45,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 16: [2022-11-25 20:47:45,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 21: [2022-11-25 20:47:45,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 16: [2022-11-25 20:47:45,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-25 20:47:45,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 2: [2022-11-25 20:47:45,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:47:45,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 20:47:45,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 9: [2022-11-25 20:47:45,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:47:45,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 20:47:45,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 4: [2022-11-25 20:47:45,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:47:45,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 20:47:45,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 6: [2022-11-25 20:47:45,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:47:45,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 20:47:45,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 28: [2022-11-25 20:47:45,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:47:45,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 28: [2022-11-25 20:47:45,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 0: [2022-11-25 20:47:45,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 28: [2022-11-25 20:47:45,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: [2022-11-25 20:47:45,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 23: [2022-11-25 20:47:45,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-25 20:47:45,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-25 20:47:45,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 10: [2022-11-25 20:47:45,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:47:45,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 20:47:45,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 9: [2022-11-25 20:47:45,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:47:45,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 20:47:45,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 1: [2022-11-25 20:47:45,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:47:45,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 3: [2022-11-25 20:47:45,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 22: [2022-11-25 20:47:45,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:47:45,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 22: [2022-11-25 20:47:45,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 1: [2022-11-25 20:47:45,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 3: [2022-11-25 20:47:45,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 22: [2022-11-25 20:47:45,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 8: [2022-11-25 20:47:45,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:47:45,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 20:47:45,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 31: [2022-11-25 20:47:45,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-25 20:47:45,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-25 20:47:45,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-25 20:47:45,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-25 20:47:45,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 31: [2022-11-25 20:47:45,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 14: [2022-11-25 20:47:45,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:47:45,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 20:47:45,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 22: [2022-11-25 20:47:45,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 21: [2022-11-25 20:47:45,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 22: [2022-11-25 20:47:45,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 21: [2022-11-25 20:47:45,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 22: [2022-11-25 20:47:45,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 21: [2022-11-25 20:47:45,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: [2022-11-25 20:47:45,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:47:45,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 20:47:45,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 24: [2022-11-25 20:47:45,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-25 20:47:45,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-25 20:47:45,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 17: [2022-11-25 20:47:45,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-25 20:47:45,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-25 20:47:45,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 13: [2022-11-25 20:47:45,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:47:45,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 20:47:45,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 26: [2022-11-25 20:47:45,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-25 20:47:45,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-25 20:47:45,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 26: [2022-11-25 20:47:45,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-25 20:47:45,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-25 20:47:45,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 25: [2022-11-25 20:47:45,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:47:45,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:47:45,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 25: [2022-11-25 20:47:45,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 4: [2022-11-25 20:47:45,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 25: [2022-11-25 20:47:45,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 19: [2022-11-25 20:47:45,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 25: [2022-11-25 20:47:45,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 19: [2022-11-25 20:47:45,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 25: [2022-11-25 20:47:45,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 19: [2022-11-25 20:47:45,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 5: [2022-11-25 20:47:45,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 25: [2022-11-25 20:47:45,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 5: [2022-11-25 20:47:45,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 20:47:45,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 17: [2022-11-25 20:47:45,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-25 20:47:45,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-25 20:47:45,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 2: [2022-11-25 20:47:45,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:47:45,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 28: [2022-11-25 20:47:45,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:47:45,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 11: [2022-11-25 20:47:45,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 28: [2022-11-25 20:47:45,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 11: [2022-11-25 20:47:45,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 20:47:45,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 28: [2022-11-25 20:47:45,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 11: [2022-11-25 20:47:45,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 11: [2022-11-25 20:47:45,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 20:47:45,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 10: [2022-11-25 20:47:45,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:47:45,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 12: [2022-11-25 20:47:45,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 27: [2022-11-25 20:47:45,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-25 20:47:45,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:47:45,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 12: [2022-11-25 20:47:45,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 27: [2022-11-25 20:47:45,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-25 20:47:45,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 12: [2022-11-25 20:47:45,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 27: [2022-11-25 20:47:45,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 27: [2022-11-25 20:47:45,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 19: [2022-11-25 20:47:45,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-25 20:47:45,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-25 20:47:45,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 6: [2022-11-25 20:47:45,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:47:45,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 20:47:45,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 29: [2022-11-25 20:47:45,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-25 20:47:45,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-25 20:47:45,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 15: [2022-11-25 20:47:45,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:47:45,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 20:47:45,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 18: [2022-11-25 20:47:45,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-25 20:47:45,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-25 20:47:45,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 12: [2022-11-25 20:47:45,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:47:45,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 5: [2022-11-25 20:47:45,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:47:45,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 20:47:45,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 12: [2022-11-25 20:47:45,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 13: [2022-11-25 20:47:45,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:47:45,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 20:47:45,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 13: [2022-11-25 20:47:45,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:47:45,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 20:47:45,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 25: [2022-11-25 20:47:45,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-25 20:47:45,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-25 20:47:45,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 7: [2022-11-25 20:47:45,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:47:45,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 20:47:45,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:47:45,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:47:45,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 7: [2022-11-25 20:47:45,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 20:47:45,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 20:47:45,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 7: [2022-11-25 20:47:45,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 6: [2022-11-25 20:47:45,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:47:45,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 29: [2022-11-25 20:47:45,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:47:45,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 29: [2022-11-25 20:47:45,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-25 20:47:45,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 16: [2022-11-25 20:47:45,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:47:45,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 16: [2022-11-25 20:47:45,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 15: [2022-11-25 20:47:45,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 16: [2022-11-25 20:47:45,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 15: [2022-11-25 20:47:45,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 15: [2022-11-25 20:47:45,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:47:45,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 20:47:45,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 18: [2022-11-25 20:47:45,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-25 20:47:45,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-25 20:47:45,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 15: [2022-11-25 20:47:45,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:47:45,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 20:47:45,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 1: [2022-11-25 20:47:45,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:47:45,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 20:47:45,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 25: [2022-11-25 20:47:45,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-25 20:47:45,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-25 20:47:45,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 16: [2022-11-25 20:47:45,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-25 20:47:45,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-25 20:47:45,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-25 20:47:45,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-25 20:47:45,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 16: [2022-11-25 20:47:45,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 18: [2022-11-25 20:47:45,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-25 20:47:45,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-25 20:47:45,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 18: [2022-11-25 20:47:45,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-25 20:47:45,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-25 20:47:45,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 1: [2022-11-25 20:47:45,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:47:45,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 20:47:45,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:47:45,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 20:47:45,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 1: [2022-11-25 20:47:45,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 30: [2022-11-25 20:47:45,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-25 20:47:45,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-25 20:47:45,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 30: [2022-11-25 20:47:45,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-25 20:47:45,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-25 20:47:45,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 20: [2022-11-25 20:47:45,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-25 20:47:45,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-25 20:47:45,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 20: [2022-11-25 20:47:45,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-25 20:47:45,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step12000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-25 20:47:45,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: successfully saved checkpoint at iteration 12000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2957.32 31: iteration 12010/ 173500 | consumed samples: 3074560 | consumed tokens: 6296698880 | elapsed time per iteration (s): 1.12 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.305211E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.314 | TFLOPs: 13.81 | 31: iteration 12020/ 173500 | consumed samples: 3077120 | consumed tokens: 6301941760 | elapsed time per iteration (s): 0.85 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.267276E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.608 | TFLOPs: 18.31 | 31: iteration 12030/ 173500 | consumed samples: 3079680 | consumed tokens: 6307184640 | elapsed time per iteration (s): 0.85 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.281079E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.012 | TFLOPs: 18.15 | 31: iteration 12040/ 173500 | consumed samples: 3082240 | consumed tokens: 6312427520 | elapsed time per iteration (s): 0.84 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.297998E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.053 | TFLOPs: 18.52 | 31: iteration 12050/ 173500 | consumed samples: 3084800 | consumed tokens: 6317670400 | elapsed time per iteration (s): 0.80 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.280074E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.417 | TFLOPs: 19.26 | 31: iteration 12060/ 173500 | consumed samples: 3087360 | consumed tokens: 6322913280 | elapsed time per iteration (s): 0.85 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.287884E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.873 | TFLOPs: 18.14 | 31: iteration 12070/ 173500 | consumed samples: 3089920 | consumed tokens: 6328156160 | elapsed time per iteration (s): 0.81 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.312940E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.142 | TFLOPs: 19.00 | 31: iteration 12080/ 173500 | consumed samples: 3092480 | consumed tokens: 6333399040 | elapsed time per iteration (s): 0.78 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.294468E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.409 | TFLOPs: 19.75 | 31: iteration 12090/ 173500 | consumed samples: 3095040 | consumed tokens: 6338641920 | elapsed time per iteration (s): 0.81 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.316769E+00 | grad norm: 3.377 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.340 | TFLOPs: 19.02 | 31: iteration 12100/ 173500 | consumed samples: 3097600 | consumed tokens: 6343884800 | elapsed time per iteration (s): 0.85 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 6.494936E+00 | grad norm: 4.020 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.281 | TFLOPs: 18.17 | 31: iteration 12110/ 173500 | consumed samples: 3100160 | consumed tokens: 6349127680 | elapsed time per iteration (s): 0.84 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 7.374171E+00 | grad norm: 1.832 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.449 | TFLOPs: 18.42 | 31: iteration 12120/ 173500 | consumed samples: 3102720 | consumed tokens: 6354370560 | elapsed time per iteration (s): 0.82 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 6.626413E+00 | grad norm: 1.291 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.002 | TFLOPs: 18.88 | 31: iteration 12130/ 173500 | consumed samples: 3105280 | consumed tokens: 6359613440 | elapsed time per iteration (s): 0.76 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 6.203856E+00 | grad norm: 0.876 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.273 | TFLOPs: 20.46 | 31: iteration 12140/ 173500 | consumed samples: 3107840 | consumed tokens: 6364856320 | elapsed time per iteration (s): 0.82 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 5.875702E+00 | grad norm: 1.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.039 | TFLOPs: 18.94 | 31: iteration 12150/ 173500 | consumed samples: 3110400 | consumed tokens: 6370099200 | elapsed time per iteration (s): 0.77 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 5.676043E+00 | grad norm: 0.521 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.146 | TFLOPs: 20.21 | 31: iteration 12160/ 173500 | consumed samples: 3112960 | consumed tokens: 6375342080 | elapsed time per iteration (s): 0.81 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 5.434065E+00 | grad norm: 0.495 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.954 | TFLOPs: 19.11 | 31: iteration 12170/ 173500 | consumed samples: 3115520 | consumed tokens: 6380584960 | elapsed time per iteration (s): 0.79 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 5.281439E+00 | grad norm: 0.542 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.562 | TFLOPs: 19.70 | 31: iteration 12180/ 173500 | consumed samples: 3118080 | consumed tokens: 6385827840 | elapsed time per iteration (s): 0.75 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 5.127155E+00 | grad norm: 0.510 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.891 | TFLOPs: 20.74 | 31: iteration 12190/ 173500 | consumed samples: 3120640 | consumed tokens: 6391070720 | elapsed time per iteration (s): 0.77 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 4.964106E+00 | grad norm: 0.401 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.695 | TFLOPs: 20.07 | 31: iteration 12200/ 173500 | consumed samples: 3123200 | consumed tokens: 6396313600 | elapsed time per iteration (s): 0.76 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 4.755995E+00 | grad norm: 0.992 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.660 | TFLOPs: 20.49 | 31: iteration 12210/ 173500 | consumed samples: 3125760 | consumed tokens: 6401556480 | elapsed time per iteration (s): 0.74 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 4.575487E+00 | grad norm: 0.847 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.350 | TFLOPs: 20.89 | 31: iteration 12220/ 173500 | consumed samples: 3128320 | consumed tokens: 6406799360 | elapsed time per iteration (s): 0.74 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 4.147025E+00 | grad norm: 1.057 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.059 | TFLOPs: 20.88 | 31: iteration 12230/ 173500 | consumed samples: 3130880 | consumed tokens: 6412042240 | elapsed time per iteration (s): 0.78 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 3.628024E+00 | grad norm: 1.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.832 | TFLOPs: 19.83 | 31: iteration 12240/ 173500 | consumed samples: 3133440 | consumed tokens: 6417285120 | elapsed time per iteration (s): 0.77 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 3.155402E+00 | grad norm: 0.568 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.691 | TFLOPs: 20.19 | 31: iteration 12250/ 173500 | consumed samples: 3136000 | consumed tokens: 6422528000 | elapsed time per iteration (s): 0.81 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.860653E+00 | grad norm: 0.466 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.962 | TFLOPs: 19.11 | 31: iteration 12260/ 173500 | consumed samples: 3138560 | consumed tokens: 6427770880 | elapsed time per iteration (s): 0.77 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.656636E+00 | grad norm: 0.307 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.333 | TFLOPs: 20.11 | 31: iteration 12270/ 173500 | consumed samples: 3141120 | consumed tokens: 6433013760 | elapsed time per iteration (s): 0.77 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.532144E+00 | grad norm: 0.283 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.045 | TFLOPs: 20.03 | 31: iteration 12280/ 173500 | consumed samples: 3143680 | consumed tokens: 6438256640 | elapsed time per iteration (s): 0.76 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.462138E+00 | grad norm: 0.277 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.285 | TFLOPs: 20.34 | 31: iteration 12290/ 173500 | consumed samples: 3146240 | consumed tokens: 6443499520 | elapsed time per iteration (s): 0.75 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.437219E+00 | grad norm: 0.231 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.501 | TFLOPs: 20.66 | 31: iteration 12300/ 173500 | consumed samples: 3148800 | consumed tokens: 6448742400 | elapsed time per iteration (s): 0.76 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.401092E+00 | grad norm: 0.373 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.812 | TFLOPs: 20.32 | 31: iteration 12310/ 173500 | consumed samples: 3151360 | consumed tokens: 6453985280 | elapsed time per iteration (s): 0.78 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.405912E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.296 | TFLOPs: 19.92 | 31: iteration 12320/ 173500 | consumed samples: 3153920 | consumed tokens: 6459228160 | elapsed time per iteration (s): 0.78 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.359804E+00 | grad norm: 0.236 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.929 | TFLOPs: 19.84 | 31: iteration 12330/ 173500 | consumed samples: 3156480 | consumed tokens: 6464471040 | elapsed time per iteration (s): 0.73 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.389628E+00 | grad norm: 0.237 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.799 | TFLOPs: 21.28 | 31: iteration 12340/ 173500 | consumed samples: 3159040 | consumed tokens: 6469713920 | elapsed time per iteration (s): 0.77 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.369552E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.367 | TFLOPs: 19.99 | 31: iteration 12350/ 173500 | consumed samples: 3161600 | consumed tokens: 6474956800 | elapsed time per iteration (s): 0.77 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.357997E+00 | grad norm: 0.225 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.819 | TFLOPs: 20.20 | 31: iteration 12360/ 173500 | consumed samples: 3164160 | consumed tokens: 6480199680 | elapsed time per iteration (s): 0.78 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.353341E+00 | grad norm: 0.239 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.305 | TFLOPs: 19.74 | 31: iteration 12370/ 173500 | consumed samples: 3166720 | consumed tokens: 6485442560 | elapsed time per iteration (s): 0.77 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.338209E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.745 | TFLOPs: 20.13 | 31: iteration 12380/ 173500 | consumed samples: 3169280 | consumed tokens: 6490685440 | elapsed time per iteration (s): 0.72 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.346867E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.248 | TFLOPs: 21.37 | 31: iteration 12390/ 173500 | consumed samples: 3171840 | consumed tokens: 6495928320 | elapsed time per iteration (s): 0.84 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.310281E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.762 | TFLOPs: 18.38 | 31: iteration 12400/ 173500 | consumed samples: 3174400 | consumed tokens: 6501171200 | elapsed time per iteration (s): 0.77 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.357955E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.160 | TFLOPs: 20.16 | 31: iteration 12410/ 173500 | consumed samples: 3176960 | consumed tokens: 6506414080 | elapsed time per iteration (s): 0.77 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.360217E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.389 | TFLOPs: 20.11 | 31: iteration 12420/ 173500 | consumed samples: 3179520 | consumed tokens: 6511656960 | elapsed time per iteration (s): 0.81 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.327647E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.614 | TFLOPs: 19.03 | 31: iteration 12430/ 173500 | consumed samples: 3182080 | consumed tokens: 6516899840 | elapsed time per iteration (s): 0.84 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.334868E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.423 | TFLOPs: 18.48 | 31: iteration 12440/ 173500 | consumed samples: 3184640 | consumed tokens: 6522142720 | elapsed time per iteration (s): 0.77 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.306397E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.656 | TFLOPs: 20.06 | 31: iteration 12450/ 173500 | consumed samples: 3187200 | consumed tokens: 6527385600 | elapsed time per iteration (s): 0.77 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.325282E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.929 | TFLOPs: 20.20 | 31: iteration 12460/ 173500 | consumed samples: 3189760 | consumed tokens: 6532628480 | elapsed time per iteration (s): 0.77 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.352081E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.839 | TFLOPs: 20.01 | 31: iteration 12470/ 173500 | consumed samples: 3192320 | consumed tokens: 6537871360 | elapsed time per iteration (s): 0.75 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.322300E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.900 | TFLOPs: 20.62 | 31: iteration 12480/ 173500 | consumed samples: 3194880 | consumed tokens: 6543114240 | elapsed time per iteration (s): 0.72 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.297374E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.040 | TFLOPs: 21.66 | 31: iteration 12490/ 173500 | consumed samples: 3197440 | consumed tokens: 6548357120 | elapsed time per iteration (s): 0.79 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.294965E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.765 | TFLOPs: 19.53 | 31: iteration 12500/ 173500 | consumed samples: 3200000 | consumed tokens: 6553600000 | elapsed time per iteration (s): 0.79 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.296455E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.404 | TFLOPs: 19.69 | 31: iteration 12510/ 173500 | consumed samples: 3202560 | consumed tokens: 6558842880 | elapsed time per iteration (s): 0.83 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.298475E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.812 | TFLOPs: 18.68 | 31: iteration 12520/ 173500 | consumed samples: 3205120 | consumed tokens: 6564085760 | elapsed time per iteration (s): 0.80 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.323911E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.465 | TFLOPs: 19.45 | 31: iteration 12530/ 173500 | consumed samples: 3207680 | consumed tokens: 6569328640 | elapsed time per iteration (s): 0.79 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.300788E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.632 | TFLOPs: 19.52 | 31: iteration 12540/ 173500 | consumed samples: 3210240 | consumed tokens: 6574571520 | elapsed time per iteration (s): 0.83 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.283549E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.284 | TFLOPs: 18.71 | 31: iteration 12550/ 173500 | consumed samples: 3212800 | consumed tokens: 6579814400 | elapsed time per iteration (s): 0.89 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.314167E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.424 | TFLOPs: 17.39 | 31: iteration 12560/ 173500 | consumed samples: 3215360 | consumed tokens: 6585057280 | elapsed time per iteration (s): 0.83 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.347469E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.275 | TFLOPs: 18.77 | 31: iteration 12570/ 173500 | consumed samples: 3217920 | consumed tokens: 6590300160 | elapsed time per iteration (s): 0.91 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.310246E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 279.875 | TFLOPs: 16.93 | 31: iteration 12580/ 173500 | consumed samples: 3220480 | consumed tokens: 6595543040 | elapsed time per iteration (s): 0.83 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.311846E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.419 | TFLOPs: 18.66 | 31: iteration 12590/ 173500 | consumed samples: 3223040 | consumed tokens: 6600785920 | elapsed time per iteration (s): 0.89 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.330195E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.780 | TFLOPs: 17.35 | 31: iteration 12600/ 173500 | consumed samples: 3225600 | consumed tokens: 6606028800 | elapsed time per iteration (s): 0.84 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.305006E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.768 | TFLOPs: 18.44 | 31: iteration 12610/ 173500 | consumed samples: 3228160 | consumed tokens: 6611271680 | elapsed time per iteration (s): 0.82 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.298586E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.204 | TFLOPs: 18.83 | 31: iteration 12620/ 173500 | consumed samples: 3230720 | consumed tokens: 6616514560 | elapsed time per iteration (s): 0.84 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.279057E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.788 | TFLOPs: 18.44 | 31: iteration 12630/ 173500 | consumed samples: 3233280 | consumed tokens: 6621757440 | elapsed time per iteration (s): 0.79 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.333577E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.708 | TFLOPs: 19.58 | 31: iteration 12640/ 173500 | consumed samples: 3235840 | consumed tokens: 6627000320 | elapsed time per iteration (s): 0.81 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.310468E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.176 | TFLOPs: 19.13 | 31: iteration 12650/ 173500 | consumed samples: 3238400 | consumed tokens: 6632243200 | elapsed time per iteration (s): 0.81 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.287766E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.127 | TFLOPs: 19.06 | 31: iteration 12660/ 173500 | consumed samples: 3240960 | consumed tokens: 6637486080 | elapsed time per iteration (s): 0.84 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.298470E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.305 | TFLOPs: 18.53 | 31: iteration 12670/ 173500 | consumed samples: 3243520 | consumed tokens: 6642728960 | elapsed time per iteration (s): 0.84 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.297162E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.381 | TFLOPs: 18.35 | 31: iteration 12680/ 173500 | consumed samples: 3246080 | consumed tokens: 6647971840 | elapsed time per iteration (s): 0.85 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.328156E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.592 | TFLOPs: 18.31 | 31: iteration 12690/ 173500 | consumed samples: 3248640 | consumed tokens: 6653214720 | elapsed time per iteration (s): 0.83 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.289050E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.457 | TFLOPs: 18.66 | 31: iteration 12700/ 173500 | consumed samples: 3251200 | consumed tokens: 6658457600 | elapsed time per iteration (s): 0.82 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.304768E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.468 | TFLOPs: 18.96 | 31: iteration 12710/ 173500 | consumed samples: 3253760 | consumed tokens: 6663700480 | elapsed time per iteration (s): 0.83 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.290837E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.005 | TFLOPs: 18.63 | 31: iteration 12720/ 173500 | consumed samples: 3256320 | consumed tokens: 6668943360 | elapsed time per iteration (s): 0.79 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.287376E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.613 | TFLOPs: 19.58 | 31: iteration 12730/ 173500 | consumed samples: 3258880 | consumed tokens: 6674186240 | elapsed time per iteration (s): 0.84 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.278748E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.229 | TFLOPs: 18.34 | 31: iteration 12740/ 173500 | consumed samples: 3261440 | consumed tokens: 6679429120 | elapsed time per iteration (s): 0.79 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.289276E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.888 | TFLOPs: 19.65 | 31: iteration 12750/ 173500 | consumed samples: 3264000 | consumed tokens: 6684672000 | elapsed time per iteration (s): 0.82 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.324759E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.962 | TFLOPs: 18.99 | 31: iteration 12760/ 173500 | consumed samples: 3266560 | consumed tokens: 6689914880 | elapsed time per iteration (s): 0.81 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.298110E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.387 | TFLOPs: 19.20 | 31: iteration 12770/ 173500 | consumed samples: 3269120 | consumed tokens: 6695157760 | elapsed time per iteration (s): 0.81 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.286181E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.014 | TFLOPs: 19.12 | 31: iteration 12780/ 173500 | consumed samples: 3271680 | consumed tokens: 6700400640 | elapsed time per iteration (s): 0.81 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.301632E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.185 | TFLOPs: 19.07 | 31: iteration 12790/ 173500 | consumed samples: 3274240 | consumed tokens: 6705643520 | elapsed time per iteration (s): 0.82 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.287434E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.120 | TFLOPs: 18.94 | 31: iteration 12800/ 173500 | consumed samples: 3276800 | consumed tokens: 6710886400 | elapsed time per iteration (s): 0.84 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.306506E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.802 | TFLOPs: 18.44 | 31: iteration 12810/ 173500 | consumed samples: 3279360 | consumed tokens: 6716129280 | elapsed time per iteration (s): 0.82 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.298727E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.182 | TFLOPs: 18.83 | 31: iteration 12820/ 173500 | consumed samples: 3281920 | consumed tokens: 6721372160 | elapsed time per iteration (s): 0.80 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.290949E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.276 | TFLOPs: 19.44 | 31: iteration 12830/ 173500 | consumed samples: 3284480 | consumed tokens: 6726615040 | elapsed time per iteration (s): 0.81 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.295524E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.997 | TFLOPs: 19.12 | 31: iteration 12840/ 173500 | consumed samples: 3287040 | consumed tokens: 6731857920 | elapsed time per iteration (s): 0.83 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.292779E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.939 | TFLOPs: 18.75 | 31: iteration 12850/ 173500 | consumed samples: 3289600 | consumed tokens: 6737100800 | elapsed time per iteration (s): 0.79 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.254268E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.272 | TFLOPs: 19.68 | 31: iteration 12860/ 173500 | consumed samples: 3292160 | consumed tokens: 6742343680 | elapsed time per iteration (s): 0.84 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.284946E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.280 | TFLOPs: 18.53 | 31: iteration 12870/ 173500 | consumed samples: 3294720 | consumed tokens: 6747586560 | elapsed time per iteration (s): 0.80 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.283472E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.804 | TFLOPs: 19.29 | 31: iteration 12880/ 173500 | consumed samples: 3297280 | consumed tokens: 6752829440 | elapsed time per iteration (s): 0.83 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.300455E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.888 | TFLOPs: 18.63 | 31: iteration 12890/ 173500 | consumed samples: 3299840 | consumed tokens: 6758072320 | elapsed time per iteration (s): 0.84 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.301060E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.511 | TFLOPs: 18.54 | 31: iteration 12900/ 173500 | consumed samples: 3302400 | consumed tokens: 6763315200 | elapsed time per iteration (s): 0.81 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.294256E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.893 | TFLOPs: 19.17 | 31: iteration 12910/ 173500 | consumed samples: 3304960 | consumed tokens: 6768558080 | elapsed time per iteration (s): 0.85 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.293489E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.877 | TFLOPs: 18.20 | 31: iteration 12920/ 173500 | consumed samples: 3307520 | consumed tokens: 6773800960 | elapsed time per iteration (s): 0.78 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.331364E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.722 | TFLOPs: 19.95 | 31: iteration 12930/ 173500 | consumed samples: 3310080 | consumed tokens: 6779043840 | elapsed time per iteration (s): 0.80 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.302187E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.040 | TFLOPs: 19.30 | 31: iteration 12940/ 173500 | consumed samples: 3312640 | consumed tokens: 6784286720 | elapsed time per iteration (s): 0.80 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.304458E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.818 | TFLOPs: 19.29 | 31: iteration 12950/ 173500 | consumed samples: 3315200 | consumed tokens: 6789529600 | elapsed time per iteration (s): 0.79 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.283033E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.686 | TFLOPs: 19.52 | 31: iteration 12960/ 173500 | consumed samples: 3317760 | consumed tokens: 6794772480 | elapsed time per iteration (s): 0.83 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.262208E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.031 | TFLOPs: 18.57 | 31: iteration 12970/ 173500 | consumed samples: 3320320 | consumed tokens: 6800015360 | elapsed time per iteration (s): 0.73 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.274457E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.076 | TFLOPs: 21.36 | 31: iteration 12980/ 173500 | consumed samples: 3322880 | consumed tokens: 6805258240 | elapsed time per iteration (s): 0.77 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.306498E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.625 | TFLOPs: 20.06 | 31: iteration 12990/ 173500 | consumed samples: 3325440 | consumed tokens: 6810501120 | elapsed time per iteration (s): 0.77 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.259422E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.806 | TFLOPs: 20.07 | 31: iteration 13000/ 173500 | consumed samples: 3328000 | consumed tokens: 6815744000 | elapsed time per iteration (s): 0.78 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.299077E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.669 | TFLOPs: 19.94 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 13000 | lm loss value: 2.246349E+00 | lm loss PPL: 9.453158E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 13000 to checkpoints_1b1long 0: [2022-11-25 21:01:07,514] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step13000 is begin to save! 0: [2022-11-25 21:01:07,524] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_01-model_00-model_states.pt... 0: [2022-11-25 21:01:07,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_01-model_00-model_states.pt. 0: [2022-11-25 21:01:07,723] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_03-model_00-model_states.pt... 0: [2022-11-25 21:01:07,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_03-model_00-model_states.pt. 0: [2022-11-25 21:01:07,804] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_04-model_00-model_states.pt... 0: [2022-11-25 21:01:07,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_04-model_00-model_states.pt. 0: [2022-11-25 21:01:07,880] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_05-model_00-model_states.pt... 0: [2022-11-25 21:01:07,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_05-model_00-model_states.pt. 0: [2022-11-25 21:01:07,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_06-model_00-model_states.pt... 0: [2022-11-25 21:01:08,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_06-model_00-model_states.pt. 0: [2022-11-25 21:01:08,024] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_07-model_00-model_states.pt... 0: [2022-11-25 21:01:08,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_07-model_00-model_states.pt. 0: [2022-11-25 21:01:08,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_08-model_00-model_states.pt... 0: [2022-11-25 21:01:08,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_08-model_00-model_states.pt. 0: [2022-11-25 21:01:08,170] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_09-model_00-model_states.pt... 0: [2022-11-25 21:01:08,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_09-model_00-model_states.pt. 0: [2022-11-25 21:01:08,242] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_10-model_00-model_states.pt... 0: [2022-11-25 21:01:08,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_10-model_00-model_states.pt. 0: [2022-11-25 21:01:08,315] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_11-model_00-model_states.pt... 0: [2022-11-25 21:01:08,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_11-model_00-model_states.pt. 0: [2022-11-25 21:01:08,388] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_12-model_00-model_states.pt... 0: [2022-11-25 21:01:08,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_12-model_00-model_states.pt. 0: [2022-11-25 21:01:08,466] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_13-model_00-model_states.pt... 0: [2022-11-25 21:01:08,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_13-model_00-model_states.pt. 0: [2022-11-25 21:01:08,539] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_14-model_00-model_states.pt... 0: [2022-11-25 21:01:08,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_14-model_00-model_states.pt. 0: [2022-11-25 21:01:08,614] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_15-model_00-model_states.pt... 0: [2022-11-25 21:01:08,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_15-model_00-model_states.pt. 0: [2022-11-25 21:01:08,686] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_16-model_00-model_states.pt... 0: [2022-11-25 21:01:08,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_16-model_00-model_states.pt. 0: [2022-11-25 21:01:08,760] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_17-model_00-model_states.pt... 0: [2022-11-25 21:01:08,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_17-model_00-model_states.pt. 0: [2022-11-25 21:01:08,834] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_18-model_00-model_states.pt... 0: [2022-11-25 21:01:08,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_18-model_00-model_states.pt. 0: [2022-11-25 21:01:08,905] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_19-model_00-model_states.pt... 0: [2022-11-25 21:01:08,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_19-model_00-model_states.pt. 0: [2022-11-25 21:01:08,980] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_20-model_00-model_states.pt... 0: [2022-11-25 21:01:09,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_20-model_00-model_states.pt. 0: [2022-11-25 21:01:09,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_21-model_00-model_states.pt... 0: [2022-11-25 21:01:09,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_21-model_00-model_states.pt. 0: [2022-11-25 21:01:09,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_22-model_00-model_states.pt... 0: [2022-11-25 21:01:09,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_22-model_00-model_states.pt. 0: [2022-11-25 21:01:09,200] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_23-model_00-model_states.pt... 0: [2022-11-25 21:01:09,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_23-model_00-model_states.pt. 0: [2022-11-25 21:01:09,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_24-model_00-model_states.pt... 0: [2022-11-25 21:01:09,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_24-model_00-model_states.pt. 0: [2022-11-25 21:01:09,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_25-model_00-model_states.pt... 0: [2022-11-25 21:01:09,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_25-model_00-model_states.pt. 0: [2022-11-25 21:01:09,419] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_26-model_00-model_states.pt... 0: [2022-11-25 21:01:09,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_26-model_00-model_states.pt. 0: [2022-11-25 21:01:09,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_27-model_00-model_states.pt... 0: [2022-11-25 21:01:09,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_27-model_00-model_states.pt. 0: [2022-11-25 21:01:09,567] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_28-model_00-model_states.pt... 0: [2022-11-25 21:01:09,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_28-model_00-model_states.pt. 0: [2022-11-25 21:01:09,641] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/layer_30-model_00-model_states.pt... 0: [2022-11-25 21:01:09,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/layer_30-model_00-model_states.pt. 0: [2022-11-25 21:01:09,643] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step13000/mp_rank_00_model_states.pt 0: [2022-11-25 21:01:09,644] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/mp_rank_00_model_states.pt... 0: [2022-11-25 21:01:09,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/mp_rank_00_model_states.pt. 0: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:01:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:01:09,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:01:09,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-25 21:01:09,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 5: [2022-11-25 21:01:09,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:01:09,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 21:01:09,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 13: [2022-11-25 21:01:09,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:01:09,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 21:01:09,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 20: [2022-11-25 21:01:09,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:01:09,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-25 21:01:09,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 17: [2022-11-25 21:01:09,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:01:09,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-25 21:01:09,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 16: [2022-11-25 21:01:09,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:01:09,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 6: [2022-11-25 21:01:09,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:01:09,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 6: [2022-11-25 21:01:09,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 21:01:09,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 5: [2022-11-25 21:01:09,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:01:09,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 21:01:09,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 21: [2022-11-25 21:01:09,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:01:09,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-25 21:01:09,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 3: [2022-11-25 21:01:09,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:01:09,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:01:09,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 3: [2022-11-25 21:01:09,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 26: [2022-11-25 21:01:09,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 3: [2022-11-25 21:01:09,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 31: [2022-11-25 21:01:09,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:01:09,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:01:09,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 9: [2022-11-25 21:01:09,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 31: [2022-11-25 21:01:09,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 9: [2022-11-25 21:01:09,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 22: [2022-11-25 21:01:09,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:01:09,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-25 21:01:09,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 10: [2022-11-25 21:01:09,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:01:09,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 21:01:09,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 18: [2022-11-25 21:01:09,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:01:09,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-25 21:01:09,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 10: [2022-11-25 21:01:09,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:01:09,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:01:09,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 21:01:09,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 25: [2022-11-25 21:01:09,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-25 21:01:09,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: [2022-11-25 21:01:09,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:01:09,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:01:09,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-25 21:01:09,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-25 21:01:09,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:01:09,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:01:09,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-25 21:01:09,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-25 21:01:09,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 8: [2022-11-25 21:01:09,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:01:09,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 8: [2022-11-25 21:01:09,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 21:01:09,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 14: [2022-11-25 21:01:09,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:01:09,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 30: [2022-11-25 21:01:09,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:01:09,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 30: [2022-11-25 21:01:09,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-25 21:01:09,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 30: [2022-11-25 21:01:09,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:01:09,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:01:09,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:01:09,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 30: [2022-11-25 21:01:09,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 19: [2022-11-25 21:01:09,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 12: [2022-11-25 21:01:09,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 30: [2022-11-25 21:01:09,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 19: [2022-11-25 21:01:09,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 24: [2022-11-25 21:01:09,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:01:09,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-25 21:01:09,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 2: [2022-11-25 21:01:09,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:01:09,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:01:09,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 21:01:09,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 21:01:09,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 2: [2022-11-25 21:01:09,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-25 21:01:09,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:01:09,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:01:09,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 1: [2022-11-25 21:01:09,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 21:01:09,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 13: [2022-11-25 21:01:09,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 19: [2022-11-25 21:01:09,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:01:09,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 8: [2022-11-25 21:01:09,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:01:09,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 19: [2022-11-25 21:01:09,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 8: [2022-11-25 21:01:09,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 23: [2022-11-25 21:01:09,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:01:09,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:01:09,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:01:09,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 7: [2022-11-25 21:01:09,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 23: [2022-11-25 21:01:09,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 7: [2022-11-25 21:01:09,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 21:01:09,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 7: [2022-11-25 21:01:09,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 24: [2022-11-25 21:01:09,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:01:09,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-25 21:01:09,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 17: [2022-11-25 21:01:09,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:01:09,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-25 21:01:09,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 25: [2022-11-25 21:01:09,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:01:09,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-25 21:01:09,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 16: [2022-11-25 21:01:09,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:01:09,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-25 21:01:09,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 23: [2022-11-25 21:01:09,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:01:09,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-25 21:01:09,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 12: [2022-11-25 21:01:09,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:01:09,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 21:01:09,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 14: [2022-11-25 21:01:09,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:01:09,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 21:01:09,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 21: [2022-11-25 21:01:09,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:01:09,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:01:09,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-25 21:01:09,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: [2022-11-25 21:01:09,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 21:01:09,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 20: [2022-11-25 21:01:09,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:01:09,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:01:09,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 26: [2022-11-25 21:01:09,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 20: [2022-11-25 21:01:09,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 26: [2022-11-25 21:01:09,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 26: [2022-11-25 21:01:09,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:01:09,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-25 21:01:09,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 3: [2022-11-25 21:01:09,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:01:09,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 21:01:09,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 29: [2022-11-25 21:01:09,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:01:09,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-25 21:01:09,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 15: [2022-11-25 21:01:09,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:01:09,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 21:01:09,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 18: [2022-11-25 21:01:09,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:01:09,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-25 21:01:09,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 10: [2022-11-25 21:01:09,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:01:09,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 21:01:09,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 22: [2022-11-25 21:01:09,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:01:09,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-25 21:01:09,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 28: [2022-11-25 21:01:09,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:01:09,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:01:09,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-25 21:01:09,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 31: [2022-11-25 21:01:09,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:01:09,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-25 21:01:09,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 28: [2022-11-25 21:01:09,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-25 21:01:09,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 25: [2022-11-25 21:01:09,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:01:09,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-25 21:01:09,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 31: [2022-11-25 21:01:09,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:01:09,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-25 21:01:09,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 6: [2022-11-25 21:01:09,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:01:09,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:01:09,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:01:09,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 7: [2022-11-25 21:01:09,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 19: [2022-11-25 21:01:09,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 6: [2022-11-25 21:01:09,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 7: [2022-11-25 21:01:09,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 19: [2022-11-25 21:01:09,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-25 21:01:09,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:01:09,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 15: [2022-11-25 21:01:09,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:01:09,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 21:01:09,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-25 21:01:09,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 23: [2022-11-25 21:01:09,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:01:09,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-25 21:01:09,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 22: [2022-11-25 21:01:09,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:01:09,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-25 21:01:09,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 6: [2022-11-25 21:01:09,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:01:09,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 21:01:09,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 30: [2022-11-25 21:01:09,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:01:09,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-25 21:01:09,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 12: [2022-11-25 21:01:09,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:01:09,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 15: [2022-11-25 21:01:09,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:01:09,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 13: [2022-11-25 21:01:09,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:01:09,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 15: [2022-11-25 21:01:09,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 13: [2022-11-25 21:01:09,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 21:01:09,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 16: [2022-11-25 21:01:09,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:01:09,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-25 21:01:09,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 14: [2022-11-25 21:01:09,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:01:09,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 21:01:09,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 19: [2022-11-25 21:01:09,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:01:09,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-25 21:01:09,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 25: [2022-11-25 21:01:09,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:01:09,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-25 21:01:09,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 10: [2022-11-25 21:01:09,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:01:09,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 21:01:09,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 24: [2022-11-25 21:01:09,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:01:09,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 30: [2022-11-25 21:01:09,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:01:09,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 30: [2022-11-25 21:01:09,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-25 21:01:09,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 18: [2022-11-25 21:01:09,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:01:09,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:01:09,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 13: [2022-11-25 21:01:09,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 18: [2022-11-25 21:01:09,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 13: [2022-11-25 21:01:09,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 20: [2022-11-25 21:01:09,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:01:09,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-25 21:01:09,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-25 21:01:09,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:01:09,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 21:01:09,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 26: [2022-11-25 21:01:09,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:01:09,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-25 21:01:09,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 17: [2022-11-25 21:01:09,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:01:09,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-25 21:01:09,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 17: [2022-11-25 21:01:09,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:01:09,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-25 21:01:09,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 31: [2022-11-25 21:01:09,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:01:09,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-25 21:01:09,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 8: [2022-11-25 21:01:09,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:01:09,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 21:01:09,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 23: [2022-11-25 21:01:09,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:01:09,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-25 21:01:09,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 12: [2022-11-25 21:01:09,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:01:09,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 3: [2022-11-25 21:01:09,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:01:09,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 3: [2022-11-25 21:01:09,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 21:01:09,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 7: [2022-11-25 21:01:09,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:01:09,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 21:01:09,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 6: [2022-11-25 21:01:09,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:01:09,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 3: [2022-11-25 21:01:09,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:01:09,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 3: [2022-11-25 21:01:09,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 21:01:09,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 15: [2022-11-25 21:01:09,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:01:09,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 21:01:09,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 16: [2022-11-25 21:01:09,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:01:09,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-25 21:01:09,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 24: [2022-11-25 21:01:09,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:01:09,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-25 21:01:09,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 27: [2022-11-25 21:01:09,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:01:09,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-25 21:01:09,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 4: [2022-11-25 21:01:09,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:01:09,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 21:01:09,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 14: [2022-11-25 21:01:09,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:01:09,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 21:01:09,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 29: [2022-11-25 21:01:09,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:01:09,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-25 21:01:09,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 8: [2022-11-25 21:01:09,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:01:09,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 21:01:09,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 22: [2022-11-25 21:01:09,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:01:09,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-25 21:01:09,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: [2022-11-25 21:01:09,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:01:09,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 21:01:09,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 28: [2022-11-25 21:01:09,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:01:09,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-25 21:01:09,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 27: [2022-11-25 21:01:09,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:01:09,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-25 21:01:09,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 27: [2022-11-25 21:01:09,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:01:09,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-25 21:01:09,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 2: [2022-11-25 21:01:09,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:01:09,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 21:01:09,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 2: [2022-11-25 21:01:09,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:01:09,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:01:09,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 27: [2022-11-25 21:01:09,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 2: [2022-11-25 21:01:09,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 4: [2022-11-25 21:01:09,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:01:09,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 4: [2022-11-25 21:01:09,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 21:01:09,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 2: [2022-11-25 21:01:09,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:01:09,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 21:01:09,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: [2022-11-25 21:01:09,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:01:09,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 21:01:09,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 28: [2022-11-25 21:01:09,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:01:09,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-25 21:01:09,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 6: [2022-11-25 21:01:09,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:01:09,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:01:09,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 9: [2022-11-25 21:01:09,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 6: [2022-11-25 21:01:09,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 9: [2022-11-25 21:01:09,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 21: [2022-11-25 21:01:09,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:01:09,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-25 21:01:09,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 4: [2022-11-25 21:01:09,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:01:09,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:01:09,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 21:01:09,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 21:01:09,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 4: [2022-11-25 21:01:09,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: [2022-11-25 21:01:09,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 21:01:09,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 11: [2022-11-25 21:01:09,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:01:09,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 21:01:09,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 9: [2022-11-25 21:01:09,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:01:09,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 21:01:09,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 10: [2022-11-25 21:01:09,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:01:09,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 21:01:09,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 20: [2022-11-25 21:01:09,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:01:09,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-25 21:01:09,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 25: [2022-11-25 21:01:09,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:01:09,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-25 21:01:09,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: [2022-11-25 21:01:09,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:01:09,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 21:01:09,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 5: [2022-11-25 21:01:09,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:01:09,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 21:01:09,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 26: [2022-11-25 21:01:09,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:01:09,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-25 21:01:09,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 11: [2022-11-25 21:01:09,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:01:09,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 21:01:09,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 11: [2022-11-25 21:01:09,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:01:09,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 21:01:09,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 17: [2022-11-25 21:01:09,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:01:09,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-25 21:01:09,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 30: [2022-11-25 21:01:09,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:01:09,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-25 21:01:09,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 27: [2022-11-25 21:01:09,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:01:09,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-25 21:01:09,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-25 21:01:09,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:01:09,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 21:01:09,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 11: [2022-11-25 21:01:09,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:01:09,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 21:01:09,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 28: [2022-11-25 21:01:09,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:01:09,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-25 21:01:09,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 29: [2022-11-25 21:01:09,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:01:09,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-25 21:01:09,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 9: [2022-11-25 21:01:09,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:01:09,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 21:01:09,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 23: [2022-11-25 21:01:09,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:01:09,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-25 21:01:09,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 18: [2022-11-25 21:01:09,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:01:09,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-25 21:01:09,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 31: [2022-11-25 21:01:09,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:01:09,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-25 21:01:09,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 19: [2022-11-25 21:01:09,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:01:09,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-25 21:01:09,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 16: [2022-11-25 21:01:09,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:01:09,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-25 21:01:09,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 13: [2022-11-25 21:01:09,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:01:09,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 21:01:09,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 5: [2022-11-25 21:01:09,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:01:09,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 21:01:09,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 12: [2022-11-25 21:01:09,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:01:09,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 21:01:09,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 15: [2022-11-25 21:01:09,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:01:09,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 18: [2022-11-25 21:01:09,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:01:09,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 18: [2022-11-25 21:01:09,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-25 21:01:09,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 3: [2022-11-25 21:01:09,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:01:09,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 21:01:09,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 29: [2022-11-25 21:01:09,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:01:09,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-25 21:01:09,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 4: [2022-11-25 21:01:09,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:01:09,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 21:01:09,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 7: [2022-11-25 21:01:09,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:01:09,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 21:01:09,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 29: [2022-11-25 21:01:09,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:01:09,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-25 21:01:09,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 8: [2022-11-25 21:01:09,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:01:09,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:01:09,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 21:01:09,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 24: [2022-11-25 21:01:09,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-25 21:01:09,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 22: [2022-11-25 21:01:09,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:01:09,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-25 21:01:09,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 6: [2022-11-25 21:01:09,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:01:09,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 21:01:09,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 21: [2022-11-25 21:01:09,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:01:09,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:01:09,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-25 21:01:09,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 14: [2022-11-25 21:01:09,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 21:01:09,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 2: [2022-11-25 21:01:09,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:01:09,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:01:09,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 21:01:09,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 10: [2022-11-25 21:01:09,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 21:01:09,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 20: [2022-11-25 21:01:09,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:01:09,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-25 21:01:09,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 28: [2022-11-25 21:01:09,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:01:09,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-25 21:01:09,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 25: [2022-11-25 21:01:09,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:01:09,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-25 21:01:09,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: [2022-11-25 21:01:09,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:01:09,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 21:01:09,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 5: [2022-11-25 21:01:09,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:01:09,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 21:01:09,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 26: [2022-11-25 21:01:09,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:01:09,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-25 21:01:09,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 11: [2022-11-25 21:01:09,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:01:09,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 21:01:09,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 30: [2022-11-25 21:01:09,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:01:09,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-25 21:01:09,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 17: [2022-11-25 21:01:09,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:01:09,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-25 21:01:09,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 9: [2022-11-25 21:01:09,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:01:09,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 21:01:09,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 27: [2022-11-25 21:01:09,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:01:09,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-25 21:01:09,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-25 21:01:09,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:01:09,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:01:09,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 19: [2022-11-25 21:01:09,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 1: [2022-11-25 21:01:09,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 19: [2022-11-25 21:01:09,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 13: [2022-11-25 21:01:09,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:01:09,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:01:09,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 21:01:09,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 23: [2022-11-25 21:01:09,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-25 21:01:09,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 12: [2022-11-25 21:01:09,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:01:09,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 21:01:09,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 3: [2022-11-25 21:01:09,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:01:09,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 21:01:09,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 18: [2022-11-25 21:01:09,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:01:09,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:01:09,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:01:09,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-25 21:01:09,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 16: [2022-11-25 21:01:09,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-25 21:01:09,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 31: [2022-11-25 21:01:09,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-25 21:01:09,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 15: [2022-11-25 21:01:09,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:01:09,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 21:01:09,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 29: [2022-11-25 21:01:09,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:01:09,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-25 21:01:09,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 4: [2022-11-25 21:01:09,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:01:09,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 21:01:09,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 24: [2022-11-25 21:01:09,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:01:09,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-25 21:01:09,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 21: [2022-11-25 21:01:09,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:01:09,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-25 21:01:09,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 8: [2022-11-25 21:01:09,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:01:09,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:01:09,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 21:01:09,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 6: [2022-11-25 21:01:09,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 21:01:09,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 14: [2022-11-25 21:01:09,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:01:09,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 21:01:09,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 5: [2022-11-25 21:01:09,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:01:09,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 21:01:09,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 2: [2022-11-25 21:01:09,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:01:09,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 21:01:09,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 20: [2022-11-25 21:01:09,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:01:09,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 26: [2022-11-25 21:01:09,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:01:09,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 26: [2022-11-25 21:01:09,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-25 21:01:09,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 28: [2022-11-25 21:01:09,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:01:09,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-25 21:01:09,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 25: [2022-11-25 21:01:09,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:01:09,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-25 21:01:09,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-25 21:01:09,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:01:09,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:01:09,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 17: [2022-11-25 21:01:09,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 1: [2022-11-25 21:01:09,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 17: [2022-11-25 21:01:09,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 9: [2022-11-25 21:01:09,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:01:09,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 21:01:09,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 7: [2022-11-25 21:01:09,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:01:09,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:01:09,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 13: [2022-11-25 21:01:09,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:01:09,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 13: [2022-11-25 21:01:09,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 27: [2022-11-25 21:01:09,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 13: [2022-11-25 21:01:09,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 27: [2022-11-25 21:01:09,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 12: [2022-11-25 21:01:09,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:01:09,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 21:01:09,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: [2022-11-25 21:01:09,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:01:09,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 21:01:09,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 19: [2022-11-25 21:01:09,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:01:09,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-25 21:01:09,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 23: [2022-11-25 21:01:09,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:01:09,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-25 21:01:09,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 10: [2022-11-25 21:01:09,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:01:09,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 21:01:09,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 3: [2022-11-25 21:01:09,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:01:09,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 21:01:09,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 30: [2022-11-25 21:01:09,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:01:09,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:01:09,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-25 21:01:09,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 31: [2022-11-25 21:01:09,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-25 21:01:09,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 22: [2022-11-25 21:01:09,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:01:09,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-25 21:01:09,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 18: [2022-11-25 21:01:09,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:01:09,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-25 21:01:09,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 15: [2022-11-25 21:01:09,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:01:09,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 21:01:09,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 5: [2022-11-25 21:01:09,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:01:09,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 21:01:09,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 9: [2022-11-25 21:01:09,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:01:09,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 21:01:09,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 29: [2022-11-25 21:01:09,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:01:09,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-25 21:01:09,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 21: [2022-11-25 21:01:09,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:01:09,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-25 21:01:09,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 22: [2022-11-25 21:01:09,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:01:09,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-25 21:01:09,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 13: [2022-11-25 21:01:09,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:01:09,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 21:01:09,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 7: [2022-11-25 21:01:09,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:01:09,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:01:09,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:01:09,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 7: [2022-11-25 21:01:09,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 14: [2022-11-25 21:01:09,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 30: [2022-11-25 21:01:09,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 7: [2022-11-25 21:01:09,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 30: [2022-11-25 21:01:09,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-25 21:01:09,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:01:09,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 21:01:09,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 19: [2022-11-25 21:01:09,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:01:09,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-25 21:01:09,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 18: [2022-11-25 21:01:09,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:01:09,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-25 21:01:09,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 8: [2022-11-25 21:01:09,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:01:09,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:01:09,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 21:01:09,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 21:01:09,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 8: [2022-11-25 21:01:09,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 24: [2022-11-25 21:01:09,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:01:09,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:01:09,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-25 21:01:09,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 31: [2022-11-25 21:01:09,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-25 21:01:09,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 27: [2022-11-25 21:01:09,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:01:09,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-25 21:01:09,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 20: [2022-11-25 21:01:09,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:01:09,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-25 21:01:09,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 6: [2022-11-25 21:01:09,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:01:09,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 21:01:09,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 25: [2022-11-25 21:01:09,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:01:09,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:01:09,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:01:09,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 29: [2022-11-25 21:01:09,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 4: [2022-11-25 21:01:09,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 29: [2022-11-25 21:01:09,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 26: [2022-11-25 21:01:09,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:01:09,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:01:09,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 25: [2022-11-25 21:01:09,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 26: [2022-11-25 21:01:09,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 3: [2022-11-25 21:01:09,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 26: [2022-11-25 21:01:09,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 3: [2022-11-25 21:01:09,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 12: [2022-11-25 21:01:09,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:01:09,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 21:01:09,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 9: [2022-11-25 21:01:09,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:01:09,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 21:01:09,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 15: [2022-11-25 21:01:09,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:01:09,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:01:09,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 21:01:09,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: [2022-11-25 21:01:09,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 10: [2022-11-25 21:01:09,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:01:09,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 10: [2022-11-25 21:01:09,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 21:01:09,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 14: [2022-11-25 21:01:09,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:01:09,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 21:01:09,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 5: [2022-11-25 21:01:09,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:01:09,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 21:01:09,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 24: [2022-11-25 21:01:09,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:01:09,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-25 21:01:09,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 2: [2022-11-25 21:01:09,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:01:09,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 21:01:09,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 23: [2022-11-25 21:01:09,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:01:09,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-25 21:01:09,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 22: [2022-11-25 21:01:09,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:01:09,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-25 21:01:09,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 4: [2022-11-25 21:01:09,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:01:09,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 28: [2022-11-25 21:01:09,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:01:09,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 28: [2022-11-25 21:01:09,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-25 21:01:09,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 16: [2022-11-25 21:01:09,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:01:09,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-25 21:01:09,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 7: [2022-11-25 21:01:09,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:01:09,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 21:01:09,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 16: [2022-11-25 21:01:09,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:01:09,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-25 21:01:09,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 17: [2022-11-25 21:01:09,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:01:09,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-25 21:01:09,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 11: [2022-11-25 21:01:10,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:01:10,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 21:01:10,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 11: [2022-11-25 21:01:10,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:01:10,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 21:01:10,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 11: [2022-11-25 21:01:10,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:01:10,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step13000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 21:01:10,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: successfully saved checkpoint at iteration 13000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2763.59 31: iteration 13010/ 173500 | consumed samples: 3330560 | consumed tokens: 6820986880 | elapsed time per iteration (s): 1.05 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.305014E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.630 | TFLOPs: 14.74 | 31: iteration 13020/ 173500 | consumed samples: 3333120 | consumed tokens: 6826229760 | elapsed time per iteration (s): 0.82 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.283697E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.119 | TFLOPs: 18.82 | 31: iteration 13030/ 173500 | consumed samples: 3335680 | consumed tokens: 6831472640 | elapsed time per iteration (s): 1.06 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.270870E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.405 | TFLOPs: 14.60 | 31: iteration 13040/ 173500 | consumed samples: 3338240 | consumed tokens: 6836715520 | elapsed time per iteration (s): 0.81 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.291435E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.585 | TFLOPs: 19.15 | 31: iteration 13050/ 173500 | consumed samples: 3340800 | consumed tokens: 6841958400 | elapsed time per iteration (s): 0.74 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.253961E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.495 | TFLOPs: 20.90 | 31: iteration 13060/ 173500 | consumed samples: 3343360 | consumed tokens: 6847201280 | elapsed time per iteration (s): 0.84 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.279734E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.179 | TFLOPs: 18.34 | 31: iteration 13070/ 173500 | consumed samples: 3345920 | consumed tokens: 6852444160 | elapsed time per iteration (s): 0.80 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.295971E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.148 | TFLOPs: 19.43 | 31: iteration 13080/ 173500 | consumed samples: 3348480 | consumed tokens: 6857687040 | elapsed time per iteration (s): 0.80 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.312758E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.117 | TFLOPs: 19.43 | 31: iteration 13090/ 173500 | consumed samples: 3351040 | consumed tokens: 6862929920 | elapsed time per iteration (s): 0.75 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.281107E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.928 | TFLOPs: 20.75 | 31: iteration 13100/ 173500 | consumed samples: 3353600 | consumed tokens: 6868172800 | elapsed time per iteration (s): 0.78 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.313731E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.268 | TFLOPs: 19.74 | 31: iteration 13110/ 173500 | consumed samples: 3356160 | consumed tokens: 6873415680 | elapsed time per iteration (s): 0.73 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.281323E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.235 | TFLOPs: 21.19 | 31: iteration 13120/ 173500 | consumed samples: 3358720 | consumed tokens: 6878658560 | elapsed time per iteration (s): 0.73 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.290077E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.945 | TFLOPs: 21.23 | 31: iteration 13130/ 173500 | consumed samples: 3361280 | consumed tokens: 6883901440 | elapsed time per iteration (s): 0.83 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.267618E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.905 | TFLOPs: 18.57 | 31: iteration 13140/ 173500 | consumed samples: 3363840 | consumed tokens: 6889144320 | elapsed time per iteration (s): 0.76 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.244446E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.681 | TFLOPs: 20.37 | 31: iteration 13150/ 173500 | consumed samples: 3366400 | consumed tokens: 6894387200 | elapsed time per iteration (s): 0.73 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.293170E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.997 | TFLOPs: 21.36 | 31: iteration 13160/ 173500 | consumed samples: 3368960 | consumed tokens: 6899630080 | elapsed time per iteration (s): 0.79 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.295793E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.463 | TFLOPs: 19.63 | 31: iteration 13170/ 173500 | consumed samples: 3371520 | consumed tokens: 6904872960 | elapsed time per iteration (s): 0.75 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.286204E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.475 | TFLOPs: 20.72 | 31: iteration 13180/ 173500 | consumed samples: 3374080 | consumed tokens: 6910115840 | elapsed time per iteration (s): 0.81 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.255316E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.469 | TFLOPs: 19.02 | 31: iteration 13190/ 173500 | consumed samples: 3376640 | consumed tokens: 6915358720 | elapsed time per iteration (s): 0.75 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.294560E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.176 | TFLOPs: 20.64 | 31: iteration 13200/ 173500 | consumed samples: 3379200 | consumed tokens: 6920601600 | elapsed time per iteration (s): 0.82 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.282937E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.687 | TFLOPs: 18.92 | 31: iteration 13210/ 173500 | consumed samples: 3381760 | consumed tokens: 6925844480 | elapsed time per iteration (s): 0.76 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.254302E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.933 | TFLOPs: 20.44 | 31: iteration 13220/ 173500 | consumed samples: 3384320 | consumed tokens: 6931087360 | elapsed time per iteration (s): 0.80 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.303704E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.922 | TFLOPs: 19.48 | 31: iteration 13230/ 173500 | consumed samples: 3386880 | consumed tokens: 6936330240 | elapsed time per iteration (s): 0.77 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.293402E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.433 | TFLOPs: 20.05 | 31: iteration 13240/ 173500 | consumed samples: 3389440 | consumed tokens: 6941573120 | elapsed time per iteration (s): 0.73 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.286084E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.050 | TFLOPs: 21.30 | 31: iteration 13250/ 173500 | consumed samples: 3392000 | consumed tokens: 6946816000 | elapsed time per iteration (s): 0.76 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.291768E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.894 | TFLOPs: 20.32 | 31: iteration 13260/ 173500 | consumed samples: 3394560 | consumed tokens: 6952058880 | elapsed time per iteration (s): 0.79 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.300927E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.487 | TFLOPs: 19.51 | 31: iteration 13270/ 173500 | consumed samples: 3397120 | consumed tokens: 6957301760 | elapsed time per iteration (s): 0.80 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.302271E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.434 | TFLOPs: 19.39 | 31: iteration 13280/ 173500 | consumed samples: 3399680 | consumed tokens: 6962544640 | elapsed time per iteration (s): 0.80 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.257277E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.012 | TFLOPs: 19.30 | 31: iteration 13290/ 173500 | consumed samples: 3402240 | consumed tokens: 6967787520 | elapsed time per iteration (s): 0.78 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.302622E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.277 | TFLOPs: 19.86 | 31: iteration 13300/ 173500 | consumed samples: 3404800 | consumed tokens: 6973030400 | elapsed time per iteration (s): 0.82 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.292819E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.702 | TFLOPs: 18.86 | 31: iteration 13310/ 173500 | consumed samples: 3407360 | consumed tokens: 6978273280 | elapsed time per iteration (s): 0.91 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.294430E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 281.665 | TFLOPs: 17.04 | 31: iteration 13320/ 173500 | consumed samples: 3409920 | consumed tokens: 6983516160 | elapsed time per iteration (s): 0.84 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.278388E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.065 | TFLOPs: 18.33 | 31: iteration 13330/ 173500 | consumed samples: 3412480 | consumed tokens: 6988759040 | elapsed time per iteration (s): 0.79 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.277385E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.306 | TFLOPs: 19.68 | 31: iteration 13340/ 173500 | consumed samples: 3415040 | consumed tokens: 6994001920 | elapsed time per iteration (s): 0.84 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.278785E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.015 | TFLOPs: 18.33 | 31: iteration 13350/ 173500 | consumed samples: 3417600 | consumed tokens: 6999244800 | elapsed time per iteration (s): 0.80 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.275207E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.340 | TFLOPs: 19.26 | 31: iteration 13360/ 173500 | consumed samples: 3420160 | consumed tokens: 7004487680 | elapsed time per iteration (s): 0.84 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.284807E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.887 | TFLOPs: 18.38 | 31: iteration 13370/ 173500 | consumed samples: 3422720 | consumed tokens: 7009730560 | elapsed time per iteration (s): 0.83 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.282373E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.654 | TFLOPs: 18.55 | 31: iteration 13380/ 173500 | consumed samples: 3425280 | consumed tokens: 7014973440 | elapsed time per iteration (s): 0.84 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.278179E+00 | grad norm: 0.270 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.680 | TFLOPs: 18.43 | 31: iteration 13390/ 173500 | consumed samples: 3427840 | consumed tokens: 7020216320 | elapsed time per iteration (s): 0.79 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.285934E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.923 | TFLOPs: 19.60 | 31: iteration 13400/ 173500 | consumed samples: 3430400 | consumed tokens: 7025459200 | elapsed time per iteration (s): 0.84 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.288238E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.956 | TFLOPs: 18.39 | 31: iteration 13410/ 173500 | consumed samples: 3432960 | consumed tokens: 7030702080 | elapsed time per iteration (s): 0.81 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.311751E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.837 | TFLOPs: 19.17 | 31: iteration 13420/ 173500 | consumed samples: 3435520 | consumed tokens: 7035944960 | elapsed time per iteration (s): 0.86 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.321083E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.742 | TFLOPs: 18.01 | 31: iteration 13430/ 173500 | consumed samples: 3438080 | consumed tokens: 7041187840 | elapsed time per iteration (s): 1.19 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.267711E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 214.704 | TFLOPs: 12.99 | 31: iteration 13440/ 173500 | consumed samples: 3440640 | consumed tokens: 7046430720 | elapsed time per iteration (s): 0.79 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.280240E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.526 | TFLOPs: 19.51 | 31: iteration 13450/ 173500 | consumed samples: 3443200 | consumed tokens: 7051673600 | elapsed time per iteration (s): 0.78 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.292089E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.496 | TFLOPs: 19.87 | 31: iteration 13460/ 173500 | consumed samples: 3445760 | consumed tokens: 7056916480 | elapsed time per iteration (s): 0.81 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.305664E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.767 | TFLOPs: 19.04 | 31: iteration 13470/ 173500 | consumed samples: 3448320 | consumed tokens: 7062159360 | elapsed time per iteration (s): 0.81 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.255733E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.486 | TFLOPs: 19.15 | 31: iteration 13480/ 173500 | consumed samples: 3450880 | consumed tokens: 7067402240 | elapsed time per iteration (s): 0.81 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.275366E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.539 | TFLOPs: 19.21 | 31: iteration 13490/ 173500 | consumed samples: 3453440 | consumed tokens: 7072645120 | elapsed time per iteration (s): 0.79 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.265645E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.101 | TFLOPs: 19.61 | 31: iteration 13500/ 173500 | consumed samples: 3456000 | consumed tokens: 7077888000 | elapsed time per iteration (s): 0.83 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.258363E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.372 | TFLOPs: 18.66 | 31: iteration 13510/ 173500 | consumed samples: 3458560 | consumed tokens: 7083130880 | elapsed time per iteration (s): 0.82 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.297226E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.605 | TFLOPs: 18.97 | 31: iteration 13520/ 173500 | consumed samples: 3461120 | consumed tokens: 7088373760 | elapsed time per iteration (s): 0.80 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.299631E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.397 | TFLOPs: 19.26 | 31: iteration 13530/ 173500 | consumed samples: 3463680 | consumed tokens: 7093616640 | elapsed time per iteration (s): 0.79 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.266674E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.361 | TFLOPs: 19.50 | 31: iteration 13540/ 173500 | consumed samples: 3466240 | consumed tokens: 7098859520 | elapsed time per iteration (s): 0.88 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.303637E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.525 | TFLOPs: 17.58 | 31: iteration 13550/ 173500 | consumed samples: 3468800 | consumed tokens: 7104102400 | elapsed time per iteration (s): 0.83 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.285085E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.973 | TFLOPs: 18.57 | 31: iteration 13560/ 173500 | consumed samples: 3471360 | consumed tokens: 7109345280 | elapsed time per iteration (s): 0.81 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.274303E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.773 | TFLOPs: 19.16 | 31: iteration 13570/ 173500 | consumed samples: 3473920 | consumed tokens: 7114588160 | elapsed time per iteration (s): 0.81 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.291911E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.848 | TFLOPs: 19.17 | 31: iteration 13580/ 173500 | consumed samples: 3476480 | consumed tokens: 7119831040 | elapsed time per iteration (s): 0.83 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.279757E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.180 | TFLOPs: 18.64 | 31: iteration 13590/ 173500 | consumed samples: 3479040 | consumed tokens: 7125073920 | elapsed time per iteration (s): 0.86 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.269309E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.970 | TFLOPs: 18.09 | 31: iteration 13600/ 173500 | consumed samples: 3481600 | consumed tokens: 7130316800 | elapsed time per iteration (s): 0.85 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.318515E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.417 | TFLOPs: 18.30 | 31: iteration 13610/ 173500 | consumed samples: 3484160 | consumed tokens: 7135559680 | elapsed time per iteration (s): 0.81 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.285376E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.081 | TFLOPs: 19.12 | 31: iteration 13620/ 173500 | consumed samples: 3486720 | consumed tokens: 7140802560 | elapsed time per iteration (s): 0.83 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.266346E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.174 | TFLOPs: 18.70 | 31: iteration 13630/ 173500 | consumed samples: 3489280 | consumed tokens: 7146045440 | elapsed time per iteration (s): 0.78 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.285356E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.305 | TFLOPs: 19.92 | 31: iteration 13640/ 173500 | consumed samples: 3491840 | consumed tokens: 7151288320 | elapsed time per iteration (s): 0.82 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.260212E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.372 | TFLOPs: 18.84 | 31: iteration 13650/ 173500 | consumed samples: 3494400 | consumed tokens: 7156531200 | elapsed time per iteration (s): 0.83 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.296285E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.518 | TFLOPs: 18.60 | 31: iteration 13660/ 173500 | consumed samples: 3496960 | consumed tokens: 7161774080 | elapsed time per iteration (s): 0.87 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.275806E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.397 | TFLOPs: 17.81 | 31: iteration 13670/ 173500 | consumed samples: 3499520 | consumed tokens: 7167016960 | elapsed time per iteration (s): 0.81 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.300425E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.314 | TFLOPs: 19.14 | 31: iteration 13680/ 173500 | consumed samples: 3502080 | consumed tokens: 7172259840 | elapsed time per iteration (s): 0.80 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.269654E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.342 | TFLOPs: 19.44 | 31: iteration 13690/ 173500 | consumed samples: 3504640 | consumed tokens: 7177502720 | elapsed time per iteration (s): 0.80 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.291383E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.473 | TFLOPs: 19.33 | 31: iteration 13700/ 173500 | consumed samples: 3507200 | consumed tokens: 7182745600 | elapsed time per iteration (s): 0.85 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.309109E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.055 | TFLOPs: 18.21 | 31: iteration 13710/ 173500 | consumed samples: 3509760 | consumed tokens: 7187988480 | elapsed time per iteration (s): 0.81 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.277607E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.785 | TFLOPs: 19.04 | 31: iteration 13720/ 173500 | consumed samples: 3512320 | consumed tokens: 7193231360 | elapsed time per iteration (s): 0.81 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.254842E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.540 | TFLOPs: 19.15 | 31: iteration 13730/ 173500 | consumed samples: 3514880 | consumed tokens: 7198474240 | elapsed time per iteration (s): 0.85 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.285253E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.027 | TFLOPs: 18.15 | 31: iteration 13740/ 173500 | consumed samples: 3517440 | consumed tokens: 7203717120 | elapsed time per iteration (s): 0.86 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.291073E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.600 | TFLOPs: 18.06 | 31: iteration 13750/ 173500 | consumed samples: 3520000 | consumed tokens: 7208960000 | elapsed time per iteration (s): 0.85 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.285910E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.931 | TFLOPs: 18.33 | 31: iteration 13760/ 173500 | consumed samples: 3522560 | consumed tokens: 7214202880 | elapsed time per iteration (s): 0.81 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.257027E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.774 | TFLOPs: 19.04 | 31: iteration 13770/ 173500 | consumed samples: 3525120 | consumed tokens: 7219445760 | elapsed time per iteration (s): 0.82 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.317116E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.748 | TFLOPs: 18.98 | 31: iteration 13780/ 173500 | consumed samples: 3527680 | consumed tokens: 7224688640 | elapsed time per iteration (s): 0.85 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.259090E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.814 | TFLOPs: 18.14 | 31: iteration 13790/ 173500 | consumed samples: 3530240 | consumed tokens: 7229931520 | elapsed time per iteration (s): 0.79 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.289703E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.239 | TFLOPs: 19.49 | 31: iteration 13800/ 173500 | consumed samples: 3532800 | consumed tokens: 7235174400 | elapsed time per iteration (s): 0.80 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.293368E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.077 | TFLOPs: 19.36 | 31: iteration 13810/ 173500 | consumed samples: 3535360 | consumed tokens: 7240417280 | elapsed time per iteration (s): 0.84 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.260898E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.457 | TFLOPs: 18.48 | 31: iteration 13820/ 173500 | consumed samples: 3537920 | consumed tokens: 7245660160 | elapsed time per iteration (s): 0.79 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.272036E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.435 | TFLOPs: 19.63 | 31: iteration 13830/ 173500 | consumed samples: 3540480 | consumed tokens: 7250903040 | elapsed time per iteration (s): 0.78 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.294530E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.536 | TFLOPs: 19.82 | 31: iteration 13840/ 173500 | consumed samples: 3543040 | consumed tokens: 7256145920 | elapsed time per iteration (s): 0.75 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.265613E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.297 | TFLOPs: 20.65 | 31: iteration 13850/ 173500 | consumed samples: 3545600 | consumed tokens: 7261388800 | elapsed time per iteration (s): 0.84 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.288703E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.720 | TFLOPs: 18.43 | 31: iteration 13860/ 173500 | consumed samples: 3548160 | consumed tokens: 7266631680 | elapsed time per iteration (s): 0.83 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.321547E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.870 | TFLOPs: 18.75 | 31: iteration 13870/ 173500 | consumed samples: 3550720 | consumed tokens: 7271874560 | elapsed time per iteration (s): 0.84 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.286994E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.033 | TFLOPs: 18.39 | 31: iteration 13880/ 173500 | consumed samples: 3553280 | consumed tokens: 7277117440 | elapsed time per iteration (s): 0.81 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.245724E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.386 | TFLOPs: 19.02 | 31: iteration 13890/ 173500 | consumed samples: 3555840 | consumed tokens: 7282360320 | elapsed time per iteration (s): 0.82 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.271540E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.935 | TFLOPs: 18.81 | 31: iteration 13900/ 173500 | consumed samples: 3558400 | consumed tokens: 7287603200 | elapsed time per iteration (s): 0.81 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.256631E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.970 | TFLOPs: 19.24 | 31: iteration 13910/ 173500 | consumed samples: 3560960 | consumed tokens: 7292846080 | elapsed time per iteration (s): 0.79 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.293236E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.853 | TFLOPs: 19.65 | 31: iteration 13920/ 173500 | consumed samples: 3563520 | consumed tokens: 7298088960 | elapsed time per iteration (s): 0.79 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.263169E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.245 | TFLOPs: 19.68 | 31: iteration 13930/ 173500 | consumed samples: 3566080 | consumed tokens: 7303331840 | elapsed time per iteration (s): 0.74 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.276247E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.627 | TFLOPs: 20.97 | 31: iteration 13940/ 173500 | consumed samples: 3568640 | consumed tokens: 7308574720 | elapsed time per iteration (s): 0.80 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.299104E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.548 | TFLOPs: 19.39 | 31: iteration 13950/ 173500 | consumed samples: 3571200 | consumed tokens: 7313817600 | elapsed time per iteration (s): 0.78 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.293753E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.698 | TFLOPs: 19.89 | 31: iteration 13960/ 173500 | consumed samples: 3573760 | consumed tokens: 7319060480 | elapsed time per iteration (s): 0.75 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.278951E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.888 | TFLOPs: 20.68 | 31: iteration 13970/ 173500 | consumed samples: 3576320 | consumed tokens: 7324303360 | elapsed time per iteration (s): 0.80 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.265211E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.385 | TFLOPs: 19.44 | 31: iteration 13980/ 173500 | consumed samples: 3578880 | consumed tokens: 7329546240 | elapsed time per iteration (s): 0.76 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.279489E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.579 | TFLOPs: 20.36 | 31: iteration 13990/ 173500 | consumed samples: 3581440 | consumed tokens: 7334789120 | elapsed time per iteration (s): 0.73 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.248582E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.511 | TFLOPs: 21.21 | 0: [2022-11-25 21:14:40,162] [INFO] [logging.py:68:log_dist] [Rank 0] step=14000, skipped=0, lr=[0.00019774496681175836, 0.00019774496681175836, 0.00019774496681175836], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 14000/ 173500 | consumed samples: 3584000 | consumed tokens: 7340032000 | elapsed time per iteration (s): 0.74 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.255094E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.138 | TFLOPs: 21.06 | 0: steps: 14000 loss: 2.2727 iter time (s): 0.803 samples/sec: 318.852 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 14000 | lm loss value: 2.214038E+00 | lm loss PPL: 9.152597E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 14000 to checkpoints_1b1long 0: [2022-11-25 21:14:40,429] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step14000 is begin to save! 0: [2022-11-25 21:14:40,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_01-model_00-model_states.pt... 0: [2022-11-25 21:14:40,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_01-model_00-model_states.pt. 0: [2022-11-25 21:14:40,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_03-model_00-model_states.pt... 0: [2022-11-25 21:14:40,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_03-model_00-model_states.pt. 0: [2022-11-25 21:14:40,713] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_04-model_00-model_states.pt... 0: [2022-11-25 21:14:40,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_04-model_00-model_states.pt. 0: [2022-11-25 21:14:40,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_05-model_00-model_states.pt... 0: [2022-11-25 21:14:40,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_05-model_00-model_states.pt. 0: [2022-11-25 21:14:40,862] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_06-model_00-model_states.pt... 0: [2022-11-25 21:14:40,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_06-model_00-model_states.pt. 0: [2022-11-25 21:14:40,933] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_07-model_00-model_states.pt... 0: [2022-11-25 21:14:41,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_07-model_00-model_states.pt. 0: [2022-11-25 21:14:41,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_08-model_00-model_states.pt... 0: [2022-11-25 21:14:41,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_08-model_00-model_states.pt. 0: [2022-11-25 21:14:41,086] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_09-model_00-model_states.pt... 0: [2022-11-25 21:14:41,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_09-model_00-model_states.pt. 0: [2022-11-25 21:14:41,158] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_10-model_00-model_states.pt... 0: [2022-11-25 21:14:41,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_10-model_00-model_states.pt. 0: [2022-11-25 21:14:41,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_11-model_00-model_states.pt... 0: [2022-11-25 21:14:41,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_11-model_00-model_states.pt. 0: [2022-11-25 21:14:41,305] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_12-model_00-model_states.pt... 0: [2022-11-25 21:14:41,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_12-model_00-model_states.pt. 0: [2022-11-25 21:14:41,380] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_13-model_00-model_states.pt... 0: [2022-11-25 21:14:41,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_13-model_00-model_states.pt. 0: [2022-11-25 21:14:41,454] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_14-model_00-model_states.pt... 0: [2022-11-25 21:14:41,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_14-model_00-model_states.pt. 0: [2022-11-25 21:14:41,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_15-model_00-model_states.pt... 0: [2022-11-25 21:14:41,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_15-model_00-model_states.pt. 0: [2022-11-25 21:14:41,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_16-model_00-model_states.pt... 0: [2022-11-25 21:14:41,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_16-model_00-model_states.pt. 0: [2022-11-25 21:14:41,677] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_17-model_00-model_states.pt... 0: [2022-11-25 21:14:41,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_17-model_00-model_states.pt. 0: [2022-11-25 21:14:41,752] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_18-model_00-model_states.pt... 0: [2022-11-25 21:14:41,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_18-model_00-model_states.pt. 0: [2022-11-25 21:14:41,827] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_19-model_00-model_states.pt... 0: [2022-11-25 21:14:41,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_19-model_00-model_states.pt. 0: [2022-11-25 21:14:41,901] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_20-model_00-model_states.pt... 0: [2022-11-25 21:14:41,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_20-model_00-model_states.pt. 0: [2022-11-25 21:14:41,976] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_21-model_00-model_states.pt... 0: [2022-11-25 21:14:42,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_21-model_00-model_states.pt. 0: [2022-11-25 21:14:42,051] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_22-model_00-model_states.pt... 0: [2022-11-25 21:14:42,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_22-model_00-model_states.pt. 0: [2022-11-25 21:14:42,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_23-model_00-model_states.pt... 0: [2022-11-25 21:14:42,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_23-model_00-model_states.pt. 0: [2022-11-25 21:14:42,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_24-model_00-model_states.pt... 0: [2022-11-25 21:14:42,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_24-model_00-model_states.pt. 0: [2022-11-25 21:14:42,294] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_25-model_00-model_states.pt... 0: [2022-11-25 21:14:42,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_25-model_00-model_states.pt. 0: [2022-11-25 21:14:42,368] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_26-model_00-model_states.pt... 0: [2022-11-25 21:14:42,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_26-model_00-model_states.pt. 0: [2022-11-25 21:14:42,441] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_27-model_00-model_states.pt... 0: [2022-11-25 21:14:42,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_27-model_00-model_states.pt. 0: [2022-11-25 21:14:42,519] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_28-model_00-model_states.pt... 0: [2022-11-25 21:14:42,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_28-model_00-model_states.pt. 0: [2022-11-25 21:14:42,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/layer_30-model_00-model_states.pt... 0: [2022-11-25 21:14:42,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/layer_30-model_00-model_states.pt. 0: [2022-11-25 21:14:42,596] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step14000/mp_rank_00_model_states.pt 0: [2022-11-25 21:14:42,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/mp_rank_00_model_states.pt... 0: [2022-11-25 21:14:42,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/mp_rank_00_model_states.pt. 0: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:14:42,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:14:42,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:14:42,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:14:42,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-25 21:14:42,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 27: [2022-11-25 21:14:42,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:14:42,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:14:42,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-25 21:14:42,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 23: [2022-11-25 21:14:42,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 14: [2022-11-25 21:14:42,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:14:42,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 14: [2022-11-25 21:14:42,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 21:14:42,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 4: [2022-11-25 21:14:42,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:14:42,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 21:14:42,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 15: [2022-11-25 21:14:42,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:14:42,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 21:14:42,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 26: [2022-11-25 21:14:42,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:14:42,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-25 21:14:42,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 7: [2022-11-25 21:14:42,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:14:42,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:14:42,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:14:42,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:14:42,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 8: [2022-11-25 21:14:42,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 19: [2022-11-25 21:14:42,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 7: [2022-11-25 21:14:42,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 8: [2022-11-25 21:14:42,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 10: [2022-11-25 21:14:42,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 19: [2022-11-25 21:14:42,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 10: [2022-11-25 21:14:42,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 6: [2022-11-25 21:14:42,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:14:42,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 21:14:42,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 8: [2022-11-25 21:14:42,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:14:42,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 21:14:42,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 11: [2022-11-25 21:14:42,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:14:42,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 21:14:42,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 13: [2022-11-25 21:14:42,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:14:42,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 21:14:42,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 9: [2022-11-25 21:14:42,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:14:42,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:14:42,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 23: [2022-11-25 21:14:42,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 9: [2022-11-25 21:14:42,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 23: [2022-11-25 21:14:42,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 22: [2022-11-25 21:14:42,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:14:42,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:14:42,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-25 21:14:42,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 1: [2022-11-25 21:14:42,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 21:14:42,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 3: [2022-11-25 21:14:42,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:14:42,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 21:14:42,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 7: [2022-11-25 21:14:42,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:14:42,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 21:14:42,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 10: [2022-11-25 21:14:42,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:14:42,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:14:42,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 21:14:42,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 15: [2022-11-25 21:14:42,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:14:42,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 15: [2022-11-25 21:14:42,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 31: [2022-11-25 21:14:42,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:14:42,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:14:42,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 15: [2022-11-25 21:14:42,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 31: [2022-11-25 21:14:42,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 18: [2022-11-25 21:14:42,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 31: [2022-11-25 21:14:42,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 18: [2022-11-25 21:14:42,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 2: [2022-11-25 21:14:42,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:14:42,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:14:42,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 21:14:42,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 2: [2022-11-25 21:14:42,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 21:14:42,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 16: [2022-11-25 21:14:42,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:14:42,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-25 21:14:42,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 20: [2022-11-25 21:14:42,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:14:42,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-25 21:14:42,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 5: [2022-11-25 21:14:42,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:14:42,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 21:14:42,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 31: [2022-11-25 21:14:42,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:14:42,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-25 21:14:42,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 27: [2022-11-25 21:14:42,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:14:42,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:14:42,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 26: [2022-11-25 21:14:42,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 27: [2022-11-25 21:14:42,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 26: [2022-11-25 21:14:42,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 19: [2022-11-25 21:14:42,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:14:42,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-25 21:14:42,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 19: [2022-11-25 21:14:42,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:14:42,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 5: [2022-11-25 21:14:42,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:14:42,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:14:42,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 5: [2022-11-25 21:14:42,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 13: [2022-11-25 21:14:42,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 5: [2022-11-25 21:14:42,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 13: [2022-11-25 21:14:42,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 9: [2022-11-25 21:14:42,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:14:42,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 21:14:42,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 18: [2022-11-25 21:14:42,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:14:42,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:14:42,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 14: [2022-11-25 21:14:42,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 18: [2022-11-25 21:14:42,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 14: [2022-11-25 21:14:42,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 11: [2022-11-25 21:14:42,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:14:42,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:14:42,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 4: [2022-11-25 21:14:42,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 11: [2022-11-25 21:14:42,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 4: [2022-11-25 21:14:42,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 5: [2022-11-25 21:14:42,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:14:42,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 21:14:42,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 31: [2022-11-25 21:14:42,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:14:42,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-25 21:14:42,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 28: [2022-11-25 21:14:42,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-25 21:14:42,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 18: [2022-11-25 21:14:42,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:14:42,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 15: [2022-11-25 21:14:42,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:14:42,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 15: [2022-11-25 21:14:42,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 21:14:42,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 3: [2022-11-25 21:14:42,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:14:42,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 21:14:42,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 25: [2022-11-25 21:14:42,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:14:42,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-25 21:14:42,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 16: [2022-11-25 21:14:42,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:14:42,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-25 21:14:42,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 1: [2022-11-25 21:14:42,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:14:42,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 28: [2022-11-25 21:14:42,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:14:42,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 23: [2022-11-25 21:14:42,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:14:42,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-25 21:14:42,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 8: [2022-11-25 21:14:42,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:14:42,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 21:14:42,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 4: [2022-11-25 21:14:42,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:14:42,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 21:14:42,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 3: [2022-11-25 21:14:42,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:14:42,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 21:14:42,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 26: [2022-11-25 21:14:42,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:14:42,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-25 21:14:42,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 20: [2022-11-25 21:14:42,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:14:42,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-25 21:14:42,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 10: [2022-11-25 21:14:42,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:14:42,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 21:14:42,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 8: [2022-11-25 21:14:42,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:14:42,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 21:14:42,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 19: [2022-11-25 21:14:42,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:14:42,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-25 21:14:42,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 9: [2022-11-25 21:14:42,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:14:42,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 21:14:42,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 2: [2022-11-25 21:14:42,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:14:42,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:14:42,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:14:42,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 5: [2022-11-25 21:14:42,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 1: [2022-11-25 21:14:42,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:14:42,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 2: [2022-11-25 21:14:42,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 5: [2022-11-25 21:14:42,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 16: [2022-11-25 21:14:42,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 18: [2022-11-25 21:14:42,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:14:42,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:14:42,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 1: [2022-11-25 21:14:42,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 18: [2022-11-25 21:14:42,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 1: [2022-11-25 21:14:42,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 21:14:42,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 1: [2022-11-25 21:14:42,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 7: [2022-11-25 21:14:42,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:14:42,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 21:14:42,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 11: [2022-11-25 21:14:42,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:14:42,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:14:42,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 22: [2022-11-25 21:14:42,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:14:42,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 11: [2022-11-25 21:14:42,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 22: [2022-11-25 21:14:42,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-25 21:14:42,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 22: [2022-11-25 21:14:42,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 7: [2022-11-25 21:14:42,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:14:42,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 21:14:42,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 25: [2022-11-25 21:14:42,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:14:42,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 20: [2022-11-25 21:14:42,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:14:42,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 25: [2022-11-25 21:14:42,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 20: [2022-11-25 21:14:42,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 6: [2022-11-25 21:14:42,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:14:42,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:14:42,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 21:14:42,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 21:14:42,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 6: [2022-11-25 21:14:42,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 9: [2022-11-25 21:14:42,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:14:42,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 21:14:42,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 20: [2022-11-25 21:14:42,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:14:42,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:14:42,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 3: [2022-11-25 21:14:42,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 21:14:42,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 20: [2022-11-25 21:14:42,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 31: [2022-11-25 21:14:42,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:14:42,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-25 21:14:42,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 28: [2022-11-25 21:14:42,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-25 21:14:42,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 28: [2022-11-25 21:14:42,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:14:42,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-25 21:14:42,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 28: [2022-11-25 21:14:42,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:14:42,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-25 21:14:42,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 6: [2022-11-25 21:14:42,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:14:42,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 21:14:42,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 4: [2022-11-25 21:14:42,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:14:42,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 21:14:42,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 15: [2022-11-25 21:14:42,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:14:42,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 21:14:42,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 29: [2022-11-25 21:14:42,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:14:42,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:14:42,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:14:42,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:14:42,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-25 21:14:42,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-25 21:14:42,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-25 21:14:42,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-25 21:14:42,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 29: [2022-11-25 21:14:42,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 29: [2022-11-25 21:14:42,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 29: [2022-11-25 21:14:42,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 11: [2022-11-25 21:14:42,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:14:42,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 21:14:42,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 17: [2022-11-25 21:14:42,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:14:42,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:14:42,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:14:42,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:14:42,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-25 21:14:42,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-25 21:14:42,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-25 21:14:42,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-25 21:14:42,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 17: [2022-11-25 21:14:42,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 17: [2022-11-25 21:14:42,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 17: [2022-11-25 21:14:42,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 19: [2022-11-25 21:14:42,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:14:42,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-25 21:14:42,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 2: [2022-11-25 21:14:42,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:14:42,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 21:14:42,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 25: [2022-11-25 21:14:42,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:14:42,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-25 21:14:42,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 22: [2022-11-25 21:14:42,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:14:42,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-25 21:14:42,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 1: [2022-11-25 21:14:42,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:14:42,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 21:14:42,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 25: [2022-11-25 21:14:42,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:14:42,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-25 21:14:42,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 26: [2022-11-25 21:14:42,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:14:42,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:14:42,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 25: [2022-11-25 21:14:42,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 26: [2022-11-25 21:14:42,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 25: [2022-11-25 21:14:42,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 6: [2022-11-25 21:14:42,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:14:42,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 21:14:42,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 27: [2022-11-25 21:14:42,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:14:42,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-25 21:14:42,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 14: [2022-11-25 21:14:42,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:14:42,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 21:14:42,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 0: [2022-11-25 21:14:42,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:14:42,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 21:14:42,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 0: [2022-11-25 21:14:42,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:14:42,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 21:14:42,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 0: [2022-11-25 21:14:42,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:14:42,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 21:14:42,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 13: [2022-11-25 21:14:42,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:14:42,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 21:14:42,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 18: [2022-11-25 21:14:42,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:14:42,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:14:42,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 10: [2022-11-25 21:14:42,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 18: [2022-11-25 21:14:42,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 10: [2022-11-25 21:14:42,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 14: [2022-11-25 21:14:42,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:14:42,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 21:14:42,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 31: [2022-11-25 21:14:42,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:14:42,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:14:42,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:14:42,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-25 21:14:42,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 0: [2022-11-25 21:14:42,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 21:14:42,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 27: [2022-11-25 21:14:42,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:14:42,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-25 21:14:42,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 15: [2022-11-25 21:14:42,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:14:42,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 21:14:42,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 0: [2022-11-25 21:14:42,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 21:14:42,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 23: [2022-11-25 21:14:42,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:14:42,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-25 21:14:42,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 13: [2022-11-25 21:14:42,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:14:42,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 21:14:42,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 24: [2022-11-25 21:14:42,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:14:42,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 10: [2022-11-25 21:14:42,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:14:42,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 30: [2022-11-25 21:14:42,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:14:42,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 21:14:42,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 30: [2022-11-25 21:14:42,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-25 21:14:42,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 12: [2022-11-25 21:14:42,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:14:42,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 21:14:42,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 21: [2022-11-25 21:14:42,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:14:42,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-25 21:14:42,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 9: [2022-11-25 21:14:42,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:14:42,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 21:14:42,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 5: [2022-11-25 21:14:42,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:14:42,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 21:14:42,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 23: [2022-11-25 21:14:42,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:14:42,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-25 21:14:42,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 8: [2022-11-25 21:14:42,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:14:42,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 21:14:42,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 30: [2022-11-25 21:14:42,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:14:42,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-25 21:14:42,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 28: [2022-11-25 21:14:42,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:14:42,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-25 21:14:42,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 24: [2022-11-25 21:14:42,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:14:42,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-25 21:14:42,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 21: [2022-11-25 21:14:42,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:14:42,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-25 21:14:42,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 29: [2022-11-25 21:14:42,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:14:42,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-25 21:14:42,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 11: [2022-11-25 21:14:42,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:14:42,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 21:14:42,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 12: [2022-11-25 21:14:42,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:14:42,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 21:14:42,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 14: [2022-11-25 21:14:42,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:14:42,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 21:14:42,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 4: [2022-11-25 21:14:42,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:14:42,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 21:14:42,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 17: [2022-11-25 21:14:42,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:14:42,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-25 21:14:42,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 16: [2022-11-25 21:14:42,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:14:42,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-25 21:14:42,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 20: [2022-11-25 21:14:42,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:14:42,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-25 21:14:42,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 22: [2022-11-25 21:14:42,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:14:42,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-25 21:14:42,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 27: [2022-11-25 21:14:42,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:14:42,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-25 21:14:42,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 13: [2022-11-25 21:14:42,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:14:42,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 21:14:42,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 3: [2022-11-25 21:14:42,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:14:42,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 21:14:42,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 1: [2022-11-25 21:14:42,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:14:42,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 21:14:42,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 2: [2022-11-25 21:14:42,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:14:42,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 21:14:42,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 6: [2022-11-25 21:14:42,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:14:42,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 21:14:42,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 19: [2022-11-25 21:14:42,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:14:42,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-25 21:14:42,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 18: [2022-11-25 21:14:42,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:14:42,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-25 21:14:42,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 31: [2022-11-25 21:14:42,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:14:42,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-25 21:14:42,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 26: [2022-11-25 21:14:42,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:14:42,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-25 21:14:42,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 0: [2022-11-25 21:14:42,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:14:42,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 21:14:42,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 25: [2022-11-25 21:14:42,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:14:42,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-25 21:14:42,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 5: [2022-11-25 21:14:42,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:14:42,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 21:14:42,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 9: [2022-11-25 21:14:42,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:14:42,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 21:14:42,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 8: [2022-11-25 21:14:42,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:14:42,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 21:14:42,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 10: [2022-11-25 21:14:42,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:14:42,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 21:14:42,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 23: [2022-11-25 21:14:42,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:14:42,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-25 21:14:42,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 24: [2022-11-25 21:14:42,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:14:42,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-25 21:14:42,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 4: [2022-11-25 21:14:42,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:14:42,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 21:14:42,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 16: [2022-11-25 21:14:42,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:14:42,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 14: [2022-11-25 21:14:42,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:14:42,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 14: [2022-11-25 21:14:42,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 21:14:42,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 6: [2022-11-25 21:14:42,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:14:42,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 21:14:42,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 13: [2022-11-25 21:14:42,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:14:42,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:14:42,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 21:14:42,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 31: [2022-11-25 21:14:42,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:14:42,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 31: [2022-11-25 21:14:42,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 26: [2022-11-25 21:14:42,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 31: [2022-11-25 21:14:42,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 20: [2022-11-25 21:14:42,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:14:42,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:14:42,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 2: [2022-11-25 21:14:42,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 21:14:42,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 20: [2022-11-25 21:14:42,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 29: [2022-11-25 21:14:42,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:14:42,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-25 21:14:42,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 12: [2022-11-25 21:14:42,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:14:42,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 21:14:42,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 3: [2022-11-25 21:14:42,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:14:42,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 21:14:42,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 11: [2022-11-25 21:14:42,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:14:42,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 21:14:42,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 30: [2022-11-25 21:14:42,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:14:42,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 7: [2022-11-25 21:14:42,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:14:42,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 21:14:42,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 30: [2022-11-25 21:14:42,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 9: [2022-11-25 21:14:42,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:14:42,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 22: [2022-11-25 21:14:42,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:14:42,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 22: [2022-11-25 21:14:42,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-25 21:14:42,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 10: [2022-11-25 21:14:42,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:14:42,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 21:14:42,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 25: [2022-11-25 21:14:42,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:14:42,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-25 21:14:42,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 28: [2022-11-25 21:14:42,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:14:42,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-25 21:14:42,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 23: [2022-11-25 21:14:42,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:14:42,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-25 21:14:42,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 18: [2022-11-25 21:14:42,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:14:42,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-25 21:14:42,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 5: [2022-11-25 21:14:42,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:14:42,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:14:42,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 19: [2022-11-25 21:14:42,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 5: [2022-11-25 21:14:42,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 19: [2022-11-25 21:14:42,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 8: [2022-11-25 21:14:42,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:14:42,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 21:14:42,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 15: [2022-11-25 21:14:42,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:14:42,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 21:14:42,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 0: [2022-11-25 21:14:42,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:14:42,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 21:14:42,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 13: [2022-11-25 21:14:42,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:14:42,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:14:42,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:14:42,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 11: [2022-11-25 21:14:42,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 13: [2022-11-25 21:14:42,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 16: [2022-11-25 21:14:42,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 11: [2022-11-25 21:14:42,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 16: [2022-11-25 21:14:42,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 21: [2022-11-25 21:14:42,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:14:42,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-25 21:14:42,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 27: [2022-11-25 21:14:42,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:14:42,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-25 21:14:42,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 9: [2022-11-25 21:14:42,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:14:42,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 21:14:42,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 21: [2022-11-25 21:14:42,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:14:42,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-25 21:14:42,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 26: [2022-11-25 21:14:42,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:14:42,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-25 21:14:42,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 4: [2022-11-25 21:14:42,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:14:42,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 21:14:42,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 31: [2022-11-25 21:14:42,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:14:42,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:14:42,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-25 21:14:42,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 25: [2022-11-25 21:14:42,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-25 21:14:42,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 14: [2022-11-25 21:14:42,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:14:42,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:14:42,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 14: [2022-11-25 21:14:42,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 29: [2022-11-25 21:14:42,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 14: [2022-11-25 21:14:42,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 27: [2022-11-25 21:14:42,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:14:42,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 5: [2022-11-25 21:14:42,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:14:42,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:14:42,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 27: [2022-11-25 21:14:42,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 5: [2022-11-25 21:14:42,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 3: [2022-11-25 21:14:42,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 19: [2022-11-25 21:14:42,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:14:42,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:14:42,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 6: [2022-11-25 21:14:42,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 21:14:42,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 10: [2022-11-25 21:14:42,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:14:42,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-25 21:14:42,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 10: [2022-11-25 21:14:42,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 21:14:42,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 8: [2022-11-25 21:14:42,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:14:42,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 21:14:42,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 24: [2022-11-25 21:14:42,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:14:42,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-25 21:14:42,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 1: [2022-11-25 21:14:42,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:14:42,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 21:14:42,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 28: [2022-11-25 21:14:42,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:14:42,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-25 21:14:42,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 2: [2022-11-25 21:14:42,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:14:42,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 21:14:42,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 22: [2022-11-25 21:14:42,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:14:42,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:14:42,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-25 21:14:42,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 15: [2022-11-25 21:14:42,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 21:14:42,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 15: [2022-11-25 21:14:42,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:14:42,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 21:14:42,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 20: [2022-11-25 21:14:42,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:14:42,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-25 21:14:42,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 7: [2022-11-25 21:14:42,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:14:42,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 21:14:42,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:14:42,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 7: [2022-11-25 21:14:42,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 21:14:42,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 23: [2022-11-25 21:14:42,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:14:42,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-25 21:14:42,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 28: [2022-11-25 21:14:42,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:14:42,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-25 21:14:42,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 7: [2022-11-25 21:14:42,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:14:42,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 21:14:42,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 0: [2022-11-25 21:14:42,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:14:42,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 21:14:42,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 11: [2022-11-25 21:14:42,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:14:42,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 21:14:42,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 29: [2022-11-25 21:14:42,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:14:42,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 3: [2022-11-25 21:14:42,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:14:42,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 18: [2022-11-25 21:14:42,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:14:42,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 18: [2022-11-25 21:14:42,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 3: [2022-11-25 21:14:42,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 18: [2022-11-25 21:14:42,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 16: [2022-11-25 21:14:42,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:14:42,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-25 21:14:42,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 2: [2022-11-25 21:14:42,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:14:42,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 30: [2022-11-25 21:14:42,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:14:42,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 30: [2022-11-25 21:14:42,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-25 21:14:42,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 13: [2022-11-25 21:14:42,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:14:42,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 21:14:42,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 12: [2022-11-25 21:14:42,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:14:42,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 21:14:42,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 27: [2022-11-25 21:14:42,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:14:42,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-25 21:14:42,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 20: [2022-11-25 21:14:42,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:14:42,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-25 21:14:42,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 12: [2022-11-25 21:14:42,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:14:42,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:14:42,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 21:14:42,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 4: [2022-11-25 21:14:42,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 21:14:42,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 14: [2022-11-25 21:14:42,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:14:42,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 21:14:42,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 17: [2022-11-25 21:14:42,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:14:42,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-25 21:14:42,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 24: [2022-11-25 21:14:42,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:14:42,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:14:42,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 22: [2022-11-25 21:14:42,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-25 21:14:42,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 24: [2022-11-25 21:14:42,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 30: [2022-11-25 21:14:42,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:14:42,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-25 21:14:42,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 1: [2022-11-25 21:14:42,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:14:42,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:14:42,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 21: [2022-11-25 21:14:42,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-25 21:14:42,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 1: [2022-11-25 21:14:42,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 17: [2022-11-25 21:14:42,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:14:42,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-25 21:14:42,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 12: [2022-11-25 21:14:42,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:14:42,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:14:42,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 24: [2022-11-25 21:14:42,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-25 21:14:42,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 12: [2022-11-25 21:14:42,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 21: [2022-11-25 21:14:42,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:14:42,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-25 21:14:42,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 12: [2022-11-25 21:14:42,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:14:42,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 21:14:42,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 24: [2022-11-25 21:14:42,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:14:42,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-25 21:14:42,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 24: [2022-11-25 21:14:42,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:14:42,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:14:42,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:14:42,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 30: [2022-11-25 21:14:42,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 21: [2022-11-25 21:14:42,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 24: [2022-11-25 21:14:42,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 30: [2022-11-25 21:14:42,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 21: [2022-11-25 21:14:42,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 12: [2022-11-25 21:14:42,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:14:42,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 21:14:42,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 17: [2022-11-25 21:14:42,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:14:42,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-25 21:14:42,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 21: [2022-11-25 21:14:42,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:14:42,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-25 21:14:42,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 30: [2022-11-25 21:14:42,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:14:42,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-25 21:14:42,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 30: [2022-11-25 21:14:42,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:14:42,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step14000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-25 21:14:42,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 0: successfully saved checkpoint at iteration 14000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2532.14 31: iteration 14010/ 173500 | consumed samples: 3586560 | consumed tokens: 7345274880 | elapsed time per iteration (s): 1.04 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.310754E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.274 | TFLOPs: 14.84 | 31: iteration 14020/ 173500 | consumed samples: 3589120 | consumed tokens: 7350517760 | elapsed time per iteration (s): 0.74 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.288066E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.764 | TFLOPs: 20.80 | 31: iteration 14030/ 173500 | consumed samples: 3591680 | consumed tokens: 7355760640 | elapsed time per iteration (s): 0.79 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.294024E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.741 | TFLOPs: 19.65 | 31: iteration 14040/ 173500 | consumed samples: 3594240 | consumed tokens: 7361003520 | elapsed time per iteration (s): 0.77 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.265778E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.287 | TFLOPs: 20.10 | 31: iteration 14050/ 173500 | consumed samples: 3596800 | consumed tokens: 7366246400 | elapsed time per iteration (s): 0.83 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.256250E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.608 | TFLOPs: 18.67 | 31: iteration 14060/ 173500 | consumed samples: 3599360 | consumed tokens: 7371489280 | elapsed time per iteration (s): 0.82 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.258300E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.613 | TFLOPs: 18.79 | 31: iteration 14070/ 173500 | consumed samples: 3601920 | consumed tokens: 7376732160 | elapsed time per iteration (s): 0.86 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.240616E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.577 | TFLOPs: 18.06 | 31: iteration 14080/ 173500 | consumed samples: 3604480 | consumed tokens: 7381975040 | elapsed time per iteration (s): 0.87 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.289080E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.876 | TFLOPs: 17.84 | 31: iteration 14090/ 173500 | consumed samples: 3607040 | consumed tokens: 7387217920 | elapsed time per iteration (s): 0.81 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.249946E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.070 | TFLOPs: 19.06 | 31: iteration 14100/ 173500 | consumed samples: 3609600 | consumed tokens: 7392460800 | elapsed time per iteration (s): 0.84 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.280100E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.154 | TFLOPs: 18.34 | 31: iteration 14110/ 173500 | consumed samples: 3612160 | consumed tokens: 7397703680 | elapsed time per iteration (s): 0.81 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.282637E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.520 | TFLOPs: 19.03 | 31: iteration 14120/ 173500 | consumed samples: 3614720 | consumed tokens: 7402946560 | elapsed time per iteration (s): 0.79 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.281873E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.319 | TFLOPs: 19.62 | 31: iteration 14130/ 173500 | consumed samples: 3617280 | consumed tokens: 7408189440 | elapsed time per iteration (s): 0.79 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.299296E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.617 | TFLOPs: 19.58 | 31: iteration 14140/ 173500 | consumed samples: 3619840 | consumed tokens: 7413432320 | elapsed time per iteration (s): 0.77 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.254193E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.159 | TFLOPs: 20.09 | 31: iteration 14150/ 173500 | consumed samples: 3622400 | consumed tokens: 7418675200 | elapsed time per iteration (s): 0.74 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.294505E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.158 | TFLOPs: 20.82 | 31: iteration 14160/ 173500 | consumed samples: 3624960 | consumed tokens: 7423918080 | elapsed time per iteration (s): 0.78 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.304824E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.405 | TFLOPs: 19.93 | 31: iteration 14170/ 173500 | consumed samples: 3627520 | consumed tokens: 7429160960 | elapsed time per iteration (s): 0.82 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.266402E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.652 | TFLOPs: 18.98 | 31: iteration 14180/ 173500 | consumed samples: 3630080 | consumed tokens: 7434403840 | elapsed time per iteration (s): 0.76 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.310743E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.116 | TFLOPs: 20.33 | 31: iteration 14190/ 173500 | consumed samples: 3632640 | consumed tokens: 7439646720 | elapsed time per iteration (s): 0.81 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.272794E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.945 | TFLOPs: 19.23 | 31: iteration 14200/ 173500 | consumed samples: 3635200 | consumed tokens: 7444889600 | elapsed time per iteration (s): 0.76 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.276276E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.930 | TFLOPs: 20.26 | 31: iteration 14210/ 173500 | consumed samples: 3637760 | consumed tokens: 7450132480 | elapsed time per iteration (s): 0.82 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.253081E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.055 | TFLOPs: 19.00 | 31: iteration 14220/ 173500 | consumed samples: 3640320 | consumed tokens: 7455375360 | elapsed time per iteration (s): 0.75 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.286594E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.078 | TFLOPs: 20.51 | 31: iteration 14230/ 173500 | consumed samples: 3642880 | consumed tokens: 7460618240 | elapsed time per iteration (s): 0.74 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.267040E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.462 | TFLOPs: 21.02 | 31: iteration 14240/ 173500 | consumed samples: 3645440 | consumed tokens: 7465861120 | elapsed time per iteration (s): 0.76 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.285763E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.561 | TFLOPs: 20.36 | 31: iteration 14250/ 173500 | consumed samples: 3648000 | consumed tokens: 7471104000 | elapsed time per iteration (s): 0.75 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.283530E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.403 | TFLOPs: 20.53 | 31: iteration 14260/ 173500 | consumed samples: 3650560 | consumed tokens: 7476346880 | elapsed time per iteration (s): 0.73 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.296711E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.786 | TFLOPs: 21.34 | 31: iteration 14270/ 173500 | consumed samples: 3653120 | consumed tokens: 7481589760 | elapsed time per iteration (s): 0.77 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.253401E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.958 | TFLOPs: 20.14 | 31: iteration 14280/ 173500 | consumed samples: 3655680 | consumed tokens: 7486832640 | elapsed time per iteration (s): 0.74 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.268114E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.061 | TFLOPs: 20.94 | 31: iteration 14290/ 173500 | consumed samples: 3658240 | consumed tokens: 7492075520 | elapsed time per iteration (s): 0.82 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.284433E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.032 | TFLOPs: 18.82 | 31: iteration 14300/ 173500 | consumed samples: 3660800 | consumed tokens: 7497318400 | elapsed time per iteration (s): 0.72 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.230682E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.782 | TFLOPs: 21.46 | 31: iteration 14310/ 173500 | consumed samples: 3663360 | consumed tokens: 7502561280 | elapsed time per iteration (s): 0.76 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.234125E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.528 | TFLOPs: 20.30 | 31: iteration 14320/ 173500 | consumed samples: 3665920 | consumed tokens: 7507804160 | elapsed time per iteration (s): 0.74 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.239463E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.498 | TFLOPs: 20.90 | 31: iteration 14330/ 173500 | consumed samples: 3668480 | consumed tokens: 7513047040 | elapsed time per iteration (s): 1.03 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.252303E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.881 | TFLOPs: 15.00 | 31: iteration 14340/ 173500 | consumed samples: 3671040 | consumed tokens: 7518289920 | elapsed time per iteration (s): 0.75 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.272766E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.686 | TFLOPs: 20.61 | 31: iteration 14350/ 173500 | consumed samples: 3673600 | consumed tokens: 7523532800 | elapsed time per iteration (s): 0.78 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.295470E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.792 | TFLOPs: 19.83 | 31: iteration 14360/ 173500 | consumed samples: 3676160 | consumed tokens: 7528775680 | elapsed time per iteration (s): 0.86 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.300411E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.111 | TFLOPs: 18.10 | 31: iteration 14370/ 173500 | consumed samples: 3678720 | consumed tokens: 7534018560 | elapsed time per iteration (s): 0.88 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.298869E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.032 | TFLOPs: 17.67 | 31: iteration 14380/ 173500 | consumed samples: 3681280 | consumed tokens: 7539261440 | elapsed time per iteration (s): 0.75 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.255912E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.006 | TFLOPs: 20.63 | 31: iteration 14390/ 173500 | consumed samples: 3683840 | consumed tokens: 7544504320 | elapsed time per iteration (s): 0.77 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.280724E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.369 | TFLOPs: 20.17 | 31: iteration 14400/ 173500 | consumed samples: 3686400 | consumed tokens: 7549747200 | elapsed time per iteration (s): 0.78 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.235214E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.615 | TFLOPs: 19.88 | 31: iteration 14410/ 173500 | consumed samples: 3688960 | consumed tokens: 7554990080 | elapsed time per iteration (s): 0.73 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.233593E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.046 | TFLOPs: 21.18 | 31: iteration 14420/ 173500 | consumed samples: 3691520 | consumed tokens: 7560232960 | elapsed time per iteration (s): 0.74 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.263656E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.690 | TFLOPs: 20.91 | 31: iteration 14430/ 173500 | consumed samples: 3694080 | consumed tokens: 7565475840 | elapsed time per iteration (s): 0.76 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.273856E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.963 | TFLOPs: 20.32 | 31: iteration 14440/ 173500 | consumed samples: 3696640 | consumed tokens: 7570718720 | elapsed time per iteration (s): 0.81 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.237494E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.096 | TFLOPs: 19.06 | 31: iteration 14450/ 173500 | consumed samples: 3699200 | consumed tokens: 7575961600 | elapsed time per iteration (s): 0.85 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.255269E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.922 | TFLOPs: 18.33 | 31: iteration 14460/ 173500 | consumed samples: 3701760 | consumed tokens: 7581204480 | elapsed time per iteration (s): 0.82 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.281876E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.235 | TFLOPs: 18.83 | 31: iteration 14470/ 173500 | consumed samples: 3704320 | consumed tokens: 7586447360 | elapsed time per iteration (s): 0.84 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.286308E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.511 | TFLOPs: 18.42 | 31: iteration 14480/ 173500 | consumed samples: 3706880 | consumed tokens: 7591690240 | elapsed time per iteration (s): 0.84 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.249482E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.232 | TFLOPs: 18.53 | 31: iteration 14490/ 173500 | consumed samples: 3709440 | consumed tokens: 7596933120 | elapsed time per iteration (s): 0.84 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.247513E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.433 | TFLOPs: 18.36 | 31: iteration 14500/ 173500 | consumed samples: 3712000 | consumed tokens: 7602176000 | elapsed time per iteration (s): 0.82 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.266648E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.939 | TFLOPs: 18.93 | 31: iteration 14510/ 173500 | consumed samples: 3714560 | consumed tokens: 7607418880 | elapsed time per iteration (s): 0.83 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.281699E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.197 | TFLOPs: 18.65 | 31: iteration 14520/ 173500 | consumed samples: 3717120 | consumed tokens: 7612661760 | elapsed time per iteration (s): 0.80 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.319052E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.039 | TFLOPs: 19.24 | 31: iteration 14530/ 173500 | consumed samples: 3719680 | consumed tokens: 7617904640 | elapsed time per iteration (s): 0.85 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.263216E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.393 | TFLOPs: 18.17 | 31: iteration 14540/ 173500 | consumed samples: 3722240 | consumed tokens: 7623147520 | elapsed time per iteration (s): 0.79 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.275888E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.894 | TFLOPs: 19.66 | 31: iteration 14550/ 173500 | consumed samples: 3724800 | consumed tokens: 7628390400 | elapsed time per iteration (s): 0.76 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.242257E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.960 | TFLOPs: 20.39 | 31: iteration 14560/ 173500 | consumed samples: 3727360 | consumed tokens: 7633633280 | elapsed time per iteration (s): 0.76 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.258337E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.218 | TFLOPs: 20.46 | 31: iteration 14570/ 173500 | consumed samples: 3729920 | consumed tokens: 7638876160 | elapsed time per iteration (s): 0.78 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.260294E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.965 | TFLOPs: 19.78 | 31: iteration 14580/ 173500 | consumed samples: 3732480 | consumed tokens: 7644119040 | elapsed time per iteration (s): 0.74 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.239196E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.614 | TFLOPs: 20.91 | 31: iteration 14590/ 173500 | consumed samples: 3735040 | consumed tokens: 7649361920 | elapsed time per iteration (s): 0.77 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.277941E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.040 | TFLOPs: 20.03 | 31: iteration 14600/ 173500 | consumed samples: 3737600 | consumed tokens: 7654604800 | elapsed time per iteration (s): 0.77 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.288222E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.420 | TFLOPs: 20.17 | 31: iteration 14610/ 173500 | consumed samples: 3740160 | consumed tokens: 7659847680 | elapsed time per iteration (s): 0.81 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.270514E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.629 | TFLOPs: 19.16 | 31: iteration 14620/ 173500 | consumed samples: 3742720 | consumed tokens: 7665090560 | elapsed time per iteration (s): 0.80 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.281586E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.140 | TFLOPs: 19.43 | 31: iteration 14630/ 173500 | consumed samples: 3745280 | consumed tokens: 7670333440 | elapsed time per iteration (s): 0.82 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.242128E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.912 | TFLOPs: 18.99 | 31: iteration 14640/ 173500 | consumed samples: 3747840 | consumed tokens: 7675576320 | elapsed time per iteration (s): 0.89 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.252289E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.703 | TFLOPs: 17.41 | 31: iteration 14650/ 173500 | consumed samples: 3750400 | consumed tokens: 7680819200 | elapsed time per iteration (s): 0.89 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.271221E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.588 | TFLOPs: 17.46 | 31: iteration 14660/ 173500 | consumed samples: 3752960 | consumed tokens: 7686062080 | elapsed time per iteration (s): 0.89 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.246912E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.587 | TFLOPs: 17.46 | 31: iteration 14670/ 173500 | consumed samples: 3755520 | consumed tokens: 7691304960 | elapsed time per iteration (s): 0.82 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.257818E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.063 | TFLOPs: 19.00 | 31: iteration 14680/ 173500 | consumed samples: 3758080 | consumed tokens: 7696547840 | elapsed time per iteration (s): 0.82 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.280074E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.325 | TFLOPs: 18.96 | 31: iteration 14690/ 173500 | consumed samples: 3760640 | consumed tokens: 7701790720 | elapsed time per iteration (s): 0.79 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.250810E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.249 | TFLOPs: 19.62 | 31: iteration 14700/ 173500 | consumed samples: 3763200 | consumed tokens: 7707033600 | elapsed time per iteration (s): 0.78 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.283123E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.408 | TFLOPs: 19.93 | 31: iteration 14710/ 173500 | consumed samples: 3765760 | consumed tokens: 7712276480 | elapsed time per iteration (s): 0.83 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.254241E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.706 | TFLOPs: 18.74 | 31: iteration 14720/ 173500 | consumed samples: 3768320 | consumed tokens: 7717519360 | elapsed time per iteration (s): 0.75 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.269748E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.654 | TFLOPs: 20.61 | 31: iteration 14730/ 173500 | consumed samples: 3770880 | consumed tokens: 7722762240 | elapsed time per iteration (s): 0.73 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.271183E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.074 | TFLOPs: 21.30 | 31: iteration 14740/ 173500 | consumed samples: 3773440 | consumed tokens: 7728005120 | elapsed time per iteration (s): 0.76 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.268438E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.760 | TFLOPs: 20.31 | 31: iteration 14750/ 173500 | consumed samples: 3776000 | consumed tokens: 7733248000 | elapsed time per iteration (s): 0.77 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.272061E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.854 | TFLOPs: 20.20 | 31: iteration 14760/ 173500 | consumed samples: 3778560 | consumed tokens: 7738490880 | elapsed time per iteration (s): 0.74 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.249131E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.640 | TFLOPs: 20.91 | 31: iteration 14770/ 173500 | consumed samples: 3781120 | consumed tokens: 7743733760 | elapsed time per iteration (s): 0.76 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.257435E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.623 | TFLOPs: 20.36 | 31: iteration 14780/ 173500 | consumed samples: 3783680 | consumed tokens: 7748976640 | elapsed time per iteration (s): 0.80 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.289094E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.196 | TFLOPs: 19.43 | 31: iteration 14790/ 173500 | consumed samples: 3786240 | consumed tokens: 7754219520 | elapsed time per iteration (s): 0.76 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.281413E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.745 | TFLOPs: 20.31 | 31: iteration 14800/ 173500 | consumed samples: 3788800 | consumed tokens: 7759462400 | elapsed time per iteration (s): 0.79 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.253288E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.755 | TFLOPs: 19.71 | 31: iteration 14810/ 173500 | consumed samples: 3791360 | consumed tokens: 7764705280 | elapsed time per iteration (s): 0.79 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.293417E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.056 | TFLOPs: 19.60 | 31: iteration 14820/ 173500 | consumed samples: 3793920 | consumed tokens: 7769948160 | elapsed time per iteration (s): 0.81 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.243478E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.421 | TFLOPs: 19.20 | 31: iteration 14830/ 173500 | consumed samples: 3796480 | consumed tokens: 7775191040 | elapsed time per iteration (s): 0.83 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.270250E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.169 | TFLOPs: 18.70 | 31: iteration 14840/ 173500 | consumed samples: 3799040 | consumed tokens: 7780433920 | elapsed time per iteration (s): 0.78 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.252633E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.367 | TFLOPs: 19.74 | 31: iteration 14850/ 173500 | consumed samples: 3801600 | consumed tokens: 7785676800 | elapsed time per iteration (s): 0.78 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.251085E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.699 | TFLOPs: 19.89 | 31: iteration 14860/ 173500 | consumed samples: 3804160 | consumed tokens: 7790919680 | elapsed time per iteration (s): 0.77 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.242855E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.105 | TFLOPs: 20.15 | 31: iteration 14870/ 173500 | consumed samples: 3806720 | consumed tokens: 7796162560 | elapsed time per iteration (s): 0.77 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.284375E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.521 | TFLOPs: 20.18 | 31: iteration 14880/ 173500 | consumed samples: 3809280 | consumed tokens: 7801405440 | elapsed time per iteration (s): 0.77 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.254613E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.962 | TFLOPs: 20.02 | 31: iteration 14890/ 173500 | consumed samples: 3811840 | consumed tokens: 7806648320 | elapsed time per iteration (s): 0.80 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.239631E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.990 | TFLOPs: 19.48 | 31: iteration 14900/ 173500 | consumed samples: 3814400 | consumed tokens: 7811891200 | elapsed time per iteration (s): 0.78 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.239495E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.839 | TFLOPs: 19.83 | 31: iteration 14910/ 173500 | consumed samples: 3816960 | consumed tokens: 7817134080 | elapsed time per iteration (s): 0.76 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.278092E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.864 | TFLOPs: 20.38 | 31: iteration 14920/ 173500 | consumed samples: 3819520 | consumed tokens: 7822376960 | elapsed time per iteration (s): 0.76 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.240347E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.637 | TFLOPs: 20.37 | 31: iteration 14930/ 173500 | consumed samples: 3822080 | consumed tokens: 7827619840 | elapsed time per iteration (s): 0.78 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.245379E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.279 | TFLOPs: 19.98 | 31: iteration 14940/ 173500 | consumed samples: 3824640 | consumed tokens: 7832862720 | elapsed time per iteration (s): 0.85 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.266896E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.894 | TFLOPs: 18.14 | 31: iteration 14950/ 173500 | consumed samples: 3827200 | consumed tokens: 7838105600 | elapsed time per iteration (s): 0.77 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.239681E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.022 | TFLOPs: 20.21 | 31: iteration 14960/ 173500 | consumed samples: 3829760 | consumed tokens: 7843348480 | elapsed time per iteration (s): 0.78 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.216727E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.266 | TFLOPs: 19.80 | 31: iteration 14970/ 173500 | consumed samples: 3832320 | consumed tokens: 7848591360 | elapsed time per iteration (s): 0.78 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.253212E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.719 | TFLOPs: 19.77 | 31: iteration 14980/ 173500 | consumed samples: 3834880 | consumed tokens: 7853834240 | elapsed time per iteration (s): 0.75 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.261270E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.546 | TFLOPs: 20.66 | 31: iteration 14990/ 173500 | consumed samples: 3837440 | consumed tokens: 7859077120 | elapsed time per iteration (s): 0.78 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.303340E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.977 | TFLOPs: 19.96 | 31: iteration 15000/ 173500 | consumed samples: 3840000 | consumed tokens: 7864320000 | elapsed time per iteration (s): 0.76 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.249610E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.984 | TFLOPs: 20.27 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 15000 | lm loss value: 2.260330E+00 | lm loss PPL: 9.586254E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 15000 to checkpoints_1b1long 0: [2022-11-25 21:27:55,179] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step15000 is begin to save! 0: [2022-11-25 21:27:55,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_01-model_00-model_states.pt... 0: [2022-11-25 21:27:55,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_01-model_00-model_states.pt. 0: [2022-11-25 21:27:55,428] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_03-model_00-model_states.pt... 0: [2022-11-25 21:27:55,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_03-model_00-model_states.pt. 0: [2022-11-25 21:27:55,504] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_04-model_00-model_states.pt... 0: [2022-11-25 21:27:55,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_04-model_00-model_states.pt. 0: [2022-11-25 21:27:55,586] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_05-model_00-model_states.pt... 0: [2022-11-25 21:27:55,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_05-model_00-model_states.pt. 0: [2022-11-25 21:27:55,664] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_06-model_00-model_states.pt... 0: [2022-11-25 21:27:55,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_06-model_00-model_states.pt. 0: [2022-11-25 21:27:55,742] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_07-model_00-model_states.pt... 0: [2022-11-25 21:27:55,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_07-model_00-model_states.pt. 0: [2022-11-25 21:27:55,822] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_08-model_00-model_states.pt... 0: [2022-11-25 21:27:55,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_08-model_00-model_states.pt. 0: [2022-11-25 21:27:55,899] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_09-model_00-model_states.pt... 0: [2022-11-25 21:27:55,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_09-model_00-model_states.pt. 0: [2022-11-25 21:27:55,975] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_10-model_00-model_states.pt... 0: [2022-11-25 21:27:56,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_10-model_00-model_states.pt. 0: [2022-11-25 21:27:56,050] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_11-model_00-model_states.pt... 0: [2022-11-25 21:27:56,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_11-model_00-model_states.pt. 0: [2022-11-25 21:27:56,127] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_12-model_00-model_states.pt... 0: [2022-11-25 21:27:56,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_12-model_00-model_states.pt. 0: [2022-11-25 21:27:56,203] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_13-model_00-model_states.pt... 0: [2022-11-25 21:27:56,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_13-model_00-model_states.pt. 0: [2022-11-25 21:27:56,277] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_14-model_00-model_states.pt... 0: [2022-11-25 21:27:56,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_14-model_00-model_states.pt. 0: [2022-11-25 21:27:56,354] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_15-model_00-model_states.pt... 0: [2022-11-25 21:27:56,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_15-model_00-model_states.pt. 0: [2022-11-25 21:27:56,427] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_16-model_00-model_states.pt... 0: [2022-11-25 21:27:56,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_16-model_00-model_states.pt. 0: [2022-11-25 21:27:56,504] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_17-model_00-model_states.pt... 0: [2022-11-25 21:27:56,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_17-model_00-model_states.pt. 0: [2022-11-25 21:27:56,578] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_18-model_00-model_states.pt... 0: [2022-11-25 21:27:56,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_18-model_00-model_states.pt. 0: [2022-11-25 21:27:56,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_19-model_00-model_states.pt... 0: [2022-11-25 21:27:56,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_19-model_00-model_states.pt. 0: [2022-11-25 21:27:56,730] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_20-model_00-model_states.pt... 0: [2022-11-25 21:27:56,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_20-model_00-model_states.pt. 0: [2022-11-25 21:27:56,806] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_21-model_00-model_states.pt... 0: [2022-11-25 21:27:56,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_21-model_00-model_states.pt. 0: [2022-11-25 21:27:56,882] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_22-model_00-model_states.pt... 0: [2022-11-25 21:27:56,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_22-model_00-model_states.pt. 0: [2022-11-25 21:27:56,957] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_23-model_00-model_states.pt... 0: [2022-11-25 21:27:57,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_23-model_00-model_states.pt. 0: [2022-11-25 21:27:57,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_24-model_00-model_states.pt... 0: [2022-11-25 21:27:57,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_24-model_00-model_states.pt. 0: [2022-11-25 21:27:57,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_25-model_00-model_states.pt... 0: [2022-11-25 21:27:57,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_25-model_00-model_states.pt. 0: [2022-11-25 21:27:57,183] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_26-model_00-model_states.pt... 0: [2022-11-25 21:27:57,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_26-model_00-model_states.pt. 0: [2022-11-25 21:27:57,258] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_27-model_00-model_states.pt... 0: [2022-11-25 21:27:57,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_27-model_00-model_states.pt. 0: [2022-11-25 21:27:57,335] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_28-model_00-model_states.pt... 0: [2022-11-25 21:27:57,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_28-model_00-model_states.pt. 0: [2022-11-25 21:27:57,411] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/layer_30-model_00-model_states.pt... 0: [2022-11-25 21:27:57,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/layer_30-model_00-model_states.pt. 0: [2022-11-25 21:27:57,413] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step15000/mp_rank_00_model_states.pt 0: [2022-11-25 21:27:57,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/mp_rank_00_model_states.pt... 0: [2022-11-25 21:27:57,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/mp_rank_00_model_states.pt. 0: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:27:57,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:27:57,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:27:57,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-25 21:27:57,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 4: [2022-11-25 21:27:57,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:27:57,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 21:27:57,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 4: [2022-11-25 21:27:57,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:27:57,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 21:27:57,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 25: [2022-11-25 21:27:57,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:27:57,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-25 21:27:57,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 29: [2022-11-25 21:27:57,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:27:57,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-25 21:27:57,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 8: [2022-11-25 21:27:57,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:27:57,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:27:57,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 21:27:57,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 24: [2022-11-25 21:27:57,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-25 21:27:57,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 0: [2022-11-25 21:27:57,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:27:57,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 21:27:57,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 28: [2022-11-25 21:27:57,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:27:57,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:27:57,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-25 21:27:57,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 16: [2022-11-25 21:27:57,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:27:57,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-25 21:27:57,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 11: [2022-11-25 21:27:57,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:27:57,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 21:27:57,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 29: [2022-11-25 21:27:57,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:27:57,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-25 21:27:57,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 11: [2022-11-25 21:27:57,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:27:57,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 21:27:57,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 27: [2022-11-25 21:27:57,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:27:57,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 7: [2022-11-25 21:27:57,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:27:57,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 7: [2022-11-25 21:27:57,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 21:27:57,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 27: [2022-11-25 21:27:57,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:27:57,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 7: [2022-11-25 21:27:57,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:27:57,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 7: [2022-11-25 21:27:57,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 21:27:57,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 6: [2022-11-25 21:27:57,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:27:57,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 21:27:57,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 1: [2022-11-25 21:27:57,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:27:57,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 21:27:57,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 13: [2022-11-25 21:27:57,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:27:57,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 21:27:57,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 10: [2022-11-25 21:27:57,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:27:57,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:27:57,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 21:27:57,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 17: [2022-11-25 21:27:57,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:27:57,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 10: [2022-11-25 21:27:57,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 17: [2022-11-25 21:27:57,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-25 21:27:57,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 17: [2022-11-25 21:27:57,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:27:57,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-25 21:27:57,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 19: [2022-11-25 21:27:57,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:27:57,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-25 21:27:57,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 21: [2022-11-25 21:27:57,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:27:57,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:27:57,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:27:57,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-25 21:27:57,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-25 21:27:57,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-25 21:27:57,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 21: [2022-11-25 21:27:57,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 21: [2022-11-25 21:27:57,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 15: [2022-11-25 21:27:57,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:27:57,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:27:57,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 21:27:57,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 21:27:57,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 15: [2022-11-25 21:27:57,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 4: [2022-11-25 21:27:57,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:27:57,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 21:27:57,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 21: [2022-11-25 21:27:57,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:27:57,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-25 21:27:57,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 28: [2022-11-25 21:27:57,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-25 21:27:57,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 28: [2022-11-25 21:27:57,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:27:57,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-25 21:27:57,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 28: [2022-11-25 21:27:57,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:27:57,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:27:57,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 12: [2022-11-25 21:27:57,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:27:57,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 12: [2022-11-25 21:27:57,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 21:27:57,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 12: [2022-11-25 21:27:57,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:27:57,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 21:27:57,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 0: [2022-11-25 21:27:57,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:27:57,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:27:57,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 0: [2022-11-25 21:27:57,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 21:27:57,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 11: [2022-11-25 21:27:57,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 8: [2022-11-25 21:27:57,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:27:57,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:27:57,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 21:27:57,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 8: [2022-11-25 21:27:57,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:27:57,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 21:27:57,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 24: [2022-11-25 21:27:57,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:27:57,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-25 21:27:57,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 30: [2022-11-25 21:27:57,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:27:57,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:27:57,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:27:57,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-25 21:27:57,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-25 21:27:57,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-25 21:27:57,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 30: [2022-11-25 21:27:57,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 30: [2022-11-25 21:27:57,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 6: [2022-11-25 21:27:57,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:27:57,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 21:27:57,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 11: [2022-11-25 21:27:57,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:27:57,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 21:27:57,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 29: [2022-11-25 21:27:57,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:27:57,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-25 21:27:57,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 7: [2022-11-25 21:27:57,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:27:57,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 21:27:57,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 23: [2022-11-25 21:27:57,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:27:57,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-25 21:27:57,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 12: [2022-11-25 21:27:57,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:27:57,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 31: [2022-11-25 21:27:57,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:27:57,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 31: [2022-11-25 21:27:57,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-25 21:27:57,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 31: [2022-11-25 21:27:57,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:27:57,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-25 21:27:57,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 28: [2022-11-25 21:27:57,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-25 21:27:57,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 28: [2022-11-25 21:27:57,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:27:57,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-25 21:27:57,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 1: [2022-11-25 21:27:57,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:27:57,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 21:27:57,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 10: [2022-11-25 21:27:57,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:27:57,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 2: [2022-11-25 21:27:57,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:27:57,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 2: [2022-11-25 21:27:57,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 21:27:57,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 22: [2022-11-25 21:27:57,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:27:57,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:27:57,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-25 21:27:57,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 12: [2022-11-25 21:27:57,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 21:27:57,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 30: [2022-11-25 21:27:57,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:27:57,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:27:57,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:27:57,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 30: [2022-11-25 21:27:57,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 10: [2022-11-25 21:27:57,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 2: [2022-11-25 21:27:57,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 31: [2022-11-25 21:27:57,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:27:57,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 31: [2022-11-25 21:27:57,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 30: [2022-11-25 21:27:57,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 31: [2022-11-25 21:27:57,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 27: [2022-11-25 21:27:57,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:27:57,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-25 21:27:57,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 14: [2022-11-25 21:27:57,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:27:57,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 21:27:57,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 9: [2022-11-25 21:27:57,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:27:57,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 21:27:57,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 14: [2022-11-25 21:27:57,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:27:57,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 21:27:57,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 24: [2022-11-25 21:27:57,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:27:57,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-25 21:27:57,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 5: [2022-11-25 21:27:57,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:27:57,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 21:27:57,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 4: [2022-11-25 21:27:57,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:27:57,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 21:27:57,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 8: [2022-11-25 21:27:57,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:27:57,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 21:27:57,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 14: [2022-11-25 21:27:57,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:27:57,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 21:27:57,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 16: [2022-11-25 21:27:57,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:27:57,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 5: [2022-11-25 21:27:57,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:27:57,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:27:57,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 5: [2022-11-25 21:27:57,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 21:27:57,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 21:27:57,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 5: [2022-11-25 21:27:57,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 13: [2022-11-25 21:27:57,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:27:57,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:27:57,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 21:27:57,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 21:27:57,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 13: [2022-11-25 21:27:57,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 29: [2022-11-25 21:27:57,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:27:57,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-25 21:27:57,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 6: [2022-11-25 21:27:57,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:27:57,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 21:27:57,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 23: [2022-11-25 21:27:57,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:27:57,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-25 21:27:57,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 16: [2022-11-25 21:27:57,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:27:57,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:27:57,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-25 21:27:57,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-25 21:27:57,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 16: [2022-11-25 21:27:57,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 7: [2022-11-25 21:27:57,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:27:57,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 21:27:57,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 22: [2022-11-25 21:27:57,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:27:57,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-25 21:27:57,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 1: [2022-11-25 21:27:57,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:27:57,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 21:27:57,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 23: [2022-11-25 21:27:57,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:27:57,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-25 21:27:57,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 19: [2022-11-25 21:27:57,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:27:57,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-25 21:27:57,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 19: [2022-11-25 21:27:57,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:27:57,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-25 21:27:57,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 3: [2022-11-25 21:27:57,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:27:57,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:27:57,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:27:57,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 21:27:57,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 21:27:57,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 21:27:57,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 3: [2022-11-25 21:27:57,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 15: [2022-11-25 21:27:57,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:27:57,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 15: [2022-11-25 21:27:57,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 21:27:57,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 27: [2022-11-25 21:27:57,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:27:57,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-25 21:27:57,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 15: [2022-11-25 21:27:57,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:27:57,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 21:27:57,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 10: [2022-11-25 21:27:57,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:27:57,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 21:27:57,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 17: [2022-11-25 21:27:57,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:27:57,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-25 21:27:57,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 17: [2022-11-25 21:27:57,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:27:57,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-25 21:27:57,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 0: [2022-11-25 21:27:57,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:27:57,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 21:27:57,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 23: [2022-11-25 21:27:57,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:27:57,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-25 21:27:57,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 18: [2022-11-25 21:27:57,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:27:57,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:27:57,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:27:57,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-25 21:27:57,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:27:57,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-25 21:27:57,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-25 21:27:57,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 18: [2022-11-25 21:27:57,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 18: [2022-11-25 21:27:57,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 18: [2022-11-25 21:27:57,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-25 21:27:57,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 20: [2022-11-25 21:27:57,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:27:57,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-25 21:27:57,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 22: [2022-11-25 21:27:57,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:27:57,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-25 21:27:57,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 6: [2022-11-25 21:27:57,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:27:57,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 21:27:57,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 24: [2022-11-25 21:27:57,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:27:57,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-25 21:27:57,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 21: [2022-11-25 21:27:57,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:27:57,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-25 21:27:57,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 2: [2022-11-25 21:27:57,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:27:57,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 21:27:57,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 5: [2022-11-25 21:27:57,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:27:57,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:27:57,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 21:27:57,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 21:27:57,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 5: [2022-11-25 21:27:57,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 1: [2022-11-25 21:27:57,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:27:57,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 21:27:57,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 28: [2022-11-25 21:27:57,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:27:57,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-25 21:27:57,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 13: [2022-11-25 21:27:57,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:27:57,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 21:27:57,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 20: [2022-11-25 21:27:57,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:27:57,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-25 21:27:57,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 4: [2022-11-25 21:27:57,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:27:57,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 21:27:57,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 11: [2022-11-25 21:27:57,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:27:57,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 21:27:57,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 2: [2022-11-25 21:27:57,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:27:57,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 21:27:57,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 30: [2022-11-25 21:27:57,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:27:57,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-25 21:27:57,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 0: [2022-11-25 21:27:57,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 21:27:57,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 1: [2022-11-25 21:27:57,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:27:57,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 21:27:57,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 6: [2022-11-25 21:27:57,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:27:57,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 21:27:57,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 18: [2022-11-25 21:27:57,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:27:57,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-25 21:27:57,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 0: [2022-11-25 21:27:57,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:27:57,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 21:27:57,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 27: [2022-11-25 21:27:57,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:27:57,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-25 21:27:57,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 16: [2022-11-25 21:27:57,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:27:57,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-25 21:27:57,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 25: [2022-11-25 21:27:57,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:27:57,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-25 21:27:57,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 26: [2022-11-25 21:27:57,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:27:57,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-25 21:27:57,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 25: [2022-11-25 21:27:57,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:27:57,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-25 21:27:57,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 8: [2022-11-25 21:27:57,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:27:57,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 21:27:57,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 9: [2022-11-25 21:27:57,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:27:57,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 21:27:57,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 17: [2022-11-25 21:27:57,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:27:57,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 7: [2022-11-25 21:27:57,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:27:57,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 17: [2022-11-25 21:27:57,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 7: [2022-11-25 21:27:57,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 26: [2022-11-25 21:27:57,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:27:57,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:27:57,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-25 21:27:57,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-25 21:27:57,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 26: [2022-11-25 21:27:57,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 22: [2022-11-25 21:27:57,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:27:57,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 31: [2022-11-25 21:27:57,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:27:57,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 31: [2022-11-25 21:27:57,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-25 21:27:57,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 3: [2022-11-25 21:27:57,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:27:57,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 21:27:57,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 29: [2022-11-25 21:27:57,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:27:57,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-25 21:27:57,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 14: [2022-11-25 21:27:57,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:27:57,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 23: [2022-11-25 21:27:57,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:27:57,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 23: [2022-11-25 21:27:57,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-25 21:27:57,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 12: [2022-11-25 21:27:57,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:27:57,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 21:27:57,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 21: [2022-11-25 21:27:57,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:27:57,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-25 21:27:57,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 2: [2022-11-25 21:27:57,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:27:57,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 21:27:57,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 15: [2022-11-25 21:27:57,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:27:57,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 21:27:57,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 26: [2022-11-25 21:27:57,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:27:57,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-25 21:27:57,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 5: [2022-11-25 21:27:57,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:27:57,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 21:27:57,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 24: [2022-11-25 21:27:57,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:27:57,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-25 21:27:57,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 1: [2022-11-25 21:27:57,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:27:57,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:27:57,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 21:27:57,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 1: [2022-11-25 21:27:57,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 21:27:57,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 9: [2022-11-25 21:27:57,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:27:57,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 21:27:57,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:27:57,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 9: [2022-11-25 21:27:57,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 21:27:57,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 13: [2022-11-25 21:27:57,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:27:57,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 21:27:57,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 28: [2022-11-25 21:27:57,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:27:57,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:27:57,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-25 21:27:57,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:27:57,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 19: [2022-11-25 21:27:57,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 19: [2022-11-25 21:27:57,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 28: [2022-11-25 21:27:57,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 19: [2022-11-25 21:27:57,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 30: [2022-11-25 21:27:57,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:27:57,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-25 21:27:57,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 4: [2022-11-25 21:27:57,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:27:57,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 21:27:57,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 6: [2022-11-25 21:27:57,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:27:57,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 21:27:57,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 16: [2022-11-25 21:27:57,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:27:57,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-25 21:27:57,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 11: [2022-11-25 21:27:57,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:27:57,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 21:27:57,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 18: [2022-11-25 21:27:57,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:27:57,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-25 21:27:57,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 0: [2022-11-25 21:27:57,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:27:57,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 21:27:57,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 8: [2022-11-25 21:27:57,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:27:57,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 21:27:57,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 27: [2022-11-25 21:27:57,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:27:57,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-25 21:27:57,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 7: [2022-11-25 21:27:57,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:27:57,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 21:27:57,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 12: [2022-11-25 21:27:57,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:27:57,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 21:27:57,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 25: [2022-11-25 21:27:57,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:27:57,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-25 21:27:57,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 26: [2022-11-25 21:27:57,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:27:57,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-25 21:27:57,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 15: [2022-11-25 21:27:57,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:27:57,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 21:27:57,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 9: [2022-11-25 21:27:57,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:27:57,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 20: [2022-11-25 21:27:57,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:27:57,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 20: [2022-11-25 21:27:57,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-25 21:27:57,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 20: [2022-11-25 21:27:57,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:27:57,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-25 21:27:57,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 23: [2022-11-25 21:27:57,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:27:57,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-25 21:27:57,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 29: [2022-11-25 21:27:57,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:27:57,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-25 21:27:57,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 17: [2022-11-25 21:27:57,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:27:57,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-25 21:27:57,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 3: [2022-11-25 21:27:57,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:27:57,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:27:57,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 3: [2022-11-25 21:27:57,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 22: [2022-11-25 21:27:57,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 3: [2022-11-25 21:27:57,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 14: [2022-11-25 21:27:57,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:27:57,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 21:27:57,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 1: [2022-11-25 21:27:57,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:27:57,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 21:27:57,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 31: [2022-11-25 21:27:57,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:27:57,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-25 21:27:57,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 13: [2022-11-25 21:27:57,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:27:57,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 21:27:57,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 24: [2022-11-25 21:27:57,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:27:57,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-25 21:27:57,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 26: [2022-11-25 21:27:57,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:27:57,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:27:57,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:27:57,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 21: [2022-11-25 21:27:57,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 5: [2022-11-25 21:27:57,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 21: [2022-11-25 21:27:57,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 26: [2022-11-25 21:27:57,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 5: [2022-11-25 21:27:57,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 2: [2022-11-25 21:27:57,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:27:57,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:27:57,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 10: [2022-11-25 21:27:57,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 2: [2022-11-25 21:27:57,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 10: [2022-11-25 21:27:57,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 18: [2022-11-25 21:27:57,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:27:57,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-25 21:27:57,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 20: [2022-11-25 21:27:57,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:27:57,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-25 21:27:57,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 6: [2022-11-25 21:27:57,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:27:57,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 21:27:57,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 4: [2022-11-25 21:27:57,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:27:57,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 21:27:57,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 16: [2022-11-25 21:27:57,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:27:57,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-25 21:27:57,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 2: [2022-11-25 21:27:57,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:27:57,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 21:27:57,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 27: [2022-11-25 21:27:57,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:27:57,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-25 21:27:57,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 8: [2022-11-25 21:27:57,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:27:57,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 21:27:57,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 30: [2022-11-25 21:27:57,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:27:57,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-25 21:27:57,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 0: [2022-11-25 21:27:57,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:27:57,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 21:27:57,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 11: [2022-11-25 21:27:57,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:27:57,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 21:27:57,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 26: [2022-11-25 21:27:57,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:27:57,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-25 21:27:57,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 9: [2022-11-25 21:27:57,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:27:57,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 21:27:57,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 17: [2022-11-25 21:27:57,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:27:57,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-25 21:27:57,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 15: [2022-11-25 21:27:57,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:27:57,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 21:27:57,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 25: [2022-11-25 21:27:57,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:27:57,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-25 21:27:57,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 7: [2022-11-25 21:27:57,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:27:57,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 21:27:57,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 5: [2022-11-25 21:27:57,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:27:57,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 21:27:57,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 12: [2022-11-25 21:27:57,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:27:57,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 21:27:57,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 31: [2022-11-25 21:27:57,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:27:57,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-25 21:27:57,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 23: [2022-11-25 21:27:57,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:27:57,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-25 21:27:57,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 28: [2022-11-25 21:27:57,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:27:57,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-25 21:27:57,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:27:57,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:27:57,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 21:27:57,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 29: [2022-11-25 21:27:57,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:27:57,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 24: [2022-11-25 21:27:57,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:27:57,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 24: [2022-11-25 21:27:57,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-25 21:27:57,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 21: [2022-11-25 21:27:57,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:27:57,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-25 21:27:57,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 1: [2022-11-25 21:27:57,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:27:57,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 21:27:57,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 13: [2022-11-25 21:27:57,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:27:57,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 21:27:57,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 14: [2022-11-25 21:27:57,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:27:57,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 21:27:57,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 25: [2022-11-25 21:27:57,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:27:57,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-25 21:27:57,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 28: [2022-11-25 21:27:57,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 28: [2022-11-25 21:27:57,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-25 21:27:57,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 6: [2022-11-25 21:27:57,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:27:57,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 21:27:57,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 19: [2022-11-25 21:27:57,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:27:57,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-25 21:27:57,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 2: [2022-11-25 21:27:57,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:27:57,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 21:27:57,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 0: [2022-11-25 21:27:57,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:27:57,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 21:27:57,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 10: [2022-11-25 21:27:57,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:27:57,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 21:27:57,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 7: [2022-11-25 21:27:57,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:27:57,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 21:27:57,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 27: [2022-11-25 21:27:57,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:27:57,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-25 21:27:57,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 20: [2022-11-25 21:27:57,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:27:57,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:27:57,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 18: [2022-11-25 21:27:57,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 20: [2022-11-25 21:27:57,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 18: [2022-11-25 21:27:57,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 31: [2022-11-25 21:27:57,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:27:57,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-25 21:27:57,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 30: [2022-11-25 21:27:57,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:27:57,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-25 21:27:57,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 22: [2022-11-25 21:27:57,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:27:57,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-25 21:27:57,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 29: [2022-11-25 21:27:57,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:27:57,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-25 21:27:57,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 16: [2022-11-25 21:27:57,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:27:57,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-25 21:27:57,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 9: [2022-11-25 21:27:57,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:27:57,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 23: [2022-11-25 21:27:57,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:27:57,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 8: [2022-11-25 21:27:57,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:27:57,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 8: [2022-11-25 21:27:57,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 23: [2022-11-25 21:27:57,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 26: [2022-11-25 21:27:57,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:27:57,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 26: [2022-11-25 21:27:57,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-25 21:27:57,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 11: [2022-11-25 21:27:57,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:27:57,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:27:57,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 4: [2022-11-25 21:27:57,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 15: [2022-11-25 21:27:57,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:27:57,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 4: [2022-11-25 21:27:57,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 15: [2022-11-25 21:27:57,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 21:27:57,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 19: [2022-11-25 21:27:57,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:27:57,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-25 21:27:57,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 17: [2022-11-25 21:27:57,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:27:57,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-25 21:27:57,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 3: [2022-11-25 21:27:57,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:27:57,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 21:27:57,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 22: [2022-11-25 21:27:57,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:27:57,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-25 21:27:57,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 31: [2022-11-25 21:27:57,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:27:57,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-25 21:27:57,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 12: [2022-11-25 21:27:57,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:27:57,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:27:57,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 13: [2022-11-25 21:27:57,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 12: [2022-11-25 21:27:57,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 13: [2022-11-25 21:27:57,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 20: [2022-11-25 21:27:57,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:27:57,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:27:57,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 14: [2022-11-25 21:27:57,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 20: [2022-11-25 21:27:57,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 14: [2022-11-25 21:27:57,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 14: [2022-11-25 21:27:57,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:27:57,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 21:27:57,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 3: [2022-11-25 21:27:57,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:27:57,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 21:27:57,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 25: [2022-11-25 21:27:57,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:27:57,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-25 21:27:57,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 9: [2022-11-25 21:27:57,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:27:57,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 21:27:57,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 20: [2022-11-25 21:27:57,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:27:57,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-25 21:27:57,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 22: [2022-11-25 21:27:57,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:27:57,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step15000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-25 21:27:57,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 0: successfully saved checkpoint at iteration 15000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2694.48 31: iteration 15010/ 173500 | consumed samples: 3842560 | consumed tokens: 7869562880 | elapsed time per iteration (s): 1.09 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.265557E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.808 | TFLOPs: 14.21 | 31: iteration 15020/ 173500 | consumed samples: 3845120 | consumed tokens: 7874805760 | elapsed time per iteration (s): 0.83 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.246537E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.150 | TFLOPs: 18.76 | 31: iteration 15030/ 173500 | consumed samples: 3847680 | consumed tokens: 7880048640 | elapsed time per iteration (s): 0.76 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.274681E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.487 | TFLOPs: 20.42 | 31: iteration 15040/ 173500 | consumed samples: 3850240 | consumed tokens: 7885291520 | elapsed time per iteration (s): 0.77 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.298930E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.404 | TFLOPs: 20.05 | 31: iteration 15050/ 173500 | consumed samples: 3852800 | consumed tokens: 7890534400 | elapsed time per iteration (s): 0.73 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.240773E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.116 | TFLOPs: 21.12 | 31: iteration 15060/ 173500 | consumed samples: 3855360 | consumed tokens: 7895777280 | elapsed time per iteration (s): 0.74 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.239542E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.689 | TFLOPs: 20.97 | 31: iteration 15070/ 173500 | consumed samples: 3857920 | consumed tokens: 7901020160 | elapsed time per iteration (s): 0.79 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.257297E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.668 | TFLOPs: 19.52 | 31: iteration 15080/ 173500 | consumed samples: 3860480 | consumed tokens: 7906263040 | elapsed time per iteration (s): 0.81 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.243705E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.619 | TFLOPs: 19.09 | 31: iteration 15090/ 173500 | consumed samples: 3863040 | consumed tokens: 7911505920 | elapsed time per iteration (s): 0.73 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.247249E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.439 | TFLOPs: 21.14 | 31: iteration 15100/ 173500 | consumed samples: 3865600 | consumed tokens: 7916748800 | elapsed time per iteration (s): 0.78 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.244817E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.273 | TFLOPs: 19.92 | 31: iteration 15110/ 173500 | consumed samples: 3868160 | consumed tokens: 7921991680 | elapsed time per iteration (s): 0.78 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.267858E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.521 | TFLOPs: 19.87 | 31: iteration 15120/ 173500 | consumed samples: 3870720 | consumed tokens: 7927234560 | elapsed time per iteration (s): 0.77 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.255050E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.474 | TFLOPs: 20.05 | 31: iteration 15130/ 173500 | consumed samples: 3873280 | consumed tokens: 7932477440 | elapsed time per iteration (s): 0.79 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.238246E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.030 | TFLOPs: 19.66 | 31: iteration 15140/ 173500 | consumed samples: 3875840 | consumed tokens: 7937720320 | elapsed time per iteration (s): 0.80 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.280975E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.233 | TFLOPs: 19.31 | 31: iteration 15150/ 173500 | consumed samples: 3878400 | consumed tokens: 7942963200 | elapsed time per iteration (s): 0.79 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.237134E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.739 | TFLOPs: 19.59 | 31: iteration 15160/ 173500 | consumed samples: 3880960 | consumed tokens: 7948206080 | elapsed time per iteration (s): 0.82 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.253230E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.388 | TFLOPs: 18.84 | 31: iteration 15170/ 173500 | consumed samples: 3883520 | consumed tokens: 7953448960 | elapsed time per iteration (s): 0.76 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.273144E+00 | grad norm: 0.373 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.564 | TFLOPs: 20.42 | 31: iteration 15180/ 173500 | consumed samples: 3886080 | consumed tokens: 7958691840 | elapsed time per iteration (s): 0.82 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 6.961064E+00 | grad norm: 3.582 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.268 | TFLOPs: 18.95 | 31: iteration 15190/ 173500 | consumed samples: 3888640 | consumed tokens: 7963934720 | elapsed time per iteration (s): 0.80 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 7.429153E+00 | grad norm: 1.003 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.575 | TFLOPs: 19.33 | 31: iteration 15200/ 173500 | consumed samples: 3891200 | consumed tokens: 7969177600 | elapsed time per iteration (s): 0.79 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 6.582848E+00 | grad norm: 0.630 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.597 | TFLOPs: 19.64 | 31: iteration 15210/ 173500 | consumed samples: 3893760 | consumed tokens: 7974420480 | elapsed time per iteration (s): 0.81 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 6.205035E+00 | grad norm: 0.741 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.525 | TFLOPs: 19.21 | 31: iteration 15220/ 173500 | consumed samples: 3896320 | consumed tokens: 7979663360 | elapsed time per iteration (s): 0.76 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 5.984016E+00 | grad norm: 0.538 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.872 | TFLOPs: 20.38 | 31: iteration 15230/ 173500 | consumed samples: 3898880 | consumed tokens: 7984906240 | elapsed time per iteration (s): 0.76 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 5.731346E+00 | grad norm: 0.969 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.761 | TFLOPs: 20.43 | 31: iteration 15240/ 173500 | consumed samples: 3901440 | consumed tokens: 7990149120 | elapsed time per iteration (s): 1.10 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 5.513887E+00 | grad norm: 0.516 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.530 | TFLOPs: 14.07 | 31: iteration 15250/ 173500 | consumed samples: 3904000 | consumed tokens: 7995392000 | elapsed time per iteration (s): 0.75 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 5.300014E+00 | grad norm: 0.510 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.436 | TFLOPs: 20.72 | 31: iteration 15260/ 173500 | consumed samples: 3906560 | consumed tokens: 8000634880 | elapsed time per iteration (s): 0.87 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 5.207876E+00 | grad norm: 0.495 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.635 | TFLOPs: 17.76 | 31: iteration 15270/ 173500 | consumed samples: 3909120 | consumed tokens: 8005877760 | elapsed time per iteration (s): 0.74 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 4.889173E+00 | grad norm: 0.859 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.544 | TFLOPs: 20.84 | 31: iteration 15280/ 173500 | consumed samples: 3911680 | consumed tokens: 8011120640 | elapsed time per iteration (s): 0.75 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 4.576165E+00 | grad norm: 1.336 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.904 | TFLOPs: 20.74 | 31: iteration 15290/ 173500 | consumed samples: 3914240 | consumed tokens: 8016363520 | elapsed time per iteration (s): 0.75 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 4.069460E+00 | grad norm: 1.351 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.786 | TFLOPs: 20.68 | 31: iteration 15300/ 173500 | consumed samples: 3916800 | consumed tokens: 8021606400 | elapsed time per iteration (s): 1.94 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 3.536905E+00 | grad norm: 0.896 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 132.208 | TFLOPs: 8.00 | 31: iteration 15310/ 173500 | consumed samples: 3919360 | consumed tokens: 8026849280 | elapsed time per iteration (s): 0.75 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 3.107882E+00 | grad norm: 0.703 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.891 | TFLOPs: 20.68 | 31: iteration 15320/ 173500 | consumed samples: 3921920 | consumed tokens: 8032092160 | elapsed time per iteration (s): 0.78 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.795938E+00 | grad norm: 0.408 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.799 | TFLOPs: 19.95 | 31: iteration 15330/ 173500 | consumed samples: 3924480 | consumed tokens: 8037335040 | elapsed time per iteration (s): 0.77 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.592071E+00 | grad norm: 0.390 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.026 | TFLOPs: 20.09 | 31: iteration 15340/ 173500 | consumed samples: 3927040 | consumed tokens: 8042577920 | elapsed time per iteration (s): 0.74 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.508429E+00 | grad norm: 0.466 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.791 | TFLOPs: 20.98 | 31: iteration 15350/ 173500 | consumed samples: 3929600 | consumed tokens: 8047820800 | elapsed time per iteration (s): 0.85 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.420542E+00 | grad norm: 0.217 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.427 | TFLOPs: 18.24 | 31: iteration 15360/ 173500 | consumed samples: 3932160 | consumed tokens: 8053063680 | elapsed time per iteration (s): 0.77 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.378542E+00 | grad norm: 0.267 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.046 | TFLOPs: 20.21 | 31: iteration 15370/ 173500 | consumed samples: 3934720 | consumed tokens: 8058306560 | elapsed time per iteration (s): 0.79 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.363146E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.393 | TFLOPs: 19.69 | 31: iteration 15380/ 173500 | consumed samples: 3937280 | consumed tokens: 8063549440 | elapsed time per iteration (s): 0.85 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.312221E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.758 | TFLOPs: 18.32 | 31: iteration 15390/ 173500 | consumed samples: 3939840 | consumed tokens: 8068792320 | elapsed time per iteration (s): 0.79 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.340224E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.845 | TFLOPs: 19.65 | 31: iteration 15400/ 173500 | consumed samples: 3942400 | consumed tokens: 8074035200 | elapsed time per iteration (s): 0.74 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.301648E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.366 | TFLOPs: 20.95 | 31: iteration 15410/ 173500 | consumed samples: 3944960 | consumed tokens: 8079278080 | elapsed time per iteration (s): 0.77 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.338880E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.686 | TFLOPs: 20.01 | 31: iteration 15420/ 173500 | consumed samples: 3947520 | consumed tokens: 8084520960 | elapsed time per iteration (s): 0.76 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.325014E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.794 | TFLOPs: 20.38 | 31: iteration 15430/ 173500 | consumed samples: 3950080 | consumed tokens: 8089763840 | elapsed time per iteration (s): 0.80 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.295955E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.948 | TFLOPs: 19.42 | 31: iteration 15440/ 173500 | consumed samples: 3952640 | consumed tokens: 8095006720 | elapsed time per iteration (s): 0.76 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.290983E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.920 | TFLOPs: 20.44 | 31: iteration 15450/ 173500 | consumed samples: 3955200 | consumed tokens: 8100249600 | elapsed time per iteration (s): 0.76 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.301986E+00 | grad norm: 0.233 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.247 | TFLOPs: 20.40 | 31: iteration 15460/ 173500 | consumed samples: 3957760 | consumed tokens: 8105492480 | elapsed time per iteration (s): 0.78 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.296815E+00 | grad norm: 0.212 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.639 | TFLOPs: 19.76 | 31: iteration 15470/ 173500 | consumed samples: 3960320 | consumed tokens: 8110735360 | elapsed time per iteration (s): 0.78 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.293293E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.282 | TFLOPs: 19.80 | 31: iteration 15480/ 173500 | consumed samples: 3962880 | consumed tokens: 8115978240 | elapsed time per iteration (s): 0.77 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.293841E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.328 | TFLOPs: 20.11 | 31: iteration 15490/ 173500 | consumed samples: 3965440 | consumed tokens: 8121221120 | elapsed time per iteration (s): 0.78 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.281651E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.456 | TFLOPs: 19.93 | 31: iteration 15500/ 173500 | consumed samples: 3968000 | consumed tokens: 8126464000 | elapsed time per iteration (s): 0.74 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.283049E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.076 | TFLOPs: 21.00 | 31: iteration 15510/ 173500 | consumed samples: 3970560 | consumed tokens: 8131706880 | elapsed time per iteration (s): 0.82 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.290385E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.841 | TFLOPs: 18.81 | 31: iteration 15520/ 173500 | consumed samples: 3973120 | consumed tokens: 8136949760 | elapsed time per iteration (s): 0.77 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.281414E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.358 | TFLOPs: 20.23 | 31: iteration 15530/ 173500 | consumed samples: 3975680 | consumed tokens: 8142192640 | elapsed time per iteration (s): 0.82 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.298750E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.099 | TFLOPs: 18.88 | 31: iteration 15540/ 173500 | consumed samples: 3978240 | consumed tokens: 8147435520 | elapsed time per iteration (s): 0.78 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.290561E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.953 | TFLOPs: 19.78 | 31: iteration 15550/ 173500 | consumed samples: 3980800 | consumed tokens: 8152678400 | elapsed time per iteration (s): 0.80 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.257167E+00 | grad norm: 0.271 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.819 | TFLOPs: 19.47 | 31: iteration 15560/ 173500 | consumed samples: 3983360 | consumed tokens: 8157921280 | elapsed time per iteration (s): 0.76 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.281677E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.040 | TFLOPs: 20.39 | 31: iteration 15570/ 173500 | consumed samples: 3985920 | consumed tokens: 8163164160 | elapsed time per iteration (s): 0.75 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.266387E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.543 | TFLOPs: 20.72 | 31: iteration 15580/ 173500 | consumed samples: 3988480 | consumed tokens: 8168407040 | elapsed time per iteration (s): 0.73 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.271022E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.631 | TFLOPs: 21.15 | 31: iteration 15590/ 173500 | consumed samples: 3991040 | consumed tokens: 8173649920 | elapsed time per iteration (s): 0.75 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.295189E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.648 | TFLOPs: 20.55 | 31: iteration 15600/ 173500 | consumed samples: 3993600 | consumed tokens: 8178892800 | elapsed time per iteration (s): 0.78 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.273607E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.420 | TFLOPs: 19.81 | 31: iteration 15610/ 173500 | consumed samples: 3996160 | consumed tokens: 8184135680 | elapsed time per iteration (s): 0.74 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.285312E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.400 | TFLOPs: 20.90 | 31: iteration 15620/ 173500 | consumed samples: 3998720 | consumed tokens: 8189378560 | elapsed time per iteration (s): 0.74 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.248393E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.493 | TFLOPs: 21.02 | 31: iteration 15630/ 173500 | consumed samples: 4001280 | consumed tokens: 8194621440 | elapsed time per iteration (s): 0.78 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.280077E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.769 | TFLOPs: 19.83 | 31: iteration 15640/ 173500 | consumed samples: 4003840 | consumed tokens: 8199864320 | elapsed time per iteration (s): 0.75 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.272032E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.444 | TFLOPs: 20.78 | 31: iteration 15650/ 173500 | consumed samples: 4006400 | consumed tokens: 8205107200 | elapsed time per iteration (s): 0.76 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.303766E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.829 | TFLOPs: 20.44 | 31: iteration 15660/ 173500 | consumed samples: 4008960 | consumed tokens: 8210350080 | elapsed time per iteration (s): 0.77 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.239153E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.157 | TFLOPs: 20.22 | 31: iteration 15670/ 173500 | consumed samples: 4011520 | consumed tokens: 8215592960 | elapsed time per iteration (s): 0.80 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.264681E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.738 | TFLOPs: 19.46 | 31: iteration 15680/ 173500 | consumed samples: 4014080 | consumed tokens: 8220835840 | elapsed time per iteration (s): 0.85 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.251149E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.336 | TFLOPs: 18.23 | 31: iteration 15690/ 173500 | consumed samples: 4016640 | consumed tokens: 8226078720 | elapsed time per iteration (s): 0.78 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.289221E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.609 | TFLOPs: 19.82 | 31: iteration 15700/ 173500 | consumed samples: 4019200 | consumed tokens: 8231321600 | elapsed time per iteration (s): 0.73 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.268130E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.777 | TFLOPs: 21.10 | 31: iteration 15710/ 173500 | consumed samples: 4021760 | consumed tokens: 8236564480 | elapsed time per iteration (s): 0.82 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.278046E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.390 | TFLOPs: 18.90 | 31: iteration 15720/ 173500 | consumed samples: 4024320 | consumed tokens: 8241807360 | elapsed time per iteration (s): 0.74 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.285543E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.340 | TFLOPs: 20.95 | 31: iteration 15730/ 173500 | consumed samples: 4026880 | consumed tokens: 8247050240 | elapsed time per iteration (s): 0.77 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.273717E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.201 | TFLOPs: 20.10 | 31: iteration 15740/ 173500 | consumed samples: 4029440 | consumed tokens: 8252293120 | elapsed time per iteration (s): 0.76 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.269541E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.688 | TFLOPs: 20.43 | 31: iteration 15750/ 173500 | consumed samples: 4032000 | consumed tokens: 8257536000 | elapsed time per iteration (s): 0.81 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.260110E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.156 | TFLOPs: 19.01 | 31: iteration 15760/ 173500 | consumed samples: 4034560 | consumed tokens: 8262778880 | elapsed time per iteration (s): 0.76 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.271285E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.598 | TFLOPs: 20.42 | 31: iteration 15770/ 173500 | consumed samples: 4037120 | consumed tokens: 8268021760 | elapsed time per iteration (s): 0.80 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.268027E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.370 | TFLOPs: 19.44 | 31: iteration 15780/ 173500 | consumed samples: 4039680 | consumed tokens: 8273264640 | elapsed time per iteration (s): 0.75 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.246407E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.126 | TFLOPs: 20.64 | 31: iteration 15790/ 173500 | consumed samples: 4042240 | consumed tokens: 8278507520 | elapsed time per iteration (s): 0.81 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.244043E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.650 | TFLOPs: 19.04 | 31: iteration 15800/ 173500 | consumed samples: 4044800 | consumed tokens: 8283750400 | elapsed time per iteration (s): 0.79 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.240786E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.373 | TFLOPs: 19.62 | 31: iteration 15810/ 173500 | consumed samples: 4047360 | consumed tokens: 8288993280 | elapsed time per iteration (s): 0.76 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.224179E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.405 | TFLOPs: 20.35 | 31: iteration 15820/ 173500 | consumed samples: 4049920 | consumed tokens: 8294236160 | elapsed time per iteration (s): 0.77 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.280451E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.835 | TFLOPs: 20.01 | 31: iteration 15830/ 173500 | consumed samples: 4052480 | consumed tokens: 8299479040 | elapsed time per iteration (s): 0.81 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.260613E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.305 | TFLOPs: 19.08 | 31: iteration 15840/ 173500 | consumed samples: 4055040 | consumed tokens: 8304721920 | elapsed time per iteration (s): 0.76 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.273075E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.043 | TFLOPs: 20.33 | 31: iteration 15850/ 173500 | consumed samples: 4057600 | consumed tokens: 8309964800 | elapsed time per iteration (s): 0.83 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.270916E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.530 | TFLOPs: 18.60 | 31: iteration 15860/ 173500 | consumed samples: 4060160 | consumed tokens: 8315207680 | elapsed time per iteration (s): 0.74 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.254829E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.545 | TFLOPs: 20.84 | 31: iteration 15870/ 173500 | consumed samples: 4062720 | consumed tokens: 8320450560 | elapsed time per iteration (s): 0.81 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.238665E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.580 | TFLOPs: 19.15 | 31: iteration 15880/ 173500 | consumed samples: 4065280 | consumed tokens: 8325693440 | elapsed time per iteration (s): 0.82 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.249148E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.799 | TFLOPs: 18.86 | 31: iteration 15890/ 173500 | consumed samples: 4067840 | consumed tokens: 8330936320 | elapsed time per iteration (s): 0.78 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.256886E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.219 | TFLOPs: 19.92 | 31: iteration 15900/ 173500 | consumed samples: 4070400 | consumed tokens: 8336179200 | elapsed time per iteration (s): 0.82 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.265081E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.246 | TFLOPs: 18.95 | 31: iteration 15910/ 173500 | consumed samples: 4072960 | consumed tokens: 8341422080 | elapsed time per iteration (s): 0.73 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.266208E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.087 | TFLOPs: 21.36 | 31: iteration 15920/ 173500 | consumed samples: 4075520 | consumed tokens: 8346664960 | elapsed time per iteration (s): 0.79 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.221431E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.743 | TFLOPs: 19.53 | 31: iteration 15930/ 173500 | consumed samples: 4078080 | consumed tokens: 8351907840 | elapsed time per iteration (s): 0.79 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.246285E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.763 | TFLOPs: 19.71 | 31: iteration 15940/ 173500 | consumed samples: 4080640 | consumed tokens: 8357150720 | elapsed time per iteration (s): 0.82 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.252236E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.468 | TFLOPs: 18.96 | 31: iteration 15950/ 173500 | consumed samples: 4083200 | consumed tokens: 8362393600 | elapsed time per iteration (s): 0.75 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.264360E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.373 | TFLOPs: 20.77 | 31: iteration 15960/ 173500 | consumed samples: 4085760 | consumed tokens: 8367636480 | elapsed time per iteration (s): 0.80 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.241636E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.920 | TFLOPs: 19.48 | 31: iteration 15970/ 173500 | consumed samples: 4088320 | consumed tokens: 8372879360 | elapsed time per iteration (s): 0.78 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.257141E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.171 | TFLOPs: 19.79 | 31: iteration 15980/ 173500 | consumed samples: 4090880 | consumed tokens: 8378122240 | elapsed time per iteration (s): 0.77 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.277350E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.315 | TFLOPs: 20.10 | 31: iteration 15990/ 173500 | consumed samples: 4093440 | consumed tokens: 8383365120 | elapsed time per iteration (s): 0.84 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.269415E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.003 | TFLOPs: 18.39 | 0: [2022-11-25 21:41:11,596] [INFO] [logging.py:68:log_dist] [Rank 0] step=16000, skipped=0, lr=[0.00019695408064628468, 0.00019695408064628468, 0.00019695408064628468], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 16000/ 173500 | consumed samples: 4096000 | consumed tokens: 8388608000 | elapsed time per iteration (s): 0.80 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.979819E+00 | grad norm: 8.781 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.647 | TFLOPs: 19.34 | 0: steps: 16000 loss: 3.3599 iter time (s): 0.790 samples/sec: 323.933 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 16000 | lm loss value: 2.551089E+00 | lm loss PPL: 1.282106E+01 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 16000 to checkpoints_1b1long 0: [2022-11-25 21:41:11,849] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step16000 is begin to save! 0: [2022-11-25 21:41:12,024] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_01-model_00-model_states.pt... 0: [2022-11-25 21:41:12,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_01-model_00-model_states.pt. 0: [2022-11-25 21:41:12,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_03-model_00-model_states.pt... 0: [2022-11-25 21:41:12,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_03-model_00-model_states.pt. 0: [2022-11-25 21:41:12,307] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_04-model_00-model_states.pt... 0: [2022-11-25 21:41:12,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_04-model_00-model_states.pt. 0: [2022-11-25 21:41:12,381] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_05-model_00-model_states.pt... 0: [2022-11-25 21:41:12,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_05-model_00-model_states.pt. 0: [2022-11-25 21:41:12,453] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_06-model_00-model_states.pt... 0: [2022-11-25 21:41:12,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_06-model_00-model_states.pt. 0: [2022-11-25 21:41:12,529] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_07-model_00-model_states.pt... 0: [2022-11-25 21:41:12,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_07-model_00-model_states.pt. 0: [2022-11-25 21:41:12,603] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_08-model_00-model_states.pt... 0: [2022-11-25 21:41:12,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_08-model_00-model_states.pt. 0: [2022-11-25 21:41:12,677] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_09-model_00-model_states.pt... 0: [2022-11-25 21:41:12,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_09-model_00-model_states.pt. 0: [2022-11-25 21:41:12,751] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_10-model_00-model_states.pt... 0: [2022-11-25 21:41:12,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_10-model_00-model_states.pt. 0: [2022-11-25 21:41:12,824] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_11-model_00-model_states.pt... 0: [2022-11-25 21:41:12,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_11-model_00-model_states.pt. 0: [2022-11-25 21:41:12,900] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_12-model_00-model_states.pt... 0: [2022-11-25 21:41:12,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_12-model_00-model_states.pt. 0: [2022-11-25 21:41:12,975] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_13-model_00-model_states.pt... 0: [2022-11-25 21:41:13,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_13-model_00-model_states.pt. 0: [2022-11-25 21:41:13,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_14-model_00-model_states.pt... 0: [2022-11-25 21:41:13,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_14-model_00-model_states.pt. 0: [2022-11-25 21:41:13,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_15-model_00-model_states.pt... 0: [2022-11-25 21:41:13,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_15-model_00-model_states.pt. 0: [2022-11-25 21:41:13,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_16-model_00-model_states.pt... 0: [2022-11-25 21:41:13,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_16-model_00-model_states.pt. 0: [2022-11-25 21:41:13,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_17-model_00-model_states.pt... 0: [2022-11-25 21:41:13,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_17-model_00-model_states.pt. 0: [2022-11-25 21:41:13,344] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_18-model_00-model_states.pt... 0: [2022-11-25 21:41:13,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_18-model_00-model_states.pt. 0: [2022-11-25 21:41:13,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_19-model_00-model_states.pt... 0: [2022-11-25 21:41:13,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_19-model_00-model_states.pt. 0: [2022-11-25 21:41:13,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_20-model_00-model_states.pt... 0: [2022-11-25 21:41:13,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_20-model_00-model_states.pt. 0: [2022-11-25 21:41:13,564] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_21-model_00-model_states.pt... 0: [2022-11-25 21:41:13,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_21-model_00-model_states.pt. 0: [2022-11-25 21:41:13,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_22-model_00-model_states.pt... 0: [2022-11-25 21:41:13,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_22-model_00-model_states.pt. 0: [2022-11-25 21:41:13,710] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_23-model_00-model_states.pt... 0: [2022-11-25 21:41:13,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_23-model_00-model_states.pt. 0: [2022-11-25 21:41:13,784] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_24-model_00-model_states.pt... 0: [2022-11-25 21:41:13,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_24-model_00-model_states.pt. 0: [2022-11-25 21:41:13,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_25-model_00-model_states.pt... 0: [2022-11-25 21:41:13,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_25-model_00-model_states.pt. 0: [2022-11-25 21:41:13,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_26-model_00-model_states.pt... 0: [2022-11-25 21:41:14,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_26-model_00-model_states.pt. 0: [2022-11-25 21:41:14,008] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_27-model_00-model_states.pt... 0: [2022-11-25 21:41:14,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_27-model_00-model_states.pt. 0: [2022-11-25 21:41:14,082] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_28-model_00-model_states.pt... 0: [2022-11-25 21:41:14,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_28-model_00-model_states.pt. 0: [2022-11-25 21:41:14,156] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/layer_30-model_00-model_states.pt... 0: [2022-11-25 21:41:14,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/layer_30-model_00-model_states.pt. 0: [2022-11-25 21:41:14,159] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step16000/mp_rank_00_model_states.pt 0: [2022-11-25 21:41:14,159] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/mp_rank_00_model_states.pt... 0: [2022-11-25 21:41:14,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/mp_rank_00_model_states.pt. 0: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:41:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:41:14,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:41:14,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-25 21:41:14,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 1: [2022-11-25 21:41:14,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:41:14,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 0: [2022-11-25 21:41:14,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:41:14,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 21:41:14,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 1: [2022-11-25 21:41:14,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 6: [2022-11-25 21:41:14,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:41:14,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 21:41:14,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 30: [2022-11-25 21:41:14,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:41:14,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 20: [2022-11-25 21:41:14,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:41:14,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 20: [2022-11-25 21:41:14,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-25 21:41:14,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 31: [2022-11-25 21:41:14,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:41:14,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-25 21:41:14,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 29: [2022-11-25 21:41:14,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:41:14,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-25 21:41:14,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 3: [2022-11-25 21:41:14,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:41:14,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 21:41:14,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 22: [2022-11-25 21:41:14,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:41:14,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-25 21:41:14,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 30: [2022-11-25 21:41:14,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:41:14,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-25 21:41:14,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 14: [2022-11-25 21:41:14,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:41:14,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 21:41:14,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 9: [2022-11-25 21:41:14,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:41:14,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 21:41:14,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 6: [2022-11-25 21:41:14,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:41:14,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 21:41:14,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 25: [2022-11-25 21:41:14,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:41:14,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:41:14,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 21:41:14,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 25: [2022-11-25 21:41:14,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-25 21:41:14,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 14: [2022-11-25 21:41:14,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:41:14,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 21:41:14,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 17: [2022-11-25 21:41:14,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:41:14,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-25 21:41:14,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 17: [2022-11-25 21:41:14,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:41:14,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-25 21:41:14,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 6: [2022-11-25 21:41:14,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:41:14,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:41:14,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 6: [2022-11-25 21:41:14,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 4: [2022-11-25 21:41:14,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 6: [2022-11-25 21:41:14,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 11: [2022-11-25 21:41:14,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:41:14,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 21:41:14,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 20: [2022-11-25 21:41:14,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:41:14,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-25 21:41:14,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 27: [2022-11-25 21:41:14,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:41:14,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:41:14,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 27: [2022-11-25 21:41:14,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 19: [2022-11-25 21:41:14,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 27: [2022-11-25 21:41:14,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 12: [2022-11-25 21:41:14,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:41:14,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 0: [2022-11-25 21:41:14,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:41:14,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:41:14,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:41:14,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 16: [2022-11-25 21:41:14,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 0: [2022-11-25 21:41:14,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 16: [2022-11-25 21:41:14,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 0: [2022-11-25 21:41:14,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 1: [2022-11-25 21:41:14,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 24: [2022-11-25 21:41:14,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:41:14,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 24: [2022-11-25 21:41:14,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 1: [2022-11-25 21:41:14,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:41:14,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 1: [2022-11-25 21:41:14,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 21:41:14,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 29: [2022-11-25 21:41:14,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:41:14,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:41:14,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 0: [2022-11-25 21:41:14,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 21:41:14,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 7: [2022-11-25 21:41:14,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:41:14,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 21:41:14,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 29: [2022-11-25 21:41:14,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 17: [2022-11-25 21:41:14,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:41:14,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:41:14,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 17: [2022-11-25 21:41:14,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-25 21:41:14,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 19: [2022-11-25 21:41:14,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 21: [2022-11-25 21:41:14,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:41:14,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-25 21:41:14,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 27: [2022-11-25 21:41:14,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:41:14,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:41:14,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 30: [2022-11-25 21:41:14,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 27: [2022-11-25 21:41:14,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 30: [2022-11-25 21:41:14,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 11: [2022-11-25 21:41:14,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:41:14,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 21:41:14,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 6: [2022-11-25 21:41:14,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:41:14,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 21:41:14,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 8: [2022-11-25 21:41:14,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:41:14,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:41:14,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 21:41:14,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 16: [2022-11-25 21:41:14,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-25 21:41:14,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 27: [2022-11-25 21:41:14,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:41:14,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 4: [2022-11-25 21:41:14,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:41:14,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 27: [2022-11-25 21:41:14,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 4: [2022-11-25 21:41:14,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 22: [2022-11-25 21:41:14,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:41:14,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-25 21:41:14,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 18: [2022-11-25 21:41:14,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:41:14,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:41:14,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-25 21:41:14,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 18: [2022-11-25 21:41:14,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-25 21:41:14,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 14: [2022-11-25 21:41:14,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:41:14,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 21:41:14,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 1: [2022-11-25 21:41:14,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:41:14,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 21:41:14,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 17: [2022-11-25 21:41:14,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:41:14,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-25 21:41:14,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 7: [2022-11-25 21:41:14,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:41:14,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:41:14,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 21:41:14,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 20: [2022-11-25 21:41:14,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:41:14,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 7: [2022-11-25 21:41:14,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 20: [2022-11-25 21:41:14,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-25 21:41:14,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 3: [2022-11-25 21:41:14,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:41:14,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:41:14,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:41:14,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 21:41:14,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 21:41:14,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 3: [2022-11-25 21:41:14,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 20: [2022-11-25 21:41:14,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 3: [2022-11-25 21:41:14,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:41:14,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 3: [2022-11-25 21:41:14,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 21:41:14,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 12: [2022-11-25 21:41:14,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:41:14,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 21:41:14,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 27: [2022-11-25 21:41:14,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:41:14,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:41:14,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 11: [2022-11-25 21:41:14,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 27: [2022-11-25 21:41:14,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 11: [2022-11-25 21:41:14,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 9: [2022-11-25 21:41:14,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:41:14,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 21:41:14,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 8: [2022-11-25 21:41:14,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:41:14,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 7: [2022-11-25 21:41:14,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:41:14,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 7: [2022-11-25 21:41:14,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 21:41:14,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 18: [2022-11-25 21:41:14,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:41:14,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-25 21:41:14,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 19: [2022-11-25 21:41:14,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:41:14,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-25 21:41:14,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 18: [2022-11-25 21:41:14,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:41:14,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-25 21:41:14,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 24: [2022-11-25 21:41:14,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:41:14,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:41:14,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 24: [2022-11-25 21:41:14,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 12: [2022-11-25 21:41:14,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 24: [2022-11-25 21:41:14,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 9: [2022-11-25 21:41:14,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:41:14,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 21:41:14,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 31: [2022-11-25 21:41:14,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:41:14,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-25 21:41:14,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 24: [2022-11-25 21:41:14,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:41:14,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-25 21:41:14,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 24: [2022-11-25 21:41:14,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:41:14,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-25 21:41:14,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 22: [2022-11-25 21:41:14,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:41:14,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-25 21:41:14,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 29: [2022-11-25 21:41:14,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:41:14,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-25 21:41:14,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 30: [2022-11-25 21:41:14,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:41:14,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 29: [2022-11-25 21:41:14,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:41:14,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 29: [2022-11-25 21:41:14,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 8: [2022-11-25 21:41:14,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:41:14,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 8: [2022-11-25 21:41:14,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 21:41:14,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 14: [2022-11-25 21:41:14,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:41:14,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 21:41:14,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 25: [2022-11-25 21:41:14,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:41:14,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-25 21:41:14,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 4: [2022-11-25 21:41:14,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:41:14,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 21:41:14,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 25: [2022-11-25 21:41:14,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:41:14,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-25 21:41:14,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 25: [2022-11-25 21:41:14,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:41:14,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-25 21:41:14,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 21: [2022-11-25 21:41:14,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:41:14,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-25 21:41:14,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 5: [2022-11-25 21:41:14,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:41:14,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:41:14,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:41:14,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 21:41:14,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 21:41:14,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 21: [2022-11-25 21:41:14,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:41:14,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 21: [2022-11-25 21:41:14,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 5: [2022-11-25 21:41:14,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 5: [2022-11-25 21:41:14,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 21: [2022-11-25 21:41:14,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 0: [2022-11-25 21:41:14,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:41:14,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:41:14,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 1: [2022-11-25 21:41:14,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 21:41:14,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 0: [2022-11-25 21:41:14,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 20: [2022-11-25 21:41:14,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:41:14,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-25 21:41:14,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 16: [2022-11-25 21:41:14,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:41:14,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:41:14,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 18: [2022-11-25 21:41:14,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 16: [2022-11-25 21:41:14,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 18: [2022-11-25 21:41:14,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 4: [2022-11-25 21:41:14,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:41:14,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 21:41:14,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 9: [2022-11-25 21:41:14,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:41:14,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 21:41:14,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 8: [2022-11-25 21:41:14,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:41:14,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 21:41:14,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 31: [2022-11-25 21:41:14,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:41:14,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-25 21:41:14,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 31: [2022-11-25 21:41:14,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:41:14,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-25 21:41:14,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 6: [2022-11-25 21:41:14,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:41:14,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 28: [2022-11-25 21:41:14,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:41:14,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:41:14,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:41:14,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:41:14,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 28: [2022-11-25 21:41:14,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-25 21:41:14,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-25 21:41:14,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-25 21:41:14,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-25 21:41:14,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 28: [2022-11-25 21:41:14,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 28: [2022-11-25 21:41:14,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 28: [2022-11-25 21:41:14,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 19: [2022-11-25 21:41:14,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:41:14,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-25 21:41:14,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 11: [2022-11-25 21:41:14,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:41:14,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 21:41:14,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 3: [2022-11-25 21:41:14,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:41:14,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 21:41:14,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 17: [2022-11-25 21:41:14,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:41:14,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-25 21:41:14,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 22: [2022-11-25 21:41:14,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:41:14,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-25 21:41:14,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 21: [2022-11-25 21:41:14,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:41:14,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-25 21:41:14,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 25: [2022-11-25 21:41:14,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:41:14,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:41:14,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:41:14,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:41:14,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:41:14,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 2: [2022-11-25 21:41:14,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 21:41:14,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 21:41:14,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 21:41:14,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 25: [2022-11-25 21:41:14,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 2: [2022-11-25 21:41:14,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 2: [2022-11-25 21:41:14,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 2: [2022-11-25 21:41:14,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 2: [2022-11-25 21:41:14,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 21: [2022-11-25 21:41:14,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:41:14,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-25 21:41:14,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 15: [2022-11-25 21:41:14,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:41:14,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:41:14,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:41:14,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:41:14,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:41:14,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 21:41:14,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 21:41:14,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 15: [2022-11-25 21:41:14,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 15: [2022-11-25 21:41:14,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 21:41:14,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 21:41:14,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 21:41:14,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 15: [2022-11-25 21:41:14,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 15: [2022-11-25 21:41:14,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 9: [2022-11-25 21:41:14,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:41:14,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 21:41:14,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 24: [2022-11-25 21:41:14,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:41:14,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-25 21:41:14,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 27: [2022-11-25 21:41:14,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:41:14,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-25 21:41:14,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 5: [2022-11-25 21:41:14,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:41:14,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 21:41:14,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 16: [2022-11-25 21:41:14,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:41:14,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-25 21:41:14,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 29: [2022-11-25 21:41:14,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:41:14,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-25 21:41:14,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 5: [2022-11-25 21:41:14,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:41:14,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 21:41:14,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 14: [2022-11-25 21:41:14,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:41:14,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 21:41:14,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 0: [2022-11-25 21:41:14,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:41:14,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 21:41:14,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 30: [2022-11-25 21:41:14,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:41:14,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-25 21:41:14,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 31: [2022-11-25 21:41:14,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:41:14,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-25 21:41:14,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 12: [2022-11-25 21:41:14,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:41:14,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 21:41:14,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 28: [2022-11-25 21:41:14,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:41:14,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-25 21:41:14,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 1: [2022-11-25 21:41:14,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:41:14,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 21:41:14,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 7: [2022-11-25 21:41:14,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:41:14,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 21:41:14,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 6: [2022-11-25 21:41:14,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:41:14,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 21:41:14,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 2: [2022-11-25 21:41:14,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:41:14,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 21:41:14,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 17: [2022-11-25 21:41:14,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:41:14,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-25 21:41:14,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 8: [2022-11-25 21:41:14,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:41:14,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 21:41:14,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 19: [2022-11-25 21:41:14,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:41:14,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-25 21:41:14,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 4: [2022-11-25 21:41:14,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:41:14,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 21:41:14,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 3: [2022-11-25 21:41:14,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:41:14,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 21:41:14,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 16: [2022-11-25 21:41:14,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:41:14,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-25 21:41:14,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 15: [2022-11-25 21:41:14,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:41:14,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 21:41:14,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 21: [2022-11-25 21:41:14,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:41:14,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-25 21:41:14,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 9: [2022-11-25 21:41:14,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:41:14,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 21:41:14,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 18: [2022-11-25 21:41:14,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:41:14,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-25 21:41:14,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 29: [2022-11-25 21:41:14,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:41:14,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-25 21:41:14,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 20: [2022-11-25 21:41:14,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:41:14,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-25 21:41:14,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 5: [2022-11-25 21:41:14,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:41:14,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 21:41:14,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 25: [2022-11-25 21:41:14,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:41:14,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-25 21:41:14,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 27: [2022-11-25 21:41:14,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:41:14,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:41:14,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 0: [2022-11-25 21:41:14,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:41:14,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 27: [2022-11-25 21:41:14,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 0: [2022-11-25 21:41:14,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 27: [2022-11-25 21:41:14,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 0: [2022-11-25 21:41:14,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 31: [2022-11-25 21:41:14,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:41:14,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-25 21:41:14,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 7: [2022-11-25 21:41:14,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:41:14,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:41:14,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 21:41:14,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 24: [2022-11-25 21:41:14,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-25 21:41:14,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 6: [2022-11-25 21:41:14,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:41:14,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 21:41:14,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 28: [2022-11-25 21:41:14,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:41:14,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:41:14,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 21:41:14,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 18: [2022-11-25 21:41:14,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:41:14,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-25 21:41:14,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 12: [2022-11-25 21:41:14,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:41:14,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 21:41:14,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 9: [2022-11-25 21:41:14,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:41:14,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 21:41:14,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 17: [2022-11-25 21:41:14,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:41:14,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-25 21:41:14,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 2: [2022-11-25 21:41:14,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:41:14,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 15: [2022-11-25 21:41:14,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:41:14,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 15: [2022-11-25 21:41:14,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 21:41:14,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 20: [2022-11-25 21:41:14,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:41:14,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-25 21:41:14,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 19: [2022-11-25 21:41:14,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:41:14,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-25 21:41:14,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 8: [2022-11-25 21:41:14,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:41:14,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 21:41:14,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 3: [2022-11-25 21:41:14,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:41:14,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 21:41:14,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 21: [2022-11-25 21:41:14,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:41:14,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:41:14,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 4: [2022-11-25 21:41:14,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 21: [2022-11-25 21:41:14,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 4: [2022-11-25 21:41:14,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 28: [2022-11-25 21:41:14,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-25 21:41:14,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 24: [2022-11-25 21:41:14,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:41:14,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-25 21:41:14,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 11: [2022-11-25 21:41:14,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:41:14,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 21:41:14,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 11: [2022-11-25 21:41:14,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:41:14,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 21:41:14,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 16: [2022-11-25 21:41:14,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:41:14,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-25 21:41:14,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 27: [2022-11-25 21:41:14,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:41:14,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:41:14,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 27: [2022-11-25 21:41:14,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 5: [2022-11-25 21:41:14,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 27: [2022-11-25 21:41:14,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 14: [2022-11-25 21:41:14,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:41:14,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 21:41:14,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 6: [2022-11-25 21:41:14,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:41:14,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:41:14,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 6: [2022-11-25 21:41:14,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 22: [2022-11-25 21:41:14,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 6: [2022-11-25 21:41:14,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 14: [2022-11-25 21:41:14,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:41:14,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 21:41:14,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 25: [2022-11-25 21:41:14,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:41:14,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-25 21:41:14,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 22: [2022-11-25 21:41:14,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:41:14,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-25 21:41:14,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 2: [2022-11-25 21:41:14,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:41:14,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:41:14,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 15: [2022-11-25 21:41:14,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 2: [2022-11-25 21:41:14,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 15: [2022-11-25 21:41:14,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 19: [2022-11-25 21:41:14,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:41:14,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 21: [2022-11-25 21:41:14,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:41:14,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 1: [2022-11-25 21:41:14,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:41:14,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-25 21:41:14,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 1: [2022-11-25 21:41:14,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 7: [2022-11-25 21:41:14,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:41:14,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 29: [2022-11-25 21:41:14,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:41:14,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 29: [2022-11-25 21:41:14,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 7: [2022-11-25 21:41:14,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 29: [2022-11-25 21:41:14,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 9: [2022-11-25 21:41:14,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:41:14,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 21:41:14,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 0: [2022-11-25 21:41:14,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:41:14,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 21:41:14,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 17: [2022-11-25 21:41:14,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:41:14,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 3: [2022-11-25 21:41:14,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:41:14,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 17: [2022-11-25 21:41:14,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 5: [2022-11-25 21:41:14,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:41:14,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 5: [2022-11-25 21:41:14,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 20: [2022-11-25 21:41:14,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:41:14,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 20: [2022-11-25 21:41:14,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-25 21:41:14,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 8: [2022-11-25 21:41:14,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:41:14,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:41:14,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 12: [2022-11-25 21:41:14,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 8: [2022-11-25 21:41:14,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 12: [2022-11-25 21:41:14,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 31: [2022-11-25 21:41:14,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:41:14,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-25 21:41:14,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 4: [2022-11-25 21:41:14,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:41:14,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 21:41:14,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 30: [2022-11-25 21:41:14,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:41:14,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-25 21:41:14,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 28: [2022-11-25 21:41:14,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:41:14,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:41:14,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-25 21:41:14,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 16: [2022-11-25 21:41:14,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-25 21:41:14,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 14: [2022-11-25 21:41:14,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:41:14,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 21:41:14,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 25: [2022-11-25 21:41:14,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:41:14,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 27: [2022-11-25 21:41:14,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:41:14,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 27: [2022-11-25 21:41:14,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-25 21:41:14,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 22: [2022-11-25 21:41:14,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:41:14,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-25 21:41:14,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 7: [2022-11-25 21:41:14,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:41:14,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 21:41:14,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 0: [2022-11-25 21:41:14,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:41:14,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:41:14,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-25 21:41:14,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 11: [2022-11-25 21:41:14,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:41:14,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:41:14,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 21:41:14,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 28: [2022-11-25 21:41:14,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 30: [2022-11-25 21:41:14,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:41:14,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 30: [2022-11-25 21:41:14,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-25 21:41:14,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 24: [2022-11-25 21:41:14,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:41:14,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-25 21:41:14,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 12: [2022-11-25 21:41:14,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:41:14,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 21:41:14,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 18: [2022-11-25 21:41:14,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:41:14,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-25 21:41:14,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 2: [2022-11-25 21:41:14,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:41:14,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 21:41:14,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 0: [2022-11-25 21:41:14,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 21:41:14,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 4: [2022-11-25 21:41:14,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:41:14,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 21:41:14,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 8: [2022-11-25 21:41:14,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:41:14,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 31: [2022-11-25 21:41:14,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:41:14,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 31: [2022-11-25 21:41:14,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-25 21:41:14,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 11: [2022-11-25 21:41:14,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:41:14,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 21:41:14,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 30: [2022-11-25 21:41:14,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:41:14,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-25 21:41:14,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 16: [2022-11-25 21:41:14,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:41:14,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-25 21:41:14,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 10: [2022-11-25 21:41:14,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:41:14,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:41:14,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:41:14,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 21:41:14,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 21:41:14,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:41:14,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 10: [2022-11-25 21:41:14,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 21:41:14,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:41:14,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:41:14,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 10: [2022-11-25 21:41:14,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 21:41:14,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 10: [2022-11-25 21:41:14,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 21:41:14,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 21:41:14,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 10: [2022-11-25 21:41:14,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 10: [2022-11-25 21:41:14,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 23: [2022-11-25 21:41:14,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:41:14,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:41:14,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:41:14,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-25 21:41:14,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-25 21:41:14,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-25 21:41:14,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 23: [2022-11-25 21:41:14,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 23: [2022-11-25 21:41:14,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 23: [2022-11-25 21:41:14,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:41:14,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:41:14,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:41:14,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-25 21:41:14,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-25 21:41:14,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-25 21:41:14,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 23: [2022-11-25 21:41:14,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 23: [2022-11-25 21:41:14,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 26: [2022-11-25 21:41:14,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:41:14,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:41:14,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:41:14,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-25 21:41:14,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-25 21:41:14,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-25 21:41:14,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 26: [2022-11-25 21:41:14,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 26: [2022-11-25 21:41:14,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 23: [2022-11-25 21:41:14,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:41:14,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-25 21:41:14,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 26: [2022-11-25 21:41:14,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:41:14,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-25 21:41:14,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 26: [2022-11-25 21:41:14,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:41:14,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:41:14,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:41:14,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:41:14,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-25 21:41:14,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-25 21:41:14,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-25 21:41:14,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-25 21:41:14,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 26: [2022-11-25 21:41:14,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 26: [2022-11-25 21:41:14,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 26: [2022-11-25 21:41:14,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 23: [2022-11-25 21:41:14,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:41:14,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-25 21:41:14,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 13: [2022-11-25 21:41:14,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:41:14,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:41:14,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:41:14,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 21:41:14,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 21:41:14,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 21:41:14,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 13: [2022-11-25 21:41:14,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 13: [2022-11-25 21:41:14,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 13: [2022-11-25 21:41:14,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:41:14,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:41:14,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 21:41:14,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 21:41:14,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 13: [2022-11-25 21:41:14,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 13: [2022-11-25 21:41:14,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:41:14,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 21:41:14,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 13: [2022-11-25 21:41:14,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:41:14,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 21:41:14,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 10: [2022-11-25 21:41:14,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:41:14,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 21:41:14,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 10: [2022-11-25 21:41:14,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:41:14,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 21:41:14,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 13: [2022-11-25 21:41:14,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:41:14,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step16000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 21:41:14,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 0: successfully saved checkpoint at iteration 16000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2727.23 31: iteration 16010/ 173500 | consumed samples: 4098560 | consumed tokens: 8393850880 | elapsed time per iteration (s): 1.05 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.460850E+00 | grad norm: 0.343 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.967 | TFLOPs: 14.70 | 31: iteration 16020/ 173500 | consumed samples: 4101120 | consumed tokens: 8399093760 | elapsed time per iteration (s): 0.81 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.310946E+00 | grad norm: 0.213 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.521 | TFLOPs: 19.03 | 31: iteration 16030/ 173500 | consumed samples: 4103680 | consumed tokens: 8404336640 | elapsed time per iteration (s): 0.75 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.273293E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.569 | TFLOPs: 20.60 | 31: iteration 16040/ 173500 | consumed samples: 4106240 | consumed tokens: 8409579520 | elapsed time per iteration (s): 0.73 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.301368E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.047 | TFLOPs: 21.18 | 31: iteration 16050/ 173500 | consumed samples: 4108800 | consumed tokens: 8414822400 | elapsed time per iteration (s): 0.77 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.235238E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.939 | TFLOPs: 20.02 | 31: iteration 16060/ 173500 | consumed samples: 4111360 | consumed tokens: 8420065280 | elapsed time per iteration (s): 0.78 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.297281E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.617 | TFLOPs: 19.88 | 31: iteration 16070/ 173500 | consumed samples: 4113920 | consumed tokens: 8425308160 | elapsed time per iteration (s): 0.82 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.261746E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.706 | TFLOPs: 18.80 | 31: iteration 16080/ 173500 | consumed samples: 4116480 | consumed tokens: 8430551040 | elapsed time per iteration (s): 0.77 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.254497E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.842 | TFLOPs: 20.14 | 31: iteration 16090/ 173500 | consumed samples: 4119040 | consumed tokens: 8435793920 | elapsed time per iteration (s): 0.84 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.256864E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.113 | TFLOPs: 18.40 | 31: iteration 16100/ 173500 | consumed samples: 4121600 | consumed tokens: 8441036800 | elapsed time per iteration (s): 0.81 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.238381E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.419 | TFLOPs: 19.08 | 31: iteration 16110/ 173500 | consumed samples: 4124160 | consumed tokens: 8446279680 | elapsed time per iteration (s): 0.81 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.269029E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.883 | TFLOPs: 19.17 | 31: iteration 16120/ 173500 | consumed samples: 4126720 | consumed tokens: 8451522560 | elapsed time per iteration (s): 0.84 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.278125E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.160 | TFLOPs: 18.46 | 31: iteration 16130/ 173500 | consumed samples: 4129280 | consumed tokens: 8456765440 | elapsed time per iteration (s): 0.85 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.246852E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.596 | TFLOPs: 18.19 | 31: iteration 16140/ 173500 | consumed samples: 4131840 | consumed tokens: 8462008320 | elapsed time per iteration (s): 0.91 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.233942E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 281.433 | TFLOPs: 17.03 | 31: iteration 16150/ 173500 | consumed samples: 4134400 | consumed tokens: 8467251200 | elapsed time per iteration (s): 0.84 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.265855E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.376 | TFLOPs: 18.53 | 31: iteration 16160/ 173500 | consumed samples: 4136960 | consumed tokens: 8472494080 | elapsed time per iteration (s): 0.85 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.258095E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.943 | TFLOPs: 18.21 | 31: iteration 16170/ 173500 | consumed samples: 4139520 | consumed tokens: 8477736960 | elapsed time per iteration (s): 0.81 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.243370E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.999 | TFLOPs: 19.18 | 31: iteration 16180/ 173500 | consumed samples: 4142080 | consumed tokens: 8482979840 | elapsed time per iteration (s): 0.80 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.269801E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.530 | TFLOPs: 19.45 | 31: iteration 16190/ 173500 | consumed samples: 4144640 | consumed tokens: 8488222720 | elapsed time per iteration (s): 0.84 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.233566E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.805 | TFLOPs: 18.44 | 31: iteration 16200/ 173500 | consumed samples: 4147200 | consumed tokens: 8493465600 | elapsed time per iteration (s): 0.82 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.247029E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.585 | TFLOPs: 18.79 | 31: iteration 16210/ 173500 | consumed samples: 4149760 | consumed tokens: 8498708480 | elapsed time per iteration (s): 0.85 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.247515E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.782 | TFLOPs: 18.32 | 31: iteration 16220/ 173500 | consumed samples: 4152320 | consumed tokens: 8503951360 | elapsed time per iteration (s): 0.73 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.232099E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.963 | TFLOPs: 21.29 | 31: iteration 16230/ 173500 | consumed samples: 4154880 | consumed tokens: 8509194240 | elapsed time per iteration (s): 0.80 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.245734E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.484 | TFLOPs: 19.39 | 31: iteration 16240/ 173500 | consumed samples: 4157440 | consumed tokens: 8514437120 | elapsed time per iteration (s): 0.73 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.234669E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.624 | TFLOPs: 21.27 | 31: iteration 16250/ 173500 | consumed samples: 4160000 | consumed tokens: 8519680000 | elapsed time per iteration (s): 0.80 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.259844E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.894 | TFLOPs: 19.41 | 31: iteration 16260/ 173500 | consumed samples: 4162560 | consumed tokens: 8524922880 | elapsed time per iteration (s): 0.73 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.237949E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.897 | TFLOPs: 21.23 | 31: iteration 16270/ 173500 | consumed samples: 4165120 | consumed tokens: 8530165760 | elapsed time per iteration (s): 0.78 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.228842E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.352 | TFLOPs: 19.92 | 31: iteration 16280/ 173500 | consumed samples: 4167680 | consumed tokens: 8535408640 | elapsed time per iteration (s): 0.79 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.242038E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.354 | TFLOPs: 19.56 | 31: iteration 16290/ 173500 | consumed samples: 4170240 | consumed tokens: 8540651520 | elapsed time per iteration (s): 0.81 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.254920E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.187 | TFLOPs: 19.07 | 31: iteration 16300/ 173500 | consumed samples: 4172800 | consumed tokens: 8545894400 | elapsed time per iteration (s): 0.76 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.237812E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.889 | TFLOPs: 20.32 | 31: iteration 16310/ 173500 | consumed samples: 4175360 | consumed tokens: 8551137280 | elapsed time per iteration (s): 0.76 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.260846E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.571 | TFLOPs: 20.48 | 31: iteration 16320/ 173500 | consumed samples: 4177920 | consumed tokens: 8556380160 | elapsed time per iteration (s): 0.76 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.263469E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.691 | TFLOPs: 20.49 | 31: iteration 16330/ 173500 | consumed samples: 4180480 | consumed tokens: 8561623040 | elapsed time per iteration (s): 0.76 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.234448E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.235 | TFLOPs: 20.34 | 31: iteration 16340/ 173500 | consumed samples: 4183040 | consumed tokens: 8566865920 | elapsed time per iteration (s): 0.76 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.253948E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.971 | TFLOPs: 20.26 | 31: iteration 16350/ 173500 | consumed samples: 4185600 | consumed tokens: 8572108800 | elapsed time per iteration (s): 0.80 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.239037E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.142 | TFLOPs: 19.43 | 31: iteration 16360/ 173500 | consumed samples: 4188160 | consumed tokens: 8577351680 | elapsed time per iteration (s): 0.80 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.257948E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.640 | TFLOPs: 19.34 | 31: iteration 16370/ 173500 | consumed samples: 4190720 | consumed tokens: 8582594560 | elapsed time per iteration (s): 0.79 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.264622E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.026 | TFLOPs: 19.72 | 31: iteration 16380/ 173500 | consumed samples: 4193280 | consumed tokens: 8587837440 | elapsed time per iteration (s): 0.84 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.270185E+00 | grad norm: 0.223 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.823 | TFLOPs: 18.50 | 31: iteration 16390/ 173500 | consumed samples: 4195840 | consumed tokens: 8593080320 | elapsed time per iteration (s): 0.78 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.267994E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.305 | TFLOPs: 19.92 | 31: iteration 16400/ 173500 | consumed samples: 4198400 | consumed tokens: 8598323200 | elapsed time per iteration (s): 0.74 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.276714E+00 | grad norm: 0.265 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.444 | TFLOPs: 20.96 | 31: iteration 16410/ 173500 | consumed samples: 4200960 | consumed tokens: 8603566080 | elapsed time per iteration (s): 0.82 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.249360E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.798 | TFLOPs: 18.80 | 31: iteration 16420/ 173500 | consumed samples: 4203520 | consumed tokens: 8608808960 | elapsed time per iteration (s): 0.73 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.243726E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.474 | TFLOPs: 21.26 | 31: iteration 16430/ 173500 | consumed samples: 4206080 | consumed tokens: 8614051840 | elapsed time per iteration (s): 0.78 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.254556E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.230 | TFLOPs: 19.92 | 31: iteration 16440/ 173500 | consumed samples: 4208640 | consumed tokens: 8619294720 | elapsed time per iteration (s): 0.76 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.245181E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.266 | TFLOPs: 20.40 | 31: iteration 16450/ 173500 | consumed samples: 4211200 | consumed tokens: 8624537600 | elapsed time per iteration (s): 0.78 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.233725E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.910 | TFLOPs: 19.96 | 31: iteration 16460/ 173500 | consumed samples: 4213760 | consumed tokens: 8629780480 | elapsed time per iteration (s): 0.81 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.237379E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.503 | TFLOPs: 19.21 | 31: iteration 16470/ 173500 | consumed samples: 4216320 | consumed tokens: 8635023360 | elapsed time per iteration (s): 0.77 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.251797E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.886 | TFLOPs: 20.14 | 31: iteration 16480/ 173500 | consumed samples: 4218880 | consumed tokens: 8640266240 | elapsed time per iteration (s): 0.80 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.268519E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.967 | TFLOPs: 19.30 | 31: iteration 16490/ 173500 | consumed samples: 4221440 | consumed tokens: 8645509120 | elapsed time per iteration (s): 0.79 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.233178E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.157 | TFLOPs: 19.61 | 31: iteration 16500/ 173500 | consumed samples: 4224000 | consumed tokens: 8650752000 | elapsed time per iteration (s): 0.79 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.258714E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.797 | TFLOPs: 19.53 | 31: iteration 16510/ 173500 | consumed samples: 4226560 | consumed tokens: 8655994880 | elapsed time per iteration (s): 0.80 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.238248E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.356 | TFLOPs: 19.32 | 31: iteration 16520/ 173500 | consumed samples: 4229120 | consumed tokens: 8661237760 | elapsed time per iteration (s): 0.76 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.256336E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.928 | TFLOPs: 20.26 | 31: iteration 16530/ 173500 | consumed samples: 4231680 | consumed tokens: 8666480640 | elapsed time per iteration (s): 0.81 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.250035E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.708 | TFLOPs: 19.22 | 31: iteration 16540/ 173500 | consumed samples: 4234240 | consumed tokens: 8671723520 | elapsed time per iteration (s): 0.78 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.239154E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.366 | TFLOPs: 19.80 | 31: iteration 16550/ 173500 | consumed samples: 4236800 | consumed tokens: 8676966400 | elapsed time per iteration (s): 0.81 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.259280E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.069 | TFLOPs: 19.06 | 31: iteration 16560/ 173500 | consumed samples: 4239360 | consumed tokens: 8682209280 | elapsed time per iteration (s): 0.77 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.219560E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.107 | TFLOPs: 20.21 | 31: iteration 16570/ 173500 | consumed samples: 4241920 | consumed tokens: 8687452160 | elapsed time per iteration (s): 0.76 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.225247E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.941 | TFLOPs: 20.26 | 31: iteration 16580/ 173500 | consumed samples: 4244480 | consumed tokens: 8692695040 | elapsed time per iteration (s): 0.74 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.259841E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.322 | TFLOPs: 20.95 | 31: iteration 16590/ 173500 | consumed samples: 4247040 | consumed tokens: 8697937920 | elapsed time per iteration (s): 0.75 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.261670E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.455 | TFLOPs: 20.78 | 31: iteration 16600/ 173500 | consumed samples: 4249600 | consumed tokens: 8703180800 | elapsed time per iteration (s): 0.75 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.209487E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.455 | TFLOPs: 20.60 | 31: iteration 16610/ 173500 | consumed samples: 4252160 | consumed tokens: 8708423680 | elapsed time per iteration (s): 0.79 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.265172E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.484 | TFLOPs: 19.51 | 31: iteration 16620/ 173500 | consumed samples: 4254720 | consumed tokens: 8713666560 | elapsed time per iteration (s): 0.77 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.246244E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.210 | TFLOPs: 20.10 | 31: iteration 16630/ 173500 | consumed samples: 4257280 | consumed tokens: 8718909440 | elapsed time per iteration (s): 0.74 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.248411E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.404 | TFLOPs: 21.02 | 31: iteration 16640/ 173500 | consumed samples: 4259840 | consumed tokens: 8724152320 | elapsed time per iteration (s): 0.77 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.240489E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.007 | TFLOPs: 20.15 | 31: iteration 16650/ 173500 | consumed samples: 4262400 | consumed tokens: 8729395200 | elapsed time per iteration (s): 0.73 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.229437E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.860 | TFLOPs: 21.17 | 31: iteration 16660/ 173500 | consumed samples: 4264960 | consumed tokens: 8734638080 | elapsed time per iteration (s): 0.80 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.239451E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.918 | TFLOPs: 19.35 | 31: iteration 16670/ 173500 | consumed samples: 4267520 | consumed tokens: 8739880960 | elapsed time per iteration (s): 0.71 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.222026E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 362.286 | TFLOPs: 21.92 | 31: iteration 16680/ 173500 | consumed samples: 4270080 | consumed tokens: 8745123840 | elapsed time per iteration (s): 0.75 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.247649E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.930 | TFLOPs: 20.56 | 31: iteration 16690/ 173500 | consumed samples: 4272640 | consumed tokens: 8750366720 | elapsed time per iteration (s): 0.76 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.216564E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.731 | TFLOPs: 20.37 | 31: iteration 16700/ 173500 | consumed samples: 4275200 | consumed tokens: 8755609600 | elapsed time per iteration (s): 0.76 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.252018E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.033 | TFLOPs: 20.45 | 31: iteration 16710/ 173500 | consumed samples: 4277760 | consumed tokens: 8760852480 | elapsed time per iteration (s): 0.77 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.245700E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.736 | TFLOPs: 20.01 | 31: iteration 16720/ 173500 | consumed samples: 4280320 | consumed tokens: 8766095360 | elapsed time per iteration (s): 0.84 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.193625E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.114 | TFLOPs: 18.46 | 31: iteration 16730/ 173500 | consumed samples: 4282880 | consumed tokens: 8771338240 | elapsed time per iteration (s): 0.78 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.272058E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.679 | TFLOPs: 19.94 | 31: iteration 16740/ 173500 | consumed samples: 4285440 | consumed tokens: 8776581120 | elapsed time per iteration (s): 0.75 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.243751E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.554 | TFLOPs: 20.78 | 31: iteration 16750/ 173500 | consumed samples: 4288000 | consumed tokens: 8781824000 | elapsed time per iteration (s): 0.82 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.250188E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.288 | TFLOPs: 18.95 | 31: iteration 16760/ 173500 | consumed samples: 4290560 | consumed tokens: 8787066880 | elapsed time per iteration (s): 0.74 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.247716E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.466 | TFLOPs: 20.84 | 31: iteration 16770/ 173500 | consumed samples: 4293120 | consumed tokens: 8792309760 | elapsed time per iteration (s): 0.77 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.220293E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.418 | TFLOPs: 20.23 | 31: iteration 16780/ 173500 | consumed samples: 4295680 | consumed tokens: 8797552640 | elapsed time per iteration (s): 0.90 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.243030E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.286 | TFLOPs: 17.20 | 31: iteration 16790/ 173500 | consumed samples: 4298240 | consumed tokens: 8802795520 | elapsed time per iteration (s): 0.75 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.215833E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.851 | TFLOPs: 20.56 | 31: iteration 16800/ 173500 | consumed samples: 4300800 | consumed tokens: 8808038400 | elapsed time per iteration (s): 0.78 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.241750E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.229 | TFLOPs: 19.80 | 31: iteration 16810/ 173500 | consumed samples: 4303360 | consumed tokens: 8813281280 | elapsed time per iteration (s): 0.80 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.217447E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.841 | TFLOPs: 19.29 | 31: iteration 16820/ 173500 | consumed samples: 4305920 | consumed tokens: 8818524160 | elapsed time per iteration (s): 0.75 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.215246E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.362 | TFLOPs: 20.71 | 31: iteration 16830/ 173500 | consumed samples: 4308480 | consumed tokens: 8823767040 | elapsed time per iteration (s): 0.78 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.236403E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.454 | TFLOPs: 19.75 | 31: iteration 16840/ 173500 | consumed samples: 4311040 | consumed tokens: 8829009920 | elapsed time per iteration (s): 0.77 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.215844E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.990 | TFLOPs: 20.08 | 31: iteration 16850/ 173500 | consumed samples: 4313600 | consumed tokens: 8834252800 | elapsed time per iteration (s): 0.75 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.234620E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.168 | TFLOPs: 20.70 | 31: iteration 16860/ 173500 | consumed samples: 4316160 | consumed tokens: 8839495680 | elapsed time per iteration (s): 0.81 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.246607E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.420 | TFLOPs: 19.02 | 31: iteration 16870/ 173500 | consumed samples: 4318720 | consumed tokens: 8844738560 | elapsed time per iteration (s): 0.78 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.252390E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.382 | TFLOPs: 19.93 | 31: iteration 16880/ 173500 | consumed samples: 4321280 | consumed tokens: 8849981440 | elapsed time per iteration (s): 0.75 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.215208E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.317 | TFLOPs: 20.65 | 31: iteration 16890/ 173500 | consumed samples: 4323840 | consumed tokens: 8855224320 | elapsed time per iteration (s): 0.79 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.248983E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.975 | TFLOPs: 19.72 | 31: iteration 16900/ 173500 | consumed samples: 4326400 | consumed tokens: 8860467200 | elapsed time per iteration (s): 0.77 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.223653E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.315 | TFLOPs: 20.23 | 31: iteration 16910/ 173500 | consumed samples: 4328960 | consumed tokens: 8865710080 | elapsed time per iteration (s): 0.73 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.216532E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.146 | TFLOPs: 21.24 | 31: iteration 16920/ 173500 | consumed samples: 4331520 | consumed tokens: 8870952960 | elapsed time per iteration (s): 0.80 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.215071E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.346 | TFLOPs: 19.44 | 31: iteration 16930/ 173500 | consumed samples: 4334080 | consumed tokens: 8876195840 | elapsed time per iteration (s): 0.78 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.218909E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.240 | TFLOPs: 19.98 | 31: iteration 16940/ 173500 | consumed samples: 4336640 | consumed tokens: 8881438720 | elapsed time per iteration (s): 0.79 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.235097E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.014 | TFLOPs: 19.54 | 31: iteration 16950/ 173500 | consumed samples: 4339200 | consumed tokens: 8886681600 | elapsed time per iteration (s): 0.73 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.215320E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.972 | TFLOPs: 21.17 | 31: iteration 16960/ 173500 | consumed samples: 4341760 | consumed tokens: 8891924480 | elapsed time per iteration (s): 0.78 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.232893E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.346 | TFLOPs: 19.92 | 31: iteration 16970/ 173500 | consumed samples: 4344320 | consumed tokens: 8897167360 | elapsed time per iteration (s): 0.84 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.232345E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.327 | TFLOPs: 18.35 | 31: iteration 16980/ 173500 | consumed samples: 4346880 | consumed tokens: 8902410240 | elapsed time per iteration (s): 0.86 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.262066E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.692 | TFLOPs: 18.07 | 31: iteration 16990/ 173500 | consumed samples: 4349440 | consumed tokens: 8907653120 | elapsed time per iteration (s): 0.73 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.252157E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.133 | TFLOPs: 21.18 | 31: iteration 17000/ 173500 | consumed samples: 4352000 | consumed tokens: 8912896000 | elapsed time per iteration (s): 0.79 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.248174E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.141 | TFLOPs: 19.49 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 17000 | lm loss value: 2.232138E+00 | lm loss PPL: 9.319772E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 17000 to checkpoints_1b1long 0: [2022-11-25 21:54:18,540] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step17000 is begin to save! 0: [2022-11-25 21:54:18,553] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_01-model_00-model_states.pt... 0: [2022-11-25 21:54:18,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_01-model_00-model_states.pt. 0: [2022-11-25 21:54:18,781] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_03-model_00-model_states.pt... 0: [2022-11-25 21:54:18,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_03-model_00-model_states.pt. 0: [2022-11-25 21:54:18,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_04-model_00-model_states.pt... 0: [2022-11-25 21:54:18,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_04-model_00-model_states.pt. 0: [2022-11-25 21:54:18,951] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_05-model_00-model_states.pt... 0: [2022-11-25 21:54:19,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_05-model_00-model_states.pt. 0: [2022-11-25 21:54:19,036] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_06-model_00-model_states.pt... 0: [2022-11-25 21:54:19,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_06-model_00-model_states.pt. 0: [2022-11-25 21:54:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_07-model_00-model_states.pt... 0: [2022-11-25 21:54:19,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_07-model_00-model_states.pt. 0: [2022-11-25 21:54:19,190] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_08-model_00-model_states.pt... 0: [2022-11-25 21:54:19,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_08-model_00-model_states.pt. 0: [2022-11-25 21:54:19,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_09-model_00-model_states.pt... 0: [2022-11-25 21:54:19,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_09-model_00-model_states.pt. 0: [2022-11-25 21:54:19,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_10-model_00-model_states.pt... 0: [2022-11-25 21:54:19,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_10-model_00-model_states.pt. 0: [2022-11-25 21:54:19,430] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_11-model_00-model_states.pt... 0: [2022-11-25 21:54:19,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_11-model_00-model_states.pt. 0: [2022-11-25 21:54:19,503] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_12-model_00-model_states.pt... 0: [2022-11-25 21:54:19,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_12-model_00-model_states.pt. 0: [2022-11-25 21:54:19,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_13-model_00-model_states.pt... 0: [2022-11-25 21:54:19,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_13-model_00-model_states.pt. 0: [2022-11-25 21:54:19,654] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_14-model_00-model_states.pt... 0: [2022-11-25 21:54:19,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_14-model_00-model_states.pt. 0: [2022-11-25 21:54:19,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_15-model_00-model_states.pt... 0: [2022-11-25 21:54:19,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_15-model_00-model_states.pt. 0: [2022-11-25 21:54:19,803] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_16-model_00-model_states.pt... 0: [2022-11-25 21:54:19,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_16-model_00-model_states.pt. 0: [2022-11-25 21:54:19,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_17-model_00-model_states.pt... 0: [2022-11-25 21:54:19,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_17-model_00-model_states.pt. 0: [2022-11-25 21:54:19,951] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_18-model_00-model_states.pt... 0: [2022-11-25 21:54:20,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_18-model_00-model_states.pt. 0: [2022-11-25 21:54:20,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_19-model_00-model_states.pt... 0: [2022-11-25 21:54:20,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_19-model_00-model_states.pt. 0: [2022-11-25 21:54:20,096] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_20-model_00-model_states.pt... 0: [2022-11-25 21:54:20,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_20-model_00-model_states.pt. 0: [2022-11-25 21:54:20,173] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_21-model_00-model_states.pt... 0: [2022-11-25 21:54:20,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_21-model_00-model_states.pt. 0: [2022-11-25 21:54:20,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_22-model_00-model_states.pt... 0: [2022-11-25 21:54:20,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_22-model_00-model_states.pt. 0: [2022-11-25 21:54:20,318] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_23-model_00-model_states.pt... 0: [2022-11-25 21:54:20,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_23-model_00-model_states.pt. 0: [2022-11-25 21:54:20,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_24-model_00-model_states.pt... 0: [2022-11-25 21:54:20,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_24-model_00-model_states.pt. 0: [2022-11-25 21:54:20,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_25-model_00-model_states.pt... 0: [2022-11-25 21:54:20,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_25-model_00-model_states.pt. 0: [2022-11-25 21:54:20,542] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_26-model_00-model_states.pt... 0: [2022-11-25 21:54:20,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_26-model_00-model_states.pt. 0: [2022-11-25 21:54:20,616] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_27-model_00-model_states.pt... 0: [2022-11-25 21:54:20,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_27-model_00-model_states.pt. 0: [2022-11-25 21:54:20,687] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_28-model_00-model_states.pt... 0: [2022-11-25 21:54:20,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_28-model_00-model_states.pt. 0: [2022-11-25 21:54:20,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/layer_30-model_00-model_states.pt... 0: [2022-11-25 21:54:20,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/layer_30-model_00-model_states.pt. 0: [2022-11-25 21:54:20,764] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step17000/mp_rank_00_model_states.pt 0: [2022-11-25 21:54:20,764] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/mp_rank_00_model_states.pt... 0: [2022-11-25 21:54:20,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/mp_rank_00_model_states.pt. 0: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 16: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 23: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 22: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 26: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 28: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 31: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 17: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 24: [2022-11-25 21:54:20,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:54:20,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:54:20,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 21:54:20,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 6: [2022-11-25 21:54:20,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:54:20,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 21:54:20,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 15: [2022-11-25 21:54:20,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:54:20,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:54:20,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 21:54:20,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 19: [2022-11-25 21:54:20,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-25 21:54:20,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 9: [2022-11-25 21:54:20,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:54:20,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 21:54:20,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 0: [2022-11-25 21:54:20,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:54:20,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 21:54:20,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 19: [2022-11-25 21:54:20,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:54:20,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-25 21:54:20,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 13: [2022-11-25 21:54:20,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:54:20,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 21:54:20,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 21: [2022-11-25 21:54:20,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:54:20,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-25 21:54:20,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 24: [2022-11-25 21:54:20,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:54:20,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:54:20,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-25 21:54:20,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 24: [2022-11-25 21:54:20,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-25 21:54:20,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 20: [2022-11-25 21:54:20,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:54:20,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-25 21:54:20,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 26: [2022-11-25 21:54:20,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:54:20,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-25 21:54:20,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 20: [2022-11-25 21:54:20,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:54:20,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-25 21:54:20,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 18: [2022-11-25 21:54:20,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:54:20,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:54:20,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-25 21:54:20,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-25 21:54:20,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 15: [2022-11-25 21:54:20,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:54:20,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 15: [2022-11-25 21:54:20,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 6: [2022-11-25 21:54:20,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:54:20,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:54:20,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 6: [2022-11-25 21:54:20,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 9: [2022-11-25 21:54:20,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 6: [2022-11-25 21:54:20,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 9: [2022-11-25 21:54:20,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 29: [2022-11-25 21:54:20,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:54:20,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 14: [2022-11-25 21:54:20,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:54:20,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 14: [2022-11-25 21:54:20,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 21:54:20,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 8: [2022-11-25 21:54:20,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:54:20,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 21:54:20,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 27: [2022-11-25 21:54:20,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:54:20,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-25 21:54:20,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 8: [2022-11-25 21:54:20,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:54:20,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 21:54:20,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 7: [2022-11-25 21:54:20,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:54:20,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 21:54:20,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 9: [2022-11-25 21:54:20,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:54:20,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 21:54:20,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 7: [2022-11-25 21:54:20,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:54:20,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 21:54:20,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 11: [2022-11-25 21:54:20,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:54:20,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 21:54:20,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 13: [2022-11-25 21:54:20,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:54:20,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 21:54:20,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 20: [2022-11-25 21:54:20,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:54:20,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-25 21:54:20,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 27: [2022-11-25 21:54:20,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:54:20,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-25 21:54:20,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 0: [2022-11-25 21:54:20,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:54:20,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:54:20,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 21:54:20,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 16: [2022-11-25 21:54:20,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:54:20,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-25 21:54:20,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 18: [2022-11-25 21:54:20,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:54:20,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:54:20,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 3: [2022-11-25 21:54:20,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 18: [2022-11-25 21:54:20,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 3: [2022-11-25 21:54:20,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 24: [2022-11-25 21:54:20,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:54:20,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:54:20,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 21:54:20,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 24: [2022-11-25 21:54:20,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-25 21:54:20,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 21: [2022-11-25 21:54:20,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:54:20,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-25 21:54:20,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 14: [2022-11-25 21:54:20,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:54:20,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 21:54:20,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 19: [2022-11-25 21:54:20,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:54:20,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-25 21:54:20,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 24: [2022-11-25 21:54:20,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:54:20,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:54:20,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:54:20,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 26: [2022-11-25 21:54:20,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-25 21:54:20,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 2: [2022-11-25 21:54:20,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:54:20,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 2: [2022-11-25 21:54:20,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 26: [2022-11-25 21:54:20,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 26: [2022-11-25 21:54:20,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 2: [2022-11-25 21:54:20,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 10: [2022-11-25 21:54:20,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:54:20,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:54:20,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 21:54:20,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 21:54:20,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 10: [2022-11-25 21:54:20,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 30: [2022-11-25 21:54:20,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:54:20,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-25 21:54:20,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 19: [2022-11-25 21:54:20,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:54:20,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-25 21:54:20,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 11: [2022-11-25 21:54:20,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:54:20,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 21:54:20,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 20: [2022-11-25 21:54:20,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:54:20,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-25 21:54:20,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 11: [2022-11-25 21:54:20,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:54:20,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 21:54:20,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 18: [2022-11-25 21:54:20,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:54:20,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:54:20,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 30: [2022-11-25 21:54:20,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 18: [2022-11-25 21:54:20,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 30: [2022-11-25 21:54:20,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 6: [2022-11-25 21:54:20,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:54:20,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 21:54:20,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 2: [2022-11-25 21:54:20,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:54:20,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:54:20,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:54:20,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 25: [2022-11-25 21:54:20,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-25 21:54:20,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-25 21:54:20,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 25: [2022-11-25 21:54:20,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 16: [2022-11-25 21:54:20,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:54:20,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 16: [2022-11-25 21:54:20,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-25 21:54:20,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 26: [2022-11-25 21:54:20,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:54:20,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-25 21:54:20,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 3: [2022-11-25 21:54:20,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:54:20,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 21:54:20,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 22: [2022-11-25 21:54:20,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:54:20,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:54:20,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:54:20,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-25 21:54:20,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-25 21:54:20,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-25 21:54:20,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 22: [2022-11-25 21:54:20,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 22: [2022-11-25 21:54:20,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 21: [2022-11-25 21:54:20,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:54:20,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:54:20,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-25 21:54:20,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-25 21:54:20,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 30: [2022-11-25 21:54:20,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:54:20,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 30: [2022-11-25 21:54:20,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-25 21:54:20,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 14: [2022-11-25 21:54:20,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:54:20,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 21:54:20,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 7: [2022-11-25 21:54:20,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:54:20,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 21:54:20,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 5: [2022-11-25 21:54:20,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:54:20,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 21:54:20,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 5: [2022-11-25 21:54:20,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:54:20,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 29: [2022-11-25 21:54:20,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:54:20,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 29: [2022-11-25 21:54:20,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 5: [2022-11-25 21:54:20,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:54:20,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 5: [2022-11-25 21:54:20,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 21:54:20,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 29: [2022-11-25 21:54:20,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:54:20,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-25 21:54:20,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 10: [2022-11-25 21:54:20,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:54:20,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 21:54:20,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 24: [2022-11-25 21:54:20,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:54:20,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-25 21:54:20,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 6: [2022-11-25 21:54:20,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:54:20,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 21:54:20,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 17: [2022-11-25 21:54:20,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:54:20,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-25 21:54:20,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:54:20,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 17: [2022-11-25 21:54:20,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-25 21:54:20,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 9: [2022-11-25 21:54:20,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:54:20,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:54:20,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 0: [2022-11-25 21:54:20,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 9: [2022-11-25 21:54:20,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 0: [2022-11-25 21:54:20,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 10: [2022-11-25 21:54:20,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:54:20,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 21:54:20,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 27: [2022-11-25 21:54:20,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:54:20,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:54:20,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-25 21:54:20,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-25 21:54:20,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 27: [2022-11-25 21:54:20,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 31: [2022-11-25 21:54:20,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:54:20,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:54:20,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:54:20,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-25 21:54:20,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-25 21:54:20,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-25 21:54:20,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 31: [2022-11-25 21:54:20,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 13: [2022-11-25 21:54:20,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:54:20,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 13: [2022-11-25 21:54:20,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 21:54:20,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 0: [2022-11-25 21:54:20,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 21:54:20,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 12: [2022-11-25 21:54:20,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:54:20,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:54:20,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:54:20,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:54:20,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 21:54:20,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 21:54:20,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 21:54:20,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 21:54:20,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 12: [2022-11-25 21:54:20,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 12: [2022-11-25 21:54:20,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 12: [2022-11-25 21:54:20,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 11: [2022-11-25 21:54:20,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:54:20,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 21:54:20,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 7: [2022-11-25 21:54:20,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:54:20,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 21:54:20,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 15: [2022-11-25 21:54:20,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:54:20,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 5: [2022-11-25 21:54:20,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:54:20,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 5: [2022-11-25 21:54:20,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 21:54:20,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 23: [2022-11-25 21:54:20,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:54:20,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:54:20,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:54:20,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-25 21:54:20,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:54:20,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-25 21:54:20,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-25 21:54:20,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 23: [2022-11-25 21:54:20,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 23: [2022-11-25 21:54:20,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 23: [2022-11-25 21:54:20,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-25 21:54:20,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 4: [2022-11-25 21:54:20,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:54:20,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:54:20,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:54:20,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:54:20,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 21:54:20,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 21:54:20,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 21:54:20,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 21:54:20,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 4: [2022-11-25 21:54:20,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 4: [2022-11-25 21:54:20,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 4: [2022-11-25 21:54:20,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 14: [2022-11-25 21:54:20,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:54:20,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 21:54:20,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 3: [2022-11-25 21:54:20,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:54:20,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 21:54:20,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 1: [2022-11-25 21:54:20,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:54:20,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:54:20,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:54:20,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:54:20,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 21:54:20,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 21:54:20,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 21:54:20,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 21:54:20,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 1: [2022-11-25 21:54:20,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 1: [2022-11-25 21:54:20,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 1: [2022-11-25 21:54:20,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 13: [2022-11-25 21:54:20,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:54:20,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 21:54:20,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 31: [2022-11-25 21:54:20,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:54:20,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-25 21:54:20,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 16: [2022-11-25 21:54:20,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:54:20,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:54:20,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-25 21:54:20,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 25: [2022-11-25 21:54:20,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-25 21:54:20,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 17: [2022-11-25 21:54:20,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:54:20,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-25 21:54:20,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 19: [2022-11-25 21:54:20,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:54:20,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-25 21:54:20,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 1: [2022-11-25 21:54:20,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:54:20,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 21:54:20,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 2: [2022-11-25 21:54:20,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:54:20,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 21:54:20,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 22: [2022-11-25 21:54:20,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:54:20,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-25 21:54:20,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 20: [2022-11-25 21:54:20,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:54:20,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-25 21:54:20,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 8: [2022-11-25 21:54:20,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:54:20,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 21:54:20,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 29: [2022-11-25 21:54:20,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:54:20,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-25 21:54:20,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 3: [2022-11-25 21:54:20,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:54:20,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 21:54:20,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 29: [2022-11-25 21:54:20,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:54:20,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-25 21:54:20,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 10: [2022-11-25 21:54:21,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:54:21,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 21:54:21,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 9: [2022-11-25 21:54:21,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:54:21,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 21:54:21,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 0: [2022-11-25 21:54:21,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:54:21,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 21:54:21,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 6: [2022-11-25 21:54:21,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:54:21,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 21:54:21,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 8: [2022-11-25 21:54:21,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:54:21,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 21:54:21,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 15: [2022-11-25 21:54:21,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:54:21,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 21:54:21,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 4: [2022-11-25 21:54:21,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:54:21,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 21:54:21,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 12: [2022-11-25 21:54:21,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:54:21,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 21:54:21,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 27: [2022-11-25 21:54:21,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:54:21,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:54:21,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 3: [2022-11-25 21:54:21,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 21:54:21,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 27: [2022-11-25 21:54:21,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 18: [2022-11-25 21:54:21,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:54:21,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-25 21:54:21,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 14: [2022-11-25 21:54:21,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:54:21,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 21:54:21,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 30: [2022-11-25 21:54:21,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:54:21,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 23: [2022-11-25 21:54:21,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:54:21,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 23: [2022-11-25 21:54:21,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-25 21:54:21,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 25: [2022-11-25 21:54:21,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:54:21,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 7: [2022-11-25 21:54:21,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:54:21,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 7: [2022-11-25 21:54:21,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 24: [2022-11-25 21:54:21,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:54:21,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 24: [2022-11-25 21:54:21,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-25 21:54:21,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 13: [2022-11-25 21:54:21,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:54:21,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 21:54:21,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 5: [2022-11-25 21:54:21,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:54:21,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 21:54:21,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 11: [2022-11-25 21:54:21,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:54:21,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:54:21,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 21: [2022-11-25 21:54:21,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 11: [2022-11-25 21:54:21,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 21: [2022-11-25 21:54:21,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 26: [2022-11-25 21:54:21,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:54:21,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 17: [2022-11-25 21:54:21,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:54:21,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 31: [2022-11-25 21:54:21,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:54:21,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 31: [2022-11-25 21:54:21,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 17: [2022-11-25 21:54:21,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 31: [2022-11-25 21:54:21,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 19: [2022-11-25 21:54:21,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:54:21,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-25 21:54:21,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 1: [2022-11-25 21:54:21,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:54:21,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 21:54:21,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 2: [2022-11-25 21:54:21,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:54:21,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:54:21,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 16: [2022-11-25 21:54:21,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-25 21:54:21,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 2: [2022-11-25 21:54:21,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 22: [2022-11-25 21:54:21,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:54:21,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-25 21:54:21,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 10: [2022-11-25 21:54:21,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:54:21,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:54:21,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 21:54:21,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 15: [2022-11-25 21:54:21,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 21:54:21,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 29: [2022-11-25 21:54:21,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:54:21,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-25 21:54:21,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 20: [2022-11-25 21:54:21,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:54:21,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-25 21:54:21,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 6: [2022-11-25 21:54:21,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:54:21,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 21:54:21,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 9: [2022-11-25 21:54:21,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:54:21,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 21:54:21,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 4: [2022-11-25 21:54:21,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:54:21,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 21:54:21,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 18: [2022-11-25 21:54:21,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:54:21,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:54:21,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 26: [2022-11-25 21:54:21,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 18: [2022-11-25 21:54:21,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 26: [2022-11-25 21:54:21,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 3: [2022-11-25 21:54:21,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:54:21,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 21:54:21,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 12: [2022-11-25 21:54:21,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:54:21,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 21:54:21,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 28: [2022-11-25 21:54:21,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:54:21,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-25 21:54:21,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 21: [2022-11-25 21:54:21,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:54:21,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-25 21:54:21,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 0: [2022-11-25 21:54:21,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:54:21,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 21:54:21,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 27: [2022-11-25 21:54:21,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:54:21,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-25 21:54:21,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 30: [2022-11-25 21:54:21,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:54:21,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-25 21:54:21,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 24: [2022-11-25 21:54:21,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:54:21,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-25 21:54:21,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 8: [2022-11-25 21:54:21,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:54:21,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 21:54:21,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 11: [2022-11-25 21:54:21,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:54:21,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 21:54:21,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 23: [2022-11-25 21:54:21,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:54:21,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-25 21:54:21,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 5: [2022-11-25 21:54:21,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:54:21,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 21:54:21,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 31: [2022-11-25 21:54:21,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:54:21,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-25 21:54:21,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 7: [2022-11-25 21:54:21,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:54:21,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 21:54:21,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 17: [2022-11-25 21:54:21,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:54:21,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-25 21:54:21,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 25: [2022-11-25 21:54:21,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:54:21,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 14: [2022-11-25 21:54:21,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:54:21,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 14: [2022-11-25 21:54:21,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 21:54:21,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 1: [2022-11-25 21:54:21,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:54:21,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 21:54:21,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 2: [2022-11-25 21:54:21,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:54:21,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 21:54:21,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 16: [2022-11-25 21:54:21,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:54:21,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-25 21:54:21,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 13: [2022-11-25 21:54:21,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:54:21,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 21:54:21,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 22: [2022-11-25 21:54:21,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:54:21,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-25 21:54:21,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 10: [2022-11-25 21:54:21,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:54:21,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 21:54:21,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 9: [2022-11-25 21:54:21,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:54:21,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 21:54:21,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 15: [2022-11-25 21:54:21,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:54:21,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 21:54:21,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 19: [2022-11-25 21:54:21,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:54:21,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-25 21:54:21,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 20: [2022-11-25 21:54:21,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-25 21:54:21,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-25 21:54:21,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 30: [2022-11-25 21:54:21,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:54:21,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-25 21:54:21,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 28: [2022-11-25 21:54:21,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:54:21,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-25 21:54:21,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 18: [2022-11-25 21:54:21,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:54:21,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-25 21:54:21,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 29: [2022-11-25 21:54:21,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-25 21:54:21,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-25 21:54:21,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 4: [2022-11-25 21:54:21,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:54:21,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 21:54:21,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 3: [2022-11-25 21:54:21,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:54:21,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 21:54:21,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 12: [2022-11-25 21:54:21,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:54:21,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 21:54:21,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 6: [2022-11-25 21:54:21,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:54:21,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 21:54:21,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 0: [2022-11-25 21:54:21,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:54:21,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 21:54:21,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 26: [2022-11-25 21:54:21,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:54:21,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-25 21:54:21,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 21: [2022-11-25 21:54:21,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:54:21,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-25 21:54:21,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 27: [2022-11-25 21:54:21,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:54:21,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-25 21:54:21,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 23: [2022-11-25 21:54:21,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:54:21,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-25 21:54:21,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 11: [2022-11-25 21:54:21,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:54:21,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 21:54:21,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 24: [2022-11-25 21:54:21,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:54:21,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-25 21:54:21,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 5: [2022-11-25 21:54:21,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:54:21,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 21:54:21,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 8: [2022-11-25 21:54:21,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:54:21,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 21:54:21,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 1: [2022-11-25 21:54:21,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:54:21,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 7: [2022-11-25 21:54:21,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:54:21,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 7: [2022-11-25 21:54:21,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 21:54:21,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 31: [2022-11-25 21:54:21,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:54:21,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-25 21:54:21,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 14: [2022-11-25 21:54:21,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:54:21,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 21:54:21,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 10: [2022-11-25 21:54:21,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:54:21,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 21:54:21,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 25: [2022-11-25 21:54:21,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:54:21,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-25 21:54:21,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 2: [2022-11-25 21:54:21,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:54:21,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 21:54:21,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 9: [2022-11-25 21:54:21,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:54:21,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 15: [2022-11-25 21:54:21,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:54:21,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 15: [2022-11-25 21:54:21,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 20: [2022-11-25 21:54:21,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:54:21,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 20: [2022-11-25 21:54:21,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-25 21:54:21,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 0: [2022-11-25 21:54:21,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:54:21,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 21:54:21,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 29: [2022-11-25 21:54:21,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:54:21,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:54:21,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 29: [2022-11-25 21:54:21,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-25 21:54:21,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 13: [2022-11-25 21:54:21,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 18: [2022-11-25 21:54:21,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-25 21:54:21,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-25 21:54:21,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 19: [2022-11-25 21:54:21,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:54:21,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 30: [2022-11-25 21:54:21,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 19: [2022-11-25 21:54:21,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 30: [2022-11-25 21:54:21,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 6: [2022-11-25 21:54:21,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 30: [2022-11-25 21:54:21,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 6: [2022-11-25 21:54:21,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 21:54:21,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 3: [2022-11-25 21:54:21,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:54:21,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-25 21:54:21,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 3: [2022-11-25 21:54:21,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 23: [2022-11-25 21:54:21,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 3: [2022-11-25 21:54:21,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 5: [2022-11-25 21:54:21,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:54:21,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 21:54:21,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 21: [2022-11-25 21:54:21,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-25 21:54:21,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-25 21:54:21,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 26: [2022-11-25 21:54:21,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-25 21:54:21,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-25 21:54:21,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 31: [2022-11-25 21:54:21,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-25 21:54:21,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 4: [2022-11-25 21:54:21,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:54:21,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 31: [2022-11-25 21:54:21,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 4: [2022-11-25 21:54:21,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 12: [2022-11-25 21:54:21,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:54:21,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 21:54:21,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 11: [2022-11-25 21:54:21,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:54:21,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 21:54:21,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 27: [2022-11-25 21:54:21,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-25 21:54:21,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-25 21:54:21,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 22: [2022-11-25 21:54:21,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:54:21,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-25 21:54:21,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 24: [2022-11-25 21:54:21,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-25 21:54:21,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-25 21:54:21,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 17: [2022-11-25 21:54:21,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:54:21,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-25 21:54:21,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 16: [2022-11-25 21:54:21,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:54:21,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-25 21:54:21,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 7: [2022-11-25 21:54:21,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:54:21,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 21:54:21,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 17: [2022-11-25 21:54:21,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:54:21,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-25 21:54:21,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 8: [2022-11-25 21:54:21,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:54:21,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 21:54:21,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 14: [2022-11-25 21:54:21,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:54:21,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 21:54:21,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 22: [2022-11-25 21:54:21,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-25 21:54:21,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-25 21:54:21,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 13: [2022-11-25 21:54:21,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:54:21,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 21:54:21,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 2: [2022-11-25 21:54:21,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:54:21,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 21:54:21,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 17: [2022-11-25 21:54:21,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-25 21:54:21,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-25 21:54:21,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 16: [2022-11-25 21:54:21,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:54:21,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-25 21:54:21,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 2: [2022-11-25 21:54:21,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:54:21,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 21:54:21,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 25: [2022-11-25 21:54:21,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:54:21,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-25 21:54:21,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 16: [2022-11-25 21:54:21,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-25 21:54:21,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-25 21:54:21,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 25: [2022-11-25 21:54:21,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-25 21:54:21,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-25 21:54:21,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 28: [2022-11-25 21:54:21,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:54:21,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:54:21,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-25 21:54:21,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-25 21:54:21,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 28: [2022-11-25 21:54:21,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 28: [2022-11-25 21:54:21,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:54:21,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-25 21:54:21,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 28: [2022-11-25 21:54:21,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 0: successfully saved checkpoint at iteration 17000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2627.45 28: [2022-11-25 21:54:21,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-25 21:54:21,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 28: [2022-11-25 21:54:21,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:54:21,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-25 21:54:21,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 28: [2022-11-25 21:54:21,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-25 21:54:21,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step17000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-25 21:54:21,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 31: iteration 17010/ 173500 | consumed samples: 4354560 | consumed tokens: 8918138880 | elapsed time per iteration (s): 1.06 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.244783E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.531 | TFLOPs: 14.55 | 31: iteration 17020/ 173500 | consumed samples: 4357120 | consumed tokens: 8923381760 | elapsed time per iteration (s): 0.81 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.255454E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.856 | TFLOPs: 19.23 | 31: iteration 17030/ 173500 | consumed samples: 4359680 | consumed tokens: 8928624640 | elapsed time per iteration (s): 0.83 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.252186E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.231 | TFLOPs: 18.71 | 31: iteration 17040/ 173500 | consumed samples: 4362240 | consumed tokens: 8933867520 | elapsed time per iteration (s): 0.77 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.250733E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.518 | TFLOPs: 20.24 | 31: iteration 17050/ 173500 | consumed samples: 4364800 | consumed tokens: 8939110400 | elapsed time per iteration (s): 1.14 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.257610E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 224.273 | TFLOPs: 13.57 | 31: iteration 17060/ 173500 | consumed samples: 4367360 | consumed tokens: 8944353280 | elapsed time per iteration (s): 0.82 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.185905E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.932 | TFLOPs: 18.87 | 31: iteration 17070/ 173500 | consumed samples: 4369920 | consumed tokens: 8949596160 | elapsed time per iteration (s): 0.76 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.238375E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.370 | TFLOPs: 20.47 | 31: iteration 17080/ 173500 | consumed samples: 4372480 | consumed tokens: 8954839040 | elapsed time per iteration (s): 0.79 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.241236E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.338 | TFLOPs: 19.50 | 31: iteration 17090/ 173500 | consumed samples: 4375040 | consumed tokens: 8960081920 | elapsed time per iteration (s): 0.77 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.238162E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.615 | TFLOPs: 20.18 | 31: iteration 17100/ 173500 | consumed samples: 4377600 | consumed tokens: 8965324800 | elapsed time per iteration (s): 0.80 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.234659E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.208 | TFLOPs: 19.37 | 31: iteration 17110/ 173500 | consumed samples: 4380160 | consumed tokens: 8970567680 | elapsed time per iteration (s): 0.72 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.224343E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.425 | TFLOPs: 21.44 | 31: iteration 17120/ 173500 | consumed samples: 4382720 | consumed tokens: 8975810560 | elapsed time per iteration (s): 0.80 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.267719E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.368 | TFLOPs: 19.38 | 31: iteration 17130/ 173500 | consumed samples: 4385280 | consumed tokens: 8981053440 | elapsed time per iteration (s): 0.79 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.217860E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.836 | TFLOPs: 19.65 | 31: iteration 17140/ 173500 | consumed samples: 4387840 | consumed tokens: 8986296320 | elapsed time per iteration (s): 0.80 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.230693E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.321 | TFLOPs: 19.32 | 31: iteration 17150/ 173500 | consumed samples: 4390400 | consumed tokens: 8991539200 | elapsed time per iteration (s): 0.77 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.234251E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.142 | TFLOPs: 20.09 | 31: iteration 17160/ 173500 | consumed samples: 4392960 | consumed tokens: 8996782080 | elapsed time per iteration (s): 0.77 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.227125E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.793 | TFLOPs: 20.07 | 31: iteration 17170/ 173500 | consumed samples: 4395520 | consumed tokens: 9002024960 | elapsed time per iteration (s): 0.75 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.250498E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.493 | TFLOPs: 20.60 | 31: iteration 17180/ 173500 | consumed samples: 4398080 | consumed tokens: 9007267840 | elapsed time per iteration (s): 0.74 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.225976E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.980 | TFLOPs: 20.87 | 31: iteration 17190/ 173500 | consumed samples: 4400640 | consumed tokens: 9012510720 | elapsed time per iteration (s): 0.76 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.252271E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.637 | TFLOPs: 20.49 | 31: iteration 17200/ 173500 | consumed samples: 4403200 | consumed tokens: 9017753600 | elapsed time per iteration (s): 0.77 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.195262E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.491 | TFLOPs: 20.18 | 31: iteration 17210/ 173500 | consumed samples: 4405760 | consumed tokens: 9022996480 | elapsed time per iteration (s): 0.74 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.213989E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.167 | TFLOPs: 20.88 | 31: iteration 17220/ 173500 | consumed samples: 4408320 | consumed tokens: 9028239360 | elapsed time per iteration (s): 0.80 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.212940E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.630 | TFLOPs: 19.46 | 31: iteration 17230/ 173500 | consumed samples: 4410880 | consumed tokens: 9033482240 | elapsed time per iteration (s): 0.82 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.256213E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.653 | TFLOPs: 18.91 | 31: iteration 17240/ 173500 | consumed samples: 4413440 | consumed tokens: 9038725120 | elapsed time per iteration (s): 0.79 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.254992E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.420 | TFLOPs: 19.57 | 31: iteration 17250/ 173500 | consumed samples: 4416000 | consumed tokens: 9043968000 | elapsed time per iteration (s): 0.80 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.223087E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.137 | TFLOPs: 19.25 | 31: iteration 17260/ 173500 | consumed samples: 4418560 | consumed tokens: 9049210880 | elapsed time per iteration (s): 0.83 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.233036E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.577 | TFLOPs: 18.73 | 31: iteration 17270/ 173500 | consumed samples: 4421120 | consumed tokens: 9054453760 | elapsed time per iteration (s): 0.77 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.254159E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.452 | TFLOPs: 20.11 | 31: iteration 17280/ 173500 | consumed samples: 4423680 | consumed tokens: 9059696640 | elapsed time per iteration (s): 0.80 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.235876E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.441 | TFLOPs: 19.39 | 31: iteration 17290/ 173500 | consumed samples: 4426240 | consumed tokens: 9064939520 | elapsed time per iteration (s): 0.82 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.240789E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.361 | TFLOPs: 18.96 | 31: iteration 17300/ 173500 | consumed samples: 4428800 | consumed tokens: 9070182400 | elapsed time per iteration (s): 0.72 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.243703E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.796 | TFLOPs: 21.40 | 31: iteration 17310/ 173500 | consumed samples: 4431360 | consumed tokens: 9075425280 | elapsed time per iteration (s): 0.77 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.213442E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.337 | TFLOPs: 20.11 | 31: iteration 17320/ 173500 | consumed samples: 4433920 | consumed tokens: 9080668160 | elapsed time per iteration (s): 0.74 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.222438E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.504 | TFLOPs: 21.02 | 31: iteration 17330/ 173500 | consumed samples: 4436480 | consumed tokens: 9085911040 | elapsed time per iteration (s): 0.75 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.245039E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.507 | TFLOPs: 20.72 | 31: iteration 17340/ 173500 | consumed samples: 4439040 | consumed tokens: 9091153920 | elapsed time per iteration (s): 0.76 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.221413E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.263 | TFLOPs: 20.34 | 31: iteration 17350/ 173500 | consumed samples: 4441600 | consumed tokens: 9096396800 | elapsed time per iteration (s): 0.79 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.213318E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.414 | TFLOPs: 19.69 | 31: iteration 17360/ 173500 | consumed samples: 4444160 | consumed tokens: 9101639680 | elapsed time per iteration (s): 0.80 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.205406E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.055 | TFLOPs: 19.36 | 31: iteration 17370/ 173500 | consumed samples: 4446720 | consumed tokens: 9106882560 | elapsed time per iteration (s): 0.75 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.210569E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.127 | TFLOPs: 20.58 | 31: iteration 17380/ 173500 | consumed samples: 4449280 | consumed tokens: 9112125440 | elapsed time per iteration (s): 0.77 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.241214E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.392 | TFLOPs: 20.05 | 31: iteration 17390/ 173500 | consumed samples: 4451840 | consumed tokens: 9117368320 | elapsed time per iteration (s): 0.78 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.260141E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.627 | TFLOPs: 19.94 | 31: iteration 17400/ 173500 | consumed samples: 4454400 | consumed tokens: 9122611200 | elapsed time per iteration (s): 0.81 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.237647E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.763 | TFLOPs: 19.04 | 31: iteration 17410/ 173500 | consumed samples: 4456960 | consumed tokens: 9127854080 | elapsed time per iteration (s): 0.74 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.227982E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.781 | TFLOPs: 20.86 | 31: iteration 17420/ 173500 | consumed samples: 4459520 | consumed tokens: 9133096960 | elapsed time per iteration (s): 0.77 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.229322E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.395 | TFLOPs: 20.11 | 31: iteration 17430/ 173500 | consumed samples: 4462080 | consumed tokens: 9138339840 | elapsed time per iteration (s): 0.76 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.248949E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.063 | TFLOPs: 20.27 | 31: iteration 17440/ 173500 | consumed samples: 4464640 | consumed tokens: 9143582720 | elapsed time per iteration (s): 0.73 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.215984E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.830 | TFLOPs: 21.10 | 31: iteration 17450/ 173500 | consumed samples: 4467200 | consumed tokens: 9148825600 | elapsed time per iteration (s): 0.77 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.242557E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.740 | TFLOPs: 20.19 | 31: iteration 17460/ 173500 | consumed samples: 4469760 | consumed tokens: 9154068480 | elapsed time per iteration (s): 0.79 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.248135E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.055 | TFLOPs: 19.60 | 31: iteration 17470/ 173500 | consumed samples: 4472320 | consumed tokens: 9159311360 | elapsed time per iteration (s): 0.81 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.200813E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.863 | TFLOPs: 19.11 | 31: iteration 17480/ 173500 | consumed samples: 4474880 | consumed tokens: 9164554240 | elapsed time per iteration (s): 0.82 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.233615E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.338 | TFLOPs: 18.84 | 31: iteration 17490/ 173500 | consumed samples: 4477440 | consumed tokens: 9169797120 | elapsed time per iteration (s): 0.80 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.254308E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.038 | TFLOPs: 19.42 | 31: iteration 17500/ 173500 | consumed samples: 4480000 | consumed tokens: 9175040000 | elapsed time per iteration (s): 0.84 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.209020E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.067 | TFLOPs: 18.46 | 31: iteration 17510/ 173500 | consumed samples: 4482560 | consumed tokens: 9180282880 | elapsed time per iteration (s): 0.80 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.212043E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.505 | TFLOPs: 19.27 | 31: iteration 17520/ 173500 | consumed samples: 4485120 | consumed tokens: 9185525760 | elapsed time per iteration (s): 0.82 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.233476E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.964 | TFLOPs: 18.87 | 31: iteration 17530/ 173500 | consumed samples: 4487680 | consumed tokens: 9190768640 | elapsed time per iteration (s): 0.79 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.262315E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.415 | TFLOPs: 19.57 | 31: iteration 17540/ 173500 | consumed samples: 4490240 | consumed tokens: 9196011520 | elapsed time per iteration (s): 0.80 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.264223E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.819 | TFLOPs: 19.47 | 31: iteration 17550/ 173500 | consumed samples: 4492800 | consumed tokens: 9201254400 | elapsed time per iteration (s): 0.78 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.239555E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.231 | TFLOPs: 19.80 | 31: iteration 17560/ 173500 | consumed samples: 4495360 | consumed tokens: 9206497280 | elapsed time per iteration (s): 0.82 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.217765E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.568 | TFLOPs: 18.91 | 31: iteration 17570/ 173500 | consumed samples: 4497920 | consumed tokens: 9211740160 | elapsed time per iteration (s): 0.79 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.217516E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.755 | TFLOPs: 19.65 | 31: iteration 17580/ 173500 | consumed samples: 4500480 | consumed tokens: 9216983040 | elapsed time per iteration (s): 0.82 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.245173E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.864 | TFLOPs: 18.87 | 31: iteration 17590/ 173500 | consumed samples: 4503040 | consumed tokens: 9222225920 | elapsed time per iteration (s): 0.86 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.214161E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.549 | TFLOPs: 18.00 | 31: iteration 17600/ 173500 | consumed samples: 4505600 | consumed tokens: 9227468800 | elapsed time per iteration (s): 0.75 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.250442E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.517 | TFLOPs: 20.72 | 31: iteration 17610/ 173500 | consumed samples: 4508160 | consumed tokens: 9232711680 | elapsed time per iteration (s): 0.77 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.235172E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.817 | TFLOPs: 20.20 | 31: iteration 17620/ 173500 | consumed samples: 4510720 | consumed tokens: 9237954560 | elapsed time per iteration (s): 0.75 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.245126E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.269 | TFLOPs: 20.52 | 31: iteration 17630/ 173500 | consumed samples: 4513280 | consumed tokens: 9243197440 | elapsed time per iteration (s): 0.73 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.216050E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.821 | TFLOPs: 21.22 | 31: iteration 17640/ 173500 | consumed samples: 4515840 | consumed tokens: 9248440320 | elapsed time per iteration (s): 0.76 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.215889E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.329 | TFLOPs: 20.47 | 31: iteration 17650/ 173500 | consumed samples: 4518400 | consumed tokens: 9253683200 | elapsed time per iteration (s): 0.75 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.239759E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.787 | TFLOPs: 20.74 | 31: iteration 17660/ 173500 | consumed samples: 4520960 | consumed tokens: 9258926080 | elapsed time per iteration (s): 0.76 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.243039E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.217 | TFLOPs: 20.34 | 31: iteration 17670/ 173500 | consumed samples: 4523520 | consumed tokens: 9264168960 | elapsed time per iteration (s): 0.78 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.191421E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.770 | TFLOPs: 19.83 | 31: iteration 17680/ 173500 | consumed samples: 4526080 | consumed tokens: 9269411840 | elapsed time per iteration (s): 0.77 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.228010E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.541 | TFLOPs: 20.06 | 31: iteration 17690/ 173500 | consumed samples: 4528640 | consumed tokens: 9274654720 | elapsed time per iteration (s): 0.79 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.258805E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.840 | TFLOPs: 19.71 | 31: iteration 17700/ 173500 | consumed samples: 4531200 | consumed tokens: 9279897600 | elapsed time per iteration (s): 0.87 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.234968E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.992 | TFLOPs: 17.73 | 31: iteration 17710/ 173500 | consumed samples: 4533760 | consumed tokens: 9285140480 | elapsed time per iteration (s): 0.80 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.227139E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.290 | TFLOPs: 19.26 | 31: iteration 17720/ 173500 | consumed samples: 4536320 | consumed tokens: 9290383360 | elapsed time per iteration (s): 0.86 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.232456E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.394 | TFLOPs: 17.93 | 31: iteration 17730/ 173500 | consumed samples: 4538880 | consumed tokens: 9295626240 | elapsed time per iteration (s): 0.82 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.190270E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.060 | TFLOPs: 18.82 | 31: iteration 17740/ 173500 | consumed samples: 4541440 | consumed tokens: 9300869120 | elapsed time per iteration (s): 0.86 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.227291E+00 | grad norm: 0.783 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.849 | TFLOPs: 17.96 | 31: iteration 17750/ 173500 | consumed samples: 4544000 | consumed tokens: 9306112000 | elapsed time per iteration (s): 0.82 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.656059E+00 | grad norm: 23.958 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.382 | TFLOPs: 18.90 | 31: iteration 17760/ 173500 | consumed samples: 4546560 | consumed tokens: 9311354880 | elapsed time per iteration (s): 0.85 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.507148E+00 | grad norm: 0.815 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.487 | TFLOPs: 18.24 | 31: iteration 17770/ 173500 | consumed samples: 4549120 | consumed tokens: 9316597760 | elapsed time per iteration (s): 0.83 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.366125E+00 | grad norm: 0.429 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.197 | TFLOPs: 18.58 | 31: iteration 17780/ 173500 | consumed samples: 4551680 | consumed tokens: 9321840640 | elapsed time per iteration (s): 0.83 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.305513E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.706 | TFLOPs: 18.62 | 31: iteration 17790/ 173500 | consumed samples: 4554240 | consumed tokens: 9327083520 | elapsed time per iteration (s): 0.81 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.241764E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.950 | TFLOPs: 19.24 | 31: iteration 17800/ 173500 | consumed samples: 4556800 | consumed tokens: 9332326400 | elapsed time per iteration (s): 0.80 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.265152E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.167 | TFLOPs: 19.25 | 31: iteration 17810/ 173500 | consumed samples: 4559360 | consumed tokens: 9337569280 | elapsed time per iteration (s): 0.82 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.255145E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.137 | TFLOPs: 18.88 | 31: iteration 17820/ 173500 | consumed samples: 4561920 | consumed tokens: 9342812160 | elapsed time per iteration (s): 0.83 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.243770E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.381 | TFLOPs: 18.66 | 31: iteration 17830/ 173500 | consumed samples: 4564480 | consumed tokens: 9348055040 | elapsed time per iteration (s): 0.82 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.253749E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.817 | TFLOPs: 18.99 | 31: iteration 17840/ 173500 | consumed samples: 4567040 | consumed tokens: 9353297920 | elapsed time per iteration (s): 0.82 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.235425E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.643 | TFLOPs: 18.85 | 31: iteration 17850/ 173500 | consumed samples: 4569600 | consumed tokens: 9358540800 | elapsed time per iteration (s): 0.83 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.234994E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.436 | TFLOPs: 18.72 | 31: iteration 17860/ 173500 | consumed samples: 4572160 | consumed tokens: 9363783680 | elapsed time per iteration (s): 0.85 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.229800E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.793 | TFLOPs: 18.32 | 31: iteration 17870/ 173500 | consumed samples: 4574720 | consumed tokens: 9369026560 | elapsed time per iteration (s): 0.80 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.436003E+00 | grad norm: 1.370 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.286 | TFLOPs: 19.26 | 31: iteration 17880/ 173500 | consumed samples: 4577280 | consumed tokens: 9374269440 | elapsed time per iteration (s): 0.81 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.312058E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.734 | TFLOPs: 19.16 | 31: iteration 17890/ 173500 | consumed samples: 4579840 | consumed tokens: 9379512320 | elapsed time per iteration (s): 0.85 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.250193E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.402 | TFLOPs: 18.29 | 31: iteration 17900/ 173500 | consumed samples: 4582400 | consumed tokens: 9384755200 | elapsed time per iteration (s): 0.83 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.295479E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.609 | TFLOPs: 18.55 | 31: iteration 17910/ 173500 | consumed samples: 4584960 | consumed tokens: 9389998080 | elapsed time per iteration (s): 0.86 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.201673E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.994 | TFLOPs: 18.09 | 31: iteration 17920/ 173500 | consumed samples: 4587520 | consumed tokens: 9395240960 | elapsed time per iteration (s): 0.82 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.240535E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.645 | TFLOPs: 18.79 | 31: iteration 17930/ 173500 | consumed samples: 4590080 | consumed tokens: 9400483840 | elapsed time per iteration (s): 0.83 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.257695E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.602 | TFLOPs: 18.73 | 31: iteration 17940/ 173500 | consumed samples: 4592640 | consumed tokens: 9405726720 | elapsed time per iteration (s): 0.84 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.241028E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.078 | TFLOPs: 18.52 | 31: iteration 17950/ 173500 | consumed samples: 4595200 | consumed tokens: 9410969600 | elapsed time per iteration (s): 0.83 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.248962E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.694 | TFLOPs: 18.55 | 31: iteration 17960/ 173500 | consumed samples: 4597760 | consumed tokens: 9416212480 | elapsed time per iteration (s): 0.84 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.223838E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.327 | TFLOPs: 18.41 | 31: iteration 17970/ 173500 | consumed samples: 4600320 | consumed tokens: 9421455360 | elapsed time per iteration (s): 0.82 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.231507E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.149 | TFLOPs: 18.88 | 31: iteration 17980/ 173500 | consumed samples: 4602880 | consumed tokens: 9426698240 | elapsed time per iteration (s): 0.86 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.251197E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.815 | TFLOPs: 17.96 | 31: iteration 17990/ 173500 | consumed samples: 4605440 | consumed tokens: 9431941120 | elapsed time per iteration (s): 0.79 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.235984E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.672 | TFLOPs: 19.58 | 0: [2022-11-25 22:07:40,912] [INFO] [logging.py:68:log_dist] [Rank 0] step=18000, skipped=0, lr=[0.00019604685446348677, 0.00019604685446348677, 0.00019604685446348677], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 18000/ 173500 | consumed samples: 4608000 | consumed tokens: 9437184000 | elapsed time per iteration (s): 0.80 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.237797E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.241 | TFLOPs: 19.25 | 0: steps: 18000 loss: 2.2020 iter time (s): 0.789 samples/sec: 324.330 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 18000 | lm loss value: 2.094586E+00 | lm loss PPL: 8.122077E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 18000 to checkpoints_1b1long 0: [2022-11-25 22:07:41,163] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step18000 is begin to save! 0: [2022-11-25 22:07:41,170] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_01-model_00-model_states.pt... 0: [2022-11-25 22:07:41,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_01-model_00-model_states.pt. 0: [2022-11-25 22:07:41,371] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_03-model_00-model_states.pt... 0: [2022-11-25 22:07:41,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_03-model_00-model_states.pt. 0: [2022-11-25 22:07:41,454] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_04-model_00-model_states.pt... 0: [2022-11-25 22:07:41,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_04-model_00-model_states.pt. 0: [2022-11-25 22:07:41,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_05-model_00-model_states.pt... 0: [2022-11-25 22:07:41,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_05-model_00-model_states.pt. 0: [2022-11-25 22:07:41,603] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_06-model_00-model_states.pt... 0: [2022-11-25 22:07:41,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_06-model_00-model_states.pt. 0: [2022-11-25 22:07:41,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_07-model_00-model_states.pt... 0: [2022-11-25 22:07:41,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_07-model_00-model_states.pt. 0: [2022-11-25 22:07:41,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_08-model_00-model_states.pt... 0: [2022-11-25 22:07:41,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_08-model_00-model_states.pt. 0: [2022-11-25 22:07:41,841] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_09-model_00-model_states.pt... 0: [2022-11-25 22:07:41,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_09-model_00-model_states.pt. 0: [2022-11-25 22:07:41,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_10-model_00-model_states.pt... 0: [2022-11-25 22:07:41,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_10-model_00-model_states.pt. 0: [2022-11-25 22:07:41,991] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_11-model_00-model_states.pt... 0: [2022-11-25 22:07:42,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_11-model_00-model_states.pt. 0: [2022-11-25 22:07:42,065] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_12-model_00-model_states.pt... 0: [2022-11-25 22:07:42,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_12-model_00-model_states.pt. 0: [2022-11-25 22:07:42,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_13-model_00-model_states.pt... 0: [2022-11-25 22:07:42,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_13-model_00-model_states.pt. 0: [2022-11-25 22:07:42,212] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_14-model_00-model_states.pt... 0: [2022-11-25 22:07:42,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_14-model_00-model_states.pt. 0: [2022-11-25 22:07:42,283] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_15-model_00-model_states.pt... 0: [2022-11-25 22:07:42,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_15-model_00-model_states.pt. 0: [2022-11-25 22:07:42,360] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_16-model_00-model_states.pt... 0: [2022-11-25 22:07:42,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_16-model_00-model_states.pt. 0: [2022-11-25 22:07:42,430] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_17-model_00-model_states.pt... 0: [2022-11-25 22:07:42,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_17-model_00-model_states.pt. 0: [2022-11-25 22:07:42,505] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_18-model_00-model_states.pt... 0: [2022-11-25 22:07:42,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_18-model_00-model_states.pt. 0: [2022-11-25 22:07:42,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_19-model_00-model_states.pt... 0: [2022-11-25 22:07:42,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_19-model_00-model_states.pt. 0: [2022-11-25 22:07:42,650] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_20-model_00-model_states.pt... 0: [2022-11-25 22:07:42,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_20-model_00-model_states.pt. 0: [2022-11-25 22:07:42,727] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_21-model_00-model_states.pt... 0: [2022-11-25 22:07:42,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_21-model_00-model_states.pt. 0: [2022-11-25 22:07:42,801] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_22-model_00-model_states.pt... 0: [2022-11-25 22:07:42,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_22-model_00-model_states.pt. 0: [2022-11-25 22:07:42,875] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_23-model_00-model_states.pt... 0: [2022-11-25 22:07:42,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_23-model_00-model_states.pt. 0: [2022-11-25 22:07:42,949] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_24-model_00-model_states.pt... 0: [2022-11-25 22:07:43,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_24-model_00-model_states.pt. 0: [2022-11-25 22:07:43,020] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_25-model_00-model_states.pt... 0: [2022-11-25 22:07:43,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_25-model_00-model_states.pt. 0: [2022-11-25 22:07:43,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_26-model_00-model_states.pt... 0: [2022-11-25 22:07:43,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_26-model_00-model_states.pt. 0: [2022-11-25 22:07:43,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_27-model_00-model_states.pt... 0: [2022-11-25 22:07:43,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_27-model_00-model_states.pt. 0: [2022-11-25 22:07:43,242] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_28-model_00-model_states.pt... 0: [2022-11-25 22:07:43,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_28-model_00-model_states.pt. 0: [2022-11-25 22:07:43,314] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/layer_30-model_00-model_states.pt... 0: [2022-11-25 22:07:43,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/layer_30-model_00-model_states.pt. 0: [2022-11-25 22:07:43,319] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step18000/mp_rank_00_model_states.pt 0: [2022-11-25 22:07:43,319] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/mp_rank_00_model_states.pt... 0: [2022-11-25 22:07:43,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/mp_rank_00_model_states.pt. 0: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:07:43,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:07:43,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:07:43,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 22:07:43,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 24: [2022-11-25 22:07:43,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:07:43,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-25 22:07:43,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 13: [2022-11-25 22:07:43,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:07:43,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 22:07:43,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 23: [2022-11-25 22:07:43,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:07:43,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-25 22:07:43,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 12: [2022-11-25 22:07:43,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:07:43,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:07:43,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 6: [2022-11-25 22:07:43,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 12: [2022-11-25 22:07:43,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 6: [2022-11-25 22:07:43,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 15: [2022-11-25 22:07:43,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:07:43,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 22:07:43,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 18: [2022-11-25 22:07:43,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:07:43,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-25 22:07:43,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 26: [2022-11-25 22:07:43,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:07:43,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-25 22:07:43,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 28: [2022-11-25 22:07:43,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:07:43,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:07:43,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 22:07:43,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 18: [2022-11-25 22:07:43,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:07:43,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-25 22:07:43,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 29: [2022-11-25 22:07:43,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:07:43,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-25 22:07:43,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 10: [2022-11-25 22:07:43,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:07:43,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 22:07:43,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 28: [2022-11-25 22:07:43,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-25 22:07:43,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 28: [2022-11-25 22:07:43,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:07:43,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:07:43,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 22:07:43,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 4: [2022-11-25 22:07:43,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:07:43,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:07:43,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 27: [2022-11-25 22:07:43,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:07:43,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: [2022-11-25 22:07:43,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 27: [2022-11-25 22:07:43,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-25 22:07:43,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: [2022-11-25 22:07:43,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 16: [2022-11-25 22:07:43,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:07:43,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:07:43,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 25: [2022-11-25 22:07:43,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:07:43,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 20: [2022-11-25 22:07:43,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:07:43,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 25: [2022-11-25 22:07:43,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 0: [2022-11-25 22:07:43,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 20: [2022-11-25 22:07:43,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-25 22:07:43,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 25: [2022-11-25 22:07:43,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 20: [2022-11-25 22:07:43,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:07:43,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:07:43,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 20: [2022-11-25 22:07:43,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 24: [2022-11-25 22:07:43,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 26: [2022-11-25 22:07:43,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:07:43,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 26: [2022-11-25 22:07:43,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-25 22:07:43,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 30: [2022-11-25 22:07:43,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:07:43,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-25 22:07:43,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 9: [2022-11-25 22:07:43,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:07:43,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:07:43,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:07:43,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 22: [2022-11-25 22:07:43,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 9: [2022-11-25 22:07:43,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 22: [2022-11-25 22:07:43,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-25 22:07:43,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 22: [2022-11-25 22:07:43,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 15: [2022-11-25 22:07:43,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:07:43,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:07:43,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 28: [2022-11-25 22:07:43,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 9: [2022-11-25 22:07:43,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 15: [2022-11-25 22:07:43,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 28: [2022-11-25 22:07:43,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 9: [2022-11-25 22:07:43,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 31: [2022-11-25 22:07:43,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:07:43,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:07:43,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-25 22:07:43,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-25 22:07:43,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 31: [2022-11-25 22:07:43,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 17: [2022-11-25 22:07:43,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:07:43,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:07:43,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 17: [2022-11-25 22:07:43,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 23: [2022-11-25 22:07:43,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 17: [2022-11-25 22:07:43,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 17: [2022-11-25 22:07:43,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:07:43,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-25 22:07:43,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 17: [2022-11-25 22:07:43,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:07:43,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:07:43,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 7: [2022-11-25 22:07:43,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 17: [2022-11-25 22:07:43,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 7: [2022-11-25 22:07:43,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 10: [2022-11-25 22:07:43,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:07:43,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 22:07:43,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 21: [2022-11-25 22:07:43,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:07:43,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-25 22:07:43,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 30: [2022-11-25 22:07:43,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:07:43,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-25 22:07:43,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 11: [2022-11-25 22:07:43,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:07:43,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 22:07:43,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 2: [2022-11-25 22:07:43,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:07:43,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 22:07:43,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 11: [2022-11-25 22:07:43,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:07:43,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:07:43,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 20: [2022-11-25 22:07:43,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 11: [2022-11-25 22:07:43,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 20: [2022-11-25 22:07:43,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 27: [2022-11-25 22:07:43,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:07:43,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-25 22:07:43,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 28: [2022-11-25 22:07:43,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:07:43,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-25 22:07:43,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: [2022-11-25 22:07:43,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:07:43,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:07:43,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 0: [2022-11-25 22:07:43,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 24: [2022-11-25 22:07:43,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: [2022-11-25 22:07:43,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 19: [2022-11-25 22:07:43,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:07:43,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-25 22:07:43,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 18: [2022-11-25 22:07:43,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:07:43,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-25 22:07:43,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 23: [2022-11-25 22:07:43,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:07:43,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-25 22:07:43,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-25 22:07:43,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:07:43,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 22:07:43,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 22: [2022-11-25 22:07:43,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:07:43,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-25 22:07:43,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 21: [2022-11-25 22:07:43,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:07:43,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-25 22:07:43,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 6: [2022-11-25 22:07:43,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:07:43,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 9: [2022-11-25 22:07:43,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:07:43,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 9: [2022-11-25 22:07:43,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 22:07:43,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 19: [2022-11-25 22:07:43,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:07:43,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-25 22:07:43,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-25 22:07:43,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:07:43,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:07:43,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 12: [2022-11-25 22:07:43,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:07:43,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 1: [2022-11-25 22:07:43,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 12: [2022-11-25 22:07:43,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 21: [2022-11-25 22:07:43,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 12: [2022-11-25 22:07:43,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 20: [2022-11-25 22:07:43,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:07:43,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-25 22:07:43,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 15: [2022-11-25 22:07:43,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:07:43,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 10: [2022-11-25 22:07:43,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:07:43,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:07:43,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 15: [2022-11-25 22:07:43,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 9: [2022-11-25 22:07:43,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 10: [2022-11-25 22:07:43,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 9: [2022-11-25 22:07:43,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 15: [2022-11-25 22:07:43,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:07:43,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 22:07:43,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 5: [2022-11-25 22:07:43,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:07:43,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 22:07:43,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 7: [2022-11-25 22:07:43,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:07:43,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 22:07:43,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 2: [2022-11-25 22:07:43,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:07:43,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 22:07:43,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 18: [2022-11-25 22:07:43,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:07:43,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:07:43,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 6: [2022-11-25 22:07:43,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 22:07:43,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 2: [2022-11-25 22:07:43,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:07:43,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 2: [2022-11-25 22:07:43,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 22:07:43,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 4: [2022-11-25 22:07:43,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:07:43,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 22:07:43,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 28: [2022-11-25 22:07:43,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:07:43,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-25 22:07:43,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 26: [2022-11-25 22:07:43,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:07:43,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-25 22:07:43,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 8: [2022-11-25 22:07:43,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:07:43,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:07:43,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:07:43,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 22:07:43,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 22:07:43,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 22:07:43,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 8: [2022-11-25 22:07:43,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 8: [2022-11-25 22:07:43,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 8: [2022-11-25 22:07:43,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:07:43,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 22:07:43,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 23: [2022-11-25 22:07:43,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:07:43,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-25 22:07:43,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 16: [2022-11-25 22:07:43,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:07:43,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:07:43,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-25 22:07:43,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-25 22:07:43,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 16: [2022-11-25 22:07:43,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-25 22:07:43,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:07:43,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:07:43,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-25 22:07:43,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-25 22:07:43,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 24: [2022-11-25 22:07:43,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:07:43,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 24: [2022-11-25 22:07:43,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-25 22:07:43,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 25: [2022-11-25 22:07:43,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:07:43,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-25 22:07:43,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 6: [2022-11-25 22:07:43,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:07:43,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 22:07:43,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 5: [2022-11-25 22:07:43,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:07:43,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:07:43,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 7: [2022-11-25 22:07:43,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:07:43,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 5: [2022-11-25 22:07:43,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 7: [2022-11-25 22:07:43,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 22:07:43,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 25: [2022-11-25 22:07:43,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 27: [2022-11-25 22:07:43,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:07:43,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-25 22:07:43,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 27: [2022-11-25 22:07:43,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:07:43,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-25 22:07:43,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 10: [2022-11-25 22:07:43,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:07:43,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:07:43,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 22: [2022-11-25 22:07:43,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 10: [2022-11-25 22:07:43,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 22: [2022-11-25 22:07:43,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 19: [2022-11-25 22:07:43,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:07:43,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:07:43,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 26: [2022-11-25 22:07:43,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 19: [2022-11-25 22:07:43,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 26: [2022-11-25 22:07:43,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 12: [2022-11-25 22:07:43,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:07:43,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 22:07:43,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 30: [2022-11-25 22:07:43,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:07:43,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-25 22:07:43,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 11: [2022-11-25 22:07:43,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:07:43,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:07:43,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 22:07:43,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 22:07:43,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 11: [2022-11-25 22:07:43,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 17: [2022-11-25 22:07:43,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:07:43,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-25 22:07:43,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 5: [2022-11-25 22:07:43,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:07:43,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 22:07:43,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 30: [2022-11-25 22:07:43,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:07:43,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-25 22:07:43,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 4: [2022-11-25 22:07:43,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:07:43,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 22:07:43,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 31: [2022-11-25 22:07:43,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:07:43,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-25 22:07:43,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 2: [2022-11-25 22:07:43,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:07:43,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 22:07:43,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 7: [2022-11-25 22:07:43,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:07:43,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 22:07:43,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 25: [2022-11-25 22:07:43,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:07:43,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-25 22:07:43,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 27: [2022-11-25 22:07:43,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:07:43,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-25 22:07:43,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: [2022-11-25 22:07:43,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:07:43,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 22:07:43,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 14: [2022-11-25 22:07:43,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:07:43,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:07:43,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:07:43,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:07:43,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 22:07:43,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 22:07:43,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 22:07:43,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 22:07:43,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 14: [2022-11-25 22:07:43,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 14: [2022-11-25 22:07:43,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 14: [2022-11-25 22:07:43,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 31: [2022-11-25 22:07:43,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:07:43,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-25 22:07:43,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: [2022-11-25 22:07:43,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:07:43,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 22:07:43,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 13: [2022-11-25 22:07:43,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:07:43,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 22:07:43,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 12: [2022-11-25 22:07:43,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:07:43,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 22:07:43,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 22: [2022-11-25 22:07:43,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:07:43,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-25 22:07:43,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 5: [2022-11-25 22:07:43,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:07:43,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 22:07:43,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 3: [2022-11-25 22:07:43,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:07:43,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 22:07:43,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 25: [2022-11-25 22:07:43,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:07:43,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-25 22:07:43,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 13: [2022-11-25 22:07:43,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:07:43,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 22:07:43,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 18: [2022-11-25 22:07:43,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:07:43,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-25 22:07:43,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 17: [2022-11-25 22:07:43,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:07:43,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-25 22:07:43,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 13: [2022-11-25 22:07:43,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:07:43,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 22:07:43,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 4: [2022-11-25 22:07:43,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:07:43,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 22:07:43,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 28: [2022-11-25 22:07:43,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:07:43,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:07:43,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-25 22:07:43,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 28: [2022-11-25 22:07:43,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-25 22:07:43,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 9: [2022-11-25 22:07:43,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:07:43,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 22:07:43,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 6: [2022-11-25 22:07:43,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:07:43,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 22:07:43,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 29: [2022-11-25 22:07:43,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:07:43,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-25 22:07:43,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 21: [2022-11-25 22:07:43,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:07:43,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-25 22:07:43,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 13: [2022-11-25 22:07:43,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:07:43,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 22:07:43,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 15: [2022-11-25 22:07:43,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:07:43,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 22:07:43,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 19: [2022-11-25 22:07:43,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:07:43,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-25 22:07:43,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 3: [2022-11-25 22:07:43,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:07:43,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 22:07:43,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 24: [2022-11-25 22:07:43,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:07:43,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-25 22:07:43,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 30: [2022-11-25 22:07:43,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:07:43,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-25 22:07:43,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 29: [2022-11-25 22:07:43,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:07:43,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-25 22:07:43,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 31: [2022-11-25 22:07:43,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:07:43,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-25 22:07:43,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 8: [2022-11-25 22:07:43,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:07:43,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 22:07:43,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-25 22:07:43,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:07:43,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 22:07:43,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 14: [2022-11-25 22:07:43,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:07:43,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 22:07:43,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 12: [2022-11-25 22:07:43,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:07:43,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 22:07:43,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 11: [2022-11-25 22:07:43,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:07:43,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 22:07:43,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 3: [2022-11-25 22:07:43,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:07:43,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:07:43,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 22:07:43,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 22:07:43,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 3: [2022-11-25 22:07:43,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 26: [2022-11-25 22:07:43,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:07:43,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-25 22:07:43,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 7: [2022-11-25 22:07:43,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:07:43,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 22:07:43,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 2: [2022-11-25 22:07:43,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:07:43,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 23: [2022-11-25 22:07:43,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:07:43,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 23: [2022-11-25 22:07:43,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-25 22:07:43,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 10: [2022-11-25 22:07:43,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:07:43,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 22:07:43,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 22: [2022-11-25 22:07:43,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:07:43,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-25 22:07:43,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 5: [2022-11-25 22:07:43,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:07:43,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 22:07:43,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: [2022-11-25 22:07:43,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:07:43,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 22:07:43,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 25: [2022-11-25 22:07:43,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:07:43,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:07:43,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 25: [2022-11-25 22:07:43,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 17: [2022-11-25 22:07:43,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 25: [2022-11-25 22:07:43,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 3: [2022-11-25 22:07:43,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:07:43,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 22:07:43,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 28: [2022-11-25 22:07:43,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:07:43,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-25 22:07:43,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 20: [2022-11-25 22:07:43,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:07:43,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-25 22:07:43,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 4: [2022-11-25 22:07:43,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:07:43,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 22:07:43,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 13: [2022-11-25 22:07:43,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:07:43,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 22:07:43,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 8: [2022-11-25 22:07:43,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:07:43,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 22:07:43,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 6: [2022-11-25 22:07:43,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:07:43,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 22:07:43,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 24: [2022-11-25 22:07:43,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:07:43,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-25 22:07:43,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 18: [2022-11-25 22:07:43,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:07:43,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-25 22:07:43,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 31: [2022-11-25 22:07:43,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:07:43,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-25 22:07:43,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-25 22:07:43,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:07:43,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 22:07:43,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 14: [2022-11-25 22:07:43,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:07:43,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 22:07:43,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 21: [2022-11-25 22:07:43,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:07:43,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-25 22:07:43,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 27: [2022-11-25 22:07:43,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:07:43,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:07:43,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 19: [2022-11-25 22:07:43,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:07:43,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 12: [2022-11-25 22:07:43,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 19: [2022-11-25 22:07:43,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-25 22:07:43,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 12: [2022-11-25 22:07:43,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 9: [2022-11-25 22:07:43,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:07:43,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 16: [2022-11-25 22:07:43,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:07:43,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 16: [2022-11-25 22:07:43,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-25 22:07:43,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 15: [2022-11-25 22:07:43,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:07:43,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 22:07:43,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 16: [2022-11-25 22:07:43,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:07:43,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-25 22:07:43,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 30: [2022-11-25 22:07:43,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:07:43,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-25 22:07:43,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 26: [2022-11-25 22:07:43,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:07:43,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-25 22:07:43,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 7: [2022-11-25 22:07:43,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:07:43,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 22:07:43,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 2: [2022-11-25 22:07:43,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:07:43,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 22:07:43,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 10: [2022-11-25 22:07:43,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:07:43,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 22:07:43,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 23: [2022-11-25 22:07:43,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:07:43,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-25 22:07:43,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 11: [2022-11-25 22:07:43,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:07:43,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 22:07:43,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 29: [2022-11-25 22:07:43,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:07:43,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:07:43,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-25 22:07:43,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-25 22:07:43,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 29: [2022-11-25 22:07:43,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 25: [2022-11-25 22:07:43,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:07:43,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 22: [2022-11-25 22:07:43,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:07:43,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 22: [2022-11-25 22:07:43,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-25 22:07:43,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 4: [2022-11-25 22:07:43,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:07:43,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 22:07:43,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: [2022-11-25 22:07:43,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:07:43,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 22:07:43,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 20: [2022-11-25 22:07:43,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:07:43,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-25 22:07:43,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 28: [2022-11-25 22:07:43,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:07:43,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-25 22:07:43,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 3: [2022-11-25 22:07:43,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:07:43,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 22:07:43,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 13: [2022-11-25 22:07:43,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:07:43,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 22:07:43,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 5: [2022-11-25 22:07:43,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:07:43,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 22:07:43,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 8: [2022-11-25 22:07:43,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:07:43,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 29: [2022-11-25 22:07:43,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:07:43,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 8: [2022-11-25 22:07:43,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 29: [2022-11-25 22:07:43,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 30: [2022-11-25 22:07:43,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:07:43,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:07:43,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 9: [2022-11-25 22:07:43,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 30: [2022-11-25 22:07:43,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 9: [2022-11-25 22:07:43,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 24: [2022-11-25 22:07:43,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:07:43,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-25 22:07:43,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 6: [2022-11-25 22:07:43,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:07:43,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 22:07:43,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 18: [2022-11-25 22:07:43,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:07:43,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-25 22:07:43,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 15: [2022-11-25 22:07:43,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:07:43,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 22:07:43,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 27: [2022-11-25 22:07:43,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:07:43,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-25 22:07:43,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 19: [2022-11-25 22:07:43,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:07:43,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-25 22:07:43,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 12: [2022-11-25 22:07:43,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:07:43,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 31: [2022-11-25 22:07:43,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:07:43,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 31: [2022-11-25 22:07:43,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-25 22:07:43,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 16: [2022-11-25 22:07:43,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:07:43,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:07:43,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-25 22:07:43,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 26: [2022-11-25 22:07:43,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-25 22:07:43,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 21: [2022-11-25 22:07:43,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:07:43,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-25 22:07:43,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-25 22:07:43,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:07:43,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 22:07:43,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 23: [2022-11-25 22:07:43,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:07:43,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-25 22:07:43,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 7: [2022-11-25 22:07:43,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:07:43,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 22:07:43,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 2: [2022-11-25 22:07:43,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:07:43,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 14: [2022-11-25 22:07:43,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:07:43,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 2: [2022-11-25 22:07:43,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 14: [2022-11-25 22:07:43,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 11: [2022-11-25 22:07:43,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:07:43,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 22:07:43,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 17: [2022-11-25 22:07:43,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:07:43,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-25 22:07:43,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 22: [2022-11-25 22:07:43,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:07:43,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-25 22:07:43,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 5: [2022-11-25 22:07:43,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:07:43,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 22:07:43,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 10: [2022-11-25 22:07:43,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:07:43,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 22:07:43,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 29: [2022-11-25 22:07:43,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:07:43,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-25 22:07:43,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 3: [2022-11-25 22:07:43,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:07:43,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 22:07:43,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 8: [2022-11-25 22:07:43,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:07:43,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 22:07:43,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 6: [2022-11-25 22:07:43,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:07:43,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 22:07:43,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 4: [2022-11-25 22:07:43,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:07:43,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 22:07:43,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 27: [2022-11-25 22:07:43,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:07:43,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 17: [2022-11-25 22:07:43,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:07:43,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 17: [2022-11-25 22:07:43,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-25 22:07:43,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: [2022-11-25 22:07:43,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:07:43,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:07:43,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 18: [2022-11-25 22:07:43,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:07:43,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 18: [2022-11-25 22:07:43,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-25 22:07:43,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 15: [2022-11-25 22:07:43,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:07:43,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 22:07:43,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 19: [2022-11-25 22:07:43,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:07:43,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-25 22:07:43,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 28: [2022-11-25 22:07:43,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:07:43,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-25 22:07:43,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 9: [2022-11-25 22:07:43,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:07:43,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 22:07:43,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 31: [2022-11-25 22:07:43,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:07:43,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-25 22:07:43,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 19: [2022-11-25 22:07:43,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:07:43,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-25 22:07:43,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: [2022-11-25 22:07:43,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 22:07:43,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 25: [2022-11-25 22:07:43,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:07:43,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-25 22:07:43,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 23: [2022-11-25 22:07:43,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:07:43,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-25 22:07:43,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 12: [2022-11-25 22:07:43,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:07:43,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 22:07:43,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 24: [2022-11-25 22:07:43,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:07:43,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:07:43,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:07:43,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 22:07:43,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-25 22:07:43,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:07:43,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 26: [2022-11-25 22:07:43,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 24: [2022-11-25 22:07:43,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 26: [2022-11-25 22:07:43,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-25 22:07:43,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 22:07:43,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 16: [2022-11-25 22:07:43,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:07:43,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-25 22:07:43,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 13: [2022-11-25 22:07:43,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:07:43,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 22:07:43,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 30: [2022-11-25 22:07:43,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:07:43,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-25 22:07:43,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 2: [2022-11-25 22:07:43,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:07:43,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 22:07:43,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 21: [2022-11-25 22:07:43,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:07:43,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-25 22:07:43,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:07:43,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 21: [2022-11-25 22:07:43,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-25 22:07:43,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 7: [2022-11-25 22:07:43,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:07:43,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 22:07:43,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 11: [2022-11-25 22:07:43,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:07:43,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 22:07:43,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-25 22:07:43,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:07:43,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 22:07:43,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 10: [2022-11-25 22:07:43,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:07:43,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 22:07:43,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 29: [2022-11-25 22:07:43,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:07:43,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step18000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-25 22:07:43,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: successfully saved checkpoint at iteration 18000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2477.69 31: iteration 18010/ 173500 | consumed samples: 4610560 | consumed tokens: 9442426880 | elapsed time per iteration (s): 1.09 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.256425E+00 | grad norm: 0.582 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.539 | TFLOPs: 14.25 | 31: iteration 18020/ 173500 | consumed samples: 4613120 | consumed tokens: 9447669760 | elapsed time per iteration (s): 0.84 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.252108E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.194 | TFLOPs: 18.46 | 31: iteration 18030/ 173500 | consumed samples: 4615680 | consumed tokens: 9452912640 | elapsed time per iteration (s): 0.84 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.219011E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.384 | TFLOPs: 18.41 | 31: iteration 18040/ 173500 | consumed samples: 4618240 | consumed tokens: 9458155520 | elapsed time per iteration (s): 0.80 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.239558E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.909 | TFLOPs: 19.41 | 31: iteration 18050/ 173500 | consumed samples: 4620800 | consumed tokens: 9463398400 | elapsed time per iteration (s): 0.76 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.236611E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.171 | TFLOPs: 20.40 | 31: iteration 18060/ 173500 | consumed samples: 4623360 | consumed tokens: 9468641280 | elapsed time per iteration (s): 0.87 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.212325E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.247 | TFLOPs: 17.80 | 31: iteration 18070/ 173500 | consumed samples: 4625920 | consumed tokens: 9473884160 | elapsed time per iteration (s): 0.73 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.245794E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.873 | TFLOPs: 21.17 | 31: iteration 18080/ 173500 | consumed samples: 4628480 | consumed tokens: 9479127040 | elapsed time per iteration (s): 0.81 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.218466E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.517 | TFLOPs: 19.03 | 31: iteration 18090/ 173500 | consumed samples: 4631040 | consumed tokens: 9484369920 | elapsed time per iteration (s): 0.79 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.245001E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.790 | TFLOPs: 19.59 | 31: iteration 18100/ 173500 | consumed samples: 4633600 | consumed tokens: 9489612800 | elapsed time per iteration (s): 0.79 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.228131E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.944 | TFLOPs: 19.60 | 31: iteration 18110/ 173500 | consumed samples: 4636160 | consumed tokens: 9494855680 | elapsed time per iteration (s): 0.83 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.258526E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.301 | TFLOPs: 18.59 | 31: iteration 18120/ 173500 | consumed samples: 4638720 | consumed tokens: 9500098560 | elapsed time per iteration (s): 0.86 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.258035E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.607 | TFLOPs: 18.00 | 31: iteration 18130/ 173500 | consumed samples: 4641280 | consumed tokens: 9505341440 | elapsed time per iteration (s): 0.87 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.245562E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.244 | TFLOPs: 17.80 | 31: iteration 18140/ 173500 | consumed samples: 4643840 | consumed tokens: 9510584320 | elapsed time per iteration (s): 0.79 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.215063E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.677 | TFLOPs: 19.58 | 31: iteration 18150/ 173500 | consumed samples: 4646400 | consumed tokens: 9515827200 | elapsed time per iteration (s): 0.76 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.204683E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.148 | TFLOPs: 20.40 | 31: iteration 18160/ 173500 | consumed samples: 4648960 | consumed tokens: 9521070080 | elapsed time per iteration (s): 0.77 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.189223E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.892 | TFLOPs: 20.20 | 31: iteration 18170/ 173500 | consumed samples: 4651520 | consumed tokens: 9526312960 | elapsed time per iteration (s): 0.76 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.210015E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.963 | TFLOPs: 20.32 | 31: iteration 18180/ 173500 | consumed samples: 4654080 | consumed tokens: 9531555840 | elapsed time per iteration (s): 0.79 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.218184E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.975 | TFLOPs: 19.72 | 31: iteration 18190/ 173500 | consumed samples: 4656640 | consumed tokens: 9536798720 | elapsed time per iteration (s): 0.77 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.214708E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.636 | TFLOPs: 20.06 | 31: iteration 18200/ 173500 | consumed samples: 4659200 | consumed tokens: 9542041600 | elapsed time per iteration (s): 0.75 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.244687E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.681 | TFLOPs: 20.61 | 31: iteration 18210/ 173500 | consumed samples: 4661760 | consumed tokens: 9547284480 | elapsed time per iteration (s): 0.72 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.193908E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.266 | TFLOPs: 21.43 | 31: iteration 18220/ 173500 | consumed samples: 4664320 | consumed tokens: 9552527360 | elapsed time per iteration (s): 0.74 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.253740E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.169 | TFLOPs: 20.94 | 31: iteration 18230/ 173500 | consumed samples: 4666880 | consumed tokens: 9557770240 | elapsed time per iteration (s): 0.78 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.212238E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.474 | TFLOPs: 19.75 | 31: iteration 18240/ 173500 | consumed samples: 4669440 | consumed tokens: 9563013120 | elapsed time per iteration (s): 0.78 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.204844E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.970 | TFLOPs: 19.84 | 31: iteration 18250/ 173500 | consumed samples: 4672000 | consumed tokens: 9568256000 | elapsed time per iteration (s): 0.76 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.225941E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.846 | TFLOPs: 20.50 | 31: iteration 18260/ 173500 | consumed samples: 4674560 | consumed tokens: 9573498880 | elapsed time per iteration (s): 0.79 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.192906E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.088 | TFLOPs: 19.61 | 31: iteration 18270/ 173500 | consumed samples: 4677120 | consumed tokens: 9578741760 | elapsed time per iteration (s): 0.76 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.212483E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.764 | TFLOPs: 20.25 | 31: iteration 18280/ 173500 | consumed samples: 4679680 | consumed tokens: 9583984640 | elapsed time per iteration (s): 0.78 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.240359E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.651 | TFLOPs: 19.88 | 31: iteration 18290/ 173500 | consumed samples: 4682240 | consumed tokens: 9589227520 | elapsed time per iteration (s): 0.80 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.223355E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.493 | TFLOPs: 19.39 | 31: iteration 18300/ 173500 | consumed samples: 4684800 | consumed tokens: 9594470400 | elapsed time per iteration (s): 0.82 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.244649E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.565 | TFLOPs: 18.91 | 31: iteration 18310/ 173500 | consumed samples: 4687360 | consumed tokens: 9599713280 | elapsed time per iteration (s): 0.78 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.255729E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.099 | TFLOPs: 19.85 | 31: iteration 18320/ 173500 | consumed samples: 4689920 | consumed tokens: 9604956160 | elapsed time per iteration (s): 0.77 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.205400E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.253 | TFLOPs: 20.22 | 31: iteration 18330/ 173500 | consumed samples: 4692480 | consumed tokens: 9610199040 | elapsed time per iteration (s): 0.77 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.226902E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.940 | TFLOPs: 20.02 | 31: iteration 18340/ 173500 | consumed samples: 4695040 | consumed tokens: 9615441920 | elapsed time per iteration (s): 0.75 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.230847E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.133 | TFLOPs: 20.58 | 31: iteration 18350/ 173500 | consumed samples: 4697600 | consumed tokens: 9620684800 | elapsed time per iteration (s): 0.74 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.234318E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.931 | TFLOPs: 20.87 | 31: iteration 18360/ 173500 | consumed samples: 4700160 | consumed tokens: 9625927680 | elapsed time per iteration (s): 0.75 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.199723E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.935 | TFLOPs: 20.57 | 31: iteration 18370/ 173500 | consumed samples: 4702720 | consumed tokens: 9631170560 | elapsed time per iteration (s): 0.74 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.259105E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.018 | TFLOPs: 20.87 | 31: iteration 18380/ 173500 | consumed samples: 4705280 | consumed tokens: 9636413440 | elapsed time per iteration (s): 0.79 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.229556E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.755 | TFLOPs: 19.53 | 31: iteration 18390/ 173500 | consumed samples: 4707840 | consumed tokens: 9641656320 | elapsed time per iteration (s): 0.76 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.247619E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.970 | TFLOPs: 20.45 | 31: iteration 18400/ 173500 | consumed samples: 4710400 | consumed tokens: 9646899200 | elapsed time per iteration (s): 0.74 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.218026E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.481 | TFLOPs: 20.96 | 31: iteration 18410/ 173500 | consumed samples: 4712960 | consumed tokens: 9652142080 | elapsed time per iteration (s): 0.80 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.196182E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.706 | TFLOPs: 19.34 | 31: iteration 18420/ 173500 | consumed samples: 4715520 | consumed tokens: 9657384960 | elapsed time per iteration (s): 0.77 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.221732E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.613 | TFLOPs: 20.00 | 31: iteration 18430/ 173500 | consumed samples: 4718080 | consumed tokens: 9662627840 | elapsed time per iteration (s): 0.76 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.237467E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.611 | TFLOPs: 20.36 | 31: iteration 18440/ 173500 | consumed samples: 4720640 | consumed tokens: 9667870720 | elapsed time per iteration (s): 0.81 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.245080E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.385 | TFLOPs: 19.20 | 31: iteration 18450/ 173500 | consumed samples: 4723200 | consumed tokens: 9673113600 | elapsed time per iteration (s): 0.81 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.208352E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.118 | TFLOPs: 19.06 | 31: iteration 18460/ 173500 | consumed samples: 4725760 | consumed tokens: 9678356480 | elapsed time per iteration (s): 0.73 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.195531E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.402 | TFLOPs: 21.32 | 31: iteration 18470/ 173500 | consumed samples: 4728320 | consumed tokens: 9683599360 | elapsed time per iteration (s): 0.77 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.214408E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.576 | TFLOPs: 20.00 | 31: iteration 18480/ 173500 | consumed samples: 4730880 | consumed tokens: 9688842240 | elapsed time per iteration (s): 0.78 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.202604E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.221 | TFLOPs: 19.80 | 31: iteration 18490/ 173500 | consumed samples: 4733440 | consumed tokens: 9694085120 | elapsed time per iteration (s): 0.80 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.219223E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.828 | TFLOPs: 19.29 | 31: iteration 18500/ 173500 | consumed samples: 4736000 | consumed tokens: 9699328000 | elapsed time per iteration (s): 0.72 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.228072E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.603 | TFLOPs: 21.51 | 31: iteration 18510/ 173500 | consumed samples: 4738560 | consumed tokens: 9704570880 | elapsed time per iteration (s): 0.75 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.217108E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.993 | TFLOPs: 20.69 | 31: iteration 18520/ 173500 | consumed samples: 4741120 | consumed tokens: 9709813760 | elapsed time per iteration (s): 0.75 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.194619E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.699 | TFLOPs: 20.61 | 31: iteration 18530/ 173500 | consumed samples: 4743680 | consumed tokens: 9715056640 | elapsed time per iteration (s): 0.78 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.220218E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.383 | TFLOPs: 19.93 | 31: iteration 18540/ 173500 | consumed samples: 4746240 | consumed tokens: 9720299520 | elapsed time per iteration (s): 0.75 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.238085E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.548 | TFLOPs: 20.66 | 31: iteration 18550/ 173500 | consumed samples: 4748800 | consumed tokens: 9725542400 | elapsed time per iteration (s): 0.74 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.240313E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.478 | TFLOPs: 21.02 | 31: iteration 18560/ 173500 | consumed samples: 4751360 | consumed tokens: 9730785280 | elapsed time per iteration (s): 0.78 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.231817E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.908 | TFLOPs: 19.90 | 31: iteration 18570/ 173500 | consumed samples: 4753920 | consumed tokens: 9736028160 | elapsed time per iteration (s): 0.75 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.194984E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.660 | TFLOPs: 20.67 | 31: iteration 18580/ 173500 | consumed samples: 4756480 | consumed tokens: 9741271040 | elapsed time per iteration (s): 0.77 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.235721E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.887 | TFLOPs: 20.08 | 31: iteration 18590/ 173500 | consumed samples: 4759040 | consumed tokens: 9746513920 | elapsed time per iteration (s): 0.76 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.212477E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.723 | TFLOPs: 20.31 | 31: iteration 18600/ 173500 | consumed samples: 4761600 | consumed tokens: 9751756800 | elapsed time per iteration (s): 0.77 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.211017E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.797 | TFLOPs: 20.19 | 31: iteration 18610/ 173500 | consumed samples: 4764160 | consumed tokens: 9756999680 | elapsed time per iteration (s): 0.77 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.229753E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.417 | TFLOPs: 20.05 | 31: iteration 18620/ 173500 | consumed samples: 4766720 | consumed tokens: 9762242560 | elapsed time per iteration (s): 0.76 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.221784E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.064 | TFLOPs: 20.33 | 31: iteration 18630/ 173500 | consumed samples: 4769280 | consumed tokens: 9767485440 | elapsed time per iteration (s): 0.80 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.231787E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.942 | TFLOPs: 19.36 | 31: iteration 18640/ 173500 | consumed samples: 4771840 | consumed tokens: 9772728320 | elapsed time per iteration (s): 0.80 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.221504E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.910 | TFLOPs: 19.35 | 31: iteration 18650/ 173500 | consumed samples: 4774400 | consumed tokens: 9777971200 | elapsed time per iteration (s): 0.80 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.220057E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.559 | TFLOPs: 19.27 | 31: iteration 18660/ 173500 | consumed samples: 4776960 | consumed tokens: 9783214080 | elapsed time per iteration (s): 0.76 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.243308E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.180 | TFLOPs: 20.34 | 31: iteration 18670/ 173500 | consumed samples: 4779520 | consumed tokens: 9788456960 | elapsed time per iteration (s): 0.76 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.199791E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.127 | TFLOPs: 20.46 | 31: iteration 18680/ 173500 | consumed samples: 4782080 | consumed tokens: 9793699840 | elapsed time per iteration (s): 0.80 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.265078E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.787 | TFLOPs: 19.35 | 31: iteration 18690/ 173500 | consumed samples: 4784640 | consumed tokens: 9798942720 | elapsed time per iteration (s): 0.78 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.217115E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.474 | TFLOPs: 19.93 | 31: iteration 18700/ 173500 | consumed samples: 4787200 | consumed tokens: 9804185600 | elapsed time per iteration (s): 0.76 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.226366E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.670 | TFLOPs: 20.37 | 31: iteration 18710/ 173500 | consumed samples: 4789760 | consumed tokens: 9809428480 | elapsed time per iteration (s): 0.82 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.235503E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.194 | TFLOPs: 18.95 | 31: iteration 18720/ 173500 | consumed samples: 4792320 | consumed tokens: 9814671360 | elapsed time per iteration (s): 0.73 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.222487E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.326 | TFLOPs: 21.19 | 31: iteration 18730/ 173500 | consumed samples: 4794880 | consumed tokens: 9819914240 | elapsed time per iteration (s): 0.77 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.221272E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.007 | TFLOPs: 20.03 | 31: iteration 18740/ 173500 | consumed samples: 4797440 | consumed tokens: 9825157120 | elapsed time per iteration (s): 0.76 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.225047E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.678 | TFLOPs: 20.43 | 31: iteration 18750/ 173500 | consumed samples: 4800000 | consumed tokens: 9830400000 | elapsed time per iteration (s): 0.73 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.215892E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.803 | TFLOPs: 21.10 | 31: iteration 18760/ 173500 | consumed samples: 4802560 | consumed tokens: 9835642880 | elapsed time per iteration (s): 0.76 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.234896E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.162 | TFLOPs: 20.28 | 31: iteration 18770/ 173500 | consumed samples: 4805120 | consumed tokens: 9840885760 | elapsed time per iteration (s): 0.74 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.181083E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.623 | TFLOPs: 20.91 | 31: iteration 18780/ 173500 | consumed samples: 4807680 | consumed tokens: 9846128640 | elapsed time per iteration (s): 0.76 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.240723E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.630 | TFLOPs: 20.37 | 31: iteration 18790/ 173500 | consumed samples: 4810240 | consumed tokens: 9851371520 | elapsed time per iteration (s): 0.78 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.217461E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.990 | TFLOPs: 19.90 | 31: iteration 18800/ 173500 | consumed samples: 4812800 | consumed tokens: 9856614400 | elapsed time per iteration (s): 0.81 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.225238E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.639 | TFLOPs: 19.22 | 31: iteration 18810/ 173500 | consumed samples: 4815360 | consumed tokens: 9861857280 | elapsed time per iteration (s): 0.83 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.188481E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.677 | TFLOPs: 18.55 | 31: iteration 18820/ 173500 | consumed samples: 4817920 | consumed tokens: 9867100160 | elapsed time per iteration (s): 0.81 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.230761E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.588 | TFLOPs: 19.21 | 31: iteration 18830/ 173500 | consumed samples: 4820480 | consumed tokens: 9872343040 | elapsed time per iteration (s): 0.82 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.223573E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.073 | TFLOPs: 18.88 | 31: iteration 18840/ 173500 | consumed samples: 4823040 | consumed tokens: 9877585920 | elapsed time per iteration (s): 0.78 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.194923E+00 | grad norm: 4.285 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.692 | TFLOPs: 19.89 | 31: iteration 18850/ 173500 | consumed samples: 4825600 | consumed tokens: 9882828800 | elapsed time per iteration (s): 0.75 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.233345E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.844 | TFLOPs: 20.74 | 31: iteration 18860/ 173500 | consumed samples: 4828160 | consumed tokens: 9888071680 | elapsed time per iteration (s): 0.75 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.223816E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.952 | TFLOPs: 20.63 | 31: iteration 18870/ 173500 | consumed samples: 4830720 | consumed tokens: 9893314560 | elapsed time per iteration (s): 0.73 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.227390E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.067 | TFLOPs: 21.30 | 31: iteration 18880/ 173500 | consumed samples: 4833280 | consumed tokens: 9898557440 | elapsed time per iteration (s): 0.74 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.222086E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.178 | TFLOPs: 21.00 | 31: iteration 18890/ 173500 | consumed samples: 4835840 | consumed tokens: 9903800320 | elapsed time per iteration (s): 0.80 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.240703E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.069 | TFLOPs: 19.42 | 31: iteration 18900/ 173500 | consumed samples: 4838400 | consumed tokens: 9909043200 | elapsed time per iteration (s): 0.80 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.211613E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.169 | TFLOPs: 19.31 | 31: iteration 18910/ 173500 | consumed samples: 4840960 | consumed tokens: 9914286080 | elapsed time per iteration (s): 0.76 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.225238E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.247 | TFLOPs: 20.28 | 31: iteration 18920/ 173500 | consumed samples: 4843520 | consumed tokens: 9919528960 | elapsed time per iteration (s): 0.72 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.199200E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.408 | TFLOPs: 21.56 | 31: iteration 18930/ 173500 | consumed samples: 4846080 | consumed tokens: 9924771840 | elapsed time per iteration (s): 0.76 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.214956E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.257 | TFLOPs: 20.40 | 31: iteration 18940/ 173500 | consumed samples: 4848640 | consumed tokens: 9930014720 | elapsed time per iteration (s): 0.72 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.258665E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.185 | TFLOPs: 21.37 | 31: iteration 18950/ 173500 | consumed samples: 4851200 | consumed tokens: 9935257600 | elapsed time per iteration (s): 1.80 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.217405E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 142.190 | TFLOPs: 8.60 | 31: iteration 18960/ 173500 | consumed samples: 4853760 | consumed tokens: 9940500480 | elapsed time per iteration (s): 0.75 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.189770E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.680 | TFLOPs: 20.55 | 31: iteration 18970/ 173500 | consumed samples: 4856320 | consumed tokens: 9945743360 | elapsed time per iteration (s): 0.78 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.239557E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.638 | TFLOPs: 19.76 | 31: iteration 18980/ 173500 | consumed samples: 4858880 | consumed tokens: 9950986240 | elapsed time per iteration (s): 0.77 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.236427E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.004 | TFLOPs: 20.15 | 31: iteration 18990/ 173500 | consumed samples: 4861440 | consumed tokens: 9956229120 | elapsed time per iteration (s): 0.86 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.228488E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.255 | TFLOPs: 18.10 | 31: iteration 19000/ 173500 | consumed samples: 4864000 | consumed tokens: 9961472000 | elapsed time per iteration (s): 0.78 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.240499E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.624 | TFLOPs: 19.82 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 19000 | lm loss value: 2.162020E+00 | lm loss PPL: 8.688671E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 19000 to checkpoints_1b1long 0: [2022-11-25 22:20:50,404] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step19000 is begin to save! 0: [2022-11-25 22:20:50,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_01-model_00-model_states.pt... 0: [2022-11-25 22:20:50,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_01-model_00-model_states.pt. 0: [2022-11-25 22:20:50,609] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_03-model_00-model_states.pt... 0: [2022-11-25 22:20:50,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_03-model_00-model_states.pt. 0: [2022-11-25 22:20:50,689] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_04-model_00-model_states.pt... 0: [2022-11-25 22:20:50,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_04-model_00-model_states.pt. 0: [2022-11-25 22:20:50,764] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_05-model_00-model_states.pt... 0: [2022-11-25 22:20:50,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_05-model_00-model_states.pt. 0: [2022-11-25 22:20:50,837] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_06-model_00-model_states.pt... 0: [2022-11-25 22:20:50,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_06-model_00-model_states.pt. 0: [2022-11-25 22:20:50,911] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_07-model_00-model_states.pt... 0: [2022-11-25 22:20:50,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_07-model_00-model_states.pt. 0: [2022-11-25 22:20:50,985] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_08-model_00-model_states.pt... 0: [2022-11-25 22:20:51,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_08-model_00-model_states.pt. 0: [2022-11-25 22:20:51,060] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_09-model_00-model_states.pt... 0: [2022-11-25 22:20:51,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_09-model_00-model_states.pt. 0: [2022-11-25 22:20:51,134] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_10-model_00-model_states.pt... 0: [2022-11-25 22:20:51,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_10-model_00-model_states.pt. 0: [2022-11-25 22:20:51,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_11-model_00-model_states.pt... 0: [2022-11-25 22:20:51,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_11-model_00-model_states.pt. 0: [2022-11-25 22:20:51,284] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_12-model_00-model_states.pt... 0: [2022-11-25 22:20:51,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_12-model_00-model_states.pt. 0: [2022-11-25 22:20:51,357] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_13-model_00-model_states.pt... 0: [2022-11-25 22:20:51,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_13-model_00-model_states.pt. 0: [2022-11-25 22:20:51,432] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_14-model_00-model_states.pt... 0: [2022-11-25 22:20:51,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_14-model_00-model_states.pt. 0: [2022-11-25 22:20:51,506] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_15-model_00-model_states.pt... 0: [2022-11-25 22:20:51,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_15-model_00-model_states.pt. 0: [2022-11-25 22:20:51,578] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_16-model_00-model_states.pt... 0: [2022-11-25 22:20:51,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_16-model_00-model_states.pt. 0: [2022-11-25 22:20:51,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_17-model_00-model_states.pt... 0: [2022-11-25 22:20:51,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_17-model_00-model_states.pt. 0: [2022-11-25 22:20:51,727] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_18-model_00-model_states.pt... 0: [2022-11-25 22:20:51,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_18-model_00-model_states.pt. 0: [2022-11-25 22:20:51,801] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_19-model_00-model_states.pt... 0: [2022-11-25 22:20:51,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_19-model_00-model_states.pt. 0: [2022-11-25 22:20:51,874] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_20-model_00-model_states.pt... 0: [2022-11-25 22:20:51,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_20-model_00-model_states.pt. 0: [2022-11-25 22:20:51,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_21-model_00-model_states.pt... 0: [2022-11-25 22:20:52,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_21-model_00-model_states.pt. 0: [2022-11-25 22:20:52,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_22-model_00-model_states.pt... 0: [2022-11-25 22:20:52,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_22-model_00-model_states.pt. 0: [2022-11-25 22:20:52,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_23-model_00-model_states.pt... 0: [2022-11-25 22:20:52,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_23-model_00-model_states.pt. 0: [2022-11-25 22:20:52,170] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_24-model_00-model_states.pt... 0: [2022-11-25 22:20:52,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_24-model_00-model_states.pt. 0: [2022-11-25 22:20:52,242] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_25-model_00-model_states.pt... 0: [2022-11-25 22:20:52,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_25-model_00-model_states.pt. 0: [2022-11-25 22:20:52,314] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_26-model_00-model_states.pt... 0: [2022-11-25 22:20:52,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_26-model_00-model_states.pt. 0: [2022-11-25 22:20:52,390] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_27-model_00-model_states.pt... 0: [2022-11-25 22:20:52,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_27-model_00-model_states.pt. 0: [2022-11-25 22:20:52,464] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_28-model_00-model_states.pt... 0: [2022-11-25 22:20:52,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_28-model_00-model_states.pt. 0: [2022-11-25 22:20:52,538] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/layer_30-model_00-model_states.pt... 0: [2022-11-25 22:20:52,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/layer_30-model_00-model_states.pt. 0: [2022-11-25 22:20:52,540] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step19000/mp_rank_00_model_states.pt 0: [2022-11-25 22:20:52,540] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/mp_rank_00_model_states.pt... 0: [2022-11-25 22:20:52,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/mp_rank_00_model_states.pt. 0: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:20:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:20:52,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:20:52,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-25 22:20:52,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: [2022-11-25 22:20:52,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:20:52,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:20:52,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-25 22:20:52,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 3: [2022-11-25 22:20:52,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:20:52,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:20:52,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 18: [2022-11-25 22:20:52,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 3: [2022-11-25 22:20:52,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 18: [2022-11-25 22:20:52,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 26: [2022-11-25 22:20:52,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:20:52,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:20:52,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-25 22:20:52,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 15: [2022-11-25 22:20:52,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 22:20:52,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 6: [2022-11-25 22:20:52,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:20:52,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 22:20:52,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 22: [2022-11-25 22:20:52,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:20:52,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:20:52,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-25 22:20:52,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 12: [2022-11-25 22:20:52,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 22:20:52,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 23: [2022-11-25 22:20:52,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:20:52,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-25 22:20:52,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 16: [2022-11-25 22:20:52,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:20:52,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-25 22:20:52,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 4: [2022-11-25 22:20:52,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:20:52,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 22:20:52,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 2: [2022-11-25 22:20:52,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:20:52,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:20:52,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:20:52,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 11: [2022-11-25 22:20:52,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 14: [2022-11-25 22:20:52,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 2: [2022-11-25 22:20:52,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 11: [2022-11-25 22:20:52,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 14: [2022-11-25 22:20:52,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-25 22:20:52,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:20:52,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 22:20:52,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 2: [2022-11-25 22:20:52,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:20:52,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 25: [2022-11-25 22:20:52,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:20:52,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:20:52,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 2: [2022-11-25 22:20:52,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 6: [2022-11-25 22:20:52,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:20:52,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 6: [2022-11-25 22:20:52,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 25: [2022-11-25 22:20:52,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 1: [2022-11-25 22:20:52,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:20:52,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:20:52,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 3: [2022-11-25 22:20:52,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 22:20:52,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-25 22:20:52,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 25: [2022-11-25 22:20:52,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-25 22:20:52,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-25 22:20:52,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:20:52,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:20:52,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 13: [2022-11-25 22:20:52,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 22:20:52,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-25 22:20:52,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 12: [2022-11-25 22:20:52,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:20:52,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 22:20:52,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 4: [2022-11-25 22:20:52,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:20:52,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 22:20:52,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 26: [2022-11-25 22:20:52,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:20:52,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-25 22:20:52,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 9: [2022-11-25 22:20:52,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:20:52,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 22:20:52,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 9: [2022-11-25 22:20:52,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:20:52,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 22:20:52,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: [2022-11-25 22:20:52,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:20:52,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:20:52,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 0: [2022-11-25 22:20:52,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 22:20:52,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 28: [2022-11-25 22:20:52,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:20:52,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-25 22:20:52,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 15: [2022-11-25 22:20:52,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 16: [2022-11-25 22:20:52,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:20:52,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-25 22:20:52,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 26: [2022-11-25 22:20:52,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:20:52,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-25 22:20:52,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 14: [2022-11-25 22:20:52,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:20:52,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 22:20:52,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 12: [2022-11-25 22:20:52,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:20:52,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:20:52,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:20:52,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 25: [2022-11-25 22:20:52,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 12: [2022-11-25 22:20:52,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 25: [2022-11-25 22:20:52,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 27: [2022-11-25 22:20:52,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-25 22:20:52,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 10: [2022-11-25 22:20:52,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:20:52,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 22:20:52,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 18: [2022-11-25 22:20:52,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:20:52,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 3: [2022-11-25 22:20:52,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:20:52,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 3: [2022-11-25 22:20:52,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 22:20:52,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 30: [2022-11-25 22:20:52,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:20:52,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-25 22:20:52,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 11: [2022-11-25 22:20:52,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:20:52,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 22:20:52,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 30: [2022-11-25 22:20:52,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:20:52,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-25 22:20:52,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 28: [2022-11-25 22:20:52,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:20:52,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-25 22:20:52,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 28: [2022-11-25 22:20:52,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:20:52,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-25 22:20:52,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 17: [2022-11-25 22:20:52,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:20:52,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-25 22:20:52,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 16: [2022-11-25 22:20:52,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:20:52,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-25 22:20:52,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 31: [2022-11-25 22:20:52,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:20:52,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-25 22:20:52,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 2: [2022-11-25 22:20:52,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:20:52,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 22:20:52,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 25: [2022-11-25 22:20:52,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:20:52,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:20:52,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-25 22:20:52,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 25: [2022-11-25 22:20:52,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-25 22:20:52,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 27: [2022-11-25 22:20:52,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:20:52,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-25 22:20:52,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 9: [2022-11-25 22:20:52,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:20:52,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 22:20:52,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 13: [2022-11-25 22:20:52,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:20:52,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 22:20:52,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 22: [2022-11-25 22:20:52,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:20:52,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-25 22:20:52,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 13: [2022-11-25 22:20:52,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:20:52,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:20:52,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 3: [2022-11-25 22:20:52,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 13: [2022-11-25 22:20:52,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 3: [2022-11-25 22:20:52,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 26: [2022-11-25 22:20:52,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:20:52,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-25 22:20:52,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 28: [2022-11-25 22:20:52,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:20:52,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-25 22:20:52,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-25 22:20:52,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:20:52,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 22:20:52,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 27: [2022-11-25 22:20:52,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:20:52,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-25 22:20:52,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 15: [2022-11-25 22:20:52,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:20:52,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 22:20:52,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 11: [2022-11-25 22:20:52,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:20:52,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 22:20:52,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 15: [2022-11-25 22:20:52,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:20:52,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 22:20:52,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 6: [2022-11-25 22:20:52,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:20:52,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:20:52,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 31: [2022-11-25 22:20:52,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:20:52,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 31: [2022-11-25 22:20:52,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 0: [2022-11-25 22:20:52,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:20:52,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:20:52,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 5: [2022-11-25 22:20:52,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 0: [2022-11-25 22:20:52,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 6: [2022-11-25 22:20:52,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 5: [2022-11-25 22:20:52,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 14: [2022-11-25 22:20:52,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:20:52,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 6: [2022-11-25 22:20:52,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 5: [2022-11-25 22:20:52,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:20:52,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:20:52,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 14: [2022-11-25 22:20:52,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 5: [2022-11-25 22:20:52,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 14: [2022-11-25 22:20:52,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 5: [2022-11-25 22:20:52,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:20:52,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 5: [2022-11-25 22:20:52,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:20:52,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 14: [2022-11-25 22:20:52,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 5: [2022-11-25 22:20:52,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 5: [2022-11-25 22:20:52,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 22:20:52,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 10: [2022-11-25 22:20:52,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:20:52,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:20:52,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 22:20:52,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 22:20:52,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 10: [2022-11-25 22:20:52,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 23: [2022-11-25 22:20:52,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:20:52,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:20:52,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-25 22:20:52,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-25 22:20:52,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 23: [2022-11-25 22:20:52,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 13: [2022-11-25 22:20:52,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:20:52,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 4: [2022-11-25 22:20:52,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:20:52,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 4: [2022-11-25 22:20:52,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 22:20:52,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 30: [2022-11-25 22:20:52,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:20:52,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:20:52,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 4: [2022-11-25 22:20:52,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 30: [2022-11-25 22:20:52,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 4: [2022-11-25 22:20:52,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 9: [2022-11-25 22:20:52,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:20:52,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 22:20:52,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 10: [2022-11-25 22:20:52,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:20:52,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 22:20:52,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 2: [2022-11-25 22:20:52,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:20:52,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 22:20:52,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 23: [2022-11-25 22:20:52,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:20:52,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-25 22:20:52,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 16: [2022-11-25 22:20:52,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:20:52,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-25 22:20:52,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 31: [2022-11-25 22:20:52,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:20:52,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-25 22:20:52,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 25: [2022-11-25 22:20:52,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:20:52,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-25 22:20:52,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: [2022-11-25 22:20:52,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:20:52,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 22:20:52,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 22: [2022-11-25 22:20:52,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:20:52,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-25 22:20:52,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 12: [2022-11-25 22:20:52,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:20:52,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 22:20:52,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 18: [2022-11-25 22:20:52,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:20:52,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-25 22:20:52,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 17: [2022-11-25 22:20:52,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:20:52,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-25 22:20:52,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 17: [2022-11-25 22:20:52,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:20:52,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-25 22:20:52,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 29: [2022-11-25 22:20:52,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:20:52,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:20:52,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:20:52,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:20:52,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-25 22:20:52,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-25 22:20:52,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-25 22:20:52,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 29: [2022-11-25 22:20:52,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-25 22:20:52,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 29: [2022-11-25 22:20:52,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 29: [2022-11-25 22:20:52,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 31: [2022-11-25 22:20:52,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:20:52,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-25 22:20:52,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: [2022-11-25 22:20:52,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 22:20:52,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 11: [2022-11-25 22:20:52,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:20:52,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 22:20:52,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 27: [2022-11-25 22:20:52,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:20:52,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-25 22:20:52,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 24: [2022-11-25 22:20:52,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:20:52,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:20:52,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:20:52,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:20:52,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-25 22:20:52,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-25 22:20:52,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 24: [2022-11-25 22:20:52,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-25 22:20:52,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-25 22:20:52,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 24: [2022-11-25 22:20:52,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 24: [2022-11-25 22:20:52,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 20: [2022-11-25 22:20:52,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:20:52,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:20:52,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:20:52,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:20:52,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-25 22:20:52,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-25 22:20:52,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-25 22:20:52,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-25 22:20:52,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 20: [2022-11-25 22:20:52,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 20: [2022-11-25 22:20:52,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 20: [2022-11-25 22:20:52,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 3: [2022-11-25 22:20:52,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:20:52,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 22:20:52,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-25 22:20:52,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:20:52,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 22:20:52,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 14: [2022-11-25 22:20:52,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:20:52,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 22:20:52,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 29: [2022-11-25 22:20:52,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:20:52,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-25 22:20:52,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 15: [2022-11-25 22:20:52,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:20:52,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 22:20:52,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 5: [2022-11-25 22:20:52,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:20:52,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 22:20:52,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 19: [2022-11-25 22:20:52,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:20:52,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:20:52,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:20:52,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:20:52,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:20:52,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-25 22:20:52,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-25 22:20:52,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-25 22:20:52,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-25 22:20:52,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-25 22:20:52,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 19: [2022-11-25 22:20:52,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 19: [2022-11-25 22:20:52,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 19: [2022-11-25 22:20:52,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 19: [2022-11-25 22:20:52,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 12: [2022-11-25 22:20:52,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:20:52,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:20:52,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 20: [2022-11-25 22:20:52,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 12: [2022-11-25 22:20:52,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 20: [2022-11-25 22:20:52,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 26: [2022-11-25 22:20:52,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:20:52,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-25 22:20:52,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 30: [2022-11-25 22:20:52,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:20:52,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-25 22:20:52,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 21: [2022-11-25 22:20:52,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:20:52,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:20:52,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:20:52,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:20:52,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:20:52,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-25 22:20:52,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-25 22:20:52,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-25 22:20:52,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-25 22:20:52,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-25 22:20:52,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 21: [2022-11-25 22:20:52,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 21: [2022-11-25 22:20:52,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 21: [2022-11-25 22:20:52,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 21: [2022-11-25 22:20:52,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 2: [2022-11-25 22:20:52,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:20:52,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 22:20:52,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 9: [2022-11-25 22:20:52,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:20:52,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 22:20:52,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 6: [2022-11-25 22:20:52,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:20:52,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 22:20:52,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 28: [2022-11-25 22:20:52,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:20:52,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 4: [2022-11-25 22:20:52,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:20:52,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 22:20:52,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 27: [2022-11-25 22:20:52,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:20:52,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-25 22:20:52,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 16: [2022-11-25 22:20:52,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:20:52,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-25 22:20:52,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 25: [2022-11-25 22:20:52,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:20:52,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-25 22:20:52,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 10: [2022-11-25 22:20:52,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:20:52,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 22:20:52,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 28: [2022-11-25 22:20:52,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 18: [2022-11-25 22:20:52,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:20:52,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-25 22:20:52,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: [2022-11-25 22:20:52,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:20:52,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 22:20:52,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 22: [2022-11-25 22:20:52,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:20:52,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 13: [2022-11-25 22:20:52,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:20:52,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 13: [2022-11-25 22:20:52,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 22:20:52,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 11: [2022-11-25 22:20:52,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:20:52,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 24: [2022-11-25 22:20:52,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:20:52,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 24: [2022-11-25 22:20:52,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-25 22:20:52,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 23: [2022-11-25 22:20:52,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:20:52,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-25 22:20:52,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 17: [2022-11-25 22:20:52,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:20:52,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-25 22:20:52,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-25 22:20:52,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:20:52,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 22:20:52,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 19: [2022-11-25 22:20:52,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:20:52,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-25 22:20:52,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 14: [2022-11-25 22:20:52,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:20:52,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 22:20:52,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 3: [2022-11-25 22:20:52,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:20:52,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 22:20:52,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 12: [2022-11-25 22:20:52,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:20:52,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 22:20:52,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 21: [2022-11-25 22:20:52,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:20:52,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-25 22:20:52,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 29: [2022-11-25 22:20:52,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:20:52,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-25 22:20:52,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 30: [2022-11-25 22:20:52,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:20:52,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-25 22:20:52,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 28: [2022-11-25 22:20:52,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:20:52,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:20:52,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 22:20:52,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 20: [2022-11-25 22:20:52,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:20:52,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-25 22:20:52,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 31: [2022-11-25 22:20:52,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:20:52,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-25 22:20:52,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 10: [2022-11-25 22:20:52,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:20:52,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 22:20:52,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 28: [2022-11-25 22:20:52,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-25 22:20:52,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 15: [2022-11-25 22:20:52,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:20:52,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 22:20:52,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 9: [2022-11-25 22:20:52,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:20:52,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 22:20:52,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 2: [2022-11-25 22:20:52,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:20:52,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 25: [2022-11-25 22:20:52,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:20:52,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:20:52,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 26: [2022-11-25 22:20:52,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-25 22:20:52,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 25: [2022-11-25 22:20:52,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-25 22:20:52,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 6: [2022-11-25 22:20:52,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:20:52,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 4: [2022-11-25 22:20:52,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:20:52,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 4: [2022-11-25 22:20:52,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 22:20:52,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 27: [2022-11-25 22:20:52,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:20:52,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-25 22:20:52,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 16: [2022-11-25 22:20:52,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:20:52,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-25 22:20:52,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 22: [2022-11-25 22:20:52,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:20:52,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-25 22:20:52,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 17: [2022-11-25 22:20:52,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:20:52,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-25 22:20:52,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 11: [2022-11-25 22:20:52,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:20:52,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 22:20:52,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 18: [2022-11-25 22:20:52,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:20:52,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:20:52,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-25 22:20:52,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-25 22:20:52,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 24: [2022-11-25 22:20:52,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:20:52,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 24: [2022-11-25 22:20:52,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 0: [2022-11-25 22:20:52,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:20:52,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 24: [2022-11-25 22:20:52,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: [2022-11-25 22:20:52,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 14: [2022-11-25 22:20:52,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:20:52,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 22:20:52,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 13: [2022-11-25 22:20:52,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:20:52,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 22:20:52,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 31: [2022-11-25 22:20:52,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:20:52,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-25 22:20:52,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 23: [2022-11-25 22:20:52,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:20:52,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-25 22:20:52,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 3: [2022-11-25 22:20:52,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:20:52,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 22:20:52,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 5: [2022-11-25 22:20:52,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:20:52,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 22:20:52,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 12: [2022-11-25 22:20:52,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:20:52,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 22:20:52,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 29: [2022-11-25 22:20:52,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:20:52,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-25 22:20:52,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 19: [2022-11-25 22:20:52,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:20:52,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-25 22:20:52,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 9: [2022-11-25 22:20:52,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:20:52,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 22:20:52,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 8: [2022-11-25 22:20:52,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:20:52,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 22:20:52,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 15: [2022-11-25 22:20:52,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:20:52,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 22:20:52,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 20: [2022-11-25 22:20:52,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:20:52,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-25 22:20:52,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 21: [2022-11-25 22:20:52,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:20:52,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-25 22:20:52,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 7: [2022-11-25 22:20:52,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:20:52,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:20:52,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 22:20:52,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 30: [2022-11-25 22:20:52,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:20:52,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-25 22:20:52,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 30: [2022-11-25 22:20:52,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-25 22:20:52,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 6: [2022-11-25 22:20:52,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:20:52,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:20:52,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 2: [2022-11-25 22:20:52,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 6: [2022-11-25 22:20:52,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 2: [2022-11-25 22:20:52,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 26: [2022-11-25 22:20:52,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:20:52,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-25 22:20:52,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 8: [2022-11-25 22:20:52,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:20:52,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 22:20:52,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 10: [2022-11-25 22:20:52,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:20:52,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:20:52,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 22: [2022-11-25 22:20:52,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 10: [2022-11-25 22:20:52,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 22: [2022-11-25 22:20:52,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 4: [2022-11-25 22:20:52,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:20:52,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 22:20:52,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 27: [2022-11-25 22:20:52,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:20:52,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:20:52,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 22:20:52,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 27: [2022-11-25 22:20:52,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-25 22:20:52,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 11: [2022-11-25 22:20:52,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:20:52,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 22:20:52,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 16: [2022-11-25 22:20:52,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:20:52,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-25 22:20:52,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: [2022-11-25 22:20:52,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:20:52,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 22:20:52,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 24: [2022-11-25 22:20:52,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:20:52,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-25 22:20:52,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 18: [2022-11-25 22:20:52,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:20:52,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-25 22:20:52,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-25 22:20:52,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:20:52,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 22:20:52,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 19: [2022-11-25 22:20:52,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:20:52,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-25 22:20:52,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 29: [2022-11-25 22:20:52,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:20:52,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-25 22:20:52,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 6: [2022-11-25 22:20:52,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:20:52,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 22:20:52,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 3: [2022-11-25 22:20:52,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:20:52,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 22:20:52,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 12: [2022-11-25 22:20:52,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:20:52,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 22:20:52,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 14: [2022-11-25 22:20:52,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:20:52,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 5: [2022-11-25 22:20:52,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:20:52,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 14: [2022-11-25 22:20:52,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 5: [2022-11-25 22:20:52,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 31: [2022-11-25 22:20:52,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:20:52,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-25 22:20:52,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 23: [2022-11-25 22:20:52,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:20:52,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 30: [2022-11-25 22:20:52,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:20:52,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 30: [2022-11-25 22:20:52,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-25 22:20:52,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 20: [2022-11-25 22:20:52,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:20:52,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-25 22:20:52,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 17: [2022-11-25 22:20:52,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:20:52,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-25 22:20:52,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 2: [2022-11-25 22:20:52,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:20:52,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 22:20:52,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 28: [2022-11-25 22:20:52,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:20:52,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:20:52,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 22:20:52,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 13: [2022-11-25 22:20:52,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:20:52,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 22:20:52,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 21: [2022-11-25 22:20:52,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:20:52,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-25 22:20:52,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 15: [2022-11-25 22:20:52,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:20:52,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 22:20:52,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 28: [2022-11-25 22:20:52,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-25 22:20:52,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 16: [2022-11-25 22:20:52,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:20:52,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-25 22:20:52,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 4: [2022-11-25 22:20:52,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:20:52,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 22:20:52,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 10: [2022-11-25 22:20:52,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:20:52,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 22:20:52,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 26: [2022-11-25 22:20:52,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:20:52,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-25 22:20:52,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 28: [2022-11-25 22:20:52,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:20:52,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:20:52,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 22:20:52,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 25: [2022-11-25 22:20:52,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:20:52,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 27: [2022-11-25 22:20:52,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:20:52,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 17: [2022-11-25 22:20:52,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:20:52,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-25 22:20:52,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 17: [2022-11-25 22:20:52,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-25 22:20:52,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 22: [2022-11-25 22:20:52,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:20:52,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-25 22:20:52,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: [2022-11-25 22:20:52,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:20:52,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 22:20:52,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 28: [2022-11-25 22:20:52,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-25 22:20:52,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 18: [2022-11-25 22:20:52,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:20:52,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-25 22:20:52,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 11: [2022-11-25 22:20:52,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:20:52,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 22:20:52,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 24: [2022-11-25 22:20:52,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:20:52,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-25 22:20:52,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 31: [2022-11-25 22:20:52,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:20:52,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-25 22:20:52,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 23: [2022-11-25 22:20:52,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:20:52,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-25 22:20:52,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 8: [2022-11-25 22:20:52,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:20:52,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 22:20:52,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 7: [2022-11-25 22:20:52,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:20:52,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 22:20:52,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 7: [2022-11-25 22:20:52,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:20:52,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 22:20:52,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 8: [2022-11-25 22:20:52,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:20:52,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 22:20:52,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 8: [2022-11-25 22:20:52,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:20:52,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:20:52,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 22:20:52,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 22:20:52,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 8: [2022-11-25 22:20:52,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 8: [2022-11-25 22:20:52,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:20:52,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:20:52,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 22:20:52,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 22:20:52,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 8: [2022-11-25 22:20:52,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 7: [2022-11-25 22:20:52,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:20:52,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:20:52,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 22:20:52,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 22:20:52,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 7: [2022-11-25 22:20:52,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 7: [2022-11-25 22:20:52,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:20:52,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 22:20:52,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:20:52,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 7: [2022-11-25 22:20:52,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step19000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 22:20:52,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: successfully saved checkpoint at iteration 19000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2491.68 31: iteration 19010/ 173500 | consumed samples: 4866560 | consumed tokens: 9966714880 | elapsed time per iteration (s): 1.03 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.242816E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.683 | TFLOPs: 14.98 | 31: iteration 19020/ 173500 | consumed samples: 4869120 | consumed tokens: 9971957760 | elapsed time per iteration (s): 0.83 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.235697E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.570 | TFLOPs: 18.67 | 31: iteration 19030/ 173500 | consumed samples: 4871680 | consumed tokens: 9977200640 | elapsed time per iteration (s): 0.81 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.205458E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.648 | TFLOPs: 19.16 | 31: iteration 19040/ 173500 | consumed samples: 4874240 | consumed tokens: 9982443520 | elapsed time per iteration (s): 0.74 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.242999E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.709 | TFLOPs: 20.85 | 31: iteration 19050/ 173500 | consumed samples: 4876800 | consumed tokens: 9987686400 | elapsed time per iteration (s): 0.74 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.210526E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.286 | TFLOPs: 21.07 | 31: iteration 19060/ 173500 | consumed samples: 4879360 | consumed tokens: 9992929280 | elapsed time per iteration (s): 0.79 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.209203E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.912 | TFLOPs: 19.66 | 31: iteration 19070/ 173500 | consumed samples: 4881920 | consumed tokens: 9998172160 | elapsed time per iteration (s): 0.77 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.197529E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.856 | TFLOPs: 20.02 | 31: iteration 19080/ 173500 | consumed samples: 4884480 | consumed tokens: 10003415040 | elapsed time per iteration (s): 0.79 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.225762E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.970 | TFLOPs: 19.72 | 31: iteration 19090/ 173500 | consumed samples: 4887040 | consumed tokens: 10008657920 | elapsed time per iteration (s): 0.74 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.244343E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.512 | TFLOPs: 20.96 | 31: iteration 19100/ 173500 | consumed samples: 4889600 | consumed tokens: 10013900800 | elapsed time per iteration (s): 0.79 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.246672E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.252 | TFLOPs: 19.62 | 31: iteration 19110/ 173500 | consumed samples: 4892160 | consumed tokens: 10019143680 | elapsed time per iteration (s): 0.78 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.249542E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.532 | TFLOPs: 19.81 | 31: iteration 19120/ 173500 | consumed samples: 4894720 | consumed tokens: 10024386560 | elapsed time per iteration (s): 0.77 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.214527E+00 | grad norm: 0.237 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.061 | TFLOPs: 20.03 | 31: iteration 19130/ 173500 | consumed samples: 4897280 | consumed tokens: 10029629440 | elapsed time per iteration (s): 0.78 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.208061E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.286 | TFLOPs: 19.98 | 31: iteration 19140/ 173500 | consumed samples: 4899840 | consumed tokens: 10034872320 | elapsed time per iteration (s): 0.80 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.211586E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.931 | TFLOPs: 19.29 | 31: iteration 19150/ 173500 | consumed samples: 4902400 | consumed tokens: 10040115200 | elapsed time per iteration (s): 0.74 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.206595E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.756 | TFLOPs: 20.80 | 31: iteration 19160/ 173500 | consumed samples: 4904960 | consumed tokens: 10045358080 | elapsed time per iteration (s): 0.79 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.230660E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.457 | TFLOPs: 19.63 | 31: iteration 19170/ 173500 | consumed samples: 4907520 | consumed tokens: 10050600960 | elapsed time per iteration (s): 0.82 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.207946E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.995 | TFLOPs: 18.87 | 31: iteration 19180/ 173500 | consumed samples: 4910080 | consumed tokens: 10055843840 | elapsed time per iteration (s): 0.78 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.224723E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.076 | TFLOPs: 19.97 | 31: iteration 19190/ 173500 | consumed samples: 4912640 | consumed tokens: 10061086720 | elapsed time per iteration (s): 0.73 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.223152E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.412 | TFLOPs: 21.14 | 31: iteration 19200/ 173500 | consumed samples: 4915200 | consumed tokens: 10066329600 | elapsed time per iteration (s): 0.75 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.225709E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.316 | TFLOPs: 20.65 | 31: iteration 19210/ 173500 | consumed samples: 4917760 | consumed tokens: 10071572480 | elapsed time per iteration (s): 0.81 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.214692E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.447 | TFLOPs: 19.14 | 31: iteration 19220/ 173500 | consumed samples: 4920320 | consumed tokens: 10076815360 | elapsed time per iteration (s): 0.80 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.219101E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.606 | TFLOPs: 19.40 | 31: iteration 19230/ 173500 | consumed samples: 4922880 | consumed tokens: 10082058240 | elapsed time per iteration (s): 0.74 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.218785E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.095 | TFLOPs: 20.94 | 31: iteration 19240/ 173500 | consumed samples: 4925440 | consumed tokens: 10087301120 | elapsed time per iteration (s): 0.83 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.223835E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.058 | TFLOPs: 18.76 | 31: iteration 19250/ 173500 | consumed samples: 4928000 | consumed tokens: 10092544000 | elapsed time per iteration (s): 0.82 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.246999E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.091 | TFLOPs: 19.00 | 31: iteration 19260/ 173500 | consumed samples: 4930560 | consumed tokens: 10097786880 | elapsed time per iteration (s): 0.84 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.209986E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.422 | TFLOPs: 18.54 | 31: iteration 19270/ 173500 | consumed samples: 4933120 | consumed tokens: 10103029760 | elapsed time per iteration (s): 0.85 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.191137E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.837 | TFLOPs: 18.14 | 31: iteration 19280/ 173500 | consumed samples: 4935680 | consumed tokens: 10108272640 | elapsed time per iteration (s): 0.87 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.210697E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.957 | TFLOPs: 17.84 | 31: iteration 19290/ 173500 | consumed samples: 4938240 | consumed tokens: 10113515520 | elapsed time per iteration (s): 0.81 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.226217E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.270 | TFLOPs: 19.13 | 31: iteration 19300/ 173500 | consumed samples: 4940800 | consumed tokens: 10118758400 | elapsed time per iteration (s): 0.81 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.215126E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.472 | TFLOPs: 19.02 | 31: iteration 19310/ 173500 | consumed samples: 4943360 | consumed tokens: 10124001280 | elapsed time per iteration (s): 0.83 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.207039E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.867 | TFLOPs: 18.69 | 31: iteration 19320/ 173500 | consumed samples: 4945920 | consumed tokens: 10129244160 | elapsed time per iteration (s): 0.81 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.197349E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.591 | TFLOPs: 19.21 | 31: iteration 19330/ 173500 | consumed samples: 4948480 | consumed tokens: 10134487040 | elapsed time per iteration (s): 0.79 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.192840E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.342 | TFLOPs: 19.50 | 31: iteration 19340/ 173500 | consumed samples: 4951040 | consumed tokens: 10139729920 | elapsed time per iteration (s): 0.82 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.219770E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.867 | TFLOPs: 18.81 | 31: iteration 19350/ 173500 | consumed samples: 4953600 | consumed tokens: 10144972800 | elapsed time per iteration (s): 0.84 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.212684E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.396 | TFLOPs: 18.42 | 31: iteration 19360/ 173500 | consumed samples: 4956160 | consumed tokens: 10150215680 | elapsed time per iteration (s): 0.84 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.205858E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.364 | TFLOPs: 18.41 | 31: iteration 19370/ 173500 | consumed samples: 4958720 | consumed tokens: 10155458560 | elapsed time per iteration (s): 0.82 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.207816E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.392 | TFLOPs: 18.78 | 31: iteration 19380/ 173500 | consumed samples: 4961280 | consumed tokens: 10160701440 | elapsed time per iteration (s): 0.78 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.191528E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.586 | TFLOPs: 19.88 | 31: iteration 19390/ 173500 | consumed samples: 4963840 | consumed tokens: 10165944320 | elapsed time per iteration (s): 0.74 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.217359E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.891 | TFLOPs: 20.80 | 31: iteration 19400/ 173500 | consumed samples: 4966400 | consumed tokens: 10171187200 | elapsed time per iteration (s): 0.84 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.216753E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.116 | TFLOPs: 18.34 | 31: iteration 19410/ 173500 | consumed samples: 4968960 | consumed tokens: 10176430080 | elapsed time per iteration (s): 0.77 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.217741E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.986 | TFLOPs: 20.08 | 31: iteration 19420/ 173500 | consumed samples: 4971520 | consumed tokens: 10181672960 | elapsed time per iteration (s): 0.82 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.225010E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.094 | TFLOPs: 18.88 | 31: iteration 19430/ 173500 | consumed samples: 4974080 | consumed tokens: 10186915840 | elapsed time per iteration (s): 0.77 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.234813E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.860 | TFLOPs: 20.14 | 31: iteration 19440/ 173500 | consumed samples: 4976640 | consumed tokens: 10192158720 | elapsed time per iteration (s): 0.79 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.217772E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.325 | TFLOPs: 19.68 | 31: iteration 19450/ 173500 | consumed samples: 4979200 | consumed tokens: 10197401600 | elapsed time per iteration (s): 0.76 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.232646E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.730 | TFLOPs: 20.25 | 31: iteration 19460/ 173500 | consumed samples: 4981760 | consumed tokens: 10202644480 | elapsed time per iteration (s): 0.83 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.205571E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.112 | TFLOPs: 18.58 | 31: iteration 19470/ 173500 | consumed samples: 4984320 | consumed tokens: 10207887360 | elapsed time per iteration (s): 0.73 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.192019E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.446 | TFLOPs: 21.26 | 31: iteration 19480/ 173500 | consumed samples: 4986880 | consumed tokens: 10213130240 | elapsed time per iteration (s): 0.74 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.199267E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.613 | TFLOPs: 20.85 | 31: iteration 19490/ 173500 | consumed samples: 4989440 | consumed tokens: 10218373120 | elapsed time per iteration (s): 0.78 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.216848E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.585 | TFLOPs: 19.88 | 31: iteration 19500/ 173500 | consumed samples: 4992000 | consumed tokens: 10223616000 | elapsed time per iteration (s): 0.77 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.201780E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.998 | TFLOPs: 20.15 | 31: iteration 19510/ 173500 | consumed samples: 4994560 | consumed tokens: 10228858880 | elapsed time per iteration (s): 0.74 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.226669E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.276 | TFLOPs: 21.07 | 31: iteration 19520/ 173500 | consumed samples: 4997120 | consumed tokens: 10234101760 | elapsed time per iteration (s): 0.76 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.225554E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.762 | TFLOPs: 20.31 | 31: iteration 19530/ 173500 | consumed samples: 4999680 | consumed tokens: 10239344640 | elapsed time per iteration (s): 0.72 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.197823E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.563 | TFLOPs: 21.45 | 31: iteration 19540/ 173500 | consumed samples: 5002240 | consumed tokens: 10244587520 | elapsed time per iteration (s): 0.79 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.217677E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.956 | TFLOPs: 19.72 | 31: iteration 19550/ 173500 | consumed samples: 5004800 | consumed tokens: 10249830400 | elapsed time per iteration (s): 0.77 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.239766E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.827 | TFLOPs: 20.01 | 31: iteration 19560/ 173500 | consumed samples: 5007360 | consumed tokens: 10255073280 | elapsed time per iteration (s): 0.92 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.206432E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 279.053 | TFLOPs: 16.88 | 31: iteration 19570/ 173500 | consumed samples: 5009920 | consumed tokens: 10260316160 | elapsed time per iteration (s): 0.74 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.196164E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.189 | TFLOPs: 21.06 | 31: iteration 19580/ 173500 | consumed samples: 5012480 | consumed tokens: 10265559040 | elapsed time per iteration (s): 0.81 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.219638E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.336 | TFLOPs: 19.14 | 31: iteration 19590/ 173500 | consumed samples: 5015040 | consumed tokens: 10270801920 | elapsed time per iteration (s): 0.79 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.193591E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.764 | TFLOPs: 19.59 | 31: iteration 19600/ 173500 | consumed samples: 5017600 | consumed tokens: 10276044800 | elapsed time per iteration (s): 0.83 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.187742E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.804 | TFLOPs: 18.68 | 31: iteration 19610/ 173500 | consumed samples: 5020160 | consumed tokens: 10281287680 | elapsed time per iteration (s): 0.83 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.217941E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.684 | TFLOPs: 18.74 | 31: iteration 19620/ 173500 | consumed samples: 5022720 | consumed tokens: 10286530560 | elapsed time per iteration (s): 0.81 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.224196E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.529 | TFLOPs: 19.21 | 31: iteration 19630/ 173500 | consumed samples: 5025280 | consumed tokens: 10291773440 | elapsed time per iteration (s): 0.79 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.222038E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.923 | TFLOPs: 19.54 | 31: iteration 19640/ 173500 | consumed samples: 5027840 | consumed tokens: 10297016320 | elapsed time per iteration (s): 0.73 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.236284E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.629 | TFLOPs: 21.33 | 31: iteration 19650/ 173500 | consumed samples: 5030400 | consumed tokens: 10302259200 | elapsed time per iteration (s): 0.77 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.185927E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.387 | TFLOPs: 20.23 | 31: iteration 19660/ 173500 | consumed samples: 5032960 | consumed tokens: 10307502080 | elapsed time per iteration (s): 0.76 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.215012E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.106 | TFLOPs: 20.45 | 31: iteration 19670/ 173500 | consumed samples: 5035520 | consumed tokens: 10312744960 | elapsed time per iteration (s): 0.72 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.199865E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.637 | TFLOPs: 21.39 | 31: iteration 19680/ 173500 | consumed samples: 5038080 | consumed tokens: 10317987840 | elapsed time per iteration (s): 0.79 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.191525E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.451 | TFLOPs: 19.57 | 31: iteration 19690/ 173500 | consumed samples: 5040640 | consumed tokens: 10323230720 | elapsed time per iteration (s): 0.80 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.197258E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.491 | TFLOPs: 19.45 | 31: iteration 19700/ 173500 | consumed samples: 5043200 | consumed tokens: 10328473600 | elapsed time per iteration (s): 0.81 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.206103E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.850 | TFLOPs: 19.11 | 31: iteration 19710/ 173500 | consumed samples: 5045760 | consumed tokens: 10333716480 | elapsed time per iteration (s): 0.82 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.198995E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.407 | TFLOPs: 18.84 | 31: iteration 19720/ 173500 | consumed samples: 5048320 | consumed tokens: 10338959360 | elapsed time per iteration (s): 0.82 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.187634E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.472 | TFLOPs: 18.78 | 31: iteration 19730/ 173500 | consumed samples: 5050880 | consumed tokens: 10344202240 | elapsed time per iteration (s): 0.84 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.204455E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.149 | TFLOPs: 18.34 | 31: iteration 19740/ 173500 | consumed samples: 5053440 | consumed tokens: 10349445120 | elapsed time per iteration (s): 0.83 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.235864E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.299 | TFLOPs: 18.59 | 31: iteration 19750/ 173500 | consumed samples: 5056000 | consumed tokens: 10354688000 | elapsed time per iteration (s): 0.85 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.211271E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.307 | TFLOPs: 18.17 | 31: iteration 19760/ 173500 | consumed samples: 5058560 | consumed tokens: 10359930880 | elapsed time per iteration (s): 0.85 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.204014E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.675 | TFLOPs: 18.25 | 31: iteration 19770/ 173500 | consumed samples: 5061120 | consumed tokens: 10365173760 | elapsed time per iteration (s): 0.84 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.210720E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.506 | TFLOPs: 18.42 | 31: iteration 19780/ 173500 | consumed samples: 5063680 | consumed tokens: 10370416640 | elapsed time per iteration (s): 0.83 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.219230E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.558 | TFLOPs: 18.61 | 31: iteration 19790/ 173500 | consumed samples: 5066240 | consumed tokens: 10375659520 | elapsed time per iteration (s): 0.80 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.209838E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.130 | TFLOPs: 19.31 | 31: iteration 19800/ 173500 | consumed samples: 5068800 | consumed tokens: 10380902400 | elapsed time per iteration (s): 0.85 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.183185E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.379 | TFLOPs: 18.23 | 31: iteration 19810/ 173500 | consumed samples: 5071360 | consumed tokens: 10386145280 | elapsed time per iteration (s): 0.82 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.215350E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.462 | TFLOPs: 18.90 | 31: iteration 19820/ 173500 | consumed samples: 5073920 | consumed tokens: 10391388160 | elapsed time per iteration (s): 0.82 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.211074E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.204 | TFLOPs: 18.89 | 31: iteration 19830/ 173500 | consumed samples: 5076480 | consumed tokens: 10396631040 | elapsed time per iteration (s): 0.78 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.183663E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.293 | TFLOPs: 19.92 | 31: iteration 19840/ 173500 | consumed samples: 5079040 | consumed tokens: 10401873920 | elapsed time per iteration (s): 0.80 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.189868E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.242 | TFLOPs: 19.25 | 31: iteration 19850/ 173500 | consumed samples: 5081600 | consumed tokens: 10407116800 | elapsed time per iteration (s): 0.80 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.227331E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.636 | TFLOPs: 19.46 | 31: iteration 19860/ 173500 | consumed samples: 5084160 | consumed tokens: 10412359680 | elapsed time per iteration (s): 0.80 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.223388E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.128 | TFLOPs: 19.25 | 31: iteration 19870/ 173500 | consumed samples: 5086720 | consumed tokens: 10417602560 | elapsed time per iteration (s): 0.83 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.230668E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.588 | TFLOPs: 18.73 | 31: iteration 19880/ 173500 | consumed samples: 5089280 | consumed tokens: 10422845440 | elapsed time per iteration (s): 0.78 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.207900E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.502 | TFLOPs: 19.93 | 31: iteration 19890/ 173500 | consumed samples: 5091840 | consumed tokens: 10428088320 | elapsed time per iteration (s): 0.79 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.207841E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.203 | TFLOPs: 19.67 | 31: iteration 19900/ 173500 | consumed samples: 5094400 | consumed tokens: 10433331200 | elapsed time per iteration (s): 0.76 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.211602E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.805 | TFLOPs: 20.44 | 31: iteration 19910/ 173500 | consumed samples: 5096960 | consumed tokens: 10438574080 | elapsed time per iteration (s): 0.77 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.227759E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.101 | TFLOPs: 20.21 | 31: iteration 19920/ 173500 | consumed samples: 5099520 | consumed tokens: 10443816960 | elapsed time per iteration (s): 0.77 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.234099E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.013 | TFLOPs: 20.15 | 31: iteration 19930/ 173500 | consumed samples: 5102080 | consumed tokens: 10449059840 | elapsed time per iteration (s): 0.78 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.176165E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.471 | TFLOPs: 19.75 | 31: iteration 19940/ 173500 | consumed samples: 5104640 | consumed tokens: 10454302720 | elapsed time per iteration (s): 0.79 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.233800E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.417 | TFLOPs: 19.63 | 31: iteration 19950/ 173500 | consumed samples: 5107200 | consumed tokens: 10459545600 | elapsed time per iteration (s): 0.77 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.186139E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.755 | TFLOPs: 20.13 | 31: iteration 19960/ 173500 | consumed samples: 5109760 | consumed tokens: 10464788480 | elapsed time per iteration (s): 0.76 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.183587E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.828 | TFLOPs: 20.50 | 31: iteration 19970/ 173500 | consumed samples: 5112320 | consumed tokens: 10470031360 | elapsed time per iteration (s): 0.76 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.221106E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.821 | TFLOPs: 20.26 | 31: iteration 19980/ 173500 | consumed samples: 5114880 | consumed tokens: 10475274240 | elapsed time per iteration (s): 0.76 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.226983E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.157 | TFLOPs: 20.34 | 31: iteration 19990/ 173500 | consumed samples: 5117440 | consumed tokens: 10480517120 | elapsed time per iteration (s): 0.79 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.220743E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.939 | TFLOPs: 19.54 | 0: [2022-11-25 22:34:05,128] [INFO] [logging.py:68:log_dist] [Rank 0] step=20000, skipped=0, lr=[0.00019502450208460265, 0.00019502450208460265, 0.00019502450208460265], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 20000/ 173500 | consumed samples: 5120000 | consumed tokens: 10485760000 | elapsed time per iteration (s): 0.74 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.238767E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.771 | TFLOPs: 20.92 | 0: steps: 20000 loss: 2.2042 iter time (s): 0.787 samples/sec: 325.314 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 20000 | lm loss value: 2.125318E+00 | lm loss PPL: 8.375563E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 20000 to checkpoints_1b1long 0: [2022-11-25 22:34:05,420] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step20000 is begin to save! 0: [2022-11-25 22:34:05,433] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_01-model_00-model_states.pt... 0: [2022-11-25 22:34:05,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_01-model_00-model_states.pt. 0: [2022-11-25 22:34:05,687] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_03-model_00-model_states.pt... 0: [2022-11-25 22:34:05,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_03-model_00-model_states.pt. 0: [2022-11-25 22:34:05,777] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_04-model_00-model_states.pt... 0: [2022-11-25 22:34:05,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_04-model_00-model_states.pt. 0: [2022-11-25 22:34:05,851] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_05-model_00-model_states.pt... 0: [2022-11-25 22:34:05,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_05-model_00-model_states.pt. 0: [2022-11-25 22:34:05,927] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_06-model_00-model_states.pt... 0: [2022-11-25 22:34:06,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_06-model_00-model_states.pt. 0: [2022-11-25 22:34:06,006] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_07-model_00-model_states.pt... 0: [2022-11-25 22:34:06,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_07-model_00-model_states.pt. 0: [2022-11-25 22:34:06,081] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_08-model_00-model_states.pt... 0: [2022-11-25 22:34:06,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_08-model_00-model_states.pt. 0: [2022-11-25 22:34:06,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_09-model_00-model_states.pt... 0: [2022-11-25 22:34:06,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_09-model_00-model_states.pt. 0: [2022-11-25 22:34:06,241] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_10-model_00-model_states.pt... 0: [2022-11-25 22:34:06,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_10-model_00-model_states.pt. 0: [2022-11-25 22:34:06,319] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_11-model_00-model_states.pt... 0: [2022-11-25 22:34:06,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_11-model_00-model_states.pt. 0: [2022-11-25 22:34:06,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_12-model_00-model_states.pt... 0: [2022-11-25 22:34:06,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_12-model_00-model_states.pt. 0: [2022-11-25 22:34:06,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_13-model_00-model_states.pt... 0: [2022-11-25 22:34:06,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_13-model_00-model_states.pt. 0: [2022-11-25 22:34:06,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_14-model_00-model_states.pt... 0: [2022-11-25 22:34:06,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_14-model_00-model_states.pt. 0: [2022-11-25 22:34:06,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_15-model_00-model_states.pt... 0: [2022-11-25 22:34:06,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_15-model_00-model_states.pt. 0: [2022-11-25 22:34:06,702] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_16-model_00-model_states.pt... 0: [2022-11-25 22:34:06,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_16-model_00-model_states.pt. 0: [2022-11-25 22:34:06,779] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_17-model_00-model_states.pt... 0: [2022-11-25 22:34:06,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_17-model_00-model_states.pt. 0: [2022-11-25 22:34:06,853] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_18-model_00-model_states.pt... 0: [2022-11-25 22:34:06,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_18-model_00-model_states.pt. 0: [2022-11-25 22:34:06,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_19-model_00-model_states.pt... 0: [2022-11-25 22:34:07,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_19-model_00-model_states.pt. 0: [2022-11-25 22:34:07,009] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_20-model_00-model_states.pt... 0: [2022-11-25 22:34:07,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_20-model_00-model_states.pt. 0: [2022-11-25 22:34:07,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_21-model_00-model_states.pt... 0: [2022-11-25 22:34:07,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_21-model_00-model_states.pt. 0: [2022-11-25 22:34:07,160] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_22-model_00-model_states.pt... 0: [2022-11-25 22:34:07,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_22-model_00-model_states.pt. 0: [2022-11-25 22:34:07,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_23-model_00-model_states.pt... 0: [2022-11-25 22:34:07,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_23-model_00-model_states.pt. 0: [2022-11-25 22:34:07,312] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_24-model_00-model_states.pt... 0: [2022-11-25 22:34:07,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_24-model_00-model_states.pt. 0: [2022-11-25 22:34:07,388] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_25-model_00-model_states.pt... 0: [2022-11-25 22:34:07,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_25-model_00-model_states.pt. 0: [2022-11-25 22:34:07,464] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_26-model_00-model_states.pt... 0: [2022-11-25 22:34:07,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_26-model_00-model_states.pt. 0: [2022-11-25 22:34:07,537] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_27-model_00-model_states.pt... 0: [2022-11-25 22:34:07,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_27-model_00-model_states.pt. 0: [2022-11-25 22:34:07,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_28-model_00-model_states.pt... 0: [2022-11-25 22:34:07,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_28-model_00-model_states.pt. 0: [2022-11-25 22:34:07,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/layer_30-model_00-model_states.pt... 0: [2022-11-25 22:34:07,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/layer_30-model_00-model_states.pt. 0: [2022-11-25 22:34:07,692] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step20000/mp_rank_00_model_states.pt 0: [2022-11-25 22:34:07,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/mp_rank_00_model_states.pt... 0: [2022-11-25 22:34:07,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/mp_rank_00_model_states.pt. 0: [2022-11-25 22:34:07,769] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:34:07,769] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:34:07,769] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:34:07,769] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:34:07,769] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:34:07,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:34:07,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:34:07,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 22:34:07,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 9: [2022-11-25 22:34:07,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:34:07,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 22:34:07,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 13: [2022-11-25 22:34:07,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:34:07,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 22:34:07,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 10: [2022-11-25 22:34:07,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:34:07,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 22:34:07,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 12: [2022-11-25 22:34:07,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:34:07,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 22:34:07,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 29: [2022-11-25 22:34:07,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:34:07,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-25 22:34:07,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 21: [2022-11-25 22:34:07,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:34:07,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-25 22:34:07,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 23: [2022-11-25 22:34:07,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:34:07,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-25 22:34:07,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 8: [2022-11-25 22:34:07,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:34:07,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 1: [2022-11-25 22:34:07,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:34:07,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:34:07,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:34:07,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:34:07,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-25 22:34:07,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 22:34:07,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 25: [2022-11-25 22:34:07,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:34:07,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 26: [2022-11-25 22:34:07,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 31: [2022-11-25 22:34:07,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-25 22:34:07,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 26: [2022-11-25 22:34:07,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 25: [2022-11-25 22:34:07,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 31: [2022-11-25 22:34:07,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 12: [2022-11-25 22:34:07,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:34:07,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 25: [2022-11-25 22:34:07,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 12: [2022-11-25 22:34:07,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 11: [2022-11-25 22:34:07,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:34:07,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 22:34:07,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 8: [2022-11-25 22:34:07,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:34:07,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:34:07,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 9: [2022-11-25 22:34:07,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 8: [2022-11-25 22:34:07,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 9: [2022-11-25 22:34:07,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 15: [2022-11-25 22:34:07,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:34:07,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 22:34:07,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 15: [2022-11-25 22:34:07,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:34:07,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 22:34:07,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 10: [2022-11-25 22:34:07,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:34:07,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:34:07,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 11: [2022-11-25 22:34:07,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 7: [2022-11-25 22:34:07,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:34:07,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:34:07,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 11: [2022-11-25 22:34:07,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 7: [2022-11-25 22:34:07,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 22:34:07,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 22:34:07,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 7: [2022-11-25 22:34:07,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 4: [2022-11-25 22:34:07,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:34:07,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 22:34:07,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 13: [2022-11-25 22:34:07,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:34:07,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 22:34:07,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 19: [2022-11-25 22:34:07,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:34:07,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-25 22:34:07,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 19: [2022-11-25 22:34:07,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:34:07,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:34:07,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-25 22:34:07,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 19: [2022-11-25 22:34:07,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-25 22:34:07,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 29: [2022-11-25 22:34:07,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:34:07,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-25 22:34:07,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 28: [2022-11-25 22:34:07,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:34:07,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 29: [2022-11-25 22:34:07,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:34:07,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 3: [2022-11-25 22:34:07,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:34:07,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 3: [2022-11-25 22:34:07,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 22:34:07,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 28: [2022-11-25 22:34:07,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 28: [2022-11-25 22:34:07,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:34:07,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-25 22:34:07,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 20: [2022-11-25 22:34:07,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:34:07,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-25 22:34:07,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 0: [2022-11-25 22:34:07,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:34:07,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 22:34:07,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 5: [2022-11-25 22:34:07,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:34:07,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 22:34:07,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 5: [2022-11-25 22:34:07,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:34:07,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 22:34:07,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 5: [2022-11-25 22:34:07,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:34:07,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 22:34:07,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 11: [2022-11-25 22:34:07,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:34:07,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 22:34:07,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 7: [2022-11-25 22:34:07,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:34:07,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:34:07,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 7: [2022-11-25 22:34:07,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 13: [2022-11-25 22:34:07,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 7: [2022-11-25 22:34:07,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 8: [2022-11-25 22:34:07,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:34:07,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 22:34:07,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-25 22:34:07,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:34:07,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 22:34:07,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 25: [2022-11-25 22:34:07,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:34:07,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-25 22:34:07,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 2: [2022-11-25 22:34:07,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:34:07,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:34:07,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:34:07,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 22:34:07,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 4: [2022-11-25 22:34:07,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 2: [2022-11-25 22:34:07,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 2: [2022-11-25 22:34:07,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 4: [2022-11-25 22:34:07,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 9: [2022-11-25 22:34:07,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:34:07,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:34:07,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:34:07,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 26: [2022-11-25 22:34:07,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 9: [2022-11-25 22:34:07,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 2: [2022-11-25 22:34:07,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:34:07,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 23: [2022-11-25 22:34:07,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:34:07,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 2: [2022-11-25 22:34:07,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 25: [2022-11-25 22:34:07,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 2: [2022-11-25 22:34:07,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 25: [2022-11-25 22:34:07,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:34:07,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 21: [2022-11-25 22:34:07,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:34:07,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 25: [2022-11-25 22:34:07,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-25 22:34:07,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 21: [2022-11-25 22:34:07,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 30: [2022-11-25 22:34:07,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:34:07,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:34:07,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:34:07,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 30: [2022-11-25 22:34:07,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-25 22:34:07,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-25 22:34:07,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-25 22:34:07,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 30: [2022-11-25 22:34:07,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 30: [2022-11-25 22:34:07,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 3: [2022-11-25 22:34:07,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:34:07,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 22:34:07,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 23: [2022-11-25 22:34:07,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:34:07,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:34:07,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 10: [2022-11-25 22:34:07,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:34:07,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 10: [2022-11-25 22:34:07,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 28: [2022-11-25 22:34:07,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 10: [2022-11-25 22:34:07,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 14: [2022-11-25 22:34:07,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:34:07,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:34:07,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 14: [2022-11-25 22:34:07,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 2: [2022-11-25 22:34:07,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 14: [2022-11-25 22:34:07,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 2: [2022-11-25 22:34:07,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 20: [2022-11-25 22:34:07,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:34:07,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-25 22:34:07,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 10: [2022-11-25 22:34:07,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:34:07,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 22:34:07,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 20: [2022-11-25 22:34:07,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:34:07,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-25 22:34:07,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 31: [2022-11-25 22:34:07,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:34:07,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:34:07,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 8: [2022-11-25 22:34:07,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 31: [2022-11-25 22:34:07,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 8: [2022-11-25 22:34:07,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 15: [2022-11-25 22:34:07,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:34:07,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 9: [2022-11-25 22:34:07,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:34:07,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 9: [2022-11-25 22:34:07,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 5: [2022-11-25 22:34:07,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:34:07,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 5: [2022-11-25 22:34:07,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 22:34:07,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 23: [2022-11-25 22:34:07,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:34:07,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-25 22:34:07,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 29: [2022-11-25 22:34:07,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:34:07,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-25 22:34:07,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 14: [2022-11-25 22:34:07,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:34:07,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 22:34:07,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:34:07,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 14: [2022-11-25 22:34:07,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 12: [2022-11-25 22:34:07,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:34:07,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 12: [2022-11-25 22:34:07,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 22:34:07,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 18: [2022-11-25 22:34:07,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:34:07,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:34:07,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:34:07,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:34:07,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-25 22:34:07,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-25 22:34:07,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-25 22:34:07,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-25 22:34:07,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 18: [2022-11-25 22:34:07,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 18: [2022-11-25 22:34:07,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 18: [2022-11-25 22:34:07,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 12: [2022-11-25 22:34:07,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:34:07,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 22:34:07,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 19: [2022-11-25 22:34:07,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:34:07,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-25 22:34:07,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 19: [2022-11-25 22:34:07,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:34:07,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-25 22:34:07,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 7: [2022-11-25 22:34:07,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:34:07,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:34:07,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 20: [2022-11-25 22:34:07,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 7: [2022-11-25 22:34:07,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 13: [2022-11-25 22:34:07,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:34:07,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 13: [2022-11-25 22:34:07,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 22:34:07,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 3: [2022-11-25 22:34:07,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:34:07,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 22:34:07,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 3: [2022-11-25 22:34:07,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:34:07,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 22:34:07,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 29: [2022-11-25 22:34:07,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:34:07,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-25 22:34:07,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 6: [2022-11-25 22:34:07,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:34:07,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:34:07,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:34:07,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:34:07,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 22:34:07,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 20: [2022-11-25 22:34:07,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:34:07,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 22:34:07,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 22:34:07,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 20: [2022-11-25 22:34:07,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 6: [2022-11-25 22:34:07,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 20: [2022-11-25 22:34:07,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 6: [2022-11-25 22:34:07,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 6: [2022-11-25 22:34:07,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-25 22:34:07,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:34:07,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:34:07,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:34:07,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 22:34:07,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 4: [2022-11-25 22:34:07,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 22:34:07,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-25 22:34:07,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-25 22:34:07,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-25 22:34:07,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:34:07,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:34:07,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:34:07,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 15: [2022-11-25 22:34:07,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:34:07,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 1: [2022-11-25 22:34:07,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 17: [2022-11-25 22:34:07,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-25 22:34:07,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 15: [2022-11-25 22:34:07,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 17: [2022-11-25 22:34:07,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 15: [2022-11-25 22:34:07,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 17: [2022-11-25 22:34:07,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:34:07,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-25 22:34:07,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 17: [2022-11-25 22:34:07,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:34:07,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-25 22:34:07,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 26: [2022-11-25 22:34:07,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:34:07,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-25 22:34:07,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 26: [2022-11-25 22:34:07,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:34:07,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-25 22:34:07,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 21: [2022-11-25 22:34:07,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:34:07,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-25 22:34:07,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 9: [2022-11-25 22:34:07,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:34:07,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 22:34:07,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 11: [2022-11-25 22:34:07,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:34:07,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 4: [2022-11-25 22:34:07,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:34:07,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 4: [2022-11-25 22:34:07,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 22:34:07,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 31: [2022-11-25 22:34:07,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:34:07,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-25 22:34:07,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 0: [2022-11-25 22:34:07,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:34:07,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:34:07,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 22:34:07,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 22:34:07,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 0: [2022-11-25 22:34:07,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 28: [2022-11-25 22:34:07,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:34:07,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-25 22:34:07,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 30: [2022-11-25 22:34:07,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:34:07,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:34:07,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 30: [2022-11-25 22:34:07,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 14: [2022-11-25 22:34:07,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 30: [2022-11-25 22:34:07,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 2: [2022-11-25 22:34:07,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:34:07,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 22:34:07,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 16: [2022-11-25 22:34:07,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:34:07,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:34:07,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:34:07,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:34:07,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-25 22:34:07,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 16: [2022-11-25 22:34:07,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-25 22:34:07,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-25 22:34:07,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-25 22:34:07,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 16: [2022-11-25 22:34:07,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 16: [2022-11-25 22:34:07,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 14: [2022-11-25 22:34:07,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:34:07,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 21: [2022-11-25 22:34:07,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:34:07,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:34:07,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 25: [2022-11-25 22:34:07,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 21: [2022-11-25 22:34:07,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 25: [2022-11-25 22:34:07,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 21: [2022-11-25 22:34:07,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 21: [2022-11-25 22:34:07,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:34:07,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-25 22:34:07,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 7: [2022-11-25 22:34:07,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:34:07,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 22:34:07,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 27: [2022-11-25 22:34:07,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:34:07,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:34:07,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:34:07,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:34:07,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-25 22:34:07,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-25 22:34:07,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-25 22:34:07,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-25 22:34:07,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 27: [2022-11-25 22:34:07,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 27: [2022-11-25 22:34:07,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 27: [2022-11-25 22:34:07,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 15: [2022-11-25 22:34:07,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:34:07,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 22:34:07,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 22: [2022-11-25 22:34:07,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:34:07,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:34:07,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-25 22:34:07,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-25 22:34:07,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 22: [2022-11-25 22:34:07,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 22: [2022-11-25 22:34:07,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:34:07,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-25 22:34:07,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 24: [2022-11-25 22:34:07,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:34:07,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-25 22:34:07,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 24: [2022-11-25 22:34:07,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:34:07,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:34:07,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-25 22:34:07,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-25 22:34:07,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 24: [2022-11-25 22:34:07,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 10: [2022-11-25 22:34:07,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:34:07,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 22:34:07,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 0: [2022-11-25 22:34:07,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:34:07,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 22:34:07,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 17: [2022-11-25 22:34:07,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:34:07,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-25 22:34:07,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 11: [2022-11-25 22:34:07,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:34:07,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 22:34:07,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 18: [2022-11-25 22:34:07,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:34:07,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-25 22:34:07,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 5: [2022-11-25 22:34:07,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:34:07,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 22:34:07,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 8: [2022-11-25 22:34:07,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:34:07,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 22:34:07,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 6: [2022-11-25 22:34:07,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:34:07,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 22:34:07,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 13: [2022-11-25 22:34:07,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:34:07,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 22:34:07,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 4: [2022-11-25 22:34:07,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:34:07,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 22:34:07,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 12: [2022-11-25 22:34:07,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:34:07,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 22:34:07,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 30: [2022-11-25 22:34:07,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:34:07,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-25 22:34:07,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 16: [2022-11-25 22:34:07,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:34:07,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:34:07,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 23: [2022-11-25 22:34:07,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 16: [2022-11-25 22:34:07,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 23: [2022-11-25 22:34:07,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 26: [2022-11-25 22:34:07,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:34:07,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-25 22:34:07,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 31: [2022-11-25 22:34:07,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:34:07,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-25 22:34:07,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 28: [2022-11-25 22:34:07,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:34:07,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-25 22:34:07,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 27: [2022-11-25 22:34:07,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:34:07,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-25 22:34:07,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 22: [2022-11-25 22:34:07,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:34:07,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-25 22:34:07,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 3: [2022-11-25 22:34:07,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:34:07,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 22:34:07,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 24: [2022-11-25 22:34:07,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:34:07,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-25 22:34:07,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 19: [2022-11-25 22:34:07,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:34:07,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-25 22:34:07,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 21: [2022-11-25 22:34:07,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:34:07,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-25 22:34:07,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 7: [2022-11-25 22:34:07,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:34:07,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 22:34:07,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 9: [2022-11-25 22:34:07,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:34:07,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:34:07,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 9: [2022-11-25 22:34:07,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 2: [2022-11-25 22:34:07,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 9: [2022-11-25 22:34:07,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 29: [2022-11-25 22:34:07,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:34:07,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:34:07,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-25 22:34:07,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 25: [2022-11-25 22:34:07,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-25 22:34:07,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 15: [2022-11-25 22:34:07,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:34:07,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 22:34:07,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 18: [2022-11-25 22:34:07,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:34:07,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-25 22:34:07,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-25 22:34:07,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:34:07,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:34:07,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 22:34:07,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 10: [2022-11-25 22:34:07,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 22:34:07,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 14: [2022-11-25 22:34:07,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:34:07,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 22:34:07,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 20: [2022-11-25 22:34:07,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:34:07,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:34:07,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 8: [2022-11-25 22:34:07,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 20: [2022-11-25 22:34:07,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 8: [2022-11-25 22:34:07,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 6: [2022-11-25 22:34:07,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:34:07,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 22:34:07,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 0: [2022-11-25 22:34:07,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:34:07,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:34:07,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 22:34:07,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 11: [2022-11-25 22:34:07,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:34:07,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 22:34:07,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 17: [2022-11-25 22:34:07,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:34:07,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:34:07,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 30: [2022-11-25 22:34:07,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 17: [2022-11-25 22:34:07,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 30: [2022-11-25 22:34:07,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 13: [2022-11-25 22:34:07,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:34:07,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 22:34:07,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 26: [2022-11-25 22:34:07,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:34:07,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-25 22:34:07,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 16: [2022-11-25 22:34:07,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:34:07,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 5: [2022-11-25 22:34:07,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:34:07,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 5: [2022-11-25 22:34:07,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 22:34:07,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 23: [2022-11-25 22:34:07,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:34:07,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-25 22:34:07,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 28: [2022-11-25 22:34:07,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:34:07,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-25 22:34:07,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 12: [2022-11-25 22:34:07,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:34:07,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 22:34:07,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 19: [2022-11-25 22:34:07,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:34:07,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-25 22:34:07,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 31: [2022-11-25 22:34:07,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:34:07,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:34:07,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 27: [2022-11-25 22:34:07,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-25 22:34:07,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 31: [2022-11-25 22:34:07,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 24: [2022-11-25 22:34:07,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:34:07,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-25 22:34:07,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 7: [2022-11-25 22:34:07,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:34:07,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 22:34:07,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 9: [2022-11-25 22:34:07,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:34:07,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 3: [2022-11-25 22:34:07,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:34:07,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 3: [2022-11-25 22:34:07,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 22:34:07,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 22: [2022-11-25 22:34:07,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:34:07,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-25 22:34:07,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 0: [2022-11-25 22:34:07,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 22:34:07,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 14: [2022-11-25 22:34:07,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:34:07,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 22:34:07,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 18: [2022-11-25 22:34:07,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:34:07,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-25 22:34:07,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 2: [2022-11-25 22:34:07,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:34:07,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 21: [2022-11-25 22:34:07,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:34:07,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 21: [2022-11-25 22:34:07,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-25 22:34:07,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 15: [2022-11-25 22:34:07,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:34:07,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 1: [2022-11-25 22:34:07,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:34:07,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-25 22:34:07,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 29: [2022-11-25 22:34:07,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:34:07,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 29: [2022-11-25 22:34:07,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-25 22:34:07,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 10: [2022-11-25 22:34:07,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:34:07,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:34:07,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 25: [2022-11-25 22:34:07,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 10: [2022-11-25 22:34:07,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 25: [2022-11-25 22:34:07,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 17: [2022-11-25 22:34:07,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:34:07,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-25 22:34:07,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 6: [2022-11-25 22:34:07,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:34:07,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 11: [2022-11-25 22:34:07,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:34:07,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 11: [2022-11-25 22:34:07,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 22:34:07,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 0: [2022-11-25 22:34:07,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:34:07,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 22:34:07,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 8: [2022-11-25 22:34:07,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:34:07,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 22:34:07,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 13: [2022-11-25 22:34:07,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:34:07,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 22:34:07,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 4: [2022-11-25 22:34:07,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:34:07,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 22:34:07,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 5: [2022-11-25 22:34:07,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:34:07,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 22:34:07,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 23: [2022-11-25 22:34:07,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:34:07,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-25 22:34:07,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 26: [2022-11-25 22:34:07,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:34:07,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-25 22:34:07,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 12: [2022-11-25 22:34:07,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:34:07,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:34:07,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 12: [2022-11-25 22:34:07,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 30: [2022-11-25 22:34:07,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 12: [2022-11-25 22:34:07,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 16: [2022-11-25 22:34:07,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:34:07,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-25 22:34:07,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 31: [2022-11-25 22:34:07,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:34:07,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-25 22:34:07,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 28: [2022-11-25 22:34:07,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:34:07,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-25 22:34:07,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 24: [2022-11-25 22:34:07,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:34:07,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 18: [2022-11-25 22:34:07,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:34:07,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 18: [2022-11-25 22:34:07,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-25 22:34:07,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 4: [2022-11-25 22:34:07,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:34:07,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 22:34:07,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 13: [2022-11-25 22:34:07,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:34:07,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 22:34:07,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 30: [2022-11-25 22:34:07,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:34:07,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:34:07,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 19: [2022-11-25 22:34:07,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 15: [2022-11-25 22:34:07,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:34:07,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 19: [2022-11-25 22:34:07,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 15: [2022-11-25 22:34:07,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 22:34:07,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 7: [2022-11-25 22:34:07,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:34:07,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:34:07,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 22:34:07,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-25 22:34:07,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:34:07,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 22:34:07,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-25 22:34:07,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 22:34:07,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 0: [2022-11-25 22:34:07,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:34:07,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 22:34:07,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 6: [2022-11-25 22:34:07,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:34:07,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 22:34:07,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 16: [2022-11-25 22:34:07,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:34:07,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:34:07,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 22: [2022-11-25 22:34:07,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 16: [2022-11-25 22:34:07,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 22: [2022-11-25 22:34:07,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 22: [2022-11-25 22:34:07,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:34:07,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 31: [2022-11-25 22:34:07,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:34:07,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 31: [2022-11-25 22:34:07,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-25 22:34:07,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 8: [2022-11-25 22:34:07,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:34:07,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 22:34:07,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 21: [2022-11-25 22:34:07,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:34:07,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-25 22:34:07,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 25: [2022-11-25 22:34:07,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:34:07,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:34:07,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 9: [2022-11-25 22:34:07,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 25: [2022-11-25 22:34:07,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 9: [2022-11-25 22:34:07,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 26: [2022-11-25 22:34:07,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:34:07,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-25 22:34:07,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 11: [2022-11-25 22:34:07,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:34:07,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 22:34:07,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 10: [2022-11-25 22:34:07,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:34:07,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 22:34:07,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 29: [2022-11-25 22:34:07,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:34:07,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:34:07,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 27: [2022-11-25 22:34:07,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 29: [2022-11-25 22:34:07,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 2: [2022-11-25 22:34:07,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:34:07,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 2: [2022-11-25 22:34:07,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 22:34:07,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 17: [2022-11-25 22:34:07,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:34:07,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-25 22:34:07,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 5: [2022-11-25 22:34:07,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:34:07,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 14: [2022-11-25 22:34:07,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:34:07,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 14: [2022-11-25 22:34:07,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 20: [2022-11-25 22:34:07,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:34:07,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 20: [2022-11-25 22:34:07,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-25 22:34:07,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 23: [2022-11-25 22:34:08,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:34:08,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-25 22:34:08,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 28: [2022-11-25 22:34:08,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:34:08,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-25 22:34:08,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 12: [2022-11-25 22:34:08,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:34:08,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 22:34:08,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 20: [2022-11-25 22:34:08,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:34:08,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-25 22:34:08,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 3: [2022-11-25 22:34:08,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:34:08,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 22:34:08,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 24: [2022-11-25 22:34:08,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:34:08,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-25 22:34:08,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 24: [2022-11-25 22:34:08,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:34:08,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-25 22:34:08,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 22: [2022-11-25 22:34:08,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:34:08,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-25 22:34:08,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 27: [2022-11-25 22:34:08,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:34:08,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step20000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-25 22:34:08,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 0: successfully saved checkpoint at iteration 20000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2597.19 31: iteration 20010/ 173500 | consumed samples: 5122560 | consumed tokens: 10491002880 | elapsed time per iteration (s): 1.01 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.214044E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 254.324 | TFLOPs: 15.39 | 31: iteration 20020/ 173500 | consumed samples: 5125120 | consumed tokens: 10496245760 | elapsed time per iteration (s): 0.74 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.188499E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.764 | TFLOPs: 20.86 | 31: iteration 20030/ 173500 | consumed samples: 5127680 | consumed tokens: 10501488640 | elapsed time per iteration (s): 0.80 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.216256E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.254 | TFLOPs: 19.37 | 31: iteration 20040/ 173500 | consumed samples: 5130240 | consumed tokens: 10506731520 | elapsed time per iteration (s): 0.74 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.216593E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.635 | TFLOPs: 20.85 | 31: iteration 20050/ 173500 | consumed samples: 5132800 | consumed tokens: 10511974400 | elapsed time per iteration (s): 0.75 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.195753E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.267 | TFLOPs: 20.77 | 31: iteration 20060/ 173500 | consumed samples: 5135360 | consumed tokens: 10517217280 | elapsed time per iteration (s): 0.78 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.214399E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.526 | TFLOPs: 19.81 | 31: iteration 20070/ 173500 | consumed samples: 5137920 | consumed tokens: 10522460160 | elapsed time per iteration (s): 0.82 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.180648E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.148 | TFLOPs: 18.88 | 31: iteration 20080/ 173500 | consumed samples: 5140480 | consumed tokens: 10527703040 | elapsed time per iteration (s): 0.73 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.194799E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.004 | TFLOPs: 21.36 | 31: iteration 20090/ 173500 | consumed samples: 5143040 | consumed tokens: 10532945920 | elapsed time per iteration (s): 0.79 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.167803E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.448 | TFLOPs: 19.63 | 31: iteration 20100/ 173500 | consumed samples: 5145600 | consumed tokens: 10538188800 | elapsed time per iteration (s): 0.81 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.225739E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.961 | TFLOPs: 19.18 | 31: iteration 20110/ 173500 | consumed samples: 5148160 | consumed tokens: 10543431680 | elapsed time per iteration (s): 0.78 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.184858E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.948 | TFLOPs: 19.90 | 31: iteration 20120/ 173500 | consumed samples: 5150720 | consumed tokens: 10548674560 | elapsed time per iteration (s): 0.82 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.195927E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.713 | TFLOPs: 18.86 | 31: iteration 20130/ 173500 | consumed samples: 5153280 | consumed tokens: 10553917440 | elapsed time per iteration (s): 0.75 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.207058E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.390 | TFLOPs: 20.53 | 31: iteration 20140/ 173500 | consumed samples: 5155840 | consumed tokens: 10559160320 | elapsed time per iteration (s): 0.82 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.199316E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.649 | TFLOPs: 18.97 | 31: iteration 20150/ 173500 | consumed samples: 5158400 | consumed tokens: 10564403200 | elapsed time per iteration (s): 0.76 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.216445E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.917 | TFLOPs: 20.38 | 31: iteration 20160/ 173500 | consumed samples: 5160960 | consumed tokens: 10569646080 | elapsed time per iteration (s): 0.77 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.244947E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.685 | TFLOPs: 20.01 | 31: iteration 20170/ 173500 | consumed samples: 5163520 | consumed tokens: 10574888960 | elapsed time per iteration (s): 0.74 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.202975E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.620 | TFLOPs: 20.85 | 31: iteration 20180/ 173500 | consumed samples: 5166080 | consumed tokens: 10580131840 | elapsed time per iteration (s): 0.73 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.208553E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.754 | TFLOPs: 21.16 | 31: iteration 20190/ 173500 | consumed samples: 5168640 | consumed tokens: 10585374720 | elapsed time per iteration (s): 0.79 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.207688E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.390 | TFLOPs: 19.50 | 31: iteration 20200/ 173500 | consumed samples: 5171200 | consumed tokens: 10590617600 | elapsed time per iteration (s): 0.75 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.214447E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.582 | TFLOPs: 20.60 | 31: iteration 20210/ 173500 | consumed samples: 5173760 | consumed tokens: 10595860480 | elapsed time per iteration (s): 0.77 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.164346E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.222 | TFLOPs: 20.16 | 31: iteration 20220/ 173500 | consumed samples: 5176320 | consumed tokens: 10601103360 | elapsed time per iteration (s): 0.75 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.186388E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.297 | TFLOPs: 20.71 | 31: iteration 20230/ 173500 | consumed samples: 5178880 | consumed tokens: 10606346240 | elapsed time per iteration (s): 0.78 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.204291E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.071 | TFLOPs: 19.97 | 31: iteration 20240/ 173500 | consumed samples: 5181440 | consumed tokens: 10611589120 | elapsed time per iteration (s): 0.83 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.202098E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.224 | TFLOPs: 18.65 | 31: iteration 20250/ 173500 | consumed samples: 5184000 | consumed tokens: 10616832000 | elapsed time per iteration (s): 0.78 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.192168E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.411 | TFLOPs: 19.75 | 31: iteration 20260/ 173500 | consumed samples: 5186560 | consumed tokens: 10622074880 | elapsed time per iteration (s): 0.78 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.198047E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.296 | TFLOPs: 19.98 | 31: iteration 20270/ 173500 | consumed samples: 5189120 | consumed tokens: 10627317760 | elapsed time per iteration (s): 0.80 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.201384E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.507 | TFLOPs: 19.45 | 31: iteration 20280/ 173500 | consumed samples: 5191680 | consumed tokens: 10632560640 | elapsed time per iteration (s): 0.75 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.210766E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.878 | TFLOPs: 20.74 | 31: iteration 20290/ 173500 | consumed samples: 5194240 | consumed tokens: 10637803520 | elapsed time per iteration (s): 0.74 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.240288E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.418 | TFLOPs: 21.02 | 31: iteration 20300/ 173500 | consumed samples: 5196800 | consumed tokens: 10643046400 | elapsed time per iteration (s): 0.77 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.190886E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.264 | TFLOPs: 20.10 | 31: iteration 20310/ 173500 | consumed samples: 5199360 | consumed tokens: 10648289280 | elapsed time per iteration (s): 0.77 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.192688E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.388 | TFLOPs: 20.23 | 31: iteration 20320/ 173500 | consumed samples: 5201920 | consumed tokens: 10653532160 | elapsed time per iteration (s): 0.75 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.206920E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.400 | TFLOPs: 20.59 | 31: iteration 20330/ 173500 | consumed samples: 5204480 | consumed tokens: 10658775040 | elapsed time per iteration (s): 0.77 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.224115E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.959 | TFLOPs: 20.20 | 31: iteration 20340/ 173500 | consumed samples: 5207040 | consumed tokens: 10664017920 | elapsed time per iteration (s): 0.77 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.195324E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.243 | TFLOPs: 20.04 | 31: iteration 20350/ 173500 | consumed samples: 5209600 | consumed tokens: 10669260800 | elapsed time per iteration (s): 0.76 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.166531E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.057 | TFLOPs: 20.33 | 31: iteration 20360/ 173500 | consumed samples: 5212160 | consumed tokens: 10674503680 | elapsed time per iteration (s): 0.77 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.210209E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.445 | TFLOPs: 20.11 | 31: iteration 20370/ 173500 | consumed samples: 5214720 | consumed tokens: 10679746560 | elapsed time per iteration (s): 0.82 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.185426E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.646 | TFLOPs: 18.97 | 31: iteration 20380/ 173500 | consumed samples: 5217280 | consumed tokens: 10684989440 | elapsed time per iteration (s): 0.76 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.209193E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.664 | TFLOPs: 20.25 | 31: iteration 20390/ 173500 | consumed samples: 5219840 | consumed tokens: 10690232320 | elapsed time per iteration (s): 0.78 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.181522E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.347 | TFLOPs: 19.92 | 31: iteration 20400/ 173500 | consumed samples: 5222400 | consumed tokens: 10695475200 | elapsed time per iteration (s): 0.76 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.230264E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.560 | TFLOPs: 20.48 | 31: iteration 20410/ 173500 | consumed samples: 5224960 | consumed tokens: 10700718080 | elapsed time per iteration (s): 0.81 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.187340E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.708 | TFLOPs: 19.04 | 31: iteration 20420/ 173500 | consumed samples: 5227520 | consumed tokens: 10705960960 | elapsed time per iteration (s): 0.72 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.217867E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.099 | TFLOPs: 21.48 | 31: iteration 20430/ 173500 | consumed samples: 5230080 | consumed tokens: 10711203840 | elapsed time per iteration (s): 0.76 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.209489E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.856 | TFLOPs: 20.50 | 31: iteration 20440/ 173500 | consumed samples: 5232640 | consumed tokens: 10716446720 | elapsed time per iteration (s): 0.82 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.210497E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.422 | TFLOPs: 18.96 | 31: iteration 20450/ 173500 | consumed samples: 5235200 | consumed tokens: 10721689600 | elapsed time per iteration (s): 0.81 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.198202E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.298 | TFLOPs: 19.07 | 31: iteration 20460/ 173500 | consumed samples: 5237760 | consumed tokens: 10726932480 | elapsed time per iteration (s): 0.77 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.206270E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.629 | TFLOPs: 20.00 | 31: iteration 20470/ 173500 | consumed samples: 5240320 | consumed tokens: 10732175360 | elapsed time per iteration (s): 0.81 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.871437E+00 | grad norm: 3.067 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.852 | TFLOPs: 19.17 | 31: iteration 20480/ 173500 | consumed samples: 5242880 | consumed tokens: 10737418240 | elapsed time per iteration (s): 0.89 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.740733E+00 | grad norm: 15.621 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.684 | TFLOPs: 17.40 | 31: iteration 20490/ 173500 | consumed samples: 5245440 | consumed tokens: 10742661120 | elapsed time per iteration (s): 0.88 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.535756E+00 | grad norm: 1.832 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.914 | TFLOPs: 17.60 | 31: iteration 20500/ 173500 | consumed samples: 5248000 | consumed tokens: 10747904000 | elapsed time per iteration (s): 0.83 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.370435E+00 | grad norm: 0.310 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.823 | TFLOPs: 18.74 | 31: iteration 20510/ 173500 | consumed samples: 5250560 | consumed tokens: 10753146880 | elapsed time per iteration (s): 0.87 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.271646E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.096 | TFLOPs: 17.85 | 31: iteration 20520/ 173500 | consumed samples: 5253120 | consumed tokens: 10758389760 | elapsed time per iteration (s): 0.81 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.293511E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.201 | TFLOPs: 19.13 | 31: iteration 20530/ 173500 | consumed samples: 5255680 | consumed tokens: 10763632640 | elapsed time per iteration (s): 0.87 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.245475E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.371 | TFLOPs: 17.75 | 31: iteration 20540/ 173500 | consumed samples: 5258240 | consumed tokens: 10768875520 | elapsed time per iteration (s): 0.83 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.211047E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.521 | TFLOPs: 18.66 | 31: iteration 20550/ 173500 | consumed samples: 5260800 | consumed tokens: 10774118400 | elapsed time per iteration (s): 0.86 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.237099E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.308 | TFLOPs: 18.05 | 31: iteration 20560/ 173500 | consumed samples: 5263360 | consumed tokens: 10779361280 | elapsed time per iteration (s): 0.84 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.220410E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.851 | TFLOPs: 18.38 | 31: iteration 20570/ 173500 | consumed samples: 5265920 | consumed tokens: 10784604160 | elapsed time per iteration (s): 0.80 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.208720E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.810 | TFLOPs: 19.41 | 31: iteration 20580/ 173500 | consumed samples: 5268480 | consumed tokens: 10789847040 | elapsed time per iteration (s): 0.82 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.188505E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.573 | TFLOPs: 18.79 | 31: iteration 20590/ 173500 | consumed samples: 5271040 | consumed tokens: 10795089920 | elapsed time per iteration (s): 0.80 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.218499E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.076 | TFLOPs: 19.24 | 31: iteration 20600/ 173500 | consumed samples: 5273600 | consumed tokens: 10800332800 | elapsed time per iteration (s): 0.81 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.195356E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.690 | TFLOPs: 19.04 | 31: iteration 20610/ 173500 | consumed samples: 5276160 | consumed tokens: 10805575680 | elapsed time per iteration (s): 0.81 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.199463E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.883 | TFLOPs: 19.11 | 31: iteration 20620/ 173500 | consumed samples: 5278720 | consumed tokens: 10810818560 | elapsed time per iteration (s): 0.81 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.189470E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.783 | TFLOPs: 19.04 | 31: iteration 20630/ 173500 | consumed samples: 5281280 | consumed tokens: 10816061440 | elapsed time per iteration (s): 0.81 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.238144E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.619 | TFLOPs: 19.09 | 31: iteration 20640/ 173500 | consumed samples: 5283840 | consumed tokens: 10821304320 | elapsed time per iteration (s): 0.81 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.216216E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.828 | TFLOPs: 19.17 | 31: iteration 20650/ 173500 | consumed samples: 5286400 | consumed tokens: 10826547200 | elapsed time per iteration (s): 0.78 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.209704E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.483 | TFLOPs: 19.93 | 31: iteration 20660/ 173500 | consumed samples: 5288960 | consumed tokens: 10831790080 | elapsed time per iteration (s): 0.77 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.234853E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.308 | TFLOPs: 20.16 | 31: iteration 20670/ 173500 | consumed samples: 5291520 | consumed tokens: 10837032960 | elapsed time per iteration (s): 0.78 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.210465E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.257 | TFLOPs: 19.74 | 31: iteration 20680/ 173500 | consumed samples: 5294080 | consumed tokens: 10842275840 | elapsed time per iteration (s): 0.76 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.193439E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.243 | TFLOPs: 20.34 | 31: iteration 20690/ 173500 | consumed samples: 5296640 | consumed tokens: 10847518720 | elapsed time per iteration (s): 0.76 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.203668E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.653 | TFLOPs: 20.25 | 31: iteration 20700/ 173500 | consumed samples: 5299200 | consumed tokens: 10852761600 | elapsed time per iteration (s): 0.84 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.214373E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.576 | TFLOPs: 18.43 | 31: iteration 20710/ 173500 | consumed samples: 5301760 | consumed tokens: 10858004480 | elapsed time per iteration (s): 0.76 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.187987E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.242 | TFLOPs: 20.34 | 31: iteration 20720/ 173500 | consumed samples: 5304320 | consumed tokens: 10863247360 | elapsed time per iteration (s): 0.76 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.193892E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.371 | TFLOPs: 20.29 | 31: iteration 20730/ 173500 | consumed samples: 5306880 | consumed tokens: 10868490240 | elapsed time per iteration (s): 0.75 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.207463E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.778 | TFLOPs: 20.62 | 31: iteration 20740/ 173500 | consumed samples: 5309440 | consumed tokens: 10873733120 | elapsed time per iteration (s): 0.76 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.193438E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.593 | TFLOPs: 20.36 | 31: iteration 20750/ 173500 | consumed samples: 5312000 | consumed tokens: 10878976000 | elapsed time per iteration (s): 0.76 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.207171E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.664 | TFLOPs: 20.31 | 31: iteration 20760/ 173500 | consumed samples: 5314560 | consumed tokens: 10884218880 | elapsed time per iteration (s): 0.75 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.212121E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.472 | TFLOPs: 20.72 | 31: iteration 20770/ 173500 | consumed samples: 5317120 | consumed tokens: 10889461760 | elapsed time per iteration (s): 0.83 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.203639E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.278 | TFLOPs: 18.59 | 31: iteration 20780/ 173500 | consumed samples: 5319680 | consumed tokens: 10894704640 | elapsed time per iteration (s): 1.02 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.203677E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.057 | TFLOPs: 15.13 | 31: iteration 20790/ 173500 | consumed samples: 5322240 | consumed tokens: 10899947520 | elapsed time per iteration (s): 0.73 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.207009E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.314 | TFLOPs: 21.07 | 31: iteration 20800/ 173500 | consumed samples: 5324800 | consumed tokens: 10905190400 | elapsed time per iteration (s): 0.74 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.229896E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.941 | TFLOPs: 20.87 | 31: iteration 20810/ 173500 | consumed samples: 5327360 | consumed tokens: 10910433280 | elapsed time per iteration (s): 0.72 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.203817E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.592 | TFLOPs: 21.63 | 31: iteration 20820/ 173500 | consumed samples: 5329920 | consumed tokens: 10915676160 | elapsed time per iteration (s): 0.78 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.204894E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.673 | TFLOPs: 19.82 | 31: iteration 20830/ 173500 | consumed samples: 5332480 | consumed tokens: 10920919040 | elapsed time per iteration (s): 0.75 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.172286E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.731 | TFLOPs: 20.67 | 31: iteration 20840/ 173500 | consumed samples: 5335040 | consumed tokens: 10926161920 | elapsed time per iteration (s): 0.72 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.194274E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.485 | TFLOPs: 21.51 | 31: iteration 20850/ 173500 | consumed samples: 5337600 | consumed tokens: 10931404800 | elapsed time per iteration (s): 0.78 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.208738E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.510 | TFLOPs: 19.87 | 31: iteration 20860/ 173500 | consumed samples: 5340160 | consumed tokens: 10936647680 | elapsed time per iteration (s): 0.73 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.210820E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.376 | TFLOPs: 21.08 | 31: iteration 20870/ 173500 | consumed samples: 5342720 | consumed tokens: 10941890560 | elapsed time per iteration (s): 0.76 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.222053E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.009 | TFLOPs: 20.27 | 31: iteration 20880/ 173500 | consumed samples: 5345280 | consumed tokens: 10947133440 | elapsed time per iteration (s): 0.75 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.208941E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.502 | TFLOPs: 20.54 | 31: iteration 20890/ 173500 | consumed samples: 5347840 | consumed tokens: 10952376320 | elapsed time per iteration (s): 0.81 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.191505E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.793 | TFLOPs: 19.04 | 31: iteration 20900/ 173500 | consumed samples: 5350400 | consumed tokens: 10957619200 | elapsed time per iteration (s): 0.80 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.193721E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.058 | TFLOPs: 19.30 | 31: iteration 20910/ 173500 | consumed samples: 5352960 | consumed tokens: 10962862080 | elapsed time per iteration (s): 0.79 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.191521E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.817 | TFLOPs: 19.65 | 31: iteration 20920/ 173500 | consumed samples: 5355520 | consumed tokens: 10968104960 | elapsed time per iteration (s): 0.81 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.196567E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.217 | TFLOPs: 19.13 | 31: iteration 20930/ 173500 | consumed samples: 5358080 | consumed tokens: 10973347840 | elapsed time per iteration (s): 0.77 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.200642E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.070 | TFLOPs: 20.03 | 31: iteration 20940/ 173500 | consumed samples: 5360640 | consumed tokens: 10978590720 | elapsed time per iteration (s): 0.76 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.175609E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.679 | TFLOPs: 20.31 | 31: iteration 20950/ 173500 | consumed samples: 5363200 | consumed tokens: 10983833600 | elapsed time per iteration (s): 0.83 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.189601E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.044 | TFLOPs: 18.58 | 31: iteration 20960/ 173500 | consumed samples: 5365760 | consumed tokens: 10989076480 | elapsed time per iteration (s): 0.79 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.194200E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.797 | TFLOPs: 19.71 | 31: iteration 20970/ 173500 | consumed samples: 5368320 | consumed tokens: 10994319360 | elapsed time per iteration (s): 0.81 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.187495E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.100 | TFLOPs: 19.12 | 31: iteration 20980/ 173500 | consumed samples: 5370880 | consumed tokens: 10999562240 | elapsed time per iteration (s): 0.78 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.175666E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.504 | TFLOPs: 19.87 | 31: iteration 20990/ 173500 | consumed samples: 5373440 | consumed tokens: 11004805120 | elapsed time per iteration (s): 0.78 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.202938E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.286 | TFLOPs: 19.86 | 31: iteration 21000/ 173500 | consumed samples: 5376000 | consumed tokens: 11010048000 | elapsed time per iteration (s): 0.80 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.207759E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.819 | TFLOPs: 19.29 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 21000 | lm loss value: 2.172492E+00 | lm loss PPL: 8.780135E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 21000 to checkpoints_1b1long 0: [2022-11-25 22:47:14,390] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step21000 is begin to save! 0: [2022-11-25 22:47:14,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_01-model_00-model_states.pt... 0: [2022-11-25 22:47:14,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_01-model_00-model_states.pt. 0: [2022-11-25 22:47:14,630] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_03-model_00-model_states.pt... 0: [2022-11-25 22:47:14,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_03-model_00-model_states.pt. 0: [2022-11-25 22:47:14,710] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_04-model_00-model_states.pt... 0: [2022-11-25 22:47:14,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_04-model_00-model_states.pt. 0: [2022-11-25 22:47:14,787] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_05-model_00-model_states.pt... 0: [2022-11-25 22:47:14,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_05-model_00-model_states.pt. 0: [2022-11-25 22:47:14,861] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_06-model_00-model_states.pt... 0: [2022-11-25 22:47:14,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_06-model_00-model_states.pt. 0: [2022-11-25 22:47:14,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_07-model_00-model_states.pt... 0: [2022-11-25 22:47:15,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_07-model_00-model_states.pt. 0: [2022-11-25 22:47:15,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_08-model_00-model_states.pt... 0: [2022-11-25 22:47:15,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_08-model_00-model_states.pt. 0: [2022-11-25 22:47:15,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_09-model_00-model_states.pt... 0: [2022-11-25 22:47:15,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_09-model_00-model_states.pt. 0: [2022-11-25 22:47:15,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_10-model_00-model_states.pt... 0: [2022-11-25 22:47:15,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_10-model_00-model_states.pt. 0: [2022-11-25 22:47:15,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_11-model_00-model_states.pt... 0: [2022-11-25 22:47:15,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_11-model_00-model_states.pt. 0: [2022-11-25 22:47:15,321] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_12-model_00-model_states.pt... 0: [2022-11-25 22:47:15,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_12-model_00-model_states.pt. 0: [2022-11-25 22:47:15,399] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_13-model_00-model_states.pt... 0: [2022-11-25 22:47:15,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_13-model_00-model_states.pt. 0: [2022-11-25 22:47:15,477] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_14-model_00-model_states.pt... 0: [2022-11-25 22:47:15,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_14-model_00-model_states.pt. 0: [2022-11-25 22:47:15,553] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_15-model_00-model_states.pt... 0: [2022-11-25 22:47:15,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_15-model_00-model_states.pt. 0: [2022-11-25 22:47:15,628] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_16-model_00-model_states.pt... 0: [2022-11-25 22:47:15,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_16-model_00-model_states.pt. 0: [2022-11-25 22:47:15,701] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_17-model_00-model_states.pt... 0: [2022-11-25 22:47:15,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_17-model_00-model_states.pt. 0: [2022-11-25 22:47:15,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_18-model_00-model_states.pt... 0: [2022-11-25 22:47:15,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_18-model_00-model_states.pt. 0: [2022-11-25 22:47:15,852] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_19-model_00-model_states.pt... 0: [2022-11-25 22:47:15,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_19-model_00-model_states.pt. 0: [2022-11-25 22:47:15,927] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_20-model_00-model_states.pt... 0: [2022-11-25 22:47:16,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_20-model_00-model_states.pt. 0: [2022-11-25 22:47:16,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_21-model_00-model_states.pt... 0: [2022-11-25 22:47:16,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_21-model_00-model_states.pt. 0: [2022-11-25 22:47:16,075] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_22-model_00-model_states.pt... 0: [2022-11-25 22:47:16,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_22-model_00-model_states.pt. 0: [2022-11-25 22:47:16,149] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_23-model_00-model_states.pt... 0: [2022-11-25 22:47:16,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_23-model_00-model_states.pt. 0: [2022-11-25 22:47:16,229] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_24-model_00-model_states.pt... 0: [2022-11-25 22:47:16,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_24-model_00-model_states.pt. 0: [2022-11-25 22:47:16,302] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_25-model_00-model_states.pt... 0: [2022-11-25 22:47:16,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_25-model_00-model_states.pt. 0: [2022-11-25 22:47:16,379] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_26-model_00-model_states.pt... 0: [2022-11-25 22:47:16,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_26-model_00-model_states.pt. 0: [2022-11-25 22:47:16,464] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_27-model_00-model_states.pt... 0: [2022-11-25 22:47:16,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_27-model_00-model_states.pt. 0: [2022-11-25 22:47:16,539] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_28-model_00-model_states.pt... 0: [2022-11-25 22:47:16,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_28-model_00-model_states.pt. 0: [2022-11-25 22:47:16,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/layer_30-model_00-model_states.pt... 0: [2022-11-25 22:47:16,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/layer_30-model_00-model_states.pt. 0: [2022-11-25 22:47:16,617] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step21000/mp_rank_00_model_states.pt 0: [2022-11-25 22:47:16,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/mp_rank_00_model_states.pt... 0: [2022-11-25 22:47:16,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/mp_rank_00_model_states.pt. 0: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 29: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 16: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 28: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 22: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 18: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 26: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 25: [2022-11-25 22:47:16,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:47:16,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:47:16,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 22:47:16,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-25 22:47:16,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:47:16,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:47:16,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 22:47:16,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 9: [2022-11-25 22:47:16,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:47:16,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 22:47:16,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 22: [2022-11-25 22:47:16,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:47:16,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-25 22:47:16,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 30: [2022-11-25 22:47:16,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:47:16,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-25 22:47:16,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 19: [2022-11-25 22:47:16,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:47:16,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-25 22:47:16,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-25 22:47:16,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:47:16,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 22:47:16,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 25: [2022-11-25 22:47:16,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:47:16,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-25 22:47:16,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 3: [2022-11-25 22:47:16,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:47:16,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 22:47:16,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 19: [2022-11-25 22:47:16,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:47:16,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 1: [2022-11-25 22:47:16,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:47:16,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 1: [2022-11-25 22:47:16,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 8: [2022-11-25 22:47:16,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:47:16,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 8: [2022-11-25 22:47:16,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 22:47:16,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 7: [2022-11-25 22:47:16,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:47:16,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:47:16,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 30: [2022-11-25 22:47:16,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 7: [2022-11-25 22:47:16,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 30: [2022-11-25 22:47:16,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-25 22:47:16,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:47:16,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 22:47:16,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 9: [2022-11-25 22:47:16,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:47:16,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 20: [2022-11-25 22:47:16,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:47:16,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 20: [2022-11-25 22:47:16,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:47:16,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-25 22:47:16,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-25 22:47:16,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 6: [2022-11-25 22:47:16,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:47:16,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 6: [2022-11-25 22:47:16,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 22:47:16,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 31: [2022-11-25 22:47:16,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:47:16,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:47:16,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-25 22:47:16,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 30: [2022-11-25 22:47:16,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-25 22:47:16,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 8: [2022-11-25 22:47:16,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:47:16,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 31: [2022-11-25 22:47:16,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:47:16,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 31: [2022-11-25 22:47:16,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-25 22:47:16,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 24: [2022-11-25 22:47:16,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:47:16,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:47:16,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:47:16,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 22: [2022-11-25 22:47:16,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 16: [2022-11-25 22:47:16,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 22: [2022-11-25 22:47:16,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 24: [2022-11-25 22:47:16,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 16: [2022-11-25 22:47:16,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:47:16,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-25 22:47:16,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 24: [2022-11-25 22:47:16,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 24: [2022-11-25 22:47:16,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:47:16,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:47:16,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 6: [2022-11-25 22:47:16,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 24: [2022-11-25 22:47:16,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 6: [2022-11-25 22:47:16,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 25: [2022-11-25 22:47:16,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:47:16,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-25 22:47:16,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 6: [2022-11-25 22:47:16,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:47:16,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 22:47:16,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 8: [2022-11-25 22:47:16,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:47:16,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 22:47:16,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 24: [2022-11-25 22:47:16,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:47:16,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 22: [2022-11-25 22:47:16,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:47:16,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-25 22:47:16,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 24: [2022-11-25 22:47:16,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 31: [2022-11-25 22:47:16,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:47:16,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-25 22:47:16,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 20: [2022-11-25 22:47:16,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:47:16,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-25 22:47:16,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 13: [2022-11-25 22:47:16,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:47:16,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:47:16,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:47:16,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 22:47:16,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 22:47:16,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 13: [2022-11-25 22:47:16,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 1: [2022-11-25 22:47:16,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 31: [2022-11-25 22:47:16,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:47:16,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 31: [2022-11-25 22:47:16,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-25 22:47:16,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 4: [2022-11-25 22:47:16,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:47:16,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:47:16,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 22:47:16,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 4: [2022-11-25 22:47:16,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 22:47:16,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 21: [2022-11-25 22:47:16,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:47:16,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:47:16,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-25 22:47:16,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-25 22:47:16,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 21: [2022-11-25 22:47:16,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 30: [2022-11-25 22:47:16,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:47:16,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-25 22:47:16,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 9: [2022-11-25 22:47:16,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:47:16,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 22:47:16,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 19: [2022-11-25 22:47:16,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:47:16,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:47:16,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-25 22:47:16,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-25 22:47:16,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 19: [2022-11-25 22:47:16,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 1: [2022-11-25 22:47:16,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:47:16,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:47:16,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 22:47:16,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 1: [2022-11-25 22:47:16,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 22:47:16,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 9: [2022-11-25 22:47:16,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:47:16,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 22:47:16,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 20: [2022-11-25 22:47:16,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:47:16,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-25 22:47:16,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 23: [2022-11-25 22:47:16,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:47:16,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:47:16,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 7: [2022-11-25 22:47:16,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 23: [2022-11-25 22:47:16,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 7: [2022-11-25 22:47:16,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 8: [2022-11-25 22:47:16,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:47:16,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 22:47:16,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 7: [2022-11-25 22:47:16,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:47:16,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 22:47:16,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-25 22:47:16,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:47:16,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 22:47:16,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 21: [2022-11-25 22:47:16,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:47:16,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-25 22:47:16,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 3: [2022-11-25 22:47:16,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:47:16,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:47:16,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 3: [2022-11-25 22:47:16,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 13: [2022-11-25 22:47:16,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 3: [2022-11-25 22:47:16,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 13: [2022-11-25 22:47:16,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:47:16,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 22:47:16,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 10: [2022-11-25 22:47:16,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:47:16,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:47:16,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:47:16,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 6: [2022-11-25 22:47:16,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:47:16,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 22:47:16,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 22:47:16,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 10: [2022-11-25 22:47:16,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 10: [2022-11-25 22:47:16,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 6: [2022-11-25 22:47:16,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 12: [2022-11-25 22:47:16,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:47:16,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 12: [2022-11-25 22:47:16,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 22:47:16,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 3: [2022-11-25 22:47:16,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:47:16,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 22: [2022-11-25 22:47:16,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:47:16,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 22: [2022-11-25 22:47:16,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-25 22:47:16,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 6: [2022-11-25 22:47:16,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:47:16,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:47:16,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 4: [2022-11-25 22:47:16,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 6: [2022-11-25 22:47:16,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 4: [2022-11-25 22:47:16,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 1: [2022-11-25 22:47:16,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:47:16,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:47:16,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:47:16,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-25 22:47:16,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 1: [2022-11-25 22:47:16,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 20: [2022-11-25 22:47:16,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 1: [2022-11-25 22:47:16,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 24: [2022-11-25 22:47:16,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:47:16,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 20: [2022-11-25 22:47:16,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 24: [2022-11-25 22:47:16,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 25: [2022-11-25 22:47:16,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:47:16,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-25 22:47:16,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 17: [2022-11-25 22:47:16,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:47:16,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-25 22:47:16,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 17: [2022-11-25 22:47:16,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:47:16,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-25 22:47:16,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 17: [2022-11-25 22:47:16,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:47:16,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 29: [2022-11-25 22:47:16,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:47:16,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 17: [2022-11-25 22:47:16,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:47:16,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:47:16,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 29: [2022-11-25 22:47:16,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 17: [2022-11-25 22:47:16,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 29: [2022-11-25 22:47:16,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-25 22:47:16,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 29: [2022-11-25 22:47:16,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 16: [2022-11-25 22:47:16,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:47:16,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-25 22:47:16,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 16: [2022-11-25 22:47:16,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:47:16,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-25 22:47:16,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 2: [2022-11-25 22:47:16,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:47:16,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:47:16,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:47:16,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:47:16,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 22:47:16,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 22:47:16,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 22:47:16,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 22:47:16,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 2: [2022-11-25 22:47:16,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 2: [2022-11-25 22:47:16,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 2: [2022-11-25 22:47:16,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 23: [2022-11-25 22:47:16,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:47:16,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-25 22:47:16,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 25: [2022-11-25 22:47:16,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:47:16,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-25 22:47:16,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 30: [2022-11-25 22:47:16,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:47:16,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-25 22:47:16,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 26: [2022-11-25 22:47:16,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:47:16,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:47:16,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:47:16,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-25 22:47:16,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-25 22:47:16,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-25 22:47:16,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 26: [2022-11-25 22:47:16,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 26: [2022-11-25 22:47:16,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 11: [2022-11-25 22:47:16,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:47:16,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:47:16,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:47:16,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:47:16,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 22:47:16,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 22:47:16,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 11: [2022-11-25 22:47:16,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 22:47:16,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 11: [2022-11-25 22:47:16,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 23: [2022-11-25 22:47:16,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:47:16,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 11: [2022-11-25 22:47:16,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 23: [2022-11-25 22:47:16,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-25 22:47:16,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 12: [2022-11-25 22:47:16,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:47:16,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 22:47:16,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 3: [2022-11-25 22:47:16,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:47:16,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 22:47:16,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 31: [2022-11-25 22:47:16,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:47:16,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-25 22:47:16,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 27: [2022-11-25 22:47:16,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:47:16,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:47:16,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:47:16,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-25 22:47:16,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-25 22:47:16,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 27: [2022-11-25 22:47:16,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 27: [2022-11-25 22:47:16,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-25 22:47:16,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 21: [2022-11-25 22:47:16,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:47:16,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:47:16,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-25 22:47:16,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 28: [2022-11-25 22:47:16,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:47:16,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:47:16,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:47:16,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:47:16,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 21: [2022-11-25 22:47:16,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 28: [2022-11-25 22:47:16,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-25 22:47:16,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-25 22:47:16,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-25 22:47:16,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-25 22:47:16,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 28: [2022-11-25 22:47:16,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 28: [2022-11-25 22:47:16,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 28: [2022-11-25 22:47:16,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 16: [2022-11-25 22:47:16,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:47:16,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-25 22:47:16,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 7: [2022-11-25 22:47:16,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:47:16,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 22:47:16,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-25 22:47:16,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:47:16,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 22:47:16,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 14: [2022-11-25 22:47:16,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:47:16,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 22:47:16,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 18: [2022-11-25 22:47:16,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:47:16,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:47:16,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:47:16,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-25 22:47:16,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-25 22:47:16,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-25 22:47:16,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 18: [2022-11-25 22:47:16,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 18: [2022-11-25 22:47:16,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 18: [2022-11-25 22:47:16,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:47:16,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-25 22:47:16,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 18: [2022-11-25 22:47:16,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:47:16,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-25 22:47:16,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-25 22:47:16,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 22:47:16,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 9: [2022-11-25 22:47:16,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:47:16,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 22:47:16,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 14: [2022-11-25 22:47:16,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:47:16,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 22:47:16,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 13: [2022-11-25 22:47:16,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:47:16,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 22:47:16,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 25: [2022-11-25 22:47:16,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:47:16,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:47:16,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 1: [2022-11-25 22:47:16,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 25: [2022-11-25 22:47:16,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 1: [2022-11-25 22:47:16,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 5: [2022-11-25 22:47:16,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:47:16,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:47:16,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:47:16,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:47:16,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:47:16,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 22:47:16,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 22:47:16,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 22:47:16,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 22:47:16,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 22:47:16,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 5: [2022-11-25 22:47:16,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 5: [2022-11-25 22:47:16,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 5: [2022-11-25 22:47:16,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 5: [2022-11-25 22:47:16,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 15: [2022-11-25 22:47:16,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:47:16,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:47:16,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:47:16,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:47:16,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:47:16,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 22:47:16,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 22:47:16,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 22:47:16,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 22:47:16,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 22:47:16,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 15: [2022-11-25 22:47:16,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 15: [2022-11-25 22:47:16,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 15: [2022-11-25 22:47:16,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 15: [2022-11-25 22:47:16,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 28: [2022-11-25 22:47:16,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:47:16,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-25 22:47:16,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 3: [2022-11-25 22:47:16,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:47:16,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 22:47:16,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 29: [2022-11-25 22:47:16,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:47:16,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-25 22:47:16,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 19: [2022-11-25 22:47:16,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:47:16,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-25 22:47:16,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 26: [2022-11-25 22:47:16,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:47:16,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-25 22:47:16,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 4: [2022-11-25 22:47:16,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:47:16,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 22:47:16,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 2: [2022-11-25 22:47:16,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:47:16,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 22:47:16,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 14: [2022-11-25 22:47:16,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:47:16,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 22:47:16,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 23: [2022-11-25 22:47:16,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:47:16,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-25 22:47:16,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 8: [2022-11-25 22:47:16,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:47:16,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 22:47:16,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 22: [2022-11-25 22:47:16,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:47:16,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-25 22:47:16,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 14: [2022-11-25 22:47:16,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:47:16,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 22:47:16,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 24: [2022-11-25 22:47:16,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:47:16,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 11: [2022-11-25 22:47:16,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:47:16,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 22:47:16,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 24: [2022-11-25 22:47:16,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 12: [2022-11-25 22:47:16,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:47:16,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 22:47:16,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 10: [2022-11-25 22:47:16,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:47:16,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 22:47:16,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 27: [2022-11-25 22:47:16,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:47:16,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-25 22:47:16,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 6: [2022-11-25 22:47:16,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:47:16,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 22:47:16,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 3: [2022-11-25 22:47:16,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:47:16,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 22:47:16,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 7: [2022-11-25 22:47:16,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:47:16,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 22:47:16,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 5: [2022-11-25 22:47:16,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:47:16,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 22:47:16,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 30: [2022-11-25 22:47:16,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:47:16,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-25 22:47:16,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 16: [2022-11-25 22:47:16,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:47:16,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-25 22:47:16,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-25 22:47:16,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:47:16,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 22:47:16,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 31: [2022-11-25 22:47:16,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:47:16,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-25 22:47:16,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 20: [2022-11-25 22:47:16,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:47:16,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-25 22:47:16,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 21: [2022-11-25 22:47:16,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:47:16,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-25 22:47:16,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 15: [2022-11-25 22:47:16,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:47:16,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 22:47:16,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 17: [2022-11-25 22:47:16,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:47:16,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-25 22:47:16,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 9: [2022-11-25 22:47:16,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:47:16,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 22:47:16,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 13: [2022-11-25 22:47:16,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:47:16,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 22:47:16,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 2: [2022-11-25 22:47:16,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:47:16,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:47:16,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 26: [2022-11-25 22:47:16,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 2: [2022-11-25 22:47:16,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 26: [2022-11-25 22:47:16,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 25: [2022-11-25 22:47:16,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:47:16,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:47:16,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-25 22:47:16,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 25: [2022-11-25 22:47:16,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-25 22:47:16,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 18: [2022-11-25 22:47:16,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:47:16,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-25 22:47:16,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 29: [2022-11-25 22:47:16,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:47:16,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-25 22:47:16,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 4: [2022-11-25 22:47:16,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:47:16,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 22:47:16,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 28: [2022-11-25 22:47:16,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:47:16,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-25 22:47:16,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 23: [2022-11-25 22:47:16,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:47:16,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-25 22:47:16,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 1: [2022-11-25 22:47:16,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:47:16,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 22:47:16,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 14: [2022-11-25 22:47:16,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:47:16,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 22:47:16,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 8: [2022-11-25 22:47:16,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:47:16,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 22:47:16,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 11: [2022-11-25 22:47:16,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:47:16,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 22:47:16,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 10: [2022-11-25 22:47:16,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:47:16,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 22:47:16,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 12: [2022-11-25 22:47:16,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:47:16,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:47:16,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 22:47:16,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 22: [2022-11-25 22:47:16,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-25 22:47:16,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 27: [2022-11-25 22:47:16,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:47:16,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-25 22:47:16,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 24: [2022-11-25 22:47:16,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:47:16,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 17: [2022-11-25 22:47:16,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:47:16,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 17: [2022-11-25 22:47:16,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-25 22:47:16,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-25 22:47:16,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:47:16,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 22:47:16,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 30: [2022-11-25 22:47:16,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:47:16,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:47:16,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:47:16,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 5: [2022-11-25 22:47:16,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 30: [2022-11-25 22:47:16,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 6: [2022-11-25 22:47:16,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 5: [2022-11-25 22:47:16,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 6: [2022-11-25 22:47:16,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 7: [2022-11-25 22:47:16,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:47:16,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 22:47:16,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 20: [2022-11-25 22:47:16,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 31: [2022-11-25 22:47:16,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:47:16,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-25 22:47:16,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 31: [2022-11-25 22:47:16,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-25 22:47:16,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 21: [2022-11-25 22:47:16,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:47:16,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-25 22:47:16,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 3: [2022-11-25 22:47:16,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:47:16,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 22:47:16,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 15: [2022-11-25 22:47:16,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:47:16,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 22:47:16,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 13: [2022-11-25 22:47:16,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:47:16,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 22:47:16,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 16: [2022-11-25 22:47:16,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:47:16,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-25 22:47:16,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 18: [2022-11-25 22:47:16,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 19: [2022-11-25 22:47:16,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:47:16,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-25 22:47:16,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 19: [2022-11-25 22:47:16,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-25 22:47:16,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 28: [2022-11-25 22:47:16,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:47:16,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:47:16,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 2: [2022-11-25 22:47:16,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:47:16,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 9: [2022-11-25 22:47:16,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 2: [2022-11-25 22:47:16,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 9: [2022-11-25 22:47:16,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 2: [2022-11-25 22:47:16,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 25: [2022-11-25 22:47:16,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:47:16,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 4: [2022-11-25 22:47:16,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:47:16,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 4: [2022-11-25 22:47:16,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 22:47:16,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 22: [2022-11-25 22:47:16,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:47:16,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:47:16,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-25 22:47:16,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-25 22:47:16,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 22:47:16,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 29: [2022-11-25 22:47:16,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:47:16,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-25 22:47:16,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 1: [2022-11-25 22:47:16,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:47:16,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 22:47:16,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 18: [2022-11-25 22:47:16,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:47:16,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:47:16,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:47:16,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 31: [2022-11-25 22:47:16,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 18: [2022-11-25 22:47:16,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 6: [2022-11-25 22:47:16,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 24: [2022-11-25 22:47:16,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 6: [2022-11-25 22:47:16,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 24: [2022-11-25 22:47:16,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 31: [2022-11-25 22:47:16,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-25 22:47:16,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 27: [2022-11-25 22:47:16,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:47:16,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 21: [2022-11-25 22:47:16,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:47:16,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 21: [2022-11-25 22:47:16,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 10: [2022-11-25 22:47:16,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 21: [2022-11-25 22:47:16,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 10: [2022-11-25 22:47:16,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 22:47:16,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 30: [2022-11-25 22:47:16,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:47:16,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 30: [2022-11-25 22:47:16,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 14: [2022-11-25 22:47:16,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 30: [2022-11-25 22:47:16,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 26: [2022-11-25 22:47:16,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:47:16,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 16: [2022-11-25 22:47:16,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:47:16,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 26: [2022-11-25 22:47:16,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-25 22:47:16,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 8: [2022-11-25 22:47:16,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 22:47:16,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 16: [2022-11-25 22:47:16,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-25 22:47:16,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 13: [2022-11-25 22:47:16,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:47:16,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 22:47:16,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 15: [2022-11-25 22:47:16,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:47:16,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:47:16,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 22:47:16,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 19: [2022-11-25 22:47:16,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:47:16,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 22:47:16,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 19: [2022-11-25 22:47:16,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-25 22:47:16,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 20: [2022-11-25 22:47:16,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-25 22:47:16,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-25 22:47:16,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 12: [2022-11-25 22:47:16,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:47:16,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 22:47:16,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 28: [2022-11-25 22:47:16,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:47:16,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 29: [2022-11-25 22:47:16,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 28: [2022-11-25 22:47:16,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 29: [2022-11-25 22:47:16,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-25 22:47:16,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 2: [2022-11-25 22:47:16,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:47:16,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:47:16,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 17: [2022-11-25 22:47:16,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:47:16,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 26: [2022-11-25 22:47:16,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:47:16,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 17: [2022-11-25 22:47:16,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 1: [2022-11-25 22:47:16,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 26: [2022-11-25 22:47:16,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-25 22:47:16,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 17: [2022-11-25 22:47:16,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 5: [2022-11-25 22:47:16,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:47:16,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 22:47:16,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 7: [2022-11-25 22:47:16,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:47:16,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:47:16,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 3: [2022-11-25 22:47:16,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 7: [2022-11-25 22:47:16,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 3: [2022-11-25 22:47:16,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 23: [2022-11-25 22:47:16,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:47:16,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:47:16,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-25 22:47:16,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 4: [2022-11-25 22:47:16,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 22:47:16,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 11: [2022-11-25 22:47:16,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:47:16,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 22:47:16,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 8: [2022-11-25 22:47:16,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:47:16,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 22:47:16,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 22: [2022-11-25 22:47:16,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-25 22:47:16,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-25 22:47:16,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 23: [2022-11-25 22:47:16,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:47:16,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-25 22:47:16,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 25: [2022-11-25 22:47:16,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-25 22:47:16,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-25 22:47:16,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 10: [2022-11-25 22:47:16,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:47:16,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 22:47:16,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 14: [2022-11-25 22:47:16,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:47:16,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 22:47:16,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 27: [2022-11-25 22:47:16,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:47:16,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-25 22:47:16,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 4: [2022-11-25 22:47:16,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:47:16,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-25 22:47:16,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-25 22:47:16,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 4: [2022-11-25 22:47:16,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 22:47:16,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 23: [2022-11-25 22:47:16,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-25 22:47:16,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-25 22:47:16,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 11: [2022-11-25 22:47:16,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:47:16,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 22:47:16,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 29: [2022-11-25 22:47:16,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-25 22:47:16,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-25 22:47:16,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 27: [2022-11-25 22:47:16,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-25 22:47:16,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-25 22:47:16,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 26: [2022-11-25 22:47:16,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-25 22:47:16,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-25 22:47:16,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 12: [2022-11-25 22:47:16,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:47:16,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 22:47:16,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 17: [2022-11-25 22:47:16,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-25 22:47:16,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-25 22:47:16,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 10: [2022-11-25 22:47:16,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:47:16,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 22:47:16,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 12: [2022-11-25 22:47:16,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:47:16,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step21000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 22:47:16,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: successfully saved checkpoint at iteration 21000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2542.71 31: iteration 21010/ 173500 | consumed samples: 5378560 | consumed tokens: 11015290880 | elapsed time per iteration (s): 1.01 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.217666E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 252.780 | TFLOPs: 15.29 | 31: iteration 21020/ 173500 | consumed samples: 5381120 | consumed tokens: 11020533760 | elapsed time per iteration (s): 0.82 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.210642E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.865 | TFLOPs: 18.81 | 31: iteration 21030/ 173500 | consumed samples: 5383680 | consumed tokens: 11025776640 | elapsed time per iteration (s): 0.79 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.226899E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.339 | TFLOPs: 19.56 | 31: iteration 21040/ 173500 | consumed samples: 5386240 | consumed tokens: 11031019520 | elapsed time per iteration (s): 0.79 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.184129E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.111 | TFLOPs: 19.55 | 31: iteration 21050/ 173500 | consumed samples: 5388800 | consumed tokens: 11036262400 | elapsed time per iteration (s): 0.80 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.201941E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.009 | TFLOPs: 19.36 | 31: iteration 21060/ 173500 | consumed samples: 5391360 | consumed tokens: 11041505280 | elapsed time per iteration (s): 0.87 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.228381E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.479 | TFLOPs: 17.88 | 31: iteration 21070/ 173500 | consumed samples: 5393920 | consumed tokens: 11046748160 | elapsed time per iteration (s): 0.81 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.181924E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.973 | TFLOPs: 19.12 | 31: iteration 21080/ 173500 | consumed samples: 5396480 | consumed tokens: 11051991040 | elapsed time per iteration (s): 0.85 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.178365E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.686 | TFLOPs: 18.31 | 31: iteration 21090/ 173500 | consumed samples: 5399040 | consumed tokens: 11057233920 | elapsed time per iteration (s): 0.77 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.181089E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.647 | TFLOPs: 20.06 | 31: iteration 21100/ 173500 | consumed samples: 5401600 | consumed tokens: 11062476800 | elapsed time per iteration (s): 0.83 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.185768E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.344 | TFLOPs: 18.59 | 31: iteration 21110/ 173500 | consumed samples: 5404160 | consumed tokens: 11067719680 | elapsed time per iteration (s): 0.84 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.220589E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.170 | TFLOPs: 18.52 | 31: iteration 21120/ 173500 | consumed samples: 5406720 | consumed tokens: 11072962560 | elapsed time per iteration (s): 0.77 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.220141E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.626 | TFLOPs: 20.06 | 31: iteration 21130/ 173500 | consumed samples: 5409280 | consumed tokens: 11078205440 | elapsed time per iteration (s): 0.81 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.182895E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.815 | TFLOPs: 19.11 | 31: iteration 21140/ 173500 | consumed samples: 5411840 | consumed tokens: 11083448320 | elapsed time per iteration (s): 0.76 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.199135E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.789 | TFLOPs: 20.31 | 31: iteration 21150/ 173500 | consumed samples: 5414400 | consumed tokens: 11088691200 | elapsed time per iteration (s): 0.76 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.216741E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.813 | TFLOPs: 20.50 | 31: iteration 21160/ 173500 | consumed samples: 5416960 | consumed tokens: 11093934080 | elapsed time per iteration (s): 0.77 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.203190E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.007 | TFLOPs: 20.09 | 31: iteration 21170/ 173500 | consumed samples: 5419520 | consumed tokens: 11099176960 | elapsed time per iteration (s): 0.77 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.183555E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.674 | TFLOPs: 20.13 | 31: iteration 21180/ 173500 | consumed samples: 5422080 | consumed tokens: 11104419840 | elapsed time per iteration (s): 0.75 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.183557E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.765 | TFLOPs: 20.55 | 31: iteration 21190/ 173500 | consumed samples: 5424640 | consumed tokens: 11109662720 | elapsed time per iteration (s): 0.80 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.200076E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.231 | TFLOPs: 19.43 | 31: iteration 21200/ 173500 | consumed samples: 5427200 | consumed tokens: 11114905600 | elapsed time per iteration (s): 0.79 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.186202E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.471 | TFLOPs: 19.51 | 31: iteration 21210/ 173500 | consumed samples: 5429760 | consumed tokens: 11120148480 | elapsed time per iteration (s): 0.81 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.178220E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.686 | TFLOPs: 19.16 | 31: iteration 21220/ 173500 | consumed samples: 5432320 | consumed tokens: 11125391360 | elapsed time per iteration (s): 0.76 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.193895E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.643 | TFLOPs: 20.31 | 31: iteration 21230/ 173500 | consumed samples: 5434880 | consumed tokens: 11130634240 | elapsed time per iteration (s): 0.75 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.175222E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.852 | TFLOPs: 20.68 | 31: iteration 21240/ 173500 | consumed samples: 5437440 | consumed tokens: 11135877120 | elapsed time per iteration (s): 0.77 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.164417E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.343 | TFLOPs: 20.05 | 31: iteration 21250/ 173500 | consumed samples: 5440000 | consumed tokens: 11141120000 | elapsed time per iteration (s): 0.76 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.226729E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.623 | TFLOPs: 20.49 | 31: iteration 21260/ 173500 | consumed samples: 5442560 | consumed tokens: 11146362880 | elapsed time per iteration (s): 0.77 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.202244E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.040 | TFLOPs: 20.03 | 31: iteration 21270/ 173500 | consumed samples: 5445120 | consumed tokens: 11151605760 | elapsed time per iteration (s): 0.77 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.180523E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.889 | TFLOPs: 20.14 | 31: iteration 21280/ 173500 | consumed samples: 5447680 | consumed tokens: 11156848640 | elapsed time per iteration (s): 0.78 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.194203E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.090 | TFLOPs: 19.97 | 31: iteration 21290/ 173500 | consumed samples: 5450240 | consumed tokens: 11162091520 | elapsed time per iteration (s): 0.74 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.237410E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.062 | TFLOPs: 21.06 | 31: iteration 21300/ 173500 | consumed samples: 5452800 | consumed tokens: 11167334400 | elapsed time per iteration (s): 0.82 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.191020E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.966 | TFLOPs: 18.87 | 31: iteration 21310/ 173500 | consumed samples: 5455360 | consumed tokens: 11172577280 | elapsed time per iteration (s): 0.73 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.184531E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.394 | TFLOPs: 21.08 | 31: iteration 21320/ 173500 | consumed samples: 5457920 | consumed tokens: 11177820160 | elapsed time per iteration (s): 0.85 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.200510E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.327 | TFLOPs: 18.29 | 31: iteration 21330/ 173500 | consumed samples: 5460480 | consumed tokens: 11183063040 | elapsed time per iteration (s): 0.75 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.180284E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.478 | TFLOPs: 20.60 | 31: iteration 21340/ 173500 | consumed samples: 5463040 | consumed tokens: 11188305920 | elapsed time per iteration (s): 0.77 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.189534E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.763 | TFLOPs: 20.01 | 31: iteration 21350/ 173500 | consumed samples: 5465600 | consumed tokens: 11193548800 | elapsed time per iteration (s): 0.78 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.214693E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.842 | TFLOPs: 19.89 | 31: iteration 21360/ 173500 | consumed samples: 5468160 | consumed tokens: 11198791680 | elapsed time per iteration (s): 0.74 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.214170E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.230 | TFLOPs: 21.01 | 31: iteration 21370/ 173500 | consumed samples: 5470720 | consumed tokens: 11204034560 | elapsed time per iteration (s): 0.76 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.217057E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.951 | TFLOPs: 20.32 | 31: iteration 21380/ 173500 | consumed samples: 5473280 | consumed tokens: 11209277440 | elapsed time per iteration (s): 0.78 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.165837E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.306 | TFLOPs: 19.86 | 31: iteration 21390/ 173500 | consumed samples: 5475840 | consumed tokens: 11214520320 | elapsed time per iteration (s): 0.72 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.200153E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.708 | TFLOPs: 21.52 | 31: iteration 21400/ 173500 | consumed samples: 5478400 | consumed tokens: 11219763200 | elapsed time per iteration (s): 0.81 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.193893E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.736 | TFLOPs: 19.16 | 31: iteration 21410/ 173500 | consumed samples: 5480960 | consumed tokens: 11225006080 | elapsed time per iteration (s): 0.75 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.217408E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.150 | TFLOPs: 20.70 | 31: iteration 21420/ 173500 | consumed samples: 5483520 | consumed tokens: 11230248960 | elapsed time per iteration (s): 0.74 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.195751E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.758 | TFLOPs: 20.80 | 31: iteration 21430/ 173500 | consumed samples: 5486080 | consumed tokens: 11235491840 | elapsed time per iteration (s): 0.73 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.225146E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.016 | TFLOPs: 21.36 | 31: iteration 21440/ 173500 | consumed samples: 5488640 | consumed tokens: 11240734720 | elapsed time per iteration (s): 0.73 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.212724E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.339 | TFLOPs: 21.07 | 31: iteration 21450/ 173500 | consumed samples: 5491200 | consumed tokens: 11245977600 | elapsed time per iteration (s): 0.72 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.203486E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.539 | TFLOPs: 21.39 | 31: iteration 21460/ 173500 | consumed samples: 5493760 | consumed tokens: 11251220480 | elapsed time per iteration (s): 0.80 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.182855E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.217 | TFLOPs: 19.37 | 31: iteration 21470/ 173500 | consumed samples: 5496320 | consumed tokens: 11256463360 | elapsed time per iteration (s): 0.78 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.184115E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.943 | TFLOPs: 19.84 | 31: iteration 21480/ 173500 | consumed samples: 5498880 | consumed tokens: 11261706240 | elapsed time per iteration (s): 0.75 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.190714E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.851 | TFLOPs: 20.56 | 31: iteration 21490/ 173500 | consumed samples: 5501440 | consumed tokens: 11266949120 | elapsed time per iteration (s): 0.76 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.219317E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.567 | TFLOPs: 20.36 | 31: iteration 21500/ 173500 | consumed samples: 5504000 | consumed tokens: 11272192000 | elapsed time per iteration (s): 0.77 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.197461E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.852 | TFLOPs: 20.20 | 31: iteration 21510/ 173500 | consumed samples: 5506560 | consumed tokens: 11277434880 | elapsed time per iteration (s): 0.74 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.163557E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.290 | TFLOPs: 20.83 | 31: iteration 21520/ 173500 | consumed samples: 5509120 | consumed tokens: 11282677760 | elapsed time per iteration (s): 0.77 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.204578E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.311 | TFLOPs: 20.22 | 31: iteration 21530/ 173500 | consumed samples: 5511680 | consumed tokens: 11287920640 | elapsed time per iteration (s): 0.79 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.169450E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.929 | TFLOPs: 19.54 | 31: iteration 21540/ 173500 | consumed samples: 5514240 | consumed tokens: 11293163520 | elapsed time per iteration (s): 0.82 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.155349E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.983 | TFLOPs: 18.87 | 31: iteration 21550/ 173500 | consumed samples: 5516800 | consumed tokens: 11298406400 | elapsed time per iteration (s): 0.71 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.207765E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 362.310 | TFLOPs: 21.92 | 31: iteration 21560/ 173500 | consumed samples: 5519360 | consumed tokens: 11303649280 | elapsed time per iteration (s): 0.76 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.182372E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.979 | TFLOPs: 20.51 | 31: iteration 21570/ 173500 | consumed samples: 5521920 | consumed tokens: 11308892160 | elapsed time per iteration (s): 0.76 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.182788E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.026 | TFLOPs: 20.27 | 31: iteration 21580/ 173500 | consumed samples: 5524480 | consumed tokens: 11314135040 | elapsed time per iteration (s): 0.73 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.193790E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.616 | TFLOPs: 21.09 | 31: iteration 21590/ 173500 | consumed samples: 5527040 | consumed tokens: 11319377920 | elapsed time per iteration (s): 0.73 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.159513E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.513 | TFLOPs: 21.08 | 31: iteration 21600/ 173500 | consumed samples: 5529600 | consumed tokens: 11324620800 | elapsed time per iteration (s): 0.75 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.204186E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.654 | TFLOPs: 20.73 | 31: iteration 21610/ 173500 | consumed samples: 5532160 | consumed tokens: 11329863680 | elapsed time per iteration (s): 0.80 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.216128E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.452 | TFLOPs: 19.45 | 31: iteration 21620/ 173500 | consumed samples: 5534720 | consumed tokens: 11335106560 | elapsed time per iteration (s): 0.75 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.210945E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.153 | TFLOPs: 20.58 | 31: iteration 21630/ 173500 | consumed samples: 5537280 | consumed tokens: 11340349440 | elapsed time per iteration (s): 0.76 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.221609E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.495 | TFLOPs: 20.36 | 31: iteration 21640/ 173500 | consumed samples: 5539840 | consumed tokens: 11345592320 | elapsed time per iteration (s): 0.80 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.205834E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.920 | TFLOPs: 19.41 | 31: iteration 21650/ 173500 | consumed samples: 5542400 | consumed tokens: 11350835200 | elapsed time per iteration (s): 0.77 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.182872E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.806 | TFLOPs: 20.13 | 31: iteration 21660/ 173500 | consumed samples: 5544960 | consumed tokens: 11356078080 | elapsed time per iteration (s): 0.87 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.199269E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.694 | TFLOPs: 17.77 | 31: iteration 21670/ 173500 | consumed samples: 5547520 | consumed tokens: 11361320960 | elapsed time per iteration (s): 0.81 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.201364E+00 | grad norm: 1.761 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.915 | TFLOPs: 19.11 | 31: iteration 21680/ 173500 | consumed samples: 5550080 | consumed tokens: 11366563840 | elapsed time per iteration (s): 0.76 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.305944E+00 | grad norm: 0.281 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.906 | TFLOPs: 20.50 | 31: iteration 21690/ 173500 | consumed samples: 5552640 | consumed tokens: 11371806720 | elapsed time per iteration (s): 0.83 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.209035E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.823 | TFLOPs: 18.56 | 31: iteration 21700/ 173500 | consumed samples: 5555200 | consumed tokens: 11377049600 | elapsed time per iteration (s): 0.75 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.184101E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.435 | TFLOPs: 20.78 | 31: iteration 21710/ 173500 | consumed samples: 5557760 | consumed tokens: 11382292480 | elapsed time per iteration (s): 0.82 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.192311E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.411 | TFLOPs: 18.78 | 31: iteration 21720/ 173500 | consumed samples: 5560320 | consumed tokens: 11387535360 | elapsed time per iteration (s): 0.75 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.200396E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.339 | TFLOPs: 20.65 | 31: iteration 21730/ 173500 | consumed samples: 5562880 | consumed tokens: 11392778240 | elapsed time per iteration (s): 0.78 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.191848E+00 | grad norm: 0.273 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.491 | TFLOPs: 19.87 | 31: iteration 21740/ 173500 | consumed samples: 5565440 | consumed tokens: 11398021120 | elapsed time per iteration (s): 0.79 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.194501E+00 | grad norm: 0.547 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.081 | TFLOPs: 19.67 | 31: iteration 21750/ 173500 | consumed samples: 5568000 | consumed tokens: 11403264000 | elapsed time per iteration (s): 0.74 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.207243E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.173 | TFLOPs: 20.94 | 31: iteration 21760/ 173500 | consumed samples: 5570560 | consumed tokens: 11408506880 | elapsed time per iteration (s): 0.77 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.219334E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.135 | TFLOPs: 20.03 | 31: iteration 21770/ 173500 | consumed samples: 5573120 | consumed tokens: 11413749760 | elapsed time per iteration (s): 0.71 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.196373E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 359.083 | TFLOPs: 21.72 | 31: iteration 21780/ 173500 | consumed samples: 5575680 | consumed tokens: 11418992640 | elapsed time per iteration (s): 0.78 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.186567E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.525 | TFLOPs: 19.75 | 31: iteration 21790/ 173500 | consumed samples: 5578240 | consumed tokens: 11424235520 | elapsed time per iteration (s): 0.75 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.202596E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.161 | TFLOPs: 20.58 | 31: iteration 21800/ 173500 | consumed samples: 5580800 | consumed tokens: 11429478400 | elapsed time per iteration (s): 0.75 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.201832E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.490 | TFLOPs: 20.78 | 31: iteration 21810/ 173500 | consumed samples: 5583360 | consumed tokens: 11434721280 | elapsed time per iteration (s): 0.76 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.187672E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.706 | TFLOPs: 20.37 | 31: iteration 21820/ 173500 | consumed samples: 5585920 | consumed tokens: 11439964160 | elapsed time per iteration (s): 0.77 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.198407E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.269 | TFLOPs: 20.10 | 31: iteration 21830/ 173500 | consumed samples: 5588480 | consumed tokens: 11445207040 | elapsed time per iteration (s): 0.77 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.205402E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.996 | TFLOPs: 20.02 | 31: iteration 21840/ 173500 | consumed samples: 5591040 | consumed tokens: 11450449920 | elapsed time per iteration (s): 0.76 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.205111E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.327 | TFLOPs: 20.47 | 31: iteration 21850/ 173500 | consumed samples: 5593600 | consumed tokens: 11455692800 | elapsed time per iteration (s): 0.78 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.219177E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.135 | TFLOPs: 19.85 | 31: iteration 21860/ 173500 | consumed samples: 5596160 | consumed tokens: 11460935680 | elapsed time per iteration (s): 0.77 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.220668E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.122 | TFLOPs: 20.15 | 31: iteration 21870/ 173500 | consumed samples: 5598720 | consumed tokens: 11466178560 | elapsed time per iteration (s): 1.13 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.212751E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.087 | TFLOPs: 13.74 | 31: iteration 21880/ 173500 | consumed samples: 5601280 | consumed tokens: 11471421440 | elapsed time per iteration (s): 0.71 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.178596E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 359.115 | TFLOPs: 21.73 | 31: iteration 21890/ 173500 | consumed samples: 5603840 | consumed tokens: 11476664320 | elapsed time per iteration (s): 0.80 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.172486E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.004 | TFLOPs: 19.48 | 31: iteration 21900/ 173500 | consumed samples: 5606400 | consumed tokens: 11481907200 | elapsed time per iteration (s): 0.81 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.211169E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.307 | TFLOPs: 19.14 | 31: iteration 21910/ 173500 | consumed samples: 5608960 | consumed tokens: 11487150080 | elapsed time per iteration (s): 0.75 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.201682E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.885 | TFLOPs: 20.56 | 31: iteration 21920/ 173500 | consumed samples: 5611520 | consumed tokens: 11492392960 | elapsed time per iteration (s): 0.77 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.175042E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.950 | TFLOPs: 20.08 | 31: iteration 21930/ 173500 | consumed samples: 5614080 | consumed tokens: 11497635840 | elapsed time per iteration (s): 0.73 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.178392E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.977 | TFLOPs: 21.29 | 31: iteration 21940/ 173500 | consumed samples: 5616640 | consumed tokens: 11502878720 | elapsed time per iteration (s): 0.84 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.219077E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.123 | TFLOPs: 18.52 | 31: iteration 21950/ 173500 | consumed samples: 5619200 | consumed tokens: 11508121600 | elapsed time per iteration (s): 0.81 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.216001E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.572 | TFLOPs: 19.21 | 31: iteration 21960/ 173500 | consumed samples: 5621760 | consumed tokens: 11513364480 | elapsed time per iteration (s): 0.74 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.204772E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.268 | TFLOPs: 21.01 | 31: iteration 21970/ 173500 | consumed samples: 5624320 | consumed tokens: 11518607360 | elapsed time per iteration (s): 0.76 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.203930E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.471 | TFLOPs: 20.30 | 31: iteration 21980/ 173500 | consumed samples: 5626880 | consumed tokens: 11523850240 | elapsed time per iteration (s): 0.76 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.182528E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.865 | TFLOPs: 20.26 | 31: iteration 21990/ 173500 | consumed samples: 5629440 | consumed tokens: 11529093120 | elapsed time per iteration (s): 0.85 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.224278E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.533 | TFLOPs: 18.12 | 0: [2022-11-25 23:00:14,743] [INFO] [logging.py:68:log_dist] [Rank 0] step=22000, skipped=0, lr=[0.00019388839136370641, 0.00019388839136370641, 0.00019388839136370641], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 22000/ 173500 | consumed samples: 5632000 | consumed tokens: 11534336000 | elapsed time per iteration (s): 0.76 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.181656E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.725 | TFLOPs: 20.31 | 0: steps: 22000 loss: 2.1663 iter time (s): 0.779 samples/sec: 328.628 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 22000 | lm loss value: 2.142201E+00 | lm loss PPL: 8.518169E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 22000 to checkpoints_1b1long 0: [2022-11-25 23:00:14,996] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step22000 is begin to save! 0: [2022-11-25 23:00:15,003] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_01-model_00-model_states.pt... 0: [2022-11-25 23:00:15,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_01-model_00-model_states.pt. 0: [2022-11-25 23:00:15,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_03-model_00-model_states.pt... 0: [2022-11-25 23:00:15,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_03-model_00-model_states.pt. 0: [2022-11-25 23:00:15,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_04-model_00-model_states.pt... 0: [2022-11-25 23:00:15,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_04-model_00-model_states.pt. 0: [2022-11-25 23:00:15,384] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_05-model_00-model_states.pt... 0: [2022-11-25 23:00:15,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_05-model_00-model_states.pt. 0: [2022-11-25 23:00:15,461] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_06-model_00-model_states.pt... 0: [2022-11-25 23:00:15,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_06-model_00-model_states.pt. 0: [2022-11-25 23:00:15,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_07-model_00-model_states.pt... 0: [2022-11-25 23:00:15,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_07-model_00-model_states.pt. 0: [2022-11-25 23:00:15,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_08-model_00-model_states.pt... 0: [2022-11-25 23:00:15,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_08-model_00-model_states.pt. 0: [2022-11-25 23:00:15,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_09-model_00-model_states.pt... 0: [2022-11-25 23:00:15,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_09-model_00-model_states.pt. 0: [2022-11-25 23:00:15,769] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_10-model_00-model_states.pt... 0: [2022-11-25 23:00:15,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_10-model_00-model_states.pt. 0: [2022-11-25 23:00:15,844] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_11-model_00-model_states.pt... 0: [2022-11-25 23:00:15,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_11-model_00-model_states.pt. 0: [2022-11-25 23:00:15,918] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_12-model_00-model_states.pt... 0: [2022-11-25 23:00:15,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_12-model_00-model_states.pt. 0: [2022-11-25 23:00:15,999] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_13-model_00-model_states.pt... 0: [2022-11-25 23:00:16,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_13-model_00-model_states.pt. 0: [2022-11-25 23:00:16,073] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_14-model_00-model_states.pt... 0: [2022-11-25 23:00:16,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_14-model_00-model_states.pt. 0: [2022-11-25 23:00:16,152] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_15-model_00-model_states.pt... 0: [2022-11-25 23:00:16,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_15-model_00-model_states.pt. 0: [2022-11-25 23:00:16,225] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_16-model_00-model_states.pt... 0: [2022-11-25 23:00:16,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_16-model_00-model_states.pt. 0: [2022-11-25 23:00:16,303] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_17-model_00-model_states.pt... 0: [2022-11-25 23:00:16,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_17-model_00-model_states.pt. 0: [2022-11-25 23:00:16,381] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_18-model_00-model_states.pt... 0: [2022-11-25 23:00:16,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_18-model_00-model_states.pt. 0: [2022-11-25 23:00:16,458] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_19-model_00-model_states.pt... 0: [2022-11-25 23:00:16,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_19-model_00-model_states.pt. 0: [2022-11-25 23:00:16,532] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_20-model_00-model_states.pt... 0: [2022-11-25 23:00:16,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_20-model_00-model_states.pt. 0: [2022-11-25 23:00:16,608] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_21-model_00-model_states.pt... 0: [2022-11-25 23:00:16,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_21-model_00-model_states.pt. 0: [2022-11-25 23:00:16,686] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_22-model_00-model_states.pt... 0: [2022-11-25 23:00:16,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_22-model_00-model_states.pt. 0: [2022-11-25 23:00:16,760] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_23-model_00-model_states.pt... 0: [2022-11-25 23:00:16,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_23-model_00-model_states.pt. 0: [2022-11-25 23:00:16,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_24-model_00-model_states.pt... 0: [2022-11-25 23:00:16,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_24-model_00-model_states.pt. 0: [2022-11-25 23:00:16,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_25-model_00-model_states.pt... 0: [2022-11-25 23:00:17,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_25-model_00-model_states.pt. 0: [2022-11-25 23:00:17,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_26-model_00-model_states.pt... 0: [2022-11-25 23:00:17,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_26-model_00-model_states.pt. 0: [2022-11-25 23:00:17,089] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_27-model_00-model_states.pt... 0: [2022-11-25 23:00:17,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_27-model_00-model_states.pt. 0: [2022-11-25 23:00:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_28-model_00-model_states.pt... 0: [2022-11-25 23:00:17,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_28-model_00-model_states.pt. 0: [2022-11-25 23:00:17,243] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/layer_30-model_00-model_states.pt... 0: [2022-11-25 23:00:17,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/layer_30-model_00-model_states.pt. 0: [2022-11-25 23:00:17,248] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step22000/mp_rank_00_model_states.pt 0: [2022-11-25 23:00:17,248] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/mp_rank_00_model_states.pt... 0: [2022-11-25 23:00:17,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/mp_rank_00_model_states.pt. 0: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:00:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:00:17,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:00:17,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 23:00:17,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 21: [2022-11-25 23:00:17,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:00:17,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-25 23:00:17,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 14: [2022-11-25 23:00:17,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:00:17,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 23:00:17,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 4: [2022-11-25 23:00:17,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:00:17,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 23:00:17,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 26: [2022-11-25 23:00:17,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:00:17,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-25 23:00:17,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 9: [2022-11-25 23:00:17,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:00:17,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 23:00:17,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 10: [2022-11-25 23:00:17,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:00:17,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 23:00:17,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 6: [2022-11-25 23:00:17,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:00:17,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:00:17,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:00:17,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 27: [2022-11-25 23:00:17,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-25 23:00:17,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 6: [2022-11-25 23:00:17,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 24: [2022-11-25 23:00:17,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 6: [2022-11-25 23:00:17,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 19: [2022-11-25 23:00:17,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:00:17,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-25 23:00:17,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 2: [2022-11-25 23:00:17,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:00:17,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 23:00:17,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 30: [2022-11-25 23:00:17,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:00:17,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-25 23:00:17,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 14: [2022-11-25 23:00:17,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:00:17,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 23:00:17,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 6: [2022-11-25 23:00:17,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:00:17,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 16: [2022-11-25 23:00:17,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:00:17,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 16: [2022-11-25 23:00:17,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:00:17,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-25 23:00:17,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 16: [2022-11-25 23:00:17,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-25 23:00:17,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 19: [2022-11-25 23:00:17,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:00:17,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-25 23:00:17,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 29: [2022-11-25 23:00:17,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:00:17,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-25 23:00:17,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 30: [2022-11-25 23:00:17,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:00:17,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-25 23:00:17,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 0: [2022-11-25 23:00:17,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:00:17,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:00:17,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:00:17,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 0: [2022-11-25 23:00:17,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 14: [2022-11-25 23:00:17,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 0: [2022-11-25 23:00:17,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 14: [2022-11-25 23:00:17,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 29: [2022-11-25 23:00:17,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 1: [2022-11-25 23:00:17,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:00:17,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 23:00:17,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 1: [2022-11-25 23:00:17,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:00:17,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 23:00:17,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 10: [2022-11-25 23:00:17,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:00:17,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 23:00:17,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 25: [2022-11-25 23:00:17,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:00:17,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 24: [2022-11-25 23:00:17,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:00:17,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 24: [2022-11-25 23:00:17,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 25: [2022-11-25 23:00:17,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:00:17,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 25: [2022-11-25 23:00:17,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-25 23:00:17,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 26: [2022-11-25 23:00:17,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:00:17,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-25 23:00:17,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 11: [2022-11-25 23:00:17,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:00:17,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:00:17,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-25 23:00:17,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 11: [2022-11-25 23:00:17,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 23:00:17,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 11: [2022-11-25 23:00:17,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:00:17,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 23:00:17,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 21: [2022-11-25 23:00:17,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:00:17,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-25 23:00:17,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 26: [2022-11-25 23:00:17,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:00:17,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:00:17,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:00:17,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:00:17,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:00:17,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:00:17,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 25: [2022-11-25 23:00:17,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 26: [2022-11-25 23:00:17,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 25: [2022-11-25 23:00:17,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 22: [2022-11-25 23:00:17,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 18: [2022-11-25 23:00:17,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 22: [2022-11-25 23:00:17,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-25 23:00:17,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 21: [2022-11-25 23:00:17,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:00:17,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 22: [2022-11-25 23:00:17,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 21: [2022-11-25 23:00:17,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 22: [2022-11-25 23:00:17,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 21: [2022-11-25 23:00:17,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 22: [2022-11-25 23:00:17,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 18: [2022-11-25 23:00:17,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:00:17,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-25 23:00:17,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 28: [2022-11-25 23:00:17,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:00:17,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-25 23:00:17,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 4: [2022-11-25 23:00:17,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:00:17,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:00:17,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 23:00:17,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 9: [2022-11-25 23:00:17,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:00:17,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:00:17,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 9: [2022-11-25 23:00:17,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 15: [2022-11-25 23:00:17,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 9: [2022-11-25 23:00:17,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 29: [2022-11-25 23:00:17,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:00:17,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-25 23:00:17,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 18: [2022-11-25 23:00:17,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:00:17,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-25 23:00:17,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 27: [2022-11-25 23:00:17,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:00:17,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-25 23:00:17,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 29: [2022-11-25 23:00:17,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:00:17,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-25 23:00:17,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 30: [2022-11-25 23:00:17,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:00:17,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-25 23:00:17,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 1: [2022-11-25 23:00:17,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:00:17,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 23:00:17,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 11: [2022-11-25 23:00:17,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:00:17,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 23:00:17,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 24: [2022-11-25 23:00:17,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:00:17,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-25 23:00:17,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 19: [2022-11-25 23:00:17,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:00:17,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:00:17,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-25 23:00:17,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-25 23:00:17,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 19: [2022-11-25 23:00:17,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 6: [2022-11-25 23:00:17,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:00:17,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 23:00:17,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 4: [2022-11-25 23:00:17,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:00:17,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 23:00:17,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 9: [2022-11-25 23:00:17,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:00:17,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 23:00:17,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 10: [2022-11-25 23:00:17,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:00:17,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:00:17,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 23:00:17,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 23:00:17,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 10: [2022-11-25 23:00:17,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 0: [2022-11-25 23:00:17,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:00:17,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 16: [2022-11-25 23:00:17,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:00:17,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 16: [2022-11-25 23:00:17,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:00:17,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-25 23:00:17,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 16: [2022-11-25 23:00:17,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-25 23:00:17,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 15: [2022-11-25 23:00:17,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:00:17,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 14: [2022-11-25 23:00:17,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:00:17,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 14: [2022-11-25 23:00:17,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 23:00:17,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 0: [2022-11-25 23:00:17,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:00:17,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:00:17,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:00:17,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:00:17,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:00:17,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 23:00:17,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 23:00:17,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 23:00:17,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 23:00:17,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 3: [2022-11-25 23:00:17,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 3: [2022-11-25 23:00:17,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 3: [2022-11-25 23:00:17,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 28: [2022-11-25 23:00:17,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-25 23:00:17,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 28: [2022-11-25 23:00:17,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:00:17,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:00:17,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 22: [2022-11-25 23:00:17,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 28: [2022-11-25 23:00:17,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 22: [2022-11-25 23:00:17,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 11: [2022-11-25 23:00:17,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:00:17,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:00:17,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 23:00:17,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 1: [2022-11-25 23:00:17,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 17: [2022-11-25 23:00:17,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:00:17,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 17: [2022-11-25 23:00:17,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-25 23:00:17,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 17: [2022-11-25 23:00:17,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:00:17,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-25 23:00:17,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 6: [2022-11-25 23:00:17,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:00:17,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 21: [2022-11-25 23:00:17,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:00:17,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 21: [2022-11-25 23:00:17,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-25 23:00:17,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 25: [2022-11-25 23:00:17,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:00:17,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:00:17,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 23:00:17,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 25: [2022-11-25 23:00:17,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-25 23:00:17,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 0: [2022-11-25 23:00:17,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 23:00:17,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 30: [2022-11-25 23:00:17,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:00:17,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-25 23:00:17,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 5: [2022-11-25 23:00:17,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:00:17,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:00:17,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:00:17,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 9: [2022-11-25 23:00:17,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 5: [2022-11-25 23:00:17,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 23:00:17,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 9: [2022-11-25 23:00:17,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 5: [2022-11-25 23:00:17,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 26: [2022-11-25 23:00:17,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:00:17,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 8: [2022-11-25 23:00:17,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:00:17,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:00:17,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 8: [2022-11-25 23:00:17,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 23:00:17,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 23:00:17,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 8: [2022-11-25 23:00:17,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 31: [2022-11-25 23:00:17,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:00:17,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:00:17,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:00:17,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-25 23:00:17,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-25 23:00:17,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-25 23:00:17,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 31: [2022-11-25 23:00:17,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 31: [2022-11-25 23:00:17,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 23: [2022-11-25 23:00:17,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:00:17,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:00:17,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:00:17,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:00:17,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-25 23:00:17,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-25 23:00:17,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-25 23:00:17,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-25 23:00:17,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 23: [2022-11-25 23:00:17,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 23: [2022-11-25 23:00:17,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 23: [2022-11-25 23:00:17,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 27: [2022-11-25 23:00:17,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:00:17,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-25 23:00:17,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 24: [2022-11-25 23:00:17,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:00:17,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 4: [2022-11-25 23:00:17,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:00:17,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 4: [2022-11-25 23:00:17,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 23:00:17,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 7: [2022-11-25 23:00:17,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:00:17,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:00:17,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:00:17,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:00:17,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 23:00:17,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 23:00:17,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 23:00:17,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 23:00:17,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 7: [2022-11-25 23:00:17,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 7: [2022-11-25 23:00:17,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 7: [2022-11-25 23:00:17,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 15: [2022-11-25 23:00:17,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:00:17,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 23:00:17,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 12: [2022-11-25 23:00:17,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:00:17,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:00:17,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:00:17,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 23:00:17,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 23:00:17,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 23:00:17,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:00:17,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 17: [2022-11-25 23:00:17,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:00:17,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 12: [2022-11-25 23:00:17,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 12: [2022-11-25 23:00:17,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 17: [2022-11-25 23:00:17,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 12: [2022-11-25 23:00:17,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 17: [2022-11-25 23:00:17,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 18: [2022-11-25 23:00:17,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:00:17,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 8: [2022-11-25 23:00:17,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:00:17,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 8: [2022-11-25 23:00:17,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 23:00:17,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 31: [2022-11-25 23:00:17,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:00:17,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-25 23:00:17,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 8: [2022-11-25 23:00:17,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:00:17,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 23:00:17,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 20: [2022-11-25 23:00:17,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:00:17,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:00:17,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-25 23:00:17,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-25 23:00:17,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 20: [2022-11-25 23:00:17,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 19: [2022-11-25 23:00:17,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:00:17,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 2: [2022-11-25 23:00:17,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:00:17,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 2: [2022-11-25 23:00:17,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 23:00:17,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 1: [2022-11-25 23:00:17,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:00:17,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 23:00:17,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 13: [2022-11-25 23:00:17,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:00:17,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 23:00:17,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 20: [2022-11-25 23:00:17,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:00:17,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-25 23:00:17,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 3: [2022-11-25 23:00:17,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:00:17,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 23:00:17,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 21: [2022-11-25 23:00:17,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:00:17,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-25 23:00:17,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 22: [2022-11-25 23:00:17,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:00:17,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:00:17,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 2: [2022-11-25 23:00:17,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 22: [2022-11-25 23:00:17,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 2: [2022-11-25 23:00:17,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 13: [2022-11-25 23:00:17,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:00:17,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 23:00:17,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 13: [2022-11-25 23:00:17,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:00:17,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:00:17,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 23:00:17,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 23:00:17,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 13: [2022-11-25 23:00:17,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 10: [2022-11-25 23:00:17,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:00:17,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 23:00:17,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 16: [2022-11-25 23:00:17,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:00:17,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-25 23:00:17,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 28: [2022-11-25 23:00:17,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:00:17,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-25 23:00:17,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 0: [2022-11-25 23:00:17,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:00:17,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 23:00:17,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 5: [2022-11-25 23:00:17,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:00:17,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 23:00:17,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 5: [2022-11-25 23:00:17,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:00:17,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:00:17,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 11: [2022-11-25 23:00:17,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 5: [2022-11-25 23:00:17,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 11: [2022-11-25 23:00:17,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 12: [2022-11-25 23:00:17,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:00:17,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 23:00:17,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 26: [2022-11-25 23:00:17,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:00:17,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-25 23:00:17,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 20: [2022-11-25 23:00:17,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:00:17,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-25 23:00:17,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 29: [2022-11-25 23:00:17,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:00:17,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-25 23:00:17,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 25: [2022-11-25 23:00:17,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:00:17,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-25 23:00:17,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 6: [2022-11-25 23:00:17,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:00:17,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 23:00:17,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 30: [2022-11-25 23:00:17,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:00:17,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-25 23:00:17,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 23: [2022-11-25 23:00:17,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:00:17,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-25 23:00:17,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 9: [2022-11-25 23:00:17,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:00:17,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 23:00:17,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 4: [2022-11-25 23:00:17,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:00:17,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 23:00:17,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 7: [2022-11-25 23:00:17,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:00:17,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 23:00:17,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 5: [2022-11-25 23:00:17,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:00:17,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 23:00:17,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 15: [2022-11-25 23:00:17,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:00:17,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 23:00:17,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 17: [2022-11-25 23:00:17,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:00:17,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-25 23:00:17,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 14: [2022-11-25 23:00:17,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:00:17,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 23:00:17,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 18: [2022-11-25 23:00:17,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:00:17,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:00:17,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-25 23:00:17,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 31: [2022-11-25 23:00:17,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-25 23:00:17,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 24: [2022-11-25 23:00:17,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:00:17,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-25 23:00:17,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 19: [2022-11-25 23:00:17,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:00:17,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-25 23:00:17,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 3: [2022-11-25 23:00:17,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:00:17,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 22: [2022-11-25 23:00:17,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:00:17,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 22: [2022-11-25 23:00:17,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-25 23:00:17,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 2: [2022-11-25 23:00:17,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:00:17,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 23:00:17,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 8: [2022-11-25 23:00:17,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:00:17,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 23:00:17,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 21: [2022-11-25 23:00:17,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:00:17,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-25 23:00:17,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 1: [2022-11-25 23:00:17,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:00:17,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 23:00:17,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 16: [2022-11-25 23:00:17,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:00:17,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-25 23:00:17,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 29: [2022-11-25 23:00:17,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:00:17,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-25 23:00:17,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 11: [2022-11-25 23:00:17,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:00:17,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 0: [2022-11-25 23:00:17,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:00:17,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 14: [2022-11-25 23:00:17,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:00:17,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 23:00:17,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 14: [2022-11-25 23:00:17,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 23:00:17,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 10: [2022-11-25 23:00:17,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:00:17,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 23:00:17,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 27: [2022-11-25 23:00:17,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:00:17,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:00:17,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-25 23:00:17,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 20: [2022-11-25 23:00:17,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-25 23:00:17,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 12: [2022-11-25 23:00:17,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:00:17,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 23:00:17,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 30: [2022-11-25 23:00:17,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:00:17,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-25 23:00:17,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 26: [2022-11-25 23:00:17,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:00:17,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-25 23:00:17,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 28: [2022-11-25 23:00:17,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:00:17,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 25: [2022-11-25 23:00:17,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:00:17,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 25: [2022-11-25 23:00:17,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-25 23:00:17,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 23: [2022-11-25 23:00:17,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:00:17,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-25 23:00:17,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 6: [2022-11-25 23:00:17,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:00:17,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 23:00:17,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 7: [2022-11-25 23:00:17,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:00:17,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 23:00:17,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 9: [2022-11-25 23:00:17,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:00:17,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 23:00:17,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 8: [2022-11-25 23:00:17,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:00:17,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 23:00:17,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 27: [2022-11-25 23:00:17,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:00:17,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-25 23:00:17,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 18: [2022-11-25 23:00:17,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:00:17,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-25 23:00:17,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 13: [2022-11-25 23:00:17,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:00:17,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 23:00:17,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 24: [2022-11-25 23:00:17,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:00:17,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-25 23:00:17,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 4: [2022-11-25 23:00:17,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:00:17,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 23:00:17,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 2: [2022-11-25 23:00:17,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:00:17,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:00:17,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 2: [2022-11-25 23:00:17,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 5: [2022-11-25 23:00:17,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 2: [2022-11-25 23:00:17,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 17: [2022-11-25 23:00:17,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:00:17,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-25 23:00:17,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 15: [2022-11-25 23:00:17,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:00:17,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:00:17,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 22: [2022-11-25 23:00:17,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 15: [2022-11-25 23:00:17,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 22: [2022-11-25 23:00:17,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 31: [2022-11-25 23:00:17,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:00:17,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-25 23:00:17,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 13: [2022-11-25 23:00:17,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:00:17,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 23:00:17,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 16: [2022-11-25 23:00:17,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:00:17,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:00:17,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 14: [2022-11-25 23:00:17,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:00:17,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 16: [2022-11-25 23:00:17,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 14: [2022-11-25 23:00:17,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 19: [2022-11-25 23:00:17,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 14: [2022-11-25 23:00:17,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 3: [2022-11-25 23:00:17,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:00:17,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 23:00:17,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 10: [2022-11-25 23:00:17,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:00:17,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 23:00:17,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 5: [2022-11-25 23:00:17,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:00:17,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 23:00:17,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 21: [2022-11-25 23:00:17,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:00:17,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-25 23:00:17,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 1: [2022-11-25 23:00:17,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:00:17,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 23:00:17,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 28: [2022-11-25 23:00:17,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:00:17,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:00:17,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 29: [2022-11-25 23:00:17,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:00:17,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 29: [2022-11-25 23:00:17,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-25 23:00:17,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 11: [2022-11-25 23:00:17,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:00:17,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 23:00:17,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 28: [2022-11-25 23:00:17,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-25 23:00:17,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 0: [2022-11-25 23:00:17,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:00:17,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 23:00:17,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 25: [2022-11-25 23:00:17,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:00:17,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-25 23:00:17,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 30: [2022-11-25 23:00:17,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:00:17,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-25 23:00:17,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 20: [2022-11-25 23:00:17,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:00:17,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-25 23:00:17,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 6: [2022-11-25 23:00:17,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:00:17,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 23:00:17,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 12: [2022-11-25 23:00:17,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:00:17,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 23:00:17,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 23: [2022-11-25 23:00:17,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:00:17,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-25 23:00:17,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 7: [2022-11-25 23:00:17,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:00:17,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 23:00:17,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 9: [2022-11-25 23:00:17,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:00:17,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 23:00:17,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 16: [2022-11-25 23:00:17,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:00:17,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-25 23:00:17,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 3: [2022-11-25 23:00:17,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:00:17,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:00:17,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:00:17,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 18: [2022-11-25 23:00:17,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 27: [2022-11-25 23:00:17,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:00:17,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 3: [2022-11-25 23:00:17,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 18: [2022-11-25 23:00:17,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 8: [2022-11-25 23:00:17,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 27: [2022-11-25 23:00:17,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-25 23:00:17,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 1: [2022-11-25 23:00:17,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:00:17,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 22: [2022-11-25 23:00:17,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:00:17,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 14: [2022-11-25 23:00:17,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:00:17,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 14: [2022-11-25 23:00:17,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 22: [2022-11-25 23:00:17,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 14: [2022-11-25 23:00:17,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 2: [2022-11-25 23:00:17,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:00:17,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 23:00:17,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 4: [2022-11-25 23:00:17,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:00:17,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 23:00:17,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 24: [2022-11-25 23:00:17,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:00:17,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-25 23:00:17,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 19: [2022-11-25 23:00:17,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:00:17,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-25 23:00:17,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 5: [2022-11-25 23:00:17,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:00:17,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 23:00:17,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 21: [2022-11-25 23:00:17,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:00:17,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:00:17,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 11: [2022-11-25 23:00:17,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:00:17,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 10: [2022-11-25 23:00:17,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 21: [2022-11-25 23:00:17,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 11: [2022-11-25 23:00:17,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 23:00:17,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 0: [2022-11-25 23:00:17,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:00:17,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 23:00:17,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 31: [2022-11-25 23:00:17,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:00:17,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-25 23:00:17,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 29: [2022-11-25 23:00:17,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:00:17,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-25 23:00:17,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 13: [2022-11-25 23:00:17,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:00:17,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:00:17,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:00:17,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 7: [2022-11-25 23:00:17,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 13: [2022-11-25 23:00:17,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 24: [2022-11-25 23:00:17,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 7: [2022-11-25 23:00:17,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 24: [2022-11-25 23:00:17,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 26: [2022-11-25 23:00:17,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:00:17,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-25 23:00:17,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 12: [2022-11-25 23:00:17,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:00:17,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 23:00:17,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 28: [2022-11-25 23:00:17,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:00:17,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-25 23:00:17,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 25: [2022-11-25 23:00:17,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:00:17,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 2: [2022-11-25 23:00:17,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:00:17,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 25: [2022-11-25 23:00:17,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 2: [2022-11-25 23:00:17,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 18: [2022-11-25 23:00:17,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:00:17,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-25 23:00:17,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 30: [2022-11-25 23:00:17,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:00:17,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-25 23:00:17,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 27: [2022-11-25 23:00:17,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:00:17,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-25 23:00:17,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 20: [2022-11-25 23:00:17,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:00:17,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-25 23:00:17,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 23: [2022-11-25 23:00:17,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:00:17,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:00:17,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-25 23:00:17,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 15: [2022-11-25 23:00:17,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 23:00:17,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 28: [2022-11-25 23:00:17,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:00:17,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:00:17,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-25 23:00:17,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 9: [2022-11-25 23:00:17,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:00:17,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 17: [2022-11-25 23:00:17,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:00:17,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 17: [2022-11-25 23:00:17,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-25 23:00:17,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 28: [2022-11-25 23:00:17,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-25 23:00:17,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 20: [2022-11-25 23:00:17,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:00:17,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-25 23:00:17,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 4: [2022-11-25 23:00:17,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:00:17,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 13: [2022-11-25 23:00:17,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:00:17,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 13: [2022-11-25 23:00:17,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 23:00:17,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 15: [2022-11-25 23:00:17,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:00:17,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 23:00:17,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 6: [2022-11-25 23:00:17,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:00:17,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 23:00:17,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 15: [2022-11-25 23:00:17,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:00:17,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 23:00:17,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 17: [2022-11-25 23:00:17,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:00:17,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-25 23:00:17,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 17: [2022-11-25 23:00:17,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:00:17,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-25 23:00:17,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 8: [2022-11-25 23:00:17,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:00:17,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step22000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 23:00:17,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 0: successfully saved checkpoint at iteration 22000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2597.19 31: iteration 22010/ 173500 | consumed samples: 5634560 | consumed tokens: 11539578880 | elapsed time per iteration (s): 1.02 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.192376E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.169 | TFLOPs: 15.13 | 31: iteration 22020/ 173500 | consumed samples: 5637120 | consumed tokens: 11544821760 | elapsed time per iteration (s): 0.78 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.192244E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.412 | TFLOPs: 19.75 | 31: iteration 22030/ 173500 | consumed samples: 5639680 | consumed tokens: 11550064640 | elapsed time per iteration (s): 0.81 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.199792E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.097 | TFLOPs: 19.12 | 31: iteration 22040/ 173500 | consumed samples: 5642240 | consumed tokens: 11555307520 | elapsed time per iteration (s): 0.77 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.160049E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.718 | TFLOPs: 20.13 | 31: iteration 22050/ 173500 | consumed samples: 5644800 | consumed tokens: 11560550400 | elapsed time per iteration (s): 0.75 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.182992E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.855 | TFLOPs: 20.56 | 31: iteration 22060/ 173500 | consumed samples: 5647360 | consumed tokens: 11565793280 | elapsed time per iteration (s): 5.06 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.196184E+00 | grad norm: 0.637 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 50.619 | TFLOPs: 3.06 | 31: iteration 22070/ 173500 | consumed samples: 5649920 | consumed tokens: 11571036160 | elapsed time per iteration (s): 0.83 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.189955E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.526 | TFLOPs: 18.67 | 31: iteration 22080/ 173500 | consumed samples: 5652480 | consumed tokens: 11576279040 | elapsed time per iteration (s): 0.79 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.210933E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.387 | TFLOPs: 19.56 | 31: iteration 22090/ 173500 | consumed samples: 5655040 | consumed tokens: 11581521920 | elapsed time per iteration (s): 0.80 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.209536E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.110 | TFLOPs: 19.43 | 31: iteration 22100/ 173500 | consumed samples: 5657600 | consumed tokens: 11586764800 | elapsed time per iteration (s): 0.74 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.205412E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.806 | TFLOPs: 21.04 | 31: iteration 22110/ 173500 | consumed samples: 5660160 | consumed tokens: 11592007680 | elapsed time per iteration (s): 0.77 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.206980E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.489 | TFLOPs: 20.05 | 31: iteration 22120/ 173500 | consumed samples: 5662720 | consumed tokens: 11597250560 | elapsed time per iteration (s): 0.76 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.199195E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.789 | TFLOPs: 20.25 | 31: iteration 22130/ 173500 | consumed samples: 5665280 | consumed tokens: 11602493440 | elapsed time per iteration (s): 0.74 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.209692E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.034 | TFLOPs: 20.87 | 31: iteration 22140/ 173500 | consumed samples: 5667840 | consumed tokens: 11607736320 | elapsed time per iteration (s): 0.78 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.158233E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.688 | TFLOPs: 19.88 | 31: iteration 22150/ 173500 | consumed samples: 5670400 | consumed tokens: 11612979200 | elapsed time per iteration (s): 0.82 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.182099E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.685 | TFLOPs: 18.86 | 31: iteration 22160/ 173500 | consumed samples: 5672960 | consumed tokens: 11618222080 | elapsed time per iteration (s): 0.83 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.214775E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.092 | TFLOPs: 18.70 | 31: iteration 22170/ 173500 | consumed samples: 5675520 | consumed tokens: 11623464960 | elapsed time per iteration (s): 0.73 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.164663E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.182 | TFLOPs: 21.31 | 31: iteration 22180/ 173500 | consumed samples: 5678080 | consumed tokens: 11628707840 | elapsed time per iteration (s): 0.76 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.199726E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.087 | TFLOPs: 20.33 | 31: iteration 22190/ 173500 | consumed samples: 5680640 | consumed tokens: 11633950720 | elapsed time per iteration (s): 0.76 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.211960E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.130 | TFLOPs: 20.40 | 31: iteration 22200/ 173500 | consumed samples: 5683200 | consumed tokens: 11639193600 | elapsed time per iteration (s): 0.80 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.187242E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.338 | TFLOPs: 19.32 | 31: iteration 22210/ 173500 | consumed samples: 5685760 | consumed tokens: 11644436480 | elapsed time per iteration (s): 0.77 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.279273E+00 | grad norm: 9.617 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.570 | TFLOPs: 20.18 | 31: iteration 22220/ 173500 | consumed samples: 5688320 | consumed tokens: 11649679360 | elapsed time per iteration (s): 3.10 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.241362E+00 | grad norm: 0.353 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 82.463 | TFLOPs: 4.99 | 31: iteration 22230/ 173500 | consumed samples: 5690880 | consumed tokens: 11654922240 | elapsed time per iteration (s): 0.77 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.247463E+00 | grad norm: 0.847 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.755 | TFLOPs: 20.13 | 31: iteration 22240/ 173500 | consumed samples: 5693440 | consumed tokens: 11660165120 | elapsed time per iteration (s): 0.79 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.248462E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.307 | TFLOPs: 19.62 | 31: iteration 22250/ 173500 | consumed samples: 5696000 | consumed tokens: 11665408000 | elapsed time per iteration (s): 0.81 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.180690E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.295 | TFLOPs: 19.01 | 31: iteration 22260/ 173500 | consumed samples: 5698560 | consumed tokens: 11670650880 | elapsed time per iteration (s): 12.09 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.215303E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 21.168 | TFLOPs: 1.28 | 31: iteration 22270/ 173500 | consumed samples: 5701120 | consumed tokens: 11675893760 | elapsed time per iteration (s): 0.75 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.228239E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.514 | TFLOPs: 20.60 | 31: iteration 22280/ 173500 | consumed samples: 5703680 | consumed tokens: 11681136640 | elapsed time per iteration (s): 0.78 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.220921E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.551 | TFLOPs: 19.82 | 31: iteration 22290/ 173500 | consumed samples: 5706240 | consumed tokens: 11686379520 | elapsed time per iteration (s): 0.77 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.241213E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.442 | TFLOPs: 20.11 | 31: iteration 22300/ 173500 | consumed samples: 5708800 | consumed tokens: 11691622400 | elapsed time per iteration (s): 0.75 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.181434E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.762 | TFLOPs: 20.55 | 31: iteration 22310/ 173500 | consumed samples: 5711360 | consumed tokens: 11696865280 | elapsed time per iteration (s): 0.76 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.194776E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.249 | TFLOPs: 20.40 | 31: iteration 22320/ 173500 | consumed samples: 5713920 | consumed tokens: 11702108160 | elapsed time per iteration (s): 0.77 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.175641E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.946 | TFLOPs: 20.14 | 31: iteration 22330/ 173500 | consumed samples: 5716480 | consumed tokens: 11707351040 | elapsed time per iteration (s): 0.74 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.201836E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.044 | TFLOPs: 20.81 | 31: iteration 22340/ 173500 | consumed samples: 5719040 | consumed tokens: 11712593920 | elapsed time per iteration (s): 0.81 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.179944E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.685 | TFLOPs: 19.22 | 31: iteration 22350/ 173500 | consumed samples: 5721600 | consumed tokens: 11717836800 | elapsed time per iteration (s): 0.79 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.150556E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.541 | TFLOPs: 19.63 | 31: iteration 22360/ 173500 | consumed samples: 5724160 | consumed tokens: 11723079680 | elapsed time per iteration (s): 0.76 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.192401E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.842 | TFLOPs: 20.32 | 31: iteration 22370/ 173500 | consumed samples: 5726720 | consumed tokens: 11728322560 | elapsed time per iteration (s): 0.77 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.187187E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.586 | TFLOPs: 20.12 | 31: iteration 22380/ 173500 | consumed samples: 5729280 | consumed tokens: 11733565440 | elapsed time per iteration (s): 0.77 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.177974E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.184 | TFLOPs: 20.04 | 31: iteration 22390/ 173500 | consumed samples: 5731840 | consumed tokens: 11738808320 | elapsed time per iteration (s): 0.77 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.142473E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.960 | TFLOPs: 20.08 | 31: iteration 22400/ 173500 | consumed samples: 5734400 | consumed tokens: 11744051200 | elapsed time per iteration (s): 0.75 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.182724E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.664 | TFLOPs: 20.61 | 31: iteration 22410/ 173500 | consumed samples: 5736960 | consumed tokens: 11749294080 | elapsed time per iteration (s): 0.83 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.183415E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.474 | TFLOPs: 18.60 | 31: iteration 22420/ 173500 | consumed samples: 5739520 | consumed tokens: 11754536960 | elapsed time per iteration (s): 0.77 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.160540E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.103 | TFLOPs: 20.03 | 31: iteration 22430/ 173500 | consumed samples: 5742080 | consumed tokens: 11759779840 | elapsed time per iteration (s): 0.77 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.157904E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.214 | TFLOPs: 20.04 | 31: iteration 22440/ 173500 | consumed samples: 5744640 | consumed tokens: 11765022720 | elapsed time per iteration (s): 0.75 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.189969E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.746 | TFLOPs: 20.55 | 31: iteration 22450/ 173500 | consumed samples: 5747200 | consumed tokens: 11770265600 | elapsed time per iteration (s): 0.79 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.206245E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.484 | TFLOPs: 19.69 | 31: iteration 22460/ 173500 | consumed samples: 5749760 | consumed tokens: 11775508480 | elapsed time per iteration (s): 0.82 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.213998E+00 | grad norm: 0.227 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.402 | TFLOPs: 18.90 | 31: iteration 22470/ 173500 | consumed samples: 5752320 | consumed tokens: 11780751360 | elapsed time per iteration (s): 0.74 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.216525E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.535 | TFLOPs: 20.84 | 31: iteration 22480/ 173500 | consumed samples: 5754880 | consumed tokens: 11785994240 | elapsed time per iteration (s): 0.78 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.187561E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.602 | TFLOPs: 19.94 | 31: iteration 22490/ 173500 | consumed samples: 5757440 | consumed tokens: 11791237120 | elapsed time per iteration (s): 0.77 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.175361E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.974 | TFLOPs: 20.14 | 31: iteration 22500/ 173500 | consumed samples: 5760000 | consumed tokens: 11796480000 | elapsed time per iteration (s): 0.78 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.195339E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.828 | TFLOPs: 19.95 | 31: iteration 22510/ 173500 | consumed samples: 5762560 | consumed tokens: 11801722880 | elapsed time per iteration (s): 0.82 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.203205E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.481 | TFLOPs: 18.78 | 31: iteration 22520/ 173500 | consumed samples: 5765120 | consumed tokens: 11806965760 | elapsed time per iteration (s): 0.82 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.202386E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.121 | TFLOPs: 18.94 | 31: iteration 22530/ 173500 | consumed samples: 5767680 | consumed tokens: 11812208640 | elapsed time per iteration (s): 0.78 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.206289E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.024 | TFLOPs: 19.91 | 31: iteration 22540/ 173500 | consumed samples: 5770240 | consumed tokens: 11817451520 | elapsed time per iteration (s): 0.79 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.188989E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.391 | TFLOPs: 19.56 | 31: iteration 22550/ 173500 | consumed samples: 5772800 | consumed tokens: 11822694400 | elapsed time per iteration (s): 0.76 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.224594E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.124 | TFLOPs: 20.27 | 31: iteration 22560/ 173500 | consumed samples: 5775360 | consumed tokens: 11827937280 | elapsed time per iteration (s): 0.78 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.180846E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.664 | TFLOPs: 19.94 | 31: iteration 22570/ 173500 | consumed samples: 5777920 | consumed tokens: 11833180160 | elapsed time per iteration (s): 0.78 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.181773E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.560 | TFLOPs: 19.76 | 31: iteration 22580/ 173500 | consumed samples: 5780480 | consumed tokens: 11838423040 | elapsed time per iteration (s): 0.78 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.166156E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.431 | TFLOPs: 19.81 | 31: iteration 22590/ 173500 | consumed samples: 5783040 | consumed tokens: 11843665920 | elapsed time per iteration (s): 0.78 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.185366E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.044 | TFLOPs: 19.91 | 31: iteration 22600/ 173500 | consumed samples: 5785600 | consumed tokens: 11848908800 | elapsed time per iteration (s): 0.80 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.192771E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.655 | TFLOPs: 19.34 | 31: iteration 22610/ 173500 | consumed samples: 5788160 | consumed tokens: 11854151680 | elapsed time per iteration (s): 0.76 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.168843E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.329 | TFLOPs: 20.35 | 31: iteration 22620/ 173500 | consumed samples: 5790720 | consumed tokens: 11859394560 | elapsed time per iteration (s): 0.82 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.180528E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.317 | TFLOPs: 18.95 | 31: iteration 22630/ 173500 | consumed samples: 5793280 | consumed tokens: 11864637440 | elapsed time per iteration (s): 0.82 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.209706E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.526 | TFLOPs: 18.97 | 31: iteration 22640/ 173500 | consumed samples: 5795840 | consumed tokens: 11869880320 | elapsed time per iteration (s): 0.74 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.175884E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.760 | TFLOPs: 20.92 | 31: iteration 22650/ 173500 | consumed samples: 5798400 | consumed tokens: 11875123200 | elapsed time per iteration (s): 0.74 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.184559E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.985 | TFLOPs: 20.99 | 31: iteration 22660/ 173500 | consumed samples: 5800960 | consumed tokens: 11880366080 | elapsed time per iteration (s): 0.76 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.182825E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.087 | TFLOPs: 20.27 | 31: iteration 22670/ 173500 | consumed samples: 5803520 | consumed tokens: 11885608960 | elapsed time per iteration (s): 0.84 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.179185E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.350 | TFLOPs: 18.53 | 31: iteration 22680/ 173500 | consumed samples: 5806080 | consumed tokens: 11890851840 | elapsed time per iteration (s): 0.77 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.204510E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.909 | TFLOPs: 20.14 | 31: iteration 22690/ 173500 | consumed samples: 5808640 | consumed tokens: 11896094720 | elapsed time per iteration (s): 0.77 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.215259E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.286 | TFLOPs: 20.04 | 31: iteration 22700/ 173500 | consumed samples: 5811200 | consumed tokens: 11901337600 | elapsed time per iteration (s): 0.82 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.188472E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.564 | TFLOPs: 18.91 | 31: iteration 22710/ 173500 | consumed samples: 5813760 | consumed tokens: 11906580480 | elapsed time per iteration (s): 0.78 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.211410E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.300 | TFLOPs: 19.86 | 31: iteration 22720/ 173500 | consumed samples: 5816320 | consumed tokens: 11911823360 | elapsed time per iteration (s): 0.79 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.173299E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.396 | TFLOPs: 19.63 | 31: iteration 22730/ 173500 | consumed samples: 5818880 | consumed tokens: 11917066240 | elapsed time per iteration (s): 0.79 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.167353E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.224 | TFLOPs: 19.55 | 31: iteration 22740/ 173500 | consumed samples: 5821440 | consumed tokens: 11922309120 | elapsed time per iteration (s): 0.76 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.170777E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.997 | TFLOPs: 20.27 | 31: iteration 22750/ 173500 | consumed samples: 5824000 | consumed tokens: 11927552000 | elapsed time per iteration (s): 0.78 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.158479E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.088 | TFLOPs: 19.79 | 31: iteration 22760/ 173500 | consumed samples: 5826560 | consumed tokens: 11932794880 | elapsed time per iteration (s): 0.80 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.219846E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.919 | TFLOPs: 19.35 | 31: iteration 22770/ 173500 | consumed samples: 5829120 | consumed tokens: 11938037760 | elapsed time per iteration (s): 0.82 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.197421E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.744 | TFLOPs: 18.92 | 31: iteration 22780/ 173500 | consumed samples: 5831680 | consumed tokens: 11943280640 | elapsed time per iteration (s): 0.80 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.197147E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.160 | TFLOPs: 19.31 | 31: iteration 22790/ 173500 | consumed samples: 5834240 | consumed tokens: 11948523520 | elapsed time per iteration (s): 0.81 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.203581E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.722 | TFLOPs: 19.10 | 31: iteration 22800/ 173500 | consumed samples: 5836800 | consumed tokens: 11953766400 | elapsed time per iteration (s): 0.83 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.184922E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.802 | TFLOPs: 18.56 | 31: iteration 22810/ 173500 | consumed samples: 5839360 | consumed tokens: 11959009280 | elapsed time per iteration (s): 0.86 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.189959E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.187 | TFLOPs: 17.98 | 31: iteration 22820/ 173500 | consumed samples: 5841920 | consumed tokens: 11964252160 | elapsed time per iteration (s): 0.84 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.188132E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.771 | TFLOPs: 18.38 | 31: iteration 22830/ 173500 | consumed samples: 5844480 | consumed tokens: 11969495040 | elapsed time per iteration (s): 0.83 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.191940E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.883 | TFLOPs: 18.57 | 31: iteration 22840/ 173500 | consumed samples: 5847040 | consumed tokens: 11974737920 | elapsed time per iteration (s): 0.82 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.168641E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.083 | TFLOPs: 18.88 | 31: iteration 22850/ 173500 | consumed samples: 5849600 | consumed tokens: 11979980800 | elapsed time per iteration (s): 0.82 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.193193E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.521 | TFLOPs: 18.85 | 31: iteration 22860/ 173500 | consumed samples: 5852160 | consumed tokens: 11985223680 | elapsed time per iteration (s): 0.79 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.166221E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.908 | TFLOPs: 19.54 | 31: iteration 22870/ 173500 | consumed samples: 5854720 | consumed tokens: 11990466560 | elapsed time per iteration (s): 0.81 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.196000E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.780 | TFLOPs: 19.10 | 31: iteration 22880/ 173500 | consumed samples: 5857280 | consumed tokens: 11995709440 | elapsed time per iteration (s): 0.82 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.205695E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.678 | TFLOPs: 18.80 | 31: iteration 22890/ 173500 | consumed samples: 5859840 | consumed tokens: 12000952320 | elapsed time per iteration (s): 0.83 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.180281E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.123 | TFLOPs: 18.58 | 31: iteration 22900/ 173500 | consumed samples: 5862400 | consumed tokens: 12006195200 | elapsed time per iteration (s): 0.82 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.204031E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.200 | TFLOPs: 18.83 | 31: iteration 22910/ 173500 | consumed samples: 5864960 | consumed tokens: 12011438080 | elapsed time per iteration (s): 0.80 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.177790E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.425 | TFLOPs: 19.32 | 31: iteration 22920/ 173500 | consumed samples: 5867520 | consumed tokens: 12016680960 | elapsed time per iteration (s): 0.80 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.191741E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.224 | TFLOPs: 19.31 | 31: iteration 22930/ 173500 | consumed samples: 5870080 | consumed tokens: 12021923840 | elapsed time per iteration (s): 0.81 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.184198E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.769 | TFLOPs: 19.22 | 31: iteration 22940/ 173500 | consumed samples: 5872640 | consumed tokens: 12027166720 | elapsed time per iteration (s): 0.80 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.147224E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.162 | TFLOPs: 19.31 | 31: iteration 22950/ 173500 | consumed samples: 5875200 | consumed tokens: 12032409600 | elapsed time per iteration (s): 0.81 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.173580E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.037 | TFLOPs: 19.18 | 31: iteration 22960/ 173500 | consumed samples: 5877760 | consumed tokens: 12037652480 | elapsed time per iteration (s): 0.77 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.195643E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.354 | TFLOPs: 19.99 | 31: iteration 22970/ 173500 | consumed samples: 5880320 | consumed tokens: 12042895360 | elapsed time per iteration (s): 0.77 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.158117E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.699 | TFLOPs: 20.01 | 31: iteration 22980/ 173500 | consumed samples: 5882880 | consumed tokens: 12048138240 | elapsed time per iteration (s): 0.70 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.167588E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 366.713 | TFLOPs: 22.19 | 31: iteration 22990/ 173500 | consumed samples: 5885440 | consumed tokens: 12053381120 | elapsed time per iteration (s): 0.76 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.173248E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.402 | TFLOPs: 20.35 | 31: iteration 23000/ 173500 | consumed samples: 5888000 | consumed tokens: 12058624000 | elapsed time per iteration (s): 0.76 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.179472E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.732 | TFLOPs: 20.25 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 23000 | lm loss value: 2.134733E+00 | lm loss PPL: 8.454792E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 23000 to checkpoints_1b1long 0: [2022-11-25 23:16:22,635] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step23000 is begin to save! 0: [2022-11-25 23:16:22,647] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_01-model_00-model_states.pt... 0: [2022-11-25 23:16:22,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_01-model_00-model_states.pt. 0: [2022-11-25 23:16:22,847] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_03-model_00-model_states.pt... 0: [2022-11-25 23:16:22,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_03-model_00-model_states.pt. 0: [2022-11-25 23:16:22,928] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_04-model_00-model_states.pt... 0: [2022-11-25 23:16:23,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_04-model_00-model_states.pt. 0: [2022-11-25 23:16:23,004] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_05-model_00-model_states.pt... 0: [2022-11-25 23:16:23,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_05-model_00-model_states.pt. 0: [2022-11-25 23:16:23,077] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_06-model_00-model_states.pt... 0: [2022-11-25 23:16:23,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_06-model_00-model_states.pt. 0: [2022-11-25 23:16:23,153] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_07-model_00-model_states.pt... 0: [2022-11-25 23:16:23,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_07-model_00-model_states.pt. 0: [2022-11-25 23:16:23,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_08-model_00-model_states.pt... 0: [2022-11-25 23:16:23,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_08-model_00-model_states.pt. 0: [2022-11-25 23:16:23,298] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_09-model_00-model_states.pt... 0: [2022-11-25 23:16:23,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_09-model_00-model_states.pt. 0: [2022-11-25 23:16:23,373] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_10-model_00-model_states.pt... 0: [2022-11-25 23:16:23,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_10-model_00-model_states.pt. 0: [2022-11-25 23:16:23,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_11-model_00-model_states.pt... 0: [2022-11-25 23:16:23,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_11-model_00-model_states.pt. 0: [2022-11-25 23:16:23,519] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_12-model_00-model_states.pt... 0: [2022-11-25 23:16:23,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_12-model_00-model_states.pt. 0: [2022-11-25 23:16:23,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_13-model_00-model_states.pt... 0: [2022-11-25 23:16:23,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_13-model_00-model_states.pt. 0: [2022-11-25 23:16:23,666] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_14-model_00-model_states.pt... 0: [2022-11-25 23:16:23,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_14-model_00-model_states.pt. 0: [2022-11-25 23:16:23,739] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_15-model_00-model_states.pt... 0: [2022-11-25 23:16:23,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_15-model_00-model_states.pt. 0: [2022-11-25 23:16:23,812] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_16-model_00-model_states.pt... 0: [2022-11-25 23:16:23,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_16-model_00-model_states.pt. 0: [2022-11-25 23:16:23,883] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_17-model_00-model_states.pt... 0: [2022-11-25 23:16:23,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_17-model_00-model_states.pt. 0: [2022-11-25 23:16:23,958] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_18-model_00-model_states.pt... 0: [2022-11-25 23:16:24,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_18-model_00-model_states.pt. 0: [2022-11-25 23:16:24,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_19-model_00-model_states.pt... 0: [2022-11-25 23:16:24,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_19-model_00-model_states.pt. 0: [2022-11-25 23:16:24,103] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_20-model_00-model_states.pt... 0: [2022-11-25 23:16:24,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_20-model_00-model_states.pt. 0: [2022-11-25 23:16:24,177] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_21-model_00-model_states.pt... 0: [2022-11-25 23:16:24,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_21-model_00-model_states.pt. 0: [2022-11-25 23:16:24,250] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_22-model_00-model_states.pt... 0: [2022-11-25 23:16:24,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_22-model_00-model_states.pt. 0: [2022-11-25 23:16:24,321] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_23-model_00-model_states.pt... 0: [2022-11-25 23:16:24,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_23-model_00-model_states.pt. 0: [2022-11-25 23:16:24,395] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_24-model_00-model_states.pt... 0: [2022-11-25 23:16:24,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_24-model_00-model_states.pt. 0: [2022-11-25 23:16:24,466] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_25-model_00-model_states.pt... 0: [2022-11-25 23:16:24,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_25-model_00-model_states.pt. 0: [2022-11-25 23:16:24,541] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_26-model_00-model_states.pt... 0: [2022-11-25 23:16:24,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_26-model_00-model_states.pt. 0: [2022-11-25 23:16:24,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_27-model_00-model_states.pt... 0: [2022-11-25 23:16:24,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_27-model_00-model_states.pt. 0: [2022-11-25 23:16:24,686] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_28-model_00-model_states.pt... 0: [2022-11-25 23:16:24,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_28-model_00-model_states.pt. 0: [2022-11-25 23:16:24,760] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/layer_30-model_00-model_states.pt... 0: [2022-11-25 23:16:24,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/layer_30-model_00-model_states.pt. 0: [2022-11-25 23:16:24,762] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step23000/mp_rank_00_model_states.pt 0: [2022-11-25 23:16:24,762] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/mp_rank_00_model_states.pt... 0: [2022-11-25 23:16:24,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/mp_rank_00_model_states.pt. 0: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:16:24,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:16:24,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:16:24,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 23:16:24,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: [2022-11-25 23:16:24,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:16:24,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:16:24,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-25 23:16:24,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 10: [2022-11-25 23:16:24,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:16:24,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 23:16:24,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 26: [2022-11-25 23:16:24,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:16:24,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-25 23:16:24,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 24: [2022-11-25 23:16:24,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:16:24,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-25 23:16:24,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 3: [2022-11-25 23:16:24,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:16:24,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 23:16:24,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 23: [2022-11-25 23:16:24,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:16:24,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-25 23:16:24,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 29: [2022-11-25 23:16:24,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:16:24,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:16:24,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-25 23:16:24,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 17: [2022-11-25 23:16:24,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-25 23:16:24,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 17: [2022-11-25 23:16:24,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:16:24,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 3: [2022-11-25 23:16:24,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:16:24,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 3: [2022-11-25 23:16:24,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 23:16:24,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 31: [2022-11-25 23:16:24,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:16:24,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:16:24,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:16:24,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:16:24,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-25 23:16:24,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 12: [2022-11-25 23:16:24,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:16:24,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 30: [2022-11-25 23:16:24,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 12: [2022-11-25 23:16:24,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 31: [2022-11-25 23:16:24,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 30: [2022-11-25 23:16:24,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 27: [2022-11-25 23:16:24,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 12: [2022-11-25 23:16:24,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 27: [2022-11-25 23:16:24,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 10: [2022-11-25 23:16:24,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:16:24,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 23:16:24,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 28: [2022-11-25 23:16:24,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:16:24,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:16:24,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:16:24,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 23:16:24,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 28: [2022-11-25 23:16:24,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-25 23:16:24,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-25 23:16:24,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 28: [2022-11-25 23:16:24,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 7: [2022-11-25 23:16:24,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:16:24,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:16:24,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:16:24,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 23:16:24,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 22: [2022-11-25 23:16:24,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-25 23:16:24,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-25 23:16:24,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 22: [2022-11-25 23:16:24,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 7: [2022-11-25 23:16:24,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:16:24,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 23:16:24,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 4: [2022-11-25 23:16:24,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:16:24,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:16:24,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 23:16:24,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 15: [2022-11-25 23:16:24,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 23:16:24,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 4: [2022-11-25 23:16:24,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:16:24,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 23:16:24,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 3: [2022-11-25 23:16:24,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:16:24,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 23:16:24,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 12: [2022-11-25 23:16:24,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:16:24,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 0: [2022-11-25 23:16:24,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:16:24,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: [2022-11-25 23:16:24,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 23:16:24,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 13: [2022-11-25 23:16:24,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:16:24,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 23:16:24,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 13: [2022-11-25 23:16:24,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:16:24,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 23:16:24,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 20: [2022-11-25 23:16:24,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:16:24,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 17: [2022-11-25 23:16:24,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:16:24,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 11: [2022-11-25 23:16:24,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:16:24,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 20: [2022-11-25 23:16:24,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:16:24,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 17: [2022-11-25 23:16:24,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-25 23:16:24,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 20: [2022-11-25 23:16:24,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-25 23:16:24,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 20: [2022-11-25 23:16:24,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:16:24,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-25 23:16:24,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 7: [2022-11-25 23:16:24,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:16:24,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:16:24,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:16:24,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:16:24,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 2: [2022-11-25 23:16:24,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 29: [2022-11-25 23:16:24,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:16:24,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 2: [2022-11-25 23:16:24,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 24: [2022-11-25 23:16:24,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:16:24,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 2: [2022-11-25 23:16:24,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 29: [2022-11-25 23:16:24,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 2: [2022-11-25 23:16:24,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 11: [2022-11-25 23:16:24,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:16:24,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 29: [2022-11-25 23:16:24,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 27: [2022-11-25 23:16:24,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 2: [2022-11-25 23:16:24,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:16:24,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 24: [2022-11-25 23:16:24,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 2: [2022-11-25 23:16:24,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 11: [2022-11-25 23:16:24,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 2: [2022-11-25 23:16:24,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 28: [2022-11-25 23:16:24,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:16:24,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:16:24,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 23:16:24,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 31: [2022-11-25 23:16:24,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:16:24,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:16:24,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:16:24,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 22: [2022-11-25 23:16:24,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 30: [2022-11-25 23:16:24,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 31: [2022-11-25 23:16:24,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 22: [2022-11-25 23:16:24,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 30: [2022-11-25 23:16:24,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 25: [2022-11-25 23:16:24,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:16:24,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-25 23:16:24,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 25: [2022-11-25 23:16:24,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:16:24,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-25 23:16:24,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 26: [2022-11-25 23:16:24,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:16:24,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:16:24,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 13: [2022-11-25 23:16:24,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 23:16:24,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 15: [2022-11-25 23:16:24,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:16:24,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 12: [2022-11-25 23:16:24,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:16:24,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:16:24,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 27: [2022-11-25 23:16:24,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 12: [2022-11-25 23:16:24,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 15: [2022-11-25 23:16:24,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 23:16:24,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 27: [2022-11-25 23:16:24,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 28: [2022-11-25 23:16:24,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 22: [2022-11-25 23:16:24,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:16:24,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 22: [2022-11-25 23:16:24,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 28: [2022-11-25 23:16:24,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:16:24,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 28: [2022-11-25 23:16:24,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-25 23:16:24,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 24: [2022-11-25 23:16:24,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:16:24,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-25 23:16:24,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 20: [2022-11-25 23:16:24,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:16:24,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:16:24,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 20: [2022-11-25 23:16:24,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 24: [2022-11-25 23:16:24,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 20: [2022-11-25 23:16:24,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 3: [2022-11-25 23:16:24,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:16:24,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 26: [2022-11-25 23:16:24,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:16:24,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 26: [2022-11-25 23:16:24,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-25 23:16:24,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 11: [2022-11-25 23:16:24,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:16:24,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 23:16:24,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 29: [2022-11-25 23:16:24,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:16:24,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-25 23:16:24,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 31: [2022-11-25 23:16:24,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:16:24,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-25 23:16:24,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 30: [2022-11-25 23:16:24,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:16:24,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-25 23:16:24,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 30: [2022-11-25 23:16:24,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:16:24,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-25 23:16:24,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 15: [2022-11-25 23:16:24,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:16:24,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 23:16:24,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 7: [2022-11-25 23:16:24,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:16:24,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 23:16:24,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 17: [2022-11-25 23:16:24,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:16:24,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:16:24,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:16:24,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:16:24,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:16:24,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:16:24,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 23:16:24,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 13: [2022-11-25 23:16:24,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 23: [2022-11-25 23:16:24,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 17: [2022-11-25 23:16:24,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 0: [2022-11-25 23:16:24,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 13: [2022-11-25 23:16:24,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 23: [2022-11-25 23:16:24,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 23: [2022-11-25 23:16:24,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 17: [2022-11-25 23:16:24,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: [2022-11-25 23:16:24,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 23: [2022-11-25 23:16:24,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 25: [2022-11-25 23:16:24,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:16:24,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 23: [2022-11-25 23:16:24,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:16:24,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 25: [2022-11-25 23:16:24,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 23: [2022-11-25 23:16:24,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 24: [2022-11-25 23:16:24,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:16:24,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-25 23:16:24,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 27: [2022-11-25 23:16:24,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:16:24,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-25 23:16:24,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 15: [2022-11-25 23:16:24,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:16:24,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 23:16:24,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 4: [2022-11-25 23:16:24,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:16:24,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:16:24,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 23:16:24,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 23:16:24,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 4: [2022-11-25 23:16:24,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 8: [2022-11-25 23:16:24,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:16:24,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:16:24,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:16:24,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 23:16:24,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 23:16:24,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 23:16:24,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 8: [2022-11-25 23:16:24,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:16:24,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 8: [2022-11-25 23:16:24,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 8: [2022-11-25 23:16:24,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 23:16:24,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 25: [2022-11-25 23:16:24,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:16:24,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-25 23:16:24,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 2: [2022-11-25 23:16:24,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:16:24,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 23:16:24,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 12: [2022-11-25 23:16:24,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:16:24,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 23:16:24,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: [2022-11-25 23:16:24,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 23:16:24,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 14: [2022-11-25 23:16:24,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:16:24,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:16:24,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:16:24,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:16:24,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 23:16:24,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 23:16:24,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 14: [2022-11-25 23:16:24,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 23:16:24,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 23:16:24,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 14: [2022-11-25 23:16:24,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 14: [2022-11-25 23:16:24,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 18: [2022-11-25 23:16:24,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:16:24,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:16:24,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:16:24,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:16:24,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-25 23:16:24,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-25 23:16:24,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-25 23:16:24,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-25 23:16:24,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 18: [2022-11-25 23:16:24,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 18: [2022-11-25 23:16:24,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 18: [2022-11-25 23:16:24,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 5: [2022-11-25 23:16:24,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:16:24,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 23:16:24,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 5: [2022-11-25 23:16:24,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:16:24,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 23:16:24,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 5: [2022-11-25 23:16:24,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:16:24,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:16:24,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 23:16:24,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 23:16:24,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 5: [2022-11-25 23:16:24,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 5: [2022-11-25 23:16:24,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:16:24,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 23:16:24,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 10: [2022-11-25 23:16:24,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:16:24,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 23:16:24,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 9: [2022-11-25 23:16:24,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:16:24,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:16:24,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:16:24,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:16:24,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:16:24,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 23:16:24,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 9: [2022-11-25 23:16:24,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 23:16:24,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 23:16:24,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 23:16:24,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 23:16:24,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 9: [2022-11-25 23:16:24,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 9: [2022-11-25 23:16:24,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 9: [2022-11-25 23:16:24,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 20: [2022-11-25 23:16:24,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:16:24,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-25 23:16:24,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 11: [2022-11-25 23:16:24,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:16:24,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 23:16:24,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 29: [2022-11-25 23:16:24,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:16:24,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-25 23:16:24,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 4: [2022-11-25 23:16:24,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:16:24,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 23:16:24,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: [2022-11-25 23:16:24,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:16:24,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 23:16:24,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 17: [2022-11-25 23:16:24,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:16:24,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-25 23:16:24,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 29: [2022-11-25 23:16:24,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:16:24,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-25 23:16:24,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 26: [2022-11-25 23:16:24,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:16:24,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-25 23:16:24,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 27: [2022-11-25 23:16:24,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:16:24,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-25 23:16:24,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 3: [2022-11-25 23:16:24,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:16:24,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 23:16:24,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 8: [2022-11-25 23:16:24,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:16:24,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 23:16:24,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 13: [2022-11-25 23:16:24,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:16:24,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 23:16:24,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 28: [2022-11-25 23:16:24,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:16:24,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-25 23:16:24,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 30: [2022-11-25 23:16:24,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:16:24,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-25 23:16:24,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 25: [2022-11-25 23:16:24,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:16:24,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-25 23:16:24,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 23: [2022-11-25 23:16:24,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:16:24,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-25 23:16:24,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 15: [2022-11-25 23:16:24,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:16:24,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 23:16:24,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 7: [2022-11-25 23:16:24,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:16:24,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 23:16:24,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 2: [2022-11-25 23:16:24,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:16:24,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 23:16:24,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 31: [2022-11-25 23:16:24,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:16:24,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 5: [2022-11-25 23:16:24,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:16:24,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 5: [2022-11-25 23:16:24,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 23:16:24,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 18: [2022-11-25 23:16:24,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:16:24,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-25 23:16:24,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 22: [2022-11-25 23:16:24,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:16:24,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-25 23:16:24,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 11: [2022-11-25 23:16:25,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:16:25,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 0: [2022-11-25 23:16:25,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:16:25,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: [2022-11-25 23:16:25,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 23:16:25,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 10: [2022-11-25 23:16:25,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:16:25,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 23:16:25,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 17: [2022-11-25 23:16:25,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:16:25,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-25 23:16:25,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 26: [2022-11-25 23:16:25,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:16:25,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:16:25,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-25 23:16:25,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 24: [2022-11-25 23:16:25,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-25 23:16:25,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 29: [2022-11-25 23:16:25,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:16:25,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-25 23:16:25,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 9: [2022-11-25 23:16:25,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:16:25,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 23:16:25,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 27: [2022-11-25 23:16:25,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:16:25,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-25 23:16:25,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 20: [2022-11-25 23:16:25,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:16:25,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-25 23:16:25,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 3: [2022-11-25 23:16:25,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:16:25,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 23:16:25,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 8: [2022-11-25 23:16:25,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:16:25,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 4: [2022-11-25 23:16:25,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:16:25,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 4: [2022-11-25 23:16:25,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 23:16:25,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 23: [2022-11-25 23:16:25,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:16:25,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-25 23:16:25,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 31: [2022-11-25 23:16:25,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:16:25,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-25 23:16:25,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 13: [2022-11-25 23:16:25,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:16:25,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 23:16:25,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 25: [2022-11-25 23:16:25,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:16:25,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 14: [2022-11-25 23:16:25,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:16:25,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 14: [2022-11-25 23:16:25,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 23:16:25,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 30: [2022-11-25 23:16:25,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:16:25,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-25 23:16:25,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 28: [2022-11-25 23:16:25,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:16:25,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-25 23:16:25,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 7: [2022-11-25 23:16:25,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:16:25,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 23:16:25,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 15: [2022-11-25 23:16:25,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:16:25,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 23:16:25,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 22: [2022-11-25 23:16:25,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:16:25,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 18: [2022-11-25 23:16:25,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:16:25,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 18: [2022-11-25 23:16:25,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-25 23:16:25,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 2: [2022-11-25 23:16:25,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:16:25,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 23:16:25,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 14: [2022-11-25 23:16:25,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:16:25,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 23:16:25,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 5: [2022-11-25 23:16:25,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:16:25,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 23:16:25,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 11: [2022-11-25 23:16:25,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:16:25,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 23:16:25,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 10: [2022-11-25 23:16:25,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:16:25,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 23:16:25,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: [2022-11-25 23:16:25,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:16:25,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 23:16:25,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 24: [2022-11-25 23:16:25,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:16:25,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-25 23:16:25,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 29: [2022-11-25 23:16:25,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:16:25,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-25 23:16:25,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 4: [2022-11-25 23:16:25,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:16:25,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 23:16:25,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 20: [2022-11-25 23:16:25,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:16:25,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-25 23:16:25,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 17: [2022-11-25 23:16:25,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:16:25,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-25 23:16:25,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 26: [2022-11-25 23:16:25,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:16:25,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-25 23:16:25,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 27: [2022-11-25 23:16:25,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:16:25,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-25 23:16:25,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 9: [2022-11-25 23:16:25,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:16:25,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 23:16:25,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 3: [2022-11-25 23:16:25,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:16:25,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 23:16:25,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 13: [2022-11-25 23:16:25,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:16:25,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 23:16:25,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 8: [2022-11-25 23:16:25,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:16:25,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 23:16:25,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 12: [2022-11-25 23:16:25,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:16:25,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 23:16:25,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 31: [2022-11-25 23:16:25,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:16:25,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-25 23:16:25,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 28: [2022-11-25 23:16:25,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:16:25,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-25 23:16:25,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 23: [2022-11-25 23:16:25,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:16:25,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-25 23:16:25,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 18: [2022-11-25 23:16:25,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:16:25,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-25 23:16:25,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 25: [2022-11-25 23:16:25,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:16:25,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-25 23:16:25,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 30: [2022-11-25 23:16:25,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:16:25,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-25 23:16:25,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 15: [2022-11-25 23:16:25,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:16:25,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 23:16:25,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 19: [2022-11-25 23:16:25,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:16:25,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-25 23:16:25,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 19: [2022-11-25 23:16:25,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:16:25,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-25 23:16:25,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 1: [2022-11-25 23:16:25,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:16:25,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 23:16:25,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: [2022-11-25 23:16:25,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:16:25,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 23:16:25,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 18: [2022-11-25 23:16:25,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:16:25,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-25 23:16:25,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 4: [2022-11-25 23:16:25,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:16:25,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 23:16:25,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 17: [2022-11-25 23:16:25,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:16:25,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-25 23:16:25,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 5: [2022-11-25 23:16:25,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:16:25,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 23:16:25,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 23: [2022-11-25 23:16:25,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:16:25,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-25 23:16:25,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 11: [2022-11-25 23:16:25,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:16:25,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 23:16:25,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 24: [2022-11-25 23:16:25,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:16:25,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-25 23:16:25,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 28: [2022-11-25 23:16:25,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:16:25,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 7: [2022-11-25 23:16:25,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:16:25,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 7: [2022-11-25 23:16:25,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 25: [2022-11-25 23:16:25,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:16:25,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 25: [2022-11-25 23:16:25,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-25 23:16:25,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 10: [2022-11-25 23:16:25,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:16:25,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:16:25,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 19: [2022-11-25 23:16:25,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 10: [2022-11-25 23:16:25,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 19: [2022-11-25 23:16:25,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 12: [2022-11-25 23:16:25,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:16:25,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 1: [2022-11-25 23:16:25,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:16:25,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 1: [2022-11-25 23:16:25,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 30: [2022-11-25 23:16:25,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:16:25,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 30: [2022-11-25 23:16:25,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 26: [2022-11-25 23:16:25,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:16:25,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 27: [2022-11-25 23:16:25,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:16:25,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-25 23:16:25,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 27: [2022-11-25 23:16:25,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-25 23:16:25,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 13: [2022-11-25 23:16:25,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:16:25,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 23:16:25,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 22: [2022-11-25 23:16:25,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:16:25,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-25 23:16:25,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 3: [2022-11-25 23:16:25,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:16:25,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:16:25,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 20: [2022-11-25 23:16:25,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 3: [2022-11-25 23:16:25,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 20: [2022-11-25 23:16:25,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 14: [2022-11-25 23:16:25,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:16:25,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:16:25,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 23:16:25,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 29: [2022-11-25 23:16:25,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-25 23:16:25,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 2: [2022-11-25 23:16:25,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:16:25,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 23:16:25,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 8: [2022-11-25 23:16:25,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:16:25,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 23:16:25,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 9: [2022-11-25 23:16:25,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:16:25,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 23:16:25,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 16: [2022-11-25 23:16:25,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:16:25,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:16:25,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-25 23:16:25,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-25 23:16:25,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 16: [2022-11-25 23:16:25,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 1: [2022-11-25 23:16:25,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:16:25,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:16:25,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 31: [2022-11-25 23:16:25,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 1: [2022-11-25 23:16:25,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 31: [2022-11-25 23:16:25,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 14: [2022-11-25 23:16:25,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:16:25,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 23:16:25,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 15: [2022-11-25 23:16:25,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:16:25,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:16:25,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 23:16:25,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 1: [2022-11-25 23:16:25,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 23:16:25,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 22: [2022-11-25 23:16:25,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:16:25,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-25 23:16:25,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 6: [2022-11-25 23:16:25,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:16:25,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 23:16:25,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 6: [2022-11-25 23:16:25,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:16:25,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:16:25,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 23:16:25,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 6: [2022-11-25 23:16:25,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 23:16:25,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 7: [2022-11-25 23:16:25,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:16:25,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 16: [2022-11-25 23:16:25,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:16:25,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 16: [2022-11-25 23:16:25,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-25 23:16:25,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 12: [2022-11-25 23:16:25,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:16:25,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:16:25,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 12: [2022-11-25 23:16:25,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 19: [2022-11-25 23:16:25,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 12: [2022-11-25 23:16:25,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 19: [2022-11-25 23:16:25,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:16:25,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-25 23:16:25,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 16: [2022-11-25 23:16:25,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:16:25,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-25 23:16:25,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:16:25,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 16: [2022-11-25 23:16:25,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-25 23:16:25,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 16: [2022-11-25 23:16:25,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:16:25,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-25 23:16:25,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 2: [2022-11-25 23:16:25,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:16:25,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 23:16:25,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 6: [2022-11-25 23:16:25,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:16:25,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 23:16:25,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 6: [2022-11-25 23:16:25,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:16:25,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 23:16:25,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 1: [2022-11-25 23:16:25,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:16:25,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 23:16:25,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 12: [2022-11-25 23:16:25,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:16:25,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 23:16:25,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 21: [2022-11-25 23:16:25,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:16:25,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:16:25,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:16:25,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:16:25,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:16:25,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:16:25,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-25 23:16:25,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-25 23:16:25,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-25 23:16:25,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-25 23:16:25,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-25 23:16:25,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-25 23:16:25,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 21: [2022-11-25 23:16:25,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 21: [2022-11-25 23:16:25,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 21: [2022-11-25 23:16:25,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 21: [2022-11-25 23:16:25,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 21: [2022-11-25 23:16:25,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 6: [2022-11-25 23:16:25,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:16:25,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 23:16:25,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 19: [2022-11-25 23:16:25,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:16:25,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-25 23:16:25,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 16: [2022-11-25 23:16:25,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:16:25,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-25 23:16:25,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 6: [2022-11-25 23:16:25,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:16:25,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 23:16:25,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 21: [2022-11-25 23:16:25,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:16:25,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-25 23:16:25,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 1: [2022-11-25 23:16:25,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:16:25,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 23:16:25,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 1: [2022-11-25 23:16:25,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:16:25,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 23:16:25,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 16: [2022-11-25 23:16:25,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:16:25,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-25 23:16:25,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 6: [2022-11-25 23:16:25,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:16:25,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 23:16:25,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 19: [2022-11-25 23:16:25,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:16:25,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:16:25,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-25 23:16:25,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-25 23:16:25,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 19: [2022-11-25 23:16:25,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 21: [2022-11-25 23:16:25,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:16:25,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-25 23:16:25,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 1: [2022-11-25 23:16:25,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:16:25,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step23000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 23:16:25,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: successfully saved checkpoint at iteration 23000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2567.38 31: iteration 23010/ 173500 | consumed samples: 5890560 | consumed tokens: 12063866880 | elapsed time per iteration (s): 1.04 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.189692E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.723 | TFLOPs: 14.93 | 31: iteration 23020/ 173500 | consumed samples: 5893120 | consumed tokens: 12069109760 | elapsed time per iteration (s): 0.77 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.173501E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.493 | TFLOPs: 20.24 | 31: iteration 23030/ 173500 | consumed samples: 5895680 | consumed tokens: 12074352640 | elapsed time per iteration (s): 0.79 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.173636E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.386 | TFLOPs: 19.56 | 31: iteration 23040/ 173500 | consumed samples: 5898240 | consumed tokens: 12079595520 | elapsed time per iteration (s): 0.75 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.175819E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.471 | TFLOPs: 20.60 | 31: iteration 23050/ 173500 | consumed samples: 5900800 | consumed tokens: 12084838400 | elapsed time per iteration (s): 0.76 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.154141E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.252 | TFLOPs: 20.28 | 31: iteration 23060/ 173500 | consumed samples: 5903360 | consumed tokens: 12090081280 | elapsed time per iteration (s): 0.77 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.202796E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.318 | TFLOPs: 20.23 | 31: iteration 23070/ 173500 | consumed samples: 5905920 | consumed tokens: 12095324160 | elapsed time per iteration (s): 0.75 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.174138E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.532 | TFLOPs: 20.72 | 31: iteration 23080/ 173500 | consumed samples: 5908480 | consumed tokens: 12100567040 | elapsed time per iteration (s): 0.76 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.201620E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.923 | TFLOPs: 20.44 | 31: iteration 23090/ 173500 | consumed samples: 5911040 | consumed tokens: 12105809920 | elapsed time per iteration (s): 0.75 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.155788E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.909 | TFLOPs: 20.56 | 31: iteration 23100/ 173500 | consumed samples: 5913600 | consumed tokens: 12111052800 | elapsed time per iteration (s): 0.76 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.171385E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.937 | TFLOPs: 20.26 | 31: iteration 23110/ 173500 | consumed samples: 5916160 | consumed tokens: 12116295680 | elapsed time per iteration (s): 0.76 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.165712E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.425 | TFLOPs: 20.35 | 31: iteration 23120/ 173500 | consumed samples: 5918720 | consumed tokens: 12121538560 | elapsed time per iteration (s): 0.80 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.156242E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.647 | TFLOPs: 19.46 | 31: iteration 23130/ 173500 | consumed samples: 5921280 | consumed tokens: 12126781440 | elapsed time per iteration (s): 0.74 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.221379E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.154 | TFLOPs: 21.06 | 31: iteration 23140/ 173500 | consumed samples: 5923840 | consumed tokens: 12132024320 | elapsed time per iteration (s): 0.75 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.171112E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.828 | TFLOPs: 20.56 | 31: iteration 23150/ 173500 | consumed samples: 5926400 | consumed tokens: 12137267200 | elapsed time per iteration (s): 0.78 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.156700E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.872 | TFLOPs: 19.96 | 31: iteration 23160/ 173500 | consumed samples: 5928960 | consumed tokens: 12142510080 | elapsed time per iteration (s): 0.80 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.144379E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.745 | TFLOPs: 19.46 | 31: iteration 23170/ 173500 | consumed samples: 5931520 | consumed tokens: 12147752960 | elapsed time per iteration (s): 0.78 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.198683E+00 | grad norm: 0.252 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.561 | TFLOPs: 19.82 | 31: iteration 23180/ 173500 | consumed samples: 5934080 | consumed tokens: 12152995840 | elapsed time per iteration (s): 0.82 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.185450E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.512 | TFLOPs: 18.97 | 31: iteration 23190/ 173500 | consumed samples: 5936640 | consumed tokens: 12158238720 | elapsed time per iteration (s): 0.76 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.207281E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.153 | TFLOPs: 20.34 | 31: iteration 23200/ 173500 | consumed samples: 5939200 | consumed tokens: 12163481600 | elapsed time per iteration (s): 0.78 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.184784E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.166 | TFLOPs: 19.85 | 31: iteration 23210/ 173500 | consumed samples: 5941760 | consumed tokens: 12168724480 | elapsed time per iteration (s): 0.84 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.165486E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.819 | TFLOPs: 18.50 | 31: iteration 23220/ 173500 | consumed samples: 5944320 | consumed tokens: 12173967360 | elapsed time per iteration (s): 0.74 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.178614E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.768 | TFLOPs: 20.86 | 31: iteration 23230/ 173500 | consumed samples: 5946880 | consumed tokens: 12179210240 | elapsed time per iteration (s): 0.86 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.171137E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.339 | TFLOPs: 17.99 | 31: iteration 23240/ 173500 | consumed samples: 5949440 | consumed tokens: 12184453120 | elapsed time per iteration (s): 0.74 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.174726E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.408 | TFLOPs: 20.84 | 31: iteration 23250/ 173500 | consumed samples: 5952000 | consumed tokens: 12189696000 | elapsed time per iteration (s): 0.76 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.197163E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.405 | TFLOPs: 20.35 | 31: iteration 23260/ 173500 | consumed samples: 5954560 | consumed tokens: 12194938880 | elapsed time per iteration (s): 0.80 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.169619E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.653 | TFLOPs: 19.46 | 31: iteration 23270/ 173500 | consumed samples: 5957120 | consumed tokens: 12200181760 | elapsed time per iteration (s): 0.82 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.198256E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.782 | TFLOPs: 18.80 | 31: iteration 23280/ 173500 | consumed samples: 5959680 | consumed tokens: 12205424640 | elapsed time per iteration (s): 0.73 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.180122E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.170 | TFLOPs: 21.12 | 31: iteration 23290/ 173500 | consumed samples: 5962240 | consumed tokens: 12210667520 | elapsed time per iteration (s): 0.80 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.176145E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.862 | TFLOPs: 19.41 | 31: iteration 23300/ 173500 | consumed samples: 5964800 | consumed tokens: 12215910400 | elapsed time per iteration (s): 1.08 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.198994E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.164 | TFLOPs: 14.29 | 31: iteration 23310/ 173500 | consumed samples: 5967360 | consumed tokens: 12221153280 | elapsed time per iteration (s): 0.85 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.180870E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.467 | TFLOPs: 18.24 | 31: iteration 23320/ 173500 | consumed samples: 5969920 | consumed tokens: 12226396160 | elapsed time per iteration (s): 0.78 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.191442E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.498 | TFLOPs: 19.75 | 31: iteration 23330/ 173500 | consumed samples: 5972480 | consumed tokens: 12231639040 | elapsed time per iteration (s): 0.84 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.165227E+00 | grad norm: 0.287 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.575 | TFLOPs: 18.43 | 31: iteration 23340/ 173500 | consumed samples: 5975040 | consumed tokens: 12236881920 | elapsed time per iteration (s): 0.81 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.190460E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.930 | TFLOPs: 19.17 | 31: iteration 23350/ 173500 | consumed samples: 5977600 | consumed tokens: 12242124800 | elapsed time per iteration (s): 0.78 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.186458E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.444 | TFLOPs: 19.75 | 31: iteration 23360/ 173500 | consumed samples: 5980160 | consumed tokens: 12247367680 | elapsed time per iteration (s): 0.77 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.190451E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.790 | TFLOPs: 20.01 | 31: iteration 23370/ 173500 | consumed samples: 5982720 | consumed tokens: 12252610560 | elapsed time per iteration (s): 0.80 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.171738E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.760 | TFLOPs: 19.34 | 31: iteration 23380/ 173500 | consumed samples: 5985280 | consumed tokens: 12257853440 | elapsed time per iteration (s): 0.78 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.192475E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.882 | TFLOPs: 19.96 | 31: iteration 23390/ 173500 | consumed samples: 5987840 | consumed tokens: 12263096320 | elapsed time per iteration (s): 0.79 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.180378E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.448 | TFLOPs: 19.57 | 31: iteration 23400/ 173500 | consumed samples: 5990400 | consumed tokens: 12268339200 | elapsed time per iteration (s): 0.79 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.149402E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.949 | TFLOPs: 19.54 | 31: iteration 23410/ 173500 | consumed samples: 5992960 | consumed tokens: 12273582080 | elapsed time per iteration (s): 0.83 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.173815E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.749 | TFLOPs: 18.74 | 31: iteration 23420/ 173500 | consumed samples: 5995520 | consumed tokens: 12278824960 | elapsed time per iteration (s): 0.77 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.185521E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.936 | TFLOPs: 20.14 | 31: iteration 23430/ 173500 | consumed samples: 5998080 | consumed tokens: 12284067840 | elapsed time per iteration (s): 0.76 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.162843E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.897 | TFLOPs: 20.50 | 31: iteration 23440/ 173500 | consumed samples: 6000640 | consumed tokens: 12289310720 | elapsed time per iteration (s): 0.79 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.184309E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.747 | TFLOPs: 19.53 | 31: iteration 23450/ 173500 | consumed samples: 6003200 | consumed tokens: 12294553600 | elapsed time per iteration (s): 0.74 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.164466E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.347 | TFLOPs: 20.83 | 31: iteration 23460/ 173500 | consumed samples: 6005760 | consumed tokens: 12299796480 | elapsed time per iteration (s): 0.75 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.164721E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.916 | TFLOPs: 20.69 | 31: iteration 23470/ 173500 | consumed samples: 6008320 | consumed tokens: 12305039360 | elapsed time per iteration (s): 0.77 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.196072E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.433 | TFLOPs: 20.05 | 31: iteration 23480/ 173500 | consumed samples: 6010880 | consumed tokens: 12310282240 | elapsed time per iteration (s): 0.74 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.194134E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.862 | TFLOPs: 20.86 | 31: iteration 23490/ 173500 | consumed samples: 6013440 | consumed tokens: 12315525120 | elapsed time per iteration (s): 0.79 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.182988E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.934 | TFLOPs: 19.60 | 31: iteration 23500/ 173500 | consumed samples: 6016000 | consumed tokens: 12320768000 | elapsed time per iteration (s): 0.77 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.188959E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.421 | TFLOPs: 19.99 | 31: iteration 23510/ 173500 | consumed samples: 6018560 | consumed tokens: 12326010880 | elapsed time per iteration (s): 0.78 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.151898E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.514 | TFLOPs: 19.75 | 31: iteration 23520/ 173500 | consumed samples: 6021120 | consumed tokens: 12331253760 | elapsed time per iteration (s): 0.74 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.199444E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.787 | TFLOPs: 20.98 | 31: iteration 23530/ 173500 | consumed samples: 6023680 | consumed tokens: 12336496640 | elapsed time per iteration (s): 0.83 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.183848E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.162 | TFLOPs: 18.70 | 31: iteration 23540/ 173500 | consumed samples: 6026240 | consumed tokens: 12341739520 | elapsed time per iteration (s): 0.84 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.259173E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.055 | TFLOPs: 18.52 | 31: iteration 23550/ 173500 | consumed samples: 6028800 | consumed tokens: 12346982400 | elapsed time per iteration (s): 0.76 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.213420E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.475 | TFLOPs: 20.48 | 31: iteration 23560/ 173500 | consumed samples: 6031360 | consumed tokens: 12352225280 | elapsed time per iteration (s): 0.79 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.203775E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.615 | TFLOPs: 19.64 | 31: iteration 23570/ 173500 | consumed samples: 6033920 | consumed tokens: 12357468160 | elapsed time per iteration (s): 0.74 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.172694E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.750 | TFLOPs: 20.92 | 31: iteration 23580/ 173500 | consumed samples: 6036480 | consumed tokens: 12362711040 | elapsed time per iteration (s): 0.75 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.167473E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.655 | TFLOPs: 20.61 | 31: iteration 23590/ 173500 | consumed samples: 6039040 | consumed tokens: 12367953920 | elapsed time per iteration (s): 0.73 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.151466E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.191 | TFLOPs: 21.19 | 31: iteration 23600/ 173500 | consumed samples: 6041600 | consumed tokens: 12373196800 | elapsed time per iteration (s): 0.75 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.180539E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.293 | TFLOPs: 20.77 | 31: iteration 23610/ 173500 | consumed samples: 6044160 | consumed tokens: 12378439680 | elapsed time per iteration (s): 0.78 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.191364E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.002 | TFLOPs: 19.84 | 31: iteration 23620/ 173500 | consumed samples: 6046720 | consumed tokens: 12383682560 | elapsed time per iteration (s): 0.77 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.202708E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.986 | TFLOPs: 20.08 | 31: iteration 23630/ 173500 | consumed samples: 6049280 | consumed tokens: 12388925440 | elapsed time per iteration (s): 0.80 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.159719E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.384 | TFLOPs: 19.26 | 31: iteration 23640/ 173500 | consumed samples: 6051840 | consumed tokens: 12394168320 | elapsed time per iteration (s): 1.57 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.173934E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 163.012 | TFLOPs: 9.86 | 31: iteration 23650/ 173500 | consumed samples: 6054400 | consumed tokens: 12399411200 | elapsed time per iteration (s): 0.75 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.142799E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.773 | TFLOPs: 20.62 | 31: iteration 23660/ 173500 | consumed samples: 6056960 | consumed tokens: 12404654080 | elapsed time per iteration (s): 0.78 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.175491E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.278 | TFLOPs: 19.92 | 31: iteration 23670/ 173500 | consumed samples: 6059520 | consumed tokens: 12409896960 | elapsed time per iteration (s): 0.83 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.175669E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.734 | TFLOPs: 18.62 | 31: iteration 23680/ 173500 | consumed samples: 6062080 | consumed tokens: 12415139840 | elapsed time per iteration (s): 1.48 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.167675E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 173.069 | TFLOPs: 10.47 | 31: iteration 23690/ 173500 | consumed samples: 6064640 | consumed tokens: 12420382720 | elapsed time per iteration (s): 0.84 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.185789E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.158 | TFLOPs: 18.34 | 31: iteration 23700/ 173500 | consumed samples: 6067200 | consumed tokens: 12425625600 | elapsed time per iteration (s): 0.77 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.184895E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.732 | TFLOPs: 20.07 | 31: iteration 23710/ 173500 | consumed samples: 6069760 | consumed tokens: 12430868480 | elapsed time per iteration (s): 0.82 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.174956E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.386 | TFLOPs: 18.78 | 31: iteration 23720/ 173500 | consumed samples: 6072320 | consumed tokens: 12436111360 | elapsed time per iteration (s): 0.84 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.170791E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.881 | TFLOPs: 18.44 | 31: iteration 23730/ 173500 | consumed samples: 6074880 | consumed tokens: 12441354240 | elapsed time per iteration (s): 0.82 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.194825E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.507 | TFLOPs: 18.91 | 31: iteration 23740/ 173500 | consumed samples: 6077440 | consumed tokens: 12446597120 | elapsed time per iteration (s): 0.82 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.139414E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.986 | TFLOPs: 18.81 | 31: iteration 23750/ 173500 | consumed samples: 6080000 | consumed tokens: 12451840000 | elapsed time per iteration (s): 0.88 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.149802E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.218 | TFLOPs: 17.62 | 31: iteration 23760/ 173500 | consumed samples: 6082560 | consumed tokens: 12457082880 | elapsed time per iteration (s): 0.80 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.170250E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.957 | TFLOPs: 19.42 | 31: iteration 23770/ 173500 | consumed samples: 6085120 | consumed tokens: 12462325760 | elapsed time per iteration (s): 0.77 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.193987E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.047 | TFLOPs: 20.09 | 31: iteration 23780/ 173500 | consumed samples: 6087680 | consumed tokens: 12467568640 | elapsed time per iteration (s): 0.78 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.197308E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.896 | TFLOPs: 19.78 | 31: iteration 23790/ 173500 | consumed samples: 6090240 | consumed tokens: 12472811520 | elapsed time per iteration (s): 0.76 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.201463E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.017 | TFLOPs: 20.27 | 31: iteration 23800/ 173500 | consumed samples: 6092800 | consumed tokens: 12478054400 | elapsed time per iteration (s): 0.78 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.177231E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.216 | TFLOPs: 19.86 | 31: iteration 23810/ 173500 | consumed samples: 6095360 | consumed tokens: 12483297280 | elapsed time per iteration (s): 0.80 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.160352E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.509 | TFLOPs: 19.45 | 31: iteration 23820/ 173500 | consumed samples: 6097920 | consumed tokens: 12488540160 | elapsed time per iteration (s): 0.85 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.166180E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.669 | TFLOPs: 18.31 | 31: iteration 23830/ 173500 | consumed samples: 6100480 | consumed tokens: 12493783040 | elapsed time per iteration (s): 0.81 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.175177E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.401 | TFLOPs: 19.02 | 31: iteration 23840/ 173500 | consumed samples: 6103040 | consumed tokens: 12499025920 | elapsed time per iteration (s): 0.80 | learning rate: 1.927E-04 | global batch size: 256 | lm loss: 2.145592E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.154 | TFLOPs: 19.37 | 31: iteration 23850/ 173500 | consumed samples: 6105600 | consumed tokens: 12504268800 | elapsed time per iteration (s): 0.77 | learning rate: 1.927E-04 | global batch size: 256 | lm loss: 2.180276E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.137 | TFLOPs: 20.09 | 31: iteration 23860/ 173500 | consumed samples: 6108160 | consumed tokens: 12509511680 | elapsed time per iteration (s): 0.75 | learning rate: 1.927E-04 | global batch size: 256 | lm loss: 2.353834E+00 | grad norm: 38.252 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.070 | TFLOPs: 20.75 | 31: iteration 23870/ 173500 | consumed samples: 6110720 | consumed tokens: 12514754560 | elapsed time per iteration (s): 0.75 | learning rate: 1.927E-04 | global batch size: 256 | lm loss: 2.841932E+00 | grad norm: 21.422 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.108 | TFLOPs: 20.58 | 31: iteration 23880/ 173500 | consumed samples: 6113280 | consumed tokens: 12519997440 | elapsed time per iteration (s): 0.79 | learning rate: 1.927E-04 | global batch size: 256 | lm loss: 2.825562E+00 | grad norm: 1.104 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.884 | TFLOPs: 19.53 | 31: iteration 23890/ 173500 | consumed samples: 6115840 | consumed tokens: 12525240320 | elapsed time per iteration (s): 0.76 | learning rate: 1.927E-04 | global batch size: 256 | lm loss: 2.316967E+00 | grad norm: 0.290 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.538 | TFLOPs: 20.36 | 31: iteration 23900/ 173500 | consumed samples: 6118400 | consumed tokens: 12530483200 | elapsed time per iteration (s): 0.76 | learning rate: 1.927E-04 | global batch size: 256 | lm loss: 2.277719E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.973 | TFLOPs: 20.33 | 31: iteration 23910/ 173500 | consumed samples: 6120960 | consumed tokens: 12535726080 | elapsed time per iteration (s): 0.83 | learning rate: 1.927E-04 | global batch size: 256 | lm loss: 2.210058E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.435 | TFLOPs: 18.66 | 31: iteration 23920/ 173500 | consumed samples: 6123520 | consumed tokens: 12540968960 | elapsed time per iteration (s): 0.76 | learning rate: 1.927E-04 | global batch size: 256 | lm loss: 2.186435E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.012 | TFLOPs: 20.39 | 31: iteration 23930/ 173500 | consumed samples: 6126080 | consumed tokens: 12546211840 | elapsed time per iteration (s): 0.82 | learning rate: 1.927E-04 | global batch size: 256 | lm loss: 2.213535E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.671 | TFLOPs: 18.98 | 31: iteration 23940/ 173500 | consumed samples: 6128640 | consumed tokens: 12551454720 | elapsed time per iteration (s): 0.79 | learning rate: 1.927E-04 | global batch size: 256 | lm loss: 2.189301E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.943 | TFLOPs: 19.72 | 31: iteration 23950/ 173500 | consumed samples: 6131200 | consumed tokens: 12556697600 | elapsed time per iteration (s): 0.79 | learning rate: 1.927E-04 | global batch size: 256 | lm loss: 2.186217E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.380 | TFLOPs: 19.50 | 31: iteration 23960/ 173500 | consumed samples: 6133760 | consumed tokens: 12561940480 | elapsed time per iteration (s): 0.81 | learning rate: 1.927E-04 | global batch size: 256 | lm loss: 2.165108E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.740 | TFLOPs: 19.04 | 31: iteration 23970/ 173500 | consumed samples: 6136320 | consumed tokens: 12567183360 | elapsed time per iteration (s): 0.81 | learning rate: 1.927E-04 | global batch size: 256 | lm loss: 2.197189E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.088 | TFLOPs: 19.12 | 31: iteration 23980/ 173500 | consumed samples: 6138880 | consumed tokens: 12572426240 | elapsed time per iteration (s): 0.76 | learning rate: 1.927E-04 | global batch size: 256 | lm loss: 2.194922E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.857 | TFLOPs: 20.44 | 31: iteration 23990/ 173500 | consumed samples: 6141440 | consumed tokens: 12577669120 | elapsed time per iteration (s): 0.75 | learning rate: 1.926E-04 | global batch size: 256 | lm loss: 2.193819E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.829 | TFLOPs: 20.74 | 0: [2022-11-25 23:29:46,360] [INFO] [logging.py:68:log_dist] [Rank 0] step=24000, skipped=0, lr=[0.00019264004235759096, 0.00019264004235759096, 0.00019264004235759096], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 24000/ 173500 | consumed samples: 6144000 | consumed tokens: 12582912000 | elapsed time per iteration (s): 0.74 | learning rate: 1.926E-04 | global batch size: 256 | lm loss: 2.180327E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.469 | TFLOPs: 21.02 | 0: steps: 24000 loss: 2.2596 iter time (s): 0.880 samples/sec: 290.746 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 24000 | lm loss value: 2.130871E+00 | lm loss PPL: 8.422198E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 24000 to checkpoints_1b1long 0: [2022-11-25 23:29:46,674] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step24000 is begin to save! 0: [2022-11-25 23:29:46,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_01-model_00-model_states.pt... 0: [2022-11-25 23:29:46,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_01-model_00-model_states.pt. 0: [2022-11-25 23:29:46,879] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_03-model_00-model_states.pt... 0: [2022-11-25 23:29:46,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_03-model_00-model_states.pt. 0: [2022-11-25 23:29:46,960] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_04-model_00-model_states.pt... 0: [2022-11-25 23:29:47,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_04-model_00-model_states.pt. 0: [2022-11-25 23:29:47,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_05-model_00-model_states.pt... 0: [2022-11-25 23:29:47,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_05-model_00-model_states.pt. 0: [2022-11-25 23:29:47,113] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_06-model_00-model_states.pt... 0: [2022-11-25 23:29:47,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_06-model_00-model_states.pt. 0: [2022-11-25 23:29:47,189] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_07-model_00-model_states.pt... 0: [2022-11-25 23:29:47,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_07-model_00-model_states.pt. 0: [2022-11-25 23:29:47,262] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_08-model_00-model_states.pt... 0: [2022-11-25 23:29:47,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_08-model_00-model_states.pt. 0: [2022-11-25 23:29:47,336] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_09-model_00-model_states.pt... 0: [2022-11-25 23:29:47,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_09-model_00-model_states.pt. 0: [2022-11-25 23:29:47,409] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_10-model_00-model_states.pt... 0: [2022-11-25 23:29:47,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_10-model_00-model_states.pt. 0: [2022-11-25 23:29:47,484] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_11-model_00-model_states.pt... 0: [2022-11-25 23:29:47,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_11-model_00-model_states.pt. 0: [2022-11-25 23:29:47,557] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_12-model_00-model_states.pt... 0: [2022-11-25 23:29:47,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_12-model_00-model_states.pt. 0: [2022-11-25 23:29:47,634] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_13-model_00-model_states.pt... 0: [2022-11-25 23:29:47,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_13-model_00-model_states.pt. 0: [2022-11-25 23:29:47,706] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_14-model_00-model_states.pt... 0: [2022-11-25 23:29:47,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_14-model_00-model_states.pt. 0: [2022-11-25 23:29:47,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_15-model_00-model_states.pt... 0: [2022-11-25 23:29:47,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_15-model_00-model_states.pt. 0: [2022-11-25 23:29:47,855] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_16-model_00-model_states.pt... 0: [2022-11-25 23:29:47,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_16-model_00-model_states.pt. 0: [2022-11-25 23:29:47,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_17-model_00-model_states.pt... 0: [2022-11-25 23:29:48,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_17-model_00-model_states.pt. 0: [2022-11-25 23:29:48,003] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_18-model_00-model_states.pt... 0: [2022-11-25 23:29:48,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_18-model_00-model_states.pt. 0: [2022-11-25 23:29:48,078] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_19-model_00-model_states.pt... 0: [2022-11-25 23:29:48,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_19-model_00-model_states.pt. 0: [2022-11-25 23:29:48,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_20-model_00-model_states.pt... 0: [2022-11-25 23:29:48,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_20-model_00-model_states.pt. 0: [2022-11-25 23:29:48,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_21-model_00-model_states.pt... 0: [2022-11-25 23:29:48,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_21-model_00-model_states.pt. 0: [2022-11-25 23:29:48,299] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_22-model_00-model_states.pt... 0: [2022-11-25 23:29:48,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_22-model_00-model_states.pt. 0: [2022-11-25 23:29:48,373] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_23-model_00-model_states.pt... 0: [2022-11-25 23:29:48,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_23-model_00-model_states.pt. 0: [2022-11-25 23:29:48,447] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_24-model_00-model_states.pt... 0: [2022-11-25 23:29:48,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_24-model_00-model_states.pt. 0: [2022-11-25 23:29:48,519] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_25-model_00-model_states.pt... 0: [2022-11-25 23:29:48,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_25-model_00-model_states.pt. 0: [2022-11-25 23:29:48,595] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_26-model_00-model_states.pt... 0: [2022-11-25 23:29:48,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_26-model_00-model_states.pt. 0: [2022-11-25 23:29:48,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_27-model_00-model_states.pt... 0: [2022-11-25 23:29:48,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_27-model_00-model_states.pt. 0: [2022-11-25 23:29:48,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_28-model_00-model_states.pt... 0: [2022-11-25 23:29:48,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_28-model_00-model_states.pt. 0: [2022-11-25 23:29:48,815] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/layer_30-model_00-model_states.pt... 0: [2022-11-25 23:29:48,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/layer_30-model_00-model_states.pt. 0: [2022-11-25 23:29:48,817] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step24000/mp_rank_00_model_states.pt 0: [2022-11-25 23:29:48,817] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/mp_rank_00_model_states.pt... 0: [2022-11-25 23:29:48,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/mp_rank_00_model_states.pt. 0: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:29:48,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:29:48,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:29:48,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-25 23:29:48,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 18: [2022-11-25 23:29:48,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:29:48,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-25 23:29:48,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 15: [2022-11-25 23:29:48,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:29:48,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 23:29:48,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 6: [2022-11-25 23:29:48,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:29:48,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 23:29:48,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 22: [2022-11-25 23:29:48,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:29:48,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 12: [2022-11-25 23:29:48,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:29:48,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 12: [2022-11-25 23:29:48,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 23:29:48,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 19: [2022-11-25 23:29:48,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:29:48,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-25 23:29:48,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 23: [2022-11-25 23:29:48,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:29:48,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-25 23:29:48,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 21: [2022-11-25 23:29:48,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:29:48,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-25 23:29:48,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 19: [2022-11-25 23:29:48,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:29:48,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-25 23:29:48,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 17: [2022-11-25 23:29:48,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:29:48,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-25 23:29:48,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 26: [2022-11-25 23:29:48,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:29:48,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-25 23:29:48,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 16: [2022-11-25 23:29:48,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:29:48,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 9: [2022-11-25 23:29:48,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:29:48,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 9: [2022-11-25 23:29:48,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 27: [2022-11-25 23:29:48,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:29:48,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 27: [2022-11-25 23:29:48,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-25 23:29:48,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: [2022-11-25 23:29:48,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:29:48,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 23:29:48,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 29: [2022-11-25 23:29:48,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:29:48,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:29:48,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 16: [2022-11-25 23:29:48,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 29: [2022-11-25 23:29:48,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 16: [2022-11-25 23:29:48,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 10: [2022-11-25 23:29:48,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:29:48,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:29:48,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 23:29:48,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 10: [2022-11-25 23:29:48,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 23:29:48,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 3: [2022-11-25 23:29:48,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:29:48,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 23:29:48,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 10: [2022-11-25 23:29:48,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:29:48,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 23:29:48,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 18: [2022-11-25 23:29:48,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:29:48,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-25 23:29:48,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 17: [2022-11-25 23:29:48,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:29:48,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 3: [2022-11-25 23:29:48,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:29:48,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 3: [2022-11-25 23:29:48,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 25: [2022-11-25 23:29:48,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:29:48,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 26: [2022-11-25 23:29:48,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:29:48,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 26: [2022-11-25 23:29:48,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-25 23:29:48,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 8: [2022-11-25 23:29:48,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:29:48,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 25: [2022-11-25 23:29:48,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 8: [2022-11-25 23:29:48,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 12: [2022-11-25 23:29:48,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:29:48,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 23:29:48,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 1: [2022-11-25 23:29:48,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:29:48,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 23:29:48,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 28: [2022-11-25 23:29:48,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:29:48,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:29:48,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-25 23:29:48,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-25 23:29:48,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 28: [2022-11-25 23:29:48,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 11: [2022-11-25 23:29:48,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:29:48,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 23:29:48,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 25: [2022-11-25 23:29:48,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:29:48,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-25 23:29:48,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 29: [2022-11-25 23:29:48,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:29:48,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-25 23:29:48,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 5: [2022-11-25 23:29:48,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:29:48,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 23:29:48,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 21: [2022-11-25 23:29:48,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:29:48,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-25 23:29:48,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 27: [2022-11-25 23:29:48,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:29:48,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-25 23:29:48,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 6: [2022-11-25 23:29:48,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:29:48,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:29:48,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 13: [2022-11-25 23:29:48,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:29:48,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:29:48,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 0: [2022-11-25 23:29:48,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:29:48,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 13: [2022-11-25 23:29:48,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 23: [2022-11-25 23:29:48,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:29:48,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 13: [2022-11-25 23:29:48,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 23: [2022-11-25 23:29:48,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 22: [2022-11-25 23:29:48,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:29:48,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:29:48,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 23: [2022-11-25 23:29:48,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:29:48,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 10: [2022-11-25 23:29:48,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 13: [2022-11-25 23:29:48,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 23: [2022-11-25 23:29:48,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 22: [2022-11-25 23:29:48,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 10: [2022-11-25 23:29:48,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 23: [2022-11-25 23:29:48,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-25 23:29:48,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 5: [2022-11-25 23:29:48,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:29:48,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 23:29:48,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 25: [2022-11-25 23:29:48,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:29:48,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 1: [2022-11-25 23:29:48,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:29:48,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 1: [2022-11-25 23:29:48,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 23:29:48,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 17: [2022-11-25 23:29:48,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:29:48,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:29:48,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 12: [2022-11-25 23:29:48,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:29:48,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 18: [2022-11-25 23:29:48,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 12: [2022-11-25 23:29:48,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 18: [2022-11-25 23:29:48,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 12: [2022-11-25 23:29:48,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 15: [2022-11-25 23:29:48,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:29:48,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:29:48,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 23:29:48,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 23:29:48,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 15: [2022-11-25 23:29:48,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 8: [2022-11-25 23:29:48,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:29:48,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 23:29:48,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 13: [2022-11-25 23:29:48,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:29:48,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:29:48,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 9: [2022-11-25 23:29:48,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 23:29:48,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 13: [2022-11-25 23:29:48,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 3: [2022-11-25 23:29:48,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:29:48,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 23:29:48,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 4: [2022-11-25 23:29:48,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:29:48,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 23:29:48,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 8: [2022-11-25 23:29:48,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:29:48,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 23:29:48,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 29: [2022-11-25 23:29:48,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:29:48,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 16: [2022-11-25 23:29:48,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:29:48,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 3: [2022-11-25 23:29:48,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:29:48,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 16: [2022-11-25 23:29:48,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 3: [2022-11-25 23:29:48,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 23:29:48,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: [2022-11-25 23:29:48,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:29:48,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:29:48,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-25 23:29:48,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: [2022-11-25 23:29:48,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 23:29:48,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 10: [2022-11-25 23:29:48,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:29:48,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 23:29:48,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: [2022-11-25 23:29:48,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:29:48,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 14: [2022-11-25 23:29:48,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:29:48,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:29:48,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 14: [2022-11-25 23:29:48,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 23:29:48,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 23:29:48,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 14: [2022-11-25 23:29:48,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 31: [2022-11-25 23:29:48,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:29:48,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-25 23:29:48,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:29:48,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 31: [2022-11-25 23:29:48,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 11: [2022-11-25 23:29:48,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:29:48,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 11: [2022-11-25 23:29:48,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 23:29:48,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 5: [2022-11-25 23:29:48,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:29:48,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 23:29:48,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 16: [2022-11-25 23:29:48,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:29:48,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-25 23:29:48,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 27: [2022-11-25 23:29:48,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:29:48,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:29:48,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 27: [2022-11-25 23:29:48,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 28: [2022-11-25 23:29:48,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:29:48,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 27: [2022-11-25 23:29:48,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 7: [2022-11-25 23:29:48,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:29:48,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 23:29:48,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 1: [2022-11-25 23:29:48,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:29:48,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 23:29:48,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 28: [2022-11-25 23:29:48,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-25 23:29:48,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 12: [2022-11-25 23:29:48,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:29:48,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 23:29:48,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 9: [2022-11-25 23:29:48,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:29:48,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 23:29:48,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 7: [2022-11-25 23:29:48,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:29:48,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:29:48,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 23:29:48,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 23:29:48,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 7: [2022-11-25 23:29:48,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 19: [2022-11-25 23:29:48,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:29:48,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:29:48,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 29: [2022-11-25 23:29:48,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 19: [2022-11-25 23:29:48,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 29: [2022-11-25 23:29:48,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 18: [2022-11-25 23:29:48,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:29:48,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:29:48,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 6: [2022-11-25 23:29:48,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 18: [2022-11-25 23:29:48,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 6: [2022-11-25 23:29:48,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 11: [2022-11-25 23:29:48,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:29:48,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 23:29:48,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 19: [2022-11-25 23:29:48,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:29:48,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-25 23:29:48,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 25: [2022-11-25 23:29:48,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:29:48,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 23: [2022-11-25 23:29:48,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:29:48,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 25: [2022-11-25 23:29:48,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 23: [2022-11-25 23:29:48,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 31: [2022-11-25 23:29:48,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:29:48,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:29:48,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 15: [2022-11-25 23:29:48,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:29:48,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 15: [2022-11-25 23:29:48,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 31: [2022-11-25 23:29:48,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 22: [2022-11-25 23:29:48,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 15: [2022-11-25 23:29:48,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 8: [2022-11-25 23:29:48,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:29:48,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 23:29:48,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 17: [2022-11-25 23:29:48,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:29:48,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-25 23:29:48,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: [2022-11-25 23:29:48,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 23:29:48,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 14: [2022-11-25 23:29:48,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:29:48,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:29:48,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 9: [2022-11-25 23:29:48,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 23:29:48,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 14: [2022-11-25 23:29:48,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 30: [2022-11-25 23:29:48,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:29:48,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:29:48,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:29:48,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-25 23:29:48,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-25 23:29:48,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-25 23:29:48,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 30: [2022-11-25 23:29:48,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 30: [2022-11-25 23:29:48,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 21: [2022-11-25 23:29:48,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:29:48,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-25 23:29:48,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 26: [2022-11-25 23:29:48,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:29:48,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-25 23:29:48,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 2: [2022-11-25 23:29:48,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:29:48,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:29:48,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:29:48,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 23:29:48,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 23:29:48,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 23:29:48,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 2: [2022-11-25 23:29:48,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 2: [2022-11-25 23:29:48,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 13: [2022-11-25 23:29:48,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:29:48,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 23:29:48,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 27: [2022-11-25 23:29:48,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:29:48,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-25 23:29:48,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 28: [2022-11-25 23:29:48,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:29:48,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-25 23:29:48,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 16: [2022-11-25 23:29:48,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:29:48,982] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-25 23:29:48,982] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 15: [2022-11-25 23:29:48,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:29:48,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 23:29:48,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 6: [2022-11-25 23:29:48,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:29:48,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 23:29:48,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 20: [2022-11-25 23:29:48,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:29:48,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:29:48,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:29:48,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:29:48,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-25 23:29:48,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-25 23:29:48,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 20: [2022-11-25 23:29:48,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-25 23:29:48,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 20: [2022-11-25 23:29:48,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-25 23:29:48,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 20: [2022-11-25 23:29:48,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 5: [2022-11-25 23:29:48,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:29:48,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 23:29:48,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 24: [2022-11-25 23:29:48,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:29:48,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:29:48,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:29:48,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:29:48,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-25 23:29:48,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-25 23:29:48,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 26: [2022-11-25 23:29:48,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:29:48,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 24: [2022-11-25 23:29:48,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 24: [2022-11-25 23:29:48,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 24: [2022-11-25 23:29:48,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-25 23:29:48,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 26: [2022-11-25 23:29:48,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-25 23:29:48,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 4: [2022-11-25 23:29:48,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:29:48,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 23:29:48,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 26: [2022-11-25 23:29:48,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:29:48,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-25 23:29:48,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 11: [2022-11-25 23:29:48,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:29:48,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 23:29:48,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 5: [2022-11-25 23:29:48,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:29:48,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 23:29:48,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 25: [2022-11-25 23:29:49,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:29:49,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-25 23:29:49,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: [2022-11-25 23:29:49,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:29:49,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 23:29:49,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 1: [2022-11-25 23:29:49,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:29:49,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 23:29:49,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 20: [2022-11-25 23:29:49,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:29:49,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-25 23:29:49,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 4: [2022-11-25 23:29:49,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:29:49,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 23:29:49,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 1: [2022-11-25 23:29:49,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:29:49,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:29:49,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 1: [2022-11-25 23:29:49,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 13: [2022-11-25 23:29:49,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 1: [2022-11-25 23:29:49,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 23: [2022-11-25 23:29:49,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:29:49,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-25 23:29:49,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 3: [2022-11-25 23:29:49,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:29:49,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 23:29:49,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 10: [2022-11-25 23:29:49,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:29:49,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 23:29:49,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 19: [2022-11-25 23:29:49,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:29:49,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-25 23:29:49,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 8: [2022-11-25 23:29:49,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:29:49,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 23:29:49,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 4: [2022-11-25 23:29:49,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:29:49,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 23:29:49,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 6: [2022-11-25 23:29:49,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:29:49,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 23:29:49,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 18: [2022-11-25 23:29:49,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:29:49,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-25 23:29:49,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 27: [2022-11-25 23:29:49,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:29:49,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:29:49,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 7: [2022-11-25 23:29:49,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 27: [2022-11-25 23:29:49,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 7: [2022-11-25 23:29:49,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 9: [2022-11-25 23:29:49,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:29:49,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 23:29:49,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 17: [2022-11-25 23:29:49,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:29:49,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-25 23:29:49,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 29: [2022-11-25 23:29:49,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:29:49,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-25 23:29:49,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 21: [2022-11-25 23:29:49,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:29:49,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:29:49,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 14: [2022-11-25 23:29:49,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 21: [2022-11-25 23:29:49,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 14: [2022-11-25 23:29:49,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 31: [2022-11-25 23:29:49,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:29:49,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-25 23:29:49,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 16: [2022-11-25 23:29:49,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:29:49,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-25 23:29:49,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 24: [2022-11-25 23:29:49,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:29:49,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-25 23:29:49,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 2: [2022-11-25 23:29:49,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:29:49,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:29:49,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 23:29:49,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 26: [2022-11-25 23:29:49,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-25 23:29:49,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 28: [2022-11-25 23:29:49,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:29:49,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 22: [2022-11-25 23:29:49,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:29:49,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 22: [2022-11-25 23:29:49,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-25 23:29:49,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 15: [2022-11-25 23:29:49,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:29:49,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 23:29:49,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 4: [2022-11-25 23:29:49,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:29:49,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 23:29:49,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:29:49,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 4: [2022-11-25 23:29:49,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 23:29:49,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 30: [2022-11-25 23:29:49,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:29:49,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-25 23:29:49,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: [2022-11-25 23:29:49,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:29:49,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 23:29:49,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 11: [2022-11-25 23:29:49,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:29:49,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 23:29:49,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 25: [2022-11-25 23:29:49,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:29:49,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-25 23:29:49,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 5: [2022-11-25 23:29:49,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:29:49,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:29:49,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 19: [2022-11-25 23:29:49,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-25 23:29:49,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 5: [2022-11-25 23:29:49,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 8: [2022-11-25 23:29:49,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:29:49,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 23:29:49,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 3: [2022-11-25 23:29:49,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:29:49,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 23:29:49,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 1: [2022-11-25 23:29:49,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:29:49,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 23:29:49,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 13: [2022-11-25 23:29:49,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:29:49,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 23:29:49,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 20: [2022-11-25 23:29:49,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:29:49,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-25 23:29:49,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 10: [2022-11-25 23:29:49,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:29:49,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 23:29:49,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 12: [2022-11-25 23:29:49,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:29:49,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:29:49,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 18: [2022-11-25 23:29:49,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 12: [2022-11-25 23:29:49,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 18: [2022-11-25 23:29:49,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 23: [2022-11-25 23:29:49,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:29:49,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:29:49,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 27: [2022-11-25 23:29:49,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:29:49,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 21: [2022-11-25 23:29:49,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-25 23:29:49,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 27: [2022-11-25 23:29:49,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-25 23:29:49,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 17: [2022-11-25 23:29:49,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:29:49,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-25 23:29:49,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 9: [2022-11-25 23:29:49,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:29:49,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 23:29:49,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 6: [2022-11-25 23:29:49,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:29:49,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 23:29:49,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 15: [2022-11-25 23:29:49,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:29:49,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 23:29:49,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 7: [2022-11-25 23:29:49,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:29:49,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 23:29:49,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 31: [2022-11-25 23:29:49,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:29:49,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-25 23:29:49,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 14: [2022-11-25 23:29:49,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:29:49,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 23:29:49,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 12: [2022-11-25 23:29:49,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:29:49,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:29:49,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 22: [2022-11-25 23:29:49,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-25 23:29:49,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 12: [2022-11-25 23:29:49,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 28: [2022-11-25 23:29:49,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:29:49,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-25 23:29:49,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 29: [2022-11-25 23:29:49,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:29:49,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-25 23:29:49,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 24: [2022-11-25 23:29:49,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:29:49,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-25 23:29:49,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 16: [2022-11-25 23:29:49,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:29:49,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-25 23:29:49,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 26: [2022-11-25 23:29:49,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:29:49,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-25 23:29:49,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 2: [2022-11-25 23:29:49,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:29:49,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 23:29:49,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 11: [2022-11-25 23:29:49,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:29:49,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 23:29:49,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 25: [2022-11-25 23:29:49,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:29:49,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-25 23:29:49,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 5: [2022-11-25 23:29:49,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:29:49,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 23:29:49,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: [2022-11-25 23:29:49,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:29:49,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 23:29:49,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 30: [2022-11-25 23:29:49,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:29:49,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-25 23:29:49,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 19: [2022-11-25 23:29:49,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:29:49,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-25 23:29:49,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 20: [2022-11-25 23:29:49,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:29:49,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-25 23:29:49,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 3: [2022-11-25 23:29:49,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:29:49,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 23:29:49,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 13: [2022-11-25 23:29:49,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:29:49,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 23:29:49,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 8: [2022-11-25 23:29:49,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:29:49,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 23:29:49,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 1: [2022-11-25 23:29:49,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:29:49,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 23:29:49,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 6: [2022-11-25 23:29:49,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:29:49,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 23:29:49,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 12: [2022-11-25 23:29:49,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:29:49,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 23:29:49,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 21: [2022-11-25 23:29:49,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:29:49,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-25 23:29:49,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 17: [2022-11-25 23:29:49,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:29:49,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-25 23:29:49,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 23: [2022-11-25 23:29:49,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:29:49,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-25 23:29:49,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 10: [2022-11-25 23:29:49,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:29:49,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 23:29:49,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 9: [2022-11-25 23:29:49,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:29:49,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 23:29:49,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 4: [2022-11-25 23:29:49,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:29:49,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 23:29:49,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 18: [2022-11-25 23:29:49,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:29:49,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:29:49,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 16: [2022-11-25 23:29:49,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:29:49,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 31: [2022-11-25 23:29:49,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 16: [2022-11-25 23:29:49,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 31: [2022-11-25 23:29:49,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 16: [2022-11-25 23:29:49,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 29: [2022-11-25 23:29:49,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:29:49,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:29:49,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-25 23:29:49,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 22: [2022-11-25 23:29:49,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:29:49,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-25 23:29:49,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 22: [2022-11-25 23:29:49,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-25 23:29:49,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 26: [2022-11-25 23:29:49,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:29:49,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-25 23:29:49,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 12: [2022-11-25 23:29:49,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:29:49,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 23:29:49,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 25: [2022-11-25 23:29:49,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:29:49,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:29:49,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:29:49,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-25 23:29:49,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: [2022-11-25 23:29:49,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 1: [2022-11-25 23:29:49,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 0: [2022-11-25 23:29:49,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 1: [2022-11-25 23:29:49,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 13: [2022-11-25 23:29:49,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:29:49,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 23:29:49,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 6: [2022-11-25 23:29:49,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:29:49,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:29:49,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:29:49,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 15: [2022-11-25 23:29:49,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 6: [2022-11-25 23:29:49,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 15: [2022-11-25 23:29:49,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 5: [2022-11-25 23:29:49,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 23:29:49,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 7: [2022-11-25 23:29:49,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:29:49,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 14: [2022-11-25 23:29:49,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:29:49,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 7: [2022-11-25 23:29:49,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 14: [2022-11-25 23:29:49,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 11: [2022-11-25 23:29:49,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:29:49,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 23:29:49,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 28: [2022-11-25 23:29:49,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:29:49,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-25 23:29:49,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 10: [2022-11-25 23:29:49,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:29:49,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 23:29:49,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 9: [2022-11-25 23:29:49,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:29:49,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:29:49,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 23:29:49,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 24: [2022-11-25 23:29:49,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-25 23:29:49,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 23: [2022-11-25 23:29:49,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:29:49,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:29:49,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-25 23:29:49,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 27: [2022-11-25 23:29:49,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-25 23:29:49,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 3: [2022-11-25 23:29:49,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:29:49,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 23:29:49,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 8: [2022-11-25 23:29:49,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:29:49,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 23:29:49,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 2: [2022-11-25 23:29:49,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:29:49,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 23:29:49,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 17: [2022-11-25 23:29:49,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:29:49,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-25 23:29:49,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 4: [2022-11-25 23:29:49,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:29:49,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:29:49,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:29:49,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 31: [2022-11-25 23:29:49,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 30: [2022-11-25 23:29:49,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 4: [2022-11-25 23:29:49,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 23:29:49,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 31: [2022-11-25 23:29:49,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 19: [2022-11-25 23:29:49,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:29:49,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:29:49,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 18: [2022-11-25 23:29:49,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 19: [2022-11-25 23:29:49,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 14: [2022-11-25 23:29:49,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:29:49,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 14: [2022-11-25 23:29:49,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 21: [2022-11-25 23:29:49,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:29:49,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 21: [2022-11-25 23:29:49,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-25 23:29:49,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 20: [2022-11-25 23:29:49,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:29:49,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-25 23:29:49,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 24: [2022-11-25 23:29:49,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:29:49,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 30: [2022-11-25 23:29:49,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:29:49,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 30: [2022-11-25 23:29:49,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-25 23:29:49,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 22: [2022-11-25 23:29:49,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:29:49,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-25 23:29:49,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 7: [2022-11-25 23:29:49,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:29:49,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 23:29:49,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 14: [2022-11-25 23:29:49,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:29:49,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:29:49,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 23:29:49,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 29: [2022-11-25 23:29:49,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-25 23:29:49,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 7: [2022-11-25 23:29:49,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:29:49,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 23:29:49,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 30: [2022-11-25 23:29:49,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:29:49,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-25 23:29:49,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 2: [2022-11-25 23:29:49,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:29:49,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 23:29:49,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 2: [2022-11-25 23:29:49,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:29:49,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step24000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 23:29:49,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: successfully saved checkpoint at iteration 24000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2455.56 31: iteration 24010/ 173500 | consumed samples: 6146560 | consumed tokens: 12588154880 | elapsed time per iteration (s): 1.01 | learning rate: 1.926E-04 | global batch size: 256 | lm loss: 2.165317E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 254.522 | TFLOPs: 15.40 | 31: iteration 24020/ 173500 | consumed samples: 6149120 | consumed tokens: 12593397760 | elapsed time per iteration (s): 0.75 | learning rate: 1.926E-04 | global batch size: 256 | lm loss: 2.180338E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.419 | TFLOPs: 20.65 | 31: iteration 24030/ 173500 | consumed samples: 6151680 | consumed tokens: 12598640640 | elapsed time per iteration (s): 0.77 | learning rate: 1.926E-04 | global batch size: 256 | lm loss: 2.198118E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.216 | TFLOPs: 20.16 | 31: iteration 24040/ 173500 | consumed samples: 6154240 | consumed tokens: 12603883520 | elapsed time per iteration (s): 0.75 | learning rate: 1.926E-04 | global batch size: 256 | lm loss: 2.175836E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.846 | TFLOPs: 20.74 | 31: iteration 24050/ 173500 | consumed samples: 6156800 | consumed tokens: 12609126400 | elapsed time per iteration (s): 0.75 | learning rate: 1.926E-04 | global batch size: 256 | lm loss: 2.190724E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.558 | TFLOPs: 20.54 | 31: iteration 24060/ 173500 | consumed samples: 6159360 | consumed tokens: 12614369280 | elapsed time per iteration (s): 0.79 | learning rate: 1.926E-04 | global batch size: 256 | lm loss: 2.187310E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.008 | TFLOPs: 19.54 | 31: iteration 24070/ 173500 | consumed samples: 6161920 | consumed tokens: 12619612160 | elapsed time per iteration (s): 0.76 | learning rate: 1.926E-04 | global batch size: 256 | lm loss: 2.169661E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.552 | TFLOPs: 20.48 | 31: iteration 24080/ 173500 | consumed samples: 6164480 | consumed tokens: 12624855040 | elapsed time per iteration (s): 0.84 | learning rate: 1.926E-04 | global batch size: 256 | lm loss: 2.195560E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.431 | TFLOPs: 18.42 | 31: iteration 24090/ 173500 | consumed samples: 6167040 | consumed tokens: 12630097920 | elapsed time per iteration (s): 0.78 | learning rate: 1.926E-04 | global batch size: 256 | lm loss: 2.172507E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.156 | TFLOPs: 19.79 | 31: iteration 24100/ 173500 | consumed samples: 6169600 | consumed tokens: 12635340800 | elapsed time per iteration (s): 0.81 | learning rate: 1.926E-04 | global batch size: 256 | lm loss: 2.151230E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.277 | TFLOPs: 19.19 | 31: iteration 24110/ 173500 | consumed samples: 6172160 | consumed tokens: 12640583680 | elapsed time per iteration (s): 0.73 | learning rate: 1.926E-04 | global batch size: 256 | lm loss: 2.188759E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.510 | TFLOPs: 21.27 | 31: iteration 24120/ 173500 | consumed samples: 6174720 | consumed tokens: 12645826560 | elapsed time per iteration (s): 0.81 | learning rate: 1.926E-04 | global batch size: 256 | lm loss: 2.187878E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.090 | TFLOPs: 19.06 | 31: iteration 24130/ 173500 | consumed samples: 6177280 | consumed tokens: 12651069440 | elapsed time per iteration (s): 0.76 | learning rate: 1.926E-04 | global batch size: 256 | lm loss: 2.183894E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.945 | TFLOPs: 20.38 | 31: iteration 24140/ 173500 | consumed samples: 6179840 | consumed tokens: 12656312320 | elapsed time per iteration (s): 0.75 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 2.151585E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.141 | TFLOPs: 20.58 | 31: iteration 24150/ 173500 | consumed samples: 6182400 | consumed tokens: 12661555200 | elapsed time per iteration (s): 0.73 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 2.179319E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.481 | TFLOPs: 21.08 | 31: iteration 24160/ 173500 | consumed samples: 6184960 | consumed tokens: 12666798080 | elapsed time per iteration (s): 0.79 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 2.186493E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.520 | TFLOPs: 19.69 | 31: iteration 24170/ 173500 | consumed samples: 6187520 | consumed tokens: 12672040960 | elapsed time per iteration (s): 0.74 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 2.174192E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.733 | TFLOPs: 20.86 | 31: iteration 24180/ 173500 | consumed samples: 6190080 | consumed tokens: 12677283840 | elapsed time per iteration (s): 0.74 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 2.183192E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.856 | TFLOPs: 20.86 | 31: iteration 24190/ 173500 | consumed samples: 6192640 | consumed tokens: 12682526720 | elapsed time per iteration (s): 0.78 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 2.171735E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.580 | TFLOPs: 19.76 | 31: iteration 24200/ 173500 | consumed samples: 6195200 | consumed tokens: 12687769600 | elapsed time per iteration (s): 0.76 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 2.191684E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.902 | TFLOPs: 20.38 | 31: iteration 24210/ 173500 | consumed samples: 6197760 | consumed tokens: 12693012480 | elapsed time per iteration (s): 0.73 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 2.174952E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.960 | TFLOPs: 21.11 | 31: iteration 24220/ 173500 | consumed samples: 6200320 | consumed tokens: 12698255360 | elapsed time per iteration (s): 0.76 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 2.149874E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.123 | TFLOPs: 20.40 | 31: iteration 24230/ 173500 | consumed samples: 6202880 | consumed tokens: 12703498240 | elapsed time per iteration (s): 0.75 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 2.165834E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.199 | TFLOPs: 20.70 | 31: iteration 24240/ 173500 | consumed samples: 6205440 | consumed tokens: 12708741120 | elapsed time per iteration (s): 0.78 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 2.176035E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.720 | TFLOPs: 19.83 | 31: iteration 24250/ 173500 | consumed samples: 6208000 | consumed tokens: 12713984000 | elapsed time per iteration (s): 0.74 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 2.181609E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.739 | TFLOPs: 21.04 | 31: iteration 24260/ 173500 | consumed samples: 6210560 | consumed tokens: 12719226880 | elapsed time per iteration (s): 0.77 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 2.171785E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.059 | TFLOPs: 20.09 | 31: iteration 24270/ 173500 | consumed samples: 6213120 | consumed tokens: 12724469760 | elapsed time per iteration (s): 0.74 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 2.191857E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.424 | TFLOPs: 21.02 | 31: iteration 24280/ 173500 | consumed samples: 6215680 | consumed tokens: 12729712640 | elapsed time per iteration (s): 0.81 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 2.179200E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.858 | TFLOPs: 19.05 | 31: iteration 24290/ 173500 | consumed samples: 6218240 | consumed tokens: 12734955520 | elapsed time per iteration (s): 0.78 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.144447E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.789 | TFLOPs: 19.89 | 31: iteration 24300/ 173500 | consumed samples: 6220800 | consumed tokens: 12740198400 | elapsed time per iteration (s): 0.82 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.165992E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.469 | TFLOPs: 18.90 | 31: iteration 24310/ 173500 | consumed samples: 6223360 | consumed tokens: 12745441280 | elapsed time per iteration (s): 0.80 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.161802E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.710 | TFLOPs: 19.34 | 31: iteration 24320/ 173500 | consumed samples: 6225920 | consumed tokens: 12750684160 | elapsed time per iteration (s): 0.78 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.173631E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.611 | TFLOPs: 19.82 | 31: iteration 24330/ 173500 | consumed samples: 6228480 | consumed tokens: 12755927040 | elapsed time per iteration (s): 0.75 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.171747E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.785 | TFLOPs: 20.62 | 31: iteration 24340/ 173500 | consumed samples: 6231040 | consumed tokens: 12761169920 | elapsed time per iteration (s): 0.75 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.181869E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.375 | TFLOPs: 20.65 | 31: iteration 24350/ 173500 | consumed samples: 6233600 | consumed tokens: 12766412800 | elapsed time per iteration (s): 0.77 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.177774E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.617 | TFLOPs: 20.24 | 31: iteration 24360/ 173500 | consumed samples: 6236160 | consumed tokens: 12771655680 | elapsed time per iteration (s): 0.76 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.168192E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.132 | TFLOPs: 20.27 | 31: iteration 24370/ 173500 | consumed samples: 6238720 | consumed tokens: 12776898560 | elapsed time per iteration (s): 0.73 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.179344E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.398 | TFLOPs: 21.08 | 31: iteration 24380/ 173500 | consumed samples: 6241280 | consumed tokens: 12782141440 | elapsed time per iteration (s): 0.81 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.143240E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.300 | TFLOPs: 19.07 | 31: iteration 24390/ 173500 | consumed samples: 6243840 | consumed tokens: 12787384320 | elapsed time per iteration (s): 0.74 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.149538E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.469 | TFLOPs: 20.84 | 31: iteration 24400/ 173500 | consumed samples: 6246400 | consumed tokens: 12792627200 | elapsed time per iteration (s): 0.77 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.187810E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.775 | TFLOPs: 20.07 | 31: iteration 24410/ 173500 | consumed samples: 6248960 | consumed tokens: 12797870080 | elapsed time per iteration (s): 0.77 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.178199E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.886 | TFLOPs: 20.08 | 31: iteration 24420/ 173500 | consumed samples: 6251520 | consumed tokens: 12803112960 | elapsed time per iteration (s): 0.86 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.188334E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.920 | TFLOPs: 18.02 | 31: iteration 24430/ 173500 | consumed samples: 6254080 | consumed tokens: 12808355840 | elapsed time per iteration (s): 0.83 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.188307E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.107 | TFLOPs: 18.76 | 31: iteration 24440/ 173500 | consumed samples: 6256640 | consumed tokens: 12813598720 | elapsed time per iteration (s): 0.83 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.176493E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.053 | TFLOPs: 18.64 | 31: iteration 24450/ 173500 | consumed samples: 6259200 | consumed tokens: 12818841600 | elapsed time per iteration (s): 0.81 | learning rate: 1.923E-04 | global batch size: 256 | lm loss: 2.156308E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.632 | TFLOPs: 19.03 | 31: iteration 24460/ 173500 | consumed samples: 6261760 | consumed tokens: 12824084480 | elapsed time per iteration (s): 0.83 | learning rate: 1.923E-04 | global batch size: 256 | lm loss: 2.194566E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.390 | TFLOPs: 18.60 | 31: iteration 24470/ 173500 | consumed samples: 6264320 | consumed tokens: 12829327360 | elapsed time per iteration (s): 0.82 | learning rate: 1.923E-04 | global batch size: 256 | lm loss: 2.175334E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.209 | TFLOPs: 18.83 | 31: iteration 24480/ 173500 | consumed samples: 6266880 | consumed tokens: 12834570240 | elapsed time per iteration (s): 0.84 | learning rate: 1.923E-04 | global batch size: 256 | lm loss: 2.191634E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.927 | TFLOPs: 18.39 | 31: iteration 24490/ 173500 | consumed samples: 6269440 | consumed tokens: 12839813120 | elapsed time per iteration (s): 0.85 | learning rate: 1.923E-04 | global batch size: 256 | lm loss: 2.181666E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.691 | TFLOPs: 18.31 | 31: iteration 24500/ 173500 | consumed samples: 6272000 | consumed tokens: 12845056000 | elapsed time per iteration (s): 0.82 | learning rate: 1.923E-04 | global batch size: 256 | lm loss: 2.185539E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.468 | TFLOPs: 18.78 | 31: iteration 24510/ 173500 | consumed samples: 6274560 | consumed tokens: 12850298880 | elapsed time per iteration (s): 0.82 | learning rate: 1.923E-04 | global batch size: 256 | lm loss: 2.196850E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.279 | TFLOPs: 18.89 | 31: iteration 24520/ 173500 | consumed samples: 6277120 | consumed tokens: 12855541760 | elapsed time per iteration (s): 0.80 | learning rate: 1.923E-04 | global batch size: 256 | lm loss: 2.213141E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.148 | TFLOPs: 19.43 | 31: iteration 24530/ 173500 | consumed samples: 6279680 | consumed tokens: 12860784640 | elapsed time per iteration (s): 0.81 | learning rate: 1.923E-04 | global batch size: 256 | lm loss: 2.176056E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.055 | TFLOPs: 19.12 | 31: iteration 24540/ 173500 | consumed samples: 6282240 | consumed tokens: 12866027520 | elapsed time per iteration (s): 0.82 | learning rate: 1.923E-04 | global batch size: 256 | lm loss: 2.196276E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.754 | TFLOPs: 18.86 | 31: iteration 24550/ 173500 | consumed samples: 6284800 | consumed tokens: 12871270400 | elapsed time per iteration (s): 0.80 | learning rate: 1.923E-04 | global batch size: 256 | lm loss: 2.176348E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.053 | TFLOPs: 19.24 | 31: iteration 24560/ 173500 | consumed samples: 6287360 | consumed tokens: 12876513280 | elapsed time per iteration (s): 0.84 | learning rate: 1.923E-04 | global batch size: 256 | lm loss: 2.183122E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.508 | TFLOPs: 18.42 | 31: iteration 24570/ 173500 | consumed samples: 6289920 | consumed tokens: 12881756160 | elapsed time per iteration (s): 0.80 | learning rate: 1.923E-04 | global batch size: 256 | lm loss: 2.204766E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.070 | TFLOPs: 19.30 | 31: iteration 24580/ 173500 | consumed samples: 6292480 | consumed tokens: 12886999040 | elapsed time per iteration (s): 1.05 | learning rate: 1.923E-04 | global batch size: 256 | lm loss: 2.195026E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.547 | TFLOPs: 14.79 | 31: iteration 24590/ 173500 | consumed samples: 6295040 | consumed tokens: 12892241920 | elapsed time per iteration (s): 0.81 | learning rate: 1.923E-04 | global batch size: 256 | lm loss: 2.131714E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.585 | TFLOPs: 19.03 | 31: iteration 24600/ 173500 | consumed samples: 6297600 | consumed tokens: 12897484800 | elapsed time per iteration (s): 0.82 | learning rate: 1.922E-04 | global batch size: 256 | lm loss: 2.170013E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.344 | TFLOPs: 18.84 | 31: iteration 24610/ 173500 | consumed samples: 6300160 | consumed tokens: 12902727680 | elapsed time per iteration (s): 0.80 | learning rate: 1.922E-04 | global batch size: 256 | lm loss: 2.177875E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.442 | TFLOPs: 19.26 | 31: iteration 24620/ 173500 | consumed samples: 6302720 | consumed tokens: 12907970560 | elapsed time per iteration (s): 0.81 | learning rate: 1.922E-04 | global batch size: 256 | lm loss: 2.209778E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.942 | TFLOPs: 19.17 | 31: iteration 24630/ 173500 | consumed samples: 6305280 | consumed tokens: 12913213440 | elapsed time per iteration (s): 0.76 | learning rate: 1.922E-04 | global batch size: 256 | lm loss: 2.170953E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.054 | TFLOPs: 20.27 | 31: iteration 24640/ 173500 | consumed samples: 6307840 | consumed tokens: 12918456320 | elapsed time per iteration (s): 0.84 | learning rate: 1.922E-04 | global batch size: 256 | lm loss: 2.196508E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.154 | TFLOPs: 18.34 | 31: iteration 24650/ 173500 | consumed samples: 6310400 | consumed tokens: 12923699200 | elapsed time per iteration (s): 0.85 | learning rate: 1.922E-04 | global batch size: 256 | lm loss: 2.186346E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.959 | TFLOPs: 18.21 | 31: iteration 24660/ 173500 | consumed samples: 6312960 | consumed tokens: 12928942080 | elapsed time per iteration (s): 0.82 | learning rate: 1.922E-04 | global batch size: 256 | lm loss: 2.180795E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.409 | TFLOPs: 18.90 | 31: iteration 24670/ 173500 | consumed samples: 6315520 | consumed tokens: 12934184960 | elapsed time per iteration (s): 0.77 | learning rate: 1.922E-04 | global batch size: 256 | lm loss: 2.184216E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.260 | TFLOPs: 20.10 | 31: iteration 24680/ 173500 | consumed samples: 6318080 | consumed tokens: 12939427840 | elapsed time per iteration (s): 0.76 | learning rate: 1.922E-04 | global batch size: 256 | lm loss: 2.166856E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.061 | TFLOPs: 20.51 | 31: iteration 24690/ 173500 | consumed samples: 6320640 | consumed tokens: 12944670720 | elapsed time per iteration (s): 0.85 | learning rate: 1.922E-04 | global batch size: 256 | lm loss: 2.204192E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.627 | TFLOPs: 18.13 | 31: iteration 24700/ 173500 | consumed samples: 6323200 | consumed tokens: 12949913600 | elapsed time per iteration (s): 0.81 | learning rate: 1.922E-04 | global batch size: 256 | lm loss: 2.195083E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.413 | TFLOPs: 19.02 | 31: iteration 24710/ 173500 | consumed samples: 6325760 | consumed tokens: 12955156480 | elapsed time per iteration (s): 0.80 | learning rate: 1.922E-04 | global batch size: 256 | lm loss: 2.122887E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.580 | TFLOPs: 19.45 | 31: iteration 24720/ 173500 | consumed samples: 6328320 | consumed tokens: 12960399360 | elapsed time per iteration (s): 0.80 | learning rate: 1.922E-04 | global batch size: 256 | lm loss: 2.173385E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.604 | TFLOPs: 19.27 | 31: iteration 24730/ 173500 | consumed samples: 6330880 | consumed tokens: 12965642240 | elapsed time per iteration (s): 0.78 | learning rate: 1.922E-04 | global batch size: 256 | lm loss: 2.151880E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.408 | TFLOPs: 19.81 | 31: iteration 24740/ 173500 | consumed samples: 6333440 | consumed tokens: 12970885120 | elapsed time per iteration (s): 0.79 | learning rate: 1.922E-04 | global batch size: 256 | lm loss: 2.169157E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.409 | TFLOPs: 19.57 | 31: iteration 24750/ 173500 | consumed samples: 6336000 | consumed tokens: 12976128000 | elapsed time per iteration (s): 0.77 | learning rate: 1.921E-04 | global batch size: 256 | lm loss: 2.152196E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.189 | TFLOPs: 20.10 | 31: iteration 24760/ 173500 | consumed samples: 6338560 | consumed tokens: 12981370880 | elapsed time per iteration (s): 0.81 | learning rate: 1.921E-04 | global batch size: 256 | lm loss: 2.152612E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.876 | TFLOPs: 19.05 | 31: iteration 24770/ 173500 | consumed samples: 6341120 | consumed tokens: 12986613760 | elapsed time per iteration (s): 0.73 | learning rate: 1.921E-04 | global batch size: 256 | lm loss: 2.181011E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.037 | TFLOPs: 21.30 | 31: iteration 24780/ 173500 | consumed samples: 6343680 | consumed tokens: 12991856640 | elapsed time per iteration (s): 0.80 | learning rate: 1.921E-04 | global batch size: 256 | lm loss: 2.172121E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.297 | TFLOPs: 19.44 | 31: iteration 24790/ 173500 | consumed samples: 6346240 | consumed tokens: 12997099520 | elapsed time per iteration (s): 0.80 | learning rate: 1.921E-04 | global batch size: 256 | lm loss: 2.174600E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.110 | TFLOPs: 19.37 | 31: iteration 24800/ 173500 | consumed samples: 6348800 | consumed tokens: 13002342400 | elapsed time per iteration (s): 0.78 | learning rate: 1.921E-04 | global batch size: 256 | lm loss: 2.170595E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.478 | TFLOPs: 19.75 | 31: iteration 24810/ 173500 | consumed samples: 6351360 | consumed tokens: 13007585280 | elapsed time per iteration (s): 0.74 | learning rate: 1.921E-04 | global batch size: 256 | lm loss: 2.191780E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.763 | TFLOPs: 20.98 | 31: iteration 24820/ 173500 | consumed samples: 6353920 | consumed tokens: 13012828160 | elapsed time per iteration (s): 0.74 | learning rate: 1.921E-04 | global batch size: 256 | lm loss: 2.180196E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.654 | TFLOPs: 20.85 | 31: iteration 24830/ 173500 | consumed samples: 6356480 | consumed tokens: 13018071040 | elapsed time per iteration (s): 0.77 | learning rate: 1.921E-04 | global batch size: 256 | lm loss: 2.182093E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.825 | TFLOPs: 20.07 | 31: iteration 24840/ 173500 | consumed samples: 6359040 | consumed tokens: 13023313920 | elapsed time per iteration (s): 0.74 | learning rate: 1.921E-04 | global batch size: 256 | lm loss: 2.153554E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.384 | TFLOPs: 21.02 | 31: iteration 24850/ 173500 | consumed samples: 6361600 | consumed tokens: 13028556800 | elapsed time per iteration (s): 0.77 | learning rate: 1.921E-04 | global batch size: 256 | lm loss: 2.180782E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.656 | TFLOPs: 20.12 | 31: iteration 24860/ 173500 | consumed samples: 6364160 | consumed tokens: 13033799680 | elapsed time per iteration (s): 0.82 | learning rate: 1.921E-04 | global batch size: 256 | lm loss: 2.129805E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.013 | TFLOPs: 18.94 | 31: iteration 24870/ 173500 | consumed samples: 6366720 | consumed tokens: 13039042560 | elapsed time per iteration (s): 0.76 | learning rate: 1.921E-04 | global batch size: 256 | lm loss: 2.171295E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.542 | TFLOPs: 20.36 | 31: iteration 24880/ 173500 | consumed samples: 6369280 | consumed tokens: 13044285440 | elapsed time per iteration (s): 0.82 | learning rate: 1.921E-04 | global batch size: 256 | lm loss: 2.158931E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.619 | TFLOPs: 18.97 | 31: iteration 24890/ 173500 | consumed samples: 6371840 | consumed tokens: 13049528320 | elapsed time per iteration (s): 0.77 | learning rate: 1.920E-04 | global batch size: 256 | lm loss: 2.170617E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.402 | TFLOPs: 19.99 | 31: iteration 24900/ 173500 | consumed samples: 6374400 | consumed tokens: 13054771200 | elapsed time per iteration (s): 0.83 | learning rate: 1.920E-04 | global batch size: 256 | lm loss: 2.147742E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.318 | TFLOPs: 18.71 | 31: iteration 24910/ 173500 | consumed samples: 6376960 | consumed tokens: 13060014080 | elapsed time per iteration (s): 0.82 | learning rate: 1.920E-04 | global batch size: 256 | lm loss: 2.148516E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.091 | TFLOPs: 18.88 | 31: iteration 24920/ 173500 | consumed samples: 6379520 | consumed tokens: 13065256960 | elapsed time per iteration (s): 0.78 | learning rate: 1.920E-04 | global batch size: 256 | lm loss: 2.165811E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.105 | TFLOPs: 19.85 | 31: iteration 24930/ 173500 | consumed samples: 6382080 | consumed tokens: 13070499840 | elapsed time per iteration (s): 0.78 | learning rate: 1.920E-04 | global batch size: 256 | lm loss: 2.166556E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.456 | TFLOPs: 19.81 | 31: iteration 24940/ 173500 | consumed samples: 6384640 | consumed tokens: 13075742720 | elapsed time per iteration (s): 0.75 | learning rate: 1.920E-04 | global batch size: 256 | lm loss: 2.158433E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.293 | TFLOPs: 20.71 | 31: iteration 24950/ 173500 | consumed samples: 6387200 | consumed tokens: 13080985600 | elapsed time per iteration (s): 0.81 | learning rate: 1.920E-04 | global batch size: 256 | lm loss: 2.168296E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.890 | TFLOPs: 19.11 | 31: iteration 24960/ 173500 | consumed samples: 6389760 | consumed tokens: 13086228480 | elapsed time per iteration (s): 0.78 | learning rate: 1.920E-04 | global batch size: 256 | lm loss: 2.156172E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.872 | TFLOPs: 19.96 | 31: iteration 24970/ 173500 | consumed samples: 6392320 | consumed tokens: 13091471360 | elapsed time per iteration (s): 0.74 | learning rate: 1.920E-04 | global batch size: 256 | lm loss: 2.149093E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.156 | TFLOPs: 21.06 | 31: iteration 24980/ 173500 | consumed samples: 6394880 | consumed tokens: 13096714240 | elapsed time per iteration (s): 0.76 | learning rate: 1.920E-04 | global batch size: 256 | lm loss: 2.148911E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.573 | TFLOPs: 20.30 | 31: iteration 24990/ 173500 | consumed samples: 6397440 | consumed tokens: 13101957120 | elapsed time per iteration (s): 0.80 | learning rate: 1.920E-04 | global batch size: 256 | lm loss: 2.167140E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.185 | TFLOPs: 19.43 | 31: iteration 25000/ 173500 | consumed samples: 6400000 | consumed tokens: 13107200000 | elapsed time per iteration (s): 0.78 | learning rate: 1.920E-04 | global batch size: 256 | lm loss: 2.166212E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.367 | TFLOPs: 19.93 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 25000 | lm loss value: 2.126208E+00 | lm loss PPL: 8.383021E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 25000 to checkpoints_1b1long 0: [2022-11-25 23:42:57,735] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step25000 is begin to save! 0: [2022-11-25 23:42:57,746] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_01-model_00-model_states.pt... 0: [2022-11-25 23:42:57,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_01-model_00-model_states.pt. 0: [2022-11-25 23:42:57,981] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_03-model_00-model_states.pt... 0: [2022-11-25 23:42:58,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_03-model_00-model_states.pt. 0: [2022-11-25 23:42:58,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_04-model_00-model_states.pt... 0: [2022-11-25 23:42:58,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_04-model_00-model_states.pt. 0: [2022-11-25 23:42:58,150] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_05-model_00-model_states.pt... 0: [2022-11-25 23:42:58,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_05-model_00-model_states.pt. 0: [2022-11-25 23:42:58,228] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_06-model_00-model_states.pt... 0: [2022-11-25 23:42:58,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_06-model_00-model_states.pt. 0: [2022-11-25 23:42:58,307] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_07-model_00-model_states.pt... 0: [2022-11-25 23:42:58,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_07-model_00-model_states.pt. 0: [2022-11-25 23:42:58,392] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_08-model_00-model_states.pt... 0: [2022-11-25 23:42:58,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_08-model_00-model_states.pt. 0: [2022-11-25 23:42:58,469] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_09-model_00-model_states.pt... 0: [2022-11-25 23:42:58,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_09-model_00-model_states.pt. 0: [2022-11-25 23:42:58,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_10-model_00-model_states.pt... 0: [2022-11-25 23:42:58,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_10-model_00-model_states.pt. 0: [2022-11-25 23:42:58,631] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_11-model_00-model_states.pt... 0: [2022-11-25 23:42:58,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_11-model_00-model_states.pt. 0: [2022-11-25 23:42:58,708] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_12-model_00-model_states.pt... 0: [2022-11-25 23:42:58,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_12-model_00-model_states.pt. 0: [2022-11-25 23:42:58,784] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_13-model_00-model_states.pt... 0: [2022-11-25 23:42:58,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_13-model_00-model_states.pt. 0: [2022-11-25 23:42:58,861] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_14-model_00-model_states.pt... 0: [2022-11-25 23:42:58,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_14-model_00-model_states.pt. 0: [2022-11-25 23:42:58,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_15-model_00-model_states.pt... 0: [2022-11-25 23:42:59,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_15-model_00-model_states.pt. 0: [2022-11-25 23:42:59,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_16-model_00-model_states.pt... 0: [2022-11-25 23:42:59,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_16-model_00-model_states.pt. 0: [2022-11-25 23:42:59,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_17-model_00-model_states.pt... 0: [2022-11-25 23:42:59,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_17-model_00-model_states.pt. 0: [2022-11-25 23:42:59,160] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_18-model_00-model_states.pt... 0: [2022-11-25 23:42:59,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_18-model_00-model_states.pt. 0: [2022-11-25 23:42:59,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_19-model_00-model_states.pt... 0: [2022-11-25 23:42:59,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_19-model_00-model_states.pt. 0: [2022-11-25 23:42:59,309] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_20-model_00-model_states.pt... 0: [2022-11-25 23:42:59,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_20-model_00-model_states.pt. 0: [2022-11-25 23:42:59,386] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_21-model_00-model_states.pt... 0: [2022-11-25 23:42:59,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_21-model_00-model_states.pt. 0: [2022-11-25 23:42:59,464] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_22-model_00-model_states.pt... 0: [2022-11-25 23:42:59,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_22-model_00-model_states.pt. 0: [2022-11-25 23:42:59,537] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_23-model_00-model_states.pt... 0: [2022-11-25 23:42:59,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_23-model_00-model_states.pt. 0: [2022-11-25 23:42:59,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_24-model_00-model_states.pt... 0: [2022-11-25 23:42:59,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_24-model_00-model_states.pt. 0: [2022-11-25 23:42:59,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_25-model_00-model_states.pt... 0: [2022-11-25 23:42:59,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_25-model_00-model_states.pt. 0: [2022-11-25 23:42:59,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_26-model_00-model_states.pt... 0: [2022-11-25 23:42:59,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_26-model_00-model_states.pt. 0: [2022-11-25 23:42:59,837] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_27-model_00-model_states.pt... 0: [2022-11-25 23:42:59,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_27-model_00-model_states.pt. 0: [2022-11-25 23:42:59,914] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_28-model_00-model_states.pt... 0: [2022-11-25 23:42:59,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_28-model_00-model_states.pt. 0: [2022-11-25 23:42:59,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/layer_30-model_00-model_states.pt... 0: [2022-11-25 23:42:59,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/layer_30-model_00-model_states.pt. 0: [2022-11-25 23:42:59,992] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step25000/mp_rank_00_model_states.pt 0: [2022-11-25 23:42:59,992] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/mp_rank_00_model_states.pt... 0: [2022-11-25 23:42:59,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/mp_rank_00_model_states.pt. 0: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:43:00,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:43:00,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:43:00,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:43:00,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-25 23:43:00,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 11: [2022-11-25 23:43:00,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:43:00,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 23:43:00,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 4: [2022-11-25 23:43:00,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:43:00,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 23:43:00,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 3: [2022-11-25 23:43:00,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:43:00,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 23:43:00,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 8: [2022-11-25 23:43:00,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:43:00,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 23:43:00,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 9: [2022-11-25 23:43:00,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:43:00,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:43:00,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 18: [2022-11-25 23:43:00,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 9: [2022-11-25 23:43:00,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 18: [2022-11-25 23:43:00,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 1: [2022-11-25 23:43:00,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:43:00,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 23:43:00,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 21: [2022-11-25 23:43:00,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:43:00,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-25 23:43:00,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 29: [2022-11-25 23:43:00,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:43:00,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-25 23:43:00,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 4: [2022-11-25 23:43:00,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:43:00,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:43:00,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 15: [2022-11-25 23:43:00,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 22: [2022-11-25 23:43:00,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:43:00,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 15: [2022-11-25 23:43:00,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 22: [2022-11-25 23:43:00,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 13: [2022-11-25 23:43:00,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:43:00,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 13: [2022-11-25 23:43:00,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 22: [2022-11-25 23:43:00,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:43:00,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:43:00,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 13: [2022-11-25 23:43:00,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 3: [2022-11-25 23:43:00,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 22: [2022-11-25 23:43:00,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 3: [2022-11-25 23:43:00,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 30: [2022-11-25 23:43:00,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:43:00,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:43:00,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-25 23:43:00,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-25 23:43:00,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 30: [2022-11-25 23:43:00,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 16: [2022-11-25 23:43:00,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:43:00,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-25 23:43:00,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 20: [2022-11-25 23:43:00,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:43:00,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 1: [2022-11-25 23:43:00,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:43:00,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:43:00,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 18: [2022-11-25 23:43:00,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:43:00,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 18: [2022-11-25 23:43:00,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 1: [2022-11-25 23:43:00,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 12: [2022-11-25 23:43:00,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 18: [2022-11-25 23:43:00,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 1: [2022-11-25 23:43:00,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 10: [2022-11-25 23:43:00,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:43:00,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 23:43:00,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 19: [2022-11-25 23:43:00,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:43:00,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-25 23:43:00,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 13: [2022-11-25 23:43:00,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:43:00,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 23:43:00,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 3: [2022-11-25 23:43:00,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:43:00,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 23:43:00,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 27: [2022-11-25 23:43:00,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:43:00,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:43:00,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 28: [2022-11-25 23:43:00,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:43:00,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 27: [2022-11-25 23:43:00,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-25 23:43:00,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 5: [2022-11-25 23:43:00,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:43:00,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:43:00,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:43:00,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 23:43:00,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 8: [2022-11-25 23:43:00,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:43:00,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:43:00,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 12: [2022-11-25 23:43:00,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:43:00,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:43:00,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 9: [2022-11-25 23:43:00,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:43:00,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 12: [2022-11-25 23:43:00,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 23:43:00,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 5: [2022-11-25 23:43:00,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 23:43:00,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 5: [2022-11-25 23:43:00,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 9: [2022-11-25 23:43:00,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 12: [2022-11-25 23:43:00,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 5: [2022-11-25 23:43:00,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 9: [2022-11-25 23:43:00,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 12: [2022-11-25 23:43:00,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 19: [2022-11-25 23:43:00,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:43:00,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 19: [2022-11-25 23:43:00,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-25 23:43:00,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 14: [2022-11-25 23:43:00,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:43:00,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 30: [2022-11-25 23:43:00,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:43:00,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 30: [2022-11-25 23:43:00,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-25 23:43:00,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 4: [2022-11-25 23:43:00,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:43:00,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 23:43:00,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 1: [2022-11-25 23:43:00,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:43:00,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 0: [2022-11-25 23:43:00,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:43:00,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 1: [2022-11-25 23:43:00,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: [2022-11-25 23:43:00,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 21: [2022-11-25 23:43:00,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:43:00,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-25 23:43:00,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: [2022-11-25 23:43:00,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:43:00,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:43:00,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:43:00,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:43:00,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:43:00,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 10: [2022-11-25 23:43:00,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 16: [2022-11-25 23:43:00,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 15: [2022-11-25 23:43:00,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 0: [2022-11-25 23:43:00,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 8: [2022-11-25 23:43:00,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 10: [2022-11-25 23:43:00,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 16: [2022-11-25 23:43:00,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 15: [2022-11-25 23:43:00,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: [2022-11-25 23:43:00,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 16: [2022-11-25 23:43:00,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:43:00,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-25 23:43:00,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 19: [2022-11-25 23:43:00,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:43:00,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-25 23:43:00,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 21: [2022-11-25 23:43:00,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:43:00,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:43:00,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-25 23:43:00,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 11: [2022-11-25 23:43:00,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 23:43:00,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:43:00,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:43:00,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 9: [2022-11-25 23:43:00,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:43:00,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 11: [2022-11-25 23:43:00,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 9: [2022-11-25 23:43:00,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 15: [2022-11-25 23:43:00,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 9: [2022-11-25 23:43:00,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 11: [2022-11-25 23:43:00,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 29: [2022-11-25 23:43:00,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:43:00,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 11: [2022-11-25 23:43:00,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:43:00,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 9: [2022-11-25 23:43:00,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:43:00,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 22: [2022-11-25 23:43:00,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:43:00,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 13: [2022-11-25 23:43:00,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:43:00,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 14: [2022-11-25 23:43:00,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:43:00,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 9: [2022-11-25 23:43:00,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 13: [2022-11-25 23:43:00,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 14: [2022-11-25 23:43:00,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 22: [2022-11-25 23:43:00,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 13: [2022-11-25 23:43:00,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 14: [2022-11-25 23:43:00,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 14: [2022-11-25 23:43:00,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:43:00,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:43:00,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 22: [2022-11-25 23:43:00,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 14: [2022-11-25 23:43:00,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 22: [2022-11-25 23:43:00,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 18: [2022-11-25 23:43:00,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:43:00,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:43:00,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:43:00,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 4: [2022-11-25 23:43:00,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 18: [2022-11-25 23:43:00,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: [2022-11-25 23:43:00,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 4: [2022-11-25 23:43:00,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: [2022-11-25 23:43:00,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 10: [2022-11-25 23:43:00,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:43:00,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 23:43:00,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 20: [2022-11-25 23:43:00,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:43:00,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-25 23:43:00,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 25: [2022-11-25 23:43:00,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:43:00,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:43:00,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:43:00,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:43:00,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-25 23:43:00,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 7: [2022-11-25 23:43:00,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:43:00,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:43:00,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 25: [2022-11-25 23:43:00,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 25: [2022-11-25 23:43:00,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-25 23:43:00,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 7: [2022-11-25 23:43:00,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 25: [2022-11-25 23:43:00,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 25: [2022-11-25 23:43:00,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 21: [2022-11-25 23:43:00,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:43:00,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 3: [2022-11-25 23:43:00,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:43:00,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 7: [2022-11-25 23:43:00,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 3: [2022-11-25 23:43:00,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 21: [2022-11-25 23:43:00,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 7: [2022-11-25 23:43:00,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 3: [2022-11-25 23:43:00,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 28: [2022-11-25 23:43:00,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-25 23:43:00,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 28: [2022-11-25 23:43:00,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:43:00,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-25 23:43:00,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 28: [2022-11-25 23:43:00,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:43:00,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 20: [2022-11-25 23:43:00,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:43:00,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 20: [2022-11-25 23:43:00,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-25 23:43:00,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 14: [2022-11-25 23:43:00,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:43:00,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:43:00,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 5: [2022-11-25 23:43:00,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 14: [2022-11-25 23:43:00,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 5: [2022-11-25 23:43:00,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 15: [2022-11-25 23:43:00,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:43:00,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 23:43:00,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 20: [2022-11-25 23:43:00,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:43:00,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-25 23:43:00,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 29: [2022-11-25 23:43:00,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:43:00,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-25 23:43:00,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 30: [2022-11-25 23:43:00,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:43:00,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-25 23:43:00,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 10: [2022-11-25 23:43:00,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:43:00,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 23:43:00,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 16: [2022-11-25 23:43:00,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:43:00,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-25 23:43:00,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 23: [2022-11-25 23:43:00,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:43:00,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:43:00,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:43:00,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-25 23:43:00,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-25 23:43:00,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 23: [2022-11-25 23:43:00,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 23: [2022-11-25 23:43:00,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-25 23:43:00,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 18: [2022-11-25 23:43:00,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:43:00,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-25 23:43:00,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 13: [2022-11-25 23:43:00,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:43:00,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 23:43:00,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 31: [2022-11-25 23:43:00,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:43:00,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:43:00,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-25 23:43:00,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 23: [2022-11-25 23:43:00,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-25 23:43:00,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 4: [2022-11-25 23:43:00,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:43:00,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 23:43:00,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 17: [2022-11-25 23:43:00,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:43:00,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-25 23:43:00,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 17: [2022-11-25 23:43:00,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:43:00,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-25 23:43:00,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 17: [2022-11-25 23:43:00,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:43:00,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-25 23:43:00,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 17: [2022-11-25 23:43:00,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:43:00,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-25 23:43:00,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 1: [2022-11-25 23:43:00,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:43:00,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 23:43:00,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 17: [2022-11-25 23:43:00,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:43:00,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:43:00,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 12: [2022-11-25 23:43:00,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 23:43:00,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 17: [2022-11-25 23:43:00,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 2: [2022-11-25 23:43:00,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:43:00,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:43:00,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:43:00,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 23:43:00,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 23:43:00,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 23:43:00,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 2: [2022-11-25 23:43:00,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 2: [2022-11-25 23:43:00,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 31: [2022-11-25 23:43:00,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:43:00,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-25 23:43:00,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:43:00,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:43:00,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 27: [2022-11-25 23:43:00,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 31: [2022-11-25 23:43:00,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 27: [2022-11-25 23:43:00,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 31: [2022-11-25 23:43:00,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:43:00,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:43:00,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:43:00,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 7: [2022-11-25 23:43:00,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 23:43:00,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 31: [2022-11-25 23:43:00,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-25 23:43:00,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 7: [2022-11-25 23:43:00,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 7: [2022-11-25 23:43:00,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 28: [2022-11-25 23:43:00,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:43:00,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-25 23:43:00,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 29: [2022-11-25 23:43:00,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:43:00,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-25 23:43:00,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 2: [2022-11-25 23:43:00,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:43:00,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 23:43:00,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: [2022-11-25 23:43:00,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 11: [2022-11-25 23:43:00,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:43:00,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 11: [2022-11-25 23:43:00,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 23:43:00,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 27: [2022-11-25 23:43:00,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:43:00,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-25 23:43:00,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 20: [2022-11-25 23:43:00,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:43:00,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-25 23:43:00,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 5: [2022-11-25 23:43:00,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:43:00,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 23:43:00,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 25: [2022-11-25 23:43:00,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:43:00,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-25 23:43:00,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 6: [2022-11-25 23:43:00,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:43:00,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:43:00,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:43:00,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:43:00,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:43:00,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 23:43:00,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 23:43:00,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 23:43:00,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 23:43:00,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 23:43:00,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 6: [2022-11-25 23:43:00,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 6: [2022-11-25 23:43:00,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 6: [2022-11-25 23:43:00,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 6: [2022-11-25 23:43:00,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 7: [2022-11-25 23:43:00,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:43:00,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 23:43:00,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 19: [2022-11-25 23:43:00,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:43:00,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-25 23:43:00,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 8: [2022-11-25 23:43:00,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:43:00,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 23:43:00,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 2: [2022-11-25 23:43:00,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:43:00,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 23:43:00,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 29: [2022-11-25 23:43:00,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:43:00,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-25 23:43:00,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 10: [2022-11-25 23:43:00,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:43:00,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 23:43:00,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 15: [2022-11-25 23:43:00,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:43:00,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 23:43:00,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 9: [2022-11-25 23:43:00,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:43:00,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 23:43:00,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 14: [2022-11-25 23:43:00,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:43:00,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:43:00,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 22: [2022-11-25 23:43:00,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 14: [2022-11-25 23:43:00,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 22: [2022-11-25 23:43:00,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 3: [2022-11-25 23:43:00,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:43:00,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 23:43:00,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 13: [2022-11-25 23:43:00,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:43:00,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 23:43:00,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: [2022-11-25 23:43:00,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:43:00,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 23:43:00,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 27: [2022-11-25 23:43:00,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:43:00,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-25 23:43:00,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 18: [2022-11-25 23:43:00,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:43:00,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-25 23:43:00,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 31: [2022-11-25 23:43:00,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:43:00,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-25 23:43:00,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 16: [2022-11-25 23:43:00,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:43:00,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-25 23:43:00,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 6: [2022-11-25 23:43:00,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:43:00,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 12: [2022-11-25 23:43:00,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:43:00,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 12: [2022-11-25 23:43:00,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 23:43:00,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 1: [2022-11-25 23:43:00,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:43:00,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 23:43:00,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 30: [2022-11-25 23:43:00,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:43:00,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-25 23:43:00,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 28: [2022-11-25 23:43:00,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:43:00,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-25 23:43:00,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 21: [2022-11-25 23:43:00,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:43:00,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-25 23:43:00,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 23: [2022-11-25 23:43:00,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:43:00,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-25 23:43:00,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 25: [2022-11-25 23:43:00,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:43:00,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-25 23:43:00,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 4: [2022-11-25 23:43:00,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:43:00,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 23:43:00,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 9: [2022-11-25 23:43:00,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:43:00,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 23:43:00,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 19: [2022-11-25 23:43:00,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:43:00,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-25 23:43:00,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 20: [2022-11-25 23:43:00,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:43:00,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 8: [2022-11-25 23:43:00,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:43:00,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 8: [2022-11-25 23:43:00,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 23:43:00,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 7: [2022-11-25 23:43:00,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:43:00,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 23:43:00,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 10: [2022-11-25 23:43:00,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:43:00,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 23:43:00,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 5: [2022-11-25 23:43:00,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:43:00,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 23:43:00,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 17: [2022-11-25 23:43:00,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:43:00,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-25 23:43:00,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 3: [2022-11-25 23:43:00,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:43:00,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 23:43:00,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 18: [2022-11-25 23:43:00,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:43:00,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:43:00,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 22: [2022-11-25 23:43:00,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-25 23:43:00,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 18: [2022-11-25 23:43:00,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 14: [2022-11-25 23:43:00,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:43:00,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:43:00,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 29: [2022-11-25 23:43:00,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-25 23:43:00,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 14: [2022-11-25 23:43:00,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 15: [2022-11-25 23:43:00,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:43:00,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 23:43:00,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: [2022-11-25 23:43:00,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:43:00,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 23:43:00,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 30: [2022-11-25 23:43:00,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:43:00,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:43:00,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 2: [2022-11-25 23:43:00,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 30: [2022-11-25 23:43:00,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 2: [2022-11-25 23:43:00,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 27: [2022-11-25 23:43:00,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:43:00,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-25 23:43:00,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 31: [2022-11-25 23:43:00,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:43:00,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-25 23:43:00,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 13: [2022-11-25 23:43:00,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:43:00,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 23:43:00,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 1: [2022-11-25 23:43:00,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:43:00,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 23:43:00,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 12: [2022-11-25 23:43:00,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:43:00,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 23:43:00,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 6: [2022-11-25 23:43:00,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:43:00,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 23:43:00,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 21: [2022-11-25 23:43:00,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:43:00,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-25 23:43:00,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 16: [2022-11-25 23:43:00,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:43:00,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-25 23:43:00,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 28: [2022-11-25 23:43:00,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:43:00,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-25 23:43:00,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 23: [2022-11-25 23:43:00,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:43:00,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-25 23:43:00,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 11: [2022-11-25 23:43:00,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:43:00,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 23:43:00,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 20: [2022-11-25 23:43:00,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:43:00,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-25 23:43:00,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 4: [2022-11-25 23:43:00,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:43:00,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 23:43:00,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 25: [2022-11-25 23:43:00,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:43:00,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-25 23:43:00,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 8: [2022-11-25 23:43:00,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:43:00,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 23:43:00,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 5: [2022-11-25 23:43:00,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:43:00,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 23:43:00,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 19: [2022-11-25 23:43:00,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:43:00,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-25 23:43:00,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 11: [2022-11-25 23:43:00,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:43:00,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 23:43:00,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 7: [2022-11-25 23:43:00,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:43:00,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 23:43:00,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 17: [2022-11-25 23:43:00,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:43:00,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-25 23:43:00,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 9: [2022-11-25 23:43:00,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:43:00,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 23:43:00,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 18: [2022-11-25 23:43:00,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:43:00,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-25 23:43:00,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 10: [2022-11-25 23:43:00,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:43:00,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 23:43:00,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 2: [2022-11-25 23:43:00,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:43:00,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 23:43:00,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 24: [2022-11-25 23:43:00,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:43:00,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:43:00,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 29: [2022-11-25 23:43:00,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:43:00,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-25 23:43:00,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 24: [2022-11-25 23:43:00,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 29: [2022-11-25 23:43:00,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-25 23:43:00,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 31: [2022-11-25 23:43:00,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:43:00,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-25 23:43:00,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 15: [2022-11-25 23:43:00,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:43:00,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 23:43:00,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 22: [2022-11-25 23:43:00,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:43:00,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:43:00,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 3: [2022-11-25 23:43:00,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 22: [2022-11-25 23:43:00,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 3: [2022-11-25 23:43:00,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 12: [2022-11-25 23:43:00,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:43:00,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 23:43:00,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 1: [2022-11-25 23:43:00,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:43:00,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 30: [2022-11-25 23:43:00,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:43:00,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 30: [2022-11-25 23:43:00,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-25 23:43:00,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: [2022-11-25 23:43:00,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:43:00,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 23:43:00,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 13: [2022-11-25 23:43:00,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:43:00,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 23:43:00,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 11: [2022-11-25 23:43:00,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:43:00,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 27: [2022-11-25 23:43:00,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:43:00,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 27: [2022-11-25 23:43:00,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-25 23:43:00,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 6: [2022-11-25 23:43:00,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:43:00,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 23:43:00,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 3: [2022-11-25 23:43:00,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:43:00,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 23:43:00,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 14: [2022-11-25 23:43:00,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:43:00,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 23:43:00,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 9: [2022-11-25 23:43:00,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:43:00,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:43:00,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 23:43:00,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 19: [2022-11-25 23:43:00,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-25 23:43:00,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 12: [2022-11-25 23:43:00,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:43:00,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 7: [2022-11-25 23:43:00,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:43:00,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 7: [2022-11-25 23:43:00,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 23:43:00,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 30: [2022-11-25 23:43:00,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:43:00,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-25 23:43:00,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 31: [2022-11-25 23:43:00,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:43:00,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-25 23:43:00,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 25: [2022-11-25 23:43:00,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:43:00,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:43:00,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-25 23:43:00,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 2: [2022-11-25 23:43:00,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 23:43:00,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 18: [2022-11-25 23:43:00,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:43:00,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-25 23:43:00,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 4: [2022-11-25 23:43:00,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:43:00,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:43:00,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 23:43:00,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 8: [2022-11-25 23:43:00,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:43:00,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 23:43:00,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 17: [2022-11-25 23:43:00,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:43:00,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 23:43:00,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 17: [2022-11-25 23:43:00,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-25 23:43:00,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 22: [2022-11-25 23:43:00,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:43:00,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-25 23:43:00,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 29: [2022-11-25 23:43:00,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:43:00,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-25 23:43:00,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 5: [2022-11-25 23:43:00,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:43:00,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 23:43:00,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 20: [2022-11-25 23:43:00,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:43:00,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:43:00,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-25 23:43:00,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 28: [2022-11-25 23:43:00,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-25 23:43:00,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 16: [2022-11-25 23:43:00,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:43:00,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:43:00,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 16: [2022-11-25 23:43:00,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 21: [2022-11-25 23:43:00,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 16: [2022-11-25 23:43:00,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 14: [2022-11-25 23:43:00,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:43:00,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 23:43:00,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 23: [2022-11-25 23:43:00,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:43:00,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-25 23:43:00,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: [2022-11-25 23:43:00,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:43:00,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 23:43:00,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 13: [2022-11-25 23:43:00,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:43:00,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:43:00,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 15: [2022-11-25 23:43:00,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 13: [2022-11-25 23:43:00,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 15: [2022-11-25 23:43:00,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 27: [2022-11-25 23:43:00,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:43:00,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-25 23:43:00,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 24: [2022-11-25 23:43:00,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:43:00,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-25 23:43:00,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 24: [2022-11-25 23:43:00,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:43:00,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-25 23:43:00,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 23: [2022-11-25 23:43:00,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:43:00,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:43:00,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 23: [2022-11-25 23:43:00,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-25 23:43:00,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 21: [2022-11-25 23:43:00,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 16: [2022-11-25 23:43:00,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:43:00,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-25 23:43:00,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 1: [2022-11-25 23:43:00,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:43:00,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 23:43:00,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 28: [2022-11-25 23:43:00,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:43:00,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-25 23:43:00,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 24: [2022-11-25 23:43:00,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:43:00,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:43:00,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-25 23:43:00,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-25 23:43:00,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 24: [2022-11-25 23:43:00,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 24: [2022-11-25 23:43:00,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:43:00,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:43:00,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-25 23:43:00,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-25 23:43:00,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 24: [2022-11-25 23:43:00,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 26: [2022-11-25 23:43:00,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:43:00,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:43:00,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:43:00,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-25 23:43:00,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-25 23:43:00,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-25 23:43:00,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 26: [2022-11-25 23:43:00,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 26: [2022-11-25 23:43:00,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 26: [2022-11-25 23:43:00,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:43:00,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-25 23:43:00,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 26: [2022-11-25 23:43:00,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:43:00,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:43:00,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:43:00,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:43:00,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-25 23:43:00,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-25 23:43:00,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-25 23:43:00,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 26: [2022-11-25 23:43:00,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step25000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-25 23:43:00,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 26: [2022-11-25 23:43:00,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 26: [2022-11-25 23:43:00,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: successfully saved checkpoint at iteration 25000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2712.27 31: iteration 25010/ 173500 | consumed samples: 6402560 | consumed tokens: 13112442880 | elapsed time per iteration (s): 1.12 | learning rate: 1.920E-04 | global batch size: 256 | lm loss: 2.154578E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.951 | TFLOPs: 13.85 | 31: iteration 25020/ 173500 | consumed samples: 6405120 | consumed tokens: 13117685760 | elapsed time per iteration (s): 0.81 | learning rate: 1.920E-04 | global batch size: 256 | lm loss: 2.160258E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.995 | TFLOPs: 19.06 | 31: iteration 25030/ 173500 | consumed samples: 6407680 | consumed tokens: 13122928640 | elapsed time per iteration (s): 0.80 | learning rate: 1.920E-04 | global batch size: 256 | lm loss: 2.162877E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.725 | TFLOPs: 19.46 | 31: iteration 25040/ 173500 | consumed samples: 6410240 | consumed tokens: 13128171520 | elapsed time per iteration (s): 0.82 | learning rate: 1.919E-04 | global batch size: 256 | lm loss: 2.179882E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.639 | TFLOPs: 18.79 | 31: iteration 25050/ 173500 | consumed samples: 6412800 | consumed tokens: 13133414400 | elapsed time per iteration (s): 0.86 | learning rate: 1.919E-04 | global batch size: 256 | lm loss: 2.185340E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.670 | TFLOPs: 18.07 | 31: iteration 25060/ 173500 | consumed samples: 6415360 | consumed tokens: 13138657280 | elapsed time per iteration (s): 0.81 | learning rate: 1.919E-04 | global batch size: 256 | lm loss: 2.173014E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.797 | TFLOPs: 19.23 | 31: iteration 25070/ 173500 | consumed samples: 6417920 | consumed tokens: 13143900160 | elapsed time per iteration (s): 0.82 | learning rate: 1.919E-04 | global batch size: 256 | lm loss: 2.161431E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.814 | TFLOPs: 18.80 | 31: iteration 25080/ 173500 | consumed samples: 6420480 | consumed tokens: 13149143040 | elapsed time per iteration (s): 0.80 | learning rate: 1.919E-04 | global batch size: 256 | lm loss: 2.180189E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.149 | TFLOPs: 19.37 | 31: iteration 25090/ 173500 | consumed samples: 6423040 | consumed tokens: 13154385920 | elapsed time per iteration (s): 0.80 | learning rate: 1.919E-04 | global batch size: 256 | lm loss: 2.168156E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.045 | TFLOPs: 19.30 | 31: iteration 25100/ 173500 | consumed samples: 6425600 | consumed tokens: 13159628800 | elapsed time per iteration (s): 0.84 | learning rate: 1.919E-04 | global batch size: 256 | lm loss: 2.173979E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.408 | TFLOPs: 18.42 | 31: iteration 25110/ 173500 | consumed samples: 6428160 | consumed tokens: 13164871680 | elapsed time per iteration (s): 0.83 | learning rate: 1.919E-04 | global batch size: 256 | lm loss: 2.164209E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.852 | TFLOPs: 18.56 | 31: iteration 25120/ 173500 | consumed samples: 6430720 | consumed tokens: 13170114560 | elapsed time per iteration (s): 0.82 | learning rate: 1.919E-04 | global batch size: 256 | lm loss: 2.154126E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.745 | TFLOPs: 18.80 | 31: iteration 25130/ 173500 | consumed samples: 6433280 | consumed tokens: 13175357440 | elapsed time per iteration (s): 0.72 | learning rate: 1.919E-04 | global batch size: 256 | lm loss: 2.165755E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.782 | TFLOPs: 21.52 | 31: iteration 25140/ 173500 | consumed samples: 6435840 | consumed tokens: 13180600320 | elapsed time per iteration (s): 0.82 | learning rate: 1.919E-04 | global batch size: 256 | lm loss: 2.162697E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.581 | TFLOPs: 18.91 | 31: iteration 25150/ 173500 | consumed samples: 6438400 | consumed tokens: 13185843200 | elapsed time per iteration (s): 0.78 | learning rate: 1.919E-04 | global batch size: 256 | lm loss: 2.180177E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.461 | TFLOPs: 19.93 | 31: iteration 25160/ 173500 | consumed samples: 6440960 | consumed tokens: 13191086080 | elapsed time per iteration (s): 0.77 | learning rate: 1.919E-04 | global batch size: 256 | lm loss: 2.183201E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.845 | TFLOPs: 20.02 | 31: iteration 25170/ 173500 | consumed samples: 6443520 | consumed tokens: 13196328960 | elapsed time per iteration (s): 0.80 | learning rate: 1.919E-04 | global batch size: 256 | lm loss: 2.170547E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.012 | TFLOPs: 19.48 | 31: iteration 25180/ 173500 | consumed samples: 6446080 | consumed tokens: 13201571840 | elapsed time per iteration (s): 0.80 | learning rate: 1.919E-04 | global batch size: 256 | lm loss: 2.160967E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.027 | TFLOPs: 19.30 | 31: iteration 25190/ 173500 | consumed samples: 6448640 | consumed tokens: 13206814720 | elapsed time per iteration (s): 0.78 | learning rate: 1.918E-04 | global batch size: 256 | lm loss: 2.181530E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.727 | TFLOPs: 19.95 | 31: iteration 25200/ 173500 | consumed samples: 6451200 | consumed tokens: 13212057600 | elapsed time per iteration (s): 0.76 | learning rate: 1.918E-04 | global batch size: 256 | lm loss: 2.169319E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.438 | TFLOPs: 20.47 | 31: iteration 25210/ 173500 | consumed samples: 6453760 | consumed tokens: 13217300480 | elapsed time per iteration (s): 0.82 | learning rate: 1.918E-04 | global batch size: 256 | lm loss: 2.179920E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.599 | TFLOPs: 18.79 | 31: iteration 25220/ 173500 | consumed samples: 6456320 | consumed tokens: 13222543360 | elapsed time per iteration (s): 0.77 | learning rate: 1.918E-04 | global batch size: 256 | lm loss: 2.181264E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.418 | TFLOPs: 19.99 | 31: iteration 25230/ 173500 | consumed samples: 6458880 | consumed tokens: 13227786240 | elapsed time per iteration (s): 0.77 | learning rate: 1.918E-04 | global batch size: 256 | lm loss: 2.154949E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.039 | TFLOPs: 20.09 | 31: iteration 25240/ 173500 | consumed samples: 6461440 | consumed tokens: 13233029120 | elapsed time per iteration (s): 0.78 | learning rate: 1.918E-04 | global batch size: 256 | lm loss: 2.170034E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.670 | TFLOPs: 19.82 | 31: iteration 25250/ 173500 | consumed samples: 6464000 | consumed tokens: 13238272000 | elapsed time per iteration (s): 0.86 | learning rate: 1.918E-04 | global batch size: 256 | lm loss: 2.157599E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.178 | TFLOPs: 17.92 | 31: iteration 25260/ 173500 | consumed samples: 6466560 | consumed tokens: 13243514880 | elapsed time per iteration (s): 0.77 | learning rate: 1.918E-04 | global batch size: 256 | lm loss: 2.163153E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.621 | TFLOPs: 20.00 | 31: iteration 25270/ 173500 | consumed samples: 6469120 | consumed tokens: 13248757760 | elapsed time per iteration (s): 0.77 | learning rate: 1.918E-04 | global batch size: 256 | lm loss: 2.183780E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.890 | TFLOPs: 20.14 | 31: iteration 25280/ 173500 | consumed samples: 6471680 | consumed tokens: 13254000640 | elapsed time per iteration (s): 0.82 | learning rate: 1.918E-04 | global batch size: 256 | lm loss: 2.190405E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.542 | TFLOPs: 18.79 | 31: iteration 25290/ 173500 | consumed samples: 6474240 | consumed tokens: 13259243520 | elapsed time per iteration (s): 0.81 | learning rate: 1.918E-04 | global batch size: 256 | lm loss: 2.115607E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.894 | TFLOPs: 19.17 | 31: iteration 25300/ 173500 | consumed samples: 6476800 | consumed tokens: 13264486400 | elapsed time per iteration (s): 0.82 | learning rate: 1.918E-04 | global batch size: 256 | lm loss: 2.177348E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.820 | TFLOPs: 18.86 | 31: iteration 25310/ 173500 | consumed samples: 6479360 | consumed tokens: 13269729280 | elapsed time per iteration (s): 0.82 | learning rate: 1.918E-04 | global batch size: 256 | lm loss: 2.158307E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.994 | TFLOPs: 18.94 | 31: iteration 25320/ 173500 | consumed samples: 6481920 | consumed tokens: 13274972160 | elapsed time per iteration (s): 0.89 | learning rate: 1.918E-04 | global batch size: 256 | lm loss: 2.164459E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.590 | TFLOPs: 17.46 | 31: iteration 25330/ 173500 | consumed samples: 6484480 | consumed tokens: 13280215040 | elapsed time per iteration (s): 0.79 | learning rate: 1.917E-04 | global batch size: 256 | lm loss: 2.165451E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.636 | TFLOPs: 19.64 | 31: iteration 25340/ 173500 | consumed samples: 6487040 | consumed tokens: 13285457920 | elapsed time per iteration (s): 0.83 | learning rate: 1.917E-04 | global batch size: 256 | lm loss: 2.169879E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.285 | TFLOPs: 18.59 | 31: iteration 25350/ 173500 | consumed samples: 6489600 | consumed tokens: 13290700800 | elapsed time per iteration (s): 0.83 | learning rate: 1.917E-04 | global batch size: 256 | lm loss: 2.159454E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.313 | TFLOPs: 18.65 | 31: iteration 25360/ 173500 | consumed samples: 6492160 | consumed tokens: 13295943680 | elapsed time per iteration (s): 0.81 | learning rate: 1.917E-04 | global batch size: 256 | lm loss: 2.159000E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.084 | TFLOPs: 19.12 | 31: iteration 25370/ 173500 | consumed samples: 6494720 | consumed tokens: 13301186560 | elapsed time per iteration (s): 0.82 | learning rate: 1.917E-04 | global batch size: 256 | lm loss: 2.150644E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.117 | TFLOPs: 18.88 | 31: iteration 25380/ 173500 | consumed samples: 6497280 | consumed tokens: 13306429440 | elapsed time per iteration (s): 0.81 | learning rate: 1.917E-04 | global batch size: 256 | lm loss: 2.181305E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.903 | TFLOPs: 19.05 | 31: iteration 25390/ 173500 | consumed samples: 6499840 | consumed tokens: 13311672320 | elapsed time per iteration (s): 0.82 | learning rate: 1.917E-04 | global batch size: 256 | lm loss: 2.199996E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.224 | TFLOPs: 18.89 | 31: iteration 25400/ 173500 | consumed samples: 6502400 | consumed tokens: 13316915200 | elapsed time per iteration (s): 0.82 | learning rate: 1.917E-04 | global batch size: 256 | lm loss: 2.178223E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.780 | TFLOPs: 18.98 | 31: iteration 25410/ 173500 | consumed samples: 6504960 | consumed tokens: 13322158080 | elapsed time per iteration (s): 0.93 | learning rate: 1.917E-04 | global batch size: 256 | lm loss: 2.138784E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 275.491 | TFLOPs: 16.67 | 31: iteration 25420/ 173500 | consumed samples: 6507520 | consumed tokens: 13327400960 | elapsed time per iteration (s): 0.86 | learning rate: 1.917E-04 | global batch size: 256 | lm loss: 2.149031E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.796 | TFLOPs: 18.08 | 31: iteration 25430/ 173500 | consumed samples: 6510080 | consumed tokens: 13332643840 | elapsed time per iteration (s): 0.87 | learning rate: 1.917E-04 | global batch size: 256 | lm loss: 2.186055E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.135 | TFLOPs: 17.85 | 31: iteration 25440/ 173500 | consumed samples: 6512640 | consumed tokens: 13337886720 | elapsed time per iteration (s): 0.85 | learning rate: 1.917E-04 | global batch size: 256 | lm loss: 2.154660E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.292 | TFLOPs: 18.23 | 31: iteration 25450/ 173500 | consumed samples: 6515200 | consumed tokens: 13343129600 | elapsed time per iteration (s): 0.80 | learning rate: 1.917E-04 | global batch size: 256 | lm loss: 2.160933E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.228 | TFLOPs: 19.43 | 31: iteration 25460/ 173500 | consumed samples: 6517760 | consumed tokens: 13348372480 | elapsed time per iteration (s): 0.82 | learning rate: 1.917E-04 | global batch size: 256 | lm loss: 2.188036E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.328 | TFLOPs: 18.77 | 31: iteration 25470/ 173500 | consumed samples: 6520320 | consumed tokens: 13353615360 | elapsed time per iteration (s): 0.80 | learning rate: 1.917E-04 | global batch size: 256 | lm loss: 2.174459E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.126 | TFLOPs: 19.31 | 31: iteration 25480/ 173500 | consumed samples: 6522880 | consumed tokens: 13358858240 | elapsed time per iteration (s): 0.80 | learning rate: 1.916E-04 | global batch size: 256 | lm loss: 2.172744E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.743 | TFLOPs: 19.40 | 31: iteration 25490/ 173500 | consumed samples: 6525440 | consumed tokens: 13364101120 | elapsed time per iteration (s): 0.83 | learning rate: 1.916E-04 | global batch size: 256 | lm loss: 2.143230E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.482 | TFLOPs: 18.72 | 31: iteration 25500/ 173500 | consumed samples: 6528000 | consumed tokens: 13369344000 | elapsed time per iteration (s): 0.87 | learning rate: 1.916E-04 | global batch size: 256 | lm loss: 2.165878E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.391 | TFLOPs: 17.87 | 31: iteration 25510/ 173500 | consumed samples: 6530560 | consumed tokens: 13374586880 | elapsed time per iteration (s): 0.85 | learning rate: 1.916E-04 | global batch size: 256 | lm loss: 2.135338E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.887 | TFLOPs: 18.26 | 31: iteration 25520/ 173500 | consumed samples: 6533120 | consumed tokens: 13379829760 | elapsed time per iteration (s): 0.79 | learning rate: 1.916E-04 | global batch size: 256 | lm loss: 2.155013E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.084 | TFLOPs: 19.55 | 31: iteration 25530/ 173500 | consumed samples: 6535680 | consumed tokens: 13385072640 | elapsed time per iteration (s): 0.82 | learning rate: 1.916E-04 | global batch size: 256 | lm loss: 2.161743E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.700 | TFLOPs: 18.98 | 31: iteration 25540/ 173500 | consumed samples: 6538240 | consumed tokens: 13390315520 | elapsed time per iteration (s): 0.80 | learning rate: 1.916E-04 | global batch size: 256 | lm loss: 2.159846E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.823 | TFLOPs: 19.29 | 31: iteration 25550/ 173500 | consumed samples: 6540800 | consumed tokens: 13395558400 | elapsed time per iteration (s): 0.87 | learning rate: 1.916E-04 | global batch size: 256 | lm loss: 2.160402E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.469 | TFLOPs: 17.81 | 31: iteration 25560/ 173500 | consumed samples: 6543360 | consumed tokens: 13400801280 | elapsed time per iteration (s): 0.81 | learning rate: 1.916E-04 | global batch size: 256 | lm loss: 2.173097E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.053 | TFLOPs: 19.06 | 31: iteration 25570/ 173500 | consumed samples: 6545920 | consumed tokens: 13406044160 | elapsed time per iteration (s): 0.82 | learning rate: 1.916E-04 | global batch size: 256 | lm loss: 2.155481E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.939 | TFLOPs: 18.99 | 31: iteration 25580/ 173500 | consumed samples: 6548480 | consumed tokens: 13411287040 | elapsed time per iteration (s): 0.82 | learning rate: 1.916E-04 | global batch size: 256 | lm loss: 2.142287E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.346 | TFLOPs: 18.84 | 31: iteration 25590/ 173500 | consumed samples: 6551040 | consumed tokens: 13416529920 | elapsed time per iteration (s): 0.91 | learning rate: 1.916E-04 | global batch size: 256 | lm loss: 2.149141E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.609 | TFLOPs: 17.10 | 31: iteration 25600/ 173500 | consumed samples: 6553600 | consumed tokens: 13421772800 | elapsed time per iteration (s): 0.85 | learning rate: 1.916E-04 | global batch size: 256 | lm loss: 2.149614E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.413 | TFLOPs: 18.30 | 31: iteration 25610/ 173500 | consumed samples: 6556160 | consumed tokens: 13427015680 | elapsed time per iteration (s): 0.85 | learning rate: 1.916E-04 | global batch size: 256 | lm loss: 2.164513E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.660 | TFLOPs: 18.19 | 31: iteration 25620/ 173500 | consumed samples: 6558720 | consumed tokens: 13432258560 | elapsed time per iteration (s): 0.76 | learning rate: 1.915E-04 | global batch size: 256 | lm loss: 2.190937E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.265 | TFLOPs: 20.34 | 31: iteration 25630/ 173500 | consumed samples: 6561280 | consumed tokens: 13437501440 | elapsed time per iteration (s): 0.75 | learning rate: 1.915E-04 | global batch size: 256 | lm loss: 2.185113E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.476 | TFLOPs: 20.60 | 31: iteration 25640/ 173500 | consumed samples: 6563840 | consumed tokens: 13442744320 | elapsed time per iteration (s): 0.78 | learning rate: 1.915E-04 | global batch size: 256 | lm loss: 2.166211E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.442 | TFLOPs: 19.75 | 31: iteration 25650/ 173500 | consumed samples: 6566400 | consumed tokens: 13447987200 | elapsed time per iteration (s): 0.74 | learning rate: 1.915E-04 | global batch size: 256 | lm loss: 2.181594E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.766 | TFLOPs: 20.80 | 31: iteration 25660/ 173500 | consumed samples: 6568960 | consumed tokens: 13453230080 | elapsed time per iteration (s): 0.83 | learning rate: 1.915E-04 | global batch size: 256 | lm loss: 2.182504E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.482 | TFLOPs: 18.72 | 31: iteration 25670/ 173500 | consumed samples: 6571520 | consumed tokens: 13458472960 | elapsed time per iteration (s): 0.88 | learning rate: 1.915E-04 | global batch size: 256 | lm loss: 2.173153E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.194 | TFLOPs: 17.68 | 31: iteration 25680/ 173500 | consumed samples: 6574080 | consumed tokens: 13463715840 | elapsed time per iteration (s): 0.78 | learning rate: 1.915E-04 | global batch size: 256 | lm loss: 2.156603E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.031 | TFLOPs: 19.85 | 31: iteration 25690/ 173500 | consumed samples: 6576640 | consumed tokens: 13468958720 | elapsed time per iteration (s): 0.79 | learning rate: 1.915E-04 | global batch size: 256 | lm loss: 2.165732E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.960 | TFLOPs: 19.60 | 31: iteration 25700/ 173500 | consumed samples: 6579200 | consumed tokens: 13474201600 | elapsed time per iteration (s): 0.79 | learning rate: 1.915E-04 | global batch size: 256 | lm loss: 2.183588E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.792 | TFLOPs: 19.59 | 31: iteration 25710/ 173500 | consumed samples: 6581760 | consumed tokens: 13479444480 | elapsed time per iteration (s): 0.78 | learning rate: 1.915E-04 | global batch size: 256 | lm loss: 2.140696E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.496 | TFLOPs: 19.81 | 31: iteration 25720/ 173500 | consumed samples: 6584320 | consumed tokens: 13484687360 | elapsed time per iteration (s): 0.77 | learning rate: 1.915E-04 | global batch size: 256 | lm loss: 2.163683E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.584 | TFLOPs: 20.06 | 31: iteration 25730/ 173500 | consumed samples: 6586880 | consumed tokens: 13489930240 | elapsed time per iteration (s): 0.82 | learning rate: 1.915E-04 | global batch size: 256 | lm loss: 2.161923E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.118 | TFLOPs: 18.82 | 31: iteration 25740/ 173500 | consumed samples: 6589440 | consumed tokens: 13495173120 | elapsed time per iteration (s): 0.77 | learning rate: 1.915E-04 | global batch size: 256 | lm loss: 2.166768E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.600 | TFLOPs: 20.24 | 31: iteration 25750/ 173500 | consumed samples: 6592000 | consumed tokens: 13500416000 | elapsed time per iteration (s): 0.77 | learning rate: 1.915E-04 | global batch size: 256 | lm loss: 2.169569E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.750 | TFLOPs: 20.19 | 31: iteration 25760/ 173500 | consumed samples: 6594560 | consumed tokens: 13505658880 | elapsed time per iteration (s): 0.78 | learning rate: 1.914E-04 | global batch size: 256 | lm loss: 2.194031E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.258 | TFLOPs: 19.74 | 31: iteration 25770/ 173500 | consumed samples: 6597120 | consumed tokens: 13510901760 | elapsed time per iteration (s): 0.74 | learning rate: 1.914E-04 | global batch size: 256 | lm loss: 2.167399E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.171 | TFLOPs: 20.88 | 31: iteration 25780/ 173500 | consumed samples: 6599680 | consumed tokens: 13516144640 | elapsed time per iteration (s): 0.75 | learning rate: 1.914E-04 | global batch size: 256 | lm loss: 2.154896E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.056 | TFLOPs: 20.63 | 31: iteration 25790/ 173500 | consumed samples: 6602240 | consumed tokens: 13521387520 | elapsed time per iteration (s): 0.73 | learning rate: 1.914E-04 | global batch size: 256 | lm loss: 2.186487E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.128 | TFLOPs: 21.18 | 31: iteration 25800/ 173500 | consumed samples: 6604800 | consumed tokens: 13526630400 | elapsed time per iteration (s): 0.74 | learning rate: 1.914E-04 | global batch size: 256 | lm loss: 2.161693E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.026 | TFLOPs: 20.87 | 31: iteration 25810/ 173500 | consumed samples: 6607360 | consumed tokens: 13531873280 | elapsed time per iteration (s): 0.76 | learning rate: 1.914E-04 | global batch size: 256 | lm loss: 2.184533E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.351 | TFLOPs: 20.35 | 31: iteration 25820/ 173500 | consumed samples: 6609920 | consumed tokens: 13537116160 | elapsed time per iteration (s): 0.78 | learning rate: 1.914E-04 | global batch size: 256 | lm loss: 2.181133E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.672 | TFLOPs: 19.76 | 31: iteration 25830/ 173500 | consumed samples: 6612480 | consumed tokens: 13542359040 | elapsed time per iteration (s): 0.80 | learning rate: 1.914E-04 | global batch size: 256 | lm loss: 2.167313E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.889 | TFLOPs: 19.35 | 31: iteration 25840/ 173500 | consumed samples: 6615040 | consumed tokens: 13547601920 | elapsed time per iteration (s): 0.78 | learning rate: 1.914E-04 | global batch size: 256 | lm loss: 2.162998E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.904 | TFLOPs: 19.90 | 31: iteration 25850/ 173500 | consumed samples: 6617600 | consumed tokens: 13552844800 | elapsed time per iteration (s): 0.85 | learning rate: 1.914E-04 | global batch size: 256 | lm loss: 2.183568E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.672 | TFLOPs: 18.25 | 31: iteration 25860/ 173500 | consumed samples: 6620160 | consumed tokens: 13558087680 | elapsed time per iteration (s): 0.84 | learning rate: 1.914E-04 | global batch size: 256 | lm loss: 2.177548E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.865 | TFLOPs: 18.50 | 31: iteration 25870/ 173500 | consumed samples: 6622720 | consumed tokens: 13563330560 | elapsed time per iteration (s): 0.83 | learning rate: 1.914E-04 | global batch size: 256 | lm loss: 2.164701E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.088 | TFLOPs: 18.64 | 31: iteration 25880/ 173500 | consumed samples: 6625280 | consumed tokens: 13568573440 | elapsed time per iteration (s): 0.82 | learning rate: 1.914E-04 | global batch size: 256 | lm loss: 2.142556E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.785 | TFLOPs: 18.80 | 31: iteration 25890/ 173500 | consumed samples: 6627840 | consumed tokens: 13573816320 | elapsed time per iteration (s): 0.86 | learning rate: 1.914E-04 | global batch size: 256 | lm loss: 2.148963E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.216 | TFLOPs: 18.04 | 31: iteration 25900/ 173500 | consumed samples: 6630400 | consumed tokens: 13579059200 | elapsed time per iteration (s): 0.77 | learning rate: 1.914E-04 | global batch size: 256 | lm loss: 2.173362E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.153 | TFLOPs: 20.03 | 31: iteration 25910/ 173500 | consumed samples: 6632960 | consumed tokens: 13584302080 | elapsed time per iteration (s): 0.72 | learning rate: 1.913E-04 | global batch size: 256 | lm loss: 2.171757E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.241 | TFLOPs: 21.61 | 31: iteration 25920/ 173500 | consumed samples: 6635520 | consumed tokens: 13589544960 | elapsed time per iteration (s): 0.77 | learning rate: 1.913E-04 | global batch size: 256 | lm loss: 2.157427E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.286 | TFLOPs: 20.22 | 31: iteration 25930/ 173500 | consumed samples: 6638080 | consumed tokens: 13594787840 | elapsed time per iteration (s): 0.74 | learning rate: 1.913E-04 | global batch size: 256 | lm loss: 2.172640E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.902 | TFLOPs: 21.05 | 31: iteration 25940/ 173500 | consumed samples: 6640640 | consumed tokens: 13600030720 | elapsed time per iteration (s): 0.74 | learning rate: 1.913E-04 | global batch size: 256 | lm loss: 2.153957E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.067 | TFLOPs: 20.82 | 31: iteration 25950/ 173500 | consumed samples: 6643200 | consumed tokens: 13605273600 | elapsed time per iteration (s): 0.81 | learning rate: 1.913E-04 | global batch size: 256 | lm loss: 2.173379E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.384 | TFLOPs: 19.14 | 31: iteration 25960/ 173500 | consumed samples: 6645760 | consumed tokens: 13610516480 | elapsed time per iteration (s): 0.82 | learning rate: 1.913E-04 | global batch size: 256 | lm loss: 2.158445E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.739 | TFLOPs: 18.80 | 31: iteration 25970/ 173500 | consumed samples: 6648320 | consumed tokens: 13615759360 | elapsed time per iteration (s): 0.83 | learning rate: 1.913E-04 | global batch size: 256 | lm loss: 2.175004E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.021 | TFLOPs: 18.70 | 31: iteration 25980/ 173500 | consumed samples: 6650880 | consumed tokens: 13621002240 | elapsed time per iteration (s): 0.80 | learning rate: 1.913E-04 | global batch size: 256 | lm loss: 2.146356E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.518 | TFLOPs: 19.27 | 31: iteration 25990/ 173500 | consumed samples: 6653440 | consumed tokens: 13626245120 | elapsed time per iteration (s): 0.80 | learning rate: 1.913E-04 | global batch size: 256 | lm loss: 2.177748E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.859 | TFLOPs: 19.29 | 0: [2022-11-25 23:56:26,598] [INFO] [logging.py:68:log_dist] [Rank 0] step=26000, skipped=0, lr=[0.00019128112529201118, 0.00019128112529201118, 0.00019128112529201118], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 26000/ 173500 | consumed samples: 6656000 | consumed tokens: 13631488000 | elapsed time per iteration (s): 0.82 | learning rate: 1.913E-04 | global batch size: 256 | lm loss: 2.184205E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.384 | TFLOPs: 18.90 | 0: steps: 26000 loss: 2.2173 iter time (s): 0.794 samples/sec: 322.218 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 26000 | lm loss value: 2.083334E+00 | lm loss PPL: 8.031198E+00 | 0: saving checkpoint at iteration 26000 to checkpoints_1b1long 31: ------------------------------------------------------------------------------------------- 0: [2022-11-25 23:56:26,858] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step26000 is begin to save! 0: [2022-11-25 23:56:26,869] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_01-model_00-model_states.pt... 0: [2022-11-25 23:56:27,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_01-model_00-model_states.pt. 0: [2022-11-25 23:56:27,155] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_03-model_00-model_states.pt... 0: [2022-11-25 23:56:27,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_03-model_00-model_states.pt. 0: [2022-11-25 23:56:27,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_04-model_00-model_states.pt... 0: [2022-11-25 23:56:27,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_04-model_00-model_states.pt. 0: [2022-11-25 23:56:27,314] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_05-model_00-model_states.pt... 0: [2022-11-25 23:56:27,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_05-model_00-model_states.pt. 0: [2022-11-25 23:56:27,389] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_06-model_00-model_states.pt... 0: [2022-11-25 23:56:27,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_06-model_00-model_states.pt. 0: [2022-11-25 23:56:27,467] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_07-model_00-model_states.pt... 0: [2022-11-25 23:56:27,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_07-model_00-model_states.pt. 0: [2022-11-25 23:56:27,540] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_08-model_00-model_states.pt... 0: [2022-11-25 23:56:27,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_08-model_00-model_states.pt. 0: [2022-11-25 23:56:27,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_09-model_00-model_states.pt... 0: [2022-11-25 23:56:27,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_09-model_00-model_states.pt. 0: [2022-11-25 23:56:27,697] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_10-model_00-model_states.pt... 0: [2022-11-25 23:56:27,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_10-model_00-model_states.pt. 0: [2022-11-25 23:56:27,774] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_11-model_00-model_states.pt... 0: [2022-11-25 23:56:27,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_11-model_00-model_states.pt. 0: [2022-11-25 23:56:27,849] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_12-model_00-model_states.pt... 0: [2022-11-25 23:56:27,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_12-model_00-model_states.pt. 0: [2022-11-25 23:56:27,926] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_13-model_00-model_states.pt... 0: [2022-11-25 23:56:28,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_13-model_00-model_states.pt. 0: [2022-11-25 23:56:28,003] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_14-model_00-model_states.pt... 0: [2022-11-25 23:56:28,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_14-model_00-model_states.pt. 0: [2022-11-25 23:56:28,078] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_15-model_00-model_states.pt... 0: [2022-11-25 23:56:28,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_15-model_00-model_states.pt. 0: [2022-11-25 23:56:28,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_16-model_00-model_states.pt... 0: [2022-11-25 23:56:28,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_16-model_00-model_states.pt. 0: [2022-11-25 23:56:28,226] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_17-model_00-model_states.pt... 0: [2022-11-25 23:56:28,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_17-model_00-model_states.pt. 0: [2022-11-25 23:56:28,303] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_18-model_00-model_states.pt... 0: [2022-11-25 23:56:28,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_18-model_00-model_states.pt. 0: [2022-11-25 23:56:28,377] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_19-model_00-model_states.pt... 0: [2022-11-25 23:56:28,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_19-model_00-model_states.pt. 0: [2022-11-25 23:56:28,450] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_20-model_00-model_states.pt... 0: [2022-11-25 23:56:28,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_20-model_00-model_states.pt. 0: [2022-11-25 23:56:28,526] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_21-model_00-model_states.pt... 0: [2022-11-25 23:56:28,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_21-model_00-model_states.pt. 0: [2022-11-25 23:56:28,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_22-model_00-model_states.pt... 0: [2022-11-25 23:56:28,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_22-model_00-model_states.pt. 0: [2022-11-25 23:56:28,678] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_23-model_00-model_states.pt... 0: [2022-11-25 23:56:28,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_23-model_00-model_states.pt. 0: [2022-11-25 23:56:28,752] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_24-model_00-model_states.pt... 0: [2022-11-25 23:56:28,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_24-model_00-model_states.pt. 0: [2022-11-25 23:56:28,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_25-model_00-model_states.pt... 0: [2022-11-25 23:56:28,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_25-model_00-model_states.pt. 0: [2022-11-25 23:56:28,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_26-model_00-model_states.pt... 0: [2022-11-25 23:56:28,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_26-model_00-model_states.pt. 0: [2022-11-25 23:56:28,977] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_27-model_00-model_states.pt... 0: [2022-11-25 23:56:29,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_27-model_00-model_states.pt. 0: [2022-11-25 23:56:29,055] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_28-model_00-model_states.pt... 0: [2022-11-25 23:56:29,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_28-model_00-model_states.pt. 0: [2022-11-25 23:56:29,127] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/layer_30-model_00-model_states.pt... 0: [2022-11-25 23:56:29,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/layer_30-model_00-model_states.pt. 0: [2022-11-25 23:56:29,132] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step26000/mp_rank_00_model_states.pt 0: [2022-11-25 23:56:29,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/mp_rank_00_model_states.pt... 0: [2022-11-25 23:56:29,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/mp_rank_00_model_states.pt. 0: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 25: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 27: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 23: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 28: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 30: [2022-11-25 23:56:29,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 21: [2022-11-25 23:56:29,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:56:29,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-25 23:56:29,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 1: [2022-11-25 23:56:29,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:56:29,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 19: [2022-11-25 23:56:29,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:56:29,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 19: [2022-11-25 23:56:29,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-25 23:56:29,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 23: [2022-11-25 23:56:29,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:56:29,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-25 23:56:29,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 20: [2022-11-25 23:56:29,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:56:29,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-25 23:56:29,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 31: [2022-11-25 23:56:29,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:56:29,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-25 23:56:29,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 25: [2022-11-25 23:56:29,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:56:29,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-25 23:56:29,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 14: [2022-11-25 23:56:29,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:56:29,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 23:56:29,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: [2022-11-25 23:56:29,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:56:29,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 23:56:29,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 8: [2022-11-25 23:56:29,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:56:29,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 23:56:29,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 31: [2022-11-25 23:56:29,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:56:29,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-25 23:56:29,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 19: [2022-11-25 23:56:29,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:56:29,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-25 23:56:29,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 9: [2022-11-25 23:56:29,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:56:29,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 23:56:29,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 20: [2022-11-25 23:56:29,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:56:29,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:56:29,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:56:29,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 30: [2022-11-25 23:56:29,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 9: [2022-11-25 23:56:29,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 20: [2022-11-25 23:56:29,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 30: [2022-11-25 23:56:29,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 9: [2022-11-25 23:56:29,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 6: [2022-11-25 23:56:29,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:56:29,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:56:29,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 23:56:29,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 23:56:29,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 6: [2022-11-25 23:56:29,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 30: [2022-11-25 23:56:29,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:56:29,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 21: [2022-11-25 23:56:29,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:56:29,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 21: [2022-11-25 23:56:29,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-25 23:56:29,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 24: [2022-11-25 23:56:29,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:56:29,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:56:29,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-25 23:56:29,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-25 23:56:29,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 24: [2022-11-25 23:56:29,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 14: [2022-11-25 23:56:29,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:56:29,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:56:29,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 23:56:29,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 17: [2022-11-25 23:56:29,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-25 23:56:29,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 17: [2022-11-25 23:56:29,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:56:29,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:56:29,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 22: [2022-11-25 23:56:29,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 17: [2022-11-25 23:56:29,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 22: [2022-11-25 23:56:29,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 22: [2022-11-25 23:56:29,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:56:29,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-25 23:56:29,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 8: [2022-11-25 23:56:29,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:56:29,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:56:29,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 23:56:29,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 23: [2022-11-25 23:56:29,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-25 23:56:29,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 7: [2022-11-25 23:56:29,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:56:29,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 23:56:29,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 1: [2022-11-25 23:56:29,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:56:29,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 23:56:29,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 7: [2022-11-25 23:56:29,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:56:29,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 23: [2022-11-25 23:56:29,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:56:29,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 12: [2022-11-25 23:56:29,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:56:29,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 7: [2022-11-25 23:56:29,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:56:29,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 23: [2022-11-25 23:56:29,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 7: [2022-11-25 23:56:29,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 8: [2022-11-25 23:56:29,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:56:29,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:56:29,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 7: [2022-11-25 23:56:29,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 3: [2022-11-25 23:56:29,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 8: [2022-11-25 23:56:29,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 10: [2022-11-25 23:56:29,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:56:29,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 8: [2022-11-25 23:56:29,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 10: [2022-11-25 23:56:29,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 23:56:29,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 27: [2022-11-25 23:56:29,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:56:29,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:56:29,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 6: [2022-11-25 23:56:29,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:56:29,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 23:56:29,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 1: [2022-11-25 23:56:29,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 27: [2022-11-25 23:56:29,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 1: [2022-11-25 23:56:29,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 21: [2022-11-25 23:56:29,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:56:29,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-25 23:56:29,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 20: [2022-11-25 23:56:29,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:56:29,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-25 23:56:29,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 15: [2022-11-25 23:56:29,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:56:29,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:56:29,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 23:56:29,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 23:56:29,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 15: [2022-11-25 23:56:29,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 14: [2022-11-25 23:56:29,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:56:29,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:56:29,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 19: [2022-11-25 23:56:29,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-25 23:56:29,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 14: [2022-11-25 23:56:29,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 25: [2022-11-25 23:56:29,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:56:29,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 0: [2022-11-25 23:56:29,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:56:29,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 24: [2022-11-25 23:56:29,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:56:29,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 23:56:29,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 31: [2022-11-25 23:56:29,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:56:29,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 31: [2022-11-25 23:56:29,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-25 23:56:29,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 24: [2022-11-25 23:56:29,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 9: [2022-11-25 23:56:29,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:56:29,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 23:56:29,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 9: [2022-11-25 23:56:29,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:56:29,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 23:56:29,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 20: [2022-11-25 23:56:29,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:56:29,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-25 23:56:29,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 29: [2022-11-25 23:56:29,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:56:29,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-25 23:56:29,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 10: [2022-11-25 23:56:29,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:56:29,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 23:56:29,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 3: [2022-11-25 23:56:29,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:56:29,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 23:56:29,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 13: [2022-11-25 23:56:29,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:56:29,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 23:56:29,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 13: [2022-11-25 23:56:29,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:56:29,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 23:56:29,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 29: [2022-11-25 23:56:29,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:56:29,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 13: [2022-11-25 23:56:29,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:56:29,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 13: [2022-11-25 23:56:29,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:56:29,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 23:56:29,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 23:56:29,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 13: [2022-11-25 23:56:29,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 23: [2022-11-25 23:56:29,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:56:29,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-25 23:56:29,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 19: [2022-11-25 23:56:29,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:56:29,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-25 23:56:29,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 6: [2022-11-25 23:56:29,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:56:29,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 23:56:29,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 12: [2022-11-25 23:56:29,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:56:29,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 23:56:29,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 21: [2022-11-25 23:56:29,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:56:29,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-25 23:56:29,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 2: [2022-11-25 23:56:29,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:56:29,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:56:29,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:56:29,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:56:29,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:56:29,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 4: [2022-11-25 23:56:29,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 23:56:29,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 2: [2022-11-25 23:56:29,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 23:56:29,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 4: [2022-11-25 23:56:29,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:56:29,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 2: [2022-11-25 23:56:29,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 4: [2022-11-25 23:56:29,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 2: [2022-11-25 23:56:29,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 2: [2022-11-25 23:56:29,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 4: [2022-11-25 23:56:29,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 23:56:29,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 30: [2022-11-25 23:56:29,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:56:29,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-25 23:56:29,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 8: [2022-11-25 23:56:29,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:56:29,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 23:56:29,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 2: [2022-11-25 23:56:29,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:56:29,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 23:56:29,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 22: [2022-11-25 23:56:29,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:56:29,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-25 23:56:29,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 22: [2022-11-25 23:56:29,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:56:29,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:56:29,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 10: [2022-11-25 23:56:29,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 22: [2022-11-25 23:56:29,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 10: [2022-11-25 23:56:29,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 31: [2022-11-25 23:56:29,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:56:29,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-25 23:56:29,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 18: [2022-11-25 23:56:29,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:56:29,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-25 23:56:29,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 17: [2022-11-25 23:56:29,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:56:29,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-25 23:56:29,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 3: [2022-11-25 23:56:29,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:56:29,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 23:56:29,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 29: [2022-11-25 23:56:29,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:56:29,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 17: [2022-11-25 23:56:29,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:56:29,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 17: [2022-11-25 23:56:29,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-25 23:56:29,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 7: [2022-11-25 23:56:29,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:56:29,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 23:56:29,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 30: [2022-11-25 23:56:29,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:56:29,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-25 23:56:29,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: [2022-11-25 23:56:29,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:56:29,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:56:29,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 23:56:29,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 26: [2022-11-25 23:56:29,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:56:29,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:56:29,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:56:29,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:56:29,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-25 23:56:29,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-25 23:56:29,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-25 23:56:29,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-25 23:56:29,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 26: [2022-11-25 23:56:29,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 26: [2022-11-25 23:56:29,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 26: [2022-11-25 23:56:29,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 12: [2022-11-25 23:56:29,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:56:29,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 23:56:29,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: [2022-11-25 23:56:29,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:56:29,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 23:56:29,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 1: [2022-11-25 23:56:29,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:56:29,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 23:56:29,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 24: [2022-11-25 23:56:29,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:56:29,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-25 23:56:29,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 25: [2022-11-25 23:56:29,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:56:29,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 19: [2022-11-25 23:56:29,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:56:29,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 19: [2022-11-25 23:56:29,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-25 23:56:29,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 18: [2022-11-25 23:56:29,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:56:29,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:56:29,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-25 23:56:29,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-25 23:56:29,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 18: [2022-11-25 23:56:29,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 11: [2022-11-25 23:56:29,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:56:29,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:56:29,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 23:56:29,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 11: [2022-11-25 23:56:29,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 23:56:29,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 18: [2022-11-25 23:56:29,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:56:29,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-25 23:56:29,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 18: [2022-11-25 23:56:29,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:56:29,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-25 23:56:29,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 26: [2022-11-25 23:56:29,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:56:29,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-25 23:56:29,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 1: [2022-11-25 23:56:29,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:56:29,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 20: [2022-11-25 23:56:29,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:56:29,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 1: [2022-11-25 23:56:29,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 20: [2022-11-25 23:56:29,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 25: [2022-11-25 23:56:29,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:56:29,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-25 23:56:29,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 25: [2022-11-25 23:56:29,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:56:29,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-25 23:56:29,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 27: [2022-11-25 23:56:29,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:56:29,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-25 23:56:29,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 17: [2022-11-25 23:56:29,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:56:29,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-25 23:56:29,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 13: [2022-11-25 23:56:29,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:56:29,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 23:56:29,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 5: [2022-11-25 23:56:29,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:56:29,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:56:29,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:56:29,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:56:29,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 23:56:29,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 23:56:29,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:56:29,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 23:56:29,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 5: [2022-11-25 23:56:29,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 5: [2022-11-25 23:56:29,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 23:56:29,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 5: [2022-11-25 23:56:29,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 11: [2022-11-25 23:56:29,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:56:29,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 5: [2022-11-25 23:56:29,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 11: [2022-11-25 23:56:29,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 23:56:29,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 15: [2022-11-25 23:56:29,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:56:29,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 23:56:29,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 16: [2022-11-25 23:56:29,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:56:29,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:56:29,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:56:29,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:56:29,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-25 23:56:29,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-25 23:56:29,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-25 23:56:29,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-25 23:56:29,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 16: [2022-11-25 23:56:29,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 16: [2022-11-25 23:56:29,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 16: [2022-11-25 23:56:29,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 27: [2022-11-25 23:56:29,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:56:29,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-25 23:56:29,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 13: [2022-11-25 23:56:29,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 23:56:29,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 9: [2022-11-25 23:56:29,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:56:29,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 23:56:29,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 2: [2022-11-25 23:56:29,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:56:29,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 23:56:29,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 14: [2022-11-25 23:56:29,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:56:29,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 23:56:29,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 15: [2022-11-25 23:56:29,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:56:29,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 23:56:29,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 11: [2022-11-25 23:56:29,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:56:29,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 23:56:29,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 24: [2022-11-25 23:56:29,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:56:29,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-25 23:56:29,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 28: [2022-11-25 23:56:29,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:56:29,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:56:29,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:56:29,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:56:29,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:56:29,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-25 23:56:29,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-25 23:56:29,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-25 23:56:29,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-25 23:56:29,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-25 23:56:29,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 28: [2022-11-25 23:56:29,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 28: [2022-11-25 23:56:29,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 28: [2022-11-25 23:56:29,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 28: [2022-11-25 23:56:29,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 21: [2022-11-25 23:56:29,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:56:29,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-25 23:56:29,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: [2022-11-25 23:56:29,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:56:29,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 23:56:29,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 31: [2022-11-25 23:56:29,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:56:29,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-25 23:56:29,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 6: [2022-11-25 23:56:29,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:56:29,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 23:56:29,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 10: [2022-11-25 23:56:29,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:56:29,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 23:56:29,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 22: [2022-11-25 23:56:29,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:56:29,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-25 23:56:29,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 29: [2022-11-25 23:56:29,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:56:29,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-25 23:56:29,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 7: [2022-11-25 23:56:29,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:56:29,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:56:29,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:56:29,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:56:29,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 4: [2022-11-25 23:56:29,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 12: [2022-11-25 23:56:29,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 15: [2022-11-25 23:56:29,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 7: [2022-11-25 23:56:29,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 4: [2022-11-25 23:56:29,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 12: [2022-11-25 23:56:29,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 15: [2022-11-25 23:56:29,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 23: [2022-11-25 23:56:29,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:56:29,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 27: [2022-11-25 23:56:29,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:56:29,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 8: [2022-11-25 23:56:29,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:56:29,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:56:29,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 23:56:29,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 3: [2022-11-25 23:56:29,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 27: [2022-11-25 23:56:29,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 3: [2022-11-25 23:56:29,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 27: [2022-11-25 23:56:29,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 16: [2022-11-25 23:56:29,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:56:29,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-25 23:56:29,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 20: [2022-11-25 23:56:29,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:56:29,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-25 23:56:29,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 1: [2022-11-25 23:56:29,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:56:29,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 23:56:29,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 25: [2022-11-25 23:56:29,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:56:29,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:56:29,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 5: [2022-11-25 23:56:29,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 23:56:29,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 25: [2022-11-25 23:56:29,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 19: [2022-11-25 23:56:29,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:56:29,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-25 23:56:29,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 18: [2022-11-25 23:56:29,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:56:29,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-25 23:56:29,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 9: [2022-11-25 23:56:29,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:56:29,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 23:56:29,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 26: [2022-11-25 23:56:29,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:56:29,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-25 23:56:29,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 21: [2022-11-25 23:56:29,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:56:29,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-25 23:56:29,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 31: [2022-11-25 23:56:29,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:56:29,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-25 23:56:29,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 2: [2022-11-25 23:56:29,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:56:29,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 23:56:29,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 22: [2022-11-25 23:56:29,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:56:29,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 14: [2022-11-25 23:56:29,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:56:29,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: [2022-11-25 23:56:29,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:56:29,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 23:56:29,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 30: [2022-11-25 23:56:29,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:56:29,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 30: [2022-11-25 23:56:29,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-25 23:56:29,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: [2022-11-25 23:56:29,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 28: [2022-11-25 23:56:29,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:56:29,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:56:29,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-25 23:56:29,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 24: [2022-11-25 23:56:29,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:56:29,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-25 23:56:29,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 6: [2022-11-25 23:56:29,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:56:29,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 23:56:29,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 10: [2022-11-25 23:56:29,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:56:29,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 23:56:29,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 29: [2022-11-25 23:56:29,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:56:29,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-25 23:56:29,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 30: [2022-11-25 23:56:29,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:56:29,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-25 23:56:29,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 3: [2022-11-25 23:56:29,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:56:29,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 23:56:29,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 12: [2022-11-25 23:56:29,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:56:29,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 23:56:29,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 15: [2022-11-25 23:56:29,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:56:29,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 8: [2022-11-25 23:56:29,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:56:29,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 4: [2022-11-25 23:56:29,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:56:29,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 8: [2022-11-25 23:56:29,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 4: [2022-11-25 23:56:29,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 8: [2022-11-25 23:56:29,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 7: [2022-11-25 23:56:29,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:56:29,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 23:56:29,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 16: [2022-11-25 23:56:29,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:56:29,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-25 23:56:29,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 11: [2022-11-25 23:56:29,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:56:29,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 23:56:29,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:56:29,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 1: [2022-11-25 23:56:29,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:56:29,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 11: [2022-11-25 23:56:29,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 28: [2022-11-25 23:56:29,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 11: [2022-11-25 23:56:29,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 1: [2022-11-25 23:56:29,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 23:56:29,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 27: [2022-11-25 23:56:29,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:56:29,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-25 23:56:29,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 23: [2022-11-25 23:56:29,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:56:29,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-25 23:56:29,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 20: [2022-11-25 23:56:29,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:56:29,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-25 23:56:29,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 5: [2022-11-25 23:56:29,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:56:29,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 23:56:29,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 19: [2022-11-25 23:56:29,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:56:29,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 19: [2022-11-25 23:56:29,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-25 23:56:29,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 25: [2022-11-25 23:56:29,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-25 23:56:29,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 13: [2022-11-25 23:56:29,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:56:29,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 23:56:29,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 9: [2022-11-25 23:56:29,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:56:29,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 23:56:29,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 18: [2022-11-25 23:56:29,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:56:29,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-25 23:56:29,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 17: [2022-11-25 23:56:29,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:56:29,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-25 23:56:29,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 13: [2022-11-25 23:56:29,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:56:29,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 23:56:29,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 21: [2022-11-25 23:56:29,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 26: [2022-11-25 23:56:29,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:56:29,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-25 23:56:29,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 26: [2022-11-25 23:56:29,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-25 23:56:29,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 31: [2022-11-25 23:56:29,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:56:29,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-25 23:56:29,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: [2022-11-25 23:56:29,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:56:29,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 23:56:29,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 2: [2022-11-25 23:56:29,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:56:29,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 23:56:29,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 14: [2022-11-25 23:56:29,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:56:29,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 23:56:29,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 7: [2022-11-25 23:56:29,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:56:29,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 23:56:29,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 28: [2022-11-25 23:56:29,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:56:29,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:56:29,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-25 23:56:29,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 23: [2022-11-25 23:56:29,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-25 23:56:29,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 6: [2022-11-25 23:56:29,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:56:29,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 23:56:29,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 24: [2022-11-25 23:56:29,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:56:29,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-25 23:56:29,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 22: [2022-11-25 23:56:29,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:56:29,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-25 23:56:29,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 30: [2022-11-25 23:56:29,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:56:29,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-25 23:56:29,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 15: [2022-11-25 23:56:29,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:56:29,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:56:29,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 23:56:29,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 2: [2022-11-25 23:56:29,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 5: [2022-11-25 23:56:29,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:56:29,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 5: [2022-11-25 23:56:29,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 23:56:29,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 11: [2022-11-25 23:56:29,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:56:29,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:56:29,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 23:56:29,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 29: [2022-11-25 23:56:29,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-25 23:56:29,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 25: [2022-11-25 23:56:29,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-25 23:56:29,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-25 23:56:29,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 10: [2022-11-25 23:56:29,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:56:29,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:56:29,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 23:56:29,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 13: [2022-11-25 23:56:29,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 23:56:29,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 14: [2022-11-25 23:56:29,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:56:29,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:56:29,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:56:29,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 14: [2022-11-25 23:56:29,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 7: [2022-11-25 23:56:29,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 4: [2022-11-25 23:56:29,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 14: [2022-11-25 23:56:29,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 7: [2022-11-25 23:56:29,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 27: [2022-11-25 23:56:29,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:56:29,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-25 23:56:29,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 1: [2022-11-25 23:56:29,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:56:29,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 21: [2022-11-25 23:56:29,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:56:29,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 1: [2022-11-25 23:56:29,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 23:56:29,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 21: [2022-11-25 23:56:29,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 0: [2022-11-25 23:56:29,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 21: [2022-11-25 23:56:29,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 20: [2022-11-25 23:56:29,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-25 23:56:29,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-25 23:56:29,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 28: [2022-11-25 23:56:29,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:56:29,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 28: [2022-11-25 23:56:29,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 9: [2022-11-25 23:56:29,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 19: [2022-11-25 23:56:29,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:56:29,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 19: [2022-11-25 23:56:29,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-25 23:56:29,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 22: [2022-11-25 23:56:29,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:56:29,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 22: [2022-11-25 23:56:29,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 16: [2022-11-25 23:56:29,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 22: [2022-11-25 23:56:29,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 16: [2022-11-25 23:56:29,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 23: [2022-11-25 23:56:29,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-25 23:56:29,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-25 23:56:29,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 31: [2022-11-25 23:56:29,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:56:29,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 26: [2022-11-25 23:56:29,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 31: [2022-11-25 23:56:29,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 26: [2022-11-25 23:56:29,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-25 23:56:29,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 28: [2022-11-25 23:56:29,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 6: [2022-11-25 23:56:29,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:56:29,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 23:56:29,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 18: [2022-11-25 23:56:29,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-25 23:56:29,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-25 23:56:29,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 24: [2022-11-25 23:56:29,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-25 23:56:29,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-25 23:56:29,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 8: [2022-11-25 23:56:29,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:56:29,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 23:56:29,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 17: [2022-11-25 23:56:29,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-25 23:56:29,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-25 23:56:29,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 15: [2022-11-25 23:56:29,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:56:29,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:56:29,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 15: [2022-11-25 23:56:29,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 12: [2022-11-25 23:56:29,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 15: [2022-11-25 23:56:29,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 10: [2022-11-25 23:56:29,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:56:29,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 23:56:29,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 4: [2022-11-25 23:56:29,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:56:29,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:56:29,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 23:56:29,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 23:56:29,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 4: [2022-11-25 23:56:29,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 11: [2022-11-25 23:56:29,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:56:29,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 23:56:29,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 8: [2022-11-25 23:56:29,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:56:29,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 23:56:29,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 29: [2022-11-25 23:56:29,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 30: [2022-11-25 23:56:29,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:56:29,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-25 23:56:29,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 30: [2022-11-25 23:56:29,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-25 23:56:29,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 16: [2022-11-25 23:56:29,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-25 23:56:29,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-25 23:56:29,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 10: [2022-11-25 23:56:29,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:56:29,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:56:29,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 23:56:29,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 12: [2022-11-25 23:56:29,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 23:56:29,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 3: [2022-11-25 23:56:29,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:56:29,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 23:56:29,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 29: [2022-11-25 23:56:29,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-25 23:56:29,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-25 23:56:29,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 27: [2022-11-25 23:56:29,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:56:29,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-25 23:56:29,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 12: [2022-11-25 23:56:29,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:56:29,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 23:56:29,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 27: [2022-11-25 23:56:29,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-25 23:56:29,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-25 23:56:29,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 3: [2022-11-25 23:56:29,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:56:29,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:56:29,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 23:56:29,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step26000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 23:56:29,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 3: [2022-11-25 23:56:29,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: successfully saved checkpoint at iteration 26000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2596.16 31: iteration 26010/ 173500 | consumed samples: 6658560 | consumed tokens: 13636730880 | elapsed time per iteration (s): 1.03 | learning rate: 1.913E-04 | global batch size: 256 | lm loss: 2.148468E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.973 | TFLOPs: 15.06 | 31: iteration 26020/ 173500 | consumed samples: 6661120 | consumed tokens: 13641973760 | elapsed time per iteration (s): 0.77 | learning rate: 1.913E-04 | global batch size: 256 | lm loss: 2.166554E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.451 | TFLOPs: 20.05 | 31: iteration 26030/ 173500 | consumed samples: 6663680 | consumed tokens: 13647216640 | elapsed time per iteration (s): 0.77 | learning rate: 1.913E-04 | global batch size: 256 | lm loss: 2.191616E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.129 | TFLOPs: 20.15 | 31: iteration 26040/ 173500 | consumed samples: 6666240 | consumed tokens: 13652459520 | elapsed time per iteration (s): 0.80 | learning rate: 1.913E-04 | global batch size: 256 | lm loss: 2.150409E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.109 | TFLOPs: 19.37 | 31: iteration 26050/ 173500 | consumed samples: 6668800 | consumed tokens: 13657702400 | elapsed time per iteration (s): 0.75 | learning rate: 1.912E-04 | global batch size: 256 | lm loss: 2.163010E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.180 | TFLOPs: 20.64 | 31: iteration 26060/ 173500 | consumed samples: 6671360 | consumed tokens: 13662945280 | elapsed time per iteration (s): 0.79 | learning rate: 1.912E-04 | global batch size: 256 | lm loss: 2.194195E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.725 | TFLOPs: 19.52 | 31: iteration 26070/ 173500 | consumed samples: 6673920 | consumed tokens: 13668188160 | elapsed time per iteration (s): 0.86 | learning rate: 1.912E-04 | global batch size: 256 | lm loss: 2.157734E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.106 | TFLOPs: 18.10 | 31: iteration 26080/ 173500 | consumed samples: 6676480 | consumed tokens: 13673431040 | elapsed time per iteration (s): 0.76 | learning rate: 1.912E-04 | global batch size: 256 | lm loss: 2.165757E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.040 | TFLOPs: 20.33 | 31: iteration 26090/ 173500 | consumed samples: 6679040 | consumed tokens: 13678673920 | elapsed time per iteration (s): 0.76 | learning rate: 1.912E-04 | global batch size: 256 | lm loss: 2.145633E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.402 | TFLOPs: 20.35 | 31: iteration 26100/ 173500 | consumed samples: 6681600 | consumed tokens: 13683916800 | elapsed time per iteration (s): 0.86 | learning rate: 1.912E-04 | global batch size: 256 | lm loss: 2.169445E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.189 | TFLOPs: 18.04 | 31: iteration 26110/ 173500 | consumed samples: 6684160 | consumed tokens: 13689159680 | elapsed time per iteration (s): 0.84 | learning rate: 1.912E-04 | global batch size: 256 | lm loss: 2.171185E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.424 | TFLOPs: 18.54 | 31: iteration 26120/ 173500 | consumed samples: 6686720 | consumed tokens: 13694402560 | elapsed time per iteration (s): 0.74 | learning rate: 1.912E-04 | global batch size: 256 | lm loss: 2.157520E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.711 | TFLOPs: 20.91 | 31: iteration 26130/ 173500 | consumed samples: 6689280 | consumed tokens: 13699645440 | elapsed time per iteration (s): 0.78 | learning rate: 1.912E-04 | global batch size: 256 | lm loss: 2.157121E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.842 | TFLOPs: 19.77 | 31: iteration 26140/ 173500 | consumed samples: 6691840 | consumed tokens: 13704888320 | elapsed time per iteration (s): 0.84 | learning rate: 1.912E-04 | global batch size: 256 | lm loss: 2.153206E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.252 | TFLOPs: 18.47 | 31: iteration 26150/ 173500 | consumed samples: 6694400 | consumed tokens: 13710131200 | elapsed time per iteration (s): 0.81 | learning rate: 1.912E-04 | global batch size: 256 | lm loss: 2.147279E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.697 | TFLOPs: 19.22 | 31: iteration 26160/ 173500 | consumed samples: 6696960 | consumed tokens: 13715374080 | elapsed time per iteration (s): 0.85 | learning rate: 1.912E-04 | global batch size: 256 | lm loss: 2.127138E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.365 | TFLOPs: 18.29 | 31: iteration 26170/ 173500 | consumed samples: 6699520 | consumed tokens: 13720616960 | elapsed time per iteration (s): 0.72 | learning rate: 1.912E-04 | global batch size: 256 | lm loss: 2.158919E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.806 | TFLOPs: 21.46 | 31: iteration 26180/ 173500 | consumed samples: 6702080 | consumed tokens: 13725859840 | elapsed time per iteration (s): 0.77 | learning rate: 1.912E-04 | global batch size: 256 | lm loss: 2.156679E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.890 | TFLOPs: 20.14 | 31: iteration 26190/ 173500 | consumed samples: 6704640 | consumed tokens: 13731102720 | elapsed time per iteration (s): 0.81 | learning rate: 1.911E-04 | global batch size: 256 | lm loss: 2.160301E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.640 | TFLOPs: 19.16 | 31: iteration 26200/ 173500 | consumed samples: 6707200 | consumed tokens: 13736345600 | elapsed time per iteration (s): 0.81 | learning rate: 1.911E-04 | global batch size: 256 | lm loss: 2.167994E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.320 | TFLOPs: 19.02 | 31: iteration 26210/ 173500 | consumed samples: 6709760 | consumed tokens: 13741588480 | elapsed time per iteration (s): 2.78 | learning rate: 1.911E-04 | global batch size: 256 | lm loss: 2.170232E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 92.034 | TFLOPs: 5.57 | 31: iteration 26220/ 173500 | consumed samples: 6712320 | consumed tokens: 13746831360 | elapsed time per iteration (s): 0.72 | learning rate: 1.911E-04 | global batch size: 256 | lm loss: 2.188915E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.339 | TFLOPs: 21.44 | 31: iteration 26230/ 173500 | consumed samples: 6714880 | consumed tokens: 13752074240 | elapsed time per iteration (s): 0.79 | learning rate: 1.911E-04 | global batch size: 256 | lm loss: 2.151362E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.914 | TFLOPs: 19.66 | 31: iteration 26240/ 173500 | consumed samples: 6717440 | consumed tokens: 13757317120 | elapsed time per iteration (s): 0.85 | learning rate: 1.911E-04 | global batch size: 256 | lm loss: 2.148383E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.579 | TFLOPs: 18.24 | 31: iteration 26250/ 173500 | consumed samples: 6720000 | consumed tokens: 13762560000 | elapsed time per iteration (s): 0.81 | learning rate: 1.911E-04 | global batch size: 256 | lm loss: 2.148402E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.566 | TFLOPs: 19.03 | 31: iteration 26260/ 173500 | consumed samples: 6722560 | consumed tokens: 13767802880 | elapsed time per iteration (s): 0.81 | learning rate: 1.911E-04 | global batch size: 256 | lm loss: 2.150699E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.451 | TFLOPs: 19.08 | 31: iteration 26270/ 173500 | consumed samples: 6725120 | consumed tokens: 13773045760 | elapsed time per iteration (s): 0.81 | learning rate: 1.911E-04 | global batch size: 256 | lm loss: 2.153250E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.301 | TFLOPs: 19.14 | 31: iteration 26280/ 173500 | consumed samples: 6727680 | consumed tokens: 13778288640 | elapsed time per iteration (s): 0.82 | learning rate: 1.911E-04 | global batch size: 256 | lm loss: 2.160048E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.540 | TFLOPs: 18.97 | 31: iteration 26290/ 173500 | consumed samples: 6730240 | consumed tokens: 13783531520 | elapsed time per iteration (s): 0.81 | learning rate: 1.911E-04 | global batch size: 256 | lm loss: 2.179309E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.062 | TFLOPs: 19.18 | 31: iteration 26300/ 173500 | consumed samples: 6732800 | consumed tokens: 13788774400 | elapsed time per iteration (s): 0.86 | learning rate: 1.911E-04 | global batch size: 256 | lm loss: 2.163106E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.883 | TFLOPs: 17.96 | 31: iteration 26310/ 173500 | consumed samples: 6735360 | consumed tokens: 13794017280 | elapsed time per iteration (s): 0.83 | learning rate: 1.911E-04 | global batch size: 256 | lm loss: 2.168655E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.275 | TFLOPs: 18.59 | 31: iteration 26320/ 173500 | consumed samples: 6737920 | consumed tokens: 13799260160 | elapsed time per iteration (s): 0.85 | learning rate: 1.911E-04 | global batch size: 256 | lm loss: 2.171061E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.462 | TFLOPs: 18.12 | 31: iteration 26330/ 173500 | consumed samples: 6740480 | consumed tokens: 13804503040 | elapsed time per iteration (s): 0.81 | learning rate: 1.910E-04 | global batch size: 256 | lm loss: 2.194585E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.488 | TFLOPs: 19.21 | 31: iteration 26340/ 173500 | consumed samples: 6743040 | consumed tokens: 13809745920 | elapsed time per iteration (s): 0.89 | learning rate: 1.910E-04 | global batch size: 256 | lm loss: 2.178124E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.261 | TFLOPs: 17.50 | 31: iteration 26350/ 173500 | consumed samples: 6745600 | consumed tokens: 13814988800 | elapsed time per iteration (s): 1.15 | learning rate: 1.910E-04 | global batch size: 256 | lm loss: 2.160855E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 223.538 | TFLOPs: 13.52 | 31: iteration 26360/ 173500 | consumed samples: 6748160 | consumed tokens: 13820231680 | elapsed time per iteration (s): 0.90 | learning rate: 1.910E-04 | global batch size: 256 | lm loss: 2.129441E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.672 | TFLOPs: 17.28 | 31: iteration 26370/ 173500 | consumed samples: 6750720 | consumed tokens: 13825474560 | elapsed time per iteration (s): 0.92 | learning rate: 1.910E-04 | global batch size: 256 | lm loss: 2.168709E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 277.817 | TFLOPs: 16.81 | 31: iteration 26380/ 173500 | consumed samples: 6753280 | consumed tokens: 13830717440 | elapsed time per iteration (s): 0.80 | learning rate: 1.910E-04 | global batch size: 256 | lm loss: 2.177352E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.744 | TFLOPs: 19.34 | 31: iteration 26390/ 173500 | consumed samples: 6755840 | consumed tokens: 13835960320 | elapsed time per iteration (s): 0.91 | learning rate: 1.910E-04 | global batch size: 256 | lm loss: 2.188090E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.532 | TFLOPs: 17.09 | 31: iteration 26400/ 173500 | consumed samples: 6758400 | consumed tokens: 13841203200 | elapsed time per iteration (s): 0.83 | learning rate: 1.910E-04 | global batch size: 256 | lm loss: 2.149082E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.665 | TFLOPs: 18.73 | 31: iteration 26410/ 173500 | consumed samples: 6760960 | consumed tokens: 13846446080 | elapsed time per iteration (s): 0.81 | learning rate: 1.910E-04 | global batch size: 256 | lm loss: 2.185142E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.941 | TFLOPs: 19.17 | 31: iteration 26420/ 173500 | consumed samples: 6763520 | consumed tokens: 13851688960 | elapsed time per iteration (s): 0.79 | learning rate: 1.910E-04 | global batch size: 256 | lm loss: 2.149590E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.174 | TFLOPs: 19.61 | 31: iteration 26430/ 173500 | consumed samples: 6766080 | consumed tokens: 13856931840 | elapsed time per iteration (s): 0.83 | learning rate: 1.910E-04 | global batch size: 256 | lm loss: 2.117280E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.119 | TFLOPs: 18.70 | 31: iteration 26440/ 173500 | consumed samples: 6768640 | consumed tokens: 13862174720 | elapsed time per iteration (s): 0.79 | learning rate: 1.910E-04 | global batch size: 256 | lm loss: 2.215109E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.999 | TFLOPs: 19.60 | 31: iteration 26450/ 173500 | consumed samples: 6771200 | consumed tokens: 13867417600 | elapsed time per iteration (s): 0.79 | learning rate: 1.910E-04 | global batch size: 256 | lm loss: 2.151540E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.742 | TFLOPs: 19.71 | 31: iteration 26460/ 173500 | consumed samples: 6773760 | consumed tokens: 13872660480 | elapsed time per iteration (s): 0.81 | learning rate: 1.910E-04 | global batch size: 256 | lm loss: 2.163993E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.175 | TFLOPs: 19.07 | 31: iteration 26470/ 173500 | consumed samples: 6776320 | consumed tokens: 13877903360 | elapsed time per iteration (s): 0.81 | learning rate: 1.909E-04 | global batch size: 256 | lm loss: 2.144714E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.957 | TFLOPs: 19.18 | 31: iteration 26480/ 173500 | consumed samples: 6778880 | consumed tokens: 13883146240 | elapsed time per iteration (s): 0.85 | learning rate: 1.909E-04 | global batch size: 256 | lm loss: 2.152727E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.442 | TFLOPs: 18.30 | 31: iteration 26490/ 173500 | consumed samples: 6781440 | consumed tokens: 13888389120 | elapsed time per iteration (s): 0.88 | learning rate: 1.909E-04 | global batch size: 256 | lm loss: 2.144842E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.387 | TFLOPs: 17.69 | 31: iteration 26500/ 173500 | consumed samples: 6784000 | consumed tokens: 13893632000 | elapsed time per iteration (s): 0.82 | learning rate: 1.909E-04 | global batch size: 256 | lm loss: 2.162518E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.702 | TFLOPs: 18.86 | 31: iteration 26510/ 173500 | consumed samples: 6786560 | consumed tokens: 13898874880 | elapsed time per iteration (s): 0.78 | learning rate: 1.909E-04 | global batch size: 256 | lm loss: 2.168115E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.535 | TFLOPs: 19.82 | 31: iteration 26520/ 173500 | consumed samples: 6789120 | consumed tokens: 13904117760 | elapsed time per iteration (s): 0.83 | learning rate: 1.909E-04 | global batch size: 256 | lm loss: 2.164527E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.700 | TFLOPs: 18.74 | 31: iteration 26530/ 173500 | consumed samples: 6791680 | consumed tokens: 13909360640 | elapsed time per iteration (s): 0.86 | learning rate: 1.909E-04 | global batch size: 256 | lm loss: 2.163590E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.603 | TFLOPs: 18.00 | 31: iteration 26540/ 173500 | consumed samples: 6794240 | consumed tokens: 13914603520 | elapsed time per iteration (s): 0.84 | learning rate: 1.909E-04 | global batch size: 256 | lm loss: 2.146007E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.748 | TFLOPs: 18.44 | 31: iteration 26550/ 173500 | consumed samples: 6796800 | consumed tokens: 13919846400 | elapsed time per iteration (s): 0.86 | learning rate: 1.909E-04 | global batch size: 256 | lm loss: 2.177670E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.815 | TFLOPs: 18.08 | 31: iteration 26560/ 173500 | consumed samples: 6799360 | consumed tokens: 13925089280 | elapsed time per iteration (s): 0.83 | learning rate: 1.909E-04 | global batch size: 256 | lm loss: 2.158577E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.919 | TFLOPs: 18.63 | 31: iteration 26570/ 173500 | consumed samples: 6801920 | consumed tokens: 13930332160 | elapsed time per iteration (s): 0.86 | learning rate: 1.909E-04 | global batch size: 256 | lm loss: 2.125416E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.394 | TFLOPs: 17.93 | 31: iteration 26580/ 173500 | consumed samples: 6804480 | consumed tokens: 13935575040 | elapsed time per iteration (s): 0.79 | learning rate: 1.909E-04 | global batch size: 256 | lm loss: 2.168035E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.157 | TFLOPs: 19.55 | 31: iteration 26590/ 173500 | consumed samples: 6807040 | consumed tokens: 13940817920 | elapsed time per iteration (s): 0.79 | learning rate: 1.909E-04 | global batch size: 256 | lm loss: 2.144280E+00 | grad norm: 0.218 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.008 | TFLOPs: 19.60 | 31: iteration 26600/ 173500 | consumed samples: 6809600 | consumed tokens: 13946060800 | elapsed time per iteration (s): 0.82 | learning rate: 1.909E-04 | global batch size: 256 | lm loss: 2.159964E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.863 | TFLOPs: 18.99 | 31: iteration 26610/ 173500 | consumed samples: 6812160 | consumed tokens: 13951303680 | elapsed time per iteration (s): 0.82 | learning rate: 1.908E-04 | global batch size: 256 | lm loss: 2.157027E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.901 | TFLOPs: 18.99 | 31: iteration 26620/ 173500 | consumed samples: 6814720 | consumed tokens: 13956546560 | elapsed time per iteration (s): 0.85 | learning rate: 1.908E-04 | global batch size: 256 | lm loss: 2.161576E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.075 | TFLOPs: 18.21 | 31: iteration 26630/ 173500 | consumed samples: 6817280 | consumed tokens: 13961789440 | elapsed time per iteration (s): 0.79 | learning rate: 1.908E-04 | global batch size: 256 | lm loss: 2.158350E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.329 | TFLOPs: 19.62 | 31: iteration 26640/ 173500 | consumed samples: 6819840 | consumed tokens: 13967032320 | elapsed time per iteration (s): 0.75 | learning rate: 1.908E-04 | global batch size: 256 | lm loss: 2.171147E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.014 | TFLOPs: 20.69 | 31: iteration 26650/ 173500 | consumed samples: 6822400 | consumed tokens: 13972275200 | elapsed time per iteration (s): 0.78 | learning rate: 1.908E-04 | global batch size: 256 | lm loss: 2.145450E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.931 | TFLOPs: 19.84 | 31: iteration 26660/ 173500 | consumed samples: 6824960 | consumed tokens: 13977518080 | elapsed time per iteration (s): 0.74 | learning rate: 1.908E-04 | global batch size: 256 | lm loss: 2.140259E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.616 | TFLOPs: 20.85 | 31: iteration 26670/ 173500 | consumed samples: 6827520 | consumed tokens: 13982760960 | elapsed time per iteration (s): 0.80 | learning rate: 1.908E-04 | global batch size: 256 | lm loss: 2.176604E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.183 | TFLOPs: 19.31 | 31: iteration 26680/ 173500 | consumed samples: 6830080 | consumed tokens: 13988003840 | elapsed time per iteration (s): 0.79 | learning rate: 1.908E-04 | global batch size: 256 | lm loss: 2.188109E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.992 | TFLOPs: 19.60 | 31: iteration 26690/ 173500 | consumed samples: 6832640 | consumed tokens: 13993246720 | elapsed time per iteration (s): 0.81 | learning rate: 1.908E-04 | global batch size: 256 | lm loss: 2.133048E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.920 | TFLOPs: 19.23 | 31: iteration 26700/ 173500 | consumed samples: 6835200 | consumed tokens: 13998489600 | elapsed time per iteration (s): 0.84 | learning rate: 1.908E-04 | global batch size: 256 | lm loss: 2.173683E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.260 | TFLOPs: 18.41 | 31: iteration 26710/ 173500 | consumed samples: 6837760 | consumed tokens: 14003732480 | elapsed time per iteration (s): 0.82 | learning rate: 1.908E-04 | global batch size: 256 | lm loss: 2.145506E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.417 | TFLOPs: 18.90 | 31: iteration 26720/ 173500 | consumed samples: 6840320 | consumed tokens: 14008975360 | elapsed time per iteration (s): 0.84 | learning rate: 1.908E-04 | global batch size: 256 | lm loss: 2.165947E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.332 | TFLOPs: 18.53 | 31: iteration 26730/ 173500 | consumed samples: 6842880 | consumed tokens: 14014218240 | elapsed time per iteration (s): 0.81 | learning rate: 1.908E-04 | global batch size: 256 | lm loss: 2.158014E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.675 | TFLOPs: 19.16 | 31: iteration 26740/ 173500 | consumed samples: 6845440 | consumed tokens: 14019461120 | elapsed time per iteration (s): 0.78 | learning rate: 1.908E-04 | global batch size: 256 | lm loss: 2.147117E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.129 | TFLOPs: 19.73 | 31: iteration 26750/ 173500 | consumed samples: 6848000 | consumed tokens: 14024704000 | elapsed time per iteration (s): 0.78 | learning rate: 1.907E-04 | global batch size: 256 | lm loss: 2.338070E+00 | grad norm: 20.882 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.034 | TFLOPs: 19.78 | 31: iteration 26760/ 173500 | consumed samples: 6850560 | consumed tokens: 14029946880 | elapsed time per iteration (s): 0.78 | learning rate: 1.907E-04 | global batch size: 256 | lm loss: 3.285589E+00 | grad norm: 1.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.995 | TFLOPs: 19.78 | 31: iteration 26770/ 173500 | consumed samples: 6853120 | consumed tokens: 14035189760 | elapsed time per iteration (s): 0.79 | learning rate: 1.907E-04 | global batch size: 256 | lm loss: 2.360604E+00 | grad norm: 0.599 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.749 | TFLOPs: 19.65 | 31: iteration 26780/ 173500 | consumed samples: 6855680 | consumed tokens: 14040432640 | elapsed time per iteration (s): 0.74 | learning rate: 1.907E-04 | global batch size: 256 | lm loss: 2.242756E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.310 | TFLOPs: 20.89 | 31: iteration 26790/ 173500 | consumed samples: 6858240 | consumed tokens: 14045675520 | elapsed time per iteration (s): 0.87 | learning rate: 1.907E-04 | global batch size: 256 | lm loss: 2.178382E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.386 | TFLOPs: 17.81 | 31: iteration 26800/ 173500 | consumed samples: 6860800 | consumed tokens: 14050918400 | elapsed time per iteration (s): 0.78 | learning rate: 1.907E-04 | global batch size: 256 | lm loss: 2.186586E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.911 | TFLOPs: 19.96 | 31: iteration 26810/ 173500 | consumed samples: 6863360 | consumed tokens: 14056161280 | elapsed time per iteration (s): 0.76 | learning rate: 1.907E-04 | global batch size: 256 | lm loss: 2.154370E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.120 | TFLOPs: 20.46 | 31: iteration 26820/ 173500 | consumed samples: 6865920 | consumed tokens: 14061404160 | elapsed time per iteration (s): 0.75 | learning rate: 1.907E-04 | global batch size: 256 | lm loss: 2.173267E+00 | grad norm: 0.560 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.554 | TFLOPs: 20.72 | 31: iteration 26830/ 173500 | consumed samples: 6868480 | consumed tokens: 14066647040 | elapsed time per iteration (s): 0.75 | learning rate: 1.907E-04 | global batch size: 256 | lm loss: 2.150490E+00 | grad norm: 0.212 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.931 | TFLOPs: 20.63 | 31: iteration 26840/ 173500 | consumed samples: 6871040 | consumed tokens: 14071889920 | elapsed time per iteration (s): 0.79 | learning rate: 1.907E-04 | global batch size: 256 | lm loss: 2.169032E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.342 | TFLOPs: 19.68 | 31: iteration 26850/ 173500 | consumed samples: 6873600 | consumed tokens: 14077132800 | elapsed time per iteration (s): 0.75 | learning rate: 1.907E-04 | global batch size: 256 | lm loss: 2.179337E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.814 | TFLOPs: 20.56 | 31: iteration 26860/ 173500 | consumed samples: 6876160 | consumed tokens: 14082375680 | elapsed time per iteration (s): 0.77 | learning rate: 1.907E-04 | global batch size: 256 | lm loss: 2.161700E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.107 | TFLOPs: 20.03 | 31: iteration 26870/ 173500 | consumed samples: 6878720 | consumed tokens: 14087618560 | elapsed time per iteration (s): 0.77 | learning rate: 1.907E-04 | global batch size: 256 | lm loss: 2.164217E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.751 | TFLOPs: 20.13 | 31: iteration 26880/ 173500 | consumed samples: 6881280 | consumed tokens: 14092861440 | elapsed time per iteration (s): 0.74 | learning rate: 1.906E-04 | global batch size: 256 | lm loss: 2.165925E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.807 | TFLOPs: 20.86 | 31: iteration 26890/ 173500 | consumed samples: 6883840 | consumed tokens: 14098104320 | elapsed time per iteration (s): 0.72 | learning rate: 1.906E-04 | global batch size: 256 | lm loss: 2.154535E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.156 | TFLOPs: 21.43 | 31: iteration 26900/ 173500 | consumed samples: 6886400 | consumed tokens: 14103347200 | elapsed time per iteration (s): 0.83 | learning rate: 1.906E-04 | global batch size: 256 | lm loss: 2.178157E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.189 | TFLOPs: 18.58 | 31: iteration 26910/ 173500 | consumed samples: 6888960 | consumed tokens: 14108590080 | elapsed time per iteration (s): 0.74 | learning rate: 1.906E-04 | global batch size: 256 | lm loss: 2.135557E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.385 | TFLOPs: 21.02 | 31: iteration 26920/ 173500 | consumed samples: 6891520 | consumed tokens: 14113832960 | elapsed time per iteration (s): 0.79 | learning rate: 1.906E-04 | global batch size: 256 | lm loss: 2.171594E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.086 | TFLOPs: 19.73 | 31: iteration 26930/ 173500 | consumed samples: 6894080 | consumed tokens: 14119075840 | elapsed time per iteration (s): 0.77 | learning rate: 1.906E-04 | global batch size: 256 | lm loss: 2.164532E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.721 | TFLOPs: 20.19 | 31: iteration 26940/ 173500 | consumed samples: 6896640 | consumed tokens: 14124318720 | elapsed time per iteration (s): 0.84 | learning rate: 1.906E-04 | global batch size: 256 | lm loss: 2.155277E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.068 | TFLOPs: 18.52 | 31: iteration 26950/ 173500 | consumed samples: 6899200 | consumed tokens: 14129561600 | elapsed time per iteration (s): 0.79 | learning rate: 1.906E-04 | global batch size: 256 | lm loss: 2.162280E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.448 | TFLOPs: 19.51 | 31: iteration 26960/ 173500 | consumed samples: 6901760 | consumed tokens: 14134804480 | elapsed time per iteration (s): 0.86 | learning rate: 1.906E-04 | global batch size: 256 | lm loss: 2.153282E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.165 | TFLOPs: 18.10 | 31: iteration 26970/ 173500 | consumed samples: 6904320 | consumed tokens: 14140047360 | elapsed time per iteration (s): 0.77 | learning rate: 1.906E-04 | global batch size: 256 | lm loss: 2.143340E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.437 | TFLOPs: 20.05 | 31: iteration 26980/ 173500 | consumed samples: 6906880 | consumed tokens: 14145290240 | elapsed time per iteration (s): 0.83 | learning rate: 1.906E-04 | global batch size: 256 | lm loss: 2.164468E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.158 | TFLOPs: 18.58 | 31: iteration 26990/ 173500 | consumed samples: 6909440 | consumed tokens: 14150533120 | elapsed time per iteration (s): 0.81 | learning rate: 1.906E-04 | global batch size: 256 | lm loss: 2.177239E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.737 | TFLOPs: 19.04 | 31: iteration 27000/ 173500 | consumed samples: 6912000 | consumed tokens: 14155776000 | elapsed time per iteration (s): 0.80 | learning rate: 1.906E-04 | global batch size: 256 | lm loss: 2.165610E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.772 | TFLOPs: 19.28 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 27000 | lm loss value: 2.132582E+00 | lm loss PPL: 8.436620E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 27000 to checkpoints_1b1long 0: [2022-11-26 00:10:17,503] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step27000 is begin to save! 0: [2022-11-26 00:10:17,517] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_01-model_00-model_states.pt... 0: [2022-11-26 00:10:17,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_01-model_00-model_states.pt. 0: [2022-11-26 00:10:17,741] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_03-model_00-model_states.pt... 0: [2022-11-26 00:10:17,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_03-model_00-model_states.pt. 0: [2022-11-26 00:10:17,824] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_04-model_00-model_states.pt... 0: [2022-11-26 00:10:17,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_04-model_00-model_states.pt. 0: [2022-11-26 00:10:17,904] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_05-model_00-model_states.pt... 0: [2022-11-26 00:10:17,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_05-model_00-model_states.pt. 0: [2022-11-26 00:10:17,982] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_06-model_00-model_states.pt... 0: [2022-11-26 00:10:18,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_06-model_00-model_states.pt. 0: [2022-11-26 00:10:18,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_07-model_00-model_states.pt... 0: [2022-11-26 00:10:18,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_07-model_00-model_states.pt. 0: [2022-11-26 00:10:18,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_08-model_00-model_states.pt... 0: [2022-11-26 00:10:18,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_08-model_00-model_states.pt. 0: [2022-11-26 00:10:18,211] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_09-model_00-model_states.pt... 0: [2022-11-26 00:10:18,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_09-model_00-model_states.pt. 0: [2022-11-26 00:10:18,290] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_10-model_00-model_states.pt... 0: [2022-11-26 00:10:18,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_10-model_00-model_states.pt. 0: [2022-11-26 00:10:18,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_11-model_00-model_states.pt... 0: [2022-11-26 00:10:18,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_11-model_00-model_states.pt. 0: [2022-11-26 00:10:18,442] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_12-model_00-model_states.pt... 0: [2022-11-26 00:10:18,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_12-model_00-model_states.pt. 0: [2022-11-26 00:10:18,522] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_13-model_00-model_states.pt... 0: [2022-11-26 00:10:18,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_13-model_00-model_states.pt. 0: [2022-11-26 00:10:18,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_14-model_00-model_states.pt... 0: [2022-11-26 00:10:18,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_14-model_00-model_states.pt. 0: [2022-11-26 00:10:18,675] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_15-model_00-model_states.pt... 0: [2022-11-26 00:10:18,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_15-model_00-model_states.pt. 0: [2022-11-26 00:10:18,752] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_16-model_00-model_states.pt... 0: [2022-11-26 00:10:18,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_16-model_00-model_states.pt. 0: [2022-11-26 00:10:18,827] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_17-model_00-model_states.pt... 0: [2022-11-26 00:10:18,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_17-model_00-model_states.pt. 0: [2022-11-26 00:10:18,904] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_18-model_00-model_states.pt... 0: [2022-11-26 00:10:18,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_18-model_00-model_states.pt. 0: [2022-11-26 00:10:18,982] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_19-model_00-model_states.pt... 0: [2022-11-26 00:10:19,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_19-model_00-model_states.pt. 0: [2022-11-26 00:10:19,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_20-model_00-model_states.pt... 0: [2022-11-26 00:10:19,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_20-model_00-model_states.pt. 0: [2022-11-26 00:10:19,134] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_21-model_00-model_states.pt... 0: [2022-11-26 00:10:19,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_21-model_00-model_states.pt. 0: [2022-11-26 00:10:19,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_22-model_00-model_states.pt... 0: [2022-11-26 00:10:19,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_22-model_00-model_states.pt. 0: [2022-11-26 00:10:19,287] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_23-model_00-model_states.pt... 0: [2022-11-26 00:10:19,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_23-model_00-model_states.pt. 0: [2022-11-26 00:10:19,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_24-model_00-model_states.pt... 0: [2022-11-26 00:10:19,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_24-model_00-model_states.pt. 0: [2022-11-26 00:10:19,438] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_25-model_00-model_states.pt... 0: [2022-11-26 00:10:19,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_25-model_00-model_states.pt. 0: [2022-11-26 00:10:19,515] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_26-model_00-model_states.pt... 0: [2022-11-26 00:10:19,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_26-model_00-model_states.pt. 0: [2022-11-26 00:10:19,594] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_27-model_00-model_states.pt... 0: [2022-11-26 00:10:19,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_27-model_00-model_states.pt. 0: [2022-11-26 00:10:19,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_28-model_00-model_states.pt... 0: [2022-11-26 00:10:19,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_28-model_00-model_states.pt. 0: [2022-11-26 00:10:19,746] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/layer_30-model_00-model_states.pt... 0: [2022-11-26 00:10:19,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/layer_30-model_00-model_states.pt. 0: [2022-11-26 00:10:19,748] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step27000/mp_rank_00_model_states.pt 0: [2022-11-26 00:10:19,748] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/mp_rank_00_model_states.pt... 0: [2022-11-26 00:10:19,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/mp_rank_00_model_states.pt. 0: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:10:19,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:10:19,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:10:19,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 00:10:19,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 22: [2022-11-26 00:10:19,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:10:19,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 00:10:19,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 2: [2022-11-26 00:10:19,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:10:19,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 00:10:19,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 25: [2022-11-26 00:10:19,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:10:19,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:10:19,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 2: [2022-11-26 00:10:19,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 00:10:19,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 25: [2022-11-26 00:10:19,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 25: [2022-11-26 00:10:19,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:10:19,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 00:10:19,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 28: [2022-11-26 00:10:19,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:10:19,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:10:19,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 00:10:19,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 00:10:19,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 28: [2022-11-26 00:10:19,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 26: [2022-11-26 00:10:19,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:10:19,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 00:10:19,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 13: [2022-11-26 00:10:19,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:10:19,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 00:10:19,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 14: [2022-11-26 00:10:19,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:10:19,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:10:19,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 00:10:19,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 14: [2022-11-26 00:10:19,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:10:19,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 14: [2022-11-26 00:10:19,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 00:10:19,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 27: [2022-11-26 00:10:19,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: [2022-11-26 00:10:19,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:10:19,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:10:19,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 00:10:19,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 21: [2022-11-26 00:10:19,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:10:19,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 00:10:19,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 3: [2022-11-26 00:10:19,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:10:19,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:10:19,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 8: [2022-11-26 00:10:19,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 00:10:19,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 3: [2022-11-26 00:10:19,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 30: [2022-11-26 00:10:19,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 00:10:19,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 00:10:19,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 16: [2022-11-26 00:10:19,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:10:19,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 00:10:19,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 16: [2022-11-26 00:10:19,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:10:19,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 00:10:19,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 4: [2022-11-26 00:10:19,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:10:19,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 00:10:19,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 21: [2022-11-26 00:10:19,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:10:19,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 24: [2022-11-26 00:10:19,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:10:19,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 30: [2022-11-26 00:10:19,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:10:19,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:10:19,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 2: [2022-11-26 00:10:19,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 24: [2022-11-26 00:10:19,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 30: [2022-11-26 00:10:19,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 2: [2022-11-26 00:10:19,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 30: [2022-11-26 00:10:19,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 9: [2022-11-26 00:10:19,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:10:19,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:10:19,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 00:10:19,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 00:10:19,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 9: [2022-11-26 00:10:19,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 4: [2022-11-26 00:10:19,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:10:19,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 27: [2022-11-26 00:10:19,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:10:19,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 27: [2022-11-26 00:10:19,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 00:10:19,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: [2022-11-26 00:10:19,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:10:19,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 1: [2022-11-26 00:10:19,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:10:19,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 00:10:19,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 00:10:19,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 26: [2022-11-26 00:10:19,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:10:19,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:10:19,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 00:10:19,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 19: [2022-11-26 00:10:19,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 12: [2022-11-26 00:10:19,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:10:19,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 12: [2022-11-26 00:10:19,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:10:19,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:10:19,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 00:10:19,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 12: [2022-11-26 00:10:19,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 00:10:19,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 00:10:19,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 30: [2022-11-26 00:10:19,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:10:19,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:10:19,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 30: [2022-11-26 00:10:19,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 10: [2022-11-26 00:10:19,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 30: [2022-11-26 00:10:19,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 10: [2022-11-26 00:10:19,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 25: [2022-11-26 00:10:19,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:10:19,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 27: [2022-11-26 00:10:19,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:10:19,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 27: [2022-11-26 00:10:19,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 00:10:19,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 8: [2022-11-26 00:10:19,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:10:19,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 19: [2022-11-26 00:10:19,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:10:19,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 19: [2022-11-26 00:10:19,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 00:10:19,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 10: [2022-11-26 00:10:19,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:10:19,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 00:10:19,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 26: [2022-11-26 00:10:19,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:10:19,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 00:10:19,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 6: [2022-11-26 00:10:19,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:10:19,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 00:10:19,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 24: [2022-11-26 00:10:19,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:10:19,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:10:19,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 15: [2022-11-26 00:10:19,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:10:19,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 24: [2022-11-26 00:10:19,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 3: [2022-11-26 00:10:19,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:10:19,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 11: [2022-11-26 00:10:19,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 3: [2022-11-26 00:10:19,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 15: [2022-11-26 00:10:19,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 11: [2022-11-26 00:10:19,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:10:19,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:10:19,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 15: [2022-11-26 00:10:19,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:10:19,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 14: [2022-11-26 00:10:19,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 8: [2022-11-26 00:10:19,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:10:19,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 11: [2022-11-26 00:10:19,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 14: [2022-11-26 00:10:19,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 8: [2022-11-26 00:10:19,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 15: [2022-11-26 00:10:19,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 11: [2022-11-26 00:10:19,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:10:19,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 11: [2022-11-26 00:10:19,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 00:10:19,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 24: [2022-11-26 00:10:19,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:10:19,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 00:10:19,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 4: [2022-11-26 00:10:19,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:10:19,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:10:19,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 00:10:19,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 22: [2022-11-26 00:10:19,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 00:10:19,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 19: [2022-11-26 00:10:19,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:10:19,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 00:10:19,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 10: [2022-11-26 00:10:19,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:10:19,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:10:19,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 26: [2022-11-26 00:10:19,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 10: [2022-11-26 00:10:19,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 26: [2022-11-26 00:10:19,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: [2022-11-26 00:10:19,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:10:19,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:10:19,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 0: [2022-11-26 00:10:19,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 22: [2022-11-26 00:10:19,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: [2022-11-26 00:10:19,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 22: [2022-11-26 00:10:19,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:10:19,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 00:10:19,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 25: [2022-11-26 00:10:19,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:10:19,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 00:10:19,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 2: [2022-11-26 00:10:19,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:10:19,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 00:10:19,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 16: [2022-11-26 00:10:19,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:10:19,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:10:19,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 00:10:19,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 00:10:19,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 16: [2022-11-26 00:10:19,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 4: [2022-11-26 00:10:19,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:10:19,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 21: [2022-11-26 00:10:19,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:10:19,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 00:10:19,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 4: [2022-11-26 00:10:19,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 28: [2022-11-26 00:10:19,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:10:19,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 00:10:19,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 28: [2022-11-26 00:10:19,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:10:19,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 00:10:19,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 00:10:19,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 30: [2022-11-26 00:10:19,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:10:19,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 30: [2022-11-26 00:10:19,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 00:10:19,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 00:10:19,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 24: [2022-11-26 00:10:19,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:10:19,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 00:10:19,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 12: [2022-11-26 00:10:19,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:10:19,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 00:10:19,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 10: [2022-11-26 00:10:19,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:10:19,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 00:10:19,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 3: [2022-11-26 00:10:19,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:10:19,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 14: [2022-11-26 00:10:19,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:10:19,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 14: [2022-11-26 00:10:19,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 00:10:19,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 27: [2022-11-26 00:10:19,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:10:19,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 00:10:19,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 5: [2022-11-26 00:10:19,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:10:19,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 00:10:19,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 5: [2022-11-26 00:10:19,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:10:19,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 00:10:19,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 5: [2022-11-26 00:10:19,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:10:19,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 00:10:19,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 00:10:19,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:10:19,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:10:19,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 00:10:19,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 00:10:19,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:10:19,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 00:10:19,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 15: [2022-11-26 00:10:19,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:10:19,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 15: [2022-11-26 00:10:19,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 1: [2022-11-26 00:10:19,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 15: [2022-11-26 00:10:19,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 11: [2022-11-26 00:10:19,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:10:19,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 00:10:19,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: [2022-11-26 00:10:19,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:10:19,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 00:10:19,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 9: [2022-11-26 00:10:19,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:10:19,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:10:19,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 00:10:19,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 00:10:19,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 9: [2022-11-26 00:10:19,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 5: [2022-11-26 00:10:19,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:10:19,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 00:10:19,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 13: [2022-11-26 00:10:19,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:10:19,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 00:10:19,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 13: [2022-11-26 00:10:19,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:10:19,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 00:10:19,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 13: [2022-11-26 00:10:19,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:10:19,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 00:10:19,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 19: [2022-11-26 00:10:19,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:10:19,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 00:10:19,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: [2022-11-26 00:10:19,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 00:10:19,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 8: [2022-11-26 00:10:19,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:10:19,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 00:10:19,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 29: [2022-11-26 00:10:19,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:10:19,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:10:19,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:10:19,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:10:19,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 00:10:19,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 00:10:19,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 00:10:19,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 6: [2022-11-26 00:10:19,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:10:19,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 6: [2022-11-26 00:10:19,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 29: [2022-11-26 00:10:19,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 29: [2022-11-26 00:10:19,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 6: [2022-11-26 00:10:19,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 29: [2022-11-26 00:10:19,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 7: [2022-11-26 00:10:19,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:10:19,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:10:19,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:10:19,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:10:19,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 00:10:19,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 00:10:19,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 00:10:19,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 00:10:19,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 7: [2022-11-26 00:10:19,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 7: [2022-11-26 00:10:19,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 7: [2022-11-26 00:10:19,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 6: [2022-11-26 00:10:19,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:10:19,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 1: [2022-11-26 00:10:19,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:10:19,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:10:19,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 6: [2022-11-26 00:10:19,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 00:10:19,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 00:10:19,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 00:10:19,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 6: [2022-11-26 00:10:19,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:10:19,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 00:10:19,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 5: [2022-11-26 00:10:19,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:10:19,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 00:10:19,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 23: [2022-11-26 00:10:19,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:10:19,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:10:19,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:10:19,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 00:10:19,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 00:10:19,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 00:10:19,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 23: [2022-11-26 00:10:19,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 23: [2022-11-26 00:10:19,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 23: [2022-11-26 00:10:19,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:10:19,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 00:10:19,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 22: [2022-11-26 00:10:19,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:10:19,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 00:10:19,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 28: [2022-11-26 00:10:19,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:10:19,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:10:19,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 00:10:19,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:10:19,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 21: [2022-11-26 00:10:19,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 00:10:19,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 2: [2022-11-26 00:10:19,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:10:19,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 00:10:19,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 28: [2022-11-26 00:10:19,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 00:10:19,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 27: [2022-11-26 00:10:19,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:10:19,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 00:10:19,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 23: [2022-11-26 00:10:19,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:10:19,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 00:10:19,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 9: [2022-11-26 00:10:19,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:10:19,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 00:10:19,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 4: [2022-11-26 00:10:19,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:10:19,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 00:10:19,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 11: [2022-11-26 00:10:19,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:10:19,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 00:10:19,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 26: [2022-11-26 00:10:19,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:10:19,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 00:10:19,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 15: [2022-11-26 00:10:19,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:10:19,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 00:10:19,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 25: [2022-11-26 00:10:19,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:10:19,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 00:10:19,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 13: [2022-11-26 00:10:19,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:10:19,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 00:10:19,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: [2022-11-26 00:10:19,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:10:19,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 00:10:19,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 7: [2022-11-26 00:10:19,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:10:19,982] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 00:10:19,982] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 3: [2022-11-26 00:10:19,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:10:19,982] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 00:10:19,982] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 30: [2022-11-26 00:10:19,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 00:10:19,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 00:10:19,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 8: [2022-11-26 00:10:19,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:10:19,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 00:10:19,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 12: [2022-11-26 00:10:19,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:10:19,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 00:10:19,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 14: [2022-11-26 00:10:19,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:10:19,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 00:10:19,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 10: [2022-11-26 00:10:19,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:10:19,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 24: [2022-11-26 00:10:19,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:10:19,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 24: [2022-11-26 00:10:19,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 00:10:19,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 29: [2022-11-26 00:10:19,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:10:19,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 00:10:19,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 6: [2022-11-26 00:10:19,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:10:19,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 00:10:19,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 16: [2022-11-26 00:10:19,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:10:19,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 00:10:19,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 22: [2022-11-26 00:10:19,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:10:19,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 00:10:19,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 00:10:19,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:10:19,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 00:10:19,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 19: [2022-11-26 00:10:19,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:10:19,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 00:10:19,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 2: [2022-11-26 00:10:19,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:10:19,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 00:10:19,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 5: [2022-11-26 00:10:19,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:10:19,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 00:10:19,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 28: [2022-11-26 00:10:20,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:10:20,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 00:10:20,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 15: [2022-11-26 00:10:20,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:10:20,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 00:10:20,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 21: [2022-11-26 00:10:20,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:10:20,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 00:10:20,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 23: [2022-11-26 00:10:20,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:10:20,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 00:10:20,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 25: [2022-11-26 00:10:20,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:10:20,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 00:10:20,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 30: [2022-11-26 00:10:20,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 00:10:20,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 00:10:20,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 4: [2022-11-26 00:10:20,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:10:20,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 00:10:20,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 11: [2022-11-26 00:10:20,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:10:20,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 9: [2022-11-26 00:10:20,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:10:20,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 9: [2022-11-26 00:10:20,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 00:10:20,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 10: [2022-11-26 00:10:20,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:10:20,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:10:20,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 00:10:20,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 8: [2022-11-26 00:10:20,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 00:10:20,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 26: [2022-11-26 00:10:20,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:10:20,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 00:10:20,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 7: [2022-11-26 00:10:20,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:10:20,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 00:10:20,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 3: [2022-11-26 00:10:20,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:10:20,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 00:10:20,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 27: [2022-11-26 00:10:20,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:10:20,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 00:10:20,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 13: [2022-11-26 00:10:20,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:10:20,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 00:10:20,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 16: [2022-11-26 00:10:20,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:10:20,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 00:10:20,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 29: [2022-11-26 00:10:20,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:10:20,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 00:10:20,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 17: [2022-11-26 00:10:20,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:10:20,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 00:10:20,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 18: [2022-11-26 00:10:20,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:10:20,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 19: [2022-11-26 00:10:20,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:10:20,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:10:20,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 19: [2022-11-26 00:10:20,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 00:10:20,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 24: [2022-11-26 00:10:20,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 00:10:20,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 12: [2022-11-26 00:10:20,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:10:20,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 00:10:20,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: [2022-11-26 00:10:20,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:10:20,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 00:10:20,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 20: [2022-11-26 00:10:20,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:10:20,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 00:10:20,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 14: [2022-11-26 00:10:20,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:10:20,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 00:10:20,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 22: [2022-11-26 00:10:20,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:10:20,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 00:10:20,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 6: [2022-11-26 00:10:20,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:10:20,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 00:10:20,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 00:10:20,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:10:20,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 00:10:20,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 18: [2022-11-26 00:10:20,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:10:20,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 00:10:20,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 5: [2022-11-26 00:10:20,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:10:20,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 00:10:20,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 21: [2022-11-26 00:10:20,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:10:20,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 00:10:20,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 20: [2022-11-26 00:10:20,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:10:20,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 00:10:20,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 28: [2022-11-26 00:10:20,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:10:20,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 00:10:20,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 17: [2022-11-26 00:10:20,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:10:20,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 00:10:20,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 31: [2022-11-26 00:10:20,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:10:20,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:10:20,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 00:10:20,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 00:10:20,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 31: [2022-11-26 00:10:20,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 2: [2022-11-26 00:10:20,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:10:20,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 00:10:20,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 27: [2022-11-26 00:10:20,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:10:20,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 00:10:20,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 25: [2022-11-26 00:10:20,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:10:20,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 00:10:20,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 9: [2022-11-26 00:10:20,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:10:20,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 0: [2022-11-26 00:10:20,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:10:20,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: [2022-11-26 00:10:20,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 00:10:20,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 23: [2022-11-26 00:10:20,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:10:20,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:10:20,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 23: [2022-11-26 00:10:20,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 7: [2022-11-26 00:10:20,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 23: [2022-11-26 00:10:20,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 4: [2022-11-26 00:10:20,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:10:20,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 00:10:20,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 15: [2022-11-26 00:10:20,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:10:20,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:10:20,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 00:10:20,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 14: [2022-11-26 00:10:20,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 00:10:20,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 11: [2022-11-26 00:10:20,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:10:20,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 00:10:20,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 26: [2022-11-26 00:10:20,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:10:20,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 00:10:20,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 3: [2022-11-26 00:10:20,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:10:20,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 00:10:20,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 13: [2022-11-26 00:10:20,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:10:20,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 00:10:20,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 10: [2022-11-26 00:10:20,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:10:20,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 00:10:20,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 30: [2022-11-26 00:10:20,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 00:10:20,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 00:10:20,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 8: [2022-11-26 00:10:20,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:10:20,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 00:10:20,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 6: [2022-11-26 00:10:20,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:10:20,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 00:10:20,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 00:10:20,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:10:20,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:10:20,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 00:10:20,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 00:10:20,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 00:10:20,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 29: [2022-11-26 00:10:20,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:10:20,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 00:10:20,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 16: [2022-11-26 00:10:20,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:10:20,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 00:10:20,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 24: [2022-11-26 00:10:20,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:10:20,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:10:20,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 00:10:20,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 24: [2022-11-26 00:10:20,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 00:10:20,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 5: [2022-11-26 00:10:20,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:10:20,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 00:10:20,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 2: [2022-11-26 00:10:20,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:10:20,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:10:20,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 00:10:20,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 22: [2022-11-26 00:10:20,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 00:10:20,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 28: [2022-11-26 00:10:20,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:10:20,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 00:10:20,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: [2022-11-26 00:10:20,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:10:20,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 00:10:20,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 15: [2022-11-26 00:10:20,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:10:20,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 24: [2022-11-26 00:10:20,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:10:20,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 24: [2022-11-26 00:10:20,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 00:10:20,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 25: [2022-11-26 00:10:20,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:10:20,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 26: [2022-11-26 00:10:20,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:10:20,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 26: [2022-11-26 00:10:20,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 29: [2022-11-26 00:10:20,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:10:20,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 16: [2022-11-26 00:10:20,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:10:20,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 00:10:20,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 16: [2022-11-26 00:10:20,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 21: [2022-11-26 00:10:20,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:10:20,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 16: [2022-11-26 00:10:20,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 21: [2022-11-26 00:10:20,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 7: [2022-11-26 00:10:20,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:10:20,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 00:10:20,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 23: [2022-11-26 00:10:20,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:10:20,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 00:10:20,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 9: [2022-11-26 00:10:20,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 30: [2022-11-26 00:10:20,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:10:20,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 30: [2022-11-26 00:10:20,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 00:10:20,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 9: [2022-11-26 00:10:20,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 12: [2022-11-26 00:10:20,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:10:20,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 00:10:20,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 8: [2022-11-26 00:10:20,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:10:20,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 00:10:20,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 27: [2022-11-26 00:10:20,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:10:20,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:10:20,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 00:10:20,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 27: [2022-11-26 00:10:20,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 00:10:20,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 11: [2022-11-26 00:10:20,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:10:20,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:10:20,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 14: [2022-11-26 00:10:20,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 00:10:20,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 11: [2022-11-26 00:10:20,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 3: [2022-11-26 00:10:20,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:10:20,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 00:10:20,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 13: [2022-11-26 00:10:20,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:10:20,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 00:10:20,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 20: [2022-11-26 00:10:20,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:10:20,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 00:10:20,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 10: [2022-11-26 00:10:20,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:10:20,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 00:10:20,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 18: [2022-11-26 00:10:20,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:10:20,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 00:10:20,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 18: [2022-11-26 00:10:20,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:10:20,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 00:10:20,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 18: [2022-11-26 00:10:20,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:10:20,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 00:10:20,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 18: [2022-11-26 00:10:20,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:10:20,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 00:10:20,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 20: [2022-11-26 00:10:20,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:10:20,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 00:10:20,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 31: [2022-11-26 00:10:20,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:10:20,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 00:10:20,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 18: [2022-11-26 00:10:20,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:10:20,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 00:10:20,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 20: [2022-11-26 00:10:20,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:10:20,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 00:10:20,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 20: [2022-11-26 00:10:20,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:10:20,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:10:20,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:10:20,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 00:10:20,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 00:10:20,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 00:10:20,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 20: [2022-11-26 00:10:20,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 20: [2022-11-26 00:10:20,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 31: [2022-11-26 00:10:20,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:10:20,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:10:20,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 00:10:20,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 00:10:20,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 31: [2022-11-26 00:10:20,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 31: [2022-11-26 00:10:20,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:10:20,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 00:10:20,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 31: [2022-11-26 00:10:20,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:10:20,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 00:10:20,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 18: [2022-11-26 00:10:20,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:10:20,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 00:10:20,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 17: [2022-11-26 00:10:20,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:10:20,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 00:10:20,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 17: [2022-11-26 00:10:20,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:10:20,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:10:20,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:10:20,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:10:20,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 00:10:20,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 00:10:20,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:10:20,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 00:10:20,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 00:10:20,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 17: [2022-11-26 00:10:20,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 17: [2022-11-26 00:10:20,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 17: [2022-11-26 00:10:20,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 17: [2022-11-26 00:10:20,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 00:10:20,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 31: [2022-11-26 00:10:20,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:10:20,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step27000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 00:10:20,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: successfully saved checkpoint at iteration 27000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2691.95 31: iteration 27010/ 173500 | consumed samples: 6914560 | consumed tokens: 14161018880 | elapsed time per iteration (s): 1.08 | learning rate: 1.906E-04 | global batch size: 256 | lm loss: 2.169772E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.295 | TFLOPs: 14.36 | 31: iteration 27020/ 173500 | consumed samples: 6917120 | consumed tokens: 14166261760 | elapsed time per iteration (s): 0.73 | learning rate: 1.905E-04 | global batch size: 256 | lm loss: 2.137774E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.373 | TFLOPs: 21.20 | 31: iteration 27030/ 173500 | consumed samples: 6919680 | consumed tokens: 14171504640 | elapsed time per iteration (s): 0.74 | learning rate: 1.905E-04 | global batch size: 256 | lm loss: 2.142391E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.776 | TFLOPs: 20.80 | 31: iteration 27040/ 173500 | consumed samples: 6922240 | consumed tokens: 14176747520 | elapsed time per iteration (s): 0.81 | learning rate: 1.905E-04 | global batch size: 256 | lm loss: 2.145975E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.478 | TFLOPs: 19.15 | 31: iteration 27050/ 173500 | consumed samples: 6924800 | consumed tokens: 14181990400 | elapsed time per iteration (s): 0.83 | learning rate: 1.905E-04 | global batch size: 256 | lm loss: 2.189779E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.961 | TFLOPs: 18.63 | 31: iteration 27060/ 173500 | consumed samples: 6927360 | consumed tokens: 14187233280 | elapsed time per iteration (s): 0.76 | learning rate: 1.905E-04 | global batch size: 256 | lm loss: 2.175387E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.278 | TFLOPs: 20.46 | 31: iteration 27070/ 173500 | consumed samples: 6929920 | consumed tokens: 14192476160 | elapsed time per iteration (s): 0.82 | learning rate: 1.905E-04 | global batch size: 256 | lm loss: 2.155490E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.146 | TFLOPs: 18.94 | 31: iteration 27080/ 173500 | consumed samples: 6932480 | consumed tokens: 14197719040 | elapsed time per iteration (s): 0.78 | learning rate: 1.905E-04 | global batch size: 256 | lm loss: 2.147085E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.611 | TFLOPs: 19.94 | 31: iteration 27090/ 173500 | consumed samples: 6935040 | consumed tokens: 14202961920 | elapsed time per iteration (s): 0.78 | learning rate: 1.905E-04 | global batch size: 256 | lm loss: 2.157713E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.674 | TFLOPs: 19.88 | 31: iteration 27100/ 173500 | consumed samples: 6937600 | consumed tokens: 14208204800 | elapsed time per iteration (s): 0.87 | learning rate: 1.905E-04 | global batch size: 256 | lm loss: 2.142924E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.953 | TFLOPs: 17.78 | 31: iteration 27110/ 173500 | consumed samples: 6940160 | consumed tokens: 14213447680 | elapsed time per iteration (s): 0.76 | learning rate: 1.905E-04 | global batch size: 256 | lm loss: 2.174130E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.642 | TFLOPs: 20.37 | 31: iteration 27120/ 173500 | consumed samples: 6942720 | consumed tokens: 14218690560 | elapsed time per iteration (s): 0.80 | learning rate: 1.905E-04 | global batch size: 256 | lm loss: 2.146568E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.297 | TFLOPs: 19.26 | 31: iteration 27130/ 173500 | consumed samples: 6945280 | consumed tokens: 14223933440 | elapsed time per iteration (s): 0.77 | learning rate: 1.905E-04 | global batch size: 256 | lm loss: 2.165318E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.580 | TFLOPs: 20.18 | 31: iteration 27140/ 173500 | consumed samples: 6947840 | consumed tokens: 14229176320 | elapsed time per iteration (s): 0.75 | learning rate: 1.905E-04 | global batch size: 256 | lm loss: 2.122834E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.235 | TFLOPs: 20.58 | 31: iteration 27150/ 173500 | consumed samples: 6950400 | consumed tokens: 14234419200 | elapsed time per iteration (s): 0.75 | learning rate: 1.905E-04 | global batch size: 256 | lm loss: 2.153568E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.664 | TFLOPs: 20.61 | 31: iteration 27160/ 173500 | consumed samples: 6952960 | consumed tokens: 14239662080 | elapsed time per iteration (s): 0.75 | learning rate: 1.904E-04 | global batch size: 256 | lm loss: 2.179129E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.615 | TFLOPs: 20.79 | 31: iteration 27170/ 173500 | consumed samples: 6955520 | consumed tokens: 14244904960 | elapsed time per iteration (s): 0.74 | learning rate: 1.904E-04 | global batch size: 256 | lm loss: 2.150368E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.998 | TFLOPs: 20.87 | 31: iteration 27180/ 173500 | consumed samples: 6958080 | consumed tokens: 14250147840 | elapsed time per iteration (s): 0.73 | learning rate: 1.904E-04 | global batch size: 256 | lm loss: 2.151186E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.867 | TFLOPs: 21.35 | 31: iteration 27190/ 173500 | consumed samples: 6960640 | consumed tokens: 14255390720 | elapsed time per iteration (s): 0.81 | learning rate: 1.904E-04 | global batch size: 256 | lm loss: 2.158256E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.532 | TFLOPs: 19.15 | 31: iteration 27200/ 173500 | consumed samples: 6963200 | consumed tokens: 14260633600 | elapsed time per iteration (s): 0.77 | learning rate: 1.904E-04 | global batch size: 256 | lm loss: 2.161052E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.402 | TFLOPs: 20.23 | 31: iteration 27210/ 173500 | consumed samples: 6965760 | consumed tokens: 14265876480 | elapsed time per iteration (s): 0.79 | learning rate: 1.904E-04 | global batch size: 256 | lm loss: 2.160382E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.791 | TFLOPs: 19.59 | 31: iteration 27220/ 173500 | consumed samples: 6968320 | consumed tokens: 14271119360 | elapsed time per iteration (s): 0.74 | learning rate: 1.904E-04 | global batch size: 256 | lm loss: 2.155696E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.518 | TFLOPs: 21.02 | 31: iteration 27230/ 173500 | consumed samples: 6970880 | consumed tokens: 14276362240 | elapsed time per iteration (s): 2.37 | learning rate: 1.904E-04 | global batch size: 256 | lm loss: 2.158827E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 108.174 | TFLOPs: 6.54 | 31: iteration 27240/ 173500 | consumed samples: 6973440 | consumed tokens: 14281605120 | elapsed time per iteration (s): 0.75 | learning rate: 1.904E-04 | global batch size: 256 | lm loss: 2.134605E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.378 | TFLOPs: 20.77 | 31: iteration 27250/ 173500 | consumed samples: 6976000 | consumed tokens: 14286848000 | elapsed time per iteration (s): 0.78 | learning rate: 1.904E-04 | global batch size: 256 | lm loss: 2.150781E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.726 | TFLOPs: 19.77 | 31: iteration 27260/ 173500 | consumed samples: 6978560 | consumed tokens: 14292090880 | elapsed time per iteration (s): 0.83 | learning rate: 1.904E-04 | global batch size: 256 | lm loss: 2.140566E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.821 | TFLOPs: 18.56 | 31: iteration 27270/ 173500 | consumed samples: 6981120 | consumed tokens: 14297333760 | elapsed time per iteration (s): 0.80 | learning rate: 1.904E-04 | global batch size: 256 | lm loss: 2.136571E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.209 | TFLOPs: 19.43 | 31: iteration 27280/ 173500 | consumed samples: 6983680 | consumed tokens: 14302576640 | elapsed time per iteration (s): 2.19 | learning rate: 1.904E-04 | global batch size: 256 | lm loss: 2.155346E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 117.039 | TFLOPs: 7.08 | 31: iteration 27290/ 173500 | consumed samples: 6986240 | consumed tokens: 14307819520 | elapsed time per iteration (s): 0.74 | learning rate: 1.903E-04 | global batch size: 256 | lm loss: 2.158046E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.150 | TFLOPs: 20.94 | 31: iteration 27300/ 173500 | consumed samples: 6988800 | consumed tokens: 14313062400 | elapsed time per iteration (s): 0.77 | learning rate: 1.903E-04 | global batch size: 256 | lm loss: 2.162772E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.103 | TFLOPs: 20.21 | 31: iteration 27310/ 173500 | consumed samples: 6991360 | consumed tokens: 14318305280 | elapsed time per iteration (s): 0.71 | learning rate: 1.903E-04 | global batch size: 256 | lm loss: 2.162852E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 359.295 | TFLOPs: 21.74 | 31: iteration 27320/ 173500 | consumed samples: 6993920 | consumed tokens: 14323548160 | elapsed time per iteration (s): 0.79 | learning rate: 1.903E-04 | global batch size: 256 | lm loss: 2.161885E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.766 | TFLOPs: 19.71 | 31: iteration 27330/ 173500 | consumed samples: 6996480 | consumed tokens: 14328791040 | elapsed time per iteration (s): 0.74 | learning rate: 1.903E-04 | global batch size: 256 | lm loss: 2.153412E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.866 | TFLOPs: 20.80 | 31: iteration 27340/ 173500 | consumed samples: 6999040 | consumed tokens: 14334033920 | elapsed time per iteration (s): 0.78 | learning rate: 1.903E-04 | global batch size: 256 | lm loss: 2.149601E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.437 | TFLOPs: 19.93 | 31: iteration 27350/ 173500 | consumed samples: 7001600 | consumed tokens: 14339276800 | elapsed time per iteration (s): 0.72 | learning rate: 1.903E-04 | global batch size: 256 | lm loss: 2.145734E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.642 | TFLOPs: 21.45 | 31: iteration 27360/ 173500 | consumed samples: 7004160 | consumed tokens: 14344519680 | elapsed time per iteration (s): 0.73 | learning rate: 1.903E-04 | global batch size: 256 | lm loss: 2.155309E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.253 | TFLOPs: 21.13 | 31: iteration 27370/ 173500 | consumed samples: 7006720 | consumed tokens: 14349762560 | elapsed time per iteration (s): 0.75 | learning rate: 1.903E-04 | global batch size: 256 | lm loss: 2.193932E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.717 | TFLOPs: 20.61 | 31: iteration 27380/ 173500 | consumed samples: 7009280 | consumed tokens: 14355005440 | elapsed time per iteration (s): 0.74 | learning rate: 1.903E-04 | global batch size: 256 | lm loss: 2.132788E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.760 | TFLOPs: 20.86 | 31: iteration 27390/ 173500 | consumed samples: 7011840 | consumed tokens: 14360248320 | elapsed time per iteration (s): 0.74 | learning rate: 1.903E-04 | global batch size: 256 | lm loss: 2.159209E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.348 | TFLOPs: 20.83 | 31: iteration 27400/ 173500 | consumed samples: 7014400 | consumed tokens: 14365491200 | elapsed time per iteration (s): 0.76 | learning rate: 1.903E-04 | global batch size: 256 | lm loss: 2.139350E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.963 | TFLOPs: 20.26 | 31: iteration 27410/ 173500 | consumed samples: 7016960 | consumed tokens: 14370734080 | elapsed time per iteration (s): 0.83 | learning rate: 1.903E-04 | global batch size: 256 | lm loss: 2.147451E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.496 | TFLOPs: 18.66 | 31: iteration 27420/ 173500 | consumed samples: 7019520 | consumed tokens: 14375976960 | elapsed time per iteration (s): 0.80 | learning rate: 1.903E-04 | global batch size: 256 | lm loss: 2.165670E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.473 | TFLOPs: 19.33 | 31: iteration 27430/ 173500 | consumed samples: 7022080 | consumed tokens: 14381219840 | elapsed time per iteration (s): 0.82 | learning rate: 1.902E-04 | global batch size: 256 | lm loss: 2.162568E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.814 | TFLOPs: 18.92 | 31: iteration 27440/ 173500 | consumed samples: 7024640 | consumed tokens: 14386462720 | elapsed time per iteration (s): 0.84 | learning rate: 1.902E-04 | global batch size: 256 | lm loss: 2.160548E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.580 | TFLOPs: 18.37 | 31: iteration 27450/ 173500 | consumed samples: 7027200 | consumed tokens: 14391705600 | elapsed time per iteration (s): 0.81 | learning rate: 1.902E-04 | global batch size: 256 | lm loss: 2.149274E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.319 | TFLOPs: 19.02 | 31: iteration 27460/ 173500 | consumed samples: 7029760 | consumed tokens: 14396948480 | elapsed time per iteration (s): 0.83 | learning rate: 1.902E-04 | global batch size: 256 | lm loss: 2.173450E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.023 | TFLOPs: 18.63 | 31: iteration 27470/ 173500 | consumed samples: 7032320 | consumed tokens: 14402191360 | elapsed time per iteration (s): 0.81 | learning rate: 1.902E-04 | global batch size: 256 | lm loss: 2.147591E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.736 | TFLOPs: 19.22 | 31: iteration 27480/ 173500 | consumed samples: 7034880 | consumed tokens: 14407434240 | elapsed time per iteration (s): 0.79 | learning rate: 1.902E-04 | global batch size: 256 | lm loss: 2.153840E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.112 | TFLOPs: 19.73 | 31: iteration 27490/ 173500 | consumed samples: 7037440 | consumed tokens: 14412677120 | elapsed time per iteration (s): 0.76 | learning rate: 1.902E-04 | global batch size: 256 | lm loss: 2.166988E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.264 | TFLOPs: 20.34 | 31: iteration 27500/ 173500 | consumed samples: 7040000 | consumed tokens: 14417920000 | elapsed time per iteration (s): 0.82 | learning rate: 1.902E-04 | global batch size: 256 | lm loss: 2.167745E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.957 | TFLOPs: 18.87 | 31: iteration 27510/ 173500 | consumed samples: 7042560 | consumed tokens: 14423162880 | elapsed time per iteration (s): 0.84 | learning rate: 1.902E-04 | global batch size: 256 | lm loss: 2.173670E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.524 | TFLOPs: 18.48 | 31: iteration 27520/ 173500 | consumed samples: 7045120 | consumed tokens: 14428405760 | elapsed time per iteration (s): 0.83 | learning rate: 1.902E-04 | global batch size: 256 | lm loss: 2.138250E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.058 | TFLOPs: 18.76 | 31: iteration 27530/ 173500 | consumed samples: 7047680 | consumed tokens: 14433648640 | elapsed time per iteration (s): 0.83 | learning rate: 1.902E-04 | global batch size: 256 | lm loss: 2.161764E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.609 | TFLOPs: 18.73 | 31: iteration 27540/ 173500 | consumed samples: 7050240 | consumed tokens: 14438891520 | elapsed time per iteration (s): 0.81 | learning rate: 1.902E-04 | global batch size: 256 | lm loss: 2.137807E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.447 | TFLOPs: 19.08 | 31: iteration 27550/ 173500 | consumed samples: 7052800 | consumed tokens: 14444134400 | elapsed time per iteration (s): 0.81 | learning rate: 1.902E-04 | global batch size: 256 | lm loss: 2.143053E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.805 | TFLOPs: 19.17 | 31: iteration 27560/ 173500 | consumed samples: 7055360 | consumed tokens: 14449377280 | elapsed time per iteration (s): 0.85 | learning rate: 1.901E-04 | global batch size: 256 | lm loss: 2.159217E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.417 | TFLOPs: 18.23 | 31: iteration 27570/ 173500 | consumed samples: 7057920 | consumed tokens: 14454620160 | elapsed time per iteration (s): 0.82 | learning rate: 1.901E-04 | global batch size: 256 | lm loss: 2.142769E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.808 | TFLOPs: 18.80 | 31: iteration 27580/ 173500 | consumed samples: 7060480 | consumed tokens: 14459863040 | elapsed time per iteration (s): 0.78 | learning rate: 1.901E-04 | global batch size: 256 | lm loss: 2.152048E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.320 | TFLOPs: 19.80 | 31: iteration 27590/ 173500 | consumed samples: 7063040 | consumed tokens: 14465105920 | elapsed time per iteration (s): 0.81 | learning rate: 1.901E-04 | global batch size: 256 | lm loss: 2.159556E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.510 | TFLOPs: 19.03 | 31: iteration 27600/ 173500 | consumed samples: 7065600 | consumed tokens: 14470348800 | elapsed time per iteration (s): 0.78 | learning rate: 1.901E-04 | global batch size: 256 | lm loss: 2.168450E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.801 | TFLOPs: 19.89 | 31: iteration 27610/ 173500 | consumed samples: 7068160 | consumed tokens: 14475591680 | elapsed time per iteration (s): 0.74 | learning rate: 1.901E-04 | global batch size: 256 | lm loss: 2.149108E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.604 | TFLOPs: 21.03 | 31: iteration 27620/ 173500 | consumed samples: 7070720 | consumed tokens: 14480834560 | elapsed time per iteration (s): 0.78 | learning rate: 1.901E-04 | global batch size: 256 | lm loss: 2.150853E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.245 | TFLOPs: 19.80 | 31: iteration 27630/ 173500 | consumed samples: 7073280 | consumed tokens: 14486077440 | elapsed time per iteration (s): 0.78 | learning rate: 1.901E-04 | global batch size: 256 | lm loss: 2.163221E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.497 | TFLOPs: 19.81 | 31: iteration 27640/ 173500 | consumed samples: 7075840 | consumed tokens: 14491320320 | elapsed time per iteration (s): 0.77 | learning rate: 1.901E-04 | global batch size: 256 | lm loss: 2.161898E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.279 | TFLOPs: 20.04 | 31: iteration 27650/ 173500 | consumed samples: 7078400 | consumed tokens: 14496563200 | elapsed time per iteration (s): 0.79 | learning rate: 1.901E-04 | global batch size: 256 | lm loss: 2.155403E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.907 | TFLOPs: 19.66 | 31: iteration 27660/ 173500 | consumed samples: 7080960 | consumed tokens: 14501806080 | elapsed time per iteration (s): 0.82 | learning rate: 1.901E-04 | global batch size: 256 | lm loss: 2.149378E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.432 | TFLOPs: 18.84 | 31: iteration 27670/ 173500 | consumed samples: 7083520 | consumed tokens: 14507048960 | elapsed time per iteration (s): 0.85 | learning rate: 1.901E-04 | global batch size: 256 | lm loss: 2.170453E+00 | grad norm: 0.217 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.626 | TFLOPs: 18.31 | 31: iteration 27680/ 173500 | consumed samples: 7086080 | consumed tokens: 14512291840 | elapsed time per iteration (s): 0.80 | learning rate: 1.901E-04 | global batch size: 256 | lm loss: 2.166459E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.794 | TFLOPs: 19.47 | 31: iteration 27690/ 173500 | consumed samples: 7088640 | consumed tokens: 14517534720 | elapsed time per iteration (s): 0.82 | learning rate: 1.900E-04 | global batch size: 256 | lm loss: 2.175110E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.870 | TFLOPs: 18.93 | 31: iteration 27700/ 173500 | consumed samples: 7091200 | consumed tokens: 14522777600 | elapsed time per iteration (s): 0.79 | learning rate: 1.900E-04 | global batch size: 256 | lm loss: 2.142333E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.491 | TFLOPs: 19.63 | 31: iteration 27710/ 173500 | consumed samples: 7093760 | consumed tokens: 14528020480 | elapsed time per iteration (s): 0.81 | learning rate: 1.900E-04 | global batch size: 256 | lm loss: 2.185218E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.892 | TFLOPs: 19.11 | 31: iteration 27720/ 173500 | consumed samples: 7096320 | consumed tokens: 14533263360 | elapsed time per iteration (s): 0.80 | learning rate: 1.900E-04 | global batch size: 256 | lm loss: 2.170977E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.676 | TFLOPs: 19.40 | 31: iteration 27730/ 173500 | consumed samples: 7098880 | consumed tokens: 14538506240 | elapsed time per iteration (s): 0.81 | learning rate: 1.900E-04 | global batch size: 256 | lm loss: 2.153678E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.989 | TFLOPs: 19.12 | 31: iteration 27740/ 173500 | consumed samples: 7101440 | consumed tokens: 14543749120 | elapsed time per iteration (s): 0.84 | learning rate: 1.900E-04 | global batch size: 256 | lm loss: 2.158848E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.190 | TFLOPs: 18.40 | 31: iteration 27750/ 173500 | consumed samples: 7104000 | consumed tokens: 14548992000 | elapsed time per iteration (s): 0.78 | learning rate: 1.900E-04 | global batch size: 256 | lm loss: 2.174388E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.861 | TFLOPs: 19.83 | 31: iteration 27760/ 173500 | consumed samples: 7106560 | consumed tokens: 14554234880 | elapsed time per iteration (s): 0.80 | learning rate: 1.900E-04 | global batch size: 256 | lm loss: 2.161124E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.410 | TFLOPs: 19.26 | 31: iteration 27770/ 173500 | consumed samples: 7109120 | consumed tokens: 14559477760 | elapsed time per iteration (s): 0.77 | learning rate: 1.900E-04 | global batch size: 256 | lm loss: 2.134531E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.883 | TFLOPs: 20.02 | 31: iteration 27780/ 173500 | consumed samples: 7111680 | consumed tokens: 14564720640 | elapsed time per iteration (s): 0.76 | learning rate: 1.900E-04 | global batch size: 256 | lm loss: 2.186676E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.732 | TFLOPs: 20.31 | 31: iteration 27790/ 173500 | consumed samples: 7114240 | consumed tokens: 14569963520 | elapsed time per iteration (s): 0.75 | learning rate: 1.900E-04 | global batch size: 256 | lm loss: 2.145616E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.458 | TFLOPs: 20.78 | 31: iteration 27800/ 173500 | consumed samples: 7116800 | consumed tokens: 14575206400 | elapsed time per iteration (s): 0.80 | learning rate: 1.900E-04 | global batch size: 256 | lm loss: 2.143757E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.507 | TFLOPs: 19.39 | 31: iteration 27810/ 173500 | consumed samples: 7119360 | consumed tokens: 14580449280 | elapsed time per iteration (s): 0.76 | learning rate: 1.900E-04 | global batch size: 256 | lm loss: 2.138966E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.083 | TFLOPs: 20.45 | 31: iteration 27820/ 173500 | consumed samples: 7121920 | consumed tokens: 14585692160 | elapsed time per iteration (s): 0.73 | learning rate: 1.899E-04 | global batch size: 256 | lm loss: 2.142136E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.121 | TFLOPs: 21.24 | 31: iteration 27830/ 173500 | consumed samples: 7124480 | consumed tokens: 14590935040 | elapsed time per iteration (s): 0.76 | learning rate: 1.899E-04 | global batch size: 256 | lm loss: 2.163810E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.396 | TFLOPs: 20.47 | 31: iteration 27840/ 173500 | consumed samples: 7127040 | consumed tokens: 14596177920 | elapsed time per iteration (s): 0.74 | learning rate: 1.899E-04 | global batch size: 256 | lm loss: 2.167702E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.397 | TFLOPs: 21.02 | 31: iteration 27850/ 173500 | consumed samples: 7129600 | consumed tokens: 14601420800 | elapsed time per iteration (s): 0.82 | learning rate: 1.899E-04 | global batch size: 256 | lm loss: 2.173246E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.117 | TFLOPs: 18.82 | 31: iteration 27860/ 173500 | consumed samples: 7132160 | consumed tokens: 14606663680 | elapsed time per iteration (s): 0.75 | learning rate: 1.899E-04 | global batch size: 256 | lm loss: 2.175225E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.946 | TFLOPs: 20.69 | 31: iteration 27870/ 173500 | consumed samples: 7134720 | consumed tokens: 14611906560 | elapsed time per iteration (s): 0.74 | learning rate: 1.899E-04 | global batch size: 256 | lm loss: 2.133740E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.937 | TFLOPs: 20.87 | 31: iteration 27880/ 173500 | consumed samples: 7137280 | consumed tokens: 14617149440 | elapsed time per iteration (s): 0.74 | learning rate: 1.899E-04 | global batch size: 256 | lm loss: 2.180795E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.268 | TFLOPs: 21.07 | 31: iteration 27890/ 173500 | consumed samples: 7139840 | consumed tokens: 14622392320 | elapsed time per iteration (s): 0.77 | learning rate: 1.899E-04 | global batch size: 256 | lm loss: 2.132385E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.043 | TFLOPs: 20.09 | 31: iteration 27900/ 173500 | consumed samples: 7142400 | consumed tokens: 14627635200 | elapsed time per iteration (s): 0.76 | learning rate: 1.899E-04 | global batch size: 256 | lm loss: 2.108406E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.649 | TFLOPs: 20.25 | 31: iteration 27910/ 173500 | consumed samples: 7144960 | consumed tokens: 14632878080 | elapsed time per iteration (s): 0.78 | learning rate: 1.899E-04 | global batch size: 256 | lm loss: 2.148894E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.816 | TFLOPs: 19.83 | 31: iteration 27920/ 173500 | consumed samples: 7147520 | consumed tokens: 14638120960 | elapsed time per iteration (s): 0.75 | learning rate: 1.899E-04 | global batch size: 256 | lm loss: 2.130060E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.556 | TFLOPs: 20.66 | 31: iteration 27930/ 173500 | consumed samples: 7150080 | consumed tokens: 14643363840 | elapsed time per iteration (s): 0.75 | learning rate: 1.899E-04 | global batch size: 256 | lm loss: 2.144755E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.874 | TFLOPs: 20.56 | 31: iteration 27940/ 173500 | consumed samples: 7152640 | consumed tokens: 14648606720 | elapsed time per iteration (s): 0.77 | learning rate: 1.899E-04 | global batch size: 256 | lm loss: 2.195570E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.924 | TFLOPs: 20.08 | 31: iteration 27950/ 173500 | consumed samples: 7155200 | consumed tokens: 14653849600 | elapsed time per iteration (s): 0.75 | learning rate: 1.899E-04 | global batch size: 256 | lm loss: 2.155696E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.142 | TFLOPs: 20.64 | 31: iteration 27960/ 173500 | consumed samples: 7157760 | consumed tokens: 14659092480 | elapsed time per iteration (s): 0.78 | learning rate: 1.898E-04 | global batch size: 256 | lm loss: 2.145851E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.454 | TFLOPs: 19.87 | 31: iteration 27970/ 173500 | consumed samples: 7160320 | consumed tokens: 14664335360 | elapsed time per iteration (s): 0.74 | learning rate: 1.898E-04 | global batch size: 256 | lm loss: 2.148321E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.538 | TFLOPs: 20.90 | 31: iteration 27980/ 173500 | consumed samples: 7162880 | consumed tokens: 14669578240 | elapsed time per iteration (s): 0.74 | learning rate: 1.898E-04 | global batch size: 256 | lm loss: 2.160346E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.680 | TFLOPs: 21.03 | 31: iteration 27990/ 173500 | consumed samples: 7165440 | consumed tokens: 14674821120 | elapsed time per iteration (s): 0.79 | learning rate: 1.898E-04 | global batch size: 256 | lm loss: 2.146404E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.619 | TFLOPs: 19.64 | 0: [2022-11-26 00:23:50,449] [INFO] [logging.py:68:log_dist] [Rank 0] step=28000, skipped=0, lr=[0.00018981345832700956, 0.00018981345832700956, 0.00018981345832700956], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 28000/ 173500 | consumed samples: 7168000 | consumed tokens: 14680064000 | elapsed time per iteration (s): 0.74 | learning rate: 1.898E-04 | global batch size: 256 | lm loss: 2.149340E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.483 | TFLOPs: 20.96 | 0: steps: 28000 loss: 2.1952 iter time (s): 0.817 samples/sec: 313.483 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 28000 | lm loss value: 2.158474E+00 | lm loss PPL: 8.657915E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 28000 to checkpoints_1b1long 0: [2022-11-26 00:23:50,715] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step28000 is begin to save! 0: [2022-11-26 00:23:50,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_01-model_00-model_states.pt... 0: [2022-11-26 00:23:50,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_01-model_00-model_states.pt. 0: [2022-11-26 00:23:50,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_03-model_00-model_states.pt... 0: [2022-11-26 00:23:51,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_03-model_00-model_states.pt. 0: [2022-11-26 00:23:51,014] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_04-model_00-model_states.pt... 0: [2022-11-26 00:23:51,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_04-model_00-model_states.pt. 0: [2022-11-26 00:23:51,089] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_05-model_00-model_states.pt... 0: [2022-11-26 00:23:51,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_05-model_00-model_states.pt. 0: [2022-11-26 00:23:51,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_06-model_00-model_states.pt... 0: [2022-11-26 00:23:51,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_06-model_00-model_states.pt. 0: [2022-11-26 00:23:51,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_07-model_00-model_states.pt... 0: [2022-11-26 00:23:51,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_07-model_00-model_states.pt. 0: [2022-11-26 00:23:51,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_08-model_00-model_states.pt... 0: [2022-11-26 00:23:51,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_08-model_00-model_states.pt. 0: [2022-11-26 00:23:51,398] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_09-model_00-model_states.pt... 0: [2022-11-26 00:23:51,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_09-model_00-model_states.pt. 0: [2022-11-26 00:23:51,477] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_10-model_00-model_states.pt... 0: [2022-11-26 00:23:51,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_10-model_00-model_states.pt. 0: [2022-11-26 00:23:51,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_11-model_00-model_states.pt... 0: [2022-11-26 00:23:51,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_11-model_00-model_states.pt. 0: [2022-11-26 00:23:51,631] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_12-model_00-model_states.pt... 0: [2022-11-26 00:23:51,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_12-model_00-model_states.pt. 0: [2022-11-26 00:23:51,708] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_13-model_00-model_states.pt... 0: [2022-11-26 00:23:51,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_13-model_00-model_states.pt. 0: [2022-11-26 00:23:51,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_14-model_00-model_states.pt... 0: [2022-11-26 00:23:51,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_14-model_00-model_states.pt. 0: [2022-11-26 00:23:51,863] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_15-model_00-model_states.pt... 0: [2022-11-26 00:23:51,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_15-model_00-model_states.pt. 0: [2022-11-26 00:23:51,939] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_16-model_00-model_states.pt... 0: [2022-11-26 00:23:52,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_16-model_00-model_states.pt. 0: [2022-11-26 00:23:52,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_17-model_00-model_states.pt... 0: [2022-11-26 00:23:52,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_17-model_00-model_states.pt. 0: [2022-11-26 00:23:52,092] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_18-model_00-model_states.pt... 0: [2022-11-26 00:23:52,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_18-model_00-model_states.pt. 0: [2022-11-26 00:23:52,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_19-model_00-model_states.pt... 0: [2022-11-26 00:23:52,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_19-model_00-model_states.pt. 0: [2022-11-26 00:23:52,247] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_20-model_00-model_states.pt... 0: [2022-11-26 00:23:52,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_20-model_00-model_states.pt. 0: [2022-11-26 00:23:52,321] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_21-model_00-model_states.pt... 0: [2022-11-26 00:23:52,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_21-model_00-model_states.pt. 0: [2022-11-26 00:23:52,400] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_22-model_00-model_states.pt... 0: [2022-11-26 00:23:52,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_22-model_00-model_states.pt. 0: [2022-11-26 00:23:52,474] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_23-model_00-model_states.pt... 0: [2022-11-26 00:23:52,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_23-model_00-model_states.pt. 0: [2022-11-26 00:23:52,553] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_24-model_00-model_states.pt... 0: [2022-11-26 00:23:52,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_24-model_00-model_states.pt. 0: [2022-11-26 00:23:52,628] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_25-model_00-model_states.pt... 0: [2022-11-26 00:23:52,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_25-model_00-model_states.pt. 0: [2022-11-26 00:23:52,712] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_26-model_00-model_states.pt... 0: [2022-11-26 00:23:52,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_26-model_00-model_states.pt. 0: [2022-11-26 00:23:52,795] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_27-model_00-model_states.pt... 0: [2022-11-26 00:23:52,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_27-model_00-model_states.pt. 0: [2022-11-26 00:23:52,872] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_28-model_00-model_states.pt... 0: [2022-11-26 00:23:52,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_28-model_00-model_states.pt. 0: [2022-11-26 00:23:52,949] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/layer_30-model_00-model_states.pt... 0: [2022-11-26 00:23:52,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/layer_30-model_00-model_states.pt. 0: [2022-11-26 00:23:52,952] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step28000/mp_rank_00_model_states.pt 0: [2022-11-26 00:23:52,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/mp_rank_00_model_states.pt... 0: [2022-11-26 00:23:52,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/mp_rank_00_model_states.pt. 0: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:23:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:23:53,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:23:53,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 7: [2022-11-26 00:23:53,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:23:53,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 7: [2022-11-26 00:23:53,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 00:23:53,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 23: [2022-11-26 00:23:53,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:23:53,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 00:23:53,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 31: [2022-11-26 00:23:53,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:23:53,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 00:23:53,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 17: [2022-11-26 00:23:53,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:23:53,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 00:23:53,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 1: [2022-11-26 00:23:53,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:23:53,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 5: [2022-11-26 00:23:53,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:23:53,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 11: [2022-11-26 00:23:53,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:23:53,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:23:53,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:23:53,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 11: [2022-11-26 00:23:53,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 00:23:53,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 5: [2022-11-26 00:23:53,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 11: [2022-11-26 00:23:53,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 11: [2022-11-26 00:23:53,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 00:23:53,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:23:53,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:23:53,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 00:23:53,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 29: [2022-11-26 00:23:53,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:23:53,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 00:23:53,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 2: [2022-11-26 00:23:53,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:23:53,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 00:23:53,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 2: [2022-11-26 00:23:53,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:23:53,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 00:23:53,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 30: [2022-11-26 00:23:53,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 00:23:53,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 00:23:53,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 30: [2022-11-26 00:23:53,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 00:23:53,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 00:23:53,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 17: [2022-11-26 00:23:53,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:23:53,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 00:23:53,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 26: [2022-11-26 00:23:53,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:23:53,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 00:23:53,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 24: [2022-11-26 00:23:53,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:23:53,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 00:23:53,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 15: [2022-11-26 00:23:53,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:23:53,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 00:23:53,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 00:23:53,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:23:53,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:23:53,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 7: [2022-11-26 00:23:53,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 00:23:53,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 00:23:53,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 12: [2022-11-26 00:23:53,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:23:53,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:23:53,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 26: [2022-11-26 00:23:53,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 10: [2022-11-26 00:23:53,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:23:53,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:23:53,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 10: [2022-11-26 00:23:53,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 00:23:53,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 12: [2022-11-26 00:23:53,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 10: [2022-11-26 00:23:53,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 10: [2022-11-26 00:23:53,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 4: [2022-11-26 00:23:53,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:23:53,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 00:23:53,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 15: [2022-11-26 00:23:53,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:23:53,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 00:23:53,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 9: [2022-11-26 00:23:53,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:23:53,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 20: [2022-11-26 00:23:53,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:23:53,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 20: [2022-11-26 00:23:53,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 00:23:53,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 9: [2022-11-26 00:23:53,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:23:53,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 00:23:53,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:23:53,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 9: [2022-11-26 00:23:53,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 00:23:53,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 12: [2022-11-26 00:23:53,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:23:53,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 00:23:53,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 29: [2022-11-26 00:23:53,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:23:53,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:23:53,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 00:23:53,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 00:23:53,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 29: [2022-11-26 00:23:53,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 28: [2022-11-26 00:23:53,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 00:23:53,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 28: [2022-11-26 00:23:53,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:23:53,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 00:23:53,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 11: [2022-11-26 00:23:53,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:23:53,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 00:23:53,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 15: [2022-11-26 00:23:53,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:23:53,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 00:23:53,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 31: [2022-11-26 00:23:53,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:23:53,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 00:23:53,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 11: [2022-11-26 00:23:53,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:23:53,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 00:23:53,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 2: [2022-11-26 00:23:53,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:23:53,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 30: [2022-11-26 00:23:53,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:23:53,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 30: [2022-11-26 00:23:53,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 00:23:53,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 17: [2022-11-26 00:23:53,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:23:53,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:23:53,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 17: [2022-11-26 00:23:53,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 22: [2022-11-26 00:23:53,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 5: [2022-11-26 00:23:53,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:23:53,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 00:23:53,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 23: [2022-11-26 00:23:53,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:23:53,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 5: [2022-11-26 00:23:53,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:23:53,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 00:23:53,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:23:53,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 5: [2022-11-26 00:23:53,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 23: [2022-11-26 00:23:53,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 5: [2022-11-26 00:23:53,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 23: [2022-11-26 00:23:53,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 10: [2022-11-26 00:23:53,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:23:53,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 1: [2022-11-26 00:23:53,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:23:53,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 26: [2022-11-26 00:23:53,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:23:53,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 24: [2022-11-26 00:23:53,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:23:53,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 00:23:53,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 1: [2022-11-26 00:23:53,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 6: [2022-11-26 00:23:53,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:23:53,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:23:53,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 6: [2022-11-26 00:23:53,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 00:23:53,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 00:23:53,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 6: [2022-11-26 00:23:53,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 4: [2022-11-26 00:23:53,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:23:53,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 4: [2022-11-26 00:23:53,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 00:23:53,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 24: [2022-11-26 00:23:53,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:23:53,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 31: [2022-11-26 00:23:53,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:23:53,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 31: [2022-11-26 00:23:53,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 00:23:53,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 25: [2022-11-26 00:23:53,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:23:53,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:23:53,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 00:23:53,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 00:23:53,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 25: [2022-11-26 00:23:53,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 25: [2022-11-26 00:23:53,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:23:53,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 00:23:53,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 00:23:53,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:23:53,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:23:53,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:23:53,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 00:23:53,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 1: [2022-11-26 00:23:53,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:23:53,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 7: [2022-11-26 00:23:53,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 7: [2022-11-26 00:23:53,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 1: [2022-11-26 00:23:53,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 00:23:53,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 20: [2022-11-26 00:23:53,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:23:53,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 12: [2022-11-26 00:23:53,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:23:53,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 22: [2022-11-26 00:23:53,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:23:53,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:23:53,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 20: [2022-11-26 00:23:53,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 22: [2022-11-26 00:23:53,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 12: [2022-11-26 00:23:53,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 20: [2022-11-26 00:23:53,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:23:53,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 00:23:53,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 20: [2022-11-26 00:23:53,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:23:53,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 20: [2022-11-26 00:23:53,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 00:23:53,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 00:23:53,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 20: [2022-11-26 00:23:53,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 26: [2022-11-26 00:23:53,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:23:53,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 13: [2022-11-26 00:23:53,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:23:53,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:23:53,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 13: [2022-11-26 00:23:53,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 00:23:53,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 00:23:53,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 13: [2022-11-26 00:23:53,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 15: [2022-11-26 00:23:53,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:23:53,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 00:23:53,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 16: [2022-11-26 00:23:53,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:23:53,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:23:53,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 00:23:53,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 00:23:53,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 16: [2022-11-26 00:23:53,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 25: [2022-11-26 00:23:53,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:23:53,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 00:23:53,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 30: [2022-11-26 00:23:53,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 00:23:53,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 00:23:53,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 9: [2022-11-26 00:23:53,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:23:53,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 00:23:53,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 28: [2022-11-26 00:23:53,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:23:53,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:23:53,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 00:23:53,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 8: [2022-11-26 00:23:53,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:23:53,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:23:53,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 00:23:53,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 00:23:53,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 8: [2022-11-26 00:23:53,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 12: [2022-11-26 00:23:53,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:23:53,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 00:23:53,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 4: [2022-11-26 00:23:53,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:23:53,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 00:23:53,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 24: [2022-11-26 00:23:53,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:23:53,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:23:53,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 00:23:53,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 24: [2022-11-26 00:23:53,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 00:23:53,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 16: [2022-11-26 00:23:53,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:23:53,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:23:53,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 00:23:53,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 17: [2022-11-26 00:23:53,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 00:23:53,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 13: [2022-11-26 00:23:53,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:23:53,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 00:23:53,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 28: [2022-11-26 00:23:53,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 00:23:53,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 28: [2022-11-26 00:23:53,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:23:53,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 00:23:53,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 2: [2022-11-26 00:23:53,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:23:53,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 00:23:53,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 29: [2022-11-26 00:23:53,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:23:53,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 00:23:53,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 1: [2022-11-26 00:23:53,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:23:53,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 00:23:53,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 8: [2022-11-26 00:23:53,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:23:53,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 00:23:53,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 6: [2022-11-26 00:23:53,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:23:53,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 00:23:53,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 22: [2022-11-26 00:23:53,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:23:53,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 00:23:53,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 19: [2022-11-26 00:23:53,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:23:53,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:23:53,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:23:53,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 00:23:53,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 00:23:53,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 00:23:53,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 19: [2022-11-26 00:23:53,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 19: [2022-11-26 00:23:53,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 21: [2022-11-26 00:23:53,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:23:53,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:23:53,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:23:53,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:23:53,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 00:23:53,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 00:23:53,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 21: [2022-11-26 00:23:53,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 21: [2022-11-26 00:23:53,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 00:23:53,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 00:23:53,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 21: [2022-11-26 00:23:53,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 5: [2022-11-26 00:23:53,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:23:53,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 00:23:53,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 23: [2022-11-26 00:23:53,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:23:53,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 00:23:53,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 3: [2022-11-26 00:23:53,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:23:53,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:23:53,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:23:53,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 00:23:53,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 00:23:53,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:23:53,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 00:23:53,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 3: [2022-11-26 00:23:53,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 3: [2022-11-26 00:23:53,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 3: [2022-11-26 00:23:53,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 00:23:53,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 9: [2022-11-26 00:23:53,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:23:53,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 00:23:53,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 18: [2022-11-26 00:23:53,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:23:53,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:23:53,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:23:53,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:23:53,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 00:23:53,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 00:23:53,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 00:23:53,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 00:23:53,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 18: [2022-11-26 00:23:53,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 18: [2022-11-26 00:23:53,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 24: [2022-11-26 00:23:53,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:23:53,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 24: [2022-11-26 00:23:53,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 00:23:53,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 14: [2022-11-26 00:23:53,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:23:53,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:23:53,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:23:53,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:23:53,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 00:23:53,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 00:23:53,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 00:23:53,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 00:23:53,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 14: [2022-11-26 00:23:53,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 14: [2022-11-26 00:23:53,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 14: [2022-11-26 00:23:53,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 27: [2022-11-26 00:23:53,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:23:53,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:23:53,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:23:53,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:23:53,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 00:23:53,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 00:23:53,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 00:23:53,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 00:23:53,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 27: [2022-11-26 00:23:53,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 27: [2022-11-26 00:23:53,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 27: [2022-11-26 00:23:53,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 20: [2022-11-26 00:23:53,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:23:53,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 00:23:53,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 15: [2022-11-26 00:23:53,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:23:53,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 00:23:53,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 00:23:53,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:23:53,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 00:23:53,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 5: [2022-11-26 00:23:53,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:23:53,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 00:23:53,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 18: [2022-11-26 00:23:53,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:23:53,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 00:23:53,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 29: [2022-11-26 00:23:53,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:23:53,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 00:23:53,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 00:23:53,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 00:23:53,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 31: [2022-11-26 00:23:53,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:23:53,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 00:23:53,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 12: [2022-11-26 00:23:53,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:23:53,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 00:23:53,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 7: [2022-11-26 00:23:53,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:23:53,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 00:23:53,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 2: [2022-11-26 00:23:53,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:23:53,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 00:23:53,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 11: [2022-11-26 00:23:53,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:23:53,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 00:23:53,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 27: [2022-11-26 00:23:53,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:23:53,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:23:53,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 26: [2022-11-26 00:23:53,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 27: [2022-11-26 00:23:53,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 26: [2022-11-26 00:23:53,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 3: [2022-11-26 00:23:53,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:23:53,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 00:23:53,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 14: [2022-11-26 00:23:53,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:23:53,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 00:23:53,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 4: [2022-11-26 00:23:53,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:23:53,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 00:23:53,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 28: [2022-11-26 00:23:53,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:23:53,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 21: [2022-11-26 00:23:53,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:23:53,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 00:23:53,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 16: [2022-11-26 00:23:53,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:23:53,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 00:23:53,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 30: [2022-11-26 00:23:53,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 00:23:53,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 28: [2022-11-26 00:23:53,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 30: [2022-11-26 00:23:53,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 6: [2022-11-26 00:23:53,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:23:53,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 00:23:53,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 13: [2022-11-26 00:23:53,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:23:53,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 00:23:53,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 23: [2022-11-26 00:23:53,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:23:53,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 00:23:53,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 8: [2022-11-26 00:23:53,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:23:53,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 00:23:53,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 22: [2022-11-26 00:23:53,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:23:53,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 00:23:53,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 17: [2022-11-26 00:23:53,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:23:53,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 00:23:53,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 10: [2022-11-26 00:23:53,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:23:53,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 00:23:53,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 25: [2022-11-26 00:23:53,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:23:53,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 00:23:53,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 19: [2022-11-26 00:23:53,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:23:53,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 00:23:53,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 00:23:53,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:23:53,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 00:23:53,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 9: [2022-11-26 00:23:53,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:23:53,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 00:23:53,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 29: [2022-11-26 00:23:53,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:23:53,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 00:23:53,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 20: [2022-11-26 00:23:53,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:23:53,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 00:23:53,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 24: [2022-11-26 00:23:53,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:23:53,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 00:23:53,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 2: [2022-11-26 00:23:53,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:23:53,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 00:23:53,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 1: [2022-11-26 00:23:53,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:23:53,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 00:23:53,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 11: [2022-11-26 00:23:53,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:23:53,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 00:23:53,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 3: [2022-11-26 00:23:53,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:23:53,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 00:23:53,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 15: [2022-11-26 00:23:53,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:23:53,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 00:23:53,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 12: [2022-11-26 00:23:53,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:23:53,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 00:23:53,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 7: [2022-11-26 00:23:53,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:23:53,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:23:53,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 00:23:53,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 26: [2022-11-26 00:23:53,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 00:23:53,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 27: [2022-11-26 00:23:53,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:23:53,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 00:23:53,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 18: [2022-11-26 00:23:53,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:23:53,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 00:23:53,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 14: [2022-11-26 00:23:53,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:23:53,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 00:23:53,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 31: [2022-11-26 00:23:53,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:23:53,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 00:23:53,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 5: [2022-11-26 00:23:53,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:23:53,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 00:23:53,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 30: [2022-11-26 00:23:53,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 00:23:53,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 00:23:53,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 4: [2022-11-26 00:23:53,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:23:53,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 00:23:53,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 6: [2022-11-26 00:23:53,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:23:53,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 00:23:53,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 16: [2022-11-26 00:23:53,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:23:53,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 00:23:53,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 13: [2022-11-26 00:23:53,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:23:53,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 00:23:53,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 8: [2022-11-26 00:23:53,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:23:53,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:23:53,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 00:23:53,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 21: [2022-11-26 00:23:53,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 00:23:53,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 28: [2022-11-26 00:23:53,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:23:53,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 00:23:53,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 00:23:53,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:23:53,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 00:23:53,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 23: [2022-11-26 00:23:53,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:23:53,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 00:23:53,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 17: [2022-11-26 00:23:53,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:23:53,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 00:23:53,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 24: [2022-11-26 00:23:53,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:23:53,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 00:23:53,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 22: [2022-11-26 00:23:53,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:23:53,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 00:23:53,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 20: [2022-11-26 00:23:53,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:23:53,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 00:23:53,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 19: [2022-11-26 00:23:53,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:23:53,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 00:23:53,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 15: [2022-11-26 00:23:53,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:23:53,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 00:23:53,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 1: [2022-11-26 00:23:53,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:23:53,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 00:23:53,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 7: [2022-11-26 00:23:53,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:23:53,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 00:23:53,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 9: [2022-11-26 00:23:53,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:23:53,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 00:23:53,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 18: [2022-11-26 00:23:53,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:23:53,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 00:23:53,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 2: [2022-11-26 00:23:53,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:23:53,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 00:23:53,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 5: [2022-11-26 00:23:53,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:23:53,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 00:23:53,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 31: [2022-11-26 00:23:53,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:23:53,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 00:23:53,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 11: [2022-11-26 00:23:53,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:23:53,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 00:23:53,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 12: [2022-11-26 00:23:53,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:23:53,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 00:23:53,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 26: [2022-11-26 00:23:53,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:23:53,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 00:23:53,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 3: [2022-11-26 00:23:53,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:23:53,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 00:23:53,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 27: [2022-11-26 00:23:53,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:23:53,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 00:23:53,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 28: [2022-11-26 00:23:53,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:23:53,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 00:23:53,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 14: [2022-11-26 00:23:53,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:23:53,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 00:23:53,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 25: [2022-11-26 00:23:53,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:23:53,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 00:23:53,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 30: [2022-11-26 00:23:53,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 00:23:53,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 00:23:53,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 10: [2022-11-26 00:23:53,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:23:53,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 00:23:53,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 4: [2022-11-26 00:23:53,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:23:53,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 00:23:53,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 23: [2022-11-26 00:23:53,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:23:53,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 00:23:53,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 21: [2022-11-26 00:23:53,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:23:53,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 00:23:53,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 29: [2022-11-26 00:23:53,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:23:53,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 00:23:53,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 7: [2022-11-26 00:23:53,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:23:53,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 00:23:53,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 17: [2022-11-26 00:23:53,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:23:53,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 00:23:53,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 13: [2022-11-26 00:23:53,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:23:53,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 00:23:53,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 00:23:53,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:23:53,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 00:23:53,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 25: [2022-11-26 00:23:53,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:23:53,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 00:23:53,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 15: [2022-11-26 00:23:53,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:23:53,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 00:23:53,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 11: [2022-11-26 00:23:53,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:23:53,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 00:23:53,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 31: [2022-11-26 00:23:53,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:23:53,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 00:23:53,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 5: [2022-11-26 00:23:53,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:23:53,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 00:23:53,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 30: [2022-11-26 00:23:53,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 00:23:53,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 00:23:53,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 23: [2022-11-26 00:23:53,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:23:53,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 00:23:53,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 12: [2022-11-26 00:23:53,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:23:53,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 00:23:53,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 1: [2022-11-26 00:23:53,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:23:53,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 00:23:53,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 27: [2022-11-26 00:23:53,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:23:53,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:23:53,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 4: [2022-11-26 00:23:53,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 27: [2022-11-26 00:23:53,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 26: [2022-11-26 00:23:53,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:23:53,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 28: [2022-11-26 00:23:53,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:23:53,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 00:23:53,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 28: [2022-11-26 00:23:53,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 00:23:53,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 24: [2022-11-26 00:23:53,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:23:53,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 16: [2022-11-26 00:23:53,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:23:53,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 24: [2022-11-26 00:23:53,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 16: [2022-11-26 00:23:53,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 10: [2022-11-26 00:23:53,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:23:53,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 00:23:53,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 22: [2022-11-26 00:23:53,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:23:53,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 29: [2022-11-26 00:23:53,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:23:53,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 29: [2022-11-26 00:23:53,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 00:23:53,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 25: [2022-11-26 00:23:53,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:23:53,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 00:23:53,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 3: [2022-11-26 00:23:53,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:23:53,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 14: [2022-11-26 00:23:53,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:23:53,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 14: [2022-11-26 00:23:53,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 21: [2022-11-26 00:23:53,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:23:53,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 00:23:53,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 14: [2022-11-26 00:23:53,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 18: [2022-11-26 00:23:53,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:23:53,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 00:23:53,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 8: [2022-11-26 00:23:53,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:23:53,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 00:23:53,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 13: [2022-11-26 00:23:53,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:23:53,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 00:23:53,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 9: [2022-11-26 00:23:53,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:23:53,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 00:23:53,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 6: [2022-11-26 00:23:53,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:23:53,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 00:23:53,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 20: [2022-11-26 00:23:53,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:23:53,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 00:23:53,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 19: [2022-11-26 00:23:53,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:23:53,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 00:23:53,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 17: [2022-11-26 00:23:53,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:23:53,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 00:23:53,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 22: [2022-11-26 00:23:53,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:23:53,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 00:23:53,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 10: [2022-11-26 00:23:53,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:23:53,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 00:23:53,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 19: [2022-11-26 00:23:53,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:23:53,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 00:23:53,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 16: [2022-11-26 00:23:53,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:23:53,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:23:53,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 00:23:53,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 00:23:53,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 16: [2022-11-26 00:23:53,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 6: [2022-11-26 00:23:53,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:23:53,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 00:23:53,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 13: [2022-11-26 00:23:53,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:23:53,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 00:23:53,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 8: [2022-11-26 00:23:53,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:23:53,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:23:53,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 00:23:53,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 00:23:53,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 8: [2022-11-26 00:23:53,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 22: [2022-11-26 00:23:53,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:23:53,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 00:23:53,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 19: [2022-11-26 00:23:53,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:23:53,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 00:23:53,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 6: [2022-11-26 00:23:53,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:23:53,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 00:23:53,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 2: [2022-11-26 00:23:53,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:23:53,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step28000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 00:23:53,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: successfully saved checkpoint at iteration 28000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2564.20 31: iteration 28010/ 173500 | consumed samples: 7170560 | consumed tokens: 14685306880 | elapsed time per iteration (s): 1.01 | learning rate: 1.898E-04 | global batch size: 256 | lm loss: 2.170906E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 254.315 | TFLOPs: 15.39 | 31: iteration 28020/ 173500 | consumed samples: 7173120 | consumed tokens: 14690549760 | elapsed time per iteration (s): 0.73 | learning rate: 1.898E-04 | global batch size: 256 | lm loss: 2.119345E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.546 | TFLOPs: 21.15 | 31: iteration 28030/ 173500 | consumed samples: 7175680 | consumed tokens: 14695792640 | elapsed time per iteration (s): 0.75 | learning rate: 1.898E-04 | global batch size: 256 | lm loss: 2.160053E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.398 | TFLOPs: 20.65 | 31: iteration 28040/ 173500 | consumed samples: 7178240 | consumed tokens: 14701035520 | elapsed time per iteration (s): 0.75 | learning rate: 1.898E-04 | global batch size: 256 | lm loss: 2.156826E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.036 | TFLOPs: 20.63 | 31: iteration 28050/ 173500 | consumed samples: 7180800 | consumed tokens: 14706278400 | elapsed time per iteration (s): 0.76 | learning rate: 1.898E-04 | global batch size: 256 | lm loss: 2.127387E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.730 | TFLOPs: 20.31 | 31: iteration 28060/ 173500 | consumed samples: 7183360 | consumed tokens: 14711521280 | elapsed time per iteration (s): 0.75 | learning rate: 1.898E-04 | global batch size: 256 | lm loss: 2.139752E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.682 | TFLOPs: 20.67 | 31: iteration 28070/ 173500 | consumed samples: 7185920 | consumed tokens: 14716764160 | elapsed time per iteration (s): 0.80 | learning rate: 1.898E-04 | global batch size: 256 | lm loss: 2.180075E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.252 | TFLOPs: 19.25 | 31: iteration 28080/ 173500 | consumed samples: 7188480 | consumed tokens: 14722007040 | elapsed time per iteration (s): 0.74 | learning rate: 1.898E-04 | global batch size: 256 | lm loss: 2.144809E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.257 | TFLOPs: 21.01 | 31: iteration 28090/ 173500 | consumed samples: 7191040 | consumed tokens: 14727249920 | elapsed time per iteration (s): 0.77 | learning rate: 1.897E-04 | global batch size: 256 | lm loss: 2.149815E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.091 | TFLOPs: 20.15 | 31: iteration 28100/ 173500 | consumed samples: 7193600 | consumed tokens: 14732492800 | elapsed time per iteration (s): 0.76 | learning rate: 1.897E-04 | global batch size: 256 | lm loss: 2.166529E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.517 | TFLOPs: 20.48 | 31: iteration 28110/ 173500 | consumed samples: 7196160 | consumed tokens: 14737735680 | elapsed time per iteration (s): 0.77 | learning rate: 1.897E-04 | global batch size: 256 | lm loss: 2.150016E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.511 | TFLOPs: 20.06 | 31: iteration 28120/ 173500 | consumed samples: 7198720 | consumed tokens: 14742978560 | elapsed time per iteration (s): 0.81 | learning rate: 1.897E-04 | global batch size: 256 | lm loss: 2.147973E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.086 | TFLOPs: 19.18 | 31: iteration 28130/ 173500 | consumed samples: 7201280 | consumed tokens: 14748221440 | elapsed time per iteration (s): 0.75 | learning rate: 1.897E-04 | global batch size: 256 | lm loss: 2.174862E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.321 | TFLOPs: 20.59 | 31: iteration 28140/ 173500 | consumed samples: 7203840 | consumed tokens: 14753464320 | elapsed time per iteration (s): 0.75 | learning rate: 1.897E-04 | global batch size: 256 | lm loss: 2.163986E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.528 | TFLOPs: 20.60 | 31: iteration 28150/ 173500 | consumed samples: 7206400 | consumed tokens: 14758707200 | elapsed time per iteration (s): 0.76 | learning rate: 1.897E-04 | global batch size: 256 | lm loss: 2.158764E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.292 | TFLOPs: 20.47 | 31: iteration 28160/ 173500 | consumed samples: 7208960 | consumed tokens: 14763950080 | elapsed time per iteration (s): 0.78 | learning rate: 1.897E-04 | global batch size: 256 | lm loss: 2.127994E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.787 | TFLOPs: 19.83 | 31: iteration 28170/ 173500 | consumed samples: 7211520 | consumed tokens: 14769192960 | elapsed time per iteration (s): 0.76 | learning rate: 1.897E-04 | global batch size: 256 | lm loss: 2.144054E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.140 | TFLOPs: 20.46 | 31: iteration 28180/ 173500 | consumed samples: 7214080 | consumed tokens: 14774435840 | elapsed time per iteration (s): 0.80 | learning rate: 1.897E-04 | global batch size: 256 | lm loss: 2.130219E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.362 | TFLOPs: 19.38 | 31: iteration 28190/ 173500 | consumed samples: 7216640 | consumed tokens: 14779678720 | elapsed time per iteration (s): 0.78 | learning rate: 1.897E-04 | global batch size: 256 | lm loss: 2.153719E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.448 | TFLOPs: 19.87 | 31: iteration 28200/ 173500 | consumed samples: 7219200 | consumed tokens: 14784921600 | elapsed time per iteration (s): 0.85 | learning rate: 1.897E-04 | global batch size: 256 | lm loss: 2.118211E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.980 | TFLOPs: 18.27 | 31: iteration 28210/ 173500 | consumed samples: 7221760 | consumed tokens: 14790164480 | elapsed time per iteration (s): 0.75 | learning rate: 1.897E-04 | global batch size: 256 | lm loss: 2.135413E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.910 | TFLOPs: 20.56 | 31: iteration 28220/ 173500 | consumed samples: 7224320 | consumed tokens: 14795407360 | elapsed time per iteration (s): 0.79 | learning rate: 1.896E-04 | global batch size: 256 | lm loss: 2.167407E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.862 | TFLOPs: 19.53 | 31: iteration 28230/ 173500 | consumed samples: 7226880 | consumed tokens: 14800650240 | elapsed time per iteration (s): 0.81 | learning rate: 1.896E-04 | global batch size: 256 | lm loss: 2.138246E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.151 | TFLOPs: 19.19 | 31: iteration 28240/ 173500 | consumed samples: 7229440 | consumed tokens: 14805893120 | elapsed time per iteration (s): 0.78 | learning rate: 1.896E-04 | global batch size: 256 | lm loss: 2.139736E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.380 | TFLOPs: 19.87 | 31: iteration 28250/ 173500 | consumed samples: 7232000 | consumed tokens: 14811136000 | elapsed time per iteration (s): 0.78 | learning rate: 1.896E-04 | global batch size: 256 | lm loss: 2.147657E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.151 | TFLOPs: 19.79 | 31: iteration 28260/ 173500 | consumed samples: 7234560 | consumed tokens: 14816378880 | elapsed time per iteration (s): 0.78 | learning rate: 1.896E-04 | global batch size: 256 | lm loss: 2.143371E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.423 | TFLOPs: 19.93 | 31: iteration 28270/ 173500 | consumed samples: 7237120 | consumed tokens: 14821621760 | elapsed time per iteration (s): 0.81 | learning rate: 1.896E-04 | global batch size: 256 | lm loss: 2.132393E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.713 | TFLOPs: 19.04 | 31: iteration 28280/ 173500 | consumed samples: 7239680 | consumed tokens: 14826864640 | elapsed time per iteration (s): 0.78 | learning rate: 1.896E-04 | global batch size: 256 | lm loss: 2.155711E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.426 | TFLOPs: 19.87 | 31: iteration 28290/ 173500 | consumed samples: 7242240 | consumed tokens: 14832107520 | elapsed time per iteration (s): 0.80 | learning rate: 1.896E-04 | global batch size: 256 | lm loss: 2.162398E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.396 | TFLOPs: 19.26 | 31: iteration 28300/ 173500 | consumed samples: 7244800 | consumed tokens: 14837350400 | elapsed time per iteration (s): 0.78 | learning rate: 1.896E-04 | global batch size: 256 | lm loss: 2.144200E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.323 | TFLOPs: 19.80 | 31: iteration 28310/ 173500 | consumed samples: 7247360 | consumed tokens: 14842593280 | elapsed time per iteration (s): 0.79 | learning rate: 1.896E-04 | global batch size: 256 | lm loss: 2.175120E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.613 | TFLOPs: 19.64 | 31: iteration 28320/ 173500 | consumed samples: 7249920 | consumed tokens: 14847836160 | elapsed time per iteration (s): 0.78 | learning rate: 1.896E-04 | global batch size: 256 | lm loss: 2.201909E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.045 | TFLOPs: 19.85 | 31: iteration 28330/ 173500 | consumed samples: 7252480 | consumed tokens: 14853079040 | elapsed time per iteration (s): 0.78 | learning rate: 1.896E-04 | global batch size: 256 | lm loss: 2.171860E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.075 | TFLOPs: 19.85 | 31: iteration 28340/ 173500 | consumed samples: 7255040 | consumed tokens: 14858321920 | elapsed time per iteration (s): 0.76 | learning rate: 1.896E-04 | global batch size: 256 | lm loss: 2.154988E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.662 | TFLOPs: 20.31 | 31: iteration 28350/ 173500 | consumed samples: 7257600 | consumed tokens: 14863564800 | elapsed time per iteration (s): 0.77 | learning rate: 1.895E-04 | global batch size: 256 | lm loss: 2.138605E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.161 | TFLOPs: 20.09 | 31: iteration 28360/ 173500 | consumed samples: 7260160 | consumed tokens: 14868807680 | elapsed time per iteration (s): 0.87 | learning rate: 1.895E-04 | global batch size: 256 | lm loss: 2.158678E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.828 | TFLOPs: 17.72 | 31: iteration 28370/ 173500 | consumed samples: 7262720 | consumed tokens: 14874050560 | elapsed time per iteration (s): 0.77 | learning rate: 1.895E-04 | global batch size: 256 | lm loss: 2.166050E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.956 | TFLOPs: 20.20 | 31: iteration 28380/ 173500 | consumed samples: 7265280 | consumed tokens: 14879293440 | elapsed time per iteration (s): 0.81 | learning rate: 1.895E-04 | global batch size: 256 | lm loss: 2.131331E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.671 | TFLOPs: 19.04 | 31: iteration 28390/ 173500 | consumed samples: 7267840 | consumed tokens: 14884536320 | elapsed time per iteration (s): 0.77 | learning rate: 1.895E-04 | global batch size: 256 | lm loss: 2.143303E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.746 | TFLOPs: 20.13 | 31: iteration 28400/ 173500 | consumed samples: 7270400 | consumed tokens: 14889779200 | elapsed time per iteration (s): 0.78 | learning rate: 1.895E-04 | global batch size: 256 | lm loss: 2.168738E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.099 | TFLOPs: 19.91 | 31: iteration 28410/ 173500 | consumed samples: 7272960 | consumed tokens: 14895022080 | elapsed time per iteration (s): 0.76 | learning rate: 1.895E-04 | global batch size: 256 | lm loss: 2.148174E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.533 | TFLOPs: 20.30 | 31: iteration 28420/ 173500 | consumed samples: 7275520 | consumed tokens: 14900264960 | elapsed time per iteration (s): 0.76 | learning rate: 1.895E-04 | global batch size: 256 | lm loss: 2.176355E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.081 | TFLOPs: 20.39 | 31: iteration 28430/ 173500 | consumed samples: 7278080 | consumed tokens: 14905507840 | elapsed time per iteration (s): 0.74 | learning rate: 1.895E-04 | global batch size: 256 | lm loss: 2.125394E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.860 | TFLOPs: 21.04 | 31: iteration 28440/ 173500 | consumed samples: 7280640 | consumed tokens: 14910750720 | elapsed time per iteration (s): 0.78 | learning rate: 1.895E-04 | global batch size: 256 | lm loss: 2.139425E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.794 | TFLOPs: 19.95 | 31: iteration 28450/ 173500 | consumed samples: 7283200 | consumed tokens: 14915993600 | elapsed time per iteration (s): 0.74 | learning rate: 1.895E-04 | global batch size: 256 | lm loss: 2.124413E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.976 | TFLOPs: 20.93 | 31: iteration 28460/ 173500 | consumed samples: 7285760 | consumed tokens: 14921236480 | elapsed time per iteration (s): 0.83 | learning rate: 1.895E-04 | global batch size: 256 | lm loss: 2.145421E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.603 | TFLOPs: 18.55 | 31: iteration 28470/ 173500 | consumed samples: 7288320 | consumed tokens: 14926479360 | elapsed time per iteration (s): 0.78 | learning rate: 1.895E-04 | global batch size: 256 | lm loss: 2.112811E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.405 | TFLOPs: 19.93 | 31: iteration 28480/ 173500 | consumed samples: 7290880 | consumed tokens: 14931722240 | elapsed time per iteration (s): 0.75 | learning rate: 1.894E-04 | global batch size: 256 | lm loss: 2.154210E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.185 | TFLOPs: 20.76 | 31: iteration 28490/ 173500 | consumed samples: 7293440 | consumed tokens: 14936965120 | elapsed time per iteration (s): 0.80 | learning rate: 1.894E-04 | global batch size: 256 | lm loss: 2.152785E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.313 | TFLOPs: 19.44 | 31: iteration 28500/ 173500 | consumed samples: 7296000 | consumed tokens: 14942208000 | elapsed time per iteration (s): 0.77 | learning rate: 1.894E-04 | global batch size: 256 | lm loss: 2.160541E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.101 | TFLOPs: 20.15 | 31: iteration 28510/ 173500 | consumed samples: 7298560 | consumed tokens: 14947450880 | elapsed time per iteration (s): 0.74 | learning rate: 1.894E-04 | global batch size: 256 | lm loss: 2.155270E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.519 | TFLOPs: 20.84 | 31: iteration 28520/ 173500 | consumed samples: 7301120 | consumed tokens: 14952693760 | elapsed time per iteration (s): 0.75 | learning rate: 1.894E-04 | global batch size: 256 | lm loss: 2.168026E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.152 | TFLOPs: 20.58 | 31: iteration 28530/ 173500 | consumed samples: 7303680 | consumed tokens: 14957936640 | elapsed time per iteration (s): 0.75 | learning rate: 1.894E-04 | global batch size: 256 | lm loss: 2.132161E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.987 | TFLOPs: 20.57 | 31: iteration 28540/ 173500 | consumed samples: 7306240 | consumed tokens: 14963179520 | elapsed time per iteration (s): 0.78 | learning rate: 1.894E-04 | global batch size: 256 | lm loss: 2.157523E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.312 | TFLOPs: 19.92 | 31: iteration 28550/ 173500 | consumed samples: 7308800 | consumed tokens: 14968422400 | elapsed time per iteration (s): 0.76 | learning rate: 1.894E-04 | global batch size: 256 | lm loss: 2.155729E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.996 | TFLOPs: 20.39 | 31: iteration 28560/ 173500 | consumed samples: 7311360 | consumed tokens: 14973665280 | elapsed time per iteration (s): 0.75 | learning rate: 1.894E-04 | global batch size: 256 | lm loss: 2.155499E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.761 | TFLOPs: 20.74 | 31: iteration 28570/ 173500 | consumed samples: 7313920 | consumed tokens: 14978908160 | elapsed time per iteration (s): 0.73 | learning rate: 1.894E-04 | global batch size: 256 | lm loss: 2.130099E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.342 | TFLOPs: 21.07 | 31: iteration 28580/ 173500 | consumed samples: 7316480 | consumed tokens: 14984151040 | elapsed time per iteration (s): 0.74 | learning rate: 1.894E-04 | global batch size: 256 | lm loss: 2.152358E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.697 | TFLOPs: 20.91 | 31: iteration 28590/ 173500 | consumed samples: 7319040 | consumed tokens: 14989393920 | elapsed time per iteration (s): 0.74 | learning rate: 1.894E-04 | global batch size: 256 | lm loss: 2.142726E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.695 | TFLOPs: 20.79 | 31: iteration 28600/ 173500 | consumed samples: 7321600 | consumed tokens: 14994636800 | elapsed time per iteration (s): 0.75 | learning rate: 1.894E-04 | global batch size: 256 | lm loss: 2.162767E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.787 | TFLOPs: 20.68 | 31: iteration 28610/ 173500 | consumed samples: 7324160 | consumed tokens: 14999879680 | elapsed time per iteration (s): 0.88 | learning rate: 1.893E-04 | global batch size: 256 | lm loss: 2.148490E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.423 | TFLOPs: 17.51 | 31: iteration 28620/ 173500 | consumed samples: 7326720 | consumed tokens: 15005122560 | elapsed time per iteration (s): 0.78 | learning rate: 1.893E-04 | global batch size: 256 | lm loss: 2.145012E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.501 | TFLOPs: 19.93 | 31: iteration 28630/ 173500 | consumed samples: 7329280 | consumed tokens: 15010365440 | elapsed time per iteration (s): 0.73 | learning rate: 1.893E-04 | global batch size: 256 | lm loss: 2.148186E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.383 | TFLOPs: 21.14 | 31: iteration 28640/ 173500 | consumed samples: 7331840 | consumed tokens: 15015608320 | elapsed time per iteration (s): 0.72 | learning rate: 1.893E-04 | global batch size: 256 | lm loss: 2.161024E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.774 | TFLOPs: 21.40 | 31: iteration 28650/ 173500 | consumed samples: 7334400 | consumed tokens: 15020851200 | elapsed time per iteration (s): 0.77 | learning rate: 1.893E-04 | global batch size: 256 | lm loss: 2.134359E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.978 | TFLOPs: 20.20 | 31: iteration 28660/ 173500 | consumed samples: 7336960 | consumed tokens: 15026094080 | elapsed time per iteration (s): 0.77 | learning rate: 1.893E-04 | global batch size: 256 | lm loss: 2.163703E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.946 | TFLOPs: 20.08 | 31: iteration 28670/ 173500 | consumed samples: 7339520 | consumed tokens: 15031336960 | elapsed time per iteration (s): 1.54 | learning rate: 1.893E-04 | global batch size: 256 | lm loss: 2.146852E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 166.494 | TFLOPs: 10.07 | 31: iteration 28680/ 173500 | consumed samples: 7342080 | consumed tokens: 15036579840 | elapsed time per iteration (s): 0.79 | learning rate: 1.893E-04 | global batch size: 256 | lm loss: 2.157540E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.502 | TFLOPs: 19.63 | 31: iteration 28690/ 173500 | consumed samples: 7344640 | consumed tokens: 15041822720 | elapsed time per iteration (s): 0.75 | learning rate: 1.893E-04 | global batch size: 256 | lm loss: 2.137419E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.911 | TFLOPs: 20.56 | 31: iteration 28700/ 173500 | consumed samples: 7347200 | consumed tokens: 15047065600 | elapsed time per iteration (s): 0.80 | learning rate: 1.893E-04 | global batch size: 256 | lm loss: 2.164232E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.538 | TFLOPs: 19.33 | 31: iteration 28710/ 173500 | consumed samples: 7349760 | consumed tokens: 15052308480 | elapsed time per iteration (s): 0.75 | learning rate: 1.893E-04 | global batch size: 256 | lm loss: 2.136490E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.203 | TFLOPs: 20.70 | 31: iteration 28720/ 173500 | consumed samples: 7352320 | consumed tokens: 15057551360 | elapsed time per iteration (s): 0.83 | learning rate: 1.893E-04 | global batch size: 256 | lm loss: 2.157515E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.436 | TFLOPs: 18.72 | 31: iteration 28730/ 173500 | consumed samples: 7354880 | consumed tokens: 15062794240 | elapsed time per iteration (s): 0.78 | learning rate: 1.893E-04 | global batch size: 256 | lm loss: 2.167751E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.119 | TFLOPs: 19.97 | 31: iteration 28740/ 173500 | consumed samples: 7357440 | consumed tokens: 15068037120 | elapsed time per iteration (s): 0.80 | learning rate: 1.892E-04 | global batch size: 256 | lm loss: 2.176848E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.780 | TFLOPs: 19.35 | 31: iteration 28750/ 173500 | consumed samples: 7360000 | consumed tokens: 15073280000 | elapsed time per iteration (s): 0.75 | learning rate: 1.892E-04 | global batch size: 256 | lm loss: 2.172545E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.217 | TFLOPs: 20.58 | 31: iteration 28760/ 173500 | consumed samples: 7362560 | consumed tokens: 15078522880 | elapsed time per iteration (s): 0.75 | learning rate: 1.892E-04 | global batch size: 256 | lm loss: 2.155764E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.506 | TFLOPs: 20.72 | 31: iteration 28770/ 173500 | consumed samples: 7365120 | consumed tokens: 15083765760 | elapsed time per iteration (s): 0.78 | learning rate: 1.892E-04 | global batch size: 256 | lm loss: 2.151203E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.851 | TFLOPs: 19.96 | 31: iteration 28780/ 173500 | consumed samples: 7367680 | consumed tokens: 15089008640 | elapsed time per iteration (s): 0.83 | learning rate: 1.892E-04 | global batch size: 256 | lm loss: 2.157703E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.828 | TFLOPs: 18.62 | 31: iteration 28790/ 173500 | consumed samples: 7370240 | consumed tokens: 15094251520 | elapsed time per iteration (s): 0.81 | learning rate: 1.892E-04 | global batch size: 256 | lm loss: 2.147656E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.857 | TFLOPs: 19.23 | 31: iteration 28800/ 173500 | consumed samples: 7372800 | consumed tokens: 15099494400 | elapsed time per iteration (s): 0.72 | learning rate: 1.892E-04 | global batch size: 256 | lm loss: 2.154261E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.507 | TFLOPs: 21.45 | 31: iteration 28810/ 173500 | consumed samples: 7375360 | consumed tokens: 15104737280 | elapsed time per iteration (s): 0.74 | learning rate: 1.892E-04 | global batch size: 256 | lm loss: 2.139549E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.590 | TFLOPs: 20.91 | 31: iteration 28820/ 173500 | consumed samples: 7377920 | consumed tokens: 15109980160 | elapsed time per iteration (s): 0.78 | learning rate: 1.892E-04 | global batch size: 256 | lm loss: 2.128741E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.932 | TFLOPs: 19.90 | 31: iteration 28830/ 173500 | consumed samples: 7380480 | consumed tokens: 15115223040 | elapsed time per iteration (s): 0.75 | learning rate: 1.892E-04 | global batch size: 256 | lm loss: 2.155163E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.245 | TFLOPs: 20.58 | 31: iteration 28840/ 173500 | consumed samples: 7383040 | consumed tokens: 15120465920 | elapsed time per iteration (s): 0.75 | learning rate: 1.892E-04 | global batch size: 256 | lm loss: 2.132368E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.622 | TFLOPs: 20.67 | 31: iteration 28850/ 173500 | consumed samples: 7385600 | consumed tokens: 15125708800 | elapsed time per iteration (s): 0.73 | learning rate: 1.892E-04 | global batch size: 256 | lm loss: 2.161348E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.805 | TFLOPs: 21.16 | 31: iteration 28860/ 173500 | consumed samples: 7388160 | consumed tokens: 15130951680 | elapsed time per iteration (s): 0.75 | learning rate: 1.891E-04 | global batch size: 256 | lm loss: 2.144921E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.972 | TFLOPs: 20.57 | 31: iteration 28870/ 173500 | consumed samples: 7390720 | consumed tokens: 15136194560 | elapsed time per iteration (s): 0.76 | learning rate: 1.891E-04 | global batch size: 256 | lm loss: 2.143851E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.679 | TFLOPs: 20.31 | 31: iteration 28880/ 173500 | consumed samples: 7393280 | consumed tokens: 15141437440 | elapsed time per iteration (s): 0.77 | learning rate: 1.891E-04 | global batch size: 256 | lm loss: 2.157286E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.512 | TFLOPs: 20.06 | 31: iteration 28890/ 173500 | consumed samples: 7395840 | consumed tokens: 15146680320 | elapsed time per iteration (s): 0.73 | learning rate: 1.891E-04 | global batch size: 256 | lm loss: 2.156515E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.169 | TFLOPs: 21.24 | 31: iteration 28900/ 173500 | consumed samples: 7398400 | consumed tokens: 15151923200 | elapsed time per iteration (s): 0.76 | learning rate: 1.891E-04 | global batch size: 256 | lm loss: 2.151955E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.836 | TFLOPs: 20.26 | 31: iteration 28910/ 173500 | consumed samples: 7400960 | consumed tokens: 15157166080 | elapsed time per iteration (s): 0.79 | learning rate: 1.891E-04 | global batch size: 256 | lm loss: 2.156889E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.402 | TFLOPs: 19.56 | 31: iteration 28920/ 173500 | consumed samples: 7403520 | consumed tokens: 15162408960 | elapsed time per iteration (s): 0.76 | learning rate: 1.891E-04 | global batch size: 256 | lm loss: 2.166413E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.407 | TFLOPs: 20.29 | 31: iteration 28930/ 173500 | consumed samples: 7406080 | consumed tokens: 15167651840 | elapsed time per iteration (s): 0.78 | learning rate: 1.891E-04 | global batch size: 256 | lm loss: 2.147642E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.173 | TFLOPs: 19.91 | 31: iteration 28940/ 173500 | consumed samples: 7408640 | consumed tokens: 15172894720 | elapsed time per iteration (s): 0.81 | learning rate: 1.891E-04 | global batch size: 256 | lm loss: 2.170320E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.319 | TFLOPs: 19.20 | 31: iteration 28950/ 173500 | consumed samples: 7411200 | consumed tokens: 15178137600 | elapsed time per iteration (s): 0.81 | learning rate: 1.891E-04 | global batch size: 256 | lm loss: 2.167934E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.419 | TFLOPs: 19.14 | 31: iteration 28960/ 173500 | consumed samples: 7413760 | consumed tokens: 15183380480 | elapsed time per iteration (s): 0.79 | learning rate: 1.891E-04 | global batch size: 256 | lm loss: 2.122172E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.808 | TFLOPs: 19.65 | 31: iteration 28970/ 173500 | consumed samples: 7416320 | consumed tokens: 15188623360 | elapsed time per iteration (s): 0.81 | learning rate: 1.891E-04 | global batch size: 256 | lm loss: 2.184843E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.177 | TFLOPs: 19.13 | 31: iteration 28980/ 173500 | consumed samples: 7418880 | consumed tokens: 15193866240 | elapsed time per iteration (s): 0.79 | learning rate: 1.891E-04 | global batch size: 256 | lm loss: 2.178088E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.961 | TFLOPs: 19.54 | 31: iteration 28990/ 173500 | consumed samples: 7421440 | consumed tokens: 15199109120 | elapsed time per iteration (s): 0.83 | learning rate: 1.890E-04 | global batch size: 256 | lm loss: 2.157400E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.805 | TFLOPs: 18.62 | 31: iteration 29000/ 173500 | consumed samples: 7424000 | consumed tokens: 15204352000 | elapsed time per iteration (s): 0.81 | learning rate: 1.890E-04 | global batch size: 256 | lm loss: 2.134703E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.468 | TFLOPs: 19.09 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 29000 | lm loss value: 2.204160E+00 | lm loss PPL: 9.062636E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 29000 to checkpoints_1b1long 0: [2022-11-26 00:36:54,863] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step29000 is begin to save! 0: [2022-11-26 00:36:54,873] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_01-model_00-model_states.pt... 0: [2022-11-26 00:36:55,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_01-model_00-model_states.pt. 0: [2022-11-26 00:36:55,082] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_03-model_00-model_states.pt... 0: [2022-11-26 00:36:55,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_03-model_00-model_states.pt. 0: [2022-11-26 00:36:55,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_04-model_00-model_states.pt... 0: [2022-11-26 00:36:55,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_04-model_00-model_states.pt. 0: [2022-11-26 00:36:55,243] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_05-model_00-model_states.pt... 0: [2022-11-26 00:36:55,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_05-model_00-model_states.pt. 0: [2022-11-26 00:36:55,320] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_06-model_00-model_states.pt... 0: [2022-11-26 00:36:55,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_06-model_00-model_states.pt. 0: [2022-11-26 00:36:55,398] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_07-model_00-model_states.pt... 0: [2022-11-26 00:36:55,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_07-model_00-model_states.pt. 0: [2022-11-26 00:36:55,472] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_08-model_00-model_states.pt... 0: [2022-11-26 00:36:55,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_08-model_00-model_states.pt. 0: [2022-11-26 00:36:55,556] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_09-model_00-model_states.pt... 0: [2022-11-26 00:36:55,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_09-model_00-model_states.pt. 0: [2022-11-26 00:36:55,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_10-model_00-model_states.pt... 0: [2022-11-26 00:36:55,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_10-model_00-model_states.pt. 0: [2022-11-26 00:36:55,709] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_11-model_00-model_states.pt... 0: [2022-11-26 00:36:55,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_11-model_00-model_states.pt. 0: [2022-11-26 00:36:55,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_12-model_00-model_states.pt... 0: [2022-11-26 00:36:55,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_12-model_00-model_states.pt. 0: [2022-11-26 00:36:55,863] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_13-model_00-model_states.pt... 0: [2022-11-26 00:36:55,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_13-model_00-model_states.pt. 0: [2022-11-26 00:36:55,938] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_14-model_00-model_states.pt... 0: [2022-11-26 00:36:56,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_14-model_00-model_states.pt. 0: [2022-11-26 00:36:56,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_15-model_00-model_states.pt... 0: [2022-11-26 00:36:56,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_15-model_00-model_states.pt. 0: [2022-11-26 00:36:56,096] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_16-model_00-model_states.pt... 0: [2022-11-26 00:36:56,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_16-model_00-model_states.pt. 0: [2022-11-26 00:36:56,173] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_17-model_00-model_states.pt... 0: [2022-11-26 00:36:56,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_17-model_00-model_states.pt. 0: [2022-11-26 00:36:56,250] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_18-model_00-model_states.pt... 0: [2022-11-26 00:36:56,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_18-model_00-model_states.pt. 0: [2022-11-26 00:36:56,328] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_19-model_00-model_states.pt... 0: [2022-11-26 00:36:56,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_19-model_00-model_states.pt. 0: [2022-11-26 00:36:56,405] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_20-model_00-model_states.pt... 0: [2022-11-26 00:36:56,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_20-model_00-model_states.pt. 0: [2022-11-26 00:36:56,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_21-model_00-model_states.pt... 0: [2022-11-26 00:36:56,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_21-model_00-model_states.pt. 0: [2022-11-26 00:36:56,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_22-model_00-model_states.pt... 0: [2022-11-26 00:36:56,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_22-model_00-model_states.pt. 0: [2022-11-26 00:36:56,633] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_23-model_00-model_states.pt... 0: [2022-11-26 00:36:56,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_23-model_00-model_states.pt. 0: [2022-11-26 00:36:56,711] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_24-model_00-model_states.pt... 0: [2022-11-26 00:36:56,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_24-model_00-model_states.pt. 0: [2022-11-26 00:36:56,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_25-model_00-model_states.pt... 0: [2022-11-26 00:36:56,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_25-model_00-model_states.pt. 0: [2022-11-26 00:36:56,864] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_26-model_00-model_states.pt... 0: [2022-11-26 00:36:56,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_26-model_00-model_states.pt. 0: [2022-11-26 00:36:56,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_27-model_00-model_states.pt... 0: [2022-11-26 00:36:57,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_27-model_00-model_states.pt. 0: [2022-11-26 00:36:57,019] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_28-model_00-model_states.pt... 0: [2022-11-26 00:36:57,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_28-model_00-model_states.pt. 0: [2022-11-26 00:36:57,098] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/layer_30-model_00-model_states.pt... 0: [2022-11-26 00:36:57,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/layer_30-model_00-model_states.pt. 0: [2022-11-26 00:36:57,100] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step29000/mp_rank_00_model_states.pt 0: [2022-11-26 00:36:57,100] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/mp_rank_00_model_states.pt... 0: [2022-11-26 00:36:57,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/mp_rank_00_model_states.pt. 0: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:36:57,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:36:57,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:36:57,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 00:36:57,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 21: [2022-11-26 00:36:57,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:36:57,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 00:36:57,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 17: [2022-11-26 00:36:57,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:36:57,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 00:36:57,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 25: [2022-11-26 00:36:57,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:36:57,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:36:57,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:36:57,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 00:36:57,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 00:36:57,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 25: [2022-11-26 00:36:57,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 7: [2022-11-26 00:36:57,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 25: [2022-11-26 00:36:57,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 19: [2022-11-26 00:36:57,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:36:57,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 00:36:57,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 19: [2022-11-26 00:36:57,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:36:57,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 00:36:57,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 3: [2022-11-26 00:36:57,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:36:57,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 00:36:57,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 16: [2022-11-26 00:36:57,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:36:57,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 00:36:57,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 8: [2022-11-26 00:36:57,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:36:57,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:36:57,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:36:57,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 16: [2022-11-26 00:36:57,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 8: [2022-11-26 00:36:57,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 16: [2022-11-26 00:36:57,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 26: [2022-11-26 00:36:57,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 00:36:57,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 20: [2022-11-26 00:36:57,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:36:57,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 25: [2022-11-26 00:36:57,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:36:57,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 25: [2022-11-26 00:36:57,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 15: [2022-11-26 00:36:57,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:36:57,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 15: [2022-11-26 00:36:57,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:36:57,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 00:36:57,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 00:36:57,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 15: [2022-11-26 00:36:57,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 10: [2022-11-26 00:36:57,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:36:57,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 00:36:57,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 00:36:57,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:36:57,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:36:57,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 23: [2022-11-26 00:36:57,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:36:57,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 00:36:57,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 23: [2022-11-26 00:36:57,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 1: [2022-11-26 00:36:57,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 23: [2022-11-26 00:36:57,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 22: [2022-11-26 00:36:57,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:36:57,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 00:36:57,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 26: [2022-11-26 00:36:57,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:36:57,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 00:36:57,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 12: [2022-11-26 00:36:57,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:36:57,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 00:36:57,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 21: [2022-11-26 00:36:57,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:36:57,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 29: [2022-11-26 00:36:57,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 30: [2022-11-26 00:36:57,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:36:57,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 29: [2022-11-26 00:36:57,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 00:36:57,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 30: [2022-11-26 00:36:57,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 00:36:57,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 12: [2022-11-26 00:36:57,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:36:57,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 8: [2022-11-26 00:36:57,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:36:57,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 8: [2022-11-26 00:36:57,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 00:36:57,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 22: [2022-11-26 00:36:57,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:36:57,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 00:36:57,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 28: [2022-11-26 00:36:57,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:36:57,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 00:36:57,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 27: [2022-11-26 00:36:57,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:36:57,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:36:57,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:36:57,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 15: [2022-11-26 00:36:57,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:36:57,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 27: [2022-11-26 00:36:57,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 15: [2022-11-26 00:36:57,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 28: [2022-11-26 00:36:57,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 27: [2022-11-26 00:36:57,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 15: [2022-11-26 00:36:57,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 27: [2022-11-26 00:36:57,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 20: [2022-11-26 00:36:57,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:36:57,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 00:36:57,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 3: [2022-11-26 00:36:57,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:36:57,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 30: [2022-11-26 00:36:57,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:36:57,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 30: [2022-11-26 00:36:57,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 00:36:57,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 00:36:57,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:36:57,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:36:57,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 1: [2022-11-26 00:36:57,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 8: [2022-11-26 00:36:57,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 26: [2022-11-26 00:36:57,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:36:57,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 26: [2022-11-26 00:36:57,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 00:36:57,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 30: [2022-11-26 00:36:57,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 00:36:57,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 00:36:57,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 2: [2022-11-26 00:36:57,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:36:57,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:36:57,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:36:57,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 00:36:57,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 2: [2022-11-26 00:36:57,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 00:36:57,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 00:36:57,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:36:57,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:36:57,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:36:57,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 00:36:57,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 23: [2022-11-26 00:36:57,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 00:36:57,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 00:36:57,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 25: [2022-11-26 00:36:57,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 29: [2022-11-26 00:36:57,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:36:57,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:36:57,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 21: [2022-11-26 00:36:57,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 29: [2022-11-26 00:36:57,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 21: [2022-11-26 00:36:57,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 00:36:57,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 25: [2022-11-26 00:36:57,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 10: [2022-11-26 00:36:57,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:36:57,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 00:36:57,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 11: [2022-11-26 00:36:57,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:36:57,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:36:57,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 11: [2022-11-26 00:36:57,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 18: [2022-11-26 00:36:57,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:36:57,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 11: [2022-11-26 00:36:57,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 18: [2022-11-26 00:36:57,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 00:36:57,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 18: [2022-11-26 00:36:57,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:36:57,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 00:36:57,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 11: [2022-11-26 00:36:57,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:36:57,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 2: [2022-11-26 00:36:57,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:36:57,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 2: [2022-11-26 00:36:57,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 00:36:57,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 16: [2022-11-26 00:36:57,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:36:57,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:36:57,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 19: [2022-11-26 00:36:57,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 16: [2022-11-26 00:36:57,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 19: [2022-11-26 00:36:57,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 0: [2022-11-26 00:36:57,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:36:57,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:36:57,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 20: [2022-11-26 00:36:57,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 00:36:57,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 0: [2022-11-26 00:36:57,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 25: [2022-11-26 00:36:57,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:36:57,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:36:57,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:36:57,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 3: [2022-11-26 00:36:57,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 30: [2022-11-26 00:36:57,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:36:57,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 3: [2022-11-26 00:36:57,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 0: [2022-11-26 00:36:57,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 3: [2022-11-26 00:36:57,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 25: [2022-11-26 00:36:57,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 22: [2022-11-26 00:36:57,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:36:57,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:36:57,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 30: [2022-11-26 00:36:57,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 7: [2022-11-26 00:36:57,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 22: [2022-11-26 00:36:57,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 30: [2022-11-26 00:36:57,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 7: [2022-11-26 00:36:57,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 20: [2022-11-26 00:36:57,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:36:57,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 00:36:57,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 31: [2022-11-26 00:36:57,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:36:57,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:36:57,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 00:36:57,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 18: [2022-11-26 00:36:57,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 00:36:57,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 27: [2022-11-26 00:36:57,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:36:57,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 00:36:57,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 8: [2022-11-26 00:36:57,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:36:57,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 00:36:57,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 2: [2022-11-26 00:36:57,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:36:57,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 00:36:57,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 31: [2022-11-26 00:36:57,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:36:57,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 00:36:57,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 31: [2022-11-26 00:36:57,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:36:57,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:36:57,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 00:36:57,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 31: [2022-11-26 00:36:57,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 26: [2022-11-26 00:36:57,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:36:57,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 6: [2022-11-26 00:36:57,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:36:57,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 00:36:57,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 6: [2022-11-26 00:36:57,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 00:36:57,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 19: [2022-11-26 00:36:57,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:36:57,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:36:57,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 7: [2022-11-26 00:36:57,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 19: [2022-11-26 00:36:57,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 7: [2022-11-26 00:36:57,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 16: [2022-11-26 00:36:57,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:36:57,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 00:36:57,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 12: [2022-11-26 00:36:57,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:36:57,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 00:36:57,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 0: [2022-11-26 00:36:57,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 00:36:57,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 10: [2022-11-26 00:36:57,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:36:57,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 00:36:57,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 17: [2022-11-26 00:36:57,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:36:57,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 00:36:57,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 17: [2022-11-26 00:36:57,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:36:57,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 00:36:57,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 17: [2022-11-26 00:36:57,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:36:57,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 00:36:57,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 29: [2022-11-26 00:36:57,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:36:57,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 13: [2022-11-26 00:36:57,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:36:57,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 13: [2022-11-26 00:36:57,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 00:36:57,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 13: [2022-11-26 00:36:57,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:36:57,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 00:36:57,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 13: [2022-11-26 00:36:57,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:36:57,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:36:57,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 00:36:57,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 00:36:57,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 13: [2022-11-26 00:36:57,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 27: [2022-11-26 00:36:57,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:36:57,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 00:36:57,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 10: [2022-11-26 00:36:57,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:36:57,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 00:36:57,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 3: [2022-11-26 00:36:57,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:36:57,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:36:57,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 22: [2022-11-26 00:36:57,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 3: [2022-11-26 00:36:57,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 12: [2022-11-26 00:36:57,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:36:57,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 12: [2022-11-26 00:36:57,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 00:36:57,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 21: [2022-11-26 00:36:57,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:36:57,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 00:36:57,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 20: [2022-11-26 00:36:57,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:36:57,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 00:36:57,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 23: [2022-11-26 00:36:57,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:36:57,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 00:36:57,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 28: [2022-11-26 00:36:57,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:36:57,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 00:36:57,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 28: [2022-11-26 00:36:57,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:36:57,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:36:57,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 00:36:57,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 29: [2022-11-26 00:36:57,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:36:57,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 00:36:57,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 5: [2022-11-26 00:36:57,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:36:57,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 00:36:57,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 5: [2022-11-26 00:36:57,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:36:57,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 00:36:57,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 5: [2022-11-26 00:36:57,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:36:57,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 00:36:57,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 5: [2022-11-26 00:36:57,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:36:57,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 00:36:57,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 11: [2022-11-26 00:36:57,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:36:57,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 00:36:57,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 11: [2022-11-26 00:36:57,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:36:57,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 00:36:57,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 6: [2022-11-26 00:36:57,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:36:57,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:36:57,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:36:57,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 00:36:57,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 00:36:57,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 00:36:57,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 6: [2022-11-26 00:36:57,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 6: [2022-11-26 00:36:57,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 9: [2022-11-26 00:36:57,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:36:57,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:36:57,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:36:57,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:36:57,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 00:36:57,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 00:36:57,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 00:36:57,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 00:36:57,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 9: [2022-11-26 00:36:57,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 9: [2022-11-26 00:36:57,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 9: [2022-11-26 00:36:57,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 5: [2022-11-26 00:36:57,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:36:57,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 00:36:57,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 9: [2022-11-26 00:36:57,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:36:57,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 00:36:57,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 4: [2022-11-26 00:36:57,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:36:57,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:36:57,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:36:57,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:36:57,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:36:57,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 4: [2022-11-26 00:36:57,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 00:36:57,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 15: [2022-11-26 00:36:57,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 4: [2022-11-26 00:36:57,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 4: [2022-11-26 00:36:57,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 4: [2022-11-26 00:36:57,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 00:36:57,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 00:36:57,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 4: [2022-11-26 00:36:57,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 24: [2022-11-26 00:36:57,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:36:57,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:36:57,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:36:57,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:36:57,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 00:36:57,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 00:36:57,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 00:36:57,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 24: [2022-11-26 00:36:57,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 24: [2022-11-26 00:36:57,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 00:36:57,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 24: [2022-11-26 00:36:57,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 00:36:57,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:36:57,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 00:36:57,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 00:36:57,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 00:36:57,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 17: [2022-11-26 00:36:57,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:36:57,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 00:36:57,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 18: [2022-11-26 00:36:57,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:36:57,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 00:36:57,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 16: [2022-11-26 00:36:57,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:36:57,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 00:36:57,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 25: [2022-11-26 00:36:57,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:36:57,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 00:36:57,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 15: [2022-11-26 00:36:57,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:36:57,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 00:36:57,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 8: [2022-11-26 00:36:57,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:36:57,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 00:36:57,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 11: [2022-11-26 00:36:57,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:36:57,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 00:36:57,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 2: [2022-11-26 00:36:57,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:36:57,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 00:36:57,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 7: [2022-11-26 00:36:57,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:36:57,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 00:36:57,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 21: [2022-11-26 00:36:57,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:36:57,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 00:36:57,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 0: [2022-11-26 00:36:57,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:36:57,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 00:36:57,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 6: [2022-11-26 00:36:57,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:36:57,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 00:36:57,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 30: [2022-11-26 00:36:57,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 00:36:57,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 00:36:57,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 13: [2022-11-26 00:36:57,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:36:57,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 00:36:57,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 31: [2022-11-26 00:36:57,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:36:57,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 00:36:57,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 23: [2022-11-26 00:36:57,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:36:57,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 00:36:57,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 3: [2022-11-26 00:36:57,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:36:57,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:36:57,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 29: [2022-11-26 00:36:57,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 3: [2022-11-26 00:36:57,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 29: [2022-11-26 00:36:57,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 10: [2022-11-26 00:36:57,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:36:57,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 00:36:57,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 22: [2022-11-26 00:36:57,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:36:57,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 00:36:57,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 4: [2022-11-26 00:36:57,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:36:57,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 00:36:57,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 27: [2022-11-26 00:36:57,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:36:57,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:36:57,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 27: [2022-11-26 00:36:57,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 00:36:57,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 12: [2022-11-26 00:36:57,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 20: [2022-11-26 00:36:57,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:36:57,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 00:36:57,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 19: [2022-11-26 00:36:57,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:36:57,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 00:36:57,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 16: [2022-11-26 00:36:57,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:36:57,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:36:57,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 00:36:57,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 28: [2022-11-26 00:36:57,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:36:57,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 28: [2022-11-26 00:36:57,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 24: [2022-11-26 00:36:57,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 28: [2022-11-26 00:36:57,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 00:36:57,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:36:57,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:36:57,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:36:57,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 25: [2022-11-26 00:36:57,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 17: [2022-11-26 00:36:57,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 1: [2022-11-26 00:36:57,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 25: [2022-11-26 00:36:57,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 17: [2022-11-26 00:36:57,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 9: [2022-11-26 00:36:57,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:36:57,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 00:36:57,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 15: [2022-11-26 00:36:57,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:36:57,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 8: [2022-11-26 00:36:57,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:36:57,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 5: [2022-11-26 00:36:57,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:36:57,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 5: [2022-11-26 00:36:57,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 00:36:57,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 8: [2022-11-26 00:36:57,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 6: [2022-11-26 00:36:57,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:36:57,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 00:36:57,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 7: [2022-11-26 00:36:57,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:36:57,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 00:36:57,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 0: [2022-11-26 00:36:57,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:36:57,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 00:36:57,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 21: [2022-11-26 00:36:57,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:36:57,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 00:36:57,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 30: [2022-11-26 00:36:57,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:36:57,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:36:57,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 30: [2022-11-26 00:36:57,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 23: [2022-11-26 00:36:57,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 30: [2022-11-26 00:36:57,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 18: [2022-11-26 00:36:57,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:36:57,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 00:36:57,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 11: [2022-11-26 00:36:57,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:36:57,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 00:36:57,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 2: [2022-11-26 00:36:57,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:36:57,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 00:36:57,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 31: [2022-11-26 00:36:57,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:36:57,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 00:36:57,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 4: [2022-11-26 00:36:57,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:36:57,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 00:36:57,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 29: [2022-11-26 00:36:57,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:36:57,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 00:36:57,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 13: [2022-11-26 00:36:57,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:36:57,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 00:36:57,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 3: [2022-11-26 00:36:57,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:36:57,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 00:36:57,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 12: [2022-11-26 00:36:57,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:36:57,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 00:36:57,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 22: [2022-11-26 00:36:57,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:36:57,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 00:36:57,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 10: [2022-11-26 00:36:57,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:36:57,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 00:36:57,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 27: [2022-11-26 00:36:57,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:36:57,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 00:36:57,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 20: [2022-11-26 00:36:57,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:36:57,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 00:36:57,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 19: [2022-11-26 00:36:57,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:36:57,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 00:36:57,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 26: [2022-11-26 00:36:57,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:36:57,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 00:36:57,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 26: [2022-11-26 00:36:57,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:36:57,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 00:36:57,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 00:36:57,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:36:57,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 00:36:57,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 28: [2022-11-26 00:36:57,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:36:57,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:36:57,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 25: [2022-11-26 00:36:57,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 00:36:57,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 15: [2022-11-26 00:36:57,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:36:57,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 00:36:57,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 9: [2022-11-26 00:36:57,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:36:57,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 00:36:57,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 24: [2022-11-26 00:36:57,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:36:57,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 00:36:57,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 16: [2022-11-26 00:36:57,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:36:57,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 00:36:57,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 5: [2022-11-26 00:36:57,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:36:57,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 00:36:57,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 28: [2022-11-26 00:36:57,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 17: [2022-11-26 00:36:57,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:36:57,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 00:36:57,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 2: [2022-11-26 00:36:57,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:36:57,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 00:36:57,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 7: [2022-11-26 00:36:57,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:36:57,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 00:36:57,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 8: [2022-11-26 00:36:57,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:36:57,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:36:57,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 00:36:57,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 21: [2022-11-26 00:36:57,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 00:36:57,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 0: [2022-11-26 00:36:57,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:36:57,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 00:36:57,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 11: [2022-11-26 00:36:57,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:36:57,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 00:36:57,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 18: [2022-11-26 00:36:57,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:36:57,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 00:36:57,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 6: [2022-11-26 00:36:57,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:36:57,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 00:36:57,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 12: [2022-11-26 00:36:57,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:36:57,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 23: [2022-11-26 00:36:57,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:36:57,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 23: [2022-11-26 00:36:57,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 00:36:57,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 31: [2022-11-26 00:36:57,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:36:57,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:36:57,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 00:36:57,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 24: [2022-11-26 00:36:57,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 00:36:57,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 4: [2022-11-26 00:36:57,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:36:57,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 00:36:57,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 14: [2022-11-26 00:36:57,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:36:57,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 00:36:57,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 13: [2022-11-26 00:36:57,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:36:57,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 00:36:57,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 22: [2022-11-26 00:36:57,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:36:57,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 00:36:57,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 26: [2022-11-26 00:36:57,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:36:57,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 00:36:57,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 3: [2022-11-26 00:36:57,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:36:57,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 00:36:57,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 10: [2022-11-26 00:36:57,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:36:57,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 15: [2022-11-26 00:36:57,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:36:57,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 15: [2022-11-26 00:36:57,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 00:36:57,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 14: [2022-11-26 00:36:57,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:36:57,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 27: [2022-11-26 00:36:57,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:36:57,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 27: [2022-11-26 00:36:57,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 00:36:57,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 25: [2022-11-26 00:36:57,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:36:57,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 00:36:57,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 9: [2022-11-26 00:36:57,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:36:57,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:36:57,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 16: [2022-11-26 00:36:57,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 9: [2022-11-26 00:36:57,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 16: [2022-11-26 00:36:57,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 23: [2022-11-26 00:36:57,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:36:57,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 00:36:57,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 20: [2022-11-26 00:36:57,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:36:57,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 00:36:57,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 00:36:57,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:36:57,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 00:36:57,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 26: [2022-11-26 00:36:57,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:36:57,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 00:36:57,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 29: [2022-11-26 00:36:57,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:36:57,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 00:36:57,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 8: [2022-11-26 00:36:57,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:36:57,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 00:36:57,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 19: [2022-11-26 00:36:57,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:36:57,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 00:36:57,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 24: [2022-11-26 00:36:57,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:36:57,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:36:57,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 17: [2022-11-26 00:36:57,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 24: [2022-11-26 00:36:57,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 17: [2022-11-26 00:36:57,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 4: [2022-11-26 00:36:57,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:36:57,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 00:36:57,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 31: [2022-11-26 00:36:57,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:36:57,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 00:36:57,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 28: [2022-11-26 00:36:57,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:36:57,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 00:36:57,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 2: [2022-11-26 00:36:57,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:36:57,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 00:36:57,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 0: [2022-11-26 00:36:57,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:36:57,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 00:36:57,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 3: [2022-11-26 00:36:57,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:36:57,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 00:36:57,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 5: [2022-11-26 00:36:57,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 30: [2022-11-26 00:36:57,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:36:57,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 30: [2022-11-26 00:36:57,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 5: [2022-11-26 00:36:57,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 30: [2022-11-26 00:36:57,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 7: [2022-11-26 00:36:57,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:36:57,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 00:36:57,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 12: [2022-11-26 00:36:57,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:36:57,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 00:36:57,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 21: [2022-11-26 00:36:57,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:36:57,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:36:57,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 00:36:57,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 10: [2022-11-26 00:36:57,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 00:36:57,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 22: [2022-11-26 00:36:57,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:36:57,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 00:36:57,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 11: [2022-11-26 00:36:57,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:36:57,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 00:36:57,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 18: [2022-11-26 00:36:57,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:36:57,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:36:57,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 00:36:57,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 27: [2022-11-26 00:36:57,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 00:36:57,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 6: [2022-11-26 00:36:57,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:36:57,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 00:36:57,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 13: [2022-11-26 00:36:57,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:36:57,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 00:36:57,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 29: [2022-11-26 00:36:57,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:36:57,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 00:36:57,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 28: [2022-11-26 00:36:57,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:36:57,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 00:36:57,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 14: [2022-11-26 00:36:57,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:36:57,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 00:36:57,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 14: [2022-11-26 00:36:57,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:36:57,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 00:36:57,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 14: [2022-11-26 00:36:57,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:36:57,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 00:36:57,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 14: [2022-11-26 00:36:57,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:36:57,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 00:36:57,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 30: [2022-11-26 00:36:57,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 00:36:57,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 00:36:57,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 14: [2022-11-26 00:36:57,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:36:57,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 00:36:57,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 14: [2022-11-26 00:36:57,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:36:57,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step29000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 00:36:57,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 0: successfully saved checkpoint at iteration 29000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2556.33 31: iteration 29010/ 173500 | consumed samples: 7426560 | consumed tokens: 15209594880 | elapsed time per iteration (s): 1.06 | learning rate: 1.890E-04 | global batch size: 256 | lm loss: 2.111278E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.586 | TFLOPs: 14.62 | 31: iteration 29020/ 173500 | consumed samples: 7429120 | consumed tokens: 15214837760 | elapsed time per iteration (s): 0.80 | learning rate: 1.890E-04 | global batch size: 256 | lm loss: 2.155448E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.124 | TFLOPs: 19.25 | 31: iteration 29030/ 173500 | consumed samples: 7431680 | consumed tokens: 15220080640 | elapsed time per iteration (s): 0.85 | learning rate: 1.890E-04 | global batch size: 256 | lm loss: 2.141341E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.156 | TFLOPs: 18.16 | 31: iteration 29040/ 173500 | consumed samples: 7434240 | consumed tokens: 15225323520 | elapsed time per iteration (s): 0.78 | learning rate: 1.890E-04 | global batch size: 256 | lm loss: 2.135237E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.839 | TFLOPs: 19.77 | 31: iteration 29050/ 173500 | consumed samples: 7436800 | consumed tokens: 15230566400 | elapsed time per iteration (s): 1.02 | learning rate: 1.890E-04 | global batch size: 256 | lm loss: 2.156762E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.403 | TFLOPs: 15.21 | 31: iteration 29060/ 173500 | consumed samples: 7439360 | consumed tokens: 15235809280 | elapsed time per iteration (s): 0.76 | learning rate: 1.890E-04 | global batch size: 256 | lm loss: 2.157908E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.328 | TFLOPs: 20.47 | 31: iteration 29070/ 173500 | consumed samples: 7441920 | consumed tokens: 15241052160 | elapsed time per iteration (s): 0.75 | learning rate: 1.890E-04 | global batch size: 256 | lm loss: 2.135238E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.106 | TFLOPs: 20.64 | 31: iteration 29080/ 173500 | consumed samples: 7444480 | consumed tokens: 15246295040 | elapsed time per iteration (s): 0.75 | learning rate: 1.890E-04 | global batch size: 256 | lm loss: 2.168147E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.543 | TFLOPs: 20.66 | 31: iteration 29090/ 173500 | consumed samples: 7447040 | consumed tokens: 15251537920 | elapsed time per iteration (s): 0.77 | learning rate: 1.890E-04 | global batch size: 256 | lm loss: 2.153845E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.584 | TFLOPs: 20.00 | 31: iteration 29100/ 173500 | consumed samples: 7449600 | consumed tokens: 15256780800 | elapsed time per iteration (s): 0.77 | learning rate: 1.890E-04 | global batch size: 256 | lm loss: 2.121440E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.507 | TFLOPs: 20.06 | 31: iteration 29110/ 173500 | consumed samples: 7452160 | consumed tokens: 15262023680 | elapsed time per iteration (s): 0.74 | learning rate: 1.890E-04 | global batch size: 256 | lm loss: 2.150566E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.093 | TFLOPs: 21.06 | 31: iteration 29120/ 173500 | consumed samples: 7454720 | consumed tokens: 15267266560 | elapsed time per iteration (s): 0.78 | learning rate: 1.889E-04 | global batch size: 256 | lm loss: 2.130304E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.913 | TFLOPs: 19.96 | 31: iteration 29130/ 173500 | consumed samples: 7457280 | consumed tokens: 15272509440 | elapsed time per iteration (s): 0.86 | learning rate: 1.889E-04 | global batch size: 256 | lm loss: 2.146920E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.187 | TFLOPs: 18.04 | 31: iteration 29140/ 173500 | consumed samples: 7459840 | consumed tokens: 15277752320 | elapsed time per iteration (s): 0.75 | learning rate: 1.889E-04 | global batch size: 256 | lm loss: 2.135444E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.998 | TFLOPs: 20.63 | 31: iteration 29150/ 173500 | consumed samples: 7462400 | consumed tokens: 15282995200 | elapsed time per iteration (s): 0.75 | learning rate: 1.889E-04 | global batch size: 256 | lm loss: 2.161101E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.214 | TFLOPs: 20.76 | 31: iteration 29160/ 173500 | consumed samples: 7464960 | consumed tokens: 15288238080 | elapsed time per iteration (s): 0.74 | learning rate: 1.889E-04 | global batch size: 256 | lm loss: 2.140821E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.337 | TFLOPs: 20.95 | 31: iteration 29170/ 173500 | consumed samples: 7467520 | consumed tokens: 15293480960 | elapsed time per iteration (s): 0.79 | learning rate: 1.889E-04 | global batch size: 256 | lm loss: 2.133688E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.229 | TFLOPs: 19.62 | 31: iteration 29180/ 173500 | consumed samples: 7470080 | consumed tokens: 15298723840 | elapsed time per iteration (s): 0.78 | learning rate: 1.889E-04 | global batch size: 256 | lm loss: 2.158357E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.073 | TFLOPs: 19.91 | 31: iteration 29190/ 173500 | consumed samples: 7472640 | consumed tokens: 15303966720 | elapsed time per iteration (s): 0.78 | learning rate: 1.889E-04 | global batch size: 256 | lm loss: 2.144674E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.405 | TFLOPs: 19.87 | 31: iteration 29200/ 173500 | consumed samples: 7475200 | consumed tokens: 15309209600 | elapsed time per iteration (s): 0.74 | learning rate: 1.889E-04 | global batch size: 256 | lm loss: 2.103506E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.020 | TFLOPs: 20.87 | 31: iteration 29210/ 173500 | consumed samples: 7477760 | consumed tokens: 15314452480 | elapsed time per iteration (s): 0.75 | learning rate: 1.889E-04 | global batch size: 256 | lm loss: 2.166799E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.380 | TFLOPs: 20.65 | 31: iteration 29220/ 173500 | consumed samples: 7480320 | consumed tokens: 15319695360 | elapsed time per iteration (s): 0.77 | learning rate: 1.889E-04 | global batch size: 256 | lm loss: 2.149116E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.718 | TFLOPs: 20.07 | 31: iteration 29230/ 173500 | consumed samples: 7482880 | consumed tokens: 15324938240 | elapsed time per iteration (s): 0.78 | learning rate: 1.889E-04 | global batch size: 256 | lm loss: 2.120328E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.351 | TFLOPs: 19.92 | 31: iteration 29240/ 173500 | consumed samples: 7485440 | consumed tokens: 15330181120 | elapsed time per iteration (s): 0.77 | learning rate: 1.888E-04 | global batch size: 256 | lm loss: 2.141631E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.430 | TFLOPs: 20.17 | 31: iteration 29250/ 173500 | consumed samples: 7488000 | consumed tokens: 15335424000 | elapsed time per iteration (s): 0.81 | learning rate: 1.888E-04 | global batch size: 256 | lm loss: 2.121396E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.821 | TFLOPs: 19.23 | 31: iteration 29260/ 173500 | consumed samples: 7490560 | consumed tokens: 15340666880 | elapsed time per iteration (s): 0.77 | learning rate: 1.888E-04 | global batch size: 256 | lm loss: 2.136933E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.301 | TFLOPs: 20.04 | 31: iteration 29270/ 173500 | consumed samples: 7493120 | consumed tokens: 15345909760 | elapsed time per iteration (s): 0.76 | learning rate: 1.888E-04 | global batch size: 256 | lm loss: 2.147587E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.465 | TFLOPs: 20.42 | 31: iteration 29280/ 173500 | consumed samples: 7495680 | consumed tokens: 15351152640 | elapsed time per iteration (s): 0.74 | learning rate: 1.888E-04 | global batch size: 256 | lm loss: 2.157043E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.593 | TFLOPs: 20.91 | 31: iteration 29290/ 173500 | consumed samples: 7498240 | consumed tokens: 15356395520 | elapsed time per iteration (s): 0.74 | learning rate: 1.888E-04 | global batch size: 256 | lm loss: 2.139029E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.801 | TFLOPs: 20.80 | 31: iteration 29300/ 173500 | consumed samples: 7500800 | consumed tokens: 15361638400 | elapsed time per iteration (s): 0.78 | learning rate: 1.888E-04 | global batch size: 256 | lm loss: 2.152228E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.590 | TFLOPs: 19.94 | 31: iteration 29310/ 173500 | consumed samples: 7503360 | consumed tokens: 15366881280 | elapsed time per iteration (s): 0.74 | learning rate: 1.888E-04 | global batch size: 256 | lm loss: 2.121149E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.606 | TFLOPs: 21.03 | 31: iteration 29320/ 173500 | consumed samples: 7505920 | consumed tokens: 15372124160 | elapsed time per iteration (s): 0.77 | learning rate: 1.888E-04 | global batch size: 256 | lm loss: 2.168953E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.142 | TFLOPs: 20.09 | 31: iteration 29330/ 173500 | consumed samples: 7508480 | consumed tokens: 15377367040 | elapsed time per iteration (s): 0.84 | learning rate: 1.888E-04 | global batch size: 256 | lm loss: 2.142836E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.081 | TFLOPs: 18.40 | 31: iteration 29340/ 173500 | consumed samples: 7511040 | consumed tokens: 15382609920 | elapsed time per iteration (s): 0.80 | learning rate: 1.888E-04 | global batch size: 256 | lm loss: 2.122618E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.727 | TFLOPs: 19.46 | 31: iteration 29350/ 173500 | consumed samples: 7513600 | consumed tokens: 15387852800 | elapsed time per iteration (s): 0.80 | learning rate: 1.888E-04 | global batch size: 256 | lm loss: 2.127641E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.738 | TFLOPs: 19.46 | 31: iteration 29360/ 173500 | consumed samples: 7516160 | consumed tokens: 15393095680 | elapsed time per iteration (s): 0.77 | learning rate: 1.888E-04 | global batch size: 256 | lm loss: 2.159169E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.770 | TFLOPs: 20.19 | 31: iteration 29370/ 173500 | consumed samples: 7518720 | consumed tokens: 15398338560 | elapsed time per iteration (s): 0.83 | learning rate: 1.887E-04 | global batch size: 256 | lm loss: 2.137370E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.153 | TFLOPs: 18.64 | 31: iteration 29380/ 173500 | consumed samples: 7521280 | consumed tokens: 15403581440 | elapsed time per iteration (s): 0.79 | learning rate: 1.887E-04 | global batch size: 256 | lm loss: 2.162240E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.788 | TFLOPs: 19.71 | 31: iteration 29390/ 173500 | consumed samples: 7523840 | consumed tokens: 15408824320 | elapsed time per iteration (s): 0.77 | learning rate: 1.887E-04 | global batch size: 256 | lm loss: 2.143186E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.495 | TFLOPs: 20.24 | 31: iteration 29400/ 173500 | consumed samples: 7526400 | consumed tokens: 15414067200 | elapsed time per iteration (s): 0.77 | learning rate: 1.887E-04 | global batch size: 256 | lm loss: 2.169682E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.800 | TFLOPs: 20.19 | 31: iteration 29410/ 173500 | consumed samples: 7528960 | consumed tokens: 15419310080 | elapsed time per iteration (s): 0.76 | learning rate: 1.887E-04 | global batch size: 256 | lm loss: 2.102759E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.090 | TFLOPs: 20.27 | 31: iteration 29420/ 173500 | consumed samples: 7531520 | consumed tokens: 15424552960 | elapsed time per iteration (s): 0.78 | learning rate: 1.887E-04 | global batch size: 256 | lm loss: 2.107315E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.493 | TFLOPs: 19.81 | 31: iteration 29430/ 173500 | consumed samples: 7534080 | consumed tokens: 15429795840 | elapsed time per iteration (s): 0.78 | learning rate: 1.887E-04 | global batch size: 256 | lm loss: 2.165631E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.259 | TFLOPs: 19.98 | 31: iteration 29440/ 173500 | consumed samples: 7536640 | consumed tokens: 15435038720 | elapsed time per iteration (s): 0.73 | learning rate: 1.887E-04 | global batch size: 256 | lm loss: 2.148466E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.649 | TFLOPs: 21.33 | 31: iteration 29450/ 173500 | consumed samples: 7539200 | consumed tokens: 15440281600 | elapsed time per iteration (s): 0.76 | learning rate: 1.887E-04 | global batch size: 256 | lm loss: 2.140580E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.099 | TFLOPs: 20.39 | 31: iteration 29460/ 173500 | consumed samples: 7541760 | consumed tokens: 15445524480 | elapsed time per iteration (s): 0.78 | learning rate: 1.887E-04 | global batch size: 256 | lm loss: 2.168698E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.018 | TFLOPs: 19.97 | 31: iteration 29470/ 173500 | consumed samples: 7544320 | consumed tokens: 15450767360 | elapsed time per iteration (s): 0.82 | learning rate: 1.887E-04 | global batch size: 256 | lm loss: 2.156897E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.400 | TFLOPs: 18.96 | 31: iteration 29480/ 173500 | consumed samples: 7546880 | consumed tokens: 15456010240 | elapsed time per iteration (s): 0.74 | learning rate: 1.887E-04 | global batch size: 256 | lm loss: 2.160646E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.706 | TFLOPs: 21.04 | 31: iteration 29490/ 173500 | consumed samples: 7549440 | consumed tokens: 15461253120 | elapsed time per iteration (s): 0.81 | learning rate: 1.887E-04 | global batch size: 256 | lm loss: 2.135569E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.196 | TFLOPs: 19.13 | 31: iteration 29500/ 173500 | consumed samples: 7552000 | consumed tokens: 15466496000 | elapsed time per iteration (s): 0.76 | learning rate: 1.886E-04 | global batch size: 256 | lm loss: 2.162904E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.349 | TFLOPs: 20.35 | 31: iteration 29510/ 173500 | consumed samples: 7554560 | consumed tokens: 15471738880 | elapsed time per iteration (s): 0.80 | learning rate: 1.886E-04 | global batch size: 256 | lm loss: 2.159777E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.509 | TFLOPs: 19.33 | 31: iteration 29520/ 173500 | consumed samples: 7557120 | consumed tokens: 15476981760 | elapsed time per iteration (s): 0.76 | learning rate: 1.886E-04 | global batch size: 256 | lm loss: 2.156236E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.671 | TFLOPs: 20.31 | 31: iteration 29530/ 173500 | consumed samples: 7559680 | consumed tokens: 15482224640 | elapsed time per iteration (s): 0.81 | learning rate: 1.886E-04 | global batch size: 256 | lm loss: 2.134211E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.868 | TFLOPs: 19.23 | 31: iteration 29540/ 173500 | consumed samples: 7562240 | consumed tokens: 15487467520 | elapsed time per iteration (s): 0.73 | learning rate: 1.886E-04 | global batch size: 256 | lm loss: 2.167932E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.817 | TFLOPs: 21.10 | 31: iteration 29550/ 173500 | consumed samples: 7564800 | consumed tokens: 15492710400 | elapsed time per iteration (s): 0.84 | learning rate: 1.886E-04 | global batch size: 256 | lm loss: 2.130083E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.343 | TFLOPs: 18.35 | 31: iteration 29560/ 173500 | consumed samples: 7567360 | consumed tokens: 15497953280 | elapsed time per iteration (s): 0.79 | learning rate: 1.886E-04 | global batch size: 256 | lm loss: 2.139685E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.830 | TFLOPs: 19.65 | 31: iteration 29570/ 173500 | consumed samples: 7569920 | consumed tokens: 15503196160 | elapsed time per iteration (s): 0.77 | learning rate: 1.886E-04 | global batch size: 256 | lm loss: 2.138544E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.045 | TFLOPs: 20.21 | 31: iteration 29580/ 173500 | consumed samples: 7572480 | consumed tokens: 15508439040 | elapsed time per iteration (s): 0.76 | learning rate: 1.886E-04 | global batch size: 256 | lm loss: 2.158611E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.076 | TFLOPs: 20.33 | 31: iteration 29590/ 173500 | consumed samples: 7575040 | consumed tokens: 15513681920 | elapsed time per iteration (s): 0.75 | learning rate: 1.886E-04 | global batch size: 256 | lm loss: 2.131545E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.734 | TFLOPs: 20.67 | 31: iteration 29600/ 173500 | consumed samples: 7577600 | consumed tokens: 15518924800 | elapsed time per iteration (s): 0.75 | learning rate: 1.886E-04 | global batch size: 256 | lm loss: 2.173526E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.732 | TFLOPs: 20.55 | 31: iteration 29610/ 173500 | consumed samples: 7580160 | consumed tokens: 15524167680 | elapsed time per iteration (s): 0.79 | learning rate: 1.886E-04 | global batch size: 256 | lm loss: 2.127551E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.967 | TFLOPs: 19.66 | 31: iteration 29620/ 173500 | consumed samples: 7582720 | consumed tokens: 15529410560 | elapsed time per iteration (s): 0.77 | learning rate: 1.885E-04 | global batch size: 256 | lm loss: 2.150227E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.610 | TFLOPs: 20.18 | 31: iteration 29630/ 173500 | consumed samples: 7585280 | consumed tokens: 15534653440 | elapsed time per iteration (s): 0.85 | learning rate: 1.885E-04 | global batch size: 256 | lm loss: 2.124895E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.490 | TFLOPs: 18.30 | 31: iteration 29640/ 173500 | consumed samples: 7587840 | consumed tokens: 15539896320 | elapsed time per iteration (s): 0.78 | learning rate: 1.885E-04 | global batch size: 256 | lm loss: 2.146060E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.117 | TFLOPs: 19.91 | 31: iteration 29650/ 173500 | consumed samples: 7590400 | consumed tokens: 15545139200 | elapsed time per iteration (s): 0.78 | learning rate: 1.885E-04 | global batch size: 256 | lm loss: 2.131112E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.195 | TFLOPs: 19.73 | 31: iteration 29660/ 173500 | consumed samples: 7592960 | consumed tokens: 15550382080 | elapsed time per iteration (s): 0.76 | learning rate: 1.885E-04 | global batch size: 256 | lm loss: 2.144838E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.155 | TFLOPs: 20.28 | 31: iteration 29670/ 173500 | consumed samples: 7595520 | consumed tokens: 15555624960 | elapsed time per iteration (s): 0.80 | learning rate: 1.885E-04 | global batch size: 256 | lm loss: 2.143406E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.965 | TFLOPs: 19.48 | 31: iteration 29680/ 173500 | consumed samples: 7598080 | consumed tokens: 15560867840 | elapsed time per iteration (s): 0.76 | learning rate: 1.885E-04 | global batch size: 256 | lm loss: 2.161883E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.845 | TFLOPs: 20.50 | 31: iteration 29690/ 173500 | consumed samples: 7600640 | consumed tokens: 15566110720 | elapsed time per iteration (s): 0.81 | learning rate: 1.885E-04 | global batch size: 256 | lm loss: 2.116077E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.733 | TFLOPs: 19.04 | 31: iteration 29700/ 173500 | consumed samples: 7603200 | consumed tokens: 15571353600 | elapsed time per iteration (s): 0.78 | learning rate: 1.885E-04 | global batch size: 256 | lm loss: 2.137483E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.858 | TFLOPs: 19.77 | 31: iteration 29710/ 173500 | consumed samples: 7605760 | consumed tokens: 15576596480 | elapsed time per iteration (s): 0.79 | learning rate: 1.885E-04 | global batch size: 256 | lm loss: 2.139493E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.922 | TFLOPs: 19.72 | 31: iteration 29720/ 173500 | consumed samples: 7608320 | consumed tokens: 15581839360 | elapsed time per iteration (s): 0.79 | learning rate: 1.885E-04 | global batch size: 256 | lm loss: 2.140546E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.416 | TFLOPs: 19.51 | 31: iteration 29730/ 173500 | consumed samples: 7610880 | consumed tokens: 15587082240 | elapsed time per iteration (s): 0.82 | learning rate: 1.885E-04 | global batch size: 256 | lm loss: 2.160570E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.946 | TFLOPs: 18.93 | 31: iteration 29740/ 173500 | consumed samples: 7613440 | consumed tokens: 15592325120 | elapsed time per iteration (s): 0.78 | learning rate: 1.884E-04 | global batch size: 256 | lm loss: 2.149002E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.430 | TFLOPs: 19.75 | 31: iteration 29750/ 173500 | consumed samples: 7616000 | consumed tokens: 15597568000 | elapsed time per iteration (s): 0.81 | learning rate: 1.884E-04 | global batch size: 256 | lm loss: 2.139769E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.671 | TFLOPs: 19.16 | 31: iteration 29760/ 173500 | consumed samples: 7618560 | consumed tokens: 15602810880 | elapsed time per iteration (s): 0.81 | learning rate: 1.884E-04 | global batch size: 256 | lm loss: 2.127057E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.528 | TFLOPs: 19.15 | 31: iteration 29770/ 173500 | consumed samples: 7621120 | consumed tokens: 15608053760 | elapsed time per iteration (s): 0.78 | learning rate: 1.884E-04 | global batch size: 256 | lm loss: 2.144679E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.058 | TFLOPs: 19.85 | 31: iteration 29780/ 173500 | consumed samples: 7623680 | consumed tokens: 15613296640 | elapsed time per iteration (s): 0.74 | learning rate: 1.884E-04 | global batch size: 256 | lm loss: 2.100052E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.474 | TFLOPs: 21.02 | 31: iteration 29790/ 173500 | consumed samples: 7626240 | consumed tokens: 15618539520 | elapsed time per iteration (s): 0.80 | learning rate: 1.884E-04 | global batch size: 256 | lm loss: 2.146963E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.939 | TFLOPs: 19.36 | 31: iteration 29800/ 173500 | consumed samples: 7628800 | consumed tokens: 15623782400 | elapsed time per iteration (s): 0.77 | learning rate: 1.884E-04 | global batch size: 256 | lm loss: 2.167540E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.779 | TFLOPs: 20.07 | 31: iteration 29810/ 173500 | consumed samples: 7631360 | consumed tokens: 15629025280 | elapsed time per iteration (s): 0.78 | learning rate: 1.884E-04 | global batch size: 256 | lm loss: 2.126874E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.500 | TFLOPs: 19.87 | 31: iteration 29820/ 173500 | consumed samples: 7633920 | consumed tokens: 15634268160 | elapsed time per iteration (s): 0.78 | learning rate: 1.884E-04 | global batch size: 256 | lm loss: 2.128455E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.380 | TFLOPs: 19.87 | 31: iteration 29830/ 173500 | consumed samples: 7636480 | consumed tokens: 15639511040 | elapsed time per iteration (s): 0.83 | learning rate: 1.884E-04 | global batch size: 256 | lm loss: 2.154910E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.853 | TFLOPs: 18.62 | 31: iteration 29840/ 173500 | consumed samples: 7639040 | consumed tokens: 15644753920 | elapsed time per iteration (s): 0.74 | learning rate: 1.884E-04 | global batch size: 256 | lm loss: 2.118678E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.033 | TFLOPs: 20.81 | 31: iteration 29850/ 173500 | consumed samples: 7641600 | consumed tokens: 15649996800 | elapsed time per iteration (s): 0.76 | learning rate: 1.884E-04 | global batch size: 256 | lm loss: 2.112721E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.441 | TFLOPs: 20.47 | 31: iteration 29860/ 173500 | consumed samples: 7644160 | consumed tokens: 15655239680 | elapsed time per iteration (s): 0.76 | learning rate: 1.884E-04 | global batch size: 256 | lm loss: 2.130104E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.037 | TFLOPs: 20.27 | 31: iteration 29870/ 173500 | consumed samples: 7646720 | consumed tokens: 15660482560 | elapsed time per iteration (s): 0.78 | learning rate: 1.883E-04 | global batch size: 256 | lm loss: 2.121754E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.352 | TFLOPs: 19.74 | 31: iteration 29880/ 173500 | consumed samples: 7649280 | consumed tokens: 15665725440 | elapsed time per iteration (s): 0.76 | learning rate: 1.883E-04 | global batch size: 256 | lm loss: 2.139598E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.769 | TFLOPs: 20.31 | 31: iteration 29890/ 173500 | consumed samples: 7651840 | consumed tokens: 15670968320 | elapsed time per iteration (s): 0.75 | learning rate: 1.883E-04 | global batch size: 256 | lm loss: 2.140372E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.577 | TFLOPs: 20.66 | 31: iteration 29900/ 173500 | consumed samples: 7654400 | consumed tokens: 15676211200 | elapsed time per iteration (s): 0.77 | learning rate: 1.883E-04 | global batch size: 256 | lm loss: 2.159305E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.972 | TFLOPs: 20.08 | 31: iteration 29910/ 173500 | consumed samples: 7656960 | consumed tokens: 15681454080 | elapsed time per iteration (s): 0.72 | learning rate: 1.883E-04 | global batch size: 256 | lm loss: 2.134860E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.572 | TFLOPs: 21.63 | 31: iteration 29920/ 173500 | consumed samples: 7659520 | consumed tokens: 15686696960 | elapsed time per iteration (s): 0.76 | learning rate: 1.883E-04 | global batch size: 256 | lm loss: 2.143470E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.913 | TFLOPs: 20.26 | 31: iteration 29930/ 173500 | consumed samples: 7662080 | consumed tokens: 15691939840 | elapsed time per iteration (s): 0.76 | learning rate: 1.883E-04 | global batch size: 256 | lm loss: 2.125937E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.012 | TFLOPs: 20.51 | 31: iteration 29940/ 173500 | consumed samples: 7664640 | consumed tokens: 15697182720 | elapsed time per iteration (s): 0.83 | learning rate: 1.883E-04 | global batch size: 256 | lm loss: 2.121528E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.339 | TFLOPs: 18.71 | 31: iteration 29950/ 173500 | consumed samples: 7667200 | consumed tokens: 15702425600 | elapsed time per iteration (s): 0.75 | learning rate: 1.883E-04 | global batch size: 256 | lm loss: 2.144388E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.419 | TFLOPs: 20.53 | 31: iteration 29960/ 173500 | consumed samples: 7669760 | consumed tokens: 15707668480 | elapsed time per iteration (s): 0.79 | learning rate: 1.883E-04 | global batch size: 256 | lm loss: 2.141039E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.159 | TFLOPs: 19.67 | 31: iteration 29970/ 173500 | consumed samples: 7672320 | consumed tokens: 15712911360 | elapsed time per iteration (s): 0.81 | learning rate: 1.883E-04 | global batch size: 256 | lm loss: 2.124932E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.711 | TFLOPs: 19.04 | 31: iteration 29980/ 173500 | consumed samples: 7674880 | consumed tokens: 15718154240 | elapsed time per iteration (s): 0.77 | learning rate: 1.883E-04 | global batch size: 256 | lm loss: 2.131488E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.638 | TFLOPs: 20.06 | 31: iteration 29990/ 173500 | consumed samples: 7677440 | consumed tokens: 15723397120 | elapsed time per iteration (s): 0.83 | learning rate: 1.882E-04 | global batch size: 256 | lm loss: 2.147488E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.790 | TFLOPs: 18.62 | 0: [2022-11-26 00:49:57,447] [INFO] [logging.py:68:log_dist] [Rank 0] step=30000, skipped=0, lr=[0.00018823900512431258, 0.00018823900512431258, 0.00018823900512431258], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 30000/ 173500 | consumed samples: 7680000 | consumed tokens: 15728640000 | elapsed time per iteration (s): 0.77 | learning rate: 1.882E-04 | global batch size: 256 | lm loss: 2.154408E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.637 | TFLOPs: 20.24 | 0: steps: 30000 loss: 2.1293 iter time (s): 0.778 samples/sec: 328.944 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 30000 | lm loss value: 2.181812E+00 | lm loss PPL: 8.862346E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 30000 to checkpoints_1b1long 0: [2022-11-26 00:49:57,950] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step30000 is begin to save! 0: [2022-11-26 00:49:57,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_01-model_00-model_states.pt... 0: [2022-11-26 00:49:58,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_01-model_00-model_states.pt. 0: [2022-11-26 00:49:58,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_03-model_00-model_states.pt... 0: [2022-11-26 00:49:58,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_03-model_00-model_states.pt. 0: [2022-11-26 00:49:58,299] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_04-model_00-model_states.pt... 0: [2022-11-26 00:49:58,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_04-model_00-model_states.pt. 0: [2022-11-26 00:49:58,379] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_05-model_00-model_states.pt... 0: [2022-11-26 00:49:58,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_05-model_00-model_states.pt. 0: [2022-11-26 00:49:58,469] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_06-model_00-model_states.pt... 0: [2022-11-26 00:49:58,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_06-model_00-model_states.pt. 0: [2022-11-26 00:49:58,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_07-model_00-model_states.pt... 0: [2022-11-26 00:49:58,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_07-model_00-model_states.pt. 0: [2022-11-26 00:49:58,630] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_08-model_00-model_states.pt... 0: [2022-11-26 00:49:58,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_08-model_00-model_states.pt. 0: [2022-11-26 00:49:58,712] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_09-model_00-model_states.pt... 0: [2022-11-26 00:49:58,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_09-model_00-model_states.pt. 0: [2022-11-26 00:49:58,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_10-model_00-model_states.pt... 0: [2022-11-26 00:49:58,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_10-model_00-model_states.pt. 0: [2022-11-26 00:49:58,873] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_11-model_00-model_states.pt... 0: [2022-11-26 00:49:58,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_11-model_00-model_states.pt. 0: [2022-11-26 00:49:58,959] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_12-model_00-model_states.pt... 0: [2022-11-26 00:49:59,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_12-model_00-model_states.pt. 0: [2022-11-26 00:49:59,040] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_13-model_00-model_states.pt... 0: [2022-11-26 00:49:59,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_13-model_00-model_states.pt. 0: [2022-11-26 00:49:59,120] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_14-model_00-model_states.pt... 0: [2022-11-26 00:49:59,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_14-model_00-model_states.pt. 0: [2022-11-26 00:49:59,200] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_15-model_00-model_states.pt... 0: [2022-11-26 00:49:59,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_15-model_00-model_states.pt. 0: [2022-11-26 00:49:59,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_16-model_00-model_states.pt... 0: [2022-11-26 00:49:59,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_16-model_00-model_states.pt. 0: [2022-11-26 00:49:59,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_17-model_00-model_states.pt... 0: [2022-11-26 00:49:59,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_17-model_00-model_states.pt. 0: [2022-11-26 00:49:59,446] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_18-model_00-model_states.pt... 0: [2022-11-26 00:49:59,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_18-model_00-model_states.pt. 0: [2022-11-26 00:49:59,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_19-model_00-model_states.pt... 0: [2022-11-26 00:49:59,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_19-model_00-model_states.pt. 0: [2022-11-26 00:49:59,606] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_20-model_00-model_states.pt... 0: [2022-11-26 00:49:59,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_20-model_00-model_states.pt. 0: [2022-11-26 00:49:59,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_21-model_00-model_states.pt... 0: [2022-11-26 00:49:59,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_21-model_00-model_states.pt. 0: [2022-11-26 00:49:59,762] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_22-model_00-model_states.pt... 0: [2022-11-26 00:49:59,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_22-model_00-model_states.pt. 0: [2022-11-26 00:49:59,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_23-model_00-model_states.pt... 0: [2022-11-26 00:49:59,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_23-model_00-model_states.pt. 0: [2022-11-26 00:49:59,923] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_24-model_00-model_states.pt... 0: [2022-11-26 00:50:00,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_24-model_00-model_states.pt. 0: [2022-11-26 00:50:00,007] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_25-model_00-model_states.pt... 0: [2022-11-26 00:50:00,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_25-model_00-model_states.pt. 0: [2022-11-26 00:50:00,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_26-model_00-model_states.pt... 0: [2022-11-26 00:50:00,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_26-model_00-model_states.pt. 0: [2022-11-26 00:50:00,166] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_27-model_00-model_states.pt... 0: [2022-11-26 00:50:00,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_27-model_00-model_states.pt. 0: [2022-11-26 00:50:00,246] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_28-model_00-model_states.pt... 0: [2022-11-26 00:50:00,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_28-model_00-model_states.pt. 0: [2022-11-26 00:50:00,327] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/layer_30-model_00-model_states.pt... 0: [2022-11-26 00:50:00,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/layer_30-model_00-model_states.pt. 0: [2022-11-26 00:50:00,329] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step30000/mp_rank_00_model_states.pt 0: [2022-11-26 00:50:00,329] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/mp_rank_00_model_states.pt... 0: [2022-11-26 00:50:00,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/mp_rank_00_model_states.pt. 0: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 20: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 28: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 22: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 30: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 19: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 24: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 31: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 18: [2022-11-26 00:50:00,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:50:00,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:50:00,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 00:50:00,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 4: [2022-11-26 00:50:00,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:50:00,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 28: [2022-11-26 00:50:00,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:50:00,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 28: [2022-11-26 00:50:00,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 00:50:00,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 10: [2022-11-26 00:50:00,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:50:00,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 00:50:00,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 12: [2022-11-26 00:50:00,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:50:00,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 00:50:00,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 11: [2022-11-26 00:50:00,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:50:00,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 00:50:00,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 3: [2022-11-26 00:50:00,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:50:00,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 00:50:00,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 00:50:00,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:50:00,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:50:00,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 00:50:00,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 13: [2022-11-26 00:50:00,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:50:00,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 00:50:00,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 1: [2022-11-26 00:50:00,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:50:00,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 00:50:00,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 30: [2022-11-26 00:50:00,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 00:50:00,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 00:50:00,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 21: [2022-11-26 00:50:00,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:50:00,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:50:00,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 4: [2022-11-26 00:50:00,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 31: [2022-11-26 00:50:00,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:50:00,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 4: [2022-11-26 00:50:00,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 31: [2022-11-26 00:50:00,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 00:50:00,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 23: [2022-11-26 00:50:00,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:50:00,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 00:50:00,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 16: [2022-11-26 00:50:00,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:50:00,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 00:50:00,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 14: [2022-11-26 00:50:00,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:50:00,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 7: [2022-11-26 00:50:00,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:50:00,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:50:00,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 7: [2022-11-26 00:50:00,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:50:00,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 7: [2022-11-26 00:50:00,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 2: [2022-11-26 00:50:00,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 7: [2022-11-26 00:50:00,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 00:50:00,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 9: [2022-11-26 00:50:00,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:50:00,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:50:00,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 9: [2022-11-26 00:50:00,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 15: [2022-11-26 00:50:00,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 9: [2022-11-26 00:50:00,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 15: [2022-11-26 00:50:00,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 8: [2022-11-26 00:50:00,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:50:00,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 00:50:00,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 9: [2022-11-26 00:50:00,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:50:00,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 00:50:00,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 18: [2022-11-26 00:50:00,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:50:00,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 00:50:00,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 00:50:00,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:50:00,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 00:50:00,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 24: [2022-11-26 00:50:00,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:50:00,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 00:50:00,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 11: [2022-11-26 00:50:00,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:50:00,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 00:50:00,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 24: [2022-11-26 00:50:00,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:50:00,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:50:00,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 24: [2022-11-26 00:50:00,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 13: [2022-11-26 00:50:00,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 24: [2022-11-26 00:50:00,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 1: [2022-11-26 00:50:00,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:50:00,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 00:50:00,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 5: [2022-11-26 00:50:00,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:50:00,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 00:50:00,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 26: [2022-11-26 00:50:00,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:50:00,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 00:50:00,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 29: [2022-11-26 00:50:00,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:50:00,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:50:00,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 3: [2022-11-26 00:50:00,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 29: [2022-11-26 00:50:00,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 3: [2022-11-26 00:50:00,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 30: [2022-11-26 00:50:00,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:50:00,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 30: [2022-11-26 00:50:00,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 18: [2022-11-26 00:50:00,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 30: [2022-11-26 00:50:00,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 18: [2022-11-26 00:50:00,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 4: [2022-11-26 00:50:00,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:50:00,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:50:00,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 00:50:00,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 8: [2022-11-26 00:50:00,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 00:50:00,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 16: [2022-11-26 00:50:00,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:50:00,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 00:50:00,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 28: [2022-11-26 00:50:00,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:50:00,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:50:00,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 00:50:00,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 27: [2022-11-26 00:50:00,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:50:00,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 11: [2022-11-26 00:50:00,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:50:00,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 11: [2022-11-26 00:50:00,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 00:50:00,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 23: [2022-11-26 00:50:00,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:50:00,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:50:00,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 27: [2022-11-26 00:50:00,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 23: [2022-11-26 00:50:00,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 27: [2022-11-26 00:50:00,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 14: [2022-11-26 00:50:00,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:50:00,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 00:50:00,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 10: [2022-11-26 00:50:00,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:50:00,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 00:50:00,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 7: [2022-11-26 00:50:00,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:50:00,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 00:50:00,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 2: [2022-11-26 00:50:00,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:50:00,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 00:50:00,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 24: [2022-11-26 00:50:00,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:50:00,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 26: [2022-11-26 00:50:00,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:50:00,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 9: [2022-11-26 00:50:00,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:50:00,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 9: [2022-11-26 00:50:00,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 13: [2022-11-26 00:50:00,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:50:00,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 9: [2022-11-26 00:50:00,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 13: [2022-11-26 00:50:00,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 00:50:00,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 15: [2022-11-26 00:50:00,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:50:00,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 00:50:00,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 29: [2022-11-26 00:50:00,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:50:00,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:50:00,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 29: [2022-11-26 00:50:00,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 10: [2022-11-26 00:50:00,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:50:00,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 10: [2022-11-26 00:50:00,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 1: [2022-11-26 00:50:00,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:50:00,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 10: [2022-11-26 00:50:00,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 16: [2022-11-26 00:50:00,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:50:00,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 16: [2022-11-26 00:50:00,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 1: [2022-11-26 00:50:00,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 16: [2022-11-26 00:50:00,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 12: [2022-11-26 00:50:00,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:50:00,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 00:50:00,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 22: [2022-11-26 00:50:00,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:50:00,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 00:50:00,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 21: [2022-11-26 00:50:00,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:50:00,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:50:00,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 29: [2022-11-26 00:50:00,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 21: [2022-11-26 00:50:00,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 29: [2022-11-26 00:50:00,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 22: [2022-11-26 00:50:00,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:50:00,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 12: [2022-11-26 00:50:00,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:50:00,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 22: [2022-11-26 00:50:00,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 12: [2022-11-26 00:50:00,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 6: [2022-11-26 00:50:00,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:50:00,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:50:00,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:50:00,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:50:00,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 00:50:00,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 4: [2022-11-26 00:50:00,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 22: [2022-11-26 00:50:00,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:50:00,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 4: [2022-11-26 00:50:00,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 22: [2022-11-26 00:50:00,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 6: [2022-11-26 00:50:00,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 6: [2022-11-26 00:50:00,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 22: [2022-11-26 00:50:00,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 6: [2022-11-26 00:50:00,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 21: [2022-11-26 00:50:00,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:50:00,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:50:00,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 7: [2022-11-26 00:50:00,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 21: [2022-11-26 00:50:00,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 7: [2022-11-26 00:50:00,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 5: [2022-11-26 00:50:00,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:50:00,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 00:50:00,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 18: [2022-11-26 00:50:00,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:50:00,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 18: [2022-11-26 00:50:00,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 28: [2022-11-26 00:50:00,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 30: [2022-11-26 00:50:00,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:50:00,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:50:00,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 18: [2022-11-26 00:50:00,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 28: [2022-11-26 00:50:00,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 8: [2022-11-26 00:50:00,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:50:00,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 30: [2022-11-26 00:50:00,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 8: [2022-11-26 00:50:00,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 23: [2022-11-26 00:50:00,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 30: [2022-11-26 00:50:00,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 8: [2022-11-26 00:50:00,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 23: [2022-11-26 00:50:00,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 29: [2022-11-26 00:50:00,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:50:00,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 00:50:00,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 22: [2022-11-26 00:50:00,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:50:00,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 00:50:00,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 30: [2022-11-26 00:50:00,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 00:50:00,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 00:50:00,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 14: [2022-11-26 00:50:00,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:50:00,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 00:50:00,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 10: [2022-11-26 00:50:00,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:50:00,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 00:50:00,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 8: [2022-11-26 00:50:00,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:50:00,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 2: [2022-11-26 00:50:00,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:50:00,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 2: [2022-11-26 00:50:00,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 00:50:00,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 31: [2022-11-26 00:50:00,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:50:00,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:50:00,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 2: [2022-11-26 00:50:00,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 31: [2022-11-26 00:50:00,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 2: [2022-11-26 00:50:00,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 17: [2022-11-26 00:50:00,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:50:00,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:50:00,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:50:00,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:50:00,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 00:50:00,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 00:50:00,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 00:50:00,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 00:50:00,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 17: [2022-11-26 00:50:00,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 17: [2022-11-26 00:50:00,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 17: [2022-11-26 00:50:00,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 1: [2022-11-26 00:50:00,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:50:00,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 26: [2022-11-26 00:50:00,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:50:00,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 1: [2022-11-26 00:50:00,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 26: [2022-11-26 00:50:00,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 27: [2022-11-26 00:50:00,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:50:00,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 00:50:00,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 00:50:00,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:50:00,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 00:50:00,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 28: [2022-11-26 00:50:00,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:50:00,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:50:00,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:50:00,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:50:00,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:50:00,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 00:50:00,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 00:50:00,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 00:50:00,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 25: [2022-11-26 00:50:00,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 25: [2022-11-26 00:50:00,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 00:50:00,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 25: [2022-11-26 00:50:00,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 11: [2022-11-26 00:50:00,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:50:00,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:50:00,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 31: [2022-11-26 00:50:00,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 9: [2022-11-26 00:50:00,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:50:00,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 9: [2022-11-26 00:50:00,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 31: [2022-11-26 00:50:00,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 9: [2022-11-26 00:50:00,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 5: [2022-11-26 00:50:00,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:50:00,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 00:50:00,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 28: [2022-11-26 00:50:00,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 00:50:00,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 6: [2022-11-26 00:50:00,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:50:00,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:50:00,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:50:00,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 00:50:00,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 20: [2022-11-26 00:50:00,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:50:00,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:50:00,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 00:50:00,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 00:50:00,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 00:50:00,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 00:50:00,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 20: [2022-11-26 00:50:00,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 20: [2022-11-26 00:50:00,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 20: [2022-11-26 00:50:00,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 00:50:00,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:50:00,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 00:50:00,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 21: [2022-11-26 00:50:00,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:50:00,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 00:50:00,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 26: [2022-11-26 00:50:00,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:50:00,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 00:50:00,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 18: [2022-11-26 00:50:00,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:50:00,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 00:50:00,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 12: [2022-11-26 00:50:00,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:50:00,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 00:50:00,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 19: [2022-11-26 00:50:00,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:50:00,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:50:00,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:50:00,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:50:00,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 00:50:00,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 00:50:00,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 00:50:00,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 00:50:00,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 00:50:00,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 19: [2022-11-26 00:50:00,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 19: [2022-11-26 00:50:00,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 19: [2022-11-26 00:50:00,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 00:50:00,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 13: [2022-11-26 00:50:00,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:50:00,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 00:50:00,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 16: [2022-11-26 00:50:00,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:50:00,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 00:50:00,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 14: [2022-11-26 00:50:00,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:50:00,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 00:50:00,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 3: [2022-11-26 00:50:00,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:50:00,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 00:50:00,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 24: [2022-11-26 00:50:00,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:50:00,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 00:50:00,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 15: [2022-11-26 00:50:00,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:50:00,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 00:50:00,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 27: [2022-11-26 00:50:00,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:50:00,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 00:50:00,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 31: [2022-11-26 00:50:00,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:50:00,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 00:50:00,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 4: [2022-11-26 00:50:00,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:50:00,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 00:50:00,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 5: [2022-11-26 00:50:00,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:50:00,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 00:50:00,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 20: [2022-11-26 00:50:00,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:50:00,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 00:50:00,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 29: [2022-11-26 00:50:00,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:50:00,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 00:50:00,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 10: [2022-11-26 00:50:00,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:50:00,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 00:50:00,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 19: [2022-11-26 00:50:00,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:50:00,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 00:50:00,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 1: [2022-11-26 00:50:00,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:50:00,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 00:50:00,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 23: [2022-11-26 00:50:00,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:50:00,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 00:50:00,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 28: [2022-11-26 00:50:00,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:50:00,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 00:50:00,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 00:50:00,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:50:00,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 00:50:00,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 17: [2022-11-26 00:50:00,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:50:00,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:50:00,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 11: [2022-11-26 00:50:00,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 17: [2022-11-26 00:50:00,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 2: [2022-11-26 00:50:00,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:50:00,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 2: [2022-11-26 00:50:00,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 00:50:00,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 22: [2022-11-26 00:50:00,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:50:00,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 00:50:00,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 9: [2022-11-26 00:50:00,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:50:00,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 00:50:00,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 25: [2022-11-26 00:50:00,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:50:00,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 7: [2022-11-26 00:50:00,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:50:00,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 7: [2022-11-26 00:50:00,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 00:50:00,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 8: [2022-11-26 00:50:00,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:50:00,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 00:50:00,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 24: [2022-11-26 00:50:00,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:50:00,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 00:50:00,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 21: [2022-11-26 00:50:00,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:50:00,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 00:50:00,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 3: [2022-11-26 00:50:00,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:50:00,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 00:50:00,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 6: [2022-11-26 00:50:00,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:50:00,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 00:50:00,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 16: [2022-11-26 00:50:00,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:50:00,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:50:00,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 18: [2022-11-26 00:50:00,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 16: [2022-11-26 00:50:00,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 18: [2022-11-26 00:50:00,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 12: [2022-11-26 00:50:00,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:50:00,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 00:50:00,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 13: [2022-11-26 00:50:00,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:50:00,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 00:50:00,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 31: [2022-11-26 00:50:00,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:50:00,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 00:50:00,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 15: [2022-11-26 00:50:00,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:50:00,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 00:50:00,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 30: [2022-11-26 00:50:00,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 00:50:00,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 00:50:00,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 26: [2022-11-26 00:50:00,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:50:00,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 14: [2022-11-26 00:50:00,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:50:00,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 14: [2022-11-26 00:50:00,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 00:50:00,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 27: [2022-11-26 00:50:00,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:50:00,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 00:50:00,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 5: [2022-11-26 00:50:00,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:50:00,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 00:50:00,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 4: [2022-11-26 00:50:00,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:50:00,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 00:50:00,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 20: [2022-11-26 00:50:00,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:50:00,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 00:50:00,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 10: [2022-11-26 00:50:00,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:50:00,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 00:50:00,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 29: [2022-11-26 00:50:00,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:50:00,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 00:50:00,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 23: [2022-11-26 00:50:00,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:50:00,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 00:50:00,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 19: [2022-11-26 00:50:00,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:50:00,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 00:50:00,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 17: [2022-11-26 00:50:00,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:50:00,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 00:50:00,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 28: [2022-11-26 00:50:00,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:50:00,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:50:00,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 00:50:00,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 28: [2022-11-26 00:50:00,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 00:50:00,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 22: [2022-11-26 00:50:00,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:50:00,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 00:50:00,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 25: [2022-11-26 00:50:00,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:50:00,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 00:50:00,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 7: [2022-11-26 00:50:00,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:50:00,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 00:50:00,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 21: [2022-11-26 00:50:00,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:50:00,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 00:50:00,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 11: [2022-11-26 00:50:00,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:50:00,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 00:50:00,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 00:50:00,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:50:00,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 00:50:00,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 9: [2022-11-26 00:50:00,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:50:00,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 00:50:00,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 2: [2022-11-26 00:50:00,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:50:00,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:50:00,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 3: [2022-11-26 00:50:00,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:50:00,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 2: [2022-11-26 00:50:00,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 13: [2022-11-26 00:50:00,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:50:00,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 3: [2022-11-26 00:50:00,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 13: [2022-11-26 00:50:00,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 3: [2022-11-26 00:50:00,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 12: [2022-11-26 00:50:00,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:50:00,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 12: [2022-11-26 00:50:00,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 00:50:00,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 16: [2022-11-26 00:50:00,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:50:00,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 18: [2022-11-26 00:50:00,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:50:00,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 18: [2022-11-26 00:50:00,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 00:50:00,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 14: [2022-11-26 00:50:00,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:50:00,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 00:50:00,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 26: [2022-11-26 00:50:00,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:50:00,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 00:50:00,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 15: [2022-11-26 00:50:00,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:50:00,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 00:50:00,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 6: [2022-11-26 00:50:00,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:50:00,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 00:50:00,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 30: [2022-11-26 00:50:00,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:50:00,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 30: [2022-11-26 00:50:00,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 00:50:00,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 27: [2022-11-26 00:50:00,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 00:50:00,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 4: [2022-11-26 00:50:00,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:50:00,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 00:50:00,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 24: [2022-11-26 00:50:00,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:50:00,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 00:50:00,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 31: [2022-11-26 00:50:00,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:50:00,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:50:00,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 00:50:00,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 20: [2022-11-26 00:50:00,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 00:50:00,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 5: [2022-11-26 00:50:00,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:50:00,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 00:50:00,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 29: [2022-11-26 00:50:00,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:50:00,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 00:50:00,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 10: [2022-11-26 00:50:00,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:50:00,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 00:50:00,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 17: [2022-11-26 00:50:00,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:50:00,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 00:50:00,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 28: [2022-11-26 00:50:00,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:50:00,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 00:50:00,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 23: [2022-11-26 00:50:00,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:50:00,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 00:50:00,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 19: [2022-11-26 00:50:00,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:50:00,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 00:50:00,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 1: [2022-11-26 00:50:00,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:50:00,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 25: [2022-11-26 00:50:00,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:50:00,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 25: [2022-11-26 00:50:00,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 00:50:00,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 22: [2022-11-26 00:50:00,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:50:00,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 00:50:00,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 00:50:00,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:50:00,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 00:50:00,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 7: [2022-11-26 00:50:00,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:50:00,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 00:50:00,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 8: [2022-11-26 00:50:00,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:50:00,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 00:50:00,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 24: [2022-11-26 00:50:00,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:50:00,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 00:50:00,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 11: [2022-11-26 00:50:00,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:50:00,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 00:50:00,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 6: [2022-11-26 00:50:00,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:50:00,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 00:50:00,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 13: [2022-11-26 00:50:00,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:50:00,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 00:50:00,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 6: [2022-11-26 00:50:00,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:50:00,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 00:50:00,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 30: [2022-11-26 00:50:00,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 00:50:00,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 00:50:00,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 16: [2022-11-26 00:50:00,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:50:00,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:50:00,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 18: [2022-11-26 00:50:00,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 16: [2022-11-26 00:50:00,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 18: [2022-11-26 00:50:00,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 2: [2022-11-26 00:50:00,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:50:00,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:50:00,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 00:50:00,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 9: [2022-11-26 00:50:00,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 00:50:00,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 13: [2022-11-26 00:50:00,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:50:00,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 00:50:00,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 13: [2022-11-26 00:50:00,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 19: [2022-11-26 00:50:00,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 13: [2022-11-26 00:50:00,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 00:50:00,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:50:00,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 00:50:00,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 16: [2022-11-26 00:50:00,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 00:50:00,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 00:50:00,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 26: [2022-11-26 00:50:00,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:50:00,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 00:50:00,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 1: [2022-11-26 00:50:00,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:50:00,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 00:50:00,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 22: [2022-11-26 00:50:00,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 00:50:00,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 00:50:00,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 27: [2022-11-26 00:50:00,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 18: [2022-11-26 00:50:00,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 24: [2022-11-26 00:50:00,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:50:00,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 18: [2022-11-26 00:50:00,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 27: [2022-11-26 00:50:00,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 24: [2022-11-26 00:50:00,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 18: [2022-11-26 00:50:00,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 24: [2022-11-26 00:50:00,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 15: [2022-11-26 00:50:00,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:50:00,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:50:00,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:50:00,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 11: [2022-11-26 00:50:00,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 15: [2022-11-26 00:50:00,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 7: [2022-11-26 00:50:00,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 11: [2022-11-26 00:50:00,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 7: [2022-11-26 00:50:00,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 25: [2022-11-26 00:50:00,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:50:00,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 4: [2022-11-26 00:50:00,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 25: [2022-11-26 00:50:00,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 4: [2022-11-26 00:50:00,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 00:50:00,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 23: [2022-11-26 00:50:00,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:50:00,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 23: [2022-11-26 00:50:00,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 12: [2022-11-26 00:50:00,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 23: [2022-11-26 00:50:00,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 12: [2022-11-26 00:50:00,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:50:00,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 12: [2022-11-26 00:50:00,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 00:50:00,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 27: [2022-11-26 00:50:00,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:50:00,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 8: [2022-11-26 00:50:00,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 27: [2022-11-26 00:50:00,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 8: [2022-11-26 00:50:00,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 00:50:00,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 21: [2022-11-26 00:50:00,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:50:00,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 30: [2022-11-26 00:50:00,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 21: [2022-11-26 00:50:00,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 30: [2022-11-26 00:50:00,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 00:50:00,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 5: [2022-11-26 00:50:00,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 29: [2022-11-26 00:50:00,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:50:00,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 29: [2022-11-26 00:50:00,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 5: [2022-11-26 00:50:00,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 29: [2022-11-26 00:50:00,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 3: [2022-11-26 00:50:00,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:50:00,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 00:50:00,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 2: [2022-11-26 00:50:00,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:50:00,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:50:00,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 17: [2022-11-26 00:50:00,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 2: [2022-11-26 00:50:00,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 21: [2022-11-26 00:50:00,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 17: [2022-11-26 00:50:00,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 21: [2022-11-26 00:50:00,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 00:50:00,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 31: [2022-11-26 00:50:00,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:50:00,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 00:50:00,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 26: [2022-11-26 00:50:00,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 00:50:00,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 00:50:00,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 28: [2022-11-26 00:50:00,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 00:50:00,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 00:50:00,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 9: [2022-11-26 00:50:00,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:50:00,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 00:50:00,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 20: [2022-11-26 00:50:00,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 00:50:00,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 00:50:00,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 31: [2022-11-26 00:50:00,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 00:50:00,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 00:50:00,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 14: [2022-11-26 00:50:00,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:50:00,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:50:00,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 00:50:00,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 00:50:00,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 14: [2022-11-26 00:50:00,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 3: [2022-11-26 00:50:00,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:50:00,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 00:50:00,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 10: [2022-11-26 00:50:00,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:50:00,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 00:50:00,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 15: [2022-11-26 00:50:00,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:50:00,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step30000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 00:50:00,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: successfully saved checkpoint at iteration 30000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2946.49 31: iteration 30010/ 173500 | consumed samples: 7682560 | consumed tokens: 15733882880 | elapsed time per iteration (s): 1.07 | learning rate: 1.882E-04 | global batch size: 256 | lm loss: 2.132077E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.041 | TFLOPs: 14.52 | 31: iteration 30020/ 173500 | consumed samples: 7685120 | consumed tokens: 15739125760 | elapsed time per iteration (s): 0.78 | learning rate: 1.882E-04 | global batch size: 256 | lm loss: 2.153677E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.360 | TFLOPs: 19.80 | 31: iteration 30030/ 173500 | consumed samples: 7687680 | consumed tokens: 15744368640 | elapsed time per iteration (s): 0.81 | learning rate: 1.882E-04 | global batch size: 256 | lm loss: 2.154313E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.770 | TFLOPs: 19.22 | 31: iteration 30040/ 173500 | consumed samples: 7690240 | consumed tokens: 15749611520 | elapsed time per iteration (s): 0.80 | learning rate: 1.882E-04 | global batch size: 256 | lm loss: 2.129871E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.027 | TFLOPs: 19.36 | 31: iteration 30050/ 173500 | consumed samples: 7692800 | consumed tokens: 15754854400 | elapsed time per iteration (s): 1.17 | learning rate: 1.882E-04 | global batch size: 256 | lm loss: 2.159493E+00 | grad norm: 0.376 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 218.602 | TFLOPs: 13.22 | 31: iteration 30060/ 173500 | consumed samples: 7695360 | consumed tokens: 15760097280 | elapsed time per iteration (s): 0.78 | learning rate: 1.882E-04 | global batch size: 256 | lm loss: 2.169090E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.513 | TFLOPs: 19.93 | 31: iteration 30070/ 173500 | consumed samples: 7697920 | consumed tokens: 15765340160 | elapsed time per iteration (s): 0.77 | learning rate: 1.882E-04 | global batch size: 256 | lm loss: 2.177340E+00 | grad norm: 0.400 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.535 | TFLOPs: 20.06 | 31: iteration 30080/ 173500 | consumed samples: 7700480 | consumed tokens: 15770583040 | elapsed time per iteration (s): 0.77 | learning rate: 1.882E-04 | global batch size: 256 | lm loss: 2.701341E+00 | grad norm: 3.619 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.256 | TFLOPs: 20.16 | 31: iteration 30090/ 173500 | consumed samples: 7703040 | consumed tokens: 15775825920 | elapsed time per iteration (s): 0.76 | learning rate: 1.882E-04 | global batch size: 256 | lm loss: 2.259341E+00 | grad norm: 0.300 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.146 | TFLOPs: 20.28 | 31: iteration 30100/ 173500 | consumed samples: 7705600 | consumed tokens: 15781068800 | elapsed time per iteration (s): 0.79 | learning rate: 1.882E-04 | global batch size: 256 | lm loss: 2.171513E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.912 | TFLOPs: 19.66 | 31: iteration 30110/ 173500 | consumed samples: 7708160 | consumed tokens: 15786311680 | elapsed time per iteration (s): 0.75 | learning rate: 1.881E-04 | global batch size: 256 | lm loss: 2.182213E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.755 | TFLOPs: 20.68 | 31: iteration 30120/ 173500 | consumed samples: 7710720 | consumed tokens: 15791554560 | elapsed time per iteration (s): 0.79 | learning rate: 1.881E-04 | global batch size: 256 | lm loss: 2.149308E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.789 | TFLOPs: 19.53 | 31: iteration 30130/ 173500 | consumed samples: 7713280 | consumed tokens: 15796797440 | elapsed time per iteration (s): 0.76 | learning rate: 1.881E-04 | global batch size: 256 | lm loss: 2.156975E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.486 | TFLOPs: 20.42 | 31: iteration 30140/ 173500 | consumed samples: 7715840 | consumed tokens: 15802040320 | elapsed time per iteration (s): 0.76 | learning rate: 1.881E-04 | global batch size: 256 | lm loss: 2.156400E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.175 | TFLOPs: 20.28 | 31: iteration 30150/ 173500 | consumed samples: 7718400 | consumed tokens: 15807283200 | elapsed time per iteration (s): 0.74 | learning rate: 1.881E-04 | global batch size: 256 | lm loss: 2.154264E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.265 | TFLOPs: 20.83 | 31: iteration 30160/ 173500 | consumed samples: 7720960 | consumed tokens: 15812526080 | elapsed time per iteration (s): 0.80 | learning rate: 1.881E-04 | global batch size: 256 | lm loss: 2.157937E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.643 | TFLOPs: 19.46 | 31: iteration 30170/ 173500 | consumed samples: 7723520 | consumed tokens: 15817768960 | elapsed time per iteration (s): 0.80 | learning rate: 1.881E-04 | global batch size: 256 | lm loss: 2.143022E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.717 | TFLOPs: 19.46 | 31: iteration 30180/ 173500 | consumed samples: 7726080 | consumed tokens: 15823011840 | elapsed time per iteration (s): 0.80 | learning rate: 1.881E-04 | global batch size: 256 | lm loss: 2.150163E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.065 | TFLOPs: 19.30 | 31: iteration 30190/ 173500 | consumed samples: 7728640 | consumed tokens: 15828254720 | elapsed time per iteration (s): 0.82 | learning rate: 1.881E-04 | global batch size: 256 | lm loss: 2.134092E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.503 | TFLOPs: 18.91 | 31: iteration 30200/ 173500 | consumed samples: 7731200 | consumed tokens: 15833497600 | elapsed time per iteration (s): 0.77 | learning rate: 1.881E-04 | global batch size: 256 | lm loss: 2.152759E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.928 | TFLOPs: 20.08 | 31: iteration 30210/ 173500 | consumed samples: 7733760 | consumed tokens: 15838740480 | elapsed time per iteration (s): 0.90 | learning rate: 1.881E-04 | global batch size: 256 | lm loss: 2.121407E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.376 | TFLOPs: 17.20 | 31: iteration 30220/ 173500 | consumed samples: 7736320 | consumed tokens: 15843983360 | elapsed time per iteration (s): 0.85 | learning rate: 1.881E-04 | global batch size: 256 | lm loss: 2.148708E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.099 | TFLOPs: 18.28 | 31: iteration 30230/ 173500 | consumed samples: 7738880 | consumed tokens: 15849226240 | elapsed time per iteration (s): 0.84 | learning rate: 1.881E-04 | global batch size: 256 | lm loss: 2.141068E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.863 | TFLOPs: 18.50 | 31: iteration 30240/ 173500 | consumed samples: 7741440 | consumed tokens: 15854469120 | elapsed time per iteration (s): 0.84 | learning rate: 1.880E-04 | global batch size: 256 | lm loss: 2.114943E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.764 | TFLOPs: 18.38 | 31: iteration 30250/ 173500 | consumed samples: 7744000 | consumed tokens: 15859712000 | elapsed time per iteration (s): 0.78 | learning rate: 1.880E-04 | global batch size: 256 | lm loss: 2.170309E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.774 | TFLOPs: 19.77 | 31: iteration 30260/ 173500 | consumed samples: 7746560 | consumed tokens: 15864954880 | elapsed time per iteration (s): 0.84 | learning rate: 1.880E-04 | global batch size: 256 | lm loss: 2.163444E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.861 | TFLOPs: 18.50 | 31: iteration 30270/ 173500 | consumed samples: 7749120 | consumed tokens: 15870197760 | elapsed time per iteration (s): 0.81 | learning rate: 1.880E-04 | global batch size: 256 | lm loss: 2.144828E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.892 | TFLOPs: 19.05 | 31: iteration 30280/ 173500 | consumed samples: 7751680 | consumed tokens: 15875440640 | elapsed time per iteration (s): 0.78 | learning rate: 1.880E-04 | global batch size: 256 | lm loss: 2.161476E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.563 | TFLOPs: 19.88 | 31: iteration 30290/ 173500 | consumed samples: 7754240 | consumed tokens: 15880683520 | elapsed time per iteration (s): 0.80 | learning rate: 1.880E-04 | global batch size: 256 | lm loss: 2.164359E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.238 | TFLOPs: 19.25 | 31: iteration 30300/ 173500 | consumed samples: 7756800 | consumed tokens: 15885926400 | elapsed time per iteration (s): 0.79 | learning rate: 1.880E-04 | global batch size: 256 | lm loss: 2.118347E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.138 | TFLOPs: 19.55 | 31: iteration 30310/ 173500 | consumed samples: 7759360 | consumed tokens: 15891169280 | elapsed time per iteration (s): 0.76 | learning rate: 1.880E-04 | global batch size: 256 | lm loss: 2.120121E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.146 | TFLOPs: 20.34 | 31: iteration 30320/ 173500 | consumed samples: 7761920 | consumed tokens: 15896412160 | elapsed time per iteration (s): 0.75 | learning rate: 1.880E-04 | global batch size: 256 | lm loss: 2.159379E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.469 | TFLOPs: 20.54 | 31: iteration 30330/ 173500 | consumed samples: 7764480 | consumed tokens: 15901655040 | elapsed time per iteration (s): 1.01 | learning rate: 1.880E-04 | global batch size: 256 | lm loss: 2.133337E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 253.429 | TFLOPs: 15.33 | 31: iteration 30340/ 173500 | consumed samples: 7767040 | consumed tokens: 15906897920 | elapsed time per iteration (s): 0.76 | learning rate: 1.880E-04 | global batch size: 256 | lm loss: 2.143669E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.008 | TFLOPs: 20.51 | 31: iteration 30350/ 173500 | consumed samples: 7769600 | consumed tokens: 15912140800 | elapsed time per iteration (s): 0.78 | learning rate: 1.880E-04 | global batch size: 256 | lm loss: 2.120895E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.319 | TFLOPs: 19.74 | 31: iteration 30360/ 173500 | consumed samples: 7772160 | consumed tokens: 15917383680 | elapsed time per iteration (s): 0.77 | learning rate: 1.879E-04 | global batch size: 256 | lm loss: 2.156558E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.413 | TFLOPs: 20.17 | 31: iteration 30370/ 173500 | consumed samples: 7774720 | consumed tokens: 15922626560 | elapsed time per iteration (s): 0.80 | learning rate: 1.879E-04 | global batch size: 256 | lm loss: 2.138135E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.767 | TFLOPs: 19.35 | 31: iteration 30380/ 173500 | consumed samples: 7777280 | consumed tokens: 15927869440 | elapsed time per iteration (s): 0.83 | learning rate: 1.879E-04 | global batch size: 256 | lm loss: 2.162252E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.672 | TFLOPs: 18.67 | 31: iteration 30390/ 173500 | consumed samples: 7779840 | consumed tokens: 15933112320 | elapsed time per iteration (s): 0.81 | learning rate: 1.879E-04 | global batch size: 256 | lm loss: 2.162883E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.844 | TFLOPs: 19.23 | 31: iteration 30400/ 173500 | consumed samples: 7782400 | consumed tokens: 15938355200 | elapsed time per iteration (s): 0.75 | learning rate: 1.879E-04 | global batch size: 256 | lm loss: 2.148747E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.284 | TFLOPs: 20.59 | 31: iteration 30410/ 173500 | consumed samples: 7784960 | consumed tokens: 15943598080 | elapsed time per iteration (s): 1.33 | learning rate: 1.879E-04 | global batch size: 256 | lm loss: 2.122828E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 191.990 | TFLOPs: 11.61 | 31: iteration 30420/ 173500 | consumed samples: 7787520 | consumed tokens: 15948840960 | elapsed time per iteration (s): 0.76 | learning rate: 1.879E-04 | global batch size: 256 | lm loss: 2.129849E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.376 | TFLOPs: 20.35 | 31: iteration 30430/ 173500 | consumed samples: 7790080 | consumed tokens: 15954083840 | elapsed time per iteration (s): 0.73 | learning rate: 1.879E-04 | global batch size: 256 | lm loss: 2.124125E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.193 | TFLOPs: 21.19 | 31: iteration 30440/ 173500 | consumed samples: 7792640 | consumed tokens: 15959326720 | elapsed time per iteration (s): 0.77 | learning rate: 1.879E-04 | global batch size: 256 | lm loss: 2.157702E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.057 | TFLOPs: 20.03 | 31: iteration 30450/ 173500 | consumed samples: 7795200 | consumed tokens: 15964569600 | elapsed time per iteration (s): 0.78 | learning rate: 1.879E-04 | global batch size: 256 | lm loss: 2.121269E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.666 | TFLOPs: 19.88 | 31: iteration 30460/ 173500 | consumed samples: 7797760 | consumed tokens: 15969812480 | elapsed time per iteration (s): 0.76 | learning rate: 1.879E-04 | global batch size: 256 | lm loss: 2.118116E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.663 | TFLOPs: 20.25 | 31: iteration 30470/ 173500 | consumed samples: 7800320 | consumed tokens: 15975055360 | elapsed time per iteration (s): 0.79 | learning rate: 1.879E-04 | global batch size: 256 | lm loss: 2.149091E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.466 | TFLOPs: 19.57 | 31: iteration 30480/ 173500 | consumed samples: 7802880 | consumed tokens: 15980298240 | elapsed time per iteration (s): 0.75 | learning rate: 1.878E-04 | global batch size: 256 | lm loss: 2.132664E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.567 | TFLOPs: 20.66 | 31: iteration 30490/ 173500 | consumed samples: 7805440 | consumed tokens: 15985541120 | elapsed time per iteration (s): 0.75 | learning rate: 1.878E-04 | global batch size: 256 | lm loss: 2.147464E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.908 | TFLOPs: 20.62 | 31: iteration 30500/ 173500 | consumed samples: 7808000 | consumed tokens: 15990784000 | elapsed time per iteration (s): 0.87 | learning rate: 1.878E-04 | global batch size: 256 | lm loss: 2.121321E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.329 | TFLOPs: 17.87 | 31: iteration 30510/ 173500 | consumed samples: 7810560 | consumed tokens: 15996026880 | elapsed time per iteration (s): 0.80 | learning rate: 1.878E-04 | global batch size: 256 | lm loss: 2.144808E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.432 | TFLOPs: 19.26 | 31: iteration 30520/ 173500 | consumed samples: 7813120 | consumed tokens: 16001269760 | elapsed time per iteration (s): 0.84 | learning rate: 1.878E-04 | global batch size: 256 | lm loss: 2.137157E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.780 | TFLOPs: 18.44 | 31: iteration 30530/ 173500 | consumed samples: 7815680 | consumed tokens: 16006512640 | elapsed time per iteration (s): 0.94 | learning rate: 1.878E-04 | global batch size: 256 | lm loss: 2.126934E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 271.692 | TFLOPs: 16.44 | 31: iteration 30540/ 173500 | consumed samples: 7818240 | consumed tokens: 16011755520 | elapsed time per iteration (s): 0.85 | learning rate: 1.878E-04 | global batch size: 256 | lm loss: 2.107206E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.658 | TFLOPs: 18.13 | 31: iteration 30550/ 173500 | consumed samples: 7820800 | consumed tokens: 16016998400 | elapsed time per iteration (s): 0.83 | learning rate: 1.878E-04 | global batch size: 256 | lm loss: 2.153195E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.035 | TFLOPs: 18.57 | 31: iteration 30560/ 173500 | consumed samples: 7823360 | consumed tokens: 16022241280 | elapsed time per iteration (s): 0.77 | learning rate: 1.878E-04 | global batch size: 256 | lm loss: 2.154068E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.660 | TFLOPs: 20.19 | 31: iteration 30570/ 173500 | consumed samples: 7825920 | consumed tokens: 16027484160 | elapsed time per iteration (s): 0.79 | learning rate: 1.878E-04 | global batch size: 256 | lm loss: 2.171154E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.031 | TFLOPs: 19.54 | 31: iteration 30580/ 173500 | consumed samples: 7828480 | consumed tokens: 16032727040 | elapsed time per iteration (s): 0.75 | learning rate: 1.878E-04 | global batch size: 256 | lm loss: 2.100459E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.168 | TFLOPs: 20.52 | 31: iteration 30590/ 173500 | consumed samples: 7831040 | consumed tokens: 16037969920 | elapsed time per iteration (s): 0.79 | learning rate: 1.878E-04 | global batch size: 256 | lm loss: 2.145551E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.338 | TFLOPs: 19.68 | 31: iteration 30600/ 173500 | consumed samples: 7833600 | consumed tokens: 16043212800 | elapsed time per iteration (s): 0.77 | learning rate: 1.877E-04 | global batch size: 256 | lm loss: 2.143290E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.152 | TFLOPs: 20.22 | 31: iteration 30610/ 173500 | consumed samples: 7836160 | consumed tokens: 16048455680 | elapsed time per iteration (s): 0.77 | learning rate: 1.877E-04 | global batch size: 256 | lm loss: 2.134413E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.985 | TFLOPs: 20.02 | 31: iteration 30620/ 173500 | consumed samples: 7838720 | consumed tokens: 16053698560 | elapsed time per iteration (s): 0.88 | learning rate: 1.877E-04 | global batch size: 256 | lm loss: 2.138912E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.923 | TFLOPs: 17.66 | 31: iteration 30630/ 173500 | consumed samples: 7841280 | consumed tokens: 16058941440 | elapsed time per iteration (s): 0.73 | learning rate: 1.877E-04 | global batch size: 256 | lm loss: 2.118856E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.835 | TFLOPs: 21.16 | 31: iteration 30640/ 173500 | consumed samples: 7843840 | consumed tokens: 16064184320 | elapsed time per iteration (s): 0.72 | learning rate: 1.877E-04 | global batch size: 256 | lm loss: 2.112468E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.044 | TFLOPs: 21.42 | 31: iteration 30650/ 173500 | consumed samples: 7846400 | consumed tokens: 16069427200 | elapsed time per iteration (s): 0.74 | learning rate: 1.877E-04 | global batch size: 256 | lm loss: 2.143732E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.935 | TFLOPs: 20.87 | 31: iteration 30660/ 173500 | consumed samples: 7848960 | consumed tokens: 16074670080 | elapsed time per iteration (s): 0.76 | learning rate: 1.877E-04 | global batch size: 256 | lm loss: 2.135904E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.412 | TFLOPs: 20.47 | 31: iteration 30670/ 173500 | consumed samples: 7851520 | consumed tokens: 16079912960 | elapsed time per iteration (s): 0.83 | learning rate: 1.877E-04 | global batch size: 256 | lm loss: 2.147988E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.000 | TFLOPs: 18.75 | 31: iteration 30680/ 173500 | consumed samples: 7854080 | consumed tokens: 16085155840 | elapsed time per iteration (s): 0.79 | learning rate: 1.877E-04 | global batch size: 256 | lm loss: 2.180680E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.877 | TFLOPs: 19.53 | 31: iteration 30690/ 173500 | consumed samples: 7856640 | consumed tokens: 16090398720 | elapsed time per iteration (s): 0.78 | learning rate: 1.877E-04 | global batch size: 256 | lm loss: 2.157595E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.898 | TFLOPs: 19.78 | 31: iteration 30700/ 173500 | consumed samples: 7859200 | consumed tokens: 16095641600 | elapsed time per iteration (s): 0.75 | learning rate: 1.877E-04 | global batch size: 256 | lm loss: 2.150612E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.597 | TFLOPs: 20.67 | 31: iteration 30710/ 173500 | consumed samples: 7861760 | consumed tokens: 16100884480 | elapsed time per iteration (s): 0.78 | learning rate: 1.877E-04 | global batch size: 256 | lm loss: 2.133205E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.493 | TFLOPs: 19.81 | 31: iteration 30720/ 173500 | consumed samples: 7864320 | consumed tokens: 16106127360 | elapsed time per iteration (s): 0.80 | learning rate: 1.876E-04 | global batch size: 256 | lm loss: 2.130033E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.086 | TFLOPs: 19.36 | 31: iteration 30730/ 173500 | consumed samples: 7866880 | consumed tokens: 16111370240 | elapsed time per iteration (s): 0.84 | learning rate: 1.876E-04 | global batch size: 256 | lm loss: 2.136431E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.676 | TFLOPs: 18.43 | 31: iteration 30740/ 173500 | consumed samples: 7869440 | consumed tokens: 16116613120 | elapsed time per iteration (s): 0.75 | learning rate: 1.876E-04 | global batch size: 256 | lm loss: 2.119904E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.512 | TFLOPs: 20.54 | 31: iteration 30750/ 173500 | consumed samples: 7872000 | consumed tokens: 16121856000 | elapsed time per iteration (s): 0.75 | learning rate: 1.876E-04 | global batch size: 256 | lm loss: 2.132059E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.419 | TFLOPs: 20.59 | 31: iteration 30760/ 173500 | consumed samples: 7874560 | consumed tokens: 16127098880 | elapsed time per iteration (s): 0.74 | learning rate: 1.876E-04 | global batch size: 256 | lm loss: 2.128246E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.870 | TFLOPs: 20.86 | 31: iteration 30770/ 173500 | consumed samples: 7877120 | consumed tokens: 16132341760 | elapsed time per iteration (s): 0.78 | learning rate: 1.876E-04 | global batch size: 256 | lm loss: 2.113267E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.252 | TFLOPs: 19.74 | 31: iteration 30780/ 173500 | consumed samples: 7879680 | consumed tokens: 16137584640 | elapsed time per iteration (s): 0.77 | learning rate: 1.876E-04 | global batch size: 256 | lm loss: 2.140705E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.033 | TFLOPs: 20.21 | 31: iteration 30790/ 173500 | consumed samples: 7882240 | consumed tokens: 16142827520 | elapsed time per iteration (s): 0.81 | learning rate: 1.876E-04 | global batch size: 256 | lm loss: 2.093614E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.490 | TFLOPs: 19.15 | 31: iteration 30800/ 173500 | consumed samples: 7884800 | consumed tokens: 16148070400 | elapsed time per iteration (s): 0.72 | learning rate: 1.876E-04 | global batch size: 256 | lm loss: 2.139608E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.238 | TFLOPs: 21.61 | 31: iteration 30810/ 173500 | consumed samples: 7887360 | consumed tokens: 16153313280 | elapsed time per iteration (s): 0.78 | learning rate: 1.876E-04 | global batch size: 256 | lm loss: 2.117333E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.901 | TFLOPs: 19.78 | 31: iteration 30820/ 173500 | consumed samples: 7889920 | consumed tokens: 16158556160 | elapsed time per iteration (s): 0.84 | learning rate: 1.876E-04 | global batch size: 256 | lm loss: 2.131594E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.409 | TFLOPs: 18.36 | 31: iteration 30830/ 173500 | consumed samples: 7892480 | consumed tokens: 16163799040 | elapsed time per iteration (s): 0.81 | learning rate: 1.876E-04 | global batch size: 256 | lm loss: 2.118800E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.357 | TFLOPs: 19.14 | 31: iteration 30840/ 173500 | consumed samples: 7895040 | consumed tokens: 16169041920 | elapsed time per iteration (s): 0.79 | learning rate: 1.875E-04 | global batch size: 256 | lm loss: 2.150584E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.990 | TFLOPs: 19.54 | 31: iteration 30850/ 173500 | consumed samples: 7897600 | consumed tokens: 16174284800 | elapsed time per iteration (s): 0.83 | learning rate: 1.875E-04 | global batch size: 256 | lm loss: 2.139608E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.075 | TFLOPs: 18.64 | 31: iteration 30860/ 173500 | consumed samples: 7900160 | consumed tokens: 16179527680 | elapsed time per iteration (s): 0.82 | learning rate: 1.875E-04 | global batch size: 256 | lm loss: 2.128239E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.276 | TFLOPs: 18.89 | 31: iteration 30870/ 173500 | consumed samples: 7902720 | consumed tokens: 16184770560 | elapsed time per iteration (s): 0.95 | learning rate: 1.875E-04 | global batch size: 256 | lm loss: 2.151003E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 270.394 | TFLOPs: 16.36 | 31: iteration 30880/ 173500 | consumed samples: 7905280 | consumed tokens: 16190013440 | elapsed time per iteration (s): 0.75 | learning rate: 1.875E-04 | global batch size: 256 | lm loss: 2.116739E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.045 | TFLOPs: 20.57 | 31: iteration 30890/ 173500 | consumed samples: 7907840 | consumed tokens: 16195256320 | elapsed time per iteration (s): 0.76 | learning rate: 1.875E-04 | global batch size: 256 | lm loss: 2.135718E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.846 | TFLOPs: 20.38 | 31: iteration 30900/ 173500 | consumed samples: 7910400 | consumed tokens: 16200499200 | elapsed time per iteration (s): 0.79 | learning rate: 1.875E-04 | global batch size: 256 | lm loss: 2.118300E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.988 | TFLOPs: 19.66 | 31: iteration 30910/ 173500 | consumed samples: 7912960 | consumed tokens: 16205742080 | elapsed time per iteration (s): 0.80 | learning rate: 1.875E-04 | global batch size: 256 | lm loss: 2.118682E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.017 | TFLOPs: 19.24 | 31: iteration 30920/ 173500 | consumed samples: 7915520 | consumed tokens: 16210984960 | elapsed time per iteration (s): 0.82 | learning rate: 1.875E-04 | global batch size: 256 | lm loss: 2.143316E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.875 | TFLOPs: 18.99 | 31: iteration 30930/ 173500 | consumed samples: 7918080 | consumed tokens: 16216227840 | elapsed time per iteration (s): 0.77 | learning rate: 1.875E-04 | global batch size: 256 | lm loss: 2.127253E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.906 | TFLOPs: 20.02 | 31: iteration 30940/ 173500 | consumed samples: 7920640 | consumed tokens: 16221470720 | elapsed time per iteration (s): 1.51 | learning rate: 1.875E-04 | global batch size: 256 | lm loss: 2.139207E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 169.492 | TFLOPs: 10.25 | 31: iteration 30950/ 173500 | consumed samples: 7923200 | consumed tokens: 16226713600 | elapsed time per iteration (s): 0.84 | learning rate: 1.875E-04 | global batch size: 256 | lm loss: 2.115063E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.546 | TFLOPs: 18.36 | 31: iteration 30960/ 173500 | consumed samples: 7925760 | consumed tokens: 16231956480 | elapsed time per iteration (s): 0.74 | learning rate: 1.874E-04 | global batch size: 256 | lm loss: 2.149592E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.215 | TFLOPs: 20.95 | 31: iteration 30970/ 173500 | consumed samples: 7928320 | consumed tokens: 16237199360 | elapsed time per iteration (s): 0.73 | learning rate: 1.874E-04 | global batch size: 256 | lm loss: 2.141746E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.766 | TFLOPs: 21.10 | 31: iteration 30980/ 173500 | consumed samples: 7930880 | consumed tokens: 16242442240 | elapsed time per iteration (s): 0.74 | learning rate: 1.874E-04 | global batch size: 256 | lm loss: 2.098689E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.788 | TFLOPs: 20.98 | 31: iteration 30990/ 173500 | consumed samples: 7933440 | consumed tokens: 16247685120 | elapsed time per iteration (s): 0.78 | learning rate: 1.874E-04 | global batch size: 256 | lm loss: 2.129172E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.948 | TFLOPs: 19.84 | 31: iteration 31000/ 173500 | consumed samples: 7936000 | consumed tokens: 16252928000 | elapsed time per iteration (s): 0.81 | learning rate: 1.874E-04 | global batch size: 256 | lm loss: 2.137651E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.314 | TFLOPs: 19.02 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 31000 | lm loss value: 2.045087E+00 | lm loss PPL: 7.729832E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 31000 to checkpoints_1b1long 0: [2022-11-26 01:03:29,397] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step31000 is begin to save! 0: [2022-11-26 01:03:29,408] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_01-model_00-model_states.pt... 0: [2022-11-26 01:03:29,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_01-model_00-model_states.pt. 0: [2022-11-26 01:03:29,631] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_03-model_00-model_states.pt... 0: [2022-11-26 01:03:29,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_03-model_00-model_states.pt. 0: [2022-11-26 01:03:29,713] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_04-model_00-model_states.pt... 0: [2022-11-26 01:03:29,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_04-model_00-model_states.pt. 0: [2022-11-26 01:03:29,788] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_05-model_00-model_states.pt... 0: [2022-11-26 01:03:29,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_05-model_00-model_states.pt. 0: [2022-11-26 01:03:29,862] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_06-model_00-model_states.pt... 0: [2022-11-26 01:03:29,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_06-model_00-model_states.pt. 0: [2022-11-26 01:03:29,937] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_07-model_00-model_states.pt... 0: [2022-11-26 01:03:30,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_07-model_00-model_states.pt. 0: [2022-11-26 01:03:30,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_08-model_00-model_states.pt... 0: [2022-11-26 01:03:30,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_08-model_00-model_states.pt. 0: [2022-11-26 01:03:30,085] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_09-model_00-model_states.pt... 0: [2022-11-26 01:03:30,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_09-model_00-model_states.pt. 0: [2022-11-26 01:03:30,159] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_10-model_00-model_states.pt... 0: [2022-11-26 01:03:30,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_10-model_00-model_states.pt. 0: [2022-11-26 01:03:30,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_11-model_00-model_states.pt... 0: [2022-11-26 01:03:30,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_11-model_00-model_states.pt. 0: [2022-11-26 01:03:30,315] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_12-model_00-model_states.pt... 0: [2022-11-26 01:03:30,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_12-model_00-model_states.pt. 0: [2022-11-26 01:03:30,387] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_13-model_00-model_states.pt... 0: [2022-11-26 01:03:30,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_13-model_00-model_states.pt. 0: [2022-11-26 01:03:30,464] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_14-model_00-model_states.pt... 0: [2022-11-26 01:03:30,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_14-model_00-model_states.pt. 0: [2022-11-26 01:03:30,540] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_15-model_00-model_states.pt... 0: [2022-11-26 01:03:30,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_15-model_00-model_states.pt. 0: [2022-11-26 01:03:30,614] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_16-model_00-model_states.pt... 0: [2022-11-26 01:03:30,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_16-model_00-model_states.pt. 0: [2022-11-26 01:03:30,692] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_17-model_00-model_states.pt... 0: [2022-11-26 01:03:30,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_17-model_00-model_states.pt. 0: [2022-11-26 01:03:30,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_18-model_00-model_states.pt... 0: [2022-11-26 01:03:30,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_18-model_00-model_states.pt. 0: [2022-11-26 01:03:30,843] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_19-model_00-model_states.pt... 0: [2022-11-26 01:03:30,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_19-model_00-model_states.pt. 0: [2022-11-26 01:03:30,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_20-model_00-model_states.pt... 0: [2022-11-26 01:03:30,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_20-model_00-model_states.pt. 0: [2022-11-26 01:03:30,992] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_21-model_00-model_states.pt... 0: [2022-11-26 01:03:31,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_21-model_00-model_states.pt. 0: [2022-11-26 01:03:31,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_22-model_00-model_states.pt... 0: [2022-11-26 01:03:31,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_22-model_00-model_states.pt. 0: [2022-11-26 01:03:31,142] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_23-model_00-model_states.pt... 0: [2022-11-26 01:03:31,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_23-model_00-model_states.pt. 0: [2022-11-26 01:03:31,217] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_24-model_00-model_states.pt... 0: [2022-11-26 01:03:31,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_24-model_00-model_states.pt. 0: [2022-11-26 01:03:31,290] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_25-model_00-model_states.pt... 0: [2022-11-26 01:03:31,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_25-model_00-model_states.pt. 0: [2022-11-26 01:03:31,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_26-model_00-model_states.pt... 0: [2022-11-26 01:03:31,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_26-model_00-model_states.pt. 0: [2022-11-26 01:03:31,438] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_27-model_00-model_states.pt... 0: [2022-11-26 01:03:31,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_27-model_00-model_states.pt. 0: [2022-11-26 01:03:31,515] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_28-model_00-model_states.pt... 0: [2022-11-26 01:03:31,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_28-model_00-model_states.pt. 0: [2022-11-26 01:03:31,588] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/layer_30-model_00-model_states.pt... 0: [2022-11-26 01:03:31,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/layer_30-model_00-model_states.pt. 0: [2022-11-26 01:03:31,592] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step31000/mp_rank_00_model_states.pt 0: [2022-11-26 01:03:31,592] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/mp_rank_00_model_states.pt... 0: [2022-11-26 01:03:31,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/mp_rank_00_model_states.pt. 0: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:03:31,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:03:31,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:03:31,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:03:31,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 01:03:31,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 0: [2022-11-26 01:03:31,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:03:31,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:03:31,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 01:03:31,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 4: [2022-11-26 01:03:31,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:03:31,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 01:03:31,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 12: [2022-11-26 01:03:31,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:03:31,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 01:03:31,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 15: [2022-11-26 01:03:31,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:03:31,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 01:03:31,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 6: [2022-11-26 01:03:31,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:03:31,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 9: [2022-11-26 01:03:31,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:03:31,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 9: [2022-11-26 01:03:31,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 01:03:31,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 20: [2022-11-26 01:03:31,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:03:31,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 1: [2022-11-26 01:03:31,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:03:31,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 1: [2022-11-26 01:03:31,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 12: [2022-11-26 01:03:31,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:03:31,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 01:03:31,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 1: [2022-11-26 01:03:31,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 9: [2022-11-26 01:03:31,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:03:31,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 01:03:31,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 21: [2022-11-26 01:03:31,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:03:31,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 15: [2022-11-26 01:03:31,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:03:31,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 15: [2022-11-26 01:03:31,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 01:03:31,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 19: [2022-11-26 01:03:31,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:03:31,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 01:03:31,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 13: [2022-11-26 01:03:31,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:03:31,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:03:31,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 22: [2022-11-26 01:03:31,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 13: [2022-11-26 01:03:31,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 22: [2022-11-26 01:03:31,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 22: [2022-11-26 01:03:31,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:03:31,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 01:03:31,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 27: [2022-11-26 01:03:31,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:03:31,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 01:03:31,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 6: [2022-11-26 01:03:31,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:03:31,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 01:03:31,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 17: [2022-11-26 01:03:31,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:03:31,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 01:03:31,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 13: [2022-11-26 01:03:31,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:03:31,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 01:03:31,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 23: [2022-11-26 01:03:31,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:03:31,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 01:03:31,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 1: [2022-11-26 01:03:31,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:03:31,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:03:31,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 1: [2022-11-26 01:03:31,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 25: [2022-11-26 01:03:31,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:03:31,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 19: [2022-11-26 01:03:31,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 25: [2022-11-26 01:03:31,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 01:03:31,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 24: [2022-11-26 01:03:31,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:03:31,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 01:03:31,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 0: [2022-11-26 01:03:31,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:03:31,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:03:31,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 31: [2022-11-26 01:03:31,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:03:31,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 0: [2022-11-26 01:03:31,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 31: [2022-11-26 01:03:31,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 31: [2022-11-26 01:03:31,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 4: [2022-11-26 01:03:31,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:03:31,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 4: [2022-11-26 01:03:31,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 01:03:31,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 17: [2022-11-26 01:03:31,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:03:31,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 01:03:31,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 13: [2022-11-26 01:03:31,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:03:31,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 01:03:31,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 19: [2022-11-26 01:03:31,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:03:31,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 01:03:31,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 9: [2022-11-26 01:03:31,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:03:31,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 01:03:31,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 8: [2022-11-26 01:03:31,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:03:31,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:03:31,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 7: [2022-11-26 01:03:31,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 8: [2022-11-26 01:03:31,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 14: [2022-11-26 01:03:31,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:03:31,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 14: [2022-11-26 01:03:31,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 01:03:31,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 1: [2022-11-26 01:03:31,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:03:31,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 23: [2022-11-26 01:03:31,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:03:31,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 1: [2022-11-26 01:03:31,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 23: [2022-11-26 01:03:31,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 25: [2022-11-26 01:03:31,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:03:31,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 01:03:31,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 31: [2022-11-26 01:03:31,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:03:31,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 01:03:31,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 21: [2022-11-26 01:03:31,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:03:31,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 01:03:31,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 15: [2022-11-26 01:03:31,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:03:31,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 01:03:31,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 7: [2022-11-26 01:03:31,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:03:31,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 01:03:31,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 18: [2022-11-26 01:03:31,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:03:31,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 01:03:31,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 22: [2022-11-26 01:03:31,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:03:31,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 01:03:31,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 24: [2022-11-26 01:03:31,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:03:31,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 27: [2022-11-26 01:03:31,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:03:31,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 27: [2022-11-26 01:03:31,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 01:03:31,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 14: [2022-11-26 01:03:31,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:03:31,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 01:03:31,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 17: [2022-11-26 01:03:31,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:03:31,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 01:03:31,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 27: [2022-11-26 01:03:31,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:03:31,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 01:03:31,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 7: [2022-11-26 01:03:31,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:03:31,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:03:31,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 01:03:31,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 0: [2022-11-26 01:03:31,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 01:03:31,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 9: [2022-11-26 01:03:31,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:03:31,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:03:31,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 01:03:31,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 12: [2022-11-26 01:03:31,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 01:03:31,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 6: [2022-11-26 01:03:31,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:03:31,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:03:31,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 6: [2022-11-26 01:03:31,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 01:03:31,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 4: [2022-11-26 01:03:31,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 6: [2022-11-26 01:03:31,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:03:31,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 01:03:31,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 20: [2022-11-26 01:03:31,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:03:31,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:03:31,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 20: [2022-11-26 01:03:31,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 4: [2022-11-26 01:03:31,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 20: [2022-11-26 01:03:31,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 24: [2022-11-26 01:03:31,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:03:31,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 01:03:31,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 20: [2022-11-26 01:03:31,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:03:31,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 17: [2022-11-26 01:03:31,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:03:31,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 14: [2022-11-26 01:03:31,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:03:31,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 14: [2022-11-26 01:03:31,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 17: [2022-11-26 01:03:31,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 14: [2022-11-26 01:03:31,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 31: [2022-11-26 01:03:31,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:03:31,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:03:31,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 01:03:31,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 15: [2022-11-26 01:03:31,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 21: [2022-11-26 01:03:31,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:03:31,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 21: [2022-11-26 01:03:31,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 01:03:31,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 19: [2022-11-26 01:03:31,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:03:31,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 01:03:31,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 13: [2022-11-26 01:03:31,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:03:31,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 01:03:31,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 29: [2022-11-26 01:03:31,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:03:31,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:03:31,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 01:03:31,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 01:03:31,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 29: [2022-11-26 01:03:31,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 21: [2022-11-26 01:03:31,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:03:31,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 01:03:31,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 12: [2022-11-26 01:03:31,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:03:31,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 27: [2022-11-26 01:03:31,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:03:31,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 12: [2022-11-26 01:03:31,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 27: [2022-11-26 01:03:31,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 22: [2022-11-26 01:03:31,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:03:31,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 01:03:31,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 14: [2022-11-26 01:03:31,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:03:31,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 01:03:31,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 3: [2022-11-26 01:03:31,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:03:31,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 01:03:31,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 18: [2022-11-26 01:03:31,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:03:31,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:03:31,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:03:31,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 7: [2022-11-26 01:03:31,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 3: [2022-11-26 01:03:31,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 18: [2022-11-26 01:03:31,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 7: [2022-11-26 01:03:31,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 3: [2022-11-26 01:03:31,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 25: [2022-11-26 01:03:31,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:03:31,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 24: [2022-11-26 01:03:31,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:03:31,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 24: [2022-11-26 01:03:31,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 01:03:31,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 18: [2022-11-26 01:03:31,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:03:31,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:03:31,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 5: [2022-11-26 01:03:31,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:03:31,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:03:31,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:03:31,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:03:31,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 18: [2022-11-26 01:03:31,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 5: [2022-11-26 01:03:31,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 5: [2022-11-26 01:03:31,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 01:03:31,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 18: [2022-11-26 01:03:31,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 5: [2022-11-26 01:03:31,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 18: [2022-11-26 01:03:31,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 5: [2022-11-26 01:03:31,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 5: [2022-11-26 01:03:31,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 5: [2022-11-26 01:03:31,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 8: [2022-11-26 01:03:31,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:03:31,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 1: [2022-11-26 01:03:31,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:03:31,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 1: [2022-11-26 01:03:31,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 01:03:31,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 25: [2022-11-26 01:03:31,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:03:31,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:03:31,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 01:03:31,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 10: [2022-11-26 01:03:31,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:03:31,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 01:03:31,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:03:31,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 10: [2022-11-26 01:03:31,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 01:03:31,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 01:03:31,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 10: [2022-11-26 01:03:31,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 0: [2022-11-26 01:03:31,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:03:31,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 01:03:31,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 23: [2022-11-26 01:03:31,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:03:31,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 01:03:31,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 8: [2022-11-26 01:03:31,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:03:31,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 01:03:31,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 3: [2022-11-26 01:03:31,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:03:31,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 01:03:31,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 29: [2022-11-26 01:03:31,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:03:31,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 4: [2022-11-26 01:03:31,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:03:31,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 29: [2022-11-26 01:03:31,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 4: [2022-11-26 01:03:31,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 12: [2022-11-26 01:03:31,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:03:31,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 01:03:31,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 6: [2022-11-26 01:03:31,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:03:31,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 01:03:31,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 0: [2022-11-26 01:03:31,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 01:03:31,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 2: [2022-11-26 01:03:31,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:03:31,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:03:31,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:03:31,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:03:31,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 01:03:31,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 01:03:31,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 2: [2022-11-26 01:03:31,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 2: [2022-11-26 01:03:31,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 01:03:31,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 01:03:31,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 2: [2022-11-26 01:03:31,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 23: [2022-11-26 01:03:31,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:03:31,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 19: [2022-11-26 01:03:31,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:03:31,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 19: [2022-11-26 01:03:31,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 01:03:31,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 18: [2022-11-26 01:03:31,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:03:31,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 26: [2022-11-26 01:03:31,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:03:31,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:03:31,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:03:31,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 26: [2022-11-26 01:03:31,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 01:03:31,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 01:03:31,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 01:03:31,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 26: [2022-11-26 01:03:31,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 26: [2022-11-26 01:03:31,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 20: [2022-11-26 01:03:31,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:03:31,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 01:03:31,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 20: [2022-11-26 01:03:31,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:03:31,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 01:03:31,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 11: [2022-11-26 01:03:31,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:03:31,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:03:31,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:03:31,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 01:03:31,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 01:03:31,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 01:03:31,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 11: [2022-11-26 01:03:31,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 11: [2022-11-26 01:03:31,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 30: [2022-11-26 01:03:31,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:03:31,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:03:31,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:03:31,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:03:31,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 01:03:31,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 01:03:31,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 01:03:31,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 01:03:31,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 30: [2022-11-26 01:03:31,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 30: [2022-11-26 01:03:31,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 30: [2022-11-26 01:03:31,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 17: [2022-11-26 01:03:31,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:03:31,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 01:03:31,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 27: [2022-11-26 01:03:31,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:03:31,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 01:03:31,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 28: [2022-11-26 01:03:31,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:03:31,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:03:31,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:03:31,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:03:31,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 01:03:31,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 01:03:31,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 01:03:31,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 01:03:31,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 28: [2022-11-26 01:03:31,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 28: [2022-11-26 01:03:31,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 28: [2022-11-26 01:03:31,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 16: [2022-11-26 01:03:31,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:03:31,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:03:31,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:03:31,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:03:31,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:03:31,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 01:03:31,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 01:03:31,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 01:03:31,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 01:03:31,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 16: [2022-11-26 01:03:31,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 16: [2022-11-26 01:03:31,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 01:03:31,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 16: [2022-11-26 01:03:31,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 16: [2022-11-26 01:03:31,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 8: [2022-11-26 01:03:31,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:03:31,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 01:03:31,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 15: [2022-11-26 01:03:31,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:03:31,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 01:03:31,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 14: [2022-11-26 01:03:31,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:03:31,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 01:03:31,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 2: [2022-11-26 01:03:31,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:03:31,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 01:03:31,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 0: [2022-11-26 01:03:31,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:03:31,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:03:31,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 01:03:31,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 31: [2022-11-26 01:03:31,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:03:31,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 01:03:31,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 5: [2022-11-26 01:03:31,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:03:31,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 01:03:31,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 28: [2022-11-26 01:03:31,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 01:03:31,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 22: [2022-11-26 01:03:31,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:03:31,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 01:03:31,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 21: [2022-11-26 01:03:31,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:03:31,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 01:03:31,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 30: [2022-11-26 01:03:31,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:03:31,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 01:03:31,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 7: [2022-11-26 01:03:31,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:03:31,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:03:31,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 01:03:31,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 9: [2022-11-26 01:03:31,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:03:31,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 9: [2022-11-26 01:03:31,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 23: [2022-11-26 01:03:31,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 9: [2022-11-26 01:03:31,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 13: [2022-11-26 01:03:31,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:03:31,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:03:31,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 01:03:31,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 25: [2022-11-26 01:03:31,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:03:31,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:03:31,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 01:03:31,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 25: [2022-11-26 01:03:31,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 01:03:31,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 1: [2022-11-26 01:03:31,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:03:31,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 13: [2022-11-26 01:03:31,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 1: [2022-11-26 01:03:31,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 13: [2022-11-26 01:03:31,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 29: [2022-11-26 01:03:31,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:03:31,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 01:03:31,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 11: [2022-11-26 01:03:31,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:03:31,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 01:03:31,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 20: [2022-11-26 01:03:31,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:03:31,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 01:03:31,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 10: [2022-11-26 01:03:31,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:03:31,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 01:03:31,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 26: [2022-11-26 01:03:31,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:03:31,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 01:03:31,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 4: [2022-11-26 01:03:31,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:03:31,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 01:03:31,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 27: [2022-11-26 01:03:31,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:03:31,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 01:03:31,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 12: [2022-11-26 01:03:31,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:03:31,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 01:03:31,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 19: [2022-11-26 01:03:31,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:03:31,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 01:03:31,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 18: [2022-11-26 01:03:31,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:03:31,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 01:03:31,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 15: [2022-11-26 01:03:31,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:03:31,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 01:03:31,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 16: [2022-11-26 01:03:31,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:03:31,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 01:03:31,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 17: [2022-11-26 01:03:31,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:03:31,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 01:03:31,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 8: [2022-11-26 01:03:31,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:03:31,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:03:31,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:03:31,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 2: [2022-11-26 01:03:31,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 8: [2022-11-26 01:03:31,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 2: [2022-11-26 01:03:31,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 0: [2022-11-26 01:03:31,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 01:03:31,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 21: [2022-11-26 01:03:31,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:03:31,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 01:03:31,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 31: [2022-11-26 01:03:31,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:03:31,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 01:03:31,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 6: [2022-11-26 01:03:31,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:03:31,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 01:03:31,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 28: [2022-11-26 01:03:31,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:03:31,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 01:03:31,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 14: [2022-11-26 01:03:31,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:03:31,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 01:03:31,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 7: [2022-11-26 01:03:31,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:03:31,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 01:03:31,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 13: [2022-11-26 01:03:31,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:03:31,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 01:03:31,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 5: [2022-11-26 01:03:31,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:03:31,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 01:03:31,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 30: [2022-11-26 01:03:31,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:03:31,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 01:03:31,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 9: [2022-11-26 01:03:31,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:03:31,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 01:03:31,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 24: [2022-11-26 01:03:31,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:03:31,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 01:03:31,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 6: [2022-11-26 01:03:31,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:03:31,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 01:03:31,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 22: [2022-11-26 01:03:31,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:03:31,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:03:31,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 01:03:31,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 22: [2022-11-26 01:03:31,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 01:03:31,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 1: [2022-11-26 01:03:31,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:03:31,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 25: [2022-11-26 01:03:31,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:03:31,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 25: [2022-11-26 01:03:31,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 11: [2022-11-26 01:03:31,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:03:31,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:03:31,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 11: [2022-11-26 01:03:31,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 29: [2022-11-26 01:03:31,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 11: [2022-11-26 01:03:31,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 29: [2022-11-26 01:03:31,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 20: [2022-11-26 01:03:31,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:03:31,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 01:03:31,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 26: [2022-11-26 01:03:31,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:03:31,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 01:03:31,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 3: [2022-11-26 01:03:31,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:03:31,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:03:31,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 01:03:31,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 4: [2022-11-26 01:03:31,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 01:03:31,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 10: [2022-11-26 01:03:31,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:03:31,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 01:03:31,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 15: [2022-11-26 01:03:31,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:03:31,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 27: [2022-11-26 01:03:31,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:03:31,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 27: [2022-11-26 01:03:31,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 01:03:31,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 16: [2022-11-26 01:03:31,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:03:31,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:03:31,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 01:03:31,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 18: [2022-11-26 01:03:31,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 19: [2022-11-26 01:03:31,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:03:31,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 19: [2022-11-26 01:03:31,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 01:03:31,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 5: [2022-11-26 01:03:31,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:03:31,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 01:03:31,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 12: [2022-11-26 01:03:31,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:03:31,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 01:03:31,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 2: [2022-11-26 01:03:31,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:03:31,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 01:03:31,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 14: [2022-11-26 01:03:31,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:03:31,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:03:31,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 01:03:31,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 13: [2022-11-26 01:03:31,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 0: [2022-11-26 01:03:31,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:03:31,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 31: [2022-11-26 01:03:31,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:03:31,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 31: [2022-11-26 01:03:31,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 0: [2022-11-26 01:03:31,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 31: [2022-11-26 01:03:31,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 21: [2022-11-26 01:03:31,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:03:31,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 01:03:31,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 7: [2022-11-26 01:03:31,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:03:31,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 01:03:31,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 28: [2022-11-26 01:03:31,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:03:31,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:03:31,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 01:03:31,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 28: [2022-11-26 01:03:31,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 01:03:31,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 8: [2022-11-26 01:03:31,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:03:31,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 01:03:31,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 17: [2022-11-26 01:03:31,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:03:31,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 01:03:31,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 30: [2022-11-26 01:03:31,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:03:31,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 01:03:31,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 22: [2022-11-26 01:03:31,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:03:31,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 01:03:31,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 9: [2022-11-26 01:03:31,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:03:31,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 01:03:31,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 1: [2022-11-26 01:03:31,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:03:31,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 01:03:31,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 15: [2022-11-26 01:03:31,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:03:31,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 01:03:31,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 0: [2022-11-26 01:03:31,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:03:31,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 01:03:31,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 24: [2022-11-26 01:03:31,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:03:31,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:03:31,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 16: [2022-11-26 01:03:31,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 24: [2022-11-26 01:03:31,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 16: [2022-11-26 01:03:31,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 22: [2022-11-26 01:03:31,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:03:31,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 01:03:31,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 29: [2022-11-26 01:03:31,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:03:31,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 01:03:31,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 3: [2022-11-26 01:03:31,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:03:31,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 01:03:31,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 7: [2022-11-26 01:03:31,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:03:31,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 9: [2022-11-26 01:03:31,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:03:31,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 9: [2022-11-26 01:03:31,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 30: [2022-11-26 01:03:31,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:03:31,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 9: [2022-11-26 01:03:31,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 5: [2022-11-26 01:03:31,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:03:31,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 21: [2022-11-26 01:03:31,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:03:31,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 21: [2022-11-26 01:03:31,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 5: [2022-11-26 01:03:31,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 21: [2022-11-26 01:03:31,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 12: [2022-11-26 01:03:31,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:03:31,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 01:03:31,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 4: [2022-11-26 01:03:31,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:03:31,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 01:03:31,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 18: [2022-11-26 01:03:31,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:03:31,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 01:03:31,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 20: [2022-11-26 01:03:31,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:03:31,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 01:03:31,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 27: [2022-11-26 01:03:31,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:03:31,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 10: [2022-11-26 01:03:31,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:03:31,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 31: [2022-11-26 01:03:31,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:03:31,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:03:31,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 10: [2022-11-26 01:03:31,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 19: [2022-11-26 01:03:31,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 31: [2022-11-26 01:03:31,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 01:03:31,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 19: [2022-11-26 01:03:31,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 8: [2022-11-26 01:03:31,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:03:31,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 01:03:31,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 23: [2022-11-26 01:03:31,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:03:31,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 01:03:31,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 28: [2022-11-26 01:03:31,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:03:31,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 01:03:31,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 17: [2022-11-26 01:03:31,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:03:31,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 01:03:31,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 26: [2022-11-26 01:03:31,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:03:31,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 01:03:31,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 25: [2022-11-26 01:03:31,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:03:31,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 01:03:31,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 24: [2022-11-26 01:03:31,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:03:31,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 01:03:31,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 10: [2022-11-26 01:03:31,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:03:31,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 01:03:31,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:03:31,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 10: [2022-11-26 01:03:31,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 01:03:31,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 3: [2022-11-26 01:03:31,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:03:31,885] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 01:03:31,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 3: [2022-11-26 01:03:31,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:03:31,885] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 01:03:31,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 23: [2022-11-26 01:03:31,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:03:31,885] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 01:03:31,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 29: [2022-11-26 01:03:31,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:03:31,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 01:03:31,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 25: [2022-11-26 01:03:31,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:03:31,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 01:03:31,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 14: [2022-11-26 01:03:31,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:03:31,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 01:03:31,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 29: [2022-11-26 01:03:31,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:03:31,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 01:03:31,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 26: [2022-11-26 01:03:31,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:03:31,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 01:03:31,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:03:31,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 26: [2022-11-26 01:03:31,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 01:03:31,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 11: [2022-11-26 01:03:31,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:03:31,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 01:03:31,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 11: [2022-11-26 01:03:31,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:03:31,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 01:03:31,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 11: [2022-11-26 01:03:31,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:03:31,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 01:03:31,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 2: [2022-11-26 01:03:31,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:03:31,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 01:03:31,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 13: [2022-11-26 01:03:31,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:03:31,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step31000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 01:03:31,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 0: successfully saved checkpoint at iteration 31000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2550.29 31: iteration 31010/ 173500 | consumed samples: 7938560 | consumed tokens: 16258170880 | elapsed time per iteration (s): 1.08 | learning rate: 1.874E-04 | global batch size: 256 | lm loss: 2.145527E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.934 | TFLOPs: 14.39 | 31: iteration 31020/ 173500 | consumed samples: 7941120 | consumed tokens: 16263413760 | elapsed time per iteration (s): 0.81 | learning rate: 1.874E-04 | global batch size: 256 | lm loss: 2.184339E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.737 | TFLOPs: 19.16 | 31: iteration 31030/ 173500 | consumed samples: 7943680 | consumed tokens: 16268656640 | elapsed time per iteration (s): 0.74 | learning rate: 1.874E-04 | global batch size: 256 | lm loss: 2.129058E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.440 | TFLOPs: 20.84 | 31: iteration 31040/ 173500 | consumed samples: 7946240 | consumed tokens: 16273899520 | elapsed time per iteration (s): 0.77 | learning rate: 1.874E-04 | global batch size: 256 | lm loss: 2.194039E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.136 | TFLOPs: 20.21 | 31: iteration 31050/ 173500 | consumed samples: 7948800 | consumed tokens: 16279142400 | elapsed time per iteration (s): 0.79 | learning rate: 1.874E-04 | global batch size: 256 | lm loss: 2.126947E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.245 | TFLOPs: 19.68 | 31: iteration 31060/ 173500 | consumed samples: 7951360 | consumed tokens: 16284385280 | elapsed time per iteration (s): 0.75 | learning rate: 1.874E-04 | global batch size: 256 | lm loss: 2.122393E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.621 | TFLOPs: 20.67 | 31: iteration 31070/ 173500 | consumed samples: 7953920 | consumed tokens: 16289628160 | elapsed time per iteration (s): 0.76 | learning rate: 1.874E-04 | global batch size: 256 | lm loss: 2.172457E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.941 | TFLOPs: 20.32 | 31: iteration 31080/ 173500 | consumed samples: 7956480 | consumed tokens: 16294871040 | elapsed time per iteration (s): 0.77 | learning rate: 1.873E-04 | global batch size: 256 | lm loss: 2.151953E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.170 | TFLOPs: 20.03 | 31: iteration 31090/ 173500 | consumed samples: 7959040 | consumed tokens: 16300113920 | elapsed time per iteration (s): 0.71 | learning rate: 1.873E-04 | global batch size: 256 | lm loss: 2.156647E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.379 | TFLOPs: 21.68 | 31: iteration 31100/ 173500 | consumed samples: 7961600 | consumed tokens: 16305356800 | elapsed time per iteration (s): 0.74 | learning rate: 1.873E-04 | global batch size: 256 | lm loss: 2.123875E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.541 | TFLOPs: 20.90 | 31: iteration 31110/ 173500 | consumed samples: 7964160 | consumed tokens: 16310599680 | elapsed time per iteration (s): 0.74 | learning rate: 1.873E-04 | global batch size: 256 | lm loss: 2.147681E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.051 | TFLOPs: 21.06 | 31: iteration 31120/ 173500 | consumed samples: 7966720 | consumed tokens: 16315842560 | elapsed time per iteration (s): 0.84 | learning rate: 1.873E-04 | global batch size: 256 | lm loss: 2.128434E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.043 | TFLOPs: 18.51 | 31: iteration 31130/ 173500 | consumed samples: 7969280 | consumed tokens: 16321085440 | elapsed time per iteration (s): 0.80 | learning rate: 1.873E-04 | global batch size: 256 | lm loss: 2.163268E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.880 | TFLOPs: 19.29 | 31: iteration 31140/ 173500 | consumed samples: 7971840 | consumed tokens: 16326328320 | elapsed time per iteration (s): 0.78 | learning rate: 1.873E-04 | global batch size: 256 | lm loss: 2.134454E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.038 | TFLOPs: 19.91 | 31: iteration 31150/ 173500 | consumed samples: 7974400 | consumed tokens: 16331571200 | elapsed time per iteration (s): 0.76 | learning rate: 1.873E-04 | global batch size: 256 | lm loss: 2.118771E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.292 | TFLOPs: 20.41 | 31: iteration 31160/ 173500 | consumed samples: 7976960 | consumed tokens: 16336814080 | elapsed time per iteration (s): 0.74 | learning rate: 1.873E-04 | global batch size: 256 | lm loss: 2.126503E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.683 | TFLOPs: 20.79 | 31: iteration 31170/ 173500 | consumed samples: 7979520 | consumed tokens: 16342056960 | elapsed time per iteration (s): 0.81 | learning rate: 1.873E-04 | global batch size: 256 | lm loss: 2.136530E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.466 | TFLOPs: 19.08 | 31: iteration 31180/ 173500 | consumed samples: 7982080 | consumed tokens: 16347299840 | elapsed time per iteration (s): 0.78 | learning rate: 1.873E-04 | global batch size: 256 | lm loss: 2.149996E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.870 | TFLOPs: 19.90 | 31: iteration 31190/ 173500 | consumed samples: 7984640 | consumed tokens: 16352542720 | elapsed time per iteration (s): 0.82 | learning rate: 1.873E-04 | global batch size: 256 | lm loss: 2.127032E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.481 | TFLOPs: 18.84 | 31: iteration 31200/ 173500 | consumed samples: 7987200 | consumed tokens: 16357785600 | elapsed time per iteration (s): 0.80 | learning rate: 1.872E-04 | global batch size: 256 | lm loss: 2.135233E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.501 | TFLOPs: 19.39 | 31: iteration 31210/ 173500 | consumed samples: 7989760 | consumed tokens: 16363028480 | elapsed time per iteration (s): 0.80 | learning rate: 1.872E-04 | global batch size: 256 | lm loss: 2.125799E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.756 | TFLOPs: 19.28 | 31: iteration 31220/ 173500 | consumed samples: 7992320 | consumed tokens: 16368271360 | elapsed time per iteration (s): 0.81 | learning rate: 1.872E-04 | global batch size: 256 | lm loss: 2.150610E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.033 | TFLOPs: 19.12 | 31: iteration 31230/ 173500 | consumed samples: 7994880 | consumed tokens: 16373514240 | elapsed time per iteration (s): 0.83 | learning rate: 1.872E-04 | global batch size: 256 | lm loss: 2.128198E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.738 | TFLOPs: 18.56 | 31: iteration 31240/ 173500 | consumed samples: 7997440 | consumed tokens: 16378757120 | elapsed time per iteration (s): 0.74 | learning rate: 1.872E-04 | global batch size: 256 | lm loss: 2.150091E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.864 | TFLOPs: 20.86 | 31: iteration 31250/ 173500 | consumed samples: 8000000 | consumed tokens: 16384000000 | elapsed time per iteration (s): 0.88 | learning rate: 1.872E-04 | global batch size: 256 | lm loss: 2.124998E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.606 | TFLOPs: 17.58 | 31: iteration 31260/ 173500 | consumed samples: 8002560 | consumed tokens: 16389242880 | elapsed time per iteration (s): 0.79 | learning rate: 1.872E-04 | global batch size: 256 | lm loss: 2.104691E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.821 | TFLOPs: 19.53 | 31: iteration 31270/ 173500 | consumed samples: 8005120 | consumed tokens: 16394485760 | elapsed time per iteration (s): 0.81 | learning rate: 1.872E-04 | global batch size: 256 | lm loss: 2.103296E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.906 | TFLOPs: 19.05 | 31: iteration 31280/ 173500 | consumed samples: 8007680 | consumed tokens: 16399728640 | elapsed time per iteration (s): 0.78 | learning rate: 1.872E-04 | global batch size: 256 | lm loss: 2.112773E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.999 | TFLOPs: 19.96 | 31: iteration 31290/ 173500 | consumed samples: 8010240 | consumed tokens: 16404971520 | elapsed time per iteration (s): 0.77 | learning rate: 1.872E-04 | global batch size: 256 | lm loss: 2.106816E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.830 | TFLOPs: 20.14 | 31: iteration 31300/ 173500 | consumed samples: 8012800 | consumed tokens: 16410214400 | elapsed time per iteration (s): 0.79 | learning rate: 1.872E-04 | global batch size: 256 | lm loss: 2.138559E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.999 | TFLOPs: 19.72 | 31: iteration 31310/ 173500 | consumed samples: 8015360 | consumed tokens: 16415457280 | elapsed time per iteration (s): 0.74 | learning rate: 1.872E-04 | global batch size: 256 | lm loss: 2.098922E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.207 | TFLOPs: 21.01 | 31: iteration 31320/ 173500 | consumed samples: 8017920 | consumed tokens: 16420700160 | elapsed time per iteration (s): 0.83 | learning rate: 1.871E-04 | global batch size: 256 | lm loss: 2.149614E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.116 | TFLOPs: 18.76 | 31: iteration 31330/ 173500 | consumed samples: 8020480 | consumed tokens: 16425943040 | elapsed time per iteration (s): 0.75 | learning rate: 1.871E-04 | global batch size: 256 | lm loss: 2.115714E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.446 | TFLOPs: 20.72 | 31: iteration 31340/ 173500 | consumed samples: 8023040 | consumed tokens: 16431185920 | elapsed time per iteration (s): 0.81 | learning rate: 1.871E-04 | global batch size: 256 | lm loss: 2.155234E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.389 | TFLOPs: 19.14 | 31: iteration 31350/ 173500 | consumed samples: 8025600 | consumed tokens: 16436428800 | elapsed time per iteration (s): 0.75 | learning rate: 1.871E-04 | global batch size: 256 | lm loss: 2.118862E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.989 | TFLOPs: 20.69 | 31: iteration 31360/ 173500 | consumed samples: 8028160 | consumed tokens: 16441671680 | elapsed time per iteration (s): 0.78 | learning rate: 1.871E-04 | global batch size: 256 | lm loss: 2.155041E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.078 | TFLOPs: 19.91 | 31: iteration 31370/ 173500 | consumed samples: 8030720 | consumed tokens: 16446914560 | elapsed time per iteration (s): 0.75 | learning rate: 1.871E-04 | global batch size: 256 | lm loss: 2.116824E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.111 | TFLOPs: 20.64 | 31: iteration 31380/ 173500 | consumed samples: 8033280 | consumed tokens: 16452157440 | elapsed time per iteration (s): 0.81 | learning rate: 1.871E-04 | global batch size: 256 | lm loss: 2.115551E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.794 | TFLOPs: 19.10 | 31: iteration 31390/ 173500 | consumed samples: 8035840 | consumed tokens: 16457400320 | elapsed time per iteration (s): 0.76 | learning rate: 1.871E-04 | global batch size: 256 | lm loss: 2.151169E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.015 | TFLOPs: 20.27 | 31: iteration 31400/ 173500 | consumed samples: 8038400 | consumed tokens: 16462643200 | elapsed time per iteration (s): 0.89 | learning rate: 1.871E-04 | global batch size: 256 | lm loss: 2.126578E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.654 | TFLOPs: 17.34 | 31: iteration 31410/ 173500 | consumed samples: 8040960 | consumed tokens: 16467886080 | elapsed time per iteration (s): 0.81 | learning rate: 1.871E-04 | global batch size: 256 | lm loss: 2.132743E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.045 | TFLOPs: 19.06 | 31: iteration 31420/ 173500 | consumed samples: 8043520 | consumed tokens: 16473128960 | elapsed time per iteration (s): 0.74 | learning rate: 1.871E-04 | global batch size: 256 | lm loss: 2.114776E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.663 | TFLOPs: 20.85 | 31: iteration 31430/ 173500 | consumed samples: 8046080 | consumed tokens: 16478371840 | elapsed time per iteration (s): 0.77 | learning rate: 1.870E-04 | global batch size: 256 | lm loss: 2.151546E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.747 | TFLOPs: 20.07 | 31: iteration 31440/ 173500 | consumed samples: 8048640 | consumed tokens: 16483614720 | elapsed time per iteration (s): 0.76 | learning rate: 1.870E-04 | global batch size: 256 | lm loss: 2.137094E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.161 | TFLOPs: 20.40 | 31: iteration 31450/ 173500 | consumed samples: 8051200 | consumed tokens: 16488857600 | elapsed time per iteration (s): 0.74 | learning rate: 1.870E-04 | global batch size: 256 | lm loss: 2.112295E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.226 | TFLOPs: 20.82 | 31: iteration 31460/ 173500 | consumed samples: 8053760 | consumed tokens: 16494100480 | elapsed time per iteration (s): 0.74 | learning rate: 1.870E-04 | global batch size: 256 | lm loss: 2.158997E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.699 | TFLOPs: 20.79 | 31: iteration 31470/ 173500 | consumed samples: 8056320 | consumed tokens: 16499343360 | elapsed time per iteration (s): 0.77 | learning rate: 1.870E-04 | global batch size: 256 | lm loss: 2.164194E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.976 | TFLOPs: 20.20 | 31: iteration 31480/ 173500 | consumed samples: 8058880 | consumed tokens: 16504586240 | elapsed time per iteration (s): 0.77 | learning rate: 1.870E-04 | global batch size: 256 | lm loss: 2.140746E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.012 | TFLOPs: 20.21 | 31: iteration 31490/ 173500 | consumed samples: 8061440 | consumed tokens: 16509829120 | elapsed time per iteration (s): 0.74 | learning rate: 1.870E-04 | global batch size: 256 | lm loss: 2.116143E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.445 | TFLOPs: 20.90 | 31: iteration 31500/ 173500 | consumed samples: 8064000 | consumed tokens: 16515072000 | elapsed time per iteration (s): 0.77 | learning rate: 1.870E-04 | global batch size: 256 | lm loss: 2.127631E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.949 | TFLOPs: 20.08 | 31: iteration 31510/ 173500 | consumed samples: 8066560 | consumed tokens: 16520314880 | elapsed time per iteration (s): 0.75 | learning rate: 1.870E-04 | global batch size: 256 | lm loss: 2.168989E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.593 | TFLOPs: 20.60 | 31: iteration 31520/ 173500 | consumed samples: 8069120 | consumed tokens: 16525557760 | elapsed time per iteration (s): 0.76 | learning rate: 1.870E-04 | global batch size: 256 | lm loss: 2.090252E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.336 | TFLOPs: 20.29 | 31: iteration 31530/ 173500 | consumed samples: 8071680 | consumed tokens: 16530800640 | elapsed time per iteration (s): 0.77 | learning rate: 1.870E-04 | global batch size: 256 | lm loss: 2.107198E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.383 | TFLOPs: 19.99 | 31: iteration 31540/ 173500 | consumed samples: 8074240 | consumed tokens: 16536043520 | elapsed time per iteration (s): 0.73 | learning rate: 1.870E-04 | global batch size: 256 | lm loss: 2.129601E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.695 | TFLOPs: 21.22 | 31: iteration 31550/ 173500 | consumed samples: 8076800 | consumed tokens: 16541286400 | elapsed time per iteration (s): 0.76 | learning rate: 1.869E-04 | global batch size: 256 | lm loss: 2.133336E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.670 | TFLOPs: 20.43 | 31: iteration 31560/ 173500 | consumed samples: 8079360 | consumed tokens: 16546529280 | elapsed time per iteration (s): 0.78 | learning rate: 1.869E-04 | global batch size: 256 | lm loss: 2.136184E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.009 | TFLOPs: 19.84 | 31: iteration 31570/ 173500 | consumed samples: 8081920 | consumed tokens: 16551772160 | elapsed time per iteration (s): 0.77 | learning rate: 1.869E-04 | global batch size: 256 | lm loss: 2.122049E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.604 | TFLOPs: 20.24 | 31: iteration 31580/ 173500 | consumed samples: 8084480 | consumed tokens: 16557015040 | elapsed time per iteration (s): 0.74 | learning rate: 1.869E-04 | global batch size: 256 | lm loss: 2.138715E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.383 | TFLOPs: 20.89 | 31: iteration 31590/ 173500 | consumed samples: 8087040 | consumed tokens: 16562257920 | elapsed time per iteration (s): 0.74 | learning rate: 1.869E-04 | global batch size: 256 | lm loss: 2.115026E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.414 | TFLOPs: 20.96 | 31: iteration 31600/ 173500 | consumed samples: 8089600 | consumed tokens: 16567500800 | elapsed time per iteration (s): 0.76 | learning rate: 1.869E-04 | global batch size: 256 | lm loss: 2.111099E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.861 | TFLOPs: 20.32 | 31: iteration 31610/ 173500 | consumed samples: 8092160 | consumed tokens: 16572743680 | elapsed time per iteration (s): 0.76 | learning rate: 1.869E-04 | global batch size: 256 | lm loss: 2.113318E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.132 | TFLOPs: 20.34 | 31: iteration 31620/ 173500 | consumed samples: 8094720 | consumed tokens: 16577986560 | elapsed time per iteration (s): 0.79 | learning rate: 1.869E-04 | global batch size: 256 | lm loss: 2.140109E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.750 | TFLOPs: 19.53 | 31: iteration 31630/ 173500 | consumed samples: 8097280 | consumed tokens: 16583229440 | elapsed time per iteration (s): 0.74 | learning rate: 1.869E-04 | global batch size: 256 | lm loss: 2.130799E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.745 | TFLOPs: 20.86 | 31: iteration 31640/ 173500 | consumed samples: 8099840 | consumed tokens: 16588472320 | elapsed time per iteration (s): 0.77 | learning rate: 1.869E-04 | global batch size: 256 | lm loss: 2.137608E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.256 | TFLOPs: 20.04 | 31: iteration 31650/ 173500 | consumed samples: 8102400 | consumed tokens: 16593715200 | elapsed time per iteration (s): 0.75 | learning rate: 1.869E-04 | global batch size: 256 | lm loss: 2.131364E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.099 | TFLOPs: 20.70 | 31: iteration 31660/ 173500 | consumed samples: 8104960 | consumed tokens: 16598958080 | elapsed time per iteration (s): 0.79 | learning rate: 1.869E-04 | global batch size: 256 | lm loss: 2.116023E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.874 | TFLOPs: 19.65 | 31: iteration 31670/ 173500 | consumed samples: 8107520 | consumed tokens: 16604200960 | elapsed time per iteration (s): 0.81 | learning rate: 1.868E-04 | global batch size: 256 | lm loss: 2.137717E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.146 | TFLOPs: 19.13 | 31: iteration 31680/ 173500 | consumed samples: 8110080 | consumed tokens: 16609443840 | elapsed time per iteration (s): 0.83 | learning rate: 1.868E-04 | global batch size: 256 | lm loss: 2.121660E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.619 | TFLOPs: 18.55 | 31: iteration 31690/ 173500 | consumed samples: 8112640 | consumed tokens: 16614686720 | elapsed time per iteration (s): 0.80 | learning rate: 1.868E-04 | global batch size: 256 | lm loss: 2.143954E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.502 | TFLOPs: 19.27 | 31: iteration 31700/ 173500 | consumed samples: 8115200 | consumed tokens: 16619929600 | elapsed time per iteration (s): 0.81 | learning rate: 1.868E-04 | global batch size: 256 | lm loss: 2.107654E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.769 | TFLOPs: 19.16 | 31: iteration 31710/ 173500 | consumed samples: 8117760 | consumed tokens: 16625172480 | elapsed time per iteration (s): 0.80 | learning rate: 1.868E-04 | global batch size: 256 | lm loss: 2.129665E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.758 | TFLOPs: 19.34 | 31: iteration 31720/ 173500 | consumed samples: 8120320 | consumed tokens: 16630415360 | elapsed time per iteration (s): 0.79 | learning rate: 1.868E-04 | global batch size: 256 | lm loss: 2.158310E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.546 | TFLOPs: 19.51 | 31: iteration 31730/ 173500 | consumed samples: 8122880 | consumed tokens: 16635658240 | elapsed time per iteration (s): 0.86 | learning rate: 1.868E-04 | global batch size: 256 | lm loss: 2.149230E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.139 | TFLOPs: 18.10 | 31: iteration 31740/ 173500 | consumed samples: 8125440 | consumed tokens: 16640901120 | elapsed time per iteration (s): 0.79 | learning rate: 1.868E-04 | global batch size: 256 | lm loss: 2.148344E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.160 | TFLOPs: 19.49 | 31: iteration 31750/ 173500 | consumed samples: 8128000 | consumed tokens: 16646144000 | elapsed time per iteration (s): 0.81 | learning rate: 1.868E-04 | global batch size: 256 | lm loss: 2.120581E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.050 | TFLOPs: 19.12 | 31: iteration 31760/ 173500 | consumed samples: 8130560 | consumed tokens: 16651386880 | elapsed time per iteration (s): 0.81 | learning rate: 1.868E-04 | global batch size: 256 | lm loss: 2.156668E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.387 | TFLOPs: 19.20 | 31: iteration 31770/ 173500 | consumed samples: 8133120 | consumed tokens: 16656629760 | elapsed time per iteration (s): 0.81 | learning rate: 1.868E-04 | global batch size: 256 | lm loss: 2.125133E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.359 | TFLOPs: 19.14 | 31: iteration 31780/ 173500 | consumed samples: 8135680 | consumed tokens: 16661872640 | elapsed time per iteration (s): 0.96 | learning rate: 1.867E-04 | global batch size: 256 | lm loss: 2.138227E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 266.998 | TFLOPs: 16.15 | 31: iteration 31790/ 173500 | consumed samples: 8138240 | consumed tokens: 16667115520 | elapsed time per iteration (s): 0.88 | learning rate: 1.867E-04 | global batch size: 256 | lm loss: 2.138653E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.454 | TFLOPs: 17.63 | 31: iteration 31800/ 173500 | consumed samples: 8140800 | consumed tokens: 16672358400 | elapsed time per iteration (s): 0.88 | learning rate: 1.867E-04 | global batch size: 256 | lm loss: 2.136008E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.357 | TFLOPs: 17.57 | 31: iteration 31810/ 173500 | consumed samples: 8143360 | consumed tokens: 16677601280 | elapsed time per iteration (s): 0.80 | learning rate: 1.867E-04 | global batch size: 256 | lm loss: 2.124039E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.820 | TFLOPs: 19.29 | 31: iteration 31820/ 173500 | consumed samples: 8145920 | consumed tokens: 16682844160 | elapsed time per iteration (s): 0.86 | learning rate: 1.867E-04 | global batch size: 256 | lm loss: 2.133659E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.071 | TFLOPs: 17.91 | 31: iteration 31830/ 173500 | consumed samples: 8148480 | consumed tokens: 16688087040 | elapsed time per iteration (s): 0.79 | learning rate: 1.867E-04 | global batch size: 256 | lm loss: 2.156271E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.622 | TFLOPs: 19.58 | 31: iteration 31840/ 173500 | consumed samples: 8151040 | consumed tokens: 16693329920 | elapsed time per iteration (s): 2.27 | learning rate: 1.867E-04 | global batch size: 256 | lm loss: 2.124982E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 112.850 | TFLOPs: 6.83 | 31: iteration 31850/ 173500 | consumed samples: 8153600 | consumed tokens: 16698572800 | elapsed time per iteration (s): 0.80 | learning rate: 1.867E-04 | global batch size: 256 | lm loss: 2.132241E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.454 | TFLOPs: 19.27 | 31: iteration 31860/ 173500 | consumed samples: 8156160 | consumed tokens: 16703815680 | elapsed time per iteration (s): 0.83 | learning rate: 1.867E-04 | global batch size: 256 | lm loss: 2.099833E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.011 | TFLOPs: 18.75 | 31: iteration 31870/ 173500 | consumed samples: 8158720 | consumed tokens: 16709058560 | elapsed time per iteration (s): 0.79 | learning rate: 1.867E-04 | global batch size: 256 | lm loss: 2.136936E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.191 | TFLOPs: 19.61 | 31: iteration 31880/ 173500 | consumed samples: 8161280 | consumed tokens: 16714301440 | elapsed time per iteration (s): 0.80 | learning rate: 1.867E-04 | global batch size: 256 | lm loss: 2.127830E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.310 | TFLOPs: 19.38 | 31: iteration 31890/ 173500 | consumed samples: 8163840 | consumed tokens: 16719544320 | elapsed time per iteration (s): 0.84 | learning rate: 1.867E-04 | global batch size: 256 | lm loss: 2.101402E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.466 | TFLOPs: 18.48 | 31: iteration 31900/ 173500 | consumed samples: 8166400 | consumed tokens: 16724787200 | elapsed time per iteration (s): 2.51 | learning rate: 1.866E-04 | global batch size: 256 | lm loss: 2.158400E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 102.180 | TFLOPs: 6.18 | 31: iteration 31910/ 173500 | consumed samples: 8168960 | consumed tokens: 16730030080 | elapsed time per iteration (s): 0.89 | learning rate: 1.866E-04 | global batch size: 256 | lm loss: 2.138363E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.124 | TFLOPs: 17.49 | 31: iteration 31920/ 173500 | consumed samples: 8171520 | consumed tokens: 16735272960 | elapsed time per iteration (s): 0.82 | learning rate: 1.866E-04 | global batch size: 256 | lm loss: 2.103120E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.106 | TFLOPs: 18.94 | 31: iteration 31930/ 173500 | consumed samples: 8174080 | consumed tokens: 16740515840 | elapsed time per iteration (s): 0.81 | learning rate: 1.866E-04 | global batch size: 256 | lm loss: 2.129313E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.672 | TFLOPs: 19.10 | 31: iteration 31940/ 173500 | consumed samples: 8176640 | consumed tokens: 16745758720 | elapsed time per iteration (s): 0.84 | learning rate: 1.866E-04 | global batch size: 256 | lm loss: 2.144461E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.791 | TFLOPs: 18.50 | 31: iteration 31950/ 173500 | consumed samples: 8179200 | consumed tokens: 16751001600 | elapsed time per iteration (s): 0.83 | learning rate: 1.866E-04 | global batch size: 256 | lm loss: 2.114084E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.987 | TFLOPs: 18.63 | 31: iteration 31960/ 173500 | consumed samples: 8181760 | consumed tokens: 16756244480 | elapsed time per iteration (s): 0.81 | learning rate: 1.866E-04 | global batch size: 256 | lm loss: 2.097401E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.977 | TFLOPs: 19.06 | 31: iteration 31970/ 173500 | consumed samples: 8184320 | consumed tokens: 16761487360 | elapsed time per iteration (s): 0.82 | learning rate: 1.866E-04 | global batch size: 256 | lm loss: 2.147191E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.931 | TFLOPs: 18.81 | 31: iteration 31980/ 173500 | consumed samples: 8186880 | consumed tokens: 16766730240 | elapsed time per iteration (s): 0.87 | learning rate: 1.866E-04 | global batch size: 256 | lm loss: 2.127657E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.259 | TFLOPs: 17.74 | 31: iteration 31990/ 173500 | consumed samples: 8189440 | consumed tokens: 16771973120 | elapsed time per iteration (s): 0.81 | learning rate: 1.866E-04 | global batch size: 256 | lm loss: 2.129136E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.978 | TFLOPs: 19.06 | 0: [2022-11-26 01:17:16,464] [INFO] [logging.py:68:log_dist] [Rank 0] step=32000, skipped=0, lr=[0.00018655987222005428, 0.00018655987222005428, 0.00018655987222005428], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 32000/ 173500 | consumed samples: 8192000 | consumed tokens: 16777216000 | elapsed time per iteration (s): 0.85 | learning rate: 1.866E-04 | global batch size: 256 | lm loss: 2.108348E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.295 | TFLOPs: 18.23 | 0: steps: 32000 loss: 2.0342 iter time (s): 0.814 samples/sec: 314.478 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 32000 | lm loss value: 2.002957E+00 | lm loss PPL: 7.410937E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 32000 to checkpoints_1b1long 0: [2022-11-26 01:17:16,825] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step32000 is begin to save! 0: [2022-11-26 01:17:16,838] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_01-model_00-model_states.pt... 0: [2022-11-26 01:17:17,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_01-model_00-model_states.pt. 0: [2022-11-26 01:17:17,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_03-model_00-model_states.pt... 0: [2022-11-26 01:17:17,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_03-model_00-model_states.pt. 0: [2022-11-26 01:17:17,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_04-model_00-model_states.pt... 0: [2022-11-26 01:17:17,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_04-model_00-model_states.pt. 0: [2022-11-26 01:17:17,213] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_05-model_00-model_states.pt... 0: [2022-11-26 01:17:17,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_05-model_00-model_states.pt. 0: [2022-11-26 01:17:17,294] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_06-model_00-model_states.pt... 0: [2022-11-26 01:17:17,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_06-model_00-model_states.pt. 0: [2022-11-26 01:17:17,371] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_07-model_00-model_states.pt... 0: [2022-11-26 01:17:17,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_07-model_00-model_states.pt. 0: [2022-11-26 01:17:17,448] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_08-model_00-model_states.pt... 0: [2022-11-26 01:17:17,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_08-model_00-model_states.pt. 0: [2022-11-26 01:17:17,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_09-model_00-model_states.pt... 0: [2022-11-26 01:17:17,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_09-model_00-model_states.pt. 0: [2022-11-26 01:17:17,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_10-model_00-model_states.pt... 0: [2022-11-26 01:17:17,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_10-model_00-model_states.pt. 0: [2022-11-26 01:17:17,689] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_11-model_00-model_states.pt... 0: [2022-11-26 01:17:17,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_11-model_00-model_states.pt. 0: [2022-11-26 01:17:17,764] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_12-model_00-model_states.pt... 0: [2022-11-26 01:17:17,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_12-model_00-model_states.pt. 0: [2022-11-26 01:17:17,843] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_13-model_00-model_states.pt... 0: [2022-11-26 01:17:17,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_13-model_00-model_states.pt. 0: [2022-11-26 01:17:17,919] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_14-model_00-model_states.pt... 0: [2022-11-26 01:17:17,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_14-model_00-model_states.pt. 0: [2022-11-26 01:17:17,991] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_15-model_00-model_states.pt... 0: [2022-11-26 01:17:18,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_15-model_00-model_states.pt. 0: [2022-11-26 01:17:18,069] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_16-model_00-model_states.pt... 0: [2022-11-26 01:17:18,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_16-model_00-model_states.pt. 0: [2022-11-26 01:17:18,142] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_17-model_00-model_states.pt... 0: [2022-11-26 01:17:18,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_17-model_00-model_states.pt. 0: [2022-11-26 01:17:18,218] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_18-model_00-model_states.pt... 0: [2022-11-26 01:17:18,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_18-model_00-model_states.pt. 0: [2022-11-26 01:17:18,290] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_19-model_00-model_states.pt... 0: [2022-11-26 01:17:18,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_19-model_00-model_states.pt. 0: [2022-11-26 01:17:18,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_20-model_00-model_states.pt... 0: [2022-11-26 01:17:18,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_20-model_00-model_states.pt. 0: [2022-11-26 01:17:18,442] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_21-model_00-model_states.pt... 0: [2022-11-26 01:17:18,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_21-model_00-model_states.pt. 0: [2022-11-26 01:17:18,516] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_22-model_00-model_states.pt... 0: [2022-11-26 01:17:18,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_22-model_00-model_states.pt. 0: [2022-11-26 01:17:18,591] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_23-model_00-model_states.pt... 0: [2022-11-26 01:17:18,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_23-model_00-model_states.pt. 0: [2022-11-26 01:17:18,666] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_24-model_00-model_states.pt... 0: [2022-11-26 01:17:18,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_24-model_00-model_states.pt. 0: [2022-11-26 01:17:18,741] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_25-model_00-model_states.pt... 0: [2022-11-26 01:17:18,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_25-model_00-model_states.pt. 0: [2022-11-26 01:17:18,816] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_26-model_00-model_states.pt... 0: [2022-11-26 01:17:18,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_26-model_00-model_states.pt. 0: [2022-11-26 01:17:18,890] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_27-model_00-model_states.pt... 0: [2022-11-26 01:17:18,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_27-model_00-model_states.pt. 0: [2022-11-26 01:17:18,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_28-model_00-model_states.pt... 0: [2022-11-26 01:17:19,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_28-model_00-model_states.pt. 0: [2022-11-26 01:17:19,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/layer_30-model_00-model_states.pt... 0: [2022-11-26 01:17:19,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/layer_30-model_00-model_states.pt. 0: [2022-11-26 01:17:19,041] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step32000/mp_rank_00_model_states.pt 0: [2022-11-26 01:17:19,041] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/mp_rank_00_model_states.pt... 0: [2022-11-26 01:17:19,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/mp_rank_00_model_states.pt. 0: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:17:19,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:17:19,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:17:19,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 01:17:19,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 1: [2022-11-26 01:17:19,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:17:19,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 01:17:19,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 23: [2022-11-26 01:17:19,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:17:19,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 01:17:19,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 3: [2022-11-26 01:17:19,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:17:19,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 01:17:19,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 12: [2022-11-26 01:17:19,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:17:19,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:17:19,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 14: [2022-11-26 01:17:19,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 12: [2022-11-26 01:17:19,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 14: [2022-11-26 01:17:19,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 16: [2022-11-26 01:17:19,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:17:19,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 27: [2022-11-26 01:17:19,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:17:19,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 27: [2022-11-26 01:17:19,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 01:17:19,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 9: [2022-11-26 01:17:19,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:17:19,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 01:17:19,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 10: [2022-11-26 01:17:19,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:17:19,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:17:19,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 15: [2022-11-26 01:17:19,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 10: [2022-11-26 01:17:19,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 15: [2022-11-26 01:17:19,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 4: [2022-11-26 01:17:19,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:17:19,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 01:17:19,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 23: [2022-11-26 01:17:19,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:17:19,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:17:19,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:17:19,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 11: [2022-11-26 01:17:19,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 01:17:19,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 23: [2022-11-26 01:17:19,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 11: [2022-11-26 01:17:19,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 11: [2022-11-26 01:17:19,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 21: [2022-11-26 01:17:19,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:17:19,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 01:17:19,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 24: [2022-11-26 01:17:19,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:17:19,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:17:19,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:17:19,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 18: [2022-11-26 01:17:19,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 24: [2022-11-26 01:17:19,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 01:17:19,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 18: [2022-11-26 01:17:19,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 24: [2022-11-26 01:17:19,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 2: [2022-11-26 01:17:19,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:17:19,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 01:17:19,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 27: [2022-11-26 01:17:19,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:17:19,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 01:17:19,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 19: [2022-11-26 01:17:19,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:17:19,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 01:17:19,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 9: [2022-11-26 01:17:19,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:17:19,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:17:19,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 01:17:19,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: [2022-11-26 01:17:19,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:17:19,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:17:19,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 01:17:19,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 01:17:19,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: [2022-11-26 01:17:19,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 17: [2022-11-26 01:17:19,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:17:19,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:17:19,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 01:17:19,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 17: [2022-11-26 01:17:19,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:17:19,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 25: [2022-11-26 01:17:19,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 17: [2022-11-26 01:17:19,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 25: [2022-11-26 01:17:19,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 12: [2022-11-26 01:17:19,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:17:19,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 01:17:19,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 14: [2022-11-26 01:17:19,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:17:19,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:17:19,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 10: [2022-11-26 01:17:19,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 14: [2022-11-26 01:17:19,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 10: [2022-11-26 01:17:19,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 3: [2022-11-26 01:17:19,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:17:19,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 01:17:19,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 13: [2022-11-26 01:17:19,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:17:19,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:17:19,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 19: [2022-11-26 01:17:19,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 13: [2022-11-26 01:17:19,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 19: [2022-11-26 01:17:19,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 13: [2022-11-26 01:17:19,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:17:19,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 01:17:19,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 18: [2022-11-26 01:17:19,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:17:19,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:17:19,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 11: [2022-11-26 01:17:19,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 22: [2022-11-26 01:17:19,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:17:19,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 11: [2022-11-26 01:17:19,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 22: [2022-11-26 01:17:19,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 01:17:19,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 24: [2022-11-26 01:17:19,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:17:19,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:17:19,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 22: [2022-11-26 01:17:19,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 01:17:19,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 24: [2022-11-26 01:17:19,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 26: [2022-11-26 01:17:19,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:17:19,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 01:17:19,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 26: [2022-11-26 01:17:19,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:17:19,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 01:17:19,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 16: [2022-11-26 01:17:19,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:17:19,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 4: [2022-11-26 01:17:19,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:17:19,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 4: [2022-11-26 01:17:19,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 01:17:19,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 2: [2022-11-26 01:17:19,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:17:19,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 01:17:19,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 1: [2022-11-26 01:17:19,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:17:19,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:17:19,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 1: [2022-11-26 01:17:19,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 22: [2022-11-26 01:17:19,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 1: [2022-11-26 01:17:19,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 17: [2022-11-26 01:17:19,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:17:19,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:17:19,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 17: [2022-11-26 01:17:19,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 10: [2022-11-26 01:17:19,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 17: [2022-11-26 01:17:19,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 15: [2022-11-26 01:17:19,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:17:19,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 01:17:19,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 5: [2022-11-26 01:17:19,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:17:19,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 01:17:19,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 5: [2022-11-26 01:17:19,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:17:19,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 01:17:19,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 5: [2022-11-26 01:17:19,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:17:19,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 01:17:19,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 1: [2022-11-26 01:17:19,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:17:19,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:17:19,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 01:17:19,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 1: [2022-11-26 01:17:19,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 01:17:19,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 1: [2022-11-26 01:17:19,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:17:19,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 01:17:19,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 28: [2022-11-26 01:17:19,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 01:17:19,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 26: [2022-11-26 01:17:19,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:17:19,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:17:19,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 01:17:19,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 28: [2022-11-26 01:17:19,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:17:19,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:17:19,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 27: [2022-11-26 01:17:19,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:17:19,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 14: [2022-11-26 01:17:19,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 26: [2022-11-26 01:17:19,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 28: [2022-11-26 01:17:19,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 14: [2022-11-26 01:17:19,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 25: [2022-11-26 01:17:19,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:17:19,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 27: [2022-11-26 01:17:19,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 25: [2022-11-26 01:17:19,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 27: [2022-11-26 01:17:19,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 13: [2022-11-26 01:17:19,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:17:19,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 01:17:19,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 3: [2022-11-26 01:17:19,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:17:19,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:17:19,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 15: [2022-11-26 01:17:19,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 3: [2022-11-26 01:17:19,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 15: [2022-11-26 01:17:19,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 4: [2022-11-26 01:17:19,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:17:19,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 01:17:19,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 21: [2022-11-26 01:17:19,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:17:19,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 01:17:19,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:17:19,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 21: [2022-11-26 01:17:19,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 23: [2022-11-26 01:17:19,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:17:19,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 23: [2022-11-26 01:17:19,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:17:19,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 01:17:19,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 01:17:19,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 23: [2022-11-26 01:17:19,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 5: [2022-11-26 01:17:19,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:17:19,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 01:17:19,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 11: [2022-11-26 01:17:19,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:17:19,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 0: [2022-11-26 01:17:19,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:17:19,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 19: [2022-11-26 01:17:19,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:17:19,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:17:19,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 01:17:19,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 01:17:19,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 19: [2022-11-26 01:17:19,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 27: [2022-11-26 01:17:19,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:17:19,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 12: [2022-11-26 01:17:19,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:17:19,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 27: [2022-11-26 01:17:19,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 12: [2022-11-26 01:17:19,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 31: [2022-11-26 01:17:19,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:17:19,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 01:17:19,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 28: [2022-11-26 01:17:19,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:17:19,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:17:19,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:17:19,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 01:17:19,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:17:19,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 13: [2022-11-26 01:17:19,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:17:19,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 16: [2022-11-26 01:17:19,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 13: [2022-11-26 01:17:19,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 18: [2022-11-26 01:17:19,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 13: [2022-11-26 01:17:19,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 10: [2022-11-26 01:17:19,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:17:19,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 10: [2022-11-26 01:17:19,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 15: [2022-11-26 01:17:19,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:17:19,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 15: [2022-11-26 01:17:19,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 01:17:19,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 17: [2022-11-26 01:17:19,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:17:19,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:17:19,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 31: [2022-11-26 01:17:19,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:17:19,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:17:19,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 01:17:19,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 3: [2022-11-26 01:17:19,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 31: [2022-11-26 01:17:19,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 17: [2022-11-26 01:17:19,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 3: [2022-11-26 01:17:19,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 31: [2022-11-26 01:17:19,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 24: [2022-11-26 01:17:19,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:17:19,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 01:17:19,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 9: [2022-11-26 01:17:19,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:17:19,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 18: [2022-11-26 01:17:19,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:17:19,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 2: [2022-11-26 01:17:19,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:17:19,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 2: [2022-11-26 01:17:19,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 18: [2022-11-26 01:17:19,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 2: [2022-11-26 01:17:19,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: [2022-11-26 01:17:19,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:17:19,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 01:17:19,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 25: [2022-11-26 01:17:19,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:17:19,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 01:17:19,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 2: [2022-11-26 01:17:19,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:17:19,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 01:17:19,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 21: [2022-11-26 01:17:19,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:17:19,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 01:17:19,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 26: [2022-11-26 01:17:19,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:17:19,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 01:17:19,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 25: [2022-11-26 01:17:19,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:17:19,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 01:17:19,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 1: [2022-11-26 01:17:19,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:17:19,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 01:17:19,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 14: [2022-11-26 01:17:19,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:17:19,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 22: [2022-11-26 01:17:19,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:17:19,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 22: [2022-11-26 01:17:19,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 01:17:19,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 28: [2022-11-26 01:17:19,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 01:17:19,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 9: [2022-11-26 01:17:19,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:17:19,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 01:17:19,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 11: [2022-11-26 01:17:19,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:17:19,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 01:17:19,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 12: [2022-11-26 01:17:19,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:17:19,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 01:17:19,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 4: [2022-11-26 01:17:19,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:17:19,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 01:17:19,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 7: [2022-11-26 01:17:19,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:17:19,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:17:19,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:17:19,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:17:19,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 01:17:19,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 01:17:19,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 01:17:19,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 01:17:19,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 7: [2022-11-26 01:17:19,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 7: [2022-11-26 01:17:19,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 7: [2022-11-26 01:17:19,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 31: [2022-11-26 01:17:19,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:17:19,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 01:17:19,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: [2022-11-26 01:17:19,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 01:17:19,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 5: [2022-11-26 01:17:19,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:17:19,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 01:17:19,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 6: [2022-11-26 01:17:19,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:17:19,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 01:17:19,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 8: [2022-11-26 01:17:19,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:17:19,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:17:19,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:17:19,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 01:17:19,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 01:17:19,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 01:17:19,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:17:19,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 8: [2022-11-26 01:17:19,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 8: [2022-11-26 01:17:19,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 25: [2022-11-26 01:17:19,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:17:19,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 01:17:19,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 25: [2022-11-26 01:17:19,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 01:17:19,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 15: [2022-11-26 01:17:19,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:17:19,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 01:17:19,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 16: [2022-11-26 01:17:19,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:17:19,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 01:17:19,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 18: [2022-11-26 01:17:19,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:17:19,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 01:17:19,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: [2022-11-26 01:17:19,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:17:19,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:17:19,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 23: [2022-11-26 01:17:19,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 01:17:19,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: [2022-11-26 01:17:19,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 8: [2022-11-26 01:17:19,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:17:19,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 01:17:19,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 17: [2022-11-26 01:17:19,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:17:19,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 01:17:19,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 14: [2022-11-26 01:17:19,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:17:19,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 01:17:19,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 4: [2022-11-26 01:17:19,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:17:19,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 01:17:19,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 12: [2022-11-26 01:17:19,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:17:19,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 01:17:19,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 29: [2022-11-26 01:17:19,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:17:19,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:17:19,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:17:19,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 01:17:19,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 29: [2022-11-26 01:17:19,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 01:17:19,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 01:17:19,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:17:19,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 29: [2022-11-26 01:17:19,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 29: [2022-11-26 01:17:19,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 01:17:19,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 7: [2022-11-26 01:17:19,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:17:19,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 01:17:19,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 2: [2022-11-26 01:17:19,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:17:19,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 01:17:19,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 6: [2022-11-26 01:17:19,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:17:19,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 01:17:19,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 28: [2022-11-26 01:17:19,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:17:19,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 01:17:19,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 6: [2022-11-26 01:17:19,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:17:19,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 01:17:19,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 10: [2022-11-26 01:17:19,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:17:19,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 01:17:19,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 3: [2022-11-26 01:17:19,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:17:19,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 01:17:19,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 24: [2022-11-26 01:17:19,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:17:19,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 01:17:19,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 19: [2022-11-26 01:17:19,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:17:19,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 01:17:19,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 13: [2022-11-26 01:17:19,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:17:19,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 01:17:19,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 31: [2022-11-26 01:17:19,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:17:19,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 01:17:19,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 27: [2022-11-26 01:17:19,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:17:19,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 01:17:19,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: [2022-11-26 01:17:19,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:17:19,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 01:17:19,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 22: [2022-11-26 01:17:19,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:17:19,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:17:19,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 21: [2022-11-26 01:17:19,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 22: [2022-11-26 01:17:19,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 21: [2022-11-26 01:17:19,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 26: [2022-11-26 01:17:19,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:17:19,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 01:17:19,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 6: [2022-11-26 01:17:19,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:17:19,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 01:17:19,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 29: [2022-11-26 01:17:19,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:17:19,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 01:17:19,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 23: [2022-11-26 01:17:19,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:17:19,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 01:17:19,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 9: [2022-11-26 01:17:19,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:17:19,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 01:17:19,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 1: [2022-11-26 01:17:19,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:17:19,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 01:17:19,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 15: [2022-11-26 01:17:19,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:17:19,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:17:19,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 25: [2022-11-26 01:17:19,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 15: [2022-11-26 01:17:19,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 25: [2022-11-26 01:17:19,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 5: [2022-11-26 01:17:19,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:17:19,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:17:19,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 16: [2022-11-26 01:17:19,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 5: [2022-11-26 01:17:19,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 16: [2022-11-26 01:17:19,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 6: [2022-11-26 01:17:19,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:17:19,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 14: [2022-11-26 01:17:19,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:17:19,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 14: [2022-11-26 01:17:19,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 01:17:19,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 11: [2022-11-26 01:17:19,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:17:19,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 01:17:19,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 18: [2022-11-26 01:17:19,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:17:19,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 01:17:19,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 8: [2022-11-26 01:17:19,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:17:19,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 01:17:19,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 7: [2022-11-26 01:17:19,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:17:19,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 01:17:19,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 12: [2022-11-26 01:17:19,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:17:19,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:17:19,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 10: [2022-11-26 01:17:19,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 12: [2022-11-26 01:17:19,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 10: [2022-11-26 01:17:19,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 2: [2022-11-26 01:17:19,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:17:19,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 01:17:19,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 24: [2022-11-26 01:17:19,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:17:19,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 01:17:19,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 17: [2022-11-26 01:17:19,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:17:19,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 01:17:19,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 3: [2022-11-26 01:17:19,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:17:19,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:17:19,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:17:19,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 19: [2022-11-26 01:17:19,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 4: [2022-11-26 01:17:19,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 3: [2022-11-26 01:17:19,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 19: [2022-11-26 01:17:19,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 4: [2022-11-26 01:17:19,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: [2022-11-26 01:17:19,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:17:19,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 28: [2022-11-26 01:17:19,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:17:19,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 28: [2022-11-26 01:17:19,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 01:17:19,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 22: [2022-11-26 01:17:19,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:17:19,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 23: [2022-11-26 01:17:19,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:17:19,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 23: [2022-11-26 01:17:19,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 01:17:19,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 27: [2022-11-26 01:17:19,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:17:19,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 01:17:19,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 26: [2022-11-26 01:17:19,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:17:19,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 01:17:19,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 21: [2022-11-26 01:17:19,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:17:19,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:17:19,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 6: [2022-11-26 01:17:19,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 01:17:19,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 21: [2022-11-26 01:17:19,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 29: [2022-11-26 01:17:19,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:17:19,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 01:17:19,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 31: [2022-11-26 01:17:19,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:17:19,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 01:17:19,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 13: [2022-11-26 01:17:19,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:17:19,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 01:17:19,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 9: [2022-11-26 01:17:19,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:17:19,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 01:17:19,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 15: [2022-11-26 01:17:19,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:17:19,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 01:17:19,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 1: [2022-11-26 01:17:19,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:17:19,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 01:17:19,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 5: [2022-11-26 01:17:19,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:17:19,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 01:17:19,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 30: [2022-11-26 01:17:19,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:17:19,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:17:19,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:17:19,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:17:19,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 01:17:19,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:17:19,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 30: [2022-11-26 01:17:19,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 01:17:19,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 01:17:19,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 01:17:19,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 01:17:19,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 30: [2022-11-26 01:17:19,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 30: [2022-11-26 01:17:19,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 30: [2022-11-26 01:17:19,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 16: [2022-11-26 01:17:19,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:17:19,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 01:17:19,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 25: [2022-11-26 01:17:19,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:17:19,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:17:19,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 25: [2022-11-26 01:17:19,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 01:17:19,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 11: [2022-11-26 01:17:19,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 18: [2022-11-26 01:17:19,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:17:19,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 01:17:19,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 8: [2022-11-26 01:17:19,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:17:19,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 01:17:19,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 4: [2022-11-26 01:17:19,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:17:19,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:17:19,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 01:17:19,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 24: [2022-11-26 01:17:19,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 17: [2022-11-26 01:17:19,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:17:19,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 7: [2022-11-26 01:17:19,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:17:19,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 7: [2022-11-26 01:17:19,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 17: [2022-11-26 01:17:19,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 7: [2022-11-26 01:17:19,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 14: [2022-11-26 01:17:19,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:17:19,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 01:17:19,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 19: [2022-11-26 01:17:19,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:17:19,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 01:17:19,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 2: [2022-11-26 01:17:19,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:17:19,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 01:17:19,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 30: [2022-11-26 01:17:19,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:17:19,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 01:17:19,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: [2022-11-26 01:17:19,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:17:19,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 01:17:19,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 12: [2022-11-26 01:17:19,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:17:19,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 01:17:19,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 2: [2022-11-26 01:17:19,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:17:19,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 01:17:19,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 23: [2022-11-26 01:17:19,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:17:19,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:17:19,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 01:17:19,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 15: [2022-11-26 01:17:19,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 01:17:19,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 28: [2022-11-26 01:17:19,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:17:19,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 01:17:19,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 6: [2022-11-26 01:17:19,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:17:19,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 01:17:19,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 10: [2022-11-26 01:17:19,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:17:19,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 01:17:19,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 24: [2022-11-26 01:17:19,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:17:19,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:17:19,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 01:17:19,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 8: [2022-11-26 01:17:19,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 01:17:19,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 27: [2022-11-26 01:17:19,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:17:19,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 01:17:19,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 25: [2022-11-26 01:17:19,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:17:19,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:17:19,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 01:17:19,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 3: [2022-11-26 01:17:19,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 01:17:19,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 9: [2022-11-26 01:17:19,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:17:19,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:17:19,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 01:17:19,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 16: [2022-11-26 01:17:19,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:17:19,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 01:17:19,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 16: [2022-11-26 01:17:19,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 01:17:19,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 1: [2022-11-26 01:17:19,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:17:19,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:17:19,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:17:19,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 01:17:19,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 11: [2022-11-26 01:17:19,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:17:19,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 01:17:19,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 26: [2022-11-26 01:17:19,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 11: [2022-11-26 01:17:19,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 26: [2022-11-26 01:17:19,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 11: [2022-11-26 01:17:19,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 7: [2022-11-26 01:17:19,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:17:19,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 01:17:19,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 17: [2022-11-26 01:17:19,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:17:19,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:17:19,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 5: [2022-11-26 01:17:19,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:17:19,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 22: [2022-11-26 01:17:19,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 5: [2022-11-26 01:17:19,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 22: [2022-11-26 01:17:19,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 5: [2022-11-26 01:17:19,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 13: [2022-11-26 01:17:19,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:17:19,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 01:17:19,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 19: [2022-11-26 01:17:19,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:17:19,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 01:17:19,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 10: [2022-11-26 01:17:19,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:17:19,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:17:19,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 10: [2022-11-26 01:17:19,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 01:17:19,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 18: [2022-11-26 01:17:19,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 31: [2022-11-26 01:17:19,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:17:19,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 01:17:19,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 29: [2022-11-26 01:17:19,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:17:19,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 01:17:19,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 14: [2022-11-26 01:17:19,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:17:19,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 01:17:19,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 4: [2022-11-26 01:17:19,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:17:19,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 01:17:19,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 3: [2022-11-26 01:17:19,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:17:19,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 01:17:19,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 28: [2022-11-26 01:17:19,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:17:19,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 01:17:19,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 26: [2022-11-26 01:17:19,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:17:19,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 01:17:19,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 27: [2022-11-26 01:17:19,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:17:19,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 01:17:19,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 31: [2022-11-26 01:17:19,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:17:19,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 01:17:19,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 13: [2022-11-26 01:17:19,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:17:19,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:17:19,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 01:17:19,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 21: [2022-11-26 01:17:19,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 01:17:19,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 22: [2022-11-26 01:17:19,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:17:19,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 01:17:19,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 29: [2022-11-26 01:17:19,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:17:19,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 01:17:19,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 30: [2022-11-26 01:17:19,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:17:19,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 01:17:19,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 30: [2022-11-26 01:17:19,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:17:19,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 01:17:19,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 20: [2022-11-26 01:17:19,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:17:19,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:17:19,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:17:19,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 01:17:19,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 20: [2022-11-26 01:17:19,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 01:17:19,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 01:17:19,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 20: [2022-11-26 01:17:19,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 20: [2022-11-26 01:17:19,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:17:19,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 01:17:19,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 20: [2022-11-26 01:17:19,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:17:19,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 01:17:19,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 20: [2022-11-26 01:17:19,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:17:19,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 01:17:19,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 20: [2022-11-26 01:17:19,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:17:19,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 01:17:19,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 20: [2022-11-26 01:17:19,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:17:19,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step32000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 01:17:19,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: successfully saved checkpoint at iteration 32000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2584.92 31: iteration 32010/ 173500 | consumed samples: 8194560 | consumed tokens: 16782458880 | elapsed time per iteration (s): 1.14 | learning rate: 1.866E-04 | global batch size: 256 | lm loss: 2.127704E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 225.115 | TFLOPs: 13.62 | 31: iteration 32020/ 173500 | consumed samples: 8197120 | consumed tokens: 16787701760 | elapsed time per iteration (s): 0.82 | learning rate: 1.865E-04 | global batch size: 256 | lm loss: 2.107634E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.824 | TFLOPs: 18.99 | 31: iteration 32030/ 173500 | consumed samples: 8199680 | consumed tokens: 16792944640 | elapsed time per iteration (s): 0.77 | learning rate: 1.865E-04 | global batch size: 256 | lm loss: 2.132906E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.242 | TFLOPs: 20.22 | 31: iteration 32040/ 173500 | consumed samples: 8202240 | consumed tokens: 16798187520 | elapsed time per iteration (s): 0.74 | learning rate: 1.865E-04 | global batch size: 256 | lm loss: 2.157520E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.694 | TFLOPs: 20.85 | 31: iteration 32050/ 173500 | consumed samples: 8204800 | consumed tokens: 16803430400 | elapsed time per iteration (s): 0.75 | learning rate: 1.865E-04 | global batch size: 256 | lm loss: 2.157686E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.703 | TFLOPs: 20.61 | 31: iteration 32060/ 173500 | consumed samples: 8207360 | consumed tokens: 16808673280 | elapsed time per iteration (s): 0.78 | learning rate: 1.865E-04 | global batch size: 256 | lm loss: 2.120222E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.471 | TFLOPs: 19.87 | 31: iteration 32070/ 173500 | consumed samples: 8209920 | consumed tokens: 16813916160 | elapsed time per iteration (s): 0.84 | learning rate: 1.865E-04 | global batch size: 256 | lm loss: 2.131325E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.815 | TFLOPs: 18.50 | 31: iteration 32080/ 173500 | consumed samples: 8212480 | consumed tokens: 16819159040 | elapsed time per iteration (s): 0.77 | learning rate: 1.865E-04 | global batch size: 256 | lm loss: 2.153010E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.785 | TFLOPs: 20.01 | 31: iteration 32090/ 173500 | consumed samples: 8215040 | consumed tokens: 16824401920 | elapsed time per iteration (s): 0.78 | learning rate: 1.865E-04 | global batch size: 256 | lm loss: 2.148174E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.609 | TFLOPs: 19.88 | 31: iteration 32100/ 173500 | consumed samples: 8217600 | consumed tokens: 16829644800 | elapsed time per iteration (s): 0.78 | learning rate: 1.865E-04 | global batch size: 256 | lm loss: 2.136447E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.313 | TFLOPs: 19.74 | 31: iteration 32110/ 173500 | consumed samples: 8220160 | consumed tokens: 16834887680 | elapsed time per iteration (s): 0.74 | learning rate: 1.865E-04 | global batch size: 256 | lm loss: 2.128485E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.006 | TFLOPs: 21.05 | 31: iteration 32120/ 173500 | consumed samples: 8222720 | consumed tokens: 16840130560 | elapsed time per iteration (s): 0.74 | learning rate: 1.865E-04 | global batch size: 256 | lm loss: 2.125694E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.025 | TFLOPs: 20.87 | 31: iteration 32130/ 173500 | consumed samples: 8225280 | consumed tokens: 16845373440 | elapsed time per iteration (s): 0.79 | learning rate: 1.864E-04 | global batch size: 256 | lm loss: 2.173090E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.310 | TFLOPs: 19.68 | 31: iteration 32140/ 173500 | consumed samples: 8227840 | consumed tokens: 16850616320 | elapsed time per iteration (s): 0.77 | learning rate: 1.864E-04 | global batch size: 256 | lm loss: 2.093979E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.966 | TFLOPs: 20.20 | 31: iteration 32150/ 173500 | consumed samples: 8230400 | consumed tokens: 16855859200 | elapsed time per iteration (s): 0.77 | learning rate: 1.864E-04 | global batch size: 256 | lm loss: 2.116512E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.312 | TFLOPs: 20.23 | 31: iteration 32160/ 173500 | consumed samples: 8232960 | consumed tokens: 16861102080 | elapsed time per iteration (s): 0.75 | learning rate: 1.864E-04 | global batch size: 256 | lm loss: 2.150676E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.377 | TFLOPs: 20.77 | 31: iteration 32170/ 173500 | consumed samples: 8235520 | consumed tokens: 16866344960 | elapsed time per iteration (s): 0.81 | learning rate: 1.864E-04 | global batch size: 256 | lm loss: 2.114788E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.625 | TFLOPs: 19.22 | 31: iteration 32180/ 173500 | consumed samples: 8238080 | consumed tokens: 16871587840 | elapsed time per iteration (s): 0.76 | learning rate: 1.864E-04 | global batch size: 256 | lm loss: 2.122053E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.689 | TFLOPs: 20.49 | 31: iteration 32190/ 173500 | consumed samples: 8240640 | consumed tokens: 16876830720 | elapsed time per iteration (s): 0.77 | learning rate: 1.864E-04 | global batch size: 256 | lm loss: 2.125861E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.328 | TFLOPs: 20.23 | 31: iteration 32200/ 173500 | consumed samples: 8243200 | consumed tokens: 16882073600 | elapsed time per iteration (s): 0.82 | learning rate: 1.864E-04 | global batch size: 256 | lm loss: 2.142013E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.412 | TFLOPs: 18.84 | 31: iteration 32210/ 173500 | consumed samples: 8245760 | consumed tokens: 16887316480 | elapsed time per iteration (s): 0.76 | learning rate: 1.864E-04 | global batch size: 256 | lm loss: 2.131083E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.910 | TFLOPs: 20.44 | 31: iteration 32220/ 173500 | consumed samples: 8248320 | consumed tokens: 16892559360 | elapsed time per iteration (s): 0.74 | learning rate: 1.864E-04 | global batch size: 256 | lm loss: 2.120583E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.915 | TFLOPs: 20.99 | 31: iteration 32230/ 173500 | consumed samples: 8250880 | consumed tokens: 16897802240 | elapsed time per iteration (s): 0.75 | learning rate: 1.864E-04 | global batch size: 256 | lm loss: 2.120016E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.382 | TFLOPs: 20.71 | 31: iteration 32240/ 173500 | consumed samples: 8253440 | consumed tokens: 16903045120 | elapsed time per iteration (s): 0.81 | learning rate: 1.864E-04 | global batch size: 256 | lm loss: 2.622397E+00 | grad norm: 1.663 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.443 | TFLOPs: 19.02 | 31: iteration 32250/ 173500 | consumed samples: 8256000 | consumed tokens: 16908288000 | elapsed time per iteration (s): 0.82 | learning rate: 1.863E-04 | global batch size: 256 | lm loss: 2.323869E+00 | grad norm: 0.369 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.834 | TFLOPs: 18.93 | 31: iteration 32260/ 173500 | consumed samples: 8258560 | consumed tokens: 16913530880 | elapsed time per iteration (s): 0.74 | learning rate: 1.863E-04 | global batch size: 256 | lm loss: 2.270432E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.430 | TFLOPs: 21.02 | 31: iteration 32270/ 173500 | consumed samples: 8261120 | consumed tokens: 16918773760 | elapsed time per iteration (s): 0.88 | learning rate: 1.863E-04 | global batch size: 256 | lm loss: 2.178877E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.971 | TFLOPs: 17.66 | 31: iteration 32280/ 173500 | consumed samples: 8263680 | consumed tokens: 16924016640 | elapsed time per iteration (s): 0.83 | learning rate: 1.863E-04 | global batch size: 256 | lm loss: 2.142207E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.798 | TFLOPs: 18.74 | 31: iteration 32290/ 173500 | consumed samples: 8266240 | consumed tokens: 16929259520 | elapsed time per iteration (s): 0.80 | learning rate: 1.863E-04 | global batch size: 256 | lm loss: 2.166176E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.988 | TFLOPs: 19.42 | 31: iteration 32300/ 173500 | consumed samples: 8268800 | consumed tokens: 16934502400 | elapsed time per iteration (s): 0.80 | learning rate: 1.863E-04 | global batch size: 256 | lm loss: 2.144547E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.877 | TFLOPs: 19.41 | 31: iteration 32310/ 173500 | consumed samples: 8271360 | consumed tokens: 16939745280 | elapsed time per iteration (s): 0.79 | learning rate: 1.863E-04 | global batch size: 256 | lm loss: 2.143979E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.210 | TFLOPs: 19.67 | 31: iteration 32320/ 173500 | consumed samples: 8273920 | consumed tokens: 16944988160 | elapsed time per iteration (s): 0.78 | learning rate: 1.863E-04 | global batch size: 256 | lm loss: 2.163726E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.182 | TFLOPs: 19.98 | 31: iteration 32330/ 173500 | consumed samples: 8276480 | consumed tokens: 16950231040 | elapsed time per iteration (s): 0.74 | learning rate: 1.863E-04 | global batch size: 256 | lm loss: 2.159921E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.193 | TFLOPs: 21.06 | 31: iteration 32340/ 173500 | consumed samples: 8279040 | consumed tokens: 16955473920 | elapsed time per iteration (s): 0.78 | learning rate: 1.863E-04 | global batch size: 256 | lm loss: 2.160544E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.959 | TFLOPs: 19.96 | 31: iteration 32350/ 173500 | consumed samples: 8281600 | consumed tokens: 16960716800 | elapsed time per iteration (s): 0.78 | learning rate: 1.863E-04 | global batch size: 256 | lm loss: 2.154005E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.860 | TFLOPs: 19.83 | 31: iteration 32360/ 173500 | consumed samples: 8284160 | consumed tokens: 16965959680 | elapsed time per iteration (s): 0.83 | learning rate: 1.862E-04 | global batch size: 256 | lm loss: 2.129160E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.960 | TFLOPs: 18.75 | 31: iteration 32370/ 173500 | consumed samples: 8286720 | consumed tokens: 16971202560 | elapsed time per iteration (s): 0.78 | learning rate: 1.862E-04 | global batch size: 256 | lm loss: 2.134135E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.061 | TFLOPs: 19.85 | 31: iteration 32380/ 173500 | consumed samples: 8289280 | consumed tokens: 16976445440 | elapsed time per iteration (s): 0.81 | learning rate: 1.862E-04 | global batch size: 256 | lm loss: 2.128918E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.746 | TFLOPs: 19.16 | 31: iteration 32390/ 173500 | consumed samples: 8291840 | consumed tokens: 16981688320 | elapsed time per iteration (s): 0.80 | learning rate: 1.862E-04 | global batch size: 256 | lm loss: 2.131176E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.742 | TFLOPs: 19.28 | 31: iteration 32400/ 173500 | consumed samples: 8294400 | consumed tokens: 16986931200 | elapsed time per iteration (s): 0.80 | learning rate: 1.862E-04 | global batch size: 256 | lm loss: 2.116762E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.922 | TFLOPs: 19.29 | 31: iteration 32410/ 173500 | consumed samples: 8296960 | consumed tokens: 16992174080 | elapsed time per iteration (s): 0.83 | learning rate: 1.862E-04 | global batch size: 256 | lm loss: 2.121517E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.088 | TFLOPs: 18.58 | 31: iteration 32420/ 173500 | consumed samples: 8299520 | consumed tokens: 16997416960 | elapsed time per iteration (s): 0.78 | learning rate: 1.862E-04 | global batch size: 256 | lm loss: 2.141203E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.580 | TFLOPs: 19.82 | 31: iteration 32430/ 173500 | consumed samples: 8302080 | consumed tokens: 17002659840 | elapsed time per iteration (s): 0.82 | learning rate: 1.862E-04 | global batch size: 256 | lm loss: 2.128056E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.596 | TFLOPs: 18.91 | 31: iteration 32440/ 173500 | consumed samples: 8304640 | consumed tokens: 17007902720 | elapsed time per iteration (s): 0.77 | learning rate: 1.862E-04 | global batch size: 256 | lm loss: 2.153741E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.547 | TFLOPs: 20.24 | 31: iteration 32450/ 173500 | consumed samples: 8307200 | consumed tokens: 17013145600 | elapsed time per iteration (s): 0.80 | learning rate: 1.862E-04 | global batch size: 256 | lm loss: 2.118748E+00 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.262 | TFLOPs: 19.25 | 31: iteration 32460/ 173500 | consumed samples: 8309760 | consumed tokens: 17018388480 | elapsed time per iteration (s): 0.79 | learning rate: 1.862E-04 | global batch size: 256 | lm loss: 2.135432E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.826 | TFLOPs: 19.71 | 31: iteration 32470/ 173500 | consumed samples: 8312320 | consumed tokens: 17023631360 | elapsed time per iteration (s): 0.81 | learning rate: 1.862E-04 | global batch size: 256 | lm loss: 2.143175E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.023 | TFLOPs: 19.06 | 31: iteration 32480/ 173500 | consumed samples: 8314880 | consumed tokens: 17028874240 | elapsed time per iteration (s): 0.80 | learning rate: 1.861E-04 | global batch size: 256 | lm loss: 2.131350E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.888 | TFLOPs: 19.35 | 31: iteration 32490/ 173500 | consumed samples: 8317440 | consumed tokens: 17034117120 | elapsed time per iteration (s): 0.80 | learning rate: 1.861E-04 | global batch size: 256 | lm loss: 2.085392E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.013 | TFLOPs: 19.36 | 31: iteration 32500/ 173500 | consumed samples: 8320000 | consumed tokens: 17039360000 | elapsed time per iteration (s): 0.82 | learning rate: 1.861E-04 | global batch size: 256 | lm loss: 2.111430E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.571 | TFLOPs: 18.97 | 31: iteration 32510/ 173500 | consumed samples: 8322560 | consumed tokens: 17044602880 | elapsed time per iteration (s): 0.81 | learning rate: 1.861E-04 | global batch size: 256 | lm loss: 2.113008E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.584 | TFLOPs: 19.21 | 31: iteration 32520/ 173500 | consumed samples: 8325120 | consumed tokens: 17049845760 | elapsed time per iteration (s): 0.81 | learning rate: 1.861E-04 | global batch size: 256 | lm loss: 2.128770E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.647 | TFLOPs: 19.16 | 31: iteration 32530/ 173500 | consumed samples: 8327680 | consumed tokens: 17055088640 | elapsed time per iteration (s): 0.80 | learning rate: 1.861E-04 | global batch size: 256 | lm loss: 2.115928E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.909 | TFLOPs: 19.41 | 31: iteration 32540/ 173500 | consumed samples: 8330240 | consumed tokens: 17060331520 | elapsed time per iteration (s): 0.81 | learning rate: 1.861E-04 | global batch size: 256 | lm loss: 2.115549E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.231 | TFLOPs: 19.13 | 31: iteration 32550/ 173500 | consumed samples: 8332800 | consumed tokens: 17065574400 | elapsed time per iteration (s): 0.76 | learning rate: 1.861E-04 | global batch size: 256 | lm loss: 2.150715E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.791 | TFLOPs: 20.31 | 31: iteration 32560/ 173500 | consumed samples: 8335360 | consumed tokens: 17070817280 | elapsed time per iteration (s): 0.79 | learning rate: 1.861E-04 | global batch size: 256 | lm loss: 2.130252E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.063 | TFLOPs: 19.67 | 31: iteration 32570/ 173500 | consumed samples: 8337920 | consumed tokens: 17076060160 | elapsed time per iteration (s): 0.77 | learning rate: 1.861E-04 | global batch size: 256 | lm loss: 2.119053E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.936 | TFLOPs: 20.14 | 31: iteration 32580/ 173500 | consumed samples: 8340480 | consumed tokens: 17081303040 | elapsed time per iteration (s): 0.84 | learning rate: 1.861E-04 | global batch size: 256 | lm loss: 2.113548E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.300 | TFLOPs: 18.47 | 31: iteration 32590/ 173500 | consumed samples: 8343040 | consumed tokens: 17086545920 | elapsed time per iteration (s): 0.75 | learning rate: 1.860E-04 | global batch size: 256 | lm loss: 2.134815E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.919 | TFLOPs: 20.75 | 31: iteration 32600/ 173500 | consumed samples: 8345600 | consumed tokens: 17091788800 | elapsed time per iteration (s): 0.74 | learning rate: 1.860E-04 | global batch size: 256 | lm loss: 2.121692E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.220 | TFLOPs: 20.82 | 31: iteration 32610/ 173500 | consumed samples: 8348160 | consumed tokens: 17097031680 | elapsed time per iteration (s): 0.76 | learning rate: 1.860E-04 | global batch size: 256 | lm loss: 2.090622E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.034 | TFLOPs: 20.33 | 31: iteration 32620/ 173500 | consumed samples: 8350720 | consumed tokens: 17102274560 | elapsed time per iteration (s): 0.79 | learning rate: 1.860E-04 | global batch size: 256 | lm loss: 2.130239E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.198 | TFLOPs: 19.49 | 31: iteration 32630/ 173500 | consumed samples: 8353280 | consumed tokens: 17107517440 | elapsed time per iteration (s): 0.75 | learning rate: 1.860E-04 | global batch size: 256 | lm loss: 2.110670E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.729 | TFLOPs: 20.55 | 31: iteration 32640/ 173500 | consumed samples: 8355840 | consumed tokens: 17112760320 | elapsed time per iteration (s): 0.83 | learning rate: 1.860E-04 | global batch size: 256 | lm loss: 2.112907E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.871 | TFLOPs: 18.69 | 31: iteration 32650/ 173500 | consumed samples: 8358400 | consumed tokens: 17118003200 | elapsed time per iteration (s): 0.71 | learning rate: 1.860E-04 | global batch size: 256 | lm loss: 2.095515E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.455 | TFLOPs: 21.69 | 31: iteration 32660/ 173500 | consumed samples: 8360960 | consumed tokens: 17123246080 | elapsed time per iteration (s): 0.81 | learning rate: 1.860E-04 | global batch size: 256 | lm loss: 2.091407E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.481 | TFLOPs: 19.15 | 31: iteration 32670/ 173500 | consumed samples: 8363520 | consumed tokens: 17128488960 | elapsed time per iteration (s): 0.78 | learning rate: 1.860E-04 | global batch size: 256 | lm loss: 2.134782E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.412 | TFLOPs: 19.81 | 31: iteration 32680/ 173500 | consumed samples: 8366080 | consumed tokens: 17133731840 | elapsed time per iteration (s): 0.74 | learning rate: 1.860E-04 | global batch size: 256 | lm loss: 2.132282E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.331 | TFLOPs: 20.95 | 31: iteration 32690/ 173500 | consumed samples: 8368640 | consumed tokens: 17138974720 | elapsed time per iteration (s): 0.77 | learning rate: 1.860E-04 | global batch size: 256 | lm loss: 2.118683E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.181 | TFLOPs: 20.22 | 31: iteration 32700/ 173500 | consumed samples: 8371200 | consumed tokens: 17144217600 | elapsed time per iteration (s): 0.83 | learning rate: 1.859E-04 | global batch size: 256 | lm loss: 2.117142E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.736 | TFLOPs: 18.68 | 31: iteration 32710/ 173500 | consumed samples: 8373760 | consumed tokens: 17149460480 | elapsed time per iteration (s): 0.81 | learning rate: 1.859E-04 | global batch size: 256 | lm loss: 2.125281E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.526 | TFLOPs: 19.03 | 31: iteration 32720/ 173500 | consumed samples: 8376320 | consumed tokens: 17154703360 | elapsed time per iteration (s): 0.80 | learning rate: 1.859E-04 | global batch size: 256 | lm loss: 2.104965E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.547 | TFLOPs: 19.45 | 31: iteration 32730/ 173500 | consumed samples: 8378880 | consumed tokens: 17159946240 | elapsed time per iteration (s): 0.74 | learning rate: 1.859E-04 | global batch size: 256 | lm loss: 2.130697E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.861 | TFLOPs: 20.98 | 31: iteration 32740/ 173500 | consumed samples: 8381440 | consumed tokens: 17165189120 | elapsed time per iteration (s): 0.79 | learning rate: 1.859E-04 | global batch size: 256 | lm loss: 2.112446E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.681 | TFLOPs: 19.58 | 31: iteration 32750/ 173500 | consumed samples: 8384000 | consumed tokens: 17170432000 | elapsed time per iteration (s): 0.88 | learning rate: 1.859E-04 | global batch size: 256 | lm loss: 2.137556E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.604 | TFLOPs: 17.52 | 31: iteration 32760/ 173500 | consumed samples: 8386560 | consumed tokens: 17175674880 | elapsed time per iteration (s): 0.91 | learning rate: 1.859E-04 | global batch size: 256 | lm loss: 2.124542E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 281.210 | TFLOPs: 17.01 | 31: iteration 32770/ 173500 | consumed samples: 8389120 | consumed tokens: 17180917760 | elapsed time per iteration (s): 0.81 | learning rate: 1.859E-04 | global batch size: 256 | lm loss: 2.134840E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.571 | TFLOPs: 19.15 | 31: iteration 32780/ 173500 | consumed samples: 8391680 | consumed tokens: 17186160640 | elapsed time per iteration (s): 0.79 | learning rate: 1.859E-04 | global batch size: 256 | lm loss: 2.109275E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.931 | TFLOPs: 19.60 | 31: iteration 32790/ 173500 | consumed samples: 8394240 | consumed tokens: 17191403520 | elapsed time per iteration (s): 0.82 | learning rate: 1.859E-04 | global batch size: 256 | lm loss: 2.131262E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.411 | TFLOPs: 18.90 | 31: iteration 32800/ 173500 | consumed samples: 8396800 | consumed tokens: 17196646400 | elapsed time per iteration (s): 0.77 | learning rate: 1.859E-04 | global batch size: 256 | lm loss: 2.119407E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.534 | TFLOPs: 20.00 | 31: iteration 32810/ 173500 | consumed samples: 8399360 | consumed tokens: 17201889280 | elapsed time per iteration (s): 0.83 | learning rate: 1.859E-04 | global batch size: 256 | lm loss: 2.139497E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.118 | TFLOPs: 18.64 | 31: iteration 32820/ 173500 | consumed samples: 8401920 | consumed tokens: 17207132160 | elapsed time per iteration (s): 0.79 | learning rate: 1.858E-04 | global batch size: 256 | lm loss: 2.142992E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.367 | TFLOPs: 19.50 | 31: iteration 32830/ 173500 | consumed samples: 8404480 | consumed tokens: 17212375040 | elapsed time per iteration (s): 0.77 | learning rate: 1.858E-04 | global batch size: 256 | lm loss: 2.161086E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.214 | TFLOPs: 20.10 | 31: iteration 32840/ 173500 | consumed samples: 8407040 | consumed tokens: 17217617920 | elapsed time per iteration (s): 0.77 | learning rate: 1.858E-04 | global batch size: 256 | lm loss: 2.139079E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.153 | TFLOPs: 20.15 | 31: iteration 32850/ 173500 | consumed samples: 8409600 | consumed tokens: 17222860800 | elapsed time per iteration (s): 0.78 | learning rate: 1.858E-04 | global batch size: 256 | lm loss: 2.106625E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.765 | TFLOPs: 19.83 | 31: iteration 32860/ 173500 | consumed samples: 8412160 | consumed tokens: 17228103680 | elapsed time per iteration (s): 0.76 | learning rate: 1.858E-04 | global batch size: 256 | lm loss: 2.135330E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.913 | TFLOPs: 20.44 | 31: iteration 32870/ 173500 | consumed samples: 8414720 | consumed tokens: 17233346560 | elapsed time per iteration (s): 0.83 | learning rate: 1.858E-04 | global batch size: 256 | lm loss: 2.115661E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.962 | TFLOPs: 18.69 | 31: iteration 32880/ 173500 | consumed samples: 8417280 | consumed tokens: 17238589440 | elapsed time per iteration (s): 0.83 | learning rate: 1.858E-04 | global batch size: 256 | lm loss: 2.120347E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.607 | TFLOPs: 18.67 | 31: iteration 32890/ 173500 | consumed samples: 8419840 | consumed tokens: 17243832320 | elapsed time per iteration (s): 0.85 | learning rate: 1.858E-04 | global batch size: 256 | lm loss: 2.107331E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.598 | TFLOPs: 18.31 | 31: iteration 32900/ 173500 | consumed samples: 8422400 | consumed tokens: 17249075200 | elapsed time per iteration (s): 0.81 | learning rate: 1.858E-04 | global batch size: 256 | lm loss: 2.117536E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.398 | TFLOPs: 19.02 | 31: iteration 32910/ 173500 | consumed samples: 8424960 | consumed tokens: 17254318080 | elapsed time per iteration (s): 0.79 | learning rate: 1.858E-04 | global batch size: 256 | lm loss: 2.117213E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.133 | TFLOPs: 19.61 | 31: iteration 32920/ 173500 | consumed samples: 8427520 | consumed tokens: 17259560960 | elapsed time per iteration (s): 0.80 | learning rate: 1.858E-04 | global batch size: 256 | lm loss: 2.109191E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.879 | TFLOPs: 19.41 | 31: iteration 32930/ 173500 | consumed samples: 8430080 | consumed tokens: 17264803840 | elapsed time per iteration (s): 0.77 | learning rate: 1.857E-04 | global batch size: 256 | lm loss: 2.156540E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.635 | TFLOPs: 20.24 | 31: iteration 32940/ 173500 | consumed samples: 8432640 | consumed tokens: 17270046720 | elapsed time per iteration (s): 0.73 | learning rate: 1.857E-04 | global batch size: 256 | lm loss: 2.151015E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.608 | TFLOPs: 21.21 | 31: iteration 32950/ 173500 | consumed samples: 8435200 | consumed tokens: 17275289600 | elapsed time per iteration (s): 0.78 | learning rate: 1.857E-04 | global batch size: 256 | lm loss: 2.148445E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.636 | TFLOPs: 19.82 | 31: iteration 32960/ 173500 | consumed samples: 8437760 | consumed tokens: 17280532480 | elapsed time per iteration (s): 0.75 | learning rate: 1.857E-04 | global batch size: 256 | lm loss: 2.121701E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.427 | TFLOPs: 20.78 | 31: iteration 32970/ 173500 | consumed samples: 8440320 | consumed tokens: 17285775360 | elapsed time per iteration (s): 0.74 | learning rate: 1.857E-04 | global batch size: 256 | lm loss: 2.132589E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.063 | TFLOPs: 21.06 | 31: iteration 32980/ 173500 | consumed samples: 8442880 | consumed tokens: 17291018240 | elapsed time per iteration (s): 0.77 | learning rate: 1.857E-04 | global batch size: 256 | lm loss: 2.149618E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.550 | TFLOPs: 20.06 | 31: iteration 32990/ 173500 | consumed samples: 8445440 | consumed tokens: 17296261120 | elapsed time per iteration (s): 0.77 | learning rate: 1.857E-04 | global batch size: 256 | lm loss: 2.099663E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.855 | TFLOPs: 20.20 | 31: iteration 33000/ 173500 | consumed samples: 8448000 | consumed tokens: 17301504000 | elapsed time per iteration (s): 0.75 | learning rate: 1.857E-04 | global batch size: 256 | lm loss: 2.156413E+00 | grad norm: 0.419 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.269 | TFLOPs: 20.52 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 33000 | lm loss value: 2.092074E+00 | lm loss PPL: 8.101700E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 33000 to checkpoints_1b1long 0: [2022-11-26 01:30:27,798] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step33000 is begin to save! 0: [2022-11-26 01:30:27,810] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_01-model_00-model_states.pt... 0: [2022-11-26 01:30:28,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_01-model_00-model_states.pt. 0: [2022-11-26 01:30:28,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_03-model_00-model_states.pt... 0: [2022-11-26 01:30:28,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_03-model_00-model_states.pt. 0: [2022-11-26 01:30:28,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_04-model_00-model_states.pt... 0: [2022-11-26 01:30:28,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_04-model_00-model_states.pt. 0: [2022-11-26 01:30:28,188] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_05-model_00-model_states.pt... 0: [2022-11-26 01:30:28,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_05-model_00-model_states.pt. 0: [2022-11-26 01:30:28,267] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_06-model_00-model_states.pt... 0: [2022-11-26 01:30:28,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_06-model_00-model_states.pt. 0: [2022-11-26 01:30:28,341] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_07-model_00-model_states.pt... 0: [2022-11-26 01:30:28,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_07-model_00-model_states.pt. 0: [2022-11-26 01:30:28,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_08-model_00-model_states.pt... 0: [2022-11-26 01:30:28,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_08-model_00-model_states.pt. 0: [2022-11-26 01:30:28,490] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_09-model_00-model_states.pt... 0: [2022-11-26 01:30:28,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_09-model_00-model_states.pt. 0: [2022-11-26 01:30:28,567] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_10-model_00-model_states.pt... 0: [2022-11-26 01:30:28,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_10-model_00-model_states.pt. 0: [2022-11-26 01:30:28,644] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_11-model_00-model_states.pt... 0: [2022-11-26 01:30:28,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_11-model_00-model_states.pt. 0: [2022-11-26 01:30:28,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_12-model_00-model_states.pt... 0: [2022-11-26 01:30:28,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_12-model_00-model_states.pt. 0: [2022-11-26 01:30:28,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_13-model_00-model_states.pt... 0: [2022-11-26 01:30:28,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_13-model_00-model_states.pt. 0: [2022-11-26 01:30:28,867] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_14-model_00-model_states.pt... 0: [2022-11-26 01:30:28,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_14-model_00-model_states.pt. 0: [2022-11-26 01:30:28,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_15-model_00-model_states.pt... 0: [2022-11-26 01:30:29,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_15-model_00-model_states.pt. 0: [2022-11-26 01:30:29,014] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_16-model_00-model_states.pt... 0: [2022-11-26 01:30:29,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_16-model_00-model_states.pt. 0: [2022-11-26 01:30:29,091] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_17-model_00-model_states.pt... 0: [2022-11-26 01:30:29,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_17-model_00-model_states.pt. 0: [2022-11-26 01:30:29,165] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_18-model_00-model_states.pt... 0: [2022-11-26 01:30:29,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_18-model_00-model_states.pt. 0: [2022-11-26 01:30:29,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_19-model_00-model_states.pt... 0: [2022-11-26 01:30:29,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_19-model_00-model_states.pt. 0: [2022-11-26 01:30:29,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_20-model_00-model_states.pt... 0: [2022-11-26 01:30:29,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_20-model_00-model_states.pt. 0: [2022-11-26 01:30:29,386] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_21-model_00-model_states.pt... 0: [2022-11-26 01:30:29,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_21-model_00-model_states.pt. 0: [2022-11-26 01:30:29,460] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_22-model_00-model_states.pt... 0: [2022-11-26 01:30:29,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_22-model_00-model_states.pt. 0: [2022-11-26 01:30:29,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_23-model_00-model_states.pt... 0: [2022-11-26 01:30:29,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_23-model_00-model_states.pt. 0: [2022-11-26 01:30:29,606] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_24-model_00-model_states.pt... 0: [2022-11-26 01:30:29,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_24-model_00-model_states.pt. 0: [2022-11-26 01:30:29,684] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_25-model_00-model_states.pt... 0: [2022-11-26 01:30:29,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_25-model_00-model_states.pt. 0: [2022-11-26 01:30:29,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_26-model_00-model_states.pt... 0: [2022-11-26 01:30:29,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_26-model_00-model_states.pt. 0: [2022-11-26 01:30:29,832] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_27-model_00-model_states.pt... 0: [2022-11-26 01:30:29,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_27-model_00-model_states.pt. 0: [2022-11-26 01:30:29,904] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_28-model_00-model_states.pt... 0: [2022-11-26 01:30:29,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_28-model_00-model_states.pt. 0: [2022-11-26 01:30:29,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/layer_30-model_00-model_states.pt... 0: [2022-11-26 01:30:29,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/layer_30-model_00-model_states.pt. 0: [2022-11-26 01:30:29,982] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step33000/mp_rank_00_model_states.pt 0: [2022-11-26 01:30:29,982] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/mp_rank_00_model_states.pt... 0: [2022-11-26 01:30:29,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/mp_rank_00_model_states.pt. 0: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:30:30,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:30:30,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:30:30,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 01:30:30,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: [2022-11-26 01:30:30,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:30:30,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:30:30,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 01:30:30,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 18: [2022-11-26 01:30:30,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:30:30,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 01:30:30,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 24: [2022-11-26 01:30:30,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:30:30,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 16: [2022-11-26 01:30:30,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:30:30,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 7: [2022-11-26 01:30:30,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:30:30,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 7: [2022-11-26 01:30:30,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 16: [2022-11-26 01:30:30,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 7: [2022-11-26 01:30:30,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 23: [2022-11-26 01:30:30,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:30:30,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 01:30:30,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: [2022-11-26 01:30:30,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:30:30,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 01:30:30,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 21: [2022-11-26 01:30:30,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:30:30,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 01:30:30,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 30: [2022-11-26 01:30:30,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:30:30,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 01:30:30,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 10: [2022-11-26 01:30:30,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:30:30,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 01:30:30,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 29: [2022-11-26 01:30:30,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:30:30,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 12: [2022-11-26 01:30:30,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:30:30,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 12: [2022-11-26 01:30:30,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 01:30:30,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 4: [2022-11-26 01:30:30,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:30:30,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 01:30:30,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 25: [2022-11-26 01:30:30,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:30:30,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:30:30,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 01:30:30,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 25: [2022-11-26 01:30:30,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 01:30:30,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 10: [2022-11-26 01:30:30,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:30:30,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:30:30,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:30:30,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 10: [2022-11-26 01:30:30,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 7: [2022-11-26 01:30:30,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 10: [2022-11-26 01:30:30,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 20: [2022-11-26 01:30:30,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 01:30:30,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 18: [2022-11-26 01:30:30,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:30:30,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 01:30:30,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 26: [2022-11-26 01:30:30,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:30:30,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 01:30:30,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 26: [2022-11-26 01:30:30,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:30:30,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 01:30:30,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 31: [2022-11-26 01:30:30,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:30:30,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 01:30:30,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 6: [2022-11-26 01:30:30,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:30:30,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 01:30:30,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 15: [2022-11-26 01:30:30,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:30:30,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 9: [2022-11-26 01:30:30,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:30:30,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:30:30,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 17: [2022-11-26 01:30:30,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:30:30,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:30:30,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 01:30:30,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 01:30:30,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 9: [2022-11-26 01:30:30,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 17: [2022-11-26 01:30:30,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 01:30:30,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 01:30:30,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 17: [2022-11-26 01:30:30,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 8: [2022-11-26 01:30:30,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:30:30,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:30:30,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 01:30:30,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 24: [2022-11-26 01:30:30,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 01:30:30,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 11: [2022-11-26 01:30:30,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:30:30,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 01:30:30,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 11: [2022-11-26 01:30:30,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:30:30,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 01:30:30,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: [2022-11-26 01:30:30,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:30:30,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 01:30:30,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 16: [2022-11-26 01:30:30,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:30:30,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 6: [2022-11-26 01:30:30,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:30:30,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 16: [2022-11-26 01:30:30,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 6: [2022-11-26 01:30:30,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 15: [2022-11-26 01:30:30,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:30:30,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 01:30:30,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 5: [2022-11-26 01:30:30,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:30:30,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:30:30,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 31: [2022-11-26 01:30:30,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 5: [2022-11-26 01:30:30,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 5: [2022-11-26 01:30:30,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:30:30,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 5: [2022-11-26 01:30:30,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 01:30:30,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 10: [2022-11-26 01:30:30,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:30:30,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 01:30:30,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 13: [2022-11-26 01:30:30,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:30:30,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 01:30:30,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 2: [2022-11-26 01:30:30,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:30:30,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 25: [2022-11-26 01:30:30,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:30:30,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 25: [2022-11-26 01:30:30,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 01:30:30,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 2: [2022-11-26 01:30:30,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:30:30,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 01:30:30,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 8: [2022-11-26 01:30:30,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:30:30,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 01:30:30,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 18: [2022-11-26 01:30:30,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:30:30,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 01:30:30,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 23: [2022-11-26 01:30:30,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:30:30,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:30:30,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 0: [2022-11-26 01:30:30,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 01:30:30,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 23: [2022-11-26 01:30:30,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 9: [2022-11-26 01:30:30,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:30:30,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 01:30:30,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 30: [2022-11-26 01:30:30,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:30:30,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:30:30,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 01:30:30,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 13: [2022-11-26 01:30:30,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 01:30:30,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 12: [2022-11-26 01:30:30,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:30:30,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:30:30,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:30:30,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:30:30,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 3: [2022-11-26 01:30:30,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 01:30:30,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 12: [2022-11-26 01:30:30,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 3: [2022-11-26 01:30:30,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 01:30:30,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 3: [2022-11-26 01:30:30,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 5: [2022-11-26 01:30:30,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:30:30,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 5: [2022-11-26 01:30:30,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 21: [2022-11-26 01:30:30,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:30:30,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 21: [2022-11-26 01:30:30,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 01:30:30,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 19: [2022-11-26 01:30:30,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:30:30,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 01:30:30,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 29: [2022-11-26 01:30:30,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:30:30,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 01:30:30,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 19: [2022-11-26 01:30:30,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:30:30,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 01:30:30,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 20: [2022-11-26 01:30:30,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:30:30,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 01:30:30,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 4: [2022-11-26 01:30:30,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:30:30,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 01:30:30,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 4: [2022-11-26 01:30:30,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:30:30,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:30:30,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:30:30,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 4: [2022-11-26 01:30:30,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 13: [2022-11-26 01:30:30,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 21: [2022-11-26 01:30:30,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 16: [2022-11-26 01:30:30,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:30:30,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 4: [2022-11-26 01:30:30,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 16: [2022-11-26 01:30:30,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 01:30:30,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 18: [2022-11-26 01:30:30,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:30:30,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:30:30,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 26: [2022-11-26 01:30:30,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 18: [2022-11-26 01:30:30,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 26: [2022-11-26 01:30:30,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 10: [2022-11-26 01:30:30,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:30:30,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:30:30,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 11: [2022-11-26 01:30:30,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:30:30,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 10: [2022-11-26 01:30:30,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 11: [2022-11-26 01:30:30,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 6: [2022-11-26 01:30:30,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 11: [2022-11-26 01:30:30,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 23: [2022-11-26 01:30:30,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:30:30,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 01:30:30,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 7: [2022-11-26 01:30:30,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:30:30,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 01:30:30,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 12: [2022-11-26 01:30:30,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:30:30,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:30:30,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 19: [2022-11-26 01:30:30,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 16: [2022-11-26 01:30:30,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:30:30,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 19: [2022-11-26 01:30:30,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 16: [2022-11-26 01:30:30,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 01:30:30,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 19: [2022-11-26 01:30:30,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:30:30,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:30:30,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 01:30:30,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 9: [2022-11-26 01:30:30,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 01:30:30,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 17: [2022-11-26 01:30:30,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:30:30,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 01:30:30,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:30:30,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 17: [2022-11-26 01:30:30,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 6: [2022-11-26 01:30:30,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:30:30,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 6: [2022-11-26 01:30:30,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 01:30:30,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 7: [2022-11-26 01:30:30,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:30:30,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:30:30,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 01:30:30,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 8: [2022-11-26 01:30:30,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 01:30:30,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 11: [2022-11-26 01:30:30,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:30:30,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 01:30:30,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 24: [2022-11-26 01:30:30,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:30:30,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:30:30,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:30:30,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 4: [2022-11-26 01:30:30,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:30:30,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 24: [2022-11-26 01:30:30,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 01:30:30,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 31: [2022-11-26 01:30:30,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:30:30,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 24: [2022-11-26 01:30:30,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 24: [2022-11-26 01:30:30,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 31: [2022-11-26 01:30:30,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 4: [2022-11-26 01:30:30,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 3: [2022-11-26 01:30:30,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:30:30,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 3: [2022-11-26 01:30:30,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 01:30:30,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 31: [2022-11-26 01:30:30,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:30:30,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 01:30:30,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 28: [2022-11-26 01:30:30,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:30:30,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:30:30,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:30:30,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 01:30:30,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 22: [2022-11-26 01:30:30,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:30:30,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:30:30,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:30:30,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:30:30,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 01:30:30,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 01:30:30,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 01:30:30,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 01:30:30,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 22: [2022-11-26 01:30:30,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 22: [2022-11-26 01:30:30,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 22: [2022-11-26 01:30:30,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 26: [2022-11-26 01:30:30,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:30:30,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 01:30:30,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 14: [2022-11-26 01:30:30,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:30:30,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:30:30,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:30:30,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:30:30,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 01:30:30,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 14: [2022-11-26 01:30:30,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 01:30:30,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 28: [2022-11-26 01:30:30,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 28: [2022-11-26 01:30:30,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:30:30,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 14: [2022-11-26 01:30:30,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 01:30:30,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 28: [2022-11-26 01:30:30,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 14: [2022-11-26 01:30:30,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 14: [2022-11-26 01:30:30,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 28: [2022-11-26 01:30:30,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 14: [2022-11-26 01:30:30,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 14: [2022-11-26 01:30:30,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 5: [2022-11-26 01:30:30,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:30:30,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 01:30:30,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 25: [2022-11-26 01:30:30,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:30:30,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 01:30:30,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 30: [2022-11-26 01:30:30,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:30:30,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 15: [2022-11-26 01:30:30,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:30:30,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:30:30,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 15: [2022-11-26 01:30:30,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:30:30,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 30: [2022-11-26 01:30:30,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 15: [2022-11-26 01:30:30,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 30: [2022-11-26 01:30:30,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 15: [2022-11-26 01:30:30,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 01:30:30,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 29: [2022-11-26 01:30:30,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:30:30,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 01:30:30,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 20: [2022-11-26 01:30:30,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:30:30,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 01:30:30,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-26 01:30:30,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:30:30,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:30:30,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:30:30,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 01:30:30,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 01:30:30,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-26 01:30:30,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-26 01:30:30,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 01:30:30,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 20: [2022-11-26 01:30:30,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:30:30,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 01:30:30,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 21: [2022-11-26 01:30:30,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:30:30,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 01:30:30,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 27: [2022-11-26 01:30:30,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:30:30,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:30:30,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:30:30,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:30:30,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 01:30:30,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 01:30:30,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 01:30:30,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 01:30:30,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 27: [2022-11-26 01:30:30,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 27: [2022-11-26 01:30:30,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 27: [2022-11-26 01:30:30,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 25: [2022-11-26 01:30:30,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:30:30,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 01:30:30,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: [2022-11-26 01:30:30,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 01:30:30,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 6: [2022-11-26 01:30:30,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:30:30,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 01:30:30,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 18: [2022-11-26 01:30:30,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:30:30,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 01:30:30,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 19: [2022-11-26 01:30:30,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:30:30,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 01:30:30,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 26: [2022-11-26 01:30:30,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:30:30,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:30:30,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 29: [2022-11-26 01:30:30,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 01:30:30,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 26: [2022-11-26 01:30:30,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 23: [2022-11-26 01:30:30,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:30:30,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:30:30,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 01:30:30,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 17: [2022-11-26 01:30:30,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 01:30:30,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: [2022-11-26 01:30:30,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:30:30,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 01:30:30,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 25: [2022-11-26 01:30:30,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:30:30,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 01:30:30,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 7: [2022-11-26 01:30:30,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:30:30,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 01:30:30,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 14: [2022-11-26 01:30:30,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:30:30,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 01:30:30,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 5: [2022-11-26 01:30:30,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:30:30,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 01:30:30,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 10: [2022-11-26 01:30:30,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:30:30,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 01:30:30,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 27: [2022-11-26 01:30:30,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:30:30,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 01:30:30,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 9: [2022-11-26 01:30:30,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:30:30,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 01:30:30,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 30: [2022-11-26 01:30:30,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:30:30,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 01:30:30,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 12: [2022-11-26 01:30:30,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:30:30,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 01:30:30,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 11: [2022-11-26 01:30:30,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:30:30,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 01:30:30,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 28: [2022-11-26 01:30:30,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:30:30,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 01:30:30,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 2: [2022-11-26 01:30:30,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:30:30,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 01:30:30,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 24: [2022-11-26 01:30:30,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:30:30,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:30:30,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 01:30:30,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 24: [2022-11-26 01:30:30,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 01:30:30,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 31: [2022-11-26 01:30:30,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:30:30,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:30:30,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 01:30:30,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 16: [2022-11-26 01:30:30,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 01:30:30,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 13: [2022-11-26 01:30:30,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:30:30,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 3: [2022-11-26 01:30:30,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:30:30,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 3: [2022-11-26 01:30:30,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 01:30:30,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 4: [2022-11-26 01:30:30,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:30:30,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 01:30:30,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 21: [2022-11-26 01:30:30,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:30:30,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 01:30:30,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-26 01:30:30,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:30:30,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:30:30,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 8: [2022-11-26 01:30:30,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 01:30:30,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-26 01:30:30,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 15: [2022-11-26 01:30:30,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:30:30,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 01:30:30,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 20: [2022-11-26 01:30:30,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:30:30,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 01:30:30,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 6: [2022-11-26 01:30:30,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:30:30,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 01:30:30,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 19: [2022-11-26 01:30:30,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:30:30,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 01:30:30,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 29: [2022-11-26 01:30:30,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:30:30,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 01:30:30,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 18: [2022-11-26 01:30:30,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:30:30,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 01:30:30,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: [2022-11-26 01:30:30,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:30:30,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:30:30,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 23: [2022-11-26 01:30:30,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 01:30:30,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: [2022-11-26 01:30:30,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 10: [2022-11-26 01:30:30,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:30:30,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 01:30:30,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 26: [2022-11-26 01:30:30,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:30:30,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 01:30:30,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 17: [2022-11-26 01:30:30,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:30:30,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 01:30:30,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 7: [2022-11-26 01:30:30,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:30:30,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:30:30,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 7: [2022-11-26 01:30:30,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 14: [2022-11-26 01:30:30,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 7: [2022-11-26 01:30:30,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 25: [2022-11-26 01:30:30,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:30:30,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 01:30:30,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 30: [2022-11-26 01:30:30,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:30:30,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 01:30:30,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 2: [2022-11-26 01:30:30,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:30:30,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 01:30:30,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 12: [2022-11-26 01:30:30,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:30:30,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 01:30:30,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 11: [2022-11-26 01:30:30,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:30:30,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 01:30:30,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 28: [2022-11-26 01:30:30,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:30:30,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:30:30,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 01:30:30,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 16: [2022-11-26 01:30:30,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:30:30,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 01:30:30,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 28: [2022-11-26 01:30:30,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 01:30:30,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 24: [2022-11-26 01:30:30,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:30:30,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:30:30,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:30:30,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 31: [2022-11-26 01:30:30,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 9: [2022-11-26 01:30:30,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 24: [2022-11-26 01:30:30,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 31: [2022-11-26 01:30:30,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 9: [2022-11-26 01:30:30,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 27: [2022-11-26 01:30:30,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:30:30,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 01:30:30,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 22: [2022-11-26 01:30:30,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:30:30,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 01:30:30,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 21: [2022-11-26 01:30:30,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:30:30,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 01:30:30,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 17: [2022-11-26 01:30:30,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:30:30,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 01:30:30,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 15: [2022-11-26 01:30:30,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:30:30,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 01:30:30,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-26 01:30:30,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:30:30,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:30:30,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:30:30,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 6: [2022-11-26 01:30:30,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 01:30:30,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 4: [2022-11-26 01:30:30,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 01:30:30,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-26 01:30:30,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 3: [2022-11-26 01:30:30,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:30:30,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 01:30:30,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 25: [2022-11-26 01:30:30,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:30:30,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:30:30,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 01:30:30,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 25: [2022-11-26 01:30:30,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 01:30:30,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 8: [2022-11-26 01:30:30,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:30:30,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 01:30:30,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 20: [2022-11-26 01:30:30,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:30:30,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 01:30:30,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: [2022-11-26 01:30:30,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:30:30,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 01:30:30,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 19: [2022-11-26 01:30:30,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:30:30,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 01:30:30,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 18: [2022-11-26 01:30:30,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:30:30,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 01:30:30,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 29: [2022-11-26 01:30:30,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:30:30,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 01:30:30,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 26: [2022-11-26 01:30:30,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:30:30,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 01:30:30,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 7: [2022-11-26 01:30:30,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:30:30,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:30:30,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 23: [2022-11-26 01:30:30,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 7: [2022-11-26 01:30:30,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 23: [2022-11-26 01:30:30,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 14: [2022-11-26 01:30:30,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:30:30,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 01:30:30,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 10: [2022-11-26 01:30:30,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:30:30,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 01:30:30,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 11: [2022-11-26 01:30:30,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:30:30,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 01:30:30,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 30: [2022-11-26 01:30:30,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:30:30,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:30:30,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 12: [2022-11-26 01:30:30,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 30: [2022-11-26 01:30:30,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 12: [2022-11-26 01:30:30,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 9: [2022-11-26 01:30:30,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:30:30,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 01:30:30,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 5: [2022-11-26 01:30:30,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:30:30,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 01:30:30,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 31: [2022-11-26 01:30:30,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:30:30,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 01:30:30,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 28: [2022-11-26 01:30:30,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:30:30,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:30:30,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 01:30:30,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 27: [2022-11-26 01:30:30,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:30:30,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 27: [2022-11-26 01:30:30,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 28: [2022-11-26 01:30:30,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 27: [2022-11-26 01:30:30,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 24: [2022-11-26 01:30:30,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:30:30,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 01:30:30,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 4: [2022-11-26 01:30:30,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:30:30,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 01:30:30,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: [2022-11-26 01:30:30,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:30:30,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 01:30:30,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 13: [2022-11-26 01:30:30,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:30:30,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 01:30:30,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 6: [2022-11-26 01:30:30,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:30:30,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 01:30:30,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 25: [2022-11-26 01:30:30,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:30:30,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 01:30:30,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 12: [2022-11-26 01:30:30,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:30:30,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 01:30:30,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 29: [2022-11-26 01:30:30,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:30:30,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 01:30:30,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 3: [2022-11-26 01:30:30,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:30:30,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 01:30:30,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 7: [2022-11-26 01:30:30,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:30:30,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 01:30:30,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 11: [2022-11-26 01:30:30,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:30:30,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 01:30:30,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 30: [2022-11-26 01:30:30,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:30:30,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:30:30,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 16: [2022-11-26 01:30:30,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:30:30,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 30: [2022-11-26 01:30:30,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 16: [2022-11-26 01:30:30,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 26: [2022-11-26 01:30:30,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 16: [2022-11-26 01:30:30,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 5: [2022-11-26 01:30:30,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:30:30,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 27: [2022-11-26 01:30:30,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:30:30,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 20: [2022-11-26 01:30:30,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:30:30,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 01:30:30,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 15: [2022-11-26 01:30:30,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:30:30,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:30:30,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 15: [2022-11-26 01:30:30,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 9: [2022-11-26 01:30:30,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 15: [2022-11-26 01:30:30,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 21: [2022-11-26 01:30:30,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:30:30,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 27: [2022-11-26 01:30:30,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 21: [2022-11-26 01:30:30,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 01:30:30,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 10: [2022-11-26 01:30:30,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:30:30,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 01:30:30,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 14: [2022-11-26 01:30:30,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:30:30,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:30:30,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 22: [2022-11-26 01:30:30,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 1: [2022-11-26 01:30:30,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:30:30,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 22: [2022-11-26 01:30:30,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 18: [2022-11-26 01:30:30,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:30:30,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 01:30:30,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-26 01:30:30,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 01:30:30,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 31: [2022-11-26 01:30:30,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:30:30,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 01:30:30,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 24: [2022-11-26 01:30:30,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:30:30,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 01:30:30,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 19: [2022-11-26 01:30:30,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:30:30,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 01:30:30,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 2: [2022-11-26 01:30:30,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:30:30,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 01:30:30,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 23: [2022-11-26 01:30:30,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:30:30,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:30:30,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 01:30:30,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 17: [2022-11-26 01:30:30,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:30:30,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 01:30:30,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 4: [2022-11-26 01:30:30,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:30:30,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 01:30:30,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-26 01:30:30,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:30:30,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 01:30:30,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 28: [2022-11-26 01:30:30,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 01:30:30,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 21: [2022-11-26 01:30:30,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:30:30,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 01:30:30,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 15: [2022-11-26 01:30:30,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:30:30,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:30:30,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 15: [2022-11-26 01:30:30,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 22: [2022-11-26 01:30:30,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 15: [2022-11-26 01:30:30,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 8: [2022-11-26 01:30:30,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:30:30,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:30:30,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 01:30:30,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 3: [2022-11-26 01:30:30,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 01:30:30,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 28: [2022-11-26 01:30:30,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:30:30,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:30:30,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 01:30:30,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 8: [2022-11-26 01:30:30,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:30:30,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 01:30:30,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 2: [2022-11-26 01:30:30,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:30:30,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 01:30:30,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 8: [2022-11-26 01:30:30,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:30:30,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 01:30:30,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-26 01:30:30,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:30:30,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 01:30:30,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-26 01:30:30,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 01:30:30,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 13: [2022-11-26 01:30:30,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:30:30,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 01:30:30,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 13: [2022-11-26 01:30:30,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:30:30,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step33000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 01:30:30,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: successfully saved checkpoint at iteration 33000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2523.12 31: iteration 33010/ 173500 | consumed samples: 8450560 | consumed tokens: 17306746880 | elapsed time per iteration (s): 1.04 | learning rate: 1.857E-04 | global batch size: 256 | lm loss: 2.144494E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.565 | TFLOPs: 14.92 | 31: iteration 33020/ 173500 | consumed samples: 8453120 | consumed tokens: 17311989760 | elapsed time per iteration (s): 0.75 | learning rate: 1.857E-04 | global batch size: 256 | lm loss: 2.141479E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.277 | TFLOPs: 20.53 | 31: iteration 33030/ 173500 | consumed samples: 8455680 | consumed tokens: 17317232640 | elapsed time per iteration (s): 0.84 | learning rate: 1.857E-04 | global batch size: 256 | lm loss: 2.133925E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.819 | TFLOPs: 18.50 | 31: iteration 33040/ 173500 | consumed samples: 8458240 | consumed tokens: 17322475520 | elapsed time per iteration (s): 0.79 | learning rate: 1.856E-04 | global batch size: 256 | lm loss: 2.101377E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.542 | TFLOPs: 19.51 | 31: iteration 33050/ 173500 | consumed samples: 8460800 | consumed tokens: 17327718400 | elapsed time per iteration (s): 0.78 | learning rate: 1.856E-04 | global batch size: 256 | lm loss: 2.095807E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.335 | TFLOPs: 19.74 | 31: iteration 33060/ 173500 | consumed samples: 8463360 | consumed tokens: 17332961280 | elapsed time per iteration (s): 0.76 | learning rate: 1.856E-04 | global batch size: 256 | lm loss: 2.114687E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.784 | TFLOPs: 20.37 | 31: iteration 33070/ 173500 | consumed samples: 8465920 | consumed tokens: 17338204160 | elapsed time per iteration (s): 0.75 | learning rate: 1.856E-04 | global batch size: 256 | lm loss: 2.118257E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.616 | TFLOPs: 20.73 | 31: iteration 33080/ 173500 | consumed samples: 8468480 | consumed tokens: 17343447040 | elapsed time per iteration (s): 0.78 | learning rate: 1.856E-04 | global batch size: 256 | lm loss: 2.089916E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.679 | TFLOPs: 19.82 | 31: iteration 33090/ 173500 | consumed samples: 8471040 | consumed tokens: 17348689920 | elapsed time per iteration (s): 0.77 | learning rate: 1.856E-04 | global batch size: 256 | lm loss: 2.126838E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.014 | TFLOPs: 20.21 | 31: iteration 33100/ 173500 | consumed samples: 8473600 | consumed tokens: 17353932800 | elapsed time per iteration (s): 1.27 | learning rate: 1.856E-04 | global batch size: 256 | lm loss: 2.123779E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 201.220 | TFLOPs: 12.17 | 31: iteration 33110/ 173500 | consumed samples: 8476160 | consumed tokens: 17359175680 | elapsed time per iteration (s): 0.78 | learning rate: 1.856E-04 | global batch size: 256 | lm loss: 2.123240E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.143 | TFLOPs: 19.79 | 31: iteration 33120/ 173500 | consumed samples: 8478720 | consumed tokens: 17364418560 | elapsed time per iteration (s): 0.74 | learning rate: 1.856E-04 | global batch size: 256 | lm loss: 2.087498E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.630 | TFLOPs: 20.85 | 31: iteration 33130/ 173500 | consumed samples: 8481280 | consumed tokens: 17369661440 | elapsed time per iteration (s): 0.73 | learning rate: 1.856E-04 | global batch size: 256 | lm loss: 2.099986E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.934 | TFLOPs: 21.17 | 31: iteration 33140/ 173500 | consumed samples: 8483840 | consumed tokens: 17374904320 | elapsed time per iteration (s): 0.74 | learning rate: 1.856E-04 | global batch size: 256 | lm loss: 2.100969E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.629 | TFLOPs: 20.85 | 31: iteration 33150/ 173500 | consumed samples: 8486400 | consumed tokens: 17380147200 | elapsed time per iteration (s): 0.83 | learning rate: 1.855E-04 | global batch size: 256 | lm loss: 2.111868E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.960 | TFLOPs: 18.75 | 31: iteration 33160/ 173500 | consumed samples: 8488960 | consumed tokens: 17385390080 | elapsed time per iteration (s): 0.81 | learning rate: 1.855E-04 | global batch size: 256 | lm loss: 2.129785E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.275 | TFLOPs: 19.19 | 31: iteration 33170/ 173500 | consumed samples: 8491520 | consumed tokens: 17390632960 | elapsed time per iteration (s): 0.86 | learning rate: 1.855E-04 | global batch size: 256 | lm loss: 2.120463E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.216 | TFLOPs: 17.92 | 31: iteration 33180/ 173500 | consumed samples: 8494080 | consumed tokens: 17395875840 | elapsed time per iteration (s): 0.84 | learning rate: 1.855E-04 | global batch size: 256 | lm loss: 2.123260E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.793 | TFLOPs: 18.44 | 31: iteration 33190/ 173500 | consumed samples: 8496640 | consumed tokens: 17401118720 | elapsed time per iteration (s): 0.81 | learning rate: 1.855E-04 | global batch size: 256 | lm loss: 2.121750E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.974 | TFLOPs: 19.12 | 31: iteration 33200/ 173500 | consumed samples: 8499200 | consumed tokens: 17406361600 | elapsed time per iteration (s): 0.82 | learning rate: 1.855E-04 | global batch size: 256 | lm loss: 2.113108E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.504 | TFLOPs: 18.78 | 31: iteration 33210/ 173500 | consumed samples: 8501760 | consumed tokens: 17411604480 | elapsed time per iteration (s): 0.83 | learning rate: 1.855E-04 | global batch size: 256 | lm loss: 2.127199E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.824 | TFLOPs: 18.74 | 31: iteration 33220/ 173500 | consumed samples: 8504320 | consumed tokens: 17416847360 | elapsed time per iteration (s): 0.81 | learning rate: 1.855E-04 | global batch size: 256 | lm loss: 2.123924E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.416 | TFLOPs: 19.02 | 31: iteration 33230/ 173500 | consumed samples: 8506880 | consumed tokens: 17422090240 | elapsed time per iteration (s): 0.83 | learning rate: 1.855E-04 | global batch size: 256 | lm loss: 2.118472E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.223 | TFLOPs: 18.71 | 31: iteration 33240/ 173500 | consumed samples: 8509440 | consumed tokens: 17427333120 | elapsed time per iteration (s): 0.80 | learning rate: 1.855E-04 | global batch size: 256 | lm loss: 2.142152E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.138 | TFLOPs: 19.37 | 31: iteration 33250/ 173500 | consumed samples: 8512000 | consumed tokens: 17432576000 | elapsed time per iteration (s): 0.85 | learning rate: 1.855E-04 | global batch size: 256 | lm loss: 2.123071E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.421 | TFLOPs: 18.11 | 31: iteration 33260/ 173500 | consumed samples: 8514560 | consumed tokens: 17437818880 | elapsed time per iteration (s): 0.79 | learning rate: 1.854E-04 | global batch size: 256 | lm loss: 2.090888E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.777 | TFLOPs: 19.71 | 31: iteration 33270/ 173500 | consumed samples: 8517120 | consumed tokens: 17443061760 | elapsed time per iteration (s): 0.79 | learning rate: 1.854E-04 | global batch size: 256 | lm loss: 2.155259E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.678 | TFLOPs: 19.58 | 31: iteration 33280/ 173500 | consumed samples: 8519680 | consumed tokens: 17448304640 | elapsed time per iteration (s): 0.81 | learning rate: 1.854E-04 | global batch size: 256 | lm loss: 2.162077E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.971 | TFLOPs: 19.12 | 31: iteration 33290/ 173500 | consumed samples: 8522240 | consumed tokens: 17453547520 | elapsed time per iteration (s): 0.80 | learning rate: 1.854E-04 | global batch size: 256 | lm loss: 2.141950E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.149 | TFLOPs: 19.43 | 31: iteration 33300/ 173500 | consumed samples: 8524800 | consumed tokens: 17458790400 | elapsed time per iteration (s): 0.81 | learning rate: 1.854E-04 | global batch size: 256 | lm loss: 2.097194E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.744 | TFLOPs: 19.22 | 31: iteration 33310/ 173500 | consumed samples: 8527360 | consumed tokens: 17464033280 | elapsed time per iteration (s): 0.86 | learning rate: 1.854E-04 | global batch size: 256 | lm loss: 2.160925E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.374 | TFLOPs: 17.99 | 31: iteration 33320/ 173500 | consumed samples: 8529920 | consumed tokens: 17469276160 | elapsed time per iteration (s): 0.76 | learning rate: 1.854E-04 | global batch size: 256 | lm loss: 2.152182E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.101 | TFLOPs: 20.27 | 31: iteration 33330/ 173500 | consumed samples: 8532480 | consumed tokens: 17474519040 | elapsed time per iteration (s): 0.79 | learning rate: 1.854E-04 | global batch size: 256 | lm loss: 2.126710E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.065 | TFLOPs: 19.54 | 31: iteration 33340/ 173500 | consumed samples: 8535040 | consumed tokens: 17479761920 | elapsed time per iteration (s): 0.75 | learning rate: 1.854E-04 | global batch size: 256 | lm loss: 2.131222E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.964 | TFLOPs: 20.69 | 31: iteration 33350/ 173500 | consumed samples: 8537600 | consumed tokens: 17485004800 | elapsed time per iteration (s): 0.80 | learning rate: 1.854E-04 | global batch size: 256 | lm loss: 2.097397E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.145 | TFLOPs: 19.43 | 31: iteration 33360/ 173500 | consumed samples: 8540160 | consumed tokens: 17490247680 | elapsed time per iteration (s): 0.77 | learning rate: 1.854E-04 | global batch size: 256 | lm loss: 2.119689E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.887 | TFLOPs: 20.08 | 31: iteration 33370/ 173500 | consumed samples: 8542720 | consumed tokens: 17495490560 | elapsed time per iteration (s): 0.79 | learning rate: 1.854E-04 | global batch size: 256 | lm loss: 2.123990E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.662 | TFLOPs: 19.52 | 31: iteration 33380/ 173500 | consumed samples: 8545280 | consumed tokens: 17500733440 | elapsed time per iteration (s): 0.83 | learning rate: 1.853E-04 | global batch size: 256 | lm loss: 2.147723E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.285 | TFLOPs: 18.77 | 31: iteration 33390/ 173500 | consumed samples: 8547840 | consumed tokens: 17505976320 | elapsed time per iteration (s): 0.73 | learning rate: 1.853E-04 | global batch size: 256 | lm loss: 2.133891E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.669 | TFLOPs: 21.09 | 31: iteration 33400/ 173500 | consumed samples: 8550400 | consumed tokens: 17511219200 | elapsed time per iteration (s): 0.74 | learning rate: 1.853E-04 | global batch size: 256 | lm loss: 2.126570E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.608 | TFLOPs: 20.85 | 31: iteration 33410/ 173500 | consumed samples: 8552960 | consumed tokens: 17516462080 | elapsed time per iteration (s): 0.76 | learning rate: 1.853E-04 | global batch size: 256 | lm loss: 2.137725E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.462 | TFLOPs: 20.42 | 31: iteration 33420/ 173500 | consumed samples: 8555520 | consumed tokens: 17521704960 | elapsed time per iteration (s): 0.73 | learning rate: 1.853E-04 | global batch size: 256 | lm loss: 2.089305E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.012 | TFLOPs: 21.30 | 31: iteration 33430/ 173500 | consumed samples: 8558080 | consumed tokens: 17526947840 | elapsed time per iteration (s): 0.76 | learning rate: 1.853E-04 | global batch size: 256 | lm loss: 2.120366E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.332 | TFLOPs: 20.41 | 31: iteration 33440/ 173500 | consumed samples: 8560640 | consumed tokens: 17532190720 | elapsed time per iteration (s): 0.79 | learning rate: 1.853E-04 | global batch size: 256 | lm loss: 2.099420E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.879 | TFLOPs: 19.65 | 31: iteration 33450/ 173500 | consumed samples: 8563200 | consumed tokens: 17537433600 | elapsed time per iteration (s): 0.74 | learning rate: 1.853E-04 | global batch size: 256 | lm loss: 2.154762E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.906 | TFLOPs: 20.87 | 31: iteration 33460/ 173500 | consumed samples: 8565760 | consumed tokens: 17542676480 | elapsed time per iteration (s): 0.76 | learning rate: 1.853E-04 | global batch size: 256 | lm loss: 2.144266E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.428 | TFLOPs: 20.35 | 31: iteration 33470/ 173500 | consumed samples: 8568320 | consumed tokens: 17547919360 | elapsed time per iteration (s): 0.76 | learning rate: 1.853E-04 | global batch size: 256 | lm loss: 2.112228E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.213 | TFLOPs: 20.28 | 31: iteration 33480/ 173500 | consumed samples: 8570880 | consumed tokens: 17553162240 | elapsed time per iteration (s): 0.76 | learning rate: 1.853E-04 | global batch size: 256 | lm loss: 2.123931E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.993 | TFLOPs: 20.33 | 31: iteration 33490/ 173500 | consumed samples: 8573440 | consumed tokens: 17558405120 | elapsed time per iteration (s): 0.73 | learning rate: 1.852E-04 | global batch size: 256 | lm loss: 2.133385E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.393 | TFLOPs: 21.08 | 31: iteration 33500/ 173500 | consumed samples: 8576000 | consumed tokens: 17563648000 | elapsed time per iteration (s): 0.82 | learning rate: 1.852E-04 | global batch size: 256 | lm loss: 2.119022E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.949 | TFLOPs: 18.93 | 31: iteration 33510/ 173500 | consumed samples: 8578560 | consumed tokens: 17568890880 | elapsed time per iteration (s): 0.78 | learning rate: 1.852E-04 | global batch size: 256 | lm loss: 2.124466E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.131 | TFLOPs: 19.91 | 31: iteration 33520/ 173500 | consumed samples: 8581120 | consumed tokens: 17574133760 | elapsed time per iteration (s): 0.79 | learning rate: 1.852E-04 | global batch size: 256 | lm loss: 2.109666E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.128 | TFLOPs: 19.55 | 31: iteration 33530/ 173500 | consumed samples: 8583680 | consumed tokens: 17579376640 | elapsed time per iteration (s): 0.76 | learning rate: 1.852E-04 | global batch size: 256 | lm loss: 2.115720E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.399 | TFLOPs: 20.47 | 31: iteration 33540/ 173500 | consumed samples: 8586240 | consumed tokens: 17584619520 | elapsed time per iteration (s): 0.79 | learning rate: 1.852E-04 | global batch size: 256 | lm loss: 2.123454E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.492 | TFLOPs: 19.63 | 31: iteration 33550/ 173500 | consumed samples: 8588800 | consumed tokens: 17589862400 | elapsed time per iteration (s): 0.82 | learning rate: 1.852E-04 | global batch size: 256 | lm loss: 2.148309E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.775 | TFLOPs: 18.86 | 31: iteration 33560/ 173500 | consumed samples: 8591360 | consumed tokens: 17595105280 | elapsed time per iteration (s): 0.77 | learning rate: 1.852E-04 | global batch size: 256 | lm loss: 2.107276E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.102 | TFLOPs: 20.15 | 31: iteration 33570/ 173500 | consumed samples: 8593920 | consumed tokens: 17600348160 | elapsed time per iteration (s): 0.77 | learning rate: 1.852E-04 | global batch size: 256 | lm loss: 2.132416E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.765 | TFLOPs: 20.19 | 31: iteration 33580/ 173500 | consumed samples: 8596480 | consumed tokens: 17605591040 | elapsed time per iteration (s): 0.72 | learning rate: 1.852E-04 | global batch size: 256 | lm loss: 2.123507E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.020 | TFLOPs: 21.48 | 31: iteration 33590/ 173500 | consumed samples: 8599040 | consumed tokens: 17610833920 | elapsed time per iteration (s): 0.74 | learning rate: 1.852E-04 | global batch size: 256 | lm loss: 2.130402E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.016 | TFLOPs: 20.87 | 31: iteration 33600/ 173500 | consumed samples: 8601600 | consumed tokens: 17616076800 | elapsed time per iteration (s): 0.80 | learning rate: 1.851E-04 | global batch size: 256 | lm loss: 2.112873E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.258 | TFLOPs: 19.37 | 31: iteration 33610/ 173500 | consumed samples: 8604160 | consumed tokens: 17621319680 | elapsed time per iteration (s): 0.74 | learning rate: 1.851E-04 | global batch size: 256 | lm loss: 2.131635E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.382 | TFLOPs: 20.83 | 31: iteration 33620/ 173500 | consumed samples: 8606720 | consumed tokens: 17626562560 | elapsed time per iteration (s): 0.77 | learning rate: 1.851E-04 | global batch size: 256 | lm loss: 2.111623E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.114 | TFLOPs: 20.03 | 31: iteration 33630/ 173500 | consumed samples: 8609280 | consumed tokens: 17631805440 | elapsed time per iteration (s): 0.77 | learning rate: 1.851E-04 | global batch size: 256 | lm loss: 2.110158E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.440 | TFLOPs: 20.11 | 31: iteration 33640/ 173500 | consumed samples: 8611840 | consumed tokens: 17637048320 | elapsed time per iteration (s): 0.81 | learning rate: 1.851E-04 | global batch size: 256 | lm loss: 2.136126E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.221 | TFLOPs: 19.19 | 31: iteration 33650/ 173500 | consumed samples: 8614400 | consumed tokens: 17642291200 | elapsed time per iteration (s): 0.78 | learning rate: 1.851E-04 | global batch size: 256 | lm loss: 2.130762E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.439 | TFLOPs: 19.87 | 31: iteration 33660/ 173500 | consumed samples: 8616960 | consumed tokens: 17647534080 | elapsed time per iteration (s): 0.81 | learning rate: 1.851E-04 | global batch size: 256 | lm loss: 2.125444E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.043 | TFLOPs: 19.12 | 31: iteration 33670/ 173500 | consumed samples: 8619520 | consumed tokens: 17652776960 | elapsed time per iteration (s): 0.84 | learning rate: 1.851E-04 | global batch size: 256 | lm loss: 2.105508E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.101 | TFLOPs: 18.40 | 31: iteration 33680/ 173500 | consumed samples: 8622080 | consumed tokens: 17658019840 | elapsed time per iteration (s): 0.81 | learning rate: 1.851E-04 | global batch size: 256 | lm loss: 2.116070E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.570 | TFLOPs: 19.15 | 31: iteration 33690/ 173500 | consumed samples: 8624640 | consumed tokens: 17663262720 | elapsed time per iteration (s): 0.82 | learning rate: 1.851E-04 | global batch size: 256 | lm loss: 2.111536E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.563 | TFLOPs: 18.85 | 31: iteration 33700/ 173500 | consumed samples: 8627200 | consumed tokens: 17668505600 | elapsed time per iteration (s): 0.80 | learning rate: 1.851E-04 | global batch size: 256 | lm loss: 2.140908E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.508 | TFLOPs: 19.45 | 31: iteration 33710/ 173500 | consumed samples: 8629760 | consumed tokens: 17673748480 | elapsed time per iteration (s): 0.80 | learning rate: 1.850E-04 | global batch size: 256 | lm loss: 2.156019E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.669 | TFLOPs: 19.34 | 31: iteration 33720/ 173500 | consumed samples: 8632320 | consumed tokens: 17678991360 | elapsed time per iteration (s): 0.82 | learning rate: 1.850E-04 | global batch size: 256 | lm loss: 2.116322E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.100 | TFLOPs: 18.82 | 31: iteration 33730/ 173500 | consumed samples: 8634880 | consumed tokens: 17684234240 | elapsed time per iteration (s): 0.82 | learning rate: 1.850E-04 | global batch size: 256 | lm loss: 2.142693E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.391 | TFLOPs: 18.84 | 31: iteration 33740/ 173500 | consumed samples: 8637440 | consumed tokens: 17689477120 | elapsed time per iteration (s): 0.82 | learning rate: 1.850E-04 | global batch size: 256 | lm loss: 2.130397E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.345 | TFLOPs: 18.84 | 31: iteration 33750/ 173500 | consumed samples: 8640000 | consumed tokens: 17694720000 | elapsed time per iteration (s): 0.84 | learning rate: 1.850E-04 | global batch size: 256 | lm loss: 2.132523E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.971 | TFLOPs: 18.45 | 31: iteration 33760/ 173500 | consumed samples: 8642560 | consumed tokens: 17699962880 | elapsed time per iteration (s): 0.83 | learning rate: 1.850E-04 | global batch size: 256 | lm loss: 2.163160E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.783 | TFLOPs: 18.56 | 31: iteration 33770/ 173500 | consumed samples: 8645120 | consumed tokens: 17705205760 | elapsed time per iteration (s): 0.84 | learning rate: 1.850E-04 | global batch size: 256 | lm loss: 2.125592E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.741 | TFLOPs: 18.50 | 31: iteration 33780/ 173500 | consumed samples: 8647680 | consumed tokens: 17710448640 | elapsed time per iteration (s): 0.83 | learning rate: 1.850E-04 | global batch size: 256 | lm loss: 2.122165E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.344 | TFLOPs: 18.71 | 31: iteration 33790/ 173500 | consumed samples: 8650240 | consumed tokens: 17715691520 | elapsed time per iteration (s): 0.81 | learning rate: 1.850E-04 | global batch size: 256 | lm loss: 2.138470E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.106 | TFLOPs: 19.12 | 31: iteration 33800/ 173500 | consumed samples: 8652800 | consumed tokens: 17720934400 | elapsed time per iteration (s): 0.83 | learning rate: 1.850E-04 | global batch size: 256 | lm loss: 2.126184E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.966 | TFLOPs: 18.75 | 31: iteration 33810/ 173500 | consumed samples: 8655360 | consumed tokens: 17726177280 | elapsed time per iteration (s): 0.83 | learning rate: 1.850E-04 | global batch size: 256 | lm loss: 2.143212E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.088 | TFLOPs: 18.70 | 31: iteration 33820/ 173500 | consumed samples: 8657920 | consumed tokens: 17731420160 | elapsed time per iteration (s): 0.84 | learning rate: 1.849E-04 | global batch size: 256 | lm loss: 2.127333E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.211 | TFLOPs: 18.52 | 31: iteration 33830/ 173500 | consumed samples: 8660480 | consumed tokens: 17736663040 | elapsed time per iteration (s): 0.85 | learning rate: 1.849E-04 | global batch size: 256 | lm loss: 2.140985E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.970 | TFLOPs: 18.21 | 31: iteration 33840/ 173500 | consumed samples: 8663040 | consumed tokens: 17741905920 | elapsed time per iteration (s): 0.84 | learning rate: 1.849E-04 | global batch size: 256 | lm loss: 2.155273E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.880 | TFLOPs: 18.44 | 31: iteration 33850/ 173500 | consumed samples: 8665600 | consumed tokens: 17747148800 | elapsed time per iteration (s): 0.83 | learning rate: 1.849E-04 | global batch size: 256 | lm loss: 2.156997E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.201 | TFLOPs: 18.71 | 31: iteration 33860/ 173500 | consumed samples: 8668160 | consumed tokens: 17752391680 | elapsed time per iteration (s): 0.80 | learning rate: 1.849E-04 | global batch size: 256 | lm loss: 2.101925E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.323 | TFLOPs: 19.26 | 31: iteration 33870/ 173500 | consumed samples: 8670720 | consumed tokens: 17757634560 | elapsed time per iteration (s): 0.80 | learning rate: 1.849E-04 | global batch size: 256 | lm loss: 2.118828E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.144 | TFLOPs: 19.31 | 31: iteration 33880/ 173500 | consumed samples: 8673280 | consumed tokens: 17762877440 | elapsed time per iteration (s): 0.82 | learning rate: 1.849E-04 | global batch size: 256 | lm loss: 2.104293E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.678 | TFLOPs: 18.86 | 31: iteration 33890/ 173500 | consumed samples: 8675840 | consumed tokens: 17768120320 | elapsed time per iteration (s): 0.82 | learning rate: 1.849E-04 | global batch size: 256 | lm loss: 2.113377E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.310 | TFLOPs: 18.89 | 31: iteration 33900/ 173500 | consumed samples: 8678400 | consumed tokens: 17773363200 | elapsed time per iteration (s): 0.82 | learning rate: 1.849E-04 | global batch size: 256 | lm loss: 2.114562E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.910 | TFLOPs: 18.93 | 31: iteration 33910/ 173500 | consumed samples: 8680960 | consumed tokens: 17778606080 | elapsed time per iteration (s): 0.81 | learning rate: 1.849E-04 | global batch size: 256 | lm loss: 2.133645E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.146 | TFLOPs: 19.13 | 31: iteration 33920/ 173500 | consumed samples: 8683520 | consumed tokens: 17783848960 | elapsed time per iteration (s): 0.80 | learning rate: 1.849E-04 | global batch size: 256 | lm loss: 2.122584E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.731 | TFLOPs: 19.40 | 31: iteration 33930/ 173500 | consumed samples: 8686080 | consumed tokens: 17789091840 | elapsed time per iteration (s): 0.78 | learning rate: 1.848E-04 | global batch size: 256 | lm loss: 2.098135E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.828 | TFLOPs: 19.83 | 31: iteration 33940/ 173500 | consumed samples: 8688640 | consumed tokens: 17794334720 | elapsed time per iteration (s): 0.85 | learning rate: 1.848E-04 | global batch size: 256 | lm loss: 2.117066E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.668 | TFLOPs: 18.31 | 31: iteration 33950/ 173500 | consumed samples: 8691200 | consumed tokens: 17799577600 | elapsed time per iteration (s): 0.80 | learning rate: 1.848E-04 | global batch size: 256 | lm loss: 2.119698E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.316 | TFLOPs: 19.32 | 31: iteration 33960/ 173500 | consumed samples: 8693760 | consumed tokens: 17804820480 | elapsed time per iteration (s): 0.84 | learning rate: 1.848E-04 | global batch size: 256 | lm loss: 2.123427E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.405 | TFLOPs: 18.48 | 31: iteration 33970/ 173500 | consumed samples: 8696320 | consumed tokens: 17810063360 | elapsed time per iteration (s): 0.82 | learning rate: 1.848E-04 | global batch size: 256 | lm loss: 2.097015E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.949 | TFLOPs: 18.81 | 31: iteration 33980/ 173500 | consumed samples: 8698880 | consumed tokens: 17815306240 | elapsed time per iteration (s): 0.83 | learning rate: 1.848E-04 | global batch size: 256 | lm loss: 2.117338E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.435 | TFLOPs: 18.66 | 31: iteration 33990/ 173500 | consumed samples: 8701440 | consumed tokens: 17820549120 | elapsed time per iteration (s): 0.82 | learning rate: 1.848E-04 | global batch size: 256 | lm loss: 2.126251E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.338 | TFLOPs: 18.96 | 0: [2022-11-26 01:43:52,490] [INFO] [logging.py:68:log_dist] [Rank 0] step=34000, skipped=0, lr=[0.00018477830620634072, 0.00018477830620634072, 0.00018477830620634072], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 34000/ 173500 | consumed samples: 8704000 | consumed tokens: 17825792000 | elapsed time per iteration (s): 0.92 | learning rate: 1.848E-04 | global batch size: 256 | lm loss: 2.137839E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 278.195 | TFLOPs: 16.83 | 0: steps: 34000 loss: 2.1802 iter time (s): 0.793 samples/sec: 322.906 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 34000 | lm loss value: 2.064054E+00 | lm loss PPL: 7.877840E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 34000 to checkpoints_1b1long 0: [2022-11-26 01:43:52,841] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step34000 is begin to save! 0: [2022-11-26 01:43:53,035] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_01-model_00-model_states.pt... 0: [2022-11-26 01:43:53,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_01-model_00-model_states.pt. 0: [2022-11-26 01:43:53,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_03-model_00-model_states.pt... 0: [2022-11-26 01:43:53,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_03-model_00-model_states.pt. 0: [2022-11-26 01:43:53,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_04-model_00-model_states.pt... 0: [2022-11-26 01:43:53,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_04-model_00-model_states.pt. 0: [2022-11-26 01:43:53,386] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_05-model_00-model_states.pt... 0: [2022-11-26 01:43:53,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_05-model_00-model_states.pt. 0: [2022-11-26 01:43:53,459] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_06-model_00-model_states.pt... 0: [2022-11-26 01:43:53,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_06-model_00-model_states.pt. 0: [2022-11-26 01:43:53,531] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_07-model_00-model_states.pt... 0: [2022-11-26 01:43:53,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_07-model_00-model_states.pt. 0: [2022-11-26 01:43:53,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_08-model_00-model_states.pt... 0: [2022-11-26 01:43:53,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_08-model_00-model_states.pt. 0: [2022-11-26 01:43:53,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_09-model_00-model_states.pt... 0: [2022-11-26 01:43:53,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_09-model_00-model_states.pt. 0: [2022-11-26 01:43:53,751] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_10-model_00-model_states.pt... 0: [2022-11-26 01:43:53,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_10-model_00-model_states.pt. 0: [2022-11-26 01:43:53,827] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_11-model_00-model_states.pt... 0: [2022-11-26 01:43:53,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_11-model_00-model_states.pt. 0: [2022-11-26 01:43:53,897] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_12-model_00-model_states.pt... 0: [2022-11-26 01:43:53,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_12-model_00-model_states.pt. 0: [2022-11-26 01:43:53,970] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_13-model_00-model_states.pt... 0: [2022-11-26 01:43:54,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_13-model_00-model_states.pt. 0: [2022-11-26 01:43:54,043] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_14-model_00-model_states.pt... 0: [2022-11-26 01:43:54,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_14-model_00-model_states.pt. 0: [2022-11-26 01:43:54,118] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_15-model_00-model_states.pt... 0: [2022-11-26 01:43:54,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_15-model_00-model_states.pt. 0: [2022-11-26 01:43:54,189] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_16-model_00-model_states.pt... 0: [2022-11-26 01:43:54,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_16-model_00-model_states.pt. 0: [2022-11-26 01:43:54,264] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_17-model_00-model_states.pt... 0: [2022-11-26 01:43:54,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_17-model_00-model_states.pt. 0: [2022-11-26 01:43:54,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_18-model_00-model_states.pt... 0: [2022-11-26 01:43:54,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_18-model_00-model_states.pt. 0: [2022-11-26 01:43:54,411] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_19-model_00-model_states.pt... 0: [2022-11-26 01:43:54,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_19-model_00-model_states.pt. 0: [2022-11-26 01:43:54,483] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_20-model_00-model_states.pt... 0: [2022-11-26 01:43:54,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_20-model_00-model_states.pt. 0: [2022-11-26 01:43:54,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_21-model_00-model_states.pt... 0: [2022-11-26 01:43:54,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_21-model_00-model_states.pt. 0: [2022-11-26 01:43:54,631] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_22-model_00-model_states.pt... 0: [2022-11-26 01:43:54,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_22-model_00-model_states.pt. 0: [2022-11-26 01:43:54,701] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_23-model_00-model_states.pt... 0: [2022-11-26 01:43:54,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_23-model_00-model_states.pt. 0: [2022-11-26 01:43:54,774] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_24-model_00-model_states.pt... 0: [2022-11-26 01:43:54,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_24-model_00-model_states.pt. 0: [2022-11-26 01:43:54,849] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_25-model_00-model_states.pt... 0: [2022-11-26 01:43:54,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_25-model_00-model_states.pt. 0: [2022-11-26 01:43:54,923] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_26-model_00-model_states.pt... 0: [2022-11-26 01:43:54,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_26-model_00-model_states.pt. 0: [2022-11-26 01:43:54,994] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_27-model_00-model_states.pt... 0: [2022-11-26 01:43:55,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_27-model_00-model_states.pt. 0: [2022-11-26 01:43:55,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_28-model_00-model_states.pt... 0: [2022-11-26 01:43:55,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_28-model_00-model_states.pt. 0: [2022-11-26 01:43:55,142] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/layer_30-model_00-model_states.pt... 0: [2022-11-26 01:43:55,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/layer_30-model_00-model_states.pt. 0: [2022-11-26 01:43:55,144] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step34000/mp_rank_00_model_states.pt 0: [2022-11-26 01:43:55,144] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/mp_rank_00_model_states.pt... 0: [2022-11-26 01:43:55,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/mp_rank_00_model_states.pt. 0: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:43:55,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:43:55,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:43:55,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 01:43:55,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 8: [2022-11-26 01:43:55,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:43:55,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 01:43:55,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 16: [2022-11-26 01:43:55,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:43:55,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 01:43:55,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: [2022-11-26 01:43:55,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:43:55,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 01:43:55,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 10: [2022-11-26 01:43:55,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:43:55,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 01:43:55,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 26: [2022-11-26 01:43:55,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:43:55,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 5: [2022-11-26 01:43:55,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:43:55,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:43:55,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-26 01:43:55,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 01:43:55,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 25: [2022-11-26 01:43:55,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 01:43:55,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 15: [2022-11-26 01:43:55,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:43:55,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 01:43:55,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 26: [2022-11-26 01:43:55,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:43:55,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:43:55,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 6: [2022-11-26 01:43:55,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 26: [2022-11-26 01:43:55,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 6: [2022-11-26 01:43:55,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 9: [2022-11-26 01:43:55,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:43:55,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 01:43:55,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 13: [2022-11-26 01:43:55,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:43:55,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:43:55,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:43:55,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 01:43:55,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 01:43:55,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 13: [2022-11-26 01:43:55,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 31: [2022-11-26 01:43:55,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:43:55,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 01:43:55,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 11: [2022-11-26 01:43:55,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:43:55,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 01:43:55,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 19: [2022-11-26 01:43:55,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:43:55,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 01:43:55,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 24: [2022-11-26 01:43:55,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:43:55,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:43:55,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 24: [2022-11-26 01:43:55,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 10: [2022-11-26 01:43:55,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 20: [2022-11-26 01:43:55,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:43:55,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:43:55,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 20: [2022-11-26 01:43:55,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 01:43:55,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 01:43:55,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 1: [2022-11-26 01:43:55,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:43:55,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 1: [2022-11-26 01:43:55,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 30: [2022-11-26 01:43:55,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:43:55,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:43:55,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 30: [2022-11-26 01:43:55,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 01:43:55,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 01:43:55,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 30: [2022-11-26 01:43:55,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 12: [2022-11-26 01:43:55,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:43:55,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:43:55,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 5: [2022-11-26 01:43:55,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 12: [2022-11-26 01:43:55,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-26 01:43:55,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 7: [2022-11-26 01:43:55,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:43:55,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 01:43:55,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 1: [2022-11-26 01:43:55,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:43:55,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 01:43:55,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 11: [2022-11-26 01:43:55,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:43:55,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 01:43:55,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 17: [2022-11-26 01:43:55,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:43:55,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:43:55,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 0: [2022-11-26 01:43:55,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 17: [2022-11-26 01:43:55,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: [2022-11-26 01:43:55,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 17: [2022-11-26 01:43:55,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:43:55,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 21: [2022-11-26 01:43:55,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:43:55,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 01:43:55,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 17: [2022-11-26 01:43:55,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 29: [2022-11-26 01:43:55,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:43:55,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 01:43:55,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 19: [2022-11-26 01:43:55,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:43:55,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 01:43:55,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 8: [2022-11-26 01:43:55,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:43:55,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 01:43:55,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 25: [2022-11-26 01:43:55,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:43:55,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 01:43:55,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 9: [2022-11-26 01:43:55,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:43:55,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 29: [2022-11-26 01:43:55,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:43:55,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 01:43:55,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 9: [2022-11-26 01:43:55,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 23: [2022-11-26 01:43:55,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:43:55,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 01:43:55,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 17: [2022-11-26 01:43:55,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:43:55,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:43:55,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 17: [2022-11-26 01:43:55,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 6: [2022-11-26 01:43:55,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 23: [2022-11-26 01:43:55,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:43:55,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 23: [2022-11-26 01:43:55,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 01:43:55,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 26: [2022-11-26 01:43:55,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:43:55,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 01:43:55,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 16: [2022-11-26 01:43:55,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:43:55,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:43:55,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 20: [2022-11-26 01:43:55,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 16: [2022-11-26 01:43:55,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 20: [2022-11-26 01:43:55,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 31: [2022-11-26 01:43:55,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:43:55,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:43:55,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 01:43:55,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 16: [2022-11-26 01:43:55,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 01:43:55,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 8: [2022-11-26 01:43:55,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:43:55,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 01:43:55,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 2: [2022-11-26 01:43:55,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:43:55,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:43:55,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 01:43:55,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 01:43:55,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 2: [2022-11-26 01:43:55,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 9: [2022-11-26 01:43:55,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:43:55,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 01:43:55,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 11: [2022-11-26 01:43:55,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:43:55,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 01:43:55,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 29: [2022-11-26 01:43:55,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:43:55,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 12: [2022-11-26 01:43:55,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:43:55,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 12: [2022-11-26 01:43:55,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 0: [2022-11-26 01:43:55,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:43:55,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 24: [2022-11-26 01:43:55,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:43:55,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 01:43:55,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 24: [2022-11-26 01:43:55,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 8: [2022-11-26 01:43:55,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:43:55,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:43:55,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:43:55,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 12: [2022-11-26 01:43:55,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 24: [2022-11-26 01:43:55,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 7: [2022-11-26 01:43:55,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:43:55,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 8: [2022-11-26 01:43:55,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 12: [2022-11-26 01:43:55,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 7: [2022-11-26 01:43:55,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 7: [2022-11-26 01:43:55,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 01:43:55,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 1: [2022-11-26 01:43:55,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:43:55,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 01:43:55,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 6: [2022-11-26 01:43:55,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:43:55,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 01:43:55,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 24: [2022-11-26 01:43:55,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:43:55,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 01:43:55,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 21: [2022-11-26 01:43:55,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:43:55,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 01:43:55,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 15: [2022-11-26 01:43:55,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:43:55,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 01:43:55,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 20: [2022-11-26 01:43:55,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:43:55,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 01:43:55,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 28: [2022-11-26 01:43:55,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 01:43:55,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 28: [2022-11-26 01:43:55,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:43:55,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 01:43:55,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 3: [2022-11-26 01:43:55,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:43:55,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:43:55,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 28: [2022-11-26 01:43:55,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 3: [2022-11-26 01:43:55,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 28: [2022-11-26 01:43:55,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 29: [2022-11-26 01:43:55,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:43:55,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 01:43:55,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-26 01:43:55,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:43:55,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:43:55,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 01:43:55,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-26 01:43:55,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 01:43:55,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-26 01:43:55,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:43:55,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:43:55,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 23: [2022-11-26 01:43:55,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 5: [2022-11-26 01:43:55,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 23: [2022-11-26 01:43:55,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 25: [2022-11-26 01:43:55,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:43:55,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 01:43:55,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 10: [2022-11-26 01:43:55,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:43:55,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 01:43:55,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 10: [2022-11-26 01:43:55,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:43:55,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:43:55,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 01:43:55,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 15: [2022-11-26 01:43:55,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 01:43:55,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 13: [2022-11-26 01:43:55,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:43:55,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 23: [2022-11-26 01:43:55,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:43:55,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 13: [2022-11-26 01:43:55,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 23: [2022-11-26 01:43:55,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 19: [2022-11-26 01:43:55,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:43:55,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:43:55,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 13: [2022-11-26 01:43:55,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 19: [2022-11-26 01:43:55,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 13: [2022-11-26 01:43:55,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 21: [2022-11-26 01:43:55,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:43:55,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 01:43:55,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 17: [2022-11-26 01:43:55,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:43:55,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 11: [2022-11-26 01:43:55,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:43:55,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 11: [2022-11-26 01:43:55,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 01:43:55,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 30: [2022-11-26 01:43:55,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:43:55,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 01:43:55,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 28: [2022-11-26 01:43:55,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:43:55,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:43:55,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 01:43:55,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 28: [2022-11-26 01:43:55,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 01:43:55,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 19: [2022-11-26 01:43:55,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:43:55,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 30: [2022-11-26 01:43:55,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:43:55,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:43:55,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 30: [2022-11-26 01:43:55,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 26: [2022-11-26 01:43:55,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 30: [2022-11-26 01:43:55,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 26: [2022-11-26 01:43:55,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 6: [2022-11-26 01:43:55,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:43:55,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 01:43:55,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 12: [2022-11-26 01:43:55,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:43:55,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 01:43:55,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 2: [2022-11-26 01:43:55,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:43:55,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:43:55,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 01:43:55,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 01:43:55,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 2: [2022-11-26 01:43:55,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 15: [2022-11-26 01:43:55,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:43:55,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 01:43:55,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 16: [2022-11-26 01:43:55,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:43:55,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 01:43:55,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 7: [2022-11-26 01:43:55,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:43:55,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 01:43:55,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 25: [2022-11-26 01:43:55,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:43:55,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 0: [2022-11-26 01:43:55,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:43:55,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:43:55,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:43:55,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:43:55,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:43:55,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 01:43:55,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 01:43:55,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 4: [2022-11-26 01:43:55,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 01:43:55,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 25: [2022-11-26 01:43:55,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 4: [2022-11-26 01:43:55,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 01:43:55,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 4: [2022-11-26 01:43:55,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: [2022-11-26 01:43:55,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:43:55,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:43:55,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 31: [2022-11-26 01:43:55,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:43:55,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 1: [2022-11-26 01:43:55,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 01:43:55,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 31: [2022-11-26 01:43:55,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 01:43:55,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 18: [2022-11-26 01:43:55,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:43:55,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:43:55,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:43:55,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:43:55,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 01:43:55,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 01:43:55,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 01:43:55,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 01:43:55,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 18: [2022-11-26 01:43:55,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 18: [2022-11-26 01:43:55,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 18: [2022-11-26 01:43:55,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 27: [2022-11-26 01:43:55,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:43:55,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 01:43:55,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 3: [2022-11-26 01:43:55,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:43:55,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 01:43:55,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 24: [2022-11-26 01:43:55,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:43:55,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 01:43:55,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 17: [2022-11-26 01:43:55,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:43:55,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 01:43:55,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 27: [2022-11-26 01:43:55,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:43:55,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:43:55,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 01:43:55,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 01:43:55,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 27: [2022-11-26 01:43:55,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 3: [2022-11-26 01:43:55,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:43:55,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 01:43:55,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 1: [2022-11-26 01:43:55,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:43:55,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:43:55,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 12: [2022-11-26 01:43:55,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 1: [2022-11-26 01:43:55,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 12: [2022-11-26 01:43:55,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: [2022-11-26 01:43:55,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 01:43:55,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 14: [2022-11-26 01:43:55,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:43:55,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 01:43:55,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:43:55,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:43:55,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 26: [2022-11-26 01:43:55,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:43:55,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 01:43:55,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 26: [2022-11-26 01:43:55,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 14: [2022-11-26 01:43:55,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 14: [2022-11-26 01:43:55,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 26: [2022-11-26 01:43:55,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-26 01:43:55,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:43:55,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 01:43:55,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 3: [2022-11-26 01:43:55,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:43:55,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 01:43:55,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 20: [2022-11-26 01:43:55,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:43:55,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 01:43:55,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 18: [2022-11-26 01:43:55,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:43:55,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 01:43:55,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 10: [2022-11-26 01:43:55,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:43:55,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 01:43:55,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 23: [2022-11-26 01:43:55,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:43:55,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 01:43:55,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 9: [2022-11-26 01:43:55,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:43:55,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 01:43:55,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 21: [2022-11-26 01:43:55,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:43:55,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 01:43:55,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 29: [2022-11-26 01:43:55,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:43:55,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 01:43:55,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 30: [2022-11-26 01:43:55,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:43:55,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 2: [2022-11-26 01:43:55,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:43:55,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 2: [2022-11-26 01:43:55,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 01:43:55,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 13: [2022-11-26 01:43:55,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:43:55,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 01:43:55,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 8: [2022-11-26 01:43:55,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:43:55,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 01:43:55,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: [2022-11-26 01:43:55,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:43:55,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 19: [2022-11-26 01:43:55,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:43:55,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 19: [2022-11-26 01:43:55,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 01:43:55,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 25: [2022-11-26 01:43:55,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:43:55,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 28: [2022-11-26 01:43:55,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:43:55,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 28: [2022-11-26 01:43:55,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 01:43:55,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 15: [2022-11-26 01:43:55,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:43:55,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 01:43:55,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 27: [2022-11-26 01:43:55,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:43:55,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 01:43:55,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 16: [2022-11-26 01:43:55,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:43:55,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 01:43:55,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 7: [2022-11-26 01:43:55,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:43:55,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 01:43:55,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 4: [2022-11-26 01:43:55,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:43:55,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 01:43:55,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 11: [2022-11-26 01:43:55,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:43:55,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 01:43:55,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 14: [2022-11-26 01:43:55,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:43:55,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 01:43:55,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 31: [2022-11-26 01:43:55,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:43:55,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 01:43:55,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 24: [2022-11-26 01:43:55,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:43:55,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 01:43:55,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 3: [2022-11-26 01:43:55,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:43:55,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 01:43:55,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 17: [2022-11-26 01:43:55,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:43:55,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 01:43:55,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 1: [2022-11-26 01:43:55,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:43:55,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 01:43:55,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 9: [2022-11-26 01:43:55,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:43:55,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 01:43:55,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 20: [2022-11-26 01:43:55,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:43:55,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 01:43:55,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-26 01:43:55,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:43:55,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 01:43:55,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 10: [2022-11-26 01:43:55,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:43:55,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:43:55,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 01:43:55,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 26: [2022-11-26 01:43:55,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 01:43:55,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 22: [2022-11-26 01:43:55,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:43:55,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:43:55,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:43:55,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 01:43:55,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 01:43:55,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 01:43:55,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 22: [2022-11-26 01:43:55,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 22: [2022-11-26 01:43:55,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 21: [2022-11-26 01:43:55,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:43:55,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 01:43:55,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 22: [2022-11-26 01:43:55,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:43:55,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 01:43:55,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 18: [2022-11-26 01:43:55,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:43:55,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 23: [2022-11-26 01:43:55,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:43:55,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 23: [2022-11-26 01:43:55,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 01:43:55,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 13: [2022-11-26 01:43:55,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:43:55,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 01:43:55,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 29: [2022-11-26 01:43:55,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:43:55,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 01:43:55,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 30: [2022-11-26 01:43:55,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:43:55,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 01:43:55,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 2: [2022-11-26 01:43:55,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:43:55,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 01:43:55,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 8: [2022-11-26 01:43:55,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:43:55,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 01:43:55,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 7: [2022-11-26 01:43:55,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:43:55,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 01:43:55,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 15: [2022-11-26 01:43:55,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:43:55,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 01:43:55,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 27: [2022-11-26 01:43:55,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:43:55,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 01:43:55,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 25: [2022-11-26 01:43:55,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:43:55,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 6: [2022-11-26 01:43:55,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:43:55,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 6: [2022-11-26 01:43:55,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 01:43:55,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 6: [2022-11-26 01:43:55,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:43:55,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 01:43:55,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: [2022-11-26 01:43:55,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:43:55,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 01:43:55,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 9: [2022-11-26 01:43:55,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:43:55,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:43:55,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 01:43:55,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 19: [2022-11-26 01:43:55,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 26: [2022-11-26 01:43:55,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:43:55,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 4: [2022-11-26 01:43:55,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:43:55,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 4: [2022-11-26 01:43:55,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 10: [2022-11-26 01:43:55,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:43:55,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 4: [2022-11-26 01:43:55,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 10: [2022-11-26 01:43:55,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 1: [2022-11-26 01:43:55,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:43:55,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 31: [2022-11-26 01:43:55,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:43:55,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 01:43:55,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 14: [2022-11-26 01:43:55,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:43:55,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 31: [2022-11-26 01:43:55,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 14: [2022-11-26 01:43:55,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 31: [2022-11-26 01:43:55,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 16: [2022-11-26 01:43:55,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:43:55,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:43:55,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 20: [2022-11-26 01:43:55,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 16: [2022-11-26 01:43:55,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 20: [2022-11-26 01:43:55,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 17: [2022-11-26 01:43:55,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:43:55,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 18: [2022-11-26 01:43:55,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:43:55,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 18: [2022-11-26 01:43:55,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 01:43:55,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 24: [2022-11-26 01:43:55,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:43:55,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 01:43:55,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 28: [2022-11-26 01:43:55,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:43:55,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 01:43:55,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 29: [2022-11-26 01:43:55,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:43:55,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 01:43:55,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 21: [2022-11-26 01:43:55,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:43:55,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 01:43:55,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-26 01:43:55,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:43:55,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 01:43:55,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 23: [2022-11-26 01:43:55,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:43:55,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 01:43:55,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 22: [2022-11-26 01:43:55,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:43:55,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 11: [2022-11-26 01:43:55,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:43:55,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 11: [2022-11-26 01:43:55,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 01:43:55,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 13: [2022-11-26 01:43:55,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:43:55,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 01:43:55,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 2: [2022-11-26 01:43:55,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:43:55,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 01:43:55,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 3: [2022-11-26 01:43:55,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:43:55,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 01:43:55,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 30: [2022-11-26 01:43:55,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:43:55,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 01:43:55,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 28: [2022-11-26 01:43:55,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:43:55,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 01:43:55,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 12: [2022-11-26 01:43:55,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:43:55,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 01:43:55,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 22: [2022-11-26 01:43:55,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:43:55,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 01:43:55,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 8: [2022-11-26 01:43:55,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:43:55,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 12: [2022-11-26 01:43:55,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:43:55,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 8: [2022-11-26 01:43:55,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 12: [2022-11-26 01:43:55,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 19: [2022-11-26 01:43:55,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:43:55,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 01:43:55,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 24: [2022-11-26 01:43:55,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:43:55,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 01:43:55,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 25: [2022-11-26 01:43:55,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:43:55,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 01:43:55,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: [2022-11-26 01:43:55,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:43:55,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 01:43:55,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 16: [2022-11-26 01:43:55,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:43:55,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 01:43:55,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 15: [2022-11-26 01:43:55,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:43:55,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 01:43:55,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 4: [2022-11-26 01:43:55,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:43:55,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:43:55,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 01:43:55,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 14: [2022-11-26 01:43:55,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 01:43:55,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 6: [2022-11-26 01:43:55,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:43:55,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:43:55,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 01:43:55,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 11: [2022-11-26 01:43:55,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 01:43:55,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 31: [2022-11-26 01:43:55,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:43:55,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 01:43:55,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 21: [2022-11-26 01:43:55,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:43:55,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 01:43:55,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 10: [2022-11-26 01:43:55,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:43:55,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:43:55,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 01:43:55,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 26: [2022-11-26 01:43:55,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 01:43:55,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 27: [2022-11-26 01:43:55,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:43:55,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:43:55,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 3: [2022-11-26 01:43:55,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 27: [2022-11-26 01:43:55,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 3: [2022-11-26 01:43:55,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 9: [2022-11-26 01:43:55,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:43:55,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 01:43:55,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 7: [2022-11-26 01:43:55,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:43:55,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 01:43:55,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 1: [2022-11-26 01:43:55,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:43:55,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 17: [2022-11-26 01:43:55,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:43:55,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 17: [2022-11-26 01:43:55,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 01:43:55,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 29: [2022-11-26 01:43:55,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:43:55,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 01:43:55,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 12: [2022-11-26 01:43:55,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:43:55,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 01:43:55,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-26 01:43:55,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:43:55,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 01:43:55,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 20: [2022-11-26 01:43:55,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:43:55,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 01:43:55,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 23: [2022-11-26 01:43:55,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:43:55,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 01:43:55,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 13: [2022-11-26 01:43:55,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:43:55,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 01:43:55,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 18: [2022-11-26 01:43:55,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:43:55,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 01:43:55,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 19: [2022-11-26 01:43:55,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:43:55,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 01:43:55,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 6: [2022-11-26 01:43:55,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:43:55,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 01:43:55,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 2: [2022-11-26 01:43:55,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:43:55,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 01:43:55,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 30: [2022-11-26 01:43:55,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:43:55,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 01:43:55,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 4: [2022-11-26 01:43:55,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:43:55,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 25: [2022-11-26 01:43:55,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:43:55,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:43:55,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 25: [2022-11-26 01:43:55,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 01:43:55,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 28: [2022-11-26 01:43:55,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 01:43:55,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 15: [2022-11-26 01:43:55,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:43:55,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 01:43:55,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 27: [2022-11-26 01:43:55,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:43:55,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 01:43:55,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 16: [2022-11-26 01:43:55,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:43:55,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 01:43:55,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 8: [2022-11-26 01:43:55,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:43:55,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 01:43:55,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 11: [2022-11-26 01:43:55,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:43:55,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 01:43:55,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 24: [2022-11-26 01:43:55,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:43:55,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 01:43:55,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 31: [2022-11-26 01:43:55,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:43:55,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 01:43:55,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 14: [2022-11-26 01:43:55,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:43:55,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 01:43:55,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 3: [2022-11-26 01:43:55,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:43:55,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 01:43:55,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 7: [2022-11-26 01:43:55,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:43:55,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 01:43:55,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 27: [2022-11-26 01:43:55,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:43:55,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 01:43:55,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 14: [2022-11-26 01:43:55,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:43:55,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 01:43:55,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 22: [2022-11-26 01:43:55,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:43:55,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:43:55,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 01:43:55,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step34000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 01:43:55,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 22: [2022-11-26 01:43:55,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: successfully saved checkpoint at iteration 34000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2860.23 31: iteration 34010/ 173500 | consumed samples: 8706560 | consumed tokens: 17831034880 | elapsed time per iteration (s): 1.13 | learning rate: 1.848E-04 | global batch size: 256 | lm loss: 2.142795E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.266 | TFLOPs: 13.75 | 31: iteration 34020/ 173500 | consumed samples: 8709120 | consumed tokens: 17836277760 | elapsed time per iteration (s): 0.81 | learning rate: 1.848E-04 | global batch size: 256 | lm loss: 2.129407E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.206 | TFLOPs: 19.01 | 31: iteration 34030/ 173500 | consumed samples: 8711680 | consumed tokens: 17841520640 | elapsed time per iteration (s): 0.80 | learning rate: 1.848E-04 | global batch size: 256 | lm loss: 2.120125E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.928 | TFLOPs: 19.29 | 31: iteration 34040/ 173500 | consumed samples: 8714240 | consumed tokens: 17846763520 | elapsed time per iteration (s): 0.85 | learning rate: 1.847E-04 | global batch size: 256 | lm loss: 2.130753E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.662 | TFLOPs: 18.19 | 31: iteration 34050/ 173500 | consumed samples: 8716800 | consumed tokens: 17852006400 | elapsed time per iteration (s): 0.82 | learning rate: 1.847E-04 | global batch size: 256 | lm loss: 2.120963E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.779 | TFLOPs: 18.98 | 31: iteration 34060/ 173500 | consumed samples: 8719360 | consumed tokens: 17857249280 | elapsed time per iteration (s): 0.73 | learning rate: 1.847E-04 | global batch size: 256 | lm loss: 2.093635E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.059 | TFLOPs: 21.30 | 31: iteration 34070/ 173500 | consumed samples: 8721920 | consumed tokens: 17862492160 | elapsed time per iteration (s): 0.76 | learning rate: 1.847E-04 | global batch size: 256 | lm loss: 2.100617E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.425 | TFLOPs: 20.41 | 31: iteration 34080/ 173500 | consumed samples: 8724480 | consumed tokens: 17867735040 | elapsed time per iteration (s): 0.74 | learning rate: 1.847E-04 | global batch size: 256 | lm loss: 2.141537E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.432 | TFLOPs: 20.84 | 31: iteration 34090/ 173500 | consumed samples: 8727040 | consumed tokens: 17872977920 | elapsed time per iteration (s): 0.77 | learning rate: 1.847E-04 | global batch size: 256 | lm loss: 2.138025E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.998 | TFLOPs: 20.02 | 31: iteration 34100/ 173500 | consumed samples: 8729600 | consumed tokens: 17878220800 | elapsed time per iteration (s): 0.79 | learning rate: 1.847E-04 | global batch size: 256 | lm loss: 2.119928E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.963 | TFLOPs: 19.60 | 31: iteration 34110/ 173500 | consumed samples: 8732160 | consumed tokens: 17883463680 | elapsed time per iteration (s): 0.80 | learning rate: 1.847E-04 | global batch size: 256 | lm loss: 2.106987E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.461 | TFLOPs: 19.39 | 31: iteration 34120/ 173500 | consumed samples: 8734720 | consumed tokens: 17888706560 | elapsed time per iteration (s): 0.78 | learning rate: 1.847E-04 | global batch size: 256 | lm loss: 2.123773E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.361 | TFLOPs: 19.87 | 31: iteration 34130/ 173500 | consumed samples: 8737280 | consumed tokens: 17893949440 | elapsed time per iteration (s): 0.78 | learning rate: 1.847E-04 | global batch size: 256 | lm loss: 2.144114E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.634 | TFLOPs: 19.94 | 31: iteration 34140/ 173500 | consumed samples: 8739840 | consumed tokens: 17899192320 | elapsed time per iteration (s): 0.77 | learning rate: 1.846E-04 | global batch size: 256 | lm loss: 2.092491E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.757 | TFLOPs: 20.07 | 31: iteration 34150/ 173500 | consumed samples: 8742400 | consumed tokens: 17904435200 | elapsed time per iteration (s): 0.74 | learning rate: 1.846E-04 | global batch size: 256 | lm loss: 2.116784E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.997 | TFLOPs: 20.99 | 31: iteration 34160/ 173500 | consumed samples: 8744960 | consumed tokens: 17909678080 | elapsed time per iteration (s): 0.73 | learning rate: 1.846E-04 | global batch size: 256 | lm loss: 2.112035E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.966 | TFLOPs: 21.17 | 31: iteration 34170/ 173500 | consumed samples: 8747520 | consumed tokens: 17914920960 | elapsed time per iteration (s): 0.79 | learning rate: 1.846E-04 | global batch size: 256 | lm loss: 2.123605E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.189 | TFLOPs: 19.61 | 31: iteration 34180/ 173500 | consumed samples: 8750080 | consumed tokens: 17920163840 | elapsed time per iteration (s): 0.77 | learning rate: 1.846E-04 | global batch size: 256 | lm loss: 2.153843E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.862 | TFLOPs: 20.20 | 31: iteration 34190/ 173500 | consumed samples: 8752640 | consumed tokens: 17925406720 | elapsed time per iteration (s): 0.82 | learning rate: 1.846E-04 | global batch size: 256 | lm loss: 2.132395E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.546 | TFLOPs: 18.97 | 31: iteration 34200/ 173500 | consumed samples: 8755200 | consumed tokens: 17930649600 | elapsed time per iteration (s): 0.80 | learning rate: 1.846E-04 | global batch size: 256 | lm loss: 2.096330E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.430 | TFLOPs: 19.26 | 31: iteration 34210/ 173500 | consumed samples: 8757760 | consumed tokens: 17935892480 | elapsed time per iteration (s): 0.81 | learning rate: 1.846E-04 | global batch size: 256 | lm loss: 2.114344E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.611 | TFLOPs: 19.21 | 31: iteration 34220/ 173500 | consumed samples: 8760320 | consumed tokens: 17941135360 | elapsed time per iteration (s): 0.81 | learning rate: 1.846E-04 | global batch size: 256 | lm loss: 2.124185E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.030 | TFLOPs: 19.18 | 31: iteration 34230/ 173500 | consumed samples: 8762880 | consumed tokens: 17946378240 | elapsed time per iteration (s): 0.82 | learning rate: 1.846E-04 | global batch size: 256 | lm loss: 2.096651E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.211 | TFLOPs: 18.83 | 31: iteration 34240/ 173500 | consumed samples: 8765440 | consumed tokens: 17951621120 | elapsed time per iteration (s): 0.77 | learning rate: 1.846E-04 | global batch size: 256 | lm loss: 2.103296E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.847 | TFLOPs: 20.02 | 31: iteration 34250/ 173500 | consumed samples: 8768000 | consumed tokens: 17956864000 | elapsed time per iteration (s): 0.79 | learning rate: 1.845E-04 | global batch size: 256 | lm loss: 2.116234E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.312 | TFLOPs: 19.68 | 31: iteration 34260/ 173500 | consumed samples: 8770560 | consumed tokens: 17962106880 | elapsed time per iteration (s): 0.82 | learning rate: 1.845E-04 | global batch size: 256 | lm loss: 2.120430E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.152 | TFLOPs: 18.88 | 31: iteration 34270/ 173500 | consumed samples: 8773120 | consumed tokens: 17967349760 | elapsed time per iteration (s): 0.78 | learning rate: 1.845E-04 | global batch size: 256 | lm loss: 2.138577E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.354 | TFLOPs: 19.93 | 31: iteration 34280/ 173500 | consumed samples: 8775680 | consumed tokens: 17972592640 | elapsed time per iteration (s): 0.80 | learning rate: 1.845E-04 | global batch size: 256 | lm loss: 2.130787E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.657 | TFLOPs: 19.34 | 31: iteration 34290/ 173500 | consumed samples: 8778240 | consumed tokens: 17977835520 | elapsed time per iteration (s): 0.79 | learning rate: 1.845E-04 | global batch size: 256 | lm loss: 2.143913E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.070 | TFLOPs: 19.48 | 31: iteration 34300/ 173500 | consumed samples: 8780800 | consumed tokens: 17983078400 | elapsed time per iteration (s): 0.83 | learning rate: 1.845E-04 | global batch size: 256 | lm loss: 2.142558E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.477 | TFLOPs: 18.66 | 31: iteration 34310/ 173500 | consumed samples: 8783360 | consumed tokens: 17988321280 | elapsed time per iteration (s): 0.78 | learning rate: 1.845E-04 | global batch size: 256 | lm loss: 2.120504E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.742 | TFLOPs: 19.83 | 31: iteration 34320/ 173500 | consumed samples: 8785920 | consumed tokens: 17993564160 | elapsed time per iteration (s): 0.82 | learning rate: 1.845E-04 | global batch size: 256 | lm loss: 2.162878E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.051 | TFLOPs: 18.82 | 31: iteration 34330/ 173500 | consumed samples: 8788480 | consumed tokens: 17998807040 | elapsed time per iteration (s): 0.81 | learning rate: 1.845E-04 | global batch size: 256 | lm loss: 2.141603E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.876 | TFLOPs: 19.11 | 31: iteration 34340/ 173500 | consumed samples: 8791040 | consumed tokens: 18004049920 | elapsed time per iteration (s): 0.87 | learning rate: 1.845E-04 | global batch size: 256 | lm loss: 2.098120E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.230 | TFLOPs: 17.86 | 31: iteration 34350/ 173500 | consumed samples: 8793600 | consumed tokens: 18009292800 | elapsed time per iteration (s): 0.78 | learning rate: 1.845E-04 | global batch size: 256 | lm loss: 2.100450E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.603 | TFLOPs: 19.88 | 31: iteration 34360/ 173500 | consumed samples: 8796160 | consumed tokens: 18014535680 | elapsed time per iteration (s): 0.82 | learning rate: 1.844E-04 | global batch size: 256 | lm loss: 2.094095E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.574 | TFLOPs: 18.97 | 31: iteration 34370/ 173500 | consumed samples: 8798720 | consumed tokens: 18019778560 | elapsed time per iteration (s): 0.82 | learning rate: 1.844E-04 | global batch size: 256 | lm loss: 2.137230E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.778 | TFLOPs: 18.86 | 31: iteration 34380/ 173500 | consumed samples: 8801280 | consumed tokens: 18025021440 | elapsed time per iteration (s): 0.82 | learning rate: 1.844E-04 | global batch size: 256 | lm loss: 2.144821E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.049 | TFLOPs: 18.88 | 31: iteration 34390/ 173500 | consumed samples: 8803840 | consumed tokens: 18030264320 | elapsed time per iteration (s): 0.81 | learning rate: 1.844E-04 | global batch size: 256 | lm loss: 2.097001E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.963 | TFLOPs: 19.18 | 31: iteration 34400/ 173500 | consumed samples: 8806400 | consumed tokens: 18035507200 | elapsed time per iteration (s): 0.85 | learning rate: 1.844E-04 | global batch size: 256 | lm loss: 2.144676E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.314 | TFLOPs: 18.29 | 31: iteration 34410/ 173500 | consumed samples: 8808960 | consumed tokens: 18040750080 | elapsed time per iteration (s): 0.83 | learning rate: 1.844E-04 | global batch size: 256 | lm loss: 2.137638E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.362 | TFLOPs: 18.66 | 31: iteration 34420/ 173500 | consumed samples: 8811520 | consumed tokens: 18045992960 | elapsed time per iteration (s): 0.81 | learning rate: 1.844E-04 | global batch size: 256 | lm loss: 2.126895E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.103 | TFLOPs: 19.18 | 31: iteration 34430/ 173500 | consumed samples: 8814080 | consumed tokens: 18051235840 | elapsed time per iteration (s): 0.83 | learning rate: 1.844E-04 | global batch size: 256 | lm loss: 2.139570E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.656 | TFLOPs: 18.55 | 31: iteration 34440/ 173500 | consumed samples: 8816640 | consumed tokens: 18056478720 | elapsed time per iteration (s): 0.82 | learning rate: 1.844E-04 | global batch size: 256 | lm loss: 2.126890E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.733 | TFLOPs: 18.86 | 31: iteration 34450/ 173500 | consumed samples: 8819200 | consumed tokens: 18061721600 | elapsed time per iteration (s): 0.79 | learning rate: 1.844E-04 | global batch size: 256 | lm loss: 2.130504E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.235 | TFLOPs: 19.55 | 31: iteration 34460/ 173500 | consumed samples: 8821760 | consumed tokens: 18066964480 | elapsed time per iteration (s): 0.78 | learning rate: 1.844E-04 | global batch size: 256 | lm loss: 2.095610E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.429 | TFLOPs: 19.87 | 31: iteration 34470/ 173500 | consumed samples: 8824320 | consumed tokens: 18072207360 | elapsed time per iteration (s): 0.79 | learning rate: 1.843E-04 | global batch size: 256 | lm loss: 2.095777E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.121 | TFLOPs: 19.49 | 31: iteration 34480/ 173500 | consumed samples: 8826880 | consumed tokens: 18077450240 | elapsed time per iteration (s): 0.81 | learning rate: 1.843E-04 | global batch size: 256 | lm loss: 2.095688E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.007 | TFLOPs: 19.06 | 31: iteration 34490/ 173500 | consumed samples: 8829440 | consumed tokens: 18082693120 | elapsed time per iteration (s): 0.78 | learning rate: 1.843E-04 | global batch size: 256 | lm loss: 2.118214E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.857 | TFLOPs: 19.83 | 31: iteration 34500/ 173500 | consumed samples: 8832000 | consumed tokens: 18087936000 | elapsed time per iteration (s): 0.83 | learning rate: 1.843E-04 | global batch size: 256 | lm loss: 2.081131E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.535 | TFLOPs: 18.67 | 31: iteration 34510/ 173500 | consumed samples: 8834560 | consumed tokens: 18093178880 | elapsed time per iteration (s): 0.78 | learning rate: 1.843E-04 | global batch size: 256 | lm loss: 2.094462E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.433 | TFLOPs: 19.75 | 31: iteration 34520/ 173500 | consumed samples: 8837120 | consumed tokens: 18098421760 | elapsed time per iteration (s): 1.10 | learning rate: 1.843E-04 | global batch size: 256 | lm loss: 2.106432E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.310 | TFLOPs: 14.11 | 31: iteration 34530/ 173500 | consumed samples: 8839680 | consumed tokens: 18103664640 | elapsed time per iteration (s): 0.78 | learning rate: 1.843E-04 | global batch size: 256 | lm loss: 2.125307E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.767 | TFLOPs: 19.89 | 31: iteration 34540/ 173500 | consumed samples: 8842240 | consumed tokens: 18108907520 | elapsed time per iteration (s): 0.75 | learning rate: 1.843E-04 | global batch size: 256 | lm loss: 2.102010E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.128 | TFLOPs: 20.76 | 31: iteration 34550/ 173500 | consumed samples: 8844800 | consumed tokens: 18114150400 | elapsed time per iteration (s): 0.81 | learning rate: 1.843E-04 | global batch size: 256 | lm loss: 2.135218E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.748 | TFLOPs: 19.10 | 31: iteration 34560/ 173500 | consumed samples: 8847360 | consumed tokens: 18119393280 | elapsed time per iteration (s): 0.80 | learning rate: 1.843E-04 | global batch size: 256 | lm loss: 2.095054E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.150 | TFLOPs: 19.43 | 31: iteration 34570/ 173500 | consumed samples: 8849920 | consumed tokens: 18124636160 | elapsed time per iteration (s): 0.75 | learning rate: 1.843E-04 | global batch size: 256 | lm loss: 2.118712E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.927 | TFLOPs: 20.56 | 31: iteration 34580/ 173500 | consumed samples: 8852480 | consumed tokens: 18129879040 | elapsed time per iteration (s): 0.79 | learning rate: 1.842E-04 | global batch size: 256 | lm loss: 2.137097E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.076 | TFLOPs: 19.73 | 31: iteration 34590/ 173500 | consumed samples: 8855040 | consumed tokens: 18135121920 | elapsed time per iteration (s): 0.76 | learning rate: 1.842E-04 | global batch size: 256 | lm loss: 2.088726E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.566 | TFLOPs: 20.36 | 31: iteration 34600/ 173500 | consumed samples: 8857600 | consumed tokens: 18140364800 | elapsed time per iteration (s): 0.80 | learning rate: 1.842E-04 | global batch size: 256 | lm loss: 2.111711E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.454 | TFLOPs: 19.33 | 31: iteration 34610/ 173500 | consumed samples: 8860160 | consumed tokens: 18145607680 | elapsed time per iteration (s): 0.77 | learning rate: 1.842E-04 | global batch size: 256 | lm loss: 2.068701E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.346 | TFLOPs: 20.11 | 31: iteration 34620/ 173500 | consumed samples: 8862720 | consumed tokens: 18150850560 | elapsed time per iteration (s): 0.81 | learning rate: 1.842E-04 | global batch size: 256 | lm loss: 2.103518E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.555 | TFLOPs: 19.15 | 31: iteration 34630/ 173500 | consumed samples: 8865280 | consumed tokens: 18156093440 | elapsed time per iteration (s): 0.81 | learning rate: 1.842E-04 | global batch size: 256 | lm loss: 2.110733E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.960 | TFLOPs: 19.18 | 31: iteration 34640/ 173500 | consumed samples: 8867840 | consumed tokens: 18161336320 | elapsed time per iteration (s): 0.75 | learning rate: 1.842E-04 | global batch size: 256 | lm loss: 2.125019E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.126 | TFLOPs: 20.58 | 31: iteration 34650/ 173500 | consumed samples: 8870400 | consumed tokens: 18166579200 | elapsed time per iteration (s): 0.74 | learning rate: 1.842E-04 | global batch size: 256 | lm loss: 2.125765E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.872 | TFLOPs: 20.92 | 31: iteration 34660/ 173500 | consumed samples: 8872960 | consumed tokens: 18171822080 | elapsed time per iteration (s): 0.79 | learning rate: 1.842E-04 | global batch size: 256 | lm loss: 2.115462E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.098 | TFLOPs: 19.73 | 31: iteration 34670/ 173500 | consumed samples: 8875520 | consumed tokens: 18177064960 | elapsed time per iteration (s): 0.75 | learning rate: 1.842E-04 | global batch size: 256 | lm loss: 2.124628E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.094 | TFLOPs: 20.57 | 31: iteration 34680/ 173500 | consumed samples: 8878080 | consumed tokens: 18182307840 | elapsed time per iteration (s): 0.77 | learning rate: 1.841E-04 | global batch size: 256 | lm loss: 2.119481E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.849 | TFLOPs: 20.20 | 31: iteration 34690/ 173500 | consumed samples: 8880640 | consumed tokens: 18187550720 | elapsed time per iteration (s): 0.76 | learning rate: 1.841E-04 | global batch size: 256 | lm loss: 2.120827E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.820 | TFLOPs: 20.26 | 31: iteration 34700/ 173500 | consumed samples: 8883200 | consumed tokens: 18192793600 | elapsed time per iteration (s): 0.77 | learning rate: 1.841E-04 | global batch size: 256 | lm loss: 2.107226E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.051 | TFLOPs: 20.21 | 31: iteration 34710/ 173500 | consumed samples: 8885760 | consumed tokens: 18198036480 | elapsed time per iteration (s): 0.73 | learning rate: 1.841E-04 | global batch size: 256 | lm loss: 2.120516E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.334 | TFLOPs: 21.07 | 31: iteration 34720/ 173500 | consumed samples: 8888320 | consumed tokens: 18203279360 | elapsed time per iteration (s): 0.73 | learning rate: 1.841E-04 | global batch size: 256 | lm loss: 2.127976E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.432 | TFLOPs: 21.20 | 31: iteration 34730/ 173500 | consumed samples: 8890880 | consumed tokens: 18208522240 | elapsed time per iteration (s): 0.76 | learning rate: 1.841E-04 | global batch size: 256 | lm loss: 2.134925E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.533 | TFLOPs: 20.36 | 31: iteration 34740/ 173500 | consumed samples: 8893440 | consumed tokens: 18213765120 | elapsed time per iteration (s): 0.76 | learning rate: 1.841E-04 | global batch size: 256 | lm loss: 2.126164E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.984 | TFLOPs: 20.39 | 31: iteration 34750/ 173500 | consumed samples: 8896000 | consumed tokens: 18219008000 | elapsed time per iteration (s): 0.75 | learning rate: 1.841E-04 | global batch size: 256 | lm loss: 2.123279E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.269 | TFLOPs: 20.77 | 31: iteration 34760/ 173500 | consumed samples: 8898560 | consumed tokens: 18224250880 | elapsed time per iteration (s): 0.75 | learning rate: 1.841E-04 | global batch size: 256 | lm loss: 2.112571E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.398 | TFLOPs: 20.53 | 31: iteration 34770/ 173500 | consumed samples: 8901120 | consumed tokens: 18229493760 | elapsed time per iteration (s): 0.81 | learning rate: 1.841E-04 | global batch size: 256 | lm loss: 2.102416E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.504 | TFLOPs: 19.09 | 31: iteration 34780/ 173500 | consumed samples: 8903680 | consumed tokens: 18234736640 | elapsed time per iteration (s): 0.76 | learning rate: 1.841E-04 | global batch size: 256 | lm loss: 2.085741E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.179 | TFLOPs: 20.28 | 31: iteration 34790/ 173500 | consumed samples: 8906240 | consumed tokens: 18239979520 | elapsed time per iteration (s): 0.72 | learning rate: 1.840E-04 | global batch size: 256 | lm loss: 2.106193E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.119 | TFLOPs: 21.42 | 31: iteration 34800/ 173500 | consumed samples: 8908800 | consumed tokens: 18245222400 | elapsed time per iteration (s): 0.82 | learning rate: 1.840E-04 | global batch size: 256 | lm loss: 2.112329E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.685 | TFLOPs: 18.98 | 31: iteration 34810/ 173500 | consumed samples: 8911360 | consumed tokens: 18250465280 | elapsed time per iteration (s): 0.80 | learning rate: 1.840E-04 | global batch size: 256 | lm loss: 2.122541E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.165 | TFLOPs: 19.31 | 31: iteration 34820/ 173500 | consumed samples: 8913920 | consumed tokens: 18255708160 | elapsed time per iteration (s): 0.78 | learning rate: 1.840E-04 | global batch size: 256 | lm loss: 2.092826E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.554 | TFLOPs: 19.94 | 31: iteration 34830/ 173500 | consumed samples: 8916480 | consumed tokens: 18260951040 | elapsed time per iteration (s): 0.80 | learning rate: 1.840E-04 | global batch size: 256 | lm loss: 2.098315E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.843 | TFLOPs: 19.29 | 31: iteration 34840/ 173500 | consumed samples: 8919040 | consumed tokens: 18266193920 | elapsed time per iteration (s): 0.76 | learning rate: 1.840E-04 | global batch size: 256 | lm loss: 2.121701E+00 | grad norm: 0.649 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.651 | TFLOPs: 20.31 | 31: iteration 34850/ 173500 | consumed samples: 8921600 | consumed tokens: 18271436800 | elapsed time per iteration (s): 0.76 | learning rate: 1.840E-04 | global batch size: 256 | lm loss: 2.232840E+00 | grad norm: 3.861 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.892 | TFLOPs: 20.44 | 31: iteration 34860/ 173500 | consumed samples: 8924160 | consumed tokens: 18276679680 | elapsed time per iteration (s): 0.75 | learning rate: 1.840E-04 | global batch size: 256 | lm loss: 2.478349E+00 | grad norm: 0.593 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.919 | TFLOPs: 20.62 | 31: iteration 34870/ 173500 | consumed samples: 8926720 | consumed tokens: 18281922560 | elapsed time per iteration (s): 0.79 | learning rate: 1.840E-04 | global batch size: 256 | lm loss: 2.217457E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.277 | TFLOPs: 19.50 | 31: iteration 34880/ 173500 | consumed samples: 8929280 | consumed tokens: 18287165440 | elapsed time per iteration (s): 0.75 | learning rate: 1.840E-04 | global batch size: 256 | lm loss: 2.123820E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.188 | TFLOPs: 20.52 | 31: iteration 34890/ 173500 | consumed samples: 8931840 | consumed tokens: 18292408320 | elapsed time per iteration (s): 0.81 | learning rate: 1.840E-04 | global batch size: 256 | lm loss: 2.155113E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.172 | TFLOPs: 19.19 | 31: iteration 34900/ 173500 | consumed samples: 8934400 | consumed tokens: 18297651200 | elapsed time per iteration (s): 0.75 | learning rate: 1.839E-04 | global batch size: 256 | lm loss: 2.145146E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.935 | TFLOPs: 20.69 | 31: iteration 34910/ 173500 | consumed samples: 8936960 | consumed tokens: 18302894080 | elapsed time per iteration (s): 1.10 | learning rate: 1.839E-04 | global batch size: 256 | lm loss: 2.139353E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.697 | TFLOPs: 14.08 | 31: iteration 34920/ 173500 | consumed samples: 8939520 | consumed tokens: 18308136960 | elapsed time per iteration (s): 0.74 | learning rate: 1.839E-04 | global batch size: 256 | lm loss: 2.111205E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.451 | TFLOPs: 20.90 | 31: iteration 34930/ 173500 | consumed samples: 8942080 | consumed tokens: 18313379840 | elapsed time per iteration (s): 0.74 | learning rate: 1.839E-04 | global batch size: 256 | lm loss: 2.130982E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.653 | TFLOPs: 20.85 | 31: iteration 34940/ 173500 | consumed samples: 8944640 | consumed tokens: 18318622720 | elapsed time per iteration (s): 0.80 | learning rate: 1.839E-04 | global batch size: 256 | lm loss: 2.114034E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.812 | TFLOPs: 19.47 | 31: iteration 34950/ 173500 | consumed samples: 8947200 | consumed tokens: 18323865600 | elapsed time per iteration (s): 0.84 | learning rate: 1.839E-04 | global batch size: 256 | lm loss: 2.122319E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.175 | TFLOPs: 18.40 | 31: iteration 34960/ 173500 | consumed samples: 8949760 | consumed tokens: 18329108480 | elapsed time per iteration (s): 0.79 | learning rate: 1.839E-04 | global batch size: 256 | lm loss: 2.125619E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.466 | TFLOPs: 19.57 | 31: iteration 34970/ 173500 | consumed samples: 8952320 | consumed tokens: 18334351360 | elapsed time per iteration (s): 0.81 | learning rate: 1.839E-04 | global batch size: 256 | lm loss: 2.149448E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.201 | TFLOPs: 19.13 | 31: iteration 34980/ 173500 | consumed samples: 8954880 | consumed tokens: 18339594240 | elapsed time per iteration (s): 0.86 | learning rate: 1.839E-04 | global batch size: 256 | lm loss: 2.140285E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.190 | TFLOPs: 17.92 | 31: iteration 34990/ 173500 | consumed samples: 8957440 | consumed tokens: 18344837120 | elapsed time per iteration (s): 0.83 | learning rate: 1.839E-04 | global batch size: 256 | lm loss: 2.159209E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.978 | TFLOPs: 18.57 | 31: iteration 35000/ 173500 | consumed samples: 8960000 | consumed tokens: 18350080000 | elapsed time per iteration (s): 0.75 | learning rate: 1.838E-04 | global batch size: 256 | lm loss: 2.150227E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.856 | TFLOPs: 20.74 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 35000 | lm loss value: 2.144013E+00 | lm loss PPL: 8.533612E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 35000 to checkpoints_1b1long 0: [2022-11-26 01:57:09,909] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step35000 is begin to save! 0: [2022-11-26 01:57:09,922] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_01-model_00-model_states.pt... 0: [2022-11-26 01:57:10,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_01-model_00-model_states.pt. 0: [2022-11-26 01:57:10,168] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_03-model_00-model_states.pt... 0: [2022-11-26 01:57:10,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_03-model_00-model_states.pt. 0: [2022-11-26 01:57:10,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_04-model_00-model_states.pt... 0: [2022-11-26 01:57:10,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_04-model_00-model_states.pt. 0: [2022-11-26 01:57:10,330] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_05-model_00-model_states.pt... 0: [2022-11-26 01:57:10,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_05-model_00-model_states.pt. 0: [2022-11-26 01:57:10,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_06-model_00-model_states.pt... 0: [2022-11-26 01:57:10,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_06-model_00-model_states.pt. 0: [2022-11-26 01:57:10,495] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_07-model_00-model_states.pt... 0: [2022-11-26 01:57:10,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_07-model_00-model_states.pt. 0: [2022-11-26 01:57:10,578] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_08-model_00-model_states.pt... 0: [2022-11-26 01:57:10,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_08-model_00-model_states.pt. 0: [2022-11-26 01:57:10,659] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_09-model_00-model_states.pt... 0: [2022-11-26 01:57:10,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_09-model_00-model_states.pt. 0: [2022-11-26 01:57:10,733] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_10-model_00-model_states.pt... 0: [2022-11-26 01:57:10,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_10-model_00-model_states.pt. 0: [2022-11-26 01:57:10,808] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_11-model_00-model_states.pt... 0: [2022-11-26 01:57:10,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_11-model_00-model_states.pt. 0: [2022-11-26 01:57:10,883] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_12-model_00-model_states.pt... 0: [2022-11-26 01:57:10,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_12-model_00-model_states.pt. 0: [2022-11-26 01:57:10,959] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_13-model_00-model_states.pt... 0: [2022-11-26 01:57:11,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_13-model_00-model_states.pt. 0: [2022-11-26 01:57:11,037] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_14-model_00-model_states.pt... 0: [2022-11-26 01:57:11,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_14-model_00-model_states.pt. 0: [2022-11-26 01:57:11,110] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_15-model_00-model_states.pt... 0: [2022-11-26 01:57:11,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_15-model_00-model_states.pt. 0: [2022-11-26 01:57:11,188] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_16-model_00-model_states.pt... 0: [2022-11-26 01:57:11,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_16-model_00-model_states.pt. 0: [2022-11-26 01:57:11,266] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_17-model_00-model_states.pt... 0: [2022-11-26 01:57:11,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_17-model_00-model_states.pt. 0: [2022-11-26 01:57:11,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_18-model_00-model_states.pt... 0: [2022-11-26 01:57:11,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_18-model_00-model_states.pt. 0: [2022-11-26 01:57:11,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_19-model_00-model_states.pt... 0: [2022-11-26 01:57:11,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_19-model_00-model_states.pt. 0: [2022-11-26 01:57:11,490] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_20-model_00-model_states.pt... 0: [2022-11-26 01:57:11,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_20-model_00-model_states.pt. 0: [2022-11-26 01:57:11,562] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_21-model_00-model_states.pt... 0: [2022-11-26 01:57:11,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_21-model_00-model_states.pt. 0: [2022-11-26 01:57:11,639] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_22-model_00-model_states.pt... 0: [2022-11-26 01:57:11,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_22-model_00-model_states.pt. 0: [2022-11-26 01:57:11,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_23-model_00-model_states.pt... 0: [2022-11-26 01:57:11,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_23-model_00-model_states.pt. 0: [2022-11-26 01:57:11,790] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_24-model_00-model_states.pt... 0: [2022-11-26 01:57:11,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_24-model_00-model_states.pt. 0: [2022-11-26 01:57:11,875] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_25-model_00-model_states.pt... 0: [2022-11-26 01:57:11,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_25-model_00-model_states.pt. 0: [2022-11-26 01:57:11,949] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_26-model_00-model_states.pt... 0: [2022-11-26 01:57:12,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_26-model_00-model_states.pt. 0: [2022-11-26 01:57:12,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_27-model_00-model_states.pt... 0: [2022-11-26 01:57:12,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_27-model_00-model_states.pt. 0: [2022-11-26 01:57:12,101] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_28-model_00-model_states.pt... 0: [2022-11-26 01:57:12,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_28-model_00-model_states.pt. 0: [2022-11-26 01:57:12,173] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/layer_30-model_00-model_states.pt... 0: [2022-11-26 01:57:12,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/layer_30-model_00-model_states.pt. 0: [2022-11-26 01:57:12,177] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step35000/mp_rank_00_model_states.pt 0: [2022-11-26 01:57:12,177] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/mp_rank_00_model_states.pt... 0: [2022-11-26 01:57:12,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/mp_rank_00_model_states.pt. 0: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 18: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 27: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 28: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 26: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:57:12,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 21: [2022-11-26 01:57:12,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:57:12,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:57:12,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 19: [2022-11-26 01:57:12,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 21: [2022-11-26 01:57:12,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 19: [2022-11-26 01:57:12,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: [2022-11-26 01:57:12,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:57:12,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:57:12,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:57:12,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 7: [2022-11-26 01:57:12,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 8: [2022-11-26 01:57:12,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:57:12,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:57:12,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 7: [2022-11-26 01:57:12,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 8: [2022-11-26 01:57:12,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 3: [2022-11-26 01:57:12,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 8: [2022-11-26 01:57:12,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 3: [2022-11-26 01:57:12,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 10: [2022-11-26 01:57:12,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:57:12,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 01:57:12,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 13: [2022-11-26 01:57:12,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:57:12,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 01:57:12,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 6: [2022-11-26 01:57:12,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:57:12,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:57:12,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 01:57:12,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 24: [2022-11-26 01:57:12,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 01:57:12,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 28: [2022-11-26 01:57:12,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:57:12,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:57:12,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 01:57:12,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 9: [2022-11-26 01:57:12,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:57:12,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 01:57:12,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 3: [2022-11-26 01:57:12,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:57:12,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:57:12,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 15: [2022-11-26 01:57:12,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 16: [2022-11-26 01:57:12,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:57:12,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 15: [2022-11-26 01:57:12,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 16: [2022-11-26 01:57:12,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 01:57:12,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 23: [2022-11-26 01:57:12,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:57:12,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 01:57:12,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 15: [2022-11-26 01:57:12,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:57:12,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 01:57:12,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 11: [2022-11-26 01:57:12,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:57:12,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 01:57:12,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 29: [2022-11-26 01:57:12,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:57:12,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:57:12,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 01:57:12,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 01:57:12,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 29: [2022-11-26 01:57:12,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 12: [2022-11-26 01:57:12,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:57:12,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 11: [2022-11-26 01:57:12,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:57:12,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 2: [2022-11-26 01:57:12,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:57:12,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 2: [2022-11-26 01:57:12,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 11: [2022-11-26 01:57:12,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 2: [2022-11-26 01:57:12,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 1: [2022-11-26 01:57:12,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:57:12,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:57:12,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 01:57:12,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 1: [2022-11-26 01:57:12,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 01:57:12,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 18: [2022-11-26 01:57:12,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:57:12,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 01:57:12,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 14: [2022-11-26 01:57:12,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:57:12,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 01:57:12,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 24: [2022-11-26 01:57:12,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:57:12,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 27: [2022-11-26 01:57:12,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:57:12,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 01:57:12,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 9: [2022-11-26 01:57:12,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:57:12,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 01:57:12,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 24: [2022-11-26 01:57:12,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 19: [2022-11-26 01:57:12,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:57:12,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 7: [2022-11-26 01:57:12,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:57:12,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 7: [2022-11-26 01:57:12,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 01:57:12,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 13: [2022-11-26 01:57:12,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:57:12,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:57:12,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 22: [2022-11-26 01:57:12,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 13: [2022-11-26 01:57:12,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 22: [2022-11-26 01:57:12,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 22: [2022-11-26 01:57:12,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:57:12,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 6: [2022-11-26 01:57:12,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:57:12,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 12: [2022-11-26 01:57:12,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:57:12,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 12: [2022-11-26 01:57:12,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 6: [2022-11-26 01:57:12,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 12: [2022-11-26 01:57:12,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 6: [2022-11-26 01:57:12,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:57:12,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 25: [2022-11-26 01:57:12,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:57:12,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 25: [2022-11-26 01:57:12,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 01:57:12,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:57:12,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 25: [2022-11-26 01:57:12,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 01:57:12,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 26: [2022-11-26 01:57:12,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:57:12,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:57:12,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 16: [2022-11-26 01:57:12,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 26: [2022-11-26 01:57:12,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 16: [2022-11-26 01:57:12,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 15: [2022-11-26 01:57:12,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:57:12,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:57:12,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 01:57:12,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 27: [2022-11-26 01:57:12,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 01:57:12,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 1: [2022-11-26 01:57:12,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:57:12,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 8: [2022-11-26 01:57:12,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:57:12,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 8: [2022-11-26 01:57:12,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 01:57:12,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 26: [2022-11-26 01:57:12,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:57:12,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:57:12,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 2: [2022-11-26 01:57:12,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:57:12,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 26: [2022-11-26 01:57:12,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 2: [2022-11-26 01:57:12,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 30: [2022-11-26 01:57:12,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 2: [2022-11-26 01:57:12,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 30: [2022-11-26 01:57:12,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:57:12,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 01:57:12,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 28: [2022-11-26 01:57:12,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 01:57:12,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 12: [2022-11-26 01:57:12,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:57:12,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:57:12,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 28: [2022-11-26 01:57:12,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:57:12,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 31: [2022-11-26 01:57:12,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:57:12,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 28: [2022-11-26 01:57:12,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 1: [2022-11-26 01:57:12,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 28: [2022-11-26 01:57:12,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 21: [2022-11-26 01:57:12,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:57:12,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 21: [2022-11-26 01:57:12,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 31: [2022-11-26 01:57:12,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 21: [2022-11-26 01:57:12,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 3: [2022-11-26 01:57:12,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:57:12,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:57:12,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 10: [2022-11-26 01:57:12,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 3: [2022-11-26 01:57:12,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 10: [2022-11-26 01:57:12,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 27: [2022-11-26 01:57:12,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:57:12,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 01:57:12,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 21: [2022-11-26 01:57:12,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:57:12,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 31: [2022-11-26 01:57:12,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:57:12,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 19: [2022-11-26 01:57:12,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:57:12,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 19: [2022-11-26 01:57:12,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 31: [2022-11-26 01:57:12,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 19: [2022-11-26 01:57:12,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 18: [2022-11-26 01:57:12,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:57:12,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 01:57:12,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 23: [2022-11-26 01:57:12,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:57:12,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 01:57:12,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 9: [2022-11-26 01:57:12,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:57:12,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 01:57:12,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: [2022-11-26 01:57:12,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:57:12,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 01:57:12,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 28: [2022-11-26 01:57:12,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:57:12,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:57:12,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 01:57:12,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 8: [2022-11-26 01:57:12,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:57:12,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 01:57:12,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 25: [2022-11-26 01:57:12,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:57:12,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 14: [2022-11-26 01:57:12,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:57:12,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 01:57:12,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 25: [2022-11-26 01:57:12,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 10: [2022-11-26 01:57:12,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:57:12,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 01:57:12,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 8: [2022-11-26 01:57:12,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:57:12,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 01:57:12,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 16: [2022-11-26 01:57:12,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:57:12,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 7: [2022-11-26 01:57:12,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:57:12,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 7: [2022-11-26 01:57:12,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 01:57:12,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 11: [2022-11-26 01:57:12,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:57:12,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 01:57:12,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 11: [2022-11-26 01:57:12,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:57:12,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 01:57:12,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 25: [2022-11-26 01:57:12,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:57:12,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 24: [2022-11-26 01:57:12,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:57:12,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 24: [2022-11-26 01:57:12,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 01:57:12,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 31: [2022-11-26 01:57:12,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:57:12,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 01:57:12,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 19: [2022-11-26 01:57:12,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:57:12,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 19: [2022-11-26 01:57:12,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 28: [2022-11-26 01:57:12,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 19: [2022-11-26 01:57:12,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 28: [2022-11-26 01:57:12,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:57:12,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:57:12,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:57:12,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:57:12,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 27: [2022-11-26 01:57:12,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:57:12,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 2: [2022-11-26 01:57:12,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 15: [2022-11-26 01:57:12,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:57:12,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 27: [2022-11-26 01:57:12,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 0: [2022-11-26 01:57:12,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 1: [2022-11-26 01:57:12,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 2: [2022-11-26 01:57:12,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 15: [2022-11-26 01:57:12,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 27: [2022-11-26 01:57:12,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 15: [2022-11-26 01:57:12,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: [2022-11-26 01:57:12,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 20: [2022-11-26 01:57:12,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:57:12,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:57:12,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:57:12,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 01:57:12,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 01:57:12,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 26: [2022-11-26 01:57:12,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:57:12,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:57:12,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 20: [2022-11-26 01:57:12,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 20: [2022-11-26 01:57:12,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 26: [2022-11-26 01:57:12,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 01:57:12,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 01:57:12,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 26: [2022-11-26 01:57:12,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 31: [2022-11-26 01:57:12,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:57:12,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 01:57:12,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 29: [2022-11-26 01:57:12,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:57:12,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 01:57:12,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 29: [2022-11-26 01:57:12,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:57:12,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 01:57:12,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 7: [2022-11-26 01:57:12,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:57:12,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 01:57:12,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 18: [2022-11-26 01:57:12,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:57:12,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:57:12,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 01:57:12,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 3: [2022-11-26 01:57:12,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 01:57:12,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 12: [2022-11-26 01:57:12,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:57:12,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 01:57:12,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 30: [2022-11-26 01:57:12,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:57:12,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 01:57:12,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 20: [2022-11-26 01:57:12,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:57:12,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 01:57:12,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 30: [2022-11-26 01:57:12,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:57:12,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 01:57:12,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 16: [2022-11-26 01:57:12,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:57:12,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 01:57:12,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: [2022-11-26 01:57:12,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:57:12,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 01:57:12,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 19: [2022-11-26 01:57:12,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:57:12,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 01:57:12,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 9: [2022-11-26 01:57:12,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:57:12,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 01:57:12,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 10: [2022-11-26 01:57:12,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:57:12,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 01:57:12,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 21: [2022-11-26 01:57:12,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:57:12,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 01:57:12,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 14: [2022-11-26 01:57:12,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:57:12,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 01:57:12,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 22: [2022-11-26 01:57:12,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:57:12,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 01:57:12,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 22: [2022-11-26 01:57:12,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:57:12,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 01:57:12,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 24: [2022-11-26 01:57:12,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:57:12,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 01:57:12,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 13: [2022-11-26 01:57:12,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:57:12,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 01:57:12,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 23: [2022-11-26 01:57:12,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:57:12,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 01:57:12,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 17: [2022-11-26 01:57:12,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:57:12,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:57:12,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:57:12,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 01:57:12,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 01:57:12,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 01:57:12,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 17: [2022-11-26 01:57:12,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 17: [2022-11-26 01:57:12,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: [2022-11-26 01:57:12,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 17: [2022-11-26 01:57:12,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:57:12,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 17: [2022-11-26 01:57:12,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 01:57:12,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 1: [2022-11-26 01:57:12,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:57:12,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 01:57:12,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 6: [2022-11-26 01:57:12,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:57:12,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:57:12,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 01:57:12,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 2: [2022-11-26 01:57:12,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 01:57:12,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 6: [2022-11-26 01:57:12,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:57:12,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 01:57:12,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 4: [2022-11-26 01:57:12,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:57:12,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:57:12,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:57:12,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 01:57:12,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 01:57:12,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 01:57:12,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 4: [2022-11-26 01:57:12,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 4: [2022-11-26 01:57:12,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 5: [2022-11-26 01:57:12,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:57:12,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:57:12,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 01:57:12,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 5: [2022-11-26 01:57:12,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 01:57:12,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 5: [2022-11-26 01:57:12,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:57:12,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 01:57:12,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 29: [2022-11-26 01:57:12,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:57:12,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 01:57:12,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 9: [2022-11-26 01:57:12,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:57:12,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 01:57:12,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 23: [2022-11-26 01:57:12,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:57:12,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 01:57:12,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 15: [2022-11-26 01:57:12,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:57:12,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 01:57:12,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 20: [2022-11-26 01:57:12,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:57:12,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 01:57:12,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 12: [2022-11-26 01:57:12,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:57:12,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 01:57:12,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 25: [2022-11-26 01:57:12,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:57:12,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 01:57:12,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 8: [2022-11-26 01:57:12,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:57:12,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 01:57:12,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 21: [2022-11-26 01:57:12,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:57:12,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 01:57:12,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 18: [2022-11-26 01:57:12,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:57:12,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 01:57:12,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 17: [2022-11-26 01:57:12,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:57:12,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 01:57:12,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 16: [2022-11-26 01:57:12,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:57:12,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 01:57:12,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 24: [2022-11-26 01:57:12,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:57:12,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:57:12,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 11: [2022-11-26 01:57:12,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 01:57:12,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 24: [2022-11-26 01:57:12,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 27: [2022-11-26 01:57:12,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:57:12,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 01:57:12,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 30: [2022-11-26 01:57:12,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:57:12,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 01:57:12,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 13: [2022-11-26 01:57:12,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:57:12,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 01:57:12,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: [2022-11-26 01:57:12,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:57:12,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 01:57:12,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 31: [2022-11-26 01:57:12,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:57:12,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 01:57:12,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 3: [2022-11-26 01:57:12,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:57:12,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 01:57:12,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 22: [2022-11-26 01:57:12,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:57:12,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 01:57:12,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 10: [2022-11-26 01:57:12,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:57:12,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 01:57:12,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 7: [2022-11-26 01:57:12,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:57:12,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 01:57:12,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 26: [2022-11-26 01:57:12,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:57:12,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 01:57:12,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 2: [2022-11-26 01:57:12,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:57:12,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 01:57:12,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 14: [2022-11-26 01:57:12,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:57:12,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:57:12,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 01:57:12,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 6: [2022-11-26 01:57:12,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 01:57:12,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 4: [2022-11-26 01:57:12,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:57:12,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 01:57:12,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 5: [2022-11-26 01:57:12,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:57:12,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 01:57:12,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 28: [2022-11-26 01:57:12,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:57:12,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:57:12,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 28: [2022-11-26 01:57:12,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 1: [2022-11-26 01:57:12,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 28: [2022-11-26 01:57:12,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 25: [2022-11-26 01:57:12,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:57:12,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 01:57:12,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 15: [2022-11-26 01:57:12,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:57:12,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 01:57:12,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 19: [2022-11-26 01:57:12,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:57:12,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 01:57:12,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 29: [2022-11-26 01:57:12,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:57:12,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 01:57:12,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 20: [2022-11-26 01:57:12,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:57:12,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 01:57:12,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 23: [2022-11-26 01:57:12,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:57:12,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 01:57:12,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 8: [2022-11-26 01:57:12,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:57:12,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 01:57:12,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 12: [2022-11-26 01:57:12,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:57:12,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 01:57:12,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 21: [2022-11-26 01:57:12,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:57:12,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 01:57:12,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 27: [2022-11-26 01:57:12,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:57:12,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 01:57:12,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 18: [2022-11-26 01:57:12,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:57:12,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 01:57:12,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 17: [2022-11-26 01:57:12,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:57:12,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 01:57:12,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 16: [2022-11-26 01:57:12,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:57:12,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 01:57:12,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 30: [2022-11-26 01:57:12,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:57:12,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 01:57:12,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 11: [2022-11-26 01:57:12,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:57:12,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 01:57:12,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 24: [2022-11-26 01:57:12,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:57:12,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 01:57:12,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 7: [2022-11-26 01:57:12,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:57:12,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 01:57:12,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 31: [2022-11-26 01:57:12,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:57:12,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 01:57:12,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 10: [2022-11-26 01:57:12,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:57:12,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 01:57:12,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 9: [2022-11-26 01:57:12,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:57:12,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 01:57:12,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 22: [2022-11-26 01:57:12,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:57:12,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 01:57:12,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 13: [2022-11-26 01:57:12,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:57:12,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 01:57:12,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: [2022-11-26 01:57:12,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:57:12,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 01:57:12,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 3: [2022-11-26 01:57:12,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:57:12,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 01:57:12,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 19: [2022-11-26 01:57:12,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:57:12,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 01:57:12,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 15: [2022-11-26 01:57:12,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:57:12,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 01:57:12,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 25: [2022-11-26 01:57:12,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:57:12,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 01:57:12,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 28: [2022-11-26 01:57:12,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:57:12,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 01:57:12,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 2: [2022-11-26 01:57:12,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:57:12,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 01:57:12,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 8: [2022-11-26 01:57:12,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:57:12,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 26: [2022-11-26 01:57:12,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:57:12,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 26: [2022-11-26 01:57:12,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 01:57:12,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 6: [2022-11-26 01:57:12,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:57:12,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:57:12,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 01:57:12,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 14: [2022-11-26 01:57:12,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 1: [2022-11-26 01:57:12,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:57:12,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 1: [2022-11-26 01:57:12,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 01:57:12,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 5: [2022-11-26 01:57:12,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:57:12,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 01:57:12,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 29: [2022-11-26 01:57:12,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:57:12,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:57:12,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 01:57:12,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 12: [2022-11-26 01:57:12,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 01:57:12,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 23: [2022-11-26 01:57:12,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:57:12,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 01:57:12,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 4: [2022-11-26 01:57:12,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:57:12,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 01:57:12,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 7: [2022-11-26 01:57:12,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:57:12,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 01:57:12,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 17: [2022-11-26 01:57:12,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:57:12,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 01:57:12,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 18: [2022-11-26 01:57:12,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:57:12,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 01:57:12,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 21: [2022-11-26 01:57:12,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:57:12,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 01:57:12,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 27: [2022-11-26 01:57:12,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:57:12,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 01:57:12,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 11: [2022-11-26 01:57:12,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:57:12,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 01:57:12,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 10: [2022-11-26 01:57:12,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:57:12,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 01:57:12,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 31: [2022-11-26 01:57:12,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:57:12,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 01:57:12,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 24: [2022-11-26 01:57:12,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 01:57:12,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 01:57:12,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 30: [2022-11-26 01:57:12,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:57:12,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 01:57:12,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 16: [2022-11-26 01:57:12,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 01:57:12,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 01:57:12,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 22: [2022-11-26 01:57:12,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:57:12,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 01:57:12,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 3: [2022-11-26 01:57:12,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:57:12,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 01:57:12,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 26: [2022-11-26 01:57:12,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:57:12,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 01:57:12,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 9: [2022-11-26 01:57:12,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:57:12,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 01:57:12,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: [2022-11-26 01:57:12,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:57:12,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 01:57:12,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 13: [2022-11-26 01:57:12,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 19: [2022-11-26 01:57:12,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:57:12,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 19: [2022-11-26 01:57:12,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 13: [2022-11-26 01:57:12,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 19: [2022-11-26 01:57:12,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 10: [2022-11-26 01:57:12,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:57:12,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:57:12,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 10: [2022-11-26 01:57:12,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 7: [2022-11-26 01:57:12,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 10: [2022-11-26 01:57:12,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 4: [2022-11-26 01:57:12,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:57:12,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:57:12,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 8: [2022-11-26 01:57:12,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:57:12,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 14: [2022-11-26 01:57:12,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 8: [2022-11-26 01:57:12,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 14: [2022-11-26 01:57:12,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 8: [2022-11-26 01:57:12,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 15: [2022-11-26 01:57:12,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:57:12,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 01:57:12,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 18: [2022-11-26 01:57:12,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 01:57:12,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 01:57:12,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 25: [2022-11-26 01:57:12,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:57:12,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 25: [2022-11-26 01:57:12,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 01:57:12,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 20: [2022-11-26 01:57:12,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 3: [2022-11-26 01:57:12,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:57:12,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 3: [2022-11-26 01:57:12,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 01:57:12,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 6: [2022-11-26 01:57:12,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:57:12,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 01:57:12,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 30: [2022-11-26 01:57:12,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 01:57:12,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 01:57:12,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 17: [2022-11-26 01:57:12,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 23: [2022-11-26 01:57:12,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 17: [2022-11-26 01:57:12,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 01:57:12,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 23: [2022-11-26 01:57:12,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 01:57:12,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 31: [2022-11-26 01:57:12,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:57:12,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 16: [2022-11-26 01:57:12,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:57:12,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 31: [2022-11-26 01:57:12,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 16: [2022-11-26 01:57:12,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 1: [2022-11-26 01:57:12,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 16: [2022-11-26 01:57:12,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 1: [2022-11-26 01:57:12,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 9: [2022-11-26 01:57:12,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:57:12,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:57:12,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 24: [2022-11-26 01:57:12,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 21: [2022-11-26 01:57:12,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 01:57:12,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 9: [2022-11-26 01:57:12,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 24: [2022-11-26 01:57:12,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 01:57:12,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 11: [2022-11-26 01:57:12,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 22: [2022-11-26 01:57:12,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:57:12,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 22: [2022-11-26 01:57:12,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 11: [2022-11-26 01:57:12,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 22: [2022-11-26 01:57:12,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 29: [2022-11-26 01:57:12,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 01:57:12,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 01:57:12,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 27: [2022-11-26 01:57:12,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 01:57:12,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 01:57:12,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 28: [2022-11-26 01:57:12,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:57:12,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 01:57:12,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 26: [2022-11-26 01:57:12,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:57:12,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 2: [2022-11-26 01:57:12,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 26: [2022-11-26 01:57:12,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 2: [2022-11-26 01:57:12,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 01:57:12,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 20: [2022-11-26 01:57:12,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:57:12,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 20: [2022-11-26 01:57:12,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 14: [2022-11-26 01:57:12,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 20: [2022-11-26 01:57:12,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 14: [2022-11-26 01:57:12,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 13: [2022-11-26 01:57:12,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:57:12,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 01:57:12,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 28: [2022-11-26 01:57:12,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 01:57:12,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 01:57:12,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 12: [2022-11-26 01:57:12,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:57:12,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 01:57:12,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: [2022-11-26 01:57:12,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:57:12,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 01:57:12,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 5: [2022-11-26 01:57:12,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:57:12,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 01:57:12,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 2: [2022-11-26 01:57:12,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:57:12,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 01:57:12,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 4: [2022-11-26 01:57:12,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:57:12,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 01:57:12,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 4: [2022-11-26 01:57:12,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:57:12,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 01:57:12,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 5: [2022-11-26 01:57:12,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:57:12,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 01:57:12,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 5: [2022-11-26 01:57:12,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:57:12,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step35000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 01:57:12,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: successfully saved checkpoint at iteration 35000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2606.83 31: iteration 35010/ 173500 | consumed samples: 8962560 | consumed tokens: 18355322880 | elapsed time per iteration (s): 1.02 | learning rate: 1.838E-04 | global batch size: 256 | lm loss: 2.153268E+00 | grad norm: 0.700 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.474 | TFLOPs: 15.21 | 31: iteration 35020/ 173500 | consumed samples: 8965120 | consumed tokens: 18360565760 | elapsed time per iteration (s): 0.78 | learning rate: 1.838E-04 | global batch size: 256 | lm loss: 2.196967E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.485 | TFLOPs: 19.81 | 31: iteration 35030/ 173500 | consumed samples: 8967680 | consumed tokens: 18365808640 | elapsed time per iteration (s): 0.76 | learning rate: 1.838E-04 | global batch size: 256 | lm loss: 2.165328E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.057 | TFLOPs: 20.27 | 31: iteration 35040/ 173500 | consumed samples: 8970240 | consumed tokens: 18371051520 | elapsed time per iteration (s): 0.75 | learning rate: 1.838E-04 | global batch size: 256 | lm loss: 2.134950E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.517 | TFLOPs: 20.72 | 31: iteration 35050/ 173500 | consumed samples: 8972800 | consumed tokens: 18376294400 | elapsed time per iteration (s): 0.78 | learning rate: 1.838E-04 | global batch size: 256 | lm loss: 2.153524E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.436 | TFLOPs: 19.93 | 31: iteration 35060/ 173500 | consumed samples: 8975360 | consumed tokens: 18381537280 | elapsed time per iteration (s): 0.78 | learning rate: 1.838E-04 | global batch size: 256 | lm loss: 2.120165E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.210 | TFLOPs: 19.80 | 31: iteration 35070/ 173500 | consumed samples: 8977920 | consumed tokens: 18386780160 | elapsed time per iteration (s): 0.79 | learning rate: 1.838E-04 | global batch size: 256 | lm loss: 2.138029E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.487 | TFLOPs: 19.57 | 31: iteration 35080/ 173500 | consumed samples: 8980480 | consumed tokens: 18392023040 | elapsed time per iteration (s): 0.76 | learning rate: 1.838E-04 | global batch size: 256 | lm loss: 2.097329E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.706 | TFLOPs: 20.31 | 31: iteration 35090/ 173500 | consumed samples: 8983040 | consumed tokens: 18397265920 | elapsed time per iteration (s): 3.52 | learning rate: 1.838E-04 | global batch size: 256 | lm loss: 2.115243E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 72.734 | TFLOPs: 4.40 | 31: iteration 35100/ 173500 | consumed samples: 8985600 | consumed tokens: 18402508800 | elapsed time per iteration (s): 0.78 | learning rate: 1.838E-04 | global batch size: 256 | lm loss: 2.095662E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.815 | TFLOPs: 19.77 | 31: iteration 35110/ 173500 | consumed samples: 8988160 | consumed tokens: 18407751680 | elapsed time per iteration (s): 0.82 | learning rate: 1.837E-04 | global batch size: 256 | lm loss: 2.103148E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.509 | TFLOPs: 18.91 | 31: iteration 35120/ 173500 | consumed samples: 8990720 | consumed tokens: 18412994560 | elapsed time per iteration (s): 0.79 | learning rate: 1.837E-04 | global batch size: 256 | lm loss: 2.115582E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.625 | TFLOPs: 19.52 | 31: iteration 35130/ 173500 | consumed samples: 8993280 | consumed tokens: 18418237440 | elapsed time per iteration (s): 0.75 | learning rate: 1.837E-04 | global batch size: 256 | lm loss: 2.128434E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.304 | TFLOPs: 20.59 | 31: iteration 35140/ 173500 | consumed samples: 8995840 | consumed tokens: 18423480320 | elapsed time per iteration (s): 0.79 | learning rate: 1.837E-04 | global batch size: 256 | lm loss: 2.126215E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.224 | TFLOPs: 19.55 | 31: iteration 35150/ 173500 | consumed samples: 8998400 | consumed tokens: 18428723200 | elapsed time per iteration (s): 0.76 | learning rate: 1.837E-04 | global batch size: 256 | lm loss: 2.123855E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.905 | TFLOPs: 20.44 | 31: iteration 35160/ 173500 | consumed samples: 9000960 | consumed tokens: 18433966080 | elapsed time per iteration (s): 0.81 | learning rate: 1.837E-04 | global batch size: 256 | lm loss: 2.110567E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.005 | TFLOPs: 19.12 | 31: iteration 35170/ 173500 | consumed samples: 9003520 | consumed tokens: 18439208960 | elapsed time per iteration (s): 0.74 | learning rate: 1.837E-04 | global batch size: 256 | lm loss: 2.098735E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.452 | TFLOPs: 20.84 | 31: iteration 35180/ 173500 | consumed samples: 9006080 | consumed tokens: 18444451840 | elapsed time per iteration (s): 0.81 | learning rate: 1.837E-04 | global batch size: 256 | lm loss: 2.112627E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.978 | TFLOPs: 19.06 | 31: iteration 35190/ 173500 | consumed samples: 9008640 | consumed tokens: 18449694720 | elapsed time per iteration (s): 0.76 | learning rate: 1.837E-04 | global batch size: 256 | lm loss: 2.129848E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.241 | TFLOPs: 20.34 | 31: iteration 35200/ 173500 | consumed samples: 9011200 | consumed tokens: 18454937600 | elapsed time per iteration (s): 0.80 | learning rate: 1.837E-04 | global batch size: 256 | lm loss: 2.131634E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.240 | TFLOPs: 19.31 | 31: iteration 35210/ 173500 | consumed samples: 9013760 | consumed tokens: 18460180480 | elapsed time per iteration (s): 0.78 | learning rate: 1.837E-04 | global batch size: 256 | lm loss: 2.119434E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.269 | TFLOPs: 19.92 | 31: iteration 35220/ 173500 | consumed samples: 9016320 | consumed tokens: 18465423360 | elapsed time per iteration (s): 0.79 | learning rate: 1.836E-04 | global batch size: 256 | lm loss: 2.123529E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.448 | TFLOPs: 19.63 | 31: iteration 35230/ 173500 | consumed samples: 9018880 | consumed tokens: 18470666240 | elapsed time per iteration (s): 0.78 | learning rate: 1.836E-04 | global batch size: 256 | lm loss: 2.121705E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.047 | TFLOPs: 19.85 | 31: iteration 35240/ 173500 | consumed samples: 9021440 | consumed tokens: 18475909120 | elapsed time per iteration (s): 0.79 | learning rate: 1.836E-04 | global batch size: 256 | lm loss: 2.106170E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.885 | TFLOPs: 19.72 | 31: iteration 35250/ 173500 | consumed samples: 9024000 | consumed tokens: 18481152000 | elapsed time per iteration (s): 0.76 | learning rate: 1.836E-04 | global batch size: 256 | lm loss: 2.117970E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.468 | TFLOPs: 20.29 | 31: iteration 35260/ 173500 | consumed samples: 9026560 | consumed tokens: 18486394880 | elapsed time per iteration (s): 0.73 | learning rate: 1.836E-04 | global batch size: 256 | lm loss: 2.098602E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.597 | TFLOPs: 21.15 | 31: iteration 35270/ 173500 | consumed samples: 9029120 | consumed tokens: 18491637760 | elapsed time per iteration (s): 0.74 | learning rate: 1.836E-04 | global batch size: 256 | lm loss: 2.109696E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.419 | TFLOPs: 20.96 | 31: iteration 35280/ 173500 | consumed samples: 9031680 | consumed tokens: 18496880640 | elapsed time per iteration (s): 0.80 | learning rate: 1.836E-04 | global batch size: 256 | lm loss: 2.134779E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.385 | TFLOPs: 19.32 | 31: iteration 35290/ 173500 | consumed samples: 9034240 | consumed tokens: 18502123520 | elapsed time per iteration (s): 2.15 | learning rate: 1.836E-04 | global batch size: 256 | lm loss: 2.094643E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.824 | TFLOPs: 7.19 | 31: iteration 35300/ 173500 | consumed samples: 9036800 | consumed tokens: 18507366400 | elapsed time per iteration (s): 0.83 | learning rate: 1.836E-04 | global batch size: 256 | lm loss: 2.129615E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.993 | TFLOPs: 18.69 | 31: iteration 35310/ 173500 | consumed samples: 9039360 | consumed tokens: 18512609280 | elapsed time per iteration (s): 0.78 | learning rate: 1.836E-04 | global batch size: 256 | lm loss: 2.105584E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.606 | TFLOPs: 19.88 | 31: iteration 35320/ 173500 | consumed samples: 9041920 | consumed tokens: 18517852160 | elapsed time per iteration (s): 0.86 | learning rate: 1.835E-04 | global batch size: 256 | lm loss: 2.140123E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.797 | TFLOPs: 18.02 | 31: iteration 35330/ 173500 | consumed samples: 9044480 | consumed tokens: 18523095040 | elapsed time per iteration (s): 0.94 | learning rate: 1.835E-04 | global batch size: 256 | lm loss: 2.090384E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 273.569 | TFLOPs: 16.55 | 31: iteration 35340/ 173500 | consumed samples: 9047040 | consumed tokens: 18528337920 | elapsed time per iteration (s): 0.81 | learning rate: 1.835E-04 | global batch size: 256 | lm loss: 2.131421E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.625 | TFLOPs: 19.15 | 31: iteration 35350/ 173500 | consumed samples: 9049600 | consumed tokens: 18533580800 | elapsed time per iteration (s): 0.72 | learning rate: 1.835E-04 | global batch size: 256 | lm loss: 2.112675E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.887 | TFLOPs: 21.65 | 31: iteration 35360/ 173500 | consumed samples: 9052160 | consumed tokens: 18538823680 | elapsed time per iteration (s): 0.78 | learning rate: 1.835E-04 | global batch size: 256 | lm loss: 2.100821E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.261 | TFLOPs: 19.74 | 31: iteration 35370/ 173500 | consumed samples: 9054720 | consumed tokens: 18544066560 | elapsed time per iteration (s): 0.75 | learning rate: 1.835E-04 | global batch size: 256 | lm loss: 2.101332E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.532 | TFLOPs: 20.78 | 31: iteration 35380/ 173500 | consumed samples: 9057280 | consumed tokens: 18549309440 | elapsed time per iteration (s): 0.78 | learning rate: 1.835E-04 | global batch size: 256 | lm loss: 2.123652E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.801 | TFLOPs: 19.77 | 31: iteration 35390/ 173500 | consumed samples: 9059840 | consumed tokens: 18554552320 | elapsed time per iteration (s): 0.72 | learning rate: 1.835E-04 | global batch size: 256 | lm loss: 2.133949E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.095 | TFLOPs: 21.42 | 31: iteration 35400/ 173500 | consumed samples: 9062400 | consumed tokens: 18559795200 | elapsed time per iteration (s): 0.76 | learning rate: 1.835E-04 | global batch size: 256 | lm loss: 2.134080E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.393 | TFLOPs: 20.41 | 31: iteration 35410/ 173500 | consumed samples: 9064960 | consumed tokens: 18565038080 | elapsed time per iteration (s): 0.76 | learning rate: 1.835E-04 | global batch size: 256 | lm loss: 2.096600E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.163 | TFLOPs: 20.46 | 31: iteration 35420/ 173500 | consumed samples: 9067520 | consumed tokens: 18570280960 | elapsed time per iteration (s): 0.77 | learning rate: 1.835E-04 | global batch size: 256 | lm loss: 2.125973E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.377 | TFLOPs: 19.99 | 31: iteration 35430/ 173500 | consumed samples: 9070080 | consumed tokens: 18575523840 | elapsed time per iteration (s): 0.78 | learning rate: 1.834E-04 | global batch size: 256 | lm loss: 2.109072E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.972 | TFLOPs: 19.96 | 31: iteration 35440/ 173500 | consumed samples: 9072640 | consumed tokens: 18580766720 | elapsed time per iteration (s): 0.78 | learning rate: 1.834E-04 | global batch size: 256 | lm loss: 2.082857E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.559 | TFLOPs: 19.82 | 31: iteration 35450/ 173500 | consumed samples: 9075200 | consumed tokens: 18586009600 | elapsed time per iteration (s): 0.74 | learning rate: 1.834E-04 | global batch size: 256 | lm loss: 2.085677E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.929 | TFLOPs: 20.81 | 31: iteration 35460/ 173500 | consumed samples: 9077760 | consumed tokens: 18591252480 | elapsed time per iteration (s): 0.76 | learning rate: 1.834E-04 | global batch size: 256 | lm loss: 2.099537E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.204 | TFLOPs: 20.28 | 31: iteration 35470/ 173500 | consumed samples: 9080320 | consumed tokens: 18596495360 | elapsed time per iteration (s): 0.76 | learning rate: 1.834E-04 | global batch size: 256 | lm loss: 2.118569E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.854 | TFLOPs: 20.32 | 31: iteration 35480/ 173500 | consumed samples: 9082880 | consumed tokens: 18601738240 | elapsed time per iteration (s): 0.75 | learning rate: 1.834E-04 | global batch size: 256 | lm loss: 2.129994E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.217 | TFLOPs: 20.64 | 31: iteration 35490/ 173500 | consumed samples: 9085440 | consumed tokens: 18606981120 | elapsed time per iteration (s): 0.75 | learning rate: 1.834E-04 | global batch size: 256 | lm loss: 2.141066E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.198 | TFLOPs: 20.70 | 31: iteration 35500/ 173500 | consumed samples: 9088000 | consumed tokens: 18612224000 | elapsed time per iteration (s): 0.81 | learning rate: 1.834E-04 | global batch size: 256 | lm loss: 2.114731E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.890 | TFLOPs: 19.23 | 31: iteration 35510/ 173500 | consumed samples: 9090560 | consumed tokens: 18617466880 | elapsed time per iteration (s): 0.80 | learning rate: 1.834E-04 | global batch size: 256 | lm loss: 2.115293E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.140 | TFLOPs: 19.37 | 31: iteration 35520/ 173500 | consumed samples: 9093120 | consumed tokens: 18622709760 | elapsed time per iteration (s): 0.83 | learning rate: 1.834E-04 | global batch size: 256 | lm loss: 2.113300E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.340 | TFLOPs: 18.71 | 31: iteration 35530/ 173500 | consumed samples: 9095680 | consumed tokens: 18627952640 | elapsed time per iteration (s): 0.79 | learning rate: 1.833E-04 | global batch size: 256 | lm loss: 2.108611E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.062 | TFLOPs: 19.73 | 31: iteration 35540/ 173500 | consumed samples: 9098240 | consumed tokens: 18633195520 | elapsed time per iteration (s): 0.80 | learning rate: 1.833E-04 | global batch size: 256 | lm loss: 2.099792E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.183 | TFLOPs: 19.43 | 31: iteration 35550/ 173500 | consumed samples: 9100800 | consumed tokens: 18638438400 | elapsed time per iteration (s): 0.84 | learning rate: 1.833E-04 | global batch size: 256 | lm loss: 2.129761E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.870 | TFLOPs: 18.50 | 31: iteration 35560/ 173500 | consumed samples: 9103360 | consumed tokens: 18643681280 | elapsed time per iteration (s): 0.81 | learning rate: 1.833E-04 | global batch size: 256 | lm loss: 2.125889E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.521 | TFLOPs: 19.09 | 31: iteration 35570/ 173500 | consumed samples: 9105920 | consumed tokens: 18648924160 | elapsed time per iteration (s): 0.80 | learning rate: 1.833E-04 | global batch size: 256 | lm loss: 2.140051E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.192 | TFLOPs: 19.31 | 31: iteration 35580/ 173500 | consumed samples: 9108480 | consumed tokens: 18654167040 | elapsed time per iteration (s): 0.79 | learning rate: 1.833E-04 | global batch size: 256 | lm loss: 2.104734E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.269 | TFLOPs: 19.56 | 31: iteration 35590/ 173500 | consumed samples: 9111040 | consumed tokens: 18659409920 | elapsed time per iteration (s): 0.79 | learning rate: 1.833E-04 | global batch size: 256 | lm loss: 2.068317E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.689 | TFLOPs: 19.58 | 31: iteration 35600/ 173500 | consumed samples: 9113600 | consumed tokens: 18664652800 | elapsed time per iteration (s): 0.81 | learning rate: 1.833E-04 | global batch size: 256 | lm loss: 2.124296E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.169 | TFLOPs: 19.07 | 31: iteration 35610/ 173500 | consumed samples: 9116160 | consumed tokens: 18669895680 | elapsed time per iteration (s): 0.82 | learning rate: 1.833E-04 | global batch size: 256 | lm loss: 2.119579E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.433 | TFLOPs: 18.84 | 31: iteration 35620/ 173500 | consumed samples: 9118720 | consumed tokens: 18675138560 | elapsed time per iteration (s): 0.83 | learning rate: 1.833E-04 | global batch size: 256 | lm loss: 2.106065E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.756 | TFLOPs: 18.62 | 31: iteration 35630/ 173500 | consumed samples: 9121280 | consumed tokens: 18680381440 | elapsed time per iteration (s): 0.76 | learning rate: 1.833E-04 | global batch size: 256 | lm loss: 2.120902E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.686 | TFLOPs: 20.25 | 31: iteration 35640/ 173500 | consumed samples: 9123840 | consumed tokens: 18685624320 | elapsed time per iteration (s): 0.80 | learning rate: 1.832E-04 | global batch size: 256 | lm loss: 2.107339E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.957 | TFLOPs: 19.30 | 31: iteration 35650/ 173500 | consumed samples: 9126400 | consumed tokens: 18690867200 | elapsed time per iteration (s): 0.83 | learning rate: 1.832E-04 | global batch size: 256 | lm loss: 2.102551E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.897 | TFLOPs: 18.75 | 31: iteration 35660/ 173500 | consumed samples: 9128960 | consumed tokens: 18696110080 | elapsed time per iteration (s): 0.80 | learning rate: 1.832E-04 | global batch size: 256 | lm loss: 2.127419E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.720 | TFLOPs: 19.46 | 31: iteration 35670/ 173500 | consumed samples: 9131520 | consumed tokens: 18701352960 | elapsed time per iteration (s): 0.83 | learning rate: 1.832E-04 | global batch size: 256 | lm loss: 2.111866E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.952 | TFLOPs: 18.63 | 31: iteration 35680/ 173500 | consumed samples: 9134080 | consumed tokens: 18706595840 | elapsed time per iteration (s): 0.79 | learning rate: 1.832E-04 | global batch size: 256 | lm loss: 2.125633E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.568 | TFLOPs: 19.70 | 31: iteration 35690/ 173500 | consumed samples: 9136640 | consumed tokens: 18711838720 | elapsed time per iteration (s): 0.83 | learning rate: 1.832E-04 | global batch size: 256 | lm loss: 2.084352E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.681 | TFLOPs: 18.67 | 31: iteration 35700/ 173500 | consumed samples: 9139200 | consumed tokens: 18717081600 | elapsed time per iteration (s): 0.81 | learning rate: 1.832E-04 | global batch size: 256 | lm loss: 2.099089E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.470 | TFLOPs: 19.09 | 31: iteration 35710/ 173500 | consumed samples: 9141760 | consumed tokens: 18722324480 | elapsed time per iteration (s): 0.76 | learning rate: 1.832E-04 | global batch size: 256 | lm loss: 2.127685E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.996 | TFLOPs: 20.33 | 31: iteration 35720/ 173500 | consumed samples: 9144320 | consumed tokens: 18727567360 | elapsed time per iteration (s): 0.74 | learning rate: 1.832E-04 | global batch size: 256 | lm loss: 2.080995E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.026 | TFLOPs: 20.99 | 31: iteration 35730/ 173500 | consumed samples: 9146880 | consumed tokens: 18732810240 | elapsed time per iteration (s): 0.76 | learning rate: 1.832E-04 | global batch size: 256 | lm loss: 2.135610E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.113 | TFLOPs: 20.45 | 31: iteration 35740/ 173500 | consumed samples: 9149440 | consumed tokens: 18738053120 | elapsed time per iteration (s): 0.78 | learning rate: 1.831E-04 | global batch size: 256 | lm loss: 2.123870E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.912 | TFLOPs: 19.96 | 31: iteration 35750/ 173500 | consumed samples: 9152000 | consumed tokens: 18743296000 | elapsed time per iteration (s): 0.77 | learning rate: 1.831E-04 | global batch size: 256 | lm loss: 2.122109E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.394 | TFLOPs: 20.05 | 31: iteration 35760/ 173500 | consumed samples: 9154560 | consumed tokens: 18748538880 | elapsed time per iteration (s): 0.77 | learning rate: 1.831E-04 | global batch size: 256 | lm loss: 2.103018E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.482 | TFLOPs: 20.11 | 31: iteration 35770/ 173500 | consumed samples: 9157120 | consumed tokens: 18753781760 | elapsed time per iteration (s): 0.74 | learning rate: 1.831E-04 | global batch size: 256 | lm loss: 2.097672E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.023 | TFLOPs: 20.99 | 31: iteration 35780/ 173500 | consumed samples: 9159680 | consumed tokens: 18759024640 | elapsed time per iteration (s): 0.74 | learning rate: 1.831E-04 | global batch size: 256 | lm loss: 2.101078E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.421 | TFLOPs: 21.02 | 31: iteration 35790/ 173500 | consumed samples: 9162240 | consumed tokens: 18764267520 | elapsed time per iteration (s): 0.77 | learning rate: 1.831E-04 | global batch size: 256 | lm loss: 2.094562E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.540 | TFLOPs: 20.24 | 31: iteration 35800/ 173500 | consumed samples: 9164800 | consumed tokens: 18769510400 | elapsed time per iteration (s): 0.78 | learning rate: 1.831E-04 | global batch size: 256 | lm loss: 2.114076E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.722 | TFLOPs: 19.95 | 31: iteration 35810/ 173500 | consumed samples: 9167360 | consumed tokens: 18774753280 | elapsed time per iteration (s): 0.80 | learning rate: 1.831E-04 | global batch size: 256 | lm loss: 2.118951E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.942 | TFLOPs: 19.42 | 31: iteration 35820/ 173500 | consumed samples: 9169920 | consumed tokens: 18779996160 | elapsed time per iteration (s): 0.75 | learning rate: 1.831E-04 | global batch size: 256 | lm loss: 2.104529E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.926 | TFLOPs: 20.56 | 31: iteration 35830/ 173500 | consumed samples: 9172480 | consumed tokens: 18785239040 | elapsed time per iteration (s): 0.80 | learning rate: 1.831E-04 | global batch size: 256 | lm loss: 2.083234E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.083 | TFLOPs: 19.24 | 31: iteration 35840/ 173500 | consumed samples: 9175040 | consumed tokens: 18790481920 | elapsed time per iteration (s): 0.79 | learning rate: 1.831E-04 | global batch size: 256 | lm loss: 2.086239E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.096 | TFLOPs: 19.49 | 31: iteration 35850/ 173500 | consumed samples: 9177600 | consumed tokens: 18795724800 | elapsed time per iteration (s): 0.84 | learning rate: 1.830E-04 | global batch size: 256 | lm loss: 2.127374E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.449 | TFLOPs: 18.36 | 31: iteration 35860/ 173500 | consumed samples: 9180160 | consumed tokens: 18800967680 | elapsed time per iteration (s): 0.85 | learning rate: 1.830E-04 | global batch size: 256 | lm loss: 2.107107E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.671 | TFLOPs: 18.25 | 31: iteration 35870/ 173500 | consumed samples: 9182720 | consumed tokens: 18806210560 | elapsed time per iteration (s): 0.82 | learning rate: 1.830E-04 | global batch size: 256 | lm loss: 2.142856E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.259 | TFLOPs: 18.83 | 31: iteration 35880/ 173500 | consumed samples: 9185280 | consumed tokens: 18811453440 | elapsed time per iteration (s): 0.77 | learning rate: 1.830E-04 | global batch size: 256 | lm loss: 2.104789E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.119 | TFLOPs: 20.03 | 31: iteration 35890/ 173500 | consumed samples: 9187840 | consumed tokens: 18816696320 | elapsed time per iteration (s): 0.77 | learning rate: 1.830E-04 | global batch size: 256 | lm loss: 2.107402E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.830 | TFLOPs: 20.14 | 31: iteration 35900/ 173500 | consumed samples: 9190400 | consumed tokens: 18821939200 | elapsed time per iteration (s): 0.73 | learning rate: 1.830E-04 | global batch size: 256 | lm loss: 2.113956E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.813 | TFLOPs: 21.16 | 31: iteration 35910/ 173500 | consumed samples: 9192960 | consumed tokens: 18827182080 | elapsed time per iteration (s): 0.83 | learning rate: 1.830E-04 | global batch size: 256 | lm loss: 2.102857E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.306 | TFLOPs: 18.71 | 31: iteration 35920/ 173500 | consumed samples: 9195520 | consumed tokens: 18832424960 | elapsed time per iteration (s): 0.76 | learning rate: 1.830E-04 | global batch size: 256 | lm loss: 2.125921E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.205 | TFLOPs: 20.34 | 31: iteration 35930/ 173500 | consumed samples: 9198080 | consumed tokens: 18837667840 | elapsed time per iteration (s): 0.74 | learning rate: 1.830E-04 | global batch size: 256 | lm loss: 2.092577E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.181 | TFLOPs: 20.88 | 31: iteration 35940/ 173500 | consumed samples: 9200640 | consumed tokens: 18842910720 | elapsed time per iteration (s): 0.77 | learning rate: 1.830E-04 | global batch size: 256 | lm loss: 2.100408E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.040 | TFLOPs: 20.21 | 31: iteration 35950/ 173500 | consumed samples: 9203200 | consumed tokens: 18848153600 | elapsed time per iteration (s): 0.78 | learning rate: 1.829E-04 | global batch size: 256 | lm loss: 2.125151E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.618 | TFLOPs: 19.82 | 31: iteration 35960/ 173500 | consumed samples: 9205760 | consumed tokens: 18853396480 | elapsed time per iteration (s): 0.77 | learning rate: 1.829E-04 | global batch size: 256 | lm loss: 2.133140E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.819 | TFLOPs: 20.07 | 31: iteration 35970/ 173500 | consumed samples: 9208320 | consumed tokens: 18858639360 | elapsed time per iteration (s): 0.74 | learning rate: 1.829E-04 | global batch size: 256 | lm loss: 2.117566E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.682 | TFLOPs: 20.85 | 31: iteration 35980/ 173500 | consumed samples: 9210880 | consumed tokens: 18863882240 | elapsed time per iteration (s): 0.76 | learning rate: 1.829E-04 | global batch size: 256 | lm loss: 2.093823E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.695 | TFLOPs: 20.37 | 31: iteration 35990/ 173500 | consumed samples: 9213440 | consumed tokens: 18869125120 | elapsed time per iteration (s): 0.80 | learning rate: 1.829E-04 | global batch size: 256 | lm loss: 2.095325E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.262 | TFLOPs: 19.44 | 0: [2022-11-26 02:10:57,472] [INFO] [logging.py:68:log_dist] [Rank 0] step=36000, skipped=0, lr=[0.00018289669072542715, 0.00018289669072542715, 0.00018289669072542715], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 36000/ 173500 | consumed samples: 9216000 | consumed tokens: 18874368000 | elapsed time per iteration (s): 0.87 | learning rate: 1.829E-04 | global batch size: 256 | lm loss: 2.127676E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.750 | TFLOPs: 17.89 | 0: steps: 36000 loss: 2.1610 iter time (s): 0.807 samples/sec: 317.267 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 36000 | lm loss value: 2.073182E+00 | lm loss PPL: 7.950077E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 36000 to checkpoints_1b1long 0: [2022-11-26 02:10:57,722] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step36000 is begin to save! 0: [2022-11-26 02:10:57,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_01-model_00-model_states.pt... 0: [2022-11-26 02:10:57,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_01-model_00-model_states.pt. 0: [2022-11-26 02:10:57,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_03-model_00-model_states.pt... 0: [2022-11-26 02:10:58,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_03-model_00-model_states.pt. 0: [2022-11-26 02:10:58,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_04-model_00-model_states.pt... 0: [2022-11-26 02:10:58,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_04-model_00-model_states.pt. 0: [2022-11-26 02:10:58,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_05-model_00-model_states.pt... 0: [2022-11-26 02:10:58,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_05-model_00-model_states.pt. 0: [2022-11-26 02:10:58,207] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_06-model_00-model_states.pt... 0: [2022-11-26 02:10:58,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_06-model_00-model_states.pt. 0: [2022-11-26 02:10:58,284] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_07-model_00-model_states.pt... 0: [2022-11-26 02:10:58,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_07-model_00-model_states.pt. 0: [2022-11-26 02:10:58,360] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_08-model_00-model_states.pt... 0: [2022-11-26 02:10:58,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_08-model_00-model_states.pt. 0: [2022-11-26 02:10:58,437] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_09-model_00-model_states.pt... 0: [2022-11-26 02:10:58,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_09-model_00-model_states.pt. 0: [2022-11-26 02:10:58,512] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_10-model_00-model_states.pt... 0: [2022-11-26 02:10:58,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_10-model_00-model_states.pt. 0: [2022-11-26 02:10:58,587] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_11-model_00-model_states.pt... 0: [2022-11-26 02:10:58,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_11-model_00-model_states.pt. 0: [2022-11-26 02:10:58,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_12-model_00-model_states.pt... 0: [2022-11-26 02:10:58,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_12-model_00-model_states.pt. 0: [2022-11-26 02:10:58,743] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_13-model_00-model_states.pt... 0: [2022-11-26 02:10:58,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_13-model_00-model_states.pt. 0: [2022-11-26 02:10:58,816] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_14-model_00-model_states.pt... 0: [2022-11-26 02:10:58,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_14-model_00-model_states.pt. 0: [2022-11-26 02:10:58,892] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_15-model_00-model_states.pt... 0: [2022-11-26 02:10:58,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_15-model_00-model_states.pt. 0: [2022-11-26 02:10:58,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_16-model_00-model_states.pt... 0: [2022-11-26 02:10:59,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_16-model_00-model_states.pt. 0: [2022-11-26 02:10:59,039] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_17-model_00-model_states.pt... 0: [2022-11-26 02:10:59,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_17-model_00-model_states.pt. 0: [2022-11-26 02:10:59,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_18-model_00-model_states.pt... 0: [2022-11-26 02:10:59,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_18-model_00-model_states.pt. 0: [2022-11-26 02:10:59,187] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_19-model_00-model_states.pt... 0: [2022-11-26 02:10:59,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_19-model_00-model_states.pt. 0: [2022-11-26 02:10:59,262] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_20-model_00-model_states.pt... 0: [2022-11-26 02:10:59,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_20-model_00-model_states.pt. 0: [2022-11-26 02:10:59,335] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_21-model_00-model_states.pt... 0: [2022-11-26 02:10:59,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_21-model_00-model_states.pt. 0: [2022-11-26 02:10:59,410] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_22-model_00-model_states.pt... 0: [2022-11-26 02:10:59,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_22-model_00-model_states.pt. 0: [2022-11-26 02:10:59,485] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_23-model_00-model_states.pt... 0: [2022-11-26 02:10:59,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_23-model_00-model_states.pt. 0: [2022-11-26 02:10:59,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_24-model_00-model_states.pt... 0: [2022-11-26 02:10:59,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_24-model_00-model_states.pt. 0: [2022-11-26 02:10:59,634] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_25-model_00-model_states.pt... 0: [2022-11-26 02:10:59,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_25-model_00-model_states.pt. 0: [2022-11-26 02:10:59,710] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_26-model_00-model_states.pt... 0: [2022-11-26 02:10:59,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_26-model_00-model_states.pt. 0: [2022-11-26 02:10:59,784] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_27-model_00-model_states.pt... 0: [2022-11-26 02:10:59,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_27-model_00-model_states.pt. 0: [2022-11-26 02:10:59,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_28-model_00-model_states.pt... 0: [2022-11-26 02:10:59,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_28-model_00-model_states.pt. 0: [2022-11-26 02:10:59,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/layer_30-model_00-model_states.pt... 0: [2022-11-26 02:10:59,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/layer_30-model_00-model_states.pt. 0: [2022-11-26 02:10:59,936] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step36000/mp_rank_00_model_states.pt 0: [2022-11-26 02:10:59,936] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/mp_rank_00_model_states.pt... 0: [2022-11-26 02:10:59,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/mp_rank_00_model_states.pt. 0: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:11:00,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:11:00,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:11:00,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 02:11:00,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 22: [2022-11-26 02:11:00,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:11:00,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 02:11:00,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 31: [2022-11-26 02:11:00,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:11:00,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 02:11:00,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 12: [2022-11-26 02:11:00,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:11:00,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 1: [2022-11-26 02:11:00,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:11:00,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 1: [2022-11-26 02:11:00,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 02:11:00,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 4: [2022-11-26 02:11:00,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:11:00,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:11:00,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 02:11:00,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 17: [2022-11-26 02:11:00,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 02:11:00,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 21: [2022-11-26 02:11:00,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:11:00,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 02:11:00,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 7: [2022-11-26 02:11:00,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:11:00,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 02:11:00,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 15: [2022-11-26 02:11:00,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:11:00,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 02:11:00,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 25: [2022-11-26 02:11:00,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:11:00,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:11:00,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:11:00,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 14: [2022-11-26 02:11:00,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 25: [2022-11-26 02:11:00,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 14: [2022-11-26 02:11:00,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 31: [2022-11-26 02:11:00,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 25: [2022-11-26 02:11:00,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 14: [2022-11-26 02:11:00,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:11:00,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 02:11:00,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 13: [2022-11-26 02:11:00,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:11:00,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 02:11:00,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 22: [2022-11-26 02:11:00,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:11:00,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 02:11:00,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 27: [2022-11-26 02:11:00,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:11:00,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:11:00,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:11:00,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 15: [2022-11-26 02:11:00,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 25: [2022-11-26 02:11:00,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 27: [2022-11-26 02:11:00,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 15: [2022-11-26 02:11:00,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 25: [2022-11-26 02:11:00,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 24: [2022-11-26 02:11:00,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:11:00,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:11:00,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:11:00,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 13: [2022-11-26 02:11:00,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:11:00,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 25: [2022-11-26 02:11:00,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 24: [2022-11-26 02:11:00,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 02:11:00,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 13: [2022-11-26 02:11:00,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 24: [2022-11-26 02:11:00,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 24: [2022-11-26 02:11:00,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 23: [2022-11-26 02:11:00,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:11:00,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 02:11:00,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 1: [2022-11-26 02:11:00,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:11:00,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 02:11:00,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 27: [2022-11-26 02:11:00,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:11:00,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 02:11:00,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 21: [2022-11-26 02:11:00,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:11:00,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:11:00,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 02:11:00,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 19: [2022-11-26 02:11:00,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:11:00,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 02:11:00,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 20: [2022-11-26 02:11:00,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:11:00,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 02:11:00,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 7: [2022-11-26 02:11:00,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:11:00,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:11:00,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 4: [2022-11-26 02:11:00,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 7: [2022-11-26 02:11:00,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 4: [2022-11-26 02:11:00,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 23: [2022-11-26 02:11:00,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:11:00,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 02:11:00,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 20: [2022-11-26 02:11:00,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:11:00,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 02:11:00,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 21: [2022-11-26 02:11:00,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:11:00,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 02:11:00,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 12: [2022-11-26 02:11:00,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:11:00,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 02:11:00,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 12: [2022-11-26 02:11:00,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:11:00,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 02:11:00,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 0: [2022-11-26 02:11:00,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:11:00,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 02:11:00,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 8: [2022-11-26 02:11:00,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:11:00,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 02:11:00,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 7: [2022-11-26 02:11:00,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:11:00,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:11:00,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 30: [2022-11-26 02:11:00,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 7: [2022-11-26 02:11:00,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 30: [2022-11-26 02:11:00,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 30: [2022-11-26 02:11:00,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:11:00,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 02:11:00,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 4: [2022-11-26 02:11:00,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:11:00,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 02:11:00,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 24: [2022-11-26 02:11:00,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:11:00,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 02:11:00,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 23: [2022-11-26 02:11:00,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:11:00,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 02:11:00,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 14: [2022-11-26 02:11:00,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:11:00,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:11:00,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 30: [2022-11-26 02:11:00,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 14: [2022-11-26 02:11:00,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 30: [2022-11-26 02:11:00,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 22: [2022-11-26 02:11:00,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:11:00,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 27: [2022-11-26 02:11:00,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:11:00,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 1: [2022-11-26 02:11:00,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:11:00,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 02:11:00,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 1: [2022-11-26 02:11:00,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 02:11:00,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 13: [2022-11-26 02:11:00,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:11:00,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 02:11:00,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 26: [2022-11-26 02:11:00,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:11:00,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 02:11:00,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 31: [2022-11-26 02:11:00,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:11:00,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 02:11:00,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 3: [2022-11-26 02:11:00,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:11:00,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 31: [2022-11-26 02:11:00,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:11:00,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 31: [2022-11-26 02:11:00,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 02:11:00,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 20: [2022-11-26 02:11:00,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:11:00,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 02:11:00,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 1: [2022-11-26 02:11:00,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:11:00,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 02:11:00,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 14: [2022-11-26 02:11:00,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:11:00,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:11:00,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 02:11:00,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 3: [2022-11-26 02:11:00,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 02:11:00,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 7: [2022-11-26 02:11:00,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:11:00,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 18: [2022-11-26 02:11:00,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:11:00,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 18: [2022-11-26 02:11:00,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 02:11:00,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 12: [2022-11-26 02:11:00,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:11:00,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 02:11:00,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 19: [2022-11-26 02:11:00,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:11:00,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:11:00,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 02:11:00,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 02:11:00,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 19: [2022-11-26 02:11:00,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 30: [2022-11-26 02:11:00,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:11:00,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:11:00,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 19: [2022-11-26 02:11:00,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 30: [2022-11-26 02:11:00,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 19: [2022-11-26 02:11:00,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 10: [2022-11-26 02:11:00,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:11:00,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 02:11:00,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 20: [2022-11-26 02:11:00,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:11:00,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 02:11:00,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 10: [2022-11-26 02:11:00,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:11:00,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 02:11:00,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 15: [2022-11-26 02:11:00,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:11:00,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 02:11:00,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 26: [2022-11-26 02:11:00,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:11:00,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 02:11:00,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 6: [2022-11-26 02:11:00,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:11:00,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 02:11:00,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 29: [2022-11-26 02:11:00,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:11:00,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:11:00,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:11:00,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 6: [2022-11-26 02:11:00,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 24: [2022-11-26 02:11:00,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:11:00,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 6: [2022-11-26 02:11:00,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 29: [2022-11-26 02:11:00,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 02:11:00,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 9: [2022-11-26 02:11:00,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:11:00,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:11:00,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 24: [2022-11-26 02:11:00,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 22: [2022-11-26 02:11:00,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 9: [2022-11-26 02:11:00,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 22: [2022-11-26 02:11:00,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 9: [2022-11-26 02:11:00,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:11:00,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 9: [2022-11-26 02:11:00,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 02:11:00,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 13: [2022-11-26 02:11:00,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:11:00,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 02:11:00,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 15: [2022-11-26 02:11:00,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:11:00,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 02:11:00,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 26: [2022-11-26 02:11:00,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:11:00,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 02:11:00,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 18: [2022-11-26 02:11:00,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:11:00,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:11:00,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 02:11:00,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 25: [2022-11-26 02:11:00,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 02:11:00,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 27: [2022-11-26 02:11:00,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:11:00,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:11:00,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 27: [2022-11-26 02:11:00,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 10: [2022-11-26 02:11:00,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 23: [2022-11-26 02:11:00,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:11:00,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 23: [2022-11-26 02:11:00,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 8: [2022-11-26 02:11:00,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:11:00,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 8: [2022-11-26 02:11:00,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:11:00,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 02:11:00,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 17: [2022-11-26 02:11:00,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:11:00,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 8: [2022-11-26 02:11:00,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 0: [2022-11-26 02:11:00,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:11:00,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:11:00,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 02:11:00,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 0: [2022-11-26 02:11:00,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 8: [2022-11-26 02:11:00,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 17: [2022-11-26 02:11:00,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:11:00,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 0: [2022-11-26 02:11:00,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 8: [2022-11-26 02:11:00,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 17: [2022-11-26 02:11:00,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 17: [2022-11-26 02:11:00,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:11:00,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 02:11:00,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 21: [2022-11-26 02:11:00,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:11:00,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 02:11:00,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 4: [2022-11-26 02:11:00,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:11:00,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 02:11:00,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 20: [2022-11-26 02:11:00,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:11:00,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:11:00,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 02:11:00,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 6: [2022-11-26 02:11:00,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 02:11:00,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 29: [2022-11-26 02:11:00,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:11:00,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 02:11:00,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 11: [2022-11-26 02:11:00,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:11:00,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:11:00,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:11:00,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 02:11:00,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 02:11:00,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 02:11:00,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 11: [2022-11-26 02:11:00,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 11: [2022-11-26 02:11:00,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 9: [2022-11-26 02:11:00,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:11:00,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 3: [2022-11-26 02:11:00,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:11:00,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 9: [2022-11-26 02:11:00,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 3: [2022-11-26 02:11:00,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 17: [2022-11-26 02:11:00,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:11:00,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 02:11:00,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 1: [2022-11-26 02:11:00,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:11:00,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 02:11:00,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 28: [2022-11-26 02:11:00,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:11:00,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:11:00,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:11:00,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:11:00,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 02:11:00,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 5: [2022-11-26 02:11:00,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:11:00,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:11:00,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:11:00,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 02:11:00,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 02:11:00,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 02:11:00,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 5: [2022-11-26 02:11:00,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 5: [2022-11-26 02:11:00,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:11:00,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 5: [2022-11-26 02:11:00,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 02:11:00,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 18: [2022-11-26 02:11:00,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:11:00,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:11:00,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 02:11:00,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 02:11:00,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 18: [2022-11-26 02:11:00,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 28: [2022-11-26 02:11:00,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 02:11:00,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 02:11:00,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 02:11:00,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 02:11:00,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 28: [2022-11-26 02:11:00,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 28: [2022-11-26 02:11:00,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 28: [2022-11-26 02:11:00,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 2: [2022-11-26 02:11:00,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:11:00,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:11:00,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:11:00,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:11:00,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 02:11:00,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 02:11:00,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 02:11:00,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 02:11:00,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 2: [2022-11-26 02:11:00,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 2: [2022-11-26 02:11:00,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 2: [2022-11-26 02:11:00,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 25: [2022-11-26 02:11:00,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:11:00,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 02:11:00,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 14: [2022-11-26 02:11:00,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:11:00,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 02:11:00,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 31: [2022-11-26 02:11:00,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:11:00,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 02:11:00,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 5: [2022-11-26 02:11:00,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:11:00,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 02:11:00,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 16: [2022-11-26 02:11:00,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:11:00,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:11:00,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:11:00,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:11:00,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 02:11:00,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 02:11:00,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 02:11:00,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 02:11:00,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 16: [2022-11-26 02:11:00,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 16: [2022-11-26 02:11:00,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 16: [2022-11-26 02:11:00,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 12: [2022-11-26 02:11:00,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:11:00,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 02:11:00,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 27: [2022-11-26 02:11:00,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:11:00,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 02:11:00,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 21: [2022-11-26 02:11:00,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:11:00,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 02:11:00,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 23: [2022-11-26 02:11:00,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:11:00,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 02:11:00,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 19: [2022-11-26 02:11:00,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:11:00,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 02:11:00,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 28: [2022-11-26 02:11:00,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:11:00,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 02:11:00,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 24: [2022-11-26 02:11:00,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:11:00,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 02:11:00,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 0: [2022-11-26 02:11:00,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:11:00,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 02:11:00,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 30: [2022-11-26 02:11:00,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:11:00,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 02:11:00,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 15: [2022-11-26 02:11:00,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:11:00,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 02:11:00,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 18: [2022-11-26 02:11:00,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:11:00,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 02:11:00,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 7: [2022-11-26 02:11:00,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:11:00,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 02:11:00,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 2: [2022-11-26 02:11:00,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:11:00,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 02:11:00,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 16: [2022-11-26 02:11:00,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:11:00,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 02:11:00,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 4: [2022-11-26 02:11:00,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:11:00,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 02:11:00,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 9: [2022-11-26 02:11:00,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:11:00,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 02:11:00,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 26: [2022-11-26 02:11:00,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:11:00,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 02:11:00,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 6: [2022-11-26 02:11:00,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:11:00,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 02:11:00,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 11: [2022-11-26 02:11:00,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:11:00,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 02:11:00,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 1: [2022-11-26 02:11:00,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:11:00,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:11:00,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 02:11:00,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 1: [2022-11-26 02:11:00,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 02:11:00,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 8: [2022-11-26 02:11:00,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:11:00,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 22: [2022-11-26 02:11:00,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:11:00,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 22: [2022-11-26 02:11:00,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 02:11:00,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 20: [2022-11-26 02:11:00,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:11:00,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 02:11:00,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 10: [2022-11-26 02:11:00,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:11:00,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:11:00,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 3: [2022-11-26 02:11:00,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 10: [2022-11-26 02:11:00,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 3: [2022-11-26 02:11:00,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 13: [2022-11-26 02:11:00,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:11:00,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 02:11:00,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 25: [2022-11-26 02:11:00,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:11:00,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 02:11:00,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 5: [2022-11-26 02:11:00,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:11:00,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 02:11:00,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 19: [2022-11-26 02:11:00,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:11:00,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:11:00,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 31: [2022-11-26 02:11:00,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 19: [2022-11-26 02:11:00,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 31: [2022-11-26 02:11:00,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 17: [2022-11-26 02:11:00,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:11:00,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 02:11:00,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 12: [2022-11-26 02:11:00,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:11:00,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 02:11:00,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 27: [2022-11-26 02:11:00,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:11:00,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 02:11:00,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 24: [2022-11-26 02:11:00,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:11:00,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 02:11:00,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 15: [2022-11-26 02:11:00,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:11:00,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 02:11:00,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 23: [2022-11-26 02:11:00,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:11:00,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 02:11:00,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 0: [2022-11-26 02:11:00,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:11:00,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 02:11:00,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 21: [2022-11-26 02:11:00,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:11:00,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 02:11:00,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 30: [2022-11-26 02:11:00,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:11:00,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:11:00,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 28: [2022-11-26 02:11:00,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 02:11:00,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 30: [2022-11-26 02:11:00,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 18: [2022-11-26 02:11:00,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:11:00,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 02:11:00,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 22: [2022-11-26 02:11:00,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:11:00,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 02:11:00,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 4: [2022-11-26 02:11:00,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:11:00,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 02:11:00,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 2: [2022-11-26 02:11:00,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:11:00,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 14: [2022-11-26 02:11:00,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:11:00,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 14: [2022-11-26 02:11:00,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 02:11:00,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 7: [2022-11-26 02:11:00,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:11:00,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 02:11:00,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 8: [2022-11-26 02:11:00,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:11:00,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 02:11:00,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 26: [2022-11-26 02:11:00,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:11:00,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 02:11:00,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 9: [2022-11-26 02:11:00,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:11:00,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 02:11:00,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 6: [2022-11-26 02:11:00,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:11:00,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 02:11:00,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 29: [2022-11-26 02:11:00,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:11:00,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 02:11:00,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 10: [2022-11-26 02:11:00,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:11:00,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 02:11:00,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 16: [2022-11-26 02:11:00,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:11:00,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 02:11:00,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 11: [2022-11-26 02:11:00,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:11:00,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 02:11:00,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 25: [2022-11-26 02:11:00,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:11:00,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:11:00,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 25: [2022-11-26 02:11:00,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 1: [2022-11-26 02:11:00,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 25: [2022-11-26 02:11:00,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 20: [2022-11-26 02:11:00,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:11:00,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 02:11:00,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 13: [2022-11-26 02:11:00,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:11:00,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 02:11:00,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 31: [2022-11-26 02:11:00,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:11:00,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 02:11:00,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 5: [2022-11-26 02:11:00,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:11:00,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 02:11:00,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 14: [2022-11-26 02:11:00,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:11:00,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 02:11:00,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 12: [2022-11-26 02:11:00,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:11:00,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 02:11:00,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 3: [2022-11-26 02:11:00,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:11:00,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 02:11:00,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 17: [2022-11-26 02:11:00,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:11:00,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 02:11:00,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 19: [2022-11-26 02:11:00,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:11:00,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 02:11:00,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 23: [2022-11-26 02:11:00,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:11:00,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 02:11:00,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 27: [2022-11-26 02:11:00,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:11:00,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 02:11:00,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 21: [2022-11-26 02:11:00,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:11:00,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 02:11:00,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 7: [2022-11-26 02:11:00,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:11:00,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:11:00,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 02:11:00,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 24: [2022-11-26 02:11:00,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 02:11:00,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 28: [2022-11-26 02:11:00,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:11:00,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 02:11:00,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 0: [2022-11-26 02:11:00,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:11:00,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 02:11:00,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 2: [2022-11-26 02:11:00,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:11:00,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 02:11:00,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 18: [2022-11-26 02:11:00,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:11:00,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 02:11:00,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 4: [2022-11-26 02:11:00,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:11:00,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 02:11:00,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 22: [2022-11-26 02:11:00,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:11:00,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 02:11:00,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 15: [2022-11-26 02:11:00,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:11:00,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 02:11:00,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 30: [2022-11-26 02:11:00,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:11:00,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 02:11:00,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 11: [2022-11-26 02:11:00,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:11:00,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 02:11:00,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 25: [2022-11-26 02:11:00,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:11:00,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 02:11:00,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 13: [2022-11-26 02:11:00,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:11:00,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 02:11:00,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 6: [2022-11-26 02:11:00,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:11:00,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 02:11:00,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 26: [2022-11-26 02:11:00,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:11:00,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 02:11:00,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 20: [2022-11-26 02:11:00,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:11:00,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 02:11:00,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 9: [2022-11-26 02:11:00,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:11:00,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:11:00,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 8: [2022-11-26 02:11:00,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 02:11:00,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 9: [2022-11-26 02:11:00,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 1: [2022-11-26 02:11:00,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:11:00,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 02:11:00,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 3: [2022-11-26 02:11:00,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:11:00,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 02:11:00,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 17: [2022-11-26 02:11:00,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:11:00,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 31: [2022-11-26 02:11:00,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:11:00,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 31: [2022-11-26 02:11:00,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 02:11:00,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 29: [2022-11-26 02:11:00,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:11:00,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 02:11:00,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 15: [2022-11-26 02:11:00,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:11:00,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 23: [2022-11-26 02:11:00,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:11:00,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:11:00,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 19: [2022-11-26 02:11:00,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 23: [2022-11-26 02:11:00,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 02:11:00,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 19: [2022-11-26 02:11:00,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 24: [2022-11-26 02:11:00,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:11:00,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 02:11:00,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 16: [2022-11-26 02:11:00,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:11:00,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 02:11:00,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 5: [2022-11-26 02:11:00,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:11:00,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 02:11:00,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 18: [2022-11-26 02:11:00,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:11:00,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 02:11:00,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 22: [2022-11-26 02:11:00,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:11:00,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 02:11:00,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 30: [2022-11-26 02:11:00,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:11:00,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 27: [2022-11-26 02:11:00,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:11:00,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 27: [2022-11-26 02:11:00,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 02:11:00,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 2: [2022-11-26 02:11:00,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:11:00,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 02:11:00,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 0: [2022-11-26 02:11:00,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:11:00,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 28: [2022-11-26 02:11:00,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:11:00,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 10: [2022-11-26 02:11:00,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:11:00,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 02:11:00,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 21: [2022-11-26 02:11:00,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:11:00,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 02:11:00,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 11: [2022-11-26 02:11:00,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:11:00,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 02:11:00,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 7: [2022-11-26 02:11:00,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:11:00,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 02:11:00,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 4: [2022-11-26 02:11:00,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:11:00,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 02:11:00,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 4: [2022-11-26 02:11:00,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 02:11:00,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 14: [2022-11-26 02:11:00,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:11:00,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 02:11:00,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 13: [2022-11-26 02:11:00,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:11:00,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 02:11:00,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 26: [2022-11-26 02:11:00,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:11:00,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 02:11:00,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 9: [2022-11-26 02:11:00,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:11:00,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 02:11:00,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 10: [2022-11-26 02:11:00,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:11:00,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 02:11:00,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 16: [2022-11-26 02:11:00,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:11:00,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 02:11:00,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 11: [2022-11-26 02:11:00,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:11:00,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 02:11:00,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 3: [2022-11-26 02:11:00,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:11:00,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 02:11:00,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 9: [2022-11-26 02:11:00,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:11:00,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 02:11:00,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 29: [2022-11-26 02:11:00,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:11:00,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 02:11:00,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 3: [2022-11-26 02:11:00,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:11:00,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:11:00,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 02:11:00,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 6: [2022-11-26 02:11:00,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 02:11:00,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:11:00,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 6: [2022-11-26 02:11:00,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 02:11:00,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 26: [2022-11-26 02:11:00,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:11:00,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 02:11:00,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 8: [2022-11-26 02:11:00,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:11:00,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 02:11:00,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 10: [2022-11-26 02:11:00,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:11:00,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 02:11:00,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 12: [2022-11-26 02:11:00,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:11:00,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 02:11:00,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 29: [2022-11-26 02:11:00,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:11:00,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step36000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 02:11:00,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 0: successfully saved checkpoint at iteration 36000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2546.91 31: iteration 36010/ 173500 | consumed samples: 9218560 | consumed tokens: 18879610880 | elapsed time per iteration (s): 1.14 | learning rate: 1.829E-04 | global batch size: 256 | lm loss: 2.098383E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 223.882 | TFLOPs: 13.54 | 31: iteration 36020/ 173500 | consumed samples: 9221120 | consumed tokens: 18884853760 | elapsed time per iteration (s): 0.86 | learning rate: 1.829E-04 | global batch size: 256 | lm loss: 2.107309E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.259 | TFLOPs: 17.98 | 31: iteration 36030/ 173500 | consumed samples: 9223680 | consumed tokens: 18890096640 | elapsed time per iteration (s): 0.90 | learning rate: 1.829E-04 | global batch size: 256 | lm loss: 2.091564E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.948 | TFLOPs: 17.24 | 31: iteration 36040/ 173500 | consumed samples: 9226240 | consumed tokens: 18895339520 | elapsed time per iteration (s): 0.81 | learning rate: 1.829E-04 | global batch size: 256 | lm loss: 2.103294E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.071 | TFLOPs: 19.06 | 31: iteration 36050/ 173500 | consumed samples: 9228800 | consumed tokens: 18900582400 | elapsed time per iteration (s): 0.96 | learning rate: 1.828E-04 | global batch size: 256 | lm loss: 2.123013E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 267.379 | TFLOPs: 16.18 | 31: iteration 36060/ 173500 | consumed samples: 9231360 | consumed tokens: 18905825280 | elapsed time per iteration (s): 0.93 | learning rate: 1.828E-04 | global batch size: 256 | lm loss: 2.129807E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 274.763 | TFLOPs: 16.62 | 31: iteration 36070/ 173500 | consumed samples: 9233920 | consumed tokens: 18911068160 | elapsed time per iteration (s): 0.75 | learning rate: 1.828E-04 | global batch size: 256 | lm loss: 2.115696E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.472 | TFLOPs: 20.78 | 31: iteration 36080/ 173500 | consumed samples: 9236480 | consumed tokens: 18916311040 | elapsed time per iteration (s): 0.86 | learning rate: 1.828E-04 | global batch size: 256 | lm loss: 2.100167E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.108 | TFLOPs: 17.91 | 31: iteration 36090/ 173500 | consumed samples: 9239040 | consumed tokens: 18921553920 | elapsed time per iteration (s): 0.86 | learning rate: 1.828E-04 | global batch size: 256 | lm loss: 2.096895E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.903 | TFLOPs: 18.02 | 31: iteration 36100/ 173500 | consumed samples: 9241600 | consumed tokens: 18926796800 | elapsed time per iteration (s): 0.76 | learning rate: 1.828E-04 | global batch size: 256 | lm loss: 2.138659E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.318 | TFLOPs: 20.29 | 31: iteration 36110/ 173500 | consumed samples: 9244160 | consumed tokens: 18932039680 | elapsed time per iteration (s): 0.77 | learning rate: 1.828E-04 | global batch size: 256 | lm loss: 2.119495E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.376 | TFLOPs: 20.05 | 31: iteration 36120/ 173500 | consumed samples: 9246720 | consumed tokens: 18937282560 | elapsed time per iteration (s): 0.77 | learning rate: 1.828E-04 | global batch size: 256 | lm loss: 2.106382E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.507 | TFLOPs: 20.12 | 31: iteration 36130/ 173500 | consumed samples: 9249280 | consumed tokens: 18942525440 | elapsed time per iteration (s): 0.79 | learning rate: 1.828E-04 | global batch size: 256 | lm loss: 2.086400E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.377 | TFLOPs: 19.62 | 31: iteration 36140/ 173500 | consumed samples: 9251840 | consumed tokens: 18947768320 | elapsed time per iteration (s): 0.74 | learning rate: 1.828E-04 | global batch size: 256 | lm loss: 2.093649E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.683 | TFLOPs: 20.85 | 31: iteration 36150/ 173500 | consumed samples: 9254400 | consumed tokens: 18953011200 | elapsed time per iteration (s): 0.78 | learning rate: 1.828E-04 | global batch size: 256 | lm loss: 2.145379E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.308 | TFLOPs: 19.86 | 31: iteration 36160/ 173500 | consumed samples: 9256960 | consumed tokens: 18958254080 | elapsed time per iteration (s): 0.78 | learning rate: 1.827E-04 | global batch size: 256 | lm loss: 2.133876E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.692 | TFLOPs: 19.89 | 31: iteration 36170/ 173500 | consumed samples: 9259520 | consumed tokens: 18963496960 | elapsed time per iteration (s): 0.77 | learning rate: 1.827E-04 | global batch size: 256 | lm loss: 2.125488E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.226 | TFLOPs: 20.16 | 31: iteration 36180/ 173500 | consumed samples: 9262080 | consumed tokens: 18968739840 | elapsed time per iteration (s): 0.77 | learning rate: 1.827E-04 | global batch size: 256 | lm loss: 2.111431E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.554 | TFLOPs: 20.18 | 31: iteration 36190/ 173500 | consumed samples: 9264640 | consumed tokens: 18973982720 | elapsed time per iteration (s): 0.79 | learning rate: 1.827E-04 | global batch size: 256 | lm loss: 2.113534E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.779 | TFLOPs: 19.65 | 31: iteration 36200/ 173500 | consumed samples: 9267200 | consumed tokens: 18979225600 | elapsed time per iteration (s): 0.74 | learning rate: 1.827E-04 | global batch size: 256 | lm loss: 2.125627E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.080 | TFLOPs: 20.94 | 31: iteration 36210/ 173500 | consumed samples: 9269760 | consumed tokens: 18984468480 | elapsed time per iteration (s): 1.76 | learning rate: 1.827E-04 | global batch size: 256 | lm loss: 2.098156E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 145.452 | TFLOPs: 8.80 | 31: iteration 36220/ 173500 | consumed samples: 9272320 | consumed tokens: 18989711360 | elapsed time per iteration (s): 0.81 | learning rate: 1.827E-04 | global batch size: 256 | lm loss: 2.135630E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.830 | TFLOPs: 19.17 | 31: iteration 36230/ 173500 | consumed samples: 9274880 | consumed tokens: 18994954240 | elapsed time per iteration (s): 0.72 | learning rate: 1.827E-04 | global batch size: 256 | lm loss: 2.125464E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.730 | TFLOPs: 21.40 | 31: iteration 36240/ 173500 | consumed samples: 9277440 | consumed tokens: 19000197120 | elapsed time per iteration (s): 0.76 | learning rate: 1.827E-04 | global batch size: 256 | lm loss: 2.127229E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.063 | TFLOPs: 20.39 | 31: iteration 36250/ 173500 | consumed samples: 9280000 | consumed tokens: 19005440000 | elapsed time per iteration (s): 0.75 | learning rate: 1.827E-04 | global batch size: 256 | lm loss: 2.099960E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.722 | TFLOPs: 20.61 | 31: iteration 36260/ 173500 | consumed samples: 9282560 | consumed tokens: 19010682880 | elapsed time per iteration (s): 0.74 | learning rate: 1.826E-04 | global batch size: 256 | lm loss: 2.090684E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.006 | TFLOPs: 20.87 | 31: iteration 36270/ 173500 | consumed samples: 9285120 | consumed tokens: 19015925760 | elapsed time per iteration (s): 0.79 | learning rate: 1.826E-04 | global batch size: 256 | lm loss: 2.126578E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.715 | TFLOPs: 19.64 | 31: iteration 36280/ 173500 | consumed samples: 9287680 | consumed tokens: 19021168640 | elapsed time per iteration (s): 0.74 | learning rate: 1.826E-04 | global batch size: 256 | lm loss: 2.102576E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.297 | TFLOPs: 20.89 | 31: iteration 36290/ 173500 | consumed samples: 9290240 | consumed tokens: 19026411520 | elapsed time per iteration (s): 0.79 | learning rate: 1.826E-04 | global batch size: 256 | lm loss: 2.131478E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.410 | TFLOPs: 19.69 | 31: iteration 36300/ 173500 | consumed samples: 9292800 | consumed tokens: 19031654400 | elapsed time per iteration (s): 0.77 | learning rate: 1.826E-04 | global batch size: 256 | lm loss: 2.125763E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.198 | TFLOPs: 20.22 | 31: iteration 36310/ 173500 | consumed samples: 9295360 | consumed tokens: 19036897280 | elapsed time per iteration (s): 0.80 | learning rate: 1.826E-04 | global batch size: 256 | lm loss: 2.120880E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.561 | TFLOPs: 19.39 | 31: iteration 36320/ 173500 | consumed samples: 9297920 | consumed tokens: 19042140160 | elapsed time per iteration (s): 0.79 | learning rate: 1.826E-04 | global batch size: 256 | lm loss: 2.100345E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.319 | TFLOPs: 19.56 | 31: iteration 36330/ 173500 | consumed samples: 9300480 | consumed tokens: 19047383040 | elapsed time per iteration (s): 0.86 | learning rate: 1.826E-04 | global batch size: 256 | lm loss: 2.085619E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.422 | TFLOPs: 17.99 | 31: iteration 36340/ 173500 | consumed samples: 9303040 | consumed tokens: 19052625920 | elapsed time per iteration (s): 0.83 | learning rate: 1.826E-04 | global batch size: 256 | lm loss: 2.117587E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.379 | TFLOPs: 18.60 | 31: iteration 36350/ 173500 | consumed samples: 9305600 | consumed tokens: 19057868800 | elapsed time per iteration (s): 0.85 | learning rate: 1.826E-04 | global batch size: 256 | lm loss: 2.086848E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.606 | TFLOPs: 18.31 | 31: iteration 36360/ 173500 | consumed samples: 9308160 | consumed tokens: 19063111680 | elapsed time per iteration (s): 0.82 | learning rate: 1.825E-04 | global batch size: 256 | lm loss: 2.122434E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.059 | TFLOPs: 18.88 | 31: iteration 36370/ 173500 | consumed samples: 9310720 | consumed tokens: 19068354560 | elapsed time per iteration (s): 0.82 | learning rate: 1.825E-04 | global batch size: 256 | lm loss: 2.113745E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.010 | TFLOPs: 18.94 | 31: iteration 36380/ 173500 | consumed samples: 9313280 | consumed tokens: 19073597440 | elapsed time per iteration (s): 0.87 | learning rate: 1.825E-04 | global batch size: 256 | lm loss: 2.122686E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.521 | TFLOPs: 17.76 | 31: iteration 36390/ 173500 | consumed samples: 9315840 | consumed tokens: 19078840320 | elapsed time per iteration (s): 0.73 | learning rate: 1.825E-04 | global batch size: 256 | lm loss: 2.105800E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.801 | TFLOPs: 21.16 | 31: iteration 36400/ 173500 | consumed samples: 9318400 | consumed tokens: 19084083200 | elapsed time per iteration (s): 0.82 | learning rate: 1.825E-04 | global batch size: 256 | lm loss: 2.126678E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.201 | TFLOPs: 18.95 | 31: iteration 36410/ 173500 | consumed samples: 9320960 | consumed tokens: 19089326080 | elapsed time per iteration (s): 0.79 | learning rate: 1.825E-04 | global batch size: 256 | lm loss: 2.095716E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.139 | TFLOPs: 19.67 | 31: iteration 36420/ 173500 | consumed samples: 9323520 | consumed tokens: 19094568960 | elapsed time per iteration (s): 0.81 | learning rate: 1.825E-04 | global batch size: 256 | lm loss: 2.099381E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.191 | TFLOPs: 19.19 | 31: iteration 36430/ 173500 | consumed samples: 9326080 | consumed tokens: 19099811840 | elapsed time per iteration (s): 0.76 | learning rate: 1.825E-04 | global batch size: 256 | lm loss: 2.110517E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.888 | TFLOPs: 20.50 | 31: iteration 36440/ 173500 | consumed samples: 9328640 | consumed tokens: 19105054720 | elapsed time per iteration (s): 0.80 | learning rate: 1.825E-04 | global batch size: 256 | lm loss: 2.095393E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.924 | TFLOPs: 19.48 | 31: iteration 36450/ 173500 | consumed samples: 9331200 | consumed tokens: 19110297600 | elapsed time per iteration (s): 0.78 | learning rate: 1.825E-04 | global batch size: 256 | lm loss: 2.126580E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.890 | TFLOPs: 19.84 | 31: iteration 36460/ 173500 | consumed samples: 9333760 | consumed tokens: 19115540480 | elapsed time per iteration (s): 0.77 | learning rate: 1.825E-04 | global batch size: 256 | lm loss: 2.107285E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.659 | TFLOPs: 20.00 | 31: iteration 36470/ 173500 | consumed samples: 9336320 | consumed tokens: 19120783360 | elapsed time per iteration (s): 0.76 | learning rate: 1.824E-04 | global batch size: 256 | lm loss: 2.096395E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.037 | TFLOPs: 20.33 | 31: iteration 36480/ 173500 | consumed samples: 9338880 | consumed tokens: 19126026240 | elapsed time per iteration (s): 0.73 | learning rate: 1.824E-04 | global batch size: 256 | lm loss: 2.116490E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.480 | TFLOPs: 21.08 | 31: iteration 36490/ 173500 | consumed samples: 9341440 | consumed tokens: 19131269120 | elapsed time per iteration (s): 0.75 | learning rate: 1.824E-04 | global batch size: 256 | lm loss: 2.129416E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.798 | TFLOPs: 20.74 | 31: iteration 36500/ 173500 | consumed samples: 9344000 | consumed tokens: 19136512000 | elapsed time per iteration (s): 0.80 | learning rate: 1.824E-04 | global batch size: 256 | lm loss: 2.085970E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.160 | TFLOPs: 19.25 | 31: iteration 36510/ 173500 | consumed samples: 9346560 | consumed tokens: 19141754880 | elapsed time per iteration (s): 0.80 | learning rate: 1.824E-04 | global batch size: 256 | lm loss: 2.119207E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.166 | TFLOPs: 19.37 | 31: iteration 36520/ 173500 | consumed samples: 9349120 | consumed tokens: 19146997760 | elapsed time per iteration (s): 0.84 | learning rate: 1.824E-04 | global batch size: 256 | lm loss: 2.081916E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.759 | TFLOPs: 18.38 | 31: iteration 36530/ 173500 | consumed samples: 9351680 | consumed tokens: 19152240640 | elapsed time per iteration (s): 0.80 | learning rate: 1.824E-04 | global batch size: 256 | lm loss: 2.113786E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.603 | TFLOPs: 19.46 | 31: iteration 36540/ 173500 | consumed samples: 9354240 | consumed tokens: 19157483520 | elapsed time per iteration (s): 0.75 | learning rate: 1.824E-04 | global batch size: 256 | lm loss: 2.111099E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.173 | TFLOPs: 20.70 | 31: iteration 36550/ 173500 | consumed samples: 9356800 | consumed tokens: 19162726400 | elapsed time per iteration (s): 0.77 | learning rate: 1.824E-04 | global batch size: 256 | lm loss: 2.106537E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.450 | TFLOPs: 20.11 | 31: iteration 36560/ 173500 | consumed samples: 9359360 | consumed tokens: 19167969280 | elapsed time per iteration (s): 0.73 | learning rate: 1.824E-04 | global batch size: 256 | lm loss: 2.127259E+00 | grad norm: 0.718 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.349 | TFLOPs: 21.07 | 31: iteration 36570/ 173500 | consumed samples: 9361920 | consumed tokens: 19173212160 | elapsed time per iteration (s): 0.73 | learning rate: 1.823E-04 | global batch size: 256 | lm loss: 2.124704E+00 | grad norm: 0.436 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.997 | TFLOPs: 21.23 | 31: iteration 36580/ 173500 | consumed samples: 9364480 | consumed tokens: 19178455040 | elapsed time per iteration (s): 0.74 | learning rate: 1.823E-04 | global batch size: 256 | lm loss: 2.114406E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.379 | TFLOPs: 20.83 | 31: iteration 36590/ 173500 | consumed samples: 9367040 | consumed tokens: 19183697920 | elapsed time per iteration (s): 0.76 | learning rate: 1.823E-04 | global batch size: 256 | lm loss: 2.113264E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.558 | TFLOPs: 20.48 | 31: iteration 36600/ 173500 | consumed samples: 9369600 | consumed tokens: 19188940800 | elapsed time per iteration (s): 0.76 | learning rate: 1.823E-04 | global batch size: 256 | lm loss: 2.083043E+00 | grad norm: 0.417 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.693 | TFLOPs: 20.37 | 31: iteration 36610/ 173500 | consumed samples: 9372160 | consumed tokens: 19194183680 | elapsed time per iteration (s): 0.77 | learning rate: 1.823E-04 | global batch size: 256 | lm loss: 2.151885E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.874 | TFLOPs: 20.02 | 31: iteration 36620/ 173500 | consumed samples: 9374720 | consumed tokens: 19199426560 | elapsed time per iteration (s): 0.84 | learning rate: 1.823E-04 | global batch size: 256 | lm loss: 2.119148E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.077 | TFLOPs: 18.40 | 31: iteration 36630/ 173500 | consumed samples: 9377280 | consumed tokens: 19204669440 | elapsed time per iteration (s): 0.78 | learning rate: 1.823E-04 | global batch size: 256 | lm loss: 2.069949E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.520 | TFLOPs: 19.87 | 31: iteration 36640/ 173500 | consumed samples: 9379840 | consumed tokens: 19209912320 | elapsed time per iteration (s): 0.85 | learning rate: 1.823E-04 | global batch size: 256 | lm loss: 2.116895E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.784 | TFLOPs: 18.14 | 31: iteration 36650/ 173500 | consumed samples: 9382400 | consumed tokens: 19215155200 | elapsed time per iteration (s): 0.72 | learning rate: 1.823E-04 | global batch size: 256 | lm loss: 2.093530E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.756 | TFLOPs: 21.46 | 31: iteration 36660/ 173500 | consumed samples: 9384960 | consumed tokens: 19220398080 | elapsed time per iteration (s): 0.99 | learning rate: 1.823E-04 | global batch size: 256 | lm loss: 2.110881E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 258.991 | TFLOPs: 15.67 | 31: iteration 36670/ 173500 | consumed samples: 9387520 | consumed tokens: 19225640960 | elapsed time per iteration (s): 0.78 | learning rate: 1.822E-04 | global batch size: 256 | lm loss: 2.126363E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.341 | TFLOPs: 19.86 | 31: iteration 36680/ 173500 | consumed samples: 9390080 | consumed tokens: 19230883840 | elapsed time per iteration (s): 0.75 | learning rate: 1.822E-04 | global batch size: 256 | lm loss: 2.113102E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.858 | TFLOPs: 20.68 | 31: iteration 36690/ 173500 | consumed samples: 9392640 | consumed tokens: 19236126720 | elapsed time per iteration (s): 0.77 | learning rate: 1.822E-04 | global batch size: 256 | lm loss: 2.107745E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.076 | TFLOPs: 20.03 | 31: iteration 36700/ 173500 | consumed samples: 9395200 | consumed tokens: 19241369600 | elapsed time per iteration (s): 0.73 | learning rate: 1.822E-04 | global batch size: 256 | lm loss: 2.106255E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.833 | TFLOPs: 21.16 | 31: iteration 36710/ 173500 | consumed samples: 9397760 | consumed tokens: 19246612480 | elapsed time per iteration (s): 0.76 | learning rate: 1.822E-04 | global batch size: 256 | lm loss: 2.097499E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.632 | TFLOPs: 20.43 | 31: iteration 36720/ 173500 | consumed samples: 9400320 | consumed tokens: 19251855360 | elapsed time per iteration (s): 0.79 | learning rate: 1.822E-04 | global batch size: 256 | lm loss: 2.096465E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.777 | TFLOPs: 19.71 | 31: iteration 36730/ 173500 | consumed samples: 9402880 | consumed tokens: 19257098240 | elapsed time per iteration (s): 0.80 | learning rate: 1.822E-04 | global batch size: 256 | lm loss: 2.115827E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.422 | TFLOPs: 19.38 | 31: iteration 36740/ 173500 | consumed samples: 9405440 | consumed tokens: 19262341120 | elapsed time per iteration (s): 0.72 | learning rate: 1.822E-04 | global batch size: 256 | lm loss: 2.128008E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.062 | TFLOPs: 21.60 | 31: iteration 36750/ 173500 | consumed samples: 9408000 | consumed tokens: 19267584000 | elapsed time per iteration (s): 0.75 | learning rate: 1.822E-04 | global batch size: 256 | lm loss: 2.099479E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.426 | TFLOPs: 20.59 | 31: iteration 36760/ 173500 | consumed samples: 9410560 | consumed tokens: 19272826880 | elapsed time per iteration (s): 0.85 | learning rate: 1.822E-04 | global batch size: 256 | lm loss: 2.109731E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.131 | TFLOPs: 18.28 | 31: iteration 36770/ 173500 | consumed samples: 9413120 | consumed tokens: 19278069760 | elapsed time per iteration (s): 0.82 | learning rate: 1.821E-04 | global batch size: 256 | lm loss: 2.101071E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.125 | TFLOPs: 18.88 | 31: iteration 36780/ 173500 | consumed samples: 9415680 | consumed tokens: 19283312640 | elapsed time per iteration (s): 0.87 | learning rate: 1.821E-04 | global batch size: 256 | lm loss: 2.132646E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.689 | TFLOPs: 17.89 | 31: iteration 36790/ 173500 | consumed samples: 9418240 | consumed tokens: 19288555520 | elapsed time per iteration (s): 0.75 | learning rate: 1.821E-04 | global batch size: 256 | lm loss: 2.118134E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.159 | TFLOPs: 20.52 | 31: iteration 36800/ 173500 | consumed samples: 9420800 | consumed tokens: 19293798400 | elapsed time per iteration (s): 0.82 | learning rate: 1.821E-04 | global batch size: 256 | lm loss: 2.094238E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.115 | TFLOPs: 18.88 | 31: iteration 36810/ 173500 | consumed samples: 9423360 | consumed tokens: 19299041280 | elapsed time per iteration (s): 0.79 | learning rate: 1.821E-04 | global batch size: 256 | lm loss: 2.111593E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.130 | TFLOPs: 19.61 | 31: iteration 36820/ 173500 | consumed samples: 9425920 | consumed tokens: 19304284160 | elapsed time per iteration (s): 0.77 | learning rate: 1.821E-04 | global batch size: 256 | lm loss: 2.117119E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.270 | TFLOPs: 20.22 | 31: iteration 36830/ 173500 | consumed samples: 9428480 | consumed tokens: 19309527040 | elapsed time per iteration (s): 0.77 | learning rate: 1.821E-04 | global batch size: 256 | lm loss: 2.119619E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.026 | TFLOPs: 20.21 | 31: iteration 36840/ 173500 | consumed samples: 9431040 | consumed tokens: 19314769920 | elapsed time per iteration (s): 0.83 | learning rate: 1.821E-04 | global batch size: 256 | lm loss: 2.094823E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.530 | TFLOPs: 18.60 | 31: iteration 36850/ 173500 | consumed samples: 9433600 | consumed tokens: 19320012800 | elapsed time per iteration (s): 0.80 | learning rate: 1.821E-04 | global batch size: 256 | lm loss: 2.129401E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.860 | TFLOPs: 19.47 | 31: iteration 36860/ 173500 | consumed samples: 9436160 | consumed tokens: 19325255680 | elapsed time per iteration (s): 0.76 | learning rate: 1.821E-04 | global batch size: 256 | lm loss: 2.117746E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.917 | TFLOPs: 20.32 | 31: iteration 36870/ 173500 | consumed samples: 9438720 | consumed tokens: 19330498560 | elapsed time per iteration (s): 0.74 | learning rate: 1.820E-04 | global batch size: 256 | lm loss: 2.085899E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.033 | TFLOPs: 20.93 | 31: iteration 36880/ 173500 | consumed samples: 9441280 | consumed tokens: 19335741440 | elapsed time per iteration (s): 0.74 | learning rate: 1.820E-04 | global batch size: 256 | lm loss: 2.076716E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.126 | TFLOPs: 20.82 | 31: iteration 36890/ 173500 | consumed samples: 9443840 | consumed tokens: 19340984320 | elapsed time per iteration (s): 0.78 | learning rate: 1.820E-04 | global batch size: 256 | lm loss: 2.095779E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.793 | TFLOPs: 19.89 | 31: iteration 36900/ 173500 | consumed samples: 9446400 | consumed tokens: 19346227200 | elapsed time per iteration (s): 0.76 | learning rate: 1.820E-04 | global batch size: 256 | lm loss: 2.132117E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.252 | TFLOPs: 20.46 | 31: iteration 36910/ 173500 | consumed samples: 9448960 | consumed tokens: 19351470080 | elapsed time per iteration (s): 0.75 | learning rate: 1.820E-04 | global batch size: 256 | lm loss: 2.136090E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.103 | TFLOPs: 20.51 | 31: iteration 36920/ 173500 | consumed samples: 9451520 | consumed tokens: 19356712960 | elapsed time per iteration (s): 0.76 | learning rate: 1.820E-04 | global batch size: 256 | lm loss: 2.106192E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.844 | TFLOPs: 20.32 | 31: iteration 36930/ 173500 | consumed samples: 9454080 | consumed tokens: 19361955840 | elapsed time per iteration (s): 0.73 | learning rate: 1.820E-04 | global batch size: 256 | lm loss: 2.120541E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.816 | TFLOPs: 21.16 | 31: iteration 36940/ 173500 | consumed samples: 9456640 | consumed tokens: 19367198720 | elapsed time per iteration (s): 0.76 | learning rate: 1.820E-04 | global batch size: 256 | lm loss: 2.143911E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.048 | TFLOPs: 20.39 | 31: iteration 36950/ 173500 | consumed samples: 9459200 | consumed tokens: 19372441600 | elapsed time per iteration (s): 0.74 | learning rate: 1.820E-04 | global batch size: 256 | lm loss: 2.110143E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.362 | TFLOPs: 21.01 | 31: iteration 36960/ 173500 | consumed samples: 9461760 | consumed tokens: 19377684480 | elapsed time per iteration (s): 0.76 | learning rate: 1.820E-04 | global batch size: 256 | lm loss: 2.090187E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.735 | TFLOPs: 20.31 | 31: iteration 36970/ 173500 | consumed samples: 9464320 | consumed tokens: 19382927360 | elapsed time per iteration (s): 0.74 | learning rate: 1.819E-04 | global batch size: 256 | lm loss: 2.129398E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.662 | TFLOPs: 20.79 | 31: iteration 36980/ 173500 | consumed samples: 9466880 | consumed tokens: 19388170240 | elapsed time per iteration (s): 0.78 | learning rate: 1.819E-04 | global batch size: 256 | lm loss: 2.119165E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.852 | TFLOPs: 19.77 | 31: iteration 36990/ 173500 | consumed samples: 9469440 | consumed tokens: 19393413120 | elapsed time per iteration (s): 0.77 | learning rate: 1.819E-04 | global batch size: 256 | lm loss: 2.093251E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.959 | TFLOPs: 20.02 | 31: iteration 37000/ 173500 | consumed samples: 9472000 | consumed tokens: 19398656000 | elapsed time per iteration (s): 0.74 | learning rate: 1.819E-04 | global batch size: 256 | lm loss: 2.101559E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.473 | TFLOPs: 20.84 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 37000 | lm loss value: 2.109146E+00 | lm loss PPL: 8.241201E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 37000 to checkpoints_1b1long 0: [2022-11-26 02:24:17,520] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step37000 is begin to save! 0: [2022-11-26 02:24:17,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_01-model_00-model_states.pt... 0: [2022-11-26 02:24:17,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_01-model_00-model_states.pt. 0: [2022-11-26 02:24:17,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_03-model_00-model_states.pt... 0: [2022-11-26 02:24:17,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_03-model_00-model_states.pt. 0: [2022-11-26 02:24:17,837] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_04-model_00-model_states.pt... 0: [2022-11-26 02:24:17,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_04-model_00-model_states.pt. 0: [2022-11-26 02:24:17,911] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_05-model_00-model_states.pt... 0: [2022-11-26 02:24:17,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_05-model_00-model_states.pt. 0: [2022-11-26 02:24:17,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_06-model_00-model_states.pt... 0: [2022-11-26 02:24:18,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_06-model_00-model_states.pt. 0: [2022-11-26 02:24:18,060] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_07-model_00-model_states.pt... 0: [2022-11-26 02:24:18,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_07-model_00-model_states.pt. 0: [2022-11-26 02:24:18,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_08-model_00-model_states.pt... 0: [2022-11-26 02:24:18,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_08-model_00-model_states.pt. 0: [2022-11-26 02:24:18,211] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_09-model_00-model_states.pt... 0: [2022-11-26 02:24:18,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_09-model_00-model_states.pt. 0: [2022-11-26 02:24:18,285] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_10-model_00-model_states.pt... 0: [2022-11-26 02:24:18,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_10-model_00-model_states.pt. 0: [2022-11-26 02:24:18,360] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_11-model_00-model_states.pt... 0: [2022-11-26 02:24:18,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_11-model_00-model_states.pt. 0: [2022-11-26 02:24:18,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_12-model_00-model_states.pt... 0: [2022-11-26 02:24:18,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_12-model_00-model_states.pt. 0: [2022-11-26 02:24:18,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_13-model_00-model_states.pt... 0: [2022-11-26 02:24:18,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_13-model_00-model_states.pt. 0: [2022-11-26 02:24:18,584] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_14-model_00-model_states.pt... 0: [2022-11-26 02:24:18,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_14-model_00-model_states.pt. 0: [2022-11-26 02:24:18,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_15-model_00-model_states.pt... 0: [2022-11-26 02:24:18,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_15-model_00-model_states.pt. 0: [2022-11-26 02:24:18,730] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_16-model_00-model_states.pt... 0: [2022-11-26 02:24:18,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_16-model_00-model_states.pt. 0: [2022-11-26 02:24:18,806] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_17-model_00-model_states.pt... 0: [2022-11-26 02:24:18,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_17-model_00-model_states.pt. 0: [2022-11-26 02:24:18,880] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_18-model_00-model_states.pt... 0: [2022-11-26 02:24:18,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_18-model_00-model_states.pt. 0: [2022-11-26 02:24:18,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_19-model_00-model_states.pt... 0: [2022-11-26 02:24:19,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_19-model_00-model_states.pt. 0: [2022-11-26 02:24:19,029] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_20-model_00-model_states.pt... 0: [2022-11-26 02:24:19,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_20-model_00-model_states.pt. 0: [2022-11-26 02:24:19,101] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_21-model_00-model_states.pt... 0: [2022-11-26 02:24:19,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_21-model_00-model_states.pt. 0: [2022-11-26 02:24:19,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_22-model_00-model_states.pt... 0: [2022-11-26 02:24:19,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_22-model_00-model_states.pt. 0: [2022-11-26 02:24:19,248] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_23-model_00-model_states.pt... 0: [2022-11-26 02:24:19,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_23-model_00-model_states.pt. 0: [2022-11-26 02:24:19,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_24-model_00-model_states.pt... 0: [2022-11-26 02:24:19,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_24-model_00-model_states.pt. 0: [2022-11-26 02:24:19,398] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_25-model_00-model_states.pt... 0: [2022-11-26 02:24:19,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_25-model_00-model_states.pt. 0: [2022-11-26 02:24:19,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_26-model_00-model_states.pt... 0: [2022-11-26 02:24:19,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_26-model_00-model_states.pt. 0: [2022-11-26 02:24:19,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_27-model_00-model_states.pt... 0: [2022-11-26 02:24:19,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_27-model_00-model_states.pt. 0: [2022-11-26 02:24:19,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_28-model_00-model_states.pt... 0: [2022-11-26 02:24:19,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_28-model_00-model_states.pt. 0: [2022-11-26 02:24:19,696] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/layer_30-model_00-model_states.pt... 0: [2022-11-26 02:24:19,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/layer_30-model_00-model_states.pt. 0: [2022-11-26 02:24:19,699] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step37000/mp_rank_00_model_states.pt 0: [2022-11-26 02:24:19,699] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/mp_rank_00_model_states.pt... 0: [2022-11-26 02:24:19,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/mp_rank_00_model_states.pt. 0: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:24:19,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:24:19,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:24:19,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 02:24:19,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 3: [2022-11-26 02:24:19,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:24:19,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:24:19,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 3: [2022-11-26 02:24:19,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 14: [2022-11-26 02:24:19,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 3: [2022-11-26 02:24:19,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 5: [2022-11-26 02:24:19,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:24:19,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 02:24:19,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 0: [2022-11-26 02:24:19,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:24:19,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:24:19,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 02:24:19,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 28: [2022-11-26 02:24:19,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:24:19,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 02:24:19,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 2: [2022-11-26 02:24:19,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:24:19,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 02:24:19,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 0: [2022-11-26 02:24:19,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:24:19,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 02:24:19,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 20: [2022-11-26 02:24:19,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:24:19,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 02:24:19,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 3: [2022-11-26 02:24:19,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:24:19,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 02:24:19,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 22: [2022-11-26 02:24:19,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:24:19,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:24:19,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 30: [2022-11-26 02:24:19,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 22: [2022-11-26 02:24:19,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 30: [2022-11-26 02:24:19,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 12: [2022-11-26 02:24:19,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:24:19,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 30: [2022-11-26 02:24:19,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:24:19,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 30: [2022-11-26 02:24:19,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 02:24:19,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 12: [2022-11-26 02:24:19,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:24:19,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 02:24:19,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 7: [2022-11-26 02:24:19,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:24:19,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 02:24:19,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 5: [2022-11-26 02:24:19,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:24:19,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 02:24:19,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 21: [2022-11-26 02:24:19,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:24:19,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 17: [2022-11-26 02:24:19,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:24:19,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:24:19,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 13: [2022-11-26 02:24:19,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:24:19,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 02:24:19,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 28: [2022-11-26 02:24:19,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:24:19,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 02:24:19,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 02:24:19,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 17: [2022-11-26 02:24:19,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 14: [2022-11-26 02:24:19,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:24:19,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 02:24:19,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 24: [2022-11-26 02:24:19,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:24:19,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:24:19,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:24:19,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 21: [2022-11-26 02:24:19,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 30: [2022-11-26 02:24:19,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 21: [2022-11-26 02:24:19,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 26: [2022-11-26 02:24:19,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:24:19,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 26: [2022-11-26 02:24:19,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 24: [2022-11-26 02:24:19,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 26: [2022-11-26 02:24:19,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 13: [2022-11-26 02:24:19,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:24:19,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 02:24:19,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 19: [2022-11-26 02:24:19,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:24:19,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 02:24:19,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 19: [2022-11-26 02:24:19,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:24:19,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 02:24:19,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 22: [2022-11-26 02:24:19,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:24:19,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 02:24:19,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 0: [2022-11-26 02:24:19,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:24:19,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 02:24:19,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-26 02:24:19,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:24:19,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:24:19,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 02:24:19,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 7: [2022-11-26 02:24:19,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 02:24:19,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 12: [2022-11-26 02:24:19,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:24:19,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 24: [2022-11-26 02:24:19,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:24:19,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 24: [2022-11-26 02:24:19,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 02:24:19,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 5: [2022-11-26 02:24:19,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:24:19,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 02:24:19,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 24: [2022-11-26 02:24:19,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:24:19,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 02:24:19,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 20: [2022-11-26 02:24:19,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:24:19,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 02:24:19,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 20: [2022-11-26 02:24:19,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:24:19,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:24:19,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 20: [2022-11-26 02:24:19,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 21: [2022-11-26 02:24:19,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 20: [2022-11-26 02:24:19,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 7: [2022-11-26 02:24:19,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:24:19,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 12: [2022-11-26 02:24:19,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:24:19,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:24:19,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:24:19,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 12: [2022-11-26 02:24:19,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 14: [2022-11-26 02:24:19,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 12: [2022-11-26 02:24:19,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 14: [2022-11-26 02:24:19,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 22: [2022-11-26 02:24:19,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 26: [2022-11-26 02:24:19,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:24:19,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 26: [2022-11-26 02:24:19,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 02:24:19,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 5: [2022-11-26 02:24:19,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:24:19,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 02:24:19,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 22: [2022-11-26 02:24:19,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:24:19,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 02:24:19,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 9: [2022-11-26 02:24:19,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:24:19,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:24:19,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:24:19,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:24:19,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 02:24:19,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 02:24:19,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 02:24:19,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 9: [2022-11-26 02:24:19,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 9: [2022-11-26 02:24:19,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 24: [2022-11-26 02:24:19,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:24:19,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 02:24:19,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 24: [2022-11-26 02:24:19,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 02:24:19,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 17: [2022-11-26 02:24:19,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:24:19,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:24:19,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:24:19,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 17: [2022-11-26 02:24:19,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 02:24:19,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 0: [2022-11-26 02:24:19,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 17: [2022-11-26 02:24:19,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 17: [2022-11-26 02:24:19,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 13: [2022-11-26 02:24:19,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:24:19,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:24:19,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:24:19,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 02:24:19,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 3: [2022-11-26 02:24:19,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 02:24:19,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 20: [2022-11-26 02:24:19,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:24:19,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 02:24:19,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 28: [2022-11-26 02:24:19,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 02:24:19,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 28: [2022-11-26 02:24:19,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:24:19,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 02:24:19,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 28: [2022-11-26 02:24:19,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:24:19,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 02:24:19,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 19: [2022-11-26 02:24:19,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:24:19,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 02:24:19,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 5: [2022-11-26 02:24:19,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:24:19,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 21: [2022-11-26 02:24:19,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:24:19,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 21: [2022-11-26 02:24:19,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 02:24:19,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-26 02:24:19,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:24:19,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 02:24:19,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 14: [2022-11-26 02:24:19,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:24:19,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 02:24:19,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 17: [2022-11-26 02:24:19,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:24:19,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 02:24:19,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 20: [2022-11-26 02:24:19,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:24:19,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 02:24:19,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 26: [2022-11-26 02:24:19,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:24:19,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 02:24:19,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 19: [2022-11-26 02:24:19,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:24:19,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 02:24:19,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 6: [2022-11-26 02:24:19,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:24:19,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:24:19,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:24:19,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 6: [2022-11-26 02:24:19,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:24:19,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 13: [2022-11-26 02:24:19,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 6: [2022-11-26 02:24:19,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 02:24:19,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 02:24:19,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 6: [2022-11-26 02:24:19,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 6: [2022-11-26 02:24:19,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 02:24:19,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 6: [2022-11-26 02:24:19,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 13: [2022-11-26 02:24:19,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:24:19,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:24:19,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:24:19,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 2: [2022-11-26 02:24:19,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 02:24:19,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 13: [2022-11-26 02:24:19,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 2: [2022-11-26 02:24:19,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 2: [2022-11-26 02:24:19,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 2: [2022-11-26 02:24:19,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:24:19,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 02:24:19,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 8: [2022-11-26 02:24:19,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:24:19,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:24:19,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:24:19,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:24:19,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 02:24:19,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 02:24:19,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 02:24:19,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 02:24:19,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 8: [2022-11-26 02:24:19,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 8: [2022-11-26 02:24:19,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 8: [2022-11-26 02:24:19,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 30: [2022-11-26 02:24:19,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:24:19,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 02:24:19,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 0: [2022-11-26 02:24:19,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 02:24:19,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-26 02:24:19,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:24:19,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 02:24:19,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:24:19,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-26 02:24:19,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 02:24:19,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 19: [2022-11-26 02:24:19,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:24:19,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 02:24:19,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-26 02:24:19,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:24:19,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 02:24:19,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 24: [2022-11-26 02:24:19,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:24:19,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 02:24:19,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 26: [2022-11-26 02:24:19,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:24:19,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 02:24:19,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 22: [2022-11-26 02:24:19,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:24:19,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 02:24:19,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 28: [2022-11-26 02:24:19,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:24:19,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:24:19,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 0: [2022-11-26 02:24:19,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 02:24:19,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 28: [2022-11-26 02:24:19,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 30: [2022-11-26 02:24:19,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:24:19,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 02:24:19,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 9: [2022-11-26 02:24:19,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:24:19,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 02:24:19,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 12: [2022-11-26 02:24:19,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:24:19,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 02:24:19,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 6: [2022-11-26 02:24:19,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:24:19,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 02:24:19,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 13: [2022-11-26 02:24:19,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:24:19,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 02:24:19,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 21: [2022-11-26 02:24:19,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:24:19,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 02:24:19,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 3: [2022-11-26 02:24:19,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:24:19,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 02:24:19,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 2: [2022-11-26 02:24:19,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:24:19,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 02:24:19,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 18: [2022-11-26 02:24:19,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:24:19,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 02:24:19,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 7: [2022-11-26 02:24:19,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:24:19,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 02:24:19,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 18: [2022-11-26 02:24:19,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:24:19,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 8: [2022-11-26 02:24:19,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:24:19,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 8: [2022-11-26 02:24:19,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 02:24:19,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 14: [2022-11-26 02:24:19,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:24:19,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 02:24:19,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 18: [2022-11-26 02:24:19,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:24:19,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 02:24:19,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 4: [2022-11-26 02:24:19,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:24:19,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:24:19,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 02:24:19,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 02:24:19,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 4: [2022-11-26 02:24:19,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 29: [2022-11-26 02:24:19,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:24:19,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 02:24:19,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 29: [2022-11-26 02:24:19,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:24:19,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 02:24:19,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 18: [2022-11-26 02:24:19,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:24:19,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 02:24:19,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 29: [2022-11-26 02:24:19,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:24:19,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:24:19,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:24:19,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 02:24:19,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 02:24:19,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 02:24:19,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 29: [2022-11-26 02:24:19,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 29: [2022-11-26 02:24:19,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 27: [2022-11-26 02:24:19,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:24:19,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 02:24:19,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 4: [2022-11-26 02:24:19,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:24:19,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:24:19,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:24:19,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 02:24:19,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 02:24:19,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 27: [2022-11-26 02:24:19,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:24:19,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 02:24:19,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 4: [2022-11-26 02:24:19,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 27: [2022-11-26 02:24:19,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 20: [2022-11-26 02:24:19,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:24:19,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 27: [2022-11-26 02:24:19,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 20: [2022-11-26 02:24:19,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 27: [2022-11-26 02:24:19,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:24:19,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 02:24:19,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 10: [2022-11-26 02:24:19,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:24:19,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:24:19,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:24:19,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 02:24:19,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 02:24:19,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 02:24:19,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 10: [2022-11-26 02:24:19,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 10: [2022-11-26 02:24:19,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 5: [2022-11-26 02:24:19,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:24:19,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 02:24:19,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 10: [2022-11-26 02:24:19,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:24:19,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:24:19,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 02:24:19,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 02:24:19,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 10: [2022-11-26 02:24:19,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 25: [2022-11-26 02:24:19,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:24:19,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 02:24:19,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 17: [2022-11-26 02:24:19,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:24:19,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 25: [2022-11-26 02:24:19,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:24:19,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 25: [2022-11-26 02:24:19,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 02:24:19,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 18: [2022-11-26 02:24:19,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:24:19,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 02:24:19,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 25: [2022-11-26 02:24:19,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:24:19,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:24:19,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 25: [2022-11-26 02:24:19,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 18: [2022-11-26 02:24:19,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 25: [2022-11-26 02:24:19,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 27: [2022-11-26 02:24:19,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:24:19,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 02:24:19,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 27: [2022-11-26 02:24:19,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:24:19,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 02:24:19,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 25: [2022-11-26 02:24:19,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:24:19,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 02:24:19,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:24:19,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 25: [2022-11-26 02:24:19,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 02:24:19,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:24:19,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 25: [2022-11-26 02:24:19,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 02:24:19,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 28: [2022-11-26 02:24:19,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:24:19,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 02:24:19,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 11: [2022-11-26 02:24:19,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:24:19,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:24:19,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:24:19,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 02:24:19,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 02:24:19,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 02:24:19,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 11: [2022-11-26 02:24:19,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 11: [2022-11-26 02:24:19,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 11: [2022-11-26 02:24:19,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:24:19,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 02:24:19,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 11: [2022-11-26 02:24:19,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:24:19,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 02:24:19,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 26: [2022-11-26 02:24:19,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:24:19,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 02:24:19,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 30: [2022-11-26 02:24:19,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:24:19,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 02:24:19,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 19: [2022-11-26 02:24:19,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:24:19,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 02:24:19,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 22: [2022-11-26 02:24:19,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:24:19,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:24:19,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 22: [2022-11-26 02:24:19,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 9: [2022-11-26 02:24:19,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 22: [2022-11-26 02:24:19,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 24: [2022-11-26 02:24:19,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:24:19,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 02:24:19,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 12: [2022-11-26 02:24:19,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:24:19,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 02:24:19,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-26 02:24:19,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:24:19,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:24:19,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 1: [2022-11-26 02:24:19,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 3: [2022-11-26 02:24:19,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-26 02:24:19,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 2: [2022-11-26 02:24:19,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:24:19,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 02:24:19,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 6: [2022-11-26 02:24:19,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:24:19,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 02:24:19,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 31: [2022-11-26 02:24:19,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:24:19,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:24:19,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:24:19,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:24:19,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:24:19,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 02:24:19,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 02:24:19,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 02:24:19,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 02:24:19,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 02:24:19,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 31: [2022-11-26 02:24:19,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 31: [2022-11-26 02:24:19,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 31: [2022-11-26 02:24:19,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 31: [2022-11-26 02:24:19,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 31: [2022-11-26 02:24:19,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:24:19,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 02:24:19,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 0: [2022-11-26 02:24:19,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:24:19,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 02:24:19,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 13: [2022-11-26 02:24:19,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:24:19,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 02:24:19,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 8: [2022-11-26 02:24:19,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:24:19,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 02:24:19,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 29: [2022-11-26 02:24:19,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:24:19,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 02:24:19,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 4: [2022-11-26 02:24:19,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:24:19,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 02:24:19,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 7: [2022-11-26 02:24:19,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:24:19,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 02:24:19,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 21: [2022-11-26 02:24:19,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:24:19,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 02:24:19,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 27: [2022-11-26 02:24:19,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:24:19,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 02:24:19,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 11: [2022-11-26 02:24:19,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:24:19,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 02:24:19,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 17: [2022-11-26 02:24:19,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:24:19,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 02:24:19,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 5: [2022-11-26 02:24:19,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:24:19,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:24:19,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 02:24:19,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 14: [2022-11-26 02:24:19,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 02:24:19,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 10: [2022-11-26 02:24:19,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:24:19,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 02:24:19,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 20: [2022-11-26 02:24:20,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:24:20,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 02:24:20,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 18: [2022-11-26 02:24:20,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:24:20,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 02:24:20,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 25: [2022-11-26 02:24:20,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:24:20,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 02:24:20,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 0: [2022-11-26 02:24:20,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:24:20,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 02:24:20,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-26 02:24:20,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:24:20,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 02:24:20,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 28: [2022-11-26 02:24:20,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:24:20,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 26: [2022-11-26 02:24:20,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:24:20,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 02:24:20,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 9: [2022-11-26 02:24:20,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:24:20,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 02:24:20,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 22: [2022-11-26 02:24:20,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:24:20,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 02:24:20,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 19: [2022-11-26 02:24:20,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:24:20,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 02:24:20,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 12: [2022-11-26 02:24:20,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:24:20,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 02:24:20,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 6: [2022-11-26 02:24:20,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:24:20,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 28: [2022-11-26 02:24:20,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 6: [2022-11-26 02:24:20,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 24: [2022-11-26 02:24:20,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:24:20,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 02:24:20,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 2: [2022-11-26 02:24:20,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:24:20,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:24:20,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 3: [2022-11-26 02:24:20,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 2: [2022-11-26 02:24:20,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 3: [2022-11-26 02:24:20,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 30: [2022-11-26 02:24:20,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:24:20,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 02:24:20,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 4: [2022-11-26 02:24:20,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:24:20,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 02:24:20,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 8: [2022-11-26 02:24:20,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:24:20,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 02:24:20,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 13: [2022-11-26 02:24:20,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:24:20,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 02:24:20,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 31: [2022-11-26 02:24:20,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:24:20,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 02:24:20,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 29: [2022-11-26 02:24:20,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:24:20,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 02:24:20,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 7: [2022-11-26 02:24:20,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:24:20,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:24:20,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 21: [2022-11-26 02:24:20,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 7: [2022-11-26 02:24:20,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 17: [2022-11-26 02:24:20,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:24:20,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 17: [2022-11-26 02:24:20,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 02:24:20,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 0: [2022-11-26 02:24:20,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:24:20,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 02:24:20,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 20: [2022-11-26 02:24:20,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:24:20,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 02:24:20,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 18: [2022-11-26 02:24:20,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:24:20,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 02:24:20,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 5: [2022-11-26 02:24:20,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:24:20,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 02:24:20,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 21: [2022-11-26 02:24:20,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:24:20,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 02:24:20,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 14: [2022-11-26 02:24:20,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:24:20,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 02:24:20,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 28: [2022-11-26 02:24:20,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:24:20,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:24:20,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 02:24:20,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 28: [2022-11-26 02:24:20,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 02:24:20,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 8: [2022-11-26 02:24:20,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:24:20,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:24:20,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 13: [2022-11-26 02:24:20,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 8: [2022-11-26 02:24:20,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 13: [2022-11-26 02:24:20,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-26 02:24:20,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:24:20,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 02:24:20,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 11: [2022-11-26 02:24:20,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:24:20,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:24:20,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 9: [2022-11-26 02:24:20,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 11: [2022-11-26 02:24:20,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 9: [2022-11-26 02:24:20,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 4: [2022-11-26 02:24:20,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:24:20,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 02:24:20,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 6: [2022-11-26 02:24:20,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:24:20,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:24:20,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 02:24:20,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 7: [2022-11-26 02:24:20,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:24:20,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 7: [2022-11-26 02:24:20,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 25: [2022-11-26 02:24:20,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 7: [2022-11-26 02:24:20,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 10: [2022-11-26 02:24:20,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:24:20,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 02:24:20,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 3: [2022-11-26 02:24:20,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:24:20,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:24:20,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 2: [2022-11-26 02:24:20,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 3: [2022-11-26 02:24:20,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 2: [2022-11-26 02:24:20,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 12: [2022-11-26 02:24:20,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:24:20,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:24:20,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 02:24:20,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 24: [2022-11-26 02:24:20,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 02:24:20,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 29: [2022-11-26 02:24:20,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:24:20,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 02:24:20,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 14: [2022-11-26 02:24:20,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:24:20,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:24:20,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:24:20,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 22: [2022-11-26 02:24:20,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 26: [2022-11-26 02:24:20,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 22: [2022-11-26 02:24:20,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 14: [2022-11-26 02:24:20,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 26: [2022-11-26 02:24:20,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 27: [2022-11-26 02:24:20,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:24:20,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:24:20,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 27: [2022-11-26 02:24:20,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 31: [2022-11-26 02:24:20,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 27: [2022-11-26 02:24:20,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 10: [2022-11-26 02:24:20,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:24:20,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 02:24:20,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 19: [2022-11-26 02:24:20,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:24:20,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 27: [2022-11-26 02:24:20,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:24:20,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 27: [2022-11-26 02:24:20,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 02:24:20,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 11: [2022-11-26 02:24:20,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:24:20,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 02:24:20,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 15: [2022-11-26 02:24:20,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:24:20,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:24:20,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 02:24:20,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 15: [2022-11-26 02:24:20,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 02:24:20,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 15: [2022-11-26 02:24:20,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:24:20,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 02:24:20,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 15: [2022-11-26 02:24:20,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:24:20,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 02:24:20,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 15: [2022-11-26 02:24:20,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:24:20,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 02:24:20,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 15: [2022-11-26 02:24:20,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:24:20,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 02:24:20,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 15: [2022-11-26 02:24:20,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:24:20,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 02:24:20,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 15: [2022-11-26 02:24:20,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:24:20,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 02:24:20,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 23: [2022-11-26 02:24:20,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:24:20,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:24:20,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:24:20,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:24:20,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 02:24:20,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 02:24:20,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 23: [2022-11-26 02:24:20,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 02:24:20,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 02:24:20,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 23: [2022-11-26 02:24:20,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 23: [2022-11-26 02:24:20,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 23: [2022-11-26 02:24:20,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:24:20,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 02:24:20,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:24:20,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 23: [2022-11-26 02:24:20,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 02:24:20,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 23: [2022-11-26 02:24:20,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:24:20,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 02:24:20,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 23: [2022-11-26 02:24:20,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:24:20,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 02:24:20,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 16: [2022-11-26 02:24:20,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:24:20,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:24:20,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:24:20,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 02:24:20,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 02:24:20,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 02:24:20,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 16: [2022-11-26 02:24:20,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 16: [2022-11-26 02:24:20,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 16: [2022-11-26 02:24:20,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:24:20,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 02:24:20,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 16: [2022-11-26 02:24:20,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:24:20,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:24:20,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:24:20,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 02:24:20,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 02:24:20,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 02:24:20,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 16: [2022-11-26 02:24:20,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 16: [2022-11-26 02:24:20,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:24:20,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 16: [2022-11-26 02:24:20,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step37000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 02:24:20,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 0: successfully saved checkpoint at iteration 37000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2610.72 31: iteration 37010/ 173500 | consumed samples: 9474560 | consumed tokens: 19403898880 | elapsed time per iteration (s): 1.03 | learning rate: 1.819E-04 | global batch size: 256 | lm loss: 2.123871E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.910 | TFLOPs: 15.00 | 31: iteration 37020/ 173500 | consumed samples: 9477120 | consumed tokens: 19409141760 | elapsed time per iteration (s): 0.84 | learning rate: 1.819E-04 | global batch size: 256 | lm loss: 2.084976E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.984 | TFLOPs: 18.45 | 31: iteration 37030/ 173500 | consumed samples: 9479680 | consumed tokens: 19414384640 | elapsed time per iteration (s): 0.79 | learning rate: 1.819E-04 | global batch size: 256 | lm loss: 2.124669E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.651 | TFLOPs: 19.64 | 31: iteration 37040/ 173500 | consumed samples: 9482240 | consumed tokens: 19419627520 | elapsed time per iteration (s): 0.88 | learning rate: 1.819E-04 | global batch size: 256 | lm loss: 2.098216E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.166 | TFLOPs: 17.68 | 31: iteration 37050/ 173500 | consumed samples: 9484800 | consumed tokens: 19424870400 | elapsed time per iteration (s): 0.75 | learning rate: 1.819E-04 | global batch size: 256 | lm loss: 2.122343E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.984 | TFLOPs: 20.69 | 31: iteration 37060/ 173500 | consumed samples: 9487360 | consumed tokens: 19430113280 | elapsed time per iteration (s): 0.78 | learning rate: 1.819E-04 | global batch size: 256 | lm loss: 2.099353E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.478 | TFLOPs: 19.93 | 31: iteration 37070/ 173500 | consumed samples: 9489920 | consumed tokens: 19435356160 | elapsed time per iteration (s): 0.79 | learning rate: 1.818E-04 | global batch size: 256 | lm loss: 2.101018E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.242 | TFLOPs: 19.62 | 31: iteration 37080/ 173500 | consumed samples: 9492480 | consumed tokens: 19440599040 | elapsed time per iteration (s): 0.77 | learning rate: 1.818E-04 | global batch size: 256 | lm loss: 2.102341E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.617 | TFLOPs: 20.18 | 31: iteration 37090/ 173500 | consumed samples: 9495040 | consumed tokens: 19445841920 | elapsed time per iteration (s): 0.74 | learning rate: 1.818E-04 | global batch size: 256 | lm loss: 2.121192E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.554 | TFLOPs: 20.84 | 31: iteration 37100/ 173500 | consumed samples: 9497600 | consumed tokens: 19451084800 | elapsed time per iteration (s): 0.77 | learning rate: 1.818E-04 | global batch size: 256 | lm loss: 2.100448E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.140 | TFLOPs: 20.09 | 31: iteration 37110/ 173500 | consumed samples: 9500160 | consumed tokens: 19456327680 | elapsed time per iteration (s): 0.73 | learning rate: 1.818E-04 | global batch size: 256 | lm loss: 2.124667E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.079 | TFLOPs: 21.30 | 31: iteration 37120/ 173500 | consumed samples: 9502720 | consumed tokens: 19461570560 | elapsed time per iteration (s): 0.75 | learning rate: 1.818E-04 | global batch size: 256 | lm loss: 2.086230E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.448 | TFLOPs: 20.78 | 31: iteration 37130/ 173500 | consumed samples: 9505280 | consumed tokens: 19466813440 | elapsed time per iteration (s): 0.92 | learning rate: 1.818E-04 | global batch size: 256 | lm loss: 2.142436E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 278.615 | TFLOPs: 16.86 | 31: iteration 37140/ 173500 | consumed samples: 9507840 | consumed tokens: 19472056320 | elapsed time per iteration (s): 0.81 | learning rate: 1.818E-04 | global batch size: 256 | lm loss: 2.095976E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.740 | TFLOPs: 19.16 | 31: iteration 37150/ 173500 | consumed samples: 9510400 | consumed tokens: 19477299200 | elapsed time per iteration (s): 0.76 | learning rate: 1.818E-04 | global batch size: 256 | lm loss: 2.108811E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.176 | TFLOPs: 20.40 | 31: iteration 37160/ 173500 | consumed samples: 9512960 | consumed tokens: 19482542080 | elapsed time per iteration (s): 0.79 | learning rate: 1.818E-04 | global batch size: 256 | lm loss: 2.120804E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.731 | TFLOPs: 19.71 | 31: iteration 37170/ 173500 | consumed samples: 9515520 | consumed tokens: 19487784960 | elapsed time per iteration (s): 0.82 | learning rate: 1.818E-04 | global batch size: 256 | lm loss: 2.086471E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.361 | TFLOPs: 18.96 | 31: iteration 37180/ 173500 | consumed samples: 9518080 | consumed tokens: 19493027840 | elapsed time per iteration (s): 2.17 | learning rate: 1.817E-04 | global batch size: 256 | lm loss: 2.107169E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 118.067 | TFLOPs: 7.14 | 31: iteration 37190/ 173500 | consumed samples: 9520640 | consumed tokens: 19498270720 | elapsed time per iteration (s): 0.75 | learning rate: 1.817E-04 | global batch size: 256 | lm loss: 2.101871E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.735 | TFLOPs: 20.55 | 31: iteration 37200/ 173500 | consumed samples: 9523200 | consumed tokens: 19503513600 | elapsed time per iteration (s): 0.79 | learning rate: 1.817E-04 | global batch size: 256 | lm loss: 2.094848E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.560 | TFLOPs: 19.70 | 31: iteration 37210/ 173500 | consumed samples: 9525760 | consumed tokens: 19508756480 | elapsed time per iteration (s): 0.82 | learning rate: 1.817E-04 | global batch size: 256 | lm loss: 2.105063E+00 | grad norm: 0.225 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.809 | TFLOPs: 18.80 | 31: iteration 37220/ 173500 | consumed samples: 9528320 | consumed tokens: 19513999360 | elapsed time per iteration (s): 0.79 | learning rate: 1.817E-04 | global batch size: 256 | lm loss: 2.117381E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.378 | TFLOPs: 19.50 | 31: iteration 37230/ 173500 | consumed samples: 9530880 | consumed tokens: 19519242240 | elapsed time per iteration (s): 0.79 | learning rate: 1.817E-04 | global batch size: 256 | lm loss: 2.124355E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.157 | TFLOPs: 19.55 | 31: iteration 37240/ 173500 | consumed samples: 9533440 | consumed tokens: 19524485120 | elapsed time per iteration (s): 0.88 | learning rate: 1.817E-04 | global batch size: 256 | lm loss: 2.079014E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.558 | TFLOPs: 17.64 | 31: iteration 37250/ 173500 | consumed samples: 9536000 | consumed tokens: 19529728000 | elapsed time per iteration (s): 0.81 | learning rate: 1.817E-04 | global batch size: 256 | lm loss: 2.116775E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.470 | TFLOPs: 19.02 | 31: iteration 37260/ 173500 | consumed samples: 9538560 | consumed tokens: 19534970880 | elapsed time per iteration (s): 0.82 | learning rate: 1.817E-04 | global batch size: 256 | lm loss: 2.090561E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.193 | TFLOPs: 18.95 | 31: iteration 37270/ 173500 | consumed samples: 9541120 | consumed tokens: 19540213760 | elapsed time per iteration (s): 0.79 | learning rate: 1.817E-04 | global batch size: 256 | lm loss: 2.130879E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.339 | TFLOPs: 19.50 | 31: iteration 37280/ 173500 | consumed samples: 9543680 | consumed tokens: 19545456640 | elapsed time per iteration (s): 0.83 | learning rate: 1.816E-04 | global batch size: 256 | lm loss: 2.113634E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.912 | TFLOPs: 18.63 | 31: iteration 37290/ 173500 | consumed samples: 9546240 | consumed tokens: 19550699520 | elapsed time per iteration (s): 0.81 | learning rate: 1.816E-04 | global batch size: 256 | lm loss: 2.108965E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.683 | TFLOPs: 19.04 | 31: iteration 37300/ 173500 | consumed samples: 9548800 | consumed tokens: 19555942400 | elapsed time per iteration (s): 0.81 | learning rate: 1.816E-04 | global batch size: 256 | lm loss: 2.102416E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.252 | TFLOPs: 19.01 | 31: iteration 37310/ 173500 | consumed samples: 9551360 | consumed tokens: 19561185280 | elapsed time per iteration (s): 0.82 | learning rate: 1.816E-04 | global batch size: 256 | lm loss: 2.107211E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.771 | TFLOPs: 18.80 | 31: iteration 37320/ 173500 | consumed samples: 9553920 | consumed tokens: 19566428160 | elapsed time per iteration (s): 0.81 | learning rate: 1.816E-04 | global batch size: 256 | lm loss: 2.105309E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.275 | TFLOPs: 19.07 | 31: iteration 37330/ 173500 | consumed samples: 9556480 | consumed tokens: 19571671040 | elapsed time per iteration (s): 0.75 | learning rate: 1.816E-04 | global batch size: 256 | lm loss: 2.106391E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.376 | TFLOPs: 20.59 | 31: iteration 37340/ 173500 | consumed samples: 9559040 | consumed tokens: 19576913920 | elapsed time per iteration (s): 0.76 | learning rate: 1.816E-04 | global batch size: 256 | lm loss: 2.092623E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.202 | TFLOPs: 20.28 | 31: iteration 37350/ 173500 | consumed samples: 9561600 | consumed tokens: 19582156800 | elapsed time per iteration (s): 0.80 | learning rate: 1.816E-04 | global batch size: 256 | lm loss: 2.106714E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.781 | TFLOPs: 19.47 | 31: iteration 37360/ 173500 | consumed samples: 9564160 | consumed tokens: 19587399680 | elapsed time per iteration (s): 0.87 | learning rate: 1.816E-04 | global batch size: 256 | lm loss: 2.119329E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.398 | TFLOPs: 17.87 | 31: iteration 37370/ 173500 | consumed samples: 9566720 | consumed tokens: 19592642560 | elapsed time per iteration (s): 0.79 | learning rate: 1.816E-04 | global batch size: 256 | lm loss: 2.095123E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.781 | TFLOPs: 19.53 | 31: iteration 37380/ 173500 | consumed samples: 9569280 | consumed tokens: 19597885440 | elapsed time per iteration (s): 0.81 | learning rate: 1.815E-04 | global batch size: 256 | lm loss: 2.141093E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.953 | TFLOPs: 19.11 | 31: iteration 37390/ 173500 | consumed samples: 9571840 | consumed tokens: 19603128320 | elapsed time per iteration (s): 0.78 | learning rate: 1.815E-04 | global batch size: 256 | lm loss: 2.094710E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.509 | TFLOPs: 19.75 | 31: iteration 37400/ 173500 | consumed samples: 9574400 | consumed tokens: 19608371200 | elapsed time per iteration (s): 0.77 | learning rate: 1.815E-04 | global batch size: 256 | lm loss: 2.094886E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.345 | TFLOPs: 20.23 | 31: iteration 37410/ 173500 | consumed samples: 9576960 | consumed tokens: 19613614080 | elapsed time per iteration (s): 0.76 | learning rate: 1.815E-04 | global batch size: 256 | lm loss: 2.072098E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.837 | TFLOPs: 20.38 | 31: iteration 37420/ 173500 | consumed samples: 9579520 | consumed tokens: 19618856960 | elapsed time per iteration (s): 0.78 | learning rate: 1.815E-04 | global batch size: 256 | lm loss: 2.088524E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.063 | TFLOPs: 19.97 | 31: iteration 37430/ 173500 | consumed samples: 9582080 | consumed tokens: 19624099840 | elapsed time per iteration (s): 0.77 | learning rate: 1.815E-04 | global batch size: 256 | lm loss: 2.099176E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.146 | TFLOPs: 20.09 | 31: iteration 37440/ 173500 | consumed samples: 9584640 | consumed tokens: 19629342720 | elapsed time per iteration (s): 0.80 | learning rate: 1.815E-04 | global batch size: 256 | lm loss: 2.088327E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.915 | TFLOPs: 19.41 | 31: iteration 37450/ 173500 | consumed samples: 9587200 | consumed tokens: 19634585600 | elapsed time per iteration (s): 0.81 | learning rate: 1.815E-04 | global batch size: 256 | lm loss: 2.114659E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.102 | TFLOPs: 19.06 | 31: iteration 37460/ 173500 | consumed samples: 9589760 | consumed tokens: 19639828480 | elapsed time per iteration (s): 0.84 | learning rate: 1.815E-04 | global batch size: 256 | lm loss: 2.111171E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.337 | TFLOPs: 18.41 | 31: iteration 37470/ 173500 | consumed samples: 9592320 | consumed tokens: 19645071360 | elapsed time per iteration (s): 0.84 | learning rate: 1.815E-04 | global batch size: 256 | lm loss: 2.127572E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.583 | TFLOPs: 18.55 | 31: iteration 37480/ 173500 | consumed samples: 9594880 | consumed tokens: 19650314240 | elapsed time per iteration (s): 0.82 | learning rate: 1.814E-04 | global batch size: 256 | lm loss: 2.102344E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.157 | TFLOPs: 18.88 | 31: iteration 37490/ 173500 | consumed samples: 9597440 | consumed tokens: 19655557120 | elapsed time per iteration (s): 0.83 | learning rate: 1.814E-04 | global batch size: 256 | lm loss: 2.110256E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.366 | TFLOPs: 18.66 | 31: iteration 37500/ 173500 | consumed samples: 9600000 | consumed tokens: 19660800000 | elapsed time per iteration (s): 0.82 | learning rate: 1.814E-04 | global batch size: 256 | lm loss: 2.122410E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.630 | TFLOPs: 18.91 | 31: iteration 37510/ 173500 | consumed samples: 9602560 | consumed tokens: 19666042880 | elapsed time per iteration (s): 0.83 | learning rate: 1.814E-04 | global batch size: 256 | lm loss: 2.113678E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.482 | TFLOPs: 18.72 | 31: iteration 37520/ 173500 | consumed samples: 9605120 | consumed tokens: 19671285760 | elapsed time per iteration (s): 0.80 | learning rate: 1.814E-04 | global batch size: 256 | lm loss: 2.078175E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.728 | TFLOPs: 19.28 | 31: iteration 37530/ 173500 | consumed samples: 9607680 | consumed tokens: 19676528640 | elapsed time per iteration (s): 0.84 | learning rate: 1.814E-04 | global batch size: 256 | lm loss: 2.102885E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.179 | TFLOPs: 18.40 | 31: iteration 37540/ 173500 | consumed samples: 9610240 | consumed tokens: 19681771520 | elapsed time per iteration (s): 0.81 | learning rate: 1.814E-04 | global batch size: 256 | lm loss: 2.046083E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.776 | TFLOPs: 19.10 | 31: iteration 37550/ 173500 | consumed samples: 9612800 | consumed tokens: 19687014400 | elapsed time per iteration (s): 1.46 | learning rate: 1.814E-04 | global batch size: 256 | lm loss: 2.089096E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 175.374 | TFLOPs: 10.61 | 31: iteration 37560/ 173500 | consumed samples: 9615360 | consumed tokens: 19692257280 | elapsed time per iteration (s): 0.79 | learning rate: 1.814E-04 | global batch size: 256 | lm loss: 2.071420E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.888 | TFLOPs: 19.72 | 31: iteration 37570/ 173500 | consumed samples: 9617920 | consumed tokens: 19697500160 | elapsed time per iteration (s): 0.82 | learning rate: 1.814E-04 | global batch size: 256 | lm loss: 2.102341E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.081 | TFLOPs: 18.88 | 31: iteration 37580/ 173500 | consumed samples: 9620480 | consumed tokens: 19702743040 | elapsed time per iteration (s): 0.85 | learning rate: 1.813E-04 | global batch size: 256 | lm loss: 2.131798E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.623 | TFLOPs: 18.31 | 31: iteration 37590/ 173500 | consumed samples: 9623040 | consumed tokens: 19707985920 | elapsed time per iteration (s): 0.85 | learning rate: 1.813E-04 | global batch size: 256 | lm loss: 2.088199E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.801 | TFLOPs: 18.26 | 31: iteration 37600/ 173500 | consumed samples: 9625600 | consumed tokens: 19713228800 | elapsed time per iteration (s): 0.81 | learning rate: 1.813E-04 | global batch size: 256 | lm loss: 2.124307E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.994 | TFLOPs: 19.24 | 31: iteration 37610/ 173500 | consumed samples: 9628160 | consumed tokens: 19718471680 | elapsed time per iteration (s): 0.83 | learning rate: 1.813E-04 | global batch size: 256 | lm loss: 2.103067E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.124 | TFLOPs: 18.70 | 31: iteration 37620/ 173500 | consumed samples: 9630720 | consumed tokens: 19723714560 | elapsed time per iteration (s): 0.86 | learning rate: 1.813E-04 | global batch size: 256 | lm loss: 2.126750E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.310 | TFLOPs: 18.11 | 31: iteration 37630/ 173500 | consumed samples: 9633280 | consumed tokens: 19728957440 | elapsed time per iteration (s): 0.85 | learning rate: 1.813E-04 | global batch size: 256 | lm loss: 2.085585E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.884 | TFLOPs: 18.32 | 31: iteration 37640/ 173500 | consumed samples: 9635840 | consumed tokens: 19734200320 | elapsed time per iteration (s): 0.82 | learning rate: 1.813E-04 | global batch size: 256 | lm loss: 2.098374E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.719 | TFLOPs: 18.98 | 31: iteration 37650/ 173500 | consumed samples: 9638400 | consumed tokens: 19739443200 | elapsed time per iteration (s): 0.73 | learning rate: 1.813E-04 | global batch size: 256 | lm loss: 2.103011E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.385 | TFLOPs: 21.26 | 31: iteration 37660/ 173500 | consumed samples: 9640960 | consumed tokens: 19744686080 | elapsed time per iteration (s): 0.75 | learning rate: 1.813E-04 | global batch size: 256 | lm loss: 2.086165E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.879 | TFLOPs: 20.56 | 31: iteration 37670/ 173500 | consumed samples: 9643520 | consumed tokens: 19749928960 | elapsed time per iteration (s): 0.77 | learning rate: 1.813E-04 | global batch size: 256 | lm loss: 2.082685E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.998 | TFLOPs: 20.21 | 31: iteration 37680/ 173500 | consumed samples: 9646080 | consumed tokens: 19755171840 | elapsed time per iteration (s): 0.87 | learning rate: 1.812E-04 | global batch size: 256 | lm loss: 2.101637E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.477 | TFLOPs: 17.82 | 31: iteration 37690/ 173500 | consumed samples: 9648640 | consumed tokens: 19760414720 | elapsed time per iteration (s): 0.76 | learning rate: 1.812E-04 | global batch size: 256 | lm loss: 2.073702E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.760 | TFLOPs: 20.49 | 31: iteration 37700/ 173500 | consumed samples: 9651200 | consumed tokens: 19765657600 | elapsed time per iteration (s): 0.80 | learning rate: 1.812E-04 | global batch size: 256 | lm loss: 2.139391E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.693 | TFLOPs: 19.46 | 31: iteration 37710/ 173500 | consumed samples: 9653760 | consumed tokens: 19770900480 | elapsed time per iteration (s): 0.82 | learning rate: 1.812E-04 | global batch size: 256 | lm loss: 2.137617E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.385 | TFLOPs: 18.78 | 31: iteration 37720/ 173500 | consumed samples: 9656320 | consumed tokens: 19776143360 | elapsed time per iteration (s): 0.74 | learning rate: 1.812E-04 | global batch size: 256 | lm loss: 2.131555E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.574 | TFLOPs: 21.03 | 31: iteration 37730/ 173500 | consumed samples: 9658880 | consumed tokens: 19781386240 | elapsed time per iteration (s): 0.78 | learning rate: 1.812E-04 | global batch size: 256 | lm loss: 2.107165E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.302 | TFLOPs: 19.86 | 31: iteration 37740/ 173500 | consumed samples: 9661440 | consumed tokens: 19786629120 | elapsed time per iteration (s): 0.77 | learning rate: 1.812E-04 | global batch size: 256 | lm loss: 2.111947E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.471 | TFLOPs: 20.05 | 31: iteration 37750/ 173500 | consumed samples: 9664000 | consumed tokens: 19791872000 | elapsed time per iteration (s): 0.85 | learning rate: 1.812E-04 | global batch size: 256 | lm loss: 2.103358E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.043 | TFLOPs: 18.21 | 31: iteration 37760/ 173500 | consumed samples: 9666560 | consumed tokens: 19797114880 | elapsed time per iteration (s): 0.77 | learning rate: 1.812E-04 | global batch size: 256 | lm loss: 2.097212E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.313 | TFLOPs: 20.04 | 31: iteration 37770/ 173500 | consumed samples: 9669120 | consumed tokens: 19802357760 | elapsed time per iteration (s): 0.83 | learning rate: 1.812E-04 | global batch size: 256 | lm loss: 2.096081E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.738 | TFLOPs: 18.56 | 31: iteration 37780/ 173500 | consumed samples: 9671680 | consumed tokens: 19807600640 | elapsed time per iteration (s): 0.83 | learning rate: 1.811E-04 | global batch size: 256 | lm loss: 2.105903E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.889 | TFLOPs: 18.75 | 31: iteration 37790/ 173500 | consumed samples: 9674240 | consumed tokens: 19812843520 | elapsed time per iteration (s): 0.79 | learning rate: 1.811E-04 | global batch size: 256 | lm loss: 2.094308E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.688 | TFLOPs: 19.64 | 31: iteration 37800/ 173500 | consumed samples: 9676800 | consumed tokens: 19818086400 | elapsed time per iteration (s): 0.78 | learning rate: 1.811E-04 | global batch size: 256 | lm loss: 2.088143E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.937 | TFLOPs: 19.84 | 31: iteration 37810/ 173500 | consumed samples: 9679360 | consumed tokens: 19823329280 | elapsed time per iteration (s): 0.77 | learning rate: 1.811E-04 | global batch size: 256 | lm loss: 2.066123E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.274 | TFLOPs: 20.16 | 31: iteration 37820/ 173500 | consumed samples: 9681920 | consumed tokens: 19828572160 | elapsed time per iteration (s): 0.76 | learning rate: 1.811E-04 | global batch size: 256 | lm loss: 2.092449E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.978 | TFLOPs: 20.33 | 31: iteration 37830/ 173500 | consumed samples: 9684480 | consumed tokens: 19833815040 | elapsed time per iteration (s): 0.75 | learning rate: 1.811E-04 | global batch size: 256 | lm loss: 2.128797E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.073 | TFLOPs: 20.69 | 31: iteration 37840/ 173500 | consumed samples: 9687040 | consumed tokens: 19839057920 | elapsed time per iteration (s): 0.72 | learning rate: 1.811E-04 | global batch size: 256 | lm loss: 2.080301E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.338 | TFLOPs: 21.38 | 31: iteration 37850/ 173500 | consumed samples: 9689600 | consumed tokens: 19844300800 | elapsed time per iteration (s): 0.75 | learning rate: 1.811E-04 | global batch size: 256 | lm loss: 2.097958E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.235 | TFLOPs: 20.58 | 31: iteration 37860/ 173500 | consumed samples: 9692160 | consumed tokens: 19849543680 | elapsed time per iteration (s): 0.76 | learning rate: 1.811E-04 | global batch size: 256 | lm loss: 2.120009E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.345 | TFLOPs: 20.29 | 31: iteration 37870/ 173500 | consumed samples: 9694720 | consumed tokens: 19854786560 | elapsed time per iteration (s): 0.84 | learning rate: 1.810E-04 | global batch size: 256 | lm loss: 2.109564E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.621 | TFLOPs: 18.43 | 31: iteration 37880/ 173500 | consumed samples: 9697280 | consumed tokens: 19860029440 | elapsed time per iteration (s): 0.85 | learning rate: 1.810E-04 | global batch size: 256 | lm loss: 2.106279E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.146 | TFLOPs: 18.28 | 31: iteration 37890/ 173500 | consumed samples: 9699840 | consumed tokens: 19865272320 | elapsed time per iteration (s): 0.80 | learning rate: 1.810E-04 | global batch size: 256 | lm loss: 2.116068E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.831 | TFLOPs: 19.47 | 31: iteration 37900/ 173500 | consumed samples: 9702400 | consumed tokens: 19870515200 | elapsed time per iteration (s): 0.78 | learning rate: 1.810E-04 | global batch size: 256 | lm loss: 2.084107E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.739 | TFLOPs: 19.83 | 31: iteration 37910/ 173500 | consumed samples: 9704960 | consumed tokens: 19875758080 | elapsed time per iteration (s): 0.79 | learning rate: 1.810E-04 | global batch size: 256 | lm loss: 2.092194E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.278 | TFLOPs: 19.68 | 31: iteration 37920/ 173500 | consumed samples: 9707520 | consumed tokens: 19881000960 | elapsed time per iteration (s): 0.73 | learning rate: 1.810E-04 | global batch size: 256 | lm loss: 2.132505E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.517 | TFLOPs: 21.08 | 31: iteration 37930/ 173500 | consumed samples: 9710080 | consumed tokens: 19886243840 | elapsed time per iteration (s): 0.84 | learning rate: 1.810E-04 | global batch size: 256 | lm loss: 2.089202E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.851 | TFLOPs: 18.44 | 31: iteration 37940/ 173500 | consumed samples: 9712640 | consumed tokens: 19891486720 | elapsed time per iteration (s): 0.76 | learning rate: 1.810E-04 | global batch size: 256 | lm loss: 2.103649E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.180 | TFLOPs: 20.34 | 31: iteration 37950/ 173500 | consumed samples: 9715200 | consumed tokens: 19896729600 | elapsed time per iteration (s): 0.77 | learning rate: 1.810E-04 | global batch size: 256 | lm loss: 2.116822E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.889 | TFLOPs: 20.20 | 31: iteration 37960/ 173500 | consumed samples: 9717760 | consumed tokens: 19901972480 | elapsed time per iteration (s): 0.78 | learning rate: 1.810E-04 | global batch size: 256 | lm loss: 2.116677E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.944 | TFLOPs: 19.78 | 31: iteration 37970/ 173500 | consumed samples: 9720320 | consumed tokens: 19907215360 | elapsed time per iteration (s): 0.75 | learning rate: 1.809E-04 | global batch size: 256 | lm loss: 2.104176E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.964 | TFLOPs: 20.63 | 31: iteration 37980/ 173500 | consumed samples: 9722880 | consumed tokens: 19912458240 | elapsed time per iteration (s): 0.74 | learning rate: 1.809E-04 | global batch size: 256 | lm loss: 2.092371E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.547 | TFLOPs: 21.03 | 31: iteration 37990/ 173500 | consumed samples: 9725440 | consumed tokens: 19917701120 | elapsed time per iteration (s): 0.74 | learning rate: 1.809E-04 | global batch size: 256 | lm loss: 2.109027E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.285 | TFLOPs: 20.95 | 0: [2022-11-26 02:37:56,077] [INFO] [logging.py:68:log_dist] [Rank 0] step=38000, skipped=0, lr=[0.00018091754328052937, 0.00018091754328052937, 0.00018091754328052937], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 38000/ 173500 | consumed samples: 9728000 | consumed tokens: 19922944000 | elapsed time per iteration (s): 0.77 | learning rate: 1.809E-04 | global batch size: 256 | lm loss: 2.105015E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.544 | TFLOPs: 20.12 | 0: steps: 38000 loss: 2.1812 iter time (s): 0.804 samples/sec: 318.388 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 38000 | lm loss value: 2.113086E+00 | lm loss PPL: 8.273735E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 38000 to checkpoints_1b1long 0: [2022-11-26 02:37:56,353] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step38000 is begin to save! 0: [2022-11-26 02:37:56,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_01-model_00-model_states.pt... 0: [2022-11-26 02:37:56,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_01-model_00-model_states.pt. 0: [2022-11-26 02:37:56,573] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_03-model_00-model_states.pt... 0: [2022-11-26 02:37:56,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_03-model_00-model_states.pt. 0: [2022-11-26 02:37:56,651] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_04-model_00-model_states.pt... 0: [2022-11-26 02:37:56,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_04-model_00-model_states.pt. 0: [2022-11-26 02:37:56,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_05-model_00-model_states.pt... 0: [2022-11-26 02:37:56,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_05-model_00-model_states.pt. 0: [2022-11-26 02:37:56,800] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_06-model_00-model_states.pt... 0: [2022-11-26 02:37:56,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_06-model_00-model_states.pt. 0: [2022-11-26 02:37:56,875] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_07-model_00-model_states.pt... 0: [2022-11-26 02:37:56,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_07-model_00-model_states.pt. 0: [2022-11-26 02:37:56,951] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_08-model_00-model_states.pt... 0: [2022-11-26 02:37:57,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_08-model_00-model_states.pt. 0: [2022-11-26 02:37:57,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_09-model_00-model_states.pt... 0: [2022-11-26 02:37:57,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_09-model_00-model_states.pt. 0: [2022-11-26 02:37:57,103] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_10-model_00-model_states.pt... 0: [2022-11-26 02:37:57,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_10-model_00-model_states.pt. 0: [2022-11-26 02:37:57,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_11-model_00-model_states.pt... 0: [2022-11-26 02:37:57,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_11-model_00-model_states.pt. 0: [2022-11-26 02:37:57,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_12-model_00-model_states.pt... 0: [2022-11-26 02:37:57,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_12-model_00-model_states.pt. 0: [2022-11-26 02:37:57,330] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_13-model_00-model_states.pt... 0: [2022-11-26 02:37:57,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_13-model_00-model_states.pt. 0: [2022-11-26 02:37:57,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_14-model_00-model_states.pt... 0: [2022-11-26 02:37:57,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_14-model_00-model_states.pt. 0: [2022-11-26 02:37:57,481] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_15-model_00-model_states.pt... 0: [2022-11-26 02:37:57,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_15-model_00-model_states.pt. 0: [2022-11-26 02:37:57,557] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_16-model_00-model_states.pt... 0: [2022-11-26 02:37:57,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_16-model_00-model_states.pt. 0: [2022-11-26 02:37:57,631] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_17-model_00-model_states.pt... 0: [2022-11-26 02:37:57,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_17-model_00-model_states.pt. 0: [2022-11-26 02:37:57,709] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_18-model_00-model_states.pt... 0: [2022-11-26 02:37:57,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_18-model_00-model_states.pt. 0: [2022-11-26 02:37:57,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_19-model_00-model_states.pt... 0: [2022-11-26 02:37:57,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_19-model_00-model_states.pt. 0: [2022-11-26 02:37:57,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_20-model_00-model_states.pt... 0: [2022-11-26 02:37:57,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_20-model_00-model_states.pt. 0: [2022-11-26 02:37:57,933] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_21-model_00-model_states.pt... 0: [2022-11-26 02:37:58,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_21-model_00-model_states.pt. 0: [2022-11-26 02:37:58,009] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_22-model_00-model_states.pt... 0: [2022-11-26 02:37:58,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_22-model_00-model_states.pt. 0: [2022-11-26 02:37:58,085] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_23-model_00-model_states.pt... 0: [2022-11-26 02:37:58,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_23-model_00-model_states.pt. 0: [2022-11-26 02:37:58,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_24-model_00-model_states.pt... 0: [2022-11-26 02:37:58,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_24-model_00-model_states.pt. 0: [2022-11-26 02:37:58,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_25-model_00-model_states.pt... 0: [2022-11-26 02:37:58,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_25-model_00-model_states.pt. 0: [2022-11-26 02:37:58,314] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_26-model_00-model_states.pt... 0: [2022-11-26 02:37:58,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_26-model_00-model_states.pt. 0: [2022-11-26 02:37:58,389] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_27-model_00-model_states.pt... 0: [2022-11-26 02:37:58,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_27-model_00-model_states.pt. 0: [2022-11-26 02:37:58,465] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_28-model_00-model_states.pt... 0: [2022-11-26 02:37:58,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_28-model_00-model_states.pt. 0: [2022-11-26 02:37:58,540] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/layer_30-model_00-model_states.pt... 0: [2022-11-26 02:37:58,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/layer_30-model_00-model_states.pt. 0: [2022-11-26 02:37:58,543] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step38000/mp_rank_00_model_states.pt 0: [2022-11-26 02:37:58,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/mp_rank_00_model_states.pt... 0: [2022-11-26 02:37:58,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/mp_rank_00_model_states.pt. 0: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:37:58,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:37:58,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:37:58,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 02:37:58,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 25: [2022-11-26 02:37:58,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:37:58,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 02:37:58,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 20: [2022-11-26 02:37:58,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:37:58,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 02:37:58,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 4: [2022-11-26 02:37:58,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:37:58,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 02:37:58,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 26: [2022-11-26 02:37:58,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:37:58,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 02:37:58,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 19: [2022-11-26 02:37:58,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:37:58,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 02:37:58,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 16: [2022-11-26 02:37:58,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:37:58,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 02:37:58,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 4: [2022-11-26 02:37:58,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:37:58,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 02:37:58,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 10: [2022-11-26 02:37:58,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:37:58,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:37:58,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 21: [2022-11-26 02:37:58,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 10: [2022-11-26 02:37:58,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 21: [2022-11-26 02:37:58,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 16: [2022-11-26 02:37:58,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:37:58,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 02:37:58,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 27: [2022-11-26 02:37:58,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:37:58,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 02:37:58,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 23: [2022-11-26 02:37:58,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:37:58,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 02:37:58,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 19: [2022-11-26 02:37:58,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:37:58,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 02:37:58,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 23: [2022-11-26 02:37:58,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:37:58,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 02:37:58,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 7: [2022-11-26 02:37:58,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:37:58,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 02:37:58,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 7: [2022-11-26 02:37:58,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:37:58,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 02:37:58,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 25: [2022-11-26 02:37:58,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:37:58,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:37:58,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 12: [2022-11-26 02:37:58,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 02:37:58,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 25: [2022-11-26 02:37:58,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 15: [2022-11-26 02:37:58,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:37:58,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 02:37:58,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 20: [2022-11-26 02:37:58,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:37:58,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 02:37:58,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 12: [2022-11-26 02:37:58,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:37:58,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 02:37:58,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 1: [2022-11-26 02:37:58,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:37:58,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:37:58,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 02:37:58,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 1: [2022-11-26 02:37:58,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 02:37:58,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:37:58,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 7: [2022-11-26 02:37:58,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:37:58,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 1: [2022-11-26 02:37:58,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 7: [2022-11-26 02:37:58,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 1: [2022-11-26 02:37:58,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 1: [2022-11-26 02:37:58,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:37:58,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 6: [2022-11-26 02:37:58,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:37:58,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 1: [2022-11-26 02:37:58,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 6: [2022-11-26 02:37:58,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 9: [2022-11-26 02:37:58,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:37:58,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:37:58,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:37:58,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 02:37:58,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 02:37:58,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 9: [2022-11-26 02:37:58,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 24: [2022-11-26 02:37:58,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 27: [2022-11-26 02:37:58,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:37:58,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:37:58,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 27: [2022-11-26 02:37:58,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 02:37:58,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 10: [2022-11-26 02:37:58,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 02:37:58,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 15: [2022-11-26 02:37:58,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:37:58,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 02:37:58,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 6: [2022-11-26 02:37:58,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:37:58,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 02:37:58,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 4: [2022-11-26 02:37:58,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:37:58,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 02:37:58,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 18: [2022-11-26 02:37:58,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:37:58,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:37:58,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:37:58,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:37:58,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 02:37:58,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 27: [2022-11-26 02:37:58,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 18: [2022-11-26 02:37:58,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 27: [2022-11-26 02:37:58,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 18: [2022-11-26 02:37:58,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 18: [2022-11-26 02:37:58,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 18: [2022-11-26 02:37:58,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 20: [2022-11-26 02:37:58,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:37:58,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 02:37:58,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 19: [2022-11-26 02:37:58,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:37:58,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:37:58,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 02:37:58,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 20: [2022-11-26 02:37:58,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 02:37:58,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 31: [2022-11-26 02:37:58,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:37:58,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:37:58,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 02:37:58,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 23: [2022-11-26 02:37:58,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:37:58,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 23: [2022-11-26 02:37:58,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 02:37:58,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 31: [2022-11-26 02:37:58,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 31: [2022-11-26 02:37:58,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:37:58,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 02:37:58,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 11: [2022-11-26 02:37:58,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:37:58,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 02:37:58,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 1: [2022-11-26 02:37:58,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:37:58,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 02:37:58,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 7: [2022-11-26 02:37:58,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:37:58,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 02:37:58,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 12: [2022-11-26 02:37:58,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:37:58,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 02:37:58,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 16: [2022-11-26 02:37:58,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:37:58,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 02:37:58,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 26: [2022-11-26 02:37:58,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:37:58,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 02:37:58,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 12: [2022-11-26 02:37:58,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:37:58,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 02:37:58,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 4: [2022-11-26 02:37:58,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:37:58,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 02:37:58,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 26: [2022-11-26 02:37:58,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:37:58,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:37:58,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 24: [2022-11-26 02:37:58,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 02:37:58,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 26: [2022-11-26 02:37:58,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 30: [2022-11-26 02:37:58,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:37:58,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:37:58,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:37:58,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:37:58,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:37:58,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 02:37:58,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 19: [2022-11-26 02:37:58,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 30: [2022-11-26 02:37:58,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 19: [2022-11-26 02:37:58,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 30: [2022-11-26 02:37:58,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 30: [2022-11-26 02:37:58,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 30: [2022-11-26 02:37:58,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 02:37:58,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 30: [2022-11-26 02:37:58,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 30: [2022-11-26 02:37:58,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:37:58,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 02:37:58,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 10: [2022-11-26 02:37:58,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:37:58,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 02:37:58,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 25: [2022-11-26 02:37:58,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:37:58,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:37:58,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 02:37:58,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 25: [2022-11-26 02:37:58,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 16: [2022-11-26 02:37:58,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:37:58,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 25: [2022-11-26 02:37:58,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:37:58,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 25: [2022-11-26 02:37:58,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 16: [2022-11-26 02:37:58,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 25: [2022-11-26 02:37:58,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 26: [2022-11-26 02:37:58,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:37:58,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 02:37:58,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:37:58,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 26: [2022-11-26 02:37:58,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 02:37:58,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 12: [2022-11-26 02:37:58,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:37:58,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 02:37:58,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 21: [2022-11-26 02:37:58,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:37:58,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 02:37:58,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 1: [2022-11-26 02:37:58,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:37:58,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 02:37:58,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 15: [2022-11-26 02:37:58,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:37:58,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 27: [2022-11-26 02:37:58,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:37:58,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 27: [2022-11-26 02:37:58,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 02:37:58,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 7: [2022-11-26 02:37:58,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:37:58,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 02:37:58,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 16: [2022-11-26 02:37:58,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:37:58,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 02:37:58,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 10: [2022-11-26 02:37:58,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:37:58,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 25: [2022-11-26 02:37:58,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:37:58,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 25: [2022-11-26 02:37:58,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 02:37:58,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 24: [2022-11-26 02:37:58,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:37:58,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 02:37:58,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 22: [2022-11-26 02:37:58,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:37:58,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 02:37:58,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 22: [2022-11-26 02:37:58,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:37:58,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:37:58,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 9: [2022-11-26 02:37:58,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 22: [2022-11-26 02:37:58,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 22: [2022-11-26 02:37:58,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:37:58,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:37:58,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 22: [2022-11-26 02:37:58,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 4: [2022-11-26 02:37:58,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 22: [2022-11-26 02:37:58,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 4: [2022-11-26 02:37:58,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 22: [2022-11-26 02:37:58,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:37:58,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 02:37:58,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 20: [2022-11-26 02:37:58,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:37:58,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:37:58,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 20: [2022-11-26 02:37:58,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 15: [2022-11-26 02:37:58,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 20: [2022-11-26 02:37:58,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 15: [2022-11-26 02:37:58,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:37:58,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 02:37:58,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 31: [2022-11-26 02:37:58,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:37:58,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 02:37:58,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 20: [2022-11-26 02:37:58,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:37:58,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:37:58,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 21: [2022-11-26 02:37:58,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 20: [2022-11-26 02:37:58,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 21: [2022-11-26 02:37:58,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 21: [2022-11-26 02:37:58,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:37:58,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:37:58,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 19: [2022-11-26 02:37:58,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 21: [2022-11-26 02:37:58,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:37:58,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 19: [2022-11-26 02:37:58,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 21: [2022-11-26 02:37:58,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 31: [2022-11-26 02:37:58,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:37:58,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:37:58,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 31: [2022-11-26 02:37:58,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 02:37:58,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 02:37:58,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 31: [2022-11-26 02:37:58,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 23: [2022-11-26 02:37:58,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:37:58,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:37:58,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 02:37:58,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 02:37:58,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 23: [2022-11-26 02:37:58,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 22: [2022-11-26 02:37:58,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:37:58,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 02:37:58,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 0: [2022-11-26 02:37:58,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:37:58,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:37:58,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:37:58,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:37:58,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 02:37:58,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 02:37:58,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 02:37:58,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 02:37:58,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 0: [2022-11-26 02:37:58,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 0: [2022-11-26 02:37:58,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 0: [2022-11-26 02:37:58,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 0: [2022-11-26 02:37:58,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:37:58,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 02:37:58,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 24: [2022-11-26 02:37:58,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:37:58,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 02:37:58,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 6: [2022-11-26 02:37:58,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:37:58,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 02:37:58,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 5: [2022-11-26 02:37:58,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:37:58,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:37:58,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 02:37:58,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 5: [2022-11-26 02:37:58,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 02:37:58,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 5: [2022-11-26 02:37:58,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:37:58,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 02:37:58,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 5: [2022-11-26 02:37:58,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:37:58,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:37:58,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 02:37:58,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 5: [2022-11-26 02:37:58,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:37:58,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 5: [2022-11-26 02:37:58,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 19: [2022-11-26 02:37:58,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 5: [2022-11-26 02:37:58,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 18: [2022-11-26 02:37:58,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:37:58,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:37:58,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 5: [2022-11-26 02:37:58,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 18: [2022-11-26 02:37:58,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 5: [2022-11-26 02:37:58,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 5: [2022-11-26 02:37:58,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:37:58,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 02:37:58,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 9: [2022-11-26 02:37:58,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:37:58,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:37:58,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 02:37:58,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 02:37:58,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 9: [2022-11-26 02:37:58,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 24: [2022-11-26 02:37:58,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:37:58,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 02:37:58,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 11: [2022-11-26 02:37:58,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:37:58,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 02:37:58,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 11: [2022-11-26 02:37:58,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:37:58,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 02:37:58,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 11: [2022-11-26 02:37:58,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:37:58,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 02:37:58,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 11: [2022-11-26 02:37:58,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:37:58,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 02:37:58,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 3: [2022-11-26 02:37:58,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:37:58,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:37:58,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:37:58,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:37:58,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:37:58,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 02:37:58,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 02:37:58,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 02:37:58,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 02:37:58,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 02:37:58,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 3: [2022-11-26 02:37:58,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 3: [2022-11-26 02:37:58,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 3: [2022-11-26 02:37:58,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 3: [2022-11-26 02:37:58,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 29: [2022-11-26 02:37:58,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:37:58,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:37:58,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:37:58,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:37:58,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:37:58,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 02:37:58,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 02:37:58,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 02:37:58,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 29: [2022-11-26 02:37:58,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 02:37:58,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 02:37:58,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 29: [2022-11-26 02:37:58,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 29: [2022-11-26 02:37:58,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 29: [2022-11-26 02:37:58,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 14: [2022-11-26 02:37:58,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:37:58,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:37:58,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:37:58,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:37:58,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:37:58,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 02:37:58,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 02:37:58,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 02:37:58,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 02:37:58,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 02:37:58,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 14: [2022-11-26 02:37:58,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 14: [2022-11-26 02:37:58,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 14: [2022-11-26 02:37:58,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 14: [2022-11-26 02:37:58,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 1: [2022-11-26 02:37:58,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:37:58,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 02:37:58,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 18: [2022-11-26 02:37:58,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:37:58,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 02:37:58,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 11: [2022-11-26 02:37:58,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:37:58,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 02:37:58,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 4: [2022-11-26 02:37:58,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:37:58,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 02:37:58,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 16: [2022-11-26 02:37:58,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:37:58,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:37:58,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 02:37:58,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 26: [2022-11-26 02:37:58,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 02:37:58,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 10: [2022-11-26 02:37:58,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:37:58,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 02:37:58,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 29: [2022-11-26 02:37:58,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:37:58,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 02:37:58,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 27: [2022-11-26 02:37:58,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:37:58,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 02:37:58,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 0: [2022-11-26 02:37:58,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:37:58,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 02:37:58,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 23: [2022-11-26 02:37:58,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:37:58,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 02:37:58,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 20: [2022-11-26 02:37:58,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:37:58,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:37:58,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 25: [2022-11-26 02:37:58,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 20: [2022-11-26 02:37:58,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 25: [2022-11-26 02:37:58,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 30: [2022-11-26 02:37:58,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:37:58,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 02:37:58,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 12: [2022-11-26 02:37:58,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:37:58,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:37:58,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:37:58,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 12: [2022-11-26 02:37:58,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 21: [2022-11-26 02:37:58,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 12: [2022-11-26 02:37:58,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 31: [2022-11-26 02:37:58,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 21: [2022-11-26 02:37:58,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 3: [2022-11-26 02:37:58,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:37:58,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 02:37:58,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 22: [2022-11-26 02:37:58,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:37:58,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 02:37:58,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 9: [2022-11-26 02:37:58,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:37:58,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 02:37:58,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 15: [2022-11-26 02:37:58,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:37:58,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 02:37:58,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 7: [2022-11-26 02:37:58,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:37:58,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 02:37:58,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 14: [2022-11-26 02:37:58,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:37:58,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 02:37:58,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 6: [2022-11-26 02:37:58,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:37:58,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 02:37:58,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 24: [2022-11-26 02:37:58,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:37:58,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 02:37:58,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 5: [2022-11-26 02:37:58,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:37:58,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 02:37:58,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 19: [2022-11-26 02:37:58,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:37:58,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 02:37:58,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 1: [2022-11-26 02:37:58,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:37:58,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 02:37:58,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 4: [2022-11-26 02:37:58,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:37:58,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 02:37:58,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 18: [2022-11-26 02:37:58,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:37:58,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 02:37:58,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 11: [2022-11-26 02:37:58,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:37:58,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 02:37:58,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 23: [2022-11-26 02:37:58,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:37:58,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 02:37:58,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 29: [2022-11-26 02:37:58,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:37:58,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 02:37:58,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 26: [2022-11-26 02:37:58,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:37:58,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 02:37:58,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 16: [2022-11-26 02:37:58,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:37:58,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 02:37:58,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 27: [2022-11-26 02:37:58,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:37:58,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 02:37:58,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 31: [2022-11-26 02:37:58,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:37:58,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 02:37:58,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 25: [2022-11-26 02:37:58,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:37:58,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 02:37:58,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 21: [2022-11-26 02:37:58,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:37:58,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 02:37:58,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 5: [2022-11-26 02:37:58,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:37:58,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 02:37:58,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 12: [2022-11-26 02:37:58,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:37:58,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 02:37:58,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 20: [2022-11-26 02:37:58,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:37:58,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 02:37:58,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 22: [2022-11-26 02:37:58,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:37:58,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 02:37:58,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 7: [2022-11-26 02:37:58,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:37:58,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 02:37:58,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 4: [2022-11-26 02:37:58,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:37:58,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:37:58,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 02:37:58,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 24: [2022-11-26 02:37:58,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:37:58,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 02:37:58,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 24: [2022-11-26 02:37:58,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 02:37:58,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 22: [2022-11-26 02:37:58,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:37:58,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 02:37:58,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 9: [2022-11-26 02:37:58,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:37:58,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:37:58,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 14: [2022-11-26 02:37:58,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:37:58,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:37:58,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 14: [2022-11-26 02:37:58,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 30: [2022-11-26 02:37:58,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 14: [2022-11-26 02:37:58,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 31: [2022-11-26 02:37:58,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:37:58,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:37:58,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 29: [2022-11-26 02:37:58,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 3: [2022-11-26 02:37:58,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:37:58,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 29: [2022-11-26 02:37:58,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 3: [2022-11-26 02:37:58,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 31: [2022-11-26 02:37:58,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 10: [2022-11-26 02:37:58,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:37:58,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 10: [2022-11-26 02:37:58,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 02:37:58,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 10: [2022-11-26 02:37:58,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:37:58,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 02:37:58,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 16: [2022-11-26 02:37:58,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:37:58,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:37:58,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 02:37:58,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 27: [2022-11-26 02:37:58,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 02:37:58,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 6: [2022-11-26 02:37:58,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:37:58,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 02:37:58,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 30: [2022-11-26 02:37:58,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:37:58,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 02:37:58,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 26: [2022-11-26 02:37:58,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:37:58,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 02:37:58,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 24: [2022-11-26 02:37:58,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:37:58,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 02:37:58,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 23: [2022-11-26 02:37:58,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:37:58,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:37:58,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:37:58,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:37:58,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 21: [2022-11-26 02:37:58,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 23: [2022-11-26 02:37:58,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 21: [2022-11-26 02:37:58,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 18: [2022-11-26 02:37:58,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 1: [2022-11-26 02:37:58,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 02:37:58,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 18: [2022-11-26 02:37:58,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 25: [2022-11-26 02:37:58,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:37:58,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:37:58,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 02:37:58,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 25: [2022-11-26 02:37:58,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 02:37:58,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 15: [2022-11-26 02:37:58,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:37:58,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:37:58,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 02:37:58,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 14: [2022-11-26 02:37:58,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:37:58,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 15: [2022-11-26 02:37:58,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 14: [2022-11-26 02:37:58,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 02:37:58,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 3: [2022-11-26 02:37:58,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:37:58,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 02:37:58,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 6: [2022-11-26 02:37:58,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:37:58,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 02:37:58,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 12: [2022-11-26 02:37:58,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:37:58,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 02:37:58,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 7: [2022-11-26 02:37:58,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:37:58,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 02:37:58,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 9: [2022-11-26 02:37:58,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:37:58,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 02:37:58,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 0: [2022-11-26 02:37:58,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:37:58,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 02:37:58,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 0: [2022-11-26 02:37:58,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 02:37:58,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 28: [2022-11-26 02:37:58,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:37:58,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:37:58,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:37:58,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:37:58,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:37:58,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 02:37:58,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 02:37:58,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 02:37:58,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 17: [2022-11-26 02:37:58,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 17: [2022-11-26 02:37:58,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 17: [2022-11-26 02:37:58,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 02:37:58,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 28: [2022-11-26 02:37:58,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 02:37:58,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 28: [2022-11-26 02:37:58,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:37:58,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 02:37:58,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 28: [2022-11-26 02:37:58,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:37:58,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 02:37:58,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 28: [2022-11-26 02:37:58,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:37:58,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 02:37:58,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 28: [2022-11-26 02:37:58,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:37:58,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 02:37:58,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 28: [2022-11-26 02:37:58,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:37:58,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:37:58,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:37:58,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 02:37:58,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 02:37:58,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 02:37:58,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 28: [2022-11-26 02:37:58,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 28: [2022-11-26 02:37:58,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 17: [2022-11-26 02:37:58,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:37:58,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 02:37:58,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 17: [2022-11-26 02:37:58,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:37:58,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:37:58,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:37:58,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 02:37:58,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 02:37:58,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 17: [2022-11-26 02:37:58,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 02:37:58,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 17: [2022-11-26 02:37:58,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 8: [2022-11-26 02:37:58,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:37:58,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 02:37:58,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 8: [2022-11-26 02:37:58,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:37:58,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 02:37:58,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 8: [2022-11-26 02:37:58,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:37:58,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 02:37:58,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 8: [2022-11-26 02:37:58,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:37:58,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 02:37:58,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 8: [2022-11-26 02:37:58,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:37:58,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 02:37:58,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 2: [2022-11-26 02:37:58,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:37:58,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:37:58,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:37:58,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:37:58,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:37:58,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:37:58,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:37:58,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 02:37:58,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 02:37:58,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 02:37:58,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 02:37:58,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 02:37:58,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 02:37:58,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 2: [2022-11-26 02:37:58,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 02:37:58,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 2: [2022-11-26 02:37:58,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 2: [2022-11-26 02:37:58,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 2: [2022-11-26 02:37:58,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 2: [2022-11-26 02:37:58,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 2: [2022-11-26 02:37:58,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 2: [2022-11-26 02:37:58,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:37:58,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 02:37:58,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 8: [2022-11-26 02:37:58,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:37:58,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:37:58,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:37:58,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 02:37:58,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 02:37:58,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 02:37:58,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 8: [2022-11-26 02:37:58,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 8: [2022-11-26 02:37:58,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 13: [2022-11-26 02:37:58,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:37:58,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 02:37:58,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 13: [2022-11-26 02:37:58,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:37:58,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 02:37:58,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 13: [2022-11-26 02:37:58,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:37:58,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:37:58,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 02:37:58,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 02:37:58,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 13: [2022-11-26 02:37:58,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 13: [2022-11-26 02:37:58,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:37:58,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:37:58,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:37:58,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:37:58,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 02:37:58,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 02:37:58,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 02:37:58,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 13: [2022-11-26 02:37:58,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 13: [2022-11-26 02:37:58,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step38000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 02:37:58,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 13: [2022-11-26 02:37:58,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 0: successfully saved checkpoint at iteration 38000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2637.87 31: iteration 38010/ 173500 | consumed samples: 9730560 | consumed tokens: 19928186880 | elapsed time per iteration (s): 1.02 | learning rate: 1.809E-04 | global batch size: 256 | lm loss: 2.101484E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.545 | TFLOPs: 15.22 | 31: iteration 38020/ 173500 | consumed samples: 9733120 | consumed tokens: 19933429760 | elapsed time per iteration (s): 0.77 | learning rate: 1.809E-04 | global batch size: 256 | lm loss: 2.062349E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.644 | TFLOPs: 20.06 | 31: iteration 38030/ 173500 | consumed samples: 9735680 | consumed tokens: 19938672640 | elapsed time per iteration (s): 0.76 | learning rate: 1.809E-04 | global batch size: 256 | lm loss: 2.102841E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.199 | TFLOPs: 20.46 | 31: iteration 38040/ 173500 | consumed samples: 9738240 | consumed tokens: 19943915520 | elapsed time per iteration (s): 0.75 | learning rate: 1.809E-04 | global batch size: 256 | lm loss: 2.073189E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.520 | TFLOPs: 20.54 | 31: iteration 38050/ 173500 | consumed samples: 9740800 | consumed tokens: 19949158400 | elapsed time per iteration (s): 0.75 | learning rate: 1.809E-04 | global batch size: 256 | lm loss: 2.110745E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.338 | TFLOPs: 20.59 | 31: iteration 38060/ 173500 | consumed samples: 9743360 | consumed tokens: 19954401280 | elapsed time per iteration (s): 0.80 | learning rate: 1.809E-04 | global batch size: 256 | lm loss: 2.101810E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.335 | TFLOPs: 19.26 | 31: iteration 38070/ 173500 | consumed samples: 9745920 | consumed tokens: 19959644160 | elapsed time per iteration (s): 0.81 | learning rate: 1.808E-04 | global batch size: 256 | lm loss: 2.090150E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.054 | TFLOPs: 19.18 | 31: iteration 38080/ 173500 | consumed samples: 9748480 | consumed tokens: 19964887040 | elapsed time per iteration (s): 0.81 | learning rate: 1.808E-04 | global batch size: 256 | lm loss: 2.088945E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.207 | TFLOPs: 19.07 | 31: iteration 38090/ 173500 | consumed samples: 9751040 | consumed tokens: 19970129920 | elapsed time per iteration (s): 0.85 | learning rate: 1.808E-04 | global batch size: 256 | lm loss: 2.088022E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.992 | TFLOPs: 18.15 | 31: iteration 38100/ 173500 | consumed samples: 9753600 | consumed tokens: 19975372800 | elapsed time per iteration (s): 0.82 | learning rate: 1.808E-04 | global batch size: 256 | lm loss: 2.094992E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.623 | TFLOPs: 18.97 | 31: iteration 38110/ 173500 | consumed samples: 9756160 | consumed tokens: 19980615680 | elapsed time per iteration (s): 0.80 | learning rate: 1.808E-04 | global batch size: 256 | lm loss: 2.108662E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.206 | TFLOPs: 19.31 | 31: iteration 38120/ 173500 | consumed samples: 9758720 | consumed tokens: 19985858560 | elapsed time per iteration (s): 0.86 | learning rate: 1.808E-04 | global batch size: 256 | lm loss: 2.113453E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.440 | TFLOPs: 17.93 | 31: iteration 38130/ 173500 | consumed samples: 9761280 | consumed tokens: 19991101440 | elapsed time per iteration (s): 0.88 | learning rate: 1.808E-04 | global batch size: 256 | lm loss: 2.119046E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.451 | TFLOPs: 17.69 | 31: iteration 38140/ 173500 | consumed samples: 9763840 | consumed tokens: 19996344320 | elapsed time per iteration (s): 0.82 | learning rate: 1.808E-04 | global batch size: 256 | lm loss: 2.123775E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.557 | TFLOPs: 18.91 | 31: iteration 38150/ 173500 | consumed samples: 9766400 | consumed tokens: 20001587200 | elapsed time per iteration (s): 0.81 | learning rate: 1.808E-04 | global batch size: 256 | lm loss: 2.110278E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.680 | TFLOPs: 19.16 | 31: iteration 38160/ 173500 | consumed samples: 9768960 | consumed tokens: 20006830080 | elapsed time per iteration (s): 0.82 | learning rate: 1.808E-04 | global batch size: 256 | lm loss: 2.109589E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.258 | TFLOPs: 18.95 | 31: iteration 38170/ 173500 | consumed samples: 9771520 | consumed tokens: 20012072960 | elapsed time per iteration (s): 0.81 | learning rate: 1.807E-04 | global batch size: 256 | lm loss: 2.135937E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.715 | TFLOPs: 19.10 | 31: iteration 38180/ 173500 | consumed samples: 9774080 | consumed tokens: 20017315840 | elapsed time per iteration (s): 0.79 | learning rate: 1.807E-04 | global batch size: 256 | lm loss: 2.108797E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.006 | TFLOPs: 19.60 | 31: iteration 38190/ 173500 | consumed samples: 9776640 | consumed tokens: 20022558720 | elapsed time per iteration (s): 0.83 | learning rate: 1.807E-04 | global batch size: 256 | lm loss: 2.114519E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.041 | TFLOPs: 18.64 | 31: iteration 38200/ 173500 | consumed samples: 9779200 | consumed tokens: 20027801600 | elapsed time per iteration (s): 0.85 | learning rate: 1.807E-04 | global batch size: 256 | lm loss: 2.102143E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.071 | TFLOPs: 18.27 | 31: iteration 38210/ 173500 | consumed samples: 9781760 | consumed tokens: 20033044480 | elapsed time per iteration (s): 0.79 | learning rate: 1.807E-04 | global batch size: 256 | lm loss: 2.124899E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.379 | TFLOPs: 19.50 | 31: iteration 38220/ 173500 | consumed samples: 9784320 | consumed tokens: 20038287360 | elapsed time per iteration (s): 0.76 | learning rate: 1.807E-04 | global batch size: 256 | lm loss: 2.092057E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.706 | TFLOPs: 20.25 | 31: iteration 38230/ 173500 | consumed samples: 9786880 | consumed tokens: 20043530240 | elapsed time per iteration (s): 0.82 | learning rate: 1.807E-04 | global batch size: 256 | lm loss: 2.101826E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.369 | TFLOPs: 18.84 | 31: iteration 38240/ 173500 | consumed samples: 9789440 | consumed tokens: 20048773120 | elapsed time per iteration (s): 0.77 | learning rate: 1.807E-04 | global batch size: 256 | lm loss: 2.146362E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.418 | TFLOPs: 19.99 | 31: iteration 38250/ 173500 | consumed samples: 9792000 | consumed tokens: 20054016000 | elapsed time per iteration (s): 0.77 | learning rate: 1.807E-04 | global batch size: 256 | lm loss: 2.119091E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.326 | TFLOPs: 20.04 | 31: iteration 38260/ 173500 | consumed samples: 9794560 | consumed tokens: 20059258880 | elapsed time per iteration (s): 0.77 | learning rate: 1.807E-04 | global batch size: 256 | lm loss: 2.080740E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.594 | TFLOPs: 20.12 | 31: iteration 38270/ 173500 | consumed samples: 9797120 | consumed tokens: 20064501760 | elapsed time per iteration (s): 0.78 | learning rate: 1.806E-04 | global batch size: 256 | lm loss: 2.113282E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.978 | TFLOPs: 19.96 | 31: iteration 38280/ 173500 | consumed samples: 9799680 | consumed tokens: 20069744640 | elapsed time per iteration (s): 0.85 | learning rate: 1.806E-04 | global batch size: 256 | lm loss: 2.095952E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.533 | TFLOPs: 18.30 | 31: iteration 38290/ 173500 | consumed samples: 9802240 | consumed tokens: 20074987520 | elapsed time per iteration (s): 0.75 | learning rate: 1.806E-04 | global batch size: 256 | lm loss: 2.108039E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.383 | TFLOPs: 20.71 | 31: iteration 38300/ 173500 | consumed samples: 9804800 | consumed tokens: 20080230400 | elapsed time per iteration (s): 0.83 | learning rate: 1.806E-04 | global batch size: 256 | lm loss: 2.094864E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.446 | TFLOPs: 18.60 | 31: iteration 38310/ 173500 | consumed samples: 9807360 | consumed tokens: 20085473280 | elapsed time per iteration (s): 0.77 | learning rate: 1.806E-04 | global batch size: 256 | lm loss: 2.113697E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.076 | TFLOPs: 20.21 | 31: iteration 38320/ 173500 | consumed samples: 9809920 | consumed tokens: 20090716160 | elapsed time per iteration (s): 0.77 | learning rate: 1.806E-04 | global batch size: 256 | lm loss: 2.092035E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.112 | TFLOPs: 20.21 | 31: iteration 38330/ 173500 | consumed samples: 9812480 | consumed tokens: 20095959040 | elapsed time per iteration (s): 0.79 | learning rate: 1.806E-04 | global batch size: 256 | lm loss: 2.063465E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.608 | TFLOPs: 19.64 | 31: iteration 38340/ 173500 | consumed samples: 9815040 | consumed tokens: 20101201920 | elapsed time per iteration (s): 0.76 | learning rate: 1.806E-04 | global batch size: 256 | lm loss: 2.096558E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.176 | TFLOPs: 20.40 | 31: iteration 38350/ 173500 | consumed samples: 9817600 | consumed tokens: 20106444800 | elapsed time per iteration (s): 0.75 | learning rate: 1.806E-04 | global batch size: 256 | lm loss: 2.114883E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.133 | TFLOPs: 20.70 | 31: iteration 38360/ 173500 | consumed samples: 9820160 | consumed tokens: 20111687680 | elapsed time per iteration (s): 0.78 | learning rate: 1.806E-04 | global batch size: 256 | lm loss: 2.096591E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.895 | TFLOPs: 19.90 | 31: iteration 38370/ 173500 | consumed samples: 9822720 | consumed tokens: 20116930560 | elapsed time per iteration (s): 0.76 | learning rate: 1.805E-04 | global batch size: 256 | lm loss: 2.084262E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.686 | TFLOPs: 20.37 | 31: iteration 38380/ 173500 | consumed samples: 9825280 | consumed tokens: 20122173440 | elapsed time per iteration (s): 0.76 | learning rate: 1.805E-04 | global batch size: 256 | lm loss: 2.111110E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.131 | TFLOPs: 20.46 | 31: iteration 38390/ 173500 | consumed samples: 9827840 | consumed tokens: 20127416320 | elapsed time per iteration (s): 0.76 | learning rate: 1.805E-04 | global batch size: 256 | lm loss: 2.112000E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.934 | TFLOPs: 20.26 | 31: iteration 38400/ 173500 | consumed samples: 9830400 | consumed tokens: 20132659200 | elapsed time per iteration (s): 0.76 | learning rate: 1.805E-04 | global batch size: 256 | lm loss: 2.103706E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.961 | TFLOPs: 20.26 | 31: iteration 38410/ 173500 | consumed samples: 9832960 | consumed tokens: 20137902080 | elapsed time per iteration (s): 0.76 | learning rate: 1.805E-04 | global batch size: 256 | lm loss: 2.094451E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.945 | TFLOPs: 20.32 | 31: iteration 38420/ 173500 | consumed samples: 9835520 | consumed tokens: 20143144960 | elapsed time per iteration (s): 0.76 | learning rate: 1.805E-04 | global batch size: 256 | lm loss: 2.127775E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.770 | TFLOPs: 20.43 | 31: iteration 38430/ 173500 | consumed samples: 9838080 | consumed tokens: 20148387840 | elapsed time per iteration (s): 0.81 | learning rate: 1.805E-04 | global batch size: 256 | lm loss: 2.098454E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.458 | TFLOPs: 19.14 | 31: iteration 38440/ 173500 | consumed samples: 9840640 | consumed tokens: 20153630720 | elapsed time per iteration (s): 0.76 | learning rate: 1.805E-04 | global batch size: 256 | lm loss: 2.109875E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.631 | TFLOPs: 20.30 | 31: iteration 38450/ 173500 | consumed samples: 9843200 | consumed tokens: 20158873600 | elapsed time per iteration (s): 0.79 | learning rate: 1.805E-04 | global batch size: 256 | lm loss: 2.107426E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.024 | TFLOPs: 19.66 | 31: iteration 38460/ 173500 | consumed samples: 9845760 | consumed tokens: 20164116480 | elapsed time per iteration (s): 0.76 | learning rate: 1.804E-04 | global batch size: 256 | lm loss: 2.088847E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.588 | TFLOPs: 20.42 | 31: iteration 38470/ 173500 | consumed samples: 9848320 | consumed tokens: 20169359360 | elapsed time per iteration (s): 0.79 | learning rate: 1.804E-04 | global batch size: 256 | lm loss: 2.071952E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.838 | TFLOPs: 19.71 | 31: iteration 38480/ 173500 | consumed samples: 9850880 | consumed tokens: 20174602240 | elapsed time per iteration (s): 0.78 | learning rate: 1.804E-04 | global batch size: 256 | lm loss: 2.111620E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.081 | TFLOPs: 19.79 | 31: iteration 38490/ 173500 | consumed samples: 9853440 | consumed tokens: 20179845120 | elapsed time per iteration (s): 0.74 | learning rate: 1.804E-04 | global batch size: 256 | lm loss: 2.103648E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.173 | TFLOPs: 20.94 | 31: iteration 38500/ 173500 | consumed samples: 9856000 | consumed tokens: 20185088000 | elapsed time per iteration (s): 0.77 | learning rate: 1.804E-04 | global batch size: 256 | lm loss: 2.107352E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.658 | TFLOPs: 20.00 | 31: iteration 38510/ 173500 | consumed samples: 9858560 | consumed tokens: 20190330880 | elapsed time per iteration (s): 0.71 | learning rate: 1.804E-04 | global batch size: 256 | lm loss: 2.103852E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.367 | TFLOPs: 21.68 | 31: iteration 38520/ 173500 | consumed samples: 9861120 | consumed tokens: 20195573760 | elapsed time per iteration (s): 0.75 | learning rate: 1.804E-04 | global batch size: 256 | lm loss: 2.123798E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.539 | TFLOPs: 20.54 | 31: iteration 38530/ 173500 | consumed samples: 9863680 | consumed tokens: 20200816640 | elapsed time per iteration (s): 0.78 | learning rate: 1.804E-04 | global batch size: 256 | lm loss: 2.134698E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.606 | TFLOPs: 19.94 | 31: iteration 38540/ 173500 | consumed samples: 9866240 | consumed tokens: 20206059520 | elapsed time per iteration (s): 0.94 | learning rate: 1.804E-04 | global batch size: 256 | lm loss: 2.077398E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 273.626 | TFLOPs: 16.55 | 31: iteration 38550/ 173500 | consumed samples: 9868800 | consumed tokens: 20211302400 | elapsed time per iteration (s): 0.75 | learning rate: 1.804E-04 | global batch size: 256 | lm loss: 2.111000E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.119 | TFLOPs: 20.76 | 31: iteration 38560/ 173500 | consumed samples: 9871360 | consumed tokens: 20216545280 | elapsed time per iteration (s): 0.77 | learning rate: 1.803E-04 | global batch size: 256 | lm loss: 2.078407E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.674 | TFLOPs: 20.19 | 31: iteration 38570/ 173500 | consumed samples: 9873920 | consumed tokens: 20221788160 | elapsed time per iteration (s): 0.76 | learning rate: 1.803E-04 | global batch size: 256 | lm loss: 2.105789E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.529 | TFLOPs: 20.48 | 31: iteration 38580/ 173500 | consumed samples: 9876480 | consumed tokens: 20227031040 | elapsed time per iteration (s): 0.80 | learning rate: 1.803E-04 | global batch size: 256 | lm loss: 2.072452E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.683 | TFLOPs: 19.40 | 31: iteration 38590/ 173500 | consumed samples: 9879040 | consumed tokens: 20232273920 | elapsed time per iteration (s): 0.76 | learning rate: 1.803E-04 | global batch size: 256 | lm loss: 2.106172E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.621 | TFLOPs: 20.43 | 31: iteration 38600/ 173500 | consumed samples: 9881600 | consumed tokens: 20237516800 | elapsed time per iteration (s): 0.81 | learning rate: 1.803E-04 | global batch size: 256 | lm loss: 2.106340E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.232 | TFLOPs: 19.13 | 31: iteration 38610/ 173500 | consumed samples: 9884160 | consumed tokens: 20242759680 | elapsed time per iteration (s): 0.77 | learning rate: 1.803E-04 | global batch size: 256 | lm loss: 2.125924E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.452 | TFLOPs: 20.05 | 31: iteration 38620/ 173500 | consumed samples: 9886720 | consumed tokens: 20248002560 | elapsed time per iteration (s): 0.79 | learning rate: 1.803E-04 | global batch size: 256 | lm loss: 2.071035E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.022 | TFLOPs: 19.54 | 31: iteration 38630/ 173500 | consumed samples: 9889280 | consumed tokens: 20253245440 | elapsed time per iteration (s): 0.77 | learning rate: 1.803E-04 | global batch size: 256 | lm loss: 2.098287E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.872 | TFLOPs: 20.08 | 31: iteration 38640/ 173500 | consumed samples: 9891840 | consumed tokens: 20258488320 | elapsed time per iteration (s): 0.81 | learning rate: 1.803E-04 | global batch size: 256 | lm loss: 2.101157E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.613 | TFLOPs: 19.15 | 31: iteration 38650/ 173500 | consumed samples: 9894400 | consumed tokens: 20263731200 | elapsed time per iteration (s): 0.79 | learning rate: 1.803E-04 | global batch size: 256 | lm loss: 2.096513E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.720 | TFLOPs: 19.64 | 31: iteration 38660/ 173500 | consumed samples: 9896960 | consumed tokens: 20268974080 | elapsed time per iteration (s): 0.82 | learning rate: 1.802E-04 | global batch size: 256 | lm loss: 2.094601E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.059 | TFLOPs: 18.82 | 31: iteration 38670/ 173500 | consumed samples: 9899520 | consumed tokens: 20274216960 | elapsed time per iteration (s): 0.80 | learning rate: 1.802E-04 | global batch size: 256 | lm loss: 2.099869E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.594 | TFLOPs: 19.46 | 31: iteration 38680/ 173500 | consumed samples: 9902080 | consumed tokens: 20279459840 | elapsed time per iteration (s): 0.80 | learning rate: 1.802E-04 | global batch size: 256 | lm loss: 2.114155E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.911 | TFLOPs: 19.29 | 31: iteration 38690/ 173500 | consumed samples: 9904640 | consumed tokens: 20284702720 | elapsed time per iteration (s): 0.79 | learning rate: 1.802E-04 | global batch size: 256 | lm loss: 2.112754E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.023 | TFLOPs: 19.48 | 31: iteration 38700/ 173500 | consumed samples: 9907200 | consumed tokens: 20289945600 | elapsed time per iteration (s): 0.82 | learning rate: 1.802E-04 | global batch size: 256 | lm loss: 2.093501E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.218 | TFLOPs: 18.95 | 31: iteration 38710/ 173500 | consumed samples: 9909760 | consumed tokens: 20295188480 | elapsed time per iteration (s): 0.82 | learning rate: 1.802E-04 | global batch size: 256 | lm loss: 2.110865E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.265 | TFLOPs: 18.83 | 31: iteration 38720/ 173500 | consumed samples: 9912320 | consumed tokens: 20300431360 | elapsed time per iteration (s): 0.82 | learning rate: 1.802E-04 | global batch size: 256 | lm loss: 2.129913E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.387 | TFLOPs: 18.96 | 31: iteration 38730/ 173500 | consumed samples: 9914880 | consumed tokens: 20305674240 | elapsed time per iteration (s): 0.78 | learning rate: 1.802E-04 | global batch size: 256 | lm loss: 2.085935E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.549 | TFLOPs: 19.88 | 31: iteration 38740/ 173500 | consumed samples: 9917440 | consumed tokens: 20310917120 | elapsed time per iteration (s): 0.86 | learning rate: 1.802E-04 | global batch size: 256 | lm loss: 2.066753E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.187 | TFLOPs: 17.98 | 31: iteration 38750/ 173500 | consumed samples: 9920000 | consumed tokens: 20316160000 | elapsed time per iteration (s): 0.89 | learning rate: 1.802E-04 | global batch size: 256 | lm loss: 2.092007E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.043 | TFLOPs: 17.43 | 31: iteration 38760/ 173500 | consumed samples: 9922560 | consumed tokens: 20321402880 | elapsed time per iteration (s): 0.81 | learning rate: 1.801E-04 | global batch size: 256 | lm loss: 2.083143E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.401 | TFLOPs: 19.20 | 31: iteration 38770/ 173500 | consumed samples: 9925120 | consumed tokens: 20326645760 | elapsed time per iteration (s): 0.79 | learning rate: 1.801E-04 | global batch size: 256 | lm loss: 2.077858E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.950 | TFLOPs: 19.54 | 31: iteration 38780/ 173500 | consumed samples: 9927680 | consumed tokens: 20331888640 | elapsed time per iteration (s): 0.82 | learning rate: 1.801E-04 | global batch size: 256 | lm loss: 2.090407E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.658 | TFLOPs: 18.92 | 31: iteration 38790/ 173500 | consumed samples: 9930240 | consumed tokens: 20337131520 | elapsed time per iteration (s): 0.79 | learning rate: 1.801E-04 | global batch size: 256 | lm loss: 2.102760E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.028 | TFLOPs: 19.66 | 31: iteration 38800/ 173500 | consumed samples: 9932800 | consumed tokens: 20342374400 | elapsed time per iteration (s): 0.81 | learning rate: 1.801E-04 | global batch size: 256 | lm loss: 2.097496E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.710 | TFLOPs: 19.04 | 31: iteration 38810/ 173500 | consumed samples: 9935360 | consumed tokens: 20347617280 | elapsed time per iteration (s): 0.84 | learning rate: 1.801E-04 | global batch size: 256 | lm loss: 2.092200E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.988 | TFLOPs: 18.39 | 31: iteration 38820/ 173500 | consumed samples: 9937920 | consumed tokens: 20352860160 | elapsed time per iteration (s): 0.79 | learning rate: 1.801E-04 | global batch size: 256 | lm loss: 2.107140E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.582 | TFLOPs: 19.70 | 31: iteration 38830/ 173500 | consumed samples: 9940480 | consumed tokens: 20358103040 | elapsed time per iteration (s): 0.83 | learning rate: 1.801E-04 | global batch size: 256 | lm loss: 2.110538E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.884 | TFLOPs: 18.63 | 31: iteration 38840/ 173500 | consumed samples: 9943040 | consumed tokens: 20363345920 | elapsed time per iteration (s): 0.82 | learning rate: 1.801E-04 | global batch size: 256 | lm loss: 2.091611E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.188 | TFLOPs: 18.89 | 31: iteration 38850/ 173500 | consumed samples: 9945600 | consumed tokens: 20368588800 | elapsed time per iteration (s): 0.81 | learning rate: 1.800E-04 | global batch size: 256 | lm loss: 2.082031E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.272 | TFLOPs: 19.01 | 31: iteration 38860/ 173500 | consumed samples: 9948160 | consumed tokens: 20373831680 | elapsed time per iteration (s): 0.80 | learning rate: 1.800E-04 | global batch size: 256 | lm loss: 2.089337E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.319 | TFLOPs: 19.32 | 31: iteration 38870/ 173500 | consumed samples: 9950720 | consumed tokens: 20379074560 | elapsed time per iteration (s): 0.83 | learning rate: 1.800E-04 | global batch size: 256 | lm loss: 2.117375E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.090 | TFLOPs: 18.58 | 31: iteration 38880/ 173500 | consumed samples: 9953280 | consumed tokens: 20384317440 | elapsed time per iteration (s): 0.86 | learning rate: 1.800E-04 | global batch size: 256 | lm loss: 2.117643E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.896 | TFLOPs: 18.08 | 31: iteration 38890/ 173500 | consumed samples: 9955840 | consumed tokens: 20389560320 | elapsed time per iteration (s): 0.79 | learning rate: 1.800E-04 | global batch size: 256 | lm loss: 2.099641E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.421 | TFLOPs: 19.51 | 31: iteration 38900/ 173500 | consumed samples: 9958400 | consumed tokens: 20394803200 | elapsed time per iteration (s): 0.85 | learning rate: 1.800E-04 | global batch size: 256 | lm loss: 2.091876E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.272 | TFLOPs: 18.29 | 31: iteration 38910/ 173500 | consumed samples: 9960960 | consumed tokens: 20400046080 | elapsed time per iteration (s): 0.87 | learning rate: 1.800E-04 | global batch size: 256 | lm loss: 2.097436E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.621 | TFLOPs: 17.70 | 31: iteration 38920/ 173500 | consumed samples: 9963520 | consumed tokens: 20405288960 | elapsed time per iteration (s): 0.86 | learning rate: 1.800E-04 | global batch size: 256 | lm loss: 2.105001E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.476 | TFLOPs: 18.00 | 31: iteration 38930/ 173500 | consumed samples: 9966080 | consumed tokens: 20410531840 | elapsed time per iteration (s): 0.81 | learning rate: 1.800E-04 | global batch size: 256 | lm loss: 2.102130E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.613 | TFLOPs: 19.21 | 31: iteration 38940/ 173500 | consumed samples: 9968640 | consumed tokens: 20415774720 | elapsed time per iteration (s): 0.84 | learning rate: 1.800E-04 | global batch size: 256 | lm loss: 2.094730E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.069 | TFLOPs: 18.46 | 31: iteration 38950/ 173500 | consumed samples: 9971200 | consumed tokens: 20421017600 | elapsed time per iteration (s): 0.80 | learning rate: 1.799E-04 | global batch size: 256 | lm loss: 2.102621E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.586 | TFLOPs: 19.33 | 31: iteration 38960/ 173500 | consumed samples: 9973760 | consumed tokens: 20426260480 | elapsed time per iteration (s): 0.84 | learning rate: 1.799E-04 | global batch size: 256 | lm loss: 2.096086E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.827 | TFLOPs: 18.50 | 31: iteration 38970/ 173500 | consumed samples: 9976320 | consumed tokens: 20431503360 | elapsed time per iteration (s): 0.79 | learning rate: 1.799E-04 | global batch size: 256 | lm loss: 2.108663E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.894 | TFLOPs: 19.53 | 31: iteration 38980/ 173500 | consumed samples: 9978880 | consumed tokens: 20436746240 | elapsed time per iteration (s): 0.80 | learning rate: 1.799E-04 | global batch size: 256 | lm loss: 2.095295E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.871 | TFLOPs: 19.29 | 31: iteration 38990/ 173500 | consumed samples: 9981440 | consumed tokens: 20441989120 | elapsed time per iteration (s): 0.80 | learning rate: 1.799E-04 | global batch size: 256 | lm loss: 2.090195E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.251 | TFLOPs: 19.43 | 31: iteration 39000/ 173500 | consumed samples: 9984000 | consumed tokens: 20447232000 | elapsed time per iteration (s): 0.79 | learning rate: 1.799E-04 | global batch size: 256 | lm loss: 2.097686E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.352 | TFLOPs: 19.68 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 39000 | lm loss value: 2.032561E+00 | lm loss PPL: 7.633613E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 39000 to checkpoints_1b1long 0: [2022-11-26 02:51:17,160] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step39000 is begin to save! 0: [2022-11-26 02:51:17,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_01-model_00-model_states.pt... 0: [2022-11-26 02:51:17,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_01-model_00-model_states.pt. 0: [2022-11-26 02:51:17,374] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_03-model_00-model_states.pt... 0: [2022-11-26 02:51:17,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_03-model_00-model_states.pt. 0: [2022-11-26 02:51:17,453] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_04-model_00-model_states.pt... 0: [2022-11-26 02:51:17,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_04-model_00-model_states.pt. 0: [2022-11-26 02:51:17,527] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_05-model_00-model_states.pt... 0: [2022-11-26 02:51:17,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_05-model_00-model_states.pt. 0: [2022-11-26 02:51:17,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_06-model_00-model_states.pt... 0: [2022-11-26 02:51:17,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_06-model_00-model_states.pt. 0: [2022-11-26 02:51:17,675] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_07-model_00-model_states.pt... 0: [2022-11-26 02:51:17,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_07-model_00-model_states.pt. 0: [2022-11-26 02:51:17,756] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_08-model_00-model_states.pt... 0: [2022-11-26 02:51:17,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_08-model_00-model_states.pt. 0: [2022-11-26 02:51:17,838] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_09-model_00-model_states.pt... 0: [2022-11-26 02:51:17,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_09-model_00-model_states.pt. 0: [2022-11-26 02:51:17,908] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_10-model_00-model_states.pt... 0: [2022-11-26 02:51:17,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_10-model_00-model_states.pt. 0: [2022-11-26 02:51:17,982] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_11-model_00-model_states.pt... 0: [2022-11-26 02:51:18,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_11-model_00-model_states.pt. 0: [2022-11-26 02:51:18,052] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_12-model_00-model_states.pt... 0: [2022-11-26 02:51:18,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_12-model_00-model_states.pt. 0: [2022-11-26 02:51:18,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_13-model_00-model_states.pt... 0: [2022-11-26 02:51:18,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_13-model_00-model_states.pt. 0: [2022-11-26 02:51:18,194] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_14-model_00-model_states.pt... 0: [2022-11-26 02:51:18,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_14-model_00-model_states.pt. 0: [2022-11-26 02:51:18,267] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_15-model_00-model_states.pt... 0: [2022-11-26 02:51:18,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_15-model_00-model_states.pt. 0: [2022-11-26 02:51:18,341] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_16-model_00-model_states.pt... 0: [2022-11-26 02:51:18,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_16-model_00-model_states.pt. 0: [2022-11-26 02:51:18,411] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_17-model_00-model_states.pt... 0: [2022-11-26 02:51:18,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_17-model_00-model_states.pt. 0: [2022-11-26 02:51:18,487] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_18-model_00-model_states.pt... 0: [2022-11-26 02:51:18,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_18-model_00-model_states.pt. 0: [2022-11-26 02:51:18,557] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_19-model_00-model_states.pt... 0: [2022-11-26 02:51:18,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_19-model_00-model_states.pt. 0: [2022-11-26 02:51:18,631] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_20-model_00-model_states.pt... 0: [2022-11-26 02:51:18,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_20-model_00-model_states.pt. 0: [2022-11-26 02:51:18,704] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_21-model_00-model_states.pt... 0: [2022-11-26 02:51:18,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_21-model_00-model_states.pt. 0: [2022-11-26 02:51:18,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_22-model_00-model_states.pt... 0: [2022-11-26 02:51:18,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_22-model_00-model_states.pt. 0: [2022-11-26 02:51:18,852] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_23-model_00-model_states.pt... 0: [2022-11-26 02:51:18,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_23-model_00-model_states.pt. 0: [2022-11-26 02:51:18,921] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_24-model_00-model_states.pt... 0: [2022-11-26 02:51:18,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_24-model_00-model_states.pt. 0: [2022-11-26 02:51:18,996] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_25-model_00-model_states.pt... 0: [2022-11-26 02:51:19,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_25-model_00-model_states.pt. 0: [2022-11-26 02:51:19,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_26-model_00-model_states.pt... 0: [2022-11-26 02:51:19,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_26-model_00-model_states.pt. 0: [2022-11-26 02:51:19,138] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_27-model_00-model_states.pt... 0: [2022-11-26 02:51:19,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_27-model_00-model_states.pt. 0: [2022-11-26 02:51:19,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_28-model_00-model_states.pt... 0: [2022-11-26 02:51:19,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_28-model_00-model_states.pt. 0: [2022-11-26 02:51:19,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/layer_30-model_00-model_states.pt... 0: [2022-11-26 02:51:19,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/layer_30-model_00-model_states.pt. 0: [2022-11-26 02:51:19,287] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step39000/mp_rank_00_model_states.pt 0: [2022-11-26 02:51:19,287] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/mp_rank_00_model_states.pt... 0: [2022-11-26 02:51:19,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/mp_rank_00_model_states.pt. 0: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 23: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 28: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 31: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 26: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:51:19,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 20: [2022-11-26 02:51:19,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:51:19,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:51:19,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 21: [2022-11-26 02:51:19,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 20: [2022-11-26 02:51:19,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 21: [2022-11-26 02:51:19,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 2: [2022-11-26 02:51:19,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:51:19,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 02:51:19,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 24: [2022-11-26 02:51:19,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:51:19,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 02:51:19,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 26: [2022-11-26 02:51:19,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:51:19,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 02:51:19,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 8: [2022-11-26 02:51:19,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:51:19,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 02:51:19,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 6: [2022-11-26 02:51:19,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:51:19,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 02:51:19,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 28: [2022-11-26 02:51:19,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:51:19,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 02:51:19,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 6: [2022-11-26 02:51:19,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:51:19,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 11: [2022-11-26 02:51:19,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:51:19,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 9: [2022-11-26 02:51:19,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:51:19,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 02:51:19,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 11: [2022-11-26 02:51:19,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 02:51:19,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 7: [2022-11-26 02:51:19,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:51:19,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 02:51:19,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 5: [2022-11-26 02:51:19,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:51:19,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 02:51:19,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 2: [2022-11-26 02:51:19,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:51:19,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 02:51:19,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 20: [2022-11-26 02:51:19,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:51:19,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 02:51:19,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 3: [2022-11-26 02:51:19,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:51:19,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 02:51:19,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 19: [2022-11-26 02:51:19,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:51:19,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 02:51:19,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 0: [2022-11-26 02:51:19,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:51:19,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:51:19,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 02:51:19,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 0: [2022-11-26 02:51:19,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:51:19,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 02:51:19,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 18: [2022-11-26 02:51:19,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:51:19,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 9: [2022-11-26 02:51:19,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:51:19,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:51:19,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:51:19,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:51:19,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 9: [2022-11-26 02:51:19,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 17: [2022-11-26 02:51:19,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 02:51:19,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 18: [2022-11-26 02:51:19,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 5: [2022-11-26 02:51:19,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 9: [2022-11-26 02:51:19,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 17: [2022-11-26 02:51:19,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 17: [2022-11-26 02:51:19,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 26: [2022-11-26 02:51:19,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:51:19,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 02:51:19,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 23: [2022-11-26 02:51:19,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:51:19,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:51:19,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 8: [2022-11-26 02:51:19,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 23: [2022-11-26 02:51:19,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 8: [2022-11-26 02:51:19,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 3: [2022-11-26 02:51:19,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:51:19,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 02:51:19,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 7: [2022-11-26 02:51:19,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:51:19,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 13: [2022-11-26 02:51:19,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:51:19,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:51:19,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 13: [2022-11-26 02:51:19,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 02:51:19,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 02:51:19,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 13: [2022-11-26 02:51:19,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 11: [2022-11-26 02:51:19,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:51:19,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 02:51:19,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 19: [2022-11-26 02:51:19,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:51:19,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:51:19,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 02:51:19,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 21: [2022-11-26 02:51:19,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:51:19,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 02:51:19,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 20: [2022-11-26 02:51:19,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:51:19,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 02:51:19,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 20: [2022-11-26 02:51:19,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 02:51:19,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 15: [2022-11-26 02:51:19,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:51:19,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:51:19,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 10: [2022-11-26 02:51:19,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 02:51:19,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 15: [2022-11-26 02:51:19,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 10: [2022-11-26 02:51:19,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:51:19,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 02:51:19,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 31: [2022-11-26 02:51:19,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:51:19,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 29: [2022-11-26 02:51:19,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:51:19,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 29: [2022-11-26 02:51:19,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 02:51:19,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:51:19,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 29: [2022-11-26 02:51:19,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 02:51:19,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 30: [2022-11-26 02:51:19,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:51:19,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 02:51:19,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 24: [2022-11-26 02:51:19,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:51:19,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 02:51:19,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 5: [2022-11-26 02:51:19,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:51:19,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 02:51:19,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 16: [2022-11-26 02:51:19,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:51:19,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 7: [2022-11-26 02:51:19,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:51:19,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 7: [2022-11-26 02:51:19,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 02:51:19,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 4: [2022-11-26 02:51:19,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:51:19,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:51:19,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:51:19,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:51:19,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 4: [2022-11-26 02:51:19,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 02:51:19,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 02:51:19,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 1: [2022-11-26 02:51:19,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 4: [2022-11-26 02:51:19,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 4: [2022-11-26 02:51:19,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 4: [2022-11-26 02:51:19,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 1: [2022-11-26 02:51:19,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:51:19,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 02:51:19,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 28: [2022-11-26 02:51:19,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:51:19,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:51:19,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 02:51:19,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 13: [2022-11-26 02:51:19,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:51:19,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 25: [2022-11-26 02:51:19,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:51:19,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 21: [2022-11-26 02:51:19,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:51:19,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 21: [2022-11-26 02:51:19,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 16: [2022-11-26 02:51:19,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:51:19,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 16: [2022-11-26 02:51:19,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 31: [2022-11-26 02:51:19,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:51:19,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:51:19,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 31: [2022-11-26 02:51:19,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 29: [2022-11-26 02:51:19,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 16: [2022-11-26 02:51:19,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 29: [2022-11-26 02:51:19,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 31: [2022-11-26 02:51:19,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 16: [2022-11-26 02:51:19,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:51:19,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 02:51:19,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 30: [2022-11-26 02:51:19,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:51:19,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 02:51:19,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 18: [2022-11-26 02:51:19,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:51:19,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 7: [2022-11-26 02:51:19,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:51:19,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 18: [2022-11-26 02:51:19,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 7: [2022-11-26 02:51:19,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 28: [2022-11-26 02:51:19,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 7: [2022-11-26 02:51:19,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 26: [2022-11-26 02:51:19,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:51:19,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 02:51:19,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 17: [2022-11-26 02:51:19,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:51:19,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:51:19,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 02:51:19,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 17: [2022-11-26 02:51:19,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 25: [2022-11-26 02:51:19,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:51:19,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 25: [2022-11-26 02:51:19,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 17: [2022-11-26 02:51:19,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:51:19,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 17: [2022-11-26 02:51:19,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 02:51:19,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 10: [2022-11-26 02:51:19,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:51:19,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 02:51:19,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 6: [2022-11-26 02:51:19,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:51:19,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 02:51:19,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 8: [2022-11-26 02:51:19,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:51:19,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 02:51:19,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 0: [2022-11-26 02:51:19,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:51:19,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:51:19,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 20: [2022-11-26 02:51:19,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 02:51:19,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 0: [2022-11-26 02:51:19,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 2: [2022-11-26 02:51:19,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:51:19,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 31: [2022-11-26 02:51:19,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:51:19,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 2: [2022-11-26 02:51:19,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 31: [2022-11-26 02:51:19,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 11: [2022-11-26 02:51:19,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:51:19,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 02:51:19,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 23: [2022-11-26 02:51:19,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:51:19,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 02:51:19,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 24: [2022-11-26 02:51:19,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:51:19,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 02:51:19,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 0: [2022-11-26 02:51:19,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:51:19,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 02:51:19,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 14: [2022-11-26 02:51:19,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:51:19,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 8: [2022-11-26 02:51:19,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:51:19,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 8: [2022-11-26 02:51:19,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 02:51:19,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 13: [2022-11-26 02:51:19,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:51:19,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 02:51:19,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 19: [2022-11-26 02:51:19,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:51:19,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 02:51:19,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 19: [2022-11-26 02:51:19,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:51:19,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 02:51:19,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 26: [2022-11-26 02:51:19,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:51:19,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 02:51:19,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 12: [2022-11-26 02:51:19,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:51:19,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:51:19,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:51:19,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 02:51:19,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 02:51:19,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 02:51:19,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 12: [2022-11-26 02:51:19,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 12: [2022-11-26 02:51:19,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 3: [2022-11-26 02:51:19,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:51:19,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 02:51:19,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 30: [2022-11-26 02:51:19,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:51:19,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 02:51:19,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 30: [2022-11-26 02:51:19,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:51:19,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 02:51:19,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 27: [2022-11-26 02:51:19,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:51:19,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:51:19,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 02:51:19,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 02:51:19,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 27: [2022-11-26 02:51:19,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 0: [2022-11-26 02:51:19,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 02:51:19,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 5: [2022-11-26 02:51:19,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:51:19,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 02:51:19,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 31: [2022-11-26 02:51:19,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:51:19,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 02:51:19,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 15: [2022-11-26 02:51:19,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:51:19,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:51:19,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 11: [2022-11-26 02:51:19,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 15: [2022-11-26 02:51:19,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 11: [2022-11-26 02:51:19,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 23: [2022-11-26 02:51:19,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:51:19,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:51:19,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 02:51:19,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:51:19,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 23: [2022-11-26 02:51:19,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 23: [2022-11-26 02:51:19,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 02:51:19,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 21: [2022-11-26 02:51:19,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 15: [2022-11-26 02:51:19,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:51:19,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 02:51:19,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 18: [2022-11-26 02:51:19,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:51:19,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 02:51:19,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 6: [2022-11-26 02:51:19,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:51:19,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 02:51:19,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 1: [2022-11-26 02:51:19,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:51:19,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:51:19,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 1: [2022-11-26 02:51:19,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 3: [2022-11-26 02:51:19,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 1: [2022-11-26 02:51:19,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 9: [2022-11-26 02:51:19,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:51:19,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 02:51:19,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:51:19,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 9: [2022-11-26 02:51:19,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 02:51:19,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 12: [2022-11-26 02:51:19,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:51:19,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 02:51:19,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 1: [2022-11-26 02:51:19,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:51:19,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 02:51:19,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 28: [2022-11-26 02:51:19,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:51:19,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:51:19,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:51:19,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:51:19,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:51:19,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 02:51:19,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 02:51:19,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 22: [2022-11-26 02:51:19,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 22: [2022-11-26 02:51:19,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 02:51:19,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 02:51:19,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 22: [2022-11-26 02:51:19,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 28: [2022-11-26 02:51:19,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 02:51:19,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 24: [2022-11-26 02:51:19,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:51:19,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 02:51:19,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 27: [2022-11-26 02:51:19,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:51:19,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 02:51:19,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 29: [2022-11-26 02:51:19,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:51:19,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 02:51:19,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 2: [2022-11-26 02:51:19,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:51:19,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 02:51:19,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 16: [2022-11-26 02:51:19,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:51:19,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 02:51:19,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 10: [2022-11-26 02:51:19,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:51:19,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 02:51:19,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 15: [2022-11-26 02:51:19,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:51:19,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 02:51:19,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 6: [2022-11-26 02:51:19,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:51:19,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 02:51:19,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 1: [2022-11-26 02:51:19,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:51:19,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 02:51:19,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 5: [2022-11-26 02:51:19,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:51:19,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 02:51:19,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 25: [2022-11-26 02:51:19,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:51:19,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 02:51:19,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 18: [2022-11-26 02:51:19,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:51:19,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 02:51:19,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 14: [2022-11-26 02:51:19,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:51:19,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 02:51:19,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 20: [2022-11-26 02:51:19,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:51:19,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 02:51:19,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 4: [2022-11-26 02:51:19,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:51:19,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 02:51:19,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 26: [2022-11-26 02:51:19,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:51:19,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 02:51:19,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 22: [2022-11-26 02:51:19,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:51:19,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 02:51:19,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 30: [2022-11-26 02:51:19,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:51:19,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 02:51:19,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 17: [2022-11-26 02:51:19,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:51:19,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 02:51:19,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 7: [2022-11-26 02:51:19,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:51:19,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 02:51:19,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 21: [2022-11-26 02:51:19,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:51:19,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 02:51:19,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 9: [2022-11-26 02:51:19,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:51:19,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 02:51:19,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 13: [2022-11-26 02:51:19,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:51:19,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 02:51:19,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 29: [2022-11-26 02:51:19,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:51:19,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 02:51:19,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 2: [2022-11-26 02:51:19,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:51:19,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 02:51:19,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 0: [2022-11-26 02:51:19,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:51:19,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 02:51:19,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 23: [2022-11-26 02:51:19,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:51:19,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 02:51:19,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 27: [2022-11-26 02:51:19,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:51:19,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 28: [2022-11-26 02:51:19,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:51:19,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 28: [2022-11-26 02:51:19,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 8: [2022-11-26 02:51:19,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:51:19,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 19: [2022-11-26 02:51:19,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:51:19,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 8: [2022-11-26 02:51:19,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 19: [2022-11-26 02:51:19,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 8: [2022-11-26 02:51:19,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 11: [2022-11-26 02:51:19,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:51:19,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:51:19,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 02:51:19,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 31: [2022-11-26 02:51:19,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 02:51:19,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 3: [2022-11-26 02:51:19,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:51:19,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 02:51:19,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 24: [2022-11-26 02:51:19,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:51:19,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 02:51:19,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 10: [2022-11-26 02:51:19,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:51:19,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 02:51:19,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 12: [2022-11-26 02:51:19,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:51:19,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 02:51:19,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 16: [2022-11-26 02:51:19,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:51:19,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 02:51:19,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 15: [2022-11-26 02:51:19,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:51:19,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 02:51:19,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 18: [2022-11-26 02:51:19,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:51:19,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 02:51:19,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 6: [2022-11-26 02:51:19,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:51:19,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 02:51:19,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 14: [2022-11-26 02:51:19,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:51:19,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 02:51:19,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 25: [2022-11-26 02:51:19,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:51:19,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 02:51:19,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 4: [2022-11-26 02:51:19,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:51:19,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 02:51:19,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 17: [2022-11-26 02:51:19,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:51:19,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 02:51:19,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 1: [2022-11-26 02:51:19,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:51:19,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 02:51:19,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 20: [2022-11-26 02:51:19,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:51:19,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 02:51:19,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 26: [2022-11-26 02:51:19,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:51:19,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 02:51:19,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 22: [2022-11-26 02:51:19,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:51:19,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 02:51:19,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 7: [2022-11-26 02:51:19,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:51:19,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 5: [2022-11-26 02:51:19,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:51:19,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 5: [2022-11-26 02:51:19,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 02:51:19,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 9: [2022-11-26 02:51:19,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:51:19,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 02:51:19,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 21: [2022-11-26 02:51:19,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:51:19,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 02:51:19,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 13: [2022-11-26 02:51:19,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:51:19,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 02:51:19,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 29: [2022-11-26 02:51:19,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:51:19,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 02:51:19,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 8: [2022-11-26 02:51:19,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:51:19,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 02:51:19,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 2: [2022-11-26 02:51:19,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:51:19,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 02:51:19,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 30: [2022-11-26 02:51:19,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:51:19,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 02:51:19,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 12: [2022-11-26 02:51:19,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:51:19,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 02:51:19,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 28: [2022-11-26 02:51:19,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:51:19,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 02:51:19,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 16: [2022-11-26 02:51:19,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:51:19,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 27: [2022-11-26 02:51:19,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:51:19,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 27: [2022-11-26 02:51:19,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 02:51:19,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 31: [2022-11-26 02:51:19,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:51:19,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 02:51:19,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 15: [2022-11-26 02:51:19,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:51:19,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:51:19,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 18: [2022-11-26 02:51:19,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 15: [2022-11-26 02:51:19,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 18: [2022-11-26 02:51:19,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 6: [2022-11-26 02:51:19,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:51:19,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 02:51:19,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 14: [2022-11-26 02:51:19,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:51:19,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 02:51:19,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 3: [2022-11-26 02:51:19,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:51:19,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 02:51:19,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 17: [2022-11-26 02:51:19,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:51:19,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:51:19,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:51:19,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 02:51:19,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 23: [2022-11-26 02:51:19,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 17: [2022-11-26 02:51:19,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 19: [2022-11-26 02:51:19,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 23: [2022-11-26 02:51:19,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 17: [2022-11-26 02:51:19,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 19: [2022-11-26 02:51:19,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 25: [2022-11-26 02:51:19,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:51:19,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 25: [2022-11-26 02:51:19,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 02:51:19,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 24: [2022-11-26 02:51:19,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:51:19,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 02:51:19,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 11: [2022-11-26 02:51:19,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:51:19,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 02:51:19,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 20: [2022-11-26 02:51:19,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 02:51:19,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 02:51:19,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 26: [2022-11-26 02:51:19,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:51:19,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 02:51:19,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 7: [2022-11-26 02:51:19,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:51:19,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 02:51:19,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 1: [2022-11-26 02:51:19,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:51:19,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 02:51:19,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 22: [2022-11-26 02:51:19,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:51:19,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 02:51:19,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 5: [2022-11-26 02:51:19,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:51:19,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 02:51:19,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 30: [2022-11-26 02:51:19,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 02:51:19,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 02:51:19,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 13: [2022-11-26 02:51:19,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:51:19,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 02:51:19,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 29: [2022-11-26 02:51:19,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 02:51:19,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 02:51:19,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 31: [2022-11-26 02:51:19,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:51:19,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 02:51:19,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 21: [2022-11-26 02:51:19,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:51:19,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 02:51:19,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 0: [2022-11-26 02:51:19,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:51:19,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 02:51:19,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 9: [2022-11-26 02:51:19,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:51:19,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 02:51:19,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 19: [2022-11-26 02:51:19,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:51:19,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 0: [2022-11-26 02:51:19,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:51:19,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 0: [2022-11-26 02:51:19,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 02:51:19,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 8: [2022-11-26 02:51:19,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:51:19,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 02:51:19,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 2: [2022-11-26 02:51:19,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:51:19,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 02:51:19,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 12: [2022-11-26 02:51:19,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:51:19,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:51:19,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 4: [2022-11-26 02:51:19,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 02:51:19,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 12: [2022-11-26 02:51:19,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 28: [2022-11-26 02:51:19,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:51:19,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 02:51:19,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 16: [2022-11-26 02:51:19,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:51:19,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 02:51:19,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 6: [2022-11-26 02:51:19,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:51:19,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 02:51:19,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 11: [2022-11-26 02:51:19,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:51:19,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 02:51:19,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 24: [2022-11-26 02:51:19,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:51:19,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 27: [2022-11-26 02:51:19,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:51:19,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 27: [2022-11-26 02:51:19,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 02:51:19,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 2: [2022-11-26 02:51:19,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:51:19,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 02:51:19,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 18: [2022-11-26 02:51:19,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 02:51:19,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 02:51:19,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 3: [2022-11-26 02:51:19,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:51:19,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 02:51:19,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 7: [2022-11-26 02:51:19,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:51:19,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:51:19,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 0: [2022-11-26 02:51:19,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 7: [2022-11-26 02:51:19,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 0: [2022-11-26 02:51:19,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 11: [2022-11-26 02:51:19,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:51:19,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:51:19,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 02:51:19,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 17: [2022-11-26 02:51:19,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:51:19,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:51:19,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 02:51:19,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 20: [2022-11-26 02:51:19,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:51:19,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 17: [2022-11-26 02:51:19,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 4: [2022-11-26 02:51:19,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 17: [2022-11-26 02:51:19,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 20: [2022-11-26 02:51:19,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 02:51:19,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 21: [2022-11-26 02:51:19,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:51:19,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 21: [2022-11-26 02:51:19,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 12: [2022-11-26 02:51:19,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 21: [2022-11-26 02:51:19,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 12: [2022-11-26 02:51:19,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 26: [2022-11-26 02:51:19,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 28: [2022-11-26 02:51:19,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 26: [2022-11-26 02:51:19,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 28: [2022-11-26 02:51:19,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 26: [2022-11-26 02:51:19,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 28: [2022-11-26 02:51:19,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 8: [2022-11-26 02:51:19,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:51:19,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 02:51:19,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 15: [2022-11-26 02:51:19,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:51:19,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 02:51:19,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 24: [2022-11-26 02:51:19,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 02:51:19,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 02:51:19,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 23: [2022-11-26 02:51:19,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:51:19,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 02:51:19,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 25: [2022-11-26 02:51:19,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:51:19,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 02:51:19,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 27: [2022-11-26 02:51:19,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:51:19,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 13: [2022-11-26 02:51:19,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:51:19,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 13: [2022-11-26 02:51:19,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 02:51:19,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 10: [2022-11-26 02:51:19,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:51:19,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:51:19,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 14: [2022-11-26 02:51:19,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 10: [2022-11-26 02:51:19,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 14: [2022-11-26 02:51:19,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 16: [2022-11-26 02:51:19,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:51:19,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 02:51:19,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 30: [2022-11-26 02:51:19,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 16: [2022-11-26 02:51:19,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 22: [2022-11-26 02:51:19,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 16: [2022-11-26 02:51:19,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 30: [2022-11-26 02:51:19,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 02:51:19,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 1: [2022-11-26 02:51:19,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:51:19,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:51:19,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 1: [2022-11-26 02:51:19,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 3: [2022-11-26 02:51:19,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 1: [2022-11-26 02:51:19,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 5: [2022-11-26 02:51:19,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 19: [2022-11-26 02:51:19,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:51:19,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 19: [2022-11-26 02:51:19,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 5: [2022-11-26 02:51:19,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 19: [2022-11-26 02:51:19,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 9: [2022-11-26 02:51:19,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:51:19,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 29: [2022-11-26 02:51:19,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:51:19,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 29: [2022-11-26 02:51:19,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 02:51:19,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 15: [2022-11-26 02:51:19,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:51:19,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 02:51:19,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 14: [2022-11-26 02:51:19,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:51:19,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 02:51:19,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 31: [2022-11-26 02:51:19,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 02:51:19,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 02:51:19,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 23: [2022-11-26 02:51:19,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 02:51:19,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 02:51:19,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 27: [2022-11-26 02:51:19,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 02:51:19,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 02:51:19,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 14: [2022-11-26 02:51:19,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:51:19,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 02:51:19,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 25: [2022-11-26 02:51:19,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:51:19,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 02:51:19,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 25: [2022-11-26 02:51:19,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 02:51:19,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step39000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 02:51:19,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 0: successfully saved checkpoint at iteration 39000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2488.02 31: iteration 39010/ 173500 | consumed samples: 9986560 | consumed tokens: 20452474880 | elapsed time per iteration (s): 1.10 | learning rate: 1.799E-04 | global batch size: 256 | lm loss: 2.139780E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.179 | TFLOPs: 14.05 | 31: iteration 39020/ 173500 | consumed samples: 9989120 | consumed tokens: 20457717760 | elapsed time per iteration (s): 0.85 | learning rate: 1.799E-04 | global batch size: 256 | lm loss: 2.125572E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.292 | TFLOPs: 18.23 | 31: iteration 39030/ 173500 | consumed samples: 9991680 | consumed tokens: 20462960640 | elapsed time per iteration (s): 0.83 | learning rate: 1.799E-04 | global batch size: 256 | lm loss: 2.065984E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.112 | TFLOPs: 18.58 | 31: iteration 39040/ 173500 | consumed samples: 9994240 | consumed tokens: 20468203520 | elapsed time per iteration (s): 0.79 | learning rate: 1.799E-04 | global batch size: 256 | lm loss: 2.113040E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.645 | TFLOPs: 19.64 | 31: iteration 39050/ 173500 | consumed samples: 9996800 | consumed tokens: 20473446400 | elapsed time per iteration (s): 0.81 | learning rate: 1.798E-04 | global batch size: 256 | lm loss: 2.087284E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.190 | TFLOPs: 19.01 | 31: iteration 39060/ 173500 | consumed samples: 9999360 | consumed tokens: 20478689280 | elapsed time per iteration (s): 0.80 | learning rate: 1.798E-04 | global batch size: 256 | lm loss: 2.106491E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.877 | TFLOPs: 19.35 | 31: iteration 39070/ 173500 | consumed samples: 10001920 | consumed tokens: 20483932160 | elapsed time per iteration (s): 0.81 | learning rate: 1.798E-04 | global batch size: 256 | lm loss: 2.065424E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.650 | TFLOPs: 19.10 | 31: iteration 39080/ 173500 | consumed samples: 10004480 | consumed tokens: 20489175040 | elapsed time per iteration (s): 0.84 | learning rate: 1.798E-04 | global batch size: 256 | lm loss: 2.118993E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.322 | TFLOPs: 18.47 | 31: iteration 39090/ 173500 | consumed samples: 10007040 | consumed tokens: 20494417920 | elapsed time per iteration (s): 0.82 | learning rate: 1.798E-04 | global batch size: 256 | lm loss: 2.104506E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.097 | TFLOPs: 18.82 | 31: iteration 39100/ 173500 | consumed samples: 10009600 | consumed tokens: 20499660800 | elapsed time per iteration (s): 0.80 | learning rate: 1.798E-04 | global batch size: 256 | lm loss: 2.091296E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.090 | TFLOPs: 19.24 | 31: iteration 39110/ 173500 | consumed samples: 10012160 | consumed tokens: 20504903680 | elapsed time per iteration (s): 0.82 | learning rate: 1.798E-04 | global batch size: 256 | lm loss: 2.080215E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.910 | TFLOPs: 18.93 | 31: iteration 39120/ 173500 | consumed samples: 10014720 | consumed tokens: 20510146560 | elapsed time per iteration (s): 0.83 | learning rate: 1.798E-04 | global batch size: 256 | lm loss: 2.110196E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.989 | TFLOPs: 18.75 | 31: iteration 39130/ 173500 | consumed samples: 10017280 | consumed tokens: 20515389440 | elapsed time per iteration (s): 0.82 | learning rate: 1.798E-04 | global batch size: 256 | lm loss: 2.104800E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.262 | TFLOPs: 18.95 | 31: iteration 39140/ 173500 | consumed samples: 10019840 | consumed tokens: 20520632320 | elapsed time per iteration (s): 0.81 | learning rate: 1.797E-04 | global batch size: 256 | lm loss: 2.087914E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.676 | TFLOPs: 19.04 | 31: iteration 39150/ 173500 | consumed samples: 10022400 | consumed tokens: 20525875200 | elapsed time per iteration (s): 0.81 | learning rate: 1.797E-04 | global batch size: 256 | lm loss: 2.095951E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.815 | TFLOPs: 19.17 | 31: iteration 39160/ 173500 | consumed samples: 10024960 | consumed tokens: 20531118080 | elapsed time per iteration (s): 0.79 | learning rate: 1.797E-04 | global batch size: 256 | lm loss: 2.089148E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.763 | TFLOPs: 19.59 | 31: iteration 39170/ 173500 | consumed samples: 10027520 | consumed tokens: 20536360960 | elapsed time per iteration (s): 0.84 | learning rate: 1.797E-04 | global batch size: 256 | lm loss: 2.104057E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.973 | TFLOPs: 18.33 | 31: iteration 39180/ 173500 | consumed samples: 10030080 | consumed tokens: 20541603840 | elapsed time per iteration (s): 0.81 | learning rate: 1.797E-04 | global batch size: 256 | lm loss: 2.113652E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.612 | TFLOPs: 19.03 | 31: iteration 39190/ 173500 | consumed samples: 10032640 | consumed tokens: 20546846720 | elapsed time per iteration (s): 0.81 | learning rate: 1.797E-04 | global batch size: 256 | lm loss: 2.232938E+00 | grad norm: 5.254 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.534 | TFLOPs: 19.03 | 31: iteration 39200/ 173500 | consumed samples: 10035200 | consumed tokens: 20552089600 | elapsed time per iteration (s): 0.85 | learning rate: 1.797E-04 | global batch size: 256 | lm loss: 2.295950E+00 | grad norm: 0.380 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.646 | TFLOPs: 18.19 | 31: iteration 39210/ 173500 | consumed samples: 10037760 | consumed tokens: 20557332480 | elapsed time per iteration (s): 0.81 | learning rate: 1.797E-04 | global batch size: 256 | lm loss: 2.140766E+00 | grad norm: 0.229 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.589 | TFLOPs: 19.09 | 31: iteration 39220/ 173500 | consumed samples: 10040320 | consumed tokens: 20562575360 | elapsed time per iteration (s): 0.81 | learning rate: 1.797E-04 | global batch size: 256 | lm loss: 2.133166E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.852 | TFLOPs: 19.05 | 31: iteration 39230/ 173500 | consumed samples: 10042880 | consumed tokens: 20567818240 | elapsed time per iteration (s): 0.84 | learning rate: 1.797E-04 | global batch size: 256 | lm loss: 2.097621E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.554 | TFLOPs: 18.49 | 31: iteration 39240/ 173500 | consumed samples: 10045440 | consumed tokens: 20573061120 | elapsed time per iteration (s): 0.85 | learning rate: 1.796E-04 | global batch size: 256 | lm loss: 2.105038E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.524 | TFLOPs: 18.12 | 31: iteration 39250/ 173500 | consumed samples: 10048000 | consumed tokens: 20578304000 | elapsed time per iteration (s): 0.82 | learning rate: 1.796E-04 | global batch size: 256 | lm loss: 2.073173E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.063 | TFLOPs: 18.82 | 31: iteration 39260/ 173500 | consumed samples: 10050560 | consumed tokens: 20583546880 | elapsed time per iteration (s): 0.81 | learning rate: 1.796E-04 | global batch size: 256 | lm loss: 2.093948E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.141 | TFLOPs: 19.07 | 31: iteration 39270/ 173500 | consumed samples: 10053120 | consumed tokens: 20588789760 | elapsed time per iteration (s): 0.80 | learning rate: 1.796E-04 | global batch size: 256 | lm loss: 2.098051E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.541 | TFLOPs: 19.33 | 31: iteration 39280/ 173500 | consumed samples: 10055680 | consumed tokens: 20594032640 | elapsed time per iteration (s): 0.80 | learning rate: 1.796E-04 | global batch size: 256 | lm loss: 2.112617E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.801 | TFLOPs: 19.29 | 31: iteration 39290/ 173500 | consumed samples: 10058240 | consumed tokens: 20599275520 | elapsed time per iteration (s): 0.80 | learning rate: 1.796E-04 | global batch size: 256 | lm loss: 2.106451E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.016 | TFLOPs: 19.36 | 31: iteration 39300/ 173500 | consumed samples: 10060800 | consumed tokens: 20604518400 | elapsed time per iteration (s): 0.81 | learning rate: 1.796E-04 | global batch size: 256 | lm loss: 2.124104E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.935 | TFLOPs: 19.11 | 31: iteration 39310/ 173500 | consumed samples: 10063360 | consumed tokens: 20609761280 | elapsed time per iteration (s): 0.83 | learning rate: 1.796E-04 | global batch size: 256 | lm loss: 2.143272E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.698 | TFLOPs: 18.68 | 31: iteration 39320/ 173500 | consumed samples: 10065920 | consumed tokens: 20615004160 | elapsed time per iteration (s): 0.81 | learning rate: 1.796E-04 | global batch size: 256 | lm loss: 2.070488E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.595 | TFLOPs: 19.03 | 31: iteration 39330/ 173500 | consumed samples: 10068480 | consumed tokens: 20620247040 | elapsed time per iteration (s): 0.83 | learning rate: 1.795E-04 | global batch size: 256 | lm loss: 2.109099E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.331 | TFLOPs: 18.71 | 31: iteration 39340/ 173500 | consumed samples: 10071040 | consumed tokens: 20625489920 | elapsed time per iteration (s): 0.81 | learning rate: 1.795E-04 | global batch size: 256 | lm loss: 2.104067E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.924 | TFLOPs: 19.23 | 31: iteration 39350/ 173500 | consumed samples: 10073600 | consumed tokens: 20630732800 | elapsed time per iteration (s): 0.79 | learning rate: 1.795E-04 | global batch size: 256 | lm loss: 2.093238E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.924 | TFLOPs: 19.60 | 31: iteration 39360/ 173500 | consumed samples: 10076160 | consumed tokens: 20635975680 | elapsed time per iteration (s): 0.85 | learning rate: 1.795E-04 | global batch size: 256 | lm loss: 2.102856E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.179 | TFLOPs: 18.28 | 31: iteration 39370/ 173500 | consumed samples: 10078720 | consumed tokens: 20641218560 | elapsed time per iteration (s): 0.82 | learning rate: 1.795E-04 | global batch size: 256 | lm loss: 2.077269E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.844 | TFLOPs: 18.87 | 31: iteration 39380/ 173500 | consumed samples: 10081280 | consumed tokens: 20646461440 | elapsed time per iteration (s): 0.81 | learning rate: 1.795E-04 | global batch size: 256 | lm loss: 2.111789E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.952 | TFLOPs: 19.24 | 31: iteration 39390/ 173500 | consumed samples: 10083840 | consumed tokens: 20651704320 | elapsed time per iteration (s): 0.83 | learning rate: 1.795E-04 | global batch size: 256 | lm loss: 2.129458E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.769 | TFLOPs: 18.56 | 31: iteration 39400/ 173500 | consumed samples: 10086400 | consumed tokens: 20656947200 | elapsed time per iteration (s): 0.74 | learning rate: 1.795E-04 | global batch size: 256 | lm loss: 2.077512E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.656 | TFLOPs: 20.79 | 31: iteration 39410/ 173500 | consumed samples: 10088960 | consumed tokens: 20662190080 | elapsed time per iteration (s): 0.76 | learning rate: 1.795E-04 | global batch size: 256 | lm loss: 2.056837E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.105 | TFLOPs: 20.45 | 31: iteration 39420/ 173500 | consumed samples: 10091520 | consumed tokens: 20667432960 | elapsed time per iteration (s): 0.79 | learning rate: 1.795E-04 | global batch size: 256 | lm loss: 2.093491E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.904 | TFLOPs: 19.72 | 31: iteration 39430/ 173500 | consumed samples: 10094080 | consumed tokens: 20672675840 | elapsed time per iteration (s): 0.78 | learning rate: 1.794E-04 | global batch size: 256 | lm loss: 2.111046E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.995 | TFLOPs: 19.96 | 31: iteration 39440/ 173500 | consumed samples: 10096640 | consumed tokens: 20677918720 | elapsed time per iteration (s): 0.76 | learning rate: 1.794E-04 | global batch size: 256 | lm loss: 2.102653E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.791 | TFLOPs: 20.25 | 31: iteration 39450/ 173500 | consumed samples: 10099200 | consumed tokens: 20683161600 | elapsed time per iteration (s): 0.80 | learning rate: 1.794E-04 | global batch size: 256 | lm loss: 2.111831E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.891 | TFLOPs: 19.35 | 31: iteration 39460/ 173500 | consumed samples: 10101760 | consumed tokens: 20688404480 | elapsed time per iteration (s): 0.75 | learning rate: 1.794E-04 | global batch size: 256 | lm loss: 2.125448E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.206 | TFLOPs: 20.76 | 31: iteration 39470/ 173500 | consumed samples: 10104320 | consumed tokens: 20693647360 | elapsed time per iteration (s): 0.75 | learning rate: 1.794E-04 | global batch size: 256 | lm loss: 2.101813E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.367 | TFLOPs: 20.71 | 31: iteration 39480/ 173500 | consumed samples: 10106880 | consumed tokens: 20698890240 | elapsed time per iteration (s): 0.79 | learning rate: 1.794E-04 | global batch size: 256 | lm loss: 2.110854E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.143 | TFLOPs: 19.55 | 31: iteration 39490/ 173500 | consumed samples: 10109440 | consumed tokens: 20704133120 | elapsed time per iteration (s): 0.77 | learning rate: 1.794E-04 | global batch size: 256 | lm loss: 2.120191E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.941 | TFLOPs: 20.08 | 31: iteration 39500/ 173500 | consumed samples: 10112000 | consumed tokens: 20709376000 | elapsed time per iteration (s): 0.76 | learning rate: 1.794E-04 | global batch size: 256 | lm loss: 2.110086E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.846 | TFLOPs: 20.44 | 31: iteration 39510/ 173500 | consumed samples: 10114560 | consumed tokens: 20714618880 | elapsed time per iteration (s): 0.78 | learning rate: 1.794E-04 | global batch size: 256 | lm loss: 2.092170E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.964 | TFLOPs: 19.96 | 31: iteration 39520/ 173500 | consumed samples: 10117120 | consumed tokens: 20719861760 | elapsed time per iteration (s): 0.88 | learning rate: 1.793E-04 | global batch size: 256 | lm loss: 2.079160E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.547 | TFLOPs: 17.64 | 31: iteration 39530/ 173500 | consumed samples: 10119680 | consumed tokens: 20725104640 | elapsed time per iteration (s): 0.83 | learning rate: 1.793E-04 | global batch size: 256 | lm loss: 2.097074E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.254 | TFLOPs: 18.59 | 31: iteration 39540/ 173500 | consumed samples: 10122240 | consumed tokens: 20730347520 | elapsed time per iteration (s): 0.76 | learning rate: 1.793E-04 | global batch size: 256 | lm loss: 2.091828E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.636 | TFLOPs: 20.31 | 31: iteration 39550/ 173500 | consumed samples: 10124800 | consumed tokens: 20735590400 | elapsed time per iteration (s): 0.73 | learning rate: 1.793E-04 | global batch size: 256 | lm loss: 2.082931E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.595 | TFLOPs: 21.21 | 31: iteration 39560/ 173500 | consumed samples: 10127360 | consumed tokens: 20740833280 | elapsed time per iteration (s): 0.76 | learning rate: 1.793E-04 | global batch size: 256 | lm loss: 2.075902E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.919 | TFLOPs: 20.26 | 31: iteration 39570/ 173500 | consumed samples: 10129920 | consumed tokens: 20746076160 | elapsed time per iteration (s): 0.77 | learning rate: 1.793E-04 | global batch size: 256 | lm loss: 2.087551E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.187 | TFLOPs: 20.04 | 31: iteration 39580/ 173500 | consumed samples: 10132480 | consumed tokens: 20751319040 | elapsed time per iteration (s): 0.79 | learning rate: 1.793E-04 | global batch size: 256 | lm loss: 2.093565E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.678 | TFLOPs: 19.64 | 31: iteration 39590/ 173500 | consumed samples: 10135040 | consumed tokens: 20756561920 | elapsed time per iteration (s): 0.74 | learning rate: 1.793E-04 | global batch size: 256 | lm loss: 2.075957E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.087 | TFLOPs: 20.88 | 31: iteration 39600/ 173500 | consumed samples: 10137600 | consumed tokens: 20761804800 | elapsed time per iteration (s): 0.85 | learning rate: 1.793E-04 | global batch size: 256 | lm loss: 2.103798E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.127 | TFLOPs: 18.16 | 31: iteration 39610/ 173500 | consumed samples: 10140160 | consumed tokens: 20767047680 | elapsed time per iteration (s): 0.85 | learning rate: 1.793E-04 | global batch size: 256 | lm loss: 2.107101E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.894 | TFLOPs: 18.14 | 31: iteration 39620/ 173500 | consumed samples: 10142720 | consumed tokens: 20772290560 | elapsed time per iteration (s): 0.82 | learning rate: 1.792E-04 | global batch size: 256 | lm loss: 2.116419E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.110 | TFLOPs: 18.94 | 31: iteration 39630/ 173500 | consumed samples: 10145280 | consumed tokens: 20777533440 | elapsed time per iteration (s): 0.86 | learning rate: 1.792E-04 | global batch size: 256 | lm loss: 2.106816E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.362 | TFLOPs: 17.99 | 31: iteration 39640/ 173500 | consumed samples: 10147840 | consumed tokens: 20782776320 | elapsed time per iteration (s): 0.82 | learning rate: 1.792E-04 | global batch size: 256 | lm loss: 2.092869E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.689 | TFLOPs: 18.80 | 31: iteration 39650/ 173500 | consumed samples: 10150400 | consumed tokens: 20788019200 | elapsed time per iteration (s): 0.80 | learning rate: 1.792E-04 | global batch size: 256 | lm loss: 2.084645E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.953 | TFLOPs: 19.36 | 31: iteration 39660/ 173500 | consumed samples: 10152960 | consumed tokens: 20793262080 | elapsed time per iteration (s): 0.73 | learning rate: 1.792E-04 | global batch size: 256 | lm loss: 2.132425E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.474 | TFLOPs: 21.32 | 31: iteration 39670/ 173500 | consumed samples: 10155520 | consumed tokens: 20798504960 | elapsed time per iteration (s): 0.75 | learning rate: 1.792E-04 | global batch size: 256 | lm loss: 2.079741E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.733 | TFLOPs: 20.55 | 31: iteration 39680/ 173500 | consumed samples: 10158080 | consumed tokens: 20803747840 | elapsed time per iteration (s): 0.75 | learning rate: 1.792E-04 | global batch size: 256 | lm loss: 2.083091E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.070 | TFLOPs: 20.69 | 31: iteration 39690/ 173500 | consumed samples: 10160640 | consumed tokens: 20808990720 | elapsed time per iteration (s): 0.74 | learning rate: 1.792E-04 | global batch size: 256 | lm loss: 2.081009E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.775 | TFLOPs: 20.86 | 31: iteration 39700/ 173500 | consumed samples: 10163200 | consumed tokens: 20814233600 | elapsed time per iteration (s): 0.76 | learning rate: 1.792E-04 | global batch size: 256 | lm loss: 2.122731E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.826 | TFLOPs: 20.50 | 31: iteration 39710/ 173500 | consumed samples: 10165760 | consumed tokens: 20819476480 | elapsed time per iteration (s): 0.72 | learning rate: 1.792E-04 | global batch size: 256 | lm loss: 2.109499E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.131 | TFLOPs: 21.55 | 31: iteration 39720/ 173500 | consumed samples: 10168320 | consumed tokens: 20824719360 | elapsed time per iteration (s): 0.78 | learning rate: 1.791E-04 | global batch size: 256 | lm loss: 2.115720E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.870 | TFLOPs: 19.96 | 31: iteration 39730/ 173500 | consumed samples: 10170880 | consumed tokens: 20829962240 | elapsed time per iteration (s): 0.75 | learning rate: 1.791E-04 | global batch size: 256 | lm loss: 2.087866E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.167 | TFLOPs: 20.64 | 31: iteration 39740/ 173500 | consumed samples: 10173440 | consumed tokens: 20835205120 | elapsed time per iteration (s): 0.79 | learning rate: 1.791E-04 | global batch size: 256 | lm loss: 2.098203E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.971 | TFLOPs: 19.72 | 31: iteration 39750/ 173500 | consumed samples: 10176000 | consumed tokens: 20840448000 | elapsed time per iteration (s): 0.76 | learning rate: 1.791E-04 | global batch size: 256 | lm loss: 2.072836E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.433 | TFLOPs: 20.35 | 31: iteration 39760/ 173500 | consumed samples: 10178560 | consumed tokens: 20845690880 | elapsed time per iteration (s): 0.80 | learning rate: 1.791E-04 | global batch size: 256 | lm loss: 2.090037E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.302 | TFLOPs: 19.44 | 31: iteration 39770/ 173500 | consumed samples: 10181120 | consumed tokens: 20850933760 | elapsed time per iteration (s): 0.75 | learning rate: 1.791E-04 | global batch size: 256 | lm loss: 2.081927E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.146 | TFLOPs: 20.76 | 31: iteration 39780/ 173500 | consumed samples: 10183680 | consumed tokens: 20856176640 | elapsed time per iteration (s): 0.78 | learning rate: 1.791E-04 | global batch size: 256 | lm loss: 2.100423E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.260 | TFLOPs: 19.86 | 31: iteration 39790/ 173500 | consumed samples: 10186240 | consumed tokens: 20861419520 | elapsed time per iteration (s): 0.74 | learning rate: 1.791E-04 | global batch size: 256 | lm loss: 2.112063E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.915 | TFLOPs: 20.99 | 31: iteration 39800/ 173500 | consumed samples: 10188800 | consumed tokens: 20866662400 | elapsed time per iteration (s): 0.86 | learning rate: 1.791E-04 | global batch size: 256 | lm loss: 2.084844E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.952 | TFLOPs: 17.96 | 31: iteration 39810/ 173500 | consumed samples: 10191360 | consumed tokens: 20871905280 | elapsed time per iteration (s): 0.72 | learning rate: 1.790E-04 | global batch size: 256 | lm loss: 2.096815E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.588 | TFLOPs: 21.45 | 31: iteration 39820/ 173500 | consumed samples: 10193920 | consumed tokens: 20877148160 | elapsed time per iteration (s): 0.76 | learning rate: 1.790E-04 | global batch size: 256 | lm loss: 2.084174E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.181 | TFLOPs: 20.34 | 31: iteration 39830/ 173500 | consumed samples: 10196480 | consumed tokens: 20882391040 | elapsed time per iteration (s): 0.73 | learning rate: 1.790E-04 | global batch size: 256 | lm loss: 2.092294E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.896 | TFLOPs: 21.17 | 31: iteration 39840/ 173500 | consumed samples: 10199040 | consumed tokens: 20887633920 | elapsed time per iteration (s): 0.74 | learning rate: 1.790E-04 | global batch size: 256 | lm loss: 2.113526E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.524 | TFLOPs: 20.90 | 31: iteration 39850/ 173500 | consumed samples: 10201600 | consumed tokens: 20892876800 | elapsed time per iteration (s): 0.75 | learning rate: 1.790E-04 | global batch size: 256 | lm loss: 2.100596E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.856 | TFLOPs: 20.74 | 31: iteration 39860/ 173500 | consumed samples: 10204160 | consumed tokens: 20898119680 | elapsed time per iteration (s): 0.76 | learning rate: 1.790E-04 | global batch size: 256 | lm loss: 2.083341E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.839 | TFLOPs: 20.50 | 31: iteration 39870/ 173500 | consumed samples: 10206720 | consumed tokens: 20903362560 | elapsed time per iteration (s): 0.79 | learning rate: 1.790E-04 | global batch size: 256 | lm loss: 2.095676E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.114 | TFLOPs: 19.73 | 31: iteration 39880/ 173500 | consumed samples: 10209280 | consumed tokens: 20908605440 | elapsed time per iteration (s): 0.79 | learning rate: 1.790E-04 | global batch size: 256 | lm loss: 2.100293E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.507 | TFLOPs: 19.63 | 31: iteration 39890/ 173500 | consumed samples: 10211840 | consumed tokens: 20913848320 | elapsed time per iteration (s): 0.81 | learning rate: 1.790E-04 | global batch size: 256 | lm loss: 2.064041E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.010 | TFLOPs: 19.12 | 31: iteration 39900/ 173500 | consumed samples: 10214400 | consumed tokens: 20919091200 | elapsed time per iteration (s): 0.76 | learning rate: 1.789E-04 | global batch size: 256 | lm loss: 2.068187E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.749 | TFLOPs: 20.31 | 31: iteration 39910/ 173500 | consumed samples: 10216960 | consumed tokens: 20924334080 | elapsed time per iteration (s): 0.77 | learning rate: 1.789E-04 | global batch size: 256 | lm loss: 2.081296E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.333 | TFLOPs: 20.23 | 31: iteration 39920/ 173500 | consumed samples: 10219520 | consumed tokens: 20929576960 | elapsed time per iteration (s): 0.80 | learning rate: 1.789E-04 | global batch size: 256 | lm loss: 2.108493E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.579 | TFLOPs: 19.45 | 31: iteration 39930/ 173500 | consumed samples: 10222080 | consumed tokens: 20934819840 | elapsed time per iteration (s): 0.81 | learning rate: 1.789E-04 | global batch size: 256 | lm loss: 2.110160E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.893 | TFLOPs: 19.11 | 31: iteration 39940/ 173500 | consumed samples: 10224640 | consumed tokens: 20940062720 | elapsed time per iteration (s): 0.78 | learning rate: 1.789E-04 | global batch size: 256 | lm loss: 2.072149E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.870 | TFLOPs: 19.77 | 31: iteration 39950/ 173500 | consumed samples: 10227200 | consumed tokens: 20945305600 | elapsed time per iteration (s): 0.71 | learning rate: 1.789E-04 | global batch size: 256 | lm loss: 2.092968E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.593 | TFLOPs: 21.69 | 31: iteration 39960/ 173500 | consumed samples: 10229760 | consumed tokens: 20950548480 | elapsed time per iteration (s): 0.73 | learning rate: 1.789E-04 | global batch size: 256 | lm loss: 2.081797E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.380 | TFLOPs: 21.14 | 31: iteration 39970/ 173500 | consumed samples: 10232320 | consumed tokens: 20955791360 | elapsed time per iteration (s): 0.78 | learning rate: 1.789E-04 | global batch size: 256 | lm loss: 2.071504E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.136 | TFLOPs: 19.79 | 31: iteration 39980/ 173500 | consumed samples: 10234880 | consumed tokens: 20961034240 | elapsed time per iteration (s): 0.75 | learning rate: 1.789E-04 | global batch size: 256 | lm loss: 2.078446E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.725 | TFLOPs: 20.67 | 31: iteration 39990/ 173500 | consumed samples: 10237440 | consumed tokens: 20966277120 | elapsed time per iteration (s): 0.76 | learning rate: 1.789E-04 | global batch size: 256 | lm loss: 2.123245E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.352 | TFLOPs: 20.47 | 0: [2022-11-26 03:04:30,928] [INFO] [logging.py:68:log_dist] [Rank 0] step=40000, skipped=0, lr=[0.0001788435118675357, 0.0001788435118675357, 0.0001788435118675357], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 40000/ 173500 | consumed samples: 10240000 | consumed tokens: 20971520000 | elapsed time per iteration (s): 0.76 | learning rate: 1.788E-04 | global batch size: 256 | lm loss: 2.085818E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.268 | TFLOPs: 20.28 | 0: steps: 40000 loss: 2.0071 iter time (s): 0.792 samples/sec: 323.152 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 40000 | lm loss value: 2.026362E+00 | lm loss PPL: 7.586440E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 40000 to checkpoints_1b1long 0: [2022-11-26 03:04:31,236] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step40000 is begin to save! 0: [2022-11-26 03:04:31,249] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_01-model_00-model_states.pt... 0: [2022-11-26 03:04:31,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_01-model_00-model_states.pt. 0: [2022-11-26 03:04:31,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_03-model_00-model_states.pt... 0: [2022-11-26 03:04:31,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_03-model_00-model_states.pt. 0: [2022-11-26 03:04:31,587] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_04-model_00-model_states.pt... 0: [2022-11-26 03:04:31,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_04-model_00-model_states.pt. 0: [2022-11-26 03:04:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_05-model_00-model_states.pt... 0: [2022-11-26 03:04:31,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_05-model_00-model_states.pt. 0: [2022-11-26 03:04:31,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_06-model_00-model_states.pt... 0: [2022-11-26 03:04:31,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_06-model_00-model_states.pt. 0: [2022-11-26 03:04:31,832] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_07-model_00-model_states.pt... 0: [2022-11-26 03:04:31,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_07-model_00-model_states.pt. 0: [2022-11-26 03:04:31,906] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_08-model_00-model_states.pt... 0: [2022-11-26 03:04:31,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_08-model_00-model_states.pt. 0: [2022-11-26 03:04:31,982] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_09-model_00-model_states.pt... 0: [2022-11-26 03:04:32,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_09-model_00-model_states.pt. 0: [2022-11-26 03:04:32,059] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_10-model_00-model_states.pt... 0: [2022-11-26 03:04:32,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_10-model_00-model_states.pt. 0: [2022-11-26 03:04:32,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_11-model_00-model_states.pt... 0: [2022-11-26 03:04:32,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_11-model_00-model_states.pt. 0: [2022-11-26 03:04:32,211] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_12-model_00-model_states.pt... 0: [2022-11-26 03:04:32,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_12-model_00-model_states.pt. 0: [2022-11-26 03:04:32,287] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_13-model_00-model_states.pt... 0: [2022-11-26 03:04:32,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_13-model_00-model_states.pt. 0: [2022-11-26 03:04:32,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_14-model_00-model_states.pt... 0: [2022-11-26 03:04:32,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_14-model_00-model_states.pt. 0: [2022-11-26 03:04:32,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_15-model_00-model_states.pt... 0: [2022-11-26 03:04:32,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_15-model_00-model_states.pt. 0: [2022-11-26 03:04:32,519] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_16-model_00-model_states.pt... 0: [2022-11-26 03:04:32,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_16-model_00-model_states.pt. 0: [2022-11-26 03:04:32,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_17-model_00-model_states.pt... 0: [2022-11-26 03:04:32,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_17-model_00-model_states.pt. 0: [2022-11-26 03:04:32,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_18-model_00-model_states.pt... 0: [2022-11-26 03:04:32,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_18-model_00-model_states.pt. 0: [2022-11-26 03:04:32,743] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_19-model_00-model_states.pt... 0: [2022-11-26 03:04:32,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_19-model_00-model_states.pt. 0: [2022-11-26 03:04:32,822] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_20-model_00-model_states.pt... 0: [2022-11-26 03:04:32,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_20-model_00-model_states.pt. 0: [2022-11-26 03:04:32,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_21-model_00-model_states.pt... 0: [2022-11-26 03:04:32,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_21-model_00-model_states.pt. 0: [2022-11-26 03:04:32,978] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_22-model_00-model_states.pt... 0: [2022-11-26 03:04:33,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_22-model_00-model_states.pt. 0: [2022-11-26 03:04:33,055] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_23-model_00-model_states.pt... 0: [2022-11-26 03:04:33,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_23-model_00-model_states.pt. 0: [2022-11-26 03:04:33,127] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_24-model_00-model_states.pt... 0: [2022-11-26 03:04:33,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_24-model_00-model_states.pt. 0: [2022-11-26 03:04:33,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_25-model_00-model_states.pt... 0: [2022-11-26 03:04:33,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_25-model_00-model_states.pt. 0: [2022-11-26 03:04:33,279] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_26-model_00-model_states.pt... 0: [2022-11-26 03:04:33,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_26-model_00-model_states.pt. 0: [2022-11-26 03:04:33,354] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_27-model_00-model_states.pt... 0: [2022-11-26 03:04:33,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_27-model_00-model_states.pt. 0: [2022-11-26 03:04:33,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_28-model_00-model_states.pt... 0: [2022-11-26 03:04:33,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_28-model_00-model_states.pt. 0: [2022-11-26 03:04:33,507] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/layer_30-model_00-model_states.pt... 0: [2022-11-26 03:04:33,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/layer_30-model_00-model_states.pt. 0: [2022-11-26 03:04:33,509] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step40000/mp_rank_00_model_states.pt 0: [2022-11-26 03:04:33,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/mp_rank_00_model_states.pt... 0: [2022-11-26 03:04:33,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/mp_rank_00_model_states.pt. 0: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:04:33,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:04:33,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:04:33,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 03:04:33,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 18: [2022-11-26 03:04:33,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:04:33,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 03:04:33,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 3: [2022-11-26 03:04:33,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:04:33,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 03:04:33,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 11: [2022-11-26 03:04:33,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:04:33,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:04:33,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 7: [2022-11-26 03:04:33,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 11: [2022-11-26 03:04:33,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 7: [2022-11-26 03:04:33,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 20: [2022-11-26 03:04:33,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:04:33,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 03:04:33,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 20: [2022-11-26 03:04:33,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:04:33,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 03:04:33,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 27: [2022-11-26 03:04:33,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:04:33,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 03:04:33,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 28: [2022-11-26 03:04:33,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:04:33,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:04:33,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 03:04:33,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 2: [2022-11-26 03:04:33,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:04:33,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 03:04:33,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 4: [2022-11-26 03:04:33,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:04:33,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 03:04:33,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 28: [2022-11-26 03:04:33,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 03:04:33,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 28: [2022-11-26 03:04:33,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:04:33,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 03:04:33,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 14: [2022-11-26 03:04:33,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:04:33,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 03:04:33,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 24: [2022-11-26 03:04:33,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:04:33,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:04:33,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 03:04:33,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 26: [2022-11-26 03:04:33,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 03:04:33,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 6: [2022-11-26 03:04:33,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:04:33,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:04:33,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:04:33,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 10: [2022-11-26 03:04:33,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:04:33,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 21: [2022-11-26 03:04:33,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:04:33,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 10: [2022-11-26 03:04:33,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 3: [2022-11-26 03:04:33,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:04:33,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 21: [2022-11-26 03:04:33,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 6: [2022-11-26 03:04:33,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 10: [2022-11-26 03:04:33,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 3: [2022-11-26 03:04:33,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 21: [2022-11-26 03:04:33,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 6: [2022-11-26 03:04:33,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 3: [2022-11-26 03:04:33,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 0: [2022-11-26 03:04:33,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:04:33,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 03:04:33,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 4: [2022-11-26 03:04:33,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:04:33,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 03:04:33,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 30: [2022-11-26 03:04:33,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:04:33,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 18: [2022-11-26 03:04:33,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:04:33,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 03:04:33,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 16: [2022-11-26 03:04:33,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:04:33,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 16: [2022-11-26 03:04:33,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 03:04:33,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:04:33,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 16: [2022-11-26 03:04:33,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 03:04:33,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 12: [2022-11-26 03:04:33,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:04:33,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:04:33,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 10: [2022-11-26 03:04:33,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 03:04:33,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 12: [2022-11-26 03:04:33,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 11: [2022-11-26 03:04:33,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:04:33,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 03:04:33,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 2: [2022-11-26 03:04:33,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:04:33,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 03:04:33,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 0: [2022-11-26 03:04:33,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:04:33,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:04:33,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 03:04:33,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 24: [2022-11-26 03:04:33,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:04:33,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 29: [2022-11-26 03:04:33,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:04:33,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 7: [2022-11-26 03:04:33,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:04:33,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 7: [2022-11-26 03:04:33,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 24: [2022-11-26 03:04:33,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 22: [2022-11-26 03:04:33,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:04:33,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 22: [2022-11-26 03:04:33,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 03:04:33,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 22: [2022-11-26 03:04:33,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:04:33,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 03:04:33,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 2: [2022-11-26 03:04:33,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:04:33,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:04:33,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 22: [2022-11-26 03:04:33,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 2: [2022-11-26 03:04:33,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 22: [2022-11-26 03:04:33,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 8: [2022-11-26 03:04:33,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:04:33,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:04:33,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 03:04:33,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 8: [2022-11-26 03:04:33,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:04:33,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 03:04:33,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 03:04:33,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 4: [2022-11-26 03:04:33,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:04:33,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 4: [2022-11-26 03:04:33,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 13: [2022-11-26 03:04:33,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:04:33,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 13: [2022-11-26 03:04:33,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 03:04:33,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 14: [2022-11-26 03:04:33,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:04:33,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 03:04:33,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 31: [2022-11-26 03:04:33,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:04:33,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:04:33,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 0: [2022-11-26 03:04:33,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:04:33,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 31: [2022-11-26 03:04:33,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 0: [2022-11-26 03:04:33,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 10: [2022-11-26 03:04:33,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:04:33,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 03:04:33,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 25: [2022-11-26 03:04:33,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:04:33,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:04:33,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 20: [2022-11-26 03:04:33,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 25: [2022-11-26 03:04:33,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 1: [2022-11-26 03:04:33,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:04:33,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 26: [2022-11-26 03:04:33,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:04:33,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 1: [2022-11-26 03:04:33,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 26: [2022-11-26 03:04:33,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 1: [2022-11-26 03:04:33,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 1: [2022-11-26 03:04:33,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:04:33,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:04:33,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 03:04:33,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 03:04:33,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 1: [2022-11-26 03:04:33,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 31: [2022-11-26 03:04:33,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:04:33,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 17: [2022-11-26 03:04:33,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:04:33,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:04:33,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 3: [2022-11-26 03:04:33,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 17: [2022-11-26 03:04:33,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 3: [2022-11-26 03:04:33,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 17: [2022-11-26 03:04:33,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 17: [2022-11-26 03:04:33,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:04:33,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 8: [2022-11-26 03:04:33,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:04:33,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 8: [2022-11-26 03:04:33,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 03:04:33,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 17: [2022-11-26 03:04:33,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:04:33,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 03:04:33,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 18: [2022-11-26 03:04:33,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:04:33,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 03:04:33,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 17: [2022-11-26 03:04:33,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:04:33,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 03:04:33,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 27: [2022-11-26 03:04:33,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:04:33,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 03:04:33,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 21: [2022-11-26 03:04:33,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:04:33,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 03:04:33,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 5: [2022-11-26 03:04:33,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:04:33,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:04:33,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 03:04:33,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 03:04:33,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 5: [2022-11-26 03:04:33,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 21: [2022-11-26 03:04:33,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:04:33,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 03:04:33,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 14: [2022-11-26 03:04:33,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:04:33,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 2: [2022-11-26 03:04:33,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:04:33,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 2: [2022-11-26 03:04:33,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 03:04:33,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 29: [2022-11-26 03:04:33,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:04:33,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 03:04:33,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 7: [2022-11-26 03:04:33,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:04:33,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 03:04:33,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 28: [2022-11-26 03:04:33,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 03:04:33,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 28: [2022-11-26 03:04:33,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:04:33,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 03:04:33,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 30: [2022-11-26 03:04:33,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:04:33,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 4: [2022-11-26 03:04:33,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:04:33,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 6: [2022-11-26 03:04:33,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:04:33,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 6: [2022-11-26 03:04:33,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:04:33,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 6: [2022-11-26 03:04:33,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 03:04:33,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 03:04:33,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 1: [2022-11-26 03:04:33,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:04:33,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 1: [2022-11-26 03:04:33,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 03:04:33,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 5: [2022-11-26 03:04:33,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:04:33,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:04:33,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:04:33,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 8: [2022-11-26 03:04:33,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 13: [2022-11-26 03:04:33,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 5: [2022-11-26 03:04:33,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 8: [2022-11-26 03:04:33,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 13: [2022-11-26 03:04:33,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 12: [2022-11-26 03:04:33,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:04:33,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 03:04:33,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 7: [2022-11-26 03:04:33,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:04:33,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 03:04:33,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 25: [2022-11-26 03:04:33,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:04:33,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:04:33,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 29: [2022-11-26 03:04:33,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 25: [2022-11-26 03:04:33,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 29: [2022-11-26 03:04:33,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 18: [2022-11-26 03:04:33,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:04:33,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 03:04:33,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 24: [2022-11-26 03:04:33,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:04:33,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:04:33,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 03:04:33,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 03:04:33,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 24: [2022-11-26 03:04:33,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 31: [2022-11-26 03:04:33,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:04:33,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:04:33,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 26: [2022-11-26 03:04:33,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 31: [2022-11-26 03:04:33,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 26: [2022-11-26 03:04:33,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 12: [2022-11-26 03:04:33,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:04:33,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 16: [2022-11-26 03:04:33,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:04:33,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 16: [2022-11-26 03:04:33,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 03:04:33,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 23: [2022-11-26 03:04:33,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:04:33,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:04:33,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:04:33,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:04:33,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 03:04:33,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 03:04:33,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 03:04:33,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 03:04:33,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 23: [2022-11-26 03:04:33,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 23: [2022-11-26 03:04:33,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 23: [2022-11-26 03:04:33,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 27: [2022-11-26 03:04:33,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:04:33,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 16: [2022-11-26 03:04:33,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:04:33,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 16: [2022-11-26 03:04:33,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 03:04:33,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 27: [2022-11-26 03:04:33,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:04:33,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 20: [2022-11-26 03:04:33,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:04:33,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 10: [2022-11-26 03:04:33,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:04:33,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 26: [2022-11-26 03:04:33,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:04:33,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 26: [2022-11-26 03:04:33,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 10: [2022-11-26 03:04:33,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 26: [2022-11-26 03:04:33,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 10: [2022-11-26 03:04:33,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 22: [2022-11-26 03:04:33,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:04:33,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 03:04:33,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 11: [2022-11-26 03:04:33,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:04:33,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:04:33,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 11: [2022-11-26 03:04:33,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 03:04:33,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 11: [2022-11-26 03:04:33,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:04:33,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 03:04:33,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 21: [2022-11-26 03:04:33,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 1: [2022-11-26 03:04:33,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:04:33,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 03:04:33,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 3: [2022-11-26 03:04:33,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:04:33,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 03:04:33,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 9: [2022-11-26 03:04:33,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:04:33,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:04:33,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:04:33,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:04:33,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 03:04:33,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 03:04:33,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 03:04:33,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 03:04:33,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 9: [2022-11-26 03:04:33,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 9: [2022-11-26 03:04:33,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 9: [2022-11-26 03:04:33,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 31: [2022-11-26 03:04:33,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:04:33,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 03:04:33,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 20: [2022-11-26 03:04:33,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:04:33,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 03:04:33,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 6: [2022-11-26 03:04:33,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:04:33,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 03:04:33,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 25: [2022-11-26 03:04:33,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:04:33,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:04:33,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 03:04:33,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 25: [2022-11-26 03:04:33,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 0: [2022-11-26 03:04:33,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:04:33,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 0: [2022-11-26 03:04:33,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 03:04:33,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 0: [2022-11-26 03:04:33,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 03:04:33,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 9: [2022-11-26 03:04:33,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:04:33,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 03:04:33,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 10: [2022-11-26 03:04:33,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:04:33,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 03:04:33,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 13: [2022-11-26 03:04:33,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:04:33,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 03:04:33,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 13: [2022-11-26 03:04:33,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:04:33,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 03:04:33,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 19: [2022-11-26 03:04:33,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:04:33,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:04:33,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:04:33,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:04:33,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 03:04:33,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 03:04:33,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 19: [2022-11-26 03:04:33,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 03:04:33,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 19: [2022-11-26 03:04:33,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 03:04:33,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 19: [2022-11-26 03:04:33,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 15: [2022-11-26 03:04:33,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:04:33,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:04:33,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:04:33,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:04:33,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:04:33,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 03:04:33,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 03:04:33,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 03:04:33,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 03:04:33,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 03:04:33,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 15: [2022-11-26 03:04:33,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 15: [2022-11-26 03:04:33,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 15: [2022-11-26 03:04:33,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 15: [2022-11-26 03:04:33,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 5: [2022-11-26 03:04:33,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:04:33,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 03:04:33,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 21: [2022-11-26 03:04:33,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:04:33,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 03:04:33,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 2: [2022-11-26 03:04:33,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:04:33,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 03:04:33,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 23: [2022-11-26 03:04:33,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:04:33,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 03:04:33,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 22: [2022-11-26 03:04:33,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:04:33,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 03:04:33,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 0: [2022-11-26 03:04:33,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:04:33,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 03:04:33,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 14: [2022-11-26 03:04:33,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:04:33,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 03:04:33,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 17: [2022-11-26 03:04:33,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:04:33,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 03:04:33,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 18: [2022-11-26 03:04:33,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:04:33,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 03:04:33,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 4: [2022-11-26 03:04:33,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:04:33,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 03:04:33,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 28: [2022-11-26 03:04:33,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:04:33,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 03:04:33,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 30: [2022-11-26 03:04:33,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:04:33,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 03:04:33,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 27: [2022-11-26 03:04:33,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:04:33,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 03:04:33,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 11: [2022-11-26 03:04:33,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:04:33,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 03:04:33,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 7: [2022-11-26 03:04:33,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:04:33,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 03:04:33,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 13: [2022-11-26 03:04:33,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:04:33,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 03:04:33,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 16: [2022-11-26 03:04:33,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:04:33,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 03:04:33,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 3: [2022-11-26 03:04:33,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:04:33,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 03:04:33,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 24: [2022-11-26 03:04:33,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:04:33,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 03:04:33,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 29: [2022-11-26 03:04:33,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:04:33,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 03:04:33,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 19: [2022-11-26 03:04:33,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:04:33,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 03:04:33,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 31: [2022-11-26 03:04:33,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:04:33,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 03:04:33,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 12: [2022-11-26 03:04:33,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:04:33,885] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 03:04:33,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 20: [2022-11-26 03:04:33,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:04:33,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 03:04:33,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 26: [2022-11-26 03:04:33,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:04:33,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 03:04:33,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 8: [2022-11-26 03:04:33,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:04:33,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 03:04:33,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 25: [2022-11-26 03:04:33,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:04:33,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 03:04:33,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 5: [2022-11-26 03:04:33,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:04:33,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:04:33,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 03:04:33,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 2: [2022-11-26 03:04:33,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 03:04:33,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 9: [2022-11-26 03:04:33,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:04:33,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 03:04:33,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 22: [2022-11-26 03:04:33,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:04:33,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 03:04:33,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 14: [2022-11-26 03:04:33,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:04:33,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 03:04:33,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 21: [2022-11-26 03:04:33,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:04:33,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 03:04:33,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 1: [2022-11-26 03:04:33,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:04:33,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 03:04:33,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 18: [2022-11-26 03:04:33,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:04:33,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 03:04:33,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 15: [2022-11-26 03:04:33,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:04:33,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 23: [2022-11-26 03:04:33,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:04:33,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 23: [2022-11-26 03:04:33,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 03:04:33,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 0: [2022-11-26 03:04:33,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:04:33,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 03:04:33,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 4: [2022-11-26 03:04:33,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:04:33,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 03:04:33,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 17: [2022-11-26 03:04:33,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:04:33,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 03:04:33,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 28: [2022-11-26 03:04:33,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:04:33,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 03:04:33,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 30: [2022-11-26 03:04:33,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:04:33,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:04:33,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:04:33,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 03:04:33,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 7: [2022-11-26 03:04:33,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 13: [2022-11-26 03:04:33,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:04:33,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 13: [2022-11-26 03:04:33,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 11: [2022-11-26 03:04:33,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 7: [2022-11-26 03:04:33,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 13: [2022-11-26 03:04:33,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 27: [2022-11-26 03:04:33,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:04:33,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 03:04:33,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 16: [2022-11-26 03:04:33,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:04:33,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 03:04:33,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 29: [2022-11-26 03:04:33,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:04:33,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 03:04:33,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 19: [2022-11-26 03:04:33,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:04:33,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 03:04:33,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 24: [2022-11-26 03:04:33,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:04:33,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 03:04:33,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 3: [2022-11-26 03:04:33,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:04:33,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 03:04:33,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 6: [2022-11-26 03:04:33,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:04:33,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 03:04:33,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 25: [2022-11-26 03:04:33,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:04:33,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:04:33,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 26: [2022-11-26 03:04:33,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:04:33,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 25: [2022-11-26 03:04:33,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 20: [2022-11-26 03:04:33,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 26: [2022-11-26 03:04:33,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 03:04:33,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 10: [2022-11-26 03:04:33,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:04:33,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 03:04:33,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 21: [2022-11-26 03:04:33,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:04:33,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 03:04:33,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 31: [2022-11-26 03:04:33,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:04:33,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 03:04:33,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 23: [2022-11-26 03:04:33,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:04:33,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 03:04:33,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 22: [2022-11-26 03:04:33,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:04:33,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 03:04:33,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 15: [2022-11-26 03:04:33,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:04:33,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 03:04:33,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 14: [2022-11-26 03:04:33,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:04:33,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 03:04:33,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 12: [2022-11-26 03:04:33,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:04:33,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 03:04:33,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 2: [2022-11-26 03:04:33,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:04:33,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 03:04:33,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 9: [2022-11-26 03:04:33,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:04:33,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:04:33,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 9: [2022-11-26 03:04:33,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 17: [2022-11-26 03:04:33,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 6: [2022-11-26 03:04:33,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:04:33,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 6: [2022-11-26 03:04:33,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 03:04:33,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 5: [2022-11-26 03:04:33,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:04:33,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:04:33,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 4: [2022-11-26 03:04:33,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 5: [2022-11-26 03:04:33,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 4: [2022-11-26 03:04:33,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 10: [2022-11-26 03:04:33,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:04:33,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 03:04:33,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 8: [2022-11-26 03:04:33,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:04:33,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 03:04:33,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 0: [2022-11-26 03:04:33,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:04:33,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 03:04:33,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 1: [2022-11-26 03:04:33,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:04:33,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 03:04:33,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 18: [2022-11-26 03:04:33,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:04:33,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 03:04:33,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 11: [2022-11-26 03:04:33,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:04:33,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 03:04:33,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 3: [2022-11-26 03:04:33,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:04:33,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 03:04:33,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 16: [2022-11-26 03:04:33,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:04:33,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:04:33,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 16: [2022-11-26 03:04:33,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 03:04:33,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 28: [2022-11-26 03:04:33,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:04:33,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 03:04:33,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 13: [2022-11-26 03:04:33,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 29: [2022-11-26 03:04:33,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:04:33,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 03:04:33,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 7: [2022-11-26 03:04:33,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:04:33,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 03:04:33,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 27: [2022-11-26 03:04:33,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:04:33,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 03:04:33,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 19: [2022-11-26 03:04:33,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:04:33,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 03:04:33,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 30: [2022-11-26 03:04:33,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:04:33,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 03:04:33,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 26: [2022-11-26 03:04:33,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:04:33,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 03:04:33,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 21: [2022-11-26 03:04:33,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:04:33,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 03:04:33,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 24: [2022-11-26 03:04:33,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:04:33,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 03:04:33,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 2: [2022-11-26 03:04:33,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:04:33,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 03:04:33,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 22: [2022-11-26 03:04:33,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:04:33,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 03:04:33,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 31: [2022-11-26 03:04:33,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:04:33,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 03:04:33,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 20: [2022-11-26 03:04:33,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:04:33,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 03:04:33,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 12: [2022-11-26 03:04:33,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:04:33,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 03:04:33,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 8: [2022-11-26 03:04:33,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:04:33,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 03:04:33,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 9: [2022-11-26 03:04:33,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:04:33,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 03:04:33,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 10: [2022-11-26 03:04:33,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:04:33,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 03:04:33,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 15: [2022-11-26 03:04:33,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:04:33,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 03:04:33,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 1: [2022-11-26 03:04:33,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:04:33,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 03:04:33,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 0: [2022-11-26 03:04:33,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:04:33,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 03:04:33,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 26: [2022-11-26 03:04:33,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:04:33,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 03:04:33,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 18: [2022-11-26 03:04:33,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:04:33,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 03:04:33,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 23: [2022-11-26 03:04:33,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:04:33,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 03:04:33,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 3: [2022-11-26 03:04:33,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:04:33,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 03:04:33,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 17: [2022-11-26 03:04:33,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:04:33,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 03:04:33,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 11: [2022-11-26 03:04:33,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:04:33,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:04:33,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:04:33,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 24: [2022-11-26 03:04:33,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:04:33,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 03:04:33,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 11: [2022-11-26 03:04:33,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 28: [2022-11-26 03:04:33,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 24: [2022-11-26 03:04:33,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 28: [2022-11-26 03:04:33,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 24: [2022-11-26 03:04:33,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 30: [2022-11-26 03:04:33,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:04:33,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 03:04:33,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 7: [2022-11-26 03:04:33,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:04:33,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 03:04:33,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 13: [2022-11-26 03:04:33,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:04:33,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:04:33,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:04:33,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 4: [2022-11-26 03:04:33,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 03:04:33,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 13: [2022-11-26 03:04:33,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 25: [2022-11-26 03:04:33,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 03:04:33,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 5: [2022-11-26 03:04:33,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:04:33,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:04:33,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 03:04:33,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 5: [2022-11-26 03:04:33,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 03:04:33,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 14: [2022-11-26 03:04:33,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:04:33,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 03:04:33,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 29: [2022-11-26 03:04:33,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:04:33,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 03:04:33,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 6: [2022-11-26 03:04:33,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:04:33,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 03:04:33,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 19: [2022-11-26 03:04:33,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:04:33,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 03:04:33,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 27: [2022-11-26 03:04:33,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:04:33,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 03:04:33,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 8: [2022-11-26 03:04:33,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:04:33,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 16: [2022-11-26 03:04:33,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:04:33,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 16: [2022-11-26 03:04:33,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 03:04:33,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 5: [2022-11-26 03:04:33,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:04:33,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 03:04:33,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 12: [2022-11-26 03:04:33,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:04:33,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step40000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 03:04:33,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 0: successfully saved checkpoint at iteration 40000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2796.71 31: iteration 40010/ 173500 | consumed samples: 10242560 | consumed tokens: 20976762880 | elapsed time per iteration (s): 1.06 | learning rate: 1.788E-04 | global batch size: 256 | lm loss: 2.120472E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.666 | TFLOPs: 14.62 | 31: iteration 40020/ 173500 | consumed samples: 10245120 | consumed tokens: 20982005760 | elapsed time per iteration (s): 0.76 | learning rate: 1.788E-04 | global batch size: 256 | lm loss: 2.117921E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.753 | TFLOPs: 20.25 | 31: iteration 40030/ 173500 | consumed samples: 10247680 | consumed tokens: 20987248640 | elapsed time per iteration (s): 0.74 | learning rate: 1.788E-04 | global batch size: 256 | lm loss: 2.115165E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.789 | TFLOPs: 20.80 | 31: iteration 40040/ 173500 | consumed samples: 10250240 | consumed tokens: 20992491520 | elapsed time per iteration (s): 0.73 | learning rate: 1.788E-04 | global batch size: 256 | lm loss: 2.096755E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.609 | TFLOPs: 21.09 | 31: iteration 40050/ 173500 | consumed samples: 10252800 | consumed tokens: 20997734400 | elapsed time per iteration (s): 0.74 | learning rate: 1.788E-04 | global batch size: 256 | lm loss: 2.105995E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.218 | TFLOPs: 20.95 | 31: iteration 40060/ 173500 | consumed samples: 10255360 | consumed tokens: 21002977280 | elapsed time per iteration (s): 0.78 | learning rate: 1.788E-04 | global batch size: 256 | lm loss: 2.092771E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.810 | TFLOPs: 19.95 | 31: iteration 40070/ 173500 | consumed samples: 10257920 | consumed tokens: 21008220160 | elapsed time per iteration (s): 0.82 | learning rate: 1.788E-04 | global batch size: 256 | lm loss: 2.085060E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.519 | TFLOPs: 18.91 | 31: iteration 40080/ 173500 | consumed samples: 10260480 | consumed tokens: 21013463040 | elapsed time per iteration (s): 0.80 | learning rate: 1.788E-04 | global batch size: 256 | lm loss: 2.072301E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.839 | TFLOPs: 19.35 | 31: iteration 40090/ 173500 | consumed samples: 10263040 | consumed tokens: 21018705920 | elapsed time per iteration (s): 0.83 | learning rate: 1.787E-04 | global batch size: 256 | lm loss: 2.133591E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.834 | TFLOPs: 18.74 | 31: iteration 40100/ 173500 | consumed samples: 10265600 | consumed tokens: 21023948800 | elapsed time per iteration (s): 0.85 | learning rate: 1.787E-04 | global batch size: 256 | lm loss: 2.080384E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.591 | TFLOPs: 18.25 | 31: iteration 40110/ 173500 | consumed samples: 10268160 | consumed tokens: 21029191680 | elapsed time per iteration (s): 0.80 | learning rate: 1.787E-04 | global batch size: 256 | lm loss: 2.106290E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.758 | TFLOPs: 19.28 | 31: iteration 40120/ 173500 | consumed samples: 10270720 | consumed tokens: 21034434560 | elapsed time per iteration (s): 0.81 | learning rate: 1.787E-04 | global batch size: 256 | lm loss: 2.085621E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.734 | TFLOPs: 19.04 | 31: iteration 40130/ 173500 | consumed samples: 10273280 | consumed tokens: 21039677440 | elapsed time per iteration (s): 0.79 | learning rate: 1.787E-04 | global batch size: 256 | lm loss: 2.082201E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.947 | TFLOPs: 19.72 | 31: iteration 40140/ 173500 | consumed samples: 10275840 | consumed tokens: 21044920320 | elapsed time per iteration (s): 0.78 | learning rate: 1.787E-04 | global batch size: 256 | lm loss: 2.074551E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.319 | TFLOPs: 19.74 | 31: iteration 40150/ 173500 | consumed samples: 10278400 | consumed tokens: 21050163200 | elapsed time per iteration (s): 0.81 | learning rate: 1.787E-04 | global batch size: 256 | lm loss: 2.083570E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.996 | TFLOPs: 19.24 | 31: iteration 40160/ 173500 | consumed samples: 10280960 | consumed tokens: 21055406080 | elapsed time per iteration (s): 0.82 | learning rate: 1.787E-04 | global batch size: 256 | lm loss: 2.080473E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.084 | TFLOPs: 18.88 | 31: iteration 40170/ 173500 | consumed samples: 10283520 | consumed tokens: 21060648960 | elapsed time per iteration (s): 0.78 | learning rate: 1.787E-04 | global batch size: 256 | lm loss: 2.078301E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.999 | TFLOPs: 19.96 | 31: iteration 40180/ 173500 | consumed samples: 10286080 | consumed tokens: 21065891840 | elapsed time per iteration (s): 0.81 | learning rate: 1.787E-04 | global batch size: 256 | lm loss: 2.113692E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.342 | TFLOPs: 19.20 | 31: iteration 40190/ 173500 | consumed samples: 10288640 | consumed tokens: 21071134720 | elapsed time per iteration (s): 0.81 | learning rate: 1.786E-04 | global batch size: 256 | lm loss: 2.090006E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.935 | TFLOPs: 19.05 | 31: iteration 40200/ 173500 | consumed samples: 10291200 | consumed tokens: 21076377600 | elapsed time per iteration (s): 0.92 | learning rate: 1.786E-04 | global batch size: 256 | lm loss: 2.104587E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 276.893 | TFLOPs: 16.75 | 31: iteration 40210/ 173500 | consumed samples: 10293760 | consumed tokens: 21081620480 | elapsed time per iteration (s): 0.74 | learning rate: 1.786E-04 | global batch size: 256 | lm loss: 2.122782E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.647 | TFLOPs: 20.79 | 31: iteration 40220/ 173500 | consumed samples: 10296320 | consumed tokens: 21086863360 | elapsed time per iteration (s): 0.78 | learning rate: 1.786E-04 | global batch size: 256 | lm loss: 2.077642E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.465 | TFLOPs: 19.75 | 31: iteration 40230/ 173500 | consumed samples: 10298880 | consumed tokens: 21092106240 | elapsed time per iteration (s): 0.80 | learning rate: 1.786E-04 | global batch size: 256 | lm loss: 2.071016E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.721 | TFLOPs: 19.46 | 31: iteration 40240/ 173500 | consumed samples: 10301440 | consumed tokens: 21097349120 | elapsed time per iteration (s): 0.77 | learning rate: 1.786E-04 | global batch size: 256 | lm loss: 2.100124E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.932 | TFLOPs: 20.20 | 31: iteration 40250/ 173500 | consumed samples: 10304000 | consumed tokens: 21102592000 | elapsed time per iteration (s): 0.80 | learning rate: 1.786E-04 | global batch size: 256 | lm loss: 2.111145E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.808 | TFLOPs: 19.35 | 31: iteration 40260/ 173500 | consumed samples: 10306560 | consumed tokens: 21107834880 | elapsed time per iteration (s): 0.84 | learning rate: 1.786E-04 | global batch size: 256 | lm loss: 2.079347E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.078 | TFLOPs: 18.34 | 31: iteration 40270/ 173500 | consumed samples: 10309120 | consumed tokens: 21113077760 | elapsed time per iteration (s): 0.77 | learning rate: 1.786E-04 | global batch size: 256 | lm loss: 2.087241E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.194 | TFLOPs: 20.10 | 31: iteration 40280/ 173500 | consumed samples: 10311680 | consumed tokens: 21118320640 | elapsed time per iteration (s): 0.76 | learning rate: 1.785E-04 | global batch size: 256 | lm loss: 2.087930E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.509 | TFLOPs: 20.42 | 31: iteration 40290/ 173500 | consumed samples: 10314240 | consumed tokens: 21123563520 | elapsed time per iteration (s): 0.77 | learning rate: 1.785E-04 | global batch size: 256 | lm loss: 2.095266E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.464 | TFLOPs: 20.11 | 31: iteration 40300/ 173500 | consumed samples: 10316800 | consumed tokens: 21128806400 | elapsed time per iteration (s): 0.80 | learning rate: 1.785E-04 | global batch size: 256 | lm loss: 2.090898E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.334 | TFLOPs: 19.32 | 31: iteration 40310/ 173500 | consumed samples: 10319360 | consumed tokens: 21134049280 | elapsed time per iteration (s): 0.85 | learning rate: 1.785E-04 | global batch size: 256 | lm loss: 2.098726E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.889 | TFLOPs: 18.14 | 31: iteration 40320/ 173500 | consumed samples: 10321920 | consumed tokens: 21139292160 | elapsed time per iteration (s): 0.77 | learning rate: 1.785E-04 | global batch size: 256 | lm loss: 2.107467E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.094 | TFLOPs: 20.03 | 31: iteration 40330/ 173500 | consumed samples: 10324480 | consumed tokens: 21144535040 | elapsed time per iteration (s): 0.75 | learning rate: 1.785E-04 | global batch size: 256 | lm loss: 2.113687E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.572 | TFLOPs: 20.54 | 31: iteration 40340/ 173500 | consumed samples: 10327040 | consumed tokens: 21149777920 | elapsed time per iteration (s): 0.83 | learning rate: 1.785E-04 | global batch size: 256 | lm loss: 2.082670E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.333 | TFLOPs: 18.65 | 31: iteration 40350/ 173500 | consumed samples: 10329600 | consumed tokens: 21155020800 | elapsed time per iteration (s): 0.78 | learning rate: 1.785E-04 | global batch size: 256 | lm loss: 2.086251E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.027 | TFLOPs: 19.84 | 31: iteration 40360/ 173500 | consumed samples: 10332160 | consumed tokens: 21160263680 | elapsed time per iteration (s): 0.77 | learning rate: 1.785E-04 | global batch size: 256 | lm loss: 2.095583E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.730 | TFLOPs: 20.13 | 31: iteration 40370/ 173500 | consumed samples: 10334720 | consumed tokens: 21165506560 | elapsed time per iteration (s): 0.78 | learning rate: 1.784E-04 | global batch size: 256 | lm loss: 2.103109E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.457 | TFLOPs: 19.81 | 31: iteration 40380/ 173500 | consumed samples: 10337280 | consumed tokens: 21170749440 | elapsed time per iteration (s): 0.75 | learning rate: 1.784E-04 | global batch size: 256 | lm loss: 2.072929E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.176 | TFLOPs: 20.64 | 31: iteration 40390/ 173500 | consumed samples: 10339840 | consumed tokens: 21175992320 | elapsed time per iteration (s): 0.84 | learning rate: 1.784E-04 | global batch size: 256 | lm loss: 2.099460E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.540 | TFLOPs: 18.42 | 31: iteration 40400/ 173500 | consumed samples: 10342400 | consumed tokens: 21181235200 | elapsed time per iteration (s): 0.83 | learning rate: 1.784E-04 | global batch size: 256 | lm loss: 2.094989E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.298 | TFLOPs: 18.65 | 31: iteration 40410/ 173500 | consumed samples: 10344960 | consumed tokens: 21186478080 | elapsed time per iteration (s): 0.76 | learning rate: 1.784E-04 | global batch size: 256 | lm loss: 2.097380E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.998 | TFLOPs: 20.39 | 31: iteration 40420/ 173500 | consumed samples: 10347520 | consumed tokens: 21191720960 | elapsed time per iteration (s): 0.79 | learning rate: 1.784E-04 | global batch size: 256 | lm loss: 2.105777E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.671 | TFLOPs: 19.58 | 31: iteration 40430/ 173500 | consumed samples: 10350080 | consumed tokens: 21196963840 | elapsed time per iteration (s): 0.82 | learning rate: 1.784E-04 | global batch size: 256 | lm loss: 2.108704E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.400 | TFLOPs: 18.96 | 31: iteration 40440/ 173500 | consumed samples: 10352640 | consumed tokens: 21202206720 | elapsed time per iteration (s): 0.79 | learning rate: 1.784E-04 | global batch size: 256 | lm loss: 2.087350E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.647 | TFLOPs: 19.52 | 31: iteration 40450/ 173500 | consumed samples: 10355200 | consumed tokens: 21207449600 | elapsed time per iteration (s): 0.79 | learning rate: 1.784E-04 | global batch size: 256 | lm loss: 2.050299E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.806 | TFLOPs: 19.71 | 31: iteration 40460/ 173500 | consumed samples: 10357760 | consumed tokens: 21212692480 | elapsed time per iteration (s): 0.81 | learning rate: 1.784E-04 | global batch size: 256 | lm loss: 2.083084E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.569 | TFLOPs: 19.21 | 31: iteration 40470/ 173500 | consumed samples: 10360320 | consumed tokens: 21217935360 | elapsed time per iteration (s): 0.80 | learning rate: 1.783E-04 | global batch size: 256 | lm loss: 2.095518E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.380 | TFLOPs: 19.32 | 31: iteration 40480/ 173500 | consumed samples: 10362880 | consumed tokens: 21223178240 | elapsed time per iteration (s): 0.80 | learning rate: 1.783E-04 | global batch size: 256 | lm loss: 2.139044E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.654 | TFLOPs: 19.34 | 31: iteration 40490/ 173500 | consumed samples: 10365440 | consumed tokens: 21228421120 | elapsed time per iteration (s): 0.71 | learning rate: 1.783E-04 | global batch size: 256 | lm loss: 2.083493E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.634 | TFLOPs: 21.70 | 31: iteration 40500/ 173500 | consumed samples: 10368000 | consumed tokens: 21233664000 | elapsed time per iteration (s): 0.77 | learning rate: 1.783E-04 | global batch size: 256 | lm loss: 2.100051E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.824 | TFLOPs: 20.14 | 31: iteration 40510/ 173500 | consumed samples: 10370560 | consumed tokens: 21238906880 | elapsed time per iteration (s): 0.75 | learning rate: 1.783E-04 | global batch size: 256 | lm loss: 2.058584E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.964 | TFLOPs: 20.57 | 31: iteration 40520/ 173500 | consumed samples: 10373120 | consumed tokens: 21244149760 | elapsed time per iteration (s): 0.75 | learning rate: 1.783E-04 | global batch size: 256 | lm loss: 2.120143E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.744 | TFLOPs: 20.61 | 31: iteration 40530/ 173500 | consumed samples: 10375680 | consumed tokens: 21249392640 | elapsed time per iteration (s): 0.78 | learning rate: 1.783E-04 | global batch size: 256 | lm loss: 2.131306E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.164 | TFLOPs: 19.97 | 31: iteration 40540/ 173500 | consumed samples: 10378240 | consumed tokens: 21254635520 | elapsed time per iteration (s): 0.75 | learning rate: 1.783E-04 | global batch size: 256 | lm loss: 2.107568E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.439 | TFLOPs: 20.54 | 31: iteration 40550/ 173500 | consumed samples: 10380800 | consumed tokens: 21259878400 | elapsed time per iteration (s): 0.81 | learning rate: 1.783E-04 | global batch size: 256 | lm loss: 2.099331E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.770 | TFLOPs: 19.16 | 31: iteration 40560/ 173500 | consumed samples: 10383360 | consumed tokens: 21265121280 | elapsed time per iteration (s): 0.83 | learning rate: 1.782E-04 | global batch size: 256 | lm loss: 2.113110E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.156 | TFLOPs: 18.64 | 31: iteration 40570/ 173500 | consumed samples: 10385920 | consumed tokens: 21270364160 | elapsed time per iteration (s): 0.84 | learning rate: 1.782E-04 | global batch size: 256 | lm loss: 2.110690E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.103 | TFLOPs: 18.40 | 31: iteration 40580/ 173500 | consumed samples: 10388480 | consumed tokens: 21275607040 | elapsed time per iteration (s): 0.82 | learning rate: 1.782E-04 | global batch size: 256 | lm loss: 2.092385E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.216 | TFLOPs: 18.83 | 31: iteration 40590/ 173500 | consumed samples: 10391040 | consumed tokens: 21280849920 | elapsed time per iteration (s): 0.83 | learning rate: 1.782E-04 | global batch size: 256 | lm loss: 2.078317E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.730 | TFLOPs: 18.74 | 31: iteration 40600/ 173500 | consumed samples: 10393600 | consumed tokens: 21286092800 | elapsed time per iteration (s): 0.80 | learning rate: 1.782E-04 | global batch size: 256 | lm loss: 2.098526E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.489 | TFLOPs: 19.27 | 31: iteration 40610/ 173500 | consumed samples: 10396160 | consumed tokens: 21291335680 | elapsed time per iteration (s): 0.81 | learning rate: 1.782E-04 | global batch size: 256 | lm loss: 2.099890E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.485 | TFLOPs: 19.15 | 31: iteration 40620/ 173500 | consumed samples: 10398720 | consumed tokens: 21296578560 | elapsed time per iteration (s): 0.80 | learning rate: 1.782E-04 | global batch size: 256 | lm loss: 2.086469E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.812 | TFLOPs: 19.47 | 31: iteration 40630/ 173500 | consumed samples: 10401280 | consumed tokens: 21301821440 | elapsed time per iteration (s): 0.79 | learning rate: 1.782E-04 | global batch size: 256 | lm loss: 2.106032E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.873 | TFLOPs: 19.59 | 31: iteration 40640/ 173500 | consumed samples: 10403840 | consumed tokens: 21307064320 | elapsed time per iteration (s): 0.80 | learning rate: 1.782E-04 | global batch size: 256 | lm loss: 2.088929E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.616 | TFLOPs: 19.34 | 31: iteration 40650/ 173500 | consumed samples: 10406400 | consumed tokens: 21312307200 | elapsed time per iteration (s): 0.81 | learning rate: 1.781E-04 | global batch size: 256 | lm loss: 2.117480E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.582 | TFLOPs: 19.09 | 31: iteration 40660/ 173500 | consumed samples: 10408960 | consumed tokens: 21317550080 | elapsed time per iteration (s): 0.82 | learning rate: 1.781E-04 | global batch size: 256 | lm loss: 2.090920E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.525 | TFLOPs: 18.97 | 31: iteration 40670/ 173500 | consumed samples: 10411520 | consumed tokens: 21322792960 | elapsed time per iteration (s): 0.80 | learning rate: 1.781E-04 | global batch size: 256 | lm loss: 2.092949E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.436 | TFLOPs: 19.39 | 31: iteration 40680/ 173500 | consumed samples: 10414080 | consumed tokens: 21328035840 | elapsed time per iteration (s): 0.80 | learning rate: 1.781E-04 | global batch size: 256 | lm loss: 2.074956E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.736 | TFLOPs: 19.28 | 31: iteration 40690/ 173500 | consumed samples: 10416640 | consumed tokens: 21333278720 | elapsed time per iteration (s): 0.84 | learning rate: 1.781E-04 | global batch size: 256 | lm loss: 2.080468E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.488 | TFLOPs: 18.42 | 31: iteration 40700/ 173500 | consumed samples: 10419200 | consumed tokens: 21338521600 | elapsed time per iteration (s): 0.80 | learning rate: 1.781E-04 | global batch size: 256 | lm loss: 2.088257E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.966 | TFLOPs: 19.42 | 31: iteration 40710/ 173500 | consumed samples: 10421760 | consumed tokens: 21343764480 | elapsed time per iteration (s): 0.84 | learning rate: 1.781E-04 | global batch size: 256 | lm loss: 2.092969E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.538 | TFLOPs: 18.42 | 31: iteration 40720/ 173500 | consumed samples: 10424320 | consumed tokens: 21349007360 | elapsed time per iteration (s): 0.80 | learning rate: 1.781E-04 | global batch size: 256 | lm loss: 2.079154E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.264 | TFLOPs: 19.44 | 31: iteration 40730/ 173500 | consumed samples: 10426880 | consumed tokens: 21354250240 | elapsed time per iteration (s): 0.87 | learning rate: 1.781E-04 | global batch size: 256 | lm loss: 2.094831E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.166 | TFLOPs: 17.74 | 31: iteration 40740/ 173500 | consumed samples: 10429440 | consumed tokens: 21359493120 | elapsed time per iteration (s): 0.83 | learning rate: 1.781E-04 | global batch size: 256 | lm loss: 2.101564E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.781 | TFLOPs: 18.56 | 31: iteration 40750/ 173500 | consumed samples: 10432000 | consumed tokens: 21364736000 | elapsed time per iteration (s): 0.81 | learning rate: 1.780E-04 | global batch size: 256 | lm loss: 2.083347E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.884 | TFLOPs: 19.17 | 31: iteration 40760/ 173500 | consumed samples: 10434560 | consumed tokens: 21369978880 | elapsed time per iteration (s): 0.82 | learning rate: 1.780E-04 | global batch size: 256 | lm loss: 2.070646E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.696 | TFLOPs: 18.86 | 31: iteration 40770/ 173500 | consumed samples: 10437120 | consumed tokens: 21375221760 | elapsed time per iteration (s): 0.80 | learning rate: 1.780E-04 | global batch size: 256 | lm loss: 2.056356E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.707 | TFLOPs: 19.40 | 31: iteration 40780/ 173500 | consumed samples: 10439680 | consumed tokens: 21380464640 | elapsed time per iteration (s): 0.81 | learning rate: 1.780E-04 | global batch size: 256 | lm loss: 2.089163E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.089 | TFLOPs: 19.12 | 31: iteration 40790/ 173500 | consumed samples: 10442240 | consumed tokens: 21385707520 | elapsed time per iteration (s): 1.28 | learning rate: 1.780E-04 | global batch size: 256 | lm loss: 2.125432E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 199.867 | TFLOPs: 12.09 | 31: iteration 40800/ 173500 | consumed samples: 10444800 | consumed tokens: 21390950400 | elapsed time per iteration (s): 0.79 | learning rate: 1.780E-04 | global batch size: 256 | lm loss: 2.087362E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.775 | TFLOPs: 19.53 | 31: iteration 40810/ 173500 | consumed samples: 10447360 | consumed tokens: 21396193280 | elapsed time per iteration (s): 0.81 | learning rate: 1.780E-04 | global batch size: 256 | lm loss: 2.087698E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.458 | TFLOPs: 19.14 | 31: iteration 40820/ 173500 | consumed samples: 10449920 | consumed tokens: 21401436160 | elapsed time per iteration (s): 0.79 | learning rate: 1.780E-04 | global batch size: 256 | lm loss: 2.085733E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.306 | TFLOPs: 19.68 | 31: iteration 40830/ 173500 | consumed samples: 10452480 | consumed tokens: 21406679040 | elapsed time per iteration (s): 0.81 | learning rate: 1.780E-04 | global batch size: 256 | lm loss: 2.122143E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.182 | TFLOPs: 19.01 | 31: iteration 40840/ 173500 | consumed samples: 10455040 | consumed tokens: 21411921920 | elapsed time per iteration (s): 0.84 | learning rate: 1.779E-04 | global batch size: 256 | lm loss: 2.081992E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.017 | TFLOPs: 18.51 | 31: iteration 40850/ 173500 | consumed samples: 10457600 | consumed tokens: 21417164800 | elapsed time per iteration (s): 0.83 | learning rate: 1.779E-04 | global batch size: 256 | lm loss: 2.099333E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.951 | TFLOPs: 18.57 | 31: iteration 40860/ 173500 | consumed samples: 10460160 | consumed tokens: 21422407680 | elapsed time per iteration (s): 0.81 | learning rate: 1.779E-04 | global batch size: 256 | lm loss: 2.080084E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.226 | TFLOPs: 19.07 | 31: iteration 40870/ 173500 | consumed samples: 10462720 | consumed tokens: 21427650560 | elapsed time per iteration (s): 0.74 | learning rate: 1.779E-04 | global batch size: 256 | lm loss: 2.093225E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.774 | TFLOPs: 20.80 | 31: iteration 40880/ 173500 | consumed samples: 10465280 | consumed tokens: 21432893440 | elapsed time per iteration (s): 0.77 | learning rate: 1.779E-04 | global batch size: 256 | lm loss: 2.109784E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.858 | TFLOPs: 20.20 | 31: iteration 40890/ 173500 | consumed samples: 10467840 | consumed tokens: 21438136320 | elapsed time per iteration (s): 0.74 | learning rate: 1.779E-04 | global batch size: 256 | lm loss: 2.091856E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.076 | TFLOPs: 20.88 | 31: iteration 40900/ 173500 | consumed samples: 10470400 | consumed tokens: 21443379200 | elapsed time per iteration (s): 0.81 | learning rate: 1.779E-04 | global batch size: 256 | lm loss: 2.083112E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.154 | TFLOPs: 19.07 | 31: iteration 40910/ 173500 | consumed samples: 10472960 | consumed tokens: 21448622080 | elapsed time per iteration (s): 0.84 | learning rate: 1.779E-04 | global batch size: 256 | lm loss: 2.078690E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.228 | TFLOPs: 18.53 | 31: iteration 40920/ 173500 | consumed samples: 10475520 | consumed tokens: 21453864960 | elapsed time per iteration (s): 0.93 | learning rate: 1.779E-04 | global batch size: 256 | lm loss: 2.114610E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 276.519 | TFLOPs: 16.73 | 31: iteration 40930/ 173500 | consumed samples: 10478080 | consumed tokens: 21459107840 | elapsed time per iteration (s): 0.77 | learning rate: 1.778E-04 | global batch size: 256 | lm loss: 2.086853E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.814 | TFLOPs: 20.01 | 31: iteration 40940/ 173500 | consumed samples: 10480640 | consumed tokens: 21464350720 | elapsed time per iteration (s): 0.88 | learning rate: 1.778E-04 | global batch size: 256 | lm loss: 2.096762E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.155 | TFLOPs: 17.61 | 31: iteration 40950/ 173500 | consumed samples: 10483200 | consumed tokens: 21469593600 | elapsed time per iteration (s): 0.75 | learning rate: 1.778E-04 | global batch size: 256 | lm loss: 2.111842E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.211 | TFLOPs: 20.64 | 31: iteration 40960/ 173500 | consumed samples: 10485760 | consumed tokens: 21474836480 | elapsed time per iteration (s): 0.75 | learning rate: 1.778E-04 | global batch size: 256 | lm loss: 2.114372E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.772 | TFLOPs: 20.62 | 31: iteration 40970/ 173500 | consumed samples: 10488320 | consumed tokens: 21480079360 | elapsed time per iteration (s): 0.74 | learning rate: 1.778E-04 | global batch size: 256 | lm loss: 2.081669E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.214 | TFLOPs: 20.88 | 31: iteration 40980/ 173500 | consumed samples: 10490880 | consumed tokens: 21485322240 | elapsed time per iteration (s): 0.76 | learning rate: 1.778E-04 | global batch size: 256 | lm loss: 2.066093E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.039 | TFLOPs: 20.33 | 31: iteration 40990/ 173500 | consumed samples: 10493440 | consumed tokens: 21490565120 | elapsed time per iteration (s): 0.79 | learning rate: 1.778E-04 | global batch size: 256 | lm loss: 2.077982E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.497 | TFLOPs: 19.69 | 31: iteration 41000/ 173500 | consumed samples: 10496000 | consumed tokens: 21495808000 | elapsed time per iteration (s): 0.73 | learning rate: 1.778E-04 | global batch size: 256 | lm loss: 2.113731E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.538 | TFLOPs: 21.21 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 41000 | lm loss value: 2.004901E+00 | lm loss PPL: 7.425362E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 41000 to checkpoints_1b1long 0: [2022-11-26 03:17:55,875] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step41000 is begin to save! 0: [2022-11-26 03:17:55,886] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_01-model_00-model_states.pt... 0: [2022-11-26 03:17:56,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_01-model_00-model_states.pt. 0: [2022-11-26 03:17:56,119] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_03-model_00-model_states.pt... 0: [2022-11-26 03:17:56,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_03-model_00-model_states.pt. 0: [2022-11-26 03:17:56,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_04-model_00-model_states.pt... 0: [2022-11-26 03:17:56,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_04-model_00-model_states.pt. 0: [2022-11-26 03:17:56,292] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_05-model_00-model_states.pt... 0: [2022-11-26 03:17:56,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_05-model_00-model_states.pt. 0: [2022-11-26 03:17:56,369] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_06-model_00-model_states.pt... 0: [2022-11-26 03:17:56,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_06-model_00-model_states.pt. 0: [2022-11-26 03:17:56,451] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_07-model_00-model_states.pt... 0: [2022-11-26 03:17:56,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_07-model_00-model_states.pt. 0: [2022-11-26 03:17:56,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_08-model_00-model_states.pt... 0: [2022-11-26 03:17:56,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_08-model_00-model_states.pt. 0: [2022-11-26 03:17:56,606] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_09-model_00-model_states.pt... 0: [2022-11-26 03:17:56,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_09-model_00-model_states.pt. 0: [2022-11-26 03:17:56,679] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_10-model_00-model_states.pt... 0: [2022-11-26 03:17:56,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_10-model_00-model_states.pt. 0: [2022-11-26 03:17:56,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_11-model_00-model_states.pt... 0: [2022-11-26 03:17:56,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_11-model_00-model_states.pt. 0: [2022-11-26 03:17:56,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_12-model_00-model_states.pt... 0: [2022-11-26 03:17:56,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_12-model_00-model_states.pt. 0: [2022-11-26 03:17:56,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_13-model_00-model_states.pt... 0: [2022-11-26 03:17:56,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_13-model_00-model_states.pt. 0: [2022-11-26 03:17:56,984] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_14-model_00-model_states.pt... 0: [2022-11-26 03:17:57,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_14-model_00-model_states.pt. 0: [2022-11-26 03:17:57,060] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_15-model_00-model_states.pt... 0: [2022-11-26 03:17:57,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_15-model_00-model_states.pt. 0: [2022-11-26 03:17:57,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_16-model_00-model_states.pt... 0: [2022-11-26 03:17:57,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_16-model_00-model_states.pt. 0: [2022-11-26 03:17:57,211] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_17-model_00-model_states.pt... 0: [2022-11-26 03:17:57,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_17-model_00-model_states.pt. 0: [2022-11-26 03:17:57,285] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_18-model_00-model_states.pt... 0: [2022-11-26 03:17:57,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_18-model_00-model_states.pt. 0: [2022-11-26 03:17:57,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_19-model_00-model_states.pt... 0: [2022-11-26 03:17:57,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_19-model_00-model_states.pt. 0: [2022-11-26 03:17:57,437] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_20-model_00-model_states.pt... 0: [2022-11-26 03:17:57,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_20-model_00-model_states.pt. 0: [2022-11-26 03:17:57,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_21-model_00-model_states.pt... 0: [2022-11-26 03:17:57,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_21-model_00-model_states.pt. 0: [2022-11-26 03:17:57,587] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_22-model_00-model_states.pt... 0: [2022-11-26 03:17:57,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_22-model_00-model_states.pt. 0: [2022-11-26 03:17:57,664] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_23-model_00-model_states.pt... 0: [2022-11-26 03:17:57,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_23-model_00-model_states.pt. 0: [2022-11-26 03:17:57,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_24-model_00-model_states.pt... 0: [2022-11-26 03:17:57,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_24-model_00-model_states.pt. 0: [2022-11-26 03:17:57,818] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_25-model_00-model_states.pt... 0: [2022-11-26 03:17:57,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_25-model_00-model_states.pt. 0: [2022-11-26 03:17:57,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_26-model_00-model_states.pt... 0: [2022-11-26 03:17:57,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_26-model_00-model_states.pt. 0: [2022-11-26 03:17:57,968] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_27-model_00-model_states.pt... 0: [2022-11-26 03:17:58,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_27-model_00-model_states.pt. 0: [2022-11-26 03:17:58,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_28-model_00-model_states.pt... 0: [2022-11-26 03:17:58,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_28-model_00-model_states.pt. 0: [2022-11-26 03:17:58,119] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/layer_30-model_00-model_states.pt... 0: [2022-11-26 03:17:58,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/layer_30-model_00-model_states.pt. 0: [2022-11-26 03:17:58,124] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step41000/mp_rank_00_model_states.pt 0: [2022-11-26 03:17:58,124] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/mp_rank_00_model_states.pt... 0: [2022-11-26 03:17:58,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/mp_rank_00_model_states.pt. 0: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:17:58,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:17:58,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:17:58,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 03:17:58,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 19: [2022-11-26 03:17:58,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:17:58,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 03:17:58,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 5: [2022-11-26 03:17:58,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:17:58,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 03:17:58,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 3: [2022-11-26 03:17:58,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:17:58,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 03:17:58,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 20: [2022-11-26 03:17:58,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:17:58,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 9: [2022-11-26 03:17:58,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:17:58,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 9: [2022-11-26 03:17:58,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 03:17:58,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 19: [2022-11-26 03:17:58,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:17:58,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 03:17:58,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 12: [2022-11-26 03:17:58,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:17:58,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 03:17:58,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 0: [2022-11-26 03:17:58,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:17:58,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:17:58,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:17:58,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:17:58,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 03:17:58,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 30: [2022-11-26 03:17:58,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 23: [2022-11-26 03:17:58,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 23: [2022-11-26 03:17:58,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 30: [2022-11-26 03:17:58,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 25: [2022-11-26 03:17:58,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:17:58,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 03:17:58,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 5: [2022-11-26 03:17:58,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:17:58,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 25: [2022-11-26 03:17:58,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:17:58,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 25: [2022-11-26 03:17:58,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 15: [2022-11-26 03:17:58,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:17:58,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 03:17:58,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 25: [2022-11-26 03:17:58,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 27: [2022-11-26 03:17:58,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:17:58,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 7: [2022-11-26 03:17:58,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:17:58,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 7: [2022-11-26 03:17:58,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 03:17:58,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 12: [2022-11-26 03:17:58,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:17:58,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 03:17:58,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 9: [2022-11-26 03:17:58,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:17:58,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:17:58,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:17:58,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 3: [2022-11-26 03:17:58,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 22: [2022-11-26 03:17:58,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 9: [2022-11-26 03:17:58,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 3: [2022-11-26 03:17:58,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 22: [2022-11-26 03:17:58,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 17: [2022-11-26 03:17:58,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:17:58,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:17:58,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 03:17:58,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 03:17:58,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 17: [2022-11-26 03:17:58,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 22: [2022-11-26 03:17:58,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:17:58,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 03:17:58,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 8: [2022-11-26 03:17:58,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:17:58,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:17:58,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 03:17:58,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 03:17:58,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 8: [2022-11-26 03:17:58,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 24: [2022-11-26 03:17:58,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:17:58,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 03:17:58,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 30: [2022-11-26 03:17:58,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:17:58,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:17:58,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 03:17:58,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 30: [2022-11-26 03:17:58,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 03:17:58,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 14: [2022-11-26 03:17:58,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:17:58,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 03:17:58,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 15: [2022-11-26 03:17:58,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:17:58,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:17:58,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 20: [2022-11-26 03:17:58,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 15: [2022-11-26 03:17:58,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 20: [2022-11-26 03:17:58,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 19: [2022-11-26 03:17:58,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:17:58,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 03:17:58,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 8: [2022-11-26 03:17:58,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:17:58,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 03:17:58,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 26: [2022-11-26 03:17:58,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:17:58,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 03:17:58,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 9: [2022-11-26 03:17:58,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:17:58,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 03:17:58,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 25: [2022-11-26 03:17:58,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:17:58,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 03:17:58,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 0: [2022-11-26 03:17:58,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:17:58,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:17:58,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 03:17:58,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 03:17:58,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 0: [2022-11-26 03:17:58,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 26: [2022-11-26 03:17:58,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:17:58,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 03:17:58,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 30: [2022-11-26 03:17:58,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:17:58,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 03:17:58,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 6: [2022-11-26 03:17:58,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:17:58,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:17:58,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:17:58,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 03:17:58,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 03:17:58,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 6: [2022-11-26 03:17:58,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 0: [2022-11-26 03:17:58,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 03:17:58,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 5: [2022-11-26 03:17:58,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:17:58,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 03:17:58,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 3: [2022-11-26 03:17:58,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:17:58,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 03:17:58,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 5: [2022-11-26 03:17:58,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:17:58,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 03:17:58,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 15: [2022-11-26 03:17:58,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:17:58,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 03:17:58,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 4: [2022-11-26 03:17:58,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:17:58,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:17:58,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 03:17:58,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 03:17:58,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 25: [2022-11-26 03:17:58,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:17:58,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 25: [2022-11-26 03:17:58,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 27: [2022-11-26 03:17:58,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:17:58,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:17:58,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 14: [2022-11-26 03:17:58,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:17:58,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 12: [2022-11-26 03:17:58,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 14: [2022-11-26 03:17:58,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 03:17:58,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:17:58,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 12: [2022-11-26 03:17:58,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 14: [2022-11-26 03:17:58,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 14: [2022-11-26 03:17:58,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 03:17:58,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 15: [2022-11-26 03:17:58,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:17:58,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 03:17:58,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 1: [2022-11-26 03:17:58,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:17:58,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:17:58,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 03:17:58,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 30: [2022-11-26 03:17:58,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 03:17:58,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 22: [2022-11-26 03:17:58,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:17:58,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 03:17:58,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 6: [2022-11-26 03:17:58,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:17:58,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:17:58,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 14: [2022-11-26 03:17:58,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 6: [2022-11-26 03:17:58,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 14: [2022-11-26 03:17:58,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 27: [2022-11-26 03:17:58,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:17:58,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:17:58,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 24: [2022-11-26 03:17:58,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:17:58,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 27: [2022-11-26 03:17:58,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 6: [2022-11-26 03:17:58,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 20: [2022-11-26 03:17:58,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:17:58,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 20: [2022-11-26 03:17:58,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 03:17:58,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 24: [2022-11-26 03:17:58,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 20: [2022-11-26 03:17:58,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:17:58,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 03:17:58,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 10: [2022-11-26 03:17:58,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:17:58,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 03:17:58,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 22: [2022-11-26 03:17:58,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:17:58,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 03:17:58,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 7: [2022-11-26 03:17:58,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:17:58,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 03:17:58,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 21: [2022-11-26 03:17:58,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:17:58,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:17:58,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 03:17:58,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 03:17:58,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 21: [2022-11-26 03:17:58,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 7: [2022-11-26 03:17:58,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:17:58,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 03:17:58,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 12: [2022-11-26 03:17:58,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:17:58,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 7: [2022-11-26 03:17:58,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:17:58,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 7: [2022-11-26 03:17:58,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 03:17:58,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 3: [2022-11-26 03:17:58,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:17:58,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 03:17:58,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 29: [2022-11-26 03:17:58,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:17:58,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 03:17:58,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 11: [2022-11-26 03:17:58,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:17:58,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:17:58,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:17:58,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:17:58,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 03:17:58,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 03:17:58,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 03:17:58,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 1: [2022-11-26 03:17:58,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:17:58,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 11: [2022-11-26 03:17:58,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 11: [2022-11-26 03:17:58,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 11: [2022-11-26 03:17:58,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 1: [2022-11-26 03:17:58,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 03:17:58,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 19: [2022-11-26 03:17:58,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:17:58,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:17:58,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 19: [2022-11-26 03:17:58,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 23: [2022-11-26 03:17:58,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 19: [2022-11-26 03:17:58,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 23: [2022-11-26 03:17:58,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:17:58,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 03:17:58,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 9: [2022-11-26 03:17:58,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:17:58,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 03:17:58,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 8: [2022-11-26 03:17:58,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:17:58,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 03:17:58,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 19: [2022-11-26 03:17:58,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:17:58,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:17:58,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:17:58,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:17:58,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 19: [2022-11-26 03:17:58,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 4: [2022-11-26 03:17:58,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 19: [2022-11-26 03:17:58,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 26: [2022-11-26 03:17:58,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 03:17:58,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 03:17:58,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 17: [2022-11-26 03:17:58,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:17:58,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 17: [2022-11-26 03:17:58,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 03:17:58,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 17: [2022-11-26 03:17:58,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:17:58,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 03:17:58,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 29: [2022-11-26 03:17:58,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:17:58,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:17:58,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 03:17:58,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 24: [2022-11-26 03:17:58,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:17:58,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 29: [2022-11-26 03:17:58,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 24: [2022-11-26 03:17:58,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 1: [2022-11-26 03:17:58,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:17:58,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 1: [2022-11-26 03:17:58,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 03:17:58,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 0: [2022-11-26 03:17:58,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 03:17:58,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 21: [2022-11-26 03:17:58,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:17:58,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 03:17:58,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 10: [2022-11-26 03:17:58,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:17:58,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:17:58,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 03:17:58,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 10: [2022-11-26 03:17:58,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 03:17:58,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 2: [2022-11-26 03:17:58,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:17:58,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:17:58,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:17:58,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:17:58,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 03:17:58,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 03:17:58,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 03:17:58,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 03:17:58,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 2: [2022-11-26 03:17:58,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 2: [2022-11-26 03:17:58,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 2: [2022-11-26 03:17:58,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 5: [2022-11-26 03:17:58,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:17:58,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 03:17:58,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 4: [2022-11-26 03:17:58,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:17:58,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 03:17:58,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 16: [2022-11-26 03:17:58,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:17:58,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:17:58,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:17:58,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:17:58,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 03:17:58,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 18: [2022-11-26 03:17:58,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:17:58,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:17:58,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 03:17:58,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 03:17:58,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 16: [2022-11-26 03:17:58,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 16: [2022-11-26 03:17:58,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 16: [2022-11-26 03:17:58,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 18: [2022-11-26 03:17:58,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:17:58,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:17:58,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 03:17:58,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 03:17:58,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 03:17:58,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 03:17:58,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 18: [2022-11-26 03:17:58,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 18: [2022-11-26 03:17:58,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 18: [2022-11-26 03:17:58,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 20: [2022-11-26 03:17:58,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:17:58,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:17:58,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 13: [2022-11-26 03:17:58,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 20: [2022-11-26 03:17:58,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 13: [2022-11-26 03:17:58,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 13: [2022-11-26 03:17:58,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:17:58,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 03:17:58,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 13: [2022-11-26 03:17:58,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:17:58,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:17:58,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 15: [2022-11-26 03:17:58,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 13: [2022-11-26 03:17:58,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 15: [2022-11-26 03:17:58,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 13: [2022-11-26 03:17:58,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:17:58,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 03:17:58,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 28: [2022-11-26 03:17:58,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:17:58,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:17:58,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:17:58,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:17:58,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 03:17:58,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 03:17:58,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 28: [2022-11-26 03:17:58,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 28: [2022-11-26 03:17:58,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 03:17:58,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 03:17:58,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 28: [2022-11-26 03:17:58,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 31: [2022-11-26 03:17:58,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:17:58,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:17:58,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:17:58,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:17:58,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 03:17:58,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 03:17:58,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 03:17:58,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 03:17:58,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 31: [2022-11-26 03:17:58,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 31: [2022-11-26 03:17:58,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 31: [2022-11-26 03:17:58,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 24: [2022-11-26 03:17:58,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:17:58,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 03:17:58,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 6: [2022-11-26 03:17:58,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:17:58,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 03:17:58,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 9: [2022-11-26 03:17:58,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:17:58,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 03:17:58,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 25: [2022-11-26 03:17:58,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:17:58,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 03:17:58,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 28: [2022-11-26 03:17:58,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:17:58,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 03:17:58,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 12: [2022-11-26 03:17:58,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:17:58,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 03:17:58,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 8: [2022-11-26 03:17:58,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:17:58,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 03:17:58,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 0: [2022-11-26 03:17:58,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:17:58,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 03:17:58,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 30: [2022-11-26 03:17:58,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:17:58,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 03:17:58,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 29: [2022-11-26 03:17:58,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:17:58,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 03:17:58,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 4: [2022-11-26 03:17:58,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:17:58,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 03:17:58,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 7: [2022-11-26 03:17:58,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:17:58,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 03:17:58,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 11: [2022-11-26 03:17:58,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:17:58,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 03:17:58,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 2: [2022-11-26 03:17:58,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:17:58,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 03:17:58,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 16: [2022-11-26 03:17:58,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:17:58,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 03:17:58,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 27: [2022-11-26 03:17:58,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:17:58,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:17:58,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 10: [2022-11-26 03:17:58,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:17:58,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 27: [2022-11-26 03:17:58,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 26: [2022-11-26 03:17:58,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 10: [2022-11-26 03:17:58,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 03:17:58,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 22: [2022-11-26 03:17:58,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:17:58,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 03:17:58,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 3: [2022-11-26 03:17:58,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:17:58,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 03:17:58,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 23: [2022-11-26 03:17:58,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:17:58,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 03:17:58,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 13: [2022-11-26 03:17:58,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:17:58,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 03:17:58,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 1: [2022-11-26 03:17:58,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:17:58,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:17:58,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 1: [2022-11-26 03:17:58,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 21: [2022-11-26 03:17:58,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 1: [2022-11-26 03:17:58,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 17: [2022-11-26 03:17:58,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:17:58,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 03:17:58,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 14: [2022-11-26 03:17:58,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:17:58,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 03:17:58,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 18: [2022-11-26 03:17:58,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:17:58,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 03:17:58,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 31: [2022-11-26 03:17:58,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:17:58,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 03:17:58,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 5: [2022-11-26 03:17:58,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:17:58,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 03:17:58,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 6: [2022-11-26 03:17:58,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:17:58,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 03:17:58,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 15: [2022-11-26 03:17:58,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:17:58,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:17:58,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 15: [2022-11-26 03:17:58,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 20: [2022-11-26 03:17:58,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 15: [2022-11-26 03:17:58,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 19: [2022-11-26 03:17:58,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:17:58,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 03:17:58,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 12: [2022-11-26 03:17:58,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:17:58,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 03:17:58,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 25: [2022-11-26 03:17:58,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:17:58,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:17:58,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 24: [2022-11-26 03:17:58,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 03:17:58,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 25: [2022-11-26 03:17:58,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 0: [2022-11-26 03:17:58,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:17:58,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 03:17:58,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 9: [2022-11-26 03:17:58,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:17:58,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 03:17:58,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 8: [2022-11-26 03:17:58,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:17:58,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 03:17:58,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 4: [2022-11-26 03:17:58,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:17:58,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 03:17:58,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 30: [2022-11-26 03:17:58,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:17:58,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 03:17:58,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 23: [2022-11-26 03:17:58,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:17:58,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 03:17:58,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 28: [2022-11-26 03:17:58,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:17:58,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 29: [2022-11-26 03:17:58,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:17:58,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 03:17:58,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 7: [2022-11-26 03:17:58,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:17:58,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 03:17:58,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 28: [2022-11-26 03:17:58,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 11: [2022-11-26 03:17:58,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:17:58,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 03:17:58,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 2: [2022-11-26 03:17:58,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:17:58,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 03:17:58,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 16: [2022-11-26 03:17:58,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:17:58,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:17:58,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 10: [2022-11-26 03:17:58,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 16: [2022-11-26 03:17:58,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 10: [2022-11-26 03:17:58,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 22: [2022-11-26 03:17:58,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:17:58,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 03:17:58,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 3: [2022-11-26 03:17:58,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:17:58,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:17:58,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 3: [2022-11-26 03:17:58,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 03:17:58,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 26: [2022-11-26 03:17:58,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 13: [2022-11-26 03:17:58,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:17:58,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 03:17:58,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 17: [2022-11-26 03:17:58,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:17:58,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 03:17:58,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 21: [2022-11-26 03:17:58,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:17:58,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 03:17:58,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 1: [2022-11-26 03:17:58,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:17:58,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 03:17:58,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 14: [2022-11-26 03:17:58,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:17:58,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 03:17:58,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 27: [2022-11-26 03:17:58,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:17:58,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:17:58,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 03:17:58,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 31: [2022-11-26 03:17:58,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 03:17:58,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 20: [2022-11-26 03:17:58,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:17:58,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 03:17:58,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 19: [2022-11-26 03:17:58,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:17:58,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 03:17:58,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 15: [2022-11-26 03:17:58,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:17:58,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:17:58,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 03:17:58,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 6: [2022-11-26 03:17:58,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 03:17:58,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 5: [2022-11-26 03:17:58,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:17:58,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:17:58,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 18: [2022-11-26 03:17:58,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 03:17:58,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 5: [2022-11-26 03:17:58,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 24: [2022-11-26 03:17:58,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:17:58,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 03:17:58,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 9: [2022-11-26 03:17:58,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:17:58,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 03:17:58,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 25: [2022-11-26 03:17:58,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:17:58,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 03:17:58,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 28: [2022-11-26 03:17:58,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:17:58,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 03:17:58,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 0: [2022-11-26 03:17:58,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:17:58,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 03:17:58,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 4: [2022-11-26 03:17:58,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:17:58,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 03:17:58,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 29: [2022-11-26 03:17:58,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:17:58,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 03:17:58,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 30: [2022-11-26 03:17:58,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:17:58,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 03:17:58,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 8: [2022-11-26 03:17:58,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:17:58,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 03:17:58,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 7: [2022-11-26 03:17:58,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:17:58,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 03:17:58,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 23: [2022-11-26 03:17:58,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:17:58,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 03:17:58,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 12: [2022-11-26 03:17:58,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:17:58,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 03:17:58,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 2: [2022-11-26 03:17:58,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:17:58,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 03:17:58,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 11: [2022-11-26 03:17:58,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:17:58,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 03:17:58,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 27: [2022-11-26 03:17:58,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:17:58,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 03:17:58,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 26: [2022-11-26 03:17:58,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:17:58,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 03:17:58,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 16: [2022-11-26 03:17:58,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:17:58,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 03:17:58,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 30: [2022-11-26 03:17:58,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:17:58,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 03:17:58,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 13: [2022-11-26 03:17:58,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:17:58,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 03:17:58,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 4: [2022-11-26 03:17:58,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:17:58,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 03:17:58,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 3: [2022-11-26 03:17:58,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:17:58,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:17:58,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 15: [2022-11-26 03:17:58,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 3: [2022-11-26 03:17:58,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 15: [2022-11-26 03:17:58,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 6: [2022-11-26 03:17:58,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:17:58,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:17:58,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:17:58,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 03:17:58,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 17: [2022-11-26 03:17:58,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 31: [2022-11-26 03:17:58,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 17: [2022-11-26 03:17:58,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 31: [2022-11-26 03:17:58,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 20: [2022-11-26 03:17:58,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:17:58,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 14: [2022-11-26 03:17:58,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:17:58,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 14: [2022-11-26 03:17:58,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 03:17:58,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 5: [2022-11-26 03:17:58,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:17:58,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 03:17:58,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 8: [2022-11-26 03:17:58,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:17:58,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 03:17:58,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 25: [2022-11-26 03:17:58,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:17:58,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:17:58,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:17:58,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 25: [2022-11-26 03:17:58,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 10: [2022-11-26 03:17:58,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 25: [2022-11-26 03:17:58,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 19: [2022-11-26 03:17:58,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 24: [2022-11-26 03:17:58,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:17:58,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 24: [2022-11-26 03:17:58,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 03:17:58,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 28: [2022-11-26 03:17:58,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:17:58,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:17:58,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 03:17:58,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 1: [2022-11-26 03:17:58,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 03:17:58,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 22: [2022-11-26 03:17:58,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:17:58,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 03:17:58,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 21: [2022-11-26 03:17:58,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:17:58,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 03:17:58,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 29: [2022-11-26 03:17:58,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:17:58,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:17:58,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 29: [2022-11-26 03:17:58,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 03:17:58,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 0: [2022-11-26 03:17:58,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 12: [2022-11-26 03:17:58,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:17:58,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 03:17:58,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 7: [2022-11-26 03:17:58,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:17:58,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 03:17:58,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 18: [2022-11-26 03:17:58,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:17:58,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 03:17:58,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 27: [2022-11-26 03:17:58,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:17:58,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 03:17:58,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 2: [2022-11-26 03:17:58,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:17:58,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 03:17:58,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 17: [2022-11-26 03:17:58,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:17:58,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:17:58,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 9: [2022-11-26 03:17:58,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 17: [2022-11-26 03:17:58,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 9: [2022-11-26 03:17:58,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 1: [2022-11-26 03:17:58,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:17:58,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 03:17:58,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 23: [2022-11-26 03:17:58,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:17:58,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 03:17:58,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 31: [2022-11-26 03:17:58,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:17:58,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 03:17:58,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 26: [2022-11-26 03:17:58,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:17:58,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 03:17:58,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 11: [2022-11-26 03:17:58,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:17:58,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 03:17:58,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 3: [2022-11-26 03:17:58,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:17:58,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:17:58,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 14: [2022-11-26 03:17:58,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 3: [2022-11-26 03:17:58,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 14: [2022-11-26 03:17:58,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 18: [2022-11-26 03:17:58,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:17:58,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 03:17:58,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 21: [2022-11-26 03:17:58,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:17:58,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 03:17:58,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 10: [2022-11-26 03:17:58,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:17:58,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 03:17:58,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 22: [2022-11-26 03:17:58,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:17:58,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 03:17:58,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 1: [2022-11-26 03:17:58,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:17:58,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 03:17:58,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 13: [2022-11-26 03:17:58,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:17:58,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 03:17:58,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 10: [2022-11-26 03:17:58,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:17:58,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 03:17:58,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 21: [2022-11-26 03:17:58,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:17:58,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 03:17:58,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 29: [2022-11-26 03:17:58,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:17:58,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 03:17:58,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 16: [2022-11-26 03:17:58,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:17:58,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step41000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 03:17:58,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 0: successfully saved checkpoint at iteration 41000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2620.44 31: iteration 41010/ 173500 | consumed samples: 10498560 | consumed tokens: 21501050880 | elapsed time per iteration (s): 1.11 | learning rate: 1.778E-04 | global batch size: 256 | lm loss: 2.093200E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.704 | TFLOPs: 13.90 | 31: iteration 41020/ 173500 | consumed samples: 10501120 | consumed tokens: 21506293760 | elapsed time per iteration (s): 0.80 | learning rate: 1.778E-04 | global batch size: 256 | lm loss: 2.117701E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.708 | TFLOPs: 19.46 | 31: iteration 41030/ 173500 | consumed samples: 10503680 | consumed tokens: 21511536640 | elapsed time per iteration (s): 0.84 | learning rate: 1.777E-04 | global batch size: 256 | lm loss: 2.091091E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.667 | TFLOPs: 18.49 | 31: iteration 41040/ 173500 | consumed samples: 10506240 | consumed tokens: 21516779520 | elapsed time per iteration (s): 0.84 | learning rate: 1.777E-04 | global batch size: 256 | lm loss: 2.111183E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.328 | TFLOPs: 18.47 | 31: iteration 41050/ 173500 | consumed samples: 10508800 | consumed tokens: 21522022400 | elapsed time per iteration (s): 0.85 | learning rate: 1.777E-04 | global batch size: 256 | lm loss: 2.104696E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.248 | TFLOPs: 18.16 | 31: iteration 41060/ 173500 | consumed samples: 10511360 | consumed tokens: 21527265280 | elapsed time per iteration (s): 0.82 | learning rate: 1.777E-04 | global batch size: 256 | lm loss: 2.057708E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.525 | TFLOPs: 18.91 | 31: iteration 41070/ 173500 | consumed samples: 10513920 | consumed tokens: 21532508160 | elapsed time per iteration (s): 0.84 | learning rate: 1.777E-04 | global batch size: 256 | lm loss: 2.116843E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.926 | TFLOPs: 18.45 | 31: iteration 41080/ 173500 | consumed samples: 10516480 | consumed tokens: 21537751040 | elapsed time per iteration (s): 0.80 | learning rate: 1.777E-04 | global batch size: 256 | lm loss: 2.094622E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.808 | TFLOPs: 19.29 | 31: iteration 41090/ 173500 | consumed samples: 10519040 | consumed tokens: 21542993920 | elapsed time per iteration (s): 0.87 | learning rate: 1.777E-04 | global batch size: 256 | lm loss: 2.066986E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.642 | TFLOPs: 17.70 | 31: iteration 41100/ 173500 | consumed samples: 10521600 | consumed tokens: 21548236800 | elapsed time per iteration (s): 0.86 | learning rate: 1.777E-04 | global batch size: 256 | lm loss: 2.073534E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.063 | TFLOPs: 17.91 | 31: iteration 41110/ 173500 | consumed samples: 10524160 | consumed tokens: 21553479680 | elapsed time per iteration (s): 0.81 | learning rate: 1.777E-04 | global batch size: 256 | lm loss: 2.083957E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.450 | TFLOPs: 19.20 | 31: iteration 41120/ 173500 | consumed samples: 10526720 | consumed tokens: 21558722560 | elapsed time per iteration (s): 0.80 | learning rate: 1.776E-04 | global batch size: 256 | lm loss: 2.097016E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.516 | TFLOPs: 19.45 | 31: iteration 41130/ 173500 | consumed samples: 10529280 | consumed tokens: 21563965440 | elapsed time per iteration (s): 0.81 | learning rate: 1.776E-04 | global batch size: 256 | lm loss: 2.118274E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.507 | TFLOPs: 19.21 | 31: iteration 41140/ 173500 | consumed samples: 10531840 | consumed tokens: 21569208320 | elapsed time per iteration (s): 0.79 | learning rate: 1.776E-04 | global batch size: 256 | lm loss: 2.088807E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.467 | TFLOPs: 19.51 | 31: iteration 41150/ 173500 | consumed samples: 10534400 | consumed tokens: 21574451200 | elapsed time per iteration (s): 0.80 | learning rate: 1.776E-04 | global batch size: 256 | lm loss: 2.112462E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.615 | TFLOPs: 19.34 | 31: iteration 41160/ 173500 | consumed samples: 10536960 | consumed tokens: 21579694080 | elapsed time per iteration (s): 0.82 | learning rate: 1.776E-04 | global batch size: 256 | lm loss: 2.107411E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.929 | TFLOPs: 18.87 | 31: iteration 41170/ 173500 | consumed samples: 10539520 | consumed tokens: 21584936960 | elapsed time per iteration (s): 0.80 | learning rate: 1.776E-04 | global batch size: 256 | lm loss: 2.074914E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.075 | TFLOPs: 19.36 | 31: iteration 41180/ 173500 | consumed samples: 10542080 | consumed tokens: 21590179840 | elapsed time per iteration (s): 0.83 | learning rate: 1.776E-04 | global batch size: 256 | lm loss: 2.048108E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.838 | TFLOPs: 18.74 | 31: iteration 41190/ 173500 | consumed samples: 10544640 | consumed tokens: 21595422720 | elapsed time per iteration (s): 0.77 | learning rate: 1.776E-04 | global batch size: 256 | lm loss: 2.098934E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.618 | TFLOPs: 20.24 | 31: iteration 41200/ 173500 | consumed samples: 10547200 | consumed tokens: 21600665600 | elapsed time per iteration (s): 0.80 | learning rate: 1.776E-04 | global batch size: 256 | lm loss: 2.058799E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.578 | TFLOPs: 19.33 | 31: iteration 41210/ 173500 | consumed samples: 10549760 | consumed tokens: 21605908480 | elapsed time per iteration (s): 2.60 | learning rate: 1.775E-04 | global batch size: 256 | lm loss: 2.094720E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 98.321 | TFLOPs: 5.95 | 31: iteration 41220/ 173500 | consumed samples: 10552320 | consumed tokens: 21611151360 | elapsed time per iteration (s): 0.77 | learning rate: 1.775E-04 | global batch size: 256 | lm loss: 2.098973E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.025 | TFLOPs: 20.15 | 31: iteration 41230/ 173500 | consumed samples: 10554880 | consumed tokens: 21616394240 | elapsed time per iteration (s): 0.79 | learning rate: 1.775E-04 | global batch size: 256 | lm loss: 2.099247E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.654 | TFLOPs: 19.52 | 31: iteration 41240/ 173500 | consumed samples: 10557440 | consumed tokens: 21621637120 | elapsed time per iteration (s): 0.80 | learning rate: 1.775E-04 | global batch size: 256 | lm loss: 2.082032E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.766 | TFLOPs: 19.47 | 31: iteration 41250/ 173500 | consumed samples: 10560000 | consumed tokens: 21626880000 | elapsed time per iteration (s): 0.75 | learning rate: 1.775E-04 | global batch size: 256 | lm loss: 2.082281E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.052 | TFLOPs: 20.63 | 31: iteration 41260/ 173500 | consumed samples: 10562560 | consumed tokens: 21632122880 | elapsed time per iteration (s): 0.82 | learning rate: 1.775E-04 | global batch size: 256 | lm loss: 2.085539E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.241 | TFLOPs: 18.83 | 31: iteration 41270/ 173500 | consumed samples: 10565120 | consumed tokens: 21637365760 | elapsed time per iteration (s): 0.79 | learning rate: 1.775E-04 | global batch size: 256 | lm loss: 2.086231E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.622 | TFLOPs: 19.58 | 31: iteration 41280/ 173500 | consumed samples: 10567680 | consumed tokens: 21642608640 | elapsed time per iteration (s): 0.75 | learning rate: 1.775E-04 | global batch size: 256 | lm loss: 2.108295E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.430 | TFLOPs: 20.66 | 31: iteration 41290/ 173500 | consumed samples: 10570240 | consumed tokens: 21647851520 | elapsed time per iteration (s): 0.78 | learning rate: 1.775E-04 | global batch size: 256 | lm loss: 2.081036E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.193 | TFLOPs: 19.79 | 31: iteration 41300/ 173500 | consumed samples: 10572800 | consumed tokens: 21653094400 | elapsed time per iteration (s): 0.76 | learning rate: 1.774E-04 | global batch size: 256 | lm loss: 2.091643E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.938 | TFLOPs: 20.26 | 31: iteration 41310/ 173500 | consumed samples: 10575360 | consumed tokens: 21658337280 | elapsed time per iteration (s): 0.75 | learning rate: 1.774E-04 | global batch size: 256 | lm loss: 2.093416E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.442 | TFLOPs: 20.60 | 31: iteration 41320/ 173500 | consumed samples: 10577920 | consumed tokens: 21663580160 | elapsed time per iteration (s): 0.77 | learning rate: 1.774E-04 | global batch size: 256 | lm loss: 2.115576E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.352 | TFLOPs: 20.05 | 31: iteration 41330/ 173500 | consumed samples: 10580480 | consumed tokens: 21668823040 | elapsed time per iteration (s): 0.79 | learning rate: 1.774E-04 | global batch size: 256 | lm loss: 2.092259E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.853 | TFLOPs: 19.65 | 31: iteration 41340/ 173500 | consumed samples: 10583040 | consumed tokens: 21674065920 | elapsed time per iteration (s): 0.75 | learning rate: 1.774E-04 | global batch size: 256 | lm loss: 2.101824E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.215 | TFLOPs: 20.58 | 31: iteration 41350/ 173500 | consumed samples: 10585600 | consumed tokens: 21679308800 | elapsed time per iteration (s): 0.77 | learning rate: 1.774E-04 | global batch size: 256 | lm loss: 2.123277E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.434 | TFLOPs: 20.23 | 31: iteration 41360/ 173500 | consumed samples: 10588160 | consumed tokens: 21684551680 | elapsed time per iteration (s): 0.84 | learning rate: 1.774E-04 | global batch size: 256 | lm loss: 2.065218E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.793 | TFLOPs: 18.50 | 31: iteration 41370/ 173500 | consumed samples: 10590720 | consumed tokens: 21689794560 | elapsed time per iteration (s): 0.81 | learning rate: 1.774E-04 | global batch size: 256 | lm loss: 2.075762E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.301 | TFLOPs: 19.07 | 31: iteration 41380/ 173500 | consumed samples: 10593280 | consumed tokens: 21695037440 | elapsed time per iteration (s): 0.81 | learning rate: 1.774E-04 | global batch size: 256 | lm loss: 2.091203E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.802 | TFLOPs: 19.17 | 31: iteration 41390/ 173500 | consumed samples: 10595840 | consumed tokens: 21700280320 | elapsed time per iteration (s): 0.79 | learning rate: 1.773E-04 | global batch size: 256 | lm loss: 2.081000E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.609 | TFLOPs: 19.70 | 31: iteration 41400/ 173500 | consumed samples: 10598400 | consumed tokens: 21705523200 | elapsed time per iteration (s): 0.80 | learning rate: 1.773E-04 | global batch size: 256 | lm loss: 2.077687E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.531 | TFLOPs: 19.39 | 31: iteration 41410/ 173500 | consumed samples: 10600960 | consumed tokens: 21710766080 | elapsed time per iteration (s): 0.78 | learning rate: 1.773E-04 | global batch size: 256 | lm loss: 2.075015E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.084 | TFLOPs: 19.97 | 31: iteration 41420/ 173500 | consumed samples: 10603520 | consumed tokens: 21716008960 | elapsed time per iteration (s): 0.79 | learning rate: 1.773E-04 | global batch size: 256 | lm loss: 2.079360E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.956 | TFLOPs: 19.54 | 31: iteration 41430/ 173500 | consumed samples: 10606080 | consumed tokens: 21721251840 | elapsed time per iteration (s): 0.75 | learning rate: 1.773E-04 | global batch size: 256 | lm loss: 2.082999E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.815 | TFLOPs: 20.62 | 31: iteration 41440/ 173500 | consumed samples: 10608640 | consumed tokens: 21726494720 | elapsed time per iteration (s): 0.79 | learning rate: 1.773E-04 | global batch size: 256 | lm loss: 2.075962E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.935 | TFLOPs: 19.72 | 31: iteration 41450/ 173500 | consumed samples: 10611200 | consumed tokens: 21731737600 | elapsed time per iteration (s): 0.80 | learning rate: 1.773E-04 | global batch size: 256 | lm loss: 2.096824E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.826 | TFLOPs: 19.47 | 31: iteration 41460/ 173500 | consumed samples: 10613760 | consumed tokens: 21736980480 | elapsed time per iteration (s): 0.76 | learning rate: 1.773E-04 | global batch size: 256 | lm loss: 2.063071E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.686 | TFLOPs: 20.31 | 31: iteration 41470/ 173500 | consumed samples: 10616320 | consumed tokens: 21742223360 | elapsed time per iteration (s): 0.73 | learning rate: 1.773E-04 | global batch size: 256 | lm loss: 2.083260E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.766 | TFLOPs: 21.10 | 31: iteration 41480/ 173500 | consumed samples: 10618880 | consumed tokens: 21747466240 | elapsed time per iteration (s): 0.80 | learning rate: 1.772E-04 | global batch size: 256 | lm loss: 2.084045E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.977 | TFLOPs: 19.36 | 31: iteration 41490/ 173500 | consumed samples: 10621440 | consumed tokens: 21752709120 | elapsed time per iteration (s): 0.85 | learning rate: 1.772E-04 | global batch size: 256 | lm loss: 2.111963E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.504 | TFLOPs: 18.12 | 31: iteration 41500/ 173500 | consumed samples: 10624000 | consumed tokens: 21757952000 | elapsed time per iteration (s): 0.82 | learning rate: 1.772E-04 | global batch size: 256 | lm loss: 2.107359E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.168 | TFLOPs: 18.89 | 31: iteration 41510/ 173500 | consumed samples: 10626560 | consumed tokens: 21763194880 | elapsed time per iteration (s): 0.78 | learning rate: 1.772E-04 | global batch size: 256 | lm loss: 2.117627E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.460 | TFLOPs: 19.93 | 31: iteration 41520/ 173500 | consumed samples: 10629120 | consumed tokens: 21768437760 | elapsed time per iteration (s): 0.83 | learning rate: 1.772E-04 | global batch size: 256 | lm loss: 2.106423E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.727 | TFLOPs: 18.62 | 31: iteration 41530/ 173500 | consumed samples: 10631680 | consumed tokens: 21773680640 | elapsed time per iteration (s): 0.74 | learning rate: 1.772E-04 | global batch size: 256 | lm loss: 2.090029E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.868 | TFLOPs: 20.86 | 31: iteration 41540/ 173500 | consumed samples: 10634240 | consumed tokens: 21778923520 | elapsed time per iteration (s): 0.83 | learning rate: 1.772E-04 | global batch size: 256 | lm loss: 2.139829E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.112 | TFLOPs: 18.58 | 31: iteration 41550/ 173500 | consumed samples: 10636800 | consumed tokens: 21784166400 | elapsed time per iteration (s): 0.79 | learning rate: 1.772E-04 | global batch size: 256 | lm loss: 2.098213E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.286 | TFLOPs: 19.56 | 31: iteration 41560/ 173500 | consumed samples: 10639360 | consumed tokens: 21789409280 | elapsed time per iteration (s): 0.87 | learning rate: 1.772E-04 | global batch size: 256 | lm loss: 2.087508E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.549 | TFLOPs: 17.82 | 31: iteration 41570/ 173500 | consumed samples: 10641920 | consumed tokens: 21794652160 | elapsed time per iteration (s): 0.75 | learning rate: 1.772E-04 | global batch size: 256 | lm loss: 2.070644E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.985 | TFLOPs: 20.63 | 31: iteration 41580/ 173500 | consumed samples: 10644480 | consumed tokens: 21799895040 | elapsed time per iteration (s): 0.79 | learning rate: 1.771E-04 | global batch size: 256 | lm loss: 2.073018E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.093 | TFLOPs: 19.49 | 31: iteration 41590/ 173500 | consumed samples: 10647040 | consumed tokens: 21805137920 | elapsed time per iteration (s): 0.76 | learning rate: 1.771E-04 | global batch size: 256 | lm loss: 2.118476E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.867 | TFLOPs: 20.44 | 31: iteration 41600/ 173500 | consumed samples: 10649600 | consumed tokens: 21810380800 | elapsed time per iteration (s): 0.76 | learning rate: 1.771E-04 | global batch size: 256 | lm loss: 2.074069E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.310 | TFLOPs: 20.41 | 31: iteration 41610/ 173500 | consumed samples: 10652160 | consumed tokens: 21815623680 | elapsed time per iteration (s): 0.74 | learning rate: 1.771E-04 | global batch size: 256 | lm loss: 2.118183E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.370 | TFLOPs: 20.95 | 31: iteration 41620/ 173500 | consumed samples: 10654720 | consumed tokens: 21820866560 | elapsed time per iteration (s): 0.76 | learning rate: 1.771E-04 | global batch size: 256 | lm loss: 2.054285E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.181 | TFLOPs: 20.40 | 31: iteration 41630/ 173500 | consumed samples: 10657280 | consumed tokens: 21826109440 | elapsed time per iteration (s): 0.77 | learning rate: 1.771E-04 | global batch size: 256 | lm loss: 2.072378E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.567 | TFLOPs: 20.00 | 31: iteration 41640/ 173500 | consumed samples: 10659840 | consumed tokens: 21831352320 | elapsed time per iteration (s): 0.79 | learning rate: 1.771E-04 | global batch size: 256 | lm loss: 2.109053E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.893 | TFLOPs: 19.66 | 31: iteration 41650/ 173500 | consumed samples: 10662400 | consumed tokens: 21836595200 | elapsed time per iteration (s): 0.81 | learning rate: 1.771E-04 | global batch size: 256 | lm loss: 2.100549E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.962 | TFLOPs: 19.24 | 31: iteration 41660/ 173500 | consumed samples: 10664960 | consumed tokens: 21841838080 | elapsed time per iteration (s): 0.79 | learning rate: 1.771E-04 | global batch size: 256 | lm loss: 2.114150E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.900 | TFLOPs: 19.66 | 31: iteration 41670/ 173500 | consumed samples: 10667520 | consumed tokens: 21847080960 | elapsed time per iteration (s): 0.83 | learning rate: 1.770E-04 | global batch size: 256 | lm loss: 2.131956E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.096 | TFLOPs: 18.76 | 31: iteration 41680/ 173500 | consumed samples: 10670080 | consumed tokens: 21852323840 | elapsed time per iteration (s): 0.82 | learning rate: 1.770E-04 | global batch size: 256 | lm loss: 2.074282E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.092 | TFLOPs: 19.00 | 31: iteration 41690/ 173500 | consumed samples: 10672640 | consumed tokens: 21857566720 | elapsed time per iteration (s): 0.74 | learning rate: 1.770E-04 | global batch size: 256 | lm loss: 2.109509E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.801 | TFLOPs: 20.80 | 31: iteration 41700/ 173500 | consumed samples: 10675200 | consumed tokens: 21862809600 | elapsed time per iteration (s): 0.79 | learning rate: 1.770E-04 | global batch size: 256 | lm loss: 2.100837E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.691 | TFLOPs: 19.64 | 31: iteration 41710/ 173500 | consumed samples: 10677760 | consumed tokens: 21868052480 | elapsed time per iteration (s): 0.73 | learning rate: 1.770E-04 | global batch size: 256 | lm loss: 2.088332E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.542 | TFLOPs: 21.27 | 31: iteration 41720/ 173500 | consumed samples: 10680320 | consumed tokens: 21873295360 | elapsed time per iteration (s): 0.79 | learning rate: 1.770E-04 | global batch size: 256 | lm loss: 2.091360E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.960 | TFLOPs: 19.54 | 31: iteration 41730/ 173500 | consumed samples: 10682880 | consumed tokens: 21878538240 | elapsed time per iteration (s): 0.79 | learning rate: 1.770E-04 | global batch size: 256 | lm loss: 2.125863E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.203 | TFLOPs: 19.61 | 31: iteration 41740/ 173500 | consumed samples: 10685440 | consumed tokens: 21883781120 | elapsed time per iteration (s): 0.74 | learning rate: 1.770E-04 | global batch size: 256 | lm loss: 2.079000E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.329 | TFLOPs: 21.01 | 31: iteration 41750/ 173500 | consumed samples: 10688000 | consumed tokens: 21889024000 | elapsed time per iteration (s): 0.82 | learning rate: 1.770E-04 | global batch size: 256 | lm loss: 2.090691E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.748 | TFLOPs: 18.98 | 31: iteration 41760/ 173500 | consumed samples: 10690560 | consumed tokens: 21894266880 | elapsed time per iteration (s): 0.74 | learning rate: 1.769E-04 | global batch size: 256 | lm loss: 2.114885E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.972 | TFLOPs: 20.81 | 31: iteration 41770/ 173500 | consumed samples: 10693120 | consumed tokens: 21899509760 | elapsed time per iteration (s): 0.75 | learning rate: 1.769E-04 | global batch size: 256 | lm loss: 2.104400E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.459 | TFLOPs: 20.60 | 31: iteration 41780/ 173500 | consumed samples: 10695680 | consumed tokens: 21904752640 | elapsed time per iteration (s): 0.74 | learning rate: 1.769E-04 | global batch size: 256 | lm loss: 2.106452E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.316 | TFLOPs: 20.83 | 31: iteration 41790/ 173500 | consumed samples: 10698240 | consumed tokens: 21909995520 | elapsed time per iteration (s): 0.78 | learning rate: 1.769E-04 | global batch size: 256 | lm loss: 2.076518E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.075 | TFLOPs: 19.97 | 31: iteration 41800/ 173500 | consumed samples: 10700800 | consumed tokens: 21915238400 | elapsed time per iteration (s): 0.77 | learning rate: 1.769E-04 | global batch size: 256 | lm loss: 2.076292E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.995 | TFLOPs: 20.08 | 31: iteration 41810/ 173500 | consumed samples: 10703360 | consumed tokens: 21920481280 | elapsed time per iteration (s): 0.83 | learning rate: 1.769E-04 | global batch size: 256 | lm loss: 2.108220E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.745 | TFLOPs: 18.68 | 31: iteration 41820/ 173500 | consumed samples: 10705920 | consumed tokens: 21925724160 | elapsed time per iteration (s): 0.81 | learning rate: 1.769E-04 | global batch size: 256 | lm loss: 2.074380E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.822 | TFLOPs: 19.17 | 31: iteration 41830/ 173500 | consumed samples: 10708480 | consumed tokens: 21930967040 | elapsed time per iteration (s): 0.82 | learning rate: 1.769E-04 | global batch size: 256 | lm loss: 2.081905E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.606 | TFLOPs: 18.85 | 31: iteration 41840/ 173500 | consumed samples: 10711040 | consumed tokens: 21936209920 | elapsed time per iteration (s): 0.91 | learning rate: 1.769E-04 | global batch size: 256 | lm loss: 2.084205E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.770 | TFLOPs: 17.11 | 31: iteration 41850/ 173500 | consumed samples: 10713600 | consumed tokens: 21941452800 | elapsed time per iteration (s): 0.81 | learning rate: 1.768E-04 | global batch size: 256 | lm loss: 2.080162E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.312 | TFLOPs: 19.08 | 31: iteration 41860/ 173500 | consumed samples: 10716160 | consumed tokens: 21946695680 | elapsed time per iteration (s): 0.83 | learning rate: 1.768E-04 | global batch size: 256 | lm loss: 2.104248E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.335 | TFLOPs: 18.71 | 31: iteration 41870/ 173500 | consumed samples: 10718720 | consumed tokens: 21951938560 | elapsed time per iteration (s): 0.87 | learning rate: 1.768E-04 | global batch size: 256 | lm loss: 2.094888E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.786 | TFLOPs: 17.83 | 31: iteration 41880/ 173500 | consumed samples: 10721280 | consumed tokens: 21957181440 | elapsed time per iteration (s): 0.84 | learning rate: 1.768E-04 | global batch size: 256 | lm loss: 2.091907E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.111 | TFLOPs: 18.34 | 31: iteration 41890/ 173500 | consumed samples: 10723840 | consumed tokens: 21962424320 | elapsed time per iteration (s): 0.85 | learning rate: 1.768E-04 | global batch size: 256 | lm loss: 2.124947E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.489 | TFLOPs: 18.24 | 31: iteration 41900/ 173500 | consumed samples: 10726400 | consumed tokens: 21967667200 | elapsed time per iteration (s): 0.82 | learning rate: 1.768E-04 | global batch size: 256 | lm loss: 2.089880E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.674 | TFLOPs: 18.86 | 31: iteration 41910/ 173500 | consumed samples: 10728960 | consumed tokens: 21972910080 | elapsed time per iteration (s): 0.82 | learning rate: 1.768E-04 | global batch size: 256 | lm loss: 2.097408E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.046 | TFLOPs: 18.94 | 31: iteration 41920/ 173500 | consumed samples: 10731520 | consumed tokens: 21978152960 | elapsed time per iteration (s): 0.81 | learning rate: 1.768E-04 | global batch size: 256 | lm loss: 2.096576E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.530 | TFLOPs: 19.09 | 31: iteration 41930/ 173500 | consumed samples: 10734080 | consumed tokens: 21983395840 | elapsed time per iteration (s): 0.82 | learning rate: 1.768E-04 | global batch size: 256 | lm loss: 2.079781E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.966 | TFLOPs: 18.87 | 31: iteration 41940/ 173500 | consumed samples: 10736640 | consumed tokens: 21988638720 | elapsed time per iteration (s): 0.77 | learning rate: 1.767E-04 | global batch size: 256 | lm loss: 2.083046E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.954 | TFLOPs: 20.02 | 31: iteration 41950/ 173500 | consumed samples: 10739200 | consumed tokens: 21993881600 | elapsed time per iteration (s): 0.77 | learning rate: 1.767E-04 | global batch size: 256 | lm loss: 2.097574E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.340 | TFLOPs: 20.11 | 31: iteration 41960/ 173500 | consumed samples: 10741760 | consumed tokens: 21999124480 | elapsed time per iteration (s): 0.78 | learning rate: 1.767E-04 | global batch size: 256 | lm loss: 2.119104E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.386 | TFLOPs: 19.81 | 31: iteration 41970/ 173500 | consumed samples: 10744320 | consumed tokens: 22004367360 | elapsed time per iteration (s): 0.78 | learning rate: 1.767E-04 | global batch size: 256 | lm loss: 2.072671E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.317 | TFLOPs: 19.74 | 31: iteration 41980/ 173500 | consumed samples: 10746880 | consumed tokens: 22009610240 | elapsed time per iteration (s): 0.76 | learning rate: 1.767E-04 | global batch size: 256 | lm loss: 2.097895E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.502 | TFLOPs: 20.48 | 31: iteration 41990/ 173500 | consumed samples: 10749440 | consumed tokens: 22014853120 | elapsed time per iteration (s): 0.80 | learning rate: 1.767E-04 | global batch size: 256 | lm loss: 2.090005E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.792 | TFLOPs: 19.35 | 0: [2022-11-26 03:31:33,108] [INFO] [logging.py:68:log_dist] [Rank 0] step=42000, skipped=0, lr=[0.00017667737143212697, 0.00017667737143212697, 0.00017667737143212697], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 42000/ 173500 | consumed samples: 10752000 | consumed tokens: 22020096000 | elapsed time per iteration (s): 0.85 | learning rate: 1.767E-04 | global batch size: 256 | lm loss: 2.079638E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.105 | TFLOPs: 18.28 | 0: steps: 42000 loss: 2.0931 iter time (s): 0.806 samples/sec: 317.740 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 42000 | lm loss value: 2.053797E+00 | lm loss PPL: 7.797448E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 42000 to checkpoints_1b1long 0: [2022-11-26 03:31:33,471] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step42000 is begin to save! 0: [2022-11-26 03:31:33,483] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_01-model_00-model_states.pt... 0: [2022-11-26 03:31:33,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_01-model_00-model_states.pt. 0: [2022-11-26 03:31:33,697] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_03-model_00-model_states.pt... 0: [2022-11-26 03:31:33,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_03-model_00-model_states.pt. 0: [2022-11-26 03:31:33,776] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_04-model_00-model_states.pt... 0: [2022-11-26 03:31:33,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_04-model_00-model_states.pt. 0: [2022-11-26 03:31:33,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_05-model_00-model_states.pt... 0: [2022-11-26 03:31:33,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_05-model_00-model_states.pt. 0: [2022-11-26 03:31:33,933] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_06-model_00-model_states.pt... 0: [2022-11-26 03:31:34,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_06-model_00-model_states.pt. 0: [2022-11-26 03:31:34,007] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_07-model_00-model_states.pt... 0: [2022-11-26 03:31:34,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_07-model_00-model_states.pt. 0: [2022-11-26 03:31:34,081] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_08-model_00-model_states.pt... 0: [2022-11-26 03:31:34,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_08-model_00-model_states.pt. 0: [2022-11-26 03:31:34,156] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_09-model_00-model_states.pt... 0: [2022-11-26 03:31:34,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_09-model_00-model_states.pt. 0: [2022-11-26 03:31:34,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_10-model_00-model_states.pt... 0: [2022-11-26 03:31:34,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_10-model_00-model_states.pt. 0: [2022-11-26 03:31:34,302] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_11-model_00-model_states.pt... 0: [2022-11-26 03:31:34,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_11-model_00-model_states.pt. 0: [2022-11-26 03:31:34,376] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_12-model_00-model_states.pt... 0: [2022-11-26 03:31:34,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_12-model_00-model_states.pt. 0: [2022-11-26 03:31:34,452] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_13-model_00-model_states.pt... 0: [2022-11-26 03:31:34,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_13-model_00-model_states.pt. 0: [2022-11-26 03:31:34,524] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_14-model_00-model_states.pt... 0: [2022-11-26 03:31:34,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_14-model_00-model_states.pt. 0: [2022-11-26 03:31:34,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_15-model_00-model_states.pt... 0: [2022-11-26 03:31:34,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_15-model_00-model_states.pt. 0: [2022-11-26 03:31:34,672] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_16-model_00-model_states.pt... 0: [2022-11-26 03:31:34,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_16-model_00-model_states.pt. 0: [2022-11-26 03:31:34,746] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_17-model_00-model_states.pt... 0: [2022-11-26 03:31:34,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_17-model_00-model_states.pt. 0: [2022-11-26 03:31:34,821] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_18-model_00-model_states.pt... 0: [2022-11-26 03:31:34,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_18-model_00-model_states.pt. 0: [2022-11-26 03:31:34,897] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_19-model_00-model_states.pt... 0: [2022-11-26 03:31:34,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_19-model_00-model_states.pt. 0: [2022-11-26 03:31:34,971] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_20-model_00-model_states.pt... 0: [2022-11-26 03:31:35,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_20-model_00-model_states.pt. 0: [2022-11-26 03:31:35,045] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_21-model_00-model_states.pt... 0: [2022-11-26 03:31:35,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_21-model_00-model_states.pt. 0: [2022-11-26 03:31:35,119] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_22-model_00-model_states.pt... 0: [2022-11-26 03:31:35,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_22-model_00-model_states.pt. 0: [2022-11-26 03:31:35,190] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_23-model_00-model_states.pt... 0: [2022-11-26 03:31:35,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_23-model_00-model_states.pt. 0: [2022-11-26 03:31:35,278] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_24-model_00-model_states.pt... 0: [2022-11-26 03:31:35,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_24-model_00-model_states.pt. 0: [2022-11-26 03:31:35,352] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_25-model_00-model_states.pt... 0: [2022-11-26 03:31:35,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_25-model_00-model_states.pt. 0: [2022-11-26 03:31:35,427] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_26-model_00-model_states.pt... 0: [2022-11-26 03:31:35,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_26-model_00-model_states.pt. 0: [2022-11-26 03:31:35,503] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_27-model_00-model_states.pt... 0: [2022-11-26 03:31:35,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_27-model_00-model_states.pt. 0: [2022-11-26 03:31:35,573] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_28-model_00-model_states.pt... 0: [2022-11-26 03:31:35,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_28-model_00-model_states.pt. 0: [2022-11-26 03:31:35,649] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/layer_30-model_00-model_states.pt... 0: [2022-11-26 03:31:35,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/layer_30-model_00-model_states.pt. 0: [2022-11-26 03:31:35,654] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step42000/mp_rank_00_model_states.pt 0: [2022-11-26 03:31:35,654] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/mp_rank_00_model_states.pt... 0: [2022-11-26 03:31:35,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/mp_rank_00_model_states.pt. 0: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:31:35,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:31:35,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:31:35,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:31:35,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 03:31:35,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 7: [2022-11-26 03:31:35,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:31:35,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 03:31:35,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 8: [2022-11-26 03:31:35,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:31:35,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 03:31:35,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 29: [2022-11-26 03:31:35,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:31:35,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 03:31:35,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 30: [2022-11-26 03:31:35,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:31:35,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 03:31:35,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 14: [2022-11-26 03:31:35,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:31:35,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 03:31:35,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 0: [2022-11-26 03:31:35,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:31:35,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:31:35,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 03:31:35,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 03:31:35,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 0: [2022-11-26 03:31:35,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 10: [2022-11-26 03:31:35,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:31:35,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 03:31:35,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 28: [2022-11-26 03:31:35,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:31:35,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:31:35,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 03:31:35,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 11: [2022-11-26 03:31:35,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:31:35,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 03:31:35,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 5: [2022-11-26 03:31:35,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:31:35,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 03:31:35,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 3: [2022-11-26 03:31:35,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:31:35,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 03:31:35,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 12: [2022-11-26 03:31:35,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:31:35,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 9: [2022-11-26 03:31:35,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:31:35,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 9: [2022-11-26 03:31:35,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 03:31:35,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 9: [2022-11-26 03:31:35,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:31:35,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 03:31:35,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 31: [2022-11-26 03:31:35,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:31:35,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 03:31:35,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 23: [2022-11-26 03:31:35,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:31:35,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 03:31:35,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 12: [2022-11-26 03:31:35,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:31:35,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 03:31:35,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 2: [2022-11-26 03:31:35,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:31:35,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 03:31:35,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 11: [2022-11-26 03:31:35,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:31:35,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:31:35,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 22: [2022-11-26 03:31:35,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 11: [2022-11-26 03:31:35,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 22: [2022-11-26 03:31:35,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 8: [2022-11-26 03:31:35,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:31:35,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 30: [2022-11-26 03:31:35,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:31:35,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 30: [2022-11-26 03:31:35,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 03:31:35,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 5: [2022-11-26 03:31:35,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:31:35,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 03:31:35,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 2: [2022-11-26 03:31:35,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:31:35,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 03:31:35,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 14: [2022-11-26 03:31:35,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:31:35,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 03:31:35,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 28: [2022-11-26 03:31:35,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 03:31:35,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 28: [2022-11-26 03:31:35,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:31:35,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 03:31:35,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 21: [2022-11-26 03:31:35,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:31:35,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 03:31:35,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 29: [2022-11-26 03:31:35,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:31:35,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 03:31:35,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 10: [2022-11-26 03:31:35,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:31:35,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 4: [2022-11-26 03:31:35,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:31:35,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 4: [2022-11-26 03:31:35,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 03:31:35,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 23: [2022-11-26 03:31:35,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:31:35,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 20: [2022-11-26 03:31:35,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:31:35,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 20: [2022-11-26 03:31:35,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 03:31:35,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:31:35,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 20: [2022-11-26 03:31:35,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 03:31:35,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 20: [2022-11-26 03:31:35,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:31:35,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 03:31:35,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 12: [2022-11-26 03:31:35,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:31:35,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 03:31:35,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 11: [2022-11-26 03:31:35,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:31:35,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 03:31:35,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 0: [2022-11-26 03:31:35,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:31:35,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 03:31:35,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 7: [2022-11-26 03:31:35,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:31:35,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 03:31:35,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 4: [2022-11-26 03:31:35,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:31:35,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 03:31:35,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 13: [2022-11-26 03:31:35,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:31:35,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 03:31:35,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 4: [2022-11-26 03:31:35,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:31:35,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 03:31:35,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 13: [2022-11-26 03:31:35,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:31:35,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 03:31:35,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 9: [2022-11-26 03:31:35,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:31:35,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 29: [2022-11-26 03:31:35,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:31:35,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 9: [2022-11-26 03:31:35,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 29: [2022-11-26 03:31:35,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 7: [2022-11-26 03:31:35,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:31:35,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:31:35,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 03:31:35,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 30: [2022-11-26 03:31:35,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 03:31:35,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 2: [2022-11-26 03:31:35,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:31:35,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:31:35,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 2: [2022-11-26 03:31:35,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 28: [2022-11-26 03:31:35,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:31:35,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 8: [2022-11-26 03:31:35,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 0: [2022-11-26 03:31:35,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:31:35,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 03:31:35,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 24: [2022-11-26 03:31:35,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:31:35,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 03:31:35,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 3: [2022-11-26 03:31:35,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:31:35,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 24: [2022-11-26 03:31:35,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:31:35,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 24: [2022-11-26 03:31:35,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 20: [2022-11-26 03:31:35,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:31:35,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 24: [2022-11-26 03:31:35,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 20: [2022-11-26 03:31:35,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 15: [2022-11-26 03:31:35,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:31:35,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 03:31:35,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 24: [2022-11-26 03:31:35,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:31:35,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 03:31:35,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 22: [2022-11-26 03:31:35,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:31:35,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 14: [2022-11-26 03:31:35,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:31:35,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 22: [2022-11-26 03:31:35,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:31:35,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 22: [2022-11-26 03:31:35,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 14: [2022-11-26 03:31:35,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 22: [2022-11-26 03:31:35,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 28: [2022-11-26 03:31:35,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 21: [2022-11-26 03:31:35,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:31:35,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 21: [2022-11-26 03:31:35,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 15: [2022-11-26 03:31:35,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:31:35,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 15: [2022-11-26 03:31:35,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 03:31:35,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 4: [2022-11-26 03:31:35,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:31:35,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 03:31:35,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 9: [2022-11-26 03:31:35,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:31:35,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:31:35,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 03:31:35,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 12: [2022-11-26 03:31:35,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 03:31:35,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 15: [2022-11-26 03:31:35,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:31:35,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 03:31:35,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 11: [2022-11-26 03:31:35,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:31:35,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 03:31:35,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 10: [2022-11-26 03:31:35,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:31:35,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 03:31:35,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 5: [2022-11-26 03:31:35,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:31:35,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 27: [2022-11-26 03:31:35,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:31:35,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 27: [2022-11-26 03:31:35,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 03:31:35,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 27: [2022-11-26 03:31:35,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:31:35,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:31:35,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 03:31:35,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 10: [2022-11-26 03:31:35,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:31:35,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 10: [2022-11-26 03:31:35,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 27: [2022-11-26 03:31:35,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 10: [2022-11-26 03:31:35,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 31: [2022-11-26 03:31:35,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:31:35,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:31:35,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 03:31:35,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 31: [2022-11-26 03:31:35,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 03:31:35,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 25: [2022-11-26 03:31:35,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:31:35,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 03:31:35,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 25: [2022-11-26 03:31:35,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:31:35,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:31:35,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 03:31:35,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 03:31:35,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 25: [2022-11-26 03:31:35,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 7: [2022-11-26 03:31:35,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:31:35,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:31:35,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 17: [2022-11-26 03:31:35,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 7: [2022-11-26 03:31:35,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 17: [2022-11-26 03:31:35,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 17: [2022-11-26 03:31:35,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:31:35,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 03:31:35,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 17: [2022-11-26 03:31:35,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:31:35,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 3: [2022-11-26 03:31:35,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:31:35,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 3: [2022-11-26 03:31:35,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 03:31:35,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 24: [2022-11-26 03:31:35,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:31:35,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 03:31:35,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 21: [2022-11-26 03:31:35,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:31:35,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:31:35,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:31:35,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 03:31:35,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 18: [2022-11-26 03:31:35,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 21: [2022-11-26 03:31:35,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 21: [2022-11-26 03:31:35,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 18: [2022-11-26 03:31:35,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 23: [2022-11-26 03:31:35,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:31:35,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:31:35,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 18: [2022-11-26 03:31:35,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 23: [2022-11-26 03:31:35,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 18: [2022-11-26 03:31:35,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 8: [2022-11-26 03:31:35,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:31:35,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 03:31:35,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 18: [2022-11-26 03:31:35,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:31:35,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 31: [2022-11-26 03:31:35,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:31:35,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 18: [2022-11-26 03:31:35,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:31:35,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 03:31:35,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 18: [2022-11-26 03:31:35,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 28: [2022-11-26 03:31:35,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:31:35,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 28: [2022-11-26 03:31:35,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 03:31:35,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 31: [2022-11-26 03:31:35,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:31:35,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 03:31:35,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 17: [2022-11-26 03:31:35,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:31:35,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 03:31:35,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 23: [2022-11-26 03:31:35,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:31:35,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 03:31:35,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 14: [2022-11-26 03:31:35,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:31:35,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 03:31:35,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 22: [2022-11-26 03:31:35,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:31:35,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 03:31:35,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 5: [2022-11-26 03:31:35,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:31:35,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 03:31:35,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 25: [2022-11-26 03:31:35,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:31:35,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 03:31:35,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 6: [2022-11-26 03:31:35,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:31:35,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:31:35,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:31:35,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:31:35,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 03:31:35,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 6: [2022-11-26 03:31:35,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 03:31:35,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 03:31:35,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 03:31:35,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 6: [2022-11-26 03:31:35,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 6: [2022-11-26 03:31:35,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 13: [2022-11-26 03:31:35,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:31:35,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 03:31:35,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 29: [2022-11-26 03:31:35,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:31:35,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 03:31:35,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 30: [2022-11-26 03:31:35,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:31:35,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 03:31:35,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 1: [2022-11-26 03:31:35,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:31:35,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:31:35,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:31:35,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:31:35,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 03:31:35,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 03:31:35,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 03:31:35,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 03:31:35,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 1: [2022-11-26 03:31:35,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 1: [2022-11-26 03:31:35,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 1: [2022-11-26 03:31:35,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 15: [2022-11-26 03:31:35,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:31:35,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 03:31:35,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 20: [2022-11-26 03:31:35,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:31:35,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 03:31:35,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 1: [2022-11-26 03:31:35,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:31:35,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 03:31:35,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 9: [2022-11-26 03:31:35,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:31:35,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 03:31:35,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 27: [2022-11-26 03:31:35,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:31:35,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 03:31:35,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 29: [2022-11-26 03:31:35,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:31:35,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 03:31:35,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 10: [2022-11-26 03:31:35,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:31:35,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 03:31:35,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 31: [2022-11-26 03:31:35,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:31:35,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 03:31:35,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 6: [2022-11-26 03:31:35,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:31:35,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 03:31:35,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 4: [2022-11-26 03:31:35,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:31:35,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 03:31:35,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 0: [2022-11-26 03:31:35,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:31:35,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 03:31:35,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 12: [2022-11-26 03:31:35,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:31:35,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 03:31:35,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 11: [2022-11-26 03:31:35,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:31:35,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 03:31:35,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 3: [2022-11-26 03:31:35,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:31:35,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 03:31:35,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 7: [2022-11-26 03:31:35,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:31:35,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 03:31:35,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 2: [2022-11-26 03:31:35,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:31:35,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 03:31:35,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 15: [2022-11-26 03:31:35,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:31:35,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 03:31:35,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 28: [2022-11-26 03:31:35,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:31:35,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 03:31:35,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 8: [2022-11-26 03:31:35,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:31:35,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 03:31:35,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 24: [2022-11-26 03:31:35,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:31:35,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 03:31:35,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 5: [2022-11-26 03:31:35,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:31:35,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:31:35,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 21: [2022-11-26 03:31:35,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 5: [2022-11-26 03:31:35,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 21: [2022-11-26 03:31:35,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 30: [2022-11-26 03:31:35,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:31:35,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 03:31:35,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 17: [2022-11-26 03:31:35,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:31:35,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 03:31:35,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 27: [2022-11-26 03:31:35,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:31:35,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 03:31:35,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 18: [2022-11-26 03:31:35,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:31:35,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 03:31:35,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 22: [2022-11-26 03:31:35,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:31:35,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 03:31:35,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 13: [2022-11-26 03:31:35,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:31:35,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 03:31:35,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 14: [2022-11-26 03:31:35,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:31:35,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 03:31:35,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 23: [2022-11-26 03:31:35,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:31:35,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 03:31:35,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 9: [2022-11-26 03:31:35,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:31:35,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 03:31:35,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 25: [2022-11-26 03:31:35,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:31:35,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 6: [2022-11-26 03:31:35,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:31:35,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 6: [2022-11-26 03:31:35,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 03:31:35,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 10: [2022-11-26 03:31:35,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:31:35,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 03:31:35,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 1: [2022-11-26 03:31:35,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:31:35,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 03:31:35,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 20: [2022-11-26 03:31:35,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:31:35,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 03:31:35,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 29: [2022-11-26 03:31:35,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:31:35,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 03:31:35,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 0: [2022-11-26 03:31:35,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:31:35,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 03:31:35,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 12: [2022-11-26 03:31:35,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:31:35,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:31:35,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 03:31:35,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 4: [2022-11-26 03:31:35,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 03:31:35,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 16: [2022-11-26 03:31:35,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:31:35,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 03:31:35,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 26: [2022-11-26 03:31:35,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:31:35,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 03:31:35,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 2: [2022-11-26 03:31:35,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:31:35,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 03:31:35,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 11: [2022-11-26 03:31:35,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:31:35,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 03:31:35,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 19: [2022-11-26 03:31:35,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:31:35,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 03:31:35,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 3: [2022-11-26 03:31:35,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:31:35,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 03:31:35,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 31: [2022-11-26 03:31:35,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:31:35,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:31:35,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 7: [2022-11-26 03:31:35,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 03:31:35,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 31: [2022-11-26 03:31:35,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 19: [2022-11-26 03:31:35,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:31:35,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 03:31:35,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 21: [2022-11-26 03:31:35,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:31:35,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 03:31:35,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 26: [2022-11-26 03:31:35,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:31:35,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:31:35,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:31:35,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:31:35,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:31:35,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:31:35,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:31:35,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:31:35,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:31:35,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:31:35,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:31:35,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:31:35,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:31:35,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:31:35,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:31:35,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:31:35,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:31:35,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:31:35,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:31:35,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:31:35,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:31:35,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:31:35,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:31:35,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:31:35,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:31:35,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:31:35,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:31:35,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:31:35,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:31:35,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:31:35,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 19: [2022-11-26 03:31:35,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:31:35,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:31:35,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 6: [2022-11-26 03:31:35,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 5: [2022-11-26 03:31:35,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 7: [2022-11-26 03:31:35,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 4: [2022-11-26 03:31:35,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 9: [2022-11-26 03:31:35,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 8: [2022-11-26 03:31:35,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 10: [2022-11-26 03:31:35,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 1: [2022-11-26 03:31:35,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 16: [2022-11-26 03:31:35,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 2: [2022-11-26 03:31:35,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 13: [2022-11-26 03:31:35,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 3: [2022-11-26 03:31:35,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 12: [2022-11-26 03:31:35,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 15: [2022-11-26 03:31:35,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 20: [2022-11-26 03:31:35,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 25: [2022-11-26 03:31:35,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 23: [2022-11-26 03:31:35,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 11: [2022-11-26 03:31:35,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 28: [2022-11-26 03:31:35,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 24: [2022-11-26 03:31:35,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 14: [2022-11-26 03:31:35,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 31: [2022-11-26 03:31:35,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 22: [2022-11-26 03:31:35,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 30: [2022-11-26 03:31:35,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 17: [2022-11-26 03:31:35,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 18: [2022-11-26 03:31:35,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 26: [2022-11-26 03:31:35,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 19: [2022-11-26 03:31:35,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 0: [2022-11-26 03:31:35,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 6: [2022-11-26 03:31:35,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 5: [2022-11-26 03:31:35,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 7: [2022-11-26 03:31:35,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 4: [2022-11-26 03:31:35,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 9: [2022-11-26 03:31:35,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 8: [2022-11-26 03:31:35,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 10: [2022-11-26 03:31:35,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 1: [2022-11-26 03:31:35,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 16: [2022-11-26 03:31:35,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 2: [2022-11-26 03:31:35,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 13: [2022-11-26 03:31:35,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 3: [2022-11-26 03:31:35,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 12: [2022-11-26 03:31:35,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 15: [2022-11-26 03:31:35,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 20: [2022-11-26 03:31:35,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 25: [2022-11-26 03:31:35,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 23: [2022-11-26 03:31:35,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 11: [2022-11-26 03:31:35,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 28: [2022-11-26 03:31:35,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 24: [2022-11-26 03:31:35,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 14: [2022-11-26 03:31:35,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 31: [2022-11-26 03:31:35,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 29: [2022-11-26 03:31:35,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 22: [2022-11-26 03:31:35,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 30: [2022-11-26 03:31:35,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 17: [2022-11-26 03:31:35,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 21: [2022-11-26 03:31:35,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 18: [2022-11-26 03:31:35,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 26: [2022-11-26 03:31:35,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:31:35,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 27: [2022-11-26 03:31:35,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 0: [2022-11-26 03:31:35,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:31:35,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:31:35,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:31:35,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:31:35,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:31:35,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:31:35,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:31:35,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:31:35,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:31:35,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:31:35,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:31:35,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:31:35,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:31:35,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:31:35,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:31:35,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:31:35,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:31:35,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:31:35,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:31:35,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:31:35,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:31:35,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:31:35,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:31:35,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 22: [2022-11-26 03:31:35,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:31:35,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:31:35,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:31:35,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 18: [2022-11-26 03:31:35,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:31:35,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 27: [2022-11-26 03:31:35,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 0: [2022-11-26 03:31:35,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 6: [2022-11-26 03:31:35,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 5: [2022-11-26 03:31:35,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 7: [2022-11-26 03:31:35,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 4: [2022-11-26 03:31:35,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 9: [2022-11-26 03:31:35,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 8: [2022-11-26 03:31:35,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 10: [2022-11-26 03:31:35,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 1: [2022-11-26 03:31:35,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 16: [2022-11-26 03:31:35,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 2: [2022-11-26 03:31:35,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 13: [2022-11-26 03:31:35,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 3: [2022-11-26 03:31:35,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 12: [2022-11-26 03:31:35,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 15: [2022-11-26 03:31:35,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 20: [2022-11-26 03:31:35,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 25: [2022-11-26 03:31:35,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 23: [2022-11-26 03:31:35,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 11: [2022-11-26 03:31:35,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 28: [2022-11-26 03:31:35,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 24: [2022-11-26 03:31:35,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 14: [2022-11-26 03:31:35,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 31: [2022-11-26 03:31:35,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 29: [2022-11-26 03:31:35,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:31:35,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 03:31:35,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 30: [2022-11-26 03:31:35,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 17: [2022-11-26 03:31:35,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 21: [2022-11-26 03:31:35,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:31:35,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 26: [2022-11-26 03:31:35,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 27: [2022-11-26 03:31:35,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:31:35,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 6: [2022-11-26 03:31:35,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 5: [2022-11-26 03:31:35,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 7: [2022-11-26 03:31:35,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 4: [2022-11-26 03:31:35,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 9: [2022-11-26 03:31:35,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 8: [2022-11-26 03:31:35,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 10: [2022-11-26 03:31:35,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 1: [2022-11-26 03:31:35,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 16: [2022-11-26 03:31:35,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 2: [2022-11-26 03:31:35,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 13: [2022-11-26 03:31:35,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 3: [2022-11-26 03:31:35,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 12: [2022-11-26 03:31:35,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 15: [2022-11-26 03:31:35,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 20: [2022-11-26 03:31:35,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 25: [2022-11-26 03:31:35,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 23: [2022-11-26 03:31:35,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 11: [2022-11-26 03:31:35,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 28: [2022-11-26 03:31:35,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 24: [2022-11-26 03:31:35,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 14: [2022-11-26 03:31:35,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 31: [2022-11-26 03:31:35,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 29: [2022-11-26 03:31:35,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 22: [2022-11-26 03:31:35,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:31:35,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 17: [2022-11-26 03:31:35,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 21: [2022-11-26 03:31:35,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 18: [2022-11-26 03:31:35,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 26: [2022-11-26 03:31:35,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:31:35,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 5: [2022-11-26 03:31:35,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:31:35,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:31:35,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:31:35,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:31:35,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:31:35,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:31:35,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:31:35,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:31:35,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:31:35,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:31:35,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 22: [2022-11-26 03:31:35,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 17: [2022-11-26 03:31:35,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:31:35,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 18: [2022-11-26 03:31:35,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:31:35,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 27: [2022-11-26 03:31:35,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 5: [2022-11-26 03:31:35,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 8: [2022-11-26 03:31:35,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 16: [2022-11-26 03:31:35,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 13: [2022-11-26 03:31:35,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 15: [2022-11-26 03:31:35,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 25: [2022-11-26 03:31:35,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 23: [2022-11-26 03:31:35,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 28: [2022-11-26 03:31:35,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 24: [2022-11-26 03:31:35,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 14: [2022-11-26 03:31:35,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 22: [2022-11-26 03:31:35,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 30: [2022-11-26 03:31:35,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:31:35,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 18: [2022-11-26 03:31:35,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 26: [2022-11-26 03:31:35,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 27: [2022-11-26 03:31:35,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:31:35,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 8: [2022-11-26 03:31:35,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 16: [2022-11-26 03:31:35,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 13: [2022-11-26 03:31:35,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 15: [2022-11-26 03:31:35,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 25: [2022-11-26 03:31:35,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 23: [2022-11-26 03:31:35,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 28: [2022-11-26 03:31:35,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 24: [2022-11-26 03:31:35,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 14: [2022-11-26 03:31:35,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 30: [2022-11-26 03:31:35,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 18: [2022-11-26 03:31:35,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 27: [2022-11-26 03:31:35,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 16: [2022-11-26 03:31:35,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:31:35,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 17: [2022-11-26 03:31:35,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 27: [2022-11-26 03:31:35,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 16: [2022-11-26 03:31:35,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 03:31:35,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 19: [2022-11-26 03:31:35,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:31:35,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 03:31:35,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 19: [2022-11-26 03:31:35,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:31:35,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 03:31:35,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 16: [2022-11-26 03:31:35,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:31:35,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 03:31:35,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 19: [2022-11-26 03:31:35,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:31:35,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 03:31:35,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 16: [2022-11-26 03:31:35,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:31:35,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 03:31:35,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 26: [2022-11-26 03:31:35,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:31:35,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:31:35,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 03:31:35,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 03:31:35,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 26: [2022-11-26 03:31:35,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 19: [2022-11-26 03:31:35,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:31:35,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 03:31:35,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 19: [2022-11-26 03:31:36,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:31:36,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 03:31:36,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 16: [2022-11-26 03:31:36,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:31:36,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 03:31:36,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 26: [2022-11-26 03:31:36,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:31:36,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 03:31:36,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 26: [2022-11-26 03:31:36,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:31:36,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step42000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 03:31:36,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 0: successfully saved checkpoint at iteration 42000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2553.23 31: iteration 42010/ 173500 | consumed samples: 10754560 | consumed tokens: 22025338880 | elapsed time per iteration (s): 1.13 | learning rate: 1.767E-04 | global batch size: 256 | lm loss: 2.101408E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 226.863 | TFLOPs: 13.72 | 31: iteration 42020/ 173500 | consumed samples: 10757120 | consumed tokens: 22030581760 | elapsed time per iteration (s): 0.80 | learning rate: 1.767E-04 | global batch size: 256 | lm loss: 2.095992E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.517 | TFLOPs: 19.33 | 31: iteration 42030/ 173500 | consumed samples: 10759680 | consumed tokens: 22035824640 | elapsed time per iteration (s): 1.16 | learning rate: 1.766E-04 | global batch size: 256 | lm loss: 2.095506E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 221.041 | TFLOPs: 13.37 | 31: iteration 42040/ 173500 | consumed samples: 10762240 | consumed tokens: 22041067520 | elapsed time per iteration (s): 1.14 | learning rate: 1.766E-04 | global batch size: 256 | lm loss: 2.099303E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 225.040 | TFLOPs: 13.61 | 31: iteration 42050/ 173500 | consumed samples: 10764800 | consumed tokens: 22046310400 | elapsed time per iteration (s): 0.78 | learning rate: 1.766E-04 | global batch size: 256 | lm loss: 2.105884E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.316 | TFLOPs: 19.80 | 31: iteration 42060/ 173500 | consumed samples: 10767360 | consumed tokens: 22051553280 | elapsed time per iteration (s): 0.76 | learning rate: 1.766E-04 | global batch size: 256 | lm loss: 2.090830E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.905 | TFLOPs: 20.26 | 31: iteration 42070/ 173500 | consumed samples: 10769920 | consumed tokens: 22056796160 | elapsed time per iteration (s): 0.77 | learning rate: 1.766E-04 | global batch size: 256 | lm loss: 2.092814E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.031 | TFLOPs: 20.15 | 31: iteration 42080/ 173500 | consumed samples: 10772480 | consumed tokens: 22062039040 | elapsed time per iteration (s): 0.77 | learning rate: 1.766E-04 | global batch size: 256 | lm loss: 2.098686E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.004 | TFLOPs: 20.15 | 31: iteration 42090/ 173500 | consumed samples: 10775040 | consumed tokens: 22067281920 | elapsed time per iteration (s): 0.83 | learning rate: 1.766E-04 | global batch size: 256 | lm loss: 2.102291E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.951 | TFLOPs: 18.69 | 31: iteration 42100/ 173500 | consumed samples: 10777600 | consumed tokens: 22072524800 | elapsed time per iteration (s): 0.78 | learning rate: 1.766E-04 | global batch size: 256 | lm loss: 2.098021E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.450 | TFLOPs: 19.75 | 31: iteration 42110/ 173500 | consumed samples: 10780160 | consumed tokens: 22077767680 | elapsed time per iteration (s): 0.75 | learning rate: 1.766E-04 | global batch size: 256 | lm loss: 2.098019E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.029 | TFLOPs: 20.57 | 31: iteration 42120/ 173500 | consumed samples: 10782720 | consumed tokens: 22083010560 | elapsed time per iteration (s): 0.79 | learning rate: 1.765E-04 | global batch size: 256 | lm loss: 2.095635E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.263 | TFLOPs: 19.62 | 31: iteration 42130/ 173500 | consumed samples: 10785280 | consumed tokens: 22088253440 | elapsed time per iteration (s): 0.80 | learning rate: 1.765E-04 | global batch size: 256 | lm loss: 2.107305E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.653 | TFLOPs: 19.28 | 31: iteration 42140/ 173500 | consumed samples: 10787840 | consumed tokens: 22093496320 | elapsed time per iteration (s): 0.76 | learning rate: 1.765E-04 | global batch size: 256 | lm loss: 2.085971E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.997 | TFLOPs: 20.33 | 31: iteration 42150/ 173500 | consumed samples: 10790400 | consumed tokens: 22098739200 | elapsed time per iteration (s): 0.85 | learning rate: 1.765E-04 | global batch size: 256 | lm loss: 2.093301E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.418 | TFLOPs: 18.24 | 31: iteration 42160/ 173500 | consumed samples: 10792960 | consumed tokens: 22103982080 | elapsed time per iteration (s): 0.75 | learning rate: 1.765E-04 | global batch size: 256 | lm loss: 2.108190E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.824 | TFLOPs: 20.74 | 31: iteration 42170/ 173500 | consumed samples: 10795520 | consumed tokens: 22109224960 | elapsed time per iteration (s): 0.79 | learning rate: 1.765E-04 | global batch size: 256 | lm loss: 2.093078E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.008 | TFLOPs: 19.72 | 31: iteration 42180/ 173500 | consumed samples: 10798080 | consumed tokens: 22114467840 | elapsed time per iteration (s): 0.75 | learning rate: 1.765E-04 | global batch size: 256 | lm loss: 2.078930E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.972 | TFLOPs: 20.57 | 31: iteration 42190/ 173500 | consumed samples: 10800640 | consumed tokens: 22119710720 | elapsed time per iteration (s): 0.76 | learning rate: 1.765E-04 | global batch size: 256 | lm loss: 2.095040E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.352 | TFLOPs: 20.41 | 31: iteration 42200/ 173500 | consumed samples: 10803200 | consumed tokens: 22124953600 | elapsed time per iteration (s): 0.82 | learning rate: 1.765E-04 | global batch size: 256 | lm loss: 2.118840E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.147 | TFLOPs: 18.94 | 31: iteration 42210/ 173500 | consumed samples: 10805760 | consumed tokens: 22130196480 | elapsed time per iteration (s): 0.76 | learning rate: 1.764E-04 | global batch size: 256 | lm loss: 2.066474E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.854 | TFLOPs: 20.26 | 31: iteration 42220/ 173500 | consumed samples: 10808320 | consumed tokens: 22135439360 | elapsed time per iteration (s): 0.78 | learning rate: 1.764E-04 | global batch size: 256 | lm loss: 2.079345E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.906 | TFLOPs: 19.90 | 31: iteration 42230/ 173500 | consumed samples: 10810880 | consumed tokens: 22140682240 | elapsed time per iteration (s): 0.80 | learning rate: 1.764E-04 | global batch size: 256 | lm loss: 2.062065E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.692 | TFLOPs: 19.34 | 31: iteration 42240/ 173500 | consumed samples: 10813440 | consumed tokens: 22145925120 | elapsed time per iteration (s): 0.78 | learning rate: 1.764E-04 | global batch size: 256 | lm loss: 2.088686E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.528 | TFLOPs: 19.88 | 31: iteration 42250/ 173500 | consumed samples: 10816000 | consumed tokens: 22151168000 | elapsed time per iteration (s): 0.77 | learning rate: 1.764E-04 | global batch size: 256 | lm loss: 2.118142E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.471 | TFLOPs: 20.17 | 31: iteration 42260/ 173500 | consumed samples: 10818560 | consumed tokens: 22156410880 | elapsed time per iteration (s): 0.76 | learning rate: 1.764E-04 | global batch size: 256 | lm loss: 2.086129E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.335 | TFLOPs: 20.29 | 31: iteration 42270/ 173500 | consumed samples: 10821120 | consumed tokens: 22161653760 | elapsed time per iteration (s): 0.78 | learning rate: 1.764E-04 | global batch size: 256 | lm loss: 2.080417E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.980 | TFLOPs: 19.90 | 31: iteration 42280/ 173500 | consumed samples: 10823680 | consumed tokens: 22166896640 | elapsed time per iteration (s): 0.75 | learning rate: 1.764E-04 | global batch size: 256 | lm loss: 2.099490E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.176 | TFLOPs: 20.58 | 31: iteration 42290/ 173500 | consumed samples: 10826240 | consumed tokens: 22172139520 | elapsed time per iteration (s): 0.82 | learning rate: 1.764E-04 | global batch size: 256 | lm loss: 2.083811E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.282 | TFLOPs: 18.89 | 31: iteration 42300/ 173500 | consumed samples: 10828800 | consumed tokens: 22177382400 | elapsed time per iteration (s): 0.81 | learning rate: 1.763E-04 | global batch size: 256 | lm loss: 2.083055E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.847 | TFLOPs: 19.23 | 31: iteration 42310/ 173500 | consumed samples: 10831360 | consumed tokens: 22182625280 | elapsed time per iteration (s): 0.76 | learning rate: 1.763E-04 | global batch size: 256 | lm loss: 2.079987E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.308 | TFLOPs: 20.29 | 31: iteration 42320/ 173500 | consumed samples: 10833920 | consumed tokens: 22187868160 | elapsed time per iteration (s): 0.76 | learning rate: 1.763E-04 | global batch size: 256 | lm loss: 2.052790E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.788 | TFLOPs: 20.25 | 31: iteration 42330/ 173500 | consumed samples: 10836480 | consumed tokens: 22193111040 | elapsed time per iteration (s): 0.79 | learning rate: 1.763E-04 | global batch size: 256 | lm loss: 2.100451E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.688 | TFLOPs: 19.52 | 31: iteration 42340/ 173500 | consumed samples: 10839040 | consumed tokens: 22198353920 | elapsed time per iteration (s): 0.78 | learning rate: 1.763E-04 | global batch size: 256 | lm loss: 2.083617E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.264 | TFLOPs: 19.98 | 31: iteration 42350/ 173500 | consumed samples: 10841600 | consumed tokens: 22203596800 | elapsed time per iteration (s): 0.81 | learning rate: 1.763E-04 | global batch size: 256 | lm loss: 2.077402E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.543 | TFLOPs: 19.03 | 31: iteration 42360/ 173500 | consumed samples: 10844160 | consumed tokens: 22208839680 | elapsed time per iteration (s): 0.78 | learning rate: 1.763E-04 | global batch size: 256 | lm loss: 2.104412E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.397 | TFLOPs: 19.87 | 31: iteration 42370/ 173500 | consumed samples: 10846720 | consumed tokens: 22214082560 | elapsed time per iteration (s): 0.81 | learning rate: 1.763E-04 | global batch size: 256 | lm loss: 2.102597E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.480 | TFLOPs: 19.03 | 31: iteration 42380/ 173500 | consumed samples: 10849280 | consumed tokens: 22219325440 | elapsed time per iteration (s): 0.80 | learning rate: 1.763E-04 | global batch size: 256 | lm loss: 2.073111E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.968 | TFLOPs: 19.36 | 31: iteration 42390/ 173500 | consumed samples: 10851840 | consumed tokens: 22224568320 | elapsed time per iteration (s): 0.84 | learning rate: 1.762E-04 | global batch size: 256 | lm loss: 2.092266E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.983 | TFLOPs: 18.45 | 31: iteration 42400/ 173500 | consumed samples: 10854400 | consumed tokens: 22229811200 | elapsed time per iteration (s): 0.82 | learning rate: 1.762E-04 | global batch size: 256 | lm loss: 2.129649E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.127 | TFLOPs: 18.94 | 31: iteration 42410/ 173500 | consumed samples: 10856960 | consumed tokens: 22235054080 | elapsed time per iteration (s): 0.78 | learning rate: 1.762E-04 | global batch size: 256 | lm loss: 2.051451E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.125 | TFLOPs: 19.85 | 31: iteration 42420/ 173500 | consumed samples: 10859520 | consumed tokens: 22240296960 | elapsed time per iteration (s): 0.78 | learning rate: 1.762E-04 | global batch size: 256 | lm loss: 2.066667E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.135 | TFLOPs: 19.85 | 31: iteration 42430/ 173500 | consumed samples: 10862080 | consumed tokens: 22245539840 | elapsed time per iteration (s): 0.80 | learning rate: 1.762E-04 | global batch size: 256 | lm loss: 2.063438E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.052 | TFLOPs: 19.24 | 31: iteration 42440/ 173500 | consumed samples: 10864640 | consumed tokens: 22250782720 | elapsed time per iteration (s): 0.83 | learning rate: 1.762E-04 | global batch size: 256 | lm loss: 2.097929E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.796 | TFLOPs: 18.62 | 31: iteration 42450/ 173500 | consumed samples: 10867200 | consumed tokens: 22256025600 | elapsed time per iteration (s): 0.81 | learning rate: 1.762E-04 | global batch size: 256 | lm loss: 2.087230E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.221 | TFLOPs: 19.19 | 31: iteration 42460/ 173500 | consumed samples: 10869760 | consumed tokens: 22261268480 | elapsed time per iteration (s): 0.80 | learning rate: 1.762E-04 | global batch size: 256 | lm loss: 2.105512E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.443 | TFLOPs: 19.39 | 31: iteration 42470/ 173500 | consumed samples: 10872320 | consumed tokens: 22266511360 | elapsed time per iteration (s): 0.83 | learning rate: 1.762E-04 | global batch size: 256 | lm loss: 2.073873E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.985 | TFLOPs: 18.57 | 31: iteration 42480/ 173500 | consumed samples: 10874880 | consumed tokens: 22271754240 | elapsed time per iteration (s): 0.86 | learning rate: 1.761E-04 | global batch size: 256 | lm loss: 2.076146E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.671 | TFLOPs: 18.01 | 31: iteration 42490/ 173500 | consumed samples: 10877440 | consumed tokens: 22276997120 | elapsed time per iteration (s): 0.83 | learning rate: 1.761E-04 | global batch size: 256 | lm loss: 2.123357E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.896 | TFLOPs: 18.75 | 31: iteration 42500/ 173500 | consumed samples: 10880000 | consumed tokens: 22282240000 | elapsed time per iteration (s): 0.78 | learning rate: 1.761E-04 | global batch size: 256 | lm loss: 2.094200E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.658 | TFLOPs: 19.88 | 31: iteration 42510/ 173500 | consumed samples: 10882560 | consumed tokens: 22287482880 | elapsed time per iteration (s): 0.80 | learning rate: 1.761E-04 | global batch size: 256 | lm loss: 2.112869E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.797 | TFLOPs: 19.41 | 31: iteration 42520/ 173500 | consumed samples: 10885120 | consumed tokens: 22292725760 | elapsed time per iteration (s): 0.84 | learning rate: 1.761E-04 | global batch size: 256 | lm loss: 2.057733E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.113 | TFLOPs: 18.40 | 31: iteration 42530/ 173500 | consumed samples: 10887680 | consumed tokens: 22297968640 | elapsed time per iteration (s): 0.85 | learning rate: 1.761E-04 | global batch size: 256 | lm loss: 2.086802E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.913 | TFLOPs: 18.14 | 31: iteration 42540/ 173500 | consumed samples: 10890240 | consumed tokens: 22303211520 | elapsed time per iteration (s): 0.81 | learning rate: 1.761E-04 | global batch size: 256 | lm loss: 2.067295E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.155 | TFLOPs: 19.13 | 31: iteration 42550/ 173500 | consumed samples: 10892800 | consumed tokens: 22308454400 | elapsed time per iteration (s): 0.82 | learning rate: 1.761E-04 | global batch size: 256 | lm loss: 2.097216E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.304 | TFLOPs: 18.95 | 31: iteration 42560/ 173500 | consumed samples: 10895360 | consumed tokens: 22313697280 | elapsed time per iteration (s): 1.77 | learning rate: 1.761E-04 | global batch size: 256 | lm loss: 2.079632E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 144.827 | TFLOPs: 8.76 | 31: iteration 42570/ 173500 | consumed samples: 10897920 | consumed tokens: 22318940160 | elapsed time per iteration (s): 0.89 | learning rate: 1.760E-04 | global batch size: 256 | lm loss: 2.082053E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.340 | TFLOPs: 17.38 | 31: iteration 42580/ 173500 | consumed samples: 10900480 | consumed tokens: 22324183040 | elapsed time per iteration (s): 0.84 | learning rate: 1.760E-04 | global batch size: 256 | lm loss: 2.092653E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.398 | TFLOPs: 18.54 | 31: iteration 42590/ 173500 | consumed samples: 10903040 | consumed tokens: 22329425920 | elapsed time per iteration (s): 0.83 | learning rate: 1.760E-04 | global batch size: 256 | lm loss: 2.104936E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.040 | TFLOPs: 18.64 | 31: iteration 42600/ 173500 | consumed samples: 10905600 | consumed tokens: 22334668800 | elapsed time per iteration (s): 0.82 | learning rate: 1.760E-04 | global batch size: 256 | lm loss: 2.096643E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.991 | TFLOPs: 18.87 | 31: iteration 42610/ 173500 | consumed samples: 10908160 | consumed tokens: 22339911680 | elapsed time per iteration (s): 0.84 | learning rate: 1.760E-04 | global batch size: 256 | lm loss: 2.075241E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.085 | TFLOPs: 18.34 | 31: iteration 42620/ 173500 | consumed samples: 10910720 | consumed tokens: 22345154560 | elapsed time per iteration (s): 0.80 | learning rate: 1.760E-04 | global batch size: 256 | lm loss: 2.065878E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.211 | TFLOPs: 19.31 | 31: iteration 42630/ 173500 | consumed samples: 10913280 | consumed tokens: 22350397440 | elapsed time per iteration (s): 0.82 | learning rate: 1.760E-04 | global batch size: 256 | lm loss: 2.061733E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.137 | TFLOPs: 18.82 | 31: iteration 42640/ 173500 | consumed samples: 10915840 | consumed tokens: 22355640320 | elapsed time per iteration (s): 0.84 | learning rate: 1.760E-04 | global batch size: 256 | lm loss: 2.089031E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.525 | TFLOPs: 18.48 | 31: iteration 42650/ 173500 | consumed samples: 10918400 | consumed tokens: 22360883200 | elapsed time per iteration (s): 0.83 | learning rate: 1.760E-04 | global batch size: 256 | lm loss: 2.074783E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.958 | TFLOPs: 18.69 | 31: iteration 42660/ 173500 | consumed samples: 10920960 | consumed tokens: 22366126080 | elapsed time per iteration (s): 0.81 | learning rate: 1.759E-04 | global batch size: 256 | lm loss: 2.068572E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.847 | TFLOPs: 19.05 | 31: iteration 42670/ 173500 | consumed samples: 10923520 | consumed tokens: 22371368960 | elapsed time per iteration (s): 2.72 | learning rate: 1.759E-04 | global batch size: 256 | lm loss: 2.087802E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 94.059 | TFLOPs: 5.69 | 31: iteration 42680/ 173500 | consumed samples: 10926080 | consumed tokens: 22376611840 | elapsed time per iteration (s): 0.78 | learning rate: 1.759E-04 | global batch size: 256 | lm loss: 2.095958E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.258 | TFLOPs: 19.92 | 31: iteration 42690/ 173500 | consumed samples: 10928640 | consumed tokens: 22381854720 | elapsed time per iteration (s): 4.57 | learning rate: 1.759E-04 | global batch size: 256 | lm loss: 2.060777E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 55.962 | TFLOPs: 3.39 | 31: iteration 42700/ 173500 | consumed samples: 10931200 | consumed tokens: 22387097600 | elapsed time per iteration (s): 0.75 | learning rate: 1.759E-04 | global batch size: 256 | lm loss: 2.106001E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.906 | TFLOPs: 20.62 | 31: iteration 42710/ 173500 | consumed samples: 10933760 | consumed tokens: 22392340480 | elapsed time per iteration (s): 0.78 | learning rate: 1.759E-04 | global batch size: 256 | lm loss: 2.103452E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.157 | TFLOPs: 19.91 | 31: iteration 42720/ 173500 | consumed samples: 10936320 | consumed tokens: 22397583360 | elapsed time per iteration (s): 0.74 | learning rate: 1.759E-04 | global batch size: 256 | lm loss: 2.085445E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.997 | TFLOPs: 20.93 | 31: iteration 42730/ 173500 | consumed samples: 10938880 | consumed tokens: 22402826240 | elapsed time per iteration (s): 0.78 | learning rate: 1.759E-04 | global batch size: 256 | lm loss: 2.067108E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.060 | TFLOPs: 19.91 | 31: iteration 42740/ 173500 | consumed samples: 10941440 | consumed tokens: 22408069120 | elapsed time per iteration (s): 0.76 | learning rate: 1.759E-04 | global batch size: 256 | lm loss: 2.065352E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.909 | TFLOPs: 20.26 | 31: iteration 42750/ 173500 | consumed samples: 10944000 | consumed tokens: 22413312000 | elapsed time per iteration (s): 0.78 | learning rate: 1.758E-04 | global batch size: 256 | lm loss: 2.056771E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.298 | TFLOPs: 19.98 | 31: iteration 42760/ 173500 | consumed samples: 10946560 | consumed tokens: 22418554880 | elapsed time per iteration (s): 0.74 | learning rate: 1.758E-04 | global batch size: 256 | lm loss: 2.064652E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.073 | TFLOPs: 20.94 | 31: iteration 42770/ 173500 | consumed samples: 10949120 | consumed tokens: 22423797760 | elapsed time per iteration (s): 0.81 | learning rate: 1.758E-04 | global batch size: 256 | lm loss: 2.090167E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.661 | TFLOPs: 19.10 | 31: iteration 42780/ 173500 | consumed samples: 10951680 | consumed tokens: 22429040640 | elapsed time per iteration (s): 0.79 | learning rate: 1.758E-04 | global batch size: 256 | lm loss: 2.078407E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.047 | TFLOPs: 19.66 | 31: iteration 42790/ 173500 | consumed samples: 10954240 | consumed tokens: 22434283520 | elapsed time per iteration (s): 0.75 | learning rate: 1.758E-04 | global batch size: 256 | lm loss: 2.093664E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.537 | TFLOPs: 20.78 | 31: iteration 42800/ 173500 | consumed samples: 10956800 | consumed tokens: 22439526400 | elapsed time per iteration (s): 0.77 | learning rate: 1.758E-04 | global batch size: 256 | lm loss: 2.079673E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.147 | TFLOPs: 20.22 | 31: iteration 42810/ 173500 | consumed samples: 10959360 | consumed tokens: 22444769280 | elapsed time per iteration (s): 0.78 | learning rate: 1.758E-04 | global batch size: 256 | lm loss: 2.055177E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.272 | TFLOPs: 19.80 | 31: iteration 42820/ 173500 | consumed samples: 10961920 | consumed tokens: 22450012160 | elapsed time per iteration (s): 0.78 | learning rate: 1.758E-04 | global batch size: 256 | lm loss: 2.073383E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.029 | TFLOPs: 19.78 | 31: iteration 42830/ 173500 | consumed samples: 10964480 | consumed tokens: 22455255040 | elapsed time per iteration (s): 0.87 | learning rate: 1.758E-04 | global batch size: 256 | lm loss: 2.083321E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.308 | TFLOPs: 17.87 | 31: iteration 42840/ 173500 | consumed samples: 10967040 | consumed tokens: 22460497920 | elapsed time per iteration (s): 0.79 | learning rate: 1.757E-04 | global batch size: 256 | lm loss: 2.111224E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.888 | TFLOPs: 19.59 | 31: iteration 42850/ 173500 | consumed samples: 10969600 | consumed tokens: 22465740800 | elapsed time per iteration (s): 1.16 | learning rate: 1.757E-04 | global batch size: 256 | lm loss: 2.078363E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 221.008 | TFLOPs: 13.37 | 31: iteration 42860/ 173500 | consumed samples: 10972160 | consumed tokens: 22470983680 | elapsed time per iteration (s): 0.79 | learning rate: 1.757E-04 | global batch size: 256 | lm loss: 2.080032E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.534 | TFLOPs: 19.51 | 31: iteration 42870/ 173500 | consumed samples: 10974720 | consumed tokens: 22476226560 | elapsed time per iteration (s): 0.81 | learning rate: 1.757E-04 | global batch size: 256 | lm loss: 2.078827E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.943 | TFLOPs: 19.05 | 31: iteration 42880/ 173500 | consumed samples: 10977280 | consumed tokens: 22481469440 | elapsed time per iteration (s): 0.81 | learning rate: 1.757E-04 | global batch size: 256 | lm loss: 2.066592E+00 | grad norm: 0.507 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.228 | TFLOPs: 19.01 | 31: iteration 42890/ 173500 | consumed samples: 10979840 | consumed tokens: 22486712320 | elapsed time per iteration (s): 0.77 | learning rate: 1.757E-04 | global batch size: 256 | lm loss: 2.135246E+00 | grad norm: 2.770 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.247 | TFLOPs: 20.16 | 31: iteration 42900/ 173500 | consumed samples: 10982400 | consumed tokens: 22491955200 | elapsed time per iteration (s): 0.76 | learning rate: 1.757E-04 | global batch size: 256 | lm loss: 2.160122E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.428 | TFLOPs: 20.41 | 31: iteration 42910/ 173500 | consumed samples: 10984960 | consumed tokens: 22497198080 | elapsed time per iteration (s): 0.73 | learning rate: 1.757E-04 | global batch size: 256 | lm loss: 2.118741E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.696 | TFLOPs: 21.22 | 31: iteration 42920/ 173500 | consumed samples: 10987520 | consumed tokens: 22502440960 | elapsed time per iteration (s): 0.74 | learning rate: 1.757E-04 | global batch size: 256 | lm loss: 2.095229E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.278 | TFLOPs: 20.89 | 31: iteration 42930/ 173500 | consumed samples: 10990080 | consumed tokens: 22507683840 | elapsed time per iteration (s): 1.05 | learning rate: 1.756E-04 | global batch size: 256 | lm loss: 2.114029E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.293 | TFLOPs: 14.72 | 31: iteration 42940/ 173500 | consumed samples: 10992640 | consumed tokens: 22512926720 | elapsed time per iteration (s): 0.82 | learning rate: 1.756E-04 | global batch size: 256 | lm loss: 2.109749E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.597 | TFLOPs: 18.85 | 31: iteration 42950/ 173500 | consumed samples: 10995200 | consumed tokens: 22518169600 | elapsed time per iteration (s): 0.75 | learning rate: 1.756E-04 | global batch size: 256 | lm loss: 2.099998E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.858 | TFLOPs: 20.62 | 31: iteration 42960/ 173500 | consumed samples: 10997760 | consumed tokens: 22523412480 | elapsed time per iteration (s): 0.78 | learning rate: 1.756E-04 | global batch size: 256 | lm loss: 2.073102E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.535 | TFLOPs: 19.82 | 31: iteration 42970/ 173500 | consumed samples: 11000320 | consumed tokens: 22528655360 | elapsed time per iteration (s): 0.76 | learning rate: 1.756E-04 | global batch size: 256 | lm loss: 2.062989E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.782 | TFLOPs: 20.50 | 31: iteration 42980/ 173500 | consumed samples: 11002880 | consumed tokens: 22533898240 | elapsed time per iteration (s): 0.84 | learning rate: 1.756E-04 | global batch size: 256 | lm loss: 2.097364E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.348 | TFLOPs: 18.47 | 31: iteration 42990/ 173500 | consumed samples: 11005440 | consumed tokens: 22539141120 | elapsed time per iteration (s): 0.75 | learning rate: 1.756E-04 | global batch size: 256 | lm loss: 2.087340E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.489 | TFLOPs: 20.66 | 31: iteration 43000/ 173500 | consumed samples: 11008000 | consumed tokens: 22544384000 | elapsed time per iteration (s): 0.78 | learning rate: 1.756E-04 | global batch size: 256 | lm loss: 2.093912E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.915 | TFLOPs: 19.78 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 43000 | lm loss value: 2.084689E+00 | lm loss PPL: 8.042091E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 43000 to checkpoints_1b1long 0: [2022-11-26 03:46:10,199] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step43000 is begin to save! 0: [2022-11-26 03:46:10,213] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_01-model_00-model_states.pt... 0: [2022-11-26 03:46:10,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_01-model_00-model_states.pt. 0: [2022-11-26 03:46:10,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_03-model_00-model_states.pt... 0: [2022-11-26 03:46:10,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_03-model_00-model_states.pt. 0: [2022-11-26 03:46:10,505] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_04-model_00-model_states.pt... 0: [2022-11-26 03:46:10,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_04-model_00-model_states.pt. 0: [2022-11-26 03:46:10,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_05-model_00-model_states.pt... 0: [2022-11-26 03:46:10,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_05-model_00-model_states.pt. 0: [2022-11-26 03:46:10,654] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_06-model_00-model_states.pt... 0: [2022-11-26 03:46:10,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_06-model_00-model_states.pt. 0: [2022-11-26 03:46:10,735] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_07-model_00-model_states.pt... 0: [2022-11-26 03:46:10,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_07-model_00-model_states.pt. 0: [2022-11-26 03:46:10,809] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_08-model_00-model_states.pt... 0: [2022-11-26 03:46:10,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_08-model_00-model_states.pt. 0: [2022-11-26 03:46:10,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_09-model_00-model_states.pt... 0: [2022-11-26 03:46:10,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_09-model_00-model_states.pt. 0: [2022-11-26 03:46:10,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_10-model_00-model_states.pt... 0: [2022-11-26 03:46:11,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_10-model_00-model_states.pt. 0: [2022-11-26 03:46:11,037] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_11-model_00-model_states.pt... 0: [2022-11-26 03:46:11,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_11-model_00-model_states.pt. 0: [2022-11-26 03:46:11,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_12-model_00-model_states.pt... 0: [2022-11-26 03:46:11,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_12-model_00-model_states.pt. 0: [2022-11-26 03:46:11,190] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_13-model_00-model_states.pt... 0: [2022-11-26 03:46:11,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_13-model_00-model_states.pt. 0: [2022-11-26 03:46:11,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_14-model_00-model_states.pt... 0: [2022-11-26 03:46:11,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_14-model_00-model_states.pt. 0: [2022-11-26 03:46:11,342] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_15-model_00-model_states.pt... 0: [2022-11-26 03:46:11,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_15-model_00-model_states.pt. 0: [2022-11-26 03:46:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_16-model_00-model_states.pt... 0: [2022-11-26 03:46:11,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_16-model_00-model_states.pt. 0: [2022-11-26 03:46:11,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_17-model_00-model_states.pt... 0: [2022-11-26 03:46:11,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_17-model_00-model_states.pt. 0: [2022-11-26 03:46:11,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_18-model_00-model_states.pt... 0: [2022-11-26 03:46:11,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_18-model_00-model_states.pt. 0: [2022-11-26 03:46:11,647] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_19-model_00-model_states.pt... 0: [2022-11-26 03:46:11,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_19-model_00-model_states.pt. 0: [2022-11-26 03:46:11,723] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_20-model_00-model_states.pt... 0: [2022-11-26 03:46:11,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_20-model_00-model_states.pt. 0: [2022-11-26 03:46:11,801] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_21-model_00-model_states.pt... 0: [2022-11-26 03:46:11,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_21-model_00-model_states.pt. 0: [2022-11-26 03:46:11,874] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_22-model_00-model_states.pt... 0: [2022-11-26 03:46:11,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_22-model_00-model_states.pt. 0: [2022-11-26 03:46:11,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_23-model_00-model_states.pt... 0: [2022-11-26 03:46:12,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_23-model_00-model_states.pt. 0: [2022-11-26 03:46:12,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_24-model_00-model_states.pt... 0: [2022-11-26 03:46:12,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_24-model_00-model_states.pt. 0: [2022-11-26 03:46:12,105] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_25-model_00-model_states.pt... 0: [2022-11-26 03:46:12,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_25-model_00-model_states.pt. 0: [2022-11-26 03:46:12,182] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_26-model_00-model_states.pt... 0: [2022-11-26 03:46:12,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_26-model_00-model_states.pt. 0: [2022-11-26 03:46:12,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_27-model_00-model_states.pt... 0: [2022-11-26 03:46:12,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_27-model_00-model_states.pt. 0: [2022-11-26 03:46:12,334] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_28-model_00-model_states.pt... 0: [2022-11-26 03:46:12,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_28-model_00-model_states.pt. 0: [2022-11-26 03:46:12,411] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/layer_30-model_00-model_states.pt... 0: [2022-11-26 03:46:12,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/layer_30-model_00-model_states.pt. 0: [2022-11-26 03:46:12,413] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step43000/mp_rank_00_model_states.pt 0: [2022-11-26 03:46:12,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/mp_rank_00_model_states.pt... 0: [2022-11-26 03:46:12,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/mp_rank_00_model_states.pt. 0: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:46:12,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:46:12,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:46:12,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:46:12,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 03:46:12,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 5: [2022-11-26 03:46:12,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:46:12,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 03:46:12,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 13: [2022-11-26 03:46:12,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:46:12,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:46:12,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 22: [2022-11-26 03:46:12,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 13: [2022-11-26 03:46:12,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 22: [2022-11-26 03:46:12,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 4: [2022-11-26 03:46:12,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:46:12,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 03:46:12,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 12: [2022-11-26 03:46:12,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:46:12,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:46:12,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 11: [2022-11-26 03:46:12,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:46:12,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 12: [2022-11-26 03:46:12,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 30: [2022-11-26 03:46:12,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 11: [2022-11-26 03:46:12,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 27: [2022-11-26 03:46:12,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:46:12,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 27: [2022-11-26 03:46:12,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 03:46:12,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 26: [2022-11-26 03:46:12,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:46:12,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:46:12,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 03:46:12,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 0: [2022-11-26 03:46:12,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:46:12,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 9: [2022-11-26 03:46:12,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:46:12,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 9: [2022-11-26 03:46:12,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 03:46:12,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 8: [2022-11-26 03:46:12,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:46:12,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 03:46:12,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 1: [2022-11-26 03:46:12,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:46:12,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:46:12,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:46:12,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 1: [2022-11-26 03:46:12,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 20: [2022-11-26 03:46:12,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:46:12,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 1: [2022-11-26 03:46:12,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 20: [2022-11-26 03:46:12,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 1: [2022-11-26 03:46:12,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 1: [2022-11-26 03:46:12,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 20: [2022-11-26 03:46:12,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 21: [2022-11-26 03:46:12,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:46:12,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 03:46:12,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 5: [2022-11-26 03:46:12,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:46:12,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:46:12,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 19: [2022-11-26 03:46:12,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 03:46:12,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 5: [2022-11-26 03:46:12,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 19: [2022-11-26 03:46:12,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:46:12,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:46:12,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:46:12,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 03:46:12,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 12: [2022-11-26 03:46:12,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 30: [2022-11-26 03:46:12,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:46:12,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 12: [2022-11-26 03:46:12,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 11: [2022-11-26 03:46:12,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:46:12,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 19: [2022-11-26 03:46:12,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 3: [2022-11-26 03:46:12,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:46:12,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 30: [2022-11-26 03:46:12,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 3: [2022-11-26 03:46:12,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 11: [2022-11-26 03:46:12,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 3: [2022-11-26 03:46:12,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 22: [2022-11-26 03:46:12,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:46:12,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 03:46:12,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 3: [2022-11-26 03:46:12,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:46:12,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 13: [2022-11-26 03:46:12,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:46:12,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 13: [2022-11-26 03:46:12,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 20: [2022-11-26 03:46:12,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:46:12,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 20: [2022-11-26 03:46:12,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 03:46:12,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 6: [2022-11-26 03:46:12,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:46:12,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 7: [2022-11-26 03:46:12,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:46:12,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 7: [2022-11-26 03:46:12,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 03:46:12,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 27: [2022-11-26 03:46:12,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:46:12,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 03:46:12,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 6: [2022-11-26 03:46:12,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:46:12,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:46:12,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 03:46:12,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 25: [2022-11-26 03:46:12,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 4: [2022-11-26 03:46:12,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:46:12,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 25: [2022-11-26 03:46:12,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 4: [2022-11-26 03:46:12,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 31: [2022-11-26 03:46:12,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:46:12,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:46:12,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 03:46:12,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 03:46:12,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 31: [2022-11-26 03:46:12,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 7: [2022-11-26 03:46:12,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:46:12,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 03:46:12,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 26: [2022-11-26 03:46:12,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:46:12,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 03:46:12,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 12: [2022-11-26 03:46:12,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:46:12,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 03:46:12,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 0: [2022-11-26 03:46:12,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:46:12,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:46:12,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 1: [2022-11-26 03:46:12,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:46:12,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 0: [2022-11-26 03:46:12,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 31: [2022-11-26 03:46:12,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 13: [2022-11-26 03:46:12,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:46:12,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 1: [2022-11-26 03:46:12,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 13: [2022-11-26 03:46:12,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 1: [2022-11-26 03:46:12,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 6: [2022-11-26 03:46:12,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:46:12,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 19: [2022-11-26 03:46:12,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:46:12,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 19: [2022-11-26 03:46:12,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 03:46:12,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 5: [2022-11-26 03:46:12,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:46:12,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 03:46:12,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 16: [2022-11-26 03:46:12,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:46:12,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 03:46:12,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:46:12,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 16: [2022-11-26 03:46:12,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 03:46:12,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 6: [2022-11-26 03:46:12,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:46:12,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 03:46:12,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 11: [2022-11-26 03:46:12,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:46:12,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 03:46:12,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 9: [2022-11-26 03:46:12,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:46:12,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 03:46:12,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 26: [2022-11-26 03:46:12,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:46:12,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 03:46:12,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 3: [2022-11-26 03:46:12,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:46:12,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 03:46:12,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 30: [2022-11-26 03:46:12,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:46:12,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 9: [2022-11-26 03:46:12,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:46:12,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 30: [2022-11-26 03:46:12,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 9: [2022-11-26 03:46:12,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 12: [2022-11-26 03:46:12,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:46:12,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 03:46:12,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 13: [2022-11-26 03:46:12,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:46:12,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 03:46:12,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 20: [2022-11-26 03:46:12,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:46:12,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 03:46:12,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 25: [2022-11-26 03:46:12,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:46:12,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 03:46:12,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:46:12,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 25: [2022-11-26 03:46:12,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 0: [2022-11-26 03:46:12,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:46:12,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 0: [2022-11-26 03:46:12,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 03:46:12,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 27: [2022-11-26 03:46:12,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:46:12,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:46:12,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 27: [2022-11-26 03:46:12,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 26: [2022-11-26 03:46:12,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 16: [2022-11-26 03:46:12,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:46:12,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:46:12,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 31: [2022-11-26 03:46:12,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 27: [2022-11-26 03:46:12,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 16: [2022-11-26 03:46:12,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 31: [2022-11-26 03:46:12,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 7: [2022-11-26 03:46:12,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:46:12,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 03:46:12,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 30: [2022-11-26 03:46:12,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:46:12,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 03:46:12,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 1: [2022-11-26 03:46:12,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:46:12,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 03:46:12,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 11: [2022-11-26 03:46:12,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:46:12,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 03:46:12,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 22: [2022-11-26 03:46:12,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:46:12,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 03:46:12,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 8: [2022-11-26 03:46:12,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:46:12,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:46:12,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 03:46:12,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 21: [2022-11-26 03:46:12,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:46:12,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 03:46:12,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 03:46:12,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 21: [2022-11-26 03:46:12,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 20: [2022-11-26 03:46:12,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:46:12,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 03:46:12,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 3: [2022-11-26 03:46:12,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:46:12,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 03:46:12,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 17: [2022-11-26 03:46:12,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:46:12,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 03:46:12,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 17: [2022-11-26 03:46:12,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:46:12,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 03:46:12,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 17: [2022-11-26 03:46:12,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:46:12,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 03:46:12,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 17: [2022-11-26 03:46:12,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:46:12,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 03:46:12,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 24: [2022-11-26 03:46:12,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:46:12,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:46:12,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:46:12,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 03:46:12,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 4: [2022-11-26 03:46:12,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 24: [2022-11-26 03:46:12,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 24: [2022-11-26 03:46:12,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 4: [2022-11-26 03:46:12,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 22: [2022-11-26 03:46:12,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:46:12,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 03:46:12,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 4: [2022-11-26 03:46:12,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:46:12,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 03:46:12,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 5: [2022-11-26 03:46:12,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:46:12,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 03:46:12,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 8: [2022-11-26 03:46:12,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:46:12,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 03:46:12,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 7: [2022-11-26 03:46:12,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:46:12,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 28: [2022-11-26 03:46:12,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:46:12,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:46:12,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:46:12,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 16: [2022-11-26 03:46:12,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:46:12,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 28: [2022-11-26 03:46:12,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 03:46:12,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 03:46:12,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 16: [2022-11-26 03:46:12,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 28: [2022-11-26 03:46:12,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 28: [2022-11-26 03:46:12,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 28: [2022-11-26 03:46:12,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 27: [2022-11-26 03:46:12,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:46:12,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 03:46:12,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 10: [2022-11-26 03:46:12,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:46:12,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:46:12,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:46:12,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 03:46:12,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 03:46:12,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 24: [2022-11-26 03:46:12,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:46:12,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 10: [2022-11-26 03:46:12,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 10: [2022-11-26 03:46:12,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 24: [2022-11-26 03:46:12,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 03:46:12,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 2: [2022-11-26 03:46:12,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:46:12,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 03:46:12,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:46:12,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:46:12,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 2: [2022-11-26 03:46:12,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 03:46:12,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 03:46:12,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 2: [2022-11-26 03:46:12,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 8: [2022-11-26 03:46:12,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:46:12,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 03:46:12,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 25: [2022-11-26 03:46:12,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:46:12,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 03:46:12,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 6: [2022-11-26 03:46:12,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:46:12,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 03:46:12,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 15: [2022-11-26 03:46:12,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:46:12,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 03:46:12,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 15: [2022-11-26 03:46:12,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:46:12,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 03:46:12,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 19: [2022-11-26 03:46:12,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:46:12,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 03:46:12,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 20: [2022-11-26 03:46:12,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:46:12,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:46:12,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 03:46:12,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 17: [2022-11-26 03:46:12,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 03:46:12,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 15: [2022-11-26 03:46:12,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:46:12,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 03:46:12,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 10: [2022-11-26 03:46:12,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:46:12,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 03:46:12,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 18: [2022-11-26 03:46:12,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:46:12,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:46:12,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:46:12,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:46:12,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 03:46:12,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 03:46:12,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 03:46:12,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 03:46:12,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 18: [2022-11-26 03:46:12,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 18: [2022-11-26 03:46:12,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 18: [2022-11-26 03:46:12,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 23: [2022-11-26 03:46:12,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:46:12,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:46:12,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:46:12,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 03:46:12,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 03:46:12,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 03:46:12,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 23: [2022-11-26 03:46:12,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 14: [2022-11-26 03:46:12,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:46:12,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:46:12,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:46:12,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:46:12,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 14: [2022-11-26 03:46:12,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 03:46:12,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 03:46:12,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 03:46:12,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 03:46:12,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 14: [2022-11-26 03:46:12,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 14: [2022-11-26 03:46:12,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 14: [2022-11-26 03:46:12,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 0: [2022-11-26 03:46:12,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 03:46:12,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 12: [2022-11-26 03:46:12,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:46:12,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 03:46:12,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 29: [2022-11-26 03:46:12,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:46:12,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:46:12,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:46:12,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:46:12,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 03:46:12,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 03:46:12,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 03:46:12,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 03:46:12,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 29: [2022-11-26 03:46:12,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 29: [2022-11-26 03:46:12,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 29: [2022-11-26 03:46:12,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 22: [2022-11-26 03:46:12,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:46:12,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 03:46:12,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 18: [2022-11-26 03:46:12,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:46:12,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 03:46:12,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 1: [2022-11-26 03:46:12,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:46:12,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:46:12,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 03:46:12,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 4: [2022-11-26 03:46:12,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 03:46:12,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 29: [2022-11-26 03:46:12,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:46:12,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 03:46:12,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 0: [2022-11-26 03:46:12,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:46:12,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 03:46:12,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 21: [2022-11-26 03:46:12,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:46:12,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 03:46:12,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 27: [2022-11-26 03:46:12,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:46:12,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 03:46:12,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 16: [2022-11-26 03:46:12,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:46:12,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 03:46:12,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 31: [2022-11-26 03:46:12,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:46:12,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 03:46:12,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 5: [2022-11-26 03:46:12,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:46:12,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 03:46:12,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 28: [2022-11-26 03:46:12,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:46:12,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 03:46:12,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 30: [2022-11-26 03:46:12,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:46:12,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 03:46:12,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 9: [2022-11-26 03:46:12,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:46:12,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 03:46:12,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 24: [2022-11-26 03:46:12,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:46:12,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 03:46:12,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 11: [2022-11-26 03:46:12,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:46:12,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 03:46:12,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 3: [2022-11-26 03:46:12,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:46:12,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 03:46:12,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 25: [2022-11-26 03:46:12,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:46:12,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 03:46:12,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 13: [2022-11-26 03:46:12,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:46:12,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 03:46:12,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 2: [2022-11-26 03:46:12,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:46:12,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 03:46:12,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 26: [2022-11-26 03:46:12,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:46:12,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 03:46:12,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 7: [2022-11-26 03:46:12,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:46:12,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 03:46:12,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 10: [2022-11-26 03:46:12,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:46:12,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 03:46:12,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 14: [2022-11-26 03:46:12,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:46:12,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 03:46:12,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 23: [2022-11-26 03:46:12,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:46:12,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 6: [2022-11-26 03:46:12,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:46:12,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 6: [2022-11-26 03:46:12,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 03:46:12,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 15: [2022-11-26 03:46:12,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:46:12,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 03:46:12,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 8: [2022-11-26 03:46:12,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:46:12,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 03:46:12,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 19: [2022-11-26 03:46:12,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:46:12,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 03:46:12,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 22: [2022-11-26 03:46:12,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:46:12,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 03:46:12,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 20: [2022-11-26 03:46:12,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:46:12,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 03:46:12,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 12: [2022-11-26 03:46:12,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:46:12,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:46:12,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 03:46:12,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 17: [2022-11-26 03:46:12,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 03:46:12,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 0: [2022-11-26 03:46:12,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:46:12,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 03:46:12,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 30: [2022-11-26 03:46:12,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:46:12,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 03:46:12,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 31: [2022-11-26 03:46:12,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:46:12,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 03:46:12,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 4: [2022-11-26 03:46:12,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:46:12,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 03:46:12,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 5: [2022-11-26 03:46:12,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:46:12,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:46:12,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 9: [2022-11-26 03:46:12,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 18: [2022-11-26 03:46:12,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:46:12,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 9: [2022-11-26 03:46:12,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 18: [2022-11-26 03:46:12,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 1: [2022-11-26 03:46:12,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:46:12,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 1: [2022-11-26 03:46:12,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 03:46:12,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 13: [2022-11-26 03:46:12,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:46:12,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 03:46:12,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 27: [2022-11-26 03:46:12,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:46:12,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:46:12,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 03:46:12,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 27: [2022-11-26 03:46:12,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 03:46:12,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 21: [2022-11-26 03:46:12,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:46:12,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 03:46:12,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 11: [2022-11-26 03:46:12,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:46:12,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 03:46:12,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 28: [2022-11-26 03:46:12,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:46:12,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 03:46:12,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 2: [2022-11-26 03:46:12,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:46:12,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 03:46:12,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 24: [2022-11-26 03:46:12,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:46:12,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 03:46:12,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 3: [2022-11-26 03:46:12,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:46:12,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 03:46:12,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 25: [2022-11-26 03:46:12,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:46:12,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 03:46:12,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 29: [2022-11-26 03:46:12,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:46:12,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 03:46:12,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 7: [2022-11-26 03:46:12,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:46:12,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 03:46:12,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 14: [2022-11-26 03:46:12,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:46:12,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 03:46:12,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 10: [2022-11-26 03:46:12,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:46:12,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 03:46:12,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 19: [2022-11-26 03:46:12,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:46:12,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 03:46:12,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 15: [2022-11-26 03:46:12,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:46:12,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 03:46:12,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 23: [2022-11-26 03:46:12,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:46:12,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 03:46:12,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 20: [2022-11-26 03:46:12,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:46:12,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 03:46:12,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 8: [2022-11-26 03:46:12,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:46:12,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 03:46:12,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 18: [2022-11-26 03:46:12,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:46:12,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 03:46:12,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 6: [2022-11-26 03:46:12,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:46:12,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 03:46:12,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 17: [2022-11-26 03:46:12,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:46:12,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 03:46:12,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 12: [2022-11-26 03:46:12,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:46:12,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 03:46:12,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 22: [2022-11-26 03:46:12,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:46:12,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 03:46:12,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 4: [2022-11-26 03:46:12,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:46:12,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 03:46:12,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 26: [2022-11-26 03:46:12,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:46:12,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 03:46:12,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 0: [2022-11-26 03:46:12,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:46:12,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 03:46:12,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 1: [2022-11-26 03:46:12,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:46:12,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 03:46:12,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 16: [2022-11-26 03:46:12,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:46:12,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 03:46:12,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 29: [2022-11-26 03:46:12,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:46:12,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 03:46:12,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 21: [2022-11-26 03:46:12,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:46:12,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 03:46:12,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 27: [2022-11-26 03:46:12,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:46:12,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 03:46:12,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 5: [2022-11-26 03:46:12,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:46:12,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 03:46:12,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 31: [2022-11-26 03:46:12,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:46:12,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 03:46:12,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 9: [2022-11-26 03:46:12,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:46:12,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 03:46:12,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 26: [2022-11-26 03:46:12,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:46:12,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 03:46:12,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 11: [2022-11-26 03:46:12,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:46:12,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 03:46:12,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 30: [2022-11-26 03:46:12,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:46:12,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 03:46:12,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 24: [2022-11-26 03:46:12,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:46:12,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 03:46:12,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 13: [2022-11-26 03:46:12,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:46:12,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 03:46:12,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 25: [2022-11-26 03:46:12,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:46:12,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 03:46:12,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 28: [2022-11-26 03:46:12,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:46:12,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:46:12,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 03:46:12,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 10: [2022-11-26 03:46:12,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:46:12,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 03:46:12,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 2: [2022-11-26 03:46:12,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:46:12,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 03:46:12,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 20: [2022-11-26 03:46:12,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:46:12,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:46:12,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 3: [2022-11-26 03:46:12,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 20: [2022-11-26 03:46:12,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 3: [2022-11-26 03:46:12,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 0: [2022-11-26 03:46:12,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:46:12,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 03:46:12,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 4: [2022-11-26 03:46:12,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:46:12,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 03:46:12,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 28: [2022-11-26 03:46:12,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 19: [2022-11-26 03:46:12,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:46:12,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 19: [2022-11-26 03:46:12,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 03:46:12,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 14: [2022-11-26 03:46:12,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:46:12,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 03:46:12,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 22: [2022-11-26 03:46:12,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:46:12,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 03:46:12,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 1: [2022-11-26 03:46:12,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:46:12,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:46:12,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 1: [2022-11-26 03:46:12,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 11: [2022-11-26 03:46:12,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 1: [2022-11-26 03:46:12,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 17: [2022-11-26 03:46:12,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:46:12,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:46:12,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 03:46:12,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 9: [2022-11-26 03:46:12,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:46:12,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 03:46:12,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 6: [2022-11-26 03:46:12,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:46:12,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 03:46:12,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 18: [2022-11-26 03:46:12,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:46:12,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 18: [2022-11-26 03:46:12,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 6: [2022-11-26 03:46:12,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 18: [2022-11-26 03:46:12,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 29: [2022-11-26 03:46:12,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:46:12,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 03:46:12,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 21: [2022-11-26 03:46:12,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:46:12,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 03:46:12,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 12: [2022-11-26 03:46:12,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:46:12,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 16: [2022-11-26 03:46:12,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:46:12,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 16: [2022-11-26 03:46:12,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 03:46:12,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 8: [2022-11-26 03:46:12,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:46:12,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:46:12,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 8: [2022-11-26 03:46:12,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 03:46:12,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 27: [2022-11-26 03:46:12,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 5: [2022-11-26 03:46:12,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:46:12,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 03:46:12,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 31: [2022-11-26 03:46:12,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:46:12,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 03:46:12,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 30: [2022-11-26 03:46:12,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:46:12,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:46:12,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 26: [2022-11-26 03:46:12,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 30: [2022-11-26 03:46:12,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 26: [2022-11-26 03:46:12,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 24: [2022-11-26 03:46:12,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:46:12,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 23: [2022-11-26 03:46:12,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:46:12,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:46:12,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 23: [2022-11-26 03:46:12,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 03:46:12,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 10: [2022-11-26 03:46:12,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:46:12,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 03:46:12,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 13: [2022-11-26 03:46:12,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:46:12,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 03:46:12,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 25: [2022-11-26 03:46:12,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:46:12,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 03:46:12,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 2: [2022-11-26 03:46:12,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:46:12,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 03:46:12,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 7: [2022-11-26 03:46:12,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:46:12,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 03:46:12,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 15: [2022-11-26 03:46:12,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:46:12,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 03:46:12,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 3: [2022-11-26 03:46:12,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:46:12,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 03:46:12,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 23: [2022-11-26 03:46:12,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:46:12,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 03:46:12,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 8: [2022-11-26 03:46:12,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:46:12,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 03:46:12,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 14: [2022-11-26 03:46:12,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:46:12,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 03:46:12,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 23: [2022-11-26 03:46:12,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:46:12,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 03:46:12,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 28: [2022-11-26 03:46:12,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 03:46:12,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 28: [2022-11-26 03:46:12,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:46:12,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 03:46:12,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 24: [2022-11-26 03:46:12,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:46:12,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 03:46:12,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 15: [2022-11-26 03:46:12,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:46:12,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 03:46:12,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 2: [2022-11-26 03:46:12,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:46:12,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step43000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 03:46:12,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 0: successfully saved checkpoint at iteration 43000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2581.56 31: iteration 43010/ 173500 | consumed samples: 11010560 | consumed tokens: 22549626880 | elapsed time per iteration (s): 1.08 | learning rate: 1.755E-04 | global batch size: 256 | lm loss: 2.107388E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.586 | TFLOPs: 14.37 | 31: iteration 43020/ 173500 | consumed samples: 11013120 | consumed tokens: 22554869760 | elapsed time per iteration (s): 0.81 | learning rate: 1.755E-04 | global batch size: 256 | lm loss: 2.109512E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.286 | TFLOPs: 19.19 | 31: iteration 43030/ 173500 | consumed samples: 11015680 | consumed tokens: 22560112640 | elapsed time per iteration (s): 0.77 | learning rate: 1.755E-04 | global batch size: 256 | lm loss: 2.073910E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.498 | TFLOPs: 20.05 | 31: iteration 43040/ 173500 | consumed samples: 11018240 | consumed tokens: 22565355520 | elapsed time per iteration (s): 0.81 | learning rate: 1.755E-04 | global batch size: 256 | lm loss: 2.088800E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.383 | TFLOPs: 19.20 | 31: iteration 43050/ 173500 | consumed samples: 11020800 | consumed tokens: 22570598400 | elapsed time per iteration (s): 0.80 | learning rate: 1.755E-04 | global batch size: 256 | lm loss: 2.090697E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.045 | TFLOPs: 19.36 | 31: iteration 43060/ 173500 | consumed samples: 11023360 | consumed tokens: 22575841280 | elapsed time per iteration (s): 0.78 | learning rate: 1.755E-04 | global batch size: 256 | lm loss: 2.091692E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.293 | TFLOPs: 19.92 | 31: iteration 43070/ 173500 | consumed samples: 11025920 | consumed tokens: 22581084160 | elapsed time per iteration (s): 0.74 | learning rate: 1.755E-04 | global batch size: 256 | lm loss: 2.091301E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.001 | TFLOPs: 21.05 | 31: iteration 43080/ 173500 | consumed samples: 11028480 | consumed tokens: 22586327040 | elapsed time per iteration (s): 0.79 | learning rate: 1.755E-04 | global batch size: 256 | lm loss: 2.093553E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.933 | TFLOPs: 19.66 | 31: iteration 43090/ 173500 | consumed samples: 11031040 | consumed tokens: 22591569920 | elapsed time per iteration (s): 0.76 | learning rate: 1.755E-04 | global batch size: 256 | lm loss: 2.096828E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.539 | TFLOPs: 20.30 | 31: iteration 43100/ 173500 | consumed samples: 11033600 | consumed tokens: 22596812800 | elapsed time per iteration (s): 0.78 | learning rate: 1.754E-04 | global batch size: 256 | lm loss: 2.097996E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.561 | TFLOPs: 19.94 | 31: iteration 43110/ 173500 | consumed samples: 11036160 | consumed tokens: 22602055680 | elapsed time per iteration (s): 0.74 | learning rate: 1.754E-04 | global batch size: 256 | lm loss: 2.104822E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.844 | TFLOPs: 20.98 | 31: iteration 43120/ 173500 | consumed samples: 11038720 | consumed tokens: 22607298560 | elapsed time per iteration (s): 0.77 | learning rate: 1.754E-04 | global batch size: 256 | lm loss: 2.075237E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.120 | TFLOPs: 20.15 | 31: iteration 43130/ 173500 | consumed samples: 11041280 | consumed tokens: 22612541440 | elapsed time per iteration (s): 0.76 | learning rate: 1.754E-04 | global batch size: 256 | lm loss: 2.068552E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.435 | TFLOPs: 20.47 | 31: iteration 43140/ 173500 | consumed samples: 11043840 | consumed tokens: 22617784320 | elapsed time per iteration (s): 0.81 | learning rate: 1.754E-04 | global batch size: 256 | lm loss: 2.073995E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.949 | TFLOPs: 19.24 | 31: iteration 43150/ 173500 | consumed samples: 11046400 | consumed tokens: 22623027200 | elapsed time per iteration (s): 0.78 | learning rate: 1.754E-04 | global batch size: 256 | lm loss: 2.077601E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.636 | TFLOPs: 19.94 | 31: iteration 43160/ 173500 | consumed samples: 11048960 | consumed tokens: 22628270080 | elapsed time per iteration (s): 0.82 | learning rate: 1.754E-04 | global batch size: 256 | lm loss: 2.102843E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.731 | TFLOPs: 18.86 | 31: iteration 43170/ 173500 | consumed samples: 11051520 | consumed tokens: 22633512960 | elapsed time per iteration (s): 0.79 | learning rate: 1.754E-04 | global batch size: 256 | lm loss: 2.059693E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.206 | TFLOPs: 19.67 | 31: iteration 43180/ 173500 | consumed samples: 11054080 | consumed tokens: 22638755840 | elapsed time per iteration (s): 0.77 | learning rate: 1.754E-04 | global batch size: 256 | lm loss: 2.099328E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.333 | TFLOPs: 20.04 | 31: iteration 43190/ 173500 | consumed samples: 11056640 | consumed tokens: 22643998720 | elapsed time per iteration (s): 0.77 | learning rate: 1.753E-04 | global batch size: 256 | lm loss: 2.094310E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.365 | TFLOPs: 20.23 | 31: iteration 43200/ 173500 | consumed samples: 11059200 | consumed tokens: 22649241600 | elapsed time per iteration (s): 0.78 | learning rate: 1.753E-04 | global batch size: 256 | lm loss: 2.062977E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.884 | TFLOPs: 19.90 | 31: iteration 43210/ 173500 | consumed samples: 11061760 | consumed tokens: 22654484480 | elapsed time per iteration (s): 0.74 | learning rate: 1.753E-04 | global batch size: 256 | lm loss: 2.116020E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.046 | TFLOPs: 20.93 | 31: iteration 43220/ 173500 | consumed samples: 11064320 | consumed tokens: 22659727360 | elapsed time per iteration (s): 0.78 | learning rate: 1.753E-04 | global batch size: 256 | lm loss: 2.088603E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.951 | TFLOPs: 19.90 | 31: iteration 43230/ 173500 | consumed samples: 11066880 | consumed tokens: 22664970240 | elapsed time per iteration (s): 0.76 | learning rate: 1.753E-04 | global batch size: 256 | lm loss: 2.095484E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.914 | TFLOPs: 20.50 | 31: iteration 43240/ 173500 | consumed samples: 11069440 | consumed tokens: 22670213120 | elapsed time per iteration (s): 0.74 | learning rate: 1.753E-04 | global batch size: 256 | lm loss: 2.102370E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.572 | TFLOPs: 20.91 | 31: iteration 43250/ 173500 | consumed samples: 11072000 | consumed tokens: 22675456000 | elapsed time per iteration (s): 0.74 | learning rate: 1.753E-04 | global batch size: 256 | lm loss: 2.090366E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.414 | TFLOPs: 20.84 | 31: iteration 43260/ 173500 | consumed samples: 11074560 | consumed tokens: 22680698880 | elapsed time per iteration (s): 0.74 | learning rate: 1.753E-04 | global batch size: 256 | lm loss: 2.081326E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.290 | TFLOPs: 21.01 | 31: iteration 43270/ 173500 | consumed samples: 11077120 | consumed tokens: 22685941760 | elapsed time per iteration (s): 0.74 | learning rate: 1.753E-04 | global batch size: 256 | lm loss: 2.075091E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.890 | TFLOPs: 20.80 | 31: iteration 43280/ 173500 | consumed samples: 11079680 | consumed tokens: 22691184640 | elapsed time per iteration (s): 0.83 | learning rate: 1.752E-04 | global batch size: 256 | lm loss: 2.091278E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.989 | TFLOPs: 18.75 | 31: iteration 43290/ 173500 | consumed samples: 11082240 | consumed tokens: 22696427520 | elapsed time per iteration (s): 0.73 | learning rate: 1.752E-04 | global batch size: 256 | lm loss: 2.083020E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.190 | TFLOPs: 21.25 | 31: iteration 43300/ 173500 | consumed samples: 11084800 | consumed tokens: 22701670400 | elapsed time per iteration (s): 0.77 | learning rate: 1.752E-04 | global batch size: 256 | lm loss: 2.062145E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.185 | TFLOPs: 20.04 | 31: iteration 43310/ 173500 | consumed samples: 11087360 | consumed tokens: 22706913280 | elapsed time per iteration (s): 0.80 | learning rate: 1.752E-04 | global batch size: 256 | lm loss: 2.086725E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.243 | TFLOPs: 19.43 | 31: iteration 43320/ 173500 | consumed samples: 11089920 | consumed tokens: 22712156160 | elapsed time per iteration (s): 0.79 | learning rate: 1.752E-04 | global batch size: 256 | lm loss: 2.077225E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.945 | TFLOPs: 19.60 | 31: iteration 43330/ 173500 | consumed samples: 11092480 | consumed tokens: 22717399040 | elapsed time per iteration (s): 0.74 | learning rate: 1.752E-04 | global batch size: 256 | lm loss: 2.061896E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.676 | TFLOPs: 20.79 | 31: iteration 43340/ 173500 | consumed samples: 11095040 | consumed tokens: 22722641920 | elapsed time per iteration (s): 0.74 | learning rate: 1.752E-04 | global batch size: 256 | lm loss: 2.065061E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.904 | TFLOPs: 21.05 | 31: iteration 43350/ 173500 | consumed samples: 11097600 | consumed tokens: 22727884800 | elapsed time per iteration (s): 0.80 | learning rate: 1.752E-04 | global batch size: 256 | lm loss: 2.070228E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.368 | TFLOPs: 19.38 | 31: iteration 43360/ 173500 | consumed samples: 11100160 | consumed tokens: 22733127680 | elapsed time per iteration (s): 0.85 | learning rate: 1.752E-04 | global batch size: 256 | lm loss: 2.096462E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.368 | TFLOPs: 18.29 | 31: iteration 43370/ 173500 | consumed samples: 11102720 | consumed tokens: 22738370560 | elapsed time per iteration (s): 0.77 | learning rate: 1.751E-04 | global batch size: 256 | lm loss: 2.080174E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.764 | TFLOPs: 20.01 | 31: iteration 43380/ 173500 | consumed samples: 11105280 | consumed tokens: 22743613440 | elapsed time per iteration (s): 0.80 | learning rate: 1.751E-04 | global batch size: 256 | lm loss: 2.094404E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.675 | TFLOPs: 19.28 | 31: iteration 43390/ 173500 | consumed samples: 11107840 | consumed tokens: 22748856320 | elapsed time per iteration (s): 0.82 | learning rate: 1.751E-04 | global batch size: 256 | lm loss: 2.085824E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.219 | TFLOPs: 18.83 | 31: iteration 43400/ 173500 | consumed samples: 11110400 | consumed tokens: 22754099200 | elapsed time per iteration (s): 0.76 | learning rate: 1.751E-04 | global batch size: 256 | lm loss: 2.098694E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.333 | TFLOPs: 20.29 | 31: iteration 43410/ 173500 | consumed samples: 11112960 | consumed tokens: 22759342080 | elapsed time per iteration (s): 0.74 | learning rate: 1.751E-04 | global batch size: 256 | lm loss: 2.094527E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.074 | TFLOPs: 20.94 | 31: iteration 43420/ 173500 | consumed samples: 11115520 | consumed tokens: 22764584960 | elapsed time per iteration (s): 0.81 | learning rate: 1.751E-04 | global batch size: 256 | lm loss: 2.076753E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.150 | TFLOPs: 19.07 | 31: iteration 43430/ 173500 | consumed samples: 11118080 | consumed tokens: 22769827840 | elapsed time per iteration (s): 0.80 | learning rate: 1.751E-04 | global batch size: 256 | lm loss: 2.080009E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.709 | TFLOPs: 19.46 | 31: iteration 43440/ 173500 | consumed samples: 11120640 | consumed tokens: 22775070720 | elapsed time per iteration (s): 0.79 | learning rate: 1.751E-04 | global batch size: 256 | lm loss: 2.045473E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.418 | TFLOPs: 19.69 | 31: iteration 43450/ 173500 | consumed samples: 11123200 | consumed tokens: 22780313600 | elapsed time per iteration (s): 0.77 | learning rate: 1.751E-04 | global batch size: 256 | lm loss: 2.064443E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.498 | TFLOPs: 20.24 | 31: iteration 43460/ 173500 | consumed samples: 11125760 | consumed tokens: 22785556480 | elapsed time per iteration (s): 0.79 | learning rate: 1.750E-04 | global batch size: 256 | lm loss: 2.109892E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.913 | TFLOPs: 19.54 | 31: iteration 43470/ 173500 | consumed samples: 11128320 | consumed tokens: 22790799360 | elapsed time per iteration (s): 0.80 | learning rate: 1.750E-04 | global batch size: 256 | lm loss: 2.106710E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.252 | TFLOPs: 19.37 | 31: iteration 43480/ 173500 | consumed samples: 11130880 | consumed tokens: 22796042240 | elapsed time per iteration (s): 0.75 | learning rate: 1.750E-04 | global batch size: 256 | lm loss: 2.090961E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.060 | TFLOPs: 20.57 | 31: iteration 43490/ 173500 | consumed samples: 11133440 | consumed tokens: 22801285120 | elapsed time per iteration (s): 0.79 | learning rate: 1.750E-04 | global batch size: 256 | lm loss: 2.141809E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.017 | TFLOPs: 19.48 | 31: iteration 43500/ 173500 | consumed samples: 11136000 | consumed tokens: 22806528000 | elapsed time per iteration (s): 0.81 | learning rate: 1.750E-04 | global batch size: 256 | lm loss: 2.092842E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.986 | TFLOPs: 19.06 | 31: iteration 43510/ 173500 | consumed samples: 11138560 | consumed tokens: 22811770880 | elapsed time per iteration (s): 0.84 | learning rate: 1.750E-04 | global batch size: 256 | lm loss: 2.052984E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.830 | TFLOPs: 18.50 | 31: iteration 43520/ 173500 | consumed samples: 11141120 | consumed tokens: 22817013760 | elapsed time per iteration (s): 0.82 | learning rate: 1.750E-04 | global batch size: 256 | lm loss: 2.079224E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.536 | TFLOPs: 18.85 | 31: iteration 43530/ 173500 | consumed samples: 11143680 | consumed tokens: 22822256640 | elapsed time per iteration (s): 0.76 | learning rate: 1.750E-04 | global batch size: 256 | lm loss: 2.094673E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.122 | TFLOPs: 20.27 | 31: iteration 43540/ 173500 | consumed samples: 11146240 | consumed tokens: 22827499520 | elapsed time per iteration (s): 0.75 | learning rate: 1.749E-04 | global batch size: 256 | lm loss: 2.108079E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.971 | TFLOPs: 20.75 | 31: iteration 43550/ 173500 | consumed samples: 11148800 | consumed tokens: 22832742400 | elapsed time per iteration (s): 0.82 | learning rate: 1.749E-04 | global batch size: 256 | lm loss: 2.056545E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.706 | TFLOPs: 18.92 | 31: iteration 43560/ 173500 | consumed samples: 11151360 | consumed tokens: 22837985280 | elapsed time per iteration (s): 0.82 | learning rate: 1.749E-04 | global batch size: 256 | lm loss: 2.076093E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.738 | TFLOPs: 18.98 | 31: iteration 43570/ 173500 | consumed samples: 11153920 | consumed tokens: 22843228160 | elapsed time per iteration (s): 0.81 | learning rate: 1.749E-04 | global batch size: 256 | lm loss: 2.064078E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.117 | TFLOPs: 19.06 | 31: iteration 43580/ 173500 | consumed samples: 11156480 | consumed tokens: 22848471040 | elapsed time per iteration (s): 0.80 | learning rate: 1.749E-04 | global batch size: 256 | lm loss: 2.104036E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.024 | TFLOPs: 19.30 | 31: iteration 43590/ 173500 | consumed samples: 11159040 | consumed tokens: 22853713920 | elapsed time per iteration (s): 0.82 | learning rate: 1.749E-04 | global batch size: 256 | lm loss: 2.089110E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.885 | TFLOPs: 18.87 | 31: iteration 43600/ 173500 | consumed samples: 11161600 | consumed tokens: 22858956800 | elapsed time per iteration (s): 0.78 | learning rate: 1.749E-04 | global batch size: 256 | lm loss: 2.070427E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.394 | TFLOPs: 19.81 | 31: iteration 43610/ 173500 | consumed samples: 11164160 | consumed tokens: 22864199680 | elapsed time per iteration (s): 0.86 | learning rate: 1.749E-04 | global batch size: 256 | lm loss: 2.053241E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.016 | TFLOPs: 17.91 | 31: iteration 43620/ 173500 | consumed samples: 11166720 | consumed tokens: 22869442560 | elapsed time per iteration (s): 0.81 | learning rate: 1.749E-04 | global batch size: 256 | lm loss: 2.081897E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.353 | TFLOPs: 19.20 | 31: iteration 43630/ 173500 | consumed samples: 11169280 | consumed tokens: 22874685440 | elapsed time per iteration (s): 0.87 | learning rate: 1.748E-04 | global batch size: 256 | lm loss: 2.081090E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.644 | TFLOPs: 17.83 | 31: iteration 43640/ 173500 | consumed samples: 11171840 | consumed tokens: 22879928320 | elapsed time per iteration (s): 0.86 | learning rate: 1.748E-04 | global batch size: 256 | lm loss: 2.096293E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.580 | TFLOPs: 18.00 | 31: iteration 43650/ 173500 | consumed samples: 11174400 | consumed tokens: 22885171200 | elapsed time per iteration (s): 0.90 | learning rate: 1.748E-04 | global batch size: 256 | lm loss: 2.095029E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.922 | TFLOPs: 17.30 | 31: iteration 43660/ 173500 | consumed samples: 11176960 | consumed tokens: 22890414080 | elapsed time per iteration (s): 0.85 | learning rate: 1.748E-04 | global batch size: 256 | lm loss: 2.069212E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.935 | TFLOPs: 18.27 | 31: iteration 43670/ 173500 | consumed samples: 11179520 | consumed tokens: 22895656960 | elapsed time per iteration (s): 0.84 | learning rate: 1.748E-04 | global batch size: 256 | lm loss: 2.085293E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.594 | TFLOPs: 18.43 | 31: iteration 43680/ 173500 | consumed samples: 11182080 | consumed tokens: 22900899840 | elapsed time per iteration (s): 0.82 | learning rate: 1.748E-04 | global batch size: 256 | lm loss: 2.078876E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.621 | TFLOPs: 18.85 | 31: iteration 43690/ 173500 | consumed samples: 11184640 | consumed tokens: 22906142720 | elapsed time per iteration (s): 0.82 | learning rate: 1.748E-04 | global batch size: 256 | lm loss: 2.117162E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.570 | TFLOPs: 18.97 | 31: iteration 43700/ 173500 | consumed samples: 11187200 | consumed tokens: 22911385600 | elapsed time per iteration (s): 0.82 | learning rate: 1.748E-04 | global batch size: 256 | lm loss: 2.108910E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.776 | TFLOPs: 18.92 | 31: iteration 43710/ 173500 | consumed samples: 11189760 | consumed tokens: 22916628480 | elapsed time per iteration (s): 0.80 | learning rate: 1.748E-04 | global batch size: 256 | lm loss: 2.118130E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.928 | TFLOPs: 19.35 | 31: iteration 43720/ 173500 | consumed samples: 11192320 | consumed tokens: 22921871360 | elapsed time per iteration (s): 0.78 | learning rate: 1.747E-04 | global batch size: 256 | lm loss: 2.083048E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.493 | TFLOPs: 19.75 | 31: iteration 43730/ 173500 | consumed samples: 11194880 | consumed tokens: 22927114240 | elapsed time per iteration (s): 0.81 | learning rate: 1.747E-04 | global batch size: 256 | lm loss: 2.082201E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.964 | TFLOPs: 19.18 | 31: iteration 43740/ 173500 | consumed samples: 11197440 | consumed tokens: 22932357120 | elapsed time per iteration (s): 0.79 | learning rate: 1.747E-04 | global batch size: 256 | lm loss: 2.078883E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.097 | TFLOPs: 19.61 | 31: iteration 43750/ 173500 | consumed samples: 11200000 | consumed tokens: 22937600000 | elapsed time per iteration (s): 1.21 | learning rate: 1.747E-04 | global batch size: 256 | lm loss: 2.079284E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 212.111 | TFLOPs: 12.83 | 31: iteration 43760/ 173500 | consumed samples: 11202560 | consumed tokens: 22942842880 | elapsed time per iteration (s): 0.80 | learning rate: 1.747E-04 | global batch size: 256 | lm loss: 2.098207E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.170 | TFLOPs: 19.37 | 31: iteration 43770/ 173500 | consumed samples: 11205120 | consumed tokens: 22948085760 | elapsed time per iteration (s): 0.82 | learning rate: 1.747E-04 | global batch size: 256 | lm loss: 2.081079E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.442 | TFLOPs: 18.78 | 31: iteration 43780/ 173500 | consumed samples: 11207680 | consumed tokens: 22953328640 | elapsed time per iteration (s): 0.81 | learning rate: 1.747E-04 | global batch size: 256 | lm loss: 2.087587E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.452 | TFLOPs: 19.08 | 31: iteration 43790/ 173500 | consumed samples: 11210240 | consumed tokens: 22958571520 | elapsed time per iteration (s): 0.85 | learning rate: 1.747E-04 | global batch size: 256 | lm loss: 2.077370E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.372 | TFLOPs: 18.17 | 31: iteration 43800/ 173500 | consumed samples: 11212800 | consumed tokens: 22963814400 | elapsed time per iteration (s): 0.82 | learning rate: 1.747E-04 | global batch size: 256 | lm loss: 2.098197E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.173 | TFLOPs: 18.89 | 31: iteration 43810/ 173500 | consumed samples: 11215360 | consumed tokens: 22969057280 | elapsed time per iteration (s): 0.86 | learning rate: 1.746E-04 | global batch size: 256 | lm loss: 2.114755E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.727 | TFLOPs: 18.01 | 31: iteration 43820/ 173500 | consumed samples: 11217920 | consumed tokens: 22974300160 | elapsed time per iteration (s): 0.77 | learning rate: 1.746E-04 | global batch size: 256 | lm loss: 2.091704E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.175 | TFLOPs: 20.16 | 31: iteration 43830/ 173500 | consumed samples: 11220480 | consumed tokens: 22979543040 | elapsed time per iteration (s): 0.76 | learning rate: 1.746E-04 | global batch size: 256 | lm loss: 2.099104E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.102 | TFLOPs: 20.27 | 31: iteration 43840/ 173500 | consumed samples: 11223040 | consumed tokens: 22984785920 | elapsed time per iteration (s): 0.82 | learning rate: 1.746E-04 | global batch size: 256 | lm loss: 2.073816E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.686 | TFLOPs: 18.92 | 31: iteration 43850/ 173500 | consumed samples: 11225600 | consumed tokens: 22990028800 | elapsed time per iteration (s): 0.75 | learning rate: 1.746E-04 | global batch size: 256 | lm loss: 2.090288E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.068 | TFLOPs: 20.75 | 31: iteration 43860/ 173500 | consumed samples: 11228160 | consumed tokens: 22995271680 | elapsed time per iteration (s): 0.83 | learning rate: 1.746E-04 | global batch size: 256 | lm loss: 2.074853E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.980 | TFLOPs: 18.63 | 31: iteration 43870/ 173500 | consumed samples: 11230720 | consumed tokens: 23000514560 | elapsed time per iteration (s): 0.75 | learning rate: 1.746E-04 | global batch size: 256 | lm loss: 2.064694E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.910 | TFLOPs: 20.75 | 31: iteration 43880/ 173500 | consumed samples: 11233280 | consumed tokens: 23005757440 | elapsed time per iteration (s): 0.77 | learning rate: 1.746E-04 | global batch size: 256 | lm loss: 2.071727E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.063 | TFLOPs: 20.15 | 31: iteration 43890/ 173500 | consumed samples: 11235840 | consumed tokens: 23011000320 | elapsed time per iteration (s): 0.80 | learning rate: 1.745E-04 | global batch size: 256 | lm loss: 2.100365E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.675 | TFLOPs: 19.46 | 31: iteration 43900/ 173500 | consumed samples: 11238400 | consumed tokens: 23016243200 | elapsed time per iteration (s): 0.81 | learning rate: 1.745E-04 | global batch size: 256 | lm loss: 2.065306E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.410 | TFLOPs: 19.08 | 31: iteration 43910/ 173500 | consumed samples: 11240960 | consumed tokens: 23021486080 | elapsed time per iteration (s): 0.88 | learning rate: 1.745E-04 | global batch size: 256 | lm loss: 2.057525E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.515 | TFLOPs: 17.70 | 31: iteration 43920/ 173500 | consumed samples: 11243520 | consumed tokens: 23026728960 | elapsed time per iteration (s): 0.78 | learning rate: 1.745E-04 | global batch size: 256 | lm loss: 2.088398E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.684 | TFLOPs: 19.76 | 31: iteration 43930/ 173500 | consumed samples: 11246080 | consumed tokens: 23031971840 | elapsed time per iteration (s): 0.75 | learning rate: 1.745E-04 | global batch size: 256 | lm loss: 2.110835E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.182 | TFLOPs: 20.64 | 31: iteration 43940/ 173500 | consumed samples: 11248640 | consumed tokens: 23037214720 | elapsed time per iteration (s): 0.74 | learning rate: 1.745E-04 | global batch size: 256 | lm loss: 2.103388E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.863 | TFLOPs: 20.98 | 31: iteration 43950/ 173500 | consumed samples: 11251200 | consumed tokens: 23042457600 | elapsed time per iteration (s): 0.81 | learning rate: 1.745E-04 | global batch size: 256 | lm loss: 2.089264E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.374 | TFLOPs: 19.20 | 31: iteration 43960/ 173500 | consumed samples: 11253760 | consumed tokens: 23047700480 | elapsed time per iteration (s): 0.78 | learning rate: 1.745E-04 | global batch size: 256 | lm loss: 2.110091E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.974 | TFLOPs: 19.78 | 31: iteration 43970/ 173500 | consumed samples: 11256320 | consumed tokens: 23052943360 | elapsed time per iteration (s): 0.78 | learning rate: 1.745E-04 | global batch size: 256 | lm loss: 2.077294E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.717 | TFLOPs: 19.77 | 31: iteration 43980/ 173500 | consumed samples: 11258880 | consumed tokens: 23058186240 | elapsed time per iteration (s): 0.76 | learning rate: 1.744E-04 | global batch size: 256 | lm loss: 2.078400E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.740 | TFLOPs: 20.43 | 31: iteration 43990/ 173500 | consumed samples: 11261440 | consumed tokens: 23063429120 | elapsed time per iteration (s): 0.77 | learning rate: 1.744E-04 | global batch size: 256 | lm loss: 2.073273E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.515 | TFLOPs: 20.24 | 0: [2022-11-26 03:59:28,456] [INFO] [logging.py:68:log_dist] [Rank 0] step=44000, skipped=0, lr=[0.00017442202015704406, 0.00017442202015704406, 0.00017442202015704406], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 44000/ 173500 | consumed samples: 11264000 | consumed tokens: 23068672000 | elapsed time per iteration (s): 0.78 | learning rate: 1.744E-04 | global batch size: 256 | lm loss: 2.061318E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.164 | TFLOPs: 19.85 | 0: steps: 44000 loss: 1.9767 iter time (s): 0.832 samples/sec: 307.535 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 44000 | lm loss value: 2.003963E+00 | lm loss PPL: 7.418395E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 44000 to checkpoints_1b1long 0: [2022-11-26 03:59:28,802] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step44000 is begin to save! 0: [2022-11-26 03:59:28,814] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_01-model_00-model_states.pt... 0: [2022-11-26 03:59:29,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_01-model_00-model_states.pt. 0: [2022-11-26 03:59:29,014] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_03-model_00-model_states.pt... 0: [2022-11-26 03:59:29,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_03-model_00-model_states.pt. 0: [2022-11-26 03:59:29,092] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_04-model_00-model_states.pt... 0: [2022-11-26 03:59:29,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_04-model_00-model_states.pt. 0: [2022-11-26 03:59:29,170] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_05-model_00-model_states.pt... 0: [2022-11-26 03:59:29,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_05-model_00-model_states.pt. 0: [2022-11-26 03:59:29,243] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_06-model_00-model_states.pt... 0: [2022-11-26 03:59:29,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_06-model_00-model_states.pt. 0: [2022-11-26 03:59:29,321] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_07-model_00-model_states.pt... 0: [2022-11-26 03:59:29,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_07-model_00-model_states.pt. 0: [2022-11-26 03:59:29,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_08-model_00-model_states.pt... 0: [2022-11-26 03:59:29,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_08-model_00-model_states.pt. 0: [2022-11-26 03:59:29,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_09-model_00-model_states.pt... 0: [2022-11-26 03:59:29,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_09-model_00-model_states.pt. 0: [2022-11-26 03:59:29,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_10-model_00-model_states.pt... 0: [2022-11-26 03:59:29,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_10-model_00-model_states.pt. 0: [2022-11-26 03:59:29,624] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_11-model_00-model_states.pt... 0: [2022-11-26 03:59:29,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_11-model_00-model_states.pt. 0: [2022-11-26 03:59:29,701] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_12-model_00-model_states.pt... 0: [2022-11-26 03:59:29,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_12-model_00-model_states.pt. 0: [2022-11-26 03:59:29,777] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_13-model_00-model_states.pt... 0: [2022-11-26 03:59:29,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_13-model_00-model_states.pt. 0: [2022-11-26 03:59:29,853] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_14-model_00-model_states.pt... 0: [2022-11-26 03:59:29,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_14-model_00-model_states.pt. 0: [2022-11-26 03:59:29,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_15-model_00-model_states.pt... 0: [2022-11-26 03:59:30,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_15-model_00-model_states.pt. 0: [2022-11-26 03:59:30,005] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_16-model_00-model_states.pt... 0: [2022-11-26 03:59:30,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_16-model_00-model_states.pt. 0: [2022-11-26 03:59:30,080] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_17-model_00-model_states.pt... 0: [2022-11-26 03:59:30,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_17-model_00-model_states.pt. 0: [2022-11-26 03:59:30,158] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_18-model_00-model_states.pt... 0: [2022-11-26 03:59:30,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_18-model_00-model_states.pt. 0: [2022-11-26 03:59:30,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_19-model_00-model_states.pt... 0: [2022-11-26 03:59:30,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_19-model_00-model_states.pt. 0: [2022-11-26 03:59:30,309] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_20-model_00-model_states.pt... 0: [2022-11-26 03:59:30,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_20-model_00-model_states.pt. 0: [2022-11-26 03:59:30,386] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_21-model_00-model_states.pt... 0: [2022-11-26 03:59:30,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_21-model_00-model_states.pt. 0: [2022-11-26 03:59:30,459] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_22-model_00-model_states.pt... 0: [2022-11-26 03:59:30,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_22-model_00-model_states.pt. 0: [2022-11-26 03:59:30,537] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_23-model_00-model_states.pt... 0: [2022-11-26 03:59:30,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_23-model_00-model_states.pt. 0: [2022-11-26 03:59:30,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_24-model_00-model_states.pt... 0: [2022-11-26 03:59:30,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_24-model_00-model_states.pt. 0: [2022-11-26 03:59:30,689] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_25-model_00-model_states.pt... 0: [2022-11-26 03:59:30,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_25-model_00-model_states.pt. 0: [2022-11-26 03:59:30,764] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_26-model_00-model_states.pt... 0: [2022-11-26 03:59:30,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_26-model_00-model_states.pt. 0: [2022-11-26 03:59:30,838] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_27-model_00-model_states.pt... 0: [2022-11-26 03:59:30,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_27-model_00-model_states.pt. 0: [2022-11-26 03:59:30,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_28-model_00-model_states.pt... 0: [2022-11-26 03:59:30,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_28-model_00-model_states.pt. 0: [2022-11-26 03:59:30,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/layer_30-model_00-model_states.pt... 0: [2022-11-26 03:59:30,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/layer_30-model_00-model_states.pt. 0: [2022-11-26 03:59:30,995] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step44000/mp_rank_00_model_states.pt 0: [2022-11-26 03:59:30,995] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/mp_rank_00_model_states.pt... 0: [2022-11-26 03:59:30,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/mp_rank_00_model_states.pt. 0: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 21: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 24: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 22: [2022-11-26 03:59:31,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 16: [2022-11-26 03:59:31,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:59:31,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 03:59:31,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 11: [2022-11-26 03:59:31,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:59:31,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 03:59:31,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 19: [2022-11-26 03:59:31,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:59:31,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 5: [2022-11-26 03:59:31,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:59:31,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 5: [2022-11-26 03:59:31,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 03:59:31,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 16: [2022-11-26 03:59:31,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:59:31,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 20: [2022-11-26 03:59:31,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:59:31,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 29: [2022-11-26 03:59:31,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:59:31,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 03:59:31,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 20: [2022-11-26 03:59:31,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 03:59:31,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 26: [2022-11-26 03:59:31,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:59:31,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 03:59:31,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 15: [2022-11-26 03:59:31,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:59:31,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 1: [2022-11-26 03:59:31,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:59:31,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 18: [2022-11-26 03:59:31,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:59:31,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:59:31,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 03:59:31,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 1: [2022-11-26 03:59:31,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 18: [2022-11-26 03:59:31,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 1: [2022-11-26 03:59:31,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 18: [2022-11-26 03:59:31,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 25: [2022-11-26 03:59:31,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:59:31,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 03:59:31,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:59:31,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 25: [2022-11-26 03:59:31,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 03:59:31,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 10: [2022-11-26 03:59:31,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:59:31,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 03:59:31,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 24: [2022-11-26 03:59:31,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:59:31,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:59:31,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 03:59:31,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 03:59:31,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 24: [2022-11-26 03:59:31,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 26: [2022-11-26 03:59:31,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:59:31,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 10: [2022-11-26 03:59:31,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:59:31,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 10: [2022-11-26 03:59:31,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 03:59:31,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 20: [2022-11-26 03:59:31,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:59:31,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:59:31,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 30: [2022-11-26 03:59:31,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 20: [2022-11-26 03:59:31,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 30: [2022-11-26 03:59:31,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 31: [2022-11-26 03:59:31,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:59:31,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:59:31,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 7: [2022-11-26 03:59:31,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 31: [2022-11-26 03:59:31,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 7: [2022-11-26 03:59:31,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 21: [2022-11-26 03:59:31,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:59:31,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 03:59:31,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 21: [2022-11-26 03:59:31,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:59:31,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:59:31,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 11: [2022-11-26 03:59:31,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 21: [2022-11-26 03:59:31,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 11: [2022-11-26 03:59:31,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 28: [2022-11-26 03:59:31,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:59:31,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:59:31,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 03:59:31,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 0: [2022-11-26 03:59:31,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:59:31,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:59:31,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 19: [2022-11-26 03:59:31,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:59:31,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 17: [2022-11-26 03:59:31,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 19: [2022-11-26 03:59:31,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 17: [2022-11-26 03:59:31,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 19: [2022-11-26 03:59:31,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 17: [2022-11-26 03:59:31,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:59:31,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 03:59:31,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 5: [2022-11-26 03:59:31,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:59:31,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 03:59:31,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 23: [2022-11-26 03:59:31,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:59:31,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:59:31,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 03:59:31,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 03:59:31,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 23: [2022-11-26 03:59:31,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 0: [2022-11-26 03:59:31,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:59:31,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:59:31,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 03:59:31,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 10: [2022-11-26 03:59:31,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:59:31,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 21: [2022-11-26 03:59:31,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:59:31,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 21: [2022-11-26 03:59:31,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 03:59:31,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 7: [2022-11-26 03:59:31,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:59:31,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:59:31,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 15: [2022-11-26 03:59:31,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:59:31,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 15: [2022-11-26 03:59:31,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 03:59:31,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 03:59:31,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 15: [2022-11-26 03:59:31,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 28: [2022-11-26 03:59:31,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 03:59:31,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 28: [2022-11-26 03:59:31,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:59:31,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 03:59:31,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 28: [2022-11-26 03:59:31,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:59:31,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:59:31,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:59:31,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 9: [2022-11-26 03:59:31,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 28: [2022-11-26 03:59:31,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 9: [2022-11-26 03:59:31,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 03:59:31,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 9: [2022-11-26 03:59:31,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 14: [2022-11-26 03:59:31,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:59:31,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 03:59:31,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 11: [2022-11-26 03:59:31,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:59:31,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:59:31,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 0: [2022-11-26 03:59:31,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 11: [2022-11-26 03:59:31,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 0: [2022-11-26 03:59:31,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 23: [2022-11-26 03:59:31,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:59:31,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:59:31,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:59:31,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 19: [2022-11-26 03:59:31,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 23: [2022-11-26 03:59:31,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 14: [2022-11-26 03:59:31,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 19: [2022-11-26 03:59:31,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 2: [2022-11-26 03:59:31,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:59:31,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 2: [2022-11-26 03:59:31,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 23: [2022-11-26 03:59:31,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:59:31,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 23: [2022-11-26 03:59:31,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 03:59:31,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 26: [2022-11-26 03:59:31,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:59:31,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:59:31,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 16: [2022-11-26 03:59:31,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:59:31,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 26: [2022-11-26 03:59:31,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 18: [2022-11-26 03:59:31,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 16: [2022-11-26 03:59:31,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 03:59:31,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 25: [2022-11-26 03:59:31,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:59:31,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 03:59:31,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 13: [2022-11-26 03:59:31,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:59:31,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 03:59:31,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 30: [2022-11-26 03:59:31,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:59:31,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:59:31,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 03:59:31,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 31: [2022-11-26 03:59:31,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 03:59:31,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 16: [2022-11-26 03:59:31,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:59:31,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 03:59:31,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 20: [2022-11-26 03:59:31,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:59:31,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 03:59:31,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 15: [2022-11-26 03:59:31,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:59:31,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:59:31,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 0: [2022-11-26 03:59:31,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 15: [2022-11-26 03:59:31,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 0: [2022-11-26 03:59:31,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 6: [2022-11-26 03:59:31,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:59:31,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 03:59:31,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 8: [2022-11-26 03:59:31,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:59:31,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:59:31,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:59:31,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 03:59:31,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 8: [2022-11-26 03:59:31,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 17: [2022-11-26 03:59:31,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 8: [2022-11-26 03:59:31,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 5: [2022-11-26 03:59:31,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:59:31,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 5: [2022-11-26 03:59:31,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 03:59:31,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 28: [2022-11-26 03:59:31,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:59:31,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:59:31,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 28: [2022-11-26 03:59:31,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 03:59:31,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 26: [2022-11-26 03:59:31,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 13: [2022-11-26 03:59:31,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:59:31,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:59:31,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 24: [2022-11-26 03:59:31,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 13: [2022-11-26 03:59:31,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 24: [2022-11-26 03:59:31,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 5: [2022-11-26 03:59:31,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:59:31,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 03:59:31,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 10: [2022-11-26 03:59:31,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:59:31,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 03:59:31,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 11: [2022-11-26 03:59:31,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:59:31,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 03:59:31,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 7: [2022-11-26 03:59:31,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:59:31,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 03:59:31,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 18: [2022-11-26 03:59:31,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:59:31,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 7: [2022-11-26 03:59:31,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:59:31,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 7: [2022-11-26 03:59:31,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 03:59:31,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 1: [2022-11-26 03:59:31,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:59:31,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:59:31,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 03:59:31,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 1: [2022-11-26 03:59:31,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 03:59:31,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 21: [2022-11-26 03:59:31,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:59:31,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 03:59:31,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 8: [2022-11-26 03:59:31,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:59:31,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 03:59:31,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:59:31,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 8: [2022-11-26 03:59:31,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 1: [2022-11-26 03:59:31,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:59:31,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:59:31,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:59:31,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:59:31,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:59:31,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 1: [2022-11-26 03:59:31,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 3: [2022-11-26 03:59:31,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 03:59:31,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 1: [2022-11-26 03:59:31,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 3: [2022-11-26 03:59:31,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 03:59:31,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 03:59:31,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 3: [2022-11-26 03:59:31,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 3: [2022-11-26 03:59:31,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 3: [2022-11-26 03:59:31,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 9: [2022-11-26 03:59:31,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:59:31,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:59:31,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:59:31,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 24: [2022-11-26 03:59:31,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 14: [2022-11-26 03:59:31,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 9: [2022-11-26 03:59:31,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 24: [2022-11-26 03:59:31,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 14: [2022-11-26 03:59:31,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 13: [2022-11-26 03:59:31,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:59:31,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 03:59:31,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 6: [2022-11-26 03:59:31,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:59:31,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 03:59:31,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 2: [2022-11-26 03:59:31,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:59:31,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:59:31,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 6: [2022-11-26 03:59:31,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 2: [2022-11-26 03:59:31,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 6: [2022-11-26 03:59:31,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 2: [2022-11-26 03:59:31,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:59:31,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 2: [2022-11-26 03:59:31,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 03:59:31,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 29: [2022-11-26 03:59:31,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:59:31,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:59:31,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:59:31,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 03:59:31,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 03:59:31,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 03:59:31,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 29: [2022-11-26 03:59:31,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 29: [2022-11-26 03:59:31,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 0: [2022-11-26 03:59:31,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 22: [2022-11-26 03:59:31,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:59:31,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:59:31,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:59:31,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 03:59:31,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 03:59:31,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 03:59:31,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 22: [2022-11-26 03:59:31,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 22: [2022-11-26 03:59:31,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 31: [2022-11-26 03:59:31,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:59:31,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:59:31,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 30: [2022-11-26 03:59:31,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 31: [2022-11-26 03:59:31,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 30: [2022-11-26 03:59:31,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 20: [2022-11-26 03:59:31,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:59:31,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 03:59:31,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 17: [2022-11-26 03:59:31,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:59:31,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:59:31,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 17: [2022-11-26 03:59:31,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 20: [2022-11-26 03:59:31,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 17: [2022-11-26 03:59:31,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 30: [2022-11-26 03:59:31,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:59:31,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 03:59:31,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 2: [2022-11-26 03:59:31,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:59:31,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 03:59:31,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 4: [2022-11-26 03:59:31,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:59:31,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:59:31,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:59:31,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 26: [2022-11-26 03:59:31,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:59:31,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 03:59:31,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 03:59:31,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 4: [2022-11-26 03:59:31,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 4: [2022-11-26 03:59:31,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 26: [2022-11-26 03:59:31,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 4: [2022-11-26 03:59:31,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:59:31,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 4: [2022-11-26 03:59:31,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 03:59:31,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 25: [2022-11-26 03:59:31,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:59:31,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 5: [2022-11-26 03:59:31,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:59:31,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 5: [2022-11-26 03:59:31,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 03:59:31,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 25: [2022-11-26 03:59:31,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:59:31,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 03:59:31,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 1: [2022-11-26 03:59:31,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:59:31,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 03:59:31,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 4: [2022-11-26 03:59:31,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:59:31,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 03:59:31,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 12: [2022-11-26 03:59:31,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:59:31,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:59:31,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:59:31,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:59:31,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 03:59:31,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 03:59:31,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 03:59:31,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 03:59:31,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 12: [2022-11-26 03:59:31,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 12: [2022-11-26 03:59:31,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 12: [2022-11-26 03:59:31,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 23: [2022-11-26 03:59:31,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:59:31,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 03:59:31,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 30: [2022-11-26 03:59:31,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:59:31,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 03:59:31,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 29: [2022-11-26 03:59:31,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:59:31,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 03:59:31,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 27: [2022-11-26 03:59:31,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:59:31,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:59:31,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:59:31,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 03:59:31,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 03:59:31,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 03:59:31,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 27: [2022-11-26 03:59:31,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 27: [2022-11-26 03:59:31,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 18: [2022-11-26 03:59:31,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:59:31,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 03:59:31,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 24: [2022-11-26 03:59:31,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:59:31,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 03:59:31,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 19: [2022-11-26 03:59:31,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:59:31,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 03:59:31,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 8: [2022-11-26 03:59:31,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:59:31,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 03:59:31,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 21: [2022-11-26 03:59:31,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:59:31,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 03:59:31,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 0: [2022-11-26 03:59:31,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:59:31,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 03:59:31,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 3: [2022-11-26 03:59:31,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:59:31,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 03:59:31,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 31: [2022-11-26 03:59:31,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:59:31,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 03:59:31,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 16: [2022-11-26 03:59:31,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:59:31,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 03:59:31,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 7: [2022-11-26 03:59:31,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:59:31,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 03:59:31,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 2: [2022-11-26 03:59:31,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:59:31,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 03:59:31,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 12: [2022-11-26 03:59:31,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:59:31,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 03:59:31,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 15: [2022-11-26 03:59:31,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:59:31,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 03:59:31,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 6: [2022-11-26 03:59:31,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:59:31,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 03:59:31,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 17: [2022-11-26 03:59:31,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:59:31,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 03:59:31,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 14: [2022-11-26 03:59:31,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:59:31,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 03:59:31,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 9: [2022-11-26 03:59:31,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:59:31,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:59:31,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 9: [2022-11-26 03:59:31,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 13: [2022-11-26 03:59:31,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 9: [2022-11-26 03:59:31,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 28: [2022-11-26 03:59:31,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:59:31,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:59:31,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 27: [2022-11-26 03:59:31,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 03:59:31,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 28: [2022-11-26 03:59:31,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 11: [2022-11-26 03:59:31,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:59:31,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 03:59:31,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 22: [2022-11-26 03:59:31,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:59:31,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 03:59:31,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 10: [2022-11-26 03:59:31,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:59:31,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 03:59:31,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 26: [2022-11-26 03:59:31,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:59:31,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:59:31,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 03:59:31,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 1: [2022-11-26 03:59:31,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 03:59:31,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 4: [2022-11-26 03:59:31,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:59:31,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 03:59:31,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 20: [2022-11-26 03:59:31,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:59:31,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 23: [2022-11-26 03:59:31,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:59:31,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 23: [2022-11-26 03:59:31,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 03:59:31,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 29: [2022-11-26 03:59:31,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:59:31,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 03:59:31,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 25: [2022-11-26 03:59:31,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:59:31,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 03:59:31,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 0: [2022-11-26 03:59:31,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:59:31,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 03:59:31,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 5: [2022-11-26 03:59:31,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:59:31,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 03:59:31,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 21: [2022-11-26 03:59:31,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:59:31,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:59:31,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 03:59:31,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 21: [2022-11-26 03:59:31,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 03:59:31,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 3: [2022-11-26 03:59:31,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:59:31,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 03:59:31,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 10: [2022-11-26 03:59:31,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:59:31,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 03:59:31,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 18: [2022-11-26 03:59:31,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:59:31,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 03:59:31,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 24: [2022-11-26 03:59:31,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:59:31,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 03:59:31,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 12: [2022-11-26 03:59:31,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:59:31,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 30: [2022-11-26 03:59:31,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:59:31,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 30: [2022-11-26 03:59:31,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 03:59:31,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 8: [2022-11-26 03:59:31,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:59:31,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 03:59:31,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 16: [2022-11-26 03:59:31,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:59:31,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 03:59:31,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 7: [2022-11-26 03:59:31,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:59:31,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:59:31,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 03:59:31,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 31: [2022-11-26 03:59:31,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 03:59:31,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 2: [2022-11-26 03:59:31,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:59:31,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 03:59:31,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 15: [2022-11-26 03:59:31,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:59:31,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 03:59:31,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 14: [2022-11-26 03:59:31,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:59:31,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 03:59:31,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 17: [2022-11-26 03:59:31,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:59:31,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 03:59:31,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 6: [2022-11-26 03:59:31,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:59:31,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 03:59:31,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 9: [2022-11-26 03:59:31,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:59:31,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 03:59:31,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 28: [2022-11-26 03:59:31,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:59:31,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 03:59:31,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 22: [2022-11-26 03:59:31,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:59:31,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 03:59:31,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 11: [2022-11-26 03:59:31,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:59:31,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 03:59:31,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 27: [2022-11-26 03:59:31,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:59:31,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 03:59:31,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 4: [2022-11-26 03:59:31,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:59:31,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 03:59:31,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 13: [2022-11-26 03:59:31,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:59:31,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 03:59:31,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 1: [2022-11-26 03:59:31,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:59:31,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 03:59:31,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 0: [2022-11-26 03:59:31,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:59:31,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:59:31,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 03:59:31,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 26: [2022-11-26 03:59:31,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 03:59:31,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 23: [2022-11-26 03:59:31,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:59:31,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 20: [2022-11-26 03:59:31,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:59:31,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 21: [2022-11-26 03:59:31,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:59:31,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 21: [2022-11-26 03:59:31,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 03:59:31,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 20: [2022-11-26 03:59:31,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 25: [2022-11-26 03:59:31,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 03:59:31,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 03:59:31,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 16: [2022-11-26 03:59:31,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:59:31,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 03:59:31,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 5: [2022-11-26 03:59:31,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:59:31,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:59:31,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 19: [2022-11-26 03:59:31,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 5: [2022-11-26 03:59:31,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 19: [2022-11-26 03:59:31,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 8: [2022-11-26 03:59:31,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:59:31,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 03:59:31,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 3: [2022-11-26 03:59:31,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:59:31,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 03:59:31,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 18: [2022-11-26 03:59:31,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:59:31,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 03:59:31,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 24: [2022-11-26 03:59:31,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:59:31,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 03:59:31,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 30: [2022-11-26 03:59:31,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 03:59:31,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 03:59:31,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 29: [2022-11-26 03:59:31,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:59:31,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:59:31,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 29: [2022-11-26 03:59:31,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 10: [2022-11-26 03:59:31,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 29: [2022-11-26 03:59:31,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 12: [2022-11-26 03:59:31,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:59:31,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 03:59:31,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 7: [2022-11-26 03:59:31,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:59:31,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 03:59:31,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 15: [2022-11-26 03:59:31,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:59:31,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 03:59:31,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 28: [2022-11-26 03:59:31,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 31: [2022-11-26 03:59:31,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:59:31,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 03:59:31,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 31: [2022-11-26 03:59:31,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 03:59:31,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 14: [2022-11-26 03:59:31,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:59:31,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 03:59:31,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 0: [2022-11-26 03:59:31,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:59:31,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 03:59:31,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 12: [2022-11-26 03:59:31,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:59:31,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 03:59:31,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 4: [2022-11-26 03:59:31,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:59:31,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 03:59:31,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 16: [2022-11-26 03:59:31,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 03:59:31,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 03:59:31,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 17: [2022-11-26 03:59:31,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 03:59:31,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 03:59:31,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 11: [2022-11-26 03:59:31,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:59:31,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:59:31,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 11: [2022-11-26 03:59:31,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 03:59:31,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 7: [2022-11-26 03:59:31,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 3: [2022-11-26 03:59:31,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:59:31,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 03:59:31,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 8: [2022-11-26 03:59:31,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:59:31,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 03:59:31,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 10: [2022-11-26 03:59:31,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:59:31,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 24: [2022-11-26 03:59:31,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:59:31,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 31: [2022-11-26 03:59:31,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 24: [2022-11-26 03:59:31,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 03:59:31,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 31: [2022-11-26 03:59:31,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 03:59:31,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 26: [2022-11-26 03:59:31,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 03:59:31,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 03:59:31,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 23: [2022-11-26 03:59:31,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 03:59:31,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 03:59:31,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 19: [2022-11-26 03:59:31,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 03:59:31,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 03:59:31,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 18: [2022-11-26 03:59:31,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 20: [2022-11-26 03:59:31,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:59:31,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 18: [2022-11-26 03:59:31,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 22: [2022-11-26 03:59:31,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 20: [2022-11-26 03:59:31,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 22: [2022-11-26 03:59:31,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 18: [2022-11-26 03:59:31,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 20: [2022-11-26 03:59:31,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 9: [2022-11-26 03:59:31,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:59:31,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 03:59:31,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 1: [2022-11-26 03:59:31,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:59:31,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:59:31,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 13: [2022-11-26 03:59:31,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 1: [2022-11-26 03:59:31,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 13: [2022-11-26 03:59:31,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 21: [2022-11-26 03:59:31,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 03:59:31,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 03:59:31,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 17: [2022-11-26 03:59:31,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:59:31,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 03:59:31,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 17: [2022-11-26 03:59:31,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 29: [2022-11-26 03:59:31,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 17: [2022-11-26 03:59:31,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 25: [2022-11-26 03:59:31,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:59:31,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:59:31,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 25: [2022-11-26 03:59:31,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 5: [2022-11-26 03:59:31,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 25: [2022-11-26 03:59:31,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 2: [2022-11-26 03:59:31,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:59:31,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:59:31,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 03:59:31,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 03:59:31,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 2: [2022-11-26 03:59:31,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 11: [2022-11-26 03:59:31,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:59:31,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 03:59:31,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 27: [2022-11-26 03:59:31,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 28: [2022-11-26 03:59:31,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:59:31,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 28: [2022-11-26 03:59:31,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 27: [2022-11-26 03:59:31,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 28: [2022-11-26 03:59:31,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 14: [2022-11-26 03:59:31,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:59:31,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 03:59:31,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 22: [2022-11-26 03:59:31,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:59:31,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 15: [2022-11-26 03:59:31,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:59:31,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 15: [2022-11-26 03:59:31,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 03:59:31,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 6: [2022-11-26 03:59:31,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:59:31,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:59:31,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 03:59:31,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 03:59:31,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 30: [2022-11-26 03:59:31,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:59:31,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 30: [2022-11-26 03:59:31,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 03:59:31,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 9: [2022-11-26 03:59:31,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:59:31,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 03:59:31,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 9: [2022-11-26 03:59:31,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:59:31,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 03:59:31,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 27: [2022-11-26 03:59:31,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:59:31,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 03:59:31,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 14: [2022-11-26 03:59:31,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:59:31,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 03:59:31,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 13: [2022-11-26 03:59:31,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:59:31,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:59:31,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 03:59:31,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 03:59:31,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 13: [2022-11-26 03:59:31,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 6: [2022-11-26 03:59:31,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:59:31,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 03:59:31,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 22: [2022-11-26 03:59:31,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 03:59:31,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 03:59:31,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 27: [2022-11-26 03:59:31,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 03:59:31,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step44000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 03:59:31,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 0: successfully saved checkpoint at iteration 44000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2513.29 31: iteration 44010/ 173500 | consumed samples: 11266560 | consumed tokens: 23073914880 | elapsed time per iteration (s): 1.09 | learning rate: 1.744E-04 | global batch size: 256 | lm loss: 2.067439E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.092 | TFLOPs: 14.22 | 31: iteration 44020/ 173500 | consumed samples: 11269120 | consumed tokens: 23079157760 | elapsed time per iteration (s): 0.85 | learning rate: 1.744E-04 | global batch size: 256 | lm loss: 2.084099E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.238 | TFLOPs: 18.28 | 31: iteration 44030/ 173500 | consumed samples: 11271680 | consumed tokens: 23084400640 | elapsed time per iteration (s): 0.82 | learning rate: 1.744E-04 | global batch size: 256 | lm loss: 2.063509E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.682 | TFLOPs: 18.92 | 31: iteration 44040/ 173500 | consumed samples: 11274240 | consumed tokens: 23089643520 | elapsed time per iteration (s): 0.80 | learning rate: 1.744E-04 | global batch size: 256 | lm loss: 2.122983E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.888 | TFLOPs: 19.41 | 31: iteration 44050/ 173500 | consumed samples: 11276800 | consumed tokens: 23094886400 | elapsed time per iteration (s): 0.80 | learning rate: 1.744E-04 | global batch size: 256 | lm loss: 2.080668E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.963 | TFLOPs: 19.48 | 31: iteration 44060/ 173500 | consumed samples: 11279360 | consumed tokens: 23100129280 | elapsed time per iteration (s): 0.76 | learning rate: 1.744E-04 | global batch size: 256 | lm loss: 2.074264E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.423 | TFLOPs: 20.29 | 31: iteration 44070/ 173500 | consumed samples: 11281920 | consumed tokens: 23105372160 | elapsed time per iteration (s): 0.80 | learning rate: 1.743E-04 | global batch size: 256 | lm loss: 2.098053E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.607 | TFLOPs: 19.46 | 31: iteration 44080/ 173500 | consumed samples: 11284480 | consumed tokens: 23110615040 | elapsed time per iteration (s): 0.78 | learning rate: 1.743E-04 | global batch size: 256 | lm loss: 2.073193E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.427 | TFLOPs: 19.87 | 31: iteration 44090/ 173500 | consumed samples: 11287040 | consumed tokens: 23115857920 | elapsed time per iteration (s): 0.83 | learning rate: 1.743E-04 | global batch size: 256 | lm loss: 2.077766E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.226 | TFLOPs: 18.77 | 31: iteration 44100/ 173500 | consumed samples: 11289600 | consumed tokens: 23121100800 | elapsed time per iteration (s): 0.78 | learning rate: 1.743E-04 | global batch size: 256 | lm loss: 2.098843E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.385 | TFLOPs: 19.81 | 31: iteration 44110/ 173500 | consumed samples: 11292160 | consumed tokens: 23126343680 | elapsed time per iteration (s): 0.77 | learning rate: 1.743E-04 | global batch size: 256 | lm loss: 2.109781E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.010 | TFLOPs: 20.21 | 31: iteration 44120/ 173500 | consumed samples: 11294720 | consumed tokens: 23131586560 | elapsed time per iteration (s): 0.82 | learning rate: 1.743E-04 | global batch size: 256 | lm loss: 2.077923E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.503 | TFLOPs: 18.91 | 31: iteration 44130/ 173500 | consumed samples: 11297280 | consumed tokens: 23136829440 | elapsed time per iteration (s): 0.83 | learning rate: 1.743E-04 | global batch size: 256 | lm loss: 2.077049E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.442 | TFLOPs: 18.60 | 31: iteration 44140/ 173500 | consumed samples: 11299840 | consumed tokens: 23142072320 | elapsed time per iteration (s): 0.82 | learning rate: 1.743E-04 | global batch size: 256 | lm loss: 2.064676E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.130 | TFLOPs: 18.88 | 31: iteration 44150/ 173500 | consumed samples: 11302400 | consumed tokens: 23147315200 | elapsed time per iteration (s): 0.82 | learning rate: 1.742E-04 | global batch size: 256 | lm loss: 2.113318E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.770 | TFLOPs: 18.80 | 31: iteration 44160/ 173500 | consumed samples: 11304960 | consumed tokens: 23152558080 | elapsed time per iteration (s): 0.85 | learning rate: 1.742E-04 | global batch size: 256 | lm loss: 2.080479E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.155 | TFLOPs: 18.16 | 31: iteration 44170/ 173500 | consumed samples: 11307520 | consumed tokens: 23157800960 | elapsed time per iteration (s): 0.80 | learning rate: 1.742E-04 | global batch size: 256 | lm loss: 2.062675E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.409 | TFLOPs: 19.38 | 31: iteration 44180/ 173500 | consumed samples: 11310080 | consumed tokens: 23163043840 | elapsed time per iteration (s): 0.82 | learning rate: 1.742E-04 | global batch size: 256 | lm loss: 2.059573E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.271 | TFLOPs: 18.95 | 31: iteration 44190/ 173500 | consumed samples: 11312640 | consumed tokens: 23168286720 | elapsed time per iteration (s): 0.81 | learning rate: 1.742E-04 | global batch size: 256 | lm loss: 2.077205E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.190 | TFLOPs: 19.07 | 31: iteration 44200/ 173500 | consumed samples: 11315200 | consumed tokens: 23173529600 | elapsed time per iteration (s): 0.79 | learning rate: 1.742E-04 | global batch size: 256 | lm loss: 2.055609E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.966 | TFLOPs: 19.54 | 31: iteration 44210/ 173500 | consumed samples: 11317760 | consumed tokens: 23178772480 | elapsed time per iteration (s): 0.77 | learning rate: 1.742E-04 | global batch size: 256 | lm loss: 2.081053E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.518 | TFLOPs: 20.00 | 31: iteration 44220/ 173500 | consumed samples: 11320320 | consumed tokens: 23184015360 | elapsed time per iteration (s): 0.82 | learning rate: 1.742E-04 | global batch size: 256 | lm loss: 2.088218E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.028 | TFLOPs: 18.82 | 31: iteration 44230/ 173500 | consumed samples: 11322880 | consumed tokens: 23189258240 | elapsed time per iteration (s): 0.80 | learning rate: 1.742E-04 | global batch size: 256 | lm loss: 2.065537E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.524 | TFLOPs: 19.45 | 31: iteration 44240/ 173500 | consumed samples: 11325440 | consumed tokens: 23194501120 | elapsed time per iteration (s): 0.79 | learning rate: 1.741E-04 | global batch size: 256 | lm loss: 2.092559E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.239 | TFLOPs: 19.49 | 31: iteration 44250/ 173500 | consumed samples: 11328000 | consumed tokens: 23199744000 | elapsed time per iteration (s): 0.78 | learning rate: 1.741E-04 | global batch size: 256 | lm loss: 2.085245E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.038 | TFLOPs: 19.78 | 31: iteration 44260/ 173500 | consumed samples: 11330560 | consumed tokens: 23204986880 | elapsed time per iteration (s): 0.83 | learning rate: 1.741E-04 | global batch size: 256 | lm loss: 2.084055E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.548 | TFLOPs: 18.73 | 31: iteration 44270/ 173500 | consumed samples: 11333120 | consumed tokens: 23210229760 | elapsed time per iteration (s): 0.87 | learning rate: 1.741E-04 | global batch size: 256 | lm loss: 2.074998E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.398 | TFLOPs: 17.75 | 31: iteration 44280/ 173500 | consumed samples: 11335680 | consumed tokens: 23215472640 | elapsed time per iteration (s): 0.92 | learning rate: 1.741E-04 | global batch size: 256 | lm loss: 2.088442E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 279.459 | TFLOPs: 16.91 | 31: iteration 44290/ 173500 | consumed samples: 11338240 | consumed tokens: 23220715520 | elapsed time per iteration (s): 0.79 | learning rate: 1.741E-04 | global batch size: 256 | lm loss: 2.091436E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.060 | TFLOPs: 19.60 | 31: iteration 44300/ 173500 | consumed samples: 11340800 | consumed tokens: 23225958400 | elapsed time per iteration (s): 0.84 | learning rate: 1.741E-04 | global batch size: 256 | lm loss: 2.045120E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.336 | TFLOPs: 18.53 | 31: iteration 44310/ 173500 | consumed samples: 11343360 | consumed tokens: 23231201280 | elapsed time per iteration (s): 0.82 | learning rate: 1.741E-04 | global batch size: 256 | lm loss: 2.098124E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.111 | TFLOPs: 18.88 | 31: iteration 44320/ 173500 | consumed samples: 11345920 | consumed tokens: 23236444160 | elapsed time per iteration (s): 0.84 | learning rate: 1.741E-04 | global batch size: 256 | lm loss: 2.061583E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.405 | TFLOPs: 18.48 | 31: iteration 44330/ 173500 | consumed samples: 11348480 | consumed tokens: 23241687040 | elapsed time per iteration (s): 0.81 | learning rate: 1.740E-04 | global batch size: 256 | lm loss: 2.088351E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.963 | TFLOPs: 19.11 | 31: iteration 44340/ 173500 | consumed samples: 11351040 | consumed tokens: 23246929920 | elapsed time per iteration (s): 0.80 | learning rate: 1.740E-04 | global batch size: 256 | lm loss: 2.057925E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.502 | TFLOPs: 19.45 | 31: iteration 44350/ 173500 | consumed samples: 11353600 | consumed tokens: 23252172800 | elapsed time per iteration (s): 0.82 | learning rate: 1.740E-04 | global batch size: 256 | lm loss: 2.082239E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.061 | TFLOPs: 18.88 | 31: iteration 44360/ 173500 | consumed samples: 11356160 | consumed tokens: 23257415680 | elapsed time per iteration (s): 0.86 | learning rate: 1.740E-04 | global batch size: 256 | lm loss: 2.081538E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.069 | TFLOPs: 18.09 | 31: iteration 44370/ 173500 | consumed samples: 11358720 | consumed tokens: 23262658560 | elapsed time per iteration (s): 0.88 | learning rate: 1.740E-04 | global batch size: 256 | lm loss: 2.073463E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.487 | TFLOPs: 17.51 | 31: iteration 44380/ 173500 | consumed samples: 11361280 | consumed tokens: 23267901440 | elapsed time per iteration (s): 0.81 | learning rate: 1.740E-04 | global batch size: 256 | lm loss: 2.100674E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.669 | TFLOPs: 19.04 | 31: iteration 44390/ 173500 | consumed samples: 11363840 | consumed tokens: 23273144320 | elapsed time per iteration (s): 0.78 | learning rate: 1.740E-04 | global batch size: 256 | lm loss: 2.092539E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.462 | TFLOPs: 19.81 | 31: iteration 44400/ 173500 | consumed samples: 11366400 | consumed tokens: 23278387200 | elapsed time per iteration (s): 0.83 | learning rate: 1.740E-04 | global batch size: 256 | lm loss: 2.073587E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.088 | TFLOPs: 18.76 | 31: iteration 44410/ 173500 | consumed samples: 11368960 | consumed tokens: 23283630080 | elapsed time per iteration (s): 0.84 | learning rate: 1.739E-04 | global batch size: 256 | lm loss: 2.090302E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.624 | TFLOPs: 18.43 | 31: iteration 44420/ 173500 | consumed samples: 11371520 | consumed tokens: 23288872960 | elapsed time per iteration (s): 0.83 | learning rate: 1.739E-04 | global batch size: 256 | lm loss: 2.088646E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.676 | TFLOPs: 18.55 | 31: iteration 44430/ 173500 | consumed samples: 11374080 | consumed tokens: 23294115840 | elapsed time per iteration (s): 0.81 | learning rate: 1.739E-04 | global batch size: 256 | lm loss: 2.050382E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.231 | TFLOPs: 19.01 | 31: iteration 44440/ 173500 | consumed samples: 11376640 | consumed tokens: 23299358720 | elapsed time per iteration (s): 0.80 | learning rate: 1.739E-04 | global batch size: 256 | lm loss: 2.075646E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.132 | TFLOPs: 19.43 | 31: iteration 44450/ 173500 | consumed samples: 11379200 | consumed tokens: 23304601600 | elapsed time per iteration (s): 0.83 | learning rate: 1.739E-04 | global batch size: 256 | lm loss: 2.107107E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.723 | TFLOPs: 18.56 | 31: iteration 44460/ 173500 | consumed samples: 11381760 | consumed tokens: 23309844480 | elapsed time per iteration (s): 0.84 | learning rate: 1.739E-04 | global batch size: 256 | lm loss: 2.056935E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.056 | TFLOPs: 18.46 | 31: iteration 44470/ 173500 | consumed samples: 11384320 | consumed tokens: 23315087360 | elapsed time per iteration (s): 0.79 | learning rate: 1.739E-04 | global batch size: 256 | lm loss: 2.071634E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.848 | TFLOPs: 19.71 | 31: iteration 44480/ 173500 | consumed samples: 11386880 | consumed tokens: 23320330240 | elapsed time per iteration (s): 0.83 | learning rate: 1.739E-04 | global batch size: 256 | lm loss: 2.096383E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.432 | TFLOPs: 18.72 | 31: iteration 44490/ 173500 | consumed samples: 11389440 | consumed tokens: 23325573120 | elapsed time per iteration (s): 0.83 | learning rate: 1.739E-04 | global batch size: 256 | lm loss: 2.053462E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.605 | TFLOPs: 18.55 | 31: iteration 44500/ 173500 | consumed samples: 11392000 | consumed tokens: 23330816000 | elapsed time per iteration (s): 0.79 | learning rate: 1.738E-04 | global batch size: 256 | lm loss: 2.079702E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.319 | TFLOPs: 19.56 | 31: iteration 44510/ 173500 | consumed samples: 11394560 | consumed tokens: 23336058880 | elapsed time per iteration (s): 0.79 | learning rate: 1.738E-04 | global batch size: 256 | lm loss: 2.092802E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.249 | TFLOPs: 19.50 | 31: iteration 44520/ 173500 | consumed samples: 11397120 | consumed tokens: 23341301760 | elapsed time per iteration (s): 0.76 | learning rate: 1.738E-04 | global batch size: 256 | lm loss: 2.084599E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.664 | TFLOPs: 20.25 | 31: iteration 44530/ 173500 | consumed samples: 11399680 | consumed tokens: 23346544640 | elapsed time per iteration (s): 0.74 | learning rate: 1.738E-04 | global batch size: 256 | lm loss: 2.054026E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.971 | TFLOPs: 20.93 | 31: iteration 44540/ 173500 | consumed samples: 11402240 | consumed tokens: 23351787520 | elapsed time per iteration (s): 0.80 | learning rate: 1.738E-04 | global batch size: 256 | lm loss: 2.097309E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.168 | TFLOPs: 19.25 | 31: iteration 44550/ 173500 | consumed samples: 11404800 | consumed tokens: 23357030400 | elapsed time per iteration (s): 0.78 | learning rate: 1.738E-04 | global batch size: 256 | lm loss: 2.074383E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.032 | TFLOPs: 19.85 | 31: iteration 44560/ 173500 | consumed samples: 11407360 | consumed tokens: 23362273280 | elapsed time per iteration (s): 0.78 | learning rate: 1.738E-04 | global batch size: 256 | lm loss: 2.057723E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.584 | TFLOPs: 19.82 | 31: iteration 44570/ 173500 | consumed samples: 11409920 | consumed tokens: 23367516160 | elapsed time per iteration (s): 0.77 | learning rate: 1.738E-04 | global batch size: 256 | lm loss: 2.087485E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.791 | TFLOPs: 20.19 | 31: iteration 44580/ 173500 | consumed samples: 11412480 | consumed tokens: 23372759040 | elapsed time per iteration (s): 0.82 | learning rate: 1.738E-04 | global batch size: 256 | lm loss: 2.082467E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.971 | TFLOPs: 18.99 | 31: iteration 44590/ 173500 | consumed samples: 11415040 | consumed tokens: 23378001920 | elapsed time per iteration (s): 0.82 | learning rate: 1.737E-04 | global batch size: 256 | lm loss: 2.067358E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.347 | TFLOPs: 18.96 | 31: iteration 44600/ 173500 | consumed samples: 11417600 | consumed tokens: 23383244800 | elapsed time per iteration (s): 0.77 | learning rate: 1.737E-04 | global batch size: 256 | lm loss: 2.062555E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.032 | TFLOPs: 20.15 | 31: iteration 44610/ 173500 | consumed samples: 11420160 | consumed tokens: 23388487680 | elapsed time per iteration (s): 0.73 | learning rate: 1.737E-04 | global batch size: 256 | lm loss: 2.068026E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.877 | TFLOPs: 21.23 | 31: iteration 44620/ 173500 | consumed samples: 11422720 | consumed tokens: 23393730560 | elapsed time per iteration (s): 0.80 | learning rate: 1.737E-04 | global batch size: 256 | lm loss: 2.084386E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.230 | TFLOPs: 19.37 | 31: iteration 44630/ 173500 | consumed samples: 11425280 | consumed tokens: 23398973440 | elapsed time per iteration (s): 0.78 | learning rate: 1.737E-04 | global batch size: 256 | lm loss: 2.083797E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.413 | TFLOPs: 19.93 | 31: iteration 44640/ 173500 | consumed samples: 11427840 | consumed tokens: 23404216320 | elapsed time per iteration (s): 0.79 | learning rate: 1.737E-04 | global batch size: 256 | lm loss: 2.080538E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.453 | TFLOPs: 19.69 | 31: iteration 44650/ 173500 | consumed samples: 11430400 | consumed tokens: 23409459200 | elapsed time per iteration (s): 0.77 | learning rate: 1.737E-04 | global batch size: 256 | lm loss: 2.073742E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.131 | TFLOPs: 20.21 | 31: iteration 44660/ 173500 | consumed samples: 11432960 | consumed tokens: 23414702080 | elapsed time per iteration (s): 0.75 | learning rate: 1.737E-04 | global batch size: 256 | lm loss: 2.046601E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.557 | TFLOPs: 20.60 | 31: iteration 44670/ 173500 | consumed samples: 11435520 | consumed tokens: 23419944960 | elapsed time per iteration (s): 0.72 | learning rate: 1.736E-04 | global batch size: 256 | lm loss: 2.079209E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.899 | TFLOPs: 21.41 | 31: iteration 44680/ 173500 | consumed samples: 11438080 | consumed tokens: 23425187840 | elapsed time per iteration (s): 0.74 | learning rate: 1.736E-04 | global batch size: 256 | lm loss: 2.056296E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.072 | TFLOPs: 21.06 | 31: iteration 44690/ 173500 | consumed samples: 11440640 | consumed tokens: 23430430720 | elapsed time per iteration (s): 0.79 | learning rate: 1.736E-04 | global batch size: 256 | lm loss: 2.065142E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.804 | TFLOPs: 19.65 | 31: iteration 44700/ 173500 | consumed samples: 11443200 | consumed tokens: 23435673600 | elapsed time per iteration (s): 0.78 | learning rate: 1.736E-04 | global batch size: 256 | lm loss: 2.106583E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.780 | TFLOPs: 19.95 | 31: iteration 44710/ 173500 | consumed samples: 11445760 | consumed tokens: 23440916480 | elapsed time per iteration (s): 0.73 | learning rate: 1.736E-04 | global batch size: 256 | lm loss: 2.098840E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.874 | TFLOPs: 21.35 | 31: iteration 44720/ 173500 | consumed samples: 11448320 | consumed tokens: 23446159360 | elapsed time per iteration (s): 0.77 | learning rate: 1.736E-04 | global batch size: 256 | lm loss: 2.075506E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.277 | TFLOPs: 20.22 | 31: iteration 44730/ 173500 | consumed samples: 11450880 | consumed tokens: 23451402240 | elapsed time per iteration (s): 0.81 | learning rate: 1.736E-04 | global batch size: 256 | lm loss: 2.074734E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.923 | TFLOPs: 19.05 | 31: iteration 44740/ 173500 | consumed samples: 11453440 | consumed tokens: 23456645120 | elapsed time per iteration (s): 0.84 | learning rate: 1.736E-04 | global batch size: 256 | lm loss: 2.084184E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.106 | TFLOPs: 18.46 | 31: iteration 44750/ 173500 | consumed samples: 11456000 | consumed tokens: 23461888000 | elapsed time per iteration (s): 0.79 | learning rate: 1.736E-04 | global batch size: 256 | lm loss: 2.090472E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.724 | TFLOPs: 19.65 | 31: iteration 44760/ 173500 | consumed samples: 11458560 | consumed tokens: 23467130880 | elapsed time per iteration (s): 0.84 | learning rate: 1.735E-04 | global batch size: 256 | lm loss: 2.072474E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.072 | TFLOPs: 18.46 | 31: iteration 44770/ 173500 | consumed samples: 11461120 | consumed tokens: 23472373760 | elapsed time per iteration (s): 0.82 | learning rate: 1.735E-04 | global batch size: 256 | lm loss: 2.099689E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.955 | TFLOPs: 18.93 | 31: iteration 44780/ 173500 | consumed samples: 11463680 | consumed tokens: 23477616640 | elapsed time per iteration (s): 0.86 | learning rate: 1.735E-04 | global batch size: 256 | lm loss: 2.075543E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.308 | TFLOPs: 18.05 | 31: iteration 44790/ 173500 | consumed samples: 11466240 | consumed tokens: 23482859520 | elapsed time per iteration (s): 0.83 | learning rate: 1.735E-04 | global batch size: 256 | lm loss: 2.087652E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.041 | TFLOPs: 18.64 | 31: iteration 44800/ 173500 | consumed samples: 11468800 | consumed tokens: 23488102400 | elapsed time per iteration (s): 0.84 | learning rate: 1.735E-04 | global batch size: 256 | lm loss: 2.085175E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.535 | TFLOPs: 18.36 | 31: iteration 44810/ 173500 | consumed samples: 11471360 | consumed tokens: 23493345280 | elapsed time per iteration (s): 0.83 | learning rate: 1.735E-04 | global batch size: 256 | lm loss: 2.071225E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.887 | TFLOPs: 18.69 | 31: iteration 44820/ 173500 | consumed samples: 11473920 | consumed tokens: 23498588160 | elapsed time per iteration (s): 0.82 | learning rate: 1.735E-04 | global batch size: 256 | lm loss: 2.083569E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.852 | TFLOPs: 18.93 | 31: iteration 44830/ 173500 | consumed samples: 11476480 | consumed tokens: 23503831040 | elapsed time per iteration (s): 0.82 | learning rate: 1.735E-04 | global batch size: 256 | lm loss: 2.072190E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.673 | TFLOPs: 18.92 | 31: iteration 44840/ 173500 | consumed samples: 11479040 | consumed tokens: 23509073920 | elapsed time per iteration (s): 0.80 | learning rate: 1.734E-04 | global batch size: 256 | lm loss: 2.074256E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.781 | TFLOPs: 19.47 | 31: iteration 44850/ 173500 | consumed samples: 11481600 | consumed tokens: 23514316800 | elapsed time per iteration (s): 0.82 | learning rate: 1.734E-04 | global batch size: 256 | lm loss: 2.062740E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.249 | TFLOPs: 18.83 | 31: iteration 44860/ 173500 | consumed samples: 11484160 | consumed tokens: 23519559680 | elapsed time per iteration (s): 0.72 | learning rate: 1.734E-04 | global batch size: 256 | lm loss: 2.080732E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.151 | TFLOPs: 21.61 | 31: iteration 44870/ 173500 | consumed samples: 11486720 | consumed tokens: 23524802560 | elapsed time per iteration (s): 0.80 | learning rate: 1.734E-04 | global batch size: 256 | lm loss: 2.100307E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.448 | TFLOPs: 19.33 | 31: iteration 44880/ 173500 | consumed samples: 11489280 | consumed tokens: 23530045440 | elapsed time per iteration (s): 0.83 | learning rate: 1.734E-04 | global batch size: 256 | lm loss: 2.098842E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.611 | TFLOPs: 18.61 | 31: iteration 44890/ 173500 | consumed samples: 11491840 | consumed tokens: 23535288320 | elapsed time per iteration (s): 0.78 | learning rate: 1.734E-04 | global batch size: 256 | lm loss: 2.089524E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.476 | TFLOPs: 19.75 | 31: iteration 44900/ 173500 | consumed samples: 11494400 | consumed tokens: 23540531200 | elapsed time per iteration (s): 0.76 | learning rate: 1.734E-04 | global batch size: 256 | lm loss: 2.089143E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.067 | TFLOPs: 20.51 | 31: iteration 44910/ 173500 | consumed samples: 11496960 | consumed tokens: 23545774080 | elapsed time per iteration (s): 0.84 | learning rate: 1.734E-04 | global batch size: 256 | lm loss: 2.064203E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.664 | TFLOPs: 18.37 | 31: iteration 44920/ 173500 | consumed samples: 11499520 | consumed tokens: 23551016960 | elapsed time per iteration (s): 0.76 | learning rate: 1.734E-04 | global batch size: 256 | lm loss: 2.112758E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.567 | TFLOPs: 20.30 | 31: iteration 44930/ 173500 | consumed samples: 11502080 | consumed tokens: 23556259840 | elapsed time per iteration (s): 0.80 | learning rate: 1.733E-04 | global batch size: 256 | lm loss: 2.072528E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.455 | TFLOPs: 19.39 | 31: iteration 44940/ 173500 | consumed samples: 11504640 | consumed tokens: 23561502720 | elapsed time per iteration (s): 0.89 | learning rate: 1.733E-04 | global batch size: 256 | lm loss: 2.098671E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.413 | TFLOPs: 17.45 | 31: iteration 44950/ 173500 | consumed samples: 11507200 | consumed tokens: 23566745600 | elapsed time per iteration (s): 0.90 | learning rate: 1.733E-04 | global batch size: 256 | lm loss: 2.075989E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 283.579 | TFLOPs: 17.16 | 31: iteration 44960/ 173500 | consumed samples: 11509760 | consumed tokens: 23571988480 | elapsed time per iteration (s): 0.86 | learning rate: 1.733E-04 | global batch size: 256 | lm loss: 2.087813E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.049 | TFLOPs: 18.03 | 31: iteration 44970/ 173500 | consumed samples: 11512320 | consumed tokens: 23577231360 | elapsed time per iteration (s): 0.75 | learning rate: 1.733E-04 | global batch size: 256 | lm loss: 2.061381E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.437 | TFLOPs: 20.54 | 31: iteration 44980/ 173500 | consumed samples: 11514880 | consumed tokens: 23582474240 | elapsed time per iteration (s): 0.75 | learning rate: 1.733E-04 | global batch size: 256 | lm loss: 2.078348E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.740 | TFLOPs: 20.73 | 31: iteration 44990/ 173500 | consumed samples: 11517440 | consumed tokens: 23587717120 | elapsed time per iteration (s): 0.75 | learning rate: 1.733E-04 | global batch size: 256 | lm loss: 2.080303E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.460 | TFLOPs: 20.54 | 31: iteration 45000/ 173500 | consumed samples: 11520000 | consumed tokens: 23592960000 | elapsed time per iteration (s): 0.76 | learning rate: 1.733E-04 | global batch size: 256 | lm loss: 2.055045E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.639 | TFLOPs: 20.49 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 45000 | lm loss value: 2.037125E+00 | lm loss PPL: 7.668529E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 45000 to checkpoints_1b1long 0: [2022-11-26 04:12:56,026] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step45000 is begin to save! 0: [2022-11-26 04:12:56,039] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_01-model_00-model_states.pt... 0: [2022-11-26 04:12:56,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_01-model_00-model_states.pt. 0: [2022-11-26 04:12:56,289] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_03-model_00-model_states.pt... 0: [2022-11-26 04:12:56,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_03-model_00-model_states.pt. 0: [2022-11-26 04:12:56,374] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_04-model_00-model_states.pt... 0: [2022-11-26 04:12:56,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_04-model_00-model_states.pt. 0: [2022-11-26 04:12:56,452] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_05-model_00-model_states.pt... 0: [2022-11-26 04:12:56,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_05-model_00-model_states.pt. 0: [2022-11-26 04:12:56,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_06-model_00-model_states.pt... 0: [2022-11-26 04:12:56,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_06-model_00-model_states.pt. 0: [2022-11-26 04:12:56,606] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_07-model_00-model_states.pt... 0: [2022-11-26 04:12:56,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_07-model_00-model_states.pt. 0: [2022-11-26 04:12:56,682] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_08-model_00-model_states.pt... 0: [2022-11-26 04:12:56,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_08-model_00-model_states.pt. 0: [2022-11-26 04:12:56,758] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_09-model_00-model_states.pt... 0: [2022-11-26 04:12:56,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_09-model_00-model_states.pt. 0: [2022-11-26 04:12:56,835] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_10-model_00-model_states.pt... 0: [2022-11-26 04:12:56,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_10-model_00-model_states.pt. 0: [2022-11-26 04:12:56,911] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_11-model_00-model_states.pt... 0: [2022-11-26 04:12:56,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_11-model_00-model_states.pt. 0: [2022-11-26 04:12:56,983] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_12-model_00-model_states.pt... 0: [2022-11-26 04:12:57,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_12-model_00-model_states.pt. 0: [2022-11-26 04:12:57,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_13-model_00-model_states.pt... 0: [2022-11-26 04:12:57,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_13-model_00-model_states.pt. 0: [2022-11-26 04:12:57,142] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_14-model_00-model_states.pt... 0: [2022-11-26 04:12:57,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_14-model_00-model_states.pt. 0: [2022-11-26 04:12:57,219] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_15-model_00-model_states.pt... 0: [2022-11-26 04:12:57,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_15-model_00-model_states.pt. 0: [2022-11-26 04:12:57,295] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_16-model_00-model_states.pt... 0: [2022-11-26 04:12:57,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_16-model_00-model_states.pt. 0: [2022-11-26 04:12:57,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_17-model_00-model_states.pt... 0: [2022-11-26 04:12:57,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_17-model_00-model_states.pt. 0: [2022-11-26 04:12:57,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_18-model_00-model_states.pt... 0: [2022-11-26 04:12:57,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_18-model_00-model_states.pt. 0: [2022-11-26 04:12:57,517] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_19-model_00-model_states.pt... 0: [2022-11-26 04:12:57,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_19-model_00-model_states.pt. 0: [2022-11-26 04:12:57,595] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_20-model_00-model_states.pt... 0: [2022-11-26 04:12:57,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_20-model_00-model_states.pt. 0: [2022-11-26 04:12:57,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_21-model_00-model_states.pt... 0: [2022-11-26 04:12:57,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_21-model_00-model_states.pt. 0: [2022-11-26 04:12:57,745] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_22-model_00-model_states.pt... 0: [2022-11-26 04:12:57,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_22-model_00-model_states.pt. 0: [2022-11-26 04:12:57,820] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_23-model_00-model_states.pt... 0: [2022-11-26 04:12:57,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_23-model_00-model_states.pt. 0: [2022-11-26 04:12:57,891] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_24-model_00-model_states.pt... 0: [2022-11-26 04:12:57,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_24-model_00-model_states.pt. 0: [2022-11-26 04:12:57,967] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_25-model_00-model_states.pt... 0: [2022-11-26 04:12:58,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_25-model_00-model_states.pt. 0: [2022-11-26 04:12:58,041] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_26-model_00-model_states.pt... 0: [2022-11-26 04:12:58,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_26-model_00-model_states.pt. 0: [2022-11-26 04:12:58,115] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_27-model_00-model_states.pt... 0: [2022-11-26 04:12:58,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_27-model_00-model_states.pt. 0: [2022-11-26 04:12:58,190] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_28-model_00-model_states.pt... 0: [2022-11-26 04:12:58,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_28-model_00-model_states.pt. 0: [2022-11-26 04:12:58,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/layer_30-model_00-model_states.pt... 0: [2022-11-26 04:12:58,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/layer_30-model_00-model_states.pt. 0: [2022-11-26 04:12:58,269] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step45000/mp_rank_00_model_states.pt 0: [2022-11-26 04:12:58,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/mp_rank_00_model_states.pt... 0: [2022-11-26 04:12:58,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/mp_rank_00_model_states.pt. 0: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:12:58,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:12:58,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:12:58,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:12:58,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 04:12:58,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 22: [2022-11-26 04:12:58,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:12:58,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:12:58,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 18: [2022-11-26 04:12:58,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 22: [2022-11-26 04:12:58,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 18: [2022-11-26 04:12:58,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 25: [2022-11-26 04:12:58,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:12:58,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 04:12:58,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 31: [2022-11-26 04:12:58,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:12:58,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 04:12:58,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 1: [2022-11-26 04:12:58,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:12:58,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:12:58,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 2: [2022-11-26 04:12:58,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 1: [2022-11-26 04:12:58,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 2: [2022-11-26 04:12:58,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 3: [2022-11-26 04:12:58,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:12:58,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 04:12:58,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 15: [2022-11-26 04:12:58,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:12:58,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 04:12:58,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 28: [2022-11-26 04:12:58,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 04:12:58,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 27: [2022-11-26 04:12:58,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 28: [2022-11-26 04:12:58,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 26: [2022-11-26 04:12:58,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:12:58,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:12:58,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 04:12:58,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 15: [2022-11-26 04:12:58,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:12:58,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 15: [2022-11-26 04:12:58,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 26: [2022-11-26 04:12:58,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 1: [2022-11-26 04:12:58,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 15: [2022-11-26 04:12:58,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 1: [2022-11-26 04:12:58,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 7: [2022-11-26 04:12:58,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:12:58,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:12:58,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:12:58,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:12:58,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 8: [2022-11-26 04:12:58,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 30: [2022-11-26 04:12:58,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 7: [2022-11-26 04:12:58,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 8: [2022-11-26 04:12:58,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 30: [2022-11-26 04:12:58,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 6: [2022-11-26 04:12:58,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 04:12:58,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 20: [2022-11-26 04:12:58,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:12:58,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 04:12:58,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 7: [2022-11-26 04:12:58,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:12:58,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:12:58,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 0: [2022-11-26 04:12:58,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 7: [2022-11-26 04:12:58,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 0: [2022-11-26 04:12:58,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 28: [2022-11-26 04:12:58,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 04:12:58,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 25: [2022-11-26 04:12:58,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 28: [2022-11-26 04:12:58,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 25: [2022-11-26 04:12:58,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 04:12:58,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 4: [2022-11-26 04:12:58,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:12:58,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 04:12:58,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 6: [2022-11-26 04:12:58,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:12:58,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:12:58,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 21: [2022-11-26 04:12:58,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 6: [2022-11-26 04:12:58,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 21: [2022-11-26 04:12:58,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 10: [2022-11-26 04:12:58,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:12:58,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:12:58,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 04:12:58,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 04:12:58,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 10: [2022-11-26 04:12:58,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 22: [2022-11-26 04:12:58,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:12:58,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 04:12:58,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 3: [2022-11-26 04:12:58,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:12:58,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 04:12:58,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 11: [2022-11-26 04:12:58,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:12:58,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 04:12:58,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 11: [2022-11-26 04:12:58,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:12:58,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 04:12:58,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 1: [2022-11-26 04:12:58,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:12:58,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:12:58,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 04:12:58,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 26: [2022-11-26 04:12:58,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:12:58,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 26: [2022-11-26 04:12:58,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 04:12:58,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 1: [2022-11-26 04:12:58,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 30: [2022-11-26 04:12:58,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:12:58,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:12:58,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:12:58,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:12:58,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 12: [2022-11-26 04:12:58,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 27: [2022-11-26 04:12:58,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 12: [2022-11-26 04:12:58,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 27: [2022-11-26 04:12:58,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 12: [2022-11-26 04:12:58,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 31: [2022-11-26 04:12:58,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:12:58,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 12: [2022-11-26 04:12:58,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 31: [2022-11-26 04:12:58,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 04:12:58,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 23: [2022-11-26 04:12:58,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:12:58,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 04:12:58,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:12:58,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 23: [2022-11-26 04:12:58,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 31: [2022-11-26 04:12:58,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:12:58,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 31: [2022-11-26 04:12:58,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 04:12:58,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 11: [2022-11-26 04:12:58,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:12:58,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 04:12:58,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 15: [2022-11-26 04:12:58,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:12:58,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 04:12:58,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 0: [2022-11-26 04:12:58,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:12:58,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 04:12:58,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 29: [2022-11-26 04:12:58,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:12:58,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 04:12:58,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 27: [2022-11-26 04:12:58,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:12:58,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:12:58,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 04:12:58,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 27: [2022-11-26 04:12:58,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 14: [2022-11-26 04:12:58,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:12:58,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 14: [2022-11-26 04:12:58,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 04:12:58,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 17: [2022-11-26 04:12:58,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:12:58,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 04:12:58,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 17: [2022-11-26 04:12:58,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:12:58,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 04:12:58,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 20: [2022-11-26 04:12:58,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:12:58,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 04:12:58,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 20: [2022-11-26 04:12:58,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:12:58,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 04:12:58,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 4: [2022-11-26 04:12:58,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:12:58,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 04:12:58,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 12: [2022-11-26 04:12:58,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:12:58,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:12:58,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 21: [2022-11-26 04:12:58,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 12: [2022-11-26 04:12:58,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 21: [2022-11-26 04:12:58,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 8: [2022-11-26 04:12:58,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:12:58,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 04:12:58,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 3: [2022-11-26 04:12:58,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:12:58,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 10: [2022-11-26 04:12:58,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:12:58,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 10: [2022-11-26 04:12:58,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 8: [2022-11-26 04:12:58,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:12:58,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 8: [2022-11-26 04:12:58,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 2: [2022-11-26 04:12:58,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:12:58,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 2: [2022-11-26 04:12:58,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 04:12:58,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 7: [2022-11-26 04:12:58,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:12:58,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 04:12:58,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 2: [2022-11-26 04:12:58,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:12:58,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 21: [2022-11-26 04:12:58,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:12:58,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 21: [2022-11-26 04:12:58,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 04:12:58,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 7: [2022-11-26 04:12:58,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:12:58,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:12:58,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 7: [2022-11-26 04:12:58,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 15: [2022-11-26 04:12:58,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 7: [2022-11-26 04:12:58,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 28: [2022-11-26 04:12:58,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 04:12:58,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:12:58,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 28: [2022-11-26 04:12:58,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 04:12:58,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 1: [2022-11-26 04:12:58,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 28: [2022-11-26 04:12:58,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 28: [2022-11-26 04:12:58,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 1: [2022-11-26 04:12:58,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 26: [2022-11-26 04:12:58,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:12:58,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 04:12:58,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 23: [2022-11-26 04:12:58,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:12:58,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 04:12:58,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 18: [2022-11-26 04:12:58,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:12:58,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 04:12:58,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 25: [2022-11-26 04:12:58,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:12:58,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 17: [2022-11-26 04:12:58,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:12:58,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 17: [2022-11-26 04:12:58,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 04:12:58,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 30: [2022-11-26 04:12:58,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:12:58,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 04:12:58,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 25: [2022-11-26 04:12:58,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:12:58,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 04:12:58,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 31: [2022-11-26 04:12:58,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:12:58,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 04:12:58,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 6: [2022-11-26 04:12:58,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:12:58,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 04:12:58,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 6: [2022-11-26 04:12:58,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:12:58,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 04:12:58,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 29: [2022-11-26 04:12:58,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:12:58,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 04:12:58,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 12: [2022-11-26 04:12:58,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:12:58,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 13: [2022-11-26 04:12:58,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:12:58,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:12:58,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:12:58,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 04:12:58,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 04:12:58,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 04:12:58,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 12: [2022-11-26 04:12:58,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 13: [2022-11-26 04:12:58,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 13: [2022-11-26 04:12:58,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 11: [2022-11-26 04:12:58,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:12:58,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 04:12:58,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 27: [2022-11-26 04:12:58,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:12:58,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 04:12:58,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 22: [2022-11-26 04:12:58,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:12:58,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 04:12:58,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 22: [2022-11-26 04:12:58,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:12:58,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 04:12:58,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 26: [2022-11-26 04:12:58,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:12:58,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 04:12:58,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 23: [2022-11-26 04:12:58,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:12:58,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 30: [2022-11-26 04:12:58,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:12:58,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 30: [2022-11-26 04:12:58,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 04:12:58,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 29: [2022-11-26 04:12:58,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:12:58,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 04:12:58,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 18: [2022-11-26 04:12:58,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:12:58,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 04:12:58,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 17: [2022-11-26 04:12:58,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:12:58,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 04:12:58,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 0: [2022-11-26 04:12:58,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 21: [2022-11-26 04:12:58,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:12:58,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 21: [2022-11-26 04:12:58,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 04:12:58,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 0: [2022-11-26 04:12:58,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:12:58,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 04:12:58,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 20: [2022-11-26 04:12:58,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:12:58,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 04:12:58,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 3: [2022-11-26 04:12:58,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:12:58,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 04:12:58,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 8: [2022-11-26 04:12:58,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:12:58,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 04:12:58,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 2: [2022-11-26 04:12:58,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:12:58,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:12:58,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 4: [2022-11-26 04:12:58,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 2: [2022-11-26 04:12:58,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 4: [2022-11-26 04:12:58,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 14: [2022-11-26 04:12:58,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:12:58,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 04:12:58,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:12:58,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 14: [2022-11-26 04:12:58,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 31: [2022-11-26 04:12:58,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:12:58,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 31: [2022-11-26 04:12:58,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 04:12:58,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 9: [2022-11-26 04:12:58,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:12:58,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:12:58,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:12:58,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:12:58,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 04:12:58,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 04:12:58,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 04:12:58,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 04:12:58,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 9: [2022-11-26 04:12:58,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 9: [2022-11-26 04:12:58,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 9: [2022-11-26 04:12:58,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 10: [2022-11-26 04:12:58,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:12:58,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 04:12:58,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 13: [2022-11-26 04:12:58,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:12:58,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 04:12:58,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 19: [2022-11-26 04:12:58,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:12:58,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:12:58,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:12:58,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:12:58,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 04:12:58,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 04:12:58,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 04:12:58,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 19: [2022-11-26 04:12:58,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 04:12:58,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 19: [2022-11-26 04:12:58,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 19: [2022-11-26 04:12:58,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 5: [2022-11-26 04:12:58,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:12:58,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:12:58,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:12:58,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 04:12:58,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 04:12:58,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 5: [2022-11-26 04:12:58,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 04:12:58,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 5: [2022-11-26 04:12:58,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 16: [2022-11-26 04:12:58,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:12:58,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:12:58,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:12:58,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:12:58,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 04:12:58,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 04:12:58,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 04:12:58,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 04:12:58,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 16: [2022-11-26 04:12:58,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 16: [2022-11-26 04:12:58,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 16: [2022-11-26 04:12:58,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 4: [2022-11-26 04:12:58,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:12:58,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 04:12:58,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 14: [2022-11-26 04:12:58,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:12:58,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 04:12:58,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 24: [2022-11-26 04:12:58,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:12:58,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:12:58,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:12:58,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:12:58,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 04:12:58,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 04:12:58,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 04:12:58,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 24: [2022-11-26 04:12:58,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 24: [2022-11-26 04:12:58,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 24: [2022-11-26 04:12:58,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 04:12:58,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 28: [2022-11-26 04:12:58,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 04:12:58,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 04:12:58,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 25: [2022-11-26 04:12:58,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:12:58,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 04:12:58,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 6: [2022-11-26 04:12:58,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:12:58,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 04:12:58,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 1: [2022-11-26 04:12:58,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:12:58,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 04:12:58,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 24: [2022-11-26 04:12:58,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:12:58,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 04:12:58,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 9: [2022-11-26 04:12:58,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:12:58,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 04:12:58,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 15: [2022-11-26 04:12:58,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:12:58,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 04:12:58,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 19: [2022-11-26 04:12:58,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:12:58,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 04:12:58,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 7: [2022-11-26 04:12:58,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:12:58,982] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 04:12:58,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 22: [2022-11-26 04:12:58,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:12:58,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 04:12:58,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 27: [2022-11-26 04:12:58,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:12:58,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 04:12:58,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 30: [2022-11-26 04:12:58,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:12:58,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 04:12:58,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 14: [2022-11-26 04:12:58,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:12:58,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 04:12:58,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 11: [2022-11-26 04:12:58,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:12:58,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 04:12:58,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 0: [2022-11-26 04:12:59,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:12:59,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 04:12:59,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 20: [2022-11-26 04:12:59,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:12:59,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 04:12:59,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 16: [2022-11-26 04:12:59,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:12:59,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 04:12:59,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 3: [2022-11-26 04:12:59,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:12:59,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 04:12:59,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 10: [2022-11-26 04:12:59,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:12:59,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:12:59,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 04:12:59,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 31: [2022-11-26 04:12:59,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 04:12:59,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 2: [2022-11-26 04:12:59,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:12:59,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 04:12:59,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 12: [2022-11-26 04:12:59,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:12:59,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 04:12:59,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 25: [2022-11-26 04:12:59,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:12:59,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 04:12:59,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 21: [2022-11-26 04:12:59,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:12:59,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 04:12:59,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 18: [2022-11-26 04:12:59,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:12:59,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 04:12:59,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 5: [2022-11-26 04:12:59,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:12:59,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 04:12:59,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 17: [2022-11-26 04:12:59,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:12:59,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 04:12:59,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 9: [2022-11-26 04:12:59,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:12:59,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 13: [2022-11-26 04:12:59,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:12:59,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 13: [2022-11-26 04:12:59,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 29: [2022-11-26 04:12:59,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:12:59,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 29: [2022-11-26 04:12:59,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 04:12:59,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 8: [2022-11-26 04:12:59,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:12:59,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 04:12:59,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 23: [2022-11-26 04:12:59,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:12:59,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 04:12:59,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 28: [2022-11-26 04:12:59,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 04:12:59,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 04:12:59,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 4: [2022-11-26 04:12:59,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:12:59,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 04:12:59,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 6: [2022-11-26 04:12:59,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:12:59,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 04:12:59,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 1: [2022-11-26 04:12:59,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:12:59,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 04:12:59,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 24: [2022-11-26 04:12:59,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:12:59,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 04:12:59,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 15: [2022-11-26 04:12:59,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:12:59,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 04:12:59,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 22: [2022-11-26 04:12:59,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:12:59,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 04:12:59,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 27: [2022-11-26 04:12:59,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:12:59,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 04:12:59,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 7: [2022-11-26 04:12:59,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:12:59,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 04:12:59,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 11: [2022-11-26 04:12:59,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:12:59,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 04:12:59,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 14: [2022-11-26 04:12:59,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:12:59,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 04:12:59,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 30: [2022-11-26 04:12:59,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:12:59,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 04:12:59,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 16: [2022-11-26 04:12:59,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:12:59,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 04:12:59,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 19: [2022-11-26 04:12:59,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:12:59,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 04:12:59,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 0: [2022-11-26 04:12:59,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:12:59,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 04:12:59,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 12: [2022-11-26 04:12:59,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:12:59,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 04:12:59,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 31: [2022-11-26 04:12:59,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:12:59,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 04:12:59,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 21: [2022-11-26 04:12:59,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:12:59,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 04:12:59,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 2: [2022-11-26 04:12:59,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:12:59,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:12:59,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 20: [2022-11-26 04:12:59,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 2: [2022-11-26 04:12:59,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 20: [2022-11-26 04:12:59,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 5: [2022-11-26 04:12:59,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:12:59,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 04:12:59,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 17: [2022-11-26 04:12:59,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:12:59,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 04:12:59,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 13: [2022-11-26 04:12:59,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:12:59,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 04:12:59,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 8: [2022-11-26 04:12:59,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:12:59,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 04:12:59,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 28: [2022-11-26 04:12:59,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:12:59,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:12:59,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 28: [2022-11-26 04:12:59,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 10: [2022-11-26 04:12:59,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 28: [2022-11-26 04:12:59,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 26: [2022-11-26 04:12:59,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:12:59,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 04:12:59,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 9: [2022-11-26 04:12:59,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:12:59,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 04:12:59,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 3: [2022-11-26 04:12:59,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:12:59,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 04:12:59,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 4: [2022-11-26 04:12:59,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:12:59,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 04:12:59,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 26: [2022-11-26 04:12:59,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:12:59,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 25: [2022-11-26 04:12:59,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:12:59,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 25: [2022-11-26 04:12:59,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 04:12:59,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 29: [2022-11-26 04:12:59,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:12:59,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:12:59,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 04:12:59,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 6: [2022-11-26 04:12:59,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 24: [2022-11-26 04:12:59,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:12:59,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 24: [2022-11-26 04:12:59,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 04:12:59,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 18: [2022-11-26 04:12:59,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:12:59,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 04:12:59,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 1: [2022-11-26 04:12:59,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:12:59,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 04:12:59,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 23: [2022-11-26 04:12:59,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:12:59,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 04:12:59,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 19: [2022-11-26 04:12:59,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:12:59,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 04:12:59,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 7: [2022-11-26 04:12:59,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:12:59,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 04:12:59,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 11: [2022-11-26 04:12:59,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:12:59,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 04:12:59,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 15: [2022-11-26 04:12:59,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:12:59,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 04:12:59,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 27: [2022-11-26 04:12:59,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:12:59,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 04:12:59,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 22: [2022-11-26 04:12:59,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:12:59,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 04:12:59,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 30: [2022-11-26 04:12:59,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:12:59,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 04:12:59,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 12: [2022-11-26 04:12:59,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:12:59,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 04:12:59,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 14: [2022-11-26 04:12:59,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:12:59,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 04:12:59,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 18: [2022-11-26 04:12:59,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:12:59,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 04:12:59,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 16: [2022-11-26 04:12:59,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:12:59,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 04:12:59,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 26: [2022-11-26 04:12:59,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:12:59,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 04:12:59,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 20: [2022-11-26 04:12:59,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:12:59,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 04:12:59,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 31: [2022-11-26 04:12:59,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:12:59,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:12:59,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 0: [2022-11-26 04:12:59,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 31: [2022-11-26 04:12:59,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 0: [2022-11-26 04:12:59,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 23: [2022-11-26 04:12:59,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:12:59,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 04:12:59,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 21: [2022-11-26 04:12:59,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:12:59,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 04:12:59,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 8: [2022-11-26 04:12:59,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:12:59,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 2: [2022-11-26 04:12:59,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:12:59,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 2: [2022-11-26 04:12:59,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 04:12:59,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 25: [2022-11-26 04:12:59,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:12:59,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 04:12:59,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 13: [2022-11-26 04:12:59,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:12:59,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 04:12:59,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 9: [2022-11-26 04:12:59,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:12:59,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 04:12:59,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 30: [2022-11-26 04:12:59,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:12:59,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 04:12:59,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 24: [2022-11-26 04:12:59,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:12:59,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:12:59,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:12:59,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 10: [2022-11-26 04:12:59,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 3: [2022-11-26 04:12:59,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 24: [2022-11-26 04:12:59,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 10: [2022-11-26 04:12:59,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 24: [2022-11-26 04:12:59,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 20: [2022-11-26 04:12:59,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:12:59,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:12:59,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 04:12:59,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 11: [2022-11-26 04:12:59,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 04:12:59,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 28: [2022-11-26 04:12:59,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 04:12:59,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 04:12:59,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 7: [2022-11-26 04:12:59,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:12:59,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 04:12:59,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 6: [2022-11-26 04:12:59,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:12:59,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 04:12:59,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 19: [2022-11-26 04:12:59,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:12:59,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 04:12:59,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 4: [2022-11-26 04:12:59,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:12:59,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 04:12:59,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 1: [2022-11-26 04:12:59,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:12:59,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 04:12:59,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 12: [2022-11-26 04:12:59,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:12:59,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 04:12:59,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 18: [2022-11-26 04:12:59,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:12:59,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:12:59,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 04:12:59,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 23: [2022-11-26 04:12:59,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 04:12:59,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 15: [2022-11-26 04:12:59,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:12:59,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:12:59,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 29: [2022-11-26 04:12:59,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 15: [2022-11-26 04:12:59,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 14: [2022-11-26 04:12:59,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:12:59,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 14: [2022-11-26 04:12:59,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 04:12:59,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 16: [2022-11-26 04:12:59,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:12:59,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 04:12:59,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 21: [2022-11-26 04:12:59,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:12:59,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:12:59,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 04:12:59,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 27: [2022-11-26 04:12:59,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 04:12:59,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 2: [2022-11-26 04:12:59,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:12:59,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 04:12:59,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 22: [2022-11-26 04:12:59,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:12:59,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 5: [2022-11-26 04:12:59,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:12:59,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 5: [2022-11-26 04:12:59,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 04:12:59,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 0: [2022-11-26 04:12:59,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:12:59,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 04:12:59,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 10: [2022-11-26 04:12:59,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:12:59,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 04:12:59,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 26: [2022-11-26 04:12:59,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:12:59,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 04:12:59,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 3: [2022-11-26 04:12:59,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:12:59,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 04:12:59,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 29: [2022-11-26 04:12:59,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:12:59,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 04:12:59,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 13: [2022-11-26 04:12:59,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:12:59,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 8: [2022-11-26 04:12:59,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:12:59,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 8: [2022-11-26 04:12:59,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 04:12:59,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 5: [2022-11-26 04:12:59,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:12:59,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 04:12:59,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 5: [2022-11-26 04:12:59,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:12:59,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 04:12:59,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 17: [2022-11-26 04:12:59,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:12:59,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 04:12:59,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 29: [2022-11-26 04:12:59,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:12:59,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step45000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 04:12:59,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 0: successfully saved checkpoint at iteration 45000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 3124.73 31: iteration 45010/ 173500 | consumed samples: 11522560 | consumed tokens: 23598202880 | elapsed time per iteration (s): 1.10 | learning rate: 1.733E-04 | global batch size: 256 | lm loss: 2.074181E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.595 | TFLOPs: 14.07 | 31: iteration 45020/ 173500 | consumed samples: 11525120 | consumed tokens: 23603445760 | elapsed time per iteration (s): 0.74 | learning rate: 1.732E-04 | global batch size: 256 | lm loss: 2.072434E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.096 | TFLOPs: 20.82 | 31: iteration 45030/ 173500 | consumed samples: 11527680 | consumed tokens: 23608688640 | elapsed time per iteration (s): 0.77 | learning rate: 1.732E-04 | global batch size: 256 | lm loss: 2.040191E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.723 | TFLOPs: 20.13 | 31: iteration 45040/ 173500 | consumed samples: 11530240 | consumed tokens: 23613931520 | elapsed time per iteration (s): 0.78 | learning rate: 1.732E-04 | global batch size: 256 | lm loss: 2.093032E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.330 | TFLOPs: 19.80 | 31: iteration 45050/ 173500 | consumed samples: 11532800 | consumed tokens: 23619174400 | elapsed time per iteration (s): 0.76 | learning rate: 1.732E-04 | global batch size: 256 | lm loss: 2.092293E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.828 | TFLOPs: 20.26 | 31: iteration 45060/ 173500 | consumed samples: 11535360 | consumed tokens: 23624417280 | elapsed time per iteration (s): 0.74 | learning rate: 1.732E-04 | global batch size: 256 | lm loss: 2.079153E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.246 | TFLOPs: 20.95 | 31: iteration 45070/ 173500 | consumed samples: 11537920 | consumed tokens: 23629660160 | elapsed time per iteration (s): 0.74 | learning rate: 1.732E-04 | global batch size: 256 | lm loss: 2.058510E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.939 | TFLOPs: 20.87 | 31: iteration 45080/ 173500 | consumed samples: 11540480 | consumed tokens: 23634903040 | elapsed time per iteration (s): 0.75 | learning rate: 1.732E-04 | global batch size: 256 | lm loss: 2.079167E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.171 | TFLOPs: 20.58 | 31: iteration 45090/ 173500 | consumed samples: 11543040 | consumed tokens: 23640145920 | elapsed time per iteration (s): 1.21 | learning rate: 1.732E-04 | global batch size: 256 | lm loss: 2.078976E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 211.511 | TFLOPs: 12.80 | 31: iteration 45100/ 173500 | consumed samples: 11545600 | consumed tokens: 23645388800 | elapsed time per iteration (s): 0.76 | learning rate: 1.731E-04 | global batch size: 256 | lm loss: 2.078906E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.351 | TFLOPs: 20.41 | 31: iteration 45110/ 173500 | consumed samples: 11548160 | consumed tokens: 23650631680 | elapsed time per iteration (s): 0.79 | learning rate: 1.731E-04 | global batch size: 256 | lm loss: 2.103436E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.927 | TFLOPs: 19.60 | 31: iteration 45120/ 173500 | consumed samples: 11550720 | consumed tokens: 23655874560 | elapsed time per iteration (s): 0.82 | learning rate: 1.731E-04 | global batch size: 256 | lm loss: 2.081236E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.029 | TFLOPs: 19.00 | 31: iteration 45130/ 173500 | consumed samples: 11553280 | consumed tokens: 23661117440 | elapsed time per iteration (s): 0.83 | learning rate: 1.731E-04 | global batch size: 256 | lm loss: 2.106098E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.558 | TFLOPs: 18.73 | 31: iteration 45140/ 173500 | consumed samples: 11555840 | consumed tokens: 23666360320 | elapsed time per iteration (s): 0.80 | learning rate: 1.731E-04 | global batch size: 256 | lm loss: 2.069330E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.681 | TFLOPs: 19.40 | 31: iteration 45150/ 173500 | consumed samples: 11558400 | consumed tokens: 23671603200 | elapsed time per iteration (s): 0.76 | learning rate: 1.731E-04 | global batch size: 256 | lm loss: 2.094118E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.695 | TFLOPs: 20.31 | 31: iteration 45160/ 173500 | consumed samples: 11560960 | consumed tokens: 23676846080 | elapsed time per iteration (s): 0.78 | learning rate: 1.731E-04 | global batch size: 256 | lm loss: 2.072607E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.508 | TFLOPs: 19.93 | 31: iteration 45170/ 173500 | consumed samples: 11563520 | consumed tokens: 23682088960 | elapsed time per iteration (s): 0.93 | learning rate: 1.731E-04 | global batch size: 256 | lm loss: 2.074454E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 275.198 | TFLOPs: 16.65 | 31: iteration 45180/ 173500 | consumed samples: 11566080 | consumed tokens: 23687331840 | elapsed time per iteration (s): 0.75 | learning rate: 1.731E-04 | global batch size: 256 | lm loss: 2.088750E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.940 | TFLOPs: 20.69 | 31: iteration 45190/ 173500 | consumed samples: 11568640 | consumed tokens: 23692574720 | elapsed time per iteration (s): 0.72 | learning rate: 1.730E-04 | global batch size: 256 | lm loss: 2.064729E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.734 | TFLOPs: 21.52 | 31: iteration 45200/ 173500 | consumed samples: 11571200 | consumed tokens: 23697817600 | elapsed time per iteration (s): 0.82 | learning rate: 1.730E-04 | global batch size: 256 | lm loss: 2.072203E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.890 | TFLOPs: 18.93 | 31: iteration 45210/ 173500 | consumed samples: 11573760 | consumed tokens: 23703060480 | elapsed time per iteration (s): 0.76 | learning rate: 1.730E-04 | global batch size: 256 | lm loss: 2.053645E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.399 | TFLOPs: 20.47 | 31: iteration 45220/ 173500 | consumed samples: 11576320 | consumed tokens: 23708303360 | elapsed time per iteration (s): 0.76 | learning rate: 1.730E-04 | global batch size: 256 | lm loss: 2.095808E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.706 | TFLOPs: 20.37 | 31: iteration 45230/ 173500 | consumed samples: 11578880 | consumed tokens: 23713546240 | elapsed time per iteration (s): 0.81 | learning rate: 1.730E-04 | global batch size: 256 | lm loss: 2.081982E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.709 | TFLOPs: 19.10 | 31: iteration 45240/ 173500 | consumed samples: 11581440 | consumed tokens: 23718789120 | elapsed time per iteration (s): 0.78 | learning rate: 1.730E-04 | global batch size: 256 | lm loss: 2.068185E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.851 | TFLOPs: 19.96 | 31: iteration 45250/ 173500 | consumed samples: 11584000 | consumed tokens: 23724032000 | elapsed time per iteration (s): 0.77 | learning rate: 1.730E-04 | global batch size: 256 | lm loss: 2.060901E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.249 | TFLOPs: 20.04 | 31: iteration 45260/ 173500 | consumed samples: 11586560 | consumed tokens: 23729274880 | elapsed time per iteration (s): 0.74 | learning rate: 1.730E-04 | global batch size: 256 | lm loss: 2.078267E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.621 | TFLOPs: 20.85 | 31: iteration 45270/ 173500 | consumed samples: 11589120 | consumed tokens: 23734517760 | elapsed time per iteration (s): 0.81 | learning rate: 1.729E-04 | global batch size: 256 | lm loss: 2.100635E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.075 | TFLOPs: 19.12 | 31: iteration 45280/ 173500 | consumed samples: 11591680 | consumed tokens: 23739760640 | elapsed time per iteration (s): 0.72 | learning rate: 1.729E-04 | global batch size: 256 | lm loss: 2.061716E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.917 | TFLOPs: 21.41 | 31: iteration 45290/ 173500 | consumed samples: 11594240 | consumed tokens: 23745003520 | elapsed time per iteration (s): 0.80 | learning rate: 1.729E-04 | global batch size: 256 | lm loss: 2.096702E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.820 | TFLOPs: 19.35 | 31: iteration 45300/ 173500 | consumed samples: 11596800 | consumed tokens: 23750246400 | elapsed time per iteration (s): 0.78 | learning rate: 1.729E-04 | global batch size: 256 | lm loss: 2.090533E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.072 | TFLOPs: 19.91 | 31: iteration 45310/ 173500 | consumed samples: 11599360 | consumed tokens: 23755489280 | elapsed time per iteration (s): 0.79 | learning rate: 1.729E-04 | global batch size: 256 | lm loss: 2.041972E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.739 | TFLOPs: 19.59 | 31: iteration 45320/ 173500 | consumed samples: 11601920 | consumed tokens: 23760732160 | elapsed time per iteration (s): 0.79 | learning rate: 1.729E-04 | global batch size: 256 | lm loss: 2.065179E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.162 | TFLOPs: 19.55 | 31: iteration 45330/ 173500 | consumed samples: 11604480 | consumed tokens: 23765975040 | elapsed time per iteration (s): 0.79 | learning rate: 1.729E-04 | global batch size: 256 | lm loss: 2.072705E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.198 | TFLOPs: 19.61 | 31: iteration 45340/ 173500 | consumed samples: 11607040 | consumed tokens: 23771217920 | elapsed time per iteration (s): 0.73 | learning rate: 1.729E-04 | global batch size: 256 | lm loss: 2.079544E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.926 | TFLOPs: 21.11 | 31: iteration 45350/ 173500 | consumed samples: 11609600 | consumed tokens: 23776460800 | elapsed time per iteration (s): 0.75 | learning rate: 1.729E-04 | global batch size: 256 | lm loss: 2.059237E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.158 | TFLOPs: 20.52 | 31: iteration 45360/ 173500 | consumed samples: 11612160 | consumed tokens: 23781703680 | elapsed time per iteration (s): 0.74 | learning rate: 1.728E-04 | global batch size: 256 | lm loss: 2.078404E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.491 | TFLOPs: 20.96 | 31: iteration 45370/ 173500 | consumed samples: 11614720 | consumed tokens: 23786946560 | elapsed time per iteration (s): 0.76 | learning rate: 1.728E-04 | global batch size: 256 | lm loss: 2.060370E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.079 | TFLOPs: 20.27 | 31: iteration 45380/ 173500 | consumed samples: 11617280 | consumed tokens: 23792189440 | elapsed time per iteration (s): 0.76 | learning rate: 1.728E-04 | global batch size: 256 | lm loss: 2.098353E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.325 | TFLOPs: 20.41 | 31: iteration 45390/ 173500 | consumed samples: 11619840 | consumed tokens: 23797432320 | elapsed time per iteration (s): 0.81 | learning rate: 1.728E-04 | global batch size: 256 | lm loss: 2.128385E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.532 | TFLOPs: 19.09 | 31: iteration 45400/ 173500 | consumed samples: 11622400 | consumed tokens: 23802675200 | elapsed time per iteration (s): 0.78 | learning rate: 1.728E-04 | global batch size: 256 | lm loss: 2.070820E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.827 | TFLOPs: 19.95 | 31: iteration 45410/ 173500 | consumed samples: 11624960 | consumed tokens: 23807918080 | elapsed time per iteration (s): 0.84 | learning rate: 1.728E-04 | global batch size: 256 | lm loss: 2.064617E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.908 | TFLOPs: 18.39 | 31: iteration 45420/ 173500 | consumed samples: 11627520 | consumed tokens: 23813160960 | elapsed time per iteration (s): 0.73 | learning rate: 1.728E-04 | global batch size: 256 | lm loss: 2.081411E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.528 | TFLOPs: 21.15 | 31: iteration 45430/ 173500 | consumed samples: 11630080 | consumed tokens: 23818403840 | elapsed time per iteration (s): 0.79 | learning rate: 1.728E-04 | global batch size: 256 | lm loss: 2.065040E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.699 | TFLOPs: 19.58 | 31: iteration 45440/ 173500 | consumed samples: 11632640 | consumed tokens: 23823646720 | elapsed time per iteration (s): 0.76 | learning rate: 1.727E-04 | global batch size: 256 | lm loss: 2.072767E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.217 | TFLOPs: 20.46 | 31: iteration 45450/ 173500 | consumed samples: 11635200 | consumed tokens: 23828889600 | elapsed time per iteration (s): 0.83 | learning rate: 1.727E-04 | global batch size: 256 | lm loss: 2.118660E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.172 | TFLOPs: 18.70 | 31: iteration 45460/ 173500 | consumed samples: 11637760 | consumed tokens: 23834132480 | elapsed time per iteration (s): 0.77 | learning rate: 1.727E-04 | global batch size: 256 | lm loss: 2.066132E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.377 | TFLOPs: 20.23 | 31: iteration 45470/ 173500 | consumed samples: 11640320 | consumed tokens: 23839375360 | elapsed time per iteration (s): 0.78 | learning rate: 1.727E-04 | global batch size: 256 | lm loss: 2.062620E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.155 | TFLOPs: 19.79 | 31: iteration 45480/ 173500 | consumed samples: 11642880 | consumed tokens: 23844618240 | elapsed time per iteration (s): 0.79 | learning rate: 1.727E-04 | global batch size: 256 | lm loss: 2.111658E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.057 | TFLOPs: 19.73 | 31: iteration 45490/ 173500 | consumed samples: 11645440 | consumed tokens: 23849861120 | elapsed time per iteration (s): 0.78 | learning rate: 1.727E-04 | global batch size: 256 | lm loss: 2.075761E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.946 | TFLOPs: 19.96 | 31: iteration 45500/ 173500 | consumed samples: 11648000 | consumed tokens: 23855104000 | elapsed time per iteration (s): 0.82 | learning rate: 1.727E-04 | global batch size: 256 | lm loss: 2.079066E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.031 | TFLOPs: 18.94 | 31: iteration 45510/ 173500 | consumed samples: 11650560 | consumed tokens: 23860346880 | elapsed time per iteration (s): 0.72 | learning rate: 1.727E-04 | global batch size: 256 | lm loss: 2.087301E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.410 | TFLOPs: 21.38 | 31: iteration 45520/ 173500 | consumed samples: 11653120 | consumed tokens: 23865589760 | elapsed time per iteration (s): 0.84 | learning rate: 1.727E-04 | global batch size: 256 | lm loss: 2.082878E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.568 | TFLOPs: 18.49 | 31: iteration 45530/ 173500 | consumed samples: 11655680 | consumed tokens: 23870832640 | elapsed time per iteration (s): 0.79 | learning rate: 1.726E-04 | global batch size: 256 | lm loss: 2.069280E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.067 | TFLOPs: 19.67 | 31: iteration 45540/ 173500 | consumed samples: 11658240 | consumed tokens: 23876075520 | elapsed time per iteration (s): 0.79 | learning rate: 1.726E-04 | global batch size: 256 | lm loss: 2.053638E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.368 | TFLOPs: 19.50 | 31: iteration 45550/ 173500 | consumed samples: 11660800 | consumed tokens: 23881318400 | elapsed time per iteration (s): 0.86 | learning rate: 1.726E-04 | global batch size: 256 | lm loss: 2.072228E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.437 | TFLOPs: 17.99 | 31: iteration 45560/ 173500 | consumed samples: 11663360 | consumed tokens: 23886561280 | elapsed time per iteration (s): 0.78 | learning rate: 1.726E-04 | global batch size: 256 | lm loss: 2.104362E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.861 | TFLOPs: 19.77 | 31: iteration 45570/ 173500 | consumed samples: 11665920 | consumed tokens: 23891804160 | elapsed time per iteration (s): 0.87 | learning rate: 1.726E-04 | global batch size: 256 | lm loss: 2.075201E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.617 | TFLOPs: 17.88 | 31: iteration 45580/ 173500 | consumed samples: 11668480 | consumed tokens: 23897047040 | elapsed time per iteration (s): 0.74 | learning rate: 1.726E-04 | global batch size: 256 | lm loss: 2.096877E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.057 | TFLOPs: 20.81 | 31: iteration 45590/ 173500 | consumed samples: 11671040 | consumed tokens: 23902289920 | elapsed time per iteration (s): 0.85 | learning rate: 1.726E-04 | global batch size: 256 | lm loss: 2.080473E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.747 | TFLOPs: 18.25 | 31: iteration 45600/ 173500 | consumed samples: 11673600 | consumed tokens: 23907532800 | elapsed time per iteration (s): 0.78 | learning rate: 1.726E-04 | global batch size: 256 | lm loss: 2.071087E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.230 | TFLOPs: 19.80 | 31: iteration 45610/ 173500 | consumed samples: 11676160 | consumed tokens: 23912775680 | elapsed time per iteration (s): 0.78 | learning rate: 1.725E-04 | global batch size: 256 | lm loss: 2.064760E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.438 | TFLOPs: 19.81 | 31: iteration 45620/ 173500 | consumed samples: 11678720 | consumed tokens: 23918018560 | elapsed time per iteration (s): 0.79 | learning rate: 1.725E-04 | global batch size: 256 | lm loss: 2.055020E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.223 | TFLOPs: 19.55 | 31: iteration 45630/ 173500 | consumed samples: 11681280 | consumed tokens: 23923261440 | elapsed time per iteration (s): 0.78 | learning rate: 1.725E-04 | global batch size: 256 | lm loss: 2.065842E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.231 | TFLOPs: 19.98 | 31: iteration 45640/ 173500 | consumed samples: 11683840 | consumed tokens: 23928504320 | elapsed time per iteration (s): 0.79 | learning rate: 1.725E-04 | global batch size: 256 | lm loss: 2.104220E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.658 | TFLOPs: 19.70 | 31: iteration 45650/ 173500 | consumed samples: 11686400 | consumed tokens: 23933747200 | elapsed time per iteration (s): 0.79 | learning rate: 1.725E-04 | global batch size: 256 | lm loss: 2.068216E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.917 | TFLOPs: 19.72 | 31: iteration 45660/ 173500 | consumed samples: 11688960 | consumed tokens: 23938990080 | elapsed time per iteration (s): 0.76 | learning rate: 1.725E-04 | global batch size: 256 | lm loss: 2.087897E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.809 | TFLOPs: 20.44 | 31: iteration 45670/ 173500 | consumed samples: 11691520 | consumed tokens: 23944232960 | elapsed time per iteration (s): 0.88 | learning rate: 1.725E-04 | global batch size: 256 | lm loss: 2.086071E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.220 | TFLOPs: 17.68 | 31: iteration 45680/ 173500 | consumed samples: 11694080 | consumed tokens: 23949475840 | elapsed time per iteration (s): 0.80 | learning rate: 1.725E-04 | global batch size: 256 | lm loss: 2.042261E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.937 | TFLOPs: 19.29 | 31: iteration 45690/ 173500 | consumed samples: 11696640 | consumed tokens: 23954718720 | elapsed time per iteration (s): 0.82 | learning rate: 1.724E-04 | global batch size: 256 | lm loss: 2.096002E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.658 | TFLOPs: 18.79 | 31: iteration 45700/ 173500 | consumed samples: 11699200 | consumed tokens: 23959961600 | elapsed time per iteration (s): 0.80 | learning rate: 1.724E-04 | global batch size: 256 | lm loss: 2.084645E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.741 | TFLOPs: 19.40 | 31: iteration 45710/ 173500 | consumed samples: 11701760 | consumed tokens: 23965204480 | elapsed time per iteration (s): 0.75 | learning rate: 1.724E-04 | global batch size: 256 | lm loss: 2.042391E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.950 | TFLOPs: 20.57 | 31: iteration 45720/ 173500 | consumed samples: 11704320 | consumed tokens: 23970447360 | elapsed time per iteration (s): 0.77 | learning rate: 1.724E-04 | global batch size: 256 | lm loss: 2.060013E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.854 | TFLOPs: 20.02 | 31: iteration 45730/ 173500 | consumed samples: 11706880 | consumed tokens: 23975690240 | elapsed time per iteration (s): 0.82 | learning rate: 1.724E-04 | global batch size: 256 | lm loss: 2.052292E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.961 | TFLOPs: 18.81 | 31: iteration 45740/ 173500 | consumed samples: 11709440 | consumed tokens: 23980933120 | elapsed time per iteration (s): 0.76 | learning rate: 1.724E-04 | global batch size: 256 | lm loss: 2.060477E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.259 | TFLOPs: 20.28 | 31: iteration 45750/ 173500 | consumed samples: 11712000 | consumed tokens: 23986176000 | elapsed time per iteration (s): 0.79 | learning rate: 1.724E-04 | global batch size: 256 | lm loss: 2.086964E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.118 | TFLOPs: 19.55 | 31: iteration 45760/ 173500 | consumed samples: 11714560 | consumed tokens: 23991418880 | elapsed time per iteration (s): 0.80 | learning rate: 1.724E-04 | global batch size: 256 | lm loss: 2.074715E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.650 | TFLOPs: 19.40 | 31: iteration 45770/ 173500 | consumed samples: 11717120 | consumed tokens: 23996661760 | elapsed time per iteration (s): 0.83 | learning rate: 1.724E-04 | global batch size: 256 | lm loss: 2.061989E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.336 | TFLOPs: 18.59 | 31: iteration 45780/ 173500 | consumed samples: 11719680 | consumed tokens: 24001904640 | elapsed time per iteration (s): 0.79 | learning rate: 1.723E-04 | global batch size: 256 | lm loss: 2.106978E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.370 | TFLOPs: 19.62 | 31: iteration 45790/ 173500 | consumed samples: 11722240 | consumed tokens: 24007147520 | elapsed time per iteration (s): 0.77 | learning rate: 1.723E-04 | global batch size: 256 | lm loss: 2.103156E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.409 | TFLOPs: 20.17 | 31: iteration 45800/ 173500 | consumed samples: 11724800 | consumed tokens: 24012390400 | elapsed time per iteration (s): 0.83 | learning rate: 1.723E-04 | global batch size: 256 | lm loss: 2.093960E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.242 | TFLOPs: 18.59 | 31: iteration 45810/ 173500 | consumed samples: 11727360 | consumed tokens: 24017633280 | elapsed time per iteration (s): 0.81 | learning rate: 1.723E-04 | global batch size: 256 | lm loss: 2.065746E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.167 | TFLOPs: 19.01 | 31: iteration 45820/ 173500 | consumed samples: 11729920 | consumed tokens: 24022876160 | elapsed time per iteration (s): 0.80 | learning rate: 1.723E-04 | global batch size: 256 | lm loss: 2.078920E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.789 | TFLOPs: 19.41 | 31: iteration 45830/ 173500 | consumed samples: 11732480 | consumed tokens: 24028119040 | elapsed time per iteration (s): 0.81 | learning rate: 1.723E-04 | global batch size: 256 | lm loss: 2.092336E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.484 | TFLOPs: 19.03 | 31: iteration 45840/ 173500 | consumed samples: 11735040 | consumed tokens: 24033361920 | elapsed time per iteration (s): 0.80 | learning rate: 1.723E-04 | global batch size: 256 | lm loss: 2.064222E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.409 | TFLOPs: 19.26 | 31: iteration 45850/ 173500 | consumed samples: 11737600 | consumed tokens: 24038604800 | elapsed time per iteration (s): 0.82 | learning rate: 1.723E-04 | global batch size: 256 | lm loss: 2.089376E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.125 | TFLOPs: 18.94 | 31: iteration 45860/ 173500 | consumed samples: 11740160 | consumed tokens: 24043847680 | elapsed time per iteration (s): 0.79 | learning rate: 1.722E-04 | global batch size: 256 | lm loss: 2.057469E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.258 | TFLOPs: 19.62 | 31: iteration 45870/ 173500 | consumed samples: 11742720 | consumed tokens: 24049090560 | elapsed time per iteration (s): 0.82 | learning rate: 1.722E-04 | global batch size: 256 | lm loss: 2.078831E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.082 | TFLOPs: 18.94 | 31: iteration 45880/ 173500 | consumed samples: 11745280 | consumed tokens: 24054333440 | elapsed time per iteration (s): 0.83 | learning rate: 1.722E-04 | global batch size: 256 | lm loss: 2.085705E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.381 | TFLOPs: 18.72 | 31: iteration 45890/ 173500 | consumed samples: 11747840 | consumed tokens: 24059576320 | elapsed time per iteration (s): 0.86 | learning rate: 1.722E-04 | global batch size: 256 | lm loss: 2.114270E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.307 | TFLOPs: 18.11 | 31: iteration 45900/ 173500 | consumed samples: 11750400 | consumed tokens: 24064819200 | elapsed time per iteration (s): 0.87 | learning rate: 1.722E-04 | global batch size: 256 | lm loss: 2.088006E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.784 | TFLOPs: 17.83 | 31: iteration 45910/ 173500 | consumed samples: 11752960 | consumed tokens: 24070062080 | elapsed time per iteration (s): 0.86 | learning rate: 1.722E-04 | global batch size: 256 | lm loss: 2.091780E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.979 | TFLOPs: 17.91 | 31: iteration 45920/ 173500 | consumed samples: 11755520 | consumed tokens: 24075304960 | elapsed time per iteration (s): 0.81 | learning rate: 1.722E-04 | global batch size: 256 | lm loss: 2.079279E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.806 | TFLOPs: 19.17 | 31: iteration 45930/ 173500 | consumed samples: 11758080 | consumed tokens: 24080547840 | elapsed time per iteration (s): 0.79 | learning rate: 1.722E-04 | global batch size: 256 | lm loss: 2.072770E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.568 | TFLOPs: 19.64 | 31: iteration 45940/ 173500 | consumed samples: 11760640 | consumed tokens: 24085790720 | elapsed time per iteration (s): 0.81 | learning rate: 1.722E-04 | global batch size: 256 | lm loss: 2.083431E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.315 | TFLOPs: 19.14 | 31: iteration 45950/ 173500 | consumed samples: 11763200 | consumed tokens: 24091033600 | elapsed time per iteration (s): 0.80 | learning rate: 1.721E-04 | global batch size: 256 | lm loss: 2.089630E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.043 | TFLOPs: 19.42 | 31: iteration 45960/ 173500 | consumed samples: 11765760 | consumed tokens: 24096276480 | elapsed time per iteration (s): 0.82 | learning rate: 1.721E-04 | global batch size: 256 | lm loss: 2.088434E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.849 | TFLOPs: 18.99 | 31: iteration 45970/ 173500 | consumed samples: 11768320 | consumed tokens: 24101519360 | elapsed time per iteration (s): 0.77 | learning rate: 1.721E-04 | global batch size: 256 | lm loss: 2.063759E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.810 | TFLOPs: 20.01 | 31: iteration 45980/ 173500 | consumed samples: 11770880 | consumed tokens: 24106762240 | elapsed time per iteration (s): 0.82 | learning rate: 1.721E-04 | global batch size: 256 | lm loss: 2.077370E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.502 | TFLOPs: 18.91 | 31: iteration 45990/ 173500 | consumed samples: 11773440 | consumed tokens: 24112005120 | elapsed time per iteration (s): 0.74 | learning rate: 1.721E-04 | global batch size: 256 | lm loss: 2.072403E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.071 | TFLOPs: 21.00 | 0: [2022-11-26 04:26:13,557] [INFO] [logging.py:68:log_dist] [Rank 0] step=46000, skipped=0, lr=[0.00017208047558447097, 0.00017208047558447097, 0.00017208047558447097], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 46000/ 173500 | consumed samples: 11776000 | consumed tokens: 24117248000 | elapsed time per iteration (s): 0.77 | learning rate: 1.721E-04 | global batch size: 256 | lm loss: 2.104727E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.127 | TFLOPs: 20.21 | 0: steps: 46000 loss: 2.2104 iter time (s): 0.797 samples/sec: 321.253 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 46000 | lm loss value: 2.078238E+00 | lm loss PPL: 7.990381E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 46000 to checkpoints_1b1long 0: [2022-11-26 04:26:13,918] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step46000 is begin to save! 0: [2022-11-26 04:26:13,928] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_01-model_00-model_states.pt... 0: [2022-11-26 04:26:14,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_01-model_00-model_states.pt. 0: [2022-11-26 04:26:14,157] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_03-model_00-model_states.pt... 0: [2022-11-26 04:26:14,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_03-model_00-model_states.pt. 0: [2022-11-26 04:26:14,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_04-model_00-model_states.pt... 0: [2022-11-26 04:26:14,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_04-model_00-model_states.pt. 0: [2022-11-26 04:26:14,317] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_05-model_00-model_states.pt... 0: [2022-11-26 04:26:14,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_05-model_00-model_states.pt. 0: [2022-11-26 04:26:14,403] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_06-model_00-model_states.pt... 0: [2022-11-26 04:26:14,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_06-model_00-model_states.pt. 0: [2022-11-26 04:26:14,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_07-model_00-model_states.pt... 0: [2022-11-26 04:26:14,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_07-model_00-model_states.pt. 0: [2022-11-26 04:26:14,561] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_08-model_00-model_states.pt... 0: [2022-11-26 04:26:14,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_08-model_00-model_states.pt. 0: [2022-11-26 04:26:14,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_09-model_00-model_states.pt... 0: [2022-11-26 04:26:14,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_09-model_00-model_states.pt. 0: [2022-11-26 04:26:14,712] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_10-model_00-model_states.pt... 0: [2022-11-26 04:26:14,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_10-model_00-model_states.pt. 0: [2022-11-26 04:26:14,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_11-model_00-model_states.pt... 0: [2022-11-26 04:26:14,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_11-model_00-model_states.pt. 0: [2022-11-26 04:26:14,872] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_12-model_00-model_states.pt... 0: [2022-11-26 04:26:14,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_12-model_00-model_states.pt. 0: [2022-11-26 04:26:14,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_13-model_00-model_states.pt... 0: [2022-11-26 04:26:15,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_13-model_00-model_states.pt. 0: [2022-11-26 04:26:15,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_14-model_00-model_states.pt... 0: [2022-11-26 04:26:15,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_14-model_00-model_states.pt. 0: [2022-11-26 04:26:15,098] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_15-model_00-model_states.pt... 0: [2022-11-26 04:26:15,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_15-model_00-model_states.pt. 0: [2022-11-26 04:26:15,174] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_16-model_00-model_states.pt... 0: [2022-11-26 04:26:15,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_16-model_00-model_states.pt. 0: [2022-11-26 04:26:15,249] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_17-model_00-model_states.pt... 0: [2022-11-26 04:26:15,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_17-model_00-model_states.pt. 0: [2022-11-26 04:26:15,327] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_18-model_00-model_states.pt... 0: [2022-11-26 04:26:15,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_18-model_00-model_states.pt. 0: [2022-11-26 04:26:15,402] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_19-model_00-model_states.pt... 0: [2022-11-26 04:26:15,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_19-model_00-model_states.pt. 0: [2022-11-26 04:26:15,479] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_20-model_00-model_states.pt... 0: [2022-11-26 04:26:15,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_20-model_00-model_states.pt. 0: [2022-11-26 04:26:15,552] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_21-model_00-model_states.pt... 0: [2022-11-26 04:26:15,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_21-model_00-model_states.pt. 0: [2022-11-26 04:26:15,627] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_22-model_00-model_states.pt... 0: [2022-11-26 04:26:15,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_22-model_00-model_states.pt. 0: [2022-11-26 04:26:15,704] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_23-model_00-model_states.pt... 0: [2022-11-26 04:26:15,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_23-model_00-model_states.pt. 0: [2022-11-26 04:26:15,777] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_24-model_00-model_states.pt... 0: [2022-11-26 04:26:15,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_24-model_00-model_states.pt. 0: [2022-11-26 04:26:15,855] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_25-model_00-model_states.pt... 0: [2022-11-26 04:26:15,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_25-model_00-model_states.pt. 0: [2022-11-26 04:26:15,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_26-model_00-model_states.pt... 0: [2022-11-26 04:26:16,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_26-model_00-model_states.pt. 0: [2022-11-26 04:26:16,006] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_27-model_00-model_states.pt... 0: [2022-11-26 04:26:16,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_27-model_00-model_states.pt. 0: [2022-11-26 04:26:16,079] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_28-model_00-model_states.pt... 0: [2022-11-26 04:26:16,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_28-model_00-model_states.pt. 0: [2022-11-26 04:26:16,156] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/layer_30-model_00-model_states.pt... 0: [2022-11-26 04:26:16,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/layer_30-model_00-model_states.pt. 0: [2022-11-26 04:26:16,159] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step46000/mp_rank_00_model_states.pt 0: [2022-11-26 04:26:16,159] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/mp_rank_00_model_states.pt... 0: [2022-11-26 04:26:16,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/mp_rank_00_model_states.pt. 0: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:26:16,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:26:16,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:26:16,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 04:26:16,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 3: [2022-11-26 04:26:16,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:26:16,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 04:26:16,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 30: [2022-11-26 04:26:16,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:26:16,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:26:16,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 04:26:16,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 24: [2022-11-26 04:26:16,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 04:26:16,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 12: [2022-11-26 04:26:16,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:26:16,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 04:26:16,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 15: [2022-11-26 04:26:16,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:26:16,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 04:26:16,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 8: [2022-11-26 04:26:16,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:26:16,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 04:26:16,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 13: [2022-11-26 04:26:16,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:26:16,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 04:26:16,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 9: [2022-11-26 04:26:16,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:26:16,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:26:16,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 04:26:16,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 29: [2022-11-26 04:26:16,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 04:26:16,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 17: [2022-11-26 04:26:16,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:26:16,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 20: [2022-11-26 04:26:16,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:26:16,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 20: [2022-11-26 04:26:16,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 04:26:16,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 7: [2022-11-26 04:26:16,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:26:16,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 04:26:16,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 13: [2022-11-26 04:26:16,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:26:16,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:26:16,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 4: [2022-11-26 04:26:16,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 13: [2022-11-26 04:26:16,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 4: [2022-11-26 04:26:16,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 2: [2022-11-26 04:26:16,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:26:16,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:26:16,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 14: [2022-11-26 04:26:16,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 2: [2022-11-26 04:26:16,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 14: [2022-11-26 04:26:16,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 5: [2022-11-26 04:26:16,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:26:16,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 04:26:16,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 5: [2022-11-26 04:26:16,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:26:16,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 04:26:16,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 28: [2022-11-26 04:26:16,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:26:16,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:26:16,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:26:16,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:26:16,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:26:16,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 6: [2022-11-26 04:26:16,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 19: [2022-11-26 04:26:16,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 6: [2022-11-26 04:26:16,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 25: [2022-11-26 04:26:16,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:26:16,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 23: [2022-11-26 04:26:16,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:26:16,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 23: [2022-11-26 04:26:16,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 24: [2022-11-26 04:26:16,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 25: [2022-11-26 04:26:16,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 23: [2022-11-26 04:26:16,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 25: [2022-11-26 04:26:16,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:26:16,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:26:16,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 1: [2022-11-26 04:26:16,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 25: [2022-11-26 04:26:16,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 1: [2022-11-26 04:26:16,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:26:16,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 1: [2022-11-26 04:26:16,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 04:26:16,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 16: [2022-11-26 04:26:16,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:26:16,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 7: [2022-11-26 04:26:16,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:26:16,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 7: [2022-11-26 04:26:16,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 04:26:16,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 23: [2022-11-26 04:26:16,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:26:16,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 04:26:16,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 19: [2022-11-26 04:26:16,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:26:16,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 04:26:16,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 27: [2022-11-26 04:26:16,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 04:26:16,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 4: [2022-11-26 04:26:16,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:26:16,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 04:26:16,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 28: [2022-11-26 04:26:16,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 04:26:16,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 28: [2022-11-26 04:26:16,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:26:16,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:26:16,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 04:26:16,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 21: [2022-11-26 04:26:16,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:26:16,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 04:26:16,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 9: [2022-11-26 04:26:16,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:26:16,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:26:16,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 9: [2022-11-26 04:26:16,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 15: [2022-11-26 04:26:16,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 9: [2022-11-26 04:26:16,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 14: [2022-11-26 04:26:16,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:26:16,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:26:16,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 04:26:16,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 29: [2022-11-26 04:26:16,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 04:26:16,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 3: [2022-11-26 04:26:16,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:26:16,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 04:26:16,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 16: [2022-11-26 04:26:16,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:26:16,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 04:26:16,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 3: [2022-11-26 04:26:16,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:26:16,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 04:26:16,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 21: [2022-11-26 04:26:16,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:26:16,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 04:26:16,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 25: [2022-11-26 04:26:16,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:26:16,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 04:26:16,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 23: [2022-11-26 04:26:16,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:26:16,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 04:26:16,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 14: [2022-11-26 04:26:16,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:26:16,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 04:26:16,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 8: [2022-11-26 04:26:16,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:26:16,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 04:26:16,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 27: [2022-11-26 04:26:16,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:26:16,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 04:26:16,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 2: [2022-11-26 04:26:16,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:26:16,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 04:26:16,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 12: [2022-11-26 04:26:16,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:26:16,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 04:26:16,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 20: [2022-11-26 04:26:16,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:26:16,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 24: [2022-11-26 04:26:16,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:26:16,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 24: [2022-11-26 04:26:16,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 22: [2022-11-26 04:26:16,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:26:16,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 24: [2022-11-26 04:26:16,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 22: [2022-11-26 04:26:16,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 22: [2022-11-26 04:26:16,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:26:16,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:26:16,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 12: [2022-11-26 04:26:16,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 22: [2022-11-26 04:26:16,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 2: [2022-11-26 04:26:16,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:26:16,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 22: [2022-11-26 04:26:16,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 28: [2022-11-26 04:26:16,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 22: [2022-11-26 04:26:16,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 2: [2022-11-26 04:26:16,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 28: [2022-11-26 04:26:16,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 22: [2022-11-26 04:26:16,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 2: [2022-11-26 04:26:16,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 5: [2022-11-26 04:26:16,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:26:16,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 04:26:16,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 25: [2022-11-26 04:26:16,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:26:16,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 04:26:16,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 1: [2022-11-26 04:26:16,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 28: [2022-11-26 04:26:16,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:26:16,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 04:26:16,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 19: [2022-11-26 04:26:16,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:26:16,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:26:16,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 6: [2022-11-26 04:26:16,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:26:16,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 19: [2022-11-26 04:26:16,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 6: [2022-11-26 04:26:16,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 04:26:16,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 6: [2022-11-26 04:26:16,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 28: [2022-11-26 04:26:16,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 04:26:16,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 6: [2022-11-26 04:26:16,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:26:16,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:26:16,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 21: [2022-11-26 04:26:16,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 6: [2022-11-26 04:26:16,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 21: [2022-11-26 04:26:16,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 29: [2022-11-26 04:26:16,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:26:16,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 04:26:16,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 17: [2022-11-26 04:26:16,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:26:16,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 04:26:16,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 17: [2022-11-26 04:26:16,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:26:16,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 04:26:16,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 16: [2022-11-26 04:26:16,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:26:16,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:26:16,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 17: [2022-11-26 04:26:16,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 16: [2022-11-26 04:26:16,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 17: [2022-11-26 04:26:16,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 16: [2022-11-26 04:26:16,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:26:16,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:26:16,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 16: [2022-11-26 04:26:16,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 12: [2022-11-26 04:26:16,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 16: [2022-11-26 04:26:16,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 5: [2022-11-26 04:26:16,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:26:16,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 04:26:16,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 19: [2022-11-26 04:26:16,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:26:16,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 04:26:16,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 18: [2022-11-26 04:26:16,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:26:16,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 04:26:16,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 20: [2022-11-26 04:26:16,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:26:16,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:26:16,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:26:16,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 11: [2022-11-26 04:26:16,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 18: [2022-11-26 04:26:16,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 20: [2022-11-26 04:26:16,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 30: [2022-11-26 04:26:16,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:26:16,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 11: [2022-11-26 04:26:16,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 30: [2022-11-26 04:26:16,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 2: [2022-11-26 04:26:16,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:26:16,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 2: [2022-11-26 04:26:16,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 04:26:16,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 10: [2022-11-26 04:26:16,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:26:16,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 04:26:16,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 24: [2022-11-26 04:26:16,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:26:16,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 04:26:16,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 30: [2022-11-26 04:26:16,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:26:16,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:26:16,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 04:26:16,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 30: [2022-11-26 04:26:16,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 04:26:16,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 9: [2022-11-26 04:26:16,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:26:16,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:26:16,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 04:26:16,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 04:26:16,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 9: [2022-11-26 04:26:16,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 0: [2022-11-26 04:26:16,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:26:16,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:26:16,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 04:26:16,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 0: [2022-11-26 04:26:16,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 04:26:16,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 14: [2022-11-26 04:26:16,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:26:16,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 04:26:16,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 27: [2022-11-26 04:26:16,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:26:16,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 04:26:16,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 7: [2022-11-26 04:26:16,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:26:16,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:26:16,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 13: [2022-11-26 04:26:16,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 26: [2022-11-26 04:26:16,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:26:16,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:26:16,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:26:16,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 26: [2022-11-26 04:26:16,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:26:16,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:26:16,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 26: [2022-11-26 04:26:16,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 04:26:16,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 7: [2022-11-26 04:26:16,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 26: [2022-11-26 04:26:16,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 26: [2022-11-26 04:26:16,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 26: [2022-11-26 04:26:16,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 04:26:16,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 7: [2022-11-26 04:26:16,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 26: [2022-11-26 04:26:16,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 26: [2022-11-26 04:26:16,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 10: [2022-11-26 04:26:16,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:26:16,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 04:26:16,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 13: [2022-11-26 04:26:16,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:26:16,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:26:16,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 30: [2022-11-26 04:26:16,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 13: [2022-11-26 04:26:16,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 30: [2022-11-26 04:26:16,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 11: [2022-11-26 04:26:16,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:26:16,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:26:16,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 3: [2022-11-26 04:26:16,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:26:16,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 3: [2022-11-26 04:26:16,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 11: [2022-11-26 04:26:16,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 11: [2022-11-26 04:26:16,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 3: [2022-11-26 04:26:16,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 21: [2022-11-26 04:26:16,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:26:16,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 04:26:16,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 28: [2022-11-26 04:26:16,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:26:16,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:26:16,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:26:16,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 04:26:16,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 15: [2022-11-26 04:26:16,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 04:26:16,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 28: [2022-11-26 04:26:16,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 04:26:16,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 29: [2022-11-26 04:26:16,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:26:16,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 04:26:16,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 1: [2022-11-26 04:26:16,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:26:16,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 04:26:16,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 8: [2022-11-26 04:26:16,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:26:16,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:26:16,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 04:26:16,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 8: [2022-11-26 04:26:16,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 04:26:16,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 5: [2022-11-26 04:26:16,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:26:16,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 04:26:16,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 10: [2022-11-26 04:26:16,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:26:16,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 04:26:16,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 4: [2022-11-26 04:26:16,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:26:16,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:26:16,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 04:26:16,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 04:26:16,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 4: [2022-11-26 04:26:16,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 6: [2022-11-26 04:26:16,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:26:16,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:26:16,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 04:26:16,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 17: [2022-11-26 04:26:16,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 31: [2022-11-26 04:26:16,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:26:16,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:26:16,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:26:16,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:26:16,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 31: [2022-11-26 04:26:16,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 04:26:16,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 04:26:16,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 04:26:16,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 31: [2022-11-26 04:26:16,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 04:26:16,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 31: [2022-11-26 04:26:16,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 31: [2022-11-26 04:26:16,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 18: [2022-11-26 04:26:16,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:26:16,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 04:26:16,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 0: [2022-11-26 04:26:16,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:26:16,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 04:26:16,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 27: [2022-11-26 04:26:16,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:26:16,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 04:26:16,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 9: [2022-11-26 04:26:16,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:26:16,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 04:26:16,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 26: [2022-11-26 04:26:16,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:26:16,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:26:16,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:26:16,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 1: [2022-11-26 04:26:16,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 26: [2022-11-26 04:26:16,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 1: [2022-11-26 04:26:16,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 22: [2022-11-26 04:26:16,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:26:16,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 04:26:16,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 18: [2022-11-26 04:26:16,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:26:16,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:26:16,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 04:26:16,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 04:26:16,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 18: [2022-11-26 04:26:16,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 13: [2022-11-26 04:26:16,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:26:16,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:26:16,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 04:26:16,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 25: [2022-11-26 04:26:16,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 04:26:16,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 16: [2022-11-26 04:26:16,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:26:16,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 04:26:16,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 20: [2022-11-26 04:26:16,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:26:16,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 04:26:16,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 0: [2022-11-26 04:26:16,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 04:26:16,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 27: [2022-11-26 04:26:16,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:26:16,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 04:26:16,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 3: [2022-11-26 04:26:16,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:26:16,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 04:26:16,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 12: [2022-11-26 04:26:16,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:26:16,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 04:26:16,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 24: [2022-11-26 04:26:16,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:26:16,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 04:26:16,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 15: [2022-11-26 04:26:16,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:26:16,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 04:26:16,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 22: [2022-11-26 04:26:16,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:26:16,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 04:26:16,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 30: [2022-11-26 04:26:16,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:26:16,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 04:26:16,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 31: [2022-11-26 04:26:16,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:26:16,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 04:26:16,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 19: [2022-11-26 04:26:16,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:26:16,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 04:26:16,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 4: [2022-11-26 04:26:16,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:26:16,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 04:26:16,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 28: [2022-11-26 04:26:16,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 04:26:16,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 04:26:16,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 23: [2022-11-26 04:26:16,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:26:16,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 04:26:16,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 8: [2022-11-26 04:26:16,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:26:16,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 04:26:16,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 21: [2022-11-26 04:26:16,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:26:16,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 04:26:16,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 0: [2022-11-26 04:26:16,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:26:16,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 04:26:16,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 2: [2022-11-26 04:26:16,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:26:16,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 04:26:16,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 11: [2022-11-26 04:26:16,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:26:16,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 04:26:16,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 29: [2022-11-26 04:26:16,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:26:16,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:26:16,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 04:26:16,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 29: [2022-11-26 04:26:16,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 04:26:16,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 14: [2022-11-26 04:26:16,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:26:16,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 04:26:16,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 17: [2022-11-26 04:26:16,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:26:16,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 04:26:16,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 7: [2022-11-26 04:26:16,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:26:16,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 04:26:16,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 26: [2022-11-26 04:26:16,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:26:16,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 04:26:16,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 5: [2022-11-26 04:26:16,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:26:16,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 04:26:16,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 20: [2022-11-26 04:26:16,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:26:16,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:26:16,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 04:26:16,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 1: [2022-11-26 04:26:16,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 04:26:16,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 25: [2022-11-26 04:26:16,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:26:16,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:26:16,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 25: [2022-11-26 04:26:16,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 9: [2022-11-26 04:26:16,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 25: [2022-11-26 04:26:16,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 24: [2022-11-26 04:26:16,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:26:16,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:26:16,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 04:26:16,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 16: [2022-11-26 04:26:16,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 13: [2022-11-26 04:26:16,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:26:16,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 16: [2022-11-26 04:26:16,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 13: [2022-11-26 04:26:16,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 3: [2022-11-26 04:26:16,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:26:16,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 04:26:16,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 15: [2022-11-26 04:26:16,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:26:16,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 04:26:16,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 6: [2022-11-26 04:26:16,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:26:16,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 27: [2022-11-26 04:26:16,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:26:16,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 27: [2022-11-26 04:26:16,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 04:26:16,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 12: [2022-11-26 04:26:16,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:26:16,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 04:26:16,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 22: [2022-11-26 04:26:16,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:26:16,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 04:26:16,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 31: [2022-11-26 04:26:16,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:26:16,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 04:26:16,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 4: [2022-11-26 04:26:16,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:26:16,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 04:26:16,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 30: [2022-11-26 04:26:16,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:26:16,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 0: [2022-11-26 04:26:16,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:26:16,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 0: [2022-11-26 04:26:16,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 04:26:16,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 23: [2022-11-26 04:26:16,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:26:16,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 04:26:16,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 21: [2022-11-26 04:26:16,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:26:16,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 04:26:16,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 8: [2022-11-26 04:26:16,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:26:16,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 04:26:16,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 10: [2022-11-26 04:26:16,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:26:16,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 04:26:16,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 28: [2022-11-26 04:26:16,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:26:16,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:26:16,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:26:16,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 14: [2022-11-26 04:26:16,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 04:26:16,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 19: [2022-11-26 04:26:16,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 28: [2022-11-26 04:26:16,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 04:26:16,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 29: [2022-11-26 04:26:16,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:26:16,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 04:26:16,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 11: [2022-11-26 04:26:16,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:26:16,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 04:26:16,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 26: [2022-11-26 04:26:16,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:26:16,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 04:26:16,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 17: [2022-11-26 04:26:16,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:26:16,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 04:26:16,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 2: [2022-11-26 04:26:16,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:26:16,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 04:26:16,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 18: [2022-11-26 04:26:16,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:26:16,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 04:26:16,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 18: [2022-11-26 04:26:16,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:26:16,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 04:26:16,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 20: [2022-11-26 04:26:16,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:26:16,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 04:26:16,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 1: [2022-11-26 04:26:16,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:26:16,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 04:26:16,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 7: [2022-11-26 04:26:16,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:26:16,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 04:26:16,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 6: [2022-11-26 04:26:16,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:26:16,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 04:26:16,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 9: [2022-11-26 04:26:16,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:26:16,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 04:26:16,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 16: [2022-11-26 04:26:16,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:26:16,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 04:26:16,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 25: [2022-11-26 04:26:16,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:26:16,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 04:26:16,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 24: [2022-11-26 04:26:16,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:26:16,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 04:26:16,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 15: [2022-11-26 04:26:16,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:26:16,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 04:26:16,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 12: [2022-11-26 04:26:16,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:26:16,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 04:26:16,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 13: [2022-11-26 04:26:16,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:26:16,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 04:26:16,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 27: [2022-11-26 04:26:16,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:26:16,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 04:26:16,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 19: [2022-11-26 04:26:16,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:26:16,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:26:16,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 04:26:16,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 31: [2022-11-26 04:26:16,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 22: [2022-11-26 04:26:16,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:26:16,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 31: [2022-11-26 04:26:16,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 22: [2022-11-26 04:26:16,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 30: [2022-11-26 04:26:16,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:26:16,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 04:26:16,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 23: [2022-11-26 04:26:16,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:26:16,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 04:26:16,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 21: [2022-11-26 04:26:16,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:26:16,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 04:26:16,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 8: [2022-11-26 04:26:16,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:26:16,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 04:26:16,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 11: [2022-11-26 04:26:16,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:26:16,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 04:26:16,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 29: [2022-11-26 04:26:16,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:26:16,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 04:26:16,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 5: [2022-11-26 04:26:16,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:26:16,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:26:16,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 04:26:16,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 5: [2022-11-26 04:26:16,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 04:26:16,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 14: [2022-11-26 04:26:16,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:26:16,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 04:26:16,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 28: [2022-11-26 04:26:16,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:26:16,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 28: [2022-11-26 04:26:16,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 04:26:16,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 0: [2022-11-26 04:26:16,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 04:26:16,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 10: [2022-11-26 04:26:16,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:26:16,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 04:26:16,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 17: [2022-11-26 04:26:16,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:26:16,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 04:26:16,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 12: [2022-11-26 04:26:16,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:26:16,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 04:26:16,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 22: [2022-11-26 04:26:16,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:26:16,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:26:16,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 5: [2022-11-26 04:26:16,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:26:16,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 22: [2022-11-26 04:26:16,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 9: [2022-11-26 04:26:16,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 5: [2022-11-26 04:26:16,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 04:26:16,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 2: [2022-11-26 04:26:16,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:26:16,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 04:26:16,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 27: [2022-11-26 04:26:16,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:26:16,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:26:16,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 04:26:16,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 1: [2022-11-26 04:26:16,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:26:16,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 04:26:16,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 8: [2022-11-26 04:26:16,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:26:16,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 3: [2022-11-26 04:26:16,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:26:16,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 1: [2022-11-26 04:26:16,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 8: [2022-11-26 04:26:16,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 3: [2022-11-26 04:26:16,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 04:26:16,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 26: [2022-11-26 04:26:16,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:26:16,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 04:26:16,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 13: [2022-11-26 04:26:16,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:26:16,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 04:26:16,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 15: [2022-11-26 04:26:16,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:26:16,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 04:26:16,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 31: [2022-11-26 04:26:16,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:26:16,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 04:26:16,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 6: [2022-11-26 04:26:16,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:26:16,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 04:26:16,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 24: [2022-11-26 04:26:16,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:26:16,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 04:26:16,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 7: [2022-11-26 04:26:16,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:26:16,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 04:26:16,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 21: [2022-11-26 04:26:16,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:26:16,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 04:26:16,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 3: [2022-11-26 04:26:16,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:26:16,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 04:26:16,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 19: [2022-11-26 04:26:16,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:26:16,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 04:26:16,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 30: [2022-11-26 04:26:16,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:26:16,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 18: [2022-11-26 04:26:16,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:26:16,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 18: [2022-11-26 04:26:16,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 16: [2022-11-26 04:26:16,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:26:16,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 4: [2022-11-26 04:26:16,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:26:16,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 04:26:16,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 4: [2022-11-26 04:26:16,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 04:26:16,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 25: [2022-11-26 04:26:16,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:26:16,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:26:16,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 25: [2022-11-26 04:26:16,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 23: [2022-11-26 04:26:16,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 25: [2022-11-26 04:26:16,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 2: [2022-11-26 04:26:16,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:26:16,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 04:26:16,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 14: [2022-11-26 04:26:16,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:26:16,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 04:26:16,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 28: [2022-11-26 04:26:16,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 04:26:16,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 04:26:16,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 29: [2022-11-26 04:26:16,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:26:16,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 04:26:16,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 11: [2022-11-26 04:26:16,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:26:16,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:26:16,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 04:26:16,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 04:26:16,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 11: [2022-11-26 04:26:16,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 7: [2022-11-26 04:26:16,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:26:16,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 04:26:16,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 10: [2022-11-26 04:26:16,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:26:16,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:26:16,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 04:26:16,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step46000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 04:26:16,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 10: [2022-11-26 04:26:16,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 0: successfully saved checkpoint at iteration 46000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2554.94 31: iteration 46010/ 173500 | consumed samples: 11778560 | consumed tokens: 24122490880 | elapsed time per iteration (s): 1.03 | learning rate: 1.721E-04 | global batch size: 256 | lm loss: 2.052642E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.498 | TFLOPs: 15.09 | 31: iteration 46020/ 173500 | consumed samples: 11781120 | consumed tokens: 24127733760 | elapsed time per iteration (s): 0.78 | learning rate: 1.721E-04 | global batch size: 256 | lm loss: 2.077073E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.277 | TFLOPs: 19.74 | 31: iteration 46030/ 173500 | consumed samples: 11783680 | consumed tokens: 24132976640 | elapsed time per iteration (s): 0.76 | learning rate: 1.720E-04 | global batch size: 256 | lm loss: 2.065495E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.717 | TFLOPs: 20.31 | 31: iteration 46040/ 173500 | consumed samples: 11786240 | consumed tokens: 24138219520 | elapsed time per iteration (s): 0.81 | learning rate: 1.720E-04 | global batch size: 256 | lm loss: 2.093433E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.564 | TFLOPs: 19.15 | 31: iteration 46050/ 173500 | consumed samples: 11788800 | consumed tokens: 24143462400 | elapsed time per iteration (s): 0.75 | learning rate: 1.720E-04 | global batch size: 256 | lm loss: 2.075723E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.855 | TFLOPs: 20.74 | 31: iteration 46060/ 173500 | consumed samples: 11791360 | consumed tokens: 24148705280 | elapsed time per iteration (s): 0.80 | learning rate: 1.720E-04 | global batch size: 256 | lm loss: 2.092277E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.741 | TFLOPs: 19.46 | 31: iteration 46070/ 173500 | consumed samples: 11793920 | consumed tokens: 24153948160 | elapsed time per iteration (s): 0.77 | learning rate: 1.720E-04 | global batch size: 256 | lm loss: 2.047559E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.750 | TFLOPs: 20.19 | 31: iteration 46080/ 173500 | consumed samples: 11796480 | consumed tokens: 24159191040 | elapsed time per iteration (s): 0.77 | learning rate: 1.720E-04 | global batch size: 256 | lm loss: 2.072609E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.308 | TFLOPs: 20.10 | 31: iteration 46090/ 173500 | consumed samples: 11799040 | consumed tokens: 24164433920 | elapsed time per iteration (s): 0.75 | learning rate: 1.720E-04 | global batch size: 256 | lm loss: 2.073486E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.997 | TFLOPs: 20.75 | 31: iteration 46100/ 173500 | consumed samples: 11801600 | consumed tokens: 24169676800 | elapsed time per iteration (s): 0.76 | learning rate: 1.720E-04 | global batch size: 256 | lm loss: 2.077959E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.958 | TFLOPs: 20.26 | 31: iteration 46110/ 173500 | consumed samples: 11804160 | consumed tokens: 24174919680 | elapsed time per iteration (s): 0.75 | learning rate: 1.719E-04 | global batch size: 256 | lm loss: 2.073131E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.506 | TFLOPs: 20.60 | 31: iteration 46120/ 173500 | consumed samples: 11806720 | consumed tokens: 24180162560 | elapsed time per iteration (s): 0.77 | learning rate: 1.719E-04 | global batch size: 256 | lm loss: 2.091561E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.439 | TFLOPs: 20.05 | 31: iteration 46130/ 173500 | consumed samples: 11809280 | consumed tokens: 24185405440 | elapsed time per iteration (s): 0.75 | learning rate: 1.719E-04 | global batch size: 256 | lm loss: 2.091126E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.624 | TFLOPs: 20.61 | 31: iteration 46140/ 173500 | consumed samples: 11811840 | consumed tokens: 24190648320 | elapsed time per iteration (s): 0.77 | learning rate: 1.719E-04 | global batch size: 256 | lm loss: 2.080836E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.455 | TFLOPs: 20.11 | 31: iteration 46150/ 173500 | consumed samples: 11814400 | consumed tokens: 24195891200 | elapsed time per iteration (s): 0.74 | learning rate: 1.719E-04 | global batch size: 256 | lm loss: 2.072811E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.427 | TFLOPs: 21.02 | 31: iteration 46160/ 173500 | consumed samples: 11816960 | consumed tokens: 24201134080 | elapsed time per iteration (s): 0.77 | learning rate: 1.719E-04 | global batch size: 256 | lm loss: 2.073814E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.400 | TFLOPs: 20.05 | 31: iteration 46170/ 173500 | consumed samples: 11819520 | consumed tokens: 24206376960 | elapsed time per iteration (s): 0.71 | learning rate: 1.719E-04 | global batch size: 256 | lm loss: 2.072047E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 361.830 | TFLOPs: 21.89 | 31: iteration 46180/ 173500 | consumed samples: 11822080 | consumed tokens: 24211619840 | elapsed time per iteration (s): 0.81 | learning rate: 1.719E-04 | global batch size: 256 | lm loss: 2.056610E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.789 | TFLOPs: 19.23 | 31: iteration 46190/ 173500 | consumed samples: 11824640 | consumed tokens: 24216862720 | elapsed time per iteration (s): 0.75 | learning rate: 1.719E-04 | global batch size: 256 | lm loss: 2.074564E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.989 | TFLOPs: 20.75 | 31: iteration 46200/ 173500 | consumed samples: 11827200 | consumed tokens: 24222105600 | elapsed time per iteration (s): 0.79 | learning rate: 1.718E-04 | global batch size: 256 | lm loss: 2.068099E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.022 | TFLOPs: 19.60 | 31: iteration 46210/ 173500 | consumed samples: 11829760 | consumed tokens: 24227348480 | elapsed time per iteration (s): 0.74 | learning rate: 1.718E-04 | global batch size: 256 | lm loss: 2.124014E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.876 | TFLOPs: 20.86 | 31: iteration 46220/ 173500 | consumed samples: 11832320 | consumed tokens: 24232591360 | elapsed time per iteration (s): 0.79 | learning rate: 1.718E-04 | global batch size: 256 | lm loss: 2.065410E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.517 | TFLOPs: 19.63 | 31: iteration 46230/ 173500 | consumed samples: 11834880 | consumed tokens: 24237834240 | elapsed time per iteration (s): 0.81 | learning rate: 1.718E-04 | global batch size: 256 | lm loss: 2.059820E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.706 | TFLOPs: 19.22 | 31: iteration 46240/ 173500 | consumed samples: 11837440 | consumed tokens: 24243077120 | elapsed time per iteration (s): 0.78 | learning rate: 1.718E-04 | global batch size: 256 | lm loss: 2.096618E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.966 | TFLOPs: 19.84 | 31: iteration 46250/ 173500 | consumed samples: 11840000 | consumed tokens: 24248320000 | elapsed time per iteration (s): 0.76 | learning rate: 1.718E-04 | global batch size: 256 | lm loss: 2.059435E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.879 | TFLOPs: 20.38 | 31: iteration 46260/ 173500 | consumed samples: 11842560 | consumed tokens: 24253562880 | elapsed time per iteration (s): 0.82 | learning rate: 1.718E-04 | global batch size: 256 | lm loss: 2.073986E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.462 | TFLOPs: 18.90 | 31: iteration 46270/ 173500 | consumed samples: 11845120 | consumed tokens: 24258805760 | elapsed time per iteration (s): 0.75 | learning rate: 1.718E-04 | global batch size: 256 | lm loss: 2.053686E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.144 | TFLOPs: 20.52 | 31: iteration 46280/ 173500 | consumed samples: 11847680 | consumed tokens: 24264048640 | elapsed time per iteration (s): 0.82 | learning rate: 1.717E-04 | global batch size: 256 | lm loss: 2.067036E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.915 | TFLOPs: 18.81 | 31: iteration 46290/ 173500 | consumed samples: 11850240 | consumed tokens: 24269291520 | elapsed time per iteration (s): 0.78 | learning rate: 1.717E-04 | global batch size: 256 | lm loss: 2.089081E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.595 | TFLOPs: 19.82 | 31: iteration 46300/ 173500 | consumed samples: 11852800 | consumed tokens: 24274534400 | elapsed time per iteration (s): 0.85 | learning rate: 1.717E-04 | global batch size: 256 | lm loss: 2.023246E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.047 | TFLOPs: 18.27 | 31: iteration 46310/ 173500 | consumed samples: 11855360 | consumed tokens: 24279777280 | elapsed time per iteration (s): 0.75 | learning rate: 1.717E-04 | global batch size: 256 | lm loss: 2.047385E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.557 | TFLOPs: 20.54 | 31: iteration 46320/ 173500 | consumed samples: 11857920 | consumed tokens: 24285020160 | elapsed time per iteration (s): 0.79 | learning rate: 1.717E-04 | global batch size: 256 | lm loss: 2.069247E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.029 | TFLOPs: 19.48 | 31: iteration 46330/ 173500 | consumed samples: 11860480 | consumed tokens: 24290263040 | elapsed time per iteration (s): 0.82 | learning rate: 1.717E-04 | global batch size: 256 | lm loss: 2.072478E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.209 | TFLOPs: 18.83 | 31: iteration 46340/ 173500 | consumed samples: 11863040 | consumed tokens: 24295505920 | elapsed time per iteration (s): 0.74 | learning rate: 1.717E-04 | global batch size: 256 | lm loss: 2.078579E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.376 | TFLOPs: 20.89 | 31: iteration 46350/ 173500 | consumed samples: 11865600 | consumed tokens: 24300748800 | elapsed time per iteration (s): 0.73 | learning rate: 1.717E-04 | global batch size: 256 | lm loss: 2.080264E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.058 | TFLOPs: 21.12 | 31: iteration 46360/ 173500 | consumed samples: 11868160 | consumed tokens: 24305991680 | elapsed time per iteration (s): 0.76 | learning rate: 1.717E-04 | global batch size: 256 | lm loss: 2.091556E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.278 | TFLOPs: 20.28 | 31: iteration 46370/ 173500 | consumed samples: 11870720 | consumed tokens: 24311234560 | elapsed time per iteration (s): 0.76 | learning rate: 1.716E-04 | global batch size: 256 | lm loss: 2.075500E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.903 | TFLOPs: 20.32 | 31: iteration 46380/ 173500 | consumed samples: 11873280 | consumed tokens: 24316477440 | elapsed time per iteration (s): 0.78 | learning rate: 1.716E-04 | global batch size: 256 | lm loss: 2.061753E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.371 | TFLOPs: 19.87 | 31: iteration 46390/ 173500 | consumed samples: 11875840 | consumed tokens: 24321720320 | elapsed time per iteration (s): 0.77 | learning rate: 1.716E-04 | global batch size: 256 | lm loss: 2.070646E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.955 | TFLOPs: 20.08 | 31: iteration 46400/ 173500 | consumed samples: 11878400 | consumed tokens: 24326963200 | elapsed time per iteration (s): 0.79 | learning rate: 1.716E-04 | global batch size: 256 | lm loss: 2.100606E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.623 | TFLOPs: 19.70 | 31: iteration 46410/ 173500 | consumed samples: 11880960 | consumed tokens: 24332206080 | elapsed time per iteration (s): 0.76 | learning rate: 1.716E-04 | global batch size: 256 | lm loss: 2.091760E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.375 | TFLOPs: 20.41 | 31: iteration 46420/ 173500 | consumed samples: 11883520 | consumed tokens: 24337448960 | elapsed time per iteration (s): 0.80 | learning rate: 1.716E-04 | global batch size: 256 | lm loss: 2.064604E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.583 | TFLOPs: 19.33 | 31: iteration 46430/ 173500 | consumed samples: 11886080 | consumed tokens: 24342691840 | elapsed time per iteration (s): 0.83 | learning rate: 1.716E-04 | global batch size: 256 | lm loss: 2.101983E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.892 | TFLOPs: 18.75 | 31: iteration 46440/ 173500 | consumed samples: 11888640 | consumed tokens: 24347934720 | elapsed time per iteration (s): 0.76 | learning rate: 1.716E-04 | global batch size: 256 | lm loss: 2.084208E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.759 | TFLOPs: 20.49 | 31: iteration 46450/ 173500 | consumed samples: 11891200 | consumed tokens: 24353177600 | elapsed time per iteration (s): 0.81 | learning rate: 1.715E-04 | global batch size: 256 | lm loss: 2.074686E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.138 | TFLOPs: 19.00 | 31: iteration 46460/ 173500 | consumed samples: 11893760 | consumed tokens: 24358420480 | elapsed time per iteration (s): 0.80 | learning rate: 1.715E-04 | global batch size: 256 | lm loss: 2.096051E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.264 | TFLOPs: 19.44 | 31: iteration 46470/ 173500 | consumed samples: 11896320 | consumed tokens: 24363663360 | elapsed time per iteration (s): 0.76 | learning rate: 1.715E-04 | global batch size: 256 | lm loss: 2.112925E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.773 | TFLOPs: 20.43 | 31: iteration 46480/ 173500 | consumed samples: 11898880 | consumed tokens: 24368906240 | elapsed time per iteration (s): 0.79 | learning rate: 1.715E-04 | global batch size: 256 | lm loss: 2.090821E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.803 | TFLOPs: 19.59 | 31: iteration 46490/ 173500 | consumed samples: 11901440 | consumed tokens: 24374149120 | elapsed time per iteration (s): 0.76 | learning rate: 1.715E-04 | global batch size: 256 | lm loss: 2.094641E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.638 | TFLOPs: 20.31 | 31: iteration 46500/ 173500 | consumed samples: 11904000 | consumed tokens: 24379392000 | elapsed time per iteration (s): 0.82 | learning rate: 1.715E-04 | global batch size: 256 | lm loss: 2.080299E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.443 | TFLOPs: 18.96 | 31: iteration 46510/ 173500 | consumed samples: 11906560 | consumed tokens: 24384634880 | elapsed time per iteration (s): 0.77 | learning rate: 1.715E-04 | global batch size: 256 | lm loss: 2.073997E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.108 | TFLOPs: 20.15 | 31: iteration 46520/ 173500 | consumed samples: 11909120 | consumed tokens: 24389877760 | elapsed time per iteration (s): 0.77 | learning rate: 1.715E-04 | global batch size: 256 | lm loss: 2.102389E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.268 | TFLOPs: 20.16 | 31: iteration 46530/ 173500 | consumed samples: 11911680 | consumed tokens: 24395120640 | elapsed time per iteration (s): 0.80 | learning rate: 1.714E-04 | global batch size: 256 | lm loss: 2.089326E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.988 | TFLOPs: 19.36 | 31: iteration 46540/ 173500 | consumed samples: 11914240 | consumed tokens: 24400363520 | elapsed time per iteration (s): 0.77 | learning rate: 1.714E-04 | global batch size: 256 | lm loss: 2.101291E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.290 | TFLOPs: 20.22 | 31: iteration 46550/ 173500 | consumed samples: 11916800 | consumed tokens: 24405606400 | elapsed time per iteration (s): 0.80 | learning rate: 1.714E-04 | global batch size: 256 | lm loss: 2.078093E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.333 | TFLOPs: 19.26 | 31: iteration 46560/ 173500 | consumed samples: 11919360 | consumed tokens: 24410849280 | elapsed time per iteration (s): 0.84 | learning rate: 1.714E-04 | global batch size: 256 | lm loss: 2.065892E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.972 | TFLOPs: 18.39 | 31: iteration 46570/ 173500 | consumed samples: 11921920 | consumed tokens: 24416092160 | elapsed time per iteration (s): 0.81 | learning rate: 1.714E-04 | global batch size: 256 | lm loss: 2.077363E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.187 | TFLOPs: 19.07 | 31: iteration 46580/ 173500 | consumed samples: 11924480 | consumed tokens: 24421335040 | elapsed time per iteration (s): 0.80 | learning rate: 1.714E-04 | global batch size: 256 | lm loss: 2.093334E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.324 | TFLOPs: 19.38 | 31: iteration 46590/ 173500 | consumed samples: 11927040 | consumed tokens: 24426577920 | elapsed time per iteration (s): 0.76 | learning rate: 1.714E-04 | global batch size: 256 | lm loss: 2.077365E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.900 | TFLOPs: 20.26 | 31: iteration 46600/ 173500 | consumed samples: 11929600 | consumed tokens: 24431820800 | elapsed time per iteration (s): 0.77 | learning rate: 1.714E-04 | global batch size: 256 | lm loss: 2.074328E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.910 | TFLOPs: 20.02 | 31: iteration 46610/ 173500 | consumed samples: 11932160 | consumed tokens: 24437063680 | elapsed time per iteration (s): 0.81 | learning rate: 1.713E-04 | global batch size: 256 | lm loss: 2.063116E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.755 | TFLOPs: 19.04 | 31: iteration 46620/ 173500 | consumed samples: 11934720 | consumed tokens: 24442306560 | elapsed time per iteration (s): 0.78 | learning rate: 1.713E-04 | global batch size: 256 | lm loss: 2.079162E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.525 | TFLOPs: 19.75 | 31: iteration 46630/ 173500 | consumed samples: 11937280 | consumed tokens: 24447549440 | elapsed time per iteration (s): 0.78 | learning rate: 1.713E-04 | global batch size: 256 | lm loss: 2.083297E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.647 | TFLOPs: 19.82 | 31: iteration 46640/ 173500 | consumed samples: 11939840 | consumed tokens: 24452792320 | elapsed time per iteration (s): 0.78 | learning rate: 1.713E-04 | global batch size: 256 | lm loss: 2.087841E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.557 | TFLOPs: 19.88 | 31: iteration 46650/ 173500 | consumed samples: 11942400 | consumed tokens: 24458035200 | elapsed time per iteration (s): 0.76 | learning rate: 1.713E-04 | global batch size: 256 | lm loss: 2.052500E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.908 | TFLOPs: 20.38 | 31: iteration 46660/ 173500 | consumed samples: 11944960 | consumed tokens: 24463278080 | elapsed time per iteration (s): 0.77 | learning rate: 1.713E-04 | global batch size: 256 | lm loss: 2.068653E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.773 | TFLOPs: 20.19 | 31: iteration 46670/ 173500 | consumed samples: 11947520 | consumed tokens: 24468520960 | elapsed time per iteration (s): 0.81 | learning rate: 1.713E-04 | global batch size: 256 | lm loss: 2.044679E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.575 | TFLOPs: 19.15 | 31: iteration 46680/ 173500 | consumed samples: 11950080 | consumed tokens: 24473763840 | elapsed time per iteration (s): 0.77 | learning rate: 1.713E-04 | global batch size: 256 | lm loss: 2.071550E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.294 | TFLOPs: 20.04 | 31: iteration 46690/ 173500 | consumed samples: 11952640 | consumed tokens: 24479006720 | elapsed time per iteration (s): 0.75 | learning rate: 1.713E-04 | global batch size: 256 | lm loss: 2.096793E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.526 | TFLOPs: 20.78 | 31: iteration 46700/ 173500 | consumed samples: 11955200 | consumed tokens: 24484249600 | elapsed time per iteration (s): 0.79 | learning rate: 1.712E-04 | global batch size: 256 | lm loss: 2.060808E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.798 | TFLOPs: 19.71 | 31: iteration 46710/ 173500 | consumed samples: 11957760 | consumed tokens: 24489492480 | elapsed time per iteration (s): 0.79 | learning rate: 1.712E-04 | global batch size: 256 | lm loss: 2.063316E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.201 | TFLOPs: 19.67 | 31: iteration 46720/ 173500 | consumed samples: 11960320 | consumed tokens: 24494735360 | elapsed time per iteration (s): 0.83 | learning rate: 1.712E-04 | global batch size: 256 | lm loss: 2.045094E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.619 | TFLOPs: 18.73 | 31: iteration 46730/ 173500 | consumed samples: 11962880 | consumed tokens: 24499978240 | elapsed time per iteration (s): 0.77 | learning rate: 1.712E-04 | global batch size: 256 | lm loss: 2.088884E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.733 | TFLOPs: 20.13 | 31: iteration 46740/ 173500 | consumed samples: 11965440 | consumed tokens: 24505221120 | elapsed time per iteration (s): 0.82 | learning rate: 1.712E-04 | global batch size: 256 | lm loss: 2.066724E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.244 | TFLOPs: 18.89 | 31: iteration 46750/ 173500 | consumed samples: 11968000 | consumed tokens: 24510464000 | elapsed time per iteration (s): 0.76 | learning rate: 1.712E-04 | global batch size: 256 | lm loss: 2.070024E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.635 | TFLOPs: 20.49 | 31: iteration 46760/ 173500 | consumed samples: 11970560 | consumed tokens: 24515706880 | elapsed time per iteration (s): 0.79 | learning rate: 1.712E-04 | global batch size: 256 | lm loss: 2.046653E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.324 | TFLOPs: 19.50 | 31: iteration 46770/ 173500 | consumed samples: 11973120 | consumed tokens: 24520949760 | elapsed time per iteration (s): 0.77 | learning rate: 1.712E-04 | global batch size: 256 | lm loss: 2.091436E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.576 | TFLOPs: 20.12 | 31: iteration 46780/ 173500 | consumed samples: 11975680 | consumed tokens: 24526192640 | elapsed time per iteration (s): 0.75 | learning rate: 1.711E-04 | global batch size: 256 | lm loss: 2.075412E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.412 | TFLOPs: 20.78 | 31: iteration 46790/ 173500 | consumed samples: 11978240 | consumed tokens: 24531435520 | elapsed time per iteration (s): 0.76 | learning rate: 1.711E-04 | global batch size: 256 | lm loss: 2.073965E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.171 | TFLOPs: 20.34 | 31: iteration 46800/ 173500 | consumed samples: 11980800 | consumed tokens: 24536678400 | elapsed time per iteration (s): 0.78 | learning rate: 1.711E-04 | global batch size: 256 | lm loss: 2.069479E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.191 | TFLOPs: 19.79 | 31: iteration 46810/ 173500 | consumed samples: 11983360 | consumed tokens: 24541921280 | elapsed time per iteration (s): 0.71 | learning rate: 1.711E-04 | global batch size: 256 | lm loss: 2.083263E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 359.635 | TFLOPs: 21.76 | 31: iteration 46820/ 173500 | consumed samples: 11985920 | consumed tokens: 24547164160 | elapsed time per iteration (s): 0.80 | learning rate: 1.711E-04 | global batch size: 256 | lm loss: 2.090640E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.463 | TFLOPs: 19.27 | 31: iteration 46830/ 173500 | consumed samples: 11988480 | consumed tokens: 24552407040 | elapsed time per iteration (s): 0.77 | learning rate: 1.711E-04 | global batch size: 256 | lm loss: 2.092844E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.892 | TFLOPs: 20.08 | 31: iteration 46840/ 173500 | consumed samples: 11991040 | consumed tokens: 24557649920 | elapsed time per iteration (s): 0.80 | learning rate: 1.711E-04 | global batch size: 256 | lm loss: 2.061739E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.583 | TFLOPs: 19.33 | 31: iteration 46850/ 173500 | consumed samples: 11993600 | consumed tokens: 24562892800 | elapsed time per iteration (s): 0.87 | learning rate: 1.711E-04 | global batch size: 256 | lm loss: 2.096990E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.793 | TFLOPs: 17.83 | 31: iteration 46860/ 173500 | consumed samples: 11996160 | consumed tokens: 24568135680 | elapsed time per iteration (s): 0.76 | learning rate: 1.710E-04 | global batch size: 256 | lm loss: 2.058192E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.784 | TFLOPs: 20.25 | 31: iteration 46870/ 173500 | consumed samples: 11998720 | consumed tokens: 24573378560 | elapsed time per iteration (s): 0.78 | learning rate: 1.710E-04 | global batch size: 256 | lm loss: 2.075328E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.217 | TFLOPs: 19.80 | 31: iteration 46880/ 173500 | consumed samples: 12001280 | consumed tokens: 24578621440 | elapsed time per iteration (s): 0.78 | learning rate: 1.710E-04 | global batch size: 256 | lm loss: 2.095681E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.627 | TFLOPs: 19.76 | 31: iteration 46890/ 173500 | consumed samples: 12003840 | consumed tokens: 24583864320 | elapsed time per iteration (s): 0.78 | learning rate: 1.710E-04 | global batch size: 256 | lm loss: 2.059841E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.607 | TFLOPs: 19.76 | 31: iteration 46900/ 173500 | consumed samples: 12006400 | consumed tokens: 24589107200 | elapsed time per iteration (s): 0.81 | learning rate: 1.710E-04 | global batch size: 256 | lm loss: 2.081013E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.981 | TFLOPs: 19.18 | 31: iteration 46910/ 173500 | consumed samples: 12008960 | consumed tokens: 24594350080 | elapsed time per iteration (s): 0.83 | learning rate: 1.710E-04 | global batch size: 256 | lm loss: 2.091287E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.641 | TFLOPs: 18.73 | 31: iteration 46920/ 173500 | consumed samples: 12011520 | consumed tokens: 24599592960 | elapsed time per iteration (s): 0.80 | learning rate: 1.710E-04 | global batch size: 256 | lm loss: 2.113758E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.307 | TFLOPs: 19.26 | 31: iteration 46930/ 173500 | consumed samples: 12014080 | consumed tokens: 24604835840 | elapsed time per iteration (s): 0.78 | learning rate: 1.710E-04 | global batch size: 256 | lm loss: 2.068860E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.099 | TFLOPs: 19.85 | 31: iteration 46940/ 173500 | consumed samples: 12016640 | consumed tokens: 24610078720 | elapsed time per iteration (s): 0.83 | learning rate: 1.710E-04 | global batch size: 256 | lm loss: 2.094599E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.004 | TFLOPs: 18.63 | 31: iteration 46950/ 173500 | consumed samples: 12019200 | consumed tokens: 24615321600 | elapsed time per iteration (s): 0.75 | learning rate: 1.709E-04 | global batch size: 256 | lm loss: 2.069961E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.574 | TFLOPs: 20.79 | 31: iteration 46960/ 173500 | consumed samples: 12021760 | consumed tokens: 24620564480 | elapsed time per iteration (s): 0.76 | learning rate: 1.709E-04 | global batch size: 256 | lm loss: 2.068642E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.752 | TFLOPs: 20.37 | 31: iteration 46970/ 173500 | consumed samples: 12024320 | consumed tokens: 24625807360 | elapsed time per iteration (s): 0.75 | learning rate: 1.709E-04 | global batch size: 256 | lm loss: 2.091703E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.145 | TFLOPs: 20.64 | 31: iteration 46980/ 173500 | consumed samples: 12026880 | consumed tokens: 24631050240 | elapsed time per iteration (s): 0.81 | learning rate: 1.709E-04 | global batch size: 256 | lm loss: 2.106037E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.678 | TFLOPs: 19.10 | 31: iteration 46990/ 173500 | consumed samples: 12029440 | consumed tokens: 24636293120 | elapsed time per iteration (s): 0.76 | learning rate: 1.709E-04 | global batch size: 256 | lm loss: 2.087557E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.565 | TFLOPs: 20.30 | 31: iteration 47000/ 173500 | consumed samples: 12032000 | consumed tokens: 24641536000 | elapsed time per iteration (s): 0.76 | learning rate: 1.709E-04 | global batch size: 256 | lm loss: 2.092586E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.552 | TFLOPs: 20.48 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 47000 | lm loss value: 1.939248E+00 | lm loss PPL: 6.953519E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 47000 to checkpoints_1b1long 0: [2022-11-26 04:39:16,779] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step47000 is begin to save! 0: [2022-11-26 04:39:16,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_01-model_00-model_states.pt... 0: [2022-11-26 04:39:17,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_01-model_00-model_states.pt. 0: [2022-11-26 04:39:17,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_03-model_00-model_states.pt... 0: [2022-11-26 04:39:17,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_03-model_00-model_states.pt. 0: [2022-11-26 04:39:17,103] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_04-model_00-model_states.pt... 0: [2022-11-26 04:39:17,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_04-model_00-model_states.pt. 0: [2022-11-26 04:39:17,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_05-model_00-model_states.pt... 0: [2022-11-26 04:39:17,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_05-model_00-model_states.pt. 0: [2022-11-26 04:39:17,257] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_06-model_00-model_states.pt... 0: [2022-11-26 04:39:17,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_06-model_00-model_states.pt. 0: [2022-11-26 04:39:17,333] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_07-model_00-model_states.pt... 0: [2022-11-26 04:39:17,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_07-model_00-model_states.pt. 0: [2022-11-26 04:39:17,409] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_08-model_00-model_states.pt... 0: [2022-11-26 04:39:17,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_08-model_00-model_states.pt. 0: [2022-11-26 04:39:17,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_09-model_00-model_states.pt... 0: [2022-11-26 04:39:17,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_09-model_00-model_states.pt. 0: [2022-11-26 04:39:17,562] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_10-model_00-model_states.pt... 0: [2022-11-26 04:39:17,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_10-model_00-model_states.pt. 0: [2022-11-26 04:39:17,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_11-model_00-model_states.pt... 0: [2022-11-26 04:39:17,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_11-model_00-model_states.pt. 0: [2022-11-26 04:39:17,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_12-model_00-model_states.pt... 0: [2022-11-26 04:39:17,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_12-model_00-model_states.pt. 0: [2022-11-26 04:39:17,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_13-model_00-model_states.pt... 0: [2022-11-26 04:39:17,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_13-model_00-model_states.pt. 0: [2022-11-26 04:39:17,870] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_14-model_00-model_states.pt... 0: [2022-11-26 04:39:17,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_14-model_00-model_states.pt. 0: [2022-11-26 04:39:17,946] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_15-model_00-model_states.pt... 0: [2022-11-26 04:39:18,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_15-model_00-model_states.pt. 0: [2022-11-26 04:39:18,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_16-model_00-model_states.pt... 0: [2022-11-26 04:39:18,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_16-model_00-model_states.pt. 0: [2022-11-26 04:39:18,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_17-model_00-model_states.pt... 0: [2022-11-26 04:39:18,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_17-model_00-model_states.pt. 0: [2022-11-26 04:39:18,174] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_18-model_00-model_states.pt... 0: [2022-11-26 04:39:18,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_18-model_00-model_states.pt. 0: [2022-11-26 04:39:18,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_19-model_00-model_states.pt... 0: [2022-11-26 04:39:18,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_19-model_00-model_states.pt. 0: [2022-11-26 04:39:18,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_20-model_00-model_states.pt... 0: [2022-11-26 04:39:18,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_20-model_00-model_states.pt. 0: [2022-11-26 04:39:18,402] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_21-model_00-model_states.pt... 0: [2022-11-26 04:39:18,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_21-model_00-model_states.pt. 0: [2022-11-26 04:39:18,479] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_22-model_00-model_states.pt... 0: [2022-11-26 04:39:18,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_22-model_00-model_states.pt. 0: [2022-11-26 04:39:18,555] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_23-model_00-model_states.pt... 0: [2022-11-26 04:39:18,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_23-model_00-model_states.pt. 0: [2022-11-26 04:39:18,632] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_24-model_00-model_states.pt... 0: [2022-11-26 04:39:18,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_24-model_00-model_states.pt. 0: [2022-11-26 04:39:18,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_25-model_00-model_states.pt... 0: [2022-11-26 04:39:18,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_25-model_00-model_states.pt. 0: [2022-11-26 04:39:18,781] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_26-model_00-model_states.pt... 0: [2022-11-26 04:39:18,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_26-model_00-model_states.pt. 0: [2022-11-26 04:39:18,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_27-model_00-model_states.pt... 0: [2022-11-26 04:39:18,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_27-model_00-model_states.pt. 0: [2022-11-26 04:39:18,936] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_28-model_00-model_states.pt... 0: [2022-11-26 04:39:19,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_28-model_00-model_states.pt. 0: [2022-11-26 04:39:19,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/layer_30-model_00-model_states.pt... 0: [2022-11-26 04:39:19,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/layer_30-model_00-model_states.pt. 0: [2022-11-26 04:39:19,014] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step47000/mp_rank_00_model_states.pt 0: [2022-11-26 04:39:19,014] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/mp_rank_00_model_states.pt... 0: [2022-11-26 04:39:19,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/mp_rank_00_model_states.pt. 0: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:39:19,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:39:19,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:39:19,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 04:39:19,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 20: [2022-11-26 04:39:19,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:39:19,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 9: [2022-11-26 04:39:19,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:39:19,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 9: [2022-11-26 04:39:19,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 04:39:19,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 1: [2022-11-26 04:39:19,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:39:19,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 04:39:19,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 2: [2022-11-26 04:39:19,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:39:19,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 04:39:19,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 28: [2022-11-26 04:39:19,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 04:39:19,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 04:39:19,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 3: [2022-11-26 04:39:19,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:39:19,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 04:39:19,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 5: [2022-11-26 04:39:19,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:39:19,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:39:19,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 04:39:19,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 24: [2022-11-26 04:39:19,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 04:39:19,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 31: [2022-11-26 04:39:19,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:39:19,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 30: [2022-11-26 04:39:19,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:39:19,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 30: [2022-11-26 04:39:19,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 10: [2022-11-26 04:39:19,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:39:19,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 10: [2022-11-26 04:39:19,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 04:39:19,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 5: [2022-11-26 04:39:19,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:39:19,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 18: [2022-11-26 04:39:19,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:39:19,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 18: [2022-11-26 04:39:19,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 04:39:19,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 21: [2022-11-26 04:39:19,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:39:19,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:39:19,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 04:39:19,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 27: [2022-11-26 04:39:19,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 04:39:19,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 4: [2022-11-26 04:39:19,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:39:19,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 04:39:19,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 14: [2022-11-26 04:39:19,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:39:19,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 04:39:19,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 12: [2022-11-26 04:39:19,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:39:19,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 04:39:19,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 28: [2022-11-26 04:39:19,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:39:19,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:39:19,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 04:39:19,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 18: [2022-11-26 04:39:19,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:39:19,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 04:39:19,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 29: [2022-11-26 04:39:19,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:39:19,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:39:19,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 04:39:19,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 27: [2022-11-26 04:39:19,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 04:39:19,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 15: [2022-11-26 04:39:19,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:39:19,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:39:19,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 04:39:19,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 0: [2022-11-26 04:39:19,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 04:39:19,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 1: [2022-11-26 04:39:19,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:39:19,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 04:39:19,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 31: [2022-11-26 04:39:19,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:39:19,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 04:39:19,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 19: [2022-11-26 04:39:19,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:39:19,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 04:39:19,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 14: [2022-11-26 04:39:19,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:39:19,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:39:19,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 21: [2022-11-26 04:39:19,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 14: [2022-11-26 04:39:19,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 30: [2022-11-26 04:39:19,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:39:19,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 30: [2022-11-26 04:39:19,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 04:39:19,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 10: [2022-11-26 04:39:19,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:39:19,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 04:39:19,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 14: [2022-11-26 04:39:19,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:39:19,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:39:19,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 3: [2022-11-26 04:39:19,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 14: [2022-11-26 04:39:19,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 3: [2022-11-26 04:39:19,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 10: [2022-11-26 04:39:19,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:39:19,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 12: [2022-11-26 04:39:19,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:39:19,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 12: [2022-11-26 04:39:19,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 04:39:19,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 30: [2022-11-26 04:39:19,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:39:19,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 04:39:19,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 24: [2022-11-26 04:39:19,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:39:19,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 04:39:19,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 7: [2022-11-26 04:39:19,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:39:19,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 04:39:19,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 18: [2022-11-26 04:39:19,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:39:19,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 04:39:19,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 20: [2022-11-26 04:39:19,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:39:19,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 9: [2022-11-26 04:39:19,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:39:19,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 9: [2022-11-26 04:39:19,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 04:39:19,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:39:19,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 20: [2022-11-26 04:39:19,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:39:19,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 20: [2022-11-26 04:39:19,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 31: [2022-11-26 04:39:19,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:39:19,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 31: [2022-11-26 04:39:19,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 9: [2022-11-26 04:39:19,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 31: [2022-11-26 04:39:19,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 0: [2022-11-26 04:39:19,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:39:19,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:39:19,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 04:39:19,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 22: [2022-11-26 04:39:19,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:39:19,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:39:19,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 04:39:19,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 19: [2022-11-26 04:39:19,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 22: [2022-11-26 04:39:19,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:39:19,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 04:39:19,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 19: [2022-11-26 04:39:19,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 3: [2022-11-26 04:39:19,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:39:19,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 04:39:19,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 27: [2022-11-26 04:39:19,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:39:19,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 04:39:19,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 27: [2022-11-26 04:39:19,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:39:19,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 04:39:19,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 1: [2022-11-26 04:39:19,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:39:19,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:39:19,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:39:19,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 04:39:19,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 21: [2022-11-26 04:39:19,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 1: [2022-11-26 04:39:19,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 1: [2022-11-26 04:39:19,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 29: [2022-11-26 04:39:19,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:39:19,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 21: [2022-11-26 04:39:19,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 29: [2022-11-26 04:39:19,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 15: [2022-11-26 04:39:19,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:39:19,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 04:39:19,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 29: [2022-11-26 04:39:19,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:39:19,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 04:39:19,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 0: [2022-11-26 04:39:19,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:39:19,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 04:39:19,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 28: [2022-11-26 04:39:19,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 04:39:19,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 28: [2022-11-26 04:39:19,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 04:39:19,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 04:39:19,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 9: [2022-11-26 04:39:19,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:39:19,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 04:39:19,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 19: [2022-11-26 04:39:19,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:39:19,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 04:39:19,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 19: [2022-11-26 04:39:19,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:39:19,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 04:39:19,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 0: [2022-11-26 04:39:19,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:39:19,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 04:39:19,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 29: [2022-11-26 04:39:19,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:39:19,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 04:39:19,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 5: [2022-11-26 04:39:19,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:39:19,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 04:39:19,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 2: [2022-11-26 04:39:19,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:39:19,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 04:39:19,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:39:19,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 2: [2022-11-26 04:39:19,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 04:39:19,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 30: [2022-11-26 04:39:19,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:39:19,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 04:39:19,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 4: [2022-11-26 04:39:19,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:39:19,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 04:39:19,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 20: [2022-11-26 04:39:19,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:39:19,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 04:39:19,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 13: [2022-11-26 04:39:19,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:39:19,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 04:39:19,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 13: [2022-11-26 04:39:19,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:39:19,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:39:19,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 04:39:19,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:39:19,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 04:39:19,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 13: [2022-11-26 04:39:19,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 13: [2022-11-26 04:39:19,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 04:39:19,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 10: [2022-11-26 04:39:19,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:39:19,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 04:39:19,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 18: [2022-11-26 04:39:19,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:39:19,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 28: [2022-11-26 04:39:19,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:39:19,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 12: [2022-11-26 04:39:19,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:39:19,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 04:39:19,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 28: [2022-11-26 04:39:19,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 04:39:19,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 31: [2022-11-26 04:39:19,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:39:19,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 04:39:19,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 21: [2022-11-26 04:39:19,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:39:19,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 04:39:19,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 15: [2022-11-26 04:39:19,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:39:19,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 04:39:19,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 22: [2022-11-26 04:39:19,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:39:19,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 04:39:19,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 5: [2022-11-26 04:39:19,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:39:19,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 04:39:19,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 5: [2022-11-26 04:39:19,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:39:19,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 04:39:19,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 12: [2022-11-26 04:39:19,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:39:19,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 04:39:19,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 14: [2022-11-26 04:39:19,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:39:19,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 04:39:19,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 24: [2022-11-26 04:39:19,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:39:19,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 04:39:19,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 7: [2022-11-26 04:39:19,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:39:19,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 04:39:19,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 3: [2022-11-26 04:39:19,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:39:19,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 04:39:19,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 7: [2022-11-26 04:39:19,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:39:19,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 04:39:19,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 7: [2022-11-26 04:39:19,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:39:19,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:39:19,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 19: [2022-11-26 04:39:19,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 7: [2022-11-26 04:39:19,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 19: [2022-11-26 04:39:19,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 1: [2022-11-26 04:39:19,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:39:19,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 23: [2022-11-26 04:39:19,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:39:19,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:39:19,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:39:19,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:39:19,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 23: [2022-11-26 04:39:19,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 04:39:19,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 04:39:19,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 04:39:19,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 04:39:19,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 23: [2022-11-26 04:39:19,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 23: [2022-11-26 04:39:19,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 23: [2022-11-26 04:39:19,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 2: [2022-11-26 04:39:19,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:39:19,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 04:39:19,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 17: [2022-11-26 04:39:19,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:39:19,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:39:19,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 04:39:19,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 17: [2022-11-26 04:39:19,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:39:19,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 17: [2022-11-26 04:39:19,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 22: [2022-11-26 04:39:19,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 17: [2022-11-26 04:39:19,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 0: [2022-11-26 04:39:19,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 17: [2022-11-26 04:39:19,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:39:19,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 04:39:19,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 0: [2022-11-26 04:39:19,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 17: [2022-11-26 04:39:19,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:39:19,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 04:39:19,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 4: [2022-11-26 04:39:19,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:39:19,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:39:19,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 04:39:19,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 04:39:19,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 4: [2022-11-26 04:39:19,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 12: [2022-11-26 04:39:19,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:39:19,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 04:39:19,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 27: [2022-11-26 04:39:19,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:39:19,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 04:39:19,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 4: [2022-11-26 04:39:19,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:39:19,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 04:39:19,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 6: [2022-11-26 04:39:19,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:39:19,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:39:19,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:39:19,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:39:19,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:39:19,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 04:39:19,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 04:39:19,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 04:39:19,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 04:39:19,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 04:39:19,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 6: [2022-11-26 04:39:19,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 6: [2022-11-26 04:39:19,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 6: [2022-11-26 04:39:19,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 6: [2022-11-26 04:39:19,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 17: [2022-11-26 04:39:19,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:39:19,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 04:39:19,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 13: [2022-11-26 04:39:19,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:39:19,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 04:39:19,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 11: [2022-11-26 04:39:19,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:39:19,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:39:19,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:39:19,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:39:19,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:39:19,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 04:39:19,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 04:39:19,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 04:39:19,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 04:39:19,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 04:39:19,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 11: [2022-11-26 04:39:19,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 11: [2022-11-26 04:39:19,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 11: [2022-11-26 04:39:19,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 11: [2022-11-26 04:39:19,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 20: [2022-11-26 04:39:19,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:39:19,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 04:39:19,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 29: [2022-11-26 04:39:19,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:39:19,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 04:39:19,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 28: [2022-11-26 04:39:19,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 04:39:19,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 04:39:19,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 9: [2022-11-26 04:39:19,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:39:19,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 04:39:19,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 14: [2022-11-26 04:39:19,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:39:19,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 04:39:19,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 3: [2022-11-26 04:39:19,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:39:19,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 04:39:19,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 30: [2022-11-26 04:39:19,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:39:19,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 04:39:19,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 15: [2022-11-26 04:39:19,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:39:19,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 04:39:19,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 0: [2022-11-26 04:39:19,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:39:19,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 04:39:19,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 22: [2022-11-26 04:39:19,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:39:19,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 04:39:19,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 23: [2022-11-26 04:39:19,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:39:19,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 04:39:19,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 10: [2022-11-26 04:39:19,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:39:19,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 04:39:19,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 31: [2022-11-26 04:39:19,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:39:19,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 04:39:19,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 24: [2022-11-26 04:39:19,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:39:19,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 04:39:19,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 6: [2022-11-26 04:39:19,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:39:19,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 04:39:19,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 7: [2022-11-26 04:39:19,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:39:19,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 21: [2022-11-26 04:39:19,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:39:19,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 21: [2022-11-26 04:39:19,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 04:39:19,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 18: [2022-11-26 04:39:19,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:39:19,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 1: [2022-11-26 04:39:19,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:39:19,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 18: [2022-11-26 04:39:19,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 1: [2022-11-26 04:39:19,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 5: [2022-11-26 04:39:19,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:39:19,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 19: [2022-11-26 04:39:19,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:39:19,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 19: [2022-11-26 04:39:19,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 04:39:19,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 4: [2022-11-26 04:39:19,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:39:19,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 04:39:19,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 12: [2022-11-26 04:39:19,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:39:19,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 04:39:19,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 27: [2022-11-26 04:39:19,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:39:19,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 04:39:19,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 11: [2022-11-26 04:39:19,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:39:19,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 04:39:19,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 17: [2022-11-26 04:39:19,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:39:19,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 04:39:19,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 13: [2022-11-26 04:39:19,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:39:19,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 04:39:19,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 9: [2022-11-26 04:39:19,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:39:19,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 04:39:19,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 3: [2022-11-26 04:39:19,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:39:19,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 04:39:19,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 28: [2022-11-26 04:39:19,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 04:39:19,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 04:39:19,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 30: [2022-11-26 04:39:19,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:39:19,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 04:39:19,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 14: [2022-11-26 04:39:19,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:39:19,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 04:39:19,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 0: [2022-11-26 04:39:19,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:39:19,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 04:39:19,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 15: [2022-11-26 04:39:19,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:39:19,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 04:39:19,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 29: [2022-11-26 04:39:19,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:39:19,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 04:39:19,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 18: [2022-11-26 04:39:19,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:39:19,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 04:39:19,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 31: [2022-11-26 04:39:19,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:39:19,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 04:39:19,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 20: [2022-11-26 04:39:19,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:39:19,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 04:39:19,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 22: [2022-11-26 04:39:19,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:39:19,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 04:39:19,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 23: [2022-11-26 04:39:19,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:39:19,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 04:39:19,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 24: [2022-11-26 04:39:19,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:39:19,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 04:39:19,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 6: [2022-11-26 04:39:19,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:39:19,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 04:39:19,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 7: [2022-11-26 04:39:19,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:39:19,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 04:39:19,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 5: [2022-11-26 04:39:19,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:39:19,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 04:39:19,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 10: [2022-11-26 04:39:19,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:39:19,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 04:39:19,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 19: [2022-11-26 04:39:19,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:39:19,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 04:39:19,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 2: [2022-11-26 04:39:19,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:39:19,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 04:39:19,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:39:19,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 2: [2022-11-26 04:39:19,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 04:39:19,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 27: [2022-11-26 04:39:19,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:39:19,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 04:39:19,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 11: [2022-11-26 04:39:19,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:39:19,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 04:39:19,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 4: [2022-11-26 04:39:19,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:39:19,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 04:39:19,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 1: [2022-11-26 04:39:19,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:39:19,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 04:39:19,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 17: [2022-11-26 04:39:19,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:39:19,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 04:39:19,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 12: [2022-11-26 04:39:19,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:39:19,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 04:39:19,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 9: [2022-11-26 04:39:19,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:39:19,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 04:39:19,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 30: [2022-11-26 04:39:19,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:39:19,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 04:39:19,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 0: [2022-11-26 04:39:19,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:39:19,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 04:39:19,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 21: [2022-11-26 04:39:19,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:39:19,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 04:39:19,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 3: [2022-11-26 04:39:19,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:39:19,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:39:19,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 04:39:19,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 15: [2022-11-26 04:39:19,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 04:39:19,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 20: [2022-11-26 04:39:19,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:39:19,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 04:39:19,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 14: [2022-11-26 04:39:19,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:39:19,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 04:39:19,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 28: [2022-11-26 04:39:19,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 04:39:19,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 04:39:19,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 18: [2022-11-26 04:39:19,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:39:19,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 04:39:19,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 29: [2022-11-26 04:39:19,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:39:19,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 04:39:19,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 31: [2022-11-26 04:39:19,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:39:19,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 04:39:19,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 23: [2022-11-26 04:39:19,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:39:19,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 04:39:19,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 17: [2022-11-26 04:39:19,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:39:19,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 6: [2022-11-26 04:39:19,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:39:19,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 6: [2022-11-26 04:39:19,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 24: [2022-11-26 04:39:19,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:39:19,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 24: [2022-11-26 04:39:19,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 04:39:19,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 19: [2022-11-26 04:39:19,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:39:19,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 04:39:19,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 12: [2022-11-26 04:39:19,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:39:19,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 04:39:19,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 10: [2022-11-26 04:39:19,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:39:19,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:39:19,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 04:39:19,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 13: [2022-11-26 04:39:19,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 7: [2022-11-26 04:39:19,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:39:19,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 13: [2022-11-26 04:39:19,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 2: [2022-11-26 04:39:19,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:39:19,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 2: [2022-11-26 04:39:19,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 04:39:19,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 4: [2022-11-26 04:39:19,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:39:19,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 04:39:19,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 0: [2022-11-26 04:39:19,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:39:19,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 04:39:19,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 1: [2022-11-26 04:39:19,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:39:19,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 04:39:19,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 22: [2022-11-26 04:39:19,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:39:19,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 04:39:19,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 5: [2022-11-26 04:39:19,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:39:19,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 04:39:19,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 21: [2022-11-26 04:39:19,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:39:19,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 04:39:19,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 9: [2022-11-26 04:39:19,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:39:19,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 27: [2022-11-26 04:39:19,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:39:19,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 27: [2022-11-26 04:39:19,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 04:39:19,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 3: [2022-11-26 04:39:19,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:39:19,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 04:39:19,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 23: [2022-11-26 04:39:19,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:39:19,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 04:39:19,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 29: [2022-11-26 04:39:19,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:39:19,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:39:19,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 04:39:19,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 14: [2022-11-26 04:39:19,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 04:39:19,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 11: [2022-11-26 04:39:19,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:39:19,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 04:39:19,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 31: [2022-11-26 04:39:19,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:39:19,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 04:39:19,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 7: [2022-11-26 04:39:19,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:39:19,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 04:39:19,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 15: [2022-11-26 04:39:19,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:39:19,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 04:39:19,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 24: [2022-11-26 04:39:19,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:39:19,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 04:39:19,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 20: [2022-11-26 04:39:19,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:39:19,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 04:39:19,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 28: [2022-11-26 04:39:19,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 04:39:19,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 04:39:19,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 10: [2022-11-26 04:39:19,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:39:19,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 04:39:19,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 30: [2022-11-26 04:39:19,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:39:19,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 04:39:19,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 18: [2022-11-26 04:39:19,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:39:19,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 04:39:19,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 22: [2022-11-26 04:39:19,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:39:19,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 04:39:19,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 13: [2022-11-26 04:39:19,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:39:19,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 04:39:19,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 21: [2022-11-26 04:39:19,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:39:19,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 04:39:19,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 8: [2022-11-26 04:39:19,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:39:19,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:39:19,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:39:19,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:39:19,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 04:39:19,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 04:39:19,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 04:39:19,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 04:39:19,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 8: [2022-11-26 04:39:19,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 8: [2022-11-26 04:39:19,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 8: [2022-11-26 04:39:19,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 8: [2022-11-26 04:39:19,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:39:19,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:39:19,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:39:19,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 04:39:19,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 04:39:19,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 04:39:19,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 8: [2022-11-26 04:39:19,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 8: [2022-11-26 04:39:19,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 26: [2022-11-26 04:39:19,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:39:19,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 04:39:19,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 8: [2022-11-26 04:39:19,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:39:19,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 26: [2022-11-26 04:39:19,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:39:19,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 26: [2022-11-26 04:39:19,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 04:39:19,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 26: [2022-11-26 04:39:19,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:39:19,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 04:39:19,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 26: [2022-11-26 04:39:19,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:39:19,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 04:39:19,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 26: [2022-11-26 04:39:19,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:39:19,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:39:19,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:39:19,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 04:39:19,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 04:39:19,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 04:39:19,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 26: [2022-11-26 04:39:19,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 26: [2022-11-26 04:39:19,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 26: [2022-11-26 04:39:19,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:39:19,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 04:39:19,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 25: [2022-11-26 04:39:19,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:39:19,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:39:19,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:39:19,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 04:39:19,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 04:39:19,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 04:39:19,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 25: [2022-11-26 04:39:19,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 25: [2022-11-26 04:39:19,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 25: [2022-11-26 04:39:19,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:39:19,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:39:19,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:39:19,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:39:19,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:39:19,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 04:39:19,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 04:39:19,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 04:39:19,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 04:39:19,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 04:39:19,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 25: [2022-11-26 04:39:19,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 25: [2022-11-26 04:39:19,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 25: [2022-11-26 04:39:19,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 25: [2022-11-26 04:39:19,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 16: [2022-11-26 04:39:19,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:39:19,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:39:19,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 04:39:19,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 04:39:19,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 16: [2022-11-26 04:39:19,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 16: [2022-11-26 04:39:19,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:39:19,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:39:19,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:39:19,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 04:39:19,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 04:39:19,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 04:39:19,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 16: [2022-11-26 04:39:19,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 16: [2022-11-26 04:39:19,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 16: [2022-11-26 04:39:19,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:39:19,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:39:19,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:39:19,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 04:39:19,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 04:39:19,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step47000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 04:39:19,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 16: [2022-11-26 04:39:19,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 16: [2022-11-26 04:39:19,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 0: successfully saved checkpoint at iteration 47000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2661.84 31: iteration 47010/ 173500 | consumed samples: 12034560 | consumed tokens: 24646778880 | elapsed time per iteration (s): 1.07 | learning rate: 1.709E-04 | global batch size: 256 | lm loss: 2.079456E+00 | grad norm: 0.390 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.035 | TFLOPs: 14.52 | 31: iteration 47020/ 173500 | consumed samples: 12037120 | consumed tokens: 24652021760 | elapsed time per iteration (s): 0.78 | learning rate: 1.709E-04 | global batch size: 256 | lm loss: 3.600356E+00 | grad norm: 5.798 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.428 | TFLOPs: 19.75 | 31: iteration 47030/ 173500 | consumed samples: 12039680 | consumed tokens: 24657264640 | elapsed time per iteration (s): 0.79 | learning rate: 1.708E-04 | global batch size: 256 | lm loss: 2.288341E+00 | grad norm: 0.460 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.917 | TFLOPs: 19.72 | 31: iteration 47040/ 173500 | consumed samples: 12042240 | consumed tokens: 24662507520 | elapsed time per iteration (s): 0.79 | learning rate: 1.708E-04 | global batch size: 256 | lm loss: 2.171004E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.055 | TFLOPs: 19.73 | 31: iteration 47050/ 173500 | consumed samples: 12044800 | consumed tokens: 24667750400 | elapsed time per iteration (s): 0.76 | learning rate: 1.708E-04 | global batch size: 256 | lm loss: 2.121368E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.693 | TFLOPs: 20.37 | 31: iteration 47060/ 173500 | consumed samples: 12047360 | consumed tokens: 24672993280 | elapsed time per iteration (s): 0.72 | learning rate: 1.708E-04 | global batch size: 256 | lm loss: 2.111640E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.039 | TFLOPs: 21.42 | 31: iteration 47070/ 173500 | consumed samples: 12049920 | consumed tokens: 24678236160 | elapsed time per iteration (s): 0.74 | learning rate: 1.708E-04 | global batch size: 256 | lm loss: 2.091481E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.147 | TFLOPs: 20.88 | 31: iteration 47080/ 173500 | consumed samples: 12052480 | consumed tokens: 24683479040 | elapsed time per iteration (s): 0.76 | learning rate: 1.708E-04 | global batch size: 256 | lm loss: 2.096778E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.443 | TFLOPs: 20.41 | 31: iteration 47090/ 173500 | consumed samples: 12055040 | consumed tokens: 24688721920 | elapsed time per iteration (s): 0.74 | learning rate: 1.708E-04 | global batch size: 256 | lm loss: 2.062222E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.326 | TFLOPs: 20.83 | 31: iteration 47100/ 173500 | consumed samples: 12057600 | consumed tokens: 24693964800 | elapsed time per iteration (s): 0.78 | learning rate: 1.708E-04 | global batch size: 256 | lm loss: 2.074833E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.679 | TFLOPs: 19.76 | 31: iteration 47110/ 173500 | consumed samples: 12060160 | consumed tokens: 24699207680 | elapsed time per iteration (s): 0.75 | learning rate: 1.707E-04 | global batch size: 256 | lm loss: 2.086584E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.090 | TFLOPs: 20.70 | 31: iteration 47120/ 173500 | consumed samples: 12062720 | consumed tokens: 24704450560 | elapsed time per iteration (s): 0.80 | learning rate: 1.707E-04 | global batch size: 256 | lm loss: 2.089950E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.901 | TFLOPs: 19.35 | 31: iteration 47130/ 173500 | consumed samples: 12065280 | consumed tokens: 24709693440 | elapsed time per iteration (s): 0.77 | learning rate: 1.707E-04 | global batch size: 256 | lm loss: 2.081593E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.601 | TFLOPs: 20.12 | 31: iteration 47140/ 173500 | consumed samples: 12067840 | consumed tokens: 24714936320 | elapsed time per iteration (s): 0.74 | learning rate: 1.707E-04 | global batch size: 256 | lm loss: 2.067953E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.127 | TFLOPs: 21.06 | 31: iteration 47150/ 173500 | consumed samples: 12070400 | consumed tokens: 24720179200 | elapsed time per iteration (s): 0.77 | learning rate: 1.707E-04 | global batch size: 256 | lm loss: 2.068065E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.790 | TFLOPs: 20.01 | 31: iteration 47160/ 173500 | consumed samples: 12072960 | consumed tokens: 24725422080 | elapsed time per iteration (s): 0.72 | learning rate: 1.707E-04 | global batch size: 256 | lm loss: 2.105481E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.655 | TFLOPs: 21.40 | 31: iteration 47170/ 173500 | consumed samples: 12075520 | consumed tokens: 24730664960 | elapsed time per iteration (s): 0.74 | learning rate: 1.707E-04 | global batch size: 256 | lm loss: 2.082456E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.488 | TFLOPs: 20.96 | 31: iteration 47180/ 173500 | consumed samples: 12078080 | consumed tokens: 24735907840 | elapsed time per iteration (s): 0.80 | learning rate: 1.707E-04 | global batch size: 256 | lm loss: 2.099864E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.748 | TFLOPs: 19.46 | 31: iteration 47190/ 173500 | consumed samples: 12080640 | consumed tokens: 24741150720 | elapsed time per iteration (s): 0.79 | learning rate: 1.706E-04 | global batch size: 256 | lm loss: 2.081390E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.630 | TFLOPs: 19.52 | 31: iteration 47200/ 173500 | consumed samples: 12083200 | consumed tokens: 24746393600 | elapsed time per iteration (s): 0.82 | learning rate: 1.706E-04 | global batch size: 256 | lm loss: 2.089093E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.797 | TFLOPs: 18.80 | 31: iteration 47210/ 173500 | consumed samples: 12085760 | consumed tokens: 24751636480 | elapsed time per iteration (s): 0.83 | learning rate: 1.706E-04 | global batch size: 256 | lm loss: 2.111979E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.114 | TFLOPs: 18.76 | 31: iteration 47220/ 173500 | consumed samples: 12088320 | consumed tokens: 24756879360 | elapsed time per iteration (s): 0.86 | learning rate: 1.706E-04 | global batch size: 256 | lm loss: 2.065028E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.369 | TFLOPs: 18.11 | 31: iteration 47230/ 173500 | consumed samples: 12090880 | consumed tokens: 24762122240 | elapsed time per iteration (s): 0.79 | learning rate: 1.706E-04 | global batch size: 256 | lm loss: 2.062979E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.551 | TFLOPs: 19.57 | 31: iteration 47240/ 173500 | consumed samples: 12093440 | consumed tokens: 24767365120 | elapsed time per iteration (s): 0.83 | learning rate: 1.706E-04 | global batch size: 256 | lm loss: 2.044295E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.632 | TFLOPs: 18.67 | 31: iteration 47250/ 173500 | consumed samples: 12096000 | consumed tokens: 24772608000 | elapsed time per iteration (s): 0.80 | learning rate: 1.706E-04 | global batch size: 256 | lm loss: 2.078834E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.544 | TFLOPs: 19.39 | 31: iteration 47260/ 173500 | consumed samples: 12098560 | consumed tokens: 24777850880 | elapsed time per iteration (s): 0.79 | learning rate: 1.706E-04 | global batch size: 256 | lm loss: 2.074853E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.413 | TFLOPs: 19.69 | 31: iteration 47270/ 173500 | consumed samples: 12101120 | consumed tokens: 24783093760 | elapsed time per iteration (s): 0.81 | learning rate: 1.706E-04 | global batch size: 256 | lm loss: 2.039443E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.694 | TFLOPs: 19.16 | 31: iteration 47280/ 173500 | consumed samples: 12103680 | consumed tokens: 24788336640 | elapsed time per iteration (s): 0.84 | learning rate: 1.705E-04 | global batch size: 256 | lm loss: 2.084517E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.500 | TFLOPs: 18.54 | 31: iteration 47290/ 173500 | consumed samples: 12106240 | consumed tokens: 24793579520 | elapsed time per iteration (s): 0.78 | learning rate: 1.705E-04 | global batch size: 256 | lm loss: 2.080202E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.355 | TFLOPs: 19.80 | 31: iteration 47300/ 173500 | consumed samples: 12108800 | consumed tokens: 24798822400 | elapsed time per iteration (s): 0.84 | learning rate: 1.705E-04 | global batch size: 256 | lm loss: 2.040793E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.434 | TFLOPs: 18.36 | 31: iteration 47310/ 173500 | consumed samples: 12111360 | consumed tokens: 24804065280 | elapsed time per iteration (s): 0.81 | learning rate: 1.705E-04 | global batch size: 256 | lm loss: 2.105002E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.132 | TFLOPs: 19.06 | 31: iteration 47320/ 173500 | consumed samples: 12113920 | consumed tokens: 24809308160 | elapsed time per iteration (s): 0.78 | learning rate: 1.705E-04 | global batch size: 256 | lm loss: 2.069276E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.458 | TFLOPs: 19.75 | 31: iteration 47330/ 173500 | consumed samples: 12116480 | consumed tokens: 24814551040 | elapsed time per iteration (s): 0.82 | learning rate: 1.705E-04 | global batch size: 256 | lm loss: 2.099670E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.932 | TFLOPs: 18.99 | 31: iteration 47340/ 173500 | consumed samples: 12119040 | consumed tokens: 24819793920 | elapsed time per iteration (s): 0.80 | learning rate: 1.705E-04 | global batch size: 256 | lm loss: 2.074089E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.762 | TFLOPs: 19.28 | 31: iteration 47350/ 173500 | consumed samples: 12121600 | consumed tokens: 24825036800 | elapsed time per iteration (s): 0.81 | learning rate: 1.705E-04 | global batch size: 256 | lm loss: 2.074714E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.066 | TFLOPs: 19.12 | 31: iteration 47360/ 173500 | consumed samples: 12124160 | consumed tokens: 24830279680 | elapsed time per iteration (s): 0.80 | learning rate: 1.704E-04 | global batch size: 256 | lm loss: 2.090935E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.586 | TFLOPs: 19.33 | 31: iteration 47370/ 173500 | consumed samples: 12126720 | consumed tokens: 24835522560 | elapsed time per iteration (s): 0.80 | learning rate: 1.704E-04 | global batch size: 256 | lm loss: 2.051475E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.106 | TFLOPs: 19.24 | 31: iteration 47380/ 173500 | consumed samples: 12129280 | consumed tokens: 24840765440 | elapsed time per iteration (s): 0.79 | learning rate: 1.704E-04 | global batch size: 256 | lm loss: 2.072437E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.215 | TFLOPs: 19.61 | 31: iteration 47390/ 173500 | consumed samples: 12131840 | consumed tokens: 24846008320 | elapsed time per iteration (s): 0.79 | learning rate: 1.704E-04 | global batch size: 256 | lm loss: 2.093400E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.296 | TFLOPs: 19.56 | 31: iteration 47400/ 173500 | consumed samples: 12134400 | consumed tokens: 24851251200 | elapsed time per iteration (s): 0.83 | learning rate: 1.704E-04 | global batch size: 256 | lm loss: 2.076573E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.980 | TFLOPs: 18.63 | 31: iteration 47410/ 173500 | consumed samples: 12136960 | consumed tokens: 24856494080 | elapsed time per iteration (s): 0.81 | learning rate: 1.704E-04 | global batch size: 256 | lm loss: 2.077725E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.746 | TFLOPs: 19.04 | 31: iteration 47420/ 173500 | consumed samples: 12139520 | consumed tokens: 24861736960 | elapsed time per iteration (s): 0.91 | learning rate: 1.704E-04 | global batch size: 256 | lm loss: 2.079958E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 280.695 | TFLOPs: 16.98 | 31: iteration 47430/ 173500 | consumed samples: 12142080 | consumed tokens: 24866979840 | elapsed time per iteration (s): 0.80 | learning rate: 1.704E-04 | global batch size: 256 | lm loss: 2.083643E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.270 | TFLOPs: 19.44 | 31: iteration 47440/ 173500 | consumed samples: 12144640 | consumed tokens: 24872222720 | elapsed time per iteration (s): 0.86 | learning rate: 1.703E-04 | global batch size: 256 | lm loss: 2.056025E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.089 | TFLOPs: 17.91 | 31: iteration 47450/ 173500 | consumed samples: 12147200 | consumed tokens: 24877465600 | elapsed time per iteration (s): 0.82 | learning rate: 1.703E-04 | global batch size: 256 | lm loss: 2.057655E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.891 | TFLOPs: 18.81 | 31: iteration 47460/ 173500 | consumed samples: 12149760 | consumed tokens: 24882708480 | elapsed time per iteration (s): 0.83 | learning rate: 1.703E-04 | global batch size: 256 | lm loss: 2.084573E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.161 | TFLOPs: 18.76 | 31: iteration 47470/ 173500 | consumed samples: 12152320 | consumed tokens: 24887951360 | elapsed time per iteration (s): 0.78 | learning rate: 1.703E-04 | global batch size: 256 | lm loss: 2.102767E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.803 | TFLOPs: 19.83 | 31: iteration 47480/ 173500 | consumed samples: 12154880 | consumed tokens: 24893194240 | elapsed time per iteration (s): 0.78 | learning rate: 1.703E-04 | global batch size: 256 | lm loss: 2.074673E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.655 | TFLOPs: 19.82 | 31: iteration 47490/ 173500 | consumed samples: 12157440 | consumed tokens: 24898437120 | elapsed time per iteration (s): 0.74 | learning rate: 1.703E-04 | global batch size: 256 | lm loss: 2.065950E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.596 | TFLOPs: 20.91 | 31: iteration 47500/ 173500 | consumed samples: 12160000 | consumed tokens: 24903680000 | elapsed time per iteration (s): 0.78 | learning rate: 1.703E-04 | global batch size: 256 | lm loss: 2.075704E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.454 | TFLOPs: 19.81 | 31: iteration 47510/ 173500 | consumed samples: 12162560 | consumed tokens: 24908922880 | elapsed time per iteration (s): 0.72 | learning rate: 1.703E-04 | global batch size: 256 | lm loss: 2.064603E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.238 | TFLOPs: 21.43 | 31: iteration 47520/ 173500 | consumed samples: 12165120 | consumed tokens: 24914165760 | elapsed time per iteration (s): 0.74 | learning rate: 1.702E-04 | global batch size: 256 | lm loss: 2.066222E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.007 | TFLOPs: 20.99 | 31: iteration 47530/ 173500 | consumed samples: 12167680 | consumed tokens: 24919408640 | elapsed time per iteration (s): 0.75 | learning rate: 1.702E-04 | global batch size: 256 | lm loss: 2.071137E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.433 | TFLOPs: 20.78 | 31: iteration 47540/ 173500 | consumed samples: 12170240 | consumed tokens: 24924651520 | elapsed time per iteration (s): 0.75 | learning rate: 1.702E-04 | global batch size: 256 | lm loss: 2.088824E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.225 | TFLOPs: 20.52 | 31: iteration 47550/ 173500 | consumed samples: 12172800 | consumed tokens: 24929894400 | elapsed time per iteration (s): 0.80 | learning rate: 1.702E-04 | global batch size: 256 | lm loss: 2.077426E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.235 | TFLOPs: 19.43 | 31: iteration 47560/ 173500 | consumed samples: 12175360 | consumed tokens: 24935137280 | elapsed time per iteration (s): 0.78 | learning rate: 1.702E-04 | global batch size: 256 | lm loss: 2.093129E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.029 | TFLOPs: 19.97 | 31: iteration 47570/ 173500 | consumed samples: 12177920 | consumed tokens: 24940380160 | elapsed time per iteration (s): 0.79 | learning rate: 1.702E-04 | global batch size: 256 | lm loss: 2.068541E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.635 | TFLOPs: 19.58 | 31: iteration 47580/ 173500 | consumed samples: 12180480 | consumed tokens: 24945623040 | elapsed time per iteration (s): 0.80 | learning rate: 1.702E-04 | global batch size: 256 | lm loss: 2.064534E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.218 | TFLOPs: 19.25 | 31: iteration 47590/ 173500 | consumed samples: 12183040 | consumed tokens: 24950865920 | elapsed time per iteration (s): 0.81 | learning rate: 1.702E-04 | global batch size: 256 | lm loss: 2.095015E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.890 | TFLOPs: 19.23 | 31: iteration 47600/ 173500 | consumed samples: 12185600 | consumed tokens: 24956108800 | elapsed time per iteration (s): 0.81 | learning rate: 1.701E-04 | global batch size: 256 | lm loss: 2.058745E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.790 | TFLOPs: 19.10 | 31: iteration 47610/ 173500 | consumed samples: 12188160 | consumed tokens: 24961351680 | elapsed time per iteration (s): 0.76 | learning rate: 1.701E-04 | global batch size: 256 | lm loss: 2.066729E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.739 | TFLOPs: 20.43 | 31: iteration 47620/ 173500 | consumed samples: 12190720 | consumed tokens: 24966594560 | elapsed time per iteration (s): 0.83 | learning rate: 1.701E-04 | global batch size: 256 | lm loss: 2.064074E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.595 | TFLOPs: 18.73 | 31: iteration 47630/ 173500 | consumed samples: 12193280 | consumed tokens: 24971837440 | elapsed time per iteration (s): 0.76 | learning rate: 1.701E-04 | global batch size: 256 | lm loss: 2.069955E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.055 | TFLOPs: 20.45 | 31: iteration 47640/ 173500 | consumed samples: 12195840 | consumed tokens: 24977080320 | elapsed time per iteration (s): 0.75 | learning rate: 1.701E-04 | global batch size: 256 | lm loss: 2.062846E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.884 | TFLOPs: 20.56 | 31: iteration 47650/ 173500 | consumed samples: 12198400 | consumed tokens: 24982323200 | elapsed time per iteration (s): 0.77 | learning rate: 1.701E-04 | global batch size: 256 | lm loss: 2.048918E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.821 | TFLOPs: 20.07 | 31: iteration 47660/ 173500 | consumed samples: 12200960 | consumed tokens: 24987566080 | elapsed time per iteration (s): 0.81 | learning rate: 1.701E-04 | global batch size: 256 | lm loss: 2.093727E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.057 | TFLOPs: 19.12 | 31: iteration 47670/ 173500 | consumed samples: 12203520 | consumed tokens: 24992808960 | elapsed time per iteration (s): 0.78 | learning rate: 1.701E-04 | global batch size: 256 | lm loss: 2.113050E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.177 | TFLOPs: 19.91 | 31: iteration 47680/ 173500 | consumed samples: 12206080 | consumed tokens: 24998051840 | elapsed time per iteration (s): 0.78 | learning rate: 1.700E-04 | global batch size: 256 | lm loss: 2.062399E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.173 | TFLOPs: 19.85 | 31: iteration 47690/ 173500 | consumed samples: 12208640 | consumed tokens: 25003294720 | elapsed time per iteration (s): 0.76 | learning rate: 1.700E-04 | global batch size: 256 | lm loss: 2.068399E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.981 | TFLOPs: 20.45 | 31: iteration 47700/ 173500 | consumed samples: 12211200 | consumed tokens: 25008537600 | elapsed time per iteration (s): 0.77 | learning rate: 1.700E-04 | global batch size: 256 | lm loss: 2.072653E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.126 | TFLOPs: 20.15 | 31: iteration 47710/ 173500 | consumed samples: 12213760 | consumed tokens: 25013780480 | elapsed time per iteration (s): 0.84 | learning rate: 1.700E-04 | global batch size: 256 | lm loss: 2.061022E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.740 | TFLOPs: 18.44 | 31: iteration 47720/ 173500 | consumed samples: 12216320 | consumed tokens: 25019023360 | elapsed time per iteration (s): 0.77 | learning rate: 1.700E-04 | global batch size: 256 | lm loss: 2.084400E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.873 | TFLOPs: 20.14 | 31: iteration 47730/ 173500 | consumed samples: 12218880 | consumed tokens: 25024266240 | elapsed time per iteration (s): 0.75 | learning rate: 1.700E-04 | global batch size: 256 | lm loss: 2.044951E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.664 | TFLOPs: 20.61 | 31: iteration 47740/ 173500 | consumed samples: 12221440 | consumed tokens: 25029509120 | elapsed time per iteration (s): 0.80 | learning rate: 1.700E-04 | global batch size: 256 | lm loss: 2.047972E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.135 | TFLOPs: 19.25 | 31: iteration 47750/ 173500 | consumed samples: 12224000 | consumed tokens: 25034752000 | elapsed time per iteration (s): 0.76 | learning rate: 1.700E-04 | global batch size: 256 | lm loss: 2.054532E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.352 | TFLOPs: 20.41 | 31: iteration 47760/ 173500 | consumed samples: 12226560 | consumed tokens: 25039994880 | elapsed time per iteration (s): 0.74 | learning rate: 1.700E-04 | global batch size: 256 | lm loss: 2.044328E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.625 | TFLOPs: 20.91 | 31: iteration 47770/ 173500 | consumed samples: 12229120 | consumed tokens: 25045237760 | elapsed time per iteration (s): 0.75 | learning rate: 1.699E-04 | global batch size: 256 | lm loss: 2.080087E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.367 | TFLOPs: 20.53 | 31: iteration 47780/ 173500 | consumed samples: 12231680 | consumed tokens: 25050480640 | elapsed time per iteration (s): 0.81 | learning rate: 1.699E-04 | global batch size: 256 | lm loss: 2.072147E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.616 | TFLOPs: 19.21 | 31: iteration 47790/ 173500 | consumed samples: 12234240 | consumed tokens: 25055723520 | elapsed time per iteration (s): 0.75 | learning rate: 1.699E-04 | global batch size: 256 | lm loss: 2.075634E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.028 | TFLOPs: 20.75 | 31: iteration 47800/ 173500 | consumed samples: 12236800 | consumed tokens: 25060966400 | elapsed time per iteration (s): 0.74 | learning rate: 1.699E-04 | global batch size: 256 | lm loss: 2.089823E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.328 | TFLOPs: 20.95 | 31: iteration 47810/ 173500 | consumed samples: 12239360 | consumed tokens: 25066209280 | elapsed time per iteration (s): 0.74 | learning rate: 1.699E-04 | global batch size: 256 | lm loss: 2.055129E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.417 | TFLOPs: 20.96 | 31: iteration 47820/ 173500 | consumed samples: 12241920 | consumed tokens: 25071452160 | elapsed time per iteration (s): 0.76 | learning rate: 1.699E-04 | global batch size: 256 | lm loss: 2.055924E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.432 | TFLOPs: 20.41 | 31: iteration 47830/ 173500 | consumed samples: 12244480 | consumed tokens: 25076695040 | elapsed time per iteration (s): 0.76 | learning rate: 1.699E-04 | global batch size: 256 | lm loss: 2.083895E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.483 | TFLOPs: 20.48 | 31: iteration 47840/ 173500 | consumed samples: 12247040 | consumed tokens: 25081937920 | elapsed time per iteration (s): 0.77 | learning rate: 1.699E-04 | global batch size: 256 | lm loss: 2.076089E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.026 | TFLOPs: 20.21 | 31: iteration 47850/ 173500 | consumed samples: 12249600 | consumed tokens: 25087180800 | elapsed time per iteration (s): 0.79 | learning rate: 1.698E-04 | global batch size: 256 | lm loss: 2.077203E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.143 | TFLOPs: 19.61 | 31: iteration 47860/ 173500 | consumed samples: 12252160 | consumed tokens: 25092423680 | elapsed time per iteration (s): 0.76 | learning rate: 1.698E-04 | global batch size: 256 | lm loss: 2.054169E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.980 | TFLOPs: 20.33 | 31: iteration 47870/ 173500 | consumed samples: 12254720 | consumed tokens: 25097666560 | elapsed time per iteration (s): 0.83 | learning rate: 1.698E-04 | global batch size: 256 | lm loss: 2.082913E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.960 | TFLOPs: 18.57 | 31: iteration 47880/ 173500 | consumed samples: 12257280 | consumed tokens: 25102909440 | elapsed time per iteration (s): 0.84 | learning rate: 1.698E-04 | global batch size: 256 | lm loss: 2.079740E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.198 | TFLOPs: 18.34 | 31: iteration 47890/ 173500 | consumed samples: 12259840 | consumed tokens: 25108152320 | elapsed time per iteration (s): 0.84 | learning rate: 1.698E-04 | global batch size: 256 | lm loss: 2.078137E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.555 | TFLOPs: 18.42 | 31: iteration 47900/ 173500 | consumed samples: 12262400 | consumed tokens: 25113395200 | elapsed time per iteration (s): 0.81 | learning rate: 1.698E-04 | global batch size: 256 | lm loss: 2.059609E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.110 | TFLOPs: 19.06 | 31: iteration 47910/ 173500 | consumed samples: 12264960 | consumed tokens: 25118638080 | elapsed time per iteration (s): 0.86 | learning rate: 1.698E-04 | global batch size: 256 | lm loss: 2.090815E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.187 | TFLOPs: 17.98 | 31: iteration 47920/ 173500 | consumed samples: 12267520 | consumed tokens: 25123880960 | elapsed time per iteration (s): 0.84 | learning rate: 1.698E-04 | global batch size: 256 | lm loss: 2.063468E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.603 | TFLOPs: 18.37 | 31: iteration 47930/ 173500 | consumed samples: 12270080 | consumed tokens: 25129123840 | elapsed time per iteration (s): 0.84 | learning rate: 1.697E-04 | global batch size: 256 | lm loss: 2.068803E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.783 | TFLOPs: 18.38 | 31: iteration 47940/ 173500 | consumed samples: 12272640 | consumed tokens: 25134366720 | elapsed time per iteration (s): 0.83 | learning rate: 1.697E-04 | global batch size: 256 | lm loss: 2.088497E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.391 | TFLOPs: 18.60 | 31: iteration 47950/ 173500 | consumed samples: 12275200 | consumed tokens: 25139609600 | elapsed time per iteration (s): 0.80 | learning rate: 1.697E-04 | global batch size: 256 | lm loss: 2.073300E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.555 | TFLOPs: 19.39 | 31: iteration 47960/ 173500 | consumed samples: 12277760 | consumed tokens: 25144852480 | elapsed time per iteration (s): 0.86 | learning rate: 1.697E-04 | global batch size: 256 | lm loss: 2.062751E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.698 | TFLOPs: 17.95 | 31: iteration 47970/ 173500 | consumed samples: 12280320 | consumed tokens: 25150095360 | elapsed time per iteration (s): 0.83 | learning rate: 1.697E-04 | global batch size: 256 | lm loss: 2.081687E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.421 | TFLOPs: 18.60 | 31: iteration 47980/ 173500 | consumed samples: 12282880 | consumed tokens: 25155338240 | elapsed time per iteration (s): 0.82 | learning rate: 1.697E-04 | global batch size: 256 | lm loss: 2.088519E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.439 | TFLOPs: 18.78 | 31: iteration 47990/ 173500 | consumed samples: 12285440 | consumed tokens: 25160581120 | elapsed time per iteration (s): 0.78 | learning rate: 1.697E-04 | global batch size: 256 | lm loss: 2.080344E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.298 | TFLOPs: 19.74 | 0: [2022-11-26 04:52:29,862] [INFO] [logging.py:68:log_dist] [Rank 0] step=48000, skipped=0, lr=[0.00016965587057872074, 0.00016965587057872074, 0.00016965587057872074], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 48000/ 173500 | consumed samples: 12288000 | consumed tokens: 25165824000 | elapsed time per iteration (s): 0.78 | learning rate: 1.697E-04 | global batch size: 256 | lm loss: 2.063472E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.619 | TFLOPs: 19.94 | 0: steps: 48000 loss: 2.0923 iter time (s): 0.783 samples/sec: 327.029 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 48000 | lm loss value: 2.083835E+00 | lm loss PPL: 8.035228E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 48000 to checkpoints_1b1long 0: [2022-11-26 04:52:30,114] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step48000 is begin to save! 0: [2022-11-26 04:52:30,249] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_01-model_00-model_states.pt... 0: [2022-11-26 04:52:30,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_01-model_00-model_states.pt. 0: [2022-11-26 04:52:30,454] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_03-model_00-model_states.pt... 0: [2022-11-26 04:52:30,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_03-model_00-model_states.pt. 0: [2022-11-26 04:52:30,536] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_04-model_00-model_states.pt... 0: [2022-11-26 04:52:30,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_04-model_00-model_states.pt. 0: [2022-11-26 04:52:30,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_05-model_00-model_states.pt... 0: [2022-11-26 04:52:30,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_05-model_00-model_states.pt. 0: [2022-11-26 04:52:30,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_06-model_00-model_states.pt... 0: [2022-11-26 04:52:30,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_06-model_00-model_states.pt. 0: [2022-11-26 04:52:30,771] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_07-model_00-model_states.pt... 0: [2022-11-26 04:52:30,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_07-model_00-model_states.pt. 0: [2022-11-26 04:52:30,847] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_08-model_00-model_states.pt... 0: [2022-11-26 04:52:30,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_08-model_00-model_states.pt. 0: [2022-11-26 04:52:30,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_09-model_00-model_states.pt... 0: [2022-11-26 04:52:31,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_09-model_00-model_states.pt. 0: [2022-11-26 04:52:31,008] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_10-model_00-model_states.pt... 0: [2022-11-26 04:52:31,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_10-model_00-model_states.pt. 0: [2022-11-26 04:52:31,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_11-model_00-model_states.pt... 0: [2022-11-26 04:52:31,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_11-model_00-model_states.pt. 0: [2022-11-26 04:52:31,165] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_12-model_00-model_states.pt... 0: [2022-11-26 04:52:31,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_12-model_00-model_states.pt. 0: [2022-11-26 04:52:31,243] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_13-model_00-model_states.pt... 0: [2022-11-26 04:52:31,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_13-model_00-model_states.pt. 0: [2022-11-26 04:52:31,318] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_14-model_00-model_states.pt... 0: [2022-11-26 04:52:31,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_14-model_00-model_states.pt. 0: [2022-11-26 04:52:31,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_15-model_00-model_states.pt... 0: [2022-11-26 04:52:31,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_15-model_00-model_states.pt. 0: [2022-11-26 04:52:31,471] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_16-model_00-model_states.pt... 0: [2022-11-26 04:52:31,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_16-model_00-model_states.pt. 0: [2022-11-26 04:52:31,552] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_17-model_00-model_states.pt... 0: [2022-11-26 04:52:31,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_17-model_00-model_states.pt. 0: [2022-11-26 04:52:31,625] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_18-model_00-model_states.pt... 0: [2022-11-26 04:52:31,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_18-model_00-model_states.pt. 0: [2022-11-26 04:52:31,703] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_19-model_00-model_states.pt... 0: [2022-11-26 04:52:31,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_19-model_00-model_states.pt. 0: [2022-11-26 04:52:31,781] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_20-model_00-model_states.pt... 0: [2022-11-26 04:52:31,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_20-model_00-model_states.pt. 0: [2022-11-26 04:52:31,858] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_21-model_00-model_states.pt... 0: [2022-11-26 04:52:31,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_21-model_00-model_states.pt. 0: [2022-11-26 04:52:31,935] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_22-model_00-model_states.pt... 0: [2022-11-26 04:52:32,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_22-model_00-model_states.pt. 0: [2022-11-26 04:52:32,014] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_23-model_00-model_states.pt... 0: [2022-11-26 04:52:32,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_23-model_00-model_states.pt. 0: [2022-11-26 04:52:32,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_24-model_00-model_states.pt... 0: [2022-11-26 04:52:32,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_24-model_00-model_states.pt. 0: [2022-11-26 04:52:32,168] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_25-model_00-model_states.pt... 0: [2022-11-26 04:52:32,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_25-model_00-model_states.pt. 0: [2022-11-26 04:52:32,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_26-model_00-model_states.pt... 0: [2022-11-26 04:52:32,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_26-model_00-model_states.pt. 0: [2022-11-26 04:52:32,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_27-model_00-model_states.pt... 0: [2022-11-26 04:52:32,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_27-model_00-model_states.pt. 0: [2022-11-26 04:52:32,401] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_28-model_00-model_states.pt... 0: [2022-11-26 04:52:32,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_28-model_00-model_states.pt. 0: [2022-11-26 04:52:32,479] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/layer_30-model_00-model_states.pt... 0: [2022-11-26 04:52:32,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/layer_30-model_00-model_states.pt. 0: [2022-11-26 04:52:32,482] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step48000/mp_rank_00_model_states.pt 0: [2022-11-26 04:52:32,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/mp_rank_00_model_states.pt... 0: [2022-11-26 04:52:32,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/mp_rank_00_model_states.pt. 0: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 22: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 19: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 21: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 30: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 27: [2022-11-26 04:52:32,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:52:32,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:52:32,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 04:52:32,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 16: [2022-11-26 04:52:32,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:52:32,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 04:52:32,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 24: [2022-11-26 04:52:32,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:52:32,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 04:52:32,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 21: [2022-11-26 04:52:32,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:52:32,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 04:52:32,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 6: [2022-11-26 04:52:32,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:52:32,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 04:52:32,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 3: [2022-11-26 04:52:32,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:52:32,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 04:52:32,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 24: [2022-11-26 04:52:32,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:52:32,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:52:32,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 26: [2022-11-26 04:52:32,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 24: [2022-11-26 04:52:32,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 26: [2022-11-26 04:52:32,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 12: [2022-11-26 04:52:32,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:52:32,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 27: [2022-11-26 04:52:32,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:52:32,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 18: [2022-11-26 04:52:32,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:52:32,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 27: [2022-11-26 04:52:32,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 04:52:32,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 18: [2022-11-26 04:52:32,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 13: [2022-11-26 04:52:32,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:52:32,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 04:52:32,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 26: [2022-11-26 04:52:32,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:52:32,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 04:52:32,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 20: [2022-11-26 04:52:32,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:52:32,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 04:52:32,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 19: [2022-11-26 04:52:32,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:52:32,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 04:52:32,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 7: [2022-11-26 04:52:32,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:52:32,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 04:52:32,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 15: [2022-11-26 04:52:32,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:52:32,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:52:32,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 31: [2022-11-26 04:52:32,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 6: [2022-11-26 04:52:32,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:52:32,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 31: [2022-11-26 04:52:32,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 6: [2022-11-26 04:52:32,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 04:52:32,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 31: [2022-11-26 04:52:32,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:52:32,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 04:52:32,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 9: [2022-11-26 04:52:32,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:52:32,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 04:52:32,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 30: [2022-11-26 04:52:32,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:52:32,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 17: [2022-11-26 04:52:32,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:52:32,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 8: [2022-11-26 04:52:32,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:52:32,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:52:32,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 04:52:32,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 21: [2022-11-26 04:52:32,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:52:32,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 8: [2022-11-26 04:52:32,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 22: [2022-11-26 04:52:32,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:52:32,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 21: [2022-11-26 04:52:32,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 22: [2022-11-26 04:52:32,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 17: [2022-11-26 04:52:32,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 21: [2022-11-26 04:52:32,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 22: [2022-11-26 04:52:32,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 22: [2022-11-26 04:52:32,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:52:32,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 04:52:32,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 17: [2022-11-26 04:52:32,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:52:32,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 04:52:32,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 18: [2022-11-26 04:52:32,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:52:32,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 04:52:32,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 6: [2022-11-26 04:52:32,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:52:32,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 04:52:32,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 4: [2022-11-26 04:52:32,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:52:32,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:52:32,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 3: [2022-11-26 04:52:32,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 4: [2022-11-26 04:52:32,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 3: [2022-11-26 04:52:32,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 2: [2022-11-26 04:52:32,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:52:32,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:52:32,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 04:52:32,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 04:52:32,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 2: [2022-11-26 04:52:32,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 4: [2022-11-26 04:52:32,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:52:32,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 04:52:32,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 27: [2022-11-26 04:52:32,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:52:32,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:52:32,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 15: [2022-11-26 04:52:32,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 27: [2022-11-26 04:52:32,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 16: [2022-11-26 04:52:32,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:52:32,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 20: [2022-11-26 04:52:32,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:52:32,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 20: [2022-11-26 04:52:32,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 16: [2022-11-26 04:52:32,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 20: [2022-11-26 04:52:32,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 13: [2022-11-26 04:52:32,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:52:32,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 04:52:32,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 7: [2022-11-26 04:52:32,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:52:32,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 5: [2022-11-26 04:52:32,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:52:32,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 5: [2022-11-26 04:52:32,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 04:52:32,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 30: [2022-11-26 04:52:32,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:52:32,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 04:52:32,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 9: [2022-11-26 04:52:32,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:52:32,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 04:52:32,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 31: [2022-11-26 04:52:32,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:52:32,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 21: [2022-11-26 04:52:32,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:52:32,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:52:32,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 21: [2022-11-26 04:52:32,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 04:52:32,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 16: [2022-11-26 04:52:32,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 04:52:32,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 17: [2022-11-26 04:52:32,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:52:32,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 04:52:32,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 26: [2022-11-26 04:52:32,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:52:32,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 04:52:32,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 6: [2022-11-26 04:52:32,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:52:32,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:52:32,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 18: [2022-11-26 04:52:32,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 19: [2022-11-26 04:52:32,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:52:32,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 18: [2022-11-26 04:52:32,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 19: [2022-11-26 04:52:32,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 4: [2022-11-26 04:52:32,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:52:32,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 4: [2022-11-26 04:52:32,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 04:52:32,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 28: [2022-11-26 04:52:32,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 04:52:32,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 04:52:32,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 04:52:32,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 04:52:32,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 28: [2022-11-26 04:52:32,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 29: [2022-11-26 04:52:32,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:52:32,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:52:32,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:52:32,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 04:52:32,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 04:52:32,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 04:52:32,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 29: [2022-11-26 04:52:32,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 29: [2022-11-26 04:52:32,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 23: [2022-11-26 04:52:32,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:52:32,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:52:32,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 04:52:32,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 04:52:32,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 23: [2022-11-26 04:52:32,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 25: [2022-11-26 04:52:32,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:52:32,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:52:32,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 12: [2022-11-26 04:52:32,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 25: [2022-11-26 04:52:32,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 12: [2022-11-26 04:52:32,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 5: [2022-11-26 04:52:32,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:52:32,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:52:32,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:52:32,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 12: [2022-11-26 04:52:32,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 8: [2022-11-26 04:52:32,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 12: [2022-11-26 04:52:32,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 19: [2022-11-26 04:52:32,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:52:32,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 04:52:32,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 5: [2022-11-26 04:52:32,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 24: [2022-11-26 04:52:32,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:52:32,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 24: [2022-11-26 04:52:32,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 5: [2022-11-26 04:52:32,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:52:32,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 5: [2022-11-26 04:52:32,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 04:52:32,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 8: [2022-11-26 04:52:32,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:52:32,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 27: [2022-11-26 04:52:32,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:52:32,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 27: [2022-11-26 04:52:32,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 04:52:32,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 2: [2022-11-26 04:52:32,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:52:32,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 25: [2022-11-26 04:52:32,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:52:32,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 3: [2022-11-26 04:52:32,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:52:32,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 3: [2022-11-26 04:52:32,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 25: [2022-11-26 04:52:32,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 3: [2022-11-26 04:52:32,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 25: [2022-11-26 04:52:32,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:52:32,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 04:52:32,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 13: [2022-11-26 04:52:32,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:52:32,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 04:52:32,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 0: [2022-11-26 04:52:32,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:52:32,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:52:32,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 04:52:32,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 04:52:32,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 0: [2022-11-26 04:52:32,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 0: [2022-11-26 04:52:32,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:52:32,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 04:52:32,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 22: [2022-11-26 04:52:32,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:52:32,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 04:52:32,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 22: [2022-11-26 04:52:32,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:52:32,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 04:52:32,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 24: [2022-11-26 04:52:32,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:52:32,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 04:52:32,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 26: [2022-11-26 04:52:32,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:52:32,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 11: [2022-11-26 04:52:32,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:52:32,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 30: [2022-11-26 04:52:32,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:52:32,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 04:52:32,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 30: [2022-11-26 04:52:32,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 11: [2022-11-26 04:52:32,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:52:32,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 11: [2022-11-26 04:52:32,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 04:52:32,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 31: [2022-11-26 04:52:32,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:52:32,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:52:32,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 31: [2022-11-26 04:52:32,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 11: [2022-11-26 04:52:32,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 31: [2022-11-26 04:52:32,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 20: [2022-11-26 04:52:32,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:52:32,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 04:52:32,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 19: [2022-11-26 04:52:32,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:52:32,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 04:52:32,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 10: [2022-11-26 04:52:32,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:52:32,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:52:32,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:52:32,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:52:32,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 04:52:32,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 04:52:32,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 04:52:32,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 9: [2022-11-26 04:52:32,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:52:32,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 9: [2022-11-26 04:52:32,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 10: [2022-11-26 04:52:32,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 10: [2022-11-26 04:52:32,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 9: [2022-11-26 04:52:32,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 10: [2022-11-26 04:52:32,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 27: [2022-11-26 04:52:32,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:52:32,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 04:52:32,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 3: [2022-11-26 04:52:32,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:52:32,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 04:52:32,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 7: [2022-11-26 04:52:32,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:52:32,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 04:52:32,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 30: [2022-11-26 04:52:32,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:52:32,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:52:32,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 30: [2022-11-26 04:52:32,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 13: [2022-11-26 04:52:32,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 30: [2022-11-26 04:52:32,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 18: [2022-11-26 04:52:32,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:52:32,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 04:52:32,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 21: [2022-11-26 04:52:32,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:52:32,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 04:52:32,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 11: [2022-11-26 04:52:32,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:52:32,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 04:52:32,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 28: [2022-11-26 04:52:32,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:52:32,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:52:32,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 23: [2022-11-26 04:52:32,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:52:32,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 15: [2022-11-26 04:52:32,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 15: [2022-11-26 04:52:32,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:52:32,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 15: [2022-11-26 04:52:32,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 04:52:32,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 16: [2022-11-26 04:52:32,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:52:32,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 04:52:32,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 4: [2022-11-26 04:52:32,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:52:32,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 04:52:32,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 6: [2022-11-26 04:52:32,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:52:32,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 04:52:32,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 25: [2022-11-26 04:52:32,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:52:32,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 04:52:32,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 20: [2022-11-26 04:52:32,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:52:32,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 04:52:32,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 24: [2022-11-26 04:52:32,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:52:32,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:52:32,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 04:52:32,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 24: [2022-11-26 04:52:32,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 04:52:32,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 29: [2022-11-26 04:52:32,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:52:32,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 14: [2022-11-26 04:52:32,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:52:32,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:52:32,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:52:32,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 14: [2022-11-26 04:52:32,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 04:52:32,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 04:52:32,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 04:52:32,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 14: [2022-11-26 04:52:32,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 14: [2022-11-26 04:52:32,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 28: [2022-11-26 04:52:32,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 04:52:32,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 10: [2022-11-26 04:52:32,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:52:32,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 04:52:32,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 0: [2022-11-26 04:52:32,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:52:32,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 04:52:32,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 29: [2022-11-26 04:52:32,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:52:32,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 04:52:32,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 2: [2022-11-26 04:52:32,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:52:32,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 04:52:32,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 17: [2022-11-26 04:52:32,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:52:32,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 04:52:32,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 9: [2022-11-26 04:52:32,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:52:32,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 04:52:32,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 17: [2022-11-26 04:52:32,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:52:32,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 04:52:32,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 20: [2022-11-26 04:52:32,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:52:32,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 04:52:32,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 0: [2022-11-26 04:52:32,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:52:32,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:52:32,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 04:52:32,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 9: [2022-11-26 04:52:32,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:52:32,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 04:52:32,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 0: [2022-11-26 04:52:32,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 04:52:32,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 12: [2022-11-26 04:52:32,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:52:32,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 04:52:32,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 5: [2022-11-26 04:52:32,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:52:32,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 04:52:32,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 8: [2022-11-26 04:52:32,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:52:32,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 04:52:32,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 28: [2022-11-26 04:52:32,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 04:52:32,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 04:52:32,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 22: [2022-11-26 04:52:32,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:52:32,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 04:52:32,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 21: [2022-11-26 04:52:32,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:52:32,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 04:52:32,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 1: [2022-11-26 04:52:32,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:52:32,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:52:32,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:52:32,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:52:32,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:52:32,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 04:52:32,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 04:52:32,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 04:52:32,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 04:52:32,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 04:52:32,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 1: [2022-11-26 04:52:32,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 1: [2022-11-26 04:52:32,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 1: [2022-11-26 04:52:32,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 1: [2022-11-26 04:52:32,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 7: [2022-11-26 04:52:32,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:52:32,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 04:52:32,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 26: [2022-11-26 04:52:32,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:52:32,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 04:52:32,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 19: [2022-11-26 04:52:32,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:52:32,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 04:52:32,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 23: [2022-11-26 04:52:32,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:52:32,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 04:52:32,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 27: [2022-11-26 04:52:32,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:52:32,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 04:52:32,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 15: [2022-11-26 04:52:32,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:52:32,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 04:52:32,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 30: [2022-11-26 04:52:32,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:52:32,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 04:52:32,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 16: [2022-11-26 04:52:32,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:52:32,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 04:52:32,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 3: [2022-11-26 04:52:32,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:52:32,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 04:52:32,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 18: [2022-11-26 04:52:32,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:52:32,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 04:52:32,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 13: [2022-11-26 04:52:32,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:52:32,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 04:52:32,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 11: [2022-11-26 04:52:32,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:52:32,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 04:52:32,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 31: [2022-11-26 04:52:32,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:52:32,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 04:52:32,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 14: [2022-11-26 04:52:32,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:52:32,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 04:52:32,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 25: [2022-11-26 04:52:32,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:52:32,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 04:52:32,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 6: [2022-11-26 04:52:32,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:52:32,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 04:52:32,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 24: [2022-11-26 04:52:32,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:52:32,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 04:52:32,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 29: [2022-11-26 04:52:32,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:52:32,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:52:32,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 17: [2022-11-26 04:52:32,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:52:32,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 9: [2022-11-26 04:52:32,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 04:52:32,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 17: [2022-11-26 04:52:32,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 04:52:32,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 4: [2022-11-26 04:52:32,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:52:32,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 04:52:32,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 21: [2022-11-26 04:52:32,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:52:32,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 04:52:32,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 2: [2022-11-26 04:52:32,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:52:32,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 04:52:32,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 5: [2022-11-26 04:52:32,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:52:32,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 04:52:32,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 28: [2022-11-26 04:52:32,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 04:52:32,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 04:52:32,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 8: [2022-11-26 04:52:32,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:52:32,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 04:52:32,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 0: [2022-11-26 04:52:32,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:52:32,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 04:52:32,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 12: [2022-11-26 04:52:32,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:52:32,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 04:52:32,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 26: [2022-11-26 04:52:32,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:52:32,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 7: [2022-11-26 04:52:32,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:52:32,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 26: [2022-11-26 04:52:32,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 7: [2022-11-26 04:52:32,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 1: [2022-11-26 04:52:32,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:52:32,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 04:52:32,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 22: [2022-11-26 04:52:32,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:52:32,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 04:52:32,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 15: [2022-11-26 04:52:32,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:52:32,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 04:52:32,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 20: [2022-11-26 04:52:32,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:52:32,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 04:52:32,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 23: [2022-11-26 04:52:32,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:52:32,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 04:52:32,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 27: [2022-11-26 04:52:32,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:52:32,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 04:52:32,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 19: [2022-11-26 04:52:32,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:52:32,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 04:52:32,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 16: [2022-11-26 04:52:32,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:52:32,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 04:52:32,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 31: [2022-11-26 04:52:32,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:52:32,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 04:52:32,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 30: [2022-11-26 04:52:32,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:52:32,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 04:52:32,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 11: [2022-11-26 04:52:32,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:52:32,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 04:52:32,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 10: [2022-11-26 04:52:32,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:52:32,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 04:52:32,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 3: [2022-11-26 04:52:32,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:52:32,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 04:52:32,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 13: [2022-11-26 04:52:32,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:52:32,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 04:52:32,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 25: [2022-11-26 04:52:32,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:52:32,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 18: [2022-11-26 04:52:32,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:52:32,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 25: [2022-11-26 04:52:32,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 18: [2022-11-26 04:52:32,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 14: [2022-11-26 04:52:32,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:52:32,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 04:52:32,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 17: [2022-11-26 04:52:32,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:52:32,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:52:32,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 04:52:32,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 17: [2022-11-26 04:52:32,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 04:52:32,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 5: [2022-11-26 04:52:32,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:52:32,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 04:52:32,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 9: [2022-11-26 04:52:32,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:52:32,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 04:52:32,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 6: [2022-11-26 04:52:32,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:52:32,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 04:52:32,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 24: [2022-11-26 04:52:32,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:52:32,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:52:32,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 04:52:32,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 1: [2022-11-26 04:52:32,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 04:52:32,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 29: [2022-11-26 04:52:32,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:52:32,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 04:52:32,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 4: [2022-11-26 04:52:32,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:52:32,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 04:52:32,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 21: [2022-11-26 04:52:32,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:52:32,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 04:52:32,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 28: [2022-11-26 04:52:32,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 04:52:32,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 7: [2022-11-26 04:52:32,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:52:32,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 04:52:32,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 28: [2022-11-26 04:52:32,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 8: [2022-11-26 04:52:32,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:52:32,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 12: [2022-11-26 04:52:32,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:52:32,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 12: [2022-11-26 04:52:32,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 04:52:32,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 0: [2022-11-26 04:52:32,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:52:32,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 04:52:32,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 15: [2022-11-26 04:52:32,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:52:32,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 04:52:32,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 20: [2022-11-26 04:52:32,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:52:32,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 04:52:32,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 22: [2022-11-26 04:52:32,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:52:32,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 04:52:32,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 19: [2022-11-26 04:52:32,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:52:32,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 04:52:32,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 27: [2022-11-26 04:52:32,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:52:32,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 04:52:32,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 2: [2022-11-26 04:52:32,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:52:32,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 04:52:32,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 31: [2022-11-26 04:52:32,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:52:32,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 04:52:32,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 26: [2022-11-26 04:52:32,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:52:32,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 04:52:32,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 18: [2022-11-26 04:52:32,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:52:32,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 04:52:32,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 16: [2022-11-26 04:52:32,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:52:32,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 04:52:32,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 17: [2022-11-26 04:52:32,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 04:52:32,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 04:52:32,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 5: [2022-11-26 04:52:32,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:52:32,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 04:52:32,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 0: [2022-11-26 04:52:32,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:52:32,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 04:52:32,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 12: [2022-11-26 04:52:32,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:52:32,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 04:52:32,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 6: [2022-11-26 04:52:32,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:52:32,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 04:52:32,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 21: [2022-11-26 04:52:32,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:52:32,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 21: [2022-11-26 04:52:32,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 14: [2022-11-26 04:52:32,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 04:52:32,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 21: [2022-11-26 04:52:32,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 23: [2022-11-26 04:52:32,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:52:32,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 04:52:32,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 24: [2022-11-26 04:52:32,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 04:52:32,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 04:52:32,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 1: [2022-11-26 04:52:32,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:52:32,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 04:52:32,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 29: [2022-11-26 04:52:32,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:52:32,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 29: [2022-11-26 04:52:32,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 13: [2022-11-26 04:52:32,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 29: [2022-11-26 04:52:32,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 13: [2022-11-26 04:52:32,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 4: [2022-11-26 04:52:32,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:52:32,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 04:52:32,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 9: [2022-11-26 04:52:32,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:52:32,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:52:32,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 04:52:32,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 9: [2022-11-26 04:52:32,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 04:52:32,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 10: [2022-11-26 04:52:32,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:52:32,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 04:52:32,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 3: [2022-11-26 04:52:32,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:52:32,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 04:52:32,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 30: [2022-11-26 04:52:32,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:52:32,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 04:52:32,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 25: [2022-11-26 04:52:32,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:52:32,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 04:52:32,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 2: [2022-11-26 04:52:32,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:52:32,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 04:52:32,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 28: [2022-11-26 04:52:32,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:52:32,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:52:32,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 04:52:32,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 20: [2022-11-26 04:52:32,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 04:52:32,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 04:52:32,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 8: [2022-11-26 04:52:32,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:52:32,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 04:52:32,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 13: [2022-11-26 04:52:32,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:52:32,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 04:52:32,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 16: [2022-11-26 04:52:32,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 04:52:32,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 04:52:32,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 22: [2022-11-26 04:52:32,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 04:52:32,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 04:52:32,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 23: [2022-11-26 04:52:32,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:52:32,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 04:52:32,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 27: [2022-11-26 04:52:32,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 04:52:32,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 04:52:32,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 31: [2022-11-26 04:52:32,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 04:52:32,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 04:52:32,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 25: [2022-11-26 04:52:32,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:52:32,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 25: [2022-11-26 04:52:32,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 04:52:32,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 15: [2022-11-26 04:52:32,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 04:52:32,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 18: [2022-11-26 04:52:32,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 19: [2022-11-26 04:52:32,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 18: [2022-11-26 04:52:32,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 19: [2022-11-26 04:52:32,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 04:52:32,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 18: [2022-11-26 04:52:32,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 7: [2022-11-26 04:52:32,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:52:32,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 28: [2022-11-26 04:52:32,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 7: [2022-11-26 04:52:32,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 28: [2022-11-26 04:52:32,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 11: [2022-11-26 04:52:32,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:52:32,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 04:52:32,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 26: [2022-11-26 04:52:32,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 04:52:32,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 04:52:32,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 23: [2022-11-26 04:52:32,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 04:52:32,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 04:52:32,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 3: [2022-11-26 04:52:32,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:52:32,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 04:52:32,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 30: [2022-11-26 04:52:32,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 04:52:32,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 04:52:32,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 2: [2022-11-26 04:52:32,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:52:32,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 04:52:32,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 14: [2022-11-26 04:52:32,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:52:32,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 04:52:32,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 14: [2022-11-26 04:52:32,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:52:32,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 04:52:32,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 28: [2022-11-26 04:52:32,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 04:52:32,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step48000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 04:52:32,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 0: successfully saved checkpoint at iteration 48000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2682.58 31: iteration 48010/ 173500 | consumed samples: 12290560 | consumed tokens: 25171066880 | elapsed time per iteration (s): 1.08 | learning rate: 1.696E-04 | global batch size: 256 | lm loss: 2.071044E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.528 | TFLOPs: 14.31 | 31: iteration 48020/ 173500 | consumed samples: 12293120 | consumed tokens: 25176309760 | elapsed time per iteration (s): 0.81 | learning rate: 1.696E-04 | global batch size: 256 | lm loss: 2.116685E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.668 | TFLOPs: 19.16 | 31: iteration 48030/ 173500 | consumed samples: 12295680 | consumed tokens: 25181552640 | elapsed time per iteration (s): 0.85 | learning rate: 1.696E-04 | global batch size: 256 | lm loss: 2.071684E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.666 | TFLOPs: 18.19 | 31: iteration 48040/ 173500 | consumed samples: 12298240 | consumed tokens: 25186795520 | elapsed time per iteration (s): 0.79 | learning rate: 1.696E-04 | global batch size: 256 | lm loss: 2.074455E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.380 | TFLOPs: 19.56 | 31: iteration 48050/ 173500 | consumed samples: 12300800 | consumed tokens: 25192038400 | elapsed time per iteration (s): 0.81 | learning rate: 1.696E-04 | global batch size: 256 | lm loss: 2.058040E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.755 | TFLOPs: 19.16 | 31: iteration 48060/ 173500 | consumed samples: 12303360 | consumed tokens: 25197281280 | elapsed time per iteration (s): 0.74 | learning rate: 1.696E-04 | global batch size: 256 | lm loss: 2.100296E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.515 | TFLOPs: 20.90 | 31: iteration 48070/ 173500 | consumed samples: 12305920 | consumed tokens: 25202524160 | elapsed time per iteration (s): 0.81 | learning rate: 1.696E-04 | global batch size: 256 | lm loss: 2.057147E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.252 | TFLOPs: 19.07 | 31: iteration 48080/ 173500 | consumed samples: 12308480 | consumed tokens: 25207767040 | elapsed time per iteration (s): 0.80 | learning rate: 1.696E-04 | global batch size: 256 | lm loss: 2.084706E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.982 | TFLOPs: 19.42 | 31: iteration 48090/ 173500 | consumed samples: 12311040 | consumed tokens: 25213009920 | elapsed time per iteration (s): 0.81 | learning rate: 1.695E-04 | global batch size: 256 | lm loss: 2.060592E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.424 | TFLOPs: 19.20 | 31: iteration 48100/ 173500 | consumed samples: 12313600 | consumed tokens: 25218252800 | elapsed time per iteration (s): 0.80 | learning rate: 1.695E-04 | global batch size: 256 | lm loss: 2.061811E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.799 | TFLOPs: 19.29 | 31: iteration 48110/ 173500 | consumed samples: 12316160 | consumed tokens: 25223495680 | elapsed time per iteration (s): 0.79 | learning rate: 1.695E-04 | global batch size: 256 | lm loss: 2.062054E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.444 | TFLOPs: 19.69 | 31: iteration 48120/ 173500 | consumed samples: 12318720 | consumed tokens: 25228738560 | elapsed time per iteration (s): 0.80 | learning rate: 1.695E-04 | global batch size: 256 | lm loss: 2.050064E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.246 | TFLOPs: 19.43 | 31: iteration 48130/ 173500 | consumed samples: 12321280 | consumed tokens: 25233981440 | elapsed time per iteration (s): 0.83 | learning rate: 1.695E-04 | global batch size: 256 | lm loss: 2.063740E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.539 | TFLOPs: 18.67 | 31: iteration 48140/ 173500 | consumed samples: 12323840 | consumed tokens: 25239224320 | elapsed time per iteration (s): 0.80 | learning rate: 1.695E-04 | global batch size: 256 | lm loss: 2.075054E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.106 | TFLOPs: 19.37 | 31: iteration 48150/ 173500 | consumed samples: 12326400 | consumed tokens: 25244467200 | elapsed time per iteration (s): 0.82 | learning rate: 1.695E-04 | global batch size: 256 | lm loss: 2.063986E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.937 | TFLOPs: 18.99 | 31: iteration 48160/ 173500 | consumed samples: 12328960 | consumed tokens: 25249710080 | elapsed time per iteration (s): 0.81 | learning rate: 1.695E-04 | global batch size: 256 | lm loss: 2.076817E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.268 | TFLOPs: 19.19 | 31: iteration 48170/ 173500 | consumed samples: 12331520 | consumed tokens: 25254952960 | elapsed time per iteration (s): 0.80 | learning rate: 1.694E-04 | global batch size: 256 | lm loss: 2.100826E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.846 | TFLOPs: 19.35 | 31: iteration 48180/ 173500 | consumed samples: 12334080 | consumed tokens: 25260195840 | elapsed time per iteration (s): 0.81 | learning rate: 1.694E-04 | global batch size: 256 | lm loss: 2.049703E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.878 | TFLOPs: 19.05 | 31: iteration 48190/ 173500 | consumed samples: 12336640 | consumed tokens: 25265438720 | elapsed time per iteration (s): 0.82 | learning rate: 1.694E-04 | global batch size: 256 | lm loss: 2.076468E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.566 | TFLOPs: 18.91 | 31: iteration 48200/ 173500 | consumed samples: 12339200 | consumed tokens: 25270681600 | elapsed time per iteration (s): 0.81 | learning rate: 1.694E-04 | global batch size: 256 | lm loss: 2.054628E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.441 | TFLOPs: 19.20 | 31: iteration 48210/ 173500 | consumed samples: 12341760 | consumed tokens: 25275924480 | elapsed time per iteration (s): 0.82 | learning rate: 1.694E-04 | global batch size: 256 | lm loss: 2.060109E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.696 | TFLOPs: 18.80 | 31: iteration 48220/ 173500 | consumed samples: 12344320 | consumed tokens: 25281167360 | elapsed time per iteration (s): 0.83 | learning rate: 1.694E-04 | global batch size: 256 | lm loss: 2.066599E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.041 | TFLOPs: 18.64 | 31: iteration 48230/ 173500 | consumed samples: 12346880 | consumed tokens: 25286410240 | elapsed time per iteration (s): 0.86 | learning rate: 1.694E-04 | global batch size: 256 | lm loss: 2.053806E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.005 | TFLOPs: 18.03 | 31: iteration 48240/ 173500 | consumed samples: 12349440 | consumed tokens: 25291653120 | elapsed time per iteration (s): 0.79 | learning rate: 1.694E-04 | global batch size: 256 | lm loss: 2.058571E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.890 | TFLOPs: 19.72 | 31: iteration 48250/ 173500 | consumed samples: 12352000 | consumed tokens: 25296896000 | elapsed time per iteration (s): 0.83 | learning rate: 1.693E-04 | global batch size: 256 | lm loss: 2.082779E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.395 | TFLOPs: 18.60 | 31: iteration 48260/ 173500 | consumed samples: 12354560 | consumed tokens: 25302138880 | elapsed time per iteration (s): 0.82 | learning rate: 1.693E-04 | global batch size: 256 | lm loss: 2.022568E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.111 | TFLOPs: 18.82 | 31: iteration 48270/ 173500 | consumed samples: 12357120 | consumed tokens: 25307381760 | elapsed time per iteration (s): 2.71 | learning rate: 1.693E-04 | global batch size: 256 | lm loss: 2.092496E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 94.320 | TFLOPs: 5.71 | 31: iteration 48280/ 173500 | consumed samples: 12359680 | consumed tokens: 25312624640 | elapsed time per iteration (s): 0.79 | learning rate: 1.693E-04 | global batch size: 256 | lm loss: 2.082412E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.066 | TFLOPs: 19.54 | 31: iteration 48290/ 173500 | consumed samples: 12362240 | consumed tokens: 25317867520 | elapsed time per iteration (s): 0.79 | learning rate: 1.693E-04 | global batch size: 256 | lm loss: 2.081914E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.192 | TFLOPs: 19.55 | 31: iteration 48300/ 173500 | consumed samples: 12364800 | consumed tokens: 25323110400 | elapsed time per iteration (s): 0.83 | learning rate: 1.693E-04 | global batch size: 256 | lm loss: 2.084198E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.283 | TFLOPs: 18.77 | 31: iteration 48310/ 173500 | consumed samples: 12367360 | consumed tokens: 25328353280 | elapsed time per iteration (s): 0.80 | learning rate: 1.693E-04 | global batch size: 256 | lm loss: 2.102187E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.095 | TFLOPs: 19.36 | 31: iteration 48320/ 173500 | consumed samples: 12369920 | consumed tokens: 25333596160 | elapsed time per iteration (s): 0.84 | learning rate: 1.693E-04 | global batch size: 256 | lm loss: 2.064968E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.360 | TFLOPs: 18.35 | 31: iteration 48330/ 173500 | consumed samples: 12372480 | consumed tokens: 25338839040 | elapsed time per iteration (s): 0.79 | learning rate: 1.692E-04 | global batch size: 256 | lm loss: 2.060164E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.750 | TFLOPs: 19.65 | 31: iteration 48340/ 173500 | consumed samples: 12375040 | consumed tokens: 25344081920 | elapsed time per iteration (s): 0.81 | learning rate: 1.692E-04 | global batch size: 256 | lm loss: 2.055518E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.155 | TFLOPs: 19.19 | 31: iteration 48350/ 173500 | consumed samples: 12377600 | consumed tokens: 25349324800 | elapsed time per iteration (s): 0.83 | learning rate: 1.692E-04 | global batch size: 256 | lm loss: 2.060525E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.095 | TFLOPs: 18.76 | 31: iteration 48360/ 173500 | consumed samples: 12380160 | consumed tokens: 25354567680 | elapsed time per iteration (s): 0.79 | learning rate: 1.692E-04 | global batch size: 256 | lm loss: 2.072979E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.607 | TFLOPs: 19.58 | 31: iteration 48370/ 173500 | consumed samples: 12382720 | consumed tokens: 25359810560 | elapsed time per iteration (s): 0.83 | learning rate: 1.692E-04 | global batch size: 256 | lm loss: 2.089616E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.472 | TFLOPs: 18.72 | 31: iteration 48380/ 173500 | consumed samples: 12385280 | consumed tokens: 25365053440 | elapsed time per iteration (s): 0.81 | learning rate: 1.692E-04 | global batch size: 256 | lm loss: 2.071805E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.177 | TFLOPs: 19.19 | 31: iteration 48390/ 173500 | consumed samples: 12387840 | consumed tokens: 25370296320 | elapsed time per iteration (s): 0.82 | learning rate: 1.692E-04 | global batch size: 256 | lm loss: 2.056699E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.864 | TFLOPs: 18.99 | 31: iteration 48400/ 173500 | consumed samples: 12390400 | consumed tokens: 25375539200 | elapsed time per iteration (s): 0.79 | learning rate: 1.692E-04 | global batch size: 256 | lm loss: 2.067035E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.621 | TFLOPs: 19.58 | 31: iteration 48410/ 173500 | consumed samples: 12392960 | consumed tokens: 25380782080 | elapsed time per iteration (s): 0.81 | learning rate: 1.691E-04 | global batch size: 256 | lm loss: 2.069280E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.508 | TFLOPs: 19.03 | 31: iteration 48420/ 173500 | consumed samples: 12395520 | consumed tokens: 25386024960 | elapsed time per iteration (s): 0.80 | learning rate: 1.691E-04 | global batch size: 256 | lm loss: 2.062556E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.358 | TFLOPs: 19.44 | 31: iteration 48430/ 173500 | consumed samples: 12398080 | consumed tokens: 25391267840 | elapsed time per iteration (s): 0.81 | learning rate: 1.691E-04 | global batch size: 256 | lm loss: 2.087590E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.955 | TFLOPs: 19.24 | 31: iteration 48440/ 173500 | consumed samples: 12400640 | consumed tokens: 25396510720 | elapsed time per iteration (s): 0.81 | learning rate: 1.691E-04 | global batch size: 256 | lm loss: 2.058243E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.069 | TFLOPs: 19.12 | 31: iteration 48450/ 173500 | consumed samples: 12403200 | consumed tokens: 25401753600 | elapsed time per iteration (s): 0.81 | learning rate: 1.691E-04 | global batch size: 256 | lm loss: 2.045073E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.668 | TFLOPs: 19.04 | 31: iteration 48460/ 173500 | consumed samples: 12405760 | consumed tokens: 25406996480 | elapsed time per iteration (s): 0.82 | learning rate: 1.691E-04 | global batch size: 256 | lm loss: 2.091574E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.861 | TFLOPs: 18.87 | 31: iteration 48470/ 173500 | consumed samples: 12408320 | consumed tokens: 25412239360 | elapsed time per iteration (s): 0.83 | learning rate: 1.691E-04 | global batch size: 256 | lm loss: 2.053552E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.883 | TFLOPs: 18.69 | 31: iteration 48480/ 173500 | consumed samples: 12410880 | consumed tokens: 25417482240 | elapsed time per iteration (s): 0.82 | learning rate: 1.691E-04 | global batch size: 256 | lm loss: 2.056706E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.285 | TFLOPs: 18.83 | 31: iteration 48490/ 173500 | consumed samples: 12413440 | consumed tokens: 25422725120 | elapsed time per iteration (s): 0.80 | learning rate: 1.690E-04 | global batch size: 256 | lm loss: 2.095259E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.513 | TFLOPs: 19.33 | 31: iteration 48500/ 173500 | consumed samples: 12416000 | consumed tokens: 25427968000 | elapsed time per iteration (s): 0.82 | learning rate: 1.690E-04 | global batch size: 256 | lm loss: 2.076753E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.846 | TFLOPs: 18.87 | 31: iteration 48510/ 173500 | consumed samples: 12418560 | consumed tokens: 25433210880 | elapsed time per iteration (s): 0.84 | learning rate: 1.690E-04 | global batch size: 256 | lm loss: 2.084157E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.008 | TFLOPs: 18.45 | 31: iteration 48520/ 173500 | consumed samples: 12421120 | consumed tokens: 25438453760 | elapsed time per iteration (s): 0.78 | learning rate: 1.690E-04 | global batch size: 256 | lm loss: 2.057551E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.228 | TFLOPs: 19.86 | 31: iteration 48530/ 173500 | consumed samples: 12423680 | consumed tokens: 25443696640 | elapsed time per iteration (s): 0.86 | learning rate: 1.690E-04 | global batch size: 256 | lm loss: 2.069168E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.994 | TFLOPs: 17.97 | 31: iteration 48540/ 173500 | consumed samples: 12426240 | consumed tokens: 25448939520 | elapsed time per iteration (s): 0.80 | learning rate: 1.690E-04 | global batch size: 256 | lm loss: 2.062001E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.042 | TFLOPs: 19.24 | 31: iteration 48550/ 173500 | consumed samples: 12428800 | consumed tokens: 25454182400 | elapsed time per iteration (s): 0.84 | learning rate: 1.690E-04 | global batch size: 256 | lm loss: 2.087623E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.154 | TFLOPs: 18.34 | 31: iteration 48560/ 173500 | consumed samples: 12431360 | consumed tokens: 25459425280 | elapsed time per iteration (s): 0.81 | learning rate: 1.690E-04 | global batch size: 256 | lm loss: 2.074175E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.336 | TFLOPs: 19.08 | 31: iteration 48570/ 173500 | consumed samples: 12433920 | consumed tokens: 25464668160 | elapsed time per iteration (s): 0.78 | learning rate: 1.690E-04 | global batch size: 256 | lm loss: 2.083590E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.163 | TFLOPs: 19.79 | 31: iteration 48580/ 173500 | consumed samples: 12436480 | consumed tokens: 25469911040 | elapsed time per iteration (s): 0.81 | learning rate: 1.689E-04 | global batch size: 256 | lm loss: 2.052752E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.774 | TFLOPs: 19.16 | 31: iteration 48590/ 173500 | consumed samples: 12439040 | consumed tokens: 25475153920 | elapsed time per iteration (s): 0.82 | learning rate: 1.689E-04 | global batch size: 256 | lm loss: 2.085377E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.991 | TFLOPs: 18.87 | 31: iteration 48600/ 173500 | consumed samples: 12441600 | consumed tokens: 25480396800 | elapsed time per iteration (s): 0.80 | learning rate: 1.689E-04 | global batch size: 256 | lm loss: 2.045020E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.430 | TFLOPs: 19.32 | 31: iteration 48610/ 173500 | consumed samples: 12444160 | consumed tokens: 25485639680 | elapsed time per iteration (s): 0.80 | learning rate: 1.689E-04 | global batch size: 256 | lm loss: 2.060950E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.512 | TFLOPs: 19.39 | 31: iteration 48620/ 173500 | consumed samples: 12446720 | consumed tokens: 25490882560 | elapsed time per iteration (s): 0.82 | learning rate: 1.689E-04 | global batch size: 256 | lm loss: 2.091485E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.337 | TFLOPs: 18.90 | 31: iteration 48630/ 173500 | consumed samples: 12449280 | consumed tokens: 25496125440 | elapsed time per iteration (s): 0.77 | learning rate: 1.689E-04 | global batch size: 256 | lm loss: 2.080858E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.123 | TFLOPs: 20.03 | 31: iteration 48640/ 173500 | consumed samples: 12451840 | consumed tokens: 25501368320 | elapsed time per iteration (s): 0.86 | learning rate: 1.689E-04 | global batch size: 256 | lm loss: 2.067690E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.355 | TFLOPs: 18.11 | 31: iteration 48650/ 173500 | consumed samples: 12454400 | consumed tokens: 25506611200 | elapsed time per iteration (s): 0.78 | learning rate: 1.689E-04 | global batch size: 256 | lm loss: 2.064605E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.605 | TFLOPs: 19.94 | 31: iteration 48660/ 173500 | consumed samples: 12456960 | consumed tokens: 25511854080 | elapsed time per iteration (s): 0.80 | learning rate: 1.688E-04 | global batch size: 256 | lm loss: 2.098221E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.582 | TFLOPs: 19.33 | 31: iteration 48670/ 173500 | consumed samples: 12459520 | consumed tokens: 25517096960 | elapsed time per iteration (s): 0.78 | learning rate: 1.688E-04 | global batch size: 256 | lm loss: 2.076659E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.612 | TFLOPs: 19.88 | 31: iteration 48680/ 173500 | consumed samples: 12462080 | consumed tokens: 25522339840 | elapsed time per iteration (s): 0.84 | learning rate: 1.688E-04 | global batch size: 256 | lm loss: 2.056965E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.172 | TFLOPs: 18.34 | 31: iteration 48690/ 173500 | consumed samples: 12464640 | consumed tokens: 25527582720 | elapsed time per iteration (s): 0.82 | learning rate: 1.688E-04 | global batch size: 256 | lm loss: 2.062063E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.948 | TFLOPs: 18.81 | 31: iteration 48700/ 173500 | consumed samples: 12467200 | consumed tokens: 25532825600 | elapsed time per iteration (s): 0.84 | learning rate: 1.688E-04 | global batch size: 256 | lm loss: 2.064507E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.951 | TFLOPs: 18.45 | 31: iteration 48710/ 173500 | consumed samples: 12469760 | consumed tokens: 25538068480 | elapsed time per iteration (s): 0.79 | learning rate: 1.688E-04 | global batch size: 256 | lm loss: 2.061733E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.045 | TFLOPs: 19.60 | 31: iteration 48720/ 173500 | consumed samples: 12472320 | consumed tokens: 25543311360 | elapsed time per iteration (s): 0.79 | learning rate: 1.688E-04 | global batch size: 256 | lm loss: 2.094788E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.911 | TFLOPs: 19.54 | 31: iteration 48730/ 173500 | consumed samples: 12474880 | consumed tokens: 25548554240 | elapsed time per iteration (s): 0.84 | learning rate: 1.688E-04 | global batch size: 256 | lm loss: 2.086618E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.327 | TFLOPs: 18.41 | 31: iteration 48740/ 173500 | consumed samples: 12477440 | consumed tokens: 25553797120 | elapsed time per iteration (s): 0.84 | learning rate: 1.687E-04 | global batch size: 256 | lm loss: 2.034575E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.614 | TFLOPs: 18.49 | 31: iteration 48750/ 173500 | consumed samples: 12480000 | consumed tokens: 25559040000 | elapsed time per iteration (s): 0.80 | learning rate: 1.687E-04 | global batch size: 256 | lm loss: 2.069294E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.580 | TFLOPs: 19.33 | 31: iteration 48760/ 173500 | consumed samples: 12482560 | consumed tokens: 25564282880 | elapsed time per iteration (s): 0.80 | learning rate: 1.687E-04 | global batch size: 256 | lm loss: 2.064034E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.831 | TFLOPs: 19.35 | 31: iteration 48770/ 173500 | consumed samples: 12485120 | consumed tokens: 25569525760 | elapsed time per iteration (s): 0.81 | learning rate: 1.687E-04 | global batch size: 256 | lm loss: 2.061470E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.740 | TFLOPs: 19.04 | 31: iteration 48780/ 173500 | consumed samples: 12487680 | consumed tokens: 25574768640 | elapsed time per iteration (s): 0.80 | learning rate: 1.687E-04 | global batch size: 256 | lm loss: 2.084891E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.694 | TFLOPs: 19.46 | 31: iteration 48790/ 173500 | consumed samples: 12490240 | consumed tokens: 25580011520 | elapsed time per iteration (s): 0.78 | learning rate: 1.687E-04 | global batch size: 256 | lm loss: 2.067029E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.930 | TFLOPs: 19.96 | 31: iteration 48800/ 173500 | consumed samples: 12492800 | consumed tokens: 25585254400 | elapsed time per iteration (s): 0.78 | learning rate: 1.687E-04 | global batch size: 256 | lm loss: 2.083816E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.653 | TFLOPs: 19.82 | 31: iteration 48810/ 173500 | consumed samples: 12495360 | consumed tokens: 25590497280 | elapsed time per iteration (s): 0.76 | learning rate: 1.687E-04 | global batch size: 256 | lm loss: 2.071764E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.742 | TFLOPs: 20.31 | 31: iteration 48820/ 173500 | consumed samples: 12497920 | consumed tokens: 25595740160 | elapsed time per iteration (s): 0.85 | learning rate: 1.686E-04 | global batch size: 256 | lm loss: 2.075245E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.110 | TFLOPs: 18.16 | 31: iteration 48830/ 173500 | consumed samples: 12500480 | consumed tokens: 25600983040 | elapsed time per iteration (s): 0.81 | learning rate: 1.686E-04 | global batch size: 256 | lm loss: 2.095121E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.816 | TFLOPs: 19.23 | 31: iteration 48840/ 173500 | consumed samples: 12503040 | consumed tokens: 25606225920 | elapsed time per iteration (s): 0.79 | learning rate: 1.686E-04 | global batch size: 256 | lm loss: 2.053766E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.978 | TFLOPs: 19.72 | 31: iteration 48850/ 173500 | consumed samples: 12505600 | consumed tokens: 25611468800 | elapsed time per iteration (s): 0.81 | learning rate: 1.686E-04 | global batch size: 256 | lm loss: 2.070976E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.795 | TFLOPs: 19.04 | 31: iteration 48860/ 173500 | consumed samples: 12508160 | consumed tokens: 25616711680 | elapsed time per iteration (s): 0.81 | learning rate: 1.686E-04 | global batch size: 256 | lm loss: 2.060272E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.349 | TFLOPs: 19.08 | 31: iteration 48870/ 173500 | consumed samples: 12510720 | consumed tokens: 25621954560 | elapsed time per iteration (s): 0.81 | learning rate: 1.686E-04 | global batch size: 256 | lm loss: 2.047802E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.739 | TFLOPs: 19.16 | 31: iteration 48880/ 173500 | consumed samples: 12513280 | consumed tokens: 25627197440 | elapsed time per iteration (s): 0.85 | learning rate: 1.686E-04 | global batch size: 256 | lm loss: 2.066230E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.851 | TFLOPs: 18.14 | 31: iteration 48890/ 173500 | consumed samples: 12515840 | consumed tokens: 25632440320 | elapsed time per iteration (s): 0.82 | learning rate: 1.686E-04 | global batch size: 256 | lm loss: 2.070368E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.732 | TFLOPs: 18.80 | 31: iteration 48900/ 173500 | consumed samples: 12518400 | consumed tokens: 25637683200 | elapsed time per iteration (s): 0.81 | learning rate: 1.685E-04 | global batch size: 256 | lm loss: 2.059880E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.922 | TFLOPs: 19.05 | 31: iteration 48910/ 173500 | consumed samples: 12520960 | consumed tokens: 25642926080 | elapsed time per iteration (s): 0.87 | learning rate: 1.685E-04 | global batch size: 256 | lm loss: 2.086296E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.641 | TFLOPs: 17.70 | 31: iteration 48920/ 173500 | consumed samples: 12523520 | consumed tokens: 25648168960 | elapsed time per iteration (s): 0.80 | learning rate: 1.685E-04 | global batch size: 256 | lm loss: 2.045604E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.993 | TFLOPs: 19.36 | 31: iteration 48930/ 173500 | consumed samples: 12526080 | consumed tokens: 25653411840 | elapsed time per iteration (s): 0.80 | learning rate: 1.685E-04 | global batch size: 256 | lm loss: 2.082325E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.285 | TFLOPs: 19.26 | 31: iteration 48940/ 173500 | consumed samples: 12528640 | consumed tokens: 25658654720 | elapsed time per iteration (s): 0.81 | learning rate: 1.685E-04 | global batch size: 256 | lm loss: 2.048411E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.448 | TFLOPs: 19.14 | 31: iteration 48950/ 173500 | consumed samples: 12531200 | consumed tokens: 25663897600 | elapsed time per iteration (s): 0.81 | learning rate: 1.685E-04 | global batch size: 256 | lm loss: 2.062634E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.437 | TFLOPs: 19.08 | 31: iteration 48960/ 173500 | consumed samples: 12533760 | consumed tokens: 25669140480 | elapsed time per iteration (s): 0.81 | learning rate: 1.685E-04 | global batch size: 256 | lm loss: 2.083138E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.540 | TFLOPs: 19.03 | 31: iteration 48970/ 173500 | consumed samples: 12536320 | consumed tokens: 25674383360 | elapsed time per iteration (s): 0.80 | learning rate: 1.685E-04 | global batch size: 256 | lm loss: 2.054279E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.466 | TFLOPs: 19.33 | 31: iteration 48980/ 173500 | consumed samples: 12538880 | consumed tokens: 25679626240 | elapsed time per iteration (s): 0.80 | learning rate: 1.684E-04 | global batch size: 256 | lm loss: 2.106441E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.256 | TFLOPs: 19.31 | 31: iteration 48990/ 173500 | consumed samples: 12541440 | consumed tokens: 25684869120 | elapsed time per iteration (s): 0.78 | learning rate: 1.684E-04 | global batch size: 256 | lm loss: 2.081581E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.162 | TFLOPs: 19.79 | 31: iteration 49000/ 173500 | consumed samples: 12544000 | consumed tokens: 25690112000 | elapsed time per iteration (s): 0.80 | learning rate: 1.684E-04 | global batch size: 256 | lm loss: 2.056070E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.278 | TFLOPs: 19.25 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 49000 | lm loss value: 2.045710E+00 | lm loss PPL: 7.734645E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 49000 to checkpoints_1b1long 0: [2022-11-26 05:06:22,252] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step49000 is begin to save! 0: [2022-11-26 05:06:22,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_01-model_00-model_states.pt... 0: [2022-11-26 05:06:22,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_01-model_00-model_states.pt. 0: [2022-11-26 05:06:22,484] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_03-model_00-model_states.pt... 0: [2022-11-26 05:06:22,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_03-model_00-model_states.pt. 0: [2022-11-26 05:06:22,566] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_04-model_00-model_states.pt... 0: [2022-11-26 05:06:22,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_04-model_00-model_states.pt. 0: [2022-11-26 05:06:22,649] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_05-model_00-model_states.pt... 0: [2022-11-26 05:06:22,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_05-model_00-model_states.pt. 0: [2022-11-26 05:06:22,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_06-model_00-model_states.pt... 0: [2022-11-26 05:06:22,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_06-model_00-model_states.pt. 0: [2022-11-26 05:06:22,805] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_07-model_00-model_states.pt... 0: [2022-11-26 05:06:22,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_07-model_00-model_states.pt. 0: [2022-11-26 05:06:22,881] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_08-model_00-model_states.pt... 0: [2022-11-26 05:06:22,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_08-model_00-model_states.pt. 0: [2022-11-26 05:06:22,957] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_09-model_00-model_states.pt... 0: [2022-11-26 05:06:23,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_09-model_00-model_states.pt. 0: [2022-11-26 05:06:23,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_10-model_00-model_states.pt... 0: [2022-11-26 05:06:23,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_10-model_00-model_states.pt. 0: [2022-11-26 05:06:23,106] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_11-model_00-model_states.pt... 0: [2022-11-26 05:06:23,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_11-model_00-model_states.pt. 0: [2022-11-26 05:06:23,183] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_12-model_00-model_states.pt... 0: [2022-11-26 05:06:23,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_12-model_00-model_states.pt. 0: [2022-11-26 05:06:23,263] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_13-model_00-model_states.pt... 0: [2022-11-26 05:06:23,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_13-model_00-model_states.pt. 0: [2022-11-26 05:06:23,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_14-model_00-model_states.pt... 0: [2022-11-26 05:06:23,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_14-model_00-model_states.pt. 0: [2022-11-26 05:06:23,411] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_15-model_00-model_states.pt... 0: [2022-11-26 05:06:23,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_15-model_00-model_states.pt. 0: [2022-11-26 05:06:23,485] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_16-model_00-model_states.pt... 0: [2022-11-26 05:06:23,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_16-model_00-model_states.pt. 0: [2022-11-26 05:06:23,559] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_17-model_00-model_states.pt... 0: [2022-11-26 05:06:23,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_17-model_00-model_states.pt. 0: [2022-11-26 05:06:23,633] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_18-model_00-model_states.pt... 0: [2022-11-26 05:06:23,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_18-model_00-model_states.pt. 0: [2022-11-26 05:06:23,709] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_19-model_00-model_states.pt... 0: [2022-11-26 05:06:23,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_19-model_00-model_states.pt. 0: [2022-11-26 05:06:23,784] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_20-model_00-model_states.pt... 0: [2022-11-26 05:06:23,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_20-model_00-model_states.pt. 0: [2022-11-26 05:06:23,874] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_21-model_00-model_states.pt... 0: [2022-11-26 05:06:23,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_21-model_00-model_states.pt. 0: [2022-11-26 05:06:23,948] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_22-model_00-model_states.pt... 0: [2022-11-26 05:06:24,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_22-model_00-model_states.pt. 0: [2022-11-26 05:06:24,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_23-model_00-model_states.pt... 0: [2022-11-26 05:06:24,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_23-model_00-model_states.pt. 0: [2022-11-26 05:06:24,099] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_24-model_00-model_states.pt... 0: [2022-11-26 05:06:24,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_24-model_00-model_states.pt. 0: [2022-11-26 05:06:24,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_25-model_00-model_states.pt... 0: [2022-11-26 05:06:24,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_25-model_00-model_states.pt. 0: [2022-11-26 05:06:24,248] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_26-model_00-model_states.pt... 0: [2022-11-26 05:06:24,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_26-model_00-model_states.pt. 0: [2022-11-26 05:06:24,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_27-model_00-model_states.pt... 0: [2022-11-26 05:06:24,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_27-model_00-model_states.pt. 0: [2022-11-26 05:06:24,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_28-model_00-model_states.pt... 0: [2022-11-26 05:06:24,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_28-model_00-model_states.pt. 0: [2022-11-26 05:06:24,472] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/layer_30-model_00-model_states.pt... 0: [2022-11-26 05:06:24,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/layer_30-model_00-model_states.pt. 0: [2022-11-26 05:06:24,474] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step49000/mp_rank_00_model_states.pt 0: [2022-11-26 05:06:24,474] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/mp_rank_00_model_states.pt... 0: [2022-11-26 05:06:24,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/mp_rank_00_model_states.pt. 0: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:06:24,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:06:24,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:06:24,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:06:24,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 05:06:24,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 23: [2022-11-26 05:06:24,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:06:24,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 05:06:24,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 25: [2022-11-26 05:06:24,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:06:24,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 10: [2022-11-26 05:06:24,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:06:24,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 10: [2022-11-26 05:06:24,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 5: [2022-11-26 05:06:24,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:06:24,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 5: [2022-11-26 05:06:24,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 05:06:24,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 16: [2022-11-26 05:06:24,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:06:24,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:06:24,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 05:06:24,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 05:06:24,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 16: [2022-11-26 05:06:24,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 26: [2022-11-26 05:06:24,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:06:24,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 05:06:24,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 26: [2022-11-26 05:06:24,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:06:24,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 05:06:24,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 10: [2022-11-26 05:06:24,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:06:24,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 05:06:24,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 8: [2022-11-26 05:06:24,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:06:24,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 05:06:24,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 31: [2022-11-26 05:06:24,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:06:24,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 05:06:24,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 11: [2022-11-26 05:06:24,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:06:24,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 29: [2022-11-26 05:06:24,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:06:24,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 29: [2022-11-26 05:06:24,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 05:06:24,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 30: [2022-11-26 05:06:24,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:06:24,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 05:06:24,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 1: [2022-11-26 05:06:24,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:06:24,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 13: [2022-11-26 05:06:24,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:06:24,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 1: [2022-11-26 05:06:24,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 13: [2022-11-26 05:06:24,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 28: [2022-11-26 05:06:24,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:06:24,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:06:24,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 05:06:24,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 28: [2022-11-26 05:06:24,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 30: [2022-11-26 05:06:24,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:06:24,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 7: [2022-11-26 05:06:24,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:06:24,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 30: [2022-11-26 05:06:24,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 7: [2022-11-26 05:06:24,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 05:06:24,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 28: [2022-11-26 05:06:24,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:06:24,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:06:24,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 05:06:24,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 0: [2022-11-26 05:06:24,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:06:24,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 05:06:24,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 0: [2022-11-26 05:06:24,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:06:24,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 05:06:24,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 16: [2022-11-26 05:06:24,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:06:24,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 05:06:24,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 1: [2022-11-26 05:06:24,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:06:24,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 11: [2022-11-26 05:06:24,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:06:24,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 11: [2022-11-26 05:06:24,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 05:06:24,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 5: [2022-11-26 05:06:24,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:06:24,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 05:06:24,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 27: [2022-11-26 05:06:24,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:06:24,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 05:06:24,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 13: [2022-11-26 05:06:24,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:06:24,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 05:06:24,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 21: [2022-11-26 05:06:24,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:06:24,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 05:06:24,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 14: [2022-11-26 05:06:24,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:06:24,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 05:06:24,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 14: [2022-11-26 05:06:24,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:06:24,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 05:06:24,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 26: [2022-11-26 05:06:24,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:06:24,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:06:24,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 27: [2022-11-26 05:06:24,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 7: [2022-11-26 05:06:24,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:06:24,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 26: [2022-11-26 05:06:24,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 7: [2022-11-26 05:06:24,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 05:06:24,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 26: [2022-11-26 05:06:24,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:06:24,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 19: [2022-11-26 05:06:24,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:06:24,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 19: [2022-11-26 05:06:24,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 05:06:24,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 10: [2022-11-26 05:06:24,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:06:24,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 05:06:24,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 13: [2022-11-26 05:06:24,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:06:24,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 05:06:24,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 18: [2022-11-26 05:06:24,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:06:24,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 16: [2022-11-26 05:06:24,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:06:24,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:06:24,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 16: [2022-11-26 05:06:24,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 05:06:24,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 11: [2022-11-26 05:06:24,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 05:06:24,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 23: [2022-11-26 05:06:24,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:06:24,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 05:06:24,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 10: [2022-11-26 05:06:24,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:06:24,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:06:24,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 05:06:24,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 31: [2022-11-26 05:06:24,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 05:06:24,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 5: [2022-11-26 05:06:24,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:06:24,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 05:06:24,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 3: [2022-11-26 05:06:24,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:06:24,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:06:24,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 3: [2022-11-26 05:06:24,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 29: [2022-11-26 05:06:24,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 3: [2022-11-26 05:06:24,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 3: [2022-11-26 05:06:24,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:06:24,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 05:06:24,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 1: [2022-11-26 05:06:24,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:06:24,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 30: [2022-11-26 05:06:24,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:06:24,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 30: [2022-11-26 05:06:24,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 05:06:24,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 21: [2022-11-26 05:06:24,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:06:24,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:06:24,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 05:06:24,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 05:06:24,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 21: [2022-11-26 05:06:24,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 8: [2022-11-26 05:06:24,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:06:24,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 05:06:24,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 31: [2022-11-26 05:06:24,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:06:24,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:06:24,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:06:24,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 05:06:24,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 31: [2022-11-26 05:06:24,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 29: [2022-11-26 05:06:24,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 31: [2022-11-26 05:06:24,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 29: [2022-11-26 05:06:24,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 8: [2022-11-26 05:06:24,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:06:24,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 05:06:24,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 11: [2022-11-26 05:06:24,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:06:24,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 05:06:24,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 28: [2022-11-26 05:06:24,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:06:24,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:06:24,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 05:06:24,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 28: [2022-11-26 05:06:24,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 05:06:24,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 30: [2022-11-26 05:06:24,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:06:24,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 18: [2022-11-26 05:06:24,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:06:24,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 18: [2022-11-26 05:06:24,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 05:06:24,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 13: [2022-11-26 05:06:24,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:06:24,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:06:24,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 0: [2022-11-26 05:06:24,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 13: [2022-11-26 05:06:24,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 0: [2022-11-26 05:06:24,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 27: [2022-11-26 05:06:24,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:06:24,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:06:24,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:06:24,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 7: [2022-11-26 05:06:24,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 27: [2022-11-26 05:06:24,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 7: [2022-11-26 05:06:24,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 21: [2022-11-26 05:06:24,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 19: [2022-11-26 05:06:24,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:06:24,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 19: [2022-11-26 05:06:24,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 05:06:24,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 15: [2022-11-26 05:06:24,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:06:24,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:06:24,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:06:24,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:06:24,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 05:06:24,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 05:06:24,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 05:06:24,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 05:06:24,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 15: [2022-11-26 05:06:24,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 15: [2022-11-26 05:06:24,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 15: [2022-11-26 05:06:24,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 3: [2022-11-26 05:06:24,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:06:24,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 23: [2022-11-26 05:06:24,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:06:24,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 23: [2022-11-26 05:06:24,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 05:06:24,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 25: [2022-11-26 05:06:24,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:06:24,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 05:06:24,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 22: [2022-11-26 05:06:24,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:06:24,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:06:24,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:06:24,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 05:06:24,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 05:06:24,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 05:06:24,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 22: [2022-11-26 05:06:24,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 22: [2022-11-26 05:06:24,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 8: [2022-11-26 05:06:24,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:06:24,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:06:24,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 1: [2022-11-26 05:06:24,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 8: [2022-11-26 05:06:24,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 1: [2022-11-26 05:06:24,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 9: [2022-11-26 05:06:24,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:06:24,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:06:24,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:06:24,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 05:06:24,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 05:06:24,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 05:06:24,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 9: [2022-11-26 05:06:24,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 9: [2022-11-26 05:06:24,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 14: [2022-11-26 05:06:24,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:06:24,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 05:06:24,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 14: [2022-11-26 05:06:24,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:06:24,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 05:06:24,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 31: [2022-11-26 05:06:24,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:06:24,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 05:06:24,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 12: [2022-11-26 05:06:24,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:06:24,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:06:24,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:06:24,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:06:24,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:06:24,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 12: [2022-11-26 05:06:24,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 20: [2022-11-26 05:06:24,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:06:24,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:06:24,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 12: [2022-11-26 05:06:24,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 05:06:24,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 20: [2022-11-26 05:06:24,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 05:06:24,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 12: [2022-11-26 05:06:24,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 20: [2022-11-26 05:06:24,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 20: [2022-11-26 05:06:24,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 12: [2022-11-26 05:06:24,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 12: [2022-11-26 05:06:24,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 12: [2022-11-26 05:06:24,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 12: [2022-11-26 05:06:24,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 25: [2022-11-26 05:06:24,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:06:24,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 05:06:24,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 7: [2022-11-26 05:06:24,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:06:24,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 05:06:24,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 18: [2022-11-26 05:06:24,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:06:24,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 05:06:24,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 4: [2022-11-26 05:06:24,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:06:24,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:06:24,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:06:24,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 05:06:24,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 05:06:24,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 25: [2022-11-26 05:06:24,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 4: [2022-11-26 05:06:24,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 25: [2022-11-26 05:06:24,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 4: [2022-11-26 05:06:24,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:06:24,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:06:24,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 05:06:24,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 05:06:24,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 4: [2022-11-26 05:06:24,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 0: [2022-11-26 05:06:24,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:06:24,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 05:06:24,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 20: [2022-11-26 05:06:24,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:06:24,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 6: [2022-11-26 05:06:24,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:06:24,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 6: [2022-11-26 05:06:24,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:06:24,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:06:24,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 05:06:24,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 6: [2022-11-26 05:06:24,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 05:06:24,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 05:06:24,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 6: [2022-11-26 05:06:24,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 28: [2022-11-26 05:06:24,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:06:24,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 05:06:24,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 24: [2022-11-26 05:06:24,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:06:24,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 05:06:24,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 0: [2022-11-26 05:06:24,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:06:24,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 24: [2022-11-26 05:06:24,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:06:24,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 24: [2022-11-26 05:06:24,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 05:06:24,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 5: [2022-11-26 05:06:24,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:06:24,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 05:06:24,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 17: [2022-11-26 05:06:24,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:06:24,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:06:24,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:06:24,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:06:24,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 05:06:24,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 05:06:24,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 05:06:24,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 05:06:24,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 17: [2022-11-26 05:06:24,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 17: [2022-11-26 05:06:24,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 17: [2022-11-26 05:06:24,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 17: [2022-11-26 05:06:24,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:06:24,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 05:06:24,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 24: [2022-11-26 05:06:24,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:06:24,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 05:06:24,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 0: [2022-11-26 05:06:24,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 26: [2022-11-26 05:06:24,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:06:24,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 26: [2022-11-26 05:06:24,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 05:06:24,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 16: [2022-11-26 05:06:24,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:06:24,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 05:06:24,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 15: [2022-11-26 05:06:24,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:06:24,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 05:06:24,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 2: [2022-11-26 05:06:24,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:06:24,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:06:24,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:06:24,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:06:24,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 05:06:24,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 11: [2022-11-26 05:06:24,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:06:24,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 2: [2022-11-26 05:06:24,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 05:06:24,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 2: [2022-11-26 05:06:24,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 11: [2022-11-26 05:06:24,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 2: [2022-11-26 05:06:24,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 11: [2022-11-26 05:06:24,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 2: [2022-11-26 05:06:24,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 2: [2022-11-26 05:06:24,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:06:24,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 05:06:24,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 10: [2022-11-26 05:06:24,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:06:24,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 05:06:24,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 21: [2022-11-26 05:06:24,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:06:24,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 05:06:24,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 3: [2022-11-26 05:06:24,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:06:24,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 05:06:24,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 23: [2022-11-26 05:06:24,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:06:24,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 05:06:24,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 31: [2022-11-26 05:06:24,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:06:24,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 05:06:24,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 27: [2022-11-26 05:06:24,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:06:24,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 05:06:24,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 7: [2022-11-26 05:06:24,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:06:24,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 05:06:24,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 29: [2022-11-26 05:06:24,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:06:24,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 05:06:24,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 9: [2022-11-26 05:06:24,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:06:24,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 05:06:24,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 30: [2022-11-26 05:06:24,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:06:24,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 05:06:24,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 18: [2022-11-26 05:06:24,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:06:24,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 05:06:24,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 4: [2022-11-26 05:06:24,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:06:24,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 05:06:24,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 8: [2022-11-26 05:06:24,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:06:24,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 05:06:24,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 13: [2022-11-26 05:06:24,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:06:24,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 05:06:24,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 20: [2022-11-26 05:06:24,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:06:24,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 05:06:24,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 14: [2022-11-26 05:06:24,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:06:24,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 05:06:24,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 22: [2022-11-26 05:06:24,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:06:24,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 05:06:24,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 1: [2022-11-26 05:06:24,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:06:24,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 05:06:24,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 25: [2022-11-26 05:06:24,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:06:24,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 05:06:24,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 0: [2022-11-26 05:06:24,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:06:24,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 05:06:24,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 5: [2022-11-26 05:06:24,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:06:24,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 05:06:24,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 17: [2022-11-26 05:06:24,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:06:24,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 05:06:24,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 24: [2022-11-26 05:06:24,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:06:24,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 05:06:24,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 28: [2022-11-26 05:06:24,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:06:24,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 05:06:24,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 10: [2022-11-26 05:06:24,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:06:24,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:06:24,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 10: [2022-11-26 05:06:24,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 12: [2022-11-26 05:06:24,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 10: [2022-11-26 05:06:24,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 11: [2022-11-26 05:06:24,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:06:24,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 05:06:24,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 2: [2022-11-26 05:06:24,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:06:24,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 05:06:24,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 19: [2022-11-26 05:06:24,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:06:24,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 05:06:24,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 16: [2022-11-26 05:06:24,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:06:24,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 05:06:24,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 15: [2022-11-26 05:06:24,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:06:24,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 05:06:24,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 6: [2022-11-26 05:06:24,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:06:24,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 05:06:24,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 23: [2022-11-26 05:06:24,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:06:24,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 05:06:24,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 26: [2022-11-26 05:06:24,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:06:24,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 05:06:24,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 30: [2022-11-26 05:06:24,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:06:24,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 05:06:24,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 21: [2022-11-26 05:06:24,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:06:24,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 05:06:24,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 3: [2022-11-26 05:06:24,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:06:24,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 05:06:24,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 4: [2022-11-26 05:06:24,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:06:24,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 05:06:24,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 9: [2022-11-26 05:06:24,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:06:24,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 05:06:24,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 18: [2022-11-26 05:06:24,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:06:24,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:06:24,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 05:06:24,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 0: [2022-11-26 05:06:24,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 05:06:24,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 16: [2022-11-26 05:06:24,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:06:24,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 05:06:24,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 22: [2022-11-26 05:06:24,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:06:24,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 05:06:24,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 27: [2022-11-26 05:06:24,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:06:24,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 05:06:24,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 29: [2022-11-26 05:06:24,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:06:24,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 23: [2022-11-26 05:06:24,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:06:24,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 23: [2022-11-26 05:06:24,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 05:06:24,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 15: [2022-11-26 05:06:24,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:06:24,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 05:06:24,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 26: [2022-11-26 05:06:24,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:06:24,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 05:06:24,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 14: [2022-11-26 05:06:24,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:06:24,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:06:24,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:06:24,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 13: [2022-11-26 05:06:24,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 14: [2022-11-26 05:06:24,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 6: [2022-11-26 05:06:24,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 05:06:24,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 13: [2022-11-26 05:06:24,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 31: [2022-11-26 05:06:24,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:06:24,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 05:06:24,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 24: [2022-11-26 05:06:24,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:06:24,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 05:06:24,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 8: [2022-11-26 05:06:24,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:06:24,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 05:06:24,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 5: [2022-11-26 05:06:24,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:06:24,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:06:24,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 05:06:24,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 21: [2022-11-26 05:06:24,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 05:06:24,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 1: [2022-11-26 05:06:24,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:06:24,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:06:24,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 1: [2022-11-26 05:06:24,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 20: [2022-11-26 05:06:24,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:06:24,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:06:24,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 19: [2022-11-26 05:06:24,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 20: [2022-11-26 05:06:24,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 28: [2022-11-26 05:06:24,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 05:06:24,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 20: [2022-11-26 05:06:24,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 25: [2022-11-26 05:06:24,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:06:24,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 05:06:24,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 17: [2022-11-26 05:06:24,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:06:24,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 05:06:24,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 11: [2022-11-26 05:06:24,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:06:24,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 3: [2022-11-26 05:06:24,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:06:24,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 3: [2022-11-26 05:06:24,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 05:06:24,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 10: [2022-11-26 05:06:24,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:06:24,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 05:06:24,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 18: [2022-11-26 05:06:24,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:06:24,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 05:06:24,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 29: [2022-11-26 05:06:24,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:06:24,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 05:06:24,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 7: [2022-11-26 05:06:24,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:06:24,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 05:06:24,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 30: [2022-11-26 05:06:24,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:06:24,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:06:24,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 12: [2022-11-26 05:06:24,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:06:24,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:06:24,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 12: [2022-11-26 05:06:24,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 05:06:24,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 05:06:24,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 12: [2022-11-26 05:06:24,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 31: [2022-11-26 05:06:24,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 05:06:24,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 13: [2022-11-26 05:06:24,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:06:24,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 05:06:24,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 9: [2022-11-26 05:06:24,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:06:24,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 05:06:24,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 27: [2022-11-26 05:06:24,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:06:24,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:06:24,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 2: [2022-11-26 05:06:24,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 27: [2022-11-26 05:06:24,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 2: [2022-11-26 05:06:24,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 7: [2022-11-26 05:06:24,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:06:24,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 05:06:24,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 8: [2022-11-26 05:06:24,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:06:24,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 05:06:24,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 22: [2022-11-26 05:06:24,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:06:24,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:06:24,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 05:06:24,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 28: [2022-11-26 05:06:24,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:06:24,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 28: [2022-11-26 05:06:24,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 17: [2022-11-26 05:06:24,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 28: [2022-11-26 05:06:24,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 4: [2022-11-26 05:06:24,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:06:24,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:06:24,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:06:24,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 05:06:24,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 5: [2022-11-26 05:06:24,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 24: [2022-11-26 05:06:24,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 5: [2022-11-26 05:06:24,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 24: [2022-11-26 05:06:24,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 1: [2022-11-26 05:06:24,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:06:24,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 05:06:24,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 26: [2022-11-26 05:06:24,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:06:24,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:06:24,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 0: [2022-11-26 05:06:24,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 26: [2022-11-26 05:06:24,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 0: [2022-11-26 05:06:24,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 6: [2022-11-26 05:06:24,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:06:24,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 05:06:24,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 19: [2022-11-26 05:06:24,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:06:24,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 05:06:24,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 16: [2022-11-26 05:06:24,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:06:24,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:06:24,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 21: [2022-11-26 05:06:24,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 16: [2022-11-26 05:06:24,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 21: [2022-11-26 05:06:24,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 15: [2022-11-26 05:06:24,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:06:24,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 14: [2022-11-26 05:06:24,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:06:24,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 14: [2022-11-26 05:06:24,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 13: [2022-11-26 05:06:24,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:06:24,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 13: [2022-11-26 05:06:24,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 05:06:24,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 23: [2022-11-26 05:06:24,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:06:24,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 05:06:24,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 3: [2022-11-26 05:06:24,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:06:24,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 2: [2022-11-26 05:06:24,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:06:24,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 2: [2022-11-26 05:06:24,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 05:06:24,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 20: [2022-11-26 05:06:24,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:06:24,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 05:06:24,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 10: [2022-11-26 05:06:24,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:06:24,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 05:06:24,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 27: [2022-11-26 05:06:24,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:06:24,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:06:24,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 05:06:24,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 30: [2022-11-26 05:06:24,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 05:06:24,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 31: [2022-11-26 05:06:24,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:06:24,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 05:06:24,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 9: [2022-11-26 05:06:24,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:06:24,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 29: [2022-11-26 05:06:24,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:06:24,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 29: [2022-11-26 05:06:24,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 05:06:24,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 1: [2022-11-26 05:06:24,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:06:24,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 05:06:24,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 25: [2022-11-26 05:06:24,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:06:24,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 05:06:24,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 12: [2022-11-26 05:06:24,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:06:24,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 05:06:24,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 7: [2022-11-26 05:06:24,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:06:24,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 05:06:24,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 22: [2022-11-26 05:06:24,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:06:24,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 05:06:24,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 18: [2022-11-26 05:06:24,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:06:24,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 05:06:24,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 4: [2022-11-26 05:06:24,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:06:24,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 05:06:24,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 11: [2022-11-26 05:06:24,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:06:24,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 05:06:24,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 8: [2022-11-26 05:06:24,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:06:24,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 05:06:24,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 14: [2022-11-26 05:06:24,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:06:24,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 05:06:24,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 25: [2022-11-26 05:06:24,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:06:24,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 05:06:24,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 19: [2022-11-26 05:06:24,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:06:24,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 05:06:24,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 6: [2022-11-26 05:06:24,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:06:24,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 05:06:24,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 6: [2022-11-26 05:06:24,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:06:24,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:06:24,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 05:06:24,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 18: [2022-11-26 05:06:24,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 05:06:24,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 22: [2022-11-26 05:06:24,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:06:24,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 05:06:24,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 24: [2022-11-26 05:06:24,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:06:24,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 05:06:24,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 9: [2022-11-26 05:06:24,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:06:24,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 05:06:24,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 20: [2022-11-26 05:06:24,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:06:24,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 05:06:24,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 19: [2022-11-26 05:06:24,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:06:24,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 05:06:24,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 27: [2022-11-26 05:06:24,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:06:24,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 05:06:24,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 24: [2022-11-26 05:06:24,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:06:24,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 05:06:24,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 20: [2022-11-26 05:06:24,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:06:24,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step49000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 05:06:24,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 0: successfully saved checkpoint at iteration 49000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2546.71 31: iteration 49010/ 173500 | consumed samples: 12546560 | consumed tokens: 25695354880 | elapsed time per iteration (s): 1.08 | learning rate: 1.684E-04 | global batch size: 256 | lm loss: 2.061480E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.022 | TFLOPs: 14.40 | 31: iteration 49020/ 173500 | consumed samples: 12549120 | consumed tokens: 25700597760 | elapsed time per iteration (s): 0.79 | learning rate: 1.684E-04 | global batch size: 256 | lm loss: 2.070891E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.896 | TFLOPs: 19.53 | 31: iteration 49030/ 173500 | consumed samples: 12551680 | consumed tokens: 25705840640 | elapsed time per iteration (s): 0.81 | learning rate: 1.684E-04 | global batch size: 256 | lm loss: 2.086358E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.292 | TFLOPs: 19.20 | 31: iteration 49040/ 173500 | consumed samples: 12554240 | consumed tokens: 25711083520 | elapsed time per iteration (s): 0.81 | learning rate: 1.684E-04 | global batch size: 256 | lm loss: 2.046134E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.565 | TFLOPs: 19.09 | 31: iteration 49050/ 173500 | consumed samples: 12556800 | consumed tokens: 25716326400 | elapsed time per iteration (s): 0.80 | learning rate: 1.684E-04 | global batch size: 256 | lm loss: 2.081684E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.421 | TFLOPs: 19.38 | 31: iteration 49060/ 173500 | consumed samples: 12559360 | consumed tokens: 25721569280 | elapsed time per iteration (s): 0.82 | learning rate: 1.683E-04 | global batch size: 256 | lm loss: 2.074692E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.381 | TFLOPs: 18.78 | 31: iteration 49070/ 173500 | consumed samples: 12561920 | consumed tokens: 25726812160 | elapsed time per iteration (s): 0.80 | learning rate: 1.683E-04 | global batch size: 256 | lm loss: 2.038719E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.352 | TFLOPs: 19.32 | 31: iteration 49080/ 173500 | consumed samples: 12564480 | consumed tokens: 25732055040 | elapsed time per iteration (s): 0.87 | learning rate: 1.683E-04 | global batch size: 256 | lm loss: 2.081491E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.367 | TFLOPs: 17.81 | 31: iteration 49090/ 173500 | consumed samples: 12567040 | consumed tokens: 25737297920 | elapsed time per iteration (s): 0.80 | learning rate: 1.683E-04 | global batch size: 256 | lm loss: 2.104376E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.597 | TFLOPs: 19.27 | 31: iteration 49100/ 173500 | consumed samples: 12569600 | consumed tokens: 25742540800 | elapsed time per iteration (s): 0.80 | learning rate: 1.683E-04 | global batch size: 256 | lm loss: 2.075347E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.936 | TFLOPs: 19.29 | 31: iteration 49110/ 173500 | consumed samples: 12572160 | consumed tokens: 25747783680 | elapsed time per iteration (s): 0.80 | learning rate: 1.683E-04 | global batch size: 256 | lm loss: 2.048257E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.745 | TFLOPs: 19.28 | 31: iteration 49120/ 173500 | consumed samples: 12574720 | consumed tokens: 25753026560 | elapsed time per iteration (s): 0.81 | learning rate: 1.683E-04 | global batch size: 256 | lm loss: 2.063065E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.369 | TFLOPs: 19.14 | 31: iteration 49130/ 173500 | consumed samples: 12577280 | consumed tokens: 25758269440 | elapsed time per iteration (s): 0.76 | learning rate: 1.683E-04 | global batch size: 256 | lm loss: 2.086271E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.812 | TFLOPs: 20.26 | 31: iteration 49140/ 173500 | consumed samples: 12579840 | consumed tokens: 25763512320 | elapsed time per iteration (s): 0.83 | learning rate: 1.682E-04 | global batch size: 256 | lm loss: 2.046295E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.040 | TFLOPs: 18.76 | 31: iteration 49150/ 173500 | consumed samples: 12582400 | consumed tokens: 25768755200 | elapsed time per iteration (s): 0.77 | learning rate: 1.682E-04 | global batch size: 256 | lm loss: 2.074380E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.240 | TFLOPs: 20.10 | 31: iteration 49160/ 173500 | consumed samples: 12584960 | consumed tokens: 25773998080 | elapsed time per iteration (s): 0.82 | learning rate: 1.682E-04 | global batch size: 256 | lm loss: 2.064101E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.843 | TFLOPs: 18.81 | 31: iteration 49170/ 173500 | consumed samples: 12587520 | consumed tokens: 25779240960 | elapsed time per iteration (s): 0.77 | learning rate: 1.682E-04 | global batch size: 256 | lm loss: 2.058323E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.524 | TFLOPs: 20.06 | 31: iteration 49180/ 173500 | consumed samples: 12590080 | consumed tokens: 25784483840 | elapsed time per iteration (s): 0.82 | learning rate: 1.682E-04 | global batch size: 256 | lm loss: 2.086045E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.107 | TFLOPs: 18.94 | 31: iteration 49190/ 173500 | consumed samples: 12592640 | consumed tokens: 25789726720 | elapsed time per iteration (s): 0.79 | learning rate: 1.682E-04 | global batch size: 256 | lm loss: 2.060579E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.254 | TFLOPs: 19.62 | 31: iteration 49200/ 173500 | consumed samples: 12595200 | consumed tokens: 25794969600 | elapsed time per iteration (s): 0.81 | learning rate: 1.682E-04 | global batch size: 256 | lm loss: 2.067708E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.031 | TFLOPs: 19.06 | 31: iteration 49210/ 173500 | consumed samples: 12597760 | consumed tokens: 25800212480 | elapsed time per iteration (s): 0.87 | learning rate: 1.682E-04 | global batch size: 256 | lm loss: 2.092571E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.869 | TFLOPs: 17.72 | 31: iteration 49220/ 173500 | consumed samples: 12600320 | consumed tokens: 25805455360 | elapsed time per iteration (s): 0.82 | learning rate: 1.681E-04 | global batch size: 256 | lm loss: 2.041788E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.021 | TFLOPs: 18.94 | 31: iteration 49230/ 173500 | consumed samples: 12602880 | consumed tokens: 25810698240 | elapsed time per iteration (s): 0.82 | learning rate: 1.681E-04 | global batch size: 256 | lm loss: 2.087595E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.642 | TFLOPs: 18.85 | 31: iteration 49240/ 173500 | consumed samples: 12605440 | consumed tokens: 25815941120 | elapsed time per iteration (s): 0.81 | learning rate: 1.681E-04 | global batch size: 256 | lm loss: 2.070215E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.035 | TFLOPs: 19.12 | 31: iteration 49250/ 173500 | consumed samples: 12608000 | consumed tokens: 25821184000 | elapsed time per iteration (s): 0.79 | learning rate: 1.681E-04 | global batch size: 256 | lm loss: 2.050054E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.372 | TFLOPs: 19.50 | 31: iteration 49260/ 173500 | consumed samples: 12610560 | consumed tokens: 25826426880 | elapsed time per iteration (s): 0.76 | learning rate: 1.681E-04 | global batch size: 256 | lm loss: 2.063173E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.769 | TFLOPs: 20.25 | 31: iteration 49270/ 173500 | consumed samples: 12613120 | consumed tokens: 25831669760 | elapsed time per iteration (s): 0.79 | learning rate: 1.681E-04 | global batch size: 256 | lm loss: 2.023780E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.407 | TFLOPs: 19.63 | 31: iteration 49280/ 173500 | consumed samples: 12615680 | consumed tokens: 25836912640 | elapsed time per iteration (s): 0.78 | learning rate: 1.681E-04 | global batch size: 256 | lm loss: 2.094818E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.309 | TFLOPs: 19.74 | 31: iteration 49290/ 173500 | consumed samples: 12618240 | consumed tokens: 25842155520 | elapsed time per iteration (s): 0.76 | learning rate: 1.680E-04 | global batch size: 256 | lm loss: 2.067538E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.243 | TFLOPs: 20.34 | 31: iteration 49300/ 173500 | consumed samples: 12620800 | consumed tokens: 25847398400 | elapsed time per iteration (s): 0.82 | learning rate: 1.680E-04 | global batch size: 256 | lm loss: 2.089188E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.424 | TFLOPs: 18.96 | 31: iteration 49310/ 173500 | consumed samples: 12623360 | consumed tokens: 25852641280 | elapsed time per iteration (s): 0.80 | learning rate: 1.680E-04 | global batch size: 256 | lm loss: 2.066055E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.264 | TFLOPs: 19.31 | 31: iteration 49320/ 173500 | consumed samples: 12625920 | consumed tokens: 25857884160 | elapsed time per iteration (s): 0.80 | learning rate: 1.680E-04 | global batch size: 256 | lm loss: 2.089358E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.650 | TFLOPs: 19.28 | 31: iteration 49330/ 173500 | consumed samples: 12628480 | consumed tokens: 25863127040 | elapsed time per iteration (s): 0.78 | learning rate: 1.680E-04 | global batch size: 256 | lm loss: 2.075049E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.633 | TFLOPs: 19.76 | 31: iteration 49340/ 173500 | consumed samples: 12631040 | consumed tokens: 25868369920 | elapsed time per iteration (s): 0.82 | learning rate: 1.680E-04 | global batch size: 256 | lm loss: 2.062962E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.587 | TFLOPs: 18.85 | 31: iteration 49350/ 173500 | consumed samples: 12633600 | consumed tokens: 25873612800 | elapsed time per iteration (s): 0.81 | learning rate: 1.680E-04 | global batch size: 256 | lm loss: 2.067865E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.733 | TFLOPs: 19.10 | 31: iteration 49360/ 173500 | consumed samples: 12636160 | consumed tokens: 25878855680 | elapsed time per iteration (s): 0.85 | learning rate: 1.680E-04 | global batch size: 256 | lm loss: 2.084959E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.035 | TFLOPs: 18.27 | 31: iteration 49370/ 173500 | consumed samples: 12638720 | consumed tokens: 25884098560 | elapsed time per iteration (s): 0.83 | learning rate: 1.679E-04 | global batch size: 256 | lm loss: 2.075396E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.435 | TFLOPs: 18.66 | 31: iteration 49380/ 173500 | consumed samples: 12641280 | consumed tokens: 25889341440 | elapsed time per iteration (s): 0.80 | learning rate: 1.679E-04 | global batch size: 256 | lm loss: 2.087646E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.458 | TFLOPs: 19.45 | 31: iteration 49390/ 173500 | consumed samples: 12643840 | consumed tokens: 25894584320 | elapsed time per iteration (s): 0.82 | learning rate: 1.679E-04 | global batch size: 256 | lm loss: 2.081406E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.364 | TFLOPs: 18.78 | 31: iteration 49400/ 173500 | consumed samples: 12646400 | consumed tokens: 25899827200 | elapsed time per iteration (s): 0.80 | learning rate: 1.679E-04 | global batch size: 256 | lm loss: 2.068433E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.672 | TFLOPs: 19.46 | 31: iteration 49410/ 173500 | consumed samples: 12648960 | consumed tokens: 25905070080 | elapsed time per iteration (s): 0.81 | learning rate: 1.679E-04 | global batch size: 256 | lm loss: 2.069930E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.879 | TFLOPs: 19.23 | 31: iteration 49420/ 173500 | consumed samples: 12651520 | consumed tokens: 25910312960 | elapsed time per iteration (s): 0.78 | learning rate: 1.679E-04 | global batch size: 256 | lm loss: 2.039603E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.579 | TFLOPs: 19.88 | 31: iteration 49430/ 173500 | consumed samples: 12654080 | consumed tokens: 25915555840 | elapsed time per iteration (s): 0.84 | learning rate: 1.679E-04 | global batch size: 256 | lm loss: 2.097608E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.152 | TFLOPs: 18.46 | 31: iteration 49440/ 173500 | consumed samples: 12656640 | consumed tokens: 25920798720 | elapsed time per iteration (s): 0.81 | learning rate: 1.679E-04 | global batch size: 256 | lm loss: 2.094642E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.821 | TFLOPs: 19.11 | 31: iteration 49450/ 173500 | consumed samples: 12659200 | consumed tokens: 25926041600 | elapsed time per iteration (s): 0.78 | learning rate: 1.678E-04 | global batch size: 256 | lm loss: 2.061682E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.194 | TFLOPs: 19.73 | 31: iteration 49460/ 173500 | consumed samples: 12661760 | consumed tokens: 25931284480 | elapsed time per iteration (s): 0.82 | learning rate: 1.678E-04 | global batch size: 256 | lm loss: 2.066176E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.081 | TFLOPs: 18.94 | 31: iteration 49470/ 173500 | consumed samples: 12664320 | consumed tokens: 25936527360 | elapsed time per iteration (s): 0.81 | learning rate: 1.678E-04 | global batch size: 256 | lm loss: 2.077301E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.115 | TFLOPs: 19.06 | 31: iteration 49480/ 173500 | consumed samples: 12666880 | consumed tokens: 25941770240 | elapsed time per iteration (s): 0.80 | learning rate: 1.678E-04 | global batch size: 256 | lm loss: 2.061864E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.638 | TFLOPs: 19.46 | 31: iteration 49490/ 173500 | consumed samples: 12669440 | consumed tokens: 25947013120 | elapsed time per iteration (s): 0.79 | learning rate: 1.678E-04 | global batch size: 256 | lm loss: 2.071514E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.716 | TFLOPs: 19.52 | 31: iteration 49500/ 173500 | consumed samples: 12672000 | consumed tokens: 25952256000 | elapsed time per iteration (s): 0.77 | learning rate: 1.678E-04 | global batch size: 256 | lm loss: 2.090235E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.069 | TFLOPs: 20.21 | 31: iteration 49510/ 173500 | consumed samples: 12674560 | consumed tokens: 25957498880 | elapsed time per iteration (s): 0.79 | learning rate: 1.678E-04 | global batch size: 256 | lm loss: 2.093445E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.527 | TFLOPs: 19.51 | 31: iteration 49520/ 173500 | consumed samples: 12677120 | consumed tokens: 25962741760 | elapsed time per iteration (s): 0.84 | learning rate: 1.678E-04 | global batch size: 256 | lm loss: 2.067624E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.666 | TFLOPs: 18.49 | 31: iteration 49530/ 173500 | consumed samples: 12679680 | consumed tokens: 25967984640 | elapsed time per iteration (s): 0.81 | learning rate: 1.677E-04 | global batch size: 256 | lm loss: 2.076300E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.442 | TFLOPs: 19.20 | 31: iteration 49540/ 173500 | consumed samples: 12682240 | consumed tokens: 25973227520 | elapsed time per iteration (s): 0.80 | learning rate: 1.677E-04 | global batch size: 256 | lm loss: 2.070330E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.039 | TFLOPs: 19.42 | 31: iteration 49550/ 173500 | consumed samples: 12684800 | consumed tokens: 25978470400 | elapsed time per iteration (s): 0.76 | learning rate: 1.677E-04 | global batch size: 256 | lm loss: 2.080094E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.063 | TFLOPs: 20.33 | 31: iteration 49560/ 173500 | consumed samples: 12687360 | consumed tokens: 25983713280 | elapsed time per iteration (s): 0.79 | learning rate: 1.677E-04 | global batch size: 256 | lm loss: 2.078461E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.802 | TFLOPs: 19.53 | 31: iteration 49570/ 173500 | consumed samples: 12689920 | consumed tokens: 25988956160 | elapsed time per iteration (s): 0.76 | learning rate: 1.677E-04 | global batch size: 256 | lm loss: 2.051911E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.773 | TFLOPs: 20.49 | 31: iteration 49580/ 173500 | consumed samples: 12692480 | consumed tokens: 25994199040 | elapsed time per iteration (s): 0.75 | learning rate: 1.677E-04 | global batch size: 256 | lm loss: 2.082121E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.986 | TFLOPs: 20.57 | 31: iteration 49590/ 173500 | consumed samples: 12695040 | consumed tokens: 25999441920 | elapsed time per iteration (s): 0.78 | learning rate: 1.677E-04 | global batch size: 256 | lm loss: 2.060288E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.219 | TFLOPs: 19.80 | 31: iteration 49600/ 173500 | consumed samples: 12697600 | consumed tokens: 26004684800 | elapsed time per iteration (s): 0.76 | learning rate: 1.677E-04 | global batch size: 256 | lm loss: 2.078989E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.422 | TFLOPs: 20.29 | 31: iteration 49610/ 173500 | consumed samples: 12700160 | consumed tokens: 26009927680 | elapsed time per iteration (s): 0.74 | learning rate: 1.676E-04 | global batch size: 256 | lm loss: 2.066732E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.606 | TFLOPs: 21.03 | 31: iteration 49620/ 173500 | consumed samples: 12702720 | consumed tokens: 26015170560 | elapsed time per iteration (s): 0.83 | learning rate: 1.676E-04 | global batch size: 256 | lm loss: 2.050423E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.434 | TFLOPs: 18.66 | 31: iteration 49630/ 173500 | consumed samples: 12705280 | consumed tokens: 26020413440 | elapsed time per iteration (s): 0.75 | learning rate: 1.676E-04 | global batch size: 256 | lm loss: 2.098831E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.383 | TFLOPs: 20.59 | 31: iteration 49640/ 173500 | consumed samples: 12707840 | consumed tokens: 26025656320 | elapsed time per iteration (s): 0.76 | learning rate: 1.676E-04 | global batch size: 256 | lm loss: 2.060180E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.852 | TFLOPs: 20.38 | 31: iteration 49650/ 173500 | consumed samples: 12710400 | consumed tokens: 26030899200 | elapsed time per iteration (s): 0.81 | learning rate: 1.676E-04 | global batch size: 256 | lm loss: 2.106457E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.320 | TFLOPs: 19.08 | 31: iteration 49660/ 173500 | consumed samples: 12712960 | consumed tokens: 26036142080 | elapsed time per iteration (s): 0.75 | learning rate: 1.676E-04 | global batch size: 256 | lm loss: 2.074706E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.559 | TFLOPs: 20.66 | 31: iteration 49670/ 173500 | consumed samples: 12715520 | consumed tokens: 26041384960 | elapsed time per iteration (s): 0.79 | learning rate: 1.676E-04 | global batch size: 256 | lm loss: 2.071439E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.738 | TFLOPs: 19.52 | 31: iteration 49680/ 173500 | consumed samples: 12718080 | consumed tokens: 26046627840 | elapsed time per iteration (s): 0.78 | learning rate: 1.676E-04 | global batch size: 256 | lm loss: 2.077521E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.399 | TFLOPs: 19.75 | 31: iteration 49690/ 173500 | consumed samples: 12720640 | consumed tokens: 26051870720 | elapsed time per iteration (s): 0.76 | learning rate: 1.675E-04 | global batch size: 256 | lm loss: 2.071586E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.029 | TFLOPs: 20.45 | 31: iteration 49700/ 173500 | consumed samples: 12723200 | consumed tokens: 26057113600 | elapsed time per iteration (s): 0.78 | learning rate: 1.675E-04 | global batch size: 256 | lm loss: 2.055562E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.295 | TFLOPs: 19.74 | 31: iteration 49710/ 173500 | consumed samples: 12725760 | consumed tokens: 26062356480 | elapsed time per iteration (s): 0.79 | learning rate: 1.675E-04 | global batch size: 256 | lm loss: 2.062206E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.218 | TFLOPs: 19.67 | 31: iteration 49720/ 173500 | consumed samples: 12728320 | consumed tokens: 26067599360 | elapsed time per iteration (s): 0.83 | learning rate: 1.675E-04 | global batch size: 256 | lm loss: 2.091102E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.060 | TFLOPs: 18.70 | 31: iteration 49730/ 173500 | consumed samples: 12730880 | consumed tokens: 26072842240 | elapsed time per iteration (s): 0.77 | learning rate: 1.675E-04 | global batch size: 256 | lm loss: 2.037978E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.047 | TFLOPs: 20.15 | 31: iteration 49740/ 173500 | consumed samples: 12733440 | consumed tokens: 26078085120 | elapsed time per iteration (s): 0.81 | learning rate: 1.675E-04 | global batch size: 256 | lm loss: 2.038734E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.458 | TFLOPs: 19.14 | 31: iteration 49750/ 173500 | consumed samples: 12736000 | consumed tokens: 26083328000 | elapsed time per iteration (s): 0.87 | learning rate: 1.675E-04 | global batch size: 256 | lm loss: 2.078801E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.281 | TFLOPs: 17.86 | 31: iteration 49760/ 173500 | consumed samples: 12738560 | consumed tokens: 26088570880 | elapsed time per iteration (s): 0.81 | learning rate: 1.675E-04 | global batch size: 256 | lm loss: 2.067605E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.748 | TFLOPs: 19.22 | 31: iteration 49770/ 173500 | consumed samples: 12741120 | consumed tokens: 26093813760 | elapsed time per iteration (s): 0.86 | learning rate: 1.674E-04 | global batch size: 256 | lm loss: 2.060597E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.782 | TFLOPs: 17.95 | 31: iteration 49780/ 173500 | consumed samples: 12743680 | consumed tokens: 26099056640 | elapsed time per iteration (s): 0.82 | learning rate: 1.674E-04 | global batch size: 256 | lm loss: 2.073991E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.912 | TFLOPs: 18.93 | 31: iteration 49790/ 173500 | consumed samples: 12746240 | consumed tokens: 26104299520 | elapsed time per iteration (s): 0.80 | learning rate: 1.674E-04 | global batch size: 256 | lm loss: 2.111270E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.242 | TFLOPs: 19.31 | 31: iteration 49800/ 173500 | consumed samples: 12748800 | consumed tokens: 26109542400 | elapsed time per iteration (s): 0.86 | learning rate: 1.674E-04 | global batch size: 256 | lm loss: 2.039198E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.743 | TFLOPs: 18.07 | 31: iteration 49810/ 173500 | consumed samples: 12751360 | consumed tokens: 26114785280 | elapsed time per iteration (s): 1.09 | learning rate: 1.674E-04 | global batch size: 256 | lm loss: 2.072867E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.219 | TFLOPs: 14.23 | 31: iteration 49820/ 173500 | consumed samples: 12753920 | consumed tokens: 26120028160 | elapsed time per iteration (s): 0.81 | learning rate: 1.674E-04 | global batch size: 256 | lm loss: 2.078032E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.370 | TFLOPs: 19.14 | 31: iteration 49830/ 173500 | consumed samples: 12756480 | consumed tokens: 26125271040 | elapsed time per iteration (s): 0.85 | learning rate: 1.674E-04 | global batch size: 256 | lm loss: 2.067358E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.890 | TFLOPs: 18.14 | 31: iteration 49840/ 173500 | consumed samples: 12759040 | consumed tokens: 26130513920 | elapsed time per iteration (s): 0.78 | learning rate: 1.674E-04 | global batch size: 256 | lm loss: 2.084301E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.403 | TFLOPs: 19.87 | 31: iteration 49850/ 173500 | consumed samples: 12761600 | consumed tokens: 26135756800 | elapsed time per iteration (s): 0.82 | learning rate: 1.673E-04 | global batch size: 256 | lm loss: 2.060864E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.093 | TFLOPs: 18.88 | 31: iteration 49860/ 173500 | consumed samples: 12764160 | consumed tokens: 26140999680 | elapsed time per iteration (s): 0.81 | learning rate: 1.673E-04 | global batch size: 256 | lm loss: 2.061746E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.179 | TFLOPs: 19.07 | 31: iteration 49870/ 173500 | consumed samples: 12766720 | consumed tokens: 26146242560 | elapsed time per iteration (s): 0.87 | learning rate: 1.673E-04 | global batch size: 256 | lm loss: 2.090246E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.762 | TFLOPs: 17.83 | 31: iteration 49880/ 173500 | consumed samples: 12769280 | consumed tokens: 26151485440 | elapsed time per iteration (s): 0.81 | learning rate: 1.673E-04 | global batch size: 256 | lm loss: 2.060112E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.805 | TFLOPs: 19.11 | 31: iteration 49890/ 173500 | consumed samples: 12771840 | consumed tokens: 26156728320 | elapsed time per iteration (s): 0.76 | learning rate: 1.673E-04 | global batch size: 256 | lm loss: 2.045118E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.332 | TFLOPs: 20.41 | 31: iteration 49900/ 173500 | consumed samples: 12774400 | consumed tokens: 26161971200 | elapsed time per iteration (s): 0.77 | learning rate: 1.673E-04 | global batch size: 256 | lm loss: 2.071464E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.970 | TFLOPs: 20.02 | 31: iteration 49910/ 173500 | consumed samples: 12776960 | consumed tokens: 26167214080 | elapsed time per iteration (s): 0.81 | learning rate: 1.673E-04 | global batch size: 256 | lm loss: 2.091434E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.298 | TFLOPs: 19.20 | 31: iteration 49920/ 173500 | consumed samples: 12779520 | consumed tokens: 26172456960 | elapsed time per iteration (s): 0.79 | learning rate: 1.673E-04 | global batch size: 256 | lm loss: 2.060000E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.855 | TFLOPs: 19.53 | 31: iteration 49930/ 173500 | consumed samples: 12782080 | consumed tokens: 26177699840 | elapsed time per iteration (s): 0.81 | learning rate: 1.672E-04 | global batch size: 256 | lm loss: 2.037285E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.884 | TFLOPs: 19.17 | 31: iteration 49940/ 173500 | consumed samples: 12784640 | consumed tokens: 26182942720 | elapsed time per iteration (s): 0.74 | learning rate: 1.672E-04 | global batch size: 256 | lm loss: 2.065604E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.716 | TFLOPs: 20.85 | 31: iteration 49950/ 173500 | consumed samples: 12787200 | consumed tokens: 26188185600 | elapsed time per iteration (s): 0.77 | learning rate: 1.672E-04 | global batch size: 256 | lm loss: 2.080343E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.207 | TFLOPs: 20.10 | 31: iteration 49960/ 173500 | consumed samples: 12789760 | consumed tokens: 26193428480 | elapsed time per iteration (s): 0.77 | learning rate: 1.672E-04 | global batch size: 256 | lm loss: 2.101392E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.515 | TFLOPs: 20.12 | 31: iteration 49970/ 173500 | consumed samples: 12792320 | consumed tokens: 26198671360 | elapsed time per iteration (s): 0.72 | learning rate: 1.672E-04 | global batch size: 256 | lm loss: 2.082409E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.920 | TFLOPs: 21.41 | 31: iteration 49980/ 173500 | consumed samples: 12794880 | consumed tokens: 26203914240 | elapsed time per iteration (s): 0.76 | learning rate: 1.672E-04 | global batch size: 256 | lm loss: 2.061691E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.453 | TFLOPs: 20.35 | 31: iteration 49990/ 173500 | consumed samples: 12797440 | consumed tokens: 26209157120 | elapsed time per iteration (s): 0.83 | learning rate: 1.672E-04 | global batch size: 256 | lm loss: 2.077126E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.497 | TFLOPs: 18.72 | 0: [2022-11-26 05:19:46,951] [INFO] [logging.py:68:log_dist] [Rank 0] step=50000, skipped=0, lr=[0.00016715144913462704, 0.00016715144913462704, 0.00016715144913462704], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 50000/ 173500 | consumed samples: 12800000 | consumed tokens: 26214400000 | elapsed time per iteration (s): 0.78 | learning rate: 1.672E-04 | global batch size: 256 | lm loss: 2.060461E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.088 | TFLOPs: 19.91 | 0: steps: 50000 loss: 2.0779 iter time (s): 0.813 samples/sec: 314.811 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 50000 | lm loss value: 2.020970E+00 | lm loss PPL: 7.545638E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 50000 to checkpoints_1b1long 0: [2022-11-26 05:19:47,253] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step50000 is begin to save! 0: [2022-11-26 05:19:47,264] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_01-model_00-model_states.pt... 0: [2022-11-26 05:19:47,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_01-model_00-model_states.pt. 0: [2022-11-26 05:19:47,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_03-model_00-model_states.pt... 0: [2022-11-26 05:19:47,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_03-model_00-model_states.pt. 0: [2022-11-26 05:19:47,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_04-model_00-model_states.pt... 0: [2022-11-26 05:19:47,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_04-model_00-model_states.pt. 0: [2022-11-26 05:19:47,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_05-model_00-model_states.pt... 0: [2022-11-26 05:19:47,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_05-model_00-model_states.pt. 0: [2022-11-26 05:19:47,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_06-model_00-model_states.pt... 0: [2022-11-26 05:19:47,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_06-model_00-model_states.pt. 0: [2022-11-26 05:19:47,817] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_07-model_00-model_states.pt... 0: [2022-11-26 05:19:47,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_07-model_00-model_states.pt. 0: [2022-11-26 05:19:47,901] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_08-model_00-model_states.pt... 0: [2022-11-26 05:19:47,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_08-model_00-model_states.pt. 0: [2022-11-26 05:19:47,976] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_09-model_00-model_states.pt... 0: [2022-11-26 05:19:48,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_09-model_00-model_states.pt. 0: [2022-11-26 05:19:48,052] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_10-model_00-model_states.pt... 0: [2022-11-26 05:19:48,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_10-model_00-model_states.pt. 0: [2022-11-26 05:19:48,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_11-model_00-model_states.pt... 0: [2022-11-26 05:19:48,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_11-model_00-model_states.pt. 0: [2022-11-26 05:19:48,205] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_12-model_00-model_states.pt... 0: [2022-11-26 05:19:48,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_12-model_00-model_states.pt. 0: [2022-11-26 05:19:48,279] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_13-model_00-model_states.pt... 0: [2022-11-26 05:19:48,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_13-model_00-model_states.pt. 0: [2022-11-26 05:19:48,358] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_14-model_00-model_states.pt... 0: [2022-11-26 05:19:48,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_14-model_00-model_states.pt. 0: [2022-11-26 05:19:48,431] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_15-model_00-model_states.pt... 0: [2022-11-26 05:19:48,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_15-model_00-model_states.pt. 0: [2022-11-26 05:19:48,506] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_16-model_00-model_states.pt... 0: [2022-11-26 05:19:48,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_16-model_00-model_states.pt. 0: [2022-11-26 05:19:48,581] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_17-model_00-model_states.pt... 0: [2022-11-26 05:19:48,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_17-model_00-model_states.pt. 0: [2022-11-26 05:19:48,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_18-model_00-model_states.pt... 0: [2022-11-26 05:19:48,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_18-model_00-model_states.pt. 0: [2022-11-26 05:19:48,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_19-model_00-model_states.pt... 0: [2022-11-26 05:19:48,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_19-model_00-model_states.pt. 0: [2022-11-26 05:19:48,810] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_20-model_00-model_states.pt... 0: [2022-11-26 05:19:48,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_20-model_00-model_states.pt. 0: [2022-11-26 05:19:48,883] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_21-model_00-model_states.pt... 0: [2022-11-26 05:19:48,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_21-model_00-model_states.pt. 0: [2022-11-26 05:19:48,960] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_22-model_00-model_states.pt... 0: [2022-11-26 05:19:49,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_22-model_00-model_states.pt. 0: [2022-11-26 05:19:49,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_23-model_00-model_states.pt... 0: [2022-11-26 05:19:49,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_23-model_00-model_states.pt. 0: [2022-11-26 05:19:49,109] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_24-model_00-model_states.pt... 0: [2022-11-26 05:19:49,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_24-model_00-model_states.pt. 0: [2022-11-26 05:19:49,182] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_25-model_00-model_states.pt... 0: [2022-11-26 05:19:49,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_25-model_00-model_states.pt. 0: [2022-11-26 05:19:49,260] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_26-model_00-model_states.pt... 0: [2022-11-26 05:19:49,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_26-model_00-model_states.pt. 0: [2022-11-26 05:19:49,335] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_27-model_00-model_states.pt... 0: [2022-11-26 05:19:49,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_27-model_00-model_states.pt. 0: [2022-11-26 05:19:49,408] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_28-model_00-model_states.pt... 0: [2022-11-26 05:19:49,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_28-model_00-model_states.pt. 0: [2022-11-26 05:19:49,484] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/layer_30-model_00-model_states.pt... 0: [2022-11-26 05:19:49,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/layer_30-model_00-model_states.pt. 0: [2022-11-26 05:19:49,487] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step50000/mp_rank_00_model_states.pt 0: [2022-11-26 05:19:49,487] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/mp_rank_00_model_states.pt... 0: [2022-11-26 05:19:49,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/mp_rank_00_model_states.pt. 0: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:19:49,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:19:49,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:19:49,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 05:19:49,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 19: [2022-11-26 05:19:49,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:19:49,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 05:19:49,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 11: [2022-11-26 05:19:49,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:19:49,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 05:19:49,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 4: [2022-11-26 05:19:49,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:19:49,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 05:19:49,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 12: [2022-11-26 05:19:49,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:19:49,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 05:19:49,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 22: [2022-11-26 05:19:49,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:19:49,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 05:19:49,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 11: [2022-11-26 05:19:49,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:19:49,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 05:19:49,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 31: [2022-11-26 05:19:49,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:19:49,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 05:19:49,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 0: [2022-11-26 05:19:49,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:19:49,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 05:19:49,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 8: [2022-11-26 05:19:49,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:19:49,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 05:19:49,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 21: [2022-11-26 05:19:49,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:19:49,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 05:19:49,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 20: [2022-11-26 05:19:49,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:19:49,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 05:19:49,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 2: [2022-11-26 05:19:49,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:19:49,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:19:49,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 21: [2022-11-26 05:19:49,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 2: [2022-11-26 05:19:49,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 21: [2022-11-26 05:19:49,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 25: [2022-11-26 05:19:49,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:19:49,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 05:19:49,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 7: [2022-11-26 05:19:49,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:19:49,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 05:19:49,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 20: [2022-11-26 05:19:49,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:19:49,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:19:49,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 10: [2022-11-26 05:19:49,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:19:49,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 10: [2022-11-26 05:19:49,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 12: [2022-11-26 05:19:49,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 20: [2022-11-26 05:19:49,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 10: [2022-11-26 05:19:49,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 14: [2022-11-26 05:19:49,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:19:49,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 05:19:49,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 17: [2022-11-26 05:19:49,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:19:49,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:19:49,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:19:49,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 6: [2022-11-26 05:19:49,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 10: [2022-11-26 05:19:49,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:19:49,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 17: [2022-11-26 05:19:49,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 6: [2022-11-26 05:19:49,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 10: [2022-11-26 05:19:49,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 05:19:49,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 28: [2022-11-26 05:19:49,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 31: [2022-11-26 05:19:49,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:19:49,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:19:49,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 05:19:49,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 31: [2022-11-26 05:19:49,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:19:49,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 19: [2022-11-26 05:19:49,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:19:49,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 05:19:49,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 19: [2022-11-26 05:19:49,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 31: [2022-11-26 05:19:49,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 19: [2022-11-26 05:19:49,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 28: [2022-11-26 05:19:49,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:19:49,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 05:19:49,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 30: [2022-11-26 05:19:49,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:19:49,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 05:19:49,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 30: [2022-11-26 05:19:49,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:19:49,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 05:19:49,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 8: [2022-11-26 05:19:49,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:19:49,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 05:19:49,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 4: [2022-11-26 05:19:49,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:19:49,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:19:49,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 05:19:49,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 05:19:49,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 4: [2022-11-26 05:19:49,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 21: [2022-11-26 05:19:49,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:19:49,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 1: [2022-11-26 05:19:49,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:19:49,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 1: [2022-11-26 05:19:49,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:19:49,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 05:19:49,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 05:19:49,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 1: [2022-11-26 05:19:49,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 26: [2022-11-26 05:19:49,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:19:49,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 05:19:49,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 27: [2022-11-26 05:19:49,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:19:49,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 17: [2022-11-26 05:19:49,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:19:49,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 5: [2022-11-26 05:19:49,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:19:49,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 17: [2022-11-26 05:19:49,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 5: [2022-11-26 05:19:49,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 17: [2022-11-26 05:19:49,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 5: [2022-11-26 05:19:49,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:19:49,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 05:19:49,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 5: [2022-11-26 05:19:49,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:19:49,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 05:19:49,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 23: [2022-11-26 05:19:49,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:19:49,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:19:49,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 05:19:49,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 25: [2022-11-26 05:19:49,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 20: [2022-11-26 05:19:49,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:19:49,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 20: [2022-11-26 05:19:49,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 05:19:49,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 23: [2022-11-26 05:19:49,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:19:49,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 05:19:49,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 25: [2022-11-26 05:19:49,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:19:49,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 05:19:49,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 14: [2022-11-26 05:19:49,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:19:49,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 05:19:49,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 10: [2022-11-26 05:19:49,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:19:49,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 05:19:49,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 27: [2022-11-26 05:19:49,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:19:49,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 19: [2022-11-26 05:19:49,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:19:49,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 19: [2022-11-26 05:19:49,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 05:19:49,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 20: [2022-11-26 05:19:49,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:19:49,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 05:19:49,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 23: [2022-11-26 05:19:49,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:19:49,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:19:49,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 05:19:49,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 05:19:49,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 23: [2022-11-26 05:19:49,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 22: [2022-11-26 05:19:49,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:19:49,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:19:49,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 12: [2022-11-26 05:19:49,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 22: [2022-11-26 05:19:49,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 12: [2022-11-26 05:19:49,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 30: [2022-11-26 05:19:49,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:19:49,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 05:19:49,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 6: [2022-11-26 05:19:49,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:19:49,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 05:19:49,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 31: [2022-11-26 05:19:49,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:19:49,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:19:49,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 14: [2022-11-26 05:19:49,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 05:19:49,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 31: [2022-11-26 05:19:49,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 11: [2022-11-26 05:19:49,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:19:49,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 05:19:49,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 11: [2022-11-26 05:19:49,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:19:49,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 27: [2022-11-26 05:19:49,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:19:49,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 22: [2022-11-26 05:19:49,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:19:49,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 27: [2022-11-26 05:19:49,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 22: [2022-11-26 05:19:49,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 27: [2022-11-26 05:19:49,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 4: [2022-11-26 05:19:49,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:19:49,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 05:19:49,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 6: [2022-11-26 05:19:49,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:19:49,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 05:19:49,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 12: [2022-11-26 05:19:49,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:19:49,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 05:19:49,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 7: [2022-11-26 05:19:49,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:19:49,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 17: [2022-11-26 05:19:49,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:19:49,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 17: [2022-11-26 05:19:49,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 05:19:49,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 17: [2022-11-26 05:19:49,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:19:49,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 05:19:49,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 0: [2022-11-26 05:19:49,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:19:49,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 05:19:49,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 8: [2022-11-26 05:19:49,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:19:49,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 05:19:49,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 0: [2022-11-26 05:19:49,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:19:49,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:19:49,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 7: [2022-11-26 05:19:49,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 0: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 7: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 21: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:19:49,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:19:49,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:19:49,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:19:49,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 26: [2022-11-26 05:19:49,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 30: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 30: [2022-11-26 05:19:49,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 26: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 30: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 8: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:19:49,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 5: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:19:49,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 24: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:19:49,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 29: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 8: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 1: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 24: [2022-11-26 05:19:49,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 29: [2022-11-26 05:19:49,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 24: [2022-11-26 05:19:49,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 29: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 29: [2022-11-26 05:19:49,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 24: [2022-11-26 05:19:49,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 05:19:49,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 29: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 24: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 24: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 24: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 0: [2022-11-26 05:19:49,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:19:49,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 05:19:49,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 28: [2022-11-26 05:19:49,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 19: [2022-11-26 05:19:49,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:19:49,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 05:19:49,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 22: [2022-11-26 05:19:49,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:19:49,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 05:19:49,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 10: [2022-11-26 05:19:49,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:19:49,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 05:19:49,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 26: [2022-11-26 05:19:49,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:19:49,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 05:19:49,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:19:49,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 26: [2022-11-26 05:19:49,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 05:19:49,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 5: [2022-11-26 05:19:49,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:19:49,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 29: [2022-11-26 05:19:49,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:19:49,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 29: [2022-11-26 05:19:49,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 05:19:49,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 14: [2022-11-26 05:19:49,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:19:49,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 25: [2022-11-26 05:19:49,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:19:49,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 25: [2022-11-26 05:19:49,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 05:19:49,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 2: [2022-11-26 05:19:49,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:19:49,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:19:49,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 05:19:49,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 05:19:49,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 2: [2022-11-26 05:19:49,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 1: [2022-11-26 05:19:49,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:19:49,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 05:19:49,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 19: [2022-11-26 05:19:49,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:19:49,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 05:19:49,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 2: [2022-11-26 05:19:49,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:19:49,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 05:19:49,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 20: [2022-11-26 05:19:49,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:19:49,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 05:19:49,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 6: [2022-11-26 05:19:49,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:19:49,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 05:19:49,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 11: [2022-11-26 05:19:49,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:19:49,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 05:19:49,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 27: [2022-11-26 05:19:49,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:19:49,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 05:19:49,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 3: [2022-11-26 05:19:49,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:19:49,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:19:49,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:19:49,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:19:49,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 05:19:49,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 05:19:49,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 15: [2022-11-26 05:19:49,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:19:49,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:19:49,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 05:19:49,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 15: [2022-11-26 05:19:49,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:19:49,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 15: [2022-11-26 05:19:49,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 05:19:49,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 3: [2022-11-26 05:19:49,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 15: [2022-11-26 05:19:49,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 15: [2022-11-26 05:19:49,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 3: [2022-11-26 05:19:49,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 15: [2022-11-26 05:19:49,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 05:19:49,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 21: [2022-11-26 05:19:49,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:19:49,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 05:19:49,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 3: [2022-11-26 05:19:49,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:19:49,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 05:19:49,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 31: [2022-11-26 05:19:49,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:19:49,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 05:19:49,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 0: [2022-11-26 05:19:49,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 05:19:49,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 16: [2022-11-26 05:19:49,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:19:49,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:19:49,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:19:49,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:19:49,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 05:19:49,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 05:19:49,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 05:19:49,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 05:19:49,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 16: [2022-11-26 05:19:49,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 16: [2022-11-26 05:19:49,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 16: [2022-11-26 05:19:49,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 18: [2022-11-26 05:19:49,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:19:49,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:19:49,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:19:49,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:19:49,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 05:19:49,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 05:19:49,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 05:19:49,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 05:19:49,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 18: [2022-11-26 05:19:49,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 18: [2022-11-26 05:19:49,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 18: [2022-11-26 05:19:49,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 4: [2022-11-26 05:19:49,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:19:49,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 05:19:49,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 2: [2022-11-26 05:19:49,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:19:49,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 05:19:49,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 13: [2022-11-26 05:19:49,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:19:49,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:19:49,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:19:49,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:19:49,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 05:19:49,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 05:19:49,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 05:19:49,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 05:19:49,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 13: [2022-11-26 05:19:49,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 13: [2022-11-26 05:19:49,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 13: [2022-11-26 05:19:49,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 9: [2022-11-26 05:19:49,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:19:49,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:19:49,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:19:49,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:19:49,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 05:19:49,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 05:19:49,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 05:19:49,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 9: [2022-11-26 05:19:49,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 9: [2022-11-26 05:19:49,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 05:19:49,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 9: [2022-11-26 05:19:49,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 28: [2022-11-26 05:19:49,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:19:49,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 7: [2022-11-26 05:19:49,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:19:49,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 05:19:49,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 28: [2022-11-26 05:19:49,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 0: [2022-11-26 05:19:49,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:19:49,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 05:19:49,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 9: [2022-11-26 05:19:49,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:19:49,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 05:19:49,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 25: [2022-11-26 05:19:49,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:19:49,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 05:19:49,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 12: [2022-11-26 05:19:49,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:19:49,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 05:19:49,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 10: [2022-11-26 05:19:49,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:19:49,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 05:19:49,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 29: [2022-11-26 05:19:49,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:19:49,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 05:19:49,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 23: [2022-11-26 05:19:49,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:19:49,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 05:19:49,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 17: [2022-11-26 05:19:49,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:19:49,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 05:19:49,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 1: [2022-11-26 05:19:49,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:19:49,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 05:19:49,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 30: [2022-11-26 05:19:49,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:19:49,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 05:19:49,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 8: [2022-11-26 05:19:49,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:19:49,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 05:19:49,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 27: [2022-11-26 05:19:49,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:19:49,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 05:19:49,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 26: [2022-11-26 05:19:49,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:19:49,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 05:19:49,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 24: [2022-11-26 05:19:49,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:19:49,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 05:19:49,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 18: [2022-11-26 05:19:49,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:19:49,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 05:19:49,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 14: [2022-11-26 05:19:49,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:19:49,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 05:19:49,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 13: [2022-11-26 05:19:49,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:19:49,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 05:19:49,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 16: [2022-11-26 05:19:49,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:19:49,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 05:19:49,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 15: [2022-11-26 05:19:49,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:19:49,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 05:19:49,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 31: [2022-11-26 05:19:49,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:19:49,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:19:49,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 22: [2022-11-26 05:19:49,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 31: [2022-11-26 05:19:49,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 22: [2022-11-26 05:19:49,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 5: [2022-11-26 05:19:49,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:19:49,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 05:19:49,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 19: [2022-11-26 05:19:49,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:19:49,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:19:49,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 05:19:49,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 11: [2022-11-26 05:19:49,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 05:19:49,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 4: [2022-11-26 05:19:49,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:19:49,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 05:19:49,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 6: [2022-11-26 05:19:49,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:19:49,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 05:19:49,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 28: [2022-11-26 05:19:49,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:19:49,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:19:49,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 05:19:49,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 7: [2022-11-26 05:19:49,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:19:49,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 05:19:49,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 2: [2022-11-26 05:19:49,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:19:49,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 05:19:49,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 0: [2022-11-26 05:19:49,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:19:49,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 05:19:49,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 28: [2022-11-26 05:19:49,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 05:19:49,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 10: [2022-11-26 05:19:49,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:19:49,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 05:19:49,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 9: [2022-11-26 05:19:49,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:19:49,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 05:19:49,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 3: [2022-11-26 05:19:49,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:19:49,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 05:19:49,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 25: [2022-11-26 05:19:49,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:19:49,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 05:19:49,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 23: [2022-11-26 05:19:49,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:19:49,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 05:19:49,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 30: [2022-11-26 05:19:49,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:19:49,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 05:19:49,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 17: [2022-11-26 05:19:49,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:19:49,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 05:19:49,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 12: [2022-11-26 05:19:49,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:19:49,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 05:19:49,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 1: [2022-11-26 05:19:49,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:19:49,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 05:19:49,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 20: [2022-11-26 05:19:49,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:19:49,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 05:19:49,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 29: [2022-11-26 05:19:49,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:19:49,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 05:19:49,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 16: [2022-11-26 05:19:49,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:19:49,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 05:19:49,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 8: [2022-11-26 05:19:49,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:19:49,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 05:19:49,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 18: [2022-11-26 05:19:49,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:19:49,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 05:19:49,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 24: [2022-11-26 05:19:49,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:19:49,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 05:19:49,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 22: [2022-11-26 05:19:49,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:19:49,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 05:19:49,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 14: [2022-11-26 05:19:49,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:19:49,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 05:19:49,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 26: [2022-11-26 05:19:49,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:19:49,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 05:19:49,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 15: [2022-11-26 05:19:49,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:19:49,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 31: [2022-11-26 05:19:49,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:19:49,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 31: [2022-11-26 05:19:49,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 05:19:49,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 27: [2022-11-26 05:19:49,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:19:49,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 05:19:49,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 13: [2022-11-26 05:19:49,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:19:49,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 05:19:49,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 5: [2022-11-26 05:19:49,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:19:49,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 05:19:49,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 20: [2022-11-26 05:19:49,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:19:49,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 05:19:49,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 4: [2022-11-26 05:19:49,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:19:49,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 05:19:49,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 3: [2022-11-26 05:19:49,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:19:49,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 05:19:49,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 7: [2022-11-26 05:19:49,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:19:49,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 2: [2022-11-26 05:19:49,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:19:49,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 2: [2022-11-26 05:19:49,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 11: [2022-11-26 05:19:49,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:19:49,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 6: [2022-11-26 05:19:49,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:19:49,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 05:19:49,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 6: [2022-11-26 05:19:49,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 21: [2022-11-26 05:19:49,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:19:49,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 28: [2022-11-26 05:19:49,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:19:49,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 28: [2022-11-26 05:19:49,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 21: [2022-11-26 05:19:49,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 28: [2022-11-26 05:19:49,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 19: [2022-11-26 05:19:49,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:19:49,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 05:19:49,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 25: [2022-11-26 05:19:49,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:19:49,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 05:19:49,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 10: [2022-11-26 05:19:49,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:19:49,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 05:19:49,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 0: [2022-11-26 05:19:49,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:19:49,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 05:19:49,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 9: [2022-11-26 05:19:49,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:19:49,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 05:19:49,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 17: [2022-11-26 05:19:49,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:19:49,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 05:19:49,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 1: [2022-11-26 05:19:49,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:19:49,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 05:19:49,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 23: [2022-11-26 05:19:49,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:19:49,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 05:19:49,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 30: [2022-11-26 05:19:49,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:19:49,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 05:19:49,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 18: [2022-11-26 05:19:49,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:19:49,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:19:49,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 05:19:49,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 22: [2022-11-26 05:19:49,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 05:19:49,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 8: [2022-11-26 05:19:49,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:19:49,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 05:19:49,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 12: [2022-11-26 05:19:49,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:19:49,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 05:19:49,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 29: [2022-11-26 05:19:49,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:19:49,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 05:19:49,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 24: [2022-11-26 05:19:49,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:19:49,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 05:19:49,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 19: [2022-11-26 05:19:49,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:19:49,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 05:19:49,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 15: [2022-11-26 05:19:49,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:19:49,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 05:19:49,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 6: [2022-11-26 05:19:49,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:19:49,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:19:49,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 31: [2022-11-26 05:19:49,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 05:19:49,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 6: [2022-11-26 05:19:49,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 2: [2022-11-26 05:19:49,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:19:49,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 05:19:49,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 26: [2022-11-26 05:19:49,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:19:49,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 05:19:49,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 16: [2022-11-26 05:19:49,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:19:49,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 05:19:49,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 28: [2022-11-26 05:19:49,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:19:49,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:19:49,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 05:19:49,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 20: [2022-11-26 05:19:49,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 05:19:49,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 27: [2022-11-26 05:19:49,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:19:49,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 13: [2022-11-26 05:19:49,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:19:49,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 05:19:49,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 27: [2022-11-26 05:19:49,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 4: [2022-11-26 05:19:49,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:19:49,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 05:19:49,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 23: [2022-11-26 05:19:49,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:19:49,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 05:19:49,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 7: [2022-11-26 05:19:49,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:19:49,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 05:19:49,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 11: [2022-11-26 05:19:49,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:19:49,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 05:19:49,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 12: [2022-11-26 05:19:49,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:19:49,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 05:19:49,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 21: [2022-11-26 05:19:49,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:19:49,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:19:49,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:19:49,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 05:19:49,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 0: [2022-11-26 05:19:49,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 3: [2022-11-26 05:19:49,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 0: [2022-11-26 05:19:49,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 3: [2022-11-26 05:19:49,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 25: [2022-11-26 05:19:49,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:19:49,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:19:49,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 05:19:49,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 25: [2022-11-26 05:19:49,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 05:19:49,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 22: [2022-11-26 05:19:49,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:19:49,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 05:19:49,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 5: [2022-11-26 05:19:49,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:19:49,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 05:19:49,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 14: [2022-11-26 05:19:49,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:19:49,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 05:19:49,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 16: [2022-11-26 05:19:49,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:19:49,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 05:19:49,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 1: [2022-11-26 05:19:49,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:19:49,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 05:19:49,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 9: [2022-11-26 05:19:49,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:19:49,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 05:19:49,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 24: [2022-11-26 05:19:49,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:19:49,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 05:19:49,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 8: [2022-11-26 05:19:49,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:19:49,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 05:19:49,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 26: [2022-11-26 05:19:49,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:19:49,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 05:19:49,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 10: [2022-11-26 05:19:49,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:19:49,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 05:19:49,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 14: [2022-11-26 05:19:49,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:19:49,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 05:19:49,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 30: [2022-11-26 05:19:49,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:19:49,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 05:19:49,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 13: [2022-11-26 05:19:49,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:19:49,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 05:19:49,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 27: [2022-11-26 05:19:49,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:19:49,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:19:49,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 05:19:49,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 17: [2022-11-26 05:19:49,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 05:19:49,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 29: [2022-11-26 05:19:49,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:19:49,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:19:49,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 05:19:49,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 05:19:49,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 29: [2022-11-26 05:19:49,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 15: [2022-11-26 05:19:49,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:19:49,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 05:19:49,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 15: [2022-11-26 05:19:49,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:19:49,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step50000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 05:19:49,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 0: successfully saved checkpoint at iteration 50000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2599.61 31: iteration 50010/ 173500 | consumed samples: 12802560 | consumed tokens: 26219642880 | elapsed time per iteration (s): 1.08 | learning rate: 1.671E-04 | global batch size: 256 | lm loss: 2.074308E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.805 | TFLOPs: 14.39 | 31: iteration 50020/ 173500 | consumed samples: 12805120 | consumed tokens: 26224885760 | elapsed time per iteration (s): 0.82 | learning rate: 1.671E-04 | global batch size: 256 | lm loss: 2.078481E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.263 | TFLOPs: 18.95 | 31: iteration 50030/ 173500 | consumed samples: 12807680 | consumed tokens: 26230128640 | elapsed time per iteration (s): 0.80 | learning rate: 1.671E-04 | global batch size: 256 | lm loss: 2.053047E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.915 | TFLOPs: 19.41 | 31: iteration 50040/ 173500 | consumed samples: 12810240 | consumed tokens: 26235371520 | elapsed time per iteration (s): 0.87 | learning rate: 1.671E-04 | global batch size: 256 | lm loss: 2.059613E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.525 | TFLOPs: 17.76 | 31: iteration 50050/ 173500 | consumed samples: 12812800 | consumed tokens: 26240614400 | elapsed time per iteration (s): 0.81 | learning rate: 1.671E-04 | global batch size: 256 | lm loss: 2.051600E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.086 | TFLOPs: 19.18 | 31: iteration 50060/ 173500 | consumed samples: 12815360 | consumed tokens: 26245857280 | elapsed time per iteration (s): 0.82 | learning rate: 1.671E-04 | global batch size: 256 | lm loss: 2.083798E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.595 | TFLOPs: 18.91 | 31: iteration 50070/ 173500 | consumed samples: 12817920 | consumed tokens: 26251100160 | elapsed time per iteration (s): 0.79 | learning rate: 1.671E-04 | global batch size: 256 | lm loss: 2.054733E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.595 | TFLOPs: 19.58 | 31: iteration 50080/ 173500 | consumed samples: 12820480 | consumed tokens: 26256343040 | elapsed time per iteration (s): 0.76 | learning rate: 1.670E-04 | global batch size: 256 | lm loss: 2.053480E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.715 | TFLOPs: 20.25 | 31: iteration 50090/ 173500 | consumed samples: 12823040 | consumed tokens: 26261585920 | elapsed time per iteration (s): 0.78 | learning rate: 1.670E-04 | global batch size: 256 | lm loss: 2.115560E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.579 | TFLOPs: 19.94 | 31: iteration 50100/ 173500 | consumed samples: 12825600 | consumed tokens: 26266828800 | elapsed time per iteration (s): 0.79 | learning rate: 1.670E-04 | global batch size: 256 | lm loss: 2.071266E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.803 | TFLOPs: 19.65 | 31: iteration 50110/ 173500 | consumed samples: 12828160 | consumed tokens: 26272071680 | elapsed time per iteration (s): 0.84 | learning rate: 1.670E-04 | global batch size: 256 | lm loss: 2.038870E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.405 | TFLOPs: 18.48 | 31: iteration 50120/ 173500 | consumed samples: 12830720 | consumed tokens: 26277314560 | elapsed time per iteration (s): 0.97 | learning rate: 1.670E-04 | global batch size: 256 | lm loss: 2.051837E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 264.561 | TFLOPs: 16.01 | 31: iteration 50130/ 173500 | consumed samples: 12833280 | consumed tokens: 26282557440 | elapsed time per iteration (s): 0.87 | learning rate: 1.670E-04 | global batch size: 256 | lm loss: 2.091274E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.688 | TFLOPs: 17.83 | 31: iteration 50140/ 173500 | consumed samples: 12835840 | consumed tokens: 26287800320 | elapsed time per iteration (s): 0.77 | learning rate: 1.670E-04 | global batch size: 256 | lm loss: 2.063500E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.603 | TFLOPs: 20.00 | 31: iteration 50150/ 173500 | consumed samples: 12838400 | consumed tokens: 26293043200 | elapsed time per iteration (s): 0.79 | learning rate: 1.670E-04 | global batch size: 256 | lm loss: 2.076608E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.830 | TFLOPs: 19.59 | 31: iteration 50160/ 173500 | consumed samples: 12840960 | consumed tokens: 26298286080 | elapsed time per iteration (s): 0.76 | learning rate: 1.669E-04 | global batch size: 256 | lm loss: 2.061987E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.498 | TFLOPs: 20.30 | 31: iteration 50170/ 173500 | consumed samples: 12843520 | consumed tokens: 26303528960 | elapsed time per iteration (s): 0.75 | learning rate: 1.669E-04 | global batch size: 256 | lm loss: 2.040565E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.811 | TFLOPs: 20.68 | 31: iteration 50180/ 173500 | consumed samples: 12846080 | consumed tokens: 26308771840 | elapsed time per iteration (s): 0.75 | learning rate: 1.669E-04 | global batch size: 256 | lm loss: 2.058841E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.617 | TFLOPs: 20.73 | 31: iteration 50190/ 173500 | consumed samples: 12848640 | consumed tokens: 26314014720 | elapsed time per iteration (s): 0.77 | learning rate: 1.669E-04 | global batch size: 256 | lm loss: 2.044112E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.487 | TFLOPs: 20.24 | 31: iteration 50200/ 173500 | consumed samples: 12851200 | consumed tokens: 26319257600 | elapsed time per iteration (s): 0.76 | learning rate: 1.669E-04 | global batch size: 256 | lm loss: 2.069565E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.617 | TFLOPs: 20.42 | 31: iteration 50210/ 173500 | consumed samples: 12853760 | consumed tokens: 26324500480 | elapsed time per iteration (s): 0.82 | learning rate: 1.669E-04 | global batch size: 256 | lm loss: 2.062761E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.305 | TFLOPs: 18.89 | 31: iteration 50220/ 173500 | consumed samples: 12856320 | consumed tokens: 26329743360 | elapsed time per iteration (s): 0.77 | learning rate: 1.669E-04 | global batch size: 256 | lm loss: 2.055452E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.814 | TFLOPs: 20.19 | 31: iteration 50230/ 173500 | consumed samples: 12858880 | consumed tokens: 26334986240 | elapsed time per iteration (s): 0.79 | learning rate: 1.669E-04 | global batch size: 256 | lm loss: 2.080462E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.463 | TFLOPs: 19.69 | 31: iteration 50240/ 173500 | consumed samples: 12861440 | consumed tokens: 26340229120 | elapsed time per iteration (s): 0.78 | learning rate: 1.668E-04 | global batch size: 256 | lm loss: 2.067188E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.275 | TFLOPs: 19.80 | 31: iteration 50250/ 173500 | consumed samples: 12864000 | consumed tokens: 26345472000 | elapsed time per iteration (s): 0.75 | learning rate: 1.668E-04 | global batch size: 256 | lm loss: 2.076609E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.955 | TFLOPs: 20.69 | 31: iteration 50260/ 173500 | consumed samples: 12866560 | consumed tokens: 26350714880 | elapsed time per iteration (s): 0.78 | learning rate: 1.668E-04 | global batch size: 256 | lm loss: 2.049074E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.063 | TFLOPs: 19.79 | 31: iteration 50270/ 173500 | consumed samples: 12869120 | consumed tokens: 26355957760 | elapsed time per iteration (s): 0.79 | learning rate: 1.668E-04 | global batch size: 256 | lm loss: 2.069538E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.640 | TFLOPs: 19.58 | 31: iteration 50280/ 173500 | consumed samples: 12871680 | consumed tokens: 26361200640 | elapsed time per iteration (s): 0.80 | learning rate: 1.668E-04 | global batch size: 256 | lm loss: 2.083287E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.126 | TFLOPs: 19.31 | 31: iteration 50290/ 173500 | consumed samples: 12874240 | consumed tokens: 26366443520 | elapsed time per iteration (s): 0.83 | learning rate: 1.668E-04 | global batch size: 256 | lm loss: 2.067527E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.586 | TFLOPs: 18.73 | 31: iteration 50300/ 173500 | consumed samples: 12876800 | consumed tokens: 26371686400 | elapsed time per iteration (s): 0.83 | learning rate: 1.668E-04 | global batch size: 256 | lm loss: 2.043981E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.369 | TFLOPs: 18.66 | 31: iteration 50310/ 173500 | consumed samples: 12879360 | consumed tokens: 26376929280 | elapsed time per iteration (s): 0.81 | learning rate: 1.668E-04 | global batch size: 256 | lm loss: 2.070464E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.661 | TFLOPs: 19.22 | 31: iteration 50320/ 173500 | consumed samples: 12881920 | consumed tokens: 26382172160 | elapsed time per iteration (s): 0.81 | learning rate: 1.667E-04 | global batch size: 256 | lm loss: 2.080096E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.313 | TFLOPs: 19.08 | 31: iteration 50330/ 173500 | consumed samples: 12884480 | consumed tokens: 26387415040 | elapsed time per iteration (s): 0.81 | learning rate: 1.667E-04 | global batch size: 256 | lm loss: 2.043864E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.648 | TFLOPs: 19.22 | 31: iteration 50340/ 173500 | consumed samples: 12887040 | consumed tokens: 26392657920 | elapsed time per iteration (s): 0.82 | learning rate: 1.667E-04 | global batch size: 256 | lm loss: 2.053514E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.472 | TFLOPs: 18.78 | 31: iteration 50350/ 173500 | consumed samples: 12889600 | consumed tokens: 26397900800 | elapsed time per iteration (s): 0.81 | learning rate: 1.667E-04 | global batch size: 256 | lm loss: 2.047704E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.233 | TFLOPs: 19.01 | 31: iteration 50360/ 173500 | consumed samples: 12892160 | consumed tokens: 26403143680 | elapsed time per iteration (s): 0.78 | learning rate: 1.667E-04 | global batch size: 256 | lm loss: 2.098970E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.966 | TFLOPs: 19.90 | 31: iteration 50370/ 173500 | consumed samples: 12894720 | consumed tokens: 26408386560 | elapsed time per iteration (s): 0.84 | learning rate: 1.667E-04 | global batch size: 256 | lm loss: 2.070057E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.354 | TFLOPs: 18.47 | 31: iteration 50380/ 173500 | consumed samples: 12897280 | consumed tokens: 26413629440 | elapsed time per iteration (s): 0.85 | learning rate: 1.667E-04 | global batch size: 256 | lm loss: 2.104499E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.254 | TFLOPs: 18.16 | 31: iteration 50390/ 173500 | consumed samples: 12899840 | consumed tokens: 26418872320 | elapsed time per iteration (s): 0.80 | learning rate: 1.667E-04 | global batch size: 256 | lm loss: 2.056684E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.551 | TFLOPs: 19.27 | 31: iteration 50400/ 173500 | consumed samples: 12902400 | consumed tokens: 26424115200 | elapsed time per iteration (s): 0.82 | learning rate: 1.666E-04 | global batch size: 256 | lm loss: 2.082304E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.137 | TFLOPs: 18.82 | 31: iteration 50410/ 173500 | consumed samples: 12904960 | consumed tokens: 26429358080 | elapsed time per iteration (s): 0.79 | learning rate: 1.666E-04 | global batch size: 256 | lm loss: 2.074650E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.843 | TFLOPs: 19.65 | 31: iteration 50420/ 173500 | consumed samples: 12907520 | consumed tokens: 26434600960 | elapsed time per iteration (s): 0.81 | learning rate: 1.666E-04 | global batch size: 256 | lm loss: 2.079010E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.999 | TFLOPs: 19.24 | 31: iteration 50430/ 173500 | consumed samples: 12910080 | consumed tokens: 26439843840 | elapsed time per iteration (s): 0.82 | learning rate: 1.666E-04 | global batch size: 256 | lm loss: 2.084490E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.870 | TFLOPs: 18.93 | 31: iteration 50440/ 173500 | consumed samples: 12912640 | consumed tokens: 26445086720 | elapsed time per iteration (s): 0.79 | learning rate: 1.666E-04 | global batch size: 256 | lm loss: 2.070736E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.224 | TFLOPs: 19.55 | 31: iteration 50450/ 173500 | consumed samples: 12915200 | consumed tokens: 26450329600 | elapsed time per iteration (s): 0.85 | learning rate: 1.666E-04 | global batch size: 256 | lm loss: 2.062753E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.364 | TFLOPs: 18.23 | 31: iteration 50460/ 173500 | consumed samples: 12917760 | consumed tokens: 26455572480 | elapsed time per iteration (s): 0.80 | learning rate: 1.666E-04 | global batch size: 256 | lm loss: 2.088299E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.570 | TFLOPs: 19.45 | 31: iteration 50470/ 173500 | consumed samples: 12920320 | consumed tokens: 26460815360 | elapsed time per iteration (s): 0.84 | learning rate: 1.666E-04 | global batch size: 256 | lm loss: 2.093860E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.849 | TFLOPs: 18.44 | 31: iteration 50480/ 173500 | consumed samples: 12922880 | consumed tokens: 26466058240 | elapsed time per iteration (s): 0.86 | learning rate: 1.665E-04 | global batch size: 256 | lm loss: 2.090698E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.107 | TFLOPs: 18.10 | 31: iteration 50490/ 173500 | consumed samples: 12925440 | consumed tokens: 26471301120 | elapsed time per iteration (s): 0.81 | learning rate: 1.665E-04 | global batch size: 256 | lm loss: 2.057570E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.870 | TFLOPs: 19.05 | 31: iteration 50500/ 173500 | consumed samples: 12928000 | consumed tokens: 26476544000 | elapsed time per iteration (s): 0.87 | learning rate: 1.665E-04 | global batch size: 256 | lm loss: 2.056224E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.839 | TFLOPs: 17.72 | 31: iteration 50510/ 173500 | consumed samples: 12930560 | consumed tokens: 26481786880 | elapsed time per iteration (s): 0.82 | learning rate: 1.665E-04 | global batch size: 256 | lm loss: 2.072834E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.243 | TFLOPs: 18.89 | 31: iteration 50520/ 173500 | consumed samples: 12933120 | consumed tokens: 26487029760 | elapsed time per iteration (s): 0.98 | learning rate: 1.665E-04 | global batch size: 256 | lm loss: 2.077918E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 260.918 | TFLOPs: 15.78 | 31: iteration 50530/ 173500 | consumed samples: 12935680 | consumed tokens: 26492272640 | elapsed time per iteration (s): 0.79 | learning rate: 1.665E-04 | global batch size: 256 | lm loss: 2.096666E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.140 | TFLOPs: 19.61 | 31: iteration 50540/ 173500 | consumed samples: 12938240 | consumed tokens: 26497515520 | elapsed time per iteration (s): 0.92 | learning rate: 1.665E-04 | global batch size: 256 | lm loss: 2.055354E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 279.168 | TFLOPs: 16.89 | 31: iteration 50550/ 173500 | consumed samples: 12940800 | consumed tokens: 26502758400 | elapsed time per iteration (s): 0.83 | learning rate: 1.664E-04 | global batch size: 256 | lm loss: 2.046845E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.073 | TFLOPs: 18.76 | 31: iteration 50560/ 173500 | consumed samples: 12943360 | consumed tokens: 26508001280 | elapsed time per iteration (s): 0.82 | learning rate: 1.664E-04 | global batch size: 256 | lm loss: 2.074164E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.937 | TFLOPs: 18.99 | 31: iteration 50570/ 173500 | consumed samples: 12945920 | consumed tokens: 26513244160 | elapsed time per iteration (s): 0.80 | learning rate: 1.664E-04 | global batch size: 256 | lm loss: 2.072759E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.347 | TFLOPs: 19.38 | 31: iteration 50580/ 173500 | consumed samples: 12948480 | consumed tokens: 26518487040 | elapsed time per iteration (s): 0.82 | learning rate: 1.664E-04 | global batch size: 256 | lm loss: 2.063224E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.728 | TFLOPs: 18.86 | 31: iteration 50590/ 173500 | consumed samples: 12951040 | consumed tokens: 26523729920 | elapsed time per iteration (s): 0.80 | learning rate: 1.664E-04 | global batch size: 256 | lm loss: 2.082307E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.946 | TFLOPs: 19.30 | 31: iteration 50600/ 173500 | consumed samples: 12953600 | consumed tokens: 26528972800 | elapsed time per iteration (s): 0.78 | learning rate: 1.664E-04 | global batch size: 256 | lm loss: 2.090562E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.282 | TFLOPs: 19.92 | 31: iteration 50610/ 173500 | consumed samples: 12956160 | consumed tokens: 26534215680 | elapsed time per iteration (s): 0.78 | learning rate: 1.664E-04 | global batch size: 256 | lm loss: 2.059521E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.466 | TFLOPs: 19.81 | 31: iteration 50620/ 173500 | consumed samples: 12958720 | consumed tokens: 26539458560 | elapsed time per iteration (s): 0.81 | learning rate: 1.664E-04 | global batch size: 256 | lm loss: 2.071564E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.379 | TFLOPs: 19.08 | 31: iteration 50630/ 173500 | consumed samples: 12961280 | consumed tokens: 26544701440 | elapsed time per iteration (s): 0.82 | learning rate: 1.663E-04 | global batch size: 256 | lm loss: 2.070759E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.030 | TFLOPs: 18.82 | 31: iteration 50640/ 173500 | consumed samples: 12963840 | consumed tokens: 26549944320 | elapsed time per iteration (s): 0.83 | learning rate: 1.663E-04 | global batch size: 256 | lm loss: 2.083346E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.998 | TFLOPs: 18.57 | 31: iteration 50650/ 173500 | consumed samples: 12966400 | consumed tokens: 26555187200 | elapsed time per iteration (s): 0.82 | learning rate: 1.663E-04 | global batch size: 256 | lm loss: 2.074596E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.905 | TFLOPs: 18.93 | 31: iteration 50660/ 173500 | consumed samples: 12968960 | consumed tokens: 26560430080 | elapsed time per iteration (s): 0.84 | learning rate: 1.663E-04 | global batch size: 256 | lm loss: 2.057098E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.868 | TFLOPs: 18.38 | 31: iteration 50670/ 173500 | consumed samples: 12971520 | consumed tokens: 26565672960 | elapsed time per iteration (s): 0.84 | learning rate: 1.663E-04 | global batch size: 256 | lm loss: 2.065356E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.424 | TFLOPs: 18.36 | 31: iteration 50680/ 173500 | consumed samples: 12974080 | consumed tokens: 26570915840 | elapsed time per iteration (s): 0.81 | learning rate: 1.663E-04 | global batch size: 256 | lm loss: 2.062576E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.836 | TFLOPs: 19.11 | 31: iteration 50690/ 173500 | consumed samples: 12976640 | consumed tokens: 26576158720 | elapsed time per iteration (s): 0.83 | learning rate: 1.663E-04 | global batch size: 256 | lm loss: 2.039168E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.613 | TFLOPs: 18.67 | 31: iteration 50700/ 173500 | consumed samples: 12979200 | consumed tokens: 26581401600 | elapsed time per iteration (s): 0.79 | learning rate: 1.663E-04 | global batch size: 256 | lm loss: 2.055192E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.419 | TFLOPs: 19.69 | 31: iteration 50710/ 173500 | consumed samples: 12981760 | consumed tokens: 26586644480 | elapsed time per iteration (s): 0.82 | learning rate: 1.662E-04 | global batch size: 256 | lm loss: 2.061217E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.976 | TFLOPs: 18.87 | 31: iteration 50720/ 173500 | consumed samples: 12984320 | consumed tokens: 26591887360 | elapsed time per iteration (s): 0.79 | learning rate: 1.662E-04 | global batch size: 256 | lm loss: 2.075177E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.537 | TFLOPs: 19.63 | 31: iteration 50730/ 173500 | consumed samples: 12986880 | consumed tokens: 26597130240 | elapsed time per iteration (s): 0.85 | learning rate: 1.662E-04 | global batch size: 256 | lm loss: 2.050445E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.605 | TFLOPs: 18.25 | 31: iteration 50740/ 173500 | consumed samples: 12989440 | consumed tokens: 26602373120 | elapsed time per iteration (s): 0.83 | learning rate: 1.662E-04 | global batch size: 256 | lm loss: 2.067668E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.976 | TFLOPs: 18.69 | 31: iteration 50750/ 173500 | consumed samples: 12992000 | consumed tokens: 26607616000 | elapsed time per iteration (s): 0.80 | learning rate: 1.662E-04 | global batch size: 256 | lm loss: 2.049501E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.299 | TFLOPs: 19.38 | 31: iteration 50760/ 173500 | consumed samples: 12994560 | consumed tokens: 26612858880 | elapsed time per iteration (s): 0.85 | learning rate: 1.662E-04 | global batch size: 256 | lm loss: 2.085094E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.058 | TFLOPs: 18.21 | 31: iteration 50770/ 173500 | consumed samples: 12997120 | consumed tokens: 26618101760 | elapsed time per iteration (s): 0.80 | learning rate: 1.662E-04 | global batch size: 256 | lm loss: 2.089438E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.280 | TFLOPs: 19.44 | 31: iteration 50780/ 173500 | consumed samples: 12999680 | consumed tokens: 26623344640 | elapsed time per iteration (s): 0.81 | learning rate: 1.662E-04 | global batch size: 256 | lm loss: 2.063142E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.368 | TFLOPs: 19.14 | 31: iteration 50790/ 173500 | consumed samples: 13002240 | consumed tokens: 26628587520 | elapsed time per iteration (s): 0.81 | learning rate: 1.661E-04 | global batch size: 256 | lm loss: 2.069036E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.784 | TFLOPs: 19.16 | 31: iteration 50800/ 173500 | consumed samples: 13004800 | consumed tokens: 26633830400 | elapsed time per iteration (s): 0.84 | learning rate: 1.661E-04 | global batch size: 256 | lm loss: 2.072540E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.205 | TFLOPs: 18.40 | 31: iteration 50810/ 173500 | consumed samples: 13007360 | consumed tokens: 26639073280 | elapsed time per iteration (s): 0.77 | learning rate: 1.661E-04 | global batch size: 256 | lm loss: 2.080848E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.917 | TFLOPs: 20.08 | 31: iteration 50820/ 173500 | consumed samples: 13009920 | consumed tokens: 26644316160 | elapsed time per iteration (s): 0.81 | learning rate: 1.661E-04 | global batch size: 256 | lm loss: 2.060765E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.725 | TFLOPs: 19.16 | 31: iteration 50830/ 173500 | consumed samples: 13012480 | consumed tokens: 26649559040 | elapsed time per iteration (s): 0.84 | learning rate: 1.661E-04 | global batch size: 256 | lm loss: 2.073286E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.939 | TFLOPs: 18.45 | 31: iteration 50840/ 173500 | consumed samples: 13015040 | consumed tokens: 26654801920 | elapsed time per iteration (s): 0.82 | learning rate: 1.661E-04 | global batch size: 256 | lm loss: 2.075744E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.248 | TFLOPs: 18.89 | 31: iteration 50850/ 173500 | consumed samples: 13017600 | consumed tokens: 26660044800 | elapsed time per iteration (s): 0.81 | learning rate: 1.661E-04 | global batch size: 256 | lm loss: 2.067350E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.123 | TFLOPs: 19.06 | 31: iteration 50860/ 173500 | consumed samples: 13020160 | consumed tokens: 26665287680 | elapsed time per iteration (s): 0.85 | learning rate: 1.661E-04 | global batch size: 256 | lm loss: 2.074405E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.019 | TFLOPs: 18.21 | 31: iteration 50870/ 173500 | consumed samples: 13022720 | consumed tokens: 26670530560 | elapsed time per iteration (s): 0.78 | learning rate: 1.660E-04 | global batch size: 256 | lm loss: 2.064257E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.209 | TFLOPs: 19.86 | 31: iteration 50880/ 173500 | consumed samples: 13025280 | consumed tokens: 26675773440 | elapsed time per iteration (s): 0.81 | learning rate: 1.660E-04 | global batch size: 256 | lm loss: 2.064132E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.135 | TFLOPs: 19.13 | 31: iteration 50890/ 173500 | consumed samples: 13027840 | consumed tokens: 26681016320 | elapsed time per iteration (s): 0.83 | learning rate: 1.660E-04 | global batch size: 256 | lm loss: 2.036485E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.619 | TFLOPs: 18.67 | 31: iteration 50900/ 173500 | consumed samples: 13030400 | consumed tokens: 26686259200 | elapsed time per iteration (s): 0.81 | learning rate: 1.660E-04 | global batch size: 256 | lm loss: 2.029930E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.772 | TFLOPs: 19.10 | 31: iteration 50910/ 173500 | consumed samples: 13032960 | consumed tokens: 26691502080 | elapsed time per iteration (s): 0.81 | learning rate: 1.660E-04 | global batch size: 256 | lm loss: 2.057368E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.256 | TFLOPs: 19.19 | 31: iteration 50920/ 173500 | consumed samples: 13035520 | consumed tokens: 26696744960 | elapsed time per iteration (s): 0.82 | learning rate: 1.660E-04 | global batch size: 256 | lm loss: 2.062870E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.770 | TFLOPs: 18.86 | 31: iteration 50930/ 173500 | consumed samples: 13038080 | consumed tokens: 26701987840 | elapsed time per iteration (s): 0.95 | learning rate: 1.660E-04 | global batch size: 256 | lm loss: 2.083384E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 269.283 | TFLOPs: 16.29 | 31: iteration 50940/ 173500 | consumed samples: 13040640 | consumed tokens: 26707230720 | elapsed time per iteration (s): 0.86 | learning rate: 1.659E-04 | global batch size: 256 | lm loss: 2.105569E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.639 | TFLOPs: 18.07 | 31: iteration 50950/ 173500 | consumed samples: 13043200 | consumed tokens: 26712473600 | elapsed time per iteration (s): 0.79 | learning rate: 1.659E-04 | global batch size: 256 | lm loss: 2.067585E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.639 | TFLOPs: 19.58 | 31: iteration 50960/ 173500 | consumed samples: 13045760 | consumed tokens: 26717716480 | elapsed time per iteration (s): 0.84 | learning rate: 1.659E-04 | global batch size: 256 | lm loss: 2.056639E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.665 | TFLOPs: 18.43 | 31: iteration 50970/ 173500 | consumed samples: 13048320 | consumed tokens: 26722959360 | elapsed time per iteration (s): 0.83 | learning rate: 1.659E-04 | global batch size: 256 | lm loss: 2.091659E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.672 | TFLOPs: 18.67 | 31: iteration 50980/ 173500 | consumed samples: 13050880 | consumed tokens: 26728202240 | elapsed time per iteration (s): 0.87 | learning rate: 1.659E-04 | global batch size: 256 | lm loss: 2.087818E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.174 | TFLOPs: 17.74 | 31: iteration 50990/ 173500 | consumed samples: 13053440 | consumed tokens: 26733445120 | elapsed time per iteration (s): 0.74 | learning rate: 1.659E-04 | global batch size: 256 | lm loss: 2.041799E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.856 | TFLOPs: 20.98 | 31: iteration 51000/ 173500 | consumed samples: 13056000 | consumed tokens: 26738688000 | elapsed time per iteration (s): 0.79 | learning rate: 1.659E-04 | global batch size: 256 | lm loss: 2.056776E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.622 | TFLOPs: 19.70 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 51000 | lm loss value: 1.966760E+00 | lm loss PPL: 7.147483E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 51000 to checkpoints_1b1long 0: [2022-11-26 05:33:25,288] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step51000 is begin to save! 0: [2022-11-26 05:33:25,299] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_01-model_00-model_states.pt... 0: [2022-11-26 05:33:25,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_01-model_00-model_states.pt. 0: [2022-11-26 05:33:25,531] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_03-model_00-model_states.pt... 0: [2022-11-26 05:33:25,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_03-model_00-model_states.pt. 0: [2022-11-26 05:33:25,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_04-model_00-model_states.pt... 0: [2022-11-26 05:33:25,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_04-model_00-model_states.pt. 0: [2022-11-26 05:33:25,681] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_05-model_00-model_states.pt... 0: [2022-11-26 05:33:25,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_05-model_00-model_states.pt. 0: [2022-11-26 05:33:25,760] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_06-model_00-model_states.pt... 0: [2022-11-26 05:33:25,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_06-model_00-model_states.pt. 0: [2022-11-26 05:33:25,836] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_07-model_00-model_states.pt... 0: [2022-11-26 05:33:25,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_07-model_00-model_states.pt. 0: [2022-11-26 05:33:25,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_08-model_00-model_states.pt... 0: [2022-11-26 05:33:25,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_08-model_00-model_states.pt. 0: [2022-11-26 05:33:25,986] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_09-model_00-model_states.pt... 0: [2022-11-26 05:33:26,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_09-model_00-model_states.pt. 0: [2022-11-26 05:33:26,061] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_10-model_00-model_states.pt... 0: [2022-11-26 05:33:26,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_10-model_00-model_states.pt. 0: [2022-11-26 05:33:26,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_11-model_00-model_states.pt... 0: [2022-11-26 05:33:26,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_11-model_00-model_states.pt. 0: [2022-11-26 05:33:26,219] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_12-model_00-model_states.pt... 0: [2022-11-26 05:33:26,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_12-model_00-model_states.pt. 0: [2022-11-26 05:33:26,299] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_13-model_00-model_states.pt... 0: [2022-11-26 05:33:26,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_13-model_00-model_states.pt. 0: [2022-11-26 05:33:26,374] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_14-model_00-model_states.pt... 0: [2022-11-26 05:33:26,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_14-model_00-model_states.pt. 0: [2022-11-26 05:33:26,447] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_15-model_00-model_states.pt... 0: [2022-11-26 05:33:26,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_15-model_00-model_states.pt. 0: [2022-11-26 05:33:26,523] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_16-model_00-model_states.pt... 0: [2022-11-26 05:33:26,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_16-model_00-model_states.pt. 0: [2022-11-26 05:33:26,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_17-model_00-model_states.pt... 0: [2022-11-26 05:33:26,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_17-model_00-model_states.pt. 0: [2022-11-26 05:33:26,677] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_18-model_00-model_states.pt... 0: [2022-11-26 05:33:26,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_18-model_00-model_states.pt. 0: [2022-11-26 05:33:26,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_19-model_00-model_states.pt... 0: [2022-11-26 05:33:26,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_19-model_00-model_states.pt. 0: [2022-11-26 05:33:26,841] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_20-model_00-model_states.pt... 0: [2022-11-26 05:33:26,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_20-model_00-model_states.pt. 0: [2022-11-26 05:33:26,919] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_21-model_00-model_states.pt... 0: [2022-11-26 05:33:26,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_21-model_00-model_states.pt. 0: [2022-11-26 05:33:26,991] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_22-model_00-model_states.pt... 0: [2022-11-26 05:33:27,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_22-model_00-model_states.pt. 0: [2022-11-26 05:33:27,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_23-model_00-model_states.pt... 0: [2022-11-26 05:33:27,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_23-model_00-model_states.pt. 0: [2022-11-26 05:33:27,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_24-model_00-model_states.pt... 0: [2022-11-26 05:33:27,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_24-model_00-model_states.pt. 0: [2022-11-26 05:33:27,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_25-model_00-model_states.pt... 0: [2022-11-26 05:33:27,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_25-model_00-model_states.pt. 0: [2022-11-26 05:33:27,306] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_26-model_00-model_states.pt... 0: [2022-11-26 05:33:27,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_26-model_00-model_states.pt. 0: [2022-11-26 05:33:27,381] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_27-model_00-model_states.pt... 0: [2022-11-26 05:33:27,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_27-model_00-model_states.pt. 0: [2022-11-26 05:33:27,459] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_28-model_00-model_states.pt... 0: [2022-11-26 05:33:27,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_28-model_00-model_states.pt. 0: [2022-11-26 05:33:27,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/layer_30-model_00-model_states.pt... 0: [2022-11-26 05:33:27,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/layer_30-model_00-model_states.pt. 0: [2022-11-26 05:33:27,537] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step51000/mp_rank_00_model_states.pt 0: [2022-11-26 05:33:27,537] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/mp_rank_00_model_states.pt... 0: [2022-11-26 05:33:27,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/mp_rank_00_model_states.pt. 0: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:33:27,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:33:27,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:33:27,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 05:33:27,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 19: [2022-11-26 05:33:27,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:33:27,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:33:27,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 05:33:27,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 0: [2022-11-26 05:33:27,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 05:33:27,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 29: [2022-11-26 05:33:27,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:33:27,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 05:33:27,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 7: [2022-11-26 05:33:27,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:33:27,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 05:33:27,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 9: [2022-11-26 05:33:27,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:33:27,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 05:33:27,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 15: [2022-11-26 05:33:27,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:33:27,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 05:33:27,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 27: [2022-11-26 05:33:27,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:33:27,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 05:33:27,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 12: [2022-11-26 05:33:27,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:33:27,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 05:33:27,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 20: [2022-11-26 05:33:27,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:33:27,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 05:33:27,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 8: [2022-11-26 05:33:27,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:33:27,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 05:33:27,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 23: [2022-11-26 05:33:27,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:33:27,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 05:33:27,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 17: [2022-11-26 05:33:27,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:33:27,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 05:33:27,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 17: [2022-11-26 05:33:27,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:33:27,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 6: [2022-11-26 05:33:27,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:33:27,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:33:27,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 6: [2022-11-26 05:33:27,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 14: [2022-11-26 05:33:27,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 6: [2022-11-26 05:33:27,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 14: [2022-11-26 05:33:27,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 4: [2022-11-26 05:33:27,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:33:27,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 05:33:27,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 4: [2022-11-26 05:33:27,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:33:27,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 6: [2022-11-26 05:33:27,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:33:27,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 6: [2022-11-26 05:33:27,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 22: [2022-11-26 05:33:27,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:33:27,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 22: [2022-11-26 05:33:27,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 05:33:27,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 22: [2022-11-26 05:33:27,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:33:27,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 05:33:27,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 0: [2022-11-26 05:33:27,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:33:27,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 16: [2022-11-26 05:33:27,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:33:27,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 16: [2022-11-26 05:33:27,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 05:33:27,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 9: [2022-11-26 05:33:27,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:33:27,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:33:27,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 0: [2022-11-26 05:33:27,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 9: [2022-11-26 05:33:27,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 0: [2022-11-26 05:33:27,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 20: [2022-11-26 05:33:27,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:33:27,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:33:27,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 29: [2022-11-26 05:33:27,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 20: [2022-11-26 05:33:27,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 29: [2022-11-26 05:33:27,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 8: [2022-11-26 05:33:27,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:33:27,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 05:33:27,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 23: [2022-11-26 05:33:27,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:33:27,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 05:33:27,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 24: [2022-11-26 05:33:27,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:33:27,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 05:33:27,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 10: [2022-11-26 05:33:27,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:33:27,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:33:27,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 05:33:27,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 05:33:27,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 10: [2022-11-26 05:33:27,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 2: [2022-11-26 05:33:27,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:33:27,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:33:27,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 05:33:27,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 05:33:27,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 2: [2022-11-26 05:33:27,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 12: [2022-11-26 05:33:27,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:33:27,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 05:33:27,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 22: [2022-11-26 05:33:27,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:33:27,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 05:33:27,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 15: [2022-11-26 05:33:27,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:33:27,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 05:33:27,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 6: [2022-11-26 05:33:27,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:33:27,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 05:33:27,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 28: [2022-11-26 05:33:27,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:33:27,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:33:27,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 16: [2022-11-26 05:33:27,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:33:27,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 16: [2022-11-26 05:33:27,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 05:33:27,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 12: [2022-11-26 05:33:27,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:33:27,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:33:27,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 10: [2022-11-26 05:33:27,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 12: [2022-11-26 05:33:27,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 10: [2022-11-26 05:33:27,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 9: [2022-11-26 05:33:27,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:33:27,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 05:33:27,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 7: [2022-11-26 05:33:27,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:33:27,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 05:33:27,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 20: [2022-11-26 05:33:27,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:33:27,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 05:33:27,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 23: [2022-11-26 05:33:27,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:33:27,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 05:33:27,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 7: [2022-11-26 05:33:27,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:33:27,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:33:27,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 29: [2022-11-26 05:33:27,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 7: [2022-11-26 05:33:27,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 29: [2022-11-26 05:33:27,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 13: [2022-11-26 05:33:27,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:33:27,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:33:27,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:33:27,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:33:27,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 13: [2022-11-26 05:33:27,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 05:33:27,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 12: [2022-11-26 05:33:27,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 25: [2022-11-26 05:33:27,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:33:27,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:33:27,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 13: [2022-11-26 05:33:27,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 27: [2022-11-26 05:33:27,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 18: [2022-11-26 05:33:27,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:33:27,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 18: [2022-11-26 05:33:27,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 27: [2022-11-26 05:33:27,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 25: [2022-11-26 05:33:27,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 18: [2022-11-26 05:33:27,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 27: [2022-11-26 05:33:27,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 27: [2022-11-26 05:33:27,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 28: [2022-11-26 05:33:27,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 05:33:27,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 28: [2022-11-26 05:33:27,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:33:27,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 05:33:27,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 28: [2022-11-26 05:33:27,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:33:27,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 05:33:27,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 24: [2022-11-26 05:33:27,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:33:27,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 05:33:27,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 14: [2022-11-26 05:33:27,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:33:27,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:33:27,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 05:33:27,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 14: [2022-11-26 05:33:27,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 05:33:27,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 8: [2022-11-26 05:33:27,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:33:27,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 05:33:27,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 8: [2022-11-26 05:33:27,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:33:27,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 05:33:27,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 23: [2022-11-26 05:33:27,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:33:27,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 22: [2022-11-26 05:33:27,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:33:27,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 22: [2022-11-26 05:33:27,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 05:33:27,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 19: [2022-11-26 05:33:27,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:33:27,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 05:33:27,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 6: [2022-11-26 05:33:27,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:33:27,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 05:33:27,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 20: [2022-11-26 05:33:27,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:33:27,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 05:33:27,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 2: [2022-11-26 05:33:27,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:33:27,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:33:27,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 05:33:27,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 17: [2022-11-26 05:33:27,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 24: [2022-11-26 05:33:27,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:33:27,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 17: [2022-11-26 05:33:27,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:33:27,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 17: [2022-11-26 05:33:27,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 05:33:27,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 24: [2022-11-26 05:33:27,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 9: [2022-11-26 05:33:27,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:33:27,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 05:33:27,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 19: [2022-11-26 05:33:27,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:33:27,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 05:33:27,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 29: [2022-11-26 05:33:27,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:33:27,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 05:33:27,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 4: [2022-11-26 05:33:27,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:33:27,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 05:33:27,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:33:27,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 4: [2022-11-26 05:33:27,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 05:33:27,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 3: [2022-11-26 05:33:27,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:33:27,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 05:33:27,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 3: [2022-11-26 05:33:27,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:33:27,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 05:33:27,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:33:27,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 3: [2022-11-26 05:33:27,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 15: [2022-11-26 05:33:27,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:33:27,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:33:27,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 05:33:27,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 3: [2022-11-26 05:33:27,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 15: [2022-11-26 05:33:27,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 15: [2022-11-26 05:33:27,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 13: [2022-11-26 05:33:27,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:33:27,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 05:33:27,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 30: [2022-11-26 05:33:27,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:33:27,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:33:27,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:33:27,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:33:27,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 14: [2022-11-26 05:33:27,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:33:27,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 14: [2022-11-26 05:33:27,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 30: [2022-11-26 05:33:27,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 05:33:27,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 14: [2022-11-26 05:33:27,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 30: [2022-11-26 05:33:27,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 18: [2022-11-26 05:33:27,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:33:27,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 30: [2022-11-26 05:33:27,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 30: [2022-11-26 05:33:27,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 18: [2022-11-26 05:33:27,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 05:33:27,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 18: [2022-11-26 05:33:27,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:33:27,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 05:33:27,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 1: [2022-11-26 05:33:27,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:33:27,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:33:27,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:33:27,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:33:27,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:33:27,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:33:27,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 1: [2022-11-26 05:33:27,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 16: [2022-11-26 05:33:27,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 24: [2022-11-26 05:33:27,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 1: [2022-11-26 05:33:27,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 05:33:27,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 05:33:27,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 24: [2022-11-26 05:33:27,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 1: [2022-11-26 05:33:27,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 05:33:27,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 1: [2022-11-26 05:33:27,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 1: [2022-11-26 05:33:27,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 25: [2022-11-26 05:33:27,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:33:27,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:33:27,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:33:27,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 05:33:27,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 05:33:27,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 05:33:27,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 25: [2022-11-26 05:33:27,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 25: [2022-11-26 05:33:27,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 31: [2022-11-26 05:33:27,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:33:27,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:33:27,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:33:27,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 05:33:27,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 05:33:27,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 05:33:27,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 31: [2022-11-26 05:33:27,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 31: [2022-11-26 05:33:27,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 31: [2022-11-26 05:33:27,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:33:27,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 05:33:27,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 0: [2022-11-26 05:33:27,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:33:27,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 05:33:27,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 28: [2022-11-26 05:33:27,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:33:27,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 05:33:27,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 0: [2022-11-26 05:33:27,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:33:27,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:33:27,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 05:33:27,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 5: [2022-11-26 05:33:27,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:33:27,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 05:33:27,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 5: [2022-11-26 05:33:27,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:33:27,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 05:33:27,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 5: [2022-11-26 05:33:27,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:33:27,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 05:33:27,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 5: [2022-11-26 05:33:27,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:33:27,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 18: [2022-11-26 05:33:27,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:33:27,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 18: [2022-11-26 05:33:27,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 7: [2022-11-26 05:33:27,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:33:27,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 7: [2022-11-26 05:33:27,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 05:33:27,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 1: [2022-11-26 05:33:27,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:33:27,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 05:33:27,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 27: [2022-11-26 05:33:27,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:33:27,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 05:33:27,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 5: [2022-11-26 05:33:27,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:33:27,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 05:33:27,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 2: [2022-11-26 05:33:27,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:33:27,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 05:33:27,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 6: [2022-11-26 05:33:27,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:33:27,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 05:33:27,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 11: [2022-11-26 05:33:27,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:33:27,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:33:27,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:33:27,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 05:33:27,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 05:33:27,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 05:33:27,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 11: [2022-11-26 05:33:27,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 11: [2022-11-26 05:33:27,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 26: [2022-11-26 05:33:27,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:33:27,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:33:27,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:33:27,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:33:27,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:33:27,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 05:33:27,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 05:33:27,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 05:33:27,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 05:33:27,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 05:33:27,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 26: [2022-11-26 05:33:27,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 26: [2022-11-26 05:33:27,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 26: [2022-11-26 05:33:27,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 26: [2022-11-26 05:33:27,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 23: [2022-11-26 05:33:27,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:33:27,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 05:33:27,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 21: [2022-11-26 05:33:27,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:33:27,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:33:27,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:33:27,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:33:27,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 05:33:27,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 05:33:27,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 21: [2022-11-26 05:33:27,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 05:33:27,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 05:33:27,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 21: [2022-11-26 05:33:27,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 21: [2022-11-26 05:33:27,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 0: [2022-11-26 05:33:27,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 05:33:27,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 17: [2022-11-26 05:33:27,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:33:27,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 05:33:27,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 9: [2022-11-26 05:33:27,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:33:27,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 05:33:27,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 8: [2022-11-26 05:33:27,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:33:27,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 05:33:27,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 31: [2022-11-26 05:33:27,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:33:27,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 05:33:27,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 16: [2022-11-26 05:33:27,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:33:27,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 05:33:27,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 4: [2022-11-26 05:33:27,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:33:27,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 05:33:27,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 15: [2022-11-26 05:33:27,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:33:27,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 05:33:27,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 21: [2022-11-26 05:33:27,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:33:27,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 05:33:27,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 22: [2022-11-26 05:33:27,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:33:27,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 05:33:27,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 19: [2022-11-26 05:33:27,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:33:27,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 05:33:27,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 12: [2022-11-26 05:33:27,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:33:27,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 05:33:27,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 20: [2022-11-26 05:33:27,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:33:27,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 05:33:27,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 18: [2022-11-26 05:33:27,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:33:27,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 05:33:27,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 10: [2022-11-26 05:33:27,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:33:27,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 05:33:27,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 24: [2022-11-26 05:33:27,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:33:27,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 05:33:27,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 30: [2022-11-26 05:33:27,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:33:27,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 05:33:27,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 29: [2022-11-26 05:33:27,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:33:27,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 05:33:27,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 28: [2022-11-26 05:33:27,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:33:27,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 05:33:27,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 27: [2022-11-26 05:33:27,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:33:27,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 05:33:27,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 2: [2022-11-26 05:33:27,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:33:27,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 05:33:27,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 7: [2022-11-26 05:33:27,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:33:27,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 05:33:27,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 0: [2022-11-26 05:33:27,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:33:27,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 05:33:27,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 3: [2022-11-26 05:33:27,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:33:27,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 05:33:27,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 11: [2022-11-26 05:33:27,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:33:27,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 05:33:27,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 14: [2022-11-26 05:33:27,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:33:27,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 25: [2022-11-26 05:33:27,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:33:27,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 25: [2022-11-26 05:33:27,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 5: [2022-11-26 05:33:27,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:33:27,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 25: [2022-11-26 05:33:27,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 5: [2022-11-26 05:33:27,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 13: [2022-11-26 05:33:27,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:33:27,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 05:33:27,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 1: [2022-11-26 05:33:27,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:33:27,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 05:33:27,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 6: [2022-11-26 05:33:27,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:33:27,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 05:33:27,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 26: [2022-11-26 05:33:27,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:33:27,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 05:33:27,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 31: [2022-11-26 05:33:27,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:33:27,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 05:33:27,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 9: [2022-11-26 05:33:27,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:33:27,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 05:33:27,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 23: [2022-11-26 05:33:27,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:33:27,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 05:33:27,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 17: [2022-11-26 05:33:27,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:33:27,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 05:33:27,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 4: [2022-11-26 05:33:27,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:33:27,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 05:33:27,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 15: [2022-11-26 05:33:27,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:33:27,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 05:33:27,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 20: [2022-11-26 05:33:27,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:33:27,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:33:27,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 05:33:27,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 22: [2022-11-26 05:33:27,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 05:33:27,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 18: [2022-11-26 05:33:27,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:33:27,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 05:33:27,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 16: [2022-11-26 05:33:27,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:33:27,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 05:33:27,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 8: [2022-11-26 05:33:27,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:33:27,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 05:33:27,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 21: [2022-11-26 05:33:27,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:33:27,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:33:27,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 05:33:27,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 24: [2022-11-26 05:33:27,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 05:33:27,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 29: [2022-11-26 05:33:27,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:33:27,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 05:33:27,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 25: [2022-11-26 05:33:27,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:33:27,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 05:33:27,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 5: [2022-11-26 05:33:27,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:33:27,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 05:33:27,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 28: [2022-11-26 05:33:27,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:33:27,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:33:27,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:33:27,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 30: [2022-11-26 05:33:27,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 12: [2022-11-26 05:33:27,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 30: [2022-11-26 05:33:27,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 3: [2022-11-26 05:33:27,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:33:27,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 05:33:27,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 4: [2022-11-26 05:33:27,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:33:27,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 05:33:27,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 10: [2022-11-26 05:33:27,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:33:27,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:33:27,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 05:33:27,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 27: [2022-11-26 05:33:27,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 23: [2022-11-26 05:33:27,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:33:27,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 27: [2022-11-26 05:33:27,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 23: [2022-11-26 05:33:27,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 28: [2022-11-26 05:33:27,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 23: [2022-11-26 05:33:27,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 13: [2022-11-26 05:33:27,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:33:27,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 05:33:27,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 2: [2022-11-26 05:33:27,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:33:27,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 05:33:27,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 7: [2022-11-26 05:33:27,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:33:27,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 05:33:27,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 19: [2022-11-26 05:33:27,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:33:27,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 05:33:27,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 6: [2022-11-26 05:33:27,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:33:27,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 05:33:27,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 1: [2022-11-26 05:33:27,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:33:27,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:33:27,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 17: [2022-11-26 05:33:27,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 1: [2022-11-26 05:33:27,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 17: [2022-11-26 05:33:27,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 20: [2022-11-26 05:33:27,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:33:27,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 05:33:27,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 8: [2022-11-26 05:33:27,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:33:27,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 05:33:27,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 16: [2022-11-26 05:33:27,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:33:27,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 05:33:27,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 0: [2022-11-26 05:33:27,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:33:27,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 05:33:27,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 11: [2022-11-26 05:33:27,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:33:27,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 05:33:27,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 31: [2022-11-26 05:33:27,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:33:27,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 05:33:27,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 26: [2022-11-26 05:33:27,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:33:27,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 05:33:27,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 21: [2022-11-26 05:33:27,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:33:27,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 05:33:27,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 15: [2022-11-26 05:33:27,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:33:27,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:33:27,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 15: [2022-11-26 05:33:27,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 29: [2022-11-26 05:33:27,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 9: [2022-11-26 05:33:27,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:33:27,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 14: [2022-11-26 05:33:27,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:33:27,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 14: [2022-11-26 05:33:27,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 9: [2022-11-26 05:33:27,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 14: [2022-11-26 05:33:27,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 18: [2022-11-26 05:33:27,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:33:27,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 05:33:27,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 10: [2022-11-26 05:33:27,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:33:27,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:33:27,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 05:33:27,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 28: [2022-11-26 05:33:27,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 05:33:27,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 25: [2022-11-26 05:33:27,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:33:27,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:33:27,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 30: [2022-11-26 05:33:27,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:33:27,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 25: [2022-11-26 05:33:27,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 05:33:27,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 24: [2022-11-26 05:33:27,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:33:27,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 05:33:27,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 24: [2022-11-26 05:33:27,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 05:33:27,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 6: [2022-11-26 05:33:27,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:33:27,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 05:33:27,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 23: [2022-11-26 05:33:27,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:33:27,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:33:27,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 23: [2022-11-26 05:33:27,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 0: [2022-11-26 05:33:27,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 23: [2022-11-26 05:33:27,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 27: [2022-11-26 05:33:27,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:33:27,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:33:27,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 12: [2022-11-26 05:33:27,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 27: [2022-11-26 05:33:27,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 1: [2022-11-26 05:33:27,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:33:27,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 22: [2022-11-26 05:33:27,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:33:27,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 22: [2022-11-26 05:33:27,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 1: [2022-11-26 05:33:27,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 22: [2022-11-26 05:33:27,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 3: [2022-11-26 05:33:27,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:33:27,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 05:33:27,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 8: [2022-11-26 05:33:27,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:33:27,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 05:33:27,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 13: [2022-11-26 05:33:27,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:33:27,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 05:33:27,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 4: [2022-11-26 05:33:27,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:33:27,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 05:33:27,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 16: [2022-11-26 05:33:27,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:33:27,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 05:33:27,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 14: [2022-11-26 05:33:27,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:33:27,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:33:27,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 17: [2022-11-26 05:33:27,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 14: [2022-11-26 05:33:27,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 17: [2022-11-26 05:33:27,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 29: [2022-11-26 05:33:27,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:33:27,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 05:33:27,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 9: [2022-11-26 05:33:27,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:33:27,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:33:27,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 22: [2022-11-26 05:33:27,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 9: [2022-11-26 05:33:27,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 22: [2022-11-26 05:33:27,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 31: [2022-11-26 05:33:27,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:33:27,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 2: [2022-11-26 05:33:27,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:33:27,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 2: [2022-11-26 05:33:27,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 05:33:27,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 25: [2022-11-26 05:33:27,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:33:27,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 05:33:27,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 19: [2022-11-26 05:33:27,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:33:27,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 18: [2022-11-26 05:33:27,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:33:27,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:33:27,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 18: [2022-11-26 05:33:27,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 21: [2022-11-26 05:33:27,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 18: [2022-11-26 05:33:27,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 21: [2022-11-26 05:33:27,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 19: [2022-11-26 05:33:27,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:33:27,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 05:33:27,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 7: [2022-11-26 05:33:27,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:33:27,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:33:27,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 7: [2022-11-26 05:33:27,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 20: [2022-11-26 05:33:27,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 7: [2022-11-26 05:33:27,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 27: [2022-11-26 05:33:27,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:33:27,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 05:33:27,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 24: [2022-11-26 05:33:27,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:33:27,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 05:33:27,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 26: [2022-11-26 05:33:27,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:33:27,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 05:33:27,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 10: [2022-11-26 05:33:27,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:33:27,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 05:33:27,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 30: [2022-11-26 05:33:27,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:33:27,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 05:33:27,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 28: [2022-11-26 05:33:27,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:33:27,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 05:33:27,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 7: [2022-11-26 05:33:27,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:33:27,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 05:33:27,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 12: [2022-11-26 05:33:27,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:33:27,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:33:27,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 3: [2022-11-26 05:33:27,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 12: [2022-11-26 05:33:27,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 3: [2022-11-26 05:33:27,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 15: [2022-11-26 05:33:27,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:33:27,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 05:33:27,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 14: [2022-11-26 05:33:27,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:33:27,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 05:33:27,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 13: [2022-11-26 05:33:27,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:33:27,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 05:33:27,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 11: [2022-11-26 05:33:27,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:33:27,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 05:33:27,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 2: [2022-11-26 05:33:27,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:33:27,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 05:33:27,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 11: [2022-11-26 05:33:27,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:33:27,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 05:33:27,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 3: [2022-11-26 05:33:27,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:33:27,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 05:33:27,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 11: [2022-11-26 05:33:27,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:33:27,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 05:33:27,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 13: [2022-11-26 05:33:27,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:33:27,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step51000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 05:33:27,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 0: successfully saved checkpoint at iteration 51000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2558.03 31: iteration 51010/ 173500 | consumed samples: 13058560 | consumed tokens: 26743930880 | elapsed time per iteration (s): 1.15 | learning rate: 1.659E-04 | global batch size: 256 | lm loss: 2.064152E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 223.415 | TFLOPs: 13.52 | 31: iteration 51020/ 173500 | consumed samples: 13061120 | consumed tokens: 26749173760 | elapsed time per iteration (s): 0.77 | learning rate: 1.658E-04 | global batch size: 256 | lm loss: 2.101027E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.358 | TFLOPs: 20.05 | 31: iteration 51030/ 173500 | consumed samples: 13063680 | consumed tokens: 26754416640 | elapsed time per iteration (s): 0.79 | learning rate: 1.658E-04 | global batch size: 256 | lm loss: 2.042995E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.950 | TFLOPs: 19.54 | 31: iteration 51040/ 173500 | consumed samples: 13066240 | consumed tokens: 26759659520 | elapsed time per iteration (s): 0.77 | learning rate: 1.658E-04 | global batch size: 256 | lm loss: 2.079593E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.164 | TFLOPs: 20.10 | 31: iteration 51050/ 173500 | consumed samples: 13068800 | consumed tokens: 26764902400 | elapsed time per iteration (s): 0.75 | learning rate: 1.658E-04 | global batch size: 256 | lm loss: 2.071892E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.343 | TFLOPs: 20.59 | 31: iteration 51060/ 173500 | consumed samples: 13071360 | consumed tokens: 26770145280 | elapsed time per iteration (s): 0.75 | learning rate: 1.658E-04 | global batch size: 256 | lm loss: 2.037744E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.379 | TFLOPs: 20.53 | 31: iteration 51070/ 173500 | consumed samples: 13073920 | consumed tokens: 26775388160 | elapsed time per iteration (s): 0.74 | learning rate: 1.658E-04 | global batch size: 256 | lm loss: 2.072018E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.470 | TFLOPs: 20.84 | 31: iteration 51080/ 173500 | consumed samples: 13076480 | consumed tokens: 26780631040 | elapsed time per iteration (s): 0.77 | learning rate: 1.658E-04 | global batch size: 256 | lm loss: 2.068140E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.272 | TFLOPs: 20.10 | 31: iteration 51090/ 173500 | consumed samples: 13079040 | consumed tokens: 26785873920 | elapsed time per iteration (s): 0.79 | learning rate: 1.658E-04 | global batch size: 256 | lm loss: 2.071322E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.260 | TFLOPs: 19.50 | 31: iteration 51100/ 173500 | consumed samples: 13081600 | consumed tokens: 26791116800 | elapsed time per iteration (s): 0.78 | learning rate: 1.657E-04 | global batch size: 256 | lm loss: 2.043951E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.314 | TFLOPs: 19.92 | 31: iteration 51110/ 173500 | consumed samples: 13084160 | consumed tokens: 26796359680 | elapsed time per iteration (s): 0.77 | learning rate: 1.657E-04 | global batch size: 256 | lm loss: 2.076102E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.174 | TFLOPs: 20.10 | 31: iteration 51120/ 173500 | consumed samples: 13086720 | consumed tokens: 26801602560 | elapsed time per iteration (s): 0.76 | learning rate: 1.657E-04 | global batch size: 256 | lm loss: 2.063531E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.449 | TFLOPs: 20.41 | 31: iteration 51130/ 173500 | consumed samples: 13089280 | consumed tokens: 26806845440 | elapsed time per iteration (s): 0.80 | learning rate: 1.657E-04 | global batch size: 256 | lm loss: 2.067721E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.903 | TFLOPs: 19.29 | 31: iteration 51140/ 173500 | consumed samples: 13091840 | consumed tokens: 26812088320 | elapsed time per iteration (s): 0.78 | learning rate: 1.657E-04 | global batch size: 256 | lm loss: 2.069608E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.971 | TFLOPs: 19.84 | 31: iteration 51150/ 173500 | consumed samples: 13094400 | consumed tokens: 26817331200 | elapsed time per iteration (s): 0.84 | learning rate: 1.657E-04 | global batch size: 256 | lm loss: 2.105889E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.155 | TFLOPs: 18.46 | 31: iteration 51160/ 173500 | consumed samples: 13096960 | consumed tokens: 26822574080 | elapsed time per iteration (s): 0.79 | learning rate: 1.657E-04 | global batch size: 256 | lm loss: 2.095783E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.607 | TFLOPs: 19.70 | 31: iteration 51170/ 173500 | consumed samples: 13099520 | consumed tokens: 26827816960 | elapsed time per iteration (s): 0.77 | learning rate: 1.657E-04 | global batch size: 256 | lm loss: 2.045741E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.479 | TFLOPs: 20.11 | 31: iteration 51180/ 173500 | consumed samples: 13102080 | consumed tokens: 26833059840 | elapsed time per iteration (s): 0.71 | learning rate: 1.656E-04 | global batch size: 256 | lm loss: 2.053875E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 359.167 | TFLOPs: 21.73 | 31: iteration 51190/ 173500 | consumed samples: 13104640 | consumed tokens: 26838302720 | elapsed time per iteration (s): 0.76 | learning rate: 1.656E-04 | global batch size: 256 | lm loss: 2.072802E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.766 | TFLOPs: 20.49 | 31: iteration 51200/ 173500 | consumed samples: 13107200 | consumed tokens: 26843545600 | elapsed time per iteration (s): 0.75 | learning rate: 1.656E-04 | global batch size: 256 | lm loss: 2.048020E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.978 | TFLOPs: 20.57 | 31: iteration 51210/ 173500 | consumed samples: 13109760 | consumed tokens: 26848788480 | elapsed time per iteration (s): 0.72 | learning rate: 1.656E-04 | global batch size: 256 | lm loss: 2.055993E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.354 | TFLOPs: 21.50 | 31: iteration 51220/ 173500 | consumed samples: 13112320 | consumed tokens: 26854031360 | elapsed time per iteration (s): 0.77 | learning rate: 1.656E-04 | global batch size: 256 | lm loss: 2.073488E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.019 | TFLOPs: 20.15 | 31: iteration 51230/ 173500 | consumed samples: 13114880 | consumed tokens: 26859274240 | elapsed time per iteration (s): 0.75 | learning rate: 1.656E-04 | global batch size: 256 | lm loss: 2.031333E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.762 | TFLOPs: 20.55 | 31: iteration 51240/ 173500 | consumed samples: 13117440 | consumed tokens: 26864517120 | elapsed time per iteration (s): 0.79 | learning rate: 1.656E-04 | global batch size: 256 | lm loss: 2.070112E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.111 | TFLOPs: 19.55 | 31: iteration 51250/ 173500 | consumed samples: 13120000 | consumed tokens: 26869760000 | elapsed time per iteration (s): 0.79 | learning rate: 1.655E-04 | global batch size: 256 | lm loss: 2.054276E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.275 | TFLOPs: 19.62 | 31: iteration 51260/ 173500 | consumed samples: 13122560 | consumed tokens: 26875002880 | elapsed time per iteration (s): 0.75 | learning rate: 1.655E-04 | global batch size: 256 | lm loss: 2.053598E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.751 | TFLOPs: 20.68 | 31: iteration 51270/ 173500 | consumed samples: 13125120 | consumed tokens: 26880245760 | elapsed time per iteration (s): 0.79 | learning rate: 1.655E-04 | global batch size: 256 | lm loss: 2.038086E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.634 | TFLOPs: 19.52 | 31: iteration 51280/ 173500 | consumed samples: 13127680 | consumed tokens: 26885488640 | elapsed time per iteration (s): 0.82 | learning rate: 1.655E-04 | global batch size: 256 | lm loss: 2.057544E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.314 | TFLOPs: 18.89 | 31: iteration 51290/ 173500 | consumed samples: 13130240 | consumed tokens: 26890731520 | elapsed time per iteration (s): 0.84 | learning rate: 1.655E-04 | global batch size: 256 | lm loss: 2.061623E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.328 | TFLOPs: 18.41 | 31: iteration 51300/ 173500 | consumed samples: 13132800 | consumed tokens: 26895974400 | elapsed time per iteration (s): 0.81 | learning rate: 1.655E-04 | global batch size: 256 | lm loss: 2.100275E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.445 | TFLOPs: 19.14 | 31: iteration 51310/ 173500 | consumed samples: 13135360 | consumed tokens: 26901217280 | elapsed time per iteration (s): 0.80 | learning rate: 1.655E-04 | global batch size: 256 | lm loss: 2.064091E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.097 | TFLOPs: 19.24 | 31: iteration 51320/ 173500 | consumed samples: 13137920 | consumed tokens: 26906460160 | elapsed time per iteration (s): 0.86 | learning rate: 1.655E-04 | global batch size: 256 | lm loss: 2.057946E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.595 | TFLOPs: 17.94 | 31: iteration 51330/ 173500 | consumed samples: 13140480 | consumed tokens: 26911703040 | elapsed time per iteration (s): 0.83 | learning rate: 1.654E-04 | global batch size: 256 | lm loss: 2.098441E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.972 | TFLOPs: 18.75 | 31: iteration 51340/ 173500 | consumed samples: 13143040 | consumed tokens: 26916945920 | elapsed time per iteration (s): 0.84 | learning rate: 1.654E-04 | global batch size: 256 | lm loss: 2.077391E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.649 | TFLOPs: 18.49 | 31: iteration 51350/ 173500 | consumed samples: 13145600 | consumed tokens: 26922188800 | elapsed time per iteration (s): 0.83 | learning rate: 1.654E-04 | global batch size: 256 | lm loss: 2.036388E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.133 | TFLOPs: 18.70 | 31: iteration 51360/ 173500 | consumed samples: 13148160 | consumed tokens: 26927431680 | elapsed time per iteration (s): 0.78 | learning rate: 1.654E-04 | global batch size: 256 | lm loss: 2.081317E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.499 | TFLOPs: 19.87 | 31: iteration 51370/ 173500 | consumed samples: 13150720 | consumed tokens: 26932674560 | elapsed time per iteration (s): 0.75 | learning rate: 1.654E-04 | global batch size: 256 | lm loss: 2.069739E+00 | grad norm: 0.266 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.799 | TFLOPs: 20.74 | 31: iteration 51380/ 173500 | consumed samples: 13153280 | consumed tokens: 26937917440 | elapsed time per iteration (s): 0.85 | learning rate: 1.654E-04 | global batch size: 256 | lm loss: 2.054873E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.384 | TFLOPs: 18.29 | 31: iteration 51390/ 173500 | consumed samples: 13155840 | consumed tokens: 26943160320 | elapsed time per iteration (s): 0.74 | learning rate: 1.654E-04 | global batch size: 256 | lm loss: 2.085714E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.024 | TFLOPs: 20.93 | 31: iteration 51400/ 173500 | consumed samples: 13158400 | consumed tokens: 26948403200 | elapsed time per iteration (s): 0.75 | learning rate: 1.654E-04 | global batch size: 256 | lm loss: 2.049638E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.451 | TFLOPs: 20.54 | 31: iteration 51410/ 173500 | consumed samples: 13160960 | consumed tokens: 26953646080 | elapsed time per iteration (s): 0.73 | learning rate: 1.653E-04 | global batch size: 256 | lm loss: 2.059161E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.216 | TFLOPs: 21.25 | 31: iteration 51420/ 173500 | consumed samples: 13163520 | consumed tokens: 26958888960 | elapsed time per iteration (s): 0.76 | learning rate: 1.653E-04 | global batch size: 256 | lm loss: 2.066535E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.879 | TFLOPs: 20.50 | 31: iteration 51430/ 173500 | consumed samples: 13166080 | consumed tokens: 26964131840 | elapsed time per iteration (s): 0.79 | learning rate: 1.653E-04 | global batch size: 256 | lm loss: 2.083130E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.374 | TFLOPs: 19.62 | 31: iteration 51440/ 173500 | consumed samples: 13168640 | consumed tokens: 26969374720 | elapsed time per iteration (s): 0.74 | learning rate: 1.653E-04 | global batch size: 256 | lm loss: 2.056275E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.069 | TFLOPs: 20.94 | 31: iteration 51450/ 173500 | consumed samples: 13171200 | consumed tokens: 26974617600 | elapsed time per iteration (s): 2.57 | learning rate: 1.653E-04 | global batch size: 256 | lm loss: 2.069045E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 99.803 | TFLOPs: 6.04 | 31: iteration 51460/ 173500 | consumed samples: 13173760 | consumed tokens: 26979860480 | elapsed time per iteration (s): 0.74 | learning rate: 1.653E-04 | global batch size: 256 | lm loss: 2.025414E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.842 | TFLOPs: 20.92 | 31: iteration 51470/ 173500 | consumed samples: 13176320 | consumed tokens: 26985103360 | elapsed time per iteration (s): 0.82 | learning rate: 1.653E-04 | global batch size: 256 | lm loss: 2.063961E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.779 | TFLOPs: 18.98 | 31: iteration 51480/ 173500 | consumed samples: 13178880 | consumed tokens: 26990346240 | elapsed time per iteration (s): 0.75 | learning rate: 1.652E-04 | global batch size: 256 | lm loss: 2.047735E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.363 | TFLOPs: 20.65 | 31: iteration 51490/ 173500 | consumed samples: 13181440 | consumed tokens: 26995589120 | elapsed time per iteration (s): 0.79 | learning rate: 1.652E-04 | global batch size: 256 | lm loss: 2.039847E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.126 | TFLOPs: 19.67 | 31: iteration 51500/ 173500 | consumed samples: 13184000 | consumed tokens: 27000832000 | elapsed time per iteration (s): 0.80 | learning rate: 1.652E-04 | global batch size: 256 | lm loss: 2.062604E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.278 | TFLOPs: 19.32 | 31: iteration 51510/ 173500 | consumed samples: 13186560 | consumed tokens: 27006074880 | elapsed time per iteration (s): 0.78 | learning rate: 1.652E-04 | global batch size: 256 | lm loss: 2.047151E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.939 | TFLOPs: 19.96 | 31: iteration 51520/ 173500 | consumed samples: 13189120 | consumed tokens: 27011317760 | elapsed time per iteration (s): 0.72 | learning rate: 1.652E-04 | global batch size: 256 | lm loss: 2.053284E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.865 | TFLOPs: 21.41 | 31: iteration 51530/ 173500 | consumed samples: 13191680 | consumed tokens: 27016560640 | elapsed time per iteration (s): 0.78 | learning rate: 1.652E-04 | global batch size: 256 | lm loss: 2.052876E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.135 | TFLOPs: 19.91 | 31: iteration 51540/ 173500 | consumed samples: 13194240 | consumed tokens: 27021803520 | elapsed time per iteration (s): 0.76 | learning rate: 1.652E-04 | global batch size: 256 | lm loss: 2.069710E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.113 | TFLOPs: 20.39 | 31: iteration 51550/ 173500 | consumed samples: 13196800 | consumed tokens: 27027046400 | elapsed time per iteration (s): 0.72 | learning rate: 1.652E-04 | global batch size: 256 | lm loss: 2.062465E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.325 | TFLOPs: 21.62 | 31: iteration 51560/ 173500 | consumed samples: 13199360 | consumed tokens: 27032289280 | elapsed time per iteration (s): 0.75 | learning rate: 1.651E-04 | global batch size: 256 | lm loss: 2.082214E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.591 | TFLOPs: 20.79 | 31: iteration 51570/ 173500 | consumed samples: 13201920 | consumed tokens: 27037532160 | elapsed time per iteration (s): 0.78 | learning rate: 1.651E-04 | global batch size: 256 | lm loss: 2.070770E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.301 | TFLOPs: 19.80 | 31: iteration 51580/ 173500 | consumed samples: 13204480 | consumed tokens: 27042775040 | elapsed time per iteration (s): 0.73 | learning rate: 1.651E-04 | global batch size: 256 | lm loss: 2.055770E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.618 | TFLOPs: 21.09 | 31: iteration 51590/ 173500 | consumed samples: 13207040 | consumed tokens: 27048017920 | elapsed time per iteration (s): 0.79 | learning rate: 1.651E-04 | global batch size: 256 | lm loss: 2.045030E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.198 | TFLOPs: 19.67 | 31: iteration 51600/ 173500 | consumed samples: 13209600 | consumed tokens: 27053260800 | elapsed time per iteration (s): 0.80 | learning rate: 1.651E-04 | global batch size: 256 | lm loss: 2.046845E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.242 | TFLOPs: 19.43 | 31: iteration 51610/ 173500 | consumed samples: 13212160 | consumed tokens: 27058503680 | elapsed time per iteration (s): 0.79 | learning rate: 1.651E-04 | global batch size: 256 | lm loss: 2.081311E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.938 | TFLOPs: 19.72 | 31: iteration 51620/ 173500 | consumed samples: 13214720 | consumed tokens: 27063746560 | elapsed time per iteration (s): 0.74 | learning rate: 1.651E-04 | global batch size: 256 | lm loss: 2.069994E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.092 | TFLOPs: 20.82 | 31: iteration 51630/ 173500 | consumed samples: 13217280 | consumed tokens: 27068989440 | elapsed time per iteration (s): 0.79 | learning rate: 1.651E-04 | global batch size: 256 | lm loss: 2.055928E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.013 | TFLOPs: 19.60 | 31: iteration 51640/ 173500 | consumed samples: 13219840 | consumed tokens: 27074232320 | elapsed time per iteration (s): 0.80 | learning rate: 1.650E-04 | global batch size: 256 | lm loss: 2.057084E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.240 | TFLOPs: 19.25 | 31: iteration 51650/ 173500 | consumed samples: 13222400 | consumed tokens: 27079475200 | elapsed time per iteration (s): 0.80 | learning rate: 1.650E-04 | global batch size: 256 | lm loss: 2.079637E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.525 | TFLOPs: 19.27 | 31: iteration 51660/ 173500 | consumed samples: 13224960 | consumed tokens: 27084718080 | elapsed time per iteration (s): 0.81 | learning rate: 1.650E-04 | global batch size: 256 | lm loss: 2.048820E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.949 | TFLOPs: 19.05 | 31: iteration 51670/ 173500 | consumed samples: 13227520 | consumed tokens: 27089960960 | elapsed time per iteration (s): 0.77 | learning rate: 1.650E-04 | global batch size: 256 | lm loss: 2.052634E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.633 | TFLOPs: 20.12 | 31: iteration 51680/ 173500 | consumed samples: 13230080 | consumed tokens: 27095203840 | elapsed time per iteration (s): 0.73 | learning rate: 1.650E-04 | global batch size: 256 | lm loss: 2.019391E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.199 | TFLOPs: 21.13 | 31: iteration 51690/ 173500 | consumed samples: 13232640 | consumed tokens: 27100446720 | elapsed time per iteration (s): 0.75 | learning rate: 1.650E-04 | global batch size: 256 | lm loss: 2.073258E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.394 | TFLOPs: 20.59 | 31: iteration 51700/ 173500 | consumed samples: 13235200 | consumed tokens: 27105689600 | elapsed time per iteration (s): 0.73 | learning rate: 1.650E-04 | global batch size: 256 | lm loss: 2.038577E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.031 | TFLOPs: 21.36 | 31: iteration 51710/ 173500 | consumed samples: 13237760 | consumed tokens: 27110932480 | elapsed time per iteration (s): 0.79 | learning rate: 1.649E-04 | global batch size: 256 | lm loss: 2.048124E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.323 | TFLOPs: 19.68 | 31: iteration 51720/ 173500 | consumed samples: 13240320 | consumed tokens: 27116175360 | elapsed time per iteration (s): 2.14 | learning rate: 1.649E-04 | global batch size: 256 | lm loss: 2.062232E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 119.817 | TFLOPs: 7.25 | 31: iteration 51730/ 173500 | consumed samples: 13242880 | consumed tokens: 27121418240 | elapsed time per iteration (s): 0.75 | learning rate: 1.649E-04 | global batch size: 256 | lm loss: 2.063505E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.751 | TFLOPs: 20.61 | 31: iteration 51740/ 173500 | consumed samples: 13245440 | consumed tokens: 27126661120 | elapsed time per iteration (s): 0.77 | learning rate: 1.649E-04 | global batch size: 256 | lm loss: 2.034574E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.962 | TFLOPs: 20.14 | 31: iteration 51750/ 173500 | consumed samples: 13248000 | consumed tokens: 27131904000 | elapsed time per iteration (s): 0.76 | learning rate: 1.649E-04 | global batch size: 256 | lm loss: 2.043513E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.067 | TFLOPs: 20.45 | 31: iteration 51760/ 173500 | consumed samples: 13250560 | consumed tokens: 27137146880 | elapsed time per iteration (s): 0.75 | learning rate: 1.649E-04 | global batch size: 256 | lm loss: 2.046915E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.288 | TFLOPs: 20.77 | 31: iteration 51770/ 173500 | consumed samples: 13253120 | consumed tokens: 27142389760 | elapsed time per iteration (s): 0.81 | learning rate: 1.649E-04 | global batch size: 256 | lm loss: 2.074152E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.277 | TFLOPs: 19.07 | 31: iteration 51780/ 173500 | consumed samples: 13255680 | consumed tokens: 27147632640 | elapsed time per iteration (s): 0.77 | learning rate: 1.649E-04 | global batch size: 256 | lm loss: 2.074302E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.991 | TFLOPs: 20.15 | 31: iteration 51790/ 173500 | consumed samples: 13258240 | consumed tokens: 27152875520 | elapsed time per iteration (s): 0.78 | learning rate: 1.648E-04 | global batch size: 256 | lm loss: 2.069209E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.971 | TFLOPs: 19.84 | 31: iteration 51800/ 173500 | consumed samples: 13260800 | consumed tokens: 27158118400 | elapsed time per iteration (s): 0.75 | learning rate: 1.648E-04 | global batch size: 256 | lm loss: 2.052721E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.463 | TFLOPs: 20.78 | 31: iteration 51810/ 173500 | consumed samples: 13263360 | consumed tokens: 27163361280 | elapsed time per iteration (s): 0.75 | learning rate: 1.648E-04 | global batch size: 256 | lm loss: 2.066952E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.210 | TFLOPs: 20.58 | 31: iteration 51820/ 173500 | consumed samples: 13265920 | consumed tokens: 27168604160 | elapsed time per iteration (s): 0.74 | learning rate: 1.648E-04 | global batch size: 256 | lm loss: 2.079986E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.610 | TFLOPs: 20.85 | 31: iteration 51830/ 173500 | consumed samples: 13268480 | consumed tokens: 27173847040 | elapsed time per iteration (s): 0.77 | learning rate: 1.648E-04 | global batch size: 256 | lm loss: 2.054311E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.397 | TFLOPs: 20.11 | 31: iteration 51840/ 173500 | consumed samples: 13271040 | consumed tokens: 27179089920 | elapsed time per iteration (s): 0.72 | learning rate: 1.648E-04 | global batch size: 256 | lm loss: 2.058076E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.887 | TFLOPs: 21.47 | 31: iteration 51850/ 173500 | consumed samples: 13273600 | consumed tokens: 27184332800 | elapsed time per iteration (s): 0.78 | learning rate: 1.648E-04 | global batch size: 256 | lm loss: 2.049217E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.774 | TFLOPs: 19.77 | 31: iteration 51860/ 173500 | consumed samples: 13276160 | consumed tokens: 27189575680 | elapsed time per iteration (s): 0.71 | learning rate: 1.648E-04 | global batch size: 256 | lm loss: 2.047093E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.779 | TFLOPs: 21.71 | 31: iteration 51870/ 173500 | consumed samples: 13278720 | consumed tokens: 27194818560 | elapsed time per iteration (s): 0.79 | learning rate: 1.647E-04 | global batch size: 256 | lm loss: 2.067863E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.100 | TFLOPs: 19.49 | 31: iteration 51880/ 173500 | consumed samples: 13281280 | consumed tokens: 27200061440 | elapsed time per iteration (s): 0.76 | learning rate: 1.647E-04 | global batch size: 256 | lm loss: 2.047305E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.797 | TFLOPs: 20.25 | 31: iteration 51890/ 173500 | consumed samples: 13283840 | consumed tokens: 27205304320 | elapsed time per iteration (s): 0.75 | learning rate: 1.647E-04 | global batch size: 256 | lm loss: 2.053729E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.316 | TFLOPs: 20.59 | 31: iteration 51900/ 173500 | consumed samples: 13286400 | consumed tokens: 27210547200 | elapsed time per iteration (s): 0.75 | learning rate: 1.647E-04 | global batch size: 256 | lm loss: 2.047052E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.663 | TFLOPs: 20.61 | 31: iteration 51910/ 173500 | consumed samples: 13288960 | consumed tokens: 27215790080 | elapsed time per iteration (s): 0.76 | learning rate: 1.647E-04 | global batch size: 256 | lm loss: 2.041199E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.934 | TFLOPs: 20.26 | 31: iteration 51920/ 173500 | consumed samples: 13291520 | consumed tokens: 27221032960 | elapsed time per iteration (s): 0.76 | learning rate: 1.647E-04 | global batch size: 256 | lm loss: 2.040279E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.528 | TFLOPs: 20.36 | 31: iteration 51930/ 173500 | consumed samples: 13294080 | consumed tokens: 27226275840 | elapsed time per iteration (s): 0.75 | learning rate: 1.647E-04 | global batch size: 256 | lm loss: 2.042928E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.484 | TFLOPs: 20.54 | 31: iteration 51940/ 173500 | consumed samples: 13296640 | consumed tokens: 27231518720 | elapsed time per iteration (s): 0.73 | learning rate: 1.646E-04 | global batch size: 256 | lm loss: 2.057243E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.464 | TFLOPs: 21.14 | 31: iteration 51950/ 173500 | consumed samples: 13299200 | consumed tokens: 27236761600 | elapsed time per iteration (s): 0.78 | learning rate: 1.646E-04 | global batch size: 256 | lm loss: 2.058662E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.266 | TFLOPs: 19.86 | 31: iteration 51960/ 173500 | consumed samples: 13301760 | consumed tokens: 27242004480 | elapsed time per iteration (s): 0.77 | learning rate: 1.646E-04 | global batch size: 256 | lm loss: 2.061732E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.142 | TFLOPs: 20.09 | 31: iteration 51970/ 173500 | consumed samples: 13304320 | consumed tokens: 27247247360 | elapsed time per iteration (s): 0.75 | learning rate: 1.646E-04 | global batch size: 256 | lm loss: 2.063466E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.011 | TFLOPs: 20.75 | 31: iteration 51980/ 173500 | consumed samples: 13306880 | consumed tokens: 27252490240 | elapsed time per iteration (s): 0.77 | learning rate: 1.646E-04 | global batch size: 256 | lm loss: 2.043535E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.972 | TFLOPs: 20.14 | 31: iteration 51990/ 173500 | consumed samples: 13309440 | consumed tokens: 27257733120 | elapsed time per iteration (s): 0.77 | learning rate: 1.646E-04 | global batch size: 256 | lm loss: 2.058627E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.643 | TFLOPs: 20.06 | 0: [2022-11-26 05:46:51,704] [INFO] [logging.py:68:log_dist] [Rank 0] step=52000, skipped=0, lr=[0.00016457056203724818, 0.00016457056203724818, 0.00016457056203724818], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 52000/ 173500 | consumed samples: 13312000 | consumed tokens: 27262976000 | elapsed time per iteration (s): 0.76 | learning rate: 1.646E-04 | global batch size: 256 | lm loss: 2.067520E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.758 | TFLOPs: 20.43 | 0: steps: 52000 loss: 1.9853 iter time (s): 0.807 samples/sec: 317.212 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 52000 | lm loss value: 2.028508E+00 | lm loss PPL: 7.602732E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 52000 to checkpoints_1b1long 0: [2022-11-26 05:46:51,969] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step52000 is begin to save! 0: [2022-11-26 05:46:51,982] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_01-model_00-model_states.pt... 0: [2022-11-26 05:46:52,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_01-model_00-model_states.pt. 0: [2022-11-26 05:46:52,203] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_03-model_00-model_states.pt... 0: [2022-11-26 05:46:52,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_03-model_00-model_states.pt. 0: [2022-11-26 05:46:52,281] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_04-model_00-model_states.pt... 0: [2022-11-26 05:46:52,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_04-model_00-model_states.pt. 0: [2022-11-26 05:46:52,361] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_05-model_00-model_states.pt... 0: [2022-11-26 05:46:52,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_05-model_00-model_states.pt. 0: [2022-11-26 05:46:52,441] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_06-model_00-model_states.pt... 0: [2022-11-26 05:46:52,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_06-model_00-model_states.pt. 0: [2022-11-26 05:46:52,516] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_07-model_00-model_states.pt... 0: [2022-11-26 05:46:52,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_07-model_00-model_states.pt. 0: [2022-11-26 05:46:52,595] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_08-model_00-model_states.pt... 0: [2022-11-26 05:46:52,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_08-model_00-model_states.pt. 0: [2022-11-26 05:46:52,673] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_09-model_00-model_states.pt... 0: [2022-11-26 05:46:52,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_09-model_00-model_states.pt. 0: [2022-11-26 05:46:52,750] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_10-model_00-model_states.pt... 0: [2022-11-26 05:46:52,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_10-model_00-model_states.pt. 0: [2022-11-26 05:46:52,829] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_11-model_00-model_states.pt... 0: [2022-11-26 05:46:52,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_11-model_00-model_states.pt. 0: [2022-11-26 05:46:52,906] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_12-model_00-model_states.pt... 0: [2022-11-26 05:46:52,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_12-model_00-model_states.pt. 0: [2022-11-26 05:46:52,983] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_13-model_00-model_states.pt... 0: [2022-11-26 05:46:53,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_13-model_00-model_states.pt. 0: [2022-11-26 05:46:53,061] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_14-model_00-model_states.pt... 0: [2022-11-26 05:46:53,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_14-model_00-model_states.pt. 0: [2022-11-26 05:46:53,135] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_15-model_00-model_states.pt... 0: [2022-11-26 05:46:53,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_15-model_00-model_states.pt. 0: [2022-11-26 05:46:53,211] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_16-model_00-model_states.pt... 0: [2022-11-26 05:46:53,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_16-model_00-model_states.pt. 0: [2022-11-26 05:46:53,287] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_17-model_00-model_states.pt... 0: [2022-11-26 05:46:53,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_17-model_00-model_states.pt. 0: [2022-11-26 05:46:53,362] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_18-model_00-model_states.pt... 0: [2022-11-26 05:46:53,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_18-model_00-model_states.pt. 0: [2022-11-26 05:46:53,436] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_19-model_00-model_states.pt... 0: [2022-11-26 05:46:53,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_19-model_00-model_states.pt. 0: [2022-11-26 05:46:53,511] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_20-model_00-model_states.pt... 0: [2022-11-26 05:46:53,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_20-model_00-model_states.pt. 0: [2022-11-26 05:46:53,586] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_21-model_00-model_states.pt... 0: [2022-11-26 05:46:53,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_21-model_00-model_states.pt. 0: [2022-11-26 05:46:53,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_22-model_00-model_states.pt... 0: [2022-11-26 05:46:53,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_22-model_00-model_states.pt. 0: [2022-11-26 05:46:53,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_23-model_00-model_states.pt... 0: [2022-11-26 05:46:53,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_23-model_00-model_states.pt. 0: [2022-11-26 05:46:53,811] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_24-model_00-model_states.pt... 0: [2022-11-26 05:46:53,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_24-model_00-model_states.pt. 0: [2022-11-26 05:46:53,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_25-model_00-model_states.pt... 0: [2022-11-26 05:46:53,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_25-model_00-model_states.pt. 0: [2022-11-26 05:46:53,958] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_26-model_00-model_states.pt... 0: [2022-11-26 05:46:54,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_26-model_00-model_states.pt. 0: [2022-11-26 05:46:54,034] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_27-model_00-model_states.pt... 0: [2022-11-26 05:46:54,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_27-model_00-model_states.pt. 0: [2022-11-26 05:46:54,106] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_28-model_00-model_states.pt... 0: [2022-11-26 05:46:54,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_28-model_00-model_states.pt. 0: [2022-11-26 05:46:54,182] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/layer_30-model_00-model_states.pt... 0: [2022-11-26 05:46:54,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/layer_30-model_00-model_states.pt. 0: [2022-11-26 05:46:54,184] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step52000/mp_rank_00_model_states.pt 0: [2022-11-26 05:46:54,184] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/mp_rank_00_model_states.pt... 0: [2022-11-26 05:46:54,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/mp_rank_00_model_states.pt. 0: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 29: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 26: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:46:54,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 31: [2022-11-26 05:46:54,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:46:54,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 05:46:54,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 26: [2022-11-26 05:46:54,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:46:54,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 0: [2022-11-26 05:46:54,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:46:54,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 26: [2022-11-26 05:46:54,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 0: [2022-11-26 05:46:54,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 7: [2022-11-26 05:46:54,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:46:54,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 05:46:54,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 29: [2022-11-26 05:46:54,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:46:54,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 12: [2022-11-26 05:46:54,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:46:54,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 12: [2022-11-26 05:46:54,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 05:46:54,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 4: [2022-11-26 05:46:54,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:46:54,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 05:46:54,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 21: [2022-11-26 05:46:54,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:46:54,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 05:46:54,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 4: [2022-11-26 05:46:54,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:46:54,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:46:54,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 05:46:54,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 4: [2022-11-26 05:46:54,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 19: [2022-11-26 05:46:54,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:46:54,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 19: [2022-11-26 05:46:54,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 05:46:54,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 9: [2022-11-26 05:46:54,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:46:54,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:46:54,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 05:46:54,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 24: [2022-11-26 05:46:54,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 2: [2022-11-26 05:46:54,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:46:54,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 2: [2022-11-26 05:46:54,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 05:46:54,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 6: [2022-11-26 05:46:54,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:46:54,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 05:46:54,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 3: [2022-11-26 05:46:54,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:46:54,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 0: [2022-11-26 05:46:54,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:46:54,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 0: [2022-11-26 05:46:54,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 05:46:54,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 11: [2022-11-26 05:46:54,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:46:54,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:46:54,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 05:46:54,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 11: [2022-11-26 05:46:54,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:46:54,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 05:46:54,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 31: [2022-11-26 05:46:54,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:46:54,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 05:46:54,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 31: [2022-11-26 05:46:54,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 20: [2022-11-26 05:46:54,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:46:54,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 20: [2022-11-26 05:46:54,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 05:46:54,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 24: [2022-11-26 05:46:54,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:46:54,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:46:54,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 05:46:54,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 18: [2022-11-26 05:46:54,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:46:54,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 05:46:54,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:46:54,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 18: [2022-11-26 05:46:54,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 05:46:54,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 4: [2022-11-26 05:46:54,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:46:54,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 05:46:54,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 24: [2022-11-26 05:46:54,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:46:54,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 21: [2022-11-26 05:46:54,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:46:54,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 05:46:54,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 24: [2022-11-26 05:46:54,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 15: [2022-11-26 05:46:54,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:46:54,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 05:46:54,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 3: [2022-11-26 05:46:54,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:46:54,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 05:46:54,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 29: [2022-11-26 05:46:54,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:46:54,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 05:46:54,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 29: [2022-11-26 05:46:54,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:46:54,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 18: [2022-11-26 05:46:54,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:46:54,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 29: [2022-11-26 05:46:54,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 18: [2022-11-26 05:46:54,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 12: [2022-11-26 05:46:54,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:46:54,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 7: [2022-11-26 05:46:54,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:46:54,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 7: [2022-11-26 05:46:54,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 05:46:54,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 2: [2022-11-26 05:46:54,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:46:54,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 3: [2022-11-26 05:46:54,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:46:54,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 3: [2022-11-26 05:46:54,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 05:46:54,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 19: [2022-11-26 05:46:54,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:46:54,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 05:46:54,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 26: [2022-11-26 05:46:54,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:46:54,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 19: [2022-11-26 05:46:54,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:46:54,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 05:46:54,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 11: [2022-11-26 05:46:54,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:46:54,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 11: [2022-11-26 05:46:54,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 05:46:54,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 5: [2022-11-26 05:46:54,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:46:54,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:46:54,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 05:46:54,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 05:46:54,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 5: [2022-11-26 05:46:54,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 26: [2022-11-26 05:46:54,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:46:54,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 20: [2022-11-26 05:46:54,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:46:54,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 20: [2022-11-26 05:46:54,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 7: [2022-11-26 05:46:54,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:46:54,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 20: [2022-11-26 05:46:54,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:46:54,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 7: [2022-11-26 05:46:54,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 20: [2022-11-26 05:46:54,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 13: [2022-11-26 05:46:54,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:46:54,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 13: [2022-11-26 05:46:54,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 05:46:54,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 21: [2022-11-26 05:46:54,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:46:54,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 05:46:54,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 21: [2022-11-26 05:46:54,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:46:54,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:46:54,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 4: [2022-11-26 05:46:54,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:46:54,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 4: [2022-11-26 05:46:54,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 11: [2022-11-26 05:46:54,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:46:54,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 4: [2022-11-26 05:46:54,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 15: [2022-11-26 05:46:54,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:46:54,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 14: [2022-11-26 05:46:54,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 15: [2022-11-26 05:46:54,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:46:54,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:46:54,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 15: [2022-11-26 05:46:54,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 05:46:54,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 05:46:54,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 05:46:54,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 15: [2022-11-26 05:46:54,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 15: [2022-11-26 05:46:54,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 12: [2022-11-26 05:46:54,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:46:54,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 05:46:54,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 2: [2022-11-26 05:46:54,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:46:54,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 05:46:54,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 5: [2022-11-26 05:46:54,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:46:54,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 05:46:54,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 18: [2022-11-26 05:46:54,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:46:54,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 05:46:54,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 31: [2022-11-26 05:46:54,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:46:54,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 05:46:54,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 1: [2022-11-26 05:46:54,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:46:54,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 3: [2022-11-26 05:46:54,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:46:54,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 1: [2022-11-26 05:46:54,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 3: [2022-11-26 05:46:54,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 14: [2022-11-26 05:46:54,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:46:54,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 05:46:54,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 20: [2022-11-26 05:46:54,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:46:54,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 05:46:54,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 12: [2022-11-26 05:46:54,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:46:54,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 05:46:54,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 29: [2022-11-26 05:46:54,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:46:54,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 2: [2022-11-26 05:46:54,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:46:54,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 2: [2022-11-26 05:46:54,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 05:46:54,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 10: [2022-11-26 05:46:54,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:46:54,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 05:46:54,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 10: [2022-11-26 05:46:54,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:46:54,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 05:46:54,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 31: [2022-11-26 05:46:54,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:46:54,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 05:46:54,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 18: [2022-11-26 05:46:54,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:46:54,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 05:46:54,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 5: [2022-11-26 05:46:54,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:46:54,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 05:46:54,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 1: [2022-11-26 05:46:54,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:46:54,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:46:54,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:46:54,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:46:54,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:46:54,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:46:54,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 05:46:54,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 27: [2022-11-26 05:46:54,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 05:46:54,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 1: [2022-11-26 05:46:54,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 27: [2022-11-26 05:46:54,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 27: [2022-11-26 05:46:54,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 1: [2022-11-26 05:46:54,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 05:46:54,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 05:46:54,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 1: [2022-11-26 05:46:54,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 1: [2022-11-26 05:46:54,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 9: [2022-11-26 05:46:54,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:46:54,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:46:54,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:46:54,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 05:46:54,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 20: [2022-11-26 05:46:54,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:46:54,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 27: [2022-11-26 05:46:54,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 05:46:54,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 05:46:54,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 27: [2022-11-26 05:46:54,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 20: [2022-11-26 05:46:54,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 9: [2022-11-26 05:46:54,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:46:54,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 05:46:54,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 19: [2022-11-26 05:46:54,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:46:54,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 05:46:54,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 15: [2022-11-26 05:46:54,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:46:54,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:46:54,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 05:46:54,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 26: [2022-11-26 05:46:54,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 05:46:54,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 21: [2022-11-26 05:46:54,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:46:54,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 05:46:54,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 0: [2022-11-26 05:46:54,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:46:54,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 05:46:54,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 29: [2022-11-26 05:46:54,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:46:54,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 05:46:54,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 0: [2022-11-26 05:46:54,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:46:54,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:46:54,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 05:46:54,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 0: [2022-11-26 05:46:54,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 9: [2022-11-26 05:46:54,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:46:54,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 9: [2022-11-26 05:46:54,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 05:46:54,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 9: [2022-11-26 05:46:54,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:46:54,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 05:46:54,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 24: [2022-11-26 05:46:54,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:46:54,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 05:46:54,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 7: [2022-11-26 05:46:54,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:46:54,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 05:46:54,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 26: [2022-11-26 05:46:54,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:46:54,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 05:46:54,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 4: [2022-11-26 05:46:54,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:46:54,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 05:46:54,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 6: [2022-11-26 05:46:54,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:46:54,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 05:46:54,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 6: [2022-11-26 05:46:54,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:46:54,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 05:46:54,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 10: [2022-11-26 05:46:54,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:46:54,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 05:46:54,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 10: [2022-11-26 05:46:54,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:46:54,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 05:46:54,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 11: [2022-11-26 05:46:54,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:46:54,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 05:46:54,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 14: [2022-11-26 05:46:54,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:46:54,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:46:54,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 05:46:54,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 14: [2022-11-26 05:46:54,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 05:46:54,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 13: [2022-11-26 05:46:54,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:46:54,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 05:46:54,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 13: [2022-11-26 05:46:54,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:46:54,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:46:54,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 05:46:54,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 05:46:54,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 13: [2022-11-26 05:46:54,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 6: [2022-11-26 05:46:54,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:46:54,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:46:54,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 05:46:54,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 05:46:54,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 6: [2022-11-26 05:46:54,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 28: [2022-11-26 05:46:54,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 05:46:54,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 28: [2022-11-26 05:46:54,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:46:54,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 05:46:54,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 28: [2022-11-26 05:46:54,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:46:54,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:46:54,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 05:46:54,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 05:46:54,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 28: [2022-11-26 05:46:54,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 28: [2022-11-26 05:46:54,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:46:54,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 05:46:54,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 13: [2022-11-26 05:46:54,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:46:54,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 05:46:54,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 3: [2022-11-26 05:46:54,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:46:54,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 05:46:54,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 5: [2022-11-26 05:46:54,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:46:54,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 05:46:54,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 2: [2022-11-26 05:46:54,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:46:54,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 05:46:54,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 10: [2022-11-26 05:46:54,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:46:54,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 05:46:54,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 24: [2022-11-26 05:46:54,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:46:54,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 05:46:54,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 14: [2022-11-26 05:46:54,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:46:54,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 05:46:54,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 22: [2022-11-26 05:46:54,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:46:54,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:46:54,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:46:54,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:46:54,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:46:54,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 05:46:54,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 05:46:54,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 22: [2022-11-26 05:46:54,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 22: [2022-11-26 05:46:54,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 05:46:54,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 05:46:54,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 05:46:54,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 22: [2022-11-26 05:46:54,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 22: [2022-11-26 05:46:54,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 12: [2022-11-26 05:46:54,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:46:54,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 05:46:54,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 23: [2022-11-26 05:46:54,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:46:54,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:46:54,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:46:54,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:46:54,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:46:54,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 05:46:54,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 05:46:54,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 05:46:54,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 23: [2022-11-26 05:46:54,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 23: [2022-11-26 05:46:54,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 05:46:54,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 05:46:54,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 23: [2022-11-26 05:46:54,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 23: [2022-11-26 05:46:54,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 31: [2022-11-26 05:46:54,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:46:54,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 05:46:54,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 7: [2022-11-26 05:46:54,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:46:54,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 05:46:54,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 6: [2022-11-26 05:46:54,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:46:54,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 05:46:54,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 1: [2022-11-26 05:46:54,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:46:54,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 05:46:54,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 9: [2022-11-26 05:46:54,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:46:54,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 05:46:54,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 19: [2022-11-26 05:46:54,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:46:54,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 05:46:54,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 23: [2022-11-26 05:46:54,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:46:54,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 05:46:54,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 0: [2022-11-26 05:46:54,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:46:54,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 05:46:54,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 29: [2022-11-26 05:46:54,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:46:54,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 05:46:54,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 21: [2022-11-26 05:46:54,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:46:54,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 05:46:54,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 18: [2022-11-26 05:46:54,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:46:54,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 05:46:54,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 27: [2022-11-26 05:46:54,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:46:54,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:46:54,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 05:46:54,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 27: [2022-11-26 05:46:54,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 13: [2022-11-26 05:46:54,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:46:54,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 13: [2022-11-26 05:46:54,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 05:46:54,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 3: [2022-11-26 05:46:54,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:46:54,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 05:46:54,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 11: [2022-11-26 05:46:54,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:46:54,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 05:46:54,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 28: [2022-11-26 05:46:54,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:46:54,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:46:54,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 05:46:54,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 4: [2022-11-26 05:46:54,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:46:54,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 05:46:54,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 20: [2022-11-26 05:46:54,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:46:54,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 05:46:54,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 2: [2022-11-26 05:46:54,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:46:54,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 05:46:54,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 10: [2022-11-26 05:46:54,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:46:54,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:46:54,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 05:46:54,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 24: [2022-11-26 05:46:54,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 05:46:54,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 12: [2022-11-26 05:46:54,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:46:54,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 05:46:54,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 14: [2022-11-26 05:46:54,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:46:54,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 05:46:54,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 6: [2022-11-26 05:46:54,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:46:54,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:46:54,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 6: [2022-11-26 05:46:54,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 05:46:54,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 19: [2022-11-26 05:46:54,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 28: [2022-11-26 05:46:54,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 05:46:54,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 1: [2022-11-26 05:46:54,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:46:54,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 05:46:54,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 9: [2022-11-26 05:46:54,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:46:54,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 05:46:54,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 31: [2022-11-26 05:46:54,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:46:54,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 05:46:54,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 20: [2022-11-26 05:46:54,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:46:54,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 05:46:54,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 26: [2022-11-26 05:46:54,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:46:54,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 05:46:54,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 29: [2022-11-26 05:46:54,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:46:54,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 05:46:54,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 4: [2022-11-26 05:46:54,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:46:54,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 05:46:54,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 23: [2022-11-26 05:46:54,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:46:54,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 05:46:54,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 28: [2022-11-26 05:46:54,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 05:46:54,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 05:46:54,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 11: [2022-11-26 05:46:54,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:46:54,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:46:54,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 15: [2022-11-26 05:46:54,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 11: [2022-11-26 05:46:54,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 15: [2022-11-26 05:46:54,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 26: [2022-11-26 05:46:54,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:46:54,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 05:46:54,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 0: [2022-11-26 05:46:54,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:46:54,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 05:46:54,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 7: [2022-11-26 05:46:54,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:46:54,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 05:46:54,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 21: [2022-11-26 05:46:54,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:46:54,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 05:46:54,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 3: [2022-11-26 05:46:54,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:46:54,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 05:46:54,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 13: [2022-11-26 05:46:54,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:46:54,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 05:46:54,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 5: [2022-11-26 05:46:54,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:46:54,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 05:46:54,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 22: [2022-11-26 05:46:54,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:46:54,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 05:46:54,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 22: [2022-11-26 05:46:54,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:46:54,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 05:46:54,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 18: [2022-11-26 05:46:54,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:46:54,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 05:46:54,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 2: [2022-11-26 05:46:54,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:46:54,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 05:46:54,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 24: [2022-11-26 05:46:54,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:46:54,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 05:46:54,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 27: [2022-11-26 05:46:54,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:46:54,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 05:46:54,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 10: [2022-11-26 05:46:54,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:46:54,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 05:46:54,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 12: [2022-11-26 05:46:54,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:46:54,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 05:46:54,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 7: [2022-11-26 05:46:54,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:46:54,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 05:46:54,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 23: [2022-11-26 05:46:54,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 05:46:54,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 05:46:54,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 6: [2022-11-26 05:46:54,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:46:54,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 05:46:54,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 1: [2022-11-26 05:46:54,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:46:54,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 05:46:54,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 11: [2022-11-26 05:46:54,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:46:54,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 05:46:54,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 18: [2022-11-26 05:46:54,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 05:46:54,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 05:46:54,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 31: [2022-11-26 05:46:54,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:46:54,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 05:46:54,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 20: [2022-11-26 05:46:54,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 05:46:54,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 05:46:54,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 26: [2022-11-26 05:46:54,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 05:46:54,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 05:46:54,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 14: [2022-11-26 05:46:54,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:46:54,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 05:46:54,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 9: [2022-11-26 05:46:54,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:46:54,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 05:46:54,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 4: [2022-11-26 05:46:54,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:46:54,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:46:54,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 05:46:54,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 15: [2022-11-26 05:46:54,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 05:46:54,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 29: [2022-11-26 05:46:54,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 05:46:54,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 05:46:54,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 21: [2022-11-26 05:46:54,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 05:46:54,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 05:46:54,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 31: [2022-11-26 05:46:54,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 05:46:54,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 05:46:54,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 12: [2022-11-26 05:46:54,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:46:54,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 05:46:54,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 0: [2022-11-26 05:46:54,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:46:54,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:46:54,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:46:54,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 05:46:54,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 10: [2022-11-26 05:46:54,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 05:46:54,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 7: [2022-11-26 05:46:54,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:46:54,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 05:46:54,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 22: [2022-11-26 05:46:54,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 05:46:54,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 05:46:54,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 27: [2022-11-26 05:46:54,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 05:46:54,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 05:46:54,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 19: [2022-11-26 05:46:54,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 05:46:54,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 05:46:54,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 3: [2022-11-26 05:46:54,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:46:54,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 05:46:54,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 0: [2022-11-26 05:46:54,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 28: [2022-11-26 05:46:54,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 24: [2022-11-26 05:46:54,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:46:54,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 28: [2022-11-26 05:46:54,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 24: [2022-11-26 05:46:54,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 28: [2022-11-26 05:46:54,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 24: [2022-11-26 05:46:54,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 13: [2022-11-26 05:46:54,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:46:54,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 05:46:54,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 2: [2022-11-26 05:46:54,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:46:54,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 05:46:54,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 14: [2022-11-26 05:46:54,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:46:54,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 05:46:54,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 30: [2022-11-26 05:46:54,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:46:54,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 05:46:54,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 30: [2022-11-26 05:46:54,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:46:54,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 05:46:54,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 30: [2022-11-26 05:46:54,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:46:54,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 05:46:54,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 30: [2022-11-26 05:46:54,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:46:54,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 05:46:54,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 30: [2022-11-26 05:46:54,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:46:54,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:46:54,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:46:54,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 05:46:54,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 05:46:54,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 05:46:54,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 30: [2022-11-26 05:46:54,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 30: [2022-11-26 05:46:54,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 30: [2022-11-26 05:46:54,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 05:46:54,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 05:46:54,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 17: [2022-11-26 05:46:54,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:46:54,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:46:54,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:46:54,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:46:54,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 05:46:54,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 05:46:54,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 05:46:54,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 05:46:54,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 17: [2022-11-26 05:46:54,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 17: [2022-11-26 05:46:54,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 17: [2022-11-26 05:46:54,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 17: [2022-11-26 05:46:54,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:46:54,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 05:46:54,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 17: [2022-11-26 05:46:54,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:46:54,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 05:46:54,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 17: [2022-11-26 05:46:54,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:46:54,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 05:46:54,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 05:46:54,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 17: [2022-11-26 05:46:54,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 05:46:54,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 16: [2022-11-26 05:46:54,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:46:54,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:46:54,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:46:54,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:46:54,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 05:46:54,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 05:46:54,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 05:46:54,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 05:46:54,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 16: [2022-11-26 05:46:54,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 16: [2022-11-26 05:46:54,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 16: [2022-11-26 05:46:54,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 16: [2022-11-26 05:46:54,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:46:54,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 05:46:54,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 16: [2022-11-26 05:46:54,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:46:54,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 05:46:54,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 16: [2022-11-26 05:46:54,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:46:54,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 05:46:54,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 05:46:54,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 05:46:54,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 16: [2022-11-26 05:46:54,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 25: [2022-11-26 05:46:54,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:46:54,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:46:54,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:46:54,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:46:54,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:46:54,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:46:54,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 05:46:54,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 05:46:54,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:46:54,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 05:46:54,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 05:46:54,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 05:46:54,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 05:46:54,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 25: [2022-11-26 05:46:54,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 25: [2022-11-26 05:46:54,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 05:46:54,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 25: [2022-11-26 05:46:54,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 25: [2022-11-26 05:46:54,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 05:46:54,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 25: [2022-11-26 05:46:54,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 25: [2022-11-26 05:46:54,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 05:46:54,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 25: [2022-11-26 05:46:54,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 8: [2022-11-26 05:46:54,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:46:54,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:46:54,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:46:54,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:46:54,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:46:54,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:46:54,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 05:46:54,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 05:46:54,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 05:46:54,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 05:46:54,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 8: [2022-11-26 05:46:54,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:46:54,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 8: [2022-11-26 05:46:54,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 8: [2022-11-26 05:46:54,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 8: [2022-11-26 05:46:54,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 05:46:54,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 05:46:54,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 05:46:54,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 8: [2022-11-26 05:46:54,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 8: [2022-11-26 05:46:54,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 8: [2022-11-26 05:46:54,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:46:54,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step52000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 05:46:54,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 0: successfully saved checkpoint at iteration 52000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2656.30 31: iteration 52010/ 173500 | consumed samples: 13314560 | consumed tokens: 27268218880 | elapsed time per iteration (s): 1.09 | learning rate: 1.646E-04 | global batch size: 256 | lm loss: 2.039381E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.870 | TFLOPs: 14.21 | 31: iteration 52020/ 173500 | consumed samples: 13317120 | consumed tokens: 27273461760 | elapsed time per iteration (s): 0.78 | learning rate: 1.645E-04 | global batch size: 256 | lm loss: 2.058027E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.692 | TFLOPs: 19.82 | 31: iteration 52030/ 173500 | consumed samples: 13319680 | consumed tokens: 27278704640 | elapsed time per iteration (s): 0.81 | learning rate: 1.645E-04 | global batch size: 256 | lm loss: 2.035579E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.381 | TFLOPs: 19.08 | 31: iteration 52040/ 173500 | consumed samples: 13322240 | consumed tokens: 27283947520 | elapsed time per iteration (s): 0.78 | learning rate: 1.645E-04 | global batch size: 256 | lm loss: 2.033782E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.229 | TFLOPs: 19.74 | 31: iteration 52050/ 173500 | consumed samples: 13324800 | consumed tokens: 27289190400 | elapsed time per iteration (s): 0.77 | learning rate: 1.645E-04 | global batch size: 256 | lm loss: 2.036703E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.763 | TFLOPs: 20.13 | 31: iteration 52060/ 173500 | consumed samples: 13327360 | consumed tokens: 27294433280 | elapsed time per iteration (s): 0.77 | learning rate: 1.645E-04 | global batch size: 256 | lm loss: 2.052843E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.902 | TFLOPs: 20.14 | 31: iteration 52070/ 173500 | consumed samples: 13329920 | consumed tokens: 27299676160 | elapsed time per iteration (s): 0.75 | learning rate: 1.645E-04 | global batch size: 256 | lm loss: 2.080323E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.663 | TFLOPs: 20.61 | 31: iteration 52080/ 173500 | consumed samples: 13332480 | consumed tokens: 27304919040 | elapsed time per iteration (s): 0.78 | learning rate: 1.645E-04 | global batch size: 256 | lm loss: 2.029423E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.267 | TFLOPs: 19.74 | 31: iteration 52090/ 173500 | consumed samples: 13335040 | consumed tokens: 27310161920 | elapsed time per iteration (s): 0.82 | learning rate: 1.645E-04 | global batch size: 256 | lm loss: 2.050727E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.454 | TFLOPs: 18.78 | 31: iteration 52100/ 173500 | consumed samples: 13337600 | consumed tokens: 27315404800 | elapsed time per iteration (s): 0.78 | learning rate: 1.644E-04 | global batch size: 256 | lm loss: 2.029455E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.199 | TFLOPs: 19.73 | 31: iteration 52110/ 173500 | consumed samples: 13340160 | consumed tokens: 27320647680 | elapsed time per iteration (s): 0.77 | learning rate: 1.644E-04 | global batch size: 256 | lm loss: 2.085475E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.054 | TFLOPs: 20.21 | 31: iteration 52120/ 173500 | consumed samples: 13342720 | consumed tokens: 27325890560 | elapsed time per iteration (s): 0.79 | learning rate: 1.644E-04 | global batch size: 256 | lm loss: 2.047326E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.196 | TFLOPs: 19.49 | 31: iteration 52130/ 173500 | consumed samples: 13345280 | consumed tokens: 27331133440 | elapsed time per iteration (s): 0.77 | learning rate: 1.644E-04 | global batch size: 256 | lm loss: 2.068861E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.700 | TFLOPs: 20.07 | 31: iteration 52140/ 173500 | consumed samples: 13347840 | consumed tokens: 27336376320 | elapsed time per iteration (s): 0.78 | learning rate: 1.644E-04 | global batch size: 256 | lm loss: 2.071556E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.640 | TFLOPs: 19.76 | 31: iteration 52150/ 173500 | consumed samples: 13350400 | consumed tokens: 27341619200 | elapsed time per iteration (s): 0.81 | learning rate: 1.644E-04 | global batch size: 256 | lm loss: 2.085285E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.638 | TFLOPs: 19.10 | 31: iteration 52160/ 173500 | consumed samples: 13352960 | consumed tokens: 27346862080 | elapsed time per iteration (s): 0.79 | learning rate: 1.644E-04 | global batch size: 256 | lm loss: 2.066902E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.340 | TFLOPs: 19.62 | 31: iteration 52170/ 173500 | consumed samples: 13355520 | consumed tokens: 27352104960 | elapsed time per iteration (s): 0.78 | learning rate: 1.643E-04 | global batch size: 256 | lm loss: 2.077802E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.804 | TFLOPs: 19.77 | 31: iteration 52180/ 173500 | consumed samples: 13358080 | consumed tokens: 27357347840 | elapsed time per iteration (s): 0.79 | learning rate: 1.643E-04 | global batch size: 256 | lm loss: 2.072775E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.780 | TFLOPs: 19.59 | 31: iteration 52190/ 173500 | consumed samples: 13360640 | consumed tokens: 27362590720 | elapsed time per iteration (s): 0.78 | learning rate: 1.643E-04 | global batch size: 256 | lm loss: 2.066554E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.046 | TFLOPs: 19.85 | 31: iteration 52200/ 173500 | consumed samples: 13363200 | consumed tokens: 27367833600 | elapsed time per iteration (s): 0.76 | learning rate: 1.643E-04 | global batch size: 256 | lm loss: 2.048222E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.925 | TFLOPs: 20.32 | 31: iteration 52210/ 173500 | consumed samples: 13365760 | consumed tokens: 27373076480 | elapsed time per iteration (s): 0.76 | learning rate: 1.643E-04 | global batch size: 256 | lm loss: 2.096089E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.089 | TFLOPs: 20.33 | 31: iteration 52220/ 173500 | consumed samples: 13368320 | consumed tokens: 27378319360 | elapsed time per iteration (s): 0.76 | learning rate: 1.643E-04 | global batch size: 256 | lm loss: 2.069024E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.224 | TFLOPs: 20.28 | 31: iteration 52230/ 173500 | consumed samples: 13370880 | consumed tokens: 27383562240 | elapsed time per iteration (s): 0.81 | learning rate: 1.643E-04 | global batch size: 256 | lm loss: 2.058732E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.549 | TFLOPs: 19.03 | 31: iteration 52240/ 173500 | consumed samples: 13373440 | consumed tokens: 27388805120 | elapsed time per iteration (s): 0.76 | learning rate: 1.643E-04 | global batch size: 256 | lm loss: 2.024187E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.881 | TFLOPs: 20.26 | 31: iteration 52250/ 173500 | consumed samples: 13376000 | consumed tokens: 27394048000 | elapsed time per iteration (s): 0.74 | learning rate: 1.642E-04 | global batch size: 256 | lm loss: 2.058690E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.859 | TFLOPs: 20.86 | 31: iteration 52260/ 173500 | consumed samples: 13378560 | consumed tokens: 27399290880 | elapsed time per iteration (s): 0.78 | learning rate: 1.642E-04 | global batch size: 256 | lm loss: 2.055233E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.288 | TFLOPs: 19.92 | 31: iteration 52270/ 173500 | consumed samples: 13381120 | consumed tokens: 27404533760 | elapsed time per iteration (s): 0.83 | learning rate: 1.642E-04 | global batch size: 256 | lm loss: 2.049648E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.288 | TFLOPs: 18.59 | 31: iteration 52280/ 173500 | consumed samples: 13383680 | consumed tokens: 27409776640 | elapsed time per iteration (s): 0.75 | learning rate: 1.642E-04 | global batch size: 256 | lm loss: 2.093197E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.845 | TFLOPs: 20.68 | 31: iteration 52290/ 173500 | consumed samples: 13386240 | consumed tokens: 27415019520 | elapsed time per iteration (s): 0.74 | learning rate: 1.642E-04 | global batch size: 256 | lm loss: 2.074511E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.299 | TFLOPs: 20.95 | 31: iteration 52300/ 173500 | consumed samples: 13388800 | consumed tokens: 27420262400 | elapsed time per iteration (s): 0.77 | learning rate: 1.642E-04 | global batch size: 256 | lm loss: 2.072632E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.598 | TFLOPs: 20.24 | 31: iteration 52310/ 173500 | consumed samples: 13391360 | consumed tokens: 27425505280 | elapsed time per iteration (s): 0.76 | learning rate: 1.642E-04 | global batch size: 256 | lm loss: 2.059952E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.254 | TFLOPs: 20.46 | 31: iteration 52320/ 173500 | consumed samples: 13393920 | consumed tokens: 27430748160 | elapsed time per iteration (s): 0.78 | learning rate: 1.642E-04 | global batch size: 256 | lm loss: 2.054951E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.949 | TFLOPs: 19.96 | 31: iteration 52330/ 173500 | consumed samples: 13396480 | consumed tokens: 27435991040 | elapsed time per iteration (s): 0.72 | learning rate: 1.641E-04 | global batch size: 256 | lm loss: 2.048784E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.275 | TFLOPs: 21.61 | 31: iteration 52340/ 173500 | consumed samples: 13399040 | consumed tokens: 27441233920 | elapsed time per iteration (s): 0.76 | learning rate: 1.641E-04 | global batch size: 256 | lm loss: 2.062334E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.683 | TFLOPs: 20.25 | 31: iteration 52350/ 173500 | consumed samples: 13401600 | consumed tokens: 27446476800 | elapsed time per iteration (s): 1.39 | learning rate: 1.641E-04 | global batch size: 256 | lm loss: 2.046681E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 184.520 | TFLOPs: 11.16 | 31: iteration 52360/ 173500 | consumed samples: 13404160 | consumed tokens: 27451719680 | elapsed time per iteration (s): 0.73 | learning rate: 1.641E-04 | global batch size: 256 | lm loss: 2.036397E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.347 | TFLOPs: 21.13 | 31: iteration 52370/ 173500 | consumed samples: 13406720 | consumed tokens: 27456962560 | elapsed time per iteration (s): 0.77 | learning rate: 1.641E-04 | global batch size: 256 | lm loss: 2.077150E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.645 | TFLOPs: 20.00 | 31: iteration 52380/ 173500 | consumed samples: 13409280 | consumed tokens: 27462205440 | elapsed time per iteration (s): 0.75 | learning rate: 1.641E-04 | global batch size: 256 | lm loss: 2.055275E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.774 | TFLOPs: 20.74 | 31: iteration 52390/ 173500 | consumed samples: 13411840 | consumed tokens: 27467448320 | elapsed time per iteration (s): 0.73 | learning rate: 1.641E-04 | global batch size: 256 | lm loss: 2.046668E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.766 | TFLOPs: 21.10 | 31: iteration 52400/ 173500 | consumed samples: 13414400 | consumed tokens: 27472691200 | elapsed time per iteration (s): 0.85 | learning rate: 1.640E-04 | global batch size: 256 | lm loss: 2.052076E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.297 | TFLOPs: 18.23 | 31: iteration 52410/ 173500 | consumed samples: 13416960 | consumed tokens: 27477934080 | elapsed time per iteration (s): 0.79 | learning rate: 1.640E-04 | global batch size: 256 | lm loss: 2.080299E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.980 | TFLOPs: 19.66 | 31: iteration 52420/ 173500 | consumed samples: 13419520 | consumed tokens: 27483176960 | elapsed time per iteration (s): 0.82 | learning rate: 1.640E-04 | global batch size: 256 | lm loss: 2.076766E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.621 | TFLOPs: 18.97 | 31: iteration 52430/ 173500 | consumed samples: 13422080 | consumed tokens: 27488419840 | elapsed time per iteration (s): 0.75 | learning rate: 1.640E-04 | global batch size: 256 | lm loss: 2.051689E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.106 | TFLOPs: 20.70 | 31: iteration 52440/ 173500 | consumed samples: 13424640 | consumed tokens: 27493662720 | elapsed time per iteration (s): 0.82 | learning rate: 1.640E-04 | global batch size: 256 | lm loss: 2.055073E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.644 | TFLOPs: 18.97 | 31: iteration 52450/ 173500 | consumed samples: 13427200 | consumed tokens: 27498905600 | elapsed time per iteration (s): 0.80 | learning rate: 1.640E-04 | global batch size: 256 | lm loss: 2.037606E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.147 | TFLOPs: 19.25 | 31: iteration 52460/ 173500 | consumed samples: 13429760 | consumed tokens: 27504148480 | elapsed time per iteration (s): 0.78 | learning rate: 1.640E-04 | global batch size: 256 | lm loss: 2.059865E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.437 | TFLOPs: 19.75 | 31: iteration 52470/ 173500 | consumed samples: 13432320 | consumed tokens: 27509391360 | elapsed time per iteration (s): 0.78 | learning rate: 1.640E-04 | global batch size: 256 | lm loss: 2.083370E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.059 | TFLOPs: 19.91 | 31: iteration 52480/ 173500 | consumed samples: 13434880 | consumed tokens: 27514634240 | elapsed time per iteration (s): 0.74 | learning rate: 1.639E-04 | global batch size: 256 | lm loss: 2.043094E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.448 | TFLOPs: 20.96 | 31: iteration 52490/ 173500 | consumed samples: 13437440 | consumed tokens: 27519877120 | elapsed time per iteration (s): 0.77 | learning rate: 1.639E-04 | global batch size: 256 | lm loss: 2.057757E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.375 | TFLOPs: 20.11 | 31: iteration 52500/ 173500 | consumed samples: 13440000 | consumed tokens: 27525120000 | elapsed time per iteration (s): 0.78 | learning rate: 1.639E-04 | global batch size: 256 | lm loss: 2.035056E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.779 | TFLOPs: 19.83 | 31: iteration 52510/ 173500 | consumed samples: 13442560 | consumed tokens: 27530362880 | elapsed time per iteration (s): 0.77 | learning rate: 1.639E-04 | global batch size: 256 | lm loss: 2.035757E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.450 | TFLOPs: 20.05 | 31: iteration 52520/ 173500 | consumed samples: 13445120 | consumed tokens: 27535605760 | elapsed time per iteration (s): 0.79 | learning rate: 1.639E-04 | global batch size: 256 | lm loss: 2.031388E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.264 | TFLOPs: 19.68 | 31: iteration 52530/ 173500 | consumed samples: 13447680 | consumed tokens: 27540848640 | elapsed time per iteration (s): 0.73 | learning rate: 1.639E-04 | global batch size: 256 | lm loss: 2.030701E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.392 | TFLOPs: 21.08 | 31: iteration 52540/ 173500 | consumed samples: 13450240 | consumed tokens: 27546091520 | elapsed time per iteration (s): 0.75 | learning rate: 1.639E-04 | global batch size: 256 | lm loss: 2.071985E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.248 | TFLOPs: 20.64 | 31: iteration 52550/ 173500 | consumed samples: 13452800 | consumed tokens: 27551334400 | elapsed time per iteration (s): 0.75 | learning rate: 1.638E-04 | global batch size: 256 | lm loss: 2.060668E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.761 | TFLOPs: 20.55 | 31: iteration 52560/ 173500 | consumed samples: 13455360 | consumed tokens: 27556577280 | elapsed time per iteration (s): 0.80 | learning rate: 1.638E-04 | global batch size: 256 | lm loss: 2.064445E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.353 | TFLOPs: 19.26 | 31: iteration 52570/ 173500 | consumed samples: 13457920 | consumed tokens: 27561820160 | elapsed time per iteration (s): 0.78 | learning rate: 1.638E-04 | global batch size: 256 | lm loss: 2.058299E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.855 | TFLOPs: 19.96 | 31: iteration 52580/ 173500 | consumed samples: 13460480 | consumed tokens: 27567063040 | elapsed time per iteration (s): 0.84 | learning rate: 1.638E-04 | global batch size: 256 | lm loss: 2.048938E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.195 | TFLOPs: 18.52 | 31: iteration 52590/ 173500 | consumed samples: 13463040 | consumed tokens: 27572305920 | elapsed time per iteration (s): 0.72 | learning rate: 1.638E-04 | global batch size: 256 | lm loss: 2.057590E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.876 | TFLOPs: 21.41 | 31: iteration 52600/ 173500 | consumed samples: 13465600 | consumed tokens: 27577548800 | elapsed time per iteration (s): 0.78 | learning rate: 1.638E-04 | global batch size: 256 | lm loss: 2.063260E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.136 | TFLOPs: 19.73 | 31: iteration 52610/ 173500 | consumed samples: 13468160 | consumed tokens: 27582791680 | elapsed time per iteration (s): 0.77 | learning rate: 1.638E-04 | global batch size: 256 | lm loss: 2.053605E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.234 | TFLOPs: 20.10 | 31: iteration 52620/ 173500 | consumed samples: 13470720 | consumed tokens: 27588034560 | elapsed time per iteration (s): 0.77 | learning rate: 1.638E-04 | global batch size: 256 | lm loss: 2.028519E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.744 | TFLOPs: 20.07 | 31: iteration 52630/ 173500 | consumed samples: 13473280 | consumed tokens: 27593277440 | elapsed time per iteration (s): 0.76 | learning rate: 1.637E-04 | global batch size: 256 | lm loss: 2.057445E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.325 | TFLOPs: 20.47 | 31: iteration 52640/ 173500 | consumed samples: 13475840 | consumed tokens: 27598520320 | elapsed time per iteration (s): 0.74 | learning rate: 1.637E-04 | global batch size: 256 | lm loss: 2.074728E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.798 | TFLOPs: 21.04 | 31: iteration 52650/ 173500 | consumed samples: 13478400 | consumed tokens: 27603763200 | elapsed time per iteration (s): 0.81 | learning rate: 1.637E-04 | global batch size: 256 | lm loss: 2.066819E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.049 | TFLOPs: 19.18 | 31: iteration 52660/ 173500 | consumed samples: 13480960 | consumed tokens: 27609006080 | elapsed time per iteration (s): 0.74 | learning rate: 1.637E-04 | global batch size: 256 | lm loss: 2.040726E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.359 | TFLOPs: 20.95 | 31: iteration 52670/ 173500 | consumed samples: 13483520 | consumed tokens: 27614248960 | elapsed time per iteration (s): 0.77 | learning rate: 1.637E-04 | global batch size: 256 | lm loss: 2.053401E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.948 | TFLOPs: 20.20 | 31: iteration 52680/ 173500 | consumed samples: 13486080 | consumed tokens: 27619491840 | elapsed time per iteration (s): 0.80 | learning rate: 1.637E-04 | global batch size: 256 | lm loss: 2.065332E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.423 | TFLOPs: 19.26 | 31: iteration 52690/ 173500 | consumed samples: 13488640 | consumed tokens: 27624734720 | elapsed time per iteration (s): 0.74 | learning rate: 1.637E-04 | global batch size: 256 | lm loss: 2.060394E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.587 | TFLOPs: 20.91 | 31: iteration 52700/ 173500 | consumed samples: 13491200 | consumed tokens: 27629977600 | elapsed time per iteration (s): 0.77 | learning rate: 1.636E-04 | global batch size: 256 | lm loss: 2.048056E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.335 | TFLOPs: 20.04 | 31: iteration 52710/ 173500 | consumed samples: 13493760 | consumed tokens: 27635220480 | elapsed time per iteration (s): 0.74 | learning rate: 1.636E-04 | global batch size: 256 | lm loss: 2.046102E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.805 | TFLOPs: 21.04 | 31: iteration 52720/ 173500 | consumed samples: 13496320 | consumed tokens: 27640463360 | elapsed time per iteration (s): 0.76 | learning rate: 1.636E-04 | global batch size: 256 | lm loss: 2.088106E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.898 | TFLOPs: 20.44 | 31: iteration 52730/ 173500 | consumed samples: 13498880 | consumed tokens: 27645706240 | elapsed time per iteration (s): 0.77 | learning rate: 1.636E-04 | global batch size: 256 | lm loss: 2.069884E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.443 | TFLOPs: 20.17 | 31: iteration 52740/ 173500 | consumed samples: 13501440 | consumed tokens: 27650949120 | elapsed time per iteration (s): 1.18 | learning rate: 1.636E-04 | global batch size: 256 | lm loss: 2.052830E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.174 | TFLOPs: 13.14 | 31: iteration 52750/ 173500 | consumed samples: 13504000 | consumed tokens: 27656192000 | elapsed time per iteration (s): 0.76 | learning rate: 1.636E-04 | global batch size: 256 | lm loss: 2.070175E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.756 | TFLOPs: 20.25 | 31: iteration 52760/ 173500 | consumed samples: 13506560 | consumed tokens: 27661434880 | elapsed time per iteration (s): 0.83 | learning rate: 1.636E-04 | global batch size: 256 | lm loss: 2.083182E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.164 | TFLOPs: 18.76 | 31: iteration 52770/ 173500 | consumed samples: 13509120 | consumed tokens: 27666677760 | elapsed time per iteration (s): 0.77 | learning rate: 1.636E-04 | global batch size: 256 | lm loss: 2.066358E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.046 | TFLOPs: 20.15 | 31: iteration 52780/ 173500 | consumed samples: 13511680 | consumed tokens: 27671920640 | elapsed time per iteration (s): 0.79 | learning rate: 1.635E-04 | global batch size: 256 | lm loss: 2.070642E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.340 | TFLOPs: 19.56 | 31: iteration 52790/ 173500 | consumed samples: 13514240 | consumed tokens: 27677163520 | elapsed time per iteration (s): 0.80 | learning rate: 1.635E-04 | global batch size: 256 | lm loss: 2.042197E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.581 | TFLOPs: 19.33 | 31: iteration 52800/ 173500 | consumed samples: 13516800 | consumed tokens: 27682406400 | elapsed time per iteration (s): 0.77 | learning rate: 1.635E-04 | global batch size: 256 | lm loss: 2.026904E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.865 | TFLOPs: 20.20 | 31: iteration 52810/ 173500 | consumed samples: 13519360 | consumed tokens: 27687649280 | elapsed time per iteration (s): 0.88 | learning rate: 1.635E-04 | global batch size: 256 | lm loss: 2.054123E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.186 | TFLOPs: 17.62 | 31: iteration 52820/ 173500 | consumed samples: 13521920 | consumed tokens: 27692892160 | elapsed time per iteration (s): 0.81 | learning rate: 1.635E-04 | global batch size: 256 | lm loss: 2.066931E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.981 | TFLOPs: 19.24 | 31: iteration 52830/ 173500 | consumed samples: 13524480 | consumed tokens: 27698135040 | elapsed time per iteration (s): 0.81 | learning rate: 1.635E-04 | global batch size: 256 | lm loss: 2.063367E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.579 | TFLOPs: 19.03 | 31: iteration 52840/ 173500 | consumed samples: 13527040 | consumed tokens: 27703377920 | elapsed time per iteration (s): 0.76 | learning rate: 1.635E-04 | global batch size: 256 | lm loss: 2.044883E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.837 | TFLOPs: 20.44 | 31: iteration 52850/ 173500 | consumed samples: 13529600 | consumed tokens: 27708620800 | elapsed time per iteration (s): 0.75 | learning rate: 1.635E-04 | global batch size: 256 | lm loss: 2.059888E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.431 | TFLOPs: 20.78 | 31: iteration 52860/ 173500 | consumed samples: 13532160 | consumed tokens: 27713863680 | elapsed time per iteration (s): 0.78 | learning rate: 1.634E-04 | global batch size: 256 | lm loss: 2.044817E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.031 | TFLOPs: 19.78 | 31: iteration 52870/ 173500 | consumed samples: 13534720 | consumed tokens: 27719106560 | elapsed time per iteration (s): 0.85 | learning rate: 1.634E-04 | global batch size: 256 | lm loss: 2.095130E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.817 | TFLOPs: 18.32 | 31: iteration 52880/ 173500 | consumed samples: 13537280 | consumed tokens: 27724349440 | elapsed time per iteration (s): 0.80 | learning rate: 1.634E-04 | global batch size: 256 | lm loss: 2.053868E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.739 | TFLOPs: 19.40 | 31: iteration 52890/ 173500 | consumed samples: 13539840 | consumed tokens: 27729592320 | elapsed time per iteration (s): 0.74 | learning rate: 1.634E-04 | global batch size: 256 | lm loss: 2.070924E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.751 | TFLOPs: 20.86 | 31: iteration 52900/ 173500 | consumed samples: 13542400 | consumed tokens: 27734835200 | elapsed time per iteration (s): 0.76 | learning rate: 1.634E-04 | global batch size: 256 | lm loss: 2.049472E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.094 | TFLOPs: 20.39 | 31: iteration 52910/ 173500 | consumed samples: 13544960 | consumed tokens: 27740078080 | elapsed time per iteration (s): 0.78 | learning rate: 1.634E-04 | global batch size: 256 | lm loss: 2.044226E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.464 | TFLOPs: 19.93 | 31: iteration 52920/ 173500 | consumed samples: 13547520 | consumed tokens: 27745320960 | elapsed time per iteration (s): 0.80 | learning rate: 1.634E-04 | global batch size: 256 | lm loss: 2.024763E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.485 | TFLOPs: 19.39 | 31: iteration 52930/ 173500 | consumed samples: 13550080 | consumed tokens: 27750563840 | elapsed time per iteration (s): 0.78 | learning rate: 1.633E-04 | global batch size: 256 | lm loss: 2.054315E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.272 | TFLOPs: 19.86 | 31: iteration 52940/ 173500 | consumed samples: 13552640 | consumed tokens: 27755806720 | elapsed time per iteration (s): 0.76 | learning rate: 1.633E-04 | global batch size: 256 | lm loss: 2.039365E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.805 | TFLOPs: 20.25 | 31: iteration 52950/ 173500 | consumed samples: 13555200 | consumed tokens: 27761049600 | elapsed time per iteration (s): 0.73 | learning rate: 1.633E-04 | global batch size: 256 | lm loss: 1.999714E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.610 | TFLOPs: 21.09 | 31: iteration 52960/ 173500 | consumed samples: 13557760 | consumed tokens: 27766292480 | elapsed time per iteration (s): 0.75 | learning rate: 1.633E-04 | global batch size: 256 | lm loss: 2.062879E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.429 | TFLOPs: 20.78 | 31: iteration 52970/ 173500 | consumed samples: 13560320 | consumed tokens: 27771535360 | elapsed time per iteration (s): 0.74 | learning rate: 1.633E-04 | global batch size: 256 | lm loss: 2.063367E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.972 | TFLOPs: 21.05 | 31: iteration 52980/ 173500 | consumed samples: 13562880 | consumed tokens: 27776778240 | elapsed time per iteration (s): 0.77 | learning rate: 1.633E-04 | global batch size: 256 | lm loss: 2.066353E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.203 | TFLOPs: 20.04 | 31: iteration 52990/ 173500 | consumed samples: 13565440 | consumed tokens: 27782021120 | elapsed time per iteration (s): 0.75 | learning rate: 1.633E-04 | global batch size: 256 | lm loss: 2.047220E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.795 | TFLOPs: 20.62 | 31: iteration 53000/ 173500 | consumed samples: 13568000 | consumed tokens: 27787264000 | elapsed time per iteration (s): 0.80 | learning rate: 1.633E-04 | global batch size: 256 | lm loss: 2.059166E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.613 | TFLOPs: 19.40 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 53000 | lm loss value: 2.097363E+00 | lm loss PPL: 8.144660E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 53000 to checkpoints_1b1long 0: [2022-11-26 06:00:00,886] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step53000 is begin to save! 0: [2022-11-26 06:00:00,900] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_01-model_00-model_states.pt... 0: [2022-11-26 06:00:01,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_01-model_00-model_states.pt. 0: [2022-11-26 06:00:01,110] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_03-model_00-model_states.pt... 0: [2022-11-26 06:00:01,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_03-model_00-model_states.pt. 0: [2022-11-26 06:00:01,186] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_04-model_00-model_states.pt... 0: [2022-11-26 06:00:01,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_04-model_00-model_states.pt. 0: [2022-11-26 06:00:01,267] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_05-model_00-model_states.pt... 0: [2022-11-26 06:00:01,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_05-model_00-model_states.pt. 0: [2022-11-26 06:00:01,342] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_06-model_00-model_states.pt... 0: [2022-11-26 06:00:01,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_06-model_00-model_states.pt. 0: [2022-11-26 06:00:01,415] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_07-model_00-model_states.pt... 0: [2022-11-26 06:00:01,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_07-model_00-model_states.pt. 0: [2022-11-26 06:00:01,493] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_08-model_00-model_states.pt... 0: [2022-11-26 06:00:01,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_08-model_00-model_states.pt. 0: [2022-11-26 06:00:01,574] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_09-model_00-model_states.pt... 0: [2022-11-26 06:00:01,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_09-model_00-model_states.pt. 0: [2022-11-26 06:00:01,649] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_10-model_00-model_states.pt... 0: [2022-11-26 06:00:01,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_10-model_00-model_states.pt. 0: [2022-11-26 06:00:01,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_11-model_00-model_states.pt... 0: [2022-11-26 06:00:01,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_11-model_00-model_states.pt. 0: [2022-11-26 06:00:01,799] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_12-model_00-model_states.pt... 0: [2022-11-26 06:00:01,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_12-model_00-model_states.pt. 0: [2022-11-26 06:00:01,872] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_13-model_00-model_states.pt... 0: [2022-11-26 06:00:01,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_13-model_00-model_states.pt. 0: [2022-11-26 06:00:01,946] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_14-model_00-model_states.pt... 0: [2022-11-26 06:00:02,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_14-model_00-model_states.pt. 0: [2022-11-26 06:00:02,020] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_15-model_00-model_states.pt... 0: [2022-11-26 06:00:02,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_15-model_00-model_states.pt. 0: [2022-11-26 06:00:02,092] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_16-model_00-model_states.pt... 0: [2022-11-26 06:00:02,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_16-model_00-model_states.pt. 0: [2022-11-26 06:00:02,166] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_17-model_00-model_states.pt... 0: [2022-11-26 06:00:02,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_17-model_00-model_states.pt. 0: [2022-11-26 06:00:02,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_18-model_00-model_states.pt... 0: [2022-11-26 06:00:02,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_18-model_00-model_states.pt. 0: [2022-11-26 06:00:02,312] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_19-model_00-model_states.pt... 0: [2022-11-26 06:00:02,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_19-model_00-model_states.pt. 0: [2022-11-26 06:00:02,384] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_20-model_00-model_states.pt... 0: [2022-11-26 06:00:02,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_20-model_00-model_states.pt. 0: [2022-11-26 06:00:02,458] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_21-model_00-model_states.pt... 0: [2022-11-26 06:00:02,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_21-model_00-model_states.pt. 0: [2022-11-26 06:00:02,531] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_22-model_00-model_states.pt... 0: [2022-11-26 06:00:02,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_22-model_00-model_states.pt. 0: [2022-11-26 06:00:02,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_23-model_00-model_states.pt... 0: [2022-11-26 06:00:02,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_23-model_00-model_states.pt. 0: [2022-11-26 06:00:02,679] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_24-model_00-model_states.pt... 0: [2022-11-26 06:00:02,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_24-model_00-model_states.pt. 0: [2022-11-26 06:00:02,752] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_25-model_00-model_states.pt... 0: [2022-11-26 06:00:02,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_25-model_00-model_states.pt. 0: [2022-11-26 06:00:02,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_26-model_00-model_states.pt... 0: [2022-11-26 06:00:02,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_26-model_00-model_states.pt. 0: [2022-11-26 06:00:02,898] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_27-model_00-model_states.pt... 0: [2022-11-26 06:00:02,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_27-model_00-model_states.pt. 0: [2022-11-26 06:00:02,971] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_28-model_00-model_states.pt... 0: [2022-11-26 06:00:03,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_28-model_00-model_states.pt. 0: [2022-11-26 06:00:03,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/layer_30-model_00-model_states.pt... 0: [2022-11-26 06:00:03,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/layer_30-model_00-model_states.pt. 0: [2022-11-26 06:00:03,048] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step53000/mp_rank_00_model_states.pt 0: [2022-11-26 06:00:03,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/mp_rank_00_model_states.pt... 0: [2022-11-26 06:00:03,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/mp_rank_00_model_states.pt. 0: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:00:03,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:00:03,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:00:03,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 0: [2022-11-26 06:00:03,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:00:03,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 0: [2022-11-26 06:00:03,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 06:00:03,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 28: [2022-11-26 06:00:03,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:00:03,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:00:03,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 06:00:03,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 4: [2022-11-26 06:00:03,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:00:03,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:00:03,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 06:00:03,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 16: [2022-11-26 06:00:03,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 06:00:03,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 15: [2022-11-26 06:00:03,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:00:03,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 06:00:03,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 31: [2022-11-26 06:00:03,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:00:03,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 06:00:03,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 5: [2022-11-26 06:00:03,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:00:03,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 06:00:03,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 3: [2022-11-26 06:00:03,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:00:03,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 06:00:03,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 17: [2022-11-26 06:00:03,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:00:03,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 06:00:03,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 30: [2022-11-26 06:00:03,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:00:03,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 06:00:03,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 0: [2022-11-26 06:00:03,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:00:03,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:00:03,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:00:03,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:00:03,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:00:03,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 22: [2022-11-26 06:00:03,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 26: [2022-11-26 06:00:03,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 06:00:03,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 22: [2022-11-26 06:00:03,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 25: [2022-11-26 06:00:03,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 26: [2022-11-26 06:00:03,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 25: [2022-11-26 06:00:03,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 7: [2022-11-26 06:00:03,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:00:03,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 06:00:03,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 15: [2022-11-26 06:00:03,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:00:03,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:00:03,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 06:00:03,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 29: [2022-11-26 06:00:03,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 06:00:03,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 13: [2022-11-26 06:00:03,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:00:03,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 06:00:03,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 13: [2022-11-26 06:00:03,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:00:03,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 06:00:03,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 29: [2022-11-26 06:00:03,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:00:03,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 06:00:03,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 20: [2022-11-26 06:00:03,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:00:03,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 06:00:03,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 9: [2022-11-26 06:00:03,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:00:03,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 4: [2022-11-26 06:00:03,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:00:03,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 4: [2022-11-26 06:00:03,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 06:00:03,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 12: [2022-11-26 06:00:03,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:00:03,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 06:00:03,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 18: [2022-11-26 06:00:03,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:00:03,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:00:03,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 06:00:03,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 06:00:03,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 18: [2022-11-26 06:00:03,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 17: [2022-11-26 06:00:03,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:00:03,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:00:03,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 31: [2022-11-26 06:00:03,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:00:03,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 8: [2022-11-26 06:00:03,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 17: [2022-11-26 06:00:03,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 31: [2022-11-26 06:00:03,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 06:00:03,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 20: [2022-11-26 06:00:03,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:00:03,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 06:00:03,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 3: [2022-11-26 06:00:03,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:00:03,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 06:00:03,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 28: [2022-11-26 06:00:03,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 06:00:03,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 28: [2022-11-26 06:00:03,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:00:03,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 06:00:03,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 15: [2022-11-26 06:00:03,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:00:03,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 06:00:03,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 21: [2022-11-26 06:00:03,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:00:03,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 06:00:03,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 7: [2022-11-26 06:00:03,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:00:03,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:00:03,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:00:03,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 19: [2022-11-26 06:00:03,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:00:03,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 06:00:03,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 7: [2022-11-26 06:00:03,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 19: [2022-11-26 06:00:03,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 19: [2022-11-26 06:00:03,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 19: [2022-11-26 06:00:03,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 06:00:03,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 12: [2022-11-26 06:00:03,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:00:03,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 16: [2022-11-26 06:00:03,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:00:03,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 21: [2022-11-26 06:00:03,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:00:03,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 21: [2022-11-26 06:00:03,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 16: [2022-11-26 06:00:03,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 21: [2022-11-26 06:00:03,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 24: [2022-11-26 06:00:03,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:00:03,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 06:00:03,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 31: [2022-11-26 06:00:03,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:00:03,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 06:00:03,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 30: [2022-11-26 06:00:03,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:00:03,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 06:00:03,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 0: [2022-11-26 06:00:03,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:00:03,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 06:00:03,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 27: [2022-11-26 06:00:03,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:00:03,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 06:00:03,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 29: [2022-11-26 06:00:03,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:00:03,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 06:00:03,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 24: [2022-11-26 06:00:03,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:00:03,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 06:00:03,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 5: [2022-11-26 06:00:03,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:00:03,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 06:00:03,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 18: [2022-11-26 06:00:03,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:00:03,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 06:00:03,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 11: [2022-11-26 06:00:03,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:00:03,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 06:00:03,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 24: [2022-11-26 06:00:03,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:00:03,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 06:00:03,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 20: [2022-11-26 06:00:03,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:00:03,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 06:00:03,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 9: [2022-11-26 06:00:03,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:00:03,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:00:03,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 4: [2022-11-26 06:00:03,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:00:03,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 17: [2022-11-26 06:00:03,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 4: [2022-11-26 06:00:03,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 17: [2022-11-26 06:00:03,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 4: [2022-11-26 06:00:03,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 23: [2022-11-26 06:00:03,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:00:03,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 06:00:03,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 9: [2022-11-26 06:00:03,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:00:03,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 06:00:03,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 25: [2022-11-26 06:00:03,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:00:03,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 06:00:03,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 3: [2022-11-26 06:00:03,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:00:03,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:00:03,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 19: [2022-11-26 06:00:03,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 06:00:03,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 3: [2022-11-26 06:00:03,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 12: [2022-11-26 06:00:03,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:00:03,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:00:03,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 12: [2022-11-26 06:00:03,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 23: [2022-11-26 06:00:03,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 12: [2022-11-26 06:00:03,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 13: [2022-11-26 06:00:03,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:00:03,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 06:00:03,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 22: [2022-11-26 06:00:03,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:00:03,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 11: [2022-11-26 06:00:03,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:00:03,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 11: [2022-11-26 06:00:03,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 06:00:03,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 28: [2022-11-26 06:00:03,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:00:03,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:00:03,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 22: [2022-11-26 06:00:03,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:00:03,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 22: [2022-11-26 06:00:03,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 06:00:03,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 0: [2022-11-26 06:00:03,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 30: [2022-11-26 06:00:03,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:00:03,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 30: [2022-11-26 06:00:03,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 06:00:03,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 15: [2022-11-26 06:00:03,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:00:03,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 06:00:03,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 7: [2022-11-26 06:00:03,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:00:03,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 06:00:03,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 9: [2022-11-26 06:00:03,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:00:03,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 06:00:03,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 26: [2022-11-26 06:00:03,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:00:03,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 28: [2022-11-26 06:00:03,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 26: [2022-11-26 06:00:03,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 28: [2022-11-26 06:00:03,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 28: [2022-11-26 06:00:03,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:00:03,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 06:00:03,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 20: [2022-11-26 06:00:03,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:00:03,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 06:00:03,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 27: [2022-11-26 06:00:03,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:00:03,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:00:03,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:00:03,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 29: [2022-11-26 06:00:03,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 27: [2022-11-26 06:00:03,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 29: [2022-11-26 06:00:03,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 27: [2022-11-26 06:00:03,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 27: [2022-11-26 06:00:03,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 2: [2022-11-26 06:00:03,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:00:03,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:00:03,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 06:00:03,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 2: [2022-11-26 06:00:03,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 06:00:03,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 8: [2022-11-26 06:00:03,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:00:03,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 06:00:03,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 5: [2022-11-26 06:00:03,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:00:03,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 06:00:03,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 25: [2022-11-26 06:00:03,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:00:03,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 06:00:03,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 10: [2022-11-26 06:00:03,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:00:03,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:00:03,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:00:03,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 06:00:03,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 06:00:03,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 06:00:03,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 10: [2022-11-26 06:00:03,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 10: [2022-11-26 06:00:03,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 14: [2022-11-26 06:00:03,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:00:03,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:00:03,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:00:03,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 06:00:03,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 06:00:03,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 06:00:03,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 14: [2022-11-26 06:00:03,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 14: [2022-11-26 06:00:03,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 13: [2022-11-26 06:00:03,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:00:03,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 06:00:03,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 1: [2022-11-26 06:00:03,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:00:03,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:00:03,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:00:03,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:00:03,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 06:00:03,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 06:00:03,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 06:00:03,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 06:00:03,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 1: [2022-11-26 06:00:03,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 1: [2022-11-26 06:00:03,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 1: [2022-11-26 06:00:03,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 21: [2022-11-26 06:00:03,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:00:03,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 06:00:03,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 17: [2022-11-26 06:00:03,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:00:03,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 06:00:03,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 26: [2022-11-26 06:00:03,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:00:03,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 06:00:03,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 11: [2022-11-26 06:00:03,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:00:03,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 06:00:03,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 8: [2022-11-26 06:00:03,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:00:03,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 06:00:03,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 21: [2022-11-26 06:00:03,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:00:03,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 06:00:03,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 6: [2022-11-26 06:00:03,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:00:03,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:00:03,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:00:03,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:00:03,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 06:00:03,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 06:00:03,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 06:00:03,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 06:00:03,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 6: [2022-11-26 06:00:03,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 6: [2022-11-26 06:00:03,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 6: [2022-11-26 06:00:03,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 3: [2022-11-26 06:00:03,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:00:03,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 06:00:03,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 30: [2022-11-26 06:00:03,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:00:03,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 06:00:03,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 25: [2022-11-26 06:00:03,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:00:03,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 06:00:03,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 27: [2022-11-26 06:00:03,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:00:03,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 06:00:03,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 10: [2022-11-26 06:00:03,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:00:03,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 06:00:03,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 24: [2022-11-26 06:00:03,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:00:03,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 06:00:03,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 5: [2022-11-26 06:00:03,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:00:03,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 06:00:03,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 23: [2022-11-26 06:00:03,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:00:03,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 06:00:03,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 18: [2022-11-26 06:00:03,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:00:03,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 06:00:03,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 0: [2022-11-26 06:00:03,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:00:03,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 06:00:03,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 22: [2022-11-26 06:00:03,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:00:03,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 06:00:03,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 31: [2022-11-26 06:00:03,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:00:03,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 06:00:03,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 4: [2022-11-26 06:00:03,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:00:03,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 06:00:03,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 12: [2022-11-26 06:00:03,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:00:03,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 06:00:03,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 16: [2022-11-26 06:00:03,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:00:03,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 06:00:03,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 14: [2022-11-26 06:00:03,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:00:03,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 06:00:03,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 7: [2022-11-26 06:00:03,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:00:03,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 06:00:03,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 2: [2022-11-26 06:00:03,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:00:03,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 06:00:03,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 6: [2022-11-26 06:00:03,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:00:03,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 06:00:03,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 1: [2022-11-26 06:00:03,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:00:03,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:00:03,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 28: [2022-11-26 06:00:03,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 06:00:03,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 1: [2022-11-26 06:00:03,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 9: [2022-11-26 06:00:03,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:00:03,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 06:00:03,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 15: [2022-11-26 06:00:03,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:00:03,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 06:00:03,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 19: [2022-11-26 06:00:03,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:00:03,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 06:00:03,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 20: [2022-11-26 06:00:03,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:00:03,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 06:00:03,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 29: [2022-11-26 06:00:03,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:00:03,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 06:00:03,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 17: [2022-11-26 06:00:03,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:00:03,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 06:00:03,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 26: [2022-11-26 06:00:03,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:00:03,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 06:00:03,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 8: [2022-11-26 06:00:03,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:00:03,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 06:00:03,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 3: [2022-11-26 06:00:03,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:00:03,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 25: [2022-11-26 06:00:03,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:00:03,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 25: [2022-11-26 06:00:03,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 21: [2022-11-26 06:00:03,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:00:03,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 25: [2022-11-26 06:00:03,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 21: [2022-11-26 06:00:03,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 11: [2022-11-26 06:00:03,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:00:03,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 06:00:03,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 24: [2022-11-26 06:00:03,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:00:03,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 10: [2022-11-26 06:00:03,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:00:03,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 24: [2022-11-26 06:00:03,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 10: [2022-11-26 06:00:03,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 27: [2022-11-26 06:00:03,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:00:03,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 06:00:03,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 13: [2022-11-26 06:00:03,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:00:03,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 06:00:03,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 23: [2022-11-26 06:00:03,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:00:03,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 06:00:03,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 30: [2022-11-26 06:00:03,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:00:03,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 06:00:03,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 5: [2022-11-26 06:00:03,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:00:03,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 06:00:03,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 0: [2022-11-26 06:00:03,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:00:03,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 06:00:03,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 18: [2022-11-26 06:00:03,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:00:03,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 06:00:03,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 22: [2022-11-26 06:00:03,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:00:03,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 06:00:03,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 28: [2022-11-26 06:00:03,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:00:03,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:00:03,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 28: [2022-11-26 06:00:03,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 06:00:03,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 16: [2022-11-26 06:00:03,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 12: [2022-11-26 06:00:03,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:00:03,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 06:00:03,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 31: [2022-11-26 06:00:03,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:00:03,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 06:00:03,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 15: [2022-11-26 06:00:03,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:00:03,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 06:00:03,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 1: [2022-11-26 06:00:03,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:00:03,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:00:03,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 9: [2022-11-26 06:00:03,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 06:00:03,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 1: [2022-11-26 06:00:03,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 7: [2022-11-26 06:00:03,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:00:03,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 2: [2022-11-26 06:00:03,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:00:03,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 2: [2022-11-26 06:00:03,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 06:00:03,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 14: [2022-11-26 06:00:03,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:00:03,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 06:00:03,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 4: [2022-11-26 06:00:03,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:00:03,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 06:00:03,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 6: [2022-11-26 06:00:03,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:00:03,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 06:00:03,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 29: [2022-11-26 06:00:03,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:00:03,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 06:00:03,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 19: [2022-11-26 06:00:03,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:00:03,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 06:00:03,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 26: [2022-11-26 06:00:03,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:00:03,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 06:00:03,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 8: [2022-11-26 06:00:03,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:00:03,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 06:00:03,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 13: [2022-11-26 06:00:03,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:00:03,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 06:00:03,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 21: [2022-11-26 06:00:03,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:00:03,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 06:00:03,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 20: [2022-11-26 06:00:03,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:00:03,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 06:00:03,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 11: [2022-11-26 06:00:03,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:00:03,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:00:03,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 3: [2022-11-26 06:00:03,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 11: [2022-11-26 06:00:03,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 3: [2022-11-26 06:00:03,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 17: [2022-11-26 06:00:03,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:00:03,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 06:00:03,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 25: [2022-11-26 06:00:03,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:00:03,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 06:00:03,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 10: [2022-11-26 06:00:03,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:00:03,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 06:00:03,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 23: [2022-11-26 06:00:03,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:00:03,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 06:00:03,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 24: [2022-11-26 06:00:03,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:00:03,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 06:00:03,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 0: [2022-11-26 06:00:03,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:00:03,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 06:00:03,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 30: [2022-11-26 06:00:03,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:00:03,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 06:00:03,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 5: [2022-11-26 06:00:03,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:00:03,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 06:00:03,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 27: [2022-11-26 06:00:03,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:00:03,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 06:00:03,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 12: [2022-11-26 06:00:03,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:00:03,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 06:00:03,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 18: [2022-11-26 06:00:03,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:00:03,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:00:03,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 31: [2022-11-26 06:00:03,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 18: [2022-11-26 06:00:03,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 31: [2022-11-26 06:00:03,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 16: [2022-11-26 06:00:03,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:00:03,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 06:00:03,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 22: [2022-11-26 06:00:03,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:00:03,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 06:00:03,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 2: [2022-11-26 06:00:03,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:00:03,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 06:00:03,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 14: [2022-11-26 06:00:03,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:00:03,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 06:00:03,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 7: [2022-11-26 06:00:03,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:00:03,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 06:00:03,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 4: [2022-11-26 06:00:03,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:00:03,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 06:00:03,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 1: [2022-11-26 06:00:03,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:00:03,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 06:00:03,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 28: [2022-11-26 06:00:03,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:00:03,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 06:00:03,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 6: [2022-11-26 06:00:03,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:00:03,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 06:00:03,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 9: [2022-11-26 06:00:03,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:00:03,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 06:00:03,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 19: [2022-11-26 06:00:03,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:00:03,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 06:00:03,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 29: [2022-11-26 06:00:03,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:00:03,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 06:00:03,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 8: [2022-11-26 06:00:03,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:00:03,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 06:00:03,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 21: [2022-11-26 06:00:03,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:00:03,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 06:00:03,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 26: [2022-11-26 06:00:03,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:00:03,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 06:00:03,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 15: [2022-11-26 06:00:03,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:00:03,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 06:00:03,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 11: [2022-11-26 06:00:03,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:00:03,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 06:00:03,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 17: [2022-11-26 06:00:03,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:00:03,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 06:00:03,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 13: [2022-11-26 06:00:03,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:00:03,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 06:00:03,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 20: [2022-11-26 06:00:03,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:00:03,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 06:00:03,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 5: [2022-11-26 06:00:03,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:00:03,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 06:00:03,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 30: [2022-11-26 06:00:03,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:00:03,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 06:00:03,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 25: [2022-11-26 06:00:03,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:00:03,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 3: [2022-11-26 06:00:03,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:00:03,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 3: [2022-11-26 06:00:03,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 06:00:03,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 18: [2022-11-26 06:00:03,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:00:03,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 06:00:03,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 10: [2022-11-26 06:00:03,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:00:03,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 06:00:03,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 22: [2022-11-26 06:00:03,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:00:03,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 06:00:03,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 24: [2022-11-26 06:00:03,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:00:03,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 06:00:03,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 4: [2022-11-26 06:00:03,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:00:03,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 06:00:03,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 27: [2022-11-26 06:00:03,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:00:03,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 06:00:03,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 23: [2022-11-26 06:00:03,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:00:03,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 06:00:03,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 31: [2022-11-26 06:00:03,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:00:03,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 06:00:03,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 26: [2022-11-26 06:00:03,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:00:03,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 06:00:03,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 14: [2022-11-26 06:00:03,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:00:03,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 06:00:03,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 29: [2022-11-26 06:00:03,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:00:03,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 06:00:03,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 16: [2022-11-26 06:00:03,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:00:03,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:00:03,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 06:00:03,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 2: [2022-11-26 06:00:03,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:00:03,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 2: [2022-11-26 06:00:03,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 5: [2022-11-26 06:00:03,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:00:03,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 2: [2022-11-26 06:00:03,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 5: [2022-11-26 06:00:03,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 06:00:03,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 15: [2022-11-26 06:00:03,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:00:03,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 06:00:03,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 19: [2022-11-26 06:00:03,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:00:03,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 06:00:03,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 7: [2022-11-26 06:00:03,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:00:03,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:00:03,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 06:00:03,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 0: [2022-11-26 06:00:03,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:00:03,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 0: [2022-11-26 06:00:03,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 1: [2022-11-26 06:00:03,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 0: [2022-11-26 06:00:03,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 13: [2022-11-26 06:00:03,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:00:03,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:00:03,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:00:03,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 21: [2022-11-26 06:00:03,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 28: [2022-11-26 06:00:03,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 21: [2022-11-26 06:00:03,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 13: [2022-11-26 06:00:03,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 28: [2022-11-26 06:00:03,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 6: [2022-11-26 06:00:03,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:00:03,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 06:00:03,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 18: [2022-11-26 06:00:03,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:00:03,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 06:00:03,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 12: [2022-11-26 06:00:03,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:00:03,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 06:00:03,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 11: [2022-11-26 06:00:03,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:00:03,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 06:00:03,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 27: [2022-11-26 06:00:03,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:00:03,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 06:00:03,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 30: [2022-11-26 06:00:03,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:00:03,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 06:00:03,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 20: [2022-11-26 06:00:03,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:00:03,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:00:03,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 0: [2022-11-26 06:00:03,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 17: [2022-11-26 06:00:03,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:00:03,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 0: [2022-11-26 06:00:03,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 17: [2022-11-26 06:00:03,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 06:00:03,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 2: [2022-11-26 06:00:03,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:00:03,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:00:03,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 9: [2022-11-26 06:00:03,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 2: [2022-11-26 06:00:03,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 25: [2022-11-26 06:00:03,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:00:03,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 25: [2022-11-26 06:00:03,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 06:00:03,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 22: [2022-11-26 06:00:03,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:00:03,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 16: [2022-11-26 06:00:03,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:00:03,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 16: [2022-11-26 06:00:03,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 06:00:03,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 3: [2022-11-26 06:00:03,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:00:03,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 06:00:03,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 4: [2022-11-26 06:00:03,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:00:03,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 06:00:03,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 24: [2022-11-26 06:00:03,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:00:03,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 06:00:03,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 31: [2022-11-26 06:00:03,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:00:03,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 06:00:03,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 10: [2022-11-26 06:00:03,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:00:03,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 06:00:03,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 14: [2022-11-26 06:00:03,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:00:03,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 06:00:03,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 7: [2022-11-26 06:00:03,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:00:03,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 06:00:03,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 12: [2022-11-26 06:00:03,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:00:03,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 06:00:03,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 23: [2022-11-26 06:00:03,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:00:03,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 06:00:03,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 2: [2022-11-26 06:00:03,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:00:03,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 06:00:03,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 23: [2022-11-26 06:00:03,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:00:03,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step53000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 06:00:03,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 0: successfully saved checkpoint at iteration 53000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2485.50 31: iteration 53010/ 173500 | consumed samples: 13570560 | consumed tokens: 27792506880 | elapsed time per iteration (s): 1.04 | learning rate: 1.632E-04 | global batch size: 256 | lm loss: 2.059587E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.292 | TFLOPs: 14.84 | 31: iteration 53020/ 173500 | consumed samples: 13573120 | consumed tokens: 27797749760 | elapsed time per iteration (s): 0.75 | learning rate: 1.632E-04 | global batch size: 256 | lm loss: 2.071588E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.798 | TFLOPs: 20.62 | 31: iteration 53030/ 173500 | consumed samples: 13575680 | consumed tokens: 27802992640 | elapsed time per iteration (s): 0.79 | learning rate: 1.632E-04 | global batch size: 256 | lm loss: 2.060419E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.497 | TFLOPs: 19.69 | 31: iteration 53040/ 173500 | consumed samples: 13578240 | consumed tokens: 27808235520 | elapsed time per iteration (s): 0.72 | learning rate: 1.632E-04 | global batch size: 256 | lm loss: 2.086976E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.663 | TFLOPs: 21.46 | 31: iteration 53050/ 173500 | consumed samples: 13580800 | consumed tokens: 27813478400 | elapsed time per iteration (s): 0.77 | learning rate: 1.632E-04 | global batch size: 256 | lm loss: 2.050679E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.068 | TFLOPs: 20.15 | 31: iteration 53060/ 173500 | consumed samples: 13583360 | consumed tokens: 27818721280 | elapsed time per iteration (s): 0.74 | learning rate: 1.632E-04 | global batch size: 256 | lm loss: 2.065760E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.217 | TFLOPs: 21.01 | 31: iteration 53070/ 173500 | consumed samples: 13585920 | consumed tokens: 27823964160 | elapsed time per iteration (s): 0.79 | learning rate: 1.632E-04 | global batch size: 256 | lm loss: 2.074208E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.821 | TFLOPs: 19.53 | 31: iteration 53080/ 173500 | consumed samples: 13588480 | consumed tokens: 27829207040 | elapsed time per iteration (s): 0.75 | learning rate: 1.631E-04 | global batch size: 256 | lm loss: 2.046782E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.103 | TFLOPs: 20.64 | 31: iteration 53090/ 173500 | consumed samples: 13591040 | consumed tokens: 27834449920 | elapsed time per iteration (s): 0.73 | learning rate: 1.631E-04 | global batch size: 256 | lm loss: 2.081433E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.281 | TFLOPs: 21.13 | 31: iteration 53100/ 173500 | consumed samples: 13593600 | consumed tokens: 27839692800 | elapsed time per iteration (s): 0.78 | learning rate: 1.631E-04 | global batch size: 256 | lm loss: 2.056034E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.275 | TFLOPs: 19.74 | 31: iteration 53110/ 173500 | consumed samples: 13596160 | consumed tokens: 27844935680 | elapsed time per iteration (s): 0.73 | learning rate: 1.631E-04 | global batch size: 256 | lm loss: 2.067429E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.169 | TFLOPs: 21.12 | 31: iteration 53120/ 173500 | consumed samples: 13598720 | consumed tokens: 27850178560 | elapsed time per iteration (s): 0.82 | learning rate: 1.631E-04 | global batch size: 256 | lm loss: 2.052794E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.021 | TFLOPs: 18.94 | 31: iteration 53130/ 173500 | consumed samples: 13601280 | consumed tokens: 27855421440 | elapsed time per iteration (s): 0.75 | learning rate: 1.631E-04 | global batch size: 256 | lm loss: 2.043376E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.214 | TFLOPs: 20.64 | 31: iteration 53140/ 173500 | consumed samples: 13603840 | consumed tokens: 27860664320 | elapsed time per iteration (s): 0.79 | learning rate: 1.631E-04 | global batch size: 256 | lm loss: 2.088331E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.502 | TFLOPs: 19.63 | 31: iteration 53150/ 173500 | consumed samples: 13606400 | consumed tokens: 27865907200 | elapsed time per iteration (s): 0.74 | learning rate: 1.631E-04 | global batch size: 256 | lm loss: 2.067367E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.793 | TFLOPs: 20.80 | 31: iteration 53160/ 173500 | consumed samples: 13608960 | consumed tokens: 27871150080 | elapsed time per iteration (s): 0.74 | learning rate: 1.630E-04 | global batch size: 256 | lm loss: 2.053536E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.018 | TFLOPs: 20.87 | 31: iteration 53170/ 173500 | consumed samples: 13611520 | consumed tokens: 27876392960 | elapsed time per iteration (s): 0.74 | learning rate: 1.630E-04 | global batch size: 256 | lm loss: 2.056870E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.648 | TFLOPs: 21.03 | 31: iteration 53180/ 173500 | consumed samples: 13614080 | consumed tokens: 27881635840 | elapsed time per iteration (s): 0.74 | learning rate: 1.630E-04 | global batch size: 256 | lm loss: 2.069228E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.010 | TFLOPs: 21.05 | 31: iteration 53190/ 173500 | consumed samples: 13616640 | consumed tokens: 27886878720 | elapsed time per iteration (s): 0.78 | learning rate: 1.630E-04 | global batch size: 256 | lm loss: 2.063009E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.789 | TFLOPs: 19.95 | 31: iteration 53200/ 173500 | consumed samples: 13619200 | consumed tokens: 27892121600 | elapsed time per iteration (s): 0.77 | learning rate: 1.630E-04 | global batch size: 256 | lm loss: 2.093000E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.989 | TFLOPs: 20.14 | 31: iteration 53210/ 173500 | consumed samples: 13621760 | consumed tokens: 27897364480 | elapsed time per iteration (s): 0.77 | learning rate: 1.630E-04 | global batch size: 256 | lm loss: 2.055586E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.238 | TFLOPs: 20.22 | 31: iteration 53220/ 173500 | consumed samples: 13624320 | consumed tokens: 27902607360 | elapsed time per iteration (s): 0.82 | learning rate: 1.630E-04 | global batch size: 256 | lm loss: 2.032993E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.541 | TFLOPs: 18.97 | 31: iteration 53230/ 173500 | consumed samples: 13626880 | consumed tokens: 27907850240 | elapsed time per iteration (s): 0.80 | learning rate: 1.629E-04 | global batch size: 256 | lm loss: 2.075516E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.114 | TFLOPs: 19.25 | 31: iteration 53240/ 173500 | consumed samples: 13629440 | consumed tokens: 27913093120 | elapsed time per iteration (s): 0.76 | learning rate: 1.629E-04 | global batch size: 256 | lm loss: 2.051241E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.385 | TFLOPs: 20.41 | 31: iteration 53250/ 173500 | consumed samples: 13632000 | consumed tokens: 27918336000 | elapsed time per iteration (s): 0.77 | learning rate: 1.629E-04 | global batch size: 256 | lm loss: 2.058109E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.079 | TFLOPs: 20.15 | 31: iteration 53260/ 173500 | consumed samples: 13634560 | consumed tokens: 27923578880 | elapsed time per iteration (s): 0.72 | learning rate: 1.629E-04 | global batch size: 256 | lm loss: 2.039156E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.682 | TFLOPs: 21.40 | 31: iteration 53270/ 173500 | consumed samples: 13637120 | consumed tokens: 27928821760 | elapsed time per iteration (s): 2.32 | learning rate: 1.629E-04 | global batch size: 256 | lm loss: 2.046004E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 110.379 | TFLOPs: 6.68 | 31: iteration 53280/ 173500 | consumed samples: 13639680 | consumed tokens: 27934064640 | elapsed time per iteration (s): 0.75 | learning rate: 1.629E-04 | global batch size: 256 | lm loss: 2.076882E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.442 | TFLOPs: 20.54 | 31: iteration 53290/ 173500 | consumed samples: 13642240 | consumed tokens: 27939307520 | elapsed time per iteration (s): 0.73 | learning rate: 1.629E-04 | global batch size: 256 | lm loss: 2.065819E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.376 | TFLOPs: 21.14 | 31: iteration 53300/ 173500 | consumed samples: 13644800 | consumed tokens: 27944550400 | elapsed time per iteration (s): 0.72 | learning rate: 1.629E-04 | global batch size: 256 | lm loss: 2.052310E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.949 | TFLOPs: 21.53 | 31: iteration 53310/ 173500 | consumed samples: 13647360 | consumed tokens: 27949793280 | elapsed time per iteration (s): 0.72 | learning rate: 1.628E-04 | global batch size: 256 | lm loss: 2.059754E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.886 | TFLOPs: 21.59 | 31: iteration 53320/ 173500 | consumed samples: 13649920 | consumed tokens: 27955036160 | elapsed time per iteration (s): 0.77 | learning rate: 1.628E-04 | global batch size: 256 | lm loss: 2.046390E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.840 | TFLOPs: 20.20 | 31: iteration 53330/ 173500 | consumed samples: 13652480 | consumed tokens: 27960279040 | elapsed time per iteration (s): 0.72 | learning rate: 1.628E-04 | global batch size: 256 | lm loss: 2.084907E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.555 | TFLOPs: 21.39 | 31: iteration 53340/ 173500 | consumed samples: 13655040 | consumed tokens: 27965521920 | elapsed time per iteration (s): 0.77 | learning rate: 1.628E-04 | global batch size: 256 | lm loss: 2.093112E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.856 | TFLOPs: 20.02 | 31: iteration 53350/ 173500 | consumed samples: 13657600 | consumed tokens: 27970764800 | elapsed time per iteration (s): 0.79 | learning rate: 1.628E-04 | global batch size: 256 | lm loss: 2.064187E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.539 | TFLOPs: 19.63 | 31: iteration 53360/ 173500 | consumed samples: 13660160 | consumed tokens: 27976007680 | elapsed time per iteration (s): 0.83 | learning rate: 1.628E-04 | global batch size: 256 | lm loss: 2.048256E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.431 | TFLOPs: 18.72 | 31: iteration 53370/ 173500 | consumed samples: 13662720 | consumed tokens: 27981250560 | elapsed time per iteration (s): 0.86 | learning rate: 1.628E-04 | global batch size: 256 | lm loss: 2.071971E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.244 | TFLOPs: 18.10 | 31: iteration 53380/ 173500 | consumed samples: 13665280 | consumed tokens: 27986493440 | elapsed time per iteration (s): 0.77 | learning rate: 1.627E-04 | global batch size: 256 | lm loss: 2.071353E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.055 | TFLOPs: 20.15 | 31: iteration 53390/ 173500 | consumed samples: 13667840 | consumed tokens: 27991736320 | elapsed time per iteration (s): 0.83 | learning rate: 1.627E-04 | global batch size: 256 | lm loss: 2.035900E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.502 | TFLOPs: 18.72 | 31: iteration 53400/ 173500 | consumed samples: 13670400 | consumed tokens: 27996979200 | elapsed time per iteration (s): 0.79 | learning rate: 1.627E-04 | global batch size: 256 | lm loss: 2.056152E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.901 | TFLOPs: 19.60 | 31: iteration 53410/ 173500 | consumed samples: 13672960 | consumed tokens: 28002222080 | elapsed time per iteration (s): 0.81 | learning rate: 1.627E-04 | global batch size: 256 | lm loss: 2.075227E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.237 | TFLOPs: 19.19 | 31: iteration 53420/ 173500 | consumed samples: 13675520 | consumed tokens: 28007464960 | elapsed time per iteration (s): 0.84 | learning rate: 1.627E-04 | global batch size: 256 | lm loss: 2.073146E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.540 | TFLOPs: 18.48 | 31: iteration 53430/ 173500 | consumed samples: 13678080 | consumed tokens: 28012707840 | elapsed time per iteration (s): 0.81 | learning rate: 1.627E-04 | global batch size: 256 | lm loss: 2.045878E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.800 | TFLOPs: 19.04 | 31: iteration 53440/ 173500 | consumed samples: 13680640 | consumed tokens: 28017950720 | elapsed time per iteration (s): 0.78 | learning rate: 1.627E-04 | global batch size: 256 | lm loss: 2.066190E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.314 | TFLOPs: 19.98 | 31: iteration 53450/ 173500 | consumed samples: 13683200 | consumed tokens: 28023193600 | elapsed time per iteration (s): 0.79 | learning rate: 1.627E-04 | global batch size: 256 | lm loss: 2.065499E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.385 | TFLOPs: 19.56 | 31: iteration 53460/ 173500 | consumed samples: 13685760 | consumed tokens: 28028436480 | elapsed time per iteration (s): 0.78 | learning rate: 1.626E-04 | global batch size: 256 | lm loss: 2.039902E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.644 | TFLOPs: 19.88 | 31: iteration 53470/ 173500 | consumed samples: 13688320 | consumed tokens: 28033679360 | elapsed time per iteration (s): 0.78 | learning rate: 1.626E-04 | global batch size: 256 | lm loss: 2.066447E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.739 | TFLOPs: 19.77 | 31: iteration 53480/ 173500 | consumed samples: 13690880 | consumed tokens: 28038922240 | elapsed time per iteration (s): 0.81 | learning rate: 1.626E-04 | global batch size: 256 | lm loss: 2.059631E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.575 | TFLOPs: 19.03 | 31: iteration 53490/ 173500 | consumed samples: 13693440 | consumed tokens: 28044165120 | elapsed time per iteration (s): 0.80 | learning rate: 1.626E-04 | global batch size: 256 | lm loss: 2.072598E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.625 | TFLOPs: 19.34 | 31: iteration 53500/ 173500 | consumed samples: 13696000 | consumed tokens: 28049408000 | elapsed time per iteration (s): 0.80 | learning rate: 1.626E-04 | global batch size: 256 | lm loss: 2.052954E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.600 | TFLOPs: 19.33 | 31: iteration 53510/ 173500 | consumed samples: 13698560 | consumed tokens: 28054650880 | elapsed time per iteration (s): 0.81 | learning rate: 1.626E-04 | global batch size: 256 | lm loss: 2.058490E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.953 | TFLOPs: 19.05 | 31: iteration 53520/ 173500 | consumed samples: 13701120 | consumed tokens: 28059893760 | elapsed time per iteration (s): 0.79 | learning rate: 1.626E-04 | global batch size: 256 | lm loss: 2.066147E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.771 | TFLOPs: 19.59 | 31: iteration 53530/ 173500 | consumed samples: 13703680 | consumed tokens: 28065136640 | elapsed time per iteration (s): 0.80 | learning rate: 1.625E-04 | global batch size: 256 | lm loss: 2.070453E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.379 | TFLOPs: 19.26 | 31: iteration 53540/ 173500 | consumed samples: 13706240 | consumed tokens: 28070379520 | elapsed time per iteration (s): 0.82 | learning rate: 1.625E-04 | global batch size: 256 | lm loss: 2.088809E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.898 | TFLOPs: 18.93 | 31: iteration 53550/ 173500 | consumed samples: 13708800 | consumed tokens: 28075622400 | elapsed time per iteration (s): 0.85 | learning rate: 1.625E-04 | global batch size: 256 | lm loss: 2.076266E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.106 | TFLOPs: 18.16 | 31: iteration 53560/ 173500 | consumed samples: 13711360 | consumed tokens: 28080865280 | elapsed time per iteration (s): 0.79 | learning rate: 1.625E-04 | global batch size: 256 | lm loss: 2.046554E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.047 | TFLOPs: 19.48 | 31: iteration 53570/ 173500 | consumed samples: 13713920 | consumed tokens: 28086108160 | elapsed time per iteration (s): 0.84 | learning rate: 1.625E-04 | global batch size: 256 | lm loss: 2.035397E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.103 | TFLOPs: 18.46 | 31: iteration 53580/ 173500 | consumed samples: 13716480 | consumed tokens: 28091351040 | elapsed time per iteration (s): 0.79 | learning rate: 1.625E-04 | global batch size: 256 | lm loss: 2.058931E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.884 | TFLOPs: 19.72 | 31: iteration 53590/ 173500 | consumed samples: 13719040 | consumed tokens: 28096593920 | elapsed time per iteration (s): 0.80 | learning rate: 1.625E-04 | global batch size: 256 | lm loss: 2.030352E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.646 | TFLOPs: 19.46 | 31: iteration 53600/ 173500 | consumed samples: 13721600 | consumed tokens: 28101836800 | elapsed time per iteration (s): 0.78 | learning rate: 1.625E-04 | global batch size: 256 | lm loss: 2.068851E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.205 | TFLOPs: 19.73 | 31: iteration 53610/ 173500 | consumed samples: 13724160 | consumed tokens: 28107079680 | elapsed time per iteration (s): 0.79 | learning rate: 1.624E-04 | global batch size: 256 | lm loss: 2.061437E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.141 | TFLOPs: 19.61 | 31: iteration 53620/ 173500 | consumed samples: 13726720 | consumed tokens: 28112322560 | elapsed time per iteration (s): 0.80 | learning rate: 1.624E-04 | global batch size: 256 | lm loss: 2.051506E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.618 | TFLOPs: 19.46 | 31: iteration 53630/ 173500 | consumed samples: 13729280 | consumed tokens: 28117565440 | elapsed time per iteration (s): 0.85 | learning rate: 1.624E-04 | global batch size: 256 | lm loss: 2.043073E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.424 | TFLOPs: 18.30 | 31: iteration 53640/ 173500 | consumed samples: 13731840 | consumed tokens: 28122808320 | elapsed time per iteration (s): 0.91 | learning rate: 1.624E-04 | global batch size: 256 | lm loss: 2.055410E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 280.887 | TFLOPs: 16.99 | 31: iteration 53650/ 173500 | consumed samples: 13734400 | consumed tokens: 28128051200 | elapsed time per iteration (s): 0.82 | learning rate: 1.624E-04 | global batch size: 256 | lm loss: 2.077294E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.890 | TFLOPs: 18.99 | 31: iteration 53660/ 173500 | consumed samples: 13736960 | consumed tokens: 28133294080 | elapsed time per iteration (s): 0.81 | learning rate: 1.624E-04 | global batch size: 256 | lm loss: 2.070015E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.323 | TFLOPs: 19.02 | 31: iteration 53670/ 173500 | consumed samples: 13739520 | consumed tokens: 28138536960 | elapsed time per iteration (s): 0.81 | learning rate: 1.624E-04 | global batch size: 256 | lm loss: 2.081918E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.572 | TFLOPs: 19.21 | 31: iteration 53680/ 173500 | consumed samples: 13742080 | consumed tokens: 28143779840 | elapsed time per iteration (s): 0.81 | learning rate: 1.623E-04 | global batch size: 256 | lm loss: 2.053113E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.214 | TFLOPs: 19.13 | 31: iteration 53690/ 173500 | consumed samples: 13744640 | consumed tokens: 28149022720 | elapsed time per iteration (s): 0.82 | learning rate: 1.623E-04 | global batch size: 256 | lm loss: 2.078298E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.934 | TFLOPs: 18.99 | 31: iteration 53700/ 173500 | consumed samples: 13747200 | consumed tokens: 28154265600 | elapsed time per iteration (s): 0.80 | learning rate: 1.623E-04 | global batch size: 256 | lm loss: 2.089110E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.079 | TFLOPs: 19.30 | 31: iteration 53710/ 173500 | consumed samples: 13749760 | consumed tokens: 28159508480 | elapsed time per iteration (s): 0.82 | learning rate: 1.623E-04 | global batch size: 256 | lm loss: 2.094679E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.265 | TFLOPs: 18.89 | 31: iteration 53720/ 173500 | consumed samples: 13752320 | consumed tokens: 28164751360 | elapsed time per iteration (s): 0.83 | learning rate: 1.623E-04 | global batch size: 256 | lm loss: 2.051945E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.286 | TFLOPs: 18.65 | 31: iteration 53730/ 173500 | consumed samples: 13754880 | consumed tokens: 28169994240 | elapsed time per iteration (s): 0.86 | learning rate: 1.623E-04 | global batch size: 256 | lm loss: 2.066158E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.092 | TFLOPs: 17.97 | 31: iteration 53740/ 173500 | consumed samples: 13757440 | consumed tokens: 28175237120 | elapsed time per iteration (s): 0.85 | learning rate: 1.623E-04 | global batch size: 256 | lm loss: 2.064477E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.887 | TFLOPs: 18.32 | 31: iteration 53750/ 173500 | consumed samples: 13760000 | consumed tokens: 28180480000 | elapsed time per iteration (s): 0.82 | learning rate: 1.623E-04 | global batch size: 256 | lm loss: 2.059137E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.067 | TFLOPs: 18.82 | 31: iteration 53760/ 173500 | consumed samples: 13762560 | consumed tokens: 28185722880 | elapsed time per iteration (s): 0.82 | learning rate: 1.622E-04 | global batch size: 256 | lm loss: 2.045665E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.633 | TFLOPs: 18.97 | 31: iteration 53770/ 173500 | consumed samples: 13765120 | consumed tokens: 28190965760 | elapsed time per iteration (s): 0.81 | learning rate: 1.622E-04 | global batch size: 256 | lm loss: 2.028978E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.778 | TFLOPs: 19.16 | 31: iteration 53780/ 173500 | consumed samples: 13767680 | consumed tokens: 28196208640 | elapsed time per iteration (s): 0.82 | learning rate: 1.622E-04 | global batch size: 256 | lm loss: 2.056298E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.663 | TFLOPs: 18.85 | 31: iteration 53790/ 173500 | consumed samples: 13770240 | consumed tokens: 28201451520 | elapsed time per iteration (s): 0.79 | learning rate: 1.622E-04 | global batch size: 256 | lm loss: 2.035199E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.671 | TFLOPs: 19.58 | 31: iteration 53800/ 173500 | consumed samples: 13772800 | consumed tokens: 28206694400 | elapsed time per iteration (s): 0.76 | learning rate: 1.622E-04 | global batch size: 256 | lm loss: 2.056540E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.483 | TFLOPs: 20.30 | 31: iteration 53810/ 173500 | consumed samples: 13775360 | consumed tokens: 28211937280 | elapsed time per iteration (s): 0.81 | learning rate: 1.622E-04 | global batch size: 256 | lm loss: 2.044639E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.411 | TFLOPs: 19.20 | 31: iteration 53820/ 173500 | consumed samples: 13777920 | consumed tokens: 28217180160 | elapsed time per iteration (s): 0.78 | learning rate: 1.622E-04 | global batch size: 256 | lm loss: 2.067008E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.747 | TFLOPs: 19.83 | 31: iteration 53830/ 173500 | consumed samples: 13780480 | consumed tokens: 28222423040 | elapsed time per iteration (s): 0.81 | learning rate: 1.621E-04 | global batch size: 256 | lm loss: 2.054813E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.884 | TFLOPs: 19.11 | 31: iteration 53840/ 173500 | consumed samples: 13783040 | consumed tokens: 28227665920 | elapsed time per iteration (s): 0.80 | learning rate: 1.621E-04 | global batch size: 256 | lm loss: 2.048882E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.247 | TFLOPs: 19.37 | 31: iteration 53850/ 173500 | consumed samples: 13785600 | consumed tokens: 28232908800 | elapsed time per iteration (s): 0.79 | learning rate: 1.621E-04 | global batch size: 256 | lm loss: 2.025143E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.299 | TFLOPs: 19.56 | 31: iteration 53860/ 173500 | consumed samples: 13788160 | consumed tokens: 28238151680 | elapsed time per iteration (s): 0.84 | learning rate: 1.621E-04 | global batch size: 256 | lm loss: 2.056044E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.546 | TFLOPs: 18.55 | 31: iteration 53870/ 173500 | consumed samples: 13790720 | consumed tokens: 28243394560 | elapsed time per iteration (s): 0.84 | learning rate: 1.621E-04 | global batch size: 256 | lm loss: 2.021243E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.857 | TFLOPs: 18.50 | 31: iteration 53880/ 173500 | consumed samples: 13793280 | consumed tokens: 28248637440 | elapsed time per iteration (s): 0.79 | learning rate: 1.621E-04 | global batch size: 256 | lm loss: 2.057300E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.221 | TFLOPs: 19.68 | 31: iteration 53890/ 173500 | consumed samples: 13795840 | consumed tokens: 28253880320 | elapsed time per iteration (s): 0.80 | learning rate: 1.621E-04 | global batch size: 256 | lm loss: 2.064640E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.616 | TFLOPs: 19.28 | 31: iteration 53900/ 173500 | consumed samples: 13798400 | consumed tokens: 28259123200 | elapsed time per iteration (s): 0.83 | learning rate: 1.621E-04 | global batch size: 256 | lm loss: 2.057808E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.098 | TFLOPs: 18.76 | 31: iteration 53910/ 173500 | consumed samples: 13800960 | consumed tokens: 28264366080 | elapsed time per iteration (s): 0.77 | learning rate: 1.620E-04 | global batch size: 256 | lm loss: 2.061037E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.427 | TFLOPs: 19.99 | 31: iteration 53920/ 173500 | consumed samples: 13803520 | consumed tokens: 28269608960 | elapsed time per iteration (s): 0.82 | learning rate: 1.620E-04 | global batch size: 256 | lm loss: 2.047579E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.347 | TFLOPs: 18.96 | 31: iteration 53930/ 173500 | consumed samples: 13806080 | consumed tokens: 28274851840 | elapsed time per iteration (s): 0.79 | learning rate: 1.620E-04 | global batch size: 256 | lm loss: 2.042485E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.024 | TFLOPs: 19.48 | 31: iteration 53940/ 173500 | consumed samples: 13808640 | consumed tokens: 28280094720 | elapsed time per iteration (s): 0.79 | learning rate: 1.620E-04 | global batch size: 256 | lm loss: 2.044734E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.022 | TFLOPs: 19.60 | 31: iteration 53950/ 173500 | consumed samples: 13811200 | consumed tokens: 28285337600 | elapsed time per iteration (s): 0.81 | learning rate: 1.620E-04 | global batch size: 256 | lm loss: 2.019357E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.568 | TFLOPs: 19.21 | 31: iteration 53960/ 173500 | consumed samples: 13813760 | consumed tokens: 28290580480 | elapsed time per iteration (s): 0.82 | learning rate: 1.620E-04 | global batch size: 256 | lm loss: 2.075237E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.589 | TFLOPs: 18.97 | 31: iteration 53970/ 173500 | consumed samples: 13816320 | consumed tokens: 28295823360 | elapsed time per iteration (s): 0.82 | learning rate: 1.620E-04 | global batch size: 256 | lm loss: 2.075816E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.564 | TFLOPs: 18.97 | 31: iteration 53980/ 173500 | consumed samples: 13818880 | consumed tokens: 28301066240 | elapsed time per iteration (s): 0.79 | learning rate: 1.619E-04 | global batch size: 256 | lm loss: 2.063742E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.717 | TFLOPs: 19.71 | 31: iteration 53990/ 173500 | consumed samples: 13821440 | consumed tokens: 28306309120 | elapsed time per iteration (s): 0.79 | learning rate: 1.619E-04 | global batch size: 256 | lm loss: 2.063123E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.116 | TFLOPs: 19.49 | 0: [2022-11-26 06:13:30,554] [INFO] [logging.py:68:log_dist] [Rank 0] step=54000, skipped=0, lr=[0.00016191666237869197, 0.00016191666237869197, 0.00016191666237869197], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 54000/ 173500 | consumed samples: 13824000 | consumed tokens: 28311552000 | elapsed time per iteration (s): 0.78 | learning rate: 1.619E-04 | global batch size: 256 | lm loss: 2.057987E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.766 | TFLOPs: 19.83 | 0: steps: 54000 loss: 2.0473 iter time (s): 0.794 samples/sec: 322.359 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 54000 | lm loss value: 2.019215E+00 | lm loss PPL: 7.532412E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 54000 to checkpoints_1b1long 0: [2022-11-26 06:13:30,804] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step54000 is begin to save! 0: [2022-11-26 06:13:30,812] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_01-model_00-model_states.pt... 0: [2022-11-26 06:13:31,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_01-model_00-model_states.pt. 0: [2022-11-26 06:13:31,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_03-model_00-model_states.pt... 0: [2022-11-26 06:13:31,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_03-model_00-model_states.pt. 0: [2022-11-26 06:13:31,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_04-model_00-model_states.pt... 0: [2022-11-26 06:13:31,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_04-model_00-model_states.pt. 0: [2022-11-26 06:13:31,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_05-model_00-model_states.pt... 0: [2022-11-26 06:13:31,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_05-model_00-model_states.pt. 0: [2022-11-26 06:13:31,273] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_06-model_00-model_states.pt... 0: [2022-11-26 06:13:31,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_06-model_00-model_states.pt. 0: [2022-11-26 06:13:31,346] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_07-model_00-model_states.pt... 0: [2022-11-26 06:13:31,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_07-model_00-model_states.pt. 0: [2022-11-26 06:13:31,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_08-model_00-model_states.pt... 0: [2022-11-26 06:13:31,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_08-model_00-model_states.pt. 0: [2022-11-26 06:13:31,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_09-model_00-model_states.pt... 0: [2022-11-26 06:13:31,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_09-model_00-model_states.pt. 0: [2022-11-26 06:13:31,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_10-model_00-model_states.pt... 0: [2022-11-26 06:13:31,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_10-model_00-model_states.pt. 0: [2022-11-26 06:13:31,644] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_11-model_00-model_states.pt... 0: [2022-11-26 06:13:31,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_11-model_00-model_states.pt. 0: [2022-11-26 06:13:31,717] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_12-model_00-model_states.pt... 0: [2022-11-26 06:13:31,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_12-model_00-model_states.pt. 0: [2022-11-26 06:13:31,790] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_13-model_00-model_states.pt... 0: [2022-11-26 06:13:31,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_13-model_00-model_states.pt. 0: [2022-11-26 06:13:31,864] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_14-model_00-model_states.pt... 0: [2022-11-26 06:13:31,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_14-model_00-model_states.pt. 0: [2022-11-26 06:13:31,937] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_15-model_00-model_states.pt... 0: [2022-11-26 06:13:32,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_15-model_00-model_states.pt. 0: [2022-11-26 06:13:32,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_16-model_00-model_states.pt... 0: [2022-11-26 06:13:32,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_16-model_00-model_states.pt. 0: [2022-11-26 06:13:32,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_17-model_00-model_states.pt... 0: [2022-11-26 06:13:32,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_17-model_00-model_states.pt. 0: [2022-11-26 06:13:32,158] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_18-model_00-model_states.pt... 0: [2022-11-26 06:13:32,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_18-model_00-model_states.pt. 0: [2022-11-26 06:13:32,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_19-model_00-model_states.pt... 0: [2022-11-26 06:13:32,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_19-model_00-model_states.pt. 0: [2022-11-26 06:13:32,307] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_20-model_00-model_states.pt... 0: [2022-11-26 06:13:32,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_20-model_00-model_states.pt. 0: [2022-11-26 06:13:32,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_21-model_00-model_states.pt... 0: [2022-11-26 06:13:32,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_21-model_00-model_states.pt. 0: [2022-11-26 06:13:32,452] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_22-model_00-model_states.pt... 0: [2022-11-26 06:13:32,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_22-model_00-model_states.pt. 0: [2022-11-26 06:13:32,523] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_23-model_00-model_states.pt... 0: [2022-11-26 06:13:32,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_23-model_00-model_states.pt. 0: [2022-11-26 06:13:32,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_24-model_00-model_states.pt... 0: [2022-11-26 06:13:32,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_24-model_00-model_states.pt. 0: [2022-11-26 06:13:32,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_25-model_00-model_states.pt... 0: [2022-11-26 06:13:32,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_25-model_00-model_states.pt. 0: [2022-11-26 06:13:32,744] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_26-model_00-model_states.pt... 0: [2022-11-26 06:13:32,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_26-model_00-model_states.pt. 0: [2022-11-26 06:13:32,818] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_27-model_00-model_states.pt... 0: [2022-11-26 06:13:32,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_27-model_00-model_states.pt. 0: [2022-11-26 06:13:32,890] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_28-model_00-model_states.pt... 0: [2022-11-26 06:13:32,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_28-model_00-model_states.pt. 0: [2022-11-26 06:13:32,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/layer_30-model_00-model_states.pt... 0: [2022-11-26 06:13:32,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/layer_30-model_00-model_states.pt. 0: [2022-11-26 06:13:32,965] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step54000/mp_rank_00_model_states.pt 0: [2022-11-26 06:13:32,965] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/mp_rank_00_model_states.pt... 0: [2022-11-26 06:13:32,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/mp_rank_00_model_states.pt. 0: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:13:33,039] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:13:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:13:33,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:13:33,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 06:13:33,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 23: [2022-11-26 06:13:33,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:13:33,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 06:13:33,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 13: [2022-11-26 06:13:33,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:13:33,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 06:13:33,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 5: [2022-11-26 06:13:33,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:13:33,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 06:13:33,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 1: [2022-11-26 06:13:33,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:13:33,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 06:13:33,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 10: [2022-11-26 06:13:33,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:13:33,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 06:13:33,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 19: [2022-11-26 06:13:33,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:13:33,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 06:13:33,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 18: [2022-11-26 06:13:33,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:13:33,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 06:13:33,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 31: [2022-11-26 06:13:33,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:13:33,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:13:33,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 06:13:33,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 29: [2022-11-26 06:13:33,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:13:33,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 06:13:33,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 27: [2022-11-26 06:13:33,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:13:33,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:13:33,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 16: [2022-11-26 06:13:33,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 06:13:33,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 27: [2022-11-26 06:13:33,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 1: [2022-11-26 06:13:33,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:13:33,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 06:13:33,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 10: [2022-11-26 06:13:33,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:13:33,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:13:33,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 06:13:33,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 7: [2022-11-26 06:13:33,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 06:13:33,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 14: [2022-11-26 06:13:33,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:13:33,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 06:13:33,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 30: [2022-11-26 06:13:33,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:13:33,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 3: [2022-11-26 06:13:33,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:13:33,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 3: [2022-11-26 06:13:33,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 06:13:33,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 11: [2022-11-26 06:13:33,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:13:33,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 06:13:33,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 29: [2022-11-26 06:13:33,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:13:33,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 06:13:33,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 25: [2022-11-26 06:13:33,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:13:33,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 19: [2022-11-26 06:13:33,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:13:33,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:13:33,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 23: [2022-11-26 06:13:33,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:13:33,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 7: [2022-11-26 06:13:33,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 23: [2022-11-26 06:13:33,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 19: [2022-11-26 06:13:33,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 25: [2022-11-26 06:13:33,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 23: [2022-11-26 06:13:33,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 27: [2022-11-26 06:13:33,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:13:33,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:13:33,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 06:13:33,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 18: [2022-11-26 06:13:33,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:13:33,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 06:13:33,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 26: [2022-11-26 06:13:33,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:13:33,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 06:13:33,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 13: [2022-11-26 06:13:33,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:13:33,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 06:13:33,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 21: [2022-11-26 06:13:33,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:13:33,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 06:13:33,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 24: [2022-11-26 06:13:33,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:13:33,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 06:13:33,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 11: [2022-11-26 06:13:33,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:13:33,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:13:33,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 6: [2022-11-26 06:13:33,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:13:33,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 14: [2022-11-26 06:13:33,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 6: [2022-11-26 06:13:33,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 11: [2022-11-26 06:13:33,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 6: [2022-11-26 06:13:33,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 31: [2022-11-26 06:13:33,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:13:33,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 6: [2022-11-26 06:13:33,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:13:33,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 6: [2022-11-26 06:13:33,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 06:13:33,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 23: [2022-11-26 06:13:33,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:13:33,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 06:13:33,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 16: [2022-11-26 06:13:33,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:13:33,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 06:13:33,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 11: [2022-11-26 06:13:33,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:13:33,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 06:13:33,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 28: [2022-11-26 06:13:33,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 06:13:33,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 28: [2022-11-26 06:13:33,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:13:33,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 06:13:33,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 21: [2022-11-26 06:13:33,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:13:33,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:13:33,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 06:13:33,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 06:13:33,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 21: [2022-11-26 06:13:33,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 5: [2022-11-26 06:13:33,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:13:33,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 06:13:33,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 3: [2022-11-26 06:13:33,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:13:33,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 06:13:33,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 5: [2022-11-26 06:13:33,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:13:33,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 06:13:33,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 30: [2022-11-26 06:13:33,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:13:33,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 06:13:33,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 0: [2022-11-26 06:13:33,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:13:33,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 13: [2022-11-26 06:13:33,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:13:33,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 0: [2022-11-26 06:13:33,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 13: [2022-11-26 06:13:33,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 31: [2022-11-26 06:13:33,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:13:33,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 06:13:33,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 8: [2022-11-26 06:13:33,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:13:33,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:13:33,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 06:13:33,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 10: [2022-11-26 06:13:33,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 06:13:33,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 8: [2022-11-26 06:13:33,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:13:33,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 06:13:33,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 26: [2022-11-26 06:13:33,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:13:33,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 06:13:33,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 10: [2022-11-26 06:13:33,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:13:33,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 16: [2022-11-26 06:13:33,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:13:33,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 16: [2022-11-26 06:13:33,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 06:13:33,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 19: [2022-11-26 06:13:33,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:13:33,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:13:33,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 06:13:33,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 19: [2022-11-26 06:13:33,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 06:13:33,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 7: [2022-11-26 06:13:33,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:13:33,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:13:33,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:13:33,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 18: [2022-11-26 06:13:33,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 14: [2022-11-26 06:13:33,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 18: [2022-11-26 06:13:33,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 7: [2022-11-26 06:13:33,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 25: [2022-11-26 06:13:33,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:13:33,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 25: [2022-11-26 06:13:33,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:13:33,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 06:13:33,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 06:13:33,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 24: [2022-11-26 06:13:33,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:13:33,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 24: [2022-11-26 06:13:33,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 06:13:33,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 27: [2022-11-26 06:13:33,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:13:33,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:13:33,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:13:33,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 24: [2022-11-26 06:13:33,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 27: [2022-11-26 06:13:33,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 28: [2022-11-26 06:13:33,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 24: [2022-11-26 06:13:33,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 28: [2022-11-26 06:13:33,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 0: [2022-11-26 06:13:33,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:13:33,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 06:13:33,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 4: [2022-11-26 06:13:33,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:13:33,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:13:33,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:13:33,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 06:13:33,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 06:13:33,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 06:13:33,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 1: [2022-11-26 06:13:33,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:13:33,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:13:33,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 4: [2022-11-26 06:13:33,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 1: [2022-11-26 06:13:33,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 06:13:33,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 22: [2022-11-26 06:13:33,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:13:33,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 1: [2022-11-26 06:13:33,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 22: [2022-11-26 06:13:33,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 06:13:33,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 30: [2022-11-26 06:13:33,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:13:33,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 13: [2022-11-26 06:13:33,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:13:33,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 06:13:33,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 29: [2022-11-26 06:13:33,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:13:33,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:13:33,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 29: [2022-11-26 06:13:33,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 22: [2022-11-26 06:13:33,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 6: [2022-11-26 06:13:33,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:13:33,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 22: [2022-11-26 06:13:33,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 6: [2022-11-26 06:13:33,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 06:13:33,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 29: [2022-11-26 06:13:33,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:13:33,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 06:13:33,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 14: [2022-11-26 06:13:33,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:13:33,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 06:13:33,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 0: [2022-11-26 06:13:33,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 06:13:33,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 26: [2022-11-26 06:13:33,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:13:33,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 06:13:33,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 9: [2022-11-26 06:13:33,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:13:33,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:13:33,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:13:33,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 06:13:33,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 06:13:33,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 06:13:33,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 9: [2022-11-26 06:13:33,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 9: [2022-11-26 06:13:33,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 5: [2022-11-26 06:13:33,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:13:33,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 06:13:33,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 17: [2022-11-26 06:13:33,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:13:33,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:13:33,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:13:33,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:13:33,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 06:13:33,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 06:13:33,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 06:13:33,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 06:13:33,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 17: [2022-11-26 06:13:33,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 17: [2022-11-26 06:13:33,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 17: [2022-11-26 06:13:33,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 0: [2022-11-26 06:13:33,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:13:33,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 30: [2022-11-26 06:13:33,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:13:33,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 30: [2022-11-26 06:13:33,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 06:13:33,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 3: [2022-11-26 06:13:33,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:13:33,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 8: [2022-11-26 06:13:33,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:13:33,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 8: [2022-11-26 06:13:33,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 06:13:33,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 20: [2022-11-26 06:13:33,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:13:33,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:13:33,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:13:33,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:13:33,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 06:13:33,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 06:13:33,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 06:13:33,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 20: [2022-11-26 06:13:33,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 20: [2022-11-26 06:13:33,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 06:13:33,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 22: [2022-11-26 06:13:33,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:13:33,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 22: [2022-11-26 06:13:33,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 06:13:33,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 23: [2022-11-26 06:13:33,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:13:33,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 06:13:33,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 21: [2022-11-26 06:13:33,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:13:33,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 06:13:33,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 7: [2022-11-26 06:13:33,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:13:33,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 06:13:33,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 2: [2022-11-26 06:13:33,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:13:33,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:13:33,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 06:13:33,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 2: [2022-11-26 06:13:33,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:13:33,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 06:13:33,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 06:13:33,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 3: [2022-11-26 06:13:33,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:13:33,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 3: [2022-11-26 06:13:33,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 06:13:33,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 31: [2022-11-26 06:13:33,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:13:33,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 06:13:33,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 16: [2022-11-26 06:13:33,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:13:33,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 06:13:33,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 25: [2022-11-26 06:13:33,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:13:33,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:13:33,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 4: [2022-11-26 06:13:33,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 06:13:33,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 25: [2022-11-26 06:13:33,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 11: [2022-11-26 06:13:33,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:13:33,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 06:13:33,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 2: [2022-11-26 06:13:33,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:13:33,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 06:13:33,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 12: [2022-11-26 06:13:33,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:13:33,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:13:33,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:13:33,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:13:33,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 06:13:33,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 06:13:33,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 06:13:33,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 06:13:33,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 12: [2022-11-26 06:13:33,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 12: [2022-11-26 06:13:33,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 12: [2022-11-26 06:13:33,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 27: [2022-11-26 06:13:33,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:13:33,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 06:13:33,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 28: [2022-11-26 06:13:33,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:13:33,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:13:33,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 06:13:33,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 15: [2022-11-26 06:13:33,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:13:33,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:13:33,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:13:33,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:13:33,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 06:13:33,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 06:13:33,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 06:13:33,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 06:13:33,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 15: [2022-11-26 06:13:33,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 15: [2022-11-26 06:13:33,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 15: [2022-11-26 06:13:33,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 28: [2022-11-26 06:13:33,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 06:13:33,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 22: [2022-11-26 06:13:33,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:13:33,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 06:13:33,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 9: [2022-11-26 06:13:33,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:13:33,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 06:13:33,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 5: [2022-11-26 06:13:33,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:13:33,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 06:13:33,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 1: [2022-11-26 06:13:33,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:13:33,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 06:13:33,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 24: [2022-11-26 06:13:33,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:13:33,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 06:13:33,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 18: [2022-11-26 06:13:33,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:13:33,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 06:13:33,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 19: [2022-11-26 06:13:33,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:13:33,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 06:13:33,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 20: [2022-11-26 06:13:33,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:13:33,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 06:13:33,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 10: [2022-11-26 06:13:33,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:13:33,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 06:13:33,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 29: [2022-11-26 06:13:33,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:13:33,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 06:13:33,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 23: [2022-11-26 06:13:33,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:13:33,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 06:13:33,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 17: [2022-11-26 06:13:33,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:13:33,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 06:13:33,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 15: [2022-11-26 06:13:33,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:13:33,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 06:13:33,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 30: [2022-11-26 06:13:33,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:13:33,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 06:13:33,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 16: [2022-11-26 06:13:33,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:13:33,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 06:13:33,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 12: [2022-11-26 06:13:33,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:13:33,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 06:13:33,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 26: [2022-11-26 06:13:33,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:13:33,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 06:13:33,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 7: [2022-11-26 06:13:33,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:13:33,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 06:13:33,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 6: [2022-11-26 06:13:33,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:13:33,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:13:33,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 8: [2022-11-26 06:13:33,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 6: [2022-11-26 06:13:33,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 8: [2022-11-26 06:13:33,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 3: [2022-11-26 06:13:33,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:13:33,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 06:13:33,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 0: [2022-11-26 06:13:33,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:13:33,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 06:13:33,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 27: [2022-11-26 06:13:33,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:13:33,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 06:13:33,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 14: [2022-11-26 06:13:33,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:13:33,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 06:13:33,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 25: [2022-11-26 06:13:33,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:13:33,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 06:13:33,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 28: [2022-11-26 06:13:33,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:13:33,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 06:13:33,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 2: [2022-11-26 06:13:33,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:13:33,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 06:13:33,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 13: [2022-11-26 06:13:33,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:13:33,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 06:13:33,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 31: [2022-11-26 06:13:33,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:13:33,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 06:13:33,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 4: [2022-11-26 06:13:33,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:13:33,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:13:33,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 18: [2022-11-26 06:13:33,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 06:13:33,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 4: [2022-11-26 06:13:33,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 11: [2022-11-26 06:13:33,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:13:33,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 06:13:33,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 5: [2022-11-26 06:13:33,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:13:33,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 06:13:33,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 1: [2022-11-26 06:13:33,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:13:33,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 06:13:33,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 24: [2022-11-26 06:13:33,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:13:33,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:13:33,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 9: [2022-11-26 06:13:33,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 06:13:33,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 24: [2022-11-26 06:13:33,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 29: [2022-11-26 06:13:33,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:13:33,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 06:13:33,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 19: [2022-11-26 06:13:33,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:13:33,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 06:13:33,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 22: [2022-11-26 06:13:33,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:13:33,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 06:13:33,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 23: [2022-11-26 06:13:33,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:13:33,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 06:13:33,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 20: [2022-11-26 06:13:33,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:13:33,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 06:13:33,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 17: [2022-11-26 06:13:33,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:13:33,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 06:13:33,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 15: [2022-11-26 06:13:33,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:13:33,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 06:13:33,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 10: [2022-11-26 06:13:33,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:13:33,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 06:13:33,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 21: [2022-11-26 06:13:33,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:13:33,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 06:13:33,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 26: [2022-11-26 06:13:33,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:13:33,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 06:13:33,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 6: [2022-11-26 06:13:33,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:13:33,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 06:13:33,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 12: [2022-11-26 06:13:33,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:13:33,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 06:13:33,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 21: [2022-11-26 06:13:33,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:13:33,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 06:13:33,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 25: [2022-11-26 06:13:33,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:13:33,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 06:13:33,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 13: [2022-11-26 06:13:33,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:13:33,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 06:13:33,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 0: [2022-11-26 06:13:33,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:13:33,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 06:13:33,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 28: [2022-11-26 06:13:33,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:13:33,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:13:33,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 06:13:33,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 8: [2022-11-26 06:13:33,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:13:33,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:13:33,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 06:13:33,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 3: [2022-11-26 06:13:33,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 06:13:33,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 31: [2022-11-26 06:13:33,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:13:33,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 06:13:33,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 7: [2022-11-26 06:13:33,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:13:33,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:13:33,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 06:13:33,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 16: [2022-11-26 06:13:33,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 2: [2022-11-26 06:13:33,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:13:33,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 2: [2022-11-26 06:13:33,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 06:13:33,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 28: [2022-11-26 06:13:33,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 06:13:33,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 11: [2022-11-26 06:13:33,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:13:33,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 06:13:33,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 24: [2022-11-26 06:13:33,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:13:33,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 30: [2022-11-26 06:13:33,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:13:33,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 4: [2022-11-26 06:13:33,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:13:33,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 30: [2022-11-26 06:13:33,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 4: [2022-11-26 06:13:33,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 30: [2022-11-26 06:13:33,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 22: [2022-11-26 06:13:33,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:13:33,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 06:13:33,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 5: [2022-11-26 06:13:33,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:13:33,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:13:33,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 06:13:33,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 18: [2022-11-26 06:13:33,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:13:33,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 06:13:33,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 18: [2022-11-26 06:13:33,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 06:13:33,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 9: [2022-11-26 06:13:33,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:13:33,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 06:13:33,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 1: [2022-11-26 06:13:33,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:13:33,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 06:13:33,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 19: [2022-11-26 06:13:33,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:13:33,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 06:13:33,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 23: [2022-11-26 06:13:33,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:13:33,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 06:13:33,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 20: [2022-11-26 06:13:33,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:13:33,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 06:13:33,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 17: [2022-11-26 06:13:33,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:13:33,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 06:13:33,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 15: [2022-11-26 06:13:33,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:13:33,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 06:13:33,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 10: [2022-11-26 06:13:33,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:13:33,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 06:13:33,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 29: [2022-11-26 06:13:33,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:13:33,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 06:13:33,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 12: [2022-11-26 06:13:33,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:13:33,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 06:13:33,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 26: [2022-11-26 06:13:33,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:13:33,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 06:13:33,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 6: [2022-11-26 06:13:33,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:13:33,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 06:13:33,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 30: [2022-11-26 06:13:33,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:13:33,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 06:13:33,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 14: [2022-11-26 06:13:33,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:13:33,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:13:33,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 14: [2022-11-26 06:13:33,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 06:13:33,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 21: [2022-11-26 06:13:33,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 13: [2022-11-26 06:13:33,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:13:33,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:13:33,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 7: [2022-11-26 06:13:33,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 13: [2022-11-26 06:13:33,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 7: [2022-11-26 06:13:33,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 0: [2022-11-26 06:13:33,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:13:33,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 06:13:33,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 31: [2022-11-26 06:13:33,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:13:33,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 06:13:33,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 25: [2022-11-26 06:13:33,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:13:33,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 06:13:33,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 3: [2022-11-26 06:13:33,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:13:33,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 06:13:33,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 5: [2022-11-26 06:13:33,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:13:33,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 06:13:33,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 20: [2022-11-26 06:13:33,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:13:33,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 06:13:33,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 16: [2022-11-26 06:13:33,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:13:33,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 06:13:33,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 29: [2022-11-26 06:13:33,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:13:33,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 06:13:33,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 8: [2022-11-26 06:13:33,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:13:33,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:13:33,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 06:13:33,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 24: [2022-11-26 06:13:33,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 06:13:33,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 18: [2022-11-26 06:13:33,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:13:33,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 06:13:33,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 27: [2022-11-26 06:13:33,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:13:33,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:13:33,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:13:33,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 15: [2022-11-26 06:13:33,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 06:13:33,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 17: [2022-11-26 06:13:33,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 27: [2022-11-26 06:13:33,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 19: [2022-11-26 06:13:33,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:13:33,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 19: [2022-11-26 06:13:33,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 2: [2022-11-26 06:13:33,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:13:33,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 2: [2022-11-26 06:13:33,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 06:13:33,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 22: [2022-11-26 06:13:33,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:13:33,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 06:13:33,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 4: [2022-11-26 06:13:33,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:13:33,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:13:33,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 10: [2022-11-26 06:13:33,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 06:13:33,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 4: [2022-11-26 06:13:33,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 11: [2022-11-26 06:13:33,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:13:33,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 06:13:33,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 23: [2022-11-26 06:13:33,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:13:33,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 24: [2022-11-26 06:13:33,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:13:33,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 24: [2022-11-26 06:13:33,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 06:13:33,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 1: [2022-11-26 06:13:33,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:13:33,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:13:33,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 1: [2022-11-26 06:13:33,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 06:13:33,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 16: [2022-11-26 06:13:33,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 28: [2022-11-26 06:13:33,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:13:33,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:13:33,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 6: [2022-11-26 06:13:33,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:13:33,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 6: [2022-11-26 06:13:33,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 06:13:33,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 12: [2022-11-26 06:13:33,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:13:33,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 21: [2022-11-26 06:13:33,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:13:33,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 21: [2022-11-26 06:13:33,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 06:13:33,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 30: [2022-11-26 06:13:33,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:13:33,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 06:13:33,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 0: [2022-11-26 06:13:33,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:13:33,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:13:33,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 06:13:33,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 3: [2022-11-26 06:13:33,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 06:13:33,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 26: [2022-11-26 06:13:33,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:13:33,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 26: [2022-11-26 06:13:33,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 28: [2022-11-26 06:13:33,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 26: [2022-11-26 06:13:33,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 18: [2022-11-26 06:13:33,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:13:33,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 06:13:33,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 22: [2022-11-26 06:13:33,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:13:33,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 06:13:33,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 27: [2022-11-26 06:13:33,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:13:33,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 06:13:33,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 8: [2022-11-26 06:13:33,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:13:33,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 06:13:33,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 11: [2022-11-26 06:13:33,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:13:33,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 06:13:33,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 7: [2022-11-26 06:13:33,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:13:33,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 4: [2022-11-26 06:13:33,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:13:33,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 4: [2022-11-26 06:13:33,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 06:13:33,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 13: [2022-11-26 06:13:33,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:13:33,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 06:13:33,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 31: [2022-11-26 06:13:33,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:13:33,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 06:13:33,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 25: [2022-11-26 06:13:33,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:13:33,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 06:13:33,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 6: [2022-11-26 06:13:33,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:13:33,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 06:13:33,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 28: [2022-11-26 06:13:33,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:13:33,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:13:33,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 06:13:33,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 2: [2022-11-26 06:13:33,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:13:33,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 06:13:33,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 28: [2022-11-26 06:13:33,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 06:13:33,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 9: [2022-11-26 06:13:33,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:13:33,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step54000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 06:13:33,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 0: successfully saved checkpoint at iteration 54000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2479.75 31: iteration 54010/ 173500 | consumed samples: 13826560 | consumed tokens: 28316794880 | elapsed time per iteration (s): 1.04 | learning rate: 1.619E-04 | global batch size: 256 | lm loss: 2.044699E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.593 | TFLOPs: 14.92 | 31: iteration 54020/ 173500 | consumed samples: 13829120 | consumed tokens: 28322037760 | elapsed time per iteration (s): 0.80 | learning rate: 1.619E-04 | global batch size: 256 | lm loss: 2.022908E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.138 | TFLOPs: 19.37 | 31: iteration 54030/ 173500 | consumed samples: 13831680 | consumed tokens: 28327280640 | elapsed time per iteration (s): 0.80 | learning rate: 1.619E-04 | global batch size: 256 | lm loss: 2.021831E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.445 | TFLOPs: 19.33 | 31: iteration 54040/ 173500 | consumed samples: 13834240 | consumed tokens: 28332523520 | elapsed time per iteration (s): 0.80 | learning rate: 1.619E-04 | global batch size: 256 | lm loss: 2.062400E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.275 | TFLOPs: 19.25 | 31: iteration 54050/ 173500 | consumed samples: 13836800 | consumed tokens: 28337766400 | elapsed time per iteration (s): 0.81 | learning rate: 1.618E-04 | global batch size: 256 | lm loss: 2.042521E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.946 | TFLOPs: 19.11 | 31: iteration 54060/ 173500 | consumed samples: 13839360 | consumed tokens: 28343009280 | elapsed time per iteration (s): 0.79 | learning rate: 1.618E-04 | global batch size: 256 | lm loss: 2.044946E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.003 | TFLOPs: 19.60 | 31: iteration 54070/ 173500 | consumed samples: 13841920 | consumed tokens: 28348252160 | elapsed time per iteration (s): 0.80 | learning rate: 1.618E-04 | global batch size: 256 | lm loss: 2.027409E+00 | grad norm: 0.469 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.734 | TFLOPs: 19.46 | 31: iteration 54080/ 173500 | consumed samples: 13844480 | consumed tokens: 28353495040 | elapsed time per iteration (s): 0.88 | learning rate: 1.618E-04 | global batch size: 256 | lm loss: 2.073211E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.652 | TFLOPs: 17.64 | 31: iteration 54090/ 173500 | consumed samples: 13847040 | consumed tokens: 28358737920 | elapsed time per iteration (s): 0.80 | learning rate: 1.618E-04 | global batch size: 256 | lm loss: 2.091310E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.453 | TFLOPs: 19.33 | 31: iteration 54100/ 173500 | consumed samples: 13849600 | consumed tokens: 28363980800 | elapsed time per iteration (s): 0.80 | learning rate: 1.618E-04 | global batch size: 256 | lm loss: 2.066169E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.044 | TFLOPs: 19.30 | 31: iteration 54110/ 173500 | consumed samples: 13852160 | consumed tokens: 28369223680 | elapsed time per iteration (s): 0.80 | learning rate: 1.618E-04 | global batch size: 256 | lm loss: 2.024711E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.072 | TFLOPs: 19.36 | 31: iteration 54120/ 173500 | consumed samples: 13854720 | consumed tokens: 28374466560 | elapsed time per iteration (s): 0.77 | learning rate: 1.618E-04 | global batch size: 256 | lm loss: 2.061098E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.380 | TFLOPs: 20.11 | 31: iteration 54130/ 173500 | consumed samples: 13857280 | consumed tokens: 28379709440 | elapsed time per iteration (s): 2.44 | learning rate: 1.617E-04 | global batch size: 256 | lm loss: 2.086162E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 104.717 | TFLOPs: 6.34 | 31: iteration 54140/ 173500 | consumed samples: 13859840 | consumed tokens: 28384952320 | elapsed time per iteration (s): 0.78 | learning rate: 1.617E-04 | global batch size: 256 | lm loss: 2.062461E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.325 | TFLOPs: 19.74 | 31: iteration 54150/ 173500 | consumed samples: 13862400 | consumed tokens: 28390195200 | elapsed time per iteration (s): 0.77 | learning rate: 1.617E-04 | global batch size: 256 | lm loss: 2.058266E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.107 | TFLOPs: 20.09 | 31: iteration 54160/ 173500 | consumed samples: 13864960 | consumed tokens: 28395438080 | elapsed time per iteration (s): 0.79 | learning rate: 1.617E-04 | global batch size: 256 | lm loss: 2.049406E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.157 | TFLOPs: 19.61 | 31: iteration 54170/ 173500 | consumed samples: 13867520 | consumed tokens: 28400680960 | elapsed time per iteration (s): 0.79 | learning rate: 1.617E-04 | global batch size: 256 | lm loss: 2.069069E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.117 | TFLOPs: 19.49 | 31: iteration 54180/ 173500 | consumed samples: 13870080 | consumed tokens: 28405923840 | elapsed time per iteration (s): 0.81 | learning rate: 1.617E-04 | global batch size: 256 | lm loss: 2.043002E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.335 | TFLOPs: 19.02 | 31: iteration 54190/ 173500 | consumed samples: 13872640 | consumed tokens: 28411166720 | elapsed time per iteration (s): 0.78 | learning rate: 1.617E-04 | global batch size: 256 | lm loss: 2.069542E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.328 | TFLOPs: 19.92 | 31: iteration 54200/ 173500 | consumed samples: 13875200 | consumed tokens: 28416409600 | elapsed time per iteration (s): 0.82 | learning rate: 1.616E-04 | global batch size: 256 | lm loss: 2.054798E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.320 | TFLOPs: 18.77 | 31: iteration 54210/ 173500 | consumed samples: 13877760 | consumed tokens: 28421652480 | elapsed time per iteration (s): 0.78 | learning rate: 1.616E-04 | global batch size: 256 | lm loss: 2.046791E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.360 | TFLOPs: 19.74 | 31: iteration 54220/ 173500 | consumed samples: 13880320 | consumed tokens: 28426895360 | elapsed time per iteration (s): 0.75 | learning rate: 1.616E-04 | global batch size: 256 | lm loss: 2.039710E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.740 | TFLOPs: 20.61 | 31: iteration 54230/ 173500 | consumed samples: 13882880 | consumed tokens: 28432138240 | elapsed time per iteration (s): 0.75 | learning rate: 1.616E-04 | global batch size: 256 | lm loss: 2.050431E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.183 | TFLOPs: 20.70 | 31: iteration 54240/ 173500 | consumed samples: 13885440 | consumed tokens: 28437381120 | elapsed time per iteration (s): 0.74 | learning rate: 1.616E-04 | global batch size: 256 | lm loss: 2.031156E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.501 | TFLOPs: 20.90 | 31: iteration 54250/ 173500 | consumed samples: 13888000 | consumed tokens: 28442624000 | elapsed time per iteration (s): 0.75 | learning rate: 1.616E-04 | global batch size: 256 | lm loss: 2.064671E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.000 | TFLOPs: 20.75 | 31: iteration 54260/ 173500 | consumed samples: 13890560 | consumed tokens: 28447866880 | elapsed time per iteration (s): 0.76 | learning rate: 1.616E-04 | global batch size: 256 | lm loss: 2.080676E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.525 | TFLOPs: 20.36 | 31: iteration 54270/ 173500 | consumed samples: 13893120 | consumed tokens: 28453109760 | elapsed time per iteration (s): 0.78 | learning rate: 1.616E-04 | global batch size: 256 | lm loss: 2.067847E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.593 | TFLOPs: 19.88 | 31: iteration 54280/ 173500 | consumed samples: 13895680 | consumed tokens: 28458352640 | elapsed time per iteration (s): 0.77 | learning rate: 1.615E-04 | global batch size: 256 | lm loss: 2.102227E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.608 | TFLOPs: 20.06 | 31: iteration 54290/ 173500 | consumed samples: 13898240 | consumed tokens: 28463595520 | elapsed time per iteration (s): 0.74 | learning rate: 1.615E-04 | global batch size: 256 | lm loss: 2.087995E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.965 | TFLOPs: 20.93 | 31: iteration 54300/ 173500 | consumed samples: 13900800 | consumed tokens: 28468838400 | elapsed time per iteration (s): 0.77 | learning rate: 1.615E-04 | global batch size: 256 | lm loss: 2.065335E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.453 | TFLOPs: 20.23 | 31: iteration 54310/ 173500 | consumed samples: 13903360 | consumed tokens: 28474081280 | elapsed time per iteration (s): 0.81 | learning rate: 1.615E-04 | global batch size: 256 | lm loss: 2.077652E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.903 | TFLOPs: 19.17 | 31: iteration 54320/ 173500 | consumed samples: 13905920 | consumed tokens: 28479324160 | elapsed time per iteration (s): 0.74 | learning rate: 1.615E-04 | global batch size: 256 | lm loss: 2.027392E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.562 | TFLOPs: 20.85 | 31: iteration 54330/ 173500 | consumed samples: 13908480 | consumed tokens: 28484567040 | elapsed time per iteration (s): 0.79 | learning rate: 1.615E-04 | global batch size: 256 | lm loss: 2.046278E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.017 | TFLOPs: 19.72 | 31: iteration 54340/ 173500 | consumed samples: 13911040 | consumed tokens: 28489809920 | elapsed time per iteration (s): 0.80 | learning rate: 1.615E-04 | global batch size: 256 | lm loss: 2.049225E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.124 | TFLOPs: 19.31 | 31: iteration 54350/ 173500 | consumed samples: 13913600 | consumed tokens: 28495052800 | elapsed time per iteration (s): 0.73 | learning rate: 1.614E-04 | global batch size: 256 | lm loss: 2.070948E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.721 | TFLOPs: 21.16 | 31: iteration 54360/ 173500 | consumed samples: 13916160 | consumed tokens: 28500295680 | elapsed time per iteration (s): 0.75 | learning rate: 1.614E-04 | global batch size: 256 | lm loss: 2.082445E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.435 | TFLOPs: 20.66 | 31: iteration 54370/ 173500 | consumed samples: 13918720 | consumed tokens: 28505538560 | elapsed time per iteration (s): 0.76 | learning rate: 1.614E-04 | global batch size: 256 | lm loss: 2.068443E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.232 | TFLOPs: 20.28 | 31: iteration 54380/ 173500 | consumed samples: 13921280 | consumed tokens: 28510781440 | elapsed time per iteration (s): 0.79 | learning rate: 1.614E-04 | global batch size: 256 | lm loss: 2.036418E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.969 | TFLOPs: 19.60 | 31: iteration 54390/ 173500 | consumed samples: 13923840 | consumed tokens: 28516024320 | elapsed time per iteration (s): 0.74 | learning rate: 1.614E-04 | global batch size: 256 | lm loss: 2.058034E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.782 | TFLOPs: 20.80 | 31: iteration 54400/ 173500 | consumed samples: 13926400 | consumed tokens: 28521267200 | elapsed time per iteration (s): 0.78 | learning rate: 1.614E-04 | global batch size: 256 | lm loss: 2.062814E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.126 | TFLOPs: 19.91 | 31: iteration 54410/ 173500 | consumed samples: 13928960 | consumed tokens: 28526510080 | elapsed time per iteration (s): 0.76 | learning rate: 1.614E-04 | global batch size: 256 | lm loss: 2.020163E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.018 | TFLOPs: 20.27 | 31: iteration 54420/ 173500 | consumed samples: 13931520 | consumed tokens: 28531752960 | elapsed time per iteration (s): 0.75 | learning rate: 1.614E-04 | global batch size: 256 | lm loss: 2.035571E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.506 | TFLOPs: 20.54 | 31: iteration 54430/ 173500 | consumed samples: 13934080 | consumed tokens: 28536995840 | elapsed time per iteration (s): 0.77 | learning rate: 1.613E-04 | global batch size: 256 | lm loss: 2.052321E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.493 | TFLOPs: 20.24 | 31: iteration 54440/ 173500 | consumed samples: 13936640 | consumed tokens: 28542238720 | elapsed time per iteration (s): 0.75 | learning rate: 1.613E-04 | global batch size: 256 | lm loss: 2.080656E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.240 | TFLOPs: 20.52 | 31: iteration 54450/ 173500 | consumed samples: 13939200 | consumed tokens: 28547481600 | elapsed time per iteration (s): 0.80 | learning rate: 1.613E-04 | global batch size: 256 | lm loss: 2.077503E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.806 | TFLOPs: 19.29 | 31: iteration 54460/ 173500 | consumed samples: 13941760 | consumed tokens: 28552724480 | elapsed time per iteration (s): 0.72 | learning rate: 1.613E-04 | global batch size: 256 | lm loss: 2.047165E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.080 | TFLOPs: 21.54 | 31: iteration 54470/ 173500 | consumed samples: 13944320 | consumed tokens: 28557967360 | elapsed time per iteration (s): 0.77 | learning rate: 1.613E-04 | global batch size: 256 | lm loss: 2.080962E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.344 | TFLOPs: 20.23 | 31: iteration 54480/ 173500 | consumed samples: 13946880 | consumed tokens: 28563210240 | elapsed time per iteration (s): 0.81 | learning rate: 1.613E-04 | global batch size: 256 | lm loss: 2.040904E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.875 | TFLOPs: 19.11 | 31: iteration 54490/ 173500 | consumed samples: 13949440 | consumed tokens: 28568453120 | elapsed time per iteration (s): 0.77 | learning rate: 1.613E-04 | global batch size: 256 | lm loss: 2.065262E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.807 | TFLOPs: 20.07 | 31: iteration 54500/ 173500 | consumed samples: 13952000 | consumed tokens: 28573696000 | elapsed time per iteration (s): 0.76 | learning rate: 1.612E-04 | global batch size: 256 | lm loss: 2.040600E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.226 | TFLOPs: 20.40 | 31: iteration 54510/ 173500 | consumed samples: 13954560 | consumed tokens: 28578938880 | elapsed time per iteration (s): 0.76 | learning rate: 1.612E-04 | global batch size: 256 | lm loss: 2.062689E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.560 | TFLOPs: 20.42 | 31: iteration 54520/ 173500 | consumed samples: 13957120 | consumed tokens: 28584181760 | elapsed time per iteration (s): 0.74 | learning rate: 1.612E-04 | global batch size: 256 | lm loss: 2.045129E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.204 | TFLOPs: 20.94 | 31: iteration 54530/ 173500 | consumed samples: 13959680 | consumed tokens: 28589424640 | elapsed time per iteration (s): 0.75 | learning rate: 1.612E-04 | global batch size: 256 | lm loss: 2.055542E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.938 | TFLOPs: 20.75 | 31: iteration 54540/ 173500 | consumed samples: 13962240 | consumed tokens: 28594667520 | elapsed time per iteration (s): 0.78 | learning rate: 1.612E-04 | global batch size: 256 | lm loss: 2.033817E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.531 | TFLOPs: 19.75 | 31: iteration 54550/ 173500 | consumed samples: 13964800 | consumed tokens: 28599910400 | elapsed time per iteration (s): 0.74 | learning rate: 1.612E-04 | global batch size: 256 | lm loss: 2.043862E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.724 | TFLOPs: 20.98 | 31: iteration 54560/ 173500 | consumed samples: 13967360 | consumed tokens: 28605153280 | elapsed time per iteration (s): 0.80 | learning rate: 1.612E-04 | global batch size: 256 | lm loss: 2.057445E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.856 | TFLOPs: 19.35 | 31: iteration 54570/ 173500 | consumed samples: 13969920 | consumed tokens: 28610396160 | elapsed time per iteration (s): 0.77 | learning rate: 1.611E-04 | global batch size: 256 | lm loss: 2.037227E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.091 | TFLOPs: 20.09 | 31: iteration 54580/ 173500 | consumed samples: 13972480 | consumed tokens: 28615639040 | elapsed time per iteration (s): 0.75 | learning rate: 1.611E-04 | global batch size: 256 | lm loss: 2.038391E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.048 | TFLOPs: 20.63 | 31: iteration 54590/ 173500 | consumed samples: 13975040 | consumed tokens: 28620881920 | elapsed time per iteration (s): 0.81 | learning rate: 1.611E-04 | global batch size: 256 | lm loss: 2.059639E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.468 | TFLOPs: 19.15 | 31: iteration 54600/ 173500 | consumed samples: 13977600 | consumed tokens: 28626124800 | elapsed time per iteration (s): 0.74 | learning rate: 1.611E-04 | global batch size: 256 | lm loss: 2.083053E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.578 | TFLOPs: 20.97 | 31: iteration 54610/ 173500 | consumed samples: 13980160 | consumed tokens: 28631367680 | elapsed time per iteration (s): 0.79 | learning rate: 1.611E-04 | global batch size: 256 | lm loss: 2.044839E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.216 | TFLOPs: 19.49 | 31: iteration 54620/ 173500 | consumed samples: 13982720 | consumed tokens: 28636610560 | elapsed time per iteration (s): 0.74 | learning rate: 1.611E-04 | global batch size: 256 | lm loss: 2.054071E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.833 | TFLOPs: 21.04 | 31: iteration 54630/ 173500 | consumed samples: 13985280 | consumed tokens: 28641853440 | elapsed time per iteration (s): 0.77 | learning rate: 1.611E-04 | global batch size: 256 | lm loss: 2.095856E+00 | grad norm: 0.318 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.672 | TFLOPs: 20.07 | 31: iteration 54640/ 173500 | consumed samples: 13987840 | consumed tokens: 28647096320 | elapsed time per iteration (s): 0.78 | learning rate: 1.611E-04 | global batch size: 256 | lm loss: 2.105858E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.118 | TFLOPs: 19.79 | 31: iteration 54650/ 173500 | consumed samples: 13990400 | consumed tokens: 28652339200 | elapsed time per iteration (s): 0.96 | learning rate: 1.610E-04 | global batch size: 256 | lm loss: 2.048453E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 266.832 | TFLOPs: 16.14 | 31: iteration 54660/ 173500 | consumed samples: 13992960 | consumed tokens: 28657582080 | elapsed time per iteration (s): 0.75 | learning rate: 1.610E-04 | global batch size: 256 | lm loss: 2.045656E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.029 | TFLOPs: 20.69 | 31: iteration 54670/ 173500 | consumed samples: 13995520 | consumed tokens: 28662824960 | elapsed time per iteration (s): 0.79 | learning rate: 1.610E-04 | global batch size: 256 | lm loss: 2.062669E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.031 | TFLOPs: 19.66 | 31: iteration 54680/ 173500 | consumed samples: 13998080 | consumed tokens: 28668067840 | elapsed time per iteration (s): 0.77 | learning rate: 1.610E-04 | global batch size: 256 | lm loss: 2.053795E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.087 | TFLOPs: 20.15 | 31: iteration 54690/ 173500 | consumed samples: 14000640 | consumed tokens: 28673310720 | elapsed time per iteration (s): 0.81 | learning rate: 1.610E-04 | global batch size: 256 | lm loss: 2.048874E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.949 | TFLOPs: 19.17 | 31: iteration 54700/ 173500 | consumed samples: 14003200 | consumed tokens: 28678553600 | elapsed time per iteration (s): 0.79 | learning rate: 1.610E-04 | global batch size: 256 | lm loss: 2.091886E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.666 | TFLOPs: 19.52 | 31: iteration 54710/ 173500 | consumed samples: 14005760 | consumed tokens: 28683796480 | elapsed time per iteration (s): 0.90 | learning rate: 1.610E-04 | global batch size: 256 | lm loss: 2.058761E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.091 | TFLOPs: 17.19 | 31: iteration 54720/ 173500 | consumed samples: 14008320 | consumed tokens: 28689039360 | elapsed time per iteration (s): 0.77 | learning rate: 1.609E-04 | global batch size: 256 | lm loss: 2.054966E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.848 | TFLOPs: 20.08 | 31: iteration 54730/ 173500 | consumed samples: 14010880 | consumed tokens: 28694282240 | elapsed time per iteration (s): 0.73 | learning rate: 1.609E-04 | global batch size: 256 | lm loss: 2.065215E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.121 | TFLOPs: 21.18 | 31: iteration 54740/ 173500 | consumed samples: 14013440 | consumed tokens: 28699525120 | elapsed time per iteration (s): 0.78 | learning rate: 1.609E-04 | global batch size: 256 | lm loss: 2.068989E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.282 | TFLOPs: 19.80 | 31: iteration 54750/ 173500 | consumed samples: 14016000 | consumed tokens: 28704768000 | elapsed time per iteration (s): 0.73 | learning rate: 1.609E-04 | global batch size: 256 | lm loss: 2.062406E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.369 | TFLOPs: 21.14 | 31: iteration 54760/ 173500 | consumed samples: 14018560 | consumed tokens: 28710010880 | elapsed time per iteration (s): 0.74 | learning rate: 1.609E-04 | global batch size: 256 | lm loss: 2.021081E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.264 | TFLOPs: 20.83 | 31: iteration 54770/ 173500 | consumed samples: 14021120 | consumed tokens: 28715253760 | elapsed time per iteration (s): 0.76 | learning rate: 1.609E-04 | global batch size: 256 | lm loss: 2.083232E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.164 | TFLOPs: 20.40 | 31: iteration 54780/ 173500 | consumed samples: 14023680 | consumed tokens: 28720496640 | elapsed time per iteration (s): 0.71 | learning rate: 1.609E-04 | global batch size: 256 | lm loss: 2.066722E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.295 | TFLOPs: 21.68 | 31: iteration 54790/ 173500 | consumed samples: 14026240 | consumed tokens: 28725739520 | elapsed time per iteration (s): 0.73 | learning rate: 1.608E-04 | global batch size: 256 | lm loss: 2.049440E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.063 | TFLOPs: 21.18 | 31: iteration 54800/ 173500 | consumed samples: 14028800 | consumed tokens: 28730982400 | elapsed time per iteration (s): 0.75 | learning rate: 1.608E-04 | global batch size: 256 | lm loss: 2.066159E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.665 | TFLOPs: 20.55 | 31: iteration 54810/ 173500 | consumed samples: 14031360 | consumed tokens: 28736225280 | elapsed time per iteration (s): 0.77 | learning rate: 1.608E-04 | global batch size: 256 | lm loss: 2.073508E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.438 | TFLOPs: 20.11 | 31: iteration 54820/ 173500 | consumed samples: 14033920 | consumed tokens: 28741468160 | elapsed time per iteration (s): 0.72 | learning rate: 1.608E-04 | global batch size: 256 | lm loss: 2.034094E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.875 | TFLOPs: 21.47 | 31: iteration 54830/ 173500 | consumed samples: 14036480 | consumed tokens: 28746711040 | elapsed time per iteration (s): 0.75 | learning rate: 1.608E-04 | global batch size: 256 | lm loss: 2.038644E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.128 | TFLOPs: 20.58 | 31: iteration 54840/ 173500 | consumed samples: 14039040 | consumed tokens: 28751953920 | elapsed time per iteration (s): 0.76 | learning rate: 1.608E-04 | global batch size: 256 | lm loss: 2.073954E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.836 | TFLOPs: 20.44 | 31: iteration 54850/ 173500 | consumed samples: 14041600 | consumed tokens: 28757196800 | elapsed time per iteration (s): 0.81 | learning rate: 1.608E-04 | global batch size: 256 | lm loss: 2.042503E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.314 | TFLOPs: 19.08 | 31: iteration 54860/ 173500 | consumed samples: 14044160 | consumed tokens: 28762439680 | elapsed time per iteration (s): 0.78 | learning rate: 1.608E-04 | global batch size: 256 | lm loss: 2.061778E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.430 | TFLOPs: 19.93 | 31: iteration 54870/ 173500 | consumed samples: 14046720 | consumed tokens: 28767682560 | elapsed time per iteration (s): 0.77 | learning rate: 1.607E-04 | global batch size: 256 | lm loss: 2.049669E+00 | grad norm: 0.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.214 | TFLOPs: 20.10 | 31: iteration 54880/ 173500 | consumed samples: 14049280 | consumed tokens: 28772925440 | elapsed time per iteration (s): 0.75 | learning rate: 1.607E-04 | global batch size: 256 | lm loss: 2.046683E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.374 | TFLOPs: 20.71 | 31: iteration 54890/ 173500 | consumed samples: 14051840 | consumed tokens: 28778168320 | elapsed time per iteration (s): 0.81 | learning rate: 1.607E-04 | global batch size: 256 | lm loss: 2.044114E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.212 | TFLOPs: 19.19 | 31: iteration 54900/ 173500 | consumed samples: 14054400 | consumed tokens: 28783411200 | elapsed time per iteration (s): 0.75 | learning rate: 1.607E-04 | global batch size: 256 | lm loss: 2.039946E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.878 | TFLOPs: 20.62 | 31: iteration 54910/ 173500 | consumed samples: 14056960 | consumed tokens: 28788654080 | elapsed time per iteration (s): 0.85 | learning rate: 1.607E-04 | global batch size: 256 | lm loss: 2.052871E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.402 | TFLOPs: 18.29 | 31: iteration 54920/ 173500 | consumed samples: 14059520 | consumed tokens: 28793896960 | elapsed time per iteration (s): 0.83 | learning rate: 1.607E-04 | global batch size: 256 | lm loss: 2.082628E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.386 | TFLOPs: 18.66 | 31: iteration 54930/ 173500 | consumed samples: 14062080 | consumed tokens: 28799139840 | elapsed time per iteration (s): 0.78 | learning rate: 1.607E-04 | global batch size: 256 | lm loss: 2.086608E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.042 | TFLOPs: 19.85 | 31: iteration 54940/ 173500 | consumed samples: 14064640 | consumed tokens: 28804382720 | elapsed time per iteration (s): 0.83 | learning rate: 1.606E-04 | global batch size: 256 | lm loss: 2.033569E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.261 | TFLOPs: 18.71 | 31: iteration 54950/ 173500 | consumed samples: 14067200 | consumed tokens: 28809625600 | elapsed time per iteration (s): 0.79 | learning rate: 1.606E-04 | global batch size: 256 | lm loss: 2.050400E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.964 | TFLOPs: 19.72 | 31: iteration 54960/ 173500 | consumed samples: 14069760 | consumed tokens: 28814868480 | elapsed time per iteration (s): 0.75 | learning rate: 1.606E-04 | global batch size: 256 | lm loss: 2.081665E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.914 | TFLOPs: 20.75 | 31: iteration 54970/ 173500 | consumed samples: 14072320 | consumed tokens: 28820111360 | elapsed time per iteration (s): 0.78 | learning rate: 1.606E-04 | global batch size: 256 | lm loss: 2.072146E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.443 | TFLOPs: 19.87 | 31: iteration 54980/ 173500 | consumed samples: 14074880 | consumed tokens: 28825354240 | elapsed time per iteration (s): 0.79 | learning rate: 1.606E-04 | global batch size: 256 | lm loss: 2.051571E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.561 | TFLOPs: 19.64 | 31: iteration 54990/ 173500 | consumed samples: 14077440 | consumed tokens: 28830597120 | elapsed time per iteration (s): 0.98 | learning rate: 1.606E-04 | global batch size: 256 | lm loss: 2.076184E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 260.976 | TFLOPs: 15.79 | 31: iteration 55000/ 173500 | consumed samples: 14080000 | consumed tokens: 28835840000 | elapsed time per iteration (s): 0.73 | learning rate: 1.606E-04 | global batch size: 256 | lm loss: 2.034198E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.326 | TFLOPs: 21.19 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 55000 | lm loss value: 1.978945E+00 | lm loss PPL: 7.235109E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 55000 to checkpoints_1b1long 0: [2022-11-26 06:26:48,989] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step55000 is begin to save! 0: [2022-11-26 06:26:49,003] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_01-model_00-model_states.pt... 0: [2022-11-26 06:26:49,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_01-model_00-model_states.pt. 0: [2022-11-26 06:26:49,257] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_03-model_00-model_states.pt... 0: [2022-11-26 06:26:49,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_03-model_00-model_states.pt. 0: [2022-11-26 06:26:49,335] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_04-model_00-model_states.pt... 0: [2022-11-26 06:26:49,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_04-model_00-model_states.pt. 0: [2022-11-26 06:26:49,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_05-model_00-model_states.pt... 0: [2022-11-26 06:26:49,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_05-model_00-model_states.pt. 0: [2022-11-26 06:26:49,498] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_06-model_00-model_states.pt... 0: [2022-11-26 06:26:49,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_06-model_00-model_states.pt. 0: [2022-11-26 06:26:49,575] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_07-model_00-model_states.pt... 0: [2022-11-26 06:26:49,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_07-model_00-model_states.pt. 0: [2022-11-26 06:26:49,650] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_08-model_00-model_states.pt... 0: [2022-11-26 06:26:49,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_08-model_00-model_states.pt. 0: [2022-11-26 06:26:49,727] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_09-model_00-model_states.pt... 0: [2022-11-26 06:26:49,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_09-model_00-model_states.pt. 0: [2022-11-26 06:26:49,799] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_10-model_00-model_states.pt... 0: [2022-11-26 06:26:49,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_10-model_00-model_states.pt. 0: [2022-11-26 06:26:49,876] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_11-model_00-model_states.pt... 0: [2022-11-26 06:26:49,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_11-model_00-model_states.pt. 0: [2022-11-26 06:26:49,948] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_12-model_00-model_states.pt... 0: [2022-11-26 06:26:50,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_12-model_00-model_states.pt. 0: [2022-11-26 06:26:50,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_13-model_00-model_states.pt... 0: [2022-11-26 06:26:50,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_13-model_00-model_states.pt. 0: [2022-11-26 06:26:50,098] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_14-model_00-model_states.pt... 0: [2022-11-26 06:26:50,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_14-model_00-model_states.pt. 0: [2022-11-26 06:26:50,173] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_15-model_00-model_states.pt... 0: [2022-11-26 06:26:50,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_15-model_00-model_states.pt. 0: [2022-11-26 06:26:50,249] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_16-model_00-model_states.pt... 0: [2022-11-26 06:26:50,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_16-model_00-model_states.pt. 0: [2022-11-26 06:26:50,326] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_17-model_00-model_states.pt... 0: [2022-11-26 06:26:50,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_17-model_00-model_states.pt. 0: [2022-11-26 06:26:50,401] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_18-model_00-model_states.pt... 0: [2022-11-26 06:26:50,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_18-model_00-model_states.pt. 0: [2022-11-26 06:26:50,475] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_19-model_00-model_states.pt... 0: [2022-11-26 06:26:50,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_19-model_00-model_states.pt. 0: [2022-11-26 06:26:50,553] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_20-model_00-model_states.pt... 0: [2022-11-26 06:26:50,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_20-model_00-model_states.pt. 0: [2022-11-26 06:26:50,628] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_21-model_00-model_states.pt... 0: [2022-11-26 06:26:50,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_21-model_00-model_states.pt. 0: [2022-11-26 06:26:50,704] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_22-model_00-model_states.pt... 0: [2022-11-26 06:26:50,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_22-model_00-model_states.pt. 0: [2022-11-26 06:26:50,780] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_23-model_00-model_states.pt... 0: [2022-11-26 06:26:50,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_23-model_00-model_states.pt. 0: [2022-11-26 06:26:50,855] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_24-model_00-model_states.pt... 0: [2022-11-26 06:26:50,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_24-model_00-model_states.pt. 0: [2022-11-26 06:26:50,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_25-model_00-model_states.pt... 0: [2022-11-26 06:26:51,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_25-model_00-model_states.pt. 0: [2022-11-26 06:26:51,003] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_26-model_00-model_states.pt... 0: [2022-11-26 06:26:51,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_26-model_00-model_states.pt. 0: [2022-11-26 06:26:51,081] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_27-model_00-model_states.pt... 0: [2022-11-26 06:26:51,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_27-model_00-model_states.pt. 0: [2022-11-26 06:26:51,156] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_28-model_00-model_states.pt... 0: [2022-11-26 06:26:51,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_28-model_00-model_states.pt. 0: [2022-11-26 06:26:51,229] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/layer_30-model_00-model_states.pt... 0: [2022-11-26 06:26:51,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/layer_30-model_00-model_states.pt. 0: [2022-11-26 06:26:51,233] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step55000/mp_rank_00_model_states.pt 0: [2022-11-26 06:26:51,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/mp_rank_00_model_states.pt... 0: [2022-11-26 06:26:51,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/mp_rank_00_model_states.pt. 0: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:26:51,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:26:51,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:26:51,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 06:26:51,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 4: [2022-11-26 06:26:51,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:26:51,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 06:26:51,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 2: [2022-11-26 06:26:51,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:26:51,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 06:26:51,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 24: [2022-11-26 06:26:51,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:26:51,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:26:51,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 06:26:51,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 24: [2022-11-26 06:26:51,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 06:26:51,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 25: [2022-11-26 06:26:51,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:26:51,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 06:26:51,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 1: [2022-11-26 06:26:51,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:26:51,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 06:26:51,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 10: [2022-11-26 06:26:51,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:26:51,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 11: [2022-11-26 06:26:51,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:26:51,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 11: [2022-11-26 06:26:51,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 06:26:51,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 14: [2022-11-26 06:26:51,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:26:51,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 12: [2022-11-26 06:26:51,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:26:51,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 06:26:51,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 14: [2022-11-26 06:26:51,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 28: [2022-11-26 06:26:51,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:26:51,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:26:51,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 06:26:51,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 5: [2022-11-26 06:26:51,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:26:51,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:26:51,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 31: [2022-11-26 06:26:51,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 5: [2022-11-26 06:26:51,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 20: [2022-11-26 06:26:51,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:26:51,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:26:51,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 20: [2022-11-26 06:26:51,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 06:26:51,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 06:26:51,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 20: [2022-11-26 06:26:51,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 23: [2022-11-26 06:26:51,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:26:51,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 06:26:51,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 24: [2022-11-26 06:26:51,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:26:51,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:26:51,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 2: [2022-11-26 06:26:51,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 24: [2022-11-26 06:26:51,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 2: [2022-11-26 06:26:51,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 4: [2022-11-26 06:26:51,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:26:51,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 06:26:51,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 26: [2022-11-26 06:26:51,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:26:51,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 06:26:51,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 2: [2022-11-26 06:26:51,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:26:51,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 06:26:51,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 23: [2022-11-26 06:26:51,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:26:51,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:26:51,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 1: [2022-11-26 06:26:51,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:26:51,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 23: [2022-11-26 06:26:51,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 25: [2022-11-26 06:26:51,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 1: [2022-11-26 06:26:51,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 06:26:51,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 10: [2022-11-26 06:26:51,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:26:51,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 06:26:51,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 28: [2022-11-26 06:26:51,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 06:26:51,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 31: [2022-11-26 06:26:51,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:26:51,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 06:26:51,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 28: [2022-11-26 06:26:51,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:26:51,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:26:51,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 06:26:51,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 21: [2022-11-26 06:26:51,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:26:51,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:26:51,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:26:51,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 6: [2022-11-26 06:26:51,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 3: [2022-11-26 06:26:51,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 21: [2022-11-26 06:26:51,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 6: [2022-11-26 06:26:51,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 3: [2022-11-26 06:26:51,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 22: [2022-11-26 06:26:51,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:26:51,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 06:26:51,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 22: [2022-11-26 06:26:51,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:26:51,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 06:26:51,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 3: [2022-11-26 06:26:51,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:26:51,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:26:51,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 3: [2022-11-26 06:26:51,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 31: [2022-11-26 06:26:51,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:26:51,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 3: [2022-11-26 06:26:51,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 31: [2022-11-26 06:26:51,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 06:26:51,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 4: [2022-11-26 06:26:51,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:26:51,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:26:51,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:26:51,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 3: [2022-11-26 06:26:51,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 4: [2022-11-26 06:26:51,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 3: [2022-11-26 06:26:51,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 0: [2022-11-26 06:26:51,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:26:51,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 06:26:51,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 1: [2022-11-26 06:26:51,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:26:51,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 06:26:51,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 21: [2022-11-26 06:26:51,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:26:51,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:26:51,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 06:26:51,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 06:26:51,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 21: [2022-11-26 06:26:51,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 14: [2022-11-26 06:26:51,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:26:51,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 06:26:51,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 14: [2022-11-26 06:26:51,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:26:51,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:26:51,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 06:26:51,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 0: [2022-11-26 06:26:51,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 2: [2022-11-26 06:26:51,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:26:51,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 2: [2022-11-26 06:26:51,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 15: [2022-11-26 06:26:51,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:26:51,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 15: [2022-11-26 06:26:51,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 06:26:51,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 31: [2022-11-26 06:26:51,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:26:51,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 06:26:51,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 19: [2022-11-26 06:26:51,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:26:51,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 06:26:51,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 24: [2022-11-26 06:26:51,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:26:51,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 5: [2022-11-26 06:26:51,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:26:51,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 5: [2022-11-26 06:26:51,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 06:26:51,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 5: [2022-11-26 06:26:51,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:26:51,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 06:26:51,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 6: [2022-11-26 06:26:51,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:26:51,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:26:51,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 06:26:51,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 6: [2022-11-26 06:26:51,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 06:26:51,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 30: [2022-11-26 06:26:51,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:26:51,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 06:26:51,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 15: [2022-11-26 06:26:51,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:26:51,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 06:26:51,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 11: [2022-11-26 06:26:51,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:26:51,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:26:51,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 30: [2022-11-26 06:26:51,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 11: [2022-11-26 06:26:51,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 30: [2022-11-26 06:26:51,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 11: [2022-11-26 06:26:51,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:26:51,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 06:26:51,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 25: [2022-11-26 06:26:51,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:26:51,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 06:26:51,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 19: [2022-11-26 06:26:51,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:26:51,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 06:26:51,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 27: [2022-11-26 06:26:51,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:26:51,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 06:26:51,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 26: [2022-11-26 06:26:51,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:26:51,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 06:26:51,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 10: [2022-11-26 06:26:51,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:26:51,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 06:26:51,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 4: [2022-11-26 06:26:51,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:26:51,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:26:51,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 28: [2022-11-26 06:26:51,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 4: [2022-11-26 06:26:51,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 20: [2022-11-26 06:26:51,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 28: [2022-11-26 06:26:51,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 20: [2022-11-26 06:26:51,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 28: [2022-11-26 06:26:51,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:26:51,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:26:51,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:26:51,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 21: [2022-11-26 06:26:51,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 25: [2022-11-26 06:26:51,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 28: [2022-11-26 06:26:51,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 25: [2022-11-26 06:26:51,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 21: [2022-11-26 06:26:51,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 12: [2022-11-26 06:26:51,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:26:51,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 06:26:51,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 23: [2022-11-26 06:26:51,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:26:51,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 06:26:51,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 30: [2022-11-26 06:26:51,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:26:51,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 06:26:51,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 14: [2022-11-26 06:26:51,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:26:51,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 20: [2022-11-26 06:26:51,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:26:51,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 20: [2022-11-26 06:26:51,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 06:26:51,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 8: [2022-11-26 06:26:51,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:26:51,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:26:51,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:26:51,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:26:51,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 06:26:51,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 06:26:51,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 06:26:51,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 06:26:51,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 8: [2022-11-26 06:26:51,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 12: [2022-11-26 06:26:51,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:26:51,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 8: [2022-11-26 06:26:51,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 12: [2022-11-26 06:26:51,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 06:26:51,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 22: [2022-11-26 06:26:51,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:26:51,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 06:26:51,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:26:51,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 19: [2022-11-26 06:26:51,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:26:51,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 19: [2022-11-26 06:26:51,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 22: [2022-11-26 06:26:51,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 19: [2022-11-26 06:26:51,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 15: [2022-11-26 06:26:51,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:26:51,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 06:26:51,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 23: [2022-11-26 06:26:51,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:26:51,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 06:26:51,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 3: [2022-11-26 06:26:51,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:26:51,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:26:51,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 26: [2022-11-26 06:26:51,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 3: [2022-11-26 06:26:51,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 26: [2022-11-26 06:26:51,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 28: [2022-11-26 06:26:51,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:26:51,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:26:51,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:26:51,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:26:51,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:26:51,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 06:26:51,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 06:26:51,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 06:26:51,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 06:26:51,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 18: [2022-11-26 06:26:51,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 18: [2022-11-26 06:26:51,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 18: [2022-11-26 06:26:51,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 30: [2022-11-26 06:26:51,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:26:51,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 06:26:51,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 27: [2022-11-26 06:26:51,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:26:51,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:26:51,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 06:26:51,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 06:26:51,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 7: [2022-11-26 06:26:51,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:26:51,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 7: [2022-11-26 06:26:51,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 06:26:51,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 28: [2022-11-26 06:26:51,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 06:26:51,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 7: [2022-11-26 06:26:51,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:26:51,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:26:51,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 06:26:51,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 06:26:51,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 7: [2022-11-26 06:26:51,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 24: [2022-11-26 06:26:51,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:26:51,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 06:26:51,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 5: [2022-11-26 06:26:51,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:26:51,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 06:26:51,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 27: [2022-11-26 06:26:51,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:26:51,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 06:26:51,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 0: [2022-11-26 06:26:51,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:26:51,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 06:26:51,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 29: [2022-11-26 06:26:51,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:26:51,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:26:51,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:26:51,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 06:26:51,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 06:26:51,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 29: [2022-11-26 06:26:51,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 06:26:51,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 29: [2022-11-26 06:26:51,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 17: [2022-11-26 06:26:51,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:26:51,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 06:26:51,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 17: [2022-11-26 06:26:51,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:26:51,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 06:26:51,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 17: [2022-11-26 06:26:51,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:26:51,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 06:26:51,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 17: [2022-11-26 06:26:51,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:26:51,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 06:26:51,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 17: [2022-11-26 06:26:51,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:26:51,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 06:26:51,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 16: [2022-11-26 06:26:51,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:26:51,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:26:51,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:26:51,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:26:51,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 06:26:51,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 06:26:51,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 06:26:51,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 16: [2022-11-26 06:26:51,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 06:26:51,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 16: [2022-11-26 06:26:51,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 16: [2022-11-26 06:26:51,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 4: [2022-11-26 06:26:51,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:26:51,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 06:26:51,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 11: [2022-11-26 06:26:51,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:26:51,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 06:26:51,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 9: [2022-11-26 06:26:51,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:26:51,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:26:51,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:26:51,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:26:51,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 06:26:51,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 06:26:51,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 06:26:51,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 06:26:51,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 9: [2022-11-26 06:26:51,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 9: [2022-11-26 06:26:51,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 9: [2022-11-26 06:26:51,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 1: [2022-11-26 06:26:51,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:26:51,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 06:26:51,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 13: [2022-11-26 06:26:51,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:26:51,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:26:51,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:26:51,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:26:51,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 6: [2022-11-26 06:26:51,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:26:51,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 06:26:51,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 06:26:51,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 06:26:51,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 6: [2022-11-26 06:26:51,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 13: [2022-11-26 06:26:51,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 13: [2022-11-26 06:26:51,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 13: [2022-11-26 06:26:51,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 6: [2022-11-26 06:26:51,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 31: [2022-11-26 06:26:51,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:26:51,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 06:26:51,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 0: [2022-11-26 06:26:51,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 06:26:51,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 25: [2022-11-26 06:26:51,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:26:51,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 06:26:51,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 20: [2022-11-26 06:26:51,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:26:51,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 06:26:51,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 0: [2022-11-26 06:26:51,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:26:51,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 06:26:51,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 23: [2022-11-26 06:26:51,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:26:51,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 06:26:51,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 10: [2022-11-26 06:26:51,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:26:51,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 06:26:51,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 2: [2022-11-26 06:26:51,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:26:51,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 06:26:51,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 30: [2022-11-26 06:26:51,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:26:51,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 06:26:51,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 12: [2022-11-26 06:26:51,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:26:51,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 06:26:51,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 8: [2022-11-26 06:26:51,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:26:51,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 06:26:51,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 16: [2022-11-26 06:26:51,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:26:51,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 06:26:51,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 11: [2022-11-26 06:26:51,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:26:51,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 06:26:51,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 9: [2022-11-26 06:26:51,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:26:51,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:26:51,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 5: [2022-11-26 06:26:51,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 9: [2022-11-26 06:26:51,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 5: [2022-11-26 06:26:51,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 13: [2022-11-26 06:26:51,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:26:51,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 06:26:51,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 24: [2022-11-26 06:26:51,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:26:51,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 06:26:51,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 29: [2022-11-26 06:26:51,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:26:51,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 06:26:51,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 14: [2022-11-26 06:26:51,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:26:51,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 06:26:51,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 28: [2022-11-26 06:26:51,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:26:51,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 06:26:51,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 15: [2022-11-26 06:26:51,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:26:51,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 06:26:51,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 3: [2022-11-26 06:26:51,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:26:51,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 19: [2022-11-26 06:26:51,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:26:51,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 19: [2022-11-26 06:26:51,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 18: [2022-11-26 06:26:51,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:26:51,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 18: [2022-11-26 06:26:51,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 06:26:51,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 22: [2022-11-26 06:26:51,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:26:51,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 06:26:51,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 7: [2022-11-26 06:26:51,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:26:51,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 06:26:51,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 26: [2022-11-26 06:26:51,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:26:51,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 06:26:51,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 21: [2022-11-26 06:26:51,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:26:51,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 06:26:51,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 1: [2022-11-26 06:26:51,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:26:51,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 06:26:51,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 4: [2022-11-26 06:26:51,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:26:51,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 06:26:51,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 27: [2022-11-26 06:26:51,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:26:51,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:26:51,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 06:26:51,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 6: [2022-11-26 06:26:51,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 06:26:51,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 17: [2022-11-26 06:26:51,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:26:51,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 06:26:51,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 25: [2022-11-26 06:26:51,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:26:51,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 06:26:51,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 20: [2022-11-26 06:26:51,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:26:51,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 06:26:51,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 31: [2022-11-26 06:26:51,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:26:51,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 06:26:51,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 0: [2022-11-26 06:26:51,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:26:51,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 06:26:51,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 2: [2022-11-26 06:26:51,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:26:51,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 06:26:51,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 23: [2022-11-26 06:26:51,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:26:51,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 06:26:51,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 10: [2022-11-26 06:26:51,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:26:51,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 06:26:51,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 30: [2022-11-26 06:26:51,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:26:51,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 06:26:51,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 21: [2022-11-26 06:26:51,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:26:51,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 06:26:51,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 12: [2022-11-26 06:26:51,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:26:51,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 06:26:51,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 8: [2022-11-26 06:26:51,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:26:51,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 06:26:51,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 5: [2022-11-26 06:26:51,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:26:51,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 16: [2022-11-26 06:26:51,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:26:51,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 16: [2022-11-26 06:26:51,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 06:26:51,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 9: [2022-11-26 06:26:51,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:26:51,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 06:26:51,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 11: [2022-11-26 06:26:51,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:26:51,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 06:26:51,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 13: [2022-11-26 06:26:51,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:26:51,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 06:26:51,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 18: [2022-11-26 06:26:51,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:26:51,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 06:26:51,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 19: [2022-11-26 06:26:51,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:26:51,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 06:26:51,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 22: [2022-11-26 06:26:51,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:26:51,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 06:26:51,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 3: [2022-11-26 06:26:51,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:26:51,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 06:26:51,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 14: [2022-11-26 06:26:51,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:26:51,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 06:26:51,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 28: [2022-11-26 06:26:51,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:26:51,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 06:26:51,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 26: [2022-11-26 06:26:51,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:26:51,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 06:26:51,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 29: [2022-11-26 06:26:51,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:26:51,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 06:26:51,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 24: [2022-11-26 06:26:51,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:26:51,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 06:26:51,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 15: [2022-11-26 06:26:51,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:26:51,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 06:26:51,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 7: [2022-11-26 06:26:51,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:26:51,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 06:26:51,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 1: [2022-11-26 06:26:51,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:26:51,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 06:26:51,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 6: [2022-11-26 06:26:51,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:26:51,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 06:26:51,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 4: [2022-11-26 06:26:51,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:26:51,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 06:26:51,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 27: [2022-11-26 06:26:51,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:26:51,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 06:26:51,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 17: [2022-11-26 06:26:51,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:26:51,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 06:26:51,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 25: [2022-11-26 06:26:51,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:26:51,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:26:51,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 06:26:51,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 2: [2022-11-26 06:26:51,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:26:51,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 2: [2022-11-26 06:26:51,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 06:26:51,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 20: [2022-11-26 06:26:51,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 0: [2022-11-26 06:26:51,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:26:51,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 06:26:51,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 10: [2022-11-26 06:26:51,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:26:51,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 31: [2022-11-26 06:26:51,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:26:51,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 31: [2022-11-26 06:26:51,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 06:26:51,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 23: [2022-11-26 06:26:51,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:26:51,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 06:26:51,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 21: [2022-11-26 06:26:51,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:26:51,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 9: [2022-11-26 06:26:51,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:26:51,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 9: [2022-11-26 06:26:51,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 06:26:51,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 30: [2022-11-26 06:26:51,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:26:51,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 06:26:51,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 8: [2022-11-26 06:26:51,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:26:51,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 06:26:51,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 3: [2022-11-26 06:26:51,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:26:51,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 06:26:51,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 11: [2022-11-26 06:26:51,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:26:51,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 06:26:51,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 14: [2022-11-26 06:26:51,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:26:51,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 06:26:51,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 16: [2022-11-26 06:26:51,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:26:51,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 06:26:51,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 12: [2022-11-26 06:26:51,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:26:51,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 06:26:51,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 5: [2022-11-26 06:26:51,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:26:51,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 06:26:51,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 24: [2022-11-26 06:26:51,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:26:51,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 06:26:51,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 29: [2022-11-26 06:26:51,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:26:51,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 06:26:51,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 13: [2022-11-26 06:26:51,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:26:51,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 06:26:51,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 18: [2022-11-26 06:26:51,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:26:51,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 06:26:51,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 22: [2022-11-26 06:26:51,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:26:51,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 06:26:51,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 26: [2022-11-26 06:26:51,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:26:51,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 06:26:51,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 1: [2022-11-26 06:26:51,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:26:51,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 06:26:51,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 6: [2022-11-26 06:26:51,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:26:51,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 06:26:51,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 19: [2022-11-26 06:26:51,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:26:51,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 06:26:51,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 0: [2022-11-26 06:26:51,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:26:51,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:26:51,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 06:26:51,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 23: [2022-11-26 06:26:51,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 06:26:51,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 20: [2022-11-26 06:26:51,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:26:51,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 06:26:51,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 10: [2022-11-26 06:26:51,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:26:51,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 06:26:51,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 4: [2022-11-26 06:26:51,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:26:51,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:26:51,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 06:26:51,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 2: [2022-11-26 06:26:51,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:26:51,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 06:26:51,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 25: [2022-11-26 06:26:51,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:26:51,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:26:51,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 25: [2022-11-26 06:26:51,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 24: [2022-11-26 06:26:51,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:26:51,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 30: [2022-11-26 06:26:51,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:26:51,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 24: [2022-11-26 06:26:51,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 06:26:51,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 30: [2022-11-26 06:26:51,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 26: [2022-11-26 06:26:51,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:26:51,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 26: [2022-11-26 06:26:51,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 06:26:51,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 28: [2022-11-26 06:26:51,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 06:26:51,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 31: [2022-11-26 06:26:51,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:26:51,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 06:26:51,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 27: [2022-11-26 06:26:51,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:26:51,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 06:26:51,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 9: [2022-11-26 06:26:51,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:26:51,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 06:26:51,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 8: [2022-11-26 06:26:51,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:26:51,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 21: [2022-11-26 06:26:51,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:26:51,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 21: [2022-11-26 06:26:51,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 06:26:51,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 22: [2022-11-26 06:26:51,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:26:51,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 06:26:51,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 17: [2022-11-26 06:26:51,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:26:51,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 06:26:51,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 14: [2022-11-26 06:26:51,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:26:51,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 06:26:51,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 16: [2022-11-26 06:26:51,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:26:51,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:26:51,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 06:26:51,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 12: [2022-11-26 06:26:51,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 11: [2022-11-26 06:26:51,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:26:51,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:26:51,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 7: [2022-11-26 06:26:51,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 11: [2022-11-26 06:26:51,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 29: [2022-11-26 06:26:51,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:26:51,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 29: [2022-11-26 06:26:51,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 11: [2022-11-26 06:26:51,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 29: [2022-11-26 06:26:51,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 13: [2022-11-26 06:26:51,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:26:51,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 06:26:51,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 27: [2022-11-26 06:26:51,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:26:51,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 06:26:51,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 18: [2022-11-26 06:26:51,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:26:51,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 28: [2022-11-26 06:26:51,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:26:51,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 5: [2022-11-26 06:26:51,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:26:51,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 06:26:51,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 7: [2022-11-26 06:26:51,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:26:51,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:26:51,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 15: [2022-11-26 06:26:51,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 7: [2022-11-26 06:26:51,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 15: [2022-11-26 06:26:51,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 28: [2022-11-26 06:26:51,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 06:26:51,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 19: [2022-11-26 06:26:51,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:26:51,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 06:26:51,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 29: [2022-11-26 06:26:51,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:26:51,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 06:26:51,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 19: [2022-11-26 06:26:51,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:26:51,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 06:26:51,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 7: [2022-11-26 06:26:51,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:26:51,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 06:26:51,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 15: [2022-11-26 06:26:51,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:26:51,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:26:51,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 06:26:51,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step55000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 06:26:51,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 15: [2022-11-26 06:26:51,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 0: successfully saved checkpoint at iteration 55000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2615.33 31: iteration 55010/ 173500 | consumed samples: 14082560 | consumed tokens: 28841082880 | elapsed time per iteration (s): 1.10 | learning rate: 1.605E-04 | global batch size: 256 | lm loss: 2.071952E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.104 | TFLOPs: 14.10 | 31: iteration 55020/ 173500 | consumed samples: 14085120 | consumed tokens: 28846325760 | elapsed time per iteration (s): 0.80 | learning rate: 1.605E-04 | global batch size: 256 | lm loss: 2.013334E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.772 | TFLOPs: 19.41 | 31: iteration 55030/ 173500 | consumed samples: 14087680 | consumed tokens: 28851568640 | elapsed time per iteration (s): 0.84 | learning rate: 1.605E-04 | global batch size: 256 | lm loss: 2.061890E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.183 | TFLOPs: 18.46 | 31: iteration 55040/ 173500 | consumed samples: 14090240 | consumed tokens: 28856811520 | elapsed time per iteration (s): 0.84 | learning rate: 1.605E-04 | global batch size: 256 | lm loss: 2.045343E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.926 | TFLOPs: 18.51 | 31: iteration 55050/ 173500 | consumed samples: 14092800 | consumed tokens: 28862054400 | elapsed time per iteration (s): 0.79 | learning rate: 1.605E-04 | global batch size: 256 | lm loss: 2.053022E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.021 | TFLOPs: 19.48 | 31: iteration 55060/ 173500 | consumed samples: 14095360 | consumed tokens: 28867297280 | elapsed time per iteration (s): 0.83 | learning rate: 1.605E-04 | global batch size: 256 | lm loss: 2.055172E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.137 | TFLOPs: 18.76 | 31: iteration 55070/ 173500 | consumed samples: 14097920 | consumed tokens: 28872540160 | elapsed time per iteration (s): 0.82 | learning rate: 1.605E-04 | global batch size: 256 | lm loss: 2.072174E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.792 | TFLOPs: 18.80 | 31: iteration 55080/ 173500 | consumed samples: 14100480 | consumed tokens: 28877783040 | elapsed time per iteration (s): 0.81 | learning rate: 1.605E-04 | global batch size: 256 | lm loss: 2.047440E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.415 | TFLOPs: 19.08 | 31: iteration 55090/ 173500 | consumed samples: 14103040 | consumed tokens: 28883025920 | elapsed time per iteration (s): 0.87 | learning rate: 1.604E-04 | global batch size: 256 | lm loss: 2.084346E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.537 | TFLOPs: 17.76 | 31: iteration 55100/ 173500 | consumed samples: 14105600 | consumed tokens: 28888268800 | elapsed time per iteration (s): 0.83 | learning rate: 1.604E-04 | global batch size: 256 | lm loss: 2.056113E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.934 | TFLOPs: 18.69 | 31: iteration 55110/ 173500 | consumed samples: 14108160 | consumed tokens: 28893511680 | elapsed time per iteration (s): 0.79 | learning rate: 1.604E-04 | global batch size: 256 | lm loss: 2.097911E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.117 | TFLOPs: 19.49 | 31: iteration 55120/ 173500 | consumed samples: 14110720 | consumed tokens: 28898754560 | elapsed time per iteration (s): 0.80 | learning rate: 1.604E-04 | global batch size: 256 | lm loss: 2.043576E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.504 | TFLOPs: 19.39 | 31: iteration 55130/ 173500 | consumed samples: 14113280 | consumed tokens: 28903997440 | elapsed time per iteration (s): 0.86 | learning rate: 1.604E-04 | global batch size: 256 | lm loss: 2.037351E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.272 | TFLOPs: 18.11 | 31: iteration 55140/ 173500 | consumed samples: 14115840 | consumed tokens: 28909240320 | elapsed time per iteration (s): 0.78 | learning rate: 1.604E-04 | global batch size: 256 | lm loss: 2.033823E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.584 | TFLOPs: 19.88 | 31: iteration 55150/ 173500 | consumed samples: 14118400 | consumed tokens: 28914483200 | elapsed time per iteration (s): 0.79 | learning rate: 1.604E-04 | global batch size: 256 | lm loss: 2.050751E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.675 | TFLOPs: 19.64 | 31: iteration 55160/ 173500 | consumed samples: 14120960 | consumed tokens: 28919726080 | elapsed time per iteration (s): 0.79 | learning rate: 1.603E-04 | global batch size: 256 | lm loss: 2.054204E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.311 | TFLOPs: 19.56 | 31: iteration 55170/ 173500 | consumed samples: 14123520 | consumed tokens: 28924968960 | elapsed time per iteration (s): 0.79 | learning rate: 1.603E-04 | global batch size: 256 | lm loss: 2.057962E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.227 | TFLOPs: 19.49 | 31: iteration 55180/ 173500 | consumed samples: 14126080 | consumed tokens: 28930211840 | elapsed time per iteration (s): 0.82 | learning rate: 1.603E-04 | global batch size: 256 | lm loss: 2.066994E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.412 | TFLOPs: 18.78 | 31: iteration 55190/ 173500 | consumed samples: 14128640 | consumed tokens: 28935454720 | elapsed time per iteration (s): 0.80 | learning rate: 1.603E-04 | global batch size: 256 | lm loss: 2.053759E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.647 | TFLOPs: 19.46 | 31: iteration 55200/ 173500 | consumed samples: 14131200 | consumed tokens: 28940697600 | elapsed time per iteration (s): 0.80 | learning rate: 1.603E-04 | global batch size: 256 | lm loss: 2.084795E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.754 | TFLOPs: 19.47 | 31: iteration 55210/ 173500 | consumed samples: 14133760 | consumed tokens: 28945940480 | elapsed time per iteration (s): 0.81 | learning rate: 1.603E-04 | global batch size: 256 | lm loss: 2.075344E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.111 | TFLOPs: 19.00 | 31: iteration 55220/ 173500 | consumed samples: 14136320 | consumed tokens: 28951183360 | elapsed time per iteration (s): 0.81 | learning rate: 1.603E-04 | global batch size: 256 | lm loss: 2.079972E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.757 | TFLOPs: 19.22 | 31: iteration 55230/ 173500 | consumed samples: 14138880 | consumed tokens: 28956426240 | elapsed time per iteration (s): 0.79 | learning rate: 1.602E-04 | global batch size: 256 | lm loss: 2.089357E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.728 | TFLOPs: 19.52 | 31: iteration 55240/ 173500 | consumed samples: 14141440 | consumed tokens: 28961669120 | elapsed time per iteration (s): 0.84 | learning rate: 1.602E-04 | global batch size: 256 | lm loss: 2.040311E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.911 | TFLOPs: 18.45 | 31: iteration 55250/ 173500 | consumed samples: 14144000 | consumed tokens: 28966912000 | elapsed time per iteration (s): 0.81 | learning rate: 1.602E-04 | global batch size: 256 | lm loss: 2.030182E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.447 | TFLOPs: 19.20 | 31: iteration 55260/ 173500 | consumed samples: 14146560 | consumed tokens: 28972154880 | elapsed time per iteration (s): 0.82 | learning rate: 1.602E-04 | global batch size: 256 | lm loss: 2.052802E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.339 | TFLOPs: 18.96 | 31: iteration 55270/ 173500 | consumed samples: 14149120 | consumed tokens: 28977397760 | elapsed time per iteration (s): 0.74 | learning rate: 1.602E-04 | global batch size: 256 | lm loss: 2.048067E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.877 | TFLOPs: 20.86 | 31: iteration 55280/ 173500 | consumed samples: 14151680 | consumed tokens: 28982640640 | elapsed time per iteration (s): 0.74 | learning rate: 1.602E-04 | global batch size: 256 | lm loss: 2.056580E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.459 | TFLOPs: 20.90 | 31: iteration 55290/ 173500 | consumed samples: 14154240 | consumed tokens: 28987883520 | elapsed time per iteration (s): 0.72 | learning rate: 1.602E-04 | global batch size: 256 | lm loss: 2.067457E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.279 | TFLOPs: 21.43 | 31: iteration 55300/ 173500 | consumed samples: 14156800 | consumed tokens: 28993126400 | elapsed time per iteration (s): 0.77 | learning rate: 1.602E-04 | global batch size: 256 | lm loss: 2.073872E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.524 | TFLOPs: 20.06 | 31: iteration 55310/ 173500 | consumed samples: 14159360 | consumed tokens: 28998369280 | elapsed time per iteration (s): 0.76 | learning rate: 1.601E-04 | global batch size: 256 | lm loss: 2.026907E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.761 | TFLOPs: 20.49 | 31: iteration 55320/ 173500 | consumed samples: 14161920 | consumed tokens: 29003612160 | elapsed time per iteration (s): 0.79 | learning rate: 1.601E-04 | global batch size: 256 | lm loss: 2.085516E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.501 | TFLOPs: 19.63 | 31: iteration 55330/ 173500 | consumed samples: 14164480 | consumed tokens: 29008855040 | elapsed time per iteration (s): 0.76 | learning rate: 1.601E-04 | global batch size: 256 | lm loss: 2.025931E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.531 | TFLOPs: 20.30 | 31: iteration 55340/ 173500 | consumed samples: 14167040 | consumed tokens: 29014097920 | elapsed time per iteration (s): 0.76 | learning rate: 1.601E-04 | global batch size: 256 | lm loss: 2.075472E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.393 | TFLOPs: 20.47 | 31: iteration 55350/ 173500 | consumed samples: 14169600 | consumed tokens: 29019340800 | elapsed time per iteration (s): 0.75 | learning rate: 1.601E-04 | global batch size: 256 | lm loss: 2.028783E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.882 | TFLOPs: 20.74 | 31: iteration 55360/ 173500 | consumed samples: 14172160 | consumed tokens: 29024583680 | elapsed time per iteration (s): 0.80 | learning rate: 1.601E-04 | global batch size: 256 | lm loss: 2.052256E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.306 | TFLOPs: 19.32 | 31: iteration 55370/ 173500 | consumed samples: 14174720 | consumed tokens: 29029826560 | elapsed time per iteration (s): 0.81 | learning rate: 1.601E-04 | global batch size: 256 | lm loss: 2.043532E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.466 | TFLOPs: 19.08 | 31: iteration 55380/ 173500 | consumed samples: 14177280 | consumed tokens: 29035069440 | elapsed time per iteration (s): 0.91 | learning rate: 1.600E-04 | global batch size: 256 | lm loss: 2.032793E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 281.284 | TFLOPs: 17.02 | 31: iteration 55390/ 173500 | consumed samples: 14179840 | consumed tokens: 29040312320 | elapsed time per iteration (s): 0.88 | learning rate: 1.600E-04 | global batch size: 256 | lm loss: 2.063862E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.193 | TFLOPs: 17.68 | 31: iteration 55400/ 173500 | consumed samples: 14182400 | consumed tokens: 29045555200 | elapsed time per iteration (s): 0.88 | learning rate: 1.600E-04 | global batch size: 256 | lm loss: 2.054115E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.332 | TFLOPs: 17.50 | 31: iteration 55410/ 173500 | consumed samples: 14184960 | consumed tokens: 29050798080 | elapsed time per iteration (s): 0.86 | learning rate: 1.600E-04 | global batch size: 256 | lm loss: 2.038982E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.134 | TFLOPs: 17.92 | 31: iteration 55420/ 173500 | consumed samples: 14187520 | consumed tokens: 29056040960 | elapsed time per iteration (s): 0.83 | learning rate: 1.600E-04 | global batch size: 256 | lm loss: 2.078547E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.214 | TFLOPs: 18.65 | 31: iteration 55430/ 173500 | consumed samples: 14190080 | consumed tokens: 29061283840 | elapsed time per iteration (s): 0.88 | learning rate: 1.600E-04 | global batch size: 256 | lm loss: 2.047077E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.998 | TFLOPs: 17.60 | 31: iteration 55440/ 173500 | consumed samples: 14192640 | consumed tokens: 29066526720 | elapsed time per iteration (s): 0.84 | learning rate: 1.600E-04 | global batch size: 256 | lm loss: 2.069894E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.501 | TFLOPs: 18.48 | 31: iteration 55450/ 173500 | consumed samples: 14195200 | consumed tokens: 29071769600 | elapsed time per iteration (s): 0.81 | learning rate: 1.599E-04 | global batch size: 256 | lm loss: 2.066916E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.650 | TFLOPs: 19.04 | 31: iteration 55460/ 173500 | consumed samples: 14197760 | consumed tokens: 29077012480 | elapsed time per iteration (s): 0.85 | learning rate: 1.599E-04 | global batch size: 256 | lm loss: 2.074408E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.890 | TFLOPs: 18.14 | 31: iteration 55470/ 173500 | consumed samples: 14200320 | consumed tokens: 29082255360 | elapsed time per iteration (s): 0.81 | learning rate: 1.599E-04 | global batch size: 256 | lm loss: 2.051999E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.819 | TFLOPs: 19.23 | 31: iteration 55480/ 173500 | consumed samples: 14202880 | consumed tokens: 29087498240 | elapsed time per iteration (s): 0.81 | learning rate: 1.599E-04 | global batch size: 256 | lm loss: 2.064963E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.689 | TFLOPs: 19.16 | 31: iteration 55490/ 173500 | consumed samples: 14205440 | consumed tokens: 29092741120 | elapsed time per iteration (s): 0.81 | learning rate: 1.599E-04 | global batch size: 256 | lm loss: 2.047936E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.461 | TFLOPs: 19.21 | 31: iteration 55500/ 173500 | consumed samples: 14208000 | consumed tokens: 29097984000 | elapsed time per iteration (s): 0.82 | learning rate: 1.599E-04 | global batch size: 256 | lm loss: 2.009980E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.228 | TFLOPs: 18.83 | 31: iteration 55510/ 173500 | consumed samples: 14210560 | consumed tokens: 29103226880 | elapsed time per iteration (s): 0.82 | learning rate: 1.599E-04 | global batch size: 256 | lm loss: 2.050685E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.517 | TFLOPs: 18.79 | 31: iteration 55520/ 173500 | consumed samples: 14213120 | consumed tokens: 29108469760 | elapsed time per iteration (s): 0.82 | learning rate: 1.599E-04 | global batch size: 256 | lm loss: 2.071477E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.317 | TFLOPs: 18.77 | 31: iteration 55530/ 173500 | consumed samples: 14215680 | consumed tokens: 29113712640 | elapsed time per iteration (s): 0.83 | learning rate: 1.598E-04 | global batch size: 256 | lm loss: 2.041798E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.074 | TFLOPs: 18.76 | 31: iteration 55540/ 173500 | consumed samples: 14218240 | consumed tokens: 29118955520 | elapsed time per iteration (s): 0.83 | learning rate: 1.598E-04 | global batch size: 256 | lm loss: 2.052929E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.295 | TFLOPs: 18.65 | 31: iteration 55550/ 173500 | consumed samples: 14220800 | consumed tokens: 29124198400 | elapsed time per iteration (s): 0.79 | learning rate: 1.598E-04 | global batch size: 256 | lm loss: 2.035550E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.496 | TFLOPs: 19.63 | 31: iteration 55560/ 173500 | consumed samples: 14223360 | consumed tokens: 29129441280 | elapsed time per iteration (s): 0.80 | learning rate: 1.598E-04 | global batch size: 256 | lm loss: 2.060829E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.155 | TFLOPs: 19.25 | 31: iteration 55570/ 173500 | consumed samples: 14225920 | consumed tokens: 29134684160 | elapsed time per iteration (s): 0.81 | learning rate: 1.598E-04 | global batch size: 256 | lm loss: 2.058718E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.382 | TFLOPs: 19.20 | 31: iteration 55580/ 173500 | consumed samples: 14228480 | consumed tokens: 29139927040 | elapsed time per iteration (s): 0.75 | learning rate: 1.598E-04 | global batch size: 256 | lm loss: 2.029378E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.100 | TFLOPs: 20.70 | 31: iteration 55590/ 173500 | consumed samples: 14231040 | consumed tokens: 29145169920 | elapsed time per iteration (s): 0.77 | learning rate: 1.598E-04 | global batch size: 256 | lm loss: 2.069427E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.230 | TFLOPs: 20.16 | 31: iteration 55600/ 173500 | consumed samples: 14233600 | consumed tokens: 29150412800 | elapsed time per iteration (s): 0.84 | learning rate: 1.597E-04 | global batch size: 256 | lm loss: 2.059267E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.418 | TFLOPs: 18.54 | 31: iteration 55610/ 173500 | consumed samples: 14236160 | consumed tokens: 29155655680 | elapsed time per iteration (s): 0.78 | learning rate: 1.597E-04 | global batch size: 256 | lm loss: 2.085813E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.236 | TFLOPs: 19.80 | 31: iteration 55620/ 173500 | consumed samples: 14238720 | consumed tokens: 29160898560 | elapsed time per iteration (s): 0.85 | learning rate: 1.597E-04 | global batch size: 256 | lm loss: 2.063763E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.230 | TFLOPs: 18.28 | 31: iteration 55630/ 173500 | consumed samples: 14241280 | consumed tokens: 29166141440 | elapsed time per iteration (s): 0.85 | learning rate: 1.597E-04 | global batch size: 256 | lm loss: 2.051795E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.236 | TFLOPs: 18.22 | 31: iteration 55640/ 173500 | consumed samples: 14243840 | consumed tokens: 29171384320 | elapsed time per iteration (s): 0.83 | learning rate: 1.597E-04 | global batch size: 256 | lm loss: 2.028835E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.225 | TFLOPs: 18.65 | 31: iteration 55650/ 173500 | consumed samples: 14246400 | consumed tokens: 29176627200 | elapsed time per iteration (s): 0.76 | learning rate: 1.597E-04 | global batch size: 256 | lm loss: 2.087315E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.841 | TFLOPs: 20.50 | 31: iteration 55660/ 173500 | consumed samples: 14248960 | consumed tokens: 29181870080 | elapsed time per iteration (s): 0.78 | learning rate: 1.597E-04 | global batch size: 256 | lm loss: 2.051954E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.089 | TFLOPs: 19.97 | 31: iteration 55670/ 173500 | consumed samples: 14251520 | consumed tokens: 29187112960 | elapsed time per iteration (s): 0.78 | learning rate: 1.596E-04 | global batch size: 256 | lm loss: 2.057539E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.408 | TFLOPs: 19.93 | 31: iteration 55680/ 173500 | consumed samples: 14254080 | consumed tokens: 29192355840 | elapsed time per iteration (s): 0.79 | learning rate: 1.596E-04 | global batch size: 256 | lm loss: 2.042750E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.904 | TFLOPs: 19.66 | 31: iteration 55690/ 173500 | consumed samples: 14256640 | consumed tokens: 29197598720 | elapsed time per iteration (s): 0.77 | learning rate: 1.596E-04 | global batch size: 256 | lm loss: 2.046619E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.219 | TFLOPs: 20.16 | 31: iteration 55700/ 173500 | consumed samples: 14259200 | consumed tokens: 29202841600 | elapsed time per iteration (s): 0.75 | learning rate: 1.596E-04 | global batch size: 256 | lm loss: 2.046064E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.205 | TFLOPs: 20.70 | 31: iteration 55710/ 173500 | consumed samples: 14261760 | consumed tokens: 29208084480 | elapsed time per iteration (s): 0.78 | learning rate: 1.596E-04 | global batch size: 256 | lm loss: 2.034002E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.331 | TFLOPs: 19.74 | 31: iteration 55720/ 173500 | consumed samples: 14264320 | consumed tokens: 29213327360 | elapsed time per iteration (s): 0.71 | learning rate: 1.596E-04 | global batch size: 256 | lm loss: 2.038167E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 362.287 | TFLOPs: 21.92 | 31: iteration 55730/ 173500 | consumed samples: 14266880 | consumed tokens: 29218570240 | elapsed time per iteration (s): 0.73 | learning rate: 1.596E-04 | global batch size: 256 | lm loss: 2.024686E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.772 | TFLOPs: 21.16 | 31: iteration 55740/ 173500 | consumed samples: 14269440 | consumed tokens: 29223813120 | elapsed time per iteration (s): 0.77 | learning rate: 1.596E-04 | global batch size: 256 | lm loss: 2.071882E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.795 | TFLOPs: 20.07 | 31: iteration 55750/ 173500 | consumed samples: 14272000 | consumed tokens: 29229056000 | elapsed time per iteration (s): 0.78 | learning rate: 1.595E-04 | global batch size: 256 | lm loss: 2.082718E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.040 | TFLOPs: 19.79 | 31: iteration 55760/ 173500 | consumed samples: 14274560 | consumed tokens: 29234298880 | elapsed time per iteration (s): 0.76 | learning rate: 1.595E-04 | global batch size: 256 | lm loss: 2.042887E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.887 | TFLOPs: 20.50 | 31: iteration 55770/ 173500 | consumed samples: 14277120 | consumed tokens: 29239541760 | elapsed time per iteration (s): 0.74 | learning rate: 1.595E-04 | global batch size: 256 | lm loss: 2.043286E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.160 | TFLOPs: 21.00 | 31: iteration 55780/ 173500 | consumed samples: 14279680 | consumed tokens: 29244784640 | elapsed time per iteration (s): 0.79 | learning rate: 1.595E-04 | global batch size: 256 | lm loss: 2.087946E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.611 | TFLOPs: 19.58 | 31: iteration 55790/ 173500 | consumed samples: 14282240 | consumed tokens: 29250027520 | elapsed time per iteration (s): 0.74 | learning rate: 1.595E-04 | global batch size: 256 | lm loss: 2.060731E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.011 | TFLOPs: 21.05 | 31: iteration 55800/ 173500 | consumed samples: 14284800 | consumed tokens: 29255270400 | elapsed time per iteration (s): 0.78 | learning rate: 1.595E-04 | global batch size: 256 | lm loss: 2.065095E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.734 | TFLOPs: 19.77 | 31: iteration 55810/ 173500 | consumed samples: 14287360 | consumed tokens: 29260513280 | elapsed time per iteration (s): 0.78 | learning rate: 1.595E-04 | global batch size: 256 | lm loss: 2.054079E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.212 | TFLOPs: 19.92 | 31: iteration 55820/ 173500 | consumed samples: 14289920 | consumed tokens: 29265756160 | elapsed time per iteration (s): 0.79 | learning rate: 1.594E-04 | global batch size: 256 | lm loss: 2.070030E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.584 | TFLOPs: 19.64 | 31: iteration 55830/ 173500 | consumed samples: 14292480 | consumed tokens: 29270999040 | elapsed time per iteration (s): 0.79 | learning rate: 1.594E-04 | global batch size: 256 | lm loss: 2.066480E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.049 | TFLOPs: 19.54 | 31: iteration 55840/ 173500 | consumed samples: 14295040 | consumed tokens: 29276241920 | elapsed time per iteration (s): 0.77 | learning rate: 1.594E-04 | global batch size: 256 | lm loss: 2.055642E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.282 | TFLOPs: 20.04 | 31: iteration 55850/ 173500 | consumed samples: 14297600 | consumed tokens: 29281484800 | elapsed time per iteration (s): 0.74 | learning rate: 1.594E-04 | global batch size: 256 | lm loss: 2.079272E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.068 | TFLOPs: 20.88 | 31: iteration 55860/ 173500 | consumed samples: 14300160 | consumed tokens: 29286727680 | elapsed time per iteration (s): 0.79 | learning rate: 1.594E-04 | global batch size: 256 | lm loss: 2.042480E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.850 | TFLOPs: 19.53 | 31: iteration 55870/ 173500 | consumed samples: 14302720 | consumed tokens: 29291970560 | elapsed time per iteration (s): 0.80 | learning rate: 1.594E-04 | global batch size: 256 | lm loss: 2.041103E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.471 | TFLOPs: 19.45 | 31: iteration 55880/ 173500 | consumed samples: 14305280 | consumed tokens: 29297213440 | elapsed time per iteration (s): 0.77 | learning rate: 1.594E-04 | global batch size: 256 | lm loss: 2.057867E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.236 | TFLOPs: 20.04 | 31: iteration 55890/ 173500 | consumed samples: 14307840 | consumed tokens: 29302456320 | elapsed time per iteration (s): 0.85 | learning rate: 1.593E-04 | global batch size: 256 | lm loss: 2.055768E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.368 | TFLOPs: 18.23 | 31: iteration 55900/ 173500 | consumed samples: 14310400 | consumed tokens: 29307699200 | elapsed time per iteration (s): 0.72 | learning rate: 1.593E-04 | global batch size: 256 | lm loss: 2.086540E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.496 | TFLOPs: 21.45 | 31: iteration 55910/ 173500 | consumed samples: 14312960 | consumed tokens: 29312942080 | elapsed time per iteration (s): 0.75 | learning rate: 1.593E-04 | global batch size: 256 | lm loss: 2.027021E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.613 | TFLOPs: 20.73 | 31: iteration 55920/ 173500 | consumed samples: 14315520 | consumed tokens: 29318184960 | elapsed time per iteration (s): 0.80 | learning rate: 1.593E-04 | global batch size: 256 | lm loss: 2.035123E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.164 | TFLOPs: 19.31 | 31: iteration 55930/ 173500 | consumed samples: 14318080 | consumed tokens: 29323427840 | elapsed time per iteration (s): 0.73 | learning rate: 1.593E-04 | global batch size: 256 | lm loss: 2.057250E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.750 | TFLOPs: 21.10 | 31: iteration 55940/ 173500 | consumed samples: 14320640 | consumed tokens: 29328670720 | elapsed time per iteration (s): 0.78 | learning rate: 1.593E-04 | global batch size: 256 | lm loss: 2.035525E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.321 | TFLOPs: 19.98 | 31: iteration 55950/ 173500 | consumed samples: 14323200 | consumed tokens: 29333913600 | elapsed time per iteration (s): 0.77 | learning rate: 1.593E-04 | global batch size: 256 | lm loss: 2.029816E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.560 | TFLOPs: 20.06 | 31: iteration 55960/ 173500 | consumed samples: 14325760 | consumed tokens: 29339156480 | elapsed time per iteration (s): 0.78 | learning rate: 1.592E-04 | global batch size: 256 | lm loss: 2.057237E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.995 | TFLOPs: 19.84 | 31: iteration 55970/ 173500 | consumed samples: 14328320 | consumed tokens: 29344399360 | elapsed time per iteration (s): 0.75 | learning rate: 1.592E-04 | global batch size: 256 | lm loss: 2.056271E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.487 | TFLOPs: 20.66 | 31: iteration 55980/ 173500 | consumed samples: 14330880 | consumed tokens: 29349642240 | elapsed time per iteration (s): 0.73 | learning rate: 1.592E-04 | global batch size: 256 | lm loss: 2.073986E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.619 | TFLOPs: 21.21 | 31: iteration 55990/ 173500 | consumed samples: 14333440 | consumed tokens: 29354885120 | elapsed time per iteration (s): 0.77 | learning rate: 1.592E-04 | global batch size: 256 | lm loss: 2.049120E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.134 | TFLOPs: 20.09 | 0: [2022-11-26 06:40:06,608] [INFO] [logging.py:68:log_dist] [Rank 0] step=56000, skipped=0, lr=[0.0001591933009380588, 0.0001591933009380588, 0.0001591933009380588], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 56000/ 173500 | consumed samples: 14336000 | consumed tokens: 29360128000 | elapsed time per iteration (s): 0.75 | learning rate: 1.592E-04 | global batch size: 256 | lm loss: 2.035583E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.125 | TFLOPs: 20.64 | 0: steps: 56000 loss: 2.0126 iter time (s): 0.793 samples/sec: 322.919 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 56000 | lm loss value: 1.996800E+00 | lm loss PPL: 7.365446E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 56000 to checkpoints_1b1long 0: [2022-11-26 06:40:06,955] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step56000 is begin to save! 0: [2022-11-26 06:40:06,967] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_01-model_00-model_states.pt... 0: [2022-11-26 06:40:07,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_01-model_00-model_states.pt. 0: [2022-11-26 06:40:07,198] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_03-model_00-model_states.pt... 0: [2022-11-26 06:40:07,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_03-model_00-model_states.pt. 0: [2022-11-26 06:40:07,280] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_04-model_00-model_states.pt... 0: [2022-11-26 06:40:07,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_04-model_00-model_states.pt. 0: [2022-11-26 06:40:07,357] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_05-model_00-model_states.pt... 0: [2022-11-26 06:40:07,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_05-model_00-model_states.pt. 0: [2022-11-26 06:40:07,437] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_06-model_00-model_states.pt... 0: [2022-11-26 06:40:07,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_06-model_00-model_states.pt. 0: [2022-11-26 06:40:07,517] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_07-model_00-model_states.pt... 0: [2022-11-26 06:40:07,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_07-model_00-model_states.pt. 0: [2022-11-26 06:40:07,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_08-model_00-model_states.pt... 0: [2022-11-26 06:40:07,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_08-model_00-model_states.pt. 0: [2022-11-26 06:40:07,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_09-model_00-model_states.pt... 0: [2022-11-26 06:40:07,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_09-model_00-model_states.pt. 0: [2022-11-26 06:40:07,746] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_10-model_00-model_states.pt... 0: [2022-11-26 06:40:07,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_10-model_00-model_states.pt. 0: [2022-11-26 06:40:07,822] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_11-model_00-model_states.pt... 0: [2022-11-26 06:40:07,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_11-model_00-model_states.pt. 0: [2022-11-26 06:40:07,898] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_12-model_00-model_states.pt... 0: [2022-11-26 06:40:07,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_12-model_00-model_states.pt. 0: [2022-11-26 06:40:07,974] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_13-model_00-model_states.pt... 0: [2022-11-26 06:40:08,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_13-model_00-model_states.pt. 0: [2022-11-26 06:40:08,049] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_14-model_00-model_states.pt... 0: [2022-11-26 06:40:08,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_14-model_00-model_states.pt. 0: [2022-11-26 06:40:08,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_15-model_00-model_states.pt... 0: [2022-11-26 06:40:08,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_15-model_00-model_states.pt. 0: [2022-11-26 06:40:08,199] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_16-model_00-model_states.pt... 0: [2022-11-26 06:40:08,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_16-model_00-model_states.pt. 0: [2022-11-26 06:40:08,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_17-model_00-model_states.pt... 0: [2022-11-26 06:40:08,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_17-model_00-model_states.pt. 0: [2022-11-26 06:40:08,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_18-model_00-model_states.pt... 0: [2022-11-26 06:40:08,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_18-model_00-model_states.pt. 0: [2022-11-26 06:40:08,437] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_19-model_00-model_states.pt... 0: [2022-11-26 06:40:08,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_19-model_00-model_states.pt. 0: [2022-11-26 06:40:08,513] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_20-model_00-model_states.pt... 0: [2022-11-26 06:40:08,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_20-model_00-model_states.pt. 0: [2022-11-26 06:40:08,587] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_21-model_00-model_states.pt... 0: [2022-11-26 06:40:08,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_21-model_00-model_states.pt. 0: [2022-11-26 06:40:08,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_22-model_00-model_states.pt... 0: [2022-11-26 06:40:08,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_22-model_00-model_states.pt. 0: [2022-11-26 06:40:08,736] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_23-model_00-model_states.pt... 0: [2022-11-26 06:40:08,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_23-model_00-model_states.pt. 0: [2022-11-26 06:40:08,813] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_24-model_00-model_states.pt... 0: [2022-11-26 06:40:08,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_24-model_00-model_states.pt. 0: [2022-11-26 06:40:08,888] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_25-model_00-model_states.pt... 0: [2022-11-26 06:40:08,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_25-model_00-model_states.pt. 0: [2022-11-26 06:40:08,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_26-model_00-model_states.pt... 0: [2022-11-26 06:40:09,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_26-model_00-model_states.pt. 0: [2022-11-26 06:40:09,036] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_27-model_00-model_states.pt... 0: [2022-11-26 06:40:09,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_27-model_00-model_states.pt. 0: [2022-11-26 06:40:09,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_28-model_00-model_states.pt... 0: [2022-11-26 06:40:09,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_28-model_00-model_states.pt. 0: [2022-11-26 06:40:09,184] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/layer_30-model_00-model_states.pt... 0: [2022-11-26 06:40:09,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/layer_30-model_00-model_states.pt. 0: [2022-11-26 06:40:09,187] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step56000/mp_rank_00_model_states.pt 0: [2022-11-26 06:40:09,187] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/mp_rank_00_model_states.pt... 0: [2022-11-26 06:40:09,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/mp_rank_00_model_states.pt. 0: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:40:09,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:40:09,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:40:09,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 06:40:09,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 17: [2022-11-26 06:40:09,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:40:09,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 06:40:09,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 0: [2022-11-26 06:40:09,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:40:09,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:40:09,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 3: [2022-11-26 06:40:09,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 0: [2022-11-26 06:40:09,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 3: [2022-11-26 06:40:09,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 4: [2022-11-26 06:40:09,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:40:09,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 06:40:09,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 24: [2022-11-26 06:40:09,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:40:09,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 06:40:09,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 1: [2022-11-26 06:40:09,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:40:09,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 06:40:09,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 20: [2022-11-26 06:40:09,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:40:09,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 06:40:09,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 25: [2022-11-26 06:40:09,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:40:09,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 06:40:09,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 9: [2022-11-26 06:40:09,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:40:09,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 06:40:09,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 22: [2022-11-26 06:40:09,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:40:09,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 23: [2022-11-26 06:40:09,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:40:09,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:40:09,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 23: [2022-11-26 06:40:09,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 06:40:09,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 06:40:09,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 23: [2022-11-26 06:40:09,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 2: [2022-11-26 06:40:09,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:40:09,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 06:40:09,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 7: [2022-11-26 06:40:09,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:40:09,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:40:09,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 06:40:09,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 06:40:09,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 7: [2022-11-26 06:40:09,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 27: [2022-11-26 06:40:09,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:40:09,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 06:40:09,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 4: [2022-11-26 06:40:09,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:40:09,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 06:40:09,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 20: [2022-11-26 06:40:09,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:40:09,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 3: [2022-11-26 06:40:09,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:40:09,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:40:09,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 3: [2022-11-26 06:40:09,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 12: [2022-11-26 06:40:09,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 3: [2022-11-26 06:40:09,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 12: [2022-11-26 06:40:09,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 15: [2022-11-26 06:40:09,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:40:09,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 06:40:09,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 15: [2022-11-26 06:40:09,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:40:09,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 06:40:09,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 8: [2022-11-26 06:40:09,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:40:09,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 06:40:09,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 29: [2022-11-26 06:40:09,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:40:09,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 06:40:09,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 6: [2022-11-26 06:40:09,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:40:09,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 06:40:09,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 14: [2022-11-26 06:40:09,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:40:09,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:40:09,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 06:40:09,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 06:40:09,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 14: [2022-11-26 06:40:09,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 12: [2022-11-26 06:40:09,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:40:09,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 06:40:09,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 24: [2022-11-26 06:40:09,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:40:09,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 16: [2022-11-26 06:40:09,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:40:09,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 16: [2022-11-26 06:40:09,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 06:40:09,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 10: [2022-11-26 06:40:09,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:40:09,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 06:40:09,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 17: [2022-11-26 06:40:09,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:40:09,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:40:09,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 17: [2022-11-26 06:40:09,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 19: [2022-11-26 06:40:09,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 17: [2022-11-26 06:40:09,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 17: [2022-11-26 06:40:09,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:40:09,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 06:40:09,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 22: [2022-11-26 06:40:09,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:40:09,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 06:40:09,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 10: [2022-11-26 06:40:09,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:40:09,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:40:09,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 16: [2022-11-26 06:40:09,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 10: [2022-11-26 06:40:09,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 16: [2022-11-26 06:40:09,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 9: [2022-11-26 06:40:09,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:40:09,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 06:40:09,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 27: [2022-11-26 06:40:09,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:40:09,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:40:09,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 27: [2022-11-26 06:40:09,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 4: [2022-11-26 06:40:09,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 27: [2022-11-26 06:40:09,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 1: [2022-11-26 06:40:09,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:40:09,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:40:09,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 06:40:09,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 06:40:09,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 1: [2022-11-26 06:40:09,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 23: [2022-11-26 06:40:09,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:40:09,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 20: [2022-11-26 06:40:09,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:40:09,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 20: [2022-11-26 06:40:09,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 06:40:09,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 25: [2022-11-26 06:40:09,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:40:09,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:40:09,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 22: [2022-11-26 06:40:09,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:40:09,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 22: [2022-11-26 06:40:09,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 25: [2022-11-26 06:40:09,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 25: [2022-11-26 06:40:09,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 22: [2022-11-26 06:40:09,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 23: [2022-11-26 06:40:09,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:40:09,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 06:40:09,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 8: [2022-11-26 06:40:09,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:40:09,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 06:40:09,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 10: [2022-11-26 06:40:09,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:40:09,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 06:40:09,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 6: [2022-11-26 06:40:09,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:40:09,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 06:40:09,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 2: [2022-11-26 06:40:09,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:40:09,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 06:40:09,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 4: [2022-11-26 06:40:09,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:40:09,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 06:40:09,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 7: [2022-11-26 06:40:09,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:40:09,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 06:40:09,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 12: [2022-11-26 06:40:09,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:40:09,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 06:40:09,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 0: [2022-11-26 06:40:09,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:40:09,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:40:09,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 7: [2022-11-26 06:40:09,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 06:40:09,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 0: [2022-11-26 06:40:09,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 8: [2022-11-26 06:40:09,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:40:09,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 06:40:09,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 2: [2022-11-26 06:40:09,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:40:09,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 06:40:09,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 3: [2022-11-26 06:40:09,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:40:09,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 06:40:09,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:40:09,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 3: [2022-11-26 06:40:09,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 06:40:09,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 2: [2022-11-26 06:40:09,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:40:09,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 06:40:09,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 12: [2022-11-26 06:40:09,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:40:09,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 06:40:09,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 15: [2022-11-26 06:40:09,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:40:09,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 06:40:09,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 25: [2022-11-26 06:40:09,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:40:09,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:40:09,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 06:40:09,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 25: [2022-11-26 06:40:09,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 1: [2022-11-26 06:40:09,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:40:09,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 1: [2022-11-26 06:40:09,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 06:40:09,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 17: [2022-11-26 06:40:09,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:40:09,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 06:40:09,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 11: [2022-11-26 06:40:09,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:40:09,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:40:09,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 15: [2022-11-26 06:40:09,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 11: [2022-11-26 06:40:09,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 19: [2022-11-26 06:40:09,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:40:09,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:40:09,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 19: [2022-11-26 06:40:09,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:40:09,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 06:40:09,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 06:40:09,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 06:40:09,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 20: [2022-11-26 06:40:09,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:40:09,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 19: [2022-11-26 06:40:09,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 20: [2022-11-26 06:40:09,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 06:40:09,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 24: [2022-11-26 06:40:09,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:40:09,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:40:09,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 06:40:09,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 13: [2022-11-26 06:40:09,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:40:09,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:40:09,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:40:09,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 13: [2022-11-26 06:40:09,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 06:40:09,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 24: [2022-11-26 06:40:09,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 13: [2022-11-26 06:40:09,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 24: [2022-11-26 06:40:09,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 13: [2022-11-26 06:40:09,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 24: [2022-11-26 06:40:09,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 28: [2022-11-26 06:40:09,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:40:09,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:40:09,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:40:09,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 06:40:09,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 06:40:09,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 06:40:09,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 28: [2022-11-26 06:40:09,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 28: [2022-11-26 06:40:09,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 11: [2022-11-26 06:40:09,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:40:09,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 06:40:09,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 27: [2022-11-26 06:40:09,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:40:09,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 06:40:09,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 9: [2022-11-26 06:40:09,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:40:09,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:40:09,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 6: [2022-11-26 06:40:09,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 06:40:09,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 6: [2022-11-26 06:40:09,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:40:09,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 9: [2022-11-26 06:40:09,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 6: [2022-11-26 06:40:09,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 0: [2022-11-26 06:40:09,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:40:09,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:40:09,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 14: [2022-11-26 06:40:09,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 0: [2022-11-26 06:40:09,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 14: [2022-11-26 06:40:09,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 9: [2022-11-26 06:40:09,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:40:09,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:40:09,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 14: [2022-11-26 06:40:09,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 9: [2022-11-26 06:40:09,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 14: [2022-11-26 06:40:09,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 18: [2022-11-26 06:40:09,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:40:09,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 06:40:09,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 18: [2022-11-26 06:40:09,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:40:09,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 06:40:09,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 27: [2022-11-26 06:40:09,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:40:09,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 06:40:09,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 28: [2022-11-26 06:40:09,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:40:09,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 06:40:09,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 11: [2022-11-26 06:40:09,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:40:09,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 06:40:09,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 16: [2022-11-26 06:40:09,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:40:09,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:40:09,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 06:40:09,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 06:40:09,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 16: [2022-11-26 06:40:09,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 18: [2022-11-26 06:40:09,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:40:09,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 06:40:09,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 18: [2022-11-26 06:40:09,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:40:09,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 06:40:09,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 29: [2022-11-26 06:40:09,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:40:09,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 06:40:09,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 29: [2022-11-26 06:40:09,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:40:09,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:40:09,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 06:40:09,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 29: [2022-11-26 06:40:09,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 06:40:09,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 21: [2022-11-26 06:40:09,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:40:09,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:40:09,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:40:09,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 06:40:09,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 06:40:09,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 06:40:09,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 21: [2022-11-26 06:40:09,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 21: [2022-11-26 06:40:09,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 8: [2022-11-26 06:40:09,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:40:09,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 06:40:09,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 5: [2022-11-26 06:40:09,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:40:09,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 06:40:09,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 31: [2022-11-26 06:40:09,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:40:09,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:40:09,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:40:09,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:40:09,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 06:40:09,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 06:40:09,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 06:40:09,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 06:40:09,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 31: [2022-11-26 06:40:09,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 31: [2022-11-26 06:40:09,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 31: [2022-11-26 06:40:09,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 5: [2022-11-26 06:40:09,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:40:09,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:40:09,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 06:40:09,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 06:40:09,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 5: [2022-11-26 06:40:09,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 0: [2022-11-26 06:40:09,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:40:09,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 06:40:09,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 13: [2022-11-26 06:40:09,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:40:09,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 06:40:09,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 30: [2022-11-26 06:40:09,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:40:09,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:40:09,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:40:09,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:40:09,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 06:40:09,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 06:40:09,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 06:40:09,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 06:40:09,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 30: [2022-11-26 06:40:09,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 30: [2022-11-26 06:40:09,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 30: [2022-11-26 06:40:09,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 6: [2022-11-26 06:40:09,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:40:09,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 06:40:09,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 25: [2022-11-26 06:40:09,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:40:09,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 06:40:09,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 26: [2022-11-26 06:40:09,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:40:09,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:40:09,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:40:09,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 06:40:09,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:40:09,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:40:09,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 06:40:09,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 26: [2022-11-26 06:40:09,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 06:40:09,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 06:40:09,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 06:40:09,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 26: [2022-11-26 06:40:09,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 26: [2022-11-26 06:40:09,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 26: [2022-11-26 06:40:09,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 17: [2022-11-26 06:40:09,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:40:09,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 06:40:09,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 19: [2022-11-26 06:40:09,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:40:09,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 06:40:09,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 16: [2022-11-26 06:40:09,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:40:09,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 06:40:09,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 3: [2022-11-26 06:40:09,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:40:09,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 06:40:09,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 22: [2022-11-26 06:40:09,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:40:09,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 06:40:09,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 4: [2022-11-26 06:40:09,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:40:09,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 06:40:09,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 9: [2022-11-26 06:40:09,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:40:09,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 06:40:09,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 7: [2022-11-26 06:40:09,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:40:09,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 06:40:09,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 23: [2022-11-26 06:40:09,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:40:09,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 06:40:09,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 2: [2022-11-26 06:40:09,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:40:09,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 06:40:09,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 15: [2022-11-26 06:40:09,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:40:09,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 06:40:09,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 1: [2022-11-26 06:40:09,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:40:09,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 06:40:09,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 18: [2022-11-26 06:40:09,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:40:09,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 06:40:09,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 30: [2022-11-26 06:40:09,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:40:09,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 06:40:09,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 8: [2022-11-26 06:40:09,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:40:09,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 06:40:09,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 0: [2022-11-26 06:40:09,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:40:09,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 06:40:09,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 14: [2022-11-26 06:40:09,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:40:09,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 31: [2022-11-26 06:40:09,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:40:09,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 31: [2022-11-26 06:40:09,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 06:40:09,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 29: [2022-11-26 06:40:09,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:40:09,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 06:40:09,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 12: [2022-11-26 06:40:09,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:40:09,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 06:40:09,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 13: [2022-11-26 06:40:09,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:40:09,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 06:40:09,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 20: [2022-11-26 06:40:09,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:40:09,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 06:40:09,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 24: [2022-11-26 06:40:09,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:40:09,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 06:40:09,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 10: [2022-11-26 06:40:09,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:40:09,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 11: [2022-11-26 06:40:09,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:40:09,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 11: [2022-11-26 06:40:09,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 06:40:09,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 5: [2022-11-26 06:40:09,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:40:09,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 06:40:09,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 21: [2022-11-26 06:40:09,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:40:09,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 06:40:09,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 6: [2022-11-26 06:40:09,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:40:09,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 06:40:09,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 27: [2022-11-26 06:40:09,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:40:09,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 06:40:09,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 25: [2022-11-26 06:40:09,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:40:09,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 06:40:09,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 17: [2022-11-26 06:40:09,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:40:09,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 06:40:09,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 16: [2022-11-26 06:40:09,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:40:09,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 06:40:09,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 19: [2022-11-26 06:40:09,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:40:09,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 06:40:09,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 23: [2022-11-26 06:40:09,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:40:09,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 06:40:09,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 4: [2022-11-26 06:40:09,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:40:09,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 06:40:09,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 3: [2022-11-26 06:40:09,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:40:09,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 22: [2022-11-26 06:40:09,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:40:09,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 22: [2022-11-26 06:40:09,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 06:40:09,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 26: [2022-11-26 06:40:09,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:40:09,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 06:40:09,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 15: [2022-11-26 06:40:09,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:40:09,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 06:40:09,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 7: [2022-11-26 06:40:09,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:40:09,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 06:40:09,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 20: [2022-11-26 06:40:09,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:40:09,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 06:40:09,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 2: [2022-11-26 06:40:09,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:40:09,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 06:40:09,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 18: [2022-11-26 06:40:09,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:40:09,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 06:40:09,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 0: [2022-11-26 06:40:09,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:40:09,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:40:09,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 06:40:09,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 28: [2022-11-26 06:40:09,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:40:09,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 30: [2022-11-26 06:40:09,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:40:09,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 30: [2022-11-26 06:40:09,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 06:40:09,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 28: [2022-11-26 06:40:09,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:40:09,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 06:40:09,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 29: [2022-11-26 06:40:09,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:40:09,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 06:40:09,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 14: [2022-11-26 06:40:09,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:40:09,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 06:40:09,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 31: [2022-11-26 06:40:09,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:40:09,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 06:40:09,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 21: [2022-11-26 06:40:09,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:40:09,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 06:40:09,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 10: [2022-11-26 06:40:09,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:40:09,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 06:40:09,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 8: [2022-11-26 06:40:09,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:40:09,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 06:40:09,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 11: [2022-11-26 06:40:09,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:40:09,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:40:09,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 24: [2022-11-26 06:40:09,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:40:09,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 11: [2022-11-26 06:40:09,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 5: [2022-11-26 06:40:09,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 12: [2022-11-26 06:40:09,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:40:09,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 06:40:09,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 24: [2022-11-26 06:40:09,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 06:40:09,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 13: [2022-11-26 06:40:09,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:40:09,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 06:40:09,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 26: [2022-11-26 06:40:09,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:40:09,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 06:40:09,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 6: [2022-11-26 06:40:09,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:40:09,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 06:40:09,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 27: [2022-11-26 06:40:09,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:40:09,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 06:40:09,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 25: [2022-11-26 06:40:09,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:40:09,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 06:40:09,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 17: [2022-11-26 06:40:09,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:40:09,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 06:40:09,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 23: [2022-11-26 06:40:09,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:40:09,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 06:40:09,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 0: [2022-11-26 06:40:09,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 06:40:09,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 19: [2022-11-26 06:40:09,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:40:09,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 06:40:09,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 15: [2022-11-26 06:40:09,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:40:09,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 06:40:09,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 9: [2022-11-26 06:40:09,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:40:09,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 06:40:09,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 22: [2022-11-26 06:40:09,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:40:09,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 06:40:09,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 4: [2022-11-26 06:40:09,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:40:09,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 06:40:09,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 3: [2022-11-26 06:40:09,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:40:09,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 06:40:09,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 16: [2022-11-26 06:40:09,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:40:09,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 06:40:09,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 7: [2022-11-26 06:40:09,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:40:09,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 06:40:09,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 30: [2022-11-26 06:40:09,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:40:09,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 06:40:09,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 2: [2022-11-26 06:40:09,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:40:09,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 06:40:09,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 20: [2022-11-26 06:40:09,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:40:09,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 06:40:09,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 18: [2022-11-26 06:40:09,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:40:09,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 06:40:09,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 1: [2022-11-26 06:40:09,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:40:09,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 06:40:09,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 31: [2022-11-26 06:40:09,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:40:09,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 06:40:09,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 29: [2022-11-26 06:40:09,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:40:09,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 06:40:09,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 14: [2022-11-26 06:40:09,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:40:09,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 06:40:09,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 28: [2022-11-26 06:40:09,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:40:09,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 06:40:09,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 12: [2022-11-26 06:40:09,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:40:09,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 06:40:09,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 0: [2022-11-26 06:40:09,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:40:09,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 06:40:09,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 24: [2022-11-26 06:40:09,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:40:09,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 06:40:09,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 10: [2022-11-26 06:40:09,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:40:09,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:40:09,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 06:40:09,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 06:40:09,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 10: [2022-11-26 06:40:09,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 6: [2022-11-26 06:40:09,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:40:09,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:40:09,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 11: [2022-11-26 06:40:09,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 6: [2022-11-26 06:40:09,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 11: [2022-11-26 06:40:09,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 19: [2022-11-26 06:40:09,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:40:09,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 06:40:09,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 25: [2022-11-26 06:40:09,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:40:09,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 26: [2022-11-26 06:40:09,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:40:09,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 26: [2022-11-26 06:40:09,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 06:40:09,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 14: [2022-11-26 06:40:09,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:40:09,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:40:09,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 18: [2022-11-26 06:40:09,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 14: [2022-11-26 06:40:09,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 18: [2022-11-26 06:40:09,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 9: [2022-11-26 06:40:09,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:40:09,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 06:40:09,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 3: [2022-11-26 06:40:09,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:40:09,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 06:40:09,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 12: [2022-11-26 06:40:09,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:40:09,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 06:40:09,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 7: [2022-11-26 06:40:09,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:40:09,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 23: [2022-11-26 06:40:09,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:40:09,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 20: [2022-11-26 06:40:09,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:40:09,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 20: [2022-11-26 06:40:09,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 23: [2022-11-26 06:40:09,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 20: [2022-11-26 06:40:09,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 16: [2022-11-26 06:40:09,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:40:09,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 06:40:09,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 15: [2022-11-26 06:40:09,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:40:09,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:40:09,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 31: [2022-11-26 06:40:09,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 06:40:09,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 8: [2022-11-26 06:40:09,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:40:09,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 22: [2022-11-26 06:40:09,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:40:09,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 22: [2022-11-26 06:40:09,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 06:40:09,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 8: [2022-11-26 06:40:09,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 1: [2022-11-26 06:40:09,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:40:09,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 17: [2022-11-26 06:40:09,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:40:09,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 17: [2022-11-26 06:40:09,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 06:40:09,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 29: [2022-11-26 06:40:09,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:40:09,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 06:40:09,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 5: [2022-11-26 06:40:09,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:40:09,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 4: [2022-11-26 06:40:09,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:40:09,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 4: [2022-11-26 06:40:09,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 06:40:09,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 9: [2022-11-26 06:40:09,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:40:09,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 06:40:09,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 24: [2022-11-26 06:40:09,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:40:09,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 06:40:09,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 2: [2022-11-26 06:40:09,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:40:09,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 06:40:09,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 8: [2022-11-26 06:40:09,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:40:09,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 06:40:09,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 27: [2022-11-26 06:40:09,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:40:09,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 06:40:09,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 30: [2022-11-26 06:40:09,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:40:09,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 06:40:09,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 13: [2022-11-26 06:40:09,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:40:09,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 06:40:09,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 5: [2022-11-26 06:40:09,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:40:09,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 06:40:09,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 21: [2022-11-26 06:40:09,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:40:09,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 06:40:09,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 28: [2022-11-26 06:40:09,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:40:09,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 06:40:09,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 13: [2022-11-26 06:40:09,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:40:09,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 06:40:09,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:40:09,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 13: [2022-11-26 06:40:09,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 06:40:09,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 11: [2022-11-26 06:40:09,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:40:09,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 06:40:09,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 27: [2022-11-26 06:40:09,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:40:09,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 06:40:09,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 21: [2022-11-26 06:40:09,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:40:09,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 06:40:09,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 11: [2022-11-26 06:40:09,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:40:09,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 06:40:09,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 5: [2022-11-26 06:40:09,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:40:09,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 06:40:09,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 21: [2022-11-26 06:40:09,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:40:09,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step56000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 06:40:09,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 0: successfully saved checkpoint at iteration 56000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2563.19 31: iteration 56010/ 173500 | consumed samples: 14338560 | consumed tokens: 29365370880 | elapsed time per iteration (s): 1.08 | learning rate: 1.592E-04 | global batch size: 256 | lm loss: 2.081280E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.137 | TFLOPs: 14.29 | 31: iteration 56020/ 173500 | consumed samples: 14341120 | consumed tokens: 29370613760 | elapsed time per iteration (s): 0.86 | learning rate: 1.592E-04 | global batch size: 256 | lm loss: 2.032232E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.844 | TFLOPs: 17.96 | 31: iteration 56030/ 173500 | consumed samples: 14343680 | consumed tokens: 29375856640 | elapsed time per iteration (s): 0.82 | learning rate: 1.592E-04 | global batch size: 256 | lm loss: 2.095408E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.343 | TFLOPs: 18.77 | 31: iteration 56040/ 173500 | consumed samples: 14346240 | consumed tokens: 29381099520 | elapsed time per iteration (s): 0.87 | learning rate: 1.591E-04 | global batch size: 256 | lm loss: 2.084950E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.775 | TFLOPs: 17.89 | 31: iteration 56050/ 173500 | consumed samples: 14348800 | consumed tokens: 29386342400 | elapsed time per iteration (s): 0.76 | learning rate: 1.591E-04 | global batch size: 256 | lm loss: 2.010634E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.998 | TFLOPs: 20.45 | 31: iteration 56060/ 173500 | consumed samples: 14351360 | consumed tokens: 29391585280 | elapsed time per iteration (s): 0.76 | learning rate: 1.591E-04 | global batch size: 256 | lm loss: 2.017463E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.758 | TFLOPs: 20.49 | 31: iteration 56070/ 173500 | consumed samples: 14353920 | consumed tokens: 29396828160 | elapsed time per iteration (s): 0.79 | learning rate: 1.591E-04 | global batch size: 256 | lm loss: 2.053193E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.451 | TFLOPs: 19.63 | 31: iteration 56080/ 173500 | consumed samples: 14356480 | consumed tokens: 29402071040 | elapsed time per iteration (s): 0.81 | learning rate: 1.591E-04 | global batch size: 256 | lm loss: 2.069782E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.671 | TFLOPs: 19.16 | 31: iteration 56090/ 173500 | consumed samples: 14359040 | consumed tokens: 29407313920 | elapsed time per iteration (s): 0.75 | learning rate: 1.591E-04 | global batch size: 256 | lm loss: 2.041488E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.416 | TFLOPs: 20.72 | 31: iteration 56100/ 173500 | consumed samples: 14361600 | consumed tokens: 29412556800 | elapsed time per iteration (s): 0.76 | learning rate: 1.591E-04 | global batch size: 256 | lm loss: 2.072645E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.522 | TFLOPs: 20.36 | 31: iteration 56110/ 173500 | consumed samples: 14364160 | consumed tokens: 29417799680 | elapsed time per iteration (s): 0.79 | learning rate: 1.590E-04 | global batch size: 256 | lm loss: 2.058435E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.509 | TFLOPs: 19.51 | 31: iteration 56120/ 173500 | consumed samples: 14366720 | consumed tokens: 29423042560 | elapsed time per iteration (s): 0.76 | learning rate: 1.590E-04 | global batch size: 256 | lm loss: 2.055110E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.678 | TFLOPs: 20.49 | 31: iteration 56130/ 173500 | consumed samples: 14369280 | consumed tokens: 29428285440 | elapsed time per iteration (s): 0.76 | learning rate: 1.590E-04 | global batch size: 256 | lm loss: 2.054662E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.609 | TFLOPs: 20.36 | 31: iteration 56140/ 173500 | consumed samples: 14371840 | consumed tokens: 29433528320 | elapsed time per iteration (s): 0.79 | learning rate: 1.590E-04 | global batch size: 256 | lm loss: 2.064997E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.545 | TFLOPs: 19.57 | 31: iteration 56150/ 173500 | consumed samples: 14374400 | consumed tokens: 29438771200 | elapsed time per iteration (s): 0.78 | learning rate: 1.590E-04 | global batch size: 256 | lm loss: 2.064450E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.342 | TFLOPs: 19.86 | 31: iteration 56160/ 173500 | consumed samples: 14376960 | consumed tokens: 29444014080 | elapsed time per iteration (s): 0.78 | learning rate: 1.590E-04 | global batch size: 256 | lm loss: 2.065190E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.619 | TFLOPs: 19.94 | 31: iteration 56170/ 173500 | consumed samples: 14379520 | consumed tokens: 29449256960 | elapsed time per iteration (s): 0.77 | learning rate: 1.590E-04 | global batch size: 256 | lm loss: 2.067945E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.134 | TFLOPs: 20.21 | 31: iteration 56180/ 173500 | consumed samples: 14382080 | consumed tokens: 29454499840 | elapsed time per iteration (s): 0.78 | learning rate: 1.589E-04 | global batch size: 256 | lm loss: 2.039090E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.301 | TFLOPs: 19.86 | 31: iteration 56190/ 173500 | consumed samples: 14384640 | consumed tokens: 29459742720 | elapsed time per iteration (s): 0.77 | learning rate: 1.589E-04 | global batch size: 256 | lm loss: 2.017683E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.349 | TFLOPs: 20.23 | 31: iteration 56200/ 173500 | consumed samples: 14387200 | consumed tokens: 29464985600 | elapsed time per iteration (s): 0.78 | learning rate: 1.589E-04 | global batch size: 256 | lm loss: 2.043520E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.057 | TFLOPs: 19.79 | 31: iteration 56210/ 173500 | consumed samples: 14389760 | consumed tokens: 29470228480 | elapsed time per iteration (s): 0.82 | learning rate: 1.589E-04 | global batch size: 256 | lm loss: 2.037864E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.850 | TFLOPs: 18.87 | 31: iteration 56220/ 173500 | consumed samples: 14392320 | consumed tokens: 29475471360 | elapsed time per iteration (s): 0.77 | learning rate: 1.589E-04 | global batch size: 256 | lm loss: 2.041510E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.045 | TFLOPs: 20.21 | 31: iteration 56230/ 173500 | consumed samples: 14394880 | consumed tokens: 29480714240 | elapsed time per iteration (s): 0.77 | learning rate: 1.589E-04 | global batch size: 256 | lm loss: 2.077324E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.150 | TFLOPs: 20.03 | 31: iteration 56240/ 173500 | consumed samples: 14397440 | consumed tokens: 29485957120 | elapsed time per iteration (s): 0.85 | learning rate: 1.589E-04 | global batch size: 256 | lm loss: 2.039589E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.949 | TFLOPs: 18.15 | 31: iteration 56250/ 173500 | consumed samples: 14400000 | consumed tokens: 29491200000 | elapsed time per iteration (s): 0.79 | learning rate: 1.588E-04 | global batch size: 256 | lm loss: 2.050892E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.098 | TFLOPs: 19.55 | 31: iteration 56260/ 173500 | consumed samples: 14402560 | consumed tokens: 29496442880 | elapsed time per iteration (s): 0.81 | learning rate: 1.588E-04 | global batch size: 256 | lm loss: 2.049075E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.167 | TFLOPs: 19.07 | 31: iteration 56270/ 173500 | consumed samples: 14405120 | consumed tokens: 29501685760 | elapsed time per iteration (s): 0.84 | learning rate: 1.588E-04 | global batch size: 256 | lm loss: 2.036328E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.805 | TFLOPs: 18.38 | 31: iteration 56280/ 173500 | consumed samples: 14407680 | consumed tokens: 29506928640 | elapsed time per iteration (s): 0.78 | learning rate: 1.588E-04 | global batch size: 256 | lm loss: 2.066607E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.809 | TFLOPs: 19.83 | 31: iteration 56290/ 173500 | consumed samples: 14410240 | consumed tokens: 29512171520 | elapsed time per iteration (s): 0.82 | learning rate: 1.588E-04 | global batch size: 256 | lm loss: 2.072826E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.550 | TFLOPs: 18.79 | 31: iteration 56300/ 173500 | consumed samples: 14412800 | consumed tokens: 29517414400 | elapsed time per iteration (s): 0.80 | learning rate: 1.588E-04 | global batch size: 256 | lm loss: 2.046643E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.235 | TFLOPs: 19.25 | 31: iteration 56310/ 173500 | consumed samples: 14415360 | consumed tokens: 29522657280 | elapsed time per iteration (s): 0.84 | learning rate: 1.588E-04 | global batch size: 256 | lm loss: 2.060938E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.942 | TFLOPs: 18.45 | 31: iteration 56320/ 173500 | consumed samples: 14417920 | consumed tokens: 29527900160 | elapsed time per iteration (s): 0.81 | learning rate: 1.588E-04 | global batch size: 256 | lm loss: 2.051507E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.209 | TFLOPs: 19.07 | 31: iteration 56330/ 173500 | consumed samples: 14420480 | consumed tokens: 29533143040 | elapsed time per iteration (s): 0.82 | learning rate: 1.587E-04 | global batch size: 256 | lm loss: 2.019965E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.188 | TFLOPs: 18.89 | 31: iteration 56340/ 173500 | consumed samples: 14423040 | consumed tokens: 29538385920 | elapsed time per iteration (s): 0.86 | learning rate: 1.587E-04 | global batch size: 256 | lm loss: 2.066797E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.342 | TFLOPs: 17.99 | 31: iteration 56350/ 173500 | consumed samples: 14425600 | consumed tokens: 29543628800 | elapsed time per iteration (s): 0.86 | learning rate: 1.587E-04 | global batch size: 256 | lm loss: 2.063447E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.970 | TFLOPs: 18.09 | 31: iteration 56360/ 173500 | consumed samples: 14428160 | consumed tokens: 29548871680 | elapsed time per iteration (s): 0.84 | learning rate: 1.587E-04 | global batch size: 256 | lm loss: 2.065303E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.990 | TFLOPs: 18.45 | 31: iteration 56370/ 173500 | consumed samples: 14430720 | consumed tokens: 29554114560 | elapsed time per iteration (s): 0.82 | learning rate: 1.587E-04 | global batch size: 256 | lm loss: 2.069938E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.115 | TFLOPs: 18.88 | 31: iteration 56380/ 173500 | consumed samples: 14433280 | consumed tokens: 29559357440 | elapsed time per iteration (s): 0.81 | learning rate: 1.587E-04 | global batch size: 256 | lm loss: 2.030682E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.181 | TFLOPs: 19.01 | 31: iteration 56390/ 173500 | consumed samples: 14435840 | consumed tokens: 29564600320 | elapsed time per iteration (s): 0.81 | learning rate: 1.587E-04 | global batch size: 256 | lm loss: 2.058053E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.751 | TFLOPs: 19.10 | 31: iteration 56400/ 173500 | consumed samples: 14438400 | consumed tokens: 29569843200 | elapsed time per iteration (s): 0.83 | learning rate: 1.586E-04 | global batch size: 256 | lm loss: 2.041451E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.150 | TFLOPs: 18.70 | 31: iteration 56410/ 173500 | consumed samples: 14440960 | consumed tokens: 29575086080 | elapsed time per iteration (s): 0.87 | learning rate: 1.586E-04 | global batch size: 256 | lm loss: 2.060614E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.254 | TFLOPs: 17.80 | 31: iteration 56420/ 173500 | consumed samples: 14443520 | consumed tokens: 29580328960 | elapsed time per iteration (s): 0.83 | learning rate: 1.586E-04 | global batch size: 256 | lm loss: 2.061443E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.714 | TFLOPs: 18.56 | 31: iteration 56430/ 173500 | consumed samples: 14446080 | consumed tokens: 29585571840 | elapsed time per iteration (s): 0.82 | learning rate: 1.586E-04 | global batch size: 256 | lm loss: 2.051528E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.169 | TFLOPs: 18.89 | 31: iteration 56440/ 173500 | consumed samples: 14448640 | consumed tokens: 29590814720 | elapsed time per iteration (s): 0.81 | learning rate: 1.586E-04 | global batch size: 256 | lm loss: 2.038921E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.489 | TFLOPs: 19.09 | 31: iteration 56450/ 173500 | consumed samples: 14451200 | consumed tokens: 29596057600 | elapsed time per iteration (s): 0.81 | learning rate: 1.586E-04 | global batch size: 256 | lm loss: 2.027553E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.657 | TFLOPs: 19.04 | 31: iteration 56460/ 173500 | consumed samples: 14453760 | consumed tokens: 29601300480 | elapsed time per iteration (s): 0.81 | learning rate: 1.586E-04 | global batch size: 256 | lm loss: 2.034917E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.858 | TFLOPs: 19.05 | 31: iteration 56470/ 173500 | consumed samples: 14456320 | consumed tokens: 29606543360 | elapsed time per iteration (s): 0.85 | learning rate: 1.585E-04 | global batch size: 256 | lm loss: 2.053768E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.552 | TFLOPs: 18.24 | 31: iteration 56480/ 173500 | consumed samples: 14458880 | consumed tokens: 29611786240 | elapsed time per iteration (s): 0.80 | learning rate: 1.585E-04 | global batch size: 256 | lm loss: 2.037543E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.065 | TFLOPs: 19.30 | 31: iteration 56490/ 173500 | consumed samples: 14461440 | consumed tokens: 29617029120 | elapsed time per iteration (s): 0.87 | learning rate: 1.585E-04 | global batch size: 256 | lm loss: 2.043278E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.119 | TFLOPs: 17.85 | 31: iteration 56500/ 173500 | consumed samples: 14464000 | consumed tokens: 29622272000 | elapsed time per iteration (s): 0.79 | learning rate: 1.585E-04 | global batch size: 256 | lm loss: 2.053994E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.081 | TFLOPs: 19.49 | 31: iteration 56510/ 173500 | consumed samples: 14466560 | consumed tokens: 29627514880 | elapsed time per iteration (s): 0.83 | learning rate: 1.585E-04 | global batch size: 256 | lm loss: 2.054682E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.330 | TFLOPs: 18.59 | 31: iteration 56520/ 173500 | consumed samples: 14469120 | consumed tokens: 29632757760 | elapsed time per iteration (s): 0.85 | learning rate: 1.585E-04 | global batch size: 256 | lm loss: 2.056701E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.049 | TFLOPs: 18.21 | 31: iteration 56530/ 173500 | consumed samples: 14471680 | consumed tokens: 29638000640 | elapsed time per iteration (s): 0.83 | learning rate: 1.585E-04 | global batch size: 256 | lm loss: 2.039124E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.539 | TFLOPs: 18.67 | 31: iteration 56540/ 173500 | consumed samples: 14474240 | consumed tokens: 29643243520 | elapsed time per iteration (s): 0.83 | learning rate: 1.584E-04 | global batch size: 256 | lm loss: 2.064934E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.773 | TFLOPs: 18.62 | 31: iteration 56550/ 173500 | consumed samples: 14476800 | consumed tokens: 29648486400 | elapsed time per iteration (s): 0.84 | learning rate: 1.584E-04 | global batch size: 256 | lm loss: 2.051817E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.981 | TFLOPs: 18.51 | 31: iteration 56560/ 173500 | consumed samples: 14479360 | consumed tokens: 29653729280 | elapsed time per iteration (s): 0.80 | learning rate: 1.584E-04 | global batch size: 256 | lm loss: 2.052420E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.788 | TFLOPs: 19.35 | 31: iteration 56570/ 173500 | consumed samples: 14481920 | consumed tokens: 29658972160 | elapsed time per iteration (s): 0.84 | learning rate: 1.584E-04 | global batch size: 256 | lm loss: 2.062683E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.520 | TFLOPs: 18.48 | 31: iteration 56580/ 173500 | consumed samples: 14484480 | consumed tokens: 29664215040 | elapsed time per iteration (s): 0.80 | learning rate: 1.584E-04 | global batch size: 256 | lm loss: 2.058041E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.260 | TFLOPs: 19.25 | 31: iteration 56590/ 173500 | consumed samples: 14487040 | consumed tokens: 29669457920 | elapsed time per iteration (s): 0.87 | learning rate: 1.584E-04 | global batch size: 256 | lm loss: 2.077574E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.885 | TFLOPs: 17.84 | 31: iteration 56600/ 173500 | consumed samples: 14489600 | consumed tokens: 29674700800 | elapsed time per iteration (s): 0.78 | learning rate: 1.584E-04 | global batch size: 256 | lm loss: 2.041576E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.858 | TFLOPs: 19.90 | 31: iteration 56610/ 173500 | consumed samples: 14492160 | consumed tokens: 29679943680 | elapsed time per iteration (s): 0.82 | learning rate: 1.583E-04 | global batch size: 256 | lm loss: 2.028316E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.035 | TFLOPs: 18.88 | 31: iteration 56620/ 173500 | consumed samples: 14494720 | consumed tokens: 29685186560 | elapsed time per iteration (s): 0.84 | learning rate: 1.583E-04 | global batch size: 256 | lm loss: 2.063816E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.040 | TFLOPs: 18.45 | 31: iteration 56630/ 173500 | consumed samples: 14497280 | consumed tokens: 29690429440 | elapsed time per iteration (s): 0.84 | learning rate: 1.583E-04 | global batch size: 256 | lm loss: 2.050702E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.896 | TFLOPs: 18.45 | 31: iteration 56640/ 173500 | consumed samples: 14499840 | consumed tokens: 29695672320 | elapsed time per iteration (s): 0.82 | learning rate: 1.583E-04 | global batch size: 256 | lm loss: 2.045175E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.106 | TFLOPs: 19.00 | 31: iteration 56650/ 173500 | consumed samples: 14502400 | consumed tokens: 29700915200 | elapsed time per iteration (s): 0.80 | learning rate: 1.583E-04 | global batch size: 256 | lm loss: 2.049586E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.868 | TFLOPs: 19.41 | 31: iteration 56660/ 173500 | consumed samples: 14504960 | consumed tokens: 29706158080 | elapsed time per iteration (s): 0.79 | learning rate: 1.583E-04 | global batch size: 256 | lm loss: 2.078602E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.966 | TFLOPs: 19.66 | 31: iteration 56670/ 173500 | consumed samples: 14507520 | consumed tokens: 29711400960 | elapsed time per iteration (s): 0.86 | learning rate: 1.583E-04 | global batch size: 256 | lm loss: 2.057130E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.610 | TFLOPs: 18.07 | 31: iteration 56680/ 173500 | consumed samples: 14510080 | consumed tokens: 29716643840 | elapsed time per iteration (s): 0.83 | learning rate: 1.583E-04 | global batch size: 256 | lm loss: 2.041068E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.730 | TFLOPs: 18.68 | 31: iteration 56690/ 173500 | consumed samples: 14512640 | consumed tokens: 29721886720 | elapsed time per iteration (s): 0.78 | learning rate: 1.582E-04 | global batch size: 256 | lm loss: 2.054653E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.968 | TFLOPs: 19.96 | 31: iteration 56700/ 173500 | consumed samples: 14515200 | consumed tokens: 29727129600 | elapsed time per iteration (s): 0.81 | learning rate: 1.582E-04 | global batch size: 256 | lm loss: 2.056590E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.414 | TFLOPs: 19.08 | 31: iteration 56710/ 173500 | consumed samples: 14517760 | consumed tokens: 29732372480 | elapsed time per iteration (s): 0.80 | learning rate: 1.582E-04 | global batch size: 256 | lm loss: 2.034788E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.014 | TFLOPs: 19.42 | 31: iteration 56720/ 173500 | consumed samples: 14520320 | consumed tokens: 29737615360 | elapsed time per iteration (s): 0.80 | learning rate: 1.582E-04 | global batch size: 256 | lm loss: 2.060439E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.528 | TFLOPs: 19.27 | 31: iteration 56730/ 173500 | consumed samples: 14522880 | consumed tokens: 29742858240 | elapsed time per iteration (s): 0.86 | learning rate: 1.582E-04 | global batch size: 256 | lm loss: 2.046310E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.430 | TFLOPs: 17.93 | 31: iteration 56740/ 173500 | consumed samples: 14525440 | consumed tokens: 29748101120 | elapsed time per iteration (s): 0.78 | learning rate: 1.582E-04 | global batch size: 256 | lm loss: 2.053150E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.973 | TFLOPs: 19.96 | 31: iteration 56750/ 173500 | consumed samples: 14528000 | consumed tokens: 29753344000 | elapsed time per iteration (s): 0.82 | learning rate: 1.582E-04 | global batch size: 256 | lm loss: 2.015167E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.150 | TFLOPs: 18.94 | 31: iteration 56760/ 173500 | consumed samples: 14530560 | consumed tokens: 29758586880 | elapsed time per iteration (s): 0.78 | learning rate: 1.581E-04 | global batch size: 256 | lm loss: 2.040114E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.378 | TFLOPs: 19.93 | 31: iteration 56770/ 173500 | consumed samples: 14533120 | consumed tokens: 29763829760 | elapsed time per iteration (s): 0.82 | learning rate: 1.581E-04 | global batch size: 256 | lm loss: 2.053632E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.799 | TFLOPs: 18.98 | 31: iteration 56780/ 173500 | consumed samples: 14535680 | consumed tokens: 29769072640 | elapsed time per iteration (s): 0.74 | learning rate: 1.581E-04 | global batch size: 256 | lm loss: 2.050109E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.293 | TFLOPs: 20.95 | 31: iteration 56790/ 173500 | consumed samples: 14538240 | consumed tokens: 29774315520 | elapsed time per iteration (s): 0.78 | learning rate: 1.581E-04 | global batch size: 256 | lm loss: 2.040753E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.634 | TFLOPs: 19.88 | 31: iteration 56800/ 173500 | consumed samples: 14540800 | consumed tokens: 29779558400 | elapsed time per iteration (s): 0.76 | learning rate: 1.581E-04 | global batch size: 256 | lm loss: 2.076632E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.971 | TFLOPs: 20.51 | 31: iteration 56810/ 173500 | consumed samples: 14543360 | consumed tokens: 29784801280 | elapsed time per iteration (s): 0.77 | learning rate: 1.581E-04 | global batch size: 256 | lm loss: 2.032575E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.781 | TFLOPs: 20.01 | 31: iteration 56820/ 173500 | consumed samples: 14545920 | consumed tokens: 29790044160 | elapsed time per iteration (s): 0.73 | learning rate: 1.581E-04 | global batch size: 256 | lm loss: 2.011998E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.146 | TFLOPs: 21.30 | 31: iteration 56830/ 173500 | consumed samples: 14548480 | consumed tokens: 29795287040 | elapsed time per iteration (s): 0.80 | learning rate: 1.580E-04 | global batch size: 256 | lm loss: 2.042978E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.338 | TFLOPs: 19.44 | 31: iteration 56840/ 173500 | consumed samples: 14551040 | consumed tokens: 29800529920 | elapsed time per iteration (s): 0.77 | learning rate: 1.580E-04 | global batch size: 256 | lm loss: 2.045834E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.824 | TFLOPs: 20.20 | 31: iteration 56850/ 173500 | consumed samples: 14553600 | consumed tokens: 29805772800 | elapsed time per iteration (s): 0.76 | learning rate: 1.580E-04 | global batch size: 256 | lm loss: 2.018223E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.784 | TFLOPs: 20.31 | 31: iteration 56860/ 173500 | consumed samples: 14556160 | consumed tokens: 29811015680 | elapsed time per iteration (s): 0.79 | learning rate: 1.580E-04 | global batch size: 256 | lm loss: 2.083330E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.940 | TFLOPs: 19.60 | 31: iteration 56870/ 173500 | consumed samples: 14558720 | consumed tokens: 29816258560 | elapsed time per iteration (s): 0.75 | learning rate: 1.580E-04 | global batch size: 256 | lm loss: 2.044567E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.034 | TFLOPs: 20.57 | 31: iteration 56880/ 173500 | consumed samples: 14561280 | consumed tokens: 29821501440 | elapsed time per iteration (s): 0.73 | learning rate: 1.580E-04 | global batch size: 256 | lm loss: 2.028231E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.575 | TFLOPs: 21.15 | 31: iteration 56890/ 173500 | consumed samples: 14563840 | consumed tokens: 29826744320 | elapsed time per iteration (s): 0.75 | learning rate: 1.580E-04 | global batch size: 256 | lm loss: 2.040476E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.393 | TFLOPs: 20.77 | 31: iteration 56900/ 173500 | consumed samples: 14566400 | consumed tokens: 29831987200 | elapsed time per iteration (s): 0.76 | learning rate: 1.579E-04 | global batch size: 256 | lm loss: 2.026969E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.832 | TFLOPs: 20.26 | 31: iteration 56910/ 173500 | consumed samples: 14568960 | consumed tokens: 29837230080 | elapsed time per iteration (s): 0.79 | learning rate: 1.579E-04 | global batch size: 256 | lm loss: 2.073337E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.618 | TFLOPs: 19.70 | 31: iteration 56920/ 173500 | consumed samples: 14571520 | consumed tokens: 29842472960 | elapsed time per iteration (s): 0.85 | learning rate: 1.579E-04 | global batch size: 256 | lm loss: 2.059716E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.914 | TFLOPs: 18.27 | 31: iteration 56930/ 173500 | consumed samples: 14574080 | consumed tokens: 29847715840 | elapsed time per iteration (s): 0.77 | learning rate: 1.579E-04 | global batch size: 256 | lm loss: 2.043938E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.460 | TFLOPs: 20.05 | 31: iteration 56940/ 173500 | consumed samples: 14576640 | consumed tokens: 29852958720 | elapsed time per iteration (s): 0.79 | learning rate: 1.579E-04 | global batch size: 256 | lm loss: 2.061698E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.047 | TFLOPs: 19.66 | 31: iteration 56950/ 173500 | consumed samples: 14579200 | consumed tokens: 29858201600 | elapsed time per iteration (s): 0.81 | learning rate: 1.579E-04 | global batch size: 256 | lm loss: 2.043125E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.431 | TFLOPs: 19.14 | 31: iteration 56960/ 173500 | consumed samples: 14581760 | consumed tokens: 29863444480 | elapsed time per iteration (s): 0.80 | learning rate: 1.579E-04 | global batch size: 256 | lm loss: 2.023404E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.363 | TFLOPs: 19.26 | 31: iteration 56970/ 173500 | consumed samples: 14584320 | consumed tokens: 29868687360 | elapsed time per iteration (s): 0.78 | learning rate: 1.578E-04 | global batch size: 256 | lm loss: 2.051734E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.160 | TFLOPs: 19.97 | 31: iteration 56980/ 173500 | consumed samples: 14586880 | consumed tokens: 29873930240 | elapsed time per iteration (s): 0.79 | learning rate: 1.578E-04 | global batch size: 256 | lm loss: 2.048557E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.591 | TFLOPs: 19.52 | 31: iteration 56990/ 173500 | consumed samples: 14589440 | consumed tokens: 29879173120 | elapsed time per iteration (s): 0.76 | learning rate: 1.578E-04 | global batch size: 256 | lm loss: 2.036429E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.668 | TFLOPs: 20.37 | 31: iteration 57000/ 173500 | consumed samples: 14592000 | consumed tokens: 29884416000 | elapsed time per iteration (s): 0.76 | learning rate: 1.578E-04 | global batch size: 256 | lm loss: 2.052646E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.209 | TFLOPs: 20.46 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 57000 | lm loss value: 2.047369E+00 | lm loss PPL: 7.747487E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 57000 to checkpoints_1b1long 0: [2022-11-26 06:53:32,272] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step57000 is begin to save! 0: [2022-11-26 06:53:32,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_01-model_00-model_states.pt... 0: [2022-11-26 06:53:32,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_01-model_00-model_states.pt. 0: [2022-11-26 06:53:32,498] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_03-model_00-model_states.pt... 0: [2022-11-26 06:53:32,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_03-model_00-model_states.pt. 0: [2022-11-26 06:53:32,574] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_04-model_00-model_states.pt... 0: [2022-11-26 06:53:32,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_04-model_00-model_states.pt. 0: [2022-11-26 06:53:32,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_05-model_00-model_states.pt... 0: [2022-11-26 06:53:32,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_05-model_00-model_states.pt. 0: [2022-11-26 06:53:32,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_06-model_00-model_states.pt... 0: [2022-11-26 06:53:32,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_06-model_00-model_states.pt. 0: [2022-11-26 06:53:32,807] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_07-model_00-model_states.pt... 0: [2022-11-26 06:53:32,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_07-model_00-model_states.pt. 0: [2022-11-26 06:53:32,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_08-model_00-model_states.pt... 0: [2022-11-26 06:53:32,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_08-model_00-model_states.pt. 0: [2022-11-26 06:53:32,958] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_09-model_00-model_states.pt... 0: [2022-11-26 06:53:33,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_09-model_00-model_states.pt. 0: [2022-11-26 06:53:33,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_10-model_00-model_states.pt... 0: [2022-11-26 06:53:33,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_10-model_00-model_states.pt. 0: [2022-11-26 06:53:33,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_11-model_00-model_states.pt... 0: [2022-11-26 06:53:33,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_11-model_00-model_states.pt. 0: [2022-11-26 06:53:33,189] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_12-model_00-model_states.pt... 0: [2022-11-26 06:53:33,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_12-model_00-model_states.pt. 0: [2022-11-26 06:53:33,264] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_13-model_00-model_states.pt... 0: [2022-11-26 06:53:33,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_13-model_00-model_states.pt. 0: [2022-11-26 06:53:33,341] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_14-model_00-model_states.pt... 0: [2022-11-26 06:53:33,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_14-model_00-model_states.pt. 0: [2022-11-26 06:53:33,415] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_15-model_00-model_states.pt... 0: [2022-11-26 06:53:33,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_15-model_00-model_states.pt. 0: [2022-11-26 06:53:33,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_16-model_00-model_states.pt... 0: [2022-11-26 06:53:33,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_16-model_00-model_states.pt. 0: [2022-11-26 06:53:33,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_17-model_00-model_states.pt... 0: [2022-11-26 06:53:33,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_17-model_00-model_states.pt. 0: [2022-11-26 06:53:33,646] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_18-model_00-model_states.pt... 0: [2022-11-26 06:53:33,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_18-model_00-model_states.pt. 0: [2022-11-26 06:53:33,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_19-model_00-model_states.pt... 0: [2022-11-26 06:53:33,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_19-model_00-model_states.pt. 0: [2022-11-26 06:53:33,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_20-model_00-model_states.pt... 0: [2022-11-26 06:53:33,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_20-model_00-model_states.pt. 0: [2022-11-26 06:53:33,874] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_21-model_00-model_states.pt... 0: [2022-11-26 06:53:33,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_21-model_00-model_states.pt. 0: [2022-11-26 06:53:33,949] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_22-model_00-model_states.pt... 0: [2022-11-26 06:53:34,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_22-model_00-model_states.pt. 0: [2022-11-26 06:53:34,028] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_23-model_00-model_states.pt... 0: [2022-11-26 06:53:34,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_23-model_00-model_states.pt. 0: [2022-11-26 06:53:34,103] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_24-model_00-model_states.pt... 0: [2022-11-26 06:53:34,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_24-model_00-model_states.pt. 0: [2022-11-26 06:53:34,178] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_25-model_00-model_states.pt... 0: [2022-11-26 06:53:34,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_25-model_00-model_states.pt. 0: [2022-11-26 06:53:34,259] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_26-model_00-model_states.pt... 0: [2022-11-26 06:53:34,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_26-model_00-model_states.pt. 0: [2022-11-26 06:53:34,332] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_27-model_00-model_states.pt... 0: [2022-11-26 06:53:34,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_27-model_00-model_states.pt. 0: [2022-11-26 06:53:34,409] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_28-model_00-model_states.pt... 0: [2022-11-26 06:53:34,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_28-model_00-model_states.pt. 0: [2022-11-26 06:53:34,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/layer_30-model_00-model_states.pt... 0: [2022-11-26 06:53:34,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/layer_30-model_00-model_states.pt. 0: [2022-11-26 06:53:34,486] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step57000/mp_rank_00_model_states.pt 0: [2022-11-26 06:53:34,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/mp_rank_00_model_states.pt... 0: [2022-11-26 06:53:34,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/mp_rank_00_model_states.pt. 31: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 24: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 30: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 16: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 23: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 06:53:34,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:53:34,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:53:34,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 06:53:34,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 6: [2022-11-26 06:53:34,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:53:34,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 06:53:34,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 9: [2022-11-26 06:53:34,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:53:34,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 06:53:34,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 2: [2022-11-26 06:53:34,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:53:34,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 06:53:34,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 6: [2022-11-26 06:53:34,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:53:34,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 06:53:34,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 14: [2022-11-26 06:53:34,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:53:34,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 06:53:34,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 7: [2022-11-26 06:53:34,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:53:34,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 21: [2022-11-26 06:53:34,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:53:34,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 21: [2022-11-26 06:53:34,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 06:53:34,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 8: [2022-11-26 06:53:34,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:53:34,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 06:53:34,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 30: [2022-11-26 06:53:34,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:53:34,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 06:53:34,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 15: [2022-11-26 06:53:34,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:53:34,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 06:53:34,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 29: [2022-11-26 06:53:34,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:53:34,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 06:53:34,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 3: [2022-11-26 06:53:34,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:53:34,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 06:53:34,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 11: [2022-11-26 06:53:34,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:53:34,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 06:53:34,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 26: [2022-11-26 06:53:34,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:53:34,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 06:53:34,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 8: [2022-11-26 06:53:34,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:53:34,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 06:53:34,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 19: [2022-11-26 06:53:34,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:53:34,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 06:53:34,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 19: [2022-11-26 06:53:34,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:53:34,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:53:34,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 3: [2022-11-26 06:53:34,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 19: [2022-11-26 06:53:34,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 3: [2022-11-26 06:53:34,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 12: [2022-11-26 06:53:34,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:53:34,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 06:53:34,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 10: [2022-11-26 06:53:34,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:53:34,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:53:34,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 06:53:34,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 06:53:34,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 10: [2022-11-26 06:53:34,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 13: [2022-11-26 06:53:34,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:53:34,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:53:34,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:53:34,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 20: [2022-11-26 06:53:34,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 9: [2022-11-26 06:53:34,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 13: [2022-11-26 06:53:34,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 20: [2022-11-26 06:53:34,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 9: [2022-11-26 06:53:34,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 12: [2022-11-26 06:53:34,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:53:34,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 06:53:34,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 28: [2022-11-26 06:53:34,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:53:34,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:53:34,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 18: [2022-11-26 06:53:34,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:53:34,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 18: [2022-11-26 06:53:34,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 29: [2022-11-26 06:53:34,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:53:34,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 29: [2022-11-26 06:53:34,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 06:53:34,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 26: [2022-11-26 06:53:34,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:53:34,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:53:34,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 06:53:34,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 24: [2022-11-26 06:53:34,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 06:53:34,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 6: [2022-11-26 06:53:34,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:53:34,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 31: [2022-11-26 06:53:34,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:53:34,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 31: [2022-11-26 06:53:34,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 18: [2022-11-26 06:53:34,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:53:34,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 18: [2022-11-26 06:53:34,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 31: [2022-11-26 06:53:34,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:53:34,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 31: [2022-11-26 06:53:34,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 0: [2022-11-26 06:53:34,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:53:34,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 24: [2022-11-26 06:53:34,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:53:34,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 06:53:34,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 24: [2022-11-26 06:53:34,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 12: [2022-11-26 06:53:34,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:53:34,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 0: [2022-11-26 06:53:34,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:53:34,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 30: [2022-11-26 06:53:34,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:53:34,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 0: [2022-11-26 06:53:34,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 30: [2022-11-26 06:53:34,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 06:53:34,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 0: [2022-11-26 06:53:34,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 3: [2022-11-26 06:53:34,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:53:34,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 06:53:34,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 10: [2022-11-26 06:53:34,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:53:34,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 06:53:34,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 28: [2022-11-26 06:53:34,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 06:53:34,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 28: [2022-11-26 06:53:34,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:53:34,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 06:53:34,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 29: [2022-11-26 06:53:34,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:53:34,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:53:34,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 06:53:34,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 1: [2022-11-26 06:53:34,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:53:34,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:53:34,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 4: [2022-11-26 06:53:34,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:53:34,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 06:53:34,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 1: [2022-11-26 06:53:34,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 27: [2022-11-26 06:53:34,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 1: [2022-11-26 06:53:34,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 06:53:34,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 1: [2022-11-26 06:53:34,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 19: [2022-11-26 06:53:34,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:53:34,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 14: [2022-11-26 06:53:34,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:53:34,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:53:34,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 20: [2022-11-26 06:53:34,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:53:34,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 20: [2022-11-26 06:53:34,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 14: [2022-11-26 06:53:34,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 20: [2022-11-26 06:53:34,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 06:53:34,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 20: [2022-11-26 06:53:34,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 5: [2022-11-26 06:53:34,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:53:34,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 06:53:34,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 1: [2022-11-26 06:53:34,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:53:34,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:53:34,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 27: [2022-11-26 06:53:34,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 1: [2022-11-26 06:53:34,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 27: [2022-11-26 06:53:34,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 21: [2022-11-26 06:53:34,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:53:34,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 24: [2022-11-26 06:53:34,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:53:34,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 24: [2022-11-26 06:53:34,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 06:53:34,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 21: [2022-11-26 06:53:34,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:53:34,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 06:53:34,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 18: [2022-11-26 06:53:34,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:53:34,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 06:53:34,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 26: [2022-11-26 06:53:34,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:53:34,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 06:53:34,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 9: [2022-11-26 06:53:34,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:53:34,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 19: [2022-11-26 06:53:34,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:53:34,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 19: [2022-11-26 06:53:34,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 06:53:34,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 11: [2022-11-26 06:53:34,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:53:34,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 06:53:34,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 31: [2022-11-26 06:53:34,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:53:34,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 06:53:34,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 28: [2022-11-26 06:53:34,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:53:34,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:53:34,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 06:53:34,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 5: [2022-11-26 06:53:34,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:53:34,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 06:53:34,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 4: [2022-11-26 06:53:34,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:53:34,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 06:53:34,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 22: [2022-11-26 06:53:34,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:53:34,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:53:34,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:53:34,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:53:34,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:53:34,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 22: [2022-11-26 06:53:34,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 06:53:34,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 06:53:34,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 11: [2022-11-26 06:53:34,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 14: [2022-11-26 06:53:34,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 22: [2022-11-26 06:53:34,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 22: [2022-11-26 06:53:34,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 11: [2022-11-26 06:53:34,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 22: [2022-11-26 06:53:34,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 30: [2022-11-26 06:53:34,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:53:34,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 06:53:34,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 26: [2022-11-26 06:53:34,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:53:34,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 11: [2022-11-26 06:53:34,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:53:34,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 27: [2022-11-26 06:53:34,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:53:34,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 21: [2022-11-26 06:53:34,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:53:34,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 29: [2022-11-26 06:53:34,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:53:34,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 29: [2022-11-26 06:53:34,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 21: [2022-11-26 06:53:34,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 27: [2022-11-26 06:53:34,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 29: [2022-11-26 06:53:34,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 27: [2022-11-26 06:53:34,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 22: [2022-11-26 06:53:34,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:53:34,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 06:53:34,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 15: [2022-11-26 06:53:34,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:53:34,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 06:53:34,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 3: [2022-11-26 06:53:34,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:53:34,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 06:53:34,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 4: [2022-11-26 06:53:34,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:53:34,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 06:53:34,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 8: [2022-11-26 06:53:34,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:53:34,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 06:53:34,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 10: [2022-11-26 06:53:34,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:53:34,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 15: [2022-11-26 06:53:34,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:53:34,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 15: [2022-11-26 06:53:34,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:53:34,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 06:53:34,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 2: [2022-11-26 06:53:34,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:53:34,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:53:34,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 12: [2022-11-26 06:53:34,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 15: [2022-11-26 06:53:34,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 12: [2022-11-26 06:53:34,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 16: [2022-11-26 06:53:34,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:53:34,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:53:34,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:53:34,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:53:34,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 16: [2022-11-26 06:53:34,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 06:53:34,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 2: [2022-11-26 06:53:34,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 06:53:34,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 16: [2022-11-26 06:53:34,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 2: [2022-11-26 06:53:34,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 16: [2022-11-26 06:53:34,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 16: [2022-11-26 06:53:34,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 16: [2022-11-26 06:53:34,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 20: [2022-11-26 06:53:34,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:53:34,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 06:53:34,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 30: [2022-11-26 06:53:34,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:53:34,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 06:53:34,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 7: [2022-11-26 06:53:34,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:53:34,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 06:53:34,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 14: [2022-11-26 06:53:34,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:53:34,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 28: [2022-11-26 06:53:34,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 06:53:34,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 14: [2022-11-26 06:53:34,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 28: [2022-11-26 06:53:34,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:53:34,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 06:53:34,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 25: [2022-11-26 06:53:34,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:53:34,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:53:34,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:53:34,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:53:34,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 6: [2022-11-26 06:53:34,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:53:34,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 6: [2022-11-26 06:53:34,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 06:53:34,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 25: [2022-11-26 06:53:34,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 06:53:34,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 06:53:34,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 06:53:34,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 25: [2022-11-26 06:53:34,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 25: [2022-11-26 06:53:34,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 0: [2022-11-26 06:53:34,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:53:34,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:53:34,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 06:53:34,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 27: [2022-11-26 06:53:34,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:53:34,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:53:34,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 06:53:34,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 31: [2022-11-26 06:53:34,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 06:53:34,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 0: [2022-11-26 06:53:34,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:53:34,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:53:34,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 13: [2022-11-26 06:53:34,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 0: [2022-11-26 06:53:34,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 13: [2022-11-26 06:53:34,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 13: [2022-11-26 06:53:34,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:53:34,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 06:53:34,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 13: [2022-11-26 06:53:34,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:53:34,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 19: [2022-11-26 06:53:34,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:53:34,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 19: [2022-11-26 06:53:34,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 06:53:34,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 8: [2022-11-26 06:53:34,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:53:34,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 06:53:34,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 1: [2022-11-26 06:53:34,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:53:34,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 7: [2022-11-26 06:53:34,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:53:34,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:53:34,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 06:53:34,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 06:53:34,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 1: [2022-11-26 06:53:34,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 7: [2022-11-26 06:53:34,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 1: [2022-11-26 06:53:34,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:53:34,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 06:53:34,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 12: [2022-11-26 06:53:34,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:53:34,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 06:53:34,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 17: [2022-11-26 06:53:34,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:53:34,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:53:34,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:53:34,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 06:53:34,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 06:53:34,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 06:53:34,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 17: [2022-11-26 06:53:34,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 17: [2022-11-26 06:53:34,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 0: [2022-11-26 06:53:34,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 06:53:34,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 5: [2022-11-26 06:53:34,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:53:34,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 06:53:34,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 6: [2022-11-26 06:53:34,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:53:34,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 06:53:34,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 11: [2022-11-26 06:53:34,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:53:34,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 06:53:34,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 18: [2022-11-26 06:53:34,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:53:34,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 06:53:34,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 23: [2022-11-26 06:53:34,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:53:34,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:53:34,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:53:34,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:53:34,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 06:53:34,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 06:53:34,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 06:53:34,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 06:53:34,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 23: [2022-11-26 06:53:34,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 23: [2022-11-26 06:53:34,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 23: [2022-11-26 06:53:34,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 26: [2022-11-26 06:53:34,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:53:34,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 06:53:34,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 20: [2022-11-26 06:53:34,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:53:34,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 06:53:34,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 21: [2022-11-26 06:53:34,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:53:34,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 06:53:34,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 7: [2022-11-26 06:53:34,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:53:34,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 06:53:34,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 0: [2022-11-26 06:53:34,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:53:34,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 06:53:34,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 3: [2022-11-26 06:53:34,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:53:34,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 06:53:34,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 14: [2022-11-26 06:53:34,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:53:34,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 06:53:34,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 10: [2022-11-26 06:53:34,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:53:34,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 06:53:34,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 8: [2022-11-26 06:53:34,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:53:34,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 06:53:34,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 15: [2022-11-26 06:53:34,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:53:34,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 06:53:34,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 30: [2022-11-26 06:53:34,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:53:34,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 06:53:34,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 5: [2022-11-26 06:53:34,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:53:34,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 06:53:34,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 13: [2022-11-26 06:53:34,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:53:34,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 06:53:34,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 31: [2022-11-26 06:53:34,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:53:34,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 06:53:34,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 27: [2022-11-26 06:53:34,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:53:34,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:53:34,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 23: [2022-11-26 06:53:34,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 27: [2022-11-26 06:53:34,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 23: [2022-11-26 06:53:34,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 22: [2022-11-26 06:53:34,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:53:34,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 06:53:34,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 28: [2022-11-26 06:53:34,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:53:34,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 29: [2022-11-26 06:53:34,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:53:34,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 29: [2022-11-26 06:53:34,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 06:53:34,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 16: [2022-11-26 06:53:34,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:53:34,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 06:53:34,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 9: [2022-11-26 06:53:34,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:53:34,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 06:53:34,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 4: [2022-11-26 06:53:34,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:53:34,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:53:34,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 06:53:34,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 17: [2022-11-26 06:53:34,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 06:53:34,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 24: [2022-11-26 06:53:34,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:53:34,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 06:53:34,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 25: [2022-11-26 06:53:34,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:53:34,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 06:53:34,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 2: [2022-11-26 06:53:34,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:53:34,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 06:53:34,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 1: [2022-11-26 06:53:34,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:53:34,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 06:53:34,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 6: [2022-11-26 06:53:34,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:53:34,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 06:53:34,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 26: [2022-11-26 06:53:34,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:53:34,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 06:53:34,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 18: [2022-11-26 06:53:34,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:53:34,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 06:53:34,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 19: [2022-11-26 06:53:34,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:53:34,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 06:53:34,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 20: [2022-11-26 06:53:34,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:53:34,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 06:53:34,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 0: [2022-11-26 06:53:34,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:53:34,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 06:53:34,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 7: [2022-11-26 06:53:34,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:53:34,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 06:53:34,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 10: [2022-11-26 06:53:34,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:53:34,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:53:34,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 3: [2022-11-26 06:53:34,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 10: [2022-11-26 06:53:34,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 3: [2022-11-26 06:53:34,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 11: [2022-11-26 06:53:34,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:53:34,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 06:53:34,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 12: [2022-11-26 06:53:34,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:53:34,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 21: [2022-11-26 06:53:34,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:53:34,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 21: [2022-11-26 06:53:34,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 06:53:34,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 24: [2022-11-26 06:53:34,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:53:34,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 06:53:34,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 30: [2022-11-26 06:53:34,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:53:34,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 06:53:34,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 8: [2022-11-26 06:53:34,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:53:34,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 06:53:34,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 14: [2022-11-26 06:53:34,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:53:34,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 06:53:34,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 4: [2022-11-26 06:53:34,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:53:34,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 06:53:34,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 31: [2022-11-26 06:53:34,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:53:34,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 06:53:34,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 13: [2022-11-26 06:53:34,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:53:34,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 22: [2022-11-26 06:53:34,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:53:34,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 22: [2022-11-26 06:53:34,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 06:53:34,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 17: [2022-11-26 06:53:34,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:53:34,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 06:53:34,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 27: [2022-11-26 06:53:34,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:53:34,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:53:34,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 06:53:34,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 1: [2022-11-26 06:53:34,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:53:34,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 06:53:34,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 1: [2022-11-26 06:53:34,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 06:53:34,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 5: [2022-11-26 06:53:34,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:53:34,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 06:53:34,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 28: [2022-11-26 06:53:34,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:53:34,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:53:34,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 06:53:34,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 28: [2022-11-26 06:53:34,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 2: [2022-11-26 06:53:34,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:53:34,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 28: [2022-11-26 06:53:34,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 2: [2022-11-26 06:53:34,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 18: [2022-11-26 06:53:34,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:53:34,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 06:53:34,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 25: [2022-11-26 06:53:34,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:53:34,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 06:53:34,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 19: [2022-11-26 06:53:34,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:53:34,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 06:53:34,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 16: [2022-11-26 06:53:34,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:53:34,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 06:53:34,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 3: [2022-11-26 06:53:34,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:53:34,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 06:53:34,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 6: [2022-11-26 06:53:34,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:53:34,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 06:53:34,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 11: [2022-11-26 06:53:34,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:53:34,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:53:34,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 26: [2022-11-26 06:53:34,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 11: [2022-11-26 06:53:34,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 26: [2022-11-26 06:53:34,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 23: [2022-11-26 06:53:34,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:53:34,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 06:53:34,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 29: [2022-11-26 06:53:34,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:53:34,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:53:34,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 9: [2022-11-26 06:53:34,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 29: [2022-11-26 06:53:34,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 9: [2022-11-26 06:53:34,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 12: [2022-11-26 06:53:34,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:53:34,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 06:53:34,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 10: [2022-11-26 06:53:34,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:53:34,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 06:53:34,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 21: [2022-11-26 06:53:34,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:53:34,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 06:53:34,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 29: [2022-11-26 06:53:34,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:53:34,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:53:34,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 06:53:34,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 14: [2022-11-26 06:53:34,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 06:53:34,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 15: [2022-11-26 06:53:34,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:53:34,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:53:34,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 06:53:34,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 06:53:34,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 15: [2022-11-26 06:53:34,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 31: [2022-11-26 06:53:34,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:53:34,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 06:53:34,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 0: [2022-11-26 06:53:34,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:53:34,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 06:53:34,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 23: [2022-11-26 06:53:34,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 06:53:34,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 06:53:34,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 2: [2022-11-26 06:53:34,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:53:34,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 06:53:34,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 22: [2022-11-26 06:53:34,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:53:34,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 06:53:34,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 27: [2022-11-26 06:53:34,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:53:34,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:53:34,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 06:53:34,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 30: [2022-11-26 06:53:34,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 06:53:34,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 13: [2022-11-26 06:53:34,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:53:34,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 06:53:34,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 24: [2022-11-26 06:53:34,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:53:34,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 06:53:34,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 9: [2022-11-26 06:53:34,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:53:34,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 06:53:34,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 17: [2022-11-26 06:53:34,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:53:34,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 06:53:34,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 3: [2022-11-26 06:53:34,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:53:34,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 28: [2022-11-26 06:53:34,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:53:34,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 28: [2022-11-26 06:53:34,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 06:53:34,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 26: [2022-11-26 06:53:34,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 06:53:34,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 06:53:34,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 16: [2022-11-26 06:53:34,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:53:34,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 06:53:34,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 8: [2022-11-26 06:53:34,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:53:34,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 06:53:34,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 29: [2022-11-26 06:53:34,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 06:53:34,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 06:53:34,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 0: [2022-11-26 06:53:34,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:53:34,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 06:53:34,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 20: [2022-11-26 06:53:34,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 06:53:34,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 06:53:34,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 12: [2022-11-26 06:53:34,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:53:34,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 06:53:34,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 21: [2022-11-26 06:53:34,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:53:34,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 5: [2022-11-26 06:53:34,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 21: [2022-11-26 06:53:34,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 5: [2022-11-26 06:53:34,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 06:53:34,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 7: [2022-11-26 06:53:34,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:53:34,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 06:53:34,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 15: [2022-11-26 06:53:34,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 30: [2022-11-26 06:53:34,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:53:34,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 06:53:34,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 30: [2022-11-26 06:53:34,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 06:53:34,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 11: [2022-11-26 06:53:34,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 27: [2022-11-26 06:53:34,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:53:34,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 06:53:34,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 27: [2022-11-26 06:53:34,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 06:53:34,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 13: [2022-11-26 06:53:34,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:53:34,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:53:34,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 06:53:34,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 10: [2022-11-26 06:53:34,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 06:53:34,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 6: [2022-11-26 06:53:34,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:53:34,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 06:53:34,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 19: [2022-11-26 06:53:34,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:53:34,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 23: [2022-11-26 06:53:34,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 19: [2022-11-26 06:53:34,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 23: [2022-11-26 06:53:34,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 06:53:34,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 16: [2022-11-26 06:53:34,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:53:34,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 06:53:34,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 14: [2022-11-26 06:53:34,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:53:34,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 06:53:34,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 24: [2022-11-26 06:53:34,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 31: [2022-11-26 06:53:34,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 24: [2022-11-26 06:53:34,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 06:53:34,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 31: [2022-11-26 06:53:34,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 06:53:34,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 4: [2022-11-26 06:53:34,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:53:34,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 06:53:34,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 17: [2022-11-26 06:53:34,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 18: [2022-11-26 06:53:34,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:53:34,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 18: [2022-11-26 06:53:34,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 17: [2022-11-26 06:53:34,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 18: [2022-11-26 06:53:34,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 1: [2022-11-26 06:53:34,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:53:34,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 25: [2022-11-26 06:53:34,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:53:34,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 25: [2022-11-26 06:53:34,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 06:53:34,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 9: [2022-11-26 06:53:34,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:53:34,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 06:53:34,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 22: [2022-11-26 06:53:34,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 06:53:34,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 06:53:34,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 4: [2022-11-26 06:53:34,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:53:34,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 06:53:34,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 28: [2022-11-26 06:53:34,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 17: [2022-11-26 06:53:34,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 28: [2022-11-26 06:53:34,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 06:53:34,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 17: [2022-11-26 06:53:34,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 06:53:34,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 5: [2022-11-26 06:53:34,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:53:34,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:53:34,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 06:53:34,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 2: [2022-11-26 06:53:34,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 06:53:34,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 16: [2022-11-26 06:53:34,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 06:53:34,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 06:53:34,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 25: [2022-11-26 06:53:34,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:53:34,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 06:53:34,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 4: [2022-11-26 06:53:34,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:53:34,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 06:53:34,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 25: [2022-11-26 06:53:34,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 06:53:34,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 06:53:34,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 8: [2022-11-26 06:53:34,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:53:34,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step57000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 06:53:34,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 0: successfully saved checkpoint at iteration 57000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2540.80 31: iteration 57010/ 173500 | consumed samples: 14594560 | consumed tokens: 29889658880 | elapsed time per iteration (s): 1.01 | learning rate: 1.578E-04 | global batch size: 256 | lm loss: 2.038788E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 253.623 | TFLOPs: 15.34 | 31: iteration 57020/ 173500 | consumed samples: 14597120 | consumed tokens: 29894901760 | elapsed time per iteration (s): 0.73 | learning rate: 1.578E-04 | global batch size: 256 | lm loss: 2.050594E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.781 | TFLOPs: 21.16 | 31: iteration 57030/ 173500 | consumed samples: 14599680 | consumed tokens: 29900144640 | elapsed time per iteration (s): 0.74 | learning rate: 1.578E-04 | global batch size: 256 | lm loss: 2.066332E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.683 | TFLOPs: 20.85 | 31: iteration 57040/ 173500 | consumed samples: 14602240 | consumed tokens: 29905387520 | elapsed time per iteration (s): 0.76 | learning rate: 1.578E-04 | global batch size: 256 | lm loss: 2.037150E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.767 | TFLOPs: 20.25 | 31: iteration 57050/ 173500 | consumed samples: 14604800 | consumed tokens: 29910630400 | elapsed time per iteration (s): 0.75 | learning rate: 1.577E-04 | global batch size: 256 | lm loss: 2.028264E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.416 | TFLOPs: 20.72 | 31: iteration 57060/ 173500 | consumed samples: 14607360 | consumed tokens: 29915873280 | elapsed time per iteration (s): 0.79 | learning rate: 1.577E-04 | global batch size: 256 | lm loss: 2.017406E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.817 | TFLOPs: 19.53 | 31: iteration 57070/ 173500 | consumed samples: 14609920 | consumed tokens: 29921116160 | elapsed time per iteration (s): 0.80 | learning rate: 1.577E-04 | global batch size: 256 | lm loss: 2.067072E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.488 | TFLOPs: 19.39 | 31: iteration 57080/ 173500 | consumed samples: 14612480 | consumed tokens: 29926359040 | elapsed time per iteration (s): 0.81 | learning rate: 1.577E-04 | global batch size: 256 | lm loss: 2.030514E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.433 | TFLOPs: 19.02 | 31: iteration 57090/ 173500 | consumed samples: 14615040 | consumed tokens: 29931601920 | elapsed time per iteration (s): 0.80 | learning rate: 1.577E-04 | global batch size: 256 | lm loss: 2.065851E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.627 | TFLOPs: 19.34 | 31: iteration 57100/ 173500 | consumed samples: 14617600 | consumed tokens: 29936844800 | elapsed time per iteration (s): 0.79 | learning rate: 1.577E-04 | global batch size: 256 | lm loss: 2.043554E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.182 | TFLOPs: 19.49 | 31: iteration 57110/ 173500 | consumed samples: 14620160 | consumed tokens: 29942087680 | elapsed time per iteration (s): 0.84 | learning rate: 1.577E-04 | global batch size: 256 | lm loss: 2.035483E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.659 | TFLOPs: 18.49 | 31: iteration 57120/ 173500 | consumed samples: 14622720 | consumed tokens: 29947330560 | elapsed time per iteration (s): 0.88 | learning rate: 1.576E-04 | global batch size: 256 | lm loss: 2.058685E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.336 | TFLOPs: 17.56 | 31: iteration 57130/ 173500 | consumed samples: 14625280 | consumed tokens: 29952573440 | elapsed time per iteration (s): 0.82 | learning rate: 1.576E-04 | global batch size: 256 | lm loss: 2.036935E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.174 | TFLOPs: 18.95 | 31: iteration 57140/ 173500 | consumed samples: 14627840 | consumed tokens: 29957816320 | elapsed time per iteration (s): 0.81 | learning rate: 1.576E-04 | global batch size: 256 | lm loss: 2.041183E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.984 | TFLOPs: 19.12 | 31: iteration 57150/ 173500 | consumed samples: 14630400 | consumed tokens: 29963059200 | elapsed time per iteration (s): 0.79 | learning rate: 1.576E-04 | global batch size: 256 | lm loss: 2.025605E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.935 | TFLOPs: 19.72 | 31: iteration 57160/ 173500 | consumed samples: 14632960 | consumed tokens: 29968302080 | elapsed time per iteration (s): 0.82 | learning rate: 1.576E-04 | global batch size: 256 | lm loss: 2.066224E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.106 | TFLOPs: 18.88 | 31: iteration 57170/ 173500 | consumed samples: 14635520 | consumed tokens: 29973544960 | elapsed time per iteration (s): 0.79 | learning rate: 1.576E-04 | global batch size: 256 | lm loss: 2.052872E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.539 | TFLOPs: 19.57 | 31: iteration 57180/ 173500 | consumed samples: 14638080 | consumed tokens: 29978787840 | elapsed time per iteration (s): 0.80 | learning rate: 1.576E-04 | global batch size: 256 | lm loss: 2.063459E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.090 | TFLOPs: 19.24 | 31: iteration 57190/ 173500 | consumed samples: 14640640 | consumed tokens: 29984030720 | elapsed time per iteration (s): 0.84 | learning rate: 1.575E-04 | global batch size: 256 | lm loss: 2.036168E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.022 | TFLOPs: 18.39 | 31: iteration 57200/ 173500 | consumed samples: 14643200 | consumed tokens: 29989273600 | elapsed time per iteration (s): 0.77 | learning rate: 1.575E-04 | global batch size: 256 | lm loss: 2.050540E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.429 | TFLOPs: 20.05 | 31: iteration 57210/ 173500 | consumed samples: 14645760 | consumed tokens: 29994516480 | elapsed time per iteration (s): 0.80 | learning rate: 1.575E-04 | global batch size: 256 | lm loss: 2.067413E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.395 | TFLOPs: 19.38 | 31: iteration 57220/ 173500 | consumed samples: 14648320 | consumed tokens: 29999759360 | elapsed time per iteration (s): 0.83 | learning rate: 1.575E-04 | global batch size: 256 | lm loss: 2.016270E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.320 | TFLOPs: 18.65 | 31: iteration 57230/ 173500 | consumed samples: 14650880 | consumed tokens: 30005002240 | elapsed time per iteration (s): 0.81 | learning rate: 1.575E-04 | global batch size: 256 | lm loss: 2.054346E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.458 | TFLOPs: 19.08 | 31: iteration 57240/ 173500 | consumed samples: 14653440 | consumed tokens: 30010245120 | elapsed time per iteration (s): 0.81 | learning rate: 1.575E-04 | global batch size: 256 | lm loss: 2.092865E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.888 | TFLOPs: 19.23 | 31: iteration 57250/ 173500 | consumed samples: 14656000 | consumed tokens: 30015488000 | elapsed time per iteration (s): 0.79 | learning rate: 1.575E-04 | global batch size: 256 | lm loss: 2.069279E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.118 | TFLOPs: 19.49 | 31: iteration 57260/ 173500 | consumed samples: 14658560 | consumed tokens: 30020730880 | elapsed time per iteration (s): 0.83 | learning rate: 1.574E-04 | global batch size: 256 | lm loss: 2.076321E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.530 | TFLOPs: 18.67 | 31: iteration 57270/ 173500 | consumed samples: 14661120 | consumed tokens: 30025973760 | elapsed time per iteration (s): 0.80 | learning rate: 1.574E-04 | global batch size: 256 | lm loss: 2.014395E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.475 | TFLOPs: 19.45 | 31: iteration 57280/ 173500 | consumed samples: 14663680 | consumed tokens: 30031216640 | elapsed time per iteration (s): 0.80 | learning rate: 1.574E-04 | global batch size: 256 | lm loss: 2.002626E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.667 | TFLOPs: 19.34 | 31: iteration 57290/ 173500 | consumed samples: 14666240 | consumed tokens: 30036459520 | elapsed time per iteration (s): 0.83 | learning rate: 1.574E-04 | global batch size: 256 | lm loss: 2.052801E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.339 | TFLOPs: 18.65 | 31: iteration 57300/ 173500 | consumed samples: 14668800 | consumed tokens: 30041702400 | elapsed time per iteration (s): 0.85 | learning rate: 1.574E-04 | global batch size: 256 | lm loss: 2.040595E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.065 | TFLOPs: 18.27 | 31: iteration 57310/ 173500 | consumed samples: 14671360 | consumed tokens: 30046945280 | elapsed time per iteration (s): 0.78 | learning rate: 1.574E-04 | global batch size: 256 | lm loss: 2.050906E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.917 | TFLOPs: 19.78 | 31: iteration 57320/ 173500 | consumed samples: 14673920 | consumed tokens: 30052188160 | elapsed time per iteration (s): 0.95 | learning rate: 1.574E-04 | global batch size: 256 | lm loss: 2.065566E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 270.195 | TFLOPs: 16.35 | 31: iteration 57330/ 173500 | consumed samples: 14676480 | consumed tokens: 30057431040 | elapsed time per iteration (s): 0.78 | learning rate: 1.573E-04 | global batch size: 256 | lm loss: 2.015094E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.304 | TFLOPs: 19.92 | 31: iteration 57340/ 173500 | consumed samples: 14679040 | consumed tokens: 30062673920 | elapsed time per iteration (s): 0.80 | learning rate: 1.573E-04 | global batch size: 256 | lm loss: 2.046583E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.745 | TFLOPs: 19.40 | 31: iteration 57350/ 173500 | consumed samples: 14681600 | consumed tokens: 30067916800 | elapsed time per iteration (s): 0.76 | learning rate: 1.573E-04 | global batch size: 256 | lm loss: 2.029273E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.632 | TFLOPs: 20.30 | 31: iteration 57360/ 173500 | consumed samples: 14684160 | consumed tokens: 30073159680 | elapsed time per iteration (s): 0.78 | learning rate: 1.573E-04 | global batch size: 256 | lm loss: 2.035141E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.146 | TFLOPs: 19.97 | 31: iteration 57370/ 173500 | consumed samples: 14686720 | consumed tokens: 30078402560 | elapsed time per iteration (s): 0.77 | learning rate: 1.573E-04 | global batch size: 256 | lm loss: 2.037919E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.925 | TFLOPs: 20.14 | 31: iteration 57380/ 173500 | consumed samples: 14689280 | consumed tokens: 30083645440 | elapsed time per iteration (s): 0.74 | learning rate: 1.573E-04 | global batch size: 256 | lm loss: 2.084121E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.734 | TFLOPs: 20.86 | 31: iteration 57390/ 173500 | consumed samples: 14691840 | consumed tokens: 30088888320 | elapsed time per iteration (s): 0.74 | learning rate: 1.573E-04 | global batch size: 256 | lm loss: 2.037263E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.857 | TFLOPs: 20.86 | 31: iteration 57400/ 173500 | consumed samples: 14694400 | consumed tokens: 30094131200 | elapsed time per iteration (s): 0.79 | learning rate: 1.572E-04 | global batch size: 256 | lm loss: 2.063423E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.117 | TFLOPs: 19.49 | 31: iteration 57410/ 173500 | consumed samples: 14696960 | consumed tokens: 30099374080 | elapsed time per iteration (s): 0.79 | learning rate: 1.572E-04 | global batch size: 256 | lm loss: 2.075749E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.751 | TFLOPs: 19.53 | 31: iteration 57420/ 173500 | consumed samples: 14699520 | consumed tokens: 30104616960 | elapsed time per iteration (s): 0.82 | learning rate: 1.572E-04 | global batch size: 256 | lm loss: 2.056623E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.134 | TFLOPs: 18.88 | 31: iteration 57430/ 173500 | consumed samples: 14702080 | consumed tokens: 30109859840 | elapsed time per iteration (s): 0.76 | learning rate: 1.572E-04 | global batch size: 256 | lm loss: 2.049096E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.237 | TFLOPs: 20.34 | 31: iteration 57440/ 173500 | consumed samples: 14704640 | consumed tokens: 30115102720 | elapsed time per iteration (s): 0.81 | learning rate: 1.572E-04 | global batch size: 256 | lm loss: 2.015289E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.496 | TFLOPs: 19.21 | 31: iteration 57450/ 173500 | consumed samples: 14707200 | consumed tokens: 30120345600 | elapsed time per iteration (s): 0.76 | learning rate: 1.572E-04 | global batch size: 256 | lm loss: 2.045566E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.459 | TFLOPs: 20.35 | 31: iteration 57460/ 173500 | consumed samples: 14709760 | consumed tokens: 30125588480 | elapsed time per iteration (s): 0.80 | learning rate: 1.572E-04 | global batch size: 256 | lm loss: 2.053122E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.694 | TFLOPs: 19.34 | 31: iteration 57470/ 173500 | consumed samples: 14712320 | consumed tokens: 30130831360 | elapsed time per iteration (s): 0.78 | learning rate: 1.571E-04 | global batch size: 256 | lm loss: 2.035785E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.854 | TFLOPs: 19.89 | 31: iteration 57480/ 173500 | consumed samples: 14714880 | consumed tokens: 30136074240 | elapsed time per iteration (s): 0.79 | learning rate: 1.571E-04 | global batch size: 256 | lm loss: 2.051602E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.240 | TFLOPs: 19.62 | 31: iteration 57490/ 173500 | consumed samples: 14717440 | consumed tokens: 30141317120 | elapsed time per iteration (s): 0.78 | learning rate: 1.571E-04 | global batch size: 256 | lm loss: 2.051029E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.095 | TFLOPs: 19.91 | 31: iteration 57500/ 173500 | consumed samples: 14720000 | consumed tokens: 30146560000 | elapsed time per iteration (s): 0.78 | learning rate: 1.571E-04 | global batch size: 256 | lm loss: 2.080024E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.902 | TFLOPs: 19.90 | 31: iteration 57510/ 173500 | consumed samples: 14722560 | consumed tokens: 30151802880 | elapsed time per iteration (s): 0.82 | learning rate: 1.571E-04 | global batch size: 256 | lm loss: 2.048176E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.003 | TFLOPs: 19.00 | 31: iteration 57520/ 173500 | consumed samples: 14725120 | consumed tokens: 30157045760 | elapsed time per iteration (s): 0.79 | learning rate: 1.571E-04 | global batch size: 256 | lm loss: 2.054811E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.400 | TFLOPs: 19.50 | 31: iteration 57530/ 173500 | consumed samples: 14727680 | consumed tokens: 30162288640 | elapsed time per iteration (s): 0.81 | learning rate: 1.571E-04 | global batch size: 256 | lm loss: 2.073612E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.175 | TFLOPs: 19.19 | 31: iteration 57540/ 173500 | consumed samples: 14730240 | consumed tokens: 30167531520 | elapsed time per iteration (s): 0.84 | learning rate: 1.571E-04 | global batch size: 256 | lm loss: 2.051387E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.993 | TFLOPs: 18.33 | 31: iteration 57550/ 173500 | consumed samples: 14732800 | consumed tokens: 30172774400 | elapsed time per iteration (s): 0.75 | learning rate: 1.570E-04 | global batch size: 256 | lm loss: 2.064672E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.547 | TFLOPs: 20.72 | 31: iteration 57560/ 173500 | consumed samples: 14735360 | consumed tokens: 30178017280 | elapsed time per iteration (s): 0.75 | learning rate: 1.570E-04 | global batch size: 256 | lm loss: 2.016952E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.679 | TFLOPs: 20.61 | 31: iteration 57570/ 173500 | consumed samples: 14737920 | consumed tokens: 30183260160 | elapsed time per iteration (s): 0.77 | learning rate: 1.570E-04 | global batch size: 256 | lm loss: 2.056795E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.051 | TFLOPs: 20.15 | 31: iteration 57580/ 173500 | consumed samples: 14740480 | consumed tokens: 30188503040 | elapsed time per iteration (s): 0.82 | learning rate: 1.570E-04 | global batch size: 256 | lm loss: 2.053354E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.483 | TFLOPs: 18.84 | 31: iteration 57590/ 173500 | consumed samples: 14743040 | consumed tokens: 30193745920 | elapsed time per iteration (s): 0.70 | learning rate: 1.570E-04 | global batch size: 256 | lm loss: 2.047492E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 363.167 | TFLOPs: 21.97 | 31: iteration 57600/ 173500 | consumed samples: 14745600 | consumed tokens: 30198988800 | elapsed time per iteration (s): 0.85 | learning rate: 1.570E-04 | global batch size: 256 | lm loss: 2.043760E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.942 | TFLOPs: 18.27 | 31: iteration 57610/ 173500 | consumed samples: 14748160 | consumed tokens: 30204231680 | elapsed time per iteration (s): 0.77 | learning rate: 1.570E-04 | global batch size: 256 | lm loss: 2.029047E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.347 | TFLOPs: 20.11 | 31: iteration 57620/ 173500 | consumed samples: 14750720 | consumed tokens: 30209474560 | elapsed time per iteration (s): 0.82 | learning rate: 1.569E-04 | global batch size: 256 | lm loss: 2.051399E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.907 | TFLOPs: 18.99 | 31: iteration 57630/ 173500 | consumed samples: 14753280 | consumed tokens: 30214717440 | elapsed time per iteration (s): 0.77 | learning rate: 1.569E-04 | global batch size: 256 | lm loss: 2.026925E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.508 | TFLOPs: 20.12 | 31: iteration 57640/ 173500 | consumed samples: 14755840 | consumed tokens: 30219960320 | elapsed time per iteration (s): 0.77 | learning rate: 1.569E-04 | global batch size: 256 | lm loss: 2.057204E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.854 | TFLOPs: 20.20 | 31: iteration 57650/ 173500 | consumed samples: 14758400 | consumed tokens: 30225203200 | elapsed time per iteration (s): 0.77 | learning rate: 1.569E-04 | global batch size: 256 | lm loss: 2.025721E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.071 | TFLOPs: 20.09 | 31: iteration 57660/ 173500 | consumed samples: 14760960 | consumed tokens: 30230446080 | elapsed time per iteration (s): 0.77 | learning rate: 1.569E-04 | global batch size: 256 | lm loss: 2.057196E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.546 | TFLOPs: 20.12 | 31: iteration 57670/ 173500 | consumed samples: 14763520 | consumed tokens: 30235688960 | elapsed time per iteration (s): 0.76 | learning rate: 1.569E-04 | global batch size: 256 | lm loss: 2.038023E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.061 | TFLOPs: 20.39 | 31: iteration 57680/ 173500 | consumed samples: 14766080 | consumed tokens: 30240931840 | elapsed time per iteration (s): 0.76 | learning rate: 1.569E-04 | global batch size: 256 | lm loss: 2.041142E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.623 | TFLOPs: 20.36 | 31: iteration 57690/ 173500 | consumed samples: 14768640 | consumed tokens: 30246174720 | elapsed time per iteration (s): 0.76 | learning rate: 1.568E-04 | global batch size: 256 | lm loss: 2.043733E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.323 | TFLOPs: 20.41 | 31: iteration 57700/ 173500 | consumed samples: 14771200 | consumed tokens: 30251417600 | elapsed time per iteration (s): 0.82 | learning rate: 1.568E-04 | global batch size: 256 | lm loss: 2.098417E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.907 | TFLOPs: 18.99 | 31: iteration 57710/ 173500 | consumed samples: 14773760 | consumed tokens: 30256660480 | elapsed time per iteration (s): 0.76 | learning rate: 1.568E-04 | global batch size: 256 | lm loss: 2.050602E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.555 | TFLOPs: 20.30 | 31: iteration 57720/ 173500 | consumed samples: 14776320 | consumed tokens: 30261903360 | elapsed time per iteration (s): 0.77 | learning rate: 1.568E-04 | global batch size: 256 | lm loss: 2.043713E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.286 | TFLOPs: 20.04 | 31: iteration 57730/ 173500 | consumed samples: 14778880 | consumed tokens: 30267146240 | elapsed time per iteration (s): 0.80 | learning rate: 1.568E-04 | global batch size: 256 | lm loss: 2.021762E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.339 | TFLOPs: 19.32 | 31: iteration 57740/ 173500 | consumed samples: 14781440 | consumed tokens: 30272389120 | elapsed time per iteration (s): 0.74 | learning rate: 1.568E-04 | global batch size: 256 | lm loss: 2.052954E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.743 | TFLOPs: 20.86 | 31: iteration 57750/ 173500 | consumed samples: 14784000 | consumed tokens: 30277632000 | elapsed time per iteration (s): 0.78 | learning rate: 1.568E-04 | global batch size: 256 | lm loss: 2.057349E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.479 | TFLOPs: 19.87 | 31: iteration 57760/ 173500 | consumed samples: 14786560 | consumed tokens: 30282874880 | elapsed time per iteration (s): 0.79 | learning rate: 1.567E-04 | global batch size: 256 | lm loss: 2.058832E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.675 | TFLOPs: 19.64 | 31: iteration 57770/ 173500 | consumed samples: 14789120 | consumed tokens: 30288117760 | elapsed time per iteration (s): 0.75 | learning rate: 1.567E-04 | global batch size: 256 | lm loss: 2.036836E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.425 | TFLOPs: 20.59 | 31: iteration 57780/ 173500 | consumed samples: 14791680 | consumed tokens: 30293360640 | elapsed time per iteration (s): 0.76 | learning rate: 1.567E-04 | global batch size: 256 | lm loss: 2.033626E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.946 | TFLOPs: 20.32 | 31: iteration 57790/ 173500 | consumed samples: 14794240 | consumed tokens: 30298603520 | elapsed time per iteration (s): 0.78 | learning rate: 1.567E-04 | global batch size: 256 | lm loss: 2.048251E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.975 | TFLOPs: 19.90 | 31: iteration 57800/ 173500 | consumed samples: 14796800 | consumed tokens: 30303846400 | elapsed time per iteration (s): 0.79 | learning rate: 1.567E-04 | global batch size: 256 | lm loss: 2.060885E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.288 | TFLOPs: 19.50 | 31: iteration 57810/ 173500 | consumed samples: 14799360 | consumed tokens: 30309089280 | elapsed time per iteration (s): 0.80 | learning rate: 1.567E-04 | global batch size: 256 | lm loss: 2.034681E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.889 | TFLOPs: 19.35 | 31: iteration 57820/ 173500 | consumed samples: 14801920 | consumed tokens: 30314332160 | elapsed time per iteration (s): 0.75 | learning rate: 1.567E-04 | global batch size: 256 | lm loss: 2.053596E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.585 | TFLOPs: 20.67 | 31: iteration 57830/ 173500 | consumed samples: 14804480 | consumed tokens: 30319575040 | elapsed time per iteration (s): 0.80 | learning rate: 1.566E-04 | global batch size: 256 | lm loss: 2.043918E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.025 | TFLOPs: 19.24 | 31: iteration 57840/ 173500 | consumed samples: 14807040 | consumed tokens: 30324817920 | elapsed time per iteration (s): 0.77 | learning rate: 1.566E-04 | global batch size: 256 | lm loss: 2.008629E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.438 | TFLOPs: 19.99 | 31: iteration 57850/ 173500 | consumed samples: 14809600 | consumed tokens: 30330060800 | elapsed time per iteration (s): 0.75 | learning rate: 1.566E-04 | global batch size: 256 | lm loss: 2.035856E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.150 | TFLOPs: 20.70 | 31: iteration 57860/ 173500 | consumed samples: 14812160 | consumed tokens: 30335303680 | elapsed time per iteration (s): 0.81 | learning rate: 1.566E-04 | global batch size: 256 | lm loss: 2.058482E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.076 | TFLOPs: 19.18 | 31: iteration 57870/ 173500 | consumed samples: 14814720 | consumed tokens: 30340546560 | elapsed time per iteration (s): 0.75 | learning rate: 1.566E-04 | global batch size: 256 | lm loss: 2.055484E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.421 | TFLOPs: 20.59 | 31: iteration 57880/ 173500 | consumed samples: 14817280 | consumed tokens: 30345789440 | elapsed time per iteration (s): 0.77 | learning rate: 1.566E-04 | global batch size: 256 | lm loss: 2.035098E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.326 | TFLOPs: 19.98 | 31: iteration 57890/ 173500 | consumed samples: 14819840 | consumed tokens: 30351032320 | elapsed time per iteration (s): 0.78 | learning rate: 1.566E-04 | global batch size: 256 | lm loss: 2.034736E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.909 | TFLOPs: 19.90 | 31: iteration 57900/ 173500 | consumed samples: 14822400 | consumed tokens: 30356275200 | elapsed time per iteration (s): 0.74 | learning rate: 1.565E-04 | global batch size: 256 | lm loss: 2.038424E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.247 | TFLOPs: 20.89 | 31: iteration 57910/ 173500 | consumed samples: 14824960 | consumed tokens: 30361518080 | elapsed time per iteration (s): 0.79 | learning rate: 1.565E-04 | global batch size: 256 | lm loss: 2.048913E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.412 | TFLOPs: 19.57 | 31: iteration 57920/ 173500 | consumed samples: 14827520 | consumed tokens: 30366760960 | elapsed time per iteration (s): 0.87 | learning rate: 1.565E-04 | global batch size: 256 | lm loss: 2.045380E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.817 | TFLOPs: 17.84 | 31: iteration 57930/ 173500 | consumed samples: 14830080 | consumed tokens: 30372003840 | elapsed time per iteration (s): 0.81 | learning rate: 1.565E-04 | global batch size: 256 | lm loss: 2.059365E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.122 | TFLOPs: 19.00 | 31: iteration 57940/ 173500 | consumed samples: 14832640 | consumed tokens: 30377246720 | elapsed time per iteration (s): 0.81 | learning rate: 1.565E-04 | global batch size: 256 | lm loss: 2.047082E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.698 | TFLOPs: 19.10 | 31: iteration 57950/ 173500 | consumed samples: 14835200 | consumed tokens: 30382489600 | elapsed time per iteration (s): 0.79 | learning rate: 1.565E-04 | global batch size: 256 | lm loss: 2.037904E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.700 | TFLOPs: 19.70 | 31: iteration 57960/ 173500 | consumed samples: 14837760 | consumed tokens: 30387732480 | elapsed time per iteration (s): 0.80 | learning rate: 1.565E-04 | global batch size: 256 | lm loss: 2.046056E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.332 | TFLOPs: 19.38 | 31: iteration 57970/ 173500 | consumed samples: 14840320 | consumed tokens: 30392975360 | elapsed time per iteration (s): 0.82 | learning rate: 1.564E-04 | global batch size: 256 | lm loss: 2.048740E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.915 | TFLOPs: 18.87 | 31: iteration 57980/ 173500 | consumed samples: 14842880 | consumed tokens: 30398218240 | elapsed time per iteration (s): 0.82 | learning rate: 1.564E-04 | global batch size: 256 | lm loss: 2.017749E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.051 | TFLOPs: 18.82 | 31: iteration 57990/ 173500 | consumed samples: 14845440 | consumed tokens: 30403461120 | elapsed time per iteration (s): 0.82 | learning rate: 1.564E-04 | global batch size: 256 | lm loss: 2.062382E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.575 | TFLOPs: 18.85 | 0: [2022-11-26 07:06:45,173] [INFO] [logging.py:68:log_dist] [Rank 0] step=58000, skipped=0, lr=[0.00015640412143068475, 0.00015640412143068475, 0.00015640412143068475], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 58000/ 173500 | consumed samples: 14848000 | consumed tokens: 30408704000 | elapsed time per iteration (s): 0.79 | learning rate: 1.564E-04 | global batch size: 256 | lm loss: 2.072591E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.604 | TFLOPs: 19.58 | 0: steps: 58000 loss: 2.0371 iter time (s): 0.794 samples/sec: 322.403 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 58000 | lm loss value: 1.993652E+00 | lm loss PPL: 7.342300E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 58000 to checkpoints_1b1long 0: [2022-11-26 07:06:45,513] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step58000 is begin to save! 0: [2022-11-26 07:06:45,523] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_01-model_00-model_states.pt... 0: [2022-11-26 07:06:45,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_01-model_00-model_states.pt. 0: [2022-11-26 07:06:45,730] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_03-model_00-model_states.pt... 0: [2022-11-26 07:06:45,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_03-model_00-model_states.pt. 0: [2022-11-26 07:06:45,812] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_04-model_00-model_states.pt... 0: [2022-11-26 07:06:45,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_04-model_00-model_states.pt. 0: [2022-11-26 07:06:45,888] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_05-model_00-model_states.pt... 0: [2022-11-26 07:06:45,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_05-model_00-model_states.pt. 0: [2022-11-26 07:06:45,967] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_06-model_00-model_states.pt... 0: [2022-11-26 07:06:46,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_06-model_00-model_states.pt. 0: [2022-11-26 07:06:46,043] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_07-model_00-model_states.pt... 0: [2022-11-26 07:06:46,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_07-model_00-model_states.pt. 0: [2022-11-26 07:06:46,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_08-model_00-model_states.pt... 0: [2022-11-26 07:06:46,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_08-model_00-model_states.pt. 0: [2022-11-26 07:06:46,199] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_09-model_00-model_states.pt... 0: [2022-11-26 07:06:46,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_09-model_00-model_states.pt. 0: [2022-11-26 07:06:46,271] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_10-model_00-model_states.pt... 0: [2022-11-26 07:06:46,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_10-model_00-model_states.pt. 0: [2022-11-26 07:06:46,343] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_11-model_00-model_states.pt... 0: [2022-11-26 07:06:46,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_11-model_00-model_states.pt. 0: [2022-11-26 07:06:46,419] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_12-model_00-model_states.pt... 0: [2022-11-26 07:06:46,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_12-model_00-model_states.pt. 0: [2022-11-26 07:06:46,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_13-model_00-model_states.pt... 0: [2022-11-26 07:06:46,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_13-model_00-model_states.pt. 0: [2022-11-26 07:06:46,568] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_14-model_00-model_states.pt... 0: [2022-11-26 07:06:46,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_14-model_00-model_states.pt. 0: [2022-11-26 07:06:46,639] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_15-model_00-model_states.pt... 0: [2022-11-26 07:06:46,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_15-model_00-model_states.pt. 0: [2022-11-26 07:06:46,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_16-model_00-model_states.pt... 0: [2022-11-26 07:06:46,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_16-model_00-model_states.pt. 0: [2022-11-26 07:06:46,791] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_17-model_00-model_states.pt... 0: [2022-11-26 07:06:46,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_17-model_00-model_states.pt. 0: [2022-11-26 07:06:46,863] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_18-model_00-model_states.pt... 0: [2022-11-26 07:06:46,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_18-model_00-model_states.pt. 0: [2022-11-26 07:06:46,939] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_19-model_00-model_states.pt... 0: [2022-11-26 07:06:47,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_19-model_00-model_states.pt. 0: [2022-11-26 07:06:47,011] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_20-model_00-model_states.pt... 0: [2022-11-26 07:06:47,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_20-model_00-model_states.pt. 0: [2022-11-26 07:06:47,085] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_21-model_00-model_states.pt... 0: [2022-11-26 07:06:47,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_21-model_00-model_states.pt. 0: [2022-11-26 07:06:47,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_22-model_00-model_states.pt... 0: [2022-11-26 07:06:47,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_22-model_00-model_states.pt. 0: [2022-11-26 07:06:47,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_23-model_00-model_states.pt... 0: [2022-11-26 07:06:47,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_23-model_00-model_states.pt. 0: [2022-11-26 07:06:47,307] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_24-model_00-model_states.pt... 0: [2022-11-26 07:06:47,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_24-model_00-model_states.pt. 0: [2022-11-26 07:06:47,384] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_25-model_00-model_states.pt... 0: [2022-11-26 07:06:47,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_25-model_00-model_states.pt. 0: [2022-11-26 07:06:47,455] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_26-model_00-model_states.pt... 0: [2022-11-26 07:06:47,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_26-model_00-model_states.pt. 0: [2022-11-26 07:06:47,531] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_27-model_00-model_states.pt... 0: [2022-11-26 07:06:47,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_27-model_00-model_states.pt. 0: [2022-11-26 07:06:47,603] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_28-model_00-model_states.pt... 0: [2022-11-26 07:06:47,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_28-model_00-model_states.pt. 0: [2022-11-26 07:06:47,677] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/layer_30-model_00-model_states.pt... 0: [2022-11-26 07:06:47,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/layer_30-model_00-model_states.pt. 0: [2022-11-26 07:06:47,682] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step58000/mp_rank_00_model_states.pt 0: [2022-11-26 07:06:47,682] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/mp_rank_00_model_states.pt... 0: [2022-11-26 07:06:47,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/mp_rank_00_model_states.pt. 0: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:06:47,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:06:47,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:06:47,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 07:06:47,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 31: [2022-11-26 07:06:47,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:06:47,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 07:06:47,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 18: [2022-11-26 07:06:47,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:06:47,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 19: [2022-11-26 07:06:47,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:06:47,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 19: [2022-11-26 07:06:47,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 07:06:47,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 7: [2022-11-26 07:06:47,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:06:47,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 07:06:47,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 17: [2022-11-26 07:06:47,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:06:47,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:06:47,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 26: [2022-11-26 07:06:47,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 07:06:47,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 17: [2022-11-26 07:06:47,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 9: [2022-11-26 07:06:47,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:06:47,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 07:06:47,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 29: [2022-11-26 07:06:47,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:06:47,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 07:06:47,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 30: [2022-11-26 07:06:47,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:06:47,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 07:06:47,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 5: [2022-11-26 07:06:47,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:06:47,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 07:06:47,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 24: [2022-11-26 07:06:47,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:06:47,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:06:47,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 5: [2022-11-26 07:06:47,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 24: [2022-11-26 07:06:47,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 5: [2022-11-26 07:06:47,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 13: [2022-11-26 07:06:47,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:06:47,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 3: [2022-11-26 07:06:47,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:06:47,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 3: [2022-11-26 07:06:47,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 07:06:47,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 14: [2022-11-26 07:06:47,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:06:47,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:06:47,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 21: [2022-11-26 07:06:47,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 14: [2022-11-26 07:06:47,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 21: [2022-11-26 07:06:47,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 26: [2022-11-26 07:06:47,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:06:47,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 07:06:47,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 15: [2022-11-26 07:06:47,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:06:47,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 07:06:47,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 23: [2022-11-26 07:06:47,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:06:47,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 07:06:47,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 23: [2022-11-26 07:06:47,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:06:47,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 07:06:47,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 27: [2022-11-26 07:06:47,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:06:47,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 07:06:47,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 24: [2022-11-26 07:06:47,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:06:47,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 18: [2022-11-26 07:06:47,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:06:47,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 29: [2022-11-26 07:06:47,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:06:47,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 29: [2022-11-26 07:06:47,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 18: [2022-11-26 07:06:47,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 29: [2022-11-26 07:06:47,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 7: [2022-11-26 07:06:47,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:06:47,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 15: [2022-11-26 07:06:47,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:06:47,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:06:47,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 15: [2022-11-26 07:06:47,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 07:06:47,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 19: [2022-11-26 07:06:47,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 07:06:47,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 14: [2022-11-26 07:06:47,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:06:47,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 07:06:47,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 30: [2022-11-26 07:06:47,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 28: [2022-11-26 07:06:47,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:06:47,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 12: [2022-11-26 07:06:47,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:06:47,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 12: [2022-11-26 07:06:47,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 30: [2022-11-26 07:06:47,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:06:47,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 30: [2022-11-26 07:06:47,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 07:06:47,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 4: [2022-11-26 07:06:47,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:06:47,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 07:06:47,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:06:47,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 0: [2022-11-26 07:06:47,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:06:47,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 12: [2022-11-26 07:06:47,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:06:47,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 12: [2022-11-26 07:06:47,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 0: [2022-11-26 07:06:47,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 12: [2022-11-26 07:06:47,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 0: [2022-11-26 07:06:47,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 8: [2022-11-26 07:06:47,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:06:47,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 07:06:47,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 8: [2022-11-26 07:06:47,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:06:47,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 07:06:47,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 9: [2022-11-26 07:06:47,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:06:47,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 07:06:47,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 9: [2022-11-26 07:06:47,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:06:47,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 07:06:47,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 16: [2022-11-26 07:06:47,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:06:47,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:06:47,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 22: [2022-11-26 07:06:47,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 16: [2022-11-26 07:06:47,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 22: [2022-11-26 07:06:47,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 26: [2022-11-26 07:06:47,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:06:47,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 07:06:47,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 11: [2022-11-26 07:06:47,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:06:47,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 07:06:47,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 15: [2022-11-26 07:06:47,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:06:47,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 18: [2022-11-26 07:06:47,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:06:47,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 22: [2022-11-26 07:06:47,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:06:47,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 22: [2022-11-26 07:06:47,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 18: [2022-11-26 07:06:47,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 22: [2022-11-26 07:06:47,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 11: [2022-11-26 07:06:47,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:06:47,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 07:06:47,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 14: [2022-11-26 07:06:47,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:06:47,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:06:47,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 07:06:47,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 24: [2022-11-26 07:06:47,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 07:06:47,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 3: [2022-11-26 07:06:47,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:06:47,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 07:06:47,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 19: [2022-11-26 07:06:47,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:06:47,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 07:06:47,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 18: [2022-11-26 07:06:47,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:06:47,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:06:47,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 26: [2022-11-26 07:06:47,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 18: [2022-11-26 07:06:47,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 26: [2022-11-26 07:06:47,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 13: [2022-11-26 07:06:47,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:06:47,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 07:06:47,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 7: [2022-11-26 07:06:47,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:06:47,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 07:06:47,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 19: [2022-11-26 07:06:47,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:06:47,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 07:06:47,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 11: [2022-11-26 07:06:47,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:06:47,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 07:06:47,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 27: [2022-11-26 07:06:47,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:06:47,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:06:47,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 07:06:47,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 30: [2022-11-26 07:06:47,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 07:06:47,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 27: [2022-11-26 07:06:47,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:06:47,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:06:47,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:06:47,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 3: [2022-11-26 07:06:47,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 16: [2022-11-26 07:06:47,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 3: [2022-11-26 07:06:47,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 27: [2022-11-26 07:06:47,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 16: [2022-11-26 07:06:47,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 29: [2022-11-26 07:06:47,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:06:47,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 07:06:47,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:06:47,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 29: [2022-11-26 07:06:47,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 07:06:47,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 24: [2022-11-26 07:06:47,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:06:47,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 07:06:47,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 12: [2022-11-26 07:06:47,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:06:47,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 07:06:47,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 20: [2022-11-26 07:06:47,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:06:47,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:06:47,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 07:06:47,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 07:06:47,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 20: [2022-11-26 07:06:47,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 14: [2022-11-26 07:06:47,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:06:47,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 07:06:47,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 28: [2022-11-26 07:06:47,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 07:06:47,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 28: [2022-11-26 07:06:47,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 07:06:47,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 07:06:47,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 28: [2022-11-26 07:06:47,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:06:47,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 28: [2022-11-26 07:06:47,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 23: [2022-11-26 07:06:47,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 28: [2022-11-26 07:06:47,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 23: [2022-11-26 07:06:47,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 28: [2022-11-26 07:06:47,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 07:06:47,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 07:06:47,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 4: [2022-11-26 07:06:47,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:06:47,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 07:06:47,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 23: [2022-11-26 07:06:47,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:06:47,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 07:06:47,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 0: [2022-11-26 07:06:47,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:06:47,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:06:47,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 0: [2022-11-26 07:06:47,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 31: [2022-11-26 07:06:47,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 0: [2022-11-26 07:06:47,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 31: [2022-11-26 07:06:47,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:06:47,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 07:06:47,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 8: [2022-11-26 07:06:47,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:06:47,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 07:06:47,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 8: [2022-11-26 07:06:47,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:06:47,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:06:47,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 3: [2022-11-26 07:06:47,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 8: [2022-11-26 07:06:47,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 3: [2022-11-26 07:06:47,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 11: [2022-11-26 07:06:47,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:06:47,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 07:06:47,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 13: [2022-11-26 07:06:47,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:06:47,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 21: [2022-11-26 07:06:47,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:06:47,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:06:47,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:06:47,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 07:06:47,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 07:06:47,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 13: [2022-11-26 07:06:47,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 21: [2022-11-26 07:06:47,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 21: [2022-11-26 07:06:47,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 21: [2022-11-26 07:06:47,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 22: [2022-11-26 07:06:47,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:06:47,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 07:06:47,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 2: [2022-11-26 07:06:47,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:06:47,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:06:47,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:06:47,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 15: [2022-11-26 07:06:47,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 2: [2022-11-26 07:06:47,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 07:06:47,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 15: [2022-11-26 07:06:47,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 2: [2022-11-26 07:06:47,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:06:47,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 2: [2022-11-26 07:06:47,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 07:06:47,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 7: [2022-11-26 07:06:47,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:06:47,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 5: [2022-11-26 07:06:47,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:06:47,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:06:47,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 5: [2022-11-26 07:06:47,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 07:06:47,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 07:06:47,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 5: [2022-11-26 07:06:47,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 9: [2022-11-26 07:06:47,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:06:47,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 07:06:47,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 4: [2022-11-26 07:06:47,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:06:47,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 07:06:47,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 22: [2022-11-26 07:06:47,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:06:47,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 07:06:47,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 31: [2022-11-26 07:06:47,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:06:47,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 07:06:47,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 20: [2022-11-26 07:06:47,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:06:47,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 07:06:47,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 27: [2022-11-26 07:06:47,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:06:47,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 12: [2022-11-26 07:06:47,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:06:47,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 27: [2022-11-26 07:06:47,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 12: [2022-11-26 07:06:47,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 17: [2022-11-26 07:06:47,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:06:47,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 07:06:47,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 17: [2022-11-26 07:06:47,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:06:47,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 07:06:47,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 17: [2022-11-26 07:06:47,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:06:47,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 07:06:47,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 10: [2022-11-26 07:06:47,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:06:47,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:06:47,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:06:47,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:06:47,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 07:06:47,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 10: [2022-11-26 07:06:47,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 07:06:47,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 07:06:47,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 07:06:47,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 10: [2022-11-26 07:06:47,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 10: [2022-11-26 07:06:47,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 1: [2022-11-26 07:06:47,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:06:47,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:06:47,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:06:47,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 07:06:47,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 07:06:47,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 07:06:47,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 1: [2022-11-26 07:06:47,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 16: [2022-11-26 07:06:47,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:06:47,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 16: [2022-11-26 07:06:47,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 07:06:47,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 16: [2022-11-26 07:06:47,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:06:47,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 07:06:47,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 19: [2022-11-26 07:06:47,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:06:47,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 07:06:47,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 5: [2022-11-26 07:06:47,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:06:47,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 07:06:47,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 0: [2022-11-26 07:06:47,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:06:47,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:06:47,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:06:47,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 07:06:47,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 07:06:47,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 0: [2022-11-26 07:06:47,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 17: [2022-11-26 07:06:47,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:06:47,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 07:06:47,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 0: [2022-11-26 07:06:47,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 07:06:47,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 18: [2022-11-26 07:06:47,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:06:47,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 07:06:47,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 12: [2022-11-26 07:06:47,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:06:47,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 07:06:47,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 6: [2022-11-26 07:06:47,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:06:47,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:06:47,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:06:47,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:06:47,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:06:47,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 07:06:47,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 07:06:47,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 07:06:47,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 07:06:47,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 07:06:47,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 6: [2022-11-26 07:06:47,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 6: [2022-11-26 07:06:47,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 6: [2022-11-26 07:06:47,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 6: [2022-11-26 07:06:47,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 25: [2022-11-26 07:06:47,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:06:47,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:06:47,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:06:47,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:06:47,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 07:06:47,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 07:06:47,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 07:06:47,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 07:06:47,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 25: [2022-11-26 07:06:47,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 25: [2022-11-26 07:06:47,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 25: [2022-11-26 07:06:47,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 25: [2022-11-26 07:06:47,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:06:47,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 07:06:47,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 23: [2022-11-26 07:06:47,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:06:47,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 07:06:47,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 14: [2022-11-26 07:06:47,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:06:47,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 07:06:47,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 9: [2022-11-26 07:06:47,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:06:47,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 07:06:47,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 28: [2022-11-26 07:06:47,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 07:06:47,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 07:06:47,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 13: [2022-11-26 07:06:47,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:06:47,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 07:06:47,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 7: [2022-11-26 07:06:47,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:06:47,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 07:06:47,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 21: [2022-11-26 07:06:47,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:06:47,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 07:06:47,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 16: [2022-11-26 07:06:47,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:06:47,885] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 07:06:47,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 30: [2022-11-26 07:06:47,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:06:47,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 07:06:47,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 24: [2022-11-26 07:06:47,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:06:47,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 07:06:47,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 29: [2022-11-26 07:06:47,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:06:47,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 07:06:47,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 3: [2022-11-26 07:06:47,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:06:47,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 07:06:47,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 15: [2022-11-26 07:06:47,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:06:47,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 07:06:47,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 31: [2022-11-26 07:06:47,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:06:47,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 07:06:47,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 10: [2022-11-26 07:06:47,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:06:47,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 07:06:47,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 2: [2022-11-26 07:06:47,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:06:47,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 07:06:47,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 4: [2022-11-26 07:06:47,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:06:47,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 07:06:47,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 8: [2022-11-26 07:06:47,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:06:47,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 07:06:47,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 11: [2022-11-26 07:06:47,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:06:47,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 07:06:47,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 27: [2022-11-26 07:06:47,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:06:47,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 07:06:47,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 1: [2022-11-26 07:06:47,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:06:47,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 07:06:47,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 22: [2022-11-26 07:06:47,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:06:47,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 07:06:47,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 20: [2022-11-26 07:06:47,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:06:47,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 07:06:47,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 0: [2022-11-26 07:06:47,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:06:47,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 07:06:47,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 5: [2022-11-26 07:06:47,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:06:47,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 07:06:47,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 19: [2022-11-26 07:06:47,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:06:47,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 07:06:47,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 18: [2022-11-26 07:06:47,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:06:47,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 07:06:47,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 25: [2022-11-26 07:06:47,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:06:47,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 07:06:47,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 12: [2022-11-26 07:06:47,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:06:47,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 07:06:47,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 16: [2022-11-26 07:06:47,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:06:47,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 07:06:47,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 9: [2022-11-26 07:06:47,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:06:47,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 07:06:47,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 17: [2022-11-26 07:06:47,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:06:47,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 07:06:47,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 23: [2022-11-26 07:06:47,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:06:47,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:06:47,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 23: [2022-11-26 07:06:47,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 07:06:47,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 29: [2022-11-26 07:06:47,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 14: [2022-11-26 07:06:47,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:06:47,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 07:06:47,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 31: [2022-11-26 07:06:47,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:06:47,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:06:47,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 07:06:47,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 7: [2022-11-26 07:06:47,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 3: [2022-11-26 07:06:47,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:06:47,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 3: [2022-11-26 07:06:47,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 07:06:47,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 6: [2022-11-26 07:06:47,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:06:47,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 07:06:47,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 30: [2022-11-26 07:06:47,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:06:47,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 07:06:47,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 28: [2022-11-26 07:06:47,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:06:47,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 28: [2022-11-26 07:06:47,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 26: [2022-11-26 07:06:47,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 07:06:47,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 28: [2022-11-26 07:06:47,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 26: [2022-11-26 07:06:47,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:06:47,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 07:06:47,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 15: [2022-11-26 07:06:47,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:06:47,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 07:06:47,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 0: [2022-11-26 07:06:47,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:06:47,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 07:06:47,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 10: [2022-11-26 07:06:47,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:06:47,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 07:06:47,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 21: [2022-11-26 07:06:47,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:06:47,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 07:06:47,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 4: [2022-11-26 07:06:47,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:06:47,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 07:06:47,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 2: [2022-11-26 07:06:47,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:06:47,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 07:06:47,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 22: [2022-11-26 07:06:47,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:06:47,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 07:06:47,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 8: [2022-11-26 07:06:47,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:06:47,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 07:06:47,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 13: [2022-11-26 07:06:47,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:06:47,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 07:06:47,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 24: [2022-11-26 07:06:47,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:06:47,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 1: [2022-11-26 07:06:47,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:06:47,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 1: [2022-11-26 07:06:47,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 07:06:47,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 19: [2022-11-26 07:06:47,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:06:47,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 07:06:47,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 17: [2022-11-26 07:06:47,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:06:47,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 07:06:47,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 20: [2022-11-26 07:06:47,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:06:47,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 07:06:47,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 11: [2022-11-26 07:06:47,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:06:47,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:06:47,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 07:06:47,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 25: [2022-11-26 07:06:47,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 07:06:47,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 27: [2022-11-26 07:06:47,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 28: [2022-11-26 07:06:47,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:06:47,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 07:06:47,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 28: [2022-11-26 07:06:47,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 07:06:47,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 12: [2022-11-26 07:06:47,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:06:47,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 07:06:47,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 6: [2022-11-26 07:06:47,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:06:47,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 07:06:47,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 14: [2022-11-26 07:06:47,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:06:47,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 07:06:47,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 18: [2022-11-26 07:06:47,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:06:47,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 07:06:47,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 13: [2022-11-26 07:06:47,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:06:47,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 07:06:47,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 23: [2022-11-26 07:06:47,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:06:47,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 07:06:47,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 7: [2022-11-26 07:06:47,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:06:47,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 07:06:47,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 30: [2022-11-26 07:06:47,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:06:47,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:06:47,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 07:06:47,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 9: [2022-11-26 07:06:47,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 07:06:47,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 16: [2022-11-26 07:06:47,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:06:47,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 07:06:47,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 5: [2022-11-26 07:06:47,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:06:47,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 07:06:47,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 24: [2022-11-26 07:06:47,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:06:47,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 07:06:47,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 21: [2022-11-26 07:06:47,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:06:47,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 07:06:47,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 4: [2022-11-26 07:06:47,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:06:47,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 07:06:47,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 26: [2022-11-26 07:06:47,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:06:47,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 07:06:47,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 29: [2022-11-26 07:06:47,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:06:47,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 07:06:47,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 11: [2022-11-26 07:06:47,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:06:47,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 07:06:47,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 8: [2022-11-26 07:06:47,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:06:47,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 07:06:47,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 31: [2022-11-26 07:06:47,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:06:47,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 07:06:47,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 15: [2022-11-26 07:06:47,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:06:47,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 07:06:47,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 10: [2022-11-26 07:06:47,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:06:47,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 07:06:47,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 3: [2022-11-26 07:06:47,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:06:47,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 07:06:47,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 17: [2022-11-26 07:06:47,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:06:47,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 07:06:47,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 2: [2022-11-26 07:06:47,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:06:47,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:06:47,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 19: [2022-11-26 07:06:47,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 07:06:47,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 2: [2022-11-26 07:06:47,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 25: [2022-11-26 07:06:47,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:06:47,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 07:06:47,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 29: [2022-11-26 07:06:47,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:06:47,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 07:06:47,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 27: [2022-11-26 07:06:47,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:06:47,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:06:47,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 07:06:47,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 31: [2022-11-26 07:06:47,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 07:06:47,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 18: [2022-11-26 07:06:47,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:06:47,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:06:47,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 0: [2022-11-26 07:06:47,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 18: [2022-11-26 07:06:47,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 0: [2022-11-26 07:06:47,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 24: [2022-11-26 07:06:47,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:06:47,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:06:47,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 07:06:47,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 24: [2022-11-26 07:06:47,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 07:06:47,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 10: [2022-11-26 07:06:47,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:06:47,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 15: [2022-11-26 07:06:47,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:06:47,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 15: [2022-11-26 07:06:47,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 07:06:47,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 16: [2022-11-26 07:06:47,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:06:47,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:06:47,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 5: [2022-11-26 07:06:47,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 16: [2022-11-26 07:06:47,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 5: [2022-11-26 07:06:47,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 28: [2022-11-26 07:06:47,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 07:06:47,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 07:06:47,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 12: [2022-11-26 07:06:47,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:06:47,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:06:47,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 30: [2022-11-26 07:06:47,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:06:47,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:06:47,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 7: [2022-11-26 07:06:47,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:06:47,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 12: [2022-11-26 07:06:47,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 30: [2022-11-26 07:06:47,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 07:06:47,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 7: [2022-11-26 07:06:47,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 1: [2022-11-26 07:06:47,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 07:06:47,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 7: [2022-11-26 07:06:47,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 14: [2022-11-26 07:06:47,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:06:47,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:06:47,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 6: [2022-11-26 07:06:47,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 14: [2022-11-26 07:06:47,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 26: [2022-11-26 07:06:47,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:06:47,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 26: [2022-11-26 07:06:47,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 07:06:47,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 8: [2022-11-26 07:06:47,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:06:47,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 07:06:47,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 4: [2022-11-26 07:06:47,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:06:47,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 07:06:47,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 20: [2022-11-26 07:06:47,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:06:47,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 07:06:47,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 22: [2022-11-26 07:06:47,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:06:47,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 07:06:47,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 23: [2022-11-26 07:06:47,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:06:47,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:06:47,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 07:06:47,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 3: [2022-11-26 07:06:47,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 07:06:47,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 27: [2022-11-26 07:06:47,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:06:47,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:06:47,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 27: [2022-11-26 07:06:47,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 21: [2022-11-26 07:06:47,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 27: [2022-11-26 07:06:47,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 20: [2022-11-26 07:06:47,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:06:47,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 07:06:47,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 11: [2022-11-26 07:06:47,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:06:47,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 07:06:47,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 1: [2022-11-26 07:06:47,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:06:47,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 2: [2022-11-26 07:06:47,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:06:47,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:06:47,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 2: [2022-11-26 07:06:47,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 22: [2022-11-26 07:06:47,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 07:06:47,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 2: [2022-11-26 07:06:47,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 1: [2022-11-26 07:06:47,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:06:47,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 07:06:47,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 20: [2022-11-26 07:06:47,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:06:47,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 07:06:47,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 2: [2022-11-26 07:06:47,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:06:47,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step58000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 07:06:47,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 0: successfully saved checkpoint at iteration 58000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2505.04 31: iteration 58010/ 173500 | consumed samples: 14850560 | consumed tokens: 30413946880 | elapsed time per iteration (s): 1.08 | learning rate: 1.564E-04 | global batch size: 256 | lm loss: 2.021144E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.683 | TFLOPs: 14.38 | 31: iteration 58020/ 173500 | consumed samples: 14853120 | consumed tokens: 30419189760 | elapsed time per iteration (s): 0.79 | learning rate: 1.564E-04 | global batch size: 256 | lm loss: 2.003494E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.730 | TFLOPs: 19.58 | 31: iteration 58030/ 173500 | consumed samples: 14855680 | consumed tokens: 30424432640 | elapsed time per iteration (s): 0.82 | learning rate: 1.564E-04 | global batch size: 256 | lm loss: 2.028656E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.891 | TFLOPs: 18.99 | 31: iteration 58040/ 173500 | consumed samples: 14858240 | consumed tokens: 30429675520 | elapsed time per iteration (s): 0.79 | learning rate: 1.563E-04 | global batch size: 256 | lm loss: 2.052077E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.351 | TFLOPs: 19.68 | 31: iteration 58050/ 173500 | consumed samples: 14860800 | consumed tokens: 30434918400 | elapsed time per iteration (s): 0.80 | learning rate: 1.563E-04 | global batch size: 256 | lm loss: 2.069073E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.768 | TFLOPs: 19.47 | 31: iteration 58060/ 173500 | consumed samples: 14863360 | consumed tokens: 30440161280 | elapsed time per iteration (s): 0.75 | learning rate: 1.563E-04 | global batch size: 256 | lm loss: 2.050636E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.332 | TFLOPs: 20.77 | 31: iteration 58070/ 173500 | consumed samples: 14865920 | consumed tokens: 30445404160 | elapsed time per iteration (s): 0.83 | learning rate: 1.563E-04 | global batch size: 256 | lm loss: 2.021355E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.976 | TFLOPs: 18.69 | 31: iteration 58080/ 173500 | consumed samples: 14868480 | consumed tokens: 30450647040 | elapsed time per iteration (s): 0.80 | learning rate: 1.563E-04 | global batch size: 256 | lm loss: 2.056298E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.417 | TFLOPs: 19.32 | 31: iteration 58090/ 173500 | consumed samples: 14871040 | consumed tokens: 30455889920 | elapsed time per iteration (s): 0.78 | learning rate: 1.563E-04 | global batch size: 256 | lm loss: 2.069170E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.894 | TFLOPs: 19.78 | 31: iteration 58100/ 173500 | consumed samples: 14873600 | consumed tokens: 30461132800 | elapsed time per iteration (s): 0.78 | learning rate: 1.563E-04 | global batch size: 256 | lm loss: 2.063454E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.469 | TFLOPs: 19.81 | 31: iteration 58110/ 173500 | consumed samples: 14876160 | consumed tokens: 30466375680 | elapsed time per iteration (s): 0.82 | learning rate: 1.562E-04 | global batch size: 256 | lm loss: 2.022450E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.608 | TFLOPs: 18.91 | 31: iteration 58120/ 173500 | consumed samples: 14878720 | consumed tokens: 30471618560 | elapsed time per iteration (s): 0.76 | learning rate: 1.562E-04 | global batch size: 256 | lm loss: 2.042872E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.954 | TFLOPs: 20.32 | 31: iteration 58130/ 173500 | consumed samples: 14881280 | consumed tokens: 30476861440 | elapsed time per iteration (s): 0.73 | learning rate: 1.562E-04 | global batch size: 256 | lm loss: 2.024968E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.492 | TFLOPs: 21.08 | 31: iteration 58140/ 173500 | consumed samples: 14883840 | consumed tokens: 30482104320 | elapsed time per iteration (s): 0.75 | learning rate: 1.562E-04 | global batch size: 256 | lm loss: 2.066712E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.150 | TFLOPs: 20.64 | 31: iteration 58150/ 173500 | consumed samples: 14886400 | consumed tokens: 30487347200 | elapsed time per iteration (s): 0.79 | learning rate: 1.562E-04 | global batch size: 256 | lm loss: 2.015178E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.501 | TFLOPs: 19.69 | 31: iteration 58160/ 173500 | consumed samples: 14888960 | consumed tokens: 30492590080 | elapsed time per iteration (s): 0.73 | learning rate: 1.562E-04 | global batch size: 256 | lm loss: 2.058895E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.659 | TFLOPs: 21.21 | 31: iteration 58170/ 173500 | consumed samples: 14891520 | consumed tokens: 30497832960 | elapsed time per iteration (s): 0.74 | learning rate: 1.562E-04 | global batch size: 256 | lm loss: 2.029935E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.456 | TFLOPs: 21.02 | 31: iteration 58180/ 173500 | consumed samples: 14894080 | consumed tokens: 30503075840 | elapsed time per iteration (s): 0.77 | learning rate: 1.561E-04 | global batch size: 256 | lm loss: 2.012416E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.609 | TFLOPs: 20.24 | 31: iteration 58190/ 173500 | consumed samples: 14896640 | consumed tokens: 30508318720 | elapsed time per iteration (s): 0.78 | learning rate: 1.561E-04 | global batch size: 256 | lm loss: 2.085019E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.144 | TFLOPs: 19.91 | 31: iteration 58200/ 173500 | consumed samples: 14899200 | consumed tokens: 30513561600 | elapsed time per iteration (s): 0.78 | learning rate: 1.561E-04 | global batch size: 256 | lm loss: 2.056709E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.198 | TFLOPs: 19.92 | 31: iteration 58210/ 173500 | consumed samples: 14901760 | consumed tokens: 30518804480 | elapsed time per iteration (s): 0.76 | learning rate: 1.561E-04 | global batch size: 256 | lm loss: 2.034422E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.164 | TFLOPs: 20.28 | 31: iteration 58220/ 173500 | consumed samples: 14904320 | consumed tokens: 30524047360 | elapsed time per iteration (s): 0.76 | learning rate: 1.561E-04 | global batch size: 256 | lm loss: 2.037306E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.983 | TFLOPs: 20.39 | 31: iteration 58230/ 173500 | consumed samples: 14906880 | consumed tokens: 30529290240 | elapsed time per iteration (s): 0.80 | learning rate: 1.561E-04 | global batch size: 256 | lm loss: 2.067150E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.482 | TFLOPs: 19.39 | 31: iteration 58240/ 173500 | consumed samples: 14909440 | consumed tokens: 30534533120 | elapsed time per iteration (s): 0.79 | learning rate: 1.561E-04 | global batch size: 256 | lm loss: 2.046371E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.951 | TFLOPs: 19.60 | 31: iteration 58250/ 173500 | consumed samples: 14912000 | consumed tokens: 30539776000 | elapsed time per iteration (s): 0.79 | learning rate: 1.561E-04 | global batch size: 256 | lm loss: 2.064024E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.111 | TFLOPs: 19.55 | 31: iteration 58260/ 173500 | consumed samples: 14914560 | consumed tokens: 30545018880 | elapsed time per iteration (s): 0.78 | learning rate: 1.560E-04 | global batch size: 256 | lm loss: 2.048817E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.411 | TFLOPs: 19.81 | 31: iteration 58270/ 173500 | consumed samples: 14917120 | consumed tokens: 30550261760 | elapsed time per iteration (s): 0.76 | learning rate: 1.560E-04 | global batch size: 256 | lm loss: 1.996727E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.273 | TFLOPs: 20.40 | 31: iteration 58280/ 173500 | consumed samples: 14919680 | consumed tokens: 30555504640 | elapsed time per iteration (s): 0.74 | learning rate: 1.560E-04 | global batch size: 256 | lm loss: 2.002034E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.008 | TFLOPs: 20.81 | 31: iteration 58290/ 173500 | consumed samples: 14922240 | consumed tokens: 30560747520 | elapsed time per iteration (s): 0.77 | learning rate: 1.560E-04 | global batch size: 256 | lm loss: 2.072945E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.286 | TFLOPs: 20.22 | 31: iteration 58300/ 173500 | consumed samples: 14924800 | consumed tokens: 30565990400 | elapsed time per iteration (s): 0.77 | learning rate: 1.560E-04 | global batch size: 256 | lm loss: 2.021577E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.343 | TFLOPs: 20.23 | 31: iteration 58310/ 173500 | consumed samples: 14927360 | consumed tokens: 30571233280 | elapsed time per iteration (s): 0.77 | learning rate: 1.560E-04 | global batch size: 256 | lm loss: 2.044655E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.661 | TFLOPs: 20.06 | 31: iteration 58320/ 173500 | consumed samples: 14929920 | consumed tokens: 30576476160 | elapsed time per iteration (s): 0.75 | learning rate: 1.560E-04 | global batch size: 256 | lm loss: 2.054719E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.920 | TFLOPs: 20.56 | 31: iteration 58330/ 173500 | consumed samples: 14932480 | consumed tokens: 30581719040 | elapsed time per iteration (s): 0.77 | learning rate: 1.559E-04 | global batch size: 256 | lm loss: 2.024441E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.311 | TFLOPs: 20.22 | 31: iteration 58340/ 173500 | consumed samples: 14935040 | consumed tokens: 30586961920 | elapsed time per iteration (s): 0.74 | learning rate: 1.559E-04 | global batch size: 256 | lm loss: 2.013656E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.064 | TFLOPs: 20.81 | 31: iteration 58350/ 173500 | consumed samples: 14937600 | consumed tokens: 30592204800 | elapsed time per iteration (s): 0.75 | learning rate: 1.559E-04 | global batch size: 256 | lm loss: 2.055219E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.391 | TFLOPs: 20.71 | 31: iteration 58360/ 173500 | consumed samples: 14940160 | consumed tokens: 30597447680 | elapsed time per iteration (s): 0.85 | learning rate: 1.559E-04 | global batch size: 256 | lm loss: 2.062100E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.872 | TFLOPs: 18.32 | 31: iteration 58370/ 173500 | consumed samples: 14942720 | consumed tokens: 30602690560 | elapsed time per iteration (s): 0.81 | learning rate: 1.559E-04 | global batch size: 256 | lm loss: 2.072154E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.380 | TFLOPs: 19.08 | 31: iteration 58380/ 173500 | consumed samples: 14945280 | consumed tokens: 30607933440 | elapsed time per iteration (s): 0.78 | learning rate: 1.559E-04 | global batch size: 256 | lm loss: 2.018669E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.236 | TFLOPs: 19.86 | 31: iteration 58390/ 173500 | consumed samples: 14947840 | consumed tokens: 30613176320 | elapsed time per iteration (s): 0.78 | learning rate: 1.559E-04 | global batch size: 256 | lm loss: 2.054507E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.521 | TFLOPs: 19.75 | 31: iteration 58400/ 173500 | consumed samples: 14950400 | consumed tokens: 30618419200 | elapsed time per iteration (s): 0.73 | learning rate: 1.558E-04 | global batch size: 256 | lm loss: 2.134084E+00 | grad norm: 1.969 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.840 | TFLOPs: 21.35 | 31: iteration 58410/ 173500 | consumed samples: 14952960 | consumed tokens: 30623662080 | elapsed time per iteration (s): 0.75 | learning rate: 1.558E-04 | global batch size: 256 | lm loss: 2.199527E+00 | grad norm: 0.315 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.862 | TFLOPs: 20.68 | 31: iteration 58420/ 173500 | consumed samples: 14955520 | consumed tokens: 30628904960 | elapsed time per iteration (s): 0.80 | learning rate: 1.558E-04 | global batch size: 256 | lm loss: 2.125363E+00 | grad norm: 0.232 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.992 | TFLOPs: 19.36 | 31: iteration 58430/ 173500 | consumed samples: 14958080 | consumed tokens: 30634147840 | elapsed time per iteration (s): 0.76 | learning rate: 1.558E-04 | global batch size: 256 | lm loss: 2.065452E+00 | grad norm: 0.239 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.021 | TFLOPs: 20.39 | 31: iteration 58440/ 173500 | consumed samples: 14960640 | consumed tokens: 30639390720 | elapsed time per iteration (s): 0.78 | learning rate: 1.558E-04 | global batch size: 256 | lm loss: 2.056506E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.679 | TFLOPs: 19.88 | 31: iteration 58450/ 173500 | consumed samples: 14963200 | consumed tokens: 30644633600 | elapsed time per iteration (s): 0.81 | learning rate: 1.558E-04 | global batch size: 256 | lm loss: 2.064972E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.831 | TFLOPs: 19.17 | 31: iteration 58460/ 173500 | consumed samples: 14965760 | consumed tokens: 30649876480 | elapsed time per iteration (s): 0.84 | learning rate: 1.558E-04 | global batch size: 256 | lm loss: 2.060110E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.964 | TFLOPs: 18.51 | 31: iteration 58470/ 173500 | consumed samples: 14968320 | consumed tokens: 30655119360 | elapsed time per iteration (s): 0.75 | learning rate: 1.557E-04 | global batch size: 256 | lm loss: 2.061942E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.028 | TFLOPs: 20.69 | 31: iteration 58480/ 173500 | consumed samples: 14970880 | consumed tokens: 30660362240 | elapsed time per iteration (s): 0.84 | learning rate: 1.557E-04 | global batch size: 256 | lm loss: 2.036729E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.045 | TFLOPs: 18.33 | 31: iteration 58490/ 173500 | consumed samples: 14973440 | consumed tokens: 30665605120 | elapsed time per iteration (s): 0.75 | learning rate: 1.557E-04 | global batch size: 256 | lm loss: 2.049361E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.156 | TFLOPs: 20.76 | 31: iteration 58500/ 173500 | consumed samples: 14976000 | consumed tokens: 30670848000 | elapsed time per iteration (s): 0.79 | learning rate: 1.557E-04 | global batch size: 256 | lm loss: 2.048099E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.851 | TFLOPs: 19.59 | 31: iteration 58510/ 173500 | consumed samples: 14978560 | consumed tokens: 30676090880 | elapsed time per iteration (s): 0.83 | learning rate: 1.557E-04 | global batch size: 256 | lm loss: 2.046039E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.273 | TFLOPs: 18.77 | 31: iteration 58520/ 173500 | consumed samples: 14981120 | consumed tokens: 30681333760 | elapsed time per iteration (s): 0.80 | learning rate: 1.557E-04 | global batch size: 256 | lm loss: 2.065175E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.519 | TFLOPs: 19.27 | 31: iteration 58530/ 173500 | consumed samples: 14983680 | consumed tokens: 30686576640 | elapsed time per iteration (s): 0.75 | learning rate: 1.557E-04 | global batch size: 256 | lm loss: 2.061205E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.367 | TFLOPs: 20.53 | 31: iteration 58540/ 173500 | consumed samples: 14986240 | consumed tokens: 30691819520 | elapsed time per iteration (s): 0.73 | learning rate: 1.556E-04 | global batch size: 256 | lm loss: 2.052494E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.475 | TFLOPs: 21.08 | 31: iteration 58550/ 173500 | consumed samples: 14988800 | consumed tokens: 30697062400 | elapsed time per iteration (s): 0.76 | learning rate: 1.556E-04 | global batch size: 256 | lm loss: 2.056502E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.461 | TFLOPs: 20.29 | 31: iteration 58560/ 173500 | consumed samples: 14991360 | consumed tokens: 30702305280 | elapsed time per iteration (s): 0.76 | learning rate: 1.556E-04 | global batch size: 256 | lm loss: 2.053571E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.420 | TFLOPs: 20.41 | 31: iteration 58570/ 173500 | consumed samples: 14993920 | consumed tokens: 30707548160 | elapsed time per iteration (s): 0.77 | learning rate: 1.556E-04 | global batch size: 256 | lm loss: 2.030004E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.162 | TFLOPs: 20.09 | 31: iteration 58580/ 173500 | consumed samples: 14996480 | consumed tokens: 30712791040 | elapsed time per iteration (s): 0.78 | learning rate: 1.556E-04 | global batch size: 256 | lm loss: 2.040655E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.375 | TFLOPs: 19.93 | 31: iteration 58590/ 173500 | consumed samples: 14999040 | consumed tokens: 30718033920 | elapsed time per iteration (s): 0.74 | learning rate: 1.556E-04 | global batch size: 256 | lm loss: 2.037149E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.326 | TFLOPs: 20.89 | 31: iteration 58600/ 173500 | consumed samples: 15001600 | consumed tokens: 30723276800 | elapsed time per iteration (s): 0.76 | learning rate: 1.556E-04 | global batch size: 256 | lm loss: 2.052898E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.386 | TFLOPs: 20.29 | 31: iteration 58610/ 173500 | consumed samples: 15004160 | consumed tokens: 30728519680 | elapsed time per iteration (s): 0.74 | learning rate: 1.555E-04 | global batch size: 256 | lm loss: 2.042735E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.990 | TFLOPs: 20.81 | 31: iteration 58620/ 173500 | consumed samples: 15006720 | consumed tokens: 30733762560 | elapsed time per iteration (s): 1.37 | learning rate: 1.555E-04 | global batch size: 256 | lm loss: 2.059665E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 186.420 | TFLOPs: 11.28 | 31: iteration 58630/ 173500 | consumed samples: 15009280 | consumed tokens: 30739005440 | elapsed time per iteration (s): 0.82 | learning rate: 1.555E-04 | global batch size: 256 | lm loss: 2.054728E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.573 | TFLOPs: 18.85 | 31: iteration 58640/ 173500 | consumed samples: 15011840 | consumed tokens: 30744248320 | elapsed time per iteration (s): 0.85 | learning rate: 1.555E-04 | global batch size: 256 | lm loss: 2.048958E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.997 | TFLOPs: 18.15 | 31: iteration 58650/ 173500 | consumed samples: 15014400 | consumed tokens: 30749491200 | elapsed time per iteration (s): 0.86 | learning rate: 1.555E-04 | global batch size: 256 | lm loss: 2.045452E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.474 | TFLOPs: 18.00 | 31: iteration 58660/ 173500 | consumed samples: 15016960 | consumed tokens: 30754734080 | elapsed time per iteration (s): 0.79 | learning rate: 1.555E-04 | global batch size: 256 | lm loss: 2.053012E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.778 | TFLOPs: 19.59 | 31: iteration 58670/ 173500 | consumed samples: 15019520 | consumed tokens: 30759976960 | elapsed time per iteration (s): 0.86 | learning rate: 1.555E-04 | global batch size: 256 | lm loss: 2.037374E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.099 | TFLOPs: 18.03 | 31: iteration 58680/ 173500 | consumed samples: 15022080 | consumed tokens: 30765219840 | elapsed time per iteration (s): 0.82 | learning rate: 1.554E-04 | global batch size: 256 | lm loss: 2.055564E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.909 | TFLOPs: 18.87 | 31: iteration 58690/ 173500 | consumed samples: 15024640 | consumed tokens: 30770462720 | elapsed time per iteration (s): 0.84 | learning rate: 1.554E-04 | global batch size: 256 | lm loss: 2.027313E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.452 | TFLOPs: 18.54 | 31: iteration 58700/ 173500 | consumed samples: 15027200 | consumed tokens: 30775705600 | elapsed time per iteration (s): 0.78 | learning rate: 1.554E-04 | global batch size: 256 | lm loss: 2.056676E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.167 | TFLOPs: 19.85 | 31: iteration 58710/ 173500 | consumed samples: 15029760 | consumed tokens: 30780948480 | elapsed time per iteration (s): 0.79 | learning rate: 1.554E-04 | global batch size: 256 | lm loss: 2.048921E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.819 | TFLOPs: 19.59 | 31: iteration 58720/ 173500 | consumed samples: 15032320 | consumed tokens: 30786191360 | elapsed time per iteration (s): 0.82 | learning rate: 1.554E-04 | global batch size: 256 | lm loss: 2.046691E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.701 | TFLOPs: 18.98 | 31: iteration 58730/ 173500 | consumed samples: 15034880 | consumed tokens: 30791434240 | elapsed time per iteration (s): 0.94 | learning rate: 1.554E-04 | global batch size: 256 | lm loss: 2.051286E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 273.326 | TFLOPs: 16.54 | 31: iteration 58740/ 173500 | consumed samples: 15037440 | consumed tokens: 30796677120 | elapsed time per iteration (s): 0.81 | learning rate: 1.554E-04 | global batch size: 256 | lm loss: 2.082376E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.027 | TFLOPs: 19.06 | 31: iteration 58750/ 173500 | consumed samples: 15040000 | consumed tokens: 30801920000 | elapsed time per iteration (s): 0.78 | learning rate: 1.553E-04 | global batch size: 256 | lm loss: 2.051341E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.380 | TFLOPs: 19.81 | 31: iteration 58760/ 173500 | consumed samples: 15042560 | consumed tokens: 30807162880 | elapsed time per iteration (s): 0.80 | learning rate: 1.553E-04 | global batch size: 256 | lm loss: 2.045328E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.297 | TFLOPs: 19.44 | 31: iteration 58770/ 173500 | consumed samples: 15045120 | consumed tokens: 30812405760 | elapsed time per iteration (s): 0.78 | learning rate: 1.553E-04 | global batch size: 256 | lm loss: 2.048289E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.661 | TFLOPs: 19.76 | 31: iteration 58780/ 173500 | consumed samples: 15047680 | consumed tokens: 30817648640 | elapsed time per iteration (s): 0.80 | learning rate: 1.553E-04 | global batch size: 256 | lm loss: 2.074689E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.832 | TFLOPs: 19.41 | 31: iteration 58790/ 173500 | consumed samples: 15050240 | consumed tokens: 30822891520 | elapsed time per iteration (s): 0.79 | learning rate: 1.553E-04 | global batch size: 256 | lm loss: 2.081647E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.738 | TFLOPs: 19.52 | 31: iteration 58800/ 173500 | consumed samples: 15052800 | consumed tokens: 30828134400 | elapsed time per iteration (s): 0.78 | learning rate: 1.553E-04 | global batch size: 256 | lm loss: 2.049495E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.795 | TFLOPs: 19.83 | 31: iteration 58810/ 173500 | consumed samples: 15055360 | consumed tokens: 30833377280 | elapsed time per iteration (s): 0.80 | learning rate: 1.553E-04 | global batch size: 256 | lm loss: 2.050947E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.771 | TFLOPs: 19.28 | 31: iteration 58820/ 173500 | consumed samples: 15057920 | consumed tokens: 30838620160 | elapsed time per iteration (s): 0.80 | learning rate: 1.552E-04 | global batch size: 256 | lm loss: 2.062034E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.816 | TFLOPs: 19.29 | 31: iteration 58830/ 173500 | consumed samples: 15060480 | consumed tokens: 30843863040 | elapsed time per iteration (s): 0.78 | learning rate: 1.552E-04 | global batch size: 256 | lm loss: 2.036013E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.011 | TFLOPs: 19.96 | 31: iteration 58840/ 173500 | consumed samples: 15063040 | consumed tokens: 30849105920 | elapsed time per iteration (s): 0.75 | learning rate: 1.552E-04 | global batch size: 256 | lm loss: 2.053401E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.164 | TFLOPs: 20.64 | 31: iteration 58850/ 173500 | consumed samples: 15065600 | consumed tokens: 30854348800 | elapsed time per iteration (s): 0.76 | learning rate: 1.552E-04 | global batch size: 256 | lm loss: 2.043448E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.643 | TFLOPs: 20.25 | 31: iteration 58860/ 173500 | consumed samples: 15068160 | consumed tokens: 30859591680 | elapsed time per iteration (s): 0.78 | learning rate: 1.552E-04 | global batch size: 256 | lm loss: 2.044939E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.526 | TFLOPs: 19.81 | 31: iteration 58870/ 173500 | consumed samples: 15070720 | consumed tokens: 30864834560 | elapsed time per iteration (s): 0.76 | learning rate: 1.552E-04 | global batch size: 256 | lm loss: 2.059410E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.541 | TFLOPs: 20.48 | 31: iteration 58880/ 173500 | consumed samples: 15073280 | consumed tokens: 30870077440 | elapsed time per iteration (s): 0.76 | learning rate: 1.552E-04 | global batch size: 256 | lm loss: 2.001267E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.128 | TFLOPs: 20.40 | 31: iteration 58890/ 173500 | consumed samples: 15075840 | consumed tokens: 30875320320 | elapsed time per iteration (s): 0.78 | learning rate: 1.551E-04 | global batch size: 256 | lm loss: 2.021454E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.870 | TFLOPs: 19.96 | 31: iteration 58900/ 173500 | consumed samples: 15078400 | consumed tokens: 30880563200 | elapsed time per iteration (s): 0.80 | learning rate: 1.551E-04 | global batch size: 256 | lm loss: 2.082587E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.612 | TFLOPs: 19.40 | 31: iteration 58910/ 173500 | consumed samples: 15080960 | consumed tokens: 30885806080 | elapsed time per iteration (s): 0.78 | learning rate: 1.551E-04 | global batch size: 256 | lm loss: 2.078685E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.185 | TFLOPs: 19.73 | 31: iteration 58920/ 173500 | consumed samples: 15083520 | consumed tokens: 30891048960 | elapsed time per iteration (s): 0.85 | learning rate: 1.551E-04 | global batch size: 256 | lm loss: 2.051501E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.641 | TFLOPs: 18.25 | 31: iteration 58930/ 173500 | consumed samples: 15086080 | consumed tokens: 30896291840 | elapsed time per iteration (s): 0.73 | learning rate: 1.551E-04 | global batch size: 256 | lm loss: 2.036251E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.677 | TFLOPs: 21.15 | 31: iteration 58940/ 173500 | consumed samples: 15088640 | consumed tokens: 30901534720 | elapsed time per iteration (s): 0.79 | learning rate: 1.551E-04 | global batch size: 256 | lm loss: 2.055776E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.552 | TFLOPs: 19.70 | 31: iteration 58950/ 173500 | consumed samples: 15091200 | consumed tokens: 30906777600 | elapsed time per iteration (s): 0.77 | learning rate: 1.551E-04 | global batch size: 256 | lm loss: 2.044043E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.329 | TFLOPs: 20.23 | 31: iteration 58960/ 173500 | consumed samples: 15093760 | consumed tokens: 30912020480 | elapsed time per iteration (s): 0.79 | learning rate: 1.550E-04 | global batch size: 256 | lm loss: 2.047764E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.957 | TFLOPs: 19.72 | 31: iteration 58970/ 173500 | consumed samples: 15096320 | consumed tokens: 30917263360 | elapsed time per iteration (s): 0.76 | learning rate: 1.550E-04 | global batch size: 256 | lm loss: 2.077498E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.497 | TFLOPs: 20.36 | 31: iteration 58980/ 173500 | consumed samples: 15098880 | consumed tokens: 30922506240 | elapsed time per iteration (s): 0.73 | learning rate: 1.550E-04 | global batch size: 256 | lm loss: 2.045522E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.813 | TFLOPs: 21.16 | 31: iteration 58990/ 173500 | consumed samples: 15101440 | consumed tokens: 30927749120 | elapsed time per iteration (s): 0.77 | learning rate: 1.550E-04 | global batch size: 256 | lm loss: 2.036786E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.690 | TFLOPs: 20.19 | 31: iteration 59000/ 173500 | consumed samples: 15104000 | consumed tokens: 30932992000 | elapsed time per iteration (s): 0.74 | learning rate: 1.550E-04 | global batch size: 256 | lm loss: 2.042983E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.178 | TFLOPs: 20.82 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 59000 | lm loss value: 1.996160E+00 | lm loss PPL: 7.360735E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 59000 to checkpoints_1b1long 0: [2022-11-26 07:19:57,072] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step59000 is begin to save! 0: [2022-11-26 07:19:57,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_01-model_00-model_states.pt... 0: [2022-11-26 07:19:57,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_01-model_00-model_states.pt. 0: [2022-11-26 07:19:57,303] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_03-model_00-model_states.pt... 0: [2022-11-26 07:19:57,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_03-model_00-model_states.pt. 0: [2022-11-26 07:19:57,384] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_04-model_00-model_states.pt... 0: [2022-11-26 07:19:57,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_04-model_00-model_states.pt. 0: [2022-11-26 07:19:57,465] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_05-model_00-model_states.pt... 0: [2022-11-26 07:19:57,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_05-model_00-model_states.pt. 0: [2022-11-26 07:19:57,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_06-model_00-model_states.pt... 0: [2022-11-26 07:19:57,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_06-model_00-model_states.pt. 0: [2022-11-26 07:19:57,619] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_07-model_00-model_states.pt... 0: [2022-11-26 07:19:57,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_07-model_00-model_states.pt. 0: [2022-11-26 07:19:57,697] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_08-model_00-model_states.pt... 0: [2022-11-26 07:19:57,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_08-model_00-model_states.pt. 0: [2022-11-26 07:19:57,779] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_09-model_00-model_states.pt... 0: [2022-11-26 07:19:57,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_09-model_00-model_states.pt. 0: [2022-11-26 07:19:57,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_10-model_00-model_states.pt... 0: [2022-11-26 07:19:57,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_10-model_00-model_states.pt. 0: [2022-11-26 07:19:57,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_11-model_00-model_states.pt... 0: [2022-11-26 07:19:58,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_11-model_00-model_states.pt. 0: [2022-11-26 07:19:58,004] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_12-model_00-model_states.pt... 0: [2022-11-26 07:19:58,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_12-model_00-model_states.pt. 0: [2022-11-26 07:19:58,080] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_13-model_00-model_states.pt... 0: [2022-11-26 07:19:58,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_13-model_00-model_states.pt. 0: [2022-11-26 07:19:58,153] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_14-model_00-model_states.pt... 0: [2022-11-26 07:19:58,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_14-model_00-model_states.pt. 0: [2022-11-26 07:19:58,226] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_15-model_00-model_states.pt... 0: [2022-11-26 07:19:58,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_15-model_00-model_states.pt. 0: [2022-11-26 07:19:58,301] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_16-model_00-model_states.pt... 0: [2022-11-26 07:19:58,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_16-model_00-model_states.pt. 0: [2022-11-26 07:19:58,373] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_17-model_00-model_states.pt... 0: [2022-11-26 07:19:58,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_17-model_00-model_states.pt. 0: [2022-11-26 07:19:58,446] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_18-model_00-model_states.pt... 0: [2022-11-26 07:19:58,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_18-model_00-model_states.pt. 0: [2022-11-26 07:19:58,524] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_19-model_00-model_states.pt... 0: [2022-11-26 07:19:58,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_19-model_00-model_states.pt. 0: [2022-11-26 07:19:58,597] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_20-model_00-model_states.pt... 0: [2022-11-26 07:19:58,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_20-model_00-model_states.pt. 0: [2022-11-26 07:19:58,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_21-model_00-model_states.pt... 0: [2022-11-26 07:19:58,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_21-model_00-model_states.pt. 0: [2022-11-26 07:19:58,746] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_22-model_00-model_states.pt... 0: [2022-11-26 07:19:58,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_22-model_00-model_states.pt. 0: [2022-11-26 07:19:58,818] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_23-model_00-model_states.pt... 0: [2022-11-26 07:19:58,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_23-model_00-model_states.pt. 0: [2022-11-26 07:19:58,892] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_24-model_00-model_states.pt... 0: [2022-11-26 07:19:58,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_24-model_00-model_states.pt. 0: [2022-11-26 07:19:58,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_25-model_00-model_states.pt... 0: [2022-11-26 07:19:59,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_25-model_00-model_states.pt. 0: [2022-11-26 07:19:59,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_26-model_00-model_states.pt... 0: [2022-11-26 07:19:59,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_26-model_00-model_states.pt. 0: [2022-11-26 07:19:59,110] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_27-model_00-model_states.pt... 0: [2022-11-26 07:19:59,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_27-model_00-model_states.pt. 0: [2022-11-26 07:19:59,187] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_28-model_00-model_states.pt... 0: [2022-11-26 07:19:59,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_28-model_00-model_states.pt. 0: [2022-11-26 07:19:59,259] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/layer_30-model_00-model_states.pt... 0: [2022-11-26 07:19:59,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/layer_30-model_00-model_states.pt. 0: [2022-11-26 07:19:59,261] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step59000/mp_rank_00_model_states.pt 0: [2022-11-26 07:19:59,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/mp_rank_00_model_states.pt... 0: [2022-11-26 07:19:59,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/mp_rank_00_model_states.pt. 0: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:19:59,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:19:59,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:19:59,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 07:19:59,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 5: [2022-11-26 07:19:59,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:19:59,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 07:19:59,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 28: [2022-11-26 07:19:59,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:19:59,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:19:59,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 07:19:59,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 3: [2022-11-26 07:19:59,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:19:59,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 07:19:59,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 0: [2022-11-26 07:19:59,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:19:59,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:19:59,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 07:19:59,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 16: [2022-11-26 07:19:59,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:19:59,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:19:59,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 12: [2022-11-26 07:19:59,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 16: [2022-11-26 07:19:59,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 12: [2022-11-26 07:19:59,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 21: [2022-11-26 07:19:59,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:19:59,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 18: [2022-11-26 07:19:59,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:19:59,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 18: [2022-11-26 07:19:59,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 07:19:59,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 26: [2022-11-26 07:19:59,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:19:59,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 07:19:59,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 11: [2022-11-26 07:19:59,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:19:59,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 07:19:59,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 1: [2022-11-26 07:19:59,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:19:59,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:19:59,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 24: [2022-11-26 07:19:59,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 07:19:59,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 7: [2022-11-26 07:19:59,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:19:59,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 7: [2022-11-26 07:19:59,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 07:19:59,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 23: [2022-11-26 07:19:59,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:19:59,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:19:59,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 26: [2022-11-26 07:19:59,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 23: [2022-11-26 07:19:59,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 26: [2022-11-26 07:19:59,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 4: [2022-11-26 07:19:59,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:19:59,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 07:19:59,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 27: [2022-11-26 07:19:59,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:19:59,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:19:59,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:19:59,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 22: [2022-11-26 07:19:59,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 19: [2022-11-26 07:19:59,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 27: [2022-11-26 07:19:59,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 22: [2022-11-26 07:19:59,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 19: [2022-11-26 07:19:59,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 12: [2022-11-26 07:19:59,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:19:59,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 07:19:59,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 30: [2022-11-26 07:19:59,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:19:59,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 07:19:59,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 1: [2022-11-26 07:19:59,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:19:59,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:19:59,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 25: [2022-11-26 07:19:59,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 1: [2022-11-26 07:19:59,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 25: [2022-11-26 07:19:59,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 25: [2022-11-26 07:19:59,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:19:59,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 07:19:59,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 11: [2022-11-26 07:19:59,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:19:59,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 07:19:59,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 10: [2022-11-26 07:19:59,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:19:59,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 07:19:59,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 26: [2022-11-26 07:19:59,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:19:59,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 07:19:59,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 13: [2022-11-26 07:19:59,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:19:59,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 07:19:59,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 21: [2022-11-26 07:19:59,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:19:59,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 07:19:59,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 14: [2022-11-26 07:19:59,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:19:59,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 07:19:59,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 16: [2022-11-26 07:19:59,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:19:59,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 07:19:59,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 15: [2022-11-26 07:19:59,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:19:59,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:19:59,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 23: [2022-11-26 07:19:59,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 15: [2022-11-26 07:19:59,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 23: [2022-11-26 07:19:59,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 29: [2022-11-26 07:19:59,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:19:59,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 19: [2022-11-26 07:19:59,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:19:59,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 19: [2022-11-26 07:19:59,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 18: [2022-11-26 07:19:59,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:19:59,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 19: [2022-11-26 07:19:59,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 0: [2022-11-26 07:19:59,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 18: [2022-11-26 07:19:59,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 0: [2022-11-26 07:19:59,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 9: [2022-11-26 07:19:59,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:19:59,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 07:19:59,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 7: [2022-11-26 07:19:59,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:19:59,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:19:59,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 7: [2022-11-26 07:19:59,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 15: [2022-11-26 07:19:59,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 7: [2022-11-26 07:19:59,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 6: [2022-11-26 07:19:59,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:19:59,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:19:59,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 24: [2022-11-26 07:19:59,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:19:59,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 6: [2022-11-26 07:19:59,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 5: [2022-11-26 07:19:59,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:19:59,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:19:59,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 5: [2022-11-26 07:19:59,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 13: [2022-11-26 07:19:59,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 24: [2022-11-26 07:19:59,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:19:59,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 13: [2022-11-26 07:19:59,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 24: [2022-11-26 07:19:59,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 07:19:59,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 22: [2022-11-26 07:19:59,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:19:59,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 22: [2022-11-26 07:19:59,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 07:19:59,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 24: [2022-11-26 07:19:59,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 12: [2022-11-26 07:19:59,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:19:59,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 07:19:59,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 29: [2022-11-26 07:19:59,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:19:59,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 07:19:59,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 0: [2022-11-26 07:19:59,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:19:59,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 14: [2022-11-26 07:19:59,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:19:59,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:19:59,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 10: [2022-11-26 07:19:59,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:19:59,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:19:59,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 10: [2022-11-26 07:19:59,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 16: [2022-11-26 07:19:59,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 14: [2022-11-26 07:19:59,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 10: [2022-11-26 07:19:59,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 16: [2022-11-26 07:19:59,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 0: [2022-11-26 07:19:59,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 07:19:59,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 3: [2022-11-26 07:19:59,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:19:59,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:19:59,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 07:19:59,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 07:19:59,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 3: [2022-11-26 07:19:59,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 25: [2022-11-26 07:19:59,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:19:59,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 4: [2022-11-26 07:19:59,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:19:59,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 25: [2022-11-26 07:19:59,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 4: [2022-11-26 07:19:59,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 29: [2022-11-26 07:19:59,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:19:59,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 07:19:59,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 28: [2022-11-26 07:19:59,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 07:19:59,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 28: [2022-11-26 07:19:59,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:19:59,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 28: [2022-11-26 07:19:59,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 07:19:59,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 30: [2022-11-26 07:19:59,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 07:19:59,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 27: [2022-11-26 07:19:59,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 28: [2022-11-26 07:19:59,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:19:59,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 28: [2022-11-26 07:19:59,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 27: [2022-11-26 07:19:59,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 28: [2022-11-26 07:19:59,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 11: [2022-11-26 07:19:59,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:19:59,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 07:19:59,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 27: [2022-11-26 07:19:59,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:19:59,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:19:59,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 5: [2022-11-26 07:19:59,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 27: [2022-11-26 07:19:59,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 5: [2022-11-26 07:19:59,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 18: [2022-11-26 07:19:59,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:19:59,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:19:59,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 07:19:59,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 26: [2022-11-26 07:19:59,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:19:59,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 07:19:59,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 23: [2022-11-26 07:19:59,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 07:19:59,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 31: [2022-11-26 07:19:59,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:19:59,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:19:59,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 07:19:59,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 07:19:59,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 31: [2022-11-26 07:19:59,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 1: [2022-11-26 07:19:59,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:19:59,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 07:19:59,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 12: [2022-11-26 07:19:59,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:19:59,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 07:19:59,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 13: [2022-11-26 07:19:59,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:19:59,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 07:19:59,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 14: [2022-11-26 07:19:59,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:19:59,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 21: [2022-11-26 07:19:59,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:19:59,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 14: [2022-11-26 07:19:59,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 21: [2022-11-26 07:19:59,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 7: [2022-11-26 07:19:59,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:19:59,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 31: [2022-11-26 07:19:59,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:19:59,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 31: [2022-11-26 07:19:59,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 07:19:59,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 9: [2022-11-26 07:19:59,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:19:59,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 07:19:59,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 1: [2022-11-26 07:19:59,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:19:59,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 07:19:59,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 19: [2022-11-26 07:19:59,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:19:59,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 07:19:59,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 22: [2022-11-26 07:19:59,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:19:59,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 07:19:59,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 19: [2022-11-26 07:19:59,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:19:59,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 07:19:59,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 4: [2022-11-26 07:19:59,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:19:59,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 07:19:59,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 25: [2022-11-26 07:19:59,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:19:59,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 07:19:59,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 6: [2022-11-26 07:19:59,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:19:59,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 07:19:59,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 17: [2022-11-26 07:19:59,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:19:59,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:19:59,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:19:59,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:19:59,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:19:59,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 07:19:59,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 07:19:59,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 07:19:59,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 15: [2022-11-26 07:19:59,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 17: [2022-11-26 07:19:59,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 17: [2022-11-26 07:19:59,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 17: [2022-11-26 07:19:59,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 17: [2022-11-26 07:19:59,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 15: [2022-11-26 07:19:59,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 28: [2022-11-26 07:19:59,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:19:59,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:19:59,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 07:19:59,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 13: [2022-11-26 07:19:59,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:19:59,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 07:19:59,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 29: [2022-11-26 07:19:59,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:19:59,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:19:59,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:19:59,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:19:59,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:19:59,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 29: [2022-11-26 07:19:59,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 20: [2022-11-26 07:19:59,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 29: [2022-11-26 07:19:59,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 20: [2022-11-26 07:19:59,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 20: [2022-11-26 07:19:59,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 20: [2022-11-26 07:19:59,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 07:19:59,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 07:19:59,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 20: [2022-11-26 07:19:59,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 8: [2022-11-26 07:19:59,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:19:59,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:19:59,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:19:59,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 07:19:59,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 07:19:59,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 07:19:59,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 8: [2022-11-26 07:19:59,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 8: [2022-11-26 07:19:59,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 28: [2022-11-26 07:19:59,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 07:19:59,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 2: [2022-11-26 07:19:59,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:19:59,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:19:59,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:19:59,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:19:59,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 07:19:59,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 07:19:59,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 07:19:59,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 07:19:59,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 2: [2022-11-26 07:19:59,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 2: [2022-11-26 07:19:59,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 2: [2022-11-26 07:19:59,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 30: [2022-11-26 07:19:59,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:19:59,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 07:19:59,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 9: [2022-11-26 07:19:59,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:19:59,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 07:19:59,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 30: [2022-11-26 07:19:59,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:19:59,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 07:19:59,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 27: [2022-11-26 07:19:59,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:19:59,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 07:19:59,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 3: [2022-11-26 07:19:59,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:19:59,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 07:19:59,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 18: [2022-11-26 07:19:59,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:19:59,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 07:19:59,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 11: [2022-11-26 07:19:59,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:19:59,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 07:19:59,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 4: [2022-11-26 07:19:59,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:19:59,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 07:19:59,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 6: [2022-11-26 07:19:59,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:19:59,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:19:59,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 6: [2022-11-26 07:19:59,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 10: [2022-11-26 07:19:59,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 6: [2022-11-26 07:19:59,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 22: [2022-11-26 07:19:59,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:19:59,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 07:19:59,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 23: [2022-11-26 07:19:59,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:19:59,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 07:19:59,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 21: [2022-11-26 07:19:59,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:19:59,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 07:19:59,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 24: [2022-11-26 07:19:59,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:19:59,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 07:19:59,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 16: [2022-11-26 07:19:59,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:19:59,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 07:19:59,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 0: [2022-11-26 07:19:59,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:19:59,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 07:19:59,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 14: [2022-11-26 07:19:59,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:19:59,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 07:19:59,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 7: [2022-11-26 07:19:59,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:19:59,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:19:59,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 5: [2022-11-26 07:19:59,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 7: [2022-11-26 07:19:59,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 5: [2022-11-26 07:19:59,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 19: [2022-11-26 07:19:59,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:19:59,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 07:19:59,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 20: [2022-11-26 07:19:59,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:19:59,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 07:19:59,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 26: [2022-11-26 07:19:59,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:19:59,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 07:19:59,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 17: [2022-11-26 07:19:59,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:19:59,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 31: [2022-11-26 07:19:59,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:19:59,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 31: [2022-11-26 07:19:59,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 07:19:59,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 28: [2022-11-26 07:19:59,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:19:59,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:19:59,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 07:19:59,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 28: [2022-11-26 07:19:59,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 07:19:59,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 25: [2022-11-26 07:19:59,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:19:59,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 07:19:59,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 1: [2022-11-26 07:19:59,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:19:59,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 07:19:59,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 9: [2022-11-26 07:19:59,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:19:59,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 07:19:59,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 3: [2022-11-26 07:19:59,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:19:59,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 07:19:59,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 13: [2022-11-26 07:19:59,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:19:59,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 07:19:59,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 29: [2022-11-26 07:19:59,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:19:59,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 07:19:59,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 2: [2022-11-26 07:19:59,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:19:59,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 07:19:59,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 8: [2022-11-26 07:19:59,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:19:59,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 15: [2022-11-26 07:19:59,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:19:59,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 15: [2022-11-26 07:19:59,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 07:19:59,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 18: [2022-11-26 07:19:59,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:19:59,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 07:19:59,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 30: [2022-11-26 07:19:59,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:19:59,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 07:19:59,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 27: [2022-11-26 07:19:59,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:19:59,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 07:19:59,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 6: [2022-11-26 07:19:59,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:19:59,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 07:19:59,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 11: [2022-11-26 07:19:59,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:19:59,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 07:19:59,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 10: [2022-11-26 07:19:59,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:19:59,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 07:19:59,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 4: [2022-11-26 07:19:59,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:19:59,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 07:19:59,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 16: [2022-11-26 07:19:59,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:19:59,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 07:19:59,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 21: [2022-11-26 07:19:59,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:19:59,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 07:19:59,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 22: [2022-11-26 07:19:59,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:19:59,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 07:19:59,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 0: [2022-11-26 07:19:59,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:19:59,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 07:19:59,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 23: [2022-11-26 07:19:59,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:19:59,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 07:19:59,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 17: [2022-11-26 07:19:59,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:19:59,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:19:59,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 07:19:59,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 24: [2022-11-26 07:19:59,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 07:19:59,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 20: [2022-11-26 07:19:59,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:19:59,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 07:19:59,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 7: [2022-11-26 07:19:59,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:19:59,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 26: [2022-11-26 07:19:59,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:19:59,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 26: [2022-11-26 07:19:59,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 07:19:59,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 31: [2022-11-26 07:19:59,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:19:59,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 07:19:59,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 5: [2022-11-26 07:19:59,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:19:59,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 07:19:59,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 28: [2022-11-26 07:19:59,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 07:19:59,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 07:19:59,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 14: [2022-11-26 07:19:59,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:19:59,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 07:19:59,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 9: [2022-11-26 07:19:59,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:19:59,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 19: [2022-11-26 07:19:59,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:19:59,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 19: [2022-11-26 07:19:59,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 07:19:59,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 12: [2022-11-26 07:19:59,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:19:59,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 07:19:59,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 1: [2022-11-26 07:19:59,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:19:59,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 25: [2022-11-26 07:19:59,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:19:59,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 25: [2022-11-26 07:19:59,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 07:19:59,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 15: [2022-11-26 07:19:59,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:19:59,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 07:19:59,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 3: [2022-11-26 07:19:59,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:19:59,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 07:19:59,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 8: [2022-11-26 07:19:59,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:19:59,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 07:19:59,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 29: [2022-11-26 07:19:59,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:19:59,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 07:19:59,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 2: [2022-11-26 07:19:59,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:19:59,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 07:19:59,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 13: [2022-11-26 07:19:59,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:19:59,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 18: [2022-11-26 07:19:59,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:19:59,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 18: [2022-11-26 07:19:59,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 07:19:59,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 27: [2022-11-26 07:19:59,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:19:59,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 07:19:59,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 30: [2022-11-26 07:19:59,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:19:59,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 07:19:59,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 6: [2022-11-26 07:19:59,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:19:59,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 07:19:59,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 10: [2022-11-26 07:19:59,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:19:59,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 07:19:59,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 23: [2022-11-26 07:19:59,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:19:59,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 07:19:59,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 16: [2022-11-26 07:19:59,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:19:59,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 07:19:59,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 11: [2022-11-26 07:19:59,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:19:59,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 22: [2022-11-26 07:19:59,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:19:59,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 22: [2022-11-26 07:19:59,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 07:19:59,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 4: [2022-11-26 07:19:59,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:19:59,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 07:19:59,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 21: [2022-11-26 07:19:59,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:19:59,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 07:19:59,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 24: [2022-11-26 07:19:59,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:19:59,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 07:19:59,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 5: [2022-11-26 07:19:59,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:19:59,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 07:19:59,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 14: [2022-11-26 07:19:59,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:19:59,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 07:19:59,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 7: [2022-11-26 07:19:59,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:19:59,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 07:19:59,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 20: [2022-11-26 07:19:59,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:19:59,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 07:19:59,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 26: [2022-11-26 07:19:59,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:19:59,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 07:19:59,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 31: [2022-11-26 07:19:59,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:19:59,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 07:19:59,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 0: [2022-11-26 07:19:59,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:19:59,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 07:19:59,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 19: [2022-11-26 07:19:59,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:19:59,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 07:19:59,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 17: [2022-11-26 07:19:59,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:19:59,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 07:19:59,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 12: [2022-11-26 07:19:59,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:19:59,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 07:19:59,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 25: [2022-11-26 07:19:59,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:19:59,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 07:19:59,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 1: [2022-11-26 07:19:59,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:19:59,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 07:19:59,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 9: [2022-11-26 07:19:59,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:19:59,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 07:19:59,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 28: [2022-11-26 07:19:59,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:19:59,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 28: [2022-11-26 07:19:59,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 07:19:59,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 29: [2022-11-26 07:19:59,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 07:19:59,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 8: [2022-11-26 07:19:59,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:19:59,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 07:19:59,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 15: [2022-11-26 07:19:59,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:19:59,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 07:19:59,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 13: [2022-11-26 07:19:59,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:19:59,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 07:19:59,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 30: [2022-11-26 07:19:59,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:19:59,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 07:19:59,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 11: [2022-11-26 07:19:59,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:19:59,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 07:19:59,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 5: [2022-11-26 07:19:59,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:19:59,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 16: [2022-11-26 07:19:59,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:19:59,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 16: [2022-11-26 07:19:59,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 07:19:59,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 4: [2022-11-26 07:19:59,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:19:59,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 07:19:59,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 27: [2022-11-26 07:19:59,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:19:59,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:19:59,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 18: [2022-11-26 07:19:59,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:19:59,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 2: [2022-11-26 07:19:59,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 27: [2022-11-26 07:19:59,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 18: [2022-11-26 07:19:59,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 07:19:59,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 10: [2022-11-26 07:19:59,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:19:59,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 07:19:59,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 3: [2022-11-26 07:19:59,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:19:59,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:19:59,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 07:19:59,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 6: [2022-11-26 07:19:59,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 07:19:59,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 20: [2022-11-26 07:19:59,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:19:59,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 07:19:59,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 24: [2022-11-26 07:19:59,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:19:59,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:19:59,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 22: [2022-11-26 07:19:59,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 24: [2022-11-26 07:19:59,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 22: [2022-11-26 07:19:59,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 26: [2022-11-26 07:19:59,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:19:59,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 07:19:59,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 17: [2022-11-26 07:19:59,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:19:59,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 07:19:59,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 31: [2022-11-26 07:19:59,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:19:59,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 07:19:59,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 23: [2022-11-26 07:19:59,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:19:59,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 07:19:59,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 14: [2022-11-26 07:19:59,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:19:59,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 07:19:59,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 21: [2022-11-26 07:19:59,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:19:59,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 07:19:59,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 9: [2022-11-26 07:19:59,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:19:59,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 07:19:59,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 19: [2022-11-26 07:19:59,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:19:59,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 07:19:59,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 31: [2022-11-26 07:19:59,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:19:59,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 07:19:59,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 12: [2022-11-26 07:19:59,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:19:59,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 07:19:59,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 25: [2022-11-26 07:19:59,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:19:59,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 07:19:59,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 18: [2022-11-26 07:19:59,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:19:59,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 07:19:59,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 29: [2022-11-26 07:19:59,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:19:59,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 07:19:59,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 16: [2022-11-26 07:19:59,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:19:59,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 07:19:59,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 21: [2022-11-26 07:19:59,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:19:59,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 1: [2022-11-26 07:19:59,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:19:59,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 13: [2022-11-26 07:19:59,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:19:59,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 1: [2022-11-26 07:19:59,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 07:19:59,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 13: [2022-11-26 07:19:59,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 3: [2022-11-26 07:19:59,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:19:59,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 07:19:59,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 27: [2022-11-26 07:19:59,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:19:59,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:19:59,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 07:19:59,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 24: [2022-11-26 07:19:59,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 07:19:59,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 28: [2022-11-26 07:19:59,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:19:59,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:19:59,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 07:19:59,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 8: [2022-11-26 07:19:59,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:19:59,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 10: [2022-11-26 07:19:59,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:19:59,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 10: [2022-11-26 07:19:59,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 07:19:59,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 23: [2022-11-26 07:19:59,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:19:59,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 07:19:59,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 28: [2022-11-26 07:19:59,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 07:19:59,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 22: [2022-11-26 07:19:59,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:19:59,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 07:19:59,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 4: [2022-11-26 07:19:59,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:19:59,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 07:19:59,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 14: [2022-11-26 07:19:59,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:19:59,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 2: [2022-11-26 07:19:59,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:19:59,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 2: [2022-11-26 07:19:59,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 07:19:59,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 7: [2022-11-26 07:19:59,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:19:59,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:19:59,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 07:19:59,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 07:19:59,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 7: [2022-11-26 07:19:59,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 15: [2022-11-26 07:19:59,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:19:59,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 07:19:59,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 8: [2022-11-26 07:19:59,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:19:59,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 07:19:59,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 6: [2022-11-26 07:19:59,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:19:59,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 07:19:59,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 30: [2022-11-26 07:19:59,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:19:59,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 07:19:59,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 0: [2022-11-26 07:19:59,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:19:59,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:19:59,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 07:19:59,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step59000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 07:19:59,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 0: [2022-11-26 07:19:59,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 0: successfully saved checkpoint at iteration 59000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2691.19 31: iteration 59010/ 173500 | consumed samples: 15106560 | consumed tokens: 30938234880 | elapsed time per iteration (s): 1.05 | learning rate: 1.550E-04 | global batch size: 256 | lm loss: 2.032372E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.777 | TFLOPs: 14.75 | 31: iteration 59020/ 173500 | consumed samples: 15109120 | consumed tokens: 30943477760 | elapsed time per iteration (s): 0.81 | learning rate: 1.550E-04 | global batch size: 256 | lm loss: 2.033480E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.222 | TFLOPs: 19.19 | 31: iteration 59030/ 173500 | consumed samples: 15111680 | consumed tokens: 30948720640 | elapsed time per iteration (s): 0.76 | learning rate: 1.549E-04 | global batch size: 256 | lm loss: 2.040930E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.035 | TFLOPs: 20.51 | 31: iteration 59040/ 173500 | consumed samples: 15114240 | consumed tokens: 30953963520 | elapsed time per iteration (s): 0.74 | learning rate: 1.549E-04 | global batch size: 256 | lm loss: 2.053190E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.293 | TFLOPs: 20.83 | 31: iteration 59050/ 173500 | consumed samples: 15116800 | consumed tokens: 30959206400 | elapsed time per iteration (s): 0.74 | learning rate: 1.549E-04 | global batch size: 256 | lm loss: 2.030133E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.681 | TFLOPs: 21.03 | 31: iteration 59060/ 173500 | consumed samples: 15119360 | consumed tokens: 30964449280 | elapsed time per iteration (s): 0.77 | learning rate: 1.549E-04 | global batch size: 256 | lm loss: 2.043327E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.516 | TFLOPs: 20.18 | 31: iteration 59070/ 173500 | consumed samples: 15121920 | consumed tokens: 30969692160 | elapsed time per iteration (s): 0.77 | learning rate: 1.549E-04 | global batch size: 256 | lm loss: 2.039252E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.535 | TFLOPs: 20.12 | 31: iteration 59080/ 173500 | consumed samples: 15124480 | consumed tokens: 30974935040 | elapsed time per iteration (s): 0.73 | learning rate: 1.549E-04 | global batch size: 256 | lm loss: 2.076890E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.423 | TFLOPs: 21.08 | 31: iteration 59090/ 173500 | consumed samples: 15127040 | consumed tokens: 30980177920 | elapsed time per iteration (s): 0.76 | learning rate: 1.549E-04 | global batch size: 256 | lm loss: 2.024767E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.203 | TFLOPs: 20.40 | 31: iteration 59100/ 173500 | consumed samples: 15129600 | consumed tokens: 30985420800 | elapsed time per iteration (s): 0.76 | learning rate: 1.548E-04 | global batch size: 256 | lm loss: 2.054918E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.907 | TFLOPs: 20.26 | 31: iteration 59110/ 173500 | consumed samples: 15132160 | consumed tokens: 30990663680 | elapsed time per iteration (s): 0.82 | learning rate: 1.548E-04 | global batch size: 256 | lm loss: 2.060222E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.012 | TFLOPs: 19.00 | 31: iteration 59120/ 173500 | consumed samples: 15134720 | consumed tokens: 30995906560 | elapsed time per iteration (s): 0.77 | learning rate: 1.548E-04 | global batch size: 256 | lm loss: 2.048615E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.262 | TFLOPs: 20.10 | 31: iteration 59130/ 173500 | consumed samples: 15137280 | consumed tokens: 31001149440 | elapsed time per iteration (s): 0.76 | learning rate: 1.548E-04 | global batch size: 256 | lm loss: 2.043692E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.685 | TFLOPs: 20.43 | 31: iteration 59140/ 173500 | consumed samples: 15139840 | consumed tokens: 31006392320 | elapsed time per iteration (s): 0.79 | learning rate: 1.548E-04 | global batch size: 256 | lm loss: 2.024923E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.108 | TFLOPs: 19.61 | 31: iteration 59150/ 173500 | consumed samples: 15142400 | consumed tokens: 31011635200 | elapsed time per iteration (s): 0.83 | learning rate: 1.548E-04 | global batch size: 256 | lm loss: 2.043920E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.524 | TFLOPs: 18.73 | 31: iteration 59160/ 173500 | consumed samples: 15144960 | consumed tokens: 31016878080 | elapsed time per iteration (s): 0.70 | learning rate: 1.548E-04 | global batch size: 256 | lm loss: 2.040556E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 363.966 | TFLOPs: 22.02 | 31: iteration 59170/ 173500 | consumed samples: 15147520 | consumed tokens: 31022120960 | elapsed time per iteration (s): 0.73 | learning rate: 1.547E-04 | global batch size: 256 | lm loss: 2.059596E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.343 | TFLOPs: 21.32 | 31: iteration 59180/ 173500 | consumed samples: 15150080 | consumed tokens: 31027363840 | elapsed time per iteration (s): 0.84 | learning rate: 1.547E-04 | global batch size: 256 | lm loss: 2.048568E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.814 | TFLOPs: 18.44 | 31: iteration 59190/ 173500 | consumed samples: 15152640 | consumed tokens: 31032606720 | elapsed time per iteration (s): 0.79 | learning rate: 1.547E-04 | global batch size: 256 | lm loss: 2.039812E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.949 | TFLOPs: 19.72 | 31: iteration 59200/ 173500 | consumed samples: 15155200 | consumed tokens: 31037849600 | elapsed time per iteration (s): 0.76 | learning rate: 1.547E-04 | global batch size: 256 | lm loss: 2.051944E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.894 | TFLOPs: 20.50 | 31: iteration 59210/ 173500 | consumed samples: 15157760 | consumed tokens: 31043092480 | elapsed time per iteration (s): 0.83 | learning rate: 1.547E-04 | global batch size: 256 | lm loss: 2.062744E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.836 | TFLOPs: 18.68 | 31: iteration 59220/ 173500 | consumed samples: 15160320 | consumed tokens: 31048335360 | elapsed time per iteration (s): 0.75 | learning rate: 1.547E-04 | global batch size: 256 | lm loss: 2.012106E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.490 | TFLOPs: 20.60 | 31: iteration 59230/ 173500 | consumed samples: 15162880 | consumed tokens: 31053578240 | elapsed time per iteration (s): 0.80 | learning rate: 1.547E-04 | global batch size: 256 | lm loss: 2.058216E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.725 | TFLOPs: 19.28 | 31: iteration 59240/ 173500 | consumed samples: 15165440 | consumed tokens: 31058821120 | elapsed time per iteration (s): 0.80 | learning rate: 1.546E-04 | global batch size: 256 | lm loss: 2.020456E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.706 | TFLOPs: 19.46 | 31: iteration 59250/ 173500 | consumed samples: 15168000 | consumed tokens: 31064064000 | elapsed time per iteration (s): 0.79 | learning rate: 1.546E-04 | global batch size: 256 | lm loss: 2.051047E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.978 | TFLOPs: 19.66 | 31: iteration 59260/ 173500 | consumed samples: 15170560 | consumed tokens: 31069306880 | elapsed time per iteration (s): 0.75 | learning rate: 1.546E-04 | global batch size: 256 | lm loss: 2.036679E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.542 | TFLOPs: 20.54 | 31: iteration 59270/ 173500 | consumed samples: 15173120 | consumed tokens: 31074549760 | elapsed time per iteration (s): 0.78 | learning rate: 1.546E-04 | global batch size: 256 | lm loss: 2.059118E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.620 | TFLOPs: 19.88 | 31: iteration 59280/ 173500 | consumed samples: 15175680 | consumed tokens: 31079792640 | elapsed time per iteration (s): 0.78 | learning rate: 1.546E-04 | global batch size: 256 | lm loss: 2.054754E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.922 | TFLOPs: 19.84 | 31: iteration 59290/ 173500 | consumed samples: 15178240 | consumed tokens: 31085035520 | elapsed time per iteration (s): 0.71 | learning rate: 1.546E-04 | global batch size: 256 | lm loss: 2.065539E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 360.270 | TFLOPs: 21.80 | 31: iteration 59300/ 173500 | consumed samples: 15180800 | consumed tokens: 31090278400 | elapsed time per iteration (s): 0.77 | learning rate: 1.546E-04 | global batch size: 256 | lm loss: 2.071848E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.444 | TFLOPs: 20.23 | 31: iteration 59310/ 173500 | consumed samples: 15183360 | consumed tokens: 31095521280 | elapsed time per iteration (s): 0.78 | learning rate: 1.545E-04 | global batch size: 256 | lm loss: 2.038122E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.074 | TFLOPs: 19.91 | 31: iteration 59320/ 173500 | consumed samples: 15185920 | consumed tokens: 31100764160 | elapsed time per iteration (s): 0.76 | learning rate: 1.545E-04 | global batch size: 256 | lm loss: 2.023035E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.065 | TFLOPs: 20.45 | 31: iteration 59330/ 173500 | consumed samples: 15188480 | consumed tokens: 31106007040 | elapsed time per iteration (s): 0.76 | learning rate: 1.545E-04 | global batch size: 256 | lm loss: 2.039995E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.170 | TFLOPs: 20.34 | 31: iteration 59340/ 173500 | consumed samples: 15191040 | consumed tokens: 31111249920 | elapsed time per iteration (s): 0.81 | learning rate: 1.545E-04 | global batch size: 256 | lm loss: 2.058486E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.787 | TFLOPs: 19.16 | 31: iteration 59350/ 173500 | consumed samples: 15193600 | consumed tokens: 31116492800 | elapsed time per iteration (s): 0.73 | learning rate: 1.545E-04 | global batch size: 256 | lm loss: 2.038614E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.461 | TFLOPs: 21.32 | 31: iteration 59360/ 173500 | consumed samples: 15196160 | consumed tokens: 31121735680 | elapsed time per iteration (s): 0.88 | learning rate: 1.545E-04 | global batch size: 256 | lm loss: 2.057578E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.285 | TFLOPs: 17.68 | 31: iteration 59370/ 173500 | consumed samples: 15198720 | consumed tokens: 31126978560 | elapsed time per iteration (s): 0.77 | learning rate: 1.545E-04 | global batch size: 256 | lm loss: 2.044760E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.114 | TFLOPs: 20.15 | 31: iteration 59380/ 173500 | consumed samples: 15201280 | consumed tokens: 31132221440 | elapsed time per iteration (s): 0.77 | learning rate: 1.544E-04 | global batch size: 256 | lm loss: 2.043274E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.367 | TFLOPs: 20.17 | 31: iteration 59390/ 173500 | consumed samples: 15203840 | consumed tokens: 31137464320 | elapsed time per iteration (s): 0.78 | learning rate: 1.544E-04 | global batch size: 256 | lm loss: 2.051328E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.176 | TFLOPs: 19.91 | 31: iteration 59400/ 173500 | consumed samples: 15206400 | consumed tokens: 31142707200 | elapsed time per iteration (s): 0.81 | learning rate: 1.544E-04 | global batch size: 256 | lm loss: 2.048911E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.011 | TFLOPs: 19.18 | 31: iteration 59410/ 173500 | consumed samples: 15208960 | consumed tokens: 31147950080 | elapsed time per iteration (s): 0.77 | learning rate: 1.544E-04 | global batch size: 256 | lm loss: 2.046705E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.098 | TFLOPs: 20.15 | 31: iteration 59420/ 173500 | consumed samples: 15211520 | consumed tokens: 31153192960 | elapsed time per iteration (s): 0.77 | learning rate: 1.544E-04 | global batch size: 256 | lm loss: 2.067773E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.048 | TFLOPs: 20.21 | 31: iteration 59430/ 173500 | consumed samples: 15214080 | consumed tokens: 31158435840 | elapsed time per iteration (s): 0.76 | learning rate: 1.544E-04 | global batch size: 256 | lm loss: 2.059741E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.925 | TFLOPs: 20.26 | 31: iteration 59440/ 173500 | consumed samples: 15216640 | consumed tokens: 31163678720 | elapsed time per iteration (s): 0.73 | learning rate: 1.544E-04 | global batch size: 256 | lm loss: 2.024034E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.570 | TFLOPs: 21.09 | 31: iteration 59450/ 173500 | consumed samples: 15219200 | consumed tokens: 31168921600 | elapsed time per iteration (s): 0.79 | learning rate: 1.543E-04 | global batch size: 256 | lm loss: 2.047345E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.569 | TFLOPs: 19.58 | 31: iteration 59460/ 173500 | consumed samples: 15221760 | consumed tokens: 31174164480 | elapsed time per iteration (s): 0.74 | learning rate: 1.543E-04 | global batch size: 256 | lm loss: 2.004997E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.485 | TFLOPs: 20.90 | 31: iteration 59470/ 173500 | consumed samples: 15224320 | consumed tokens: 31179407360 | elapsed time per iteration (s): 0.76 | learning rate: 1.543E-04 | global batch size: 256 | lm loss: 2.052672E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.366 | TFLOPs: 20.29 | 31: iteration 59480/ 173500 | consumed samples: 15226880 | consumed tokens: 31184650240 | elapsed time per iteration (s): 0.78 | learning rate: 1.543E-04 | global batch size: 256 | lm loss: 2.054786E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.309 | TFLOPs: 19.92 | 31: iteration 59490/ 173500 | consumed samples: 15229440 | consumed tokens: 31189893120 | elapsed time per iteration (s): 0.78 | learning rate: 1.543E-04 | global batch size: 256 | lm loss: 2.009621E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.198 | TFLOPs: 19.86 | 31: iteration 59500/ 173500 | consumed samples: 15232000 | consumed tokens: 31195136000 | elapsed time per iteration (s): 0.79 | learning rate: 1.543E-04 | global batch size: 256 | lm loss: 2.063202E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.691 | TFLOPs: 19.52 | 31: iteration 59510/ 173500 | consumed samples: 15234560 | consumed tokens: 31200378880 | elapsed time per iteration (s): 0.72 | learning rate: 1.543E-04 | global batch size: 256 | lm loss: 2.023612E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.168 | TFLOPs: 21.37 | 31: iteration 59520/ 173500 | consumed samples: 15237120 | consumed tokens: 31205621760 | elapsed time per iteration (s): 0.83 | learning rate: 1.542E-04 | global batch size: 256 | lm loss: 2.043898E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.315 | TFLOPs: 18.59 | 31: iteration 59530/ 173500 | consumed samples: 15239680 | consumed tokens: 31210864640 | elapsed time per iteration (s): 0.78 | learning rate: 1.542E-04 | global batch size: 256 | lm loss: 2.046245E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.621 | TFLOPs: 19.82 | 31: iteration 59540/ 173500 | consumed samples: 15242240 | consumed tokens: 31216107520 | elapsed time per iteration (s): 0.73 | learning rate: 1.542E-04 | global batch size: 256 | lm loss: 2.045184E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.351 | TFLOPs: 21.13 | 31: iteration 59550/ 173500 | consumed samples: 15244800 | consumed tokens: 31221350400 | elapsed time per iteration (s): 0.80 | learning rate: 1.542E-04 | global batch size: 256 | lm loss: 2.037220E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.279 | TFLOPs: 19.32 | 31: iteration 59560/ 173500 | consumed samples: 15247360 | consumed tokens: 31226593280 | elapsed time per iteration (s): 0.95 | learning rate: 1.542E-04 | global batch size: 256 | lm loss: 2.053217E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 270.210 | TFLOPs: 16.35 | 31: iteration 59570/ 173500 | consumed samples: 15249920 | consumed tokens: 31231836160 | elapsed time per iteration (s): 0.84 | learning rate: 1.542E-04 | global batch size: 256 | lm loss: 2.021188E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.014 | TFLOPs: 18.51 | 31: iteration 59580/ 173500 | consumed samples: 15252480 | consumed tokens: 31237079040 | elapsed time per iteration (s): 0.84 | learning rate: 1.542E-04 | global batch size: 256 | lm loss: 2.002747E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.291 | TFLOPs: 18.53 | 31: iteration 59590/ 173500 | consumed samples: 15255040 | consumed tokens: 31242321920 | elapsed time per iteration (s): 0.83 | learning rate: 1.541E-04 | global batch size: 256 | lm loss: 2.045621E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.024 | TFLOPs: 18.76 | 31: iteration 59600/ 173500 | consumed samples: 15257600 | consumed tokens: 31247564800 | elapsed time per iteration (s): 0.79 | learning rate: 1.541E-04 | global batch size: 256 | lm loss: 2.064039E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.307 | TFLOPs: 19.62 | 31: iteration 59610/ 173500 | consumed samples: 15260160 | consumed tokens: 31252807680 | elapsed time per iteration (s): 0.79 | learning rate: 1.541E-04 | global batch size: 256 | lm loss: 2.046047E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.756 | TFLOPs: 19.71 | 31: iteration 59620/ 173500 | consumed samples: 15262720 | consumed tokens: 31258050560 | elapsed time per iteration (s): 0.88 | learning rate: 1.541E-04 | global batch size: 256 | lm loss: 2.026926E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.332 | TFLOPs: 17.56 | 31: iteration 59630/ 173500 | consumed samples: 15265280 | consumed tokens: 31263293440 | elapsed time per iteration (s): 0.81 | learning rate: 1.541E-04 | global batch size: 256 | lm loss: 2.040041E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.118 | TFLOPs: 19.12 | 31: iteration 59640/ 173500 | consumed samples: 15267840 | consumed tokens: 31268536320 | elapsed time per iteration (s): 0.75 | learning rate: 1.541E-04 | global batch size: 256 | lm loss: 2.062878E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.601 | TFLOPs: 20.73 | 31: iteration 59650/ 173500 | consumed samples: 15270400 | consumed tokens: 31273779200 | elapsed time per iteration (s): 0.74 | learning rate: 1.541E-04 | global batch size: 256 | lm loss: 2.058627E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.499 | TFLOPs: 20.96 | 31: iteration 59660/ 173500 | consumed samples: 15272960 | consumed tokens: 31279022080 | elapsed time per iteration (s): 0.74 | learning rate: 1.540E-04 | global batch size: 256 | lm loss: 2.039796E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.849 | TFLOPs: 20.92 | 31: iteration 59670/ 173500 | consumed samples: 15275520 | consumed tokens: 31284264960 | elapsed time per iteration (s): 0.75 | learning rate: 1.540E-04 | global batch size: 256 | lm loss: 2.030584E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.563 | TFLOPs: 20.78 | 31: iteration 59680/ 173500 | consumed samples: 15278080 | consumed tokens: 31289507840 | elapsed time per iteration (s): 0.76 | learning rate: 1.540E-04 | global batch size: 256 | lm loss: 2.052930E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.227 | TFLOPs: 20.28 | 31: iteration 59690/ 173500 | consumed samples: 15280640 | consumed tokens: 31294750720 | elapsed time per iteration (s): 0.79 | learning rate: 1.540E-04 | global batch size: 256 | lm loss: 2.071145E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.024 | TFLOPs: 19.72 | 31: iteration 59700/ 173500 | consumed samples: 15283200 | consumed tokens: 31299993600 | elapsed time per iteration (s): 0.76 | learning rate: 1.540E-04 | global batch size: 256 | lm loss: 2.044349E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.585 | TFLOPs: 20.30 | 31: iteration 59710/ 173500 | consumed samples: 15285760 | consumed tokens: 31305236480 | elapsed time per iteration (s): 0.79 | learning rate: 1.540E-04 | global batch size: 256 | lm loss: 2.012997E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.201 | TFLOPs: 19.61 | 31: iteration 59720/ 173500 | consumed samples: 15288320 | consumed tokens: 31310479360 | elapsed time per iteration (s): 0.74 | learning rate: 1.540E-04 | global batch size: 256 | lm loss: 2.045152E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.899 | TFLOPs: 20.93 | 31: iteration 59730/ 173500 | consumed samples: 15290880 | consumed tokens: 31315722240 | elapsed time per iteration (s): 0.79 | learning rate: 1.539E-04 | global batch size: 256 | lm loss: 2.037752E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.681 | TFLOPs: 19.70 | 31: iteration 59740/ 173500 | consumed samples: 15293440 | consumed tokens: 31320965120 | elapsed time per iteration (s): 0.75 | learning rate: 1.539E-04 | global batch size: 256 | lm loss: 2.046185E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.367 | TFLOPs: 20.53 | 31: iteration 59750/ 173500 | consumed samples: 15296000 | consumed tokens: 31326208000 | elapsed time per iteration (s): 0.82 | learning rate: 1.539E-04 | global batch size: 256 | lm loss: 2.007551E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.687 | TFLOPs: 18.92 | 31: iteration 59760/ 173500 | consumed samples: 15298560 | consumed tokens: 31331450880 | elapsed time per iteration (s): 0.82 | learning rate: 1.539E-04 | global batch size: 256 | lm loss: 2.046060E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.852 | TFLOPs: 18.87 | 31: iteration 59770/ 173500 | consumed samples: 15301120 | consumed tokens: 31336693760 | elapsed time per iteration (s): 0.77 | learning rate: 1.539E-04 | global batch size: 256 | lm loss: 2.046164E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.529 | TFLOPs: 20.18 | 31: iteration 59780/ 173500 | consumed samples: 15303680 | consumed tokens: 31341936640 | elapsed time per iteration (s): 0.79 | learning rate: 1.539E-04 | global batch size: 256 | lm loss: 2.052958E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.258 | TFLOPs: 19.68 | 31: iteration 59790/ 173500 | consumed samples: 15306240 | consumed tokens: 31347179520 | elapsed time per iteration (s): 0.77 | learning rate: 1.539E-04 | global batch size: 256 | lm loss: 2.034972E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.898 | TFLOPs: 20.08 | 31: iteration 59800/ 173500 | consumed samples: 15308800 | consumed tokens: 31352422400 | elapsed time per iteration (s): 0.80 | learning rate: 1.538E-04 | global batch size: 256 | lm loss: 2.041942E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.860 | TFLOPs: 19.41 | 31: iteration 59810/ 173500 | consumed samples: 15311360 | consumed tokens: 31357665280 | elapsed time per iteration (s): 0.76 | learning rate: 1.538E-04 | global batch size: 256 | lm loss: 2.055202E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.072 | TFLOPs: 20.51 | 31: iteration 59820/ 173500 | consumed samples: 15313920 | consumed tokens: 31362908160 | elapsed time per iteration (s): 0.77 | learning rate: 1.538E-04 | global batch size: 256 | lm loss: 2.042141E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.219 | TFLOPs: 20.22 | 31: iteration 59830/ 173500 | consumed samples: 15316480 | consumed tokens: 31368151040 | elapsed time per iteration (s): 0.77 | learning rate: 1.538E-04 | global batch size: 256 | lm loss: 2.054104E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.092 | TFLOPs: 20.15 | 31: iteration 59840/ 173500 | consumed samples: 15319040 | consumed tokens: 31373393920 | elapsed time per iteration (s): 0.76 | learning rate: 1.538E-04 | global batch size: 256 | lm loss: 2.038867E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.633 | TFLOPs: 20.30 | 31: iteration 59850/ 173500 | consumed samples: 15321600 | consumed tokens: 31378636800 | elapsed time per iteration (s): 0.79 | learning rate: 1.538E-04 | global batch size: 256 | lm loss: 2.049865E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.192 | TFLOPs: 19.55 | 31: iteration 59860/ 173500 | consumed samples: 15324160 | consumed tokens: 31383879680 | elapsed time per iteration (s): 0.79 | learning rate: 1.538E-04 | global batch size: 256 | lm loss: 2.035550E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.108 | TFLOPs: 19.61 | 31: iteration 59870/ 173500 | consumed samples: 15326720 | consumed tokens: 31389122560 | elapsed time per iteration (s): 0.80 | learning rate: 1.537E-04 | global batch size: 256 | lm loss: 2.050063E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.350 | TFLOPs: 19.44 | 31: iteration 59880/ 173500 | consumed samples: 15329280 | consumed tokens: 31394365440 | elapsed time per iteration (s): 0.82 | learning rate: 1.537E-04 | global batch size: 256 | lm loss: 2.034606E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.995 | TFLOPs: 18.94 | 31: iteration 59890/ 173500 | consumed samples: 15331840 | consumed tokens: 31399608320 | elapsed time per iteration (s): 0.80 | learning rate: 1.537E-04 | global batch size: 256 | lm loss: 2.022103E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.398 | TFLOPs: 19.26 | 31: iteration 59900/ 173500 | consumed samples: 15334400 | consumed tokens: 31404851200 | elapsed time per iteration (s): 0.80 | learning rate: 1.537E-04 | global batch size: 256 | lm loss: 2.056268E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.120 | TFLOPs: 19.43 | 31: iteration 59910/ 173500 | consumed samples: 15336960 | consumed tokens: 31410094080 | elapsed time per iteration (s): 0.81 | learning rate: 1.537E-04 | global batch size: 256 | lm loss: 2.061275E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.065 | TFLOPs: 19.18 | 31: iteration 59920/ 173500 | consumed samples: 15339520 | consumed tokens: 31415336960 | elapsed time per iteration (s): 0.80 | learning rate: 1.537E-04 | global batch size: 256 | lm loss: 2.050912E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.966 | TFLOPs: 19.42 | 31: iteration 59930/ 173500 | consumed samples: 15342080 | consumed tokens: 31420579840 | elapsed time per iteration (s): 0.81 | learning rate: 1.537E-04 | global batch size: 256 | lm loss: 2.020306E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.851 | TFLOPs: 19.05 | 31: iteration 59940/ 173500 | consumed samples: 15344640 | consumed tokens: 31425822720 | elapsed time per iteration (s): 0.84 | learning rate: 1.536E-04 | global batch size: 256 | lm loss: 2.032364E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.615 | TFLOPs: 18.37 | 31: iteration 59950/ 173500 | consumed samples: 15347200 | consumed tokens: 31431065600 | elapsed time per iteration (s): 0.83 | learning rate: 1.536E-04 | global batch size: 256 | lm loss: 2.054654E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.792 | TFLOPs: 18.74 | 31: iteration 59960/ 173500 | consumed samples: 15349760 | consumed tokens: 31436308480 | elapsed time per iteration (s): 0.87 | learning rate: 1.536E-04 | global batch size: 256 | lm loss: 2.051786E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.832 | TFLOPs: 17.84 | 31: iteration 59970/ 173500 | consumed samples: 15352320 | consumed tokens: 31441551360 | elapsed time per iteration (s): 0.79 | learning rate: 1.536E-04 | global batch size: 256 | lm loss: 2.044407E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.287 | TFLOPs: 19.56 | 31: iteration 59980/ 173500 | consumed samples: 15354880 | consumed tokens: 31446794240 | elapsed time per iteration (s): 1.23 | learning rate: 1.536E-04 | global batch size: 256 | lm loss: 2.039220E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 208.565 | TFLOPs: 12.62 | 31: iteration 59990/ 173500 | consumed samples: 15357440 | consumed tokens: 31452037120 | elapsed time per iteration (s): 0.76 | learning rate: 1.536E-04 | global batch size: 256 | lm loss: 2.053955E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.489 | TFLOPs: 20.48 | 0: [2022-11-26 07:33:06,760] [INFO] [logging.py:68:log_dist] [Rank 0] step=60000, skipped=0, lr=[0.00015355285563304073, 0.00015355285563304073, 0.00015355285563304073], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 60000/ 173500 | consumed samples: 15360000 | consumed tokens: 31457280000 | elapsed time per iteration (s): 0.80 | learning rate: 1.536E-04 | global batch size: 256 | lm loss: 2.017577E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.050 | TFLOPs: 19.42 | 0: steps: 60000 loss: 2.0991 iter time (s): 0.786 samples/sec: 325.884 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 60000 | lm loss value: 1.987818E+00 | lm loss PPL: 7.299589E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 60000 to checkpoints_1b1long 0: [2022-11-26 07:33:07,111] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step60000 is begin to save! 0: [2022-11-26 07:33:07,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_01-model_00-model_states.pt... 0: [2022-11-26 07:33:07,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_01-model_00-model_states.pt. 0: [2022-11-26 07:33:07,369] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_03-model_00-model_states.pt... 0: [2022-11-26 07:33:07,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_03-model_00-model_states.pt. 0: [2022-11-26 07:33:07,452] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_04-model_00-model_states.pt... 0: [2022-11-26 07:33:07,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_04-model_00-model_states.pt. 0: [2022-11-26 07:33:07,531] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_05-model_00-model_states.pt... 0: [2022-11-26 07:33:07,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_05-model_00-model_states.pt. 0: [2022-11-26 07:33:07,606] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_06-model_00-model_states.pt... 0: [2022-11-26 07:33:07,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_06-model_00-model_states.pt. 0: [2022-11-26 07:33:07,691] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_07-model_00-model_states.pt... 0: [2022-11-26 07:33:07,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_07-model_00-model_states.pt. 0: [2022-11-26 07:33:07,769] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_08-model_00-model_states.pt... 0: [2022-11-26 07:33:07,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_08-model_00-model_states.pt. 0: [2022-11-26 07:33:07,843] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_09-model_00-model_states.pt... 0: [2022-11-26 07:33:07,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_09-model_00-model_states.pt. 0: [2022-11-26 07:33:07,919] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_10-model_00-model_states.pt... 0: [2022-11-26 07:33:07,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_10-model_00-model_states.pt. 0: [2022-11-26 07:33:07,995] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_11-model_00-model_states.pt... 0: [2022-11-26 07:33:08,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_11-model_00-model_states.pt. 0: [2022-11-26 07:33:08,070] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_12-model_00-model_states.pt... 0: [2022-11-26 07:33:08,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_12-model_00-model_states.pt. 0: [2022-11-26 07:33:08,146] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_13-model_00-model_states.pt... 0: [2022-11-26 07:33:08,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_13-model_00-model_states.pt. 0: [2022-11-26 07:33:08,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_14-model_00-model_states.pt... 0: [2022-11-26 07:33:08,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_14-model_00-model_states.pt. 0: [2022-11-26 07:33:08,298] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_15-model_00-model_states.pt... 0: [2022-11-26 07:33:08,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_15-model_00-model_states.pt. 0: [2022-11-26 07:33:08,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_16-model_00-model_states.pt... 0: [2022-11-26 07:33:08,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_16-model_00-model_states.pt. 0: [2022-11-26 07:33:08,451] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_17-model_00-model_states.pt... 0: [2022-11-26 07:33:08,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_17-model_00-model_states.pt. 0: [2022-11-26 07:33:08,529] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_18-model_00-model_states.pt... 0: [2022-11-26 07:33:08,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_18-model_00-model_states.pt. 0: [2022-11-26 07:33:08,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_19-model_00-model_states.pt... 0: [2022-11-26 07:33:08,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_19-model_00-model_states.pt. 0: [2022-11-26 07:33:08,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_20-model_00-model_states.pt... 0: [2022-11-26 07:33:08,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_20-model_00-model_states.pt. 0: [2022-11-26 07:33:08,756] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_21-model_00-model_states.pt... 0: [2022-11-26 07:33:08,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_21-model_00-model_states.pt. 0: [2022-11-26 07:33:08,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_22-model_00-model_states.pt... 0: [2022-11-26 07:33:08,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_22-model_00-model_states.pt. 0: [2022-11-26 07:33:08,906] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_23-model_00-model_states.pt... 0: [2022-11-26 07:33:08,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_23-model_00-model_states.pt. 0: [2022-11-26 07:33:08,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_24-model_00-model_states.pt... 0: [2022-11-26 07:33:09,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_24-model_00-model_states.pt. 0: [2022-11-26 07:33:09,054] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_25-model_00-model_states.pt... 0: [2022-11-26 07:33:09,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_25-model_00-model_states.pt. 0: [2022-11-26 07:33:09,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_26-model_00-model_states.pt... 0: [2022-11-26 07:33:09,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_26-model_00-model_states.pt. 0: [2022-11-26 07:33:09,205] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_27-model_00-model_states.pt... 0: [2022-11-26 07:33:09,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_27-model_00-model_states.pt. 0: [2022-11-26 07:33:09,278] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_28-model_00-model_states.pt... 0: [2022-11-26 07:33:09,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_28-model_00-model_states.pt. 0: [2022-11-26 07:33:09,355] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/layer_30-model_00-model_states.pt... 0: [2022-11-26 07:33:09,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/layer_30-model_00-model_states.pt. 0: [2022-11-26 07:33:09,357] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step60000/mp_rank_00_model_states.pt 0: [2022-11-26 07:33:09,357] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/mp_rank_00_model_states.pt... 0: [2022-11-26 07:33:09,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/mp_rank_00_model_states.pt. 0: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:33:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:33:09,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:33:09,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 07:33:09,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 7: [2022-11-26 07:33:09,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:33:09,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 07:33:09,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 11: [2022-11-26 07:33:09,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:33:09,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 07:33:09,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 4: [2022-11-26 07:33:09,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:33:09,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 07:33:09,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 17: [2022-11-26 07:33:09,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:33:09,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 07:33:09,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 8: [2022-11-26 07:33:09,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:33:09,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 1: [2022-11-26 07:33:09,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:33:09,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 29: [2022-11-26 07:33:09,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:33:09,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 1: [2022-11-26 07:33:09,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 29: [2022-11-26 07:33:09,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 1: [2022-11-26 07:33:09,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 28: [2022-11-26 07:33:09,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:33:09,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:33:09,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 07:33:09,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 28: [2022-11-26 07:33:09,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 07:33:09,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 21: [2022-11-26 07:33:09,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:33:09,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:33:09,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 07:33:09,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 1: [2022-11-26 07:33:09,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:33:09,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 1: [2022-11-26 07:33:09,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 27: [2022-11-26 07:33:09,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 1: [2022-11-26 07:33:09,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 12: [2022-11-26 07:33:09,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:33:09,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 07:33:09,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 10: [2022-11-26 07:33:09,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:33:09,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:33:09,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:33:09,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 20: [2022-11-26 07:33:09,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:33:09,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:33:09,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 10: [2022-11-26 07:33:09,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 20: [2022-11-26 07:33:09,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 31: [2022-11-26 07:33:09,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 4: [2022-11-26 07:33:09,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 20: [2022-11-26 07:33:09,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 31: [2022-11-26 07:33:09,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 20: [2022-11-26 07:33:09,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 07:33:09,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 5: [2022-11-26 07:33:09,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:33:09,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 07:33:09,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 25: [2022-11-26 07:33:09,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:33:09,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 07:33:09,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:33:09,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:33:09,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:33:09,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 10: [2022-11-26 07:33:09,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 14: [2022-11-26 07:33:09,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 5: [2022-11-26 07:33:09,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:33:09,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 5: [2022-11-26 07:33:09,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 25: [2022-11-26 07:33:09,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 5: [2022-11-26 07:33:09,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 25: [2022-11-26 07:33:09,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 19: [2022-11-26 07:33:09,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:33:09,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 19: [2022-11-26 07:33:09,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 07:33:09,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 21: [2022-11-26 07:33:09,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:33:09,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 07:33:09,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 7: [2022-11-26 07:33:09,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:33:09,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:33:09,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 22: [2022-11-26 07:33:09,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 7: [2022-11-26 07:33:09,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 22: [2022-11-26 07:33:09,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 17: [2022-11-26 07:33:09,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:33:09,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 07:33:09,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 8: [2022-11-26 07:33:09,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:33:09,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:33:09,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 07:33:09,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 07:33:09,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 8: [2022-11-26 07:33:09,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 19: [2022-11-26 07:33:09,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:33:09,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 07:33:09,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:33:09,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 20: [2022-11-26 07:33:09,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:33:09,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 20: [2022-11-26 07:33:09,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 19: [2022-11-26 07:33:09,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 20: [2022-11-26 07:33:09,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 31: [2022-11-26 07:33:09,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:33:09,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 07:33:09,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 28: [2022-11-26 07:33:09,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:33:09,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:33:09,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:33:09,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 28: [2022-11-26 07:33:09,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 2: [2022-11-26 07:33:09,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 07:33:09,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 28: [2022-11-26 07:33:09,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 2: [2022-11-26 07:33:09,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 29: [2022-11-26 07:33:09,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:33:09,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 07:33:09,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 25: [2022-11-26 07:33:09,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:33:09,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 07:33:09,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 27: [2022-11-26 07:33:09,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:33:09,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 12: [2022-11-26 07:33:09,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:33:09,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 12: [2022-11-26 07:33:09,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 07:33:09,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 28: [2022-11-26 07:33:09,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:33:09,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:33:09,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 07:33:09,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 1: [2022-11-26 07:33:09,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:33:09,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 4: [2022-11-26 07:33:09,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:33:09,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 4: [2022-11-26 07:33:09,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 07:33:09,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 7: [2022-11-26 07:33:09,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:33:09,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 07:33:09,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 3: [2022-11-26 07:33:09,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:33:09,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 07:33:09,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 26: [2022-11-26 07:33:09,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:33:09,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 29: [2022-11-26 07:33:09,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:33:09,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 29: [2022-11-26 07:33:09,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 07:33:09,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 12: [2022-11-26 07:33:09,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:33:09,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:33:09,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 12: [2022-11-26 07:33:09,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 10: [2022-11-26 07:33:09,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 12: [2022-11-26 07:33:09,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 30: [2022-11-26 07:33:09,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:33:09,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 07:33:09,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 14: [2022-11-26 07:33:09,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:33:09,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:33:09,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 21: [2022-11-26 07:33:09,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 8: [2022-11-26 07:33:09,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:33:09,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 21: [2022-11-26 07:33:09,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 8: [2022-11-26 07:33:09,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 07:33:09,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 14: [2022-11-26 07:33:09,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:33:09,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 07:33:09,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 2: [2022-11-26 07:33:09,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:33:09,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:33:09,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 07:33:09,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 11: [2022-11-26 07:33:09,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 07:33:09,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 17: [2022-11-26 07:33:09,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:33:09,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 07:33:09,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 22: [2022-11-26 07:33:09,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:33:09,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 07:33:09,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 5: [2022-11-26 07:33:09,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:33:09,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 07:33:09,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 26: [2022-11-26 07:33:09,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:33:09,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:33:09,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 07:33:09,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 0: [2022-11-26 07:33:09,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 07:33:09,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 0: [2022-11-26 07:33:09,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:33:09,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 07:33:09,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 5: [2022-11-26 07:33:09,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:33:09,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 07:33:09,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 24: [2022-11-26 07:33:09,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:33:09,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:33:09,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:33:09,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:33:09,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:33:09,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 24: [2022-11-26 07:33:09,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 07:33:09,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 07:33:09,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 07:33:09,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 3: [2022-11-26 07:33:09,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 24: [2022-11-26 07:33:09,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 24: [2022-11-26 07:33:09,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 24: [2022-11-26 07:33:09,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 24: [2022-11-26 07:33:09,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 12: [2022-11-26 07:33:09,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:33:09,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 07:33:09,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 23: [2022-11-26 07:33:09,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:33:09,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 07:33:09,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 23: [2022-11-26 07:33:09,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:33:09,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 1: [2022-11-26 07:33:09,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:33:09,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 1: [2022-11-26 07:33:09,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 07:33:09,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 14: [2022-11-26 07:33:09,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:33:09,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 07:33:09,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 4: [2022-11-26 07:33:09,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:33:09,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 07:33:09,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 28: [2022-11-26 07:33:09,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 07:33:09,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 20: [2022-11-26 07:33:09,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:33:09,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 07:33:09,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 27: [2022-11-26 07:33:09,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:33:09,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 07:33:09,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 25: [2022-11-26 07:33:09,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:33:09,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:33:09,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 11: [2022-11-26 07:33:09,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 25: [2022-11-26 07:33:09,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 11: [2022-11-26 07:33:09,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 31: [2022-11-26 07:33:09,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:33:09,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 27: [2022-11-26 07:33:09,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:33:09,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 31: [2022-11-26 07:33:09,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:33:09,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 31: [2022-11-26 07:33:09,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 07:33:09,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 27: [2022-11-26 07:33:09,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 7: [2022-11-26 07:33:09,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:33:09,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 07:33:09,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 29: [2022-11-26 07:33:09,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:33:09,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 07:33:09,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 22: [2022-11-26 07:33:09,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:33:09,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 07:33:09,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:33:09,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 22: [2022-11-26 07:33:09,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 07:33:09,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 30: [2022-11-26 07:33:09,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:33:09,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:33:09,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 07:33:09,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 30: [2022-11-26 07:33:09,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 07:33:09,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 17: [2022-11-26 07:33:09,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:33:09,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 07:33:09,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 2: [2022-11-26 07:33:09,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:33:09,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 07:33:09,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 10: [2022-11-26 07:33:09,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:33:09,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 07:33:09,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 26: [2022-11-26 07:33:09,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:33:09,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:33:09,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 07:33:09,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 26: [2022-11-26 07:33:09,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 07:33:09,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 19: [2022-11-26 07:33:09,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:33:09,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 07:33:09,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 11: [2022-11-26 07:33:09,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:33:09,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:33:09,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 5: [2022-11-26 07:33:09,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 11: [2022-11-26 07:33:09,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 5: [2022-11-26 07:33:09,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 26: [2022-11-26 07:33:09,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:33:09,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 07:33:09,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 28: [2022-11-26 07:33:09,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 07:33:09,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 07:33:09,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 9: [2022-11-26 07:33:09,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:33:09,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:33:09,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:33:09,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 07:33:09,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 07:33:09,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 9: [2022-11-26 07:33:09,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 07:33:09,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 18: [2022-11-26 07:33:09,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:33:09,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:33:09,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:33:09,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:33:09,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 18: [2022-11-26 07:33:09,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 07:33:09,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 07:33:09,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 07:33:09,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 07:33:09,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 18: [2022-11-26 07:33:09,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 18: [2022-11-26 07:33:09,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 18: [2022-11-26 07:33:09,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 0: [2022-11-26 07:33:09,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:33:09,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 07:33:09,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 26: [2022-11-26 07:33:09,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:33:09,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 07:33:09,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 0: [2022-11-26 07:33:09,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:33:09,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 07:33:09,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 16: [2022-11-26 07:33:09,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:33:09,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:33:09,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:33:09,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:33:09,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 07:33:09,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 07:33:09,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 07:33:09,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 07:33:09,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 16: [2022-11-26 07:33:09,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 16: [2022-11-26 07:33:09,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 16: [2022-11-26 07:33:09,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 0: [2022-11-26 07:33:09,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:33:09,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 07:33:09,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 15: [2022-11-26 07:33:09,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:33:09,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 07:33:09,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 19: [2022-11-26 07:33:09,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:33:09,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 07:33:09,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 13: [2022-11-26 07:33:09,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:33:09,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:33:09,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:33:09,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:33:09,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 07:33:09,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 07:33:09,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 07:33:09,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 07:33:09,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 13: [2022-11-26 07:33:09,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 13: [2022-11-26 07:33:09,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 13: [2022-11-26 07:33:09,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 15: [2022-11-26 07:33:09,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:33:09,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:33:09,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 07:33:09,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 07:33:09,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 15: [2022-11-26 07:33:09,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 25: [2022-11-26 07:33:09,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:33:09,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 07:33:09,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 17: [2022-11-26 07:33:09,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:33:09,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 07:33:09,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 1: [2022-11-26 07:33:09,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:33:09,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 07:33:09,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 18: [2022-11-26 07:33:09,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:33:09,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 07:33:09,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 6: [2022-11-26 07:33:09,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:33:09,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 07:33:09,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 6: [2022-11-26 07:33:09,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:33:09,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 07:33:09,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 6: [2022-11-26 07:33:09,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:33:09,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 07:33:09,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 8: [2022-11-26 07:33:09,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:33:09,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 07:33:09,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 29: [2022-11-26 07:33:09,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:33:09,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 07:33:09,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 21: [2022-11-26 07:33:09,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:33:09,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 07:33:09,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 28: [2022-11-26 07:33:09,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 07:33:09,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 07:33:09,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 3: [2022-11-26 07:33:09,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:33:09,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 07:33:09,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 24: [2022-11-26 07:33:09,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:33:09,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 07:33:09,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 12: [2022-11-26 07:33:09,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:33:09,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 07:33:09,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 13: [2022-11-26 07:33:09,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:33:09,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 07:33:09,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 20: [2022-11-26 07:33:09,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:33:09,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 07:33:09,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 10: [2022-11-26 07:33:09,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:33:09,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 07:33:09,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 27: [2022-11-26 07:33:09,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:33:09,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 07:33:09,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 11: [2022-11-26 07:33:09,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:33:09,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 07:33:09,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 4: [2022-11-26 07:33:09,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:33:09,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 07:33:09,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 23: [2022-11-26 07:33:09,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:33:09,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 07:33:09,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 14: [2022-11-26 07:33:09,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:33:09,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 07:33:09,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 9: [2022-11-26 07:33:09,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:33:09,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 07:33:09,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 22: [2022-11-26 07:33:09,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:33:09,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 07:33:09,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 31: [2022-11-26 07:33:09,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:33:09,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 07:33:09,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 2: [2022-11-26 07:33:09,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:33:09,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 07:33:09,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 15: [2022-11-26 07:33:09,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:33:09,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 07:33:09,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 6: [2022-11-26 07:33:09,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:33:09,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 07:33:09,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 7: [2022-11-26 07:33:09,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:33:09,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 07:33:09,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 0: [2022-11-26 07:33:09,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:33:09,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:33:09,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 07:33:09,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 16: [2022-11-26 07:33:09,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:33:09,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 07:33:09,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 0: [2022-11-26 07:33:09,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 07:33:09,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 5: [2022-11-26 07:33:09,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:33:09,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 07:33:09,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 8: [2022-11-26 07:33:09,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:33:09,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 07:33:09,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 21: [2022-11-26 07:33:09,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:33:09,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 07:33:09,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 18: [2022-11-26 07:33:09,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:33:09,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 07:33:09,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 17: [2022-11-26 07:33:09,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:33:09,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 07:33:09,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 19: [2022-11-26 07:33:09,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:33:09,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 07:33:09,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 25: [2022-11-26 07:33:09,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:33:09,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 07:33:09,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 28: [2022-11-26 07:33:09,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 07:33:09,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 07:33:09,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 1: [2022-11-26 07:33:09,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:33:09,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 07:33:09,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 31: [2022-11-26 07:33:09,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:33:09,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 07:33:09,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 29: [2022-11-26 07:33:09,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:33:09,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 07:33:09,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 24: [2022-11-26 07:33:09,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:33:09,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 07:33:09,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 4: [2022-11-26 07:33:09,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:33:09,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 07:33:09,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 12: [2022-11-26 07:33:09,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:33:09,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 07:33:09,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 20: [2022-11-26 07:33:09,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:33:09,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 07:33:09,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 3: [2022-11-26 07:33:09,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:33:09,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 07:33:09,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 7: [2022-11-26 07:33:09,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:33:09,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 07:33:09,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 27: [2022-11-26 07:33:09,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:33:09,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 07:33:09,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 26: [2022-11-26 07:33:09,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:33:09,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 07:33:09,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 16: [2022-11-26 07:33:09,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:33:09,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 07:33:09,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 22: [2022-11-26 07:33:09,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:33:09,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 07:33:09,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 14: [2022-11-26 07:33:09,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:33:09,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 07:33:09,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 23: [2022-11-26 07:33:09,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:33:09,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 07:33:09,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 2: [2022-11-26 07:33:09,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:33:09,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 07:33:09,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 11: [2022-11-26 07:33:09,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:33:09,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 07:33:09,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 5: [2022-11-26 07:33:09,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:33:09,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 07:33:09,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 15: [2022-11-26 07:33:09,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:33:09,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 07:33:09,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 0: [2022-11-26 07:33:09,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:33:09,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 07:33:09,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 6: [2022-11-26 07:33:09,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:33:09,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 07:33:09,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 30: [2022-11-26 07:33:09,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:33:09,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 07:33:09,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 9: [2022-11-26 07:33:09,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:33:09,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 07:33:09,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 19: [2022-11-26 07:33:09,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:33:09,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 07:33:09,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 18: [2022-11-26 07:33:09,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:33:09,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 07:33:09,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 21: [2022-11-26 07:33:09,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:33:09,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 07:33:09,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 1: [2022-11-26 07:33:09,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:33:09,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 07:33:09,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 13: [2022-11-26 07:33:09,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:33:09,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 07:33:09,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 17: [2022-11-26 07:33:09,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:33:09,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 07:33:09,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 28: [2022-11-26 07:33:09,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 07:33:09,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 07:33:09,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 12: [2022-11-26 07:33:09,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:33:09,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 07:33:09,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 8: [2022-11-26 07:33:09,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:33:09,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 07:33:09,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 25: [2022-11-26 07:33:09,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:33:09,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 07:33:09,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 29: [2022-11-26 07:33:09,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:33:09,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 07:33:09,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 24: [2022-11-26 07:33:09,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:33:09,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 31: [2022-11-26 07:33:09,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:33:09,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 31: [2022-11-26 07:33:09,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 07:33:09,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 13: [2022-11-26 07:33:09,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:33:09,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 07:33:09,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 20: [2022-11-26 07:33:09,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:33:09,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 07:33:09,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 10: [2022-11-26 07:33:09,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:33:09,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 07:33:09,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 3: [2022-11-26 07:33:09,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:33:09,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 07:33:09,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 7: [2022-11-26 07:33:09,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:33:09,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 07:33:09,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 4: [2022-11-26 07:33:09,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:33:09,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 07:33:09,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 26: [2022-11-26 07:33:09,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:33:09,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 07:33:09,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 27: [2022-11-26 07:33:09,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:33:09,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 07:33:09,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 22: [2022-11-26 07:33:09,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:33:09,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 07:33:09,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 16: [2022-11-26 07:33:09,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:33:09,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 07:33:09,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 10: [2022-11-26 07:33:09,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:33:09,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 07:33:09,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 11: [2022-11-26 07:33:09,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:33:09,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 07:33:09,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 24: [2022-11-26 07:33:09,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:33:09,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 07:33:09,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 28: [2022-11-26 07:33:09,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 07:33:09,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 07:33:09,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 31: [2022-11-26 07:33:09,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:33:09,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 07:33:09,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 30: [2022-11-26 07:33:09,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:33:09,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 07:33:09,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 23: [2022-11-26 07:33:09,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:33:09,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 07:33:09,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 7: [2022-11-26 07:33:09,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:33:09,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 07:33:09,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 0: [2022-11-26 07:33:09,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:33:09,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:33:09,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 22: [2022-11-26 07:33:09,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 0: [2022-11-26 07:33:09,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 22: [2022-11-26 07:33:09,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 21: [2022-11-26 07:33:09,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:33:09,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 20: [2022-11-26 07:33:09,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:33:09,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 21: [2022-11-26 07:33:09,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 20: [2022-11-26 07:33:09,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 11: [2022-11-26 07:33:09,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:33:09,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 2: [2022-11-26 07:33:09,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:33:09,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 2: [2022-11-26 07:33:09,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 07:33:09,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 4: [2022-11-26 07:33:09,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:33:09,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 07:33:09,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 13: [2022-11-26 07:33:09,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:33:09,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 07:33:09,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 19: [2022-11-26 07:33:09,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:33:09,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:33:09,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 3: [2022-11-26 07:33:09,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 19: [2022-11-26 07:33:09,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 3: [2022-11-26 07:33:09,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 12: [2022-11-26 07:33:09,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:33:09,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 07:33:09,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 14: [2022-11-26 07:33:09,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:33:09,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 07:33:09,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 10: [2022-11-26 07:33:09,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:33:09,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 07:33:09,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 23: [2022-11-26 07:33:09,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:33:09,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 07:33:09,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 18: [2022-11-26 07:33:09,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:33:09,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 07:33:09,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 9: [2022-11-26 07:33:09,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:33:09,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 17: [2022-11-26 07:33:09,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:33:09,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 17: [2022-11-26 07:33:09,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 07:33:09,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 27: [2022-11-26 07:33:09,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:33:09,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 07:33:09,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 1: [2022-11-26 07:33:09,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:33:09,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 07:33:09,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 16: [2022-11-26 07:33:09,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:33:09,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 07:33:09,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 30: [2022-11-26 07:33:09,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:33:09,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 5: [2022-11-26 07:33:09,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:33:09,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 5: [2022-11-26 07:33:09,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 07:33:09,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 2: [2022-11-26 07:33:09,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:33:09,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 07:33:09,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 26: [2022-11-26 07:33:09,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:33:09,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 07:33:09,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 6: [2022-11-26 07:33:09,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:33:09,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:33:09,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 07:33:09,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 07:33:09,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 6: [2022-11-26 07:33:09,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 9: [2022-11-26 07:33:09,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:33:09,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 25: [2022-11-26 07:33:09,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:33:09,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 25: [2022-11-26 07:33:09,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 07:33:09,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 14: [2022-11-26 07:33:09,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:33:09,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 07:33:09,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 8: [2022-11-26 07:33:09,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:33:09,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 07:33:09,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 29: [2022-11-26 07:33:09,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:33:09,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 07:33:09,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 9: [2022-11-26 07:33:09,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:33:09,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 07:33:09,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 15: [2022-11-26 07:33:09,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:33:09,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:33:09,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:33:09,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 07:33:09,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 07:33:09,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 15: [2022-11-26 07:33:09,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 07:33:09,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 15: [2022-11-26 07:33:09,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 6: [2022-11-26 07:33:09,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:33:09,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 07:33:09,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 23: [2022-11-26 07:33:09,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:33:09,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 07:33:09,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 30: [2022-11-26 07:33:09,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:33:09,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step60000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 07:33:09,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 0: successfully saved checkpoint at iteration 60000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2679.75 31: iteration 60010/ 173500 | consumed samples: 15362560 | consumed tokens: 31462522880 | elapsed time per iteration (s): 1.10 | learning rate: 1.535E-04 | global batch size: 256 | lm loss: 2.019903E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.969 | TFLOPs: 14.09 | 31: iteration 60020/ 173500 | consumed samples: 15365120 | consumed tokens: 31467765760 | elapsed time per iteration (s): 0.80 | learning rate: 1.535E-04 | global batch size: 256 | lm loss: 2.015663E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.373 | TFLOPs: 19.32 | 31: iteration 60030/ 173500 | consumed samples: 15367680 | consumed tokens: 31473008640 | elapsed time per iteration (s): 0.79 | learning rate: 1.535E-04 | global batch size: 256 | lm loss: 2.054572E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.544 | TFLOPs: 19.51 | 31: iteration 60040/ 173500 | consumed samples: 15370240 | consumed tokens: 31478251520 | elapsed time per iteration (s): 2.02 | learning rate: 1.535E-04 | global batch size: 256 | lm loss: 2.036282E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 126.426 | TFLOPs: 7.65 | 31: iteration 60050/ 173500 | consumed samples: 15372800 | consumed tokens: 31483494400 | elapsed time per iteration (s): 0.81 | learning rate: 1.535E-04 | global batch size: 256 | lm loss: 2.000564E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.781 | TFLOPs: 19.10 | 31: iteration 60060/ 173500 | consumed samples: 15375360 | consumed tokens: 31488737280 | elapsed time per iteration (s): 0.85 | learning rate: 1.535E-04 | global batch size: 256 | lm loss: 2.046975E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.734 | TFLOPs: 18.31 | 31: iteration 60070/ 173500 | consumed samples: 15377920 | consumed tokens: 31493980160 | elapsed time per iteration (s): 0.81 | learning rate: 1.535E-04 | global batch size: 256 | lm loss: 2.053474E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.004 | TFLOPs: 19.12 | 31: iteration 60080/ 173500 | consumed samples: 15380480 | consumed tokens: 31499223040 | elapsed time per iteration (s): 0.81 | learning rate: 1.534E-04 | global batch size: 256 | lm loss: 2.016375E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.979 | TFLOPs: 19.06 | 31: iteration 60090/ 173500 | consumed samples: 15383040 | consumed tokens: 31504465920 | elapsed time per iteration (s): 0.87 | learning rate: 1.534E-04 | global batch size: 256 | lm loss: 2.055817E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.758 | TFLOPs: 17.83 | 31: iteration 60100/ 173500 | consumed samples: 15385600 | consumed tokens: 31509708800 | elapsed time per iteration (s): 0.87 | learning rate: 1.534E-04 | global batch size: 256 | lm loss: 2.041980E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.391 | TFLOPs: 17.81 | 31: iteration 60110/ 173500 | consumed samples: 15388160 | consumed tokens: 31514951680 | elapsed time per iteration (s): 0.82 | learning rate: 1.534E-04 | global batch size: 256 | lm loss: 2.055351E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.704 | TFLOPs: 18.98 | 31: iteration 60120/ 173500 | consumed samples: 15390720 | consumed tokens: 31520194560 | elapsed time per iteration (s): 0.84 | learning rate: 1.534E-04 | global batch size: 256 | lm loss: 2.012062E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.023 | TFLOPs: 18.51 | 31: iteration 60130/ 173500 | consumed samples: 15393280 | consumed tokens: 31525437440 | elapsed time per iteration (s): 0.80 | learning rate: 1.534E-04 | global batch size: 256 | lm loss: 2.045454E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.543 | TFLOPs: 19.33 | 31: iteration 60140/ 173500 | consumed samples: 15395840 | consumed tokens: 31530680320 | elapsed time per iteration (s): 0.85 | learning rate: 1.534E-04 | global batch size: 256 | lm loss: 2.051247E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.267 | TFLOPs: 18.17 | 31: iteration 60150/ 173500 | consumed samples: 15398400 | consumed tokens: 31535923200 | elapsed time per iteration (s): 0.79 | learning rate: 1.533E-04 | global batch size: 256 | lm loss: 2.046237E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.682 | TFLOPs: 19.64 | 31: iteration 60160/ 173500 | consumed samples: 15400960 | consumed tokens: 31541166080 | elapsed time per iteration (s): 0.82 | learning rate: 1.533E-04 | global batch size: 256 | lm loss: 2.049155E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.786 | TFLOPs: 18.98 | 31: iteration 60170/ 173500 | consumed samples: 15403520 | consumed tokens: 31546408960 | elapsed time per iteration (s): 0.84 | learning rate: 1.533E-04 | global batch size: 256 | lm loss: 2.047625E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.684 | TFLOPs: 18.49 | 31: iteration 60180/ 173500 | consumed samples: 15406080 | consumed tokens: 31551651840 | elapsed time per iteration (s): 0.90 | learning rate: 1.533E-04 | global batch size: 256 | lm loss: 2.040195E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 283.936 | TFLOPs: 17.18 | 31: iteration 60190/ 173500 | consumed samples: 15408640 | consumed tokens: 31556894720 | elapsed time per iteration (s): 0.85 | learning rate: 1.533E-04 | global batch size: 256 | lm loss: 2.055721E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.468 | TFLOPs: 18.24 | 31: iteration 60200/ 173500 | consumed samples: 15411200 | consumed tokens: 31562137600 | elapsed time per iteration (s): 0.87 | learning rate: 1.533E-04 | global batch size: 256 | lm loss: 2.074742E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.331 | TFLOPs: 17.81 | 31: iteration 60210/ 173500 | consumed samples: 15413760 | consumed tokens: 31567380480 | elapsed time per iteration (s): 0.81 | learning rate: 1.533E-04 | global batch size: 256 | lm loss: 2.021383E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.657 | TFLOPs: 19.10 | 31: iteration 60220/ 173500 | consumed samples: 15416320 | consumed tokens: 31572623360 | elapsed time per iteration (s): 0.83 | learning rate: 1.532E-04 | global batch size: 256 | lm loss: 2.070026E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.520 | TFLOPs: 18.60 | 31: iteration 60230/ 173500 | consumed samples: 15418880 | consumed tokens: 31577866240 | elapsed time per iteration (s): 0.91 | learning rate: 1.532E-04 | global batch size: 256 | lm loss: 2.032764E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 281.235 | TFLOPs: 17.01 | 31: iteration 60240/ 173500 | consumed samples: 15421440 | consumed tokens: 31583109120 | elapsed time per iteration (s): 0.87 | learning rate: 1.532E-04 | global batch size: 256 | lm loss: 2.026185E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.886 | TFLOPs: 17.90 | 31: iteration 60250/ 173500 | consumed samples: 15424000 | consumed tokens: 31588352000 | elapsed time per iteration (s): 0.84 | learning rate: 1.532E-04 | global batch size: 256 | lm loss: 2.029339E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.264 | TFLOPs: 18.35 | 31: iteration 60260/ 173500 | consumed samples: 15426560 | consumed tokens: 31593594880 | elapsed time per iteration (s): 0.89 | learning rate: 1.532E-04 | global batch size: 256 | lm loss: 2.032676E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.605 | TFLOPs: 17.40 | 31: iteration 60270/ 173500 | consumed samples: 15429120 | consumed tokens: 31598837760 | elapsed time per iteration (s): 0.89 | learning rate: 1.532E-04 | global batch size: 256 | lm loss: 2.071153E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.845 | TFLOPs: 17.47 | 31: iteration 60280/ 173500 | consumed samples: 15431680 | consumed tokens: 31604080640 | elapsed time per iteration (s): 0.90 | learning rate: 1.531E-04 | global batch size: 256 | lm loss: 2.018483E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.822 | TFLOPs: 17.23 | 31: iteration 60290/ 173500 | consumed samples: 15434240 | consumed tokens: 31609323520 | elapsed time per iteration (s): 0.78 | learning rate: 1.531E-04 | global batch size: 256 | lm loss: 2.023073E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.716 | TFLOPs: 19.77 | 31: iteration 60300/ 173500 | consumed samples: 15436800 | consumed tokens: 31614566400 | elapsed time per iteration (s): 0.81 | learning rate: 1.531E-04 | global batch size: 256 | lm loss: 2.053092E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.138 | TFLOPs: 19.19 | 31: iteration 60310/ 173500 | consumed samples: 15439360 | consumed tokens: 31619809280 | elapsed time per iteration (s): 0.85 | learning rate: 1.531E-04 | global batch size: 256 | lm loss: 2.017640E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.513 | TFLOPs: 18.18 | 31: iteration 60320/ 173500 | consumed samples: 15441920 | consumed tokens: 31625052160 | elapsed time per iteration (s): 0.80 | learning rate: 1.531E-04 | global batch size: 256 | lm loss: 2.034999E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.086 | TFLOPs: 19.30 | 31: iteration 60330/ 173500 | consumed samples: 15444480 | consumed tokens: 31630295040 | elapsed time per iteration (s): 0.80 | learning rate: 1.531E-04 | global batch size: 256 | lm loss: 2.063486E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.927 | TFLOPs: 19.29 | 31: iteration 60340/ 173500 | consumed samples: 15447040 | consumed tokens: 31635537920 | elapsed time per iteration (s): 0.81 | learning rate: 1.531E-04 | global batch size: 256 | lm loss: 2.037690E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.253 | TFLOPs: 19.19 | 31: iteration 60350/ 173500 | consumed samples: 15449600 | consumed tokens: 31640780800 | elapsed time per iteration (s): 0.80 | learning rate: 1.530E-04 | global batch size: 256 | lm loss: 2.041319E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.561 | TFLOPs: 19.33 | 31: iteration 60360/ 173500 | consumed samples: 15452160 | consumed tokens: 31646023680 | elapsed time per iteration (s): 0.86 | learning rate: 1.530E-04 | global batch size: 256 | lm loss: 2.036669E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.029 | TFLOPs: 18.09 | 31: iteration 60370/ 173500 | consumed samples: 15454720 | consumed tokens: 31651266560 | elapsed time per iteration (s): 0.83 | learning rate: 1.530E-04 | global batch size: 256 | lm loss: 2.046614E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.789 | TFLOPs: 18.68 | 31: iteration 60380/ 173500 | consumed samples: 15457280 | consumed tokens: 31656509440 | elapsed time per iteration (s): 0.81 | learning rate: 1.530E-04 | global batch size: 256 | lm loss: 2.021500E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.717 | TFLOPs: 19.04 | 31: iteration 60390/ 173500 | consumed samples: 15459840 | consumed tokens: 31661752320 | elapsed time per iteration (s): 0.83 | learning rate: 1.530E-04 | global batch size: 256 | lm loss: 2.078133E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.057 | TFLOPs: 18.64 | 31: iteration 60400/ 173500 | consumed samples: 15462400 | consumed tokens: 31666995200 | elapsed time per iteration (s): 0.90 | learning rate: 1.530E-04 | global batch size: 256 | lm loss: 2.041212E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 283.785 | TFLOPs: 17.17 | 31: iteration 60410/ 173500 | consumed samples: 15464960 | consumed tokens: 31672238080 | elapsed time per iteration (s): 0.80 | learning rate: 1.530E-04 | global batch size: 256 | lm loss: 2.052413E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.877 | TFLOPs: 19.29 | 31: iteration 60420/ 173500 | consumed samples: 15467520 | consumed tokens: 31677480960 | elapsed time per iteration (s): 0.86 | learning rate: 1.529E-04 | global batch size: 256 | lm loss: 2.032609E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.446 | TFLOPs: 17.93 | 31: iteration 60430/ 173500 | consumed samples: 15470080 | consumed tokens: 31682723840 | elapsed time per iteration (s): 0.82 | learning rate: 1.529E-04 | global batch size: 256 | lm loss: 2.020554E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.556 | TFLOPs: 18.97 | 31: iteration 60440/ 173500 | consumed samples: 15472640 | consumed tokens: 31687966720 | elapsed time per iteration (s): 0.81 | learning rate: 1.529E-04 | global batch size: 256 | lm loss: 2.041550E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.953 | TFLOPs: 19.05 | 31: iteration 60450/ 173500 | consumed samples: 15475200 | consumed tokens: 31693209600 | elapsed time per iteration (s): 0.78 | learning rate: 1.529E-04 | global batch size: 256 | lm loss: 2.039058E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.321 | TFLOPs: 19.92 | 31: iteration 60460/ 173500 | consumed samples: 15477760 | consumed tokens: 31698452480 | elapsed time per iteration (s): 0.79 | learning rate: 1.529E-04 | global batch size: 256 | lm loss: 2.061332E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.063 | TFLOPs: 19.67 | 31: iteration 60470/ 173500 | consumed samples: 15480320 | consumed tokens: 31703695360 | elapsed time per iteration (s): 0.84 | learning rate: 1.529E-04 | global batch size: 256 | lm loss: 2.056239E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.613 | TFLOPs: 18.43 | 31: iteration 60480/ 173500 | consumed samples: 15482880 | consumed tokens: 31708938240 | elapsed time per iteration (s): 0.79 | learning rate: 1.529E-04 | global batch size: 256 | lm loss: 2.048403E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.942 | TFLOPs: 19.54 | 31: iteration 60490/ 173500 | consumed samples: 15485440 | consumed tokens: 31714181120 | elapsed time per iteration (s): 0.82 | learning rate: 1.528E-04 | global batch size: 256 | lm loss: 2.039277E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.330 | TFLOPs: 18.83 | 31: iteration 60500/ 173500 | consumed samples: 15488000 | consumed tokens: 31719424000 | elapsed time per iteration (s): 0.82 | learning rate: 1.528E-04 | global batch size: 256 | lm loss: 2.039823E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.896 | TFLOPs: 18.81 | 31: iteration 60510/ 173500 | consumed samples: 15490560 | consumed tokens: 31724666880 | elapsed time per iteration (s): 0.79 | learning rate: 1.528E-04 | global batch size: 256 | lm loss: 2.027113E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.159 | TFLOPs: 19.61 | 31: iteration 60520/ 173500 | consumed samples: 15493120 | consumed tokens: 31729909760 | elapsed time per iteration (s): 0.83 | learning rate: 1.528E-04 | global batch size: 256 | lm loss: 2.001434E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.045 | TFLOPs: 18.64 | 31: iteration 60530/ 173500 | consumed samples: 15495680 | consumed tokens: 31735152640 | elapsed time per iteration (s): 0.82 | learning rate: 1.528E-04 | global batch size: 256 | lm loss: 2.021636E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.330 | TFLOPs: 18.83 | 31: iteration 60540/ 173500 | consumed samples: 15498240 | consumed tokens: 31740395520 | elapsed time per iteration (s): 0.86 | learning rate: 1.528E-04 | global batch size: 256 | lm loss: 2.034312E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.708 | TFLOPs: 18.07 | 31: iteration 60550/ 173500 | consumed samples: 15500800 | consumed tokens: 31745638400 | elapsed time per iteration (s): 0.81 | learning rate: 1.528E-04 | global batch size: 256 | lm loss: 2.077287E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.055 | TFLOPs: 19.06 | 31: iteration 60560/ 173500 | consumed samples: 15503360 | consumed tokens: 31750881280 | elapsed time per iteration (s): 0.83 | learning rate: 1.527E-04 | global batch size: 256 | lm loss: 2.026649E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.881 | TFLOPs: 18.57 | 31: iteration 60570/ 173500 | consumed samples: 15505920 | consumed tokens: 31756124160 | elapsed time per iteration (s): 0.79 | learning rate: 1.527E-04 | global batch size: 256 | lm loss: 2.052995E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.716 | TFLOPs: 19.58 | 31: iteration 60580/ 173500 | consumed samples: 15508480 | consumed tokens: 31761367040 | elapsed time per iteration (s): 0.82 | learning rate: 1.527E-04 | global batch size: 256 | lm loss: 2.030326E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.656 | TFLOPs: 18.85 | 31: iteration 60590/ 173500 | consumed samples: 15511040 | consumed tokens: 31766609920 | elapsed time per iteration (s): 0.78 | learning rate: 1.527E-04 | global batch size: 256 | lm loss: 2.034712E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.174 | TFLOPs: 19.79 | 31: iteration 60600/ 173500 | consumed samples: 15513600 | consumed tokens: 31771852800 | elapsed time per iteration (s): 0.79 | learning rate: 1.527E-04 | global batch size: 256 | lm loss: 2.032268E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.974 | TFLOPs: 19.60 | 31: iteration 60610/ 173500 | consumed samples: 15516160 | consumed tokens: 31777095680 | elapsed time per iteration (s): 0.85 | learning rate: 1.527E-04 | global batch size: 256 | lm loss: 2.070313E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.082 | TFLOPs: 18.28 | 31: iteration 60620/ 173500 | consumed samples: 15518720 | consumed tokens: 31782338560 | elapsed time per iteration (s): 0.81 | learning rate: 1.527E-04 | global batch size: 256 | lm loss: 2.032976E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.543 | TFLOPs: 19.21 | 31: iteration 60630/ 173500 | consumed samples: 15521280 | consumed tokens: 31787581440 | elapsed time per iteration (s): 0.81 | learning rate: 1.526E-04 | global batch size: 256 | lm loss: 2.051747E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.720 | TFLOPs: 19.10 | 31: iteration 60640/ 173500 | consumed samples: 15523840 | consumed tokens: 31792824320 | elapsed time per iteration (s): 0.79 | learning rate: 1.526E-04 | global batch size: 256 | lm loss: 2.034190E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.313 | TFLOPs: 19.56 | 31: iteration 60650/ 173500 | consumed samples: 15526400 | consumed tokens: 31798067200 | elapsed time per iteration (s): 0.81 | learning rate: 1.526E-04 | global batch size: 256 | lm loss: 2.039239E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.885 | TFLOPs: 19.23 | 31: iteration 60660/ 173500 | consumed samples: 15528960 | consumed tokens: 31803310080 | elapsed time per iteration (s): 0.79 | learning rate: 1.526E-04 | global batch size: 256 | lm loss: 2.029515E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.232 | TFLOPs: 19.49 | 31: iteration 60670/ 173500 | consumed samples: 15531520 | consumed tokens: 31808552960 | elapsed time per iteration (s): 0.84 | learning rate: 1.526E-04 | global batch size: 256 | lm loss: 2.024479E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.744 | TFLOPs: 18.38 | 31: iteration 60680/ 173500 | consumed samples: 15534080 | consumed tokens: 31813795840 | elapsed time per iteration (s): 0.87 | learning rate: 1.526E-04 | global batch size: 256 | lm loss: 2.044381E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.243 | TFLOPs: 17.74 | 31: iteration 60690/ 173500 | consumed samples: 15536640 | consumed tokens: 31819038720 | elapsed time per iteration (s): 0.82 | learning rate: 1.526E-04 | global batch size: 256 | lm loss: 2.051076E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.078 | TFLOPs: 18.94 | 31: iteration 60700/ 173500 | consumed samples: 15539200 | consumed tokens: 31824281600 | elapsed time per iteration (s): 0.79 | learning rate: 1.525E-04 | global batch size: 256 | lm loss: 2.052517E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.866 | TFLOPs: 19.59 | 31: iteration 60710/ 173500 | consumed samples: 15541760 | consumed tokens: 31829524480 | elapsed time per iteration (s): 0.82 | learning rate: 1.525E-04 | global batch size: 256 | lm loss: 2.042863E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.261 | TFLOPs: 18.89 | 31: iteration 60720/ 173500 | consumed samples: 15544320 | consumed tokens: 31834767360 | elapsed time per iteration (s): 0.82 | learning rate: 1.525E-04 | global batch size: 256 | lm loss: 2.058145E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.708 | TFLOPs: 18.86 | 31: iteration 60730/ 173500 | consumed samples: 15546880 | consumed tokens: 31840010240 | elapsed time per iteration (s): 0.79 | learning rate: 1.525E-04 | global batch size: 256 | lm loss: 2.045752E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.326 | TFLOPs: 19.62 | 31: iteration 60740/ 173500 | consumed samples: 15549440 | consumed tokens: 31845253120 | elapsed time per iteration (s): 0.80 | learning rate: 1.525E-04 | global batch size: 256 | lm loss: 2.043298E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.774 | TFLOPs: 19.29 | 31: iteration 60750/ 173500 | consumed samples: 15552000 | consumed tokens: 31850496000 | elapsed time per iteration (s): 0.84 | learning rate: 1.525E-04 | global batch size: 256 | lm loss: 2.029957E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.484 | TFLOPs: 18.48 | 31: iteration 60760/ 173500 | consumed samples: 15554560 | consumed tokens: 31855738880 | elapsed time per iteration (s): 0.77 | learning rate: 1.525E-04 | global batch size: 256 | lm loss: 2.067514E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.995 | TFLOPs: 20.02 | 31: iteration 60770/ 173500 | consumed samples: 15557120 | consumed tokens: 31860981760 | elapsed time per iteration (s): 0.80 | learning rate: 1.524E-04 | global batch size: 256 | lm loss: 2.051228E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.987 | TFLOPs: 19.48 | 31: iteration 60780/ 173500 | consumed samples: 15559680 | consumed tokens: 31866224640 | elapsed time per iteration (s): 0.79 | learning rate: 1.524E-04 | global batch size: 256 | lm loss: 2.028669E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.402 | TFLOPs: 19.69 | 31: iteration 60790/ 173500 | consumed samples: 15562240 | consumed tokens: 31871467520 | elapsed time per iteration (s): 0.81 | learning rate: 1.524E-04 | global batch size: 256 | lm loss: 2.029831E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.003 | TFLOPs: 19.06 | 31: iteration 60800/ 173500 | consumed samples: 15564800 | consumed tokens: 31876710400 | elapsed time per iteration (s): 0.78 | learning rate: 1.524E-04 | global batch size: 256 | lm loss: 2.024118E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.317 | TFLOPs: 19.80 | 31: iteration 60810/ 173500 | consumed samples: 15567360 | consumed tokens: 31881953280 | elapsed time per iteration (s): 0.81 | learning rate: 1.524E-04 | global batch size: 256 | lm loss: 2.040011E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.974 | TFLOPs: 19.06 | 31: iteration 60820/ 173500 | consumed samples: 15569920 | consumed tokens: 31887196160 | elapsed time per iteration (s): 0.80 | learning rate: 1.524E-04 | global batch size: 256 | lm loss: 2.042765E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.460 | TFLOPs: 19.39 | 31: iteration 60830/ 173500 | consumed samples: 15572480 | consumed tokens: 31892439040 | elapsed time per iteration (s): 0.83 | learning rate: 1.524E-04 | global batch size: 256 | lm loss: 2.033146E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.472 | TFLOPs: 18.72 | 31: iteration 60840/ 173500 | consumed samples: 15575040 | consumed tokens: 31897681920 | elapsed time per iteration (s): 0.84 | learning rate: 1.523E-04 | global batch size: 256 | lm loss: 2.027294E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.684 | TFLOPs: 18.43 | 31: iteration 60850/ 173500 | consumed samples: 15577600 | consumed tokens: 31902924800 | elapsed time per iteration (s): 0.83 | learning rate: 1.523E-04 | global batch size: 256 | lm loss: 2.044277E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.754 | TFLOPs: 18.56 | 31: iteration 60860/ 173500 | consumed samples: 15580160 | consumed tokens: 31908167680 | elapsed time per iteration (s): 0.84 | learning rate: 1.523E-04 | global batch size: 256 | lm loss: 2.010797E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.338 | TFLOPs: 18.47 | 31: iteration 60870/ 173500 | consumed samples: 15582720 | consumed tokens: 31913410560 | elapsed time per iteration (s): 0.85 | learning rate: 1.523E-04 | global batch size: 256 | lm loss: 2.063317E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.623 | TFLOPs: 18.25 | 31: iteration 60880/ 173500 | consumed samples: 15585280 | consumed tokens: 31918653440 | elapsed time per iteration (s): 0.81 | learning rate: 1.523E-04 | global batch size: 256 | lm loss: 2.026419E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.503 | TFLOPs: 19.21 | 31: iteration 60890/ 173500 | consumed samples: 15587840 | consumed tokens: 31923896320 | elapsed time per iteration (s): 0.80 | learning rate: 1.523E-04 | global batch size: 256 | lm loss: 2.062549E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.221 | TFLOPs: 19.37 | 31: iteration 60900/ 173500 | consumed samples: 15590400 | consumed tokens: 31929139200 | elapsed time per iteration (s): 0.83 | learning rate: 1.523E-04 | global batch size: 256 | lm loss: 2.057108E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.435 | TFLOPs: 18.60 | 31: iteration 60910/ 173500 | consumed samples: 15592960 | consumed tokens: 31934382080 | elapsed time per iteration (s): 0.84 | learning rate: 1.522E-04 | global batch size: 256 | lm loss: 2.053344E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.418 | TFLOPs: 18.36 | 31: iteration 60920/ 173500 | consumed samples: 15595520 | consumed tokens: 31939624960 | elapsed time per iteration (s): 0.80 | learning rate: 1.522E-04 | global batch size: 256 | lm loss: 2.048111E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.411 | TFLOPs: 19.26 | 31: iteration 60930/ 173500 | consumed samples: 15598080 | consumed tokens: 31944867840 | elapsed time per iteration (s): 0.82 | learning rate: 1.522E-04 | global batch size: 256 | lm loss: 2.029161E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.019 | TFLOPs: 18.88 | 31: iteration 60940/ 173500 | consumed samples: 15600640 | consumed tokens: 31950110720 | elapsed time per iteration (s): 0.83 | learning rate: 1.522E-04 | global batch size: 256 | lm loss: 2.041960E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.920 | TFLOPs: 18.57 | 31: iteration 60950/ 173500 | consumed samples: 15603200 | consumed tokens: 31955353600 | elapsed time per iteration (s): 0.80 | learning rate: 1.522E-04 | global batch size: 256 | lm loss: 2.026212E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.062 | TFLOPs: 19.42 | 31: iteration 60960/ 173500 | consumed samples: 15605760 | consumed tokens: 31960596480 | elapsed time per iteration (s): 0.82 | learning rate: 1.522E-04 | global batch size: 256 | lm loss: 2.017417E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.926 | TFLOPs: 18.93 | 31: iteration 60970/ 173500 | consumed samples: 15608320 | consumed tokens: 31965839360 | elapsed time per iteration (s): 0.80 | learning rate: 1.521E-04 | global batch size: 256 | lm loss: 2.042590E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.902 | TFLOPs: 19.47 | 31: iteration 60980/ 173500 | consumed samples: 15610880 | consumed tokens: 31971082240 | elapsed time per iteration (s): 0.82 | learning rate: 1.521E-04 | global batch size: 256 | lm loss: 2.056065E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.965 | TFLOPs: 18.81 | 31: iteration 60990/ 173500 | consumed samples: 15613440 | consumed tokens: 31976325120 | elapsed time per iteration (s): 0.79 | learning rate: 1.521E-04 | global batch size: 256 | lm loss: 2.050613E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.864 | TFLOPs: 19.59 | 31: iteration 61000/ 173500 | consumed samples: 15616000 | consumed tokens: 31981568000 | elapsed time per iteration (s): 0.81 | learning rate: 1.521E-04 | global batch size: 256 | lm loss: 2.017365E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.473 | TFLOPs: 19.21 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 61000 | lm loss value: 2.012673E+00 | lm loss PPL: 7.483291E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 61000 to checkpoints_1b1long 0: [2022-11-26 07:47:04,883] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step61000 is begin to save! 0: [2022-11-26 07:47:04,896] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_01-model_00-model_states.pt... 0: [2022-11-26 07:47:05,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_01-model_00-model_states.pt. 0: [2022-11-26 07:47:05,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_03-model_00-model_states.pt... 0: [2022-11-26 07:47:05,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_03-model_00-model_states.pt. 0: [2022-11-26 07:47:05,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_04-model_00-model_states.pt... 0: [2022-11-26 07:47:05,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_04-model_00-model_states.pt. 0: [2022-11-26 07:47:05,287] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_05-model_00-model_states.pt... 0: [2022-11-26 07:47:05,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_05-model_00-model_states.pt. 0: [2022-11-26 07:47:05,362] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_06-model_00-model_states.pt... 0: [2022-11-26 07:47:05,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_06-model_00-model_states.pt. 0: [2022-11-26 07:47:05,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_07-model_00-model_states.pt... 0: [2022-11-26 07:47:05,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_07-model_00-model_states.pt. 0: [2022-11-26 07:47:05,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_08-model_00-model_states.pt... 0: [2022-11-26 07:47:05,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_08-model_00-model_states.pt. 0: [2022-11-26 07:47:05,588] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_09-model_00-model_states.pt... 0: [2022-11-26 07:47:05,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_09-model_00-model_states.pt. 0: [2022-11-26 07:47:05,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_10-model_00-model_states.pt... 0: [2022-11-26 07:47:05,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_10-model_00-model_states.pt. 0: [2022-11-26 07:47:05,748] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_11-model_00-model_states.pt... 0: [2022-11-26 07:47:05,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_11-model_00-model_states.pt. 0: [2022-11-26 07:47:05,822] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_12-model_00-model_states.pt... 0: [2022-11-26 07:47:05,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_12-model_00-model_states.pt. 0: [2022-11-26 07:47:05,898] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_13-model_00-model_states.pt... 0: [2022-11-26 07:47:05,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_13-model_00-model_states.pt. 0: [2022-11-26 07:47:05,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_14-model_00-model_states.pt... 0: [2022-11-26 07:47:06,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_14-model_00-model_states.pt. 0: [2022-11-26 07:47:06,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_15-model_00-model_states.pt... 0: [2022-11-26 07:47:06,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_15-model_00-model_states.pt. 0: [2022-11-26 07:47:06,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_16-model_00-model_states.pt... 0: [2022-11-26 07:47:06,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_16-model_00-model_states.pt. 0: [2022-11-26 07:47:06,214] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_17-model_00-model_states.pt... 0: [2022-11-26 07:47:06,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_17-model_00-model_states.pt. 0: [2022-11-26 07:47:06,289] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_18-model_00-model_states.pt... 0: [2022-11-26 07:47:06,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_18-model_00-model_states.pt. 0: [2022-11-26 07:47:06,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_19-model_00-model_states.pt... 0: [2022-11-26 07:47:06,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_19-model_00-model_states.pt. 0: [2022-11-26 07:47:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_20-model_00-model_states.pt... 0: [2022-11-26 07:47:06,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_20-model_00-model_states.pt. 0: [2022-11-26 07:47:06,519] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_21-model_00-model_states.pt... 0: [2022-11-26 07:47:06,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_21-model_00-model_states.pt. 0: [2022-11-26 07:47:06,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_22-model_00-model_states.pt... 0: [2022-11-26 07:47:06,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_22-model_00-model_states.pt. 0: [2022-11-26 07:47:06,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_23-model_00-model_states.pt... 0: [2022-11-26 07:47:06,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_23-model_00-model_states.pt. 0: [2022-11-26 07:47:06,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_24-model_00-model_states.pt... 0: [2022-11-26 07:47:06,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_24-model_00-model_states.pt. 0: [2022-11-26 07:47:06,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_25-model_00-model_states.pt... 0: [2022-11-26 07:47:07,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_25-model_00-model_states.pt. 0: [2022-11-26 07:47:07,057] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_26-model_00-model_states.pt... 0: [2022-11-26 07:47:07,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_26-model_00-model_states.pt. 0: [2022-11-26 07:47:07,199] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_27-model_00-model_states.pt... 0: [2022-11-26 07:47:07,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_27-model_00-model_states.pt. 0: [2022-11-26 07:47:07,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_28-model_00-model_states.pt... 0: [2022-11-26 07:47:07,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_28-model_00-model_states.pt. 0: [2022-11-26 07:47:07,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/layer_30-model_00-model_states.pt... 0: [2022-11-26 07:47:07,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/layer_30-model_00-model_states.pt. 0: [2022-11-26 07:47:07,503] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step61000/mp_rank_00_model_states.pt 0: [2022-11-26 07:47:07,503] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/mp_rank_00_model_states.pt... 0: [2022-11-26 07:47:07,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/mp_rank_00_model_states.pt. 0: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 16: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 28: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 30: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 19: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 25: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 29: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 17: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 18: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:47:07,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:47:07,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:47:07,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:47:07,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 07:47:07,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 23: [2022-11-26 07:47:07,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:47:07,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 07:47:07,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 6: [2022-11-26 07:47:07,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:47:07,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 07:47:07,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 28: [2022-11-26 07:47:07,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 07:47:07,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 07:47:07,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 8: [2022-11-26 07:47:07,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:47:07,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 07:47:07,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 3: [2022-11-26 07:47:07,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:47:07,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 4: [2022-11-26 07:47:07,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:47:07,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 4: [2022-11-26 07:47:07,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 07:47:07,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 16: [2022-11-26 07:47:07,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:47:07,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 07:47:07,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 19: [2022-11-26 07:47:07,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:47:07,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 07:47:07,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 25: [2022-11-26 07:47:07,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:47:07,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 07:47:07,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 20: [2022-11-26 07:47:07,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:47:07,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 07:47:07,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 6: [2022-11-26 07:47:07,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:47:07,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:47:07,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 07:47:07,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 24: [2022-11-26 07:47:07,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 07:47:07,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 9: [2022-11-26 07:47:07,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:47:07,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 24: [2022-11-26 07:47:07,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:47:07,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 24: [2022-11-26 07:47:07,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 1: [2022-11-26 07:47:07,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:47:07,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 1: [2022-11-26 07:47:07,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 07:47:07,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 15: [2022-11-26 07:47:07,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:47:07,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 07:47:07,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 27: [2022-11-26 07:47:07,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:47:07,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 07:47:07,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 2: [2022-11-26 07:47:07,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:47:07,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 07:47:07,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:47:07,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 2: [2022-11-26 07:47:07,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 07:47:07,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 17: [2022-11-26 07:47:07,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:47:07,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 07:47:07,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 5: [2022-11-26 07:47:07,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:47:07,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:47:07,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:47:07,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:47:07,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 07:47:07,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 3: [2022-11-26 07:47:07,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 11: [2022-11-26 07:47:07,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 5: [2022-11-26 07:47:07,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 3: [2022-11-26 07:47:07,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 11: [2022-11-26 07:47:07,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 5: [2022-11-26 07:47:07,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 10: [2022-11-26 07:47:07,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:47:07,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 07:47:07,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 4: [2022-11-26 07:47:07,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:47:07,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 07:47:07,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 20: [2022-11-26 07:47:07,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:47:07,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 07:47:07,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 16: [2022-11-26 07:47:07,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:47:07,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 10: [2022-11-26 07:47:07,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:47:07,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 10: [2022-11-26 07:47:07,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 07:47:07,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 25: [2022-11-26 07:47:07,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:47:07,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 07:47:07,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 1: [2022-11-26 07:47:07,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:47:07,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 07:47:07,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 23: [2022-11-26 07:47:07,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:47:07,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 07:47:07,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 5: [2022-11-26 07:47:07,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:47:07,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:47:07,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:47:07,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 17: [2022-11-26 07:47:07,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 19: [2022-11-26 07:47:07,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 5: [2022-11-26 07:47:07,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 17: [2022-11-26 07:47:07,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 19: [2022-11-26 07:47:07,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 17: [2022-11-26 07:47:07,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:47:07,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 07:47:07,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 9: [2022-11-26 07:47:07,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:47:07,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 07:47:07,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 9: [2022-11-26 07:47:07,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:47:07,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 07:47:07,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 19: [2022-11-26 07:47:07,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:47:07,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 07:47:07,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 11: [2022-11-26 07:47:07,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:47:07,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 07:47:07,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 23: [2022-11-26 07:47:07,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:47:07,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 07:47:07,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 1: [2022-11-26 07:47:07,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:47:07,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 07:47:07,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 2: [2022-11-26 07:47:07,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 28: [2022-11-26 07:47:07,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:47:07,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:47:07,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 28: [2022-11-26 07:47:07,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 27: [2022-11-26 07:47:07,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 2: [2022-11-26 07:47:07,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 28: [2022-11-26 07:47:07,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 27: [2022-11-26 07:47:07,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 20: [2022-11-26 07:47:07,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:47:07,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 07:47:07,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 8: [2022-11-26 07:47:07,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:47:07,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 07:47:07,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:47:07,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 8: [2022-11-26 07:47:07,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 07:47:07,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 5: [2022-11-26 07:47:07,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:47:07,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 07:47:07,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 15: [2022-11-26 07:47:07,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:47:07,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 07:47:07,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 14: [2022-11-26 07:47:07,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:47:07,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 0: [2022-11-26 07:47:07,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:47:07,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:47:07,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 0: [2022-11-26 07:47:07,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 07:47:07,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 07:47:07,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 0: [2022-11-26 07:47:07,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 21: [2022-11-26 07:47:07,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:47:07,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:47:07,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 07:47:07,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 07:47:07,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 3: [2022-11-26 07:47:07,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:47:07,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 3: [2022-11-26 07:47:07,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 07:47:07,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 2: [2022-11-26 07:47:07,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:47:07,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 07:47:07,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 25: [2022-11-26 07:47:07,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:47:07,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:47:07,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 07:47:07,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 3: [2022-11-26 07:47:07,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 07:47:07,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 6: [2022-11-26 07:47:07,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:47:07,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 07:47:07,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 4: [2022-11-26 07:47:07,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:47:07,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 13: [2022-11-26 07:47:07,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:47:07,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 13: [2022-11-26 07:47:07,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 07:47:07,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 16: [2022-11-26 07:47:07,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:47:07,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 07:47:07,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 28: [2022-11-26 07:47:07,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:47:07,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:47:07,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 28: [2022-11-26 07:47:07,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 29: [2022-11-26 07:47:07,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:47:07,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:47:07,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:47:07,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:47:07,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 11: [2022-11-26 07:47:07,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 28: [2022-11-26 07:47:07,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 29: [2022-11-26 07:47:07,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 07:47:07,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 15: [2022-11-26 07:47:07,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:47:07,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 29: [2022-11-26 07:47:07,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 07:47:07,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 15: [2022-11-26 07:47:07,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 11: [2022-11-26 07:47:07,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 29: [2022-11-26 07:47:07,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 29: [2022-11-26 07:47:07,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 29: [2022-11-26 07:47:07,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 29: [2022-11-26 07:47:07,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 15: [2022-11-26 07:47:07,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 24: [2022-11-26 07:47:07,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:47:07,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 07:47:07,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 1: [2022-11-26 07:47:07,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:47:07,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:47:07,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 07:47:07,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 7: [2022-11-26 07:47:07,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 07:47:07,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 0: [2022-11-26 07:47:07,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:47:07,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 07:47:07,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 27: [2022-11-26 07:47:07,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:47:07,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 07:47:07,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 22: [2022-11-26 07:47:07,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:47:07,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 07:47:07,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 22: [2022-11-26 07:47:07,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:47:07,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 07:47:07,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 6: [2022-11-26 07:47:07,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:47:07,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 07:47:07,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 14: [2022-11-26 07:47:07,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:47:07,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 07:47:07,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 27: [2022-11-26 07:47:07,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:47:07,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 07:47:07,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 10: [2022-11-26 07:47:07,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:47:07,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 07:47:07,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 26: [2022-11-26 07:47:07,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:47:07,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:47:07,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 07:47:07,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 07:47:07,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:47:07,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:47:07,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 26: [2022-11-26 07:47:07,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 7: [2022-11-26 07:47:07,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:47:07,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 26: [2022-11-26 07:47:07,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 07:47:07,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 7: [2022-11-26 07:47:07,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 20: [2022-11-26 07:47:07,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 7: [2022-11-26 07:47:07,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 24: [2022-11-26 07:47:07,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:47:07,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 07:47:07,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 9: [2022-11-26 07:47:07,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:47:07,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 07:47:07,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 23: [2022-11-26 07:47:07,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:47:07,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:47:07,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 14: [2022-11-26 07:47:07,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 23: [2022-11-26 07:47:07,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 14: [2022-11-26 07:47:07,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 17: [2022-11-26 07:47:07,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:47:07,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 07:47:07,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 4: [2022-11-26 07:47:07,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:47:07,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 07:47:07,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 10: [2022-11-26 07:47:07,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:47:07,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 07:47:07,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 7: [2022-11-26 07:47:07,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:47:07,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 07:47:07,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 30: [2022-11-26 07:47:07,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:47:07,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:47:07,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:47:07,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:47:07,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 07:47:07,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 07:47:07,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 07:47:07,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 07:47:07,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 30: [2022-11-26 07:47:07,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 30: [2022-11-26 07:47:07,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 30: [2022-11-26 07:47:07,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 8: [2022-11-26 07:47:07,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:47:07,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 07:47:07,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 11: [2022-11-26 07:47:07,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:47:07,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 07:47:07,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 13: [2022-11-26 07:47:07,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:47:07,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 07:47:07,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 28: [2022-11-26 07:47:07,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 07:47:07,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 07:47:07,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 16: [2022-11-26 07:47:07,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:47:07,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 07:47:07,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 15: [2022-11-26 07:47:07,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:47:07,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 25: [2022-11-26 07:47:07,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:47:07,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 25: [2022-11-26 07:47:07,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 07:47:07,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 19: [2022-11-26 07:47:07,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:47:07,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 07:47:07,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 0: [2022-11-26 07:47:07,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 07:47:07,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 18: [2022-11-26 07:47:07,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:47:07,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:47:07,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 12: [2022-11-26 07:47:07,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:47:07,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:47:07,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:47:07,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 07:47:07,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 12: [2022-11-26 07:47:07,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:47:07,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 12: [2022-11-26 07:47:07,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 07:47:07,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 07:47:07,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 07:47:07,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 12: [2022-11-26 07:47:07,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 12: [2022-11-26 07:47:07,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 12: [2022-11-26 07:47:07,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 07:47:07,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 18: [2022-11-26 07:47:07,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:47:07,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 22: [2022-11-26 07:47:07,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:47:07,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 22: [2022-11-26 07:47:07,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 07:47:07,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 18: [2022-11-26 07:47:07,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:47:07,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 07:47:07,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 5: [2022-11-26 07:47:07,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:47:07,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 07:47:07,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 31: [2022-11-26 07:47:07,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:47:07,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:47:07,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:47:07,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:47:07,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:47:07,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 21: [2022-11-26 07:47:07,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 31: [2022-11-26 07:47:07,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 07:47:07,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 21: [2022-11-26 07:47:07,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 31: [2022-11-26 07:47:07,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 07:47:07,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 07:47:07,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 31: [2022-11-26 07:47:07,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 31: [2022-11-26 07:47:07,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 6: [2022-11-26 07:47:07,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:47:07,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 07:47:07,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 20: [2022-11-26 07:47:07,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:47:07,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 07:47:07,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 1: [2022-11-26 07:47:07,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:47:07,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 07:47:07,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 26: [2022-11-26 07:47:07,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:47:07,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 07:47:07,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 4: [2022-11-26 07:47:07,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:47:07,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 07:47:07,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 12: [2022-11-26 07:47:07,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:47:07,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 07:47:07,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 29: [2022-11-26 07:47:07,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:47:07,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:47:07,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 07:47:07,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 17: [2022-11-26 07:47:07,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 07:47:07,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 23: [2022-11-26 07:47:07,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:47:07,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 07:47:07,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 24: [2022-11-26 07:47:07,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:47:07,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 07:47:07,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 27: [2022-11-26 07:47:07,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:47:07,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 0: [2022-11-26 07:47:07,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:47:07,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 0: [2022-11-26 07:47:07,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 07:47:07,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 10: [2022-11-26 07:47:07,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:47:07,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 07:47:07,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 30: [2022-11-26 07:47:07,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:47:07,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 07:47:07,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 9: [2022-11-26 07:47:07,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:47:07,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:47:07,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 2: [2022-11-26 07:47:07,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 07:47:07,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 9: [2022-11-26 07:47:07,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 31: [2022-11-26 07:47:07,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:47:07,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 07:47:07,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 3: [2022-11-26 07:47:07,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:47:07,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 07:47:07,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 22: [2022-11-26 07:47:07,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:47:07,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 07:47:07,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 16: [2022-11-26 07:47:07,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:47:07,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 07:47:07,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 14: [2022-11-26 07:47:07,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:47:07,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 07:47:07,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 5: [2022-11-26 07:47:07,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:47:07,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 07:47:07,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 8: [2022-11-26 07:47:07,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:47:07,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 07:47:07,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 7: [2022-11-26 07:47:07,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:47:07,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 07:47:07,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 15: [2022-11-26 07:47:07,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:47:07,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 07:47:07,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 21: [2022-11-26 07:47:07,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:47:07,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 07:47:07,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 25: [2022-11-26 07:47:07,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:47:07,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:47:07,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 13: [2022-11-26 07:47:07,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 07:47:07,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 25: [2022-11-26 07:47:07,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 26: [2022-11-26 07:47:07,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:47:07,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 07:47:07,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 18: [2022-11-26 07:47:07,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:47:07,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 07:47:07,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 11: [2022-11-26 07:47:07,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:47:07,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 07:47:07,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 6: [2022-11-26 07:47:07,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:47:07,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 07:47:07,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 19: [2022-11-26 07:47:07,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:47:07,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 07:47:07,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 29: [2022-11-26 07:47:07,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:47:07,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 4: [2022-11-26 07:47:07,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:47:07,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 4: [2022-11-26 07:47:07,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 07:47:07,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 12: [2022-11-26 07:47:07,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:47:07,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 07:47:07,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 23: [2022-11-26 07:47:07,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:47:07,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 07:47:07,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 28: [2022-11-26 07:47:07,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 07:47:07,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 1: [2022-11-26 07:47:07,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:47:07,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 07:47:07,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 17: [2022-11-26 07:47:07,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:47:07,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 07:47:07,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 24: [2022-11-26 07:47:07,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:47:07,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 07:47:07,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 28: [2022-11-26 07:47:07,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 9: [2022-11-26 07:47:07,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:47:07,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 07:47:07,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 27: [2022-11-26 07:47:07,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:47:07,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 07:47:07,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 14: [2022-11-26 07:47:07,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:47:07,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 07:47:07,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 20: [2022-11-26 07:47:07,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:47:07,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 07:47:07,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 2: [2022-11-26 07:47:07,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:47:07,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 07:47:07,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 30: [2022-11-26 07:47:07,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:47:07,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 07:47:07,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 0: [2022-11-26 07:47:07,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:47:07,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:47:07,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 07:47:07,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 10: [2022-11-26 07:47:07,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 07:47:07,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 31: [2022-11-26 07:47:07,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:47:07,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 07:47:07,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 3: [2022-11-26 07:47:07,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:47:07,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 07:47:07,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 8: [2022-11-26 07:47:07,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:47:07,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:47:07,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 7: [2022-11-26 07:47:07,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 8: [2022-11-26 07:47:07,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 7: [2022-11-26 07:47:07,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 21: [2022-11-26 07:47:07,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:47:07,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 07:47:07,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 11: [2022-11-26 07:47:07,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:47:07,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 07:47:07,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 15: [2022-11-26 07:47:07,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:47:07,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 07:47:07,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 22: [2022-11-26 07:47:07,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:47:07,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 07:47:07,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 25: [2022-11-26 07:47:07,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:47:07,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 07:47:07,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 16: [2022-11-26 07:47:07,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:47:07,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:47:07,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 16: [2022-11-26 07:47:07,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 07:47:07,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 6: [2022-11-26 07:47:07,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:47:07,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 6: [2022-11-26 07:47:07,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 07:47:07,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 5: [2022-11-26 07:47:07,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:47:07,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 07:47:07,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 13: [2022-11-26 07:47:07,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:47:07,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 07:47:07,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 1: [2022-11-26 07:47:07,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:47:07,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 28: [2022-11-26 07:47:07,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:47:07,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 28: [2022-11-26 07:47:07,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 07:47:07,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 18: [2022-11-26 07:47:07,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:47:07,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 07:47:07,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 26: [2022-11-26 07:47:07,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:47:07,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 07:47:07,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 12: [2022-11-26 07:47:07,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:47:07,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 07:47:07,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 4: [2022-11-26 07:47:07,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:47:07,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 07:47:07,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 23: [2022-11-26 07:47:07,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:47:07,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 07:47:07,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 29: [2022-11-26 07:47:07,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:47:07,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 07:47:07,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 17: [2022-11-26 07:47:07,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:47:07,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:47:07,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 07:47:07,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 20: [2022-11-26 07:47:07,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 07:47:07,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 3: [2022-11-26 07:47:07,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:47:07,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 07:47:07,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 24: [2022-11-26 07:47:07,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 07:47:07,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 07:47:07,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 0: [2022-11-26 07:47:07,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:47:07,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 07:47:07,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 9: [2022-11-26 07:47:07,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:47:07,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 07:47:07,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 14: [2022-11-26 07:47:07,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:47:07,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 07:47:07,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 2: [2022-11-26 07:47:07,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:47:07,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 07:47:07,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 30: [2022-11-26 07:47:07,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:47:07,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 07:47:07,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 31: [2022-11-26 07:47:07,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:47:07,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 07:47:07,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 8: [2022-11-26 07:47:07,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:47:07,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:47:07,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 07:47:07,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 10: [2022-11-26 07:47:07,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 07:47:07,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 5: [2022-11-26 07:47:07,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:47:07,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 07:47:07,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 28: [2022-11-26 07:47:07,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 07:47:07,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 07:47:07,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 25: [2022-11-26 07:47:07,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:47:07,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:47:07,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 07:47:07,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 11: [2022-11-26 07:47:07,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 07:47:07,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 7: [2022-11-26 07:47:07,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:47:07,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 07:47:07,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 15: [2022-11-26 07:47:07,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:47:07,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 07:47:07,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 6: [2022-11-26 07:47:07,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:47:07,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 07:47:07,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 16: [2022-11-26 07:47:07,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:47:07,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 07:47:07,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 16: [2022-11-26 07:47:07,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:47:07,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 16: [2022-11-26 07:47:07,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 07:47:07,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 19: [2022-11-26 07:47:07,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 07:47:07,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 23: [2022-11-26 07:47:07,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 07:47:07,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 07:47:07,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 27: [2022-11-26 07:47:07,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:47:07,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:47:07,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 07:47:07,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 22: [2022-11-26 07:47:07,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 07:47:07,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 20: [2022-11-26 07:47:07,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:47:07,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 20: [2022-11-26 07:47:07,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 18: [2022-11-26 07:47:07,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 20: [2022-11-26 07:47:07,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 18: [2022-11-26 07:47:07,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 14: [2022-11-26 07:47:07,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:47:07,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 07:47:07,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 14: [2022-11-26 07:47:07,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 30: [2022-11-26 07:47:07,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 14: [2022-11-26 07:47:07,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 21: [2022-11-26 07:47:07,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:47:07,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 07:47:07,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 1: [2022-11-26 07:47:07,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:47:07,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 25: [2022-11-26 07:47:07,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:47:07,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 4: [2022-11-26 07:47:07,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 25: [2022-11-26 07:47:07,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 4: [2022-11-26 07:47:07,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 25: [2022-11-26 07:47:07,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 4: [2022-11-26 07:47:07,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 29: [2022-11-26 07:47:07,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 07:47:07,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 07:47:07,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 0: [2022-11-26 07:47:07,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:47:07,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 07:47:07,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 26: [2022-11-26 07:47:07,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:47:07,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 07:47:07,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 17: [2022-11-26 07:47:07,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 07:47:07,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 07:47:07,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 13: [2022-11-26 07:47:07,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:47:07,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 07:47:07,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 3: [2022-11-26 07:47:07,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 28: [2022-11-26 07:47:07,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:47:07,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 07:47:07,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 28: [2022-11-26 07:47:07,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 07:47:07,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 11: [2022-11-26 07:47:07,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:47:07,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:47:07,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 07:47:07,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 12: [2022-11-26 07:47:07,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 07:47:07,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 2: [2022-11-26 07:47:07,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:47:07,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 24: [2022-11-26 07:47:07,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:47:07,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:47:07,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 24: [2022-11-26 07:47:07,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 10: [2022-11-26 07:47:07,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 24: [2022-11-26 07:47:07,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 10: [2022-11-26 07:47:07,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 19: [2022-11-26 07:47:07,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 07:47:07,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 07:47:07,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 18: [2022-11-26 07:47:07,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:47:07,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 18: [2022-11-26 07:47:07,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 07:47:07,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 8: [2022-11-26 07:47:07,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 07:47:07,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 9: [2022-11-26 07:47:07,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:47:07,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 07:47:07,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 27: [2022-11-26 07:47:07,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 07:47:07,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 07:47:07,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 15: [2022-11-26 07:47:07,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:47:07,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 07:47:07,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 26: [2022-11-26 07:47:07,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 07:47:07,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 07:47:07,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 21: [2022-11-26 07:47:07,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:47:07,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 07:47:07,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 31: [2022-11-26 07:47:07,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 07:47:07,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 07:47:07,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 7: [2022-11-26 07:47:07,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:47:07,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 07:47:07,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 21: [2022-11-26 07:47:07,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 07:47:07,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 07:47:07,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 7: [2022-11-26 07:47:07,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:47:07,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 07:47:07,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 22: [2022-11-26 07:47:07,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:47:07,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 07:47:07,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 22: [2022-11-26 07:47:07,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 07:47:07,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 07:47:07,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 13: [2022-11-26 07:47:07,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:47:07,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 07:47:07,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 13: [2022-11-26 07:47:07,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:47:07,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step61000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 07:47:07,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 0: successfully saved checkpoint at iteration 61000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2997.29 31: iteration 61010/ 173500 | consumed samples: 15618560 | consumed tokens: 31986810880 | elapsed time per iteration (s): 1.09 | learning rate: 1.521E-04 | global batch size: 256 | lm loss: 2.017247E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.618 | TFLOPs: 14.19 | 31: iteration 61020/ 173500 | consumed samples: 15621120 | consumed tokens: 31992053760 | elapsed time per iteration (s): 0.80 | learning rate: 1.521E-04 | global batch size: 256 | lm loss: 2.038409E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.736 | TFLOPs: 19.46 | 31: iteration 61030/ 173500 | consumed samples: 15623680 | consumed tokens: 31997296640 | elapsed time per iteration (s): 0.86 | learning rate: 1.521E-04 | global batch size: 256 | lm loss: 2.060498E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.074 | TFLOPs: 18.03 | 31: iteration 61040/ 173500 | consumed samples: 15626240 | consumed tokens: 32002539520 | elapsed time per iteration (s): 0.72 | learning rate: 1.520E-04 | global batch size: 256 | lm loss: 2.029474E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.773 | TFLOPs: 21.64 | 31: iteration 61050/ 173500 | consumed samples: 15628800 | consumed tokens: 32007782400 | elapsed time per iteration (s): 0.82 | learning rate: 1.520E-04 | global batch size: 256 | lm loss: 2.040029E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.153 | TFLOPs: 18.94 | 31: iteration 61060/ 173500 | consumed samples: 15631360 | consumed tokens: 32013025280 | elapsed time per iteration (s): 0.74 | learning rate: 1.520E-04 | global batch size: 256 | lm loss: 2.036761E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.189 | TFLOPs: 21.06 | 31: iteration 61070/ 173500 | consumed samples: 15633920 | consumed tokens: 32018268160 | elapsed time per iteration (s): 0.77 | learning rate: 1.520E-04 | global batch size: 256 | lm loss: 2.024949E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.682 | TFLOPs: 20.19 | 31: iteration 61080/ 173500 | consumed samples: 15636480 | consumed tokens: 32023511040 | elapsed time per iteration (s): 0.76 | learning rate: 1.520E-04 | global batch size: 256 | lm loss: 2.060720E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.215 | TFLOPs: 20.40 | 31: iteration 61090/ 173500 | consumed samples: 15639040 | consumed tokens: 32028753920 | elapsed time per iteration (s): 0.77 | learning rate: 1.520E-04 | global batch size: 256 | lm loss: 2.045576E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.633 | TFLOPs: 20.24 | 31: iteration 61100/ 173500 | consumed samples: 15641600 | consumed tokens: 32033996800 | elapsed time per iteration (s): 0.77 | learning rate: 1.520E-04 | global batch size: 256 | lm loss: 2.059499E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.909 | TFLOPs: 20.08 | 31: iteration 61110/ 173500 | consumed samples: 15644160 | consumed tokens: 32039239680 | elapsed time per iteration (s): 0.78 | learning rate: 1.519E-04 | global batch size: 256 | lm loss: 2.035153E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.776 | TFLOPs: 19.95 | 31: iteration 61120/ 173500 | consumed samples: 15646720 | consumed tokens: 32044482560 | elapsed time per iteration (s): 0.75 | learning rate: 1.519E-04 | global batch size: 256 | lm loss: 2.016329E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.833 | TFLOPs: 20.62 | 31: iteration 61130/ 173500 | consumed samples: 15649280 | consumed tokens: 32049725440 | elapsed time per iteration (s): 0.76 | learning rate: 1.519E-04 | global batch size: 256 | lm loss: 2.025009E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.187 | TFLOPs: 20.34 | 31: iteration 61140/ 173500 | consumed samples: 15651840 | consumed tokens: 32054968320 | elapsed time per iteration (s): 0.76 | learning rate: 1.519E-04 | global batch size: 256 | lm loss: 2.015707E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.936 | TFLOPs: 20.44 | 31: iteration 61150/ 173500 | consumed samples: 15654400 | consumed tokens: 32060211200 | elapsed time per iteration (s): 0.75 | learning rate: 1.519E-04 | global batch size: 256 | lm loss: 2.045631E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.209 | TFLOPs: 20.58 | 31: iteration 61160/ 173500 | consumed samples: 15656960 | consumed tokens: 32065454080 | elapsed time per iteration (s): 0.76 | learning rate: 1.519E-04 | global batch size: 256 | lm loss: 2.034621E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.615 | TFLOPs: 20.49 | 31: iteration 61170/ 173500 | consumed samples: 15659520 | consumed tokens: 32070696960 | elapsed time per iteration (s): 0.75 | learning rate: 1.519E-04 | global batch size: 256 | lm loss: 2.051518E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.605 | TFLOPs: 20.79 | 31: iteration 61180/ 173500 | consumed samples: 15662080 | consumed tokens: 32075939840 | elapsed time per iteration (s): 0.77 | learning rate: 1.518E-04 | global batch size: 256 | lm loss: 2.054292E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.702 | TFLOPs: 20.13 | 31: iteration 61190/ 173500 | consumed samples: 15664640 | consumed tokens: 32081182720 | elapsed time per iteration (s): 0.80 | learning rate: 1.518E-04 | global batch size: 256 | lm loss: 2.048952E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.016 | TFLOPs: 19.24 | 31: iteration 61200/ 173500 | consumed samples: 15667200 | consumed tokens: 32086425600 | elapsed time per iteration (s): 0.77 | learning rate: 1.518E-04 | global batch size: 256 | lm loss: 2.048180E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.480 | TFLOPs: 20.17 | 31: iteration 61210/ 173500 | consumed samples: 15669760 | consumed tokens: 32091668480 | elapsed time per iteration (s): 0.81 | learning rate: 1.518E-04 | global batch size: 256 | lm loss: 2.056238E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.168 | TFLOPs: 19.01 | 31: iteration 61220/ 173500 | consumed samples: 15672320 | consumed tokens: 32096911360 | elapsed time per iteration (s): 0.78 | learning rate: 1.518E-04 | global batch size: 256 | lm loss: 2.045698E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.761 | TFLOPs: 19.77 | 31: iteration 61230/ 173500 | consumed samples: 15674880 | consumed tokens: 32102154240 | elapsed time per iteration (s): 0.84 | learning rate: 1.518E-04 | global batch size: 256 | lm loss: 2.037567E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.611 | TFLOPs: 18.43 | 31: iteration 61240/ 173500 | consumed samples: 15677440 | consumed tokens: 32107397120 | elapsed time per iteration (s): 0.76 | learning rate: 1.518E-04 | global batch size: 256 | lm loss: 2.039232E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.544 | TFLOPs: 20.30 | 31: iteration 61250/ 173500 | consumed samples: 15680000 | consumed tokens: 32112640000 | elapsed time per iteration (s): 0.82 | learning rate: 1.517E-04 | global batch size: 256 | lm loss: 2.035250E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.509 | TFLOPs: 18.78 | 31: iteration 61260/ 173500 | consumed samples: 15682560 | consumed tokens: 32117882880 | elapsed time per iteration (s): 0.83 | learning rate: 1.517E-04 | global batch size: 256 | lm loss: 2.039710E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.829 | TFLOPs: 18.56 | 31: iteration 61270/ 173500 | consumed samples: 15685120 | consumed tokens: 32123125760 | elapsed time per iteration (s): 0.76 | learning rate: 1.517E-04 | global batch size: 256 | lm loss: 2.008092E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.199 | TFLOPs: 20.46 | 31: iteration 61280/ 173500 | consumed samples: 15687680 | consumed tokens: 32128368640 | elapsed time per iteration (s): 0.79 | learning rate: 1.517E-04 | global batch size: 256 | lm loss: 2.045139E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.659 | TFLOPs: 19.64 | 31: iteration 61290/ 173500 | consumed samples: 15690240 | consumed tokens: 32133611520 | elapsed time per iteration (s): 0.86 | learning rate: 1.517E-04 | global batch size: 256 | lm loss: 2.033599E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.830 | TFLOPs: 17.96 | 31: iteration 61300/ 173500 | consumed samples: 15692800 | consumed tokens: 32138854400 | elapsed time per iteration (s): 0.78 | learning rate: 1.517E-04 | global batch size: 256 | lm loss: 2.025835E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.151 | TFLOPs: 19.73 | 31: iteration 61310/ 173500 | consumed samples: 15695360 | consumed tokens: 32144097280 | elapsed time per iteration (s): 0.80 | learning rate: 1.517E-04 | global batch size: 256 | lm loss: 2.044095E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.434 | TFLOPs: 19.32 | 31: iteration 61320/ 173500 | consumed samples: 15697920 | consumed tokens: 32149340160 | elapsed time per iteration (s): 0.77 | learning rate: 1.516E-04 | global batch size: 256 | lm loss: 2.046617E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.228 | TFLOPs: 20.10 | 31: iteration 61330/ 173500 | consumed samples: 15700480 | consumed tokens: 32154583040 | elapsed time per iteration (s): 0.77 | learning rate: 1.516E-04 | global batch size: 256 | lm loss: 2.060884E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.443 | TFLOPs: 20.17 | 31: iteration 61340/ 173500 | consumed samples: 15703040 | consumed tokens: 32159825920 | elapsed time per iteration (s): 0.86 | learning rate: 1.516E-04 | global batch size: 256 | lm loss: 2.022966E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.516 | TFLOPs: 18.06 | 31: iteration 61350/ 173500 | consumed samples: 15705600 | consumed tokens: 32165068800 | elapsed time per iteration (s): 0.82 | learning rate: 1.516E-04 | global batch size: 256 | lm loss: 2.038760E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.296 | TFLOPs: 18.95 | 31: iteration 61360/ 173500 | consumed samples: 15708160 | consumed tokens: 32170311680 | elapsed time per iteration (s): 1.38 | learning rate: 1.516E-04 | global batch size: 256 | lm loss: 2.032911E+00 | grad norm: 0.318 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 185.206 | TFLOPs: 11.20 | 31: iteration 61370/ 173500 | consumed samples: 15710720 | consumed tokens: 32175554560 | elapsed time per iteration (s): 0.75 | learning rate: 1.516E-04 | global batch size: 256 | lm loss: 2.066388E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.560 | TFLOPs: 20.54 | 31: iteration 61380/ 173500 | consumed samples: 15713280 | consumed tokens: 32180797440 | elapsed time per iteration (s): 0.76 | learning rate: 1.516E-04 | global batch size: 256 | lm loss: 2.038749E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.109 | TFLOPs: 20.39 | 31: iteration 61390/ 173500 | consumed samples: 15715840 | consumed tokens: 32186040320 | elapsed time per iteration (s): 0.75 | learning rate: 1.515E-04 | global batch size: 256 | lm loss: 2.065116E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.257 | TFLOPs: 20.52 | 31: iteration 61400/ 173500 | consumed samples: 15718400 | consumed tokens: 32191283200 | elapsed time per iteration (s): 0.80 | learning rate: 1.515E-04 | global batch size: 256 | lm loss: 2.037247E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.107 | TFLOPs: 19.37 | 31: iteration 61410/ 173500 | consumed samples: 15720960 | consumed tokens: 32196526080 | elapsed time per iteration (s): 0.74 | learning rate: 1.515E-04 | global batch size: 256 | lm loss: 2.038834E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.220 | TFLOPs: 21.01 | 31: iteration 61420/ 173500 | consumed samples: 15723520 | consumed tokens: 32201768960 | elapsed time per iteration (s): 0.73 | learning rate: 1.515E-04 | global batch size: 256 | lm loss: 2.048666E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.418 | TFLOPs: 21.26 | 31: iteration 61430/ 173500 | consumed samples: 15726080 | consumed tokens: 32207011840 | elapsed time per iteration (s): 0.77 | learning rate: 1.515E-04 | global batch size: 256 | lm loss: 2.048519E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.354 | TFLOPs: 20.17 | 31: iteration 61440/ 173500 | consumed samples: 15728640 | consumed tokens: 32212254720 | elapsed time per iteration (s): 0.75 | learning rate: 1.515E-04 | global batch size: 256 | lm loss: 2.028052E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.222 | TFLOPs: 20.58 | 31: iteration 61450/ 173500 | consumed samples: 15731200 | consumed tokens: 32217497600 | elapsed time per iteration (s): 0.80 | learning rate: 1.514E-04 | global batch size: 256 | lm loss: 1.999864E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.491 | TFLOPs: 19.45 | 31: iteration 61460/ 173500 | consumed samples: 15733760 | consumed tokens: 32222740480 | elapsed time per iteration (s): 0.80 | learning rate: 1.514E-04 | global batch size: 256 | lm loss: 2.066416E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.891 | TFLOPs: 19.29 | 31: iteration 61470/ 173500 | consumed samples: 15736320 | consumed tokens: 32227983360 | elapsed time per iteration (s): 0.76 | learning rate: 1.514E-04 | global batch size: 256 | lm loss: 2.043705E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.011 | TFLOPs: 20.45 | 31: iteration 61480/ 173500 | consumed samples: 15738880 | consumed tokens: 32233226240 | elapsed time per iteration (s): 0.75 | learning rate: 1.514E-04 | global batch size: 256 | lm loss: 2.029283E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.502 | TFLOPs: 20.78 | 31: iteration 61490/ 173500 | consumed samples: 15741440 | consumed tokens: 32238469120 | elapsed time per iteration (s): 0.81 | learning rate: 1.514E-04 | global batch size: 256 | lm loss: 2.014694E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.704 | TFLOPs: 19.22 | 31: iteration 61500/ 173500 | consumed samples: 15744000 | consumed tokens: 32243712000 | elapsed time per iteration (s): 0.78 | learning rate: 1.514E-04 | global batch size: 256 | lm loss: 2.020025E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.032 | TFLOPs: 19.91 | 31: iteration 61510/ 173500 | consumed samples: 15746560 | consumed tokens: 32248954880 | elapsed time per iteration (s): 0.81 | learning rate: 1.514E-04 | global batch size: 256 | lm loss: 2.037592E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.915 | TFLOPs: 19.17 | 31: iteration 61520/ 173500 | consumed samples: 15749120 | consumed tokens: 32254197760 | elapsed time per iteration (s): 0.83 | learning rate: 1.513E-04 | global batch size: 256 | lm loss: 2.022201E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.456 | TFLOPs: 18.60 | 31: iteration 61530/ 173500 | consumed samples: 15751680 | consumed tokens: 32259440640 | elapsed time per iteration (s): 0.79 | learning rate: 1.513E-04 | global batch size: 256 | lm loss: 2.057975E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.015 | TFLOPs: 19.66 | 31: iteration 61540/ 173500 | consumed samples: 15754240 | consumed tokens: 32264683520 | elapsed time per iteration (s): 0.73 | learning rate: 1.513E-04 | global batch size: 256 | lm loss: 2.050766E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.578 | TFLOPs: 21.15 | 31: iteration 61550/ 173500 | consumed samples: 15756800 | consumed tokens: 32269926400 | elapsed time per iteration (s): 0.81 | learning rate: 1.513E-04 | global batch size: 256 | lm loss: 2.014270E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.640 | TFLOPs: 19.10 | 31: iteration 61560/ 173500 | consumed samples: 15759360 | consumed tokens: 32275169280 | elapsed time per iteration (s): 0.78 | learning rate: 1.513E-04 | global batch size: 256 | lm loss: 2.033157E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.957 | TFLOPs: 19.78 | 31: iteration 61570/ 173500 | consumed samples: 15761920 | consumed tokens: 32280412160 | elapsed time per iteration (s): 0.76 | learning rate: 1.513E-04 | global batch size: 256 | lm loss: 2.049834E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.528 | TFLOPs: 20.36 | 31: iteration 61580/ 173500 | consumed samples: 15764480 | consumed tokens: 32285655040 | elapsed time per iteration (s): 0.78 | learning rate: 1.513E-04 | global batch size: 256 | lm loss: 2.030987E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.729 | TFLOPs: 19.77 | 31: iteration 61590/ 173500 | consumed samples: 15767040 | consumed tokens: 32290897920 | elapsed time per iteration (s): 0.74 | learning rate: 1.512E-04 | global batch size: 256 | lm loss: 2.064099E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.849 | TFLOPs: 21.04 | 31: iteration 61600/ 173500 | consumed samples: 15769600 | consumed tokens: 32296140800 | elapsed time per iteration (s): 0.79 | learning rate: 1.512E-04 | global batch size: 256 | lm loss: 2.055823E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.766 | TFLOPs: 19.53 | 31: iteration 61610/ 173500 | consumed samples: 15772160 | consumed tokens: 32301383680 | elapsed time per iteration (s): 0.77 | learning rate: 1.512E-04 | global batch size: 256 | lm loss: 2.014245E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.581 | TFLOPs: 20.12 | 31: iteration 61620/ 173500 | consumed samples: 15774720 | consumed tokens: 32306626560 | elapsed time per iteration (s): 0.86 | learning rate: 1.512E-04 | global batch size: 256 | lm loss: 2.023172E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.090 | TFLOPs: 18.09 | 31: iteration 61630/ 173500 | consumed samples: 15777280 | consumed tokens: 32311869440 | elapsed time per iteration (s): 0.79 | learning rate: 1.512E-04 | global batch size: 256 | lm loss: 2.058872E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.112 | TFLOPs: 19.55 | 31: iteration 61640/ 173500 | consumed samples: 15779840 | consumed tokens: 32317112320 | elapsed time per iteration (s): 0.80 | learning rate: 1.512E-04 | global batch size: 256 | lm loss: 2.033064E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.303 | TFLOPs: 19.44 | 31: iteration 61650/ 173500 | consumed samples: 15782400 | consumed tokens: 32322355200 | elapsed time per iteration (s): 0.77 | learning rate: 1.512E-04 | global batch size: 256 | lm loss: 2.075762E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.294 | TFLOPs: 20.22 | 31: iteration 61660/ 173500 | consumed samples: 15784960 | consumed tokens: 32327598080 | elapsed time per iteration (s): 0.78 | learning rate: 1.511E-04 | global batch size: 256 | lm loss: 2.010049E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.337 | TFLOPs: 19.80 | 31: iteration 61670/ 173500 | consumed samples: 15787520 | consumed tokens: 32332840960 | elapsed time per iteration (s): 0.77 | learning rate: 1.511E-04 | global batch size: 256 | lm loss: 2.019372E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.941 | TFLOPs: 20.08 | 31: iteration 61680/ 173500 | consumed samples: 15790080 | consumed tokens: 32338083840 | elapsed time per iteration (s): 0.79 | learning rate: 1.511E-04 | global batch size: 256 | lm loss: 2.044293E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.668 | TFLOPs: 19.58 | 31: iteration 61690/ 173500 | consumed samples: 15792640 | consumed tokens: 32343326720 | elapsed time per iteration (s): 0.84 | learning rate: 1.511E-04 | global batch size: 256 | lm loss: 2.019664E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.481 | TFLOPs: 18.42 | 31: iteration 61700/ 173500 | consumed samples: 15795200 | consumed tokens: 32348569600 | elapsed time per iteration (s): 0.82 | learning rate: 1.511E-04 | global batch size: 256 | lm loss: 2.057565E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.484 | TFLOPs: 18.78 | 31: iteration 61710/ 173500 | consumed samples: 15797760 | consumed tokens: 32353812480 | elapsed time per iteration (s): 0.81 | learning rate: 1.511E-04 | global batch size: 256 | lm loss: 2.042377E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.179 | TFLOPs: 19.19 | 31: iteration 61720/ 173500 | consumed samples: 15800320 | consumed tokens: 32359055360 | elapsed time per iteration (s): 0.83 | learning rate: 1.511E-04 | global batch size: 256 | lm loss: 2.021873E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.925 | TFLOPs: 18.57 | 31: iteration 61730/ 173500 | consumed samples: 15802880 | consumed tokens: 32364298240 | elapsed time per iteration (s): 0.78 | learning rate: 1.510E-04 | global batch size: 256 | lm loss: 2.042778E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.319 | TFLOPs: 19.98 | 31: iteration 61740/ 173500 | consumed samples: 15805440 | consumed tokens: 32369541120 | elapsed time per iteration (s): 0.74 | learning rate: 1.510E-04 | global batch size: 256 | lm loss: 2.038752E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.548 | TFLOPs: 21.03 | 31: iteration 61750/ 173500 | consumed samples: 15808000 | consumed tokens: 32374784000 | elapsed time per iteration (s): 0.74 | learning rate: 1.510E-04 | global batch size: 256 | lm loss: 2.046788E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.633 | TFLOPs: 20.97 | 31: iteration 61760/ 173500 | consumed samples: 15810560 | consumed tokens: 32380026880 | elapsed time per iteration (s): 0.76 | learning rate: 1.510E-04 | global batch size: 256 | lm loss: 2.040714E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.584 | TFLOPs: 20.36 | 31: iteration 61770/ 173500 | consumed samples: 15813120 | consumed tokens: 32385269760 | elapsed time per iteration (s): 0.71 | learning rate: 1.510E-04 | global batch size: 256 | lm loss: 2.042252E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.903 | TFLOPs: 21.71 | 31: iteration 61780/ 173500 | consumed samples: 15815680 | consumed tokens: 32390512640 | elapsed time per iteration (s): 0.75 | learning rate: 1.510E-04 | global batch size: 256 | lm loss: 2.042115E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.491 | TFLOPs: 20.54 | 31: iteration 61790/ 173500 | consumed samples: 15818240 | consumed tokens: 32395755520 | elapsed time per iteration (s): 0.76 | learning rate: 1.510E-04 | global batch size: 256 | lm loss: 2.036127E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.806 | TFLOPs: 20.25 | 31: iteration 61800/ 173500 | consumed samples: 15820800 | consumed tokens: 32400998400 | elapsed time per iteration (s): 2.38 | learning rate: 1.509E-04 | global batch size: 256 | lm loss: 2.038898E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 107.604 | TFLOPs: 6.51 | 31: iteration 61810/ 173500 | consumed samples: 15823360 | consumed tokens: 32406241280 | elapsed time per iteration (s): 0.75 | learning rate: 1.509E-04 | global batch size: 256 | lm loss: 2.041708E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.530 | TFLOPs: 20.66 | 31: iteration 61820/ 173500 | consumed samples: 15825920 | consumed tokens: 32411484160 | elapsed time per iteration (s): 0.80 | learning rate: 1.509E-04 | global batch size: 256 | lm loss: 2.034381E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.491 | TFLOPs: 19.39 | 31: iteration 61830/ 173500 | consumed samples: 15828480 | consumed tokens: 32416727040 | elapsed time per iteration (s): 0.77 | learning rate: 1.509E-04 | global batch size: 256 | lm loss: 2.049714E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.375 | TFLOPs: 20.11 | 31: iteration 61840/ 173500 | consumed samples: 15831040 | consumed tokens: 32421969920 | elapsed time per iteration (s): 0.74 | learning rate: 1.509E-04 | global batch size: 256 | lm loss: 2.035708E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.072 | TFLOPs: 20.82 | 31: iteration 61850/ 173500 | consumed samples: 15833600 | consumed tokens: 32427212800 | elapsed time per iteration (s): 0.75 | learning rate: 1.509E-04 | global batch size: 256 | lm loss: 2.042897E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.993 | TFLOPs: 20.69 | 31: iteration 61860/ 173500 | consumed samples: 15836160 | consumed tokens: 32432455680 | elapsed time per iteration (s): 0.76 | learning rate: 1.508E-04 | global batch size: 256 | lm loss: 2.037286E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.840 | TFLOPs: 20.50 | 31: iteration 61870/ 173500 | consumed samples: 15838720 | consumed tokens: 32437698560 | elapsed time per iteration (s): 0.73 | learning rate: 1.508E-04 | global batch size: 256 | lm loss: 2.037422E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.120 | TFLOPs: 21.12 | 31: iteration 61880/ 173500 | consumed samples: 15841280 | consumed tokens: 32442941440 | elapsed time per iteration (s): 0.76 | learning rate: 1.508E-04 | global batch size: 256 | lm loss: 2.067176E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.266 | TFLOPs: 20.40 | 31: iteration 61890/ 173500 | consumed samples: 15843840 | consumed tokens: 32448184320 | elapsed time per iteration (s): 0.75 | learning rate: 1.508E-04 | global batch size: 256 | lm loss: 2.052839E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.990 | TFLOPs: 20.57 | 31: iteration 61900/ 173500 | consumed samples: 15846400 | consumed tokens: 32453427200 | elapsed time per iteration (s): 0.82 | learning rate: 1.508E-04 | global batch size: 256 | lm loss: 2.058843E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.804 | TFLOPs: 18.86 | 31: iteration 61910/ 173500 | consumed samples: 15848960 | consumed tokens: 32458670080 | elapsed time per iteration (s): 0.79 | learning rate: 1.508E-04 | global batch size: 256 | lm loss: 2.038321E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.715 | TFLOPs: 19.70 | 31: iteration 61920/ 173500 | consumed samples: 15851520 | consumed tokens: 32463912960 | elapsed time per iteration (s): 0.81 | learning rate: 1.508E-04 | global batch size: 256 | lm loss: 2.015913E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.363 | TFLOPs: 19.02 | 31: iteration 61930/ 173500 | consumed samples: 15854080 | consumed tokens: 32469155840 | elapsed time per iteration (s): 0.75 | learning rate: 1.507E-04 | global batch size: 256 | lm loss: 2.040606E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.368 | TFLOPs: 20.53 | 31: iteration 61940/ 173500 | consumed samples: 15856640 | consumed tokens: 32474398720 | elapsed time per iteration (s): 0.75 | learning rate: 1.507E-04 | global batch size: 256 | lm loss: 2.041655E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.162 | TFLOPs: 20.52 | 31: iteration 61950/ 173500 | consumed samples: 15859200 | consumed tokens: 32479641600 | elapsed time per iteration (s): 0.75 | learning rate: 1.507E-04 | global batch size: 256 | lm loss: 2.052086E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.575 | TFLOPs: 20.66 | 31: iteration 61960/ 173500 | consumed samples: 15861760 | consumed tokens: 32484884480 | elapsed time per iteration (s): 0.80 | learning rate: 1.507E-04 | global batch size: 256 | lm loss: 2.035577E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.226 | TFLOPs: 19.25 | 31: iteration 61970/ 173500 | consumed samples: 15864320 | consumed tokens: 32490127360 | elapsed time per iteration (s): 0.79 | learning rate: 1.507E-04 | global batch size: 256 | lm loss: 2.052708E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.843 | TFLOPs: 19.65 | 31: iteration 61980/ 173500 | consumed samples: 15866880 | consumed tokens: 32495370240 | elapsed time per iteration (s): 0.75 | learning rate: 1.507E-04 | global batch size: 256 | lm loss: 2.043285E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.842 | TFLOPs: 20.62 | 31: iteration 61990/ 173500 | consumed samples: 15869440 | consumed tokens: 32500613120 | elapsed time per iteration (s): 0.81 | learning rate: 1.507E-04 | global batch size: 256 | lm loss: 2.032261E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.450 | TFLOPs: 19.08 | 0: [2022-11-26 08:00:28,812] [INFO] [logging.py:68:log_dist] [Rank 0] step=62000, skipped=0, lr=[0.00015064331838981058, 0.00015064331838981058, 0.00015064331838981058], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 62000/ 173500 | consumed samples: 15872000 | consumed tokens: 32505856000 | elapsed time per iteration (s): 0.78 | learning rate: 1.506E-04 | global batch size: 256 | lm loss: 2.030536E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.605 | TFLOPs: 19.76 | 0: steps: 62000 loss: 2.0723 iter time (s): 0.815 samples/sec: 313.953 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 62000 | lm loss value: 2.038171E+00 | lm loss PPL: 7.676558E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 62000 to checkpoints_1b1long 0: [2022-11-26 08:00:29,091] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step62000 is begin to save! 0: [2022-11-26 08:00:29,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_01-model_00-model_states.pt... 0: [2022-11-26 08:00:29,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_01-model_00-model_states.pt. 0: [2022-11-26 08:00:29,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_03-model_00-model_states.pt... 0: [2022-11-26 08:00:29,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_03-model_00-model_states.pt. 0: [2022-11-26 08:00:29,393] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_04-model_00-model_states.pt... 0: [2022-11-26 08:00:29,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_04-model_00-model_states.pt. 0: [2022-11-26 08:00:29,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_05-model_00-model_states.pt... 0: [2022-11-26 08:00:29,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_05-model_00-model_states.pt. 0: [2022-11-26 08:00:29,541] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_06-model_00-model_states.pt... 0: [2022-11-26 08:00:29,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_06-model_00-model_states.pt. 0: [2022-11-26 08:00:29,619] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_07-model_00-model_states.pt... 0: [2022-11-26 08:00:29,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_07-model_00-model_states.pt. 0: [2022-11-26 08:00:29,695] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_08-model_00-model_states.pt... 0: [2022-11-26 08:00:29,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_08-model_00-model_states.pt. 0: [2022-11-26 08:00:29,768] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_09-model_00-model_states.pt... 0: [2022-11-26 08:00:29,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_09-model_00-model_states.pt. 0: [2022-11-26 08:00:29,843] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_10-model_00-model_states.pt... 0: [2022-11-26 08:00:29,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_10-model_00-model_states.pt. 0: [2022-11-26 08:00:29,921] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_11-model_00-model_states.pt... 0: [2022-11-26 08:00:29,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_11-model_00-model_states.pt. 0: [2022-11-26 08:00:29,994] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_12-model_00-model_states.pt... 0: [2022-11-26 08:00:30,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_12-model_00-model_states.pt. 0: [2022-11-26 08:00:30,071] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_13-model_00-model_states.pt... 0: [2022-11-26 08:00:30,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_13-model_00-model_states.pt. 0: [2022-11-26 08:00:30,146] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_14-model_00-model_states.pt... 0: [2022-11-26 08:00:30,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_14-model_00-model_states.pt. 0: [2022-11-26 08:00:30,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_15-model_00-model_states.pt... 0: [2022-11-26 08:00:30,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_15-model_00-model_states.pt. 0: [2022-11-26 08:00:30,294] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_16-model_00-model_states.pt... 0: [2022-11-26 08:00:30,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_16-model_00-model_states.pt. 0: [2022-11-26 08:00:30,370] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_17-model_00-model_states.pt... 0: [2022-11-26 08:00:30,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_17-model_00-model_states.pt. 0: [2022-11-26 08:00:30,446] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_18-model_00-model_states.pt... 0: [2022-11-26 08:00:30,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_18-model_00-model_states.pt. 0: [2022-11-26 08:00:30,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_19-model_00-model_states.pt... 0: [2022-11-26 08:00:30,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_19-model_00-model_states.pt. 0: [2022-11-26 08:00:30,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_20-model_00-model_states.pt... 0: [2022-11-26 08:00:30,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_20-model_00-model_states.pt. 0: [2022-11-26 08:00:30,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_21-model_00-model_states.pt... 0: [2022-11-26 08:00:30,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_21-model_00-model_states.pt. 0: [2022-11-26 08:00:30,743] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_22-model_00-model_states.pt... 0: [2022-11-26 08:00:30,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_22-model_00-model_states.pt. 0: [2022-11-26 08:00:30,817] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_23-model_00-model_states.pt... 0: [2022-11-26 08:00:30,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_23-model_00-model_states.pt. 0: [2022-11-26 08:00:30,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_24-model_00-model_states.pt... 0: [2022-11-26 08:00:30,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_24-model_00-model_states.pt. 0: [2022-11-26 08:00:30,968] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_25-model_00-model_states.pt... 0: [2022-11-26 08:00:31,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_25-model_00-model_states.pt. 0: [2022-11-26 08:00:31,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_26-model_00-model_states.pt... 0: [2022-11-26 08:00:31,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_26-model_00-model_states.pt. 0: [2022-11-26 08:00:31,121] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_27-model_00-model_states.pt... 0: [2022-11-26 08:00:31,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_27-model_00-model_states.pt. 0: [2022-11-26 08:00:31,194] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_28-model_00-model_states.pt... 0: [2022-11-26 08:00:31,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_28-model_00-model_states.pt. 0: [2022-11-26 08:00:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/layer_30-model_00-model_states.pt... 0: [2022-11-26 08:00:31,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/layer_30-model_00-model_states.pt. 0: [2022-11-26 08:00:31,274] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step62000/mp_rank_00_model_states.pt 0: [2022-11-26 08:00:31,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/mp_rank_00_model_states.pt... 0: [2022-11-26 08:00:31,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/mp_rank_00_model_states.pt. 0: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:00:31,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:00:31,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:00:31,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 08:00:31,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 21: [2022-11-26 08:00:31,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:00:31,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 08:00:31,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 10: [2022-11-26 08:00:31,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:00:31,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 08:00:31,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 30: [2022-11-26 08:00:31,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:00:31,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 08:00:31,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 27: [2022-11-26 08:00:31,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:00:31,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 08:00:31,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 10: [2022-11-26 08:00:31,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:00:31,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 08:00:31,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 24: [2022-11-26 08:00:31,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:00:31,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 08:00:31,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 2: [2022-11-26 08:00:31,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:00:31,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 08:00:31,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 6: [2022-11-26 08:00:31,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:00:31,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 08:00:31,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 6: [2022-11-26 08:00:31,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:00:31,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 08:00:31,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 25: [2022-11-26 08:00:31,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:00:31,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:00:31,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:00:31,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 2: [2022-11-26 08:00:31,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 25: [2022-11-26 08:00:31,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 2: [2022-11-26 08:00:31,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 25: [2022-11-26 08:00:31,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 25: [2022-11-26 08:00:31,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 14: [2022-11-26 08:00:31,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:00:31,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 08:00:31,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 20: [2022-11-26 08:00:31,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:00:31,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 08:00:31,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 8: [2022-11-26 08:00:31,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:00:31,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:00:31,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:00:31,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 19: [2022-11-26 08:00:31,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 08:00:31,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 8: [2022-11-26 08:00:31,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 19: [2022-11-26 08:00:31,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 19: [2022-11-26 08:00:31,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 12: [2022-11-26 08:00:31,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:00:31,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 24: [2022-11-26 08:00:31,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:00:31,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 24: [2022-11-26 08:00:31,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 08:00:31,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 7: [2022-11-26 08:00:31,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:00:31,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 08:00:31,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 1: [2022-11-26 08:00:31,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:00:31,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 08:00:31,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 7: [2022-11-26 08:00:31,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:00:31,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 08:00:31,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 11: [2022-11-26 08:00:31,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:00:31,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 08:00:31,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 13: [2022-11-26 08:00:31,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:00:31,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 08:00:31,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 29: [2022-11-26 08:00:31,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:00:31,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 08:00:31,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 29: [2022-11-26 08:00:31,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:00:31,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 08:00:31,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 6: [2022-11-26 08:00:31,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:00:31,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 08:00:31,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 14: [2022-11-26 08:00:31,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:00:31,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 08:00:31,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 3: [2022-11-26 08:00:31,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:00:31,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 08:00:31,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 5: [2022-11-26 08:00:31,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:00:31,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 08:00:31,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 21: [2022-11-26 08:00:31,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:00:31,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 08:00:31,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 27: [2022-11-26 08:00:31,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:00:31,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 08:00:31,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 30: [2022-11-26 08:00:31,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:00:31,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 08:00:31,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 1: [2022-11-26 08:00:31,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:00:31,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:00:31,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 08:00:31,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 31: [2022-11-26 08:00:31,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 08:00:31,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 17: [2022-11-26 08:00:31,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:00:31,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 08:00:31,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 15: [2022-11-26 08:00:31,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:00:31,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 08:00:31,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 13: [2022-11-26 08:00:31,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:00:31,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 20: [2022-11-26 08:00:31,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:00:31,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 20: [2022-11-26 08:00:31,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 15: [2022-11-26 08:00:31,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:00:31,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 15: [2022-11-26 08:00:31,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 10: [2022-11-26 08:00:31,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:00:31,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 15: [2022-11-26 08:00:31,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 10: [2022-11-26 08:00:31,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 8: [2022-11-26 08:00:31,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:00:31,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 08:00:31,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 17: [2022-11-26 08:00:31,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:00:31,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 08:00:31,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 9: [2022-11-26 08:00:31,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:00:31,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:00:31,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 08:00:31,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 08:00:31,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 9: [2022-11-26 08:00:31,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 9: [2022-11-26 08:00:31,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:00:31,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 08:00:31,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 7: [2022-11-26 08:00:31,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:00:31,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 08:00:31,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 2: [2022-11-26 08:00:31,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:00:31,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 08:00:31,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 19: [2022-11-26 08:00:31,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:00:31,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 08:00:31,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 21: [2022-11-26 08:00:31,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:00:31,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 08:00:31,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 12: [2022-11-26 08:00:31,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:00:31,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 08:00:31,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 24: [2022-11-26 08:00:31,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:00:31,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 17: [2022-11-26 08:00:31,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:00:31,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 24: [2022-11-26 08:00:31,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 12: [2022-11-26 08:00:31,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:00:31,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 12: [2022-11-26 08:00:31,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 31: [2022-11-26 08:00:31,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:00:31,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 31: [2022-11-26 08:00:31,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:00:31,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 08:00:31,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 31: [2022-11-26 08:00:31,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 08:00:31,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 5: [2022-11-26 08:00:31,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:00:31,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:00:31,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:00:31,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 3: [2022-11-26 08:00:31,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:00:31,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 5: [2022-11-26 08:00:31,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 27: [2022-11-26 08:00:31,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 5: [2022-11-26 08:00:31,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 3: [2022-11-26 08:00:31,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 5: [2022-11-26 08:00:31,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 3: [2022-11-26 08:00:31,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 1: [2022-11-26 08:00:31,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:00:31,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 08:00:31,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 6: [2022-11-26 08:00:31,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:00:31,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 08:00:31,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 25: [2022-11-26 08:00:31,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:00:31,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:00:31,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 08:00:31,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 0: [2022-11-26 08:00:31,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:00:31,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 08:00:31,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 0: [2022-11-26 08:00:31,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:00:31,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:00:31,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 08:00:31,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 0: [2022-11-26 08:00:31,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 08:00:31,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 17: [2022-11-26 08:00:31,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:00:31,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 08:00:31,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 31: [2022-11-26 08:00:31,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:00:31,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 08:00:31,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 13: [2022-11-26 08:00:31,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:00:31,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 08:00:31,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 11: [2022-11-26 08:00:31,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:00:31,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 08:00:31,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 20: [2022-11-26 08:00:31,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:00:31,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 08:00:31,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 9: [2022-11-26 08:00:31,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:00:31,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:00:31,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 8: [2022-11-26 08:00:31,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 9: [2022-11-26 08:00:31,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 8: [2022-11-26 08:00:31,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 7: [2022-11-26 08:00:31,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:00:31,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 08:00:31,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 20: [2022-11-26 08:00:31,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:00:31,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 08:00:31,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 2: [2022-11-26 08:00:31,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:00:31,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 08:00:31,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 13: [2022-11-26 08:00:31,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:00:31,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 08:00:31,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 12: [2022-11-26 08:00:31,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:00:31,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 08:00:31,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 25: [2022-11-26 08:00:31,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:00:31,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 08:00:31,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 21: [2022-11-26 08:00:31,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:00:31,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 15: [2022-11-26 08:00:31,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:00:31,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 15: [2022-11-26 08:00:31,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 08:00:31,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 1: [2022-11-26 08:00:31,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:00:31,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 08:00:31,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 5: [2022-11-26 08:00:31,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:00:31,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 08:00:31,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 18: [2022-11-26 08:00:31,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:00:31,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 08:00:31,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 24: [2022-11-26 08:00:31,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:00:31,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 08:00:31,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 16: [2022-11-26 08:00:31,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:00:31,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:00:31,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:00:31,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 30: [2022-11-26 08:00:31,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:00:31,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 08:00:31,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 16: [2022-11-26 08:00:31,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 08:00:31,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 30: [2022-11-26 08:00:31,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 16: [2022-11-26 08:00:31,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 30: [2022-11-26 08:00:31,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 27: [2022-11-26 08:00:31,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:00:31,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 11: [2022-11-26 08:00:31,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:00:31,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 11: [2022-11-26 08:00:31,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 08:00:31,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 15: [2022-11-26 08:00:31,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:00:31,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 08:00:31,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 18: [2022-11-26 08:00:31,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:00:31,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 08:00:31,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 14: [2022-11-26 08:00:31,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:00:31,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:00:31,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 08:00:31,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 30: [2022-11-26 08:00:31,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:00:31,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 14: [2022-11-26 08:00:31,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 30: [2022-11-26 08:00:31,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 08:00:31,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 19: [2022-11-26 08:00:31,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:00:31,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 08:00:31,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 26: [2022-11-26 08:00:31,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:00:31,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:00:31,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:00:31,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:00:31,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 08:00:31,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 26: [2022-11-26 08:00:31,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 08:00:31,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 08:00:31,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 08:00:31,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 26: [2022-11-26 08:00:31,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 26: [2022-11-26 08:00:31,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 8: [2022-11-26 08:00:31,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:00:31,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 08:00:31,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 3: [2022-11-26 08:00:31,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:00:31,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 08:00:31,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 29: [2022-11-26 08:00:31,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:00:31,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 11: [2022-11-26 08:00:31,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:00:31,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 11: [2022-11-26 08:00:31,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 08:00:31,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 28: [2022-11-26 08:00:31,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:00:31,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:00:31,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:00:31,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:00:31,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:00:31,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 08:00:31,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 0: [2022-11-26 08:00:31,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 28: [2022-11-26 08:00:31,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 08:00:31,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 08:00:31,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 28: [2022-11-26 08:00:31,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 28: [2022-11-26 08:00:31,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 28: [2022-11-26 08:00:31,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 0: [2022-11-26 08:00:31,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 0: [2022-11-26 08:00:31,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 08:00:31,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 18: [2022-11-26 08:00:31,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:00:31,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 08:00:31,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 18: [2022-11-26 08:00:31,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:00:31,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 08:00:31,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 23: [2022-11-26 08:00:31,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:00:31,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:00:31,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:00:31,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 08:00:31,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 08:00:31,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 08:00:31,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:00:31,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 23: [2022-11-26 08:00:31,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 23: [2022-11-26 08:00:31,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 23: [2022-11-26 08:00:31,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 08:00:31,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 22: [2022-11-26 08:00:31,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:00:31,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:00:31,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:00:31,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 08:00:31,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 08:00:31,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 08:00:31,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 22: [2022-11-26 08:00:31,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:00:31,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 22: [2022-11-26 08:00:31,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 29: [2022-11-26 08:00:31,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:00:31,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 08:00:31,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 29: [2022-11-26 08:00:31,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 08:00:31,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 15: [2022-11-26 08:00:31,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:00:31,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 08:00:31,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 20: [2022-11-26 08:00:31,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:00:31,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 08:00:31,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 6: [2022-11-26 08:00:31,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:00:31,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 08:00:31,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 1: [2022-11-26 08:00:31,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:00:31,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 08:00:31,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 31: [2022-11-26 08:00:31,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:00:31,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 08:00:31,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 4: [2022-11-26 08:00:31,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:00:31,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:00:31,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:00:31,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 08:00:31,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 08:00:31,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 08:00:31,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 4: [2022-11-26 08:00:31,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 4: [2022-11-26 08:00:31,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 19: [2022-11-26 08:00:31,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:00:31,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 08:00:31,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 12: [2022-11-26 08:00:31,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:00:31,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 08:00:31,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 10: [2022-11-26 08:00:31,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:00:31,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 08:00:31,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 9: [2022-11-26 08:00:31,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:00:31,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 08:00:31,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 2: [2022-11-26 08:00:31,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:00:31,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 08:00:31,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 25: [2022-11-26 08:00:31,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:00:31,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 08:00:31,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 17: [2022-11-26 08:00:31,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:00:31,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 08:00:31,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 7: [2022-11-26 08:00:31,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:00:31,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 08:00:31,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 0: [2022-11-26 08:00:31,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:00:31,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 08:00:31,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 30: [2022-11-26 08:00:31,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:00:31,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 08:00:31,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 3: [2022-11-26 08:00:31,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:00:31,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:00:31,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 3: [2022-11-26 08:00:31,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 22: [2022-11-26 08:00:31,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 3: [2022-11-26 08:00:31,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 4: [2022-11-26 08:00:31,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:00:31,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 08:00:31,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 16: [2022-11-26 08:00:31,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:00:31,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 08:00:31,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 23: [2022-11-26 08:00:31,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:00:31,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 08:00:31,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 26: [2022-11-26 08:00:31,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:00:31,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 08:00:31,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 5: [2022-11-26 08:00:31,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:00:31,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 08:00:31,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 8: [2022-11-26 08:00:31,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:00:31,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 08:00:31,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 14: [2022-11-26 08:00:31,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:00:31,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 08:00:31,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 18: [2022-11-26 08:00:31,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:00:31,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 08:00:31,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 24: [2022-11-26 08:00:31,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:00:31,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 08:00:31,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 6: [2022-11-26 08:00:31,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:00:31,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 08:00:31,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 21: [2022-11-26 08:00:31,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:00:31,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 08:00:31,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 13: [2022-11-26 08:00:31,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:00:31,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:00:31,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 08:00:31,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 28: [2022-11-26 08:00:31,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 08:00:31,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 11: [2022-11-26 08:00:31,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:00:31,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 08:00:31,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 27: [2022-11-26 08:00:31,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:00:31,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:00:31,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 27: [2022-11-26 08:00:31,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 19: [2022-11-26 08:00:31,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 27: [2022-11-26 08:00:31,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 15: [2022-11-26 08:00:31,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:00:31,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 08:00:31,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 29: [2022-11-26 08:00:31,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:00:31,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 08:00:31,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 10: [2022-11-26 08:00:31,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:00:31,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 08:00:31,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 1: [2022-11-26 08:00:31,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:00:31,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 08:00:31,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 7: [2022-11-26 08:00:31,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:00:31,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 08:00:31,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 25: [2022-11-26 08:00:31,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:00:31,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 08:00:31,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 9: [2022-11-26 08:00:31,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:00:31,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 08:00:31,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 20: [2022-11-26 08:00:31,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:00:31,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 08:00:31,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 0: [2022-11-26 08:00:31,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:00:31,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 08:00:31,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 2: [2022-11-26 08:00:31,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:00:31,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 31: [2022-11-26 08:00:31,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:00:31,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 31: [2022-11-26 08:00:31,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 08:00:31,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 12: [2022-11-26 08:00:31,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:00:31,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 08:00:31,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 3: [2022-11-26 08:00:31,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:00:31,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:00:31,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 08:00:31,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 17: [2022-11-26 08:00:31,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 08:00:31,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 26: [2022-11-26 08:00:31,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:00:31,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 08:00:31,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 4: [2022-11-26 08:00:31,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:00:31,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 08:00:31,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 22: [2022-11-26 08:00:31,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:00:31,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 08:00:31,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 16: [2022-11-26 08:00:31,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:00:31,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 08:00:31,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 14: [2022-11-26 08:00:31,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:00:31,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 08:00:31,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 30: [2022-11-26 08:00:31,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:00:31,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 08:00:31,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 24: [2022-11-26 08:00:31,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:00:31,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 08:00:31,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 27: [2022-11-26 08:00:31,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:00:31,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:00:31,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 27: [2022-11-26 08:00:31,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 08:00:31,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 5: [2022-11-26 08:00:31,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 8: [2022-11-26 08:00:31,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:00:31,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 08:00:31,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 18: [2022-11-26 08:00:31,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:00:31,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 08:00:31,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 23: [2022-11-26 08:00:31,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:00:31,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 08:00:31,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 13: [2022-11-26 08:00:31,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:00:31,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 08:00:31,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 10: [2022-11-26 08:00:31,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:00:31,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 08:00:31,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 6: [2022-11-26 08:00:31,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:00:31,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 08:00:31,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 21: [2022-11-26 08:00:31,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:00:31,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 08:00:31,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 1: [2022-11-26 08:00:31,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:00:31,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 08:00:31,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 20: [2022-11-26 08:00:31,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:00:31,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 08:00:31,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 19: [2022-11-26 08:00:31,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:00:31,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 08:00:31,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 15: [2022-11-26 08:00:31,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:00:31,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 08:00:31,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 31: [2022-11-26 08:00:31,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:00:31,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 08:00:31,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 11: [2022-11-26 08:00:31,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:00:31,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 08:00:31,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 29: [2022-11-26 08:00:31,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:00:31,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 08:00:31,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 7: [2022-11-26 08:00:31,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:00:31,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 08:00:31,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 25: [2022-11-26 08:00:31,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:00:31,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 08:00:31,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 12: [2022-11-26 08:00:31,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:00:31,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 08:00:31,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 9: [2022-11-26 08:00:31,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:00:31,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 08:00:31,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 26: [2022-11-26 08:00:31,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:00:31,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 08:00:31,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 17: [2022-11-26 08:00:31,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:00:31,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 08:00:31,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 2: [2022-11-26 08:00:31,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:00:31,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 08:00:31,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 30: [2022-11-26 08:00:31,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:00:31,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 08:00:31,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 3: [2022-11-26 08:00:31,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:00:31,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 08:00:31,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 22: [2022-11-26 08:00:31,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:00:31,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 08:00:31,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 4: [2022-11-26 08:00:31,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:00:31,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 08:00:31,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 0: [2022-11-26 08:00:31,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:00:31,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 08:00:31,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 16: [2022-11-26 08:00:31,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:00:31,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 08:00:31,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 28: [2022-11-26 08:00:31,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:00:31,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:00:31,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 08:00:31,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 23: [2022-11-26 08:00:31,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:00:31,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 08:00:31,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 28: [2022-11-26 08:00:31,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 08:00:31,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 24: [2022-11-26 08:00:31,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:00:31,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 08:00:31,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 13: [2022-11-26 08:00:31,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:00:31,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 08:00:31,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 7: [2022-11-26 08:00:31,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:00:31,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 08:00:31,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 25: [2022-11-26 08:00:31,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:00:31,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 08:00:31,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 10: [2022-11-26 08:00:31,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:00:31,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:00:31,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 08:00:31,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 6: [2022-11-26 08:00:31,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 08:00:31,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 21: [2022-11-26 08:00:31,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:00:31,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 08:00:31,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 29: [2022-11-26 08:00:31,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:00:31,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:00:31,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 27: [2022-11-26 08:00:31,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:00:31,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 29: [2022-11-26 08:00:31,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 12: [2022-11-26 08:00:31,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 27: [2022-11-26 08:00:31,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 08:00:31,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 30: [2022-11-26 08:00:31,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:00:31,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 08:00:31,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 18: [2022-11-26 08:00:31,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:00:31,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 08:00:31,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 9: [2022-11-26 08:00:31,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:00:31,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 08:00:31,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 5: [2022-11-26 08:00:31,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:00:31,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:00:31,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 08:00:31,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 15: [2022-11-26 08:00:31,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 08:00:31,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 1: [2022-11-26 08:00:31,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:00:31,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 08:00:31,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 11: [2022-11-26 08:00:31,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:00:31,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 08:00:31,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 28: [2022-11-26 08:00:31,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:00:31,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:00:31,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 08:00:31,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 0: [2022-11-26 08:00:31,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:00:31,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:00:31,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 0: [2022-11-26 08:00:31,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 08:00:31,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 8: [2022-11-26 08:00:31,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:00:31,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 8: [2022-11-26 08:00:31,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 08:00:31,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 22: [2022-11-26 08:00:31,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:00:31,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 08:00:31,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 26: [2022-11-26 08:00:31,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:00:31,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 08:00:31,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 28: [2022-11-26 08:00:31,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 08:00:31,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 4: [2022-11-26 08:00:31,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:00:31,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 08:00:31,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 31: [2022-11-26 08:00:31,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:00:31,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 08:00:31,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 17: [2022-11-26 08:00:31,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:00:31,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 3: [2022-11-26 08:00:31,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:00:31,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 08:00:31,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 17: [2022-11-26 08:00:31,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 2: [2022-11-26 08:00:31,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:00:31,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 08:00:31,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 16: [2022-11-26 08:00:31,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:00:31,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 08:00:31,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 11: [2022-11-26 08:00:31,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:00:31,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 08:00:31,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 14: [2022-11-26 08:00:31,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:00:31,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 08:00:31,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 23: [2022-11-26 08:00:31,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:00:31,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 08:00:31,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 4: [2022-11-26 08:00:31,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:00:31,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 08:00:31,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 5: [2022-11-26 08:00:31,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:00:31,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 08:00:31,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 8: [2022-11-26 08:00:31,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:00:31,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:00:31,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 08:00:31,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 13: [2022-11-26 08:00:31,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 08:00:31,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 18: [2022-11-26 08:00:31,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:00:31,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 08:00:31,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 16: [2022-11-26 08:00:31,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:00:31,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 08:00:31,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 26: [2022-11-26 08:00:31,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:00:31,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 08:00:31,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 27: [2022-11-26 08:00:31,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:00:31,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 28: [2022-11-26 08:00:31,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:00:31,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 28: [2022-11-26 08:00:31,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 08:00:31,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 24: [2022-11-26 08:00:31,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:00:31,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 08:00:31,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 21: [2022-11-26 08:00:31,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:00:31,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step62000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 08:00:31,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 0: successfully saved checkpoint at iteration 62000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2558.10 31: iteration 62010/ 173500 | consumed samples: 15874560 | consumed tokens: 32511098880 | elapsed time per iteration (s): 1.01 | learning rate: 1.506E-04 | global batch size: 256 | lm loss: 2.036887E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 253.776 | TFLOPs: 15.35 | 31: iteration 62020/ 173500 | consumed samples: 15877120 | consumed tokens: 32516341760 | elapsed time per iteration (s): 0.78 | learning rate: 1.506E-04 | global batch size: 256 | lm loss: 2.017295E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.507 | TFLOPs: 19.75 | 31: iteration 62030/ 173500 | consumed samples: 15879680 | consumed tokens: 32521584640 | elapsed time per iteration (s): 0.75 | learning rate: 1.506E-04 | global batch size: 256 | lm loss: 2.025351E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.181 | TFLOPs: 20.52 | 31: iteration 62040/ 173500 | consumed samples: 15882240 | consumed tokens: 32526827520 | elapsed time per iteration (s): 0.77 | learning rate: 1.506E-04 | global batch size: 256 | lm loss: 2.032999E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.602 | TFLOPs: 20.18 | 31: iteration 62050/ 173500 | consumed samples: 15884800 | consumed tokens: 32532070400 | elapsed time per iteration (s): 0.79 | learning rate: 1.506E-04 | global batch size: 256 | lm loss: 2.023677E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.069 | TFLOPs: 19.73 | 31: iteration 62060/ 173500 | consumed samples: 15887360 | consumed tokens: 32537313280 | elapsed time per iteration (s): 0.76 | learning rate: 1.506E-04 | global batch size: 256 | lm loss: 2.042485E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.338 | TFLOPs: 20.47 | 31: iteration 62070/ 173500 | consumed samples: 15889920 | consumed tokens: 32542556160 | elapsed time per iteration (s): 0.73 | learning rate: 1.505E-04 | global batch size: 256 | lm loss: 2.054231E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.361 | TFLOPs: 21.20 | 31: iteration 62080/ 173500 | consumed samples: 15892480 | consumed tokens: 32547799040 | elapsed time per iteration (s): 0.79 | learning rate: 1.505E-04 | global batch size: 256 | lm loss: 2.062286E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.175 | TFLOPs: 19.67 | 31: iteration 62090/ 173500 | consumed samples: 15895040 | consumed tokens: 32553041920 | elapsed time per iteration (s): 0.75 | learning rate: 1.505E-04 | global batch size: 256 | lm loss: 2.043506E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.910 | TFLOPs: 20.56 | 31: iteration 62100/ 173500 | consumed samples: 15897600 | consumed tokens: 32558284800 | elapsed time per iteration (s): 0.76 | learning rate: 1.505E-04 | global batch size: 256 | lm loss: 2.006524E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.017 | TFLOPs: 20.33 | 31: iteration 62110/ 173500 | consumed samples: 15900160 | consumed tokens: 32563527680 | elapsed time per iteration (s): 0.76 | learning rate: 1.505E-04 | global batch size: 256 | lm loss: 2.026034E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.108 | TFLOPs: 20.39 | 31: iteration 62120/ 173500 | consumed samples: 15902720 | consumed tokens: 32568770560 | elapsed time per iteration (s): 0.73 | learning rate: 1.505E-04 | global batch size: 256 | lm loss: 2.044642E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.336 | TFLOPs: 21.25 | 31: iteration 62130/ 173500 | consumed samples: 15905280 | consumed tokens: 32574013440 | elapsed time per iteration (s): 0.77 | learning rate: 1.505E-04 | global batch size: 256 | lm loss: 2.052713E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.835 | TFLOPs: 20.08 | 31: iteration 62140/ 173500 | consumed samples: 15907840 | consumed tokens: 32579256320 | elapsed time per iteration (s): 0.78 | learning rate: 1.504E-04 | global batch size: 256 | lm loss: 2.034678E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.115 | TFLOPs: 19.79 | 31: iteration 62150/ 173500 | consumed samples: 15910400 | consumed tokens: 32584499200 | elapsed time per iteration (s): 0.77 | learning rate: 1.504E-04 | global batch size: 256 | lm loss: 2.037020E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.509 | TFLOPs: 20.18 | 31: iteration 62160/ 173500 | consumed samples: 15912960 | consumed tokens: 32589742080 | elapsed time per iteration (s): 0.79 | learning rate: 1.504E-04 | global batch size: 256 | lm loss: 2.022536E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.015 | TFLOPs: 19.60 | 31: iteration 62170/ 173500 | consumed samples: 15915520 | consumed tokens: 32594984960 | elapsed time per iteration (s): 0.78 | learning rate: 1.504E-04 | global batch size: 256 | lm loss: 2.055969E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.565 | TFLOPs: 19.88 | 31: iteration 62180/ 173500 | consumed samples: 15918080 | consumed tokens: 32600227840 | elapsed time per iteration (s): 0.84 | learning rate: 1.504E-04 | global batch size: 256 | lm loss: 2.057407E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.176 | TFLOPs: 18.46 | 31: iteration 62190/ 173500 | consumed samples: 15920640 | consumed tokens: 32605470720 | elapsed time per iteration (s): 0.98 | learning rate: 1.504E-04 | global batch size: 256 | lm loss: 2.058560E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 262.550 | TFLOPs: 15.88 | 31: iteration 62200/ 173500 | consumed samples: 15923200 | consumed tokens: 32610713600 | elapsed time per iteration (s): 0.78 | learning rate: 1.503E-04 | global batch size: 256 | lm loss: 2.024186E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.210 | TFLOPs: 19.98 | 31: iteration 62210/ 173500 | consumed samples: 15925760 | consumed tokens: 32615956480 | elapsed time per iteration (s): 0.76 | learning rate: 1.503E-04 | global batch size: 256 | lm loss: 2.039168E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.020 | TFLOPs: 20.27 | 31: iteration 62220/ 173500 | consumed samples: 15928320 | consumed tokens: 32621199360 | elapsed time per iteration (s): 0.74 | learning rate: 1.503E-04 | global batch size: 256 | lm loss: 2.066650E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.233 | TFLOPs: 20.95 | 31: iteration 62230/ 173500 | consumed samples: 15930880 | consumed tokens: 32626442240 | elapsed time per iteration (s): 0.74 | learning rate: 1.503E-04 | global batch size: 256 | lm loss: 2.042060E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.104 | TFLOPs: 20.94 | 31: iteration 62240/ 173500 | consumed samples: 15933440 | consumed tokens: 32631685120 | elapsed time per iteration (s): 0.79 | learning rate: 1.503E-04 | global batch size: 256 | lm loss: 2.048928E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.129 | TFLOPs: 19.61 | 31: iteration 62250/ 173500 | consumed samples: 15936000 | consumed tokens: 32636928000 | elapsed time per iteration (s): 0.82 | learning rate: 1.503E-04 | global batch size: 256 | lm loss: 2.023145E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.590 | TFLOPs: 18.85 | 31: iteration 62260/ 173500 | consumed samples: 15938560 | consumed tokens: 32642170880 | elapsed time per iteration (s): 0.79 | learning rate: 1.503E-04 | global batch size: 256 | lm loss: 2.043424E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.355 | TFLOPs: 19.68 | 31: iteration 62270/ 173500 | consumed samples: 15941120 | consumed tokens: 32647413760 | elapsed time per iteration (s): 0.80 | learning rate: 1.502E-04 | global batch size: 256 | lm loss: 2.018488E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.251 | TFLOPs: 19.37 | 31: iteration 62280/ 173500 | consumed samples: 15943680 | consumed tokens: 32652656640 | elapsed time per iteration (s): 0.81 | learning rate: 1.502E-04 | global batch size: 256 | lm loss: 2.053057E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.722 | TFLOPs: 19.22 | 31: iteration 62290/ 173500 | consumed samples: 15946240 | consumed tokens: 32657899520 | elapsed time per iteration (s): 0.81 | learning rate: 1.502E-04 | global batch size: 256 | lm loss: 2.040420E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.857 | TFLOPs: 19.05 | 31: iteration 62300/ 173500 | consumed samples: 15948800 | consumed tokens: 32663142400 | elapsed time per iteration (s): 0.78 | learning rate: 1.502E-04 | global batch size: 256 | lm loss: 2.041760E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.261 | TFLOPs: 19.86 | 31: iteration 62310/ 173500 | consumed samples: 15951360 | consumed tokens: 32668385280 | elapsed time per iteration (s): 0.82 | learning rate: 1.502E-04 | global batch size: 256 | lm loss: 2.070061E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.882 | TFLOPs: 18.81 | 31: iteration 62320/ 173500 | consumed samples: 15953920 | consumed tokens: 32673628160 | elapsed time per iteration (s): 0.83 | learning rate: 1.502E-04 | global batch size: 256 | lm loss: 2.029269E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.255 | TFLOPs: 18.71 | 31: iteration 62330/ 173500 | consumed samples: 15956480 | consumed tokens: 32678871040 | elapsed time per iteration (s): 0.82 | learning rate: 1.502E-04 | global batch size: 256 | lm loss: 2.046925E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.760 | TFLOPs: 18.80 | 31: iteration 62340/ 173500 | consumed samples: 15959040 | consumed tokens: 32684113920 | elapsed time per iteration (s): 0.79 | learning rate: 1.501E-04 | global batch size: 256 | lm loss: 2.024456E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.066 | TFLOPs: 19.61 | 31: iteration 62350/ 173500 | consumed samples: 15961600 | consumed tokens: 32689356800 | elapsed time per iteration (s): 0.81 | learning rate: 1.501E-04 | global batch size: 256 | lm loss: 2.043882E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.366 | TFLOPs: 19.02 | 31: iteration 62360/ 173500 | consumed samples: 15964160 | consumed tokens: 32694599680 | elapsed time per iteration (s): 0.78 | learning rate: 1.501E-04 | global batch size: 256 | lm loss: 2.058141E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.222 | TFLOPs: 19.74 | 31: iteration 62370/ 173500 | consumed samples: 15966720 | consumed tokens: 32699842560 | elapsed time per iteration (s): 0.85 | learning rate: 1.501E-04 | global batch size: 256 | lm loss: 2.051590E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.298 | TFLOPs: 18.17 | 31: iteration 62380/ 173500 | consumed samples: 15969280 | consumed tokens: 32705085440 | elapsed time per iteration (s): 0.79 | learning rate: 1.501E-04 | global batch size: 256 | lm loss: 2.043028E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.152 | TFLOPs: 19.61 | 31: iteration 62390/ 173500 | consumed samples: 15971840 | consumed tokens: 32710328320 | elapsed time per iteration (s): 0.80 | learning rate: 1.501E-04 | global batch size: 256 | lm loss: 2.030370E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.835 | TFLOPs: 19.35 | 31: iteration 62400/ 173500 | consumed samples: 15974400 | consumed tokens: 32715571200 | elapsed time per iteration (s): 0.83 | learning rate: 1.501E-04 | global batch size: 256 | lm loss: 2.033764E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.254 | TFLOPs: 18.77 | 31: iteration 62410/ 173500 | consumed samples: 15976960 | consumed tokens: 32720814080 | elapsed time per iteration (s): 0.90 | learning rate: 1.500E-04 | global batch size: 256 | lm loss: 2.033238E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.805 | TFLOPs: 17.29 | 31: iteration 62420/ 173500 | consumed samples: 15979520 | consumed tokens: 32726056960 | elapsed time per iteration (s): 0.80 | learning rate: 1.500E-04 | global batch size: 256 | lm loss: 2.053452E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.535 | TFLOPs: 19.45 | 31: iteration 62430/ 173500 | consumed samples: 15982080 | consumed tokens: 32731299840 | elapsed time per iteration (s): 0.77 | learning rate: 1.500E-04 | global batch size: 256 | lm loss: 2.042261E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.122 | TFLOPs: 20.21 | 31: iteration 62440/ 173500 | consumed samples: 15984640 | consumed tokens: 32736542720 | elapsed time per iteration (s): 0.78 | learning rate: 1.500E-04 | global batch size: 256 | lm loss: 2.039416E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.012 | TFLOPs: 19.78 | 31: iteration 62450/ 173500 | consumed samples: 15987200 | consumed tokens: 32741785600 | elapsed time per iteration (s): 0.79 | learning rate: 1.500E-04 | global batch size: 256 | lm loss: 2.034053E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.795 | TFLOPs: 19.65 | 31: iteration 62460/ 173500 | consumed samples: 15989760 | consumed tokens: 32747028480 | elapsed time per iteration (s): 0.79 | learning rate: 1.500E-04 | global batch size: 256 | lm loss: 2.007905E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.200 | TFLOPs: 19.67 | 31: iteration 62470/ 173500 | consumed samples: 15992320 | consumed tokens: 32752271360 | elapsed time per iteration (s): 0.75 | learning rate: 1.500E-04 | global batch size: 256 | lm loss: 2.025941E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.862 | TFLOPs: 20.62 | 31: iteration 62480/ 173500 | consumed samples: 15994880 | consumed tokens: 32757514240 | elapsed time per iteration (s): 0.79 | learning rate: 1.499E-04 | global batch size: 256 | lm loss: 2.022852E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.538 | TFLOPs: 19.63 | 31: iteration 62490/ 173500 | consumed samples: 15997440 | consumed tokens: 32762757120 | elapsed time per iteration (s): 0.80 | learning rate: 1.499E-04 | global batch size: 256 | lm loss: 2.058862E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.689 | TFLOPs: 19.40 | 31: iteration 62500/ 173500 | consumed samples: 16000000 | consumed tokens: 32768000000 | elapsed time per iteration (s): 0.85 | learning rate: 1.499E-04 | global batch size: 256 | lm loss: 2.044593E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.357 | TFLOPs: 18.23 | 31: iteration 62510/ 173500 | consumed samples: 16002560 | consumed tokens: 32773242880 | elapsed time per iteration (s): 0.75 | learning rate: 1.499E-04 | global batch size: 256 | lm loss: 2.036537E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.747 | TFLOPs: 20.74 | 31: iteration 62520/ 173500 | consumed samples: 16005120 | consumed tokens: 32778485760 | elapsed time per iteration (s): 0.79 | learning rate: 1.499E-04 | global batch size: 256 | lm loss: 2.020056E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.194 | TFLOPs: 19.49 | 31: iteration 62530/ 173500 | consumed samples: 16007680 | consumed tokens: 32783728640 | elapsed time per iteration (s): 0.75 | learning rate: 1.499E-04 | global batch size: 256 | lm loss: 2.052933E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.883 | TFLOPs: 20.56 | 31: iteration 62540/ 173500 | consumed samples: 16010240 | consumed tokens: 32788971520 | elapsed time per iteration (s): 0.76 | learning rate: 1.498E-04 | global batch size: 256 | lm loss: 2.068853E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.462 | TFLOPs: 20.48 | 31: iteration 62550/ 173500 | consumed samples: 16012800 | consumed tokens: 32794214400 | elapsed time per iteration (s): 0.75 | learning rate: 1.498E-04 | global batch size: 256 | lm loss: 2.016839E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.300 | TFLOPs: 20.77 | 31: iteration 62560/ 173500 | consumed samples: 16015360 | consumed tokens: 32799457280 | elapsed time per iteration (s): 0.79 | learning rate: 1.498E-04 | global batch size: 256 | lm loss: 2.080383E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.308 | TFLOPs: 19.68 | 31: iteration 62570/ 173500 | consumed samples: 16017920 | consumed tokens: 32804700160 | elapsed time per iteration (s): 0.72 | learning rate: 1.498E-04 | global batch size: 256 | lm loss: 2.055929E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.552 | TFLOPs: 21.63 | 31: iteration 62580/ 173500 | consumed samples: 16020480 | consumed tokens: 32809943040 | elapsed time per iteration (s): 0.78 | learning rate: 1.498E-04 | global batch size: 256 | lm loss: 2.017400E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.521 | TFLOPs: 19.94 | 31: iteration 62590/ 173500 | consumed samples: 16023040 | consumed tokens: 32815185920 | elapsed time per iteration (s): 0.80 | learning rate: 1.498E-04 | global batch size: 256 | lm loss: 2.062429E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.436 | TFLOPs: 19.45 | 31: iteration 62600/ 173500 | consumed samples: 16025600 | consumed tokens: 32820428800 | elapsed time per iteration (s): 0.76 | learning rate: 1.498E-04 | global batch size: 256 | lm loss: 2.065914E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.532 | TFLOPs: 20.42 | 31: iteration 62610/ 173500 | consumed samples: 16028160 | consumed tokens: 32825671680 | elapsed time per iteration (s): 0.82 | learning rate: 1.497E-04 | global batch size: 256 | lm loss: 2.042087E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.481 | TFLOPs: 18.90 | 31: iteration 62620/ 173500 | consumed samples: 16030720 | consumed tokens: 32830914560 | elapsed time per iteration (s): 0.81 | learning rate: 1.497E-04 | global batch size: 256 | lm loss: 2.051033E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.789 | TFLOPs: 19.04 | 31: iteration 62630/ 173500 | consumed samples: 16033280 | consumed tokens: 32836157440 | elapsed time per iteration (s): 0.86 | learning rate: 1.497E-04 | global batch size: 256 | lm loss: 2.050539E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.354 | TFLOPs: 18.11 | 31: iteration 62640/ 173500 | consumed samples: 16035840 | consumed tokens: 32841400320 | elapsed time per iteration (s): 0.81 | learning rate: 1.497E-04 | global batch size: 256 | lm loss: 2.040697E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.871 | TFLOPs: 19.11 | 31: iteration 62650/ 173500 | consumed samples: 16038400 | consumed tokens: 32846643200 | elapsed time per iteration (s): 0.84 | learning rate: 1.497E-04 | global batch size: 256 | lm loss: 2.023176E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.050 | TFLOPs: 18.52 | 31: iteration 62660/ 173500 | consumed samples: 16040960 | consumed tokens: 32851886080 | elapsed time per iteration (s): 0.86 | learning rate: 1.497E-04 | global batch size: 256 | lm loss: 2.016678E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.111 | TFLOPs: 17.97 | 31: iteration 62670/ 173500 | consumed samples: 16043520 | consumed tokens: 32857128960 | elapsed time per iteration (s): 0.85 | learning rate: 1.497E-04 | global batch size: 256 | lm loss: 2.001822E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.267 | TFLOPs: 18.29 | 31: iteration 62680/ 173500 | consumed samples: 16046080 | consumed tokens: 32862371840 | elapsed time per iteration (s): 0.85 | learning rate: 1.496E-04 | global batch size: 256 | lm loss: 2.067164E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.144 | TFLOPs: 18.16 | 31: iteration 62690/ 173500 | consumed samples: 16048640 | consumed tokens: 32867614720 | elapsed time per iteration (s): 0.82 | learning rate: 1.496E-04 | global batch size: 256 | lm loss: 2.056749E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.587 | TFLOPs: 18.97 | 31: iteration 62700/ 173500 | consumed samples: 16051200 | consumed tokens: 32872857600 | elapsed time per iteration (s): 0.80 | learning rate: 1.496E-04 | global batch size: 256 | lm loss: 2.042993E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.202 | TFLOPs: 19.31 | 31: iteration 62710/ 173500 | consumed samples: 16053760 | consumed tokens: 32878100480 | elapsed time per iteration (s): 0.81 | learning rate: 1.496E-04 | global batch size: 256 | lm loss: 2.030991E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.917 | TFLOPs: 19.05 | 31: iteration 62720/ 173500 | consumed samples: 16056320 | consumed tokens: 32883343360 | elapsed time per iteration (s): 0.84 | learning rate: 1.496E-04 | global batch size: 256 | lm loss: 2.037461E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.443 | TFLOPs: 18.54 | 31: iteration 62730/ 173500 | consumed samples: 16058880 | consumed tokens: 32888586240 | elapsed time per iteration (s): 0.80 | learning rate: 1.496E-04 | global batch size: 256 | lm loss: 2.044291E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.654 | TFLOPs: 19.34 | 31: iteration 62740/ 173500 | consumed samples: 16061440 | consumed tokens: 32893829120 | elapsed time per iteration (s): 0.82 | learning rate: 1.496E-04 | global batch size: 256 | lm loss: 2.008645E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.503 | TFLOPs: 18.85 | 31: iteration 62750/ 173500 | consumed samples: 16064000 | consumed tokens: 32899072000 | elapsed time per iteration (s): 0.85 | learning rate: 1.495E-04 | global batch size: 256 | lm loss: 2.047783E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.001 | TFLOPs: 18.15 | 31: iteration 62760/ 173500 | consumed samples: 16066560 | consumed tokens: 32904314880 | elapsed time per iteration (s): 0.79 | learning rate: 1.495E-04 | global batch size: 256 | lm loss: 2.022261E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.657 | TFLOPs: 19.52 | 31: iteration 62770/ 173500 | consumed samples: 16069120 | consumed tokens: 32909557760 | elapsed time per iteration (s): 0.81 | learning rate: 1.495E-04 | global batch size: 256 | lm loss: 2.030814E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.845 | TFLOPs: 19.23 | 31: iteration 62780/ 173500 | consumed samples: 16071680 | consumed tokens: 32914800640 | elapsed time per iteration (s): 0.82 | learning rate: 1.495E-04 | global batch size: 256 | lm loss: 2.024080E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.435 | TFLOPs: 18.96 | 31: iteration 62790/ 173500 | consumed samples: 16074240 | consumed tokens: 32920043520 | elapsed time per iteration (s): 0.82 | learning rate: 1.495E-04 | global batch size: 256 | lm loss: 2.025413E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.169 | TFLOPs: 18.95 | 31: iteration 62800/ 173500 | consumed samples: 16076800 | consumed tokens: 32925286400 | elapsed time per iteration (s): 0.80 | learning rate: 1.495E-04 | global batch size: 256 | lm loss: 2.047388E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.664 | TFLOPs: 19.40 | 31: iteration 62810/ 173500 | consumed samples: 16079360 | consumed tokens: 32930529280 | elapsed time per iteration (s): 0.81 | learning rate: 1.494E-04 | global batch size: 256 | lm loss: 2.061273E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.235 | TFLOPs: 19.07 | 31: iteration 62820/ 173500 | consumed samples: 16081920 | consumed tokens: 32935772160 | elapsed time per iteration (s): 0.80 | learning rate: 1.494E-04 | global batch size: 256 | lm loss: 2.076122E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.410 | TFLOPs: 19.38 | 31: iteration 62830/ 173500 | consumed samples: 16084480 | consumed tokens: 32941015040 | elapsed time per iteration (s): 0.83 | learning rate: 1.494E-04 | global batch size: 256 | lm loss: 2.057348E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.516 | TFLOPs: 18.66 | 31: iteration 62840/ 173500 | consumed samples: 16087040 | consumed tokens: 32946257920 | elapsed time per iteration (s): 0.79 | learning rate: 1.494E-04 | global batch size: 256 | lm loss: 2.055985E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.970 | TFLOPs: 19.66 | 31: iteration 62850/ 173500 | consumed samples: 16089600 | consumed tokens: 32951500800 | elapsed time per iteration (s): 0.78 | learning rate: 1.494E-04 | global batch size: 256 | lm loss: 2.020139E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.886 | TFLOPs: 19.90 | 31: iteration 62860/ 173500 | consumed samples: 16092160 | consumed tokens: 32956743680 | elapsed time per iteration (s): 0.79 | learning rate: 1.494E-04 | global batch size: 256 | lm loss: 2.017980E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.278 | TFLOPs: 19.62 | 31: iteration 62870/ 173500 | consumed samples: 16094720 | consumed tokens: 32961986560 | elapsed time per iteration (s): 0.79 | learning rate: 1.494E-04 | global batch size: 256 | lm loss: 2.020370E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.567 | TFLOPs: 19.70 | 31: iteration 62880/ 173500 | consumed samples: 16097280 | consumed tokens: 32967229440 | elapsed time per iteration (s): 0.80 | learning rate: 1.493E-04 | global batch size: 256 | lm loss: 2.054552E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.867 | TFLOPs: 19.29 | 31: iteration 62890/ 173500 | consumed samples: 16099840 | consumed tokens: 32972472320 | elapsed time per iteration (s): 0.80 | learning rate: 1.493E-04 | global batch size: 256 | lm loss: 2.026050E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.097 | TFLOPs: 19.43 | 31: iteration 62900/ 173500 | consumed samples: 16102400 | consumed tokens: 32977715200 | elapsed time per iteration (s): 0.80 | learning rate: 1.493E-04 | global batch size: 256 | lm loss: 2.063154E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.710 | TFLOPs: 19.28 | 31: iteration 62910/ 173500 | consumed samples: 16104960 | consumed tokens: 32982958080 | elapsed time per iteration (s): 0.82 | learning rate: 1.493E-04 | global batch size: 256 | lm loss: 2.030964E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.525 | TFLOPs: 18.97 | 31: iteration 62920/ 173500 | consumed samples: 16107520 | consumed tokens: 32988200960 | elapsed time per iteration (s): 0.82 | learning rate: 1.493E-04 | global batch size: 256 | lm loss: 2.054464E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.226 | TFLOPs: 18.83 | 31: iteration 62930/ 173500 | consumed samples: 16110080 | consumed tokens: 32993443840 | elapsed time per iteration (s): 0.81 | learning rate: 1.493E-04 | global batch size: 256 | lm loss: 2.009900E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.348 | TFLOPs: 19.20 | 31: iteration 62940/ 173500 | consumed samples: 16112640 | consumed tokens: 32998686720 | elapsed time per iteration (s): 0.82 | learning rate: 1.493E-04 | global batch size: 256 | lm loss: 2.015327E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.819 | TFLOPs: 18.86 | 31: iteration 62950/ 173500 | consumed samples: 16115200 | consumed tokens: 33003929600 | elapsed time per iteration (s): 0.81 | learning rate: 1.492E-04 | global batch size: 256 | lm loss: 2.026488E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.253 | TFLOPs: 19.01 | 31: iteration 62960/ 173500 | consumed samples: 16117760 | consumed tokens: 33009172480 | elapsed time per iteration (s): 0.83 | learning rate: 1.492E-04 | global batch size: 256 | lm loss: 2.056462E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.312 | TFLOPs: 18.59 | 31: iteration 62970/ 173500 | consumed samples: 16120320 | consumed tokens: 33014415360 | elapsed time per iteration (s): 0.83 | learning rate: 1.492E-04 | global batch size: 256 | lm loss: 2.031477E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.944 | TFLOPs: 18.75 | 31: iteration 62980/ 173500 | consumed samples: 16122880 | consumed tokens: 33019658240 | elapsed time per iteration (s): 0.83 | learning rate: 1.492E-04 | global batch size: 256 | lm loss: 2.043313E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.496 | TFLOPs: 18.66 | 31: iteration 62990/ 173500 | consumed samples: 16125440 | consumed tokens: 33024901120 | elapsed time per iteration (s): 0.81 | learning rate: 1.492E-04 | global batch size: 256 | lm loss: 2.058891E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.844 | TFLOPs: 19.11 | 31: iteration 63000/ 173500 | consumed samples: 16128000 | consumed tokens: 33030144000 | elapsed time per iteration (s): 0.81 | learning rate: 1.492E-04 | global batch size: 256 | lm loss: 2.015446E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.065 | TFLOPs: 19.12 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 63000 | lm loss value: 1.895957E+00 | lm loss PPL: 6.658919E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 63000 to checkpoints_1b1long 0: [2022-11-26 08:13:50,199] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step63000 is begin to save! 0: [2022-11-26 08:13:50,213] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_01-model_00-model_states.pt... 0: [2022-11-26 08:13:50,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_01-model_00-model_states.pt. 0: [2022-11-26 08:13:50,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_03-model_00-model_states.pt... 0: [2022-11-26 08:13:50,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_03-model_00-model_states.pt. 0: [2022-11-26 08:13:50,496] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_04-model_00-model_states.pt... 0: [2022-11-26 08:13:50,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_04-model_00-model_states.pt. 0: [2022-11-26 08:13:50,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_05-model_00-model_states.pt... 0: [2022-11-26 08:13:50,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_05-model_00-model_states.pt. 0: [2022-11-26 08:13:50,645] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_06-model_00-model_states.pt... 0: [2022-11-26 08:13:50,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_06-model_00-model_states.pt. 0: [2022-11-26 08:13:50,717] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_07-model_00-model_states.pt... 0: [2022-11-26 08:13:50,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_07-model_00-model_states.pt. 0: [2022-11-26 08:13:50,797] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_08-model_00-model_states.pt... 0: [2022-11-26 08:13:50,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_08-model_00-model_states.pt. 0: [2022-11-26 08:13:50,875] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_09-model_00-model_states.pt... 0: [2022-11-26 08:13:50,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_09-model_00-model_states.pt. 0: [2022-11-26 08:13:50,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_10-model_00-model_states.pt... 0: [2022-11-26 08:13:51,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_10-model_00-model_states.pt. 0: [2022-11-26 08:13:51,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_11-model_00-model_states.pt... 0: [2022-11-26 08:13:51,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_11-model_00-model_states.pt. 0: [2022-11-26 08:13:51,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_12-model_00-model_states.pt... 0: [2022-11-26 08:13:51,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_12-model_00-model_states.pt. 0: [2022-11-26 08:13:51,177] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_13-model_00-model_states.pt... 0: [2022-11-26 08:13:51,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_13-model_00-model_states.pt. 0: [2022-11-26 08:13:51,253] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_14-model_00-model_states.pt... 0: [2022-11-26 08:13:51,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_14-model_00-model_states.pt. 0: [2022-11-26 08:13:51,342] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_15-model_00-model_states.pt... 0: [2022-11-26 08:13:51,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_15-model_00-model_states.pt. 0: [2022-11-26 08:13:51,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_16-model_00-model_states.pt... 0: [2022-11-26 08:13:51,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_16-model_00-model_states.pt. 0: [2022-11-26 08:13:51,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_17-model_00-model_states.pt... 0: [2022-11-26 08:13:51,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_17-model_00-model_states.pt. 0: [2022-11-26 08:13:51,566] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_18-model_00-model_states.pt... 0: [2022-11-26 08:13:51,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_18-model_00-model_states.pt. 0: [2022-11-26 08:13:51,642] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_19-model_00-model_states.pt... 0: [2022-11-26 08:13:51,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_19-model_00-model_states.pt. 0: [2022-11-26 08:13:51,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_20-model_00-model_states.pt... 0: [2022-11-26 08:13:51,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_20-model_00-model_states.pt. 0: [2022-11-26 08:13:51,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_21-model_00-model_states.pt... 0: [2022-11-26 08:13:51,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_21-model_00-model_states.pt. 0: [2022-11-26 08:13:51,869] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_22-model_00-model_states.pt... 0: [2022-11-26 08:13:51,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_22-model_00-model_states.pt. 0: [2022-11-26 08:13:51,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_23-model_00-model_states.pt... 0: [2022-11-26 08:13:52,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_23-model_00-model_states.pt. 0: [2022-11-26 08:13:52,017] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_24-model_00-model_states.pt... 0: [2022-11-26 08:13:52,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_24-model_00-model_states.pt. 0: [2022-11-26 08:13:52,093] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_25-model_00-model_states.pt... 0: [2022-11-26 08:13:52,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_25-model_00-model_states.pt. 0: [2022-11-26 08:13:52,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_26-model_00-model_states.pt... 0: [2022-11-26 08:13:52,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_26-model_00-model_states.pt. 0: [2022-11-26 08:13:52,243] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_27-model_00-model_states.pt... 0: [2022-11-26 08:13:52,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_27-model_00-model_states.pt. 0: [2022-11-26 08:13:52,319] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_28-model_00-model_states.pt... 0: [2022-11-26 08:13:52,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_28-model_00-model_states.pt. 0: [2022-11-26 08:13:52,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/layer_30-model_00-model_states.pt... 0: [2022-11-26 08:13:52,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/layer_30-model_00-model_states.pt. 0: [2022-11-26 08:13:52,396] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step63000/mp_rank_00_model_states.pt 0: [2022-11-26 08:13:52,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/mp_rank_00_model_states.pt... 0: [2022-11-26 08:13:52,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/mp_rank_00_model_states.pt. 31: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:13:52,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:13:52,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:13:52,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 08:13:52,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 5: [2022-11-26 08:13:52,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:13:52,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 08:13:52,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 25: [2022-11-26 08:13:52,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:13:52,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 08:13:52,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 3: [2022-11-26 08:13:52,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:13:52,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 08:13:52,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 15: [2022-11-26 08:13:52,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:13:52,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 08:13:52,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 9: [2022-11-26 08:13:52,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:13:52,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:13:52,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 9: [2022-11-26 08:13:52,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 26: [2022-11-26 08:13:52,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 9: [2022-11-26 08:13:52,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 2: [2022-11-26 08:13:52,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:13:52,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 15: [2022-11-26 08:13:52,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:13:52,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 15: [2022-11-26 08:13:52,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 08:13:52,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 21: [2022-11-26 08:13:52,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:13:52,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 08:13:52,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 16: [2022-11-26 08:13:52,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:13:52,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 08:13:52,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 13: [2022-11-26 08:13:52,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:13:52,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 08:13:52,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 31: [2022-11-26 08:13:52,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:13:52,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 18: [2022-11-26 08:13:52,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:13:52,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 18: [2022-11-26 08:13:52,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 08:13:52,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 18: [2022-11-26 08:13:52,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:13:52,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 08:13:52,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 26: [2022-11-26 08:13:52,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:13:52,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 08:13:52,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 6: [2022-11-26 08:13:52,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:13:52,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:13:52,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 08:13:52,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 08:13:52,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 23: [2022-11-26 08:13:52,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:13:52,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 23: [2022-11-26 08:13:52,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 08:13:52,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 14: [2022-11-26 08:13:52,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:13:52,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 08:13:52,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 14: [2022-11-26 08:13:52,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:13:52,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 08:13:52,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 4: [2022-11-26 08:13:52,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:13:52,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 08:13:52,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 31: [2022-11-26 08:13:52,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:13:52,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 7: [2022-11-26 08:13:52,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:13:52,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 7: [2022-11-26 08:13:52,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 08:13:52,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 12: [2022-11-26 08:13:52,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:13:52,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:13:52,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 08:13:52,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 08:13:52,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 12: [2022-11-26 08:13:52,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 19: [2022-11-26 08:13:52,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:13:52,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 08:13:52,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 30: [2022-11-26 08:13:52,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:13:52,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 13: [2022-11-26 08:13:52,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:13:52,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 30: [2022-11-26 08:13:52,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 13: [2022-11-26 08:13:52,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 27: [2022-11-26 08:13:52,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:13:52,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:13:52,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:13:52,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 9: [2022-11-26 08:13:52,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 27: [2022-11-26 08:13:52,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 5: [2022-11-26 08:13:52,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:13:52,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 27: [2022-11-26 08:13:52,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 5: [2022-11-26 08:13:52,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 27: [2022-11-26 08:13:52,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 5: [2022-11-26 08:13:52,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 3: [2022-11-26 08:13:52,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:13:52,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:13:52,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:13:52,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 08:13:52,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 0: [2022-11-26 08:13:52,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 08:13:52,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 23: [2022-11-26 08:13:52,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 5: [2022-11-26 08:13:52,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:13:52,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:13:52,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:13:52,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 19: [2022-11-26 08:13:52,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:13:52,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:13:52,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 8: [2022-11-26 08:13:52,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:13:52,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 2: [2022-11-26 08:13:52,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 19: [2022-11-26 08:13:52,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 5: [2022-11-26 08:13:52,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 8: [2022-11-26 08:13:52,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 2: [2022-11-26 08:13:52,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 19: [2022-11-26 08:13:52,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 8: [2022-11-26 08:13:52,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 08:13:52,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 30: [2022-11-26 08:13:52,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:13:52,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:13:52,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 21: [2022-11-26 08:13:52,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:13:52,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 30: [2022-11-26 08:13:52,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 21: [2022-11-26 08:13:52,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:13:52,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 21: [2022-11-26 08:13:52,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 08:13:52,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 14: [2022-11-26 08:13:52,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:13:52,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 14: [2022-11-26 08:13:52,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 21: [2022-11-26 08:13:52,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 14: [2022-11-26 08:13:52,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 28: [2022-11-26 08:13:52,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:13:52,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:13:52,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:13:52,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:13:52,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 19: [2022-11-26 08:13:52,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 16: [2022-11-26 08:13:52,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 26: [2022-11-26 08:13:52,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 19: [2022-11-26 08:13:52,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 16: [2022-11-26 08:13:52,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 18: [2022-11-26 08:13:52,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:13:52,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 6: [2022-11-26 08:13:52,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:13:52,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 6: [2022-11-26 08:13:52,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 08:13:52,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 16: [2022-11-26 08:13:52,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:13:52,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 08:13:52,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 31: [2022-11-26 08:13:52,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:13:52,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 0: [2022-11-26 08:13:52,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:13:52,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 0: [2022-11-26 08:13:52,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 08:13:52,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 9: [2022-11-26 08:13:52,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:13:52,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 08:13:52,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 8: [2022-11-26 08:13:52,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:13:52,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 08:13:52,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 7: [2022-11-26 08:13:52,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:13:52,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 08:13:52,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 9: [2022-11-26 08:13:52,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:13:52,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 08:13:52,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 25: [2022-11-26 08:13:52,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:13:52,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 08:13:52,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 18: [2022-11-26 08:13:52,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:13:52,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 17: [2022-11-26 08:13:52,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:13:52,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 17: [2022-11-26 08:13:52,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 08:13:52,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 17: [2022-11-26 08:13:52,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:13:52,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 08:13:52,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 17: [2022-11-26 08:13:52,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:13:52,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 08:13:52,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 17: [2022-11-26 08:13:52,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:13:52,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 08:13:52,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 3: [2022-11-26 08:13:52,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:13:52,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 29: [2022-11-26 08:13:52,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:13:52,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 3: [2022-11-26 08:13:52,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 29: [2022-11-26 08:13:52,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 19: [2022-11-26 08:13:52,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:13:52,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 29: [2022-11-26 08:13:52,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:13:52,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 29: [2022-11-26 08:13:52,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 5: [2022-11-26 08:13:52,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:13:52,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 29: [2022-11-26 08:13:52,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 5: [2022-11-26 08:13:52,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 29: [2022-11-26 08:13:52,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:13:52,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 08:13:52,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 3: [2022-11-26 08:13:52,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:13:52,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 08:13:52,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 7: [2022-11-26 08:13:52,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:13:52,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 08:13:52,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 7: [2022-11-26 08:13:52,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:13:52,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 08:13:52,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 15: [2022-11-26 08:13:52,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:13:52,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 27: [2022-11-26 08:13:52,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:13:52,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 2: [2022-11-26 08:13:52,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:13:52,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 2: [2022-11-26 08:13:52,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 08:13:52,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 27: [2022-11-26 08:13:52,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 2: [2022-11-26 08:13:52,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:13:52,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 08:13:52,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 22: [2022-11-26 08:13:52,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:13:52,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 08:13:52,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 23: [2022-11-26 08:13:52,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:13:52,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:13:52,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 08:13:52,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 23: [2022-11-26 08:13:52,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 22: [2022-11-26 08:13:52,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:13:52,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 22: [2022-11-26 08:13:52,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 08:13:52,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 16: [2022-11-26 08:13:52,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:13:52,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 08:13:52,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 22: [2022-11-26 08:13:52,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:13:52,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:13:52,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 22: [2022-11-26 08:13:52,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 13: [2022-11-26 08:13:52,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 13: [2022-11-26 08:13:52,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:13:52,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 13: [2022-11-26 08:13:52,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 08:13:52,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 12: [2022-11-26 08:13:52,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:13:52,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:13:52,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 08:13:52,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 08:13:52,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 12: [2022-11-26 08:13:52,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 4: [2022-11-26 08:13:52,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:13:52,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 08:13:52,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 4: [2022-11-26 08:13:52,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:13:52,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 08:13:52,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 28: [2022-11-26 08:13:52,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 08:13:52,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 28: [2022-11-26 08:13:52,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:13:52,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 08:13:52,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 25: [2022-11-26 08:13:52,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:13:52,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:13:52,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:13:52,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 25: [2022-11-26 08:13:52,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 08:13:52,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 14: [2022-11-26 08:13:52,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 25: [2022-11-26 08:13:52,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 25: [2022-11-26 08:13:52,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 28: [2022-11-26 08:13:52,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:13:52,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 26: [2022-11-26 08:13:52,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:13:52,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 26: [2022-11-26 08:13:52,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 08:13:52,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 21: [2022-11-26 08:13:52,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:13:52,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 08:13:52,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 15: [2022-11-26 08:13:52,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:13:52,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 08:13:52,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 0: [2022-11-26 08:13:52,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:13:52,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 08:13:52,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 30: [2022-11-26 08:13:52,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:13:52,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:13:52,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 08:13:52,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 6: [2022-11-26 08:13:52,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 08:13:52,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 30: [2022-11-26 08:13:52,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:13:52,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 08:13:52,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 31: [2022-11-26 08:13:52,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:13:52,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 08:13:52,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 18: [2022-11-26 08:13:52,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:13:52,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 08:13:52,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 28: [2022-11-26 08:13:52,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:13:52,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 08:13:52,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 0: [2022-11-26 08:13:52,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 08:13:52,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 8: [2022-11-26 08:13:52,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:13:52,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 08:13:52,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 4: [2022-11-26 08:13:52,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:13:52,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:13:52,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 08:13:52,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 08:13:52,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 4: [2022-11-26 08:13:52,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 23: [2022-11-26 08:13:52,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:13:52,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 08:13:52,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 10: [2022-11-26 08:13:52,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:13:52,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:13:52,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:13:52,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:13:52,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 08:13:52,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 08:13:52,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 08:13:52,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 08:13:52,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 10: [2022-11-26 08:13:52,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 10: [2022-11-26 08:13:52,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 10: [2022-11-26 08:13:52,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 6: [2022-11-26 08:13:52,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:13:52,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 08:13:52,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 9: [2022-11-26 08:13:52,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:13:52,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 08:13:52,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 17: [2022-11-26 08:13:52,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:13:52,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 08:13:52,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 1: [2022-11-26 08:13:52,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:13:52,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:13:52,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:13:52,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:13:52,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:13:52,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 08:13:52,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 08:13:52,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 08:13:52,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 08:13:52,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 08:13:52,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 1: [2022-11-26 08:13:52,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 1: [2022-11-26 08:13:52,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 1: [2022-11-26 08:13:52,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 1: [2022-11-26 08:13:52,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 27: [2022-11-26 08:13:52,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:13:52,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:13:52,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 08:13:52,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 27: [2022-11-26 08:13:52,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 08:13:52,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 2: [2022-11-26 08:13:52,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:13:52,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 08:13:52,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 10: [2022-11-26 08:13:52,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:13:52,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 08:13:52,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 29: [2022-11-26 08:13:52,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:13:52,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 08:13:52,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 25: [2022-11-26 08:13:52,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:13:52,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 08:13:52,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 0: [2022-11-26 08:13:52,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:13:52,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 08:13:52,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 13: [2022-11-26 08:13:52,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:13:52,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 08:13:52,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 19: [2022-11-26 08:13:52,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:13:52,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 08:13:52,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 14: [2022-11-26 08:13:52,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:13:52,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 08:13:52,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 12: [2022-11-26 08:13:52,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:13:52,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 08:13:52,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 30: [2022-11-26 08:13:52,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:13:52,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 08:13:52,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 16: [2022-11-26 08:13:52,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:13:52,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 08:13:52,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 31: [2022-11-26 08:13:52,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:13:52,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 08:13:52,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 23: [2022-11-26 08:13:52,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:13:52,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 08:13:52,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 28: [2022-11-26 08:13:52,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:13:52,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:13:52,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 08:13:52,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 7: [2022-11-26 08:13:52,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:13:52,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 08:13:52,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 8: [2022-11-26 08:13:52,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:13:52,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 3: [2022-11-26 08:13:52,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:13:52,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 3: [2022-11-26 08:13:52,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 08:13:52,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 18: [2022-11-26 08:13:52,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:13:52,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 08:13:52,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 17: [2022-11-26 08:13:52,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:13:52,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 08:13:52,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 21: [2022-11-26 08:13:52,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:13:52,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 22: [2022-11-26 08:13:52,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:13:52,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 22: [2022-11-26 08:13:52,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 08:13:52,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 15: [2022-11-26 08:13:52,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:13:52,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 08:13:52,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 4: [2022-11-26 08:13:52,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:13:52,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 4: [2022-11-26 08:13:52,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 28: [2022-11-26 08:13:52,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 4: [2022-11-26 08:13:52,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 27: [2022-11-26 08:13:52,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:13:52,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 08:13:52,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 25: [2022-11-26 08:13:52,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:13:52,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 1: [2022-11-26 08:13:52,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:13:52,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 1: [2022-11-26 08:13:52,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 6: [2022-11-26 08:13:52,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:13:52,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 1: [2022-11-26 08:13:52,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 6: [2022-11-26 08:13:52,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 10: [2022-11-26 08:13:52,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:13:52,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 08:13:52,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 9: [2022-11-26 08:13:52,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:13:52,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 08:13:52,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 5: [2022-11-26 08:13:52,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:13:52,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 08:13:52,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 2: [2022-11-26 08:13:52,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:13:52,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 08:13:52,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 19: [2022-11-26 08:13:52,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:13:52,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 08:13:52,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 14: [2022-11-26 08:13:52,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:13:52,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 08:13:52,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 16: [2022-11-26 08:13:52,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:13:52,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 08:13:52,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 13: [2022-11-26 08:13:52,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:13:52,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 08:13:52,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 28: [2022-11-26 08:13:52,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:13:52,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:13:52,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 08:13:52,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 7: [2022-11-26 08:13:52,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:13:52,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 08:13:52,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 30: [2022-11-26 08:13:52,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:13:52,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:13:52,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 29: [2022-11-26 08:13:52,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 08:13:52,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 30: [2022-11-26 08:13:52,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 28: [2022-11-26 08:13:52,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 08:13:52,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 31: [2022-11-26 08:13:52,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:13:52,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 08:13:52,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 23: [2022-11-26 08:13:52,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:13:52,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 08:13:52,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 21: [2022-11-26 08:13:52,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:13:52,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 08:13:52,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 26: [2022-11-26 08:13:52,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:13:52,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 08:13:52,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 15: [2022-11-26 08:13:52,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:13:52,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 08:13:52,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 3: [2022-11-26 08:13:52,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:13:52,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 08:13:52,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 18: [2022-11-26 08:13:52,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:13:52,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 22: [2022-11-26 08:13:52,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:13:52,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 18: [2022-11-26 08:13:52,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 22: [2022-11-26 08:13:52,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 8: [2022-11-26 08:13:52,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:13:52,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 08:13:52,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 6: [2022-11-26 08:13:52,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:13:52,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 08:13:52,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 27: [2022-11-26 08:13:52,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:13:52,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 08:13:52,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 1: [2022-11-26 08:13:52,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:13:52,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 4: [2022-11-26 08:13:52,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:13:52,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 4: [2022-11-26 08:13:52,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 08:13:52,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 10: [2022-11-26 08:13:52,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:13:52,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 08:13:52,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 25: [2022-11-26 08:13:52,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:13:52,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 08:13:52,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 0: [2022-11-26 08:13:52,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:13:52,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 08:13:52,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 9: [2022-11-26 08:13:52,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:13:52,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 08:13:52,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 17: [2022-11-26 08:13:52,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:13:52,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 08:13:52,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 2: [2022-11-26 08:13:52,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:13:52,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 08:13:52,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 14: [2022-11-26 08:13:52,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:13:52,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 08:13:52,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 13: [2022-11-26 08:13:52,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:13:52,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 08:13:52,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 19: [2022-11-26 08:13:52,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:13:52,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 08:13:52,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 29: [2022-11-26 08:13:52,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:13:52,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 5: [2022-11-26 08:13:52,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:13:52,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 5: [2022-11-26 08:13:52,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 08:13:52,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 12: [2022-11-26 08:13:52,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:13:52,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 08:13:52,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 23: [2022-11-26 08:13:52,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:13:52,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 08:13:52,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 15: [2022-11-26 08:13:52,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:13:52,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:13:52,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 08:13:52,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 0: [2022-11-26 08:13:52,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 08:13:52,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 31: [2022-11-26 08:13:52,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:13:52,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 08:13:52,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 7: [2022-11-26 08:13:52,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:13:52,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:13:52,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 21: [2022-11-26 08:13:52,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 30: [2022-11-26 08:13:52,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:13:52,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 7: [2022-11-26 08:13:52,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 30: [2022-11-26 08:13:52,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 08:13:52,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 16: [2022-11-26 08:13:52,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:13:52,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 08:13:52,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 26: [2022-11-26 08:13:52,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:13:52,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 08:13:52,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 28: [2022-11-26 08:13:52,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:13:52,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 08:13:52,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 3: [2022-11-26 08:13:52,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:13:52,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 08:13:52,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 18: [2022-11-26 08:13:52,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:13:52,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 08:13:52,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 4: [2022-11-26 08:13:52,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:13:52,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 08:13:52,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 8: [2022-11-26 08:13:52,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:13:52,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 08:13:52,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 25: [2022-11-26 08:13:52,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:13:52,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 22: [2022-11-26 08:13:52,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:13:52,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 22: [2022-11-26 08:13:52,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 08:13:52,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 29: [2022-11-26 08:13:52,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:13:52,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 08:13:52,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 0: [2022-11-26 08:13:52,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:13:52,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 08:13:52,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 19: [2022-11-26 08:13:52,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:13:52,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 08:13:52,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 21: [2022-11-26 08:13:52,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:13:52,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:13:52,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 26: [2022-11-26 08:13:52,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 21: [2022-11-26 08:13:52,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 26: [2022-11-26 08:13:52,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 3: [2022-11-26 08:13:52,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:13:52,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 08:13:52,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 6: [2022-11-26 08:13:52,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:13:52,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 08:13:52,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 10: [2022-11-26 08:13:52,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:13:52,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 27: [2022-11-26 08:13:52,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:13:52,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 27: [2022-11-26 08:13:52,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 08:13:52,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 13: [2022-11-26 08:13:52,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:13:52,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 08:13:52,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 17: [2022-11-26 08:13:52,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:13:52,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 08:13:52,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 2: [2022-11-26 08:13:52,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:13:52,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 08:13:52,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 14: [2022-11-26 08:13:52,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:13:52,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 5: [2022-11-26 08:13:52,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:13:52,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 5: [2022-11-26 08:13:52,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 08:13:52,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 22: [2022-11-26 08:13:52,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:13:52,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 08:13:52,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 7: [2022-11-26 08:13:52,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:13:52,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 08:13:52,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 8: [2022-11-26 08:13:52,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:13:52,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 08:13:52,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 31: [2022-11-26 08:13:52,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:13:52,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 08:13:52,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 15: [2022-11-26 08:13:52,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:13:52,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 08:13:52,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 16: [2022-11-26 08:13:52,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:13:52,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 08:13:52,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 28: [2022-11-26 08:13:52,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:13:52,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 23: [2022-11-26 08:13:52,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:13:52,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 08:13:52,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 1: [2022-11-26 08:13:52,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:13:52,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:13:52,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 08:13:52,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 20: [2022-11-26 08:13:52,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 08:13:52,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 20: [2022-11-26 08:13:52,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:13:52,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 08:13:52,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 28: [2022-11-26 08:13:52,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 20: [2022-11-26 08:13:52,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:13:52,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 08:13:52,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 12: [2022-11-26 08:13:52,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:13:52,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:13:52,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 12: [2022-11-26 08:13:52,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 30: [2022-11-26 08:13:52,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 12: [2022-11-26 08:13:52,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 20: [2022-11-26 08:13:52,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:13:52,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 08:13:52,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 20: [2022-11-26 08:13:52,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:13:52,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 08:13:52,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 9: [2022-11-26 08:13:52,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:13:52,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 08:13:52,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 20: [2022-11-26 08:13:52,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:13:52,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 08:13:52,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 20: [2022-11-26 08:13:52,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:13:52,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 08:13:52,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 20: [2022-11-26 08:13:52,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:13:52,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 08:13:52,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 24: [2022-11-26 08:13:52,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:13:52,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:13:52,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:13:52,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 08:13:52,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 08:13:52,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 08:13:52,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 24: [2022-11-26 08:13:52,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 24: [2022-11-26 08:13:52,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 24: [2022-11-26 08:13:52,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:13:52,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 08:13:52,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 24: [2022-11-26 08:13:52,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:13:52,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:13:52,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:13:52,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 08:13:52,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:13:52,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 08:13:52,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 08:13:52,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 24: [2022-11-26 08:13:52,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 24: [2022-11-26 08:13:52,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 24: [2022-11-26 08:13:52,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 08:13:52,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 11: [2022-11-26 08:13:52,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:13:52,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:13:52,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:13:52,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:13:52,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:13:52,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 08:13:52,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:13:52,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:13:52,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 08:13:52,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 11: [2022-11-26 08:13:52,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 08:13:52,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 08:13:52,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:13:52,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 08:13:52,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 11: [2022-11-26 08:13:52,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 08:13:52,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 08:13:52,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 11: [2022-11-26 08:13:52,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 11: [2022-11-26 08:13:52,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 11: [2022-11-26 08:13:52,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step63000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 08:13:52,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 11: [2022-11-26 08:13:52,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 11: [2022-11-26 08:13:52,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 0: successfully saved checkpoint at iteration 63000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2666.87 31: iteration 63010/ 173500 | consumed samples: 16130560 | consumed tokens: 33035386880 | elapsed time per iteration (s): 1.08 | learning rate: 1.492E-04 | global batch size: 256 | lm loss: 2.027345E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.351 | TFLOPs: 14.30 | 31: iteration 63020/ 173500 | consumed samples: 16133120 | consumed tokens: 33040629760 | elapsed time per iteration (s): 0.77 | learning rate: 1.491E-04 | global batch size: 256 | lm loss: 1.997731E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.175 | TFLOPs: 20.04 | 31: iteration 63030/ 173500 | consumed samples: 16135680 | consumed tokens: 33045872640 | elapsed time per iteration (s): 0.77 | learning rate: 1.491E-04 | global batch size: 256 | lm loss: 2.085497E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.561 | TFLOPs: 20.18 | 31: iteration 63040/ 173500 | consumed samples: 16138240 | consumed tokens: 33051115520 | elapsed time per iteration (s): 0.83 | learning rate: 1.491E-04 | global batch size: 256 | lm loss: 2.038981E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.166 | TFLOPs: 18.64 | 31: iteration 63050/ 173500 | consumed samples: 16140800 | consumed tokens: 33056358400 | elapsed time per iteration (s): 0.81 | learning rate: 1.491E-04 | global batch size: 256 | lm loss: 2.036568E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.620 | TFLOPs: 19.09 | 31: iteration 63060/ 173500 | consumed samples: 16143360 | consumed tokens: 33061601280 | elapsed time per iteration (s): 2.24 | learning rate: 1.491E-04 | global batch size: 256 | lm loss: 2.076325E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 114.515 | TFLOPs: 6.93 | 31: iteration 63070/ 173500 | consumed samples: 16145920 | consumed tokens: 33066844160 | elapsed time per iteration (s): 0.72 | learning rate: 1.491E-04 | global batch size: 256 | lm loss: 2.057965E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.296 | TFLOPs: 21.43 | 31: iteration 63080/ 173500 | consumed samples: 16148480 | consumed tokens: 33072087040 | elapsed time per iteration (s): 0.77 | learning rate: 1.490E-04 | global batch size: 256 | lm loss: 2.021835E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.442 | TFLOPs: 20.23 | 31: iteration 63090/ 173500 | consumed samples: 16151040 | consumed tokens: 33077329920 | elapsed time per iteration (s): 0.78 | learning rate: 1.490E-04 | global batch size: 256 | lm loss: 2.027502E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.686 | TFLOPs: 19.76 | 31: iteration 63100/ 173500 | consumed samples: 16153600 | consumed tokens: 33082572800 | elapsed time per iteration (s): 0.77 | learning rate: 1.490E-04 | global batch size: 256 | lm loss: 2.056607E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.378 | TFLOPs: 20.05 | 31: iteration 63110/ 173500 | consumed samples: 16156160 | consumed tokens: 33087815680 | elapsed time per iteration (s): 0.78 | learning rate: 1.490E-04 | global batch size: 256 | lm loss: 2.015306E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.406 | TFLOPs: 19.93 | 31: iteration 63120/ 173500 | consumed samples: 16158720 | consumed tokens: 33093058560 | elapsed time per iteration (s): 0.75 | learning rate: 1.490E-04 | global batch size: 256 | lm loss: 2.017987E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.675 | TFLOPs: 20.67 | 31: iteration 63130/ 173500 | consumed samples: 16161280 | consumed tokens: 33098301440 | elapsed time per iteration (s): 2.67 | learning rate: 1.490E-04 | global batch size: 256 | lm loss: 2.031688E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 95.966 | TFLOPs: 5.81 | 31: iteration 63140/ 173500 | consumed samples: 16163840 | consumed tokens: 33103544320 | elapsed time per iteration (s): 0.74 | learning rate: 1.490E-04 | global batch size: 256 | lm loss: 2.045937E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.777 | TFLOPs: 21.04 | 31: iteration 63150/ 173500 | consumed samples: 16166400 | consumed tokens: 33108787200 | elapsed time per iteration (s): 0.80 | learning rate: 1.489E-04 | global batch size: 256 | lm loss: 2.033034E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.160 | TFLOPs: 19.37 | 31: iteration 63160/ 173500 | consumed samples: 16168960 | consumed tokens: 33114030080 | elapsed time per iteration (s): 0.85 | learning rate: 1.489E-04 | global batch size: 256 | lm loss: 2.032488E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.383 | TFLOPs: 18.29 | 31: iteration 63170/ 173500 | consumed samples: 16171520 | consumed tokens: 33119272960 | elapsed time per iteration (s): 0.83 | learning rate: 1.489E-04 | global batch size: 256 | lm loss: 2.001579E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.867 | TFLOPs: 18.56 | 31: iteration 63180/ 173500 | consumed samples: 16174080 | consumed tokens: 33124515840 | elapsed time per iteration (s): 0.84 | learning rate: 1.489E-04 | global batch size: 256 | lm loss: 2.053809E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.981 | TFLOPs: 18.51 | 31: iteration 63190/ 173500 | consumed samples: 16176640 | consumed tokens: 33129758720 | elapsed time per iteration (s): 0.81 | learning rate: 1.489E-04 | global batch size: 256 | lm loss: 2.041532E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.403 | TFLOPs: 19.02 | 31: iteration 63200/ 173500 | consumed samples: 16179200 | consumed tokens: 33135001600 | elapsed time per iteration (s): 0.80 | learning rate: 1.489E-04 | global batch size: 256 | lm loss: 2.039651E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.862 | TFLOPs: 19.47 | 31: iteration 63210/ 173500 | consumed samples: 16181760 | consumed tokens: 33140244480 | elapsed time per iteration (s): 0.80 | learning rate: 1.489E-04 | global batch size: 256 | lm loss: 2.067444E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.748 | TFLOPs: 19.34 | 31: iteration 63220/ 173500 | consumed samples: 16184320 | consumed tokens: 33145487360 | elapsed time per iteration (s): 0.79 | learning rate: 1.488E-04 | global batch size: 256 | lm loss: 2.006130E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.424 | TFLOPs: 19.69 | 31: iteration 63230/ 173500 | consumed samples: 16186880 | consumed tokens: 33150730240 | elapsed time per iteration (s): 0.80 | learning rate: 1.488E-04 | global batch size: 256 | lm loss: 2.044411E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.086 | TFLOPs: 19.24 | 31: iteration 63240/ 173500 | consumed samples: 16189440 | consumed tokens: 33155973120 | elapsed time per iteration (s): 0.79 | learning rate: 1.488E-04 | global batch size: 256 | lm loss: 2.032137E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.901 | TFLOPs: 19.66 | 31: iteration 63250/ 173500 | consumed samples: 16192000 | consumed tokens: 33161216000 | elapsed time per iteration (s): 0.76 | learning rate: 1.488E-04 | global batch size: 256 | lm loss: 2.000324E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.600 | TFLOPs: 20.42 | 31: iteration 63260/ 173500 | consumed samples: 16194560 | consumed tokens: 33166458880 | elapsed time per iteration (s): 0.75 | learning rate: 1.488E-04 | global batch size: 256 | lm loss: 2.021445E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.114 | TFLOPs: 20.76 | 31: iteration 63270/ 173500 | consumed samples: 16197120 | consumed tokens: 33171701760 | elapsed time per iteration (s): 0.77 | learning rate: 1.488E-04 | global batch size: 256 | lm loss: 2.015970E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.055 | TFLOPs: 20.09 | 31: iteration 63280/ 173500 | consumed samples: 16199680 | consumed tokens: 33176944640 | elapsed time per iteration (s): 0.79 | learning rate: 1.488E-04 | global batch size: 256 | lm loss: 2.034417E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.758 | TFLOPs: 19.65 | 31: iteration 63290/ 173500 | consumed samples: 16202240 | consumed tokens: 33182187520 | elapsed time per iteration (s): 0.81 | learning rate: 1.487E-04 | global batch size: 256 | lm loss: 2.024251E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.761 | TFLOPs: 19.16 | 31: iteration 63300/ 173500 | consumed samples: 16204800 | consumed tokens: 33187430400 | elapsed time per iteration (s): 0.77 | learning rate: 1.487E-04 | global batch size: 256 | lm loss: 2.028847E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.279 | TFLOPs: 20.10 | 31: iteration 63310/ 173500 | consumed samples: 16207360 | consumed tokens: 33192673280 | elapsed time per iteration (s): 0.85 | learning rate: 1.487E-04 | global batch size: 256 | lm loss: 2.046046E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.717 | TFLOPs: 18.19 | 31: iteration 63320/ 173500 | consumed samples: 16209920 | consumed tokens: 33197916160 | elapsed time per iteration (s): 0.76 | learning rate: 1.487E-04 | global batch size: 256 | lm loss: 2.034855E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.094 | TFLOPs: 20.39 | 31: iteration 63330/ 173500 | consumed samples: 16212480 | consumed tokens: 33203159040 | elapsed time per iteration (s): 0.80 | learning rate: 1.487E-04 | global batch size: 256 | lm loss: 2.038720E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.115 | TFLOPs: 19.43 | 31: iteration 63340/ 173500 | consumed samples: 16215040 | consumed tokens: 33208401920 | elapsed time per iteration (s): 0.75 | learning rate: 1.487E-04 | global batch size: 256 | lm loss: 2.017524E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.212 | TFLOPs: 20.64 | 31: iteration 63350/ 173500 | consumed samples: 16217600 | consumed tokens: 33213644800 | elapsed time per iteration (s): 0.72 | learning rate: 1.486E-04 | global batch size: 256 | lm loss: 2.045383E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.935 | TFLOPs: 21.41 | 31: iteration 63360/ 173500 | consumed samples: 16220160 | consumed tokens: 33218887680 | elapsed time per iteration (s): 0.76 | learning rate: 1.486E-04 | global batch size: 256 | lm loss: 2.043579E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.421 | TFLOPs: 20.47 | 31: iteration 63370/ 173500 | consumed samples: 16222720 | consumed tokens: 33224130560 | elapsed time per iteration (s): 0.75 | learning rate: 1.486E-04 | global batch size: 256 | lm loss: 2.047680E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.436 | TFLOPs: 20.60 | 31: iteration 63380/ 173500 | consumed samples: 16225280 | consumed tokens: 33229373440 | elapsed time per iteration (s): 0.75 | learning rate: 1.486E-04 | global batch size: 256 | lm loss: 2.016422E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.953 | TFLOPs: 20.63 | 31: iteration 63390/ 173500 | consumed samples: 16227840 | consumed tokens: 33234616320 | elapsed time per iteration (s): 0.77 | learning rate: 1.486E-04 | global batch size: 256 | lm loss: 2.037239E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.972 | TFLOPs: 20.02 | 31: iteration 63400/ 173500 | consumed samples: 16230400 | consumed tokens: 33239859200 | elapsed time per iteration (s): 0.83 | learning rate: 1.486E-04 | global batch size: 256 | lm loss: 2.045483E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.098 | TFLOPs: 18.64 | 31: iteration 63410/ 173500 | consumed samples: 16232960 | consumed tokens: 33245102080 | elapsed time per iteration (s): 0.73 | learning rate: 1.486E-04 | global batch size: 256 | lm loss: 2.032152E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.246 | TFLOPs: 21.13 | 31: iteration 63420/ 173500 | consumed samples: 16235520 | consumed tokens: 33250344960 | elapsed time per iteration (s): 0.82 | learning rate: 1.485E-04 | global batch size: 256 | lm loss: 2.044346E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.098 | TFLOPs: 19.00 | 31: iteration 63430/ 173500 | consumed samples: 16238080 | consumed tokens: 33255587840 | elapsed time per iteration (s): 0.75 | learning rate: 1.485E-04 | global batch size: 256 | lm loss: 2.045760E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.427 | TFLOPs: 20.72 | 31: iteration 63440/ 173500 | consumed samples: 16240640 | consumed tokens: 33260830720 | elapsed time per iteration (s): 0.79 | learning rate: 1.485E-04 | global batch size: 256 | lm loss: 2.048582E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.136 | TFLOPs: 19.49 | 31: iteration 63450/ 173500 | consumed samples: 16243200 | consumed tokens: 33266073600 | elapsed time per iteration (s): 0.78 | learning rate: 1.485E-04 | global batch size: 256 | lm loss: 2.017931E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.385 | TFLOPs: 19.81 | 31: iteration 63460/ 173500 | consumed samples: 16245760 | consumed tokens: 33271316480 | elapsed time per iteration (s): 0.78 | learning rate: 1.485E-04 | global batch size: 256 | lm loss: 2.027859E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.748 | TFLOPs: 19.77 | 31: iteration 63470/ 173500 | consumed samples: 16248320 | consumed tokens: 33276559360 | elapsed time per iteration (s): 0.82 | learning rate: 1.485E-04 | global batch size: 256 | lm loss: 2.048792E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.109 | TFLOPs: 18.82 | 31: iteration 63480/ 173500 | consumed samples: 16250880 | consumed tokens: 33281802240 | elapsed time per iteration (s): 0.80 | learning rate: 1.485E-04 | global batch size: 256 | lm loss: 2.028563E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.370 | TFLOPs: 19.32 | 31: iteration 63490/ 173500 | consumed samples: 16253440 | consumed tokens: 33287045120 | elapsed time per iteration (s): 0.85 | learning rate: 1.484E-04 | global batch size: 256 | lm loss: 2.049352E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.046 | TFLOPs: 18.21 | 31: iteration 63500/ 173500 | consumed samples: 16256000 | consumed tokens: 33292288000 | elapsed time per iteration (s): 0.95 | learning rate: 1.484E-04 | global batch size: 256 | lm loss: 2.051458E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 268.214 | TFLOPs: 16.23 | 31: iteration 63510/ 173500 | consumed samples: 16258560 | consumed tokens: 33297530880 | elapsed time per iteration (s): 0.78 | learning rate: 1.484E-04 | global batch size: 256 | lm loss: 2.045657E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.057 | TFLOPs: 19.79 | 31: iteration 63520/ 173500 | consumed samples: 16261120 | consumed tokens: 33302773760 | elapsed time per iteration (s): 0.74 | learning rate: 1.484E-04 | global batch size: 256 | lm loss: 2.041658E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.771 | TFLOPs: 20.80 | 31: iteration 63530/ 173500 | consumed samples: 16263680 | consumed tokens: 33308016640 | elapsed time per iteration (s): 0.79 | learning rate: 1.484E-04 | global batch size: 256 | lm loss: 2.036532E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.758 | TFLOPs: 19.65 | 31: iteration 63540/ 173500 | consumed samples: 16266240 | consumed tokens: 33313259520 | elapsed time per iteration (s): 0.91 | learning rate: 1.484E-04 | global batch size: 256 | lm loss: 2.021217E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 280.687 | TFLOPs: 16.98 | 31: iteration 63550/ 173500 | consumed samples: 16268800 | consumed tokens: 33318502400 | elapsed time per iteration (s): 0.74 | learning rate: 1.484E-04 | global batch size: 256 | lm loss: 2.054379E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.533 | TFLOPs: 20.84 | 31: iteration 63560/ 173500 | consumed samples: 16271360 | consumed tokens: 33323745280 | elapsed time per iteration (s): 0.76 | learning rate: 1.483E-04 | global batch size: 256 | lm loss: 2.062789E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.508 | TFLOPs: 20.36 | 31: iteration 63570/ 173500 | consumed samples: 16273920 | consumed tokens: 33328988160 | elapsed time per iteration (s): 0.75 | learning rate: 1.483E-04 | global batch size: 256 | lm loss: 2.025730E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.488 | TFLOPs: 20.66 | 31: iteration 63580/ 173500 | consumed samples: 16276480 | consumed tokens: 33334231040 | elapsed time per iteration (s): 0.84 | learning rate: 1.483E-04 | global batch size: 256 | lm loss: 2.055940E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.859 | TFLOPs: 18.38 | 31: iteration 63590/ 173500 | consumed samples: 16279040 | consumed tokens: 33339473920 | elapsed time per iteration (s): 0.82 | learning rate: 1.483E-04 | global batch size: 256 | lm loss: 2.043814E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.041 | TFLOPs: 18.88 | 31: iteration 63600/ 173500 | consumed samples: 16281600 | consumed tokens: 33344716800 | elapsed time per iteration (s): 0.77 | learning rate: 1.483E-04 | global batch size: 256 | lm loss: 2.025876E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.380 | TFLOPs: 20.05 | 31: iteration 63610/ 173500 | consumed samples: 16284160 | consumed tokens: 33349959680 | elapsed time per iteration (s): 0.75 | learning rate: 1.483E-04 | global batch size: 256 | lm loss: 2.046527E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.649 | TFLOPs: 20.61 | 31: iteration 63620/ 173500 | consumed samples: 16286720 | consumed tokens: 33355202560 | elapsed time per iteration (s): 0.75 | learning rate: 1.482E-04 | global batch size: 256 | lm loss: 2.079985E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.638 | TFLOPs: 20.73 | 31: iteration 63630/ 173500 | consumed samples: 16289280 | consumed tokens: 33360445440 | elapsed time per iteration (s): 0.76 | learning rate: 1.482E-04 | global batch size: 256 | lm loss: 2.052076E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.442 | TFLOPs: 20.41 | 31: iteration 63640/ 173500 | consumed samples: 16291840 | consumed tokens: 33365688320 | elapsed time per iteration (s): 0.79 | learning rate: 1.482E-04 | global batch size: 256 | lm loss: 2.029576E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.286 | TFLOPs: 19.62 | 31: iteration 63650/ 173500 | consumed samples: 16294400 | consumed tokens: 33370931200 | elapsed time per iteration (s): 0.83 | learning rate: 1.482E-04 | global batch size: 256 | lm loss: 2.034523E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.230 | TFLOPs: 18.65 | 31: iteration 63660/ 173500 | consumed samples: 16296960 | consumed tokens: 33376174080 | elapsed time per iteration (s): 0.81 | learning rate: 1.482E-04 | global batch size: 256 | lm loss: 2.024386E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.993 | TFLOPs: 19.18 | 31: iteration 63670/ 173500 | consumed samples: 16299520 | consumed tokens: 33381416960 | elapsed time per iteration (s): 0.76 | learning rate: 1.482E-04 | global batch size: 256 | lm loss: 2.057567E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.926 | TFLOPs: 20.38 | 31: iteration 63680/ 173500 | consumed samples: 16302080 | consumed tokens: 33386659840 | elapsed time per iteration (s): 0.77 | learning rate: 1.482E-04 | global batch size: 256 | lm loss: 1.995200E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.164 | TFLOPs: 20.10 | 31: iteration 63690/ 173500 | consumed samples: 16304640 | consumed tokens: 33391902720 | elapsed time per iteration (s): 0.78 | learning rate: 1.481E-04 | global batch size: 256 | lm loss: 2.028096E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.137 | TFLOPs: 19.73 | 31: iteration 63700/ 173500 | consumed samples: 16307200 | consumed tokens: 33397145600 | elapsed time per iteration (s): 0.78 | learning rate: 1.481E-04 | global batch size: 256 | lm loss: 2.021747E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.250 | TFLOPs: 19.86 | 31: iteration 63710/ 173500 | consumed samples: 16309760 | consumed tokens: 33402388480 | elapsed time per iteration (s): 0.78 | learning rate: 1.481E-04 | global batch size: 256 | lm loss: 2.053378E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.531 | TFLOPs: 19.94 | 31: iteration 63720/ 173500 | consumed samples: 16312320 | consumed tokens: 33407631360 | elapsed time per iteration (s): 0.77 | learning rate: 1.481E-04 | global batch size: 256 | lm loss: 2.029711E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.767 | TFLOPs: 20.19 | 31: iteration 63730/ 173500 | consumed samples: 16314880 | consumed tokens: 33412874240 | elapsed time per iteration (s): 0.75 | learning rate: 1.481E-04 | global batch size: 256 | lm loss: 2.030642E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.640 | TFLOPs: 20.55 | 31: iteration 63740/ 173500 | consumed samples: 16317440 | consumed tokens: 33418117120 | elapsed time per iteration (s): 0.75 | learning rate: 1.481E-04 | global batch size: 256 | lm loss: 2.031561E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.166 | TFLOPs: 20.58 | 31: iteration 63750/ 173500 | consumed samples: 16320000 | consumed tokens: 33423360000 | elapsed time per iteration (s): 0.77 | learning rate: 1.481E-04 | global batch size: 256 | lm loss: 2.065619E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.873 | TFLOPs: 20.02 | 31: iteration 63760/ 173500 | consumed samples: 16322560 | consumed tokens: 33428602880 | elapsed time per iteration (s): 0.77 | learning rate: 1.480E-04 | global batch size: 256 | lm loss: 2.027292E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.424 | TFLOPs: 20.11 | 31: iteration 63770/ 173500 | consumed samples: 16325120 | consumed tokens: 33433845760 | elapsed time per iteration (s): 0.76 | learning rate: 1.480E-04 | global batch size: 256 | lm loss: 2.022945E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.366 | TFLOPs: 20.41 | 31: iteration 63780/ 173500 | consumed samples: 16327680 | consumed tokens: 33439088640 | elapsed time per iteration (s): 0.74 | learning rate: 1.480E-04 | global batch size: 256 | lm loss: 2.048559E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.702 | TFLOPs: 20.79 | 31: iteration 63790/ 173500 | consumed samples: 16330240 | consumed tokens: 33444331520 | elapsed time per iteration (s): 0.81 | learning rate: 1.480E-04 | global batch size: 256 | lm loss: 2.051464E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.850 | TFLOPs: 19.23 | 31: iteration 63800/ 173500 | consumed samples: 16332800 | consumed tokens: 33449574400 | elapsed time per iteration (s): 0.76 | learning rate: 1.480E-04 | global batch size: 256 | lm loss: 2.054402E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.526 | TFLOPs: 20.42 | 31: iteration 63810/ 173500 | consumed samples: 16335360 | consumed tokens: 33454817280 | elapsed time per iteration (s): 0.76 | learning rate: 1.480E-04 | global batch size: 256 | lm loss: 2.037232E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.953 | TFLOPs: 20.38 | 31: iteration 63820/ 173500 | consumed samples: 16337920 | consumed tokens: 33460060160 | elapsed time per iteration (s): 0.74 | learning rate: 1.479E-04 | global batch size: 256 | lm loss: 2.017155E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.921 | TFLOPs: 20.99 | 31: iteration 63830/ 173500 | consumed samples: 16340480 | consumed tokens: 33465303040 | elapsed time per iteration (s): 0.76 | learning rate: 1.479E-04 | global batch size: 256 | lm loss: 2.042365E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.152 | TFLOPs: 20.46 | 31: iteration 63840/ 173500 | consumed samples: 16343040 | consumed tokens: 33470545920 | elapsed time per iteration (s): 0.77 | learning rate: 1.479E-04 | global batch size: 256 | lm loss: 2.056940E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.439 | TFLOPs: 20.17 | 31: iteration 63850/ 173500 | consumed samples: 16345600 | consumed tokens: 33475788800 | elapsed time per iteration (s): 0.77 | learning rate: 1.479E-04 | global batch size: 256 | lm loss: 2.044522E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.502 | TFLOPs: 19.99 | 31: iteration 63860/ 173500 | consumed samples: 16348160 | consumed tokens: 33481031680 | elapsed time per iteration (s): 0.78 | learning rate: 1.479E-04 | global batch size: 256 | lm loss: 2.027947E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.807 | TFLOPs: 19.95 | 31: iteration 63870/ 173500 | consumed samples: 16350720 | consumed tokens: 33486274560 | elapsed time per iteration (s): 0.75 | learning rate: 1.479E-04 | global batch size: 256 | lm loss: 2.043854E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.510 | TFLOPs: 20.78 | 31: iteration 63880/ 173500 | consumed samples: 16353280 | consumed tokens: 33491517440 | elapsed time per iteration (s): 0.76 | learning rate: 1.479E-04 | global batch size: 256 | lm loss: 2.039228E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.200 | TFLOPs: 20.28 | 31: iteration 63890/ 173500 | consumed samples: 16355840 | consumed tokens: 33496760320 | elapsed time per iteration (s): 0.77 | learning rate: 1.478E-04 | global batch size: 256 | lm loss: 2.028636E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.311 | TFLOPs: 20.22 | 31: iteration 63900/ 173500 | consumed samples: 16358400 | consumed tokens: 33502003200 | elapsed time per iteration (s): 0.83 | learning rate: 1.478E-04 | global batch size: 256 | lm loss: 2.039594E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.360 | TFLOPs: 18.72 | 31: iteration 63910/ 173500 | consumed samples: 16360960 | consumed tokens: 33507246080 | elapsed time per iteration (s): 0.74 | learning rate: 1.478E-04 | global batch size: 256 | lm loss: 2.042113E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.742 | TFLOPs: 20.92 | 31: iteration 63920/ 173500 | consumed samples: 16363520 | consumed tokens: 33512488960 | elapsed time per iteration (s): 0.79 | learning rate: 1.478E-04 | global batch size: 256 | lm loss: 2.037706E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.830 | TFLOPs: 19.71 | 31: iteration 63930/ 173500 | consumed samples: 16366080 | consumed tokens: 33517731840 | elapsed time per iteration (s): 0.78 | learning rate: 1.478E-04 | global batch size: 256 | lm loss: 2.027359E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.162 | TFLOPs: 19.85 | 31: iteration 63940/ 173500 | consumed samples: 16368640 | consumed tokens: 33522974720 | elapsed time per iteration (s): 0.76 | learning rate: 1.478E-04 | global batch size: 256 | lm loss: 2.020674E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.600 | TFLOPs: 20.42 | 31: iteration 63950/ 173500 | consumed samples: 16371200 | consumed tokens: 33528217600 | elapsed time per iteration (s): 0.78 | learning rate: 1.478E-04 | global batch size: 256 | lm loss: 2.057470E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.092 | TFLOPs: 19.97 | 31: iteration 63960/ 173500 | consumed samples: 16373760 | consumed tokens: 33533460480 | elapsed time per iteration (s): 0.82 | learning rate: 1.477E-04 | global batch size: 256 | lm loss: 2.018530E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.295 | TFLOPs: 18.83 | 31: iteration 63970/ 173500 | consumed samples: 16376320 | consumed tokens: 33538703360 | elapsed time per iteration (s): 0.86 | learning rate: 1.477E-04 | global batch size: 256 | lm loss: 2.020483E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.346 | TFLOPs: 18.05 | 31: iteration 63980/ 173500 | consumed samples: 16378880 | consumed tokens: 33543946240 | elapsed time per iteration (s): 0.75 | learning rate: 1.477E-04 | global batch size: 256 | lm loss: 2.019568E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.795 | TFLOPs: 20.62 | 31: iteration 63990/ 173500 | consumed samples: 16381440 | consumed tokens: 33549189120 | elapsed time per iteration (s): 0.78 | learning rate: 1.477E-04 | global batch size: 256 | lm loss: 2.052176E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.651 | TFLOPs: 19.76 | 0: [2022-11-26 08:27:28,580] [INFO] [logging.py:68:log_dist] [Rank 0] step=64000, skipped=0, lr=[0.0001476794025098283, 0.0001476794025098283, 0.0001476794025098283], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 64000/ 173500 | consumed samples: 16384000 | consumed tokens: 33554432000 | elapsed time per iteration (s): 0.72 | learning rate: 1.477E-04 | global batch size: 256 | lm loss: 2.039564E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.942 | TFLOPs: 21.47 | 0: steps: 64000 loss: 2.0808 iter time (s): 0.805 samples/sec: 318.169 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 64000 | lm loss value: 2.108692E+00 | lm loss PPL: 8.237457E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 64000 to checkpoints_1b1long 0: [2022-11-26 08:27:28,831] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step64000 is begin to save! 0: [2022-11-26 08:27:28,841] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_01-model_00-model_states.pt... 0: [2022-11-26 08:27:29,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_01-model_00-model_states.pt. 0: [2022-11-26 08:27:29,069] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_03-model_00-model_states.pt... 0: [2022-11-26 08:27:29,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_03-model_00-model_states.pt. 0: [2022-11-26 08:27:29,154] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_04-model_00-model_states.pt... 0: [2022-11-26 08:27:29,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_04-model_00-model_states.pt. 0: [2022-11-26 08:27:29,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_05-model_00-model_states.pt... 0: [2022-11-26 08:27:29,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_05-model_00-model_states.pt. 0: [2022-11-26 08:27:29,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_06-model_00-model_states.pt... 0: [2022-11-26 08:27:29,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_06-model_00-model_states.pt. 0: [2022-11-26 08:27:29,390] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_07-model_00-model_states.pt... 0: [2022-11-26 08:27:29,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_07-model_00-model_states.pt. 0: [2022-11-26 08:27:29,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_08-model_00-model_states.pt... 0: [2022-11-26 08:27:29,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_08-model_00-model_states.pt. 0: [2022-11-26 08:27:29,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_09-model_00-model_states.pt... 0: [2022-11-26 08:27:29,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_09-model_00-model_states.pt. 0: [2022-11-26 08:27:29,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_10-model_00-model_states.pt... 0: [2022-11-26 08:27:29,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_10-model_00-model_states.pt. 0: [2022-11-26 08:27:29,710] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_11-model_00-model_states.pt... 0: [2022-11-26 08:27:29,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_11-model_00-model_states.pt. 0: [2022-11-26 08:27:29,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_12-model_00-model_states.pt... 0: [2022-11-26 08:27:29,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_12-model_00-model_states.pt. 0: [2022-11-26 08:27:29,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_13-model_00-model_states.pt... 0: [2022-11-26 08:27:29,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_13-model_00-model_states.pt. 0: [2022-11-26 08:27:29,937] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_14-model_00-model_states.pt... 0: [2022-11-26 08:27:30,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_14-model_00-model_states.pt. 0: [2022-11-26 08:27:30,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_15-model_00-model_states.pt... 0: [2022-11-26 08:27:30,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_15-model_00-model_states.pt. 0: [2022-11-26 08:27:30,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_16-model_00-model_states.pt... 0: [2022-11-26 08:27:30,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_16-model_00-model_states.pt. 0: [2022-11-26 08:27:30,173] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_17-model_00-model_states.pt... 0: [2022-11-26 08:27:30,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_17-model_00-model_states.pt. 0: [2022-11-26 08:27:30,248] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_18-model_00-model_states.pt... 0: [2022-11-26 08:27:30,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_18-model_00-model_states.pt. 0: [2022-11-26 08:27:30,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_19-model_00-model_states.pt... 0: [2022-11-26 08:27:30,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_19-model_00-model_states.pt. 0: [2022-11-26 08:27:30,403] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_20-model_00-model_states.pt... 0: [2022-11-26 08:27:30,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_20-model_00-model_states.pt. 0: [2022-11-26 08:27:30,481] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_21-model_00-model_states.pt... 0: [2022-11-26 08:27:30,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_21-model_00-model_states.pt. 0: [2022-11-26 08:27:30,559] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_22-model_00-model_states.pt... 0: [2022-11-26 08:27:30,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_22-model_00-model_states.pt. 0: [2022-11-26 08:27:30,637] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_23-model_00-model_states.pt... 0: [2022-11-26 08:27:30,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_23-model_00-model_states.pt. 0: [2022-11-26 08:27:30,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_24-model_00-model_states.pt... 0: [2022-11-26 08:27:30,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_24-model_00-model_states.pt. 0: [2022-11-26 08:27:30,790] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_25-model_00-model_states.pt... 0: [2022-11-26 08:27:30,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_25-model_00-model_states.pt. 0: [2022-11-26 08:27:30,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_26-model_00-model_states.pt... 0: [2022-11-26 08:27:30,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_26-model_00-model_states.pt. 0: [2022-11-26 08:27:30,946] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_27-model_00-model_states.pt... 0: [2022-11-26 08:27:31,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_27-model_00-model_states.pt. 0: [2022-11-26 08:27:31,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_28-model_00-model_states.pt... 0: [2022-11-26 08:27:31,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_28-model_00-model_states.pt. 0: [2022-11-26 08:27:31,101] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/layer_30-model_00-model_states.pt... 0: [2022-11-26 08:27:31,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/layer_30-model_00-model_states.pt. 0: [2022-11-26 08:27:31,103] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step64000/mp_rank_00_model_states.pt 0: [2022-11-26 08:27:31,103] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/mp_rank_00_model_states.pt... 0: [2022-11-26 08:27:31,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/mp_rank_00_model_states.pt. 0: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:27:31,175] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:27:31,175] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:27:31,175] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:27:31,175] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:27:31,175] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:27:31,175] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:27:31,175] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:27:31,175] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:27:31,175] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:27:31,175] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:27:31,175] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:27:31,175] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:27:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:27:31,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:27:31,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 08:27:31,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 19: [2022-11-26 08:27:31,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:27:31,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 13: [2022-11-26 08:27:31,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:27:31,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 13: [2022-11-26 08:27:31,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 08:27:31,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 26: [2022-11-26 08:27:31,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:27:31,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 08:27:31,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 24: [2022-11-26 08:27:31,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:27:31,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:27:31,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 08:27:31,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 0: [2022-11-26 08:27:31,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 08:27:31,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 16: [2022-11-26 08:27:31,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:27:31,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 08:27:31,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 2: [2022-11-26 08:27:31,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:27:31,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 08:27:31,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 25: [2022-11-26 08:27:31,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:27:31,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 08:27:31,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 3: [2022-11-26 08:27:31,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:27:31,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 08:27:31,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 13: [2022-11-26 08:27:31,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:27:31,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:27:31,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 30: [2022-11-26 08:27:31,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:27:31,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 30: [2022-11-26 08:27:31,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 08:27:31,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 30: [2022-11-26 08:27:31,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:27:31,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 08:27:31,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 2: [2022-11-26 08:27:31,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:27:31,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 18: [2022-11-26 08:27:31,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:27:31,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 1: [2022-11-26 08:27:31,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:27:31,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 27: [2022-11-26 08:27:31,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:27:31,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 27: [2022-11-26 08:27:31,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 1: [2022-11-26 08:27:31,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 08:27:31,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 27: [2022-11-26 08:27:31,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 22: [2022-11-26 08:27:31,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:27:31,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 08:27:31,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 20: [2022-11-26 08:27:31,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:27:31,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 08:27:31,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 24: [2022-11-26 08:27:31,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:27:31,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:27:31,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 08:27:31,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 24: [2022-11-26 08:27:31,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 29: [2022-11-26 08:27:31,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:27:31,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 08:27:31,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 24: [2022-11-26 08:27:31,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 11: [2022-11-26 08:27:31,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:27:31,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 08:27:31,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 31: [2022-11-26 08:27:31,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:27:31,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 08:27:31,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 4: [2022-11-26 08:27:31,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:27:31,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:27:31,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 08:27:31,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 11: [2022-11-26 08:27:31,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 08:27:31,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 27: [2022-11-26 08:27:31,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:27:31,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:27:31,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 12: [2022-11-26 08:27:31,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:27:31,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 08:27:31,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 08:27:31,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 12: [2022-11-26 08:27:31,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 27: [2022-11-26 08:27:31,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 6: [2022-11-26 08:27:31,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:27:31,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 08:27:31,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 9: [2022-11-26 08:27:31,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:27:31,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 08:27:31,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 4: [2022-11-26 08:27:31,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:27:31,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 08:27:31,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 16: [2022-11-26 08:27:31,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:27:31,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 08:27:31,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 22: [2022-11-26 08:27:31,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:27:31,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 08:27:31,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 6: [2022-11-26 08:27:31,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:27:31,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 08:27:31,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 10: [2022-11-26 08:27:31,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:27:31,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 08:27:31,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 10: [2022-11-26 08:27:31,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:27:31,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 08:27:31,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 30: [2022-11-26 08:27:31,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:27:31,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 08:27:31,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 26: [2022-11-26 08:27:31,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:27:31,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 16: [2022-11-26 08:27:31,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:27:31,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 16: [2022-11-26 08:27:31,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 08:27:31,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 1: [2022-11-26 08:27:31,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:27:31,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 08:27:31,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 18: [2022-11-26 08:27:31,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:27:31,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 08:27:31,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 1: [2022-11-26 08:27:31,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:27:31,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 3: [2022-11-26 08:27:31,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:27:31,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:27:31,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 3: [2022-11-26 08:27:31,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 19: [2022-11-26 08:27:31,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 3: [2022-11-26 08:27:31,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 19: [2022-11-26 08:27:31,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 23: [2022-11-26 08:27:31,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:27:31,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 08:27:31,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 3: [2022-11-26 08:27:31,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:27:31,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 08:27:31,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 29: [2022-11-26 08:27:31,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:27:31,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 08:27:31,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 28: [2022-11-26 08:27:31,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 08:27:31,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 28: [2022-11-26 08:27:31,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:27:31,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 0: [2022-11-26 08:27:31,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:27:31,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 28: [2022-11-26 08:27:31,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:27:31,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 13: [2022-11-26 08:27:31,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:27:31,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 08:27:31,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 0: [2022-11-26 08:27:31,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 13: [2022-11-26 08:27:31,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 08:27:31,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 20: [2022-11-26 08:27:31,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:27:31,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 08:27:31,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 22: [2022-11-26 08:27:31,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:27:31,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 08:27:31,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 12: [2022-11-26 08:27:31,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:27:31,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 08:27:31,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 15: [2022-11-26 08:27:31,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:27:31,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 08:27:31,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 31: [2022-11-26 08:27:31,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:27:31,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 08:27:31,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 9: [2022-11-26 08:27:31,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:27:31,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:27:31,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 08:27:31,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 24: [2022-11-26 08:27:31,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 08:27:31,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 30: [2022-11-26 08:27:31,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:27:31,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 08:27:31,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 20: [2022-11-26 08:27:31,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:27:31,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:27:31,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 08:27:31,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 25: [2022-11-26 08:27:31,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:27:31,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 28: [2022-11-26 08:27:31,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:27:31,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:27:31,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 26: [2022-11-26 08:27:31,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 08:27:31,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 25: [2022-11-26 08:27:31,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 08:27:31,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:27:31,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 25: [2022-11-26 08:27:31,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 08:27:31,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 7: [2022-11-26 08:27:31,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:27:31,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:27:31,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 08:27:31,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 18: [2022-11-26 08:27:31,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 08:27:31,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 25: [2022-11-26 08:27:31,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:27:31,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:27:31,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 4: [2022-11-26 08:27:31,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 08:27:31,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 25: [2022-11-26 08:27:31,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 6: [2022-11-26 08:27:31,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:27:31,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:27:31,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 22: [2022-11-26 08:27:31,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 6: [2022-11-26 08:27:31,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 22: [2022-11-26 08:27:31,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 19: [2022-11-26 08:27:31,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:27:31,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 08:27:31,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 16: [2022-11-26 08:27:31,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:27:31,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 08:27:31,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 12: [2022-11-26 08:27:31,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:27:31,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 08:27:31,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 13: [2022-11-26 08:27:31,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:27:31,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 20: [2022-11-26 08:27:31,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:27:31,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:27:31,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 20: [2022-11-26 08:27:31,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 5: [2022-11-26 08:27:31,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 20: [2022-11-26 08:27:31,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 5: [2022-11-26 08:27:31,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 5: [2022-11-26 08:27:31,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:27:31,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 08:27:31,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 7: [2022-11-26 08:27:31,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:27:31,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 23: [2022-11-26 08:27:31,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:27:31,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 08:27:31,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 7: [2022-11-26 08:27:31,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 11: [2022-11-26 08:27:31,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:27:31,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 08:27:31,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 2: [2022-11-26 08:27:31,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:27:31,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:27:31,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:27:31,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 08:27:31,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 23: [2022-11-26 08:27:31,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 08:27:31,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 27: [2022-11-26 08:27:31,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 08:27:31,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 10: [2022-11-26 08:27:31,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:27:31,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 08:27:31,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 11: [2022-11-26 08:27:31,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:27:31,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:27:31,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 08:27:31,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 24: [2022-11-26 08:27:31,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 1: [2022-11-26 08:27:31,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:27:31,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 1: [2022-11-26 08:27:31,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 08:27:31,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 10: [2022-11-26 08:27:31,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:27:31,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 08:27:31,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 9: [2022-11-26 08:27:31,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:27:31,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 08:27:31,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 31: [2022-11-26 08:27:31,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:27:31,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 08:27:31,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 3: [2022-11-26 08:27:31,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:27:31,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 08:27:31,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 29: [2022-11-26 08:27:31,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:27:31,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 08:27:31,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 29: [2022-11-26 08:27:31,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:27:31,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:27:31,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 08:27:31,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 2: [2022-11-26 08:27:31,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 08:27:31,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 6: [2022-11-26 08:27:31,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:27:31,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 08:27:31,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 7: [2022-11-26 08:27:31,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:27:31,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 08:27:31,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 18: [2022-11-26 08:27:31,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:27:31,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 08:27:31,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 0: [2022-11-26 08:27:31,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:27:31,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:27:31,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 08:27:31,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 08:27:31,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 0: [2022-11-26 08:27:31,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 28: [2022-11-26 08:27:31,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 08:27:31,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 9: [2022-11-26 08:27:31,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:27:31,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 08:27:31,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 26: [2022-11-26 08:27:31,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:27:31,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 08:27:31,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 19: [2022-11-26 08:27:31,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:27:31,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 08:27:31,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 15: [2022-11-26 08:27:31,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:27:31,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:27:31,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 21: [2022-11-26 08:27:31,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:27:31,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:27:31,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:27:31,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 15: [2022-11-26 08:27:31,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 21: [2022-11-26 08:27:31,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:27:31,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 4: [2022-11-26 08:27:31,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 21: [2022-11-26 08:27:31,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 08:27:31,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 08:27:31,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 24: [2022-11-26 08:27:31,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:27:31,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 21: [2022-11-26 08:27:31,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 21: [2022-11-26 08:27:31,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 21: [2022-11-26 08:27:31,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 24: [2022-11-26 08:27:31,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 08:27:31,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 14: [2022-11-26 08:27:31,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:27:31,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:27:31,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:27:31,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:27:31,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 08:27:31,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 08:27:31,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 14: [2022-11-26 08:27:31,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 08:27:31,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 08:27:31,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 14: [2022-11-26 08:27:31,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 14: [2022-11-26 08:27:31,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 7: [2022-11-26 08:27:31,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:27:31,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 08:27:31,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 5: [2022-11-26 08:27:31,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:27:31,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 08:27:31,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 5: [2022-11-26 08:27:31,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:27:31,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 08:27:31,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 23: [2022-11-26 08:27:31,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:27:31,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 08:27:31,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 28: [2022-11-26 08:27:31,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:27:31,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:27:31,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 08:27:31,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 28: [2022-11-26 08:27:31,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 08:27:31,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 25: [2022-11-26 08:27:31,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:27:31,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:27:31,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 20: [2022-11-26 08:27:31,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:27:31,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 1: [2022-11-26 08:27:31,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 20: [2022-11-26 08:27:31,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 1: [2022-11-26 08:27:31,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 20: [2022-11-26 08:27:31,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 8: [2022-11-26 08:27:31,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:27:31,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 08:27:31,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 8: [2022-11-26 08:27:31,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:27:31,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:27:31,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 08:27:31,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 08:27:31,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 8: [2022-11-26 08:27:31,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 29: [2022-11-26 08:27:31,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:27:31,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 08:27:31,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 17: [2022-11-26 08:27:31,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:27:31,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 08:27:31,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 17: [2022-11-26 08:27:31,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:27:31,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 08:27:31,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 2: [2022-11-26 08:27:31,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:27:31,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 08:27:31,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 19: [2022-11-26 08:27:31,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:27:31,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 08:27:31,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 17: [2022-11-26 08:27:31,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:27:31,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 08:27:31,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 9: [2022-11-26 08:27:31,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:27:31,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 08:27:31,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 3: [2022-11-26 08:27:31,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:27:31,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 08:27:31,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 26: [2022-11-26 08:27:31,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:27:31,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 08:27:31,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 22: [2022-11-26 08:27:31,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:27:31,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 08:27:31,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 30: [2022-11-26 08:27:31,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:27:31,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 08:27:31,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 12: [2022-11-26 08:27:31,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:27:31,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 08:27:31,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 10: [2022-11-26 08:27:31,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:27:31,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 08:27:31,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 13: [2022-11-26 08:27:31,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:27:31,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 08:27:31,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 0: [2022-11-26 08:27:31,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:27:31,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:27:31,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 08:27:31,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 18: [2022-11-26 08:27:31,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:27:31,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 08:27:31,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 6: [2022-11-26 08:27:31,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:27:31,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 08:27:31,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 16: [2022-11-26 08:27:31,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:27:31,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 08:27:31,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 4: [2022-11-26 08:27:31,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:27:31,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 08:27:31,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 27: [2022-11-26 08:27:31,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:27:31,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 08:27:31,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 14: [2022-11-26 08:27:31,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:27:31,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 08:27:31,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 7: [2022-11-26 08:27:31,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:27:31,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 08:27:31,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 31: [2022-11-26 08:27:31,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:27:31,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 08:27:31,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 0: [2022-11-26 08:27:31,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 08:27:31,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 23: [2022-11-26 08:27:31,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:27:31,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 08:27:31,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 17: [2022-11-26 08:27:31,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:27:31,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 08:27:31,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 5: [2022-11-26 08:27:31,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:27:31,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 08:27:31,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 8: [2022-11-26 08:27:31,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:27:31,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 08:27:31,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 21: [2022-11-26 08:27:31,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:27:31,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 08:27:31,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 15: [2022-11-26 08:27:31,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:27:31,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 08:27:31,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 20: [2022-11-26 08:27:31,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:27:31,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 08:27:31,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 1: [2022-11-26 08:27:31,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:27:31,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 08:27:31,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 30: [2022-11-26 08:27:31,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:27:31,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 08:27:31,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 2: [2022-11-26 08:27:31,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:27:31,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 08:27:31,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 3: [2022-11-26 08:27:31,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:27:31,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 08:27:31,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 29: [2022-11-26 08:27:31,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:27:31,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 08:27:31,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 9: [2022-11-26 08:27:31,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:27:31,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 08:27:31,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 10: [2022-11-26 08:27:31,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:27:31,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:27:31,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 08:27:31,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 24: [2022-11-26 08:27:31,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 08:27:31,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 28: [2022-11-26 08:27:31,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:27:31,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 08:27:31,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 22: [2022-11-26 08:27:31,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:27:31,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 08:27:31,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 27: [2022-11-26 08:27:31,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:27:31,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:27:31,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 21: [2022-11-26 08:27:31,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 27: [2022-11-26 08:27:31,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 21: [2022-11-26 08:27:31,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 14: [2022-11-26 08:27:31,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:27:31,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 08:27:31,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 18: [2022-11-26 08:27:31,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:27:31,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 08:27:31,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 0: [2022-11-26 08:27:31,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:27:31,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 08:27:31,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 31: [2022-11-26 08:27:31,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:27:31,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:27:31,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 7: [2022-11-26 08:27:31,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 31: [2022-11-26 08:27:31,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 19: [2022-11-26 08:27:31,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:27:31,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 19: [2022-11-26 08:27:31,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 08:27:31,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 6: [2022-11-26 08:27:31,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:27:31,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:27:31,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 08:27:31,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 12: [2022-11-26 08:27:31,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 08:27:31,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 24: [2022-11-26 08:27:31,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:27:31,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 08:27:31,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 13: [2022-11-26 08:27:31,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:27:31,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:27:31,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 13: [2022-11-26 08:27:31,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 3: [2022-11-26 08:27:31,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:27:31,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 13: [2022-11-26 08:27:31,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 3: [2022-11-26 08:27:31,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 08:27:31,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 4: [2022-11-26 08:27:31,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:27:31,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 08:27:31,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 22: [2022-11-26 08:27:31,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:27:31,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 08:27:31,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 15: [2022-11-26 08:27:31,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:27:31,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 08:27:31,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 17: [2022-11-26 08:27:31,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:27:31,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 08:27:31,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 2: [2022-11-26 08:27:31,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:27:31,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 08:27:31,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 26: [2022-11-26 08:27:31,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:27:31,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:27:31,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 08:27:31,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 25: [2022-11-26 08:27:31,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:27:31,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 16: [2022-11-26 08:27:31,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:27:31,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 26: [2022-11-26 08:27:31,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 16: [2022-11-26 08:27:31,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 25: [2022-11-26 08:27:31,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 08:27:31,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 10: [2022-11-26 08:27:31,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:27:31,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 08:27:31,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 20: [2022-11-26 08:27:31,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:27:31,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 08:27:31,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 30: [2022-11-26 08:27:31,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:27:31,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 08:27:31,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 1: [2022-11-26 08:27:31,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:27:31,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:27:31,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 29: [2022-11-26 08:27:31,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 1: [2022-11-26 08:27:31,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 29: [2022-11-26 08:27:31,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 8: [2022-11-26 08:27:31,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:27:31,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 08:27:31,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 19: [2022-11-26 08:27:31,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:27:31,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 08:27:31,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 12: [2022-11-26 08:27:31,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:27:31,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 08:27:31,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 7: [2022-11-26 08:27:31,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:27:31,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 08:27:31,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 31: [2022-11-26 08:27:31,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:27:31,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 08:27:31,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 9: [2022-11-26 08:27:31,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:27:31,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 08:27:31,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 21: [2022-11-26 08:27:31,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:27:31,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 08:27:31,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 28: [2022-11-26 08:27:31,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:27:31,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 08:27:31,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 14: [2022-11-26 08:27:31,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:27:31,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 08:27:31,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 27: [2022-11-26 08:27:31,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:27:31,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 08:27:31,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 13: [2022-11-26 08:27:31,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:27:31,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 08:27:31,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 4: [2022-11-26 08:27:31,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:27:31,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:27:31,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:27:31,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 16: [2022-11-26 08:27:31,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 4: [2022-11-26 08:27:31,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 18: [2022-11-26 08:27:31,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:27:31,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 16: [2022-11-26 08:27:31,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 1: [2022-11-26 08:27:31,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 18: [2022-11-26 08:27:31,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 08:27:31,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 11: [2022-11-26 08:27:31,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:27:31,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 08:27:31,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 11: [2022-11-26 08:27:31,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:27:31,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:27:31,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 23: [2022-11-26 08:27:31,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 11: [2022-11-26 08:27:31,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 23: [2022-11-26 08:27:31,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 0: [2022-11-26 08:27:31,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:27:31,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 08:27:31,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 28: [2022-11-26 08:27:31,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:27:31,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 08:27:31,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 20: [2022-11-26 08:27:31,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:27:31,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:27:31,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 27: [2022-11-26 08:27:31,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 20: [2022-11-26 08:27:31,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 24: [2022-11-26 08:27:31,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:27:31,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 22: [2022-11-26 08:27:31,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:27:31,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 08:27:31,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 22: [2022-11-26 08:27:31,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 08:27:31,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 10: [2022-11-26 08:27:31,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:27:31,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:27:31,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 08:27:31,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 14: [2022-11-26 08:27:31,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 08:27:31,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 6: [2022-11-26 08:27:31,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:27:31,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:27:31,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 9: [2022-11-26 08:27:31,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 6: [2022-11-26 08:27:31,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 9: [2022-11-26 08:27:31,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 6: [2022-11-26 08:27:31,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:27:31,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:27:31,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 16: [2022-11-26 08:27:31,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 6: [2022-11-26 08:27:31,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 16: [2022-11-26 08:27:31,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 30: [2022-11-26 08:27:31,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:27:31,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 08:27:31,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 18: [2022-11-26 08:27:31,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:27:31,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 08:27:31,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 2: [2022-11-26 08:27:31,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:27:31,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:27:31,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 08:27:31,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 3: [2022-11-26 08:27:31,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 08:27:31,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 31: [2022-11-26 08:27:31,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:27:31,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 08:27:31,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 26: [2022-11-26 08:27:31,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:27:31,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 08:27:31,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 29: [2022-11-26 08:27:31,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:27:31,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 08:27:31,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 11: [2022-11-26 08:27:31,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:27:31,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 08:27:31,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 7: [2022-11-26 08:27:31,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:27:31,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:27:31,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 19: [2022-11-26 08:27:31,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 7: [2022-11-26 08:27:31,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 19: [2022-11-26 08:27:31,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 23: [2022-11-26 08:27:31,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:27:31,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:27:31,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 08:27:31,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 08:27:31,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 23: [2022-11-26 08:27:31,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 0: [2022-11-26 08:27:31,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:27:31,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 08:27:31,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 5: [2022-11-26 08:27:31,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:27:31,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:27:31,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 08:27:31,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 4: [2022-11-26 08:27:31,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 08:27:31,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 21: [2022-11-26 08:27:31,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:27:31,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 08:27:31,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 8: [2022-11-26 08:27:31,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:27:31,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 08:27:31,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 13: [2022-11-26 08:27:31,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:27:31,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 08:27:31,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 12: [2022-11-26 08:27:31,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:27:31,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:27:31,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 08:27:31,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 25: [2022-11-26 08:27:31,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 08:27:31,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 25: [2022-11-26 08:27:31,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:27:31,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 08:27:31,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 15: [2022-11-26 08:27:31,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:27:31,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:27:31,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 08:27:31,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 08:27:31,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 15: [2022-11-26 08:27:31,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 17: [2022-11-26 08:27:31,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:27:31,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 08:27:31,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 5: [2022-11-26 08:27:31,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:27:31,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 08:27:31,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 8: [2022-11-26 08:27:31,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:27:31,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 08:27:31,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 8: [2022-11-26 08:27:31,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:27:31,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 08:27:31,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 17: [2022-11-26 08:27:31,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:27:31,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:27:31,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 08:27:31,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step64000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 08:27:31,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 17: [2022-11-26 08:27:31,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 0: successfully saved checkpoint at iteration 64000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2572.89 31: iteration 64010/ 173500 | consumed samples: 16386560 | consumed tokens: 33559674880 | elapsed time per iteration (s): 1.07 | learning rate: 1.477E-04 | global batch size: 256 | lm loss: 2.029506E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.262 | TFLOPs: 14.54 | 31: iteration 64020/ 173500 | consumed samples: 16389120 | consumed tokens: 33564917760 | elapsed time per iteration (s): 0.76 | learning rate: 1.476E-04 | global batch size: 256 | lm loss: 2.019900E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.729 | TFLOPs: 20.43 | 31: iteration 64030/ 173500 | consumed samples: 16391680 | consumed tokens: 33570160640 | elapsed time per iteration (s): 0.78 | learning rate: 1.476E-04 | global batch size: 256 | lm loss: 2.041031E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.861 | TFLOPs: 19.96 | 31: iteration 64040/ 173500 | consumed samples: 16394240 | consumed tokens: 33575403520 | elapsed time per iteration (s): 0.82 | learning rate: 1.476E-04 | global batch size: 256 | lm loss: 2.016902E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.726 | TFLOPs: 18.86 | 31: iteration 64050/ 173500 | consumed samples: 16396800 | consumed tokens: 33580646400 | elapsed time per iteration (s): 0.73 | learning rate: 1.476E-04 | global batch size: 256 | lm loss: 2.064372E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.184 | TFLOPs: 21.12 | 31: iteration 64060/ 173500 | consumed samples: 16399360 | consumed tokens: 33585889280 | elapsed time per iteration (s): 0.75 | learning rate: 1.476E-04 | global batch size: 256 | lm loss: 2.042733E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.615 | TFLOPs: 20.61 | 31: iteration 64070/ 173500 | consumed samples: 16401920 | consumed tokens: 33591132160 | elapsed time per iteration (s): 0.78 | learning rate: 1.476E-04 | global batch size: 256 | lm loss: 2.090346E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.203 | TFLOPs: 19.86 | 31: iteration 64080/ 173500 | consumed samples: 16404480 | consumed tokens: 33596375040 | elapsed time per iteration (s): 0.77 | learning rate: 1.476E-04 | global batch size: 256 | lm loss: 2.009341E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.520 | TFLOPs: 20.06 | 31: iteration 64090/ 173500 | consumed samples: 16407040 | consumed tokens: 33601617920 | elapsed time per iteration (s): 0.81 | learning rate: 1.475E-04 | global batch size: 256 | lm loss: 2.044959E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.967 | TFLOPs: 19.12 | 31: iteration 64100/ 173500 | consumed samples: 16409600 | consumed tokens: 33606860800 | elapsed time per iteration (s): 0.79 | learning rate: 1.475E-04 | global batch size: 256 | lm loss: 1.989510E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.064 | TFLOPs: 19.54 | 31: iteration 64110/ 173500 | consumed samples: 16412160 | consumed tokens: 33612103680 | elapsed time per iteration (s): 0.78 | learning rate: 1.475E-04 | global batch size: 256 | lm loss: 2.037190E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.125 | TFLOPs: 19.97 | 31: iteration 64120/ 173500 | consumed samples: 16414720 | consumed tokens: 33617346560 | elapsed time per iteration (s): 0.75 | learning rate: 1.475E-04 | global batch size: 256 | lm loss: 2.028325E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.927 | TFLOPs: 20.63 | 31: iteration 64130/ 173500 | consumed samples: 16417280 | consumed tokens: 33622589440 | elapsed time per iteration (s): 0.84 | learning rate: 1.475E-04 | global batch size: 256 | lm loss: 2.026278E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.627 | TFLOPs: 18.43 | 31: iteration 64140/ 173500 | consumed samples: 16419840 | consumed tokens: 33627832320 | elapsed time per iteration (s): 0.85 | learning rate: 1.475E-04 | global batch size: 256 | lm loss: 2.026899E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.711 | TFLOPs: 18.31 | 31: iteration 64150/ 173500 | consumed samples: 16422400 | consumed tokens: 33633075200 | elapsed time per iteration (s): 0.79 | learning rate: 1.475E-04 | global batch size: 256 | lm loss: 2.041370E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.282 | TFLOPs: 19.62 | 31: iteration 64160/ 173500 | consumed samples: 16424960 | consumed tokens: 33638318080 | elapsed time per iteration (s): 0.81 | learning rate: 1.474E-04 | global batch size: 256 | lm loss: 2.027833E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.249 | TFLOPs: 19.19 | 31: iteration 64170/ 173500 | consumed samples: 16427520 | consumed tokens: 33643560960 | elapsed time per iteration (s): 0.92 | learning rate: 1.474E-04 | global batch size: 256 | lm loss: 2.034871E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 279.608 | TFLOPs: 16.92 | 31: iteration 64180/ 173500 | consumed samples: 16430080 | consumed tokens: 33648803840 | elapsed time per iteration (s): 0.85 | learning rate: 1.474E-04 | global batch size: 256 | lm loss: 2.036027E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.068 | TFLOPs: 18.15 | 31: iteration 64190/ 173500 | consumed samples: 16432640 | consumed tokens: 33654046720 | elapsed time per iteration (s): 0.85 | learning rate: 1.474E-04 | global batch size: 256 | lm loss: 2.025847E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.826 | TFLOPs: 18.14 | 31: iteration 64200/ 173500 | consumed samples: 16435200 | consumed tokens: 33659289600 | elapsed time per iteration (s): 0.84 | learning rate: 1.474E-04 | global batch size: 256 | lm loss: 2.042015E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.282 | TFLOPs: 18.35 | 31: iteration 64210/ 173500 | consumed samples: 16437760 | consumed tokens: 33664532480 | elapsed time per iteration (s): 0.80 | learning rate: 1.474E-04 | global batch size: 256 | lm loss: 2.054374E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.043 | TFLOPs: 19.36 | 31: iteration 64220/ 173500 | consumed samples: 16440320 | consumed tokens: 33669775360 | elapsed time per iteration (s): 0.78 | learning rate: 1.474E-04 | global batch size: 256 | lm loss: 2.060496E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.181 | TFLOPs: 19.79 | 31: iteration 64230/ 173500 | consumed samples: 16442880 | consumed tokens: 33675018240 | elapsed time per iteration (s): 0.78 | learning rate: 1.473E-04 | global batch size: 256 | lm loss: 2.025128E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.009 | TFLOPs: 19.90 | 31: iteration 64240/ 173500 | consumed samples: 16445440 | consumed tokens: 33680261120 | elapsed time per iteration (s): 0.73 | learning rate: 1.473E-04 | global batch size: 256 | lm loss: 2.036592E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.384 | TFLOPs: 21.08 | 31: iteration 64250/ 173500 | consumed samples: 16448000 | consumed tokens: 33685504000 | elapsed time per iteration (s): 0.80 | learning rate: 1.473E-04 | global batch size: 256 | lm loss: 2.020625E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.805 | TFLOPs: 19.41 | 31: iteration 64260/ 173500 | consumed samples: 16450560 | consumed tokens: 33690746880 | elapsed time per iteration (s): 0.77 | learning rate: 1.473E-04 | global batch size: 256 | lm loss: 2.018164E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.128 | TFLOPs: 20.21 | 31: iteration 64270/ 173500 | consumed samples: 16453120 | consumed tokens: 33695989760 | elapsed time per iteration (s): 0.74 | learning rate: 1.473E-04 | global batch size: 256 | lm loss: 2.051155E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.610 | TFLOPs: 20.85 | 31: iteration 64280/ 173500 | consumed samples: 16455680 | consumed tokens: 33701232640 | elapsed time per iteration (s): 0.73 | learning rate: 1.473E-04 | global batch size: 256 | lm loss: 2.042810E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.579 | TFLOPs: 21.33 | 31: iteration 64290/ 173500 | consumed samples: 16458240 | consumed tokens: 33706475520 | elapsed time per iteration (s): 0.78 | learning rate: 1.472E-04 | global batch size: 256 | lm loss: 2.033094E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.568 | TFLOPs: 19.88 | 31: iteration 64300/ 173500 | consumed samples: 16460800 | consumed tokens: 33711718400 | elapsed time per iteration (s): 0.82 | learning rate: 1.472E-04 | global batch size: 256 | lm loss: 2.050219E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.050 | TFLOPs: 18.82 | 31: iteration 64310/ 173500 | consumed samples: 16463360 | consumed tokens: 33716961280 | elapsed time per iteration (s): 0.75 | learning rate: 1.472E-04 | global batch size: 256 | lm loss: 2.054347E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.239 | TFLOPs: 20.70 | 31: iteration 64320/ 173500 | consumed samples: 16465920 | consumed tokens: 33722204160 | elapsed time per iteration (s): 0.82 | learning rate: 1.472E-04 | global batch size: 256 | lm loss: 2.040005E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.658 | TFLOPs: 18.92 | 31: iteration 64330/ 173500 | consumed samples: 16468480 | consumed tokens: 33727447040 | elapsed time per iteration (s): 0.71 | learning rate: 1.472E-04 | global batch size: 256 | lm loss: 2.042275E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.902 | TFLOPs: 21.71 | 31: iteration 64340/ 173500 | consumed samples: 16471040 | consumed tokens: 33732689920 | elapsed time per iteration (s): 0.73 | learning rate: 1.472E-04 | global batch size: 256 | lm loss: 2.042874E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.485 | TFLOPs: 21.08 | 31: iteration 64350/ 173500 | consumed samples: 16473600 | consumed tokens: 33737932800 | elapsed time per iteration (s): 0.78 | learning rate: 1.472E-04 | global batch size: 256 | lm loss: 2.021724E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.761 | TFLOPs: 19.95 | 31: iteration 64360/ 173500 | consumed samples: 16476160 | consumed tokens: 33743175680 | elapsed time per iteration (s): 0.89 | learning rate: 1.471E-04 | global batch size: 256 | lm loss: 2.063011E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.083 | TFLOPs: 17.49 | 31: iteration 64370/ 173500 | consumed samples: 16478720 | consumed tokens: 33748418560 | elapsed time per iteration (s): 0.76 | learning rate: 1.471E-04 | global batch size: 256 | lm loss: 2.043577E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.725 | TFLOPs: 20.49 | 31: iteration 64380/ 173500 | consumed samples: 16481280 | consumed tokens: 33753661440 | elapsed time per iteration (s): 0.76 | learning rate: 1.471E-04 | global batch size: 256 | lm loss: 2.030514E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.818 | TFLOPs: 20.38 | 31: iteration 64390/ 173500 | consumed samples: 16483840 | consumed tokens: 33758904320 | elapsed time per iteration (s): 0.74 | learning rate: 1.471E-04 | global batch size: 256 | lm loss: 2.070053E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.965 | TFLOPs: 20.93 | 31: iteration 64400/ 173500 | consumed samples: 16486400 | consumed tokens: 33764147200 | elapsed time per iteration (s): 0.79 | learning rate: 1.471E-04 | global batch size: 256 | lm loss: 2.030467E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.742 | TFLOPs: 19.71 | 31: iteration 64410/ 173500 | consumed samples: 16488960 | consumed tokens: 33769390080 | elapsed time per iteration (s): 0.79 | learning rate: 1.471E-04 | global batch size: 256 | lm loss: 2.010448E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.960 | TFLOPs: 19.54 | 31: iteration 64420/ 173500 | consumed samples: 16491520 | consumed tokens: 33774632960 | elapsed time per iteration (s): 0.84 | learning rate: 1.471E-04 | global batch size: 256 | lm loss: 2.064409E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.528 | TFLOPs: 18.48 | 31: iteration 64430/ 173500 | consumed samples: 16494080 | consumed tokens: 33779875840 | elapsed time per iteration (s): 0.74 | learning rate: 1.470E-04 | global batch size: 256 | lm loss: 2.027519E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.828 | TFLOPs: 20.92 | 31: iteration 64440/ 173500 | consumed samples: 16496640 | consumed tokens: 33785118720 | elapsed time per iteration (s): 0.77 | learning rate: 1.470E-04 | global batch size: 256 | lm loss: 2.023495E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.341 | TFLOPs: 19.98 | 31: iteration 64450/ 173500 | consumed samples: 16499200 | consumed tokens: 33790361600 | elapsed time per iteration (s): 0.76 | learning rate: 1.470E-04 | global batch size: 256 | lm loss: 2.021328E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.675 | TFLOPs: 20.31 | 31: iteration 64460/ 173500 | consumed samples: 16501760 | consumed tokens: 33795604480 | elapsed time per iteration (s): 0.75 | learning rate: 1.470E-04 | global batch size: 256 | lm loss: 2.025894E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.758 | TFLOPs: 20.61 | 31: iteration 64470/ 173500 | consumed samples: 16504320 | consumed tokens: 33800847360 | elapsed time per iteration (s): 0.76 | learning rate: 1.470E-04 | global batch size: 256 | lm loss: 2.047260E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.199 | TFLOPs: 20.46 | 31: iteration 64480/ 173500 | consumed samples: 16506880 | consumed tokens: 33806090240 | elapsed time per iteration (s): 0.75 | learning rate: 1.470E-04 | global batch size: 256 | lm loss: 2.030241E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.782 | TFLOPs: 20.62 | 31: iteration 64490/ 173500 | consumed samples: 16509440 | consumed tokens: 33811333120 | elapsed time per iteration (s): 0.76 | learning rate: 1.469E-04 | global batch size: 256 | lm loss: 2.050608E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.697 | TFLOPs: 20.49 | 31: iteration 64500/ 173500 | consumed samples: 16512000 | consumed tokens: 33816576000 | elapsed time per iteration (s): 0.79 | learning rate: 1.469E-04 | global batch size: 256 | lm loss: 2.029993E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.562 | TFLOPs: 19.57 | 31: iteration 64510/ 173500 | consumed samples: 16514560 | consumed tokens: 33821818880 | elapsed time per iteration (s): 0.76 | learning rate: 1.469E-04 | global batch size: 256 | lm loss: 2.036385E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.070 | TFLOPs: 20.51 | 31: iteration 64520/ 173500 | consumed samples: 16517120 | consumed tokens: 33827061760 | elapsed time per iteration (s): 0.76 | learning rate: 1.469E-04 | global batch size: 256 | lm loss: 2.051962E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.634 | TFLOPs: 20.49 | 31: iteration 64530/ 173500 | consumed samples: 16519680 | consumed tokens: 33832304640 | elapsed time per iteration (s): 0.77 | learning rate: 1.469E-04 | global batch size: 256 | lm loss: 2.024836E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.360 | TFLOPs: 20.17 | 31: iteration 64540/ 173500 | consumed samples: 16522240 | consumed tokens: 33837547520 | elapsed time per iteration (s): 0.76 | learning rate: 1.469E-04 | global batch size: 256 | lm loss: 2.064203E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.863 | TFLOPs: 20.50 | 31: iteration 64550/ 173500 | consumed samples: 16524800 | consumed tokens: 33842790400 | elapsed time per iteration (s): 0.85 | learning rate: 1.469E-04 | global batch size: 256 | lm loss: 2.035402E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.989 | TFLOPs: 18.21 | 31: iteration 64560/ 173500 | consumed samples: 16527360 | consumed tokens: 33848033280 | elapsed time per iteration (s): 0.82 | learning rate: 1.468E-04 | global batch size: 256 | lm loss: 2.014692E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.982 | TFLOPs: 19.00 | 31: iteration 64570/ 173500 | consumed samples: 16529920 | consumed tokens: 33853276160 | elapsed time per iteration (s): 0.93 | learning rate: 1.468E-04 | global batch size: 256 | lm loss: 2.056544E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 274.503 | TFLOPs: 16.61 | 31: iteration 64580/ 173500 | consumed samples: 16532480 | consumed tokens: 33858519040 | elapsed time per iteration (s): 0.80 | learning rate: 1.468E-04 | global batch size: 256 | lm loss: 2.038297E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.353 | TFLOPs: 19.32 | 31: iteration 64590/ 173500 | consumed samples: 16535040 | consumed tokens: 33863761920 | elapsed time per iteration (s): 0.79 | learning rate: 1.468E-04 | global batch size: 256 | lm loss: 1.993442E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.831 | TFLOPs: 19.53 | 31: iteration 64600/ 173500 | consumed samples: 16537600 | consumed tokens: 33869004800 | elapsed time per iteration (s): 0.80 | learning rate: 1.468E-04 | global batch size: 256 | lm loss: 2.025281E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.334 | TFLOPs: 19.26 | 31: iteration 64610/ 173500 | consumed samples: 16540160 | consumed tokens: 33874247680 | elapsed time per iteration (s): 0.83 | learning rate: 1.468E-04 | global batch size: 256 | lm loss: 2.020928E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.799 | TFLOPs: 18.68 | 31: iteration 64620/ 173500 | consumed samples: 16542720 | consumed tokens: 33879490560 | elapsed time per iteration (s): 0.82 | learning rate: 1.468E-04 | global batch size: 256 | lm loss: 2.012212E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.761 | TFLOPs: 18.80 | 31: iteration 64630/ 173500 | consumed samples: 16545280 | consumed tokens: 33884733440 | elapsed time per iteration (s): 0.78 | learning rate: 1.467E-04 | global batch size: 256 | lm loss: 2.031376E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.784 | TFLOPs: 19.89 | 31: iteration 64640/ 173500 | consumed samples: 16547840 | consumed tokens: 33889976320 | elapsed time per iteration (s): 0.86 | learning rate: 1.467E-04 | global batch size: 256 | lm loss: 2.034080E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.409 | TFLOPs: 17.99 | 31: iteration 64650/ 173500 | consumed samples: 16550400 | consumed tokens: 33895219200 | elapsed time per iteration (s): 0.80 | learning rate: 1.467E-04 | global batch size: 256 | lm loss: 2.053801E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.558 | TFLOPs: 19.27 | 31: iteration 64660/ 173500 | consumed samples: 16552960 | consumed tokens: 33900462080 | elapsed time per iteration (s): 0.82 | learning rate: 1.467E-04 | global batch size: 256 | lm loss: 2.044975E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.289 | TFLOPs: 18.89 | 31: iteration 64670/ 173500 | consumed samples: 16555520 | consumed tokens: 33905704960 | elapsed time per iteration (s): 0.80 | learning rate: 1.467E-04 | global batch size: 256 | lm loss: 1.999512E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.970 | TFLOPs: 19.36 | 31: iteration 64680/ 173500 | consumed samples: 16558080 | consumed tokens: 33910947840 | elapsed time per iteration (s): 0.74 | learning rate: 1.467E-04 | global batch size: 256 | lm loss: 2.027461E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.807 | TFLOPs: 20.92 | 31: iteration 64690/ 173500 | consumed samples: 16560640 | consumed tokens: 33916190720 | elapsed time per iteration (s): 0.78 | learning rate: 1.466E-04 | global batch size: 256 | lm loss: 2.042521E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.386 | TFLOPs: 19.87 | 31: iteration 64700/ 173500 | consumed samples: 16563200 | consumed tokens: 33921433600 | elapsed time per iteration (s): 0.77 | learning rate: 1.466E-04 | global batch size: 256 | lm loss: 2.061352E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.811 | TFLOPs: 20.07 | 31: iteration 64710/ 173500 | consumed samples: 16565760 | consumed tokens: 33926676480 | elapsed time per iteration (s): 0.78 | learning rate: 1.466E-04 | global batch size: 256 | lm loss: 2.049313E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.932 | TFLOPs: 19.96 | 31: iteration 64720/ 173500 | consumed samples: 16568320 | consumed tokens: 33931919360 | elapsed time per iteration (s): 0.80 | learning rate: 1.466E-04 | global batch size: 256 | lm loss: 2.000562E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.820 | TFLOPs: 19.29 | 31: iteration 64730/ 173500 | consumed samples: 16570880 | consumed tokens: 33937162240 | elapsed time per iteration (s): 0.78 | learning rate: 1.466E-04 | global batch size: 256 | lm loss: 2.026175E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.120 | TFLOPs: 19.91 | 31: iteration 64740/ 173500 | consumed samples: 16573440 | consumed tokens: 33942405120 | elapsed time per iteration (s): 0.75 | learning rate: 1.466E-04 | global batch size: 256 | lm loss: 2.021229E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.672 | TFLOPs: 20.61 | 31: iteration 64750/ 173500 | consumed samples: 16576000 | consumed tokens: 33947648000 | elapsed time per iteration (s): 0.84 | learning rate: 1.466E-04 | global batch size: 256 | lm loss: 2.015261E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.730 | TFLOPs: 18.50 | 31: iteration 64760/ 173500 | consumed samples: 16578560 | consumed tokens: 33952890880 | elapsed time per iteration (s): 0.81 | learning rate: 1.465E-04 | global batch size: 256 | lm loss: 2.022036E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.330 | TFLOPs: 19.20 | 31: iteration 64770/ 173500 | consumed samples: 16581120 | consumed tokens: 33958133760 | elapsed time per iteration (s): 0.83 | learning rate: 1.465E-04 | global batch size: 256 | lm loss: 2.035889E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.775 | TFLOPs: 18.74 | 31: iteration 64780/ 173500 | consumed samples: 16583680 | consumed tokens: 33963376640 | elapsed time per iteration (s): 0.79 | learning rate: 1.465E-04 | global batch size: 256 | lm loss: 2.032709E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.775 | TFLOPs: 19.65 | 31: iteration 64790/ 173500 | consumed samples: 16586240 | consumed tokens: 33968619520 | elapsed time per iteration (s): 0.76 | learning rate: 1.465E-04 | global batch size: 256 | lm loss: 2.027603E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.145 | TFLOPs: 20.46 | 31: iteration 64800/ 173500 | consumed samples: 16588800 | consumed tokens: 33973862400 | elapsed time per iteration (s): 0.77 | learning rate: 1.465E-04 | global batch size: 256 | lm loss: 2.015813E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.620 | TFLOPs: 20.24 | 31: iteration 64810/ 173500 | consumed samples: 16591360 | consumed tokens: 33979105280 | elapsed time per iteration (s): 0.80 | learning rate: 1.465E-04 | global batch size: 256 | lm loss: 2.034744E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.360 | TFLOPs: 19.26 | 31: iteration 64820/ 173500 | consumed samples: 16593920 | consumed tokens: 33984348160 | elapsed time per iteration (s): 0.76 | learning rate: 1.464E-04 | global batch size: 256 | lm loss: 2.039917E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.906 | TFLOPs: 20.26 | 31: iteration 64830/ 173500 | consumed samples: 16596480 | consumed tokens: 33989591040 | elapsed time per iteration (s): 0.84 | learning rate: 1.464E-04 | global batch size: 256 | lm loss: 2.029356E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.111 | TFLOPs: 18.40 | 31: iteration 64840/ 173500 | consumed samples: 16599040 | consumed tokens: 33994833920 | elapsed time per iteration (s): 0.82 | learning rate: 1.464E-04 | global batch size: 256 | lm loss: 1.998066E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.482 | TFLOPs: 18.84 | 31: iteration 64850/ 173500 | consumed samples: 16601600 | consumed tokens: 34000076800 | elapsed time per iteration (s): 0.93 | learning rate: 1.464E-04 | global batch size: 256 | lm loss: 2.034114E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 276.690 | TFLOPs: 16.74 | 31: iteration 64860/ 173500 | consumed samples: 16604160 | consumed tokens: 34005319680 | elapsed time per iteration (s): 0.76 | learning rate: 1.464E-04 | global batch size: 256 | lm loss: 2.025770E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.555 | TFLOPs: 20.30 | 31: iteration 64870/ 173500 | consumed samples: 16606720 | consumed tokens: 34010562560 | elapsed time per iteration (s): 0.82 | learning rate: 1.464E-04 | global batch size: 256 | lm loss: 2.050665E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.712 | TFLOPs: 18.86 | 31: iteration 64880/ 173500 | consumed samples: 16609280 | consumed tokens: 34015805440 | elapsed time per iteration (s): 0.83 | learning rate: 1.464E-04 | global batch size: 256 | lm loss: 2.009477E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.809 | TFLOPs: 18.62 | 31: iteration 64890/ 173500 | consumed samples: 16611840 | consumed tokens: 34021048320 | elapsed time per iteration (s): 0.86 | learning rate: 1.463E-04 | global batch size: 256 | lm loss: 2.048899E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.902 | TFLOPs: 18.08 | 31: iteration 64900/ 173500 | consumed samples: 16614400 | consumed tokens: 34026291200 | elapsed time per iteration (s): 0.86 | learning rate: 1.463E-04 | global batch size: 256 | lm loss: 2.006618E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.303 | TFLOPs: 18.11 | 31: iteration 64910/ 173500 | consumed samples: 16616960 | consumed tokens: 34031534080 | elapsed time per iteration (s): 0.87 | learning rate: 1.463E-04 | global batch size: 256 | lm loss: 2.047731E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.593 | TFLOPs: 17.82 | 31: iteration 64920/ 173500 | consumed samples: 16619520 | consumed tokens: 34036776960 | elapsed time per iteration (s): 0.86 | learning rate: 1.463E-04 | global batch size: 256 | lm loss: 2.041036E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.718 | TFLOPs: 17.95 | 31: iteration 64930/ 173500 | consumed samples: 16622080 | consumed tokens: 34042019840 | elapsed time per iteration (s): 0.86 | learning rate: 1.463E-04 | global batch size: 256 | lm loss: 2.019175E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.740 | TFLOPs: 18.01 | 31: iteration 64940/ 173500 | consumed samples: 16624640 | consumed tokens: 34047262720 | elapsed time per iteration (s): 0.85 | learning rate: 1.463E-04 | global batch size: 256 | lm loss: 2.049560E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.496 | TFLOPs: 18.30 | 31: iteration 64950/ 173500 | consumed samples: 16627200 | consumed tokens: 34052505600 | elapsed time per iteration (s): 0.83 | learning rate: 1.463E-04 | global batch size: 256 | lm loss: 2.040346E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.870 | TFLOPs: 18.56 | 31: iteration 64960/ 173500 | consumed samples: 16629760 | consumed tokens: 34057748480 | elapsed time per iteration (s): 0.80 | learning rate: 1.462E-04 | global batch size: 256 | lm loss: 2.012606E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.842 | TFLOPs: 19.47 | 31: iteration 64970/ 173500 | consumed samples: 16632320 | consumed tokens: 34062991360 | elapsed time per iteration (s): 0.81 | learning rate: 1.462E-04 | global batch size: 256 | lm loss: 2.061274E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.619 | TFLOPs: 19.03 | 31: iteration 64980/ 173500 | consumed samples: 16634880 | consumed tokens: 34068234240 | elapsed time per iteration (s): 0.82 | learning rate: 1.462E-04 | global batch size: 256 | lm loss: 2.003758E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.488 | TFLOPs: 18.78 | 31: iteration 64990/ 173500 | consumed samples: 16637440 | consumed tokens: 34073477120 | elapsed time per iteration (s): 0.80 | learning rate: 1.462E-04 | global batch size: 256 | lm loss: 2.053178E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.471 | TFLOPs: 19.45 | 31: iteration 65000/ 173500 | consumed samples: 16640000 | consumed tokens: 34078720000 | elapsed time per iteration (s): 0.82 | learning rate: 1.462E-04 | global batch size: 256 | lm loss: 2.013623E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.475 | TFLOPs: 18.84 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 65000 | lm loss value: 1.940496E+00 | lm loss PPL: 6.962206E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 65000 to checkpoints_1b1long 0: [2022-11-26 08:40:49,191] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step65000 is begin to save! 0: [2022-11-26 08:40:49,200] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_01-model_00-model_states.pt... 0: [2022-11-26 08:40:49,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_01-model_00-model_states.pt. 0: [2022-11-26 08:40:49,436] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_03-model_00-model_states.pt... 0: [2022-11-26 08:40:49,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_03-model_00-model_states.pt. 0: [2022-11-26 08:40:49,513] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_04-model_00-model_states.pt... 0: [2022-11-26 08:40:49,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_04-model_00-model_states.pt. 0: [2022-11-26 08:40:49,591] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_05-model_00-model_states.pt... 0: [2022-11-26 08:40:49,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_05-model_00-model_states.pt. 0: [2022-11-26 08:40:49,672] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_06-model_00-model_states.pt... 0: [2022-11-26 08:40:49,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_06-model_00-model_states.pt. 0: [2022-11-26 08:40:49,750] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_07-model_00-model_states.pt... 0: [2022-11-26 08:40:49,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_07-model_00-model_states.pt. 0: [2022-11-26 08:40:49,829] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_08-model_00-model_states.pt... 0: [2022-11-26 08:40:49,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_08-model_00-model_states.pt. 0: [2022-11-26 08:40:49,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_09-model_00-model_states.pt... 0: [2022-11-26 08:40:49,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_09-model_00-model_states.pt. 0: [2022-11-26 08:40:49,980] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_10-model_00-model_states.pt... 0: [2022-11-26 08:40:50,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_10-model_00-model_states.pt. 0: [2022-11-26 08:40:50,055] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_11-model_00-model_states.pt... 0: [2022-11-26 08:40:50,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_11-model_00-model_states.pt. 0: [2022-11-26 08:40:50,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_12-model_00-model_states.pt... 0: [2022-11-26 08:40:50,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_12-model_00-model_states.pt. 0: [2022-11-26 08:40:50,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_13-model_00-model_states.pt... 0: [2022-11-26 08:40:50,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_13-model_00-model_states.pt. 0: [2022-11-26 08:40:50,285] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_14-model_00-model_states.pt... 0: [2022-11-26 08:40:50,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_14-model_00-model_states.pt. 0: [2022-11-26 08:40:50,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_15-model_00-model_states.pt... 0: [2022-11-26 08:40:50,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_15-model_00-model_states.pt. 0: [2022-11-26 08:40:50,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_16-model_00-model_states.pt... 0: [2022-11-26 08:40:50,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_16-model_00-model_states.pt. 0: [2022-11-26 08:40:50,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_17-model_00-model_states.pt... 0: [2022-11-26 08:40:50,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_17-model_00-model_states.pt. 0: [2022-11-26 08:40:50,586] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_18-model_00-model_states.pt... 0: [2022-11-26 08:40:50,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_18-model_00-model_states.pt. 0: [2022-11-26 08:40:50,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_19-model_00-model_states.pt... 0: [2022-11-26 08:40:50,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_19-model_00-model_states.pt. 0: [2022-11-26 08:40:50,738] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_20-model_00-model_states.pt... 0: [2022-11-26 08:40:50,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_20-model_00-model_states.pt. 0: [2022-11-26 08:40:50,816] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_21-model_00-model_states.pt... 0: [2022-11-26 08:40:50,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_21-model_00-model_states.pt. 0: [2022-11-26 08:40:50,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_22-model_00-model_states.pt... 0: [2022-11-26 08:40:50,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_22-model_00-model_states.pt. 0: [2022-11-26 08:40:50,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_23-model_00-model_states.pt... 0: [2022-11-26 08:40:51,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_23-model_00-model_states.pt. 0: [2022-11-26 08:40:51,042] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_24-model_00-model_states.pt... 0: [2022-11-26 08:40:51,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_24-model_00-model_states.pt. 0: [2022-11-26 08:40:51,117] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_25-model_00-model_states.pt... 0: [2022-11-26 08:40:51,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_25-model_00-model_states.pt. 0: [2022-11-26 08:40:51,192] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_26-model_00-model_states.pt... 0: [2022-11-26 08:40:51,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_26-model_00-model_states.pt. 0: [2022-11-26 08:40:51,267] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_27-model_00-model_states.pt... 0: [2022-11-26 08:40:51,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_27-model_00-model_states.pt. 0: [2022-11-26 08:40:51,343] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_28-model_00-model_states.pt... 0: [2022-11-26 08:40:51,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_28-model_00-model_states.pt. 0: [2022-11-26 08:40:51,419] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/layer_30-model_00-model_states.pt... 0: [2022-11-26 08:40:51,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/layer_30-model_00-model_states.pt. 0: [2022-11-26 08:40:51,421] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step65000/mp_rank_00_model_states.pt 0: [2022-11-26 08:40:51,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/mp_rank_00_model_states.pt... 0: [2022-11-26 08:40:51,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/mp_rank_00_model_states.pt. 0: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:40:51,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:40:51,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:40:51,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:40:51,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 08:40:51,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 3: [2022-11-26 08:40:51,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:40:51,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 08:40:51,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 30: [2022-11-26 08:40:51,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:40:51,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 08:40:51,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 22: [2022-11-26 08:40:51,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:40:51,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 08:40:51,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 20: [2022-11-26 08:40:51,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:40:51,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 08:40:51,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 27: [2022-11-26 08:40:51,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:40:51,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 08:40:51,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 7: [2022-11-26 08:40:51,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:40:51,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 08:40:51,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 21: [2022-11-26 08:40:51,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:40:51,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 08:40:51,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 28: [2022-11-26 08:40:51,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:40:51,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:40:51,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 08:40:51,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 21: [2022-11-26 08:40:51,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:40:51,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 28: [2022-11-26 08:40:51,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 21: [2022-11-26 08:40:51,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 28: [2022-11-26 08:40:51,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 1: [2022-11-26 08:40:51,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:40:51,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 08:40:51,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 27: [2022-11-26 08:40:51,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:40:51,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 08:40:51,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 2: [2022-11-26 08:40:51,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:40:51,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 08:40:51,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 24: [2022-11-26 08:40:51,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:40:51,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 08:40:51,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 29: [2022-11-26 08:40:51,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:40:51,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 08:40:51,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 3: [2022-11-26 08:40:51,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:40:51,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 08:40:51,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 19: [2022-11-26 08:40:51,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:40:51,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 08:40:51,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 26: [2022-11-26 08:40:51,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:40:51,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 1: [2022-11-26 08:40:51,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:40:51,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 26: [2022-11-26 08:40:51,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:40:51,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 08:40:51,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 19: [2022-11-26 08:40:51,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:40:51,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 19: [2022-11-26 08:40:51,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 08:40:51,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 24: [2022-11-26 08:40:51,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:40:51,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 24: [2022-11-26 08:40:51,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 08:40:51,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 4: [2022-11-26 08:40:51,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:40:51,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 08:40:51,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 10: [2022-11-26 08:40:51,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:40:51,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:40:51,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 22: [2022-11-26 08:40:51,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 08:40:51,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 10: [2022-11-26 08:40:51,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 11: [2022-11-26 08:40:51,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:40:51,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 08:40:51,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:40:51,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 11: [2022-11-26 08:40:51,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 08:40:51,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 30: [2022-11-26 08:40:51,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:40:51,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 08:40:51,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 25: [2022-11-26 08:40:51,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:40:51,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 08:40:51,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 2: [2022-11-26 08:40:51,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:40:51,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 08:40:51,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 16: [2022-11-26 08:40:51,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:40:51,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 08:40:51,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 21: [2022-11-26 08:40:51,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:40:51,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 08:40:51,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 19: [2022-11-26 08:40:51,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:40:51,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 08:40:51,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 31: [2022-11-26 08:40:51,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:40:51,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:40:51,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 08:40:51,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 13: [2022-11-26 08:40:51,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:40:51,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 08:40:51,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 6: [2022-11-26 08:40:51,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:40:51,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:40:51,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 13: [2022-11-26 08:40:51,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 6: [2022-11-26 08:40:51,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 13: [2022-11-26 08:40:51,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 5: [2022-11-26 08:40:51,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:40:51,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 08:40:51,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 1: [2022-11-26 08:40:51,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:40:51,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:40:51,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 25: [2022-11-26 08:40:51,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 1: [2022-11-26 08:40:51,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 25: [2022-11-26 08:40:51,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 3: [2022-11-26 08:40:51,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:40:51,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:40:51,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 20: [2022-11-26 08:40:51,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 3: [2022-11-26 08:40:51,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 20: [2022-11-26 08:40:51,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 8: [2022-11-26 08:40:51,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:40:51,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:40:51,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 08:40:51,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 08:40:51,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 8: [2022-11-26 08:40:51,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 11: [2022-11-26 08:40:51,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:40:51,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 08:40:51,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 29: [2022-11-26 08:40:51,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:40:51,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 08:40:51,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 4: [2022-11-26 08:40:51,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:40:51,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 08:40:51,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 0: [2022-11-26 08:40:51,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:40:51,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 08:40:51,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 14: [2022-11-26 08:40:51,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:40:51,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 08:40:51,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 5: [2022-11-26 08:40:51,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:40:51,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 08:40:51,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 7: [2022-11-26 08:40:51,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:40:51,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 08:40:51,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 12: [2022-11-26 08:40:51,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:40:51,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:40:51,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 24: [2022-11-26 08:40:51,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 12: [2022-11-26 08:40:51,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 24: [2022-11-26 08:40:51,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 28: [2022-11-26 08:40:51,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 08:40:51,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 28: [2022-11-26 08:40:51,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:40:51,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 08:40:51,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 27: [2022-11-26 08:40:51,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:40:51,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 08:40:51,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 2: [2022-11-26 08:40:51,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:40:51,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:40:51,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 15: [2022-11-26 08:40:51,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 26: [2022-11-26 08:40:51,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:40:51,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 15: [2022-11-26 08:40:51,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 26: [2022-11-26 08:40:51,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 08:40:51,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 16: [2022-11-26 08:40:51,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:40:51,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 22: [2022-11-26 08:40:51,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:40:51,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 22: [2022-11-26 08:40:51,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 08:40:51,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 16: [2022-11-26 08:40:51,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:40:51,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 08:40:51,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 30: [2022-11-26 08:40:51,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:40:51,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 08:40:51,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 6: [2022-11-26 08:40:51,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:40:51,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 08:40:51,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 29: [2022-11-26 08:40:51,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:40:51,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 08:40:51,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 26: [2022-11-26 08:40:51,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:40:51,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 08:40:51,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 12: [2022-11-26 08:40:51,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:40:51,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 08:40:51,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 1: [2022-11-26 08:40:51,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:40:51,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 08:40:51,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 21: [2022-11-26 08:40:51,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:40:51,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:40:51,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 08:40:51,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 20: [2022-11-26 08:40:51,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 08:40:51,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 3: [2022-11-26 08:40:51,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:40:51,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 2: [2022-11-26 08:40:51,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:40:51,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 2: [2022-11-26 08:40:51,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 15: [2022-11-26 08:40:51,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:40:51,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 15: [2022-11-26 08:40:51,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 08:40:51,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 24: [2022-11-26 08:40:51,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:40:51,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 22: [2022-11-26 08:40:51,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:40:51,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 14: [2022-11-26 08:40:51,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:40:51,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 24: [2022-11-26 08:40:51,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 14: [2022-11-26 08:40:51,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 08:40:51,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 23: [2022-11-26 08:40:51,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:40:51,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 14: [2022-11-26 08:40:51,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:40:51,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 14: [2022-11-26 08:40:51,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 08:40:51,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 10: [2022-11-26 08:40:51,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:40:51,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 08:40:51,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 25: [2022-11-26 08:40:51,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:40:51,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 15: [2022-11-26 08:40:51,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:40:51,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 25: [2022-11-26 08:40:51,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 8: [2022-11-26 08:40:51,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:40:51,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 16: [2022-11-26 08:40:51,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:40:51,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 16: [2022-11-26 08:40:51,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 10: [2022-11-26 08:40:51,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:40:51,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 16: [2022-11-26 08:40:51,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 10: [2022-11-26 08:40:51,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 08:40:51,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 31: [2022-11-26 08:40:51,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:40:51,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 08:40:51,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 31: [2022-11-26 08:40:51,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:40:51,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 08:40:51,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 27: [2022-11-26 08:40:51,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:40:51,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 08:40:51,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 11: [2022-11-26 08:40:51,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:40:51,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 08:40:51,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 13: [2022-11-26 08:40:51,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:40:51,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:40:51,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 13: [2022-11-26 08:40:51,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 10: [2022-11-26 08:40:51,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 13: [2022-11-26 08:40:51,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 23: [2022-11-26 08:40:51,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:40:51,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 08:40:51,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 29: [2022-11-26 08:40:51,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:40:51,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 08:40:51,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 0: [2022-11-26 08:40:51,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:40:51,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:40:51,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 0: [2022-11-26 08:40:51,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 08:40:51,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 8: [2022-11-26 08:40:51,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 7: [2022-11-26 08:40:51,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:40:51,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 08:40:51,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 19: [2022-11-26 08:40:51,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:40:51,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 08:40:51,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 30: [2022-11-26 08:40:51,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:40:51,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 08:40:51,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 17: [2022-11-26 08:40:51,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:40:51,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 08:40:51,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 17: [2022-11-26 08:40:51,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:40:51,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 08:40:51,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 17: [2022-11-26 08:40:51,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:40:51,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 5: [2022-11-26 08:40:51,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:40:51,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 17: [2022-11-26 08:40:51,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:40:51,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 17: [2022-11-26 08:40:51,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 5: [2022-11-26 08:40:51,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 17: [2022-11-26 08:40:51,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 12: [2022-11-26 08:40:51,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:40:51,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:40:51,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 5: [2022-11-26 08:40:51,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:40:51,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 5: [2022-11-26 08:40:51,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 12: [2022-11-26 08:40:51,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 23: [2022-11-26 08:40:51,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 5: [2022-11-26 08:40:51,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 28: [2022-11-26 08:40:51,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:40:51,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:40:51,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 08:40:51,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 4: [2022-11-26 08:40:51,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:40:51,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 08:40:51,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 7: [2022-11-26 08:40:51,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:40:51,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 08:40:51,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 28: [2022-11-26 08:40:51,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 08:40:51,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 0: [2022-11-26 08:40:51,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:40:51,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 08:40:51,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 4: [2022-11-26 08:40:51,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:40:51,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 08:40:51,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 0: [2022-11-26 08:40:51,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:40:51,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:40:51,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 20: [2022-11-26 08:40:51,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 0: [2022-11-26 08:40:51,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 20: [2022-11-26 08:40:51,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 13: [2022-11-26 08:40:51,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:40:51,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 08:40:51,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 14: [2022-11-26 08:40:51,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:40:51,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 08:40:51,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 9: [2022-11-26 08:40:51,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:40:51,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 08:40:51,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 6: [2022-11-26 08:40:51,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:40:51,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 08:40:51,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 18: [2022-11-26 08:40:51,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:40:51,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:40:51,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:40:51,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:40:51,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 08:40:51,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 08:40:51,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 08:40:51,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 08:40:51,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 18: [2022-11-26 08:40:51,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 18: [2022-11-26 08:40:51,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 18: [2022-11-26 08:40:51,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 1: [2022-11-26 08:40:51,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:40:51,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 08:40:51,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 9: [2022-11-26 08:40:51,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:40:51,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 08:40:51,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:40:51,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 9: [2022-11-26 08:40:51,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 08:40:51,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 19: [2022-11-26 08:40:51,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:40:51,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 08:40:51,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 26: [2022-11-26 08:40:51,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:40:51,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 08:40:51,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 17: [2022-11-26 08:40:51,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:40:51,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 08:40:51,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 0: [2022-11-26 08:40:51,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 08:40:51,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 6: [2022-11-26 08:40:51,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:40:51,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:40:51,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 08:40:51,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 08:40:51,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 6: [2022-11-26 08:40:51,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 21: [2022-11-26 08:40:51,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:40:51,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 08:40:51,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 29: [2022-11-26 08:40:51,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:40:51,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 08:40:51,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 25: [2022-11-26 08:40:51,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:40:51,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 08:40:51,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 10: [2022-11-26 08:40:51,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:40:51,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 08:40:51,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 11: [2022-11-26 08:40:51,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:40:51,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 08:40:51,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 2: [2022-11-26 08:40:51,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:40:51,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 08:40:51,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 12: [2022-11-26 08:40:51,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:40:51,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 08:40:51,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 22: [2022-11-26 08:40:51,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:40:51,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 08:40:51,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 24: [2022-11-26 08:40:51,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:40:51,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 08:40:51,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 15: [2022-11-26 08:40:51,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:40:51,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 08:40:51,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 30: [2022-11-26 08:40:51,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:40:51,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 08:40:51,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 31: [2022-11-26 08:40:51,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:40:51,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 08:40:51,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 20: [2022-11-26 08:40:51,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:40:51,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 08:40:51,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 8: [2022-11-26 08:40:51,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:40:51,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 08:40:51,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 3: [2022-11-26 08:40:51,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:40:51,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 08:40:51,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 18: [2022-11-26 08:40:51,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:40:51,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 08:40:51,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 9: [2022-11-26 08:40:51,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:40:51,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 08:40:51,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 0: [2022-11-26 08:40:51,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:40:51,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 08:40:51,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 27: [2022-11-26 08:40:51,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:40:51,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 23: [2022-11-26 08:40:51,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:40:51,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 23: [2022-11-26 08:40:51,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 08:40:51,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 7: [2022-11-26 08:40:51,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:40:51,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 08:40:51,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 1: [2022-11-26 08:40:51,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:40:51,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:40:51,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 4: [2022-11-26 08:40:51,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 08:40:51,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 1: [2022-11-26 08:40:51,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 13: [2022-11-26 08:40:51,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:40:51,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:40:51,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 16: [2022-11-26 08:40:51,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 13: [2022-11-26 08:40:51,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 16: [2022-11-26 08:40:51,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 5: [2022-11-26 08:40:51,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:40:51,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 08:40:51,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 28: [2022-11-26 08:40:51,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:40:51,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:40:51,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 08:40:51,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 14: [2022-11-26 08:40:51,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:40:51,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:40:51,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 6: [2022-11-26 08:40:51,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 14: [2022-11-26 08:40:51,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 6: [2022-11-26 08:40:51,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 17: [2022-11-26 08:40:51,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:40:51,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 08:40:51,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 2: [2022-11-26 08:40:51,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:40:51,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 08:40:51,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 28: [2022-11-26 08:40:51,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 08:40:51,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 29: [2022-11-26 08:40:51,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:40:51,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 08:40:51,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 19: [2022-11-26 08:40:51,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:40:51,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 08:40:51,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 25: [2022-11-26 08:40:51,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:40:51,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 08:40:51,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 26: [2022-11-26 08:40:51,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:40:51,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 08:40:51,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 8: [2022-11-26 08:40:51,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:40:51,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 08:40:51,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 10: [2022-11-26 08:40:51,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:40:51,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 08:40:51,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 11: [2022-11-26 08:40:51,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:40:51,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 08:40:51,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 22: [2022-11-26 08:40:51,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:40:51,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 08:40:51,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 24: [2022-11-26 08:40:51,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:40:51,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 08:40:51,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 30: [2022-11-26 08:40:51,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:40:51,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 08:40:51,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 15: [2022-11-26 08:40:51,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:40:51,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 08:40:51,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 20: [2022-11-26 08:40:51,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:40:51,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 08:40:51,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 3: [2022-11-26 08:40:51,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:40:51,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 08:40:51,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 31: [2022-11-26 08:40:51,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:40:51,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 08:40:51,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 18: [2022-11-26 08:40:51,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:40:51,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 08:40:51,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 9: [2022-11-26 08:40:51,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:40:51,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 08:40:51,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 12: [2022-11-26 08:40:51,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:40:51,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 08:40:51,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 7: [2022-11-26 08:40:51,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:40:51,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 08:40:51,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 0: [2022-11-26 08:40:51,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:40:51,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 08:40:51,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 1: [2022-11-26 08:40:51,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:40:51,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 08:40:51,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 28: [2022-11-26 08:40:51,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:40:51,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 08:40:51,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 16: [2022-11-26 08:40:51,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:40:51,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 08:40:51,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 13: [2022-11-26 08:40:51,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:40:51,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 08:40:51,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 27: [2022-11-26 08:40:51,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:40:51,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:40:51,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 08:40:51,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 5: [2022-11-26 08:40:51,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 08:40:51,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 6: [2022-11-26 08:40:51,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:40:51,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:40:51,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 26: [2022-11-26 08:40:51,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 6: [2022-11-26 08:40:51,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 26: [2022-11-26 08:40:51,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 21: [2022-11-26 08:40:51,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:40:51,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 08:40:51,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 23: [2022-11-26 08:40:51,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:40:51,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 08:40:51,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 25: [2022-11-26 08:40:51,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:40:51,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 08:40:51,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 14: [2022-11-26 08:40:51,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:40:51,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 08:40:51,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 4: [2022-11-26 08:40:51,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:40:51,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 08:40:51,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 19: [2022-11-26 08:40:51,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:40:51,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 08:40:51,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 2: [2022-11-26 08:40:51,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:40:51,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 29: [2022-11-26 08:40:51,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:40:51,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:40:51,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 29: [2022-11-26 08:40:51,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 10: [2022-11-26 08:40:51,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 08:40:51,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 29: [2022-11-26 08:40:51,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 22: [2022-11-26 08:40:51,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:40:51,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 08:40:51,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 8: [2022-11-26 08:40:51,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:40:51,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 08:40:51,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 15: [2022-11-26 08:40:51,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:40:51,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 08:40:51,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 11: [2022-11-26 08:40:51,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:40:51,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 08:40:51,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 24: [2022-11-26 08:40:51,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:40:51,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 08:40:51,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 20: [2022-11-26 08:40:51,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:40:51,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 08:40:51,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 30: [2022-11-26 08:40:51,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:40:51,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 08:40:51,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 18: [2022-11-26 08:40:51,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:40:51,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 08:40:51,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 3: [2022-11-26 08:40:51,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:40:51,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 08:40:51,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 17: [2022-11-26 08:40:51,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:40:51,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 08:40:51,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 5: [2022-11-26 08:40:51,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:40:51,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 08:40:51,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 12: [2022-11-26 08:40:51,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:40:51,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 08:40:51,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 21: [2022-11-26 08:40:51,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:40:51,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 08:40:51,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 28: [2022-11-26 08:40:51,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:40:51,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:40:51,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 08:40:51,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 9: [2022-11-26 08:40:51,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:40:51,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:40:51,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 13: [2022-11-26 08:40:51,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 08:40:51,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 9: [2022-11-26 08:40:51,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 22: [2022-11-26 08:40:51,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:40:51,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:40:51,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:40:51,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 08:40:51,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 7: [2022-11-26 08:40:51,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 11: [2022-11-26 08:40:51,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 7: [2022-11-26 08:40:51,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 11: [2022-11-26 08:40:51,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 2: [2022-11-26 08:40:51,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:40:51,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 08:40:51,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 0: [2022-11-26 08:40:51,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:40:51,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:40:51,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 0: [2022-11-26 08:40:51,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 16: [2022-11-26 08:40:51,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 0: [2022-11-26 08:40:51,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 28: [2022-11-26 08:40:51,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 08:40:51,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 31: [2022-11-26 08:40:51,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:40:51,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:40:51,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 08:40:51,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 24: [2022-11-26 08:40:51,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 08:40:51,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 27: [2022-11-26 08:40:51,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:40:51,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 08:40:51,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 25: [2022-11-26 08:40:51,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:40:51,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 08:40:51,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 8: [2022-11-26 08:40:51,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:40:51,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 08:40:51,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 6: [2022-11-26 08:40:51,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:40:51,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:40:51,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 08:40:51,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 26: [2022-11-26 08:40:51,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 08:40:51,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 31: [2022-11-26 08:40:51,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:40:51,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 08:40:51,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 12: [2022-11-26 08:40:51,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:40:51,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:40:51,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 08:40:51,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 15: [2022-11-26 08:40:51,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 08:40:51,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 29: [2022-11-26 08:40:51,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:40:51,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 08:40:51,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 23: [2022-11-26 08:40:51,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:40:51,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 17: [2022-11-26 08:40:51,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:40:51,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:40:51,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 18: [2022-11-26 08:40:51,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 08:40:51,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 17: [2022-11-26 08:40:51,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 08:40:51,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 3: [2022-11-26 08:40:51,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:40:51,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 08:40:51,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 10: [2022-11-26 08:40:51,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:40:51,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:40:51,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 08:40:51,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 19: [2022-11-26 08:40:51,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 08:40:51,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 14: [2022-11-26 08:40:51,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:40:51,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 08:40:51,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 5: [2022-11-26 08:40:51,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:40:51,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 08:40:51,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 28: [2022-11-26 08:40:51,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:40:51,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 08:40:51,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 30: [2022-11-26 08:40:51,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:40:51,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 08:40:51,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 4: [2022-11-26 08:40:51,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:40:51,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 08:40:51,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 9: [2022-11-26 08:40:51,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:40:51,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 08:40:51,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 13: [2022-11-26 08:40:51,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:40:51,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 08:40:51,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 27: [2022-11-26 08:40:51,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:40:51,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 08:40:51,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 20: [2022-11-26 08:40:51,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:40:51,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 08:40:51,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 7: [2022-11-26 08:40:51,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:40:51,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 08:40:51,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 9: [2022-11-26 08:40:51,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:40:51,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:40:51,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 14: [2022-11-26 08:40:51,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 9: [2022-11-26 08:40:51,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 14: [2022-11-26 08:40:51,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 16: [2022-11-26 08:40:51,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:40:51,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 08:40:51,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 15: [2022-11-26 08:40:51,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:40:51,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 08:40:51,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 23: [2022-11-26 08:40:51,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:40:51,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 08:40:51,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 4: [2022-11-26 08:40:51,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:40:51,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 08:40:51,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 23: [2022-11-26 08:40:51,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:40:51,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step65000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 08:40:51,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 0: successfully saved checkpoint at iteration 65000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2574.33 31: iteration 65010/ 173500 | consumed samples: 16642560 | consumed tokens: 34083962880 | elapsed time per iteration (s): 1.07 | learning rate: 1.462E-04 | global batch size: 256 | lm loss: 2.017793E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.965 | TFLOPs: 14.46 | 31: iteration 65020/ 173500 | consumed samples: 16645120 | consumed tokens: 34089205760 | elapsed time per iteration (s): 0.84 | learning rate: 1.461E-04 | global batch size: 256 | lm loss: 2.040537E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.806 | TFLOPs: 18.38 | 31: iteration 65030/ 173500 | consumed samples: 16647680 | consumed tokens: 34094448640 | elapsed time per iteration (s): 0.84 | learning rate: 1.461E-04 | global batch size: 256 | lm loss: 2.013402E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.876 | TFLOPs: 18.50 | 31: iteration 65040/ 173500 | consumed samples: 16650240 | consumed tokens: 34099691520 | elapsed time per iteration (s): 0.84 | learning rate: 1.461E-04 | global batch size: 256 | lm loss: 2.044311E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.712 | TFLOPs: 18.43 | 31: iteration 65050/ 173500 | consumed samples: 16652800 | consumed tokens: 34104934400 | elapsed time per iteration (s): 0.80 | learning rate: 1.461E-04 | global batch size: 256 | lm loss: 2.051743E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.563 | TFLOPs: 19.27 | 31: iteration 65060/ 173500 | consumed samples: 16655360 | consumed tokens: 34110177280 | elapsed time per iteration (s): 0.90 | learning rate: 1.461E-04 | global batch size: 256 | lm loss: 2.043859E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 283.529 | TFLOPs: 17.15 | 31: iteration 65070/ 173500 | consumed samples: 16657920 | consumed tokens: 34115420160 | elapsed time per iteration (s): 0.91 | learning rate: 1.461E-04 | global batch size: 256 | lm loss: 2.039287E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.144 | TFLOPs: 17.07 | 31: iteration 65080/ 173500 | consumed samples: 16660480 | consumed tokens: 34120663040 | elapsed time per iteration (s): 0.80 | learning rate: 1.461E-04 | global batch size: 256 | lm loss: 2.051385E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.338 | TFLOPs: 19.44 | 31: iteration 65090/ 173500 | consumed samples: 16663040 | consumed tokens: 34125905920 | elapsed time per iteration (s): 0.82 | learning rate: 1.460E-04 | global batch size: 256 | lm loss: 2.042033E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.321 | TFLOPs: 18.83 | 31: iteration 65100/ 173500 | consumed samples: 16665600 | consumed tokens: 34131148800 | elapsed time per iteration (s): 0.88 | learning rate: 1.460E-04 | global batch size: 256 | lm loss: 2.030684E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.682 | TFLOPs: 17.59 | 31: iteration 65110/ 173500 | consumed samples: 16668160 | consumed tokens: 34136391680 | elapsed time per iteration (s): 0.82 | learning rate: 1.460E-04 | global batch size: 256 | lm loss: 2.014529E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.595 | TFLOPs: 18.91 | 31: iteration 65120/ 173500 | consumed samples: 16670720 | consumed tokens: 34141634560 | elapsed time per iteration (s): 0.85 | learning rate: 1.460E-04 | global batch size: 256 | lm loss: 1.985461E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.470 | TFLOPs: 18.30 | 31: iteration 65130/ 173500 | consumed samples: 16673280 | consumed tokens: 34146877440 | elapsed time per iteration (s): 0.81 | learning rate: 1.460E-04 | global batch size: 256 | lm loss: 2.047751E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.349 | TFLOPs: 19.14 | 31: iteration 65140/ 173500 | consumed samples: 16675840 | consumed tokens: 34152120320 | elapsed time per iteration (s): 0.80 | learning rate: 1.460E-04 | global batch size: 256 | lm loss: 2.064799E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.113 | TFLOPs: 19.25 | 31: iteration 65150/ 173500 | consumed samples: 16678400 | consumed tokens: 34157363200 | elapsed time per iteration (s): 0.87 | learning rate: 1.460E-04 | global batch size: 256 | lm loss: 2.045596E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.512 | TFLOPs: 17.88 | 31: iteration 65160/ 173500 | consumed samples: 16680960 | consumed tokens: 34162606080 | elapsed time per iteration (s): 0.80 | learning rate: 1.459E-04 | global batch size: 256 | lm loss: 2.037131E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.611 | TFLOPs: 19.40 | 31: iteration 65170/ 173500 | consumed samples: 16683520 | consumed tokens: 34167848960 | elapsed time per iteration (s): 0.76 | learning rate: 1.459E-04 | global batch size: 256 | lm loss: 2.048558E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.086 | TFLOPs: 20.27 | 31: iteration 65180/ 173500 | consumed samples: 16686080 | consumed tokens: 34173091840 | elapsed time per iteration (s): 0.76 | learning rate: 1.459E-04 | global batch size: 256 | lm loss: 2.034362E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.623 | TFLOPs: 20.49 | 31: iteration 65190/ 173500 | consumed samples: 16688640 | consumed tokens: 34178334720 | elapsed time per iteration (s): 0.74 | learning rate: 1.459E-04 | global batch size: 256 | lm loss: 2.049916E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.092 | TFLOPs: 20.88 | 31: iteration 65200/ 173500 | consumed samples: 16691200 | consumed tokens: 34183577600 | elapsed time per iteration (s): 0.84 | learning rate: 1.459E-04 | global batch size: 256 | lm loss: 2.057313E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.377 | TFLOPs: 18.41 | 31: iteration 65210/ 173500 | consumed samples: 16693760 | consumed tokens: 34188820480 | elapsed time per iteration (s): 0.81 | learning rate: 1.459E-04 | global batch size: 256 | lm loss: 2.004773E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.272 | TFLOPs: 19.07 | 31: iteration 65220/ 173500 | consumed samples: 16696320 | consumed tokens: 34194063360 | elapsed time per iteration (s): 0.80 | learning rate: 1.458E-04 | global batch size: 256 | lm loss: 2.045708E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.690 | TFLOPs: 19.46 | 31: iteration 65230/ 173500 | consumed samples: 16698880 | consumed tokens: 34199306240 | elapsed time per iteration (s): 0.75 | learning rate: 1.458E-04 | global batch size: 256 | lm loss: 2.040077E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.602 | TFLOPs: 20.55 | 31: iteration 65240/ 173500 | consumed samples: 16701440 | consumed tokens: 34204549120 | elapsed time per iteration (s): 0.73 | learning rate: 1.458E-04 | global batch size: 256 | lm loss: 2.027668E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.984 | TFLOPs: 21.17 | 31: iteration 65250/ 173500 | consumed samples: 16704000 | consumed tokens: 34209792000 | elapsed time per iteration (s): 0.81 | learning rate: 1.458E-04 | global batch size: 256 | lm loss: 2.053517E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.582 | TFLOPs: 19.21 | 31: iteration 65260/ 173500 | consumed samples: 16706560 | consumed tokens: 34215034880 | elapsed time per iteration (s): 0.72 | learning rate: 1.458E-04 | global batch size: 256 | lm loss: 2.050146E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.532 | TFLOPs: 21.39 | 31: iteration 65270/ 173500 | consumed samples: 16709120 | consumed tokens: 34220277760 | elapsed time per iteration (s): 0.83 | learning rate: 1.458E-04 | global batch size: 256 | lm loss: 2.040924E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.276 | TFLOPs: 18.59 | 31: iteration 65280/ 173500 | consumed samples: 16711680 | consumed tokens: 34225520640 | elapsed time per iteration (s): 0.79 | learning rate: 1.458E-04 | global batch size: 256 | lm loss: 2.045272E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.843 | TFLOPs: 19.71 | 31: iteration 65290/ 173500 | consumed samples: 16714240 | consumed tokens: 34230763520 | elapsed time per iteration (s): 0.74 | learning rate: 1.457E-04 | global batch size: 256 | lm loss: 2.040495E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.491 | TFLOPs: 20.84 | 31: iteration 65300/ 173500 | consumed samples: 16716800 | consumed tokens: 34236006400 | elapsed time per iteration (s): 0.74 | learning rate: 1.457E-04 | global batch size: 256 | lm loss: 2.053842E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.054 | TFLOPs: 21.06 | 31: iteration 65310/ 173500 | consumed samples: 16719360 | consumed tokens: 34241249280 | elapsed time per iteration (s): 0.81 | learning rate: 1.457E-04 | global batch size: 256 | lm loss: 2.040816E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.763 | TFLOPs: 19.16 | 31: iteration 65320/ 173500 | consumed samples: 16721920 | consumed tokens: 34246492160 | elapsed time per iteration (s): 0.73 | learning rate: 1.457E-04 | global batch size: 256 | lm loss: 2.032451E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.775 | TFLOPs: 21.22 | 31: iteration 65330/ 173500 | consumed samples: 16724480 | consumed tokens: 34251735040 | elapsed time per iteration (s): 0.77 | learning rate: 1.457E-04 | global batch size: 256 | lm loss: 2.026150E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.751 | TFLOPs: 20.13 | 31: iteration 65340/ 173500 | consumed samples: 16727040 | consumed tokens: 34256977920 | elapsed time per iteration (s): 0.78 | learning rate: 1.457E-04 | global batch size: 256 | lm loss: 2.025169E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.677 | TFLOPs: 19.94 | 31: iteration 65350/ 173500 | consumed samples: 16729600 | consumed tokens: 34262220800 | elapsed time per iteration (s): 0.76 | learning rate: 1.457E-04 | global batch size: 256 | lm loss: 2.021060E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.046 | TFLOPs: 20.39 | 31: iteration 65360/ 173500 | consumed samples: 16732160 | consumed tokens: 34267463680 | elapsed time per iteration (s): 0.84 | learning rate: 1.456E-04 | global batch size: 256 | lm loss: 2.022876E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.683 | TFLOPs: 18.49 | 31: iteration 65370/ 173500 | consumed samples: 16734720 | consumed tokens: 34272706560 | elapsed time per iteration (s): 0.78 | learning rate: 1.456E-04 | global batch size: 256 | lm loss: 2.049385E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.171 | TFLOPs: 19.73 | 31: iteration 65380/ 173500 | consumed samples: 16737280 | consumed tokens: 34277949440 | elapsed time per iteration (s): 0.79 | learning rate: 1.456E-04 | global batch size: 256 | lm loss: 2.050173E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.857 | TFLOPs: 19.53 | 31: iteration 65390/ 173500 | consumed samples: 16739840 | consumed tokens: 34283192320 | elapsed time per iteration (s): 0.80 | learning rate: 1.456E-04 | global batch size: 256 | lm loss: 2.012724E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.926 | TFLOPs: 19.48 | 31: iteration 65400/ 173500 | consumed samples: 16742400 | consumed tokens: 34288435200 | elapsed time per iteration (s): 0.77 | learning rate: 1.456E-04 | global batch size: 256 | lm loss: 2.025312E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.510 | TFLOPs: 20.06 | 31: iteration 65410/ 173500 | consumed samples: 16744960 | consumed tokens: 34293678080 | elapsed time per iteration (s): 0.74 | learning rate: 1.456E-04 | global batch size: 256 | lm loss: 2.027111E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.839 | TFLOPs: 21.04 | 31: iteration 65420/ 173500 | consumed samples: 16747520 | consumed tokens: 34298920960 | elapsed time per iteration (s): 0.77 | learning rate: 1.455E-04 | global batch size: 256 | lm loss: 2.052371E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.487 | TFLOPs: 20.24 | 31: iteration 65430/ 173500 | consumed samples: 16750080 | consumed tokens: 34304163840 | elapsed time per iteration (s): 0.79 | learning rate: 1.455E-04 | global batch size: 256 | lm loss: 2.037022E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.928 | TFLOPs: 19.66 | 31: iteration 65440/ 173500 | consumed samples: 16752640 | consumed tokens: 34309406720 | elapsed time per iteration (s): 0.80 | learning rate: 1.455E-04 | global batch size: 256 | lm loss: 2.051030E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.576 | TFLOPs: 19.45 | 31: iteration 65450/ 173500 | consumed samples: 16755200 | consumed tokens: 34314649600 | elapsed time per iteration (s): 0.78 | learning rate: 1.455E-04 | global batch size: 256 | lm loss: 2.015874E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.166 | TFLOPs: 19.79 | 31: iteration 65460/ 173500 | consumed samples: 16757760 | consumed tokens: 34319892480 | elapsed time per iteration (s): 0.75 | learning rate: 1.455E-04 | global batch size: 256 | lm loss: 2.043286E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.556 | TFLOPs: 20.66 | 31: iteration 65470/ 173500 | consumed samples: 16760320 | consumed tokens: 34325135360 | elapsed time per iteration (s): 0.74 | learning rate: 1.455E-04 | global batch size: 256 | lm loss: 2.011698E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.141 | TFLOPs: 21.06 | 31: iteration 65480/ 173500 | consumed samples: 16762880 | consumed tokens: 34330378240 | elapsed time per iteration (s): 0.78 | learning rate: 1.455E-04 | global batch size: 256 | lm loss: 2.058558E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.859 | TFLOPs: 19.77 | 31: iteration 65490/ 173500 | consumed samples: 16765440 | consumed tokens: 34335621120 | elapsed time per iteration (s): 0.78 | learning rate: 1.454E-04 | global batch size: 256 | lm loss: 2.036037E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.297 | TFLOPs: 19.98 | 31: iteration 65500/ 173500 | consumed samples: 16768000 | consumed tokens: 34340864000 | elapsed time per iteration (s): 0.77 | learning rate: 1.454E-04 | global batch size: 256 | lm loss: 2.038711E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.654 | TFLOPs: 20.06 | 31: iteration 65510/ 173500 | consumed samples: 16770560 | consumed tokens: 34346106880 | elapsed time per iteration (s): 0.75 | learning rate: 1.454E-04 | global batch size: 256 | lm loss: 2.023335E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.613 | TFLOPs: 20.67 | 31: iteration 65520/ 173500 | consumed samples: 16773120 | consumed tokens: 34351349760 | elapsed time per iteration (s): 0.81 | learning rate: 1.454E-04 | global batch size: 256 | lm loss: 2.020901E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.395 | TFLOPs: 19.08 | 31: iteration 65530/ 173500 | consumed samples: 16775680 | consumed tokens: 34356592640 | elapsed time per iteration (s): 0.80 | learning rate: 1.454E-04 | global batch size: 256 | lm loss: 2.026483E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.625 | TFLOPs: 19.46 | 31: iteration 65540/ 173500 | consumed samples: 16778240 | consumed tokens: 34361835520 | elapsed time per iteration (s): 0.86 | learning rate: 1.454E-04 | global batch size: 256 | lm loss: 2.016886E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.020 | TFLOPs: 17.91 | 31: iteration 65550/ 173500 | consumed samples: 16780800 | consumed tokens: 34367078400 | elapsed time per iteration (s): 0.83 | learning rate: 1.453E-04 | global batch size: 256 | lm loss: 2.028830E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.075 | TFLOPs: 18.64 | 31: iteration 65560/ 173500 | consumed samples: 16783360 | consumed tokens: 34372321280 | elapsed time per iteration (s): 0.84 | learning rate: 1.453E-04 | global batch size: 256 | lm loss: 2.017462E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.541 | TFLOPs: 18.42 | 31: iteration 65570/ 173500 | consumed samples: 16785920 | consumed tokens: 34377564160 | elapsed time per iteration (s): 0.83 | learning rate: 1.453E-04 | global batch size: 256 | lm loss: 2.057878E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.251 | TFLOPs: 18.59 | 31: iteration 65580/ 173500 | consumed samples: 16788480 | consumed tokens: 34382807040 | elapsed time per iteration (s): 0.84 | learning rate: 1.453E-04 | global batch size: 256 | lm loss: 2.022469E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.584 | TFLOPs: 18.55 | 31: iteration 65590/ 173500 | consumed samples: 16791040 | consumed tokens: 34388049920 | elapsed time per iteration (s): 0.80 | learning rate: 1.453E-04 | global batch size: 256 | lm loss: 2.041060E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.787 | TFLOPs: 19.41 | 31: iteration 65600/ 173500 | consumed samples: 16793600 | consumed tokens: 34393292800 | elapsed time per iteration (s): 0.80 | learning rate: 1.453E-04 | global batch size: 256 | lm loss: 2.034437E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.404 | TFLOPs: 19.38 | 31: iteration 65610/ 173500 | consumed samples: 16796160 | consumed tokens: 34398535680 | elapsed time per iteration (s): 0.83 | learning rate: 1.453E-04 | global batch size: 256 | lm loss: 2.009744E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.231 | TFLOPs: 18.65 | 31: iteration 65620/ 173500 | consumed samples: 16798720 | consumed tokens: 34403778560 | elapsed time per iteration (s): 0.85 | learning rate: 1.452E-04 | global batch size: 256 | lm loss: 2.050452E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.392 | TFLOPs: 18.29 | 31: iteration 65630/ 173500 | consumed samples: 16801280 | consumed tokens: 34409021440 | elapsed time per iteration (s): 0.85 | learning rate: 1.452E-04 | global batch size: 256 | lm loss: 2.005435E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.141 | TFLOPs: 18.28 | 31: iteration 65640/ 173500 | consumed samples: 16803840 | consumed tokens: 34414264320 | elapsed time per iteration (s): 0.88 | learning rate: 1.452E-04 | global batch size: 256 | lm loss: 2.018499E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.457 | TFLOPs: 17.57 | 31: iteration 65650/ 173500 | consumed samples: 16806400 | consumed tokens: 34419507200 | elapsed time per iteration (s): 0.78 | learning rate: 1.452E-04 | global batch size: 256 | lm loss: 2.039629E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.388 | TFLOPs: 19.87 | 31: iteration 65660/ 173500 | consumed samples: 16808960 | consumed tokens: 34424750080 | elapsed time per iteration (s): 0.79 | learning rate: 1.452E-04 | global batch size: 256 | lm loss: 2.023283E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.155 | TFLOPs: 19.49 | 31: iteration 65670/ 173500 | consumed samples: 16811520 | consumed tokens: 34429992960 | elapsed time per iteration (s): 0.80 | learning rate: 1.452E-04 | global batch size: 256 | lm loss: 2.021707E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.499 | TFLOPs: 19.27 | 31: iteration 65680/ 173500 | consumed samples: 16814080 | consumed tokens: 34435235840 | elapsed time per iteration (s): 0.80 | learning rate: 1.452E-04 | global batch size: 256 | lm loss: 2.035868E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.668 | TFLOPs: 19.34 | 31: iteration 65690/ 173500 | consumed samples: 16816640 | consumed tokens: 34440478720 | elapsed time per iteration (s): 0.77 | learning rate: 1.451E-04 | global batch size: 256 | lm loss: 2.029568E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.447 | TFLOPs: 20.11 | 31: iteration 65700/ 173500 | consumed samples: 16819200 | consumed tokens: 34445721600 | elapsed time per iteration (s): 0.82 | learning rate: 1.451E-04 | global batch size: 256 | lm loss: 2.026683E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.832 | TFLOPs: 18.99 | 31: iteration 65710/ 173500 | consumed samples: 16821760 | consumed tokens: 34450964480 | elapsed time per iteration (s): 0.76 | learning rate: 1.451E-04 | global batch size: 256 | lm loss: 2.005770E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.088 | TFLOPs: 20.39 | 31: iteration 65720/ 173500 | consumed samples: 16824320 | consumed tokens: 34456207360 | elapsed time per iteration (s): 0.79 | learning rate: 1.451E-04 | global batch size: 256 | lm loss: 2.020705E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.725 | TFLOPs: 19.58 | 31: iteration 65730/ 173500 | consumed samples: 16826880 | consumed tokens: 34461450240 | elapsed time per iteration (s): 0.73 | learning rate: 1.451E-04 | global batch size: 256 | lm loss: 2.018486E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.856 | TFLOPs: 21.17 | 31: iteration 65740/ 173500 | consumed samples: 16829440 | consumed tokens: 34466693120 | elapsed time per iteration (s): 0.77 | learning rate: 1.451E-04 | global batch size: 256 | lm loss: 2.058995E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.812 | TFLOPs: 20.13 | 31: iteration 65750/ 173500 | consumed samples: 16832000 | consumed tokens: 34471936000 | elapsed time per iteration (s): 0.78 | learning rate: 1.450E-04 | global batch size: 256 | lm loss: 2.053613E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.279 | TFLOPs: 19.98 | 31: iteration 65760/ 173500 | consumed samples: 16834560 | consumed tokens: 34477178880 | elapsed time per iteration (s): 0.79 | learning rate: 1.450E-04 | global batch size: 256 | lm loss: 2.044083E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.094 | TFLOPs: 19.67 | 31: iteration 65770/ 173500 | consumed samples: 16837120 | consumed tokens: 34482421760 | elapsed time per iteration (s): 0.75 | learning rate: 1.450E-04 | global batch size: 256 | lm loss: 2.022223E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.836 | TFLOPs: 20.56 | 31: iteration 65780/ 173500 | consumed samples: 16839680 | consumed tokens: 34487664640 | elapsed time per iteration (s): 0.79 | learning rate: 1.450E-04 | global batch size: 256 | lm loss: 2.020200E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.229 | TFLOPs: 19.68 | 31: iteration 65790/ 173500 | consumed samples: 16842240 | consumed tokens: 34492907520 | elapsed time per iteration (s): 0.79 | learning rate: 1.450E-04 | global batch size: 256 | lm loss: 2.042102E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.982 | TFLOPs: 19.60 | 31: iteration 65800/ 173500 | consumed samples: 16844800 | consumed tokens: 34498150400 | elapsed time per iteration (s): 0.76 | learning rate: 1.450E-04 | global batch size: 256 | lm loss: 2.010806E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.669 | TFLOPs: 20.49 | 31: iteration 65810/ 173500 | consumed samples: 16847360 | consumed tokens: 34503393280 | elapsed time per iteration (s): 0.78 | learning rate: 1.450E-04 | global batch size: 256 | lm loss: 2.038213E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.179 | TFLOPs: 19.85 | 31: iteration 65820/ 173500 | consumed samples: 16849920 | consumed tokens: 34508636160 | elapsed time per iteration (s): 0.74 | learning rate: 1.449E-04 | global batch size: 256 | lm loss: 2.043742E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.106 | TFLOPs: 20.94 | 31: iteration 65830/ 173500 | consumed samples: 16852480 | consumed tokens: 34513879040 | elapsed time per iteration (s): 0.80 | learning rate: 1.449E-04 | global batch size: 256 | lm loss: 2.017998E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.893 | TFLOPs: 19.47 | 31: iteration 65840/ 173500 | consumed samples: 16855040 | consumed tokens: 34519121920 | elapsed time per iteration (s): 0.81 | learning rate: 1.449E-04 | global batch size: 256 | lm loss: 2.026546E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.519 | TFLOPs: 19.09 | 31: iteration 65850/ 173500 | consumed samples: 16857600 | consumed tokens: 34524364800 | elapsed time per iteration (s): 0.80 | learning rate: 1.449E-04 | global batch size: 256 | lm loss: 2.018760E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.812 | TFLOPs: 19.35 | 31: iteration 65860/ 173500 | consumed samples: 16860160 | consumed tokens: 34529607680 | elapsed time per iteration (s): 0.84 | learning rate: 1.449E-04 | global batch size: 256 | lm loss: 2.031478E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.234 | TFLOPs: 18.41 | 31: iteration 65870/ 173500 | consumed samples: 16862720 | consumed tokens: 34534850560 | elapsed time per iteration (s): 0.79 | learning rate: 1.449E-04 | global batch size: 256 | lm loss: 2.041665E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.591 | TFLOPs: 19.70 | 31: iteration 65880/ 173500 | consumed samples: 16865280 | consumed tokens: 34540093440 | elapsed time per iteration (s): 0.81 | learning rate: 1.448E-04 | global batch size: 256 | lm loss: 2.048611E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.796 | TFLOPs: 19.23 | 31: iteration 65890/ 173500 | consumed samples: 16867840 | consumed tokens: 34545336320 | elapsed time per iteration (s): 0.84 | learning rate: 1.448E-04 | global batch size: 256 | lm loss: 2.021711E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.104 | TFLOPs: 18.40 | 31: iteration 65900/ 173500 | consumed samples: 16870400 | consumed tokens: 34550579200 | elapsed time per iteration (s): 0.88 | learning rate: 1.448E-04 | global batch size: 256 | lm loss: 2.032811E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.245 | TFLOPs: 17.56 | 31: iteration 65910/ 173500 | consumed samples: 16872960 | consumed tokens: 34555822080 | elapsed time per iteration (s): 0.88 | learning rate: 1.448E-04 | global batch size: 256 | lm loss: 2.006070E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.692 | TFLOPs: 17.65 | 31: iteration 65920/ 173500 | consumed samples: 16875520 | consumed tokens: 34561064960 | elapsed time per iteration (s): 0.82 | learning rate: 1.448E-04 | global batch size: 256 | lm loss: 2.055538E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.864 | TFLOPs: 18.81 | 31: iteration 65930/ 173500 | consumed samples: 16878080 | consumed tokens: 34566307840 | elapsed time per iteration (s): 0.79 | learning rate: 1.448E-04 | global batch size: 256 | lm loss: 2.031322E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.847 | TFLOPs: 19.53 | 31: iteration 65940/ 173500 | consumed samples: 16880640 | consumed tokens: 34571550720 | elapsed time per iteration (s): 0.84 | learning rate: 1.448E-04 | global batch size: 256 | lm loss: 2.056441E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.559 | TFLOPs: 18.55 | 31: iteration 65950/ 173500 | consumed samples: 16883200 | consumed tokens: 34576793600 | elapsed time per iteration (s): 0.80 | learning rate: 1.447E-04 | global batch size: 256 | lm loss: 2.043707E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.745 | TFLOPs: 19.34 | 31: iteration 65960/ 173500 | consumed samples: 16885760 | consumed tokens: 34582036480 | elapsed time per iteration (s): 0.81 | learning rate: 1.447E-04 | global batch size: 256 | lm loss: 2.038104E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.169 | TFLOPs: 19.19 | 31: iteration 65970/ 173500 | consumed samples: 16888320 | consumed tokens: 34587279360 | elapsed time per iteration (s): 0.78 | learning rate: 1.447E-04 | global batch size: 256 | lm loss: 2.039061E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.556 | TFLOPs: 19.82 | 31: iteration 65980/ 173500 | consumed samples: 16890880 | consumed tokens: 34592522240 | elapsed time per iteration (s): 0.74 | learning rate: 1.447E-04 | global batch size: 256 | lm loss: 2.028988E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.852 | TFLOPs: 20.92 | 31: iteration 65990/ 173500 | consumed samples: 16893440 | consumed tokens: 34597765120 | elapsed time per iteration (s): 0.78 | learning rate: 1.447E-04 | global batch size: 256 | lm loss: 2.031551E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.567 | TFLOPs: 19.94 | 0: [2022-11-26 08:54:10,315] [INFO] [logging.py:68:log_dist] [Rank 0] step=66000, skipped=0, lr=[0.00014466507355770288, 0.00014466507355770288, 0.00014466507355770288], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 66000/ 173500 | consumed samples: 16896000 | consumed tokens: 34603008000 | elapsed time per iteration (s): 0.82 | learning rate: 1.447E-04 | global batch size: 256 | lm loss: 2.034325E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.242 | TFLOPs: 18.95 | 0: steps: 66000 loss: 2.0552 iter time (s): 0.796 samples/sec: 321.772 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 66000 | lm loss value: 2.004344E+00 | lm loss PPL: 7.421222E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 66000 to checkpoints_1b1long 0: [2022-11-26 08:54:10,664] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step66000 is begin to save! 0: [2022-11-26 08:54:10,672] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_01-model_00-model_states.pt... 0: [2022-11-26 08:54:10,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_01-model_00-model_states.pt. 0: [2022-11-26 08:54:10,922] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_03-model_00-model_states.pt... 0: [2022-11-26 08:54:11,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_03-model_00-model_states.pt. 0: [2022-11-26 08:54:11,001] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_04-model_00-model_states.pt... 0: [2022-11-26 08:54:11,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_04-model_00-model_states.pt. 0: [2022-11-26 08:54:11,078] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_05-model_00-model_states.pt... 0: [2022-11-26 08:54:11,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_05-model_00-model_states.pt. 0: [2022-11-26 08:54:11,155] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_06-model_00-model_states.pt... 0: [2022-11-26 08:54:11,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_06-model_00-model_states.pt. 0: [2022-11-26 08:54:11,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_07-model_00-model_states.pt... 0: [2022-11-26 08:54:11,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_07-model_00-model_states.pt. 0: [2022-11-26 08:54:11,302] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_08-model_00-model_states.pt... 0: [2022-11-26 08:54:11,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_08-model_00-model_states.pt. 0: [2022-11-26 08:54:11,375] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_09-model_00-model_states.pt... 0: [2022-11-26 08:54:11,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_09-model_00-model_states.pt. 0: [2022-11-26 08:54:11,450] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_10-model_00-model_states.pt... 0: [2022-11-26 08:54:11,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_10-model_00-model_states.pt. 0: [2022-11-26 08:54:11,523] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_11-model_00-model_states.pt... 0: [2022-11-26 08:54:11,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_11-model_00-model_states.pt. 0: [2022-11-26 08:54:11,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_12-model_00-model_states.pt... 0: [2022-11-26 08:54:11,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_12-model_00-model_states.pt. 0: [2022-11-26 08:54:11,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_13-model_00-model_states.pt... 0: [2022-11-26 08:54:11,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_13-model_00-model_states.pt. 0: [2022-11-26 08:54:11,744] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_14-model_00-model_states.pt... 0: [2022-11-26 08:54:11,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_14-model_00-model_states.pt. 0: [2022-11-26 08:54:11,815] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_15-model_00-model_states.pt... 0: [2022-11-26 08:54:11,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_15-model_00-model_states.pt. 0: [2022-11-26 08:54:11,890] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_16-model_00-model_states.pt... 0: [2022-11-26 08:54:11,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_16-model_00-model_states.pt. 0: [2022-11-26 08:54:11,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_17-model_00-model_states.pt... 0: [2022-11-26 08:54:12,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_17-model_00-model_states.pt. 0: [2022-11-26 08:54:12,034] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_18-model_00-model_states.pt... 0: [2022-11-26 08:54:12,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_18-model_00-model_states.pt. 0: [2022-11-26 08:54:12,109] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_19-model_00-model_states.pt... 0: [2022-11-26 08:54:12,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_19-model_00-model_states.pt. 0: [2022-11-26 08:54:12,182] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_20-model_00-model_states.pt... 0: [2022-11-26 08:54:12,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_20-model_00-model_states.pt. 0: [2022-11-26 08:54:12,256] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_21-model_00-model_states.pt... 0: [2022-11-26 08:54:12,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_21-model_00-model_states.pt. 0: [2022-11-26 08:54:12,329] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_22-model_00-model_states.pt... 0: [2022-11-26 08:54:12,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_22-model_00-model_states.pt. 0: [2022-11-26 08:54:12,401] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_23-model_00-model_states.pt... 0: [2022-11-26 08:54:12,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_23-model_00-model_states.pt. 0: [2022-11-26 08:54:12,475] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_24-model_00-model_states.pt... 0: [2022-11-26 08:54:12,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_24-model_00-model_states.pt. 0: [2022-11-26 08:54:12,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_25-model_00-model_states.pt... 0: [2022-11-26 08:54:12,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_25-model_00-model_states.pt. 0: [2022-11-26 08:54:12,622] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_26-model_00-model_states.pt... 0: [2022-11-26 08:54:12,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_26-model_00-model_states.pt. 0: [2022-11-26 08:54:12,695] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_27-model_00-model_states.pt... 0: [2022-11-26 08:54:12,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_27-model_00-model_states.pt. 0: [2022-11-26 08:54:12,769] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_28-model_00-model_states.pt... 0: [2022-11-26 08:54:12,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_28-model_00-model_states.pt. 0: [2022-11-26 08:54:12,847] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/layer_30-model_00-model_states.pt... 0: [2022-11-26 08:54:12,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/layer_30-model_00-model_states.pt. 0: [2022-11-26 08:54:12,853] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step66000/mp_rank_00_model_states.pt 0: [2022-11-26 08:54:12,853] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/mp_rank_00_model_states.pt... 0: [2022-11-26 08:54:12,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/mp_rank_00_model_states.pt. 0: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 18: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 31: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 20: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 17: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:54:12,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 19: [2022-11-26 08:54:12,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:54:12,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 08:54:12,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 3: [2022-11-26 08:54:12,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:54:12,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 08:54:12,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 17: [2022-11-26 08:54:12,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:54:12,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 08:54:12,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 10: [2022-11-26 08:54:12,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:54:12,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 6: [2022-11-26 08:54:12,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:54:12,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 6: [2022-11-26 08:54:12,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 08:54:12,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 10: [2022-11-26 08:54:12,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:54:12,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 08:54:12,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 19: [2022-11-26 08:54:12,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:54:12,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 08:54:12,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 5: [2022-11-26 08:54:12,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:54:12,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 08:54:12,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 11: [2022-11-26 08:54:12,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:54:12,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:54:12,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 11: [2022-11-26 08:54:12,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 8: [2022-11-26 08:54:12,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 11: [2022-11-26 08:54:12,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 3: [2022-11-26 08:54:12,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:54:12,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 08:54:12,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 20: [2022-11-26 08:54:12,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:54:12,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 08:54:12,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 6: [2022-11-26 08:54:12,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:54:12,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 08:54:12,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 25: [2022-11-26 08:54:12,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:54:12,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 11: [2022-11-26 08:54:12,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:54:12,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:54:12,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 22: [2022-11-26 08:54:12,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 25: [2022-11-26 08:54:12,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 11: [2022-11-26 08:54:12,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 24: [2022-11-26 08:54:12,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:54:12,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 23: [2022-11-26 08:54:12,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:54:12,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 23: [2022-11-26 08:54:12,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 24: [2022-11-26 08:54:12,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 23: [2022-11-26 08:54:12,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 23: [2022-11-26 08:54:12,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:54:12,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 08:54:12,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 24: [2022-11-26 08:54:12,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:54:12,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:54:12,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 14: [2022-11-26 08:54:12,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:54:12,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 24: [2022-11-26 08:54:12,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 14: [2022-11-26 08:54:12,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:54:12,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 30: [2022-11-26 08:54:12,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 14: [2022-11-26 08:54:12,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 14: [2022-11-26 08:54:12,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 08:54:12,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 16: [2022-11-26 08:54:12,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:54:12,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 08:54:12,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 22: [2022-11-26 08:54:12,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:54:12,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:54:12,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 30: [2022-11-26 08:54:12,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 22: [2022-11-26 08:54:12,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 30: [2022-11-26 08:54:12,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 15: [2022-11-26 08:54:12,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:54:12,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:54:12,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 08:54:12,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 08:54:12,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 15: [2022-11-26 08:54:12,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 9: [2022-11-26 08:54:12,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:54:12,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 08:54:12,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 27: [2022-11-26 08:54:12,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:54:12,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 08:54:12,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 27: [2022-11-26 08:54:12,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:54:12,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 08:54:12,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 5: [2022-11-26 08:54:12,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:54:12,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 08:54:12,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 11: [2022-11-26 08:54:12,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:54:12,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 08:54:12,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 19: [2022-11-26 08:54:12,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:54:12,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 25: [2022-11-26 08:54:12,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:54:12,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 25: [2022-11-26 08:54:12,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 08:54:12,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 8: [2022-11-26 08:54:12,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:54:12,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 0: [2022-11-26 08:54:12,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:54:12,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 8: [2022-11-26 08:54:12,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 0: [2022-11-26 08:54:12,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 1: [2022-11-26 08:54:12,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:54:12,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:54:12,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 08:54:12,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 14: [2022-11-26 08:54:12,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:54:12,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 08:54:12,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 30: [2022-11-26 08:54:12,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 08:54:12,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 22: [2022-11-26 08:54:13,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:54:13,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 5: [2022-11-26 08:54:13,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:54:13,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 22: [2022-11-26 08:54:13,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 5: [2022-11-26 08:54:13,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 20: [2022-11-26 08:54:13,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:54:13,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 08:54:13,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 17: [2022-11-26 08:54:13,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:54:13,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 08:54:13,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 10: [2022-11-26 08:54:13,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:54:13,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 08:54:13,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 9: [2022-11-26 08:54:13,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:54:13,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:54:13,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 08:54:13,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 24: [2022-11-26 08:54:13,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:54:13,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 9: [2022-11-26 08:54:13,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 2: [2022-11-26 08:54:13,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:54:13,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 2: [2022-11-26 08:54:13,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 24: [2022-11-26 08:54:13,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 2: [2022-11-26 08:54:13,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 28: [2022-11-26 08:54:13,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:54:13,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:54:13,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 2: [2022-11-26 08:54:13,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:54:13,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 2: [2022-11-26 08:54:13,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 08:54:13,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 12: [2022-11-26 08:54:13,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:54:13,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:54:13,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 08:54:13,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 08:54:13,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 12: [2022-11-26 08:54:13,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 31: [2022-11-26 08:54:13,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:54:13,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:54:13,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:54:13,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 08:54:13,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 08:54:13,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 08:54:13,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 31: [2022-11-26 08:54:13,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 31: [2022-11-26 08:54:13,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 17: [2022-11-26 08:54:13,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:54:13,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 28: [2022-11-26 08:54:13,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 17: [2022-11-26 08:54:13,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 28: [2022-11-26 08:54:13,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 28: [2022-11-26 08:54:13,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:54:13,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 08:54:13,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 27: [2022-11-26 08:54:13,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:54:13,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 08:54:13,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 10: [2022-11-26 08:54:13,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:54:13,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 08:54:13,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 3: [2022-11-26 08:54:13,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:54:13,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 08:54:13,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:54:13,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 3: [2022-11-26 08:54:13,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 08:54:13,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 1: [2022-11-26 08:54:13,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:54:13,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:54:13,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 11: [2022-11-26 08:54:13,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:54:13,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 11: [2022-11-26 08:54:13,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 0: [2022-11-26 08:54:13,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:54:13,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 1: [2022-11-26 08:54:13,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 0: [2022-11-26 08:54:13,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 1: [2022-11-26 08:54:13,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 0: [2022-11-26 08:54:13,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 6: [2022-11-26 08:54:13,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:54:13,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 08:54:13,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 6: [2022-11-26 08:54:13,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:54:13,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 08:54:13,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 9: [2022-11-26 08:54:13,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:54:13,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 08:54:13,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 1: [2022-11-26 08:54:13,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:54:13,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 08:54:13,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 25: [2022-11-26 08:54:13,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:54:13,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:54:13,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 30: [2022-11-26 08:54:13,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:54:13,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 15: [2022-11-26 08:54:13,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 08:54:13,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 30: [2022-11-26 08:54:13,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 08:54:13,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 17: [2022-11-26 08:54:13,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:54:13,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:54:13,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:54:13,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 21: [2022-11-26 08:54:13,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 17: [2022-11-26 08:54:13,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 21: [2022-11-26 08:54:13,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 08:54:13,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 21: [2022-11-26 08:54:13,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 22: [2022-11-26 08:54:13,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:54:13,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 5: [2022-11-26 08:54:13,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:54:13,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 5: [2022-11-26 08:54:13,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 08:54:13,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 28: [2022-11-26 08:54:13,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:54:13,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 08:54:13,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 2: [2022-11-26 08:54:13,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:54:13,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:54:13,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 12: [2022-11-26 08:54:13,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 24: [2022-11-26 08:54:13,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:54:13,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 12: [2022-11-26 08:54:13,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 24: [2022-11-26 08:54:13,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 08:54:13,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 1: [2022-11-26 08:54:13,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:54:13,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:54:13,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 1: [2022-11-26 08:54:13,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 8: [2022-11-26 08:54:13,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 1: [2022-11-26 08:54:13,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 20: [2022-11-26 08:54:13,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:54:13,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:54:13,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 08:54:13,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 08:54:13,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 20: [2022-11-26 08:54:13,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 4: [2022-11-26 08:54:13,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:54:13,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:54:13,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 08:54:13,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 08:54:13,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 4: [2022-11-26 08:54:13,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 29: [2022-11-26 08:54:13,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:54:13,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 08:54:13,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 27: [2022-11-26 08:54:13,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:54:13,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 08:54:13,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 19: [2022-11-26 08:54:13,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:54:13,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 08:54:13,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 21: [2022-11-26 08:54:13,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:54:13,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 08:54:13,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 23: [2022-11-26 08:54:13,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:54:13,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 08:54:13,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 2: [2022-11-26 08:54:13,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:54:13,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 08:54:13,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 23: [2022-11-26 08:54:13,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:54:13,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 08:54:13,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 25: [2022-11-26 08:54:13,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:54:13,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 08:54:13,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 14: [2022-11-26 08:54:13,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:54:13,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 08:54:13,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 29: [2022-11-26 08:54:13,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:54:13,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 08:54:13,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 16: [2022-11-26 08:54:13,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:54:13,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 08:54:13,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 1: [2022-11-26 08:54:13,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:54:13,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 08:54:13,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 29: [2022-11-26 08:54:13,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:54:13,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:54:13,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 08:54:13,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 08:54:13,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 29: [2022-11-26 08:54:13,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 7: [2022-11-26 08:54:13,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:54:13,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:54:13,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:54:13,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 08:54:13,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 08:54:13,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 08:54:13,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 7: [2022-11-26 08:54:13,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 7: [2022-11-26 08:54:13,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 16: [2022-11-26 08:54:13,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:54:13,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:54:13,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 08:54:13,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 08:54:13,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 16: [2022-11-26 08:54:13,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 12: [2022-11-26 08:54:13,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:54:13,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 08:54:13,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 0: [2022-11-26 08:54:13,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:54:13,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:54:13,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 08:54:13,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 31: [2022-11-26 08:54:13,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:54:13,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 19: [2022-11-26 08:54:13,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:54:13,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 19: [2022-11-26 08:54:13,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 08:54:13,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 6: [2022-11-26 08:54:13,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:54:13,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 08:54:13,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 4: [2022-11-26 08:54:13,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:54:13,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:54:13,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 08:54:13,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 08:54:13,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 4: [2022-11-26 08:54:13,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 13: [2022-11-26 08:54:13,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:54:13,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:54:13,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:54:13,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 08:54:13,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 13: [2022-11-26 08:54:13,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 08:54:13,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 08:54:13,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 13: [2022-11-26 08:54:13,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 5: [2022-11-26 08:54:13,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:54:13,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 08:54:13,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 29: [2022-11-26 08:54:13,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:54:13,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 08:54:13,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 23: [2022-11-26 08:54:13,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:54:13,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 08:54:13,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 26: [2022-11-26 08:54:13,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:54:13,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:54:13,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:54:13,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:54:13,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 08:54:13,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 08:54:13,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 08:54:13,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 08:54:13,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 26: [2022-11-26 08:54:13,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 26: [2022-11-26 08:54:13,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 26: [2022-11-26 08:54:13,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 9: [2022-11-26 08:54:13,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:54:13,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 08:54:13,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 18: [2022-11-26 08:54:13,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:54:13,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 08:54:13,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 0: [2022-11-26 08:54:13,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 08:54:13,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 26: [2022-11-26 08:54:13,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:54:13,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 08:54:13,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 17: [2022-11-26 08:54:13,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:54:13,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 08:54:13,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 10: [2022-11-26 08:54:13,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:54:13,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 08:54:13,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 18: [2022-11-26 08:54:13,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:54:13,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 08:54:13,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 18: [2022-11-26 08:54:13,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:54:13,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:54:13,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 31: [2022-11-26 08:54:13,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 18: [2022-11-26 08:54:13,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 31: [2022-11-26 08:54:13,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 30: [2022-11-26 08:54:13,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:54:13,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 08:54:13,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 3: [2022-11-26 08:54:13,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:54:13,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 08:54:13,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 11: [2022-11-26 08:54:13,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:54:13,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 08:54:13,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 14: [2022-11-26 08:54:13,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:54:13,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 08:54:13,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 22: [2022-11-26 08:54:13,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:54:13,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 08:54:13,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 8: [2022-11-26 08:54:13,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:54:13,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 08:54:13,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 24: [2022-11-26 08:54:13,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:54:13,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 08:54:13,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 28: [2022-11-26 08:54:13,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:54:13,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 08:54:13,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 0: [2022-11-26 08:54:13,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:54:13,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 08:54:13,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 15: [2022-11-26 08:54:13,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:54:13,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 08:54:13,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 16: [2022-11-26 08:54:13,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:54:13,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 08:54:13,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 21: [2022-11-26 08:54:13,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:54:13,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 08:54:13,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 4: [2022-11-26 08:54:13,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:54:13,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 08:54:13,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 25: [2022-11-26 08:54:13,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:54:13,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 08:54:13,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 12: [2022-11-26 08:54:13,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:54:13,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 08:54:13,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 1: [2022-11-26 08:54:13,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:54:13,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 20: [2022-11-26 08:54:13,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:54:13,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 20: [2022-11-26 08:54:13,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 08:54:13,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 13: [2022-11-26 08:54:13,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:54:13,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 08:54:13,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 2: [2022-11-26 08:54:13,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:54:13,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 08:54:13,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 7: [2022-11-26 08:54:13,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:54:13,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 08:54:13,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 27: [2022-11-26 08:54:13,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:54:13,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 08:54:13,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 23: [2022-11-26 08:54:13,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:54:13,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 08:54:13,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 6: [2022-11-26 08:54:13,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:54:13,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 08:54:13,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 26: [2022-11-26 08:54:13,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:54:13,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 11: [2022-11-26 08:54:13,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:54:13,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:54:13,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 18: [2022-11-26 08:54:13,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 11: [2022-11-26 08:54:13,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 18: [2022-11-26 08:54:13,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 11: [2022-11-26 08:54:13,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 5: [2022-11-26 08:54:13,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:54:13,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 08:54:13,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 19: [2022-11-26 08:54:13,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:54:13,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 08:54:13,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 17: [2022-11-26 08:54:13,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:54:13,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 08:54:13,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 9: [2022-11-26 08:54:13,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:54:13,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 08:54:13,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 29: [2022-11-26 08:54:13,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:54:13,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 08:54:13,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 3: [2022-11-26 08:54:13,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:54:13,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 08:54:13,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 31: [2022-11-26 08:54:13,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:54:13,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 08:54:13,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 30: [2022-11-26 08:54:13,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:54:13,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 08:54:13,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 8: [2022-11-26 08:54:13,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:54:13,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 08:54:13,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 14: [2022-11-26 08:54:13,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:54:13,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 08:54:13,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 24: [2022-11-26 08:54:13,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:54:13,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 08:54:13,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 10: [2022-11-26 08:54:13,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:54:13,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:54:13,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 08:54:13,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 0: [2022-11-26 08:54:13,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:54:13,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 08:54:13,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 0: [2022-11-26 08:54:13,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 08:54:13,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 15: [2022-11-26 08:54:13,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:54:13,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 08:54:13,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 22: [2022-11-26 08:54:13,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:54:13,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 08:54:13,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 4: [2022-11-26 08:54:13,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:54:13,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:54:13,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 08:54:13,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 25: [2022-11-26 08:54:13,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 08:54:13,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 21: [2022-11-26 08:54:13,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:54:13,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 08:54:13,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 12: [2022-11-26 08:54:13,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:54:13,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:54:13,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 18: [2022-11-26 08:54:13,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 12: [2022-11-26 08:54:13,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 18: [2022-11-26 08:54:13,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 7: [2022-11-26 08:54:13,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:54:13,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 08:54:13,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 1: [2022-11-26 08:54:13,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:54:13,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 08:54:13,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 27: [2022-11-26 08:54:13,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:54:13,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 08:54:13,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 20: [2022-11-26 08:54:13,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:54:13,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:54:13,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 08:54:13,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 2: [2022-11-26 08:54:13,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 08:54:13,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 13: [2022-11-26 08:54:13,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:54:13,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 08:54:13,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 16: [2022-11-26 08:54:13,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:54:13,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 08:54:13,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 23: [2022-11-26 08:54:13,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:54:13,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 08:54:13,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 29: [2022-11-26 08:54:13,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:54:13,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 08:54:13,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 19: [2022-11-26 08:54:13,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:54:13,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 08:54:13,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 17: [2022-11-26 08:54:13,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:54:13,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 5: [2022-11-26 08:54:13,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:54:13,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 5: [2022-11-26 08:54:13,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 08:54:13,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 26: [2022-11-26 08:54:13,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:54:13,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 08:54:13,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 9: [2022-11-26 08:54:13,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:54:13,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 08:54:13,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 11: [2022-11-26 08:54:13,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:54:13,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 08:54:13,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 6: [2022-11-26 08:54:13,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:54:13,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 08:54:13,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 3: [2022-11-26 08:54:13,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:54:13,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 08:54:13,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 31: [2022-11-26 08:54:13,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:54:13,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 08:54:13,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 10: [2022-11-26 08:54:13,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:54:13,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 08:54:13,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 8: [2022-11-26 08:54:13,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:54:13,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 08:54:13,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 1: [2022-11-26 08:54:13,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:54:13,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 08:54:13,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 14: [2022-11-26 08:54:13,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:54:13,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 08:54:13,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 24: [2022-11-26 08:54:13,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:54:13,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:54:13,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 08:54:13,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 30: [2022-11-26 08:54:13,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 08:54:13,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 25: [2022-11-26 08:54:13,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:54:13,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 08:54:13,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 7: [2022-11-26 08:54:13,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:54:13,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 08:54:13,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 0: [2022-11-26 08:54:13,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:54:13,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 08:54:13,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 15: [2022-11-26 08:54:13,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:54:13,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:54:13,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 15: [2022-11-26 08:54:13,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 08:54:13,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 18: [2022-11-26 08:54:13,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 22: [2022-11-26 08:54:13,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:54:13,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 08:54:13,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 27: [2022-11-26 08:54:13,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 29: [2022-11-26 08:54:13,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:54:13,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 08:54:13,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 29: [2022-11-26 08:54:13,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 08:54:13,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 16: [2022-11-26 08:54:13,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:54:13,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 08:54:13,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 20: [2022-11-26 08:54:13,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:54:13,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 08:54:13,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 8: [2022-11-26 08:54:13,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:54:13,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 08:54:13,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 2: [2022-11-26 08:54:13,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:54:13,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 08:54:13,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 6: [2022-11-26 08:54:13,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:54:13,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 08:54:13,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 11: [2022-11-26 08:54:13,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:54:13,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 08:54:13,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 17: [2022-11-26 08:54:13,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 08:54:13,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 08:54:13,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 24: [2022-11-26 08:54:13,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 08:54:13,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 08:54:13,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 13: [2022-11-26 08:54:13,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 23: [2022-11-26 08:54:13,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:54:13,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 08:54:13,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 23: [2022-11-26 08:54:13,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 08:54:13,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 9: [2022-11-26 08:54:13,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:54:13,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 08:54:13,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 14: [2022-11-26 08:54:13,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:54:13,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 08:54:13,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 4: [2022-11-26 08:54:13,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:54:13,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 08:54:13,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 21: [2022-11-26 08:54:13,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:54:13,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 12: [2022-11-26 08:54:13,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:54:13,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 12: [2022-11-26 08:54:13,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 08:54:13,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 26: [2022-11-26 08:54:13,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 08:54:13,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 08:54:13,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 5: [2022-11-26 08:54:13,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:54:13,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 08:54:13,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 3: [2022-11-26 08:54:13,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:54:13,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 08:54:13,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 19: [2022-11-26 08:54:13,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 08:54:13,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 08:54:13,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 31: [2022-11-26 08:54:13,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 08:54:13,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 08:54:13,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 22: [2022-11-26 08:54:13,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 08:54:13,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 08:54:13,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 28: [2022-11-26 08:54:13,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:54:13,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:54:13,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:54:13,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 08:54:13,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 15: [2022-11-26 08:54:13,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 08:54:13,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 10: [2022-11-26 08:54:13,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:54:13,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 08:54:13,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 27: [2022-11-26 08:54:13,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:54:13,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 0: [2022-11-26 08:54:13,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 27: [2022-11-26 08:54:13,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 0: [2022-11-26 08:54:13,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 08:54:13,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 7: [2022-11-26 08:54:13,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:54:13,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 08:54:13,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 25: [2022-11-26 08:54:13,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 08:54:13,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 08:54:13,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 30: [2022-11-26 08:54:13,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 08:54:13,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 08:54:13,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 18: [2022-11-26 08:54:13,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:54:13,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 08:54:13,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 4: [2022-11-26 08:54:13,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:54:13,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 08:54:13,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 28: [2022-11-26 08:54:13,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 08:54:13,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 28: [2022-11-26 08:54:13,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 08:54:13,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 08:54:13,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 13: [2022-11-26 08:54:13,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:54:13,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 08:54:13,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 16: [2022-11-26 08:54:13,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 20: [2022-11-26 08:54:13,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 16: [2022-11-26 08:54:13,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 20: [2022-11-26 08:54:13,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 16: [2022-11-26 08:54:13,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 20: [2022-11-26 08:54:13,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 13: [2022-11-26 08:54:13,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:54:13,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 08:54:13,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 28: [2022-11-26 08:54:13,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:54:13,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 08:54:13,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 08:54:13,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 12: [2022-11-26 08:54:13,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:54:13,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 08:54:13,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 7: [2022-11-26 08:54:13,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:54:13,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 08:54:13,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 28: [2022-11-26 08:54:13,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 08:54:13,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 21: [2022-11-26 08:54:13,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:54:13,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 08:54:13,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 21: [2022-11-26 08:54:13,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 08:54:13,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step66000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 08:54:13,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 0: successfully saved checkpoint at iteration 66000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2541.18 31: iteration 66010/ 173500 | consumed samples: 16898560 | consumed tokens: 34608250880 | elapsed time per iteration (s): 1.03 | learning rate: 1.446E-04 | global batch size: 256 | lm loss: 2.043762E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.754 | TFLOPs: 15.05 | 31: iteration 66020/ 173500 | consumed samples: 16901120 | consumed tokens: 34613493760 | elapsed time per iteration (s): 0.75 | learning rate: 1.446E-04 | global batch size: 256 | lm loss: 2.016299E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.546 | TFLOPs: 20.60 | 31: iteration 66030/ 173500 | consumed samples: 16903680 | consumed tokens: 34618736640 | elapsed time per iteration (s): 0.79 | learning rate: 1.446E-04 | global batch size: 256 | lm loss: 2.036853E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.063 | TFLOPs: 19.54 | 31: iteration 66040/ 173500 | consumed samples: 16906240 | consumed tokens: 34623979520 | elapsed time per iteration (s): 0.77 | learning rate: 1.446E-04 | global batch size: 256 | lm loss: 2.055075E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.575 | TFLOPs: 20.18 | 31: iteration 66050/ 173500 | consumed samples: 16908800 | consumed tokens: 34629222400 | elapsed time per iteration (s): 0.81 | learning rate: 1.446E-04 | global batch size: 256 | lm loss: 2.062829E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.340 | TFLOPs: 19.20 | 31: iteration 66060/ 173500 | consumed samples: 16911360 | consumed tokens: 34634465280 | elapsed time per iteration (s): 0.77 | learning rate: 1.446E-04 | global batch size: 256 | lm loss: 2.025485E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.756 | TFLOPs: 20.01 | 31: iteration 66070/ 173500 | consumed samples: 16913920 | consumed tokens: 34639708160 | elapsed time per iteration (s): 0.75 | learning rate: 1.446E-04 | global batch size: 256 | lm loss: 2.012903E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.904 | TFLOPs: 20.56 | 31: iteration 66080/ 173500 | consumed samples: 16916480 | consumed tokens: 34644951040 | elapsed time per iteration (s): 0.73 | learning rate: 1.445E-04 | global batch size: 256 | lm loss: 2.062575E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.515 | TFLOPs: 21.08 | 31: iteration 66090/ 173500 | consumed samples: 16919040 | consumed tokens: 34650193920 | elapsed time per iteration (s): 0.77 | learning rate: 1.445E-04 | global batch size: 256 | lm loss: 2.027289E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.640 | TFLOPs: 20.12 | 31: iteration 66100/ 173500 | consumed samples: 16921600 | consumed tokens: 34655436800 | elapsed time per iteration (s): 0.76 | learning rate: 1.445E-04 | global batch size: 256 | lm loss: 2.042949E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.420 | TFLOPs: 20.41 | 31: iteration 66110/ 173500 | consumed samples: 16924160 | consumed tokens: 34660679680 | elapsed time per iteration (s): 0.80 | learning rate: 1.445E-04 | global batch size: 256 | lm loss: 2.044631E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.692 | TFLOPs: 19.46 | 31: iteration 66120/ 173500 | consumed samples: 16926720 | consumed tokens: 34665922560 | elapsed time per iteration (s): 0.82 | learning rate: 1.445E-04 | global batch size: 256 | lm loss: 2.012161E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.380 | TFLOPs: 18.84 | 31: iteration 66130/ 173500 | consumed samples: 16929280 | consumed tokens: 34671165440 | elapsed time per iteration (s): 0.79 | learning rate: 1.445E-04 | global batch size: 256 | lm loss: 2.024639E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.826 | TFLOPs: 19.65 | 31: iteration 66140/ 173500 | consumed samples: 16931840 | consumed tokens: 34676408320 | elapsed time per iteration (s): 0.82 | learning rate: 1.445E-04 | global batch size: 256 | lm loss: 2.030277E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.921 | TFLOPs: 18.87 | 31: iteration 66150/ 173500 | consumed samples: 16934400 | consumed tokens: 34681651200 | elapsed time per iteration (s): 0.82 | learning rate: 1.444E-04 | global batch size: 256 | lm loss: 2.016955E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.818 | TFLOPs: 18.80 | 31: iteration 66160/ 173500 | consumed samples: 16936960 | consumed tokens: 34686894080 | elapsed time per iteration (s): 0.78 | learning rate: 1.444E-04 | global batch size: 256 | lm loss: 2.042683E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.254 | TFLOPs: 19.74 | 31: iteration 66170/ 173500 | consumed samples: 16939520 | consumed tokens: 34692136960 | elapsed time per iteration (s): 0.77 | learning rate: 1.444E-04 | global batch size: 256 | lm loss: 2.012061E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.543 | TFLOPs: 20.12 | 31: iteration 66180/ 173500 | consumed samples: 16942080 | consumed tokens: 34697379840 | elapsed time per iteration (s): 0.78 | learning rate: 1.444E-04 | global batch size: 256 | lm loss: 2.023156E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.709 | TFLOPs: 19.89 | 31: iteration 66190/ 173500 | consumed samples: 16944640 | consumed tokens: 34702622720 | elapsed time per iteration (s): 0.76 | learning rate: 1.444E-04 | global batch size: 256 | lm loss: 2.013677E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.108 | TFLOPs: 20.33 | 31: iteration 66200/ 173500 | consumed samples: 16947200 | consumed tokens: 34707865600 | elapsed time per iteration (s): 0.77 | learning rate: 1.444E-04 | global batch size: 256 | lm loss: 2.045938E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.682 | TFLOPs: 20.07 | 31: iteration 66210/ 173500 | consumed samples: 16949760 | consumed tokens: 34713108480 | elapsed time per iteration (s): 0.76 | learning rate: 1.443E-04 | global batch size: 256 | lm loss: 2.023335E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.990 | TFLOPs: 20.39 | 31: iteration 66220/ 173500 | consumed samples: 16952320 | consumed tokens: 34718351360 | elapsed time per iteration (s): 0.74 | learning rate: 1.443E-04 | global batch size: 256 | lm loss: 2.044897E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.167 | TFLOPs: 20.82 | 31: iteration 66230/ 173500 | consumed samples: 16954880 | consumed tokens: 34723594240 | elapsed time per iteration (s): 0.77 | learning rate: 1.443E-04 | global batch size: 256 | lm loss: 2.038882E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.529 | TFLOPs: 20.18 | 31: iteration 66240/ 173500 | consumed samples: 16957440 | consumed tokens: 34728837120 | elapsed time per iteration (s): 0.78 | learning rate: 1.443E-04 | global batch size: 256 | lm loss: 2.066478E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.346 | TFLOPs: 19.86 | 31: iteration 66250/ 173500 | consumed samples: 16960000 | consumed tokens: 34734080000 | elapsed time per iteration (s): 0.77 | learning rate: 1.443E-04 | global batch size: 256 | lm loss: 2.023495E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.308 | TFLOPs: 20.10 | 31: iteration 66260/ 173500 | consumed samples: 16962560 | consumed tokens: 34739322880 | elapsed time per iteration (s): 0.75 | learning rate: 1.443E-04 | global batch size: 256 | lm loss: 2.040310E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.909 | TFLOPs: 20.68 | 31: iteration 66270/ 173500 | consumed samples: 16965120 | consumed tokens: 34744565760 | elapsed time per iteration (s): 0.72 | learning rate: 1.443E-04 | global batch size: 256 | lm loss: 2.022386E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.153 | TFLOPs: 21.36 | 31: iteration 66280/ 173500 | consumed samples: 16967680 | consumed tokens: 34749808640 | elapsed time per iteration (s): 0.76 | learning rate: 1.442E-04 | global batch size: 256 | lm loss: 2.004275E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.949 | TFLOPs: 20.51 | 31: iteration 66290/ 173500 | consumed samples: 16970240 | consumed tokens: 34755051520 | elapsed time per iteration (s): 0.78 | learning rate: 1.442E-04 | global batch size: 256 | lm loss: 2.039732E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.975 | TFLOPs: 19.84 | 31: iteration 66300/ 173500 | consumed samples: 16972800 | consumed tokens: 34760294400 | elapsed time per iteration (s): 0.80 | learning rate: 1.442E-04 | global batch size: 256 | lm loss: 2.024858E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.786 | TFLOPs: 19.41 | 31: iteration 66310/ 173500 | consumed samples: 16975360 | consumed tokens: 34765537280 | elapsed time per iteration (s): 0.77 | learning rate: 1.442E-04 | global batch size: 256 | lm loss: 2.011405E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.355 | TFLOPs: 20.11 | 31: iteration 66320/ 173500 | consumed samples: 16977920 | consumed tokens: 34770780160 | elapsed time per iteration (s): 0.72 | learning rate: 1.442E-04 | global batch size: 256 | lm loss: 2.023322E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.611 | TFLOPs: 21.45 | 31: iteration 66330/ 173500 | consumed samples: 16980480 | consumed tokens: 34776023040 | elapsed time per iteration (s): 0.78 | learning rate: 1.442E-04 | global batch size: 256 | lm loss: 2.048323E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.126 | TFLOPs: 19.73 | 31: iteration 66340/ 173500 | consumed samples: 16983040 | consumed tokens: 34781265920 | elapsed time per iteration (s): 0.80 | learning rate: 1.441E-04 | global batch size: 256 | lm loss: 2.020021E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.680 | TFLOPs: 19.46 | 31: iteration 66350/ 173500 | consumed samples: 16985600 | consumed tokens: 34786508800 | elapsed time per iteration (s): 0.75 | learning rate: 1.441E-04 | global batch size: 256 | lm loss: 2.051447E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.597 | TFLOPs: 20.73 | 31: iteration 66360/ 173500 | consumed samples: 16988160 | consumed tokens: 34791751680 | elapsed time per iteration (s): 0.78 | learning rate: 1.441E-04 | global batch size: 256 | lm loss: 2.044337E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.708 | TFLOPs: 19.89 | 31: iteration 66370/ 173500 | consumed samples: 16990720 | consumed tokens: 34796994560 | elapsed time per iteration (s): 0.80 | learning rate: 1.441E-04 | global batch size: 256 | lm loss: 2.036014E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.348 | TFLOPs: 19.26 | 31: iteration 66380/ 173500 | consumed samples: 16993280 | consumed tokens: 34802237440 | elapsed time per iteration (s): 0.79 | learning rate: 1.441E-04 | global batch size: 256 | lm loss: 2.062405E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.187 | TFLOPs: 19.67 | 31: iteration 66390/ 173500 | consumed samples: 16995840 | consumed tokens: 34807480320 | elapsed time per iteration (s): 0.76 | learning rate: 1.441E-04 | global batch size: 256 | lm loss: 2.037170E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.680 | TFLOPs: 20.25 | 31: iteration 66400/ 173500 | consumed samples: 16998400 | consumed tokens: 34812723200 | elapsed time per iteration (s): 0.76 | learning rate: 1.441E-04 | global batch size: 256 | lm loss: 2.023154E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.701 | TFLOPs: 20.31 | 31: iteration 66410/ 173500 | consumed samples: 17000960 | consumed tokens: 34817966080 | elapsed time per iteration (s): 0.76 | learning rate: 1.440E-04 | global batch size: 256 | lm loss: 2.024660E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.817 | TFLOPs: 20.50 | 31: iteration 66420/ 173500 | consumed samples: 17003520 | consumed tokens: 34823208960 | elapsed time per iteration (s): 0.76 | learning rate: 1.440E-04 | global batch size: 256 | lm loss: 2.014285E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.570 | TFLOPs: 20.42 | 31: iteration 66430/ 173500 | consumed samples: 17006080 | consumed tokens: 34828451840 | elapsed time per iteration (s): 0.77 | learning rate: 1.440E-04 | global batch size: 256 | lm loss: 2.052614E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.624 | TFLOPs: 20.12 | 31: iteration 66440/ 173500 | consumed samples: 17008640 | consumed tokens: 34833694720 | elapsed time per iteration (s): 0.77 | learning rate: 1.440E-04 | global batch size: 256 | lm loss: 2.054603E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.398 | TFLOPs: 20.23 | 31: iteration 66450/ 173500 | consumed samples: 17011200 | consumed tokens: 34838937600 | elapsed time per iteration (s): 0.76 | learning rate: 1.440E-04 | global batch size: 256 | lm loss: 2.014054E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.979 | TFLOPs: 20.51 | 31: iteration 66460/ 173500 | consumed samples: 17013760 | consumed tokens: 34844180480 | elapsed time per iteration (s): 0.78 | learning rate: 1.440E-04 | global batch size: 256 | lm loss: 2.043875E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.753 | TFLOPs: 19.83 | 31: iteration 66470/ 173500 | consumed samples: 17016320 | consumed tokens: 34849423360 | elapsed time per iteration (s): 0.73 | learning rate: 1.439E-04 | global batch size: 256 | lm loss: 2.021494E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.717 | TFLOPs: 21.22 | 31: iteration 66480/ 173500 | consumed samples: 17018880 | consumed tokens: 34854666240 | elapsed time per iteration (s): 0.75 | learning rate: 1.439E-04 | global batch size: 256 | lm loss: 2.026944E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.444 | TFLOPs: 20.66 | 31: iteration 66490/ 173500 | consumed samples: 17021440 | consumed tokens: 34859909120 | elapsed time per iteration (s): 0.80 | learning rate: 1.439E-04 | global batch size: 256 | lm loss: 2.015136E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.649 | TFLOPs: 19.40 | 31: iteration 66500/ 173500 | consumed samples: 17024000 | consumed tokens: 34865152000 | elapsed time per iteration (s): 0.75 | learning rate: 1.439E-04 | global batch size: 256 | lm loss: 2.064294E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.651 | TFLOPs: 20.73 | 31: iteration 66510/ 173500 | consumed samples: 17026560 | consumed tokens: 34870394880 | elapsed time per iteration (s): 0.76 | learning rate: 1.439E-04 | global batch size: 256 | lm loss: 2.038662E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.939 | TFLOPs: 20.44 | 31: iteration 66520/ 173500 | consumed samples: 17029120 | consumed tokens: 34875637760 | elapsed time per iteration (s): 0.80 | learning rate: 1.439E-04 | global batch size: 256 | lm loss: 2.057428E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.881 | TFLOPs: 19.47 | 31: iteration 66530/ 173500 | consumed samples: 17031680 | consumed tokens: 34880880640 | elapsed time per iteration (s): 0.75 | learning rate: 1.439E-04 | global batch size: 256 | lm loss: 2.025442E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.276 | TFLOPs: 20.59 | 31: iteration 66540/ 173500 | consumed samples: 17034240 | consumed tokens: 34886123520 | elapsed time per iteration (s): 0.77 | learning rate: 1.438E-04 | global batch size: 256 | lm loss: 2.051474E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.530 | TFLOPs: 20.12 | 31: iteration 66550/ 173500 | consumed samples: 17036800 | consumed tokens: 34891366400 | elapsed time per iteration (s): 0.77 | learning rate: 1.438E-04 | global batch size: 256 | lm loss: 2.045370E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.541 | TFLOPs: 20.06 | 31: iteration 66560/ 173500 | consumed samples: 17039360 | consumed tokens: 34896609280 | elapsed time per iteration (s): 0.75 | learning rate: 1.438E-04 | global batch size: 256 | lm loss: 2.063584E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.917 | TFLOPs: 20.75 | 31: iteration 66570/ 173500 | consumed samples: 17041920 | consumed tokens: 34901852160 | elapsed time per iteration (s): 0.74 | learning rate: 1.438E-04 | global batch size: 256 | lm loss: 2.015111E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.927 | TFLOPs: 20.99 | 31: iteration 66580/ 173500 | consumed samples: 17044480 | consumed tokens: 34907095040 | elapsed time per iteration (s): 0.76 | learning rate: 1.438E-04 | global batch size: 256 | lm loss: 2.025149E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.216 | TFLOPs: 20.34 | 31: iteration 66590/ 173500 | consumed samples: 17047040 | consumed tokens: 34912337920 | elapsed time per iteration (s): 0.75 | learning rate: 1.438E-04 | global batch size: 256 | lm loss: 2.038715E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.574 | TFLOPs: 20.79 | 31: iteration 66600/ 173500 | consumed samples: 17049600 | consumed tokens: 34917580800 | elapsed time per iteration (s): 0.79 | learning rate: 1.438E-04 | global batch size: 256 | lm loss: 2.053044E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.359 | TFLOPs: 19.62 | 31: iteration 66610/ 173500 | consumed samples: 17052160 | consumed tokens: 34922823680 | elapsed time per iteration (s): 0.76 | learning rate: 1.437E-04 | global batch size: 256 | lm loss: 2.021143E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.992 | TFLOPs: 20.39 | 31: iteration 66620/ 173500 | consumed samples: 17054720 | consumed tokens: 34928066560 | elapsed time per iteration (s): 0.77 | learning rate: 1.437E-04 | global batch size: 256 | lm loss: 2.054395E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.346 | TFLOPs: 20.17 | 31: iteration 66630/ 173500 | consumed samples: 17057280 | consumed tokens: 34933309440 | elapsed time per iteration (s): 0.76 | learning rate: 1.437E-04 | global batch size: 256 | lm loss: 2.040725E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.192 | TFLOPs: 20.28 | 31: iteration 66640/ 173500 | consumed samples: 17059840 | consumed tokens: 34938552320 | elapsed time per iteration (s): 0.74 | learning rate: 1.437E-04 | global batch size: 256 | lm loss: 2.019731E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.572 | TFLOPs: 20.91 | 31: iteration 66650/ 173500 | consumed samples: 17062400 | consumed tokens: 34943795200 | elapsed time per iteration (s): 0.72 | learning rate: 1.437E-04 | global batch size: 256 | lm loss: 2.027765E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.776 | TFLOPs: 21.52 | 31: iteration 66660/ 173500 | consumed samples: 17064960 | consumed tokens: 34949038080 | elapsed time per iteration (s): 0.76 | learning rate: 1.437E-04 | global batch size: 256 | lm loss: 2.009339E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.817 | TFLOPs: 20.44 | 31: iteration 66670/ 173500 | consumed samples: 17067520 | consumed tokens: 34954280960 | elapsed time per iteration (s): 0.77 | learning rate: 1.436E-04 | global batch size: 256 | lm loss: 2.015682E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.152 | TFLOPs: 20.03 | 31: iteration 66680/ 173500 | consumed samples: 17070080 | consumed tokens: 34959523840 | elapsed time per iteration (s): 0.78 | learning rate: 1.436E-04 | global batch size: 256 | lm loss: 2.072390E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.964 | TFLOPs: 19.90 | 31: iteration 66690/ 173500 | consumed samples: 17072640 | consumed tokens: 34964766720 | elapsed time per iteration (s): 0.83 | learning rate: 1.436E-04 | global batch size: 256 | lm loss: 2.033144E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.749 | TFLOPs: 18.62 | 31: iteration 66700/ 173500 | consumed samples: 17075200 | consumed tokens: 34970009600 | elapsed time per iteration (s): 0.79 | learning rate: 1.436E-04 | global batch size: 256 | lm loss: 2.023216E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.102 | TFLOPs: 19.55 | 31: iteration 66710/ 173500 | consumed samples: 17077760 | consumed tokens: 34975252480 | elapsed time per iteration (s): 0.80 | learning rate: 1.436E-04 | global batch size: 256 | lm loss: 2.030077E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.400 | TFLOPs: 19.38 | 31: iteration 66720/ 173500 | consumed samples: 17080320 | consumed tokens: 34980495360 | elapsed time per iteration (s): 0.80 | learning rate: 1.436E-04 | global batch size: 256 | lm loss: 2.026954E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.181 | TFLOPs: 19.31 | 31: iteration 66730/ 173500 | consumed samples: 17082880 | consumed tokens: 34985738240 | elapsed time per iteration (s): 0.81 | learning rate: 1.436E-04 | global batch size: 256 | lm loss: 2.043344E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.290 | TFLOPs: 19.13 | 31: iteration 66740/ 173500 | consumed samples: 17085440 | consumed tokens: 34990981120 | elapsed time per iteration (s): 0.81 | learning rate: 1.435E-04 | global batch size: 256 | lm loss: 2.039141E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.737 | TFLOPs: 19.16 | 31: iteration 66750/ 173500 | consumed samples: 17088000 | consumed tokens: 34996224000 | elapsed time per iteration (s): 0.80 | learning rate: 1.435E-04 | global batch size: 256 | lm loss: 2.040525E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.470 | TFLOPs: 19.45 | 31: iteration 66760/ 173500 | consumed samples: 17090560 | consumed tokens: 35001466880 | elapsed time per iteration (s): 0.83 | learning rate: 1.435E-04 | global batch size: 256 | lm loss: 1.993352E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.462 | TFLOPs: 18.66 | 31: iteration 66770/ 173500 | consumed samples: 17093120 | consumed tokens: 35006709760 | elapsed time per iteration (s): 0.85 | learning rate: 1.435E-04 | global batch size: 256 | lm loss: 2.029539E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.641 | TFLOPs: 18.25 | 31: iteration 66780/ 173500 | consumed samples: 17095680 | consumed tokens: 35011952640 | elapsed time per iteration (s): 0.90 | learning rate: 1.435E-04 | global batch size: 256 | lm loss: 2.032933E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.856 | TFLOPs: 17.29 | 31: iteration 66790/ 173500 | consumed samples: 17098240 | consumed tokens: 35017195520 | elapsed time per iteration (s): 0.83 | learning rate: 1.435E-04 | global batch size: 256 | lm loss: 2.017524E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.757 | TFLOPs: 18.74 | 31: iteration 66800/ 173500 | consumed samples: 17100800 | consumed tokens: 35022438400 | elapsed time per iteration (s): 0.81 | learning rate: 1.434E-04 | global batch size: 256 | lm loss: 2.002760E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.181 | TFLOPs: 19.13 | 31: iteration 66810/ 173500 | consumed samples: 17103360 | consumed tokens: 35027681280 | elapsed time per iteration (s): 0.82 | learning rate: 1.434E-04 | global batch size: 256 | lm loss: 2.028682E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.976 | TFLOPs: 18.87 | 31: iteration 66820/ 173500 | consumed samples: 17105920 | consumed tokens: 35032924160 | elapsed time per iteration (s): 0.79 | learning rate: 1.434E-04 | global batch size: 256 | lm loss: 2.033451E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.268 | TFLOPs: 19.50 | 31: iteration 66830/ 173500 | consumed samples: 17108480 | consumed tokens: 35038167040 | elapsed time per iteration (s): 0.85 | learning rate: 1.434E-04 | global batch size: 256 | lm loss: 2.046275E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.510 | TFLOPs: 18.12 | 31: iteration 66840/ 173500 | consumed samples: 17111040 | consumed tokens: 35043409920 | elapsed time per iteration (s): 0.86 | learning rate: 1.434E-04 | global batch size: 256 | lm loss: 2.032885E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.242 | TFLOPs: 18.10 | 31: iteration 66850/ 173500 | consumed samples: 17113600 | consumed tokens: 35048652800 | elapsed time per iteration (s): 0.82 | learning rate: 1.434E-04 | global batch size: 256 | lm loss: 2.024720E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.087 | TFLOPs: 18.88 | 31: iteration 66860/ 173500 | consumed samples: 17116160 | consumed tokens: 35053895680 | elapsed time per iteration (s): 0.83 | learning rate: 1.434E-04 | global batch size: 256 | lm loss: 2.019246E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.647 | TFLOPs: 18.61 | 31: iteration 66870/ 173500 | consumed samples: 17118720 | consumed tokens: 35059138560 | elapsed time per iteration (s): 0.84 | learning rate: 1.433E-04 | global batch size: 256 | lm loss: 2.036067E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.797 | TFLOPs: 18.44 | 31: iteration 66880/ 173500 | consumed samples: 17121280 | consumed tokens: 35064381440 | elapsed time per iteration (s): 0.83 | learning rate: 1.433E-04 | global batch size: 256 | lm loss: 2.017331E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.853 | TFLOPs: 18.62 | 31: iteration 66890/ 173500 | consumed samples: 17123840 | consumed tokens: 35069624320 | elapsed time per iteration (s): 0.79 | learning rate: 1.433E-04 | global batch size: 256 | lm loss: 2.019636E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.491 | TFLOPs: 19.51 | 31: iteration 66900/ 173500 | consumed samples: 17126400 | consumed tokens: 35074867200 | elapsed time per iteration (s): 0.83 | learning rate: 1.433E-04 | global batch size: 256 | lm loss: 2.079641E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.062 | TFLOPs: 18.64 | 31: iteration 66910/ 173500 | consumed samples: 17128960 | consumed tokens: 35080110080 | elapsed time per iteration (s): 0.84 | learning rate: 1.433E-04 | global batch size: 256 | lm loss: 2.044829E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.331 | TFLOPs: 18.53 | 31: iteration 66920/ 173500 | consumed samples: 17131520 | consumed tokens: 35085352960 | elapsed time per iteration (s): 0.80 | learning rate: 1.433E-04 | global batch size: 256 | lm loss: 2.038960E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.988 | TFLOPs: 19.36 | 31: iteration 66930/ 173500 | consumed samples: 17134080 | consumed tokens: 35090595840 | elapsed time per iteration (s): 0.78 | learning rate: 1.432E-04 | global batch size: 256 | lm loss: 2.029109E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.769 | TFLOPs: 19.77 | 31: iteration 66940/ 173500 | consumed samples: 17136640 | consumed tokens: 35095838720 | elapsed time per iteration (s): 0.83 | learning rate: 1.432E-04 | global batch size: 256 | lm loss: 2.039351E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.058 | TFLOPs: 18.76 | 31: iteration 66950/ 173500 | consumed samples: 17139200 | consumed tokens: 35101081600 | elapsed time per iteration (s): 0.81 | learning rate: 1.432E-04 | global batch size: 256 | lm loss: 2.079086E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.337 | TFLOPs: 19.08 | 31: iteration 66960/ 173500 | consumed samples: 17141760 | consumed tokens: 35106324480 | elapsed time per iteration (s): 0.83 | learning rate: 1.432E-04 | global batch size: 256 | lm loss: 2.046957E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.984 | TFLOPs: 18.57 | 31: iteration 66970/ 173500 | consumed samples: 17144320 | consumed tokens: 35111567360 | elapsed time per iteration (s): 0.81 | learning rate: 1.432E-04 | global batch size: 256 | lm loss: 2.015237E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.174 | TFLOPs: 19.01 | 31: iteration 66980/ 173500 | consumed samples: 17146880 | consumed tokens: 35116810240 | elapsed time per iteration (s): 0.79 | learning rate: 1.432E-04 | global batch size: 256 | lm loss: 2.046909E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.679 | TFLOPs: 19.58 | 31: iteration 66990/ 173500 | consumed samples: 17149440 | consumed tokens: 35122053120 | elapsed time per iteration (s): 0.80 | learning rate: 1.432E-04 | global batch size: 256 | lm loss: 2.074482E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.582 | TFLOPs: 19.39 | 31: iteration 67000/ 173500 | consumed samples: 17152000 | consumed tokens: 35127296000 | elapsed time per iteration (s): 0.83 | learning rate: 1.431E-04 | global batch size: 256 | lm loss: 2.045128E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.264 | TFLOPs: 18.65 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 67000 | lm loss value: 1.994993E+00 | lm loss PPL: 7.352153E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 67000 to checkpoints_1b1long 0: [2022-11-26 09:07:18,169] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step67000 is begin to save! 0: [2022-11-26 09:07:18,182] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_01-model_00-model_states.pt... 0: [2022-11-26 09:07:18,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_01-model_00-model_states.pt. 0: [2022-11-26 09:07:18,389] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_03-model_00-model_states.pt... 0: [2022-11-26 09:07:18,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_03-model_00-model_states.pt. 0: [2022-11-26 09:07:18,471] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_04-model_00-model_states.pt... 0: [2022-11-26 09:07:18,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_04-model_00-model_states.pt. 0: [2022-11-26 09:07:18,553] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_05-model_00-model_states.pt... 0: [2022-11-26 09:07:18,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_05-model_00-model_states.pt. 0: [2022-11-26 09:07:18,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_06-model_00-model_states.pt... 0: [2022-11-26 09:07:18,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_06-model_00-model_states.pt. 0: [2022-11-26 09:07:18,711] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_07-model_00-model_states.pt... 0: [2022-11-26 09:07:18,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_07-model_00-model_states.pt. 0: [2022-11-26 09:07:18,786] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_08-model_00-model_states.pt... 0: [2022-11-26 09:07:18,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_08-model_00-model_states.pt. 0: [2022-11-26 09:07:18,864] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_09-model_00-model_states.pt... 0: [2022-11-26 09:07:18,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_09-model_00-model_states.pt. 0: [2022-11-26 09:07:18,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_10-model_00-model_states.pt... 0: [2022-11-26 09:07:19,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_10-model_00-model_states.pt. 0: [2022-11-26 09:07:19,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_11-model_00-model_states.pt... 0: [2022-11-26 09:07:19,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_11-model_00-model_states.pt. 0: [2022-11-26 09:07:19,093] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_12-model_00-model_states.pt... 0: [2022-11-26 09:07:19,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_12-model_00-model_states.pt. 0: [2022-11-26 09:07:19,173] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_13-model_00-model_states.pt... 0: [2022-11-26 09:07:19,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_13-model_00-model_states.pt. 0: [2022-11-26 09:07:19,252] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_14-model_00-model_states.pt... 0: [2022-11-26 09:07:19,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_14-model_00-model_states.pt. 0: [2022-11-26 09:07:19,329] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_15-model_00-model_states.pt... 0: [2022-11-26 09:07:19,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_15-model_00-model_states.pt. 0: [2022-11-26 09:07:19,407] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_16-model_00-model_states.pt... 0: [2022-11-26 09:07:19,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_16-model_00-model_states.pt. 0: [2022-11-26 09:07:19,485] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_17-model_00-model_states.pt... 0: [2022-11-26 09:07:19,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_17-model_00-model_states.pt. 0: [2022-11-26 09:07:19,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_18-model_00-model_states.pt... 0: [2022-11-26 09:07:19,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_18-model_00-model_states.pt. 0: [2022-11-26 09:07:19,641] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_19-model_00-model_states.pt... 0: [2022-11-26 09:07:19,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_19-model_00-model_states.pt. 0: [2022-11-26 09:07:19,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_20-model_00-model_states.pt... 0: [2022-11-26 09:07:19,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_20-model_00-model_states.pt. 0: [2022-11-26 09:07:19,797] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_21-model_00-model_states.pt... 0: [2022-11-26 09:07:19,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_21-model_00-model_states.pt. 0: [2022-11-26 09:07:19,875] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_22-model_00-model_states.pt... 0: [2022-11-26 09:07:19,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_22-model_00-model_states.pt. 0: [2022-11-26 09:07:19,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_23-model_00-model_states.pt... 0: [2022-11-26 09:07:20,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_23-model_00-model_states.pt. 0: [2022-11-26 09:07:20,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_24-model_00-model_states.pt... 0: [2022-11-26 09:07:20,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_24-model_00-model_states.pt. 0: [2022-11-26 09:07:20,109] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_25-model_00-model_states.pt... 0: [2022-11-26 09:07:20,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_25-model_00-model_states.pt. 0: [2022-11-26 09:07:20,184] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_26-model_00-model_states.pt... 0: [2022-11-26 09:07:20,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_26-model_00-model_states.pt. 0: [2022-11-26 09:07:20,264] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_27-model_00-model_states.pt... 0: [2022-11-26 09:07:20,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_27-model_00-model_states.pt. 0: [2022-11-26 09:07:20,341] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_28-model_00-model_states.pt... 0: [2022-11-26 09:07:20,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_28-model_00-model_states.pt. 0: [2022-11-26 09:07:20,419] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/layer_30-model_00-model_states.pt... 0: [2022-11-26 09:07:20,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/layer_30-model_00-model_states.pt. 0: [2022-11-26 09:07:20,422] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step67000/mp_rank_00_model_states.pt 0: [2022-11-26 09:07:20,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/mp_rank_00_model_states.pt... 0: [2022-11-26 09:07:20,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/mp_rank_00_model_states.pt. 0: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:07:20,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:07:20,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:07:20,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:07:20,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 8: [2022-11-26 09:07:20,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 09:07:20,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 28: [2022-11-26 09:07:20,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 0: [2022-11-26 09:07:20,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:07:20,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 09:07:20,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 2: [2022-11-26 09:07:20,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:07:20,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 09:07:20,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 27: [2022-11-26 09:07:20,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:07:20,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 09:07:20,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 19: [2022-11-26 09:07:20,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:07:20,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 09:07:20,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 5: [2022-11-26 09:07:20,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:07:20,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:07:20,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:07:20,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 12: [2022-11-26 09:07:20,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 6: [2022-11-26 09:07:20,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:07:20,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 1: [2022-11-26 09:07:20,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 12: [2022-11-26 09:07:20,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 6: [2022-11-26 09:07:20,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 09:07:20,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 1: [2022-11-26 09:07:20,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 10: [2022-11-26 09:07:20,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:07:20,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 09:07:20,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 20: [2022-11-26 09:07:20,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:07:20,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 09:07:20,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 30: [2022-11-26 09:07:20,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:07:20,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 7: [2022-11-26 09:07:20,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:07:20,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 7: [2022-11-26 09:07:20,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 09:07:20,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 21: [2022-11-26 09:07:20,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:07:20,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 09:07:20,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 31: [2022-11-26 09:07:20,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:07:20,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 09:07:20,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 2: [2022-11-26 09:07:20,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:07:20,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 09:07:20,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 25: [2022-11-26 09:07:20,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:07:20,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:07:20,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 09:07:20,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 25: [2022-11-26 09:07:20,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 09:07:20,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 15: [2022-11-26 09:07:20,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:07:20,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 09:07:20,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 3: [2022-11-26 09:07:20,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:07:20,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:07:20,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 09:07:20,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 29: [2022-11-26 09:07:20,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:07:20,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 29: [2022-11-26 09:07:20,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 28: [2022-11-26 09:07:20,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 29: [2022-11-26 09:07:20,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 0: [2022-11-26 09:07:20,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:07:20,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:07:20,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 14: [2022-11-26 09:07:20,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:07:20,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 20: [2022-11-26 09:07:20,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:07:20,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 20: [2022-11-26 09:07:20,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 14: [2022-11-26 09:07:20,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 20: [2022-11-26 09:07:20,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 8: [2022-11-26 09:07:20,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:07:20,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 09:07:20,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 18: [2022-11-26 09:07:20,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:07:20,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 09:07:20,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 24: [2022-11-26 09:07:20,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:07:20,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 09:07:20,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 12: [2022-11-26 09:07:20,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:07:20,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 09:07:20,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 6: [2022-11-26 09:07:20,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:07:20,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 09:07:20,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 10: [2022-11-26 09:07:20,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:07:20,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 09:07:20,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 31: [2022-11-26 09:07:20,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:07:20,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 09:07:20,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 5: [2022-11-26 09:07:20,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:07:20,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 09:07:20,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 13: [2022-11-26 09:07:20,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:07:20,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 09:07:20,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 2: [2022-11-26 09:07:20,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:07:20,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 09:07:20,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 22: [2022-11-26 09:07:20,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:07:20,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:07:20,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 11: [2022-11-26 09:07:20,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:07:20,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 0: [2022-11-26 09:07:20,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 11: [2022-11-26 09:07:20,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 09:07:20,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 0: [2022-11-26 09:07:20,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 21: [2022-11-26 09:07:20,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:07:20,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 09:07:20,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 30: [2022-11-26 09:07:20,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:07:20,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 09:07:20,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 15: [2022-11-26 09:07:20,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:07:20,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 09:07:20,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 27: [2022-11-26 09:07:20,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:07:20,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 09:07:20,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 1: [2022-11-26 09:07:20,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:07:20,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:07:20,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 09:07:20,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 1: [2022-11-26 09:07:20,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:07:20,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:07:20,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 09:07:20,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 1: [2022-11-26 09:07:20,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 09:07:20,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 26: [2022-11-26 09:07:20,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:07:20,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:07:20,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 09:07:20,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 26: [2022-11-26 09:07:20,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 09:07:20,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 25: [2022-11-26 09:07:20,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:07:20,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 09:07:20,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 17: [2022-11-26 09:07:20,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:07:20,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:07:20,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 19: [2022-11-26 09:07:20,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:07:20,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 17: [2022-11-26 09:07:20,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 19: [2022-11-26 09:07:20,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 6: [2022-11-26 09:07:20,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 17: [2022-11-26 09:07:20,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:07:20,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 19: [2022-11-26 09:07:20,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 17: [2022-11-26 09:07:20,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 7: [2022-11-26 09:07:20,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:07:20,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:07:20,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 17: [2022-11-26 09:07:20,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 7: [2022-11-26 09:07:20,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 17: [2022-11-26 09:07:20,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 26: [2022-11-26 09:07:20,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:07:20,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 09:07:20,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 12: [2022-11-26 09:07:20,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:07:20,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 09:07:20,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 18: [2022-11-26 09:07:20,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:07:20,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 09:07:20,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 15: [2022-11-26 09:07:20,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:07:20,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 16: [2022-11-26 09:07:20,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:07:20,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:07:20,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 15: [2022-11-26 09:07:20,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 19: [2022-11-26 09:07:20,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:07:20,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 24: [2022-11-26 09:07:20,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 19: [2022-11-26 09:07:20,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 24: [2022-11-26 09:07:20,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 19: [2022-11-26 09:07:20,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 24: [2022-11-26 09:07:20,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:07:20,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 09:07:20,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 10: [2022-11-26 09:07:20,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:07:20,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 09:07:20,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 13: [2022-11-26 09:07:20,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:07:20,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 09:07:20,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 7: [2022-11-26 09:07:20,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:07:20,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 09:07:20,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 14: [2022-11-26 09:07:20,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:07:20,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 09:07:20,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 20: [2022-11-26 09:07:20,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:07:20,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 3: [2022-11-26 09:07:20,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:07:20,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 3: [2022-11-26 09:07:20,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 09:07:20,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 3: [2022-11-26 09:07:20,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:07:20,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 09:07:20,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 29: [2022-11-26 09:07:20,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:07:20,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:07:20,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 09:07:20,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 22: [2022-11-26 09:07:20,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:07:20,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 29: [2022-11-26 09:07:20,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 22: [2022-11-26 09:07:20,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 09:07:20,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 16: [2022-11-26 09:07:20,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:07:20,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 09:07:20,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 18: [2022-11-26 09:07:20,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:07:20,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 09:07:20,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 28: [2022-11-26 09:07:20,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 09:07:20,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 28: [2022-11-26 09:07:20,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:07:20,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 09:07:20,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 31: [2022-11-26 09:07:20,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:07:20,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 09:07:20,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 17: [2022-11-26 09:07:20,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:07:20,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 14: [2022-11-26 09:07:20,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:07:20,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 14: [2022-11-26 09:07:20,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 09:07:20,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 26: [2022-11-26 09:07:20,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:07:20,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 09:07:20,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 27: [2022-11-26 09:07:20,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:07:20,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 09:07:20,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 27: [2022-11-26 09:07:20,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:07:20,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:07:20,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 27: [2022-11-26 09:07:20,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 09:07:20,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 14: [2022-11-26 09:07:20,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 11: [2022-11-26 09:07:20,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:07:20,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 09:07:20,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 30: [2022-11-26 09:07:20,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:07:20,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:07:20,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:07:20,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 20: [2022-11-26 09:07:20,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 2: [2022-11-26 09:07:20,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 20: [2022-11-26 09:07:20,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 1: [2022-11-26 09:07:20,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:07:20,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 09:07:20,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 22: [2022-11-26 09:07:20,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:07:20,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 09:07:20,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 30: [2022-11-26 09:07:20,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 09:07:20,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 5: [2022-11-26 09:07:20,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:07:20,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 09:07:20,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 0: [2022-11-26 09:07:20,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:07:20,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 09:07:20,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 16: [2022-11-26 09:07:20,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:07:20,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 09:07:20,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 10: [2022-11-26 09:07:20,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:07:20,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 09:07:20,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 8: [2022-11-26 09:07:20,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:07:20,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 09:07:20,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 0: [2022-11-26 09:07:20,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 09:07:20,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 25: [2022-11-26 09:07:20,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:07:20,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:07:20,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 25: [2022-11-26 09:07:20,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 5: [2022-11-26 09:07:20,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 25: [2022-11-26 09:07:20,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 31: [2022-11-26 09:07:20,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:07:20,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 09:07:20,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 12: [2022-11-26 09:07:20,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:07:20,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 09:07:20,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 24: [2022-11-26 09:07:20,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:07:20,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 09:07:20,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 23: [2022-11-26 09:07:20,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:07:20,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:07:20,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:07:20,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:07:20,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 09:07:20,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 09:07:20,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 09:07:20,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 09:07:20,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 23: [2022-11-26 09:07:20,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 23: [2022-11-26 09:07:20,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 23: [2022-11-26 09:07:20,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 29: [2022-11-26 09:07:20,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:07:20,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 09:07:20,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 19: [2022-11-26 09:07:20,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:07:20,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 09:07:20,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 7: [2022-11-26 09:07:20,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:07:20,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 09:07:20,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 15: [2022-11-26 09:07:20,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:07:20,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 09:07:20,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 21: [2022-11-26 09:07:20,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:07:20,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:07:20,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 13: [2022-11-26 09:07:20,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:07:20,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 21: [2022-11-26 09:07:20,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 13: [2022-11-26 09:07:20,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 30: [2022-11-26 09:07:20,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 13: [2022-11-26 09:07:20,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 9: [2022-11-26 09:07:20,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:07:20,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:07:20,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:07:20,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 09:07:20,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:07:20,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 09:07:20,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 09:07:20,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 9: [2022-11-26 09:07:20,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 9: [2022-11-26 09:07:20,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 9: [2022-11-26 09:07:20,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 09:07:20,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 18: [2022-11-26 09:07:20,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:07:20,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 09:07:20,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 22: [2022-11-26 09:07:20,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:07:20,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 09:07:20,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 25: [2022-11-26 09:07:20,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:07:20,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 09:07:20,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 4: [2022-11-26 09:07:20,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:07:20,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:07:20,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:07:20,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:07:20,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 09:07:20,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 09:07:20,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 09:07:20,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 09:07:20,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 4: [2022-11-26 09:07:20,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 4: [2022-11-26 09:07:20,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 4: [2022-11-26 09:07:20,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 3: [2022-11-26 09:07:20,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:07:20,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 09:07:20,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 11: [2022-11-26 09:07:20,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:07:20,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 09:07:20,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 20: [2022-11-26 09:07:20,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:07:20,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 09:07:20,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 1: [2022-11-26 09:07:20,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:07:20,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 09:07:20,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 28: [2022-11-26 09:07:20,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:07:20,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 24: [2022-11-26 09:07:20,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:07:20,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 09:07:20,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 28: [2022-11-26 09:07:20,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 6: [2022-11-26 09:07:20,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:07:20,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 09:07:20,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 23: [2022-11-26 09:07:20,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:07:20,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 09:07:20,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 5: [2022-11-26 09:07:20,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:07:20,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 09:07:20,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 2: [2022-11-26 09:07:20,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:07:20,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 09:07:20,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 16: [2022-11-26 09:07:20,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:07:20,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 09:07:20,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 17: [2022-11-26 09:07:20,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:07:20,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 09:07:20,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 0: [2022-11-26 09:07:20,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:07:20,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:07:20,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 0: [2022-11-26 09:07:20,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 4: [2022-11-26 09:07:20,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 0: [2022-11-26 09:07:20,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 26: [2022-11-26 09:07:20,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:07:20,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:07:20,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 09:07:20,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 8: [2022-11-26 09:07:20,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 14: [2022-11-26 09:07:20,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:07:20,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 8: [2022-11-26 09:07:20,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 14: [2022-11-26 09:07:20,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 31: [2022-11-26 09:07:20,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:07:20,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 09:07:20,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 10: [2022-11-26 09:07:20,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:07:20,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 09:07:20,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 19: [2022-11-26 09:07:20,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:07:20,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 09:07:20,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 7: [2022-11-26 09:07:20,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:07:20,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 09:07:20,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 29: [2022-11-26 09:07:20,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:07:20,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:07:20,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 09:07:20,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 9: [2022-11-26 09:07:20,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:07:20,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 09:07:20,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 13: [2022-11-26 09:07:20,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:07:20,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 29: [2022-11-26 09:07:20,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 13: [2022-11-26 09:07:20,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 29: [2022-11-26 09:07:20,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 12: [2022-11-26 09:07:20,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:07:20,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 09:07:20,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 30: [2022-11-26 09:07:20,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:07:20,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 09:07:20,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 27: [2022-11-26 09:07:20,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:07:20,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 09:07:20,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 18: [2022-11-26 09:07:20,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:07:20,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:07:20,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 21: [2022-11-26 09:07:20,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 18: [2022-11-26 09:07:20,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 21: [2022-11-26 09:07:20,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 25: [2022-11-26 09:07:20,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:07:20,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 09:07:20,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 3: [2022-11-26 09:07:20,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:07:20,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 09:07:20,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 11: [2022-11-26 09:07:20,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:07:20,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 09:07:20,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 22: [2022-11-26 09:07:20,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:07:20,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 09:07:20,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 28: [2022-11-26 09:07:20,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:07:20,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 09:07:20,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 24: [2022-11-26 09:07:20,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:07:20,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 09:07:20,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 6: [2022-11-26 09:07:20,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:07:20,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 09:07:20,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 20: [2022-11-26 09:07:20,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:07:20,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 09:07:20,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 5: [2022-11-26 09:07:20,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:07:20,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 09:07:20,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 1: [2022-11-26 09:07:20,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:07:20,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 09:07:20,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 23: [2022-11-26 09:07:20,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:07:20,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 09:07:20,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 2: [2022-11-26 09:07:20,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:07:20,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 09:07:20,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 14: [2022-11-26 09:07:20,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:07:20,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 09:07:20,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 27: [2022-11-26 09:07:20,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:07:20,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 09:07:20,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 0: [2022-11-26 09:07:20,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:07:20,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 09:07:20,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 16: [2022-11-26 09:07:20,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:07:20,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 09:07:20,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 8: [2022-11-26 09:07:20,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:07:20,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 09:07:20,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 31: [2022-11-26 09:07:20,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:07:20,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:07:20,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 09:07:20,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 10: [2022-11-26 09:07:20,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 09:07:20,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 19: [2022-11-26 09:07:20,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:07:20,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 4: [2022-11-26 09:07:20,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:07:20,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 4: [2022-11-26 09:07:20,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 09:07:20,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 26: [2022-11-26 09:07:20,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:07:20,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 09:07:20,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 7: [2022-11-26 09:07:20,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:07:20,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 09:07:20,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 9: [2022-11-26 09:07:20,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:07:20,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 09:07:20,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 12: [2022-11-26 09:07:20,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:07:20,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:07:20,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 09:07:20,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 21: [2022-11-26 09:07:20,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:07:20,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 29: [2022-11-26 09:07:20,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:07:20,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 21: [2022-11-26 09:07:20,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 09:07:20,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 13: [2022-11-26 09:07:20,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 29: [2022-11-26 09:07:20,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 30: [2022-11-26 09:07:20,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:07:20,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 09:07:20,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 15: [2022-11-26 09:07:20,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:07:20,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 09:07:20,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 3: [2022-11-26 09:07:20,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:07:20,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 09:07:20,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 17: [2022-11-26 09:07:20,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:07:20,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 09:07:20,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 11: [2022-11-26 09:07:20,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:07:20,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 09:07:20,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 22: [2022-11-26 09:07:20,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:07:20,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 09:07:20,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 18: [2022-11-26 09:07:20,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:07:20,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 09:07:20,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 25: [2022-11-26 09:07:20,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:07:20,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 09:07:20,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 28: [2022-11-26 09:07:20,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:07:20,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 09:07:20,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 20: [2022-11-26 09:07:20,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:07:20,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 09:07:20,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 1: [2022-11-26 09:07:20,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:07:20,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 09:07:20,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 24: [2022-11-26 09:07:20,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:07:20,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 09:07:20,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 6: [2022-11-26 09:07:20,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:07:20,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 09:07:20,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 5: [2022-11-26 09:07:20,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:07:20,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 09:07:20,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 23: [2022-11-26 09:07:20,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:07:20,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 09:07:20,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 27: [2022-11-26 09:07:20,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:07:20,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:07:20,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 09:07:20,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 14: [2022-11-26 09:07:20,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:07:20,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 09:07:20,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 14: [2022-11-26 09:07:20,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 09:07:20,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 2: [2022-11-26 09:07:20,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:07:20,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 09:07:20,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 0: [2022-11-26 09:07:20,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:07:20,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 09:07:20,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 17: [2022-11-26 09:07:20,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:07:20,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 09:07:20,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 4: [2022-11-26 09:07:20,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:07:20,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 09:07:20,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 16: [2022-11-26 09:07:20,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:07:20,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 09:07:20,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 8: [2022-11-26 09:07:20,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:07:20,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:07:20,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 13: [2022-11-26 09:07:20,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 09:07:20,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 8: [2022-11-26 09:07:20,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 26: [2022-11-26 09:07:20,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:07:20,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 09:07:20,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 9: [2022-11-26 09:07:20,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:07:20,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 09:07:20,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 12: [2022-11-26 09:07:20,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:07:20,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 09:07:20,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 29: [2022-11-26 09:07:20,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:07:20,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 09:07:20,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 30: [2022-11-26 09:07:20,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:07:20,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 09:07:20,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 15: [2022-11-26 09:07:20,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:07:20,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 09:07:20,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 7: [2022-11-26 09:07:20,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:07:20,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 09:07:20,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 28: [2022-11-26 09:07:20,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:07:20,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 09:07:20,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 21: [2022-11-26 09:07:20,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:07:20,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 09:07:20,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 6: [2022-11-26 09:07:20,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:07:20,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 09:07:20,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 3: [2022-11-26 09:07:20,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:07:20,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 09:07:20,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 5: [2022-11-26 09:07:20,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:07:20,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 09:07:20,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 25: [2022-11-26 09:07:20,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:07:20,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:07:20,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 19: [2022-11-26 09:07:20,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 25: [2022-11-26 09:07:20,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 20: [2022-11-26 09:07:20,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:07:20,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 20: [2022-11-26 09:07:20,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 09:07:20,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 1: [2022-11-26 09:07:20,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:07:20,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 09:07:20,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 27: [2022-11-26 09:07:20,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:07:20,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 09:07:20,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 31: [2022-11-26 09:07:20,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:07:20,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 09:07:20,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 24: [2022-11-26 09:07:20,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:07:20,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 0: [2022-11-26 09:07:20,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:07:20,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 09:07:20,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 24: [2022-11-26 09:07:20,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 30: [2022-11-26 09:07:20,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:07:20,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 09:07:20,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 22: [2022-11-26 09:07:20,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:07:20,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 09:07:20,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 7: [2022-11-26 09:07:20,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:07:20,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:07:20,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 09:07:20,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 6: [2022-11-26 09:07:20,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:07:20,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 8: [2022-11-26 09:07:20,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:07:20,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 14: [2022-11-26 09:07:20,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:07:20,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 8: [2022-11-26 09:07:20,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 14: [2022-11-26 09:07:20,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 6: [2022-11-26 09:07:20,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 8: [2022-11-26 09:07:20,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 14: [2022-11-26 09:07:20,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 13: [2022-11-26 09:07:20,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:07:20,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:07:20,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 09:07:20,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 29: [2022-11-26 09:07:20,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 09:07:20,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 21: [2022-11-26 09:07:20,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:07:20,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 09:07:20,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 4: [2022-11-26 09:07:20,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:07:20,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 09:07:20,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 18: [2022-11-26 09:07:20,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:07:20,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 3: [2022-11-26 09:07:20,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:07:20,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 25: [2022-11-26 09:07:20,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:07:20,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 09:07:20,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 25: [2022-11-26 09:07:20,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 09:07:20,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 11: [2022-11-26 09:07:20,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:07:20,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:07:20,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 09:07:20,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 2: [2022-11-26 09:07:20,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 09:07:20,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 10: [2022-11-26 09:07:20,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:07:20,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:07:20,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 12: [2022-11-26 09:07:20,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 10: [2022-11-26 09:07:20,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 12: [2022-11-26 09:07:20,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 9: [2022-11-26 09:07:20,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:07:20,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 09:07:20,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 15: [2022-11-26 09:07:20,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:07:20,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:07:20,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 15: [2022-11-26 09:07:20,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 23: [2022-11-26 09:07:20,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 15: [2022-11-26 09:07:20,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 16: [2022-11-26 09:07:20,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:07:20,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 09:07:20,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 18: [2022-11-26 09:07:20,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:07:20,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:07:20,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 26: [2022-11-26 09:07:20,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 18: [2022-11-26 09:07:20,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 26: [2022-11-26 09:07:20,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 22: [2022-11-26 09:07:20,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:07:20,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 09:07:20,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 19: [2022-11-26 09:07:20,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:07:20,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 09:07:20,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 16: [2022-11-26 09:07:20,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:07:20,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 09:07:20,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 10: [2022-11-26 09:07:20,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:07:20,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 09:07:20,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 26: [2022-11-26 09:07:20,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:07:20,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 09:07:20,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 11: [2022-11-26 09:07:20,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:07:20,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step67000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 09:07:20,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 0: successfully saved checkpoint at iteration 67000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2602.95 31: iteration 67010/ 173500 | consumed samples: 17154560 | consumed tokens: 35132538880 | elapsed time per iteration (s): 1.06 | learning rate: 1.431E-04 | global batch size: 256 | lm loss: 2.034080E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.920 | TFLOPs: 14.64 | 31: iteration 67020/ 173500 | consumed samples: 17157120 | consumed tokens: 35137781760 | elapsed time per iteration (s): 0.81 | learning rate: 1.431E-04 | global batch size: 256 | lm loss: 2.023506E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.322 | TFLOPs: 19.08 | 31: iteration 67030/ 173500 | consumed samples: 17159680 | consumed tokens: 35143024640 | elapsed time per iteration (s): 0.81 | learning rate: 1.431E-04 | global batch size: 256 | lm loss: 2.040687E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.049 | TFLOPs: 19.12 | 31: iteration 67040/ 173500 | consumed samples: 17162240 | consumed tokens: 35148267520 | elapsed time per iteration (s): 0.78 | learning rate: 1.431E-04 | global batch size: 256 | lm loss: 2.041777E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.443 | TFLOPs: 19.75 | 31: iteration 67050/ 173500 | consumed samples: 17164800 | consumed tokens: 35153510400 | elapsed time per iteration (s): 0.73 | learning rate: 1.431E-04 | global batch size: 256 | lm loss: 2.009830E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.302 | TFLOPs: 21.31 | 31: iteration 67060/ 173500 | consumed samples: 17167360 | consumed tokens: 35158753280 | elapsed time per iteration (s): 0.75 | learning rate: 1.430E-04 | global batch size: 256 | lm loss: 2.030134E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.034 | TFLOPs: 20.69 | 31: iteration 67070/ 173500 | consumed samples: 17169920 | consumed tokens: 35163996160 | elapsed time per iteration (s): 0.77 | learning rate: 1.430E-04 | global batch size: 256 | lm loss: 2.029974E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.273 | TFLOPs: 20.10 | 31: iteration 67080/ 173500 | consumed samples: 17172480 | consumed tokens: 35169239040 | elapsed time per iteration (s): 0.75 | learning rate: 1.430E-04 | global batch size: 256 | lm loss: 2.025835E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.476 | TFLOPs: 20.72 | 31: iteration 67090/ 173500 | consumed samples: 17175040 | consumed tokens: 35174481920 | elapsed time per iteration (s): 0.76 | learning rate: 1.430E-04 | global batch size: 256 | lm loss: 2.036246E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.636 | TFLOPs: 20.37 | 31: iteration 67100/ 173500 | consumed samples: 17177600 | consumed tokens: 35179724800 | elapsed time per iteration (s): 0.80 | learning rate: 1.430E-04 | global batch size: 256 | lm loss: 2.062423E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.936 | TFLOPs: 19.36 | 31: iteration 67110/ 173500 | consumed samples: 17180160 | consumed tokens: 35184967680 | elapsed time per iteration (s): 0.74 | learning rate: 1.430E-04 | global batch size: 256 | lm loss: 1.997216E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.340 | TFLOPs: 20.83 | 31: iteration 67120/ 173500 | consumed samples: 17182720 | consumed tokens: 35190210560 | elapsed time per iteration (s): 0.78 | learning rate: 1.430E-04 | global batch size: 256 | lm loss: 2.043957E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.962 | TFLOPs: 19.90 | 31: iteration 67130/ 173500 | consumed samples: 17185280 | consumed tokens: 35195453440 | elapsed time per iteration (s): 0.74 | learning rate: 1.429E-04 | global batch size: 256 | lm loss: 2.056787E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.523 | TFLOPs: 21.02 | 31: iteration 67140/ 173500 | consumed samples: 17187840 | consumed tokens: 35200696320 | elapsed time per iteration (s): 0.72 | learning rate: 1.429E-04 | global batch size: 256 | lm loss: 2.014293E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.249 | TFLOPs: 21.55 | 31: iteration 67150/ 173500 | consumed samples: 17190400 | consumed tokens: 35205939200 | elapsed time per iteration (s): 0.79 | learning rate: 1.429E-04 | global batch size: 256 | lm loss: 2.012340E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.560 | TFLOPs: 19.70 | 31: iteration 67160/ 173500 | consumed samples: 17192960 | consumed tokens: 35211182080 | elapsed time per iteration (s): 0.79 | learning rate: 1.429E-04 | global batch size: 256 | lm loss: 2.021133E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.150 | TFLOPs: 19.67 | 31: iteration 67170/ 173500 | consumed samples: 17195520 | consumed tokens: 35216424960 | elapsed time per iteration (s): 0.80 | learning rate: 1.429E-04 | global batch size: 256 | lm loss: 2.043034E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.000 | TFLOPs: 19.30 | 31: iteration 67180/ 173500 | consumed samples: 17198080 | consumed tokens: 35221667840 | elapsed time per iteration (s): 0.78 | learning rate: 1.429E-04 | global batch size: 256 | lm loss: 2.037726E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.826 | TFLOPs: 19.83 | 31: iteration 67190/ 173500 | consumed samples: 17200640 | consumed tokens: 35226910720 | elapsed time per iteration (s): 0.82 | learning rate: 1.428E-04 | global batch size: 256 | lm loss: 2.003647E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.562 | TFLOPs: 18.97 | 31: iteration 67200/ 173500 | consumed samples: 17203200 | consumed tokens: 35232153600 | elapsed time per iteration (s): 0.83 | learning rate: 1.428E-04 | global batch size: 256 | lm loss: 2.053989E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.235 | TFLOPs: 18.77 | 31: iteration 67210/ 173500 | consumed samples: 17205760 | consumed tokens: 35237396480 | elapsed time per iteration (s): 0.73 | learning rate: 1.428E-04 | global batch size: 256 | lm loss: 2.037695E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.400 | TFLOPs: 21.08 | 31: iteration 67220/ 173500 | consumed samples: 17208320 | consumed tokens: 35242639360 | elapsed time per iteration (s): 0.82 | learning rate: 1.428E-04 | global batch size: 256 | lm loss: 2.033537E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.152 | TFLOPs: 18.82 | 31: iteration 67230/ 173500 | consumed samples: 17210880 | consumed tokens: 35247882240 | elapsed time per iteration (s): 0.79 | learning rate: 1.428E-04 | global batch size: 256 | lm loss: 2.063861E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.852 | TFLOPs: 19.71 | 31: iteration 67240/ 173500 | consumed samples: 17213440 | consumed tokens: 35253125120 | elapsed time per iteration (s): 0.81 | learning rate: 1.428E-04 | global batch size: 256 | lm loss: 2.017554E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.388 | TFLOPs: 19.14 | 31: iteration 67250/ 173500 | consumed samples: 17216000 | consumed tokens: 35258368000 | elapsed time per iteration (s): 0.75 | learning rate: 1.428E-04 | global batch size: 256 | lm loss: 2.038541E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.225 | TFLOPs: 20.70 | 31: iteration 67260/ 173500 | consumed samples: 17218560 | consumed tokens: 35263610880 | elapsed time per iteration (s): 0.78 | learning rate: 1.427E-04 | global batch size: 256 | lm loss: 2.042244E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.762 | TFLOPs: 19.83 | 31: iteration 67270/ 173500 | consumed samples: 17221120 | consumed tokens: 35268853760 | elapsed time per iteration (s): 0.78 | learning rate: 1.427E-04 | global batch size: 256 | lm loss: 2.028670E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.921 | TFLOPs: 19.90 | 31: iteration 67280/ 173500 | consumed samples: 17223680 | consumed tokens: 35274096640 | elapsed time per iteration (s): 0.79 | learning rate: 1.427E-04 | global batch size: 256 | lm loss: 2.010112E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.180 | TFLOPs: 19.61 | 31: iteration 67290/ 173500 | consumed samples: 17226240 | consumed tokens: 35279339520 | elapsed time per iteration (s): 0.76 | learning rate: 1.427E-04 | global batch size: 256 | lm loss: 2.034346E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.940 | TFLOPs: 20.50 | 31: iteration 67300/ 173500 | consumed samples: 17228800 | consumed tokens: 35284582400 | elapsed time per iteration (s): 0.82 | learning rate: 1.427E-04 | global batch size: 256 | lm loss: 2.022414E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.950 | TFLOPs: 18.81 | 31: iteration 67310/ 173500 | consumed samples: 17231360 | consumed tokens: 35289825280 | elapsed time per iteration (s): 0.80 | learning rate: 1.427E-04 | global batch size: 256 | lm loss: 2.008057E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.472 | TFLOPs: 19.33 | 31: iteration 67320/ 173500 | consumed samples: 17233920 | consumed tokens: 35295068160 | elapsed time per iteration (s): 0.80 | learning rate: 1.426E-04 | global batch size: 256 | lm loss: 2.061090E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.853 | TFLOPs: 19.47 | 31: iteration 67330/ 173500 | consumed samples: 17236480 | consumed tokens: 35300311040 | elapsed time per iteration (s): 0.77 | learning rate: 1.426E-04 | global batch size: 256 | lm loss: 2.027788E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.318 | TFLOPs: 20.10 | 31: iteration 67340/ 173500 | consumed samples: 17239040 | consumed tokens: 35305553920 | elapsed time per iteration (s): 0.78 | learning rate: 1.426E-04 | global batch size: 256 | lm loss: 2.038436E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.541 | TFLOPs: 19.75 | 31: iteration 67350/ 173500 | consumed samples: 17241600 | consumed tokens: 35310796800 | elapsed time per iteration (s): 0.74 | learning rate: 1.426E-04 | global batch size: 256 | lm loss: 2.031811E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.077 | TFLOPs: 21.00 | 31: iteration 67360/ 173500 | consumed samples: 17244160 | consumed tokens: 35316039680 | elapsed time per iteration (s): 0.79 | learning rate: 1.426E-04 | global batch size: 256 | lm loss: 2.039060E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.114 | TFLOPs: 19.67 | 31: iteration 67370/ 173500 | consumed samples: 17246720 | consumed tokens: 35321282560 | elapsed time per iteration (s): 0.74 | learning rate: 1.426E-04 | global batch size: 256 | lm loss: 2.022607E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.284 | TFLOPs: 20.83 | 31: iteration 67380/ 173500 | consumed samples: 17249280 | consumed tokens: 35326525440 | elapsed time per iteration (s): 0.80 | learning rate: 1.426E-04 | global batch size: 256 | lm loss: 2.032241E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.059 | TFLOPs: 19.42 | 31: iteration 67390/ 173500 | consumed samples: 17251840 | consumed tokens: 35331768320 | elapsed time per iteration (s): 0.76 | learning rate: 1.425E-04 | global batch size: 256 | lm loss: 2.027821E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.065 | TFLOPs: 20.33 | 31: iteration 67400/ 173500 | consumed samples: 17254400 | consumed tokens: 35337011200 | elapsed time per iteration (s): 0.86 | learning rate: 1.425E-04 | global batch size: 256 | lm loss: 1.997310E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.690 | TFLOPs: 18.01 | 31: iteration 67410/ 173500 | consumed samples: 17256960 | consumed tokens: 35342254080 | elapsed time per iteration (s): 0.82 | learning rate: 1.425E-04 | global batch size: 256 | lm loss: 2.033840E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.026 | TFLOPs: 19.00 | 31: iteration 67420/ 173500 | consumed samples: 17259520 | consumed tokens: 35347496960 | elapsed time per iteration (s): 0.81 | learning rate: 1.425E-04 | global batch size: 256 | lm loss: 2.056764E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.569 | TFLOPs: 19.03 | 31: iteration 67430/ 173500 | consumed samples: 17262080 | consumed tokens: 35352739840 | elapsed time per iteration (s): 0.83 | learning rate: 1.425E-04 | global batch size: 256 | lm loss: 1.990370E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.279 | TFLOPs: 18.71 | 31: iteration 67440/ 173500 | consumed samples: 17264640 | consumed tokens: 35357982720 | elapsed time per iteration (s): 0.77 | learning rate: 1.425E-04 | global batch size: 256 | lm loss: 2.035727E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.663 | TFLOPs: 20.00 | 31: iteration 67450/ 173500 | consumed samples: 17267200 | consumed tokens: 35363225600 | elapsed time per iteration (s): 0.77 | learning rate: 1.425E-04 | global batch size: 256 | lm loss: 2.031613E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.642 | TFLOPs: 20.00 | 31: iteration 67460/ 173500 | consumed samples: 17269760 | consumed tokens: 35368468480 | elapsed time per iteration (s): 0.81 | learning rate: 1.424E-04 | global batch size: 256 | lm loss: 2.007018E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.561 | TFLOPs: 19.21 | 31: iteration 67470/ 173500 | consumed samples: 17272320 | consumed tokens: 35373711360 | elapsed time per iteration (s): 0.76 | learning rate: 1.424E-04 | global batch size: 256 | lm loss: 2.033784E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.108 | TFLOPs: 20.27 | 31: iteration 67480/ 173500 | consumed samples: 17274880 | consumed tokens: 35378954240 | elapsed time per iteration (s): 0.78 | learning rate: 1.424E-04 | global batch size: 256 | lm loss: 2.017874E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.834 | TFLOPs: 19.95 | 31: iteration 67490/ 173500 | consumed samples: 17277440 | consumed tokens: 35384197120 | elapsed time per iteration (s): 0.77 | learning rate: 1.424E-04 | global batch size: 256 | lm loss: 2.052210E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.963 | TFLOPs: 20.08 | 31: iteration 67500/ 173500 | consumed samples: 17280000 | consumed tokens: 35389440000 | elapsed time per iteration (s): 0.77 | learning rate: 1.424E-04 | global batch size: 256 | lm loss: 2.036505E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.479 | TFLOPs: 20.11 | 31: iteration 67510/ 173500 | consumed samples: 17282560 | consumed tokens: 35394682880 | elapsed time per iteration (s): 0.80 | learning rate: 1.424E-04 | global batch size: 256 | lm loss: 2.035322E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.191 | TFLOPs: 19.37 | 31: iteration 67520/ 173500 | consumed samples: 17285120 | consumed tokens: 35399925760 | elapsed time per iteration (s): 0.76 | learning rate: 1.423E-04 | global batch size: 256 | lm loss: 2.021638E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.989 | TFLOPs: 20.51 | 31: iteration 67530/ 173500 | consumed samples: 17287680 | consumed tokens: 35405168640 | elapsed time per iteration (s): 0.74 | learning rate: 1.423E-04 | global batch size: 256 | lm loss: 2.004462E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.374 | TFLOPs: 20.95 | 31: iteration 67540/ 173500 | consumed samples: 17290240 | consumed tokens: 35410411520 | elapsed time per iteration (s): 0.74 | learning rate: 1.423E-04 | global batch size: 256 | lm loss: 2.034734E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.271 | TFLOPs: 20.95 | 31: iteration 67550/ 173500 | consumed samples: 17292800 | consumed tokens: 35415654400 | elapsed time per iteration (s): 0.78 | learning rate: 1.423E-04 | global batch size: 256 | lm loss: 2.012496E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.314 | TFLOPs: 19.98 | 31: iteration 67560/ 173500 | consumed samples: 17295360 | consumed tokens: 35420897280 | elapsed time per iteration (s): 0.72 | learning rate: 1.423E-04 | global batch size: 256 | lm loss: 2.044267E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.552 | TFLOPs: 21.39 | 31: iteration 67570/ 173500 | consumed samples: 17297920 | consumed tokens: 35426140160 | elapsed time per iteration (s): 0.78 | learning rate: 1.423E-04 | global batch size: 256 | lm loss: 2.005193E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.704 | TFLOPs: 19.76 | 31: iteration 67580/ 173500 | consumed samples: 17300480 | consumed tokens: 35431383040 | elapsed time per iteration (s): 0.82 | learning rate: 1.423E-04 | global batch size: 256 | lm loss: 2.059095E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.691 | TFLOPs: 18.98 | 31: iteration 67590/ 173500 | consumed samples: 17303040 | consumed tokens: 35436625920 | elapsed time per iteration (s): 0.83 | learning rate: 1.422E-04 | global batch size: 256 | lm loss: 2.033615E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.930 | TFLOPs: 18.63 | 31: iteration 67600/ 173500 | consumed samples: 17305600 | consumed tokens: 35441868800 | elapsed time per iteration (s): 0.78 | learning rate: 1.422E-04 | global batch size: 256 | lm loss: 2.011546E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.099 | TFLOPs: 19.79 | 31: iteration 67610/ 173500 | consumed samples: 17308160 | consumed tokens: 35447111680 | elapsed time per iteration (s): 0.86 | learning rate: 1.422E-04 | global batch size: 256 | lm loss: 2.015131E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.167 | TFLOPs: 18.04 | 31: iteration 67620/ 173500 | consumed samples: 17310720 | consumed tokens: 35452354560 | elapsed time per iteration (s): 0.81 | learning rate: 1.422E-04 | global batch size: 256 | lm loss: 2.045980E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.805 | TFLOPs: 19.11 | 31: iteration 67630/ 173500 | consumed samples: 17313280 | consumed tokens: 35457597440 | elapsed time per iteration (s): 0.79 | learning rate: 1.422E-04 | global batch size: 256 | lm loss: 2.020986E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.524 | TFLOPs: 19.57 | 31: iteration 67640/ 173500 | consumed samples: 17315840 | consumed tokens: 35462840320 | elapsed time per iteration (s): 0.82 | learning rate: 1.422E-04 | global batch size: 256 | lm loss: 2.002081E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.851 | TFLOPs: 18.93 | 31: iteration 67650/ 173500 | consumed samples: 17318400 | consumed tokens: 35468083200 | elapsed time per iteration (s): 0.79 | learning rate: 1.421E-04 | global batch size: 256 | lm loss: 2.045633E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.960 | TFLOPs: 19.54 | 31: iteration 67660/ 173500 | consumed samples: 17320960 | consumed tokens: 35473326080 | elapsed time per iteration (s): 0.77 | learning rate: 1.421E-04 | global batch size: 256 | lm loss: 2.044597E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.524 | TFLOPs: 20.00 | 31: iteration 67670/ 173500 | consumed samples: 17323520 | consumed tokens: 35478568960 | elapsed time per iteration (s): 0.76 | learning rate: 1.421E-04 | global batch size: 256 | lm loss: 2.025292E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.930 | TFLOPs: 20.50 | 31: iteration 67680/ 173500 | consumed samples: 17326080 | consumed tokens: 35483811840 | elapsed time per iteration (s): 0.77 | learning rate: 1.421E-04 | global batch size: 256 | lm loss: 2.015979E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.938 | TFLOPs: 20.08 | 31: iteration 67690/ 173500 | consumed samples: 17328640 | consumed tokens: 35489054720 | elapsed time per iteration (s): 0.80 | learning rate: 1.421E-04 | global batch size: 256 | lm loss: 2.053188E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.057 | TFLOPs: 19.42 | 31: iteration 67700/ 173500 | consumed samples: 17331200 | consumed tokens: 35494297600 | elapsed time per iteration (s): 0.75 | learning rate: 1.421E-04 | global batch size: 256 | lm loss: 2.014837E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.722 | TFLOPs: 20.73 | 31: iteration 67710/ 173500 | consumed samples: 17333760 | consumed tokens: 35499540480 | elapsed time per iteration (s): 0.75 | learning rate: 1.421E-04 | global batch size: 256 | lm loss: 2.025575E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.123 | TFLOPs: 20.52 | 31: iteration 67720/ 173500 | consumed samples: 17336320 | consumed tokens: 35504783360 | elapsed time per iteration (s): 0.78 | learning rate: 1.420E-04 | global batch size: 256 | lm loss: 2.047784E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.276 | TFLOPs: 19.86 | 31: iteration 67730/ 173500 | consumed samples: 17338880 | consumed tokens: 35510026240 | elapsed time per iteration (s): 0.78 | learning rate: 1.420E-04 | global batch size: 256 | lm loss: 2.023598E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.017 | TFLOPs: 19.78 | 31: iteration 67740/ 173500 | consumed samples: 17341440 | consumed tokens: 35515269120 | elapsed time per iteration (s): 0.71 | learning rate: 1.420E-04 | global batch size: 256 | lm loss: 2.059185E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.538 | TFLOPs: 21.69 | 31: iteration 67750/ 173500 | consumed samples: 17344000 | consumed tokens: 35520512000 | elapsed time per iteration (s): 0.83 | learning rate: 1.420E-04 | global batch size: 256 | lm loss: 2.050848E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.414 | TFLOPs: 18.72 | 31: iteration 67760/ 173500 | consumed samples: 17346560 | consumed tokens: 35525754880 | elapsed time per iteration (s): 0.81 | learning rate: 1.420E-04 | global batch size: 256 | lm loss: 2.025205E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.285 | TFLOPs: 19.01 | 31: iteration 67770/ 173500 | consumed samples: 17349120 | consumed tokens: 35530997760 | elapsed time per iteration (s): 0.80 | learning rate: 1.420E-04 | global batch size: 256 | lm loss: 2.028743E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.197 | TFLOPs: 19.43 | 31: iteration 67780/ 173500 | consumed samples: 17351680 | consumed tokens: 35536240640 | elapsed time per iteration (s): 0.84 | learning rate: 1.419E-04 | global batch size: 256 | lm loss: 2.008932E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.799 | TFLOPs: 18.44 | 31: iteration 67790/ 173500 | consumed samples: 17354240 | consumed tokens: 35541483520 | elapsed time per iteration (s): 0.85 | learning rate: 1.419E-04 | global batch size: 256 | lm loss: 2.029870E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.120 | TFLOPs: 18.16 | 31: iteration 67800/ 173500 | consumed samples: 17356800 | consumed tokens: 35546726400 | elapsed time per iteration (s): 0.89 | learning rate: 1.419E-04 | global batch size: 256 | lm loss: 2.013724E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.107 | TFLOPs: 17.49 | 31: iteration 67810/ 173500 | consumed samples: 17359360 | consumed tokens: 35551969280 | elapsed time per iteration (s): 0.78 | learning rate: 1.419E-04 | global batch size: 256 | lm loss: 2.051681E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.117 | TFLOPs: 19.73 | 31: iteration 67820/ 173500 | consumed samples: 17361920 | consumed tokens: 35557212160 | elapsed time per iteration (s): 0.79 | learning rate: 1.419E-04 | global batch size: 256 | lm loss: 2.027280E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.663 | TFLOPs: 19.64 | 31: iteration 67830/ 173500 | consumed samples: 17364480 | consumed tokens: 35562455040 | elapsed time per iteration (s): 0.79 | learning rate: 1.419E-04 | global batch size: 256 | lm loss: 2.015710E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.627 | TFLOPs: 19.58 | 31: iteration 67840/ 173500 | consumed samples: 17367040 | consumed tokens: 35567697920 | elapsed time per iteration (s): 0.80 | learning rate: 1.419E-04 | global batch size: 256 | lm loss: 2.067744E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.466 | TFLOPs: 19.27 | 31: iteration 67850/ 173500 | consumed samples: 17369600 | consumed tokens: 35572940800 | elapsed time per iteration (s): 0.76 | learning rate: 1.418E-04 | global batch size: 256 | lm loss: 2.021345E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.805 | TFLOPs: 20.44 | 31: iteration 67860/ 173500 | consumed samples: 17372160 | consumed tokens: 35578183680 | elapsed time per iteration (s): 0.72 | learning rate: 1.418E-04 | global batch size: 256 | lm loss: 2.022770E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.978 | TFLOPs: 21.41 | 31: iteration 67870/ 173500 | consumed samples: 17374720 | consumed tokens: 35583426560 | elapsed time per iteration (s): 0.80 | learning rate: 1.418E-04 | global batch size: 256 | lm loss: 2.038547E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.797 | TFLOPs: 19.41 | 31: iteration 67880/ 173500 | consumed samples: 17377280 | consumed tokens: 35588669440 | elapsed time per iteration (s): 0.73 | learning rate: 1.418E-04 | global batch size: 256 | lm loss: 2.048701E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.945 | TFLOPs: 21.11 | 31: iteration 67890/ 173500 | consumed samples: 17379840 | consumed tokens: 35593912320 | elapsed time per iteration (s): 0.75 | learning rate: 1.418E-04 | global batch size: 256 | lm loss: 2.043475E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.764 | TFLOPs: 20.68 | 31: iteration 67900/ 173500 | consumed samples: 17382400 | consumed tokens: 35599155200 | elapsed time per iteration (s): 0.76 | learning rate: 1.418E-04 | global batch size: 256 | lm loss: 2.024194E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.177 | TFLOPs: 20.34 | 31: iteration 67910/ 173500 | consumed samples: 17384960 | consumed tokens: 35604398080 | elapsed time per iteration (s): 0.76 | learning rate: 1.417E-04 | global batch size: 256 | lm loss: 2.012474E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.769 | TFLOPs: 20.37 | 31: iteration 67920/ 173500 | consumed samples: 17387520 | consumed tokens: 35609640960 | elapsed time per iteration (s): 0.76 | learning rate: 1.417E-04 | global batch size: 256 | lm loss: 2.025398E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.390 | TFLOPs: 20.41 | 31: iteration 67930/ 173500 | consumed samples: 17390080 | consumed tokens: 35614883840 | elapsed time per iteration (s): 0.81 | learning rate: 1.417E-04 | global batch size: 256 | lm loss: 2.039641E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.378 | TFLOPs: 19.08 | 31: iteration 67940/ 173500 | consumed samples: 17392640 | consumed tokens: 35620126720 | elapsed time per iteration (s): 0.75 | learning rate: 1.417E-04 | global batch size: 256 | lm loss: 2.024016E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.190 | TFLOPs: 20.52 | 31: iteration 67950/ 173500 | consumed samples: 17395200 | consumed tokens: 35625369600 | elapsed time per iteration (s): 0.79 | learning rate: 1.417E-04 | global batch size: 256 | lm loss: 2.019478E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.090 | TFLOPs: 19.55 | 31: iteration 67960/ 173500 | consumed samples: 17397760 | consumed tokens: 35630612480 | elapsed time per iteration (s): 0.75 | learning rate: 1.417E-04 | global batch size: 256 | lm loss: 1.998470E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.974 | TFLOPs: 20.69 | 31: iteration 67970/ 173500 | consumed samples: 17400320 | consumed tokens: 35635855360 | elapsed time per iteration (s): 0.81 | learning rate: 1.417E-04 | global batch size: 256 | lm loss: 2.001211E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.718 | TFLOPs: 19.04 | 31: iteration 67980/ 173500 | consumed samples: 17402880 | consumed tokens: 35641098240 | elapsed time per iteration (s): 0.80 | learning rate: 1.416E-04 | global batch size: 256 | lm loss: 2.041522E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.644 | TFLOPs: 19.46 | 31: iteration 67990/ 173500 | consumed samples: 17405440 | consumed tokens: 35646341120 | elapsed time per iteration (s): 0.80 | learning rate: 1.416E-04 | global batch size: 256 | lm loss: 2.019069E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.560 | TFLOPs: 19.33 | 0: [2022-11-26 09:20:24,042] [INFO] [logging.py:68:log_dist] [Rank 0] step=68000, skipped=0, lr=[0.00014160436454810027, 0.00014160436454810027, 0.00014160436454810027], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 68000/ 173500 | consumed samples: 17408000 | consumed tokens: 35651584000 | elapsed time per iteration (s): 0.83 | learning rate: 1.416E-04 | global batch size: 256 | lm loss: 2.037169E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.642 | TFLOPs: 18.73 | 0: steps: 68000 loss: 2.0491 iter time (s): 0.782 samples/sec: 327.531 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 68000 | lm loss value: 1.878408E+00 | lm loss PPL: 6.543080E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 68000 to checkpoints_1b1long 0: [2022-11-26 09:20:24,373] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step68000 is begin to save! 0: [2022-11-26 09:20:24,386] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_01-model_00-model_states.pt... 0: [2022-11-26 09:20:24,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_01-model_00-model_states.pt. 0: [2022-11-26 09:20:24,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_03-model_00-model_states.pt... 0: [2022-11-26 09:20:24,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_03-model_00-model_states.pt. 0: [2022-11-26 09:20:24,701] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_04-model_00-model_states.pt... 0: [2022-11-26 09:20:24,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_04-model_00-model_states.pt. 0: [2022-11-26 09:20:24,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_05-model_00-model_states.pt... 0: [2022-11-26 09:20:24,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_05-model_00-model_states.pt. 0: [2022-11-26 09:20:24,857] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_06-model_00-model_states.pt... 0: [2022-11-26 09:20:24,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_06-model_00-model_states.pt. 0: [2022-11-26 09:20:24,938] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_07-model_00-model_states.pt... 0: [2022-11-26 09:20:25,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_07-model_00-model_states.pt. 0: [2022-11-26 09:20:25,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_08-model_00-model_states.pt... 0: [2022-11-26 09:20:25,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_08-model_00-model_states.pt. 0: [2022-11-26 09:20:25,104] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_09-model_00-model_states.pt... 0: [2022-11-26 09:20:25,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_09-model_00-model_states.pt. 0: [2022-11-26 09:20:25,182] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_10-model_00-model_states.pt... 0: [2022-11-26 09:20:25,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_10-model_00-model_states.pt. 0: [2022-11-26 09:20:25,258] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_11-model_00-model_states.pt... 0: [2022-11-26 09:20:25,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_11-model_00-model_states.pt. 0: [2022-11-26 09:20:25,332] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_12-model_00-model_states.pt... 0: [2022-11-26 09:20:25,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_12-model_00-model_states.pt. 0: [2022-11-26 09:20:25,410] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_13-model_00-model_states.pt... 0: [2022-11-26 09:20:25,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_13-model_00-model_states.pt. 0: [2022-11-26 09:20:25,488] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_14-model_00-model_states.pt... 0: [2022-11-26 09:20:25,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_14-model_00-model_states.pt. 0: [2022-11-26 09:20:25,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_15-model_00-model_states.pt... 0: [2022-11-26 09:20:25,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_15-model_00-model_states.pt. 0: [2022-11-26 09:20:25,642] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_16-model_00-model_states.pt... 0: [2022-11-26 09:20:25,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_16-model_00-model_states.pt. 0: [2022-11-26 09:20:25,718] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_17-model_00-model_states.pt... 0: [2022-11-26 09:20:25,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_17-model_00-model_states.pt. 0: [2022-11-26 09:20:25,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_18-model_00-model_states.pt... 0: [2022-11-26 09:20:25,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_18-model_00-model_states.pt. 0: [2022-11-26 09:20:25,870] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_19-model_00-model_states.pt... 0: [2022-11-26 09:20:25,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_19-model_00-model_states.pt. 0: [2022-11-26 09:20:25,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_20-model_00-model_states.pt... 0: [2022-11-26 09:20:26,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_20-model_00-model_states.pt. 0: [2022-11-26 09:20:26,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_21-model_00-model_states.pt... 0: [2022-11-26 09:20:26,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_21-model_00-model_states.pt. 0: [2022-11-26 09:20:26,100] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_22-model_00-model_states.pt... 0: [2022-11-26 09:20:26,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_22-model_00-model_states.pt. 0: [2022-11-26 09:20:26,175] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_23-model_00-model_states.pt... 0: [2022-11-26 09:20:26,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_23-model_00-model_states.pt. 0: [2022-11-26 09:20:26,253] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_24-model_00-model_states.pt... 0: [2022-11-26 09:20:26,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_24-model_00-model_states.pt. 0: [2022-11-26 09:20:26,328] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_25-model_00-model_states.pt... 0: [2022-11-26 09:20:26,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_25-model_00-model_states.pt. 0: [2022-11-26 09:20:26,406] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_26-model_00-model_states.pt... 0: [2022-11-26 09:20:26,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_26-model_00-model_states.pt. 0: [2022-11-26 09:20:26,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_27-model_00-model_states.pt... 0: [2022-11-26 09:20:26,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_27-model_00-model_states.pt. 0: [2022-11-26 09:20:26,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_28-model_00-model_states.pt... 0: [2022-11-26 09:20:26,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_28-model_00-model_states.pt. 0: [2022-11-26 09:20:26,634] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/layer_30-model_00-model_states.pt... 0: [2022-11-26 09:20:26,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/layer_30-model_00-model_states.pt. 0: [2022-11-26 09:20:26,638] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step68000/mp_rank_00_model_states.pt 0: [2022-11-26 09:20:26,639] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/mp_rank_00_model_states.pt... 0: [2022-11-26 09:20:26,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/mp_rank_00_model_states.pt. 0: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:20:26,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:20:26,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:20:26,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:20:26,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 09:20:26,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 18: [2022-11-26 09:20:26,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:20:26,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 09:20:26,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 29: [2022-11-26 09:20:26,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:20:26,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 09:20:26,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 20: [2022-11-26 09:20:26,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:20:26,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 09:20:26,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 4: [2022-11-26 09:20:26,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:20:26,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 09:20:26,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 10: [2022-11-26 09:20:26,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:20:26,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:20:26,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 27: [2022-11-26 09:20:26,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 10: [2022-11-26 09:20:26,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 27: [2022-11-26 09:20:26,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 22: [2022-11-26 09:20:26,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:20:26,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 09:20:26,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 9: [2022-11-26 09:20:26,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:20:26,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 09:20:26,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 14: [2022-11-26 09:20:26,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:20:26,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 09:20:26,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 0: [2022-11-26 09:20:26,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:20:26,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 09:20:26,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 28: [2022-11-26 09:20:26,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:20:26,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:20:26,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 2: [2022-11-26 09:20:26,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:20:26,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 2: [2022-11-26 09:20:26,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 09:20:26,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 09:20:26,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 2: [2022-11-26 09:20:26,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 12: [2022-11-26 09:20:26,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:20:26,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 09:20:26,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 3: [2022-11-26 09:20:26,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:20:26,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:20:26,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:20:26,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 09:20:26,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 3: [2022-11-26 09:20:26,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 26: [2022-11-26 09:20:26,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 26: [2022-11-26 09:20:26,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 3: [2022-11-26 09:20:26,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 27: [2022-11-26 09:20:26,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:20:26,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:20:26,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 3: [2022-11-26 09:20:26,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 09:20:26,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 27: [2022-11-26 09:20:26,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 7: [2022-11-26 09:20:26,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:20:26,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 09:20:26,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 10: [2022-11-26 09:20:26,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:20:26,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 09:20:26,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 17: [2022-11-26 09:20:26,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:20:26,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 09:20:26,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 8: [2022-11-26 09:20:26,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:20:26,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:20:26,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:20:26,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:20:26,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:20:26,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 09:20:26,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 21: [2022-11-26 09:20:26,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 19: [2022-11-26 09:20:26,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 7: [2022-11-26 09:20:26,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 8: [2022-11-26 09:20:26,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 23: [2022-11-26 09:20:26,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:20:26,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 19: [2022-11-26 09:20:26,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 7: [2022-11-26 09:20:26,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 8: [2022-11-26 09:20:26,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 23: [2022-11-26 09:20:26,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 09:20:26,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 20: [2022-11-26 09:20:26,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:20:26,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 09:20:26,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 4: [2022-11-26 09:20:26,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:20:26,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 09:20:26,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 31: [2022-11-26 09:20:26,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:20:26,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:20:26,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 18: [2022-11-26 09:20:26,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 31: [2022-11-26 09:20:26,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 18: [2022-11-26 09:20:26,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 12: [2022-11-26 09:20:26,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:20:26,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:20:26,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 13: [2022-11-26 09:20:26,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 12: [2022-11-26 09:20:26,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 13: [2022-11-26 09:20:26,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 19: [2022-11-26 09:20:26,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:20:26,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 09:20:26,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 17: [2022-11-26 09:20:26,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:20:26,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 09:20:26,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 7: [2022-11-26 09:20:26,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:20:26,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 09:20:26,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 9: [2022-11-26 09:20:26,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:20:26,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:20:26,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 9: [2022-11-26 09:20:26,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 22: [2022-11-26 09:20:26,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:20:26,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 9: [2022-11-26 09:20:26,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 22: [2022-11-26 09:20:26,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 09:20:26,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 23: [2022-11-26 09:20:26,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:20:26,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 09:20:26,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 6: [2022-11-26 09:20:26,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:20:26,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:20:26,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 9: [2022-11-26 09:20:26,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 14: [2022-11-26 09:20:26,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:20:26,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 9: [2022-11-26 09:20:26,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 14: [2022-11-26 09:20:26,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 09:20:26,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 14: [2022-11-26 09:20:26,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:20:26,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 09:20:26,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 29: [2022-11-26 09:20:26,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:20:26,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 09:20:26,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 20: [2022-11-26 09:20:26,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:20:26,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:20:26,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 2: [2022-11-26 09:20:26,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 20: [2022-11-26 09:20:26,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 2: [2022-11-26 09:20:26,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 28: [2022-11-26 09:20:26,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:20:26,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:20:26,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 09:20:26,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 26: [2022-11-26 09:20:26,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:20:26,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:20:26,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 09:20:26,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 0: [2022-11-26 09:20:26,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 18: [2022-11-26 09:20:26,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:20:26,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 18: [2022-11-26 09:20:26,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 09:20:26,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 21: [2022-11-26 09:20:26,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:20:26,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:20:26,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 09:20:26,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 09:20:26,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 21: [2022-11-26 09:20:26,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 8: [2022-11-26 09:20:26,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:20:26,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 0: [2022-11-26 09:20:26,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:20:26,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 10: [2022-11-26 09:20:26,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:20:26,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 09:20:26,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 10: [2022-11-26 09:20:26,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:20:26,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 09:20:26,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 19: [2022-11-26 09:20:26,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:20:26,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 09:20:26,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 0: [2022-11-26 09:20:26,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:20:26,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:20:26,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:20:26,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 0: [2022-11-26 09:20:26,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 6: [2022-11-26 09:20:26,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 31: [2022-11-26 09:20:26,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 6: [2022-11-26 09:20:26,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 0: [2022-11-26 09:20:26,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 31: [2022-11-26 09:20:26,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:20:26,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 09:20:26,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 27: [2022-11-26 09:20:26,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:20:26,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:20:26,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 27: [2022-11-26 09:20:26,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 21: [2022-11-26 09:20:26,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 17: [2022-11-26 09:20:26,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:20:26,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 17: [2022-11-26 09:20:26,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 09:20:26,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 27: [2022-11-26 09:20:26,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:20:26,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 4: [2022-11-26 09:20:26,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:20:26,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:20:26,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:20:26,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 16: [2022-11-26 09:20:26,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 09:20:26,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 27: [2022-11-26 09:20:26,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 16: [2022-11-26 09:20:26,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 16: [2022-11-26 09:20:26,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 4: [2022-11-26 09:20:26,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 11: [2022-11-26 09:20:26,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:20:26,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:20:26,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 28: [2022-11-26 09:20:26,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 12: [2022-11-26 09:20:26,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 28: [2022-11-26 09:20:26,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 12: [2022-11-26 09:20:26,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 11: [2022-11-26 09:20:26,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 28: [2022-11-26 09:20:26,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:20:26,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:20:26,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:20:26,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 09:20:26,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 28: [2022-11-26 09:20:26,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 12: [2022-11-26 09:20:26,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 28: [2022-11-26 09:20:26,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 12: [2022-11-26 09:20:26,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 23: [2022-11-26 09:20:26,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:20:26,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 09:20:26,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 6: [2022-11-26 09:20:26,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:20:26,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 09:20:26,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 11: [2022-11-26 09:20:26,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:20:26,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:20:26,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 09:20:26,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 09:20:26,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 11: [2022-11-26 09:20:26,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 8: [2022-11-26 09:20:26,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:20:26,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:20:26,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 09:20:26,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 24: [2022-11-26 09:20:26,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 7: [2022-11-26 09:20:26,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:20:26,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 24: [2022-11-26 09:20:26,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 7: [2022-11-26 09:20:26,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 24: [2022-11-26 09:20:26,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:20:26,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 17: [2022-11-26 09:20:26,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:20:26,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 17: [2022-11-26 09:20:26,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 09:20:26,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 31: [2022-11-26 09:20:26,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:20:26,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:20:26,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 09:20:26,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 20: [2022-11-26 09:20:26,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 09:20:26,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 18: [2022-11-26 09:20:26,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:20:26,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 09:20:26,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 4: [2022-11-26 09:20:26,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:20:26,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 09:20:26,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 22: [2022-11-26 09:20:26,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:20:26,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 09:20:26,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 1: [2022-11-26 09:20:26,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:20:26,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 09:20:26,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 1: [2022-11-26 09:20:26,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:20:26,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 09:20:26,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 30: [2022-11-26 09:20:26,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:20:26,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:20:26,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:20:26,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 09:20:26,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 09:20:26,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 30: [2022-11-26 09:20:26,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 30: [2022-11-26 09:20:26,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 09:20:26,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 30: [2022-11-26 09:20:26,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:20:26,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 09:20:26,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 9: [2022-11-26 09:20:26,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:20:26,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 09:20:26,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 9: [2022-11-26 09:20:26,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:20:26,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 09:20:26,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 2: [2022-11-26 09:20:26,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:20:26,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 09:20:26,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 24: [2022-11-26 09:20:26,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:20:26,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 09:20:26,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 13: [2022-11-26 09:20:26,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:20:26,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 29: [2022-11-26 09:20:26,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:20:26,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 13: [2022-11-26 09:20:26,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:20:26,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 13: [2022-11-26 09:20:26,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 29: [2022-11-26 09:20:26,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 13: [2022-11-26 09:20:26,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 25: [2022-11-26 09:20:26,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:20:26,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:20:26,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:20:26,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:20:26,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:20:26,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 25: [2022-11-26 09:20:26,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 09:20:26,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 09:20:26,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 13: [2022-11-26 09:20:26,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 25: [2022-11-26 09:20:26,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 25: [2022-11-26 09:20:26,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 09:20:26,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 25: [2022-11-26 09:20:26,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 25: [2022-11-26 09:20:26,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 0: [2022-11-26 09:20:26,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 1: [2022-11-26 09:20:26,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:20:26,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 1: [2022-11-26 09:20:26,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 09:20:26,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 22: [2022-11-26 09:20:26,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:20:26,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 09:20:26,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 26: [2022-11-26 09:20:26,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:20:26,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 09:20:26,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 16: [2022-11-26 09:20:26,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:20:26,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:20:26,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 09:20:26,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 5: [2022-11-26 09:20:26,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:20:26,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:20:26,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:20:26,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:20:26,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:20:26,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 5: [2022-11-26 09:20:26,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 09:20:26,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 09:20:26,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 16: [2022-11-26 09:20:26,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 5: [2022-11-26 09:20:26,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 09:20:26,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 09:20:26,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 5: [2022-11-26 09:20:26,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 5: [2022-11-26 09:20:26,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 5: [2022-11-26 09:20:26,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 5: [2022-11-26 09:20:26,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 3: [2022-11-26 09:20:26,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:20:26,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 09:20:26,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 15: [2022-11-26 09:20:26,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:20:26,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:20:26,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:20:26,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:20:26,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 09:20:26,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 09:20:26,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 09:20:26,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 09:20:26,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 15: [2022-11-26 09:20:26,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 15: [2022-11-26 09:20:26,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 15: [2022-11-26 09:20:26,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 14: [2022-11-26 09:20:26,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:20:26,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 09:20:26,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 3: [2022-11-26 09:20:26,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:20:26,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 09:20:26,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 29: [2022-11-26 09:20:26,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:20:26,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 09:20:26,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 23: [2022-11-26 09:20:26,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:20:26,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 09:20:26,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 19: [2022-11-26 09:20:26,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:20:26,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 09:20:26,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 0: [2022-11-26 09:20:26,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:20:26,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 09:20:26,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 10: [2022-11-26 09:20:26,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:20:26,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 09:20:26,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 18: [2022-11-26 09:20:26,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:20:26,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 09:20:26,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 27: [2022-11-26 09:20:26,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:20:26,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 09:20:26,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 4: [2022-11-26 09:20:26,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:20:26,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 09:20:26,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 17: [2022-11-26 09:20:26,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:20:26,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 09:20:26,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 28: [2022-11-26 09:20:26,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:20:26,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 09:20:26,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 25: [2022-11-26 09:20:26,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:20:26,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:20:26,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 20: [2022-11-26 09:20:26,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 09:20:26,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 25: [2022-11-26 09:20:26,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 7: [2022-11-26 09:20:26,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:20:26,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 09:20:26,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 14: [2022-11-26 09:20:26,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:20:26,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 09:20:26,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 15: [2022-11-26 09:20:26,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:20:26,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 09:20:26,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 8: [2022-11-26 09:20:26,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:20:26,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 09:20:26,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 12: [2022-11-26 09:20:26,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:20:26,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 09:20:26,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 24: [2022-11-26 09:20:26,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:20:26,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 09:20:26,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 30: [2022-11-26 09:20:26,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:20:26,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 09:20:26,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 3: [2022-11-26 09:20:26,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:20:26,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 09:20:26,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 21: [2022-11-26 09:20:26,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:20:26,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 09:20:26,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 13: [2022-11-26 09:20:26,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:20:26,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 09:20:26,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 16: [2022-11-26 09:20:26,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:20:26,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 09:20:26,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 5: [2022-11-26 09:20:26,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:20:26,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 09:20:26,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 1: [2022-11-26 09:20:26,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:20:26,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:20:26,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:20:26,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 09:20:26,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 1: [2022-11-26 09:20:26,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 26: [2022-11-26 09:20:26,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 1: [2022-11-26 09:20:26,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 26: [2022-11-26 09:20:26,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 2: [2022-11-26 09:20:26,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:20:26,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:20:26,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 09:20:26,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 31: [2022-11-26 09:20:26,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 09:20:26,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 6: [2022-11-26 09:20:26,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:20:26,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 09:20:26,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 11: [2022-11-26 09:20:26,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:20:26,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 09:20:26,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 23: [2022-11-26 09:20:26,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:20:26,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:20:26,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 23: [2022-11-26 09:20:26,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 0: [2022-11-26 09:20:26,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 23: [2022-11-26 09:20:26,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 9: [2022-11-26 09:20:26,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:20:26,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 09:20:26,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 10: [2022-11-26 09:20:26,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:20:26,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 09:20:26,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 18: [2022-11-26 09:20:26,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:20:26,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 29: [2022-11-26 09:20:26,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:20:26,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 29: [2022-11-26 09:20:26,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 09:20:26,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 20: [2022-11-26 09:20:26,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:20:26,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 09:20:26,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 17: [2022-11-26 09:20:26,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:20:26,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 09:20:26,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 19: [2022-11-26 09:20:26,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:20:26,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 09:20:26,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 27: [2022-11-26 09:20:26,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:20:26,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 09:20:26,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 4: [2022-11-26 09:20:26,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:20:26,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 09:20:26,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 7: [2022-11-26 09:20:26,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:20:26,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 09:20:26,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 28: [2022-11-26 09:20:26,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:20:26,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 09:20:26,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 15: [2022-11-26 09:20:26,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:20:26,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 25: [2022-11-26 09:20:26,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:20:26,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 25: [2022-11-26 09:20:26,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 09:20:26,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 21: [2022-11-26 09:20:26,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:20:26,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 09:20:26,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 14: [2022-11-26 09:20:26,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:20:26,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 09:20:26,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 12: [2022-11-26 09:20:26,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:20:26,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 09:20:26,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 30: [2022-11-26 09:20:26,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:20:26,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 09:20:26,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 23: [2022-11-26 09:20:26,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:20:26,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 09:20:26,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 26: [2022-11-26 09:20:26,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:20:26,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 09:20:26,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 1: [2022-11-26 09:20:26,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:20:26,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 09:20:26,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 8: [2022-11-26 09:20:26,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:20:26,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 09:20:26,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 24: [2022-11-26 09:20:26,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:20:26,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 09:20:26,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 5: [2022-11-26 09:20:26,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:20:26,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 09:20:26,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 11: [2022-11-26 09:20:26,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:20:26,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 22: [2022-11-26 09:20:26,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:20:26,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 22: [2022-11-26 09:20:26,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 09:20:26,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 6: [2022-11-26 09:20:26,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:20:26,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 09:20:26,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 31: [2022-11-26 09:20:26,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:20:26,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 09:20:26,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 0: [2022-11-26 09:20:26,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:20:26,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 09:20:26,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 13: [2022-11-26 09:20:26,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:20:26,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:20:26,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 09:20:26,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 13: [2022-11-26 09:20:26,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 09:20:26,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 20: [2022-11-26 09:20:26,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:20:26,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 09:20:26,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 29: [2022-11-26 09:20:26,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:20:26,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 09:20:26,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 18: [2022-11-26 09:20:26,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:20:26,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:20:26,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 19: [2022-11-26 09:20:26,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 18: [2022-11-26 09:20:26,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 19: [2022-11-26 09:20:26,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 2: [2022-11-26 09:20:26,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:20:26,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 09:20:26,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 10: [2022-11-26 09:20:26,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:20:26,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 09:20:26,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 16: [2022-11-26 09:20:26,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:20:26,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 09:20:26,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 17: [2022-11-26 09:20:26,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:20:26,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 09:20:26,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 4: [2022-11-26 09:20:26,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:20:26,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 09:20:26,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 27: [2022-11-26 09:20:26,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:20:26,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 09:20:26,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 7: [2022-11-26 09:20:26,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:20:26,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 09:20:26,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 28: [2022-11-26 09:20:26,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:20:26,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 09:20:26,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 25: [2022-11-26 09:20:26,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:20:26,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 09:20:26,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 12: [2022-11-26 09:20:26,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:20:26,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 09:20:26,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 21: [2022-11-26 09:20:26,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:20:26,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 09:20:26,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 30: [2022-11-26 09:20:26,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:20:26,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 09:20:26,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 8: [2022-11-26 09:20:26,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:20:26,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:20:26,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 8: [2022-11-26 09:20:26,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 15: [2022-11-26 09:20:26,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 8: [2022-11-26 09:20:26,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 3: [2022-11-26 09:20:26,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:20:26,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 09:20:26,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 13: [2022-11-26 09:20:26,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:20:26,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 09:20:26,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 16: [2022-11-26 09:20:26,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:20:26,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 22: [2022-11-26 09:20:26,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:20:26,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 22: [2022-11-26 09:20:26,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 09:20:26,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 14: [2022-11-26 09:20:26,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:20:26,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 09:20:26,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 25: [2022-11-26 09:20:26,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:20:26,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 0: [2022-11-26 09:20:26,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:20:26,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 0: [2022-11-26 09:20:26,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 09:20:26,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 26: [2022-11-26 09:20:26,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:20:26,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 09:20:26,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 28: [2022-11-26 09:20:26,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:20:26,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 23: [2022-11-26 09:20:26,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:20:26,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 7: [2022-11-26 09:20:26,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:20:26,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 7: [2022-11-26 09:20:26,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 23: [2022-11-26 09:20:26,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 7: [2022-11-26 09:20:26,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 10: [2022-11-26 09:20:26,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:20:26,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:20:26,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 10: [2022-11-26 09:20:26,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 12: [2022-11-26 09:20:26,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 10: [2022-11-26 09:20:26,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 9: [2022-11-26 09:20:26,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:20:26,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 09:20:26,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 3: [2022-11-26 09:20:26,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:20:26,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 09:20:26,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 31: [2022-11-26 09:20:26,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:20:26,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:20:26,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:20:26,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 09:20:26,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 15: [2022-11-26 09:20:26,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 09:20:26,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 20: [2022-11-26 09:20:26,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 09:20:26,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 8: [2022-11-26 09:20:26,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:20:26,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 09:20:26,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 29: [2022-11-26 09:20:26,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:20:26,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:20:26,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 09:20:26,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 22: [2022-11-26 09:20:26,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 09:20:26,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 17: [2022-11-26 09:20:26,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:20:26,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 09:20:26,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 21: [2022-11-26 09:20:26,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:20:26,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 09:20:26,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 18: [2022-11-26 09:20:26,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:20:26,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 5: [2022-11-26 09:20:26,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:20:26,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 18: [2022-11-26 09:20:26,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 5: [2022-11-26 09:20:26,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 26: [2022-11-26 09:20:26,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:20:26,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 09:20:26,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 11: [2022-11-26 09:20:26,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:20:26,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:20:26,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 30: [2022-11-26 09:20:26,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 11: [2022-11-26 09:20:26,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 30: [2022-11-26 09:20:26,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 19: [2022-11-26 09:20:26,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:20:26,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 09:20:26,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 4: [2022-11-26 09:20:26,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:20:26,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 09:20:26,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 31: [2022-11-26 09:20:26,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:20:26,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 6: [2022-11-26 09:20:26,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:20:26,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 6: [2022-11-26 09:20:26,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 09:20:26,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 1: [2022-11-26 09:20:26,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:20:26,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 09:20:26,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 14: [2022-11-26 09:20:26,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:20:26,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 09:20:26,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 24: [2022-11-26 09:20:26,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:20:26,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 13: [2022-11-26 09:20:26,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:20:26,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:20:26,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 13: [2022-11-26 09:20:26,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 3: [2022-11-26 09:20:26,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 13: [2022-11-26 09:20:26,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 3: [2022-11-26 09:20:26,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 16: [2022-11-26 09:20:26,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:20:26,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 09:20:26,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 1: [2022-11-26 09:20:26,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:20:26,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 09:20:26,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 24: [2022-11-26 09:20:26,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:20:26,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 09:20:26,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 6: [2022-11-26 09:20:26,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:20:26,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 09:20:26,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 27: [2022-11-26 09:20:26,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:20:26,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 09:20:26,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 2: [2022-11-26 09:20:26,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:20:26,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 09:20:26,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 2: [2022-11-26 09:20:26,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:20:26,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 09:20:26,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 24: [2022-11-26 09:20:26,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:20:26,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 09:20:26,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 1: [2022-11-26 09:20:26,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:20:26,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 09:20:26,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 11: [2022-11-26 09:20:26,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:20:26,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 09:20:26,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:20:26,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 11: [2022-11-26 09:20:26,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 09:20:26,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 6: [2022-11-26 09:20:26,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:20:26,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step68000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 09:20:26,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 0: successfully saved checkpoint at iteration 68000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2606.52 31: iteration 68010/ 173500 | consumed samples: 17410560 | consumed tokens: 35656826880 | elapsed time per iteration (s): 1.14 | learning rate: 1.416E-04 | global batch size: 256 | lm loss: 2.038151E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 224.579 | TFLOPs: 13.59 | 31: iteration 68020/ 173500 | consumed samples: 17413120 | consumed tokens: 35662069760 | elapsed time per iteration (s): 0.82 | learning rate: 1.416E-04 | global batch size: 256 | lm loss: 2.015602E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.633 | TFLOPs: 18.79 | 31: iteration 68030/ 173500 | consumed samples: 17415680 | consumed tokens: 35667312640 | elapsed time per iteration (s): 0.85 | learning rate: 1.416E-04 | global batch size: 256 | lm loss: 2.017852E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.795 | TFLOPs: 18.26 | 31: iteration 68040/ 173500 | consumed samples: 17418240 | consumed tokens: 35672555520 | elapsed time per iteration (s): 0.79 | learning rate: 1.415E-04 | global batch size: 256 | lm loss: 2.028197E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.524 | TFLOPs: 19.51 | 31: iteration 68050/ 173500 | consumed samples: 17420800 | consumed tokens: 35677798400 | elapsed time per iteration (s): 0.82 | learning rate: 1.415E-04 | global batch size: 256 | lm loss: 2.020366E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.339 | TFLOPs: 18.84 | 31: iteration 68060/ 173500 | consumed samples: 17423360 | consumed tokens: 35683041280 | elapsed time per iteration (s): 0.91 | learning rate: 1.415E-04 | global batch size: 256 | lm loss: 2.041411E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 281.140 | TFLOPs: 17.01 | 31: iteration 68070/ 173500 | consumed samples: 17425920 | consumed tokens: 35688284160 | elapsed time per iteration (s): 0.85 | learning rate: 1.415E-04 | global batch size: 256 | lm loss: 2.018263E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.865 | TFLOPs: 18.20 | 31: iteration 68080/ 173500 | consumed samples: 17428480 | consumed tokens: 35693527040 | elapsed time per iteration (s): 0.88 | learning rate: 1.415E-04 | global batch size: 256 | lm loss: 2.027149E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.067 | TFLOPs: 17.67 | 31: iteration 68090/ 173500 | consumed samples: 17431040 | consumed tokens: 35698769920 | elapsed time per iteration (s): 0.84 | learning rate: 1.415E-04 | global batch size: 256 | lm loss: 2.052881E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.488 | TFLOPs: 18.36 | 31: iteration 68100/ 173500 | consumed samples: 17433600 | consumed tokens: 35704012800 | elapsed time per iteration (s): 0.86 | learning rate: 1.415E-04 | global batch size: 256 | lm loss: 2.056355E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.175 | TFLOPs: 17.98 | 31: iteration 68110/ 173500 | consumed samples: 17436160 | consumed tokens: 35709255680 | elapsed time per iteration (s): 0.79 | learning rate: 1.414E-04 | global batch size: 256 | lm loss: 2.041584E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.628 | TFLOPs: 19.58 | 31: iteration 68120/ 173500 | consumed samples: 17438720 | consumed tokens: 35714498560 | elapsed time per iteration (s): 0.78 | learning rate: 1.414E-04 | global batch size: 256 | lm loss: 2.011281E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.873 | TFLOPs: 19.96 | 31: iteration 68130/ 173500 | consumed samples: 17441280 | consumed tokens: 35719741440 | elapsed time per iteration (s): 0.82 | learning rate: 1.414E-04 | global batch size: 256 | lm loss: 2.043271E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.288 | TFLOPs: 18.95 | 31: iteration 68140/ 173500 | consumed samples: 17443840 | consumed tokens: 35724984320 | elapsed time per iteration (s): 0.81 | learning rate: 1.414E-04 | global batch size: 256 | lm loss: 2.029742E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.982 | TFLOPs: 19.06 | 31: iteration 68150/ 173500 | consumed samples: 17446400 | consumed tokens: 35730227200 | elapsed time per iteration (s): 0.82 | learning rate: 1.414E-04 | global batch size: 256 | lm loss: 2.029658E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.766 | TFLOPs: 18.80 | 31: iteration 68160/ 173500 | consumed samples: 17448960 | consumed tokens: 35735470080 | elapsed time per iteration (s): 0.85 | learning rate: 1.414E-04 | global batch size: 256 | lm loss: 2.035217E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.184 | TFLOPs: 18.28 | 31: iteration 68170/ 173500 | consumed samples: 17451520 | consumed tokens: 35740712960 | elapsed time per iteration (s): 0.83 | learning rate: 1.413E-04 | global batch size: 256 | lm loss: 2.050608E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.415 | TFLOPs: 18.72 | 31: iteration 68180/ 173500 | consumed samples: 17454080 | consumed tokens: 35745955840 | elapsed time per iteration (s): 0.87 | learning rate: 1.413E-04 | global batch size: 256 | lm loss: 2.032758E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.260 | TFLOPs: 17.74 | 31: iteration 68190/ 173500 | consumed samples: 17456640 | consumed tokens: 35751198720 | elapsed time per iteration (s): 0.81 | learning rate: 1.413E-04 | global batch size: 256 | lm loss: 2.047963E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.446 | TFLOPs: 19.20 | 31: iteration 68200/ 173500 | consumed samples: 17459200 | consumed tokens: 35756441600 | elapsed time per iteration (s): 0.79 | learning rate: 1.413E-04 | global batch size: 256 | lm loss: 2.043883E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.059 | TFLOPs: 19.73 | 31: iteration 68210/ 173500 | consumed samples: 17461760 | consumed tokens: 35761684480 | elapsed time per iteration (s): 0.85 | learning rate: 1.413E-04 | global batch size: 256 | lm loss: 2.014456E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.742 | TFLOPs: 18.25 | 31: iteration 68220/ 173500 | consumed samples: 17464320 | consumed tokens: 35766927360 | elapsed time per iteration (s): 0.87 | learning rate: 1.413E-04 | global batch size: 256 | lm loss: 2.030469E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.477 | TFLOPs: 17.75 | 31: iteration 68230/ 173500 | consumed samples: 17466880 | consumed tokens: 35772170240 | elapsed time per iteration (s): 0.83 | learning rate: 1.412E-04 | global batch size: 256 | lm loss: 2.059765E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.311 | TFLOPs: 18.71 | 31: iteration 68240/ 173500 | consumed samples: 17469440 | consumed tokens: 35777413120 | elapsed time per iteration (s): 0.80 | learning rate: 1.412E-04 | global batch size: 256 | lm loss: 2.047169E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.522 | TFLOPs: 19.27 | 31: iteration 68250/ 173500 | consumed samples: 17472000 | consumed tokens: 35782656000 | elapsed time per iteration (s): 0.83 | learning rate: 1.412E-04 | global batch size: 256 | lm loss: 2.042040E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.430 | TFLOPs: 18.60 | 31: iteration 68260/ 173500 | consumed samples: 17474560 | consumed tokens: 35787898880 | elapsed time per iteration (s): 0.84 | learning rate: 1.412E-04 | global batch size: 256 | lm loss: 2.027819E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.508 | TFLOPs: 18.48 | 31: iteration 68270/ 173500 | consumed samples: 17477120 | consumed tokens: 35793141760 | elapsed time per iteration (s): 0.81 | learning rate: 1.412E-04 | global batch size: 256 | lm loss: 2.019741E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.829 | TFLOPs: 19.11 | 31: iteration 68280/ 173500 | consumed samples: 17479680 | consumed tokens: 35798384640 | elapsed time per iteration (s): 0.81 | learning rate: 1.412E-04 | global batch size: 256 | lm loss: 2.018160E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.537 | TFLOPs: 19.15 | 31: iteration 68290/ 173500 | consumed samples: 17482240 | consumed tokens: 35803627520 | elapsed time per iteration (s): 0.85 | learning rate: 1.412E-04 | global batch size: 256 | lm loss: 2.049352E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.882 | TFLOPs: 18.20 | 31: iteration 68300/ 173500 | consumed samples: 17484800 | consumed tokens: 35808870400 | elapsed time per iteration (s): 0.80 | learning rate: 1.411E-04 | global batch size: 256 | lm loss: 2.018925E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.487 | TFLOPs: 19.27 | 31: iteration 68310/ 173500 | consumed samples: 17487360 | consumed tokens: 35814113280 | elapsed time per iteration (s): 0.84 | learning rate: 1.411E-04 | global batch size: 256 | lm loss: 2.014853E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.302 | TFLOPs: 18.53 | 31: iteration 68320/ 173500 | consumed samples: 17489920 | consumed tokens: 35819356160 | elapsed time per iteration (s): 0.87 | learning rate: 1.411E-04 | global batch size: 256 | lm loss: 2.052511E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.256 | TFLOPs: 17.74 | 31: iteration 68330/ 173500 | consumed samples: 17492480 | consumed tokens: 35824599040 | elapsed time per iteration (s): 0.84 | learning rate: 1.411E-04 | global batch size: 256 | lm loss: 2.019293E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.183 | TFLOPs: 18.34 | 31: iteration 68340/ 173500 | consumed samples: 17495040 | consumed tokens: 35829841920 | elapsed time per iteration (s): 0.87 | learning rate: 1.411E-04 | global batch size: 256 | lm loss: 2.022724E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.569 | TFLOPs: 17.88 | 31: iteration 68350/ 173500 | consumed samples: 17497600 | consumed tokens: 35835084800 | elapsed time per iteration (s): 0.82 | learning rate: 1.411E-04 | global batch size: 256 | lm loss: 2.036705E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.078 | TFLOPs: 19.00 | 31: iteration 68360/ 173500 | consumed samples: 17500160 | consumed tokens: 35840327680 | elapsed time per iteration (s): 0.85 | learning rate: 1.410E-04 | global batch size: 256 | lm loss: 2.014294E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.876 | TFLOPs: 18.32 | 31: iteration 68370/ 173500 | consumed samples: 17502720 | consumed tokens: 35845570560 | elapsed time per iteration (s): 0.83 | learning rate: 1.410E-04 | global batch size: 256 | lm loss: 2.010216E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.681 | TFLOPs: 18.67 | 31: iteration 68380/ 173500 | consumed samples: 17505280 | consumed tokens: 35850813440 | elapsed time per iteration (s): 0.87 | learning rate: 1.410E-04 | global batch size: 256 | lm loss: 2.059482E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.626 | TFLOPs: 17.82 | 31: iteration 68390/ 173500 | consumed samples: 17507840 | consumed tokens: 35856056320 | elapsed time per iteration (s): 0.77 | learning rate: 1.410E-04 | global batch size: 256 | lm loss: 2.022584E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.848 | TFLOPs: 20.14 | 31: iteration 68400/ 173500 | consumed samples: 17510400 | consumed tokens: 35861299200 | elapsed time per iteration (s): 0.79 | learning rate: 1.410E-04 | global batch size: 256 | lm loss: 2.096699E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.654 | TFLOPs: 19.70 | 31: iteration 68410/ 173500 | consumed samples: 17512960 | consumed tokens: 35866542080 | elapsed time per iteration (s): 0.76 | learning rate: 1.410E-04 | global batch size: 256 | lm loss: 2.025058E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.662 | TFLOPs: 20.49 | 31: iteration 68420/ 173500 | consumed samples: 17515520 | consumed tokens: 35871784960 | elapsed time per iteration (s): 0.79 | learning rate: 1.410E-04 | global batch size: 256 | lm loss: 2.014779E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.557 | TFLOPs: 19.57 | 31: iteration 68430/ 173500 | consumed samples: 17518080 | consumed tokens: 35877027840 | elapsed time per iteration (s): 0.77 | learning rate: 1.409E-04 | global batch size: 256 | lm loss: 2.064675E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.828 | TFLOPs: 20.01 | 31: iteration 68440/ 173500 | consumed samples: 17520640 | consumed tokens: 35882270720 | elapsed time per iteration (s): 0.88 | learning rate: 1.409E-04 | global batch size: 256 | lm loss: 2.038271E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.695 | TFLOPs: 17.53 | 31: iteration 68450/ 173500 | consumed samples: 17523200 | consumed tokens: 35887513600 | elapsed time per iteration (s): 0.81 | learning rate: 1.409E-04 | global batch size: 256 | lm loss: 2.015380E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.971 | TFLOPs: 19.05 | 31: iteration 68460/ 173500 | consumed samples: 17525760 | consumed tokens: 35892756480 | elapsed time per iteration (s): 0.79 | learning rate: 1.409E-04 | global batch size: 256 | lm loss: 1.980322E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.131 | TFLOPs: 19.55 | 31: iteration 68470/ 173500 | consumed samples: 17528320 | consumed tokens: 35897999360 | elapsed time per iteration (s): 0.76 | learning rate: 1.409E-04 | global batch size: 256 | lm loss: 2.021939E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.861 | TFLOPs: 20.38 | 31: iteration 68480/ 173500 | consumed samples: 17530880 | consumed tokens: 35903242240 | elapsed time per iteration (s): 0.79 | learning rate: 1.409E-04 | global batch size: 256 | lm loss: 2.041549E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.003 | TFLOPs: 19.72 | 31: iteration 68490/ 173500 | consumed samples: 17533440 | consumed tokens: 35908485120 | elapsed time per iteration (s): 0.94 | learning rate: 1.408E-04 | global batch size: 256 | lm loss: 2.040061E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 271.095 | TFLOPs: 16.40 | 31: iteration 68500/ 173500 | consumed samples: 17536000 | consumed tokens: 35913728000 | elapsed time per iteration (s): 0.77 | learning rate: 1.408E-04 | global batch size: 256 | lm loss: 2.031289E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.270 | TFLOPs: 20.04 | 31: iteration 68510/ 173500 | consumed samples: 17538560 | consumed tokens: 35918970880 | elapsed time per iteration (s): 0.76 | learning rate: 1.408E-04 | global batch size: 256 | lm loss: 2.017359E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.313 | TFLOPs: 20.29 | 31: iteration 68520/ 173500 | consumed samples: 17541120 | consumed tokens: 35924213760 | elapsed time per iteration (s): 0.76 | learning rate: 1.408E-04 | global batch size: 256 | lm loss: 2.041584E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.733 | TFLOPs: 20.49 | 31: iteration 68530/ 173500 | consumed samples: 17543680 | consumed tokens: 35929456640 | elapsed time per iteration (s): 0.76 | learning rate: 1.408E-04 | global batch size: 256 | lm loss: 2.049204E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.673 | TFLOPs: 20.31 | 31: iteration 68540/ 173500 | consumed samples: 17546240 | consumed tokens: 35934699520 | elapsed time per iteration (s): 0.88 | learning rate: 1.408E-04 | global batch size: 256 | lm loss: 2.015272E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.828 | TFLOPs: 17.65 | 31: iteration 68550/ 173500 | consumed samples: 17548800 | consumed tokens: 35939942400 | elapsed time per iteration (s): 0.76 | learning rate: 1.408E-04 | global batch size: 256 | lm loss: 2.044379E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.503 | TFLOPs: 20.48 | 31: iteration 68560/ 173500 | consumed samples: 17551360 | consumed tokens: 35945185280 | elapsed time per iteration (s): 0.78 | learning rate: 1.407E-04 | global batch size: 256 | lm loss: 2.019036E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.120 | TFLOPs: 19.91 | 31: iteration 68570/ 173500 | consumed samples: 17553920 | consumed tokens: 35950428160 | elapsed time per iteration (s): 0.92 | learning rate: 1.407E-04 | global batch size: 256 | lm loss: 2.029870E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 278.615 | TFLOPs: 16.86 | 31: iteration 68580/ 173500 | consumed samples: 17556480 | consumed tokens: 35955671040 | elapsed time per iteration (s): 0.80 | learning rate: 1.407E-04 | global batch size: 256 | lm loss: 2.052930E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.196 | TFLOPs: 19.31 | 31: iteration 68590/ 173500 | consumed samples: 17559040 | consumed tokens: 35960913920 | elapsed time per iteration (s): 0.81 | learning rate: 1.407E-04 | global batch size: 256 | lm loss: 2.048098E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.476 | TFLOPs: 19.09 | 31: iteration 68600/ 173500 | consumed samples: 17561600 | consumed tokens: 35966156800 | elapsed time per iteration (s): 0.85 | learning rate: 1.407E-04 | global batch size: 256 | lm loss: 2.021881E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.712 | TFLOPs: 18.19 | 31: iteration 68610/ 173500 | consumed samples: 17564160 | consumed tokens: 35971399680 | elapsed time per iteration (s): 0.82 | learning rate: 1.407E-04 | global batch size: 256 | lm loss: 2.072358E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.013 | TFLOPs: 19.00 | 31: iteration 68620/ 173500 | consumed samples: 17566720 | consumed tokens: 35976642560 | elapsed time per iteration (s): 0.80 | learning rate: 1.406E-04 | global batch size: 256 | lm loss: 2.038555E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.672 | TFLOPs: 19.46 | 31: iteration 68630/ 173500 | consumed samples: 17569280 | consumed tokens: 35981885440 | elapsed time per iteration (s): 0.80 | learning rate: 1.406E-04 | global batch size: 256 | lm loss: 2.046556E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.991 | TFLOPs: 19.30 | 31: iteration 68640/ 173500 | consumed samples: 17571840 | consumed tokens: 35987128320 | elapsed time per iteration (s): 0.79 | learning rate: 1.406E-04 | global batch size: 256 | lm loss: 2.064316E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.395 | TFLOPs: 19.69 | 31: iteration 68650/ 173500 | consumed samples: 17574400 | consumed tokens: 35992371200 | elapsed time per iteration (s): 0.80 | learning rate: 1.406E-04 | global batch size: 256 | lm loss: 2.017031E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.092 | TFLOPs: 19.30 | 31: iteration 68660/ 173500 | consumed samples: 17576960 | consumed tokens: 35997614080 | elapsed time per iteration (s): 0.81 | learning rate: 1.406E-04 | global batch size: 256 | lm loss: 2.033590E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.690 | TFLOPs: 19.04 | 31: iteration 68670/ 173500 | consumed samples: 17579520 | consumed tokens: 36002856960 | elapsed time per iteration (s): 0.83 | learning rate: 1.406E-04 | global batch size: 256 | lm loss: 2.051607E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.373 | TFLOPs: 18.66 | 31: iteration 68680/ 173500 | consumed samples: 17582080 | consumed tokens: 36008099840 | elapsed time per iteration (s): 0.85 | learning rate: 1.406E-04 | global batch size: 256 | lm loss: 2.031987E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.236 | TFLOPs: 18.22 | 31: iteration 68690/ 173500 | consumed samples: 17584640 | consumed tokens: 36013342720 | elapsed time per iteration (s): 0.83 | learning rate: 1.405E-04 | global batch size: 256 | lm loss: 2.039471E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.914 | TFLOPs: 18.69 | 31: iteration 68700/ 173500 | consumed samples: 17587200 | consumed tokens: 36018585600 | elapsed time per iteration (s): 0.91 | learning rate: 1.405E-04 | global batch size: 256 | lm loss: 2.014468E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.527 | TFLOPs: 17.09 | 31: iteration 68710/ 173500 | consumed samples: 17589760 | consumed tokens: 36023828480 | elapsed time per iteration (s): 0.83 | learning rate: 1.405E-04 | global batch size: 256 | lm loss: 2.034221E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.615 | TFLOPs: 18.67 | 31: iteration 68720/ 173500 | consumed samples: 17592320 | consumed tokens: 36029071360 | elapsed time per iteration (s): 0.87 | learning rate: 1.405E-04 | global batch size: 256 | lm loss: 2.015894E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.606 | TFLOPs: 17.76 | 31: iteration 68730/ 173500 | consumed samples: 17594880 | consumed tokens: 36034314240 | elapsed time per iteration (s): 0.80 | learning rate: 1.405E-04 | global batch size: 256 | lm loss: 2.015647E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.743 | TFLOPs: 19.28 | 31: iteration 68740/ 173500 | consumed samples: 17597440 | consumed tokens: 36039557120 | elapsed time per iteration (s): 0.82 | learning rate: 1.405E-04 | global batch size: 256 | lm loss: 2.052398E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.841 | TFLOPs: 18.93 | 31: iteration 68750/ 173500 | consumed samples: 17600000 | consumed tokens: 36044800000 | elapsed time per iteration (s): 0.85 | learning rate: 1.404E-04 | global batch size: 256 | lm loss: 2.051409E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.178 | TFLOPs: 18.16 | 31: iteration 68760/ 173500 | consumed samples: 17602560 | consumed tokens: 36050042880 | elapsed time per iteration (s): 0.78 | learning rate: 1.404E-04 | global batch size: 256 | lm loss: 2.011308E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.423 | TFLOPs: 19.75 | 31: iteration 68770/ 173500 | consumed samples: 17605120 | consumed tokens: 36055285760 | elapsed time per iteration (s): 0.81 | learning rate: 1.404E-04 | global batch size: 256 | lm loss: 1.991907E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.821 | TFLOPs: 19.11 | 31: iteration 68780/ 173500 | consumed samples: 17607680 | consumed tokens: 36060528640 | elapsed time per iteration (s): 0.80 | learning rate: 1.404E-04 | global batch size: 256 | lm loss: 2.043609E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.437 | TFLOPs: 19.39 | 31: iteration 68790/ 173500 | consumed samples: 17610240 | consumed tokens: 36065771520 | elapsed time per iteration (s): 0.86 | learning rate: 1.404E-04 | global batch size: 256 | lm loss: 2.036872E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.560 | TFLOPs: 18.00 | 31: iteration 68800/ 173500 | consumed samples: 17612800 | consumed tokens: 36071014400 | elapsed time per iteration (s): 0.81 | learning rate: 1.404E-04 | global batch size: 256 | lm loss: 2.009567E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.374 | TFLOPs: 19.08 | 31: iteration 68810/ 173500 | consumed samples: 17615360 | consumed tokens: 36076257280 | elapsed time per iteration (s): 0.80 | learning rate: 1.404E-04 | global batch size: 256 | lm loss: 2.038178E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.991 | TFLOPs: 19.42 | 31: iteration 68820/ 173500 | consumed samples: 17617920 | consumed tokens: 36081500160 | elapsed time per iteration (s): 0.84 | learning rate: 1.403E-04 | global batch size: 256 | lm loss: 2.013550E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.993 | TFLOPs: 18.45 | 31: iteration 68830/ 173500 | consumed samples: 17620480 | consumed tokens: 36086743040 | elapsed time per iteration (s): 0.76 | learning rate: 1.403E-04 | global batch size: 256 | lm loss: 2.040595E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.198 | TFLOPs: 20.28 | 31: iteration 68840/ 173500 | consumed samples: 17623040 | consumed tokens: 36091985920 | elapsed time per iteration (s): 0.75 | learning rate: 1.403E-04 | global batch size: 256 | lm loss: 2.001934E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.746 | TFLOPs: 20.67 | 31: iteration 68850/ 173500 | consumed samples: 17625600 | consumed tokens: 36097228800 | elapsed time per iteration (s): 0.91 | learning rate: 1.403E-04 | global batch size: 256 | lm loss: 2.049525E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 280.403 | TFLOPs: 16.96 | 31: iteration 68860/ 173500 | consumed samples: 17628160 | consumed tokens: 36102471680 | elapsed time per iteration (s): 0.79 | learning rate: 1.403E-04 | global batch size: 256 | lm loss: 2.037773E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.910 | TFLOPs: 19.54 | 31: iteration 68870/ 173500 | consumed samples: 17630720 | consumed tokens: 36107714560 | elapsed time per iteration (s): 0.76 | learning rate: 1.403E-04 | global batch size: 256 | lm loss: 2.017628E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.613 | TFLOPs: 20.49 | 31: iteration 68880/ 173500 | consumed samples: 17633280 | consumed tokens: 36112957440 | elapsed time per iteration (s): 0.80 | learning rate: 1.402E-04 | global batch size: 256 | lm loss: 2.054301E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.394 | TFLOPs: 19.38 | 31: iteration 68890/ 173500 | consumed samples: 17635840 | consumed tokens: 36118200320 | elapsed time per iteration (s): 0.79 | learning rate: 1.402E-04 | global batch size: 256 | lm loss: 2.013803E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.684 | TFLOPs: 19.58 | 31: iteration 68900/ 173500 | consumed samples: 17638400 | consumed tokens: 36123443200 | elapsed time per iteration (s): 0.82 | learning rate: 1.402E-04 | global batch size: 256 | lm loss: 2.026145E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.451 | TFLOPs: 18.90 | 31: iteration 68910/ 173500 | consumed samples: 17640960 | consumed tokens: 36128686080 | elapsed time per iteration (s): 0.82 | learning rate: 1.402E-04 | global batch size: 256 | lm loss: 2.023827E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.497 | TFLOPs: 18.84 | 31: iteration 68920/ 173500 | consumed samples: 17643520 | consumed tokens: 36133928960 | elapsed time per iteration (s): 0.84 | learning rate: 1.402E-04 | global batch size: 256 | lm loss: 2.018506E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.573 | TFLOPs: 18.49 | 31: iteration 68930/ 173500 | consumed samples: 17646080 | consumed tokens: 36139171840 | elapsed time per iteration (s): 0.91 | learning rate: 1.402E-04 | global batch size: 256 | lm loss: 2.051689E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 281.845 | TFLOPs: 17.05 | 31: iteration 68940/ 173500 | consumed samples: 17648640 | consumed tokens: 36144414720 | elapsed time per iteration (s): 0.79 | learning rate: 1.402E-04 | global batch size: 256 | lm loss: 2.015796E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.053 | TFLOPs: 19.73 | 31: iteration 68950/ 173500 | consumed samples: 17651200 | consumed tokens: 36149657600 | elapsed time per iteration (s): 0.75 | learning rate: 1.401E-04 | global batch size: 256 | lm loss: 2.027092E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.592 | TFLOPs: 20.60 | 31: iteration 68960/ 173500 | consumed samples: 17653760 | consumed tokens: 36154900480 | elapsed time per iteration (s): 0.78 | learning rate: 1.401E-04 | global batch size: 256 | lm loss: 2.040032E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.706 | TFLOPs: 19.95 | 31: iteration 68970/ 173500 | consumed samples: 17656320 | consumed tokens: 36160143360 | elapsed time per iteration (s): 0.83 | learning rate: 1.401E-04 | global batch size: 256 | lm loss: 2.017586E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.583 | TFLOPs: 18.73 | 31: iteration 68980/ 173500 | consumed samples: 17658880 | consumed tokens: 36165386240 | elapsed time per iteration (s): 0.74 | learning rate: 1.401E-04 | global batch size: 256 | lm loss: 2.030391E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.807 | TFLOPs: 20.80 | 31: iteration 68990/ 173500 | consumed samples: 17661440 | consumed tokens: 36170629120 | elapsed time per iteration (s): 0.78 | learning rate: 1.401E-04 | global batch size: 256 | lm loss: 2.021126E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.167 | TFLOPs: 19.73 | 31: iteration 69000/ 173500 | consumed samples: 17664000 | consumed tokens: 36175872000 | elapsed time per iteration (s): 1.62 | learning rate: 1.401E-04 | global batch size: 256 | lm loss: 1.999947E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 158.354 | TFLOPs: 9.58 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 69000 | lm loss value: 1.968348E+00 | lm loss PPL: 7.158840E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 69000 to checkpoints_1b1long 0: [2022-11-26 09:34:14,914] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step69000 is begin to save! 0: [2022-11-26 09:34:14,927] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_01-model_00-model_states.pt... 0: [2022-11-26 09:34:15,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_01-model_00-model_states.pt. 0: [2022-11-26 09:34:15,154] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_03-model_00-model_states.pt... 0: [2022-11-26 09:34:15,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_03-model_00-model_states.pt. 0: [2022-11-26 09:34:15,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_04-model_00-model_states.pt... 0: [2022-11-26 09:34:15,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_04-model_00-model_states.pt. 0: [2022-11-26 09:34:15,311] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_05-model_00-model_states.pt... 0: [2022-11-26 09:34:15,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_05-model_00-model_states.pt. 0: [2022-11-26 09:34:15,391] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_06-model_00-model_states.pt... 0: [2022-11-26 09:34:15,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_06-model_00-model_states.pt. 0: [2022-11-26 09:34:15,469] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_07-model_00-model_states.pt... 0: [2022-11-26 09:34:15,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_07-model_00-model_states.pt. 0: [2022-11-26 09:34:15,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_08-model_00-model_states.pt... 0: [2022-11-26 09:34:15,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_08-model_00-model_states.pt. 0: [2022-11-26 09:34:15,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_09-model_00-model_states.pt... 0: [2022-11-26 09:34:15,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_09-model_00-model_states.pt. 0: [2022-11-26 09:34:15,698] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_10-model_00-model_states.pt... 0: [2022-11-26 09:34:15,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_10-model_00-model_states.pt. 0: [2022-11-26 09:34:15,779] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_11-model_00-model_states.pt... 0: [2022-11-26 09:34:15,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_11-model_00-model_states.pt. 0: [2022-11-26 09:34:15,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_12-model_00-model_states.pt... 0: [2022-11-26 09:34:15,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_12-model_00-model_states.pt. 0: [2022-11-26 09:34:15,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_13-model_00-model_states.pt... 0: [2022-11-26 09:34:16,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_13-model_00-model_states.pt. 0: [2022-11-26 09:34:16,009] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_14-model_00-model_states.pt... 0: [2022-11-26 09:34:16,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_14-model_00-model_states.pt. 0: [2022-11-26 09:34:16,085] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_15-model_00-model_states.pt... 0: [2022-11-26 09:34:16,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_15-model_00-model_states.pt. 0: [2022-11-26 09:34:16,163] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_16-model_00-model_states.pt... 0: [2022-11-26 09:34:16,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_16-model_00-model_states.pt. 0: [2022-11-26 09:34:16,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_17-model_00-model_states.pt... 0: [2022-11-26 09:34:16,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_17-model_00-model_states.pt. 0: [2022-11-26 09:34:16,315] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_18-model_00-model_states.pt... 0: [2022-11-26 09:34:16,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_18-model_00-model_states.pt. 0: [2022-11-26 09:34:16,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_19-model_00-model_states.pt... 0: [2022-11-26 09:34:16,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_19-model_00-model_states.pt. 0: [2022-11-26 09:34:16,469] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_20-model_00-model_states.pt... 0: [2022-11-26 09:34:16,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_20-model_00-model_states.pt. 0: [2022-11-26 09:34:16,545] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_21-model_00-model_states.pt... 0: [2022-11-26 09:34:16,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_21-model_00-model_states.pt. 0: [2022-11-26 09:34:16,622] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_22-model_00-model_states.pt... 0: [2022-11-26 09:34:16,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_22-model_00-model_states.pt. 0: [2022-11-26 09:34:16,699] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_23-model_00-model_states.pt... 0: [2022-11-26 09:34:16,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_23-model_00-model_states.pt. 0: [2022-11-26 09:34:16,780] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_24-model_00-model_states.pt... 0: [2022-11-26 09:34:16,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_24-model_00-model_states.pt. 0: [2022-11-26 09:34:16,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_25-model_00-model_states.pt... 0: [2022-11-26 09:34:16,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_25-model_00-model_states.pt. 0: [2022-11-26 09:34:16,933] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_26-model_00-model_states.pt... 0: [2022-11-26 09:34:17,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_26-model_00-model_states.pt. 0: [2022-11-26 09:34:17,009] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_27-model_00-model_states.pt... 0: [2022-11-26 09:34:17,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_27-model_00-model_states.pt. 0: [2022-11-26 09:34:17,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_28-model_00-model_states.pt... 0: [2022-11-26 09:34:17,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_28-model_00-model_states.pt. 0: [2022-11-26 09:34:17,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/layer_30-model_00-model_states.pt... 0: [2022-11-26 09:34:17,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/layer_30-model_00-model_states.pt. 0: [2022-11-26 09:34:17,165] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step69000/mp_rank_00_model_states.pt 0: [2022-11-26 09:34:17,165] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/mp_rank_00_model_states.pt... 0: [2022-11-26 09:34:17,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/mp_rank_00_model_states.pt. 31: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:34:17,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:34:17,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:34:17,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 09:34:17,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 31: [2022-11-26 09:34:17,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:34:17,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 09:34:17,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 27: [2022-11-26 09:34:17,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:34:17,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:34:17,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 19: [2022-11-26 09:34:17,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 27: [2022-11-26 09:34:17,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 19: [2022-11-26 09:34:17,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 4: [2022-11-26 09:34:17,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:34:17,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 09:34:17,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 10: [2022-11-26 09:34:17,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:34:17,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 09:34:17,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 25: [2022-11-26 09:34:17,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:34:17,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 09:34:17,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 0: [2022-11-26 09:34:17,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:34:17,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 09:34:17,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 24: [2022-11-26 09:34:17,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:34:17,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:34:17,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:34:17,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 9: [2022-11-26 09:34:17,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 09:34:17,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 24: [2022-11-26 09:34:17,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 5: [2022-11-26 09:34:17,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 24: [2022-11-26 09:34:17,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 1: [2022-11-26 09:34:17,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:34:17,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 09:34:17,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 26: [2022-11-26 09:34:17,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:34:17,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 09:34:17,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 28: [2022-11-26 09:34:17,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:34:17,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:34:17,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 09:34:17,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 24: [2022-11-26 09:34:17,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:34:17,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 09:34:17,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 0: [2022-11-26 09:34:17,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:34:17,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 09:34:17,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 20: [2022-11-26 09:34:17,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:34:17,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 09:34:17,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 23: [2022-11-26 09:34:17,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:34:17,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:34:17,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 12: [2022-11-26 09:34:17,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:34:17,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 30: [2022-11-26 09:34:17,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 12: [2022-11-26 09:34:17,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 23: [2022-11-26 09:34:17,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 12: [2022-11-26 09:34:17,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 28: [2022-11-26 09:34:17,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:34:17,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:34:17,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 5: [2022-11-26 09:34:17,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 28: [2022-11-26 09:34:17,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 5: [2022-11-26 09:34:17,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 23: [2022-11-26 09:34:17,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:34:17,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:34:17,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 23: [2022-11-26 09:34:17,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 09:34:17,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 9: [2022-11-26 09:34:17,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 31: [2022-11-26 09:34:17,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:34:17,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 09:34:17,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 25: [2022-11-26 09:34:17,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:34:17,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 09:34:17,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 14: [2022-11-26 09:34:17,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:34:17,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 09:34:17,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 6: [2022-11-26 09:34:17,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:34:17,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 13: [2022-11-26 09:34:17,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:34:17,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:34:17,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 13: [2022-11-26 09:34:17,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 09:34:17,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 09:34:17,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 6: [2022-11-26 09:34:17,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:34:17,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 20: [2022-11-26 09:34:17,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:34:17,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 6: [2022-11-26 09:34:17,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 20: [2022-11-26 09:34:17,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 6: [2022-11-26 09:34:17,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 30: [2022-11-26 09:34:17,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:34:17,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 3: [2022-11-26 09:34:17,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:34:17,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 3: [2022-11-26 09:34:17,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 09:34:17,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 26: [2022-11-26 09:34:17,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:34:17,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 09:34:17,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 27: [2022-11-26 09:34:17,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:34:17,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 09:34:17,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 4: [2022-11-26 09:34:17,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:34:17,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 11: [2022-11-26 09:34:17,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:34:17,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:34:17,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 11: [2022-11-26 09:34:17,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 09:34:17,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 09:34:17,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 11: [2022-11-26 09:34:17,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 7: [2022-11-26 09:34:17,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:34:17,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 09:34:17,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 19: [2022-11-26 09:34:17,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:34:17,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 09:34:17,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 28: [2022-11-26 09:34:17,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:34:17,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:34:17,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 09:34:17,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 5: [2022-11-26 09:34:17,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:34:17,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 09:34:17,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 1: [2022-11-26 09:34:17,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:34:17,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:34:17,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 0: [2022-11-26 09:34:17,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 09:34:17,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 1: [2022-11-26 09:34:17,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 18: [2022-11-26 09:34:17,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:34:17,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 09:34:17,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 22: [2022-11-26 09:34:17,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:34:17,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 09:34:17,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 22: [2022-11-26 09:34:17,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:34:17,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 09:34:17,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 9: [2022-11-26 09:34:17,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:34:17,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 09:34:17,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 10: [2022-11-26 09:34:17,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:34:17,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 09:34:17,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 25: [2022-11-26 09:34:17,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:34:17,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 09:34:17,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 11: [2022-11-26 09:34:17,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:34:17,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 09:34:17,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 23: [2022-11-26 09:34:17,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:34:17,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 09:34:17,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 30: [2022-11-26 09:34:17,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:34:17,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 09:34:17,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 12: [2022-11-26 09:34:17,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:34:17,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 09:34:17,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 18: [2022-11-26 09:34:17,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:34:17,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 14: [2022-11-26 09:34:17,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:34:17,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 18: [2022-11-26 09:34:17,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 14: [2022-11-26 09:34:17,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 10: [2022-11-26 09:34:17,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:34:17,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 09:34:17,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 28: [2022-11-26 09:34:17,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 09:34:17,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 28: [2022-11-26 09:34:17,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:34:17,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 09:34:17,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 5: [2022-11-26 09:34:17,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:34:17,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 09:34:17,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 16: [2022-11-26 09:34:17,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:34:17,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:34:17,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 09:34:17,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 09:34:17,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 16: [2022-11-26 09:34:17,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 0: [2022-11-26 09:34:17,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:34:17,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 09:34:17,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 15: [2022-11-26 09:34:17,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:34:17,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:34:17,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 15: [2022-11-26 09:34:17,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 3: [2022-11-26 09:34:17,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 15: [2022-11-26 09:34:17,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 31: [2022-11-26 09:34:17,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:34:17,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:34:17,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 09:34:17,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 19: [2022-11-26 09:34:17,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 09:34:17,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 27: [2022-11-26 09:34:17,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:34:17,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:34:17,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 18: [2022-11-26 09:34:17,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 27: [2022-11-26 09:34:17,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 19: [2022-11-26 09:34:17,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:34:17,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 19: [2022-11-26 09:34:17,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 09:34:17,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 14: [2022-11-26 09:34:17,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:34:17,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:34:17,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 9: [2022-11-26 09:34:17,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 09:34:17,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 14: [2022-11-26 09:34:17,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 25: [2022-11-26 09:34:17,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:34:17,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 23: [2022-11-26 09:34:17,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:34:17,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 23: [2022-11-26 09:34:17,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 09:34:17,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 6: [2022-11-26 09:34:17,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:34:17,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 09:34:17,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 24: [2022-11-26 09:34:17,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:34:17,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:34:17,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 09:34:17,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 09:34:17,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 24: [2022-11-26 09:34:17,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 4: [2022-11-26 09:34:17,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:34:17,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:34:17,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 09:34:17,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 09:34:17,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 4: [2022-11-26 09:34:17,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 29: [2022-11-26 09:34:17,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:34:17,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:34:17,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 09:34:17,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 09:34:17,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 29: [2022-11-26 09:34:17,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 3: [2022-11-26 09:34:17,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:34:17,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:34:17,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 09:34:17,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 11: [2022-11-26 09:34:17,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 09:34:17,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 10: [2022-11-26 09:34:17,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:34:17,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 09:34:17,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 15: [2022-11-26 09:34:17,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:34:17,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 09:34:17,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 13: [2022-11-26 09:34:17,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:34:17,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 09:34:17,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 7: [2022-11-26 09:34:17,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:34:17,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:34:17,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 09:34:17,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 09:34:17,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 7: [2022-11-26 09:34:17,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 12: [2022-11-26 09:34:17,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:34:17,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 09:34:17,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 26: [2022-11-26 09:34:17,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:34:17,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 20: [2022-11-26 09:34:17,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:34:17,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 20: [2022-11-26 09:34:17,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:34:17,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 09:34:17,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 20: [2022-11-26 09:34:17,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 09:34:17,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 22: [2022-11-26 09:34:17,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:34:17,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 09:34:17,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 7: [2022-11-26 09:34:17,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:34:17,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 09:34:17,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 6: [2022-11-26 09:34:17,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:34:17,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 16: [2022-11-26 09:34:17,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:34:17,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:34:17,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 22: [2022-11-26 09:34:17,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:34:17,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 22: [2022-11-26 09:34:17,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 09:34:17,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 26: [2022-11-26 09:34:17,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 16: [2022-11-26 09:34:17,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 6: [2022-11-26 09:34:17,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 16: [2022-11-26 09:34:17,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:34:17,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 09:34:17,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 13: [2022-11-26 09:34:17,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:34:17,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 09:34:17,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 29: [2022-11-26 09:34:17,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:34:17,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:34:17,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 09:34:17,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 27: [2022-11-26 09:34:17,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 09:34:17,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 30: [2022-11-26 09:34:17,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:34:17,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 09:34:17,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 8: [2022-11-26 09:34:17,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:34:17,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:34:17,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:34:17,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 09:34:17,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 09:34:17,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 09:34:17,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 8: [2022-11-26 09:34:17,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 8: [2022-11-26 09:34:17,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 15: [2022-11-26 09:34:17,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:34:17,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:34:17,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 09:34:17,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 09:34:17,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 15: [2022-11-26 09:34:17,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 14: [2022-11-26 09:34:17,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:34:17,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 09:34:17,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 0: [2022-11-26 09:34:17,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 09:34:17,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 6: [2022-11-26 09:34:17,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:34:17,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 09:34:17,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 19: [2022-11-26 09:34:17,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:34:17,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 09:34:17,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 21: [2022-11-26 09:34:17,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:34:17,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:34:17,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:34:17,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:34:17,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 09:34:17,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 09:34:17,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 09:34:17,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 09:34:17,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 21: [2022-11-26 09:34:17,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 21: [2022-11-26 09:34:17,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 21: [2022-11-26 09:34:17,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 5: [2022-11-26 09:34:17,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:34:17,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 09:34:17,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 1: [2022-11-26 09:34:17,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:34:17,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 09:34:17,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 17: [2022-11-26 09:34:17,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:34:17,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 09:34:17,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 17: [2022-11-26 09:34:17,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:34:17,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 09:34:17,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 17: [2022-11-26 09:34:17,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:34:17,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 09:34:17,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 31: [2022-11-26 09:34:17,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:34:17,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 09:34:17,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 2: [2022-11-26 09:34:17,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:34:17,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:34:17,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:34:17,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:34:17,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 09:34:17,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 09:34:17,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 2: [2022-11-26 09:34:17,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 2: [2022-11-26 09:34:17,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 09:34:17,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 09:34:17,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 2: [2022-11-26 09:34:17,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 1: [2022-11-26 09:34:17,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:34:17,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 09:34:17,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 1: [2022-11-26 09:34:17,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:34:17,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 09:34:17,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 25: [2022-11-26 09:34:17,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:34:17,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 09:34:17,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 11: [2022-11-26 09:34:17,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:34:17,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 09:34:17,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 4: [2022-11-26 09:34:17,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:34:17,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 09:34:17,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 12: [2022-11-26 09:34:17,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:34:17,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 09:34:17,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 9: [2022-11-26 09:34:17,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:34:17,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 09:34:17,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 28: [2022-11-26 09:34:17,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:34:17,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 09:34:17,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 10: [2022-11-26 09:34:17,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:34:17,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 09:34:17,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 8: [2022-11-26 09:34:17,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:34:17,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 09:34:17,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 13: [2022-11-26 09:34:17,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:34:17,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 09:34:17,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 23: [2022-11-26 09:34:17,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:34:17,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 09:34:17,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 0: [2022-11-26 09:34:17,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:34:17,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 09:34:17,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 30: [2022-11-26 09:34:17,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:34:17,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 09:34:17,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 18: [2022-11-26 09:34:17,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:34:17,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 09:34:17,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 31: [2022-11-26 09:34:17,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:34:17,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 09:34:17,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 29: [2022-11-26 09:34:17,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:34:17,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 09:34:17,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 17: [2022-11-26 09:34:17,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:34:17,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 09:34:17,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 20: [2022-11-26 09:34:17,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:34:17,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 09:34:17,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 2: [2022-11-26 09:34:17,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:34:17,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 09:34:17,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 14: [2022-11-26 09:34:17,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:34:17,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 09:34:17,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 26: [2022-11-26 09:34:17,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:34:17,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 09:34:17,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 16: [2022-11-26 09:34:17,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:34:17,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 09:34:17,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 21: [2022-11-26 09:34:17,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:34:17,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 09:34:17,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 15: [2022-11-26 09:34:17,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:34:17,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 09:34:17,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 27: [2022-11-26 09:34:17,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:34:17,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 09:34:17,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 7: [2022-11-26 09:34:17,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:34:17,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 09:34:17,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 3: [2022-11-26 09:34:17,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:34:17,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 09:34:17,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 22: [2022-11-26 09:34:17,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:34:17,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 09:34:17,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 6: [2022-11-26 09:34:17,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:34:17,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 09:34:17,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 25: [2022-11-26 09:34:17,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:34:17,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 09:34:17,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 5: [2022-11-26 09:34:17,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:34:17,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 09:34:17,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 9: [2022-11-26 09:34:17,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:34:17,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 09:34:17,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 28: [2022-11-26 09:34:17,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:34:17,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:34:17,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 09:34:17,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 19: [2022-11-26 09:34:17,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:34:17,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 09:34:17,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 23: [2022-11-26 09:34:17,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:34:17,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 09:34:17,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 24: [2022-11-26 09:34:17,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:34:17,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 09:34:17,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 28: [2022-11-26 09:34:17,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 09:34:17,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 14: [2022-11-26 09:34:17,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:34:17,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 09:34:17,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 11: [2022-11-26 09:34:17,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:34:17,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 09:34:17,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 0: [2022-11-26 09:34:17,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:34:17,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 09:34:17,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 21: [2022-11-26 09:34:17,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:34:17,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 09:34:17,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 1: [2022-11-26 09:34:17,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:34:17,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:34:17,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 09:34:17,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 1: [2022-11-26 09:34:17,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 09:34:17,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 18: [2022-11-26 09:34:17,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:34:17,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 09:34:17,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 30: [2022-11-26 09:34:17,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:34:17,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 09:34:17,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 8: [2022-11-26 09:34:17,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:34:17,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 09:34:17,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 31: [2022-11-26 09:34:17,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:34:17,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 09:34:17,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 2: [2022-11-26 09:34:17,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:34:17,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 09:34:17,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 29: [2022-11-26 09:34:17,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:34:17,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 09:34:17,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 17: [2022-11-26 09:34:17,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:34:17,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 09:34:17,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 20: [2022-11-26 09:34:17,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:34:17,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 09:34:17,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 24: [2022-11-26 09:34:17,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:34:17,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:34:17,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 4: [2022-11-26 09:34:17,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 24: [2022-11-26 09:34:17,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 4: [2022-11-26 09:34:17,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 27: [2022-11-26 09:34:17,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:34:17,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 09:34:17,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 15: [2022-11-26 09:34:17,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:34:17,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 09:34:17,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 10: [2022-11-26 09:34:17,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:34:17,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 09:34:17,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 26: [2022-11-26 09:34:17,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:34:17,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 09:34:17,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 3: [2022-11-26 09:34:17,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:34:17,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 09:34:17,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 16: [2022-11-26 09:34:17,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:34:17,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 7: [2022-11-26 09:34:17,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:34:17,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 7: [2022-11-26 09:34:17,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 09:34:17,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 6: [2022-11-26 09:34:17,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:34:17,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 09:34:17,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 28: [2022-11-26 09:34:17,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:34:17,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:34:17,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 09:34:17,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 28: [2022-11-26 09:34:17,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 09:34:17,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 23: [2022-11-26 09:34:17,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:34:17,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 09:34:17,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 5: [2022-11-26 09:34:17,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:34:17,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 09:34:17,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 0: [2022-11-26 09:34:17,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:34:17,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:34:17,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 09:34:17,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 12: [2022-11-26 09:34:17,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 09:34:17,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 25: [2022-11-26 09:34:17,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:34:17,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 09:34:17,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 19: [2022-11-26 09:34:17,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:34:17,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 09:34:17,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 13: [2022-11-26 09:34:17,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:34:17,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 09:34:17,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 9: [2022-11-26 09:34:17,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:34:17,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 09:34:17,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 1: [2022-11-26 09:34:17,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:34:17,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 09:34:17,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 4: [2022-11-26 09:34:17,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:34:17,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 09:34:17,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 2: [2022-11-26 09:34:17,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:34:17,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:34:17,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 09:34:17,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 11: [2022-11-26 09:34:17,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 09:34:17,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 21: [2022-11-26 09:34:17,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:34:17,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 09:34:17,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 10: [2022-11-26 09:34:17,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:34:17,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 09:34:17,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 30: [2022-11-26 09:34:17,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:34:17,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 09:34:17,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 31: [2022-11-26 09:34:17,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:34:17,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 09:34:17,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 27: [2022-11-26 09:34:17,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:34:17,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:34:17,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 09:34:17,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 20: [2022-11-26 09:34:17,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 09:34:17,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 24: [2022-11-26 09:34:17,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:34:17,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 09:34:17,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 8: [2022-11-26 09:34:17,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:34:17,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:34:17,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 29: [2022-11-26 09:34:17,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 8: [2022-11-26 09:34:17,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 29: [2022-11-26 09:34:17,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 15: [2022-11-26 09:34:17,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:34:17,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 09:34:17,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 26: [2022-11-26 09:34:17,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:34:17,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 09:34:17,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 18: [2022-11-26 09:34:17,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:34:17,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 09:34:17,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 5: [2022-11-26 09:34:17,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:34:17,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 09:34:17,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 16: [2022-11-26 09:34:17,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:34:17,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 09:34:17,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 14: [2022-11-26 09:34:17,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:34:17,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 09:34:17,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 23: [2022-11-26 09:34:17,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:34:17,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:34:17,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 09:34:17,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 3: [2022-11-26 09:34:17,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 09:34:17,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 28: [2022-11-26 09:34:17,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:34:17,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:34:17,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 09:34:17,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 19: [2022-11-26 09:34:17,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:34:17,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 09:34:17,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 13: [2022-11-26 09:34:17,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:34:17,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:34:17,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 09:34:17,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 7: [2022-11-26 09:34:17,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 09:34:17,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 25: [2022-11-26 09:34:17,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:34:17,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:34:17,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 11: [2022-11-26 09:34:17,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:34:17,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 9: [2022-11-26 09:34:17,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 11: [2022-11-26 09:34:17,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 6: [2022-11-26 09:34:17,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:34:17,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 11: [2022-11-26 09:34:17,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 4: [2022-11-26 09:34:17,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:34:17,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 4: [2022-11-26 09:34:17,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 6: [2022-11-26 09:34:17,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 4: [2022-11-26 09:34:17,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 10: [2022-11-26 09:34:17,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:34:17,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 0: [2022-11-26 09:34:17,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:34:17,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 0: [2022-11-26 09:34:17,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 28: [2022-11-26 09:34:17,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 09:34:17,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 0: [2022-11-26 09:34:17,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 18: [2022-11-26 09:34:17,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:34:17,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 09:34:17,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 8: [2022-11-26 09:34:17,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:34:17,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 09:34:17,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 30: [2022-11-26 09:34:17,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:34:17,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:34:17,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 2: [2022-11-26 09:34:17,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:34:17,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 2: [2022-11-26 09:34:17,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 12: [2022-11-26 09:34:17,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 30: [2022-11-26 09:34:17,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 2: [2022-11-26 09:34:17,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 1: [2022-11-26 09:34:17,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:34:17,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:34:17,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 09:34:17,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 1: [2022-11-26 09:34:17,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 09:34:17,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 17: [2022-11-26 09:34:17,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:34:17,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:34:17,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 29: [2022-11-26 09:34:17,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 17: [2022-11-26 09:34:17,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 29: [2022-11-26 09:34:17,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 31: [2022-11-26 09:34:17,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:34:17,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 09:34:17,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 20: [2022-11-26 09:34:17,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:34:17,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 15: [2022-11-26 09:34:17,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:34:17,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 15: [2022-11-26 09:34:17,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 09:34:17,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 24: [2022-11-26 09:34:17,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:34:17,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:34:17,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 09:34:17,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 24: [2022-11-26 09:34:17,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 09:34:17,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 26: [2022-11-26 09:34:17,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:34:17,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 09:34:17,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 14: [2022-11-26 09:34:17,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:34:17,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 09:34:17,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 3: [2022-11-26 09:34:17,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:34:17,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 09:34:17,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 16: [2022-11-26 09:34:17,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:34:17,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 09:34:17,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 27: [2022-11-26 09:34:17,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:34:17,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 09:34:17,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 17: [2022-11-26 09:34:17,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:34:17,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 09:34:17,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:34:17,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 17: [2022-11-26 09:34:17,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 09:34:17,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 7: [2022-11-26 09:34:17,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:34:17,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 09:34:17,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 22: [2022-11-26 09:34:17,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:34:17,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 09:34:17,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 29: [2022-11-26 09:34:17,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:34:17,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step69000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 09:34:17,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 0: successfully saved checkpoint at iteration 69000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2569.55 31: iteration 69010/ 173500 | consumed samples: 17666560 | consumed tokens: 36181114880 | elapsed time per iteration (s): 1.02 | learning rate: 1.400E-04 | global batch size: 256 | lm loss: 2.040878E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.931 | TFLOPs: 15.24 | 31: iteration 69020/ 173500 | consumed samples: 17669120 | consumed tokens: 36186357760 | elapsed time per iteration (s): 0.80 | learning rate: 1.400E-04 | global batch size: 256 | lm loss: 2.029992E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.856 | TFLOPs: 19.47 | 31: iteration 69030/ 173500 | consumed samples: 17671680 | consumed tokens: 36191600640 | elapsed time per iteration (s): 0.81 | learning rate: 1.400E-04 | global batch size: 256 | lm loss: 2.030958E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.456 | TFLOPs: 19.14 | 31: iteration 69040/ 173500 | consumed samples: 17674240 | consumed tokens: 36196843520 | elapsed time per iteration (s): 0.74 | learning rate: 1.400E-04 | global batch size: 256 | lm loss: 2.037227E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.510 | TFLOPs: 20.96 | 31: iteration 69050/ 173500 | consumed samples: 17676800 | consumed tokens: 36202086400 | elapsed time per iteration (s): 0.74 | learning rate: 1.400E-04 | global batch size: 256 | lm loss: 2.004888E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.184 | TFLOPs: 21.06 | 31: iteration 69060/ 173500 | consumed samples: 17679360 | consumed tokens: 36207329280 | elapsed time per iteration (s): 0.77 | learning rate: 1.400E-04 | global batch size: 256 | lm loss: 2.019192E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.007 | TFLOPs: 20.09 | 31: iteration 69070/ 173500 | consumed samples: 17681920 | consumed tokens: 36212572160 | elapsed time per iteration (s): 0.74 | learning rate: 1.399E-04 | global batch size: 256 | lm loss: 2.046110E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.973 | TFLOPs: 20.87 | 31: iteration 69080/ 173500 | consumed samples: 17684480 | consumed tokens: 36217815040 | elapsed time per iteration (s): 0.78 | learning rate: 1.399E-04 | global batch size: 256 | lm loss: 2.004712E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.811 | TFLOPs: 19.89 | 31: iteration 69090/ 173500 | consumed samples: 17687040 | consumed tokens: 36223057920 | elapsed time per iteration (s): 0.73 | learning rate: 1.399E-04 | global batch size: 256 | lm loss: 2.016877E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.824 | TFLOPs: 21.16 | 31: iteration 69100/ 173500 | consumed samples: 17689600 | consumed tokens: 36228300800 | elapsed time per iteration (s): 0.85 | learning rate: 1.399E-04 | global batch size: 256 | lm loss: 2.011372E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.402 | TFLOPs: 18.23 | 31: iteration 69110/ 173500 | consumed samples: 17692160 | consumed tokens: 36233543680 | elapsed time per iteration (s): 0.81 | learning rate: 1.399E-04 | global batch size: 256 | lm loss: 2.019990E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.570 | TFLOPs: 19.09 | 31: iteration 69120/ 173500 | consumed samples: 17694720 | consumed tokens: 36238786560 | elapsed time per iteration (s): 0.80 | learning rate: 1.399E-04 | global batch size: 256 | lm loss: 2.058313E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.375 | TFLOPs: 19.26 | 31: iteration 69130/ 173500 | consumed samples: 17697280 | consumed tokens: 36244029440 | elapsed time per iteration (s): 0.82 | learning rate: 1.399E-04 | global batch size: 256 | lm loss: 2.013569E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.442 | TFLOPs: 18.90 | 31: iteration 69140/ 173500 | consumed samples: 17699840 | consumed tokens: 36249272320 | elapsed time per iteration (s): 0.77 | learning rate: 1.398E-04 | global batch size: 256 | lm loss: 2.017961E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.956 | TFLOPs: 20.08 | 31: iteration 69150/ 173500 | consumed samples: 17702400 | consumed tokens: 36254515200 | elapsed time per iteration (s): 0.79 | learning rate: 1.398E-04 | global batch size: 256 | lm loss: 2.043458E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.392 | TFLOPs: 19.69 | 31: iteration 69160/ 173500 | consumed samples: 17704960 | consumed tokens: 36259758080 | elapsed time per iteration (s): 0.80 | learning rate: 1.398E-04 | global batch size: 256 | lm loss: 2.015110E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.277 | TFLOPs: 19.32 | 31: iteration 69170/ 173500 | consumed samples: 17707520 | consumed tokens: 36265000960 | elapsed time per iteration (s): 0.85 | learning rate: 1.398E-04 | global batch size: 256 | lm loss: 2.037569E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.290 | TFLOPs: 18.23 | 31: iteration 69180/ 173500 | consumed samples: 17710080 | consumed tokens: 36270243840 | elapsed time per iteration (s): 0.78 | learning rate: 1.398E-04 | global batch size: 256 | lm loss: 2.029384E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.225 | TFLOPs: 19.74 | 31: iteration 69190/ 173500 | consumed samples: 17712640 | consumed tokens: 36275486720 | elapsed time per iteration (s): 0.80 | learning rate: 1.398E-04 | global batch size: 256 | lm loss: 2.034114E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.109 | TFLOPs: 19.43 | 31: iteration 69200/ 173500 | consumed samples: 17715200 | consumed tokens: 36280729600 | elapsed time per iteration (s): 0.77 | learning rate: 1.397E-04 | global batch size: 256 | lm loss: 2.020271E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.177 | TFLOPs: 20.22 | 31: iteration 69210/ 173500 | consumed samples: 17717760 | consumed tokens: 36285972480 | elapsed time per iteration (s): 0.75 | learning rate: 1.397E-04 | global batch size: 256 | lm loss: 2.048877E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.032 | TFLOPs: 20.63 | 31: iteration 69220/ 173500 | consumed samples: 17720320 | consumed tokens: 36291215360 | elapsed time per iteration (s): 0.75 | learning rate: 1.397E-04 | global batch size: 256 | lm loss: 2.013307E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.416 | TFLOPs: 20.72 | 31: iteration 69230/ 173500 | consumed samples: 17722880 | consumed tokens: 36296458240 | elapsed time per iteration (s): 0.94 | learning rate: 1.397E-04 | global batch size: 256 | lm loss: 2.033138E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 272.043 | TFLOPs: 16.46 | 31: iteration 69240/ 173500 | consumed samples: 17725440 | consumed tokens: 36301701120 | elapsed time per iteration (s): 0.78 | learning rate: 1.397E-04 | global batch size: 256 | lm loss: 1.990739E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.635 | TFLOPs: 19.76 | 31: iteration 69250/ 173500 | consumed samples: 17728000 | consumed tokens: 36306944000 | elapsed time per iteration (s): 0.78 | learning rate: 1.397E-04 | global batch size: 256 | lm loss: 2.020332E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.173 | TFLOPs: 19.73 | 31: iteration 69260/ 173500 | consumed samples: 17730560 | consumed tokens: 36312186880 | elapsed time per iteration (s): 0.74 | learning rate: 1.397E-04 | global batch size: 256 | lm loss: 2.039100E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.932 | TFLOPs: 20.81 | 31: iteration 69270/ 173500 | consumed samples: 17733120 | consumed tokens: 36317429760 | elapsed time per iteration (s): 0.78 | learning rate: 1.396E-04 | global batch size: 256 | lm loss: 2.013299E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.386 | TFLOPs: 19.75 | 31: iteration 69280/ 173500 | consumed samples: 17735680 | consumed tokens: 36322672640 | elapsed time per iteration (s): 0.75 | learning rate: 1.396E-04 | global batch size: 256 | lm loss: 1.987950E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.892 | TFLOPs: 20.56 | 31: iteration 69290/ 173500 | consumed samples: 17738240 | consumed tokens: 36327915520 | elapsed time per iteration (s): 0.87 | learning rate: 1.396E-04 | global batch size: 256 | lm loss: 2.048109E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.207 | TFLOPs: 17.74 | 31: iteration 69300/ 173500 | consumed samples: 17740800 | consumed tokens: 36333158400 | elapsed time per iteration (s): 0.77 | learning rate: 1.396E-04 | global batch size: 256 | lm loss: 2.028849E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.400 | TFLOPs: 20.17 | 31: iteration 69310/ 173500 | consumed samples: 17743360 | consumed tokens: 36338401280 | elapsed time per iteration (s): 0.78 | learning rate: 1.396E-04 | global batch size: 256 | lm loss: 1.998837E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.295 | TFLOPs: 19.98 | 31: iteration 69320/ 173500 | consumed samples: 17745920 | consumed tokens: 36343644160 | elapsed time per iteration (s): 0.75 | learning rate: 1.396E-04 | global batch size: 256 | lm loss: 2.024464E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.912 | TFLOPs: 20.62 | 31: iteration 69330/ 173500 | consumed samples: 17748480 | consumed tokens: 36348887040 | elapsed time per iteration (s): 0.78 | learning rate: 1.395E-04 | global batch size: 256 | lm loss: 2.037295E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.047 | TFLOPs: 19.85 | 31: iteration 69340/ 173500 | consumed samples: 17751040 | consumed tokens: 36354129920 | elapsed time per iteration (s): 0.77 | learning rate: 1.395E-04 | global batch size: 256 | lm loss: 2.035133E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.690 | TFLOPs: 20.19 | 31: iteration 69350/ 173500 | consumed samples: 17753600 | consumed tokens: 36359372800 | elapsed time per iteration (s): 0.78 | learning rate: 1.395E-04 | global batch size: 256 | lm loss: 2.027524E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.843 | TFLOPs: 19.77 | 31: iteration 69360/ 173500 | consumed samples: 17756160 | consumed tokens: 36364615680 | elapsed time per iteration (s): 0.75 | learning rate: 1.395E-04 | global batch size: 256 | lm loss: 2.034321E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.620 | TFLOPs: 20.55 | 31: iteration 69370/ 173500 | consumed samples: 17758720 | consumed tokens: 36369858560 | elapsed time per iteration (s): 0.82 | learning rate: 1.395E-04 | global batch size: 256 | lm loss: 2.016868E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.206 | TFLOPs: 18.95 | 31: iteration 69380/ 173500 | consumed samples: 17761280 | consumed tokens: 36375101440 | elapsed time per iteration (s): 0.75 | learning rate: 1.395E-04 | global batch size: 256 | lm loss: 1.992237E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.512 | TFLOPs: 20.72 | 31: iteration 69390/ 173500 | consumed samples: 17763840 | consumed tokens: 36380344320 | elapsed time per iteration (s): 0.84 | learning rate: 1.395E-04 | global batch size: 256 | lm loss: 2.025264E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.725 | TFLOPs: 18.50 | 31: iteration 69400/ 173500 | consumed samples: 17766400 | consumed tokens: 36385587200 | elapsed time per iteration (s): 0.79 | learning rate: 1.394E-04 | global batch size: 256 | lm loss: 2.040603E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.463 | TFLOPs: 19.51 | 31: iteration 69410/ 173500 | consumed samples: 17768960 | consumed tokens: 36390830080 | elapsed time per iteration (s): 0.79 | learning rate: 1.394E-04 | global batch size: 256 | lm loss: 2.014225E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.168 | TFLOPs: 19.67 | 31: iteration 69420/ 173500 | consumed samples: 17771520 | consumed tokens: 36396072960 | elapsed time per iteration (s): 0.79 | learning rate: 1.394E-04 | global batch size: 256 | lm loss: 2.013632E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.647 | TFLOPs: 19.64 | 31: iteration 69430/ 173500 | consumed samples: 17774080 | consumed tokens: 36401315840 | elapsed time per iteration (s): 0.85 | learning rate: 1.394E-04 | global batch size: 256 | lm loss: 2.038265E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.153 | TFLOPs: 18.22 | 31: iteration 69440/ 173500 | consumed samples: 17776640 | consumed tokens: 36406558720 | elapsed time per iteration (s): 0.80 | learning rate: 1.394E-04 | global batch size: 256 | lm loss: 2.018838E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.951 | TFLOPs: 19.42 | 31: iteration 69450/ 173500 | consumed samples: 17779200 | consumed tokens: 36411801600 | elapsed time per iteration (s): 0.81 | learning rate: 1.394E-04 | global batch size: 256 | lm loss: 2.013031E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.938 | TFLOPs: 19.23 | 31: iteration 69460/ 173500 | consumed samples: 17781760 | consumed tokens: 36417044480 | elapsed time per iteration (s): 0.91 | learning rate: 1.393E-04 | global batch size: 256 | lm loss: 2.054215E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.687 | TFLOPs: 17.10 | 31: iteration 69470/ 173500 | consumed samples: 17784320 | consumed tokens: 36422287360 | elapsed time per iteration (s): 0.82 | learning rate: 1.393E-04 | global batch size: 256 | lm loss: 2.008883E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.140 | TFLOPs: 18.82 | 31: iteration 69480/ 173500 | consumed samples: 17786880 | consumed tokens: 36427530240 | elapsed time per iteration (s): 0.85 | learning rate: 1.393E-04 | global batch size: 256 | lm loss: 2.038937E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.693 | TFLOPs: 18.13 | 31: iteration 69490/ 173500 | consumed samples: 17789440 | consumed tokens: 36432773120 | elapsed time per iteration (s): 0.79 | learning rate: 1.393E-04 | global batch size: 256 | lm loss: 2.022879E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.844 | TFLOPs: 19.53 | 31: iteration 69500/ 173500 | consumed samples: 17792000 | consumed tokens: 36438016000 | elapsed time per iteration (s): 0.83 | learning rate: 1.393E-04 | global batch size: 256 | lm loss: 2.034189E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.458 | TFLOPs: 18.60 | 31: iteration 69510/ 173500 | consumed samples: 17794560 | consumed tokens: 36443258880 | elapsed time per iteration (s): 0.82 | learning rate: 1.393E-04 | global batch size: 256 | lm loss: 2.020505E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.607 | TFLOPs: 18.91 | 31: iteration 69520/ 173500 | consumed samples: 17797120 | consumed tokens: 36448501760 | elapsed time per iteration (s): 0.80 | learning rate: 1.392E-04 | global batch size: 256 | lm loss: 2.056758E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.139 | TFLOPs: 19.43 | 31: iteration 69530/ 173500 | consumed samples: 17799680 | consumed tokens: 36453744640 | elapsed time per iteration (s): 0.80 | learning rate: 1.392E-04 | global batch size: 256 | lm loss: 2.023368E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.083 | TFLOPs: 19.30 | 31: iteration 69540/ 173500 | consumed samples: 17802240 | consumed tokens: 36458987520 | elapsed time per iteration (s): 0.85 | learning rate: 1.392E-04 | global batch size: 256 | lm loss: 2.004931E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.575 | TFLOPs: 18.31 | 31: iteration 69550/ 173500 | consumed samples: 17804800 | consumed tokens: 36464230400 | elapsed time per iteration (s): 0.87 | learning rate: 1.392E-04 | global batch size: 256 | lm loss: 2.021604E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.502 | TFLOPs: 17.82 | 31: iteration 69560/ 173500 | consumed samples: 17807360 | consumed tokens: 36469473280 | elapsed time per iteration (s): 0.77 | learning rate: 1.392E-04 | global batch size: 256 | lm loss: 2.016143E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.786 | TFLOPs: 20.01 | 31: iteration 69570/ 173500 | consumed samples: 17809920 | consumed tokens: 36474716160 | elapsed time per iteration (s): 0.85 | learning rate: 1.392E-04 | global batch size: 256 | lm loss: 2.043434E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.433 | TFLOPs: 18.18 | 31: iteration 69580/ 173500 | consumed samples: 17812480 | consumed tokens: 36479959040 | elapsed time per iteration (s): 0.88 | learning rate: 1.392E-04 | global batch size: 256 | lm loss: 2.029943E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.126 | TFLOPs: 17.67 | 31: iteration 69590/ 173500 | consumed samples: 17815040 | consumed tokens: 36485201920 | elapsed time per iteration (s): 0.80 | learning rate: 1.391E-04 | global batch size: 256 | lm loss: 2.037106E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.275 | TFLOPs: 19.25 | 31: iteration 69600/ 173500 | consumed samples: 17817600 | consumed tokens: 36490444800 | elapsed time per iteration (s): 0.79 | learning rate: 1.391E-04 | global batch size: 256 | lm loss: 2.065587E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.261 | TFLOPs: 19.62 | 31: iteration 69610/ 173500 | consumed samples: 17820160 | consumed tokens: 36495687680 | elapsed time per iteration (s): 0.81 | learning rate: 1.391E-04 | global batch size: 256 | lm loss: 2.016864E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.089 | TFLOPs: 19.18 | 31: iteration 69620/ 173500 | consumed samples: 17822720 | consumed tokens: 36500930560 | elapsed time per iteration (s): 0.82 | learning rate: 1.391E-04 | global batch size: 256 | lm loss: 2.023663E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.153 | TFLOPs: 18.82 | 31: iteration 69630/ 173500 | consumed samples: 17825280 | consumed tokens: 36506173440 | elapsed time per iteration (s): 0.78 | learning rate: 1.391E-04 | global batch size: 256 | lm loss: 1.995173E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.856 | TFLOPs: 19.89 | 31: iteration 69640/ 173500 | consumed samples: 17827840 | consumed tokens: 36511416320 | elapsed time per iteration (s): 0.81 | learning rate: 1.391E-04 | global batch size: 256 | lm loss: 2.045093E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.362 | TFLOPs: 19.14 | 31: iteration 69650/ 173500 | consumed samples: 17830400 | consumed tokens: 36516659200 | elapsed time per iteration (s): 0.81 | learning rate: 1.390E-04 | global batch size: 256 | lm loss: 2.004732E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.950 | TFLOPs: 19.24 | 31: iteration 69660/ 173500 | consumed samples: 17832960 | consumed tokens: 36521902080 | elapsed time per iteration (s): 0.82 | learning rate: 1.390E-04 | global batch size: 256 | lm loss: 2.028253E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.594 | TFLOPs: 18.97 | 31: iteration 69670/ 173500 | consumed samples: 17835520 | consumed tokens: 36527144960 | elapsed time per iteration (s): 0.87 | learning rate: 1.390E-04 | global batch size: 256 | lm loss: 2.021046E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.225 | TFLOPs: 17.74 | 31: iteration 69680/ 173500 | consumed samples: 17838080 | consumed tokens: 36532387840 | elapsed time per iteration (s): 0.74 | learning rate: 1.390E-04 | global batch size: 256 | lm loss: 2.025259E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.932 | TFLOPs: 20.93 | 31: iteration 69690/ 173500 | consumed samples: 17840640 | consumed tokens: 36537630720 | elapsed time per iteration (s): 0.79 | learning rate: 1.390E-04 | global batch size: 256 | lm loss: 1.988920E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.128 | TFLOPs: 19.49 | 31: iteration 69700/ 173500 | consumed samples: 17843200 | consumed tokens: 36542873600 | elapsed time per iteration (s): 0.75 | learning rate: 1.390E-04 | global batch size: 256 | lm loss: 2.012255E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.386 | TFLOPs: 20.65 | 31: iteration 69710/ 173500 | consumed samples: 17845760 | consumed tokens: 36548116480 | elapsed time per iteration (s): 0.89 | learning rate: 1.390E-04 | global batch size: 256 | lm loss: 2.069115E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.108 | TFLOPs: 17.31 | 31: iteration 69720/ 173500 | consumed samples: 17848320 | consumed tokens: 36553359360 | elapsed time per iteration (s): 0.77 | learning rate: 1.389E-04 | global batch size: 256 | lm loss: 2.032052E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.211 | TFLOPs: 20.22 | 31: iteration 69730/ 173500 | consumed samples: 17850880 | consumed tokens: 36558602240 | elapsed time per iteration (s): 0.79 | learning rate: 1.389E-04 | global batch size: 256 | lm loss: 2.014585E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.959 | TFLOPs: 19.54 | 31: iteration 69740/ 173500 | consumed samples: 17853440 | consumed tokens: 36563845120 | elapsed time per iteration (s): 0.79 | learning rate: 1.389E-04 | global batch size: 256 | lm loss: 2.016078E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.857 | TFLOPs: 19.71 | 31: iteration 69750/ 173500 | consumed samples: 17856000 | consumed tokens: 36569088000 | elapsed time per iteration (s): 0.77 | learning rate: 1.389E-04 | global batch size: 256 | lm loss: 2.022745E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.183 | TFLOPs: 20.04 | 31: iteration 69760/ 173500 | consumed samples: 17858560 | consumed tokens: 36574330880 | elapsed time per iteration (s): 0.74 | learning rate: 1.389E-04 | global batch size: 256 | lm loss: 1.994857E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.252 | TFLOPs: 20.95 | 31: iteration 69770/ 173500 | consumed samples: 17861120 | consumed tokens: 36579573760 | elapsed time per iteration (s): 0.77 | learning rate: 1.389E-04 | global batch size: 256 | lm loss: 2.003144E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.505 | TFLOPs: 19.99 | 31: iteration 69780/ 173500 | consumed samples: 17863680 | consumed tokens: 36584816640 | elapsed time per iteration (s): 0.76 | learning rate: 1.388E-04 | global batch size: 256 | lm loss: 2.052515E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.689 | TFLOPs: 20.49 | 31: iteration 69790/ 173500 | consumed samples: 17866240 | consumed tokens: 36590059520 | elapsed time per iteration (s): 0.76 | learning rate: 1.388E-04 | global batch size: 256 | lm loss: 2.032556E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.070 | TFLOPs: 20.27 | 31: iteration 69800/ 173500 | consumed samples: 17868800 | consumed tokens: 36595302400 | elapsed time per iteration (s): 0.77 | learning rate: 1.388E-04 | global batch size: 256 | lm loss: 2.019147E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.820 | TFLOPs: 20.13 | 31: iteration 69810/ 173500 | consumed samples: 17871360 | consumed tokens: 36600545280 | elapsed time per iteration (s): 0.76 | learning rate: 1.388E-04 | global batch size: 256 | lm loss: 2.006979E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.329 | TFLOPs: 20.35 | 31: iteration 69820/ 173500 | consumed samples: 17873920 | consumed tokens: 36605788160 | elapsed time per iteration (s): 0.72 | learning rate: 1.388E-04 | global batch size: 256 | lm loss: 1.986496E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.613 | TFLOPs: 21.51 | 31: iteration 69830/ 173500 | consumed samples: 17876480 | consumed tokens: 36611031040 | elapsed time per iteration (s): 0.93 | learning rate: 1.388E-04 | global batch size: 256 | lm loss: 2.015783E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 274.182 | TFLOPs: 16.59 | 31: iteration 69840/ 173500 | consumed samples: 17879040 | consumed tokens: 36616273920 | elapsed time per iteration (s): 0.81 | learning rate: 1.388E-04 | global batch size: 256 | lm loss: 2.009175E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.290 | TFLOPs: 19.01 | 31: iteration 69850/ 173500 | consumed samples: 17881600 | consumed tokens: 36621516800 | elapsed time per iteration (s): 0.81 | learning rate: 1.387E-04 | global batch size: 256 | lm loss: 2.018929E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.332 | TFLOPs: 19.14 | 31: iteration 69860/ 173500 | consumed samples: 17884160 | consumed tokens: 36626759680 | elapsed time per iteration (s): 0.82 | learning rate: 1.387E-04 | global batch size: 256 | lm loss: 2.031837E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.687 | TFLOPs: 18.86 | 31: iteration 69870/ 173500 | consumed samples: 17886720 | consumed tokens: 36632002560 | elapsed time per iteration (s): 0.83 | learning rate: 1.387E-04 | global batch size: 256 | lm loss: 2.024761E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.588 | TFLOPs: 18.55 | 31: iteration 69880/ 173500 | consumed samples: 17889280 | consumed tokens: 36637245440 | elapsed time per iteration (s): 0.82 | learning rate: 1.387E-04 | global batch size: 256 | lm loss: 2.008836E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.624 | TFLOPs: 18.91 | 31: iteration 69890/ 173500 | consumed samples: 17891840 | consumed tokens: 36642488320 | elapsed time per iteration (s): 0.83 | learning rate: 1.387E-04 | global batch size: 256 | lm loss: 2.013800E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.415 | TFLOPs: 18.72 | 31: iteration 69900/ 173500 | consumed samples: 17894400 | consumed tokens: 36647731200 | elapsed time per iteration (s): 0.82 | learning rate: 1.387E-04 | global batch size: 256 | lm loss: 2.034183E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.394 | TFLOPs: 18.90 | 31: iteration 69910/ 173500 | consumed samples: 17896960 | consumed tokens: 36652974080 | elapsed time per iteration (s): 0.78 | learning rate: 1.386E-04 | global batch size: 256 | lm loss: 2.012593E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.080 | TFLOPs: 19.97 | 31: iteration 69920/ 173500 | consumed samples: 17899520 | consumed tokens: 36658216960 | elapsed time per iteration (s): 0.76 | learning rate: 1.386E-04 | global batch size: 256 | lm loss: 2.065405E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.367 | TFLOPs: 20.47 | 31: iteration 69930/ 173500 | consumed samples: 17902080 | consumed tokens: 36663459840 | elapsed time per iteration (s): 0.87 | learning rate: 1.386E-04 | global batch size: 256 | lm loss: 2.034737E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.602 | TFLOPs: 17.82 | 31: iteration 69940/ 173500 | consumed samples: 17904640 | consumed tokens: 36668702720 | elapsed time per iteration (s): 0.79 | learning rate: 1.386E-04 | global batch size: 256 | lm loss: 2.034401E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.616 | TFLOPs: 19.70 | 31: iteration 69950/ 173500 | consumed samples: 17907200 | consumed tokens: 36673945600 | elapsed time per iteration (s): 0.90 | learning rate: 1.386E-04 | global batch size: 256 | lm loss: 2.010707E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 283.351 | TFLOPs: 17.14 | 31: iteration 69960/ 173500 | consumed samples: 17909760 | consumed tokens: 36679188480 | elapsed time per iteration (s): 0.79 | learning rate: 1.386E-04 | global batch size: 256 | lm loss: 2.019228E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.386 | TFLOPs: 19.56 | 31: iteration 69970/ 173500 | consumed samples: 17912320 | consumed tokens: 36684431360 | elapsed time per iteration (s): 0.86 | learning rate: 1.385E-04 | global batch size: 256 | lm loss: 2.036553E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.219 | TFLOPs: 18.10 | 31: iteration 69980/ 173500 | consumed samples: 17914880 | consumed tokens: 36689674240 | elapsed time per iteration (s): 0.78 | learning rate: 1.385E-04 | global batch size: 256 | lm loss: 2.029401E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.059 | TFLOPs: 19.97 | 31: iteration 69990/ 173500 | consumed samples: 17917440 | consumed tokens: 36694917120 | elapsed time per iteration (s): 0.80 | learning rate: 1.385E-04 | global batch size: 256 | lm loss: 2.016198E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.775 | TFLOPs: 19.35 | 0: [2022-11-26 09:47:37,414] [INFO] [logging.py:68:log_dist] [Rank 0] step=70000, skipped=0, lr=[0.0001385013705497804, 0.0001385013705497804, 0.0001385013705497804], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 70000/ 173500 | consumed samples: 17920000 | consumed tokens: 36700160000 | elapsed time per iteration (s): 0.82 | learning rate: 1.385E-04 | global batch size: 256 | lm loss: 2.034461E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.716 | TFLOPs: 18.80 | 0: steps: 70000 loss: 2.0335 iter time (s): 0.811 samples/sec: 315.507 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 70000 | lm loss value: 2.020608E+00 | lm loss PPL: 7.542913E+00 | 0: saving checkpoint at iteration 70000 to checkpoints_1b1long 31: ------------------------------------------------------------------------------------------- 0: [2022-11-26 09:47:37,723] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step70000 is begin to save! 0: [2022-11-26 09:47:37,735] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_01-model_00-model_states.pt... 0: [2022-11-26 09:47:37,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_01-model_00-model_states.pt. 0: [2022-11-26 09:47:37,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_03-model_00-model_states.pt... 0: [2022-11-26 09:47:38,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_03-model_00-model_states.pt. 0: [2022-11-26 09:47:38,073] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_04-model_00-model_states.pt... 0: [2022-11-26 09:47:38,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_04-model_00-model_states.pt. 0: [2022-11-26 09:47:38,155] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_05-model_00-model_states.pt... 0: [2022-11-26 09:47:38,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_05-model_00-model_states.pt. 0: [2022-11-26 09:47:38,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_06-model_00-model_states.pt... 0: [2022-11-26 09:47:38,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_06-model_00-model_states.pt. 0: [2022-11-26 09:47:38,311] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_07-model_00-model_states.pt... 0: [2022-11-26 09:47:38,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_07-model_00-model_states.pt. 0: [2022-11-26 09:47:38,388] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_08-model_00-model_states.pt... 0: [2022-11-26 09:47:38,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_08-model_00-model_states.pt. 0: [2022-11-26 09:47:38,464] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_09-model_00-model_states.pt... 0: [2022-11-26 09:47:38,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_09-model_00-model_states.pt. 0: [2022-11-26 09:47:38,541] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_10-model_00-model_states.pt... 0: [2022-11-26 09:47:38,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_10-model_00-model_states.pt. 0: [2022-11-26 09:47:38,620] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_11-model_00-model_states.pt... 0: [2022-11-26 09:47:38,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_11-model_00-model_states.pt. 0: [2022-11-26 09:47:38,703] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_12-model_00-model_states.pt... 0: [2022-11-26 09:47:38,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_12-model_00-model_states.pt. 0: [2022-11-26 09:47:38,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_13-model_00-model_states.pt... 0: [2022-11-26 09:47:38,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_13-model_00-model_states.pt. 0: [2022-11-26 09:47:38,858] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_14-model_00-model_states.pt... 0: [2022-11-26 09:47:38,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_14-model_00-model_states.pt. 0: [2022-11-26 09:47:38,935] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_15-model_00-model_states.pt... 0: [2022-11-26 09:47:39,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_15-model_00-model_states.pt. 0: [2022-11-26 09:47:39,007] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_16-model_00-model_states.pt... 0: [2022-11-26 09:47:39,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_16-model_00-model_states.pt. 0: [2022-11-26 09:47:39,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_17-model_00-model_states.pt... 0: [2022-11-26 09:47:39,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_17-model_00-model_states.pt. 0: [2022-11-26 09:47:39,158] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_18-model_00-model_states.pt... 0: [2022-11-26 09:47:39,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_18-model_00-model_states.pt. 0: [2022-11-26 09:47:39,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_19-model_00-model_states.pt... 0: [2022-11-26 09:47:39,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_19-model_00-model_states.pt. 0: [2022-11-26 09:47:39,303] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_20-model_00-model_states.pt... 0: [2022-11-26 09:47:39,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_20-model_00-model_states.pt. 0: [2022-11-26 09:47:39,380] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_21-model_00-model_states.pt... 0: [2022-11-26 09:47:39,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_21-model_00-model_states.pt. 0: [2022-11-26 09:47:39,455] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_22-model_00-model_states.pt... 0: [2022-11-26 09:47:39,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_22-model_00-model_states.pt. 0: [2022-11-26 09:47:39,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_23-model_00-model_states.pt... 0: [2022-11-26 09:47:39,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_23-model_00-model_states.pt. 0: [2022-11-26 09:47:39,604] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_24-model_00-model_states.pt... 0: [2022-11-26 09:47:39,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_24-model_00-model_states.pt. 0: [2022-11-26 09:47:39,676] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_25-model_00-model_states.pt... 0: [2022-11-26 09:47:39,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_25-model_00-model_states.pt. 0: [2022-11-26 09:47:39,752] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_26-model_00-model_states.pt... 0: [2022-11-26 09:47:39,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_26-model_00-model_states.pt. 0: [2022-11-26 09:47:39,827] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_27-model_00-model_states.pt... 0: [2022-11-26 09:47:39,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_27-model_00-model_states.pt. 0: [2022-11-26 09:47:39,901] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_28-model_00-model_states.pt... 0: [2022-11-26 09:47:39,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_28-model_00-model_states.pt. 0: [2022-11-26 09:47:39,975] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/layer_30-model_00-model_states.pt... 0: [2022-11-26 09:47:39,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/layer_30-model_00-model_states.pt. 0: [2022-11-26 09:47:39,978] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step70000/mp_rank_00_model_states.pt 0: [2022-11-26 09:47:39,978] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/mp_rank_00_model_states.pt... 0: [2022-11-26 09:47:39,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/mp_rank_00_model_states.pt. 0: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 29: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 18: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 27: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 26: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 28: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:47:40,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 24: [2022-11-26 09:47:40,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:47:40,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 09:47:40,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 4: [2022-11-26 09:47:40,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:47:40,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 09:47:40,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 4: [2022-11-26 09:47:40,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:47:40,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 09:47:40,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 6: [2022-11-26 09:47:40,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:47:40,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 09:47:40,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 1: [2022-11-26 09:47:40,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:47:40,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:47:40,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 09:47:40,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 1: [2022-11-26 09:47:40,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 09:47:40,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 11: [2022-11-26 09:47:40,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:47:40,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 09:47:40,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 28: [2022-11-26 09:47:40,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:47:40,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:47:40,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 14: [2022-11-26 09:47:40,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:47:40,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 14: [2022-11-26 09:47:40,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 09:47:40,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 11: [2022-11-26 09:47:40,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:47:40,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 09:47:40,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 25: [2022-11-26 09:47:40,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:47:40,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:47:40,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 8: [2022-11-26 09:47:40,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:47:40,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 29: [2022-11-26 09:47:40,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 25: [2022-11-26 09:47:40,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 22: [2022-11-26 09:47:40,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:47:40,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 09:47:40,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 22: [2022-11-26 09:47:40,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 09:47:40,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 8: [2022-11-26 09:47:40,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:47:40,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 09:47:40,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 24: [2022-11-26 09:47:40,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:47:40,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 09:47:40,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 20: [2022-11-26 09:47:40,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:47:40,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:47:40,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:47:40,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 20: [2022-11-26 09:47:40,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 2: [2022-11-26 09:47:40,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 20: [2022-11-26 09:47:40,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 09:47:40,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 12: [2022-11-26 09:47:40,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:47:40,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 12: [2022-11-26 09:47:40,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 09:47:40,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 12: [2022-11-26 09:47:40,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:47:40,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 09:47:40,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 1: [2022-11-26 09:47:40,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:47:40,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:47:40,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 1: [2022-11-26 09:47:40,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 29: [2022-11-26 09:47:40,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 26: [2022-11-26 09:47:40,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:47:40,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:47:40,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 10: [2022-11-26 09:47:40,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 26: [2022-11-26 09:47:40,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 10: [2022-11-26 09:47:40,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 26: [2022-11-26 09:47:40,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 26: [2022-11-26 09:47:40,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:47:40,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 09:47:40,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 3: [2022-11-26 09:47:40,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:47:40,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 21: [2022-11-26 09:47:40,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:47:40,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 21: [2022-11-26 09:47:40,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 09:47:40,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 18: [2022-11-26 09:47:40,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:47:40,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 09:47:40,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 28: [2022-11-26 09:47:40,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 0: [2022-11-26 09:47:40,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:47:40,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:47:40,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 28: [2022-11-26 09:47:40,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:47:40,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 09:47:40,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 28: [2022-11-26 09:47:40,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 0: [2022-11-26 09:47:40,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 0: [2022-11-26 09:47:40,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 28: [2022-11-26 09:47:40,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 16: [2022-11-26 09:47:40,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:47:40,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 09:47:40,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 3: [2022-11-26 09:47:40,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:47:40,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:47:40,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 09:47:40,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 3: [2022-11-26 09:47:40,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 09:47:40,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 16: [2022-11-26 09:47:40,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:47:40,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:47:40,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 09:47:40,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 22: [2022-11-26 09:47:40,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:47:40,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 22: [2022-11-26 09:47:40,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 09:47:40,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 25: [2022-11-26 09:47:40,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 27: [2022-11-26 09:47:40,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:47:40,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:47:40,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 09:47:40,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 6: [2022-11-26 09:47:40,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 09:47:40,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 5: [2022-11-26 09:47:40,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:47:40,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 09:47:40,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 5: [2022-11-26 09:47:40,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:47:40,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 09:47:40,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 31: [2022-11-26 09:47:40,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:47:40,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 09:47:40,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 4: [2022-11-26 09:47:40,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:47:40,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 09:47:40,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 2: [2022-11-26 09:47:40,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:47:40,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 09:47:40,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 4: [2022-11-26 09:47:40,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:47:40,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 09:47:40,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 28: [2022-11-26 09:47:40,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:47:40,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:47:40,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:47:40,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:47:40,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 09:47:40,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 30: [2022-11-26 09:47:40,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 09:47:40,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 09:47:40,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 09:47:40,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 30: [2022-11-26 09:47:40,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 30: [2022-11-26 09:47:40,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 7: [2022-11-26 09:47:40,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:47:40,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 09:47:40,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 21: [2022-11-26 09:47:40,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:47:40,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:47:40,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 27: [2022-11-26 09:47:40,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:47:40,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 09:47:40,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 21: [2022-11-26 09:47:40,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 8: [2022-11-26 09:47:40,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:47:40,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:47:40,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 09:47:40,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:47:40,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 8: [2022-11-26 09:47:40,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 14: [2022-11-26 09:47:40,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 11: [2022-11-26 09:47:40,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:47:40,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 11: [2022-11-26 09:47:40,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 27: [2022-11-26 09:47:40,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 11: [2022-11-26 09:47:40,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 8: [2022-11-26 09:47:40,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 27: [2022-11-26 09:47:40,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 7: [2022-11-26 09:47:40,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:47:40,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 09:47:40,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 13: [2022-11-26 09:47:40,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:47:40,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 09:47:40,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 15: [2022-11-26 09:47:40,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:47:40,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 09:47:40,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 13: [2022-11-26 09:47:40,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:47:40,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:47:40,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:47:40,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:47:40,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 13: [2022-11-26 09:47:40,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 09:47:40,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 15: [2022-11-26 09:47:40,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 17: [2022-11-26 09:47:40,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:47:40,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 13: [2022-11-26 09:47:40,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 15: [2022-11-26 09:47:40,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 15: [2022-11-26 09:47:40,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 17: [2022-11-26 09:47:40,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 09:47:40,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:47:40,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 17: [2022-11-26 09:47:40,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 09:47:40,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 3: [2022-11-26 09:47:40,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:47:40,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 09:47:40,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 20: [2022-11-26 09:47:40,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:47:40,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 09:47:40,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 12: [2022-11-26 09:47:40,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:47:40,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 09:47:40,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 26: [2022-11-26 09:47:40,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:47:40,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:47:40,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 09:47:40,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 09:47:40,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 26: [2022-11-26 09:47:40,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 16: [2022-11-26 09:47:40,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:47:40,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 09:47:40,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 3: [2022-11-26 09:47:40,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:47:40,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:47:40,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 3: [2022-11-26 09:47:40,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 6: [2022-11-26 09:47:40,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 3: [2022-11-26 09:47:40,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 5: [2022-11-26 09:47:40,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:47:40,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 09:47:40,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 18: [2022-11-26 09:47:40,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:47:40,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 09:47:40,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 18: [2022-11-26 09:47:40,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:47:40,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:47:40,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 09:47:40,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 10: [2022-11-26 09:47:40,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:47:40,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 10: [2022-11-26 09:47:40,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 17: [2022-11-26 09:47:40,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:47:40,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 28: [2022-11-26 09:47:40,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 17: [2022-11-26 09:47:40,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 9: [2022-11-26 09:47:40,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:47:40,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 09:47:40,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 17: [2022-11-26 09:47:40,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 17: [2022-11-26 09:47:40,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:47:40,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 09:47:40,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 10: [2022-11-26 09:47:40,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:47:40,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 09:47:40,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 13: [2022-11-26 09:47:40,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:47:40,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 09:47:40,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 6: [2022-11-26 09:47:40,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:47:40,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:47:40,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 11: [2022-11-26 09:47:40,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 6: [2022-11-26 09:47:40,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 11: [2022-11-26 09:47:40,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 24: [2022-11-26 09:47:40,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:47:40,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 09:47:40,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 2: [2022-11-26 09:47:40,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:47:40,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:47:40,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 20: [2022-11-26 09:47:40,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:47:40,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 27: [2022-11-26 09:47:40,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 14: [2022-11-26 09:47:40,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:47:40,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 14: [2022-11-26 09:47:40,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 27: [2022-11-26 09:47:40,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 14: [2022-11-26 09:47:40,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 30: [2022-11-26 09:47:40,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:47:40,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 30: [2022-11-26 09:47:40,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 09:47:40,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 12: [2022-11-26 09:47:40,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:47:40,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 09:47:40,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 8: [2022-11-26 09:47:40,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:47:40,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 09:47:40,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 16: [2022-11-26 09:47:40,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:47:40,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:47:40,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 31: [2022-11-26 09:47:40,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:47:40,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 31: [2022-11-26 09:47:40,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 09:47:40,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 09:47:40,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 31: [2022-11-26 09:47:40,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 1: [2022-11-26 09:47:40,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:47:40,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 09:47:40,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 9: [2022-11-26 09:47:40,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:47:40,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:47:40,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 09:47:40,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 21: [2022-11-26 09:47:40,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:47:40,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 9: [2022-11-26 09:47:40,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 15: [2022-11-26 09:47:40,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:47:40,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 15: [2022-11-26 09:47:40,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 21: [2022-11-26 09:47:40,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 15: [2022-11-26 09:47:40,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 31: [2022-11-26 09:47:40,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:47:40,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 09:47:40,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 0: [2022-11-26 09:47:40,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:47:40,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:47:40,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 0: [2022-11-26 09:47:40,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 4: [2022-11-26 09:47:40,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 0: [2022-11-26 09:47:40,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 23: [2022-11-26 09:47:40,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:47:40,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 2: [2022-11-26 09:47:40,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:47:40,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 23: [2022-11-26 09:47:40,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 2: [2022-11-26 09:47:40,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 0: [2022-11-26 09:47:40,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:47:40,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 09:47:40,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 5: [2022-11-26 09:47:40,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:47:40,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 09:47:40,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 1: [2022-11-26 09:47:40,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:47:40,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 09:47:40,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 22: [2022-11-26 09:47:40,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:47:40,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:47:40,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 7: [2022-11-26 09:47:40,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 22: [2022-11-26 09:47:40,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 22: [2022-11-26 09:47:40,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:47:40,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 24: [2022-11-26 09:47:40,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:47:40,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 09:47:40,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 24: [2022-11-26 09:47:40,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 09:47:40,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 29: [2022-11-26 09:47:40,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:47:40,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 09:47:40,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:47:40,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 29: [2022-11-26 09:47:40,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 09:47:40,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 21: [2022-11-26 09:47:40,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:47:40,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 09:47:40,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 23: [2022-11-26 09:47:40,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:47:40,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 09:47:40,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 23: [2022-11-26 09:47:40,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:47:40,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 09:47:40,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 20: [2022-11-26 09:47:40,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:47:40,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 09:47:40,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 25: [2022-11-26 09:47:40,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:47:40,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 09:47:40,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 26: [2022-11-26 09:47:40,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:47:40,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 09:47:40,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 0: [2022-11-26 09:47:40,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:47:40,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:47:40,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 09:47:40,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 19: [2022-11-26 09:47:40,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:47:40,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:47:40,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:47:40,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:47:40,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 09:47:40,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 09:47:40,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 09:47:40,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 09:47:40,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 19: [2022-11-26 09:47:40,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 19: [2022-11-26 09:47:40,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 19: [2022-11-26 09:47:40,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 24: [2022-11-26 09:47:40,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:47:40,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 09:47:40,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 0: [2022-11-26 09:47:40,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 09:47:40,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 10: [2022-11-26 09:47:40,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:47:40,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 09:47:40,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 21: [2022-11-26 09:47:40,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:47:40,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 09:47:40,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 31: [2022-11-26 09:47:40,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:47:40,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 09:47:40,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 25: [2022-11-26 09:47:40,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:47:40,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:47:40,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 09:47:40,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 09:47:40,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 25: [2022-11-26 09:47:40,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 5: [2022-11-26 09:47:40,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:47:40,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 09:47:40,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 8: [2022-11-26 09:47:40,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:47:40,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 09:47:40,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 3: [2022-11-26 09:47:40,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:47:40,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 09:47:40,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 23: [2022-11-26 09:47:40,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:47:40,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 09:47:40,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 15: [2022-11-26 09:47:40,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:47:40,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 09:47:40,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 29: [2022-11-26 09:47:40,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:47:40,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 09:47:40,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 9: [2022-11-26 09:47:40,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:47:40,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 09:47:40,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 30: [2022-11-26 09:47:40,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:47:40,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 09:47:40,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 27: [2022-11-26 09:47:40,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:47:40,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 09:47:40,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 11: [2022-11-26 09:47:40,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:47:40,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 09:47:40,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 1: [2022-11-26 09:47:40,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:47:40,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 09:47:40,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 2: [2022-11-26 09:47:40,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:47:40,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 09:47:40,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 17: [2022-11-26 09:47:40,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:47:40,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 09:47:40,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 14: [2022-11-26 09:47:40,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:47:40,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 09:47:40,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 18: [2022-11-26 09:47:40,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:47:40,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 22: [2022-11-26 09:47:40,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:47:40,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 22: [2022-11-26 09:47:40,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 09:47:40,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 7: [2022-11-26 09:47:40,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:47:40,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 09:47:40,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 16: [2022-11-26 09:47:40,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:47:40,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 09:47:40,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 13: [2022-11-26 09:47:40,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:47:40,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 09:47:40,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 28: [2022-11-26 09:47:40,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:47:40,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 0: [2022-11-26 09:47:40,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:47:40,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:47:40,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 09:47:40,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 6: [2022-11-26 09:47:40,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 09:47:40,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 12: [2022-11-26 09:47:40,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:47:40,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 09:47:40,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 28: [2022-11-26 09:47:40,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 26: [2022-11-26 09:47:40,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:47:40,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 09:47:40,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 4: [2022-11-26 09:47:40,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:47:40,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 09:47:40,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 19: [2022-11-26 09:47:40,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:47:40,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 09:47:40,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 31: [2022-11-26 09:47:40,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:47:40,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 09:47:40,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 5: [2022-11-26 09:47:40,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:47:40,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 09:47:40,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 23: [2022-11-26 09:47:40,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:47:40,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 09:47:40,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 20: [2022-11-26 09:47:40,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:47:40,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 8: [2022-11-26 09:47:40,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:47:40,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 8: [2022-11-26 09:47:40,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 24: [2022-11-26 09:47:40,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:47:40,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 24: [2022-11-26 09:47:40,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 09:47:40,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 9: [2022-11-26 09:47:40,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:47:40,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 09:47:40,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 3: [2022-11-26 09:47:40,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:47:40,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 09:47:40,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 21: [2022-11-26 09:47:40,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:47:40,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 09:47:40,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 27: [2022-11-26 09:47:40,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:47:40,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 09:47:40,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 10: [2022-11-26 09:47:40,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:47:40,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 09:47:40,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 19: [2022-11-26 09:47:40,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:47:40,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 15: [2022-11-26 09:47:40,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:47:40,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 29: [2022-11-26 09:47:40,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:47:40,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 09:47:40,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 29: [2022-11-26 09:47:40,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 09:47:40,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 30: [2022-11-26 09:47:40,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:47:40,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 09:47:40,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 11: [2022-11-26 09:47:40,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:47:40,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 09:47:40,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 28: [2022-11-26 09:47:40,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:47:40,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:47:40,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 09:47:40,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 16: [2022-11-26 09:47:40,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:47:40,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 09:47:40,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 12: [2022-11-26 09:47:40,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:47:40,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 09:47:40,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 25: [2022-11-26 09:47:40,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:47:40,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 09:47:40,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 7: [2022-11-26 09:47:40,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:47:40,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 09:47:40,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 18: [2022-11-26 09:47:40,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:47:40,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 09:47:40,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 22: [2022-11-26 09:47:40,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:47:40,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 09:47:40,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 28: [2022-11-26 09:47:40,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 09:47:40,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 4: [2022-11-26 09:47:40,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:47:40,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 6: [2022-11-26 09:47:40,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:47:40,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 0: [2022-11-26 09:47:40,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:47:40,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 0: [2022-11-26 09:47:40,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 09:47:40,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 6: [2022-11-26 09:47:40,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 13: [2022-11-26 09:47:40,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:47:40,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 09:47:40,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 26: [2022-11-26 09:47:40,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:47:40,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 17: [2022-11-26 09:47:40,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:47:40,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 17: [2022-11-26 09:47:40,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 09:47:40,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 2: [2022-11-26 09:47:40,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:47:40,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 09:47:40,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 14: [2022-11-26 09:47:40,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:47:40,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 09:47:40,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 20: [2022-11-26 09:47:40,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:47:40,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 09:47:40,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 5: [2022-11-26 09:47:40,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:47:40,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 09:47:40,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 21: [2022-11-26 09:47:40,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:47:40,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:47:40,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:47:40,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 21: [2022-11-26 09:47:40,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 24: [2022-11-26 09:47:40,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 21: [2022-11-26 09:47:40,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 8: [2022-11-26 09:47:40,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 09:47:40,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 3: [2022-11-26 09:47:40,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:47:40,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 09:47:40,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 23: [2022-11-26 09:47:40,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:47:40,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 09:47:40,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 25: [2022-11-26 09:47:40,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:47:40,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 09:47:40,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 31: [2022-11-26 09:47:40,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:47:40,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 09:47:40,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 29: [2022-11-26 09:47:40,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:47:40,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:47:40,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 15: [2022-11-26 09:47:40,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 29: [2022-11-26 09:47:40,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 15: [2022-11-26 09:47:40,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 9: [2022-11-26 09:47:40,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:47:40,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 09:47:40,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 27: [2022-11-26 09:47:40,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 09:47:40,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 09:47:40,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 19: [2022-11-26 09:47:40,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:47:40,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 09:47:40,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 30: [2022-11-26 09:47:40,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:47:40,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 11: [2022-11-26 09:47:40,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:47:40,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 11: [2022-11-26 09:47:40,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 09:47:40,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 1: [2022-11-26 09:47:40,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:47:40,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 09:47:40,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 17: [2022-11-26 09:47:40,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:47:40,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 09:47:40,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 6: [2022-11-26 09:47:40,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:47:40,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 09:47:40,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 7: [2022-11-26 09:47:40,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:47:40,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 09:47:40,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 28: [2022-11-26 09:47:40,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:47:40,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:47:40,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:47:40,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 18: [2022-11-26 09:47:40,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 4: [2022-11-26 09:47:40,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 18: [2022-11-26 09:47:40,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 13: [2022-11-26 09:47:40,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:47:40,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 09:47:40,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 0: [2022-11-26 09:47:40,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:47:40,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 16: [2022-11-26 09:47:40,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:47:40,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 16: [2022-11-26 09:47:40,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 09:47:40,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 22: [2022-11-26 09:47:40,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:47:40,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 12: [2022-11-26 09:47:40,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:47:40,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 12: [2022-11-26 09:47:40,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 09:47:40,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 8: [2022-11-26 09:47:40,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:47:40,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 09:47:40,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 26: [2022-11-26 09:47:40,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 09:47:40,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 09:47:40,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 20: [2022-11-26 09:47:40,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 09:47:40,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 09:47:40,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 10: [2022-11-26 09:47:40,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:47:40,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:47:40,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 09:47:40,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 09:47:40,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 10: [2022-11-26 09:47:40,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 28: [2022-11-26 09:47:40,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 09:47:40,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 25: [2022-11-26 09:47:40,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:47:40,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 25: [2022-11-26 09:47:40,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 09:47:40,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 3: [2022-11-26 09:47:40,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 09:47:40,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 31: [2022-11-26 09:47:40,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 09:47:40,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 09:47:40,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 24: [2022-11-26 09:47:40,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:47:40,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 24: [2022-11-26 09:47:40,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 09:47:40,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 14: [2022-11-26 09:47:40,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 09:47:40,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 21: [2022-11-26 09:47:40,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:47:40,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 21: [2022-11-26 09:47:40,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 5: [2022-11-26 09:47:40,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 21: [2022-11-26 09:47:40,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 5: [2022-11-26 09:47:40,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 2: [2022-11-26 09:47:40,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:47:40,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 09:47:40,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 29: [2022-11-26 09:47:40,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 09:47:40,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 09:47:40,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 15: [2022-11-26 09:47:40,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:47:40,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 09:47:40,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 23: [2022-11-26 09:47:40,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:47:40,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 09:47:40,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 28: [2022-11-26 09:47:40,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 09:47:40,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 19: [2022-11-26 09:47:40,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:47:40,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 27: [2022-11-26 09:47:40,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 19: [2022-11-26 09:47:40,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 27: [2022-11-26 09:47:40,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 09:47:40,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 17: [2022-11-26 09:47:40,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 09:47:40,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 09:47:40,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 7: [2022-11-26 09:47:40,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:47:40,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 09:47:40,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 30: [2022-11-26 09:47:40,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 09:47:40,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 09:47:40,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 18: [2022-11-26 09:47:40,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 09:47:40,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 09:47:40,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 13: [2022-11-26 09:47:40,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:47:40,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 09:47:40,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 22: [2022-11-26 09:47:40,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 09:47:40,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 09:47:40,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 16: [2022-11-26 09:47:40,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 09:47:40,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 09:47:40,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 23: [2022-11-26 09:47:40,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:47:40,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 23: [2022-11-26 09:47:40,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 11: [2022-11-26 09:47:40,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 09:47:40,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 23: [2022-11-26 09:47:40,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 28: [2022-11-26 09:47:40,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 14: [2022-11-26 09:47:40,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:47:40,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 09:47:40,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 12: [2022-11-26 09:47:40,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:47:40,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 09:47:40,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 9: [2022-11-26 09:47:40,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:47:40,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 09:47:40,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 9: [2022-11-26 09:47:40,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:47:40,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:47:40,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 2: [2022-11-26 09:47:40,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 9: [2022-11-26 09:47:40,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 2: [2022-11-26 09:47:40,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 1: [2022-11-26 09:47:40,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:47:40,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step70000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 09:47:40,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 0: successfully saved checkpoint at iteration 70000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2656.03 31: iteration 70010/ 173500 | consumed samples: 17922560 | consumed tokens: 36705402880 | elapsed time per iteration (s): 1.05 | learning rate: 1.385E-04 | global batch size: 256 | lm loss: 2.026032E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.278 | TFLOPs: 14.72 | 31: iteration 70020/ 173500 | consumed samples: 17925120 | consumed tokens: 36710645760 | elapsed time per iteration (s): 0.79 | learning rate: 1.385E-04 | global batch size: 256 | lm loss: 2.002803E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.380 | TFLOPs: 19.62 | 31: iteration 70030/ 173500 | consumed samples: 17927680 | consumed tokens: 36715888640 | elapsed time per iteration (s): 0.77 | learning rate: 1.385E-04 | global batch size: 256 | lm loss: 2.049086E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.509 | TFLOPs: 20.12 | 31: iteration 70040/ 173500 | consumed samples: 17930240 | consumed tokens: 36721131520 | elapsed time per iteration (s): 0.81 | learning rate: 1.384E-04 | global batch size: 256 | lm loss: 2.008282E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.177 | TFLOPs: 19.13 | 31: iteration 70050/ 173500 | consumed samples: 17932800 | consumed tokens: 36726374400 | elapsed time per iteration (s): 0.79 | learning rate: 1.384E-04 | global batch size: 256 | lm loss: 1.999646E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.897 | TFLOPs: 19.72 | 31: iteration 70060/ 173500 | consumed samples: 17935360 | consumed tokens: 36731617280 | elapsed time per iteration (s): 0.81 | learning rate: 1.384E-04 | global batch size: 256 | lm loss: 1.981574E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.115 | TFLOPs: 19.00 | 31: iteration 70070/ 173500 | consumed samples: 17937920 | consumed tokens: 36736860160 | elapsed time per iteration (s): 0.81 | learning rate: 1.384E-04 | global batch size: 256 | lm loss: 2.041516E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.403 | TFLOPs: 19.14 | 31: iteration 70080/ 173500 | consumed samples: 17940480 | consumed tokens: 36742103040 | elapsed time per iteration (s): 0.78 | learning rate: 1.384E-04 | global batch size: 256 | lm loss: 2.040515E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.439 | TFLOPs: 19.93 | 31: iteration 70090/ 173500 | consumed samples: 17943040 | consumed tokens: 36747345920 | elapsed time per iteration (s): 0.86 | learning rate: 1.384E-04 | global batch size: 256 | lm loss: 2.018431E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.789 | TFLOPs: 17.95 | 31: iteration 70100/ 173500 | consumed samples: 17945600 | consumed tokens: 36752588800 | elapsed time per iteration (s): 0.85 | learning rate: 1.383E-04 | global batch size: 256 | lm loss: 2.023465E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.992 | TFLOPs: 18.27 | 31: iteration 70110/ 173500 | consumed samples: 17948160 | consumed tokens: 36757831680 | elapsed time per iteration (s): 0.81 | learning rate: 1.383E-04 | global batch size: 256 | lm loss: 2.006726E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.474 | TFLOPs: 19.02 | 31: iteration 70120/ 173500 | consumed samples: 17950720 | consumed tokens: 36763074560 | elapsed time per iteration (s): 0.80 | learning rate: 1.383E-04 | global batch size: 256 | lm loss: 2.011489E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.273 | TFLOPs: 19.44 | 31: iteration 70130/ 173500 | consumed samples: 17953280 | consumed tokens: 36768317440 | elapsed time per iteration (s): 0.84 | learning rate: 1.383E-04 | global batch size: 256 | lm loss: 2.051906E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.379 | TFLOPs: 18.47 | 31: iteration 70140/ 173500 | consumed samples: 17955840 | consumed tokens: 36773560320 | elapsed time per iteration (s): 0.84 | learning rate: 1.383E-04 | global batch size: 256 | lm loss: 2.022068E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.007 | TFLOPs: 18.33 | 31: iteration 70150/ 173500 | consumed samples: 17958400 | consumed tokens: 36778803200 | elapsed time per iteration (s): 0.83 | learning rate: 1.383E-04 | global batch size: 256 | lm loss: 2.037626E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.327 | TFLOPs: 18.71 | 31: iteration 70160/ 173500 | consumed samples: 17960960 | consumed tokens: 36784046080 | elapsed time per iteration (s): 0.79 | learning rate: 1.383E-04 | global batch size: 256 | lm loss: 2.032536E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.181 | TFLOPs: 19.67 | 31: iteration 70170/ 173500 | consumed samples: 17963520 | consumed tokens: 36789288960 | elapsed time per iteration (s): 0.81 | learning rate: 1.382E-04 | global batch size: 256 | lm loss: 2.034591E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.580 | TFLOPs: 19.09 | 31: iteration 70180/ 173500 | consumed samples: 17966080 | consumed tokens: 36794531840 | elapsed time per iteration (s): 0.84 | learning rate: 1.382E-04 | global batch size: 256 | lm loss: 2.036700E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.337 | TFLOPs: 18.41 | 31: iteration 70190/ 173500 | consumed samples: 17968640 | consumed tokens: 36799774720 | elapsed time per iteration (s): 0.80 | learning rate: 1.382E-04 | global batch size: 256 | lm loss: 2.031237E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.728 | TFLOPs: 19.28 | 31: iteration 70200/ 173500 | consumed samples: 17971200 | consumed tokens: 36805017600 | elapsed time per iteration (s): 0.85 | learning rate: 1.382E-04 | global batch size: 256 | lm loss: 2.024602E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.427 | TFLOPs: 18.18 | 31: iteration 70210/ 173500 | consumed samples: 17973760 | consumed tokens: 36810260480 | elapsed time per iteration (s): 0.85 | learning rate: 1.382E-04 | global batch size: 256 | lm loss: 1.980031E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.304 | TFLOPs: 18.17 | 31: iteration 70220/ 173500 | consumed samples: 17976320 | consumed tokens: 36815503360 | elapsed time per iteration (s): 3.73 | learning rate: 1.382E-04 | global batch size: 256 | lm loss: 2.013360E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 68.700 | TFLOPs: 4.16 | 31: iteration 70230/ 173500 | consumed samples: 17978880 | consumed tokens: 36820746240 | elapsed time per iteration (s): 0.87 | learning rate: 1.381E-04 | global batch size: 256 | lm loss: 2.018768E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.838 | TFLOPs: 17.72 | 31: iteration 70240/ 173500 | consumed samples: 17981440 | consumed tokens: 36825989120 | elapsed time per iteration (s): 0.80 | learning rate: 1.381E-04 | global batch size: 256 | lm loss: 2.008293E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.569 | TFLOPs: 19.33 | 31: iteration 70250/ 173500 | consumed samples: 17984000 | consumed tokens: 36831232000 | elapsed time per iteration (s): 0.82 | learning rate: 1.381E-04 | global batch size: 256 | lm loss: 2.025692E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.307 | TFLOPs: 18.83 | 31: iteration 70260/ 173500 | consumed samples: 17986560 | consumed tokens: 36836474880 | elapsed time per iteration (s): 0.79 | learning rate: 1.381E-04 | global batch size: 256 | lm loss: 2.028038E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.053 | TFLOPs: 19.48 | 31: iteration 70270/ 173500 | consumed samples: 17989120 | consumed tokens: 36841717760 | elapsed time per iteration (s): 0.84 | learning rate: 1.381E-04 | global batch size: 256 | lm loss: 2.024588E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.762 | TFLOPs: 18.44 | 31: iteration 70280/ 173500 | consumed samples: 17991680 | consumed tokens: 36846960640 | elapsed time per iteration (s): 0.83 | learning rate: 1.381E-04 | global batch size: 256 | lm loss: 2.042200E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.026 | TFLOPs: 18.70 | 31: iteration 70290/ 173500 | consumed samples: 17994240 | consumed tokens: 36852203520 | elapsed time per iteration (s): 0.84 | learning rate: 1.380E-04 | global batch size: 256 | lm loss: 2.025390E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.431 | TFLOPs: 18.54 | 31: iteration 70300/ 173500 | consumed samples: 17996800 | consumed tokens: 36857446400 | elapsed time per iteration (s): 0.81 | learning rate: 1.380E-04 | global batch size: 256 | lm loss: 2.027797E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.224 | TFLOPs: 19.01 | 31: iteration 70310/ 173500 | consumed samples: 17999360 | consumed tokens: 36862689280 | elapsed time per iteration (s): 0.80 | learning rate: 1.380E-04 | global batch size: 256 | lm loss: 2.054191E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.037 | TFLOPs: 19.30 | 31: iteration 70320/ 173500 | consumed samples: 18001920 | consumed tokens: 36867932160 | elapsed time per iteration (s): 0.82 | learning rate: 1.380E-04 | global batch size: 256 | lm loss: 2.012270E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.628 | TFLOPs: 18.91 | 31: iteration 70330/ 173500 | consumed samples: 18004480 | consumed tokens: 36873175040 | elapsed time per iteration (s): 0.80 | learning rate: 1.380E-04 | global batch size: 256 | lm loss: 2.019839E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.577 | TFLOPs: 19.33 | 31: iteration 70340/ 173500 | consumed samples: 18007040 | consumed tokens: 36878417920 | elapsed time per iteration (s): 0.84 | learning rate: 1.380E-04 | global batch size: 256 | lm loss: 2.026813E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.006 | TFLOPs: 18.39 | 31: iteration 70350/ 173500 | consumed samples: 18009600 | consumed tokens: 36883660800 | elapsed time per iteration (s): 0.81 | learning rate: 1.380E-04 | global batch size: 256 | lm loss: 2.041607E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.101 | TFLOPs: 19.18 | 31: iteration 70360/ 173500 | consumed samples: 18012160 | consumed tokens: 36888903680 | elapsed time per iteration (s): 0.86 | learning rate: 1.379E-04 | global batch size: 256 | lm loss: 2.032557E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.021 | TFLOPs: 17.91 | 31: iteration 70370/ 173500 | consumed samples: 18014720 | consumed tokens: 36894146560 | elapsed time per iteration (s): 0.82 | learning rate: 1.379E-04 | global batch size: 256 | lm loss: 2.006147E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.272 | TFLOPs: 18.83 | 31: iteration 70380/ 173500 | consumed samples: 18017280 | consumed tokens: 36899389440 | elapsed time per iteration (s): 0.83 | learning rate: 1.379E-04 | global batch size: 256 | lm loss: 2.022371E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.406 | TFLOPs: 18.66 | 31: iteration 70390/ 173500 | consumed samples: 18019840 | consumed tokens: 36904632320 | elapsed time per iteration (s): 0.79 | learning rate: 1.379E-04 | global batch size: 256 | lm loss: 2.037575E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.133 | TFLOPs: 19.61 | 31: iteration 70400/ 173500 | consumed samples: 18022400 | consumed tokens: 36909875200 | elapsed time per iteration (s): 0.85 | learning rate: 1.379E-04 | global batch size: 256 | lm loss: 2.037667E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.220 | TFLOPs: 18.22 | 31: iteration 70410/ 173500 | consumed samples: 18024960 | consumed tokens: 36915118080 | elapsed time per iteration (s): 1.81 | learning rate: 1.379E-04 | global batch size: 256 | lm loss: 2.032565E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 141.255 | TFLOPs: 8.55 | 31: iteration 70420/ 173500 | consumed samples: 18027520 | consumed tokens: 36920360960 | elapsed time per iteration (s): 0.80 | learning rate: 1.378E-04 | global batch size: 256 | lm loss: 2.036436E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.077 | TFLOPs: 19.24 | 31: iteration 70430/ 173500 | consumed samples: 18030080 | consumed tokens: 36925603840 | elapsed time per iteration (s): 0.75 | learning rate: 1.378E-04 | global batch size: 256 | lm loss: 2.051894E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.183 | TFLOPs: 20.70 | 31: iteration 70440/ 173500 | consumed samples: 18032640 | consumed tokens: 36930846720 | elapsed time per iteration (s): 1.00 | learning rate: 1.378E-04 | global batch size: 256 | lm loss: 2.020735E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 257.013 | TFLOPs: 15.55 | 31: iteration 70450/ 173500 | consumed samples: 18035200 | consumed tokens: 36936089600 | elapsed time per iteration (s): 0.75 | learning rate: 1.378E-04 | global batch size: 256 | lm loss: 2.016646E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.632 | TFLOPs: 20.61 | 31: iteration 70460/ 173500 | consumed samples: 18037760 | consumed tokens: 36941332480 | elapsed time per iteration (s): 0.80 | learning rate: 1.378E-04 | global batch size: 256 | lm loss: 1.988535E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.839 | TFLOPs: 19.29 | 31: iteration 70470/ 173500 | consumed samples: 18040320 | consumed tokens: 36946575360 | elapsed time per iteration (s): 0.82 | learning rate: 1.378E-04 | global batch size: 256 | lm loss: 2.020521E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.132 | TFLOPs: 18.88 | 31: iteration 70480/ 173500 | consumed samples: 18042880 | consumed tokens: 36951818240 | elapsed time per iteration (s): 0.81 | learning rate: 1.378E-04 | global batch size: 256 | lm loss: 2.010351E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.156 | TFLOPs: 19.19 | 31: iteration 70490/ 173500 | consumed samples: 18045440 | consumed tokens: 36957061120 | elapsed time per iteration (s): 0.71 | learning rate: 1.377E-04 | global batch size: 256 | lm loss: 2.016340E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 360.953 | TFLOPs: 21.84 | 31: iteration 70500/ 173500 | consumed samples: 18048000 | consumed tokens: 36962304000 | elapsed time per iteration (s): 0.80 | learning rate: 1.377E-04 | global batch size: 256 | lm loss: 2.039175E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.198 | TFLOPs: 19.43 | 31: iteration 70510/ 173500 | consumed samples: 18050560 | consumed tokens: 36967546880 | elapsed time per iteration (s): 0.73 | learning rate: 1.377E-04 | global batch size: 256 | lm loss: 2.015190E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.431 | TFLOPs: 21.20 | 31: iteration 70520/ 173500 | consumed samples: 18053120 | consumed tokens: 36972789760 | elapsed time per iteration (s): 0.77 | learning rate: 1.377E-04 | global batch size: 256 | lm loss: 2.012514E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.739 | TFLOPs: 20.01 | 31: iteration 70530/ 173500 | consumed samples: 18055680 | consumed tokens: 36978032640 | elapsed time per iteration (s): 0.82 | learning rate: 1.377E-04 | global batch size: 256 | lm loss: 2.032890E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.574 | TFLOPs: 18.97 | 31: iteration 70540/ 173500 | consumed samples: 18058240 | consumed tokens: 36983275520 | elapsed time per iteration (s): 0.81 | learning rate: 1.377E-04 | global batch size: 256 | lm loss: 2.014493E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.018 | TFLOPs: 19.18 | 31: iteration 70550/ 173500 | consumed samples: 18060800 | consumed tokens: 36988518400 | elapsed time per iteration (s): 0.78 | learning rate: 1.376E-04 | global batch size: 256 | lm loss: 2.012593E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.087 | TFLOPs: 19.97 | 31: iteration 70560/ 173500 | consumed samples: 18063360 | consumed tokens: 36993761280 | elapsed time per iteration (s): 0.78 | learning rate: 1.376E-04 | global batch size: 256 | lm loss: 2.009529E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.854 | TFLOPs: 19.83 | 31: iteration 70570/ 173500 | consumed samples: 18065920 | consumed tokens: 36999004160 | elapsed time per iteration (s): 0.83 | learning rate: 1.376E-04 | global batch size: 256 | lm loss: 2.014476E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.122 | TFLOPs: 18.64 | 31: iteration 70580/ 173500 | consumed samples: 18068480 | consumed tokens: 37004247040 | elapsed time per iteration (s): 0.82 | learning rate: 1.376E-04 | global batch size: 256 | lm loss: 2.039131E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.007 | TFLOPs: 18.88 | 31: iteration 70590/ 173500 | consumed samples: 18071040 | consumed tokens: 37009489920 | elapsed time per iteration (s): 0.81 | learning rate: 1.376E-04 | global batch size: 256 | lm loss: 2.003905E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.193 | TFLOPs: 19.07 | 31: iteration 70600/ 173500 | consumed samples: 18073600 | consumed tokens: 37014732800 | elapsed time per iteration (s): 0.80 | learning rate: 1.376E-04 | global batch size: 256 | lm loss: 2.038865E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.120 | TFLOPs: 19.37 | 31: iteration 70610/ 173500 | consumed samples: 18076160 | consumed tokens: 37019975680 | elapsed time per iteration (s): 0.77 | learning rate: 1.375E-04 | global batch size: 256 | lm loss: 2.030384E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.119 | TFLOPs: 20.09 | 31: iteration 70620/ 173500 | consumed samples: 18078720 | consumed tokens: 37025218560 | elapsed time per iteration (s): 0.81 | learning rate: 1.375E-04 | global batch size: 256 | lm loss: 2.020350E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.107 | TFLOPs: 19.18 | 31: iteration 70630/ 173500 | consumed samples: 18081280 | consumed tokens: 37030461440 | elapsed time per iteration (s): 0.82 | learning rate: 1.375E-04 | global batch size: 256 | lm loss: 2.032072E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.902 | TFLOPs: 18.81 | 31: iteration 70640/ 173500 | consumed samples: 18083840 | consumed tokens: 37035704320 | elapsed time per iteration (s): 0.80 | learning rate: 1.375E-04 | global batch size: 256 | lm loss: 2.039381E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.317 | TFLOPs: 19.26 | 31: iteration 70650/ 173500 | consumed samples: 18086400 | consumed tokens: 37040947200 | elapsed time per iteration (s): 0.91 | learning rate: 1.375E-04 | global batch size: 256 | lm loss: 2.031668E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 279.813 | TFLOPs: 16.93 | 31: iteration 70660/ 173500 | consumed samples: 18088960 | consumed tokens: 37046190080 | elapsed time per iteration (s): 0.92 | learning rate: 1.375E-04 | global batch size: 256 | lm loss: 2.047924E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 279.279 | TFLOPs: 16.90 | 31: iteration 70670/ 173500 | consumed samples: 18091520 | consumed tokens: 37051432960 | elapsed time per iteration (s): 0.90 | learning rate: 1.375E-04 | global batch size: 256 | lm loss: 2.025280E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.669 | TFLOPs: 17.22 | 31: iteration 70680/ 173500 | consumed samples: 18094080 | consumed tokens: 37056675840 | elapsed time per iteration (s): 0.79 | learning rate: 1.374E-04 | global batch size: 256 | lm loss: 2.043084E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.502 | TFLOPs: 19.57 | 31: iteration 70690/ 173500 | consumed samples: 18096640 | consumed tokens: 37061918720 | elapsed time per iteration (s): 1.03 | learning rate: 1.374E-04 | global batch size: 256 | lm loss: 2.011968E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.070 | TFLOPs: 15.07 | 31: iteration 70700/ 173500 | consumed samples: 18099200 | consumed tokens: 37067161600 | elapsed time per iteration (s): 0.88 | learning rate: 1.374E-04 | global batch size: 256 | lm loss: 2.051272E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.984 | TFLOPs: 17.54 | 31: iteration 70710/ 173500 | consumed samples: 18101760 | consumed tokens: 37072404480 | elapsed time per iteration (s): 0.78 | learning rate: 1.374E-04 | global batch size: 256 | lm loss: 2.023008E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.470 | TFLOPs: 19.87 | 31: iteration 70720/ 173500 | consumed samples: 18104320 | consumed tokens: 37077647360 | elapsed time per iteration (s): 0.81 | learning rate: 1.374E-04 | global batch size: 256 | lm loss: 2.015082E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.785 | TFLOPs: 19.16 | 31: iteration 70730/ 173500 | consumed samples: 18106880 | consumed tokens: 37082890240 | elapsed time per iteration (s): 0.83 | learning rate: 1.374E-04 | global batch size: 256 | lm loss: 2.021283E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.051 | TFLOPs: 18.76 | 31: iteration 70740/ 173500 | consumed samples: 18109440 | consumed tokens: 37088133120 | elapsed time per iteration (s): 0.86 | learning rate: 1.373E-04 | global batch size: 256 | lm loss: 2.012651E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.450 | TFLOPs: 18.06 | 31: iteration 70750/ 173500 | consumed samples: 18112000 | consumed tokens: 37093376000 | elapsed time per iteration (s): 0.82 | learning rate: 1.373E-04 | global batch size: 256 | lm loss: 1.992143E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.751 | TFLOPs: 18.98 | 31: iteration 70760/ 173500 | consumed samples: 18114560 | consumed tokens: 37098618880 | elapsed time per iteration (s): 0.83 | learning rate: 1.373E-04 | global batch size: 256 | lm loss: 2.005238E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.984 | TFLOPs: 18.75 | 31: iteration 70770/ 173500 | consumed samples: 18117120 | consumed tokens: 37103861760 | elapsed time per iteration (s): 0.79 | learning rate: 1.373E-04 | global batch size: 256 | lm loss: 2.012957E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.059 | TFLOPs: 19.54 | 31: iteration 70780/ 173500 | consumed samples: 18119680 | consumed tokens: 37109104640 | elapsed time per iteration (s): 0.78 | learning rate: 1.373E-04 | global batch size: 256 | lm loss: 2.031205E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.258 | TFLOPs: 19.80 | 31: iteration 70790/ 173500 | consumed samples: 18122240 | consumed tokens: 37114347520 | elapsed time per iteration (s): 0.77 | learning rate: 1.373E-04 | global batch size: 256 | lm loss: 2.020889E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.133 | TFLOPs: 20.21 | 31: iteration 70800/ 173500 | consumed samples: 18124800 | consumed tokens: 37119590400 | elapsed time per iteration (s): 0.81 | learning rate: 1.372E-04 | global batch size: 256 | lm loss: 2.015341E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.856 | TFLOPs: 19.23 | 31: iteration 70810/ 173500 | consumed samples: 18127360 | consumed tokens: 37124833280 | elapsed time per iteration (s): 0.80 | learning rate: 1.372E-04 | global batch size: 256 | lm loss: 2.037399E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.575 | TFLOPs: 19.27 | 31: iteration 70820/ 173500 | consumed samples: 18129920 | consumed tokens: 37130076160 | elapsed time per iteration (s): 0.83 | learning rate: 1.372E-04 | global batch size: 256 | lm loss: 2.057788E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.115 | TFLOPs: 18.58 | 31: iteration 70830/ 173500 | consumed samples: 18132480 | consumed tokens: 37135319040 | elapsed time per iteration (s): 0.78 | learning rate: 1.372E-04 | global batch size: 256 | lm loss: 2.043290E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.570 | TFLOPs: 19.82 | 31: iteration 70840/ 173500 | consumed samples: 18135040 | consumed tokens: 37140561920 | elapsed time per iteration (s): 0.85 | learning rate: 1.372E-04 | global batch size: 256 | lm loss: 2.015206E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.174 | TFLOPs: 18.28 | 31: iteration 70850/ 173500 | consumed samples: 18137600 | consumed tokens: 37145804800 | elapsed time per iteration (s): 0.86 | learning rate: 1.372E-04 | global batch size: 256 | lm loss: 2.027100E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.348 | TFLOPs: 18.11 | 31: iteration 70860/ 173500 | consumed samples: 18140160 | consumed tokens: 37151047680 | elapsed time per iteration (s): 0.80 | learning rate: 1.372E-04 | global batch size: 256 | lm loss: 2.029350E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.900 | TFLOPs: 19.29 | 31: iteration 70870/ 173500 | consumed samples: 18142720 | consumed tokens: 37156290560 | elapsed time per iteration (s): 0.79 | learning rate: 1.371E-04 | global batch size: 256 | lm loss: 2.044640E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.522 | TFLOPs: 19.51 | 31: iteration 70880/ 173500 | consumed samples: 18145280 | consumed tokens: 37161533440 | elapsed time per iteration (s): 0.79 | learning rate: 1.371E-04 | global batch size: 256 | lm loss: 1.997384E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.574 | TFLOPs: 19.70 | 31: iteration 70890/ 173500 | consumed samples: 18147840 | consumed tokens: 37166776320 | elapsed time per iteration (s): 0.78 | learning rate: 1.371E-04 | global batch size: 256 | lm loss: 2.058765E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.254 | TFLOPs: 19.86 | 31: iteration 70900/ 173500 | consumed samples: 18150400 | consumed tokens: 37172019200 | elapsed time per iteration (s): 0.78 | learning rate: 1.371E-04 | global batch size: 256 | lm loss: 2.001095E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.541 | TFLOPs: 19.82 | 31: iteration 70910/ 173500 | consumed samples: 18152960 | consumed tokens: 37177262080 | elapsed time per iteration (s): 0.79 | learning rate: 1.371E-04 | global batch size: 256 | lm loss: 2.009833E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.147 | TFLOPs: 19.55 | 31: iteration 70920/ 173500 | consumed samples: 18155520 | consumed tokens: 37182504960 | elapsed time per iteration (s): 0.79 | learning rate: 1.371E-04 | global batch size: 256 | lm loss: 2.036354E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.851 | TFLOPs: 19.71 | 31: iteration 70930/ 173500 | consumed samples: 18158080 | consumed tokens: 37187747840 | elapsed time per iteration (s): 0.83 | learning rate: 1.370E-04 | global batch size: 256 | lm loss: 2.008201E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.097 | TFLOPs: 18.58 | 31: iteration 70940/ 173500 | consumed samples: 18160640 | consumed tokens: 37192990720 | elapsed time per iteration (s): 0.75 | learning rate: 1.370E-04 | global batch size: 256 | lm loss: 2.027537E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.167 | TFLOPs: 20.64 | 31: iteration 70950/ 173500 | consumed samples: 18163200 | consumed tokens: 37198233600 | elapsed time per iteration (s): 0.77 | learning rate: 1.370E-04 | global batch size: 256 | lm loss: 2.039149E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.820 | TFLOPs: 20.13 | 31: iteration 70960/ 173500 | consumed samples: 18165760 | consumed tokens: 37203476480 | elapsed time per iteration (s): 0.75 | learning rate: 1.370E-04 | global batch size: 256 | lm loss: 2.031195E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.338 | TFLOPs: 20.77 | 31: iteration 70970/ 173500 | consumed samples: 18168320 | consumed tokens: 37208719360 | elapsed time per iteration (s): 0.77 | learning rate: 1.370E-04 | global batch size: 256 | lm loss: 2.044395E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.500 | TFLOPs: 20.24 | 31: iteration 70980/ 173500 | consumed samples: 18170880 | consumed tokens: 37213962240 | elapsed time per iteration (s): 0.81 | learning rate: 1.370E-04 | global batch size: 256 | lm loss: 2.024895E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.494 | TFLOPs: 19.15 | 31: iteration 70990/ 173500 | consumed samples: 18173440 | consumed tokens: 37219205120 | elapsed time per iteration (s): 0.84 | learning rate: 1.370E-04 | global batch size: 256 | lm loss: 1.996209E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.284 | TFLOPs: 18.53 | 31: iteration 71000/ 173500 | consumed samples: 18176000 | consumed tokens: 37224448000 | elapsed time per iteration (s): 0.89 | learning rate: 1.369E-04 | global batch size: 256 | lm loss: 2.036256E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.446 | TFLOPs: 17.33 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 71000 | lm loss value: 1.977212E+00 | lm loss PPL: 7.222578E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 71000 to checkpoints_1b1long 0: [2022-11-26 10:01:54,601] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step71000 is begin to save! 0: [2022-11-26 10:01:54,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_01-model_00-model_states.pt... 0: [2022-11-26 10:01:54,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_01-model_00-model_states.pt. 0: [2022-11-26 10:01:54,935] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_03-model_00-model_states.pt... 0: [2022-11-26 10:01:55,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_03-model_00-model_states.pt. 0: [2022-11-26 10:01:55,051] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_04-model_00-model_states.pt... 0: [2022-11-26 10:01:55,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_04-model_00-model_states.pt. 0: [2022-11-26 10:01:55,127] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_05-model_00-model_states.pt... 0: [2022-11-26 10:01:55,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_05-model_00-model_states.pt. 0: [2022-11-26 10:01:55,205] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_06-model_00-model_states.pt... 0: [2022-11-26 10:01:55,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_06-model_00-model_states.pt. 0: [2022-11-26 10:01:55,287] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_07-model_00-model_states.pt... 0: [2022-11-26 10:01:55,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_07-model_00-model_states.pt. 0: [2022-11-26 10:01:55,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_08-model_00-model_states.pt... 0: [2022-11-26 10:01:55,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_08-model_00-model_states.pt. 0: [2022-11-26 10:01:55,441] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_09-model_00-model_states.pt... 0: [2022-11-26 10:01:55,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_09-model_00-model_states.pt. 0: [2022-11-26 10:01:55,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_10-model_00-model_states.pt... 0: [2022-11-26 10:01:55,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_10-model_00-model_states.pt. 0: [2022-11-26 10:01:55,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_11-model_00-model_states.pt... 0: [2022-11-26 10:01:55,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_11-model_00-model_states.pt. 0: [2022-11-26 10:01:55,678] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_12-model_00-model_states.pt... 0: [2022-11-26 10:01:55,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_12-model_00-model_states.pt. 0: [2022-11-26 10:01:55,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_13-model_00-model_states.pt... 0: [2022-11-26 10:01:55,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_13-model_00-model_states.pt. 0: [2022-11-26 10:01:55,838] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_14-model_00-model_states.pt... 0: [2022-11-26 10:01:55,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_14-model_00-model_states.pt. 0: [2022-11-26 10:01:55,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_15-model_00-model_states.pt... 0: [2022-11-26 10:01:55,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_15-model_00-model_states.pt. 0: [2022-11-26 10:01:55,995] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_16-model_00-model_states.pt... 0: [2022-11-26 10:01:56,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_16-model_00-model_states.pt. 0: [2022-11-26 10:01:56,070] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_17-model_00-model_states.pt... 0: [2022-11-26 10:01:56,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_17-model_00-model_states.pt. 0: [2022-11-26 10:01:56,148] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_18-model_00-model_states.pt... 0: [2022-11-26 10:01:56,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_18-model_00-model_states.pt. 0: [2022-11-26 10:01:56,228] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_19-model_00-model_states.pt... 0: [2022-11-26 10:01:56,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_19-model_00-model_states.pt. 0: [2022-11-26 10:01:56,304] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_20-model_00-model_states.pt... 0: [2022-11-26 10:01:56,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_20-model_00-model_states.pt. 0: [2022-11-26 10:01:56,382] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_21-model_00-model_states.pt... 0: [2022-11-26 10:01:56,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_21-model_00-model_states.pt. 0: [2022-11-26 10:01:56,460] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_22-model_00-model_states.pt... 0: [2022-11-26 10:01:56,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_22-model_00-model_states.pt. 0: [2022-11-26 10:01:56,536] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_23-model_00-model_states.pt... 0: [2022-11-26 10:01:56,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_23-model_00-model_states.pt. 0: [2022-11-26 10:01:56,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_24-model_00-model_states.pt... 0: [2022-11-26 10:01:56,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_24-model_00-model_states.pt. 0: [2022-11-26 10:01:56,750] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_25-model_00-model_states.pt... 0: [2022-11-26 10:01:56,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_25-model_00-model_states.pt. 0: [2022-11-26 10:01:56,823] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_26-model_00-model_states.pt... 0: [2022-11-26 10:01:56,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_26-model_00-model_states.pt. 0: [2022-11-26 10:01:56,899] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_27-model_00-model_states.pt... 0: [2022-11-26 10:01:56,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_27-model_00-model_states.pt. 0: [2022-11-26 10:01:56,976] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_28-model_00-model_states.pt... 0: [2022-11-26 10:01:57,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_28-model_00-model_states.pt. 0: [2022-11-26 10:01:57,051] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/layer_30-model_00-model_states.pt... 0: [2022-11-26 10:01:57,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/layer_30-model_00-model_states.pt. 0: [2022-11-26 10:01:57,055] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step71000/mp_rank_00_model_states.pt 0: [2022-11-26 10:01:57,055] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/mp_rank_00_model_states.pt... 0: [2022-11-26 10:01:57,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/mp_rank_00_model_states.pt. 31: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:01:57,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:01:57,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:01:57,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 10:01:57,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 19: [2022-11-26 10:01:57,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:01:57,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 10:01:57,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 8: [2022-11-26 10:01:57,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:01:57,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 10:01:57,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 11: [2022-11-26 10:01:57,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:01:57,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 10:01:57,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 25: [2022-11-26 10:01:57,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:01:57,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 10:01:57,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 6: [2022-11-26 10:01:57,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:01:57,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 10:01:57,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 19: [2022-11-26 10:01:57,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:01:57,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 10:01:57,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 4: [2022-11-26 10:01:57,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:01:57,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 10:01:57,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 5: [2022-11-26 10:01:57,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:01:57,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 10:01:57,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 20: [2022-11-26 10:01:57,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:01:57,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:01:57,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 30: [2022-11-26 10:01:57,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 20: [2022-11-26 10:01:57,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 30: [2022-11-26 10:01:57,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 1: [2022-11-26 10:01:57,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:01:57,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:01:57,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 10:01:57,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 1: [2022-11-26 10:01:57,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 5: [2022-11-26 10:01:57,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:01:57,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 5: [2022-11-26 10:01:57,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 10:01:57,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 0: [2022-11-26 10:01:57,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:01:57,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:01:57,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 0: [2022-11-26 10:01:57,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 9: [2022-11-26 10:01:57,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:01:57,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 0: [2022-11-26 10:01:57,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 9: [2022-11-26 10:01:57,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 26: [2022-11-26 10:01:57,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:01:57,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 26: [2022-11-26 10:01:57,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 10:01:57,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 29: [2022-11-26 10:01:57,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:01:57,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 10:01:57,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 22: [2022-11-26 10:01:57,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:01:57,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 9: [2022-11-26 10:01:57,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:01:57,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 9: [2022-11-26 10:01:57,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 10:01:57,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 6: [2022-11-26 10:01:57,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:01:57,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 30: [2022-11-26 10:01:57,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:01:57,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 17: [2022-11-26 10:01:57,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:01:57,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:01:57,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 10:01:57,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 17: [2022-11-26 10:01:57,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 10:01:57,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 23: [2022-11-26 10:01:57,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:01:57,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 17: [2022-11-26 10:01:57,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 23: [2022-11-26 10:01:57,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 10:01:57,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 24: [2022-11-26 10:01:57,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:01:57,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 10:01:57,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 24: [2022-11-26 10:01:57,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:01:57,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 10:01:57,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 27: [2022-11-26 10:01:57,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:01:57,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:01:57,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 16: [2022-11-26 10:01:57,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:01:57,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 27: [2022-11-26 10:01:57,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 16: [2022-11-26 10:01:57,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 16: [2022-11-26 10:01:57,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 10:01:57,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 4: [2022-11-26 10:01:57,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:01:57,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 10:01:57,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 12: [2022-11-26 10:01:57,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:01:57,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 10:01:57,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 14: [2022-11-26 10:01:57,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:01:57,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 10:01:57,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 11: [2022-11-26 10:01:57,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:01:57,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 10:01:57,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 27: [2022-11-26 10:01:57,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:01:57,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 10:01:57,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 1: [2022-11-26 10:01:57,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:01:57,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 10:01:57,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 10: [2022-11-26 10:01:57,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:01:57,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:01:57,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 10:01:57,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 23: [2022-11-26 10:01:57,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 10:01:57,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 12: [2022-11-26 10:01:57,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:01:57,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:01:57,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 6: [2022-11-26 10:01:57,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 12: [2022-11-26 10:01:57,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 6: [2022-11-26 10:01:57,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 20: [2022-11-26 10:01:57,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:01:57,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 10:01:57,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 2: [2022-11-26 10:01:57,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:01:57,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 10:01:57,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 10: [2022-11-26 10:01:57,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:01:57,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 10:01:57,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 24: [2022-11-26 10:01:57,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:01:57,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 22: [2022-11-26 10:01:57,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:01:57,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 22: [2022-11-26 10:01:57,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 10:01:57,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 19: [2022-11-26 10:01:57,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:01:57,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:01:57,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:01:57,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 10:01:57,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 10:01:57,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 19: [2022-11-26 10:01:57,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 8: [2022-11-26 10:01:57,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:01:57,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 29: [2022-11-26 10:01:57,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:01:57,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 29: [2022-11-26 10:01:57,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 10:01:57,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 12: [2022-11-26 10:01:57,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:01:57,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 10: [2022-11-26 10:01:57,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:01:57,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 10: [2022-11-26 10:01:57,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 10:01:57,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 5: [2022-11-26 10:01:57,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:01:57,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:01:57,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 10:01:57,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 5: [2022-11-26 10:01:57,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 0: [2022-11-26 10:01:57,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:01:57,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:01:57,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 5: [2022-11-26 10:01:57,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 16: [2022-11-26 10:01:57,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 11: [2022-11-26 10:01:57,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:01:57,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 11: [2022-11-26 10:01:57,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 0: [2022-11-26 10:01:57,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 20: [2022-11-26 10:01:57,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:01:57,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 20: [2022-11-26 10:01:57,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 0: [2022-11-26 10:01:57,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:01:57,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 0: [2022-11-26 10:01:57,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 10:01:57,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 17: [2022-11-26 10:01:57,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:01:57,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:01:57,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:01:57,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 14: [2022-11-26 10:01:57,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 10:01:57,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:01:57,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 14: [2022-11-26 10:01:57,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 30: [2022-11-26 10:01:57,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 5: [2022-11-26 10:01:57,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:01:57,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 5: [2022-11-26 10:01:57,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 14: [2022-11-26 10:01:57,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 30: [2022-11-26 10:01:57,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 23: [2022-11-26 10:01:57,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:01:57,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 23: [2022-11-26 10:01:57,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 10:01:57,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 20: [2022-11-26 10:01:57,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:01:57,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 10:01:57,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 28: [2022-11-26 10:01:57,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:01:57,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:01:57,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:01:57,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:01:57,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 9: [2022-11-26 10:01:57,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 27: [2022-11-26 10:01:57,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 9: [2022-11-26 10:01:57,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 11: [2022-11-26 10:01:57,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:01:57,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 11: [2022-11-26 10:01:57,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 27: [2022-11-26 10:01:57,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 11: [2022-11-26 10:01:57,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 30: [2022-11-26 10:01:57,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:01:57,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 10:01:57,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 24: [2022-11-26 10:01:57,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:01:57,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 10:01:57,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 28: [2022-11-26 10:01:57,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 10:01:57,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 22: [2022-11-26 10:01:57,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 28: [2022-11-26 10:01:57,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:01:57,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 28: [2022-11-26 10:01:57,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 10:01:57,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 26: [2022-11-26 10:01:57,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:01:57,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 10:01:57,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 23: [2022-11-26 10:01:57,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:01:57,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 10:01:57,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 0: [2022-11-26 10:01:57,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:01:57,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:01:57,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 7: [2022-11-26 10:01:57,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:01:57,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 26: [2022-11-26 10:01:57,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:01:57,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 10:01:57,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 26: [2022-11-26 10:01:57,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 10:01:57,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 26: [2022-11-26 10:01:57,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:01:57,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 10:01:57,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 0: [2022-11-26 10:01:57,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:01:57,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 10:01:57,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 15: [2022-11-26 10:01:57,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:01:57,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 10:01:57,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 25: [2022-11-26 10:01:57,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:01:57,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:01:57,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 10:01:57,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 5: [2022-11-26 10:01:57,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:01:57,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 5: [2022-11-26 10:01:57,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 25: [2022-11-26 10:01:57,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 5: [2022-11-26 10:01:57,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 1: [2022-11-26 10:01:57,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:01:57,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 10:01:57,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 6: [2022-11-26 10:01:57,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:01:57,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:01:57,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:01:57,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:01:57,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:01:57,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 7: [2022-11-26 10:01:57,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 21: [2022-11-26 10:01:57,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 10:01:57,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 10:01:57,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 7: [2022-11-26 10:01:57,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 21: [2022-11-26 10:01:57,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 21: [2022-11-26 10:01:57,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 21: [2022-11-26 10:01:57,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 6: [2022-11-26 10:01:57,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 2: [2022-11-26 10:01:57,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:01:57,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 10:01:57,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 4: [2022-11-26 10:01:57,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:01:57,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:01:57,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 16: [2022-11-26 10:01:57,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 4: [2022-11-26 10:01:57,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 16: [2022-11-26 10:01:57,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 14: [2022-11-26 10:01:57,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:01:57,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:01:57,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 14: [2022-11-26 10:01:57,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 12: [2022-11-26 10:01:57,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 14: [2022-11-26 10:01:57,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 17: [2022-11-26 10:01:57,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:01:57,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 10:01:57,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 1: [2022-11-26 10:01:57,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:01:57,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 10:01:57,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 28: [2022-11-26 10:01:57,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:01:57,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 10:01:57,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 9: [2022-11-26 10:01:57,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:01:57,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 10:01:57,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 29: [2022-11-26 10:01:57,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:01:57,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:01:57,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 10:01:57,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 29: [2022-11-26 10:01:57,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 10:01:57,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 3: [2022-11-26 10:01:57,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:01:57,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:01:57,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:01:57,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 10:01:57,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 19: [2022-11-26 10:01:57,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:01:57,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 10:01:57,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 19: [2022-11-26 10:01:57,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 3: [2022-11-26 10:01:57,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 3: [2022-11-26 10:01:57,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 19: [2022-11-26 10:01:57,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 7: [2022-11-26 10:01:57,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:01:57,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 25: [2022-11-26 10:01:57,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:01:57,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 25: [2022-11-26 10:01:57,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 10:01:57,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 11: [2022-11-26 10:01:57,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:01:57,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 10:01:57,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 15: [2022-11-26 10:01:57,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:01:57,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:01:57,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 10:01:57,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 10:01:57,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 15: [2022-11-26 10:01:57,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 22: [2022-11-26 10:01:57,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:01:57,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 10:01:57,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 10: [2022-11-26 10:01:57,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:01:57,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 10:01:57,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 13: [2022-11-26 10:01:57,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:01:57,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:01:57,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:01:57,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:01:57,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 10:01:57,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 10:01:57,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 13: [2022-11-26 10:01:57,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 10:01:57,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 13: [2022-11-26 10:01:57,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 10:01:57,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 13: [2022-11-26 10:01:57,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 31: [2022-11-26 10:01:57,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:01:57,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:01:57,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:01:57,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:01:57,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 10:01:57,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 10:01:57,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 10:01:57,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 10:01:57,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 31: [2022-11-26 10:01:57,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 31: [2022-11-26 10:01:57,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 31: [2022-11-26 10:01:57,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 18: [2022-11-26 10:01:57,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:01:57,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 10:01:57,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 10: [2022-11-26 10:01:57,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:01:57,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 10:01:57,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 25: [2022-11-26 10:01:57,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:01:57,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 10:01:57,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 26: [2022-11-26 10:01:57,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:01:57,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 10:01:57,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 20: [2022-11-26 10:01:57,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:01:57,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 10:01:57,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 24: [2022-11-26 10:01:57,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:01:57,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 10:01:57,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 0: [2022-11-26 10:01:57,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 18: [2022-11-26 10:01:57,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:01:57,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 18: [2022-11-26 10:01:57,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 10:01:57,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 8: [2022-11-26 10:01:57,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:01:57,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 10:01:57,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 18: [2022-11-26 10:01:57,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:01:57,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 10:01:57,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 17: [2022-11-26 10:01:57,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:01:57,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 10:01:57,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 2: [2022-11-26 10:01:57,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:01:57,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 10:01:57,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 9: [2022-11-26 10:01:57,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:01:57,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 10:01:57,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 18: [2022-11-26 10:01:57,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:01:57,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 10:01:57,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 27: [2022-11-26 10:01:57,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:01:57,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 10:01:57,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 30: [2022-11-26 10:01:57,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:01:57,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:01:57,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 6: [2022-11-26 10:01:57,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 30: [2022-11-26 10:01:57,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 6: [2022-11-26 10:01:57,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 1: [2022-11-26 10:01:57,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:01:57,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 10:01:57,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 13: [2022-11-26 10:01:57,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:01:57,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 10:01:57,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 29: [2022-11-26 10:01:57,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:01:57,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 10:01:57,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 31: [2022-11-26 10:01:57,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:01:57,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 10:01:57,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 12: [2022-11-26 10:01:57,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:01:57,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 10:01:57,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 4: [2022-11-26 10:01:57,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:01:57,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 10:01:57,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 21: [2022-11-26 10:01:57,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:01:57,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 10:01:57,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 28: [2022-11-26 10:01:57,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:01:57,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 10:01:57,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 0: [2022-11-26 10:01:57,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:01:57,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 10:01:57,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 20: [2022-11-26 10:01:57,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:01:57,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:01:57,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 22: [2022-11-26 10:01:57,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 20: [2022-11-26 10:01:57,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 22: [2022-11-26 10:01:57,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 7: [2022-11-26 10:01:57,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:01:57,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 5: [2022-11-26 10:01:57,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:01:57,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 5: [2022-11-26 10:01:57,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 10:01:57,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 16: [2022-11-26 10:01:57,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:01:57,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 10:01:57,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 14: [2022-11-26 10:01:57,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:01:57,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 10:01:57,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 19: [2022-11-26 10:01:57,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:01:57,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:01:57,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 19: [2022-11-26 10:01:57,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 24: [2022-11-26 10:01:57,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 19: [2022-11-26 10:01:57,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 15: [2022-11-26 10:01:57,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:01:57,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 10:01:57,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 26: [2022-11-26 10:01:57,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:01:57,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 10:01:57,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 23: [2022-11-26 10:01:57,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:01:57,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 10:01:57,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 2: [2022-11-26 10:01:57,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:01:57,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 10:01:57,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 11: [2022-11-26 10:01:57,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:01:57,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 10:01:57,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 18: [2022-11-26 10:01:57,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:01:57,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 10:01:57,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 8: [2022-11-26 10:01:57,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:01:57,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:01:57,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 3: [2022-11-26 10:01:57,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 8: [2022-11-26 10:01:57,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 3: [2022-11-26 10:01:57,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 25: [2022-11-26 10:01:57,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:01:57,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 10:01:57,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 30: [2022-11-26 10:01:57,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:01:57,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 10:01:57,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 10: [2022-11-26 10:01:57,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:01:57,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 10:01:57,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 6: [2022-11-26 10:01:57,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:01:57,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 10:01:57,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 9: [2022-11-26 10:01:57,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:01:57,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 10:01:57,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 17: [2022-11-26 10:01:57,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:01:57,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 10:01:57,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 27: [2022-11-26 10:01:57,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:01:57,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:01:57,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 19: [2022-11-26 10:01:57,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 27: [2022-11-26 10:01:57,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 19: [2022-11-26 10:01:57,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 21: [2022-11-26 10:01:57,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:01:57,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 10:01:57,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 28: [2022-11-26 10:01:57,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:01:57,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:01:57,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:01:57,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 10:01:57,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 12: [2022-11-26 10:01:57,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 10:01:57,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 23: [2022-11-26 10:01:57,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:01:57,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:01:57,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 22: [2022-11-26 10:01:57,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 23: [2022-11-26 10:01:57,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 22: [2022-11-26 10:01:57,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 1: [2022-11-26 10:01:57,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:01:57,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 10:01:57,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 31: [2022-11-26 10:01:57,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:01:57,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 10:01:57,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 8: [2022-11-26 10:01:57,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:01:57,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 10:01:57,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 20: [2022-11-26 10:01:57,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:01:57,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:01:57,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 20: [2022-11-26 10:01:57,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 2: [2022-11-26 10:01:57,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 20: [2022-11-26 10:01:57,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 5: [2022-11-26 10:01:57,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:01:57,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:01:57,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 10:01:57,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 5: [2022-11-26 10:01:57,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 10:01:57,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 26: [2022-11-26 10:01:57,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:01:57,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 10:01:57,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 25: [2022-11-26 10:01:57,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:01:57,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 7: [2022-11-26 10:01:57,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:01:57,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 7: [2022-11-26 10:01:57,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 10:01:57,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 28: [2022-11-26 10:01:57,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 10:01:57,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 14: [2022-11-26 10:01:57,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:01:57,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 10:01:57,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 0: [2022-11-26 10:01:57,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:01:57,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 10:01:57,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 15: [2022-11-26 10:01:57,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:01:57,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 10:01:57,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 11: [2022-11-26 10:01:57,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:01:57,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 18: [2022-11-26 10:01:57,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:01:57,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 18: [2022-11-26 10:01:57,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 10:01:57,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 17: [2022-11-26 10:01:57,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:01:57,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 10:01:57,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 10: [2022-11-26 10:01:57,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:01:57,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:01:57,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 29: [2022-11-26 10:01:57,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 10: [2022-11-26 10:01:57,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 29: [2022-11-26 10:01:57,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 24: [2022-11-26 10:01:57,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:01:57,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:01:57,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 16: [2022-11-26 10:01:57,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:01:57,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 9: [2022-11-26 10:01:57,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 16: [2022-11-26 10:01:57,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 24: [2022-11-26 10:01:57,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 16: [2022-11-26 10:01:57,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 4: [2022-11-26 10:01:57,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:01:57,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 10:01:57,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 30: [2022-11-26 10:01:57,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:01:57,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 10:01:57,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 27: [2022-11-26 10:01:57,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:01:57,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 10:01:57,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 13: [2022-11-26 10:01:57,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:01:57,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 10:01:57,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 1: [2022-11-26 10:01:57,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:01:57,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:01:57,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 10:01:57,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 12: [2022-11-26 10:01:57,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 10:01:57,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 31: [2022-11-26 10:01:57,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:01:57,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 10:01:57,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 6: [2022-11-26 10:01:57,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:01:57,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 10:01:57,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 4: [2022-11-26 10:01:57,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:01:57,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 10:01:57,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 29: [2022-11-26 10:01:57,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:01:57,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 10:01:57,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 16: [2022-11-26 10:01:57,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:01:57,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 10:01:57,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 3: [2022-11-26 10:01:57,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:01:57,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 10:01:57,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 19: [2022-11-26 10:01:57,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:01:57,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 10:01:57,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 23: [2022-11-26 10:01:57,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:01:57,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 10:01:57,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 5: [2022-11-26 10:01:57,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:01:57,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 10:01:57,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 2: [2022-11-26 10:01:57,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:01:57,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 10:01:57,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 8: [2022-11-26 10:01:57,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:01:57,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 10:01:57,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 0: [2022-11-26 10:01:57,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:01:57,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 10:01:57,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 15: [2022-11-26 10:01:57,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:01:57,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 10:01:57,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 20: [2022-11-26 10:01:57,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:01:57,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 10:01:57,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 7: [2022-11-26 10:01:57,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:01:57,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:01:57,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 10:01:57,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 7: [2022-11-26 10:01:57,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 10:01:57,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 24: [2022-11-26 10:01:57,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:01:57,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 10:01:57,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 25: [2022-11-26 10:01:57,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:01:57,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 11: [2022-11-26 10:01:57,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:01:57,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 11: [2022-11-26 10:01:57,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 10:01:57,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 26: [2022-11-26 10:01:57,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:01:57,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 10:01:57,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 9: [2022-11-26 10:01:57,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:01:57,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 10:01:57,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 10: [2022-11-26 10:01:57,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:01:57,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 10:01:57,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 14: [2022-11-26 10:01:57,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:01:57,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 10:01:57,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 22: [2022-11-26 10:01:57,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:01:57,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 10:01:57,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:01:57,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 22: [2022-11-26 10:01:57,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 10:01:57,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 28: [2022-11-26 10:01:57,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:01:57,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:01:57,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 28: [2022-11-26 10:01:57,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 3: [2022-11-26 10:01:57,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 28: [2022-11-26 10:01:57,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 16: [2022-11-26 10:01:57,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:01:57,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 10:01:57,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 4: [2022-11-26 10:01:57,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:01:57,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 10:01:57,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 18: [2022-11-26 10:01:57,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:01:57,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 10:01:57,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 30: [2022-11-26 10:01:57,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:01:57,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 10:01:57,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 17: [2022-11-26 10:01:57,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:01:57,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 10:01:57,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 27: [2022-11-26 10:01:57,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:01:57,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 10:01:57,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 21: [2022-11-26 10:01:57,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:01:57,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 10:01:57,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 13: [2022-11-26 10:01:57,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:01:57,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 10:01:57,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 29: [2022-11-26 10:01:57,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:01:57,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 10:01:57,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 6: [2022-11-26 10:01:57,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:01:57,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 10:01:57,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 1: [2022-11-26 10:01:57,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:01:57,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:01:57,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 10:01:57,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 1: [2022-11-26 10:01:57,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 10:01:57,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 31: [2022-11-26 10:01:57,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:01:57,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 10:01:57,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 23: [2022-11-26 10:01:57,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:01:57,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 10:01:57,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 28: [2022-11-26 10:01:57,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:01:57,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 10:01:57,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 14: [2022-11-26 10:01:57,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:01:57,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 10:01:57,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 21: [2022-11-26 10:01:57,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:01:57,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 10:01:57,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 15: [2022-11-26 10:01:57,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:01:57,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 10:01:57,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 21: [2022-11-26 10:01:57,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:01:57,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 10:01:57,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 28: [2022-11-26 10:01:57,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:01:57,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 10:01:57,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 7: [2022-11-26 10:01:57,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:01:57,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 10:01:57,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 15: [2022-11-26 10:01:57,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:01:57,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 10:01:57,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 3: [2022-11-26 10:01:57,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:01:57,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 10:01:57,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 18: [2022-11-26 10:01:57,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:01:57,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step71000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 10:01:57,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 0: successfully saved checkpoint at iteration 71000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2781.30 31: iteration 71010/ 173500 | consumed samples: 18178560 | consumed tokens: 37229690880 | elapsed time per iteration (s): 1.11 | learning rate: 1.369E-04 | global batch size: 256 | lm loss: 2.004129E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.501 | TFLOPs: 13.94 | 31: iteration 71020/ 173500 | consumed samples: 18181120 | consumed tokens: 37234933760 | elapsed time per iteration (s): 0.80 | learning rate: 1.369E-04 | global batch size: 256 | lm loss: 2.021271E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.082 | TFLOPs: 19.42 | 31: iteration 71030/ 173500 | consumed samples: 18183680 | consumed tokens: 37240176640 | elapsed time per iteration (s): 0.78 | learning rate: 1.369E-04 | global batch size: 256 | lm loss: 2.028986E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.977 | TFLOPs: 19.96 | 31: iteration 71040/ 173500 | consumed samples: 18186240 | consumed tokens: 37245419520 | elapsed time per iteration (s): 0.78 | learning rate: 1.369E-04 | global batch size: 256 | lm loss: 2.042026E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.676 | TFLOPs: 19.82 | 31: iteration 71050/ 173500 | consumed samples: 18188800 | consumed tokens: 37250662400 | elapsed time per iteration (s): 0.80 | learning rate: 1.369E-04 | global batch size: 256 | lm loss: 2.012873E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.088 | TFLOPs: 19.24 | 31: iteration 71060/ 173500 | consumed samples: 18191360 | consumed tokens: 37255905280 | elapsed time per iteration (s): 0.84 | learning rate: 1.368E-04 | global batch size: 256 | lm loss: 1.992380E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.078 | TFLOPs: 18.34 | 31: iteration 71070/ 173500 | consumed samples: 18193920 | consumed tokens: 37261148160 | elapsed time per iteration (s): 0.75 | learning rate: 1.368E-04 | global batch size: 256 | lm loss: 2.032833E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.808 | TFLOPs: 20.62 | 31: iteration 71080/ 173500 | consumed samples: 18196480 | consumed tokens: 37266391040 | elapsed time per iteration (s): 0.74 | learning rate: 1.368E-04 | global batch size: 256 | lm loss: 2.031961E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.262 | TFLOPs: 20.95 | 31: iteration 71090/ 173500 | consumed samples: 18199040 | consumed tokens: 37271633920 | elapsed time per iteration (s): 0.73 | learning rate: 1.368E-04 | global batch size: 256 | lm loss: 2.017011E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.944 | TFLOPs: 21.17 | 31: iteration 71100/ 173500 | consumed samples: 18201600 | consumed tokens: 37276876800 | elapsed time per iteration (s): 0.74 | learning rate: 1.368E-04 | global batch size: 256 | lm loss: 2.023720E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.672 | TFLOPs: 20.79 | 31: iteration 71110/ 173500 | consumed samples: 18204160 | consumed tokens: 37282119680 | elapsed time per iteration (s): 0.90 | learning rate: 1.368E-04 | global batch size: 256 | lm loss: 2.030386E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.779 | TFLOPs: 17.23 | 31: iteration 71120/ 173500 | consumed samples: 18206720 | consumed tokens: 37287362560 | elapsed time per iteration (s): 0.75 | learning rate: 1.367E-04 | global batch size: 256 | lm loss: 2.019270E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.197 | TFLOPs: 20.70 | 31: iteration 71130/ 173500 | consumed samples: 18209280 | consumed tokens: 37292605440 | elapsed time per iteration (s): 0.78 | learning rate: 1.367E-04 | global batch size: 256 | lm loss: 2.009619E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.746 | TFLOPs: 19.89 | 31: iteration 71140/ 173500 | consumed samples: 18211840 | consumed tokens: 37297848320 | elapsed time per iteration (s): 0.77 | learning rate: 1.367E-04 | global batch size: 256 | lm loss: 2.010205E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.098 | TFLOPs: 20.21 | 31: iteration 71150/ 173500 | consumed samples: 18214400 | consumed tokens: 37303091200 | elapsed time per iteration (s): 0.78 | learning rate: 1.367E-04 | global batch size: 256 | lm loss: 2.023379E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.514 | TFLOPs: 19.75 | 31: iteration 71160/ 173500 | consumed samples: 18216960 | consumed tokens: 37308334080 | elapsed time per iteration (s): 0.78 | learning rate: 1.367E-04 | global batch size: 256 | lm loss: 2.033938E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.139 | TFLOPs: 19.85 | 31: iteration 71170/ 173500 | consumed samples: 18219520 | consumed tokens: 37313576960 | elapsed time per iteration (s): 0.80 | learning rate: 1.367E-04 | global batch size: 256 | lm loss: 2.006861E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.508 | TFLOPs: 19.39 | 31: iteration 71180/ 173500 | consumed samples: 18222080 | consumed tokens: 37318819840 | elapsed time per iteration (s): 0.73 | learning rate: 1.367E-04 | global batch size: 256 | lm loss: 2.060906E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.768 | TFLOPs: 21.28 | 31: iteration 71190/ 173500 | consumed samples: 18224640 | consumed tokens: 37324062720 | elapsed time per iteration (s): 0.80 | learning rate: 1.366E-04 | global batch size: 256 | lm loss: 2.026125E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.226 | TFLOPs: 19.25 | 31: iteration 71200/ 173500 | consumed samples: 18227200 | consumed tokens: 37329305600 | elapsed time per iteration (s): 0.76 | learning rate: 1.366E-04 | global batch size: 256 | lm loss: 2.022988E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.768 | TFLOPs: 20.37 | 31: iteration 71210/ 173500 | consumed samples: 18229760 | consumed tokens: 37334548480 | elapsed time per iteration (s): 0.77 | learning rate: 1.366E-04 | global batch size: 256 | lm loss: 2.028742E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.923 | TFLOPs: 20.14 | 31: iteration 71220/ 173500 | consumed samples: 18232320 | consumed tokens: 37339791360 | elapsed time per iteration (s): 0.80 | learning rate: 1.366E-04 | global batch size: 256 | lm loss: 2.014910E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.951 | TFLOPs: 19.42 | 31: iteration 71230/ 173500 | consumed samples: 18234880 | consumed tokens: 37345034240 | elapsed time per iteration (s): 0.79 | learning rate: 1.366E-04 | global batch size: 256 | lm loss: 2.043240E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.917 | TFLOPs: 19.66 | 31: iteration 71240/ 173500 | consumed samples: 18237440 | consumed tokens: 37350277120 | elapsed time per iteration (s): 0.79 | learning rate: 1.366E-04 | global batch size: 256 | lm loss: 2.011432E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.891 | TFLOPs: 19.59 | 31: iteration 71250/ 173500 | consumed samples: 18240000 | consumed tokens: 37355520000 | elapsed time per iteration (s): 0.79 | learning rate: 1.365E-04 | global batch size: 256 | lm loss: 2.028262E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.483 | TFLOPs: 19.57 | 31: iteration 71260/ 173500 | consumed samples: 18242560 | consumed tokens: 37360762880 | elapsed time per iteration (s): 0.81 | learning rate: 1.365E-04 | global batch size: 256 | lm loss: 2.013178E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.022 | TFLOPs: 19.06 | 31: iteration 71270/ 173500 | consumed samples: 18245120 | consumed tokens: 37366005760 | elapsed time per iteration (s): 0.79 | learning rate: 1.365E-04 | global batch size: 256 | lm loss: 2.013630E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.012 | TFLOPs: 19.66 | 31: iteration 71280/ 173500 | consumed samples: 18247680 | consumed tokens: 37371248640 | elapsed time per iteration (s): 0.81 | learning rate: 1.365E-04 | global batch size: 256 | lm loss: 1.988313E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.552 | TFLOPs: 19.21 | 31: iteration 71290/ 173500 | consumed samples: 18250240 | consumed tokens: 37376491520 | elapsed time per iteration (s): 0.78 | learning rate: 1.365E-04 | global batch size: 256 | lm loss: 2.019163E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.699 | TFLOPs: 19.89 | 31: iteration 71300/ 173500 | consumed samples: 18252800 | consumed tokens: 37381734400 | elapsed time per iteration (s): 0.79 | learning rate: 1.365E-04 | global batch size: 256 | lm loss: 2.033945E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.218 | TFLOPs: 19.55 | 31: iteration 71310/ 173500 | consumed samples: 18255360 | consumed tokens: 37386977280 | elapsed time per iteration (s): 0.78 | learning rate: 1.364E-04 | global batch size: 256 | lm loss: 2.023971E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.479 | TFLOPs: 19.81 | 31: iteration 71320/ 173500 | consumed samples: 18257920 | consumed tokens: 37392220160 | elapsed time per iteration (s): 0.84 | learning rate: 1.364E-04 | global batch size: 256 | lm loss: 2.025164E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.531 | TFLOPs: 18.48 | 31: iteration 71330/ 173500 | consumed samples: 18260480 | consumed tokens: 37397463040 | elapsed time per iteration (s): 0.83 | learning rate: 1.364E-04 | global batch size: 256 | lm loss: 2.040095E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.718 | TFLOPs: 18.74 | 31: iteration 71340/ 173500 | consumed samples: 18263040 | consumed tokens: 37402705920 | elapsed time per iteration (s): 0.84 | learning rate: 1.364E-04 | global batch size: 256 | lm loss: 2.026818E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.147 | TFLOPs: 18.40 | 31: iteration 71350/ 173500 | consumed samples: 18265600 | consumed tokens: 37407948800 | elapsed time per iteration (s): 0.85 | learning rate: 1.364E-04 | global batch size: 256 | lm loss: 2.024703E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.609 | TFLOPs: 18.19 | 31: iteration 71360/ 173500 | consumed samples: 18268160 | consumed tokens: 37413191680 | elapsed time per iteration (s): 0.82 | learning rate: 1.364E-04 | global batch size: 256 | lm loss: 2.013378E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.754 | TFLOPs: 18.92 | 31: iteration 71370/ 173500 | consumed samples: 18270720 | consumed tokens: 37418434560 | elapsed time per iteration (s): 0.81 | learning rate: 1.364E-04 | global batch size: 256 | lm loss: 2.016891E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.333 | TFLOPs: 19.02 | 31: iteration 71380/ 173500 | consumed samples: 18273280 | consumed tokens: 37423677440 | elapsed time per iteration (s): 0.83 | learning rate: 1.363E-04 | global batch size: 256 | lm loss: 2.010539E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.978 | TFLOPs: 18.63 | 31: iteration 71390/ 173500 | consumed samples: 18275840 | consumed tokens: 37428920320 | elapsed time per iteration (s): 0.80 | learning rate: 1.363E-04 | global batch size: 256 | lm loss: 2.011359E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.324 | TFLOPs: 19.38 | 31: iteration 71400/ 173500 | consumed samples: 18278400 | consumed tokens: 37434163200 | elapsed time per iteration (s): 0.81 | learning rate: 1.363E-04 | global batch size: 256 | lm loss: 2.035476E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.358 | TFLOPs: 19.14 | 31: iteration 71410/ 173500 | consumed samples: 18280960 | consumed tokens: 37439406080 | elapsed time per iteration (s): 0.79 | learning rate: 1.363E-04 | global batch size: 256 | lm loss: 1.986929E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.968 | TFLOPs: 19.72 | 31: iteration 71420/ 173500 | consumed samples: 18283520 | consumed tokens: 37444648960 | elapsed time per iteration (s): 0.83 | learning rate: 1.363E-04 | global batch size: 256 | lm loss: 2.014814E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.300 | TFLOPs: 18.71 | 31: iteration 71430/ 173500 | consumed samples: 18286080 | consumed tokens: 37449891840 | elapsed time per iteration (s): 0.82 | learning rate: 1.363E-04 | global batch size: 256 | lm loss: 2.025706E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.174 | TFLOPs: 18.83 | 31: iteration 71440/ 173500 | consumed samples: 18288640 | consumed tokens: 37455134720 | elapsed time per iteration (s): 0.80 | learning rate: 1.362E-04 | global batch size: 256 | lm loss: 1.998313E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.084 | TFLOPs: 19.24 | 31: iteration 71450/ 173500 | consumed samples: 18291200 | consumed tokens: 37460377600 | elapsed time per iteration (s): 0.81 | learning rate: 1.362E-04 | global batch size: 256 | lm loss: 2.010129E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.990 | TFLOPs: 19.18 | 31: iteration 71460/ 173500 | consumed samples: 18293760 | consumed tokens: 37465620480 | elapsed time per iteration (s): 0.87 | learning rate: 1.362E-04 | global batch size: 256 | lm loss: 2.015219E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.400 | TFLOPs: 17.87 | 31: iteration 71470/ 173500 | consumed samples: 18296320 | consumed tokens: 37470863360 | elapsed time per iteration (s): 0.88 | learning rate: 1.362E-04 | global batch size: 256 | lm loss: 2.030195E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.356 | TFLOPs: 17.57 | 31: iteration 71480/ 173500 | consumed samples: 18298880 | consumed tokens: 37476106240 | elapsed time per iteration (s): 0.81 | learning rate: 1.362E-04 | global batch size: 256 | lm loss: 2.021566E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.622 | TFLOPs: 19.03 | 31: iteration 71490/ 173500 | consumed samples: 18301440 | consumed tokens: 37481349120 | elapsed time per iteration (s): 0.85 | learning rate: 1.362E-04 | global batch size: 256 | lm loss: 2.026366E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.602 | TFLOPs: 18.31 | 31: iteration 71500/ 173500 | consumed samples: 18304000 | consumed tokens: 37486592000 | elapsed time per iteration (s): 0.81 | learning rate: 1.361E-04 | global batch size: 256 | lm loss: 2.026028E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.561 | TFLOPs: 19.03 | 31: iteration 71510/ 173500 | consumed samples: 18306560 | consumed tokens: 37491834880 | elapsed time per iteration (s): 0.82 | learning rate: 1.361E-04 | global batch size: 256 | lm loss: 2.006341E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.998 | TFLOPs: 18.81 | 31: iteration 71520/ 173500 | consumed samples: 18309120 | consumed tokens: 37497077760 | elapsed time per iteration (s): 0.86 | learning rate: 1.361E-04 | global batch size: 256 | lm loss: 1.993987E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.783 | TFLOPs: 18.08 | 31: iteration 71530/ 173500 | consumed samples: 18311680 | consumed tokens: 37502320640 | elapsed time per iteration (s): 0.81 | learning rate: 1.361E-04 | global batch size: 256 | lm loss: 2.002729E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.640 | TFLOPs: 19.10 | 31: iteration 71540/ 173500 | consumed samples: 18314240 | consumed tokens: 37507563520 | elapsed time per iteration (s): 0.83 | learning rate: 1.361E-04 | global batch size: 256 | lm loss: 2.015555E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.040 | TFLOPs: 18.58 | 31: iteration 71550/ 173500 | consumed samples: 18316800 | consumed tokens: 37512806400 | elapsed time per iteration (s): 1.12 | learning rate: 1.361E-04 | global batch size: 256 | lm loss: 2.002330E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.397 | TFLOPs: 13.88 | 31: iteration 71560/ 173500 | consumed samples: 18319360 | consumed tokens: 37518049280 | elapsed time per iteration (s): 0.82 | learning rate: 1.361E-04 | global batch size: 256 | lm loss: 2.022878E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.538 | TFLOPs: 18.91 | 31: iteration 71570/ 173500 | consumed samples: 18321920 | consumed tokens: 37523292160 | elapsed time per iteration (s): 0.79 | learning rate: 1.360E-04 | global batch size: 256 | lm loss: 2.023395E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.069 | TFLOPs: 19.48 | 31: iteration 71580/ 173500 | consumed samples: 18324480 | consumed tokens: 37528535040 | elapsed time per iteration (s): 0.82 | learning rate: 1.360E-04 | global batch size: 256 | lm loss: 2.026250E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.423 | TFLOPs: 18.90 | 31: iteration 71590/ 173500 | consumed samples: 18327040 | consumed tokens: 37533777920 | elapsed time per iteration (s): 0.80 | learning rate: 1.360E-04 | global batch size: 256 | lm loss: 2.023457E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.355 | TFLOPs: 19.44 | 31: iteration 71600/ 173500 | consumed samples: 18329600 | consumed tokens: 37539020800 | elapsed time per iteration (s): 0.82 | learning rate: 1.360E-04 | global batch size: 256 | lm loss: 2.032794E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.491 | TFLOPs: 18.84 | 31: iteration 71610/ 173500 | consumed samples: 18332160 | consumed tokens: 37544263680 | elapsed time per iteration (s): 0.84 | learning rate: 1.360E-04 | global batch size: 256 | lm loss: 2.006197E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.399 | TFLOPs: 18.42 | 31: iteration 71620/ 173500 | consumed samples: 18334720 | consumed tokens: 37549506560 | elapsed time per iteration (s): 0.80 | learning rate: 1.360E-04 | global batch size: 256 | lm loss: 2.015669E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.292 | TFLOPs: 19.38 | 31: iteration 71630/ 173500 | consumed samples: 18337280 | consumed tokens: 37554749440 | elapsed time per iteration (s): 0.81 | learning rate: 1.359E-04 | global batch size: 256 | lm loss: 2.059152E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.725 | TFLOPs: 19.16 | 31: iteration 71640/ 173500 | consumed samples: 18339840 | consumed tokens: 37559992320 | elapsed time per iteration (s): 0.89 | learning rate: 1.359E-04 | global batch size: 256 | lm loss: 2.018623E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.764 | TFLOPs: 17.35 | 31: iteration 71650/ 173500 | consumed samples: 18342400 | consumed tokens: 37565235200 | elapsed time per iteration (s): 0.83 | learning rate: 1.359E-04 | global batch size: 256 | lm loss: 2.040575E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.006 | TFLOPs: 18.57 | 31: iteration 71660/ 173500 | consumed samples: 18344960 | consumed tokens: 37570478080 | elapsed time per iteration (s): 0.86 | learning rate: 1.359E-04 | global batch size: 256 | lm loss: 2.031599E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.808 | TFLOPs: 18.08 | 31: iteration 71670/ 173500 | consumed samples: 18347520 | consumed tokens: 37575720960 | elapsed time per iteration (s): 0.82 | learning rate: 1.359E-04 | global batch size: 256 | lm loss: 2.024372E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.984 | TFLOPs: 18.93 | 31: iteration 71680/ 173500 | consumed samples: 18350080 | consumed tokens: 37580963840 | elapsed time per iteration (s): 0.78 | learning rate: 1.359E-04 | global batch size: 256 | lm loss: 2.029423E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.468 | TFLOPs: 19.93 | 31: iteration 71690/ 173500 | consumed samples: 18352640 | consumed tokens: 37586206720 | elapsed time per iteration (s): 0.81 | learning rate: 1.358E-04 | global batch size: 256 | lm loss: 2.023249E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.252 | TFLOPs: 19.01 | 31: iteration 71700/ 173500 | consumed samples: 18355200 | consumed tokens: 37591449600 | elapsed time per iteration (s): 0.84 | learning rate: 1.358E-04 | global batch size: 256 | lm loss: 2.010295E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.502 | TFLOPs: 18.36 | 31: iteration 71710/ 173500 | consumed samples: 18357760 | consumed tokens: 37596692480 | elapsed time per iteration (s): 0.81 | learning rate: 1.358E-04 | global batch size: 256 | lm loss: 2.037186E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.960 | TFLOPs: 19.18 | 31: iteration 71720/ 173500 | consumed samples: 18360320 | consumed tokens: 37601935360 | elapsed time per iteration (s): 0.84 | learning rate: 1.358E-04 | global batch size: 256 | lm loss: 2.018975E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.141 | TFLOPs: 18.46 | 31: iteration 71730/ 173500 | consumed samples: 18362880 | consumed tokens: 37607178240 | elapsed time per iteration (s): 0.81 | learning rate: 1.358E-04 | global batch size: 256 | lm loss: 2.048059E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.214 | TFLOPs: 19.07 | 31: iteration 71740/ 173500 | consumed samples: 18365440 | consumed tokens: 37612421120 | elapsed time per iteration (s): 0.84 | learning rate: 1.358E-04 | global batch size: 256 | lm loss: 2.009886E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.950 | TFLOPs: 18.45 | 31: iteration 71750/ 173500 | consumed samples: 18368000 | consumed tokens: 37617664000 | elapsed time per iteration (s): 0.82 | learning rate: 1.358E-04 | global batch size: 256 | lm loss: 2.000497E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.858 | TFLOPs: 18.93 | 31: iteration 71760/ 173500 | consumed samples: 18370560 | consumed tokens: 37622906880 | elapsed time per iteration (s): 0.81 | learning rate: 1.357E-04 | global batch size: 256 | lm loss: 2.001271E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.701 | TFLOPs: 19.04 | 31: iteration 71770/ 173500 | consumed samples: 18373120 | consumed tokens: 37628149760 | elapsed time per iteration (s): 0.81 | learning rate: 1.357E-04 | global batch size: 256 | lm loss: 2.028488E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.490 | TFLOPs: 19.03 | 31: iteration 71780/ 173500 | consumed samples: 18375680 | consumed tokens: 37633392640 | elapsed time per iteration (s): 0.79 | learning rate: 1.357E-04 | global batch size: 256 | lm loss: 2.040356E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.537 | TFLOPs: 19.57 | 31: iteration 71790/ 173500 | consumed samples: 18378240 | consumed tokens: 37638635520 | elapsed time per iteration (s): 0.83 | learning rate: 1.357E-04 | global batch size: 256 | lm loss: 2.001091E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.522 | TFLOPs: 18.60 | 31: iteration 71800/ 173500 | consumed samples: 18380800 | consumed tokens: 37643878400 | elapsed time per iteration (s): 0.83 | learning rate: 1.357E-04 | global batch size: 256 | lm loss: 2.016434E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.232 | TFLOPs: 18.77 | 31: iteration 71810/ 173500 | consumed samples: 18383360 | consumed tokens: 37649121280 | elapsed time per iteration (s): 0.81 | learning rate: 1.357E-04 | global batch size: 256 | lm loss: 1.996986E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.963 | TFLOPs: 19.18 | 31: iteration 71820/ 173500 | consumed samples: 18385920 | consumed tokens: 37654364160 | elapsed time per iteration (s): 0.86 | learning rate: 1.356E-04 | global batch size: 256 | lm loss: 1.997554E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.021 | TFLOPs: 17.97 | 31: iteration 71830/ 173500 | consumed samples: 18388480 | consumed tokens: 37659607040 | elapsed time per iteration (s): 0.88 | learning rate: 1.356E-04 | global batch size: 256 | lm loss: 2.015025E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.041 | TFLOPs: 17.67 | 31: iteration 71840/ 173500 | consumed samples: 18391040 | consumed tokens: 37664849920 | elapsed time per iteration (s): 0.83 | learning rate: 1.356E-04 | global batch size: 256 | lm loss: 2.035252E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.866 | TFLOPs: 18.75 | 31: iteration 71850/ 173500 | consumed samples: 18393600 | consumed tokens: 37670092800 | elapsed time per iteration (s): 0.82 | learning rate: 1.356E-04 | global batch size: 256 | lm loss: 2.036926E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.705 | TFLOPs: 18.98 | 31: iteration 71860/ 173500 | consumed samples: 18396160 | consumed tokens: 37675335680 | elapsed time per iteration (s): 0.78 | learning rate: 1.356E-04 | global batch size: 256 | lm loss: 2.018727E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.259 | TFLOPs: 19.86 | 31: iteration 71870/ 173500 | consumed samples: 18398720 | consumed tokens: 37680578560 | elapsed time per iteration (s): 0.81 | learning rate: 1.356E-04 | global batch size: 256 | lm loss: 2.039814E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.864 | TFLOPs: 19.23 | 31: iteration 71880/ 173500 | consumed samples: 18401280 | consumed tokens: 37685821440 | elapsed time per iteration (s): 0.81 | learning rate: 1.355E-04 | global batch size: 256 | lm loss: 2.002605E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.939 | TFLOPs: 19.17 | 31: iteration 71890/ 173500 | consumed samples: 18403840 | consumed tokens: 37691064320 | elapsed time per iteration (s): 0.80 | learning rate: 1.355E-04 | global batch size: 256 | lm loss: 2.025753E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.022 | TFLOPs: 19.36 | 31: iteration 71900/ 173500 | consumed samples: 18406400 | consumed tokens: 37696307200 | elapsed time per iteration (s): 0.82 | learning rate: 1.355E-04 | global batch size: 256 | lm loss: 2.012419E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.354 | TFLOPs: 18.90 | 31: iteration 71910/ 173500 | consumed samples: 18408960 | consumed tokens: 37701550080 | elapsed time per iteration (s): 0.79 | learning rate: 1.355E-04 | global batch size: 256 | lm loss: 2.049921E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.157 | TFLOPs: 19.61 | 31: iteration 71920/ 173500 | consumed samples: 18411520 | consumed tokens: 37706792960 | elapsed time per iteration (s): 0.77 | learning rate: 1.355E-04 | global batch size: 256 | lm loss: 2.034632E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.136 | TFLOPs: 20.09 | 31: iteration 71930/ 173500 | consumed samples: 18414080 | consumed tokens: 37712035840 | elapsed time per iteration (s): 0.83 | learning rate: 1.355E-04 | global batch size: 256 | lm loss: 2.002715E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.879 | TFLOPs: 18.69 | 31: iteration 71940/ 173500 | consumed samples: 18416640 | consumed tokens: 37717278720 | elapsed time per iteration (s): 0.82 | learning rate: 1.355E-04 | global batch size: 256 | lm loss: 2.031750E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.904 | TFLOPs: 18.81 | 31: iteration 71950/ 173500 | consumed samples: 18419200 | consumed tokens: 37722521600 | elapsed time per iteration (s): 0.79 | learning rate: 1.354E-04 | global batch size: 256 | lm loss: 2.022324E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.643 | TFLOPs: 19.52 | 31: iteration 71960/ 173500 | consumed samples: 18421760 | consumed tokens: 37727764480 | elapsed time per iteration (s): 0.83 | learning rate: 1.354E-04 | global batch size: 256 | lm loss: 2.029140E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.755 | TFLOPs: 18.74 | 31: iteration 71970/ 173500 | consumed samples: 18424320 | consumed tokens: 37733007360 | elapsed time per iteration (s): 0.81 | learning rate: 1.354E-04 | global batch size: 256 | lm loss: 2.062295E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.327 | TFLOPs: 19.02 | 31: iteration 71980/ 173500 | consumed samples: 18426880 | consumed tokens: 37738250240 | elapsed time per iteration (s): 0.80 | learning rate: 1.354E-04 | global batch size: 256 | lm loss: 2.029272E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.766 | TFLOPs: 19.47 | 31: iteration 71990/ 173500 | consumed samples: 18429440 | consumed tokens: 37743493120 | elapsed time per iteration (s): 0.79 | learning rate: 1.354E-04 | global batch size: 256 | lm loss: 2.013727E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.790 | TFLOPs: 19.59 | 0: [2022-11-26 10:15:29,411] [INFO] [logging.py:68:log_dist] [Rank 0] step=72000, skipped=0, lr=[0.0001353602432066091, 0.0001353602432066091, 0.0001353602432066091], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 72000/ 173500 | consumed samples: 18432000 | consumed tokens: 37748736000 | elapsed time per iteration (s): 0.78 | learning rate: 1.354E-04 | global batch size: 256 | lm loss: 2.013914E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.527 | TFLOPs: 19.75 | 0: steps: 72000 loss: 1.9613 iter time (s): 0.830 samples/sec: 308.256 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 72000 | lm loss value: 1.970750E+00 | lm loss PPL: 7.176058E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 72000 to checkpoints_1b1long 0: [2022-11-26 10:15:29,754] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step72000 is begin to save! 0: [2022-11-26 10:15:29,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_01-model_00-model_states.pt... 0: [2022-11-26 10:15:29,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_01-model_00-model_states.pt. 0: [2022-11-26 10:15:29,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_03-model_00-model_states.pt... 0: [2022-11-26 10:15:30,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_03-model_00-model_states.pt. 0: [2022-11-26 10:15:30,061] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_04-model_00-model_states.pt... 0: [2022-11-26 10:15:30,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_04-model_00-model_states.pt. 0: [2022-11-26 10:15:30,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_05-model_00-model_states.pt... 0: [2022-11-26 10:15:30,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_05-model_00-model_states.pt. 0: [2022-11-26 10:15:30,220] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_06-model_00-model_states.pt... 0: [2022-11-26 10:15:30,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_06-model_00-model_states.pt. 0: [2022-11-26 10:15:30,299] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_07-model_00-model_states.pt... 0: [2022-11-26 10:15:30,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_07-model_00-model_states.pt. 0: [2022-11-26 10:15:30,376] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_08-model_00-model_states.pt... 0: [2022-11-26 10:15:30,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_08-model_00-model_states.pt. 0: [2022-11-26 10:15:30,456] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_09-model_00-model_states.pt... 0: [2022-11-26 10:15:30,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_09-model_00-model_states.pt. 0: [2022-11-26 10:15:30,537] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_10-model_00-model_states.pt... 0: [2022-11-26 10:15:30,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_10-model_00-model_states.pt. 0: [2022-11-26 10:15:30,616] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_11-model_00-model_states.pt... 0: [2022-11-26 10:15:30,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_11-model_00-model_states.pt. 0: [2022-11-26 10:15:30,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_12-model_00-model_states.pt... 0: [2022-11-26 10:15:30,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_12-model_00-model_states.pt. 0: [2022-11-26 10:15:30,773] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_13-model_00-model_states.pt... 0: [2022-11-26 10:15:30,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_13-model_00-model_states.pt. 0: [2022-11-26 10:15:30,850] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_14-model_00-model_states.pt... 0: [2022-11-26 10:15:30,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_14-model_00-model_states.pt. 0: [2022-11-26 10:15:30,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_15-model_00-model_states.pt... 0: [2022-11-26 10:15:31,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_15-model_00-model_states.pt. 0: [2022-11-26 10:15:31,007] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_16-model_00-model_states.pt... 0: [2022-11-26 10:15:31,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_16-model_00-model_states.pt. 0: [2022-11-26 10:15:31,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_17-model_00-model_states.pt... 0: [2022-11-26 10:15:31,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_17-model_00-model_states.pt. 0: [2022-11-26 10:15:31,166] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_18-model_00-model_states.pt... 0: [2022-11-26 10:15:31,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_18-model_00-model_states.pt. 0: [2022-11-26 10:15:31,243] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_19-model_00-model_states.pt... 0: [2022-11-26 10:15:31,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_19-model_00-model_states.pt. 0: [2022-11-26 10:15:31,321] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_20-model_00-model_states.pt... 0: [2022-11-26 10:15:31,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_20-model_00-model_states.pt. 0: [2022-11-26 10:15:31,399] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_21-model_00-model_states.pt... 0: [2022-11-26 10:15:31,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_21-model_00-model_states.pt. 0: [2022-11-26 10:15:31,479] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_22-model_00-model_states.pt... 0: [2022-11-26 10:15:31,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_22-model_00-model_states.pt. 0: [2022-11-26 10:15:31,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_23-model_00-model_states.pt... 0: [2022-11-26 10:15:31,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_23-model_00-model_states.pt. 0: [2022-11-26 10:15:31,634] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_24-model_00-model_states.pt... 0: [2022-11-26 10:15:31,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_24-model_00-model_states.pt. 0: [2022-11-26 10:15:31,711] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_25-model_00-model_states.pt... 0: [2022-11-26 10:15:31,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_25-model_00-model_states.pt. 0: [2022-11-26 10:15:31,791] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_26-model_00-model_states.pt... 0: [2022-11-26 10:15:31,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_26-model_00-model_states.pt. 0: [2022-11-26 10:15:31,870] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_27-model_00-model_states.pt... 0: [2022-11-26 10:15:31,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_27-model_00-model_states.pt. 0: [2022-11-26 10:15:31,946] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_28-model_00-model_states.pt... 0: [2022-11-26 10:15:32,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_28-model_00-model_states.pt. 0: [2022-11-26 10:15:32,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/layer_30-model_00-model_states.pt... 0: [2022-11-26 10:15:32,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/layer_30-model_00-model_states.pt. 0: [2022-11-26 10:15:32,028] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step72000/mp_rank_00_model_states.pt 0: [2022-11-26 10:15:32,028] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/mp_rank_00_model_states.pt... 0: [2022-11-26 10:15:32,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/mp_rank_00_model_states.pt. 0: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:15:32,226] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:15:32,226] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:15:32,226] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:15:32,226] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:15:32,226] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:15:32,226] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:15:32,226] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:15:32,226] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:15:32,226] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:15:32,226] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:15:32,226] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:15:32,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:15:32,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:15:32,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 10:15:32,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 1: [2022-11-26 10:15:32,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:15:32,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 10:15:32,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 20: [2022-11-26 10:15:32,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:15:32,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 10:15:32,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 2: [2022-11-26 10:15:32,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:15:32,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 10:15:32,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 1: [2022-11-26 10:15:32,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:15:32,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:15:32,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 0: [2022-11-26 10:15:32,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 10:15:32,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 1: [2022-11-26 10:15:32,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 23: [2022-11-26 10:15:32,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:15:32,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 10:15:32,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 30: [2022-11-26 10:15:32,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:15:32,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 10:15:32,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 18: [2022-11-26 10:15:32,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:15:32,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 10:15:32,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 25: [2022-11-26 10:15:32,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:15:32,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:15:32,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 3: [2022-11-26 10:15:32,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 25: [2022-11-26 10:15:32,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 3: [2022-11-26 10:15:32,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 2: [2022-11-26 10:15:32,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:15:32,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 10:15:32,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 25: [2022-11-26 10:15:32,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:15:32,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 10:15:32,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 21: [2022-11-26 10:15:32,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:15:32,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 10:15:32,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 21: [2022-11-26 10:15:32,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:15:32,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 15: [2022-11-26 10:15:32,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:15:32,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:15:32,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 15: [2022-11-26 10:15:32,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 30: [2022-11-26 10:15:32,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 15: [2022-11-26 10:15:32,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 30: [2022-11-26 10:15:32,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 8: [2022-11-26 10:15:32,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:15:32,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 10:15:32,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 10: [2022-11-26 10:15:32,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:15:32,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:15:32,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:15:32,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 19: [2022-11-26 10:15:32,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 10: [2022-11-26 10:15:32,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 19: [2022-11-26 10:15:32,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 10:15:32,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 19: [2022-11-26 10:15:32,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 2: [2022-11-26 10:15:32,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:15:32,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 10:15:32,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 28: [2022-11-26 10:15:32,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:15:32,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:15:32,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 8: [2022-11-26 10:15:32,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:15:32,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 28: [2022-11-26 10:15:32,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 10:15:32,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 8: [2022-11-26 10:15:32,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 28: [2022-11-26 10:15:32,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 17: [2022-11-26 10:15:32,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:15:32,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:15:32,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 10:15:32,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 1: [2022-11-26 10:15:32,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 24: [2022-11-26 10:15:32,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:15:32,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 24: [2022-11-26 10:15:32,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 10:15:32,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 15: [2022-11-26 10:15:32,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:15:32,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:15:32,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 17: [2022-11-26 10:15:32,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 10:15:32,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 10: [2022-11-26 10:15:32,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:15:32,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 10: [2022-11-26 10:15:32,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 10:15:32,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:15:32,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:15:32,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 26: [2022-11-26 10:15:32,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:15:32,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 10: [2022-11-26 10:15:32,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 3: [2022-11-26 10:15:32,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:15:32,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:15:32,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 10: [2022-11-26 10:15:32,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 13: [2022-11-26 10:15:32,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:15:32,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 15: [2022-11-26 10:15:32,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 26: [2022-11-26 10:15:32,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 13: [2022-11-26 10:15:32,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 3: [2022-11-26 10:15:32,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 15: [2022-11-26 10:15:32,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 26: [2022-11-26 10:15:32,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 13: [2022-11-26 10:15:32,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 13: [2022-11-26 10:15:32,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:15:32,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 10:15:32,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 19: [2022-11-26 10:15:32,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:15:32,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 10:15:32,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 6: [2022-11-26 10:15:32,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:15:32,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 10:15:32,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 23: [2022-11-26 10:15:32,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:15:32,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 10:15:32,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 26: [2022-11-26 10:15:32,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:15:32,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 10:15:32,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 30: [2022-11-26 10:15:32,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:15:32,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:15:32,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 12: [2022-11-26 10:15:32,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:15:32,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 18: [2022-11-26 10:15:32,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 12: [2022-11-26 10:15:32,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 20: [2022-11-26 10:15:32,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:15:32,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 26: [2022-11-26 10:15:32,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:15:32,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 12: [2022-11-26 10:15:32,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 20: [2022-11-26 10:15:32,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 26: [2022-11-26 10:15:32,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 10:15:32,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 31: [2022-11-26 10:15:32,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:15:32,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 10:15:32,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 27: [2022-11-26 10:15:32,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:15:32,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 10:15:32,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 20: [2022-11-26 10:15:32,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:15:32,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 10:15:32,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 24: [2022-11-26 10:15:32,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:15:32,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 10:15:32,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 9: [2022-11-26 10:15:32,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:15:32,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:15:32,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 10:15:32,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 10:15:32,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 9: [2022-11-26 10:15:32,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 17: [2022-11-26 10:15:32,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:15:32,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 31: [2022-11-26 10:15:32,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:15:32,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 25: [2022-11-26 10:15:32,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:15:32,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 25: [2022-11-26 10:15:32,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 31: [2022-11-26 10:15:32,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 7: [2022-11-26 10:15:32,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:15:32,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 25: [2022-11-26 10:15:32,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 7: [2022-11-26 10:15:32,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 22: [2022-11-26 10:15:32,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:15:32,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 10:15:32,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 22: [2022-11-26 10:15:32,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:15:32,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:15:32,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 10:15:32,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 21: [2022-11-26 10:15:32,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 10:15:32,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 12: [2022-11-26 10:15:32,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:15:32,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 10:15:32,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 18: [2022-11-26 10:15:32,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:15:32,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:15:32,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 13: [2022-11-26 10:15:32,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 18: [2022-11-26 10:15:32,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 13: [2022-11-26 10:15:32,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 18: [2022-11-26 10:15:32,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:15:32,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 10:15:32,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 1: [2022-11-26 10:15:32,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:15:32,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:15:32,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:15:32,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 21: [2022-11-26 10:15:32,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 8: [2022-11-26 10:15:32,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 1: [2022-11-26 10:15:32,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 21: [2022-11-26 10:15:32,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 8: [2022-11-26 10:15:32,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 30: [2022-11-26 10:15:32,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:15:32,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 10:15:32,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 2: [2022-11-26 10:15:32,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:15:32,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 11: [2022-11-26 10:15:32,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:15:32,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 10:15:32,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 11: [2022-11-26 10:15:32,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:15:32,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 10:15:32,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 11: [2022-11-26 10:15:32,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:15:32,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 2: [2022-11-26 10:15:32,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 11: [2022-11-26 10:15:32,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 20: [2022-11-26 10:15:32,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:15:32,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 10:15:32,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 28: [2022-11-26 10:15:32,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:15:32,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 10:15:32,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 7: [2022-11-26 10:15:32,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:15:32,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:15:32,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 10:15:32,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 10:15:32,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 7: [2022-11-26 10:15:32,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 9: [2022-11-26 10:15:32,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:15:32,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:15:32,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 11: [2022-11-26 10:15:32,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:15:32,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 9: [2022-11-26 10:15:32,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 14: [2022-11-26 10:15:32,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 11: [2022-11-26 10:15:32,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 10:15:32,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 23: [2022-11-26 10:15:32,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:15:32,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 10:15:32,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 16: [2022-11-26 10:15:32,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:15:32,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 10:15:32,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 5: [2022-11-26 10:15:32,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:15:32,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:15:32,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 10:15:32,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 10:15:32,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 5: [2022-11-26 10:15:32,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 31: [2022-11-26 10:15:32,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:15:32,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:15:32,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 13: [2022-11-26 10:15:32,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:15:32,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 22: [2022-11-26 10:15:32,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:15:32,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 31: [2022-11-26 10:15:32,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 22: [2022-11-26 10:15:32,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 13: [2022-11-26 10:15:32,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 31: [2022-11-26 10:15:32,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 22: [2022-11-26 10:15:32,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 19: [2022-11-26 10:15:32,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:15:32,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 10:15:32,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 25: [2022-11-26 10:15:32,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:15:32,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:15:32,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 27: [2022-11-26 10:15:32,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:15:32,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 24: [2022-11-26 10:15:32,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 27: [2022-11-26 10:15:32,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 25: [2022-11-26 10:15:32,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 27: [2022-11-26 10:15:32,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 16: [2022-11-26 10:15:32,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:15:32,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 10:15:32,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 6: [2022-11-26 10:15:32,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:15:32,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 10:15:32,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 6: [2022-11-26 10:15:32,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:15:32,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 10:15:32,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 14: [2022-11-26 10:15:32,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:15:32,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:15:32,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 22: [2022-11-26 10:15:32,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 14: [2022-11-26 10:15:32,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 22: [2022-11-26 10:15:32,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 6: [2022-11-26 10:15:32,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:15:32,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 10:15:32,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 26: [2022-11-26 10:15:32,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:15:32,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:15:32,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 0: [2022-11-26 10:15:32,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 26: [2022-11-26 10:15:32,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 17: [2022-11-26 10:15:32,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:15:32,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 17: [2022-11-26 10:15:32,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 10:15:32,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 0: [2022-11-26 10:15:32,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:15:32,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:15:32,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 10:15:32,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 10:15:32,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 0: [2022-11-26 10:15:32,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 12: [2022-11-26 10:15:32,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:15:32,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 10:15:32,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 8: [2022-11-26 10:15:32,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:15:32,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 10:15:32,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 15: [2022-11-26 10:15:32,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:15:32,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 10:15:32,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 27: [2022-11-26 10:15:32,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:15:32,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 10:15:32,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 1: [2022-11-26 10:15:32,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:15:32,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 10:15:32,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 3: [2022-11-26 10:15:32,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:15:32,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 10:15:32,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:15:32,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 3: [2022-11-26 10:15:32,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 23: [2022-11-26 10:15:32,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:15:32,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 23: [2022-11-26 10:15:32,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 16: [2022-11-26 10:15:32,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:15:32,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 16: [2022-11-26 10:15:32,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 10:15:32,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 18: [2022-11-26 10:15:32,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:15:32,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 10:15:32,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 9: [2022-11-26 10:15:32,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:15:32,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 10:15:32,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 14: [2022-11-26 10:15:32,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:15:32,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 10:15:32,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 4: [2022-11-26 10:15:32,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:15:32,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:15:32,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:15:32,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:15:32,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 10:15:32,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 10:15:32,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 10:15:32,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 10:15:32,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 4: [2022-11-26 10:15:32,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 4: [2022-11-26 10:15:32,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 4: [2022-11-26 10:15:32,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 29: [2022-11-26 10:15:32,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:15:32,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:15:32,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:15:32,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:15:32,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 10:15:32,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 10:15:32,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 10:15:32,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 10:15:32,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 29: [2022-11-26 10:15:32,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 29: [2022-11-26 10:15:32,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 29: [2022-11-26 10:15:32,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 5: [2022-11-26 10:15:32,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:15:32,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 10:15:32,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 28: [2022-11-26 10:15:32,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:15:32,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 10:15:32,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 10: [2022-11-26 10:15:32,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:15:32,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 10:15:32,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 26: [2022-11-26 10:15:32,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:15:32,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 10:15:32,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 25: [2022-11-26 10:15:32,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:15:32,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 10:15:32,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 23: [2022-11-26 10:15:32,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:15:32,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 10:15:32,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 11: [2022-11-26 10:15:32,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:15:32,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 10:15:32,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 0: [2022-11-26 10:15:32,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:15:32,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 10:15:32,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 20: [2022-11-26 10:15:32,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:15:32,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 10:15:32,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 7: [2022-11-26 10:15:32,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:15:32,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 10:15:32,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 13: [2022-11-26 10:15:32,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:15:32,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 10:15:32,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 2: [2022-11-26 10:15:32,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:15:32,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 10:15:32,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 10: [2022-11-26 10:15:32,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:15:32,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 10:15:32,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 9: [2022-11-26 10:15:32,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:15:32,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 10:15:32,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 21: [2022-11-26 10:15:32,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:15:32,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 10:15:32,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 28: [2022-11-26 10:15:32,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:15:32,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 10:15:32,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 19: [2022-11-26 10:15:32,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:15:32,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 10:15:32,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 24: [2022-11-26 10:15:32,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:15:32,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 8: [2022-11-26 10:15:32,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:15:32,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 8: [2022-11-26 10:15:32,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 10:15:32,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 6: [2022-11-26 10:15:32,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:15:32,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:15:32,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 6: [2022-11-26 10:15:32,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 15: [2022-11-26 10:15:32,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 6: [2022-11-26 10:15:32,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 31: [2022-11-26 10:15:32,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:15:32,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 17: [2022-11-26 10:15:32,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:15:32,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 17: [2022-11-26 10:15:32,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 10:15:32,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 27: [2022-11-26 10:15:32,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:15:32,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:15:32,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 10:15:32,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 1: [2022-11-26 10:15:32,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 10:15:32,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 4: [2022-11-26 10:15:32,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:15:32,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:15:32,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 30: [2022-11-26 10:15:32,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 4: [2022-11-26 10:15:32,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 30: [2022-11-26 10:15:32,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 22: [2022-11-26 10:15:32,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:15:32,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 10:15:32,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 16: [2022-11-26 10:15:32,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:15:32,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 10:15:32,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 12: [2022-11-26 10:15:32,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:15:32,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 10:15:32,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 5: [2022-11-26 10:15:32,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:15:32,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 10:15:32,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 29: [2022-11-26 10:15:32,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:15:32,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:15:32,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 3: [2022-11-26 10:15:32,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 10:15:32,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 29: [2022-11-26 10:15:32,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 0: [2022-11-26 10:15:32,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:15:32,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 10:15:32,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 18: [2022-11-26 10:15:32,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:15:32,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 10:15:32,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 20: [2022-11-26 10:15:32,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:15:32,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 10:15:32,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 14: [2022-11-26 10:15:32,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:15:32,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 10:15:32,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 25: [2022-11-26 10:15:32,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:15:32,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 10:15:32,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 11: [2022-11-26 10:15:32,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:15:32,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 10:15:32,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 7: [2022-11-26 10:15:32,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:15:32,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:15:32,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 7: [2022-11-26 10:15:32,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 19: [2022-11-26 10:15:32,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 7: [2022-11-26 10:15:32,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 23: [2022-11-26 10:15:32,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:15:32,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 10:15:32,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 2: [2022-11-26 10:15:32,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:15:32,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 10:15:32,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 26: [2022-11-26 10:15:32,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:15:32,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 10:15:32,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 10: [2022-11-26 10:15:32,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:15:32,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 10:15:32,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 30: [2022-11-26 10:15:32,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:15:32,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 10:15:32,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 13: [2022-11-26 10:15:32,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:15:32,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 10:15:32,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 9: [2022-11-26 10:15:32,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:15:32,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 10:15:32,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 28: [2022-11-26 10:15:32,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:15:32,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 10:15:32,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 6: [2022-11-26 10:15:32,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:15:32,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:15:32,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:15:32,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 15: [2022-11-26 10:15:32,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 21: [2022-11-26 10:15:32,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 15: [2022-11-26 10:15:32,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 6: [2022-11-26 10:15:32,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 21: [2022-11-26 10:15:32,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 24: [2022-11-26 10:15:32,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:15:32,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 10:15:32,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 8: [2022-11-26 10:15:32,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:15:32,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 10:15:32,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 31: [2022-11-26 10:15:32,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:15:32,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 1: [2022-11-26 10:15:32,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:15:32,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 1: [2022-11-26 10:15:32,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 17: [2022-11-26 10:15:32,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:15:32,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 17: [2022-11-26 10:15:32,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 10:15:32,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 11: [2022-11-26 10:15:32,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:15:32,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 10:15:32,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 3: [2022-11-26 10:15:32,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:15:32,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 10:15:32,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 13: [2022-11-26 10:15:32,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:15:32,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 10:15:32,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 20: [2022-11-26 10:15:32,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:15:32,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 10:15:32,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 12: [2022-11-26 10:15:32,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:15:32,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:15:32,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 16: [2022-11-26 10:15:32,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 10:15:32,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 12: [2022-11-26 10:15:32,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 0: [2022-11-26 10:15:32,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:15:32,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 10:15:32,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 18: [2022-11-26 10:15:32,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:15:32,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 10:15:32,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 5: [2022-11-26 10:15:32,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:15:32,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 10:15:32,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 14: [2022-11-26 10:15:32,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:15:32,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:15:32,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 10:15:32,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 27: [2022-11-26 10:15:32,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 10:15:32,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 23: [2022-11-26 10:15:32,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:15:32,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:15:32,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 23: [2022-11-26 10:15:32,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 4: [2022-11-26 10:15:32,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 23: [2022-11-26 10:15:32,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 10: [2022-11-26 10:15:32,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:15:32,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:15:32,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 10:15:32,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 26: [2022-11-26 10:15:32,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 10:15:32,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 22: [2022-11-26 10:15:32,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:15:32,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 10:15:32,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 6: [2022-11-26 10:15:32,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:15:32,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 10:15:32,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 9: [2022-11-26 10:15:32,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:15:32,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:15:32,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 10:15:32,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 30: [2022-11-26 10:15:32,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 2: [2022-11-26 10:15:32,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:15:32,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 2: [2022-11-26 10:15:32,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 10:15:32,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 21: [2022-11-26 10:15:32,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:15:32,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:15:32,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 8: [2022-11-26 10:15:32,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 29: [2022-11-26 10:15:32,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:15:32,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 8: [2022-11-26 10:15:32,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 29: [2022-11-26 10:15:32,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 10:15:32,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 7: [2022-11-26 10:15:32,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:15:32,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 10:15:32,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 15: [2022-11-26 10:15:32,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:15:32,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:15:32,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 1: [2022-11-26 10:15:32,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:15:32,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 25: [2022-11-26 10:15:32,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 10:15:32,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 1: [2022-11-26 10:15:32,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 10:15:32,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 19: [2022-11-26 10:15:32,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:15:32,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:15:32,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 19: [2022-11-26 10:15:32,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 10:15:32,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 11: [2022-11-26 10:15:32,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 14: [2022-11-26 10:15:32,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:15:32,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 10:15:32,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 20: [2022-11-26 10:15:32,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:15:32,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:15:32,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 4: [2022-11-26 10:15:32,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:15:32,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 4: [2022-11-26 10:15:32,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 24: [2022-11-26 10:15:32,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 4: [2022-11-26 10:15:32,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 24: [2022-11-26 10:15:32,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 31: [2022-11-26 10:15:32,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:15:32,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 10:15:32,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 0: [2022-11-26 10:15:32,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:15:32,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:15:32,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:15:32,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 13: [2022-11-26 10:15:32,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 10:15:32,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 3: [2022-11-26 10:15:32,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:15:32,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 0: [2022-11-26 10:15:32,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 22: [2022-11-26 10:15:32,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 3: [2022-11-26 10:15:32,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 10:15:32,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 23: [2022-11-26 10:15:32,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:15:32,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:15:32,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 28: [2022-11-26 10:15:32,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 23: [2022-11-26 10:15:32,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 28: [2022-11-26 10:15:32,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 26: [2022-11-26 10:15:32,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:15:32,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 10:15:32,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 5: [2022-11-26 10:15:32,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:15:32,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 10:15:32,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 4: [2022-11-26 10:15:32,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:15:32,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 10:15:32,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 18: [2022-11-26 10:15:32,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:15:32,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 10:15:32,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 17: [2022-11-26 10:15:32,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:15:32,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 27: [2022-11-26 10:15:32,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:15:32,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 27: [2022-11-26 10:15:32,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 10:15:32,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 15: [2022-11-26 10:15:32,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:15:32,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 10:15:32,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 30: [2022-11-26 10:15:32,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:15:32,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 10:15:32,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 17: [2022-11-26 10:15:32,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:15:32,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 2: [2022-11-26 10:15:32,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:15:32,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 2: [2022-11-26 10:15:32,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 10:15:32,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 6: [2022-11-26 10:15:32,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:15:32,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 10:15:32,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 21: [2022-11-26 10:15:32,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:15:32,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:15:32,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 16: [2022-11-26 10:15:32,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 10:15:32,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 21: [2022-11-26 10:15:32,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 31: [2022-11-26 10:15:32,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:15:32,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 10:15:32,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 8: [2022-11-26 10:15:32,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:15:32,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 10:15:32,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 29: [2022-11-26 10:15:32,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:15:32,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 10:15:32,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 3: [2022-11-26 10:15:32,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:15:32,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:15:32,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 22: [2022-11-26 10:15:32,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:15:32,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 22: [2022-11-26 10:15:32,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 19: [2022-11-26 10:15:32,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 10: [2022-11-26 10:15:32,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:15:32,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 22: [2022-11-26 10:15:32,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 10: [2022-11-26 10:15:32,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 10:15:32,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 7: [2022-11-26 10:15:32,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:15:32,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 10:15:32,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 29: [2022-11-26 10:15:32,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:15:32,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 10:15:32,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 9: [2022-11-26 10:15:32,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:15:32,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 10:15:32,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 28: [2022-11-26 10:15:32,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:15:32,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 10:15:32,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 14: [2022-11-26 10:15:32,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:15:32,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 10:15:32,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 16: [2022-11-26 10:15:32,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:15:32,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 10:15:32,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 24: [2022-11-26 10:15:32,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:15:32,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 10:15:32,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 25: [2022-11-26 10:15:32,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:15:32,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 10:15:32,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 5: [2022-11-26 10:15:32,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:15:32,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 10:15:32,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 12: [2022-11-26 10:15:32,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:15:32,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 10:15:32,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 12: [2022-11-26 10:15:32,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:15:32,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 10:15:32,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 27: [2022-11-26 10:15:32,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:15:32,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:15:32,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 10:15:32,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 10:15:32,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 27: [2022-11-26 10:15:32,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 12: [2022-11-26 10:15:32,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:15:32,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 10:15:32,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 16: [2022-11-26 10:15:32,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:15:32,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 10:15:32,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 5: [2022-11-26 10:15:32,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:15:32,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 10:15:32,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 14: [2022-11-26 10:15:32,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:15:32,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step72000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 10:15:32,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 0: successfully saved checkpoint at iteration 72000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2721.43 31: iteration 72010/ 173500 | consumed samples: 18434560 | consumed tokens: 37753978880 | elapsed time per iteration (s): 1.11 | learning rate: 1.353E-04 | global batch size: 256 | lm loss: 2.006576E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.424 | TFLOPs: 13.94 | 31: iteration 72020/ 173500 | consumed samples: 18437120 | consumed tokens: 37759221760 | elapsed time per iteration (s): 0.82 | learning rate: 1.353E-04 | global batch size: 256 | lm loss: 2.042072E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.474 | TFLOPs: 18.96 | 31: iteration 72030/ 173500 | consumed samples: 18439680 | consumed tokens: 37764464640 | elapsed time per iteration (s): 0.80 | learning rate: 1.353E-04 | global batch size: 256 | lm loss: 2.010616E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.144 | TFLOPs: 19.37 | 31: iteration 72040/ 173500 | consumed samples: 18442240 | consumed tokens: 37769707520 | elapsed time per iteration (s): 1.80 | learning rate: 1.353E-04 | global batch size: 256 | lm loss: 2.041574E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 142.159 | TFLOPs: 8.60 | 31: iteration 72050/ 173500 | consumed samples: 18444800 | consumed tokens: 37774950400 | elapsed time per iteration (s): 0.79 | learning rate: 1.353E-04 | global batch size: 256 | lm loss: 1.993771E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.009 | TFLOPs: 19.72 | 31: iteration 72060/ 173500 | consumed samples: 18447360 | consumed tokens: 37780193280 | elapsed time per iteration (s): 0.81 | learning rate: 1.353E-04 | global batch size: 256 | lm loss: 2.029746E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.673 | TFLOPs: 19.22 | 31: iteration 72070/ 173500 | consumed samples: 18449920 | consumed tokens: 37785436160 | elapsed time per iteration (s): 0.79 | learning rate: 1.352E-04 | global batch size: 256 | lm loss: 2.012371E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.552 | TFLOPs: 19.70 | 31: iteration 72080/ 173500 | consumed samples: 18452480 | consumed tokens: 37790679040 | elapsed time per iteration (s): 0.77 | learning rate: 1.352E-04 | global batch size: 256 | lm loss: 2.032379E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.256 | TFLOPs: 20.16 | 31: iteration 72090/ 173500 | consumed samples: 18455040 | consumed tokens: 37795921920 | elapsed time per iteration (s): 0.79 | learning rate: 1.352E-04 | global batch size: 256 | lm loss: 2.033158E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.126 | TFLOPs: 19.49 | 31: iteration 72100/ 173500 | consumed samples: 18457600 | consumed tokens: 37801164800 | elapsed time per iteration (s): 0.80 | learning rate: 1.352E-04 | global batch size: 256 | lm loss: 2.032891E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.395 | TFLOPs: 19.26 | 31: iteration 72110/ 173500 | consumed samples: 18460160 | consumed tokens: 37806407680 | elapsed time per iteration (s): 0.81 | learning rate: 1.352E-04 | global batch size: 256 | lm loss: 2.008782E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.907 | TFLOPs: 19.17 | 31: iteration 72120/ 173500 | consumed samples: 18462720 | consumed tokens: 37811650560 | elapsed time per iteration (s): 0.79 | learning rate: 1.352E-04 | global batch size: 256 | lm loss: 1.998249E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.599 | TFLOPs: 19.52 | 31: iteration 72130/ 173500 | consumed samples: 18465280 | consumed tokens: 37816893440 | elapsed time per iteration (s): 0.73 | learning rate: 1.352E-04 | global batch size: 256 | lm loss: 2.010448E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.755 | TFLOPs: 21.28 | 31: iteration 72140/ 173500 | consumed samples: 18467840 | consumed tokens: 37822136320 | elapsed time per iteration (s): 0.80 | learning rate: 1.351E-04 | global batch size: 256 | lm loss: 2.017160E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.569 | TFLOPs: 19.45 | 31: iteration 72150/ 173500 | consumed samples: 18470400 | consumed tokens: 37827379200 | elapsed time per iteration (s): 0.73 | learning rate: 1.351E-04 | global batch size: 256 | lm loss: 1.988189E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.220 | TFLOPs: 21.19 | 31: iteration 72160/ 173500 | consumed samples: 18472960 | consumed tokens: 37832622080 | elapsed time per iteration (s): 0.76 | learning rate: 1.351E-04 | global batch size: 256 | lm loss: 2.035461E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.036 | TFLOPs: 20.39 | 31: iteration 72170/ 173500 | consumed samples: 18475520 | consumed tokens: 37837864960 | elapsed time per iteration (s): 0.78 | learning rate: 1.351E-04 | global batch size: 256 | lm loss: 2.022930E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.542 | TFLOPs: 19.82 | 31: iteration 72180/ 173500 | consumed samples: 18478080 | consumed tokens: 37843107840 | elapsed time per iteration (s): 0.80 | learning rate: 1.351E-04 | global batch size: 256 | lm loss: 2.022475E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.132 | TFLOPs: 19.25 | 31: iteration 72190/ 173500 | consumed samples: 18480640 | consumed tokens: 37848350720 | elapsed time per iteration (s): 0.79 | learning rate: 1.351E-04 | global batch size: 256 | lm loss: 2.013774E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.202 | TFLOPs: 19.61 | 31: iteration 72200/ 173500 | consumed samples: 18483200 | consumed tokens: 37853593600 | elapsed time per iteration (s): 0.80 | learning rate: 1.350E-04 | global batch size: 256 | lm loss: 2.002951E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.267 | TFLOPs: 19.44 | 31: iteration 72210/ 173500 | consumed samples: 18485760 | consumed tokens: 37858836480 | elapsed time per iteration (s): 0.77 | learning rate: 1.350E-04 | global batch size: 256 | lm loss: 1.986174E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.344 | TFLOPs: 19.98 | 31: iteration 72220/ 173500 | consumed samples: 18488320 | consumed tokens: 37864079360 | elapsed time per iteration (s): 0.80 | learning rate: 1.350E-04 | global batch size: 256 | lm loss: 2.014908E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.887 | TFLOPs: 19.41 | 31: iteration 72230/ 173500 | consumed samples: 18490880 | consumed tokens: 37869322240 | elapsed time per iteration (s): 0.82 | learning rate: 1.350E-04 | global batch size: 256 | lm loss: 2.024359E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.174 | TFLOPs: 18.95 | 31: iteration 72240/ 173500 | consumed samples: 18493440 | consumed tokens: 37874565120 | elapsed time per iteration (s): 0.75 | learning rate: 1.350E-04 | global batch size: 256 | lm loss: 2.056190E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.199 | TFLOPs: 20.58 | 31: iteration 72250/ 173500 | consumed samples: 18496000 | consumed tokens: 37879808000 | elapsed time per iteration (s): 0.76 | learning rate: 1.350E-04 | global batch size: 256 | lm loss: 2.004177E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.343 | TFLOPs: 20.47 | 31: iteration 72260/ 173500 | consumed samples: 18498560 | consumed tokens: 37885050880 | elapsed time per iteration (s): 0.72 | learning rate: 1.349E-04 | global batch size: 256 | lm loss: 2.004460E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.273 | TFLOPs: 21.37 | 31: iteration 72270/ 173500 | consumed samples: 18501120 | consumed tokens: 37890293760 | elapsed time per iteration (s): 0.76 | learning rate: 1.349E-04 | global batch size: 256 | lm loss: 2.033804E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.713 | TFLOPs: 20.49 | 31: iteration 72280/ 173500 | consumed samples: 18503680 | consumed tokens: 37895536640 | elapsed time per iteration (s): 0.75 | learning rate: 1.349E-04 | global batch size: 256 | lm loss: 2.028021E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.111 | TFLOPs: 20.70 | 31: iteration 72290/ 173500 | consumed samples: 18506240 | consumed tokens: 37900779520 | elapsed time per iteration (s): 0.76 | learning rate: 1.349E-04 | global batch size: 256 | lm loss: 1.986812E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.326 | TFLOPs: 20.41 | 31: iteration 72300/ 173500 | consumed samples: 18508800 | consumed tokens: 37906022400 | elapsed time per iteration (s): 0.80 | learning rate: 1.349E-04 | global batch size: 256 | lm loss: 2.043354E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.314 | TFLOPs: 19.44 | 31: iteration 72310/ 173500 | consumed samples: 18511360 | consumed tokens: 37911265280 | elapsed time per iteration (s): 0.74 | learning rate: 1.349E-04 | global batch size: 256 | lm loss: 2.021211E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.386 | TFLOPs: 20.96 | 31: iteration 72320/ 173500 | consumed samples: 18513920 | consumed tokens: 37916508160 | elapsed time per iteration (s): 0.75 | learning rate: 1.349E-04 | global batch size: 256 | lm loss: 1.994311E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.592 | TFLOPs: 20.79 | 31: iteration 72330/ 173500 | consumed samples: 18516480 | consumed tokens: 37921751040 | elapsed time per iteration (s): 0.75 | learning rate: 1.348E-04 | global batch size: 256 | lm loss: 1.994952E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.466 | TFLOPs: 20.54 | 31: iteration 72340/ 173500 | consumed samples: 18519040 | consumed tokens: 37926993920 | elapsed time per iteration (s): 0.82 | learning rate: 1.348E-04 | global batch size: 256 | lm loss: 2.013636E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.487 | TFLOPs: 18.78 | 31: iteration 72350/ 173500 | consumed samples: 18521600 | consumed tokens: 37932236800 | elapsed time per iteration (s): 0.72 | learning rate: 1.348E-04 | global batch size: 256 | lm loss: 2.032283E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.795 | TFLOPs: 21.59 | 31: iteration 72360/ 173500 | consumed samples: 18524160 | consumed tokens: 37937479680 | elapsed time per iteration (s): 0.77 | learning rate: 1.348E-04 | global batch size: 256 | lm loss: 2.039610E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.838 | TFLOPs: 20.14 | 31: iteration 72370/ 173500 | consumed samples: 18526720 | consumed tokens: 37942722560 | elapsed time per iteration (s): 0.73 | learning rate: 1.348E-04 | global batch size: 256 | lm loss: 2.030383E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.894 | TFLOPs: 21.23 | 31: iteration 72380/ 173500 | consumed samples: 18529280 | consumed tokens: 37947965440 | elapsed time per iteration (s): 0.81 | learning rate: 1.348E-04 | global batch size: 256 | lm loss: 2.010431E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.415 | TFLOPs: 19.14 | 31: iteration 72390/ 173500 | consumed samples: 18531840 | consumed tokens: 37953208320 | elapsed time per iteration (s): 0.79 | learning rate: 1.347E-04 | global batch size: 256 | lm loss: 2.022053E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.982 | TFLOPs: 19.66 | 31: iteration 72400/ 173500 | consumed samples: 18534400 | consumed tokens: 37958451200 | elapsed time per iteration (s): 0.76 | learning rate: 1.347E-04 | global batch size: 256 | lm loss: 2.041038E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.234 | TFLOPs: 20.34 | 31: iteration 72410/ 173500 | consumed samples: 18536960 | consumed tokens: 37963694080 | elapsed time per iteration (s): 0.75 | learning rate: 1.347E-04 | global batch size: 256 | lm loss: 2.033875E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.466 | TFLOPs: 20.78 | 31: iteration 72420/ 173500 | consumed samples: 18539520 | consumed tokens: 37968936960 | elapsed time per iteration (s): 0.78 | learning rate: 1.347E-04 | global batch size: 256 | lm loss: 2.017741E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.858 | TFLOPs: 19.77 | 31: iteration 72430/ 173500 | consumed samples: 18542080 | consumed tokens: 37974179840 | elapsed time per iteration (s): 1.24 | learning rate: 1.347E-04 | global batch size: 256 | lm loss: 2.023126E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 205.987 | TFLOPs: 12.46 | 31: iteration 72440/ 173500 | consumed samples: 18544640 | consumed tokens: 37979422720 | elapsed time per iteration (s): 0.76 | learning rate: 1.347E-04 | global batch size: 256 | lm loss: 2.010914E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.372 | TFLOPs: 20.35 | 31: iteration 72450/ 173500 | consumed samples: 18547200 | consumed tokens: 37984665600 | elapsed time per iteration (s): 0.72 | learning rate: 1.346E-04 | global batch size: 256 | lm loss: 1.995908E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.774 | TFLOPs: 21.64 | 31: iteration 72460/ 173500 | consumed samples: 18549760 | consumed tokens: 37989908480 | elapsed time per iteration (s): 0.76 | learning rate: 1.346E-04 | global batch size: 256 | lm loss: 2.045207E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.637 | TFLOPs: 20.31 | 31: iteration 72470/ 173500 | consumed samples: 18552320 | consumed tokens: 37995151360 | elapsed time per iteration (s): 0.75 | learning rate: 1.346E-04 | global batch size: 256 | lm loss: 2.018623E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.462 | TFLOPs: 20.78 | 31: iteration 72480/ 173500 | consumed samples: 18554880 | consumed tokens: 38000394240 | elapsed time per iteration (s): 0.76 | learning rate: 1.346E-04 | global batch size: 256 | lm loss: 2.036341E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.573 | TFLOPs: 20.36 | 31: iteration 72490/ 173500 | consumed samples: 18557440 | consumed tokens: 38005637120 | elapsed time per iteration (s): 0.81 | learning rate: 1.346E-04 | global batch size: 256 | lm loss: 1.999539E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.211 | TFLOPs: 19.07 | 31: iteration 72500/ 173500 | consumed samples: 18560000 | consumed tokens: 38010880000 | elapsed time per iteration (s): 0.79 | learning rate: 1.346E-04 | global batch size: 256 | lm loss: 1.994681E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.290 | TFLOPs: 19.62 | 31: iteration 72510/ 173500 | consumed samples: 18562560 | consumed tokens: 38016122880 | elapsed time per iteration (s): 0.74 | learning rate: 1.346E-04 | global batch size: 256 | lm loss: 2.015254E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.117 | TFLOPs: 20.82 | 31: iteration 72520/ 173500 | consumed samples: 18565120 | consumed tokens: 38021365760 | elapsed time per iteration (s): 0.75 | learning rate: 1.345E-04 | global batch size: 256 | lm loss: 1.998837E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.136 | TFLOPs: 20.76 | 31: iteration 72530/ 173500 | consumed samples: 18567680 | consumed tokens: 38026608640 | elapsed time per iteration (s): 0.74 | learning rate: 1.345E-04 | global batch size: 256 | lm loss: 2.030739E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.101 | TFLOPs: 20.94 | 31: iteration 72540/ 173500 | consumed samples: 18570240 | consumed tokens: 38031851520 | elapsed time per iteration (s): 0.73 | learning rate: 1.345E-04 | global batch size: 256 | lm loss: 2.015137E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.143 | TFLOPs: 21.12 | 31: iteration 72550/ 173500 | consumed samples: 18572800 | consumed tokens: 38037094400 | elapsed time per iteration (s): 0.75 | learning rate: 1.345E-04 | global batch size: 256 | lm loss: 1.995790E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.672 | TFLOPs: 20.55 | 31: iteration 72560/ 173500 | consumed samples: 18575360 | consumed tokens: 38042337280 | elapsed time per iteration (s): 0.77 | learning rate: 1.345E-04 | global batch size: 256 | lm loss: 2.028347E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.611 | TFLOPs: 20.24 | 31: iteration 72570/ 173500 | consumed samples: 18577920 | consumed tokens: 38047580160 | elapsed time per iteration (s): 0.78 | learning rate: 1.345E-04 | global batch size: 256 | lm loss: 2.031557E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.345 | TFLOPs: 19.80 | 31: iteration 72580/ 173500 | consumed samples: 18580480 | consumed tokens: 38052823040 | elapsed time per iteration (s): 0.83 | learning rate: 1.344E-04 | global batch size: 256 | lm loss: 2.022863E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.653 | TFLOPs: 18.55 | 31: iteration 72590/ 173500 | consumed samples: 18583040 | consumed tokens: 38058065920 | elapsed time per iteration (s): 0.82 | learning rate: 1.344E-04 | global batch size: 256 | lm loss: 2.016295E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.951 | TFLOPs: 18.93 | 31: iteration 72600/ 173500 | consumed samples: 18585600 | consumed tokens: 38063308800 | elapsed time per iteration (s): 0.80 | learning rate: 1.344E-04 | global batch size: 256 | lm loss: 2.016143E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.239 | TFLOPs: 19.37 | 31: iteration 72610/ 173500 | consumed samples: 18588160 | consumed tokens: 38068551680 | elapsed time per iteration (s): 0.76 | learning rate: 1.344E-04 | global batch size: 256 | lm loss: 1.993742E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.317 | TFLOPs: 20.29 | 31: iteration 72620/ 173500 | consumed samples: 18590720 | consumed tokens: 38073794560 | elapsed time per iteration (s): 0.80 | learning rate: 1.344E-04 | global batch size: 256 | lm loss: 2.025300E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.639 | TFLOPs: 19.40 | 31: iteration 72630/ 173500 | consumed samples: 18593280 | consumed tokens: 38079037440 | elapsed time per iteration (s): 0.77 | learning rate: 1.344E-04 | global batch size: 256 | lm loss: 2.005945E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.384 | TFLOPs: 20.23 | 31: iteration 72640/ 173500 | consumed samples: 18595840 | consumed tokens: 38084280320 | elapsed time per iteration (s): 0.77 | learning rate: 1.343E-04 | global batch size: 256 | lm loss: 2.024480E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.942 | TFLOPs: 20.20 | 31: iteration 72650/ 173500 | consumed samples: 18598400 | consumed tokens: 38089523200 | elapsed time per iteration (s): 0.74 | learning rate: 1.343E-04 | global batch size: 256 | lm loss: 2.023340E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.901 | TFLOPs: 20.99 | 31: iteration 72660/ 173500 | consumed samples: 18600960 | consumed tokens: 38094766080 | elapsed time per iteration (s): 0.78 | learning rate: 1.343E-04 | global batch size: 256 | lm loss: 2.006935E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.097 | TFLOPs: 19.79 | 31: iteration 72670/ 173500 | consumed samples: 18603520 | consumed tokens: 38100008960 | elapsed time per iteration (s): 0.81 | learning rate: 1.343E-04 | global batch size: 256 | lm loss: 2.008200E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.044 | TFLOPs: 19.12 | 31: iteration 72680/ 173500 | consumed samples: 18606080 | consumed tokens: 38105251840 | elapsed time per iteration (s): 0.81 | learning rate: 1.343E-04 | global batch size: 256 | lm loss: 2.028515E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.886 | TFLOPs: 19.23 | 31: iteration 72690/ 173500 | consumed samples: 18608640 | consumed tokens: 38110494720 | elapsed time per iteration (s): 0.82 | learning rate: 1.343E-04 | global batch size: 256 | lm loss: 2.025558E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.985 | TFLOPs: 18.87 | 31: iteration 72700/ 173500 | consumed samples: 18611200 | consumed tokens: 38115737600 | elapsed time per iteration (s): 0.79 | learning rate: 1.343E-04 | global batch size: 256 | lm loss: 2.005884E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.261 | TFLOPs: 19.56 | 31: iteration 72710/ 173500 | consumed samples: 18613760 | consumed tokens: 38120980480 | elapsed time per iteration (s): 0.79 | learning rate: 1.342E-04 | global batch size: 256 | lm loss: 1.984301E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.657 | TFLOPs: 19.64 | 31: iteration 72720/ 173500 | consumed samples: 18616320 | consumed tokens: 38126223360 | elapsed time per iteration (s): 0.81 | learning rate: 1.342E-04 | global batch size: 256 | lm loss: 2.026539E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.993 | TFLOPs: 19.24 | 31: iteration 72730/ 173500 | consumed samples: 18618880 | consumed tokens: 38131466240 | elapsed time per iteration (s): 0.73 | learning rate: 1.342E-04 | global batch size: 256 | lm loss: 2.054666E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.159 | TFLOPs: 21.12 | 31: iteration 72740/ 173500 | consumed samples: 18621440 | consumed tokens: 38136709120 | elapsed time per iteration (s): 0.75 | learning rate: 1.342E-04 | global batch size: 256 | lm loss: 2.042016E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.453 | TFLOPs: 20.54 | 31: iteration 72750/ 173500 | consumed samples: 18624000 | consumed tokens: 38141952000 | elapsed time per iteration (s): 0.75 | learning rate: 1.342E-04 | global batch size: 256 | lm loss: 2.001365E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.452 | TFLOPs: 20.60 | 31: iteration 72760/ 173500 | consumed samples: 18626560 | consumed tokens: 38147194880 | elapsed time per iteration (s): 0.76 | learning rate: 1.342E-04 | global batch size: 256 | lm loss: 2.039411E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.247 | TFLOPs: 20.46 | 31: iteration 72770/ 173500 | consumed samples: 18629120 | consumed tokens: 38152437760 | elapsed time per iteration (s): 0.77 | learning rate: 1.341E-04 | global batch size: 256 | lm loss: 2.004339E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.984 | TFLOPs: 20.02 | 31: iteration 72780/ 173500 | consumed samples: 18631680 | consumed tokens: 38157680640 | elapsed time per iteration (s): 0.75 | learning rate: 1.341E-04 | global batch size: 256 | lm loss: 2.047704E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.472 | TFLOPs: 20.78 | 31: iteration 72790/ 173500 | consumed samples: 18634240 | consumed tokens: 38162923520 | elapsed time per iteration (s): 0.80 | learning rate: 1.341E-04 | global batch size: 256 | lm loss: 2.031581E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.996 | TFLOPs: 19.48 | 31: iteration 72800/ 173500 | consumed samples: 18636800 | consumed tokens: 38168166400 | elapsed time per iteration (s): 0.77 | learning rate: 1.341E-04 | global batch size: 256 | lm loss: 2.009533E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.304 | TFLOPs: 20.22 | 31: iteration 72810/ 173500 | consumed samples: 18639360 | consumed tokens: 38173409280 | elapsed time per iteration (s): 0.79 | learning rate: 1.341E-04 | global batch size: 256 | lm loss: 2.047712E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.440 | TFLOPs: 19.63 | 31: iteration 72820/ 173500 | consumed samples: 18641920 | consumed tokens: 38178652160 | elapsed time per iteration (s): 0.81 | learning rate: 1.341E-04 | global batch size: 256 | lm loss: 2.016072E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.970 | TFLOPs: 19.24 | 31: iteration 72830/ 173500 | consumed samples: 18644480 | consumed tokens: 38183895040 | elapsed time per iteration (s): 0.88 | learning rate: 1.340E-04 | global batch size: 256 | lm loss: 2.010051E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.878 | TFLOPs: 17.66 | 31: iteration 72840/ 173500 | consumed samples: 18647040 | consumed tokens: 38189137920 | elapsed time per iteration (s): 0.80 | learning rate: 1.340E-04 | global batch size: 256 | lm loss: 2.024940E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.277 | TFLOPs: 19.32 | 31: iteration 72850/ 173500 | consumed samples: 18649600 | consumed tokens: 38194380800 | elapsed time per iteration (s): 0.78 | learning rate: 1.340E-04 | global batch size: 256 | lm loss: 2.036466E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.139 | TFLOPs: 19.73 | 31: iteration 72860/ 173500 | consumed samples: 18652160 | consumed tokens: 38199623680 | elapsed time per iteration (s): 0.76 | learning rate: 1.340E-04 | global batch size: 256 | lm loss: 2.018186E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.669 | TFLOPs: 20.25 | 31: iteration 72870/ 173500 | consumed samples: 18654720 | consumed tokens: 38204866560 | elapsed time per iteration (s): 0.79 | learning rate: 1.340E-04 | global batch size: 256 | lm loss: 2.016573E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.760 | TFLOPs: 19.59 | 31: iteration 72880/ 173500 | consumed samples: 18657280 | consumed tokens: 38210109440 | elapsed time per iteration (s): 0.79 | learning rate: 1.340E-04 | global batch size: 256 | lm loss: 2.026060E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.521 | TFLOPs: 19.57 | 31: iteration 72890/ 173500 | consumed samples: 18659840 | consumed tokens: 38215352320 | elapsed time per iteration (s): 0.81 | learning rate: 1.340E-04 | global batch size: 256 | lm loss: 2.025734E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.083 | TFLOPs: 19.06 | 31: iteration 72900/ 173500 | consumed samples: 18662400 | consumed tokens: 38220595200 | elapsed time per iteration (s): 0.81 | learning rate: 1.339E-04 | global batch size: 256 | lm loss: 1.990678E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.846 | TFLOPs: 19.23 | 31: iteration 72910/ 173500 | consumed samples: 18664960 | consumed tokens: 38225838080 | elapsed time per iteration (s): 0.75 | learning rate: 1.339E-04 | global batch size: 256 | lm loss: 2.001342E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.967 | TFLOPs: 20.69 | 31: iteration 72920/ 173500 | consumed samples: 18667520 | consumed tokens: 38231080960 | elapsed time per iteration (s): 0.73 | learning rate: 1.339E-04 | global batch size: 256 | lm loss: 2.010439E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.937 | TFLOPs: 21.29 | 31: iteration 72930/ 173500 | consumed samples: 18670080 | consumed tokens: 38236323840 | elapsed time per iteration (s): 0.72 | learning rate: 1.339E-04 | global batch size: 256 | lm loss: 2.010079E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.108 | TFLOPs: 21.48 | 31: iteration 72940/ 173500 | consumed samples: 18672640 | consumed tokens: 38241566720 | elapsed time per iteration (s): 0.74 | learning rate: 1.339E-04 | global batch size: 256 | lm loss: 1.984961E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.024 | TFLOPs: 20.93 | 31: iteration 72950/ 173500 | consumed samples: 18675200 | consumed tokens: 38246809600 | elapsed time per iteration (s): 0.77 | learning rate: 1.339E-04 | global batch size: 256 | lm loss: 2.048064E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.042 | TFLOPs: 20.03 | 31: iteration 72960/ 173500 | consumed samples: 18677760 | consumed tokens: 38252052480 | elapsed time per iteration (s): 0.75 | learning rate: 1.338E-04 | global batch size: 256 | lm loss: 2.024693E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.692 | TFLOPs: 20.55 | 31: iteration 72970/ 173500 | consumed samples: 18680320 | consumed tokens: 38257295360 | elapsed time per iteration (s): 0.74 | learning rate: 1.338E-04 | global batch size: 256 | lm loss: 2.005404E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.230 | TFLOPs: 21.07 | 31: iteration 72980/ 173500 | consumed samples: 18682880 | consumed tokens: 38262538240 | elapsed time per iteration (s): 0.82 | learning rate: 1.338E-04 | global batch size: 256 | lm loss: 2.032627E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.795 | TFLOPs: 18.92 | 31: iteration 72990/ 173500 | consumed samples: 18685440 | consumed tokens: 38267781120 | elapsed time per iteration (s): 0.74 | learning rate: 1.338E-04 | global batch size: 256 | lm loss: 2.047960E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.637 | TFLOPs: 20.97 | 31: iteration 73000/ 173500 | consumed samples: 18688000 | consumed tokens: 38273024000 | elapsed time per iteration (s): 0.73 | learning rate: 1.338E-04 | global batch size: 256 | lm loss: 2.017893E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.157 | TFLOPs: 21.30 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 73000 | lm loss value: 1.923388E+00 | lm loss PPL: 6.844104E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 73000 to checkpoints_1b1long 0: [2022-11-26 10:28:42,065] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step73000 is begin to save! 0: [2022-11-26 10:28:42,076] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_01-model_00-model_states.pt... 0: [2022-11-26 10:28:42,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_01-model_00-model_states.pt. 0: [2022-11-26 10:28:42,291] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_03-model_00-model_states.pt... 0: [2022-11-26 10:28:42,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_03-model_00-model_states.pt. 0: [2022-11-26 10:28:42,374] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_04-model_00-model_states.pt... 0: [2022-11-26 10:28:42,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_04-model_00-model_states.pt. 0: [2022-11-26 10:28:42,453] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_05-model_00-model_states.pt... 0: [2022-11-26 10:28:42,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_05-model_00-model_states.pt. 0: [2022-11-26 10:28:42,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_06-model_00-model_states.pt... 0: [2022-11-26 10:28:42,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_06-model_00-model_states.pt. 0: [2022-11-26 10:28:42,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_07-model_00-model_states.pt... 0: [2022-11-26 10:28:42,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_07-model_00-model_states.pt. 0: [2022-11-26 10:28:42,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_08-model_00-model_states.pt... 0: [2022-11-26 10:28:42,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_08-model_00-model_states.pt. 0: [2022-11-26 10:28:42,758] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_09-model_00-model_states.pt... 0: [2022-11-26 10:28:42,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_09-model_00-model_states.pt. 0: [2022-11-26 10:28:42,833] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_10-model_00-model_states.pt... 0: [2022-11-26 10:28:42,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_10-model_00-model_states.pt. 0: [2022-11-26 10:28:42,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_11-model_00-model_states.pt... 0: [2022-11-26 10:28:42,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_11-model_00-model_states.pt. 0: [2022-11-26 10:28:42,982] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_12-model_00-model_states.pt... 0: [2022-11-26 10:28:43,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_12-model_00-model_states.pt. 0: [2022-11-26 10:28:43,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_13-model_00-model_states.pt... 0: [2022-11-26 10:28:43,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_13-model_00-model_states.pt. 0: [2022-11-26 10:28:43,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_14-model_00-model_states.pt... 0: [2022-11-26 10:28:43,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_14-model_00-model_states.pt. 0: [2022-11-26 10:28:43,206] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_15-model_00-model_states.pt... 0: [2022-11-26 10:28:43,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_15-model_00-model_states.pt. 0: [2022-11-26 10:28:43,281] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_16-model_00-model_states.pt... 0: [2022-11-26 10:28:43,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_16-model_00-model_states.pt. 0: [2022-11-26 10:28:43,352] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_17-model_00-model_states.pt... 0: [2022-11-26 10:28:43,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_17-model_00-model_states.pt. 0: [2022-11-26 10:28:43,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_18-model_00-model_states.pt... 0: [2022-11-26 10:28:43,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_18-model_00-model_states.pt. 0: [2022-11-26 10:28:43,502] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_19-model_00-model_states.pt... 0: [2022-11-26 10:28:43,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_19-model_00-model_states.pt. 0: [2022-11-26 10:28:43,574] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_20-model_00-model_states.pt... 0: [2022-11-26 10:28:43,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_20-model_00-model_states.pt. 0: [2022-11-26 10:28:43,651] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_21-model_00-model_states.pt... 0: [2022-11-26 10:28:43,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_21-model_00-model_states.pt. 0: [2022-11-26 10:28:43,727] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_22-model_00-model_states.pt... 0: [2022-11-26 10:28:43,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_22-model_00-model_states.pt. 0: [2022-11-26 10:28:43,801] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_23-model_00-model_states.pt... 0: [2022-11-26 10:28:43,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_23-model_00-model_states.pt. 0: [2022-11-26 10:28:43,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_24-model_00-model_states.pt... 0: [2022-11-26 10:28:43,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_24-model_00-model_states.pt. 0: [2022-11-26 10:28:43,949] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_25-model_00-model_states.pt... 0: [2022-11-26 10:28:44,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_25-model_00-model_states.pt. 0: [2022-11-26 10:28:44,024] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_26-model_00-model_states.pt... 0: [2022-11-26 10:28:44,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_26-model_00-model_states.pt. 0: [2022-11-26 10:28:44,109] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_27-model_00-model_states.pt... 0: [2022-11-26 10:28:44,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_27-model_00-model_states.pt. 0: [2022-11-26 10:28:44,187] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_28-model_00-model_states.pt... 0: [2022-11-26 10:28:44,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_28-model_00-model_states.pt. 0: [2022-11-26 10:28:44,259] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/layer_30-model_00-model_states.pt... 0: [2022-11-26 10:28:44,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/layer_30-model_00-model_states.pt. 0: [2022-11-26 10:28:44,263] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step73000/mp_rank_00_model_states.pt 0: [2022-11-26 10:28:44,263] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/mp_rank_00_model_states.pt... 0: [2022-11-26 10:28:44,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/mp_rank_00_model_states.pt. 0: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:28:44,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:28:44,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:28:44,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:28:44,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 10:28:44,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 21: [2022-11-26 10:28:44,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:28:44,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 10:28:44,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 12: [2022-11-26 10:28:44,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:28:44,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 10:28:44,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 20: [2022-11-26 10:28:44,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:28:44,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 10:28:44,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 14: [2022-11-26 10:28:44,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:28:44,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 10:28:44,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 11: [2022-11-26 10:28:44,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:28:44,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 18: [2022-11-26 10:28:44,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:28:44,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 18: [2022-11-26 10:28:44,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 10:28:44,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 10: [2022-11-26 10:28:44,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:28:44,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:28:44,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:28:44,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 19: [2022-11-26 10:28:44,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 10: [2022-11-26 10:28:44,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 10: [2022-11-26 10:28:44,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 19: [2022-11-26 10:28:44,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 10: [2022-11-26 10:28:44,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 18: [2022-11-26 10:28:44,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:28:44,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 10:28:44,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 2: [2022-11-26 10:28:44,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:28:44,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 10:28:44,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 16: [2022-11-26 10:28:44,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:28:44,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:28:44,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:28:44,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:28:44,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 11: [2022-11-26 10:28:44,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 8: [2022-11-26 10:28:44,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 10:28:44,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:28:44,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 3: [2022-11-26 10:28:44,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 11: [2022-11-26 10:28:44,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 8: [2022-11-26 10:28:44,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 16: [2022-11-26 10:28:44,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 8: [2022-11-26 10:28:44,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 24: [2022-11-26 10:28:44,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:28:44,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 9: [2022-11-26 10:28:44,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:28:44,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 9: [2022-11-26 10:28:44,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 24: [2022-11-26 10:28:44,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 9: [2022-11-26 10:28:44,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 16: [2022-11-26 10:28:44,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:28:44,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 10:28:44,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 29: [2022-11-26 10:28:44,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:28:44,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 10:28:44,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 27: [2022-11-26 10:28:44,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:28:44,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 10:28:44,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 16: [2022-11-26 10:28:44,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:28:44,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 10:28:44,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 21: [2022-11-26 10:28:44,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:28:44,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 10:28:44,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 15: [2022-11-26 10:28:44,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:28:44,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 10:28:44,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 12: [2022-11-26 10:28:44,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:28:44,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 10:28:44,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 19: [2022-11-26 10:28:44,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:28:44,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 23: [2022-11-26 10:28:44,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:28:44,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 23: [2022-11-26 10:28:44,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 10:28:44,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 20: [2022-11-26 10:28:44,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:28:44,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 10:28:44,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 29: [2022-11-26 10:28:44,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:28:44,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 10:28:44,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 6: [2022-11-26 10:28:44,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:28:44,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:28:44,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 10:28:44,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 10:28:44,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 6: [2022-11-26 10:28:44,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 7: [2022-11-26 10:28:44,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:28:44,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:28:44,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 10:28:44,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 14: [2022-11-26 10:28:44,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 17: [2022-11-26 10:28:44,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:28:44,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 17: [2022-11-26 10:28:44,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 10:28:44,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 2: [2022-11-26 10:28:44,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:28:44,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:28:44,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 10: [2022-11-26 10:28:44,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 2: [2022-11-26 10:28:44,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 10: [2022-11-26 10:28:44,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 7: [2022-11-26 10:28:44,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:28:44,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:28:44,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:28:44,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 7: [2022-11-26 10:28:44,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 31: [2022-11-26 10:28:44,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 19: [2022-11-26 10:28:44,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 31: [2022-11-26 10:28:44,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 7: [2022-11-26 10:28:44,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 3: [2022-11-26 10:28:44,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:28:44,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:28:44,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:28:44,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 3: [2022-11-26 10:28:44,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 24: [2022-11-26 10:28:44,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 18: [2022-11-26 10:28:44,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 3: [2022-11-26 10:28:44,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 24: [2022-11-26 10:28:44,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 21: [2022-11-26 10:28:44,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:28:44,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 10:28:44,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 15: [2022-11-26 10:28:44,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:28:44,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 10:28:44,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 28: [2022-11-26 10:28:44,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:28:44,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:28:44,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:28:44,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 17: [2022-11-26 10:28:44,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 8: [2022-11-26 10:28:44,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 17: [2022-11-26 10:28:44,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 28: [2022-11-26 10:28:44,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 10:28:44,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 11: [2022-11-26 10:28:44,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:28:44,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 10:28:44,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 15: [2022-11-26 10:28:44,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:28:44,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:28:44,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 9: [2022-11-26 10:28:44,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 15: [2022-11-26 10:28:44,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 28: [2022-11-26 10:28:44,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:28:44,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 12: [2022-11-26 10:28:44,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:28:44,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 10:28:44,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 23: [2022-11-26 10:28:44,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:28:44,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 10:28:44,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 6: [2022-11-26 10:28:44,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:28:44,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 10:28:44,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 27: [2022-11-26 10:28:44,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:28:44,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:28:44,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 24: [2022-11-26 10:28:44,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 27: [2022-11-26 10:28:44,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 24: [2022-11-26 10:28:44,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 15: [2022-11-26 10:28:44,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:28:44,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 10:28:44,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 19: [2022-11-26 10:28:44,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:28:44,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:28:44,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 10:28:44,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 6: [2022-11-26 10:28:44,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 10:28:44,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 18: [2022-11-26 10:28:44,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:28:44,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 27: [2022-11-26 10:28:44,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:28:44,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 27: [2022-11-26 10:28:44,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 10:28:44,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 10: [2022-11-26 10:28:44,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:28:44,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 10:28:44,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 16: [2022-11-26 10:28:44,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:28:44,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 10:28:44,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 23: [2022-11-26 10:28:44,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:28:44,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 10:28:44,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:28:44,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 23: [2022-11-26 10:28:44,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 10:28:44,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 3: [2022-11-26 10:28:44,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:28:44,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:28:44,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 2: [2022-11-26 10:28:44,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 3: [2022-11-26 10:28:44,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 2: [2022-11-26 10:28:44,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 29: [2022-11-26 10:28:44,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:28:44,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 10:28:44,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 31: [2022-11-26 10:28:44,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:28:44,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 10:28:44,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 13: [2022-11-26 10:28:44,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:28:44,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 10:28:44,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 13: [2022-11-26 10:28:44,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:28:44,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 10:28:44,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 31: [2022-11-26 10:28:44,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:28:44,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 10:28:44,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 0: [2022-11-26 10:28:44,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:28:44,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 13: [2022-11-26 10:28:44,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:28:44,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:28:44,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 13: [2022-11-26 10:28:44,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 10:28:44,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 20: [2022-11-26 10:28:44,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:28:44,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 13: [2022-11-26 10:28:44,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 20: [2022-11-26 10:28:44,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 10:28:44,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 8: [2022-11-26 10:28:44,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:28:44,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 10:28:44,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 0: [2022-11-26 10:28:44,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:28:44,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:28:44,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 10:28:44,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 10:28:44,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 0: [2022-11-26 10:28:44,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 7: [2022-11-26 10:28:44,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:28:44,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 12: [2022-11-26 10:28:44,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:28:44,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 12: [2022-11-26 10:28:44,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 11: [2022-11-26 10:28:44,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:28:44,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 11: [2022-11-26 10:28:44,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 10:28:44,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 1: [2022-11-26 10:28:44,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:28:44,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 10:28:44,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 24: [2022-11-26 10:28:44,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:28:44,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 10:28:44,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 17: [2022-11-26 10:28:44,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:28:44,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:28:44,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 10:28:44,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 10:28:44,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 17: [2022-11-26 10:28:44,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 1: [2022-11-26 10:28:44,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:28:44,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:28:44,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 1: [2022-11-26 10:28:44,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 9: [2022-11-26 10:28:44,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:28:44,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 1: [2022-11-26 10:28:44,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 9: [2022-11-26 10:28:44,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 10:28:44,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 14: [2022-11-26 10:28:44,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:28:44,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 10:28:44,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 7: [2022-11-26 10:28:44,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:28:44,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 10:28:44,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 28: [2022-11-26 10:28:44,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 10:28:44,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 28: [2022-11-26 10:28:44,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:28:44,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 10:28:44,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 29: [2022-11-26 10:28:44,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:28:44,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 10:28:44,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 28: [2022-11-26 10:28:44,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:28:44,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:28:44,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:28:44,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 10:28:44,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 4: [2022-11-26 10:28:44,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 10:28:44,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 5: [2022-11-26 10:28:44,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:28:44,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:28:44,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:28:44,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:28:44,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 10:28:44,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 10:28:44,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 10:28:44,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 10:28:44,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 5: [2022-11-26 10:28:44,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 5: [2022-11-26 10:28:44,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 5: [2022-11-26 10:28:44,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 0: [2022-11-26 10:28:44,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 10:28:44,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 25: [2022-11-26 10:28:44,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:28:44,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:28:44,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:28:44,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 31: [2022-11-26 10:28:44,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:28:44,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 10:28:44,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 10:28:44,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 25: [2022-11-26 10:28:44,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 25: [2022-11-26 10:28:44,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 31: [2022-11-26 10:28:44,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 10:28:44,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 30: [2022-11-26 10:28:44,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:28:44,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:28:44,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:28:44,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:28:44,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 10:28:44,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 10:28:44,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 10:28:44,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 10:28:44,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 30: [2022-11-26 10:28:44,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 30: [2022-11-26 10:28:44,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 30: [2022-11-26 10:28:44,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 26: [2022-11-26 10:28:44,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:28:44,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:28:44,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:28:44,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:28:44,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 10:28:44,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 10:28:44,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 26: [2022-11-26 10:28:44,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 10:28:44,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 10:28:44,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 26: [2022-11-26 10:28:44,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 26: [2022-11-26 10:28:44,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 28: [2022-11-26 10:28:44,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 10:28:44,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 22: [2022-11-26 10:28:44,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:28:44,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:28:44,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:28:44,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:28:44,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 10:28:44,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 10:28:44,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 10:28:44,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 10:28:44,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 22: [2022-11-26 10:28:44,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 22: [2022-11-26 10:28:44,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 22: [2022-11-26 10:28:44,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 14: [2022-11-26 10:28:44,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:28:44,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 10:28:44,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 3: [2022-11-26 10:28:44,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:28:44,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 10:28:44,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 1: [2022-11-26 10:28:44,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:28:44,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 10:28:44,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 27: [2022-11-26 10:28:44,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:28:44,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 10:28:44,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 21: [2022-11-26 10:28:44,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:28:44,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 10:28:44,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 9: [2022-11-26 10:28:44,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:28:44,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 10:28:44,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 2: [2022-11-26 10:28:44,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:28:44,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 10:28:44,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 6: [2022-11-26 10:28:44,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:28:44,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 10:28:44,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 20: [2022-11-26 10:28:44,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:28:44,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 16: [2022-11-26 10:28:44,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:28:44,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 16: [2022-11-26 10:28:44,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 10:28:44,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 12: [2022-11-26 10:28:44,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:28:44,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 10:28:44,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 15: [2022-11-26 10:28:44,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:28:44,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 10:28:44,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 18: [2022-11-26 10:28:44,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:28:44,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 10:28:44,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 26: [2022-11-26 10:28:44,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:28:44,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 10:28:44,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 23: [2022-11-26 10:28:44,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:28:44,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 10:28:44,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 5: [2022-11-26 10:28:44,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:28:44,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:28:44,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 10: [2022-11-26 10:28:44,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 5: [2022-11-26 10:28:44,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 10: [2022-11-26 10:28:44,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 13: [2022-11-26 10:28:44,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:28:44,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 10:28:44,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 19: [2022-11-26 10:28:44,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:28:44,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 10:28:44,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 8: [2022-11-26 10:28:44,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:28:44,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 10:28:44,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 17: [2022-11-26 10:28:44,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:28:44,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 10:28:44,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 22: [2022-11-26 10:28:44,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:28:44,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 10:28:44,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 25: [2022-11-26 10:28:44,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:28:44,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 10:28:44,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 7: [2022-11-26 10:28:44,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:28:44,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 10:28:44,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 4: [2022-11-26 10:28:44,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:28:44,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 10:28:44,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 11: [2022-11-26 10:28:44,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:28:44,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 10:28:44,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 3: [2022-11-26 10:28:44,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:28:44,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 10:28:44,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 1: [2022-11-26 10:28:44,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:28:44,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 10:28:44,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 14: [2022-11-26 10:28:44,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:28:44,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 10:28:44,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 28: [2022-11-26 10:28:44,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:28:44,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 30: [2022-11-26 10:28:44,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:28:44,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 10:28:44,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 21: [2022-11-26 10:28:44,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:28:44,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 10:28:44,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 0: [2022-11-26 10:28:44,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:28:44,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 10:28:44,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 9: [2022-11-26 10:28:44,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:28:44,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 24: [2022-11-26 10:28:44,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:28:44,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 29: [2022-11-26 10:28:44,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:28:44,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 29: [2022-11-26 10:28:44,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 10:28:44,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 24: [2022-11-26 10:28:44,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 6: [2022-11-26 10:28:44,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:28:44,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 10:28:44,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 2: [2022-11-26 10:28:44,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:28:44,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 10:28:44,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 16: [2022-11-26 10:28:44,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:28:44,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 28: [2022-11-26 10:28:44,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 16: [2022-11-26 10:28:44,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 27: [2022-11-26 10:28:44,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:28:44,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:28:44,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 10:28:44,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 31: [2022-11-26 10:28:44,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:28:44,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 10:28:44,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 20: [2022-11-26 10:28:44,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 10:28:44,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 15: [2022-11-26 10:28:44,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:28:44,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 10:28:44,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 12: [2022-11-26 10:28:44,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:28:44,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:28:44,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 18: [2022-11-26 10:28:44,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 12: [2022-11-26 10:28:44,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 18: [2022-11-26 10:28:44,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 17: [2022-11-26 10:28:44,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:28:44,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 10:28:44,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 5: [2022-11-26 10:28:44,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:28:44,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 10:28:44,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 26: [2022-11-26 10:28:44,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:28:44,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 10:28:44,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 23: [2022-11-26 10:28:44,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:28:44,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 10:28:44,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 8: [2022-11-26 10:28:44,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:28:44,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 10:28:44,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 19: [2022-11-26 10:28:44,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:28:44,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 10:28:44,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 13: [2022-11-26 10:28:44,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:28:44,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 10:28:44,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 25: [2022-11-26 10:28:44,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:28:44,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 10:28:44,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 0: [2022-11-26 10:28:44,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:28:44,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 10:28:44,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 4: [2022-11-26 10:28:44,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:28:44,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 10:28:44,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 7: [2022-11-26 10:28:44,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:28:44,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 10:28:44,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 22: [2022-11-26 10:28:44,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:28:44,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:28:44,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 10:28:44,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 24: [2022-11-26 10:28:44,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 10:28:44,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 30: [2022-11-26 10:28:44,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:28:44,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 10:28:44,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 11: [2022-11-26 10:28:44,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:28:44,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 10:28:44,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 9: [2022-11-26 10:28:44,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:28:44,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 10:28:44,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 3: [2022-11-26 10:28:44,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:28:44,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 10:28:44,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 1: [2022-11-26 10:28:44,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:28:44,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:28:44,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 28: [2022-11-26 10:28:44,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 1: [2022-11-26 10:28:44,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 28: [2022-11-26 10:28:44,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 27: [2022-11-26 10:28:44,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:28:44,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 14: [2022-11-26 10:28:44,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:28:44,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 14: [2022-11-26 10:28:44,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 10:28:44,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 6: [2022-11-26 10:28:44,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:28:44,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 10:28:44,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 15: [2022-11-26 10:28:44,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:28:44,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 10:28:44,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 31: [2022-11-26 10:28:44,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:28:44,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:28:44,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 16: [2022-11-26 10:28:44,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 31: [2022-11-26 10:28:44,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 16: [2022-11-26 10:28:44,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 18: [2022-11-26 10:28:44,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:28:44,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:28:44,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 10:28:44,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 2: [2022-11-26 10:28:44,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 10:28:44,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 29: [2022-11-26 10:28:44,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:28:44,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 10:28:44,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 10: [2022-11-26 10:28:44,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:28:44,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 10:28:44,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 12: [2022-11-26 10:28:44,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:28:44,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 10:28:44,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 10: [2022-11-26 10:28:44,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:28:44,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 10:28:44,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 20: [2022-11-26 10:28:44,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:28:44,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 10:28:44,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 26: [2022-11-26 10:28:44,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:28:44,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 10:28:44,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 8: [2022-11-26 10:28:44,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:28:44,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 10:28:44,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 19: [2022-11-26 10:28:44,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:28:44,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 10:28:44,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 17: [2022-11-26 10:28:44,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:28:44,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 10:28:44,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 5: [2022-11-26 10:28:44,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:28:44,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 10:28:44,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 13: [2022-11-26 10:28:44,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:28:44,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 21: [2022-11-26 10:28:44,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:28:44,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 21: [2022-11-26 10:28:44,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 10:28:44,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 23: [2022-11-26 10:28:44,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:28:44,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 10:28:44,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 4: [2022-11-26 10:28:44,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:28:44,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 10:28:44,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 22: [2022-11-26 10:28:44,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:28:44,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 10:28:44,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 0: [2022-11-26 10:28:44,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:28:44,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 10:28:44,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 30: [2022-11-26 10:28:44,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:28:44,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 10:28:44,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 7: [2022-11-26 10:28:44,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:28:44,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 10:28:44,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 25: [2022-11-26 10:28:44,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:28:44,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 10:28:44,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 11: [2022-11-26 10:28:44,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:28:44,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:28:44,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 10:28:44,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 24: [2022-11-26 10:28:44,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 10:28:44,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 28: [2022-11-26 10:28:44,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:28:44,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:28:44,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 10:28:44,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 31: [2022-11-26 10:28:44,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:28:44,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 10:28:44,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 20: [2022-11-26 10:28:44,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:28:44,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 10:28:44,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 8: [2022-11-26 10:28:44,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:28:44,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 10:28:44,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 2: [2022-11-26 10:28:44,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:28:44,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 10:28:44,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 12: [2022-11-26 10:28:44,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:28:44,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:28:44,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 10:28:44,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 28: [2022-11-26 10:28:44,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 1: [2022-11-26 10:28:44,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 28: [2022-11-26 10:28:44,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 1: [2022-11-26 10:28:44,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 14: [2022-11-26 10:28:44,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:28:44,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 10:28:44,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 5: [2022-11-26 10:28:44,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:28:44,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 10:28:44,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 3: [2022-11-26 10:28:44,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:28:44,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 10:28:44,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 18: [2022-11-26 10:28:44,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:28:44,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 10:28:44,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 15: [2022-11-26 10:28:44,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:28:44,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 6: [2022-11-26 10:28:44,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:28:44,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 6: [2022-11-26 10:28:44,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 10:28:44,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 17: [2022-11-26 10:28:44,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:28:44,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 10:28:44,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 9: [2022-11-26 10:28:44,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:28:44,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 10:28:44,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 27: [2022-11-26 10:28:44,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:28:44,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 10:28:44,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 22: [2022-11-26 10:28:44,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:28:44,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 10:28:44,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 30: [2022-11-26 10:28:44,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:28:44,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 7: [2022-11-26 10:28:44,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:28:44,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 7: [2022-11-26 10:28:44,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 19: [2022-11-26 10:28:44,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:28:44,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 19: [2022-11-26 10:28:44,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 21: [2022-11-26 10:28:44,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:28:44,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 19: [2022-11-26 10:28:44,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 21: [2022-11-26 10:28:44,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 13: [2022-11-26 10:28:44,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:28:44,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 26: [2022-11-26 10:28:44,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:28:44,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 26: [2022-11-26 10:28:44,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 10:28:44,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 10: [2022-11-26 10:28:44,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:28:44,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:28:44,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 29: [2022-11-26 10:28:44,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 10: [2022-11-26 10:28:44,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 29: [2022-11-26 10:28:44,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 11: [2022-11-26 10:28:44,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:28:44,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 10:28:44,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 28: [2022-11-26 10:28:44,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:28:44,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 10:28:44,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 3: [2022-11-26 10:28:44,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:28:44,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 10:28:44,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 31: [2022-11-26 10:28:44,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:28:44,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:28:44,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 10:28:44,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 23: [2022-11-26 10:28:44,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 10:28:44,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 25: [2022-11-26 10:28:44,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:28:44,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 10:28:44,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 0: [2022-11-26 10:28:44,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:28:44,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 1: [2022-11-26 10:28:44,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:28:44,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 14: [2022-11-26 10:28:44,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:28:44,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 10:28:44,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 14: [2022-11-26 10:28:44,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 10:28:44,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 24: [2022-11-26 10:28:44,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:28:44,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 10:28:44,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 9: [2022-11-26 10:28:44,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:28:44,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 10:28:44,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 4: [2022-11-26 10:28:44,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:28:44,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:28:44,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 10:28:44,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 2: [2022-11-26 10:28:44,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 10:28:44,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 21: [2022-11-26 10:28:44,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:28:44,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 10:28:44,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 27: [2022-11-26 10:28:44,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:28:44,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 10:28:44,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 25: [2022-11-26 10:28:44,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:28:44,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:28:44,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 25: [2022-11-26 10:28:44,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 10:28:44,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 1: [2022-11-26 10:28:44,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 4: [2022-11-26 10:28:44,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:28:44,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 10:28:44,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 29: [2022-11-26 10:28:44,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:28:44,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step73000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 10:28:44,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 0: successfully saved checkpoint at iteration 73000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2513.70 31: iteration 73010/ 173500 | consumed samples: 18690560 | consumed tokens: 38278266880 | elapsed time per iteration (s): 1.08 | learning rate: 1.338E-04 | global batch size: 256 | lm loss: 2.029073E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.396 | TFLOPs: 14.36 | 31: iteration 73020/ 173500 | consumed samples: 18693120 | consumed tokens: 38283509760 | elapsed time per iteration (s): 0.85 | learning rate: 1.337E-04 | global batch size: 256 | lm loss: 2.032501E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.579 | TFLOPs: 18.31 | 31: iteration 73030/ 173500 | consumed samples: 18695680 | consumed tokens: 38288752640 | elapsed time per iteration (s): 0.79 | learning rate: 1.337E-04 | global batch size: 256 | lm loss: 2.009093E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.790 | TFLOPs: 19.65 | 31: iteration 73040/ 173500 | consumed samples: 18698240 | consumed tokens: 38293995520 | elapsed time per iteration (s): 0.77 | learning rate: 1.337E-04 | global batch size: 256 | lm loss: 2.028990E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.628 | TFLOPs: 20.18 | 31: iteration 73050/ 173500 | consumed samples: 18700800 | consumed tokens: 38299238400 | elapsed time per iteration (s): 0.72 | learning rate: 1.337E-04 | global batch size: 256 | lm loss: 2.029488E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.701 | TFLOPs: 21.58 | 31: iteration 73060/ 173500 | consumed samples: 18703360 | consumed tokens: 38304481280 | elapsed time per iteration (s): 0.73 | learning rate: 1.337E-04 | global batch size: 256 | lm loss: 2.023115E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.990 | TFLOPs: 21.29 | 31: iteration 73070/ 173500 | consumed samples: 18705920 | consumed tokens: 38309724160 | elapsed time per iteration (s): 0.72 | learning rate: 1.337E-04 | global batch size: 256 | lm loss: 2.015884E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.612 | TFLOPs: 21.39 | 31: iteration 73080/ 173500 | consumed samples: 18708480 | consumed tokens: 38314967040 | elapsed time per iteration (s): 0.77 | learning rate: 1.336E-04 | global batch size: 256 | lm loss: 2.005197E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.226 | TFLOPs: 20.16 | 31: iteration 73090/ 173500 | consumed samples: 18711040 | consumed tokens: 38320209920 | elapsed time per iteration (s): 0.75 | learning rate: 1.336E-04 | global batch size: 256 | lm loss: 2.020407E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.080 | TFLOPs: 20.63 | 31: iteration 73100/ 173500 | consumed samples: 18713600 | consumed tokens: 38325452800 | elapsed time per iteration (s): 0.76 | learning rate: 1.336E-04 | global batch size: 256 | lm loss: 2.038095E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.686 | TFLOPs: 20.37 | 31: iteration 73110/ 173500 | consumed samples: 18716160 | consumed tokens: 38330695680 | elapsed time per iteration (s): 0.73 | learning rate: 1.336E-04 | global batch size: 256 | lm loss: 2.004653E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.421 | TFLOPs: 21.08 | 31: iteration 73120/ 173500 | consumed samples: 18718720 | consumed tokens: 38335938560 | elapsed time per iteration (s): 0.82 | learning rate: 1.336E-04 | global batch size: 256 | lm loss: 2.041282E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.055 | TFLOPs: 18.94 | 31: iteration 73130/ 173500 | consumed samples: 18721280 | consumed tokens: 38341181440 | elapsed time per iteration (s): 0.74 | learning rate: 1.336E-04 | global batch size: 256 | lm loss: 2.031750E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.870 | TFLOPs: 20.92 | 31: iteration 73140/ 173500 | consumed samples: 18723840 | consumed tokens: 38346424320 | elapsed time per iteration (s): 0.78 | learning rate: 1.336E-04 | global batch size: 256 | lm loss: 2.011142E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.620 | TFLOPs: 19.82 | 31: iteration 73150/ 173500 | consumed samples: 18726400 | consumed tokens: 38351667200 | elapsed time per iteration (s): 0.81 | learning rate: 1.335E-04 | global batch size: 256 | lm loss: 2.011765E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.103 | TFLOPs: 19.06 | 31: iteration 73160/ 173500 | consumed samples: 18728960 | consumed tokens: 38356910080 | elapsed time per iteration (s): 0.80 | learning rate: 1.335E-04 | global batch size: 256 | lm loss: 1.997856E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.374 | TFLOPs: 19.44 | 31: iteration 73170/ 173500 | consumed samples: 18731520 | consumed tokens: 38362152960 | elapsed time per iteration (s): 0.81 | learning rate: 1.335E-04 | global batch size: 256 | lm loss: 2.003873E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.122 | TFLOPs: 19.06 | 31: iteration 73180/ 173500 | consumed samples: 18734080 | consumed tokens: 38367395840 | elapsed time per iteration (s): 0.78 | learning rate: 1.335E-04 | global batch size: 256 | lm loss: 2.024613E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.956 | TFLOPs: 19.96 | 31: iteration 73190/ 173500 | consumed samples: 18736640 | consumed tokens: 38372638720 | elapsed time per iteration (s): 0.78 | learning rate: 1.335E-04 | global batch size: 256 | lm loss: 1.999549E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.582 | TFLOPs: 19.82 | 31: iteration 73200/ 173500 | consumed samples: 18739200 | consumed tokens: 38377881600 | elapsed time per iteration (s): 0.83 | learning rate: 1.335E-04 | global batch size: 256 | lm loss: 2.036256E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.821 | TFLOPs: 18.68 | 31: iteration 73210/ 173500 | consumed samples: 18741760 | consumed tokens: 38383124480 | elapsed time per iteration (s): 0.75 | learning rate: 1.334E-04 | global batch size: 256 | lm loss: 2.045913E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.612 | TFLOPs: 20.55 | 31: iteration 73220/ 173500 | consumed samples: 18744320 | consumed tokens: 38388367360 | elapsed time per iteration (s): 0.78 | learning rate: 1.334E-04 | global batch size: 256 | lm loss: 1.987888E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.261 | TFLOPs: 19.92 | 31: iteration 73230/ 173500 | consumed samples: 18746880 | consumed tokens: 38393610240 | elapsed time per iteration (s): 0.75 | learning rate: 1.334E-04 | global batch size: 256 | lm loss: 2.016204E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.790 | TFLOPs: 20.74 | 31: iteration 73240/ 173500 | consumed samples: 18749440 | consumed tokens: 38398853120 | elapsed time per iteration (s): 0.74 | learning rate: 1.334E-04 | global batch size: 256 | lm loss: 2.017759E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.369 | TFLOPs: 21.01 | 31: iteration 73250/ 173500 | consumed samples: 18752000 | consumed tokens: 38404096000 | elapsed time per iteration (s): 0.75 | learning rate: 1.334E-04 | global batch size: 256 | lm loss: 2.049785E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.927 | TFLOPs: 20.75 | 31: iteration 73260/ 173500 | consumed samples: 18754560 | consumed tokens: 38409338880 | elapsed time per iteration (s): 0.74 | learning rate: 1.334E-04 | global batch size: 256 | lm loss: 2.027975E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.902 | TFLOPs: 20.87 | 31: iteration 73270/ 173500 | consumed samples: 18757120 | consumed tokens: 38414581760 | elapsed time per iteration (s): 0.76 | learning rate: 1.333E-04 | global batch size: 256 | lm loss: 2.005126E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.576 | TFLOPs: 20.36 | 31: iteration 73280/ 173500 | consumed samples: 18759680 | consumed tokens: 38419824640 | elapsed time per iteration (s): 0.79 | learning rate: 1.333E-04 | global batch size: 256 | lm loss: 2.018887E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.722 | TFLOPs: 19.71 | 31: iteration 73290/ 173500 | consumed samples: 18762240 | consumed tokens: 38425067520 | elapsed time per iteration (s): 0.78 | learning rate: 1.333E-04 | global batch size: 256 | lm loss: 2.025336E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.268 | TFLOPs: 19.98 | 31: iteration 73300/ 173500 | consumed samples: 18764800 | consumed tokens: 38430310400 | elapsed time per iteration (s): 0.71 | learning rate: 1.333E-04 | global batch size: 256 | lm loss: 2.017253E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 363.052 | TFLOPs: 21.96 | 31: iteration 73310/ 173500 | consumed samples: 18767360 | consumed tokens: 38435553280 | elapsed time per iteration (s): 0.82 | learning rate: 1.333E-04 | global batch size: 256 | lm loss: 1.991726E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.885 | TFLOPs: 18.93 | 31: iteration 73320/ 173500 | consumed samples: 18769920 | consumed tokens: 38440796160 | elapsed time per iteration (s): 0.83 | learning rate: 1.333E-04 | global batch size: 256 | lm loss: 2.019513E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.753 | TFLOPs: 18.68 | 31: iteration 73330/ 173500 | consumed samples: 18772480 | consumed tokens: 38446039040 | elapsed time per iteration (s): 0.76 | learning rate: 1.333E-04 | global batch size: 256 | lm loss: 2.017326E+00 | grad norm: 0.230 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.997 | TFLOPs: 20.27 | 31: iteration 73340/ 173500 | consumed samples: 18775040 | consumed tokens: 38451281920 | elapsed time per iteration (s): 0.82 | learning rate: 1.332E-04 | global batch size: 256 | lm loss: 2.025359E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.957 | TFLOPs: 18.93 | 31: iteration 73350/ 173500 | consumed samples: 18777600 | consumed tokens: 38456524800 | elapsed time per iteration (s): 0.75 | learning rate: 1.332E-04 | global batch size: 256 | lm loss: 2.027782E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.879 | TFLOPs: 20.62 | 31: iteration 73360/ 173500 | consumed samples: 18780160 | consumed tokens: 38461767680 | elapsed time per iteration (s): 0.76 | learning rate: 1.332E-04 | global batch size: 256 | lm loss: 2.004838E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.687 | TFLOPs: 20.25 | 31: iteration 73370/ 173500 | consumed samples: 18782720 | consumed tokens: 38467010560 | elapsed time per iteration (s): 0.84 | learning rate: 1.332E-04 | global batch size: 256 | lm loss: 2.006817E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.535 | TFLOPs: 18.54 | 31: iteration 73380/ 173500 | consumed samples: 18785280 | consumed tokens: 38472253440 | elapsed time per iteration (s): 0.79 | learning rate: 1.332E-04 | global batch size: 256 | lm loss: 2.014106E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.928 | TFLOPs: 19.54 | 31: iteration 73390/ 173500 | consumed samples: 18787840 | consumed tokens: 38477496320 | elapsed time per iteration (s): 0.79 | learning rate: 1.332E-04 | global batch size: 256 | lm loss: 2.004507E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.934 | TFLOPs: 19.60 | 31: iteration 73400/ 173500 | consumed samples: 18790400 | consumed tokens: 38482739200 | elapsed time per iteration (s): 0.80 | learning rate: 1.331E-04 | global batch size: 256 | lm loss: 2.017926E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.299 | TFLOPs: 19.44 | 31: iteration 73410/ 173500 | consumed samples: 18792960 | consumed tokens: 38487982080 | elapsed time per iteration (s): 0.79 | learning rate: 1.331E-04 | global batch size: 256 | lm loss: 2.014118E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.709 | TFLOPs: 19.58 | 31: iteration 73420/ 173500 | consumed samples: 18795520 | consumed tokens: 38493224960 | elapsed time per iteration (s): 2.65 | learning rate: 1.331E-04 | global batch size: 256 | lm loss: 2.020132E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 96.771 | TFLOPs: 5.85 | 31: iteration 73430/ 173500 | consumed samples: 18798080 | consumed tokens: 38498467840 | elapsed time per iteration (s): 0.87 | learning rate: 1.331E-04 | global batch size: 256 | lm loss: 2.034777E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.477 | TFLOPs: 17.88 | 31: iteration 73440/ 173500 | consumed samples: 18800640 | consumed tokens: 38503710720 | elapsed time per iteration (s): 0.86 | learning rate: 1.331E-04 | global batch size: 256 | lm loss: 2.012851E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.980 | TFLOPs: 18.09 | 31: iteration 73450/ 173500 | consumed samples: 18803200 | consumed tokens: 38508953600 | elapsed time per iteration (s): 0.83 | learning rate: 1.331E-04 | global batch size: 256 | lm loss: 1.995037E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.902 | TFLOPs: 18.69 | 31: iteration 73460/ 173500 | consumed samples: 18805760 | consumed tokens: 38514196480 | elapsed time per iteration (s): 0.89 | learning rate: 1.330E-04 | global batch size: 256 | lm loss: 1.989085E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.240 | TFLOPs: 17.32 | 31: iteration 73470/ 173500 | consumed samples: 18808320 | consumed tokens: 38519439360 | elapsed time per iteration (s): 0.84 | learning rate: 1.330E-04 | global batch size: 256 | lm loss: 2.010777E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.974 | TFLOPs: 18.51 | 31: iteration 73480/ 173500 | consumed samples: 18810880 | consumed tokens: 38524682240 | elapsed time per iteration (s): 0.84 | learning rate: 1.330E-04 | global batch size: 256 | lm loss: 2.003167E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.995 | TFLOPs: 18.45 | 31: iteration 73490/ 173500 | consumed samples: 18813440 | consumed tokens: 38529925120 | elapsed time per iteration (s): 0.82 | learning rate: 1.330E-04 | global batch size: 256 | lm loss: 2.011360E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.096 | TFLOPs: 19.00 | 31: iteration 73500/ 173500 | consumed samples: 18816000 | consumed tokens: 38535168000 | elapsed time per iteration (s): 0.84 | learning rate: 1.330E-04 | global batch size: 256 | lm loss: 2.028190E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.515 | TFLOPs: 18.54 | 31: iteration 73510/ 173500 | consumed samples: 18818560 | consumed tokens: 38540410880 | elapsed time per iteration (s): 0.81 | learning rate: 1.330E-04 | global batch size: 256 | lm loss: 2.024448E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.044 | TFLOPs: 19.18 | 31: iteration 73520/ 173500 | consumed samples: 18821120 | consumed tokens: 38545653760 | elapsed time per iteration (s): 0.85 | learning rate: 1.330E-04 | global batch size: 256 | lm loss: 2.040843E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.510 | TFLOPs: 18.12 | 31: iteration 73530/ 173500 | consumed samples: 18823680 | consumed tokens: 38550896640 | elapsed time per iteration (s): 0.80 | learning rate: 1.329E-04 | global batch size: 256 | lm loss: 2.020600E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.258 | TFLOPs: 19.31 | 31: iteration 73540/ 173500 | consumed samples: 18826240 | consumed tokens: 38556139520 | elapsed time per iteration (s): 0.80 | learning rate: 1.329E-04 | global batch size: 256 | lm loss: 2.026899E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.768 | TFLOPs: 19.35 | 31: iteration 73550/ 173500 | consumed samples: 18828800 | consumed tokens: 38561382400 | elapsed time per iteration (s): 0.83 | learning rate: 1.329E-04 | global batch size: 256 | lm loss: 2.063381E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.268 | TFLOPs: 18.59 | 31: iteration 73560/ 173500 | consumed samples: 18831360 | consumed tokens: 38566625280 | elapsed time per iteration (s): 0.80 | learning rate: 1.329E-04 | global batch size: 256 | lm loss: 2.027679E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.493 | TFLOPs: 19.27 | 31: iteration 73570/ 173500 | consumed samples: 18833920 | consumed tokens: 38571868160 | elapsed time per iteration (s): 0.80 | learning rate: 1.329E-04 | global batch size: 256 | lm loss: 1.982152E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.957 | TFLOPs: 19.42 | 31: iteration 73580/ 173500 | consumed samples: 18836480 | consumed tokens: 38577111040 | elapsed time per iteration (s): 0.79 | learning rate: 1.329E-04 | global batch size: 256 | lm loss: 2.001788E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.017 | TFLOPs: 19.54 | 31: iteration 73590/ 173500 | consumed samples: 18839040 | consumed tokens: 38582353920 | elapsed time per iteration (s): 0.79 | learning rate: 1.328E-04 | global batch size: 256 | lm loss: 2.054537E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.595 | TFLOPs: 19.58 | 31: iteration 73600/ 173500 | consumed samples: 18841600 | consumed tokens: 38587596800 | elapsed time per iteration (s): 0.74 | learning rate: 1.328E-04 | global batch size: 256 | lm loss: 2.008965E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.404 | TFLOPs: 20.96 | 31: iteration 73610/ 173500 | consumed samples: 18844160 | consumed tokens: 38592839680 | elapsed time per iteration (s): 0.81 | learning rate: 1.328E-04 | global batch size: 256 | lm loss: 2.025954E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.549 | TFLOPs: 19.09 | 31: iteration 73620/ 173500 | consumed samples: 18846720 | consumed tokens: 38598082560 | elapsed time per iteration (s): 0.75 | learning rate: 1.328E-04 | global batch size: 256 | lm loss: 1.993271E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.555 | TFLOPs: 20.54 | 31: iteration 73630/ 173500 | consumed samples: 18849280 | consumed tokens: 38603325440 | elapsed time per iteration (s): 0.79 | learning rate: 1.328E-04 | global batch size: 256 | lm loss: 2.025270E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.211 | TFLOPs: 19.67 | 31: iteration 73640/ 173500 | consumed samples: 18851840 | consumed tokens: 38608568320 | elapsed time per iteration (s): 0.78 | learning rate: 1.328E-04 | global batch size: 256 | lm loss: 2.024623E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.453 | TFLOPs: 19.75 | 31: iteration 73650/ 173500 | consumed samples: 18854400 | consumed tokens: 38613811200 | elapsed time per iteration (s): 0.81 | learning rate: 1.327E-04 | global batch size: 256 | lm loss: 2.002378E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.036 | TFLOPs: 19.06 | 31: iteration 73660/ 173500 | consumed samples: 18856960 | consumed tokens: 38619054080 | elapsed time per iteration (s): 0.78 | learning rate: 1.327E-04 | global batch size: 256 | lm loss: 2.058730E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.831 | TFLOPs: 19.89 | 31: iteration 73670/ 173500 | consumed samples: 18859520 | consumed tokens: 38624296960 | elapsed time per iteration (s): 0.78 | learning rate: 1.327E-04 | global batch size: 256 | lm loss: 2.006585E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.413 | TFLOPs: 19.81 | 31: iteration 73680/ 173500 | consumed samples: 18862080 | consumed tokens: 38629539840 | elapsed time per iteration (s): 0.76 | learning rate: 1.327E-04 | global batch size: 256 | lm loss: 2.027058E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.833 | TFLOPs: 20.44 | 31: iteration 73690/ 173500 | consumed samples: 18864640 | consumed tokens: 38634782720 | elapsed time per iteration (s): 0.76 | learning rate: 1.327E-04 | global batch size: 256 | lm loss: 2.019403E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.435 | TFLOPs: 20.41 | 31: iteration 73700/ 173500 | consumed samples: 18867200 | consumed tokens: 38640025600 | elapsed time per iteration (s): 0.79 | learning rate: 1.327E-04 | global batch size: 256 | lm loss: 2.014188E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.074 | TFLOPs: 19.73 | 31: iteration 73710/ 173500 | consumed samples: 18869760 | consumed tokens: 38645268480 | elapsed time per iteration (s): 0.76 | learning rate: 1.326E-04 | global batch size: 256 | lm loss: 2.021324E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.453 | TFLOPs: 20.42 | 31: iteration 73720/ 173500 | consumed samples: 18872320 | consumed tokens: 38650511360 | elapsed time per iteration (s): 0.80 | learning rate: 1.326E-04 | global batch size: 256 | lm loss: 2.032126E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.734 | TFLOPs: 19.28 | 31: iteration 73730/ 173500 | consumed samples: 18874880 | consumed tokens: 38655754240 | elapsed time per iteration (s): 0.74 | learning rate: 1.326E-04 | global batch size: 256 | lm loss: 2.018204E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.246 | TFLOPs: 20.83 | 31: iteration 73740/ 173500 | consumed samples: 18877440 | consumed tokens: 38660997120 | elapsed time per iteration (s): 0.82 | learning rate: 1.326E-04 | global batch size: 256 | lm loss: 2.043649E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.428 | TFLOPs: 18.90 | 31: iteration 73750/ 173500 | consumed samples: 18880000 | consumed tokens: 38666240000 | elapsed time per iteration (s): 0.73 | learning rate: 1.326E-04 | global batch size: 256 | lm loss: 2.023531E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.528 | TFLOPs: 21.27 | 31: iteration 73760/ 173500 | consumed samples: 18882560 | consumed tokens: 38671482880 | elapsed time per iteration (s): 0.76 | learning rate: 1.326E-04 | global batch size: 256 | lm loss: 2.003035E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.244 | TFLOPs: 20.34 | 31: iteration 73770/ 173500 | consumed samples: 18885120 | consumed tokens: 38676725760 | elapsed time per iteration (s): 2.49 | learning rate: 1.326E-04 | global batch size: 256 | lm loss: 2.014725E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 102.993 | TFLOPs: 6.23 | 31: iteration 73780/ 173500 | consumed samples: 18887680 | consumed tokens: 38681968640 | elapsed time per iteration (s): 0.75 | learning rate: 1.325E-04 | global batch size: 256 | lm loss: 2.005217E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.343 | TFLOPs: 20.71 | 31: iteration 73790/ 173500 | consumed samples: 18890240 | consumed tokens: 38687211520 | elapsed time per iteration (s): 0.76 | learning rate: 1.325E-04 | global batch size: 256 | lm loss: 2.039278E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.315 | TFLOPs: 20.41 | 31: iteration 73800/ 173500 | consumed samples: 18892800 | consumed tokens: 38692454400 | elapsed time per iteration (s): 0.78 | learning rate: 1.325E-04 | global batch size: 256 | lm loss: 1.986308E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.055 | TFLOPs: 19.91 | 31: iteration 73810/ 173500 | consumed samples: 18895360 | consumed tokens: 38697697280 | elapsed time per iteration (s): 0.92 | learning rate: 1.325E-04 | global batch size: 256 | lm loss: 2.009456E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 279.566 | TFLOPs: 16.91 | 31: iteration 73820/ 173500 | consumed samples: 18897920 | consumed tokens: 38702940160 | elapsed time per iteration (s): 0.77 | learning rate: 1.325E-04 | global batch size: 256 | lm loss: 2.008222E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.086 | TFLOPs: 20.03 | 31: iteration 73830/ 173500 | consumed samples: 18900480 | consumed tokens: 38708183040 | elapsed time per iteration (s): 0.73 | learning rate: 1.325E-04 | global batch size: 256 | lm loss: 2.034494E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.180 | TFLOPs: 21.31 | 31: iteration 73840/ 173500 | consumed samples: 18903040 | consumed tokens: 38713425920 | elapsed time per iteration (s): 0.77 | learning rate: 1.324E-04 | global batch size: 256 | lm loss: 2.036698E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.046 | TFLOPs: 20.03 | 31: iteration 73850/ 173500 | consumed samples: 18905600 | consumed tokens: 38718668800 | elapsed time per iteration (s): 0.72 | learning rate: 1.324E-04 | global batch size: 256 | lm loss: 2.014614E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.510 | TFLOPs: 21.45 | 31: iteration 73860/ 173500 | consumed samples: 18908160 | consumed tokens: 38723911680 | elapsed time per iteration (s): 0.79 | learning rate: 1.324E-04 | global batch size: 256 | lm loss: 2.030321E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.443 | TFLOPs: 19.57 | 31: iteration 73870/ 173500 | consumed samples: 18910720 | consumed tokens: 38729154560 | elapsed time per iteration (s): 0.79 | learning rate: 1.324E-04 | global batch size: 256 | lm loss: 2.026138E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.143 | TFLOPs: 19.67 | 31: iteration 73880/ 173500 | consumed samples: 18913280 | consumed tokens: 38734397440 | elapsed time per iteration (s): 0.88 | learning rate: 1.324E-04 | global batch size: 256 | lm loss: 2.023838E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.313 | TFLOPs: 17.62 | 31: iteration 73890/ 173500 | consumed samples: 18915840 | consumed tokens: 38739640320 | elapsed time per iteration (s): 0.83 | learning rate: 1.324E-04 | global batch size: 256 | lm loss: 2.021084E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.204 | TFLOPs: 18.71 | 31: iteration 73900/ 173500 | consumed samples: 18918400 | consumed tokens: 38744883200 | elapsed time per iteration (s): 0.86 | learning rate: 1.323E-04 | global batch size: 256 | lm loss: 2.012553E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.792 | TFLOPs: 18.08 | 31: iteration 73910/ 173500 | consumed samples: 18920960 | consumed tokens: 38750126080 | elapsed time per iteration (s): 0.85 | learning rate: 1.323E-04 | global batch size: 256 | lm loss: 1.998912E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.016 | TFLOPs: 18.27 | 31: iteration 73920/ 173500 | consumed samples: 18923520 | consumed tokens: 38755368960 | elapsed time per iteration (s): 0.78 | learning rate: 1.323E-04 | global batch size: 256 | lm loss: 2.049411E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.288 | TFLOPs: 19.74 | 31: iteration 73930/ 173500 | consumed samples: 18926080 | consumed tokens: 38760611840 | elapsed time per iteration (s): 0.81 | learning rate: 1.323E-04 | global batch size: 256 | lm loss: 1.998088E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.349 | TFLOPs: 19.02 | 31: iteration 73940/ 173500 | consumed samples: 18928640 | consumed tokens: 38765854720 | elapsed time per iteration (s): 0.79 | learning rate: 1.323E-04 | global batch size: 256 | lm loss: 2.004205E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.311 | TFLOPs: 19.68 | 31: iteration 73950/ 173500 | consumed samples: 18931200 | consumed tokens: 38771097600 | elapsed time per iteration (s): 0.77 | learning rate: 1.323E-04 | global batch size: 256 | lm loss: 2.043276E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.613 | TFLOPs: 20.12 | 31: iteration 73960/ 173500 | consumed samples: 18933760 | consumed tokens: 38776340480 | elapsed time per iteration (s): 0.75 | learning rate: 1.322E-04 | global batch size: 256 | lm loss: 1.977461E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.583 | TFLOPs: 20.54 | 31: iteration 73970/ 173500 | consumed samples: 18936320 | consumed tokens: 38781583360 | elapsed time per iteration (s): 0.81 | learning rate: 1.322E-04 | global batch size: 256 | lm loss: 2.034417E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.458 | TFLOPs: 19.08 | 31: iteration 73980/ 173500 | consumed samples: 18938880 | consumed tokens: 38786826240 | elapsed time per iteration (s): 0.75 | learning rate: 1.322E-04 | global batch size: 256 | lm loss: 2.036605E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.245 | TFLOPs: 20.52 | 31: iteration 73990/ 173500 | consumed samples: 18941440 | consumed tokens: 38792069120 | elapsed time per iteration (s): 0.76 | learning rate: 1.322E-04 | global batch size: 256 | lm loss: 2.013018E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.146 | TFLOPs: 20.40 | 0: [2022-11-26 10:42:28,019] [INFO] [logging.py:68:log_dist] [Rank 0] step=74000, skipped=0, lr=[0.0001321851851828754, 0.0001321851851828754, 0.0001321851851828754], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 74000/ 173500 | consumed samples: 18944000 | consumed tokens: 38797312000 | elapsed time per iteration (s): 0.76 | learning rate: 1.322E-04 | global batch size: 256 | lm loss: 2.017291E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.189 | TFLOPs: 20.46 | 0: steps: 74000 loss: 1.9755 iter time (s): 0.804 samples/sec: 318.448 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 74000 | lm loss value: 1.907903E+00 | lm loss PPL: 6.738944E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 74000 to checkpoints_1b1long 0: [2022-11-26 10:42:28,306] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step74000 is begin to save! 0: [2022-11-26 10:42:28,318] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_01-model_00-model_states.pt... 0: [2022-11-26 10:42:28,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_01-model_00-model_states.pt. 0: [2022-11-26 10:42:28,548] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_03-model_00-model_states.pt... 0: [2022-11-26 10:42:28,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_03-model_00-model_states.pt. 0: [2022-11-26 10:42:28,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_04-model_00-model_states.pt... 0: [2022-11-26 10:42:28,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_04-model_00-model_states.pt. 0: [2022-11-26 10:42:28,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_05-model_00-model_states.pt... 0: [2022-11-26 10:42:28,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_05-model_00-model_states.pt. 0: [2022-11-26 10:42:28,784] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_06-model_00-model_states.pt... 0: [2022-11-26 10:42:28,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_06-model_00-model_states.pt. 0: [2022-11-26 10:42:28,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_07-model_00-model_states.pt... 0: [2022-11-26 10:42:28,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_07-model_00-model_states.pt. 0: [2022-11-26 10:42:28,935] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_08-model_00-model_states.pt... 0: [2022-11-26 10:42:29,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_08-model_00-model_states.pt. 0: [2022-11-26 10:42:29,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_09-model_00-model_states.pt... 0: [2022-11-26 10:42:29,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_09-model_00-model_states.pt. 0: [2022-11-26 10:42:29,090] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_10-model_00-model_states.pt... 0: [2022-11-26 10:42:29,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_10-model_00-model_states.pt. 0: [2022-11-26 10:42:29,164] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_11-model_00-model_states.pt... 0: [2022-11-26 10:42:29,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_11-model_00-model_states.pt. 0: [2022-11-26 10:42:29,242] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_12-model_00-model_states.pt... 0: [2022-11-26 10:42:29,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_12-model_00-model_states.pt. 0: [2022-11-26 10:42:29,315] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_13-model_00-model_states.pt... 0: [2022-11-26 10:42:29,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_13-model_00-model_states.pt. 0: [2022-11-26 10:42:29,395] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_14-model_00-model_states.pt... 0: [2022-11-26 10:42:29,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_14-model_00-model_states.pt. 0: [2022-11-26 10:42:29,471] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_15-model_00-model_states.pt... 0: [2022-11-26 10:42:29,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_15-model_00-model_states.pt. 0: [2022-11-26 10:42:29,545] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_16-model_00-model_states.pt... 0: [2022-11-26 10:42:29,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_16-model_00-model_states.pt. 0: [2022-11-26 10:42:29,622] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_17-model_00-model_states.pt... 0: [2022-11-26 10:42:29,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_17-model_00-model_states.pt. 0: [2022-11-26 10:42:29,697] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_18-model_00-model_states.pt... 0: [2022-11-26 10:42:29,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_18-model_00-model_states.pt. 0: [2022-11-26 10:42:29,774] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_19-model_00-model_states.pt... 0: [2022-11-26 10:42:29,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_19-model_00-model_states.pt. 0: [2022-11-26 10:42:29,849] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_20-model_00-model_states.pt... 0: [2022-11-26 10:42:29,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_20-model_00-model_states.pt. 0: [2022-11-26 10:42:29,923] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_21-model_00-model_states.pt... 0: [2022-11-26 10:42:30,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_21-model_00-model_states.pt. 0: [2022-11-26 10:42:30,000] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_22-model_00-model_states.pt... 0: [2022-11-26 10:42:30,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_22-model_00-model_states.pt. 0: [2022-11-26 10:42:30,076] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_23-model_00-model_states.pt... 0: [2022-11-26 10:42:30,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_23-model_00-model_states.pt. 0: [2022-11-26 10:42:30,149] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_24-model_00-model_states.pt... 0: [2022-11-26 10:42:30,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_24-model_00-model_states.pt. 0: [2022-11-26 10:42:30,226] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_25-model_00-model_states.pt... 0: [2022-11-26 10:42:30,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_25-model_00-model_states.pt. 0: [2022-11-26 10:42:30,302] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_26-model_00-model_states.pt... 0: [2022-11-26 10:42:30,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_26-model_00-model_states.pt. 0: [2022-11-26 10:42:30,376] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_27-model_00-model_states.pt... 0: [2022-11-26 10:42:30,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_27-model_00-model_states.pt. 0: [2022-11-26 10:42:30,455] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_28-model_00-model_states.pt... 0: [2022-11-26 10:42:30,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_28-model_00-model_states.pt. 0: [2022-11-26 10:42:30,531] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/layer_30-model_00-model_states.pt... 0: [2022-11-26 10:42:30,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/layer_30-model_00-model_states.pt. 0: [2022-11-26 10:42:30,533] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step74000/mp_rank_00_model_states.pt 0: [2022-11-26 10:42:30,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/mp_rank_00_model_states.pt... 0: [2022-11-26 10:42:30,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/mp_rank_00_model_states.pt. 0: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:42:30,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:42:30,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:42:30,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 10:42:30,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 20: [2022-11-26 10:42:30,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:42:30,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 10:42:30,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 14: [2022-11-26 10:42:30,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:42:30,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:42:30,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 3: [2022-11-26 10:42:30,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 14: [2022-11-26 10:42:30,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 3: [2022-11-26 10:42:30,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 7: [2022-11-26 10:42:30,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:42:30,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 10:42:30,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 21: [2022-11-26 10:42:30,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:42:30,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 10:42:30,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 10: [2022-11-26 10:42:30,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:42:30,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:42:30,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 19: [2022-11-26 10:42:30,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 10: [2022-11-26 10:42:30,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 19: [2022-11-26 10:42:30,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 22: [2022-11-26 10:42:30,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:42:30,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 0: [2022-11-26 10:42:30,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:42:30,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:42:30,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:42:30,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 0: [2022-11-26 10:42:30,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 13: [2022-11-26 10:42:30,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 10:42:30,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 0: [2022-11-26 10:42:30,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 13: [2022-11-26 10:42:30,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 13: [2022-11-26 10:42:30,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 6: [2022-11-26 10:42:30,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:42:30,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 10:42:30,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 9: [2022-11-26 10:42:30,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:42:30,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 10:42:30,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 24: [2022-11-26 10:42:30,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:42:30,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:42:30,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 24: [2022-11-26 10:42:30,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 16: [2022-11-26 10:42:30,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:42:30,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 10: [2022-11-26 10:42:30,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 16: [2022-11-26 10:42:30,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 10:42:30,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 27: [2022-11-26 10:42:30,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:42:30,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 10:42:30,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 11: [2022-11-26 10:42:30,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:42:30,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:42:30,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 18: [2022-11-26 10:42:30,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 11: [2022-11-26 10:42:30,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 18: [2022-11-26 10:42:30,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 20: [2022-11-26 10:42:30,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:42:30,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:42:30,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 20: [2022-11-26 10:42:30,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 12: [2022-11-26 10:42:30,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 20: [2022-11-26 10:42:30,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 7: [2022-11-26 10:42:30,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:42:30,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 10:42:30,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 23: [2022-11-26 10:42:30,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:42:30,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 11: [2022-11-26 10:42:30,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:42:30,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:42:30,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 11: [2022-11-26 10:42:30,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 14: [2022-11-26 10:42:30,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 11: [2022-11-26 10:42:30,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 14: [2022-11-26 10:42:30,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 4: [2022-11-26 10:42:30,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:42:30,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 10:42:30,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 26: [2022-11-26 10:42:30,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:42:30,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 10:42:30,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 16: [2022-11-26 10:42:30,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:42:30,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 10:42:30,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 30: [2022-11-26 10:42:30,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:42:30,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 10:42:30,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 6: [2022-11-26 10:42:30,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:42:30,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 10:42:30,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 23: [2022-11-26 10:42:30,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:42:30,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 9: [2022-11-26 10:42:30,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:42:30,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 9: [2022-11-26 10:42:30,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 10:42:30,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 22: [2022-11-26 10:42:30,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:42:30,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 10:42:30,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 26: [2022-11-26 10:42:30,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:42:30,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 10:42:30,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 4: [2022-11-26 10:42:30,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:42:30,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 10:42:30,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 19: [2022-11-26 10:42:30,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:42:30,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:42:30,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 18: [2022-11-26 10:42:30,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 19: [2022-11-26 10:42:30,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 18: [2022-11-26 10:42:30,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 12: [2022-11-26 10:42:30,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:42:30,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 5: [2022-11-26 10:42:30,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:42:30,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:42:30,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 5: [2022-11-26 10:42:30,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 3: [2022-11-26 10:42:30,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 5: [2022-11-26 10:42:30,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 3: [2022-11-26 10:42:30,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 5: [2022-11-26 10:42:30,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:42:30,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 10:42:30,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 5: [2022-11-26 10:42:30,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:42:30,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:42:30,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 30: [2022-11-26 10:42:30,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 5: [2022-11-26 10:42:30,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 30: [2022-11-26 10:42:30,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 13: [2022-11-26 10:42:30,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:42:30,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 10:42:30,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 0: [2022-11-26 10:42:30,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:42:30,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 10:42:30,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 21: [2022-11-26 10:42:30,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:42:30,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 10:42:30,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 10: [2022-11-26 10:42:30,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:42:30,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 10:42:30,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 19: [2022-11-26 10:42:30,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:42:30,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 10:42:30,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 9: [2022-11-26 10:42:30,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:42:30,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 10:42:30,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 20: [2022-11-26 10:42:30,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:42:30,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 10:42:30,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 25: [2022-11-26 10:42:30,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:42:30,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 27: [2022-11-26 10:42:30,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:42:30,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 14: [2022-11-26 10:42:30,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:42:30,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 10:42:30,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 27: [2022-11-26 10:42:30,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 10:42:30,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 24: [2022-11-26 10:42:30,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:42:30,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 10:42:30,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 25: [2022-11-26 10:42:30,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:42:30,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 21: [2022-11-26 10:42:30,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:42:30,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 21: [2022-11-26 10:42:30,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 10:42:30,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 25: [2022-11-26 10:42:30,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:42:30,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 10:42:30,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 24: [2022-11-26 10:42:30,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:42:30,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 30: [2022-11-26 10:42:30,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:42:30,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 30: [2022-11-26 10:42:30,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 10:42:30,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 16: [2022-11-26 10:42:30,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:42:30,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 10:42:30,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 18: [2022-11-26 10:42:30,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:42:30,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 10:42:30,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 12: [2022-11-26 10:42:30,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:42:30,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 10:42:30,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 18: [2022-11-26 10:42:30,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:42:30,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 10:42:30,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 22: [2022-11-26 10:42:30,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:42:30,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 10:42:30,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 27: [2022-11-26 10:42:30,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:42:30,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:42:30,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:42:30,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:42:30,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 10:42:30,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 11: [2022-11-26 10:42:30,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 8: [2022-11-26 10:42:30,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 8: [2022-11-26 10:42:30,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 11: [2022-11-26 10:42:30,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 27: [2022-11-26 10:42:30,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 10:42:30,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 11: [2022-11-26 10:42:30,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:42:30,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 23: [2022-11-26 10:42:30,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:42:30,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 23: [2022-11-26 10:42:30,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 10:42:30,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 13: [2022-11-26 10:42:30,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:42:30,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 10:42:30,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 9: [2022-11-26 10:42:30,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:42:30,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 10:42:30,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 20: [2022-11-26 10:42:30,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:42:30,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 10:42:30,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 6: [2022-11-26 10:42:30,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:42:30,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:42:30,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 7: [2022-11-26 10:42:30,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 6: [2022-11-26 10:42:30,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 7: [2022-11-26 10:42:30,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 8: [2022-11-26 10:42:30,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:42:30,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 10:42:30,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 26: [2022-11-26 10:42:30,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:42:30,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 10:42:30,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 0: [2022-11-26 10:42:30,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:42:30,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 10:42:30,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 14: [2022-11-26 10:42:30,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:42:30,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 10:42:30,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 17: [2022-11-26 10:42:30,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:42:30,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:42:30,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 10:42:30,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 10:42:30,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 3: [2022-11-26 10:42:30,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:42:30,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 3: [2022-11-26 10:42:30,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 10:42:30,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 21: [2022-11-26 10:42:30,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:42:30,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 10:42:30,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 25: [2022-11-26 10:42:30,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:42:30,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:42:30,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 26: [2022-11-26 10:42:30,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 10:42:30,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 25: [2022-11-26 10:42:30,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 19: [2022-11-26 10:42:30,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:42:30,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:42:30,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:42:30,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:42:30,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 12: [2022-11-26 10:42:30,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 31: [2022-11-26 10:42:30,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:42:30,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 19: [2022-11-26 10:42:30,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 12: [2022-11-26 10:42:30,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 31: [2022-11-26 10:42:30,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 30: [2022-11-26 10:42:30,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:42:30,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 30: [2022-11-26 10:42:30,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 31: [2022-11-26 10:42:30,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 31: [2022-11-26 10:42:30,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 30: [2022-11-26 10:42:30,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 6: [2022-11-26 10:42:30,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:42:30,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 6: [2022-11-26 10:42:30,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 29: [2022-11-26 10:42:30,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:42:30,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 29: [2022-11-26 10:42:30,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 10:42:30,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 16: [2022-11-26 10:42:30,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:42:30,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:42:30,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:42:30,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:42:30,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 10:42:30,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 7: [2022-11-26 10:42:30,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:42:30,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 15: [2022-11-26 10:42:30,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 7: [2022-11-26 10:42:30,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 16: [2022-11-26 10:42:30,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 15: [2022-11-26 10:42:30,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 7: [2022-11-26 10:42:30,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 15: [2022-11-26 10:42:30,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 15: [2022-11-26 10:42:30,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 3: [2022-11-26 10:42:30,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:42:30,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 10: [2022-11-26 10:42:30,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:42:30,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 10: [2022-11-26 10:42:30,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 28: [2022-11-26 10:42:30,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:42:30,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:42:30,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:42:30,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 28: [2022-11-26 10:42:30,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 10:42:30,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:42:30,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 10:42:30,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 10:42:30,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 28: [2022-11-26 10:42:30,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 28: [2022-11-26 10:42:30,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 10:42:30,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 28: [2022-11-26 10:42:30,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 17: [2022-11-26 10:42:30,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:42:30,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:42:30,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 17: [2022-11-26 10:42:30,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 29: [2022-11-26 10:42:30,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 17: [2022-11-26 10:42:30,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 0: [2022-11-26 10:42:30,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:42:30,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:42:30,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 10:42:30,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 22: [2022-11-26 10:42:30,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:42:30,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:42:30,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:42:30,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:42:30,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:42:30,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 22: [2022-11-26 10:42:30,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 1: [2022-11-26 10:42:30,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 10:42:30,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 22: [2022-11-26 10:42:30,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 1: [2022-11-26 10:42:30,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 10:42:30,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 1: [2022-11-26 10:42:30,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 1: [2022-11-26 10:42:30,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 1: [2022-11-26 10:42:30,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 29: [2022-11-26 10:42:30,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:42:30,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 10:42:30,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 23: [2022-11-26 10:42:30,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:42:30,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 10:42:30,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 0: [2022-11-26 10:42:30,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 10:42:30,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 27: [2022-11-26 10:42:30,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:42:30,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 4: [2022-11-26 10:42:30,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:42:30,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 4: [2022-11-26 10:42:30,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 10:42:30,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 15: [2022-11-26 10:42:30,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:42:30,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 10:42:30,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 2: [2022-11-26 10:42:30,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:42:30,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:42:30,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:42:30,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:42:30,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 10:42:30,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 10:42:30,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 2: [2022-11-26 10:42:30,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 2: [2022-11-26 10:42:30,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 10:42:30,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 10:42:30,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 2: [2022-11-26 10:42:30,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 6: [2022-11-26 10:42:30,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:42:30,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 10:42:30,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 20: [2022-11-26 10:42:30,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:42:30,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 10:42:30,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 1: [2022-11-26 10:42:30,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:42:30,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 10:42:30,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 14: [2022-11-26 10:42:30,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:42:30,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 10:42:30,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 5: [2022-11-26 10:42:30,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:42:30,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 10:42:30,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 5: [2022-11-26 10:42:30,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:42:30,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 10:42:30,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 13: [2022-11-26 10:42:30,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:42:30,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 10:42:30,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 4: [2022-11-26 10:42:30,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:42:30,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 10:42:30,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 11: [2022-11-26 10:42:30,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:42:30,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 10:42:30,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 25: [2022-11-26 10:42:30,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:42:30,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 10:42:30,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 31: [2022-11-26 10:42:30,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:42:30,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 10:42:30,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 7: [2022-11-26 10:42:30,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:42:30,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 10:42:30,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 3: [2022-11-26 10:42:30,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:42:30,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 10:42:30,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 2: [2022-11-26 10:42:30,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:42:30,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 10:42:30,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 9: [2022-11-26 10:42:30,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:42:30,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 10:42:30,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 23: [2022-11-26 10:42:30,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:42:30,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 10:42:30,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 18: [2022-11-26 10:42:30,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:42:30,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 10:42:30,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 19: [2022-11-26 10:42:30,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:42:30,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 10:42:30,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 16: [2022-11-26 10:42:30,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:42:30,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 10:42:30,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 21: [2022-11-26 10:42:30,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:42:30,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 10:42:30,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 26: [2022-11-26 10:42:30,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:42:30,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 10:42:30,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 0: [2022-11-26 10:42:30,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:42:30,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 8: [2022-11-26 10:42:30,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:42:30,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 8: [2022-11-26 10:42:30,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 22: [2022-11-26 10:42:30,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:42:30,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 22: [2022-11-26 10:42:30,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 10:42:30,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 30: [2022-11-26 10:42:30,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:42:30,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 10:42:30,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 20: [2022-11-26 10:42:30,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:42:30,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 10:42:30,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 10: [2022-11-26 10:42:30,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:42:30,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 10:42:30,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 28: [2022-11-26 10:42:30,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:42:30,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:42:30,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 10:42:30,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 24: [2022-11-26 10:42:30,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:42:30,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 10:42:30,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 17: [2022-11-26 10:42:30,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:42:30,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 15: [2022-11-26 10:42:30,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:42:30,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 17: [2022-11-26 10:42:30,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 15: [2022-11-26 10:42:30,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 28: [2022-11-26 10:42:30,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 10:42:30,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 27: [2022-11-26 10:42:30,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:42:30,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 10:42:30,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 5: [2022-11-26 10:42:30,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:42:30,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 10:42:30,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 13: [2022-11-26 10:42:30,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:42:30,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 10:42:30,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 31: [2022-11-26 10:42:30,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:42:30,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 10:42:30,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 14: [2022-11-26 10:42:30,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:42:30,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 10:42:30,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 4: [2022-11-26 10:42:30,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:42:30,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 10:42:30,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 9: [2022-11-26 10:42:30,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:42:30,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 10:42:30,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 7: [2022-11-26 10:42:30,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:42:30,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 10:42:30,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 11: [2022-11-26 10:42:30,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:42:30,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 10:42:30,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 3: [2022-11-26 10:42:30,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:42:30,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 10:42:30,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 12: [2022-11-26 10:42:30,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:42:30,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 10:42:30,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 6: [2022-11-26 10:42:30,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:42:30,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 10:42:30,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 25: [2022-11-26 10:42:30,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:42:30,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 10:42:30,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 18: [2022-11-26 10:42:30,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:42:30,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 10:42:30,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 28: [2022-11-26 10:42:30,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:42:30,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:42:30,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:42:30,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 0: [2022-11-26 10:42:30,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 23: [2022-11-26 10:42:30,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 10:42:30,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 0: [2022-11-26 10:42:30,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 28: [2022-11-26 10:42:30,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 1: [2022-11-26 10:42:30,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:42:30,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 10:42:30,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 24: [2022-11-26 10:42:30,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:42:30,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 19: [2022-11-26 10:42:30,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:42:30,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 19: [2022-11-26 10:42:30,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 10:42:30,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 12: [2022-11-26 10:42:30,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:42:30,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 10:42:30,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 26: [2022-11-26 10:42:30,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:42:30,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:42:30,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 10:42:30,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 21: [2022-11-26 10:42:30,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 10:42:30,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 8: [2022-11-26 10:42:30,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:42:30,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 10:42:30,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 20: [2022-11-26 10:42:30,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:42:30,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 10:42:30,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 16: [2022-11-26 10:42:30,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:42:30,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 10:42:30,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 10: [2022-11-26 10:42:30,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:42:30,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 10:42:30,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 30: [2022-11-26 10:42:30,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:42:30,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 10:42:30,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 29: [2022-11-26 10:42:30,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:42:30,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 10:42:30,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 15: [2022-11-26 10:42:30,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:42:30,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 10:42:30,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 22: [2022-11-26 10:42:30,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:42:30,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 10:42:30,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 17: [2022-11-26 10:42:30,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:42:30,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 10:42:30,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 27: [2022-11-26 10:42:30,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:42:30,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 10:42:30,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 5: [2022-11-26 10:42:30,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:42:30,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 10:42:30,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 6: [2022-11-26 10:42:30,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:42:30,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 10:42:30,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 13: [2022-11-26 10:42:30,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:42:30,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 10:42:30,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 1: [2022-11-26 10:42:30,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:42:30,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 10:42:30,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 14: [2022-11-26 10:42:30,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:42:30,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 10:42:30,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 11: [2022-11-26 10:42:30,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:42:30,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 10:42:30,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 4: [2022-11-26 10:42:30,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:42:30,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 10:42:30,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 31: [2022-11-26 10:42:30,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:42:30,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 10:42:30,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 9: [2022-11-26 10:42:30,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:42:30,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:42:30,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 10:42:30,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 25: [2022-11-26 10:42:30,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 10:42:30,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 7: [2022-11-26 10:42:30,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:42:30,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 10:42:30,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 23: [2022-11-26 10:42:30,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:42:30,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 16: [2022-11-26 10:42:30,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:42:30,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 3: [2022-11-26 10:42:30,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:42:30,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 10:42:30,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 3: [2022-11-26 10:42:30,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 10:42:30,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 30: [2022-11-26 10:42:30,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:42:30,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 10:42:30,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 18: [2022-11-26 10:42:30,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:42:30,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 10:42:30,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 12: [2022-11-26 10:42:30,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:42:30,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 10:42:30,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 2: [2022-11-26 10:42:30,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:42:30,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:42:30,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 10:42:30,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 10:42:30,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 2: [2022-11-26 10:42:30,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 19: [2022-11-26 10:42:30,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:42:30,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 10:42:30,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 8: [2022-11-26 10:42:30,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:42:30,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 10:42:30,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 22: [2022-11-26 10:42:30,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:42:30,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 10:42:30,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 28: [2022-11-26 10:42:30,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:42:30,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 10:42:30,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 0: [2022-11-26 10:42:30,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:42:30,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 10:42:30,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 26: [2022-11-26 10:42:30,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:42:30,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 10:42:30,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 24: [2022-11-26 10:42:30,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:42:30,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 10:42:30,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 11: [2022-11-26 10:42:30,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:42:30,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 10:42:30,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 10: [2022-11-26 10:42:30,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:42:30,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 10:42:30,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 20: [2022-11-26 10:42:30,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:42:30,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 10:42:30,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 21: [2022-11-26 10:42:30,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:42:30,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 10:42:30,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 15: [2022-11-26 10:42:30,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:42:30,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 10:42:30,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 14: [2022-11-26 10:42:30,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:42:30,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 10:42:30,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 13: [2022-11-26 10:42:30,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:42:30,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 10:42:30,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 4: [2022-11-26 10:42:30,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:42:30,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 10:42:30,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 7: [2022-11-26 10:42:30,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:42:30,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 10:42:30,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 6: [2022-11-26 10:42:30,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:42:30,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:42:30,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 10:42:30,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 6: [2022-11-26 10:42:30,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 10:42:30,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 27: [2022-11-26 10:42:30,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:42:30,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 10:42:30,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 31: [2022-11-26 10:42:30,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:42:30,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 10:42:30,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 18: [2022-11-26 10:42:30,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:42:30,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 10:42:30,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 2: [2022-11-26 10:42:30,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:42:30,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 10:42:30,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 17: [2022-11-26 10:42:30,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:42:30,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 10:42:30,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 16: [2022-11-26 10:42:30,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:42:30,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:42:30,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 30: [2022-11-26 10:42:30,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 16: [2022-11-26 10:42:30,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 30: [2022-11-26 10:42:30,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 23: [2022-11-26 10:42:30,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:42:30,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:42:30,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 10:42:30,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 25: [2022-11-26 10:42:30,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 10:42:30,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 12: [2022-11-26 10:42:30,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:42:30,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 3: [2022-11-26 10:42:30,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:42:30,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 3: [2022-11-26 10:42:30,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 8: [2022-11-26 10:42:30,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:42:30,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 8: [2022-11-26 10:42:30,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 10:42:30,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 9: [2022-11-26 10:42:30,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:42:30,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 10:42:30,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 19: [2022-11-26 10:42:30,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:42:30,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 10:42:30,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 29: [2022-11-26 10:42:30,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:42:30,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 10:42:30,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 24: [2022-11-26 10:42:30,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:42:30,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 10:42:30,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 1: [2022-11-26 10:42:30,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:42:30,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 28: [2022-11-26 10:42:30,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:42:30,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:42:30,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 28: [2022-11-26 10:42:30,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 26: [2022-11-26 10:42:30,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 28: [2022-11-26 10:42:30,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 26: [2022-11-26 10:42:30,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 22: [2022-11-26 10:42:30,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:42:30,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 10:42:30,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 0: [2022-11-26 10:42:30,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:42:30,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 10:42:30,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 10: [2022-11-26 10:42:30,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:42:30,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 10:42:30,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 29: [2022-11-26 10:42:30,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:42:30,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:42:30,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 21: [2022-11-26 10:42:30,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 29: [2022-11-26 10:42:30,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 21: [2022-11-26 10:42:30,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 15: [2022-11-26 10:42:30,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:42:30,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 10:42:30,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 17: [2022-11-26 10:42:30,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:42:30,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 10:42:30,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 29: [2022-11-26 10:42:30,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:42:30,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 10:42:30,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 8: [2022-11-26 10:42:30,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:42:30,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 10:42:30,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 27: [2022-11-26 10:42:30,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:42:30,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 10:42:30,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 31: [2022-11-26 10:42:30,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:42:30,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 10:42:30,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 17: [2022-11-26 10:42:30,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:42:30,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step74000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 10:42:30,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 0: successfully saved checkpoint at iteration 74000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2579.39 31: iteration 74010/ 173500 | consumed samples: 18946560 | consumed tokens: 38802554880 | elapsed time per iteration (s): 1.01 | learning rate: 1.322E-04 | global batch size: 256 | lm loss: 2.022641E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 254.586 | TFLOPs: 15.40 | 31: iteration 74020/ 173500 | consumed samples: 18949120 | consumed tokens: 38807797760 | elapsed time per iteration (s): 0.78 | learning rate: 1.322E-04 | global batch size: 256 | lm loss: 1.984153E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.442 | TFLOPs: 19.75 | 31: iteration 74030/ 173500 | consumed samples: 18951680 | consumed tokens: 38813040640 | elapsed time per iteration (s): 0.74 | learning rate: 1.321E-04 | global batch size: 256 | lm loss: 2.009877E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.136 | TFLOPs: 20.88 | 31: iteration 74040/ 173500 | consumed samples: 18954240 | consumed tokens: 38818283520 | elapsed time per iteration (s): 0.78 | learning rate: 1.321E-04 | global batch size: 256 | lm loss: 2.008300E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.652 | TFLOPs: 19.94 | 31: iteration 74050/ 173500 | consumed samples: 18956800 | consumed tokens: 38823526400 | elapsed time per iteration (s): 0.77 | learning rate: 1.321E-04 | global batch size: 256 | lm loss: 2.016139E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.939 | TFLOPs: 20.02 | 31: iteration 74060/ 173500 | consumed samples: 18959360 | consumed tokens: 38828769280 | elapsed time per iteration (s): 0.78 | learning rate: 1.321E-04 | global batch size: 256 | lm loss: 2.018631E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.716 | TFLOPs: 19.77 | 31: iteration 74070/ 173500 | consumed samples: 18961920 | consumed tokens: 38834012160 | elapsed time per iteration (s): 0.79 | learning rate: 1.321E-04 | global batch size: 256 | lm loss: 2.014673E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.776 | TFLOPs: 19.65 | 31: iteration 74080/ 173500 | consumed samples: 18964480 | consumed tokens: 38839255040 | elapsed time per iteration (s): 0.80 | learning rate: 1.321E-04 | global batch size: 256 | lm loss: 1.990710E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.609 | TFLOPs: 19.28 | 31: iteration 74090/ 173500 | consumed samples: 18967040 | consumed tokens: 38844497920 | elapsed time per iteration (s): 0.75 | learning rate: 1.320E-04 | global batch size: 256 | lm loss: 2.035513E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.750 | TFLOPs: 20.74 | 31: iteration 74100/ 173500 | consumed samples: 18969600 | consumed tokens: 38849740800 | elapsed time per iteration (s): 0.74 | learning rate: 1.320E-04 | global batch size: 256 | lm loss: 2.040438E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.128 | TFLOPs: 20.82 | 31: iteration 74110/ 173500 | consumed samples: 18972160 | consumed tokens: 38854983680 | elapsed time per iteration (s): 0.79 | learning rate: 1.320E-04 | global batch size: 256 | lm loss: 1.996075E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.797 | TFLOPs: 19.53 | 31: iteration 74120/ 173500 | consumed samples: 18974720 | consumed tokens: 38860226560 | elapsed time per iteration (s): 0.82 | learning rate: 1.320E-04 | global batch size: 256 | lm loss: 2.009336E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.955 | TFLOPs: 18.87 | 31: iteration 74130/ 173500 | consumed samples: 18977280 | consumed tokens: 38865469440 | elapsed time per iteration (s): 0.79 | learning rate: 1.320E-04 | global batch size: 256 | lm loss: 2.012795E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.492 | TFLOPs: 19.69 | 31: iteration 74140/ 173500 | consumed samples: 18979840 | consumed tokens: 38870712320 | elapsed time per iteration (s): 0.76 | learning rate: 1.320E-04 | global batch size: 256 | lm loss: 1.997081E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.463 | TFLOPs: 20.36 | 31: iteration 74150/ 173500 | consumed samples: 18982400 | consumed tokens: 38875955200 | elapsed time per iteration (s): 0.74 | learning rate: 1.319E-04 | global batch size: 256 | lm loss: 2.022986E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.956 | TFLOPs: 20.87 | 31: iteration 74160/ 173500 | consumed samples: 18984960 | consumed tokens: 38881198080 | elapsed time per iteration (s): 0.75 | learning rate: 1.319E-04 | global batch size: 256 | lm loss: 2.018168E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.794 | TFLOPs: 20.74 | 31: iteration 74170/ 173500 | consumed samples: 18987520 | consumed tokens: 38886440960 | elapsed time per iteration (s): 0.75 | learning rate: 1.319E-04 | global batch size: 256 | lm loss: 2.025505E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.377 | TFLOPs: 20.53 | 31: iteration 74180/ 173500 | consumed samples: 18990080 | consumed tokens: 38891683840 | elapsed time per iteration (s): 0.79 | learning rate: 1.319E-04 | global batch size: 256 | lm loss: 2.022327E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.456 | TFLOPs: 19.51 | 31: iteration 74190/ 173500 | consumed samples: 18992640 | consumed tokens: 38896926720 | elapsed time per iteration (s): 0.80 | learning rate: 1.319E-04 | global batch size: 256 | lm loss: 2.028933E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.435 | TFLOPs: 19.32 | 31: iteration 74200/ 173500 | consumed samples: 18995200 | consumed tokens: 38902169600 | elapsed time per iteration (s): 0.76 | learning rate: 1.319E-04 | global batch size: 256 | lm loss: 2.003331E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.420 | TFLOPs: 20.29 | 31: iteration 74210/ 173500 | consumed samples: 18997760 | consumed tokens: 38907412480 | elapsed time per iteration (s): 0.84 | learning rate: 1.319E-04 | global batch size: 256 | lm loss: 2.024162E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.336 | TFLOPs: 18.53 | 31: iteration 74220/ 173500 | consumed samples: 19000320 | consumed tokens: 38912655360 | elapsed time per iteration (s): 0.80 | learning rate: 1.318E-04 | global batch size: 256 | lm loss: 2.041457E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.020 | TFLOPs: 19.30 | 31: iteration 74230/ 173500 | consumed samples: 19002880 | consumed tokens: 38917898240 | elapsed time per iteration (s): 0.81 | learning rate: 1.318E-04 | global batch size: 256 | lm loss: 2.009557E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.990 | TFLOPs: 19.06 | 31: iteration 74240/ 173500 | consumed samples: 19005440 | consumed tokens: 38923141120 | elapsed time per iteration (s): 0.78 | learning rate: 1.318E-04 | global batch size: 256 | lm loss: 2.024734E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.803 | TFLOPs: 19.95 | 31: iteration 74250/ 173500 | consumed samples: 19008000 | consumed tokens: 38928384000 | elapsed time per iteration (s): 0.78 | learning rate: 1.318E-04 | global batch size: 256 | lm loss: 1.984448E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.292 | TFLOPs: 19.74 | 31: iteration 74260/ 173500 | consumed samples: 19010560 | consumed tokens: 38933626880 | elapsed time per iteration (s): 0.85 | learning rate: 1.318E-04 | global batch size: 256 | lm loss: 2.038768E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.621 | TFLOPs: 18.31 | 31: iteration 74270/ 173500 | consumed samples: 19013120 | consumed tokens: 38938869760 | elapsed time per iteration (s): 0.77 | learning rate: 1.318E-04 | global batch size: 256 | lm loss: 2.045829E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.715 | TFLOPs: 20.19 | 31: iteration 74280/ 173500 | consumed samples: 19015680 | consumed tokens: 38944112640 | elapsed time per iteration (s): 0.78 | learning rate: 1.317E-04 | global batch size: 256 | lm loss: 2.009854E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.460 | TFLOPs: 19.81 | 31: iteration 74290/ 173500 | consumed samples: 19018240 | consumed tokens: 38949355520 | elapsed time per iteration (s): 0.79 | learning rate: 1.317E-04 | global batch size: 256 | lm loss: 2.000651E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.918 | TFLOPs: 19.60 | 31: iteration 74300/ 173500 | consumed samples: 19020800 | consumed tokens: 38954598400 | elapsed time per iteration (s): 0.81 | learning rate: 1.317E-04 | global batch size: 256 | lm loss: 2.034667E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.800 | TFLOPs: 19.23 | 31: iteration 74310/ 173500 | consumed samples: 19023360 | consumed tokens: 38959841280 | elapsed time per iteration (s): 0.79 | learning rate: 1.317E-04 | global batch size: 256 | lm loss: 2.002399E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.529 | TFLOPs: 19.69 | 31: iteration 74320/ 173500 | consumed samples: 19025920 | consumed tokens: 38965084160 | elapsed time per iteration (s): 0.77 | learning rate: 1.317E-04 | global batch size: 256 | lm loss: 2.015609E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.009 | TFLOPs: 20.21 | 31: iteration 74330/ 173500 | consumed samples: 19028480 | consumed tokens: 38970327040 | elapsed time per iteration (s): 0.76 | learning rate: 1.317E-04 | global batch size: 256 | lm loss: 2.015623E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.170 | TFLOPs: 20.40 | 31: iteration 74340/ 173500 | consumed samples: 19031040 | consumed tokens: 38975569920 | elapsed time per iteration (s): 0.79 | learning rate: 1.316E-04 | global batch size: 256 | lm loss: 1.990659E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.620 | TFLOPs: 19.64 | 31: iteration 74350/ 173500 | consumed samples: 19033600 | consumed tokens: 38980812800 | elapsed time per iteration (s): 0.79 | learning rate: 1.316E-04 | global batch size: 256 | lm loss: 2.039468E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.388 | TFLOPs: 19.69 | 31: iteration 74360/ 173500 | consumed samples: 19036160 | consumed tokens: 38986055680 | elapsed time per iteration (s): 0.72 | learning rate: 1.316E-04 | global batch size: 256 | lm loss: 2.013869E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.217 | TFLOPs: 21.49 | 31: iteration 74370/ 173500 | consumed samples: 19038720 | consumed tokens: 38991298560 | elapsed time per iteration (s): 0.79 | learning rate: 1.316E-04 | global batch size: 256 | lm loss: 2.028960E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.894 | TFLOPs: 19.72 | 31: iteration 74380/ 173500 | consumed samples: 19041280 | consumed tokens: 38996541440 | elapsed time per iteration (s): 0.84 | learning rate: 1.316E-04 | global batch size: 256 | lm loss: 1.977240E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.591 | TFLOPs: 18.49 | 31: iteration 74390/ 173500 | consumed samples: 19043840 | consumed tokens: 39001784320 | elapsed time per iteration (s): 0.83 | learning rate: 1.316E-04 | global batch size: 256 | lm loss: 2.015192E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.798 | TFLOPs: 18.68 | 31: iteration 74400/ 173500 | consumed samples: 19046400 | consumed tokens: 39007027200 | elapsed time per iteration (s): 0.79 | learning rate: 1.315E-04 | global batch size: 256 | lm loss: 2.031812E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.325 | TFLOPs: 19.56 | 31: iteration 74410/ 173500 | consumed samples: 19048960 | consumed tokens: 39012270080 | elapsed time per iteration (s): 0.79 | learning rate: 1.315E-04 | global batch size: 256 | lm loss: 2.013049E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.422 | TFLOPs: 19.69 | 31: iteration 74420/ 173500 | consumed samples: 19051520 | consumed tokens: 39017512960 | elapsed time per iteration (s): 0.79 | learning rate: 1.315E-04 | global batch size: 256 | lm loss: 2.007710E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.761 | TFLOPs: 19.59 | 31: iteration 74430/ 173500 | consumed samples: 19054080 | consumed tokens: 39022755840 | elapsed time per iteration (s): 0.79 | learning rate: 1.315E-04 | global batch size: 256 | lm loss: 2.023272E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.679 | TFLOPs: 19.52 | 31: iteration 74440/ 173500 | consumed samples: 19056640 | consumed tokens: 39027998720 | elapsed time per iteration (s): 0.83 | learning rate: 1.315E-04 | global batch size: 256 | lm loss: 2.017835E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.356 | TFLOPs: 18.72 | 31: iteration 74450/ 173500 | consumed samples: 19059200 | consumed tokens: 39033241600 | elapsed time per iteration (s): 0.74 | learning rate: 1.315E-04 | global batch size: 256 | lm loss: 2.019486E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.091 | TFLOPs: 20.88 | 31: iteration 74460/ 173500 | consumed samples: 19061760 | consumed tokens: 39038484480 | elapsed time per iteration (s): 0.79 | learning rate: 1.315E-04 | global batch size: 256 | lm loss: 1.994531E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.541 | TFLOPs: 19.63 | 31: iteration 74470/ 173500 | consumed samples: 19064320 | consumed tokens: 39043727360 | elapsed time per iteration (s): 0.81 | learning rate: 1.314E-04 | global batch size: 256 | lm loss: 2.030858E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.611 | TFLOPs: 19.21 | 31: iteration 74480/ 173500 | consumed samples: 19066880 | consumed tokens: 39048970240 | elapsed time per iteration (s): 0.87 | learning rate: 1.314E-04 | global batch size: 256 | lm loss: 2.021850E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.647 | TFLOPs: 17.70 | 31: iteration 74490/ 173500 | consumed samples: 19069440 | consumed tokens: 39054213120 | elapsed time per iteration (s): 0.79 | learning rate: 1.314E-04 | global batch size: 256 | lm loss: 2.015445E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.623 | TFLOPs: 19.58 | 31: iteration 74500/ 173500 | consumed samples: 19072000 | consumed tokens: 39059456000 | elapsed time per iteration (s): 0.81 | learning rate: 1.314E-04 | global batch size: 256 | lm loss: 2.033329E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.512 | TFLOPs: 19.03 | 31: iteration 74510/ 173500 | consumed samples: 19074560 | consumed tokens: 39064698880 | elapsed time per iteration (s): 0.84 | learning rate: 1.314E-04 | global batch size: 256 | lm loss: 2.019502E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.675 | TFLOPs: 18.43 | 31: iteration 74520/ 173500 | consumed samples: 19077120 | consumed tokens: 39069941760 | elapsed time per iteration (s): 0.80 | learning rate: 1.314E-04 | global batch size: 256 | lm loss: 2.042727E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.051 | TFLOPs: 19.42 | 31: iteration 74530/ 173500 | consumed samples: 19079680 | consumed tokens: 39075184640 | elapsed time per iteration (s): 0.85 | learning rate: 1.313E-04 | global batch size: 256 | lm loss: 2.019030E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.319 | TFLOPs: 18.23 | 31: iteration 74540/ 173500 | consumed samples: 19082240 | consumed tokens: 39080427520 | elapsed time per iteration (s): 0.80 | learning rate: 1.313E-04 | global batch size: 256 | lm loss: 2.007598E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.061 | TFLOPs: 19.30 | 31: iteration 74550/ 173500 | consumed samples: 19084800 | consumed tokens: 39085670400 | elapsed time per iteration (s): 0.76 | learning rate: 1.313E-04 | global batch size: 256 | lm loss: 2.007715E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.752 | TFLOPs: 20.25 | 31: iteration 74560/ 173500 | consumed samples: 19087360 | consumed tokens: 39090913280 | elapsed time per iteration (s): 0.76 | learning rate: 1.313E-04 | global batch size: 256 | lm loss: 2.020309E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.198 | TFLOPs: 20.46 | 31: iteration 74570/ 173500 | consumed samples: 19089920 | consumed tokens: 39096156160 | elapsed time per iteration (s): 0.84 | learning rate: 1.313E-04 | global batch size: 256 | lm loss: 2.004063E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.708 | TFLOPs: 18.37 | 31: iteration 74580/ 173500 | consumed samples: 19092480 | consumed tokens: 39101399040 | elapsed time per iteration (s): 0.76 | learning rate: 1.313E-04 | global batch size: 256 | lm loss: 2.011840E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.597 | TFLOPs: 20.42 | 31: iteration 74590/ 173500 | consumed samples: 19095040 | consumed tokens: 39106641920 | elapsed time per iteration (s): 0.73 | learning rate: 1.312E-04 | global batch size: 256 | lm loss: 2.012290E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.603 | TFLOPs: 21.27 | 31: iteration 74600/ 173500 | consumed samples: 19097600 | consumed tokens: 39111884800 | elapsed time per iteration (s): 0.80 | learning rate: 1.312E-04 | global batch size: 256 | lm loss: 2.014327E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.367 | TFLOPs: 19.32 | 31: iteration 74610/ 173500 | consumed samples: 19100160 | consumed tokens: 39117127680 | elapsed time per iteration (s): 0.75 | learning rate: 1.312E-04 | global batch size: 256 | lm loss: 2.024253E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.049 | TFLOPs: 20.75 | 31: iteration 74620/ 173500 | consumed samples: 19102720 | consumed tokens: 39122370560 | elapsed time per iteration (s): 0.82 | learning rate: 1.312E-04 | global batch size: 256 | lm loss: 2.008989E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.956 | TFLOPs: 18.93 | 31: iteration 74630/ 173500 | consumed samples: 19105280 | consumed tokens: 39127613440 | elapsed time per iteration (s): 0.73 | learning rate: 1.312E-04 | global batch size: 256 | lm loss: 1.994668E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.664 | TFLOPs: 21.15 | 31: iteration 74640/ 173500 | consumed samples: 19107840 | consumed tokens: 39132856320 | elapsed time per iteration (s): 0.81 | learning rate: 1.312E-04 | global batch size: 256 | lm loss: 1.992311E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.565 | TFLOPs: 19.15 | 31: iteration 74650/ 173500 | consumed samples: 19110400 | consumed tokens: 39138099200 | elapsed time per iteration (s): 0.74 | learning rate: 1.311E-04 | global batch size: 256 | lm loss: 2.029410E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.351 | TFLOPs: 20.89 | 31: iteration 74660/ 173500 | consumed samples: 19112960 | consumed tokens: 39143342080 | elapsed time per iteration (s): 0.74 | learning rate: 1.311E-04 | global batch size: 256 | lm loss: 1.989726E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.758 | TFLOPs: 20.80 | 31: iteration 74670/ 173500 | consumed samples: 19115520 | consumed tokens: 39148584960 | elapsed time per iteration (s): 0.86 | learning rate: 1.311E-04 | global batch size: 256 | lm loss: 2.044229E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.129 | TFLOPs: 17.92 | 31: iteration 74680/ 173500 | consumed samples: 19118080 | consumed tokens: 39153827840 | elapsed time per iteration (s): 1.31 | learning rate: 1.311E-04 | global batch size: 256 | lm loss: 2.002904E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 195.122 | TFLOPs: 11.80 | 31: iteration 74690/ 173500 | consumed samples: 19120640 | consumed tokens: 39159070720 | elapsed time per iteration (s): 0.89 | learning rate: 1.311E-04 | global batch size: 256 | lm loss: 2.004799E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.164 | TFLOPs: 17.43 | 31: iteration 74700/ 173500 | consumed samples: 19123200 | consumed tokens: 39164313600 | elapsed time per iteration (s): 0.83 | learning rate: 1.311E-04 | global batch size: 256 | lm loss: 2.016344E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.443 | TFLOPs: 18.60 | 31: iteration 74710/ 173500 | consumed samples: 19125760 | consumed tokens: 39169556480 | elapsed time per iteration (s): 0.87 | learning rate: 1.311E-04 | global batch size: 256 | lm loss: 2.026566E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.530 | TFLOPs: 17.76 | 31: iteration 74720/ 173500 | consumed samples: 19128320 | consumed tokens: 39174799360 | elapsed time per iteration (s): 0.80 | learning rate: 1.310E-04 | global batch size: 256 | lm loss: 2.040186E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.931 | TFLOPs: 19.48 | 31: iteration 74730/ 173500 | consumed samples: 19130880 | consumed tokens: 39180042240 | elapsed time per iteration (s): 0.79 | learning rate: 1.310E-04 | global batch size: 256 | lm loss: 1.994985E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.110 | TFLOPs: 19.49 | 31: iteration 74740/ 173500 | consumed samples: 19133440 | consumed tokens: 39185285120 | elapsed time per iteration (s): 1.73 | learning rate: 1.310E-04 | global batch size: 256 | lm loss: 1.998016E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 147.758 | TFLOPs: 8.94 | 31: iteration 74750/ 173500 | consumed samples: 19136000 | consumed tokens: 39190528000 | elapsed time per iteration (s): 0.85 | learning rate: 1.310E-04 | global batch size: 256 | lm loss: 2.001072E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.169 | TFLOPs: 18.28 | 31: iteration 74760/ 173500 | consumed samples: 19138560 | consumed tokens: 39195770880 | elapsed time per iteration (s): 0.82 | learning rate: 1.310E-04 | global batch size: 256 | lm loss: 2.020580E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.071 | TFLOPs: 18.88 | 31: iteration 74770/ 173500 | consumed samples: 19141120 | consumed tokens: 39201013760 | elapsed time per iteration (s): 0.81 | learning rate: 1.310E-04 | global batch size: 256 | lm loss: 2.012438E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.604 | TFLOPs: 19.15 | 31: iteration 74780/ 173500 | consumed samples: 19143680 | consumed tokens: 39206256640 | elapsed time per iteration (s): 0.81 | learning rate: 1.309E-04 | global batch size: 256 | lm loss: 2.006344E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.800 | TFLOPs: 19.11 | 31: iteration 74790/ 173500 | consumed samples: 19146240 | consumed tokens: 39211499520 | elapsed time per iteration (s): 0.81 | learning rate: 1.309E-04 | global batch size: 256 | lm loss: 2.017386E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.913 | TFLOPs: 19.05 | 31: iteration 74800/ 173500 | consumed samples: 19148800 | consumed tokens: 39216742400 | elapsed time per iteration (s): 0.83 | learning rate: 1.309E-04 | global batch size: 256 | lm loss: 1.999203E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.789 | TFLOPs: 18.68 | 31: iteration 74810/ 173500 | consumed samples: 19151360 | consumed tokens: 39221985280 | elapsed time per iteration (s): 0.80 | learning rate: 1.309E-04 | global batch size: 256 | lm loss: 2.014804E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.218 | TFLOPs: 19.25 | 31: iteration 74820/ 173500 | consumed samples: 19153920 | consumed tokens: 39227228160 | elapsed time per iteration (s): 0.81 | learning rate: 1.309E-04 | global batch size: 256 | lm loss: 2.011395E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.591 | TFLOPs: 19.21 | 31: iteration 74830/ 173500 | consumed samples: 19156480 | consumed tokens: 39232471040 | elapsed time per iteration (s): 0.83 | learning rate: 1.309E-04 | global batch size: 256 | lm loss: 2.027200E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.641 | TFLOPs: 18.61 | 31: iteration 74840/ 173500 | consumed samples: 19159040 | consumed tokens: 39237713920 | elapsed time per iteration (s): 0.82 | learning rate: 1.308E-04 | global batch size: 256 | lm loss: 2.029839E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.163 | TFLOPs: 18.82 | 31: iteration 74850/ 173500 | consumed samples: 19161600 | consumed tokens: 39242956800 | elapsed time per iteration (s): 0.84 | learning rate: 1.308E-04 | global batch size: 256 | lm loss: 2.004717E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.632 | TFLOPs: 18.37 | 31: iteration 74860/ 173500 | consumed samples: 19164160 | consumed tokens: 39248199680 | elapsed time per iteration (s): 0.78 | learning rate: 1.308E-04 | global batch size: 256 | lm loss: 1.993892E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.180 | TFLOPs: 19.79 | 31: iteration 74870/ 173500 | consumed samples: 19166720 | consumed tokens: 39253442560 | elapsed time per iteration (s): 0.81 | learning rate: 1.308E-04 | global batch size: 256 | lm loss: 1.998267E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.198 | TFLOPs: 19.13 | 31: iteration 74880/ 173500 | consumed samples: 19169280 | consumed tokens: 39258685440 | elapsed time per iteration (s): 0.80 | learning rate: 1.308E-04 | global batch size: 256 | lm loss: 2.016199E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.895 | TFLOPs: 19.47 | 31: iteration 74890/ 173500 | consumed samples: 19171840 | consumed tokens: 39263928320 | elapsed time per iteration (s): 0.83 | learning rate: 1.308E-04 | global batch size: 256 | lm loss: 2.018256E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.316 | TFLOPs: 18.71 | 31: iteration 74900/ 173500 | consumed samples: 19174400 | consumed tokens: 39269171200 | elapsed time per iteration (s): 0.78 | learning rate: 1.307E-04 | global batch size: 256 | lm loss: 2.026436E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.274 | TFLOPs: 19.74 | 31: iteration 74910/ 173500 | consumed samples: 19176960 | consumed tokens: 39274414080 | elapsed time per iteration (s): 0.81 | learning rate: 1.307E-04 | global batch size: 256 | lm loss: 2.042435E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.919 | TFLOPs: 19.11 | 31: iteration 74920/ 173500 | consumed samples: 19179520 | consumed tokens: 39279656960 | elapsed time per iteration (s): 0.81 | learning rate: 1.307E-04 | global batch size: 256 | lm loss: 2.065528E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.469 | TFLOPs: 19.02 | 31: iteration 74930/ 173500 | consumed samples: 19182080 | consumed tokens: 39284899840 | elapsed time per iteration (s): 0.81 | learning rate: 1.307E-04 | global batch size: 256 | lm loss: 2.012115E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.564 | TFLOPs: 19.15 | 31: iteration 74940/ 173500 | consumed samples: 19184640 | consumed tokens: 39290142720 | elapsed time per iteration (s): 0.80 | learning rate: 1.307E-04 | global batch size: 256 | lm loss: 2.042043E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.560 | TFLOPs: 19.27 | 31: iteration 74950/ 173500 | consumed samples: 19187200 | consumed tokens: 39295385600 | elapsed time per iteration (s): 0.81 | learning rate: 1.307E-04 | global batch size: 256 | lm loss: 1.987837E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.093 | TFLOPs: 19.18 | 31: iteration 74960/ 173500 | consumed samples: 19189760 | consumed tokens: 39300628480 | elapsed time per iteration (s): 0.75 | learning rate: 1.307E-04 | global batch size: 256 | lm loss: 2.020997E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.039 | TFLOPs: 20.75 | 31: iteration 74970/ 173500 | consumed samples: 19192320 | consumed tokens: 39305871360 | elapsed time per iteration (s): 0.81 | learning rate: 1.306E-04 | global batch size: 256 | lm loss: 2.051252E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.766 | TFLOPs: 19.04 | 31: iteration 74980/ 173500 | consumed samples: 19194880 | consumed tokens: 39311114240 | elapsed time per iteration (s): 0.74 | learning rate: 1.306E-04 | global batch size: 256 | lm loss: 2.054168E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.984 | TFLOPs: 20.93 | 31: iteration 74990/ 173500 | consumed samples: 19197440 | consumed tokens: 39316357120 | elapsed time per iteration (s): 0.78 | learning rate: 1.306E-04 | global batch size: 256 | lm loss: 2.041047E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.318 | TFLOPs: 19.86 | 31: iteration 75000/ 173500 | consumed samples: 19200000 | consumed tokens: 39321600000 | elapsed time per iteration (s): 0.75 | learning rate: 1.306E-04 | global batch size: 256 | lm loss: 2.035188E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.251 | TFLOPs: 20.58 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 75000 | lm loss value: 2.100717E+00 | lm loss PPL: 8.172028E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 75000 to checkpoints_1b1long 0: [2022-11-26 10:55:59,498] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step75000 is begin to save! 0: [2022-11-26 10:55:59,512] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_01-model_00-model_states.pt... 0: [2022-11-26 10:55:59,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_01-model_00-model_states.pt. 0: [2022-11-26 10:55:59,762] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_03-model_00-model_states.pt... 0: [2022-11-26 10:55:59,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_03-model_00-model_states.pt. 0: [2022-11-26 10:55:59,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_04-model_00-model_states.pt... 0: [2022-11-26 10:55:59,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_04-model_00-model_states.pt. 0: [2022-11-26 10:55:59,922] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_05-model_00-model_states.pt... 0: [2022-11-26 10:56:00,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_05-model_00-model_states.pt. 0: [2022-11-26 10:56:00,005] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_06-model_00-model_states.pt... 0: [2022-11-26 10:56:00,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_06-model_00-model_states.pt. 0: [2022-11-26 10:56:00,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_07-model_00-model_states.pt... 0: [2022-11-26 10:56:00,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_07-model_00-model_states.pt. 0: [2022-11-26 10:56:00,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_08-model_00-model_states.pt... 0: [2022-11-26 10:56:00,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_08-model_00-model_states.pt. 0: [2022-11-26 10:56:00,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_09-model_00-model_states.pt... 0: [2022-11-26 10:56:00,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_09-model_00-model_states.pt. 0: [2022-11-26 10:56:00,330] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_10-model_00-model_states.pt... 0: [2022-11-26 10:56:00,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_10-model_00-model_states.pt. 0: [2022-11-26 10:56:00,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_11-model_00-model_states.pt... 0: [2022-11-26 10:56:00,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_11-model_00-model_states.pt. 0: [2022-11-26 10:56:00,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_12-model_00-model_states.pt... 0: [2022-11-26 10:56:00,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_12-model_00-model_states.pt. 0: [2022-11-26 10:56:00,653] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_13-model_00-model_states.pt... 0: [2022-11-26 10:56:00,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_13-model_00-model_states.pt. 0: [2022-11-26 10:56:00,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_14-model_00-model_states.pt... 0: [2022-11-26 10:56:00,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_14-model_00-model_states.pt. 0: [2022-11-26 10:56:00,872] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_15-model_00-model_states.pt... 0: [2022-11-26 10:56:00,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_15-model_00-model_states.pt. 0: [2022-11-26 10:56:00,945] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_16-model_00-model_states.pt... 0: [2022-11-26 10:56:01,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_16-model_00-model_states.pt. 0: [2022-11-26 10:56:01,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_17-model_00-model_states.pt... 0: [2022-11-26 10:56:01,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_17-model_00-model_states.pt. 0: [2022-11-26 10:56:01,096] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_18-model_00-model_states.pt... 0: [2022-11-26 10:56:01,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_18-model_00-model_states.pt. 0: [2022-11-26 10:56:01,174] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_19-model_00-model_states.pt... 0: [2022-11-26 10:56:01,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_19-model_00-model_states.pt. 0: [2022-11-26 10:56:01,250] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_20-model_00-model_states.pt... 0: [2022-11-26 10:56:01,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_20-model_00-model_states.pt. 0: [2022-11-26 10:56:01,328] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_21-model_00-model_states.pt... 0: [2022-11-26 10:56:01,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_21-model_00-model_states.pt. 0: [2022-11-26 10:56:01,407] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_22-model_00-model_states.pt... 0: [2022-11-26 10:56:01,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_22-model_00-model_states.pt. 0: [2022-11-26 10:56:01,479] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_23-model_00-model_states.pt... 0: [2022-11-26 10:56:01,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_23-model_00-model_states.pt. 0: [2022-11-26 10:56:01,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_24-model_00-model_states.pt... 0: [2022-11-26 10:56:01,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_24-model_00-model_states.pt. 0: [2022-11-26 10:56:01,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_25-model_00-model_states.pt... 0: [2022-11-26 10:56:01,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_25-model_00-model_states.pt. 0: [2022-11-26 10:56:01,710] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_26-model_00-model_states.pt... 0: [2022-11-26 10:56:01,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_26-model_00-model_states.pt. 0: [2022-11-26 10:56:01,787] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_27-model_00-model_states.pt... 0: [2022-11-26 10:56:01,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_27-model_00-model_states.pt. 0: [2022-11-26 10:56:01,863] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_28-model_00-model_states.pt... 0: [2022-11-26 10:56:01,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_28-model_00-model_states.pt. 0: [2022-11-26 10:56:01,935] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/layer_30-model_00-model_states.pt... 0: [2022-11-26 10:56:01,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/layer_30-model_00-model_states.pt. 0: [2022-11-26 10:56:01,939] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step75000/mp_rank_00_model_states.pt 0: [2022-11-26 10:56:01,939] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/mp_rank_00_model_states.pt... 0: [2022-11-26 10:56:01,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/mp_rank_00_model_states.pt. 0: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 25: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 10:56:02,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 31: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 26: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:56:02,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:56:02,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:56:02,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 10:56:02,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 18: [2022-11-26 10:56:02,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:56:02,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:56:02,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 18: [2022-11-26 10:56:02,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 9: [2022-11-26 10:56:02,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 18: [2022-11-26 10:56:02,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 28: [2022-11-26 10:56:02,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:56:02,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:56:02,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 10:56:02,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 24: [2022-11-26 10:56:02,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:56:02,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 10:56:02,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 0: [2022-11-26 10:56:02,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:56:02,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:56:02,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 10:56:02,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 5: [2022-11-26 10:56:02,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:56:02,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:56:02,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 10:56:02,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 6: [2022-11-26 10:56:02,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:56:02,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 2: [2022-11-26 10:56:02,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:56:02,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 27: [2022-11-26 10:56:02,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:56:02,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 1: [2022-11-26 10:56:02,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 2: [2022-11-26 10:56:02,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 28: [2022-11-26 10:56:02,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 6: [2022-11-26 10:56:02,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 2: [2022-11-26 10:56:02,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 27: [2022-11-26 10:56:02,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:56:02,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 10:56:02,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 27: [2022-11-26 10:56:02,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 1: [2022-11-26 10:56:02,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:56:02,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:56:02,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 1: [2022-11-26 10:56:02,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 17: [2022-11-26 10:56:02,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 1: [2022-11-26 10:56:02,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 18: [2022-11-26 10:56:02,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:56:02,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:56:02,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 18: [2022-11-26 10:56:02,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 5: [2022-11-26 10:56:02,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 24: [2022-11-26 10:56:02,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:56:02,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 5: [2022-11-26 10:56:02,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 24: [2022-11-26 10:56:02,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 10:56:02,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 7: [2022-11-26 10:56:02,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:56:02,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 10:56:02,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 26: [2022-11-26 10:56:02,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:56:02,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 10:56:02,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 3: [2022-11-26 10:56:02,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:56:02,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 10:56:02,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 15: [2022-11-26 10:56:02,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:56:02,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:56:02,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 10:56:02,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 19: [2022-11-26 10:56:02,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 10:56:02,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 22: [2022-11-26 10:56:02,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:56:02,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 10:56:02,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 22: [2022-11-26 10:56:02,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:56:02,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 10:56:02,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 8: [2022-11-26 10:56:02,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:56:02,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 17: [2022-11-26 10:56:02,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:56:02,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 30: [2022-11-26 10:56:02,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:56:02,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 30: [2022-11-26 10:56:02,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 10:56:02,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 17: [2022-11-26 10:56:02,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 6: [2022-11-26 10:56:02,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:56:02,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 10:56:02,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 30: [2022-11-26 10:56:02,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:56:02,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 10:56:02,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 9: [2022-11-26 10:56:02,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:56:02,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 10:56:02,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 2: [2022-11-26 10:56:02,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:56:02,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 28: [2022-11-26 10:56:02,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:56:02,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 15: [2022-11-26 10:56:02,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:56:02,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 10:56:02,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 25: [2022-11-26 10:56:02,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:56:02,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:56:02,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 10:56:02,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 25: [2022-11-26 10:56:02,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 10:56:02,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 29: [2022-11-26 10:56:02,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:56:02,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 25: [2022-11-26 10:56:02,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:56:02,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 13: [2022-11-26 10:56:02,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:56:02,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 10:56:02,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 25: [2022-11-26 10:56:02,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 28: [2022-11-26 10:56:02,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 10: [2022-11-26 10:56:02,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:56:02,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 28: [2022-11-26 10:56:02,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 24: [2022-11-26 10:56:02,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:56:02,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:56:02,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 5: [2022-11-26 10:56:02,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 10: [2022-11-26 10:56:02,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 24: [2022-11-26 10:56:02,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 5: [2022-11-26 10:56:02,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 10: [2022-11-26 10:56:02,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 29: [2022-11-26 10:56:02,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:56:02,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 10:56:02,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 10: [2022-11-26 10:56:02,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:56:02,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 10:56:02,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 14: [2022-11-26 10:56:02,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:56:02,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 10:56:02,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 19: [2022-11-26 10:56:02,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:56:02,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 10:56:02,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 16: [2022-11-26 10:56:02,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:56:02,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 10:56:02,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 0: [2022-11-26 10:56:02,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:56:02,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:56:02,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:56:02,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 3: [2022-11-26 10:56:02,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:56:02,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 21: [2022-11-26 10:56:02,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:56:02,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 7: [2022-11-26 10:56:02,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 3: [2022-11-26 10:56:02,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 23: [2022-11-26 10:56:02,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 21: [2022-11-26 10:56:02,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 0: [2022-11-26 10:56:02,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 3: [2022-11-26 10:56:02,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 21: [2022-11-26 10:56:02,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 11: [2022-11-26 10:56:02,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:56:02,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:56:02,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 11: [2022-11-26 10:56:02,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 30: [2022-11-26 10:56:02,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 11: [2022-11-26 10:56:02,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 11: [2022-11-26 10:56:02,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:56:02,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 10:56:02,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 3: [2022-11-26 10:56:02,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:56:02,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 10:56:02,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 15: [2022-11-26 10:56:02,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:56:02,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:56:02,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 10:56:02,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 19: [2022-11-26 10:56:02,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 10:56:02,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 11: [2022-11-26 10:56:02,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:56:02,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 10:56:02,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 26: [2022-11-26 10:56:02,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:56:02,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 10:56:02,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 8: [2022-11-26 10:56:02,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:56:02,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 7: [2022-11-26 10:56:02,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:56:02,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 7: [2022-11-26 10:56:02,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 10:56:02,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 1: [2022-11-26 10:56:02,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:56:02,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 10:56:02,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 1: [2022-11-26 10:56:02,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:56:02,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:56:02,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:56:02,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:56:02,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 14: [2022-11-26 10:56:02,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 20: [2022-11-26 10:56:02,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 14: [2022-11-26 10:56:02,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 2: [2022-11-26 10:56:02,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:56:02,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 2: [2022-11-26 10:56:02,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 10:56:02,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 16: [2022-11-26 10:56:02,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:56:02,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 10:56:02,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 20: [2022-11-26 10:56:02,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 20: [2022-11-26 10:56:02,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:56:02,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 10:56:02,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 8: [2022-11-26 10:56:02,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:56:02,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 10:56:02,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 28: [2022-11-26 10:56:02,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 10:56:02,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 21: [2022-11-26 10:56:02,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:56:02,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 10:56:02,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 14: [2022-11-26 10:56:02,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:56:02,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:56:02,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 14: [2022-11-26 10:56:02,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 25: [2022-11-26 10:56:02,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:56:02,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 18: [2022-11-26 10:56:02,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 25: [2022-11-26 10:56:02,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 10:56:02,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:56:02,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 25: [2022-11-26 10:56:02,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 10:56:02,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 17: [2022-11-26 10:56:02,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:56:02,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 10:56:02,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 22: [2022-11-26 10:56:02,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:56:02,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 10:56:02,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 30: [2022-11-26 10:56:02,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:56:02,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 10:56:02,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 3: [2022-11-26 10:56:02,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:56:02,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 10:56:02,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 29: [2022-11-26 10:56:02,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:56:02,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 10:56:02,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 27: [2022-11-26 10:56:02,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:56:02,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:56:02,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 11: [2022-11-26 10:56:02,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 10:56:02,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 15: [2022-11-26 10:56:02,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:56:02,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 15: [2022-11-26 10:56:02,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 10:56:02,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 23: [2022-11-26 10:56:02,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:56:02,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 10:56:02,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 27: [2022-11-26 10:56:02,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:56:02,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 10:56:02,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 24: [2022-11-26 10:56:02,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:56:02,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:56:02,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 24: [2022-11-26 10:56:02,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 10:56:02,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 23: [2022-11-26 10:56:02,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 23: [2022-11-26 10:56:02,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:56:02,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 10:56:02,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 18: [2022-11-26 10:56:02,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:56:02,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 10:56:02,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 13: [2022-11-26 10:56:02,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:56:02,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 10:56:02,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 6: [2022-11-26 10:56:02,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:56:02,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 10:56:02,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 19: [2022-11-26 10:56:02,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:56:02,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 10:56:02,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 0: [2022-11-26 10:56:02,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:56:02,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:56:02,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:56:02,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 26: [2022-11-26 10:56:02,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 0: [2022-11-26 10:56:02,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 14: [2022-11-26 10:56:02,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 26: [2022-11-26 10:56:02,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 0: [2022-11-26 10:56:02,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 9: [2022-11-26 10:56:02,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:56:02,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 10:56:02,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 9: [2022-11-26 10:56:02,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:56:02,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 10:56:02,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 12: [2022-11-26 10:56:02,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:56:02,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:56:02,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:56:02,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 10:56:02,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 10:56:02,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 10:56:02,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 12: [2022-11-26 10:56:02,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 12: [2022-11-26 10:56:02,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 5: [2022-11-26 10:56:02,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:56:02,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 10:56:02,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 4: [2022-11-26 10:56:02,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:56:02,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 10:56:02,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 0: [2022-11-26 10:56:02,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:56:02,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:56:02,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 10:56:02,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 0: [2022-11-26 10:56:02,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 10:56:02,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 0: [2022-11-26 10:56:02,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 10:56:02,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 20: [2022-11-26 10:56:02,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:56:02,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 10:56:02,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 16: [2022-11-26 10:56:02,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:56:02,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 10:56:02,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 13: [2022-11-26 10:56:02,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:56:02,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:56:02,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 6: [2022-11-26 10:56:02,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 13: [2022-11-26 10:56:02,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 6: [2022-11-26 10:56:02,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 22: [2022-11-26 10:56:02,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:56:02,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:56:02,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 10:56:02,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 10: [2022-11-26 10:56:02,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 10:56:02,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 21: [2022-11-26 10:56:02,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:56:02,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 10:56:02,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 2: [2022-11-26 10:56:02,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:56:02,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 10:56:02,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 7: [2022-11-26 10:56:02,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:56:02,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 29: [2022-11-26 10:56:02,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:56:02,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 29: [2022-11-26 10:56:02,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 10:56:02,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 28: [2022-11-26 10:56:02,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:56:02,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:56:02,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 10:56:02,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 28: [2022-11-26 10:56:02,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 10:56:02,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 12: [2022-11-26 10:56:02,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:56:02,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 10:56:02,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 4: [2022-11-26 10:56:02,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:56:02,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 10:56:02,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 10: [2022-11-26 10:56:02,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:56:02,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 10:56:02,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 26: [2022-11-26 10:56:02,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:56:02,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 10:56:02,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 31: [2022-11-26 10:56:02,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:56:02,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:56:02,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:56:02,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 10:56:02,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 10:56:02,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 10:56:02,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 31: [2022-11-26 10:56:02,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 31: [2022-11-26 10:56:02,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 16: [2022-11-26 10:56:02,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:56:02,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 10:56:02,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 17: [2022-11-26 10:56:02,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:56:02,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 10:56:02,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 31: [2022-11-26 10:56:02,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:56:02,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 10:56:02,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 1: [2022-11-26 10:56:02,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:56:02,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 10:56:02,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 7: [2022-11-26 10:56:02,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:56:02,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 10:56:02,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 6: [2022-11-26 10:56:02,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:56:02,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 10:56:02,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 3: [2022-11-26 10:56:02,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:56:02,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 10:56:02,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 18: [2022-11-26 10:56:02,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:56:02,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 9: [2022-11-26 10:56:02,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:56:02,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 9: [2022-11-26 10:56:02,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 10:56:02,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 0: [2022-11-26 10:56:02,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:56:02,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 10:56:02,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 28: [2022-11-26 10:56:02,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:56:02,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 10:56:02,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 26: [2022-11-26 10:56:02,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:56:02,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 10:56:02,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 22: [2022-11-26 10:56:02,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:56:02,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 10:56:02,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 5: [2022-11-26 10:56:02,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:56:02,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 10:56:02,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 16: [2022-11-26 10:56:02,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:56:02,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 10:56:02,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 8: [2022-11-26 10:56:02,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:56:02,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 10:56:02,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 25: [2022-11-26 10:56:02,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:56:02,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 10:56:02,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 3: [2022-11-26 10:56:02,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:56:02,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 10:56:02,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 30: [2022-11-26 10:56:02,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:56:02,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:56:02,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:56:02,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 10:56:02,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 19: [2022-11-26 10:56:02,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 30: [2022-11-26 10:56:02,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 10:56:02,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 19: [2022-11-26 10:56:02,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 23: [2022-11-26 10:56:02,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:56:02,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 10:56:02,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 30: [2022-11-26 10:56:02,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:56:02,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:56:02,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:56:02,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 10:56:02,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 30: [2022-11-26 10:56:02,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 10:56:02,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 1: [2022-11-26 10:56:02,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 20: [2022-11-26 10:56:02,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:56:02,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 7: [2022-11-26 10:56:02,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:56:02,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 7: [2022-11-26 10:56:02,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 20: [2022-11-26 10:56:02,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 7: [2022-11-26 10:56:02,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 31: [2022-11-26 10:56:02,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:56:02,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 10:56:02,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 21: [2022-11-26 10:56:02,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:56:02,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 10:56:02,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 9: [2022-11-26 10:56:02,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:56:02,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 10:56:02,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 2: [2022-11-26 10:56:02,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:56:02,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 10:56:02,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 4: [2022-11-26 10:56:02,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:56:02,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 10:56:02,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 24: [2022-11-26 10:56:02,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:56:02,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 10:56:02,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 12: [2022-11-26 10:56:02,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:56:02,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 10:56:02,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 10: [2022-11-26 10:56:02,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:56:02,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 10:56:02,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 18: [2022-11-26 10:56:02,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:56:02,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 2: [2022-11-26 10:56:02,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:56:02,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 2: [2022-11-26 10:56:02,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 10:56:02,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 14: [2022-11-26 10:56:02,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:56:02,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 10:56:02,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 8: [2022-11-26 10:56:02,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:56:02,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:56:02,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 15: [2022-11-26 10:56:02,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 8: [2022-11-26 10:56:02,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 15: [2022-11-26 10:56:02,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 19: [2022-11-26 10:56:02,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:56:02,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 10:56:02,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 13: [2022-11-26 10:56:02,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:56:02,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:56:02,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 10:56:02,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 10:56:02,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 13: [2022-11-26 10:56:02,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 23: [2022-11-26 10:56:02,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:56:02,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 10:56:02,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 31: [2022-11-26 10:56:02,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:56:02,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 10:56:02,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 20: [2022-11-26 10:56:02,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:56:02,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 10:56:02,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 5: [2022-11-26 10:56:02,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:56:02,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 10:56:02,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 11: [2022-11-26 10:56:02,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:56:02,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 10:56:02,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 24: [2022-11-26 10:56:02,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:56:02,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 10:56:02,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 5: [2022-11-26 10:56:02,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:56:02,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:56:02,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 10:56:02,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 9: [2022-11-26 10:56:02,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 10:56:02,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 0: [2022-11-26 10:56:02,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:56:02,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 10:56:02,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 10: [2022-11-26 10:56:02,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:56:02,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:56:02,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:56:02,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 2: [2022-11-26 10:56:02,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 15: [2022-11-26 10:56:02,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:56:02,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 2: [2022-11-26 10:56:02,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 25: [2022-11-26 10:56:02,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 28: [2022-11-26 10:56:02,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:56:02,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:56:02,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 10:56:02,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 25: [2022-11-26 10:56:02,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 17: [2022-11-26 10:56:02,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 10:56:02,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 7: [2022-11-26 10:56:02,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:56:02,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 4: [2022-11-26 10:56:02,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:56:02,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 4: [2022-11-26 10:56:02,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 10:56:02,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 8: [2022-11-26 10:56:02,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:56:02,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 10:56:02,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 22: [2022-11-26 10:56:02,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:56:02,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 10:56:02,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 26: [2022-11-26 10:56:02,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:56:02,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:56:02,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 27: [2022-11-26 10:56:02,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 26: [2022-11-26 10:56:02,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 3: [2022-11-26 10:56:02,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:56:02,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:56:02,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:56:02,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 12: [2022-11-26 10:56:02,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 3: [2022-11-26 10:56:02,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 12: [2022-11-26 10:56:02,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 3: [2022-11-26 10:56:02,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 21: [2022-11-26 10:56:02,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 10:56:02,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 30: [2022-11-26 10:56:02,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:56:02,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 10:56:02,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 6: [2022-11-26 10:56:02,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:56:02,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 16: [2022-11-26 10:56:02,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:56:02,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:56:02,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 16: [2022-11-26 10:56:02,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 17: [2022-11-26 10:56:02,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:56:02,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 16: [2022-11-26 10:56:02,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 18: [2022-11-26 10:56:02,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 17: [2022-11-26 10:56:02,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 10:56:02,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 20: [2022-11-26 10:56:02,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:56:02,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:56:02,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 25: [2022-11-26 10:56:02,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:56:02,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 25: [2022-11-26 10:56:02,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 10:56:02,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 27: [2022-11-26 10:56:02,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:56:02,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 20: [2022-11-26 10:56:02,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 27: [2022-11-26 10:56:02,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 20: [2022-11-26 10:56:02,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 23: [2022-11-26 10:56:02,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:56:02,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 11: [2022-11-26 10:56:02,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:56:02,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 11: [2022-11-26 10:56:02,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 10:56:02,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 24: [2022-11-26 10:56:02,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 10:56:02,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 10:56:02,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 14: [2022-11-26 10:56:02,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:56:02,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 10:56:02,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 29: [2022-11-26 10:56:02,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:56:02,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 10:56:02,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 0: [2022-11-26 10:56:02,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:56:02,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 10:56:02,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 0: [2022-11-26 10:56:02,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 28: [2022-11-26 10:56:02,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:56:02,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 0: [2022-11-26 10:56:02,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 28: [2022-11-26 10:56:02,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 27: [2022-11-26 10:56:02,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 10:56:02,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 10:56:02,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 6: [2022-11-26 10:56:02,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:56:02,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 10:56:02,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 21: [2022-11-26 10:56:02,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:56:02,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 10:56:02,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 22: [2022-11-26 10:56:02,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 19: [2022-11-26 10:56:02,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:56:02,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 10:56:02,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 19: [2022-11-26 10:56:02,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 10:56:02,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 29: [2022-11-26 10:56:02,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:56:02,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 13: [2022-11-26 10:56:02,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:56:02,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 13: [2022-11-26 10:56:02,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 10:56:02,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 0: [2022-11-26 10:56:02,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:56:02,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 10:56:02,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 9: [2022-11-26 10:56:02,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:56:02,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:56:02,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 1: [2022-11-26 10:56:02,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 10:56:02,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 9: [2022-11-26 10:56:02,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 7: [2022-11-26 10:56:02,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:56:02,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 10:56:02,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 21: [2022-11-26 10:56:02,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 10:56:02,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 10:56:02,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 6: [2022-11-26 10:56:02,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:56:02,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 10:56:02,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 18: [2022-11-26 10:56:02,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:56:02,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:56:02,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 11: [2022-11-26 10:56:02,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:56:02,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:56:02,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 15: [2022-11-26 10:56:02,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 29: [2022-11-26 10:56:02,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:56:02,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 18: [2022-11-26 10:56:02,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 12: [2022-11-26 10:56:02,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 11: [2022-11-26 10:56:02,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 10: [2022-11-26 10:56:02,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 15: [2022-11-26 10:56:02,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 29: [2022-11-26 10:56:02,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 10: [2022-11-26 10:56:02,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 15: [2022-11-26 10:56:02,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 29: [2022-11-26 10:56:02,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 17: [2022-11-26 10:56:02,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 10:56:02,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 2: [2022-11-26 10:56:02,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:56:02,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:56:02,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 8: [2022-11-26 10:56:02,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 2: [2022-11-26 10:56:02,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 8: [2022-11-26 10:56:02,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 5: [2022-11-26 10:56:02,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:56:02,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 10:56:02,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 3: [2022-11-26 10:56:02,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:56:02,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 12: [2022-11-26 10:56:02,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:56:02,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 12: [2022-11-26 10:56:02,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 10:56:02,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 11: [2022-11-26 10:56:02,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 28: [2022-11-26 10:56:02,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 25: [2022-11-26 10:56:02,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 28: [2022-11-26 10:56:02,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 25: [2022-11-26 10:56:02,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 28: [2022-11-26 10:56:02,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 25: [2022-11-26 10:56:02,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 10: [2022-11-26 10:56:02,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:56:02,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 17: [2022-11-26 10:56:02,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:56:02,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 17: [2022-11-26 10:56:02,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 27: [2022-11-26 10:56:02,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 17: [2022-11-26 10:56:02,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 27: [2022-11-26 10:56:02,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 10:56:02,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 30: [2022-11-26 10:56:02,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 10:56:02,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 10:56:02,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 15: [2022-11-26 10:56:02,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:56:02,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:56:02,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 16: [2022-11-26 10:56:02,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 15: [2022-11-26 10:56:02,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 16: [2022-11-26 10:56:02,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 13: [2022-11-26 10:56:02,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:56:02,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:56:02,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 20: [2022-11-26 10:56:02,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 13: [2022-11-26 10:56:02,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 24: [2022-11-26 10:56:02,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 20: [2022-11-26 10:56:02,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 24: [2022-11-26 10:56:02,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 10:56:02,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 16: [2022-11-26 10:56:02,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:56:02,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 19: [2022-11-26 10:56:02,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 16: [2022-11-26 10:56:02,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 19: [2022-11-26 10:56:02,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 10:56:02,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 1: [2022-11-26 10:56:02,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:56:02,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 26: [2022-11-26 10:56:02,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:56:02,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 26: [2022-11-26 10:56:02,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 10:56:02,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 22: [2022-11-26 10:56:02,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 10:56:02,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 10:56:02,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 26: [2022-11-26 10:56:02,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 10:56:02,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 10:56:02,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 4: [2022-11-26 10:56:02,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:56:02,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 10:56:02,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 23: [2022-11-26 10:56:02,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 10:56:02,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 10:56:02,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 31: [2022-11-26 10:56:02,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:56:02,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 10:56:02,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 14: [2022-11-26 10:56:02,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:56:02,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 10:56:02,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 31: [2022-11-26 10:56:02,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 10:56:02,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 10:56:02,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 4: [2022-11-26 10:56:02,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:56:02,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step75000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 10:56:02,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 0: successfully saved checkpoint at iteration 75000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2837.53 31: iteration 75010/ 173500 | consumed samples: 19202560 | consumed tokens: 39326842880 | elapsed time per iteration (s): 1.04 | learning rate: 1.306E-04 | global batch size: 256 | lm loss: 2.002604E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.985 | TFLOPs: 14.82 | 31: iteration 75020/ 173500 | consumed samples: 19205120 | consumed tokens: 39332085760 | elapsed time per iteration (s): 0.80 | learning rate: 1.306E-04 | global batch size: 256 | lm loss: 1.966949E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.870 | TFLOPs: 19.47 | 31: iteration 75030/ 173500 | consumed samples: 19207680 | consumed tokens: 39337328640 | elapsed time per iteration (s): 0.79 | learning rate: 1.305E-04 | global batch size: 256 | lm loss: 2.009972E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.803 | TFLOPs: 19.65 | 31: iteration 75040/ 173500 | consumed samples: 19210240 | consumed tokens: 39342571520 | elapsed time per iteration (s): 0.73 | learning rate: 1.305E-04 | global batch size: 256 | lm loss: 2.018219E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.248 | TFLOPs: 21.13 | 31: iteration 75050/ 173500 | consumed samples: 19212800 | consumed tokens: 39347814400 | elapsed time per iteration (s): 0.78 | learning rate: 1.305E-04 | global batch size: 256 | lm loss: 2.004659E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.917 | TFLOPs: 19.84 | 31: iteration 75060/ 173500 | consumed samples: 19215360 | consumed tokens: 39353057280 | elapsed time per iteration (s): 0.80 | learning rate: 1.305E-04 | global batch size: 256 | lm loss: 1.998614E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.061 | TFLOPs: 19.36 | 31: iteration 75070/ 173500 | consumed samples: 19217920 | consumed tokens: 39358300160 | elapsed time per iteration (s): 0.80 | learning rate: 1.305E-04 | global batch size: 256 | lm loss: 2.020475E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.034 | TFLOPs: 19.42 | 31: iteration 75080/ 173500 | consumed samples: 19220480 | consumed tokens: 39363543040 | elapsed time per iteration (s): 0.75 | learning rate: 1.305E-04 | global batch size: 256 | lm loss: 1.994402E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.017 | TFLOPs: 20.75 | 31: iteration 75090/ 173500 | consumed samples: 19223040 | consumed tokens: 39368785920 | elapsed time per iteration (s): 0.78 | learning rate: 1.304E-04 | global batch size: 256 | lm loss: 1.992657E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.130 | TFLOPs: 19.97 | 31: iteration 75100/ 173500 | consumed samples: 19225600 | consumed tokens: 39374028800 | elapsed time per iteration (s): 0.75 | learning rate: 1.304E-04 | global batch size: 256 | lm loss: 2.000638E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.564 | TFLOPs: 20.54 | 31: iteration 75110/ 173500 | consumed samples: 19228160 | consumed tokens: 39379271680 | elapsed time per iteration (s): 0.75 | learning rate: 1.304E-04 | global batch size: 256 | lm loss: 2.002898E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.303 | TFLOPs: 20.53 | 31: iteration 75120/ 173500 | consumed samples: 19230720 | consumed tokens: 39384514560 | elapsed time per iteration (s): 0.75 | learning rate: 1.304E-04 | global batch size: 256 | lm loss: 1.986851E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.767 | TFLOPs: 20.74 | 31: iteration 75130/ 173500 | consumed samples: 19233280 | consumed tokens: 39389757440 | elapsed time per iteration (s): 0.71 | learning rate: 1.304E-04 | global batch size: 256 | lm loss: 2.024828E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 360.291 | TFLOPs: 21.80 | 31: iteration 75140/ 173500 | consumed samples: 19235840 | consumed tokens: 39395000320 | elapsed time per iteration (s): 0.74 | learning rate: 1.304E-04 | global batch size: 256 | lm loss: 2.009049E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.408 | TFLOPs: 20.84 | 31: iteration 75150/ 173500 | consumed samples: 19238400 | consumed tokens: 39400243200 | elapsed time per iteration (s): 0.75 | learning rate: 1.303E-04 | global batch size: 256 | lm loss: 2.023548E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.255 | TFLOPs: 20.65 | 31: iteration 75160/ 173500 | consumed samples: 19240960 | consumed tokens: 39405486080 | elapsed time per iteration (s): 0.74 | learning rate: 1.303E-04 | global batch size: 256 | lm loss: 2.015718E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.913 | TFLOPs: 20.99 | 31: iteration 75170/ 173500 | consumed samples: 19243520 | consumed tokens: 39410728960 | elapsed time per iteration (s): 0.84 | learning rate: 1.303E-04 | global batch size: 256 | lm loss: 2.006246E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.727 | TFLOPs: 18.44 | 31: iteration 75180/ 173500 | consumed samples: 19246080 | consumed tokens: 39415971840 | elapsed time per iteration (s): 0.79 | learning rate: 1.303E-04 | global batch size: 256 | lm loss: 2.037126E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.383 | TFLOPs: 19.62 | 31: iteration 75190/ 173500 | consumed samples: 19248640 | consumed tokens: 39421214720 | elapsed time per iteration (s): 0.74 | learning rate: 1.303E-04 | global batch size: 256 | lm loss: 1.998339E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.264 | TFLOPs: 20.95 | 31: iteration 75200/ 173500 | consumed samples: 19251200 | consumed tokens: 39426457600 | elapsed time per iteration (s): 0.80 | learning rate: 1.303E-04 | global batch size: 256 | lm loss: 2.030938E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.669 | TFLOPs: 19.46 | 31: iteration 75210/ 173500 | consumed samples: 19253760 | consumed tokens: 39431700480 | elapsed time per iteration (s): 0.80 | learning rate: 1.302E-04 | global batch size: 256 | lm loss: 2.043652E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.562 | TFLOPs: 19.27 | 31: iteration 75220/ 173500 | consumed samples: 19256320 | consumed tokens: 39436943360 | elapsed time per iteration (s): 0.78 | learning rate: 1.302E-04 | global batch size: 256 | lm loss: 2.043522E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.385 | TFLOPs: 19.93 | 31: iteration 75230/ 173500 | consumed samples: 19258880 | consumed tokens: 39442186240 | elapsed time per iteration (s): 0.81 | learning rate: 1.302E-04 | global batch size: 256 | lm loss: 2.031425E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.408 | TFLOPs: 19.14 | 31: iteration 75240/ 173500 | consumed samples: 19261440 | consumed tokens: 39447429120 | elapsed time per iteration (s): 0.87 | learning rate: 1.302E-04 | global batch size: 256 | lm loss: 2.034230E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.941 | TFLOPs: 17.90 | 31: iteration 75250/ 173500 | consumed samples: 19264000 | consumed tokens: 39452672000 | elapsed time per iteration (s): 0.82 | learning rate: 1.302E-04 | global batch size: 256 | lm loss: 2.014873E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.081 | TFLOPs: 18.82 | 31: iteration 75260/ 173500 | consumed samples: 19266560 | consumed tokens: 39457914880 | elapsed time per iteration (s): 0.82 | learning rate: 1.302E-04 | global batch size: 256 | lm loss: 2.000616E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.391 | TFLOPs: 18.90 | 31: iteration 75270/ 173500 | consumed samples: 19269120 | consumed tokens: 39463157760 | elapsed time per iteration (s): 0.82 | learning rate: 1.302E-04 | global batch size: 256 | lm loss: 2.014512E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.794 | TFLOPs: 18.86 | 31: iteration 75280/ 173500 | consumed samples: 19271680 | consumed tokens: 39468400640 | elapsed time per iteration (s): 0.74 | learning rate: 1.301E-04 | global batch size: 256 | lm loss: 2.021354E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.912 | TFLOPs: 20.99 | 31: iteration 75290/ 173500 | consumed samples: 19274240 | consumed tokens: 39473643520 | elapsed time per iteration (s): 0.77 | learning rate: 1.301E-04 | global batch size: 256 | lm loss: 2.031159E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.529 | TFLOPs: 20.18 | 31: iteration 75300/ 173500 | consumed samples: 19276800 | consumed tokens: 39478886400 | elapsed time per iteration (s): 0.74 | learning rate: 1.301E-04 | global batch size: 256 | lm loss: 2.044855E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.882 | TFLOPs: 20.80 | 31: iteration 75310/ 173500 | consumed samples: 19279360 | consumed tokens: 39484129280 | elapsed time per iteration (s): 0.75 | learning rate: 1.301E-04 | global batch size: 256 | lm loss: 1.983597E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.489 | TFLOPs: 20.72 | 31: iteration 75320/ 173500 | consumed samples: 19281920 | consumed tokens: 39489372160 | elapsed time per iteration (s): 0.78 | learning rate: 1.301E-04 | global batch size: 256 | lm loss: 1.987591E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.873 | TFLOPs: 19.84 | 31: iteration 75330/ 173500 | consumed samples: 19284480 | consumed tokens: 39494615040 | elapsed time per iteration (s): 0.77 | learning rate: 1.301E-04 | global batch size: 256 | lm loss: 2.005052E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.254 | TFLOPs: 20.04 | 31: iteration 75340/ 173500 | consumed samples: 19287040 | consumed tokens: 39499857920 | elapsed time per iteration (s): 0.80 | learning rate: 1.300E-04 | global batch size: 256 | lm loss: 1.988515E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.065 | TFLOPs: 19.24 | 31: iteration 75350/ 173500 | consumed samples: 19289600 | consumed tokens: 39505100800 | elapsed time per iteration (s): 0.78 | learning rate: 1.300E-04 | global batch size: 256 | lm loss: 2.014738E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.641 | TFLOPs: 19.94 | 31: iteration 75360/ 173500 | consumed samples: 19292160 | consumed tokens: 39510343680 | elapsed time per iteration (s): 0.80 | learning rate: 1.300E-04 | global batch size: 256 | lm loss: 1.976641E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.601 | TFLOPs: 19.40 | 31: iteration 75370/ 173500 | consumed samples: 19294720 | consumed tokens: 39515586560 | elapsed time per iteration (s): 0.73 | learning rate: 1.300E-04 | global batch size: 256 | lm loss: 2.006997E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.897 | TFLOPs: 21.35 | 31: iteration 75380/ 173500 | consumed samples: 19297280 | consumed tokens: 39520829440 | elapsed time per iteration (s): 0.81 | learning rate: 1.300E-04 | global batch size: 256 | lm loss: 2.004318E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.585 | TFLOPs: 19.09 | 31: iteration 75390/ 173500 | consumed samples: 19299840 | consumed tokens: 39526072320 | elapsed time per iteration (s): 0.77 | learning rate: 1.300E-04 | global batch size: 256 | lm loss: 2.021548E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.630 | TFLOPs: 20.18 | 31: iteration 75400/ 173500 | consumed samples: 19302400 | consumed tokens: 39531315200 | elapsed time per iteration (s): 0.77 | learning rate: 1.299E-04 | global batch size: 256 | lm loss: 2.023398E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.043 | TFLOPs: 20.15 | 31: iteration 75410/ 173500 | consumed samples: 19304960 | consumed tokens: 39536558080 | elapsed time per iteration (s): 0.80 | learning rate: 1.299E-04 | global batch size: 256 | lm loss: 2.021322E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.522 | TFLOPs: 19.33 | 31: iteration 75420/ 173500 | consumed samples: 19307520 | consumed tokens: 39541800960 | elapsed time per iteration (s): 0.73 | learning rate: 1.299E-04 | global batch size: 256 | lm loss: 2.017205E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.445 | TFLOPs: 21.14 | 31: iteration 75430/ 173500 | consumed samples: 19310080 | consumed tokens: 39547043840 | elapsed time per iteration (s): 0.77 | learning rate: 1.299E-04 | global batch size: 256 | lm loss: 1.998135E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.318 | TFLOPs: 20.10 | 31: iteration 75440/ 173500 | consumed samples: 19312640 | consumed tokens: 39552286720 | elapsed time per iteration (s): 0.78 | learning rate: 1.299E-04 | global batch size: 256 | lm loss: 2.019568E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.587 | TFLOPs: 19.76 | 31: iteration 75450/ 173500 | consumed samples: 19315200 | consumed tokens: 39557529600 | elapsed time per iteration (s): 0.78 | learning rate: 1.299E-04 | global batch size: 256 | lm loss: 2.016947E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.877 | TFLOPs: 19.84 | 31: iteration 75460/ 173500 | consumed samples: 19317760 | consumed tokens: 39562772480 | elapsed time per iteration (s): 0.76 | learning rate: 1.298E-04 | global batch size: 256 | lm loss: 2.002757E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.643 | TFLOPs: 20.37 | 31: iteration 75470/ 173500 | consumed samples: 19320320 | consumed tokens: 39568015360 | elapsed time per iteration (s): 0.74 | learning rate: 1.298E-04 | global batch size: 256 | lm loss: 2.010593E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.009 | TFLOPs: 20.99 | 31: iteration 75480/ 173500 | consumed samples: 19322880 | consumed tokens: 39573258240 | elapsed time per iteration (s): 0.78 | learning rate: 1.298E-04 | global batch size: 256 | lm loss: 1.984114E+00 | grad norm: 3.599 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.373 | TFLOPs: 19.81 | 31: iteration 75490/ 173500 | consumed samples: 19325440 | consumed tokens: 39578501120 | elapsed time per iteration (s): 0.87 | learning rate: 1.298E-04 | global batch size: 256 | lm loss: 2.037819E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.082 | TFLOPs: 17.85 | 31: iteration 75500/ 173500 | consumed samples: 19328000 | consumed tokens: 39583744000 | elapsed time per iteration (s): 0.81 | learning rate: 1.298E-04 | global batch size: 256 | lm loss: 2.031756E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.608 | TFLOPs: 19.15 | 31: iteration 75510/ 173500 | consumed samples: 19330560 | consumed tokens: 39588986880 | elapsed time per iteration (s): 0.74 | learning rate: 1.298E-04 | global batch size: 256 | lm loss: 2.018379E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.548 | TFLOPs: 20.90 | 31: iteration 75520/ 173500 | consumed samples: 19333120 | consumed tokens: 39594229760 | elapsed time per iteration (s): 0.75 | learning rate: 1.298E-04 | global batch size: 256 | lm loss: 2.017007E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.917 | TFLOPs: 20.56 | 31: iteration 75530/ 173500 | consumed samples: 19335680 | consumed tokens: 39599472640 | elapsed time per iteration (s): 0.75 | learning rate: 1.297E-04 | global batch size: 256 | lm loss: 2.008404E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.610 | TFLOPs: 20.73 | 31: iteration 75540/ 173500 | consumed samples: 19338240 | consumed tokens: 39604715520 | elapsed time per iteration (s): 0.83 | learning rate: 1.297E-04 | global batch size: 256 | lm loss: 2.131782E+00 | grad norm: 17.860 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.966 | TFLOPs: 18.63 | 31: iteration 75550/ 173500 | consumed samples: 19340800 | consumed tokens: 39609958400 | elapsed time per iteration (s): 0.77 | learning rate: 1.297E-04 | global batch size: 256 | lm loss: 2.077259E+00 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.706 | TFLOPs: 20.19 | 31: iteration 75560/ 173500 | consumed samples: 19343360 | consumed tokens: 39615201280 | elapsed time per iteration (s): 0.82 | learning rate: 1.297E-04 | global batch size: 256 | lm loss: 2.041671E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.977 | TFLOPs: 18.87 | 31: iteration 75570/ 173500 | consumed samples: 19345920 | consumed tokens: 39620444160 | elapsed time per iteration (s): 0.92 | learning rate: 1.297E-04 | global batch size: 256 | lm loss: 2.060395E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 279.122 | TFLOPs: 16.89 | 31: iteration 75580/ 173500 | consumed samples: 19348480 | consumed tokens: 39625687040 | elapsed time per iteration (s): 0.77 | learning rate: 1.297E-04 | global batch size: 256 | lm loss: 2.024295E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.972 | TFLOPs: 20.02 | 31: iteration 75590/ 173500 | consumed samples: 19351040 | consumed tokens: 39630929920 | elapsed time per iteration (s): 0.84 | learning rate: 1.296E-04 | global batch size: 256 | lm loss: 2.012687E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.032 | TFLOPs: 18.39 | 31: iteration 75600/ 173500 | consumed samples: 19353600 | consumed tokens: 39636172800 | elapsed time per iteration (s): 0.76 | learning rate: 1.296E-04 | global batch size: 256 | lm loss: 2.032852E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.725 | TFLOPs: 20.43 | 31: iteration 75610/ 173500 | consumed samples: 19356160 | consumed tokens: 39641415680 | elapsed time per iteration (s): 0.74 | learning rate: 1.296E-04 | global batch size: 256 | lm loss: 2.001810E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.973 | TFLOPs: 20.81 | 31: iteration 75620/ 173500 | consumed samples: 19358720 | consumed tokens: 39646658560 | elapsed time per iteration (s): 0.75 | learning rate: 1.296E-04 | global batch size: 256 | lm loss: 2.013273E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.366 | TFLOPs: 20.65 | 31: iteration 75630/ 173500 | consumed samples: 19361280 | consumed tokens: 39651901440 | elapsed time per iteration (s): 0.71 | learning rate: 1.296E-04 | global batch size: 256 | lm loss: 2.046727E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.056 | TFLOPs: 21.66 | 31: iteration 75640/ 173500 | consumed samples: 19363840 | consumed tokens: 39657144320 | elapsed time per iteration (s): 0.77 | learning rate: 1.296E-04 | global batch size: 256 | lm loss: 2.022276E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.055 | TFLOPs: 20.03 | 31: iteration 75650/ 173500 | consumed samples: 19366400 | consumed tokens: 39662387200 | elapsed time per iteration (s): 0.76 | learning rate: 1.295E-04 | global batch size: 256 | lm loss: 2.019503E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.328 | TFLOPs: 20.47 | 31: iteration 75660/ 173500 | consumed samples: 19368960 | consumed tokens: 39667630080 | elapsed time per iteration (s): 0.77 | learning rate: 1.295E-04 | global batch size: 256 | lm loss: 2.015390E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.648 | TFLOPs: 20.06 | 31: iteration 75670/ 173500 | consumed samples: 19371520 | consumed tokens: 39672872960 | elapsed time per iteration (s): 0.72 | learning rate: 1.295E-04 | global batch size: 256 | lm loss: 2.009179E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.926 | TFLOPs: 21.41 | 31: iteration 75680/ 173500 | consumed samples: 19374080 | consumed tokens: 39678115840 | elapsed time per iteration (s): 0.77 | learning rate: 1.295E-04 | global batch size: 256 | lm loss: 2.011761E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.626 | TFLOPs: 20.00 | 31: iteration 75690/ 173500 | consumed samples: 19376640 | consumed tokens: 39683358720 | elapsed time per iteration (s): 0.77 | learning rate: 1.295E-04 | global batch size: 256 | lm loss: 2.000914E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.020 | TFLOPs: 20.03 | 31: iteration 75700/ 173500 | consumed samples: 19379200 | consumed tokens: 39688601600 | elapsed time per iteration (s): 0.77 | learning rate: 1.295E-04 | global batch size: 256 | lm loss: 2.014482E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.675 | TFLOPs: 20.07 | 31: iteration 75710/ 173500 | consumed samples: 19381760 | consumed tokens: 39693844480 | elapsed time per iteration (s): 0.74 | learning rate: 1.294E-04 | global batch size: 256 | lm loss: 1.995837E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.873 | TFLOPs: 20.80 | 31: iteration 75720/ 173500 | consumed samples: 19384320 | consumed tokens: 39699087360 | elapsed time per iteration (s): 0.80 | learning rate: 1.294E-04 | global batch size: 256 | lm loss: 2.004984E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.511 | TFLOPs: 19.45 | 31: iteration 75730/ 173500 | consumed samples: 19386880 | consumed tokens: 39704330240 | elapsed time per iteration (s): 0.80 | learning rate: 1.294E-04 | global batch size: 256 | lm loss: 2.025996E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.994 | TFLOPs: 19.36 | 31: iteration 75740/ 173500 | consumed samples: 19389440 | consumed tokens: 39709573120 | elapsed time per iteration (s): 0.79 | learning rate: 1.294E-04 | global batch size: 256 | lm loss: 1.990695E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.292 | TFLOPs: 19.62 | 31: iteration 75750/ 173500 | consumed samples: 19392000 | consumed tokens: 39714816000 | elapsed time per iteration (s): 0.81 | learning rate: 1.294E-04 | global batch size: 256 | lm loss: 2.026226E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.507 | TFLOPs: 19.09 | 31: iteration 75760/ 173500 | consumed samples: 19394560 | consumed tokens: 39720058880 | elapsed time per iteration (s): 0.78 | learning rate: 1.294E-04 | global batch size: 256 | lm loss: 2.006921E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.978 | TFLOPs: 19.96 | 31: iteration 75770/ 173500 | consumed samples: 19397120 | consumed tokens: 39725301760 | elapsed time per iteration (s): 0.79 | learning rate: 1.294E-04 | global batch size: 256 | lm loss: 2.044121E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.340 | TFLOPs: 19.62 | 31: iteration 75780/ 173500 | consumed samples: 19399680 | consumed tokens: 39730544640 | elapsed time per iteration (s): 0.80 | learning rate: 1.293E-04 | global batch size: 256 | lm loss: 2.017979E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.925 | TFLOPs: 19.48 | 31: iteration 75790/ 173500 | consumed samples: 19402240 | consumed tokens: 39735787520 | elapsed time per iteration (s): 0.85 | learning rate: 1.293E-04 | global batch size: 256 | lm loss: 1.991755E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.154 | TFLOPs: 18.22 | 31: iteration 75800/ 173500 | consumed samples: 19404800 | consumed tokens: 39741030400 | elapsed time per iteration (s): 0.83 | learning rate: 1.293E-04 | global batch size: 256 | lm loss: 2.002353E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.136 | TFLOPs: 18.70 | 31: iteration 75810/ 173500 | consumed samples: 19407360 | consumed tokens: 39746273280 | elapsed time per iteration (s): 0.81 | learning rate: 1.293E-04 | global batch size: 256 | lm loss: 1.998491E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.978 | TFLOPs: 19.12 | 31: iteration 75820/ 173500 | consumed samples: 19409920 | consumed tokens: 39751516160 | elapsed time per iteration (s): 0.82 | learning rate: 1.293E-04 | global batch size: 256 | lm loss: 2.023980E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.386 | TFLOPs: 18.96 | 31: iteration 75830/ 173500 | consumed samples: 19412480 | consumed tokens: 39756759040 | elapsed time per iteration (s): 0.79 | learning rate: 1.293E-04 | global batch size: 256 | lm loss: 2.027311E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.407 | TFLOPs: 19.57 | 31: iteration 75840/ 173500 | consumed samples: 19415040 | consumed tokens: 39762001920 | elapsed time per iteration (s): 0.83 | learning rate: 1.292E-04 | global batch size: 256 | lm loss: 2.024710E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.609 | TFLOPs: 18.55 | 31: iteration 75850/ 173500 | consumed samples: 19417600 | consumed tokens: 39767244800 | elapsed time per iteration (s): 0.83 | learning rate: 1.292E-04 | global batch size: 256 | lm loss: 2.001419E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.661 | TFLOPs: 18.67 | 31: iteration 75860/ 173500 | consumed samples: 19420160 | consumed tokens: 39772487680 | elapsed time per iteration (s): 0.82 | learning rate: 1.292E-04 | global batch size: 256 | lm loss: 2.027188E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.653 | TFLOPs: 18.98 | 31: iteration 75870/ 173500 | consumed samples: 19422720 | consumed tokens: 39777730560 | elapsed time per iteration (s): 0.85 | learning rate: 1.292E-04 | global batch size: 256 | lm loss: 2.016162E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.790 | TFLOPs: 18.14 | 31: iteration 75880/ 173500 | consumed samples: 19425280 | consumed tokens: 39782973440 | elapsed time per iteration (s): 0.79 | learning rate: 1.292E-04 | global batch size: 256 | lm loss: 2.018558E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.746 | TFLOPs: 19.65 | 31: iteration 75890/ 173500 | consumed samples: 19427840 | consumed tokens: 39788216320 | elapsed time per iteration (s): 0.82 | learning rate: 1.292E-04 | global batch size: 256 | lm loss: 2.029674E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.680 | TFLOPs: 18.80 | 31: iteration 75900/ 173500 | consumed samples: 19430400 | consumed tokens: 39793459200 | elapsed time per iteration (s): 0.78 | learning rate: 1.291E-04 | global batch size: 256 | lm loss: 1.987724E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.171 | TFLOPs: 19.85 | 31: iteration 75910/ 173500 | consumed samples: 19432960 | consumed tokens: 39798702080 | elapsed time per iteration (s): 0.80 | learning rate: 1.291E-04 | global batch size: 256 | lm loss: 2.021152E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.240 | TFLOPs: 19.25 | 31: iteration 75920/ 173500 | consumed samples: 19435520 | consumed tokens: 39803944960 | elapsed time per iteration (s): 0.83 | learning rate: 1.291E-04 | global batch size: 256 | lm loss: 2.004014E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.187 | TFLOPs: 18.64 | 31: iteration 75930/ 173500 | consumed samples: 19438080 | consumed tokens: 39809187840 | elapsed time per iteration (s): 0.82 | learning rate: 1.291E-04 | global batch size: 256 | lm loss: 1.996549E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.423 | TFLOPs: 18.90 | 31: iteration 75940/ 173500 | consumed samples: 19440640 | consumed tokens: 39814430720 | elapsed time per iteration (s): 0.82 | learning rate: 1.291E-04 | global batch size: 256 | lm loss: 1.979759E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.427 | TFLOPs: 18.96 | 31: iteration 75950/ 173500 | consumed samples: 19443200 | consumed tokens: 39819673600 | elapsed time per iteration (s): 0.79 | learning rate: 1.291E-04 | global batch size: 256 | lm loss: 2.002861E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.617 | TFLOPs: 19.70 | 31: iteration 75960/ 173500 | consumed samples: 19445760 | consumed tokens: 39824916480 | elapsed time per iteration (s): 0.74 | learning rate: 1.290E-04 | global batch size: 256 | lm loss: 2.028924E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.546 | TFLOPs: 20.97 | 31: iteration 75970/ 173500 | consumed samples: 19448320 | consumed tokens: 39830159360 | elapsed time per iteration (s): 0.84 | learning rate: 1.290E-04 | global batch size: 256 | lm loss: 2.029869E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.101 | TFLOPs: 18.40 | 31: iteration 75980/ 173500 | consumed samples: 19450880 | consumed tokens: 39835402240 | elapsed time per iteration (s): 0.76 | learning rate: 1.290E-04 | global batch size: 256 | lm loss: 2.037537E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.114 | TFLOPs: 20.39 | 31: iteration 75990/ 173500 | consumed samples: 19453440 | consumed tokens: 39840645120 | elapsed time per iteration (s): 0.72 | learning rate: 1.290E-04 | global batch size: 256 | lm loss: 2.005338E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.587 | TFLOPs: 21.51 | 0: [2022-11-26 11:09:05,636] [INFO] [logging.py:68:log_dist] [Rank 0] step=76000, skipped=0, lr=[0.0001289804445403464, 0.0001289804445403464, 0.0001289804445403464], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 76000/ 173500 | consumed samples: 19456000 | consumed tokens: 39845888000 | elapsed time per iteration (s): 0.73 | learning rate: 1.290E-04 | global batch size: 256 | lm loss: 2.014163E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.641 | TFLOPs: 21.33 | 0: steps: 76000 loss: 1.9141 iter time (s): 0.793 samples/sec: 322.680 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 76000 | lm loss value: 1.887744E+00 | lm loss PPL: 6.604450E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 76000 to checkpoints_1b1long 0: [2022-11-26 11:09:05,885] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step76000 is begin to save! 0: [2022-11-26 11:09:05,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_01-model_00-model_states.pt... 0: [2022-11-26 11:09:06,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_01-model_00-model_states.pt. 0: [2022-11-26 11:09:06,121] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_03-model_00-model_states.pt... 0: [2022-11-26 11:09:06,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_03-model_00-model_states.pt. 0: [2022-11-26 11:09:06,206] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_04-model_00-model_states.pt... 0: [2022-11-26 11:09:06,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_04-model_00-model_states.pt. 0: [2022-11-26 11:09:06,283] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_05-model_00-model_states.pt... 0: [2022-11-26 11:09:06,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_05-model_00-model_states.pt. 0: [2022-11-26 11:09:06,358] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_06-model_00-model_states.pt... 0: [2022-11-26 11:09:06,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_06-model_00-model_states.pt. 0: [2022-11-26 11:09:06,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_07-model_00-model_states.pt... 0: [2022-11-26 11:09:06,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_07-model_00-model_states.pt. 0: [2022-11-26 11:09:06,511] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_08-model_00-model_states.pt... 0: [2022-11-26 11:09:06,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_08-model_00-model_states.pt. 0: [2022-11-26 11:09:06,589] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_09-model_00-model_states.pt... 0: [2022-11-26 11:09:06,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_09-model_00-model_states.pt. 0: [2022-11-26 11:09:06,665] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_10-model_00-model_states.pt... 0: [2022-11-26 11:09:06,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_10-model_00-model_states.pt. 0: [2022-11-26 11:09:06,738] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_11-model_00-model_states.pt... 0: [2022-11-26 11:09:06,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_11-model_00-model_states.pt. 0: [2022-11-26 11:09:06,811] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_12-model_00-model_states.pt... 0: [2022-11-26 11:09:06,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_12-model_00-model_states.pt. 0: [2022-11-26 11:09:06,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_13-model_00-model_states.pt... 0: [2022-11-26 11:09:06,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_13-model_00-model_states.pt. 0: [2022-11-26 11:09:06,958] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_14-model_00-model_states.pt... 0: [2022-11-26 11:09:07,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_14-model_00-model_states.pt. 0: [2022-11-26 11:09:07,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_15-model_00-model_states.pt... 0: [2022-11-26 11:09:07,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_15-model_00-model_states.pt. 0: [2022-11-26 11:09:07,105] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_16-model_00-model_states.pt... 0: [2022-11-26 11:09:07,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_16-model_00-model_states.pt. 0: [2022-11-26 11:09:07,178] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_17-model_00-model_states.pt... 0: [2022-11-26 11:09:07,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_17-model_00-model_states.pt. 0: [2022-11-26 11:09:07,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_18-model_00-model_states.pt... 0: [2022-11-26 11:09:07,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_18-model_00-model_states.pt. 0: [2022-11-26 11:09:07,325] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_19-model_00-model_states.pt... 0: [2022-11-26 11:09:07,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_19-model_00-model_states.pt. 0: [2022-11-26 11:09:07,398] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_20-model_00-model_states.pt... 0: [2022-11-26 11:09:07,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_20-model_00-model_states.pt. 0: [2022-11-26 11:09:07,471] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_21-model_00-model_states.pt... 0: [2022-11-26 11:09:07,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_21-model_00-model_states.pt. 0: [2022-11-26 11:09:07,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_22-model_00-model_states.pt... 0: [2022-11-26 11:09:07,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_22-model_00-model_states.pt. 0: [2022-11-26 11:09:07,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_23-model_00-model_states.pt... 0: [2022-11-26 11:09:07,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_23-model_00-model_states.pt. 0: [2022-11-26 11:09:07,691] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_24-model_00-model_states.pt... 0: [2022-11-26 11:09:07,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_24-model_00-model_states.pt. 0: [2022-11-26 11:09:07,762] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_25-model_00-model_states.pt... 0: [2022-11-26 11:09:07,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_25-model_00-model_states.pt. 0: [2022-11-26 11:09:07,838] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_26-model_00-model_states.pt... 0: [2022-11-26 11:09:07,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_26-model_00-model_states.pt. 0: [2022-11-26 11:09:07,911] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_27-model_00-model_states.pt... 0: [2022-11-26 11:09:07,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_27-model_00-model_states.pt. 0: [2022-11-26 11:09:07,986] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_28-model_00-model_states.pt... 0: [2022-11-26 11:09:08,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_28-model_00-model_states.pt. 0: [2022-11-26 11:09:08,061] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/layer_30-model_00-model_states.pt... 0: [2022-11-26 11:09:08,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/layer_30-model_00-model_states.pt. 0: [2022-11-26 11:09:08,063] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step76000/mp_rank_00_model_states.pt 0: [2022-11-26 11:09:08,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/mp_rank_00_model_states.pt... 0: [2022-11-26 11:09:08,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/mp_rank_00_model_states.pt. 0: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 16: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 20: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 30: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 26: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 16: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 25: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 28: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:09:08,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 23: [2022-11-26 11:09:08,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 11:09:08,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 11:09:08,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 29: [2022-11-26 11:09:08,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 11:09:08,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 11:09:08,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 7: [2022-11-26 11:09:08,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 24: [2022-11-26 11:09:08,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:09:08,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 11:09:08,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 9: [2022-11-26 11:09:08,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:09:08,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 24: [2022-11-26 11:09:08,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 9: [2022-11-26 11:09:08,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 24: [2022-11-26 11:09:08,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 6: [2022-11-26 11:09:08,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:09:08,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 11:09:08,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 13: [2022-11-26 11:09:08,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:09:08,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 5: [2022-11-26 11:09:08,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:09:08,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:09:08,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:09:08,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 5: [2022-11-26 11:09:08,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 11:09:08,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 8: [2022-11-26 11:09:08,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 30: [2022-11-26 11:09:08,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:09:08,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 5: [2022-11-26 11:09:08,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 8: [2022-11-26 11:09:08,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 30: [2022-11-26 11:09:08,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 11:09:08,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 6: [2022-11-26 11:09:08,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:09:08,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 11:09:08,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 26: [2022-11-26 11:09:08,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 11:09:08,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 11:09:08,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 11: [2022-11-26 11:09:08,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:09:08,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:09:08,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:09:08,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 15: [2022-11-26 11:09:08,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 11: [2022-11-26 11:09:08,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 30: [2022-11-26 11:09:08,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 15: [2022-11-26 11:09:08,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 11: [2022-11-26 11:09:08,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:09:08,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 30: [2022-11-26 11:09:08,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 11: [2022-11-26 11:09:08,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 28: [2022-11-26 11:09:08,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:09:08,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:09:08,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 28: [2022-11-26 11:09:08,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 11:09:08,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 1: [2022-11-26 11:09:08,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 10: [2022-11-26 11:09:08,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:09:08,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 11:09:08,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 4: [2022-11-26 11:09:08,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:09:08,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:09:08,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 11:09:08,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 11:09:08,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 4: [2022-11-26 11:09:08,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 1: [2022-11-26 11:09:08,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:09:08,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:09:08,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 1: [2022-11-26 11:09:08,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 21: [2022-11-26 11:09:08,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:09:08,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 1: [2022-11-26 11:09:08,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 16: [2022-11-26 11:09:08,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 11:09:08,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 21: [2022-11-26 11:09:08,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 11:09:08,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 16: [2022-11-26 11:09:08,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 23: [2022-11-26 11:09:08,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 11:09:08,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 11:09:08,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 2: [2022-11-26 11:09:08,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:09:08,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:09:08,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 11:09:08,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 11:09:08,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 2: [2022-11-26 11:09:08,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 12: [2022-11-26 11:09:08,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:09:08,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:09:08,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 11:09:08,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 11:09:08,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 12: [2022-11-26 11:09:08,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 31: [2022-11-26 11:09:08,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 11:09:08,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:09:08,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 31: [2022-11-26 11:09:08,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 7: [2022-11-26 11:09:08,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 31: [2022-11-26 11:09:08,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 31: [2022-11-26 11:09:08,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 7: [2022-11-26 11:09:08,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 31: [2022-11-26 11:09:08,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 29: [2022-11-26 11:09:08,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 11:09:08,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 11:09:08,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 17: [2022-11-26 11:09:08,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 11:09:08,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 11:09:08,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 14: [2022-11-26 11:09:08,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:09:08,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 11:09:08,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 26: [2022-11-26 11:09:08,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 11:09:08,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 11:09:08,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 25: [2022-11-26 11:09:08,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 11:09:08,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 11:09:08,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 15: [2022-11-26 11:09:08,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:09:08,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 11:09:08,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 10: [2022-11-26 11:09:08,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:09:08,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 11:09:08,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 0: [2022-11-26 11:09:08,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:09:08,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:09:08,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 0: [2022-11-26 11:09:08,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 14: [2022-11-26 11:09:08,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 0: [2022-11-26 11:09:08,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 14: [2022-11-26 11:09:08,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 16: [2022-11-26 11:09:08,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:09:08,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 16: [2022-11-26 11:09:08,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 11:09:08,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 14: [2022-11-26 11:09:08,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 13: [2022-11-26 11:09:08,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 25: [2022-11-26 11:09:08,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:09:08,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 11:09:08,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:09:08,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 13: [2022-11-26 11:09:08,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 11:09:08,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 25: [2022-11-26 11:09:08,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 11:09:08,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 27: [2022-11-26 11:09:08,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 11:09:08,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 11:09:08,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 11:09:08,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 1: [2022-11-26 11:09:08,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 27: [2022-11-26 11:09:08,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 27: [2022-11-26 11:09:08,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 1: [2022-11-26 11:09:08,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 11:09:08,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 27: [2022-11-26 11:09:08,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 11:09:08,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 11:09:08,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 8: [2022-11-26 11:09:08,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:09:08,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:09:08,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 15: [2022-11-26 11:09:08,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 8: [2022-11-26 11:09:08,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 15: [2022-11-26 11:09:08,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 21: [2022-11-26 11:09:08,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 11:09:08,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 11:09:08,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 26: [2022-11-26 11:09:08,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:09:08,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:09:08,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 26: [2022-11-26 11:09:08,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 16: [2022-11-26 11:09:08,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:09:08,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 16: [2022-11-26 11:09:08,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 11:09:08,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 26: [2022-11-26 11:09:08,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 21: [2022-11-26 11:09:08,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 11:09:08,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 11:09:08,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 22: [2022-11-26 11:09:08,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 11:09:08,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 11:09:08,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 23: [2022-11-26 11:09:08,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 11:09:08,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 11:09:08,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 22: [2022-11-26 11:09:08,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 11:09:08,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 11:09:08,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 28: [2022-11-26 11:09:08,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 11:09:08,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 11:09:08,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 28: [2022-11-26 11:09:08,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 11:09:08,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 11:09:08,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 28: [2022-11-26 11:09:08,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 11:09:08,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 11:09:08,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 31: [2022-11-26 11:09:08,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 11:09:08,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 11:09:08,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 8: [2022-11-26 11:09:08,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:09:08,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 11:09:08,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 10: [2022-11-26 11:09:08,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:09:08,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 11:09:08,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 14: [2022-11-26 11:09:08,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:09:08,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 11:09:08,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 30: [2022-11-26 11:09:08,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:09:08,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 11:09:08,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 11: [2022-11-26 11:09:08,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:09:08,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:09:08,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 11:09:08,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 7: [2022-11-26 11:09:08,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 11:09:08,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 7: [2022-11-26 11:09:08,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:09:08,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 11:09:08,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 26: [2022-11-26 11:09:08,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 11:09:08,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 11: [2022-11-26 11:09:08,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 26: [2022-11-26 11:09:08,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 11: [2022-11-26 11:09:08,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 1: [2022-11-26 11:09:08,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:09:08,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 1: [2022-11-26 11:09:08,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 11:09:08,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 2: [2022-11-26 11:09:08,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:09:08,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 11:09:08,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 9: [2022-11-26 11:09:08,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:09:08,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 12: [2022-11-26 11:09:08,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:09:08,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 12: [2022-11-26 11:09:08,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 11:09:08,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 29: [2022-11-26 11:09:08,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 11:09:08,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 11:09:08,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 6: [2022-11-26 11:09:08,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:09:08,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:09:08,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 11:09:08,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 6: [2022-11-26 11:09:08,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 10: [2022-11-26 11:09:08,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:09:08,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 10: [2022-11-26 11:09:08,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 11:09:08,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 30: [2022-11-26 11:09:08,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:09:08,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 24: [2022-11-26 11:09:08,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:09:08,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 0: [2022-11-26 11:09:08,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:09:08,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 24: [2022-11-26 11:09:08,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 8: [2022-11-26 11:09:08,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:09:08,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 0: [2022-11-26 11:09:08,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:09:08,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 24: [2022-11-26 11:09:08,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:09:08,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 24: [2022-11-26 11:09:08,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 24: [2022-11-26 11:09:08,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 0: [2022-11-26 11:09:08,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 24: [2022-11-26 11:09:08,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 0: [2022-11-26 11:09:08,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 3: [2022-11-26 11:09:08,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:09:08,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 11:09:08,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 3: [2022-11-26 11:09:08,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:09:08,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 11:09:08,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 3: [2022-11-26 11:09:08,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:09:08,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 11:09:08,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 3: [2022-11-26 11:09:08,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:09:08,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 11:09:08,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 21: [2022-11-26 11:09:08,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 11:09:08,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 11:09:08,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 9: [2022-11-26 11:09:08,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:09:08,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 11:09:08,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 22: [2022-11-26 11:09:08,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 11:09:08,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 11:09:08,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 4: [2022-11-26 11:09:08,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:09:08,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 11:09:08,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 16: [2022-11-26 11:09:08,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 11:09:08,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 11:09:08,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 24: [2022-11-26 11:09:08,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 11:09:08,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 11:09:08,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 13: [2022-11-26 11:09:08,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:09:08,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 11:09:08,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 2: [2022-11-26 11:09:08,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:09:08,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 11:09:08,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 20: [2022-11-26 11:09:08,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 11:09:08,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 11:09:08,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 11:09:08,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 11:09:08,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 11:09:08,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 11:09:08,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 11:09:08,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 11:09:08,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 20: [2022-11-26 11:09:08,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 20: [2022-11-26 11:09:08,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 20: [2022-11-26 11:09:08,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 15: [2022-11-26 11:09:08,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:09:08,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 11:09:08,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 27: [2022-11-26 11:09:08,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:09:08,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:09:08,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 11:09:08,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 5: [2022-11-26 11:09:08,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 22: [2022-11-26 11:09:08,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:09:08,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 22: [2022-11-26 11:09:08,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 27: [2022-11-26 11:09:08,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 5: [2022-11-26 11:09:08,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 22: [2022-11-26 11:09:08,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 27: [2022-11-26 11:09:08,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 25: [2022-11-26 11:09:08,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:09:08,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 25: [2022-11-26 11:09:08,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 0: [2022-11-26 11:09:08,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 25: [2022-11-26 11:09:08,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 0: [2022-11-26 11:09:08,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 23: [2022-11-26 11:09:08,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 11:09:08,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 11:09:08,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 18: [2022-11-26 11:09:08,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 11:09:08,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 11:09:08,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 11:09:08,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 11:09:08,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 11:09:08,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 18: [2022-11-26 11:09:08,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 11:09:08,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 11:09:08,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 11:09:08,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 18: [2022-11-26 11:09:08,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 18: [2022-11-26 11:09:08,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 17: [2022-11-26 11:09:08,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 11:09:08,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 11:09:08,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 17: [2022-11-26 11:09:08,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 11:09:08,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 11:09:08,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 17: [2022-11-26 11:09:08,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 11:09:08,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 11:09:08,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 31: [2022-11-26 11:09:08,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 11:09:08,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 11:09:08,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 12: [2022-11-26 11:09:08,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:09:08,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 11:09:08,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 29: [2022-11-26 11:09:08,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 11:09:08,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 11:09:08,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 25: [2022-11-26 11:09:08,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 11:09:08,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 11:09:08,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 6: [2022-11-26 11:09:08,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:09:08,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 11:09:08,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 5: [2022-11-26 11:09:08,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:09:08,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 11:09:08,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 10: [2022-11-26 11:09:08,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:09:08,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 11:09:08,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 20: [2022-11-26 11:09:08,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 11:09:08,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 11:09:08,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 28: [2022-11-26 11:09:08,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 11:09:08,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 11:09:08,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 9: [2022-11-26 11:09:08,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:09:08,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 11:09:08,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 8: [2022-11-26 11:09:08,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:09:08,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 11:09:08,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 15: [2022-11-26 11:09:08,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:09:08,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 11:09:08,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 14: [2022-11-26 11:09:08,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:09:08,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 11:09:08,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 26: [2022-11-26 11:09:08,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 11:09:08,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 11:09:08,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 11: [2022-11-26 11:09:08,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:09:08,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 11:09:08,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 0: [2022-11-26 11:09:08,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:09:08,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 11:09:08,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 16: [2022-11-26 11:09:08,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 11:09:08,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 11:09:08,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 4: [2022-11-26 11:09:08,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:09:08,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 11:09:08,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 24: [2022-11-26 11:09:08,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 11:09:08,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 11:09:08,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 27: [2022-11-26 11:09:08,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:09:08,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 27: [2022-11-26 11:09:08,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 11:09:08,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 1: [2022-11-26 11:09:08,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 11:09:08,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 7: [2022-11-26 11:09:08,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:09:08,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 11:09:08,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 17: [2022-11-26 11:09:08,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 11:09:08,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 11:09:08,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 21: [2022-11-26 11:09:08,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 11:09:08,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 11:09:08,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 12: [2022-11-26 11:09:08,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:09:08,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 11:09:08,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 29: [2022-11-26 11:09:08,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 11:09:08,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 11:09:08,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 30: [2022-11-26 11:09:08,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:09:08,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 11:09:08,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 2: [2022-11-26 11:09:08,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:09:08,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 11:09:08,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 23: [2022-11-26 11:09:08,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 11:09:08,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 11:09:08,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 18: [2022-11-26 11:09:08,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 31: [2022-11-26 11:09:08,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 18: [2022-11-26 11:09:08,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 11:09:08,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 31: [2022-11-26 11:09:08,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 11:09:08,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 3: [2022-11-26 11:09:08,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:09:08,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 13: [2022-11-26 11:09:08,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:09:08,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 13: [2022-11-26 11:09:08,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 11:09:08,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 22: [2022-11-26 11:09:08,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 11:09:08,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 11:09:08,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 25: [2022-11-26 11:09:08,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 11:09:08,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 11:09:08,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 5: [2022-11-26 11:09:08,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:09:08,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 11:09:08,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 6: [2022-11-26 11:09:08,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:09:08,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 11:09:08,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 9: [2022-11-26 11:09:08,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:09:08,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 11:09:08,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 10: [2022-11-26 11:09:08,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:09:08,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 11:09:08,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 8: [2022-11-26 11:09:08,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:09:08,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 11:09:08,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 28: [2022-11-26 11:09:08,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:09:08,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:09:08,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 28: [2022-11-26 11:09:08,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 15: [2022-11-26 11:09:08,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 0: [2022-11-26 11:09:08,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 15: [2022-11-26 11:09:08,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 0: [2022-11-26 11:09:08,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 28: [2022-11-26 11:09:08,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 20: [2022-11-26 11:09:08,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 11:09:08,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 11:09:08,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 3: [2022-11-26 11:09:08,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:09:08,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 11:09:08,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 26: [2022-11-26 11:09:08,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 11:09:08,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 11:09:08,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 11: [2022-11-26 11:09:08,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:09:08,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 11:09:08,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 27: [2022-11-26 11:09:08,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 11:09:08,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 11:09:08,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 1: [2022-11-26 11:09:08,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:09:08,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 11:09:08,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 14: [2022-11-26 11:09:08,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:09:08,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 11:09:08,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 7: [2022-11-26 11:09:08,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 16: [2022-11-26 11:09:08,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 11:09:08,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 7: [2022-11-26 11:09:08,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 11:09:08,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 16: [2022-11-26 11:09:08,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 4: [2022-11-26 11:09:08,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:09:08,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 11:09:08,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 17: [2022-11-26 11:09:08,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 11:09:08,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 11:09:08,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 30: [2022-11-26 11:09:08,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:09:08,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 23: [2022-11-26 11:09:08,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:09:08,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 23: [2022-11-26 11:09:08,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 11:09:08,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 24: [2022-11-26 11:09:08,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 11:09:08,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 11:09:08,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 2: [2022-11-26 11:09:08,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 21: [2022-11-26 11:09:08,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:09:08,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 21: [2022-11-26 11:09:08,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 2: [2022-11-26 11:09:08,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 21: [2022-11-26 11:09:08,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 19: [2022-11-26 11:09:08,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 11:09:08,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 11:09:08,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 22: [2022-11-26 11:09:08,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 11:09:08,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 11:09:08,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 12: [2022-11-26 11:09:08,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 29: [2022-11-26 11:09:08,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:09:08,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 29: [2022-11-26 11:09:08,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 12: [2022-11-26 11:09:08,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 29: [2022-11-26 11:09:08,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 31: [2022-11-26 11:09:08,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 11:09:08,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 11:09:08,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 18: [2022-11-26 11:09:08,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 11:09:08,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 11:09:08,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 25: [2022-11-26 11:09:08,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 11:09:08,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 11:09:08,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 13: [2022-11-26 11:09:08,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:09:08,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 11:09:08,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 6: [2022-11-26 11:09:08,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:09:08,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 11:09:08,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 10: [2022-11-26 11:09:08,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:09:08,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 11:09:08,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 20: [2022-11-26 11:09:08,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 11:09:08,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 11:09:08,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 28: [2022-11-26 11:09:08,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 11:09:08,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 11:09:08,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 0: [2022-11-26 11:09:08,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:09:08,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:09:08,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 11:09:08,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 11: [2022-11-26 11:09:08,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:09:08,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 11:09:08,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 8: [2022-11-26 11:09:08,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:09:08,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 11:09:08,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 9: [2022-11-26 11:09:08,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 26: [2022-11-26 11:09:08,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:09:08,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 11:09:08,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 26: [2022-11-26 11:09:08,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 11:09:08,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 4: [2022-11-26 11:09:08,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:09:08,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 11:09:08,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 19: [2022-11-26 11:09:08,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 11:09:08,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 11:09:08,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 1: [2022-11-26 11:09:08,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:09:08,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 11:09:08,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 7: [2022-11-26 11:09:08,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:09:08,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 11:09:08,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 27: [2022-11-26 11:09:08,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 11:09:08,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 11:09:08,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 14: [2022-11-26 11:09:08,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:09:08,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 11:09:08,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 0: [2022-11-26 11:09:08,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 11:09:08,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 3: [2022-11-26 11:09:08,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:09:08,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 11:09:08,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 16: [2022-11-26 11:09:08,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 11:09:08,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 11:09:08,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 30: [2022-11-26 11:09:08,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:09:08,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 11:09:08,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 24: [2022-11-26 11:09:08,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 11:09:08,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 11:09:08,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 5: [2022-11-26 11:09:08,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:09:08,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 11:09:08,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 17: [2022-11-26 11:09:08,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 11:09:08,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 11:09:08,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 2: [2022-11-26 11:09:08,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:09:08,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 11:09:08,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 21: [2022-11-26 11:09:08,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 11:09:08,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 11:09:08,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 6: [2022-11-26 11:09:08,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:09:08,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 11:09:08,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 29: [2022-11-26 11:09:08,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:09:08,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 29: [2022-11-26 11:09:08,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 11:09:08,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 0: [2022-11-26 11:09:08,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 12: [2022-11-26 11:09:08,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:09:08,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 12: [2022-11-26 11:09:08,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 11:09:08,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 15: [2022-11-26 11:09:08,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:09:08,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 11:09:08,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 11: [2022-11-26 11:09:08,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:09:08,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 11:09:08,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 1: [2022-11-26 11:09:08,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:09:08,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 11:09:08,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 22: [2022-11-26 11:09:08,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 11:09:08,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 11:09:08,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 13: [2022-11-26 11:09:08,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:09:08,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 11:09:08,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 23: [2022-11-26 11:09:08,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 11:09:08,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 11:09:08,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 27: [2022-11-26 11:09:08,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 11:09:08,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 11:09:08,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 5: [2022-11-26 11:09:08,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:09:08,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:09:08,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 11:09:08,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 8: [2022-11-26 11:09:08,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 20: [2022-11-26 11:09:08,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:09:08,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 20: [2022-11-26 11:09:08,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 11:09:08,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 18: [2022-11-26 11:09:08,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 11:09:08,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 11:09:08,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 28: [2022-11-26 11:09:08,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 11:09:08,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 11:09:08,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 10: [2022-11-26 11:09:08,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:09:08,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 11:09:08,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 9: [2022-11-26 11:09:08,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:09:08,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 11:09:08,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 14: [2022-11-26 11:09:08,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:09:08,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 11:09:08,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 30: [2022-11-26 11:09:08,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:09:08,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 11:09:08,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 24: [2022-11-26 11:09:08,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 11:09:08,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 31: [2022-11-26 11:09:08,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 24: [2022-11-26 11:09:08,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 12: [2022-11-26 11:09:08,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 31: [2022-11-26 11:09:08,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 12: [2022-11-26 11:09:08,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 11:09:08,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 31: [2022-11-26 11:09:08,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 16: [2022-11-26 11:09:08,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 29: [2022-11-26 11:09:08,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 16: [2022-11-26 11:09:08,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 29: [2022-11-26 11:09:08,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 11:09:08,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 16: [2022-11-26 11:09:08,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 3: [2022-11-26 11:09:08,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:09:08,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 11:09:08,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 21: [2022-11-26 11:09:08,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 11:09:08,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 11:09:08,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 26: [2022-11-26 11:09:08,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 18: [2022-11-26 11:09:08,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 26: [2022-11-26 11:09:08,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 31: [2022-11-26 11:09:08,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 18: [2022-11-26 11:09:08,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 26: [2022-11-26 11:09:08,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 31: [2022-11-26 11:09:08,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 18: [2022-11-26 11:09:08,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 31: [2022-11-26 11:09:08,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 4: [2022-11-26 11:09:08,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:09:08,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 11:09:08,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 7: [2022-11-26 11:09:08,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:09:08,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 11:09:08,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 25: [2022-11-26 11:09:08,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 11:09:08,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 11:09:08,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 19: [2022-11-26 11:09:08,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 11:09:08,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 11:09:08,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 2: [2022-11-26 11:09:08,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:09:08,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 11:09:08,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 23: [2022-11-26 11:09:08,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 11:09:08,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 11:09:08,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 13: [2022-11-26 11:09:08,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:09:08,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 11:09:08,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 19: [2022-11-26 11:09:08,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 11:09:08,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 11:09:08,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 19: [2022-11-26 11:09:08,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 11:09:08,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 11:09:08,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 19: [2022-11-26 11:09:08,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 11:09:08,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 11:09:08,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 19: [2022-11-26 11:09:08,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 11:09:08,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 11:09:08,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 22: [2022-11-26 11:09:08,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 11:09:08,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 11:09:08,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 17: [2022-11-26 11:09:08,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 11:09:08,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 11:09:08,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 19: [2022-11-26 11:09:08,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 11:09:08,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 11:09:08,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 25: [2022-11-26 11:09:08,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 11:09:08,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step76000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 11:09:08,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 0: successfully saved checkpoint at iteration 76000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2521.22 31: iteration 76010/ 173500 | consumed samples: 19458560 | consumed tokens: 39851130880 | elapsed time per iteration (s): 1.06 | learning rate: 1.290E-04 | global batch size: 256 | lm loss: 1.993409E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.372 | TFLOPs: 14.60 | 31: iteration 76020/ 173500 | consumed samples: 19461120 | consumed tokens: 39856373760 | elapsed time per iteration (s): 0.77 | learning rate: 1.289E-04 | global batch size: 256 | lm loss: 2.007950E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.358 | TFLOPs: 20.05 | 31: iteration 76030/ 173500 | consumed samples: 19463680 | consumed tokens: 39861616640 | elapsed time per iteration (s): 1.01 | learning rate: 1.289E-04 | global batch size: 256 | lm loss: 2.017157E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 253.871 | TFLOPs: 15.36 | 31: iteration 76040/ 173500 | consumed samples: 19466240 | consumed tokens: 39866859520 | elapsed time per iteration (s): 0.80 | learning rate: 1.289E-04 | global batch size: 256 | lm loss: 2.003363E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.873 | TFLOPs: 19.29 | 31: iteration 76050/ 173500 | consumed samples: 19468800 | consumed tokens: 39872102400 | elapsed time per iteration (s): 0.81 | learning rate: 1.289E-04 | global batch size: 256 | lm loss: 2.032658E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.175 | TFLOPs: 19.01 | 31: iteration 76060/ 173500 | consumed samples: 19471360 | consumed tokens: 39877345280 | elapsed time per iteration (s): 0.81 | learning rate: 1.289E-04 | global batch size: 256 | lm loss: 2.021026E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.739 | TFLOPs: 19.16 | 31: iteration 76070/ 173500 | consumed samples: 19473920 | consumed tokens: 39882588160 | elapsed time per iteration (s): 0.79 | learning rate: 1.289E-04 | global batch size: 256 | lm loss: 1.998482E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.286 | TFLOPs: 19.56 | 31: iteration 76080/ 173500 | consumed samples: 19476480 | consumed tokens: 39887831040 | elapsed time per iteration (s): 0.72 | learning rate: 1.289E-04 | global batch size: 256 | lm loss: 2.027375E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.674 | TFLOPs: 21.58 | 31: iteration 76090/ 173500 | consumed samples: 19479040 | consumed tokens: 39893073920 | elapsed time per iteration (s): 0.76 | learning rate: 1.288E-04 | global batch size: 256 | lm loss: 2.003600E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.025 | TFLOPs: 20.33 | 31: iteration 76100/ 173500 | consumed samples: 19481600 | consumed tokens: 39898316800 | elapsed time per iteration (s): 0.76 | learning rate: 1.288E-04 | global batch size: 256 | lm loss: 2.034285E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.083 | TFLOPs: 20.33 | 31: iteration 76110/ 173500 | consumed samples: 19484160 | consumed tokens: 39903559680 | elapsed time per iteration (s): 0.79 | learning rate: 1.288E-04 | global batch size: 256 | lm loss: 2.018962E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.792 | TFLOPs: 19.65 | 31: iteration 76120/ 173500 | consumed samples: 19486720 | consumed tokens: 39908802560 | elapsed time per iteration (s): 0.77 | learning rate: 1.288E-04 | global batch size: 256 | lm loss: 1.996055E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.053 | TFLOPs: 20.09 | 31: iteration 76130/ 173500 | consumed samples: 19489280 | consumed tokens: 39914045440 | elapsed time per iteration (s): 0.76 | learning rate: 1.288E-04 | global batch size: 256 | lm loss: 2.015997E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.984 | TFLOPs: 20.33 | 31: iteration 76140/ 173500 | consumed samples: 19491840 | consumed tokens: 39919288320 | elapsed time per iteration (s): 0.75 | learning rate: 1.288E-04 | global batch size: 256 | lm loss: 2.027978E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.889 | TFLOPs: 20.56 | 31: iteration 76150/ 173500 | consumed samples: 19494400 | consumed tokens: 39924531200 | elapsed time per iteration (s): 0.76 | learning rate: 1.287E-04 | global batch size: 256 | lm loss: 2.033183E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.777 | TFLOPs: 20.50 | 31: iteration 76160/ 173500 | consumed samples: 19496960 | consumed tokens: 39929774080 | elapsed time per iteration (s): 0.74 | learning rate: 1.287E-04 | global batch size: 256 | lm loss: 2.032570E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.171 | TFLOPs: 20.88 | 31: iteration 76170/ 173500 | consumed samples: 19499520 | consumed tokens: 39935016960 | elapsed time per iteration (s): 0.78 | learning rate: 1.287E-04 | global batch size: 256 | lm loss: 2.006437E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.577 | TFLOPs: 19.82 | 31: iteration 76180/ 173500 | consumed samples: 19502080 | consumed tokens: 39940259840 | elapsed time per iteration (s): 0.76 | learning rate: 1.287E-04 | global batch size: 256 | lm loss: 2.024143E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.856 | TFLOPs: 20.44 | 31: iteration 76190/ 173500 | consumed samples: 19504640 | consumed tokens: 39945502720 | elapsed time per iteration (s): 0.77 | learning rate: 1.287E-04 | global batch size: 256 | lm loss: 2.010225E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.044 | TFLOPs: 20.03 | 31: iteration 76200/ 173500 | consumed samples: 19507200 | consumed tokens: 39950745600 | elapsed time per iteration (s): 0.78 | learning rate: 1.287E-04 | global batch size: 256 | lm loss: 2.020098E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.853 | TFLOPs: 19.96 | 31: iteration 76210/ 173500 | consumed samples: 19509760 | consumed tokens: 39955988480 | elapsed time per iteration (s): 0.81 | learning rate: 1.286E-04 | global batch size: 256 | lm loss: 1.996616E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.779 | TFLOPs: 19.04 | 31: iteration 76220/ 173500 | consumed samples: 19512320 | consumed tokens: 39961231360 | elapsed time per iteration (s): 0.73 | learning rate: 1.286E-04 | global batch size: 256 | lm loss: 2.016575E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.391 | TFLOPs: 21.14 | 31: iteration 76230/ 173500 | consumed samples: 19514880 | consumed tokens: 39966474240 | elapsed time per iteration (s): 0.79 | learning rate: 1.286E-04 | global batch size: 256 | lm loss: 2.043435E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.010 | TFLOPs: 19.54 | 31: iteration 76240/ 173500 | consumed samples: 19517440 | consumed tokens: 39971717120 | elapsed time per iteration (s): 0.80 | learning rate: 1.286E-04 | global batch size: 256 | lm loss: 1.996262E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.412 | TFLOPs: 19.44 | 31: iteration 76250/ 173500 | consumed samples: 19520000 | consumed tokens: 39976960000 | elapsed time per iteration (s): 0.78 | learning rate: 1.286E-04 | global batch size: 256 | lm loss: 1.985860E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.416 | TFLOPs: 19.81 | 31: iteration 76260/ 173500 | consumed samples: 19522560 | consumed tokens: 39982202880 | elapsed time per iteration (s): 0.83 | learning rate: 1.286E-04 | global batch size: 256 | lm loss: 2.018348E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.084 | TFLOPs: 18.64 | 31: iteration 76270/ 173500 | consumed samples: 19525120 | consumed tokens: 39987445760 | elapsed time per iteration (s): 0.78 | learning rate: 1.285E-04 | global batch size: 256 | lm loss: 2.016727E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.286 | TFLOPs: 19.80 | 31: iteration 76280/ 173500 | consumed samples: 19527680 | consumed tokens: 39992688640 | elapsed time per iteration (s): 0.78 | learning rate: 1.285E-04 | global batch size: 256 | lm loss: 1.989543E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.120 | TFLOPs: 19.85 | 31: iteration 76290/ 173500 | consumed samples: 19530240 | consumed tokens: 39997931520 | elapsed time per iteration (s): 0.82 | learning rate: 1.285E-04 | global batch size: 256 | lm loss: 2.003348E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.940 | TFLOPs: 18.81 | 31: iteration 76300/ 173500 | consumed samples: 19532800 | consumed tokens: 40003174400 | elapsed time per iteration (s): 0.79 | learning rate: 1.285E-04 | global batch size: 256 | lm loss: 2.004173E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.488 | TFLOPs: 19.51 | 31: iteration 76310/ 173500 | consumed samples: 19535360 | consumed tokens: 40008417280 | elapsed time per iteration (s): 0.79 | learning rate: 1.285E-04 | global batch size: 256 | lm loss: 1.988947E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.845 | TFLOPs: 19.59 | 31: iteration 76320/ 173500 | consumed samples: 19537920 | consumed tokens: 40013660160 | elapsed time per iteration (s): 0.81 | learning rate: 1.285E-04 | global batch size: 256 | lm loss: 2.015292E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.549 | TFLOPs: 19.09 | 31: iteration 76330/ 173500 | consumed samples: 19540480 | consumed tokens: 40018903040 | elapsed time per iteration (s): 0.82 | learning rate: 1.284E-04 | global batch size: 256 | lm loss: 2.050614E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.489 | TFLOPs: 18.84 | 31: iteration 76340/ 173500 | consumed samples: 19543040 | consumed tokens: 40024145920 | elapsed time per iteration (s): 0.83 | learning rate: 1.284E-04 | global batch size: 256 | lm loss: 2.052825E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.871 | TFLOPs: 18.75 | 31: iteration 76350/ 173500 | consumed samples: 19545600 | consumed tokens: 40029388800 | elapsed time per iteration (s): 0.81 | learning rate: 1.284E-04 | global batch size: 256 | lm loss: 2.039934E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.794 | TFLOPs: 19.23 | 31: iteration 76360/ 173500 | consumed samples: 19548160 | consumed tokens: 40034631680 | elapsed time per iteration (s): 0.79 | learning rate: 1.284E-04 | global batch size: 256 | lm loss: 2.001391E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.768 | TFLOPs: 19.65 | 31: iteration 76370/ 173500 | consumed samples: 19550720 | consumed tokens: 40039874560 | elapsed time per iteration (s): 0.78 | learning rate: 1.284E-04 | global batch size: 256 | lm loss: 2.017383E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.892 | TFLOPs: 19.78 | 31: iteration 76380/ 173500 | consumed samples: 19553280 | consumed tokens: 40045117440 | elapsed time per iteration (s): 0.76 | learning rate: 1.284E-04 | global batch size: 256 | lm loss: 1.996947E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.776 | TFLOPs: 20.25 | 31: iteration 76390/ 173500 | consumed samples: 19555840 | consumed tokens: 40050360320 | elapsed time per iteration (s): 0.84 | learning rate: 1.284E-04 | global batch size: 256 | lm loss: 1.999121E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.723 | TFLOPs: 18.37 | 31: iteration 76400/ 173500 | consumed samples: 19558400 | consumed tokens: 40055603200 | elapsed time per iteration (s): 0.74 | learning rate: 1.283E-04 | global batch size: 256 | lm loss: 2.011879E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.630 | TFLOPs: 20.85 | 31: iteration 76410/ 173500 | consumed samples: 19560960 | consumed tokens: 40060846080 | elapsed time per iteration (s): 0.74 | learning rate: 1.283E-04 | global batch size: 256 | lm loss: 2.035503E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.851 | TFLOPs: 20.92 | 31: iteration 76420/ 173500 | consumed samples: 19563520 | consumed tokens: 40066088960 | elapsed time per iteration (s): 0.73 | learning rate: 1.283E-04 | global batch size: 256 | lm loss: 2.019399E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.962 | TFLOPs: 21.17 | 31: iteration 76430/ 173500 | consumed samples: 19566080 | consumed tokens: 40071331840 | elapsed time per iteration (s): 0.77 | learning rate: 1.283E-04 | global batch size: 256 | lm loss: 2.022152E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.901 | TFLOPs: 20.20 | 31: iteration 76440/ 173500 | consumed samples: 19568640 | consumed tokens: 40076574720 | elapsed time per iteration (s): 0.77 | learning rate: 1.283E-04 | global batch size: 256 | lm loss: 1.995504E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.787 | TFLOPs: 20.19 | 31: iteration 76450/ 173500 | consumed samples: 19571200 | consumed tokens: 40081817600 | elapsed time per iteration (s): 0.80 | learning rate: 1.283E-04 | global batch size: 256 | lm loss: 2.011541E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.681 | TFLOPs: 19.34 | 31: iteration 76460/ 173500 | consumed samples: 19573760 | consumed tokens: 40087060480 | elapsed time per iteration (s): 0.80 | learning rate: 1.282E-04 | global batch size: 256 | lm loss: 2.007715E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.613 | TFLOPs: 19.28 | 31: iteration 76470/ 173500 | consumed samples: 19576320 | consumed tokens: 40092303360 | elapsed time per iteration (s): 0.81 | learning rate: 1.282E-04 | global batch size: 256 | lm loss: 2.029826E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.793 | TFLOPs: 19.10 | 31: iteration 76480/ 173500 | consumed samples: 19578880 | consumed tokens: 40097546240 | elapsed time per iteration (s): 0.80 | learning rate: 1.282E-04 | global batch size: 256 | lm loss: 2.007439E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.163 | TFLOPs: 19.43 | 31: iteration 76490/ 173500 | consumed samples: 19581440 | consumed tokens: 40102789120 | elapsed time per iteration (s): 0.93 | learning rate: 1.282E-04 | global batch size: 256 | lm loss: 1.996859E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 275.473 | TFLOPs: 16.67 | 31: iteration 76500/ 173500 | consumed samples: 19584000 | consumed tokens: 40108032000 | elapsed time per iteration (s): 0.85 | learning rate: 1.282E-04 | global batch size: 256 | lm loss: 1.979050E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.853 | TFLOPs: 18.14 | 31: iteration 76510/ 173500 | consumed samples: 19586560 | consumed tokens: 40113274880 | elapsed time per iteration (s): 0.83 | learning rate: 1.282E-04 | global batch size: 256 | lm loss: 2.005416E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.166 | TFLOPs: 18.64 | 31: iteration 76520/ 173500 | consumed samples: 19589120 | consumed tokens: 40118517760 | elapsed time per iteration (s): 0.83 | learning rate: 1.281E-04 | global batch size: 256 | lm loss: 2.020011E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.772 | TFLOPs: 18.56 | 31: iteration 76530/ 173500 | consumed samples: 19591680 | consumed tokens: 40123760640 | elapsed time per iteration (s): 0.91 | learning rate: 1.281E-04 | global batch size: 256 | lm loss: 2.024061E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 281.558 | TFLOPs: 17.03 | 31: iteration 76540/ 173500 | consumed samples: 19594240 | consumed tokens: 40129003520 | elapsed time per iteration (s): 0.80 | learning rate: 1.281E-04 | global batch size: 256 | lm loss: 1.991200E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.694 | TFLOPs: 19.40 | 31: iteration 76550/ 173500 | consumed samples: 19596800 | consumed tokens: 40134246400 | elapsed time per iteration (s): 0.78 | learning rate: 1.281E-04 | global batch size: 256 | lm loss: 2.029784E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.969 | TFLOPs: 19.78 | 31: iteration 76560/ 173500 | consumed samples: 19599360 | consumed tokens: 40139489280 | elapsed time per iteration (s): 0.81 | learning rate: 1.281E-04 | global batch size: 256 | lm loss: 2.024717E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.937 | TFLOPs: 19.05 | 31: iteration 76570/ 173500 | consumed samples: 19601920 | consumed tokens: 40144732160 | elapsed time per iteration (s): 0.82 | learning rate: 1.281E-04 | global batch size: 256 | lm loss: 2.037693E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.958 | TFLOPs: 18.87 | 31: iteration 76580/ 173500 | consumed samples: 19604480 | consumed tokens: 40149975040 | elapsed time per iteration (s): 0.84 | learning rate: 1.280E-04 | global batch size: 256 | lm loss: 2.009273E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.266 | TFLOPs: 18.41 | 31: iteration 76590/ 173500 | consumed samples: 19607040 | consumed tokens: 40155217920 | elapsed time per iteration (s): 0.83 | learning rate: 1.280E-04 | global batch size: 256 | lm loss: 2.009103E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.792 | TFLOPs: 18.62 | 31: iteration 76600/ 173500 | consumed samples: 19609600 | consumed tokens: 40160460800 | elapsed time per iteration (s): 0.82 | learning rate: 1.280E-04 | global batch size: 256 | lm loss: 1.984226E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.722 | TFLOPs: 18.80 | 31: iteration 76610/ 173500 | consumed samples: 19612160 | consumed tokens: 40165703680 | elapsed time per iteration (s): 0.76 | learning rate: 1.280E-04 | global batch size: 256 | lm loss: 2.005120E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.311 | TFLOPs: 20.41 | 31: iteration 76620/ 173500 | consumed samples: 19614720 | consumed tokens: 40170946560 | elapsed time per iteration (s): 0.78 | learning rate: 1.280E-04 | global batch size: 256 | lm loss: 2.009049E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.620 | TFLOPs: 19.76 | 31: iteration 76630/ 173500 | consumed samples: 19617280 | consumed tokens: 40176189440 | elapsed time per iteration (s): 0.82 | learning rate: 1.280E-04 | global batch size: 256 | lm loss: 2.026991E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.659 | TFLOPs: 18.98 | 31: iteration 76640/ 173500 | consumed samples: 19619840 | consumed tokens: 40181432320 | elapsed time per iteration (s): 0.76 | learning rate: 1.279E-04 | global batch size: 256 | lm loss: 2.036276E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.028 | TFLOPs: 20.39 | 31: iteration 76650/ 173500 | consumed samples: 19622400 | consumed tokens: 40186675200 | elapsed time per iteration (s): 0.79 | learning rate: 1.279E-04 | global batch size: 256 | lm loss: 2.007284E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.907 | TFLOPs: 19.54 | 31: iteration 76660/ 173500 | consumed samples: 19624960 | consumed tokens: 40191918080 | elapsed time per iteration (s): 0.77 | learning rate: 1.279E-04 | global batch size: 256 | lm loss: 2.006242E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.151 | TFLOPs: 20.15 | 31: iteration 76670/ 173500 | consumed samples: 19627520 | consumed tokens: 40197160960 | elapsed time per iteration (s): 0.77 | learning rate: 1.279E-04 | global batch size: 256 | lm loss: 1.998361E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.686 | TFLOPs: 20.07 | 31: iteration 76680/ 173500 | consumed samples: 19630080 | consumed tokens: 40202403840 | elapsed time per iteration (s): 0.77 | learning rate: 1.279E-04 | global batch size: 256 | lm loss: 2.000536E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.028 | TFLOPs: 20.09 | 31: iteration 76690/ 173500 | consumed samples: 19632640 | consumed tokens: 40207646720 | elapsed time per iteration (s): 0.77 | learning rate: 1.279E-04 | global batch size: 256 | lm loss: 2.012577E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.277 | TFLOPs: 20.04 | 31: iteration 76700/ 173500 | consumed samples: 19635200 | consumed tokens: 40212889600 | elapsed time per iteration (s): 0.78 | learning rate: 1.279E-04 | global batch size: 256 | lm loss: 1.996187E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.108 | TFLOPs: 19.85 | 31: iteration 76710/ 173500 | consumed samples: 19637760 | consumed tokens: 40218132480 | elapsed time per iteration (s): 0.79 | learning rate: 1.278E-04 | global batch size: 256 | lm loss: 1.984503E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.578 | TFLOPs: 19.70 | 31: iteration 76720/ 173500 | consumed samples: 19640320 | consumed tokens: 40223375360 | elapsed time per iteration (s): 0.79 | learning rate: 1.278E-04 | global batch size: 256 | lm loss: 1.996957E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.284 | TFLOPs: 19.56 | 31: iteration 76730/ 173500 | consumed samples: 19642880 | consumed tokens: 40228618240 | elapsed time per iteration (s): 0.77 | learning rate: 1.278E-04 | global batch size: 256 | lm loss: 2.018557E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.669 | TFLOPs: 20.19 | 31: iteration 76740/ 173500 | consumed samples: 19645440 | consumed tokens: 40233861120 | elapsed time per iteration (s): 0.78 | learning rate: 1.278E-04 | global batch size: 256 | lm loss: 2.026265E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.013 | TFLOPs: 19.96 | 31: iteration 76750/ 173500 | consumed samples: 19648000 | consumed tokens: 40239104000 | elapsed time per iteration (s): 0.80 | learning rate: 1.278E-04 | global batch size: 256 | lm loss: 2.026319E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.788 | TFLOPs: 19.47 | 31: iteration 76760/ 173500 | consumed samples: 19650560 | consumed tokens: 40244346880 | elapsed time per iteration (s): 0.83 | learning rate: 1.278E-04 | global batch size: 256 | lm loss: 2.015080E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.022 | TFLOPs: 18.76 | 31: iteration 76770/ 173500 | consumed samples: 19653120 | consumed tokens: 40249589760 | elapsed time per iteration (s): 0.78 | learning rate: 1.277E-04 | global batch size: 256 | lm loss: 2.013892E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.702 | TFLOPs: 19.89 | 31: iteration 76780/ 173500 | consumed samples: 19655680 | consumed tokens: 40254832640 | elapsed time per iteration (s): 0.75 | learning rate: 1.277E-04 | global batch size: 256 | lm loss: 1.992379E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.219 | TFLOPs: 20.58 | 31: iteration 76790/ 173500 | consumed samples: 19658240 | consumed tokens: 40260075520 | elapsed time per iteration (s): 0.79 | learning rate: 1.277E-04 | global batch size: 256 | lm loss: 2.001809E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.934 | TFLOPs: 19.54 | 31: iteration 76800/ 173500 | consumed samples: 19660800 | consumed tokens: 40265318400 | elapsed time per iteration (s): 0.87 | learning rate: 1.277E-04 | global batch size: 256 | lm loss: 1.998812E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.233 | TFLOPs: 17.80 | 31: iteration 76810/ 173500 | consumed samples: 19663360 | consumed tokens: 40270561280 | elapsed time per iteration (s): 0.81 | learning rate: 1.277E-04 | global batch size: 256 | lm loss: 2.003345E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.425 | TFLOPs: 19.14 | 31: iteration 76820/ 173500 | consumed samples: 19665920 | consumed tokens: 40275804160 | elapsed time per iteration (s): 0.84 | learning rate: 1.277E-04 | global batch size: 256 | lm loss: 2.006365E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.770 | TFLOPs: 18.50 | 31: iteration 76830/ 173500 | consumed samples: 19668480 | consumed tokens: 40281047040 | elapsed time per iteration (s): 2.54 | learning rate: 1.276E-04 | global batch size: 256 | lm loss: 2.041093E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 100.816 | TFLOPs: 6.10 | 31: iteration 76840/ 173500 | consumed samples: 19671040 | consumed tokens: 40286289920 | elapsed time per iteration (s): 0.77 | learning rate: 1.276E-04 | global batch size: 256 | lm loss: 2.002671E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.876 | TFLOPs: 20.02 | 31: iteration 76850/ 173500 | consumed samples: 19673600 | consumed tokens: 40291532800 | elapsed time per iteration (s): 0.91 | learning rate: 1.276E-04 | global batch size: 256 | lm loss: 2.025630E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.505 | TFLOPs: 17.09 | 31: iteration 76860/ 173500 | consumed samples: 19676160 | consumed tokens: 40296775680 | elapsed time per iteration (s): 0.74 | learning rate: 1.276E-04 | global batch size: 256 | lm loss: 2.007406E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.495 | TFLOPs: 21.02 | 31: iteration 76870/ 173500 | consumed samples: 19678720 | consumed tokens: 40302018560 | elapsed time per iteration (s): 0.82 | learning rate: 1.276E-04 | global batch size: 256 | lm loss: 2.034333E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.828 | TFLOPs: 18.80 | 31: iteration 76880/ 173500 | consumed samples: 19681280 | consumed tokens: 40307261440 | elapsed time per iteration (s): 0.85 | learning rate: 1.276E-04 | global batch size: 256 | lm loss: 2.015327E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.696 | TFLOPs: 18.31 | 31: iteration 76890/ 173500 | consumed samples: 19683840 | consumed tokens: 40312504320 | elapsed time per iteration (s): 0.78 | learning rate: 1.275E-04 | global batch size: 256 | lm loss: 2.019841E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.042 | TFLOPs: 19.79 | 31: iteration 76900/ 173500 | consumed samples: 19686400 | consumed tokens: 40317747200 | elapsed time per iteration (s): 0.81 | learning rate: 1.275E-04 | global batch size: 256 | lm loss: 1.980768E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.959 | TFLOPs: 19.05 | 31: iteration 76910/ 173500 | consumed samples: 19688960 | consumed tokens: 40322990080 | elapsed time per iteration (s): 0.76 | learning rate: 1.275E-04 | global batch size: 256 | lm loss: 2.026471E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.920 | TFLOPs: 20.50 | 31: iteration 76920/ 173500 | consumed samples: 19691520 | consumed tokens: 40328232960 | elapsed time per iteration (s): 0.80 | learning rate: 1.275E-04 | global batch size: 256 | lm loss: 2.032196E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.953 | TFLOPs: 19.36 | 31: iteration 76930/ 173500 | consumed samples: 19694080 | consumed tokens: 40333475840 | elapsed time per iteration (s): 0.77 | learning rate: 1.275E-04 | global batch size: 256 | lm loss: 2.047177E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.545 | TFLOPs: 20.06 | 31: iteration 76940/ 173500 | consumed samples: 19696640 | consumed tokens: 40338718720 | elapsed time per iteration (s): 0.81 | learning rate: 1.275E-04 | global batch size: 256 | lm loss: 2.004579E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.421 | TFLOPs: 19.20 | 31: iteration 76950/ 173500 | consumed samples: 19699200 | consumed tokens: 40343961600 | elapsed time per iteration (s): 0.75 | learning rate: 1.274E-04 | global batch size: 256 | lm loss: 2.014963E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.718 | TFLOPs: 20.61 | 31: iteration 76960/ 173500 | consumed samples: 19701760 | consumed tokens: 40349204480 | elapsed time per iteration (s): 3.79 | learning rate: 1.274E-04 | global batch size: 256 | lm loss: 2.008582E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 67.479 | TFLOPs: 4.08 | 31: iteration 76970/ 173500 | consumed samples: 19704320 | consumed tokens: 40354447360 | elapsed time per iteration (s): 0.81 | learning rate: 1.274E-04 | global batch size: 256 | lm loss: 2.006014E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.651 | TFLOPs: 19.16 | 31: iteration 76980/ 173500 | consumed samples: 19706880 | consumed tokens: 40359690240 | elapsed time per iteration (s): 0.85 | learning rate: 1.274E-04 | global batch size: 256 | lm loss: 2.015769E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.984 | TFLOPs: 18.21 | 31: iteration 76990/ 173500 | consumed samples: 19709440 | consumed tokens: 40364933120 | elapsed time per iteration (s): 0.82 | learning rate: 1.274E-04 | global batch size: 256 | lm loss: 1.999586E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.795 | TFLOPs: 18.98 | 31: iteration 77000/ 173500 | consumed samples: 19712000 | consumed tokens: 40370176000 | elapsed time per iteration (s): 0.78 | learning rate: 1.274E-04 | global batch size: 256 | lm loss: 2.034693E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.490 | TFLOPs: 19.81 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 77000 | lm loss value: 1.944865E+00 | lm loss PPL: 6.992689E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 77000 to checkpoints_1b1long 0: [2022-11-26 11:23:12,587] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step77000 is begin to save! 0: [2022-11-26 11:23:12,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_01-model_00-model_states.pt... 0: [2022-11-26 11:23:12,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_01-model_00-model_states.pt. 0: [2022-11-26 11:23:12,810] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_03-model_00-model_states.pt... 0: [2022-11-26 11:23:12,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_03-model_00-model_states.pt. 0: [2022-11-26 11:23:12,892] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_04-model_00-model_states.pt... 0: [2022-11-26 11:23:12,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_04-model_00-model_states.pt. 0: [2022-11-26 11:23:12,970] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_05-model_00-model_states.pt... 0: [2022-11-26 11:23:13,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_05-model_00-model_states.pt. 0: [2022-11-26 11:23:13,052] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_06-model_00-model_states.pt... 0: [2022-11-26 11:23:13,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_06-model_00-model_states.pt. 0: [2022-11-26 11:23:13,128] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_07-model_00-model_states.pt... 0: [2022-11-26 11:23:13,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_07-model_00-model_states.pt. 0: [2022-11-26 11:23:13,214] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_08-model_00-model_states.pt... 0: [2022-11-26 11:23:13,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_08-model_00-model_states.pt. 0: [2022-11-26 11:23:13,288] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_09-model_00-model_states.pt... 0: [2022-11-26 11:23:13,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_09-model_00-model_states.pt. 0: [2022-11-26 11:23:13,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_10-model_00-model_states.pt... 0: [2022-11-26 11:23:13,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_10-model_00-model_states.pt. 0: [2022-11-26 11:23:13,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_11-model_00-model_states.pt... 0: [2022-11-26 11:23:13,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_11-model_00-model_states.pt. 0: [2022-11-26 11:23:13,519] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_12-model_00-model_states.pt... 0: [2022-11-26 11:23:13,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_12-model_00-model_states.pt. 0: [2022-11-26 11:23:13,595] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_13-model_00-model_states.pt... 0: [2022-11-26 11:23:13,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_13-model_00-model_states.pt. 0: [2022-11-26 11:23:13,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_14-model_00-model_states.pt... 0: [2022-11-26 11:23:13,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_14-model_00-model_states.pt. 0: [2022-11-26 11:23:13,744] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_15-model_00-model_states.pt... 0: [2022-11-26 11:23:13,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_15-model_00-model_states.pt. 0: [2022-11-26 11:23:13,815] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_16-model_00-model_states.pt... 0: [2022-11-26 11:23:13,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_16-model_00-model_states.pt. 0: [2022-11-26 11:23:13,891] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_17-model_00-model_states.pt... 0: [2022-11-26 11:23:13,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_17-model_00-model_states.pt. 0: [2022-11-26 11:23:13,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_18-model_00-model_states.pt... 0: [2022-11-26 11:23:14,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_18-model_00-model_states.pt. 0: [2022-11-26 11:23:14,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_19-model_00-model_states.pt... 0: [2022-11-26 11:23:14,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_19-model_00-model_states.pt. 0: [2022-11-26 11:23:14,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_20-model_00-model_states.pt... 0: [2022-11-26 11:23:14,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_20-model_00-model_states.pt. 0: [2022-11-26 11:23:14,188] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_21-model_00-model_states.pt... 0: [2022-11-26 11:23:14,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_21-model_00-model_states.pt. 0: [2022-11-26 11:23:14,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_22-model_00-model_states.pt... 0: [2022-11-26 11:23:14,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_22-model_00-model_states.pt. 0: [2022-11-26 11:23:14,336] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_23-model_00-model_states.pt... 0: [2022-11-26 11:23:14,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_23-model_00-model_states.pt. 0: [2022-11-26 11:23:14,408] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_24-model_00-model_states.pt... 0: [2022-11-26 11:23:14,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_24-model_00-model_states.pt. 0: [2022-11-26 11:23:14,484] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_25-model_00-model_states.pt... 0: [2022-11-26 11:23:14,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_25-model_00-model_states.pt. 0: [2022-11-26 11:23:14,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_26-model_00-model_states.pt... 0: [2022-11-26 11:23:14,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_26-model_00-model_states.pt. 0: [2022-11-26 11:23:14,630] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_27-model_00-model_states.pt... 0: [2022-11-26 11:23:14,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_27-model_00-model_states.pt. 0: [2022-11-26 11:23:14,710] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_28-model_00-model_states.pt... 0: [2022-11-26 11:23:14,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_28-model_00-model_states.pt. 0: [2022-11-26 11:23:14,784] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/layer_30-model_00-model_states.pt... 0: [2022-11-26 11:23:14,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/layer_30-model_00-model_states.pt. 0: [2022-11-26 11:23:14,787] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step77000/mp_rank_00_model_states.pt 0: [2022-11-26 11:23:14,787] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/mp_rank_00_model_states.pt... 0: [2022-11-26 11:23:14,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/mp_rank_00_model_states.pt. 0: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 20: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 30: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 19: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 28: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 29: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 17: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 17: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:23:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:23:14,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 28: [2022-11-26 11:23:14,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:23:14,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:23:14,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 11:23:14,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 6: [2022-11-26 11:23:14,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:23:14,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 11:23:14,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 29: [2022-11-26 11:23:14,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 11:23:14,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 11:23:14,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 24: [2022-11-26 11:23:14,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 29: [2022-11-26 11:23:14,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 11:23:14,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 11:23:14,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 24: [2022-11-26 11:23:14,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 11:23:14,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 24: [2022-11-26 11:23:14,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 16: [2022-11-26 11:23:14,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 11:23:14,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 11:23:14,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 25: [2022-11-26 11:23:14,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 24: [2022-11-26 11:23:14,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 25: [2022-11-26 11:23:14,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 24: [2022-11-26 11:23:14,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 25: [2022-11-26 11:23:14,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 12: [2022-11-26 11:23:14,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:23:14,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 11:23:14,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 20: [2022-11-26 11:23:14,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 11:23:14,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 11:23:14,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 12: [2022-11-26 11:23:14,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:23:14,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 11:23:14,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 21: [2022-11-26 11:23:14,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 11:23:14,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 11:23:14,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 13: [2022-11-26 11:23:14,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:23:14,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 18: [2022-11-26 11:23:14,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:23:14,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 18: [2022-11-26 11:23:14,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 11:23:14,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 2: [2022-11-26 11:23:14,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 28: [2022-11-26 11:23:14,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 2: [2022-11-26 11:23:14,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 28: [2022-11-26 11:23:14,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 6: [2022-11-26 11:23:14,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:23:14,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 18: [2022-11-26 11:23:14,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:23:14,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 22: [2022-11-26 11:23:14,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 18: [2022-11-26 11:23:14,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 6: [2022-11-26 11:23:14,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 22: [2022-11-26 11:23:14,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 18: [2022-11-26 11:23:14,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 28: [2022-11-26 11:23:14,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 22: [2022-11-26 11:23:14,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 28: [2022-11-26 11:23:14,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 0: [2022-11-26 11:23:14,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 28: [2022-11-26 11:23:14,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 0: [2022-11-26 11:23:14,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 15: [2022-11-26 11:23:14,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:23:14,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 0: [2022-11-26 11:23:14,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 15: [2022-11-26 11:23:14,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 4: [2022-11-26 11:23:14,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:23:14,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 11:23:14,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 3: [2022-11-26 11:23:14,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:23:14,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 11:23:14,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 16: [2022-11-26 11:23:14,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 20: [2022-11-26 11:23:14,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 16: [2022-11-26 11:23:14,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 20: [2022-11-26 11:23:14,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 16: [2022-11-26 11:23:14,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 20: [2022-11-26 11:23:14,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 30: [2022-11-26 11:23:14,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:23:14,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 28: [2022-11-26 11:23:14,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:23:14,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:23:14,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 28: [2022-11-26 11:23:14,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 7: [2022-11-26 11:23:14,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 11:23:14,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 28: [2022-11-26 11:23:14,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 10: [2022-11-26 11:23:14,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:23:14,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 11:23:14,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 29: [2022-11-26 11:23:14,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 11:23:14,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 11:23:14,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 21: [2022-11-26 11:23:14,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 11:23:14,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 24: [2022-11-26 11:23:14,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 21: [2022-11-26 11:23:14,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 2: [2022-11-26 11:23:14,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:23:14,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 11:23:14,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 24: [2022-11-26 11:23:14,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 11:23:14,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 23: [2022-11-26 11:23:14,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 31: [2022-11-26 11:23:14,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 23: [2022-11-26 11:23:14,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 11:23:14,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 31: [2022-11-26 11:23:14,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 11:23:14,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 11:23:14,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 11:23:14,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 31: [2022-11-26 11:23:14,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 15: [2022-11-26 11:23:14,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:23:14,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:23:14,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 11:23:14,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 11:23:14,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 15: [2022-11-26 11:23:14,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 22: [2022-11-26 11:23:14,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 11:23:14,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 11:23:14,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 2: [2022-11-26 11:23:14,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:23:14,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 11:23:14,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 1: [2022-11-26 11:23:14,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:23:14,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 11:23:14,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 1: [2022-11-26 11:23:14,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 17: [2022-11-26 11:23:14,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:23:14,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 17: [2022-11-26 11:23:14,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 11:23:14,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 17: [2022-11-26 11:23:14,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:23:14,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 17: [2022-11-26 11:23:14,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 11:23:14,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 4: [2022-11-26 11:23:14,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:23:14,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 11:23:14,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 10: [2022-11-26 11:23:14,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:23:14,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 13: [2022-11-26 11:23:14,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:23:14,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 13: [2022-11-26 11:23:14,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 11:23:14,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 21: [2022-11-26 11:23:14,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 11:23:14,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 11:23:14,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 12: [2022-11-26 11:23:14,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:23:14,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 11:23:14,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 25: [2022-11-26 11:23:14,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 11:23:14,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 11:23:14,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 3: [2022-11-26 11:23:14,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:23:14,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 11:23:14,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 6: [2022-11-26 11:23:14,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:23:14,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 11:23:14,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 13: [2022-11-26 11:23:14,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 18: [2022-11-26 11:23:14,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:23:14,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 18: [2022-11-26 11:23:14,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 13: [2022-11-26 11:23:14,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 18: [2022-11-26 11:23:14,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 20: [2022-11-26 11:23:14,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 11:23:14,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 11:23:14,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 29: [2022-11-26 11:23:14,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 11:23:14,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 11:23:14,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 1: [2022-11-26 11:23:14,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:23:14,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 11:23:14,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 28: [2022-11-26 11:23:14,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 11:23:14,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 3: [2022-11-26 11:23:14,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 28: [2022-11-26 11:23:14,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 3: [2022-11-26 11:23:14,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 31: [2022-11-26 11:23:14,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:23:14,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 17: [2022-11-26 11:23:14,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:23:14,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 17: [2022-11-26 11:23:14,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 11:23:14,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 15: [2022-11-26 11:23:14,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 11:23:14,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 31: [2022-11-26 11:23:14,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 11:23:14,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 6: [2022-11-26 11:23:14,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:23:14,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 11:23:14,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 19: [2022-11-26 11:23:14,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 11:23:14,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 11:23:14,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 11:23:14,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 21: [2022-11-26 11:23:14,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 11:23:14,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 19: [2022-11-26 11:23:14,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 11:23:14,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 21: [2022-11-26 11:23:14,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 19: [2022-11-26 11:23:14,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 11:23:14,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 11:23:14,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 19: [2022-11-26 11:23:14,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 19: [2022-11-26 11:23:14,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 19: [2022-11-26 11:23:14,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 9: [2022-11-26 11:23:14,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:23:14,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:23:14,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:23:14,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 11:23:14,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 11:23:14,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 11:23:14,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 9: [2022-11-26 11:23:14,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 9: [2022-11-26 11:23:14,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 22: [2022-11-26 11:23:14,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 11:23:14,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 11:23:14,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 5: [2022-11-26 11:23:14,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:23:14,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 11:23:14,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 30: [2022-11-26 11:23:14,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 23: [2022-11-26 11:23:14,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:23:14,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 11:23:14,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 23: [2022-11-26 11:23:14,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 11:23:14,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 5: [2022-11-26 11:23:14,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:23:14,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 23: [2022-11-26 11:23:14,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:23:14,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 23: [2022-11-26 11:23:14,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 11:23:14,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 14: [2022-11-26 11:23:14,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:23:14,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:23:14,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 11:23:14,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 11:23:14,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 14: [2022-11-26 11:23:14,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 18: [2022-11-26 11:23:14,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 11:23:14,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 11:23:14,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 7: [2022-11-26 11:23:14,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:23:14,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:23:14,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:23:14,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 11:23:14,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 11:23:14,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 11:23:14,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 7: [2022-11-26 11:23:14,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 7: [2022-11-26 11:23:14,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 16: [2022-11-26 11:23:14,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 11:23:14,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 11:23:14,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 20: [2022-11-26 11:23:14,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 11:23:14,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 11:23:14,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 10: [2022-11-26 11:23:14,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:23:14,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 11:23:14,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 3: [2022-11-26 11:23:14,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 26: [2022-11-26 11:23:14,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:23:14,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 26: [2022-11-26 11:23:14,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 3: [2022-11-26 11:23:14,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 26: [2022-11-26 11:23:14,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 0: [2022-11-26 11:23:14,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:23:14,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:23:14,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 10: [2022-11-26 11:23:14,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 11:23:14,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 0: [2022-11-26 11:23:14,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 26: [2022-11-26 11:23:14,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:23:14,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 26: [2022-11-26 11:23:14,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 0: [2022-11-26 11:23:14,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 11:23:14,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 4: [2022-11-26 11:23:14,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:23:14,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 26: [2022-11-26 11:23:14,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 4: [2022-11-26 11:23:14,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 16: [2022-11-26 11:23:14,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 11:23:14,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 11:23:14,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 2: [2022-11-26 11:23:14,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:23:14,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 11:23:14,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 17: [2022-11-26 11:23:14,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 11:23:14,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 11:23:14,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 4: [2022-11-26 11:23:14,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:23:14,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 14: [2022-11-26 11:23:14,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:23:14,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 11:23:14,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 4: [2022-11-26 11:23:14,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 12: [2022-11-26 11:23:14,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:23:14,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 27: [2022-11-26 11:23:14,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:23:14,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 27: [2022-11-26 11:23:14,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 11:23:14,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 1: [2022-11-26 11:23:14,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:23:14,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 11:23:14,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 23: [2022-11-26 11:23:14,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 11:23:14,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 27: [2022-11-26 11:23:14,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 23: [2022-11-26 11:23:14,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 27: [2022-11-26 11:23:14,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 11:23:14,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 22: [2022-11-26 11:23:14,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 11:23:14,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 11:23:14,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 5: [2022-11-26 11:23:14,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:23:14,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 11:23:14,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 30: [2022-11-26 11:23:14,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:23:14,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 11:23:14,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 24: [2022-11-26 11:23:14,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 11:23:14,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 11:23:14,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 9: [2022-11-26 11:23:14,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:23:14,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 11:23:14,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 11: [2022-11-26 11:23:14,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:23:14,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:23:14,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 13: [2022-11-26 11:23:14,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:23:14,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 13: [2022-11-26 11:23:14,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 11: [2022-11-26 11:23:14,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:23:14,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 13: [2022-11-26 11:23:14,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 11: [2022-11-26 11:23:14,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 11: [2022-11-26 11:23:14,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 11:23:14,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 26: [2022-11-26 11:23:14,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 11:23:14,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 11:23:14,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 1: [2022-11-26 11:23:14,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 25: [2022-11-26 11:23:14,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:23:14,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 25: [2022-11-26 11:23:14,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 1: [2022-11-26 11:23:14,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 25: [2022-11-26 11:23:14,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 27: [2022-11-26 11:23:14,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 11:23:14,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 11:23:14,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 25: [2022-11-26 11:23:14,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 11:23:14,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 11:23:14,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 6: [2022-11-26 11:23:14,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:23:14,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 11:23:14,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 31: [2022-11-26 11:23:14,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 11:23:14,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 11:23:14,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 8: [2022-11-26 11:23:14,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:23:14,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:23:14,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:23:14,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:23:14,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 11:23:14,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 11:23:14,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 11:23:14,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 11:23:14,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 8: [2022-11-26 11:23:14,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 8: [2022-11-26 11:23:14,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 8: [2022-11-26 11:23:14,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 24: [2022-11-26 11:23:14,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 11:23:14,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 11:23:14,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 0: [2022-11-26 11:23:14,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 11:23:14,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 28: [2022-11-26 11:23:14,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 11:23:14,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 11:23:14,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 29: [2022-11-26 11:23:14,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 11:23:14,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 11:23:14,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 19: [2022-11-26 11:23:14,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 11:23:14,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 11:23:14,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 7: [2022-11-26 11:23:14,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:23:14,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 11:23:14,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 0: [2022-11-26 11:23:14,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:23:14,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 11:23:14,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 21: [2022-11-26 11:23:14,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 11:23:14,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 11:23:14,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 15: [2022-11-26 11:23:15,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:23:15,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 11:23:15,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 4: [2022-11-26 11:23:15,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:23:15,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 11:23:15,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 3: [2022-11-26 11:23:15,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:23:15,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 11:23:15,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 9: [2022-11-26 11:23:15,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:23:15,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 11:23:15,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 30: [2022-11-26 11:23:15,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:23:15,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 11:23:15,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 11: [2022-11-26 11:23:15,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:23:15,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 11:23:15,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 10: [2022-11-26 11:23:15,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:23:15,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 31: [2022-11-26 11:23:15,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 16: [2022-11-26 11:23:15,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 31: [2022-11-26 11:23:15,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 16: [2022-11-26 11:23:15,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 31: [2022-11-26 11:23:15,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 10: [2022-11-26 11:23:15,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 16: [2022-11-26 11:23:15,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 20: [2022-11-26 11:23:15,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:23:15,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 20: [2022-11-26 11:23:15,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 11:23:15,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 8: [2022-11-26 11:23:15,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 17: [2022-11-26 11:23:15,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:23:15,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 26: [2022-11-26 11:23:15,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 11:23:15,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 17: [2022-11-26 11:23:15,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 11:23:15,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 26: [2022-11-26 11:23:15,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 25: [2022-11-26 11:23:15,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 11:23:15,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 11:23:15,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 27: [2022-11-26 11:23:15,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 18: [2022-11-26 11:23:15,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 27: [2022-11-26 11:23:15,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 18: [2022-11-26 11:23:15,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 27: [2022-11-26 11:23:15,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 18: [2022-11-26 11:23:15,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 12: [2022-11-26 11:23:15,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:23:15,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 11:23:15,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 22: [2022-11-26 11:23:15,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 11:23:15,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 11:23:15,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 2: [2022-11-26 11:23:15,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:23:15,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 11:23:15,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 1: [2022-11-26 11:23:15,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:23:15,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 11:23:15,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 23: [2022-11-26 11:23:15,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 11:23:15,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 11:23:15,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 19: [2022-11-26 11:23:15,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 11:23:15,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 11:23:15,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 5: [2022-11-26 11:23:15,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:23:15,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 11:23:15,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 14: [2022-11-26 11:23:15,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:23:15,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 11:23:15,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 13: [2022-11-26 11:23:15,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:23:15,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 11:23:15,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 6: [2022-11-26 11:23:15,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:23:15,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 11:23:15,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 29: [2022-11-26 11:23:15,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 11:23:15,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 11:23:15,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 28: [2022-11-26 11:23:15,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 11:23:15,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 11:23:15,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 7: [2022-11-26 11:23:15,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:23:15,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 11:23:15,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 15: [2022-11-26 11:23:15,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:23:15,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 11:23:15,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 21: [2022-11-26 11:23:15,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 11:23:15,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 11:23:15,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 24: [2022-11-26 11:23:15,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 11:23:15,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 11:23:15,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 0: [2022-11-26 11:23:15,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:23:15,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 11:23:15,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 3: [2022-11-26 11:23:15,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:23:15,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 11:23:15,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 4: [2022-11-26 11:23:15,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:23:15,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 11:23:15,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 9: [2022-11-26 11:23:15,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:23:15,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 11:23:15,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 2: [2022-11-26 11:23:15,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:23:15,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 11:23:15,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 10: [2022-11-26 11:23:15,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:23:15,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:23:15,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 10: [2022-11-26 11:23:15,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 12: [2022-11-26 11:23:15,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 10: [2022-11-26 11:23:15,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 18: [2022-11-26 11:23:15,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 11:23:15,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 11:23:15,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 8: [2022-11-26 11:23:15,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:23:15,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 11:23:15,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 25: [2022-11-26 11:23:15,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 11:23:15,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 11:23:15,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 16: [2022-11-26 11:23:15,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:23:15,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 16: [2022-11-26 11:23:15,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 11:23:15,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 1: [2022-11-26 11:23:15,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 11:23:15,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 14: [2022-11-26 11:23:15,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:23:15,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:23:15,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 14: [2022-11-26 11:23:15,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 30: [2022-11-26 11:23:15,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 14: [2022-11-26 11:23:15,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 22: [2022-11-26 11:23:15,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 11:23:15,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 11: [2022-11-26 11:23:15,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 22: [2022-11-26 11:23:15,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 11: [2022-11-26 11:23:15,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 6: [2022-11-26 11:23:15,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:23:15,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 6: [2022-11-26 11:23:15,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 11:23:15,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 20: [2022-11-26 11:23:15,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 11:23:15,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 11:23:15,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 5: [2022-11-26 11:23:15,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:23:15,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 11:23:15,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 23: [2022-11-26 11:23:15,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 11:23:15,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 11:23:15,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 13: [2022-11-26 11:23:15,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:23:15,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 11:23:15,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 31: [2022-11-26 11:23:15,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 11:23:15,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 11:23:15,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 19: [2022-11-26 11:23:15,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 24: [2022-11-26 11:23:15,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 19: [2022-11-26 11:23:15,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 11:23:15,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 26: [2022-11-26 11:23:15,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 24: [2022-11-26 11:23:15,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 11:23:15,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 26: [2022-11-26 11:23:15,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 28: [2022-11-26 11:23:15,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 26: [2022-11-26 11:23:15,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 28: [2022-11-26 11:23:15,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 11:23:15,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 29: [2022-11-26 11:23:15,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 11:23:15,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 11:23:15,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 21: [2022-11-26 11:23:15,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 11:23:15,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 11:23:15,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 27: [2022-11-26 11:23:15,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 11:23:15,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 11:23:15,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 15: [2022-11-26 11:23:15,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:23:15,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 11:23:15,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 7: [2022-11-26 11:23:15,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:23:15,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 0: [2022-11-26 11:23:15,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:23:15,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 0: [2022-11-26 11:23:15,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 11:23:15,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 2: [2022-11-26 11:23:15,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:23:15,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 11:23:15,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 3: [2022-11-26 11:23:15,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:23:15,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 11:23:15,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 25: [2022-11-26 11:23:15,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 11:23:15,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 11:23:15,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 4: [2022-11-26 11:23:15,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:23:15,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 11:23:15,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 17: [2022-11-26 11:23:15,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 11:23:15,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 11:23:15,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 11:23:15,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 17: [2022-11-26 11:23:15,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 11:23:15,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 10: [2022-11-26 11:23:15,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:23:15,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 11:23:15,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 8: [2022-11-26 11:23:15,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:23:15,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 11:23:15,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 30: [2022-11-26 11:23:15,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:23:15,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 11:23:15,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 20: [2022-11-26 11:23:15,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 11:23:15,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 11:23:15,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 18: [2022-11-26 11:23:15,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 11:23:15,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 11:23:15,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 9: [2022-11-26 11:23:15,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:23:15,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 11:23:15,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 31: [2022-11-26 11:23:15,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 11:23:15,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 11:23:15,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 13: [2022-11-26 11:23:15,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:23:15,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 11:23:15,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 22: [2022-11-26 11:23:15,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 11:23:15,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 11:23:15,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 12: [2022-11-26 11:23:15,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:23:15,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 11:23:15,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 16: [2022-11-26 11:23:15,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 11:23:15,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 11:23:15,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 23: [2022-11-26 11:23:15,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 11:23:15,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 11:23:15,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 26: [2022-11-26 11:23:15,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 11:23:15,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 11:23:15,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 27: [2022-11-26 11:23:15,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 11:23:15,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 11:23:15,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 1: [2022-11-26 11:23:15,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:23:15,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 11:23:15,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 6: [2022-11-26 11:23:15,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:23:15,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 11:23:15,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 15: [2022-11-26 11:23:15,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:23:15,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 11:23:15,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 2: [2022-11-26 11:23:15,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 21: [2022-11-26 11:23:15,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:23:15,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 11:23:15,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 14: [2022-11-26 11:23:15,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 21: [2022-11-26 11:23:15,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 14: [2022-11-26 11:23:15,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 21: [2022-11-26 11:23:15,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 14: [2022-11-26 11:23:15,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 25: [2022-11-26 11:23:15,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:23:15,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:23:15,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 11:23:15,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 25: [2022-11-26 11:23:15,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 11:23:15,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 24: [2022-11-26 11:23:15,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 11:23:15,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 31: [2022-11-26 11:23:15,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 24: [2022-11-26 11:23:15,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 31: [2022-11-26 11:23:15,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 11:23:15,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 0: [2022-11-26 11:23:15,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 16: [2022-11-26 11:23:15,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:23:15,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 11:23:15,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 16: [2022-11-26 11:23:15,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 11:23:15,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 10: [2022-11-26 11:23:15,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:23:15,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 11:23:15,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 4: [2022-11-26 11:23:15,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:23:15,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 11:23:15,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 13: [2022-11-26 11:23:15,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:23:15,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 19: [2022-11-26 11:23:15,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:23:15,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 19: [2022-11-26 11:23:15,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 20: [2022-11-26 11:23:15,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 19: [2022-11-26 11:23:15,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 20: [2022-11-26 11:23:15,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 29: [2022-11-26 11:23:15,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 20: [2022-11-26 11:23:15,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 29: [2022-11-26 11:23:15,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 11:23:15,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 28: [2022-11-26 11:23:15,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 11:23:15,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 11:23:15,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 17: [2022-11-26 11:23:15,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:23:15,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:23:15,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 11:23:15,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 17: [2022-11-26 11:23:15,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 11:23:15,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 14: [2022-11-26 11:23:15,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:23:15,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 18: [2022-11-26 11:23:15,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:23:15,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 14: [2022-11-26 11:23:15,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 11: [2022-11-26 11:23:15,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 22: [2022-11-26 11:23:15,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 18: [2022-11-26 11:23:15,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 14: [2022-11-26 11:23:15,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 22: [2022-11-26 11:23:15,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 18: [2022-11-26 11:23:15,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 22: [2022-11-26 11:23:15,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 12: [2022-11-26 11:23:15,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:23:15,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 11:23:15,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 23: [2022-11-26 11:23:15,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 11:23:15,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 3: [2022-11-26 11:23:15,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 23: [2022-11-26 11:23:15,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 3: [2022-11-26 11:23:15,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 11:23:15,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 9: [2022-11-26 11:23:15,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:23:15,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 11:23:15,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 27: [2022-11-26 11:23:15,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 11:23:15,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 11:23:15,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 30: [2022-11-26 11:23:15,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:23:15,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 11:23:15,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 26: [2022-11-26 11:23:15,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 11:23:15,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 11:23:15,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 5: [2022-11-26 11:23:15,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:23:15,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 11:23:15,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 5: [2022-11-26 11:23:15,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:23:15,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 11:23:15,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 14: [2022-11-26 11:23:15,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:23:15,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 11:23:15,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 11: [2022-11-26 11:23:15,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:23:15,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 11:23:15,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 26: [2022-11-26 11:23:15,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 11:23:15,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 11:23:15,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 27: [2022-11-26 11:23:15,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 11:23:15,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 11:23:15,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 5: [2022-11-26 11:23:15,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:23:15,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 11:23:15,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 11: [2022-11-26 11:23:15,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:23:15,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step77000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 11:23:15,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 0: successfully saved checkpoint at iteration 77000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2541.56 31: iteration 77010/ 173500 | consumed samples: 19714560 | consumed tokens: 40375418880 | elapsed time per iteration (s): 1.03 | learning rate: 1.274E-04 | global batch size: 256 | lm loss: 1.977169E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.654 | TFLOPs: 15.04 | 31: iteration 77020/ 173500 | consumed samples: 19717120 | consumed tokens: 40380661760 | elapsed time per iteration (s): 0.84 | learning rate: 1.273E-04 | global batch size: 256 | lm loss: 2.027969E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.971 | TFLOPs: 18.51 | 31: iteration 77030/ 173500 | consumed samples: 19719680 | consumed tokens: 40385904640 | elapsed time per iteration (s): 0.79 | learning rate: 1.273E-04 | global batch size: 256 | lm loss: 2.000851E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.144 | TFLOPs: 19.61 | 31: iteration 77040/ 173500 | consumed samples: 19722240 | consumed tokens: 40391147520 | elapsed time per iteration (s): 0.81 | learning rate: 1.273E-04 | global batch size: 256 | lm loss: 2.020933E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.396 | TFLOPs: 19.20 | 31: iteration 77050/ 173500 | consumed samples: 19724800 | consumed tokens: 40396390400 | elapsed time per iteration (s): 0.83 | learning rate: 1.273E-04 | global batch size: 256 | lm loss: 2.009353E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.341 | TFLOPs: 18.65 | 31: iteration 77060/ 173500 | consumed samples: 19727360 | consumed tokens: 40401633280 | elapsed time per iteration (s): 0.79 | learning rate: 1.273E-04 | global batch size: 256 | lm loss: 2.013806E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.965 | TFLOPs: 19.72 | 31: iteration 77070/ 173500 | consumed samples: 19729920 | consumed tokens: 40406876160 | elapsed time per iteration (s): 0.80 | learning rate: 1.273E-04 | global batch size: 256 | lm loss: 1.996053E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.644 | TFLOPs: 19.46 | 31: iteration 77080/ 173500 | consumed samples: 19732480 | consumed tokens: 40412119040 | elapsed time per iteration (s): 0.81 | learning rate: 1.272E-04 | global batch size: 256 | lm loss: 2.022990E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.560 | TFLOPs: 19.15 | 31: iteration 77090/ 173500 | consumed samples: 19735040 | consumed tokens: 40417361920 | elapsed time per iteration (s): 0.80 | learning rate: 1.272E-04 | global batch size: 256 | lm loss: 2.006563E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.708 | TFLOPs: 19.28 | 31: iteration 77100/ 173500 | consumed samples: 19737600 | consumed tokens: 40422604800 | elapsed time per iteration (s): 0.73 | learning rate: 1.272E-04 | global batch size: 256 | lm loss: 2.016403E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.577 | TFLOPs: 21.15 | 31: iteration 77110/ 173500 | consumed samples: 19740160 | consumed tokens: 40427847680 | elapsed time per iteration (s): 0.78 | learning rate: 1.272E-04 | global batch size: 256 | lm loss: 1.989615E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.869 | TFLOPs: 19.77 | 31: iteration 77120/ 173500 | consumed samples: 19742720 | consumed tokens: 40433090560 | elapsed time per iteration (s): 0.78 | learning rate: 1.272E-04 | global batch size: 256 | lm loss: 2.048901E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.998 | TFLOPs: 19.78 | 31: iteration 77130/ 173500 | consumed samples: 19745280 | consumed tokens: 40438333440 | elapsed time per iteration (s): 0.80 | learning rate: 1.272E-04 | global batch size: 256 | lm loss: 2.000178E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.924 | TFLOPs: 19.42 | 31: iteration 77140/ 173500 | consumed samples: 19747840 | consumed tokens: 40443576320 | elapsed time per iteration (s): 0.77 | learning rate: 1.271E-04 | global batch size: 256 | lm loss: 2.018538E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.472 | TFLOPs: 20.11 | 31: iteration 77150/ 173500 | consumed samples: 19750400 | consumed tokens: 40448819200 | elapsed time per iteration (s): 0.77 | learning rate: 1.271E-04 | global batch size: 256 | lm loss: 1.998578E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.482 | TFLOPs: 20.24 | 31: iteration 77160/ 173500 | consumed samples: 19752960 | consumed tokens: 40454062080 | elapsed time per iteration (s): 0.80 | learning rate: 1.271E-04 | global batch size: 256 | lm loss: 2.012374E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.689 | TFLOPs: 19.28 | 31: iteration 77170/ 173500 | consumed samples: 19755520 | consumed tokens: 40459304960 | elapsed time per iteration (s): 0.78 | learning rate: 1.271E-04 | global batch size: 256 | lm loss: 2.011579E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.125 | TFLOPs: 19.85 | 31: iteration 77180/ 173500 | consumed samples: 19758080 | consumed tokens: 40464547840 | elapsed time per iteration (s): 0.85 | learning rate: 1.271E-04 | global batch size: 256 | lm loss: 2.003112E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.168 | TFLOPs: 18.28 | 31: iteration 77190/ 173500 | consumed samples: 19760640 | consumed tokens: 40469790720 | elapsed time per iteration (s): 0.81 | learning rate: 1.271E-04 | global batch size: 256 | lm loss: 1.990712E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.926 | TFLOPs: 19.05 | 31: iteration 77200/ 173500 | consumed samples: 19763200 | consumed tokens: 40475033600 | elapsed time per iteration (s): 0.78 | learning rate: 1.270E-04 | global batch size: 256 | lm loss: 1.994241E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.713 | TFLOPs: 19.83 | 31: iteration 77210/ 173500 | consumed samples: 19765760 | consumed tokens: 40480276480 | elapsed time per iteration (s): 0.81 | learning rate: 1.270E-04 | global batch size: 256 | lm loss: 2.012465E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.602 | TFLOPs: 19.09 | 31: iteration 77220/ 173500 | consumed samples: 19768320 | consumed tokens: 40485519360 | elapsed time per iteration (s): 0.83 | learning rate: 1.270E-04 | global batch size: 256 | lm loss: 2.003000E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.488 | TFLOPs: 18.66 | 31: iteration 77230/ 173500 | consumed samples: 19770880 | consumed tokens: 40490762240 | elapsed time per iteration (s): 0.78 | learning rate: 1.270E-04 | global batch size: 256 | lm loss: 2.021484E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.287 | TFLOPs: 19.86 | 31: iteration 77240/ 173500 | consumed samples: 19773440 | consumed tokens: 40496005120 | elapsed time per iteration (s): 0.78 | learning rate: 1.270E-04 | global batch size: 256 | lm loss: 2.017061E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.035 | TFLOPs: 19.97 | 31: iteration 77250/ 173500 | consumed samples: 19776000 | consumed tokens: 40501248000 | elapsed time per iteration (s): 0.79 | learning rate: 1.270E-04 | global batch size: 256 | lm loss: 2.032472E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.106 | TFLOPs: 19.67 | 31: iteration 77260/ 173500 | consumed samples: 19778560 | consumed tokens: 40506490880 | elapsed time per iteration (s): 0.81 | learning rate: 1.269E-04 | global batch size: 256 | lm loss: 1.989555E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.834 | TFLOPs: 19.23 | 31: iteration 77270/ 173500 | consumed samples: 19781120 | consumed tokens: 40511733760 | elapsed time per iteration (s): 0.81 | learning rate: 1.269E-04 | global batch size: 256 | lm loss: 2.024630E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.062 | TFLOPs: 19.12 | 31: iteration 77280/ 173500 | consumed samples: 19783680 | consumed tokens: 40516976640 | elapsed time per iteration (s): 0.82 | learning rate: 1.269E-04 | global batch size: 256 | lm loss: 1.997458E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.439 | TFLOPs: 18.84 | 31: iteration 77290/ 173500 | consumed samples: 19786240 | consumed tokens: 40522219520 | elapsed time per iteration (s): 0.79 | learning rate: 1.269E-04 | global batch size: 256 | lm loss: 2.043687E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.480 | TFLOPs: 19.51 | 31: iteration 77300/ 173500 | consumed samples: 19788800 | consumed tokens: 40527462400 | elapsed time per iteration (s): 0.84 | learning rate: 1.269E-04 | global batch size: 256 | lm loss: 2.019528E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.375 | TFLOPs: 18.53 | 31: iteration 77310/ 173500 | consumed samples: 19791360 | consumed tokens: 40532705280 | elapsed time per iteration (s): 0.84 | learning rate: 1.269E-04 | global batch size: 256 | lm loss: 1.972805E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.852 | TFLOPs: 18.38 | 31: iteration 77320/ 173500 | consumed samples: 19793920 | consumed tokens: 40537948160 | elapsed time per iteration (s): 0.95 | learning rate: 1.269E-04 | global batch size: 256 | lm loss: 2.032767E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 268.978 | TFLOPs: 16.27 | 31: iteration 77330/ 173500 | consumed samples: 19796480 | consumed tokens: 40543191040 | elapsed time per iteration (s): 0.97 | learning rate: 1.268E-04 | global batch size: 256 | lm loss: 2.028244E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 263.446 | TFLOPs: 15.94 | 31: iteration 77340/ 173500 | consumed samples: 19799040 | consumed tokens: 40548433920 | elapsed time per iteration (s): 0.83 | learning rate: 1.268E-04 | global batch size: 256 | lm loss: 2.025154E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.073 | TFLOPs: 18.64 | 31: iteration 77350/ 173500 | consumed samples: 19801600 | consumed tokens: 40553676800 | elapsed time per iteration (s): 0.82 | learning rate: 1.268E-04 | global batch size: 256 | lm loss: 2.002019E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.058 | TFLOPs: 18.94 | 31: iteration 77360/ 173500 | consumed samples: 19804160 | consumed tokens: 40558919680 | elapsed time per iteration (s): 0.82 | learning rate: 1.268E-04 | global batch size: 256 | lm loss: 2.015397E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.352 | TFLOPs: 18.90 | 31: iteration 77370/ 173500 | consumed samples: 19806720 | consumed tokens: 40564162560 | elapsed time per iteration (s): 0.87 | learning rate: 1.268E-04 | global batch size: 256 | lm loss: 2.033237E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.878 | TFLOPs: 17.84 | 31: iteration 77380/ 173500 | consumed samples: 19809280 | consumed tokens: 40569405440 | elapsed time per iteration (s): 0.78 | learning rate: 1.268E-04 | global batch size: 256 | lm loss: 2.009806E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.162 | TFLOPs: 19.79 | 31: iteration 77390/ 173500 | consumed samples: 19811840 | consumed tokens: 40574648320 | elapsed time per iteration (s): 0.82 | learning rate: 1.267E-04 | global batch size: 256 | lm loss: 1.996409E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.423 | TFLOPs: 18.78 | 31: iteration 77400/ 173500 | consumed samples: 19814400 | consumed tokens: 40579891200 | elapsed time per iteration (s): 0.79 | learning rate: 1.267E-04 | global batch size: 256 | lm loss: 2.014745E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.259 | TFLOPs: 19.50 | 31: iteration 77410/ 173500 | consumed samples: 19816960 | consumed tokens: 40585134080 | elapsed time per iteration (s): 0.81 | learning rate: 1.267E-04 | global batch size: 256 | lm loss: 1.980522E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.484 | TFLOPs: 19.15 | 31: iteration 77420/ 173500 | consumed samples: 19819520 | consumed tokens: 40590376960 | elapsed time per iteration (s): 0.84 | learning rate: 1.267E-04 | global batch size: 256 | lm loss: 2.015060E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.487 | TFLOPs: 18.54 | 31: iteration 77430/ 173500 | consumed samples: 19822080 | consumed tokens: 40595619840 | elapsed time per iteration (s): 0.80 | learning rate: 1.267E-04 | global batch size: 256 | lm loss: 2.010352E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.769 | TFLOPs: 19.47 | 31: iteration 77440/ 173500 | consumed samples: 19824640 | consumed tokens: 40600862720 | elapsed time per iteration (s): 0.79 | learning rate: 1.267E-04 | global batch size: 256 | lm loss: 2.013745E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.462 | TFLOPs: 19.51 | 31: iteration 77450/ 173500 | consumed samples: 19827200 | consumed tokens: 40606105600 | elapsed time per iteration (s): 0.78 | learning rate: 1.266E-04 | global batch size: 256 | lm loss: 1.966448E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.813 | TFLOPs: 19.77 | 31: iteration 77460/ 173500 | consumed samples: 19829760 | consumed tokens: 40611348480 | elapsed time per iteration (s): 0.81 | learning rate: 1.266E-04 | global batch size: 256 | lm loss: 2.008206E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.104 | TFLOPs: 19.18 | 31: iteration 77470/ 173500 | consumed samples: 19832320 | consumed tokens: 40616591360 | elapsed time per iteration (s): 0.86 | learning rate: 1.266E-04 | global batch size: 256 | lm loss: 2.012012E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.363 | TFLOPs: 18.11 | 31: iteration 77480/ 173500 | consumed samples: 19834880 | consumed tokens: 40621834240 | elapsed time per iteration (s): 0.82 | learning rate: 1.266E-04 | global batch size: 256 | lm loss: 2.014993E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.011 | TFLOPs: 18.88 | 31: iteration 77490/ 173500 | consumed samples: 19837440 | consumed tokens: 40627077120 | elapsed time per iteration (s): 0.83 | learning rate: 1.266E-04 | global batch size: 256 | lm loss: 2.029416E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.974 | TFLOPs: 18.69 | 31: iteration 77500/ 173500 | consumed samples: 19840000 | consumed tokens: 40632320000 | elapsed time per iteration (s): 0.85 | learning rate: 1.266E-04 | global batch size: 256 | lm loss: 1.985721E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.173 | TFLOPs: 18.16 | 31: iteration 77510/ 173500 | consumed samples: 19842560 | consumed tokens: 40637562880 | elapsed time per iteration (s): 0.80 | learning rate: 1.265E-04 | global batch size: 256 | lm loss: 2.017867E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.578 | TFLOPs: 19.39 | 31: iteration 77520/ 173500 | consumed samples: 19845120 | consumed tokens: 40642805760 | elapsed time per iteration (s): 0.88 | learning rate: 1.265E-04 | global batch size: 256 | lm loss: 2.024868E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.162 | TFLOPs: 17.61 | 31: iteration 77530/ 173500 | consumed samples: 19847680 | consumed tokens: 40648048640 | elapsed time per iteration (s): 0.80 | learning rate: 1.265E-04 | global batch size: 256 | lm loss: 2.009974E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.459 | TFLOPs: 19.33 | 31: iteration 77540/ 173500 | consumed samples: 19850240 | consumed tokens: 40653291520 | elapsed time per iteration (s): 0.82 | learning rate: 1.265E-04 | global batch size: 256 | lm loss: 1.999915E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.993 | TFLOPs: 18.94 | 31: iteration 77550/ 173500 | consumed samples: 19852800 | consumed tokens: 40658534400 | elapsed time per iteration (s): 0.80 | learning rate: 1.265E-04 | global batch size: 256 | lm loss: 2.005345E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.018 | TFLOPs: 19.30 | 31: iteration 77560/ 173500 | consumed samples: 19855360 | consumed tokens: 40663777280 | elapsed time per iteration (s): 0.79 | learning rate: 1.265E-04 | global batch size: 256 | lm loss: 2.007248E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.204 | TFLOPs: 19.61 | 31: iteration 77570/ 173500 | consumed samples: 19857920 | consumed tokens: 40669020160 | elapsed time per iteration (s): 0.81 | learning rate: 1.264E-04 | global batch size: 256 | lm loss: 2.025984E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.048 | TFLOPs: 19.18 | 31: iteration 77580/ 173500 | consumed samples: 19860480 | consumed tokens: 40674263040 | elapsed time per iteration (s): 0.81 | learning rate: 1.264E-04 | global batch size: 256 | lm loss: 2.035381E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.599 | TFLOPs: 19.03 | 31: iteration 77590/ 173500 | consumed samples: 19863040 | consumed tokens: 40679505920 | elapsed time per iteration (s): 0.81 | learning rate: 1.264E-04 | global batch size: 256 | lm loss: 1.995322E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.219 | TFLOPs: 19.01 | 31: iteration 77600/ 173500 | consumed samples: 19865600 | consumed tokens: 40684748800 | elapsed time per iteration (s): 0.76 | learning rate: 1.264E-04 | global batch size: 256 | lm loss: 2.009917E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.539 | TFLOPs: 20.42 | 31: iteration 77610/ 173500 | consumed samples: 19868160 | consumed tokens: 40689991680 | elapsed time per iteration (s): 0.87 | learning rate: 1.264E-04 | global batch size: 256 | lm loss: 1.998054E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.464 | TFLOPs: 17.87 | 31: iteration 77620/ 173500 | consumed samples: 19870720 | consumed tokens: 40695234560 | elapsed time per iteration (s): 0.75 | learning rate: 1.264E-04 | global batch size: 256 | lm loss: 2.031764E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.138 | TFLOPs: 20.52 | 31: iteration 77630/ 173500 | consumed samples: 19873280 | consumed tokens: 40700477440 | elapsed time per iteration (s): 0.79 | learning rate: 1.263E-04 | global batch size: 256 | lm loss: 2.031867E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.478 | TFLOPs: 19.69 | 31: iteration 77640/ 173500 | consumed samples: 19875840 | consumed tokens: 40705720320 | elapsed time per iteration (s): 0.80 | learning rate: 1.263E-04 | global batch size: 256 | lm loss: 1.994495E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.342 | TFLOPs: 19.38 | 31: iteration 77650/ 173500 | consumed samples: 19878400 | consumed tokens: 40710963200 | elapsed time per iteration (s): 0.82 | learning rate: 1.263E-04 | global batch size: 256 | lm loss: 2.026834E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.085 | TFLOPs: 19.00 | 31: iteration 77660/ 173500 | consumed samples: 19880960 | consumed tokens: 40716206080 | elapsed time per iteration (s): 0.80 | learning rate: 1.263E-04 | global batch size: 256 | lm loss: 2.012238E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.658 | TFLOPs: 19.40 | 31: iteration 77670/ 173500 | consumed samples: 19883520 | consumed tokens: 40721448960 | elapsed time per iteration (s): 0.78 | learning rate: 1.263E-04 | global batch size: 256 | lm loss: 2.009360E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.989 | TFLOPs: 19.84 | 31: iteration 77680/ 173500 | consumed samples: 19886080 | consumed tokens: 40726691840 | elapsed time per iteration (s): 0.73 | learning rate: 1.263E-04 | global batch size: 256 | lm loss: 1.995222E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.041 | TFLOPs: 21.18 | 31: iteration 77690/ 173500 | consumed samples: 19888640 | consumed tokens: 40731934720 | elapsed time per iteration (s): 0.79 | learning rate: 1.263E-04 | global batch size: 256 | lm loss: 1.983534E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.277 | TFLOPs: 19.68 | 31: iteration 77700/ 173500 | consumed samples: 19891200 | consumed tokens: 40737177600 | elapsed time per iteration (s): 0.78 | learning rate: 1.262E-04 | global batch size: 256 | lm loss: 2.019689E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.248 | TFLOPs: 19.86 | 31: iteration 77710/ 173500 | consumed samples: 19893760 | consumed tokens: 40742420480 | elapsed time per iteration (s): 0.78 | learning rate: 1.262E-04 | global batch size: 256 | lm loss: 2.027694E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.200 | TFLOPs: 19.73 | 31: iteration 77720/ 173500 | consumed samples: 19896320 | consumed tokens: 40747663360 | elapsed time per iteration (s): 0.71 | learning rate: 1.262E-04 | global batch size: 256 | lm loss: 2.013783E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.767 | TFLOPs: 21.70 | 31: iteration 77730/ 173500 | consumed samples: 19898880 | consumed tokens: 40752906240 | elapsed time per iteration (s): 0.77 | learning rate: 1.262E-04 | global batch size: 256 | lm loss: 1.987932E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.438 | TFLOPs: 20.23 | 31: iteration 77740/ 173500 | consumed samples: 19901440 | consumed tokens: 40758149120 | elapsed time per iteration (s): 0.81 | learning rate: 1.262E-04 | global batch size: 256 | lm loss: 1.982998E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.615 | TFLOPs: 19.09 | 31: iteration 77750/ 173500 | consumed samples: 19904000 | consumed tokens: 40763392000 | elapsed time per iteration (s): 0.80 | learning rate: 1.262E-04 | global batch size: 256 | lm loss: 1.998277E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.666 | TFLOPs: 19.40 | 31: iteration 77760/ 173500 | consumed samples: 19906560 | consumed tokens: 40768634880 | elapsed time per iteration (s): 0.81 | learning rate: 1.261E-04 | global batch size: 256 | lm loss: 2.004931E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.742 | TFLOPs: 19.04 | 31: iteration 77770/ 173500 | consumed samples: 19909120 | consumed tokens: 40773877760 | elapsed time per iteration (s): 0.85 | learning rate: 1.261E-04 | global batch size: 256 | lm loss: 2.003403E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.855 | TFLOPs: 18.20 | 31: iteration 77780/ 173500 | consumed samples: 19911680 | consumed tokens: 40779120640 | elapsed time per iteration (s): 0.82 | learning rate: 1.261E-04 | global batch size: 256 | lm loss: 2.015850E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.991 | TFLOPs: 18.81 | 31: iteration 77790/ 173500 | consumed samples: 19914240 | consumed tokens: 40784363520 | elapsed time per iteration (s): 0.82 | learning rate: 1.261E-04 | global batch size: 256 | lm loss: 2.006575E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.884 | TFLOPs: 18.87 | 31: iteration 77800/ 173500 | consumed samples: 19916800 | consumed tokens: 40789606400 | elapsed time per iteration (s): 0.83 | learning rate: 1.261E-04 | global batch size: 256 | lm loss: 2.012602E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.757 | TFLOPs: 18.56 | 31: iteration 77810/ 173500 | consumed samples: 19919360 | consumed tokens: 40794849280 | elapsed time per iteration (s): 0.81 | learning rate: 1.261E-04 | global batch size: 256 | lm loss: 1.966338E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.380 | TFLOPs: 19.02 | 31: iteration 77820/ 173500 | consumed samples: 19921920 | consumed tokens: 40800092160 | elapsed time per iteration (s): 0.79 | learning rate: 1.260E-04 | global batch size: 256 | lm loss: 2.005459E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.862 | TFLOPs: 19.53 | 31: iteration 77830/ 173500 | consumed samples: 19924480 | consumed tokens: 40805335040 | elapsed time per iteration (s): 0.87 | learning rate: 1.260E-04 | global batch size: 256 | lm loss: 2.012135E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.919 | TFLOPs: 17.78 | 31: iteration 77840/ 173500 | consumed samples: 19927040 | consumed tokens: 40810577920 | elapsed time per iteration (s): 0.86 | learning rate: 1.260E-04 | global batch size: 256 | lm loss: 2.019779E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.857 | TFLOPs: 18.08 | 31: iteration 77850/ 173500 | consumed samples: 19929600 | consumed tokens: 40815820800 | elapsed time per iteration (s): 0.82 | learning rate: 1.260E-04 | global batch size: 256 | lm loss: 2.003867E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.309 | TFLOPs: 18.95 | 31: iteration 77860/ 173500 | consumed samples: 19932160 | consumed tokens: 40821063680 | elapsed time per iteration (s): 0.80 | learning rate: 1.260E-04 | global batch size: 256 | lm loss: 2.004320E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.814 | TFLOPs: 19.41 | 31: iteration 77870/ 173500 | consumed samples: 19934720 | consumed tokens: 40826306560 | elapsed time per iteration (s): 0.82 | learning rate: 1.260E-04 | global batch size: 256 | lm loss: 2.028072E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.020 | TFLOPs: 19.00 | 31: iteration 77880/ 173500 | consumed samples: 19937280 | consumed tokens: 40831549440 | elapsed time per iteration (s): 0.81 | learning rate: 1.259E-04 | global batch size: 256 | lm loss: 2.005925E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.131 | TFLOPs: 19.00 | 31: iteration 77890/ 173500 | consumed samples: 19939840 | consumed tokens: 40836792320 | elapsed time per iteration (s): 0.83 | learning rate: 1.259E-04 | global batch size: 256 | lm loss: 2.008630E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.847 | TFLOPs: 18.62 | 31: iteration 77900/ 173500 | consumed samples: 19942400 | consumed tokens: 40842035200 | elapsed time per iteration (s): 0.81 | learning rate: 1.259E-04 | global batch size: 256 | lm loss: 2.008775E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.899 | TFLOPs: 19.05 | 31: iteration 77910/ 173500 | consumed samples: 19944960 | consumed tokens: 40847278080 | elapsed time per iteration (s): 0.79 | learning rate: 1.259E-04 | global batch size: 256 | lm loss: 2.009307E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.656 | TFLOPs: 19.52 | 31: iteration 77920/ 173500 | consumed samples: 19947520 | consumed tokens: 40852520960 | elapsed time per iteration (s): 0.77 | learning rate: 1.259E-04 | global batch size: 256 | lm loss: 2.000298E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.125 | TFLOPs: 20.21 | 31: iteration 77930/ 173500 | consumed samples: 19950080 | consumed tokens: 40857763840 | elapsed time per iteration (s): 0.83 | learning rate: 1.259E-04 | global batch size: 256 | lm loss: 2.001065E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.247 | TFLOPs: 18.77 | 31: iteration 77940/ 173500 | consumed samples: 19952640 | consumed tokens: 40863006720 | elapsed time per iteration (s): 0.83 | learning rate: 1.258E-04 | global batch size: 256 | lm loss: 1.986194E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.912 | TFLOPs: 18.57 | 31: iteration 77950/ 173500 | consumed samples: 19955200 | consumed tokens: 40868249600 | elapsed time per iteration (s): 0.84 | learning rate: 1.258E-04 | global batch size: 256 | lm loss: 2.021063E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.329 | TFLOPs: 18.47 | 31: iteration 77960/ 173500 | consumed samples: 19957760 | consumed tokens: 40873492480 | elapsed time per iteration (s): 0.87 | learning rate: 1.258E-04 | global batch size: 256 | lm loss: 2.020870E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.370 | TFLOPs: 17.75 | 31: iteration 77970/ 173500 | consumed samples: 19960320 | consumed tokens: 40878735360 | elapsed time per iteration (s): 0.86 | learning rate: 1.258E-04 | global batch size: 256 | lm loss: 1.983522E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.464 | TFLOPs: 17.94 | 31: iteration 77980/ 173500 | consumed samples: 19962880 | consumed tokens: 40883978240 | elapsed time per iteration (s): 1.00 | learning rate: 1.258E-04 | global batch size: 256 | lm loss: 2.033678E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 257.265 | TFLOPs: 15.56 | 31: iteration 77990/ 173500 | consumed samples: 19965440 | consumed tokens: 40889221120 | elapsed time per iteration (s): 0.87 | learning rate: 1.258E-04 | global batch size: 256 | lm loss: 2.019977E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.229 | TFLOPs: 17.74 | 0: [2022-11-26 11:36:47,863] [INFO] [logging.py:68:log_dist] [Rank 0] step=78000, skipped=0, lr=[0.00012575030905458257, 0.00012575030905458257, 0.00012575030905458257], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 78000/ 173500 | consumed samples: 19968000 | consumed tokens: 40894464000 | elapsed time per iteration (s): 0.80 | learning rate: 1.258E-04 | global batch size: 256 | lm loss: 2.026544E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.586 | TFLOPs: 19.46 | 0: steps: 78000 loss: 2.0987 iter time (s): 0.826 samples/sec: 309.967 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 78000 | lm loss value: 1.982936E+00 | lm loss PPL: 7.264036E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 78000 to checkpoints_1b1long 0: [2022-11-26 11:36:48,148] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step78000 is begin to save! 0: [2022-11-26 11:36:48,157] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_01-model_00-model_states.pt... 0: [2022-11-26 11:36:48,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_01-model_00-model_states.pt. 0: [2022-11-26 11:36:48,383] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_03-model_00-model_states.pt... 0: [2022-11-26 11:36:48,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_03-model_00-model_states.pt. 0: [2022-11-26 11:36:48,464] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_04-model_00-model_states.pt... 0: [2022-11-26 11:36:48,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_04-model_00-model_states.pt. 0: [2022-11-26 11:36:48,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_05-model_00-model_states.pt... 0: [2022-11-26 11:36:48,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_05-model_00-model_states.pt. 0: [2022-11-26 11:36:48,628] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_06-model_00-model_states.pt... 0: [2022-11-26 11:36:48,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_06-model_00-model_states.pt. 0: [2022-11-26 11:36:48,710] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_07-model_00-model_states.pt... 0: [2022-11-26 11:36:48,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_07-model_00-model_states.pt. 0: [2022-11-26 11:36:48,784] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_08-model_00-model_states.pt... 0: [2022-11-26 11:36:48,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_08-model_00-model_states.pt. 0: [2022-11-26 11:36:48,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_09-model_00-model_states.pt... 0: [2022-11-26 11:36:48,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_09-model_00-model_states.pt. 0: [2022-11-26 11:36:48,948] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_10-model_00-model_states.pt... 0: [2022-11-26 11:36:49,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_10-model_00-model_states.pt. 0: [2022-11-26 11:36:49,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_11-model_00-model_states.pt... 0: [2022-11-26 11:36:49,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_11-model_00-model_states.pt. 0: [2022-11-26 11:36:49,109] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_12-model_00-model_states.pt... 0: [2022-11-26 11:36:49,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_12-model_00-model_states.pt. 0: [2022-11-26 11:36:49,188] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_13-model_00-model_states.pt... 0: [2022-11-26 11:36:49,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_13-model_00-model_states.pt. 0: [2022-11-26 11:36:49,266] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_14-model_00-model_states.pt... 0: [2022-11-26 11:36:49,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_14-model_00-model_states.pt. 0: [2022-11-26 11:36:49,345] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_15-model_00-model_states.pt... 0: [2022-11-26 11:36:49,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_15-model_00-model_states.pt. 0: [2022-11-26 11:36:49,424] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_16-model_00-model_states.pt... 0: [2022-11-26 11:36:49,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_16-model_00-model_states.pt. 0: [2022-11-26 11:36:49,503] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_17-model_00-model_states.pt... 0: [2022-11-26 11:36:49,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_17-model_00-model_states.pt. 0: [2022-11-26 11:36:49,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_18-model_00-model_states.pt... 0: [2022-11-26 11:36:49,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_18-model_00-model_states.pt. 0: [2022-11-26 11:36:49,662] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_19-model_00-model_states.pt... 0: [2022-11-26 11:36:49,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_19-model_00-model_states.pt. 0: [2022-11-26 11:36:49,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_20-model_00-model_states.pt... 0: [2022-11-26 11:36:49,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_20-model_00-model_states.pt. 0: [2022-11-26 11:36:49,816] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_21-model_00-model_states.pt... 0: [2022-11-26 11:36:49,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_21-model_00-model_states.pt. 0: [2022-11-26 11:36:49,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_22-model_00-model_states.pt... 0: [2022-11-26 11:36:49,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_22-model_00-model_states.pt. 0: [2022-11-26 11:36:49,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_23-model_00-model_states.pt... 0: [2022-11-26 11:36:50,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_23-model_00-model_states.pt. 0: [2022-11-26 11:36:50,052] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_24-model_00-model_states.pt... 0: [2022-11-26 11:36:50,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_24-model_00-model_states.pt. 0: [2022-11-26 11:36:50,127] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_25-model_00-model_states.pt... 0: [2022-11-26 11:36:50,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_25-model_00-model_states.pt. 0: [2022-11-26 11:36:50,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_26-model_00-model_states.pt... 0: [2022-11-26 11:36:50,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_26-model_00-model_states.pt. 0: [2022-11-26 11:36:50,287] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_27-model_00-model_states.pt... 0: [2022-11-26 11:36:50,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_27-model_00-model_states.pt. 0: [2022-11-26 11:36:50,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_28-model_00-model_states.pt... 0: [2022-11-26 11:36:50,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_28-model_00-model_states.pt. 0: [2022-11-26 11:36:50,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/layer_30-model_00-model_states.pt... 0: [2022-11-26 11:36:50,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/layer_30-model_00-model_states.pt. 0: [2022-11-26 11:36:50,445] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step78000/mp_rank_00_model_states.pt 0: [2022-11-26 11:36:50,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/mp_rank_00_model_states.pt... 0: [2022-11-26 11:36:50,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/mp_rank_00_model_states.pt. 0: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 22: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 23: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 22: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 19: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 22: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 21: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 30: [2022-11-26 11:36:50,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 18: [2022-11-26 11:36:50,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 11:36:50,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 11:36:50,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 0: [2022-11-26 11:36:50,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:36:50,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 11:36:50,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 28: [2022-11-26 11:36:50,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 27: [2022-11-26 11:36:50,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 11:36:50,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 11:36:50,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 30: [2022-11-26 11:36:50,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:36:50,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 11:36:50,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 14: [2022-11-26 11:36:50,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:36:50,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 11:36:50,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 0: [2022-11-26 11:36:50,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:36:50,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 11:36:50,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 3: [2022-11-26 11:36:50,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:36:50,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 11:36:50,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 9: [2022-11-26 11:36:50,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:36:50,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 11:36:50,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 28: [2022-11-26 11:36:50,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 11:36:50,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 28: [2022-11-26 11:36:50,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 11:36:50,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 11:36:50,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 2: [2022-11-26 11:36:50,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:36:50,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 11: [2022-11-26 11:36:50,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:36:50,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 11: [2022-11-26 11:36:50,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 18: [2022-11-26 11:36:50,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:36:50,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 17: [2022-11-26 11:36:50,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 18: [2022-11-26 11:36:50,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 11:36:50,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 17: [2022-11-26 11:36:50,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 11:36:50,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 30: [2022-11-26 11:36:50,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:36:50,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 11:36:50,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 6: [2022-11-26 11:36:50,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:36:50,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 11:36:50,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 10: [2022-11-26 11:36:50,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:36:50,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 11:36:50,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 24: [2022-11-26 11:36:50,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 29: [2022-11-26 11:36:50,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 11:36:50,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 11:36:50,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 24: [2022-11-26 11:36:50,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 3: [2022-11-26 11:36:50,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 24: [2022-11-26 11:36:50,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 3: [2022-11-26 11:36:50,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 11:36:50,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 28: [2022-11-26 11:36:50,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:36:50,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 17: [2022-11-26 11:36:50,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:36:50,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 11:36:50,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 17: [2022-11-26 11:36:50,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 11:36:50,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 14: [2022-11-26 11:36:50,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:36:50,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 11:36:50,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 27: [2022-11-26 11:36:50,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 11:36:50,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 11:36:50,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 16: [2022-11-26 11:36:50,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 11:36:50,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 11:36:50,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 7: [2022-11-26 11:36:50,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:36:50,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 11:36:50,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 29: [2022-11-26 11:36:50,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:36:50,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 29: [2022-11-26 11:36:50,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 11: [2022-11-26 11:36:50,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 29: [2022-11-26 11:36:50,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 11: [2022-11-26 11:36:50,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 18: [2022-11-26 11:36:50,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 11:36:50,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 11:36:50,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 24: [2022-11-26 11:36:50,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:36:50,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:36:50,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 24: [2022-11-26 11:36:50,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 11:36:50,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 2: [2022-11-26 11:36:50,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 12: [2022-11-26 11:36:50,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:36:50,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 11:36:50,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 0: [2022-11-26 11:36:50,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:36:50,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:36:50,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 10: [2022-11-26 11:36:50,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:36:50,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 0: [2022-11-26 11:36:50,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 10: [2022-11-26 11:36:50,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 12: [2022-11-26 11:36:50,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 10: [2022-11-26 11:36:50,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 19: [2022-11-26 11:36:50,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 11:36:50,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 11:36:50,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 11:36:50,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 11:36:50,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 19: [2022-11-26 11:36:50,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 31: [2022-11-26 11:36:50,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 11:36:50,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 11:36:50,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 29: [2022-11-26 11:36:50,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 31: [2022-11-26 11:36:50,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 29: [2022-11-26 11:36:50,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 31: [2022-11-26 11:36:50,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 29: [2022-11-26 11:36:50,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 31: [2022-11-26 11:36:50,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 11:36:50,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 31: [2022-11-26 11:36:50,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 11:36:50,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 11: [2022-11-26 11:36:50,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:36:50,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 11:36:50,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 24: [2022-11-26 11:36:50,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 11:36:50,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 11:36:50,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 9: [2022-11-26 11:36:50,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:36:50,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 11:36:50,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 8: [2022-11-26 11:36:50,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 18: [2022-11-26 11:36:50,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:36:50,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 18: [2022-11-26 11:36:50,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 8: [2022-11-26 11:36:50,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 18: [2022-11-26 11:36:50,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 21: [2022-11-26 11:36:50,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 11:36:50,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 11:36:50,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 27: [2022-11-26 11:36:50,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:36:50,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 27: [2022-11-26 11:36:50,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 11:36:50,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 8: [2022-11-26 11:36:50,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 11:36:50,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 14: [2022-11-26 11:36:50,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:36:50,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 11:36:50,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 30: [2022-11-26 11:36:50,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:36:50,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 11:36:50,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 19: [2022-11-26 11:36:50,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 11:36:50,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 11:36:50,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 16: [2022-11-26 11:36:50,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 21: [2022-11-26 11:36:50,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 16: [2022-11-26 11:36:50,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 21: [2022-11-26 11:36:50,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 16: [2022-11-26 11:36:50,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 21: [2022-11-26 11:36:50,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 12: [2022-11-26 11:36:50,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:36:50,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 11:36:50,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 17: [2022-11-26 11:36:50,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 11:36:50,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 11:36:50,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 1: [2022-11-26 11:36:50,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:36:50,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 11:36:50,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 0: [2022-11-26 11:36:50,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:36:50,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:36:50,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 0: [2022-11-26 11:36:50,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 11: [2022-11-26 11:36:50,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 29: [2022-11-26 11:36:50,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:36:50,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 29: [2022-11-26 11:36:50,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 0: [2022-11-26 11:36:50,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 2: [2022-11-26 11:36:50,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 29: [2022-11-26 11:36:50,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 2: [2022-11-26 11:36:50,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 20: [2022-11-26 11:36:50,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 11:36:50,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 30: [2022-11-26 11:36:50,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 20: [2022-11-26 11:36:50,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 30: [2022-11-26 11:36:50,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 11:36:50,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 24: [2022-11-26 11:36:50,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 11:36:50,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 11:36:50,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 10: [2022-11-26 11:36:50,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:36:50,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 11:36:50,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 3: [2022-11-26 11:36:50,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:36:50,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:36:50,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 20: [2022-11-26 11:36:50,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:36:50,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 3: [2022-11-26 11:36:50,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 20: [2022-11-26 11:36:50,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 14: [2022-11-26 11:36:50,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 20: [2022-11-26 11:36:50,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 9: [2022-11-26 11:36:50,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:36:50,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 11:36:50,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 3: [2022-11-26 11:36:50,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:36:50,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 12: [2022-11-26 11:36:50,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:36:50,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 12: [2022-11-26 11:36:50,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 11:36:50,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 28: [2022-11-26 11:36:50,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 11:36:50,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 28: [2022-11-26 11:36:50,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 11:36:50,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 11:36:50,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 19: [2022-11-26 11:36:50,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 11:36:50,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 11:36:50,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 7: [2022-11-26 11:36:50,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:36:50,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 11:36:50,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 6: [2022-11-26 11:36:50,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:36:50,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 11:36:50,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 1: [2022-11-26 11:36:50,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:36:50,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 11:36:50,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 6: [2022-11-26 11:36:50,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:36:50,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:36:50,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 8: [2022-11-26 11:36:50,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 6: [2022-11-26 11:36:50,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 8: [2022-11-26 11:36:50,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 10: [2022-11-26 11:36:50,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:36:50,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 11:36:50,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 7: [2022-11-26 11:36:50,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:36:50,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:36:50,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 11:36:50,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 11:36:50,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 7: [2022-11-26 11:36:50,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 16: [2022-11-26 11:36:50,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 11:36:50,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 11:36:50,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 17: [2022-11-26 11:36:50,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:36:50,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 17: [2022-11-26 11:36:50,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 1: [2022-11-26 11:36:50,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 17: [2022-11-26 11:36:50,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 1: [2022-11-26 11:36:50,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 21: [2022-11-26 11:36:50,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 11:36:50,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 11:36:50,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 11:36:50,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 11:36:50,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 21: [2022-11-26 11:36:50,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 23: [2022-11-26 11:36:50,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 11:36:50,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 11:36:50,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 11:36:50,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 11:36:50,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 8: [2022-11-26 11:36:50,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 23: [2022-11-26 11:36:50,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 23: [2022-11-26 11:36:50,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 31: [2022-11-26 11:36:50,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:36:50,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 23: [2022-11-26 11:36:50,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 8: [2022-11-26 11:36:50,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 23: [2022-11-26 11:36:50,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 31: [2022-11-26 11:36:50,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 11:36:50,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 20: [2022-11-26 11:36:50,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 11:36:50,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 11:36:50,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 27: [2022-11-26 11:36:50,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 11:36:50,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 11:36:50,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 0: [2022-11-26 11:36:50,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:36:50,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 11:36:50,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 22: [2022-11-26 11:36:50,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 11:36:50,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 11:36:50,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 11:36:50,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 11:36:50,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 11:36:50,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 11:36:50,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 11:36:50,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 22: [2022-11-26 11:36:50,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 22: [2022-11-26 11:36:50,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 22: [2022-11-26 11:36:50,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 11:36:50,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 4: [2022-11-26 11:36:50,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:36:50,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:36:50,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 11:36:50,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 11:36:50,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 4: [2022-11-26 11:36:50,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 4: [2022-11-26 11:36:50,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:36:50,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 11:36:50,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 15: [2022-11-26 11:36:50,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:36:50,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 11:36:50,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 15: [2022-11-26 11:36:50,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:36:50,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:36:50,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 11:36:50,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 11:36:50,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 15: [2022-11-26 11:36:50,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 5: [2022-11-26 11:36:50,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:36:50,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:36:50,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:36:50,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:36:50,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 11:36:50,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 11:36:50,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 11:36:50,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 11:36:50,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 5: [2022-11-26 11:36:50,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 5: [2022-11-26 11:36:50,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 5: [2022-11-26 11:36:50,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 9: [2022-11-26 11:36:50,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:36:50,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 11:36:50,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 2: [2022-11-26 11:36:50,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:36:50,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 11:36:50,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 25: [2022-11-26 11:36:50,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 11:36:50,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 11:36:50,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 11:36:50,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 11:36:50,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 11:36:50,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 11:36:50,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 11:36:50,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 11:36:50,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 25: [2022-11-26 11:36:50,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 25: [2022-11-26 11:36:50,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 25: [2022-11-26 11:36:50,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 6: [2022-11-26 11:36:50,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:36:50,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 11:36:50,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 13: [2022-11-26 11:36:50,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:36:50,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:36:50,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:36:50,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 11:36:50,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:36:50,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 11:36:50,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 11:36:50,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 13: [2022-11-26 11:36:50,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 13: [2022-11-26 11:36:50,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 13: [2022-11-26 11:36:50,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 11:36:50,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 19: [2022-11-26 11:36:50,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 11:36:50,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 11:36:50,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 5: [2022-11-26 11:36:50,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:36:50,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 11:36:50,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 18: [2022-11-26 11:36:50,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 11:36:50,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 11:36:50,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 26: [2022-11-26 11:36:50,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 11:36:50,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 11:36:50,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 11:36:50,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 11:36:50,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 11:36:50,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 11:36:50,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 11:36:50,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 11:36:50,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 26: [2022-11-26 11:36:50,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 11:36:50,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 11:36:50,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 26: [2022-11-26 11:36:50,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 26: [2022-11-26 11:36:50,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 26: [2022-11-26 11:36:50,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 28: [2022-11-26 11:36:50,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 11:36:50,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 11:36:50,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 10: [2022-11-26 11:36:50,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:36:50,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 11:36:50,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 29: [2022-11-26 11:36:50,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 11:36:50,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 11:36:50,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 11: [2022-11-26 11:36:50,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:36:50,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 11:36:50,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 17: [2022-11-26 11:36:50,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 11:36:50,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 11:36:50,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 21: [2022-11-26 11:36:50,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 11:36:50,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 11:36:50,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 30: [2022-11-26 11:36:50,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:36:50,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 11:36:50,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 14: [2022-11-26 11:36:50,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:36:50,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 11:36:50,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 7: [2022-11-26 11:36:50,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:36:50,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 11:36:50,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 12: [2022-11-26 11:36:50,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:36:50,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 11:36:50,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 25: [2022-11-26 11:36:50,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 11:36:50,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 11:36:50,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 16: [2022-11-26 11:36:50,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 11:36:50,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 11:36:50,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 24: [2022-11-26 11:36:50,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 11:36:50,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 11:36:50,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 3: [2022-11-26 11:36:50,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:36:50,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 11:36:50,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 0: [2022-11-26 11:36:50,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:36:50,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 11:36:50,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 9: [2022-11-26 11:36:50,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:36:50,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 11:36:50,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 2: [2022-11-26 11:36:50,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:36:50,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 11:36:50,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 8: [2022-11-26 11:36:50,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:36:50,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 11:36:50,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 27: [2022-11-26 11:36:50,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 11:36:50,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 11:36:50,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 15: [2022-11-26 11:36:50,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:36:50,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 11:36:50,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 22: [2022-11-26 11:36:50,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 11:36:50,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 11:36:50,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 1: [2022-11-26 11:36:50,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:36:50,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 11:36:50,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 13: [2022-11-26 11:36:50,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:36:50,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 11:36:50,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 4: [2022-11-26 11:36:50,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:36:50,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 11:36:50,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 23: [2022-11-26 11:36:50,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 11:36:50,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 11:36:50,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 5: [2022-11-26 11:36:50,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:36:50,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 11:36:50,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 20: [2022-11-26 11:36:50,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 11:36:50,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 11:36:50,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 6: [2022-11-26 11:36:50,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:36:50,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 11:36:50,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 31: [2022-11-26 11:36:50,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 11:36:50,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 11:36:50,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 19: [2022-11-26 11:36:50,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 11:36:50,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 11:36:50,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 11: [2022-11-26 11:36:50,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:36:50,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 11:36:50,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 26: [2022-11-26 11:36:50,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 11:36:50,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 11:36:50,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 18: [2022-11-26 11:36:50,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 11:36:50,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 11:36:50,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 28: [2022-11-26 11:36:50,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:36:50,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:36:50,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 11:36:50,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 28: [2022-11-26 11:36:50,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 11:36:50,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 9: [2022-11-26 11:36:50,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:36:50,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 11:36:50,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 29: [2022-11-26 11:36:50,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 11:36:50,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 11:36:50,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 25: [2022-11-26 11:36:50,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 11:36:50,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 11:36:50,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 17: [2022-11-26 11:36:50,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 11:36:50,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 11:36:50,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 30: [2022-11-26 11:36:50,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:36:50,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 11:36:50,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 7: [2022-11-26 11:36:50,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:36:50,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 11:36:50,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 21: [2022-11-26 11:36:50,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 11:36:50,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 11:36:50,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 14: [2022-11-26 11:36:50,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:36:50,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 11:36:50,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 0: [2022-11-26 11:36:50,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 24: [2022-11-26 11:36:50,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 11:36:50,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 11:36:50,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 31: [2022-11-26 11:36:50,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 11:36:50,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 11:36:50,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 16: [2022-11-26 11:36:50,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 11:36:50,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 11:36:50,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 8: [2022-11-26 11:36:50,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:36:50,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 11:36:50,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 3: [2022-11-26 11:36:50,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:36:50,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 22: [2022-11-26 11:36:50,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:36:50,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 22: [2022-11-26 11:36:50,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 11:36:50,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 2: [2022-11-26 11:36:50,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:36:50,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 11:36:50,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 27: [2022-11-26 11:36:50,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 11:36:50,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 11:36:50,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 15: [2022-11-26 11:36:50,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:36:50,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 11:36:50,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 5: [2022-11-26 11:36:50,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:36:50,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 11:36:50,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 13: [2022-11-26 11:36:50,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:36:50,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 11:36:50,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 4: [2022-11-26 11:36:50,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:36:50,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 11:36:50,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 20: [2022-11-26 11:36:50,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 11:36:50,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 11:36:50,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 0: [2022-11-26 11:36:50,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 11:36:50,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 12: [2022-11-26 11:36:50,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:36:50,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 11:36:50,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 23: [2022-11-26 11:36:50,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 11:36:50,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 11:36:50,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 6: [2022-11-26 11:36:50,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:36:50,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 19: [2022-11-26 11:36:50,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:36:50,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 19: [2022-11-26 11:36:50,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 11:36:50,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 11: [2022-11-26 11:36:50,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:36:50,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 11:36:50,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 1: [2022-11-26 11:36:50,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:36:50,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 11:36:50,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 17: [2022-11-26 11:36:50,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 11:36:50,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 11:36:50,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 28: [2022-11-26 11:36:50,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 11:36:50,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 11:36:50,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 29: [2022-11-26 11:36:50,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 11:36:50,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 11:36:50,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 21: [2022-11-26 11:36:50,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:36:50,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 21: [2022-11-26 11:36:50,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 11:36:50,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 10: [2022-11-26 11:36:50,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 11:36:50,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 12: [2022-11-26 11:36:50,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:36:50,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:36:50,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 11:36:50,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 14: [2022-11-26 11:36:50,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 11:36:50,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 7: [2022-11-26 11:36:50,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 26: [2022-11-26 11:36:50,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 11:36:50,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 7: [2022-11-26 11:36:50,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 11:36:50,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 26: [2022-11-26 11:36:50,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 25: [2022-11-26 11:36:50,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 11:36:50,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 11:36:50,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 9: [2022-11-26 11:36:50,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:36:50,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 11:36:50,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 0: [2022-11-26 11:36:50,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:36:50,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 11:36:50,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 24: [2022-11-26 11:36:50,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 11:36:50,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 18: [2022-11-26 11:36:50,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 24: [2022-11-26 11:36:50,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 18: [2022-11-26 11:36:50,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 11:36:50,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 30: [2022-11-26 11:36:50,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:36:50,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 11:36:50,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 3: [2022-11-26 11:36:50,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:36:50,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 11:36:50,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 16: [2022-11-26 11:36:50,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 11:36:50,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 11:36:50,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 22: [2022-11-26 11:36:50,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 11:36:50,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 11:36:50,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 20: [2022-11-26 11:36:50,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 27: [2022-11-26 11:36:50,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:36:50,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 20: [2022-11-26 11:36:50,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 11:36:50,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 8: [2022-11-26 11:36:50,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 11:36:50,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 27: [2022-11-26 11:36:50,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 11:36:50,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 5: [2022-11-26 11:36:50,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:36:50,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 11:36:50,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 30: [2022-11-26 11:36:50,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 11:36:50,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 11:36:50,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 1: [2022-11-26 11:36:50,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 19: [2022-11-26 11:36:50,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 11:36:50,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 1: [2022-11-26 11:36:50,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 19: [2022-11-26 11:36:50,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 7: [2022-11-26 11:36:50,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:36:50,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 2: [2022-11-26 11:36:50,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:36:50,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 11:36:50,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 7: [2022-11-26 11:36:50,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 11:36:50,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 13: [2022-11-26 11:36:50,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:36:50,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:36:50,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 11:36:50,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 13: [2022-11-26 11:36:50,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 11:36:50,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 18: [2022-11-26 11:36:50,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 26: [2022-11-26 11:36:50,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 11:36:50,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 18: [2022-11-26 11:36:50,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 26: [2022-11-26 11:36:50,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 18: [2022-11-26 11:36:50,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 17: [2022-11-26 11:36:50,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:36:50,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:36:50,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 11:36:50,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 17: [2022-11-26 11:36:50,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 16: [2022-11-26 11:36:50,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 17: [2022-11-26 11:36:50,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 16: [2022-11-26 11:36:50,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 11:36:50,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 14: [2022-11-26 11:36:50,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:36:50,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 11:36:50,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 29: [2022-11-26 11:36:50,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 23: [2022-11-26 11:36:50,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 11:36:50,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 11:36:50,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 29: [2022-11-26 11:36:50,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 11:36:50,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 31: [2022-11-26 11:36:50,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 11:36:50,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 11:36:50,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 10: [2022-11-26 11:36:50,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:36:50,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 11:36:50,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 15: [2022-11-26 11:36:50,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:36:50,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:36:50,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 11:36:50,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 22: [2022-11-26 11:36:50,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:36:50,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 22: [2022-11-26 11:36:50,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 15: [2022-11-26 11:36:50,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 22: [2022-11-26 11:36:50,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 28: [2022-11-26 11:36:50,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:36:50,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:36:50,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 11:36:50,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 21: [2022-11-26 11:36:50,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 11:36:50,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 11:36:50,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 13: [2022-11-26 11:36:50,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:36:50,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 6: [2022-11-26 11:36:50,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:36:50,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 6: [2022-11-26 11:36:50,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 11:36:50,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 11: [2022-11-26 11:36:50,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:36:50,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 11:36:50,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 25: [2022-11-26 11:36:50,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 11:36:50,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 11:36:50,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 28: [2022-11-26 11:36:50,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 11:36:50,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 24: [2022-11-26 11:36:50,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 11:36:50,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 11:36:50,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 2: [2022-11-26 11:36:50,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:36:50,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 11:36:50,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 27: [2022-11-26 11:36:50,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 11:36:50,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 11:36:50,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 20: [2022-11-26 11:36:50,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 11:36:50,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 11:36:50,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 1: [2022-11-26 11:36:50,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:36:50,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:36:50,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 4: [2022-11-26 11:36:50,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 1: [2022-11-26 11:36:50,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 4: [2022-11-26 11:36:50,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 16: [2022-11-26 11:36:50,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 11:36:50,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 8: [2022-11-26 11:36:50,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 16: [2022-11-26 11:36:50,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 8: [2022-11-26 11:36:50,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 11:36:50,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 23: [2022-11-26 11:36:50,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 11:36:50,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 11:36:50,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 11:36:50,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 11:36:50,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 23: [2022-11-26 11:36:50,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 31: [2022-11-26 11:36:50,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 11:36:50,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 11:36:50,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 20: [2022-11-26 11:36:50,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 11:36:50,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 11:36:50,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 1: [2022-11-26 11:36:50,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:36:50,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:36:50,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 11:36:50,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 1: [2022-11-26 11:36:50,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 11:36:50,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 4: [2022-11-26 11:36:50,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:36:50,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 11:36:50,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 12: [2022-11-26 11:36:50,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:36:50,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step78000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 11:36:50,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 0: successfully saved checkpoint at iteration 78000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2615.13 31: iteration 78010/ 173500 | consumed samples: 19970560 | consumed tokens: 40899706880 | elapsed time per iteration (s): 1.13 | learning rate: 1.257E-04 | global batch size: 256 | lm loss: 1.989638E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 226.964 | TFLOPs: 13.73 | 31: iteration 78020/ 173500 | consumed samples: 19973120 | consumed tokens: 40904949760 | elapsed time per iteration (s): 0.78 | learning rate: 1.257E-04 | global batch size: 256 | lm loss: 2.010253E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.938 | TFLOPs: 19.84 | 31: iteration 78030/ 173500 | consumed samples: 19975680 | consumed tokens: 40910192640 | elapsed time per iteration (s): 0.79 | learning rate: 1.257E-04 | global batch size: 256 | lm loss: 2.016126E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.669 | TFLOPs: 19.64 | 31: iteration 78040/ 173500 | consumed samples: 19978240 | consumed tokens: 40915435520 | elapsed time per iteration (s): 0.78 | learning rate: 1.257E-04 | global batch size: 256 | lm loss: 2.004523E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.689 | TFLOPs: 19.88 | 31: iteration 78050/ 173500 | consumed samples: 19980800 | consumed tokens: 40920678400 | elapsed time per iteration (s): 0.79 | learning rate: 1.257E-04 | global batch size: 256 | lm loss: 2.001257E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.592 | TFLOPs: 19.58 | 31: iteration 78060/ 173500 | consumed samples: 19983360 | consumed tokens: 40925921280 | elapsed time per iteration (s): 0.80 | learning rate: 1.257E-04 | global batch size: 256 | lm loss: 2.007217E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.032 | TFLOPs: 19.36 | 31: iteration 78070/ 173500 | consumed samples: 19985920 | consumed tokens: 40931164160 | elapsed time per iteration (s): 0.78 | learning rate: 1.256E-04 | global batch size: 256 | lm loss: 2.022983E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.071 | TFLOPs: 19.79 | 31: iteration 78080/ 173500 | consumed samples: 19988480 | consumed tokens: 40936407040 | elapsed time per iteration (s): 0.81 | learning rate: 1.256E-04 | global batch size: 256 | lm loss: 2.023004E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.871 | TFLOPs: 19.11 | 31: iteration 78090/ 173500 | consumed samples: 19991040 | consumed tokens: 40941649920 | elapsed time per iteration (s): 0.78 | learning rate: 1.256E-04 | global batch size: 256 | lm loss: 2.012722E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.801 | TFLOPs: 19.95 | 31: iteration 78100/ 173500 | consumed samples: 19993600 | consumed tokens: 40946892800 | elapsed time per iteration (s): 0.72 | learning rate: 1.256E-04 | global batch size: 256 | lm loss: 2.002364E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.509 | TFLOPs: 21.63 | 31: iteration 78110/ 173500 | consumed samples: 19996160 | consumed tokens: 40952135680 | elapsed time per iteration (s): 0.80 | learning rate: 1.256E-04 | global batch size: 256 | lm loss: 2.009005E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.218 | TFLOPs: 19.25 | 31: iteration 78120/ 173500 | consumed samples: 19998720 | consumed tokens: 40957378560 | elapsed time per iteration (s): 0.80 | learning rate: 1.256E-04 | global batch size: 256 | lm loss: 2.019135E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.085 | TFLOPs: 19.30 | 31: iteration 78130/ 173500 | consumed samples: 20001280 | consumed tokens: 40962621440 | elapsed time per iteration (s): 0.80 | learning rate: 1.255E-04 | global batch size: 256 | lm loss: 2.010855E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.511 | TFLOPs: 19.45 | 31: iteration 78140/ 173500 | consumed samples: 20003840 | consumed tokens: 40967864320 | elapsed time per iteration (s): 0.76 | learning rate: 1.255E-04 | global batch size: 256 | lm loss: 2.012390E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.711 | TFLOPs: 20.43 | 31: iteration 78150/ 173500 | consumed samples: 20006400 | consumed tokens: 40973107200 | elapsed time per iteration (s): 0.84 | learning rate: 1.255E-04 | global batch size: 256 | lm loss: 1.974015E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.244 | TFLOPs: 18.35 | 31: iteration 78160/ 173500 | consumed samples: 20008960 | consumed tokens: 40978350080 | elapsed time per iteration (s): 0.80 | learning rate: 1.255E-04 | global batch size: 256 | lm loss: 2.014326E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.981 | TFLOPs: 19.36 | 31: iteration 78170/ 173500 | consumed samples: 20011520 | consumed tokens: 40983592960 | elapsed time per iteration (s): 0.83 | learning rate: 1.255E-04 | global batch size: 256 | lm loss: 1.988365E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.280 | TFLOPs: 18.71 | 31: iteration 78180/ 173500 | consumed samples: 20014080 | consumed tokens: 40988835840 | elapsed time per iteration (s): 0.81 | learning rate: 1.255E-04 | global batch size: 256 | lm loss: 1.992249E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.007 | TFLOPs: 19.18 | 31: iteration 78190/ 173500 | consumed samples: 20016640 | consumed tokens: 40994078720 | elapsed time per iteration (s): 0.75 | learning rate: 1.254E-04 | global batch size: 256 | lm loss: 2.009424E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.606 | TFLOPs: 20.61 | 31: iteration 78200/ 173500 | consumed samples: 20019200 | consumed tokens: 40999321600 | elapsed time per iteration (s): 0.76 | learning rate: 1.254E-04 | global batch size: 256 | lm loss: 1.997745E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.534 | TFLOPs: 20.36 | 31: iteration 78210/ 173500 | consumed samples: 20021760 | consumed tokens: 41004564480 | elapsed time per iteration (s): 0.81 | learning rate: 1.254E-04 | global batch size: 256 | lm loss: 2.006659E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.749 | TFLOPs: 19.22 | 31: iteration 78220/ 173500 | consumed samples: 20024320 | consumed tokens: 41009807360 | elapsed time per iteration (s): 0.76 | learning rate: 1.254E-04 | global batch size: 256 | lm loss: 1.978086E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.584 | TFLOPs: 20.42 | 31: iteration 78230/ 173500 | consumed samples: 20026880 | consumed tokens: 41015050240 | elapsed time per iteration (s): 0.81 | learning rate: 1.254E-04 | global batch size: 256 | lm loss: 1.987912E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.939 | TFLOPs: 19.23 | 31: iteration 78240/ 173500 | consumed samples: 20029440 | consumed tokens: 41020293120 | elapsed time per iteration (s): 0.82 | learning rate: 1.254E-04 | global batch size: 256 | lm loss: 2.021792E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.441 | TFLOPs: 18.78 | 31: iteration 78250/ 173500 | consumed samples: 20032000 | consumed tokens: 41025536000 | elapsed time per iteration (s): 0.89 | learning rate: 1.253E-04 | global batch size: 256 | lm loss: 2.019066E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.635 | TFLOPs: 17.46 | 31: iteration 78260/ 173500 | consumed samples: 20034560 | consumed tokens: 41030778880 | elapsed time per iteration (s): 0.81 | learning rate: 1.253E-04 | global batch size: 256 | lm loss: 2.021884E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.209 | TFLOPs: 19.13 | 31: iteration 78270/ 173500 | consumed samples: 20037120 | consumed tokens: 41036021760 | elapsed time per iteration (s): 0.81 | learning rate: 1.253E-04 | global batch size: 256 | lm loss: 2.017278E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.610 | TFLOPs: 19.03 | 31: iteration 78280/ 173500 | consumed samples: 20039680 | consumed tokens: 41041264640 | elapsed time per iteration (s): 58.95 | learning rate: 1.253E-04 | global batch size: 256 | lm loss: 2.000011E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 4.343 | TFLOPs: 0.26 | 31: iteration 78290/ 173500 | consumed samples: 20042240 | consumed tokens: 41046507520 | elapsed time per iteration (s): 0.79 | learning rate: 1.253E-04 | global batch size: 256 | lm loss: 1.979553E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.605 | TFLOPs: 19.52 | 31: iteration 78300/ 173500 | consumed samples: 20044800 | consumed tokens: 41051750400 | elapsed time per iteration (s): 0.73 | learning rate: 1.253E-04 | global batch size: 256 | lm loss: 2.023917E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.688 | TFLOPs: 21.22 | 31: iteration 78310/ 173500 | consumed samples: 20047360 | consumed tokens: 41056993280 | elapsed time per iteration (s): 0.77 | learning rate: 1.252E-04 | global batch size: 256 | lm loss: 2.011375E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.258 | TFLOPs: 20.04 | 31: iteration 78320/ 173500 | consumed samples: 20049920 | consumed tokens: 41062236160 | elapsed time per iteration (s): 0.76 | learning rate: 1.252E-04 | global batch size: 256 | lm loss: 2.008866E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.379 | TFLOPs: 20.29 | 31: iteration 78330/ 173500 | consumed samples: 20052480 | consumed tokens: 41067479040 | elapsed time per iteration (s): 0.77 | learning rate: 1.252E-04 | global batch size: 256 | lm loss: 2.005987E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.682 | TFLOPs: 20.01 | 31: iteration 78340/ 173500 | consumed samples: 20055040 | consumed tokens: 41072721920 | elapsed time per iteration (s): 0.77 | learning rate: 1.252E-04 | global batch size: 256 | lm loss: 2.011318E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.541 | TFLOPs: 20.18 | 31: iteration 78350/ 173500 | consumed samples: 20057600 | consumed tokens: 41077964800 | elapsed time per iteration (s): 2.94 | learning rate: 1.252E-04 | global batch size: 256 | lm loss: 1.996615E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 87.108 | TFLOPs: 5.27 | 31: iteration 78360/ 173500 | consumed samples: 20060160 | consumed tokens: 41083207680 | elapsed time per iteration (s): 1.03 | learning rate: 1.252E-04 | global batch size: 256 | lm loss: 1.995517E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.691 | TFLOPs: 15.05 | 31: iteration 78370/ 173500 | consumed samples: 20062720 | consumed tokens: 41088450560 | elapsed time per iteration (s): 0.77 | learning rate: 1.252E-04 | global batch size: 256 | lm loss: 2.009379E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.327 | TFLOPs: 20.04 | 31: iteration 78380/ 173500 | consumed samples: 20065280 | consumed tokens: 41093693440 | elapsed time per iteration (s): 0.75 | learning rate: 1.251E-04 | global batch size: 256 | lm loss: 2.028842E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.279 | TFLOPs: 20.59 | 31: iteration 78390/ 173500 | consumed samples: 20067840 | consumed tokens: 41098936320 | elapsed time per iteration (s): 0.74 | learning rate: 1.251E-04 | global batch size: 256 | lm loss: 1.969839E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.660 | TFLOPs: 20.79 | 31: iteration 78400/ 173500 | consumed samples: 20070400 | consumed tokens: 41104179200 | elapsed time per iteration (s): 0.73 | learning rate: 1.251E-04 | global batch size: 256 | lm loss: 2.007214E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.602 | TFLOPs: 21.21 | 31: iteration 78410/ 173500 | consumed samples: 20072960 | consumed tokens: 41109422080 | elapsed time per iteration (s): 0.77 | learning rate: 1.251E-04 | global batch size: 256 | lm loss: 2.011921E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.881 | TFLOPs: 20.08 | 31: iteration 78420/ 173500 | consumed samples: 20075520 | consumed tokens: 41114664960 | elapsed time per iteration (s): 0.81 | learning rate: 1.251E-04 | global batch size: 256 | lm loss: 2.019090E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.173 | TFLOPs: 19.07 | 31: iteration 78430/ 173500 | consumed samples: 20078080 | consumed tokens: 41119907840 | elapsed time per iteration (s): 0.79 | learning rate: 1.251E-04 | global batch size: 256 | lm loss: 2.019555E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.766 | TFLOPs: 19.71 | 31: iteration 78440/ 173500 | consumed samples: 20080640 | consumed tokens: 41125150720 | elapsed time per iteration (s): 0.79 | learning rate: 1.250E-04 | global batch size: 256 | lm loss: 2.004835E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.487 | TFLOPs: 19.57 | 31: iteration 78450/ 173500 | consumed samples: 20083200 | consumed tokens: 41130393600 | elapsed time per iteration (s): 0.78 | learning rate: 1.250E-04 | global batch size: 256 | lm loss: 1.996998E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.200 | TFLOPs: 19.79 | 31: iteration 78460/ 173500 | consumed samples: 20085760 | consumed tokens: 41135636480 | elapsed time per iteration (s): 0.81 | learning rate: 1.250E-04 | global batch size: 256 | lm loss: 2.005799E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.993 | TFLOPs: 19.24 | 31: iteration 78470/ 173500 | consumed samples: 20088320 | consumed tokens: 41140879360 | elapsed time per iteration (s): 0.81 | learning rate: 1.250E-04 | global batch size: 256 | lm loss: 1.997741E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.573 | TFLOPs: 19.15 | 31: iteration 78480/ 173500 | consumed samples: 20090880 | consumed tokens: 41146122240 | elapsed time per iteration (s): 4.46 | learning rate: 1.250E-04 | global batch size: 256 | lm loss: 2.046351E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 57.446 | TFLOPs: 3.48 | 31: iteration 78490/ 173500 | consumed samples: 20093440 | consumed tokens: 41151365120 | elapsed time per iteration (s): 0.80 | learning rate: 1.250E-04 | global batch size: 256 | lm loss: 2.034252E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.867 | TFLOPs: 19.35 | 31: iteration 78500/ 173500 | consumed samples: 20096000 | consumed tokens: 41156608000 | elapsed time per iteration (s): 0.81 | learning rate: 1.249E-04 | global batch size: 256 | lm loss: 2.021374E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.153 | TFLOPs: 19.13 | 31: iteration 78510/ 173500 | consumed samples: 20098560 | consumed tokens: 41161850880 | elapsed time per iteration (s): 0.82 | learning rate: 1.249E-04 | global batch size: 256 | lm loss: 2.011251E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.288 | TFLOPs: 18.83 | 31: iteration 78520/ 173500 | consumed samples: 20101120 | consumed tokens: 41167093760 | elapsed time per iteration (s): 0.81 | learning rate: 1.249E-04 | global batch size: 256 | lm loss: 2.021430E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.717 | TFLOPs: 19.04 | 31: iteration 78530/ 173500 | consumed samples: 20103680 | consumed tokens: 41172336640 | elapsed time per iteration (s): 0.81 | learning rate: 1.249E-04 | global batch size: 256 | lm loss: 2.037893E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.765 | TFLOPs: 19.16 | 31: iteration 78540/ 173500 | consumed samples: 20106240 | consumed tokens: 41177579520 | elapsed time per iteration (s): 0.81 | learning rate: 1.249E-04 | global batch size: 256 | lm loss: 2.011512E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.892 | TFLOPs: 19.17 | 31: iteration 78550/ 173500 | consumed samples: 20108800 | consumed tokens: 41182822400 | elapsed time per iteration (s): 0.82 | learning rate: 1.249E-04 | global batch size: 256 | lm loss: 2.028452E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.920 | TFLOPs: 18.87 | 31: iteration 78560/ 173500 | consumed samples: 20111360 | consumed tokens: 41188065280 | elapsed time per iteration (s): 0.81 | learning rate: 1.248E-04 | global batch size: 256 | lm loss: 2.000720E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.146 | TFLOPs: 19.13 | 31: iteration 78570/ 173500 | consumed samples: 20113920 | consumed tokens: 41193308160 | elapsed time per iteration (s): 0.79 | learning rate: 1.248E-04 | global batch size: 256 | lm loss: 1.991338E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.852 | TFLOPs: 19.71 | 31: iteration 78580/ 173500 | consumed samples: 20116480 | consumed tokens: 41198551040 | elapsed time per iteration (s): 0.81 | learning rate: 1.248E-04 | global batch size: 256 | lm loss: 1.997030E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.953 | TFLOPs: 19.05 | 31: iteration 78590/ 173500 | consumed samples: 20119040 | consumed tokens: 41203793920 | elapsed time per iteration (s): 0.79 | learning rate: 1.248E-04 | global batch size: 256 | lm loss: 1.988311E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.714 | TFLOPs: 19.70 | 31: iteration 78600/ 173500 | consumed samples: 20121600 | consumed tokens: 41209036800 | elapsed time per iteration (s): 0.81 | learning rate: 1.248E-04 | global batch size: 256 | lm loss: 1.978356E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.756 | TFLOPs: 19.04 | 31: iteration 78610/ 173500 | consumed samples: 20124160 | consumed tokens: 41214279680 | elapsed time per iteration (s): 0.77 | learning rate: 1.248E-04 | global batch size: 256 | lm loss: 2.015100E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.951 | TFLOPs: 20.02 | 31: iteration 78620/ 173500 | consumed samples: 20126720 | consumed tokens: 41219522560 | elapsed time per iteration (s): 0.75 | learning rate: 1.247E-04 | global batch size: 256 | lm loss: 2.010413E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.400 | TFLOPs: 20.71 | 31: iteration 78630/ 173500 | consumed samples: 20129280 | consumed tokens: 41224765440 | elapsed time per iteration (s): 0.73 | learning rate: 1.247E-04 | global batch size: 256 | lm loss: 2.006166E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.369 | TFLOPs: 21.32 | 31: iteration 78640/ 173500 | consumed samples: 20131840 | consumed tokens: 41230008320 | elapsed time per iteration (s): 0.78 | learning rate: 1.247E-04 | global batch size: 256 | lm loss: 2.008925E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.019 | TFLOPs: 19.90 | 31: iteration 78650/ 173500 | consumed samples: 20134400 | consumed tokens: 41235251200 | elapsed time per iteration (s): 0.79 | learning rate: 1.247E-04 | global batch size: 256 | lm loss: 2.014995E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.954 | TFLOPs: 19.66 | 31: iteration 78660/ 173500 | consumed samples: 20136960 | consumed tokens: 41240494080 | elapsed time per iteration (s): 0.77 | learning rate: 1.247E-04 | global batch size: 256 | lm loss: 2.017908E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.791 | TFLOPs: 20.19 | 31: iteration 78670/ 173500 | consumed samples: 20139520 | consumed tokens: 41245736960 | elapsed time per iteration (s): 0.77 | learning rate: 1.247E-04 | global batch size: 256 | lm loss: 2.010094E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.593 | TFLOPs: 20.12 | 31: iteration 78680/ 173500 | consumed samples: 20142080 | consumed tokens: 41250979840 | elapsed time per iteration (s): 0.74 | learning rate: 1.246E-04 | global batch size: 256 | lm loss: 2.015035E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.207 | TFLOPs: 20.88 | 31: iteration 78690/ 173500 | consumed samples: 20144640 | consumed tokens: 41256222720 | elapsed time per iteration (s): 0.76 | learning rate: 1.246E-04 | global batch size: 256 | lm loss: 2.000507E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.050 | TFLOPs: 20.33 | 31: iteration 78700/ 173500 | consumed samples: 20147200 | consumed tokens: 41261465600 | elapsed time per iteration (s): 0.75 | learning rate: 1.246E-04 | global batch size: 256 | lm loss: 2.014887E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.159 | TFLOPs: 20.76 | 31: iteration 78710/ 173500 | consumed samples: 20149760 | consumed tokens: 41266708480 | elapsed time per iteration (s): 0.80 | learning rate: 1.246E-04 | global batch size: 256 | lm loss: 1.993292E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.568 | TFLOPs: 19.39 | 31: iteration 78720/ 173500 | consumed samples: 20152320 | consumed tokens: 41271951360 | elapsed time per iteration (s): 0.78 | learning rate: 1.246E-04 | global batch size: 256 | lm loss: 2.011884E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.549 | TFLOPs: 19.94 | 31: iteration 78730/ 173500 | consumed samples: 20154880 | consumed tokens: 41277194240 | elapsed time per iteration (s): 0.79 | learning rate: 1.246E-04 | global batch size: 256 | lm loss: 2.018721E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.658 | TFLOPs: 19.58 | 31: iteration 78740/ 173500 | consumed samples: 20157440 | consumed tokens: 41282437120 | elapsed time per iteration (s): 0.80 | learning rate: 1.245E-04 | global batch size: 256 | lm loss: 2.015392E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.639 | TFLOPs: 19.34 | 31: iteration 78750/ 173500 | consumed samples: 20160000 | consumed tokens: 41287680000 | elapsed time per iteration (s): 0.76 | learning rate: 1.245E-04 | global batch size: 256 | lm loss: 1.994568E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.575 | TFLOPs: 20.42 | 31: iteration 78760/ 173500 | consumed samples: 20162560 | consumed tokens: 41292922880 | elapsed time per iteration (s): 0.82 | learning rate: 1.245E-04 | global batch size: 256 | lm loss: 1.997459E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.798 | TFLOPs: 18.98 | 31: iteration 78770/ 173500 | consumed samples: 20165120 | consumed tokens: 41298165760 | elapsed time per iteration (s): 0.79 | learning rate: 1.245E-04 | global batch size: 256 | lm loss: 2.009882E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.987 | TFLOPs: 19.60 | 31: iteration 78780/ 173500 | consumed samples: 20167680 | consumed tokens: 41303408640 | elapsed time per iteration (s): 0.78 | learning rate: 1.245E-04 | global batch size: 256 | lm loss: 2.011599E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.321 | TFLOPs: 19.86 | 31: iteration 78790/ 173500 | consumed samples: 20170240 | consumed tokens: 41308651520 | elapsed time per iteration (s): 0.78 | learning rate: 1.245E-04 | global batch size: 256 | lm loss: 1.977328E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.352 | TFLOPs: 19.80 | 31: iteration 78800/ 173500 | consumed samples: 20172800 | consumed tokens: 41313894400 | elapsed time per iteration (s): 0.78 | learning rate: 1.245E-04 | global batch size: 256 | lm loss: 1.971840E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.128 | TFLOPs: 19.91 | 31: iteration 78810/ 173500 | consumed samples: 20175360 | consumed tokens: 41319137280 | elapsed time per iteration (s): 0.75 | learning rate: 1.244E-04 | global batch size: 256 | lm loss: 2.007241E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.786 | TFLOPs: 20.56 | 31: iteration 78820/ 173500 | consumed samples: 20177920 | consumed tokens: 41324380160 | elapsed time per iteration (s): 0.73 | learning rate: 1.244E-04 | global batch size: 256 | lm loss: 2.013480E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.925 | TFLOPs: 21.17 | 31: iteration 78830/ 173500 | consumed samples: 20180480 | consumed tokens: 41329623040 | elapsed time per iteration (s): 0.72 | learning rate: 1.244E-04 | global batch size: 256 | lm loss: 2.019527E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.615 | TFLOPs: 21.63 | 31: iteration 78840/ 173500 | consumed samples: 20183040 | consumed tokens: 41334865920 | elapsed time per iteration (s): 0.73 | learning rate: 1.244E-04 | global batch size: 256 | lm loss: 2.016800E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.087 | TFLOPs: 21.24 | 31: iteration 78850/ 173500 | consumed samples: 20185600 | consumed tokens: 41340108800 | elapsed time per iteration (s): 0.77 | learning rate: 1.244E-04 | global batch size: 256 | lm loss: 2.014766E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.284 | TFLOPs: 20.22 | 31: iteration 78860/ 173500 | consumed samples: 20188160 | consumed tokens: 41345351680 | elapsed time per iteration (s): 0.74 | learning rate: 1.244E-04 | global batch size: 256 | lm loss: 2.029082E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.007 | TFLOPs: 20.93 | 31: iteration 78870/ 173500 | consumed samples: 20190720 | consumed tokens: 41350594560 | elapsed time per iteration (s): 0.75 | learning rate: 1.243E-04 | global batch size: 256 | lm loss: 2.001927E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.321 | TFLOPs: 20.53 | 31: iteration 78880/ 173500 | consumed samples: 20193280 | consumed tokens: 41355837440 | elapsed time per iteration (s): 0.79 | learning rate: 1.243E-04 | global batch size: 256 | lm loss: 1.983181E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.157 | TFLOPs: 19.49 | 31: iteration 78890/ 173500 | consumed samples: 20195840 | consumed tokens: 41361080320 | elapsed time per iteration (s): 0.76 | learning rate: 1.243E-04 | global batch size: 256 | lm loss: 2.040398E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.199 | TFLOPs: 20.34 | 31: iteration 78900/ 173500 | consumed samples: 20198400 | consumed tokens: 41366323200 | elapsed time per iteration (s): 0.82 | learning rate: 1.243E-04 | global batch size: 256 | lm loss: 1.992153E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.846 | TFLOPs: 18.81 | 31: iteration 78910/ 173500 | consumed samples: 20200960 | consumed tokens: 41371566080 | elapsed time per iteration (s): 0.76 | learning rate: 1.243E-04 | global batch size: 256 | lm loss: 2.025880E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.543 | TFLOPs: 20.48 | 31: iteration 78920/ 173500 | consumed samples: 20203520 | consumed tokens: 41376808960 | elapsed time per iteration (s): 0.79 | learning rate: 1.243E-04 | global batch size: 256 | lm loss: 1.992326E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.758 | TFLOPs: 19.53 | 31: iteration 78930/ 173500 | consumed samples: 20206080 | consumed tokens: 41382051840 | elapsed time per iteration (s): 0.75 | learning rate: 1.242E-04 | global batch size: 256 | lm loss: 2.001353E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.239 | TFLOPs: 20.70 | 31: iteration 78940/ 173500 | consumed samples: 20208640 | consumed tokens: 41387294720 | elapsed time per iteration (s): 0.79 | learning rate: 1.242E-04 | global batch size: 256 | lm loss: 2.033941E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.183 | TFLOPs: 19.55 | 31: iteration 78950/ 173500 | consumed samples: 20211200 | consumed tokens: 41392537600 | elapsed time per iteration (s): 0.75 | learning rate: 1.242E-04 | global batch size: 256 | lm loss: 2.012935E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.136 | TFLOPs: 20.76 | 31: iteration 78960/ 173500 | consumed samples: 20213760 | consumed tokens: 41397780480 | elapsed time per iteration (s): 0.79 | learning rate: 1.242E-04 | global batch size: 256 | lm loss: 1.987090E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.668 | TFLOPs: 19.58 | 31: iteration 78970/ 173500 | consumed samples: 20216320 | consumed tokens: 41403023360 | elapsed time per iteration (s): 0.74 | learning rate: 1.242E-04 | global batch size: 256 | lm loss: 1.997475E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.039 | TFLOPs: 20.99 | 31: iteration 78980/ 173500 | consumed samples: 20218880 | consumed tokens: 41408266240 | elapsed time per iteration (s): 0.74 | learning rate: 1.242E-04 | global batch size: 256 | lm loss: 2.011773E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.771 | TFLOPs: 21.04 | 31: iteration 78990/ 173500 | consumed samples: 20221440 | consumed tokens: 41413509120 | elapsed time per iteration (s): 0.82 | learning rate: 1.241E-04 | global batch size: 256 | lm loss: 1.996579E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.747 | TFLOPs: 18.80 | 31: iteration 79000/ 173500 | consumed samples: 20224000 | consumed tokens: 41418752000 | elapsed time per iteration (s): 0.72 | learning rate: 1.241E-04 | global batch size: 256 | lm loss: 1.998362E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.865 | TFLOPs: 21.53 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 79000 | lm loss value: 2.001760E+00 | lm loss PPL: 7.402074E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 79000 to checkpoints_1b1long 0: [2022-11-26 12:00:35,345] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step79000 is begin to save! 0: [2022-11-26 12:00:35,357] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_01-model_00-model_states.pt... 0: [2022-11-26 12:00:35,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_01-model_00-model_states.pt. 0: [2022-11-26 12:00:35,583] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_03-model_00-model_states.pt... 0: [2022-11-26 12:00:35,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_03-model_00-model_states.pt. 0: [2022-11-26 12:00:35,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_04-model_00-model_states.pt... 0: [2022-11-26 12:00:35,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_04-model_00-model_states.pt. 0: [2022-11-26 12:00:35,736] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_05-model_00-model_states.pt... 0: [2022-11-26 12:00:35,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_05-model_00-model_states.pt. 0: [2022-11-26 12:00:35,812] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_06-model_00-model_states.pt... 0: [2022-11-26 12:00:35,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_06-model_00-model_states.pt. 0: [2022-11-26 12:00:35,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_07-model_00-model_states.pt... 0: [2022-11-26 12:00:35,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_07-model_00-model_states.pt. 0: [2022-11-26 12:00:35,959] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_08-model_00-model_states.pt... 0: [2022-11-26 12:00:36,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_08-model_00-model_states.pt. 0: [2022-11-26 12:00:36,033] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_09-model_00-model_states.pt... 0: [2022-11-26 12:00:36,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_09-model_00-model_states.pt. 0: [2022-11-26 12:00:36,105] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_10-model_00-model_states.pt... 0: [2022-11-26 12:00:36,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_10-model_00-model_states.pt. 0: [2022-11-26 12:00:36,184] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_11-model_00-model_states.pt... 0: [2022-11-26 12:00:36,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_11-model_00-model_states.pt. 0: [2022-11-26 12:00:36,258] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_12-model_00-model_states.pt... 0: [2022-11-26 12:00:36,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_12-model_00-model_states.pt. 0: [2022-11-26 12:00:36,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_13-model_00-model_states.pt... 0: [2022-11-26 12:00:36,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_13-model_00-model_states.pt. 0: [2022-11-26 12:00:36,411] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_14-model_00-model_states.pt... 0: [2022-11-26 12:00:36,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_14-model_00-model_states.pt. 0: [2022-11-26 12:00:36,487] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_15-model_00-model_states.pt... 0: [2022-11-26 12:00:36,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_15-model_00-model_states.pt. 0: [2022-11-26 12:00:36,561] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_16-model_00-model_states.pt... 0: [2022-11-26 12:00:36,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_16-model_00-model_states.pt. 0: [2022-11-26 12:00:36,633] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_17-model_00-model_states.pt... 0: [2022-11-26 12:00:36,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_17-model_00-model_states.pt. 0: [2022-11-26 12:00:36,709] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_18-model_00-model_states.pt... 0: [2022-11-26 12:00:36,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_18-model_00-model_states.pt. 0: [2022-11-26 12:00:36,779] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_19-model_00-model_states.pt... 0: [2022-11-26 12:00:36,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_19-model_00-model_states.pt. 0: [2022-11-26 12:00:36,856] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_20-model_00-model_states.pt... 0: [2022-11-26 12:00:36,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_20-model_00-model_states.pt. 0: [2022-11-26 12:00:36,927] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_21-model_00-model_states.pt... 0: [2022-11-26 12:00:37,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_21-model_00-model_states.pt. 0: [2022-11-26 12:00:37,001] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_22-model_00-model_states.pt... 0: [2022-11-26 12:00:37,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_22-model_00-model_states.pt. 0: [2022-11-26 12:00:37,077] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_23-model_00-model_states.pt... 0: [2022-11-26 12:00:37,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_23-model_00-model_states.pt. 0: [2022-11-26 12:00:37,148] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_24-model_00-model_states.pt... 0: [2022-11-26 12:00:37,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_24-model_00-model_states.pt. 0: [2022-11-26 12:00:37,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_25-model_00-model_states.pt... 0: [2022-11-26 12:00:37,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_25-model_00-model_states.pt. 0: [2022-11-26 12:00:37,297] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_26-model_00-model_states.pt... 0: [2022-11-26 12:00:37,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_26-model_00-model_states.pt. 0: [2022-11-26 12:00:37,370] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_27-model_00-model_states.pt... 0: [2022-11-26 12:00:37,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_27-model_00-model_states.pt. 0: [2022-11-26 12:00:37,442] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_28-model_00-model_states.pt... 0: [2022-11-26 12:00:37,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_28-model_00-model_states.pt. 0: [2022-11-26 12:00:37,517] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/layer_30-model_00-model_states.pt... 0: [2022-11-26 12:00:37,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/layer_30-model_00-model_states.pt. 0: [2022-11-26 12:00:37,519] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step79000/mp_rank_00_model_states.pt 0: [2022-11-26 12:00:37,519] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/mp_rank_00_model_states.pt... 0: [2022-11-26 12:00:37,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/mp_rank_00_model_states.pt. 0: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:00:37,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:00:37,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:00:37,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:00:37,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 12:00:37,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 0: [2022-11-26 12:00:37,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 12:00:37,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 25: [2022-11-26 12:00:37,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:00:37,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 12:00:37,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 0: [2022-11-26 12:00:37,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:00:37,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 12:00:37,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 11: [2022-11-26 12:00:37,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:00:37,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 12:00:37,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 30: [2022-11-26 12:00:37,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:00:37,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 12:00:37,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 19: [2022-11-26 12:00:37,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:00:37,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 12:00:37,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 6: [2022-11-26 12:00:37,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:00:37,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:00:37,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 6: [2022-11-26 12:00:37,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 13: [2022-11-26 12:00:37,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 6: [2022-11-26 12:00:37,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 11: [2022-11-26 12:00:37,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:00:37,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:00:37,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 10: [2022-11-26 12:00:37,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 11: [2022-11-26 12:00:37,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 10: [2022-11-26 12:00:37,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 16: [2022-11-26 12:00:37,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:00:37,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 12:00:37,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 5: [2022-11-26 12:00:37,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:00:37,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:00:37,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 12:00:37,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 12:00:37,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 5: [2022-11-26 12:00:37,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 26: [2022-11-26 12:00:37,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:00:37,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 12:00:37,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 4: [2022-11-26 12:00:37,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:00:37,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 12:00:37,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 14: [2022-11-26 12:00:37,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:00:37,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:00:37,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 18: [2022-11-26 12:00:37,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:00:37,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 14: [2022-11-26 12:00:37,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 18: [2022-11-26 12:00:37,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 27: [2022-11-26 12:00:37,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 18: [2022-11-26 12:00:37,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 4: [2022-11-26 12:00:37,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:00:37,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 12:00:37,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 25: [2022-11-26 12:00:37,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:00:37,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 12:00:37,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 20: [2022-11-26 12:00:37,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:00:37,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 12:00:37,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 16: [2022-11-26 12:00:37,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:00:37,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 12:00:37,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 5: [2022-11-26 12:00:37,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:00:37,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 19: [2022-11-26 12:00:37,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:00:37,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 19: [2022-11-26 12:00:37,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 12:00:37,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 10: [2022-11-26 12:00:37,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:00:37,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 12:00:37,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 30: [2022-11-26 12:00:37,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:00:37,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 13: [2022-11-26 12:00:37,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:00:37,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 13: [2022-11-26 12:00:37,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 12:00:37,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 15: [2022-11-26 12:00:37,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:00:37,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 12:00:37,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 15: [2022-11-26 12:00:37,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:00:37,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 11: [2022-11-26 12:00:37,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:00:37,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 11: [2022-11-26 12:00:37,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 12:00:37,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 18: [2022-11-26 12:00:37,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:00:37,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 27: [2022-11-26 12:00:37,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:00:37,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 27: [2022-11-26 12:00:37,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 12:00:37,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 27: [2022-11-26 12:00:37,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:00:37,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 12:00:37,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 14: [2022-11-26 12:00:37,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:00:37,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 12:00:37,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 15: [2022-11-26 12:00:37,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:00:37,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 12:00:37,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 29: [2022-11-26 12:00:37,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:00:37,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:00:37,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 14: [2022-11-26 12:00:37,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:00:37,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 14: [2022-11-26 12:00:37,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 29: [2022-11-26 12:00:37,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 29: [2022-11-26 12:00:37,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 14: [2022-11-26 12:00:37,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 29: [2022-11-26 12:00:37,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:00:37,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 12:00:37,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 10: [2022-11-26 12:00:37,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:00:37,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 12:00:37,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 6: [2022-11-26 12:00:37,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:00:37,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 16: [2022-11-26 12:00:37,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:00:37,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 12:00:37,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 6: [2022-11-26 12:00:37,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 26: [2022-11-26 12:00:37,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:00:37,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 12:00:37,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 26: [2022-11-26 12:00:37,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:00:37,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:00:37,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 12:00:37,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 0: [2022-11-26 12:00:37,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 25: [2022-11-26 12:00:37,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:00:37,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 11: [2022-11-26 12:00:37,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:00:37,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 20: [2022-11-26 12:00:37,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:00:37,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 12:00:37,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 25: [2022-11-26 12:00:37,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 20: [2022-11-26 12:00:37,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 12:00:37,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 0: [2022-11-26 12:00:37,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:00:37,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:00:37,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 12:00:37,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 4: [2022-11-26 12:00:37,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:00:37,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 12:00:37,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 4: [2022-11-26 12:00:37,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:00:37,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 12:00:37,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 26: [2022-11-26 12:00:37,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:00:37,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 12:00:37,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 18: [2022-11-26 12:00:37,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:00:37,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 12:00:37,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 27: [2022-11-26 12:00:37,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:00:37,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 12:00:37,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 24: [2022-11-26 12:00:37,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:00:37,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:00:37,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:00:37,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:00:37,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 12:00:37,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 24: [2022-11-26 12:00:37,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 12:00:37,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 30: [2022-11-26 12:00:37,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 30: [2022-11-26 12:00:37,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 24: [2022-11-26 12:00:37,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 24: [2022-11-26 12:00:37,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 14: [2022-11-26 12:00:37,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:00:37,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 12:00:37,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 21: [2022-11-26 12:00:37,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:00:37,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:00:37,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:00:37,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 12:00:37,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:00:37,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 12:00:37,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 12:00:37,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 21: [2022-11-26 12:00:37,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 21: [2022-11-26 12:00:37,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 29: [2022-11-26 12:00:37,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:00:37,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 29: [2022-11-26 12:00:37,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 21: [2022-11-26 12:00:37,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 29: [2022-11-26 12:00:37,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 13: [2022-11-26 12:00:37,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:00:37,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:00:37,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 16: [2022-11-26 12:00:37,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 13: [2022-11-26 12:00:37,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 6: [2022-11-26 12:00:37,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:00:37,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:00:37,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 6: [2022-11-26 12:00:37,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 12:00:37,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 12:00:37,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 6: [2022-11-26 12:00:37,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 25: [2022-11-26 12:00:37,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:00:37,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 12:00:37,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 18: [2022-11-26 12:00:37,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:00:37,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 12:00:37,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 3: [2022-11-26 12:00:37,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:00:37,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:00:37,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:00:37,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:00:37,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 12:00:37,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 12:00:37,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 12:00:37,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 12:00:37,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 3: [2022-11-26 12:00:37,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 3: [2022-11-26 12:00:37,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 3: [2022-11-26 12:00:37,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 15: [2022-11-26 12:00:37,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:00:37,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 12:00:37,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 19: [2022-11-26 12:00:37,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:00:37,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 12:00:37,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 19: [2022-11-26 12:00:37,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:00:37,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 12:00:37,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 13: [2022-11-26 12:00:37,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:00:37,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 12:00:37,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 0: [2022-11-26 12:00:37,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 12:00:37,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 16: [2022-11-26 12:00:37,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:00:37,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:00:37,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 12:00:37,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 24: [2022-11-26 12:00:37,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 12:00:37,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:00:37,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 24: [2022-11-26 12:00:37,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 12:00:37,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 2: [2022-11-26 12:00:37,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:00:37,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:00:37,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:00:37,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 12:00:37,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 12:00:37,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 12:00:37,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 2: [2022-11-26 12:00:37,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 2: [2022-11-26 12:00:37,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 20: [2022-11-26 12:00:37,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:00:37,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 12:00:37,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 10: [2022-11-26 12:00:37,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:00:37,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 12:00:37,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 25: [2022-11-26 12:00:37,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:00:37,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 12:00:37,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 5: [2022-11-26 12:00:37,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:00:37,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 12:00:37,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 6: [2022-11-26 12:00:37,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:00:37,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 12:00:37,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 20: [2022-11-26 12:00:37,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:00:37,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 12:00:37,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 29: [2022-11-26 12:00:37,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:00:37,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 12:00:37,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 2: [2022-11-26 12:00:37,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:00:37,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 12:00:37,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 12: [2022-11-26 12:00:37,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:00:37,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:00:37,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:00:37,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:00:37,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 12:00:37,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 12:00:37,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 12:00:37,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 12:00:37,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 12: [2022-11-26 12:00:37,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 12: [2022-11-26 12:00:37,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 12: [2022-11-26 12:00:37,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 9: [2022-11-26 12:00:37,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:00:37,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 12:00:37,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 9: [2022-11-26 12:00:37,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:00:37,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 12:00:37,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 4: [2022-11-26 12:00:37,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:00:37,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 12:00:37,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 22: [2022-11-26 12:00:37,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:00:37,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 12:00:37,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 23: [2022-11-26 12:00:37,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:00:37,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:00:37,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:00:37,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:00:37,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:00:37,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 12:00:37,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 12:00:37,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 23: [2022-11-26 12:00:37,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 12:00:37,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 23: [2022-11-26 12:00:37,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 12:00:37,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 12:00:37,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 23: [2022-11-26 12:00:37,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 23: [2022-11-26 12:00:37,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 11: [2022-11-26 12:00:37,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:00:37,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 12:00:37,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 1: [2022-11-26 12:00:37,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:00:37,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:00:37,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:00:37,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:00:37,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:00:37,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 12:00:37,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 12:00:37,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 1: [2022-11-26 12:00:37,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 1: [2022-11-26 12:00:37,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 12:00:37,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 12:00:37,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 12:00:37,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 1: [2022-11-26 12:00:37,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 1: [2022-11-26 12:00:37,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 9: [2022-11-26 12:00:37,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:00:37,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 12:00:37,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 26: [2022-11-26 12:00:37,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:00:37,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 12:00:37,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 17: [2022-11-26 12:00:37,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:00:37,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:00:37,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 12:00:37,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 12:00:37,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 17: [2022-11-26 12:00:37,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 17: [2022-11-26 12:00:37,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:00:37,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 12:00:37,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 13: [2022-11-26 12:00:37,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:00:37,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 12:00:37,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 7: [2022-11-26 12:00:37,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:00:37,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:00:37,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:00:37,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 12:00:37,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 12:00:37,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 12:00:37,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 7: [2022-11-26 12:00:37,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 7: [2022-11-26 12:00:37,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 8: [2022-11-26 12:00:37,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:00:37,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:00:37,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 12:00:37,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 8: [2022-11-26 12:00:37,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 12:00:37,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 8: [2022-11-26 12:00:37,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:00:37,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 12:00:37,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 22: [2022-11-26 12:00:37,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:00:37,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:00:37,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 12:00:37,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 12:00:37,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 22: [2022-11-26 12:00:37,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 28: [2022-11-26 12:00:37,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:00:37,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:00:37,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:00:37,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:00:37,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:00:37,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 12:00:37,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 12:00:37,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 12:00:37,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 12:00:37,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 12:00:37,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 28: [2022-11-26 12:00:37,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 28: [2022-11-26 12:00:37,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 28: [2022-11-26 12:00:37,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 28: [2022-11-26 12:00:37,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 21: [2022-11-26 12:00:37,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:00:37,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 12:00:37,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 30: [2022-11-26 12:00:37,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:00:37,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 0: [2022-11-26 12:00:37,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:00:37,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 0: [2022-11-26 12:00:37,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 12:00:37,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 3: [2022-11-26 12:00:37,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:00:37,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 12:00:37,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 27: [2022-11-26 12:00:37,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:00:37,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 12:00:37,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 31: [2022-11-26 12:00:37,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:00:37,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:00:37,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:00:37,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 12:00:37,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 12:00:37,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 12:00:37,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 31: [2022-11-26 12:00:37,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 31: [2022-11-26 12:00:37,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 10: [2022-11-26 12:00:37,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:00:37,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 12:00:37,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 24: [2022-11-26 12:00:37,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:00:37,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 12:00:37,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 2: [2022-11-26 12:00:37,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:00:37,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 12:00:37,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 12: [2022-11-26 12:00:37,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:00:37,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 12:00:37,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 19: [2022-11-26 12:00:37,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:00:37,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 12:00:37,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 15: [2022-11-26 12:00:37,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:00:37,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 12:00:37,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 14: [2022-11-26 12:00:37,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:00:37,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 12:00:37,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 31: [2022-11-26 12:00:37,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:00:37,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 12:00:37,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 18: [2022-11-26 12:00:37,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:00:37,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 12:00:37,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 22: [2022-11-26 12:00:37,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:00:37,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 12:00:37,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 9: [2022-11-26 12:00:37,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:00:37,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 12:00:37,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 6: [2022-11-26 12:00:37,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:00:37,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 12:00:37,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 17: [2022-11-26 12:00:37,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:00:37,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 12:00:37,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 25: [2022-11-26 12:00:37,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:00:37,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 12:00:37,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 7: [2022-11-26 12:00:37,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:00:37,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 12:00:37,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 16: [2022-11-26 12:00:37,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:00:37,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 12:00:37,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 5: [2022-11-26 12:00:37,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:00:37,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 12:00:37,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 23: [2022-11-26 12:00:37,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:00:37,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 12:00:37,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 28: [2022-11-26 12:00:37,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:00:37,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:00:37,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 12:00:37,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 20: [2022-11-26 12:00:37,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:00:37,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 12:00:37,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 0: [2022-11-26 12:00:37,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:00:37,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 12:00:37,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 1: [2022-11-26 12:00:37,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:00:37,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 12:00:37,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 28: [2022-11-26 12:00:37,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 12:00:37,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 29: [2022-11-26 12:00:37,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:00:37,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 12:00:37,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 3: [2022-11-26 12:00:37,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:00:37,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 12:00:37,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 11: [2022-11-26 12:00:37,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:00:37,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 12:00:37,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 27: [2022-11-26 12:00:37,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:00:37,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:00:37,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:00:37,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 27: [2022-11-26 12:00:37,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 8: [2022-11-26 12:00:37,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 21: [2022-11-26 12:00:37,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 8: [2022-11-26 12:00:37,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 27: [2022-11-26 12:00:37,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 30: [2022-11-26 12:00:37,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:00:37,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 12:00:37,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 13: [2022-11-26 12:00:37,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:00:37,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 12:00:37,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 10: [2022-11-26 12:00:37,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:00:37,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 12:00:37,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 24: [2022-11-26 12:00:37,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:00:37,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:00:37,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 19: [2022-11-26 12:00:37,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 12:00:37,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 24: [2022-11-26 12:00:37,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 18: [2022-11-26 12:00:37,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:00:37,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:00:37,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 2: [2022-11-26 12:00:37,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 12:00:37,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 18: [2022-11-26 12:00:37,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 26: [2022-11-26 12:00:37,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:00:37,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:00:37,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 26: [2022-11-26 12:00:37,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 12: [2022-11-26 12:00:37,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 26: [2022-11-26 12:00:37,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 15: [2022-11-26 12:00:37,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:00:37,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 12:00:37,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 14: [2022-11-26 12:00:37,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:00:37,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 12:00:37,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 9: [2022-11-26 12:00:37,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:00:37,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 12:00:37,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 6: [2022-11-26 12:00:37,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:00:37,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 12:00:37,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 31: [2022-11-26 12:00:37,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:00:37,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 12:00:37,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 22: [2022-11-26 12:00:37,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:00:37,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 12:00:37,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 8: [2022-11-26 12:00:37,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:00:37,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 12:00:37,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 17: [2022-11-26 12:00:37,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:00:37,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 12:00:37,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 7: [2022-11-26 12:00:37,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:00:37,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 12:00:37,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 16: [2022-11-26 12:00:37,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:00:37,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 12:00:37,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 29: [2022-11-26 12:00:37,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:00:37,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 12:00:37,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 0: [2022-11-26 12:00:37,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:00:37,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 12:00:37,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 1: [2022-11-26 12:00:37,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:00:37,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 12:00:37,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 4: [2022-11-26 12:00:37,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:00:37,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 12:00:37,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 20: [2022-11-26 12:00:37,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:00:37,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 12:00:37,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 5: [2022-11-26 12:00:37,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:00:37,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:00:37,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 5: [2022-11-26 12:00:37,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 21: [2022-11-26 12:00:37,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 5: [2022-11-26 12:00:37,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 25: [2022-11-26 12:00:37,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:00:37,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 12:00:37,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 13: [2022-11-26 12:00:37,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:00:37,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 12:00:37,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 23: [2022-11-26 12:00:37,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:00:37,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 12:00:37,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 28: [2022-11-26 12:00:37,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:00:37,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:00:37,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 12:00:37,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 11: [2022-11-26 12:00:37,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:00:37,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 12:00:37,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 3: [2022-11-26 12:00:37,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:00:37,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:00:37,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 12:00:37,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 30: [2022-11-26 12:00:37,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 12:00:37,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 28: [2022-11-26 12:00:37,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 12:00:37,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 2: [2022-11-26 12:00:37,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:00:37,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 12:00:37,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 10: [2022-11-26 12:00:37,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:00:37,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 12:00:37,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 26: [2022-11-26 12:00:37,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:00:37,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:00:37,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 24: [2022-11-26 12:00:37,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 26: [2022-11-26 12:00:37,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 24: [2022-11-26 12:00:37,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 18: [2022-11-26 12:00:37,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:00:37,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 12:00:37,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 12: [2022-11-26 12:00:37,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:00:37,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 12:00:37,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 19: [2022-11-26 12:00:37,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:00:37,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 12:00:37,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 9: [2022-11-26 12:00:37,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:00:37,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 14: [2022-11-26 12:00:37,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:00:37,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 14: [2022-11-26 12:00:37,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 12:00:37,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 15: [2022-11-26 12:00:37,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:00:37,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 12:00:37,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 22: [2022-11-26 12:00:37,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:00:37,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 12:00:37,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 17: [2022-11-26 12:00:37,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:00:37,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 12:00:37,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 31: [2022-11-26 12:00:37,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:00:37,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 12:00:37,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 16: [2022-11-26 12:00:37,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:00:37,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:00:37,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 2: [2022-11-26 12:00:37,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 16: [2022-11-26 12:00:37,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 2: [2022-11-26 12:00:37,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 19: [2022-11-26 12:00:37,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:00:37,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:00:37,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 12:00:37,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 8: [2022-11-26 12:00:37,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 12:00:37,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 0: [2022-11-26 12:00:37,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:00:37,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:00:37,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 5: [2022-11-26 12:00:37,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 0: [2022-11-26 12:00:37,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 5: [2022-11-26 12:00:37,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 1: [2022-11-26 12:00:37,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:00:37,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 12:00:37,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 23: [2022-11-26 12:00:37,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:00:37,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:00:37,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 7: [2022-11-26 12:00:37,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 23: [2022-11-26 12:00:37,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 7: [2022-11-26 12:00:37,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 4: [2022-11-26 12:00:37,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:00:37,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 12:00:37,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 3: [2022-11-26 12:00:37,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:00:37,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 12:00:37,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 20: [2022-11-26 12:00:37,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:00:37,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 12:00:37,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 28: [2022-11-26 12:00:37,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:00:37,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:00:37,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:00:37,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 12:00:37,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 27: [2022-11-26 12:00:37,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 12:00:37,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 6: [2022-11-26 12:00:37,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 12:00:37,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 24: [2022-11-26 12:00:37,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:00:37,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 12:00:37,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 25: [2022-11-26 12:00:37,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:00:37,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 12:00:37,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 21: [2022-11-26 12:00:37,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:00:37,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 12:00:37,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 10: [2022-11-26 12:00:37,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:00:37,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 12:00:37,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 26: [2022-11-26 12:00:37,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:00:37,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 12:00:37,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 13: [2022-11-26 12:00:37,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:00:37,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 12:00:37,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 11: [2022-11-26 12:00:37,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:00:37,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 12:00:37,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 29: [2022-11-26 12:00:37,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:00:37,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:00:37,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 12:00:37,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 14: [2022-11-26 12:00:37,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 12:00:37,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 30: [2022-11-26 12:00:37,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:00:37,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 12:00:37,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 18: [2022-11-26 12:00:37,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:00:37,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 12:00:37,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 15: [2022-11-26 12:00:37,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:00:37,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 12:00:37,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 12: [2022-11-26 12:00:37,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:00:37,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 12:00:37,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 22: [2022-11-26 12:00:37,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:00:37,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 12:00:37,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 17: [2022-11-26 12:00:37,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:00:37,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 12:00:37,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 8: [2022-11-26 12:00:37,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:00:37,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 12:00:37,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 8: [2022-11-26 12:00:37,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:00:37,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 12:00:37,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 7: [2022-11-26 12:00:37,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:00:37,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 12:00:37,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 22: [2022-11-26 12:00:37,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:00:37,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 12:00:37,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 17: [2022-11-26 12:00:37,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:00:37,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 12:00:37,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 31: [2022-11-26 12:00:37,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:00:37,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 12:00:37,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 9: [2022-11-26 12:00:37,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:00:37,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:00:37,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 12:00:37,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 12:00:37,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 9: [2022-11-26 12:00:37,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 7: [2022-11-26 12:00:37,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:00:37,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 12:00:37,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 31: [2022-11-26 12:00:37,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:00:37,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step79000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 12:00:37,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 0: successfully saved checkpoint at iteration 79000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2546.66 31: iteration 79010/ 173500 | consumed samples: 20226560 | consumed tokens: 41423994880 | elapsed time per iteration (s): 0.99 | learning rate: 1.241E-04 | global batch size: 256 | lm loss: 2.007991E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 258.289 | TFLOPs: 15.63 | 31: iteration 79020/ 173500 | consumed samples: 20229120 | consumed tokens: 41429237760 | elapsed time per iteration (s): 0.75 | learning rate: 1.241E-04 | global batch size: 256 | lm loss: 2.012850E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.003 | TFLOPs: 20.63 | 31: iteration 79030/ 173500 | consumed samples: 20231680 | consumed tokens: 41434480640 | elapsed time per iteration (s): 0.76 | learning rate: 1.241E-04 | global batch size: 256 | lm loss: 2.007715E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.504 | TFLOPs: 20.36 | 31: iteration 79040/ 173500 | consumed samples: 20234240 | consumed tokens: 41439723520 | elapsed time per iteration (s): 0.75 | learning rate: 1.241E-04 | global batch size: 256 | lm loss: 2.011400E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.565 | TFLOPs: 20.66 | 31: iteration 79050/ 173500 | consumed samples: 20236800 | consumed tokens: 41444966400 | elapsed time per iteration (s): 0.74 | learning rate: 1.240E-04 | global batch size: 256 | lm loss: 2.035609E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.629 | TFLOPs: 20.97 | 31: iteration 79060/ 173500 | consumed samples: 20239360 | consumed tokens: 41450209280 | elapsed time per iteration (s): 0.80 | learning rate: 1.240E-04 | global batch size: 256 | lm loss: 2.000902E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.588 | TFLOPs: 19.46 | 31: iteration 79070/ 173500 | consumed samples: 20241920 | consumed tokens: 41455452160 | elapsed time per iteration (s): 0.73 | learning rate: 1.240E-04 | global batch size: 256 | lm loss: 1.976606E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.377 | TFLOPs: 21.08 | 31: iteration 79080/ 173500 | consumed samples: 20244480 | consumed tokens: 41460695040 | elapsed time per iteration (s): 0.79 | learning rate: 1.240E-04 | global batch size: 256 | lm loss: 2.004968E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.248 | TFLOPs: 19.56 | 31: iteration 79090/ 173500 | consumed samples: 20247040 | consumed tokens: 41465937920 | elapsed time per iteration (s): 0.79 | learning rate: 1.240E-04 | global batch size: 256 | lm loss: 2.008985E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.216 | TFLOPs: 19.67 | 31: iteration 79100/ 173500 | consumed samples: 20249600 | consumed tokens: 41471180800 | elapsed time per iteration (s): 0.80 | learning rate: 1.240E-04 | global batch size: 256 | lm loss: 1.996210E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.799 | TFLOPs: 19.47 | 31: iteration 79110/ 173500 | consumed samples: 20252160 | consumed tokens: 41476423680 | elapsed time per iteration (s): 0.78 | learning rate: 1.239E-04 | global batch size: 256 | lm loss: 2.033650E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.729 | TFLOPs: 19.89 | 31: iteration 79120/ 173500 | consumed samples: 20254720 | consumed tokens: 41481666560 | elapsed time per iteration (s): 0.78 | learning rate: 1.239E-04 | global batch size: 256 | lm loss: 2.000363E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.989 | TFLOPs: 19.90 | 31: iteration 79130/ 173500 | consumed samples: 20257280 | consumed tokens: 41486909440 | elapsed time per iteration (s): 0.78 | learning rate: 1.239E-04 | global batch size: 256 | lm loss: 2.021936E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.667 | TFLOPs: 19.76 | 31: iteration 79140/ 173500 | consumed samples: 20259840 | consumed tokens: 41492152320 | elapsed time per iteration (s): 0.83 | learning rate: 1.239E-04 | global batch size: 256 | lm loss: 1.995157E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.267 | TFLOPs: 18.77 | 31: iteration 79150/ 173500 | consumed samples: 20262400 | consumed tokens: 41497395200 | elapsed time per iteration (s): 0.81 | learning rate: 1.239E-04 | global batch size: 256 | lm loss: 2.028979E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.474 | TFLOPs: 19.02 | 31: iteration 79160/ 173500 | consumed samples: 20264960 | consumed tokens: 41502638080 | elapsed time per iteration (s): 0.82 | learning rate: 1.239E-04 | global batch size: 256 | lm loss: 2.009094E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.480 | TFLOPs: 18.90 | 31: iteration 79170/ 173500 | consumed samples: 20267520 | consumed tokens: 41507880960 | elapsed time per iteration (s): 0.82 | learning rate: 1.239E-04 | global batch size: 256 | lm loss: 1.981990E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.360 | TFLOPs: 18.90 | 31: iteration 79180/ 173500 | consumed samples: 20270080 | consumed tokens: 41513123840 | elapsed time per iteration (s): 0.78 | learning rate: 1.238E-04 | global batch size: 256 | lm loss: 2.007822E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.264 | TFLOPs: 19.92 | 31: iteration 79190/ 173500 | consumed samples: 20272640 | consumed tokens: 41518366720 | elapsed time per iteration (s): 0.78 | learning rate: 1.238E-04 | global batch size: 256 | lm loss: 2.032573E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.450 | TFLOPs: 19.81 | 31: iteration 79200/ 173500 | consumed samples: 20275200 | consumed tokens: 41523609600 | elapsed time per iteration (s): 0.77 | learning rate: 1.238E-04 | global batch size: 256 | lm loss: 2.000432E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.791 | TFLOPs: 20.19 | 31: iteration 79210/ 173500 | consumed samples: 20277760 | consumed tokens: 41528852480 | elapsed time per iteration (s): 0.73 | learning rate: 1.238E-04 | global batch size: 256 | lm loss: 2.006799E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.116 | TFLOPs: 21.18 | 31: iteration 79220/ 173500 | consumed samples: 20280320 | consumed tokens: 41534095360 | elapsed time per iteration (s): 0.77 | learning rate: 1.238E-04 | global batch size: 256 | lm loss: 2.022617E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.330 | TFLOPs: 20.17 | 31: iteration 79230/ 173500 | consumed samples: 20282880 | consumed tokens: 41539338240 | elapsed time per iteration (s): 0.81 | learning rate: 1.238E-04 | global batch size: 256 | lm loss: 2.021303E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.343 | TFLOPs: 19.02 | 31: iteration 79240/ 173500 | consumed samples: 20285440 | consumed tokens: 41544581120 | elapsed time per iteration (s): 0.79 | learning rate: 1.237E-04 | global batch size: 256 | lm loss: 1.982225E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.667 | TFLOPs: 19.52 | 31: iteration 79250/ 173500 | consumed samples: 20288000 | consumed tokens: 41549824000 | elapsed time per iteration (s): 0.79 | learning rate: 1.237E-04 | global batch size: 256 | lm loss: 2.005614E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.678 | TFLOPs: 19.70 | 31: iteration 79260/ 173500 | consumed samples: 20290560 | consumed tokens: 41555066880 | elapsed time per iteration (s): 0.77 | learning rate: 1.237E-04 | global batch size: 256 | lm loss: 2.010752E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.036 | TFLOPs: 20.15 | 31: iteration 79270/ 173500 | consumed samples: 20293120 | consumed tokens: 41560309760 | elapsed time per iteration (s): 0.80 | learning rate: 1.237E-04 | global batch size: 256 | lm loss: 2.007765E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.288 | TFLOPs: 19.44 | 31: iteration 79280/ 173500 | consumed samples: 20295680 | consumed tokens: 41565552640 | elapsed time per iteration (s): 0.78 | learning rate: 1.237E-04 | global batch size: 256 | lm loss: 2.019157E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.717 | TFLOPs: 19.83 | 31: iteration 79290/ 173500 | consumed samples: 20298240 | consumed tokens: 41570795520 | elapsed time per iteration (s): 0.76 | learning rate: 1.237E-04 | global batch size: 256 | lm loss: 2.037412E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.317 | TFLOPs: 20.47 | 31: iteration 79300/ 173500 | consumed samples: 20300800 | consumed tokens: 41576038400 | elapsed time per iteration (s): 0.75 | learning rate: 1.236E-04 | global batch size: 256 | lm loss: 2.005858E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.414 | TFLOPs: 20.53 | 31: iteration 79310/ 173500 | consumed samples: 20303360 | consumed tokens: 41581281280 | elapsed time per iteration (s): 0.80 | learning rate: 1.236E-04 | global batch size: 256 | lm loss: 1.965117E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.435 | TFLOPs: 19.45 | 31: iteration 79320/ 173500 | consumed samples: 20305920 | consumed tokens: 41586524160 | elapsed time per iteration (s): 0.79 | learning rate: 1.236E-04 | global batch size: 256 | lm loss: 2.000751E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.076 | TFLOPs: 19.55 | 31: iteration 79330/ 173500 | consumed samples: 20308480 | consumed tokens: 41591767040 | elapsed time per iteration (s): 0.83 | learning rate: 1.236E-04 | global batch size: 256 | lm loss: 2.002205E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.281 | TFLOPs: 18.65 | 31: iteration 79340/ 173500 | consumed samples: 20311040 | consumed tokens: 41597009920 | elapsed time per iteration (s): 0.81 | learning rate: 1.236E-04 | global batch size: 256 | lm loss: 2.029035E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.957 | TFLOPs: 19.05 | 31: iteration 79350/ 173500 | consumed samples: 20313600 | consumed tokens: 41602252800 | elapsed time per iteration (s): 0.78 | learning rate: 1.236E-04 | global batch size: 256 | lm loss: 2.007480E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.440 | TFLOPs: 19.93 | 31: iteration 79360/ 173500 | consumed samples: 20316160 | consumed tokens: 41607495680 | elapsed time per iteration (s): 0.77 | learning rate: 1.235E-04 | global batch size: 256 | lm loss: 2.025748E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.623 | TFLOPs: 20.06 | 31: iteration 79370/ 173500 | consumed samples: 20318720 | consumed tokens: 41612738560 | elapsed time per iteration (s): 0.81 | learning rate: 1.235E-04 | global batch size: 256 | lm loss: 1.996913E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.235 | TFLOPs: 19.19 | 31: iteration 79380/ 173500 | consumed samples: 20321280 | consumed tokens: 41617981440 | elapsed time per iteration (s): 0.79 | learning rate: 1.235E-04 | global batch size: 256 | lm loss: 2.010032E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.702 | TFLOPs: 19.70 | 31: iteration 79390/ 173500 | consumed samples: 20323840 | consumed tokens: 41623224320 | elapsed time per iteration (s): 0.80 | learning rate: 1.235E-04 | global batch size: 256 | lm loss: 2.036464E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.818 | TFLOPs: 19.47 | 31: iteration 79400/ 173500 | consumed samples: 20326400 | consumed tokens: 41628467200 | elapsed time per iteration (s): 0.77 | learning rate: 1.235E-04 | global batch size: 256 | lm loss: 2.032462E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.961 | TFLOPs: 20.20 | 31: iteration 79410/ 173500 | consumed samples: 20328960 | consumed tokens: 41633710080 | elapsed time per iteration (s): 0.74 | learning rate: 1.235E-04 | global batch size: 256 | lm loss: 2.018322E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.967 | TFLOPs: 21.05 | 31: iteration 79420/ 173500 | consumed samples: 20331520 | consumed tokens: 41638952960 | elapsed time per iteration (s): 0.78 | learning rate: 1.234E-04 | global batch size: 256 | lm loss: 2.018093E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.894 | TFLOPs: 19.96 | 31: iteration 79430/ 173500 | consumed samples: 20334080 | consumed tokens: 41644195840 | elapsed time per iteration (s): 0.83 | learning rate: 1.234E-04 | global batch size: 256 | lm loss: 2.015825E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.254 | TFLOPs: 18.59 | 31: iteration 79440/ 173500 | consumed samples: 20336640 | consumed tokens: 41649438720 | elapsed time per iteration (s): 0.83 | learning rate: 1.234E-04 | global batch size: 256 | lm loss: 2.009187E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.219 | TFLOPs: 18.65 | 31: iteration 79450/ 173500 | consumed samples: 20339200 | consumed tokens: 41654681600 | elapsed time per iteration (s): 0.82 | learning rate: 1.234E-04 | global batch size: 256 | lm loss: 2.004026E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.196 | TFLOPs: 18.83 | 31: iteration 79460/ 173500 | consumed samples: 20341760 | consumed tokens: 41659924480 | elapsed time per iteration (s): 0.84 | learning rate: 1.234E-04 | global batch size: 256 | lm loss: 1.997558E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.301 | TFLOPs: 18.53 | 31: iteration 79470/ 173500 | consumed samples: 20344320 | consumed tokens: 41665167360 | elapsed time per iteration (s): 0.83 | learning rate: 1.234E-04 | global batch size: 256 | lm loss: 2.033611E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.384 | TFLOPs: 18.72 | 31: iteration 79480/ 173500 | consumed samples: 20346880 | consumed tokens: 41670410240 | elapsed time per iteration (s): 0.80 | learning rate: 1.233E-04 | global batch size: 256 | lm loss: 1.979928E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.630 | TFLOPs: 19.34 | 31: iteration 79490/ 173500 | consumed samples: 20349440 | consumed tokens: 41675653120 | elapsed time per iteration (s): 0.78 | learning rate: 1.233E-04 | global batch size: 256 | lm loss: 2.000059E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.596 | TFLOPs: 19.82 | 31: iteration 79500/ 173500 | consumed samples: 20352000 | consumed tokens: 41680896000 | elapsed time per iteration (s): 0.75 | learning rate: 1.233E-04 | global batch size: 256 | lm loss: 2.002547E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.108 | TFLOPs: 20.58 | 31: iteration 79510/ 173500 | consumed samples: 20354560 | consumed tokens: 41686138880 | elapsed time per iteration (s): 0.77 | learning rate: 1.233E-04 | global batch size: 256 | lm loss: 1.971388E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.550 | TFLOPs: 20.12 | 31: iteration 79520/ 173500 | consumed samples: 20357120 | consumed tokens: 41691381760 | elapsed time per iteration (s): 0.80 | learning rate: 1.233E-04 | global batch size: 256 | lm loss: 2.015595E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.024 | TFLOPs: 19.36 | 31: iteration 79530/ 173500 | consumed samples: 20359680 | consumed tokens: 41696624640 | elapsed time per iteration (s): 0.81 | learning rate: 1.233E-04 | global batch size: 256 | lm loss: 1.975255E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.563 | TFLOPs: 19.09 | 31: iteration 79540/ 173500 | consumed samples: 20362240 | consumed tokens: 41701867520 | elapsed time per iteration (s): 0.78 | learning rate: 1.232E-04 | global batch size: 256 | lm loss: 2.004394E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.347 | TFLOPs: 19.80 | 31: iteration 79550/ 173500 | consumed samples: 20364800 | consumed tokens: 41707110400 | elapsed time per iteration (s): 0.78 | learning rate: 1.232E-04 | global batch size: 256 | lm loss: 2.010051E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.259 | TFLOPs: 19.80 | 31: iteration 79560/ 173500 | consumed samples: 20367360 | consumed tokens: 41712353280 | elapsed time per iteration (s): 0.83 | learning rate: 1.232E-04 | global batch size: 256 | lm loss: 2.010292E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.539 | TFLOPs: 18.73 | 31: iteration 79570/ 173500 | consumed samples: 20369920 | consumed tokens: 41717596160 | elapsed time per iteration (s): 0.82 | learning rate: 1.232E-04 | global batch size: 256 | lm loss: 2.005886E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.112 | TFLOPs: 18.82 | 31: iteration 79580/ 173500 | consumed samples: 20372480 | consumed tokens: 41722839040 | elapsed time per iteration (s): 0.80 | learning rate: 1.232E-04 | global batch size: 256 | lm loss: 2.029492E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.243 | TFLOPs: 19.25 | 31: iteration 79590/ 173500 | consumed samples: 20375040 | consumed tokens: 41728081920 | elapsed time per iteration (s): 0.85 | learning rate: 1.232E-04 | global batch size: 256 | lm loss: 2.023956E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.913 | TFLOPs: 18.26 | 31: iteration 79600/ 173500 | consumed samples: 20377600 | consumed tokens: 41733324800 | elapsed time per iteration (s): 0.79 | learning rate: 1.232E-04 | global batch size: 256 | lm loss: 2.006258E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.129 | TFLOPs: 19.61 | 31: iteration 79610/ 173500 | consumed samples: 20380160 | consumed tokens: 41738567680 | elapsed time per iteration (s): 0.81 | learning rate: 1.231E-04 | global batch size: 256 | lm loss: 2.029621E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.535 | TFLOPs: 19.15 | 31: iteration 79620/ 173500 | consumed samples: 20382720 | consumed tokens: 41743810560 | elapsed time per iteration (s): 0.79 | learning rate: 1.231E-04 | global batch size: 256 | lm loss: 2.022802E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.064 | TFLOPs: 19.48 | 31: iteration 79630/ 173500 | consumed samples: 20385280 | consumed tokens: 41749053440 | elapsed time per iteration (s): 0.74 | learning rate: 1.231E-04 | global batch size: 256 | lm loss: 1.993361E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.576 | TFLOPs: 21.03 | 31: iteration 79640/ 173500 | consumed samples: 20387840 | consumed tokens: 41754296320 | elapsed time per iteration (s): 0.76 | learning rate: 1.231E-04 | global batch size: 256 | lm loss: 2.035949E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.378 | TFLOPs: 20.35 | 31: iteration 79650/ 173500 | consumed samples: 20390400 | consumed tokens: 41759539200 | elapsed time per iteration (s): 0.80 | learning rate: 1.231E-04 | global batch size: 256 | lm loss: 2.005741E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.251 | TFLOPs: 19.37 | 31: iteration 79660/ 173500 | consumed samples: 20392960 | consumed tokens: 41764782080 | elapsed time per iteration (s): 0.74 | learning rate: 1.231E-04 | global batch size: 256 | lm loss: 2.012255E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.430 | TFLOPs: 20.90 | 31: iteration 79670/ 173500 | consumed samples: 20395520 | consumed tokens: 41770024960 | elapsed time per iteration (s): 0.76 | learning rate: 1.230E-04 | global batch size: 256 | lm loss: 2.017353E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.521 | TFLOPs: 20.48 | 31: iteration 79680/ 173500 | consumed samples: 20398080 | consumed tokens: 41775267840 | elapsed time per iteration (s): 0.80 | learning rate: 1.230E-04 | global batch size: 256 | lm loss: 1.980338E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.755 | TFLOPs: 19.34 | 31: iteration 79690/ 173500 | consumed samples: 20400640 | consumed tokens: 41780510720 | elapsed time per iteration (s): 0.78 | learning rate: 1.230E-04 | global batch size: 256 | lm loss: 2.011778E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.799 | TFLOPs: 19.83 | 31: iteration 79700/ 173500 | consumed samples: 20403200 | consumed tokens: 41785753600 | elapsed time per iteration (s): 0.79 | learning rate: 1.230E-04 | global batch size: 256 | lm loss: 2.010024E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.463 | TFLOPs: 19.57 | 31: iteration 79710/ 173500 | consumed samples: 20405760 | consumed tokens: 41790996480 | elapsed time per iteration (s): 0.84 | learning rate: 1.230E-04 | global batch size: 256 | lm loss: 2.019139E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.938 | TFLOPs: 18.39 | 31: iteration 79720/ 173500 | consumed samples: 20408320 | consumed tokens: 41796239360 | elapsed time per iteration (s): 0.82 | learning rate: 1.230E-04 | global batch size: 256 | lm loss: 1.972184E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.783 | TFLOPs: 18.86 | 31: iteration 79730/ 173500 | consumed samples: 20410880 | consumed tokens: 41801482240 | elapsed time per iteration (s): 0.78 | learning rate: 1.229E-04 | global batch size: 256 | lm loss: 1.984313E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.038 | TFLOPs: 19.91 | 31: iteration 79740/ 173500 | consumed samples: 20413440 | consumed tokens: 41806725120 | elapsed time per iteration (s): 0.81 | learning rate: 1.229E-04 | global batch size: 256 | lm loss: 1.974700E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.936 | TFLOPs: 19.05 | 31: iteration 79750/ 173500 | consumed samples: 20416000 | consumed tokens: 41811968000 | elapsed time per iteration (s): 0.84 | learning rate: 1.229E-04 | global batch size: 256 | lm loss: 2.022883E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.362 | TFLOPs: 18.47 | 31: iteration 79760/ 173500 | consumed samples: 20418560 | consumed tokens: 41817210880 | elapsed time per iteration (s): 0.78 | learning rate: 1.229E-04 | global batch size: 256 | lm loss: 2.027653E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.584 | TFLOPs: 19.76 | 31: iteration 79770/ 173500 | consumed samples: 20421120 | consumed tokens: 41822453760 | elapsed time per iteration (s): 0.78 | learning rate: 1.229E-04 | global batch size: 256 | lm loss: 2.012667E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.149 | TFLOPs: 19.91 | 31: iteration 79780/ 173500 | consumed samples: 20423680 | consumed tokens: 41827696640 | elapsed time per iteration (s): 0.77 | learning rate: 1.229E-04 | global batch size: 256 | lm loss: 2.004223E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.434 | TFLOPs: 20.11 | 31: iteration 79790/ 173500 | consumed samples: 20426240 | consumed tokens: 41832939520 | elapsed time per iteration (s): 0.79 | learning rate: 1.228E-04 | global batch size: 256 | lm loss: 1.993545E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.601 | TFLOPs: 19.52 | 31: iteration 79800/ 173500 | consumed samples: 20428800 | consumed tokens: 41838182400 | elapsed time per iteration (s): 0.79 | learning rate: 1.228E-04 | global batch size: 256 | lm loss: 2.006519E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.921 | TFLOPs: 19.54 | 31: iteration 79810/ 173500 | consumed samples: 20431360 | consumed tokens: 41843425280 | elapsed time per iteration (s): 0.82 | learning rate: 1.228E-04 | global batch size: 256 | lm loss: 1.999952E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.317 | TFLOPs: 18.83 | 31: iteration 79820/ 173500 | consumed samples: 20433920 | consumed tokens: 41848668160 | elapsed time per iteration (s): 0.85 | learning rate: 1.228E-04 | global batch size: 256 | lm loss: 2.018327E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.491 | TFLOPs: 18.18 | 31: iteration 79830/ 173500 | consumed samples: 20436480 | consumed tokens: 41853911040 | elapsed time per iteration (s): 0.81 | learning rate: 1.228E-04 | global batch size: 256 | lm loss: 2.012541E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.090 | TFLOPs: 19.18 | 31: iteration 79840/ 173500 | consumed samples: 20439040 | consumed tokens: 41859153920 | elapsed time per iteration (s): 0.81 | learning rate: 1.228E-04 | global batch size: 256 | lm loss: 1.988156E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.004 | TFLOPs: 19.24 | 31: iteration 79850/ 173500 | consumed samples: 20441600 | consumed tokens: 41864396800 | elapsed time per iteration (s): 0.81 | learning rate: 1.227E-04 | global batch size: 256 | lm loss: 1.970527E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.819 | TFLOPs: 19.23 | 31: iteration 79860/ 173500 | consumed samples: 20444160 | consumed tokens: 41869639680 | elapsed time per iteration (s): 0.81 | learning rate: 1.227E-04 | global batch size: 256 | lm loss: 1.968820E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.857 | TFLOPs: 19.23 | 31: iteration 79870/ 173500 | consumed samples: 20446720 | consumed tokens: 41874882560 | elapsed time per iteration (s): 0.84 | learning rate: 1.227E-04 | global batch size: 256 | lm loss: 2.006922E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.977 | TFLOPs: 18.45 | 31: iteration 79880/ 173500 | consumed samples: 20449280 | consumed tokens: 41880125440 | elapsed time per iteration (s): 0.77 | learning rate: 1.227E-04 | global batch size: 256 | lm loss: 1.994477E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.900 | TFLOPs: 20.14 | 31: iteration 79890/ 173500 | consumed samples: 20451840 | consumed tokens: 41885368320 | elapsed time per iteration (s): 0.80 | learning rate: 1.227E-04 | global batch size: 256 | lm loss: 2.051021E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.619 | TFLOPs: 19.46 | 31: iteration 79900/ 173500 | consumed samples: 20454400 | consumed tokens: 41890611200 | elapsed time per iteration (s): 0.80 | learning rate: 1.227E-04 | global batch size: 256 | lm loss: 1.985781E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.918 | TFLOPs: 19.41 | 31: iteration 79910/ 173500 | consumed samples: 20456960 | consumed tokens: 41895854080 | elapsed time per iteration (s): 0.81 | learning rate: 1.226E-04 | global batch size: 256 | lm loss: 1.979096E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.618 | TFLOPs: 19.09 | 31: iteration 79920/ 173500 | consumed samples: 20459520 | consumed tokens: 41901096960 | elapsed time per iteration (s): 0.83 | learning rate: 1.226E-04 | global batch size: 256 | lm loss: 2.018699E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.347 | TFLOPs: 18.59 | 31: iteration 79930/ 173500 | consumed samples: 20462080 | consumed tokens: 41906339840 | elapsed time per iteration (s): 0.77 | learning rate: 1.226E-04 | global batch size: 256 | lm loss: 1.993276E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.875 | TFLOPs: 20.20 | 31: iteration 79940/ 173500 | consumed samples: 20464640 | consumed tokens: 41911582720 | elapsed time per iteration (s): 0.74 | learning rate: 1.226E-04 | global batch size: 256 | lm loss: 1.978274E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.225 | TFLOPs: 20.82 | 31: iteration 79950/ 173500 | consumed samples: 20467200 | consumed tokens: 41916825600 | elapsed time per iteration (s): 0.73 | learning rate: 1.226E-04 | global batch size: 256 | lm loss: 1.999282E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.691 | TFLOPs: 21.28 | 31: iteration 79960/ 173500 | consumed samples: 20469760 | consumed tokens: 41922068480 | elapsed time per iteration (s): 0.74 | learning rate: 1.226E-04 | global batch size: 256 | lm loss: 1.991572E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.722 | TFLOPs: 20.79 | 31: iteration 79970/ 173500 | consumed samples: 20472320 | consumed tokens: 41927311360 | elapsed time per iteration (s): 0.75 | learning rate: 1.225E-04 | global batch size: 256 | lm loss: 1.991636E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.303 | TFLOPs: 20.71 | 31: iteration 79980/ 173500 | consumed samples: 20474880 | consumed tokens: 41932554240 | elapsed time per iteration (s): 0.80 | learning rate: 1.225E-04 | global batch size: 256 | lm loss: 2.010476E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.745 | TFLOPs: 19.40 | 31: iteration 79990/ 173500 | consumed samples: 20477440 | consumed tokens: 41937797120 | elapsed time per iteration (s): 0.79 | learning rate: 1.225E-04 | global batch size: 256 | lm loss: 2.025831E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.030 | TFLOPs: 19.66 | 0: [2022-11-26 12:13:46,333] [INFO] [logging.py:68:log_dist] [Rank 0] step=80000, skipped=0, lr=[0.00012249910047811783, 0.00012249910047811783, 0.00012249910047811783], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 80000/ 173500 | consumed samples: 20480000 | consumed tokens: 41943040000 | elapsed time per iteration (s): 0.72 | learning rate: 1.225E-04 | global batch size: 256 | lm loss: 2.014367E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.501 | TFLOPs: 21.45 | 0: steps: 80000 loss: 1.9916 iter time (s): 1.104 samples/sec: 231.890 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 80000 | lm loss value: 1.955206E+00 | lm loss PPL: 7.065377E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 80000 to checkpoints_1b1long 0: [2022-11-26 12:13:46,621] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step80000 is begin to save! 0: [2022-11-26 12:13:46,632] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_01-model_00-model_states.pt... 0: [2022-11-26 12:13:46,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_01-model_00-model_states.pt. 0: [2022-11-26 12:13:46,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_03-model_00-model_states.pt... 0: [2022-11-26 12:13:46,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_03-model_00-model_states.pt. 0: [2022-11-26 12:13:46,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_04-model_00-model_states.pt... 0: [2022-11-26 12:13:47,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_04-model_00-model_states.pt. 0: [2022-11-26 12:13:47,035] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_05-model_00-model_states.pt... 0: [2022-11-26 12:13:47,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_05-model_00-model_states.pt. 0: [2022-11-26 12:13:47,115] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_06-model_00-model_states.pt... 0: [2022-11-26 12:13:47,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_06-model_00-model_states.pt. 0: [2022-11-26 12:13:47,192] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_07-model_00-model_states.pt... 0: [2022-11-26 12:13:47,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_07-model_00-model_states.pt. 0: [2022-11-26 12:13:47,266] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_08-model_00-model_states.pt... 0: [2022-11-26 12:13:47,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_08-model_00-model_states.pt. 0: [2022-11-26 12:13:47,345] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_09-model_00-model_states.pt... 0: [2022-11-26 12:13:47,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_09-model_00-model_states.pt. 0: [2022-11-26 12:13:47,419] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_10-model_00-model_states.pt... 0: [2022-11-26 12:13:47,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_10-model_00-model_states.pt. 0: [2022-11-26 12:13:47,495] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_11-model_00-model_states.pt... 0: [2022-11-26 12:13:47,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_11-model_00-model_states.pt. 0: [2022-11-26 12:13:47,567] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_12-model_00-model_states.pt... 0: [2022-11-26 12:13:47,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_12-model_00-model_states.pt. 0: [2022-11-26 12:13:47,644] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_13-model_00-model_states.pt... 0: [2022-11-26 12:13:47,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_13-model_00-model_states.pt. 0: [2022-11-26 12:13:47,718] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_14-model_00-model_states.pt... 0: [2022-11-26 12:13:47,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_14-model_00-model_states.pt. 0: [2022-11-26 12:13:47,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_15-model_00-model_states.pt... 0: [2022-11-26 12:13:47,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_15-model_00-model_states.pt. 0: [2022-11-26 12:13:47,869] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_16-model_00-model_states.pt... 0: [2022-11-26 12:13:47,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_16-model_00-model_states.pt. 0: [2022-11-26 12:13:47,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_17-model_00-model_states.pt... 0: [2022-11-26 12:13:48,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_17-model_00-model_states.pt. 0: [2022-11-26 12:13:48,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_18-model_00-model_states.pt... 0: [2022-11-26 12:13:48,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_18-model_00-model_states.pt. 0: [2022-11-26 12:13:48,096] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_19-model_00-model_states.pt... 0: [2022-11-26 12:13:48,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_19-model_00-model_states.pt. 0: [2022-11-26 12:13:48,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_20-model_00-model_states.pt... 0: [2022-11-26 12:13:48,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_20-model_00-model_states.pt. 0: [2022-11-26 12:13:48,243] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_21-model_00-model_states.pt... 0: [2022-11-26 12:13:48,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_21-model_00-model_states.pt. 0: [2022-11-26 12:13:48,318] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_22-model_00-model_states.pt... 0: [2022-11-26 12:13:48,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_22-model_00-model_states.pt. 0: [2022-11-26 12:13:48,411] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_23-model_00-model_states.pt... 0: [2022-11-26 12:13:48,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_23-model_00-model_states.pt. 0: [2022-11-26 12:13:48,489] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_24-model_00-model_states.pt... 0: [2022-11-26 12:13:48,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_24-model_00-model_states.pt. 0: [2022-11-26 12:13:48,562] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_25-model_00-model_states.pt... 0: [2022-11-26 12:13:48,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_25-model_00-model_states.pt. 0: [2022-11-26 12:13:48,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_26-model_00-model_states.pt... 0: [2022-11-26 12:13:48,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_26-model_00-model_states.pt. 0: [2022-11-26 12:13:48,708] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_27-model_00-model_states.pt... 0: [2022-11-26 12:13:48,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_27-model_00-model_states.pt. 0: [2022-11-26 12:13:48,781] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_28-model_00-model_states.pt... 0: [2022-11-26 12:13:48,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_28-model_00-model_states.pt. 0: [2022-11-26 12:13:48,855] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/layer_30-model_00-model_states.pt... 0: [2022-11-26 12:13:48,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/layer_30-model_00-model_states.pt. 0: [2022-11-26 12:13:48,859] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step80000/mp_rank_00_model_states.pt 0: [2022-11-26 12:13:48,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/mp_rank_00_model_states.pt... 0: [2022-11-26 12:13:48,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/mp_rank_00_model_states.pt. 0: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:13:48,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:13:48,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:13:48,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:13:48,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 12:13:48,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 24: [2022-11-26 12:13:48,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:13:48,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 12:13:48,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 4: [2022-11-26 12:13:48,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:13:48,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 12:13:48,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 30: [2022-11-26 12:13:48,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:13:48,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 12:13:48,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 9: [2022-11-26 12:13:48,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:13:48,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:13:48,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 8: [2022-11-26 12:13:48,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 9: [2022-11-26 12:13:48,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 8: [2022-11-26 12:13:48,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 29: [2022-11-26 12:13:48,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:13:48,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 12:13:48,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 29: [2022-11-26 12:13:48,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:13:48,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 1: [2022-11-26 12:13:48,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:13:48,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 4: [2022-11-26 12:13:48,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:13:48,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 28: [2022-11-26 12:13:48,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:13:48,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 1: [2022-11-26 12:13:48,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 28: [2022-11-26 12:13:48,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 4: [2022-11-26 12:13:48,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 28: [2022-11-26 12:13:48,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 26: [2022-11-26 12:13:48,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:13:48,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 12:13:48,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 7: [2022-11-26 12:13:48,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:13:48,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:13:48,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:13:48,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 3: [2022-11-26 12:13:48,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 7: [2022-11-26 12:13:48,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 3: [2022-11-26 12:13:48,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 12:13:48,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 3: [2022-11-26 12:13:48,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 22: [2022-11-26 12:13:48,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:13:48,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:13:48,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 5: [2022-11-26 12:13:48,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 22: [2022-11-26 12:13:48,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 5: [2022-11-26 12:13:48,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 31: [2022-11-26 12:13:48,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:13:48,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:13:48,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 6: [2022-11-26 12:13:48,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 31: [2022-11-26 12:13:48,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 6: [2022-11-26 12:13:48,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 24: [2022-11-26 12:13:48,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:13:48,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:13:48,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 12:13:48,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 24: [2022-11-26 12:13:48,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 21: [2022-11-26 12:13:48,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:13:48,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 12:13:48,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 24: [2022-11-26 12:13:48,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 14: [2022-11-26 12:13:48,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:13:48,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 12:13:48,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 14: [2022-11-26 12:13:48,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:13:48,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 12:13:48,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 7: [2022-11-26 12:13:48,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:13:48,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:13:48,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:13:48,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 9: [2022-11-26 12:13:48,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 18: [2022-11-26 12:13:48,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 7: [2022-11-26 12:13:48,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 9: [2022-11-26 12:13:48,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 18: [2022-11-26 12:13:48,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 31: [2022-11-26 12:13:48,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:13:48,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 12:13:48,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 25: [2022-11-26 12:13:48,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:13:48,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 12:13:48,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 23: [2022-11-26 12:13:48,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:13:48,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 6: [2022-11-26 12:13:48,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:13:48,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 6: [2022-11-26 12:13:48,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 12:13:48,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 18: [2022-11-26 12:13:48,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:13:48,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:13:48,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 11: [2022-11-26 12:13:48,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 18: [2022-11-26 12:13:48,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 11: [2022-11-26 12:13:48,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 23: [2022-11-26 12:13:48,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:13:48,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 12:13:48,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 27: [2022-11-26 12:13:48,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:13:48,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 12:13:48,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 22: [2022-11-26 12:13:48,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:13:48,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 12:13:48,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 25: [2022-11-26 12:13:48,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:13:48,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 12:13:48,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 28: [2022-11-26 12:13:48,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:13:48,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:13:48,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:13:48,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 21: [2022-11-26 12:13:48,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 12: [2022-11-26 12:13:48,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 21: [2022-11-26 12:13:48,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 4: [2022-11-26 12:13:48,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:13:48,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 12:13:48,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 5: [2022-11-26 12:13:48,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:13:48,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 12:13:48,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 5: [2022-11-26 12:13:49,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:13:49,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 12:13:49,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 31: [2022-11-26 12:13:49,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:13:49,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 12:13:49,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 17: [2022-11-26 12:13:49,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:13:49,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:13:49,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 28: [2022-11-26 12:13:48,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 17: [2022-11-26 12:13:49,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 8: [2022-11-26 12:13:49,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 28: [2022-11-26 12:13:48,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 8: [2022-11-26 12:13:49,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 0: [2022-11-26 12:13:49,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:13:49,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:13:49,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 0: [2022-11-26 12:13:49,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 15: [2022-11-26 12:13:49,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 0: [2022-11-26 12:13:49,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 20: [2022-11-26 12:13:49,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:13:49,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:13:49,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 12:13:49,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 12:13:49,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 20: [2022-11-26 12:13:49,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 27: [2022-11-26 12:13:49,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:13:49,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 12:13:49,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 17: [2022-11-26 12:13:49,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:13:49,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 12:13:49,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 20: [2022-11-26 12:13:49,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:13:49,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 30: [2022-11-26 12:13:49,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:13:49,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 30: [2022-11-26 12:13:49,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 12:13:49,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 7: [2022-11-26 12:13:49,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:13:49,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 12:13:49,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 23: [2022-11-26 12:13:49,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:13:49,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 12:13:49,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 11: [2022-11-26 12:13:49,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:13:49,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 12:13:49,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 27: [2022-11-26 12:13:49,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:13:49,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 12:13:49,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 24: [2022-11-26 12:13:49,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:13:48,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:13:49,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:13:49,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 18: [2022-11-26 12:13:49,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:13:49,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 24: [2022-11-26 12:13:49,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 18: [2022-11-26 12:13:49,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 13: [2022-11-26 12:13:48,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 11: [2022-11-26 12:13:49,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 18: [2022-11-26 12:13:49,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 13: [2022-11-26 12:13:48,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 13: [2022-11-26 12:13:48,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:13:48,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 12:13:48,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 26: [2022-11-26 12:13:49,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:13:49,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 12:13:49,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 2: [2022-11-26 12:13:49,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:13:49,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:13:49,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 11: [2022-11-26 12:13:49,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 2: [2022-11-26 12:13:49,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 11: [2022-11-26 12:13:49,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 1: [2022-11-26 12:13:49,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:13:49,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 12:13:49,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 12: [2022-11-26 12:13:49,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:13:49,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 12:13:49,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 1: [2022-11-26 12:13:49,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:13:49,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 13: [2022-11-26 12:13:49,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:13:49,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 13: [2022-11-26 12:13:49,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 29: [2022-11-26 12:13:49,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:13:49,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 29: [2022-11-26 12:13:49,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 12:13:49,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:13:49,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:13:49,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 12: [2022-11-26 12:13:49,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 29: [2022-11-26 12:13:49,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 12: [2022-11-26 12:13:49,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 29: [2022-11-26 12:13:49,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 30: [2022-11-26 12:13:49,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:13:49,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 14: [2022-11-26 12:13:49,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:13:49,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 14: [2022-11-26 12:13:49,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 12:13:49,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 9: [2022-11-26 12:13:49,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:13:49,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:13:49,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 9: [2022-11-26 12:13:49,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 28: [2022-11-26 12:13:49,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 9: [2022-11-26 12:13:49,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 6: [2022-11-26 12:13:49,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:13:49,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:13:49,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 14: [2022-11-26 12:13:49,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:13:49,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 12:13:49,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 2: [2022-11-26 12:13:49,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:13:49,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 6: [2022-11-26 12:13:49,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 12:13:49,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 2: [2022-11-26 12:13:49,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 12:13:49,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 9: [2022-11-26 12:13:49,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:13:49,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 17: [2022-11-26 12:13:49,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:13:49,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 17: [2022-11-26 12:13:49,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 12:13:49,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 13: [2022-11-26 12:13:49,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:13:49,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 3: [2022-11-26 12:13:49,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:13:49,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 22: [2022-11-26 12:13:49,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:13:49,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 12:13:49,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:13:49,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 3: [2022-11-26 12:13:49,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 3: [2022-11-26 12:13:49,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 22: [2022-11-26 12:13:49,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 3: [2022-11-26 12:13:49,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 22: [2022-11-26 12:13:49,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:13:49,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 12:13:49,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 27: [2022-11-26 12:13:49,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:13:49,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 12:13:49,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 7: [2022-11-26 12:13:49,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:13:49,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 12:13:49,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 5: [2022-11-26 12:13:49,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:13:49,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 12:13:49,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 15: [2022-11-26 12:13:49,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:13:49,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 12:13:49,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 8: [2022-11-26 12:13:49,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:13:49,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 12:13:49,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 23: [2022-11-26 12:13:49,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:13:49,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 12:13:49,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 24: [2022-11-26 12:13:49,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:13:49,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 12:13:49,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 0: [2022-11-26 12:13:49,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:13:49,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:13:49,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 0: [2022-11-26 12:13:49,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 12:13:49,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 20: [2022-11-26 12:13:49,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:13:49,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 20: [2022-11-26 12:13:49,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 12:13:49,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 31: [2022-11-26 12:13:49,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:13:49,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 12:13:49,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 25: [2022-11-26 12:13:49,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:13:49,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:13:49,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 12:13:49,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 12:13:49,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 25: [2022-11-26 12:13:49,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 0: [2022-11-26 12:13:49,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:13:49,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:13:49,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 12: [2022-11-26 12:13:49,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:13:49,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 0: [2022-11-26 12:13:49,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 12: [2022-11-26 12:13:49,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 12:13:49,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 0: [2022-11-26 12:13:49,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 4: [2022-11-26 12:13:49,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:13:49,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 12:13:49,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 30: [2022-11-26 12:13:49,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:13:49,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 12:13:49,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 15: [2022-11-26 12:13:49,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:13:49,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 12:13:49,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 18: [2022-11-26 12:13:49,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:13:49,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:13:49,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 26: [2022-11-26 12:13:49,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 12:13:49,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 18: [2022-11-26 12:13:49,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 17: [2022-11-26 12:13:49,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:13:49,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 12:13:49,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 21: [2022-11-26 12:13:49,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:13:49,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 12:13:49,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 15: [2022-11-26 12:13:49,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:13:49,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 12:13:49,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 28: [2022-11-26 12:13:49,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:13:49,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 12:13:49,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 16: [2022-11-26 12:13:49,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:13:49,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:13:49,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:13:49,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:13:49,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 12:13:49,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 12:13:49,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 12:13:49,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 12:13:49,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 16: [2022-11-26 12:13:49,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 16: [2022-11-26 12:13:49,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 16: [2022-11-26 12:13:49,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 10: [2022-11-26 12:13:49,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:13:49,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:13:49,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:13:49,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:13:49,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 12:13:49,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 12:13:49,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 12:13:49,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 12:13:49,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 10: [2022-11-26 12:13:49,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 10: [2022-11-26 12:13:49,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 10: [2022-11-26 12:13:49,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 19: [2022-11-26 12:13:49,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:13:49,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:13:49,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 12:13:49,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 12:13:49,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 19: [2022-11-26 12:13:49,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 9: [2022-11-26 12:13:49,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:13:49,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 12:13:49,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 1: [2022-11-26 12:13:49,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:13:49,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 31: [2022-11-26 12:13:49,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:13:49,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:13:49,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 31: [2022-11-26 12:13:49,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 12:13:49,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 26: [2022-11-26 12:13:49,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 12:13:49,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 13: [2022-11-26 12:13:49,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:13:49,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 12:13:49,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 3: [2022-11-26 12:13:49,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:13:49,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 12:13:49,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 23: [2022-11-26 12:13:49,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:13:49,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:13:49,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 14: [2022-11-26 12:13:49,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 2: [2022-11-26 12:13:49,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:13:49,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 14: [2022-11-26 12:13:49,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 2: [2022-11-26 12:13:49,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 12:13:49,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 20: [2022-11-26 12:13:49,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:13:49,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 12:13:49,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 0: [2022-11-26 12:13:49,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:13:49,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 12:13:49,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 29: [2022-11-26 12:13:49,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:13:49,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 12:13:49,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 12: [2022-11-26 12:13:49,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:13:49,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 12:13:49,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 4: [2022-11-26 12:13:49,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:13:49,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 12:13:49,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 11: [2022-11-26 12:13:49,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:13:49,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:13:49,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 24: [2022-11-26 12:13:49,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:13:49,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 16: [2022-11-26 12:13:49,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:13:49,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 18: [2022-11-26 12:13:49,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 16: [2022-11-26 12:13:49,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 12:13:49,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 24: [2022-11-26 12:13:49,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 12:13:49,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 21: [2022-11-26 12:13:49,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:13:49,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 12:13:49,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 13: [2022-11-26 12:13:49,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:13:49,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:13:49,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 30: [2022-11-26 12:13:49,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 13: [2022-11-26 12:13:49,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 30: [2022-11-26 12:13:49,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 27: [2022-11-26 12:13:49,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:13:49,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:13:49,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 8: [2022-11-26 12:13:49,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 12:13:49,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 27: [2022-11-26 12:13:49,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 17: [2022-11-26 12:13:49,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:13:49,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:13:49,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 12:13:49,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 1: [2022-11-26 12:13:49,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:13:49,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 12:13:49,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 9: [2022-11-26 12:13:49,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:13:49,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:13:49,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:13:49,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 29: [2022-11-26 12:13:49,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 5: [2022-11-26 12:13:49,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 9: [2022-11-26 12:13:49,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 29: [2022-11-26 12:13:49,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 5: [2022-11-26 12:13:49,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 7: [2022-11-26 12:13:49,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:13:49,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:13:49,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 12:13:49,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 8: [2022-11-26 12:13:49,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:13:49,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 12:13:49,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 8: [2022-11-26 12:13:49,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 12:13:49,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 23: [2022-11-26 12:13:49,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:13:49,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 12:13:49,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 16: [2022-11-26 12:13:49,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:13:49,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:13:49,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:13:49,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 11: [2022-11-26 12:13:49,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 16: [2022-11-26 12:13:49,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 11: [2022-11-26 12:13:49,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 17: [2022-11-26 12:13:49,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 12:13:49,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 28: [2022-11-26 12:13:49,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 12:13:49,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 22: [2022-11-26 12:13:49,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:13:49,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 12:13:49,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 3: [2022-11-26 12:13:49,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:13:49,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 12:13:49,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 20: [2022-11-26 12:13:49,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:13:49,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 12:13:49,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 22: [2022-11-26 12:13:49,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:13:49,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 0: [2022-11-26 12:13:49,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:13:49,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 0: [2022-11-26 12:13:49,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 12:13:49,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 18: [2022-11-26 12:13:49,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:13:49,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 12:13:49,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 14: [2022-11-26 12:13:49,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:13:49,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 12:13:49,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 4: [2022-11-26 12:13:49,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:13:49,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 12:13:49,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 23: [2022-11-26 12:13:49,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:13:49,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 12:13:49,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 28: [2022-11-26 12:13:49,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:13:49,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:13:49,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 12:13:49,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 25: [2022-11-26 12:13:49,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:13:49,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 5: [2022-11-26 12:13:49,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:13:49,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 14: [2022-11-26 12:13:49,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:13:49,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 14: [2022-11-26 12:13:49,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 5: [2022-11-26 12:13:49,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 14: [2022-11-26 12:13:49,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 31: [2022-11-26 12:13:49,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:13:49,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:13:49,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 7: [2022-11-26 12:13:49,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 31: [2022-11-26 12:13:49,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 7: [2022-11-26 12:13:49,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 10: [2022-11-26 12:13:49,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:13:49,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 12:13:49,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 21: [2022-11-26 12:13:49,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:13:49,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 12:13:49,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 19: [2022-11-26 12:13:49,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:13:49,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 6: [2022-11-26 12:13:49,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:13:49,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 6: [2022-11-26 12:13:49,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:13:49,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 12:13:49,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 12:13:49,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 6: [2022-11-26 12:13:49,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 11: [2022-11-26 12:13:49,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:13:49,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 12:13:49,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 30: [2022-11-26 12:13:49,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:13:49,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:13:49,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 30: [2022-11-26 12:13:49,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 12:13:49,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 31: [2022-11-26 12:13:49,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 16: [2022-11-26 12:13:49,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:13:49,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 12:13:49,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 24: [2022-11-26 12:13:49,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:13:49,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:13:49,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:13:49,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 12:13:49,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 12:13:49,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 24: [2022-11-26 12:13:49,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 26: [2022-11-26 12:13:49,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 12:13:49,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 5: [2022-11-26 12:13:49,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:13:49,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 12:13:49,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 0: [2022-11-26 12:13:49,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:13:49,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:13:49,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 12:13:49,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 18: [2022-11-26 12:13:49,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:13:49,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 12:13:49,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 2: [2022-11-26 12:13:49,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:13:49,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:13:49,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 2: [2022-11-26 12:13:49,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 30: [2022-11-26 12:13:49,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 2: [2022-11-26 12:13:49,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 12: [2022-11-26 12:13:49,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:13:49,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:13:49,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 12:13:49,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 12:13:49,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 12: [2022-11-26 12:13:49,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 1: [2022-11-26 12:13:49,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:13:49,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 9: [2022-11-26 12:13:49,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:13:49,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 1: [2022-11-26 12:13:49,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 9: [2022-11-26 12:13:49,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 20: [2022-11-26 12:13:49,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:13:49,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 29: [2022-11-26 12:13:49,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:13:49,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 29: [2022-11-26 12:13:49,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 12:13:49,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 10: [2022-11-26 12:13:49,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:13:49,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 12:13:49,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 27: [2022-11-26 12:13:49,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:13:49,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:13:49,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 12:13:49,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 3: [2022-11-26 12:13:49,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 12:13:49,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 21: [2022-11-26 12:13:49,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:13:49,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 12:13:49,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 28: [2022-11-26 12:13:49,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 12:13:49,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 15: [2022-11-26 12:13:49,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:13:49,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:13:49,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 28: [2022-11-26 12:13:49,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 15: [2022-11-26 12:13:49,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 28: [2022-11-26 12:13:49,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 8: [2022-11-26 12:13:49,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:13:49,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 12:13:49,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 2: [2022-11-26 12:13:49,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:13:49,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 12:13:49,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 25: [2022-11-26 12:13:49,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:13:49,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 12:13:49,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 9: [2022-11-26 12:13:49,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:13:49,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 12:13:49,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 31: [2022-11-26 12:13:49,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:13:49,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:13:49,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 31: [2022-11-26 12:13:49,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 1: [2022-11-26 12:13:49,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 31: [2022-11-26 12:13:49,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 6: [2022-11-26 12:13:49,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:13:49,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 12:13:49,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 22: [2022-11-26 12:13:49,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:13:49,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 12:13:49,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 16: [2022-11-26 12:13:49,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:13:49,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 12:13:49,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 7: [2022-11-26 12:13:49,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:13:49,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 12:13:49,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 23: [2022-11-26 12:13:49,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:13:49,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 12:13:49,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 15: [2022-11-26 12:13:49,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:13:49,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 12:13:49,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 24: [2022-11-26 12:13:49,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:13:49,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:13:49,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 27: [2022-11-26 12:13:49,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 8: [2022-11-26 12:13:49,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:13:49,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 27: [2022-11-26 12:13:49,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 8: [2022-11-26 12:13:49,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 12:13:49,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 8: [2022-11-26 12:13:49,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:13:49,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 5: [2022-11-26 12:13:49,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:13:49,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 5: [2022-11-26 12:13:49,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 12:13:49,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 4: [2022-11-26 12:13:49,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:13:49,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 12:13:49,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 19: [2022-11-26 12:13:49,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:13:49,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 12:13:49,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 17: [2022-11-26 12:13:49,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:13:49,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 3: [2022-11-26 12:13:49,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:13:49,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 3: [2022-11-26 12:13:49,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 12:13:49,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 14: [2022-11-26 12:13:49,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:13:49,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 12:13:49,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 0: [2022-11-26 12:13:49,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:13:49,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:13:49,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 0: [2022-11-26 12:13:49,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 12: [2022-11-26 12:13:49,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 0: [2022-11-26 12:13:49,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 19: [2022-11-26 12:13:49,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:13:49,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 12:13:49,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 21: [2022-11-26 12:13:49,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:13:49,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 12:13:49,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 19: [2022-11-26 12:13:49,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:13:49,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 12:13:49,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 6: [2022-11-26 12:13:49,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:13:49,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 12:13:49,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 20: [2022-11-26 12:13:49,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:13:49,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 12:13:49,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 25: [2022-11-26 12:13:49,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:13:49,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:13:49,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 12:13:49,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 25: [2022-11-26 12:13:49,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 26: [2022-11-26 12:13:49,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:13:49,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 30: [2022-11-26 12:13:49,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:13:49,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 30: [2022-11-26 12:13:49,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 12:13:49,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 26: [2022-11-26 12:13:49,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 15: [2022-11-26 12:13:49,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:13:49,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 12:13:49,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 29: [2022-11-26 12:13:49,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:13:49,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 12:13:49,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 2: [2022-11-26 12:13:49,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:13:49,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:13:49,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 2: [2022-11-26 12:13:49,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 22: [2022-11-26 12:13:49,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 2: [2022-11-26 12:13:49,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 28: [2022-11-26 12:13:49,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:13:49,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:13:49,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 15: [2022-11-26 12:13:49,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:13:49,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 15: [2022-11-26 12:13:49,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 12:13:49,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 10: [2022-11-26 12:13:49,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:13:49,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 12:13:49,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 2: [2022-11-26 12:13:49,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:13:49,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 12:13:49,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 0: [2022-11-26 12:13:49,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 12:13:49,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 18: [2022-11-26 12:13:49,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:13:49,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 12:13:49,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 7: [2022-11-26 12:13:49,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:13:49,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:13:49,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 7: [2022-11-26 12:13:49,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 19: [2022-11-26 12:13:49,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 7: [2022-11-26 12:13:49,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 2: [2022-11-26 12:13:49,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:13:49,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 12:13:49,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 28: [2022-11-26 12:13:49,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 12:13:49,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 19: [2022-11-26 12:13:49,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:13:49,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 12:13:49,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 17: [2022-11-26 12:13:49,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:13:49,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 12:13:49,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 13: [2022-11-26 12:13:49,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:13:49,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 12:13:49,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 11: [2022-11-26 12:13:49,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:13:49,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 12:13:49,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 13: [2022-11-26 12:13:49,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:13:49,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step80000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 12:13:49,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 0: successfully saved checkpoint at iteration 80000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2705.24 31: iteration 80010/ 173500 | consumed samples: 20482560 | consumed tokens: 41948282880 | elapsed time per iteration (s): 1.03 | learning rate: 1.225E-04 | global batch size: 256 | lm loss: 2.005979E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.439 | TFLOPs: 15.03 | 31: iteration 80020/ 173500 | consumed samples: 20485120 | consumed tokens: 41953525760 | elapsed time per iteration (s): 0.72 | learning rate: 1.225E-04 | global batch size: 256 | lm loss: 1.993343E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.920 | TFLOPs: 21.65 | 31: iteration 80030/ 173500 | consumed samples: 20487680 | consumed tokens: 41958768640 | elapsed time per iteration (s): 0.87 | learning rate: 1.225E-04 | global batch size: 256 | lm loss: 2.014580E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.530 | TFLOPs: 17.82 | 31: iteration 80040/ 173500 | consumed samples: 20490240 | consumed tokens: 41964011520 | elapsed time per iteration (s): 0.81 | learning rate: 1.224E-04 | global batch size: 256 | lm loss: 2.018399E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.341 | TFLOPs: 19.02 | 31: iteration 80050/ 173500 | consumed samples: 20492800 | consumed tokens: 41969254400 | elapsed time per iteration (s): 0.82 | learning rate: 1.224E-04 | global batch size: 256 | lm loss: 2.011036E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.430 | TFLOPs: 18.90 | 31: iteration 80060/ 173500 | consumed samples: 20495360 | consumed tokens: 41974497280 | elapsed time per iteration (s): 0.86 | learning rate: 1.224E-04 | global batch size: 256 | lm loss: 1.993427E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.756 | TFLOPs: 18.01 | 31: iteration 80070/ 173500 | consumed samples: 20497920 | consumed tokens: 41979740160 | elapsed time per iteration (s): 0.82 | learning rate: 1.224E-04 | global batch size: 256 | lm loss: 2.006026E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.291 | TFLOPs: 18.95 | 31: iteration 80080/ 173500 | consumed samples: 20500480 | consumed tokens: 41984983040 | elapsed time per iteration (s): 0.79 | learning rate: 1.224E-04 | global batch size: 256 | lm loss: 2.007131E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.657 | TFLOPs: 19.52 | 31: iteration 80090/ 173500 | consumed samples: 20503040 | consumed tokens: 41990225920 | elapsed time per iteration (s): 0.81 | learning rate: 1.224E-04 | global batch size: 256 | lm loss: 2.002138E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.298 | TFLOPs: 19.07 | 31: iteration 80100/ 173500 | consumed samples: 20505600 | consumed tokens: 41995468800 | elapsed time per iteration (s): 0.79 | learning rate: 1.223E-04 | global batch size: 256 | lm loss: 2.003539E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.835 | TFLOPs: 19.59 | 31: iteration 80110/ 173500 | consumed samples: 20508160 | consumed tokens: 42000711680 | elapsed time per iteration (s): 0.81 | learning rate: 1.223E-04 | global batch size: 256 | lm loss: 2.005943E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.316 | TFLOPs: 19.02 | 31: iteration 80120/ 173500 | consumed samples: 20510720 | consumed tokens: 42005954560 | elapsed time per iteration (s): 0.80 | learning rate: 1.223E-04 | global batch size: 256 | lm loss: 2.013960E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.122 | TFLOPs: 19.31 | 31: iteration 80130/ 173500 | consumed samples: 20513280 | consumed tokens: 42011197440 | elapsed time per iteration (s): 0.83 | learning rate: 1.223E-04 | global batch size: 256 | lm loss: 2.025382E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.024 | TFLOPs: 18.63 | 31: iteration 80140/ 173500 | consumed samples: 20515840 | consumed tokens: 42016440320 | elapsed time per iteration (s): 0.79 | learning rate: 1.223E-04 | global batch size: 256 | lm loss: 1.979951E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.729 | TFLOPs: 19.58 | 31: iteration 80150/ 173500 | consumed samples: 20518400 | consumed tokens: 42021683200 | elapsed time per iteration (s): 0.79 | learning rate: 1.223E-04 | global batch size: 256 | lm loss: 1.989015E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.780 | TFLOPs: 19.71 | 31: iteration 80160/ 173500 | consumed samples: 20520960 | consumed tokens: 42026926080 | elapsed time per iteration (s): 0.77 | learning rate: 1.222E-04 | global batch size: 256 | lm loss: 2.010329E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.537 | TFLOPs: 20.12 | 31: iteration 80170/ 173500 | consumed samples: 20523520 | consumed tokens: 42032168960 | elapsed time per iteration (s): 0.79 | learning rate: 1.222E-04 | global batch size: 256 | lm loss: 2.020309E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.884 | TFLOPs: 19.53 | 31: iteration 80180/ 173500 | consumed samples: 20526080 | consumed tokens: 42037411840 | elapsed time per iteration (s): 0.71 | learning rate: 1.222E-04 | global batch size: 256 | lm loss: 1.989105E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.313 | TFLOPs: 21.68 | 31: iteration 80190/ 173500 | consumed samples: 20528640 | consumed tokens: 42042654720 | elapsed time per iteration (s): 0.80 | learning rate: 1.222E-04 | global batch size: 256 | lm loss: 1.984314E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.425 | TFLOPs: 19.32 | 31: iteration 80200/ 173500 | consumed samples: 20531200 | consumed tokens: 42047897600 | elapsed time per iteration (s): 0.77 | learning rate: 1.222E-04 | global batch size: 256 | lm loss: 2.011231E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.385 | TFLOPs: 20.17 | 31: iteration 80210/ 173500 | consumed samples: 20533760 | consumed tokens: 42053140480 | elapsed time per iteration (s): 0.75 | learning rate: 1.222E-04 | global batch size: 256 | lm loss: 1.991397E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.870 | TFLOPs: 20.68 | 31: iteration 80220/ 173500 | consumed samples: 20536320 | consumed tokens: 42058383360 | elapsed time per iteration (s): 0.76 | learning rate: 1.221E-04 | global batch size: 256 | lm loss: 2.007565E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.654 | TFLOPs: 20.37 | 31: iteration 80230/ 173500 | consumed samples: 20538880 | consumed tokens: 42063626240 | elapsed time per iteration (s): 0.78 | learning rate: 1.221E-04 | global batch size: 256 | lm loss: 2.001927E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.778 | TFLOPs: 19.77 | 31: iteration 80240/ 173500 | consumed samples: 20541440 | consumed tokens: 42068869120 | elapsed time per iteration (s): 0.77 | learning rate: 1.221E-04 | global batch size: 256 | lm loss: 1.977445E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.615 | TFLOPs: 20.06 | 31: iteration 80250/ 173500 | consumed samples: 20544000 | consumed tokens: 42074112000 | elapsed time per iteration (s): 0.77 | learning rate: 1.221E-04 | global batch size: 256 | lm loss: 2.008425E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.102 | TFLOPs: 20.15 | 31: iteration 80260/ 173500 | consumed samples: 20546560 | consumed tokens: 42079354880 | elapsed time per iteration (s): 0.73 | learning rate: 1.221E-04 | global batch size: 256 | lm loss: 2.023793E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.457 | TFLOPs: 21.14 | 31: iteration 80270/ 173500 | consumed samples: 20549120 | consumed tokens: 42084597760 | elapsed time per iteration (s): 0.76 | learning rate: 1.221E-04 | global batch size: 256 | lm loss: 1.988533E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.853 | TFLOPs: 20.44 | 31: iteration 80280/ 173500 | consumed samples: 20551680 | consumed tokens: 42089840640 | elapsed time per iteration (s): 0.79 | learning rate: 1.220E-04 | global batch size: 256 | lm loss: 2.005983E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.084 | TFLOPs: 19.73 | 31: iteration 80290/ 173500 | consumed samples: 20554240 | consumed tokens: 42095083520 | elapsed time per iteration (s): 0.79 | learning rate: 1.220E-04 | global batch size: 256 | lm loss: 2.002134E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.378 | TFLOPs: 19.50 | 31: iteration 80300/ 173500 | consumed samples: 20556800 | consumed tokens: 42100326400 | elapsed time per iteration (s): 0.77 | learning rate: 1.220E-04 | global batch size: 256 | lm loss: 2.029710E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.962 | TFLOPs: 20.02 | 31: iteration 80310/ 173500 | consumed samples: 20559360 | consumed tokens: 42105569280 | elapsed time per iteration (s): 0.75 | learning rate: 1.220E-04 | global batch size: 256 | lm loss: 1.994291E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.240 | TFLOPs: 20.64 | 31: iteration 80320/ 173500 | consumed samples: 20561920 | consumed tokens: 42110812160 | elapsed time per iteration (s): 0.73 | learning rate: 1.220E-04 | global batch size: 256 | lm loss: 1.996288E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.528 | TFLOPs: 21.15 | 31: iteration 80330/ 173500 | consumed samples: 20564480 | consumed tokens: 42116055040 | elapsed time per iteration (s): 0.74 | learning rate: 1.220E-04 | global batch size: 256 | lm loss: 1.976692E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.844 | TFLOPs: 20.80 | 31: iteration 80340/ 173500 | consumed samples: 20567040 | consumed tokens: 42121297920 | elapsed time per iteration (s): 0.74 | learning rate: 1.219E-04 | global batch size: 256 | lm loss: 2.004918E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.881 | TFLOPs: 20.86 | 31: iteration 80350/ 173500 | consumed samples: 20569600 | consumed tokens: 42126540800 | elapsed time per iteration (s): 0.73 | learning rate: 1.219E-04 | global batch size: 256 | lm loss: 2.015921E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.319 | TFLOPs: 21.25 | 31: iteration 80360/ 173500 | consumed samples: 20572160 | consumed tokens: 42131783680 | elapsed time per iteration (s): 0.77 | learning rate: 1.219E-04 | global batch size: 256 | lm loss: 2.010492E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.024 | TFLOPs: 20.03 | 31: iteration 80370/ 173500 | consumed samples: 20574720 | consumed tokens: 42137026560 | elapsed time per iteration (s): 0.77 | learning rate: 1.219E-04 | global batch size: 256 | lm loss: 2.001896E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.488 | TFLOPs: 20.24 | 31: iteration 80380/ 173500 | consumed samples: 20577280 | consumed tokens: 42142269440 | elapsed time per iteration (s): 0.78 | learning rate: 1.219E-04 | global batch size: 256 | lm loss: 2.006767E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.756 | TFLOPs: 19.77 | 31: iteration 80390/ 173500 | consumed samples: 20579840 | consumed tokens: 42147512320 | elapsed time per iteration (s): 0.85 | learning rate: 1.219E-04 | global batch size: 256 | lm loss: 2.004877E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.200 | TFLOPs: 18.16 | 31: iteration 80400/ 173500 | consumed samples: 20582400 | consumed tokens: 42152755200 | elapsed time per iteration (s): 0.80 | learning rate: 1.218E-04 | global batch size: 256 | lm loss: 1.978974E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.645 | TFLOPs: 19.46 | 31: iteration 80410/ 173500 | consumed samples: 20584960 | consumed tokens: 42157998080 | elapsed time per iteration (s): 0.79 | learning rate: 1.218E-04 | global batch size: 256 | lm loss: 2.004339E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.827 | TFLOPs: 19.71 | 31: iteration 80420/ 173500 | consumed samples: 20587520 | consumed tokens: 42163240960 | elapsed time per iteration (s): 0.80 | learning rate: 1.218E-04 | global batch size: 256 | lm loss: 2.006861E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.633 | TFLOPs: 19.34 | 31: iteration 80430/ 173500 | consumed samples: 20590080 | consumed tokens: 42168483840 | elapsed time per iteration (s): 0.81 | learning rate: 1.218E-04 | global batch size: 256 | lm loss: 2.026390E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.285 | TFLOPs: 19.13 | 31: iteration 80440/ 173500 | consumed samples: 20592640 | consumed tokens: 42173726720 | elapsed time per iteration (s): 0.81 | learning rate: 1.218E-04 | global batch size: 256 | lm loss: 1.997888E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.052 | TFLOPs: 19.06 | 31: iteration 80450/ 173500 | consumed samples: 20595200 | consumed tokens: 42178969600 | elapsed time per iteration (s): 0.72 | learning rate: 1.218E-04 | global batch size: 256 | lm loss: 1.978545E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.816 | TFLOPs: 21.59 | 31: iteration 80460/ 173500 | consumed samples: 20597760 | consumed tokens: 42184212480 | elapsed time per iteration (s): 0.79 | learning rate: 1.217E-04 | global batch size: 256 | lm loss: 1.991125E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.189 | TFLOPs: 19.67 | 31: iteration 80470/ 173500 | consumed samples: 20600320 | consumed tokens: 42189455360 | elapsed time per iteration (s): 0.77 | learning rate: 1.217E-04 | global batch size: 256 | lm loss: 1.996641E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.638 | TFLOPs: 20.12 | 31: iteration 80480/ 173500 | consumed samples: 20602880 | consumed tokens: 42194698240 | elapsed time per iteration (s): 0.73 | learning rate: 1.217E-04 | global batch size: 256 | lm loss: 1.985323E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.025 | TFLOPs: 21.30 | 31: iteration 80490/ 173500 | consumed samples: 20605440 | consumed tokens: 42199941120 | elapsed time per iteration (s): 0.77 | learning rate: 1.217E-04 | global batch size: 256 | lm loss: 1.983501E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.003 | TFLOPs: 20.02 | 31: iteration 80500/ 173500 | consumed samples: 20608000 | consumed tokens: 42205184000 | elapsed time per iteration (s): 0.80 | learning rate: 1.217E-04 | global batch size: 256 | lm loss: 1.990585E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.640 | TFLOPs: 19.28 | 31: iteration 80510/ 173500 | consumed samples: 20610560 | consumed tokens: 42210426880 | elapsed time per iteration (s): 0.81 | learning rate: 1.217E-04 | global batch size: 256 | lm loss: 2.012908E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.371 | TFLOPs: 19.08 | 31: iteration 80520/ 173500 | consumed samples: 20613120 | consumed tokens: 42215669760 | elapsed time per iteration (s): 0.83 | learning rate: 1.217E-04 | global batch size: 256 | lm loss: 2.022594E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.977 | TFLOPs: 18.75 | 31: iteration 80530/ 173500 | consumed samples: 20615680 | consumed tokens: 42220912640 | elapsed time per iteration (s): 0.78 | learning rate: 1.216E-04 | global batch size: 256 | lm loss: 1.986620E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.623 | TFLOPs: 19.88 | 31: iteration 80540/ 173500 | consumed samples: 20618240 | consumed tokens: 42226155520 | elapsed time per iteration (s): 0.81 | learning rate: 1.216E-04 | global batch size: 256 | lm loss: 2.006007E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.238 | TFLOPs: 19.13 | 31: iteration 80550/ 173500 | consumed samples: 20620800 | consumed tokens: 42231398400 | elapsed time per iteration (s): 0.82 | learning rate: 1.216E-04 | global batch size: 256 | lm loss: 2.019568E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.147 | TFLOPs: 18.82 | 31: iteration 80560/ 173500 | consumed samples: 20623360 | consumed tokens: 42236641280 | elapsed time per iteration (s): 0.81 | learning rate: 1.216E-04 | global batch size: 256 | lm loss: 2.007999E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.804 | TFLOPs: 19.23 | 31: iteration 80570/ 173500 | consumed samples: 20625920 | consumed tokens: 42241884160 | elapsed time per iteration (s): 0.79 | learning rate: 1.216E-04 | global batch size: 256 | lm loss: 2.025046E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.135 | TFLOPs: 19.55 | 31: iteration 80580/ 173500 | consumed samples: 20628480 | consumed tokens: 42247127040 | elapsed time per iteration (s): 0.80 | learning rate: 1.216E-04 | global batch size: 256 | lm loss: 2.014301E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.064 | TFLOPs: 19.36 | 31: iteration 80590/ 173500 | consumed samples: 20631040 | consumed tokens: 42252369920 | elapsed time per iteration (s): 0.76 | learning rate: 1.215E-04 | global batch size: 256 | lm loss: 2.020576E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.567 | TFLOPs: 20.42 | 31: iteration 80600/ 173500 | consumed samples: 20633600 | consumed tokens: 42257612800 | elapsed time per iteration (s): 0.90 | learning rate: 1.215E-04 | global batch size: 256 | lm loss: 2.017309E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.525 | TFLOPs: 17.21 | 31: iteration 80610/ 173500 | consumed samples: 20636160 | consumed tokens: 42262855680 | elapsed time per iteration (s): 0.79 | learning rate: 1.215E-04 | global batch size: 256 | lm loss: 1.992645E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.630 | TFLOPs: 19.52 | 31: iteration 80620/ 173500 | consumed samples: 20638720 | consumed tokens: 42268098560 | elapsed time per iteration (s): 0.76 | learning rate: 1.215E-04 | global batch size: 256 | lm loss: 2.006188E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.903 | TFLOPs: 20.44 | 31: iteration 80630/ 173500 | consumed samples: 20641280 | consumed tokens: 42273341440 | elapsed time per iteration (s): 0.79 | learning rate: 1.215E-04 | global batch size: 256 | lm loss: 2.006567E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.295 | TFLOPs: 19.50 | 31: iteration 80640/ 173500 | consumed samples: 20643840 | consumed tokens: 42278584320 | elapsed time per iteration (s): 0.77 | learning rate: 1.215E-04 | global batch size: 256 | lm loss: 2.015804E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.432 | TFLOPs: 20.17 | 31: iteration 80650/ 173500 | consumed samples: 20646400 | consumed tokens: 42283827200 | elapsed time per iteration (s): 0.78 | learning rate: 1.214E-04 | global batch size: 256 | lm loss: 2.015835E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.051 | TFLOPs: 19.79 | 31: iteration 80660/ 173500 | consumed samples: 20648960 | consumed tokens: 42289070080 | elapsed time per iteration (s): 0.82 | learning rate: 1.214E-04 | global batch size: 256 | lm loss: 2.006967E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.543 | TFLOPs: 18.91 | 31: iteration 80670/ 173500 | consumed samples: 20651520 | consumed tokens: 42294312960 | elapsed time per iteration (s): 0.80 | learning rate: 1.214E-04 | global batch size: 256 | lm loss: 2.016814E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.376 | TFLOPs: 19.44 | 31: iteration 80680/ 173500 | consumed samples: 20654080 | consumed tokens: 42299555840 | elapsed time per iteration (s): 0.82 | learning rate: 1.214E-04 | global batch size: 256 | lm loss: 1.997999E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.969 | TFLOPs: 18.93 | 31: iteration 80690/ 173500 | consumed samples: 20656640 | consumed tokens: 42304798720 | elapsed time per iteration (s): 0.81 | learning rate: 1.214E-04 | global batch size: 256 | lm loss: 2.039317E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.503 | TFLOPs: 19.21 | 31: iteration 80700/ 173500 | consumed samples: 20659200 | consumed tokens: 42310041600 | elapsed time per iteration (s): 0.82 | learning rate: 1.214E-04 | global batch size: 256 | lm loss: 2.023944E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.626 | TFLOPs: 18.79 | 31: iteration 80710/ 173500 | consumed samples: 20661760 | consumed tokens: 42315284480 | elapsed time per iteration (s): 0.80 | learning rate: 1.213E-04 | global batch size: 256 | lm loss: 2.022077E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.926 | TFLOPs: 19.29 | 31: iteration 80720/ 173500 | consumed samples: 20664320 | consumed tokens: 42320527360 | elapsed time per iteration (s): 0.83 | learning rate: 1.213E-04 | global batch size: 256 | lm loss: 2.003205E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.891 | TFLOPs: 18.63 | 31: iteration 80730/ 173500 | consumed samples: 20666880 | consumed tokens: 42325770240 | elapsed time per iteration (s): 0.85 | learning rate: 1.213E-04 | global batch size: 256 | lm loss: 2.026114E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.848 | TFLOPs: 18.26 | 31: iteration 80740/ 173500 | consumed samples: 20669440 | consumed tokens: 42331013120 | elapsed time per iteration (s): 0.85 | learning rate: 1.213E-04 | global batch size: 256 | lm loss: 2.010007E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.843 | TFLOPs: 18.32 | 31: iteration 80750/ 173500 | consumed samples: 20672000 | consumed tokens: 42336256000 | elapsed time per iteration (s): 0.85 | learning rate: 1.213E-04 | global batch size: 256 | lm loss: 2.011229E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.854 | TFLOPs: 18.14 | 31: iteration 80760/ 173500 | consumed samples: 20674560 | consumed tokens: 42341498880 | elapsed time per iteration (s): 0.83 | learning rate: 1.213E-04 | global batch size: 256 | lm loss: 1.998891E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.634 | TFLOPs: 18.55 | 31: iteration 80770/ 173500 | consumed samples: 20677120 | consumed tokens: 42346741760 | elapsed time per iteration (s): 0.87 | learning rate: 1.212E-04 | global batch size: 256 | lm loss: 2.038580E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.873 | TFLOPs: 17.90 | 31: iteration 80780/ 173500 | consumed samples: 20679680 | consumed tokens: 42351984640 | elapsed time per iteration (s): 0.84 | learning rate: 1.212E-04 | global batch size: 256 | lm loss: 2.008795E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.866 | TFLOPs: 18.44 | 31: iteration 80790/ 173500 | consumed samples: 20682240 | consumed tokens: 42357227520 | elapsed time per iteration (s): 0.84 | learning rate: 1.212E-04 | global batch size: 256 | lm loss: 1.996507E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.449 | TFLOPs: 18.36 | 31: iteration 80800/ 173500 | consumed samples: 20684800 | consumed tokens: 42362470400 | elapsed time per iteration (s): 0.81 | learning rate: 1.212E-04 | global batch size: 256 | lm loss: 1.965640E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.567 | TFLOPs: 19.21 | 31: iteration 80810/ 173500 | consumed samples: 20687360 | consumed tokens: 42367713280 | elapsed time per iteration (s): 0.80 | learning rate: 1.212E-04 | global batch size: 256 | lm loss: 1.998417E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.419 | TFLOPs: 19.32 | 31: iteration 80820/ 173500 | consumed samples: 20689920 | consumed tokens: 42372956160 | elapsed time per iteration (s): 0.80 | learning rate: 1.212E-04 | global batch size: 256 | lm loss: 1.983444E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.326 | TFLOPs: 19.32 | 31: iteration 80830/ 173500 | consumed samples: 20692480 | consumed tokens: 42378199040 | elapsed time per iteration (s): 0.81 | learning rate: 1.211E-04 | global batch size: 256 | lm loss: 1.988248E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.841 | TFLOPs: 19.17 | 31: iteration 80840/ 173500 | consumed samples: 20695040 | consumed tokens: 42383441920 | elapsed time per iteration (s): 0.84 | learning rate: 1.211E-04 | global batch size: 256 | lm loss: 1.982829E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.315 | TFLOPs: 18.35 | 31: iteration 80850/ 173500 | consumed samples: 20697600 | consumed tokens: 42388684800 | elapsed time per iteration (s): 0.87 | learning rate: 1.211E-04 | global batch size: 256 | lm loss: 2.027726E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.481 | TFLOPs: 17.82 | 31: iteration 80860/ 173500 | consumed samples: 20700160 | consumed tokens: 42393927680 | elapsed time per iteration (s): 0.83 | learning rate: 1.211E-04 | global batch size: 256 | lm loss: 2.005070E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.329 | TFLOPs: 18.59 | 31: iteration 80870/ 173500 | consumed samples: 20702720 | consumed tokens: 42399170560 | elapsed time per iteration (s): 0.82 | learning rate: 1.211E-04 | global batch size: 256 | lm loss: 2.028049E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.686 | TFLOPs: 18.98 | 31: iteration 80880/ 173500 | consumed samples: 20705280 | consumed tokens: 42404413440 | elapsed time per iteration (s): 0.85 | learning rate: 1.211E-04 | global batch size: 256 | lm loss: 1.991220E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.068 | TFLOPs: 18.27 | 31: iteration 80890/ 173500 | consumed samples: 20707840 | consumed tokens: 42409656320 | elapsed time per iteration (s): 0.79 | learning rate: 1.210E-04 | global batch size: 256 | lm loss: 2.017230E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.318 | TFLOPs: 19.56 | 31: iteration 80900/ 173500 | consumed samples: 20710400 | consumed tokens: 42414899200 | elapsed time per iteration (s): 0.82 | learning rate: 1.210E-04 | global batch size: 256 | lm loss: 1.977572E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.704 | TFLOPs: 18.80 | 31: iteration 80910/ 173500 | consumed samples: 20712960 | consumed tokens: 42420142080 | elapsed time per iteration (s): 0.78 | learning rate: 1.210E-04 | global batch size: 256 | lm loss: 1.995720E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.588 | TFLOPs: 19.94 | 31: iteration 80920/ 173500 | consumed samples: 20715520 | consumed tokens: 42425384960 | elapsed time per iteration (s): 0.84 | learning rate: 1.210E-04 | global batch size: 256 | lm loss: 2.022630E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.988 | TFLOPs: 18.51 | 31: iteration 80930/ 173500 | consumed samples: 20718080 | consumed tokens: 42430627840 | elapsed time per iteration (s): 0.90 | learning rate: 1.210E-04 | global batch size: 256 | lm loss: 2.012586E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 283.974 | TFLOPs: 17.18 | 31: iteration 80940/ 173500 | consumed samples: 20720640 | consumed tokens: 42435870720 | elapsed time per iteration (s): 0.80 | learning rate: 1.210E-04 | global batch size: 256 | lm loss: 2.008519E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.224 | TFLOPs: 19.31 | 31: iteration 80950/ 173500 | consumed samples: 20723200 | consumed tokens: 42441113600 | elapsed time per iteration (s): 0.81 | learning rate: 1.209E-04 | global batch size: 256 | lm loss: 1.997342E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.173 | TFLOPs: 19.19 | 31: iteration 80960/ 173500 | consumed samples: 20725760 | consumed tokens: 42446356480 | elapsed time per iteration (s): 0.79 | learning rate: 1.209E-04 | global batch size: 256 | lm loss: 1.996721E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.284 | TFLOPs: 19.50 | 31: iteration 80970/ 173500 | consumed samples: 20728320 | consumed tokens: 42451599360 | elapsed time per iteration (s): 0.81 | learning rate: 1.209E-04 | global batch size: 256 | lm loss: 2.000367E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.365 | TFLOPs: 19.08 | 31: iteration 80980/ 173500 | consumed samples: 20730880 | consumed tokens: 42456842240 | elapsed time per iteration (s): 0.80 | learning rate: 1.209E-04 | global batch size: 256 | lm loss: 1.979812E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.180 | TFLOPs: 19.37 | 31: iteration 80990/ 173500 | consumed samples: 20733440 | consumed tokens: 42462085120 | elapsed time per iteration (s): 0.81 | learning rate: 1.209E-04 | global batch size: 256 | lm loss: 2.017464E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.125 | TFLOPs: 19.06 | 31: iteration 81000/ 173500 | consumed samples: 20736000 | consumed tokens: 42467328000 | elapsed time per iteration (s): 0.78 | learning rate: 1.209E-04 | global batch size: 256 | lm loss: 2.017397E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.062 | TFLOPs: 19.79 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 81000 | lm loss value: 1.940759E+00 | lm loss PPL: 6.964035E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 81000 to checkpoints_1b1long 0: [2022-11-26 12:27:07,815] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step81000 is begin to save! 0: [2022-11-26 12:27:07,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_01-model_00-model_states.pt... 0: [2022-11-26 12:27:08,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_01-model_00-model_states.pt. 0: [2022-11-26 12:27:08,043] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_03-model_00-model_states.pt... 0: [2022-11-26 12:27:08,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_03-model_00-model_states.pt. 0: [2022-11-26 12:27:08,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_04-model_00-model_states.pt... 0: [2022-11-26 12:27:08,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_04-model_00-model_states.pt. 0: [2022-11-26 12:27:08,207] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_05-model_00-model_states.pt... 0: [2022-11-26 12:27:08,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_05-model_00-model_states.pt. 0: [2022-11-26 12:27:08,285] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_06-model_00-model_states.pt... 0: [2022-11-26 12:27:08,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_06-model_00-model_states.pt. 0: [2022-11-26 12:27:08,362] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_07-model_00-model_states.pt... 0: [2022-11-26 12:27:08,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_07-model_00-model_states.pt. 0: [2022-11-26 12:27:08,441] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_08-model_00-model_states.pt... 0: [2022-11-26 12:27:08,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_08-model_00-model_states.pt. 0: [2022-11-26 12:27:08,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_09-model_00-model_states.pt... 0: [2022-11-26 12:27:08,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_09-model_00-model_states.pt. 0: [2022-11-26 12:27:08,598] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_10-model_00-model_states.pt... 0: [2022-11-26 12:27:08,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_10-model_00-model_states.pt. 0: [2022-11-26 12:27:08,675] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_11-model_00-model_states.pt... 0: [2022-11-26 12:27:08,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_11-model_00-model_states.pt. 0: [2022-11-26 12:27:08,752] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_12-model_00-model_states.pt... 0: [2022-11-26 12:27:08,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_12-model_00-model_states.pt. 0: [2022-11-26 12:27:08,833] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_13-model_00-model_states.pt... 0: [2022-11-26 12:27:08,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_13-model_00-model_states.pt. 0: [2022-11-26 12:27:08,908] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_14-model_00-model_states.pt... 0: [2022-11-26 12:27:08,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_14-model_00-model_states.pt. 0: [2022-11-26 12:27:08,985] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_15-model_00-model_states.pt... 0: [2022-11-26 12:27:09,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_15-model_00-model_states.pt. 0: [2022-11-26 12:27:09,062] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_16-model_00-model_states.pt... 0: [2022-11-26 12:27:09,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_16-model_00-model_states.pt. 0: [2022-11-26 12:27:09,138] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_17-model_00-model_states.pt... 0: [2022-11-26 12:27:09,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_17-model_00-model_states.pt. 0: [2022-11-26 12:27:09,213] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_18-model_00-model_states.pt... 0: [2022-11-26 12:27:09,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_18-model_00-model_states.pt. 0: [2022-11-26 12:27:09,290] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_19-model_00-model_states.pt... 0: [2022-11-26 12:27:09,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_19-model_00-model_states.pt. 0: [2022-11-26 12:27:09,368] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_20-model_00-model_states.pt... 0: [2022-11-26 12:27:09,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_20-model_00-model_states.pt. 0: [2022-11-26 12:27:09,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_21-model_00-model_states.pt... 0: [2022-11-26 12:27:09,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_21-model_00-model_states.pt. 0: [2022-11-26 12:27:09,524] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_22-model_00-model_states.pt... 0: [2022-11-26 12:27:09,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_22-model_00-model_states.pt. 0: [2022-11-26 12:27:09,600] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_23-model_00-model_states.pt... 0: [2022-11-26 12:27:09,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_23-model_00-model_states.pt. 0: [2022-11-26 12:27:09,674] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_24-model_00-model_states.pt... 0: [2022-11-26 12:27:09,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_24-model_00-model_states.pt. 0: [2022-11-26 12:27:09,750] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_25-model_00-model_states.pt... 0: [2022-11-26 12:27:09,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_25-model_00-model_states.pt. 0: [2022-11-26 12:27:09,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_26-model_00-model_states.pt... 0: [2022-11-26 12:27:09,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_26-model_00-model_states.pt. 0: [2022-11-26 12:27:09,901] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_27-model_00-model_states.pt... 0: [2022-11-26 12:27:09,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_27-model_00-model_states.pt. 0: [2022-11-26 12:27:09,976] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_28-model_00-model_states.pt... 0: [2022-11-26 12:27:10,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_28-model_00-model_states.pt. 0: [2022-11-26 12:27:10,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/layer_30-model_00-model_states.pt... 0: [2022-11-26 12:27:10,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/layer_30-model_00-model_states.pt. 0: [2022-11-26 12:27:10,055] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step81000/mp_rank_00_model_states.pt 0: [2022-11-26 12:27:10,055] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/mp_rank_00_model_states.pt... 0: [2022-11-26 12:27:10,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/mp_rank_00_model_states.pt. 0: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:27:10,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:27:10,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:27:10,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:27:10,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 12:27:10,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 1: [2022-11-26 12:27:10,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 12:27:10,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 15: [2022-11-26 12:27:10,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:27:10,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 12:27:10,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 6: [2022-11-26 12:27:10,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:27:10,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:27:10,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 21: [2022-11-26 12:27:10,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 6: [2022-11-26 12:27:10,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 21: [2022-11-26 12:27:10,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 7: [2022-11-26 12:27:10,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:27:10,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 25: [2022-11-26 12:27:10,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:27:10,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 25: [2022-11-26 12:27:10,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 12:27:10,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 23: [2022-11-26 12:27:10,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:27:10,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:27:10,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 29: [2022-11-26 12:27:10,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 23: [2022-11-26 12:27:10,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 29: [2022-11-26 12:27:10,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 20: [2022-11-26 12:27:10,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:27:10,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:27:10,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 20: [2022-11-26 12:27:10,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 26: [2022-11-26 12:27:10,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:27:10,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 26: [2022-11-26 12:27:10,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 12:27:10,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 27: [2022-11-26 12:27:10,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 14: [2022-11-26 12:27:10,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:27:10,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 12:27:10,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 30: [2022-11-26 12:27:10,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:27:10,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 31: [2022-11-26 12:27:10,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:27:10,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 30: [2022-11-26 12:27:10,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 31: [2022-11-26 12:27:10,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 4: [2022-11-26 12:27:10,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:27:10,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 12:27:10,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 3: [2022-11-26 12:27:10,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:27:10,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:27:10,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 8: [2022-11-26 12:27:10,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 3: [2022-11-26 12:27:10,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 8: [2022-11-26 12:27:10,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 10: [2022-11-26 12:27:10,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:27:10,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 12:27:10,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 16: [2022-11-26 12:27:10,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:27:10,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 12:27:10,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 29: [2022-11-26 12:27:10,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:27:10,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 12:27:10,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 22: [2022-11-26 12:27:10,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:27:10,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 2: [2022-11-26 12:27:10,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:27:10,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 6: [2022-11-26 12:27:10,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:27:10,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 6: [2022-11-26 12:27:10,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 2: [2022-11-26 12:27:10,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 6: [2022-11-26 12:27:10,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 0: [2022-11-26 12:27:10,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:27:10,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:27:10,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 12:27:10,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 21: [2022-11-26 12:27:10,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:27:10,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 12:27:10,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 13: [2022-11-26 12:27:10,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:27:10,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 12:27:10,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 8: [2022-11-26 12:27:10,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:27:10,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 12:27:10,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 12: [2022-11-26 12:27:10,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:27:10,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 12:27:10,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 5: [2022-11-26 12:27:10,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:27:10,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:27:10,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 12:27:10,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 12:27:10,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 5: [2022-11-26 12:27:10,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 11: [2022-11-26 12:27:10,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:27:10,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 7: [2022-11-26 12:27:10,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:27:10,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 9: [2022-11-26 12:27:10,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:27:10,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 9: [2022-11-26 12:27:10,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 7: [2022-11-26 12:27:10,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 9: [2022-11-26 12:27:10,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 1: [2022-11-26 12:27:10,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:27:10,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 12:27:10,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 20: [2022-11-26 12:27:10,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:27:10,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 12:27:10,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 2: [2022-11-26 12:27:10,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:27:10,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 12:27:10,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 25: [2022-11-26 12:27:10,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:27:10,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:27:10,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 12:27:10,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 25: [2022-11-26 12:27:10,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 12:27:10,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 31: [2022-11-26 12:27:10,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:27:10,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 12:27:10,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 10: [2022-11-26 12:27:10,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:27:10,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 12:27:10,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 23: [2022-11-26 12:27:10,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:27:10,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:27:10,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 12:27:10,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 13: [2022-11-26 12:27:10,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 12:27:10,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 15: [2022-11-26 12:27:10,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:27:10,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 12:27:10,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 30: [2022-11-26 12:27:10,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:27:10,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 12:27:10,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 21: [2022-11-26 12:27:10,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:27:10,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 12:27:10,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 4: [2022-11-26 12:27:10,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:27:10,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 12:27:10,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 3: [2022-11-26 12:27:10,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:27:10,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 12:27:10,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 22: [2022-11-26 12:27:10,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:27:10,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 12:27:10,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 12: [2022-11-26 12:27:10,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:27:10,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 12:27:10,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 6: [2022-11-26 12:27:10,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:27:10,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:27:10,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 16: [2022-11-26 12:27:10,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 6: [2022-11-26 12:27:10,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 16: [2022-11-26 12:27:10,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 16: [2022-11-26 12:27:10,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:27:10,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:27:10,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 9: [2022-11-26 12:27:10,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 16: [2022-11-26 12:27:10,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 9: [2022-11-26 12:27:10,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 26: [2022-11-26 12:27:10,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:27:10,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:27:10,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 20: [2022-11-26 12:27:10,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:27:10,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 26: [2022-11-26 12:27:10,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 20: [2022-11-26 12:27:10,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 25: [2022-11-26 12:27:10,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 20: [2022-11-26 12:27:10,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 14: [2022-11-26 12:27:10,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:27:10,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 12:27:10,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 26: [2022-11-26 12:27:10,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:27:10,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 12:27:10,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 0: [2022-11-26 12:27:10,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:27:10,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 8: [2022-11-26 12:27:10,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:27:10,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 8: [2022-11-26 12:27:10,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 12:27:10,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 7: [2022-11-26 12:27:10,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:27:10,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 12:27:10,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 15: [2022-11-26 12:27:10,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:27:10,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:27:10,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 11: [2022-11-26 12:27:10,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 15: [2022-11-26 12:27:10,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 11: [2022-11-26 12:27:10,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 27: [2022-11-26 12:27:10,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:27:10,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 22: [2022-11-26 12:27:10,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:27:10,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 22: [2022-11-26 12:27:10,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 12:27:10,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 27: [2022-11-26 12:27:10,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:27:10,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 12:27:10,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 12:27:10,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 27: [2022-11-26 12:27:10,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 13: [2022-11-26 12:27:10,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:27:10,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 12:27:10,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 30: [2022-11-26 12:27:10,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:27:10,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 12:27:10,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 2: [2022-11-26 12:27:10,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:27:10,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 12:27:10,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 12: [2022-11-26 12:27:10,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:27:10,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 12:27:10,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 5: [2022-11-26 12:27:10,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:27:10,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 12:27:10,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 14: [2022-11-26 12:27:10,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:27:10,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 12:27:10,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 29: [2022-11-26 12:27:10,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:27:10,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 12:27:10,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 3: [2022-11-26 12:27:10,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:27:10,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 31: [2022-11-26 12:27:10,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:27:10,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 31: [2022-11-26 12:27:10,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 12:27:10,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 4: [2022-11-26 12:27:10,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:27:10,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 12:27:10,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 28: [2022-11-26 12:27:10,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:27:10,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:27:10,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 12:27:10,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 12:27:10,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 28: [2022-11-26 12:27:10,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 23: [2022-11-26 12:27:10,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:27:10,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 12:27:10,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 20: [2022-11-26 12:27:10,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:27:10,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 12:27:10,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 5: [2022-11-26 12:27:10,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:27:10,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 12:27:10,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 22: [2022-11-26 12:27:10,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:27:10,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 12:27:10,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 18: [2022-11-26 12:27:10,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:27:10,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:27:10,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:27:10,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 12:27:10,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 1: [2022-11-26 12:27:10,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 12:27:10,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 12:27:10,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 1: [2022-11-26 12:27:10,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 10: [2022-11-26 12:27:10,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:27:10,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 12:27:10,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 18: [2022-11-26 12:27:10,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:27:10,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 12:27:10,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 18: [2022-11-26 12:27:10,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:27:10,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 12:27:10,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 15: [2022-11-26 12:27:10,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:27:10,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 12:27:10,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 29: [2022-11-26 12:27:10,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:27:10,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 12:27:10,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 19: [2022-11-26 12:27:10,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:27:10,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:27:10,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:27:10,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:27:10,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 12:27:10,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 12:27:10,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 12:27:10,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 12:27:10,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 19: [2022-11-26 12:27:10,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 19: [2022-11-26 12:27:10,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 19: [2022-11-26 12:27:10,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 24: [2022-11-26 12:27:10,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:27:10,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:27:10,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:27:10,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:27:10,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 12:27:10,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 12:27:10,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 12:27:10,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 24: [2022-11-26 12:27:10,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 24: [2022-11-26 12:27:10,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 24: [2022-11-26 12:27:10,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 12:27:10,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 27: [2022-11-26 12:27:10,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:27:10,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 12:27:10,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 17: [2022-11-26 12:27:10,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:27:10,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:27:10,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:27:10,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:27:10,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 12:27:10,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 12:27:10,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 12:27:10,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 12:27:10,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 17: [2022-11-26 12:27:10,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 17: [2022-11-26 12:27:10,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 17: [2022-11-26 12:27:10,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 31: [2022-11-26 12:27:10,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:27:10,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 12:27:10,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 23: [2022-11-26 12:27:10,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:27:10,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 12:27:10,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 14: [2022-11-26 12:27:10,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:27:10,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 12:27:10,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 25: [2022-11-26 12:27:10,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:27:10,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 12:27:10,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 28: [2022-11-26 12:27:10,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:27:10,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 12:27:10,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 26: [2022-11-26 12:27:10,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:27:10,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 12:27:10,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 8: [2022-11-26 12:27:10,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:27:10,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 12:27:10,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 9: [2022-11-26 12:27:10,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:27:10,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 12:27:10,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 30: [2022-11-26 12:27:10,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:27:10,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:27:10,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 12:27:10,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 0: [2022-11-26 12:27:10,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 12:27:10,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 10: [2022-11-26 12:27:10,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:27:10,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 12:27:10,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 6: [2022-11-26 12:27:10,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:27:10,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 12:27:10,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 2: [2022-11-26 12:27:10,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:27:10,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 12:27:10,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 21: [2022-11-26 12:27:10,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:27:10,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 12:27:10,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 3: [2022-11-26 12:27:10,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:27:10,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 12:27:10,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 11: [2022-11-26 12:27:10,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:27:10,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 12:27:10,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 7: [2022-11-26 12:27:10,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:27:10,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 12:27:10,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 16: [2022-11-26 12:27:10,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:27:10,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 12:27:10,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 13: [2022-11-26 12:27:10,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:27:10,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 12:27:10,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 4: [2022-11-26 12:27:10,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:27:10,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 12:27:10,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 12: [2022-11-26 12:27:10,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:27:10,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 12:27:10,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 29: [2022-11-26 12:27:10,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:27:10,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 12:27:10,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 15: [2022-11-26 12:27:10,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:27:10,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 12:27:10,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 24: [2022-11-26 12:27:10,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:27:10,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 12:27:10,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 5: [2022-11-26 12:27:10,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:27:10,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 12:27:10,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 18: [2022-11-26 12:27:10,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:27:10,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 12:27:10,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 20: [2022-11-26 12:27:10,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:27:10,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 12:27:10,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 22: [2022-11-26 12:27:10,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:27:10,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 12:27:10,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 27: [2022-11-26 12:27:10,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:27:10,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 12:27:10,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 17: [2022-11-26 12:27:10,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:27:10,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 12:27:10,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 19: [2022-11-26 12:27:10,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:27:10,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 12:27:10,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 25: [2022-11-26 12:27:10,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:27:10,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:27:10,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 1: [2022-11-26 12:27:10,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 25: [2022-11-26 12:27:10,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 1: [2022-11-26 12:27:10,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 23: [2022-11-26 12:27:10,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:27:10,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 12:27:10,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 31: [2022-11-26 12:27:10,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:27:10,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 12:27:10,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 30: [2022-11-26 12:27:10,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:27:10,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 12:27:10,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 14: [2022-11-26 12:27:10,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:27:10,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 12:27:10,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 26: [2022-11-26 12:27:10,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:27:10,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 12:27:10,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 0: [2022-11-26 12:27:10,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:27:10,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 12:27:10,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 28: [2022-11-26 12:27:10,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:27:10,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 12:27:10,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 2: [2022-11-26 12:27:10,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:27:10,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 12:27:10,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 10: [2022-11-26 12:27:10,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:27:10,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 12:27:10,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 4: [2022-11-26 12:27:10,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:27:10,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 12:27:10,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 6: [2022-11-26 12:27:10,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:27:10,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:27:10,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 9: [2022-11-26 12:27:10,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 6: [2022-11-26 12:27:10,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 9: [2022-11-26 12:27:10,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 13: [2022-11-26 12:27:10,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:27:10,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 12:27:10,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 3: [2022-11-26 12:27:10,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:27:10,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 12:27:10,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 8: [2022-11-26 12:27:10,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:27:10,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 12:27:10,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 7: [2022-11-26 12:27:10,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:27:10,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 12:27:10,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 16: [2022-11-26 12:27:10,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:27:10,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 12:27:10,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 11: [2022-11-26 12:27:10,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:27:10,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 12:27:10,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 15: [2022-11-26 12:27:10,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:27:10,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 12:27:10,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 29: [2022-11-26 12:27:10,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:27:10,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 12:27:10,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 12: [2022-11-26 12:27:10,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:27:10,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 12:27:10,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 24: [2022-11-26 12:27:10,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:27:10,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 12:27:10,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 5: [2022-11-26 12:27:10,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:27:10,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 12:27:10,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 20: [2022-11-26 12:27:10,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:27:10,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 12:27:10,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 22: [2022-11-26 12:27:10,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:27:10,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 17: [2022-11-26 12:27:10,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:27:10,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 17: [2022-11-26 12:27:10,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 12:27:10,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 19: [2022-11-26 12:27:10,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:27:10,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 12:27:10,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 1: [2022-11-26 12:27:10,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:27:10,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 12:27:10,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 25: [2022-11-26 12:27:10,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:27:10,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:27:10,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 18: [2022-11-26 12:27:10,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 25: [2022-11-26 12:27:10,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 18: [2022-11-26 12:27:10,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 27: [2022-11-26 12:27:10,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:27:10,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 12:27:10,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 31: [2022-11-26 12:27:10,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:27:10,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 12:27:10,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 23: [2022-11-26 12:27:10,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:27:10,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 12:27:10,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 28: [2022-11-26 12:27:10,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:27:10,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 12:27:10,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 21: [2022-11-26 12:27:10,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:27:10,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 12:27:10,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 14: [2022-11-26 12:27:10,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:27:10,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 0: [2022-11-26 12:27:10,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:27:10,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 0: [2022-11-26 12:27:10,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 12:27:10,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 8: [2022-11-26 12:27:10,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:27:10,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:27:10,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 8: [2022-11-26 12:27:10,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 2: [2022-11-26 12:27:10,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 8: [2022-11-26 12:27:10,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 26: [2022-11-26 12:27:10,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:27:10,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 12:27:10,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 6: [2022-11-26 12:27:10,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:27:10,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 10: [2022-11-26 12:27:10,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:27:10,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 10: [2022-11-26 12:27:10,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 12:27:10,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 30: [2022-11-26 12:27:10,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:27:10,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 12:27:10,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 9: [2022-11-26 12:27:10,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:27:10,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 12:27:10,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 21: [2022-11-26 12:27:10,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:27:10,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 12:27:10,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 13: [2022-11-26 12:27:10,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:27:10,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 12:27:10,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 3: [2022-11-26 12:27:10,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:27:10,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 12:27:10,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 7: [2022-11-26 12:27:10,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:27:10,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 12:27:10,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 29: [2022-11-26 12:27:10,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:27:10,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 12:27:10,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 4: [2022-11-26 12:27:10,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:27:10,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 12:27:10,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 15: [2022-11-26 12:27:10,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:27:10,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 12:27:10,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 11: [2022-11-26 12:27:10,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:27:10,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 12:27:10,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 12: [2022-11-26 12:27:10,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:27:10,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 12:27:10,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 24: [2022-11-26 12:27:10,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:27:10,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:27:10,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 20: [2022-11-26 12:27:10,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 12:27:10,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 24: [2022-11-26 12:27:10,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 19: [2022-11-26 12:27:10,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:27:10,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 12:27:10,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 1: [2022-11-26 12:27:10,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:27:10,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 12:27:10,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 17: [2022-11-26 12:27:10,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:27:10,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 12:27:10,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 5: [2022-11-26 12:27:10,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:27:10,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 12:27:10,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 27: [2022-11-26 12:27:10,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:27:10,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 12:27:10,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 18: [2022-11-26 12:27:10,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:27:10,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 12:27:10,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 22: [2022-11-26 12:27:10,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:27:10,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 12:27:10,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 16: [2022-11-26 12:27:10,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:27:10,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 12:27:10,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 25: [2022-11-26 12:27:10,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:27:10,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 12:27:10,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 14: [2022-11-26 12:27:10,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:27:10,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 12:27:10,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 26: [2022-11-26 12:27:10,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:27:10,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 12:27:10,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 9: [2022-11-26 12:27:10,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:27:10,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:27:10,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 12:27:10,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 23: [2022-11-26 12:27:10,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 12:27:10,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 0: [2022-11-26 12:27:10,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:27:10,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 12:27:10,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 21: [2022-11-26 12:27:10,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:27:10,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 12:27:10,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 31: [2022-11-26 12:27:10,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:27:10,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 12:27:10,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 6: [2022-11-26 12:27:10,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:27:10,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 12:27:10,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 3: [2022-11-26 12:27:10,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:27:10,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 12:27:10,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 8: [2022-11-26 12:27:10,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:27:10,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 12:27:10,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 4: [2022-11-26 12:27:10,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:27:10,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:27:10,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 10: [2022-11-26 12:27:10,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 4: [2022-11-26 12:27:10,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 10: [2022-11-26 12:27:10,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 29: [2022-11-26 12:27:10,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:27:10,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 12:27:10,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 13: [2022-11-26 12:27:10,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:27:10,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 12:27:10,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 2: [2022-11-26 12:27:10,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:27:10,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 12:27:10,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 20: [2022-11-26 12:27:10,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:27:10,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 12:27:10,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 24: [2022-11-26 12:27:10,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:27:10,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 12:27:10,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 28: [2022-11-26 12:27:10,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:27:10,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 12:27:10,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 19: [2022-11-26 12:27:10,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:27:10,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 17: [2022-11-26 12:27:10,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:27:10,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 17: [2022-11-26 12:27:10,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 12:27:10,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 16: [2022-11-26 12:27:10,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:27:10,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 12:27:10,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 25: [2022-11-26 12:27:10,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:27:10,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 12:27:10,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 15: [2022-11-26 12:27:10,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:27:10,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:27:10,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 12:27:10,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 27: [2022-11-26 12:27:10,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 12:27:10,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 11: [2022-11-26 12:27:10,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:27:10,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 12:27:10,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 22: [2022-11-26 12:27:10,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:27:10,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:27:10,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:27:10,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 1: [2022-11-26 12:27:10,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 26: [2022-11-26 12:27:10,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 1: [2022-11-26 12:27:10,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 26: [2022-11-26 12:27:10,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 22: [2022-11-26 12:27:10,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 7: [2022-11-26 12:27:10,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:27:10,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 12:27:10,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 30: [2022-11-26 12:27:10,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:27:10,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 12:27:10,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 12: [2022-11-26 12:27:10,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:27:10,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 12:27:10,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 10: [2022-11-26 12:27:10,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:27:10,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:27:10,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 12:27:10,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 14: [2022-11-26 12:27:10,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 12:27:10,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 21: [2022-11-26 12:27:10,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:27:10,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 12:27:10,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 6: [2022-11-26 12:27:10,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:27:10,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:27:10,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 12:27:10,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 4: [2022-11-26 12:27:10,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 12:27:10,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 9: [2022-11-26 12:27:10,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:27:10,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:27:10,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:27:10,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 30: [2022-11-26 12:27:10,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 8: [2022-11-26 12:27:10,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 9: [2022-11-26 12:27:10,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 8: [2022-11-26 12:27:10,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 30: [2022-11-26 12:27:10,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 23: [2022-11-26 12:27:10,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:27:10,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:27:10,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:27:10,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 23: [2022-11-26 12:27:10,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 2: [2022-11-26 12:27:10,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:27:10,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 18: [2022-11-26 12:27:10,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 2: [2022-11-26 12:27:10,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 13: [2022-11-26 12:27:10,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 12:27:10,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 2: [2022-11-26 12:27:10,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 5: [2022-11-26 12:27:10,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:27:10,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 12:27:10,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 3: [2022-11-26 12:27:10,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:27:10,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 12:27:10,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 16: [2022-11-26 12:27:10,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:27:10,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 12:27:10,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 31: [2022-11-26 12:27:10,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:27:10,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 12:27:10,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 0: [2022-11-26 12:27:10,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:27:10,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 12:27:10,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 11: [2022-11-26 12:27:10,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:27:10,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 12:27:10,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 7: [2022-11-26 12:27:10,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:27:10,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 12:27:10,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 28: [2022-11-26 12:27:10,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:27:10,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 12:27:10,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 28: [2022-11-26 12:27:10,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:27:10,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 12:27:10,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 12: [2022-11-26 12:27:10,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:27:10,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:27:10,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 12: [2022-11-26 12:27:10,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step81000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 12:27:10,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 18: [2022-11-26 12:27:10,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 0: successfully saved checkpoint at iteration 81000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2582.81 31: iteration 81010/ 173500 | consumed samples: 20738560 | consumed tokens: 42472570880 | elapsed time per iteration (s): 1.08 | learning rate: 1.209E-04 | global batch size: 256 | lm loss: 2.012504E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.525 | TFLOPs: 14.37 | 31: iteration 81020/ 173500 | consumed samples: 20741120 | consumed tokens: 42477813760 | elapsed time per iteration (s): 0.84 | learning rate: 1.208E-04 | global batch size: 256 | lm loss: 1.999875E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.617 | TFLOPs: 18.37 | 31: iteration 81030/ 173500 | consumed samples: 20743680 | consumed tokens: 42483056640 | elapsed time per iteration (s): 0.79 | learning rate: 1.208E-04 | global batch size: 256 | lm loss: 1.997128E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.389 | TFLOPs: 19.50 | 31: iteration 81040/ 173500 | consumed samples: 20746240 | consumed tokens: 42488299520 | elapsed time per iteration (s): 0.85 | learning rate: 1.208E-04 | global batch size: 256 | lm loss: 1.986576E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.038 | TFLOPs: 18.27 | 31: iteration 81050/ 173500 | consumed samples: 20748800 | consumed tokens: 42493542400 | elapsed time per iteration (s): 0.73 | learning rate: 1.208E-04 | global batch size: 256 | lm loss: 2.015564E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.937 | TFLOPs: 21.17 | 31: iteration 81060/ 173500 | consumed samples: 20751360 | consumed tokens: 42498785280 | elapsed time per iteration (s): 0.79 | learning rate: 1.208E-04 | global batch size: 256 | lm loss: 1.943399E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.939 | TFLOPs: 19.60 | 31: iteration 81070/ 173500 | consumed samples: 20753920 | consumed tokens: 42504028160 | elapsed time per iteration (s): 0.78 | learning rate: 1.208E-04 | global batch size: 256 | lm loss: 2.007938E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.285 | TFLOPs: 19.80 | 31: iteration 81080/ 173500 | consumed samples: 20756480 | consumed tokens: 42509271040 | elapsed time per iteration (s): 0.82 | learning rate: 1.207E-04 | global batch size: 256 | lm loss: 2.015590E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.841 | TFLOPs: 18.87 | 31: iteration 81090/ 173500 | consumed samples: 20759040 | consumed tokens: 42514513920 | elapsed time per iteration (s): 0.78 | learning rate: 1.207E-04 | global batch size: 256 | lm loss: 2.017186E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.584 | TFLOPs: 19.88 | 31: iteration 81100/ 173500 | consumed samples: 20761600 | consumed tokens: 42519756800 | elapsed time per iteration (s): 0.78 | learning rate: 1.207E-04 | global batch size: 256 | lm loss: 1.987904E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.308 | TFLOPs: 19.92 | 31: iteration 81110/ 173500 | consumed samples: 20764160 | consumed tokens: 42524999680 | elapsed time per iteration (s): 0.76 | learning rate: 1.207E-04 | global batch size: 256 | lm loss: 1.971091E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.998 | TFLOPs: 20.45 | 31: iteration 81120/ 173500 | consumed samples: 20766720 | consumed tokens: 42530242560 | elapsed time per iteration (s): 0.77 | learning rate: 1.207E-04 | global batch size: 256 | lm loss: 1.997600E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.105 | TFLOPs: 20.03 | 31: iteration 81130/ 173500 | consumed samples: 20769280 | consumed tokens: 42535485440 | elapsed time per iteration (s): 0.97 | learning rate: 1.207E-04 | global batch size: 256 | lm loss: 1.989476E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 264.868 | TFLOPs: 16.02 | 31: iteration 81140/ 173500 | consumed samples: 20771840 | consumed tokens: 42540728320 | elapsed time per iteration (s): 0.77 | learning rate: 1.206E-04 | global batch size: 256 | lm loss: 2.048789E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.888 | TFLOPs: 20.08 | 31: iteration 81150/ 173500 | consumed samples: 20774400 | consumed tokens: 42545971200 | elapsed time per iteration (s): 0.80 | learning rate: 1.206E-04 | global batch size: 256 | lm loss: 1.997112E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.217 | TFLOPs: 19.31 | 31: iteration 81160/ 173500 | consumed samples: 20776960 | consumed tokens: 42551214080 | elapsed time per iteration (s): 0.74 | learning rate: 1.206E-04 | global batch size: 256 | lm loss: 2.025941E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.280 | TFLOPs: 20.83 | 31: iteration 81170/ 173500 | consumed samples: 20779520 | consumed tokens: 42556456960 | elapsed time per iteration (s): 0.82 | learning rate: 1.206E-04 | global batch size: 256 | lm loss: 1.979223E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.904 | TFLOPs: 18.81 | 31: iteration 81180/ 173500 | consumed samples: 20782080 | consumed tokens: 42561699840 | elapsed time per iteration (s): 0.74 | learning rate: 1.206E-04 | global batch size: 256 | lm loss: 1.998292E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.524 | TFLOPs: 20.90 | 31: iteration 81190/ 173500 | consumed samples: 20784640 | consumed tokens: 42566942720 | elapsed time per iteration (s): 0.75 | learning rate: 1.206E-04 | global batch size: 256 | lm loss: 2.010963E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.208 | TFLOPs: 20.52 | 31: iteration 81200/ 173500 | consumed samples: 20787200 | consumed tokens: 42572185600 | elapsed time per iteration (s): 0.80 | learning rate: 1.205E-04 | global batch size: 256 | lm loss: 2.000276E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.383 | TFLOPs: 19.32 | 31: iteration 81210/ 173500 | consumed samples: 20789760 | consumed tokens: 42577428480 | elapsed time per iteration (s): 0.80 | learning rate: 1.205E-04 | global batch size: 256 | lm loss: 2.025566E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.624 | TFLOPs: 19.40 | 31: iteration 81220/ 173500 | consumed samples: 20792320 | consumed tokens: 42582671360 | elapsed time per iteration (s): 0.75 | learning rate: 1.205E-04 | global batch size: 256 | lm loss: 2.003299E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.615 | TFLOPs: 20.61 | 31: iteration 81230/ 173500 | consumed samples: 20794880 | consumed tokens: 42587914240 | elapsed time per iteration (s): 0.73 | learning rate: 1.205E-04 | global batch size: 256 | lm loss: 2.012846E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.928 | TFLOPs: 21.17 | 31: iteration 81240/ 173500 | consumed samples: 20797440 | consumed tokens: 42593157120 | elapsed time per iteration (s): 0.79 | learning rate: 1.205E-04 | global batch size: 256 | lm loss: 1.999528E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.088 | TFLOPs: 19.61 | 31: iteration 81250/ 173500 | consumed samples: 20800000 | consumed tokens: 42598400000 | elapsed time per iteration (s): 0.81 | learning rate: 1.205E-04 | global batch size: 256 | lm loss: 1.989211E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.801 | TFLOPs: 19.11 | 31: iteration 81260/ 173500 | consumed samples: 20802560 | consumed tokens: 42603642880 | elapsed time per iteration (s): 0.80 | learning rate: 1.204E-04 | global batch size: 256 | lm loss: 2.017494E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.691 | TFLOPs: 19.46 | 31: iteration 81270/ 173500 | consumed samples: 20805120 | consumed tokens: 42608885760 | elapsed time per iteration (s): 0.76 | learning rate: 1.204E-04 | global batch size: 256 | lm loss: 2.028608E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.588 | TFLOPs: 20.48 | 31: iteration 81280/ 173500 | consumed samples: 20807680 | consumed tokens: 42614128640 | elapsed time per iteration (s): 0.76 | learning rate: 1.204E-04 | global batch size: 256 | lm loss: 2.007928E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.691 | TFLOPs: 20.25 | 31: iteration 81290/ 173500 | consumed samples: 20810240 | consumed tokens: 42619371520 | elapsed time per iteration (s): 0.75 | learning rate: 1.204E-04 | global batch size: 256 | lm loss: 1.988574E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.436 | TFLOPs: 20.78 | 31: iteration 81300/ 173500 | consumed samples: 20812800 | consumed tokens: 42624614400 | elapsed time per iteration (s): 0.80 | learning rate: 1.204E-04 | global batch size: 256 | lm loss: 1.994223E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.524 | TFLOPs: 19.33 | 31: iteration 81310/ 173500 | consumed samples: 20815360 | consumed tokens: 42629857280 | elapsed time per iteration (s): 0.76 | learning rate: 1.204E-04 | global batch size: 256 | lm loss: 2.001346E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.269 | TFLOPs: 20.40 | 31: iteration 81320/ 173500 | consumed samples: 20817920 | consumed tokens: 42635100160 | elapsed time per iteration (s): 0.80 | learning rate: 1.203E-04 | global batch size: 256 | lm loss: 2.013652E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.692 | TFLOPs: 19.28 | 31: iteration 81330/ 173500 | consumed samples: 20820480 | consumed tokens: 42640343040 | elapsed time per iteration (s): 0.85 | learning rate: 1.203E-04 | global batch size: 256 | lm loss: 2.014125E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.734 | TFLOPs: 18.19 | 31: iteration 81340/ 173500 | consumed samples: 20823040 | consumed tokens: 42645585920 | elapsed time per iteration (s): 1.58 | learning rate: 1.203E-04 | global batch size: 256 | lm loss: 2.004723E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 162.198 | TFLOPs: 9.81 | 31: iteration 81350/ 173500 | consumed samples: 20825600 | consumed tokens: 42650828800 | elapsed time per iteration (s): 0.78 | learning rate: 1.203E-04 | global batch size: 256 | lm loss: 1.968168E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.382 | TFLOPs: 19.81 | 31: iteration 81360/ 173500 | consumed samples: 20828160 | consumed tokens: 42656071680 | elapsed time per iteration (s): 0.79 | learning rate: 1.203E-04 | global batch size: 256 | lm loss: 1.993231E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.713 | TFLOPs: 19.58 | 31: iteration 81370/ 173500 | consumed samples: 20830720 | consumed tokens: 42661314560 | elapsed time per iteration (s): 0.79 | learning rate: 1.203E-04 | global batch size: 256 | lm loss: 1.986727E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.774 | TFLOPs: 19.65 | 31: iteration 81380/ 173500 | consumed samples: 20833280 | consumed tokens: 42666557440 | elapsed time per iteration (s): 0.77 | learning rate: 1.202E-04 | global batch size: 256 | lm loss: 2.012123E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.148 | TFLOPs: 20.09 | 31: iteration 81390/ 173500 | consumed samples: 20835840 | consumed tokens: 42671800320 | elapsed time per iteration (s): 0.73 | learning rate: 1.202E-04 | global batch size: 256 | lm loss: 1.961617E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.031 | TFLOPs: 21.24 | 31: iteration 81400/ 173500 | consumed samples: 20838400 | consumed tokens: 42677043200 | elapsed time per iteration (s): 0.73 | learning rate: 1.202E-04 | global batch size: 256 | lm loss: 2.034701E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.613 | TFLOPs: 21.33 | 31: iteration 81410/ 173500 | consumed samples: 20840960 | consumed tokens: 42682286080 | elapsed time per iteration (s): 0.86 | learning rate: 1.202E-04 | global batch size: 256 | lm loss: 2.003344E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.804 | TFLOPs: 17.96 | 31: iteration 81420/ 173500 | consumed samples: 20843520 | consumed tokens: 42687528960 | elapsed time per iteration (s): 0.78 | learning rate: 1.202E-04 | global batch size: 256 | lm loss: 1.981343E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.879 | TFLOPs: 19.96 | 31: iteration 81430/ 173500 | consumed samples: 20846080 | consumed tokens: 42692771840 | elapsed time per iteration (s): 0.80 | learning rate: 1.202E-04 | global batch size: 256 | lm loss: 2.007004E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.273 | TFLOPs: 19.44 | 31: iteration 81440/ 173500 | consumed samples: 20848640 | consumed tokens: 42698014720 | elapsed time per iteration (s): 0.77 | learning rate: 1.201E-04 | global batch size: 256 | lm loss: 2.015495E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.131 | TFLOPs: 20.15 | 31: iteration 81450/ 173500 | consumed samples: 20851200 | consumed tokens: 42703257600 | elapsed time per iteration (s): 0.88 | learning rate: 1.201E-04 | global batch size: 256 | lm loss: 1.990175E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.010 | TFLOPs: 17.54 | 31: iteration 81460/ 173500 | consumed samples: 20853760 | consumed tokens: 42708500480 | elapsed time per iteration (s): 0.77 | learning rate: 1.201E-04 | global batch size: 256 | lm loss: 1.967026E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.465 | TFLOPs: 19.99 | 31: iteration 81470/ 173500 | consumed samples: 20856320 | consumed tokens: 42713743360 | elapsed time per iteration (s): 0.79 | learning rate: 1.201E-04 | global batch size: 256 | lm loss: 2.015916E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.191 | TFLOPs: 19.55 | 31: iteration 81480/ 173500 | consumed samples: 20858880 | consumed tokens: 42718986240 | elapsed time per iteration (s): 0.79 | learning rate: 1.201E-04 | global batch size: 256 | lm loss: 2.005128E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.126 | TFLOPs: 19.55 | 31: iteration 81490/ 173500 | consumed samples: 20861440 | consumed tokens: 42724229120 | elapsed time per iteration (s): 1.30 | learning rate: 1.201E-04 | global batch size: 256 | lm loss: 2.004639E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 196.638 | TFLOPs: 11.90 | 31: iteration 81500/ 173500 | consumed samples: 20864000 | consumed tokens: 42729472000 | elapsed time per iteration (s): 0.78 | learning rate: 1.200E-04 | global batch size: 256 | lm loss: 2.019224E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.113 | TFLOPs: 19.79 | 31: iteration 81510/ 173500 | consumed samples: 20866560 | consumed tokens: 42734714880 | elapsed time per iteration (s): 0.79 | learning rate: 1.200E-04 | global batch size: 256 | lm loss: 2.034684E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.544 | TFLOPs: 19.69 | 31: iteration 81520/ 173500 | consumed samples: 20869120 | consumed tokens: 42739957760 | elapsed time per iteration (s): 0.80 | learning rate: 1.200E-04 | global batch size: 256 | lm loss: 1.981738E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.923 | TFLOPs: 19.48 | 31: iteration 81530/ 173500 | consumed samples: 20871680 | consumed tokens: 42745200640 | elapsed time per iteration (s): 0.81 | learning rate: 1.200E-04 | global batch size: 256 | lm loss: 2.009995E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.120 | TFLOPs: 19.06 | 31: iteration 81540/ 173500 | consumed samples: 20874240 | consumed tokens: 42750443520 | elapsed time per iteration (s): 0.80 | learning rate: 1.200E-04 | global batch size: 256 | lm loss: 2.005936E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.572 | TFLOPs: 19.33 | 31: iteration 81550/ 173500 | consumed samples: 20876800 | consumed tokens: 42755686400 | elapsed time per iteration (s): 0.82 | learning rate: 1.200E-04 | global batch size: 256 | lm loss: 2.015809E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.134 | TFLOPs: 18.82 | 31: iteration 81560/ 173500 | consumed samples: 20879360 | consumed tokens: 42760929280 | elapsed time per iteration (s): 0.81 | learning rate: 1.200E-04 | global batch size: 256 | lm loss: 2.004428E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.374 | TFLOPs: 19.08 | 31: iteration 81570/ 173500 | consumed samples: 20881920 | consumed tokens: 42766172160 | elapsed time per iteration (s): 0.81 | learning rate: 1.199E-04 | global batch size: 256 | lm loss: 2.001155E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.894 | TFLOPs: 19.23 | 31: iteration 81580/ 173500 | consumed samples: 20884480 | consumed tokens: 42771415040 | elapsed time per iteration (s): 0.80 | learning rate: 1.199E-04 | global batch size: 256 | lm loss: 1.985047E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.285 | TFLOPs: 19.38 | 31: iteration 81590/ 173500 | consumed samples: 20887040 | consumed tokens: 42776657920 | elapsed time per iteration (s): 0.80 | learning rate: 1.199E-04 | global batch size: 256 | lm loss: 1.991971E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.595 | TFLOPs: 19.33 | 31: iteration 81600/ 173500 | consumed samples: 20889600 | consumed tokens: 42781900800 | elapsed time per iteration (s): 0.84 | learning rate: 1.199E-04 | global batch size: 256 | lm loss: 2.022017E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.514 | TFLOPs: 18.42 | 31: iteration 81610/ 173500 | consumed samples: 20892160 | consumed tokens: 42787143680 | elapsed time per iteration (s): 0.86 | learning rate: 1.199E-04 | global batch size: 256 | lm loss: 1.970568E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.166 | TFLOPs: 18.04 | 31: iteration 81620/ 173500 | consumed samples: 20894720 | consumed tokens: 42792386560 | elapsed time per iteration (s): 0.79 | learning rate: 1.199E-04 | global batch size: 256 | lm loss: 2.021493E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.519 | TFLOPs: 19.63 | 31: iteration 81630/ 173500 | consumed samples: 20897280 | consumed tokens: 42797629440 | elapsed time per iteration (s): 0.80 | learning rate: 1.198E-04 | global batch size: 256 | lm loss: 2.016049E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.685 | TFLOPs: 19.40 | 31: iteration 81640/ 173500 | consumed samples: 20899840 | consumed tokens: 42802872320 | elapsed time per iteration (s): 0.79 | learning rate: 1.198E-04 | global batch size: 256 | lm loss: 1.990323E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.996 | TFLOPs: 19.54 | 31: iteration 81650/ 173500 | consumed samples: 20902400 | consumed tokens: 42808115200 | elapsed time per iteration (s): 0.82 | learning rate: 1.198E-04 | global batch size: 256 | lm loss: 1.996306E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.510 | TFLOPs: 18.97 | 31: iteration 81660/ 173500 | consumed samples: 20904960 | consumed tokens: 42813358080 | elapsed time per iteration (s): 0.80 | learning rate: 1.198E-04 | global batch size: 256 | lm loss: 2.001585E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.321 | TFLOPs: 19.26 | 31: iteration 81670/ 173500 | consumed samples: 20907520 | consumed tokens: 42818600960 | elapsed time per iteration (s): 0.82 | learning rate: 1.198E-04 | global batch size: 256 | lm loss: 1.999959E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.945 | TFLOPs: 18.93 | 31: iteration 81680/ 173500 | consumed samples: 20910080 | consumed tokens: 42823843840 | elapsed time per iteration (s): 0.82 | learning rate: 1.198E-04 | global batch size: 256 | lm loss: 2.030680E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.991 | TFLOPs: 19.00 | 31: iteration 81690/ 173500 | consumed samples: 20912640 | consumed tokens: 42829086720 | elapsed time per iteration (s): 0.88 | learning rate: 1.197E-04 | global batch size: 256 | lm loss: 1.974148E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.929 | TFLOPs: 17.54 | 31: iteration 81700/ 173500 | consumed samples: 20915200 | consumed tokens: 42834329600 | elapsed time per iteration (s): 0.84 | learning rate: 1.197E-04 | global batch size: 256 | lm loss: 2.017139E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.505 | TFLOPs: 18.54 | 31: iteration 81710/ 173500 | consumed samples: 20917760 | consumed tokens: 42839572480 | elapsed time per iteration (s): 0.86 | learning rate: 1.197E-04 | global batch size: 256 | lm loss: 2.027070E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.931 | TFLOPs: 18.08 | 31: iteration 81720/ 173500 | consumed samples: 20920320 | consumed tokens: 42844815360 | elapsed time per iteration (s): 0.89 | learning rate: 1.197E-04 | global batch size: 256 | lm loss: 2.015723E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.019 | TFLOPs: 17.36 | 31: iteration 81730/ 173500 | consumed samples: 20922880 | consumed tokens: 42850058240 | elapsed time per iteration (s): 0.81 | learning rate: 1.197E-04 | global batch size: 256 | lm loss: 2.031429E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.773 | TFLOPs: 19.16 | 31: iteration 81740/ 173500 | consumed samples: 20925440 | consumed tokens: 42855301120 | elapsed time per iteration (s): 0.83 | learning rate: 1.197E-04 | global batch size: 256 | lm loss: 1.992326E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.182 | TFLOPs: 18.58 | 31: iteration 81750/ 173500 | consumed samples: 20928000 | consumed tokens: 42860544000 | elapsed time per iteration (s): 0.83 | learning rate: 1.196E-04 | global batch size: 256 | lm loss: 2.034837E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.035 | TFLOPs: 18.57 | 31: iteration 81760/ 173500 | consumed samples: 20930560 | consumed tokens: 42865786880 | elapsed time per iteration (s): 0.82 | learning rate: 1.196E-04 | global batch size: 256 | lm loss: 2.017517E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.631 | TFLOPs: 18.79 | 31: iteration 81770/ 173500 | consumed samples: 20933120 | consumed tokens: 42871029760 | elapsed time per iteration (s): 0.81 | learning rate: 1.196E-04 | global batch size: 256 | lm loss: 2.020549E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.643 | TFLOPs: 19.16 | 31: iteration 81780/ 173500 | consumed samples: 20935680 | consumed tokens: 42876272640 | elapsed time per iteration (s): 0.80 | learning rate: 1.196E-04 | global batch size: 256 | lm loss: 2.012193E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.196 | TFLOPs: 19.43 | 31: iteration 81790/ 173500 | consumed samples: 20938240 | consumed tokens: 42881515520 | elapsed time per iteration (s): 0.81 | learning rate: 1.196E-04 | global batch size: 256 | lm loss: 2.011837E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.545 | TFLOPs: 19.03 | 31: iteration 81800/ 173500 | consumed samples: 20940800 | consumed tokens: 42886758400 | elapsed time per iteration (s): 0.84 | learning rate: 1.196E-04 | global batch size: 256 | lm loss: 1.987930E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.505 | TFLOPs: 18.54 | 31: iteration 81810/ 173500 | consumed samples: 20943360 | consumed tokens: 42892001280 | elapsed time per iteration (s): 0.82 | learning rate: 1.195E-04 | global batch size: 256 | lm loss: 1.977583E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.930 | TFLOPs: 18.81 | 31: iteration 81820/ 173500 | consumed samples: 20945920 | consumed tokens: 42897244160 | elapsed time per iteration (s): 0.79 | learning rate: 1.195E-04 | global batch size: 256 | lm loss: 2.021666E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.802 | TFLOPs: 19.53 | 31: iteration 81830/ 173500 | consumed samples: 20948480 | consumed tokens: 42902487040 | elapsed time per iteration (s): 0.81 | learning rate: 1.195E-04 | global batch size: 256 | lm loss: 2.006024E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.447 | TFLOPs: 19.08 | 31: iteration 81840/ 173500 | consumed samples: 20951040 | consumed tokens: 42907729920 | elapsed time per iteration (s): 0.82 | learning rate: 1.195E-04 | global batch size: 256 | lm loss: 2.009666E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.896 | TFLOPs: 18.99 | 31: iteration 81850/ 173500 | consumed samples: 20953600 | consumed tokens: 42912972800 | elapsed time per iteration (s): 0.82 | learning rate: 1.195E-04 | global batch size: 256 | lm loss: 1.989988E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.098 | TFLOPs: 19.00 | 31: iteration 81860/ 173500 | consumed samples: 20956160 | consumed tokens: 42918215680 | elapsed time per iteration (s): 0.81 | learning rate: 1.195E-04 | global batch size: 256 | lm loss: 2.005128E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.398 | TFLOPs: 19.08 | 31: iteration 81870/ 173500 | consumed samples: 20958720 | consumed tokens: 42923458560 | elapsed time per iteration (s): 0.85 | learning rate: 1.194E-04 | global batch size: 256 | lm loss: 2.005280E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.355 | TFLOPs: 18.17 | 31: iteration 81880/ 173500 | consumed samples: 20961280 | consumed tokens: 42928701440 | elapsed time per iteration (s): 0.81 | learning rate: 1.194E-04 | global batch size: 256 | lm loss: 1.990373E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.397 | TFLOPs: 19.08 | 31: iteration 81890/ 173500 | consumed samples: 20963840 | consumed tokens: 42933944320 | elapsed time per iteration (s): 0.83 | learning rate: 1.194E-04 | global batch size: 256 | lm loss: 1.996021E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.437 | TFLOPs: 18.66 | 31: iteration 81900/ 173500 | consumed samples: 20966400 | consumed tokens: 42939187200 | elapsed time per iteration (s): 0.76 | learning rate: 1.194E-04 | global batch size: 256 | lm loss: 1.998973E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.107 | TFLOPs: 20.33 | 31: iteration 81910/ 173500 | consumed samples: 20968960 | consumed tokens: 42944430080 | elapsed time per iteration (s): 0.75 | learning rate: 1.194E-04 | global batch size: 256 | lm loss: 2.003567E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.665 | TFLOPs: 20.61 | 31: iteration 81920/ 173500 | consumed samples: 20971520 | consumed tokens: 42949672960 | elapsed time per iteration (s): 0.78 | learning rate: 1.194E-04 | global batch size: 256 | lm loss: 2.009097E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.381 | TFLOPs: 19.87 | 31: iteration 81930/ 173500 | consumed samples: 20974080 | consumed tokens: 42954915840 | elapsed time per iteration (s): 0.78 | learning rate: 1.193E-04 | global batch size: 256 | lm loss: 2.000218E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.364 | TFLOPs: 19.93 | 31: iteration 81940/ 173500 | consumed samples: 20976640 | consumed tokens: 42960158720 | elapsed time per iteration (s): 0.75 | learning rate: 1.193E-04 | global batch size: 256 | lm loss: 1.993032E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.074 | TFLOPs: 20.57 | 31: iteration 81950/ 173500 | consumed samples: 20979200 | consumed tokens: 42965401600 | elapsed time per iteration (s): 0.78 | learning rate: 1.193E-04 | global batch size: 256 | lm loss: 1.995128E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.149 | TFLOPs: 19.79 | 31: iteration 81960/ 173500 | consumed samples: 20981760 | consumed tokens: 42970644480 | elapsed time per iteration (s): 0.85 | learning rate: 1.193E-04 | global batch size: 256 | lm loss: 2.006441E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.779 | TFLOPs: 18.32 | 31: iteration 81970/ 173500 | consumed samples: 20984320 | consumed tokens: 42975887360 | elapsed time per iteration (s): 0.78 | learning rate: 1.193E-04 | global batch size: 256 | lm loss: 2.016065E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.751 | TFLOPs: 19.89 | 31: iteration 81980/ 173500 | consumed samples: 20986880 | consumed tokens: 42981130240 | elapsed time per iteration (s): 0.76 | learning rate: 1.193E-04 | global batch size: 256 | lm loss: 1.980658E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.208 | TFLOPs: 20.34 | 31: iteration 81990/ 173500 | consumed samples: 20989440 | consumed tokens: 42986373120 | elapsed time per iteration (s): 0.79 | learning rate: 1.192E-04 | global batch size: 256 | lm loss: 1.969298E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.969 | TFLOPs: 19.72 | 0: [2022-11-26 12:40:42,398] [INFO] [logging.py:68:log_dist] [Rank 0] step=82000, skipped=0, lr=[0.00011923116875818059, 0.00011923116875818059, 0.00011923116875818059], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 82000/ 173500 | consumed samples: 20992000 | consumed tokens: 42991616000 | elapsed time per iteration (s): 0.72 | learning rate: 1.192E-04 | global batch size: 256 | lm loss: 2.008047E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.288 | TFLOPs: 21.43 | 0: steps: 82000 loss: 2.0126 iter time (s): 0.802 samples/sec: 319.138 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 82000 | lm loss value: 2.049759E+00 | lm loss PPL: 7.766027E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 82000 to checkpoints_1b1long 0: [2022-11-26 12:40:42,668] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step82000 is begin to save! 0: [2022-11-26 12:40:42,679] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_01-model_00-model_states.pt... 0: [2022-11-26 12:40:42,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_01-model_00-model_states.pt. 0: [2022-11-26 12:40:42,906] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_03-model_00-model_states.pt... 0: [2022-11-26 12:40:42,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_03-model_00-model_states.pt. 0: [2022-11-26 12:40:42,991] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_04-model_00-model_states.pt... 0: [2022-11-26 12:40:43,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_04-model_00-model_states.pt. 0: [2022-11-26 12:40:43,073] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_05-model_00-model_states.pt... 0: [2022-11-26 12:40:43,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_05-model_00-model_states.pt. 0: [2022-11-26 12:40:43,153] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_06-model_00-model_states.pt... 0: [2022-11-26 12:40:43,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_06-model_00-model_states.pt. 0: [2022-11-26 12:40:43,225] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_07-model_00-model_states.pt... 0: [2022-11-26 12:40:43,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_07-model_00-model_states.pt. 0: [2022-11-26 12:40:43,302] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_08-model_00-model_states.pt... 0: [2022-11-26 12:40:43,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_08-model_00-model_states.pt. 0: [2022-11-26 12:40:43,374] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_09-model_00-model_states.pt... 0: [2022-11-26 12:40:43,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_09-model_00-model_states.pt. 0: [2022-11-26 12:40:43,451] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_10-model_00-model_states.pt... 0: [2022-11-26 12:40:43,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_10-model_00-model_states.pt. 0: [2022-11-26 12:40:43,527] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_11-model_00-model_states.pt... 0: [2022-11-26 12:40:43,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_11-model_00-model_states.pt. 0: [2022-11-26 12:40:43,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_12-model_00-model_states.pt... 0: [2022-11-26 12:40:43,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_12-model_00-model_states.pt. 0: [2022-11-26 12:40:43,675] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_13-model_00-model_states.pt... 0: [2022-11-26 12:40:43,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_13-model_00-model_states.pt. 0: [2022-11-26 12:40:43,750] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_14-model_00-model_states.pt... 0: [2022-11-26 12:40:43,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_14-model_00-model_states.pt. 0: [2022-11-26 12:40:43,823] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_15-model_00-model_states.pt... 0: [2022-11-26 12:40:43,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_15-model_00-model_states.pt. 0: [2022-11-26 12:40:43,897] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_16-model_00-model_states.pt... 0: [2022-11-26 12:40:43,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_16-model_00-model_states.pt. 0: [2022-11-26 12:40:43,974] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_17-model_00-model_states.pt... 0: [2022-11-26 12:40:44,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_17-model_00-model_states.pt. 0: [2022-11-26 12:40:44,049] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_18-model_00-model_states.pt... 0: [2022-11-26 12:40:44,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_18-model_00-model_states.pt. 0: [2022-11-26 12:40:44,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_19-model_00-model_states.pt... 0: [2022-11-26 12:40:44,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_19-model_00-model_states.pt. 0: [2022-11-26 12:40:44,200] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_20-model_00-model_states.pt... 0: [2022-11-26 12:40:44,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_20-model_00-model_states.pt. 0: [2022-11-26 12:40:44,275] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_21-model_00-model_states.pt... 0: [2022-11-26 12:40:44,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_21-model_00-model_states.pt. 0: [2022-11-26 12:40:44,346] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_22-model_00-model_states.pt... 0: [2022-11-26 12:40:44,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_22-model_00-model_states.pt. 0: [2022-11-26 12:40:44,424] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_23-model_00-model_states.pt... 0: [2022-11-26 12:40:44,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_23-model_00-model_states.pt. 0: [2022-11-26 12:40:44,498] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_24-model_00-model_states.pt... 0: [2022-11-26 12:40:44,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_24-model_00-model_states.pt. 0: [2022-11-26 12:40:44,574] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_25-model_00-model_states.pt... 0: [2022-11-26 12:40:44,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_25-model_00-model_states.pt. 0: [2022-11-26 12:40:44,648] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_26-model_00-model_states.pt... 0: [2022-11-26 12:40:44,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_26-model_00-model_states.pt. 0: [2022-11-26 12:40:44,722] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_27-model_00-model_states.pt... 0: [2022-11-26 12:40:44,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_27-model_00-model_states.pt. 0: [2022-11-26 12:40:44,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_28-model_00-model_states.pt... 0: [2022-11-26 12:40:44,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_28-model_00-model_states.pt. 0: [2022-11-26 12:40:44,873] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/layer_30-model_00-model_states.pt... 0: [2022-11-26 12:40:44,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/layer_30-model_00-model_states.pt. 0: [2022-11-26 12:40:44,875] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step82000/mp_rank_00_model_states.pt 0: [2022-11-26 12:40:44,875] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/mp_rank_00_model_states.pt... 0: [2022-11-26 12:40:44,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/mp_rank_00_model_states.pt. 31: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:40:44,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:40:45,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:40:45,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 12:40:45,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 24: [2022-11-26 12:40:45,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:40:45,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 12:40:45,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 11: [2022-11-26 12:40:45,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:40:45,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 12:40:45,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 28: [2022-11-26 12:40:45,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:40:45,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:40:45,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 12:40:45,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 2: [2022-11-26 12:40:45,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:40:45,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 12:40:45,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 8: [2022-11-26 12:40:45,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:40:45,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:40:45,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 20: [2022-11-26 12:40:45,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 8: [2022-11-26 12:40:45,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 20: [2022-11-26 12:40:45,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 6: [2022-11-26 12:40:45,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:40:45,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:40:45,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 12:40:45,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 8: [2022-11-26 12:40:45,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 12:40:45,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 24: [2022-11-26 12:40:45,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:40:45,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 12:40:45,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 9: [2022-11-26 12:40:45,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:40:45,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 12:40:45,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 29: [2022-11-26 12:40:45,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:40:45,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 12:40:45,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 10: [2022-11-26 12:40:45,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:40:45,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:40:45,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:40:45,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 12:40:45,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 11: [2022-11-26 12:40:45,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 18: [2022-11-26 12:40:45,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 11: [2022-11-26 12:40:45,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 18: [2022-11-26 12:40:45,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 17: [2022-11-26 12:40:45,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:40:45,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:40:45,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 18: [2022-11-26 12:40:45,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:40:45,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 17: [2022-11-26 12:40:45,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 18: [2022-11-26 12:40:45,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 10: [2022-11-26 12:40:45,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 18: [2022-11-26 12:40:45,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 0: [2022-11-26 12:40:45,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:40:45,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 12:40:45,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 27: [2022-11-26 12:40:45,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:40:45,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:40:45,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 19: [2022-11-26 12:40:45,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 27: [2022-11-26 12:40:45,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 19: [2022-11-26 12:40:45,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 9: [2022-11-26 12:40:45,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:40:45,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:40:45,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 7: [2022-11-26 12:40:45,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 12:40:45,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 9: [2022-11-26 12:40:45,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 4: [2022-11-26 12:40:45,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:40:45,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 12:40:45,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 12: [2022-11-26 12:40:45,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:40:45,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 12:40:45,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 30: [2022-11-26 12:40:45,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:40:45,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:40:45,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:40:45,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:40:45,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 27: [2022-11-26 12:40:45,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 4: [2022-11-26 12:40:45,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 3: [2022-11-26 12:40:45,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:40:45,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:40:45,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 3: [2022-11-26 12:40:45,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 30: [2022-11-26 12:40:45,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 12:40:45,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 3: [2022-11-26 12:40:45,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 30: [2022-11-26 12:40:45,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 30: [2022-11-26 12:40:45,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 27: [2022-11-26 12:40:45,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 7: [2022-11-26 12:40:45,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 3: [2022-11-26 12:40:45,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:40:45,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 12:40:45,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 13: [2022-11-26 12:40:45,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:40:45,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:40:45,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 12:40:45,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 12:40:45,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 13: [2022-11-26 12:40:45,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 22: [2022-11-26 12:40:45,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:40:45,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:40:45,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 12: [2022-11-26 12:40:45,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 22: [2022-11-26 12:40:45,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 12: [2022-11-26 12:40:45,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 24: [2022-11-26 12:40:45,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:40:45,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 6: [2022-11-26 12:40:45,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:40:45,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 24: [2022-11-26 12:40:45,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 6: [2022-11-26 12:40:45,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 11: [2022-11-26 12:40:45,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:40:45,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 9: [2022-11-26 12:40:45,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:40:45,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 17: [2022-11-26 12:40:45,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:40:45,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 12:40:45,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 17: [2022-11-26 12:40:45,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 12:40:45,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 13: [2022-11-26 12:40:45,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:40:45,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 12:40:45,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 29: [2022-11-26 12:40:45,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:40:45,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 18: [2022-11-26 12:40:45,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:40:45,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 18: [2022-11-26 12:40:45,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 12:40:45,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 7: [2022-11-26 12:40:45,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:40:45,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 12:40:45,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 3: [2022-11-26 12:40:45,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:40:45,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:40:45,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 19: [2022-11-26 12:40:45,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 3: [2022-11-26 12:40:45,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 19: [2022-11-26 12:40:45,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 29: [2022-11-26 12:40:45,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:40:45,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 12:40:45,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 0: [2022-11-26 12:40:45,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:40:45,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:40:45,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 12:40:45,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 12: [2022-11-26 12:40:45,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:40:45,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 12:40:45,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 11: [2022-11-26 12:40:45,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:40:45,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 12:40:45,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 2: [2022-11-26 12:40:45,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:40:45,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 12:40:45,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:40:45,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 12:40:45,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 2: [2022-11-26 12:40:45,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 27: [2022-11-26 12:40:45,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:40:45,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 12:40:45,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 19: [2022-11-26 12:40:45,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:40:45,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 12:40:45,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 30: [2022-11-26 12:40:45,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:40:45,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 12:40:45,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 8: [2022-11-26 12:40:45,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:40:45,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 12:40:45,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 19: [2022-11-26 12:40:45,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:40:45,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 12:40:45,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 31: [2022-11-26 12:40:45,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:40:45,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 12:40:45,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 18: [2022-11-26 12:40:45,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:40:45,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 7: [2022-11-26 12:40:45,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:40:45,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 7: [2022-11-26 12:40:45,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 12:40:45,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 4: [2022-11-26 12:40:45,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:40:45,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 12:40:45,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 20: [2022-11-26 12:40:45,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:40:45,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 12:40:45,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 25: [2022-11-26 12:40:45,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:40:45,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:40:45,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:40:45,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 12:40:45,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 12:40:45,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 14: [2022-11-26 12:40:45,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:40:45,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:40:45,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 25: [2022-11-26 12:40:45,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 25: [2022-11-26 12:40:45,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 14: [2022-11-26 12:40:45,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 12:40:45,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 12:40:45,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 14: [2022-11-26 12:40:45,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 29: [2022-11-26 12:40:45,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:40:45,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 12:40:45,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 20: [2022-11-26 12:40:45,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:40:45,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 12:40:45,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 8: [2022-11-26 12:40:45,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:40:45,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:40:45,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 12:40:45,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 27: [2022-11-26 12:40:45,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 12: [2022-11-26 12:40:45,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:40:45,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 12:40:45,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 27: [2022-11-26 12:40:45,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 24: [2022-11-26 12:40:45,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:40:45,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 12:40:45,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 22: [2022-11-26 12:40:45,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:40:45,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 12:40:45,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 15: [2022-11-26 12:40:45,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:40:45,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:40:45,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:40:45,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 12:40:45,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 12:40:45,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 15: [2022-11-26 12:40:45,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 17: [2022-11-26 12:40:45,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 25: [2022-11-26 12:40:45,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:40:45,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 17: [2022-11-26 12:40:45,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 25: [2022-11-26 12:40:45,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 28: [2022-11-26 12:40:45,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 6: [2022-11-26 12:40:45,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:40:45,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:40:45,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:40:45,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 28: [2022-11-26 12:40:45,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:40:45,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 12:40:45,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 12:40:45,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 28: [2022-11-26 12:40:45,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 6: [2022-11-26 12:40:45,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 6: [2022-11-26 12:40:45,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 6: [2022-11-26 12:40:45,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 28: [2022-11-26 12:40:45,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 28: [2022-11-26 12:40:45,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:40:45,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 12:40:45,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 28: [2022-11-26 12:40:45,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:40:45,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 12:40:45,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 3: [2022-11-26 12:40:45,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:40:45,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 12:40:45,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 31: [2022-11-26 12:40:45,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:40:45,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:40:45,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:40:45,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 12:40:45,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 10: [2022-11-26 12:40:45,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:40:45,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 10: [2022-11-26 12:40:45,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 31: [2022-11-26 12:40:45,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 10: [2022-11-26 12:40:45,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 31: [2022-11-26 12:40:45,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 31: [2022-11-26 12:40:45,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 1: [2022-11-26 12:40:45,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:40:45,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:40:45,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 4: [2022-11-26 12:40:45,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 1: [2022-11-26 12:40:45,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 4: [2022-11-26 12:40:45,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 23: [2022-11-26 12:40:45,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:40:45,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 2: [2022-11-26 12:40:45,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:40:45,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 2: [2022-11-26 12:40:45,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 12:40:45,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 23: [2022-11-26 12:40:45,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:40:45,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 12:40:45,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 30: [2022-11-26 12:40:45,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:40:45,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 12:40:45,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 26: [2022-11-26 12:40:45,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:40:45,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:40:45,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:40:45,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 12:40:45,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 12:40:45,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 12:40:45,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 26: [2022-11-26 12:40:45,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 26: [2022-11-26 12:40:45,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 23: [2022-11-26 12:40:45,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:40:45,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 12:40:45,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 13: [2022-11-26 12:40:45,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:40:45,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 12:40:45,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 28: [2022-11-26 12:40:45,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:40:45,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:40:45,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 14: [2022-11-26 12:40:45,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 28: [2022-11-26 12:40:45,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 14: [2022-11-26 12:40:45,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 9: [2022-11-26 12:40:45,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:40:45,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:40:45,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 12:40:45,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 12:40:45,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 9: [2022-11-26 12:40:45,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 15: [2022-11-26 12:40:45,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:40:45,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 12:40:45,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 15: [2022-11-26 12:40:45,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:40:45,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 12:40:45,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 22: [2022-11-26 12:40:45,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:40:45,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 12:40:45,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 22: [2022-11-26 12:40:45,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:40:45,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 12:40:45,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 16: [2022-11-26 12:40:45,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:40:45,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:40:45,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 12:40:45,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 12:40:45,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:40:45,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 16: [2022-11-26 12:40:45,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 16: [2022-11-26 12:40:45,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 12:40:45,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 1: [2022-11-26 12:40:45,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:40:45,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:40:45,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 12:40:45,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 12:40:45,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 1: [2022-11-26 12:40:45,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 0: [2022-11-26 12:40:45,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:40:45,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 12:40:45,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:40:45,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 0: [2022-11-26 12:40:45,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 12:40:45,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 21: [2022-11-26 12:40:45,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:40:45,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:40:45,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:40:45,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:40:45,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 12:40:45,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 12:40:45,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 12:40:45,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 12:40:45,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 21: [2022-11-26 12:40:45,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 21: [2022-11-26 12:40:45,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 21: [2022-11-26 12:40:45,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 27: [2022-11-26 12:40:45,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:40:45,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 12:40:45,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 13: [2022-11-26 12:40:45,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:40:45,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 12:40:45,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 8: [2022-11-26 12:40:45,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:40:45,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 12:40:45,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 0: [2022-11-26 12:40:45,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:40:45,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 12:40:45,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 20: [2022-11-26 12:40:45,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:40:45,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 12:40:45,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 15: [2022-11-26 12:40:45,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:40:45,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 12:40:45,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 24: [2022-11-26 12:40:45,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:40:45,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 12:40:45,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 5: [2022-11-26 12:40:45,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:40:45,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:40:45,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:40:45,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:40:45,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:40:45,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 12:40:45,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 12:40:45,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 12:40:45,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 12:40:45,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 12:40:45,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 5: [2022-11-26 12:40:45,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 5: [2022-11-26 12:40:45,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 5: [2022-11-26 12:40:45,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 5: [2022-11-26 12:40:45,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 0: [2022-11-26 12:40:45,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 12:40:45,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 3: [2022-11-26 12:40:45,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:40:45,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 12:40:45,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 11: [2022-11-26 12:40:45,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:40:45,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 12:40:45,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 19: [2022-11-26 12:40:45,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:40:45,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 12:40:45,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 18: [2022-11-26 12:40:45,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:40:45,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 12:40:45,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 30: [2022-11-26 12:40:45,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:40:45,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 12:40:45,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 2: [2022-11-26 12:40:45,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:40:45,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 12:40:45,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 31: [2022-11-26 12:40:45,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:40:45,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 12:40:45,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 29: [2022-11-26 12:40:45,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:40:45,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 12:40:45,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 10: [2022-11-26 12:40:45,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:40:45,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 12:40:45,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 17: [2022-11-26 12:40:45,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:40:45,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 12:40:45,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 25: [2022-11-26 12:40:45,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:40:45,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 12:40:45,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 21: [2022-11-26 12:40:45,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:40:45,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 12:40:45,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 4: [2022-11-26 12:40:45,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:40:45,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 12:40:45,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 7: [2022-11-26 12:40:45,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:40:45,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 12:40:45,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 22: [2022-11-26 12:40:45,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:40:45,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 12:40:45,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 1: [2022-11-26 12:40:45,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:40:45,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 12:40:45,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 9: [2022-11-26 12:40:45,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:40:45,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 12:40:45,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 12: [2022-11-26 12:40:45,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:40:45,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 12:40:45,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 26: [2022-11-26 12:40:45,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:40:45,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 12:40:45,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 16: [2022-11-26 12:40:45,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:40:45,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 12:40:45,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 23: [2022-11-26 12:40:45,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:40:45,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 28: [2022-11-26 12:40:45,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:40:45,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 28: [2022-11-26 12:40:45,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 12:40:45,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 14: [2022-11-26 12:40:45,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:40:45,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 6: [2022-11-26 12:40:45,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:40:45,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 14: [2022-11-26 12:40:45,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 6: [2022-11-26 12:40:45,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 0: [2022-11-26 12:40:45,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:40:45,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 12:40:45,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 27: [2022-11-26 12:40:45,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:40:45,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 12:40:45,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 13: [2022-11-26 12:40:45,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:40:45,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 12:40:45,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 8: [2022-11-26 12:40:45,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:40:45,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 12:40:45,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 24: [2022-11-26 12:40:45,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:40:45,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 12:40:45,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 11: [2022-11-26 12:40:45,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:40:45,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 12:40:45,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 5: [2022-11-26 12:40:45,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:40:45,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 12:40:45,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 15: [2022-11-26 12:40:45,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:40:45,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 12:40:45,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 17: [2022-11-26 12:40:45,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:40:45,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 12:40:45,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 3: [2022-11-26 12:40:45,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:40:45,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 12:40:45,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 18: [2022-11-26 12:40:45,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:40:45,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 12:40:45,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 20: [2022-11-26 12:40:45,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:40:45,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 12:40:45,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 30: [2022-11-26 12:40:45,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:40:45,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 12:40:45,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 19: [2022-11-26 12:40:45,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:40:45,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 12:40:45,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 10: [2022-11-26 12:40:45,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:40:45,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 7: [2022-11-26 12:40:45,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:40:45,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 7: [2022-11-26 12:40:45,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 12:40:45,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 25: [2022-11-26 12:40:45,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:40:45,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 12:40:45,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 22: [2022-11-26 12:40:45,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:40:45,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:40:45,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:40:45,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 1: [2022-11-26 12:40:45,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 2: [2022-11-26 12:40:45,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 1: [2022-11-26 12:40:45,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 22: [2022-11-26 12:40:45,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 12:40:45,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 26: [2022-11-26 12:40:45,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:40:45,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 12:40:45,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 29: [2022-11-26 12:40:45,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:40:45,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 12:40:45,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 31: [2022-11-26 12:40:45,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:40:45,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 12:40:45,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 4: [2022-11-26 12:40:45,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:40:45,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 12:40:45,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 9: [2022-11-26 12:40:45,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:40:45,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 12:40:45,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 12: [2022-11-26 12:40:45,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:40:45,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 12:40:45,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 23: [2022-11-26 12:40:45,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:40:45,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 12:40:45,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 28: [2022-11-26 12:40:45,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:40:45,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 12:40:45,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 5: [2022-11-26 12:40:45,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:40:45,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 12:40:45,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 8: [2022-11-26 12:40:45,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:40:45,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 13: [2022-11-26 12:40:45,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:40:45,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 13: [2022-11-26 12:40:45,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 12:40:45,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 16: [2022-11-26 12:40:45,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:40:45,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 12:40:45,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 15: [2022-11-26 12:40:45,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:40:45,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 12:40:45,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 20: [2022-11-26 12:40:45,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:40:45,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 12:40:45,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 27: [2022-11-26 12:40:45,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:40:45,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 12:40:45,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 14: [2022-11-26 12:40:45,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:40:45,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 12:40:45,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 24: [2022-11-26 12:40:45,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:40:45,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:40:45,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 12:40:45,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 24: [2022-11-26 12:40:45,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 12:40:45,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 21: [2022-11-26 12:40:45,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:40:45,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 0: [2022-11-26 12:40:45,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:40:45,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 0: [2022-11-26 12:40:45,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 12:40:45,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 25: [2022-11-26 12:40:45,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:40:45,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 12:40:45,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 19: [2022-11-26 12:40:45,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:40:45,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 12:40:45,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 17: [2022-11-26 12:40:45,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:40:45,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 12:40:45,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 18: [2022-11-26 12:40:45,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:40:45,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 12:40:45,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 31: [2022-11-26 12:40:45,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:40:45,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 12:40:45,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 11: [2022-11-26 12:40:45,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:40:45,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 12:40:45,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 29: [2022-11-26 12:40:45,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:40:45,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 12:40:45,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 30: [2022-11-26 12:40:45,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:40:45,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 12:40:45,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 3: [2022-11-26 12:40:45,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:40:45,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:40:45,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 12:40:45,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 12: [2022-11-26 12:40:45,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 12:40:45,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 10: [2022-11-26 12:40:45,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:40:45,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 12:40:45,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 2: [2022-11-26 12:40:45,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:40:45,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 12:40:45,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 4: [2022-11-26 12:40:45,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:40:45,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 12:40:45,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 21: [2022-11-26 12:40:45,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:40:45,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 12:40:45,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 7: [2022-11-26 12:40:45,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:40:45,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 12:40:45,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 1: [2022-11-26 12:40:45,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:40:45,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 12:40:45,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 9: [2022-11-26 12:40:45,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:40:45,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:40:45,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 12:40:45,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 8: [2022-11-26 12:40:45,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 12:40:45,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 15: [2022-11-26 12:40:45,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:40:45,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 12:40:45,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 20: [2022-11-26 12:40:45,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:40:45,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 12:40:45,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 30: [2022-11-26 12:40:45,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:40:45,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 12:40:45,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 0: [2022-11-26 12:40:45,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:40:45,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 12:40:45,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 24: [2022-11-26 12:40:45,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:40:45,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:40:45,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 27: [2022-11-26 12:40:45,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:40:45,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 5: [2022-11-26 12:40:45,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 12:40:45,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 27: [2022-11-26 12:40:45,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 12:40:45,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 6: [2022-11-26 12:40:45,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:40:45,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 12:40:45,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 31: [2022-11-26 12:40:45,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:40:45,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 12:40:45,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 4: [2022-11-26 12:40:45,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:40:45,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:40:45,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 12:40:45,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 17: [2022-11-26 12:40:45,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:40:45,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 12:40:45,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 14: [2022-11-26 12:40:45,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:40:45,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 12:40:45,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 10: [2022-11-26 12:40:45,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:40:45,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 29: [2022-11-26 12:40:45,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:40:45,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 21: [2022-11-26 12:40:45,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:40:45,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 21: [2022-11-26 12:40:45,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 29: [2022-11-26 12:40:45,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 21: [2022-11-26 12:40:45,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 13: [2022-11-26 12:40:45,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:40:45,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 12:40:45,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 22: [2022-11-26 12:40:45,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:40:45,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 12:40:45,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 19: [2022-11-26 12:40:45,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:40:45,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 12:40:45,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 25: [2022-11-26 12:40:45,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:40:45,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:40:45,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 25: [2022-11-26 12:40:45,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 11: [2022-11-26 12:40:45,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:40:45,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 25: [2022-11-26 12:40:45,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 3: [2022-11-26 12:40:45,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:40:45,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 12:40:45,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 3: [2022-11-26 12:40:45,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 12:40:45,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 28: [2022-11-26 12:40:45,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 12:40:45,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 26: [2022-11-26 12:40:45,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:40:45,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 12:40:45,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 23: [2022-11-26 12:40:45,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:40:45,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 12:40:45,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 1: [2022-11-26 12:40:45,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:40:45,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 12:40:45,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 7: [2022-11-26 12:40:45,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:40:45,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 12:40:45,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 18: [2022-11-26 12:40:45,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:40:45,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 12:40:45,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 23: [2022-11-26 12:40:45,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:40:45,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 12:40:45,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 1: [2022-11-26 12:40:45,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:40:45,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 12:40:45,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 22: [2022-11-26 12:40:45,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:40:45,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:40:45,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 22: [2022-11-26 12:40:45,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 12: [2022-11-26 12:40:45,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 22: [2022-11-26 12:40:45,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 26: [2022-11-26 12:40:45,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:40:45,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 12:40:45,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 14: [2022-11-26 12:40:45,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:40:45,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 12:40:45,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 26: [2022-11-26 12:40:45,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:40:45,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 12:40:45,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 23: [2022-11-26 12:40:45,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:40:45,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 12:40:45,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 16: [2022-11-26 12:40:45,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:40:45,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 12:40:45,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 14: [2022-11-26 12:40:45,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:40:45,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 12:40:45,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 16: [2022-11-26 12:40:45,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:40:45,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:40:45,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 12:40:45,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step82000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 12:40:45,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 16: [2022-11-26 12:40:45,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 0: successfully saved checkpoint at iteration 82000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2546.56 31: iteration 82010/ 173500 | consumed samples: 20994560 | consumed tokens: 42996858880 | elapsed time per iteration (s): 1.06 | learning rate: 1.192E-04 | global batch size: 256 | lm loss: 2.029990E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.869 | TFLOPs: 14.57 | 31: iteration 82020/ 173500 | consumed samples: 20997120 | consumed tokens: 43002101760 | elapsed time per iteration (s): 0.83 | learning rate: 1.192E-04 | global batch size: 256 | lm loss: 2.016487E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.734 | TFLOPs: 18.74 | 31: iteration 82030/ 173500 | consumed samples: 20999680 | consumed tokens: 43007344640 | elapsed time per iteration (s): 0.78 | learning rate: 1.192E-04 | global batch size: 256 | lm loss: 2.045121E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.445 | TFLOPs: 19.75 | 31: iteration 82040/ 173500 | consumed samples: 21002240 | consumed tokens: 43012587520 | elapsed time per iteration (s): 0.79 | learning rate: 1.192E-04 | global batch size: 256 | lm loss: 1.954400E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.570 | TFLOPs: 19.58 | 31: iteration 82050/ 173500 | consumed samples: 21004800 | consumed tokens: 43017830400 | elapsed time per iteration (s): 0.83 | learning rate: 1.191E-04 | global batch size: 256 | lm loss: 2.020918E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.883 | TFLOPs: 18.69 | 31: iteration 82060/ 173500 | consumed samples: 21007360 | consumed tokens: 43023073280 | elapsed time per iteration (s): 0.95 | learning rate: 1.191E-04 | global batch size: 256 | lm loss: 2.027913E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 268.096 | TFLOPs: 16.22 | 31: iteration 82070/ 173500 | consumed samples: 21009920 | consumed tokens: 43028316160 | elapsed time per iteration (s): 0.82 | learning rate: 1.191E-04 | global batch size: 256 | lm loss: 2.005437E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.887 | TFLOPs: 18.87 | 31: iteration 82080/ 173500 | consumed samples: 21012480 | consumed tokens: 43033559040 | elapsed time per iteration (s): 0.80 | learning rate: 1.191E-04 | global batch size: 256 | lm loss: 1.982337E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.125 | TFLOPs: 19.31 | 31: iteration 82090/ 173500 | consumed samples: 21015040 | consumed tokens: 43038801920 | elapsed time per iteration (s): 0.81 | learning rate: 1.191E-04 | global batch size: 256 | lm loss: 2.020140E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.898 | TFLOPs: 19.17 | 31: iteration 82100/ 173500 | consumed samples: 21017600 | consumed tokens: 43044044800 | elapsed time per iteration (s): 0.79 | learning rate: 1.191E-04 | global batch size: 256 | lm loss: 1.979857E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.078 | TFLOPs: 19.61 | 31: iteration 82110/ 173500 | consumed samples: 21020160 | consumed tokens: 43049287680 | elapsed time per iteration (s): 0.80 | learning rate: 1.191E-04 | global batch size: 256 | lm loss: 1.987536E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.610 | TFLOPs: 19.34 | 31: iteration 82120/ 173500 | consumed samples: 21022720 | consumed tokens: 43054530560 | elapsed time per iteration (s): 0.73 | learning rate: 1.190E-04 | global batch size: 256 | lm loss: 2.017220E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.102 | TFLOPs: 21.30 | 31: iteration 82130/ 173500 | consumed samples: 21025280 | consumed tokens: 43059773440 | elapsed time per iteration (s): 0.77 | learning rate: 1.190E-04 | global batch size: 256 | lm loss: 1.989724E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.683 | TFLOPs: 20.01 | 31: iteration 82140/ 173500 | consumed samples: 21027840 | consumed tokens: 43065016320 | elapsed time per iteration (s): 0.77 | learning rate: 1.190E-04 | global batch size: 256 | lm loss: 2.012383E+00 | grad norm: 0.241 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.152 | TFLOPs: 20.03 | 31: iteration 82150/ 173500 | consumed samples: 21030400 | consumed tokens: 43070259200 | elapsed time per iteration (s): 0.81 | learning rate: 1.190E-04 | global batch size: 256 | lm loss: 2.012773E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.114 | TFLOPs: 19.12 | 31: iteration 82160/ 173500 | consumed samples: 21032960 | consumed tokens: 43075502080 | elapsed time per iteration (s): 0.82 | learning rate: 1.190E-04 | global batch size: 256 | lm loss: 2.037685E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.854 | TFLOPs: 18.87 | 31: iteration 82170/ 173500 | consumed samples: 21035520 | consumed tokens: 43080744960 | elapsed time per iteration (s): 0.81 | learning rate: 1.190E-04 | global batch size: 256 | lm loss: 2.029443E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.401 | TFLOPs: 19.08 | 31: iteration 82180/ 173500 | consumed samples: 21038080 | consumed tokens: 43085987840 | elapsed time per iteration (s): 0.79 | learning rate: 1.189E-04 | global batch size: 256 | lm loss: 2.001313E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.867 | TFLOPs: 19.59 | 31: iteration 82190/ 173500 | consumed samples: 21040640 | consumed tokens: 43091230720 | elapsed time per iteration (s): 0.78 | learning rate: 1.189E-04 | global batch size: 256 | lm loss: 2.008939E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.015 | TFLOPs: 19.78 | 31: iteration 82200/ 173500 | consumed samples: 21043200 | consumed tokens: 43096473600 | elapsed time per iteration (s): 0.79 | learning rate: 1.189E-04 | global batch size: 256 | lm loss: 2.024362E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.519 | TFLOPs: 19.63 | 31: iteration 82210/ 173500 | consumed samples: 21045760 | consumed tokens: 43101716480 | elapsed time per iteration (s): 0.75 | learning rate: 1.189E-04 | global batch size: 256 | lm loss: 1.993336E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.152 | TFLOPs: 20.52 | 31: iteration 82220/ 173500 | consumed samples: 21048320 | consumed tokens: 43106959360 | elapsed time per iteration (s): 0.78 | learning rate: 1.189E-04 | global batch size: 256 | lm loss: 2.012369E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.871 | TFLOPs: 19.96 | 31: iteration 82230/ 173500 | consumed samples: 21050880 | consumed tokens: 43112202240 | elapsed time per iteration (s): 0.81 | learning rate: 1.189E-04 | global batch size: 256 | lm loss: 1.990487E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.273 | TFLOPs: 19.13 | 31: iteration 82240/ 173500 | consumed samples: 21053440 | consumed tokens: 43117445120 | elapsed time per iteration (s): 0.83 | learning rate: 1.188E-04 | global batch size: 256 | lm loss: 2.009050E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.356 | TFLOPs: 18.72 | 31: iteration 82250/ 173500 | consumed samples: 21056000 | consumed tokens: 43122688000 | elapsed time per iteration (s): 0.75 | learning rate: 1.188E-04 | global batch size: 256 | lm loss: 2.009857E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.602 | TFLOPs: 20.55 | 31: iteration 82260/ 173500 | consumed samples: 21058560 | consumed tokens: 43127930880 | elapsed time per iteration (s): 0.75 | learning rate: 1.188E-04 | global batch size: 256 | lm loss: 2.023185E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.525 | TFLOPs: 20.72 | 31: iteration 82270/ 173500 | consumed samples: 21061120 | consumed tokens: 43133173760 | elapsed time per iteration (s): 0.79 | learning rate: 1.188E-04 | global batch size: 256 | lm loss: 2.027816E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.107 | TFLOPs: 19.61 | 31: iteration 82280/ 173500 | consumed samples: 21063680 | consumed tokens: 43138416640 | elapsed time per iteration (s): 0.77 | learning rate: 1.188E-04 | global batch size: 256 | lm loss: 2.016662E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.147 | TFLOPs: 20.22 | 31: iteration 82290/ 173500 | consumed samples: 21066240 | consumed tokens: 43143659520 | elapsed time per iteration (s): 0.79 | learning rate: 1.188E-04 | global batch size: 256 | lm loss: 2.007964E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.191 | TFLOPs: 19.49 | 31: iteration 82300/ 173500 | consumed samples: 21068800 | consumed tokens: 43148902400 | elapsed time per iteration (s): 0.80 | learning rate: 1.187E-04 | global batch size: 256 | lm loss: 1.990650E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.112 | TFLOPs: 19.31 | 31: iteration 82310/ 173500 | consumed samples: 21071360 | consumed tokens: 43154145280 | elapsed time per iteration (s): 0.81 | learning rate: 1.187E-04 | global batch size: 256 | lm loss: 1.978981E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.424 | TFLOPs: 19.02 | 31: iteration 82320/ 173500 | consumed samples: 21073920 | consumed tokens: 43159388160 | elapsed time per iteration (s): 0.81 | learning rate: 1.187E-04 | global batch size: 256 | lm loss: 1.981724E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.266 | TFLOPs: 19.07 | 31: iteration 82330/ 173500 | consumed samples: 21076480 | consumed tokens: 43164631040 | elapsed time per iteration (s): 0.86 | learning rate: 1.187E-04 | global batch size: 256 | lm loss: 1.980473E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.379 | TFLOPs: 18.11 | 31: iteration 82340/ 173500 | consumed samples: 21079040 | consumed tokens: 43169873920 | elapsed time per iteration (s): 0.81 | learning rate: 1.187E-04 | global batch size: 256 | lm loss: 2.028960E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.927 | TFLOPs: 19.23 | 31: iteration 82350/ 173500 | consumed samples: 21081600 | consumed tokens: 43175116800 | elapsed time per iteration (s): 0.80 | learning rate: 1.187E-04 | global batch size: 256 | lm loss: 1.978273E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.437 | TFLOPs: 19.26 | 31: iteration 82360/ 173500 | consumed samples: 21084160 | consumed tokens: 43180359680 | elapsed time per iteration (s): 0.82 | learning rate: 1.186E-04 | global batch size: 256 | lm loss: 1.999885E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.569 | TFLOPs: 18.97 | 31: iteration 82370/ 173500 | consumed samples: 21086720 | consumed tokens: 43185602560 | elapsed time per iteration (s): 0.83 | learning rate: 1.186E-04 | global batch size: 256 | lm loss: 2.011149E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.778 | TFLOPs: 18.68 | 31: iteration 82380/ 173500 | consumed samples: 21089280 | consumed tokens: 43190845440 | elapsed time per iteration (s): 0.81 | learning rate: 1.186E-04 | global batch size: 256 | lm loss: 1.980001E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.773 | TFLOPs: 19.16 | 31: iteration 82390/ 173500 | consumed samples: 21091840 | consumed tokens: 43196088320 | elapsed time per iteration (s): 0.79 | learning rate: 1.186E-04 | global batch size: 256 | lm loss: 1.993375E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.634 | TFLOPs: 19.70 | 31: iteration 82400/ 173500 | consumed samples: 21094400 | consumed tokens: 43201331200 | elapsed time per iteration (s): 0.79 | learning rate: 1.186E-04 | global batch size: 256 | lm loss: 1.982912E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.154 | TFLOPs: 19.61 | 31: iteration 82410/ 173500 | consumed samples: 21096960 | consumed tokens: 43206574080 | elapsed time per iteration (s): 0.83 | learning rate: 1.186E-04 | global batch size: 256 | lm loss: 1.986509E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.461 | TFLOPs: 18.72 | 31: iteration 82420/ 173500 | consumed samples: 21099520 | consumed tokens: 43211816960 | elapsed time per iteration (s): 0.80 | learning rate: 1.185E-04 | global batch size: 256 | lm loss: 1.993575E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.415 | TFLOPs: 19.38 | 31: iteration 82430/ 173500 | consumed samples: 21102080 | consumed tokens: 43217059840 | elapsed time per iteration (s): 0.81 | learning rate: 1.185E-04 | global batch size: 256 | lm loss: 2.019475E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.689 | TFLOPs: 19.10 | 31: iteration 82440/ 173500 | consumed samples: 21104640 | consumed tokens: 43222302720 | elapsed time per iteration (s): 0.82 | learning rate: 1.185E-04 | global batch size: 256 | lm loss: 1.988473E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.738 | TFLOPs: 18.86 | 31: iteration 82450/ 173500 | consumed samples: 21107200 | consumed tokens: 43227545600 | elapsed time per iteration (s): 0.78 | learning rate: 1.185E-04 | global batch size: 256 | lm loss: 2.017828E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.864 | TFLOPs: 19.77 | 31: iteration 82460/ 173500 | consumed samples: 21109760 | consumed tokens: 43232788480 | elapsed time per iteration (s): 0.82 | learning rate: 1.185E-04 | global batch size: 256 | lm loss: 2.017419E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.056 | TFLOPs: 18.94 | 31: iteration 82470/ 173500 | consumed samples: 21112320 | consumed tokens: 43238031360 | elapsed time per iteration (s): 0.83 | learning rate: 1.185E-04 | global batch size: 256 | lm loss: 1.977698E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.121 | TFLOPs: 18.58 | 31: iteration 82480/ 173500 | consumed samples: 21114880 | consumed tokens: 43243274240 | elapsed time per iteration (s): 0.85 | learning rate: 1.184E-04 | global batch size: 256 | lm loss: 1.985469E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.633 | TFLOPs: 18.31 | 31: iteration 82490/ 173500 | consumed samples: 21117440 | consumed tokens: 43248517120 | elapsed time per iteration (s): 0.74 | learning rate: 1.184E-04 | global batch size: 256 | lm loss: 1.987045E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.125 | TFLOPs: 20.88 | 31: iteration 82500/ 173500 | consumed samples: 21120000 | consumed tokens: 43253760000 | elapsed time per iteration (s): 0.79 | learning rate: 1.184E-04 | global batch size: 256 | lm loss: 2.026900E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.765 | TFLOPs: 19.59 | 31: iteration 82510/ 173500 | consumed samples: 21122560 | consumed tokens: 43259002880 | elapsed time per iteration (s): 0.71 | learning rate: 1.184E-04 | global batch size: 256 | lm loss: 1.990494E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 361.890 | TFLOPs: 21.89 | 31: iteration 82520/ 173500 | consumed samples: 21125120 | consumed tokens: 43264245760 | elapsed time per iteration (s): 0.80 | learning rate: 1.184E-04 | global batch size: 256 | lm loss: 2.010203E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.463 | TFLOPs: 19.45 | 31: iteration 82530/ 173500 | consumed samples: 21127680 | consumed tokens: 43269488640 | elapsed time per iteration (s): 0.75 | learning rate: 1.184E-04 | global batch size: 256 | lm loss: 2.019532E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.100 | TFLOPs: 20.51 | 31: iteration 82540/ 173500 | consumed samples: 21130240 | consumed tokens: 43274731520 | elapsed time per iteration (s): 0.72 | learning rate: 1.183E-04 | global batch size: 256 | lm loss: 1.991838E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.247 | TFLOPs: 21.49 | 31: iteration 82550/ 173500 | consumed samples: 21132800 | consumed tokens: 43279974400 | elapsed time per iteration (s): 0.73 | learning rate: 1.183E-04 | global batch size: 256 | lm loss: 2.016808E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.036 | TFLOPs: 21.24 | 31: iteration 82560/ 173500 | consumed samples: 21135360 | consumed tokens: 43285217280 | elapsed time per iteration (s): 0.76 | learning rate: 1.183E-04 | global batch size: 256 | lm loss: 2.003287E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.543 | TFLOPs: 20.48 | 31: iteration 82570/ 173500 | consumed samples: 21137920 | consumed tokens: 43290460160 | elapsed time per iteration (s): 0.72 | learning rate: 1.183E-04 | global batch size: 256 | lm loss: 2.000133E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.113 | TFLOPs: 21.42 | 31: iteration 82580/ 173500 | consumed samples: 21140480 | consumed tokens: 43295703040 | elapsed time per iteration (s): 0.80 | learning rate: 1.183E-04 | global batch size: 256 | lm loss: 2.035971E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.395 | TFLOPs: 19.38 | 31: iteration 82590/ 173500 | consumed samples: 21143040 | consumed tokens: 43300945920 | elapsed time per iteration (s): 0.85 | learning rate: 1.183E-04 | global batch size: 256 | lm loss: 1.999166E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.266 | TFLOPs: 18.23 | 31: iteration 82600/ 173500 | consumed samples: 21145600 | consumed tokens: 43306188800 | elapsed time per iteration (s): 0.91 | learning rate: 1.182E-04 | global batch size: 256 | lm loss: 2.023483E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.777 | TFLOPs: 17.11 | 31: iteration 82610/ 173500 | consumed samples: 21148160 | consumed tokens: 43311431680 | elapsed time per iteration (s): 0.77 | learning rate: 1.182E-04 | global batch size: 256 | lm loss: 1.995467E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.567 | TFLOPs: 20.18 | 31: iteration 82620/ 173500 | consumed samples: 21150720 | consumed tokens: 43316674560 | elapsed time per iteration (s): 0.81 | learning rate: 1.182E-04 | global batch size: 256 | lm loss: 2.006069E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.747 | TFLOPs: 19.16 | 31: iteration 82630/ 173500 | consumed samples: 21153280 | consumed tokens: 43321917440 | elapsed time per iteration (s): 0.81 | learning rate: 1.182E-04 | global batch size: 256 | lm loss: 2.015763E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.168 | TFLOPs: 19.19 | 31: iteration 82640/ 173500 | consumed samples: 21155840 | consumed tokens: 43327160320 | elapsed time per iteration (s): 0.78 | learning rate: 1.182E-04 | global batch size: 256 | lm loss: 2.010373E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.361 | TFLOPs: 19.87 | 31: iteration 82650/ 173500 | consumed samples: 21158400 | consumed tokens: 43332403200 | elapsed time per iteration (s): 0.77 | learning rate: 1.182E-04 | global batch size: 256 | lm loss: 2.001480E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.581 | TFLOPs: 20.18 | 31: iteration 82660/ 173500 | consumed samples: 21160960 | consumed tokens: 43337646080 | elapsed time per iteration (s): 0.75 | learning rate: 1.181E-04 | global batch size: 256 | lm loss: 2.017215E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.756 | TFLOPs: 20.55 | 31: iteration 82670/ 173500 | consumed samples: 21163520 | consumed tokens: 43342888960 | elapsed time per iteration (s): 0.77 | learning rate: 1.181E-04 | global batch size: 256 | lm loss: 1.992820E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.701 | TFLOPs: 20.13 | 31: iteration 82680/ 173500 | consumed samples: 21166080 | consumed tokens: 43348131840 | elapsed time per iteration (s): 0.76 | learning rate: 1.181E-04 | global batch size: 256 | lm loss: 1.989570E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.568 | TFLOPs: 20.42 | 31: iteration 82690/ 173500 | consumed samples: 21168640 | consumed tokens: 43353374720 | elapsed time per iteration (s): 0.73 | learning rate: 1.181E-04 | global batch size: 256 | lm loss: 2.001230E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.928 | TFLOPs: 21.35 | 31: iteration 82700/ 173500 | consumed samples: 21171200 | consumed tokens: 43358617600 | elapsed time per iteration (s): 0.78 | learning rate: 1.181E-04 | global batch size: 256 | lm loss: 2.014508E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.002 | TFLOPs: 19.90 | 31: iteration 82710/ 173500 | consumed samples: 21173760 | consumed tokens: 43363860480 | elapsed time per iteration (s): 0.76 | learning rate: 1.181E-04 | global batch size: 256 | lm loss: 1.985387E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.834 | TFLOPs: 20.38 | 31: iteration 82720/ 173500 | consumed samples: 21176320 | consumed tokens: 43369103360 | elapsed time per iteration (s): 0.77 | learning rate: 1.181E-04 | global batch size: 256 | lm loss: 1.994342E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.961 | TFLOPs: 20.08 | 31: iteration 82730/ 173500 | consumed samples: 21178880 | consumed tokens: 43374346240 | elapsed time per iteration (s): 0.73 | learning rate: 1.180E-04 | global batch size: 256 | lm loss: 2.005643E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.708 | TFLOPs: 21.10 | 31: iteration 82740/ 173500 | consumed samples: 21181440 | consumed tokens: 43379589120 | elapsed time per iteration (s): 0.83 | learning rate: 1.180E-04 | global batch size: 256 | lm loss: 2.011250E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.791 | TFLOPs: 18.62 | 31: iteration 82750/ 173500 | consumed samples: 21184000 | consumed tokens: 43384832000 | elapsed time per iteration (s): 0.79 | learning rate: 1.180E-04 | global batch size: 256 | lm loss: 2.020922E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.449 | TFLOPs: 19.57 | 31: iteration 82760/ 173500 | consumed samples: 21186560 | consumed tokens: 43390074880 | elapsed time per iteration (s): 0.84 | learning rate: 1.180E-04 | global batch size: 256 | lm loss: 2.029354E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.411 | TFLOPs: 18.48 | 31: iteration 82770/ 173500 | consumed samples: 21189120 | consumed tokens: 43395317760 | elapsed time per iteration (s): 0.78 | learning rate: 1.180E-04 | global batch size: 256 | lm loss: 2.003670E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.247 | TFLOPs: 19.98 | 31: iteration 82780/ 173500 | consumed samples: 21191680 | consumed tokens: 43400560640 | elapsed time per iteration (s): 0.76 | learning rate: 1.180E-04 | global batch size: 256 | lm loss: 2.012538E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.631 | TFLOPs: 20.43 | 31: iteration 82790/ 173500 | consumed samples: 21194240 | consumed tokens: 43405803520 | elapsed time per iteration (s): 0.81 | learning rate: 1.179E-04 | global batch size: 256 | lm loss: 2.042059E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.437 | TFLOPs: 19.20 | 31: iteration 82800/ 173500 | consumed samples: 21196800 | consumed tokens: 43411046400 | elapsed time per iteration (s): 0.76 | learning rate: 1.179E-04 | global batch size: 256 | lm loss: 2.000349E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.336 | TFLOPs: 20.35 | 31: iteration 82810/ 173500 | consumed samples: 21199360 | consumed tokens: 43416289280 | elapsed time per iteration (s): 0.82 | learning rate: 1.179E-04 | global batch size: 256 | lm loss: 1.993583E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.706 | TFLOPs: 18.92 | 31: iteration 82820/ 173500 | consumed samples: 21201920 | consumed tokens: 43421532160 | elapsed time per iteration (s): 0.78 | learning rate: 1.179E-04 | global batch size: 256 | lm loss: 1.983593E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.374 | TFLOPs: 19.93 | 31: iteration 82830/ 173500 | consumed samples: 21204480 | consumed tokens: 43426775040 | elapsed time per iteration (s): 0.84 | learning rate: 1.179E-04 | global batch size: 256 | lm loss: 2.010695E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.285 | TFLOPs: 18.41 | 31: iteration 82840/ 173500 | consumed samples: 21207040 | consumed tokens: 43432017920 | elapsed time per iteration (s): 0.79 | learning rate: 1.179E-04 | global batch size: 256 | lm loss: 1.992167E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.033 | TFLOPs: 19.60 | 31: iteration 82850/ 173500 | consumed samples: 21209600 | consumed tokens: 43437260800 | elapsed time per iteration (s): 0.74 | learning rate: 1.178E-04 | global batch size: 256 | lm loss: 1.994296E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.691 | TFLOPs: 20.79 | 31: iteration 82860/ 173500 | consumed samples: 21212160 | consumed tokens: 43442503680 | elapsed time per iteration (s): 0.75 | learning rate: 1.178E-04 | global batch size: 256 | lm loss: 2.008906E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.529 | TFLOPs: 20.66 | 31: iteration 82870/ 173500 | consumed samples: 21214720 | consumed tokens: 43447746560 | elapsed time per iteration (s): 0.73 | learning rate: 1.178E-04 | global batch size: 256 | lm loss: 2.020778E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.275 | TFLOPs: 21.25 | 31: iteration 82880/ 173500 | consumed samples: 21217280 | consumed tokens: 43452989440 | elapsed time per iteration (s): 0.75 | learning rate: 1.178E-04 | global batch size: 256 | lm loss: 2.026103E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.850 | TFLOPs: 20.68 | 31: iteration 82890/ 173500 | consumed samples: 21219840 | consumed tokens: 43458232320 | elapsed time per iteration (s): 0.75 | learning rate: 1.178E-04 | global batch size: 256 | lm loss: 2.018855E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.071 | TFLOPs: 20.69 | 31: iteration 82900/ 173500 | consumed samples: 21222400 | consumed tokens: 43463475200 | elapsed time per iteration (s): 0.73 | learning rate: 1.178E-04 | global batch size: 256 | lm loss: 1.998979E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.998 | TFLOPs: 21.11 | 31: iteration 82910/ 173500 | consumed samples: 21224960 | consumed tokens: 43468718080 | elapsed time per iteration (s): 0.79 | learning rate: 1.177E-04 | global batch size: 256 | lm loss: 2.018199E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.183 | TFLOPs: 19.49 | 31: iteration 82920/ 173500 | consumed samples: 21227520 | consumed tokens: 43473960960 | elapsed time per iteration (s): 0.77 | learning rate: 1.177E-04 | global batch size: 256 | lm loss: 2.000407E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.887 | TFLOPs: 20.08 | 31: iteration 82930/ 173500 | consumed samples: 21230080 | consumed tokens: 43479203840 | elapsed time per iteration (s): 0.80 | learning rate: 1.177E-04 | global batch size: 256 | lm loss: 1.985553E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.726 | TFLOPs: 19.40 | 31: iteration 82940/ 173500 | consumed samples: 21232640 | consumed tokens: 43484446720 | elapsed time per iteration (s): 0.75 | learning rate: 1.177E-04 | global batch size: 256 | lm loss: 2.009960E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.346 | TFLOPs: 20.59 | 31: iteration 82950/ 173500 | consumed samples: 21235200 | consumed tokens: 43489689600 | elapsed time per iteration (s): 0.82 | learning rate: 1.177E-04 | global batch size: 256 | lm loss: 1.991351E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.932 | TFLOPs: 18.99 | 31: iteration 82960/ 173500 | consumed samples: 21237760 | consumed tokens: 43494932480 | elapsed time per iteration (s): 0.75 | learning rate: 1.177E-04 | global batch size: 256 | lm loss: 1.998053E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.594 | TFLOPs: 20.79 | 31: iteration 82970/ 173500 | consumed samples: 21240320 | consumed tokens: 43500175360 | elapsed time per iteration (s): 0.77 | learning rate: 1.176E-04 | global batch size: 256 | lm loss: 2.023444E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.992 | TFLOPs: 20.02 | 31: iteration 82980/ 173500 | consumed samples: 21242880 | consumed tokens: 43505418240 | elapsed time per iteration (s): 0.77 | learning rate: 1.176E-04 | global batch size: 256 | lm loss: 2.007299E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.479 | TFLOPs: 19.99 | 31: iteration 82990/ 173500 | consumed samples: 21245440 | consumed tokens: 43510661120 | elapsed time per iteration (s): 0.73 | learning rate: 1.176E-04 | global batch size: 256 | lm loss: 2.021267E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.372 | TFLOPs: 21.26 | 31: iteration 83000/ 173500 | consumed samples: 21248000 | consumed tokens: 43515904000 | elapsed time per iteration (s): 0.72 | learning rate: 1.176E-04 | global batch size: 256 | lm loss: 2.001777E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.062 | TFLOPs: 21.54 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 83000 | lm loss value: 1.853103E+00 | lm loss PPL: 6.379585E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 83000 to checkpoints_1b1long 0: [2022-11-26 12:53:52,665] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step83000 is begin to save! 0: [2022-11-26 12:53:52,677] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_01-model_00-model_states.pt... 0: [2022-11-26 12:53:52,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_01-model_00-model_states.pt. 0: [2022-11-26 12:53:52,892] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_03-model_00-model_states.pt... 0: [2022-11-26 12:53:52,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_03-model_00-model_states.pt. 0: [2022-11-26 12:53:52,971] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_04-model_00-model_states.pt... 0: [2022-11-26 12:53:53,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_04-model_00-model_states.pt. 0: [2022-11-26 12:53:53,051] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_05-model_00-model_states.pt... 0: [2022-11-26 12:53:53,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_05-model_00-model_states.pt. 0: [2022-11-26 12:53:53,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_06-model_00-model_states.pt... 0: [2022-11-26 12:53:53,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_06-model_00-model_states.pt. 0: [2022-11-26 12:53:53,206] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_07-model_00-model_states.pt... 0: [2022-11-26 12:53:53,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_07-model_00-model_states.pt. 0: [2022-11-26 12:53:53,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_08-model_00-model_states.pt... 0: [2022-11-26 12:53:53,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_08-model_00-model_states.pt. 0: [2022-11-26 12:53:53,370] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_09-model_00-model_states.pt... 0: [2022-11-26 12:53:53,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_09-model_00-model_states.pt. 0: [2022-11-26 12:53:53,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_10-model_00-model_states.pt... 0: [2022-11-26 12:53:53,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_10-model_00-model_states.pt. 0: [2022-11-26 12:53:53,526] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_11-model_00-model_states.pt... 0: [2022-11-26 12:53:53,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_11-model_00-model_states.pt. 0: [2022-11-26 12:53:53,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_12-model_00-model_states.pt... 0: [2022-11-26 12:53:53,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_12-model_00-model_states.pt. 0: [2022-11-26 12:53:53,679] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_13-model_00-model_states.pt... 0: [2022-11-26 12:53:53,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_13-model_00-model_states.pt. 0: [2022-11-26 12:53:53,759] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_14-model_00-model_states.pt... 0: [2022-11-26 12:53:53,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_14-model_00-model_states.pt. 0: [2022-11-26 12:53:53,837] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_15-model_00-model_states.pt... 0: [2022-11-26 12:53:53,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_15-model_00-model_states.pt. 0: [2022-11-26 12:53:53,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_16-model_00-model_states.pt... 0: [2022-11-26 12:53:53,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_16-model_00-model_states.pt. 0: [2022-11-26 12:53:53,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_17-model_00-model_states.pt... 0: [2022-11-26 12:53:54,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_17-model_00-model_states.pt. 0: [2022-11-26 12:53:54,070] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_18-model_00-model_states.pt... 0: [2022-11-26 12:53:54,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_18-model_00-model_states.pt. 0: [2022-11-26 12:53:54,150] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_19-model_00-model_states.pt... 0: [2022-11-26 12:53:54,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_19-model_00-model_states.pt. 0: [2022-11-26 12:53:54,229] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_20-model_00-model_states.pt... 0: [2022-11-26 12:53:54,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_20-model_00-model_states.pt. 0: [2022-11-26 12:53:54,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_21-model_00-model_states.pt... 0: [2022-11-26 12:53:54,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_21-model_00-model_states.pt. 0: [2022-11-26 12:53:54,386] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_22-model_00-model_states.pt... 0: [2022-11-26 12:53:54,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_22-model_00-model_states.pt. 0: [2022-11-26 12:53:54,461] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_23-model_00-model_states.pt... 0: [2022-11-26 12:53:54,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_23-model_00-model_states.pt. 0: [2022-11-26 12:53:54,538] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_24-model_00-model_states.pt... 0: [2022-11-26 12:53:54,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_24-model_00-model_states.pt. 0: [2022-11-26 12:53:54,620] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_25-model_00-model_states.pt... 0: [2022-11-26 12:53:54,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_25-model_00-model_states.pt. 0: [2022-11-26 12:53:54,696] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_26-model_00-model_states.pt... 0: [2022-11-26 12:53:54,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_26-model_00-model_states.pt. 0: [2022-11-26 12:53:54,777] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_27-model_00-model_states.pt... 0: [2022-11-26 12:53:54,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_27-model_00-model_states.pt. 0: [2022-11-26 12:53:54,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_28-model_00-model_states.pt... 0: [2022-11-26 12:53:54,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_28-model_00-model_states.pt. 0: [2022-11-26 12:53:54,928] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/layer_30-model_00-model_states.pt... 0: [2022-11-26 12:53:54,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/layer_30-model_00-model_states.pt. 0: [2022-11-26 12:53:54,933] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step83000/mp_rank_00_model_states.pt 0: [2022-11-26 12:53:54,933] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/mp_rank_00_model_states.pt... 0: [2022-11-26 12:53:54,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/mp_rank_00_model_states.pt. 0: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 31: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 28: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 29: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 17: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 20: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 26: [2022-11-26 12:53:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:53:55,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:53:55,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:53:55,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 12:53:55,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 4: [2022-11-26 12:53:55,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:53:55,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 12:53:55,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 20: [2022-11-26 12:53:55,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:53:55,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 12:53:55,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 19: [2022-11-26 12:53:55,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:53:55,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 12:53:55,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 26: [2022-11-26 12:53:55,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:53:55,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 12:53:55,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 25: [2022-11-26 12:53:55,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:53:55,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 12:53:55,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 16: [2022-11-26 12:53:55,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:53:55,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:53:55,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 31: [2022-11-26 12:53:55,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 16: [2022-11-26 12:53:55,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 31: [2022-11-26 12:53:55,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 11: [2022-11-26 12:53:55,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:53:55,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 12:53:55,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 27: [2022-11-26 12:53:55,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:53:55,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 24: [2022-11-26 12:53:55,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:53:55,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 24: [2022-11-26 12:53:55,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 12:53:55,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 21: [2022-11-26 12:53:55,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:53:55,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 12:53:55,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 18: [2022-11-26 12:53:55,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:53:55,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 12:53:55,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 9: [2022-11-26 12:53:55,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:53:55,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 12:53:55,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 22: [2022-11-26 12:53:55,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:53:55,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:53:55,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:53:55,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 21: [2022-11-26 12:53:55,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 23: [2022-11-26 12:53:55,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 21: [2022-11-26 12:53:55,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 23: [2022-11-26 12:53:55,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 5: [2022-11-26 12:53:55,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:53:55,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 5: [2022-11-26 12:53:55,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 12:53:55,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 2: [2022-11-26 12:53:55,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:53:55,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 12:53:55,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 2: [2022-11-26 12:53:55,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:53:55,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 12:53:55,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 5: [2022-11-26 12:53:55,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:53:55,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 12:53:55,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 30: [2022-11-26 12:53:55,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:53:55,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 12:53:55,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 27: [2022-11-26 12:53:55,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:53:55,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:53:55,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:53:55,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 20: [2022-11-26 12:53:55,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 12:53:55,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 8: [2022-11-26 12:53:55,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 27: [2022-11-26 12:53:55,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 8: [2022-11-26 12:53:55,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 8: [2022-11-26 12:53:55,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:53:55,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:53:55,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 12:53:55,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 1: [2022-11-26 12:53:55,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 12:53:55,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 15: [2022-11-26 12:53:55,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:53:55,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 12:53:55,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 10: [2022-11-26 12:53:55,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:53:55,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:53:55,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:53:55,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:53:55,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 17: [2022-11-26 12:53:55,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 10: [2022-11-26 12:53:55,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 17: [2022-11-26 12:53:55,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 0: [2022-11-26 12:53:55,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 31: [2022-11-26 12:53:55,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 17: [2022-11-26 12:53:55,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:53:55,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 31: [2022-11-26 12:53:55,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 17: [2022-11-26 12:53:55,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 12:53:55,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 28: [2022-11-26 12:53:55,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:53:55,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:53:55,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 12:53:55,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 16: [2022-11-26 12:53:55,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:53:55,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 11: [2022-11-26 12:53:55,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:53:55,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 11: [2022-11-26 12:53:55,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 12:53:55,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 19: [2022-11-26 12:53:55,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:53:55,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 12:53:55,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 7: [2022-11-26 12:53:55,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:53:55,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 7: [2022-11-26 12:53:55,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 12:53:55,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 14: [2022-11-26 12:53:55,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:53:55,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 14: [2022-11-26 12:53:55,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 12:53:55,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 14: [2022-11-26 12:53:55,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:53:55,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:53:55,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 24: [2022-11-26 12:53:55,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 14: [2022-11-26 12:53:55,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 24: [2022-11-26 12:53:55,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 15: [2022-11-26 12:53:55,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:53:55,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 12:53:55,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 28: [2022-11-26 12:53:55,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 12:53:55,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 4: [2022-11-26 12:53:55,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:53:55,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 23: [2022-11-26 12:53:55,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:53:55,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 23: [2022-11-26 12:53:55,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 12:53:55,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 25: [2022-11-26 12:53:55,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:53:55,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 12:53:55,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 9: [2022-11-26 12:53:55,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:53:55,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 12:53:55,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 12: [2022-11-26 12:53:55,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:53:55,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 12:53:55,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 21: [2022-11-26 12:53:55,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:53:55,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 12:53:55,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 8: [2022-11-26 12:53:55,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:53:55,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 12:53:55,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 18: [2022-11-26 12:53:55,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:53:55,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 12:53:55,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 13: [2022-11-26 12:53:55,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:53:55,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 12:53:55,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 4: [2022-11-26 12:53:55,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:53:55,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 28: [2022-11-26 12:53:55,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:53:55,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:53:55,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 4: [2022-11-26 12:53:55,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 5: [2022-11-26 12:53:55,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 7: [2022-11-26 12:53:55,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:53:55,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 12:53:55,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 10: [2022-11-26 12:53:55,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:53:55,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:53:55,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:53:55,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 12:53:55,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 23: [2022-11-26 12:53:55,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 31: [2022-11-26 12:53:55,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 23: [2022-11-26 12:53:55,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 31: [2022-11-26 12:53:55,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 22: [2022-11-26 12:53:55,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:53:55,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 12:53:55,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 16: [2022-11-26 12:53:55,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:53:55,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 12:53:55,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 11: [2022-11-26 12:53:55,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:53:55,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 12:53:55,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 20: [2022-11-26 12:53:55,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:53:55,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 12:53:55,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 3: [2022-11-26 12:53:55,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:53:55,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 12:53:55,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:53:55,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 3: [2022-11-26 12:53:55,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 12:53:55,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 13: [2022-11-26 12:53:55,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:53:55,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:53:55,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 12:53:55,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 14: [2022-11-26 12:53:55,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 12:53:55,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 1: [2022-11-26 12:53:55,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:53:55,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 12:53:55,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 2: [2022-11-26 12:53:55,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:53:55,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 30: [2022-11-26 12:53:55,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:53:55,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 30: [2022-11-26 12:53:55,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 12:53:55,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 29: [2022-11-26 12:53:55,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:53:55,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:53:55,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 12:53:55,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 12:53:55,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 1: [2022-11-26 12:53:55,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:53:55,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 1: [2022-11-26 12:53:55,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 12:53:55,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 27: [2022-11-26 12:53:55,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:53:55,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 12:53:55,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 12: [2022-11-26 12:53:55,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:53:55,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 12:53:55,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 19: [2022-11-26 12:53:55,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:53:55,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 12:53:55,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 5: [2022-11-26 12:53:55,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:53:55,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 12:53:55,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 30: [2022-11-26 12:53:55,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:53:55,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 12:53:55,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 0: [2022-11-26 12:53:55,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:53:55,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:53:55,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 26: [2022-11-26 12:53:55,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 0: [2022-11-26 12:53:55,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 26: [2022-11-26 12:53:55,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 11: [2022-11-26 12:53:55,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:53:55,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 12:53:55,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 24: [2022-11-26 12:53:55,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:53:55,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 12:53:55,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 13: [2022-11-26 12:53:55,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:53:55,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 12:53:55,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 25: [2022-11-26 12:53:55,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:53:55,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 12:53:55,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 1: [2022-11-26 12:53:55,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:53:55,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 18: [2022-11-26 12:53:55,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:53:55,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 18: [2022-11-26 12:53:55,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 12:53:55,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 26: [2022-11-26 12:53:55,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:53:55,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 12:53:55,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 4: [2022-11-26 12:53:55,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:53:55,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 12:53:55,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 28: [2022-11-26 12:53:55,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 12:53:55,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 28: [2022-11-26 12:53:55,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:53:55,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 12:53:55,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 22: [2022-11-26 12:53:55,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:53:55,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 9: [2022-11-26 12:53:55,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:53:55,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 9: [2022-11-26 12:53:55,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 12:53:55,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 17: [2022-11-26 12:53:55,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:53:55,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:53:55,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 12:53:55,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 17: [2022-11-26 12:53:55,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 12:53:55,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 19: [2022-11-26 12:53:55,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:53:55,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 12:53:55,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 2: [2022-11-26 12:53:55,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:53:55,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 12:53:55,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 6: [2022-11-26 12:53:55,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:53:55,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:53:55,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:53:55,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:53:55,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 12:53:55,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 12:53:55,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 12:53:55,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 6: [2022-11-26 12:53:55,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 12:53:55,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 6: [2022-11-26 12:53:55,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 6: [2022-11-26 12:53:55,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 26: [2022-11-26 12:53:55,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:53:55,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 12:53:55,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 8: [2022-11-26 12:53:55,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:53:55,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 12:53:55,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 10: [2022-11-26 12:53:55,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:53:55,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 12:53:55,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 14: [2022-11-26 12:53:55,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:53:55,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 12:53:55,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 15: [2022-11-26 12:53:55,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:53:55,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 12:53:55,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 21: [2022-11-26 12:53:55,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:53:55,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 30: [2022-11-26 12:53:55,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:53:55,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 30: [2022-11-26 12:53:55,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 12:53:55,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 9: [2022-11-26 12:53:55,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:53:55,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 12:53:55,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 27: [2022-11-26 12:53:55,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:53:55,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 12:53:55,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 0: [2022-11-26 12:53:55,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:53:55,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 12:53:55,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 16: [2022-11-26 12:53:55,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:53:55,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 12:53:55,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 20: [2022-11-26 12:53:55,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:53:55,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 12:53:55,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 25: [2022-11-26 12:53:55,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:53:55,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:53:55,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 13: [2022-11-26 12:53:55,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 12:53:55,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 25: [2022-11-26 12:53:55,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 29: [2022-11-26 12:53:55,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:53:55,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 12:53:55,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 31: [2022-11-26 12:53:55,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:53:55,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 12:53:55,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 18: [2022-11-26 12:53:55,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:53:55,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 12:53:55,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 23: [2022-11-26 12:53:55,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:53:55,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 12:53:55,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 28: [2022-11-26 12:53:55,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:53:55,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 12:53:55,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 12: [2022-11-26 12:53:55,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:53:55,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 12:53:55,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 22: [2022-11-26 12:53:55,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:53:55,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 12:53:55,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 24: [2022-11-26 12:53:55,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:53:55,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 12:53:55,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 7: [2022-11-26 12:53:55,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:53:55,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 12:53:55,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 6: [2022-11-26 12:53:55,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:53:55,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 12:53:55,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 3: [2022-11-26 12:53:55,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:53:55,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 12:53:55,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 19: [2022-11-26 12:53:55,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:53:55,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:53:55,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 12:53:55,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 4: [2022-11-26 12:53:55,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 12:53:55,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 5: [2022-11-26 12:53:55,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:53:55,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 12:53:55,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 17: [2022-11-26 12:53:55,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:53:55,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 12:53:55,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 1: [2022-11-26 12:53:55,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:53:55,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 12:53:55,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 2: [2022-11-26 12:53:55,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:53:55,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 26: [2022-11-26 12:53:55,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:53:55,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 26: [2022-11-26 12:53:55,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 12:53:55,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 15: [2022-11-26 12:53:55,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:53:55,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 12:53:55,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 8: [2022-11-26 12:53:55,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:53:55,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 12:53:55,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 11: [2022-11-26 12:53:55,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:53:55,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 12:53:55,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 10: [2022-11-26 12:53:55,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:53:55,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 12:53:55,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 20: [2022-11-26 12:53:55,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:53:55,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 12:53:55,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 14: [2022-11-26 12:53:55,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:53:55,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 12:53:55,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 0: [2022-11-26 12:53:55,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:53:55,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:53:55,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 21: [2022-11-26 12:53:55,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 12:53:55,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 0: [2022-11-26 12:53:55,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 16: [2022-11-26 12:53:55,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:53:55,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 12:53:55,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 29: [2022-11-26 12:53:55,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:53:55,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 12:53:55,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 31: [2022-11-26 12:53:55,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:53:55,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 12:53:55,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 23: [2022-11-26 12:53:55,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:53:55,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 12:53:55,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 12: [2022-11-26 12:53:55,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:53:55,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 12:53:55,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 30: [2022-11-26 12:53:55,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:53:55,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 12:53:55,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 25: [2022-11-26 12:53:55,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:53:55,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 12:53:55,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 27: [2022-11-26 12:53:55,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:53:55,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 12:53:55,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 19: [2022-11-26 12:53:55,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:53:55,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 12:53:55,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 17: [2022-11-26 12:53:55,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:53:55,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 12:53:55,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 22: [2022-11-26 12:53:55,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:53:55,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 12:53:55,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 7: [2022-11-26 12:53:55,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:53:55,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 12:53:55,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 13: [2022-11-26 12:53:55,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:53:55,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 12:53:55,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 28: [2022-11-26 12:53:55,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:53:55,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 12:53:55,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 24: [2022-11-26 12:53:55,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:53:55,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 12:53:55,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 1: [2022-11-26 12:53:55,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:53:55,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 12:53:55,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 5: [2022-11-26 12:53:55,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:53:55,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 12:53:55,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 4: [2022-11-26 12:53:55,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:53:55,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 12:53:55,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 6: [2022-11-26 12:53:55,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:53:55,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 12:53:55,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 13: [2022-11-26 12:53:55,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:53:55,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 12:53:55,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 9: [2022-11-26 12:53:55,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:53:55,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 12:53:55,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 15: [2022-11-26 12:53:55,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:53:55,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:53:55,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:53:55,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 20: [2022-11-26 12:53:55,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 2: [2022-11-26 12:53:55,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:53:55,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 20: [2022-11-26 12:53:55,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 26: [2022-11-26 12:53:55,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 2: [2022-11-26 12:53:55,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 26: [2022-11-26 12:53:55,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 2: [2022-11-26 12:53:55,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 11: [2022-11-26 12:53:55,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:53:55,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 12:53:55,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 27: [2022-11-26 12:53:55,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:53:55,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 12:53:55,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 12: [2022-11-26 12:53:55,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:53:55,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 12:53:55,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 16: [2022-11-26 12:53:55,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:53:55,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 12:53:55,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 10: [2022-11-26 12:53:55,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:53:55,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:53:55,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 12:53:55,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 10: [2022-11-26 12:53:55,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 12:53:55,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 23: [2022-11-26 12:53:55,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:53:55,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 12:53:55,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 9: [2022-11-26 12:53:55,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:53:55,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:53:55,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 17: [2022-11-26 12:53:55,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:53:55,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 19: [2022-11-26 12:53:55,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 17: [2022-11-26 12:53:55,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 19: [2022-11-26 12:53:55,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 17: [2022-11-26 12:53:55,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 0: [2022-11-26 12:53:55,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:53:55,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 12:53:55,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 6: [2022-11-26 12:53:55,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:53:55,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 12:53:55,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 1: [2022-11-26 12:53:55,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:53:55,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:53:55,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 1: [2022-11-26 12:53:55,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 29: [2022-11-26 12:53:55,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 1: [2022-11-26 12:53:55,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 8: [2022-11-26 12:53:55,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:53:55,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 12:53:55,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 14: [2022-11-26 12:53:55,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:53:55,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 12:53:55,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 4: [2022-11-26 12:53:55,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:53:55,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 12:53:55,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 3: [2022-11-26 12:53:55,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:53:55,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 12:53:55,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 3: [2022-11-26 12:53:55,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:53:55,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 12:53:55,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 18: [2022-11-26 12:53:55,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:53:55,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 12:53:55,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 28: [2022-11-26 12:53:55,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:53:55,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:53:55,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 25: [2022-11-26 12:53:55,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 28: [2022-11-26 12:53:55,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 25: [2022-11-26 12:53:55,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 22: [2022-11-26 12:53:55,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:53:55,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 12:53:55,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 5: [2022-11-26 12:53:55,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:53:55,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 12:53:55,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 24: [2022-11-26 12:53:55,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:53:55,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 12:53:55,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 15: [2022-11-26 12:53:55,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:53:55,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:53:55,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 12:53:55,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 11: [2022-11-26 12:53:55,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 12:53:55,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 21: [2022-11-26 12:53:55,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:53:55,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 12:53:55,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 10: [2022-11-26 12:53:55,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:53:55,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 12:53:55,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 7: [2022-11-26 12:53:55,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:53:55,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 12:53:55,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 26: [2022-11-26 12:53:55,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:53:55,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 12:53:55,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 31: [2022-11-26 12:53:55,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:53:55,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 12:53:55,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 16: [2022-11-26 12:53:55,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:53:55,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 12:53:55,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 0: [2022-11-26 12:53:55,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:53:55,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:53:55,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 30: [2022-11-26 12:53:55,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 0: [2022-11-26 12:53:55,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 30: [2022-11-26 12:53:55,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 18: [2022-11-26 12:53:55,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:53:55,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 12:53:55,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 20: [2022-11-26 12:53:55,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:53:55,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 25: [2022-11-26 12:53:55,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:53:55,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 25: [2022-11-26 12:53:55,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 12:53:55,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 2: [2022-11-26 12:53:55,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:53:55,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 12:53:55,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 8: [2022-11-26 12:53:55,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:53:55,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 12:53:55,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 14: [2022-11-26 12:53:55,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:53:55,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 12:53:55,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 9: [2022-11-26 12:53:55,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:53:55,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 12:53:55,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 9: [2022-11-26 12:53:55,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 17: [2022-11-26 12:53:55,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 9: [2022-11-26 12:53:55,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 27: [2022-11-26 12:53:55,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:53:55,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 12:53:55,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 23: [2022-11-26 12:53:55,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:53:55,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 12:53:55,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 19: [2022-11-26 12:53:55,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 12:53:55,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 12:53:55,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 13: [2022-11-26 12:53:55,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:53:55,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 12:53:55,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 22: [2022-11-26 12:53:55,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:53:55,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 12:53:55,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 18: [2022-11-26 12:53:55,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:53:55,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 12:53:55,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 6: [2022-11-26 12:53:55,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:53:55,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 12:53:55,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 15: [2022-11-26 12:53:55,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:53:55,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 8: [2022-11-26 12:53:55,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:53:55,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 8: [2022-11-26 12:53:55,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 12:53:55,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 28: [2022-11-26 12:53:55,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:53:55,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 12:53:55,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 3: [2022-11-26 12:53:55,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:53:55,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 12:53:55,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 12: [2022-11-26 12:53:55,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:53:55,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:53:55,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 12:53:55,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 10: [2022-11-26 12:53:55,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 12:53:55,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 29: [2022-11-26 12:53:55,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:53:55,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 12:53:55,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 5: [2022-11-26 12:53:55,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:53:55,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 12:53:55,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 1: [2022-11-26 12:53:55,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:53:55,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 12:53:55,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 25: [2022-11-26 12:53:55,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:53:55,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 24: [2022-11-26 12:53:55,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 25: [2022-11-26 12:53:55,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 24: [2022-11-26 12:53:55,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 12:53:55,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 7: [2022-11-26 12:53:55,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:53:55,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 12:53:55,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 7: [2022-11-26 12:53:55,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:53:55,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 9: [2022-11-26 12:53:55,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 16: [2022-11-26 12:53:55,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:53:55,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 9: [2022-11-26 12:53:55,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 12:53:55,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 16: [2022-11-26 12:53:55,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 12:53:55,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 26: [2022-11-26 12:53:55,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:53:55,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 11: [2022-11-26 12:53:55,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:53:55,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 27: [2022-11-26 12:53:55,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 26: [2022-11-26 12:53:55,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 0: [2022-11-26 12:53:55,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 11: [2022-11-26 12:53:55,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 0: [2022-11-26 12:53:55,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 11: [2022-11-26 12:53:55,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 27: [2022-11-26 12:53:55,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 12:53:55,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 4: [2022-11-26 12:53:55,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:53:55,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 12:53:55,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 21: [2022-11-26 12:53:55,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 12:53:55,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 12:53:55,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 30: [2022-11-26 12:53:55,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:53:55,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 12:53:55,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 31: [2022-11-26 12:53:55,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:53:55,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 12:53:55,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 14: [2022-11-26 12:53:55,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:53:55,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 12:53:55,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 20: [2022-11-26 12:53:55,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:53:55,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 30: [2022-11-26 12:53:55,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 12:53:55,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 12:53:55,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 2: [2022-11-26 12:53:55,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 20: [2022-11-26 12:53:55,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 2: [2022-11-26 12:53:55,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 12:53:55,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 18: [2022-11-26 12:53:55,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 12:53:55,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 12:53:55,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 24: [2022-11-26 12:53:55,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 12:53:55,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 12:53:55,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 28: [2022-11-26 12:53:55,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 12:53:55,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 12:53:55,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 22: [2022-11-26 12:53:55,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 12:53:55,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 12:53:55,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 29: [2022-11-26 12:53:55,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:53:55,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 12:53:55,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 31: [2022-11-26 12:53:55,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 12:53:55,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 12:53:55,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 12: [2022-11-26 12:53:55,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:53:55,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 12:53:55,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 23: [2022-11-26 12:53:55,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 12:53:55,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 12:53:55,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 13: [2022-11-26 12:53:55,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:53:55,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 12:53:55,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 7: [2022-11-26 12:53:55,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:53:55,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 12:53:55,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 29: [2022-11-26 12:53:55,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 12:53:55,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 12:53:55,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 12: [2022-11-26 12:53:55,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:53:55,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 12:53:55,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 3: [2022-11-26 12:53:55,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:53:55,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 12:53:55,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 3: [2022-11-26 12:53:55,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:53:55,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step83000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 12:53:55,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 0: successfully saved checkpoint at iteration 83000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2619.09 31: iteration 83010/ 173500 | consumed samples: 21250560 | consumed tokens: 43521146880 | elapsed time per iteration (s): 1.03 | learning rate: 1.176E-04 | global batch size: 256 | lm loss: 1.989258E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.591 | TFLOPs: 15.10 | 31: iteration 83020/ 173500 | consumed samples: 21253120 | consumed tokens: 43526389760 | elapsed time per iteration (s): 0.73 | learning rate: 1.176E-04 | global batch size: 256 | lm loss: 1.992061E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.820 | TFLOPs: 21.10 | 31: iteration 83030/ 173500 | consumed samples: 21255680 | consumed tokens: 43531632640 | elapsed time per iteration (s): 0.75 | learning rate: 1.175E-04 | global batch size: 256 | lm loss: 1.985451E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.201 | TFLOPs: 20.58 | 31: iteration 83040/ 173500 | consumed samples: 21258240 | consumed tokens: 43536875520 | elapsed time per iteration (s): 0.71 | learning rate: 1.175E-04 | global batch size: 256 | lm loss: 1.984933E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 359.798 | TFLOPs: 21.77 | 31: iteration 83050/ 173500 | consumed samples: 21260800 | consumed tokens: 43542118400 | elapsed time per iteration (s): 0.73 | learning rate: 1.175E-04 | global batch size: 256 | lm loss: 2.022933E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.736 | TFLOPs: 21.34 | 31: iteration 83060/ 173500 | consumed samples: 21263360 | consumed tokens: 43547361280 | elapsed time per iteration (s): 0.76 | learning rate: 1.175E-04 | global batch size: 256 | lm loss: 2.001978E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.726 | TFLOPs: 20.31 | 31: iteration 83070/ 173500 | consumed samples: 21265920 | consumed tokens: 43552604160 | elapsed time per iteration (s): 0.73 | learning rate: 1.175E-04 | global batch size: 256 | lm loss: 2.016347E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.275 | TFLOPs: 21.25 | 31: iteration 83080/ 173500 | consumed samples: 21268480 | consumed tokens: 43557847040 | elapsed time per iteration (s): 0.73 | learning rate: 1.175E-04 | global batch size: 256 | lm loss: 2.020214E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.672 | TFLOPs: 21.15 | 31: iteration 83090/ 173500 | consumed samples: 21271040 | consumed tokens: 43563089920 | elapsed time per iteration (s): 0.76 | learning rate: 1.174E-04 | global batch size: 256 | lm loss: 1.984627E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.278 | TFLOPs: 20.46 | 31: iteration 83100/ 173500 | consumed samples: 21273600 | consumed tokens: 43568332800 | elapsed time per iteration (s): 0.79 | learning rate: 1.174E-04 | global batch size: 256 | lm loss: 2.015612E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.322 | TFLOPs: 19.50 | 31: iteration 83110/ 173500 | consumed samples: 21276160 | consumed tokens: 43573575680 | elapsed time per iteration (s): 0.87 | learning rate: 1.174E-04 | global batch size: 256 | lm loss: 1.985911E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.353 | TFLOPs: 17.87 | 31: iteration 83120/ 173500 | consumed samples: 21278720 | consumed tokens: 43578818560 | elapsed time per iteration (s): 0.77 | learning rate: 1.174E-04 | global batch size: 256 | lm loss: 2.011719E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.742 | TFLOPs: 20.01 | 31: iteration 83130/ 173500 | consumed samples: 21281280 | consumed tokens: 43584061440 | elapsed time per iteration (s): 0.75 | learning rate: 1.174E-04 | global batch size: 256 | lm loss: 2.014514E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.721 | TFLOPs: 20.55 | 31: iteration 83140/ 173500 | consumed samples: 21283840 | consumed tokens: 43589304320 | elapsed time per iteration (s): 0.77 | learning rate: 1.174E-04 | global batch size: 256 | lm loss: 1.961537E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.994 | TFLOPs: 20.08 | 31: iteration 83150/ 173500 | consumed samples: 21286400 | consumed tokens: 43594547200 | elapsed time per iteration (s): 0.76 | learning rate: 1.173E-04 | global batch size: 256 | lm loss: 1.995342E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.378 | TFLOPs: 20.47 | 31: iteration 83160/ 173500 | consumed samples: 21288960 | consumed tokens: 43599790080 | elapsed time per iteration (s): 0.75 | learning rate: 1.173E-04 | global batch size: 256 | lm loss: 1.996234E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.546 | TFLOPs: 20.60 | 31: iteration 83170/ 173500 | consumed samples: 21291520 | consumed tokens: 43605032960 | elapsed time per iteration (s): 0.75 | learning rate: 1.173E-04 | global batch size: 256 | lm loss: 2.042439E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.374 | TFLOPs: 20.53 | 31: iteration 83180/ 173500 | consumed samples: 21294080 | consumed tokens: 43610275840 | elapsed time per iteration (s): 0.77 | learning rate: 1.173E-04 | global batch size: 256 | lm loss: 2.018932E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.560 | TFLOPs: 20.18 | 31: iteration 83190/ 173500 | consumed samples: 21296640 | consumed tokens: 43615518720 | elapsed time per iteration (s): 0.77 | learning rate: 1.173E-04 | global batch size: 256 | lm loss: 1.972730E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.375 | TFLOPs: 20.05 | 31: iteration 83200/ 173500 | consumed samples: 21299200 | consumed tokens: 43620761600 | elapsed time per iteration (s): 0.77 | learning rate: 1.173E-04 | global batch size: 256 | lm loss: 1.987363E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.151 | TFLOPs: 20.09 | 31: iteration 83210/ 173500 | consumed samples: 21301760 | consumed tokens: 43626004480 | elapsed time per iteration (s): 0.74 | learning rate: 1.172E-04 | global batch size: 256 | lm loss: 2.023341E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.452 | TFLOPs: 20.84 | 31: iteration 83220/ 173500 | consumed samples: 21304320 | consumed tokens: 43631247360 | elapsed time per iteration (s): 0.76 | learning rate: 1.172E-04 | global batch size: 256 | lm loss: 1.975757E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.729 | TFLOPs: 20.49 | 31: iteration 83230/ 173500 | consumed samples: 21306880 | consumed tokens: 43636490240 | elapsed time per iteration (s): 0.75 | learning rate: 1.172E-04 | global batch size: 256 | lm loss: 1.986849E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.095 | TFLOPs: 20.76 | 31: iteration 83240/ 173500 | consumed samples: 21309440 | consumed tokens: 43641733120 | elapsed time per iteration (s): 0.80 | learning rate: 1.172E-04 | global batch size: 256 | lm loss: 2.019104E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.967 | TFLOPs: 19.42 | 31: iteration 83250/ 173500 | consumed samples: 21312000 | consumed tokens: 43646976000 | elapsed time per iteration (s): 0.78 | learning rate: 1.172E-04 | global batch size: 256 | lm loss: 2.020435E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.557 | TFLOPs: 19.82 | 31: iteration 83260/ 173500 | consumed samples: 21314560 | consumed tokens: 43652218880 | elapsed time per iteration (s): 0.73 | learning rate: 1.172E-04 | global batch size: 256 | lm loss: 1.993166E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.656 | TFLOPs: 21.09 | 31: iteration 83270/ 173500 | consumed samples: 21317120 | consumed tokens: 43657461760 | elapsed time per iteration (s): 0.79 | learning rate: 1.171E-04 | global batch size: 256 | lm loss: 2.002871E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.320 | TFLOPs: 19.68 | 31: iteration 83280/ 173500 | consumed samples: 21319680 | consumed tokens: 43662704640 | elapsed time per iteration (s): 0.75 | learning rate: 1.171E-04 | global batch size: 256 | lm loss: 2.016938E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.309 | TFLOPs: 20.71 | 31: iteration 83290/ 173500 | consumed samples: 21322240 | consumed tokens: 43667947520 | elapsed time per iteration (s): 0.77 | learning rate: 1.171E-04 | global batch size: 256 | lm loss: 2.006931E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.100 | TFLOPs: 20.21 | 31: iteration 83300/ 173500 | consumed samples: 21324800 | consumed tokens: 43673190400 | elapsed time per iteration (s): 2.00 | learning rate: 1.171E-04 | global batch size: 256 | lm loss: 2.028530E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 128.295 | TFLOPs: 7.76 | 31: iteration 83310/ 173500 | consumed samples: 21327360 | consumed tokens: 43678433280 | elapsed time per iteration (s): 0.78 | learning rate: 1.171E-04 | global batch size: 256 | lm loss: 1.968490E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.441 | TFLOPs: 19.75 | 31: iteration 83320/ 173500 | consumed samples: 21329920 | consumed tokens: 43683676160 | elapsed time per iteration (s): 0.78 | learning rate: 1.171E-04 | global batch size: 256 | lm loss: 2.025739E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.808 | TFLOPs: 19.77 | 31: iteration 83330/ 173500 | consumed samples: 21332480 | consumed tokens: 43688919040 | elapsed time per iteration (s): 0.76 | learning rate: 1.171E-04 | global batch size: 256 | lm loss: 1.996465E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.793 | TFLOPs: 20.44 | 31: iteration 83340/ 173500 | consumed samples: 21335040 | consumed tokens: 43694161920 | elapsed time per iteration (s): 0.71 | learning rate: 1.170E-04 | global batch size: 256 | lm loss: 1.999603E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.717 | TFLOPs: 21.70 | 31: iteration 83350/ 173500 | consumed samples: 21337600 | consumed tokens: 43699404800 | elapsed time per iteration (s): 0.78 | learning rate: 1.170E-04 | global batch size: 256 | lm loss: 2.014279E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.228 | TFLOPs: 19.86 | 31: iteration 83360/ 173500 | consumed samples: 21340160 | consumed tokens: 43704647680 | elapsed time per iteration (s): 0.78 | learning rate: 1.170E-04 | global batch size: 256 | lm loss: 1.987688E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.443 | TFLOPs: 19.93 | 31: iteration 83370/ 173500 | consumed samples: 21342720 | consumed tokens: 43709890560 | elapsed time per iteration (s): 0.79 | learning rate: 1.170E-04 | global batch size: 256 | lm loss: 2.008169E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.299 | TFLOPs: 19.68 | 31: iteration 83380/ 173500 | consumed samples: 21345280 | consumed tokens: 43715133440 | elapsed time per iteration (s): 0.74 | learning rate: 1.170E-04 | global batch size: 256 | lm loss: 2.014637E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.762 | TFLOPs: 21.04 | 31: iteration 83390/ 173500 | consumed samples: 21347840 | consumed tokens: 43720376320 | elapsed time per iteration (s): 0.77 | learning rate: 1.170E-04 | global batch size: 256 | lm loss: 1.996009E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.222 | TFLOPs: 20.04 | 31: iteration 83400/ 173500 | consumed samples: 21350400 | consumed tokens: 43725619200 | elapsed time per iteration (s): 0.78 | learning rate: 1.169E-04 | global batch size: 256 | lm loss: 2.022822E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.613 | TFLOPs: 19.76 | 31: iteration 83410/ 173500 | consumed samples: 21352960 | consumed tokens: 43730862080 | elapsed time per iteration (s): 0.76 | learning rate: 1.169E-04 | global batch size: 256 | lm loss: 1.991117E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.345 | TFLOPs: 20.41 | 31: iteration 83420/ 173500 | consumed samples: 21355520 | consumed tokens: 43736104960 | elapsed time per iteration (s): 0.75 | learning rate: 1.169E-04 | global batch size: 256 | lm loss: 2.003460E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.252 | TFLOPs: 20.52 | 31: iteration 83430/ 173500 | consumed samples: 21358080 | consumed tokens: 43741347840 | elapsed time per iteration (s): 0.80 | learning rate: 1.169E-04 | global batch size: 256 | lm loss: 1.998918E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.097 | TFLOPs: 19.37 | 31: iteration 83440/ 173500 | consumed samples: 21360640 | consumed tokens: 43746590720 | elapsed time per iteration (s): 0.78 | learning rate: 1.169E-04 | global batch size: 256 | lm loss: 2.019721E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.661 | TFLOPs: 19.94 | 31: iteration 83450/ 173500 | consumed samples: 21363200 | consumed tokens: 43751833600 | elapsed time per iteration (s): 0.80 | learning rate: 1.169E-04 | global batch size: 256 | lm loss: 2.000264E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.537 | TFLOPs: 19.33 | 31: iteration 83460/ 173500 | consumed samples: 21365760 | consumed tokens: 43757076480 | elapsed time per iteration (s): 0.79 | learning rate: 1.168E-04 | global batch size: 256 | lm loss: 1.986543E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.685 | TFLOPs: 19.70 | 31: iteration 83470/ 173500 | consumed samples: 21368320 | consumed tokens: 43762319360 | elapsed time per iteration (s): 0.80 | learning rate: 1.168E-04 | global batch size: 256 | lm loss: 1.990888E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.369 | TFLOPs: 19.38 | 31: iteration 83480/ 173500 | consumed samples: 21370880 | consumed tokens: 43767562240 | elapsed time per iteration (s): 0.81 | learning rate: 1.168E-04 | global batch size: 256 | lm loss: 1.999188E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.705 | TFLOPs: 19.10 | 31: iteration 83490/ 173500 | consumed samples: 21373440 | consumed tokens: 43772805120 | elapsed time per iteration (s): 0.81 | learning rate: 1.168E-04 | global batch size: 256 | lm loss: 1.993972E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.833 | TFLOPs: 19.11 | 31: iteration 83500/ 173500 | consumed samples: 21376000 | consumed tokens: 43778048000 | elapsed time per iteration (s): 0.79 | learning rate: 1.168E-04 | global batch size: 256 | lm loss: 1.983997E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.636 | TFLOPs: 19.58 | 31: iteration 83510/ 173500 | consumed samples: 21378560 | consumed tokens: 43783290880 | elapsed time per iteration (s): 0.76 | learning rate: 1.168E-04 | global batch size: 256 | lm loss: 1.991529E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.964 | TFLOPs: 20.26 | 31: iteration 83520/ 173500 | consumed samples: 21381120 | consumed tokens: 43788533760 | elapsed time per iteration (s): 0.78 | learning rate: 1.167E-04 | global batch size: 256 | lm loss: 1.969699E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.358 | TFLOPs: 19.74 | 31: iteration 83530/ 173500 | consumed samples: 21383680 | consumed tokens: 43793776640 | elapsed time per iteration (s): 0.78 | learning rate: 1.167E-04 | global batch size: 256 | lm loss: 2.019935E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.430 | TFLOPs: 19.75 | 31: iteration 83540/ 173500 | consumed samples: 21386240 | consumed tokens: 43799019520 | elapsed time per iteration (s): 0.76 | learning rate: 1.167E-04 | global batch size: 256 | lm loss: 1.984090E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.108 | TFLOPs: 20.45 | 31: iteration 83550/ 173500 | consumed samples: 21388800 | consumed tokens: 43804262400 | elapsed time per iteration (s): 0.73 | learning rate: 1.167E-04 | global batch size: 256 | lm loss: 2.008533E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.913 | TFLOPs: 21.23 | 31: iteration 83560/ 173500 | consumed samples: 21391360 | consumed tokens: 43809505280 | elapsed time per iteration (s): 0.77 | learning rate: 1.167E-04 | global batch size: 256 | lm loss: 1.973785E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.249 | TFLOPs: 20.22 | 31: iteration 83570/ 173500 | consumed samples: 21393920 | consumed tokens: 43814748160 | elapsed time per iteration (s): 0.81 | learning rate: 1.167E-04 | global batch size: 256 | lm loss: 1.996187E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.071 | TFLOPs: 19.12 | 31: iteration 83580/ 173500 | consumed samples: 21396480 | consumed tokens: 43819991040 | elapsed time per iteration (s): 0.86 | learning rate: 1.166E-04 | global batch size: 256 | lm loss: 2.032399E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.635 | TFLOPs: 18.07 | 31: iteration 83590/ 173500 | consumed samples: 21399040 | consumed tokens: 43825233920 | elapsed time per iteration (s): 0.88 | learning rate: 1.166E-04 | global batch size: 256 | lm loss: 1.989931E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.174 | TFLOPs: 17.62 | 31: iteration 83600/ 173500 | consumed samples: 21401600 | consumed tokens: 43830476800 | elapsed time per iteration (s): 0.86 | learning rate: 1.166E-04 | global batch size: 256 | lm loss: 2.027641E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.307 | TFLOPs: 18.05 | 31: iteration 83610/ 173500 | consumed samples: 21404160 | consumed tokens: 43835719680 | elapsed time per iteration (s): 0.80 | learning rate: 1.166E-04 | global batch size: 256 | lm loss: 2.011238E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.249 | TFLOPs: 19.37 | 31: iteration 83620/ 173500 | consumed samples: 21406720 | consumed tokens: 43840962560 | elapsed time per iteration (s): 0.85 | learning rate: 1.166E-04 | global batch size: 256 | lm loss: 1.984380E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.969 | TFLOPs: 18.27 | 31: iteration 83630/ 173500 | consumed samples: 21409280 | consumed tokens: 43846205440 | elapsed time per iteration (s): 0.79 | learning rate: 1.166E-04 | global batch size: 256 | lm loss: 2.000178E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.177 | TFLOPs: 19.55 | 31: iteration 83640/ 173500 | consumed samples: 21411840 | consumed tokens: 43851448320 | elapsed time per iteration (s): 0.77 | learning rate: 1.165E-04 | global batch size: 256 | lm loss: 1.986062E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.866 | TFLOPs: 20.14 | 31: iteration 83650/ 173500 | consumed samples: 21414400 | consumed tokens: 43856691200 | elapsed time per iteration (s): 0.79 | learning rate: 1.165E-04 | global batch size: 256 | lm loss: 1.983648E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.540 | TFLOPs: 19.69 | 31: iteration 83660/ 173500 | consumed samples: 21416960 | consumed tokens: 43861934080 | elapsed time per iteration (s): 0.75 | learning rate: 1.165E-04 | global batch size: 256 | lm loss: 2.016514E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.156 | TFLOPs: 20.58 | 31: iteration 83670/ 173500 | consumed samples: 21419520 | consumed tokens: 43867176960 | elapsed time per iteration (s): 0.76 | learning rate: 1.165E-04 | global batch size: 256 | lm loss: 1.983839E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.726 | TFLOPs: 20.37 | 31: iteration 83680/ 173500 | consumed samples: 21422080 | consumed tokens: 43872419840 | elapsed time per iteration (s): 0.80 | learning rate: 1.165E-04 | global batch size: 256 | lm loss: 2.001088E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.506 | TFLOPs: 19.45 | 31: iteration 83690/ 173500 | consumed samples: 21424640 | consumed tokens: 43877662720 | elapsed time per iteration (s): 0.74 | learning rate: 1.165E-04 | global batch size: 256 | lm loss: 1.978735E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.693 | TFLOPs: 20.97 | 31: iteration 83700/ 173500 | consumed samples: 21427200 | consumed tokens: 43882905600 | elapsed time per iteration (s): 0.77 | learning rate: 1.164E-04 | global batch size: 256 | lm loss: 1.999139E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.643 | TFLOPs: 20.06 | 31: iteration 83710/ 173500 | consumed samples: 21429760 | consumed tokens: 43888148480 | elapsed time per iteration (s): 0.74 | learning rate: 1.164E-04 | global batch size: 256 | lm loss: 1.990927E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.997 | TFLOPs: 20.93 | 31: iteration 83720/ 173500 | consumed samples: 21432320 | consumed tokens: 43893391360 | elapsed time per iteration (s): 0.77 | learning rate: 1.164E-04 | global batch size: 256 | lm loss: 1.998686E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.305 | TFLOPs: 20.16 | 31: iteration 83730/ 173500 | consumed samples: 21434880 | consumed tokens: 43898634240 | elapsed time per iteration (s): 0.77 | learning rate: 1.164E-04 | global batch size: 256 | lm loss: 1.988091E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.886 | TFLOPs: 20.14 | 31: iteration 83740/ 173500 | consumed samples: 21437440 | consumed tokens: 43903877120 | elapsed time per iteration (s): 0.82 | learning rate: 1.164E-04 | global batch size: 256 | lm loss: 1.986802E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.351 | TFLOPs: 18.84 | 31: iteration 83750/ 173500 | consumed samples: 21440000 | consumed tokens: 43909120000 | elapsed time per iteration (s): 0.81 | learning rate: 1.164E-04 | global batch size: 256 | lm loss: 1.999331E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.005 | TFLOPs: 19.24 | 31: iteration 83760/ 173500 | consumed samples: 21442560 | consumed tokens: 43914362880 | elapsed time per iteration (s): 0.82 | learning rate: 1.163E-04 | global batch size: 256 | lm loss: 2.002834E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.059 | TFLOPs: 19.00 | 31: iteration 83770/ 173500 | consumed samples: 21445120 | consumed tokens: 43919605760 | elapsed time per iteration (s): 0.75 | learning rate: 1.163E-04 | global batch size: 256 | lm loss: 2.013427E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.394 | TFLOPs: 20.65 | 31: iteration 83780/ 173500 | consumed samples: 21447680 | consumed tokens: 43924848640 | elapsed time per iteration (s): 0.80 | learning rate: 1.163E-04 | global batch size: 256 | lm loss: 2.007062E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.854 | TFLOPs: 19.35 | 31: iteration 83790/ 173500 | consumed samples: 21450240 | consumed tokens: 43930091520 | elapsed time per iteration (s): 0.82 | learning rate: 1.163E-04 | global batch size: 256 | lm loss: 1.954331E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.558 | TFLOPs: 18.91 | 31: iteration 83800/ 173500 | consumed samples: 21452800 | consumed tokens: 43935334400 | elapsed time per iteration (s): 0.81 | learning rate: 1.163E-04 | global batch size: 256 | lm loss: 1.985616E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.435 | TFLOPs: 19.02 | 31: iteration 83810/ 173500 | consumed samples: 21455360 | consumed tokens: 43940577280 | elapsed time per iteration (s): 0.77 | learning rate: 1.163E-04 | global batch size: 256 | lm loss: 2.011059E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.935 | TFLOPs: 20.14 | 31: iteration 83820/ 173500 | consumed samples: 21457920 | consumed tokens: 43945820160 | elapsed time per iteration (s): 0.78 | learning rate: 1.162E-04 | global batch size: 256 | lm loss: 2.008975E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.231 | TFLOPs: 19.80 | 31: iteration 83830/ 173500 | consumed samples: 21460480 | consumed tokens: 43951063040 | elapsed time per iteration (s): 0.80 | learning rate: 1.162E-04 | global batch size: 256 | lm loss: 1.987881E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.930 | TFLOPs: 19.29 | 31: iteration 83840/ 173500 | consumed samples: 21463040 | consumed tokens: 43956305920 | elapsed time per iteration (s): 0.82 | learning rate: 1.162E-04 | global batch size: 256 | lm loss: 2.021034E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.195 | TFLOPs: 18.95 | 31: iteration 83850/ 173500 | consumed samples: 21465600 | consumed tokens: 43961548800 | elapsed time per iteration (s): 0.81 | learning rate: 1.162E-04 | global batch size: 256 | lm loss: 1.982862E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.084 | TFLOPs: 19.18 | 31: iteration 83860/ 173500 | consumed samples: 21468160 | consumed tokens: 43966791680 | elapsed time per iteration (s): 0.85 | learning rate: 1.162E-04 | global batch size: 256 | lm loss: 1.993727E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.933 | TFLOPs: 18.15 | 31: iteration 83870/ 173500 | consumed samples: 21470720 | consumed tokens: 43972034560 | elapsed time per iteration (s): 0.84 | learning rate: 1.162E-04 | global batch size: 256 | lm loss: 2.000066E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.327 | TFLOPs: 18.53 | 31: iteration 83880/ 173500 | consumed samples: 21473280 | consumed tokens: 43977277440 | elapsed time per iteration (s): 0.82 | learning rate: 1.161E-04 | global batch size: 256 | lm loss: 1.991781E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.831 | TFLOPs: 18.80 | 31: iteration 83890/ 173500 | consumed samples: 21475840 | consumed tokens: 43982520320 | elapsed time per iteration (s): 0.87 | learning rate: 1.161E-04 | global batch size: 256 | lm loss: 1.967039E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.188 | TFLOPs: 17.80 | 31: iteration 83900/ 173500 | consumed samples: 21478400 | consumed tokens: 43987763200 | elapsed time per iteration (s): 0.81 | learning rate: 1.161E-04 | global batch size: 256 | lm loss: 1.988013E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.518 | TFLOPs: 19.09 | 31: iteration 83910/ 173500 | consumed samples: 21480960 | consumed tokens: 43993006080 | elapsed time per iteration (s): 0.83 | learning rate: 1.161E-04 | global batch size: 256 | lm loss: 2.013609E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.167 | TFLOPs: 18.76 | 31: iteration 83920/ 173500 | consumed samples: 21483520 | consumed tokens: 43998248960 | elapsed time per iteration (s): 0.83 | learning rate: 1.161E-04 | global batch size: 256 | lm loss: 2.024091E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.760 | TFLOPs: 18.68 | 31: iteration 83930/ 173500 | consumed samples: 21486080 | consumed tokens: 44003491840 | elapsed time per iteration (s): 0.85 | learning rate: 1.161E-04 | global batch size: 256 | lm loss: 1.977280E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.194 | TFLOPs: 18.16 | 31: iteration 83940/ 173500 | consumed samples: 21488640 | consumed tokens: 44008734720 | elapsed time per iteration (s): 0.77 | learning rate: 1.160E-04 | global batch size: 256 | lm loss: 2.015187E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.613 | TFLOPs: 20.24 | 31: iteration 83950/ 173500 | consumed samples: 21491200 | consumed tokens: 44013977600 | elapsed time per iteration (s): 0.75 | learning rate: 1.160E-04 | global batch size: 256 | lm loss: 2.007991E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.983 | TFLOPs: 20.57 | 31: iteration 83960/ 173500 | consumed samples: 21493760 | consumed tokens: 44019220480 | elapsed time per iteration (s): 0.80 | learning rate: 1.160E-04 | global batch size: 256 | lm loss: 1.991540E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.698 | TFLOPs: 19.40 | 31: iteration 83970/ 173500 | consumed samples: 21496320 | consumed tokens: 44024463360 | elapsed time per iteration (s): 0.74 | learning rate: 1.160E-04 | global batch size: 256 | lm loss: 1.990552E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.575 | TFLOPs: 20.97 | 31: iteration 83980/ 173500 | consumed samples: 21498880 | consumed tokens: 44029706240 | elapsed time per iteration (s): 0.76 | learning rate: 1.160E-04 | global batch size: 256 | lm loss: 2.038715E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.873 | TFLOPs: 20.50 | 31: iteration 83990/ 173500 | consumed samples: 21501440 | consumed tokens: 44034949120 | elapsed time per iteration (s): 0.74 | learning rate: 1.160E-04 | global batch size: 256 | lm loss: 2.001354E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.971 | TFLOPs: 20.99 | 0: [2022-11-26 13:07:08,058] [INFO] [logging.py:68:log_dist] [Rank 0] step=84000, skipped=0, lr=[0.00011595088621669176, 0.00011595088621669176, 0.00011595088621669176], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 84000/ 173500 | consumed samples: 21504000 | consumed tokens: 44040192000 | elapsed time per iteration (s): 0.75 | learning rate: 1.160E-04 | global batch size: 256 | lm loss: 1.986395E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.125 | TFLOPs: 20.70 | 0: steps: 84000 loss: 1.9831 iter time (s): 0.788 samples/sec: 325.053 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 84000 | lm loss value: 2.080994E+00 | lm loss PPL: 8.012428E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 84000 to checkpoints_1b1long 0: [2022-11-26 13:07:08,334] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step84000 is begin to save! 0: [2022-11-26 13:07:08,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_01-model_00-model_states.pt... 0: [2022-11-26 13:07:08,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_01-model_00-model_states.pt. 0: [2022-11-26 13:07:08,600] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_03-model_00-model_states.pt... 0: [2022-11-26 13:07:08,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_03-model_00-model_states.pt. 0: [2022-11-26 13:07:08,692] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_04-model_00-model_states.pt... 0: [2022-11-26 13:07:08,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_04-model_00-model_states.pt. 0: [2022-11-26 13:07:08,773] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_05-model_00-model_states.pt... 0: [2022-11-26 13:07:08,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_05-model_00-model_states.pt. 0: [2022-11-26 13:07:08,851] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_06-model_00-model_states.pt... 0: [2022-11-26 13:07:08,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_06-model_00-model_states.pt. 0: [2022-11-26 13:07:08,930] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_07-model_00-model_states.pt... 0: [2022-11-26 13:07:09,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_07-model_00-model_states.pt. 0: [2022-11-26 13:07:09,008] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_08-model_00-model_states.pt... 0: [2022-11-26 13:07:09,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_08-model_00-model_states.pt. 0: [2022-11-26 13:07:09,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_09-model_00-model_states.pt... 0: [2022-11-26 13:07:09,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_09-model_00-model_states.pt. 0: [2022-11-26 13:07:09,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_10-model_00-model_states.pt... 0: [2022-11-26 13:07:09,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_10-model_00-model_states.pt. 0: [2022-11-26 13:07:09,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_11-model_00-model_states.pt... 0: [2022-11-26 13:07:09,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_11-model_00-model_states.pt. 0: [2022-11-26 13:07:09,317] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_12-model_00-model_states.pt... 0: [2022-11-26 13:07:09,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_12-model_00-model_states.pt. 0: [2022-11-26 13:07:09,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_13-model_00-model_states.pt... 0: [2022-11-26 13:07:09,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_13-model_00-model_states.pt. 0: [2022-11-26 13:07:09,469] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_14-model_00-model_states.pt... 0: [2022-11-26 13:07:09,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_14-model_00-model_states.pt. 0: [2022-11-26 13:07:09,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_15-model_00-model_states.pt... 0: [2022-11-26 13:07:09,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_15-model_00-model_states.pt. 0: [2022-11-26 13:07:09,622] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_16-model_00-model_states.pt... 0: [2022-11-26 13:07:09,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_16-model_00-model_states.pt. 0: [2022-11-26 13:07:09,697] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_17-model_00-model_states.pt... 0: [2022-11-26 13:07:09,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_17-model_00-model_states.pt. 0: [2022-11-26 13:07:09,774] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_18-model_00-model_states.pt... 0: [2022-11-26 13:07:09,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_18-model_00-model_states.pt. 0: [2022-11-26 13:07:09,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_19-model_00-model_states.pt... 0: [2022-11-26 13:07:09,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_19-model_00-model_states.pt. 0: [2022-11-26 13:07:09,921] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_20-model_00-model_states.pt... 0: [2022-11-26 13:07:09,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_20-model_00-model_states.pt. 0: [2022-11-26 13:07:09,994] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_21-model_00-model_states.pt... 0: [2022-11-26 13:07:10,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_21-model_00-model_states.pt. 0: [2022-11-26 13:07:10,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_22-model_00-model_states.pt... 0: [2022-11-26 13:07:10,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_22-model_00-model_states.pt. 0: [2022-11-26 13:07:10,144] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_23-model_00-model_states.pt... 0: [2022-11-26 13:07:10,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_23-model_00-model_states.pt. 0: [2022-11-26 13:07:10,216] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_24-model_00-model_states.pt... 0: [2022-11-26 13:07:10,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_24-model_00-model_states.pt. 0: [2022-11-26 13:07:10,292] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_25-model_00-model_states.pt... 0: [2022-11-26 13:07:10,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_25-model_00-model_states.pt. 0: [2022-11-26 13:07:10,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_26-model_00-model_states.pt... 0: [2022-11-26 13:07:10,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_26-model_00-model_states.pt. 0: [2022-11-26 13:07:10,441] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_27-model_00-model_states.pt... 0: [2022-11-26 13:07:10,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_27-model_00-model_states.pt. 0: [2022-11-26 13:07:10,516] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_28-model_00-model_states.pt... 0: [2022-11-26 13:07:10,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_28-model_00-model_states.pt. 0: [2022-11-26 13:07:10,590] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/layer_30-model_00-model_states.pt... 0: [2022-11-26 13:07:10,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/layer_30-model_00-model_states.pt. 0: [2022-11-26 13:07:10,592] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step84000/mp_rank_00_model_states.pt 0: [2022-11-26 13:07:10,592] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/mp_rank_00_model_states.pt... 0: [2022-11-26 13:07:10,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/mp_rank_00_model_states.pt. 0: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:07:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:07:10,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:07:10,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 13:07:10,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 15: [2022-11-26 13:07:10,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:07:10,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 13:07:10,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 29: [2022-11-26 13:07:10,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:07:10,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 13:07:10,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 9: [2022-11-26 13:07:10,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:07:10,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 13:07:10,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 7: [2022-11-26 13:07:10,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:07:10,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 13:07:10,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 0: [2022-11-26 13:07:10,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:07:10,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 13:07:10,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 21: [2022-11-26 13:07:10,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:07:10,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 13:07:10,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 28: [2022-11-26 13:07:10,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:07:10,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 13:07:10,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 6: [2022-11-26 13:07:10,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:07:10,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 13:07:10,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 12: [2022-11-26 13:07:10,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:07:10,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 13:07:10,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 11: [2022-11-26 13:07:10,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:07:10,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 13:07:10,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 13: [2022-11-26 13:07:10,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:07:10,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 13:07:10,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 19: [2022-11-26 13:07:10,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:07:10,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:07:10,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 5: [2022-11-26 13:07:10,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 19: [2022-11-26 13:07:10,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 5: [2022-11-26 13:07:10,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 28: [2022-11-26 13:07:10,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:07:10,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:07:10,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:07:10,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 25: [2022-11-26 13:07:10,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:07:10,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 27: [2022-11-26 13:07:10,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 5: [2022-11-26 13:07:10,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 28: [2022-11-26 13:07:10,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 27: [2022-11-26 13:07:10,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 25: [2022-11-26 13:07:10,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 13:07:10,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 26: [2022-11-26 13:07:10,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:07:10,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 13:07:10,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 26: [2022-11-26 13:07:10,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:07:10,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 13:07:10,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 21: [2022-11-26 13:07:10,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:07:10,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 13:07:10,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 19: [2022-11-26 13:07:10,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:07:10,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 13:07:10,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 1: [2022-11-26 13:07:10,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:07:10,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:07:10,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 1: [2022-11-26 13:07:10,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 23: [2022-11-26 13:07:10,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 1: [2022-11-26 13:07:10,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 23: [2022-11-26 13:07:10,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:07:10,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 13:07:10,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 20: [2022-11-26 13:07:10,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:07:10,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:07:10,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 13:07:10,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 9: [2022-11-26 13:07:10,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:07:10,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 20: [2022-11-26 13:07:10,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 9: [2022-11-26 13:07:10,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 29: [2022-11-26 13:07:10,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:07:10,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 29: [2022-11-26 13:07:10,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 13:07:10,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 7: [2022-11-26 13:07:10,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:07:10,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 13:07:10,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 22: [2022-11-26 13:07:10,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:07:10,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:07:10,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 13:07:10,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 25: [2022-11-26 13:07:10,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 13:07:10,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 22: [2022-11-26 13:07:10,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:07:10,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 13:07:10,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 25: [2022-11-26 13:07:10,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:07:10,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 13:07:10,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 17: [2022-11-26 13:07:10,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:07:10,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 13:07:10,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 17: [2022-11-26 13:07:10,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:07:10,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 13:07:10,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 14: [2022-11-26 13:07:10,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:07:10,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:07:10,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:07:10,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 13:07:10,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 13:07:10,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 14: [2022-11-26 13:07:10,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 5: [2022-11-26 13:07:10,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 13:07:10,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 6: [2022-11-26 13:07:10,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:07:10,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:07:10,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 13:07:10,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 24: [2022-11-26 13:07:10,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 13:07:10,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 26: [2022-11-26 13:07:10,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:07:10,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 13:07:10,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 24: [2022-11-26 13:07:10,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:07:10,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 13:07:10,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 17: [2022-11-26 13:07:10,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:07:10,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 13:07:10,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 27: [2022-11-26 13:07:10,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:07:10,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 13:07:10,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 15: [2022-11-26 13:07:10,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:07:10,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 13:07:10,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 7: [2022-11-26 13:07:10,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:07:10,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:07:10,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 9: [2022-11-26 13:07:10,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 7: [2022-11-26 13:07:10,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 9: [2022-11-26 13:07:10,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 11: [2022-11-26 13:07:10,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:07:10,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 13:07:10,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 6: [2022-11-26 13:07:10,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:07:10,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:07:10,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 8: [2022-11-26 13:07:10,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 15: [2022-11-26 13:07:10,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:07:10,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 8: [2022-11-26 13:07:10,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 15: [2022-11-26 13:07:10,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 13:07:10,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 25: [2022-11-26 13:07:10,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:07:10,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 13:07:10,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 8: [2022-11-26 13:07:10,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:07:10,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 13:07:10,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 12: [2022-11-26 13:07:10,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:07:10,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 13:07:10,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 5: [2022-11-26 13:07:10,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:07:10,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 13:07:10,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 11: [2022-11-26 13:07:10,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:07:10,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 13:07:10,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 28: [2022-11-26 13:07:10,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:07:10,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:07:10,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:07:10,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 13:07:10,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 13:07:10,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 22: [2022-11-26 13:07:10,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:07:10,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 26: [2022-11-26 13:07:10,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:07:10,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 28: [2022-11-26 13:07:10,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 22: [2022-11-26 13:07:10,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 26: [2022-11-26 13:07:10,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 28: [2022-11-26 13:07:10,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 26: [2022-11-26 13:07:10,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 28: [2022-11-26 13:07:10,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:07:10,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 13:07:10,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 12: [2022-11-26 13:07:10,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:07:10,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:07:10,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:07:10,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 30: [2022-11-26 13:07:10,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:07:10,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 12: [2022-11-26 13:07:10,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 30: [2022-11-26 13:07:10,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 19: [2022-11-26 13:07:10,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 13:07:10,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 21: [2022-11-26 13:07:10,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:07:10,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 30: [2022-11-26 13:07:10,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 21: [2022-11-26 13:07:10,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 13:07:10,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 27: [2022-11-26 13:07:10,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:07:10,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 13:07:10,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 29: [2022-11-26 13:07:10,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:07:10,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 13:07:10,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 8: [2022-11-26 13:07:10,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:07:10,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 13:07:10,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 23: [2022-11-26 13:07:10,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:07:10,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 13:07:10,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 24: [2022-11-26 13:07:10,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:07:10,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 13:07:10,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 23: [2022-11-26 13:07:10,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:07:10,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 13:07:10,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 14: [2022-11-26 13:07:10,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:07:10,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 13:07:10,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 20: [2022-11-26 13:07:10,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:07:10,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 13:07:10,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 1: [2022-11-26 13:07:10,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:07:10,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 13:07:10,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:07:10,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 1: [2022-11-26 13:07:10,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 13:07:10,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 7: [2022-11-26 13:07:10,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:07:10,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 13:07:10,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 20: [2022-11-26 13:07:10,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:07:10,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 13:07:10,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 13: [2022-11-26 13:07:10,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:07:10,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 13:07:10,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 13: [2022-11-26 13:07:10,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:07:10,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 13:07:10,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 29: [2022-11-26 13:07:10,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:07:10,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 13:07:10,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 0: [2022-11-26 13:07:10,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:07:10,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 13:07:10,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 16: [2022-11-26 13:07:10,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:07:10,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:07:10,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:07:10,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 13:07:10,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 4: [2022-11-26 13:07:10,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:07:10,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 4: [2022-11-26 13:07:10,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 16: [2022-11-26 13:07:10,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 16: [2022-11-26 13:07:10,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 4: [2022-11-26 13:07:10,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 16: [2022-11-26 13:07:10,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 11: [2022-11-26 13:07:10,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:07:10,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 13:07:10,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 1: [2022-11-26 13:07:10,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:07:10,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:07:10,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 14: [2022-11-26 13:07:10,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 13:07:10,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 6: [2022-11-26 13:07:10,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:07:10,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 1: [2022-11-26 13:07:10,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 6: [2022-11-26 13:07:10,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 18: [2022-11-26 13:07:10,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:07:10,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:07:10,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:07:10,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:07:10,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 13:07:10,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 13:07:10,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 13:07:10,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 13:07:10,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 18: [2022-11-26 13:07:10,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 18: [2022-11-26 13:07:10,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 18: [2022-11-26 13:07:10,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 0: [2022-11-26 13:07:10,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:07:10,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:07:10,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 13:07:10,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 13:07:10,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 0: [2022-11-26 13:07:10,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 30: [2022-11-26 13:07:10,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:07:10,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 15: [2022-11-26 13:07:10,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:07:10,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 15: [2022-11-26 13:07:10,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 13:07:10,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 31: [2022-11-26 13:07:10,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:07:10,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:07:10,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 13:07:10,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:07:10,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 13:07:10,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 31: [2022-11-26 13:07:10,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 31: [2022-11-26 13:07:10,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 3: [2022-11-26 13:07:10,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:07:10,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:07:10,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:07:10,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 3: [2022-11-26 13:07:10,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 13:07:10,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 13:07:10,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 13:07:10,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 3: [2022-11-26 13:07:10,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 3: [2022-11-26 13:07:10,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 17: [2022-11-26 13:07:10,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:07:10,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 13:07:10,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 4: [2022-11-26 13:07:10,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:07:10,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 13:07:10,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 22: [2022-11-26 13:07:10,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:07:10,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 13:07:10,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 12: [2022-11-26 13:07:10,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:07:10,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 13:07:10,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 9: [2022-11-26 13:07:10,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:07:10,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 13:07:10,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 16: [2022-11-26 13:07:10,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:07:10,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 13:07:10,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 31: [2022-11-26 13:07:10,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:07:10,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 13:07:10,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 13: [2022-11-26 13:07:10,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:07:10,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 13:07:10,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 3: [2022-11-26 13:07:10,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:07:10,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 13:07:10,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 24: [2022-11-26 13:07:10,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:07:10,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 13:07:10,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 10: [2022-11-26 13:07:10,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:07:10,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:07:10,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:07:10,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:07:10,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 13:07:10,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 13:07:10,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 13:07:10,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 13:07:10,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 10: [2022-11-26 13:07:10,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 10: [2022-11-26 13:07:10,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 10: [2022-11-26 13:07:10,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 4: [2022-11-26 13:07:10,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:07:10,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 13:07:10,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 2: [2022-11-26 13:07:10,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:07:10,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:07:10,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:07:10,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 13:07:10,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 13:07:10,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 13:07:10,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 2: [2022-11-26 13:07:10,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 2: [2022-11-26 13:07:10,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 21: [2022-11-26 13:07:10,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:07:10,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 13:07:10,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 27: [2022-11-26 13:07:10,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:07:10,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 13:07:10,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 5: [2022-11-26 13:07:10,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:07:10,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 13:07:10,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 19: [2022-11-26 13:07:10,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:07:10,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 13:07:10,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 20: [2022-11-26 13:07:10,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:07:10,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 13:07:10,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 6: [2022-11-26 13:07:10,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:07:10,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 13:07:10,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 23: [2022-11-26 13:07:10,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:07:10,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 13:07:10,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 28: [2022-11-26 13:07:10,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:07:10,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 13:07:10,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 29: [2022-11-26 13:07:10,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:07:10,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 13:07:10,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 10: [2022-11-26 13:07:10,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:07:10,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 13:07:10,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 7: [2022-11-26 13:07:10,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:07:10,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 13:07:10,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 25: [2022-11-26 13:07:10,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:07:10,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 13:07:10,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 18: [2022-11-26 13:07:10,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:07:10,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 13:07:10,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 0: [2022-11-26 13:07:10,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:07:10,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:07:10,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 13:07:10,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 1: [2022-11-26 13:07:10,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 13:07:10,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 2: [2022-11-26 13:07:10,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:07:10,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:07:10,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 14: [2022-11-26 13:07:10,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 2: [2022-11-26 13:07:10,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 14: [2022-11-26 13:07:10,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 8: [2022-11-26 13:07:10,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:07:10,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 13:07:10,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 9: [2022-11-26 13:07:10,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:07:10,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 22: [2022-11-26 13:07:10,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:07:10,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 22: [2022-11-26 13:07:10,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 13:07:10,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 31: [2022-11-26 13:07:10,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:07:10,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 13:07:10,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 4: [2022-11-26 13:07:10,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:07:10,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 13:07:10,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 30: [2022-11-26 13:07:10,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:07:10,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 16: [2022-11-26 13:07:10,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:07:10,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 16: [2022-11-26 13:07:10,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 15: [2022-11-26 13:07:10,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:07:10,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 15: [2022-11-26 13:07:10,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 13:07:10,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 13: [2022-11-26 13:07:10,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:07:10,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 13:07:10,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 26: [2022-11-26 13:07:10,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:07:10,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:07:10,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 13:07:10,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 11: [2022-11-26 13:07:10,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 13:07:10,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 24: [2022-11-26 13:07:10,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:07:10,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:07:10,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 17: [2022-11-26 13:07:10,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 13:07:10,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 12: [2022-11-26 13:07:10,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:07:10,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 12: [2022-11-26 13:07:10,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 13:07:10,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 21: [2022-11-26 13:07:10,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:07:10,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 13:07:10,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 3: [2022-11-26 13:07:10,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:07:10,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 13:07:10,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 27: [2022-11-26 13:07:10,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:07:10,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 13:07:10,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 20: [2022-11-26 13:07:10,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:07:10,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 13:07:10,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 19: [2022-11-26 13:07:10,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:07:10,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 13:07:10,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 5: [2022-11-26 13:07:10,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:07:10,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 13:07:10,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 29: [2022-11-26 13:07:10,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:07:10,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 13:07:10,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 23: [2022-11-26 13:07:10,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:07:10,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 13:07:10,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 6: [2022-11-26 13:07:10,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:07:10,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 13:07:10,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 28: [2022-11-26 13:07:10,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:07:10,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 13:07:10,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 10: [2022-11-26 13:07:10,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:07:10,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 13:07:10,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 0: [2022-11-26 13:07:10,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:07:10,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 13:07:10,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 14: [2022-11-26 13:07:10,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:07:10,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 13:07:10,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 17: [2022-11-26 13:07:10,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:07:10,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 13:07:10,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 22: [2022-11-26 13:07:10,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:07:10,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 13:07:10,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 18: [2022-11-26 13:07:10,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:07:10,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 13:07:10,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 8: [2022-11-26 13:07:10,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:07:10,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 13:07:10,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 25: [2022-11-26 13:07:10,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:07:10,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:07:10,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 13:07:10,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 26: [2022-11-26 13:07:10,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 13:07:10,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 9: [2022-11-26 13:07:10,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:07:10,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 13:07:10,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 16: [2022-11-26 13:07:10,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:07:10,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 13:07:10,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 31: [2022-11-26 13:07:10,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:07:10,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 13:07:10,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 2: [2022-11-26 13:07:10,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:07:10,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 13:07:10,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 13: [2022-11-26 13:07:10,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:07:10,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 13:07:10,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 24: [2022-11-26 13:07:10,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:07:10,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 13:07:10,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 4: [2022-11-26 13:07:10,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:07:10,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 13:07:10,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 1: [2022-11-26 13:07:10,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:07:10,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 13:07:10,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 11: [2022-11-26 13:07:10,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:07:10,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 30: [2022-11-26 13:07:10,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:07:10,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 30: [2022-11-26 13:07:10,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 13:07:10,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 15: [2022-11-26 13:07:10,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:07:10,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 13:07:10,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 12: [2022-11-26 13:07:10,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:07:10,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 13:07:10,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 5: [2022-11-26 13:07:10,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:07:10,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 13:07:10,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 27: [2022-11-26 13:07:10,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:07:10,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 13:07:10,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 3: [2022-11-26 13:07:10,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:07:10,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 13:07:10,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 20: [2022-11-26 13:07:10,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:07:10,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 13:07:10,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 28: [2022-11-26 13:07:10,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:07:10,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 13:07:10,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 7: [2022-11-26 13:07:10,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:07:10,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 13:07:10,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 19: [2022-11-26 13:07:10,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:07:10,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 13:07:10,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 29: [2022-11-26 13:07:10,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:07:10,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 13:07:10,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 21: [2022-11-26 13:07:10,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:07:10,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 13:07:10,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 23: [2022-11-26 13:07:10,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:07:10,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 13:07:10,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 6: [2022-11-26 13:07:10,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:07:10,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 13:07:10,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 18: [2022-11-26 13:07:10,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:07:10,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 13:07:10,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 17: [2022-11-26 13:07:10,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:07:10,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 13:07:10,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 22: [2022-11-26 13:07:10,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:07:10,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 13:07:10,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 8: [2022-11-26 13:07:10,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:07:10,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 13:07:10,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 10: [2022-11-26 13:07:10,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:07:10,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 13:07:10,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 25: [2022-11-26 13:07:10,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:07:10,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:07:10,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 13:07:10,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 0: [2022-11-26 13:07:10,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 2: [2022-11-26 13:07:10,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:07:10,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 2: [2022-11-26 13:07:10,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 1: [2022-11-26 13:07:10,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:07:10,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 1: [2022-11-26 13:07:10,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 13:07:10,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 26: [2022-11-26 13:07:10,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:07:10,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 13:07:10,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 14: [2022-11-26 13:07:10,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:07:10,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 13:07:10,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 30: [2022-11-26 13:07:10,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:07:10,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 13:07:10,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 5: [2022-11-26 13:07:10,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:07:10,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 13:07:10,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 31: [2022-11-26 13:07:10,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:07:10,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 13:07:10,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 0: [2022-11-26 13:07:10,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:07:10,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:07:10,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:07:10,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 13:07:10,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 11: [2022-11-26 13:07:10,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 13:07:10,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 29: [2022-11-26 13:07:10,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:07:10,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:07:10,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 10: [2022-11-26 13:07:10,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 29: [2022-11-26 13:07:10,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 10: [2022-11-26 13:07:10,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 12: [2022-11-26 13:07:10,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:07:10,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 13:07:10,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 9: [2022-11-26 13:07:10,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:07:10,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 13:07:10,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 24: [2022-11-26 13:07:10,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:07:10,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 24: [2022-11-26 13:07:10,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 14: [2022-11-26 13:07:10,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:07:10,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 14: [2022-11-26 13:07:10,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 13:07:10,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 18: [2022-11-26 13:07:10,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:07:10,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 21: [2022-11-26 13:07:10,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:07:10,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 21: [2022-11-26 13:07:10,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 13:07:10,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 0: [2022-11-26 13:07:10,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 8: [2022-11-26 13:07:10,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:07:10,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:07:10,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 8: [2022-11-26 13:07:10,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 3: [2022-11-26 13:07:10,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 8: [2022-11-26 13:07:10,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 25: [2022-11-26 13:07:10,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:07:10,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:07:10,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 13:07:10,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 16: [2022-11-26 13:07:10,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 13:07:10,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 27: [2022-11-26 13:07:10,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:07:10,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:07:10,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 13:07:10,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 28: [2022-11-26 13:07:10,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 13:07:10,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 31: [2022-11-26 13:07:10,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:07:10,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:07:10,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 31: [2022-11-26 13:07:10,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 17: [2022-11-26 13:07:10,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 31: [2022-11-26 13:07:10,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 22: [2022-11-26 13:07:10,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:07:10,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:07:10,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 13:07:10,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 4: [2022-11-26 13:07:10,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 13:07:10,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 23: [2022-11-26 13:07:10,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:07:10,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 13:07:10,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 7: [2022-11-26 13:07:10,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:07:10,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:07:10,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 13: [2022-11-26 13:07:10,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 13:07:10,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 7: [2022-11-26 13:07:10,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 6: [2022-11-26 13:07:10,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:07:10,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:07:10,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:07:10,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:07:10,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 6: [2022-11-26 13:07:10,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 19: [2022-11-26 13:07:10,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 6: [2022-11-26 13:07:10,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 20: [2022-11-26 13:07:10,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 26: [2022-11-26 13:07:10,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 20: [2022-11-26 13:07:10,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 11: [2022-11-26 13:07:10,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:07:10,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 30: [2022-11-26 13:07:10,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:07:10,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 30: [2022-11-26 13:07:10,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 11: [2022-11-26 13:07:10,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 30: [2022-11-26 13:07:10,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 7: [2022-11-26 13:07:10,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:07:10,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:07:10,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 7: [2022-11-26 13:07:10,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 13:07:10,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 9: [2022-11-26 13:07:10,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 24: [2022-11-26 13:07:10,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:07:10,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:07:10,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 13:07:10,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 4: [2022-11-26 13:07:10,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:07:10,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 13:07:10,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 4: [2022-11-26 13:07:10,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 13:07:10,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 27: [2022-11-26 13:07:10,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:07:10,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 13:07:10,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 12: [2022-11-26 13:07:10,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:07:10,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 13:07:10,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 15: [2022-11-26 13:07:10,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:07:10,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 13:07:10,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 21: [2022-11-26 13:07:10,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:07:10,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 13:07:10,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 3: [2022-11-26 13:07:10,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:07:10,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 13:07:10,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 16: [2022-11-26 13:07:10,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:07:10,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 13:07:10,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 2: [2022-11-26 13:07:10,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:07:10,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 13:07:10,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:07:10,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 2: [2022-11-26 13:07:10,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 13:07:10,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 13: [2022-11-26 13:07:10,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:07:10,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 13:07:10,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 4: [2022-11-26 13:07:10,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:07:10,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step84000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 13:07:10,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 0: successfully saved checkpoint at iteration 84000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2602.48 31: iteration 84010/ 173500 | consumed samples: 21506560 | consumed tokens: 44045434880 | elapsed time per iteration (s): 1.09 | learning rate: 1.159E-04 | global batch size: 256 | lm loss: 1.995473E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.624 | TFLOPs: 14.25 | 31: iteration 84020/ 173500 | consumed samples: 21509120 | consumed tokens: 44050677760 | elapsed time per iteration (s): 0.79 | learning rate: 1.159E-04 | global batch size: 256 | lm loss: 2.003917E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.611 | TFLOPs: 19.58 | 31: iteration 84030/ 173500 | consumed samples: 21511680 | consumed tokens: 44055920640 | elapsed time per iteration (s): 0.79 | learning rate: 1.159E-04 | global batch size: 256 | lm loss: 2.016931E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.056 | TFLOPs: 19.60 | 31: iteration 84040/ 173500 | consumed samples: 21514240 | consumed tokens: 44061163520 | elapsed time per iteration (s): 0.74 | learning rate: 1.159E-04 | global batch size: 256 | lm loss: 2.016996E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.807 | TFLOPs: 21.04 | 31: iteration 84050/ 173500 | consumed samples: 21516800 | consumed tokens: 44066406400 | elapsed time per iteration (s): 0.72 | learning rate: 1.159E-04 | global batch size: 256 | lm loss: 1.982289E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.370 | TFLOPs: 21.44 | 31: iteration 84060/ 173500 | consumed samples: 21519360 | consumed tokens: 44071649280 | elapsed time per iteration (s): 0.73 | learning rate: 1.159E-04 | global batch size: 256 | lm loss: 2.016990E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.765 | TFLOPs: 21.16 | 31: iteration 84070/ 173500 | consumed samples: 21521920 | consumed tokens: 44076892160 | elapsed time per iteration (s): 0.75 | learning rate: 1.158E-04 | global batch size: 256 | lm loss: 2.007858E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.380 | TFLOPs: 20.53 | 31: iteration 84080/ 173500 | consumed samples: 21524480 | consumed tokens: 44082135040 | elapsed time per iteration (s): 0.78 | learning rate: 1.158E-04 | global batch size: 256 | lm loss: 2.017626E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.011 | TFLOPs: 19.84 | 31: iteration 84090/ 173500 | consumed samples: 21527040 | consumed tokens: 44087377920 | elapsed time per iteration (s): 0.76 | learning rate: 1.158E-04 | global batch size: 256 | lm loss: 1.989438E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.573 | TFLOPs: 20.42 | 31: iteration 84100/ 173500 | consumed samples: 21529600 | consumed tokens: 44092620800 | elapsed time per iteration (s): 0.75 | learning rate: 1.158E-04 | global batch size: 256 | lm loss: 1.990199E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.489 | TFLOPs: 20.60 | 31: iteration 84110/ 173500 | consumed samples: 21532160 | consumed tokens: 44097863680 | elapsed time per iteration (s): 0.75 | learning rate: 1.158E-04 | global batch size: 256 | lm loss: 2.000330E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.910 | TFLOPs: 20.56 | 31: iteration 84120/ 173500 | consumed samples: 21534720 | consumed tokens: 44103106560 | elapsed time per iteration (s): 0.82 | learning rate: 1.158E-04 | global batch size: 256 | lm loss: 1.992945E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.743 | TFLOPs: 18.92 | 31: iteration 84130/ 173500 | consumed samples: 21537280 | consumed tokens: 44108349440 | elapsed time per iteration (s): 0.75 | learning rate: 1.157E-04 | global batch size: 256 | lm loss: 1.998368E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.701 | TFLOPs: 20.55 | 31: iteration 84140/ 173500 | consumed samples: 21539840 | consumed tokens: 44113592320 | elapsed time per iteration (s): 0.82 | learning rate: 1.157E-04 | global batch size: 256 | lm loss: 2.018492E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.416 | TFLOPs: 18.96 | 31: iteration 84150/ 173500 | consumed samples: 21542400 | consumed tokens: 44118835200 | elapsed time per iteration (s): 0.74 | learning rate: 1.157E-04 | global batch size: 256 | lm loss: 2.017231E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.723 | TFLOPs: 20.92 | 31: iteration 84160/ 173500 | consumed samples: 21544960 | consumed tokens: 44124078080 | elapsed time per iteration (s): 0.75 | learning rate: 1.157E-04 | global batch size: 256 | lm loss: 1.984224E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.402 | TFLOPs: 20.71 | 31: iteration 84170/ 173500 | consumed samples: 21547520 | consumed tokens: 44129320960 | elapsed time per iteration (s): 0.74 | learning rate: 1.157E-04 | global batch size: 256 | lm loss: 2.000638E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.473 | TFLOPs: 20.90 | 31: iteration 84180/ 173500 | consumed samples: 21550080 | consumed tokens: 44134563840 | elapsed time per iteration (s): 0.76 | learning rate: 1.157E-04 | global batch size: 256 | lm loss: 2.012999E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.298 | TFLOPs: 20.35 | 31: iteration 84190/ 173500 | consumed samples: 21552640 | consumed tokens: 44139806720 | elapsed time per iteration (s): 0.77 | learning rate: 1.156E-04 | global batch size: 256 | lm loss: 1.989415E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.610 | TFLOPs: 20.12 | 31: iteration 84200/ 173500 | consumed samples: 21555200 | consumed tokens: 44145049600 | elapsed time per iteration (s): 0.83 | learning rate: 1.156E-04 | global batch size: 256 | lm loss: 2.022775E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.855 | TFLOPs: 18.68 | 31: iteration 84210/ 173500 | consumed samples: 21557760 | consumed tokens: 44150292480 | elapsed time per iteration (s): 0.79 | learning rate: 1.156E-04 | global batch size: 256 | lm loss: 1.998459E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.874 | TFLOPs: 19.59 | 31: iteration 84220/ 173500 | consumed samples: 21560320 | consumed tokens: 44155535360 | elapsed time per iteration (s): 0.73 | learning rate: 1.156E-04 | global batch size: 256 | lm loss: 1.986177E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.127 | TFLOPs: 21.24 | 31: iteration 84230/ 173500 | consumed samples: 21562880 | consumed tokens: 44160778240 | elapsed time per iteration (s): 0.78 | learning rate: 1.156E-04 | global batch size: 256 | lm loss: 2.028829E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.509 | TFLOPs: 19.87 | 31: iteration 84240/ 173500 | consumed samples: 21565440 | consumed tokens: 44166021120 | elapsed time per iteration (s): 0.82 | learning rate: 1.156E-04 | global batch size: 256 | lm loss: 2.004663E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.125 | TFLOPs: 18.94 | 31: iteration 84250/ 173500 | consumed samples: 21568000 | consumed tokens: 44171264000 | elapsed time per iteration (s): 0.78 | learning rate: 1.155E-04 | global batch size: 256 | lm loss: 1.985485E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.066 | TFLOPs: 19.85 | 31: iteration 84260/ 173500 | consumed samples: 21570560 | consumed tokens: 44176506880 | elapsed time per iteration (s): 0.79 | learning rate: 1.155E-04 | global batch size: 256 | lm loss: 2.009792E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.658 | TFLOPs: 19.58 | 31: iteration 84270/ 173500 | consumed samples: 21573120 | consumed tokens: 44181749760 | elapsed time per iteration (s): 0.76 | learning rate: 1.155E-04 | global batch size: 256 | lm loss: 2.007100E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.534 | TFLOPs: 20.48 | 31: iteration 84280/ 173500 | consumed samples: 21575680 | consumed tokens: 44186992640 | elapsed time per iteration (s): 0.78 | learning rate: 1.155E-04 | global batch size: 256 | lm loss: 2.013884E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.312 | TFLOPs: 19.80 | 31: iteration 84290/ 173500 | consumed samples: 21578240 | consumed tokens: 44192235520 | elapsed time per iteration (s): 0.78 | learning rate: 1.155E-04 | global batch size: 256 | lm loss: 2.022067E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.029 | TFLOPs: 19.78 | 31: iteration 84300/ 173500 | consumed samples: 21580800 | consumed tokens: 44197478400 | elapsed time per iteration (s): 0.78 | learning rate: 1.155E-04 | global batch size: 256 | lm loss: 2.021930E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.035 | TFLOPs: 19.78 | 31: iteration 84310/ 173500 | consumed samples: 21583360 | consumed tokens: 44202721280 | elapsed time per iteration (s): 0.72 | learning rate: 1.154E-04 | global batch size: 256 | lm loss: 2.014729E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.475 | TFLOPs: 21.57 | 31: iteration 84320/ 173500 | consumed samples: 21585920 | consumed tokens: 44207964160 | elapsed time per iteration (s): 0.74 | learning rate: 1.154E-04 | global batch size: 256 | lm loss: 2.010141E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.933 | TFLOPs: 20.87 | 31: iteration 84330/ 173500 | consumed samples: 21588480 | consumed tokens: 44213207040 | elapsed time per iteration (s): 0.80 | learning rate: 1.154E-04 | global batch size: 256 | lm loss: 1.985225E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.646 | TFLOPs: 19.34 | 31: iteration 84340/ 173500 | consumed samples: 21591040 | consumed tokens: 44218449920 | elapsed time per iteration (s): 0.82 | learning rate: 1.154E-04 | global batch size: 256 | lm loss: 2.007564E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.477 | TFLOPs: 18.96 | 31: iteration 84350/ 173500 | consumed samples: 21593600 | consumed tokens: 44223692800 | elapsed time per iteration (s): 0.80 | learning rate: 1.154E-04 | global batch size: 256 | lm loss: 1.981633E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.659 | TFLOPs: 19.34 | 31: iteration 84360/ 173500 | consumed samples: 21596160 | consumed tokens: 44228935680 | elapsed time per iteration (s): 0.76 | learning rate: 1.154E-04 | global batch size: 256 | lm loss: 1.994373E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.135 | TFLOPs: 20.27 | 31: iteration 84370/ 173500 | consumed samples: 21598720 | consumed tokens: 44234178560 | elapsed time per iteration (s): 0.78 | learning rate: 1.153E-04 | global batch size: 256 | lm loss: 2.021368E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.452 | TFLOPs: 19.81 | 31: iteration 84380/ 173500 | consumed samples: 21601280 | consumed tokens: 44239421440 | elapsed time per iteration (s): 0.75 | learning rate: 1.153E-04 | global batch size: 256 | lm loss: 2.027605E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.063 | TFLOPs: 20.57 | 31: iteration 84390/ 173500 | consumed samples: 21603840 | consumed tokens: 44244664320 | elapsed time per iteration (s): 0.76 | learning rate: 1.153E-04 | global batch size: 256 | lm loss: 2.016121E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.172 | TFLOPs: 20.34 | 31: iteration 84400/ 173500 | consumed samples: 21606400 | consumed tokens: 44249907200 | elapsed time per iteration (s): 0.79 | learning rate: 1.153E-04 | global batch size: 256 | lm loss: 2.011447E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.107 | TFLOPs: 19.73 | 31: iteration 84410/ 173500 | consumed samples: 21608960 | consumed tokens: 44255150080 | elapsed time per iteration (s): 0.81 | learning rate: 1.153E-04 | global batch size: 256 | lm loss: 1.995941E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.667 | TFLOPs: 19.22 | 31: iteration 84420/ 173500 | consumed samples: 21611520 | consumed tokens: 44260392960 | elapsed time per iteration (s): 0.80 | learning rate: 1.153E-04 | global batch size: 256 | lm loss: 1.998832E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.116 | TFLOPs: 19.31 | 31: iteration 84430/ 173500 | consumed samples: 21614080 | consumed tokens: 44265635840 | elapsed time per iteration (s): 0.84 | learning rate: 1.152E-04 | global batch size: 256 | lm loss: 2.021518E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.425 | TFLOPs: 18.48 | 31: iteration 84440/ 173500 | consumed samples: 21616640 | consumed tokens: 44270878720 | elapsed time per iteration (s): 0.80 | learning rate: 1.152E-04 | global batch size: 256 | lm loss: 1.984796E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.470 | TFLOPs: 19.33 | 31: iteration 84450/ 173500 | consumed samples: 21619200 | consumed tokens: 44276121600 | elapsed time per iteration (s): 0.83 | learning rate: 1.152E-04 | global batch size: 256 | lm loss: 2.004060E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.425 | TFLOPs: 18.60 | 31: iteration 84460/ 173500 | consumed samples: 21621760 | consumed tokens: 44281364480 | elapsed time per iteration (s): 0.79 | learning rate: 1.152E-04 | global batch size: 256 | lm loss: 2.004357E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.018 | TFLOPs: 19.48 | 31: iteration 84470/ 173500 | consumed samples: 21624320 | consumed tokens: 44286607360 | elapsed time per iteration (s): 0.85 | learning rate: 1.152E-04 | global batch size: 256 | lm loss: 2.006547E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.537 | TFLOPs: 18.24 | 31: iteration 84480/ 173500 | consumed samples: 21626880 | consumed tokens: 44291850240 | elapsed time per iteration (s): 0.79 | learning rate: 1.152E-04 | global batch size: 256 | lm loss: 1.974849E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.859 | TFLOPs: 19.53 | 31: iteration 84490/ 173500 | consumed samples: 21629440 | consumed tokens: 44297093120 | elapsed time per iteration (s): 0.82 | learning rate: 1.151E-04 | global batch size: 256 | lm loss: 2.003858E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.748 | TFLOPs: 18.98 | 31: iteration 84500/ 173500 | consumed samples: 21632000 | consumed tokens: 44302336000 | elapsed time per iteration (s): 0.81 | learning rate: 1.151E-04 | global batch size: 256 | lm loss: 1.977390E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.228 | TFLOPs: 19.13 | 31: iteration 84510/ 173500 | consumed samples: 21634560 | consumed tokens: 44307578880 | elapsed time per iteration (s): 0.80 | learning rate: 1.151E-04 | global batch size: 256 | lm loss: 1.988010E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.865 | TFLOPs: 19.35 | 31: iteration 84520/ 173500 | consumed samples: 21637120 | consumed tokens: 44312821760 | elapsed time per iteration (s): 0.81 | learning rate: 1.151E-04 | global batch size: 256 | lm loss: 2.039273E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.943 | TFLOPs: 19.17 | 31: iteration 84530/ 173500 | consumed samples: 21639680 | consumed tokens: 44318064640 | elapsed time per iteration (s): 0.81 | learning rate: 1.151E-04 | global batch size: 256 | lm loss: 2.024654E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.486 | TFLOPs: 19.15 | 31: iteration 84540/ 173500 | consumed samples: 21642240 | consumed tokens: 44323307520 | elapsed time per iteration (s): 0.81 | learning rate: 1.151E-04 | global batch size: 256 | lm loss: 1.997227E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.160 | TFLOPs: 19.01 | 31: iteration 84550/ 173500 | consumed samples: 21644800 | consumed tokens: 44328550400 | elapsed time per iteration (s): 0.81 | learning rate: 1.150E-04 | global batch size: 256 | lm loss: 1.983664E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.541 | TFLOPs: 19.21 | 31: iteration 84560/ 173500 | consumed samples: 21647360 | consumed tokens: 44333793280 | elapsed time per iteration (s): 0.77 | learning rate: 1.150E-04 | global batch size: 256 | lm loss: 1.973495E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.712 | TFLOPs: 20.19 | 31: iteration 84570/ 173500 | consumed samples: 21649920 | consumed tokens: 44339036160 | elapsed time per iteration (s): 0.82 | learning rate: 1.150E-04 | global batch size: 256 | lm loss: 1.982345E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.071 | TFLOPs: 18.82 | 31: iteration 84580/ 173500 | consumed samples: 21652480 | consumed tokens: 44344279040 | elapsed time per iteration (s): 0.80 | learning rate: 1.150E-04 | global batch size: 256 | lm loss: 1.990213E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.215 | TFLOPs: 19.25 | 31: iteration 84590/ 173500 | consumed samples: 21655040 | consumed tokens: 44349521920 | elapsed time per iteration (s): 0.81 | learning rate: 1.150E-04 | global batch size: 256 | lm loss: 2.000605E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.217 | TFLOPs: 19.01 | 31: iteration 84600/ 173500 | consumed samples: 21657600 | consumed tokens: 44354764800 | elapsed time per iteration (s): 0.80 | learning rate: 1.150E-04 | global batch size: 256 | lm loss: 1.989964E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.149 | TFLOPs: 19.31 | 31: iteration 84610/ 173500 | consumed samples: 21660160 | consumed tokens: 44360007680 | elapsed time per iteration (s): 0.76 | learning rate: 1.149E-04 | global batch size: 256 | lm loss: 1.984323E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.073 | TFLOPs: 20.27 | 31: iteration 84620/ 173500 | consumed samples: 21662720 | consumed tokens: 44365250560 | elapsed time per iteration (s): 0.73 | learning rate: 1.149E-04 | global batch size: 256 | lm loss: 2.002039E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.641 | TFLOPs: 21.15 | 31: iteration 84630/ 173500 | consumed samples: 21665280 | consumed tokens: 44370493440 | elapsed time per iteration (s): 0.76 | learning rate: 1.149E-04 | global batch size: 256 | lm loss: 1.983372E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.025 | TFLOPs: 20.33 | 31: iteration 84640/ 173500 | consumed samples: 21667840 | consumed tokens: 44375736320 | elapsed time per iteration (s): 0.76 | learning rate: 1.149E-04 | global batch size: 256 | lm loss: 1.989208E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.489 | TFLOPs: 20.30 | 31: iteration 84650/ 173500 | consumed samples: 21670400 | consumed tokens: 44380979200 | elapsed time per iteration (s): 0.76 | learning rate: 1.149E-04 | global batch size: 256 | lm loss: 1.989605E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.414 | TFLOPs: 20.41 | 31: iteration 84660/ 173500 | consumed samples: 21672960 | consumed tokens: 44386222080 | elapsed time per iteration (s): 0.79 | learning rate: 1.149E-04 | global batch size: 256 | lm loss: 1.991729E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.044 | TFLOPs: 19.72 | 31: iteration 84670/ 173500 | consumed samples: 21675520 | consumed tokens: 44391464960 | elapsed time per iteration (s): 0.81 | learning rate: 1.148E-04 | global batch size: 256 | lm loss: 2.003897E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.603 | TFLOPs: 19.09 | 31: iteration 84680/ 173500 | consumed samples: 21678080 | consumed tokens: 44396707840 | elapsed time per iteration (s): 0.88 | learning rate: 1.148E-04 | global batch size: 256 | lm loss: 1.987632E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.301 | TFLOPs: 17.56 | 31: iteration 84690/ 173500 | consumed samples: 21680640 | consumed tokens: 44401950720 | elapsed time per iteration (s): 0.82 | learning rate: 1.148E-04 | global batch size: 256 | lm loss: 1.995665E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.213 | TFLOPs: 18.95 | 31: iteration 84700/ 173500 | consumed samples: 21683200 | consumed tokens: 44407193600 | elapsed time per iteration (s): 0.84 | learning rate: 1.148E-04 | global batch size: 256 | lm loss: 2.013430E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.113 | TFLOPs: 18.52 | 31: iteration 84710/ 173500 | consumed samples: 21685760 | consumed tokens: 44412436480 | elapsed time per iteration (s): 0.85 | learning rate: 1.148E-04 | global batch size: 256 | lm loss: 2.019397E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.136 | TFLOPs: 18.16 | 31: iteration 84720/ 173500 | consumed samples: 21688320 | consumed tokens: 44417679360 | elapsed time per iteration (s): 0.86 | learning rate: 1.148E-04 | global batch size: 256 | lm loss: 1.988955E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.266 | TFLOPs: 17.92 | 31: iteration 84730/ 173500 | consumed samples: 21690880 | consumed tokens: 44422922240 | elapsed time per iteration (s): 0.87 | learning rate: 1.148E-04 | global batch size: 256 | lm loss: 1.990045E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.738 | TFLOPs: 17.71 | 31: iteration 84740/ 173500 | consumed samples: 21693440 | consumed tokens: 44428165120 | elapsed time per iteration (s): 0.86 | learning rate: 1.147E-04 | global batch size: 256 | lm loss: 1.989710E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.560 | TFLOPs: 17.94 | 31: iteration 84750/ 173500 | consumed samples: 21696000 | consumed tokens: 44433408000 | elapsed time per iteration (s): 0.81 | learning rate: 1.147E-04 | global batch size: 256 | lm loss: 2.004454E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.935 | TFLOPs: 19.17 | 31: iteration 84760/ 173500 | consumed samples: 21698560 | consumed tokens: 44438650880 | elapsed time per iteration (s): 0.81 | learning rate: 1.147E-04 | global batch size: 256 | lm loss: 2.005225E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.606 | TFLOPs: 19.15 | 31: iteration 84770/ 173500 | consumed samples: 21701120 | consumed tokens: 44443893760 | elapsed time per iteration (s): 0.82 | learning rate: 1.147E-04 | global batch size: 256 | lm loss: 2.008600E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.650 | TFLOPs: 18.85 | 31: iteration 84780/ 173500 | consumed samples: 21703680 | consumed tokens: 44449136640 | elapsed time per iteration (s): 0.88 | learning rate: 1.147E-04 | global batch size: 256 | lm loss: 2.012139E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.040 | TFLOPs: 17.61 | 31: iteration 84790/ 173500 | consumed samples: 21706240 | consumed tokens: 44454379520 | elapsed time per iteration (s): 0.84 | learning rate: 1.147E-04 | global batch size: 256 | lm loss: 2.041573E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.262 | TFLOPs: 18.35 | 31: iteration 84800/ 173500 | consumed samples: 21708800 | consumed tokens: 44459622400 | elapsed time per iteration (s): 0.83 | learning rate: 1.146E-04 | global batch size: 256 | lm loss: 1.987811E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.090 | TFLOPs: 18.64 | 31: iteration 84810/ 173500 | consumed samples: 21711360 | consumed tokens: 44464865280 | elapsed time per iteration (s): 0.81 | learning rate: 1.146E-04 | global batch size: 256 | lm loss: 2.002900E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.166 | TFLOPs: 19.07 | 31: iteration 84820/ 173500 | consumed samples: 21713920 | consumed tokens: 44470108160 | elapsed time per iteration (s): 0.81 | learning rate: 1.146E-04 | global batch size: 256 | lm loss: 1.983623E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.414 | TFLOPs: 19.14 | 31: iteration 84830/ 173500 | consumed samples: 21716480 | consumed tokens: 44475351040 | elapsed time per iteration (s): 0.81 | learning rate: 1.146E-04 | global batch size: 256 | lm loss: 2.009497E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.137 | TFLOPs: 19.06 | 31: iteration 84840/ 173500 | consumed samples: 21719040 | consumed tokens: 44480593920 | elapsed time per iteration (s): 0.83 | learning rate: 1.146E-04 | global batch size: 256 | lm loss: 1.991275E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.833 | TFLOPs: 18.68 | 31: iteration 84850/ 173500 | consumed samples: 21721600 | consumed tokens: 44485836800 | elapsed time per iteration (s): 0.81 | learning rate: 1.146E-04 | global batch size: 256 | lm loss: 2.006951E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.397 | TFLOPs: 19.08 | 31: iteration 84860/ 173500 | consumed samples: 21724160 | consumed tokens: 44491079680 | elapsed time per iteration (s): 0.79 | learning rate: 1.145E-04 | global batch size: 256 | lm loss: 2.009134E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.574 | TFLOPs: 19.70 | 31: iteration 84870/ 173500 | consumed samples: 21726720 | consumed tokens: 44496322560 | elapsed time per iteration (s): 0.82 | learning rate: 1.145E-04 | global batch size: 256 | lm loss: 1.998133E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.786 | TFLOPs: 18.86 | 31: iteration 84880/ 173500 | consumed samples: 21729280 | consumed tokens: 44501565440 | elapsed time per iteration (s): 0.82 | learning rate: 1.145E-04 | global batch size: 256 | lm loss: 2.006235E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.894 | TFLOPs: 18.93 | 31: iteration 84890/ 173500 | consumed samples: 21731840 | consumed tokens: 44506808320 | elapsed time per iteration (s): 0.87 | learning rate: 1.145E-04 | global batch size: 256 | lm loss: 1.993576E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.943 | TFLOPs: 17.72 | 31: iteration 84900/ 173500 | consumed samples: 21734400 | consumed tokens: 44512051200 | elapsed time per iteration (s): 0.85 | learning rate: 1.145E-04 | global batch size: 256 | lm loss: 2.005616E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.378 | TFLOPs: 18.23 | 31: iteration 84910/ 173500 | consumed samples: 21736960 | consumed tokens: 44517294080 | elapsed time per iteration (s): 0.95 | learning rate: 1.145E-04 | global batch size: 256 | lm loss: 2.002334E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 268.140 | TFLOPs: 16.22 | 31: iteration 84920/ 173500 | consumed samples: 21739520 | consumed tokens: 44522536960 | elapsed time per iteration (s): 0.86 | learning rate: 1.144E-04 | global batch size: 256 | lm loss: 1.983485E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.217 | TFLOPs: 17.92 | 31: iteration 84930/ 173500 | consumed samples: 21742080 | consumed tokens: 44527779840 | elapsed time per iteration (s): 0.83 | learning rate: 1.144E-04 | global batch size: 256 | lm loss: 2.017876E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.843 | TFLOPs: 18.74 | 31: iteration 84940/ 173500 | consumed samples: 21744640 | consumed tokens: 44533022720 | elapsed time per iteration (s): 0.85 | learning rate: 1.144E-04 | global batch size: 256 | lm loss: 2.013641E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.030 | TFLOPs: 18.15 | 31: iteration 84950/ 173500 | consumed samples: 21747200 | consumed tokens: 44538265600 | elapsed time per iteration (s): 0.83 | learning rate: 1.144E-04 | global batch size: 256 | lm loss: 2.019737E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.302 | TFLOPs: 18.71 | 31: iteration 84960/ 173500 | consumed samples: 21749760 | consumed tokens: 44543508480 | elapsed time per iteration (s): 0.81 | learning rate: 1.144E-04 | global batch size: 256 | lm loss: 1.966619E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.853 | TFLOPs: 19.17 | 31: iteration 84970/ 173500 | consumed samples: 21752320 | consumed tokens: 44548751360 | elapsed time per iteration (s): 0.82 | learning rate: 1.144E-04 | global batch size: 256 | lm loss: 2.001445E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.266 | TFLOPs: 18.83 | 31: iteration 84980/ 173500 | consumed samples: 21754880 | consumed tokens: 44553994240 | elapsed time per iteration (s): 0.86 | learning rate: 1.143E-04 | global batch size: 256 | lm loss: 2.014158E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.284 | TFLOPs: 18.05 | 31: iteration 84990/ 173500 | consumed samples: 21757440 | consumed tokens: 44559237120 | elapsed time per iteration (s): 0.80 | learning rate: 1.143E-04 | global batch size: 256 | lm loss: 1.993098E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.725 | TFLOPs: 19.28 | 31: iteration 85000/ 173500 | consumed samples: 21760000 | consumed tokens: 44564480000 | elapsed time per iteration (s): 0.81 | learning rate: 1.143E-04 | global batch size: 256 | lm loss: 2.025140E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.960 | TFLOPs: 19.18 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 85000 | lm loss value: 1.952787E+00 | lm loss PPL: 7.048302E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 85000 to checkpoints_1b1long 0: [2022-11-26 13:20:31,829] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step85000 is begin to save! 0: [2022-11-26 13:20:31,844] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_01-model_00-model_states.pt... 0: [2022-11-26 13:20:32,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_01-model_00-model_states.pt. 0: [2022-11-26 13:20:32,071] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_03-model_00-model_states.pt... 0: [2022-11-26 13:20:32,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_03-model_00-model_states.pt. 0: [2022-11-26 13:20:32,150] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_04-model_00-model_states.pt... 0: [2022-11-26 13:20:32,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_04-model_00-model_states.pt. 0: [2022-11-26 13:20:32,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_05-model_00-model_states.pt... 0: [2022-11-26 13:20:32,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_05-model_00-model_states.pt. 0: [2022-11-26 13:20:32,312] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_06-model_00-model_states.pt... 0: [2022-11-26 13:20:32,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_06-model_00-model_states.pt. 0: [2022-11-26 13:20:32,389] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_07-model_00-model_states.pt... 0: [2022-11-26 13:20:32,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_07-model_00-model_states.pt. 0: [2022-11-26 13:20:32,465] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_08-model_00-model_states.pt... 0: [2022-11-26 13:20:32,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_08-model_00-model_states.pt. 0: [2022-11-26 13:20:32,551] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_09-model_00-model_states.pt... 0: [2022-11-26 13:20:32,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_09-model_00-model_states.pt. 0: [2022-11-26 13:20:32,628] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_10-model_00-model_states.pt... 0: [2022-11-26 13:20:32,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_10-model_00-model_states.pt. 0: [2022-11-26 13:20:32,702] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_11-model_00-model_states.pt... 0: [2022-11-26 13:20:32,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_11-model_00-model_states.pt. 0: [2022-11-26 13:20:32,780] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_12-model_00-model_states.pt... 0: [2022-11-26 13:20:32,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_12-model_00-model_states.pt. 0: [2022-11-26 13:20:32,855] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_13-model_00-model_states.pt... 0: [2022-11-26 13:20:32,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_13-model_00-model_states.pt. 0: [2022-11-26 13:20:32,930] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_14-model_00-model_states.pt... 0: [2022-11-26 13:20:33,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_14-model_00-model_states.pt. 0: [2022-11-26 13:20:33,006] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_15-model_00-model_states.pt... 0: [2022-11-26 13:20:33,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_15-model_00-model_states.pt. 0: [2022-11-26 13:20:33,081] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_16-model_00-model_states.pt... 0: [2022-11-26 13:20:33,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_16-model_00-model_states.pt. 0: [2022-11-26 13:20:33,156] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_17-model_00-model_states.pt... 0: [2022-11-26 13:20:33,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_17-model_00-model_states.pt. 0: [2022-11-26 13:20:33,229] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_18-model_00-model_states.pt... 0: [2022-11-26 13:20:33,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_18-model_00-model_states.pt. 0: [2022-11-26 13:20:33,306] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_19-model_00-model_states.pt... 0: [2022-11-26 13:20:33,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_19-model_00-model_states.pt. 0: [2022-11-26 13:20:33,381] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_20-model_00-model_states.pt... 0: [2022-11-26 13:20:33,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_20-model_00-model_states.pt. 0: [2022-11-26 13:20:33,454] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_21-model_00-model_states.pt... 0: [2022-11-26 13:20:33,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_21-model_00-model_states.pt. 0: [2022-11-26 13:20:33,532] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_22-model_00-model_states.pt... 0: [2022-11-26 13:20:33,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_22-model_00-model_states.pt. 0: [2022-11-26 13:20:33,604] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_23-model_00-model_states.pt... 0: [2022-11-26 13:20:33,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_23-model_00-model_states.pt. 0: [2022-11-26 13:20:33,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_24-model_00-model_states.pt... 0: [2022-11-26 13:20:33,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_24-model_00-model_states.pt. 0: [2022-11-26 13:20:33,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_25-model_00-model_states.pt... 0: [2022-11-26 13:20:33,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_25-model_00-model_states.pt. 0: [2022-11-26 13:20:33,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_26-model_00-model_states.pt... 0: [2022-11-26 13:20:33,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_26-model_00-model_states.pt. 0: [2022-11-26 13:20:33,905] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_27-model_00-model_states.pt... 0: [2022-11-26 13:20:33,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_27-model_00-model_states.pt. 0: [2022-11-26 13:20:33,980] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_28-model_00-model_states.pt... 0: [2022-11-26 13:20:34,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_28-model_00-model_states.pt. 0: [2022-11-26 13:20:34,054] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/layer_30-model_00-model_states.pt... 0: [2022-11-26 13:20:34,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/layer_30-model_00-model_states.pt. 0: [2022-11-26 13:20:34,057] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step85000/mp_rank_00_model_states.pt 0: [2022-11-26 13:20:34,057] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/mp_rank_00_model_states.pt... 0: [2022-11-26 13:20:34,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/mp_rank_00_model_states.pt. 0: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:20:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:20:34,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:20:34,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 13:20:34,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 18: [2022-11-26 13:20:34,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:20:34,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 13:20:34,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 29: [2022-11-26 13:20:34,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:20:34,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 13:20:34,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 28: [2022-11-26 13:20:34,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:20:34,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:20:34,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 13:20:34,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 16: [2022-11-26 13:20:34,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:20:34,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 13:20:34,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 26: [2022-11-26 13:20:34,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:20:34,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 13:20:34,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 21: [2022-11-26 13:20:34,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:20:34,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 13:20:34,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 1: [2022-11-26 13:20:34,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:20:34,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:20:34,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:20:34,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 13:20:34,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 26: [2022-11-26 13:20:34,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 1: [2022-11-26 13:20:34,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 1: [2022-11-26 13:20:34,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 26: [2022-11-26 13:20:34,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 3: [2022-11-26 13:20:34,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:20:34,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 13:20:34,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 18: [2022-11-26 13:20:34,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:20:34,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 13:20:34,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 3: [2022-11-26 13:20:34,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:20:34,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:20:34,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 21: [2022-11-26 13:20:34,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 3: [2022-11-26 13:20:34,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 21: [2022-11-26 13:20:34,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 26: [2022-11-26 13:20:34,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:20:34,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 13:20:34,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 30: [2022-11-26 13:20:34,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:20:34,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:20:34,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:20:34,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 16: [2022-11-26 13:20:34,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 20: [2022-11-26 13:20:34,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 30: [2022-11-26 13:20:34,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 20: [2022-11-26 13:20:34,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 16: [2022-11-26 13:20:34,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 24: [2022-11-26 13:20:34,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:20:34,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:20:34,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:20:34,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 29: [2022-11-26 13:20:34,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:20:34,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 20: [2022-11-26 13:20:34,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 29: [2022-11-26 13:20:34,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 4: [2022-11-26 13:20:34,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 20: [2022-11-26 13:20:34,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 29: [2022-11-26 13:20:34,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 24: [2022-11-26 13:20:34,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 30: [2022-11-26 13:20:34,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:20:34,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 13:20:34,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 3: [2022-11-26 13:20:34,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:20:34,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 13:20:34,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 10: [2022-11-26 13:20:34,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:20:34,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 13:20:34,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 5: [2022-11-26 13:20:34,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:20:34,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:20:34,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 26: [2022-11-26 13:20:34,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 5: [2022-11-26 13:20:34,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 26: [2022-11-26 13:20:34,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 29: [2022-11-26 13:20:34,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:20:34,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 13:20:34,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 28: [2022-11-26 13:20:34,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 13:20:34,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 28: [2022-11-26 13:20:34,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:20:34,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 13:20:34,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 1: [2022-11-26 13:20:34,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:20:34,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 13:20:34,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 30: [2022-11-26 13:20:34,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:20:34,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 13:20:34,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 12: [2022-11-26 13:20:34,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:20:34,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 13:20:34,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 20: [2022-11-26 13:20:34,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:20:34,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 13:20:34,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 10: [2022-11-26 13:20:34,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:20:34,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 0: [2022-11-26 13:20:34,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:20:34,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 0: [2022-11-26 13:20:34,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 13:20:34,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 12: [2022-11-26 13:20:34,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:20:34,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 13:20:34,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 20: [2022-11-26 13:20:34,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:20:34,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 13:20:34,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 12: [2022-11-26 13:20:34,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:20:34,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:20:34,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:20:34,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 22: [2022-11-26 13:20:34,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 12: [2022-11-26 13:20:34,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 22: [2022-11-26 13:20:34,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 13:20:34,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 22: [2022-11-26 13:20:34,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 30: [2022-11-26 13:20:34,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:20:34,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:20:34,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 6: [2022-11-26 13:20:34,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 30: [2022-11-26 13:20:34,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 6: [2022-11-26 13:20:34,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 6: [2022-11-26 13:20:34,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:20:34,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 13:20:34,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 28: [2022-11-26 13:20:34,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:20:34,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:20:34,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:20:34,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 6: [2022-11-26 13:20:34,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 9: [2022-11-26 13:20:34,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:20:34,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 6: [2022-11-26 13:20:34,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 9: [2022-11-26 13:20:34,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 9: [2022-11-26 13:20:34,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 13:20:34,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 28: [2022-11-26 13:20:34,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 4: [2022-11-26 13:20:34,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:20:34,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:20:34,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 10: [2022-11-26 13:20:34,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 4: [2022-11-26 13:20:34,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 10: [2022-11-26 13:20:34,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 18: [2022-11-26 13:20:34,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:20:34,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 13:20:34,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 29: [2022-11-26 13:20:34,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:20:34,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 13:20:34,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 10: [2022-11-26 13:20:34,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:20:34,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 18: [2022-11-26 13:20:34,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:20:34,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 18: [2022-11-26 13:20:34,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 13:20:34,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 6: [2022-11-26 13:20:34,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:20:34,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:20:34,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 13:20:34,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 13:20:34,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 6: [2022-11-26 13:20:34,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 27: [2022-11-26 13:20:34,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:20:34,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:20:34,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 2: [2022-11-26 13:20:34,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 13:20:34,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 2: [2022-11-26 13:20:34,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:20:34,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:20:34,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 2: [2022-11-26 13:20:34,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 27: [2022-11-26 13:20:34,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 2: [2022-11-26 13:20:34,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 27: [2022-11-26 13:20:34,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 1: [2022-11-26 13:20:34,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:20:34,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 13:20:34,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 0: [2022-11-26 13:20:34,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:20:34,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 13:20:34,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 5: [2022-11-26 13:20:34,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:20:34,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:20:34,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 12: [2022-11-26 13:20:34,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:20:34,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 12: [2022-11-26 13:20:34,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 5: [2022-11-26 13:20:34,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 12: [2022-11-26 13:20:34,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 5: [2022-11-26 13:20:34,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 9: [2022-11-26 13:20:34,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:20:34,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 13:20:34,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 0: [2022-11-26 13:20:34,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:20:34,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 7: [2022-11-26 13:20:34,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:20:34,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 7: [2022-11-26 13:20:34,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:20:34,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 13:20:34,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:20:34,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:20:34,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 7: [2022-11-26 13:20:34,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 9: [2022-11-26 13:20:34,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 16: [2022-11-26 13:20:34,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:20:34,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 13:20:34,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 9: [2022-11-26 13:20:34,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 7: [2022-11-26 13:20:34,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 16: [2022-11-26 13:20:34,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 13:20:34,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 5: [2022-11-26 13:20:34,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:20:34,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 13:20:34,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 3: [2022-11-26 13:20:34,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:20:34,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 13:20:34,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 31: [2022-11-26 13:20:34,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:20:34,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:20:34,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:20:34,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 13:20:34,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 13:20:34,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 13:20:34,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 31: [2022-11-26 13:20:34,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:20:34,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 27: [2022-11-26 13:20:34,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:20:34,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 31: [2022-11-26 13:20:34,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 13:20:34,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 27: [2022-11-26 13:20:34,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:20:34,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 13:20:34,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 21: [2022-11-26 13:20:34,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:20:34,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 21: [2022-11-26 13:20:34,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 27: [2022-11-26 13:20:34,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 21: [2022-11-26 13:20:34,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 21: [2022-11-26 13:20:34,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:20:34,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:20:34,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 20: [2022-11-26 13:20:34,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 21: [2022-11-26 13:20:34,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 16: [2022-11-26 13:20:34,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:20:34,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:20:34,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 13:20:34,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 20: [2022-11-26 13:20:34,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 11: [2022-11-26 13:20:34,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:20:34,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 11: [2022-11-26 13:20:34,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 16: [2022-11-26 13:20:34,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 15: [2022-11-26 13:20:34,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:20:34,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 15: [2022-11-26 13:20:34,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 11: [2022-11-26 13:20:34,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:20:34,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:20:34,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 11: [2022-11-26 13:20:34,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 13:20:34,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 11: [2022-11-26 13:20:34,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:20:34,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 13:20:34,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 24: [2022-11-26 13:20:34,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:20:34,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:20:34,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:20:34,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 11: [2022-11-26 13:20:34,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 13:20:34,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 24: [2022-11-26 13:20:34,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 24: [2022-11-26 13:20:34,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 13:20:34,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 13:20:34,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 24: [2022-11-26 13:20:34,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 8: [2022-11-26 13:20:34,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:20:34,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:20:34,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 13:20:34,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 0: [2022-11-26 13:20:34,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:20:34,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 8: [2022-11-26 13:20:34,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 18: [2022-11-26 13:20:34,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:20:34,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 0: [2022-11-26 13:20:34,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 8: [2022-11-26 13:20:34,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:20:34,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 8: [2022-11-26 13:20:34,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 0: [2022-11-26 13:20:34,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 8: [2022-11-26 13:20:34,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 23: [2022-11-26 13:20:34,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:20:34,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:20:34,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 13:20:34,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 13:20:34,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 23: [2022-11-26 13:20:34,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 22: [2022-11-26 13:20:34,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:20:34,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 13:20:34,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 22: [2022-11-26 13:20:34,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:20:34,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 13:20:34,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 14: [2022-11-26 13:20:34,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:20:34,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:20:34,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:20:34,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:20:34,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 13:20:34,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 13:20:34,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 13:20:34,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 13:20:34,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 14: [2022-11-26 13:20:34,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 4: [2022-11-26 13:20:34,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:20:34,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:20:34,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 14: [2022-11-26 13:20:34,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 4: [2022-11-26 13:20:34,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 13:20:34,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 13:20:34,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 4: [2022-11-26 13:20:34,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 1: [2022-11-26 13:20:34,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:20:34,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 13:20:34,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 2: [2022-11-26 13:20:34,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:20:34,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 13:20:34,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:20:34,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 2: [2022-11-26 13:20:34,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 13:20:34,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 15: [2022-11-26 13:20:34,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:20:34,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 13:20:34,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:20:34,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 13:20:34,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 15: [2022-11-26 13:20:34,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 28: [2022-11-26 13:20:34,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:20:34,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:20:34,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 25: [2022-11-26 13:20:34,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 28: [2022-11-26 13:20:34,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 25: [2022-11-26 13:20:34,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 25: [2022-11-26 13:20:34,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:20:34,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 13:20:34,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 7: [2022-11-26 13:20:34,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:20:34,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 13:20:34,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 23: [2022-11-26 13:20:34,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:20:34,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:20:34,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 13:20:34,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:20:34,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 13:20:34,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 23: [2022-11-26 13:20:34,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 23: [2022-11-26 13:20:34,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 13:20:34,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 25: [2022-11-26 13:20:34,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:20:34,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 13:20:34,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 28: [2022-11-26 13:20:34,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:20:34,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 13:20:34,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 4: [2022-11-26 13:20:34,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:20:34,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 13:20:34,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 19: [2022-11-26 13:20:34,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:20:34,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:20:34,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:20:34,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 13:20:34,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 13:20:34,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 13:20:34,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 19: [2022-11-26 13:20:34,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 19: [2022-11-26 13:20:34,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 26: [2022-11-26 13:20:34,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:20:34,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 13:20:34,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 13: [2022-11-26 13:20:34,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:20:34,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:20:34,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:20:34,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:20:34,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 13:20:34,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 13:20:34,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 13:20:34,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 13:20:34,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 13: [2022-11-26 13:20:34,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 13: [2022-11-26 13:20:34,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 13: [2022-11-26 13:20:34,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 27: [2022-11-26 13:20:34,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:20:34,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 13:20:34,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 11: [2022-11-26 13:20:34,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:20:34,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 13:20:34,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 3: [2022-11-26 13:20:34,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:20:34,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 13:20:34,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 14: [2022-11-26 13:20:34,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:20:34,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 13:20:34,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 31: [2022-11-26 13:20:34,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:20:34,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 13:20:34,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 17: [2022-11-26 13:20:34,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:20:34,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:20:34,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:20:34,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:20:34,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:20:34,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:20:34,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 17: [2022-11-26 13:20:34,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 13:20:34,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 9: [2022-11-26 13:20:34,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 17: [2022-11-26 13:20:34,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 17: [2022-11-26 13:20:34,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 13:20:34,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 13:20:34,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 17: [2022-11-26 13:20:34,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 13:20:34,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 17: [2022-11-26 13:20:34,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 17: [2022-11-26 13:20:34,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 7: [2022-11-26 13:20:34,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:20:34,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 13:20:34,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 5: [2022-11-26 13:20:34,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:20:34,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 13:20:34,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 30: [2022-11-26 13:20:34,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:20:34,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 13:20:34,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 24: [2022-11-26 13:20:34,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:20:34,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 13:20:34,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 2: [2022-11-26 13:20:34,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:20:34,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:20:34,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 13: [2022-11-26 13:20:34,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 2: [2022-11-26 13:20:34,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 13: [2022-11-26 13:20:34,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 10: [2022-11-26 13:20:34,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:20:34,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 13:20:34,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 19: [2022-11-26 13:20:34,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:20:34,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 13:20:34,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 12: [2022-11-26 13:20:34,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:20:34,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:20:34,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 15: [2022-11-26 13:20:34,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 12: [2022-11-26 13:20:34,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 15: [2022-11-26 13:20:34,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 1: [2022-11-26 13:20:34,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:20:34,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 13:20:34,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 20: [2022-11-26 13:20:34,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:20:34,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 13:20:34,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 22: [2022-11-26 13:20:34,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:20:34,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 13:20:34,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 25: [2022-11-26 13:20:34,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:20:34,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 13:20:34,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 21: [2022-11-26 13:20:34,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:20:34,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 13:20:34,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 29: [2022-11-26 13:20:34,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:20:34,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 13:20:34,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 6: [2022-11-26 13:20:34,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:20:34,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 13:20:34,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 0: [2022-11-26 13:20:34,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:20:34,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:20:34,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 13:20:34,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 0: [2022-11-26 13:20:34,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 13:20:34,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 8: [2022-11-26 13:20:34,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:20:34,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 13:20:34,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 23: [2022-11-26 13:20:34,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:20:34,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 13:20:34,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 28: [2022-11-26 13:20:34,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:20:34,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 13:20:34,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 18: [2022-11-26 13:20:34,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:20:34,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:20:34,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 13:20:34,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 17: [2022-11-26 13:20:34,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 13:20:34,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 26: [2022-11-26 13:20:34,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:20:34,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 13:20:34,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 7: [2022-11-26 13:20:34,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:20:34,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 13:20:34,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 4: [2022-11-26 13:20:34,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:20:34,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 13:20:34,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 5: [2022-11-26 13:20:34,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:20:34,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 13:20:34,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 11: [2022-11-26 13:20:34,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:20:34,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 13:20:34,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 14: [2022-11-26 13:20:34,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:20:34,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 13:20:34,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 9: [2022-11-26 13:20:34,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:20:34,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 13:20:34,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 27: [2022-11-26 13:20:34,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:20:34,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 13:20:34,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 3: [2022-11-26 13:20:34,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:20:34,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 13:20:34,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 30: [2022-11-26 13:20:34,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:20:34,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 13:20:34,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 10: [2022-11-26 13:20:34,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:20:34,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 13:20:34,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 24: [2022-11-26 13:20:34,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:20:34,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 13:20:34,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 31: [2022-11-26 13:20:34,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:20:34,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 13:20:34,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 13: [2022-11-26 13:20:34,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:20:34,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 13:20:34,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 2: [2022-11-26 13:20:34,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:20:34,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 13:20:34,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 12: [2022-11-26 13:20:34,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:20:34,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 13:20:34,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 21: [2022-11-26 13:20:34,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:20:34,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 13:20:34,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 1: [2022-11-26 13:20:34,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:20:34,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 13:20:34,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 8: [2022-11-26 13:20:34,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:20:34,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 13:20:34,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 0: [2022-11-26 13:20:34,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:20:34,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:20:34,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 13:20:34,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 15: [2022-11-26 13:20:34,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:20:34,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 13:20:34,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 25: [2022-11-26 13:20:34,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:20:34,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 13:20:34,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 29: [2022-11-26 13:20:34,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:20:34,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:20:34,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 29: [2022-11-26 13:20:34,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 13:20:34,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 18: [2022-11-26 13:20:34,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 20: [2022-11-26 13:20:34,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:20:34,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 13:20:34,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 6: [2022-11-26 13:20:34,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:20:34,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 13:20:34,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 22: [2022-11-26 13:20:34,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:20:34,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 13:20:34,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 28: [2022-11-26 13:20:34,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:20:34,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:20:34,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 23: [2022-11-26 13:20:34,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 28: [2022-11-26 13:20:34,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 23: [2022-11-26 13:20:34,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 4: [2022-11-26 13:20:34,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:20:34,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 13:20:34,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 26: [2022-11-26 13:20:34,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:20:34,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 16: [2022-11-26 13:20:34,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:20:34,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 16: [2022-11-26 13:20:34,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 13:20:34,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 17: [2022-11-26 13:20:34,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:20:34,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 13:20:34,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 0: [2022-11-26 13:20:34,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 13:20:34,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 30: [2022-11-26 13:20:34,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:20:34,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 13:20:34,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 3: [2022-11-26 13:20:34,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:20:34,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 13:20:34,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 7: [2022-11-26 13:20:34,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:20:34,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 13:20:34,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 5: [2022-11-26 13:20:34,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:20:34,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 13:20:34,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 9: [2022-11-26 13:20:34,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:20:34,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 13:20:34,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 11: [2022-11-26 13:20:34,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:20:34,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 13:20:34,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 14: [2022-11-26 13:20:34,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:20:34,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 13:20:34,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 27: [2022-11-26 13:20:34,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:20:34,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:20:34,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 24: [2022-11-26 13:20:34,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 27: [2022-11-26 13:20:34,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 24: [2022-11-26 13:20:34,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 13: [2022-11-26 13:20:34,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:20:34,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 13:20:34,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 10: [2022-11-26 13:20:34,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:20:34,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 13:20:34,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 2: [2022-11-26 13:20:34,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:20:34,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 13:20:34,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 12: [2022-11-26 13:20:34,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:20:34,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 13:20:34,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 21: [2022-11-26 13:20:34,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:20:34,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 13:20:34,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 5: [2022-11-26 13:20:34,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:20:34,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 13:20:34,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 20: [2022-11-26 13:20:34,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:20:34,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:20:34,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 20: [2022-11-26 13:20:34,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 31: [2022-11-26 13:20:34,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 20: [2022-11-26 13:20:34,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 1: [2022-11-26 13:20:34,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:20:34,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 13:20:34,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 14: [2022-11-26 13:20:34,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:20:34,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 13:20:34,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 28: [2022-11-26 13:20:34,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:20:34,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:20:34,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:20:34,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 23: [2022-11-26 13:20:34,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 29: [2022-11-26 13:20:34,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:20:34,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 23: [2022-11-26 13:20:34,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 29: [2022-11-26 13:20:34,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 13:20:34,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 10: [2022-11-26 13:20:34,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:20:34,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:20:34,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 13:20:34,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 18: [2022-11-26 13:20:34,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:20:34,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 13:20:34,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 18: [2022-11-26 13:20:34,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 16: [2022-11-26 13:20:34,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:20:34,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 16: [2022-11-26 13:20:34,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 6: [2022-11-26 13:20:34,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:20:34,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 3: [2022-11-26 13:20:34,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:20:34,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:20:34,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 3: [2022-11-26 13:20:34,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 11: [2022-11-26 13:20:34,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 6: [2022-11-26 13:20:34,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 3: [2022-11-26 13:20:34,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 6: [2022-11-26 13:20:34,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 8: [2022-11-26 13:20:34,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:20:34,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 13:20:34,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 28: [2022-11-26 13:20:34,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 13:20:34,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 22: [2022-11-26 13:20:34,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:20:34,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 13:20:34,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 30: [2022-11-26 13:20:34,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:20:34,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 13:20:34,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 7: [2022-11-26 13:20:34,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:20:34,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 13:20:34,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 12: [2022-11-26 13:20:34,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:20:34,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 4: [2022-11-26 13:20:34,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:20:34,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 4: [2022-11-26 13:20:34,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 13:20:34,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 26: [2022-11-26 13:20:34,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:20:34,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 13:20:34,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 0: [2022-11-26 13:20:34,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:20:34,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 13:20:34,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 25: [2022-11-26 13:20:34,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:20:34,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 13:20:34,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 13: [2022-11-26 13:20:34,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:20:34,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 13:20:34,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 2: [2022-11-26 13:20:34,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:20:34,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:20:34,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 31: [2022-11-26 13:20:34,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 2: [2022-11-26 13:20:34,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 31: [2022-11-26 13:20:34,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 21: [2022-11-26 13:20:34,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:20:34,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 13:20:34,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 9: [2022-11-26 13:20:34,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:20:34,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 13:20:34,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 22: [2022-11-26 13:20:34,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:20:34,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 13:20:34,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 27: [2022-11-26 13:20:34,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:20:34,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 13:20:34,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 8: [2022-11-26 13:20:34,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:20:34,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 13:20:34,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 17: [2022-11-26 13:20:34,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:20:34,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 13:20:34,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 19: [2022-11-26 13:20:34,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:20:34,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 13:20:34,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 15: [2022-11-26 13:20:34,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:20:34,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 13:20:34,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 8: [2022-11-26 13:20:34,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:20:34,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 13:20:34,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 15: [2022-11-26 13:20:34,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:20:34,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 13:20:34,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 25: [2022-11-26 13:20:34,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:20:34,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 13:20:34,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 25: [2022-11-26 13:20:34,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:20:34,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 13:20:34,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 19: [2022-11-26 13:20:34,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:20:34,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 13:20:34,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 15: [2022-11-26 13:20:34,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:20:34,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step85000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 13:20:34,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 0: successfully saved checkpoint at iteration 85000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2592.11 31: iteration 85010/ 173500 | consumed samples: 21762560 | consumed tokens: 44569722880 | elapsed time per iteration (s): 1.04 | learning rate: 1.143E-04 | global batch size: 256 | lm loss: 2.014884E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.001 | TFLOPs: 14.94 | 31: iteration 85020/ 173500 | consumed samples: 21765120 | consumed tokens: 44574965760 | elapsed time per iteration (s): 0.80 | learning rate: 1.143E-04 | global batch size: 256 | lm loss: 2.002166E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.089 | TFLOPs: 19.43 | 31: iteration 85030/ 173500 | consumed samples: 21767680 | consumed tokens: 44580208640 | elapsed time per iteration (s): 0.83 | learning rate: 1.143E-04 | global batch size: 256 | lm loss: 2.014582E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.886 | TFLOPs: 18.75 | 31: iteration 85040/ 173500 | consumed samples: 21770240 | consumed tokens: 44585451520 | elapsed time per iteration (s): 0.87 | learning rate: 1.142E-04 | global batch size: 256 | lm loss: 2.007860E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.170 | TFLOPs: 17.74 | 31: iteration 85050/ 173500 | consumed samples: 21772800 | consumed tokens: 44590694400 | elapsed time per iteration (s): 0.84 | learning rate: 1.142E-04 | global batch size: 256 | lm loss: 1.945273E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.142 | TFLOPs: 18.40 | 31: iteration 85060/ 173500 | consumed samples: 21775360 | consumed tokens: 44595937280 | elapsed time per iteration (s): 0.77 | learning rate: 1.142E-04 | global batch size: 256 | lm loss: 1.994296E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.565 | TFLOPs: 20.00 | 31: iteration 85070/ 173500 | consumed samples: 21777920 | consumed tokens: 44601180160 | elapsed time per iteration (s): 0.82 | learning rate: 1.142E-04 | global batch size: 256 | lm loss: 2.006567E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.746 | TFLOPs: 18.80 | 31: iteration 85080/ 173500 | consumed samples: 21780480 | consumed tokens: 44606423040 | elapsed time per iteration (s): 0.80 | learning rate: 1.142E-04 | global batch size: 256 | lm loss: 1.957490E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.101 | TFLOPs: 19.24 | 31: iteration 85090/ 173500 | consumed samples: 21783040 | consumed tokens: 44611665920 | elapsed time per iteration (s): 0.82 | learning rate: 1.142E-04 | global batch size: 256 | lm loss: 2.004367E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.717 | TFLOPs: 18.80 | 31: iteration 85100/ 173500 | consumed samples: 21785600 | consumed tokens: 44616908800 | elapsed time per iteration (s): 1.27 | learning rate: 1.141E-04 | global batch size: 256 | lm loss: 1.996629E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 201.064 | TFLOPs: 12.16 | 31: iteration 85110/ 173500 | consumed samples: 21788160 | consumed tokens: 44622151680 | elapsed time per iteration (s): 0.77 | learning rate: 1.141E-04 | global batch size: 256 | lm loss: 1.983525E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.762 | TFLOPs: 20.07 | 31: iteration 85120/ 173500 | consumed samples: 21790720 | consumed tokens: 44627394560 | elapsed time per iteration (s): 0.79 | learning rate: 1.141E-04 | global batch size: 256 | lm loss: 1.986697E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.834 | TFLOPs: 19.65 | 31: iteration 85130/ 173500 | consumed samples: 21793280 | consumed tokens: 44632637440 | elapsed time per iteration (s): 0.79 | learning rate: 1.141E-04 | global batch size: 256 | lm loss: 2.003092E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.724 | TFLOPs: 19.71 | 31: iteration 85140/ 173500 | consumed samples: 21795840 | consumed tokens: 44637880320 | elapsed time per iteration (s): 0.73 | learning rate: 1.141E-04 | global batch size: 256 | lm loss: 1.987749E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.103 | TFLOPs: 21.30 | 31: iteration 85150/ 173500 | consumed samples: 21798400 | consumed tokens: 44643123200 | elapsed time per iteration (s): 0.78 | learning rate: 1.141E-04 | global batch size: 256 | lm loss: 2.002914E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.135 | TFLOPs: 19.85 | 31: iteration 85160/ 173500 | consumed samples: 21800960 | consumed tokens: 44648366080 | elapsed time per iteration (s): 0.75 | learning rate: 1.140E-04 | global batch size: 256 | lm loss: 2.014937E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.190 | TFLOPs: 20.76 | 31: iteration 85170/ 173500 | consumed samples: 21803520 | consumed tokens: 44653608960 | elapsed time per iteration (s): 0.82 | learning rate: 1.140E-04 | global batch size: 256 | lm loss: 1.980523E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.549 | TFLOPs: 18.91 | 31: iteration 85180/ 173500 | consumed samples: 21806080 | consumed tokens: 44658851840 | elapsed time per iteration (s): 0.81 | learning rate: 1.140E-04 | global batch size: 256 | lm loss: 1.987070E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.064 | TFLOPs: 19.18 | 31: iteration 85190/ 173500 | consumed samples: 21808640 | consumed tokens: 44664094720 | elapsed time per iteration (s): 0.80 | learning rate: 1.140E-04 | global batch size: 256 | lm loss: 1.984425E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.135 | TFLOPs: 19.31 | 31: iteration 85200/ 173500 | consumed samples: 21811200 | consumed tokens: 44669337600 | elapsed time per iteration (s): 0.82 | learning rate: 1.140E-04 | global batch size: 256 | lm loss: 2.020166E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.365 | TFLOPs: 18.96 | 31: iteration 85210/ 173500 | consumed samples: 21813760 | consumed tokens: 44674580480 | elapsed time per iteration (s): 0.80 | learning rate: 1.140E-04 | global batch size: 256 | lm loss: 1.998877E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.994 | TFLOPs: 19.48 | 31: iteration 85220/ 173500 | consumed samples: 21816320 | consumed tokens: 44679823360 | elapsed time per iteration (s): 0.82 | learning rate: 1.139E-04 | global batch size: 256 | lm loss: 2.025240E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.992 | TFLOPs: 18.81 | 31: iteration 85230/ 173500 | consumed samples: 21818880 | consumed tokens: 44685066240 | elapsed time per iteration (s): 0.81 | learning rate: 1.139E-04 | global batch size: 256 | lm loss: 2.002057E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.259 | TFLOPs: 19.13 | 31: iteration 85240/ 173500 | consumed samples: 21821440 | consumed tokens: 44690309120 | elapsed time per iteration (s): 0.89 | learning rate: 1.139E-04 | global batch size: 256 | lm loss: 2.002521E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.262 | TFLOPs: 17.50 | 31: iteration 85250/ 173500 | consumed samples: 21824000 | consumed tokens: 44695552000 | elapsed time per iteration (s): 0.85 | learning rate: 1.139E-04 | global batch size: 256 | lm loss: 1.990925E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.139 | TFLOPs: 18.22 | 31: iteration 85260/ 173500 | consumed samples: 21826560 | consumed tokens: 44700794880 | elapsed time per iteration (s): 0.82 | learning rate: 1.139E-04 | global batch size: 256 | lm loss: 2.021146E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.450 | TFLOPs: 18.84 | 31: iteration 85270/ 173500 | consumed samples: 21829120 | consumed tokens: 44706037760 | elapsed time per iteration (s): 0.77 | learning rate: 1.139E-04 | global batch size: 256 | lm loss: 1.995011E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.556 | TFLOPs: 20.18 | 31: iteration 85280/ 173500 | consumed samples: 21831680 | consumed tokens: 44711280640 | elapsed time per iteration (s): 0.76 | learning rate: 1.138E-04 | global batch size: 256 | lm loss: 1.985204E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.195 | TFLOPs: 20.34 | 31: iteration 85290/ 173500 | consumed samples: 21834240 | consumed tokens: 44716523520 | elapsed time per iteration (s): 0.77 | learning rate: 1.138E-04 | global batch size: 256 | lm loss: 2.004951E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.553 | TFLOPs: 20.18 | 31: iteration 85300/ 173500 | consumed samples: 21836800 | consumed tokens: 44721766400 | elapsed time per iteration (s): 0.77 | learning rate: 1.138E-04 | global batch size: 256 | lm loss: 2.000349E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.896 | TFLOPs: 20.02 | 31: iteration 85310/ 173500 | consumed samples: 21839360 | consumed tokens: 44727009280 | elapsed time per iteration (s): 0.80 | learning rate: 1.138E-04 | global batch size: 256 | lm loss: 1.998229E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.848 | TFLOPs: 19.47 | 31: iteration 85320/ 173500 | consumed samples: 21841920 | consumed tokens: 44732252160 | elapsed time per iteration (s): 0.74 | learning rate: 1.138E-04 | global batch size: 256 | lm loss: 2.006464E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.785 | TFLOPs: 20.98 | 31: iteration 85330/ 173500 | consumed samples: 21844480 | consumed tokens: 44737495040 | elapsed time per iteration (s): 0.78 | learning rate: 1.138E-04 | global batch size: 256 | lm loss: 2.002804E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.685 | TFLOPs: 19.95 | 31: iteration 85340/ 173500 | consumed samples: 21847040 | consumed tokens: 44742737920 | elapsed time per iteration (s): 0.77 | learning rate: 1.137E-04 | global batch size: 256 | lm loss: 2.008203E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.550 | TFLOPs: 20.12 | 31: iteration 85350/ 173500 | consumed samples: 21849600 | consumed tokens: 44747980800 | elapsed time per iteration (s): 0.78 | learning rate: 1.137E-04 | global batch size: 256 | lm loss: 2.018660E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.496 | TFLOPs: 19.75 | 31: iteration 85360/ 173500 | consumed samples: 21852160 | consumed tokens: 44753223680 | elapsed time per iteration (s): 0.79 | learning rate: 1.137E-04 | global batch size: 256 | lm loss: 1.966460E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.130 | TFLOPs: 19.61 | 31: iteration 85370/ 173500 | consumed samples: 21854720 | consumed tokens: 44758466560 | elapsed time per iteration (s): 0.73 | learning rate: 1.137E-04 | global batch size: 256 | lm loss: 1.993046E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.081 | TFLOPs: 21.24 | 31: iteration 85380/ 173500 | consumed samples: 21857280 | consumed tokens: 44763709440 | elapsed time per iteration (s): 0.83 | learning rate: 1.137E-04 | global batch size: 256 | lm loss: 2.004738E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.711 | TFLOPs: 18.68 | 31: iteration 85390/ 173500 | consumed samples: 21859840 | consumed tokens: 44768952320 | elapsed time per iteration (s): 0.77 | learning rate: 1.137E-04 | global batch size: 256 | lm loss: 2.015140E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.544 | TFLOPs: 20.12 | 31: iteration 85400/ 173500 | consumed samples: 21862400 | consumed tokens: 44774195200 | elapsed time per iteration (s): 0.80 | learning rate: 1.136E-04 | global batch size: 256 | lm loss: 1.987076E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.589 | TFLOPs: 19.27 | 31: iteration 85410/ 173500 | consumed samples: 21864960 | consumed tokens: 44779438080 | elapsed time per iteration (s): 0.82 | learning rate: 1.136E-04 | global batch size: 256 | lm loss: 1.977085E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.604 | TFLOPs: 18.91 | 31: iteration 85420/ 173500 | consumed samples: 21867520 | consumed tokens: 44784680960 | elapsed time per iteration (s): 0.76 | learning rate: 1.136E-04 | global batch size: 256 | lm loss: 1.995250E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.115 | TFLOPs: 20.33 | 31: iteration 85430/ 173500 | consumed samples: 21870080 | consumed tokens: 44789923840 | elapsed time per iteration (s): 0.75 | learning rate: 1.136E-04 | global batch size: 256 | lm loss: 2.003199E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.219 | TFLOPs: 20.64 | 31: iteration 85440/ 173500 | consumed samples: 21872640 | consumed tokens: 44795166720 | elapsed time per iteration (s): 0.78 | learning rate: 1.136E-04 | global batch size: 256 | lm loss: 1.996683E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.494 | TFLOPs: 19.93 | 31: iteration 85450/ 173500 | consumed samples: 21875200 | consumed tokens: 44800409600 | elapsed time per iteration (s): 0.77 | learning rate: 1.136E-04 | global batch size: 256 | lm loss: 1.992738E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.694 | TFLOPs: 20.07 | 31: iteration 85460/ 173500 | consumed samples: 21877760 | consumed tokens: 44805652480 | elapsed time per iteration (s): 0.73 | learning rate: 1.136E-04 | global batch size: 256 | lm loss: 2.021440E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.311 | TFLOPs: 21.07 | 31: iteration 85470/ 173500 | consumed samples: 21880320 | consumed tokens: 44810895360 | elapsed time per iteration (s): 0.72 | learning rate: 1.135E-04 | global batch size: 256 | lm loss: 2.013722E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.835 | TFLOPs: 21.59 | 31: iteration 85480/ 173500 | consumed samples: 21882880 | consumed tokens: 44816138240 | elapsed time per iteration (s): 0.75 | learning rate: 1.135E-04 | global batch size: 256 | lm loss: 1.996681E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.509 | TFLOPs: 20.78 | 31: iteration 85490/ 173500 | consumed samples: 21885440 | consumed tokens: 44821381120 | elapsed time per iteration (s): 0.71 | learning rate: 1.135E-04 | global batch size: 256 | lm loss: 2.003146E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 361.069 | TFLOPs: 21.84 | 31: iteration 85500/ 173500 | consumed samples: 21888000 | consumed tokens: 44826624000 | elapsed time per iteration (s): 0.78 | learning rate: 1.135E-04 | global batch size: 256 | lm loss: 1.994946E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.667 | TFLOPs: 19.82 | 31: iteration 85510/ 173500 | consumed samples: 21890560 | consumed tokens: 44831866880 | elapsed time per iteration (s): 0.76 | learning rate: 1.135E-04 | global batch size: 256 | lm loss: 2.008361E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.558 | TFLOPs: 20.36 | 31: iteration 85520/ 173500 | consumed samples: 21893120 | consumed tokens: 44837109760 | elapsed time per iteration (s): 0.84 | learning rate: 1.135E-04 | global batch size: 256 | lm loss: 2.029139E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.988 | TFLOPs: 18.51 | 31: iteration 85530/ 173500 | consumed samples: 21895680 | consumed tokens: 44842352640 | elapsed time per iteration (s): 0.75 | learning rate: 1.134E-04 | global batch size: 256 | lm loss: 1.962255E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.881 | TFLOPs: 20.56 | 31: iteration 85540/ 173500 | consumed samples: 21898240 | consumed tokens: 44847595520 | elapsed time per iteration (s): 0.76 | learning rate: 1.134E-04 | global batch size: 256 | lm loss: 2.016128E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.821 | TFLOPs: 20.32 | 31: iteration 85550/ 173500 | consumed samples: 21900800 | consumed tokens: 44852838400 | elapsed time per iteration (s): 0.78 | learning rate: 1.134E-04 | global batch size: 256 | lm loss: 2.024931E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.248 | TFLOPs: 19.80 | 31: iteration 85560/ 173500 | consumed samples: 21903360 | consumed tokens: 44858081280 | elapsed time per iteration (s): 0.77 | learning rate: 1.134E-04 | global batch size: 256 | lm loss: 2.015842E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.374 | TFLOPs: 20.05 | 31: iteration 85570/ 173500 | consumed samples: 21905920 | consumed tokens: 44863324160 | elapsed time per iteration (s): 0.79 | learning rate: 1.134E-04 | global batch size: 256 | lm loss: 2.009963E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.079 | TFLOPs: 19.55 | 31: iteration 85580/ 173500 | consumed samples: 21908480 | consumed tokens: 44868567040 | elapsed time per iteration (s): 0.78 | learning rate: 1.134E-04 | global batch size: 256 | lm loss: 2.027731E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.892 | TFLOPs: 19.78 | 31: iteration 85590/ 173500 | consumed samples: 21911040 | consumed tokens: 44873809920 | elapsed time per iteration (s): 0.78 | learning rate: 1.133E-04 | global batch size: 256 | lm loss: 2.015356E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.725 | TFLOPs: 19.89 | 31: iteration 85600/ 173500 | consumed samples: 21913600 | consumed tokens: 44879052800 | elapsed time per iteration (s): 0.75 | learning rate: 1.133E-04 | global batch size: 256 | lm loss: 2.050330E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.100 | TFLOPs: 20.64 | 31: iteration 85610/ 173500 | consumed samples: 21916160 | consumed tokens: 44884295680 | elapsed time per iteration (s): 0.77 | learning rate: 1.133E-04 | global batch size: 256 | lm loss: 2.019754E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.801 | TFLOPs: 20.13 | 31: iteration 85620/ 173500 | consumed samples: 21918720 | consumed tokens: 44889538560 | elapsed time per iteration (s): 0.78 | learning rate: 1.133E-04 | global batch size: 256 | lm loss: 1.992887E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.267 | TFLOPs: 19.80 | 31: iteration 85630/ 173500 | consumed samples: 21921280 | consumed tokens: 44894781440 | elapsed time per iteration (s): 0.79 | learning rate: 1.133E-04 | global batch size: 256 | lm loss: 2.045482E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.763 | TFLOPs: 19.71 | 31: iteration 85640/ 173500 | consumed samples: 21923840 | consumed tokens: 44900024320 | elapsed time per iteration (s): 0.80 | learning rate: 1.133E-04 | global batch size: 256 | lm loss: 2.022421E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.898 | TFLOPs: 19.47 | 31: iteration 85650/ 173500 | consumed samples: 21926400 | consumed tokens: 44905267200 | elapsed time per iteration (s): 0.80 | learning rate: 1.132E-04 | global batch size: 256 | lm loss: 1.984394E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.159 | TFLOPs: 19.43 | 31: iteration 85660/ 173500 | consumed samples: 21928960 | consumed tokens: 44910510080 | elapsed time per iteration (s): 0.80 | learning rate: 1.132E-04 | global batch size: 256 | lm loss: 2.001203E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.006 | TFLOPs: 19.42 | 31: iteration 85670/ 173500 | consumed samples: 21931520 | consumed tokens: 44915752960 | elapsed time per iteration (s): 0.85 | learning rate: 1.132E-04 | global batch size: 256 | lm loss: 1.986700E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.441 | TFLOPs: 18.24 | 31: iteration 85680/ 173500 | consumed samples: 21934080 | consumed tokens: 44920995840 | elapsed time per iteration (s): 0.82 | learning rate: 1.132E-04 | global batch size: 256 | lm loss: 2.001025E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.489 | TFLOPs: 18.97 | 31: iteration 85690/ 173500 | consumed samples: 21936640 | consumed tokens: 44926238720 | elapsed time per iteration (s): 0.77 | learning rate: 1.132E-04 | global batch size: 256 | lm loss: 2.022310E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.690 | TFLOPs: 20.01 | 31: iteration 85700/ 173500 | consumed samples: 21939200 | consumed tokens: 44931481600 | elapsed time per iteration (s): 0.81 | learning rate: 1.132E-04 | global batch size: 256 | lm loss: 1.956975E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.443 | TFLOPs: 19.02 | 31: iteration 85710/ 173500 | consumed samples: 21941760 | consumed tokens: 44936724480 | elapsed time per iteration (s): 0.80 | learning rate: 1.131E-04 | global batch size: 256 | lm loss: 1.960561E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.558 | TFLOPs: 19.45 | 31: iteration 85720/ 173500 | consumed samples: 21944320 | consumed tokens: 44941967360 | elapsed time per iteration (s): 0.78 | learning rate: 1.131E-04 | global batch size: 256 | lm loss: 1.997645E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.305 | TFLOPs: 19.80 | 31: iteration 85730/ 173500 | consumed samples: 21946880 | consumed tokens: 44947210240 | elapsed time per iteration (s): 0.76 | learning rate: 1.131E-04 | global batch size: 256 | lm loss: 2.007651E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.938 | TFLOPs: 20.26 | 31: iteration 85740/ 173500 | consumed samples: 21949440 | consumed tokens: 44952453120 | elapsed time per iteration (s): 0.78 | learning rate: 1.131E-04 | global batch size: 256 | lm loss: 2.008653E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.338 | TFLOPs: 19.86 | 31: iteration 85750/ 173500 | consumed samples: 21952000 | consumed tokens: 44957696000 | elapsed time per iteration (s): 0.75 | learning rate: 1.131E-04 | global batch size: 256 | lm loss: 2.011663E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.248 | TFLOPs: 20.52 | 31: iteration 85760/ 173500 | consumed samples: 21954560 | consumed tokens: 44962938880 | elapsed time per iteration (s): 0.77 | learning rate: 1.131E-04 | global batch size: 256 | lm loss: 2.001584E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.451 | TFLOPs: 20.05 | 31: iteration 85770/ 173500 | consumed samples: 21957120 | consumed tokens: 44968181760 | elapsed time per iteration (s): 0.76 | learning rate: 1.130E-04 | global batch size: 256 | lm loss: 1.984805E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.646 | TFLOPs: 20.43 | 31: iteration 85780/ 173500 | consumed samples: 21959680 | consumed tokens: 44973424640 | elapsed time per iteration (s): 0.76 | learning rate: 1.130E-04 | global batch size: 256 | lm loss: 2.017526E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.986 | TFLOPs: 20.27 | 31: iteration 85790/ 173500 | consumed samples: 21962240 | consumed tokens: 44978667520 | elapsed time per iteration (s): 0.76 | learning rate: 1.130E-04 | global batch size: 256 | lm loss: 1.993220E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.305 | TFLOPs: 20.41 | 31: iteration 85800/ 173500 | consumed samples: 21964800 | consumed tokens: 44983910400 | elapsed time per iteration (s): 0.86 | learning rate: 1.130E-04 | global batch size: 256 | lm loss: 1.970778E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.608 | TFLOPs: 18.00 | 31: iteration 85810/ 173500 | consumed samples: 21967360 | consumed tokens: 44989153280 | elapsed time per iteration (s): 0.74 | learning rate: 1.130E-04 | global batch size: 256 | lm loss: 2.031826E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.623 | TFLOPs: 20.91 | 31: iteration 85820/ 173500 | consumed samples: 21969920 | consumed tokens: 44994396160 | elapsed time per iteration (s): 0.80 | learning rate: 1.130E-04 | global batch size: 256 | lm loss: 2.006618E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.246 | TFLOPs: 19.25 | 31: iteration 85830/ 173500 | consumed samples: 21972480 | consumed tokens: 44999639040 | elapsed time per iteration (s): 0.79 | learning rate: 1.129E-04 | global batch size: 256 | lm loss: 1.997418E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.800 | TFLOPs: 19.71 | 31: iteration 85840/ 173500 | consumed samples: 21975040 | consumed tokens: 45004881920 | elapsed time per iteration (s): 0.91 | learning rate: 1.129E-04 | global batch size: 256 | lm loss: 2.018173E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.099 | TFLOPs: 17.07 | 31: iteration 85850/ 173500 | consumed samples: 21977600 | consumed tokens: 45010124800 | elapsed time per iteration (s): 0.73 | learning rate: 1.129E-04 | global batch size: 256 | lm loss: 2.014510E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.963 | TFLOPs: 21.23 | 31: iteration 85860/ 173500 | consumed samples: 21980160 | consumed tokens: 45015367680 | elapsed time per iteration (s): 0.79 | learning rate: 1.129E-04 | global batch size: 256 | lm loss: 2.041650E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.037 | TFLOPs: 19.60 | 31: iteration 85870/ 173500 | consumed samples: 21982720 | consumed tokens: 45020610560 | elapsed time per iteration (s): 0.72 | learning rate: 1.129E-04 | global batch size: 256 | lm loss: 2.024203E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.871 | TFLOPs: 21.65 | 31: iteration 85880/ 173500 | consumed samples: 21985280 | consumed tokens: 45025853440 | elapsed time per iteration (s): 0.74 | learning rate: 1.129E-04 | global batch size: 256 | lm loss: 1.983689E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.398 | TFLOPs: 21.02 | 31: iteration 85890/ 173500 | consumed samples: 21987840 | consumed tokens: 45031096320 | elapsed time per iteration (s): 0.75 | learning rate: 1.128E-04 | global batch size: 256 | lm loss: 1.986840E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.302 | TFLOPs: 20.71 | 31: iteration 85900/ 173500 | consumed samples: 21990400 | consumed tokens: 45036339200 | elapsed time per iteration (s): 0.73 | learning rate: 1.128E-04 | global batch size: 256 | lm loss: 2.009336E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.746 | TFLOPs: 21.10 | 31: iteration 85910/ 173500 | consumed samples: 21992960 | consumed tokens: 45041582080 | elapsed time per iteration (s): 0.74 | learning rate: 1.128E-04 | global batch size: 256 | lm loss: 1.982094E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.292 | TFLOPs: 21.07 | 31: iteration 85920/ 173500 | consumed samples: 21995520 | consumed tokens: 45046824960 | elapsed time per iteration (s): 0.79 | learning rate: 1.128E-04 | global batch size: 256 | lm loss: 2.013130E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.391 | TFLOPs: 19.56 | 31: iteration 85930/ 173500 | consumed samples: 21998080 | consumed tokens: 45052067840 | elapsed time per iteration (s): 0.76 | learning rate: 1.128E-04 | global batch size: 256 | lm loss: 2.001261E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.766 | TFLOPs: 20.49 | 31: iteration 85940/ 173500 | consumed samples: 22000640 | consumed tokens: 45057310720 | elapsed time per iteration (s): 0.77 | learning rate: 1.128E-04 | global batch size: 256 | lm loss: 1.984851E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.602 | TFLOPs: 20.18 | 31: iteration 85950/ 173500 | consumed samples: 22003200 | consumed tokens: 45062553600 | elapsed time per iteration (s): 0.70 | learning rate: 1.127E-04 | global batch size: 256 | lm loss: 1.998462E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 364.192 | TFLOPs: 22.03 | 31: iteration 85960/ 173500 | consumed samples: 22005760 | consumed tokens: 45067796480 | elapsed time per iteration (s): 0.74 | learning rate: 1.127E-04 | global batch size: 256 | lm loss: 1.982341E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.244 | TFLOPs: 20.95 | 31: iteration 85970/ 173500 | consumed samples: 22008320 | consumed tokens: 45073039360 | elapsed time per iteration (s): 0.79 | learning rate: 1.127E-04 | global batch size: 256 | lm loss: 2.015454E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.294 | TFLOPs: 19.68 | 31: iteration 85980/ 173500 | consumed samples: 22010880 | consumed tokens: 45078282240 | elapsed time per iteration (s): 0.74 | learning rate: 1.127E-04 | global batch size: 256 | lm loss: 1.975452E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.670 | TFLOPs: 20.91 | 31: iteration 85990/ 173500 | consumed samples: 22013440 | consumed tokens: 45083525120 | elapsed time per iteration (s): 0.76 | learning rate: 1.127E-04 | global batch size: 256 | lm loss: 1.957537E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.967 | TFLOPs: 20.39 | 0: [2022-11-26 13:33:40,680] [INFO] [logging.py:68:log_dist] [Rank 0] step=86000, skipped=0, lr=[0.0001126626417003261, 0.0001126626417003261, 0.0001126626417003261], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 86000/ 173500 | consumed samples: 22016000 | consumed tokens: 45088768000 | elapsed time per iteration (s): 0.78 | learning rate: 1.127E-04 | global batch size: 256 | lm loss: 1.978553E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.455 | TFLOPs: 19.87 | 0: steps: 86000 loss: 2.0046 iter time (s): 0.791 samples/sec: 323.643 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 86000 | lm loss value: 1.976412E+00 | lm loss PPL: 7.216806E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 86000 to checkpoints_1b1long 0: [2022-11-26 13:33:41,024] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step86000 is begin to save! 0: [2022-11-26 13:33:41,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_01-model_00-model_states.pt... 0: [2022-11-26 13:33:41,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_01-model_00-model_states.pt. 0: [2022-11-26 13:33:41,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_03-model_00-model_states.pt... 0: [2022-11-26 13:33:41,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_03-model_00-model_states.pt. 0: [2022-11-26 13:33:41,330] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_04-model_00-model_states.pt... 0: [2022-11-26 13:33:41,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_04-model_00-model_states.pt. 0: [2022-11-26 13:33:41,409] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_05-model_00-model_states.pt... 0: [2022-11-26 13:33:41,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_05-model_00-model_states.pt. 0: [2022-11-26 13:33:41,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_06-model_00-model_states.pt... 0: [2022-11-26 13:33:41,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_06-model_00-model_states.pt. 0: [2022-11-26 13:33:41,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_07-model_00-model_states.pt... 0: [2022-11-26 13:33:41,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_07-model_00-model_states.pt. 0: [2022-11-26 13:33:41,651] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_08-model_00-model_states.pt... 0: [2022-11-26 13:33:41,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_08-model_00-model_states.pt. 0: [2022-11-26 13:33:41,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_09-model_00-model_states.pt... 0: [2022-11-26 13:33:41,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_09-model_00-model_states.pt. 0: [2022-11-26 13:33:41,804] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_10-model_00-model_states.pt... 0: [2022-11-26 13:33:41,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_10-model_00-model_states.pt. 0: [2022-11-26 13:33:41,880] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_11-model_00-model_states.pt... 0: [2022-11-26 13:33:41,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_11-model_00-model_states.pt. 0: [2022-11-26 13:33:41,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_12-model_00-model_states.pt... 0: [2022-11-26 13:33:42,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_12-model_00-model_states.pt. 0: [2022-11-26 13:33:42,029] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_13-model_00-model_states.pt... 0: [2022-11-26 13:33:42,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_13-model_00-model_states.pt. 0: [2022-11-26 13:33:42,104] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_14-model_00-model_states.pt... 0: [2022-11-26 13:33:42,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_14-model_00-model_states.pt. 0: [2022-11-26 13:33:42,182] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_15-model_00-model_states.pt... 0: [2022-11-26 13:33:42,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_15-model_00-model_states.pt. 0: [2022-11-26 13:33:42,257] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_16-model_00-model_states.pt... 0: [2022-11-26 13:33:42,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_16-model_00-model_states.pt. 0: [2022-11-26 13:33:42,334] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_17-model_00-model_states.pt... 0: [2022-11-26 13:33:42,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_17-model_00-model_states.pt. 0: [2022-11-26 13:33:42,410] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_18-model_00-model_states.pt... 0: [2022-11-26 13:33:42,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_18-model_00-model_states.pt. 0: [2022-11-26 13:33:42,483] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_19-model_00-model_states.pt... 0: [2022-11-26 13:33:42,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_19-model_00-model_states.pt. 0: [2022-11-26 13:33:42,561] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_20-model_00-model_states.pt... 0: [2022-11-26 13:33:42,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_20-model_00-model_states.pt. 0: [2022-11-26 13:33:42,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_21-model_00-model_states.pt... 0: [2022-11-26 13:33:42,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_21-model_00-model_states.pt. 0: [2022-11-26 13:33:42,710] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_22-model_00-model_states.pt... 0: [2022-11-26 13:33:42,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_22-model_00-model_states.pt. 0: [2022-11-26 13:33:42,790] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_23-model_00-model_states.pt... 0: [2022-11-26 13:33:42,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_23-model_00-model_states.pt. 0: [2022-11-26 13:33:42,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_24-model_00-model_states.pt... 0: [2022-11-26 13:33:42,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_24-model_00-model_states.pt. 0: [2022-11-26 13:33:42,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_25-model_00-model_states.pt... 0: [2022-11-26 13:33:43,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_25-model_00-model_states.pt. 0: [2022-11-26 13:33:43,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_26-model_00-model_states.pt... 0: [2022-11-26 13:33:43,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_26-model_00-model_states.pt. 0: [2022-11-26 13:33:43,091] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_27-model_00-model_states.pt... 0: [2022-11-26 13:33:43,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_27-model_00-model_states.pt. 0: [2022-11-26 13:33:43,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_28-model_00-model_states.pt... 0: [2022-11-26 13:33:43,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_28-model_00-model_states.pt. 0: [2022-11-26 13:33:43,242] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/layer_30-model_00-model_states.pt... 0: [2022-11-26 13:33:43,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/layer_30-model_00-model_states.pt. 0: [2022-11-26 13:33:43,247] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step86000/mp_rank_00_model_states.pt 0: [2022-11-26 13:33:43,247] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/mp_rank_00_model_states.pt... 0: [2022-11-26 13:33:43,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/mp_rank_00_model_states.pt. 0: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:33:58,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:33:58,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:33:58,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 10: [2022-11-26 13:33:58,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:33:58,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 10: [2022-11-26 13:33:58,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 13:33:58,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 29: [2022-11-26 13:33:58,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:33:58,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 13:33:58,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 5: [2022-11-26 13:33:58,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:33:58,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 13:33:58,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 8: [2022-11-26 13:33:58,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:33:58,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 13:33:58,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 30: [2022-11-26 13:33:58,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:33:58,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 13:33:58,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 13: [2022-11-26 13:33:58,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:33:58,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 13:33:58,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 25: [2022-11-26 13:33:58,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:33:58,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 13:33:58,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 21: [2022-11-26 13:33:58,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:33:58,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 13:33:58,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 9: [2022-11-26 13:33:58,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:33:58,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 13:33:58,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 4: [2022-11-26 13:33:58,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:33:58,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 13:33:58,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 16: [2022-11-26 13:33:58,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:33:58,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 13:33:58,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 7: [2022-11-26 13:33:58,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:33:58,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 12: [2022-11-26 13:33:58,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:33:58,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 12: [2022-11-26 13:33:58,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 13:33:58,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 26: [2022-11-26 13:33:58,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:33:58,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 13:33:58,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 6: [2022-11-26 13:33:58,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:33:58,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 13:33:58,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 8: [2022-11-26 13:33:58,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:33:58,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 13:33:58,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 31: [2022-11-26 13:33:58,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:33:58,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 13:33:58,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 23: [2022-11-26 13:33:58,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:33:58,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 14: [2022-11-26 13:33:58,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:33:58,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 14: [2022-11-26 13:33:58,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 13:33:58,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 3: [2022-11-26 13:33:58,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:33:58,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 13:33:58,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 1: [2022-11-26 13:33:58,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:33:58,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:33:58,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 13:33:58,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 1: [2022-11-26 13:33:58,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 13:33:58,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 21: [2022-11-26 13:33:58,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:33:58,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 13:33:58,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 22: [2022-11-26 13:33:58,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:33:58,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 13:33:58,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 17: [2022-11-26 13:33:58,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:33:58,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 13:33:58,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 10: [2022-11-26 13:33:58,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:33:58,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 13:33:58,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 15: [2022-11-26 13:33:58,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:33:58,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 13:33:58,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 11: [2022-11-26 13:33:58,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:33:58,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 13:33:58,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 11: [2022-11-26 13:33:58,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:33:58,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 13:33:58,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 25: [2022-11-26 13:33:58,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:33:58,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 31: [2022-11-26 13:33:58,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:33:58,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 31: [2022-11-26 13:33:58,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 13:33:58,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 16: [2022-11-26 13:33:58,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:33:58,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 13:33:58,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 0: [2022-11-26 13:33:58,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:33:58,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 27: [2022-11-26 13:33:58,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:33:58,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 14: [2022-11-26 13:33:58,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:33:58,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:33:58,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 14: [2022-11-26 13:33:58,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 27: [2022-11-26 13:33:58,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 14: [2022-11-26 13:33:58,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 2: [2022-11-26 13:33:58,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:33:58,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 2: [2022-11-26 13:33:58,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 30: [2022-11-26 13:33:58,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 2: [2022-11-26 13:33:58,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 26: [2022-11-26 13:33:58,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:33:58,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 13:33:58,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 13: [2022-11-26 13:33:58,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:33:58,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:33:58,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 20: [2022-11-26 13:33:58,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 13: [2022-11-26 13:33:58,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 6: [2022-11-26 13:33:58,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:33:58,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 6: [2022-11-26 13:33:58,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 13:33:58,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 30: [2022-11-26 13:33:58,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:33:58,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 13:33:58,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 4: [2022-11-26 13:33:58,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:33:58,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:33:58,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 13:33:58,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 7: [2022-11-26 13:33:58,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 13:33:58,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 23: [2022-11-26 13:33:58,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:33:58,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 17: [2022-11-26 13:33:58,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:33:58,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 17: [2022-11-26 13:33:58,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 13:33:58,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 1: [2022-11-26 13:33:58,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:33:58,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 13:33:58,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 12: [2022-11-26 13:33:58,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:33:58,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 13:33:58,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 12: [2022-11-26 13:33:58,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:33:58,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 13:33:58,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 5: [2022-11-26 13:33:58,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:33:58,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 13:33:58,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 2: [2022-11-26 13:33:58,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:33:58,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 13:33:58,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 9: [2022-11-26 13:33:58,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:33:58,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 13:33:58,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 22: [2022-11-26 13:33:58,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:33:58,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 13:33:58,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 15: [2022-11-26 13:33:58,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:33:58,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 13:33:58,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 24: [2022-11-26 13:33:58,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:33:58,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 13:33:58,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:33:58,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 24: [2022-11-26 13:33:58,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 13:33:58,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:33:58,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 24: [2022-11-26 13:33:58,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 20: [2022-11-26 13:33:58,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:33:58,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 20: [2022-11-26 13:33:58,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 13:33:58,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 3: [2022-11-26 13:33:58,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:33:58,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 13:33:58,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 21: [2022-11-26 13:33:58,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:33:58,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:33:58,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 13: [2022-11-26 13:33:58,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 21: [2022-11-26 13:33:58,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 13: [2022-11-26 13:33:58,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 27: [2022-11-26 13:33:58,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:33:58,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 13:33:58,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 19: [2022-11-26 13:33:58,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:33:58,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:33:58,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:33:58,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 13:33:58,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 13:33:58,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 13:33:58,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 19: [2022-11-26 13:33:58,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 19: [2022-11-26 13:33:58,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 10: [2022-11-26 13:33:58,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:33:58,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 13:33:58,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 16: [2022-11-26 13:33:58,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:33:58,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 5: [2022-11-26 13:33:58,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:33:58,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 5: [2022-11-26 13:33:58,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 13:33:58,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 14: [2022-11-26 13:33:58,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:33:58,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 13:33:58,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 28: [2022-11-26 13:33:58,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:33:58,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:33:58,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 13:33:58,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 13:33:58,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 28: [2022-11-26 13:33:58,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 8: [2022-11-26 13:33:58,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:33:58,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 13:33:58,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 4: [2022-11-26 13:33:58,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:33:58,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 13:33:58,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 1: [2022-11-26 13:33:58,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:33:58,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 13:33:58,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 31: [2022-11-26 13:33:58,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:33:58,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 13:33:58,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 18: [2022-11-26 13:33:58,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:33:58,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 13:33:58,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:33:58,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 18: [2022-11-26 13:33:58,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 13:33:58,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 29: [2022-11-26 13:33:58,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:33:58,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 13:33:58,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 6: [2022-11-26 13:33:58,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:33:58,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:33:58,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 13:33:58,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 28: [2022-11-26 13:33:58,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 13:33:58,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 25: [2022-11-26 13:33:58,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:33:58,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 13:33:58,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 23: [2022-11-26 13:33:58,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:33:58,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 13:33:58,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 0: [2022-11-26 13:33:58,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:33:58,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 13:33:58,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 0: [2022-11-26 13:33:58,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:33:58,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 13:33:58,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 9: [2022-11-26 13:33:58,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:33:58,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 13:33:58,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 3: [2022-11-26 13:33:58,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:33:58,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 13:33:58,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 18: [2022-11-26 13:33:58,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:33:58,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 13:33:58,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 26: [2022-11-26 13:33:58,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:33:58,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 13:33:58,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 22: [2022-11-26 13:33:58,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:33:58,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 13:33:58,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 20: [2022-11-26 13:33:58,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:33:58,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 13:33:58,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 7: [2022-11-26 13:33:58,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:33:58,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 13:33:58,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 11: [2022-11-26 13:33:58,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:33:58,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 13:33:58,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 15: [2022-11-26 13:33:58,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:33:58,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 13:33:58,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 2: [2022-11-26 13:33:58,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:33:58,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:33:58,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 13:33:58,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 17: [2022-11-26 13:33:58,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 13:33:58,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 27: [2022-11-26 13:33:58,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:33:58,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 13:33:58,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 5: [2022-11-26 13:33:58,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:33:58,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 13:33:58,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 12: [2022-11-26 13:33:58,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:33:58,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 13:33:58,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 30: [2022-11-26 13:33:58,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:33:58,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 13:33:58,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 21: [2022-11-26 13:33:58,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:33:58,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 13:33:58,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 10: [2022-11-26 13:33:58,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:33:58,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 13:33:58,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 19: [2022-11-26 13:33:58,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:33:58,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 13:33:58,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 13: [2022-11-26 13:33:58,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:33:58,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 13:33:58,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 4: [2022-11-26 13:33:58,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:33:58,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 13:33:58,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 1: [2022-11-26 13:33:58,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:33:58,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 13:33:58,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 16: [2022-11-26 13:33:58,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:33:58,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 13:33:58,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 28: [2022-11-26 13:33:58,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:33:58,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 13:33:58,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 8: [2022-11-26 13:33:58,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:33:58,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 13:33:58,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 29: [2022-11-26 13:33:58,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:33:58,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 13:33:58,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 6: [2022-11-26 13:33:58,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:33:58,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 13:33:58,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 9: [2022-11-26 13:33:58,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:33:58,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 13:33:58,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 25: [2022-11-26 13:33:58,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:33:58,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 13:33:58,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 14: [2022-11-26 13:33:58,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:33:58,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:33:58,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 13:33:58,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 13:33:58,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 14: [2022-11-26 13:33:58,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 23: [2022-11-26 13:33:58,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:33:58,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 13:33:58,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 7: [2022-11-26 13:33:58,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:33:58,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 13:33:58,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 31: [2022-11-26 13:33:58,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:33:58,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 13:33:58,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 11: [2022-11-26 13:33:58,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:33:58,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 13:33:58,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 18: [2022-11-26 13:33:58,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:33:58,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 13:33:58,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 22: [2022-11-26 13:33:58,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:33:58,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:33:58,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 22: [2022-11-26 13:33:58,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 26: [2022-11-26 13:33:58,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 22: [2022-11-26 13:33:58,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 3: [2022-11-26 13:33:58,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:33:58,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 13:33:58,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 20: [2022-11-26 13:33:58,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:33:58,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 13:33:58,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 15: [2022-11-26 13:33:58,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:33:58,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 13:33:58,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 17: [2022-11-26 13:33:58,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:33:58,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 13:33:58,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 2: [2022-11-26 13:33:58,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:33:58,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 13:33:58,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 27: [2022-11-26 13:33:58,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:33:58,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 13:33:58,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 3: [2022-11-26 13:33:58,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:33:58,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 13:33:58,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 27: [2022-11-26 13:33:58,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:33:58,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 13:33:58,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 24: [2022-11-26 13:33:58,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:33:58,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 13:33:58,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 21: [2022-11-26 13:33:58,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:33:58,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 5: [2022-11-26 13:33:58,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:33:58,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 5: [2022-11-26 13:33:58,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 13:33:58,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 30: [2022-11-26 13:33:58,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:33:58,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 19: [2022-11-26 13:33:58,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:33:58,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:33:58,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 19: [2022-11-26 13:33:58,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 12: [2022-11-26 13:33:58,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 19: [2022-11-26 13:33:58,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 12: [2022-11-26 13:33:58,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 4: [2022-11-26 13:33:58,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:33:58,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 13:33:58,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 10: [2022-11-26 13:33:58,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:33:58,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 13:33:58,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 16: [2022-11-26 13:33:58,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:33:58,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 13:33:58,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 0: [2022-11-26 13:33:58,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:33:58,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 13:33:58,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 28: [2022-11-26 13:33:58,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:33:58,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 13:33:58,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 13: [2022-11-26 13:33:58,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:33:58,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 13:33:58,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 1: [2022-11-26 13:33:58,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:33:58,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 13:33:58,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 6: [2022-11-26 13:33:58,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:33:58,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 13:33:58,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 31: [2022-11-26 13:33:58,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:33:58,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 13:33:58,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 9: [2022-11-26 13:33:58,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:33:58,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 13:33:58,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 29: [2022-11-26 13:33:58,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:33:58,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 13:33:58,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 8: [2022-11-26 13:33:58,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:33:58,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 13:33:58,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 25: [2022-11-26 13:33:58,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:33:58,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 13:33:58,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 26: [2022-11-26 13:33:58,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:33:58,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 13:33:58,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 22: [2022-11-26 13:33:58,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:33:58,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 13:33:58,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 7: [2022-11-26 13:33:58,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:33:58,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 11: [2022-11-26 13:33:58,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:33:58,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 11: [2022-11-26 13:33:58,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 13:33:58,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 20: [2022-11-26 13:33:58,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:33:58,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 13:33:58,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 15: [2022-11-26 13:33:58,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:33:58,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 13:33:58,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 23: [2022-11-26 13:33:58,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:33:58,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:33:58,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 13:33:58,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 23: [2022-11-26 13:33:58,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 13:33:58,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 17: [2022-11-26 13:33:58,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:33:58,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 13:33:58,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 2: [2022-11-26 13:33:58,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:33:58,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 13:33:58,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 19: [2022-11-26 13:33:58,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:33:58,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 13:33:58,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 30: [2022-11-26 13:33:58,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:33:58,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 13:33:58,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 24: [2022-11-26 13:33:58,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:33:58,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 13:33:58,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 12: [2022-11-26 13:33:58,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:33:58,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 13:33:58,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 21: [2022-11-26 13:33:58,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:33:58,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 13:33:58,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 13: [2022-11-26 13:33:58,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:33:58,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 13:33:58,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 4: [2022-11-26 13:33:58,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:33:58,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:33:58,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 5: [2022-11-26 13:33:58,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 13:33:58,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 4: [2022-11-26 13:33:58,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 28: [2022-11-26 13:33:58,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:33:58,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 13:33:58,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 18: [2022-11-26 13:33:58,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:33:58,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:33:58,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 13:33:58,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 13:33:58,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 18: [2022-11-26 13:33:58,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 10: [2022-11-26 13:33:58,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:33:58,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 13:33:58,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 16: [2022-11-26 13:33:58,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:33:58,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:33:58,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 13:33:58,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 13:33:58,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 16: [2022-11-26 13:33:58,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 14: [2022-11-26 13:33:58,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:33:58,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 13:33:58,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 29: [2022-11-26 13:33:58,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:33:58,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 13:33:58,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 1: [2022-11-26 13:33:58,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:33:58,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:33:58,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 6: [2022-11-26 13:33:58,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:33:58,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 6: [2022-11-26 13:33:58,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 13:33:58,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 1: [2022-11-26 13:33:58,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 13:33:58,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 8: [2022-11-26 13:33:58,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:33:58,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 13:33:58,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 0: [2022-11-26 13:33:58,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:33:58,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 13:33:58,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 9: [2022-11-26 13:33:58,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:33:58,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 13:33:58,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 3: [2022-11-26 13:33:58,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:33:58,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 13:33:58,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 0: [2022-11-26 13:33:58,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:33:58,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:33:58,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 13:33:58,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 22: [2022-11-26 13:33:58,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:33:58,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 13:33:58,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 7: [2022-11-26 13:33:58,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:33:58,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 13:33:58,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 25: [2022-11-26 13:33:58,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:33:58,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 13:33:58,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 27: [2022-11-26 13:33:58,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:33:58,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 13:33:58,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 30: [2022-11-26 13:33:58,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:33:58,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 0: [2022-11-26 13:33:58,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 30: [2022-11-26 13:33:58,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 0: [2022-11-26 13:33:58,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 26: [2022-11-26 13:33:58,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:33:58,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 13:33:58,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 7: [2022-11-26 13:33:58,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:33:58,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 13:33:58,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 4: [2022-11-26 13:33:58,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:33:58,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:33:58,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 13:33:58,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 24: [2022-11-26 13:33:58,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 13:33:58,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 11: [2022-11-26 13:33:58,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:33:58,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 13:33:58,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 25: [2022-11-26 13:33:58,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:33:58,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:33:58,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 13:33:58,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 20: [2022-11-26 13:33:58,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 13:33:58,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 22: [2022-11-26 13:33:58,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:33:58,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 13:33:58,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 6: [2022-11-26 13:33:58,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:33:58,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 13:33:58,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 1: [2022-11-26 13:33:58,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:33:58,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 13:33:58,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 4: [2022-11-26 13:33:58,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:33:58,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 10: [2022-11-26 13:33:58,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:33:58,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 10: [2022-11-26 13:33:58,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 13:33:58,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 17: [2022-11-26 13:33:58,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:33:58,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 13:33:58,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 23: [2022-11-26 13:33:58,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:33:58,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 14: [2022-11-26 13:33:58,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:33:58,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 26: [2022-11-26 13:33:58,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:33:58,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 13:33:58,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 26: [2022-11-26 13:33:58,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 13:33:58,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 22: [2022-11-26 13:33:58,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:33:58,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:33:58,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 30: [2022-11-26 13:33:58,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 22: [2022-11-26 13:33:58,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 30: [2022-11-26 13:33:58,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 23: [2022-11-26 13:33:58,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:33:58,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:33:58,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 29: [2022-11-26 13:33:58,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 23: [2022-11-26 13:33:58,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 29: [2022-11-26 13:33:58,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 2: [2022-11-26 13:33:58,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:33:58,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 13:33:58,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 15: [2022-11-26 13:33:58,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:33:58,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 13:33:58,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 12: [2022-11-26 13:33:58,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:33:58,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 13:33:58,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:33:58,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 12: [2022-11-26 13:33:58,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 13:33:58,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 1: [2022-11-26 13:33:58,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:33:58,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 13:33:58,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 31: [2022-11-26 13:33:58,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:33:58,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:33:58,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:33:58,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 0: [2022-11-26 13:33:58,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 6: [2022-11-26 13:33:58,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:33:58,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 0: [2022-11-26 13:33:58,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 13: [2022-11-26 13:33:58,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:33:58,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:33:58,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:33:58,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 31: [2022-11-26 13:33:58,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 13: [2022-11-26 13:33:58,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 13:33:58,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 25: [2022-11-26 13:33:58,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 31: [2022-11-26 13:33:58,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 6: [2022-11-26 13:33:58,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 13: [2022-11-26 13:33:58,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 13: [2022-11-26 13:33:58,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 31: [2022-11-26 13:33:58,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 20: [2022-11-26 13:33:58,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:33:58,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 13:33:58,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 24: [2022-11-26 13:33:58,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:33:58,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 13:33:58,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 24: [2022-11-26 13:33:58,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:33:58,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 13:33:58,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 9: [2022-11-26 13:33:58,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:33:58,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 18: [2022-11-26 13:33:58,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:33:58,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:33:58,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 18: [2022-11-26 13:33:58,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 13:33:58,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 13:33:58,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 18: [2022-11-26 13:33:58,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 16: [2022-11-26 13:33:58,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:33:58,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:33:58,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:33:58,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 13:33:58,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 13:33:58,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 2: [2022-11-26 13:33:58,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 16: [2022-11-26 13:33:58,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 13:33:58,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 19: [2022-11-26 13:33:58,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:33:58,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:33:58,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:33:58,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 7: [2022-11-26 13:33:58,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 19: [2022-11-26 13:33:58,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 19: [2022-11-26 13:33:58,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 13:33:58,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 7: [2022-11-26 13:33:58,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 10: [2022-11-26 13:33:58,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:33:58,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:33:58,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:33:58,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 28: [2022-11-26 13:33:58,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 27: [2022-11-26 13:33:58,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 13:33:58,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 10: [2022-11-26 13:33:58,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 28: [2022-11-26 13:33:58,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 28: [2022-11-26 13:33:58,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:33:58,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 21: [2022-11-26 13:33:58,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:33:58,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:33:58,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 21: [2022-11-26 13:33:58,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:33:58,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 21: [2022-11-26 13:33:58,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 27: [2022-11-26 13:33:58,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 8: [2022-11-26 13:33:58,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:33:58,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 13:33:58,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 8: [2022-11-26 13:33:58,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 21: [2022-11-26 13:33:58,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 8: [2022-11-26 13:33:58,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:33:58,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 8: [2022-11-26 13:33:58,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 13:33:58,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 14: [2022-11-26 13:33:58,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:33:58,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 13:33:58,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 26: [2022-11-26 13:33:58,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:33:58,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 13:33:58,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 5: [2022-11-26 13:33:58,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:33:58,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 13:33:58,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 11: [2022-11-26 13:33:58,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:33:58,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 13:33:58,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 3: [2022-11-26 13:33:58,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:33:58,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:33:58,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 29: [2022-11-26 13:33:58,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 3: [2022-11-26 13:33:58,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 29: [2022-11-26 13:33:58,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 3: [2022-11-26 13:33:58,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:33:58,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:33:58,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 9: [2022-11-26 13:33:58,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:33:58,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 15: [2022-11-26 13:33:58,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 13:33:58,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 9: [2022-11-26 13:33:58,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 13:33:58,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 5: [2022-11-26 13:33:58,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:33:58,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 13:33:58,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 15: [2022-11-26 13:33:58,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:33:58,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 13:33:58,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 17: [2022-11-26 13:33:58,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:33:58,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 13:33:58,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 17: [2022-11-26 13:33:58,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:33:58,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 13:33:58,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 11: [2022-11-26 13:33:58,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:33:58,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step86000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 13:33:58,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 0: successfully saved checkpoint at iteration 86000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 17870.01 31: iteration 86010/ 173500 | consumed samples: 22018560 | consumed tokens: 45094010880 | elapsed time per iteration (s): 2.67 | learning rate: 1.126E-04 | global batch size: 256 | lm loss: 1.996120E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 95.706 | TFLOPs: 5.79 | 31: iteration 86020/ 173500 | consumed samples: 22021120 | consumed tokens: 45099253760 | elapsed time per iteration (s): 0.81 | learning rate: 1.126E-04 | global batch size: 256 | lm loss: 1.973911E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.199 | TFLOPs: 19.19 | 31: iteration 86030/ 173500 | consumed samples: 22023680 | consumed tokens: 45104496640 | elapsed time per iteration (s): 0.82 | learning rate: 1.126E-04 | global batch size: 256 | lm loss: 1.983616E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.577 | TFLOPs: 18.97 | 31: iteration 86040/ 173500 | consumed samples: 22026240 | consumed tokens: 45109739520 | elapsed time per iteration (s): 0.79 | learning rate: 1.126E-04 | global batch size: 256 | lm loss: 1.989736E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.112 | TFLOPs: 19.55 | 31: iteration 86050/ 173500 | consumed samples: 22028800 | consumed tokens: 45114982400 | elapsed time per iteration (s): 0.75 | learning rate: 1.126E-04 | global batch size: 256 | lm loss: 1.990019E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.826 | TFLOPs: 20.68 | 31: iteration 86060/ 173500 | consumed samples: 22031360 | consumed tokens: 45120225280 | elapsed time per iteration (s): 0.74 | learning rate: 1.126E-04 | global batch size: 256 | lm loss: 1.985491E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.751 | TFLOPs: 20.80 | 31: iteration 86070/ 173500 | consumed samples: 22033920 | consumed tokens: 45125468160 | elapsed time per iteration (s): 0.74 | learning rate: 1.125E-04 | global batch size: 256 | lm loss: 2.013001E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.529 | TFLOPs: 20.84 | 31: iteration 86080/ 173500 | consumed samples: 22036480 | consumed tokens: 45130711040 | elapsed time per iteration (s): 0.79 | learning rate: 1.125E-04 | global batch size: 256 | lm loss: 1.962934E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.594 | TFLOPs: 19.70 | 31: iteration 86090/ 173500 | consumed samples: 22039040 | consumed tokens: 45135953920 | elapsed time per iteration (s): 0.84 | learning rate: 1.125E-04 | global batch size: 256 | lm loss: 1.972941E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.095 | TFLOPs: 18.46 | 31: iteration 86100/ 173500 | consumed samples: 22041600 | consumed tokens: 45141196800 | elapsed time per iteration (s): 0.84 | learning rate: 1.125E-04 | global batch size: 256 | lm loss: 2.008498E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.312 | TFLOPs: 18.53 | 31: iteration 86110/ 173500 | consumed samples: 22044160 | consumed tokens: 45146439680 | elapsed time per iteration (s): 0.83 | learning rate: 1.125E-04 | global batch size: 256 | lm loss: 2.002323E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.818 | TFLOPs: 18.74 | 31: iteration 86120/ 173500 | consumed samples: 22046720 | consumed tokens: 45151682560 | elapsed time per iteration (s): 0.84 | learning rate: 1.125E-04 | global batch size: 256 | lm loss: 1.981606E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.028 | TFLOPs: 18.51 | 31: iteration 86130/ 173500 | consumed samples: 22049280 | consumed tokens: 45156925440 | elapsed time per iteration (s): 0.78 | learning rate: 1.124E-04 | global batch size: 256 | lm loss: 2.011798E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.234 | TFLOPs: 19.74 | 31: iteration 86140/ 173500 | consumed samples: 22051840 | consumed tokens: 45162168320 | elapsed time per iteration (s): 0.82 | learning rate: 1.124E-04 | global batch size: 256 | lm loss: 2.018271E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.072 | TFLOPs: 18.94 | 31: iteration 86150/ 173500 | consumed samples: 22054400 | consumed tokens: 45167411200 | elapsed time per iteration (s): 0.81 | learning rate: 1.124E-04 | global batch size: 256 | lm loss: 1.979595E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.986 | TFLOPs: 19.24 | 31: iteration 86160/ 173500 | consumed samples: 22056960 | consumed tokens: 45172654080 | elapsed time per iteration (s): 0.82 | learning rate: 1.124E-04 | global batch size: 256 | lm loss: 1.998933E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.537 | TFLOPs: 18.85 | 31: iteration 86170/ 173500 | consumed samples: 22059520 | consumed tokens: 45177896960 | elapsed time per iteration (s): 0.81 | learning rate: 1.124E-04 | global batch size: 256 | lm loss: 1.986492E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.169 | TFLOPs: 19.13 | 31: iteration 86180/ 173500 | consumed samples: 22062080 | consumed tokens: 45183139840 | elapsed time per iteration (s): 0.82 | learning rate: 1.124E-04 | global batch size: 256 | lm loss: 2.003592E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.079 | TFLOPs: 18.82 | 31: iteration 86190/ 173500 | consumed samples: 22064640 | consumed tokens: 45188382720 | elapsed time per iteration (s): 0.82 | learning rate: 1.124E-04 | global batch size: 256 | lm loss: 1.998945E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.277 | TFLOPs: 18.83 | 31: iteration 86200/ 173500 | consumed samples: 22067200 | consumed tokens: 45193625600 | elapsed time per iteration (s): 0.81 | learning rate: 1.123E-04 | global batch size: 256 | lm loss: 1.997808E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.349 | TFLOPs: 19.14 | 31: iteration 86210/ 173500 | consumed samples: 22069760 | consumed tokens: 45198868480 | elapsed time per iteration (s): 0.80 | learning rate: 1.123E-04 | global batch size: 256 | lm loss: 2.000936E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.597 | TFLOPs: 19.40 | 31: iteration 86220/ 173500 | consumed samples: 22072320 | consumed tokens: 45204111360 | elapsed time per iteration (s): 0.79 | learning rate: 1.123E-04 | global batch size: 256 | lm loss: 1.985793E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.198 | TFLOPs: 19.49 | 31: iteration 86230/ 173500 | consumed samples: 22074880 | consumed tokens: 45209354240 | elapsed time per iteration (s): 0.78 | learning rate: 1.123E-04 | global batch size: 256 | lm loss: 1.975560E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.017 | TFLOPs: 19.90 | 31: iteration 86240/ 173500 | consumed samples: 22077440 | consumed tokens: 45214597120 | elapsed time per iteration (s): 0.80 | learning rate: 1.123E-04 | global batch size: 256 | lm loss: 1.976404E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.540 | TFLOPs: 19.39 | 31: iteration 86250/ 173500 | consumed samples: 22080000 | consumed tokens: 45219840000 | elapsed time per iteration (s): 0.78 | learning rate: 1.123E-04 | global batch size: 256 | lm loss: 2.015873E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.471 | TFLOPs: 19.81 | 31: iteration 86260/ 173500 | consumed samples: 22082560 | consumed tokens: 45225082880 | elapsed time per iteration (s): 0.83 | learning rate: 1.122E-04 | global batch size: 256 | lm loss: 1.977183E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.835 | TFLOPs: 18.68 | 31: iteration 86270/ 173500 | consumed samples: 22085120 | consumed tokens: 45230325760 | elapsed time per iteration (s): 0.79 | learning rate: 1.122E-04 | global batch size: 256 | lm loss: 2.009973E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.324 | TFLOPs: 19.56 | 31: iteration 86280/ 173500 | consumed samples: 22087680 | consumed tokens: 45235568640 | elapsed time per iteration (s): 0.80 | learning rate: 1.122E-04 | global batch size: 256 | lm loss: 2.018952E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.563 | TFLOPs: 19.27 | 31: iteration 86290/ 173500 | consumed samples: 22090240 | consumed tokens: 45240811520 | elapsed time per iteration (s): 0.82 | learning rate: 1.122E-04 | global batch size: 256 | lm loss: 2.031513E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.960 | TFLOPs: 18.93 | 31: iteration 86300/ 173500 | consumed samples: 22092800 | consumed tokens: 45246054400 | elapsed time per iteration (s): 0.79 | learning rate: 1.122E-04 | global batch size: 256 | lm loss: 1.966561E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.588 | TFLOPs: 19.58 | 31: iteration 86310/ 173500 | consumed samples: 22095360 | consumed tokens: 45251297280 | elapsed time per iteration (s): 0.79 | learning rate: 1.122E-04 | global batch size: 256 | lm loss: 2.006869E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.529 | TFLOPs: 19.57 | 31: iteration 86320/ 173500 | consumed samples: 22097920 | consumed tokens: 45256540160 | elapsed time per iteration (s): 0.87 | learning rate: 1.121E-04 | global batch size: 256 | lm loss: 1.984033E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.914 | TFLOPs: 17.90 | 31: iteration 86330/ 173500 | consumed samples: 22100480 | consumed tokens: 45261783040 | elapsed time per iteration (s): 0.81 | learning rate: 1.121E-04 | global batch size: 256 | lm loss: 2.009079E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.064 | TFLOPs: 19.06 | 31: iteration 86340/ 173500 | consumed samples: 22103040 | consumed tokens: 45267025920 | elapsed time per iteration (s): 0.80 | learning rate: 1.121E-04 | global batch size: 256 | lm loss: 1.999405E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.204 | TFLOPs: 19.37 | 31: iteration 86350/ 173500 | consumed samples: 22105600 | consumed tokens: 45272268800 | elapsed time per iteration (s): 0.79 | learning rate: 1.121E-04 | global batch size: 256 | lm loss: 1.982957E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.687 | TFLOPs: 19.64 | 31: iteration 86360/ 173500 | consumed samples: 22108160 | consumed tokens: 45277511680 | elapsed time per iteration (s): 0.74 | learning rate: 1.121E-04 | global batch size: 256 | lm loss: 1.994152E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.870 | TFLOPs: 20.80 | 31: iteration 86370/ 173500 | consumed samples: 22110720 | consumed tokens: 45282754560 | elapsed time per iteration (s): 0.80 | learning rate: 1.121E-04 | global batch size: 256 | lm loss: 1.997297E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.480 | TFLOPs: 19.39 | 31: iteration 86380/ 173500 | consumed samples: 22113280 | consumed tokens: 45287997440 | elapsed time per iteration (s): 0.79 | learning rate: 1.120E-04 | global batch size: 256 | lm loss: 1.989865E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.943 | TFLOPs: 19.60 | 31: iteration 86390/ 173500 | consumed samples: 22115840 | consumed tokens: 45293240320 | elapsed time per iteration (s): 0.79 | learning rate: 1.120E-04 | global batch size: 256 | lm loss: 1.984019E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.281 | TFLOPs: 19.68 | 31: iteration 86400/ 173500 | consumed samples: 22118400 | consumed tokens: 45298483200 | elapsed time per iteration (s): 0.73 | learning rate: 1.120E-04 | global batch size: 256 | lm loss: 2.023516E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.364 | TFLOPs: 21.14 | 31: iteration 86410/ 173500 | consumed samples: 22120960 | consumed tokens: 45303726080 | elapsed time per iteration (s): 0.72 | learning rate: 1.120E-04 | global batch size: 256 | lm loss: 2.015259E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.008 | TFLOPs: 21.42 | 31: iteration 86420/ 173500 | consumed samples: 22123520 | consumed tokens: 45308968960 | elapsed time per iteration (s): 0.79 | learning rate: 1.120E-04 | global batch size: 256 | lm loss: 1.995607E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.702 | TFLOPs: 19.70 | 31: iteration 86430/ 173500 | consumed samples: 22126080 | consumed tokens: 45314211840 | elapsed time per iteration (s): 0.73 | learning rate: 1.120E-04 | global batch size: 256 | lm loss: 2.008700E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.407 | TFLOPs: 21.20 | 31: iteration 86440/ 173500 | consumed samples: 22128640 | consumed tokens: 45319454720 | elapsed time per iteration (s): 0.74 | learning rate: 1.119E-04 | global batch size: 256 | lm loss: 2.018317E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.877 | TFLOPs: 20.99 | 31: iteration 86450/ 173500 | consumed samples: 22131200 | consumed tokens: 45324697600 | elapsed time per iteration (s): 1.01 | learning rate: 1.119E-04 | global batch size: 256 | lm loss: 1.954413E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 252.534 | TFLOPs: 15.28 | 31: iteration 86460/ 173500 | consumed samples: 22133760 | consumed tokens: 45329940480 | elapsed time per iteration (s): 0.81 | learning rate: 1.119E-04 | global batch size: 256 | lm loss: 1.973609E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.061 | TFLOPs: 19.06 | 31: iteration 86470/ 173500 | consumed samples: 22136320 | consumed tokens: 45335183360 | elapsed time per iteration (s): 0.80 | learning rate: 1.119E-04 | global batch size: 256 | lm loss: 1.978618E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.197 | TFLOPs: 19.43 | 31: iteration 86480/ 173500 | consumed samples: 22138880 | consumed tokens: 45340426240 | elapsed time per iteration (s): 0.81 | learning rate: 1.119E-04 | global batch size: 256 | lm loss: 2.015122E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.761 | TFLOPs: 19.22 | 31: iteration 86490/ 173500 | consumed samples: 22141440 | consumed tokens: 45345669120 | elapsed time per iteration (s): 0.79 | learning rate: 1.119E-04 | global batch size: 256 | lm loss: 1.999734E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.309 | TFLOPs: 19.68 | 31: iteration 86500/ 173500 | consumed samples: 22144000 | consumed tokens: 45350912000 | elapsed time per iteration (s): 0.78 | learning rate: 1.118E-04 | global batch size: 256 | lm loss: 2.004322E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.264 | TFLOPs: 19.80 | 31: iteration 86510/ 173500 | consumed samples: 22146560 | consumed tokens: 45356154880 | elapsed time per iteration (s): 0.81 | learning rate: 1.118E-04 | global batch size: 256 | lm loss: 1.991339E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.061 | TFLOPs: 19.06 | 31: iteration 86520/ 173500 | consumed samples: 22149120 | consumed tokens: 45361397760 | elapsed time per iteration (s): 0.79 | learning rate: 1.118E-04 | global batch size: 256 | lm loss: 2.008858E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.466 | TFLOPs: 19.51 | 31: iteration 86530/ 173500 | consumed samples: 22151680 | consumed tokens: 45366640640 | elapsed time per iteration (s): 0.82 | learning rate: 1.118E-04 | global batch size: 256 | lm loss: 1.996472E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.341 | TFLOPs: 18.84 | 31: iteration 86540/ 173500 | consumed samples: 22154240 | consumed tokens: 45371883520 | elapsed time per iteration (s): 0.84 | learning rate: 1.118E-04 | global batch size: 256 | lm loss: 1.992084E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.151 | TFLOPs: 18.34 | 31: iteration 86550/ 173500 | consumed samples: 22156800 | consumed tokens: 45377126400 | elapsed time per iteration (s): 0.78 | learning rate: 1.118E-04 | global batch size: 256 | lm loss: 2.004790E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.640 | TFLOPs: 19.82 | 31: iteration 86560/ 173500 | consumed samples: 22159360 | consumed tokens: 45382369280 | elapsed time per iteration (s): 0.76 | learning rate: 1.117E-04 | global batch size: 256 | lm loss: 2.001510E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.167 | TFLOPs: 20.40 | 31: iteration 86570/ 173500 | consumed samples: 22161920 | consumed tokens: 45387612160 | elapsed time per iteration (s): 0.75 | learning rate: 1.117E-04 | global batch size: 256 | lm loss: 1.999408E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.721 | TFLOPs: 20.67 | 31: iteration 86580/ 173500 | consumed samples: 22164480 | consumed tokens: 45392855040 | elapsed time per iteration (s): 0.86 | learning rate: 1.117E-04 | global batch size: 256 | lm loss: 1.976160E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.301 | TFLOPs: 17.99 | 31: iteration 86590/ 173500 | consumed samples: 22167040 | consumed tokens: 45398097920 | elapsed time per iteration (s): 0.81 | learning rate: 1.117E-04 | global batch size: 256 | lm loss: 1.995197E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.636 | TFLOPs: 19.22 | 31: iteration 86600/ 173500 | consumed samples: 22169600 | consumed tokens: 45403340800 | elapsed time per iteration (s): 0.84 | learning rate: 1.117E-04 | global batch size: 256 | lm loss: 2.020018E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.030 | TFLOPs: 18.39 | 31: iteration 86610/ 173500 | consumed samples: 22172160 | consumed tokens: 45408583680 | elapsed time per iteration (s): 0.80 | learning rate: 1.117E-04 | global batch size: 256 | lm loss: 1.987976E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.225 | TFLOPs: 19.25 | 31: iteration 86620/ 173500 | consumed samples: 22174720 | consumed tokens: 45413826560 | elapsed time per iteration (s): 0.80 | learning rate: 1.116E-04 | global batch size: 256 | lm loss: 2.021788E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.376 | TFLOPs: 19.44 | 31: iteration 86630/ 173500 | consumed samples: 22177280 | consumed tokens: 45419069440 | elapsed time per iteration (s): 0.81 | learning rate: 1.116E-04 | global batch size: 256 | lm loss: 1.996451E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.768 | TFLOPs: 19.04 | 31: iteration 86640/ 173500 | consumed samples: 22179840 | consumed tokens: 45424312320 | elapsed time per iteration (s): 0.77 | learning rate: 1.116E-04 | global batch size: 256 | lm loss: 1.984344E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.532 | TFLOPs: 20.12 | 31: iteration 86650/ 173500 | consumed samples: 22182400 | consumed tokens: 45429555200 | elapsed time per iteration (s): 0.80 | learning rate: 1.116E-04 | global batch size: 256 | lm loss: 1.997851E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.381 | TFLOPs: 19.32 | 31: iteration 86660/ 173500 | consumed samples: 22184960 | consumed tokens: 45434798080 | elapsed time per iteration (s): 0.74 | learning rate: 1.116E-04 | global batch size: 256 | lm loss: 2.008927E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.726 | TFLOPs: 20.86 | 31: iteration 86670/ 173500 | consumed samples: 22187520 | consumed tokens: 45440040960 | elapsed time per iteration (s): 0.80 | learning rate: 1.116E-04 | global batch size: 256 | lm loss: 1.999464E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.786 | TFLOPs: 19.29 | 31: iteration 86680/ 173500 | consumed samples: 22190080 | consumed tokens: 45445283840 | elapsed time per iteration (s): 0.74 | learning rate: 1.115E-04 | global batch size: 256 | lm loss: 1.966208E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.857 | TFLOPs: 21.04 | 31: iteration 86690/ 173500 | consumed samples: 22192640 | consumed tokens: 45450526720 | elapsed time per iteration (s): 0.77 | learning rate: 1.115E-04 | global batch size: 256 | lm loss: 2.008627E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.141 | TFLOPs: 20.03 | 31: iteration 86700/ 173500 | consumed samples: 22195200 | consumed tokens: 45455769600 | elapsed time per iteration (s): 0.74 | learning rate: 1.115E-04 | global batch size: 256 | lm loss: 2.021568E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.266 | TFLOPs: 20.95 | 31: iteration 86710/ 173500 | consumed samples: 22197760 | consumed tokens: 45461012480 | elapsed time per iteration (s): 0.77 | learning rate: 1.115E-04 | global batch size: 256 | lm loss: 2.015533E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.521 | TFLOPs: 20.12 | 31: iteration 86720/ 173500 | consumed samples: 22200320 | consumed tokens: 45466255360 | elapsed time per iteration (s): 0.73 | learning rate: 1.115E-04 | global batch size: 256 | lm loss: 2.002864E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.354 | TFLOPs: 21.14 | 31: iteration 86730/ 173500 | consumed samples: 22202880 | consumed tokens: 45471498240 | elapsed time per iteration (s): 0.76 | learning rate: 1.115E-04 | global batch size: 256 | lm loss: 1.996816E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.514 | TFLOPs: 20.30 | 31: iteration 86740/ 173500 | consumed samples: 22205440 | consumed tokens: 45476741120 | elapsed time per iteration (s): 0.75 | learning rate: 1.114E-04 | global batch size: 256 | lm loss: 2.004932E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.850 | TFLOPs: 20.74 | 31: iteration 86750/ 173500 | consumed samples: 22208000 | consumed tokens: 45481984000 | elapsed time per iteration (s): 0.77 | learning rate: 1.114E-04 | global batch size: 256 | lm loss: 1.963693E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.993 | TFLOPs: 20.15 | 31: iteration 86760/ 173500 | consumed samples: 22210560 | consumed tokens: 45487226880 | elapsed time per iteration (s): 0.80 | learning rate: 1.114E-04 | global batch size: 256 | lm loss: 2.001807E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.345 | TFLOPs: 19.44 | 31: iteration 86770/ 173500 | consumed samples: 22213120 | consumed tokens: 45492469760 | elapsed time per iteration (s): 0.79 | learning rate: 1.114E-04 | global batch size: 256 | lm loss: 1.989697E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.895 | TFLOPs: 19.59 | 31: iteration 86780/ 173500 | consumed samples: 22215680 | consumed tokens: 45497712640 | elapsed time per iteration (s): 0.77 | learning rate: 1.114E-04 | global batch size: 256 | lm loss: 2.031473E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.352 | TFLOPs: 20.11 | 31: iteration 86790/ 173500 | consumed samples: 22218240 | consumed tokens: 45502955520 | elapsed time per iteration (s): 0.77 | learning rate: 1.114E-04 | global batch size: 256 | lm loss: 2.005848E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.113 | TFLOPs: 20.15 | 31: iteration 86800/ 173500 | consumed samples: 22220800 | consumed tokens: 45508198400 | elapsed time per iteration (s): 0.82 | learning rate: 1.113E-04 | global batch size: 256 | lm loss: 1.972018E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.318 | TFLOPs: 18.77 | 31: iteration 86810/ 173500 | consumed samples: 22223360 | consumed tokens: 45513441280 | elapsed time per iteration (s): 0.85 | learning rate: 1.113E-04 | global batch size: 256 | lm loss: 2.009033E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.272 | TFLOPs: 18.29 | 31: iteration 86820/ 173500 | consumed samples: 22225920 | consumed tokens: 45518684160 | elapsed time per iteration (s): 0.76 | learning rate: 1.113E-04 | global batch size: 256 | lm loss: 2.002926E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.574 | TFLOPs: 20.42 | 31: iteration 86830/ 173500 | consumed samples: 22228480 | consumed tokens: 45523927040 | elapsed time per iteration (s): 0.77 | learning rate: 1.113E-04 | global batch size: 256 | lm loss: 1.994052E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.191 | TFLOPs: 20.16 | 31: iteration 86840/ 173500 | consumed samples: 22231040 | consumed tokens: 45529169920 | elapsed time per iteration (s): 0.76 | learning rate: 1.113E-04 | global batch size: 256 | lm loss: 1.970248E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.007 | TFLOPs: 20.39 | 31: iteration 86850/ 173500 | consumed samples: 22233600 | consumed tokens: 45534412800 | elapsed time per iteration (s): 0.78 | learning rate: 1.113E-04 | global batch size: 256 | lm loss: 1.999811E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.921 | TFLOPs: 19.90 | 31: iteration 86860/ 173500 | consumed samples: 22236160 | consumed tokens: 45539655680 | elapsed time per iteration (s): 0.76 | learning rate: 1.112E-04 | global batch size: 256 | lm loss: 1.993809E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.929 | TFLOPs: 20.38 | 31: iteration 86870/ 173500 | consumed samples: 22238720 | consumed tokens: 45544898560 | elapsed time per iteration (s): 0.74 | learning rate: 1.112E-04 | global batch size: 256 | lm loss: 1.996493E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.311 | TFLOPs: 20.89 | 31: iteration 86880/ 173500 | consumed samples: 22241280 | consumed tokens: 45550141440 | elapsed time per iteration (s): 0.77 | learning rate: 1.112E-04 | global batch size: 256 | lm loss: 1.970326E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.039 | TFLOPs: 20.21 | 31: iteration 86890/ 173500 | consumed samples: 22243840 | consumed tokens: 45555384320 | elapsed time per iteration (s): 0.72 | learning rate: 1.112E-04 | global batch size: 256 | lm loss: 1.995013E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.210 | TFLOPs: 21.55 | 31: iteration 86900/ 173500 | consumed samples: 22246400 | consumed tokens: 45560627200 | elapsed time per iteration (s): 0.75 | learning rate: 1.112E-04 | global batch size: 256 | lm loss: 1.995953E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.003 | TFLOPs: 20.63 | 31: iteration 86910/ 173500 | consumed samples: 22248960 | consumed tokens: 45565870080 | elapsed time per iteration (s): 0.85 | learning rate: 1.112E-04 | global batch size: 256 | lm loss: 2.005362E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.074 | TFLOPs: 18.15 | 31: iteration 86920/ 173500 | consumed samples: 22251520 | consumed tokens: 45571112960 | elapsed time per iteration (s): 0.76 | learning rate: 1.111E-04 | global batch size: 256 | lm loss: 1.995783E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.905 | TFLOPs: 20.26 | 31: iteration 86930/ 173500 | consumed samples: 22254080 | consumed tokens: 45576355840 | elapsed time per iteration (s): 0.73 | learning rate: 1.111E-04 | global batch size: 256 | lm loss: 1.990022E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.478 | TFLOPs: 21.08 | 31: iteration 86940/ 173500 | consumed samples: 22256640 | consumed tokens: 45581598720 | elapsed time per iteration (s): 0.76 | learning rate: 1.111E-04 | global batch size: 256 | lm loss: 1.982711E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.431 | TFLOPs: 20.29 | 31: iteration 86950/ 173500 | consumed samples: 22259200 | consumed tokens: 45586841600 | elapsed time per iteration (s): 0.87 | learning rate: 1.111E-04 | global batch size: 256 | lm loss: 1.985602E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.083 | TFLOPs: 17.73 | 31: iteration 86960/ 173500 | consumed samples: 22261760 | consumed tokens: 45592084480 | elapsed time per iteration (s): 0.83 | learning rate: 1.111E-04 | global batch size: 256 | lm loss: 1.972692E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.184 | TFLOPs: 18.64 | 31: iteration 86970/ 173500 | consumed samples: 22264320 | consumed tokens: 45597327360 | elapsed time per iteration (s): 0.75 | learning rate: 1.111E-04 | global batch size: 256 | lm loss: 2.014921E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.150 | TFLOPs: 20.70 | 31: iteration 86980/ 173500 | consumed samples: 22266880 | consumed tokens: 45602570240 | elapsed time per iteration (s): 0.76 | learning rate: 1.110E-04 | global batch size: 256 | lm loss: 2.007639E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.207 | TFLOPs: 20.46 | 31: iteration 86990/ 173500 | consumed samples: 22269440 | consumed tokens: 45607813120 | elapsed time per iteration (s): 0.81 | learning rate: 1.110E-04 | global batch size: 256 | lm loss: 2.000238E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.800 | TFLOPs: 19.11 | 31: iteration 87000/ 173500 | consumed samples: 22272000 | consumed tokens: 45613056000 | elapsed time per iteration (s): 0.76 | learning rate: 1.110E-04 | global batch size: 256 | lm loss: 1.993919E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.697 | TFLOPs: 20.25 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 87000 | lm loss value: 1.929147E+00 | lm loss PPL: 6.883635E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 87000 to checkpoints_1b1long 0: [2022-11-26 13:47:10,524] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step87000 is begin to save! 0: [2022-11-26 13:47:10,537] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_01-model_00-model_states.pt... 0: [2022-11-26 13:47:10,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_01-model_00-model_states.pt. 0: [2022-11-26 13:47:10,759] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_03-model_00-model_states.pt... 0: [2022-11-26 13:47:10,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_03-model_00-model_states.pt. 0: [2022-11-26 13:47:10,841] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_04-model_00-model_states.pt... 0: [2022-11-26 13:47:10,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_04-model_00-model_states.pt. 0: [2022-11-26 13:47:10,920] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_05-model_00-model_states.pt... 0: [2022-11-26 13:47:10,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_05-model_00-model_states.pt. 0: [2022-11-26 13:47:11,000] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_06-model_00-model_states.pt... 0: [2022-11-26 13:47:11,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_06-model_00-model_states.pt. 0: [2022-11-26 13:47:11,073] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_07-model_00-model_states.pt... 0: [2022-11-26 13:47:11,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_07-model_00-model_states.pt. 0: [2022-11-26 13:47:11,155] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_08-model_00-model_states.pt... 0: [2022-11-26 13:47:11,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_08-model_00-model_states.pt. 0: [2022-11-26 13:47:11,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_09-model_00-model_states.pt... 0: [2022-11-26 13:47:11,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_09-model_00-model_states.pt. 0: [2022-11-26 13:47:11,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_10-model_00-model_states.pt... 0: [2022-11-26 13:47:11,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_10-model_00-model_states.pt. 0: [2022-11-26 13:47:11,392] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_11-model_00-model_states.pt... 0: [2022-11-26 13:47:11,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_11-model_00-model_states.pt. 0: [2022-11-26 13:47:11,475] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_12-model_00-model_states.pt... 0: [2022-11-26 13:47:11,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_12-model_00-model_states.pt. 0: [2022-11-26 13:47:11,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_13-model_00-model_states.pt... 0: [2022-11-26 13:47:11,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_13-model_00-model_states.pt. 0: [2022-11-26 13:47:11,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_14-model_00-model_states.pt... 0: [2022-11-26 13:47:11,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_14-model_00-model_states.pt. 0: [2022-11-26 13:47:11,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_15-model_00-model_states.pt... 0: [2022-11-26 13:47:11,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_15-model_00-model_states.pt. 0: [2022-11-26 13:47:11,801] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_16-model_00-model_states.pt... 0: [2022-11-26 13:47:11,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_16-model_00-model_states.pt. 0: [2022-11-26 13:47:11,882] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_17-model_00-model_states.pt... 0: [2022-11-26 13:47:11,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_17-model_00-model_states.pt. 0: [2022-11-26 13:47:11,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_18-model_00-model_states.pt... 0: [2022-11-26 13:47:12,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_18-model_00-model_states.pt. 0: [2022-11-26 13:47:12,054] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_19-model_00-model_states.pt... 0: [2022-11-26 13:47:12,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_19-model_00-model_states.pt. 0: [2022-11-26 13:47:12,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_20-model_00-model_states.pt... 0: [2022-11-26 13:47:12,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_20-model_00-model_states.pt. 0: [2022-11-26 13:47:12,214] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_21-model_00-model_states.pt... 0: [2022-11-26 13:47:12,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_21-model_00-model_states.pt. 0: [2022-11-26 13:47:12,292] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_22-model_00-model_states.pt... 0: [2022-11-26 13:47:12,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_22-model_00-model_states.pt. 0: [2022-11-26 13:47:12,370] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_23-model_00-model_states.pt... 0: [2022-11-26 13:47:12,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_23-model_00-model_states.pt. 0: [2022-11-26 13:47:12,451] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_24-model_00-model_states.pt... 0: [2022-11-26 13:47:12,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_24-model_00-model_states.pt. 0: [2022-11-26 13:47:12,527] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_25-model_00-model_states.pt... 0: [2022-11-26 13:47:12,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_25-model_00-model_states.pt. 0: [2022-11-26 13:47:12,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_26-model_00-model_states.pt... 0: [2022-11-26 13:47:12,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_26-model_00-model_states.pt. 0: [2022-11-26 13:47:12,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_27-model_00-model_states.pt... 0: [2022-11-26 13:47:12,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_27-model_00-model_states.pt. 0: [2022-11-26 13:47:12,764] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_28-model_00-model_states.pt... 0: [2022-11-26 13:47:12,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_28-model_00-model_states.pt. 0: [2022-11-26 13:47:12,843] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/layer_30-model_00-model_states.pt... 0: [2022-11-26 13:47:12,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/layer_30-model_00-model_states.pt. 0: [2022-11-26 13:47:12,845] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step87000/mp_rank_00_model_states.pt 0: [2022-11-26 13:47:12,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/mp_rank_00_model_states.pt... 0: [2022-11-26 13:47:12,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/mp_rank_00_model_states.pt. 0: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 16: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 28: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 18: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 31: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 30: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 26: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 20: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 24: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:47:12,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 29: [2022-11-26 13:47:12,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:47:12,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 13:47:12,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 1: [2022-11-26 13:47:12,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:47:12,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 13:47:12,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 28: [2022-11-26 13:47:12,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:47:12,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:47:12,982] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 28: [2022-11-26 13:47:12,982] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 27: [2022-11-26 13:47:12,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:47:12,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 28: [2022-11-26 13:47:12,982] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 27: [2022-11-26 13:47:12,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 11: [2022-11-26 13:47:12,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:47:12,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 11: [2022-11-26 13:47:12,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 13:47:12,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 0: [2022-11-26 13:47:12,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:47:12,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:47:12,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 13:47:12,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 12: [2022-11-26 13:47:12,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:47:12,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 13:47:12,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 20: [2022-11-26 13:47:12,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:47:12,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 13:47:12,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 6: [2022-11-26 13:47:12,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:47:12,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 13:47:12,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 8: [2022-11-26 13:47:12,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:47:12,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 13:47:12,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 25: [2022-11-26 13:47:12,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:47:12,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:47:12,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 13:47:12,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 25: [2022-11-26 13:47:12,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 30: [2022-11-26 13:47:12,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:47:12,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 15: [2022-11-26 13:47:12,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:47:12,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 30: [2022-11-26 13:47:12,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 15: [2022-11-26 13:47:12,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 30: [2022-11-26 13:47:12,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 31: [2022-11-26 13:47:12,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:47:12,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 13:47:12,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 30: [2022-11-26 13:47:12,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:47:12,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 13:47:12,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 18: [2022-11-26 13:47:12,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:47:12,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 13:47:12,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 31: [2022-11-26 13:47:12,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:47:12,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:47:12,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 3: [2022-11-26 13:47:12,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 31: [2022-11-26 13:47:12,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 3: [2022-11-26 13:47:12,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 11: [2022-11-26 13:47:12,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:47:12,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 13:47:12,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 24: [2022-11-26 13:47:12,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:47:12,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 7: [2022-11-26 13:47:12,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:47:12,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 8: [2022-11-26 13:47:12,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:47:12,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 8: [2022-11-26 13:47:12,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 7: [2022-11-26 13:47:12,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 8: [2022-11-26 13:47:12,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 12: [2022-11-26 13:47:12,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:47:12,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 13:47:12,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 18: [2022-11-26 13:47:12,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:47:12,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 2: [2022-11-26 13:47:12,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:47:12,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 2: [2022-11-26 13:47:12,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 14: [2022-11-26 13:47:12,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:47:12,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 14: [2022-11-26 13:47:12,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 13:47:12,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 9: [2022-11-26 13:47:12,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:47:12,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 13:47:12,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 14: [2022-11-26 13:47:12,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:47:12,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 30: [2022-11-26 13:47:12,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:47:12,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 30: [2022-11-26 13:47:12,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 13:47:12,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 15: [2022-11-26 13:47:12,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:47:12,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:47:12,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 24: [2022-11-26 13:47:12,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 15: [2022-11-26 13:47:12,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 19: [2022-11-26 13:47:12,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:47:12,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 19: [2022-11-26 13:47:12,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 29: [2022-11-26 13:47:12,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:47:12,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:47:12,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:47:12,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 29: [2022-11-26 13:47:12,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 19: [2022-11-26 13:47:12,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 29: [2022-11-26 13:47:12,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 19: [2022-11-26 13:47:12,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 13:47:12,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 19: [2022-11-26 13:47:12,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 3: [2022-11-26 13:47:12,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:47:12,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 18: [2022-11-26 13:47:12,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:47:12,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 18: [2022-11-26 13:47:12,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 13:47:12,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 22: [2022-11-26 13:47:12,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:47:12,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 13:47:12,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 0: [2022-11-26 13:47:12,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:47:12,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:47:12,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 0: [2022-11-26 13:47:12,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 4: [2022-11-26 13:47:12,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 0: [2022-11-26 13:47:12,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 22: [2022-11-26 13:47:12,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:47:12,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 13:47:12,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 25: [2022-11-26 13:47:12,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:47:12,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:47:12,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 13:47:12,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 20: [2022-11-26 13:47:12,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:47:12,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 13:47:12,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 20: [2022-11-26 13:47:12,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 7: [2022-11-26 13:47:12,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:47:12,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 7: [2022-11-26 13:47:12,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 13:47:12,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 4: [2022-11-26 13:47:12,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:47:12,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 13:47:12,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 9: [2022-11-26 13:47:12,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:47:12,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 13:47:12,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 31: [2022-11-26 13:47:12,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:47:12,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:47:12,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 13:47:12,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 17: [2022-11-26 13:47:12,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 13:47:12,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 17: [2022-11-26 13:47:12,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:47:12,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 13:47:12,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 27: [2022-11-26 13:47:12,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:47:12,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:47:12,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 12: [2022-11-26 13:47:12,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:47:12,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 12: [2022-11-26 13:47:12,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 20: [2022-11-26 13:47:12,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 12: [2022-11-26 13:47:12,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 27: [2022-11-26 13:47:12,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 28: [2022-11-26 13:47:12,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:47:12,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 13:47:12,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 6: [2022-11-26 13:47:12,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:47:12,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 13:47:12,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 21: [2022-11-26 13:47:12,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:47:12,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:47:12,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 13:47:12,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 19: [2022-11-26 13:47:12,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:47:12,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 21: [2022-11-26 13:47:12,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 19: [2022-11-26 13:47:12,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 13:47:12,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 6: [2022-11-26 13:47:12,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:47:12,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 13:47:12,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 2: [2022-11-26 13:47:12,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:47:12,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:47:12,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 13:47:12,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 8: [2022-11-26 13:47:12,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 13:47:12,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 24: [2022-11-26 13:47:12,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:47:12,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 13:47:12,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 7: [2022-11-26 13:47:12,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:47:12,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:47:12,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 13:47:12,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 25: [2022-11-26 13:47:12,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 13:47:12,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 11: [2022-11-26 13:47:12,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:47:12,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 13:47:12,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 3: [2022-11-26 13:47:12,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:47:12,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:47:12,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 10: [2022-11-26 13:47:12,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 3: [2022-11-26 13:47:12,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 10: [2022-11-26 13:47:12,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 29: [2022-11-26 13:47:12,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:47:12,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:47:12,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 13:47:12,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 10: [2022-11-26 13:47:12,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:47:12,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 13:47:12,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 4: [2022-11-26 13:47:12,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:47:12,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 13:47:12,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 18: [2022-11-26 13:47:12,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:47:12,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 13:47:12,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 2: [2022-11-26 13:47:12,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:47:12,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 13:47:12,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 9: [2022-11-26 13:47:12,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:47:12,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 8: [2022-11-26 13:47:12,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:47:12,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 8: [2022-11-26 13:47:12,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 13:47:12,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 31: [2022-11-26 13:47:13,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:47:13,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:47:13,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 13:47:13,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 20: [2022-11-26 13:47:13,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 30: [2022-11-26 13:47:13,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:47:13,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 30: [2022-11-26 13:47:13,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 13:47:13,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 17: [2022-11-26 13:47:13,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:47:13,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:47:13,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:47:13,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 13:47:13,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 13:47:13,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:47:13,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 16: [2022-11-26 13:47:13,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 16: [2022-11-26 13:47:13,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 13:47:13,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 7: [2022-11-26 13:47:13,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:47:13,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:47:13,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:47:13,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 13:47:13,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 21: [2022-11-26 13:47:13,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 0: [2022-11-26 13:47:13,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 21: [2022-11-26 13:47:13,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 0: [2022-11-26 13:47:13,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 27: [2022-11-26 13:47:13,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:47:13,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 1: [2022-11-26 13:47:13,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:47:13,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 1: [2022-11-26 13:47:13,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 15: [2022-11-26 13:47:13,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:47:13,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 15: [2022-11-26 13:47:13,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 13:47:13,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 11: [2022-11-26 13:47:13,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:47:13,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 13:47:13,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 14: [2022-11-26 13:47:13,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:47:13,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:47:13,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 13:47:13,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 21: [2022-11-26 13:47:13,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 13:47:13,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 1: [2022-11-26 13:47:13,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:47:13,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 13:47:13,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 12: [2022-11-26 13:47:13,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:47:13,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 13:47:13,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 14: [2022-11-26 13:47:13,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:47:13,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:47:13,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 24: [2022-11-26 13:47:13,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 14: [2022-11-26 13:47:13,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 24: [2022-11-26 13:47:13,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 17: [2022-11-26 13:47:13,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 13:47:13,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 17: [2022-11-26 13:47:13,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:47:13,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 29: [2022-11-26 13:47:13,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:47:13,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 29: [2022-11-26 13:47:13,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 13:47:13,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 28: [2022-11-26 13:47:12,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 13:47:12,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 28: [2022-11-26 13:47:13,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:47:13,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 13:47:13,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 23: [2022-11-26 13:47:13,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:47:13,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:47:13,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 13:47:13,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:47:13,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 13:47:13,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 23: [2022-11-26 13:47:13,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 5: [2022-11-26 13:47:13,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:47:13,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 5: [2022-11-26 13:47:13,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 23: [2022-11-26 13:47:13,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 5: [2022-11-26 13:47:13,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 5: [2022-11-26 13:47:13,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:47:13,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 13:47:13,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 5: [2022-11-26 13:47:13,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:47:13,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 13:47:13,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 25: [2022-11-26 13:47:13,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:47:13,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 13:47:13,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 27: [2022-11-26 13:47:13,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:47:13,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 13:47:13,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 13: [2022-11-26 13:47:12,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:47:12,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 6: [2022-11-26 13:47:13,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:47:12,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 6: [2022-11-26 13:47:13,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 13: [2022-11-26 13:47:12,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:47:12,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 6: [2022-11-26 13:47:13,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 13: [2022-11-26 13:47:12,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 6: [2022-11-26 13:47:13,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:47:13,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 13:47:13,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 13: [2022-11-26 13:47:13,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:47:13,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:47:13,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 13:47:13,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 13:47:13,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 13: [2022-11-26 13:47:13,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 31: [2022-11-26 13:47:13,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:47:13,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 13:47:13,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 11: [2022-11-26 13:47:13,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:47:13,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 13:47:13,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 10: [2022-11-26 13:47:13,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:47:13,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 13:47:13,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 26: [2022-11-26 13:47:13,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:47:13,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:47:13,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:47:13,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 13:47:13,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 13:47:13,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 26: [2022-11-26 13:47:13,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 13:47:13,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 26: [2022-11-26 13:47:13,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 22: [2022-11-26 13:47:13,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:47:13,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 13:47:13,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 22: [2022-11-26 13:47:13,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:47:13,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 13:47:13,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 9: [2022-11-26 13:47:13,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:47:13,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 13:47:13,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 17: [2022-11-26 13:47:13,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:47:13,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 13:47:13,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 1: [2022-11-26 13:47:13,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:47:13,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:47:13,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 13:47:13,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 13:47:13,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 1: [2022-11-26 13:47:13,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 0: [2022-11-26 13:47:13,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:47:13,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 13:47:13,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 19: [2022-11-26 13:47:13,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:47:13,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 13:47:13,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 4: [2022-11-26 13:47:13,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:47:13,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 13:47:13,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 0: [2022-11-26 13:47:13,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:47:13,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 13:47:13,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 13: [2022-11-26 13:47:13,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:47:13,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 13:47:13,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 0: [2022-11-26 13:47:13,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 13:47:13,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 2: [2022-11-26 13:47:13,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:47:13,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 13:47:13,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 30: [2022-11-26 13:47:13,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:47:13,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 13:47:13,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 8: [2022-11-26 13:47:13,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:47:13,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 13:47:13,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 3: [2022-11-26 13:47:13,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:47:13,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 13:47:13,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 29: [2022-11-26 13:47:13,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:47:13,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 13:47:13,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 20: [2022-11-26 13:47:13,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:47:13,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 13:47:13,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 12: [2022-11-26 13:47:13,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:47:13,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 13:47:13,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 18: [2022-11-26 13:47:13,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:47:13,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 13:47:13,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 9: [2022-11-26 13:47:13,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:47:13,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 13:47:13,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 15: [2022-11-26 13:47:13,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:47:13,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 13:47:13,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 21: [2022-11-26 13:47:13,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:47:13,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 13:47:13,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 28: [2022-11-26 13:47:13,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:47:13,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 13:47:13,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 7: [2022-11-26 13:47:13,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:47:13,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 13:47:13,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 25: [2022-11-26 13:47:13,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:47:13,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 13:47:13,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 14: [2022-11-26 13:47:13,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:47:13,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 13:47:13,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 16: [2022-11-26 13:47:13,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:47:13,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 13:47:13,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 6: [2022-11-26 13:47:13,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:47:13,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 13:47:13,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 26: [2022-11-26 13:47:13,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:47:13,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 13:47:13,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 5: [2022-11-26 13:47:13,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:47:13,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:47:13,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 23: [2022-11-26 13:47:13,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 5: [2022-11-26 13:47:13,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 23: [2022-11-26 13:47:13,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 22: [2022-11-26 13:47:13,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:47:13,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 13:47:13,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 27: [2022-11-26 13:47:13,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:47:13,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 13:47:13,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 1: [2022-11-26 13:47:13,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:47:13,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 13:47:13,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 10: [2022-11-26 13:47:13,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:47:13,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 13:47:13,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 24: [2022-11-26 13:47:13,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:47:13,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 13:47:13,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 2: [2022-11-26 13:47:13,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:47:13,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 31: [2022-11-26 13:47:13,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:47:13,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 31: [2022-11-26 13:47:13,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 13:47:13,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 17: [2022-11-26 13:47:13,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:47:13,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 13:47:13,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 19: [2022-11-26 13:47:13,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:47:13,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 13:47:13,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 29: [2022-11-26 13:47:13,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:47:13,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 13:47:13,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 4: [2022-11-26 13:47:13,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:47:13,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 13:47:13,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 0: [2022-11-26 13:47:13,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:47:13,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:47:13,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 8: [2022-11-26 13:47:13,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 13:47:13,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 0: [2022-11-26 13:47:13,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 12: [2022-11-26 13:47:13,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:47:13,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 13:47:13,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 3: [2022-11-26 13:47:13,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:47:13,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 13:47:13,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 30: [2022-11-26 13:47:13,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:47:13,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 13:47:13,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 13: [2022-11-26 13:47:13,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:47:13,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 13:47:13,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 20: [2022-11-26 13:47:13,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:47:13,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 21: [2022-11-26 13:47:13,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:47:13,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 21: [2022-11-26 13:47:13,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 13:47:13,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 11: [2022-11-26 13:47:13,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:47:13,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 13:47:13,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 18: [2022-11-26 13:47:13,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:47:13,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 13:47:13,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 9: [2022-11-26 13:47:13,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:47:13,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 13:47:13,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 27: [2022-11-26 13:47:13,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:47:13,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 13:47:13,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 28: [2022-11-26 13:47:13,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:47:13,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 25: [2022-11-26 13:47:13,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:47:13,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 25: [2022-11-26 13:47:13,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 13:47:13,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 15: [2022-11-26 13:47:13,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:47:13,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:47:13,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 15: [2022-11-26 13:47:13,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 14: [2022-11-26 13:47:13,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 15: [2022-11-26 13:47:13,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 7: [2022-11-26 13:47:13,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:47:13,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 13:47:13,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 16: [2022-11-26 13:47:13,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:47:13,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 13:47:13,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 23: [2022-11-26 13:47:13,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:47:13,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 13:47:13,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 6: [2022-11-26 13:47:13,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:47:13,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 13:47:13,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 5: [2022-11-26 13:47:13,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:47:13,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 13:47:13,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 26: [2022-11-26 13:47:13,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:47:13,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 13:47:13,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 19: [2022-11-26 13:47:13,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:47:13,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 13:47:13,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 22: [2022-11-26 13:47:13,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:47:13,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 13:47:13,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 4: [2022-11-26 13:47:13,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:47:13,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 13:47:13,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 10: [2022-11-26 13:47:13,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:47:13,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 13:47:13,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 1: [2022-11-26 13:47:13,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:47:13,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 31: [2022-11-26 13:47:13,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:47:13,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 31: [2022-11-26 13:47:13,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 13:47:13,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 2: [2022-11-26 13:47:13,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:47:13,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 13:47:13,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 24: [2022-11-26 13:47:13,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:47:13,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 13:47:13,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 29: [2022-11-26 13:47:13,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:47:13,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 13:47:13,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 3: [2022-11-26 13:47:13,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:47:13,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 13:47:13,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 12: [2022-11-26 13:47:13,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:47:13,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 13:47:13,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 9: [2022-11-26 13:47:13,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:47:13,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 13:47:13,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 17: [2022-11-26 13:47:13,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:47:13,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 13:47:13,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 0: [2022-11-26 13:47:13,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:47:13,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 13:47:13,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 8: [2022-11-26 13:47:13,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:47:13,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 13:47:13,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 20: [2022-11-26 13:47:13,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 13:47:13,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 13:47:13,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 21: [2022-11-26 13:47:13,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:47:13,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 13:47:13,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 18: [2022-11-26 13:47:13,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 13:47:13,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 13:47:13,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 13: [2022-11-26 13:47:13,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:47:13,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 13:47:13,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 15: [2022-11-26 13:47:13,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:47:13,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 13:47:13,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 30: [2022-11-26 13:47:13,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 13:47:13,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 13:47:13,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 11: [2022-11-26 13:47:13,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:47:13,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 13:47:13,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 28: [2022-11-26 13:47:13,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:47:13,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:47:13,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 13:47:13,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 7: [2022-11-26 13:47:13,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:47:13,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 13:47:13,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 28: [2022-11-26 13:47:13,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 13:47:13,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 24: [2022-11-26 13:47:13,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:47:13,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 13:47:13,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 27: [2022-11-26 13:47:13,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:47:13,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 13:47:13,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 28: [2022-11-26 13:47:13,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 13:47:13,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 13:47:13,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 14: [2022-11-26 13:47:13,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:47:13,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 13:47:13,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 0: [2022-11-26 13:47:13,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:47:13,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 13:47:13,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 12: [2022-11-26 13:47:13,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:47:13,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 13:47:13,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 29: [2022-11-26 13:47:13,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 13:47:13,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 13:47:13,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 7: [2022-11-26 13:47:13,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:47:13,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 13:47:13,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 22: [2022-11-26 13:47:13,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:47:13,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 13:47:13,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 9: [2022-11-26 13:47:13,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:47:13,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:47:13,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:47:13,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 16: [2022-11-26 13:47:13,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 13:47:13,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 9: [2022-11-26 13:47:13,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 1: [2022-11-26 13:47:13,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 24: [2022-11-26 13:47:13,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 26: [2022-11-26 13:47:13,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:47:13,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 24: [2022-11-26 13:47:13,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 1: [2022-11-26 13:47:13,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 26: [2022-11-26 13:47:13,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 5: [2022-11-26 13:47:13,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:47:13,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 26: [2022-11-26 13:47:13,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 5: [2022-11-26 13:47:13,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 25: [2022-11-26 13:47:13,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 13:47:13,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 13:47:13,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 21: [2022-11-26 13:47:13,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 13:47:13,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 13:47:13,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 2: [2022-11-26 13:47:13,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:47:13,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:47:13,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 6: [2022-11-26 13:47:13,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 13:47:13,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 2: [2022-11-26 13:47:13,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 23: [2022-11-26 13:47:13,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:47:13,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 13:47:13,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 15: [2022-11-26 13:47:13,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:47:13,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 13:47:13,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 8: [2022-11-26 13:47:13,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:47:13,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 13:47:13,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 14: [2022-11-26 13:47:13,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:47:13,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 13:47:13,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 13: [2022-11-26 13:47:13,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:47:13,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:47:13,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 18: [2022-11-26 13:47:13,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:47:13,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 30: [2022-11-26 13:47:13,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 19: [2022-11-26 13:47:13,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 18: [2022-11-26 13:47:13,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 19: [2022-11-26 13:47:13,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 30: [2022-11-26 13:47:13,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 13:47:13,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 18: [2022-11-26 13:47:13,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 3: [2022-11-26 13:47:13,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:47:13,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 13:47:13,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 4: [2022-11-26 13:47:13,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:47:13,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 20: [2022-11-26 13:47:13,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:47:13,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 20: [2022-11-26 13:47:13,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 13:47:13,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 27: [2022-11-26 13:47:13,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 13:47:13,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 13:47:13,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 31: [2022-11-26 13:47:13,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 13:47:13,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 13:47:13,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 11: [2022-11-26 13:47:13,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:47:13,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 13:47:13,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 17: [2022-11-26 13:47:13,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 13:47:13,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 13:47:13,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 16: [2022-11-26 13:47:13,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:47:13,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 13:47:13,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 22: [2022-11-26 13:47:13,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 13:47:13,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 13:47:13,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 23: [2022-11-26 13:47:13,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:47:13,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 13:47:13,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 13:47:13,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 23: [2022-11-26 13:47:13,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 13:47:13,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 16: [2022-11-26 13:47:13,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 13:47:13,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 13:47:13,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 26: [2022-11-26 13:47:13,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:47:13,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 13:47:13,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 10: [2022-11-26 13:47:13,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:47:13,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 13:47:13,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 5: [2022-11-26 13:47:13,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:47:13,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 13:47:13,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 5: [2022-11-26 13:47:13,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 13:47:13,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 26: [2022-11-26 13:47:13,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 5: [2022-11-26 13:47:13,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:47:13,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 13:47:13,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 10: [2022-11-26 13:47:13,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:47:13,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 13:47:13,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 10: [2022-11-26 13:47:13,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:47:13,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step87000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 13:47:13,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 0: successfully saved checkpoint at iteration 87000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2675.08 31: iteration 87010/ 173500 | consumed samples: 22274560 | consumed tokens: 45618298880 | elapsed time per iteration (s): 1.04 | learning rate: 1.110E-04 | global batch size: 256 | lm loss: 1.982360E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.367 | TFLOPs: 14.90 | 31: iteration 87020/ 173500 | consumed samples: 22277120 | consumed tokens: 45623541760 | elapsed time per iteration (s): 0.78 | learning rate: 1.110E-04 | global batch size: 256 | lm loss: 1.997730E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.729 | TFLOPs: 19.95 | 31: iteration 87030/ 173500 | consumed samples: 22279680 | consumed tokens: 45628784640 | elapsed time per iteration (s): 0.80 | learning rate: 1.110E-04 | global batch size: 256 | lm loss: 2.007316E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.253 | TFLOPs: 19.31 | 31: iteration 87040/ 173500 | consumed samples: 22282240 | consumed tokens: 45634027520 | elapsed time per iteration (s): 0.81 | learning rate: 1.110E-04 | global batch size: 256 | lm loss: 2.000790E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.875 | TFLOPs: 19.11 | 31: iteration 87050/ 173500 | consumed samples: 22284800 | consumed tokens: 45639270400 | elapsed time per iteration (s): 0.80 | learning rate: 1.109E-04 | global batch size: 256 | lm loss: 2.035655E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.968 | TFLOPs: 19.48 | 31: iteration 87060/ 173500 | consumed samples: 22287360 | consumed tokens: 45644513280 | elapsed time per iteration (s): 0.80 | learning rate: 1.109E-04 | global batch size: 256 | lm loss: 1.982562E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.437 | TFLOPs: 19.39 | 31: iteration 87070/ 173500 | consumed samples: 22289920 | consumed tokens: 45649756160 | elapsed time per iteration (s): 0.80 | learning rate: 1.109E-04 | global batch size: 256 | lm loss: 1.980568E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.785 | TFLOPs: 19.29 | 31: iteration 87080/ 173500 | consumed samples: 22292480 | consumed tokens: 45654999040 | elapsed time per iteration (s): 0.82 | learning rate: 1.109E-04 | global batch size: 256 | lm loss: 2.017175E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.089 | TFLOPs: 18.94 | 31: iteration 87090/ 173500 | consumed samples: 22295040 | consumed tokens: 45660241920 | elapsed time per iteration (s): 0.81 | learning rate: 1.109E-04 | global batch size: 256 | lm loss: 1.972020E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.953 | TFLOPs: 19.24 | 31: iteration 87100/ 173500 | consumed samples: 22297600 | consumed tokens: 45665484800 | elapsed time per iteration (s): 0.78 | learning rate: 1.109E-04 | global batch size: 256 | lm loss: 2.008482E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.217 | TFLOPs: 19.74 | 31: iteration 87110/ 173500 | consumed samples: 22300160 | consumed tokens: 45670727680 | elapsed time per iteration (s): 0.81 | learning rate: 1.108E-04 | global batch size: 256 | lm loss: 1.982590E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.195 | TFLOPs: 19.07 | 31: iteration 87120/ 173500 | consumed samples: 22302720 | consumed tokens: 45675970560 | elapsed time per iteration (s): 0.78 | learning rate: 1.108E-04 | global batch size: 256 | lm loss: 2.001632E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.947 | TFLOPs: 19.84 | 31: iteration 87130/ 173500 | consumed samples: 22305280 | consumed tokens: 45681213440 | elapsed time per iteration (s): 0.76 | learning rate: 1.108E-04 | global batch size: 256 | lm loss: 1.984456E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.326 | TFLOPs: 20.41 | 31: iteration 87140/ 173500 | consumed samples: 22307840 | consumed tokens: 45686456320 | elapsed time per iteration (s): 0.80 | learning rate: 1.108E-04 | global batch size: 256 | lm loss: 1.991332E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.999 | TFLOPs: 19.48 | 31: iteration 87150/ 173500 | consumed samples: 22310400 | consumed tokens: 45691699200 | elapsed time per iteration (s): 0.75 | learning rate: 1.108E-04 | global batch size: 256 | lm loss: 1.986886E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.023 | TFLOPs: 20.57 | 31: iteration 87160/ 173500 | consumed samples: 22312960 | consumed tokens: 45696942080 | elapsed time per iteration (s): 0.78 | learning rate: 1.108E-04 | global batch size: 256 | lm loss: 1.988900E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.191 | TFLOPs: 19.73 | 31: iteration 87170/ 173500 | consumed samples: 22315520 | consumed tokens: 45702184960 | elapsed time per iteration (s): 0.83 | learning rate: 1.107E-04 | global batch size: 256 | lm loss: 1.989644E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.318 | TFLOPs: 18.59 | 31: iteration 87180/ 173500 | consumed samples: 22318080 | consumed tokens: 45707427840 | elapsed time per iteration (s): 0.91 | learning rate: 1.107E-04 | global batch size: 256 | lm loss: 2.001270E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 279.815 | TFLOPs: 16.93 | 31: iteration 87190/ 173500 | consumed samples: 22320640 | consumed tokens: 45712670720 | elapsed time per iteration (s): 0.83 | learning rate: 1.107E-04 | global batch size: 256 | lm loss: 2.008844E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.222 | TFLOPs: 18.71 | 31: iteration 87200/ 173500 | consumed samples: 22323200 | consumed tokens: 45717913600 | elapsed time per iteration (s): 0.81 | learning rate: 1.107E-04 | global batch size: 256 | lm loss: 1.974344E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.360 | TFLOPs: 19.20 | 31: iteration 87210/ 173500 | consumed samples: 22325760 | consumed tokens: 45723156480 | elapsed time per iteration (s): 0.83 | learning rate: 1.107E-04 | global batch size: 256 | lm loss: 1.997766E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.394 | TFLOPs: 18.72 | 31: iteration 87220/ 173500 | consumed samples: 22328320 | consumed tokens: 45728399360 | elapsed time per iteration (s): 0.89 | learning rate: 1.107E-04 | global batch size: 256 | lm loss: 2.009335E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.234 | TFLOPs: 17.44 | 31: iteration 87230/ 173500 | consumed samples: 22330880 | consumed tokens: 45733642240 | elapsed time per iteration (s): 0.83 | learning rate: 1.106E-04 | global batch size: 256 | lm loss: 2.021436E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.965 | TFLOPs: 18.63 | 31: iteration 87240/ 173500 | consumed samples: 22333440 | consumed tokens: 45738885120 | elapsed time per iteration (s): 0.83 | learning rate: 1.106E-04 | global batch size: 256 | lm loss: 1.975726E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.515 | TFLOPs: 18.72 | 31: iteration 87250/ 173500 | consumed samples: 22336000 | consumed tokens: 45744128000 | elapsed time per iteration (s): 0.83 | learning rate: 1.106E-04 | global batch size: 256 | lm loss: 1.998201E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.640 | TFLOPs: 18.73 | 31: iteration 87260/ 173500 | consumed samples: 22338560 | consumed tokens: 45749370880 | elapsed time per iteration (s): 0.82 | learning rate: 1.106E-04 | global batch size: 256 | lm loss: 2.003738E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.319 | TFLOPs: 18.89 | 31: iteration 87270/ 173500 | consumed samples: 22341120 | consumed tokens: 45754613760 | elapsed time per iteration (s): 0.82 | learning rate: 1.106E-04 | global batch size: 256 | lm loss: 1.990697E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.008 | TFLOPs: 19.00 | 31: iteration 87280/ 173500 | consumed samples: 22343680 | consumed tokens: 45759856640 | elapsed time per iteration (s): 0.84 | learning rate: 1.106E-04 | global batch size: 256 | lm loss: 2.006424E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.774 | TFLOPs: 18.38 | 31: iteration 87290/ 173500 | consumed samples: 22346240 | consumed tokens: 45765099520 | elapsed time per iteration (s): 0.85 | learning rate: 1.105E-04 | global batch size: 256 | lm loss: 1.969046E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.748 | TFLOPs: 18.13 | 31: iteration 87300/ 173500 | consumed samples: 22348800 | consumed tokens: 45770342400 | elapsed time per iteration (s): 0.86 | learning rate: 1.105E-04 | global batch size: 256 | lm loss: 1.988731E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.266 | TFLOPs: 18.04 | 31: iteration 87310/ 173500 | consumed samples: 22351360 | consumed tokens: 45775585280 | elapsed time per iteration (s): 0.82 | learning rate: 1.105E-04 | global batch size: 256 | lm loss: 1.993423E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.809 | TFLOPs: 18.80 | 31: iteration 87320/ 173500 | consumed samples: 22353920 | consumed tokens: 45780828160 | elapsed time per iteration (s): 0.87 | learning rate: 1.105E-04 | global batch size: 256 | lm loss: 2.011302E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.573 | TFLOPs: 17.82 | 31: iteration 87330/ 173500 | consumed samples: 22356480 | consumed tokens: 45786071040 | elapsed time per iteration (s): 0.80 | learning rate: 1.105E-04 | global batch size: 256 | lm loss: 1.976122E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.309 | TFLOPs: 19.38 | 31: iteration 87340/ 173500 | consumed samples: 22359040 | consumed tokens: 45791313920 | elapsed time per iteration (s): 0.84 | learning rate: 1.105E-04 | global batch size: 256 | lm loss: 1.970256E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.658 | TFLOPs: 18.43 | 31: iteration 87350/ 173500 | consumed samples: 22361600 | consumed tokens: 45796556800 | elapsed time per iteration (s): 0.84 | learning rate: 1.104E-04 | global batch size: 256 | lm loss: 2.009819E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.969 | TFLOPs: 18.45 | 31: iteration 87360/ 173500 | consumed samples: 22364160 | consumed tokens: 45801799680 | elapsed time per iteration (s): 0.78 | learning rate: 1.104E-04 | global batch size: 256 | lm loss: 1.988537E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.410 | TFLOPs: 19.81 | 31: iteration 87370/ 173500 | consumed samples: 22366720 | consumed tokens: 45807042560 | elapsed time per iteration (s): 0.80 | learning rate: 1.104E-04 | global batch size: 256 | lm loss: 1.996198E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.032 | TFLOPs: 19.42 | 31: iteration 87380/ 173500 | consumed samples: 22369280 | consumed tokens: 45812285440 | elapsed time per iteration (s): 0.77 | learning rate: 1.104E-04 | global batch size: 256 | lm loss: 1.966809E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.637 | TFLOPs: 20.24 | 31: iteration 87390/ 173500 | consumed samples: 22371840 | consumed tokens: 45817528320 | elapsed time per iteration (s): 0.78 | learning rate: 1.104E-04 | global batch size: 256 | lm loss: 1.974135E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.382 | TFLOPs: 19.93 | 31: iteration 87400/ 173500 | consumed samples: 22374400 | consumed tokens: 45822771200 | elapsed time per iteration (s): 0.79 | learning rate: 1.104E-04 | global batch size: 256 | lm loss: 1.966597E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.811 | TFLOPs: 19.53 | 31: iteration 87410/ 173500 | consumed samples: 22376960 | consumed tokens: 45828014080 | elapsed time per iteration (s): 0.80 | learning rate: 1.103E-04 | global batch size: 256 | lm loss: 2.017567E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.223 | TFLOPs: 19.37 | 31: iteration 87420/ 173500 | consumed samples: 22379520 | consumed tokens: 45833256960 | elapsed time per iteration (s): 0.82 | learning rate: 1.103E-04 | global batch size: 256 | lm loss: 2.012534E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.443 | TFLOPs: 18.84 | 31: iteration 87430/ 173500 | consumed samples: 22382080 | consumed tokens: 45838499840 | elapsed time per iteration (s): 0.83 | learning rate: 1.103E-04 | global batch size: 256 | lm loss: 1.998370E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.075 | TFLOPs: 18.76 | 31: iteration 87440/ 173500 | consumed samples: 22384640 | consumed tokens: 45843742720 | elapsed time per iteration (s): 0.82 | learning rate: 1.103E-04 | global batch size: 256 | lm loss: 1.977358E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.093 | TFLOPs: 18.88 | 31: iteration 87450/ 173500 | consumed samples: 22387200 | consumed tokens: 45848985600 | elapsed time per iteration (s): 0.77 | learning rate: 1.103E-04 | global batch size: 256 | lm loss: 1.988237E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.775 | TFLOPs: 20.19 | 31: iteration 87460/ 173500 | consumed samples: 22389760 | consumed tokens: 45854228480 | elapsed time per iteration (s): 0.81 | learning rate: 1.103E-04 | global batch size: 256 | lm loss: 1.999117E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.538 | TFLOPs: 19.15 | 31: iteration 87470/ 173500 | consumed samples: 22392320 | consumed tokens: 45859471360 | elapsed time per iteration (s): 0.81 | learning rate: 1.102E-04 | global batch size: 256 | lm loss: 2.001363E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.881 | TFLOPs: 19.11 | 31: iteration 87480/ 173500 | consumed samples: 22394880 | consumed tokens: 45864714240 | elapsed time per iteration (s): 0.91 | learning rate: 1.102E-04 | global batch size: 256 | lm loss: 1.979694E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 280.655 | TFLOPs: 16.98 | 31: iteration 87490/ 173500 | consumed samples: 22397440 | consumed tokens: 45869957120 | elapsed time per iteration (s): 0.75 | learning rate: 1.102E-04 | global batch size: 256 | lm loss: 2.029450E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.985 | TFLOPs: 20.75 | 31: iteration 87500/ 173500 | consumed samples: 22400000 | consumed tokens: 45875200000 | elapsed time per iteration (s): 0.82 | learning rate: 1.102E-04 | global batch size: 256 | lm loss: 2.029816E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.817 | TFLOPs: 18.80 | 31: iteration 87510/ 173500 | consumed samples: 22402560 | consumed tokens: 45880442880 | elapsed time per iteration (s): 0.84 | learning rate: 1.102E-04 | global batch size: 256 | lm loss: 2.011067E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.014 | TFLOPs: 18.45 | 31: iteration 87520/ 173500 | consumed samples: 22405120 | consumed tokens: 45885685760 | elapsed time per iteration (s): 0.87 | learning rate: 1.102E-04 | global batch size: 256 | lm loss: 1.989266E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.609 | TFLOPs: 17.76 | 31: iteration 87530/ 173500 | consumed samples: 22407680 | consumed tokens: 45890928640 | elapsed time per iteration (s): 0.83 | learning rate: 1.101E-04 | global batch size: 256 | lm loss: 2.008787E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.702 | TFLOPs: 18.62 | 31: iteration 87540/ 173500 | consumed samples: 22410240 | consumed tokens: 45896171520 | elapsed time per iteration (s): 0.85 | learning rate: 1.101E-04 | global batch size: 256 | lm loss: 2.025611E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.568 | TFLOPs: 18.30 | 31: iteration 87550/ 173500 | consumed samples: 22412800 | consumed tokens: 45901414400 | elapsed time per iteration (s): 0.79 | learning rate: 1.101E-04 | global batch size: 256 | lm loss: 1.983187E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.013 | TFLOPs: 19.54 | 31: iteration 87560/ 173500 | consumed samples: 22415360 | consumed tokens: 45906657280 | elapsed time per iteration (s): 0.84 | learning rate: 1.101E-04 | global batch size: 256 | lm loss: 1.986126E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.484 | TFLOPs: 18.48 | 31: iteration 87570/ 173500 | consumed samples: 22417920 | consumed tokens: 45911900160 | elapsed time per iteration (s): 0.82 | learning rate: 1.101E-04 | global batch size: 256 | lm loss: 2.017370E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.881 | TFLOPs: 18.87 | 31: iteration 87580/ 173500 | consumed samples: 22420480 | consumed tokens: 45917143040 | elapsed time per iteration (s): 0.88 | learning rate: 1.101E-04 | global batch size: 256 | lm loss: 1.985078E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.394 | TFLOPs: 17.63 | 31: iteration 87590/ 173500 | consumed samples: 22423040 | consumed tokens: 45922385920 | elapsed time per iteration (s): 0.80 | learning rate: 1.100E-04 | global batch size: 256 | lm loss: 2.005398E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.336 | TFLOPs: 19.26 | 31: iteration 87600/ 173500 | consumed samples: 22425600 | consumed tokens: 45927628800 | elapsed time per iteration (s): 0.86 | learning rate: 1.100E-04 | global batch size: 256 | lm loss: 2.013570E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.397 | TFLOPs: 17.93 | 31: iteration 87610/ 173500 | consumed samples: 22428160 | consumed tokens: 45932871680 | elapsed time per iteration (s): 0.90 | learning rate: 1.100E-04 | global batch size: 256 | lm loss: 1.998721E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.654 | TFLOPs: 17.22 | 31: iteration 87620/ 173500 | consumed samples: 22430720 | consumed tokens: 45938114560 | elapsed time per iteration (s): 0.84 | learning rate: 1.100E-04 | global batch size: 256 | lm loss: 1.970771E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.521 | TFLOPs: 18.42 | 31: iteration 87630/ 173500 | consumed samples: 22433280 | consumed tokens: 45943357440 | elapsed time per iteration (s): 0.81 | learning rate: 1.100E-04 | global batch size: 256 | lm loss: 2.007217E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.267 | TFLOPs: 19.07 | 31: iteration 87640/ 173500 | consumed samples: 22435840 | consumed tokens: 45948600320 | elapsed time per iteration (s): 0.80 | learning rate: 1.100E-04 | global batch size: 256 | lm loss: 2.013961E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.056 | TFLOPs: 19.42 | 31: iteration 87650/ 173500 | consumed samples: 22438400 | consumed tokens: 45953843200 | elapsed time per iteration (s): 0.79 | learning rate: 1.099E-04 | global batch size: 256 | lm loss: 2.013132E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.399 | TFLOPs: 19.63 | 31: iteration 87660/ 173500 | consumed samples: 22440960 | consumed tokens: 45959086080 | elapsed time per iteration (s): 0.82 | learning rate: 1.099E-04 | global batch size: 256 | lm loss: 1.992952E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.298 | TFLOPs: 18.89 | 31: iteration 87670/ 173500 | consumed samples: 22443520 | consumed tokens: 45964328960 | elapsed time per iteration (s): 0.75 | learning rate: 1.099E-04 | global batch size: 256 | lm loss: 2.042368E+00 | grad norm: 0.249 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.081 | TFLOPs: 20.51 | 31: iteration 87680/ 173500 | consumed samples: 22446080 | consumed tokens: 45969571840 | elapsed time per iteration (s): 0.83 | learning rate: 1.099E-04 | global batch size: 256 | lm loss: 2.008220E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.138 | TFLOPs: 18.76 | 31: iteration 87690/ 173500 | consumed samples: 22448640 | consumed tokens: 45974814720 | elapsed time per iteration (s): 0.79 | learning rate: 1.099E-04 | global batch size: 256 | lm loss: 2.019004E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.559 | TFLOPs: 19.51 | 31: iteration 87700/ 173500 | consumed samples: 22451200 | consumed tokens: 45980057600 | elapsed time per iteration (s): 0.83 | learning rate: 1.099E-04 | global batch size: 256 | lm loss: 1.994545E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.379 | TFLOPs: 18.72 | 31: iteration 87710/ 173500 | consumed samples: 22453760 | consumed tokens: 45985300480 | elapsed time per iteration (s): 0.83 | learning rate: 1.098E-04 | global batch size: 256 | lm loss: 2.004576E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.550 | TFLOPs: 18.61 | 31: iteration 87720/ 173500 | consumed samples: 22456320 | consumed tokens: 45990543360 | elapsed time per iteration (s): 0.84 | learning rate: 1.098E-04 | global batch size: 256 | lm loss: 1.981251E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.197 | TFLOPs: 18.34 | 31: iteration 87730/ 173500 | consumed samples: 22458880 | consumed tokens: 45995786240 | elapsed time per iteration (s): 0.81 | learning rate: 1.098E-04 | global batch size: 256 | lm loss: 1.995582E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.070 | TFLOPs: 19.18 | 31: iteration 87740/ 173500 | consumed samples: 22461440 | consumed tokens: 46001029120 | elapsed time per iteration (s): 0.80 | learning rate: 1.098E-04 | global batch size: 256 | lm loss: 2.009186E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.636 | TFLOPs: 19.46 | 31: iteration 87750/ 173500 | consumed samples: 22464000 | consumed tokens: 46006272000 | elapsed time per iteration (s): 0.80 | learning rate: 1.098E-04 | global batch size: 256 | lm loss: 1.981168E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.937 | TFLOPs: 19.29 | 31: iteration 87760/ 173500 | consumed samples: 22466560 | consumed tokens: 46011514880 | elapsed time per iteration (s): 0.77 | learning rate: 1.098E-04 | global batch size: 256 | lm loss: 2.013158E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.414 | TFLOPs: 20.11 | 31: iteration 87770/ 173500 | consumed samples: 22469120 | consumed tokens: 46016757760 | elapsed time per iteration (s): 2.85 | learning rate: 1.097E-04 | global batch size: 256 | lm loss: 1.996033E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 89.677 | TFLOPs: 5.43 | 31: iteration 87780/ 173500 | consumed samples: 22471680 | consumed tokens: 46022000640 | elapsed time per iteration (s): 0.91 | learning rate: 1.097E-04 | global batch size: 256 | lm loss: 2.023979E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 281.020 | TFLOPs: 17.00 | 31: iteration 87790/ 173500 | consumed samples: 22474240 | consumed tokens: 46027243520 | elapsed time per iteration (s): 0.77 | learning rate: 1.097E-04 | global batch size: 256 | lm loss: 2.000446E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.978 | TFLOPs: 20.02 | 31: iteration 87800/ 173500 | consumed samples: 22476800 | consumed tokens: 46032486400 | elapsed time per iteration (s): 0.77 | learning rate: 1.097E-04 | global batch size: 256 | lm loss: 2.014600E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.099 | TFLOPs: 20.09 | 31: iteration 87810/ 173500 | consumed samples: 22479360 | consumed tokens: 46037729280 | elapsed time per iteration (s): 0.75 | learning rate: 1.097E-04 | global batch size: 256 | lm loss: 2.015447E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.784 | TFLOPs: 20.68 | 31: iteration 87820/ 173500 | consumed samples: 22481920 | consumed tokens: 46042972160 | elapsed time per iteration (s): 0.74 | learning rate: 1.097E-04 | global batch size: 256 | lm loss: 1.999223E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.158 | TFLOPs: 20.88 | 31: iteration 87830/ 173500 | consumed samples: 22484480 | consumed tokens: 46048215040 | elapsed time per iteration (s): 0.77 | learning rate: 1.097E-04 | global batch size: 256 | lm loss: 1.997112E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.419 | TFLOPs: 20.11 | 31: iteration 87840/ 173500 | consumed samples: 22487040 | consumed tokens: 46053457920 | elapsed time per iteration (s): 0.76 | learning rate: 1.096E-04 | global batch size: 256 | lm loss: 1.964368E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.556 | TFLOPs: 20.36 | 31: iteration 87850/ 173500 | consumed samples: 22489600 | consumed tokens: 46058700800 | elapsed time per iteration (s): 0.76 | learning rate: 1.096E-04 | global batch size: 256 | lm loss: 1.996199E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.954 | TFLOPs: 20.32 | 31: iteration 87860/ 173500 | consumed samples: 22492160 | consumed tokens: 46063943680 | elapsed time per iteration (s): 0.76 | learning rate: 1.096E-04 | global batch size: 256 | lm loss: 1.990219E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.374 | TFLOPs: 20.29 | 31: iteration 87870/ 173500 | consumed samples: 22494720 | consumed tokens: 46069186560 | elapsed time per iteration (s): 0.80 | learning rate: 1.096E-04 | global batch size: 256 | lm loss: 1.994455E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.310 | TFLOPs: 19.32 | 31: iteration 87880/ 173500 | consumed samples: 22497280 | consumed tokens: 46074429440 | elapsed time per iteration (s): 0.76 | learning rate: 1.096E-04 | global batch size: 256 | lm loss: 1.981670E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.435 | TFLOPs: 20.29 | 31: iteration 87890/ 173500 | consumed samples: 22499840 | consumed tokens: 46079672320 | elapsed time per iteration (s): 0.82 | learning rate: 1.096E-04 | global batch size: 256 | lm loss: 1.976296E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.837 | TFLOPs: 18.99 | 31: iteration 87900/ 173500 | consumed samples: 22502400 | consumed tokens: 46084915200 | elapsed time per iteration (s): 0.78 | learning rate: 1.095E-04 | global batch size: 256 | lm loss: 1.973724E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.977 | TFLOPs: 19.84 | 31: iteration 87910/ 173500 | consumed samples: 22504960 | consumed tokens: 46090158080 | elapsed time per iteration (s): 0.76 | learning rate: 1.095E-04 | global batch size: 256 | lm loss: 2.007096E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.228 | TFLOPs: 20.34 | 31: iteration 87920/ 173500 | consumed samples: 22507520 | consumed tokens: 46095400960 | elapsed time per iteration (s): 0.83 | learning rate: 1.095E-04 | global batch size: 256 | lm loss: 2.013654E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.732 | TFLOPs: 18.74 | 31: iteration 87930/ 173500 | consumed samples: 22510080 | consumed tokens: 46100643840 | elapsed time per iteration (s): 0.83 | learning rate: 1.095E-04 | global batch size: 256 | lm loss: 1.993816E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.859 | TFLOPs: 18.69 | 31: iteration 87940/ 173500 | consumed samples: 22512640 | consumed tokens: 46105886720 | elapsed time per iteration (s): 0.95 | learning rate: 1.095E-04 | global batch size: 256 | lm loss: 1.964152E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 270.390 | TFLOPs: 16.36 | 31: iteration 87950/ 173500 | consumed samples: 22515200 | consumed tokens: 46111129600 | elapsed time per iteration (s): 0.87 | learning rate: 1.095E-04 | global batch size: 256 | lm loss: 1.969246E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.914 | TFLOPs: 17.72 | 31: iteration 87960/ 173500 | consumed samples: 22517760 | consumed tokens: 46116372480 | elapsed time per iteration (s): 0.86 | learning rate: 1.094E-04 | global batch size: 256 | lm loss: 1.977355E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.082 | TFLOPs: 18.03 | 31: iteration 87970/ 173500 | consumed samples: 22520320 | consumed tokens: 46121615360 | elapsed time per iteration (s): 0.82 | learning rate: 1.094E-04 | global batch size: 256 | lm loss: 2.008234E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.297 | TFLOPs: 18.89 | 31: iteration 87980/ 173500 | consumed samples: 22522880 | consumed tokens: 46126858240 | elapsed time per iteration (s): 0.79 | learning rate: 1.094E-04 | global batch size: 256 | lm loss: 2.005050E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.508 | TFLOPs: 19.51 | 31: iteration 87990/ 173500 | consumed samples: 22525440 | consumed tokens: 46132101120 | elapsed time per iteration (s): 0.83 | learning rate: 1.094E-04 | global batch size: 256 | lm loss: 1.991856E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.366 | TFLOPs: 18.72 | 0: [2022-11-26 14:01:06,569] [INFO] [logging.py:68:log_dist] [Rank 0] step=88000, skipped=0, lr=[0.00010937083470846484, 0.00010937083470846484, 0.00010937083470846484], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 88000/ 173500 | consumed samples: 22528000 | consumed tokens: 46137344000 | elapsed time per iteration (s): 0.76 | learning rate: 1.094E-04 | global batch size: 256 | lm loss: 2.004689E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.016 | TFLOPs: 20.27 | 0: steps: 88000 loss: 1.9958 iter time (s): 0.810 samples/sec: 316.059 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 88000 | lm loss value: 1.976398E+00 | lm loss PPL: 7.216701E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 88000 to checkpoints_1b1long 0: [2022-11-26 14:01:06,819] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step88000 is begin to save! 0: [2022-11-26 14:01:06,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_01-model_00-model_states.pt... 0: [2022-11-26 14:01:07,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_01-model_00-model_states.pt. 0: [2022-11-26 14:01:07,047] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_03-model_00-model_states.pt... 0: [2022-11-26 14:01:07,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_03-model_00-model_states.pt. 0: [2022-11-26 14:01:07,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_04-model_00-model_states.pt... 0: [2022-11-26 14:01:07,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_04-model_00-model_states.pt. 0: [2022-11-26 14:01:07,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_05-model_00-model_states.pt... 0: [2022-11-26 14:01:07,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_05-model_00-model_states.pt. 0: [2022-11-26 14:01:07,285] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_06-model_00-model_states.pt... 0: [2022-11-26 14:01:07,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_06-model_00-model_states.pt. 0: [2022-11-26 14:01:07,362] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_07-model_00-model_states.pt... 0: [2022-11-26 14:01:07,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_07-model_00-model_states.pt. 0: [2022-11-26 14:01:07,437] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_08-model_00-model_states.pt... 0: [2022-11-26 14:01:07,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_08-model_00-model_states.pt. 0: [2022-11-26 14:01:07,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_09-model_00-model_states.pt... 0: [2022-11-26 14:01:07,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_09-model_00-model_states.pt. 0: [2022-11-26 14:01:07,588] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_10-model_00-model_states.pt... 0: [2022-11-26 14:01:07,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_10-model_00-model_states.pt. 0: [2022-11-26 14:01:07,666] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_11-model_00-model_states.pt... 0: [2022-11-26 14:01:07,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_11-model_00-model_states.pt. 0: [2022-11-26 14:01:07,744] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_12-model_00-model_states.pt... 0: [2022-11-26 14:01:07,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_12-model_00-model_states.pt. 0: [2022-11-26 14:01:07,827] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_13-model_00-model_states.pt... 0: [2022-11-26 14:01:07,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_13-model_00-model_states.pt. 0: [2022-11-26 14:01:07,904] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_14-model_00-model_states.pt... 0: [2022-11-26 14:01:07,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_14-model_00-model_states.pt. 0: [2022-11-26 14:01:07,982] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_15-model_00-model_states.pt... 0: [2022-11-26 14:01:08,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_15-model_00-model_states.pt. 0: [2022-11-26 14:01:08,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_16-model_00-model_states.pt... 0: [2022-11-26 14:01:08,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_16-model_00-model_states.pt. 0: [2022-11-26 14:01:08,134] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_17-model_00-model_states.pt... 0: [2022-11-26 14:01:08,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_17-model_00-model_states.pt. 0: [2022-11-26 14:01:08,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_18-model_00-model_states.pt... 0: [2022-11-26 14:01:08,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_18-model_00-model_states.pt. 0: [2022-11-26 14:01:08,288] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_19-model_00-model_states.pt... 0: [2022-11-26 14:01:08,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_19-model_00-model_states.pt. 0: [2022-11-26 14:01:08,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_20-model_00-model_states.pt... 0: [2022-11-26 14:01:08,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_20-model_00-model_states.pt. 0: [2022-11-26 14:01:08,442] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_21-model_00-model_states.pt... 0: [2022-11-26 14:01:08,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_21-model_00-model_states.pt. 0: [2022-11-26 14:01:08,517] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_22-model_00-model_states.pt... 0: [2022-11-26 14:01:08,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_22-model_00-model_states.pt. 0: [2022-11-26 14:01:08,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_23-model_00-model_states.pt... 0: [2022-11-26 14:01:08,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_23-model_00-model_states.pt. 0: [2022-11-26 14:01:08,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_24-model_00-model_states.pt... 0: [2022-11-26 14:01:08,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_24-model_00-model_states.pt. 0: [2022-11-26 14:01:08,744] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_25-model_00-model_states.pt... 0: [2022-11-26 14:01:08,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_25-model_00-model_states.pt. 0: [2022-11-26 14:01:08,829] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_26-model_00-model_states.pt... 0: [2022-11-26 14:01:08,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_26-model_00-model_states.pt. 0: [2022-11-26 14:01:08,905] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_27-model_00-model_states.pt... 0: [2022-11-26 14:01:08,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_27-model_00-model_states.pt. 0: [2022-11-26 14:01:08,982] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_28-model_00-model_states.pt... 0: [2022-11-26 14:01:09,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_28-model_00-model_states.pt. 0: [2022-11-26 14:01:09,059] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/layer_30-model_00-model_states.pt... 0: [2022-11-26 14:01:09,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/layer_30-model_00-model_states.pt. 0: [2022-11-26 14:01:09,061] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step88000/mp_rank_00_model_states.pt 0: [2022-11-26 14:01:09,061] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/mp_rank_00_model_states.pt... 0: [2022-11-26 14:01:09,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/mp_rank_00_model_states.pt. 0: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:01:09,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:01:09,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:01:09,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 14:01:09,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 26: [2022-11-26 14:01:09,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:01:09,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 14:01:09,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 1: [2022-11-26 14:01:09,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:01:09,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 14:01:09,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 4: [2022-11-26 14:01:09,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:01:09,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 14:01:09,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 27: [2022-11-26 14:01:09,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:01:09,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:01:09,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 21: [2022-11-26 14:01:09,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 27: [2022-11-26 14:01:09,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 2: [2022-11-26 14:01:09,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:01:09,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 2: [2022-11-26 14:01:09,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 14:01:09,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 24: [2022-11-26 14:01:09,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:01:09,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 14:01:09,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 31: [2022-11-26 14:01:09,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:01:09,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:01:09,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 14:01:09,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 23: [2022-11-26 14:01:09,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 14:01:09,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 3: [2022-11-26 14:01:09,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:01:09,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:01:09,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 14:01:09,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 16: [2022-11-26 14:01:09,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 14:01:09,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 20: [2022-11-26 14:01:09,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:01:09,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 14:01:09,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 29: [2022-11-26 14:01:09,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:01:09,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:01:09,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 14: [2022-11-26 14:01:09,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 29: [2022-11-26 14:01:09,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 14: [2022-11-26 14:01:09,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 6: [2022-11-26 14:01:09,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:01:09,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 14:01:09,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 28: [2022-11-26 14:01:09,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:01:09,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 18: [2022-11-26 14:01:09,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:01:09,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 14:01:09,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 28: [2022-11-26 14:01:09,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 0: [2022-11-26 14:01:09,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:01:09,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 14:01:09,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 21: [2022-11-26 14:01:09,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:01:09,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 14:01:09,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 9: [2022-11-26 14:01:09,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:01:09,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 14:01:09,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 10: [2022-11-26 14:01:09,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:01:09,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 14:01:09,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 26: [2022-11-26 14:01:09,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:01:09,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 14:01:09,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 2: [2022-11-26 14:01:09,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:01:09,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 14:01:09,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 14: [2022-11-26 14:01:09,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:01:09,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:01:09,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 19: [2022-11-26 14:01:09,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 14: [2022-11-26 14:01:09,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 19: [2022-11-26 14:01:09,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 13: [2022-11-26 14:01:09,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:01:09,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 14:01:09,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 10: [2022-11-26 14:01:09,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:01:09,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 8: [2022-11-26 14:01:09,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:01:09,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 8: [2022-11-26 14:01:09,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 14:01:09,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 13: [2022-11-26 14:01:09,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:01:09,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 14:01:09,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 5: [2022-11-26 14:01:09,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:01:09,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 14:01:09,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 6: [2022-11-26 14:01:09,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:01:09,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 14:01:09,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 4: [2022-11-26 14:01:09,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:01:09,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 14:01:09,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 8: [2022-11-26 14:01:09,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:01:09,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 14:01:09,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 11: [2022-11-26 14:01:09,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:01:09,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:01:09,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 14:01:09,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 14:01:09,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 11: [2022-11-26 14:01:09,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 27: [2022-11-26 14:01:09,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:01:09,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 14:01:09,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 9: [2022-11-26 14:01:09,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:01:09,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:01:09,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 14:01:09,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 1: [2022-11-26 14:01:09,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:01:09,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:01:09,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 9: [2022-11-26 14:01:09,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 1: [2022-11-26 14:01:09,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 14:01:09,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 4: [2022-11-26 14:01:09,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:01:09,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 1: [2022-11-26 14:01:09,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 4: [2022-11-26 14:01:09,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 14:01:09,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 23: [2022-11-26 14:01:09,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:01:09,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:01:09,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:01:09,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 24: [2022-11-26 14:01:09,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:01:09,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 3: [2022-11-26 14:01:09,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 23: [2022-11-26 14:01:09,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 3: [2022-11-26 14:01:09,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 31: [2022-11-26 14:01:09,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 24: [2022-11-26 14:01:09,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 11: [2022-11-26 14:01:09,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:01:09,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 11: [2022-11-26 14:01:09,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 31: [2022-11-26 14:01:09,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:01:09,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:01:09,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 20: [2022-11-26 14:01:09,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 31: [2022-11-26 14:01:09,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 20: [2022-11-26 14:01:09,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 31: [2022-11-26 14:01:09,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 26: [2022-11-26 14:01:09,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:01:09,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:01:09,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:01:09,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 23: [2022-11-26 14:01:09,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 29: [2022-11-26 14:01:09,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 23: [2022-11-26 14:01:09,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 29: [2022-11-26 14:01:09,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 26: [2022-11-26 14:01:09,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 20: [2022-11-26 14:01:09,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:01:09,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 30: [2022-11-26 14:01:09,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:01:09,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 20: [2022-11-26 14:01:09,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 30: [2022-11-26 14:01:09,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 18: [2022-11-26 14:01:09,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:01:09,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 14:01:09,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:01:09,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 18: [2022-11-26 14:01:09,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 14:01:09,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 28: [2022-11-26 14:01:09,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:01:09,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:01:09,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 14:01:09,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 10: [2022-11-26 14:01:09,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:01:09,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 26: [2022-11-26 14:01:09,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:01:09,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 26: [2022-11-26 14:01:09,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 14:01:09,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 30: [2022-11-26 14:01:09,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:01:09,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 14:01:09,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 28: [2022-11-26 14:01:09,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 14:01:09,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 27: [2022-11-26 14:01:09,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:01:09,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 14:01:09,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 0: [2022-11-26 14:01:09,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:01:09,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 14:01:09,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 29: [2022-11-26 14:01:09,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:01:09,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:01:09,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 14:01:09,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 16: [2022-11-26 14:01:09,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 14:01:09,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 5: [2022-11-26 14:01:09,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:01:09,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 14:01:09,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 5: [2022-11-26 14:01:09,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:01:09,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 14:01:09,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 24: [2022-11-26 14:01:09,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:01:09,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 30: [2022-11-26 14:01:09,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:01:09,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 24: [2022-11-26 14:01:09,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 30: [2022-11-26 14:01:09,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 21: [2022-11-26 14:01:09,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:01:09,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:01:09,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 14:01:09,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 14:01:09,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 21: [2022-11-26 14:01:09,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 8: [2022-11-26 14:01:09,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:01:09,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 14:01:09,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 19: [2022-11-26 14:01:09,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:01:09,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 14:01:09,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 28: [2022-11-26 14:01:09,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:01:09,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:01:09,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:01:09,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:01:09,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 13: [2022-11-26 14:01:09,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 12: [2022-11-26 14:01:09,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 12: [2022-11-26 14:01:09,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 13: [2022-11-26 14:01:09,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 12: [2022-11-26 14:01:09,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 4: [2022-11-26 14:01:09,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:01:09,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:01:09,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 14:01:09,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 3: [2022-11-26 14:01:09,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 14:01:09,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 6: [2022-11-26 14:01:09,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:01:09,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 14:01:09,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 1: [2022-11-26 14:01:09,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:01:09,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:01:09,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:01:09,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 14:01:09,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 1: [2022-11-26 14:01:09,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 7: [2022-11-26 14:01:09,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 7: [2022-11-26 14:01:09,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 1: [2022-11-26 14:01:09,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 6: [2022-11-26 14:01:09,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:01:09,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 14:01:09,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 14: [2022-11-26 14:01:09,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:01:09,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 14:01:09,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 11: [2022-11-26 14:01:09,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:01:09,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 14:01:09,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 16: [2022-11-26 14:01:09,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:01:09,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 14:01:09,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 20: [2022-11-26 14:01:09,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:01:09,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 14:01:09,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 0: [2022-11-26 14:01:09,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:01:09,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 14:01:09,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 0: [2022-11-26 14:01:09,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:01:09,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 14:01:09,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 28: [2022-11-26 14:01:09,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 14:01:09,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 3: [2022-11-26 14:01:09,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:01:09,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 14:01:09,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 2: [2022-11-26 14:01:09,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:01:09,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 14:01:09,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 9: [2022-11-26 14:01:09,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:01:09,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 14:01:09,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 10: [2022-11-26 14:01:09,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:01:09,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 14:01:09,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 23: [2022-11-26 14:01:09,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:01:09,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 31: [2022-11-26 14:01:09,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:01:09,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 31: [2022-11-26 14:01:09,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 14:01:09,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 22: [2022-11-26 14:01:09,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:01:09,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:01:09,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:01:09,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 14:01:09,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 14:01:09,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 14:01:09,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 22: [2022-11-26 14:01:09,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 22: [2022-11-26 14:01:09,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 19: [2022-11-26 14:01:09,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:01:09,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 14:01:09,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 24: [2022-11-26 14:01:09,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:01:09,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 14:01:09,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 17: [2022-11-26 14:01:09,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:01:09,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 14:01:09,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 17: [2022-11-26 14:01:09,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:01:09,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 14:01:09,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 17: [2022-11-26 14:01:09,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:01:09,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 14:01:09,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 17: [2022-11-26 14:01:09,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:01:09,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 14:01:09,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 8: [2022-11-26 14:01:09,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:01:09,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 14:01:09,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 14: [2022-11-26 14:01:09,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:01:09,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 14:01:09,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 12: [2022-11-26 14:01:09,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:01:09,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 14:01:09,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 30: [2022-11-26 14:01:09,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:01:09,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 14:01:09,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 5: [2022-11-26 14:01:09,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:01:09,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 14:01:09,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 13: [2022-11-26 14:01:09,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:01:09,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 14:01:09,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 18: [2022-11-26 14:01:09,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:01:09,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 14:01:09,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 28: [2022-11-26 14:01:09,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:01:09,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:01:09,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:01:09,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:01:09,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:01:09,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 14:01:09,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 14:01:09,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 14:01:09,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 14:01:09,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 15: [2022-11-26 14:01:09,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 15: [2022-11-26 14:01:09,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 15: [2022-11-26 14:01:09,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 22: [2022-11-26 14:01:09,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:01:09,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 14:01:09,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 28: [2022-11-26 14:01:09,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 14:01:09,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 7: [2022-11-26 14:01:09,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:01:09,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 14:01:09,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 25: [2022-11-26 14:01:09,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:01:09,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:01:09,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:01:09,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:01:09,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 14:01:09,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 14:01:09,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 14:01:09,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 14:01:09,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 25: [2022-11-26 14:01:09,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 25: [2022-11-26 14:01:09,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 25: [2022-11-26 14:01:09,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 27: [2022-11-26 14:01:09,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:01:09,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 14:01:09,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 16: [2022-11-26 14:01:09,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:01:09,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 14:01:09,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 29: [2022-11-26 14:01:09,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:01:09,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 14:01:09,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 1: [2022-11-26 14:01:09,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:01:09,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 14:01:09,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 19: [2022-11-26 14:01:09,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:01:09,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 14:01:09,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 0: [2022-11-26 14:01:09,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:01:09,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:01:09,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 14:01:09,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 0: [2022-11-26 14:01:09,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 14:01:09,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 4: [2022-11-26 14:01:09,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:01:09,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 14:01:09,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 25: [2022-11-26 14:01:09,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:01:09,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 14:01:09,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 21: [2022-11-26 14:01:09,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:01:09,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 14:01:09,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 9: [2022-11-26 14:01:09,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:01:09,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 14:01:09,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 23: [2022-11-26 14:01:09,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:01:09,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 14:01:09,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 10: [2022-11-26 14:01:09,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:01:09,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 14:01:09,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 2: [2022-11-26 14:01:09,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:01:09,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 14:01:09,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 6: [2022-11-26 14:01:09,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:01:09,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 14:01:09,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 11: [2022-11-26 14:01:09,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:01:09,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 14:01:09,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 3: [2022-11-26 14:01:09,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:01:09,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 14:01:09,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 30: [2022-11-26 14:01:09,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:01:09,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 14:01:09,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 15: [2022-11-26 14:01:09,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:01:09,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 14:01:09,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 20: [2022-11-26 14:01:09,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:01:09,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 14:01:09,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 12: [2022-11-26 14:01:09,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:01:09,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 14:01:09,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 17: [2022-11-26 14:01:09,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:01:09,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 14:01:09,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 18: [2022-11-26 14:01:09,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:01:09,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 14:01:09,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 8: [2022-11-26 14:01:09,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:01:09,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 14:01:09,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 7: [2022-11-26 14:01:09,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:01:09,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 14:01:09,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 0: [2022-11-26 14:01:09,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:01:09,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 14:01:09,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 29: [2022-11-26 14:01:09,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:01:09,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 14:01:09,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 16: [2022-11-26 14:01:09,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:01:09,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 14:01:09,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 19: [2022-11-26 14:01:09,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:01:09,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 14:01:09,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 13: [2022-11-26 14:01:09,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:01:09,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 14:01:09,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 28: [2022-11-26 14:01:09,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:01:09,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 14:01:09,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 14: [2022-11-26 14:01:09,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:01:09,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 14:01:09,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 1: [2022-11-26 14:01:09,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:01:09,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 14:01:09,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 22: [2022-11-26 14:01:09,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:01:09,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 14:01:09,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 26: [2022-11-26 14:01:09,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:01:09,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 14:01:09,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 31: [2022-11-26 14:01:09,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:01:09,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 14:01:09,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 4: [2022-11-26 14:01:09,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:01:09,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 14:01:09,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 23: [2022-11-26 14:01:09,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:01:09,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 14:01:09,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 27: [2022-11-26 14:01:09,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:01:09,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 14:01:09,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 3: [2022-11-26 14:01:09,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:01:09,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 14:01:09,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 11: [2022-11-26 14:01:09,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:01:09,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 14:01:09,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 25: [2022-11-26 14:01:09,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:01:09,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:01:09,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 24: [2022-11-26 14:01:09,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:01:09,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 14:01:09,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 25: [2022-11-26 14:01:09,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 24: [2022-11-26 14:01:09,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 5: [2022-11-26 14:01:09,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:01:09,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 24: [2022-11-26 14:01:09,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 5: [2022-11-26 14:01:09,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 15: [2022-11-26 14:01:09,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:01:09,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 14:01:09,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 21: [2022-11-26 14:01:09,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:01:09,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 14:01:09,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 16: [2022-11-26 14:01:09,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:01:09,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 14:01:09,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 30: [2022-11-26 14:01:09,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:01:09,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 14:01:09,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 6: [2022-11-26 14:01:09,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:01:09,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 14:01:09,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 14: [2022-11-26 14:01:09,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:01:09,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 14:01:09,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 2: [2022-11-26 14:01:09,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:01:09,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 14:01:09,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 31: [2022-11-26 14:01:09,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:01:09,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 14:01:09,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 7: [2022-11-26 14:01:09,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:01:09,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 14:01:09,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 8: [2022-11-26 14:01:09,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:01:09,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:01:09,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 14:01:09,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 4: [2022-11-26 14:01:09,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:01:09,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 14:01:09,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 23: [2022-11-26 14:01:09,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:01:09,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 14:01:09,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 19: [2022-11-26 14:01:09,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:01:09,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 14:01:09,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 11: [2022-11-26 14:01:09,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:01:09,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 14:01:09,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 21: [2022-11-26 14:01:09,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:01:09,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 14:01:09,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 12: [2022-11-26 14:01:09,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:01:09,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 14:01:09,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 25: [2022-11-26 14:01:09,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:01:09,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 14:01:09,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 18: [2022-11-26 14:01:09,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:01:09,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 14:01:09,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 20: [2022-11-26 14:01:09,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:01:09,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 14:01:09,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 13: [2022-11-26 14:01:09,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:01:09,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 29: [2022-11-26 14:01:09,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:01:09,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 29: [2022-11-26 14:01:09,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 14:01:09,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 10: [2022-11-26 14:01:09,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:01:09,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 14:01:09,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 22: [2022-11-26 14:01:09,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:01:09,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 14:01:09,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 15: [2022-11-26 14:01:09,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:01:09,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 14:01:09,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 6: [2022-11-26 14:01:09,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:01:09,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 14:01:09,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 17: [2022-11-26 14:01:09,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:01:09,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 14:01:09,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 24: [2022-11-26 14:01:09,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:01:09,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:01:09,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 24: [2022-11-26 14:01:09,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 2: [2022-11-26 14:01:09,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 24: [2022-11-26 14:01:09,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 0: [2022-11-26 14:01:09,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:01:09,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 14:01:09,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 27: [2022-11-26 14:01:09,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:01:09,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 14:01:09,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 3: [2022-11-26 14:01:09,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:01:09,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 14:01:09,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 26: [2022-11-26 14:01:09,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:01:09,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 14:01:09,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 30: [2022-11-26 14:01:09,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:01:09,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 14:01:09,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 19: [2022-11-26 14:01:09,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:01:09,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 14:01:09,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 9: [2022-11-26 14:01:09,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:01:09,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 28: [2022-11-26 14:01:09,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 14:01:09,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 9: [2022-11-26 14:01:09,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 31: [2022-11-26 14:01:09,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:01:09,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 14:01:09,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 27: [2022-11-26 14:01:09,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:01:09,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 14:01:09,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 14: [2022-11-26 14:01:09,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:01:09,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 14:01:09,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 18: [2022-11-26 14:01:09,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:01:09,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 14:01:09,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 31: [2022-11-26 14:01:09,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:01:09,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 14:01:09,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 13: [2022-11-26 14:01:09,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:01:09,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 14:01:09,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 4: [2022-11-26 14:01:09,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:01:09,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:01:09,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:01:09,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:01:09,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 5: [2022-11-26 14:01:09,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 4: [2022-11-26 14:01:09,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 20: [2022-11-26 14:01:09,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 14:01:09,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 5: [2022-11-26 14:01:09,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 20: [2022-11-26 14:01:09,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 20: [2022-11-26 14:01:09,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 1: [2022-11-26 14:01:09,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:01:09,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 14:01:09,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 7: [2022-11-26 14:01:09,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:01:09,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:01:09,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 22: [2022-11-26 14:01:09,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 7: [2022-11-26 14:01:09,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 22: [2022-11-26 14:01:09,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 21: [2022-11-26 14:01:09,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:01:09,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:01:09,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 14:01:09,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 23: [2022-11-26 14:01:09,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 14:01:09,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 15: [2022-11-26 14:01:09,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:01:09,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 17: [2022-11-26 14:01:09,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:01:09,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 17: [2022-11-26 14:01:09,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 14:01:09,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 26: [2022-11-26 14:01:09,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:01:09,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 14:01:09,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 17: [2022-11-26 14:01:09,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:01:09,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:01:09,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 8: [2022-11-26 14:01:09,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:01:09,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 28: [2022-11-26 14:01:09,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 8: [2022-11-26 14:01:09,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 28: [2022-11-26 14:01:09,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 8: [2022-11-26 14:01:09,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 30: [2022-11-26 14:01:09,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:01:09,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 14:01:09,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 6: [2022-11-26 14:01:09,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:01:09,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 14:01:09,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 25: [2022-11-26 14:01:09,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:01:09,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 14:01:09,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 16: [2022-11-26 14:01:09,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:01:09,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 14:01:09,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 24: [2022-11-26 14:01:09,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:01:09,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 14:01:09,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 18: [2022-11-26 14:01:09,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:01:09,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:01:09,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 14:01:09,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 0: [2022-11-26 14:01:09,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 14:01:09,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 14: [2022-11-26 14:01:09,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:01:09,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 14:01:09,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 12: [2022-11-26 14:01:09,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:01:09,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:01:09,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 14:01:09,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 14:01:09,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 12: [2022-11-26 14:01:09,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 13: [2022-11-26 14:01:09,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:01:09,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 14:01:09,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 1: [2022-11-26 14:01:09,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:01:09,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:01:09,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:01:09,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 14:01:09,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 11: [2022-11-26 14:01:09,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 14:01:09,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 9: [2022-11-26 14:01:09,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 14:01:09,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 8: [2022-11-26 14:01:09,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:01:09,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 14:01:09,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 3: [2022-11-26 14:01:09,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:01:09,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 14:01:09,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 10: [2022-11-26 14:01:09,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:01:09,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 14:01:09,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 24: [2022-11-26 14:01:09,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:01:09,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 14:01:09,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 27: [2022-11-26 14:01:09,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:01:09,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 14:01:09,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 16: [2022-11-26 14:01:09,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:01:09,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 14:01:09,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 7: [2022-11-26 14:01:09,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:01:09,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 14:01:09,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 29: [2022-11-26 14:01:09,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:01:09,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 2: [2022-11-26 14:01:09,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:01:09,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 2: [2022-11-26 14:01:09,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 14:01:09,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 29: [2022-11-26 14:01:09,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:01:09,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 14:01:09,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 28: [2022-11-26 14:01:09,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:01:09,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:01:09,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 14:01:09,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 28: [2022-11-26 14:01:09,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 14:01:09,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 7: [2022-11-26 14:01:09,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:01:09,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 9: [2022-11-26 14:01:09,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:01:09,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 9: [2022-11-26 14:01:09,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 14:01:09,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 22: [2022-11-26 14:01:09,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:01:09,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 14:01:09,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 5: [2022-11-26 14:01:09,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:01:09,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 14:01:09,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 5: [2022-11-26 14:01:09,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:01:09,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step88000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 14:01:09,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 0: successfully saved checkpoint at iteration 88000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2555.62 31: iteration 88010/ 173500 | consumed samples: 22530560 | consumed tokens: 46142586880 | elapsed time per iteration (s): 1.05 | learning rate: 1.094E-04 | global batch size: 256 | lm loss: 1.959991E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.208 | TFLOPs: 14.77 | 31: iteration 88020/ 173500 | consumed samples: 22533120 | consumed tokens: 46147829760 | elapsed time per iteration (s): 0.77 | learning rate: 1.093E-04 | global batch size: 256 | lm loss: 2.004789E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.815 | TFLOPs: 20.19 | 31: iteration 88030/ 173500 | consumed samples: 22535680 | consumed tokens: 46153072640 | elapsed time per iteration (s): 0.80 | learning rate: 1.093E-04 | global batch size: 256 | lm loss: 1.985434E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.330 | TFLOPs: 19.38 | 31: iteration 88040/ 173500 | consumed samples: 22538240 | consumed tokens: 46158315520 | elapsed time per iteration (s): 0.83 | learning rate: 1.093E-04 | global batch size: 256 | lm loss: 1.973757E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.712 | TFLOPs: 18.68 | 31: iteration 88050/ 173500 | consumed samples: 22540800 | consumed tokens: 46163558400 | elapsed time per iteration (s): 0.75 | learning rate: 1.093E-04 | global batch size: 256 | lm loss: 2.006836E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.165 | TFLOPs: 20.52 | 31: iteration 88060/ 173500 | consumed samples: 22543360 | consumed tokens: 46168801280 | elapsed time per iteration (s): 0.81 | learning rate: 1.093E-04 | global batch size: 256 | lm loss: 1.968574E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.337 | TFLOPs: 19.02 | 31: iteration 88070/ 173500 | consumed samples: 22545920 | consumed tokens: 46174044160 | elapsed time per iteration (s): 0.78 | learning rate: 1.093E-04 | global batch size: 256 | lm loss: 2.001747E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.872 | TFLOPs: 19.96 | 31: iteration 88080/ 173500 | consumed samples: 22548480 | consumed tokens: 46179287040 | elapsed time per iteration (s): 0.79 | learning rate: 1.092E-04 | global batch size: 256 | lm loss: 1.982494E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.569 | TFLOPs: 19.58 | 31: iteration 88090/ 173500 | consumed samples: 22551040 | consumed tokens: 46184529920 | elapsed time per iteration (s): 0.79 | learning rate: 1.092E-04 | global batch size: 256 | lm loss: 2.004206E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.255 | TFLOPs: 19.62 | 31: iteration 88100/ 173500 | consumed samples: 22553600 | consumed tokens: 46189772800 | elapsed time per iteration (s): 0.81 | learning rate: 1.092E-04 | global batch size: 256 | lm loss: 1.970153E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.501 | TFLOPs: 19.21 | 31: iteration 88110/ 173500 | consumed samples: 22556160 | consumed tokens: 46195015680 | elapsed time per iteration (s): 0.82 | learning rate: 1.092E-04 | global batch size: 256 | lm loss: 1.982152E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.574 | TFLOPs: 18.91 | 31: iteration 88120/ 173500 | consumed samples: 22558720 | consumed tokens: 46200258560 | elapsed time per iteration (s): 0.83 | learning rate: 1.092E-04 | global batch size: 256 | lm loss: 1.979110E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.601 | TFLOPs: 18.67 | 31: iteration 88130/ 173500 | consumed samples: 22561280 | consumed tokens: 46205501440 | elapsed time per iteration (s): 0.86 | learning rate: 1.092E-04 | global batch size: 256 | lm loss: 1.991011E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.777 | TFLOPs: 18.01 | 31: iteration 88140/ 173500 | consumed samples: 22563840 | consumed tokens: 46210744320 | elapsed time per iteration (s): 0.83 | learning rate: 1.091E-04 | global batch size: 256 | lm loss: 1.976834E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.106 | TFLOPs: 18.76 | 31: iteration 88150/ 173500 | consumed samples: 22566400 | consumed tokens: 46215987200 | elapsed time per iteration (s): 0.81 | learning rate: 1.091E-04 | global batch size: 256 | lm loss: 1.982179E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.213 | TFLOPs: 19.07 | 31: iteration 88160/ 173500 | consumed samples: 22568960 | consumed tokens: 46221230080 | elapsed time per iteration (s): 0.81 | learning rate: 1.091E-04 | global batch size: 256 | lm loss: 1.972845E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.658 | TFLOPs: 19.04 | 31: iteration 88170/ 173500 | consumed samples: 22571520 | consumed tokens: 46226472960 | elapsed time per iteration (s): 0.81 | learning rate: 1.091E-04 | global batch size: 256 | lm loss: 1.984348E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.827 | TFLOPs: 19.11 | 31: iteration 88180/ 173500 | consumed samples: 22574080 | consumed tokens: 46231715840 | elapsed time per iteration (s): 0.81 | learning rate: 1.091E-04 | global batch size: 256 | lm loss: 1.996633E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.816 | TFLOPs: 19.05 | 31: iteration 88190/ 173500 | consumed samples: 22576640 | consumed tokens: 46236958720 | elapsed time per iteration (s): 0.81 | learning rate: 1.091E-04 | global batch size: 256 | lm loss: 2.001264E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.814 | TFLOPs: 19.23 | 31: iteration 88200/ 173500 | consumed samples: 22579200 | consumed tokens: 46242201600 | elapsed time per iteration (s): 0.86 | learning rate: 1.090E-04 | global batch size: 256 | lm loss: 2.008100E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.699 | TFLOPs: 18.01 | 31: iteration 88210/ 173500 | consumed samples: 22581760 | consumed tokens: 46247444480 | elapsed time per iteration (s): 0.79 | learning rate: 1.090E-04 | global batch size: 256 | lm loss: 1.988808E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.187 | TFLOPs: 19.61 | 31: iteration 88220/ 173500 | consumed samples: 22584320 | consumed tokens: 46252687360 | elapsed time per iteration (s): 0.79 | learning rate: 1.090E-04 | global batch size: 256 | lm loss: 1.997033E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.593 | TFLOPs: 19.52 | 31: iteration 88230/ 173500 | consumed samples: 22586880 | consumed tokens: 46257930240 | elapsed time per iteration (s): 0.82 | learning rate: 1.090E-04 | global batch size: 256 | lm loss: 2.028127E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.048 | TFLOPs: 18.82 | 31: iteration 88240/ 173500 | consumed samples: 22589440 | consumed tokens: 46263173120 | elapsed time per iteration (s): 0.80 | learning rate: 1.090E-04 | global batch size: 256 | lm loss: 1.972142E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.188 | TFLOPs: 19.25 | 31: iteration 88250/ 173500 | consumed samples: 22592000 | consumed tokens: 46268416000 | elapsed time per iteration (s): 0.82 | learning rate: 1.090E-04 | global batch size: 256 | lm loss: 2.031558E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.792 | TFLOPs: 18.80 | 31: iteration 88260/ 173500 | consumed samples: 22594560 | consumed tokens: 46273658880 | elapsed time per iteration (s): 0.80 | learning rate: 1.089E-04 | global batch size: 256 | lm loss: 1.991658E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.990 | TFLOPs: 19.42 | 31: iteration 88270/ 173500 | consumed samples: 22597120 | consumed tokens: 46278901760 | elapsed time per iteration (s): 0.83 | learning rate: 1.089E-04 | global batch size: 256 | lm loss: 1.982728E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.893 | TFLOPs: 18.57 | 31: iteration 88280/ 173500 | consumed samples: 22599680 | consumed tokens: 46284144640 | elapsed time per iteration (s): 0.82 | learning rate: 1.089E-04 | global batch size: 256 | lm loss: 2.005154E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.922 | TFLOPs: 18.99 | 31: iteration 88290/ 173500 | consumed samples: 22602240 | consumed tokens: 46289387520 | elapsed time per iteration (s): 0.84 | learning rate: 1.089E-04 | global batch size: 256 | lm loss: 2.012375E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.246 | TFLOPs: 18.41 | 31: iteration 88300/ 173500 | consumed samples: 22604800 | consumed tokens: 46294630400 | elapsed time per iteration (s): 0.85 | learning rate: 1.089E-04 | global batch size: 256 | lm loss: 2.012795E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.032 | TFLOPs: 18.27 | 31: iteration 88310/ 173500 | consumed samples: 22607360 | consumed tokens: 46299873280 | elapsed time per iteration (s): 0.83 | learning rate: 1.089E-04 | global batch size: 256 | lm loss: 1.981055E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.110 | TFLOPs: 18.64 | 31: iteration 88320/ 173500 | consumed samples: 22609920 | consumed tokens: 46305116160 | elapsed time per iteration (s): 0.78 | learning rate: 1.088E-04 | global batch size: 256 | lm loss: 1.955922E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.784 | TFLOPs: 19.89 | 31: iteration 88330/ 173500 | consumed samples: 22612480 | consumed tokens: 46310359040 | elapsed time per iteration (s): 0.85 | learning rate: 1.088E-04 | global batch size: 256 | lm loss: 1.997628E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.940 | TFLOPs: 18.33 | 31: iteration 88340/ 173500 | consumed samples: 22615040 | consumed tokens: 46315601920 | elapsed time per iteration (s): 0.75 | learning rate: 1.088E-04 | global batch size: 256 | lm loss: 1.962516E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.043 | TFLOPs: 20.75 | 31: iteration 88350/ 173500 | consumed samples: 22617600 | consumed tokens: 46320844800 | elapsed time per iteration (s): 0.75 | learning rate: 1.088E-04 | global batch size: 256 | lm loss: 1.962416E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.017 | TFLOPs: 20.63 | 31: iteration 88360/ 173500 | consumed samples: 22620160 | consumed tokens: 46326087680 | elapsed time per iteration (s): 0.74 | learning rate: 1.088E-04 | global batch size: 256 | lm loss: 1.995019E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.457 | TFLOPs: 21.02 | 31: iteration 88370/ 173500 | consumed samples: 22622720 | consumed tokens: 46331330560 | elapsed time per iteration (s): 0.77 | learning rate: 1.088E-04 | global batch size: 256 | lm loss: 2.007942E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.805 | TFLOPs: 20.07 | 31: iteration 88380/ 173500 | consumed samples: 22625280 | consumed tokens: 46336573440 | elapsed time per iteration (s): 0.76 | learning rate: 1.087E-04 | global batch size: 256 | lm loss: 1.983474E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.975 | TFLOPs: 20.39 | 31: iteration 88390/ 173500 | consumed samples: 22627840 | consumed tokens: 46341816320 | elapsed time per iteration (s): 0.78 | learning rate: 1.087E-04 | global batch size: 256 | lm loss: 2.014569E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.059 | TFLOPs: 19.97 | 31: iteration 88400/ 173500 | consumed samples: 22630400 | consumed tokens: 46347059200 | elapsed time per iteration (s): 0.81 | learning rate: 1.087E-04 | global batch size: 256 | lm loss: 1.964480E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.918 | TFLOPs: 19.23 | 31: iteration 88410/ 173500 | consumed samples: 22632960 | consumed tokens: 46352302080 | elapsed time per iteration (s): 0.76 | learning rate: 1.087E-04 | global batch size: 256 | lm loss: 1.968718E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.680 | TFLOPs: 20.31 | 31: iteration 88420/ 173500 | consumed samples: 22635520 | consumed tokens: 46357544960 | elapsed time per iteration (s): 0.82 | learning rate: 1.087E-04 | global batch size: 256 | lm loss: 1.996551E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.693 | TFLOPs: 18.86 | 31: iteration 88430/ 173500 | consumed samples: 22638080 | consumed tokens: 46362787840 | elapsed time per iteration (s): 0.87 | learning rate: 1.087E-04 | global batch size: 256 | lm loss: 1.985559E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.450 | TFLOPs: 17.87 | 31: iteration 88440/ 173500 | consumed samples: 22640640 | consumed tokens: 46368030720 | elapsed time per iteration (s): 0.78 | learning rate: 1.086E-04 | global batch size: 256 | lm loss: 1.965235E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.952 | TFLOPs: 19.84 | 31: iteration 88450/ 173500 | consumed samples: 22643200 | consumed tokens: 46373273600 | elapsed time per iteration (s): 0.81 | learning rate: 1.086E-04 | global batch size: 256 | lm loss: 2.000948E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.829 | TFLOPs: 19.05 | 31: iteration 88460/ 173500 | consumed samples: 22645760 | consumed tokens: 46378516480 | elapsed time per iteration (s): 0.77 | learning rate: 1.086E-04 | global batch size: 256 | lm loss: 2.007759E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.832 | TFLOPs: 20.08 | 31: iteration 88470/ 173500 | consumed samples: 22648320 | consumed tokens: 46383759360 | elapsed time per iteration (s): 0.82 | learning rate: 1.086E-04 | global batch size: 256 | lm loss: 2.029774E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.352 | TFLOPs: 18.78 | 31: iteration 88480/ 173500 | consumed samples: 22650880 | consumed tokens: 46389002240 | elapsed time per iteration (s): 0.79 | learning rate: 1.086E-04 | global batch size: 256 | lm loss: 2.000825E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.211 | TFLOPs: 19.61 | 31: iteration 88490/ 173500 | consumed samples: 22653440 | consumed tokens: 46394245120 | elapsed time per iteration (s): 0.78 | learning rate: 1.086E-04 | global batch size: 256 | lm loss: 1.990988E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.815 | TFLOPs: 19.77 | 31: iteration 88500/ 173500 | consumed samples: 22656000 | consumed tokens: 46399488000 | elapsed time per iteration (s): 0.78 | learning rate: 1.085E-04 | global batch size: 256 | lm loss: 2.033891E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.817 | TFLOPs: 19.77 | 31: iteration 88510/ 173500 | consumed samples: 22658560 | consumed tokens: 46404730880 | elapsed time per iteration (s): 0.77 | learning rate: 1.085E-04 | global batch size: 256 | lm loss: 1.993084E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.694 | TFLOPs: 20.19 | 31: iteration 88520/ 173500 | consumed samples: 22661120 | consumed tokens: 46409973760 | elapsed time per iteration (s): 0.77 | learning rate: 1.085E-04 | global batch size: 256 | lm loss: 1.984768E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.076 | TFLOPs: 20.21 | 31: iteration 88530/ 173500 | consumed samples: 22663680 | consumed tokens: 46415216640 | elapsed time per iteration (s): 0.76 | learning rate: 1.085E-04 | global batch size: 256 | lm loss: 1.973933E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.441 | TFLOPs: 20.29 | 31: iteration 88540/ 173500 | consumed samples: 22666240 | consumed tokens: 46420459520 | elapsed time per iteration (s): 0.75 | learning rate: 1.085E-04 | global batch size: 256 | lm loss: 2.005453E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.756 | TFLOPs: 20.74 | 31: iteration 88550/ 173500 | consumed samples: 22668800 | consumed tokens: 46425702400 | elapsed time per iteration (s): 0.83 | learning rate: 1.085E-04 | global batch size: 256 | lm loss: 2.024329E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.602 | TFLOPs: 18.73 | 31: iteration 88560/ 173500 | consumed samples: 22671360 | consumed tokens: 46430945280 | elapsed time per iteration (s): 0.73 | learning rate: 1.084E-04 | global batch size: 256 | lm loss: 1.983843E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.573 | TFLOPs: 21.15 | 31: iteration 88570/ 173500 | consumed samples: 22673920 | consumed tokens: 46436188160 | elapsed time per iteration (s): 0.78 | learning rate: 1.084E-04 | global batch size: 256 | lm loss: 1.983430E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.943 | TFLOPs: 19.84 | 31: iteration 88580/ 173500 | consumed samples: 22676480 | consumed tokens: 46441431040 | elapsed time per iteration (s): 0.79 | learning rate: 1.084E-04 | global batch size: 256 | lm loss: 2.013671E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.645 | TFLOPs: 19.70 | 31: iteration 88590/ 173500 | consumed samples: 22679040 | consumed tokens: 46446673920 | elapsed time per iteration (s): 0.78 | learning rate: 1.084E-04 | global batch size: 256 | lm loss: 1.972455E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.055 | TFLOPs: 19.91 | 31: iteration 88600/ 173500 | consumed samples: 22681600 | consumed tokens: 46451916800 | elapsed time per iteration (s): 0.75 | learning rate: 1.084E-04 | global batch size: 256 | lm loss: 1.970229E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.540 | TFLOPs: 20.54 | 31: iteration 88610/ 173500 | consumed samples: 22684160 | consumed tokens: 46457159680 | elapsed time per iteration (s): 0.78 | learning rate: 1.084E-04 | global batch size: 256 | lm loss: 1.986817E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.941 | TFLOPs: 19.90 | 31: iteration 88620/ 173500 | consumed samples: 22686720 | consumed tokens: 46462402560 | elapsed time per iteration (s): 0.77 | learning rate: 1.084E-04 | global batch size: 256 | lm loss: 1.998767E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.286 | TFLOPs: 20.04 | 31: iteration 88630/ 173500 | consumed samples: 22689280 | consumed tokens: 46467645440 | elapsed time per iteration (s): 0.81 | learning rate: 1.083E-04 | global batch size: 256 | lm loss: 2.010100E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.517 | TFLOPs: 19.09 | 31: iteration 88640/ 173500 | consumed samples: 22691840 | consumed tokens: 46472888320 | elapsed time per iteration (s): 0.75 | learning rate: 1.083E-04 | global batch size: 256 | lm loss: 1.995915E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.886 | TFLOPs: 20.56 | 31: iteration 88650/ 173500 | consumed samples: 22694400 | consumed tokens: 46478131200 | elapsed time per iteration (s): 0.75 | learning rate: 1.083E-04 | global batch size: 256 | lm loss: 1.977204E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.522 | TFLOPs: 20.60 | 31: iteration 88660/ 173500 | consumed samples: 22696960 | consumed tokens: 46483374080 | elapsed time per iteration (s): 0.78 | learning rate: 1.083E-04 | global batch size: 256 | lm loss: 1.997280E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.299 | TFLOPs: 19.92 | 31: iteration 88670/ 173500 | consumed samples: 22699520 | consumed tokens: 46488616960 | elapsed time per iteration (s): 0.74 | learning rate: 1.083E-04 | global batch size: 256 | lm loss: 2.000503E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.506 | TFLOPs: 21.02 | 31: iteration 88680/ 173500 | consumed samples: 22702080 | consumed tokens: 46493859840 | elapsed time per iteration (s): 0.73 | learning rate: 1.083E-04 | global batch size: 256 | lm loss: 2.002185E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.338 | TFLOPs: 21.07 | 31: iteration 88690/ 173500 | consumed samples: 22704640 | consumed tokens: 46499102720 | elapsed time per iteration (s): 0.79 | learning rate: 1.082E-04 | global batch size: 256 | lm loss: 1.970695E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.340 | TFLOPs: 19.62 | 31: iteration 88700/ 173500 | consumed samples: 22707200 | consumed tokens: 46504345600 | elapsed time per iteration (s): 0.76 | learning rate: 1.082E-04 | global batch size: 256 | lm loss: 1.995427E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.676 | TFLOPs: 20.37 | 31: iteration 88710/ 173500 | consumed samples: 22709760 | consumed tokens: 46509588480 | elapsed time per iteration (s): 0.76 | learning rate: 1.082E-04 | global batch size: 256 | lm loss: 1.986954E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.601 | TFLOPs: 20.42 | 31: iteration 88720/ 173500 | consumed samples: 22712320 | consumed tokens: 46514831360 | elapsed time per iteration (s): 0.73 | learning rate: 1.082E-04 | global batch size: 256 | lm loss: 1.965664E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.719 | TFLOPs: 21.28 | 31: iteration 88730/ 173500 | consumed samples: 22714880 | consumed tokens: 46520074240 | elapsed time per iteration (s): 0.76 | learning rate: 1.082E-04 | global batch size: 256 | lm loss: 2.015713E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.991 | TFLOPs: 20.27 | 31: iteration 88740/ 173500 | consumed samples: 22717440 | consumed tokens: 46525317120 | elapsed time per iteration (s): 0.81 | learning rate: 1.082E-04 | global batch size: 256 | lm loss: 1.996326E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.023 | TFLOPs: 19.18 | 31: iteration 88750/ 173500 | consumed samples: 22720000 | consumed tokens: 46530560000 | elapsed time per iteration (s): 0.75 | learning rate: 1.081E-04 | global batch size: 256 | lm loss: 1.990557E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.740 | TFLOPs: 20.67 | 31: iteration 88760/ 173500 | consumed samples: 22722560 | consumed tokens: 46535802880 | elapsed time per iteration (s): 0.77 | learning rate: 1.081E-04 | global batch size: 256 | lm loss: 1.995425E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.111 | TFLOPs: 20.03 | 31: iteration 88770/ 173500 | consumed samples: 22725120 | consumed tokens: 46541045760 | elapsed time per iteration (s): 0.96 | learning rate: 1.081E-04 | global batch size: 256 | lm loss: 1.994996E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 267.389 | TFLOPs: 16.18 | 31: iteration 88780/ 173500 | consumed samples: 22727680 | consumed tokens: 46546288640 | elapsed time per iteration (s): 0.76 | learning rate: 1.081E-04 | global batch size: 256 | lm loss: 1.991471E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.802 | TFLOPs: 20.38 | 31: iteration 88790/ 173500 | consumed samples: 22730240 | consumed tokens: 46551531520 | elapsed time per iteration (s): 0.78 | learning rate: 1.081E-04 | global batch size: 256 | lm loss: 1.968650E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.439 | TFLOPs: 19.87 | 31: iteration 88800/ 173500 | consumed samples: 22732800 | consumed tokens: 46556774400 | elapsed time per iteration (s): 0.74 | learning rate: 1.081E-04 | global batch size: 256 | lm loss: 2.021750E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.026 | TFLOPs: 20.99 | 31: iteration 88810/ 173500 | consumed samples: 22735360 | consumed tokens: 46562017280 | elapsed time per iteration (s): 0.79 | learning rate: 1.080E-04 | global batch size: 256 | lm loss: 1.987844E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.024 | TFLOPs: 19.72 | 31: iteration 88820/ 173500 | consumed samples: 22737920 | consumed tokens: 46567260160 | elapsed time per iteration (s): 0.82 | learning rate: 1.080E-04 | global batch size: 256 | lm loss: 2.013901E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.666 | TFLOPs: 18.98 | 31: iteration 88830/ 173500 | consumed samples: 22740480 | consumed tokens: 46572503040 | elapsed time per iteration (s): 0.80 | learning rate: 1.080E-04 | global batch size: 256 | lm loss: 1.982253E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.725 | TFLOPs: 19.28 | 31: iteration 88840/ 173500 | consumed samples: 22743040 | consumed tokens: 46577745920 | elapsed time per iteration (s): 0.72 | learning rate: 1.080E-04 | global batch size: 256 | lm loss: 2.016419E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.042 | TFLOPs: 21.48 | 31: iteration 88850/ 173500 | consumed samples: 22745600 | consumed tokens: 46582988800 | elapsed time per iteration (s): 0.76 | learning rate: 1.080E-04 | global batch size: 256 | lm loss: 1.989842E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.579 | TFLOPs: 20.36 | 31: iteration 88860/ 173500 | consumed samples: 22748160 | consumed tokens: 46588231680 | elapsed time per iteration (s): 0.75 | learning rate: 1.080E-04 | global batch size: 256 | lm loss: 1.972838E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.515 | TFLOPs: 20.60 | 31: iteration 88870/ 173500 | consumed samples: 22750720 | consumed tokens: 46593474560 | elapsed time per iteration (s): 0.75 | learning rate: 1.079E-04 | global batch size: 256 | lm loss: 1.988971E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.521 | TFLOPs: 20.66 | 31: iteration 88880/ 173500 | consumed samples: 22753280 | consumed tokens: 46598717440 | elapsed time per iteration (s): 0.78 | learning rate: 1.079E-04 | global batch size: 256 | lm loss: 1.974340E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.594 | TFLOPs: 19.94 | 31: iteration 88890/ 173500 | consumed samples: 22755840 | consumed tokens: 46603960320 | elapsed time per iteration (s): 0.77 | learning rate: 1.079E-04 | global batch size: 256 | lm loss: 2.012893E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.576 | TFLOPs: 20.12 | 31: iteration 88900/ 173500 | consumed samples: 22758400 | consumed tokens: 46609203200 | elapsed time per iteration (s): 0.82 | learning rate: 1.079E-04 | global batch size: 256 | lm loss: 2.011890E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.194 | TFLOPs: 18.95 | 31: iteration 88910/ 173500 | consumed samples: 22760960 | consumed tokens: 46614446080 | elapsed time per iteration (s): 0.82 | learning rate: 1.079E-04 | global batch size: 256 | lm loss: 1.981293E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.023 | TFLOPs: 18.88 | 31: iteration 88920/ 173500 | consumed samples: 22763520 | consumed tokens: 46619688960 | elapsed time per iteration (s): 0.85 | learning rate: 1.079E-04 | global batch size: 256 | lm loss: 1.994292E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.809 | TFLOPs: 18.26 | 31: iteration 88930/ 173500 | consumed samples: 22766080 | consumed tokens: 46624931840 | elapsed time per iteration (s): 0.80 | learning rate: 1.078E-04 | global batch size: 256 | lm loss: 1.983113E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.323 | TFLOPs: 19.26 | 31: iteration 88940/ 173500 | consumed samples: 22768640 | consumed tokens: 46630174720 | elapsed time per iteration (s): 0.76 | learning rate: 1.078E-04 | global batch size: 256 | lm loss: 1.983168E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.121 | TFLOPs: 20.27 | 31: iteration 88950/ 173500 | consumed samples: 22771200 | consumed tokens: 46635417600 | elapsed time per iteration (s): 0.79 | learning rate: 1.078E-04 | global batch size: 256 | lm loss: 2.012194E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.874 | TFLOPs: 19.53 | 31: iteration 88960/ 173500 | consumed samples: 22773760 | consumed tokens: 46640660480 | elapsed time per iteration (s): 0.76 | learning rate: 1.078E-04 | global batch size: 256 | lm loss: 1.980038E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.468 | TFLOPs: 20.48 | 31: iteration 88970/ 173500 | consumed samples: 22776320 | consumed tokens: 46645903360 | elapsed time per iteration (s): 0.73 | learning rate: 1.078E-04 | global batch size: 256 | lm loss: 1.986252E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.652 | TFLOPs: 21.27 | 31: iteration 88980/ 173500 | consumed samples: 22778880 | consumed tokens: 46651146240 | elapsed time per iteration (s): 0.77 | learning rate: 1.078E-04 | global batch size: 256 | lm loss: 1.967923E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.664 | TFLOPs: 20.00 | 31: iteration 88990/ 173500 | consumed samples: 22781440 | consumed tokens: 46656389120 | elapsed time per iteration (s): 0.74 | learning rate: 1.077E-04 | global batch size: 256 | lm loss: 1.989605E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.353 | TFLOPs: 20.83 | 31: iteration 89000/ 173500 | consumed samples: 22784000 | consumed tokens: 46661632000 | elapsed time per iteration (s): 0.74 | learning rate: 1.077E-04 | global batch size: 256 | lm loss: 1.956371E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.094 | TFLOPs: 21.06 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 89000 | lm loss value: 1.859740E+00 | lm loss PPL: 6.422067E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 89000 to checkpoints_1b1long 0: [2022-11-26 14:14:17,802] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step89000 is begin to save! 0: [2022-11-26 14:14:17,814] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_01-model_00-model_states.pt... 0: [2022-11-26 14:14:18,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_01-model_00-model_states.pt. 0: [2022-11-26 14:14:18,057] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_03-model_00-model_states.pt... 0: [2022-11-26 14:14:18,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_03-model_00-model_states.pt. 0: [2022-11-26 14:14:18,143] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_04-model_00-model_states.pt... 0: [2022-11-26 14:14:18,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_04-model_00-model_states.pt. 0: [2022-11-26 14:14:18,220] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_05-model_00-model_states.pt... 0: [2022-11-26 14:14:18,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_05-model_00-model_states.pt. 0: [2022-11-26 14:14:18,297] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_06-model_00-model_states.pt... 0: [2022-11-26 14:14:18,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_06-model_00-model_states.pt. 0: [2022-11-26 14:14:18,370] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_07-model_00-model_states.pt... 0: [2022-11-26 14:14:18,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_07-model_00-model_states.pt. 0: [2022-11-26 14:14:18,447] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_08-model_00-model_states.pt... 0: [2022-11-26 14:14:18,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_08-model_00-model_states.pt. 0: [2022-11-26 14:14:18,527] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_09-model_00-model_states.pt... 0: [2022-11-26 14:14:18,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_09-model_00-model_states.pt. 0: [2022-11-26 14:14:18,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_10-model_00-model_states.pt... 0: [2022-11-26 14:14:18,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_10-model_00-model_states.pt. 0: [2022-11-26 14:14:18,676] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_11-model_00-model_states.pt... 0: [2022-11-26 14:14:18,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_11-model_00-model_states.pt. 0: [2022-11-26 14:14:18,752] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_12-model_00-model_states.pt... 0: [2022-11-26 14:14:18,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_12-model_00-model_states.pt. 0: [2022-11-26 14:14:18,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_13-model_00-model_states.pt... 0: [2022-11-26 14:14:18,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_13-model_00-model_states.pt. 0: [2022-11-26 14:14:18,904] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_14-model_00-model_states.pt... 0: [2022-11-26 14:14:18,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_14-model_00-model_states.pt. 0: [2022-11-26 14:14:18,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_15-model_00-model_states.pt... 0: [2022-11-26 14:14:19,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_15-model_00-model_states.pt. 0: [2022-11-26 14:14:19,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_16-model_00-model_states.pt... 0: [2022-11-26 14:14:19,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_16-model_00-model_states.pt. 0: [2022-11-26 14:14:19,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_17-model_00-model_states.pt... 0: [2022-11-26 14:14:19,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_17-model_00-model_states.pt. 0: [2022-11-26 14:14:19,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_18-model_00-model_states.pt... 0: [2022-11-26 14:14:19,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_18-model_00-model_states.pt. 0: [2022-11-26 14:14:19,278] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_19-model_00-model_states.pt... 0: [2022-11-26 14:14:19,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_19-model_00-model_states.pt. 0: [2022-11-26 14:14:19,355] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_20-model_00-model_states.pt... 0: [2022-11-26 14:14:19,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_20-model_00-model_states.pt. 0: [2022-11-26 14:14:19,427] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_21-model_00-model_states.pt... 0: [2022-11-26 14:14:19,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_21-model_00-model_states.pt. 0: [2022-11-26 14:14:19,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_22-model_00-model_states.pt... 0: [2022-11-26 14:14:19,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_22-model_00-model_states.pt. 0: [2022-11-26 14:14:19,574] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_23-model_00-model_states.pt... 0: [2022-11-26 14:14:19,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_23-model_00-model_states.pt. 0: [2022-11-26 14:14:19,650] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_24-model_00-model_states.pt... 0: [2022-11-26 14:14:19,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_24-model_00-model_states.pt. 0: [2022-11-26 14:14:19,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_25-model_00-model_states.pt... 0: [2022-11-26 14:14:19,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_25-model_00-model_states.pt. 0: [2022-11-26 14:14:19,800] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_26-model_00-model_states.pt... 0: [2022-11-26 14:14:19,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_26-model_00-model_states.pt. 0: [2022-11-26 14:14:19,875] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_27-model_00-model_states.pt... 0: [2022-11-26 14:14:19,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_27-model_00-model_states.pt. 0: [2022-11-26 14:14:19,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_28-model_00-model_states.pt... 0: [2022-11-26 14:14:20,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_28-model_00-model_states.pt. 0: [2022-11-26 14:14:20,024] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/layer_30-model_00-model_states.pt... 0: [2022-11-26 14:14:20,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/layer_30-model_00-model_states.pt. 0: [2022-11-26 14:14:20,026] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step89000/mp_rank_00_model_states.pt 0: [2022-11-26 14:14:20,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/mp_rank_00_model_states.pt... 0: [2022-11-26 14:14:20,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/mp_rank_00_model_states.pt. 0: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:14:20,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:14:20,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:14:20,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 14:14:20,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 20: [2022-11-26 14:14:20,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:14:20,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 14:14:20,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 19: [2022-11-26 14:14:20,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:14:20,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 14:14:20,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 9: [2022-11-26 14:14:20,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:14:20,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 14:14:20,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 12: [2022-11-26 14:14:20,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:14:20,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 14:14:20,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 30: [2022-11-26 14:14:20,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:14:20,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 14:14:20,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 19: [2022-11-26 14:14:20,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:14:20,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 14:14:20,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 28: [2022-11-26 14:14:20,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:14:20,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:14:20,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 14:14:20,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 28: [2022-11-26 14:14:20,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 14:14:20,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 9: [2022-11-26 14:14:20,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:14:20,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 14:14:20,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 0: [2022-11-26 14:14:20,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:14:20,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 14:14:20,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 11: [2022-11-26 14:14:20,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:14:20,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 14:14:20,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 15: [2022-11-26 14:14:20,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:14:20,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 14:14:20,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 12: [2022-11-26 14:14:20,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:14:20,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 14:14:20,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 20: [2022-11-26 14:14:20,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:14:20,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 14:14:20,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 30: [2022-11-26 14:14:20,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:14:20,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:14:20,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 14:14:20,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 8: [2022-11-26 14:14:20,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 10: [2022-11-26 14:14:20,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:14:20,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 10: [2022-11-26 14:14:20,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 14:14:20,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 1: [2022-11-26 14:14:20,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:14:20,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 14:14:20,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 22: [2022-11-26 14:14:20,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:14:20,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 14:14:20,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 22: [2022-11-26 14:14:20,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:14:20,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 14:14:20,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 29: [2022-11-26 14:14:20,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:14:20,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 14:14:20,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 25: [2022-11-26 14:14:20,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:14:20,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 17: [2022-11-26 14:14:20,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:14:20,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:14:20,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:14:20,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:14:20,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 20: [2022-11-26 14:14:20,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 25: [2022-11-26 14:14:20,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 24: [2022-11-26 14:14:20,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 14: [2022-11-26 14:14:20,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 17: [2022-11-26 14:14:20,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 24: [2022-11-26 14:14:20,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 17: [2022-11-26 14:14:20,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 20: [2022-11-26 14:14:20,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 12: [2022-11-26 14:14:20,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:14:20,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 14:14:20,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 14: [2022-11-26 14:14:20,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:14:20,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:14:20,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 14: [2022-11-26 14:14:20,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 8: [2022-11-26 14:14:20,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 14: [2022-11-26 14:14:20,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 25: [2022-11-26 14:14:20,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:14:20,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:14:20,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 5: [2022-11-26 14:14:20,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 21: [2022-11-26 14:14:20,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:14:20,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 25: [2022-11-26 14:14:20,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 21: [2022-11-26 14:14:20,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 14:14:20,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 24: [2022-11-26 14:14:20,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:14:20,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 26: [2022-11-26 14:14:20,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:14:20,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 26: [2022-11-26 14:14:20,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 14:14:20,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 6: [2022-11-26 14:14:20,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:14:20,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 14:14:20,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 25: [2022-11-26 14:14:20,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:14:20,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:14:20,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:14:20,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 14:14:20,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 25: [2022-11-26 14:14:20,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 23: [2022-11-26 14:14:20,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:14:20,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 27: [2022-11-26 14:14:20,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 25: [2022-11-26 14:14:20,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 23: [2022-11-26 14:14:20,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 14:14:20,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:14:20,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 26: [2022-11-26 14:14:20,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:14:20,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 14:14:20,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 7: [2022-11-26 14:14:20,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:14:20,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:14:20,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 7: [2022-11-26 14:14:20,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 26: [2022-11-26 14:14:20,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 7: [2022-11-26 14:14:20,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 7: [2022-11-26 14:14:20,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 14:14:20,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 15: [2022-11-26 14:14:20,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:14:20,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 14:14:20,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 29: [2022-11-26 14:14:20,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:14:20,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 9: [2022-11-26 14:14:20,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:14:20,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:14:20,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 14:14:20,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 29: [2022-11-26 14:14:20,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 9: [2022-11-26 14:14:20,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 9: [2022-11-26 14:14:20,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 29: [2022-11-26 14:14:20,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:14:20,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 19: [2022-11-26 14:14:20,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:14:20,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 14:14:20,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 13: [2022-11-26 14:14:20,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:14:20,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 14:14:20,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 29: [2022-11-26 14:14:20,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 6: [2022-11-26 14:14:20,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:14:20,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:14:20,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 14:14:20,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 1: [2022-11-26 14:14:20,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 14:14:20,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 13: [2022-11-26 14:14:20,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:14:20,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 14:14:20,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 10: [2022-11-26 14:14:20,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:14:20,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 14:14:20,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 14: [2022-11-26 14:14:20,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:14:20,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 14:14:20,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 30: [2022-11-26 14:14:20,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:14:20,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 14:14:20,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 17: [2022-11-26 14:14:20,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:14:20,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 14:14:20,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:14:20,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 17: [2022-11-26 14:14:20,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 6: [2022-11-26 14:14:20,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:14:20,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 28: [2022-11-26 14:14:20,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:14:20,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 14:14:20,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 17: [2022-11-26 14:14:20,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:14:20,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 14:14:20,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 28: [2022-11-26 14:14:20,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 14:14:20,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 13: [2022-11-26 14:14:20,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:14:20,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 14:14:20,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 14: [2022-11-26 14:14:20,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:14:20,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 14:14:20,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 24: [2022-11-26 14:14:20,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:14:20,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 12: [2022-11-26 14:14:20,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:14:20,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:14:20,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 12: [2022-11-26 14:14:20,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 14:14:20,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 28: [2022-11-26 14:14:20,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:14:20,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:14:20,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 28: [2022-11-26 14:14:20,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 15: [2022-11-26 14:14:20,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:14:20,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 21: [2022-11-26 14:14:20,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 15: [2022-11-26 14:14:20,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 21: [2022-11-26 14:14:20,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 15: [2022-11-26 14:14:20,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 20: [2022-11-26 14:14:20,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 22: [2022-11-26 14:14:20,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:14:20,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 14:14:20,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 7: [2022-11-26 14:14:20,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:14:20,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:14:20,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:14:20,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:14:20,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:14:20,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 14:14:20,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 7: [2022-11-26 14:14:20,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 8: [2022-11-26 14:14:20,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:14:20,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 3: [2022-11-26 14:14:20,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 14:14:20,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 3: [2022-11-26 14:14:20,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 7: [2022-11-26 14:14:20,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 8: [2022-11-26 14:14:20,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 3: [2022-11-26 14:14:20,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 8: [2022-11-26 14:14:20,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 14:14:20,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 19: [2022-11-26 14:14:20,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:14:20,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 14:14:20,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 0: [2022-11-26 14:14:20,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:14:20,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 14:14:20,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 1: [2022-11-26 14:14:20,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:14:20,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:14:20,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 14:14:20,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 14:14:20,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 1: [2022-11-26 14:14:20,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 15: [2022-11-26 14:14:20,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:14:20,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 14:14:20,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 5: [2022-11-26 14:14:20,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:14:20,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:14:20,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 14:14:20,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 26: [2022-11-26 14:14:20,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:14:20,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:14:20,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 14:14:20,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 26: [2022-11-26 14:14:20,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 14:14:20,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 29: [2022-11-26 14:14:20,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:14:20,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 14:14:20,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 10: [2022-11-26 14:14:20,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:14:20,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:14:20,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 26: [2022-11-26 14:14:20,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 10: [2022-11-26 14:14:20,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 26: [2022-11-26 14:14:20,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 28: [2022-11-26 14:14:20,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:14:20,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:14:20,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 14:14:20,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 7: [2022-11-26 14:14:20,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:14:20,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 14:14:20,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 30: [2022-11-26 14:14:20,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:14:20,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 14:14:20,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 25: [2022-11-26 14:14:20,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:14:20,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:14:20,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 22: [2022-11-26 14:14:20,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 25: [2022-11-26 14:14:20,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 22: [2022-11-26 14:14:20,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 0: [2022-11-26 14:14:20,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:14:20,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:14:20,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 14:14:20,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 27: [2022-11-26 14:14:20,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:14:20,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 14:14:20,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 19: [2022-11-26 14:14:20,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:14:20,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 14:14:20,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 5: [2022-11-26 14:14:20,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 14:14:20,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 5: [2022-11-26 14:14:20,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:14:20,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 3: [2022-11-26 14:14:20,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:14:20,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 3: [2022-11-26 14:14:20,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 14:14:20,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 28: [2022-11-26 14:14:20,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 14:14:20,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 23: [2022-11-26 14:14:20,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:14:20,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 14:14:20,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 23: [2022-11-26 14:14:20,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:14:20,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 14:14:20,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 21: [2022-11-26 14:14:20,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:14:20,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 14:14:20,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 21: [2022-11-26 14:14:20,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:14:20,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 14:14:20,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 0: [2022-11-26 14:14:20,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 14:14:20,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 18: [2022-11-26 14:14:20,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:14:20,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:14:20,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:14:20,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:14:20,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 13: [2022-11-26 14:14:20,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:14:20,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 14:14:20,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 13: [2022-11-26 14:14:20,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 18: [2022-11-26 14:14:20,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 13: [2022-11-26 14:14:20,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 18: [2022-11-26 14:14:20,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 18: [2022-11-26 14:14:20,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 18: [2022-11-26 14:14:20,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 18: [2022-11-26 14:14:20,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 11: [2022-11-26 14:14:20,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:14:20,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 14:14:20,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 11: [2022-11-26 14:14:20,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:14:20,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 14:14:20,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 11: [2022-11-26 14:14:20,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:14:20,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 14:14:20,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 20: [2022-11-26 14:14:20,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:14:20,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 6: [2022-11-26 14:14:20,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:14:20,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 6: [2022-11-26 14:14:20,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 14:14:20,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 6: [2022-11-26 14:14:20,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:14:20,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 14:14:20,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 11: [2022-11-26 14:14:20,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:14:20,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 14:14:20,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 12: [2022-11-26 14:14:20,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:14:20,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 14:14:20,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 17: [2022-11-26 14:14:20,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:14:20,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 14:14:20,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 9: [2022-11-26 14:14:20,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:14:20,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 23: [2022-11-26 14:14:20,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:14:20,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 23: [2022-11-26 14:14:20,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 14:14:20,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 16: [2022-11-26 14:14:20,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:14:20,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:14:20,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:14:20,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:14:20,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 14:14:20,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 14:14:20,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 16: [2022-11-26 14:14:20,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 14:14:20,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 14:14:20,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 16: [2022-11-26 14:14:20,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 16: [2022-11-26 14:14:20,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 31: [2022-11-26 14:14:20,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:14:20,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:14:20,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:14:20,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:14:20,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:14:20,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 14:14:20,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 14:14:20,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 14:14:20,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 14:14:20,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 14:14:20,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 31: [2022-11-26 14:14:20,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 31: [2022-11-26 14:14:20,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 31: [2022-11-26 14:14:20,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 31: [2022-11-26 14:14:20,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 16: [2022-11-26 14:14:20,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:14:20,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 14:14:20,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 30: [2022-11-26 14:14:20,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:14:20,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 14:14:20,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 13: [2022-11-26 14:14:20,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:14:20,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 14:14:20,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 0: [2022-11-26 14:14:20,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:14:20,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 14:14:20,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 7: [2022-11-26 14:14:20,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:14:20,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 14:14:20,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 29: [2022-11-26 14:14:20,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:14:20,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 14:14:20,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 22: [2022-11-26 14:14:20,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:14:20,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 14:14:20,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 3: [2022-11-26 14:14:20,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:14:20,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 14:14:20,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 18: [2022-11-26 14:14:20,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:14:20,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 14:14:20,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 27: [2022-11-26 14:14:20,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:14:20,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 1: [2022-11-26 14:14:20,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:14:20,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 1: [2022-11-26 14:14:20,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 14:14:20,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 8: [2022-11-26 14:14:20,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:14:20,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 14:14:20,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 25: [2022-11-26 14:14:20,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:14:20,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 26: [2022-11-26 14:14:20,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:14:20,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 25: [2022-11-26 14:14:20,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 26: [2022-11-26 14:14:20,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 28: [2022-11-26 14:14:20,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:14:20,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 14:14:20,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 24: [2022-11-26 14:14:20,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:14:20,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 14:14:20,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 21: [2022-11-26 14:14:20,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:14:20,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 14:14:20,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 14: [2022-11-26 14:14:20,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:14:20,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 14:14:20,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 5: [2022-11-26 14:14:20,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:14:20,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 14:14:20,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 6: [2022-11-26 14:14:20,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:14:20,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 14:14:20,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 15: [2022-11-26 14:14:20,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:14:20,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 14:14:20,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 20: [2022-11-26 14:14:20,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:14:20,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 14:14:20,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 11: [2022-11-26 14:14:20,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:14:20,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 14:14:20,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 9: [2022-11-26 14:14:20,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:14:20,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 14:14:20,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 12: [2022-11-26 14:14:20,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:14:20,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 14:14:20,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 0: [2022-11-26 14:14:20,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:14:20,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 14:14:20,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 31: [2022-11-26 14:14:20,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:14:20,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 14:14:20,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 16: [2022-11-26 14:14:20,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:14:20,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 14:14:20,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 23: [2022-11-26 14:14:20,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:14:20,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 14:14:20,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 29: [2022-11-26 14:14:20,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:14:20,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 14:14:20,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 22: [2022-11-26 14:14:20,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:14:20,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 14:14:20,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 19: [2022-11-26 14:14:20,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:14:20,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 14:14:20,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 10: [2022-11-26 14:14:20,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:14:20,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 14:14:20,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 13: [2022-11-26 14:14:20,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:14:20,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:14:20,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 17: [2022-11-26 14:14:20,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 13: [2022-11-26 14:14:20,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 17: [2022-11-26 14:14:20,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 18: [2022-11-26 14:14:20,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:14:20,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 14:14:20,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 7: [2022-11-26 14:14:20,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:14:20,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 14:14:20,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 26: [2022-11-26 14:14:20,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:14:20,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 14:14:20,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 3: [2022-11-26 14:14:20,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:14:20,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 30: [2022-11-26 14:14:20,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:14:20,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 30: [2022-11-26 14:14:20,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 14:14:20,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 21: [2022-11-26 14:14:20,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:14:20,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:14:20,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 14:14:20,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 1: [2022-11-26 14:14:20,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 14:14:20,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 27: [2022-11-26 14:14:20,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:14:20,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 14:14:20,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 15: [2022-11-26 14:14:20,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:14:20,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 14:14:20,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 28: [2022-11-26 14:14:20,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:14:20,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 14:14:20,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 8: [2022-11-26 14:14:20,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:14:20,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:14:20,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 14:14:20,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 10: [2022-11-26 14:14:20,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 14:14:20,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 24: [2022-11-26 14:14:20,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:14:20,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 14:14:20,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 25: [2022-11-26 14:14:20,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:14:20,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 14:14:20,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 5: [2022-11-26 14:14:20,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:14:20,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 14:14:20,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 14: [2022-11-26 14:14:20,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:14:20,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 14:14:20,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 11: [2022-11-26 14:14:20,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:14:20,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 14:14:20,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 9: [2022-11-26 14:14:20,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:14:20,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:14:20,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 6: [2022-11-26 14:14:20,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 9: [2022-11-26 14:14:20,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 6: [2022-11-26 14:14:20,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 20: [2022-11-26 14:14:20,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:14:20,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 14:14:20,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 12: [2022-11-26 14:14:20,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:14:20,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 14:14:20,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 16: [2022-11-26 14:14:20,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:14:20,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 14:14:20,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 31: [2022-11-26 14:14:20,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:14:20,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:14:20,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 14:14:20,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 30: [2022-11-26 14:14:20,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:14:20,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 14:14:20,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 23: [2022-11-26 14:14:20,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:14:20,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 14:14:20,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 0: [2022-11-26 14:14:20,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:14:20,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 14:14:20,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 19: [2022-11-26 14:14:20,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:14:20,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 14:14:20,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 31: [2022-11-26 14:14:20,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 14:14:20,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 13: [2022-11-26 14:14:20,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:14:20,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 14:14:20,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 7: [2022-11-26 14:14:20,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:14:20,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 14:14:20,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 29: [2022-11-26 14:14:20,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:14:20,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 14:14:20,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 22: [2022-11-26 14:14:20,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:14:20,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 14:14:20,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 3: [2022-11-26 14:14:20,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:14:20,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 14:14:20,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 27: [2022-11-26 14:14:20,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:14:20,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:14:20,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 10: [2022-11-26 14:14:20,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 1: [2022-11-26 14:14:20,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:14:20,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 10: [2022-11-26 14:14:20,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 1: [2022-11-26 14:14:20,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 14:14:20,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 18: [2022-11-26 14:14:20,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:14:20,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 14:14:20,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 26: [2022-11-26 14:14:20,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:14:20,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:14:20,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 15: [2022-11-26 14:14:20,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 26: [2022-11-26 14:14:20,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 15: [2022-11-26 14:14:20,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 5: [2022-11-26 14:14:20,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:14:20,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 14:14:20,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 8: [2022-11-26 14:14:20,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:14:20,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 14:14:20,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 20: [2022-11-26 14:14:20,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:14:20,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 14:14:20,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 28: [2022-11-26 14:14:20,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:14:20,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:14:20,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 14:14:20,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 19: [2022-11-26 14:14:20,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:14:20,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 14:14:20,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 11: [2022-11-26 14:14:20,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:14:20,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 14:14:20,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 21: [2022-11-26 14:14:20,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:14:20,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 14:14:20,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 0: [2022-11-26 14:14:20,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:14:20,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 14:14:20,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 25: [2022-11-26 14:14:20,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:14:20,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 14:14:20,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 24: [2022-11-26 14:14:20,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:14:20,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 14:14:20,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 6: [2022-11-26 14:14:20,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:14:20,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 29: [2022-11-26 14:14:20,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:14:20,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 29: [2022-11-26 14:14:20,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 14:14:20,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 12: [2022-11-26 14:14:20,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:14:20,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 14:14:20,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 9: [2022-11-26 14:14:20,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:14:20,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 14:14:20,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 22: [2022-11-26 14:14:20,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:14:20,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 14:14:20,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 24: [2022-11-26 14:14:20,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:14:20,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 24: [2022-11-26 14:14:20,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 28: [2022-11-26 14:14:20,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 24: [2022-11-26 14:14:20,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 17: [2022-11-26 14:14:20,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:14:20,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 14:14:20,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 28: [2022-11-26 14:14:20,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:14:20,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 14:14:20,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 18: [2022-11-26 14:14:20,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:14:20,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 14:14:20,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 10: [2022-11-26 14:14:20,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:14:20,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 14:14:20,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 7: [2022-11-26 14:14:20,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:14:20,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 14:14:20,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 21: [2022-11-26 14:14:20,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:14:20,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 14:14:20,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 8: [2022-11-26 14:14:20,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:14:20,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 14:14:20,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 23: [2022-11-26 14:14:20,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:14:20,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 14:14:20,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 14: [2022-11-26 14:14:20,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:14:20,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 14:14:20,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 31: [2022-11-26 14:14:20,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:14:20,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 14:14:20,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 26: [2022-11-26 14:14:20,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:14:20,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 14:14:20,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 1: [2022-11-26 14:14:20,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:14:20,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 14:14:20,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 27: [2022-11-26 14:14:20,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:14:20,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 14:14:20,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 16: [2022-11-26 14:14:20,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:14:20,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:14:20,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 14:14:20,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 25: [2022-11-26 14:14:20,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 14:14:20,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 3: [2022-11-26 14:14:20,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:14:20,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 14:14:20,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 15: [2022-11-26 14:14:20,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:14:20,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 14:14:20,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 13: [2022-11-26 14:14:20,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:14:20,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 14:14:20,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 30: [2022-11-26 14:14:20,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:14:20,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 14:14:20,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 4: [2022-11-26 14:14:20,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:14:20,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 14:14:20,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 4: [2022-11-26 14:14:20,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:14:20,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 14:14:20,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 4: [2022-11-26 14:14:20,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:14:20,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 14:14:20,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 4: [2022-11-26 14:14:20,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:14:20,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 14:14:20,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 4: [2022-11-26 14:14:20,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:14:20,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 14:14:20,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 4: [2022-11-26 14:14:20,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:14:20,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:14:20,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 14:14:20,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 14:14:20,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 4: [2022-11-26 14:14:20,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 4: [2022-11-26 14:14:20,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:14:20,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 14:14:20,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 2: [2022-11-26 14:14:20,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:14:20,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:14:20,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:14:20,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:14:20,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:14:20,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:14:20,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:14:20,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 14:14:20,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 14:14:20,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 14:14:20,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 14:14:20,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 14:14:20,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 14:14:20,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 14:14:20,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 2: [2022-11-26 14:14:20,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 2: [2022-11-26 14:14:20,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 2: [2022-11-26 14:14:20,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 2: [2022-11-26 14:14:20,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 2: [2022-11-26 14:14:20,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 2: [2022-11-26 14:14:20,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 2: [2022-11-26 14:14:20,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:14:20,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step89000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 14:14:20,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 0: successfully saved checkpoint at iteration 89000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2705.79 31: iteration 89010/ 173500 | consumed samples: 22786560 | consumed tokens: 46666874880 | elapsed time per iteration (s): 1.07 | learning rate: 1.077E-04 | global batch size: 256 | lm loss: 1.980217E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.010 | TFLOPs: 14.52 | 31: iteration 89020/ 173500 | consumed samples: 22789120 | consumed tokens: 46672117760 | elapsed time per iteration (s): 0.77 | learning rate: 1.077E-04 | global batch size: 256 | lm loss: 1.988240E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.476 | TFLOPs: 20.05 | 31: iteration 89030/ 173500 | consumed samples: 22791680 | consumed tokens: 46677360640 | elapsed time per iteration (s): 0.81 | learning rate: 1.077E-04 | global batch size: 256 | lm loss: 1.997943E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.290 | TFLOPs: 19.07 | 31: iteration 89040/ 173500 | consumed samples: 22794240 | consumed tokens: 46682603520 | elapsed time per iteration (s): 0.74 | learning rate: 1.077E-04 | global batch size: 256 | lm loss: 1.978144E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.466 | TFLOPs: 21.02 | 31: iteration 89050/ 173500 | consumed samples: 22796800 | consumed tokens: 46687846400 | elapsed time per iteration (s): 0.75 | learning rate: 1.076E-04 | global batch size: 256 | lm loss: 1.998325E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.487 | TFLOPs: 20.72 | 31: iteration 89060/ 173500 | consumed samples: 22799360 | consumed tokens: 46693089280 | elapsed time per iteration (s): 0.76 | learning rate: 1.076E-04 | global batch size: 256 | lm loss: 1.974021E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.254 | TFLOPs: 20.40 | 31: iteration 89070/ 173500 | consumed samples: 22801920 | consumed tokens: 46698332160 | elapsed time per iteration (s): 0.75 | learning rate: 1.076E-04 | global batch size: 256 | lm loss: 1.997895E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.498 | TFLOPs: 20.72 | 31: iteration 89080/ 173500 | consumed samples: 22804480 | consumed tokens: 46703575040 | elapsed time per iteration (s): 0.79 | learning rate: 1.076E-04 | global batch size: 256 | lm loss: 2.001536E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.827 | TFLOPs: 19.71 | 31: iteration 89090/ 173500 | consumed samples: 22807040 | consumed tokens: 46708817920 | elapsed time per iteration (s): 0.77 | learning rate: 1.076E-04 | global batch size: 256 | lm loss: 1.980210E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.450 | TFLOPs: 20.17 | 31: iteration 89100/ 173500 | consumed samples: 22809600 | consumed tokens: 46714060800 | elapsed time per iteration (s): 0.76 | learning rate: 1.076E-04 | global batch size: 256 | lm loss: 1.993841E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.566 | TFLOPs: 20.36 | 31: iteration 89110/ 173500 | consumed samples: 22812160 | consumed tokens: 46719303680 | elapsed time per iteration (s): 0.82 | learning rate: 1.075E-04 | global batch size: 256 | lm loss: 2.019788E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.500 | TFLOPs: 18.97 | 31: iteration 89120/ 173500 | consumed samples: 22814720 | consumed tokens: 46724546560 | elapsed time per iteration (s): 0.81 | learning rate: 1.075E-04 | global batch size: 256 | lm loss: 1.981989E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.969 | TFLOPs: 19.12 | 31: iteration 89130/ 173500 | consumed samples: 22817280 | consumed tokens: 46729789440 | elapsed time per iteration (s): 0.80 | learning rate: 1.075E-04 | global batch size: 256 | lm loss: 1.974183E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.377 | TFLOPs: 19.26 | 31: iteration 89140/ 173500 | consumed samples: 22819840 | consumed tokens: 46735032320 | elapsed time per iteration (s): 0.84 | learning rate: 1.075E-04 | global batch size: 256 | lm loss: 1.988588E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.560 | TFLOPs: 18.55 | 31: iteration 89150/ 173500 | consumed samples: 22822400 | consumed tokens: 46740275200 | elapsed time per iteration (s): 0.81 | learning rate: 1.075E-04 | global batch size: 256 | lm loss: 2.004746E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.832 | TFLOPs: 19.11 | 31: iteration 89160/ 173500 | consumed samples: 22824960 | consumed tokens: 46745518080 | elapsed time per iteration (s): 0.83 | learning rate: 1.075E-04 | global batch size: 256 | lm loss: 2.018698E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.127 | TFLOPs: 18.76 | 31: iteration 89170/ 173500 | consumed samples: 22827520 | consumed tokens: 46750760960 | elapsed time per iteration (s): 0.83 | learning rate: 1.074E-04 | global batch size: 256 | lm loss: 1.987109E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.676 | TFLOPs: 18.67 | 31: iteration 89180/ 173500 | consumed samples: 22830080 | consumed tokens: 46756003840 | elapsed time per iteration (s): 0.80 | learning rate: 1.074E-04 | global batch size: 256 | lm loss: 2.022252E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.678 | TFLOPs: 19.46 | 31: iteration 89190/ 173500 | consumed samples: 22832640 | consumed tokens: 46761246720 | elapsed time per iteration (s): 0.78 | learning rate: 1.074E-04 | global batch size: 256 | lm loss: 1.973182E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.020 | TFLOPs: 19.78 | 31: iteration 89200/ 173500 | consumed samples: 22835200 | consumed tokens: 46766489600 | elapsed time per iteration (s): 0.85 | learning rate: 1.074E-04 | global batch size: 256 | lm loss: 1.998128E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.468 | TFLOPs: 18.30 | 31: iteration 89210/ 173500 | consumed samples: 22837760 | consumed tokens: 46771732480 | elapsed time per iteration (s): 0.82 | learning rate: 1.074E-04 | global batch size: 256 | lm loss: 1.950547E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.359 | TFLOPs: 18.78 | 31: iteration 89220/ 173500 | consumed samples: 22840320 | consumed tokens: 46776975360 | elapsed time per iteration (s): 0.81 | learning rate: 1.074E-04 | global batch size: 256 | lm loss: 1.986642E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.522 | TFLOPs: 19.21 | 31: iteration 89230/ 173500 | consumed samples: 22842880 | consumed tokens: 46782218240 | elapsed time per iteration (s): 0.88 | learning rate: 1.073E-04 | global batch size: 256 | lm loss: 1.999468E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.451 | TFLOPs: 17.69 | 31: iteration 89240/ 173500 | consumed samples: 22845440 | consumed tokens: 46787461120 | elapsed time per iteration (s): 0.86 | learning rate: 1.073E-04 | global batch size: 256 | lm loss: 1.991916E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.074 | TFLOPs: 17.91 | 31: iteration 89250/ 173500 | consumed samples: 22848000 | consumed tokens: 46792704000 | elapsed time per iteration (s): 0.80 | learning rate: 1.073E-04 | global batch size: 256 | lm loss: 1.991027E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.774 | TFLOPs: 19.41 | 31: iteration 89260/ 173500 | consumed samples: 22850560 | consumed tokens: 46797946880 | elapsed time per iteration (s): 0.76 | learning rate: 1.073E-04 | global batch size: 256 | lm loss: 1.979364E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.451 | TFLOPs: 20.48 | 31: iteration 89270/ 173500 | consumed samples: 22853120 | consumed tokens: 46803189760 | elapsed time per iteration (s): 0.80 | learning rate: 1.073E-04 | global batch size: 256 | lm loss: 1.996659E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.002 | TFLOPs: 19.30 | 31: iteration 89280/ 173500 | consumed samples: 22855680 | consumed tokens: 46808432640 | elapsed time per iteration (s): 0.78 | learning rate: 1.073E-04 | global batch size: 256 | lm loss: 1.969087E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.428 | TFLOPs: 19.81 | 31: iteration 89290/ 173500 | consumed samples: 22858240 | consumed tokens: 46813675520 | elapsed time per iteration (s): 0.73 | learning rate: 1.072E-04 | global batch size: 256 | lm loss: 2.018580E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.916 | TFLOPs: 21.17 | 31: iteration 89300/ 173500 | consumed samples: 22860800 | consumed tokens: 46818918400 | elapsed time per iteration (s): 0.80 | learning rate: 1.072E-04 | global batch size: 256 | lm loss: 2.015704E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.349 | TFLOPs: 19.38 | 31: iteration 89310/ 173500 | consumed samples: 22863360 | consumed tokens: 46824161280 | elapsed time per iteration (s): 0.79 | learning rate: 1.072E-04 | global batch size: 256 | lm loss: 1.984261E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.223 | TFLOPs: 19.68 | 31: iteration 89320/ 173500 | consumed samples: 22865920 | consumed tokens: 46829404160 | elapsed time per iteration (s): 0.75 | learning rate: 1.072E-04 | global batch size: 256 | lm loss: 2.034394E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.520 | TFLOPs: 20.78 | 31: iteration 89330/ 173500 | consumed samples: 22868480 | consumed tokens: 46834647040 | elapsed time per iteration (s): 0.75 | learning rate: 1.072E-04 | global batch size: 256 | lm loss: 2.003506E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.139 | TFLOPs: 20.52 | 31: iteration 89340/ 173500 | consumed samples: 22871040 | consumed tokens: 46839889920 | elapsed time per iteration (s): 0.81 | learning rate: 1.072E-04 | global batch size: 256 | lm loss: 1.984766E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.921 | TFLOPs: 19.23 | 31: iteration 89350/ 173500 | consumed samples: 22873600 | consumed tokens: 46845132800 | elapsed time per iteration (s): 0.82 | learning rate: 1.071E-04 | global batch size: 256 | lm loss: 2.005849E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.828 | TFLOPs: 18.93 | 31: iteration 89360/ 173500 | consumed samples: 22876160 | consumed tokens: 46850375680 | elapsed time per iteration (s): 0.83 | learning rate: 1.071E-04 | global batch size: 256 | lm loss: 1.967702E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.222 | TFLOPs: 18.77 | 31: iteration 89370/ 173500 | consumed samples: 22878720 | consumed tokens: 46855618560 | elapsed time per iteration (s): 0.81 | learning rate: 1.071E-04 | global batch size: 256 | lm loss: 1.990576E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.158 | TFLOPs: 19.13 | 31: iteration 89380/ 173500 | consumed samples: 22881280 | consumed tokens: 46860861440 | elapsed time per iteration (s): 0.82 | learning rate: 1.071E-04 | global batch size: 256 | lm loss: 1.969232E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.662 | TFLOPs: 18.79 | 31: iteration 89390/ 173500 | consumed samples: 22883840 | consumed tokens: 46866104320 | elapsed time per iteration (s): 0.84 | learning rate: 1.071E-04 | global batch size: 256 | lm loss: 1.998567E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.725 | TFLOPs: 18.44 | 31: iteration 89400/ 173500 | consumed samples: 22886400 | consumed tokens: 46871347200 | elapsed time per iteration (s): 0.82 | learning rate: 1.071E-04 | global batch size: 256 | lm loss: 2.007843E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.238 | TFLOPs: 18.83 | 31: iteration 89410/ 173500 | consumed samples: 22888960 | consumed tokens: 46876590080 | elapsed time per iteration (s): 0.85 | learning rate: 1.071E-04 | global batch size: 256 | lm loss: 1.979942E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.720 | TFLOPs: 18.13 | 31: iteration 89420/ 173500 | consumed samples: 22891520 | consumed tokens: 46881832960 | elapsed time per iteration (s): 0.89 | learning rate: 1.070E-04 | global batch size: 256 | lm loss: 2.014504E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.771 | TFLOPs: 17.35 | 31: iteration 89430/ 173500 | consumed samples: 22894080 | consumed tokens: 46887075840 | elapsed time per iteration (s): 0.86 | learning rate: 1.070E-04 | global batch size: 256 | lm loss: 2.015047E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.650 | TFLOPs: 18.01 | 31: iteration 89440/ 173500 | consumed samples: 22896640 | consumed tokens: 46892318720 | elapsed time per iteration (s): 0.86 | learning rate: 1.070E-04 | global batch size: 256 | lm loss: 2.014978E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.415 | TFLOPs: 18.11 | 31: iteration 89450/ 173500 | consumed samples: 22899200 | consumed tokens: 46897561600 | elapsed time per iteration (s): 0.87 | learning rate: 1.070E-04 | global batch size: 256 | lm loss: 1.983712E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.046 | TFLOPs: 17.73 | 31: iteration 89460/ 173500 | consumed samples: 22901760 | consumed tokens: 46902804480 | elapsed time per iteration (s): 0.85 | learning rate: 1.070E-04 | global batch size: 256 | lm loss: 1.972009E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.725 | TFLOPs: 18.31 | 31: iteration 89470/ 173500 | consumed samples: 22904320 | consumed tokens: 46908047360 | elapsed time per iteration (s): 0.81 | learning rate: 1.070E-04 | global batch size: 256 | lm loss: 2.000002E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.544 | TFLOPs: 19.03 | 31: iteration 89480/ 173500 | consumed samples: 22906880 | consumed tokens: 46913290240 | elapsed time per iteration (s): 0.88 | learning rate: 1.069E-04 | global batch size: 256 | lm loss: 1.983760E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.967 | TFLOPs: 17.54 | 31: iteration 89490/ 173500 | consumed samples: 22909440 | consumed tokens: 46918533120 | elapsed time per iteration (s): 0.85 | learning rate: 1.069E-04 | global batch size: 256 | lm loss: 2.018473E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.736 | TFLOPs: 18.25 | 31: iteration 89500/ 173500 | consumed samples: 22912000 | consumed tokens: 46923776000 | elapsed time per iteration (s): 0.82 | learning rate: 1.069E-04 | global batch size: 256 | lm loss: 1.974909E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.443 | TFLOPs: 18.78 | 31: iteration 89510/ 173500 | consumed samples: 22914560 | consumed tokens: 46929018880 | elapsed time per iteration (s): 0.85 | learning rate: 1.069E-04 | global batch size: 256 | lm loss: 2.009478E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.822 | TFLOPs: 18.20 | 31: iteration 89520/ 173500 | consumed samples: 22917120 | consumed tokens: 46934261760 | elapsed time per iteration (s): 0.82 | learning rate: 1.069E-04 | global batch size: 256 | lm loss: 2.015396E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.276 | TFLOPs: 18.83 | 31: iteration 89530/ 173500 | consumed samples: 22919680 | consumed tokens: 46939504640 | elapsed time per iteration (s): 0.83 | learning rate: 1.069E-04 | global batch size: 256 | lm loss: 2.026728E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.949 | TFLOPs: 18.63 | 31: iteration 89540/ 173500 | consumed samples: 22922240 | consumed tokens: 46944747520 | elapsed time per iteration (s): 0.83 | learning rate: 1.068E-04 | global batch size: 256 | lm loss: 2.020216E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.137 | TFLOPs: 18.70 | 31: iteration 89550/ 173500 | consumed samples: 22924800 | consumed tokens: 46949990400 | elapsed time per iteration (s): 0.82 | learning rate: 1.068E-04 | global batch size: 256 | lm loss: 1.989375E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.430 | TFLOPs: 18.84 | 31: iteration 89560/ 173500 | consumed samples: 22927360 | consumed tokens: 46955233280 | elapsed time per iteration (s): 0.87 | learning rate: 1.068E-04 | global batch size: 256 | lm loss: 1.985651E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.292 | TFLOPs: 17.86 | 31: iteration 89570/ 173500 | consumed samples: 22929920 | consumed tokens: 46960476160 | elapsed time per iteration (s): 1.05 | learning rate: 1.068E-04 | global batch size: 256 | lm loss: 2.004914E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.137 | TFLOPs: 14.71 | 31: iteration 89580/ 173500 | consumed samples: 22932480 | consumed tokens: 46965719040 | elapsed time per iteration (s): 0.83 | learning rate: 1.068E-04 | global batch size: 256 | lm loss: 2.018764E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.758 | TFLOPs: 18.68 | 31: iteration 89590/ 173500 | consumed samples: 22935040 | consumed tokens: 46970961920 | elapsed time per iteration (s): 0.92 | learning rate: 1.068E-04 | global batch size: 256 | lm loss: 1.997474E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 278.578 | TFLOPs: 16.85 | 31: iteration 89600/ 173500 | consumed samples: 22937600 | consumed tokens: 46976204800 | elapsed time per iteration (s): 0.80 | learning rate: 1.067E-04 | global batch size: 256 | lm loss: 2.016205E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.453 | TFLOPs: 19.33 | 31: iteration 89610/ 173500 | consumed samples: 22940160 | consumed tokens: 46981447680 | elapsed time per iteration (s): 0.82 | learning rate: 1.067E-04 | global batch size: 256 | lm loss: 2.003205E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.905 | TFLOPs: 18.87 | 31: iteration 89620/ 173500 | consumed samples: 22942720 | consumed tokens: 46986690560 | elapsed time per iteration (s): 0.83 | learning rate: 1.067E-04 | global batch size: 256 | lm loss: 2.013617E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.689 | TFLOPs: 18.74 | 31: iteration 89630/ 173500 | consumed samples: 22945280 | consumed tokens: 46991933440 | elapsed time per iteration (s): 0.83 | learning rate: 1.067E-04 | global batch size: 256 | lm loss: 1.992684E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.038 | TFLOPs: 18.70 | 31: iteration 89640/ 173500 | consumed samples: 22947840 | consumed tokens: 46997176320 | elapsed time per iteration (s): 0.82 | learning rate: 1.067E-04 | global batch size: 256 | lm loss: 1.970057E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.162 | TFLOPs: 18.82 | 31: iteration 89650/ 173500 | consumed samples: 22950400 | consumed tokens: 47002419200 | elapsed time per iteration (s): 0.81 | learning rate: 1.067E-04 | global batch size: 256 | lm loss: 1.987138E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.251 | TFLOPs: 19.19 | 31: iteration 89660/ 173500 | consumed samples: 22952960 | consumed tokens: 47007662080 | elapsed time per iteration (s): 0.79 | learning rate: 1.066E-04 | global batch size: 256 | lm loss: 1.988193E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.515 | TFLOPs: 19.57 | 31: iteration 89670/ 173500 | consumed samples: 22955520 | consumed tokens: 47012904960 | elapsed time per iteration (s): 0.81 | learning rate: 1.066E-04 | global batch size: 256 | lm loss: 1.958521E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.209 | TFLOPs: 19.19 | 31: iteration 89680/ 173500 | consumed samples: 22958080 | consumed tokens: 47018147840 | elapsed time per iteration (s): 0.80 | learning rate: 1.066E-04 | global batch size: 256 | lm loss: 2.003224E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.297 | TFLOPs: 19.26 | 31: iteration 89690/ 173500 | consumed samples: 22960640 | consumed tokens: 47023390720 | elapsed time per iteration (s): 0.84 | learning rate: 1.066E-04 | global batch size: 256 | lm loss: 2.003240E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.535 | TFLOPs: 18.36 | 31: iteration 89700/ 173500 | consumed samples: 22963200 | consumed tokens: 47028633600 | elapsed time per iteration (s): 0.82 | learning rate: 1.066E-04 | global batch size: 256 | lm loss: 1.986452E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.250 | TFLOPs: 18.89 | 31: iteration 89710/ 173500 | consumed samples: 22965760 | consumed tokens: 47033876480 | elapsed time per iteration (s): 0.89 | learning rate: 1.066E-04 | global batch size: 256 | lm loss: 2.001113E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.511 | TFLOPs: 17.39 | 31: iteration 89720/ 173500 | consumed samples: 22968320 | consumed tokens: 47039119360 | elapsed time per iteration (s): 0.80 | learning rate: 1.065E-04 | global batch size: 256 | lm loss: 1.976636E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.108 | TFLOPs: 19.37 | 31: iteration 89730/ 173500 | consumed samples: 22970880 | consumed tokens: 47044362240 | elapsed time per iteration (s): 0.94 | learning rate: 1.065E-04 | global batch size: 256 | lm loss: 1.986530E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 271.575 | TFLOPs: 16.43 | 31: iteration 89740/ 173500 | consumed samples: 22973440 | consumed tokens: 47049605120 | elapsed time per iteration (s): 0.85 | learning rate: 1.065E-04 | global batch size: 256 | lm loss: 2.023987E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.307 | TFLOPs: 18.17 | 31: iteration 89750/ 173500 | consumed samples: 22976000 | consumed tokens: 47054848000 | elapsed time per iteration (s): 0.82 | learning rate: 1.065E-04 | global batch size: 256 | lm loss: 1.977304E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.195 | TFLOPs: 18.95 | 31: iteration 89760/ 173500 | consumed samples: 22978560 | consumed tokens: 47060090880 | elapsed time per iteration (s): 0.81 | learning rate: 1.065E-04 | global batch size: 256 | lm loss: 2.008530E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.692 | TFLOPs: 19.22 | 31: iteration 89770/ 173500 | consumed samples: 22981120 | consumed tokens: 47065333760 | elapsed time per iteration (s): 0.81 | learning rate: 1.065E-04 | global batch size: 256 | lm loss: 1.994083E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.822 | TFLOPs: 19.17 | 31: iteration 89780/ 173500 | consumed samples: 22983680 | consumed tokens: 47070576640 | elapsed time per iteration (s): 0.80 | learning rate: 1.064E-04 | global batch size: 256 | lm loss: 1.963923E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.363 | TFLOPs: 19.26 | 31: iteration 89790/ 173500 | consumed samples: 22986240 | consumed tokens: 47075819520 | elapsed time per iteration (s): 0.78 | learning rate: 1.064E-04 | global batch size: 256 | lm loss: 1.973819E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.532 | TFLOPs: 19.88 | 31: iteration 89800/ 173500 | consumed samples: 22988800 | consumed tokens: 47081062400 | elapsed time per iteration (s): 0.77 | learning rate: 1.064E-04 | global batch size: 256 | lm loss: 1.997927E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.422 | TFLOPs: 20.05 | 31: iteration 89810/ 173500 | consumed samples: 22991360 | consumed tokens: 47086305280 | elapsed time per iteration (s): 0.81 | learning rate: 1.064E-04 | global batch size: 256 | lm loss: 2.002731E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.588 | TFLOPs: 19.21 | 31: iteration 89820/ 173500 | consumed samples: 22993920 | consumed tokens: 47091548160 | elapsed time per iteration (s): 0.82 | learning rate: 1.064E-04 | global batch size: 256 | lm loss: 1.948944E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.341 | TFLOPs: 18.96 | 31: iteration 89830/ 173500 | consumed samples: 22996480 | consumed tokens: 47096791040 | elapsed time per iteration (s): 0.84 | learning rate: 1.064E-04 | global batch size: 256 | lm loss: 2.013463E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.356 | TFLOPs: 18.53 | 31: iteration 89840/ 173500 | consumed samples: 22999040 | consumed tokens: 47102033920 | elapsed time per iteration (s): 0.83 | learning rate: 1.063E-04 | global batch size: 256 | lm loss: 1.998917E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.959 | TFLOPs: 18.75 | 31: iteration 89850/ 173500 | consumed samples: 23001600 | consumed tokens: 47107276800 | elapsed time per iteration (s): 0.83 | learning rate: 1.063E-04 | global batch size: 256 | lm loss: 1.979047E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.579 | TFLOPs: 18.73 | 31: iteration 89860/ 173500 | consumed samples: 23004160 | consumed tokens: 47112519680 | elapsed time per iteration (s): 0.84 | learning rate: 1.063E-04 | global batch size: 256 | lm loss: 1.969411E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.556 | TFLOPs: 18.36 | 31: iteration 89870/ 173500 | consumed samples: 23006720 | consumed tokens: 47117762560 | elapsed time per iteration (s): 0.84 | learning rate: 1.063E-04 | global batch size: 256 | lm loss: 1.979970E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.234 | TFLOPs: 18.34 | 31: iteration 89880/ 173500 | consumed samples: 23009280 | consumed tokens: 47123005440 | elapsed time per iteration (s): 0.81 | learning rate: 1.063E-04 | global batch size: 256 | lm loss: 1.973478E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.670 | TFLOPs: 19.22 | 31: iteration 89890/ 173500 | consumed samples: 23011840 | consumed tokens: 47128248320 | elapsed time per iteration (s): 0.83 | learning rate: 1.063E-04 | global batch size: 256 | lm loss: 1.941923E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.776 | TFLOPs: 18.68 | 31: iteration 89900/ 173500 | consumed samples: 23014400 | consumed tokens: 47133491200 | elapsed time per iteration (s): 0.84 | learning rate: 1.062E-04 | global batch size: 256 | lm loss: 1.957364E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.793 | TFLOPs: 18.38 | 31: iteration 89910/ 173500 | consumed samples: 23016960 | consumed tokens: 47138734080 | elapsed time per iteration (s): 0.83 | learning rate: 1.062E-04 | global batch size: 256 | lm loss: 1.992628E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.440 | TFLOPs: 18.72 | 31: iteration 89920/ 173500 | consumed samples: 23019520 | consumed tokens: 47143976960 | elapsed time per iteration (s): 0.87 | learning rate: 1.062E-04 | global batch size: 256 | lm loss: 1.992282E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.123 | TFLOPs: 17.85 | 31: iteration 89930/ 173500 | consumed samples: 23022080 | consumed tokens: 47149219840 | elapsed time per iteration (s): 0.85 | learning rate: 1.062E-04 | global batch size: 256 | lm loss: 2.012018E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.827 | TFLOPs: 18.20 | 31: iteration 89940/ 173500 | consumed samples: 23024640 | consumed tokens: 47154462720 | elapsed time per iteration (s): 0.81 | learning rate: 1.062E-04 | global batch size: 256 | lm loss: 1.990434E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.350 | TFLOPs: 19.14 | 31: iteration 89950/ 173500 | consumed samples: 23027200 | consumed tokens: 47159705600 | elapsed time per iteration (s): 0.79 | learning rate: 1.062E-04 | global batch size: 256 | lm loss: 2.000633E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.163 | TFLOPs: 19.61 | 31: iteration 89960/ 173500 | consumed samples: 23029760 | consumed tokens: 47164948480 | elapsed time per iteration (s): 0.82 | learning rate: 1.061E-04 | global batch size: 256 | lm loss: 1.994325E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.862 | TFLOPs: 18.87 | 31: iteration 89970/ 173500 | consumed samples: 23032320 | consumed tokens: 47170191360 | elapsed time per iteration (s): 0.80 | learning rate: 1.061E-04 | global batch size: 256 | lm loss: 1.992878E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.784 | TFLOPs: 19.41 | 31: iteration 89980/ 173500 | consumed samples: 23034880 | consumed tokens: 47175434240 | elapsed time per iteration (s): 0.84 | learning rate: 1.061E-04 | global batch size: 256 | lm loss: 1.988950E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.373 | TFLOPs: 18.47 | 31: iteration 89990/ 173500 | consumed samples: 23037440 | consumed tokens: 47180677120 | elapsed time per iteration (s): 0.78 | learning rate: 1.061E-04 | global batch size: 256 | lm loss: 1.982529E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.506 | TFLOPs: 19.81 | 0: [2022-11-26 14:28:01,572] [INFO] [logging.py:68:log_dist] [Rank 0] step=90000, skipped=0, lr=[0.00010607986950689534, 0.00010607986950689534, 0.00010607986950689534], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 90000/ 173500 | consumed samples: 23040000 | consumed tokens: 47185920000 | elapsed time per iteration (s): 0.95 | learning rate: 1.061E-04 | global batch size: 256 | lm loss: 1.994425E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 268.785 | TFLOPs: 16.26 | 0: steps: 90000 loss: 1.9452 iter time (s): 0.802 samples/sec: 319.110 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 90000 | lm loss value: 1.900050E+00 | lm loss PPL: 6.686226E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 90000 to checkpoints_1b1long 0: [2022-11-26 14:28:01,933] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step90000 is begin to save! 0: [2022-11-26 14:28:01,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_01-model_00-model_states.pt... 0: [2022-11-26 14:28:02,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_01-model_00-model_states.pt. 0: [2022-11-26 14:28:02,217] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_03-model_00-model_states.pt... 0: [2022-11-26 14:28:02,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_03-model_00-model_states.pt. 0: [2022-11-26 14:28:02,295] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_04-model_00-model_states.pt... 0: [2022-11-26 14:28:02,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_04-model_00-model_states.pt. 0: [2022-11-26 14:28:02,375] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_05-model_00-model_states.pt... 0: [2022-11-26 14:28:02,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_05-model_00-model_states.pt. 0: [2022-11-26 14:28:02,455] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_06-model_00-model_states.pt... 0: [2022-11-26 14:28:02,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_06-model_00-model_states.pt. 0: [2022-11-26 14:28:02,532] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_07-model_00-model_states.pt... 0: [2022-11-26 14:28:02,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_07-model_00-model_states.pt. 0: [2022-11-26 14:28:02,608] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_08-model_00-model_states.pt... 0: [2022-11-26 14:28:02,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_08-model_00-model_states.pt. 0: [2022-11-26 14:28:02,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_09-model_00-model_states.pt... 0: [2022-11-26 14:28:02,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_09-model_00-model_states.pt. 0: [2022-11-26 14:28:02,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_10-model_00-model_states.pt... 0: [2022-11-26 14:28:02,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_10-model_00-model_states.pt. 0: [2022-11-26 14:28:02,838] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_11-model_00-model_states.pt... 0: [2022-11-26 14:28:02,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_11-model_00-model_states.pt. 0: [2022-11-26 14:28:02,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_12-model_00-model_states.pt... 0: [2022-11-26 14:28:02,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_12-model_00-model_states.pt. 0: [2022-11-26 14:28:02,990] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_13-model_00-model_states.pt... 0: [2022-11-26 14:28:03,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_13-model_00-model_states.pt. 0: [2022-11-26 14:28:03,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_14-model_00-model_states.pt... 0: [2022-11-26 14:28:03,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_14-model_00-model_states.pt. 0: [2022-11-26 14:28:03,139] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_15-model_00-model_states.pt... 0: [2022-11-26 14:28:03,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_15-model_00-model_states.pt. 0: [2022-11-26 14:28:03,217] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_16-model_00-model_states.pt... 0: [2022-11-26 14:28:03,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_16-model_00-model_states.pt. 0: [2022-11-26 14:28:03,293] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_17-model_00-model_states.pt... 0: [2022-11-26 14:28:03,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_17-model_00-model_states.pt. 0: [2022-11-26 14:28:03,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_18-model_00-model_states.pt... 0: [2022-11-26 14:28:03,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_18-model_00-model_states.pt. 0: [2022-11-26 14:28:03,441] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_19-model_00-model_states.pt... 0: [2022-11-26 14:28:03,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_19-model_00-model_states.pt. 0: [2022-11-26 14:28:03,517] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_20-model_00-model_states.pt... 0: [2022-11-26 14:28:03,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_20-model_00-model_states.pt. 0: [2022-11-26 14:28:03,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_21-model_00-model_states.pt... 0: [2022-11-26 14:28:03,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_21-model_00-model_states.pt. 0: [2022-11-26 14:28:03,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_22-model_00-model_states.pt... 0: [2022-11-26 14:28:03,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_22-model_00-model_states.pt. 0: [2022-11-26 14:28:03,748] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_23-model_00-model_states.pt... 0: [2022-11-26 14:28:03,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_23-model_00-model_states.pt. 0: [2022-11-26 14:28:03,822] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_24-model_00-model_states.pt... 0: [2022-11-26 14:28:03,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_24-model_00-model_states.pt. 0: [2022-11-26 14:28:03,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_25-model_00-model_states.pt... 0: [2022-11-26 14:28:03,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_25-model_00-model_states.pt. 0: [2022-11-26 14:28:03,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_26-model_00-model_states.pt... 0: [2022-11-26 14:28:04,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_26-model_00-model_states.pt. 0: [2022-11-26 14:28:04,054] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_27-model_00-model_states.pt... 0: [2022-11-26 14:28:04,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_27-model_00-model_states.pt. 0: [2022-11-26 14:28:04,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_28-model_00-model_states.pt... 0: [2022-11-26 14:28:04,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_28-model_00-model_states.pt. 0: [2022-11-26 14:28:04,206] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/layer_30-model_00-model_states.pt... 0: [2022-11-26 14:28:04,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/layer_30-model_00-model_states.pt. 0: [2022-11-26 14:28:04,208] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step90000/mp_rank_00_model_states.pt 0: [2022-11-26 14:28:04,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/mp_rank_00_model_states.pt... 0: [2022-11-26 14:28:04,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/mp_rank_00_model_states.pt. 0: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:28:04,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:28:04,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:28:04,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 14:28:04,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 18: [2022-11-26 14:28:04,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:28:04,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 14:28:04,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 0: [2022-11-26 14:28:04,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:28:04,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 14:28:04,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 0: [2022-11-26 14:28:04,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:28:04,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 14:28:04,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 0: [2022-11-26 14:28:04,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:28:04,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 14:28:04,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 11: [2022-11-26 14:28:04,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:28:04,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 14:28:04,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 18: [2022-11-26 14:28:04,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:28:04,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 14:28:04,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 6: [2022-11-26 14:28:04,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:28:04,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 14:28:04,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 0: [2022-11-26 14:28:04,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:28:04,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:28:04,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:28:04,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 0: [2022-11-26 14:28:04,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 16: [2022-11-26 14:28:04,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 1: [2022-11-26 14:28:04,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 29: [2022-11-26 14:28:04,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:28:04,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 29: [2022-11-26 14:28:04,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 1: [2022-11-26 14:28:04,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 29: [2022-11-26 14:28:04,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 9: [2022-11-26 14:28:04,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:28:04,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 14:28:04,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 3: [2022-11-26 14:28:04,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:28:04,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 14:28:04,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 7: [2022-11-26 14:28:04,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:28:04,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:28:04,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 18: [2022-11-26 14:28:04,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 7: [2022-11-26 14:28:04,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 18: [2022-11-26 14:28:04,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 1: [2022-11-26 14:28:04,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:28:04,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:28:04,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 1: [2022-11-26 14:28:04,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 16: [2022-11-26 14:28:04,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 1: [2022-11-26 14:28:04,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 28: [2022-11-26 14:28:04,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:28:04,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:28:04,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 14:28:04,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 23: [2022-11-26 14:28:04,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:28:04,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:28:04,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 30: [2022-11-26 14:28:04,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 23: [2022-11-26 14:28:04,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 30: [2022-11-26 14:28:04,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 20: [2022-11-26 14:28:04,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:28:04,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 14:28:04,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 20: [2022-11-26 14:28:04,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:28:04,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 14:28:04,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 4: [2022-11-26 14:28:04,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:28:04,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:28:04,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 14:28:04,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 26: [2022-11-26 14:28:04,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:28:04,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 31: [2022-11-26 14:28:04,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:28:04,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 31: [2022-11-26 14:28:04,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 14:28:04,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 4: [2022-11-26 14:28:04,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 6: [2022-11-26 14:28:04,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:28:04,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 6: [2022-11-26 14:28:04,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 14:28:04,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 26: [2022-11-26 14:28:04,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:28:04,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 7: [2022-11-26 14:28:04,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:28:04,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:28:04,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 7: [2022-11-26 14:28:04,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 14:28:04,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 21: [2022-11-26 14:28:04,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 14:28:04,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 10: [2022-11-26 14:28:04,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:28:04,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 14:28:04,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 22: [2022-11-26 14:28:04,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:28:04,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 14:28:04,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 9: [2022-11-26 14:28:04,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:28:04,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 14:28:04,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 6: [2022-11-26 14:28:04,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:28:04,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 14:28:04,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 4: [2022-11-26 14:28:04,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:28:04,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 30: [2022-11-26 14:28:04,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:28:04,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 30: [2022-11-26 14:28:04,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 14:28:04,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 19: [2022-11-26 14:28:04,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:28:04,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 14:28:04,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 30: [2022-11-26 14:28:04,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:28:04,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 14:28:04,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 11: [2022-11-26 14:28:04,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:28:04,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:28:04,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 14:28:04,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 10: [2022-11-26 14:28:04,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 14:28:04,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 21: [2022-11-26 14:28:04,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:28:04,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 14:28:04,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 18: [2022-11-26 14:28:04,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:28:04,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 14:28:04,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 26: [2022-11-26 14:28:04,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:28:04,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 14:28:04,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 10: [2022-11-26 14:28:04,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:28:04,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 14:28:04,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 6: [2022-11-26 14:28:04,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:28:04,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 14:28:04,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 1: [2022-11-26 14:28:04,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:28:04,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:28:04,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:28:04,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:28:04,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:28:04,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 21: [2022-11-26 14:28:04,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 27: [2022-11-26 14:28:04,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 14:28:04,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 1: [2022-11-26 14:28:04,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 21: [2022-11-26 14:28:04,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 27: [2022-11-26 14:28:04,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 14:28:04,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 27: [2022-11-26 14:28:04,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 27: [2022-11-26 14:28:04,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 5: [2022-11-26 14:28:04,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:28:04,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:28:04,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 9: [2022-11-26 14:28:04,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:28:04,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 30: [2022-11-26 14:28:04,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:28:04,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 9: [2022-11-26 14:28:04,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 14:28:04,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 29: [2022-11-26 14:28:04,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:28:04,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 5: [2022-11-26 14:28:04,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:28:04,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:28:04,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 29: [2022-11-26 14:28:04,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 30: [2022-11-26 14:28:04,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 5: [2022-11-26 14:28:04,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 4: [2022-11-26 14:28:04,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 29: [2022-11-26 14:28:04,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 27: [2022-11-26 14:28:04,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:28:04,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 4: [2022-11-26 14:28:04,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 5: [2022-11-26 14:28:04,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:28:04,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 5: [2022-11-26 14:28:04,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 14:28:04,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 26: [2022-11-26 14:28:04,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:28:04,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:28:04,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:28:04,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 5: [2022-11-26 14:28:04,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 14:28:04,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 26: [2022-11-26 14:28:04,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 5: [2022-11-26 14:28:04,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 5: [2022-11-26 14:28:04,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 26: [2022-11-26 14:28:04,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 9: [2022-11-26 14:28:04,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:28:04,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:28:04,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 14:28:04,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 14:28:04,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 9: [2022-11-26 14:28:04,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 27: [2022-11-26 14:28:04,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:28:04,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:28:04,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:28:04,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:28:04,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 26: [2022-11-26 14:28:04,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 19: [2022-11-26 14:28:04,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 27: [2022-11-26 14:28:04,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 26: [2022-11-26 14:28:04,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 19: [2022-11-26 14:28:04,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 27: [2022-11-26 14:28:04,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 30: [2022-11-26 14:28:04,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 27: [2022-11-26 14:28:04,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:28:04,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 14:28:04,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 20: [2022-11-26 14:28:04,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:28:04,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:28:04,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 4: [2022-11-26 14:28:04,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:28:04,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 14:28:04,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 23: [2022-11-26 14:28:04,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:28:04,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 20: [2022-11-26 14:28:04,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 23: [2022-11-26 14:28:04,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 4: [2022-11-26 14:28:04,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 23: [2022-11-26 14:28:04,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 11: [2022-11-26 14:28:04,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:28:04,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 28: [2022-11-26 14:28:04,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 14:28:04,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 28: [2022-11-26 14:28:04,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:28:04,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 16: [2022-11-26 14:28:04,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:28:04,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 16: [2022-11-26 14:28:04,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 28: [2022-11-26 14:28:04,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:28:04,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:28:04,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 16: [2022-11-26 14:28:04,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:28:04,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 16: [2022-11-26 14:28:04,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 23: [2022-11-26 14:28:04,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 28: [2022-11-26 14:28:04,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:28:04,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:28:04,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 11: [2022-11-26 14:28:04,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 28: [2022-11-26 14:28:04,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 7: [2022-11-26 14:28:04,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 16: [2022-11-26 14:28:04,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 23: [2022-11-26 14:28:04,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 28: [2022-11-26 14:28:04,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 7: [2022-11-26 14:28:04,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 28: [2022-11-26 14:28:04,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:28:04,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 20: [2022-11-26 14:28:04,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:28:04,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 20: [2022-11-26 14:28:04,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 14:28:04,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 19: [2022-11-26 14:28:04,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:28:04,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 14:28:04,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 0: [2022-11-26 14:28:04,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:28:04,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:28:04,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 20: [2022-11-26 14:28:04,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 0: [2022-11-26 14:28:04,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 20: [2022-11-26 14:28:04,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 26: [2022-11-26 14:28:04,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:28:04,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 14:28:04,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 7: [2022-11-26 14:28:04,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:28:04,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 14:28:04,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 7: [2022-11-26 14:28:04,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:28:04,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 14:28:04,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 29: [2022-11-26 14:28:04,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:28:04,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 14:28:04,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 31: [2022-11-26 14:28:04,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:28:04,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 14:28:04,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 12: [2022-11-26 14:28:04,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:28:04,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 31: [2022-11-26 14:28:04,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:28:04,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 15: [2022-11-26 14:28:04,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:28:04,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:28:04,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:28:04,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 15: [2022-11-26 14:28:04,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 14:28:04,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 14:28:04,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 31: [2022-11-26 14:28:04,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 21: [2022-11-26 14:28:04,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:28:04,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 15: [2022-11-26 14:28:04,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 15: [2022-11-26 14:28:04,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 21: [2022-11-26 14:28:04,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:28:04,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 17: [2022-11-26 14:28:04,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:28:04,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:28:04,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 14:28:04,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 21: [2022-11-26 14:28:04,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 21: [2022-11-26 14:28:04,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:28:04,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:28:04,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 21: [2022-11-26 14:28:04,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 17: [2022-11-26 14:28:04,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 14:28:04,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 17: [2022-11-26 14:28:04,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 14:28:04,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 17: [2022-11-26 14:28:04,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 21: [2022-11-26 14:28:04,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 11: [2022-11-26 14:28:04,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:28:04,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 14:28:04,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:28:04,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:28:04,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:28:04,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 11: [2022-11-26 14:28:04,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 4: [2022-11-26 14:28:04,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 14:28:04,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 11: [2022-11-26 14:28:04,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 4: [2022-11-26 14:28:04,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 6: [2022-11-26 14:28:04,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:28:04,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 6: [2022-11-26 14:28:04,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 14:28:04,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 17: [2022-11-26 14:28:04,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:28:04,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 31: [2022-11-26 14:28:04,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:28:04,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:28:04,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 14:28:04,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 17: [2022-11-26 14:28:04,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 31: [2022-11-26 14:28:04,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 31: [2022-11-26 14:28:04,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 19: [2022-11-26 14:28:04,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:28:04,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 14:28:04,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 12: [2022-11-26 14:28:04,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:28:04,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:28:04,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 14:28:04,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 29: [2022-11-26 14:28:04,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 14:28:04,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 0: [2022-11-26 14:28:04,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:28:04,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:28:04,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 14:28:04,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 16: [2022-11-26 14:28:04,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:28:04,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 14:28:04,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 31: [2022-11-26 14:28:04,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:28:04,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 14:28:04,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 12: [2022-11-26 14:28:04,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:28:04,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:28:04,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 14:28:04,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 14:28:04,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 12: [2022-11-26 14:28:04,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 6: [2022-11-26 14:28:04,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:28:04,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 14:28:04,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 19: [2022-11-26 14:28:04,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:28:04,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:28:04,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 14:28:04,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 17: [2022-11-26 14:28:04,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 14:28:04,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 1: [2022-11-26 14:28:04,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:28:04,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:28:04,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:28:04,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 29: [2022-11-26 14:28:04,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 14:28:04,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 10: [2022-11-26 14:28:04,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 1: [2022-11-26 14:28:04,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 14:28:04,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:28:04,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 17: [2022-11-26 14:28:04,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:28:04,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 14:28:04,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 17: [2022-11-26 14:28:04,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 3: [2022-11-26 14:28:04,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:28:04,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 3: [2022-11-26 14:28:04,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:28:04,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 14:28:04,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:28:04,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 3: [2022-11-26 14:28:04,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 14:28:04,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 14:28:04,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 3: [2022-11-26 14:28:04,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 29: [2022-11-26 14:28:04,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:28:04,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 19: [2022-11-26 14:28:04,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:28:04,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 19: [2022-11-26 14:28:04,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 14:28:04,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 23: [2022-11-26 14:28:04,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:28:04,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:28:04,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 14:28:04,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 14:28:04,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 23: [2022-11-26 14:28:04,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 23: [2022-11-26 14:28:04,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:28:04,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 14:28:04,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 25: [2022-11-26 14:28:04,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:28:04,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:28:04,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 14:28:04,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 14:28:04,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 25: [2022-11-26 14:28:04,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 25: [2022-11-26 14:28:04,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:28:04,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 14:28:04,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 25: [2022-11-26 14:28:04,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:28:04,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 14:28:04,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 22: [2022-11-26 14:28:04,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:28:04,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:28:04,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 14:28:04,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:28:04,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:28:04,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 22: [2022-11-26 14:28:04,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 14:28:04,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 14:28:04,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 14:28:04,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 22: [2022-11-26 14:28:04,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 22: [2022-11-26 14:28:04,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 15: [2022-11-26 14:28:04,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:28:04,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 14:28:04,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 15: [2022-11-26 14:28:04,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:28:04,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 14:28:04,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 15: [2022-11-26 14:28:04,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:28:04,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 14:28:04,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 1: [2022-11-26 14:28:04,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:28:04,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 14:28:04,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 8: [2022-11-26 14:28:04,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:28:04,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 14:28:04,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 10: [2022-11-26 14:28:04,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:28:04,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 14:28:04,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 24: [2022-11-26 14:28:04,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:28:04,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:28:04,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:28:04,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:28:04,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:28:04,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 14:28:04,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 14:28:04,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 14:28:04,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 14:28:04,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 14:28:04,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 24: [2022-11-26 14:28:04,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 24: [2022-11-26 14:28:04,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 24: [2022-11-26 14:28:04,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 24: [2022-11-26 14:28:04,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 24: [2022-11-26 14:28:04,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:28:04,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 14:28:04,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 0: [2022-11-26 14:28:04,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 14:28:04,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 5: [2022-11-26 14:28:04,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:28:04,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 14:28:04,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 8: [2022-11-26 14:28:04,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:28:04,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:28:04,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:28:04,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:28:04,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 14:28:04,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 14:28:04,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 14:28:04,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 8: [2022-11-26 14:28:04,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 8: [2022-11-26 14:28:04,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 8: [2022-11-26 14:28:04,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 14:28:04,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 2: [2022-11-26 14:28:04,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:28:04,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:28:04,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:28:04,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 14:28:04,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 14:28:04,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 14:28:04,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 2: [2022-11-26 14:28:04,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 2: [2022-11-26 14:28:04,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 2: [2022-11-26 14:28:04,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:28:04,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 14:28:04,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 7: [2022-11-26 14:28:04,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:28:04,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 14:28:04,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 2: [2022-11-26 14:28:04,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:28:04,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 14:28:04,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 2: [2022-11-26 14:28:04,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:28:04,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 14:28:04,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 14: [2022-11-26 14:28:04,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:28:04,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:28:04,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:28:04,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:28:04,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:28:04,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:28:04,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 14:28:04,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 14:28:04,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 14:28:04,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 14:28:04,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 12: [2022-11-26 14:28:04,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 14: [2022-11-26 14:28:04,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 12: [2022-11-26 14:28:04,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 14: [2022-11-26 14:28:04,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 14: [2022-11-26 14:28:04,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 14: [2022-11-26 14:28:04,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 14: [2022-11-26 14:28:04,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 13: [2022-11-26 14:28:04,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:28:04,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:28:04,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:28:04,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:28:04,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:28:04,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 14:28:04,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 14:28:04,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 14:28:04,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 14:28:04,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 13: [2022-11-26 14:28:04,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 14:28:04,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 13: [2022-11-26 14:28:04,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 13: [2022-11-26 14:28:04,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 13: [2022-11-26 14:28:04,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 30: [2022-11-26 14:28:04,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:28:04,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 14:28:04,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 16: [2022-11-26 14:28:04,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:28:04,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 14:28:04,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 11: [2022-11-26 14:28:04,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:28:04,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 14:28:04,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 8: [2022-11-26 14:28:04,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:28:04,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 14:28:04,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 28: [2022-11-26 14:28:04,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:28:04,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:28:04,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 14:28:04,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 28: [2022-11-26 14:28:04,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 14:28:04,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 25: [2022-11-26 14:28:04,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:28:04,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 14:28:04,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 14: [2022-11-26 14:28:04,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:28:04,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 14:28:04,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 13: [2022-11-26 14:28:04,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:28:04,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 14:28:04,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 18: [2022-11-26 14:28:04,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:28:04,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 14:28:04,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 20: [2022-11-26 14:28:04,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:28:04,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 14:28:04,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 23: [2022-11-26 14:28:04,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:28:04,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 14:28:04,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 3: [2022-11-26 14:28:04,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:28:04,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 14:28:04,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 6: [2022-11-26 14:28:04,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:28:04,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 14:28:04,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 15: [2022-11-26 14:28:04,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:28:04,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 14:28:04,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 26: [2022-11-26 14:28:04,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:28:04,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 14:28:04,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 29: [2022-11-26 14:28:04,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:28:04,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 14:28:04,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 9: [2022-11-26 14:28:04,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:28:04,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 14:28:04,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 17: [2022-11-26 14:28:04,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:28:04,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 14:28:04,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 2: [2022-11-26 14:28:04,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:28:04,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 14:28:04,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 7: [2022-11-26 14:28:04,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:28:04,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 14:28:04,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 4: [2022-11-26 14:28:04,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:28:04,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 14:28:04,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 10: [2022-11-26 14:28:04,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:28:04,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 14:28:04,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 1: [2022-11-26 14:28:04,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:28:04,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 14:28:04,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 24: [2022-11-26 14:28:04,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:28:04,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 14:28:04,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 11: [2022-11-26 14:28:04,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:28:04,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 14:28:04,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 19: [2022-11-26 14:28:04,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:28:04,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 14:28:04,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 5: [2022-11-26 14:28:04,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:28:04,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 14:28:04,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 27: [2022-11-26 14:28:04,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:28:04,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:28:04,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 14:28:04,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 30: [2022-11-26 14:28:04,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 14:28:04,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 22: [2022-11-26 14:28:04,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:28:04,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 14:28:04,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 31: [2022-11-26 14:28:04,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:28:04,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 14:28:04,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 1: [2022-11-26 14:28:04,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:28:04,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 14:28:04,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 7: [2022-11-26 14:28:04,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:28:04,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 14:28:04,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 2: [2022-11-26 14:28:04,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:28:04,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:28:04,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 13: [2022-11-26 14:28:04,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 2: [2022-11-26 14:28:04,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 13: [2022-11-26 14:28:04,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 21: [2022-11-26 14:28:04,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:28:04,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:28:04,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 14:28:04,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 26: [2022-11-26 14:28:04,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:28:04,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 14:28:04,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 26: [2022-11-26 14:28:04,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 14:28:04,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 20: [2022-11-26 14:28:04,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:28:04,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 14:28:04,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 15: [2022-11-26 14:28:04,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:28:04,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 14:28:04,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 12: [2022-11-26 14:28:04,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:28:04,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 14:28:04,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 29: [2022-11-26 14:28:04,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:28:04,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 14:28:04,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 8: [2022-11-26 14:28:04,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:28:04,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:28:04,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 14:28:04,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 14:28:04,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 8: [2022-11-26 14:28:04,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 30: [2022-11-26 14:28:04,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:28:04,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 6: [2022-11-26 14:28:04,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:28:04,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 27: [2022-11-26 14:28:04,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:28:04,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:28:04,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 14:28:04,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 24: [2022-11-26 14:28:04,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 27: [2022-11-26 14:28:04,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 14:28:04,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 24: [2022-11-26 14:28:04,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 19: [2022-11-26 14:28:04,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:28:04,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:28:04,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 14:28:04,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 5: [2022-11-26 14:28:04,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 14:28:04,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 18: [2022-11-26 14:28:04,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:28:04,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 14:28:04,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 17: [2022-11-26 14:28:04,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:28:04,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:28:04,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 22: [2022-11-26 14:28:04,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 17: [2022-11-26 14:28:04,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 10: [2022-11-26 14:28:04,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:28:04,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 10: [2022-11-26 14:28:04,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 14:28:04,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 3: [2022-11-26 14:28:04,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:28:04,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 14:28:04,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 25: [2022-11-26 14:28:04,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:28:04,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 14:28:04,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 4: [2022-11-26 14:28:04,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:28:04,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 14:28:04,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 11: [2022-11-26 14:28:04,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:28:04,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 14:28:04,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 14: [2022-11-26 14:28:04,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:28:04,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:28:04,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 13: [2022-11-26 14:28:04,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 14: [2022-11-26 14:28:04,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:28:04,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 14: [2022-11-26 14:28:04,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 14:28:04,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 13: [2022-11-26 14:28:04,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 21: [2022-11-26 14:28:04,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:28:04,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 23: [2022-11-26 14:28:04,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:28:04,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 18: [2022-11-26 14:28:04,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:28:04,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 14:28:04,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 18: [2022-11-26 14:28:04,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 14:28:04,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 12: [2022-11-26 14:28:04,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:28:04,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 14:28:04,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 0: [2022-11-26 14:28:04,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:28:04,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:28:04,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 14:28:04,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 0: [2022-11-26 14:28:04,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 14:28:04,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 3: [2022-11-26 14:28:04,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:28:04,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 14:28:04,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 25: [2022-11-26 14:28:04,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:28:04,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:28:04,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 14:28:04,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 31: [2022-11-26 14:28:04,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 25: [2022-11-26 14:28:04,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:28:04,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 25: [2022-11-26 14:28:04,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 14:28:04,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 28: [2022-11-26 14:28:04,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:28:04,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 14:28:04,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 16: [2022-11-26 14:28:04,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:28:04,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 14:28:04,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:28:04,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 16: [2022-11-26 14:28:04,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 14:28:04,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 9: [2022-11-26 14:28:04,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:28:04,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step90000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 14:28:04,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 0: successfully saved checkpoint at iteration 90000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2620.34 31: iteration 90010/ 173500 | consumed samples: 23042560 | consumed tokens: 47191162880 | elapsed time per iteration (s): 1.09 | learning rate: 1.061E-04 | global batch size: 256 | lm loss: 1.945086E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.464 | TFLOPs: 14.18 | 31: iteration 90020/ 173500 | consumed samples: 23045120 | consumed tokens: 47196405760 | elapsed time per iteration (s): 0.78 | learning rate: 1.060E-04 | global batch size: 256 | lm loss: 1.978701E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.291 | TFLOPs: 19.74 | 31: iteration 90030/ 173500 | consumed samples: 23047680 | consumed tokens: 47201648640 | elapsed time per iteration (s): 0.79 | learning rate: 1.060E-04 | global batch size: 256 | lm loss: 2.005658E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.046 | TFLOPs: 19.54 | 31: iteration 90040/ 173500 | consumed samples: 23050240 | consumed tokens: 47206891520 | elapsed time per iteration (s): 0.75 | learning rate: 1.060E-04 | global batch size: 256 | lm loss: 1.978810E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.458 | TFLOPs: 20.54 | 31: iteration 90050/ 173500 | consumed samples: 23052800 | consumed tokens: 47212134400 | elapsed time per iteration (s): 0.79 | learning rate: 1.060E-04 | global batch size: 256 | lm loss: 1.975096E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.549 | TFLOPs: 19.63 | 31: iteration 90060/ 173500 | consumed samples: 23055360 | consumed tokens: 47217377280 | elapsed time per iteration (s): 0.72 | learning rate: 1.060E-04 | global batch size: 256 | lm loss: 1.984416E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.536 | TFLOPs: 21.39 | 31: iteration 90070/ 173500 | consumed samples: 23057920 | consumed tokens: 47222620160 | elapsed time per iteration (s): 0.81 | learning rate: 1.060E-04 | global batch size: 256 | lm loss: 1.991406E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.793 | TFLOPs: 19.04 | 31: iteration 90080/ 173500 | consumed samples: 23060480 | consumed tokens: 47227863040 | elapsed time per iteration (s): 0.74 | learning rate: 1.059E-04 | global batch size: 256 | lm loss: 1.976716E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.872 | TFLOPs: 20.92 | 31: iteration 90090/ 173500 | consumed samples: 23063040 | consumed tokens: 47233105920 | elapsed time per iteration (s): 0.77 | learning rate: 1.059E-04 | global batch size: 256 | lm loss: 1.994694E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.257 | TFLOPs: 20.10 | 31: iteration 90100/ 173500 | consumed samples: 23065600 | consumed tokens: 47238348800 | elapsed time per iteration (s): 0.79 | learning rate: 1.059E-04 | global batch size: 256 | lm loss: 1.964086E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.004 | TFLOPs: 19.72 | 31: iteration 90110/ 173500 | consumed samples: 23068160 | consumed tokens: 47243591680 | elapsed time per iteration (s): 0.77 | learning rate: 1.059E-04 | global batch size: 256 | lm loss: 1.993801E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.975 | TFLOPs: 20.08 | 31: iteration 90120/ 173500 | consumed samples: 23070720 | consumed tokens: 47248834560 | elapsed time per iteration (s): 0.75 | learning rate: 1.059E-04 | global batch size: 256 | lm loss: 2.019399E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.626 | TFLOPs: 20.55 | 31: iteration 90130/ 173500 | consumed samples: 23073280 | consumed tokens: 47254077440 | elapsed time per iteration (s): 0.73 | learning rate: 1.059E-04 | global batch size: 256 | lm loss: 1.975833E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.245 | TFLOPs: 21.31 | 31: iteration 90140/ 173500 | consumed samples: 23075840 | consumed tokens: 47259320320 | elapsed time per iteration (s): 0.76 | learning rate: 1.058E-04 | global batch size: 256 | lm loss: 1.998074E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.503 | TFLOPs: 20.42 | 31: iteration 90150/ 173500 | consumed samples: 23078400 | consumed tokens: 47264563200 | elapsed time per iteration (s): 0.74 | learning rate: 1.058E-04 | global batch size: 256 | lm loss: 1.971658E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.490 | TFLOPs: 20.96 | 31: iteration 90160/ 173500 | consumed samples: 23080960 | consumed tokens: 47269806080 | elapsed time per iteration (s): 0.75 | learning rate: 1.058E-04 | global batch size: 256 | lm loss: 1.987761E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.870 | TFLOPs: 20.56 | 31: iteration 90170/ 173500 | consumed samples: 23083520 | consumed tokens: 47275048960 | elapsed time per iteration (s): 0.76 | learning rate: 1.058E-04 | global batch size: 256 | lm loss: 1.997199E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.075 | TFLOPs: 20.33 | 31: iteration 90180/ 173500 | consumed samples: 23086080 | consumed tokens: 47280291840 | elapsed time per iteration (s): 0.75 | learning rate: 1.058E-04 | global batch size: 256 | lm loss: 1.987051E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.945 | TFLOPs: 20.57 | 31: iteration 90190/ 173500 | consumed samples: 23088640 | consumed tokens: 47285534720 | elapsed time per iteration (s): 0.78 | learning rate: 1.058E-04 | global batch size: 256 | lm loss: 1.976734E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.075 | TFLOPs: 19.85 | 31: iteration 90200/ 173500 | consumed samples: 23091200 | consumed tokens: 47290777600 | elapsed time per iteration (s): 0.81 | learning rate: 1.058E-04 | global batch size: 256 | lm loss: 1.954696E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.644 | TFLOPs: 19.16 | 31: iteration 90210/ 173500 | consumed samples: 23093760 | consumed tokens: 47296020480 | elapsed time per iteration (s): 0.83 | learning rate: 1.057E-04 | global batch size: 256 | lm loss: 1.983813E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.120 | TFLOPs: 18.70 | 31: iteration 90220/ 173500 | consumed samples: 23096320 | consumed tokens: 47301263360 | elapsed time per iteration (s): 0.86 | learning rate: 1.057E-04 | global batch size: 256 | lm loss: 1.993532E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.827 | TFLOPs: 17.96 | 31: iteration 90230/ 173500 | consumed samples: 23098880 | consumed tokens: 47306506240 | elapsed time per iteration (s): 0.83 | learning rate: 1.057E-04 | global batch size: 256 | lm loss: 2.004144E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.889 | TFLOPs: 18.57 | 31: iteration 90240/ 173500 | consumed samples: 23101440 | consumed tokens: 47311749120 | elapsed time per iteration (s): 0.85 | learning rate: 1.057E-04 | global batch size: 256 | lm loss: 1.980589E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.585 | TFLOPs: 18.31 | 31: iteration 90250/ 173500 | consumed samples: 23104000 | consumed tokens: 47316992000 | elapsed time per iteration (s): 0.78 | learning rate: 1.057E-04 | global batch size: 256 | lm loss: 1.977786E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.127 | TFLOPs: 19.85 | 31: iteration 90260/ 173500 | consumed samples: 23106560 | consumed tokens: 47322234880 | elapsed time per iteration (s): 0.86 | learning rate: 1.057E-04 | global batch size: 256 | lm loss: 1.987597E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.015 | TFLOPs: 17.97 | 31: iteration 90270/ 173500 | consumed samples: 23109120 | consumed tokens: 47327477760 | elapsed time per iteration (s): 0.77 | learning rate: 1.056E-04 | global batch size: 256 | lm loss: 1.988876E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.224 | TFLOPs: 20.10 | 31: iteration 90280/ 173500 | consumed samples: 23111680 | consumed tokens: 47332720640 | elapsed time per iteration (s): 0.80 | learning rate: 1.056E-04 | global batch size: 256 | lm loss: 1.977213E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.409 | TFLOPs: 19.32 | 31: iteration 90290/ 173500 | consumed samples: 23114240 | consumed tokens: 47337963520 | elapsed time per iteration (s): 0.80 | learning rate: 1.056E-04 | global batch size: 256 | lm loss: 1.974808E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.683 | TFLOPs: 19.46 | 31: iteration 90300/ 173500 | consumed samples: 23116800 | consumed tokens: 47343206400 | elapsed time per iteration (s): 0.82 | learning rate: 1.056E-04 | global batch size: 256 | lm loss: 1.995910E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.652 | TFLOPs: 18.79 | 31: iteration 90310/ 173500 | consumed samples: 23119360 | consumed tokens: 47348449280 | elapsed time per iteration (s): 0.80 | learning rate: 1.056E-04 | global batch size: 256 | lm loss: 1.979742E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.981 | TFLOPs: 19.36 | 31: iteration 90320/ 173500 | consumed samples: 23121920 | consumed tokens: 47353692160 | elapsed time per iteration (s): 0.81 | learning rate: 1.056E-04 | global batch size: 256 | lm loss: 1.962639E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.396 | TFLOPs: 19.08 | 31: iteration 90330/ 173500 | consumed samples: 23124480 | consumed tokens: 47358935040 | elapsed time per iteration (s): 0.82 | learning rate: 1.055E-04 | global batch size: 256 | lm loss: 1.963976E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.293 | TFLOPs: 18.83 | 31: iteration 90340/ 173500 | consumed samples: 23127040 | consumed tokens: 47364177920 | elapsed time per iteration (s): 0.85 | learning rate: 1.055E-04 | global batch size: 256 | lm loss: 2.008346E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.174 | TFLOPs: 18.28 | 31: iteration 90350/ 173500 | consumed samples: 23129600 | consumed tokens: 47369420800 | elapsed time per iteration (s): 0.82 | learning rate: 1.055E-04 | global batch size: 256 | lm loss: 1.959052E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.671 | TFLOPs: 18.98 | 31: iteration 90360/ 173500 | consumed samples: 23132160 | consumed tokens: 47374663680 | elapsed time per iteration (s): 0.81 | learning rate: 1.055E-04 | global batch size: 256 | lm loss: 2.005915E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.799 | TFLOPs: 19.04 | 31: iteration 90370/ 173500 | consumed samples: 23134720 | consumed tokens: 47379906560 | elapsed time per iteration (s): 0.84 | learning rate: 1.055E-04 | global batch size: 256 | lm loss: 1.981070E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.191 | TFLOPs: 18.46 | 31: iteration 90380/ 173500 | consumed samples: 23137280 | consumed tokens: 47385149440 | elapsed time per iteration (s): 0.78 | learning rate: 1.055E-04 | global batch size: 256 | lm loss: 1.974351E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.715 | TFLOPs: 19.95 | 31: iteration 90390/ 173500 | consumed samples: 23139840 | consumed tokens: 47390392320 | elapsed time per iteration (s): 0.77 | learning rate: 1.054E-04 | global batch size: 256 | lm loss: 2.008613E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.972 | TFLOPs: 20.20 | 31: iteration 90400/ 173500 | consumed samples: 23142400 | consumed tokens: 47395635200 | elapsed time per iteration (s): 0.78 | learning rate: 1.054E-04 | global batch size: 256 | lm loss: 1.992472E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.493 | TFLOPs: 19.81 | 31: iteration 90410/ 173500 | consumed samples: 23144960 | consumed tokens: 47400878080 | elapsed time per iteration (s): 0.77 | learning rate: 1.054E-04 | global batch size: 256 | lm loss: 1.961662E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.242 | TFLOPs: 20.10 | 31: iteration 90420/ 173500 | consumed samples: 23147520 | consumed tokens: 47406120960 | elapsed time per iteration (s): 0.91 | learning rate: 1.054E-04 | global batch size: 256 | lm loss: 1.999720E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 280.706 | TFLOPs: 16.98 | 31: iteration 90430/ 173500 | consumed samples: 23150080 | consumed tokens: 47411363840 | elapsed time per iteration (s): 0.75 | learning rate: 1.054E-04 | global batch size: 256 | lm loss: 1.972836E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.649 | TFLOPs: 20.67 | 31: iteration 90440/ 173500 | consumed samples: 23152640 | consumed tokens: 47416606720 | elapsed time per iteration (s): 0.76 | learning rate: 1.054E-04 | global batch size: 256 | lm loss: 1.988425E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.837 | TFLOPs: 20.32 | 31: iteration 90450/ 173500 | consumed samples: 23155200 | consumed tokens: 47421849600 | elapsed time per iteration (s): 0.74 | learning rate: 1.053E-04 | global batch size: 256 | lm loss: 2.002753E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.487 | TFLOPs: 20.90 | 31: iteration 90460/ 173500 | consumed samples: 23157760 | consumed tokens: 47427092480 | elapsed time per iteration (s): 0.79 | learning rate: 1.053E-04 | global batch size: 256 | lm loss: 1.991559E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.083 | TFLOPs: 19.67 | 31: iteration 90470/ 173500 | consumed samples: 23160320 | consumed tokens: 47432335360 | elapsed time per iteration (s): 0.84 | learning rate: 1.053E-04 | global batch size: 256 | lm loss: 1.984383E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.851 | TFLOPs: 18.38 | 31: iteration 90480/ 173500 | consumed samples: 23162880 | consumed tokens: 47437578240 | elapsed time per iteration (s): 0.84 | learning rate: 1.053E-04 | global batch size: 256 | lm loss: 2.008131E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.238 | TFLOPs: 18.53 | 31: iteration 90490/ 173500 | consumed samples: 23165440 | consumed tokens: 47442821120 | elapsed time per iteration (s): 0.81 | learning rate: 1.053E-04 | global batch size: 256 | lm loss: 1.982190E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.096 | TFLOPs: 19.06 | 31: iteration 90500/ 173500 | consumed samples: 23168000 | consumed tokens: 47448064000 | elapsed time per iteration (s): 0.75 | learning rate: 1.053E-04 | global batch size: 256 | lm loss: 2.005759E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.344 | TFLOPs: 20.59 | 31: iteration 90510/ 173500 | consumed samples: 23170560 | consumed tokens: 47453306880 | elapsed time per iteration (s): 0.79 | learning rate: 1.052E-04 | global batch size: 256 | lm loss: 1.966680E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.902 | TFLOPs: 19.72 | 31: iteration 90520/ 173500 | consumed samples: 23173120 | consumed tokens: 47458549760 | elapsed time per iteration (s): 0.80 | learning rate: 1.052E-04 | global batch size: 256 | lm loss: 1.990223E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.015 | TFLOPs: 19.36 | 31: iteration 90530/ 173500 | consumed samples: 23175680 | consumed tokens: 47463792640 | elapsed time per iteration (s): 0.75 | learning rate: 1.052E-04 | global batch size: 256 | lm loss: 1.962267E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.057 | TFLOPs: 20.63 | 31: iteration 90540/ 173500 | consumed samples: 23178240 | consumed tokens: 47469035520 | elapsed time per iteration (s): 0.82 | learning rate: 1.052E-04 | global batch size: 256 | lm loss: 1.996496E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.678 | TFLOPs: 18.92 | 31: iteration 90550/ 173500 | consumed samples: 23180800 | consumed tokens: 47474278400 | elapsed time per iteration (s): 0.75 | learning rate: 1.052E-04 | global batch size: 256 | lm loss: 1.986866E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.616 | TFLOPs: 20.79 | 31: iteration 90560/ 173500 | consumed samples: 23183360 | consumed tokens: 47479521280 | elapsed time per iteration (s): 0.77 | learning rate: 1.052E-04 | global batch size: 256 | lm loss: 2.013242E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.221 | TFLOPs: 20.10 | 31: iteration 90570/ 173500 | consumed samples: 23185920 | consumed tokens: 47484764160 | elapsed time per iteration (s): 0.76 | learning rate: 1.051E-04 | global batch size: 256 | lm loss: 2.016931E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.915 | TFLOPs: 20.38 | 31: iteration 90580/ 173500 | consumed samples: 23188480 | consumed tokens: 47490007040 | elapsed time per iteration (s): 0.74 | learning rate: 1.051E-04 | global batch size: 256 | lm loss: 1.960550E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.265 | TFLOPs: 20.83 | 31: iteration 90590/ 173500 | consumed samples: 23191040 | consumed tokens: 47495249920 | elapsed time per iteration (s): 0.75 | learning rate: 1.051E-04 | global batch size: 256 | lm loss: 2.006560E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.400 | TFLOPs: 20.53 | 31: iteration 90600/ 173500 | consumed samples: 23193600 | consumed tokens: 47500492800 | elapsed time per iteration (s): 0.75 | learning rate: 1.051E-04 | global batch size: 256 | lm loss: 1.992742E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.266 | TFLOPs: 20.52 | 31: iteration 90610/ 173500 | consumed samples: 23196160 | consumed tokens: 47505735680 | elapsed time per iteration (s): 0.79 | learning rate: 1.051E-04 | global batch size: 256 | lm loss: 1.981343E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.178 | TFLOPs: 19.55 | 31: iteration 90620/ 173500 | consumed samples: 23198720 | consumed tokens: 47510978560 | elapsed time per iteration (s): 0.84 | learning rate: 1.051E-04 | global batch size: 256 | lm loss: 1.962041E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.287 | TFLOPs: 18.41 | 31: iteration 90630/ 173500 | consumed samples: 23201280 | consumed tokens: 47516221440 | elapsed time per iteration (s): 0.76 | learning rate: 1.050E-04 | global batch size: 256 | lm loss: 2.007060E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.967 | TFLOPs: 20.51 | 31: iteration 90640/ 173500 | consumed samples: 23203840 | consumed tokens: 47521464320 | elapsed time per iteration (s): 0.71 | learning rate: 1.050E-04 | global batch size: 256 | lm loss: 1.981795E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.107 | TFLOPs: 21.66 | 31: iteration 90650/ 173500 | consumed samples: 23206400 | consumed tokens: 47526707200 | elapsed time per iteration (s): 0.81 | learning rate: 1.050E-04 | global batch size: 256 | lm loss: 1.964214E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.581 | TFLOPs: 19.09 | 31: iteration 90660/ 173500 | consumed samples: 23208960 | consumed tokens: 47531950080 | elapsed time per iteration (s): 0.75 | learning rate: 1.050E-04 | global batch size: 256 | lm loss: 1.985663E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.957 | TFLOPs: 20.57 | 31: iteration 90670/ 173500 | consumed samples: 23211520 | consumed tokens: 47537192960 | elapsed time per iteration (s): 0.85 | learning rate: 1.050E-04 | global batch size: 256 | lm loss: 2.000380E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.657 | TFLOPs: 18.31 | 31: iteration 90680/ 173500 | consumed samples: 23214080 | consumed tokens: 47542435840 | elapsed time per iteration (s): 0.78 | learning rate: 1.050E-04 | global batch size: 256 | lm loss: 1.983416E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.557 | TFLOPs: 19.88 | 31: iteration 90690/ 173500 | consumed samples: 23216640 | consumed tokens: 47547678720 | elapsed time per iteration (s): 0.75 | learning rate: 1.049E-04 | global batch size: 256 | lm loss: 1.986243E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.812 | TFLOPs: 20.74 | 31: iteration 90700/ 173500 | consumed samples: 23219200 | consumed tokens: 47552921600 | elapsed time per iteration (s): 0.81 | learning rate: 1.049E-04 | global batch size: 256 | lm loss: 1.977657E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.945 | TFLOPs: 19.11 | 31: iteration 90710/ 173500 | consumed samples: 23221760 | consumed tokens: 47558164480 | elapsed time per iteration (s): 0.81 | learning rate: 1.049E-04 | global batch size: 256 | lm loss: 1.986362E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.476 | TFLOPs: 19.15 | 31: iteration 90720/ 173500 | consumed samples: 23224320 | consumed tokens: 47563407360 | elapsed time per iteration (s): 0.75 | learning rate: 1.049E-04 | global batch size: 256 | lm loss: 2.023323E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.715 | TFLOPs: 20.61 | 31: iteration 90730/ 173500 | consumed samples: 23226880 | consumed tokens: 47568650240 | elapsed time per iteration (s): 0.86 | learning rate: 1.049E-04 | global batch size: 256 | lm loss: 1.987287E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.151 | TFLOPs: 18.10 | 31: iteration 90740/ 173500 | consumed samples: 23229440 | consumed tokens: 47573893120 | elapsed time per iteration (s): 0.77 | learning rate: 1.049E-04 | global batch size: 256 | lm loss: 1.973873E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.477 | TFLOPs: 20.17 | 31: iteration 90750/ 173500 | consumed samples: 23232000 | consumed tokens: 47579136000 | elapsed time per iteration (s): 0.76 | learning rate: 1.048E-04 | global batch size: 256 | lm loss: 1.979108E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.150 | TFLOPs: 20.28 | 31: iteration 90760/ 173500 | consumed samples: 23234560 | consumed tokens: 47584378880 | elapsed time per iteration (s): 0.75 | learning rate: 1.048E-04 | global batch size: 256 | lm loss: 1.985284E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.512 | TFLOPs: 20.78 | 31: iteration 90770/ 173500 | consumed samples: 23237120 | consumed tokens: 47589621760 | elapsed time per iteration (s): 0.76 | learning rate: 1.048E-04 | global batch size: 256 | lm loss: 1.997686E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.535 | TFLOPs: 20.36 | 31: iteration 90780/ 173500 | consumed samples: 23239680 | consumed tokens: 47594864640 | elapsed time per iteration (s): 0.82 | learning rate: 1.048E-04 | global batch size: 256 | lm loss: 1.998265E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.645 | TFLOPs: 18.85 | 31: iteration 90790/ 173500 | consumed samples: 23242240 | consumed tokens: 47600107520 | elapsed time per iteration (s): 0.78 | learning rate: 1.048E-04 | global batch size: 256 | lm loss: 1.980381E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.653 | TFLOPs: 19.94 | 31: iteration 90800/ 173500 | consumed samples: 23244800 | consumed tokens: 47605350400 | elapsed time per iteration (s): 0.78 | learning rate: 1.048E-04 | global batch size: 256 | lm loss: 2.009459E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.760 | TFLOPs: 19.77 | 31: iteration 90810/ 173500 | consumed samples: 23247360 | consumed tokens: 47610593280 | elapsed time per iteration (s): 0.78 | learning rate: 1.047E-04 | global batch size: 256 | lm loss: 1.990258E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.071 | TFLOPs: 19.91 | 31: iteration 90820/ 173500 | consumed samples: 23249920 | consumed tokens: 47615836160 | elapsed time per iteration (s): 0.82 | learning rate: 1.047E-04 | global batch size: 256 | lm loss: 1.991203E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.144 | TFLOPs: 18.94 | 31: iteration 90830/ 173500 | consumed samples: 23252480 | consumed tokens: 47621079040 | elapsed time per iteration (s): 0.81 | learning rate: 1.047E-04 | global batch size: 256 | lm loss: 1.951587E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.144 | TFLOPs: 19.07 | 31: iteration 90840/ 173500 | consumed samples: 23255040 | consumed tokens: 47626321920 | elapsed time per iteration (s): 0.78 | learning rate: 1.047E-04 | global batch size: 256 | lm loss: 1.962646E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.552 | TFLOPs: 19.76 | 31: iteration 90850/ 173500 | consumed samples: 23257600 | consumed tokens: 47631564800 | elapsed time per iteration (s): 0.74 | learning rate: 1.047E-04 | global batch size: 256 | lm loss: 2.004413E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.093 | TFLOPs: 20.94 | 31: iteration 90860/ 173500 | consumed samples: 23260160 | consumed tokens: 47636807680 | elapsed time per iteration (s): 0.81 | learning rate: 1.047E-04 | global batch size: 256 | lm loss: 1.965429E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.741 | TFLOPs: 19.04 | 31: iteration 90870/ 173500 | consumed samples: 23262720 | consumed tokens: 47642050560 | elapsed time per iteration (s): 0.82 | learning rate: 1.046E-04 | global batch size: 256 | lm loss: 1.986607E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.955 | TFLOPs: 18.99 | 31: iteration 90880/ 173500 | consumed samples: 23265280 | consumed tokens: 47647293440 | elapsed time per iteration (s): 0.80 | learning rate: 1.046E-04 | global batch size: 256 | lm loss: 2.001915E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.574 | TFLOPs: 19.33 | 31: iteration 90890/ 173500 | consumed samples: 23267840 | consumed tokens: 47652536320 | elapsed time per iteration (s): 0.78 | learning rate: 1.046E-04 | global batch size: 256 | lm loss: 2.023805E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.150 | TFLOPs: 19.73 | 31: iteration 90900/ 173500 | consumed samples: 23270400 | consumed tokens: 47657779200 | elapsed time per iteration (s): 0.76 | learning rate: 1.046E-04 | global batch size: 256 | lm loss: 1.975356E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.276 | TFLOPs: 20.34 | 31: iteration 90910/ 173500 | consumed samples: 23272960 | consumed tokens: 47663022080 | elapsed time per iteration (s): 0.75 | learning rate: 1.046E-04 | global batch size: 256 | lm loss: 2.001402E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.012 | TFLOPs: 20.69 | 31: iteration 90920/ 173500 | consumed samples: 23275520 | consumed tokens: 47668264960 | elapsed time per iteration (s): 0.77 | learning rate: 1.046E-04 | global batch size: 256 | lm loss: 1.991751E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.811 | TFLOPs: 20.13 | 31: iteration 90930/ 173500 | consumed samples: 23278080 | consumed tokens: 47673507840 | elapsed time per iteration (s): 0.77 | learning rate: 1.046E-04 | global batch size: 256 | lm loss: 1.990572E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.799 | TFLOPs: 20.07 | 31: iteration 90940/ 173500 | consumed samples: 23280640 | consumed tokens: 47678750720 | elapsed time per iteration (s): 0.80 | learning rate: 1.045E-04 | global batch size: 256 | lm loss: 1.972711E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.238 | TFLOPs: 19.25 | 31: iteration 90950/ 173500 | consumed samples: 23283200 | consumed tokens: 47683993600 | elapsed time per iteration (s): 0.80 | learning rate: 1.045E-04 | global batch size: 256 | lm loss: 1.981438E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.392 | TFLOPs: 19.32 | 31: iteration 90960/ 173500 | consumed samples: 23285760 | consumed tokens: 47689236480 | elapsed time per iteration (s): 0.77 | learning rate: 1.045E-04 | global batch size: 256 | lm loss: 2.004279E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.628 | TFLOPs: 20.12 | 31: iteration 90970/ 173500 | consumed samples: 23288320 | consumed tokens: 47694479360 | elapsed time per iteration (s): 0.74 | learning rate: 1.045E-04 | global batch size: 256 | lm loss: 2.002094E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.359 | TFLOPs: 20.83 | 31: iteration 90980/ 173500 | consumed samples: 23290880 | consumed tokens: 47699722240 | elapsed time per iteration (s): 0.81 | learning rate: 1.045E-04 | global batch size: 256 | lm loss: 1.985630E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.504 | TFLOPs: 19.03 | 31: iteration 90990/ 173500 | consumed samples: 23293440 | consumed tokens: 47704965120 | elapsed time per iteration (s): 0.82 | learning rate: 1.045E-04 | global batch size: 256 | lm loss: 1.974546E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.228 | TFLOPs: 18.83 | 31: iteration 91000/ 173500 | consumed samples: 23296000 | consumed tokens: 47710208000 | elapsed time per iteration (s): 0.81 | learning rate: 1.044E-04 | global batch size: 256 | lm loss: 1.997088E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.179 | TFLOPs: 19.19 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 91000 | lm loss value: 1.944751E+00 | lm loss PPL: 6.991891E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 91000 to checkpoints_1b1long 0: [2022-11-26 14:41:12,154] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step91000 is begin to save! 0: [2022-11-26 14:41:12,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_01-model_00-model_states.pt... 0: [2022-11-26 14:41:12,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_01-model_00-model_states.pt. 0: [2022-11-26 14:41:12,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_03-model_00-model_states.pt... 0: [2022-11-26 14:41:12,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_03-model_00-model_states.pt. 0: [2022-11-26 14:41:12,449] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_04-model_00-model_states.pt... 0: [2022-11-26 14:41:12,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_04-model_00-model_states.pt. 0: [2022-11-26 14:41:12,526] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_05-model_00-model_states.pt... 0: [2022-11-26 14:41:12,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_05-model_00-model_states.pt. 0: [2022-11-26 14:41:12,600] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_06-model_00-model_states.pt... 0: [2022-11-26 14:41:12,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_06-model_00-model_states.pt. 0: [2022-11-26 14:41:12,674] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_07-model_00-model_states.pt... 0: [2022-11-26 14:41:12,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_07-model_00-model_states.pt. 0: [2022-11-26 14:41:12,749] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_08-model_00-model_states.pt... 0: [2022-11-26 14:41:12,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_08-model_00-model_states.pt. 0: [2022-11-26 14:41:12,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_09-model_00-model_states.pt... 0: [2022-11-26 14:41:12,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_09-model_00-model_states.pt. 0: [2022-11-26 14:41:12,899] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_10-model_00-model_states.pt... 0: [2022-11-26 14:41:12,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_10-model_00-model_states.pt. 0: [2022-11-26 14:41:12,974] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_11-model_00-model_states.pt... 0: [2022-11-26 14:41:13,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_11-model_00-model_states.pt. 0: [2022-11-26 14:41:13,051] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_12-model_00-model_states.pt... 0: [2022-11-26 14:41:13,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_12-model_00-model_states.pt. 0: [2022-11-26 14:41:13,124] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_13-model_00-model_states.pt... 0: [2022-11-26 14:41:13,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_13-model_00-model_states.pt. 0: [2022-11-26 14:41:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_14-model_00-model_states.pt... 0: [2022-11-26 14:41:13,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_14-model_00-model_states.pt. 0: [2022-11-26 14:41:13,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_15-model_00-model_states.pt... 0: [2022-11-26 14:41:13,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_15-model_00-model_states.pt. 0: [2022-11-26 14:41:13,352] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_16-model_00-model_states.pt... 0: [2022-11-26 14:41:13,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_16-model_00-model_states.pt. 0: [2022-11-26 14:41:13,427] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_17-model_00-model_states.pt... 0: [2022-11-26 14:41:13,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_17-model_00-model_states.pt. 0: [2022-11-26 14:41:13,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_18-model_00-model_states.pt... 0: [2022-11-26 14:41:13,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_18-model_00-model_states.pt. 0: [2022-11-26 14:41:13,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_19-model_00-model_states.pt... 0: [2022-11-26 14:41:13,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_19-model_00-model_states.pt. 0: [2022-11-26 14:41:13,651] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_20-model_00-model_states.pt... 0: [2022-11-26 14:41:13,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_20-model_00-model_states.pt. 0: [2022-11-26 14:41:13,726] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_21-model_00-model_states.pt... 0: [2022-11-26 14:41:13,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_21-model_00-model_states.pt. 0: [2022-11-26 14:41:13,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_22-model_00-model_states.pt... 0: [2022-11-26 14:41:13,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_22-model_00-model_states.pt. 0: [2022-11-26 14:41:13,875] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_23-model_00-model_states.pt... 0: [2022-11-26 14:41:13,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_23-model_00-model_states.pt. 0: [2022-11-26 14:41:13,948] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_24-model_00-model_states.pt... 0: [2022-11-26 14:41:14,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_24-model_00-model_states.pt. 0: [2022-11-26 14:41:14,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_25-model_00-model_states.pt... 0: [2022-11-26 14:41:14,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_25-model_00-model_states.pt. 0: [2022-11-26 14:41:14,100] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_26-model_00-model_states.pt... 0: [2022-11-26 14:41:14,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_26-model_00-model_states.pt. 0: [2022-11-26 14:41:14,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_27-model_00-model_states.pt... 0: [2022-11-26 14:41:14,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_27-model_00-model_states.pt. 0: [2022-11-26 14:41:14,246] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_28-model_00-model_states.pt... 0: [2022-11-26 14:41:14,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_28-model_00-model_states.pt. 0: [2022-11-26 14:41:14,321] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/layer_30-model_00-model_states.pt... 0: [2022-11-26 14:41:14,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/layer_30-model_00-model_states.pt. 0: [2022-11-26 14:41:14,325] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step91000/mp_rank_00_model_states.pt 0: [2022-11-26 14:41:14,325] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/mp_rank_00_model_states.pt... 0: [2022-11-26 14:41:14,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/mp_rank_00_model_states.pt. 0: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:41:14,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:41:14,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:41:14,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 14:41:14,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 14: [2022-11-26 14:41:14,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:41:14,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 14:41:14,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 17: [2022-11-26 14:41:14,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:41:14,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:41:14,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 14:41:14,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 1: [2022-11-26 14:41:14,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 14:41:14,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 25: [2022-11-26 14:41:14,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:41:14,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 14:41:14,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 8: [2022-11-26 14:41:14,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:41:14,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 14:41:14,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 9: [2022-11-26 14:41:14,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:41:14,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 14:41:14,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 21: [2022-11-26 14:41:14,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:41:14,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 27: [2022-11-26 14:41:14,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:41:14,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 27: [2022-11-26 14:41:14,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 14:41:14,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 4: [2022-11-26 14:41:14,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:41:14,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 14:41:14,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 13: [2022-11-26 14:41:14,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:41:14,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 11: [2022-11-26 14:41:14,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:41:14,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 11: [2022-11-26 14:41:14,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 12: [2022-11-26 14:41:14,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:41:14,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 12: [2022-11-26 14:41:14,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 14:41:14,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 6: [2022-11-26 14:41:14,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:41:14,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:41:14,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 7: [2022-11-26 14:41:14,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 24: [2022-11-26 14:41:14,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:41:14,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 6: [2022-11-26 14:41:14,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 24: [2022-11-26 14:41:14,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 14:41:14,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 23: [2022-11-26 14:41:14,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:41:14,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 14:41:14,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 10: [2022-11-26 14:41:14,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:41:14,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:41:14,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 2: [2022-11-26 14:41:14,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 10: [2022-11-26 14:41:14,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 2: [2022-11-26 14:41:14,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 11: [2022-11-26 14:41:14,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:41:14,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 14:41:14,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 6: [2022-11-26 14:41:14,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:41:14,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 14:41:14,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 3: [2022-11-26 14:41:14,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:41:14,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 4: [2022-11-26 14:41:14,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:41:14,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 4: [2022-11-26 14:41:14,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 14:41:14,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 20: [2022-11-26 14:41:14,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:41:14,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 14:41:14,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 29: [2022-11-26 14:41:14,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:41:14,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:41:14,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:41:14,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 15: [2022-11-26 14:41:14,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 29: [2022-11-26 14:41:14,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 15: [2022-11-26 14:41:14,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 24: [2022-11-26 14:41:14,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 14:41:14,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 14: [2022-11-26 14:41:14,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:41:14,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:41:14,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 14:41:14,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 2: [2022-11-26 14:41:14,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 14:41:14,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 16: [2022-11-26 14:41:14,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:41:14,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:41:14,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 21: [2022-11-26 14:41:14,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 16: [2022-11-26 14:41:14,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 21: [2022-11-26 14:41:14,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 13: [2022-11-26 14:41:14,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:41:14,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 14:41:14,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 30: [2022-11-26 14:41:14,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:41:14,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 14:41:14,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 12: [2022-11-26 14:41:14,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:41:14,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:41:14,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:41:14,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 24: [2022-11-26 14:41:14,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:41:14,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 10: [2022-11-26 14:41:14,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 12: [2022-11-26 14:41:14,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 10: [2022-11-26 14:41:14,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 30: [2022-11-26 14:41:14,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 24: [2022-11-26 14:41:14,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 14:41:14,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 31: [2022-11-26 14:41:14,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:41:14,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 14:41:14,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 27: [2022-11-26 14:41:14,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:41:14,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:41:14,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 14: [2022-11-26 14:41:14,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 14:41:14,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 27: [2022-11-26 14:41:14,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 6: [2022-11-26 14:41:14,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:41:14,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 14:41:14,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 7: [2022-11-26 14:41:14,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:41:14,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 14:41:14,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 8: [2022-11-26 14:41:14,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:41:14,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 14:41:14,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 8: [2022-11-26 14:41:14,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:41:14,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 0: [2022-11-26 14:41:14,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:41:14,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 3: [2022-11-26 14:41:14,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:41:14,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 20: [2022-11-26 14:41:14,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:41:14,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 20: [2022-11-26 14:41:14,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 14:41:14,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 26: [2022-11-26 14:41:14,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:41:14,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 14:41:14,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 30: [2022-11-26 14:41:14,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:41:14,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 16: [2022-11-26 14:41:14,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:41:14,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 16: [2022-11-26 14:41:14,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 14:41:14,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 2: [2022-11-26 14:41:14,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:41:14,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 14:41:14,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 4: [2022-11-26 14:41:14,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:41:14,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:41:14,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 14:41:14,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 1: [2022-11-26 14:41:14,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 17: [2022-11-26 14:41:14,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:41:14,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 17: [2022-11-26 14:41:14,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 5: [2022-11-26 14:41:14,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:41:14,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:41:14,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 5: [2022-11-26 14:41:14,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 9: [2022-11-26 14:41:14,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 17: [2022-11-26 14:41:14,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:41:14,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 9: [2022-11-26 14:41:14,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 17: [2022-11-26 14:41:14,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 14:41:14,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 5: [2022-11-26 14:41:14,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:41:14,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:41:14,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 25: [2022-11-26 14:41:14,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:41:14,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:41:14,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 5: [2022-11-26 14:41:14,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 25: [2022-11-26 14:41:14,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 11: [2022-11-26 14:41:14,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 25: [2022-11-26 14:41:14,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 14:41:14,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 25: [2022-11-26 14:41:14,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 23: [2022-11-26 14:41:14,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:41:14,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:41:14,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 14:41:14,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 14:41:14,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 23: [2022-11-26 14:41:14,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 15: [2022-11-26 14:41:14,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:41:14,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 27: [2022-11-26 14:41:14,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:41:14,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 31: [2022-11-26 14:41:14,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:41:14,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 14:41:14,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 31: [2022-11-26 14:41:14,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 14:41:14,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 15: [2022-11-26 14:41:14,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:41:14,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:41:14,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 15: [2022-11-26 14:41:14,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 26: [2022-11-26 14:41:14,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 15: [2022-11-26 14:41:14,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 10: [2022-11-26 14:41:14,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:41:14,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 14:41:14,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 3: [2022-11-26 14:41:14,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:41:14,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 14:41:14,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 20: [2022-11-26 14:41:14,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:41:14,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 14:41:14,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 24: [2022-11-26 14:41:14,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:41:14,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:41:14,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 9: [2022-11-26 14:41:14,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 24: [2022-11-26 14:41:14,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 9: [2022-11-26 14:41:14,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 13: [2022-11-26 14:41:14,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:41:14,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:41:14,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 14:41:14,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 21: [2022-11-26 14:41:14,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:41:14,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 21: [2022-11-26 14:41:14,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 14:41:14,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 1: [2022-11-26 14:41:14,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 12: [2022-11-26 14:41:14,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:41:14,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:41:14,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 16: [2022-11-26 14:41:14,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 12: [2022-11-26 14:41:14,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 16: [2022-11-26 14:41:14,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 6: [2022-11-26 14:41:14,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:41:14,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 14:41:14,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 15: [2022-11-26 14:41:14,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:41:14,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 14:41:14,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 11: [2022-11-26 14:41:14,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:41:14,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 14:41:14,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 7: [2022-11-26 14:41:14,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:41:14,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 14:41:14,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 29: [2022-11-26 14:41:14,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:41:14,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 14:41:14,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 0: [2022-11-26 14:41:14,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:41:14,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 14:41:14,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 31: [2022-11-26 14:41:14,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:41:14,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 14:41:14,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 14: [2022-11-26 14:41:14,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:41:14,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 14:41:14,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 22: [2022-11-26 14:41:14,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:41:14,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 14:41:14,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 2: [2022-11-26 14:41:14,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:41:14,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 14:41:14,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 8: [2022-11-26 14:41:14,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:41:14,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:41:14,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 14:41:14,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 29: [2022-11-26 14:41:14,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 22: [2022-11-26 14:41:14,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:41:14,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 29: [2022-11-26 14:41:14,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 22: [2022-11-26 14:41:14,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 13: [2022-11-26 14:41:14,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:41:14,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 14:41:14,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 4: [2022-11-26 14:41:14,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:41:14,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 14:41:14,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 22: [2022-11-26 14:41:14,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:41:14,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 14:41:14,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 1: [2022-11-26 14:41:14,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:41:14,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 14:41:14,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 17: [2022-11-26 14:41:14,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:41:14,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 14:41:14,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 5: [2022-11-26 14:41:14,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:41:14,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 14:41:14,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 29: [2022-11-26 14:41:14,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:41:14,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 14:41:14,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 12: [2022-11-26 14:41:14,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:41:14,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 14:41:14,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 0: [2022-11-26 14:41:14,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:41:14,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 14:41:14,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 30: [2022-11-26 14:41:14,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:41:14,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 14:41:14,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 0: [2022-11-26 14:41:14,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 14:41:14,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 18: [2022-11-26 14:41:14,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:41:14,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 14:41:14,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 7: [2022-11-26 14:41:14,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:41:14,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 14:41:14,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 22: [2022-11-26 14:41:14,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:41:14,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 14:41:14,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 19: [2022-11-26 14:41:14,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:41:14,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:41:14,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 14:41:14,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:41:14,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:41:14,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 14:41:14,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 19: [2022-11-26 14:41:14,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 19: [2022-11-26 14:41:14,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 14:41:14,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 14:41:14,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 19: [2022-11-26 14:41:14,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 5: [2022-11-26 14:41:14,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:41:14,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 14:41:14,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 28: [2022-11-26 14:41:14,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:41:14,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:41:14,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:41:14,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:41:14,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 14:41:14,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 14:41:14,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 14:41:14,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 14:41:14,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 28: [2022-11-26 14:41:14,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 28: [2022-11-26 14:41:14,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 28: [2022-11-26 14:41:14,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 10: [2022-11-26 14:41:14,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:41:14,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 14:41:14,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 18: [2022-11-26 14:41:14,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:41:14,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 14:41:14,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 25: [2022-11-26 14:41:14,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:41:14,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 18: [2022-11-26 14:41:14,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:41:14,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 25: [2022-11-26 14:41:14,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 18: [2022-11-26 14:41:14,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 20: [2022-11-26 14:41:14,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:41:14,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 14:41:14,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 26: [2022-11-26 14:41:14,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:41:14,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 14:41:14,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 18: [2022-11-26 14:41:14,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:41:14,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 14:41:14,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 23: [2022-11-26 14:41:14,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:41:14,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 14:41:14,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 16: [2022-11-26 14:41:14,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:41:14,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 14:41:14,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 3: [2022-11-26 14:41:14,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:41:14,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 14:41:14,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 27: [2022-11-26 14:41:14,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:41:14,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 14:41:14,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 21: [2022-11-26 14:41:14,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:41:14,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 14:41:14,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 31: [2022-11-26 14:41:14,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:41:14,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 14:41:14,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 9: [2022-11-26 14:41:14,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:41:14,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 14:41:14,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 24: [2022-11-26 14:41:14,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:41:14,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 14:41:14,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 6: [2022-11-26 14:41:14,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:41:14,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 14:41:14,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 29: [2022-11-26 14:41:14,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:41:14,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:41:14,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 2: [2022-11-26 14:41:14,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 14:41:14,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 29: [2022-11-26 14:41:14,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 17: [2022-11-26 14:41:14,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:41:14,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 14:41:14,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 14: [2022-11-26 14:41:14,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:41:14,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 14:41:14,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 15: [2022-11-26 14:41:14,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:41:14,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 14:41:14,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 4: [2022-11-26 14:41:14,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:41:14,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 14:41:14,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 11: [2022-11-26 14:41:14,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:41:14,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 14:41:14,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 13: [2022-11-26 14:41:14,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:41:14,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 14:41:14,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 8: [2022-11-26 14:41:14,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:41:14,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 14:41:14,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 19: [2022-11-26 14:41:14,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:41:14,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 14:41:14,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 30: [2022-11-26 14:41:14,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:41:14,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 14:41:14,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 12: [2022-11-26 14:41:14,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:41:14,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 14:41:14,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 1: [2022-11-26 14:41:14,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:41:14,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 28: [2022-11-26 14:41:14,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:41:14,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 28: [2022-11-26 14:41:14,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 14:41:14,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 0: [2022-11-26 14:41:14,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:41:14,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 14:41:14,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 7: [2022-11-26 14:41:14,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:41:14,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 14:41:14,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 16: [2022-11-26 14:41:14,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:41:14,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 14:41:14,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 18: [2022-11-26 14:41:14,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:41:14,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 14:41:14,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 26: [2022-11-26 14:41:14,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:41:14,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 14:41:14,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 31: [2022-11-26 14:41:14,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:41:14,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 14:41:14,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 9: [2022-11-26 14:41:14,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:41:14,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 14:41:14,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 25: [2022-11-26 14:41:14,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:41:14,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 14:41:14,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 20: [2022-11-26 14:41:14,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:41:14,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 14:41:14,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 23: [2022-11-26 14:41:14,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:41:14,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 10: [2022-11-26 14:41:14,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:41:14,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 3: [2022-11-26 14:41:14,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:41:14,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 3: [2022-11-26 14:41:14,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 14:41:14,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 10: [2022-11-26 14:41:14,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 21: [2022-11-26 14:41:14,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:41:14,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 14:41:14,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 5: [2022-11-26 14:41:14,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:41:14,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 14:41:14,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 22: [2022-11-26 14:41:14,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:41:14,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 14:41:14,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 14: [2022-11-26 14:41:14,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:41:14,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 14:41:14,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 27: [2022-11-26 14:41:14,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:41:14,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 14:41:14,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 15: [2022-11-26 14:41:14,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:41:14,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 14:41:14,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 6: [2022-11-26 14:41:14,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:41:14,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 14:41:14,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 24: [2022-11-26 14:41:14,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:41:14,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 14:41:14,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 2: [2022-11-26 14:41:14,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:41:14,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 14:41:14,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 29: [2022-11-26 14:41:14,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:41:14,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 14:41:14,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 4: [2022-11-26 14:41:14,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:41:14,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 14:41:14,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 17: [2022-11-26 14:41:14,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:41:14,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 14:41:14,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 13: [2022-11-26 14:41:14,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:41:14,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 14:41:14,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 11: [2022-11-26 14:41:14,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:41:14,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 14:41:14,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 12: [2022-11-26 14:41:14,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:41:14,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 14:41:14,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 0: [2022-11-26 14:41:14,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:41:14,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:41:14,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 19: [2022-11-26 14:41:14,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 0: [2022-11-26 14:41:14,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 19: [2022-11-26 14:41:14,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 1: [2022-11-26 14:41:14,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:41:14,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:41:14,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 8: [2022-11-26 14:41:14,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 1: [2022-11-26 14:41:14,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 8: [2022-11-26 14:41:14,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 30: [2022-11-26 14:41:14,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:41:14,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 14:41:14,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 28: [2022-11-26 14:41:14,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:41:14,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 14:41:14,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 7: [2022-11-26 14:41:14,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:41:14,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 22: [2022-11-26 14:41:14,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:41:14,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 22: [2022-11-26 14:41:14,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 14:41:14,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 18: [2022-11-26 14:41:14,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:41:14,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 14:41:14,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 26: [2022-11-26 14:41:14,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:41:14,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 14:41:14,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 16: [2022-11-26 14:41:14,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:41:14,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 14:41:14,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 31: [2022-11-26 14:41:14,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:41:14,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:41:14,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:41:14,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 25: [2022-11-26 14:41:14,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 23: [2022-11-26 14:41:14,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 31: [2022-11-26 14:41:14,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 23: [2022-11-26 14:41:14,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 25: [2022-11-26 14:41:14,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 9: [2022-11-26 14:41:14,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:41:14,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 14:41:14,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 20: [2022-11-26 14:41:14,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:41:14,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 14:41:14,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 3: [2022-11-26 14:41:14,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:41:14,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 14:41:14,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 6: [2022-11-26 14:41:14,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:41:14,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 14:41:14,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 5: [2022-11-26 14:41:14,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:41:14,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 14:41:14,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 14: [2022-11-26 14:41:14,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:41:14,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 14:41:14,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 15: [2022-11-26 14:41:14,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:41:14,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 14:41:14,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 21: [2022-11-26 14:41:14,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:41:14,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 14:41:14,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 24: [2022-11-26 14:41:14,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:41:14,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 14:41:14,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 27: [2022-11-26 14:41:14,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:41:14,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 14:41:14,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 10: [2022-11-26 14:41:14,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:41:14,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 14:41:14,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 2: [2022-11-26 14:41:14,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:41:14,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 14:41:14,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 11: [2022-11-26 14:41:14,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:41:14,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 14:41:14,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 17: [2022-11-26 14:41:14,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:41:14,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 14:41:14,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 29: [2022-11-26 14:41:14,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:41:14,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 14:41:14,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 28: [2022-11-26 14:41:14,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:41:14,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:41:14,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:41:14,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 28: [2022-11-26 14:41:14,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 30: [2022-11-26 14:41:14,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:41:14,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 1: [2022-11-26 14:41:14,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 28: [2022-11-26 14:41:14,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 30: [2022-11-26 14:41:14,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 1: [2022-11-26 14:41:14,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 30: [2022-11-26 14:41:14,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 7: [2022-11-26 14:41:14,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:41:14,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 14:41:14,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 19: [2022-11-26 14:41:14,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:41:14,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 14:41:14,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 0: [2022-11-26 14:41:14,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:41:14,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:41:14,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 22: [2022-11-26 14:41:14,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 14:41:14,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 0: [2022-11-26 14:41:14,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 20: [2022-11-26 14:41:14,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:41:14,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 14:41:14,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 25: [2022-11-26 14:41:14,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:41:14,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:41:14,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 4: [2022-11-26 14:41:14,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 14:41:14,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 25: [2022-11-26 14:41:14,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 8: [2022-11-26 14:41:14,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:41:14,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:41:14,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 8: [2022-11-26 14:41:14,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 14:41:14,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 5: [2022-11-26 14:41:14,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 12: [2022-11-26 14:41:14,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:41:14,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 14:41:14,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 18: [2022-11-26 14:41:14,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:41:14,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 14:41:14,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 26: [2022-11-26 14:41:14,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:41:14,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 14:41:14,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 9: [2022-11-26 14:41:14,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:41:14,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 14:41:14,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 3: [2022-11-26 14:41:14,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:41:14,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 14:41:14,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 24: [2022-11-26 14:41:14,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:41:14,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:41:14,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 14:41:14,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 16: [2022-11-26 14:41:14,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 14:41:14,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 15: [2022-11-26 14:41:14,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:41:14,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 14:41:14,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 14: [2022-11-26 14:41:14,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:41:14,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:41:14,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 14:41:14,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 27: [2022-11-26 14:41:14,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 14:41:14,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 0: [2022-11-26 14:41:14,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:41:14,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 14:41:14,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 30: [2022-11-26 14:41:14,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:41:14,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 14:41:14,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 6: [2022-11-26 14:41:14,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:41:14,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:41:14,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 21: [2022-11-26 14:41:14,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 6: [2022-11-26 14:41:14,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 21: [2022-11-26 14:41:14,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 29: [2022-11-26 14:41:14,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:41:14,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 14:41:14,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 4: [2022-11-26 14:41:14,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:41:14,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 14:41:14,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 28: [2022-11-26 14:41:14,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:41:14,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:41:14,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 17: [2022-11-26 14:41:14,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 14:41:14,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 8: [2022-11-26 14:41:14,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:41:14,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 14:41:14,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 20: [2022-11-26 14:41:14,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:41:14,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 14:41:14,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 2: [2022-11-26 14:41:14,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:41:14,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:41:14,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 14:41:14,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 23: [2022-11-26 14:41:14,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 14:41:14,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 12: [2022-11-26 14:41:14,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:41:14,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:41:14,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 1: [2022-11-26 14:41:14,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 14:41:14,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 12: [2022-11-26 14:41:14,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 25: [2022-11-26 14:41:14,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:41:14,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 14:41:14,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 31: [2022-11-26 14:41:14,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:41:14,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 14:41:14,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 16: [2022-11-26 14:41:14,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:41:14,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 14:41:14,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 28: [2022-11-26 14:41:14,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 27: [2022-11-26 14:41:14,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:41:14,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 14:41:14,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 19: [2022-11-26 14:41:14,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:41:14,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 14:41:14,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 11: [2022-11-26 14:41:14,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:41:14,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 14:41:14,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 5: [2022-11-26 14:41:14,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:41:14,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 14:41:14,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 7: [2022-11-26 14:41:14,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:41:14,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 14:41:14,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 22: [2022-11-26 14:41:14,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:41:14,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 14:41:14,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 3: [2022-11-26 14:41:14,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:41:14,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:41:14,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 9: [2022-11-26 14:41:14,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 14:41:14,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 3: [2022-11-26 14:41:14,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 23: [2022-11-26 14:41:14,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:41:14,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 14:41:14,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 31: [2022-11-26 14:41:14,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:41:14,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 14:41:14,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 10: [2022-11-26 14:41:14,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:41:14,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 14:41:14,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 18: [2022-11-26 14:41:14,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:41:14,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 14:41:14,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 26: [2022-11-26 14:41:14,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:41:14,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 14:41:14,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 10: [2022-11-26 14:41:14,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:41:14,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 14:41:14,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 21: [2022-11-26 14:41:14,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:41:14,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 14:41:14,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 26: [2022-11-26 14:41:14,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:41:14,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 14:41:14,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 13: [2022-11-26 14:41:14,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:41:14,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step91000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 14:41:14,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 0: successfully saved checkpoint at iteration 91000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2502.98 31: iteration 91010/ 173500 | consumed samples: 23298560 | consumed tokens: 47715450880 | elapsed time per iteration (s): 1.06 | learning rate: 1.044E-04 | global batch size: 256 | lm loss: 1.980019E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.430 | TFLOPs: 14.61 | 31: iteration 91020/ 173500 | consumed samples: 23301120 | consumed tokens: 47720693760 | elapsed time per iteration (s): 0.81 | learning rate: 1.044E-04 | global batch size: 256 | lm loss: 2.012826E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.643 | TFLOPs: 19.04 | 31: iteration 91030/ 173500 | consumed samples: 23303680 | consumed tokens: 47725936640 | elapsed time per iteration (s): 0.81 | learning rate: 1.044E-04 | global batch size: 256 | lm loss: 1.942167E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.728 | TFLOPs: 19.10 | 31: iteration 91040/ 173500 | consumed samples: 23306240 | consumed tokens: 47731179520 | elapsed time per iteration (s): 0.75 | learning rate: 1.044E-04 | global batch size: 256 | lm loss: 1.991108E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.454 | TFLOPs: 20.54 | 31: iteration 91050/ 173500 | consumed samples: 23308800 | consumed tokens: 47736422400 | elapsed time per iteration (s): 0.81 | learning rate: 1.044E-04 | global batch size: 256 | lm loss: 2.026581E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.798 | TFLOPs: 19.23 | 31: iteration 91060/ 173500 | consumed samples: 23311360 | consumed tokens: 47741665280 | elapsed time per iteration (s): 0.76 | learning rate: 1.043E-04 | global batch size: 256 | lm loss: 1.974108E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.063 | TFLOPs: 20.39 | 31: iteration 91070/ 173500 | consumed samples: 23313920 | consumed tokens: 47746908160 | elapsed time per iteration (s): 0.75 | learning rate: 1.043E-04 | global batch size: 256 | lm loss: 2.017748E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.486 | TFLOPs: 20.54 | 31: iteration 91080/ 173500 | consumed samples: 23316480 | consumed tokens: 47752151040 | elapsed time per iteration (s): 0.77 | learning rate: 1.043E-04 | global batch size: 256 | lm loss: 1.994722E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.000 | TFLOPs: 20.15 | 31: iteration 91090/ 173500 | consumed samples: 23319040 | consumed tokens: 47757393920 | elapsed time per iteration (s): 0.72 | learning rate: 1.043E-04 | global batch size: 256 | lm loss: 1.964939E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.885 | TFLOPs: 21.47 | 31: iteration 91100/ 173500 | consumed samples: 23321600 | consumed tokens: 47762636800 | elapsed time per iteration (s): 0.78 | learning rate: 1.043E-04 | global batch size: 256 | lm loss: 1.975489E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.447 | TFLOPs: 19.93 | 31: iteration 91110/ 173500 | consumed samples: 23324160 | consumed tokens: 47767879680 | elapsed time per iteration (s): 0.74 | learning rate: 1.043E-04 | global batch size: 256 | lm loss: 1.989133E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.026 | TFLOPs: 20.81 | 31: iteration 91120/ 173500 | consumed samples: 23326720 | consumed tokens: 47773122560 | elapsed time per iteration (s): 0.87 | learning rate: 1.042E-04 | global batch size: 256 | lm loss: 2.029625E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.638 | TFLOPs: 17.70 | 31: iteration 91130/ 173500 | consumed samples: 23329280 | consumed tokens: 47778365440 | elapsed time per iteration (s): 0.73 | learning rate: 1.042E-04 | global batch size: 256 | lm loss: 1.984515E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.481 | TFLOPs: 21.20 | 31: iteration 91140/ 173500 | consumed samples: 23331840 | consumed tokens: 47783608320 | elapsed time per iteration (s): 0.72 | learning rate: 1.042E-04 | global batch size: 256 | lm loss: 1.995379E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.335 | TFLOPs: 21.56 | 31: iteration 91150/ 173500 | consumed samples: 23334400 | consumed tokens: 47788851200 | elapsed time per iteration (s): 0.74 | learning rate: 1.042E-04 | global batch size: 256 | lm loss: 1.999090E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.942 | TFLOPs: 20.81 | 31: iteration 91160/ 173500 | consumed samples: 23336960 | consumed tokens: 47794094080 | elapsed time per iteration (s): 0.79 | learning rate: 1.042E-04 | global batch size: 256 | lm loss: 2.004960E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.320 | TFLOPs: 19.62 | 31: iteration 91170/ 173500 | consumed samples: 23339520 | consumed tokens: 47799336960 | elapsed time per iteration (s): 0.78 | learning rate: 1.042E-04 | global batch size: 256 | lm loss: 1.991431E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.831 | TFLOPs: 19.83 | 31: iteration 91180/ 173500 | consumed samples: 23342080 | consumed tokens: 47804579840 | elapsed time per iteration (s): 0.77 | learning rate: 1.041E-04 | global batch size: 256 | lm loss: 1.983327E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.357 | TFLOPs: 19.99 | 31: iteration 91190/ 173500 | consumed samples: 23344640 | consumed tokens: 47809822720 | elapsed time per iteration (s): 0.80 | learning rate: 1.041E-04 | global batch size: 256 | lm loss: 2.009355E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.430 | TFLOPs: 19.32 | 31: iteration 91200/ 173500 | consumed samples: 23347200 | consumed tokens: 47815065600 | elapsed time per iteration (s): 0.76 | learning rate: 1.041E-04 | global batch size: 256 | lm loss: 2.005498E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.471 | TFLOPs: 20.42 | 31: iteration 91210/ 173500 | consumed samples: 23349760 | consumed tokens: 47820308480 | elapsed time per iteration (s): 0.77 | learning rate: 1.041E-04 | global batch size: 256 | lm loss: 2.022251E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.228 | TFLOPs: 20.10 | 31: iteration 91220/ 173500 | consumed samples: 23352320 | consumed tokens: 47825551360 | elapsed time per iteration (s): 0.75 | learning rate: 1.041E-04 | global batch size: 256 | lm loss: 1.992195E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.156 | TFLOPs: 20.58 | 31: iteration 91230/ 173500 | consumed samples: 23354880 | consumed tokens: 47830794240 | elapsed time per iteration (s): 0.75 | learning rate: 1.041E-04 | global batch size: 256 | lm loss: 1.988418E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.419 | TFLOPs: 20.53 | 31: iteration 91240/ 173500 | consumed samples: 23357440 | consumed tokens: 47836037120 | elapsed time per iteration (s): 0.72 | learning rate: 1.040E-04 | global batch size: 256 | lm loss: 1.980208E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.953 | TFLOPs: 21.41 | 31: iteration 91250/ 173500 | consumed samples: 23360000 | consumed tokens: 47841280000 | elapsed time per iteration (s): 0.74 | learning rate: 1.040E-04 | global batch size: 256 | lm loss: 1.984637E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.131 | TFLOPs: 21.00 | 31: iteration 91260/ 173500 | consumed samples: 23362560 | consumed tokens: 47846522880 | elapsed time per iteration (s): 0.76 | learning rate: 1.040E-04 | global batch size: 256 | lm loss: 1.991629E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.128 | TFLOPs: 20.33 | 31: iteration 91270/ 173500 | consumed samples: 23365120 | consumed tokens: 47851765760 | elapsed time per iteration (s): 0.78 | learning rate: 1.040E-04 | global batch size: 256 | lm loss: 2.023441E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.083 | TFLOPs: 19.85 | 31: iteration 91280/ 173500 | consumed samples: 23367680 | consumed tokens: 47857008640 | elapsed time per iteration (s): 0.84 | learning rate: 1.040E-04 | global batch size: 256 | lm loss: 1.957788E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.395 | TFLOPs: 18.48 | 31: iteration 91290/ 173500 | consumed samples: 23370240 | consumed tokens: 47862251520 | elapsed time per iteration (s): 0.79 | learning rate: 1.040E-04 | global batch size: 256 | lm loss: 1.990863E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.388 | TFLOPs: 19.50 | 31: iteration 91300/ 173500 | consumed samples: 23372800 | consumed tokens: 47867494400 | elapsed time per iteration (s): 0.77 | learning rate: 1.039E-04 | global batch size: 256 | lm loss: 1.979770E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.481 | TFLOPs: 19.99 | 31: iteration 91310/ 173500 | consumed samples: 23375360 | consumed tokens: 47872737280 | elapsed time per iteration (s): 0.76 | learning rate: 1.039E-04 | global batch size: 256 | lm loss: 1.972483E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.646 | TFLOPs: 20.25 | 31: iteration 91320/ 173500 | consumed samples: 23377920 | consumed tokens: 47877980160 | elapsed time per iteration (s): 0.74 | learning rate: 1.039E-04 | global batch size: 256 | lm loss: 1.987375E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.782 | TFLOPs: 20.80 | 31: iteration 91330/ 173500 | consumed samples: 23380480 | consumed tokens: 47883223040 | elapsed time per iteration (s): 0.77 | learning rate: 1.039E-04 | global batch size: 256 | lm loss: 1.967067E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.630 | TFLOPs: 20.06 | 31: iteration 91340/ 173500 | consumed samples: 23383040 | consumed tokens: 47888465920 | elapsed time per iteration (s): 0.74 | learning rate: 1.039E-04 | global batch size: 256 | lm loss: 1.993985E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.069 | TFLOPs: 21.06 | 31: iteration 91350/ 173500 | consumed samples: 23385600 | consumed tokens: 47893708800 | elapsed time per iteration (s): 0.75 | learning rate: 1.039E-04 | global batch size: 256 | lm loss: 1.987782E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.754 | TFLOPs: 20.68 | 31: iteration 91360/ 173500 | consumed samples: 23388160 | consumed tokens: 47898951680 | elapsed time per iteration (s): 0.81 | learning rate: 1.038E-04 | global batch size: 256 | lm loss: 1.963941E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.122 | TFLOPs: 19.06 | 31: iteration 91370/ 173500 | consumed samples: 23390720 | consumed tokens: 47904194560 | elapsed time per iteration (s): 0.80 | learning rate: 1.038E-04 | global batch size: 256 | lm loss: 1.945514E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.605 | TFLOPs: 19.40 | 31: iteration 91380/ 173500 | consumed samples: 23393280 | consumed tokens: 47909437440 | elapsed time per iteration (s): 0.83 | learning rate: 1.038E-04 | global batch size: 256 | lm loss: 1.976204E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.045 | TFLOPs: 18.70 | 31: iteration 91390/ 173500 | consumed samples: 23395840 | consumed tokens: 47914680320 | elapsed time per iteration (s): 0.75 | learning rate: 1.038E-04 | global batch size: 256 | lm loss: 1.984361E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.505 | TFLOPs: 20.66 | 31: iteration 91400/ 173500 | consumed samples: 23398400 | consumed tokens: 47919923200 | elapsed time per iteration (s): 0.76 | learning rate: 1.038E-04 | global batch size: 256 | lm loss: 1.974480E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.749 | TFLOPs: 20.43 | 31: iteration 91410/ 173500 | consumed samples: 23400960 | consumed tokens: 47925166080 | elapsed time per iteration (s): 0.74 | learning rate: 1.038E-04 | global batch size: 256 | lm loss: 1.976680E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.486 | TFLOPs: 20.84 | 31: iteration 91420/ 173500 | consumed samples: 23403520 | consumed tokens: 47930408960 | elapsed time per iteration (s): 0.76 | learning rate: 1.037E-04 | global batch size: 256 | lm loss: 2.000798E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.551 | TFLOPs: 20.48 | 31: iteration 91430/ 173500 | consumed samples: 23406080 | consumed tokens: 47935651840 | elapsed time per iteration (s): 0.77 | learning rate: 1.037E-04 | global batch size: 256 | lm loss: 1.991638E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.567 | TFLOPs: 20.24 | 31: iteration 91440/ 173500 | consumed samples: 23408640 | consumed tokens: 47940894720 | elapsed time per iteration (s): 0.77 | learning rate: 1.037E-04 | global batch size: 256 | lm loss: 2.014326E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.141 | TFLOPs: 20.09 | 31: iteration 91450/ 173500 | consumed samples: 23411200 | consumed tokens: 47946137600 | elapsed time per iteration (s): 0.78 | learning rate: 1.037E-04 | global batch size: 256 | lm loss: 1.994987E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.330 | TFLOPs: 19.92 | 31: iteration 91460/ 173500 | consumed samples: 23413760 | consumed tokens: 47951380480 | elapsed time per iteration (s): 0.76 | learning rate: 1.037E-04 | global batch size: 256 | lm loss: 1.992454E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.561 | TFLOPs: 20.42 | 31: iteration 91470/ 173500 | consumed samples: 23416320 | consumed tokens: 47956623360 | elapsed time per iteration (s): 0.80 | learning rate: 1.037E-04 | global batch size: 256 | lm loss: 2.001851E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.715 | TFLOPs: 19.28 | 31: iteration 91480/ 173500 | consumed samples: 23418880 | consumed tokens: 47961866240 | elapsed time per iteration (s): 0.78 | learning rate: 1.036E-04 | global batch size: 256 | lm loss: 1.982166E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.968 | TFLOPs: 19.90 | 31: iteration 91490/ 173500 | consumed samples: 23421440 | consumed tokens: 47967109120 | elapsed time per iteration (s): 0.79 | learning rate: 1.036E-04 | global batch size: 256 | lm loss: 2.000638E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.458 | TFLOPs: 19.57 | 31: iteration 91500/ 173500 | consumed samples: 23424000 | consumed tokens: 47972352000 | elapsed time per iteration (s): 0.79 | learning rate: 1.036E-04 | global batch size: 256 | lm loss: 1.999708E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.199 | TFLOPs: 19.49 | 31: iteration 91510/ 173500 | consumed samples: 23426560 | consumed tokens: 47977594880 | elapsed time per iteration (s): 0.77 | learning rate: 1.036E-04 | global batch size: 256 | lm loss: 1.952187E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.020 | TFLOPs: 20.03 | 31: iteration 91520/ 173500 | consumed samples: 23429120 | consumed tokens: 47982837760 | elapsed time per iteration (s): 0.75 | learning rate: 1.036E-04 | global batch size: 256 | lm loss: 2.002998E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.321 | TFLOPs: 20.65 | 31: iteration 91530/ 173500 | consumed samples: 23431680 | consumed tokens: 47988080640 | elapsed time per iteration (s): 0.80 | learning rate: 1.036E-04 | global batch size: 256 | lm loss: 2.030778E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.767 | TFLOPs: 19.47 | 31: iteration 91540/ 173500 | consumed samples: 23434240 | consumed tokens: 47993323520 | elapsed time per iteration (s): 0.75 | learning rate: 1.035E-04 | global batch size: 256 | lm loss: 2.001289E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.914 | TFLOPs: 20.68 | 31: iteration 91550/ 173500 | consumed samples: 23436800 | consumed tokens: 47998566400 | elapsed time per iteration (s): 0.75 | learning rate: 1.035E-04 | global batch size: 256 | lm loss: 1.975607E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.889 | TFLOPs: 20.62 | 31: iteration 91560/ 173500 | consumed samples: 23439360 | consumed tokens: 48003809280 | elapsed time per iteration (s): 0.77 | learning rate: 1.035E-04 | global batch size: 256 | lm loss: 2.016486E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.326 | TFLOPs: 19.98 | 31: iteration 91570/ 173500 | consumed samples: 23441920 | consumed tokens: 48009052160 | elapsed time per iteration (s): 0.82 | learning rate: 1.035E-04 | global batch size: 256 | lm loss: 2.002743E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.575 | TFLOPs: 18.85 | 31: iteration 91580/ 173500 | consumed samples: 23444480 | consumed tokens: 48014295040 | elapsed time per iteration (s): 0.79 | learning rate: 1.035E-04 | global batch size: 256 | lm loss: 1.988841E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.519 | TFLOPs: 19.51 | 31: iteration 91590/ 173500 | consumed samples: 23447040 | consumed tokens: 48019537920 | elapsed time per iteration (s): 0.81 | learning rate: 1.035E-04 | global batch size: 256 | lm loss: 1.988722E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.291 | TFLOPs: 19.13 | 31: iteration 91600/ 173500 | consumed samples: 23449600 | consumed tokens: 48024780800 | elapsed time per iteration (s): 0.80 | learning rate: 1.035E-04 | global batch size: 256 | lm loss: 1.998652E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.973 | TFLOPs: 19.42 | 31: iteration 91610/ 173500 | consumed samples: 23452160 | consumed tokens: 48030023680 | elapsed time per iteration (s): 0.82 | learning rate: 1.034E-04 | global batch size: 256 | lm loss: 1.977174E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.653 | TFLOPs: 18.79 | 31: iteration 91620/ 173500 | consumed samples: 23454720 | consumed tokens: 48035266560 | elapsed time per iteration (s): 0.81 | learning rate: 1.034E-04 | global batch size: 256 | lm loss: 2.012870E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.567 | TFLOPs: 19.21 | 31: iteration 91630/ 173500 | consumed samples: 23457280 | consumed tokens: 48040509440 | elapsed time per iteration (s): 0.83 | learning rate: 1.034E-04 | global batch size: 256 | lm loss: 1.982777E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.306 | TFLOPs: 18.59 | 31: iteration 91640/ 173500 | consumed samples: 23459840 | consumed tokens: 48045752320 | elapsed time per iteration (s): 0.83 | learning rate: 1.034E-04 | global batch size: 256 | lm loss: 2.007478E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.420 | TFLOPs: 18.60 | 31: iteration 91650/ 173500 | consumed samples: 23462400 | consumed tokens: 48050995200 | elapsed time per iteration (s): 0.80 | learning rate: 1.034E-04 | global batch size: 256 | lm loss: 1.967337E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.473 | TFLOPs: 19.45 | 31: iteration 91660/ 173500 | consumed samples: 23464960 | consumed tokens: 48056238080 | elapsed time per iteration (s): 0.81 | learning rate: 1.034E-04 | global batch size: 256 | lm loss: 1.947895E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.093 | TFLOPs: 19.12 | 31: iteration 91670/ 173500 | consumed samples: 23467520 | consumed tokens: 48061480960 | elapsed time per iteration (s): 0.82 | learning rate: 1.033E-04 | global batch size: 256 | lm loss: 1.979341E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.945 | TFLOPs: 18.99 | 31: iteration 91680/ 173500 | consumed samples: 23470080 | consumed tokens: 48066723840 | elapsed time per iteration (s): 0.82 | learning rate: 1.033E-04 | global batch size: 256 | lm loss: 2.002438E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.717 | TFLOPs: 18.92 | 31: iteration 91690/ 173500 | consumed samples: 23472640 | consumed tokens: 48071966720 | elapsed time per iteration (s): 0.87 | learning rate: 1.033E-04 | global batch size: 256 | lm loss: 1.961006E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.268 | TFLOPs: 17.80 | 31: iteration 91700/ 173500 | consumed samples: 23475200 | consumed tokens: 48077209600 | elapsed time per iteration (s): 0.80 | learning rate: 1.033E-04 | global batch size: 256 | lm loss: 1.989831E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.622 | TFLOPs: 19.46 | 31: iteration 91710/ 173500 | consumed samples: 23477760 | consumed tokens: 48082452480 | elapsed time per iteration (s): 0.80 | learning rate: 1.033E-04 | global batch size: 256 | lm loss: 1.984253E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.273 | TFLOPs: 19.44 | 31: iteration 91720/ 173500 | consumed samples: 23480320 | consumed tokens: 48087695360 | elapsed time per iteration (s): 0.78 | learning rate: 1.033E-04 | global batch size: 256 | lm loss: 1.990184E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.508 | TFLOPs: 19.87 | 31: iteration 91730/ 173500 | consumed samples: 23482880 | consumed tokens: 48092938240 | elapsed time per iteration (s): 0.79 | learning rate: 1.032E-04 | global batch size: 256 | lm loss: 1.976979E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.851 | TFLOPs: 19.59 | 31: iteration 91740/ 173500 | consumed samples: 23485440 | consumed tokens: 48098181120 | elapsed time per iteration (s): 0.82 | learning rate: 1.032E-04 | global batch size: 256 | lm loss: 2.024969E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.756 | TFLOPs: 18.86 | 31: iteration 91750/ 173500 | consumed samples: 23488000 | consumed tokens: 48103424000 | elapsed time per iteration (s): 0.82 | learning rate: 1.032E-04 | global batch size: 256 | lm loss: 1.963707E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.396 | TFLOPs: 18.78 | 31: iteration 91760/ 173500 | consumed samples: 23490560 | consumed tokens: 48108666880 | elapsed time per iteration (s): 0.80 | learning rate: 1.032E-04 | global batch size: 256 | lm loss: 1.984951E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.571 | TFLOPs: 19.45 | 31: iteration 91770/ 173500 | consumed samples: 23493120 | consumed tokens: 48113909760 | elapsed time per iteration (s): 0.82 | learning rate: 1.032E-04 | global batch size: 256 | lm loss: 1.987541E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.129 | TFLOPs: 18.82 | 31: iteration 91780/ 173500 | consumed samples: 23495680 | consumed tokens: 48119152640 | elapsed time per iteration (s): 0.80 | learning rate: 1.032E-04 | global batch size: 256 | lm loss: 1.968654E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.297 | TFLOPs: 19.38 | 31: iteration 91790/ 173500 | consumed samples: 23498240 | consumed tokens: 48124395520 | elapsed time per iteration (s): 0.99 | learning rate: 1.031E-04 | global batch size: 256 | lm loss: 1.978437E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 259.759 | TFLOPs: 15.71 | 31: iteration 91800/ 173500 | consumed samples: 23500800 | consumed tokens: 48129638400 | elapsed time per iteration (s): 0.81 | learning rate: 1.031E-04 | global batch size: 256 | lm loss: 1.980739E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.408 | TFLOPs: 19.08 | 31: iteration 91810/ 173500 | consumed samples: 23503360 | consumed tokens: 48134881280 | elapsed time per iteration (s): 0.84 | learning rate: 1.031E-04 | global batch size: 256 | lm loss: 1.970939E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.150 | TFLOPs: 18.34 | 31: iteration 91820/ 173500 | consumed samples: 23505920 | consumed tokens: 48140124160 | elapsed time per iteration (s): 0.81 | learning rate: 1.031E-04 | global batch size: 256 | lm loss: 1.980926E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.025 | TFLOPs: 19.18 | 31: iteration 91830/ 173500 | consumed samples: 23508480 | consumed tokens: 48145367040 | elapsed time per iteration (s): 0.87 | learning rate: 1.031E-04 | global batch size: 256 | lm loss: 1.956100E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.170 | TFLOPs: 17.86 | 31: iteration 91840/ 173500 | consumed samples: 23511040 | consumed tokens: 48150609920 | elapsed time per iteration (s): 0.88 | learning rate: 1.031E-04 | global batch size: 256 | lm loss: 1.977137E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.058 | TFLOPs: 17.67 | 31: iteration 91850/ 173500 | consumed samples: 23513600 | consumed tokens: 48155852800 | elapsed time per iteration (s): 0.83 | learning rate: 1.030E-04 | global batch size: 256 | lm loss: 1.975886E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.628 | TFLOPs: 18.61 | 31: iteration 91860/ 173500 | consumed samples: 23516160 | consumed tokens: 48161095680 | elapsed time per iteration (s): 0.79 | learning rate: 1.030E-04 | global batch size: 256 | lm loss: 1.984562E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.153 | TFLOPs: 19.61 | 31: iteration 91870/ 173500 | consumed samples: 23518720 | consumed tokens: 48166338560 | elapsed time per iteration (s): 0.85 | learning rate: 1.030E-04 | global batch size: 256 | lm loss: 1.979029E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.292 | TFLOPs: 18.29 | 31: iteration 91880/ 173500 | consumed samples: 23521280 | consumed tokens: 48171581440 | elapsed time per iteration (s): 0.86 | learning rate: 1.030E-04 | global batch size: 256 | lm loss: 2.002081E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.401 | TFLOPs: 17.93 | 31: iteration 91890/ 173500 | consumed samples: 23523840 | consumed tokens: 48176824320 | elapsed time per iteration (s): 0.83 | learning rate: 1.030E-04 | global batch size: 256 | lm loss: 2.002783E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.538 | TFLOPs: 18.67 | 31: iteration 91900/ 173500 | consumed samples: 23526400 | consumed tokens: 48182067200 | elapsed time per iteration (s): 0.80 | learning rate: 1.030E-04 | global batch size: 256 | lm loss: 1.999498E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.407 | TFLOPs: 19.38 | 31: iteration 91910/ 173500 | consumed samples: 23528960 | consumed tokens: 48187310080 | elapsed time per iteration (s): 0.81 | learning rate: 1.029E-04 | global batch size: 256 | lm loss: 1.991418E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.334 | TFLOPs: 19.14 | 31: iteration 91920/ 173500 | consumed samples: 23531520 | consumed tokens: 48192552960 | elapsed time per iteration (s): 0.79 | learning rate: 1.029E-04 | global batch size: 256 | lm loss: 1.999382E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.081 | TFLOPs: 19.49 | 31: iteration 91930/ 173500 | consumed samples: 23534080 | consumed tokens: 48197795840 | elapsed time per iteration (s): 0.80 | learning rate: 1.029E-04 | global batch size: 256 | lm loss: 1.996750E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.960 | TFLOPs: 19.48 | 31: iteration 91940/ 173500 | consumed samples: 23536640 | consumed tokens: 48203038720 | elapsed time per iteration (s): 0.77 | learning rate: 1.029E-04 | global batch size: 256 | lm loss: 1.959042E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.638 | TFLOPs: 20.06 | 31: iteration 91950/ 173500 | consumed samples: 23539200 | consumed tokens: 48208281600 | elapsed time per iteration (s): 0.81 | learning rate: 1.029E-04 | global batch size: 256 | lm loss: 2.005457E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.647 | TFLOPs: 19.22 | 31: iteration 91960/ 173500 | consumed samples: 23541760 | consumed tokens: 48213524480 | elapsed time per iteration (s): 0.83 | learning rate: 1.029E-04 | global batch size: 256 | lm loss: 1.953665E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.783 | TFLOPs: 18.56 | 31: iteration 91970/ 173500 | consumed samples: 23544320 | consumed tokens: 48218767360 | elapsed time per iteration (s): 0.81 | learning rate: 1.028E-04 | global batch size: 256 | lm loss: 1.968518E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.817 | TFLOPs: 19.23 | 31: iteration 91980/ 173500 | consumed samples: 23546880 | consumed tokens: 48224010240 | elapsed time per iteration (s): 0.82 | learning rate: 1.028E-04 | global batch size: 256 | lm loss: 1.992548E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.735 | TFLOPs: 18.98 | 31: iteration 91990/ 173500 | consumed samples: 23549440 | consumed tokens: 48229253120 | elapsed time per iteration (s): 0.80 | learning rate: 1.028E-04 | global batch size: 256 | lm loss: 1.972186E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.834 | TFLOPs: 19.35 | 0: [2022-11-26 14:54:27,613] [INFO] [logging.py:68:log_dist] [Rank 0] step=92000, skipped=0, lr=[0.0001027941492351335, 0.0001027941492351335, 0.0001027941492351335], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 92000/ 173500 | consumed samples: 23552000 | consumed tokens: 48234496000 | elapsed time per iteration (s): 0.85 | learning rate: 1.028E-04 | global batch size: 256 | lm loss: 2.000673E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.624 | TFLOPs: 18.13 | 0: steps: 92000 loss: 2.0132 iter time (s): 0.788 samples/sec: 325.026 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 92000 | lm loss value: 1.850299E+00 | lm loss PPL: 6.361723E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 92000 to checkpoints_1b1long 0: [2022-11-26 14:54:27,880] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step92000 is begin to save! 0: [2022-11-26 14:54:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_01-model_00-model_states.pt... 0: [2022-11-26 14:54:28,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_01-model_00-model_states.pt. 0: [2022-11-26 14:54:28,118] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_03-model_00-model_states.pt... 0: [2022-11-26 14:54:28,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_03-model_00-model_states.pt. 0: [2022-11-26 14:54:28,200] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_04-model_00-model_states.pt... 0: [2022-11-26 14:54:28,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_04-model_00-model_states.pt. 0: [2022-11-26 14:54:28,284] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_05-model_00-model_states.pt... 0: [2022-11-26 14:54:28,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_05-model_00-model_states.pt. 0: [2022-11-26 14:54:28,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_06-model_00-model_states.pt... 0: [2022-11-26 14:54:28,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_06-model_00-model_states.pt. 0: [2022-11-26 14:54:28,437] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_07-model_00-model_states.pt... 0: [2022-11-26 14:54:28,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_07-model_00-model_states.pt. 0: [2022-11-26 14:54:28,516] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_08-model_00-model_states.pt... 0: [2022-11-26 14:54:28,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_08-model_00-model_states.pt. 0: [2022-11-26 14:54:28,590] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_09-model_00-model_states.pt... 0: [2022-11-26 14:54:28,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_09-model_00-model_states.pt. 0: [2022-11-26 14:54:28,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_10-model_00-model_states.pt... 0: [2022-11-26 14:54:28,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_10-model_00-model_states.pt. 0: [2022-11-26 14:54:28,741] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_11-model_00-model_states.pt... 0: [2022-11-26 14:54:28,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_11-model_00-model_states.pt. 0: [2022-11-26 14:54:28,816] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_12-model_00-model_states.pt... 0: [2022-11-26 14:54:28,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_12-model_00-model_states.pt. 0: [2022-11-26 14:54:28,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_13-model_00-model_states.pt... 0: [2022-11-26 14:54:28,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_13-model_00-model_states.pt. 0: [2022-11-26 14:54:28,970] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_14-model_00-model_states.pt... 0: [2022-11-26 14:54:29,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_14-model_00-model_states.pt. 0: [2022-11-26 14:54:29,047] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_15-model_00-model_states.pt... 0: [2022-11-26 14:54:29,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_15-model_00-model_states.pt. 0: [2022-11-26 14:54:29,123] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_16-model_00-model_states.pt... 0: [2022-11-26 14:54:29,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_16-model_00-model_states.pt. 0: [2022-11-26 14:54:29,199] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_17-model_00-model_states.pt... 0: [2022-11-26 14:54:29,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_17-model_00-model_states.pt. 0: [2022-11-26 14:54:29,271] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_18-model_00-model_states.pt... 0: [2022-11-26 14:54:29,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_18-model_00-model_states.pt. 0: [2022-11-26 14:54:29,351] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_19-model_00-model_states.pt... 0: [2022-11-26 14:54:29,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_19-model_00-model_states.pt. 0: [2022-11-26 14:54:29,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_20-model_00-model_states.pt... 0: [2022-11-26 14:54:29,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_20-model_00-model_states.pt. 0: [2022-11-26 14:54:29,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_21-model_00-model_states.pt... 0: [2022-11-26 14:54:29,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_21-model_00-model_states.pt. 0: [2022-11-26 14:54:29,574] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_22-model_00-model_states.pt... 0: [2022-11-26 14:54:29,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_22-model_00-model_states.pt. 0: [2022-11-26 14:54:29,646] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_23-model_00-model_states.pt... 0: [2022-11-26 14:54:29,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_23-model_00-model_states.pt. 0: [2022-11-26 14:54:29,723] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_24-model_00-model_states.pt... 0: [2022-11-26 14:54:29,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_24-model_00-model_states.pt. 0: [2022-11-26 14:54:29,795] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_25-model_00-model_states.pt... 0: [2022-11-26 14:54:29,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_25-model_00-model_states.pt. 0: [2022-11-26 14:54:29,871] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_26-model_00-model_states.pt... 0: [2022-11-26 14:54:29,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_26-model_00-model_states.pt. 0: [2022-11-26 14:54:29,942] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_27-model_00-model_states.pt... 0: [2022-11-26 14:54:30,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_27-model_00-model_states.pt. 0: [2022-11-26 14:54:30,019] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_28-model_00-model_states.pt... 0: [2022-11-26 14:54:30,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_28-model_00-model_states.pt. 0: [2022-11-26 14:54:30,093] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/layer_30-model_00-model_states.pt... 0: [2022-11-26 14:54:30,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/layer_30-model_00-model_states.pt. 0: [2022-11-26 14:54:30,095] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step92000/mp_rank_00_model_states.pt 0: [2022-11-26 14:54:30,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/mp_rank_00_model_states.pt... 0: [2022-11-26 14:54:30,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/mp_rank_00_model_states.pt. 0: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 24: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 19: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 31: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 14:54:30,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:54:30,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:54:30,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 14:54:30,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 26: [2022-11-26 14:54:30,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:54:30,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 14:54:30,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 10: [2022-11-26 14:54:30,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:54:30,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 14:54:30,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 31: [2022-11-26 14:54:30,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:54:30,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 14:54:30,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 3: [2022-11-26 14:54:30,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:54:30,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 14:54:30,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 18: [2022-11-26 14:54:30,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:54:30,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 14:54:30,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 19: [2022-11-26 14:54:30,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:54:30,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:54:30,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 14:54:30,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 26: [2022-11-26 14:54:30,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:54:30,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 27: [2022-11-26 14:54:30,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 26: [2022-11-26 14:54:30,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 27: [2022-11-26 14:54:30,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 19: [2022-11-26 14:54:30,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:54:30,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 14:54:30,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 21: [2022-11-26 14:54:30,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:54:30,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:54:30,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 21: [2022-11-26 14:54:30,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 30: [2022-11-26 14:54:30,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 21: [2022-11-26 14:54:30,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 2: [2022-11-26 14:54:30,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:54:30,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:54:30,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:54:30,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 0: [2022-11-26 14:54:30,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:54:30,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 17: [2022-11-26 14:54:30,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 14:54:30,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 0: [2022-11-26 14:54:30,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 12: [2022-11-26 14:54:30,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:54:30,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 17: [2022-11-26 14:54:30,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 0: [2022-11-26 14:54:30,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 12: [2022-11-26 14:54:30,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 14:54:30,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 4: [2022-11-26 14:54:30,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:54:30,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 14:54:30,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 8: [2022-11-26 14:54:30,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:54:30,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 14:54:30,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 2: [2022-11-26 14:54:30,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:54:30,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:54:30,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 14: [2022-11-26 14:54:30,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 2: [2022-11-26 14:54:30,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 14: [2022-11-26 14:54:30,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 28: [2022-11-26 14:54:30,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:54:30,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 14:54:30,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 3: [2022-11-26 14:54:30,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:54:30,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 14:54:30,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 17: [2022-11-26 14:54:30,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:54:30,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:54:30,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 17: [2022-11-26 14:54:30,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 24: [2022-11-26 14:54:30,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 17: [2022-11-26 14:54:30,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 14: [2022-11-26 14:54:30,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:54:30,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 14:54:30,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 27: [2022-11-26 14:54:30,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:54:30,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 14:54:30,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 1: [2022-11-26 14:54:30,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:54:30,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:54:30,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 28: [2022-11-26 14:54:30,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:54:30,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 28: [2022-11-26 14:54:30,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 14:54:30,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 14:54:30,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 28: [2022-11-26 14:54:30,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 18: [2022-11-26 14:54:30,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:54:30,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 0: [2022-11-26 14:54:30,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:54:30,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 24: [2022-11-26 14:54:30,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:54:30,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 14:54:30,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 24: [2022-11-26 14:54:30,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 14:54:30,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 2: [2022-11-26 14:54:30,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:54:30,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 14:54:30,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 1: [2022-11-26 14:54:30,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:54:30,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 14:54:30,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 29: [2022-11-26 14:54:30,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:54:30,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:54:30,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 14:54:30,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 8: [2022-11-26 14:54:30,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 14:54:30,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 22: [2022-11-26 14:54:30,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:54:30,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 14:54:30,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 22: [2022-11-26 14:54:30,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:54:30,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 14:54:30,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 30: [2022-11-26 14:54:30,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:54:30,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 16: [2022-11-26 14:54:30,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:54:30,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 16: [2022-11-26 14:54:30,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 14:54:30,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 15: [2022-11-26 14:54:30,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:54:30,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 10: [2022-11-26 14:54:30,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:54:30,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 10: [2022-11-26 14:54:30,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 14:54:30,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 15: [2022-11-26 14:54:30,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:54:30,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 14:54:30,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 14: [2022-11-26 14:54:30,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:54:30,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 14:54:30,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 19: [2022-11-26 14:54:30,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:54:30,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 31: [2022-11-26 14:54:30,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:54:30,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 31: [2022-11-26 14:54:30,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 14:54:30,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 22: [2022-11-26 14:54:30,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:54:30,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 14:54:30,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 4: [2022-11-26 14:54:30,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:54:30,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:54:30,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 14:54:30,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 14:54:30,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 4: [2022-11-26 14:54:30,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 28: [2022-11-26 14:54:30,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:54:30,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:54:30,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 14:54:30,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 24: [2022-11-26 14:54:30,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:54:30,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 14:54:30,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 14: [2022-11-26 14:54:30,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:54:30,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 14:54:30,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 2: [2022-11-26 14:54:30,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:54:30,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 14:54:30,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 21: [2022-11-26 14:54:30,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:54:30,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 14:54:30,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 11: [2022-11-26 14:54:30,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:54:30,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 12: [2022-11-26 14:54:30,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:54:30,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 15: [2022-11-26 14:54:30,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:54:30,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:54:30,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 15: [2022-11-26 14:54:30,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 11: [2022-11-26 14:54:30,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 12: [2022-11-26 14:54:30,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 15: [2022-11-26 14:54:30,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 11: [2022-11-26 14:54:30,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 11: [2022-11-26 14:54:30,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:54:30,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 14:54:30,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 31: [2022-11-26 14:54:30,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:54:30,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 14:54:30,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 29: [2022-11-26 14:54:30,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:54:30,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 14:54:30,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 12: [2022-11-26 14:54:30,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:54:30,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 14:54:30,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 8: [2022-11-26 14:54:30,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:54:30,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 14:54:30,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 10: [2022-11-26 14:54:30,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:54:30,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 14:54:30,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 28: [2022-11-26 14:54:30,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 14:54:30,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 4: [2022-11-26 14:54:30,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:54:30,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:54:30,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 30: [2022-11-26 14:54:30,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 4: [2022-11-26 14:54:30,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 30: [2022-11-26 14:54:30,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 21: [2022-11-26 14:54:30,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:54:30,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 14:54:30,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 27: [2022-11-26 14:54:30,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:54:30,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:54:30,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 16: [2022-11-26 14:54:30,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 14:54:30,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 26: [2022-11-26 14:54:30,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:54:30,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 26: [2022-11-26 14:54:30,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 14:54:30,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 27: [2022-11-26 14:54:30,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:54:30,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:54:30,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 29: [2022-11-26 14:54:30,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 27: [2022-11-26 14:54:30,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 29: [2022-11-26 14:54:30,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 1: [2022-11-26 14:54:30,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:54:30,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 14:54:30,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 16: [2022-11-26 14:54:30,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:54:30,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:54:30,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:54:30,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:54:30,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 16: [2022-11-26 14:54:30,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 24: [2022-11-26 14:54:30,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 0: [2022-11-26 14:54:30,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 16: [2022-11-26 14:54:30,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 16: [2022-11-26 14:54:30,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 1: [2022-11-26 14:54:30,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:54:30,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 1: [2022-11-26 14:54:30,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 14:54:30,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 0: [2022-11-26 14:54:30,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 26: [2022-11-26 14:54:30,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:54:30,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 14:54:30,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 19: [2022-11-26 14:54:30,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:54:30,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 14:54:30,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 3: [2022-11-26 14:54:30,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:54:30,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:54:30,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 14:54:30,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 14:54:30,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 3: [2022-11-26 14:54:30,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 22: [2022-11-26 14:54:30,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:54:30,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 14:54:30,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 11: [2022-11-26 14:54:30,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:54:30,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 14:54:30,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 17: [2022-11-26 14:54:30,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:54:30,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 14:54:30,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 0: [2022-11-26 14:54:30,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:54:30,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 30: [2022-11-26 14:54:30,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:54:30,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 30: [2022-11-26 14:54:30,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 14:54:30,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 29: [2022-11-26 14:54:30,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:54:30,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 14:54:30,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 18: [2022-11-26 14:54:30,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:54:30,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 14:54:30,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 21: [2022-11-26 14:54:30,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:54:30,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 14:54:30,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 23: [2022-11-26 14:54:30,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:54:30,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:54:30,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:54:30,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:54:30,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 14:54:30,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 8: [2022-11-26 14:54:30,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 23: [2022-11-26 14:54:30,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 8: [2022-11-26 14:54:30,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 12: [2022-11-26 14:54:30,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 23: [2022-11-26 14:54:30,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 12: [2022-11-26 14:54:30,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 31: [2022-11-26 14:54:30,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:54:30,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 14:54:30,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 25: [2022-11-26 14:54:30,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:54:30,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:54:30,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:54:30,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:54:30,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:54:30,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:54:30,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:54:30,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:54:30,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 14:54:30,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 14:54:30,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 25: [2022-11-26 14:54:30,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 14:54:30,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 14:54:30,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 13: [2022-11-26 14:54:30,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 25: [2022-11-26 14:54:30,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 13: [2022-11-26 14:54:30,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 13: [2022-11-26 14:54:30,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 13: [2022-11-26 14:54:30,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 25: [2022-11-26 14:54:30,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 25: [2022-11-26 14:54:30,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 25: [2022-11-26 14:54:30,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 25: [2022-11-26 14:54:30,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 13: [2022-11-26 14:54:30,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 15: [2022-11-26 14:54:30,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:54:30,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 14:54:30,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 7: [2022-11-26 14:54:30,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:54:30,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:54:30,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:54:30,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:54:30,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 14:54:30,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 14:54:30,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 14:54:30,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 7: [2022-11-26 14:54:30,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 7: [2022-11-26 14:54:30,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 7: [2022-11-26 14:54:30,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 14:54:30,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 23: [2022-11-26 14:54:30,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:54:30,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:54:30,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 14:54:30,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 23: [2022-11-26 14:54:30,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 14:54:30,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 6: [2022-11-26 14:54:30,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:54:30,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:54:30,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:54:30,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:54:30,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 14:54:30,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 14:54:30,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 14:54:30,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 14:54:30,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 6: [2022-11-26 14:54:30,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 6: [2022-11-26 14:54:30,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 6: [2022-11-26 14:54:30,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 15: [2022-11-26 14:54:30,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:54:30,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 14:54:30,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 23: [2022-11-26 14:54:30,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:54:30,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 14:54:30,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 17: [2022-11-26 14:54:30,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:54:30,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 14:54:30,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 4: [2022-11-26 14:54:30,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:54:30,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 14:54:30,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 5: [2022-11-26 14:54:30,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:54:30,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:54:30,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:54:30,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:54:30,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 14:54:30,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 14:54:30,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 14:54:30,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 14:54:30,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 5: [2022-11-26 14:54:30,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 5: [2022-11-26 14:54:30,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 5: [2022-11-26 14:54:30,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:54:30,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 5: [2022-11-26 14:54:30,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 14:54:30,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 25: [2022-11-26 14:54:30,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:54:30,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 14:54:30,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 28: [2022-11-26 14:54:30,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:54:30,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 14:54:30,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 6: [2022-11-26 14:54:30,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:54:30,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:54:30,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 26: [2022-11-26 14:54:30,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 6: [2022-11-26 14:54:30,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 26: [2022-11-26 14:54:30,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 0: [2022-11-26 14:54:30,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:54:30,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 14:54:30,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 2: [2022-11-26 14:54:30,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:54:30,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 14:54:30,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 19: [2022-11-26 14:54:30,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:54:30,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 14:54:30,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 14: [2022-11-26 14:54:30,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:54:30,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 14:54:30,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 10: [2022-11-26 14:54:30,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:54:30,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 14:54:30,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 27: [2022-11-26 14:54:30,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:54:30,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 14:54:30,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 1: [2022-11-26 14:54:30,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:54:30,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 24: [2022-11-26 14:54:30,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:54:30,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 24: [2022-11-26 14:54:30,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 14:54:30,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 16: [2022-11-26 14:54:30,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:54:30,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 14:54:30,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 7: [2022-11-26 14:54:30,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:54:30,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 14:54:30,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 29: [2022-11-26 14:54:30,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:54:30,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 14:54:30,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 21: [2022-11-26 14:54:30,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:54:30,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 12: [2022-11-26 14:54:30,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:54:30,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 21: [2022-11-26 14:54:30,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 12: [2022-11-26 14:54:30,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 8: [2022-11-26 14:54:30,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:54:30,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 14:54:30,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 18: [2022-11-26 14:54:30,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:54:30,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 14:54:30,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 3: [2022-11-26 14:54:30,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:54:30,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 14:54:30,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 11: [2022-11-26 14:54:30,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:54:30,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 14:54:30,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 22: [2022-11-26 14:54:30,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:54:30,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 14:54:30,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 15: [2022-11-26 14:54:30,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:54:30,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 31: [2022-11-26 14:54:30,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:54:30,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 13: [2022-11-26 14:54:30,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:54:30,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 31: [2022-11-26 14:54:30,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 26: [2022-11-26 14:54:30,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:54:30,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 31: [2022-11-26 14:54:30,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 26: [2022-11-26 14:54:30,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 14:54:30,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 23: [2022-11-26 14:54:30,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:54:30,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 14:54:30,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 5: [2022-11-26 14:54:30,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:54:30,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 14:54:30,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 17: [2022-11-26 14:54:30,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:54:30,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 14:54:30,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 4: [2022-11-26 14:54:30,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:54:30,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 14:54:30,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 2: [2022-11-26 14:54:30,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:54:30,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 14:54:30,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 0: [2022-11-26 14:54:30,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:54:30,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 14:54:30,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 6: [2022-11-26 14:54:30,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:54:30,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 14:54:30,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 28: [2022-11-26 14:54:30,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 14:54:30,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 14:54:30,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 25: [2022-11-26 14:54:30,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:54:30,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:54:30,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 19: [2022-11-26 14:54:30,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 14:54:30,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 25: [2022-11-26 14:54:30,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 14: [2022-11-26 14:54:30,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:54:30,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 14:54:30,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 30: [2022-11-26 14:54:30,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:54:30,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:54:30,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 14:54:30,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 14:54:30,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 30: [2022-11-26 14:54:30,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 9: [2022-11-26 14:54:30,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:54:30,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 14:54:30,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 20: [2022-11-26 14:54:30,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:54:30,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 14:54:30,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 3: [2022-11-26 14:54:30,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:54:30,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 14:54:30,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 27: [2022-11-26 14:54:30,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:54:30,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 14:54:30,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 16: [2022-11-26 14:54:30,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:54:30,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 14:54:30,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 24: [2022-11-26 14:54:30,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:54:30,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 14:54:30,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 18: [2022-11-26 14:54:30,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:54:30,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 14:54:30,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 11: [2022-11-26 14:54:30,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:54:30,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 14:54:30,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 8: [2022-11-26 14:54:30,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:54:30,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 14:54:30,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 13: [2022-11-26 14:54:30,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:54:30,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:54:30,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 14:54:30,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 12: [2022-11-26 14:54:30,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 14:54:30,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 29: [2022-11-26 14:54:30,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:54:30,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 14:54:30,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 10: [2022-11-26 14:54:30,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:54:30,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 14:54:30,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 22: [2022-11-26 14:54:30,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:54:30,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 14:54:30,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 7: [2022-11-26 14:54:30,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:54:30,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 14:54:30,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 1: [2022-11-26 14:54:30,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:54:30,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 21: [2022-11-26 14:54:30,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:54:30,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 1: [2022-11-26 14:54:30,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 21: [2022-11-26 14:54:30,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 20: [2022-11-26 14:54:30,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:54:30,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 14:54:30,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 31: [2022-11-26 14:54:30,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:54:30,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 14:54:30,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 17: [2022-11-26 14:54:30,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 23: [2022-11-26 14:54:30,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:54:30,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 14:54:30,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 23: [2022-11-26 14:54:30,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 14:54:30,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 4: [2022-11-26 14:54:30,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:54:30,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 14:54:30,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 5: [2022-11-26 14:54:30,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:54:30,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 14:54:30,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 9: [2022-11-26 14:54:30,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:54:30,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 14:54:30,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 26: [2022-11-26 14:54:30,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:54:30,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 14:54:30,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 0: [2022-11-26 14:54:30,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:54:30,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:54:30,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 14:54:30,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 28: [2022-11-26 14:54:30,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:54:30,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:54:30,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 14:54:30,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 28: [2022-11-26 14:54:30,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 14:54:30,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 0: [2022-11-26 14:54:30,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 14:54:30,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 25: [2022-11-26 14:54:30,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:54:30,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 14:54:30,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 6: [2022-11-26 14:54:30,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:54:30,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 14:54:30,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 14: [2022-11-26 14:54:30,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:54:30,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:54:30,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 14:54:30,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 10: [2022-11-26 14:54:30,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 14:54:30,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 13: [2022-11-26 14:54:30,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:54:30,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 14:54:30,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 27: [2022-11-26 14:54:30,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:54:30,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:54:30,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 14:54:30,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 19: [2022-11-26 14:54:30,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 14:54:30,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 31: [2022-11-26 14:54:30,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:54:30,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 14:54:30,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 18: [2022-11-26 14:54:30,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:54:30,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 14:54:30,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 30: [2022-11-26 14:54:30,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:54:30,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 14:54:30,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 3: [2022-11-26 14:54:30,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:54:30,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 14:54:30,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 1: [2022-11-26 14:54:30,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:54:30,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 14:54:30,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 11: [2022-11-26 14:54:30,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:54:30,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 14:54:30,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 12: [2022-11-26 14:54:30,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:54:30,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 14:54:30,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 22: [2022-11-26 14:54:30,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:54:30,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 14:54:30,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 16: [2022-11-26 14:54:30,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:54:30,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 14:54:30,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 21: [2022-11-26 14:54:30,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:54:30,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 14:54:30,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 24: [2022-11-26 14:54:30,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:54:30,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 14:54:30,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 10: [2022-11-26 14:54:30,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:54:30,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:54:30,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 7: [2022-11-26 14:54:30,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:54:30,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 23: [2022-11-26 14:54:30,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:54:30,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 14:54:30,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 29: [2022-11-26 14:54:30,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 26: [2022-11-26 14:54:30,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 14:54:30,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 23: [2022-11-26 14:54:30,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 29: [2022-11-26 14:54:30,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 23: [2022-11-26 14:54:30,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 29: [2022-11-26 14:54:30,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 2: [2022-11-26 14:54:30,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:54:30,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 14:54:30,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 28: [2022-11-26 14:54:30,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:54:30,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 21: [2022-11-26 14:54:30,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 17: [2022-11-26 14:54:30,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 21: [2022-11-26 14:54:30,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 17: [2022-11-26 14:54:30,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 21: [2022-11-26 14:54:30,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 8: [2022-11-26 14:54:30,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:54:30,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 14:54:30,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 0: [2022-11-26 14:54:30,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:54:30,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 5: [2022-11-26 14:54:30,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:54:30,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 0: [2022-11-26 14:54:30,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 5: [2022-11-26 14:54:30,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 25: [2022-11-26 14:54:30,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 14:54:30,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 14:54:30,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 6: [2022-11-26 14:54:30,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:54:30,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 14:54:30,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 14: [2022-11-26 14:54:30,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:54:30,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:54:30,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 3: [2022-11-26 14:54:30,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 14: [2022-11-26 14:54:30,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 3: [2022-11-26 14:54:30,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 28: [2022-11-26 14:54:30,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 14:54:30,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 4: [2022-11-26 14:54:30,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:54:30,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 14:54:30,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 31: [2022-11-26 14:54:30,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 14:54:30,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 14:54:30,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 12: [2022-11-26 14:54:30,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:54:30,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 14:54:30,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 22: [2022-11-26 14:54:30,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 14:54:30,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 14:54:30,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 15: [2022-11-26 14:54:30,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:54:30,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 14:54:30,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 7: [2022-11-26 14:54:30,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:54:30,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 14:54:30,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 19: [2022-11-26 14:54:30,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 14:54:30,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 14:54:30,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 8: [2022-11-26 14:54:30,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:54:30,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 14:54:30,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 13: [2022-11-26 14:54:30,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:54:30,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 14:54:30,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 16: [2022-11-26 14:54:30,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 14:54:30,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 14:54:30,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 18: [2022-11-26 14:54:30,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 14:54:30,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 14:54:30,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 24: [2022-11-26 14:54:30,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 14:54:30,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 14:54:30,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 11: [2022-11-26 14:54:30,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:54:30,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 14:54:30,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 29: [2022-11-26 14:54:30,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 30: [2022-11-26 14:54:30,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 29: [2022-11-26 14:54:30,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 14:54:30,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 30: [2022-11-26 14:54:30,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 14:54:30,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 20: [2022-11-26 14:54:30,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:54:30,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 14:54:30,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 27: [2022-11-26 14:54:30,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:54:30,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 27: [2022-11-26 14:54:30,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 20: [2022-11-26 14:54:30,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 27: [2022-11-26 14:54:30,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 20: [2022-11-26 14:54:30,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 1: [2022-11-26 14:54:30,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:54:30,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 14:54:30,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 20: [2022-11-26 14:54:30,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:54:30,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:54:30,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 14:54:30,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 14:54:30,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 20: [2022-11-26 14:54:30,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 20: [2022-11-26 14:54:30,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:54:30,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 14:54:30,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 20: [2022-11-26 14:54:30,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 14:54:30,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 14:54:30,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 9: [2022-11-26 14:54:30,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:54:30,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:54:30,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 14:54:30,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 14:54:30,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:54:30,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 9: [2022-11-26 14:54:30,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 9: [2022-11-26 14:54:30,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 14:54:30,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 9: [2022-11-26 14:54:30,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:54:30,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 14:54:30,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 9: [2022-11-26 14:54:30,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:54:30,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:54:30,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 14:54:30,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step92000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 14:54:30,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 9: [2022-11-26 14:54:30,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 0: successfully saved checkpoint at iteration 92000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2584.47 31: iteration 92010/ 173500 | consumed samples: 23554560 | consumed tokens: 48239738880 | elapsed time per iteration (s): 1.17 | learning rate: 1.028E-04 | global batch size: 256 | lm loss: 1.974659E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 219.739 | TFLOPs: 13.29 | 31: iteration 92020/ 173500 | consumed samples: 23557120 | consumed tokens: 48244981760 | elapsed time per iteration (s): 0.82 | learning rate: 1.028E-04 | global batch size: 256 | lm loss: 1.967294E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.045 | TFLOPs: 18.82 | 31: iteration 92030/ 173500 | consumed samples: 23559680 | consumed tokens: 48250224640 | elapsed time per iteration (s): 0.74 | learning rate: 1.027E-04 | global batch size: 256 | lm loss: 2.007963E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.138 | TFLOPs: 20.88 | 31: iteration 92040/ 173500 | consumed samples: 23562240 | consumed tokens: 48255467520 | elapsed time per iteration (s): 0.74 | learning rate: 1.027E-04 | global batch size: 256 | lm loss: 1.955987E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.378 | TFLOPs: 20.96 | 31: iteration 92050/ 173500 | consumed samples: 23564800 | consumed tokens: 48260710400 | elapsed time per iteration (s): 0.81 | learning rate: 1.027E-04 | global batch size: 256 | lm loss: 1.973772E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.871 | TFLOPs: 19.11 | 31: iteration 92060/ 173500 | consumed samples: 23567360 | consumed tokens: 48265953280 | elapsed time per iteration (s): 0.74 | learning rate: 1.027E-04 | global batch size: 256 | lm loss: 1.979246E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.499 | TFLOPs: 20.90 | 31: iteration 92070/ 173500 | consumed samples: 23569920 | consumed tokens: 48271196160 | elapsed time per iteration (s): 0.77 | learning rate: 1.027E-04 | global batch size: 256 | lm loss: 1.997126E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.264 | TFLOPs: 20.16 | 31: iteration 92080/ 173500 | consumed samples: 23572480 | consumed tokens: 48276439040 | elapsed time per iteration (s): 0.77 | learning rate: 1.027E-04 | global batch size: 256 | lm loss: 1.962045E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.606 | TFLOPs: 20.24 | 31: iteration 92090/ 173500 | consumed samples: 23575040 | consumed tokens: 48281681920 | elapsed time per iteration (s): 0.75 | learning rate: 1.026E-04 | global batch size: 256 | lm loss: 1.986514E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.615 | TFLOPs: 20.61 | 31: iteration 92100/ 173500 | consumed samples: 23577600 | consumed tokens: 48286924800 | elapsed time per iteration (s): 0.84 | learning rate: 1.026E-04 | global batch size: 256 | lm loss: 1.988574E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.334 | TFLOPs: 18.35 | 31: iteration 92110/ 173500 | consumed samples: 23580160 | consumed tokens: 48292167680 | elapsed time per iteration (s): 0.81 | learning rate: 1.026E-04 | global batch size: 256 | lm loss: 2.013213E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.672 | TFLOPs: 19.16 | 31: iteration 92120/ 173500 | consumed samples: 23582720 | consumed tokens: 48297410560 | elapsed time per iteration (s): 0.85 | learning rate: 1.026E-04 | global batch size: 256 | lm loss: 2.016629E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.898 | TFLOPs: 18.32 | 31: iteration 92130/ 173500 | consumed samples: 23585280 | consumed tokens: 48302653440 | elapsed time per iteration (s): 0.83 | learning rate: 1.026E-04 | global batch size: 256 | lm loss: 1.950320E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.822 | TFLOPs: 18.74 | 31: iteration 92140/ 173500 | consumed samples: 23587840 | consumed tokens: 48307896320 | elapsed time per iteration (s): 0.82 | learning rate: 1.026E-04 | global batch size: 256 | lm loss: 1.992360E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.622 | TFLOPs: 18.91 | 31: iteration 92150/ 173500 | consumed samples: 23590400 | consumed tokens: 48313139200 | elapsed time per iteration (s): 0.83 | learning rate: 1.025E-04 | global batch size: 256 | lm loss: 1.976157E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.636 | TFLOPs: 18.73 | 31: iteration 92160/ 173500 | consumed samples: 23592960 | consumed tokens: 48318382080 | elapsed time per iteration (s): 0.83 | learning rate: 1.025E-04 | global batch size: 256 | lm loss: 2.005714E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.272 | TFLOPs: 18.71 | 31: iteration 92170/ 173500 | consumed samples: 23595520 | consumed tokens: 48323624960 | elapsed time per iteration (s): 0.80 | learning rate: 1.025E-04 | global batch size: 256 | lm loss: 2.001221E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.176 | TFLOPs: 19.37 | 31: iteration 92180/ 173500 | consumed samples: 23598080 | consumed tokens: 48328867840 | elapsed time per iteration (s): 0.83 | learning rate: 1.025E-04 | global batch size: 256 | lm loss: 1.966492E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.060 | TFLOPs: 18.70 | 31: iteration 92190/ 173500 | consumed samples: 23600640 | consumed tokens: 48334110720 | elapsed time per iteration (s): 0.84 | learning rate: 1.025E-04 | global batch size: 256 | lm loss: 1.962349E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.290 | TFLOPs: 18.53 | 31: iteration 92200/ 173500 | consumed samples: 23603200 | consumed tokens: 48339353600 | elapsed time per iteration (s): 0.81 | learning rate: 1.025E-04 | global batch size: 256 | lm loss: 2.002020E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.408 | TFLOPs: 19.14 | 31: iteration 92210/ 173500 | consumed samples: 23605760 | consumed tokens: 48344596480 | elapsed time per iteration (s): 0.83 | learning rate: 1.024E-04 | global batch size: 256 | lm loss: 1.987481E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.238 | TFLOPs: 18.77 | 31: iteration 92220/ 173500 | consumed samples: 23608320 | consumed tokens: 48349839360 | elapsed time per iteration (s): 0.92 | learning rate: 1.024E-04 | global batch size: 256 | lm loss: 1.964988E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 278.489 | TFLOPs: 16.85 | 31: iteration 92230/ 173500 | consumed samples: 23610880 | consumed tokens: 48355082240 | elapsed time per iteration (s): 0.83 | learning rate: 1.024E-04 | global batch size: 256 | lm loss: 2.007380E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.498 | TFLOPs: 18.72 | 31: iteration 92240/ 173500 | consumed samples: 23613440 | consumed tokens: 48360325120 | elapsed time per iteration (s): 0.79 | learning rate: 1.024E-04 | global batch size: 256 | lm loss: 1.986269E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.495 | TFLOPs: 19.57 | 31: iteration 92250/ 173500 | consumed samples: 23616000 | consumed tokens: 48365568000 | elapsed time per iteration (s): 0.79 | learning rate: 1.024E-04 | global batch size: 256 | lm loss: 1.975655E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.414 | TFLOPs: 19.69 | 31: iteration 92260/ 173500 | consumed samples: 23618560 | consumed tokens: 48370810880 | elapsed time per iteration (s): 0.81 | learning rate: 1.024E-04 | global batch size: 256 | lm loss: 2.036482E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.870 | TFLOPs: 19.11 | 31: iteration 92270/ 173500 | consumed samples: 23621120 | consumed tokens: 48376053760 | elapsed time per iteration (s): 0.79 | learning rate: 1.024E-04 | global batch size: 256 | lm loss: 2.010186E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.701 | TFLOPs: 19.64 | 31: iteration 92280/ 173500 | consumed samples: 23623680 | consumed tokens: 48381296640 | elapsed time per iteration (s): 0.91 | learning rate: 1.023E-04 | global batch size: 256 | lm loss: 1.998714E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 281.768 | TFLOPs: 17.05 | 31: iteration 92290/ 173500 | consumed samples: 23626240 | consumed tokens: 48386539520 | elapsed time per iteration (s): 0.82 | learning rate: 1.023E-04 | global batch size: 256 | lm loss: 1.999203E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.507 | TFLOPs: 18.85 | 31: iteration 92300/ 173500 | consumed samples: 23628800 | consumed tokens: 48391782400 | elapsed time per iteration (s): 0.88 | learning rate: 1.023E-04 | global batch size: 256 | lm loss: 1.991721E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.419 | TFLOPs: 17.63 | 31: iteration 92310/ 173500 | consumed samples: 23631360 | consumed tokens: 48397025280 | elapsed time per iteration (s): 0.88 | learning rate: 1.023E-04 | global batch size: 256 | lm loss: 2.007551E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.024 | TFLOPs: 17.55 | 31: iteration 92320/ 173500 | consumed samples: 23633920 | consumed tokens: 48402268160 | elapsed time per iteration (s): 0.80 | learning rate: 1.023E-04 | global batch size: 256 | lm loss: 1.966022E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.073 | TFLOPs: 19.30 | 31: iteration 92330/ 173500 | consumed samples: 23636480 | consumed tokens: 48407511040 | elapsed time per iteration (s): 0.80 | learning rate: 1.023E-04 | global batch size: 256 | lm loss: 1.961165E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.341 | TFLOPs: 19.32 | 31: iteration 92340/ 173500 | consumed samples: 23639040 | consumed tokens: 48412753920 | elapsed time per iteration (s): 0.83 | learning rate: 1.022E-04 | global batch size: 256 | lm loss: 1.982816E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.127 | TFLOPs: 18.64 | 31: iteration 92350/ 173500 | consumed samples: 23641600 | consumed tokens: 48417996800 | elapsed time per iteration (s): 0.80 | learning rate: 1.022E-04 | global batch size: 256 | lm loss: 1.983308E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.823 | TFLOPs: 19.29 | 31: iteration 92360/ 173500 | consumed samples: 23644160 | consumed tokens: 48423239680 | elapsed time per iteration (s): 0.79 | learning rate: 1.022E-04 | global batch size: 256 | lm loss: 1.968435E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.859 | TFLOPs: 19.65 | 31: iteration 92370/ 173500 | consumed samples: 23646720 | consumed tokens: 48428482560 | elapsed time per iteration (s): 0.88 | learning rate: 1.022E-04 | global batch size: 256 | lm loss: 1.988443E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.487 | TFLOPs: 17.63 | 31: iteration 92380/ 173500 | consumed samples: 23649280 | consumed tokens: 48433725440 | elapsed time per iteration (s): 0.80 | learning rate: 1.022E-04 | global batch size: 256 | lm loss: 1.982873E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.521 | TFLOPs: 19.33 | 31: iteration 92390/ 173500 | consumed samples: 23651840 | consumed tokens: 48438968320 | elapsed time per iteration (s): 0.78 | learning rate: 1.022E-04 | global batch size: 256 | lm loss: 1.980557E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.211 | TFLOPs: 19.80 | 31: iteration 92400/ 173500 | consumed samples: 23654400 | consumed tokens: 48444211200 | elapsed time per iteration (s): 0.74 | learning rate: 1.021E-04 | global batch size: 256 | lm loss: 1.972667E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.036 | TFLOPs: 20.93 | 31: iteration 92410/ 173500 | consumed samples: 23656960 | consumed tokens: 48449454080 | elapsed time per iteration (s): 0.79 | learning rate: 1.021E-04 | global batch size: 256 | lm loss: 1.977941E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.595 | TFLOPs: 19.70 | 31: iteration 92420/ 173500 | consumed samples: 23659520 | consumed tokens: 48454696960 | elapsed time per iteration (s): 0.86 | learning rate: 1.021E-04 | global batch size: 256 | lm loss: 1.965191E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.391 | TFLOPs: 18.11 | 31: iteration 92430/ 173500 | consumed samples: 23662080 | consumed tokens: 48459939840 | elapsed time per iteration (s): 0.76 | learning rate: 1.021E-04 | global batch size: 256 | lm loss: 2.015892E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.432 | TFLOPs: 20.29 | 31: iteration 92440/ 173500 | consumed samples: 23664640 | consumed tokens: 48465182720 | elapsed time per iteration (s): 0.87 | learning rate: 1.021E-04 | global batch size: 256 | lm loss: 1.988494E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.522 | TFLOPs: 17.82 | 31: iteration 92450/ 173500 | consumed samples: 23667200 | consumed tokens: 48470425600 | elapsed time per iteration (s): 0.88 | learning rate: 1.021E-04 | global batch size: 256 | lm loss: 1.986199E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.802 | TFLOPs: 17.53 | 31: iteration 92460/ 173500 | consumed samples: 23669760 | consumed tokens: 48475668480 | elapsed time per iteration (s): 0.82 | learning rate: 1.020E-04 | global batch size: 256 | lm loss: 1.988138E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.513 | TFLOPs: 18.91 | 31: iteration 92470/ 173500 | consumed samples: 23672320 | consumed tokens: 48480911360 | elapsed time per iteration (s): 0.77 | learning rate: 1.020E-04 | global batch size: 256 | lm loss: 2.011464E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.247 | TFLOPs: 20.04 | 31: iteration 92480/ 173500 | consumed samples: 23674880 | consumed tokens: 48486154240 | elapsed time per iteration (s): 0.76 | learning rate: 1.020E-04 | global batch size: 256 | lm loss: 1.999879E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.881 | TFLOPs: 20.26 | 31: iteration 92490/ 173500 | consumed samples: 23677440 | consumed tokens: 48491397120 | elapsed time per iteration (s): 0.85 | learning rate: 1.020E-04 | global batch size: 256 | lm loss: 1.972395E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.847 | TFLOPs: 18.32 | 31: iteration 92500/ 173500 | consumed samples: 23680000 | consumed tokens: 48496640000 | elapsed time per iteration (s): 0.81 | learning rate: 1.020E-04 | global batch size: 256 | lm loss: 1.999322E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.682 | TFLOPs: 19.16 | 31: iteration 92510/ 173500 | consumed samples: 23682560 | consumed tokens: 48501882880 | elapsed time per iteration (s): 0.77 | learning rate: 1.020E-04 | global batch size: 256 | lm loss: 1.996620E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.417 | TFLOPs: 20.05 | 31: iteration 92520/ 173500 | consumed samples: 23685120 | consumed tokens: 48507125760 | elapsed time per iteration (s): 0.80 | learning rate: 1.019E-04 | global batch size: 256 | lm loss: 1.977835E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.533 | TFLOPs: 19.45 | 31: iteration 92530/ 173500 | consumed samples: 23687680 | consumed tokens: 48512368640 | elapsed time per iteration (s): 0.77 | learning rate: 1.019E-04 | global batch size: 256 | lm loss: 1.981571E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.913 | TFLOPs: 20.14 | 31: iteration 92540/ 173500 | consumed samples: 23690240 | consumed tokens: 48517611520 | elapsed time per iteration (s): 0.74 | learning rate: 1.019E-04 | global batch size: 256 | lm loss: 1.988061E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.729 | TFLOPs: 20.86 | 31: iteration 92550/ 173500 | consumed samples: 23692800 | consumed tokens: 48522854400 | elapsed time per iteration (s): 0.78 | learning rate: 1.019E-04 | global batch size: 256 | lm loss: 1.985881E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.500 | TFLOPs: 19.81 | 31: iteration 92560/ 173500 | consumed samples: 23695360 | consumed tokens: 48528097280 | elapsed time per iteration (s): 0.77 | learning rate: 1.019E-04 | global batch size: 256 | lm loss: 2.008827E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.719 | TFLOPs: 20.13 | 31: iteration 92570/ 173500 | consumed samples: 23697920 | consumed tokens: 48533340160 | elapsed time per iteration (s): 0.77 | learning rate: 1.019E-04 | global batch size: 256 | lm loss: 1.970859E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.408 | TFLOPs: 20.11 | 31: iteration 92580/ 173500 | consumed samples: 23700480 | consumed tokens: 48538583040 | elapsed time per iteration (s): 0.75 | learning rate: 1.018E-04 | global batch size: 256 | lm loss: 1.946202E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.291 | TFLOPs: 20.65 | 31: iteration 92590/ 173500 | consumed samples: 23703040 | consumed tokens: 48543825920 | elapsed time per iteration (s): 0.77 | learning rate: 1.018E-04 | global batch size: 256 | lm loss: 1.989881E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.420 | TFLOPs: 20.05 | 31: iteration 92600/ 173500 | consumed samples: 23705600 | consumed tokens: 48549068800 | elapsed time per iteration (s): 0.77 | learning rate: 1.018E-04 | global batch size: 256 | lm loss: 1.998658E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.239 | TFLOPs: 20.10 | 31: iteration 92610/ 173500 | consumed samples: 23708160 | consumed tokens: 48554311680 | elapsed time per iteration (s): 0.85 | learning rate: 1.018E-04 | global batch size: 256 | lm loss: 1.997502E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.878 | TFLOPs: 18.26 | 31: iteration 92620/ 173500 | consumed samples: 23710720 | consumed tokens: 48559554560 | elapsed time per iteration (s): 0.75 | learning rate: 1.018E-04 | global batch size: 256 | lm loss: 1.987993E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.172 | TFLOPs: 20.52 | 31: iteration 92630/ 173500 | consumed samples: 23713280 | consumed tokens: 48564797440 | elapsed time per iteration (s): 0.79 | learning rate: 1.018E-04 | global batch size: 256 | lm loss: 1.992842E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.475 | TFLOPs: 19.63 | 31: iteration 92640/ 173500 | consumed samples: 23715840 | consumed tokens: 48570040320 | elapsed time per iteration (s): 0.90 | learning rate: 1.017E-04 | global batch size: 256 | lm loss: 1.938551E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.115 | TFLOPs: 17.25 | 31: iteration 92650/ 173500 | consumed samples: 23718400 | consumed tokens: 48575283200 | elapsed time per iteration (s): 0.83 | learning rate: 1.017E-04 | global batch size: 256 | lm loss: 1.974779E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.187 | TFLOPs: 18.58 | 31: iteration 92660/ 173500 | consumed samples: 23720960 | consumed tokens: 48580526080 | elapsed time per iteration (s): 0.82 | learning rate: 1.017E-04 | global batch size: 256 | lm loss: 2.003246E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.020 | TFLOPs: 19.00 | 31: iteration 92670/ 173500 | consumed samples: 23723520 | consumed tokens: 48585768960 | elapsed time per iteration (s): 0.82 | learning rate: 1.017E-04 | global batch size: 256 | lm loss: 2.001806E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.021 | TFLOPs: 18.88 | 31: iteration 92680/ 173500 | consumed samples: 23726080 | consumed tokens: 48591011840 | elapsed time per iteration (s): 0.80 | learning rate: 1.017E-04 | global batch size: 256 | lm loss: 1.970867E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.685 | TFLOPs: 19.46 | 31: iteration 92690/ 173500 | consumed samples: 23728640 | consumed tokens: 48596254720 | elapsed time per iteration (s): 0.80 | learning rate: 1.017E-04 | global batch size: 256 | lm loss: 1.977800E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.439 | TFLOPs: 19.33 | 31: iteration 92700/ 173500 | consumed samples: 23731200 | consumed tokens: 48601497600 | elapsed time per iteration (s): 0.77 | learning rate: 1.016E-04 | global batch size: 256 | lm loss: 1.995831E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.528 | TFLOPs: 20.24 | 31: iteration 92710/ 173500 | consumed samples: 23733760 | consumed tokens: 48606740480 | elapsed time per iteration (s): 0.79 | learning rate: 1.016E-04 | global batch size: 256 | lm loss: 2.000128E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.082 | TFLOPs: 19.67 | 31: iteration 92720/ 173500 | consumed samples: 23736320 | consumed tokens: 48611983360 | elapsed time per iteration (s): 0.91 | learning rate: 1.016E-04 | global batch size: 256 | lm loss: 1.957032E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.814 | TFLOPs: 17.11 | 31: iteration 92730/ 173500 | consumed samples: 23738880 | consumed tokens: 48617226240 | elapsed time per iteration (s): 0.86 | learning rate: 1.016E-04 | global batch size: 256 | lm loss: 1.952723E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.552 | TFLOPs: 17.94 | 31: iteration 92740/ 173500 | consumed samples: 23741440 | consumed tokens: 48622469120 | elapsed time per iteration (s): 0.84 | learning rate: 1.016E-04 | global batch size: 256 | lm loss: 1.972788E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.734 | TFLOPs: 18.50 | 31: iteration 92750/ 173500 | consumed samples: 23744000 | consumed tokens: 48627712000 | elapsed time per iteration (s): 0.83 | learning rate: 1.016E-04 | global batch size: 256 | lm loss: 2.003095E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.679 | TFLOPs: 18.61 | 31: iteration 92760/ 173500 | consumed samples: 23746560 | consumed tokens: 48632954880 | elapsed time per iteration (s): 0.79 | learning rate: 1.015E-04 | global batch size: 256 | lm loss: 1.987269E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.215 | TFLOPs: 19.55 | 31: iteration 92770/ 173500 | consumed samples: 23749120 | consumed tokens: 48638197760 | elapsed time per iteration (s): 0.82 | learning rate: 1.015E-04 | global batch size: 256 | lm loss: 1.981575E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.913 | TFLOPs: 18.99 | 31: iteration 92780/ 173500 | consumed samples: 23751680 | consumed tokens: 48643440640 | elapsed time per iteration (s): 0.84 | learning rate: 1.015E-04 | global batch size: 256 | lm loss: 1.989313E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.537 | TFLOPs: 18.48 | 31: iteration 92790/ 173500 | consumed samples: 23754240 | consumed tokens: 48648683520 | elapsed time per iteration (s): 0.79 | learning rate: 1.015E-04 | global batch size: 256 | lm loss: 1.985382E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.017 | TFLOPs: 19.48 | 31: iteration 92800/ 173500 | consumed samples: 23756800 | consumed tokens: 48653926400 | elapsed time per iteration (s): 0.84 | learning rate: 1.015E-04 | global batch size: 256 | lm loss: 1.984436E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.287 | TFLOPs: 18.35 | 31: iteration 92810/ 173500 | consumed samples: 23759360 | consumed tokens: 48659169280 | elapsed time per iteration (s): 0.84 | learning rate: 1.015E-04 | global batch size: 256 | lm loss: 1.994597E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.246 | TFLOPs: 18.53 | 31: iteration 92820/ 173500 | consumed samples: 23761920 | consumed tokens: 48664412160 | elapsed time per iteration (s): 0.79 | learning rate: 1.014E-04 | global batch size: 256 | lm loss: 1.990623E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.565 | TFLOPs: 19.57 | 31: iteration 92830/ 173500 | consumed samples: 23764480 | consumed tokens: 48669655040 | elapsed time per iteration (s): 0.75 | learning rate: 1.014E-04 | global batch size: 256 | lm loss: 1.972751E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.528 | TFLOPs: 20.60 | 31: iteration 92840/ 173500 | consumed samples: 23767040 | consumed tokens: 48674897920 | elapsed time per iteration (s): 0.74 | learning rate: 1.014E-04 | global batch size: 256 | lm loss: 1.942311E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.615 | TFLOPs: 21.03 | 31: iteration 92850/ 173500 | consumed samples: 23769600 | consumed tokens: 48680140800 | elapsed time per iteration (s): 0.73 | learning rate: 1.014E-04 | global batch size: 256 | lm loss: 2.007996E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.973 | TFLOPs: 21.35 | 31: iteration 92860/ 173500 | consumed samples: 23772160 | consumed tokens: 48685383680 | elapsed time per iteration (s): 0.78 | learning rate: 1.014E-04 | global batch size: 256 | lm loss: 2.002004E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.425 | TFLOPs: 19.87 | 31: iteration 92870/ 173500 | consumed samples: 23774720 | consumed tokens: 48690626560 | elapsed time per iteration (s): 0.80 | learning rate: 1.014E-04 | global batch size: 256 | lm loss: 1.970871E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.471 | TFLOPs: 19.33 | 31: iteration 92880/ 173500 | consumed samples: 23777280 | consumed tokens: 48695869440 | elapsed time per iteration (s): 0.78 | learning rate: 1.014E-04 | global batch size: 256 | lm loss: 1.987815E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.312 | TFLOPs: 19.86 | 31: iteration 92890/ 173500 | consumed samples: 23779840 | consumed tokens: 48701112320 | elapsed time per iteration (s): 0.81 | learning rate: 1.013E-04 | global batch size: 256 | lm loss: 1.960987E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.354 | TFLOPs: 19.20 | 31: iteration 92900/ 173500 | consumed samples: 23782400 | consumed tokens: 48706355200 | elapsed time per iteration (s): 0.81 | learning rate: 1.013E-04 | global batch size: 256 | lm loss: 1.980655E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.876 | TFLOPs: 19.23 | 31: iteration 92910/ 173500 | consumed samples: 23784960 | consumed tokens: 48711598080 | elapsed time per iteration (s): 0.74 | learning rate: 1.013E-04 | global batch size: 256 | lm loss: 1.983867E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.797 | TFLOPs: 20.86 | 31: iteration 92920/ 173500 | consumed samples: 23787520 | consumed tokens: 48716840960 | elapsed time per iteration (s): 0.80 | learning rate: 1.013E-04 | global batch size: 256 | lm loss: 1.984454E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.426 | TFLOPs: 19.45 | 31: iteration 92930/ 173500 | consumed samples: 23790080 | consumed tokens: 48722083840 | elapsed time per iteration (s): 0.73 | learning rate: 1.013E-04 | global batch size: 256 | lm loss: 2.001827E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.997 | TFLOPs: 21.29 | 31: iteration 92940/ 173500 | consumed samples: 23792640 | consumed tokens: 48727326720 | elapsed time per iteration (s): 0.79 | learning rate: 1.013E-04 | global batch size: 256 | lm loss: 2.002630E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.021 | TFLOPs: 19.48 | 31: iteration 92950/ 173500 | consumed samples: 23795200 | consumed tokens: 48732569600 | elapsed time per iteration (s): 0.79 | learning rate: 1.012E-04 | global batch size: 256 | lm loss: 1.992008E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.110 | TFLOPs: 19.67 | 31: iteration 92960/ 173500 | consumed samples: 23797760 | consumed tokens: 48737812480 | elapsed time per iteration (s): 0.74 | learning rate: 1.012E-04 | global batch size: 256 | lm loss: 1.984924E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.844 | TFLOPs: 20.98 | 31: iteration 92970/ 173500 | consumed samples: 23800320 | consumed tokens: 48743055360 | elapsed time per iteration (s): 0.79 | learning rate: 1.012E-04 | global batch size: 256 | lm loss: 1.968823E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.875 | TFLOPs: 19.71 | 31: iteration 92980/ 173500 | consumed samples: 23802880 | consumed tokens: 48748298240 | elapsed time per iteration (s): 0.76 | learning rate: 1.012E-04 | global batch size: 256 | lm loss: 2.012116E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.785 | TFLOPs: 20.25 | 31: iteration 92990/ 173500 | consumed samples: 23805440 | consumed tokens: 48753541120 | elapsed time per iteration (s): 0.77 | learning rate: 1.012E-04 | global batch size: 256 | lm loss: 2.005137E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.409 | TFLOPs: 20.23 | 31: iteration 93000/ 173500 | consumed samples: 23808000 | consumed tokens: 48758784000 | elapsed time per iteration (s): 0.72 | learning rate: 1.012E-04 | global batch size: 256 | lm loss: 1.981376E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.614 | TFLOPs: 21.51 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 93000 | lm loss value: 1.966579E+00 | lm loss PPL: 7.146188E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 93000 to checkpoints_1b1long 0: [2022-11-26 15:07:53,620] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step93000 is begin to save! 0: [2022-11-26 15:07:53,631] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_01-model_00-model_states.pt... 0: [2022-11-26 15:07:53,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_01-model_00-model_states.pt. 0: [2022-11-26 15:07:53,850] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_03-model_00-model_states.pt... 0: [2022-11-26 15:07:53,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_03-model_00-model_states.pt. 0: [2022-11-26 15:07:53,930] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_04-model_00-model_states.pt... 0: [2022-11-26 15:07:54,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_04-model_00-model_states.pt. 0: [2022-11-26 15:07:54,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_05-model_00-model_states.pt... 0: [2022-11-26 15:07:54,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_05-model_00-model_states.pt. 0: [2022-11-26 15:07:54,090] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_06-model_00-model_states.pt... 0: [2022-11-26 15:07:54,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_06-model_00-model_states.pt. 0: [2022-11-26 15:07:54,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_07-model_00-model_states.pt... 0: [2022-11-26 15:07:54,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_07-model_00-model_states.pt. 0: [2022-11-26 15:07:54,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_08-model_00-model_states.pt... 0: [2022-11-26 15:07:54,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_08-model_00-model_states.pt. 0: [2022-11-26 15:07:54,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_09-model_00-model_states.pt... 0: [2022-11-26 15:07:54,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_09-model_00-model_states.pt. 0: [2022-11-26 15:07:54,398] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_10-model_00-model_states.pt... 0: [2022-11-26 15:07:54,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_10-model_00-model_states.pt. 0: [2022-11-26 15:07:54,477] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_11-model_00-model_states.pt... 0: [2022-11-26 15:07:54,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_11-model_00-model_states.pt. 0: [2022-11-26 15:07:54,552] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_12-model_00-model_states.pt... 0: [2022-11-26 15:07:54,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_12-model_00-model_states.pt. 0: [2022-11-26 15:07:54,630] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_13-model_00-model_states.pt... 0: [2022-11-26 15:07:54,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_13-model_00-model_states.pt. 0: [2022-11-26 15:07:54,705] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_14-model_00-model_states.pt... 0: [2022-11-26 15:07:54,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_14-model_00-model_states.pt. 0: [2022-11-26 15:07:54,781] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_15-model_00-model_states.pt... 0: [2022-11-26 15:07:54,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_15-model_00-model_states.pt. 0: [2022-11-26 15:07:54,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_16-model_00-model_states.pt... 0: [2022-11-26 15:07:54,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_16-model_00-model_states.pt. 0: [2022-11-26 15:07:54,935] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_17-model_00-model_states.pt... 0: [2022-11-26 15:07:55,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_17-model_00-model_states.pt. 0: [2022-11-26 15:07:55,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_18-model_00-model_states.pt... 0: [2022-11-26 15:07:55,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_18-model_00-model_states.pt. 0: [2022-11-26 15:07:55,089] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_19-model_00-model_states.pt... 0: [2022-11-26 15:07:55,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_19-model_00-model_states.pt. 0: [2022-11-26 15:07:55,163] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_20-model_00-model_states.pt... 0: [2022-11-26 15:07:55,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_20-model_00-model_states.pt. 0: [2022-11-26 15:07:55,242] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_21-model_00-model_states.pt... 0: [2022-11-26 15:07:55,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_21-model_00-model_states.pt. 0: [2022-11-26 15:07:55,319] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_22-model_00-model_states.pt... 0: [2022-11-26 15:07:55,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_22-model_00-model_states.pt. 0: [2022-11-26 15:07:55,393] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_23-model_00-model_states.pt... 0: [2022-11-26 15:07:55,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_23-model_00-model_states.pt. 0: [2022-11-26 15:07:55,471] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_24-model_00-model_states.pt... 0: [2022-11-26 15:07:55,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_24-model_00-model_states.pt. 0: [2022-11-26 15:07:55,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_25-model_00-model_states.pt... 0: [2022-11-26 15:07:55,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_25-model_00-model_states.pt. 0: [2022-11-26 15:07:55,624] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_26-model_00-model_states.pt... 0: [2022-11-26 15:07:55,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_26-model_00-model_states.pt. 0: [2022-11-26 15:07:55,699] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_27-model_00-model_states.pt... 0: [2022-11-26 15:07:55,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_27-model_00-model_states.pt. 0: [2022-11-26 15:07:55,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_28-model_00-model_states.pt... 0: [2022-11-26 15:07:55,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_28-model_00-model_states.pt. 0: [2022-11-26 15:07:55,853] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/layer_30-model_00-model_states.pt... 0: [2022-11-26 15:07:55,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/layer_30-model_00-model_states.pt. 0: [2022-11-26 15:07:55,855] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step93000/mp_rank_00_model_states.pt 0: [2022-11-26 15:07:55,855] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/mp_rank_00_model_states.pt... 0: [2022-11-26 15:07:55,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/mp_rank_00_model_states.pt. 0: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:07:55,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:07:55,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:07:55,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 15:07:55,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 12: [2022-11-26 15:07:55,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:07:55,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 15:07:55,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 18: [2022-11-26 15:07:55,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:07:55,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 20: [2022-11-26 15:07:55,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:07:55,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 20: [2022-11-26 15:07:55,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 15:07:55,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 7: [2022-11-26 15:07:55,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:07:55,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 15:07:55,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 10: [2022-11-26 15:07:55,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:07:55,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 15:07:55,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 5: [2022-11-26 15:07:55,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:07:55,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 15:07:55,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 9: [2022-11-26 15:07:55,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:07:55,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 15:07:55,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 16: [2022-11-26 15:07:55,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:07:55,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 15: [2022-11-26 15:07:55,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:07:55,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 15:07:55,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 16: [2022-11-26 15:07:55,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 28: [2022-11-26 15:07:55,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:07:55,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:07:55,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 15:07:55,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 13: [2022-11-26 15:07:55,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:07:55,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 15:07:55,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 13: [2022-11-26 15:07:55,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:07:55,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 15:07:55,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 23: [2022-11-26 15:07:55,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:07:55,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 15:07:55,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 26: [2022-11-26 15:07:55,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:07:55,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:07:55,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 1: [2022-11-26 15:07:55,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:07:55,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 0: [2022-11-26 15:07:55,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:07:55,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 15:07:55,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 0: [2022-11-26 15:07:55,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 15:07:55,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 20: [2022-11-26 15:07:55,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:07:55,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 17: [2022-11-26 15:07:55,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:07:55,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 20: [2022-11-26 15:07:55,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 15:07:55,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 17: [2022-11-26 15:07:55,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 27: [2022-11-26 15:07:55,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:07:55,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 27: [2022-11-26 15:07:55,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 1: [2022-11-26 15:07:55,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:07:55,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 6: [2022-11-26 15:07:55,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:07:55,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 6: [2022-11-26 15:07:55,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 15:07:55,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 1: [2022-11-26 15:07:55,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 10: [2022-11-26 15:07:55,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:07:55,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 24: [2022-11-26 15:07:55,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:07:55,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 7: [2022-11-26 15:07:55,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:07:55,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 7: [2022-11-26 15:07:55,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 15:07:55,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 24: [2022-11-26 15:07:55,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 17: [2022-11-26 15:07:55,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:07:55,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:07:55,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 30: [2022-11-26 15:07:55,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 15:07:55,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 17: [2022-11-26 15:07:55,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 4: [2022-11-26 15:07:55,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:07:55,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 15:07:55,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 16: [2022-11-26 15:07:55,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:07:55,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 15:07:55,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 31: [2022-11-26 15:07:55,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:07:55,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 15:07:55,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 20: [2022-11-26 15:07:55,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:07:55,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:07:55,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:07:55,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 12: [2022-11-26 15:07:55,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:07:55,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 2: [2022-11-26 15:07:55,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 12: [2022-11-26 15:07:55,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 20: [2022-11-26 15:07:55,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 2: [2022-11-26 15:07:55,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 2: [2022-11-26 15:07:55,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 12: [2022-11-26 15:07:55,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 26: [2022-11-26 15:07:55,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:07:55,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 15:07:55,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 24: [2022-11-26 15:07:55,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:07:55,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:07:55,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 24: [2022-11-26 15:07:55,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 19: [2022-11-26 15:07:55,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 24: [2022-11-26 15:07:55,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 14: [2022-11-26 15:07:55,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:07:55,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:07:55,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 15:07:55,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 15:07:55,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 14: [2022-11-26 15:07:55,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 3: [2022-11-26 15:07:55,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:07:55,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 15:07:55,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 9: [2022-11-26 15:07:55,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:07:55,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 6: [2022-11-26 15:07:55,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:07:56,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 6: [2022-11-26 15:07:55,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 26: [2022-11-26 15:07:56,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:07:55,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 26: [2022-11-26 15:07:56,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 15:07:56,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 9: [2022-11-26 15:07:56,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:07:56,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 15:07:56,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 4: [2022-11-26 15:07:56,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:07:56,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 15:07:56,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 25: [2022-11-26 15:07:56,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:07:56,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:07:56,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 15:07:56,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:07:56,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 25: [2022-11-26 15:07:56,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 16: [2022-11-26 15:07:56,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 25: [2022-11-26 15:07:56,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 16: [2022-11-26 15:07:56,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 13: [2022-11-26 15:07:56,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:07:56,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 25: [2022-11-26 15:07:56,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:07:56,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 15: [2022-11-26 15:07:56,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:07:56,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 25: [2022-11-26 15:07:56,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 15:07:56,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 15: [2022-11-26 15:07:56,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 14: [2022-11-26 15:07:56,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:07:56,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:07:56,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 4: [2022-11-26 15:07:56,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 14: [2022-11-26 15:07:56,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 4: [2022-11-26 15:07:56,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 19: [2022-11-26 15:07:56,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:07:56,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 15:07:56,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 11: [2022-11-26 15:07:55,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:07:55,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 15:07:55,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 11: [2022-11-26 15:07:56,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:07:56,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 15:07:56,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 21: [2022-11-26 15:07:56,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:07:56,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 15:07:56,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 1: [2022-11-26 15:07:56,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:07:56,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 15:07:56,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 11: [2022-11-26 15:07:56,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:07:56,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 15:07:56,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 17: [2022-11-26 15:07:56,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:07:56,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 16: [2022-11-26 15:07:56,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:07:56,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:07:56,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 16: [2022-11-26 15:07:56,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 15:07:56,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 31: [2022-11-26 15:07:56,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 7: [2022-11-26 15:07:56,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:07:56,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 7: [2022-11-26 15:07:56,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 15:07:56,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 10: [2022-11-26 15:07:56,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:07:56,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:07:56,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:07:56,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 15:07:56,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 24: [2022-11-26 15:07:56,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 10: [2022-11-26 15:07:56,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 10: [2022-11-26 15:07:56,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 24: [2022-11-26 15:07:56,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 31: [2022-11-26 15:07:56,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:07:56,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 15:07:56,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 27: [2022-11-26 15:07:56,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:07:56,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 2: [2022-11-26 15:07:56,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:07:56,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 2: [2022-11-26 15:07:56,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:07:56,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 5: [2022-11-26 15:07:56,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:07:56,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 15:07:56,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 5: [2022-11-26 15:07:56,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 2: [2022-11-26 15:07:56,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 5: [2022-11-26 15:07:56,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 1: [2022-11-26 15:07:56,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:07:56,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 15:07:56,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 21: [2022-11-26 15:07:56,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:07:56,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:07:56,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:07:56,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 6: [2022-11-26 15:07:56,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 21: [2022-11-26 15:07:56,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 6: [2022-11-26 15:07:56,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 23: [2022-11-26 15:07:56,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 26: [2022-11-26 15:07:56,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:07:56,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 26: [2022-11-26 15:07:56,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 15:07:56,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 3: [2022-11-26 15:07:56,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:07:56,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 15:07:56,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 5: [2022-11-26 15:07:56,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:07:56,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 15:07:56,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 20: [2022-11-26 15:07:56,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:07:56,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:07:56,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 4: [2022-11-26 15:07:56,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 20: [2022-11-26 15:07:56,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 4: [2022-11-26 15:07:56,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 9: [2022-11-26 15:07:56,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:07:56,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:07:56,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:07:56,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 15:07:56,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 0: [2022-11-26 15:07:56,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 15:07:56,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 15:07:56,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 0: [2022-11-26 15:07:56,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 18: [2022-11-26 15:07:56,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:07:56,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:07:56,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 30: [2022-11-26 15:07:56,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 18: [2022-11-26 15:07:56,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 30: [2022-11-26 15:07:56,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 17: [2022-11-26 15:07:56,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:07:56,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 15:07:56,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 18: [2022-11-26 15:07:56,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:07:56,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 25: [2022-11-26 15:07:56,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:07:56,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 25: [2022-11-26 15:07:56,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 15:07:56,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 15: [2022-11-26 15:07:56,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:07:56,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 15:07:56,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:07:56,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 15: [2022-11-26 15:07:56,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 15:07:56,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 8: [2022-11-26 15:07:56,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:07:56,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 15:07:56,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 11: [2022-11-26 15:07:56,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:07:56,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 15:07:56,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 7: [2022-11-26 15:07:56,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:07:56,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 15:07:56,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 8: [2022-11-26 15:07:56,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:07:56,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 15:07:56,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 24: [2022-11-26 15:07:56,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:07:56,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 30: [2022-11-26 15:07:56,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:07:56,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 30: [2022-11-26 15:07:56,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 15:07:56,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 19: [2022-11-26 15:07:56,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:07:56,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 15:07:56,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 14: [2022-11-26 15:07:56,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:07:56,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 15:07:56,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 31: [2022-11-26 15:07:56,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:07:56,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 15:07:56,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 22: [2022-11-26 15:07:56,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:07:56,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:07:56,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:07:56,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 15:07:56,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 15:07:56,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 22: [2022-11-26 15:07:56,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 15:07:56,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 22: [2022-11-26 15:07:56,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 0: [2022-11-26 15:07:56,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:07:56,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:07:56,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 0: [2022-11-26 15:07:56,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 8: [2022-11-26 15:07:56,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 0: [2022-11-26 15:07:56,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 23: [2022-11-26 15:07:56,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:07:56,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 15:07:56,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:07:56,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 23: [2022-11-26 15:07:56,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 15:07:56,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 12: [2022-11-26 15:07:56,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:07:56,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 15:07:56,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 21: [2022-11-26 15:07:56,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:07:56,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 15:07:56,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 13: [2022-11-26 15:07:56,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:07:56,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 15:07:56,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 12: [2022-11-26 15:07:56,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:07:56,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 15:07:56,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 18: [2022-11-26 15:07:56,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:07:56,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 15:07:56,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 27: [2022-11-26 15:07:56,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:07:56,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 15:07:56,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 3: [2022-11-26 15:07:56,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:07:56,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 15:07:56,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 20: [2022-11-26 15:07:56,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:07:56,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 15:07:56,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 19: [2022-11-26 15:07:56,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:07:56,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 15:07:56,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 19: [2022-11-26 15:07:56,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:07:56,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 15:07:56,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 27: [2022-11-26 15:07:56,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:07:56,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 15:07:56,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 9: [2022-11-26 15:07:56,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:07:56,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 15:07:56,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 1: [2022-11-26 15:07:56,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:07:56,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 15:07:56,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 4: [2022-11-26 15:07:56,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:07:56,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 15:07:56,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 26: [2022-11-26 15:07:56,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:07:56,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 15:07:56,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 5: [2022-11-26 15:07:56,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:07:56,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 15:07:56,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 11: [2022-11-26 15:07:56,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:07:56,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 15:07:56,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 16: [2022-11-26 15:07:56,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:07:56,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 15:07:56,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 29: [2022-11-26 15:07:56,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:07:56,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:07:56,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:07:56,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 28: [2022-11-26 15:07:55,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 29: [2022-11-26 15:07:56,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 15:07:56,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 15:07:56,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 15:07:56,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 28: [2022-11-26 15:07:55,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 29: [2022-11-26 15:07:56,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 29: [2022-11-26 15:07:56,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 29: [2022-11-26 15:07:56,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 28: [2022-11-26 15:07:55,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:07:56,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 6: [2022-11-26 15:07:56,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 28: [2022-11-26 15:07:55,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 6: [2022-11-26 15:07:56,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 28: [2022-11-26 15:07:55,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 6: [2022-11-26 15:07:56,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 28: [2022-11-26 15:07:56,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 15:07:56,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 15:07:56,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 28: [2022-11-26 15:07:56,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 15:07:56,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 15:07:56,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 28: [2022-11-26 15:07:56,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 15:07:56,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 15:07:56,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 30: [2022-11-26 15:07:56,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:07:56,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 15:07:56,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 23: [2022-11-26 15:07:56,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:07:56,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 15:07:56,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 17: [2022-11-26 15:07:56,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:07:56,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:07:56,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 15:07:56,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 17: [2022-11-26 15:07:56,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 15:07:56,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 2: [2022-11-26 15:07:56,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:07:56,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 15:07:56,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 21: [2022-11-26 15:07:56,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:07:56,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 15:07:56,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 6: [2022-11-26 15:07:56,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:07:56,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 15:07:56,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 25: [2022-11-26 15:07:56,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:07:56,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 15:07:56,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 29: [2022-11-26 15:07:56,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:07:56,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 15:07:56,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 0: [2022-11-26 15:07:56,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:07:56,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 15:07:56,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 24: [2022-11-26 15:07:56,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:07:56,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 15:07:56,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 7: [2022-11-26 15:07:56,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:07:56,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 15:07:56,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 13: [2022-11-26 15:07:56,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:07:56,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 15:07:56,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 12: [2022-11-26 15:07:56,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:07:56,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 15:07:56,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 10: [2022-11-26 15:07:56,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:07:56,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:07:56,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 15:07:56,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 31: [2022-11-26 15:07:56,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 15:07:56,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 8: [2022-11-26 15:07:56,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:07:56,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 15:07:56,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 15: [2022-11-26 15:07:56,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:07:56,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 15:07:56,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 18: [2022-11-26 15:07:56,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:07:56,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 15:07:56,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 27: [2022-11-26 15:07:56,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:07:56,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 15:07:56,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 22: [2022-11-26 15:07:56,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:07:56,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 15:07:56,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 3: [2022-11-26 15:07:56,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:07:56,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 15:07:56,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 5: [2022-11-26 15:07:56,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:07:56,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 26: [2022-11-26 15:07:56,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:07:56,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 26: [2022-11-26 15:07:56,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 15:07:56,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 20: [2022-11-26 15:07:56,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:07:56,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 15:07:56,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 4: [2022-11-26 15:07:56,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:07:56,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 28: [2022-11-26 15:07:56,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:07:56,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 28: [2022-11-26 15:07:56,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 15:07:56,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 1: [2022-11-26 15:07:56,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:07:56,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:07:56,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 15:07:56,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 9: [2022-11-26 15:07:56,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 15:07:56,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 16: [2022-11-26 15:07:56,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:07:56,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 15:07:56,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 23: [2022-11-26 15:07:56,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:07:56,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 15:07:56,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 30: [2022-11-26 15:07:56,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:07:56,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 15:07:56,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 11: [2022-11-26 15:07:56,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:07:56,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 14: [2022-11-26 15:07:56,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:07:56,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 14: [2022-11-26 15:07:56,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 15:07:56,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 19: [2022-11-26 15:07:56,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:07:56,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 15:07:56,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 17: [2022-11-26 15:07:56,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:07:56,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 15:07:56,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 29: [2022-11-26 15:07:56,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:07:56,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 15:07:56,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 6: [2022-11-26 15:07:56,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:07:56,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 15:07:56,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 0: [2022-11-26 15:07:56,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:07:56,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 7: [2022-11-26 15:07:56,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:07:56,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 7: [2022-11-26 15:07:56,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 15:07:56,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 25: [2022-11-26 15:07:56,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:07:56,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:07:56,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 25: [2022-11-26 15:07:56,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 21: [2022-11-26 15:07:56,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 31: [2022-11-26 15:07:56,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:07:56,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 13: [2022-11-26 15:07:56,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:07:56,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 15:07:56,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 13: [2022-11-26 15:07:56,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 15:07:56,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 12: [2022-11-26 15:07:56,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:07:56,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 15: [2022-11-26 15:07:56,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:07:56,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 15: [2022-11-26 15:07:56,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 15:07:56,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 2: [2022-11-26 15:07:56,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:07:56,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 15:07:56,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 18: [2022-11-26 15:07:56,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:07:56,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 15:07:56,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 27: [2022-11-26 15:07:56,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:07:56,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 5: [2022-11-26 15:07:56,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:07:56,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 5: [2022-11-26 15:07:56,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 15:07:56,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 20: [2022-11-26 15:07:56,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:07:56,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 15:07:56,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 3: [2022-11-26 15:07:56,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:07:56,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 15:07:56,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 28: [2022-11-26 15:07:56,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 15:07:56,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 15:07:56,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 22: [2022-11-26 15:07:56,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:07:56,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 15:07:56,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 16: [2022-11-26 15:07:56,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:07:56,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 15:07:56,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 9: [2022-11-26 15:07:56,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:07:56,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:07:56,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 9: [2022-11-26 15:07:56,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 10: [2022-11-26 15:07:56,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 9: [2022-11-26 15:07:56,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 1: [2022-11-26 15:07:56,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:07:56,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 15:07:56,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 26: [2022-11-26 15:07:56,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:07:56,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 15:07:56,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 24: [2022-11-26 15:07:56,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:07:56,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 15:07:56,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 23: [2022-11-26 15:07:56,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:07:56,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 15:07:56,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 11: [2022-11-26 15:07:56,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:07:56,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 15:07:56,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 19: [2022-11-26 15:07:56,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:07:56,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 15:07:56,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 0: [2022-11-26 15:07:56,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:07:56,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 4: [2022-11-26 15:07:56,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:07:56,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 4: [2022-11-26 15:07:56,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 15:07:56,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 8: [2022-11-26 15:07:56,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:07:56,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 15:07:56,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 17: [2022-11-26 15:07:56,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:07:56,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 15:07:56,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 30: [2022-11-26 15:07:56,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:07:56,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 15:07:56,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 24: [2022-11-26 15:07:56,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:07:56,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 15:07:56,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 25: [2022-11-26 15:07:56,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:07:56,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 15:07:56,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 14: [2022-11-26 15:07:56,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:07:56,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 15:07:56,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 2: [2022-11-26 15:07:56,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:07:56,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 15:07:56,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 6: [2022-11-26 15:07:56,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:07:56,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 15:07:56,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 13: [2022-11-26 15:07:56,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:07:56,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 15:07:56,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 7: [2022-11-26 15:07:56,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:07:56,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 15:07:56,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 29: [2022-11-26 15:07:56,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:07:56,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 15:07:56,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 15: [2022-11-26 15:07:56,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:07:56,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 15:07:56,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 12: [2022-11-26 15:07:56,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:07:56,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 15:07:56,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 31: [2022-11-26 15:07:56,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:07:56,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 15:07:56,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 18: [2022-11-26 15:07:56,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:07:56,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 15:07:56,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 20: [2022-11-26 15:07:56,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:07:56,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:07:56,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 15:07:56,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 21: [2022-11-26 15:07:56,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 15:07:56,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 5: [2022-11-26 15:07:56,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:07:56,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 15:07:56,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 10: [2022-11-26 15:07:56,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:07:56,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 15:07:56,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 27: [2022-11-26 15:07:56,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:07:56,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:07:56,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 15:07:56,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 17: [2022-11-26 15:07:56,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 15:07:56,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 0: [2022-11-26 15:07:56,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:07:56,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:07:56,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 15:07:56,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 19: [2022-11-26 15:07:56,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:07:56,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 15:07:56,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 9: [2022-11-26 15:07:56,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:07:56,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 15:07:56,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 16: [2022-11-26 15:07:56,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:07:56,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:07:56,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 4: [2022-11-26 15:07:56,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 7: [2022-11-26 15:07:56,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:07:56,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 4: [2022-11-26 15:07:56,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 7: [2022-11-26 15:07:56,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 15:07:56,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 22: [2022-11-26 15:07:56,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:07:56,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:07:56,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 15:07:56,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 15:07:56,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 22: [2022-11-26 15:07:56,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 25: [2022-11-26 15:07:56,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:07:56,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 28: [2022-11-26 15:07:56,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:07:56,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 28: [2022-11-26 15:07:56,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 15:07:56,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 24: [2022-11-26 15:07:56,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:07:56,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 15:07:56,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 15: [2022-11-26 15:07:56,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:07:56,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 15:07:56,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 18: [2022-11-26 15:07:56,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:07:56,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 15:07:56,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 8: [2022-11-26 15:07:56,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:07:56,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 15:07:56,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 6: [2022-11-26 15:07:56,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:07:56,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 0: [2022-11-26 15:07:56,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 6: [2022-11-26 15:07:56,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 0: [2022-11-26 15:07:56,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 14: [2022-11-26 15:07:56,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:07:56,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 15:07:56,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 26: [2022-11-26 15:07:56,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:07:56,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 15:07:56,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 11: [2022-11-26 15:07:56,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:07:56,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 15:07:56,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 31: [2022-11-26 15:07:56,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:07:56,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 15:07:56,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 2: [2022-11-26 15:07:56,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:07:56,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 15:07:56,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 1: [2022-11-26 15:07:56,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:07:56,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 12: [2022-11-26 15:07:56,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:07:56,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 12: [2022-11-26 15:07:56,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 3: [2022-11-26 15:07:56,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:07:56,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 23: [2022-11-26 15:07:56,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:07:56,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 15:07:56,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 23: [2022-11-26 15:07:56,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 15:07:56,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 29: [2022-11-26 15:07:56,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:07:56,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 15:07:56,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 8: [2022-11-26 15:07:56,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:07:56,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 15:07:56,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 21: [2022-11-26 15:07:56,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:07:56,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 15:07:56,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 27: [2022-11-26 15:07:56,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:07:56,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 15:07:56,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 3: [2022-11-26 15:07:56,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:07:56,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 15:07:56,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 10: [2022-11-26 15:07:56,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:07:56,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:07:56,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 3: [2022-11-26 15:07:56,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 10: [2022-11-26 15:07:56,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 3: [2022-11-26 15:07:56,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 22: [2022-11-26 15:07:56,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:07:56,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 15:07:56,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 8: [2022-11-26 15:07:56,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:07:56,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 15:07:56,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 13: [2022-11-26 15:07:56,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:07:56,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step93000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 15:07:56,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 0: successfully saved checkpoint at iteration 93000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2607.23 31: iteration 93010/ 173500 | consumed samples: 23810560 | consumed tokens: 48764026880 | elapsed time per iteration (s): 1.04 | learning rate: 1.011E-04 | global batch size: 256 | lm loss: 1.978442E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.034 | TFLOPs: 14.88 | 31: iteration 93020/ 173500 | consumed samples: 23813120 | consumed tokens: 48769269760 | elapsed time per iteration (s): 0.73 | learning rate: 1.011E-04 | global batch size: 256 | lm loss: 1.991874E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.694 | TFLOPs: 21.10 | 31: iteration 93030/ 173500 | consumed samples: 23815680 | consumed tokens: 48774512640 | elapsed time per iteration (s): 0.82 | learning rate: 1.011E-04 | global batch size: 256 | lm loss: 2.015205E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.355 | TFLOPs: 18.96 | 31: iteration 93040/ 173500 | consumed samples: 23818240 | consumed tokens: 48779755520 | elapsed time per iteration (s): 0.73 | learning rate: 1.011E-04 | global batch size: 256 | lm loss: 2.000223E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.376 | TFLOPs: 21.14 | 31: iteration 93050/ 173500 | consumed samples: 23820800 | consumed tokens: 48784998400 | elapsed time per iteration (s): 0.73 | learning rate: 1.011E-04 | global batch size: 256 | lm loss: 1.957441E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.906 | TFLOPs: 21.29 | 31: iteration 93060/ 173500 | consumed samples: 23823360 | consumed tokens: 48790241280 | elapsed time per iteration (s): 0.78 | learning rate: 1.011E-04 | global batch size: 256 | lm loss: 1.980324E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.663 | TFLOPs: 19.82 | 31: iteration 93070/ 173500 | consumed samples: 23825920 | consumed tokens: 48795484160 | elapsed time per iteration (s): 0.75 | learning rate: 1.010E-04 | global batch size: 256 | lm loss: 2.014191E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.602 | TFLOPs: 20.55 | 31: iteration 93080/ 173500 | consumed samples: 23828480 | consumed tokens: 48800727040 | elapsed time per iteration (s): 0.78 | learning rate: 1.010E-04 | global batch size: 256 | lm loss: 2.026694E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.382 | TFLOPs: 19.87 | 31: iteration 93090/ 173500 | consumed samples: 23831040 | consumed tokens: 48805969920 | elapsed time per iteration (s): 0.76 | learning rate: 1.010E-04 | global batch size: 256 | lm loss: 2.007983E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.004 | TFLOPs: 20.45 | 31: iteration 93100/ 173500 | consumed samples: 23833600 | consumed tokens: 48811212800 | elapsed time per iteration (s): 0.84 | learning rate: 1.010E-04 | global batch size: 256 | lm loss: 1.964609E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.096 | TFLOPs: 18.46 | 31: iteration 93110/ 173500 | consumed samples: 23836160 | consumed tokens: 48816455680 | elapsed time per iteration (s): 0.78 | learning rate: 1.010E-04 | global batch size: 256 | lm loss: 1.993895E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.917 | TFLOPs: 19.84 | 31: iteration 93120/ 173500 | consumed samples: 23838720 | consumed tokens: 48821698560 | elapsed time per iteration (s): 0.75 | learning rate: 1.010E-04 | global batch size: 256 | lm loss: 1.974056E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.739 | TFLOPs: 20.73 | 31: iteration 93130/ 173500 | consumed samples: 23841280 | consumed tokens: 48826941440 | elapsed time per iteration (s): 0.83 | learning rate: 1.009E-04 | global batch size: 256 | lm loss: 1.962502E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.081 | TFLOPs: 18.76 | 31: iteration 93140/ 173500 | consumed samples: 23843840 | consumed tokens: 48832184320 | elapsed time per iteration (s): 0.80 | learning rate: 1.009E-04 | global batch size: 256 | lm loss: 1.974487E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.179 | TFLOPs: 19.43 | 31: iteration 93150/ 173500 | consumed samples: 23846400 | consumed tokens: 48837427200 | elapsed time per iteration (s): 0.80 | learning rate: 1.009E-04 | global batch size: 256 | lm loss: 1.988580E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.998 | TFLOPs: 19.48 | 31: iteration 93160/ 173500 | consumed samples: 23848960 | consumed tokens: 48842670080 | elapsed time per iteration (s): 0.76 | learning rate: 1.009E-04 | global batch size: 256 | lm loss: 1.951030E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.389 | TFLOPs: 20.35 | 31: iteration 93170/ 173500 | consumed samples: 23851520 | consumed tokens: 48847912960 | elapsed time per iteration (s): 0.75 | learning rate: 1.009E-04 | global batch size: 256 | lm loss: 1.956107E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.019 | TFLOPs: 20.57 | 31: iteration 93180/ 173500 | consumed samples: 23854080 | consumed tokens: 48853155840 | elapsed time per iteration (s): 0.76 | learning rate: 1.009E-04 | global batch size: 256 | lm loss: 1.995885E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.797 | TFLOPs: 20.25 | 31: iteration 93190/ 173500 | consumed samples: 23856640 | consumed tokens: 48858398720 | elapsed time per iteration (s): 0.77 | learning rate: 1.008E-04 | global batch size: 256 | lm loss: 1.993526E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.279 | TFLOPs: 20.10 | 31: iteration 93200/ 173500 | consumed samples: 23859200 | consumed tokens: 48863641600 | elapsed time per iteration (s): 0.72 | learning rate: 1.008E-04 | global batch size: 256 | lm loss: 1.987599E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.910 | TFLOPs: 21.59 | 31: iteration 93210/ 173500 | consumed samples: 23861760 | consumed tokens: 48868884480 | elapsed time per iteration (s): 0.77 | learning rate: 1.008E-04 | global batch size: 256 | lm loss: 2.005419E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.775 | TFLOPs: 20.07 | 31: iteration 93220/ 173500 | consumed samples: 23864320 | consumed tokens: 48874127360 | elapsed time per iteration (s): 0.78 | learning rate: 1.008E-04 | global batch size: 256 | lm loss: 1.999296E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.372 | TFLOPs: 19.74 | 31: iteration 93230/ 173500 | consumed samples: 23866880 | consumed tokens: 48879370240 | elapsed time per iteration (s): 0.74 | learning rate: 1.008E-04 | global batch size: 256 | lm loss: 1.995998E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.678 | TFLOPs: 20.91 | 31: iteration 93240/ 173500 | consumed samples: 23869440 | consumed tokens: 48884613120 | elapsed time per iteration (s): 0.72 | learning rate: 1.008E-04 | global batch size: 256 | lm loss: 1.986222E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.322 | TFLOPs: 21.38 | 31: iteration 93250/ 173500 | consumed samples: 23872000 | consumed tokens: 48889856000 | elapsed time per iteration (s): 0.77 | learning rate: 1.007E-04 | global batch size: 256 | lm loss: 1.976584E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.553 | TFLOPs: 20.18 | 31: iteration 93260/ 173500 | consumed samples: 23874560 | consumed tokens: 48895098880 | elapsed time per iteration (s): 0.75 | learning rate: 1.007E-04 | global batch size: 256 | lm loss: 1.988706E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.269 | TFLOPs: 20.65 | 31: iteration 93270/ 173500 | consumed samples: 23877120 | consumed tokens: 48900341760 | elapsed time per iteration (s): 0.77 | learning rate: 1.007E-04 | global batch size: 256 | lm loss: 2.010353E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.528 | TFLOPs: 20.24 | 31: iteration 93280/ 173500 | consumed samples: 23879680 | consumed tokens: 48905584640 | elapsed time per iteration (s): 0.79 | learning rate: 1.007E-04 | global batch size: 256 | lm loss: 1.968200E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.486 | TFLOPs: 19.57 | 31: iteration 93290/ 173500 | consumed samples: 23882240 | consumed tokens: 48910827520 | elapsed time per iteration (s): 0.76 | learning rate: 1.007E-04 | global batch size: 256 | lm loss: 1.987472E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.920 | TFLOPs: 20.38 | 31: iteration 93300/ 173500 | consumed samples: 23884800 | consumed tokens: 48916070400 | elapsed time per iteration (s): 0.77 | learning rate: 1.007E-04 | global batch size: 256 | lm loss: 1.965266E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.239 | TFLOPs: 20.16 | 31: iteration 93310/ 173500 | consumed samples: 23887360 | consumed tokens: 48921313280 | elapsed time per iteration (s): 0.76 | learning rate: 1.006E-04 | global batch size: 256 | lm loss: 2.006260E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.039 | TFLOPs: 20.39 | 31: iteration 93320/ 173500 | consumed samples: 23889920 | consumed tokens: 48926556160 | elapsed time per iteration (s): 0.82 | learning rate: 1.006E-04 | global batch size: 256 | lm loss: 1.968012E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.167 | TFLOPs: 18.89 | 31: iteration 93330/ 173500 | consumed samples: 23892480 | consumed tokens: 48931799040 | elapsed time per iteration (s): 0.75 | learning rate: 1.006E-04 | global batch size: 256 | lm loss: 1.990743E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.764 | TFLOPs: 20.55 | 31: iteration 93340/ 173500 | consumed samples: 23895040 | consumed tokens: 48937041920 | elapsed time per iteration (s): 0.80 | learning rate: 1.006E-04 | global batch size: 256 | lm loss: 1.965198E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.919 | TFLOPs: 19.48 | 31: iteration 93350/ 173500 | consumed samples: 23897600 | consumed tokens: 48942284800 | elapsed time per iteration (s): 0.76 | learning rate: 1.006E-04 | global batch size: 256 | lm loss: 1.985733E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.023 | TFLOPs: 20.27 | 31: iteration 93360/ 173500 | consumed samples: 23900160 | consumed tokens: 48947527680 | elapsed time per iteration (s): 0.77 | learning rate: 1.006E-04 | global batch size: 256 | lm loss: 1.996344E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.973 | TFLOPs: 20.08 | 31: iteration 93370/ 173500 | consumed samples: 23902720 | consumed tokens: 48952770560 | elapsed time per iteration (s): 0.76 | learning rate: 1.005E-04 | global batch size: 256 | lm loss: 1.992286E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.768 | TFLOPs: 20.37 | 31: iteration 93380/ 173500 | consumed samples: 23905280 | consumed tokens: 48958013440 | elapsed time per iteration (s): 0.76 | learning rate: 1.005E-04 | global batch size: 256 | lm loss: 1.993771E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.072 | TFLOPs: 20.33 | 31: iteration 93390/ 173500 | consumed samples: 23907840 | consumed tokens: 48963256320 | elapsed time per iteration (s): 0.72 | learning rate: 1.005E-04 | global batch size: 256 | lm loss: 1.977023E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.022 | TFLOPs: 21.42 | 31: iteration 93400/ 173500 | consumed samples: 23910400 | consumed tokens: 48968499200 | elapsed time per iteration (s): 0.75 | learning rate: 1.005E-04 | global batch size: 256 | lm loss: 1.976262E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.267 | TFLOPs: 20.59 | 31: iteration 93410/ 173500 | consumed samples: 23912960 | consumed tokens: 48973742080 | elapsed time per iteration (s): 0.79 | learning rate: 1.005E-04 | global batch size: 256 | lm loss: 1.996986E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.636 | TFLOPs: 19.58 | 31: iteration 93420/ 173500 | consumed samples: 23915520 | consumed tokens: 48978984960 | elapsed time per iteration (s): 0.75 | learning rate: 1.005E-04 | global batch size: 256 | lm loss: 1.976885E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.621 | TFLOPs: 20.61 | 31: iteration 93430/ 173500 | consumed samples: 23918080 | consumed tokens: 48984227840 | elapsed time per iteration (s): 0.77 | learning rate: 1.005E-04 | global batch size: 256 | lm loss: 1.985880E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.028 | TFLOPs: 20.15 | 31: iteration 93440/ 173500 | consumed samples: 23920640 | consumed tokens: 48989470720 | elapsed time per iteration (s): 0.76 | learning rate: 1.004E-04 | global batch size: 256 | lm loss: 1.994293E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.880 | TFLOPs: 20.44 | 31: iteration 93450/ 173500 | consumed samples: 23923200 | consumed tokens: 48994713600 | elapsed time per iteration (s): 0.81 | learning rate: 1.004E-04 | global batch size: 256 | lm loss: 1.973106E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.750 | TFLOPs: 19.10 | 31: iteration 93460/ 173500 | consumed samples: 23925760 | consumed tokens: 48999956480 | elapsed time per iteration (s): 0.75 | learning rate: 1.004E-04 | global batch size: 256 | lm loss: 1.982057E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.533 | TFLOPs: 20.78 | 31: iteration 93470/ 173500 | consumed samples: 23928320 | consumed tokens: 49005199360 | elapsed time per iteration (s): 0.78 | learning rate: 1.004E-04 | global batch size: 256 | lm loss: 1.958220E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.617 | TFLOPs: 19.88 | 31: iteration 93480/ 173500 | consumed samples: 23930880 | consumed tokens: 49010442240 | elapsed time per iteration (s): 0.71 | learning rate: 1.004E-04 | global batch size: 256 | lm loss: 1.993149E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 359.297 | TFLOPs: 21.74 | 31: iteration 93490/ 173500 | consumed samples: 23933440 | consumed tokens: 49015685120 | elapsed time per iteration (s): 0.77 | learning rate: 1.004E-04 | global batch size: 256 | lm loss: 1.993173E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.851 | TFLOPs: 20.08 | 31: iteration 93500/ 173500 | consumed samples: 23936000 | consumed tokens: 49020928000 | elapsed time per iteration (s): 0.77 | learning rate: 1.003E-04 | global batch size: 256 | lm loss: 2.011467E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.155 | TFLOPs: 20.09 | 31: iteration 93510/ 173500 | consumed samples: 23938560 | consumed tokens: 49026170880 | elapsed time per iteration (s): 0.77 | learning rate: 1.003E-04 | global batch size: 256 | lm loss: 1.956113E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.269 | TFLOPs: 20.22 | 31: iteration 93520/ 173500 | consumed samples: 23941120 | consumed tokens: 49031413760 | elapsed time per iteration (s): 0.82 | learning rate: 1.003E-04 | global batch size: 256 | lm loss: 1.970439E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.045 | TFLOPs: 18.82 | 31: iteration 93530/ 173500 | consumed samples: 23943680 | consumed tokens: 49036656640 | elapsed time per iteration (s): 0.82 | learning rate: 1.003E-04 | global batch size: 256 | lm loss: 1.987312E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.307 | TFLOPs: 18.83 | 31: iteration 93540/ 173500 | consumed samples: 23946240 | consumed tokens: 49041899520 | elapsed time per iteration (s): 0.85 | learning rate: 1.003E-04 | global batch size: 256 | lm loss: 1.979047E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.156 | TFLOPs: 18.28 | 31: iteration 93550/ 173500 | consumed samples: 23948800 | consumed tokens: 49047142400 | elapsed time per iteration (s): 0.82 | learning rate: 1.003E-04 | global batch size: 256 | lm loss: 2.016941E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.173 | TFLOPs: 18.89 | 31: iteration 93560/ 173500 | consumed samples: 23951360 | consumed tokens: 49052385280 | elapsed time per iteration (s): 0.76 | learning rate: 1.002E-04 | global batch size: 256 | lm loss: 2.026892E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.853 | TFLOPs: 20.38 | 31: iteration 93570/ 173500 | consumed samples: 23953920 | consumed tokens: 49057628160 | elapsed time per iteration (s): 0.84 | learning rate: 1.002E-04 | global batch size: 256 | lm loss: 1.988844E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.768 | TFLOPs: 18.44 | 31: iteration 93580/ 173500 | consumed samples: 23956480 | consumed tokens: 49062871040 | elapsed time per iteration (s): 0.79 | learning rate: 1.002E-04 | global batch size: 256 | lm loss: 1.998293E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.625 | TFLOPs: 19.64 | 31: iteration 93590/ 173500 | consumed samples: 23959040 | consumed tokens: 49068113920 | elapsed time per iteration (s): 0.77 | learning rate: 1.002E-04 | global batch size: 256 | lm loss: 2.018696E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.714 | TFLOPs: 20.07 | 31: iteration 93600/ 173500 | consumed samples: 23961600 | consumed tokens: 49073356800 | elapsed time per iteration (s): 0.73 | learning rate: 1.002E-04 | global batch size: 256 | lm loss: 1.986110E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.587 | TFLOPs: 21.21 | 31: iteration 93610/ 173500 | consumed samples: 23964160 | consumed tokens: 49078599680 | elapsed time per iteration (s): 0.80 | learning rate: 1.002E-04 | global batch size: 256 | lm loss: 1.961277E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.402 | TFLOPs: 19.38 | 31: iteration 93620/ 173500 | consumed samples: 23966720 | consumed tokens: 49083842560 | elapsed time per iteration (s): 0.74 | learning rate: 1.001E-04 | global batch size: 256 | lm loss: 1.964993E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.307 | TFLOPs: 20.89 | 31: iteration 93630/ 173500 | consumed samples: 23969280 | consumed tokens: 49089085440 | elapsed time per iteration (s): 0.86 | learning rate: 1.001E-04 | global batch size: 256 | lm loss: 1.978338E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.023 | TFLOPs: 17.97 | 31: iteration 93640/ 173500 | consumed samples: 23971840 | consumed tokens: 49094328320 | elapsed time per iteration (s): 0.79 | learning rate: 1.001E-04 | global batch size: 256 | lm loss: 1.974492E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.410 | TFLOPs: 19.63 | 31: iteration 93650/ 173500 | consumed samples: 23974400 | consumed tokens: 49099571200 | elapsed time per iteration (s): 0.93 | learning rate: 1.001E-04 | global batch size: 256 | lm loss: 1.976463E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 275.428 | TFLOPs: 16.66 | 31: iteration 93660/ 173500 | consumed samples: 23976960 | consumed tokens: 49104814080 | elapsed time per iteration (s): 0.79 | learning rate: 1.001E-04 | global batch size: 256 | lm loss: 1.957145E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.942 | TFLOPs: 19.66 | 31: iteration 93670/ 173500 | consumed samples: 23979520 | consumed tokens: 49110056960 | elapsed time per iteration (s): 0.81 | learning rate: 1.001E-04 | global batch size: 256 | lm loss: 1.953434E+00 | grad norm: 0.222 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.149 | TFLOPs: 19.19 | 31: iteration 93680/ 173500 | consumed samples: 23982080 | consumed tokens: 49115299840 | elapsed time per iteration (s): 0.82 | learning rate: 1.000E-04 | global batch size: 256 | lm loss: 1.988272E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.415 | TFLOPs: 18.96 | 31: iteration 93690/ 173500 | consumed samples: 23984640 | consumed tokens: 49120542720 | elapsed time per iteration (s): 0.81 | learning rate: 1.000E-04 | global batch size: 256 | lm loss: 1.986719E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.227 | TFLOPs: 19.07 | 31: iteration 93700/ 173500 | consumed samples: 23987200 | consumed tokens: 49125785600 | elapsed time per iteration (s): 0.85 | learning rate: 1.000E-04 | global batch size: 256 | lm loss: 1.978832E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.401 | TFLOPs: 18.29 | 31: iteration 93710/ 173500 | consumed samples: 23989760 | consumed tokens: 49131028480 | elapsed time per iteration (s): 0.76 | learning rate: 9.999E-05 | global batch size: 256 | lm loss: 1.993832E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.703 | TFLOPs: 20.43 | 31: iteration 93720/ 173500 | consumed samples: 23992320 | consumed tokens: 49136271360 | elapsed time per iteration (s): 0.76 | learning rate: 9.998E-05 | global batch size: 256 | lm loss: 1.994066E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.096 | TFLOPs: 20.45 | 31: iteration 93730/ 173500 | consumed samples: 23994880 | consumed tokens: 49141514240 | elapsed time per iteration (s): 0.77 | learning rate: 9.996E-05 | global batch size: 256 | lm loss: 1.988038E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.224 | TFLOPs: 20.16 | 31: iteration 93740/ 173500 | consumed samples: 23997440 | consumed tokens: 49146757120 | elapsed time per iteration (s): 0.75 | learning rate: 9.994E-05 | global batch size: 256 | lm loss: 1.992088E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.954 | TFLOPs: 20.69 | 31: iteration 93750/ 173500 | consumed samples: 24000000 | consumed tokens: 49152000000 | elapsed time per iteration (s): 0.79 | learning rate: 9.993E-05 | global batch size: 256 | lm loss: 1.976579E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.410 | TFLOPs: 19.51 | 31: iteration 93760/ 173500 | consumed samples: 24002560 | consumed tokens: 49157242880 | elapsed time per iteration (s): 0.80 | learning rate: 9.991E-05 | global batch size: 256 | lm loss: 1.976083E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.610 | TFLOPs: 19.34 | 31: iteration 93770/ 173500 | consumed samples: 24005120 | consumed tokens: 49162485760 | elapsed time per iteration (s): 0.74 | learning rate: 9.989E-05 | global batch size: 256 | lm loss: 1.952512E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.302 | TFLOPs: 20.83 | 31: iteration 93780/ 173500 | consumed samples: 24007680 | consumed tokens: 49167728640 | elapsed time per iteration (s): 0.74 | learning rate: 9.988E-05 | global batch size: 256 | lm loss: 2.001097E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.062 | TFLOPs: 20.81 | 31: iteration 93790/ 173500 | consumed samples: 24010240 | consumed tokens: 49172971520 | elapsed time per iteration (s): 0.74 | learning rate: 9.986E-05 | global batch size: 256 | lm loss: 1.988558E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.266 | TFLOPs: 21.07 | 31: iteration 93800/ 173500 | consumed samples: 24012800 | consumed tokens: 49178214400 | elapsed time per iteration (s): 0.75 | learning rate: 9.985E-05 | global batch size: 256 | lm loss: 1.972121E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.903 | TFLOPs: 20.56 | 31: iteration 93810/ 173500 | consumed samples: 24015360 | consumed tokens: 49183457280 | elapsed time per iteration (s): 0.76 | learning rate: 9.983E-05 | global batch size: 256 | lm loss: 1.983042E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.145 | TFLOPs: 20.40 | 31: iteration 93820/ 173500 | consumed samples: 24017920 | consumed tokens: 49188700160 | elapsed time per iteration (s): 0.78 | learning rate: 9.981E-05 | global batch size: 256 | lm loss: 1.996835E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.617 | TFLOPs: 19.94 | 31: iteration 93830/ 173500 | consumed samples: 24020480 | consumed tokens: 49193943040 | elapsed time per iteration (s): 0.79 | learning rate: 9.980E-05 | global batch size: 256 | lm loss: 2.020689E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.045 | TFLOPs: 19.72 | 31: iteration 93840/ 173500 | consumed samples: 24023040 | consumed tokens: 49199185920 | elapsed time per iteration (s): 0.77 | learning rate: 9.978E-05 | global batch size: 256 | lm loss: 1.967258E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.617 | TFLOPs: 20.00 | 31: iteration 93850/ 173500 | consumed samples: 24025600 | consumed tokens: 49204428800 | elapsed time per iteration (s): 0.78 | learning rate: 9.976E-05 | global batch size: 256 | lm loss: 1.974030E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.739 | TFLOPs: 19.95 | 31: iteration 93860/ 173500 | consumed samples: 24028160 | consumed tokens: 49209671680 | elapsed time per iteration (s): 0.78 | learning rate: 9.975E-05 | global batch size: 256 | lm loss: 1.994312E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.764 | TFLOPs: 19.95 | 31: iteration 93870/ 173500 | consumed samples: 24030720 | consumed tokens: 49214914560 | elapsed time per iteration (s): 0.75 | learning rate: 9.973E-05 | global batch size: 256 | lm loss: 1.973396E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.543 | TFLOPs: 20.60 | 31: iteration 93880/ 173500 | consumed samples: 24033280 | consumed tokens: 49220157440 | elapsed time per iteration (s): 0.77 | learning rate: 9.971E-05 | global batch size: 256 | lm loss: 1.992494E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.618 | TFLOPs: 20.18 | 31: iteration 93890/ 173500 | consumed samples: 24035840 | consumed tokens: 49225400320 | elapsed time per iteration (s): 0.76 | learning rate: 9.970E-05 | global batch size: 256 | lm loss: 1.960972E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.934 | TFLOPs: 20.50 | 31: iteration 93900/ 173500 | consumed samples: 24038400 | consumed tokens: 49230643200 | elapsed time per iteration (s): 0.81 | learning rate: 9.968E-05 | global batch size: 256 | lm loss: 1.975603E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.069 | TFLOPs: 19.18 | 31: iteration 93910/ 173500 | consumed samples: 24040960 | consumed tokens: 49235886080 | elapsed time per iteration (s): 0.77 | learning rate: 9.967E-05 | global batch size: 256 | lm loss: 1.983135E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.621 | TFLOPs: 20.06 | 31: iteration 93920/ 173500 | consumed samples: 24043520 | consumed tokens: 49241128960 | elapsed time per iteration (s): 0.87 | learning rate: 9.965E-05 | global batch size: 256 | lm loss: 1.986667E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.583 | TFLOPs: 17.88 | 31: iteration 93930/ 173500 | consumed samples: 24046080 | consumed tokens: 49246371840 | elapsed time per iteration (s): 0.86 | learning rate: 9.963E-05 | global batch size: 256 | lm loss: 1.980772E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.924 | TFLOPs: 17.96 | 31: iteration 93940/ 173500 | consumed samples: 24048640 | consumed tokens: 49251614720 | elapsed time per iteration (s): 0.81 | learning rate: 9.962E-05 | global batch size: 256 | lm loss: 1.979533E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.318 | TFLOPs: 19.08 | 31: iteration 93950/ 173500 | consumed samples: 24051200 | consumed tokens: 49256857600 | elapsed time per iteration (s): 0.80 | learning rate: 9.960E-05 | global batch size: 256 | lm loss: 1.958370E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.837 | TFLOPs: 19.29 | 31: iteration 93960/ 173500 | consumed samples: 24053760 | consumed tokens: 49262100480 | elapsed time per iteration (s): 0.83 | learning rate: 9.958E-05 | global batch size: 256 | lm loss: 1.970521E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.182 | TFLOPs: 18.70 | 31: iteration 93970/ 173500 | consumed samples: 24056320 | consumed tokens: 49267343360 | elapsed time per iteration (s): 0.86 | learning rate: 9.957E-05 | global batch size: 256 | lm loss: 2.005820E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.154 | TFLOPs: 18.10 | 31: iteration 93980/ 173500 | consumed samples: 24058880 | consumed tokens: 49272586240 | elapsed time per iteration (s): 0.87 | learning rate: 9.955E-05 | global batch size: 256 | lm loss: 1.981106E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.482 | TFLOPs: 17.88 | 31: iteration 93990/ 173500 | consumed samples: 24061440 | consumed tokens: 49277829120 | elapsed time per iteration (s): 0.88 | learning rate: 9.953E-05 | global batch size: 256 | lm loss: 1.981149E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.194 | TFLOPs: 17.62 | 0: [2022-11-26 15:20:58,813] [INFO] [logging.py:68:log_dist] [Rank 0] step=94000, skipped=0, lr=[9.951807001525316e-05, 9.951807001525316e-05, 9.951807001525316e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 94000/ 173500 | consumed samples: 24064000 | consumed tokens: 49283072000 | elapsed time per iteration (s): 0.89 | learning rate: 9.952E-05 | global batch size: 256 | lm loss: 1.990535E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.353 | TFLOPs: 17.32 | 0: steps: 94000 loss: 2.0090 iter time (s): 0.790 samples/sec: 323.920 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 94000 | lm loss value: 1.957393E+00 | lm loss PPL: 7.080844E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 94000 to checkpoints_1b1long 0: [2022-11-26 15:20:59,079] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step94000 is begin to save! 0: [2022-11-26 15:20:59,090] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_01-model_00-model_states.pt... 0: [2022-11-26 15:20:59,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_01-model_00-model_states.pt. 0: [2022-11-26 15:20:59,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_03-model_00-model_states.pt... 0: [2022-11-26 15:20:59,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_03-model_00-model_states.pt. 0: [2022-11-26 15:20:59,405] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_04-model_00-model_states.pt... 0: [2022-11-26 15:20:59,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_04-model_00-model_states.pt. 0: [2022-11-26 15:20:59,481] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_05-model_00-model_states.pt... 0: [2022-11-26 15:20:59,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_05-model_00-model_states.pt. 0: [2022-11-26 15:20:59,557] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_06-model_00-model_states.pt... 0: [2022-11-26 15:20:59,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_06-model_00-model_states.pt. 0: [2022-11-26 15:20:59,633] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_07-model_00-model_states.pt... 0: [2022-11-26 15:20:59,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_07-model_00-model_states.pt. 0: [2022-11-26 15:20:59,711] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_08-model_00-model_states.pt... 0: [2022-11-26 15:20:59,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_08-model_00-model_states.pt. 0: [2022-11-26 15:20:59,787] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_09-model_00-model_states.pt... 0: [2022-11-26 15:20:59,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_09-model_00-model_states.pt. 0: [2022-11-26 15:20:59,867] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_10-model_00-model_states.pt... 0: [2022-11-26 15:20:59,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_10-model_00-model_states.pt. 0: [2022-11-26 15:20:59,942] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_11-model_00-model_states.pt... 0: [2022-11-26 15:21:00,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_11-model_00-model_states.pt. 0: [2022-11-26 15:21:00,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_12-model_00-model_states.pt... 0: [2022-11-26 15:21:00,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_12-model_00-model_states.pt. 0: [2022-11-26 15:21:00,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_13-model_00-model_states.pt... 0: [2022-11-26 15:21:00,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_13-model_00-model_states.pt. 0: [2022-11-26 15:21:00,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_14-model_00-model_states.pt... 0: [2022-11-26 15:21:00,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_14-model_00-model_states.pt. 0: [2022-11-26 15:21:00,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_15-model_00-model_states.pt... 0: [2022-11-26 15:21:00,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_15-model_00-model_states.pt. 0: [2022-11-26 15:21:00,340] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_16-model_00-model_states.pt... 0: [2022-11-26 15:21:00,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_16-model_00-model_states.pt. 0: [2022-11-26 15:21:00,419] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_17-model_00-model_states.pt... 0: [2022-11-26 15:21:00,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_17-model_00-model_states.pt. 0: [2022-11-26 15:21:00,495] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_18-model_00-model_states.pt... 0: [2022-11-26 15:21:00,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_18-model_00-model_states.pt. 0: [2022-11-26 15:21:00,571] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_19-model_00-model_states.pt... 0: [2022-11-26 15:21:00,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_19-model_00-model_states.pt. 0: [2022-11-26 15:21:00,651] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_20-model_00-model_states.pt... 0: [2022-11-26 15:21:00,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_20-model_00-model_states.pt. 0: [2022-11-26 15:21:00,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_21-model_00-model_states.pt... 0: [2022-11-26 15:21:00,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_21-model_00-model_states.pt. 0: [2022-11-26 15:21:00,806] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_22-model_00-model_states.pt... 0: [2022-11-26 15:21:00,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_22-model_00-model_states.pt. 0: [2022-11-26 15:21:00,883] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_23-model_00-model_states.pt... 0: [2022-11-26 15:21:00,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_23-model_00-model_states.pt. 0: [2022-11-26 15:21:00,957] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_24-model_00-model_states.pt... 0: [2022-11-26 15:21:01,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_24-model_00-model_states.pt. 0: [2022-11-26 15:21:01,037] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_25-model_00-model_states.pt... 0: [2022-11-26 15:21:01,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_25-model_00-model_states.pt. 0: [2022-11-26 15:21:01,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_26-model_00-model_states.pt... 0: [2022-11-26 15:21:01,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_26-model_00-model_states.pt. 0: [2022-11-26 15:21:01,189] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_27-model_00-model_states.pt... 0: [2022-11-26 15:21:01,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_27-model_00-model_states.pt. 0: [2022-11-26 15:21:01,268] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_28-model_00-model_states.pt... 0: [2022-11-26 15:21:01,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_28-model_00-model_states.pt. 0: [2022-11-26 15:21:01,343] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/layer_30-model_00-model_states.pt... 0: [2022-11-26 15:21:01,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/layer_30-model_00-model_states.pt. 0: [2022-11-26 15:21:01,347] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step94000/mp_rank_00_model_states.pt 0: [2022-11-26 15:21:01,347] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/mp_rank_00_model_states.pt... 0: [2022-11-26 15:21:01,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/mp_rank_00_model_states.pt. 0: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:21:01,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:21:01,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:21:01,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 15:21:01,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 11: [2022-11-26 15:21:01,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:21:01,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:21:01,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 15:21:01,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 24: [2022-11-26 15:21:01,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 15:21:01,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 19: [2022-11-26 15:21:01,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:21:01,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 15:21:01,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 15: [2022-11-26 15:21:01,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:21:01,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:21:01,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 11: [2022-11-26 15:21:01,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 15: [2022-11-26 15:21:01,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 11: [2022-11-26 15:21:01,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 9: [2022-11-26 15:21:01,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:21:01,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 15:21:01,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 6: [2022-11-26 15:21:01,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:21:01,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 15:21:01,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 30: [2022-11-26 15:21:01,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:21:01,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 15:21:01,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 5: [2022-11-26 15:21:01,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:21:01,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 15:21:01,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 19: [2022-11-26 15:21:01,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:21:01,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 15:21:01,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 0: [2022-11-26 15:21:01,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:21:01,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 1: [2022-11-26 15:21:01,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:21:01,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 1: [2022-11-26 15:21:01,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 17: [2022-11-26 15:21:01,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:21:01,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 17: [2022-11-26 15:21:01,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 15:21:01,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 0: [2022-11-26 15:21:01,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:21:01,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:21:01,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:21:01,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 30: [2022-11-26 15:21:01,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 15:21:01,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 17: [2022-11-26 15:21:01,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 0: [2022-11-26 15:21:01,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 17: [2022-11-26 15:21:01,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 1: [2022-11-26 15:21:01,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:21:01,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 15:21:01,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 26: [2022-11-26 15:21:01,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:21:01,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 15:21:01,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 5: [2022-11-26 15:21:01,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:21:01,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:21:01,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 30: [2022-11-26 15:21:01,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 5: [2022-11-26 15:21:01,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 30: [2022-11-26 15:21:01,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 4: [2022-11-26 15:21:01,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:21:01,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 15:21:01,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 20: [2022-11-26 15:21:01,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:21:01,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 15:21:01,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 12: [2022-11-26 15:21:01,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:21:01,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 15:21:01,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 12: [2022-11-26 15:21:01,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:21:01,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 15:21:01,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 20: [2022-11-26 15:21:01,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:21:01,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 15:21:01,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 4: [2022-11-26 15:21:01,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:21:01,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 15:21:01,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 6: [2022-11-26 15:21:01,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:21:01,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 15:21:01,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 17: [2022-11-26 15:21:01,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:21:01,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 9: [2022-11-26 15:21:01,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:21:01,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 17: [2022-11-26 15:21:01,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 30: [2022-11-26 15:21:01,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:21:01,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 30: [2022-11-26 15:21:01,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 28: [2022-11-26 15:21:01,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:21:01,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 26: [2022-11-26 15:21:01,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:21:01,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 15:21:01,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 14: [2022-11-26 15:21:01,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:21:01,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 15:21:01,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 15: [2022-11-26 15:21:01,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:21:01,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:21:01,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 15:21:01,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 6: [2022-11-26 15:21:01,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:21:01,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 15:21:01,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 1: [2022-11-26 15:21:01,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 15:21:01,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 11: [2022-11-26 15:21:01,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:21:01,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 15:21:01,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 19: [2022-11-26 15:21:01,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:21:01,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 15:21:01,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 19: [2022-11-26 15:21:01,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:21:01,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 1: [2022-11-26 15:21:01,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:21:01,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:21:01,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 1: [2022-11-26 15:21:01,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 15:21:01,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 24: [2022-11-26 15:21:01,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 15:21:01,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:21:01,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 24: [2022-11-26 15:21:01,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 15:21:01,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 9: [2022-11-26 15:21:01,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:21:01,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 15:21:01,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 22: [2022-11-26 15:21:01,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:21:01,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:21:01,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 22: [2022-11-26 15:21:01,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 15: [2022-11-26 15:21:01,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 22: [2022-11-26 15:21:01,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 22: [2022-11-26 15:21:01,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:21:01,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 15:21:01,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 4: [2022-11-26 15:21:01,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:21:01,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 24: [2022-11-26 15:21:01,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:21:01,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 24: [2022-11-26 15:21:01,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 15:21:01,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 11: [2022-11-26 15:21:01,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:21:01,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 15:21:01,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 12: [2022-11-26 15:21:01,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:21:01,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 15:21:01,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 14: [2022-11-26 15:21:01,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:21:01,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 15:21:01,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 9: [2022-11-26 15:21:01,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:21:01,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 15:21:01,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 26: [2022-11-26 15:21:01,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:21:01,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 15:21:01,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 11: [2022-11-26 15:21:01,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:21:01,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 6: [2022-11-26 15:21:01,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:21:01,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 6: [2022-11-26 15:21:01,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 15:21:01,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 15: [2022-11-26 15:21:01,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:21:01,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 15:21:01,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:21:01,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 15: [2022-11-26 15:21:01,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 15:21:01,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 25: [2022-11-26 15:21:01,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:21:01,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:21:01,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 20: [2022-11-26 15:21:01,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 25: [2022-11-26 15:21:01,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 20: [2022-11-26 15:21:01,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 25: [2022-11-26 15:21:01,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:21:01,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 15:21:01,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 12: [2022-11-26 15:21:01,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:21:01,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 15:21:01,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 25: [2022-11-26 15:21:01,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:21:01,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 15:21:01,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 4: [2022-11-26 15:21:01,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:21:01,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 15:21:01,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 14: [2022-11-26 15:21:01,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:21:01,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 15:21:01,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 20: [2022-11-26 15:21:01,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:21:01,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 15:21:01,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 14: [2022-11-26 15:21:01,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:21:01,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 15:21:01,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 24: [2022-11-26 15:21:01,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:21:01,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 15:21:01,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 0: [2022-11-26 15:21:01,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:21:01,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 15:21:01,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 30: [2022-11-26 15:21:01,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:21:01,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 15:21:01,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 21: [2022-11-26 15:21:01,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:21:01,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:21:01,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:21:01,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:21:01,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:21:01,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 15:21:01,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 15:21:01,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 15:21:01,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 15:21:01,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 15:21:01,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 21: [2022-11-26 15:21:01,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 21: [2022-11-26 15:21:01,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 21: [2022-11-26 15:21:01,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 21: [2022-11-26 15:21:01,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 14: [2022-11-26 15:21:01,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:21:01,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 15:21:01,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 19: [2022-11-26 15:21:01,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:21:01,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 15:21:01,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 5: [2022-11-26 15:21:01,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:21:01,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 15:21:01,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 5: [2022-11-26 15:21:01,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:21:01,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 12: [2022-11-26 15:21:01,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:21:01,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 12: [2022-11-26 15:21:01,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 15:21:01,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 3: [2022-11-26 15:21:01,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:21:01,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:21:01,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:21:01,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:21:01,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:21:01,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 15:21:01,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 15:21:01,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 15:21:01,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 15:21:01,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 3: [2022-11-26 15:21:01,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 3: [2022-11-26 15:21:01,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 15:21:01,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 3: [2022-11-26 15:21:01,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 3: [2022-11-26 15:21:01,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 0: [2022-11-26 15:21:01,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:21:01,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:21:01,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 15:21:01,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 0: [2022-11-26 15:21:01,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 8: [2022-11-26 15:21:01,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:21:01,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:21:01,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:21:01,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:21:01,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 8: [2022-11-26 15:21:01,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:21:01,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 15:21:01,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 15:21:01,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 15:21:01,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 26: [2022-11-26 15:21:01,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:21:01,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 8: [2022-11-26 15:21:01,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 15:21:01,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 26: [2022-11-26 15:21:01,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 8: [2022-11-26 15:21:01,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 8: [2022-11-26 15:21:01,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 8: [2022-11-26 15:21:01,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 26: [2022-11-26 15:21:01,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 26: [2022-11-26 15:21:01,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:21:01,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 15:21:01,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 25: [2022-11-26 15:21:01,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:21:01,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 15:21:01,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 6: [2022-11-26 15:21:01,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:21:01,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 15:21:01,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 1: [2022-11-26 15:21:01,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:21:01,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 15:21:01,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 22: [2022-11-26 15:21:01,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:21:01,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 17: [2022-11-26 15:21:01,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:21:01,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 22: [2022-11-26 15:21:01,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:21:01,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 15:21:01,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 22: [2022-11-26 15:21:01,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:21:01,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 17: [2022-11-26 15:21:01,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 22: [2022-11-26 15:21:01,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 17: [2022-11-26 15:21:01,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 17: [2022-11-26 15:21:01,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:21:01,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 15:21:01,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 20: [2022-11-26 15:21:01,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:21:01,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 15:21:01,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 10: [2022-11-26 15:21:01,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:21:01,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:21:01,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:21:01,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:21:01,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:21:01,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 15:21:01,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 15:21:01,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 15:21:01,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 15:21:01,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 5: [2022-11-26 15:21:01,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:21:01,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 10: [2022-11-26 15:21:01,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 10: [2022-11-26 15:21:01,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 10: [2022-11-26 15:21:01,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 5: [2022-11-26 15:21:01,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 10: [2022-11-26 15:21:01,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 5: [2022-11-26 15:21:01,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 4: [2022-11-26 15:21:01,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:21:01,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 15:21:01,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 19: [2022-11-26 15:21:01,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:21:01,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 15:21:01,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 28: [2022-11-26 15:21:01,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 15:21:01,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 28: [2022-11-26 15:21:01,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 15:21:01,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 15:21:01,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 28: [2022-11-26 15:21:01,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 15:21:01,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 15:21:01,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 28: [2022-11-26 15:21:01,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 15:21:01,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 15:21:01,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 28: [2022-11-26 15:21:01,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 15:21:01,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 25: [2022-11-26 15:21:01,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 28: [2022-11-26 15:21:01,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 25: [2022-11-26 15:21:01,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 15:21:01,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 11: [2022-11-26 15:21:01,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:21:01,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 15:21:01,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 16: [2022-11-26 15:21:01,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:21:01,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:21:01,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:21:01,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:21:01,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:21:01,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 15:21:01,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 15:21:01,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 15:21:01,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 15:21:01,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 15:21:01,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 16: [2022-11-26 15:21:01,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 16: [2022-11-26 15:21:01,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 9: [2022-11-26 15:21:01,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:21:01,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 16: [2022-11-26 15:21:01,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 9: [2022-11-26 15:21:01,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 15:21:01,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 17: [2022-11-26 15:21:01,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:21:01,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 15:21:01,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 31: [2022-11-26 15:21:01,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:21:01,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:21:01,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:21:01,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:21:01,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 15:21:01,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:21:01,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 15:21:01,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 15:21:01,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 15:21:01,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 31: [2022-11-26 15:21:01,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 15:21:01,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 31: [2022-11-26 15:21:01,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 31: [2022-11-26 15:21:01,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 31: [2022-11-26 15:21:01,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 27: [2022-11-26 15:21:01,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:21:01,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:21:01,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:21:01,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:21:01,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 15:21:01,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 15:21:01,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 15:21:01,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 27: [2022-11-26 15:21:01,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 27: [2022-11-26 15:21:01,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 27: [2022-11-26 15:21:01,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 15:21:01,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:21:01,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 27: [2022-11-26 15:21:01,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 15:21:01,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 18: [2022-11-26 15:21:01,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:21:01,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:21:01,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:21:01,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:21:01,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:21:01,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 15:21:01,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 15:21:01,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 15:21:01,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 15:21:01,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 15:21:01,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 18: [2022-11-26 15:21:01,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 18: [2022-11-26 15:21:01,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 18: [2022-11-26 15:21:01,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 18: [2022-11-26 15:21:01,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 18: [2022-11-26 15:21:01,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:21:01,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 15:21:01,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 6: [2022-11-26 15:21:01,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:21:01,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 15:21:01,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 0: [2022-11-26 15:21:01,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:21:01,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:21:01,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 15:21:01,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 26: [2022-11-26 15:21:01,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:21:01,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 15:21:01,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 0: [2022-11-26 15:21:01,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 15:21:01,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 27: [2022-11-26 15:21:01,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:21:01,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 15:21:01,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 8: [2022-11-26 15:21:01,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:21:01,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 15:21:01,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 21: [2022-11-26 15:21:01,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:21:01,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 15:21:01,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 10: [2022-11-26 15:21:01,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:21:01,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 15:21:01,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 15: [2022-11-26 15:21:01,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:21:01,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 15:21:01,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 9: [2022-11-26 15:21:01,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:21:01,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 15:21:01,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 16: [2022-11-26 15:21:01,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:21:01,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 15:21:01,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 12: [2022-11-26 15:21:01,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:21:01,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 15:21:01,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 0: [2022-11-26 15:21:01,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:21:01,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 20: [2022-11-26 15:21:01,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:21:01,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 20: [2022-11-26 15:21:01,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 15:21:01,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 1: [2022-11-26 15:21:01,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:21:01,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 15:21:01,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 4: [2022-11-26 15:21:01,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:21:01,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 15:21:01,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 24: [2022-11-26 15:21:01,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:21:01,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 30: [2022-11-26 15:21:01,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:21:01,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 30: [2022-11-26 15:21:01,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 15:21:01,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 22: [2022-11-26 15:21:01,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:21:01,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 15:21:01,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 3: [2022-11-26 15:21:01,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:21:01,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 15:21:01,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 31: [2022-11-26 15:21:01,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:21:01,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 15:21:01,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 25: [2022-11-26 15:21:01,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:21:01,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 15:21:01,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 28: [2022-11-26 15:21:01,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 15:21:01,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 15:21:01,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 17: [2022-11-26 15:21:01,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:21:01,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 15:21:01,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 19: [2022-11-26 15:21:01,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:21:01,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 15:21:01,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 18: [2022-11-26 15:21:01,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:21:01,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 15:21:01,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 5: [2022-11-26 15:21:01,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:21:01,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 15:21:01,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 26: [2022-11-26 15:21:01,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:21:01,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 15:21:01,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 27: [2022-11-26 15:21:01,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:21:01,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 15:21:01,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 6: [2022-11-26 15:21:01,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:21:01,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:21:01,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 15:21:01,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 6: [2022-11-26 15:21:01,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 15:21:01,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 15: [2022-11-26 15:21:01,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:21:01,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 15:21:01,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 11: [2022-11-26 15:21:01,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:21:01,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 15:21:01,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 21: [2022-11-26 15:21:01,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:21:01,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 15:21:01,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 30: [2022-11-26 15:21:01,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:21:01,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 15:21:01,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 10: [2022-11-26 15:21:01,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:21:01,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 15:21:01,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 16: [2022-11-26 15:21:01,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:21:01,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 15:21:01,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 14: [2022-11-26 15:21:01,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:21:01,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 15:21:01,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 4: [2022-11-26 15:21:01,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:21:01,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 15:21:01,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 1: [2022-11-26 15:21:01,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:21:01,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 15:21:01,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 25: [2022-11-26 15:21:01,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 28: [2022-11-26 15:21:01,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 15:21:01,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 25: [2022-11-26 15:21:01,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 28: [2022-11-26 15:21:01,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 25: [2022-11-26 15:21:01,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 0: [2022-11-26 15:21:01,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:21:01,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:21:01,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 5: [2022-11-26 15:21:01,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:21:01,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 0: [2022-11-26 15:21:01,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 5: [2022-11-26 15:21:01,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 20: [2022-11-26 15:21:01,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 5: [2022-11-26 15:21:01,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 12: [2022-11-26 15:21:01,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:21:01,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 15:21:01,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 18: [2022-11-26 15:21:01,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:21:01,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 15:21:01,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 9: [2022-11-26 15:21:01,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:21:01,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:21:01,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 15:21:01,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 6: [2022-11-26 15:21:01,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 15:21:01,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 31: [2022-11-26 15:21:01,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:21:01,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 15:21:01,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 3: [2022-11-26 15:21:01,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:21:01,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:21:01,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 15:21:01,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 4: [2022-11-26 15:21:01,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 22: [2022-11-26 15:21:01,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:21:01,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 22: [2022-11-26 15:21:01,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 15:21:01,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 20: [2022-11-26 15:21:01,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:21:01,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 15:21:01,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 14: [2022-11-26 15:21:01,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:21:01,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 15:21:01,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 24: [2022-11-26 15:21:01,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:21:01,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 27: [2022-11-26 15:21:01,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:21:01,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 9: [2022-11-26 15:21:01,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:21:01,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 15:21:01,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 9: [2022-11-26 15:21:01,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 15:21:01,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 24: [2022-11-26 15:21:01,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:21:01,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 15:21:01,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 11: [2022-11-26 15:21:01,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:21:01,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 15:21:01,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 21: [2022-11-26 15:21:01,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:21:01,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 15:21:01,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 19: [2022-11-26 15:21:01,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:21:01,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 15:21:01,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 16: [2022-11-26 15:21:01,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 28: [2022-11-26 15:21:01,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:21:01,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 15:21:01,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 28: [2022-11-26 15:21:01,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 15:21:01,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 17: [2022-11-26 15:21:01,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:21:01,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:21:01,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 15:21:01,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 25: [2022-11-26 15:21:01,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 15:21:01,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 12: [2022-11-26 15:21:01,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:21:01,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 15:21:01,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 26: [2022-11-26 15:21:01,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:21:01,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:21:01,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 15:21:01,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 26: [2022-11-26 15:21:01,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 15:21:01,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 31: [2022-11-26 15:21:01,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:21:01,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 15:21:01,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 15: [2022-11-26 15:21:01,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:21:01,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 15:21:01,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 1: [2022-11-26 15:21:01,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:21:01,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 15:21:01,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 22: [2022-11-26 15:21:01,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:21:01,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 15:21:01,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 8: [2022-11-26 15:21:01,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:21:01,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 15:21:01,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 30: [2022-11-26 15:21:01,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:21:01,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 15:21:01,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 10: [2022-11-26 15:21:01,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:21:01,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 15:21:01,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 13: [2022-11-26 15:21:01,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:21:01,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:21:01,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 15:21:01,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 15:21:01,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 13: [2022-11-26 15:21:01,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 13: [2022-11-26 15:21:01,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:21:01,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:21:01,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 15:21:01,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 13: [2022-11-26 15:21:01,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 15:21:01,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 13: [2022-11-26 15:21:01,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:21:01,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 15:21:01,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 13: [2022-11-26 15:21:01,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:21:01,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 15:21:01,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 13: [2022-11-26 15:21:01,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:21:01,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 15:21:01,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 13: [2022-11-26 15:21:01,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:21:01,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 15:21:01,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 23: [2022-11-26 15:21:01,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:21:01,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:21:01,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:21:01,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:21:01,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 15:21:01,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 15:21:01,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 23: [2022-11-26 15:21:01,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 15:21:01,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 15:21:01,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 23: [2022-11-26 15:21:01,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 23: [2022-11-26 15:21:01,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 23: [2022-11-26 15:21:01,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:21:01,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 15:21:01,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 23: [2022-11-26 15:21:01,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:21:01,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 15:21:01,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 23: [2022-11-26 15:21:01,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:21:01,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 15:21:01,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 23: [2022-11-26 15:21:01,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:21:01,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 15:21:01,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 7: [2022-11-26 15:21:01,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:21:01,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:21:01,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:21:01,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:21:01,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 15:21:01,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 15:21:01,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 7: [2022-11-26 15:21:01,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 15:21:01,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 7: [2022-11-26 15:21:01,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 15:21:01,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 7: [2022-11-26 15:21:01,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 7: [2022-11-26 15:21:01,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:21:01,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:21:01,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:21:01,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 15:21:01,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 15:21:01,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 15:21:01,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 7: [2022-11-26 15:21:01,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 7: [2022-11-26 15:21:01,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 7: [2022-11-26 15:21:01,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:21:01,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 15:21:01,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 2: [2022-11-26 15:21:01,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:21:01,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 15:21:01,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 2: [2022-11-26 15:21:01,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:21:01,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 15:21:01,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 2: [2022-11-26 15:21:01,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:21:01,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 15:21:01,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 2: [2022-11-26 15:21:01,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:21:01,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:21:01,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:21:01,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 15:21:01,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 15:21:01,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 15:21:01,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 2: [2022-11-26 15:21:01,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 2: [2022-11-26 15:21:01,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 2: [2022-11-26 15:21:01,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:21:01,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 15:21:01,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 29: [2022-11-26 15:21:01,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:21:01,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:21:01,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:21:01,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:21:01,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:21:01,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:21:01,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 15:21:01,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:21:01,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 15:21:01,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 15:21:01,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 15:21:01,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 15:21:01,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 29: [2022-11-26 15:21:01,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 15:21:01,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 29: [2022-11-26 15:21:01,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 29: [2022-11-26 15:21:01,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 15:21:01,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 29: [2022-11-26 15:21:01,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 29: [2022-11-26 15:21:01,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 29: [2022-11-26 15:21:01,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 29: [2022-11-26 15:21:01,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:21:01,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 15:21:01,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 2: [2022-11-26 15:21:01,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:21:01,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step94000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 15:21:01,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 0: successfully saved checkpoint at iteration 94000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2692.85 31: iteration 94010/ 173500 | consumed samples: 24066560 | consumed tokens: 49288314880 | elapsed time per iteration (s): 1.12 | learning rate: 9.950E-05 | global batch size: 256 | lm loss: 1.988731E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.333 | TFLOPs: 13.87 | 31: iteration 94020/ 173500 | consumed samples: 24069120 | consumed tokens: 49293557760 | elapsed time per iteration (s): 0.83 | learning rate: 9.949E-05 | global batch size: 256 | lm loss: 1.976184E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.376 | TFLOPs: 18.60 | 31: iteration 94030/ 173500 | consumed samples: 24071680 | consumed tokens: 49298800640 | elapsed time per iteration (s): 0.85 | learning rate: 9.947E-05 | global batch size: 256 | lm loss: 2.011624E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.068 | TFLOPs: 18.27 | 31: iteration 94040/ 173500 | consumed samples: 24074240 | consumed tokens: 49304043520 | elapsed time per iteration (s): 0.80 | learning rate: 9.945E-05 | global batch size: 256 | lm loss: 2.001447E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.263 | TFLOPs: 19.31 | 31: iteration 94050/ 173500 | consumed samples: 24076800 | consumed tokens: 49309286400 | elapsed time per iteration (s): 0.93 | learning rate: 9.944E-05 | global batch size: 256 | lm loss: 1.960976E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 274.731 | TFLOPs: 16.62 | 31: iteration 94060/ 173500 | consumed samples: 24079360 | consumed tokens: 49314529280 | elapsed time per iteration (s): 0.85 | learning rate: 9.942E-05 | global batch size: 256 | lm loss: 1.955845E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.930 | TFLOPs: 18.27 | 31: iteration 94070/ 173500 | consumed samples: 24081920 | consumed tokens: 49319772160 | elapsed time per iteration (s): 0.81 | learning rate: 9.940E-05 | global batch size: 256 | lm loss: 1.994621E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.885 | TFLOPs: 19.17 | 31: iteration 94080/ 173500 | consumed samples: 24084480 | consumed tokens: 49325015040 | elapsed time per iteration (s): 0.83 | learning rate: 9.939E-05 | global batch size: 256 | lm loss: 1.967040E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.276 | TFLOPs: 18.65 | 31: iteration 94090/ 173500 | consumed samples: 24087040 | consumed tokens: 49330257920 | elapsed time per iteration (s): 0.77 | learning rate: 9.937E-05 | global batch size: 256 | lm loss: 1.996933E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.238 | TFLOPs: 20.10 | 31: iteration 94100/ 173500 | consumed samples: 24089600 | consumed tokens: 49335500800 | elapsed time per iteration (s): 0.79 | learning rate: 9.935E-05 | global batch size: 256 | lm loss: 1.995128E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.477 | TFLOPs: 19.57 | 31: iteration 94110/ 173500 | consumed samples: 24092160 | consumed tokens: 49340743680 | elapsed time per iteration (s): 0.75 | learning rate: 9.934E-05 | global batch size: 256 | lm loss: 1.965587E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.954 | TFLOPs: 20.69 | 31: iteration 94120/ 173500 | consumed samples: 24094720 | consumed tokens: 49345986560 | elapsed time per iteration (s): 0.79 | learning rate: 9.932E-05 | global batch size: 256 | lm loss: 1.966606E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.989 | TFLOPs: 19.72 | 31: iteration 94130/ 173500 | consumed samples: 24097280 | consumed tokens: 49351229440 | elapsed time per iteration (s): 0.76 | learning rate: 9.931E-05 | global batch size: 256 | lm loss: 1.981282E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.025 | TFLOPs: 20.45 | 31: iteration 94140/ 173500 | consumed samples: 24099840 | consumed tokens: 49356472320 | elapsed time per iteration (s): 0.77 | learning rate: 9.929E-05 | global batch size: 256 | lm loss: 1.981750E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.483 | TFLOPs: 20.11 | 31: iteration 94150/ 173500 | consumed samples: 24102400 | consumed tokens: 49361715200 | elapsed time per iteration (s): 0.73 | learning rate: 9.927E-05 | global batch size: 256 | lm loss: 1.978878E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.476 | TFLOPs: 21.14 | 31: iteration 94160/ 173500 | consumed samples: 24104960 | consumed tokens: 49366958080 | elapsed time per iteration (s): 0.77 | learning rate: 9.926E-05 | global batch size: 256 | lm loss: 1.971653E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.506 | TFLOPs: 19.99 | 31: iteration 94170/ 173500 | consumed samples: 24107520 | consumed tokens: 49372200960 | elapsed time per iteration (s): 0.81 | learning rate: 9.924E-05 | global batch size: 256 | lm loss: 1.994490E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.160 | TFLOPs: 19.19 | 31: iteration 94180/ 173500 | consumed samples: 24110080 | consumed tokens: 49377443840 | elapsed time per iteration (s): 0.80 | learning rate: 9.922E-05 | global batch size: 256 | lm loss: 1.984459E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.546 | TFLOPs: 19.33 | 31: iteration 94190/ 173500 | consumed samples: 24112640 | consumed tokens: 49382686720 | elapsed time per iteration (s): 0.79 | learning rate: 9.921E-05 | global batch size: 256 | lm loss: 1.986568E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.881 | TFLOPs: 19.65 | 31: iteration 94200/ 173500 | consumed samples: 24115200 | consumed tokens: 49387929600 | elapsed time per iteration (s): 0.82 | learning rate: 9.919E-05 | global batch size: 256 | lm loss: 1.976674E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.963 | TFLOPs: 18.93 | 31: iteration 94210/ 173500 | consumed samples: 24117760 | consumed tokens: 49393172480 | elapsed time per iteration (s): 0.82 | learning rate: 9.917E-05 | global batch size: 256 | lm loss: 1.987046E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.599 | TFLOPs: 18.85 | 31: iteration 94220/ 173500 | consumed samples: 24120320 | consumed tokens: 49398415360 | elapsed time per iteration (s): 0.81 | learning rate: 9.916E-05 | global batch size: 256 | lm loss: 1.979896E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.478 | TFLOPs: 19.09 | 31: iteration 94230/ 173500 | consumed samples: 24122880 | consumed tokens: 49403658240 | elapsed time per iteration (s): 0.71 | learning rate: 9.914E-05 | global batch size: 256 | lm loss: 1.965598E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 359.672 | TFLOPs: 21.76 | 31: iteration 94240/ 173500 | consumed samples: 24125440 | consumed tokens: 49408901120 | elapsed time per iteration (s): 0.84 | learning rate: 9.913E-05 | global batch size: 256 | lm loss: 1.991168E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.155 | TFLOPs: 18.46 | 31: iteration 94250/ 173500 | consumed samples: 24128000 | consumed tokens: 49414144000 | elapsed time per iteration (s): 0.77 | learning rate: 9.911E-05 | global batch size: 256 | lm loss: 1.969081E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.478 | TFLOPs: 20.24 | 31: iteration 94260/ 173500 | consumed samples: 24130560 | consumed tokens: 49419386880 | elapsed time per iteration (s): 0.81 | learning rate: 9.909E-05 | global batch size: 256 | lm loss: 2.007704E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.102 | TFLOPs: 19.12 | 31: iteration 94270/ 173500 | consumed samples: 24133120 | consumed tokens: 49424629760 | elapsed time per iteration (s): 0.78 | learning rate: 9.908E-05 | global batch size: 256 | lm loss: 1.973387E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.930 | TFLOPs: 19.90 | 31: iteration 94280/ 173500 | consumed samples: 24135680 | consumed tokens: 49429872640 | elapsed time per iteration (s): 0.77 | learning rate: 9.906E-05 | global batch size: 256 | lm loss: 1.965126E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.599 | TFLOPs: 20.06 | 31: iteration 94290/ 173500 | consumed samples: 24138240 | consumed tokens: 49435115520 | elapsed time per iteration (s): 0.76 | learning rate: 9.904E-05 | global batch size: 256 | lm loss: 1.971137E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.108 | TFLOPs: 20.45 | 31: iteration 94300/ 173500 | consumed samples: 24140800 | consumed tokens: 49440358400 | elapsed time per iteration (s): 0.73 | learning rate: 9.903E-05 | global batch size: 256 | lm loss: 1.979294E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.049 | TFLOPs: 21.12 | 31: iteration 94310/ 173500 | consumed samples: 24143360 | consumed tokens: 49445601280 | elapsed time per iteration (s): 0.80 | learning rate: 9.901E-05 | global batch size: 256 | lm loss: 1.987886E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.387 | TFLOPs: 19.32 | 31: iteration 94320/ 173500 | consumed samples: 24145920 | consumed tokens: 49450844160 | elapsed time per iteration (s): 0.82 | learning rate: 9.900E-05 | global batch size: 256 | lm loss: 1.972907E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.202 | TFLOPs: 18.95 | 31: iteration 94330/ 173500 | consumed samples: 24148480 | consumed tokens: 49456087040 | elapsed time per iteration (s): 0.84 | learning rate: 9.898E-05 | global batch size: 256 | lm loss: 1.991507E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.661 | TFLOPs: 18.37 | 31: iteration 94340/ 173500 | consumed samples: 24151040 | consumed tokens: 49461329920 | elapsed time per iteration (s): 0.81 | learning rate: 9.896E-05 | global batch size: 256 | lm loss: 1.982593E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.134 | TFLOPs: 19.19 | 31: iteration 94350/ 173500 | consumed samples: 24153600 | consumed tokens: 49466572800 | elapsed time per iteration (s): 0.80 | learning rate: 9.895E-05 | global batch size: 256 | lm loss: 1.958949E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.109 | TFLOPs: 19.43 | 31: iteration 94360/ 173500 | consumed samples: 24156160 | consumed tokens: 49471815680 | elapsed time per iteration (s): 0.85 | learning rate: 9.893E-05 | global batch size: 256 | lm loss: 1.969498E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.089 | TFLOPs: 18.15 | 31: iteration 94370/ 173500 | consumed samples: 24158720 | consumed tokens: 49477058560 | elapsed time per iteration (s): 0.80 | learning rate: 9.891E-05 | global batch size: 256 | lm loss: 1.993001E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.794 | TFLOPs: 19.47 | 31: iteration 94380/ 173500 | consumed samples: 24161280 | consumed tokens: 49482301440 | elapsed time per iteration (s): 0.87 | learning rate: 9.890E-05 | global batch size: 256 | lm loss: 1.982712E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.769 | TFLOPs: 17.89 | 31: iteration 94390/ 173500 | consumed samples: 24163840 | consumed tokens: 49487544320 | elapsed time per iteration (s): 0.83 | learning rate: 9.888E-05 | global batch size: 256 | lm loss: 1.996050E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.450 | TFLOPs: 18.60 | 31: iteration 94400/ 173500 | consumed samples: 24166400 | consumed tokens: 49492787200 | elapsed time per iteration (s): 0.85 | learning rate: 9.886E-05 | global batch size: 256 | lm loss: 1.974913E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.711 | TFLOPs: 18.13 | 31: iteration 94410/ 173500 | consumed samples: 24168960 | consumed tokens: 49498030080 | elapsed time per iteration (s): 0.84 | learning rate: 9.885E-05 | global batch size: 256 | lm loss: 1.972990E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.758 | TFLOPs: 18.38 | 31: iteration 94420/ 173500 | consumed samples: 24171520 | consumed tokens: 49503272960 | elapsed time per iteration (s): 0.84 | learning rate: 9.883E-05 | global batch size: 256 | lm loss: 2.003667E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.483 | TFLOPs: 18.48 | 31: iteration 94430/ 173500 | consumed samples: 24174080 | consumed tokens: 49508515840 | elapsed time per iteration (s): 0.80 | learning rate: 9.882E-05 | global batch size: 256 | lm loss: 1.986641E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.313 | TFLOPs: 19.38 | 31: iteration 94440/ 173500 | consumed samples: 24176640 | consumed tokens: 49513758720 | elapsed time per iteration (s): 0.79 | learning rate: 9.880E-05 | global batch size: 256 | lm loss: 1.980589E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.741 | TFLOPs: 19.53 | 31: iteration 94450/ 173500 | consumed samples: 24179200 | consumed tokens: 49519001600 | elapsed time per iteration (s): 0.76 | learning rate: 9.878E-05 | global batch size: 256 | lm loss: 1.989404E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.078 | TFLOPs: 20.33 | 31: iteration 94460/ 173500 | consumed samples: 24181760 | consumed tokens: 49524244480 | elapsed time per iteration (s): 0.80 | learning rate: 9.877E-05 | global batch size: 256 | lm loss: 2.033977E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.174 | TFLOPs: 19.43 | 31: iteration 94470/ 173500 | consumed samples: 24184320 | consumed tokens: 49529487360 | elapsed time per iteration (s): 0.79 | learning rate: 9.875E-05 | global batch size: 256 | lm loss: 1.993151E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.143 | TFLOPs: 19.61 | 31: iteration 94480/ 173500 | consumed samples: 24186880 | consumed tokens: 49534730240 | elapsed time per iteration (s): 0.84 | learning rate: 9.873E-05 | global batch size: 256 | lm loss: 1.968869E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.112 | TFLOPs: 18.40 | 31: iteration 94490/ 173500 | consumed samples: 24189440 | consumed tokens: 49539973120 | elapsed time per iteration (s): 0.81 | learning rate: 9.872E-05 | global batch size: 256 | lm loss: 1.992466E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.938 | TFLOPs: 19.17 | 31: iteration 94500/ 173500 | consumed samples: 24192000 | consumed tokens: 49545216000 | elapsed time per iteration (s): 0.86 | learning rate: 9.870E-05 | global batch size: 256 | lm loss: 1.970218E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.060 | TFLOPs: 17.97 | 31: iteration 94510/ 173500 | consumed samples: 24194560 | consumed tokens: 49550458880 | elapsed time per iteration (s): 0.85 | learning rate: 9.868E-05 | global batch size: 256 | lm loss: 1.965827E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.397 | TFLOPs: 18.29 | 31: iteration 94520/ 173500 | consumed samples: 24197120 | consumed tokens: 49555701760 | elapsed time per iteration (s): 0.79 | learning rate: 9.867E-05 | global batch size: 256 | lm loss: 1.949671E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.136 | TFLOPs: 19.67 | 31: iteration 94530/ 173500 | consumed samples: 24199680 | consumed tokens: 49560944640 | elapsed time per iteration (s): 0.80 | learning rate: 9.865E-05 | global batch size: 256 | lm loss: 2.019723E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.661 | TFLOPs: 19.34 | 31: iteration 94540/ 173500 | consumed samples: 24202240 | consumed tokens: 49566187520 | elapsed time per iteration (s): 0.91 | learning rate: 9.864E-05 | global batch size: 256 | lm loss: 1.963830E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.699 | TFLOPs: 17.10 | 31: iteration 94550/ 173500 | consumed samples: 24204800 | consumed tokens: 49571430400 | elapsed time per iteration (s): 0.80 | learning rate: 9.862E-05 | global batch size: 256 | lm loss: 1.995686E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.501 | TFLOPs: 19.27 | 31: iteration 94560/ 173500 | consumed samples: 24207360 | consumed tokens: 49576673280 | elapsed time per iteration (s): 0.88 | learning rate: 9.860E-05 | global batch size: 256 | lm loss: 1.964154E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.226 | TFLOPs: 17.62 | 31: iteration 94570/ 173500 | consumed samples: 24209920 | consumed tokens: 49581916160 | elapsed time per iteration (s): 0.83 | learning rate: 9.859E-05 | global batch size: 256 | lm loss: 1.978548E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.234 | TFLOPs: 18.77 | 31: iteration 94580/ 173500 | consumed samples: 24212480 | consumed tokens: 49587159040 | elapsed time per iteration (s): 0.84 | learning rate: 9.857E-05 | global batch size: 256 | lm loss: 1.987342E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.449 | TFLOPs: 18.36 | 31: iteration 94590/ 173500 | consumed samples: 24215040 | consumed tokens: 49592401920 | elapsed time per iteration (s): 0.82 | learning rate: 9.855E-05 | global batch size: 256 | lm loss: 1.956265E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.300 | TFLOPs: 18.83 | 31: iteration 94600/ 173500 | consumed samples: 24217600 | consumed tokens: 49597644800 | elapsed time per iteration (s): 0.81 | learning rate: 9.854E-05 | global batch size: 256 | lm loss: 1.996509E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.257 | TFLOPs: 19.01 | 31: iteration 94610/ 173500 | consumed samples: 24220160 | consumed tokens: 49602887680 | elapsed time per iteration (s): 0.88 | learning rate: 9.852E-05 | global batch size: 256 | lm loss: 1.972516E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.760 | TFLOPs: 17.65 | 31: iteration 94620/ 173500 | consumed samples: 24222720 | consumed tokens: 49608130560 | elapsed time per iteration (s): 0.83 | learning rate: 9.851E-05 | global batch size: 256 | lm loss: 2.009669E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.540 | TFLOPs: 18.67 | 31: iteration 94630/ 173500 | consumed samples: 24225280 | consumed tokens: 49613373440 | elapsed time per iteration (s): 0.81 | learning rate: 9.849E-05 | global batch size: 256 | lm loss: 1.990139E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.952 | TFLOPs: 19.24 | 31: iteration 94640/ 173500 | consumed samples: 24227840 | consumed tokens: 49618616320 | elapsed time per iteration (s): 0.91 | learning rate: 9.847E-05 | global batch size: 256 | lm loss: 1.982952E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 280.278 | TFLOPs: 16.96 | 31: iteration 94650/ 173500 | consumed samples: 24230400 | consumed tokens: 49623859200 | elapsed time per iteration (s): 0.82 | learning rate: 9.846E-05 | global batch size: 256 | lm loss: 1.979104E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.160 | TFLOPs: 18.88 | 31: iteration 94660/ 173500 | consumed samples: 24232960 | consumed tokens: 49629102080 | elapsed time per iteration (s): 0.82 | learning rate: 9.844E-05 | global batch size: 256 | lm loss: 1.978206E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.418 | TFLOPs: 18.84 | 31: iteration 94670/ 173500 | consumed samples: 24235520 | consumed tokens: 49634344960 | elapsed time per iteration (s): 0.80 | learning rate: 9.842E-05 | global batch size: 256 | lm loss: 1.975624E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.973 | TFLOPs: 19.30 | 31: iteration 94680/ 173500 | consumed samples: 24238080 | consumed tokens: 49639587840 | elapsed time per iteration (s): 0.80 | learning rate: 9.841E-05 | global batch size: 256 | lm loss: 1.936442E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.495 | TFLOPs: 19.45 | 31: iteration 94690/ 173500 | consumed samples: 24240640 | consumed tokens: 49644830720 | elapsed time per iteration (s): 0.77 | learning rate: 9.839E-05 | global batch size: 256 | lm loss: 1.970248E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.622 | TFLOPs: 20.06 | 31: iteration 94700/ 173500 | consumed samples: 24243200 | consumed tokens: 49650073600 | elapsed time per iteration (s): 0.82 | learning rate: 9.837E-05 | global batch size: 256 | lm loss: 1.960019E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.717 | TFLOPs: 18.86 | 31: iteration 94710/ 173500 | consumed samples: 24245760 | consumed tokens: 49655316480 | elapsed time per iteration (s): 0.93 | learning rate: 9.836E-05 | global batch size: 256 | lm loss: 1.959484E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 274.720 | TFLOPs: 16.62 | 31: iteration 94720/ 173500 | consumed samples: 24248320 | consumed tokens: 49660559360 | elapsed time per iteration (s): 0.85 | learning rate: 9.834E-05 | global batch size: 256 | lm loss: 2.011308E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.194 | TFLOPs: 18.22 | 31: iteration 94730/ 173500 | consumed samples: 24250880 | consumed tokens: 49665802240 | elapsed time per iteration (s): 0.86 | learning rate: 9.833E-05 | global batch size: 256 | lm loss: 1.958097E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.117 | TFLOPs: 18.04 | 31: iteration 94740/ 173500 | consumed samples: 24253440 | consumed tokens: 49671045120 | elapsed time per iteration (s): 0.84 | learning rate: 9.831E-05 | global batch size: 256 | lm loss: 1.978156E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.449 | TFLOPs: 18.42 | 31: iteration 94750/ 173500 | consumed samples: 24256000 | consumed tokens: 49676288000 | elapsed time per iteration (s): 0.84 | learning rate: 9.829E-05 | global batch size: 256 | lm loss: 1.978256E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.316 | TFLOPs: 18.53 | 31: iteration 94760/ 173500 | consumed samples: 24258560 | consumed tokens: 49681530880 | elapsed time per iteration (s): 0.76 | learning rate: 9.828E-05 | global batch size: 256 | lm loss: 1.999618E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.969 | TFLOPs: 20.33 | 31: iteration 94770/ 173500 | consumed samples: 24261120 | consumed tokens: 49686773760 | elapsed time per iteration (s): 0.74 | learning rate: 9.826E-05 | global batch size: 256 | lm loss: 1.986414E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.690 | TFLOPs: 20.79 | 31: iteration 94780/ 173500 | consumed samples: 24263680 | consumed tokens: 49692016640 | elapsed time per iteration (s): 0.77 | learning rate: 9.824E-05 | global batch size: 256 | lm loss: 1.984504E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.906 | TFLOPs: 20.02 | 31: iteration 94790/ 173500 | consumed samples: 24266240 | consumed tokens: 49697259520 | elapsed time per iteration (s): 0.73 | learning rate: 9.823E-05 | global batch size: 256 | lm loss: 1.998590E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.528 | TFLOPs: 21.09 | 31: iteration 94800/ 173500 | consumed samples: 24268800 | consumed tokens: 49702502400 | elapsed time per iteration (s): 0.77 | learning rate: 9.821E-05 | global batch size: 256 | lm loss: 1.977152E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.365 | TFLOPs: 20.17 | 31: iteration 94810/ 173500 | consumed samples: 24271360 | consumed tokens: 49707745280 | elapsed time per iteration (s): 0.78 | learning rate: 9.820E-05 | global batch size: 256 | lm loss: 1.987163E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.029 | TFLOPs: 19.97 | 31: iteration 94820/ 173500 | consumed samples: 24273920 | consumed tokens: 49712988160 | elapsed time per iteration (s): 0.80 | learning rate: 9.818E-05 | global batch size: 256 | lm loss: 1.991123E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.324 | TFLOPs: 19.32 | 31: iteration 94830/ 173500 | consumed samples: 24276480 | consumed tokens: 49718231040 | elapsed time per iteration (s): 0.76 | learning rate: 9.816E-05 | global batch size: 256 | lm loss: 1.977318E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.652 | TFLOPs: 20.43 | 31: iteration 94840/ 173500 | consumed samples: 24279040 | consumed tokens: 49723473920 | elapsed time per iteration (s): 0.79 | learning rate: 9.815E-05 | global batch size: 256 | lm loss: 1.970315E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.251 | TFLOPs: 19.56 | 31: iteration 94850/ 173500 | consumed samples: 24281600 | consumed tokens: 49728716800 | elapsed time per iteration (s): 0.77 | learning rate: 9.813E-05 | global batch size: 256 | lm loss: 1.964790E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.753 | TFLOPs: 20.07 | 31: iteration 94860/ 173500 | consumed samples: 24284160 | consumed tokens: 49733959680 | elapsed time per iteration (s): 0.76 | learning rate: 9.811E-05 | global batch size: 256 | lm loss: 1.970844E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.933 | TFLOPs: 20.50 | 31: iteration 94870/ 173500 | consumed samples: 24286720 | consumed tokens: 49739202560 | elapsed time per iteration (s): 0.85 | learning rate: 9.810E-05 | global batch size: 256 | lm loss: 1.973055E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.412 | TFLOPs: 18.23 | 31: iteration 94880/ 173500 | consumed samples: 24289280 | consumed tokens: 49744445440 | elapsed time per iteration (s): 0.78 | learning rate: 9.808E-05 | global batch size: 256 | lm loss: 2.008850E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.681 | TFLOPs: 19.76 | 31: iteration 94890/ 173500 | consumed samples: 24291840 | consumed tokens: 49749688320 | elapsed time per iteration (s): 0.78 | learning rate: 9.806E-05 | global batch size: 256 | lm loss: 2.027540E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.976 | TFLOPs: 19.90 | 31: iteration 94900/ 173500 | consumed samples: 24294400 | consumed tokens: 49754931200 | elapsed time per iteration (s): 0.81 | learning rate: 9.805E-05 | global batch size: 256 | lm loss: 2.005787E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.181 | TFLOPs: 19.01 | 31: iteration 94910/ 173500 | consumed samples: 24296960 | consumed tokens: 49760174080 | elapsed time per iteration (s): 0.75 | learning rate: 9.803E-05 | global batch size: 256 | lm loss: 1.986749E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.433 | TFLOPs: 20.53 | 31: iteration 94920/ 173500 | consumed samples: 24299520 | consumed tokens: 49765416960 | elapsed time per iteration (s): 0.75 | learning rate: 9.802E-05 | global batch size: 256 | lm loss: 1.999519E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.151 | TFLOPs: 20.52 | 31: iteration 94930/ 173500 | consumed samples: 24302080 | consumed tokens: 49770659840 | elapsed time per iteration (s): 0.76 | learning rate: 9.800E-05 | global batch size: 256 | lm loss: 2.001888E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.440 | TFLOPs: 20.47 | 31: iteration 94940/ 173500 | consumed samples: 24304640 | consumed tokens: 49775902720 | elapsed time per iteration (s): 0.77 | learning rate: 9.798E-05 | global batch size: 256 | lm loss: 1.973833E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.722 | TFLOPs: 20.13 | 31: iteration 94950/ 173500 | consumed samples: 24307200 | consumed tokens: 49781145600 | elapsed time per iteration (s): 0.79 | learning rate: 9.797E-05 | global batch size: 256 | lm loss: 1.996644E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.598 | TFLOPs: 19.58 | 31: iteration 94960/ 173500 | consumed samples: 24309760 | consumed tokens: 49786388480 | elapsed time per iteration (s): 0.83 | learning rate: 9.795E-05 | global batch size: 256 | lm loss: 1.999656E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.785 | TFLOPs: 18.68 | 31: iteration 94970/ 173500 | consumed samples: 24312320 | consumed tokens: 49791631360 | elapsed time per iteration (s): 0.72 | learning rate: 9.793E-05 | global batch size: 256 | lm loss: 1.978890E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.938 | TFLOPs: 21.59 | 31: iteration 94980/ 173500 | consumed samples: 24314880 | consumed tokens: 49796874240 | elapsed time per iteration (s): 1.31 | learning rate: 9.792E-05 | global batch size: 256 | lm loss: 1.973049E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 195.370 | TFLOPs: 11.82 | 31: iteration 94990/ 173500 | consumed samples: 24317440 | consumed tokens: 49802117120 | elapsed time per iteration (s): 0.91 | learning rate: 9.790E-05 | global batch size: 256 | lm loss: 1.972249E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.017 | TFLOPs: 17.06 | 31: iteration 95000/ 173500 | consumed samples: 24320000 | consumed tokens: 49807360000 | elapsed time per iteration (s): 0.74 | learning rate: 9.789E-05 | global batch size: 256 | lm loss: 1.982883E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.403 | TFLOPs: 20.84 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 95000 | lm loss value: 1.888165E+00 | lm loss PPL: 6.607235E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 95000 to checkpoints_1b1long 0: [2022-11-26 15:34:33,734] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step95000 is begin to save! 0: [2022-11-26 15:34:33,747] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_01-model_00-model_states.pt... 0: [2022-11-26 15:34:34,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_01-model_00-model_states.pt. 0: [2022-11-26 15:34:34,070] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_03-model_00-model_states.pt... 0: [2022-11-26 15:34:34,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_03-model_00-model_states.pt. 0: [2022-11-26 15:34:34,150] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_04-model_00-model_states.pt... 0: [2022-11-26 15:34:34,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_04-model_00-model_states.pt. 0: [2022-11-26 15:34:34,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_05-model_00-model_states.pt... 0: [2022-11-26 15:34:34,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_05-model_00-model_states.pt. 0: [2022-11-26 15:34:34,314] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_06-model_00-model_states.pt... 0: [2022-11-26 15:34:34,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_06-model_00-model_states.pt. 0: [2022-11-26 15:34:34,389] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_07-model_00-model_states.pt... 0: [2022-11-26 15:34:34,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_07-model_00-model_states.pt. 0: [2022-11-26 15:34:34,464] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_08-model_00-model_states.pt... 0: [2022-11-26 15:34:34,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_08-model_00-model_states.pt. 0: [2022-11-26 15:34:34,545] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_09-model_00-model_states.pt... 0: [2022-11-26 15:34:34,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_09-model_00-model_states.pt. 0: [2022-11-26 15:34:34,616] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_10-model_00-model_states.pt... 0: [2022-11-26 15:34:34,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_10-model_00-model_states.pt. 0: [2022-11-26 15:34:34,692] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_11-model_00-model_states.pt... 0: [2022-11-26 15:34:34,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_11-model_00-model_states.pt. 0: [2022-11-26 15:34:34,768] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_12-model_00-model_states.pt... 0: [2022-11-26 15:34:34,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_12-model_00-model_states.pt. 0: [2022-11-26 15:34:34,844] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_13-model_00-model_states.pt... 0: [2022-11-26 15:34:34,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_13-model_00-model_states.pt. 0: [2022-11-26 15:34:34,917] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_14-model_00-model_states.pt... 0: [2022-11-26 15:34:34,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_14-model_00-model_states.pt. 0: [2022-11-26 15:34:34,994] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_15-model_00-model_states.pt... 0: [2022-11-26 15:34:35,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_15-model_00-model_states.pt. 0: [2022-11-26 15:34:35,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_16-model_00-model_states.pt... 0: [2022-11-26 15:34:35,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_16-model_00-model_states.pt. 0: [2022-11-26 15:34:35,144] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_17-model_00-model_states.pt... 0: [2022-11-26 15:34:35,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_17-model_00-model_states.pt. 0: [2022-11-26 15:34:35,217] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_18-model_00-model_states.pt... 0: [2022-11-26 15:34:35,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_18-model_00-model_states.pt. 0: [2022-11-26 15:34:35,294] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_19-model_00-model_states.pt... 0: [2022-11-26 15:34:35,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_19-model_00-model_states.pt. 0: [2022-11-26 15:34:35,370] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_20-model_00-model_states.pt... 0: [2022-11-26 15:34:35,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_20-model_00-model_states.pt. 0: [2022-11-26 15:34:35,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_21-model_00-model_states.pt... 0: [2022-11-26 15:34:35,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_21-model_00-model_states.pt. 0: [2022-11-26 15:34:35,521] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_22-model_00-model_states.pt... 0: [2022-11-26 15:34:35,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_22-model_00-model_states.pt. 0: [2022-11-26 15:34:35,597] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_23-model_00-model_states.pt... 0: [2022-11-26 15:34:35,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_23-model_00-model_states.pt. 0: [2022-11-26 15:34:35,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_24-model_00-model_states.pt... 0: [2022-11-26 15:34:35,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_24-model_00-model_states.pt. 0: [2022-11-26 15:34:35,747] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_25-model_00-model_states.pt... 0: [2022-11-26 15:34:35,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_25-model_00-model_states.pt. 0: [2022-11-26 15:34:35,822] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_26-model_00-model_states.pt... 0: [2022-11-26 15:34:35,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_26-model_00-model_states.pt. 0: [2022-11-26 15:34:35,896] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_27-model_00-model_states.pt... 0: [2022-11-26 15:34:35,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_27-model_00-model_states.pt. 0: [2022-11-26 15:34:35,972] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_28-model_00-model_states.pt... 0: [2022-11-26 15:34:36,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_28-model_00-model_states.pt. 0: [2022-11-26 15:34:36,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/layer_30-model_00-model_states.pt... 0: [2022-11-26 15:34:36,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/layer_30-model_00-model_states.pt. 0: [2022-11-26 15:34:36,050] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step95000/mp_rank_00_model_states.pt 0: [2022-11-26 15:34:36,050] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/mp_rank_00_model_states.pt... 0: [2022-11-26 15:34:36,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/mp_rank_00_model_states.pt. 0: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:34:36,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:34:36,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:34:36,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:34:36,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 15:34:36,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 18: [2022-11-26 15:34:36,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:34:36,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 15:34:36,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 0: [2022-11-26 15:34:36,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:34:36,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 15:34:36,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 18: [2022-11-26 15:34:36,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:34:36,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:34:36,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 15:34:36,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 15:34:36,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 18: [2022-11-26 15:34:36,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 18: [2022-11-26 15:34:36,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:34:36,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 15:34:36,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 20: [2022-11-26 15:34:36,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:34:36,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 15:34:36,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 21: [2022-11-26 15:34:36,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:34:36,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 15:34:36,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 26: [2022-11-26 15:34:36,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:34:36,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 21: [2022-11-26 15:34:36,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:34:36,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 21: [2022-11-26 15:34:36,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 15:34:36,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 9: [2022-11-26 15:34:36,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:34:36,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 15:34:36,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 16: [2022-11-26 15:34:36,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:34:36,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 15:34:36,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 24: [2022-11-26 15:34:36,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:34:36,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:34:36,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:34:36,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 30: [2022-11-26 15:34:36,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:34:36,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 30: [2022-11-26 15:34:36,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 15:34:36,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 1: [2022-11-26 15:34:36,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 24: [2022-11-26 15:34:36,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 1: [2022-11-26 15:34:36,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 15: [2022-11-26 15:34:36,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 12: [2022-11-26 15:34:36,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:34:36,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 4: [2022-11-26 15:34:36,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:34:36,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 4: [2022-11-26 15:34:36,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 15:34:36,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 4: [2022-11-26 15:34:36,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:34:36,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 15:34:36,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 26: [2022-11-26 15:34:36,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:34:36,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 15:34:36,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 26: [2022-11-26 15:34:36,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:34:36,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 15:34:36,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 21: [2022-11-26 15:34:36,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:34:36,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 0: [2022-11-26 15:34:36,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:34:36,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:34:36,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:34:36,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 22: [2022-11-26 15:34:36,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 0: [2022-11-26 15:34:36,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 22: [2022-11-26 15:34:36,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 0: [2022-11-26 15:34:36,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 15:34:36,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 0: [2022-11-26 15:34:36,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 23: [2022-11-26 15:34:36,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:34:36,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:34:36,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 11: [2022-11-26 15:34:36,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 23: [2022-11-26 15:34:36,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 11: [2022-11-26 15:34:36,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 27: [2022-11-26 15:34:36,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:34:36,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:34:36,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 15:34:36,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 27: [2022-11-26 15:34:36,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 16: [2022-11-26 15:34:36,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:34:36,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 15:34:36,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 21: [2022-11-26 15:34:36,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:34:36,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 16: [2022-11-26 15:34:36,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:34:36,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 16: [2022-11-26 15:34:36,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 21: [2022-11-26 15:34:36,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 16: [2022-11-26 15:34:36,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 10: [2022-11-26 15:34:36,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:34:36,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 15:34:36,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 10: [2022-11-26 15:34:36,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:34:36,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 15:34:36,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 26: [2022-11-26 15:34:36,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:34:36,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 15:34:36,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 21: [2022-11-26 15:34:36,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:34:36,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:34:36,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 15:34:36,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 24: [2022-11-26 15:34:36,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 15:34:36,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 27: [2022-11-26 15:34:36,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:34:36,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:34:36,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 15:34:36,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 31: [2022-11-26 15:34:36,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 15:34:36,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 6: [2022-11-26 15:34:36,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:34:36,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 15:34:36,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 19: [2022-11-26 15:34:36,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:34:36,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 15:34:36,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 3: [2022-11-26 15:34:36,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:34:36,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 15:34:36,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 3: [2022-11-26 15:34:36,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:34:36,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:34:36,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:34:36,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:34:36,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 15:34:36,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 15:34:36,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 3: [2022-11-26 15:34:36,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 29: [2022-11-26 15:34:36,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 29: [2022-11-26 15:34:36,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 29: [2022-11-26 15:34:36,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 3: [2022-11-26 15:34:36,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 29: [2022-11-26 15:34:36,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 28: [2022-11-26 15:34:36,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:34:36,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:34:36,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:34:36,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 15:34:36,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 15:34:36,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 28: [2022-11-26 15:34:36,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 29: [2022-11-26 15:34:36,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 28: [2022-11-26 15:34:36,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 29: [2022-11-26 15:34:36,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 8: [2022-11-26 15:34:36,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:34:36,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 8: [2022-11-26 15:34:36,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 15:34:36,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 3: [2022-11-26 15:34:36,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:34:36,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 15:34:36,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 24: [2022-11-26 15:34:36,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:34:36,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 15:34:36,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 23: [2022-11-26 15:34:36,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 28: [2022-11-26 15:34:36,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:34:36,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 15:34:36,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 28: [2022-11-26 15:34:36,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 15:34:36,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 26: [2022-11-26 15:34:36,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:34:36,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 22: [2022-11-26 15:34:36,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:34:36,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:34:36,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 16: [2022-11-26 15:34:36,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:34:36,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 15:34:36,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 16: [2022-11-26 15:34:36,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 22: [2022-11-26 15:34:36,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 22: [2022-11-26 15:34:36,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 16: [2022-11-26 15:34:36,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 8: [2022-11-26 15:34:36,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:34:36,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 15:34:36,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 26: [2022-11-26 15:34:36,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:34:36,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 15:34:36,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 7: [2022-11-26 15:34:36,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:34:36,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:34:36,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 15:34:36,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 19: [2022-11-26 15:34:36,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:34:36,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 7: [2022-11-26 15:34:36,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 7: [2022-11-26 15:34:36,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:34:36,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 15:34:36,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 7: [2022-11-26 15:34:36,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 15:34:36,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:34:36,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 3: [2022-11-26 15:34:36,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:34:36,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:34:36,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 3: [2022-11-26 15:34:36,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 15:34:36,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 7: [2022-11-26 15:34:36,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 3: [2022-11-26 15:34:36,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 3: [2022-11-26 15:34:36,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 23: [2022-11-26 15:34:36,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:34:36,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 15:34:36,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 31: [2022-11-26 15:34:36,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:34:36,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 27: [2022-11-26 15:34:36,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:34:36,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 4: [2022-11-26 15:34:36,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:34:36,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 15:34:36,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 27: [2022-11-26 15:34:36,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 1: [2022-11-26 15:34:36,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:34:36,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:34:36,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 1: [2022-11-26 15:34:36,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 16: [2022-11-26 15:34:36,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 10: [2022-11-26 15:34:36,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:34:36,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 16: [2022-11-26 15:34:36,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 10: [2022-11-26 15:34:36,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 15:34:36,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 24: [2022-11-26 15:34:36,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:34:36,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:34:36,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 16: [2022-11-26 15:34:36,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:34:36,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 16: [2022-11-26 15:34:36,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 6: [2022-11-26 15:34:36,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 16: [2022-11-26 15:34:36,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 6: [2022-11-26 15:34:36,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 27: [2022-11-26 15:34:36,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:34:36,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 21: [2022-11-26 15:34:36,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:34:36,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 10: [2022-11-26 15:34:36,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:34:36,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:34:36,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 20: [2022-11-26 15:34:36,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:34:36,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 21: [2022-11-26 15:34:36,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 10: [2022-11-26 15:34:36,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 20: [2022-11-26 15:34:36,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 11: [2022-11-26 15:34:36,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:34:36,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 11: [2022-11-26 15:34:36,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 11: [2022-11-26 15:34:36,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 21: [2022-11-26 15:34:36,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 11: [2022-11-26 15:34:36,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:34:36,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 11: [2022-11-26 15:34:36,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 15:34:36,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 10: [2022-11-26 15:34:36,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:34:36,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:34:36,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:34:36,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 1: [2022-11-26 15:34:36,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 15:34:36,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 10: [2022-11-26 15:34:36,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 1: [2022-11-26 15:34:36,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 1: [2022-11-26 15:34:36,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 18: [2022-11-26 15:34:36,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:34:36,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 6: [2022-11-26 15:34:36,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:34:36,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 6: [2022-11-26 15:34:36,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 15:34:36,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 27: [2022-11-26 15:34:36,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:34:36,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 15:34:36,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 6: [2022-11-26 15:34:36,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:34:36,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 15:34:36,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 11: [2022-11-26 15:34:36,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:34:36,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 12: [2022-11-26 15:34:36,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:34:36,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 12: [2022-11-26 15:34:36,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 15:34:36,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 10: [2022-11-26 15:34:36,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:34:36,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:34:36,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 20: [2022-11-26 15:34:36,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 10: [2022-11-26 15:34:36,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 20: [2022-11-26 15:34:36,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 11: [2022-11-26 15:34:36,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:34:36,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 15:34:36,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 23: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:34:36,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 30: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:34:36,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 6: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:34:36,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 30: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:34:36,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 15:34:36,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 30: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 9: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:34:36,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 9: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:34:36,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 15:34:36,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 15:34:36,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 15:34:36,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 15:34:36,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 12: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 9: [2022-11-26 15:34:36,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 12: [2022-11-26 15:34:36,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 9: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 9: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 9: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 12: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 12: [2022-11-26 15:34:36,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 12: [2022-11-26 15:34:36,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 15:34:36,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 28: [2022-11-26 15:34:36,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 15:34:36,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 19: [2022-11-26 15:34:36,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:34:36,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 28: [2022-11-26 15:34:36,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 7: [2022-11-26 15:34:36,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:34:36,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 7: [2022-11-26 15:34:36,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:34:36,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 7: [2022-11-26 15:34:36,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 15:34:36,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 15:34:36,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 15:34:36,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 7: [2022-11-26 15:34:36,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 7: [2022-11-26 15:34:36,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 5: [2022-11-26 15:34:36,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:34:36,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 8: [2022-11-26 15:34:36,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:34:36,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 8: [2022-11-26 15:34:36,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 15:34:36,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 21: [2022-11-26 15:34:36,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:34:36,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 15:34:36,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 4: [2022-11-26 15:34:36,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:34:36,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 15:34:36,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 26: [2022-11-26 15:34:36,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:34:36,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 12: [2022-11-26 15:34:36,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:34:36,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:34:36,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 12: [2022-11-26 15:34:36,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:34:36,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 15:34:36,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 15:34:36,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 23: [2022-11-26 15:34:36,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:34:36,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 12: [2022-11-26 15:34:36,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 23: [2022-11-26 15:34:36,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 12: [2022-11-26 15:34:36,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 23: [2022-11-26 15:34:36,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 3: [2022-11-26 15:34:36,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:34:36,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:34:36,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 29: [2022-11-26 15:34:36,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 3: [2022-11-26 15:34:36,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 29: [2022-11-26 15:34:36,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 30: [2022-11-26 15:34:36,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:34:36,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 24: [2022-11-26 15:34:36,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:34:36,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 24: [2022-11-26 15:34:36,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 15:34:36,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 5: [2022-11-26 15:34:36,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:34:36,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:34:36,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 15:34:36,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 15:34:36,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 5: [2022-11-26 15:34:36,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 5: [2022-11-26 15:34:36,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:34:36,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 15:34:36,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 3: [2022-11-26 15:34:36,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:34:36,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:34:36,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 27: [2022-11-26 15:34:36,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:34:36,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 31: [2022-11-26 15:34:36,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 3: [2022-11-26 15:34:36,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 27: [2022-11-26 15:34:36,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 31: [2022-11-26 15:34:36,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:34:36,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 31: [2022-11-26 15:34:36,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 15:34:36,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 31: [2022-11-26 15:34:36,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:34:36,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:34:36,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 23: [2022-11-26 15:34:36,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:34:36,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 15:34:36,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 31: [2022-11-26 15:34:36,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 8: [2022-11-26 15:34:36,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:34:36,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:34:36,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 8: [2022-11-26 15:34:36,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:34:36,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 8: [2022-11-26 15:34:36,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 15:34:36,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 15:34:36,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 15:34:36,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 8: [2022-11-26 15:34:36,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 1: [2022-11-26 15:34:36,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:34:36,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:34:36,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:34:36,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 1: [2022-11-26 15:34:36,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 23: [2022-11-26 15:34:36,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 1: [2022-11-26 15:34:36,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 15:34:36,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 23: [2022-11-26 15:34:36,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 1: [2022-11-26 15:34:36,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 1: [2022-11-26 15:34:36,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:34:36,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:34:36,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 15:34:36,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 15:34:36,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 1: [2022-11-26 15:34:36,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 31: [2022-11-26 15:34:36,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:34:36,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 15:34:36,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 31: [2022-11-26 15:34:36,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:34:36,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:34:36,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:34:36,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:34:36,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 24: [2022-11-26 15:34:36,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:34:36,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 15:34:36,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 31: [2022-11-26 15:34:36,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 14: [2022-11-26 15:34:36,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 14: [2022-11-26 15:34:36,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 24: [2022-11-26 15:34:36,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 15:34:36,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 14: [2022-11-26 15:34:36,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 15:34:36,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 19: [2022-11-26 15:34:36,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:34:36,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 28: [2022-11-26 15:34:36,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:34:36,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 16: [2022-11-26 15:34:36,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:34:36,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 15:34:36,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 22: [2022-11-26 15:34:36,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:34:36,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:34:36,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:34:36,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 14: [2022-11-26 15:34:36,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:34:36,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 22: [2022-11-26 15:34:36,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 15:34:36,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 14: [2022-11-26 15:34:36,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 22: [2022-11-26 15:34:36,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 14: [2022-11-26 15:34:36,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 22: [2022-11-26 15:34:36,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 14: [2022-11-26 15:34:36,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:34:36,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 30: [2022-11-26 15:34:36,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:34:36,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 30: [2022-11-26 15:34:36,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 15:34:36,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 19: [2022-11-26 15:34:36,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:34:36,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:34:36,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 15:34:36,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 15:34:36,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 19: [2022-11-26 15:34:36,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 14: [2022-11-26 15:34:36,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:34:36,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 15:34:36,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 20: [2022-11-26 15:34:36,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:34:36,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 15:34:36,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 20: [2022-11-26 15:34:36,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:34:36,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 15:34:36,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 28: [2022-11-26 15:34:36,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 15:34:36,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 28: [2022-11-26 15:34:36,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 15:34:36,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 15:34:36,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 15:34:36,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 15:34:36,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 28: [2022-11-26 15:34:36,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 19: [2022-11-26 15:34:36,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:34:36,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 15:34:36,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 19: [2022-11-26 15:34:36,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:34:36,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 15:34:36,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 24: [2022-11-26 15:34:36,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:34:36,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 15:34:36,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 20: [2022-11-26 15:34:36,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:34:36,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 15:34:36,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 20: [2022-11-26 15:34:36,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:34:36,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:34:36,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 5: [2022-11-26 15:34:36,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 15:34:36,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 20: [2022-11-26 15:34:36,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 13: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:34:36,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:34:36,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:34:36,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 5: [2022-11-26 15:34:36,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 15:34:36,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 13: [2022-11-26 15:34:36,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 15:34:36,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 13: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 5: [2022-11-26 15:34:36,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 5: [2022-11-26 15:34:36,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 13: [2022-11-26 15:34:36,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 7: [2022-11-26 15:34:36,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:34:36,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:34:36,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:34:36,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 23: [2022-11-26 15:34:36,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:34:36,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 7: [2022-11-26 15:34:36,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 13: [2022-11-26 15:34:36,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 22: [2022-11-26 15:34:36,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 5: [2022-11-26 15:34:36,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:34:36,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 23: [2022-11-26 15:34:36,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 15:34:36,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 5: [2022-11-26 15:34:36,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 15:34:36,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 15: [2022-11-26 15:34:36,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:34:36,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 15:34:36,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 15: [2022-11-26 15:34:36,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:34:36,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 15:34:36,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 15: [2022-11-26 15:34:36,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:34:36,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:34:36,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 15: [2022-11-26 15:34:36,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 13: [2022-11-26 15:34:36,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 15: [2022-11-26 15:34:36,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 13: [2022-11-26 15:34:36,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:34:36,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 15: [2022-11-26 15:34:36,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:34:36,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 28: [2022-11-26 15:34:36,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:34:36,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:34:36,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:34:36,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 15:34:36,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 15:34:36,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 13: [2022-11-26 15:34:36,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 15: [2022-11-26 15:34:36,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 15:34:36,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 8: [2022-11-26 15:34:36,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:34:36,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 15:34:36,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 6: [2022-11-26 15:34:36,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:34:36,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 15:34:36,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 11: [2022-11-26 15:34:36,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:34:36,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 15:34:36,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 28: [2022-11-26 15:34:36,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 15:34:36,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 4: [2022-11-26 15:34:36,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:34:36,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:34:36,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:34:36,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 15:34:36,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 15:34:36,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 4: [2022-11-26 15:34:36,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 15:34:36,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 4: [2022-11-26 15:34:36,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 20: [2022-11-26 15:34:36,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:34:36,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 15:34:36,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 10: [2022-11-26 15:34:36,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:34:36,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:34:36,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 15:34:36,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 25: [2022-11-26 15:34:36,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 15:34:36,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 25: [2022-11-26 15:34:36,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:34:36,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 15:34:36,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 25: [2022-11-26 15:34:36,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:34:36,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 15:34:36,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 30: [2022-11-26 15:34:36,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:34:36,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 15:34:36,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 0: [2022-11-26 15:34:36,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:34:36,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 25: [2022-11-26 15:34:36,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:34:36,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:34:36,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:34:36,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 25: [2022-11-26 15:34:36,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 15:34:36,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 15:34:36,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 15:34:36,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 25: [2022-11-26 15:34:36,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 25: [2022-11-26 15:34:36,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 15: [2022-11-26 15:34:36,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:34:36,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 15:34:36,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 15: [2022-11-26 15:34:36,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:34:36,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 15:34:36,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 15: [2022-11-26 15:34:36,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:34:36,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 15:34:36,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 0: [2022-11-26 15:34:36,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:34:36,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 15:34:36,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 0: [2022-11-26 15:34:36,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 15:34:36,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 18: [2022-11-26 15:34:36,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:34:36,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 15:34:36,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 2: [2022-11-26 15:34:36,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:34:36,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:34:36,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:34:36,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:34:36,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:34:36,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:34:36,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:34:36,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 15:34:36,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 15:34:36,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 15:34:36,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 15:34:36,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 15:34:36,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 15:34:36,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 15:34:36,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 2: [2022-11-26 15:34:36,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 2: [2022-11-26 15:34:36,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 2: [2022-11-26 15:34:36,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:34:36,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 2: [2022-11-26 15:34:36,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 2: [2022-11-26 15:34:36,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 2: [2022-11-26 15:34:36,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 2: [2022-11-26 15:34:36,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 15:34:36,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 24: [2022-11-26 15:34:36,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:34:36,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 15:34:36,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 21: [2022-11-26 15:34:36,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:34:36,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 15:34:36,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 17: [2022-11-26 15:34:36,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:34:36,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:34:36,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:34:36,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:34:36,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:34:36,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:34:36,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:34:36,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 15:34:36,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 15:34:36,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 15:34:36,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 15:34:36,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 15:34:36,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 15:34:36,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 15:34:36,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 17: [2022-11-26 15:34:36,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 17: [2022-11-26 15:34:36,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 17: [2022-11-26 15:34:36,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 17: [2022-11-26 15:34:36,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 17: [2022-11-26 15:34:36,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 17: [2022-11-26 15:34:36,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 25: [2022-11-26 15:34:36,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:34:36,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 15:34:36,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 28: [2022-11-26 15:34:36,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:34:36,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:34:36,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 15:34:36,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 28: [2022-11-26 15:34:36,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 15:34:36,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 14: [2022-11-26 15:34:36,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:34:36,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 15:34:36,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 3: [2022-11-26 15:34:36,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:34:36,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 15:34:36,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 6: [2022-11-26 15:34:36,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:34:36,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 15:34:36,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 4: [2022-11-26 15:34:36,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:34:36,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 15:34:36,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 26: [2022-11-26 15:34:36,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:34:36,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 15:34:36,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 18: [2022-11-26 15:34:36,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:34:36,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 15:34:36,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 17: [2022-11-26 15:34:36,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:34:36,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 15:34:36,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 29: [2022-11-26 15:34:36,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:34:36,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 15:34:36,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 30: [2022-11-26 15:34:36,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:34:36,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 15:34:36,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 11: [2022-11-26 15:34:36,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:34:36,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 15:34:36,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 0: [2022-11-26 15:34:36,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:34:36,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 15:34:36,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 18: [2022-11-26 15:34:36,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:34:36,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 15:34:36,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 9: [2022-11-26 15:34:36,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:34:36,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 15:34:36,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 22: [2022-11-26 15:34:36,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:34:36,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 15:34:36,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 8: [2022-11-26 15:34:36,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:34:36,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 15:34:36,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 25: [2022-11-26 15:34:36,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:34:36,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 15:34:36,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 14: [2022-11-26 15:34:36,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:34:36,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 15:34:36,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 10: [2022-11-26 15:34:36,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:34:36,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 15:34:36,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 12: [2022-11-26 15:34:36,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:34:36,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 15:34:36,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 27: [2022-11-26 15:34:36,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:34:36,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step95000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 15:34:36,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 0: successfully saved checkpoint at iteration 95000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2639.66 31: iteration 95010/ 173500 | consumed samples: 24322560 | consumed tokens: 49812602880 | elapsed time per iteration (s): 1.06 | learning rate: 9.787E-05 | global batch size: 256 | lm loss: 1.985808E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.549 | TFLOPs: 14.61 | 31: iteration 95020/ 173500 | consumed samples: 24325120 | consumed tokens: 49817845760 | elapsed time per iteration (s): 0.77 | learning rate: 9.785E-05 | global batch size: 256 | lm loss: 1.969585E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.325 | TFLOPs: 20.17 | 31: iteration 95030/ 173500 | consumed samples: 24327680 | consumed tokens: 49823088640 | elapsed time per iteration (s): 0.77 | learning rate: 9.784E-05 | global batch size: 256 | lm loss: 1.993892E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.911 | TFLOPs: 20.02 | 31: iteration 95040/ 173500 | consumed samples: 24330240 | consumed tokens: 49828331520 | elapsed time per iteration (s): 0.79 | learning rate: 9.782E-05 | global batch size: 256 | lm loss: 2.004996E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.195 | TFLOPs: 19.67 | 31: iteration 95050/ 173500 | consumed samples: 24332800 | consumed tokens: 49833574400 | elapsed time per iteration (s): 0.78 | learning rate: 9.780E-05 | global batch size: 256 | lm loss: 1.966982E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.617 | TFLOPs: 19.82 | 31: iteration 95060/ 173500 | consumed samples: 24335360 | consumed tokens: 49838817280 | elapsed time per iteration (s): 0.78 | learning rate: 9.779E-05 | global batch size: 256 | lm loss: 1.947336E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.792 | TFLOPs: 19.89 | 31: iteration 95070/ 173500 | consumed samples: 24337920 | consumed tokens: 49844060160 | elapsed time per iteration (s): 0.81 | learning rate: 9.777E-05 | global batch size: 256 | lm loss: 1.964802E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.498 | TFLOPs: 19.03 | 31: iteration 95080/ 173500 | consumed samples: 24340480 | consumed tokens: 49849303040 | elapsed time per iteration (s): 0.81 | learning rate: 9.775E-05 | global batch size: 256 | lm loss: 1.981956E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.576 | TFLOPs: 19.15 | 31: iteration 95090/ 173500 | consumed samples: 24343040 | consumed tokens: 49854545920 | elapsed time per iteration (s): 0.77 | learning rate: 9.774E-05 | global batch size: 256 | lm loss: 1.980805E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.471 | TFLOPs: 20.23 | 31: iteration 95100/ 173500 | consumed samples: 24345600 | consumed tokens: 49859788800 | elapsed time per iteration (s): 0.78 | learning rate: 9.772E-05 | global batch size: 256 | lm loss: 2.011094E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.116 | TFLOPs: 19.85 | 31: iteration 95110/ 173500 | consumed samples: 24348160 | consumed tokens: 49865031680 | elapsed time per iteration (s): 0.79 | learning rate: 9.771E-05 | global batch size: 256 | lm loss: 2.012542E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.403 | TFLOPs: 19.50 | 31: iteration 95120/ 173500 | consumed samples: 24350720 | consumed tokens: 49870274560 | elapsed time per iteration (s): 0.77 | learning rate: 9.769E-05 | global batch size: 256 | lm loss: 2.006977E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.224 | TFLOPs: 20.10 | 31: iteration 95130/ 173500 | consumed samples: 24353280 | consumed tokens: 49875517440 | elapsed time per iteration (s): 0.84 | learning rate: 9.767E-05 | global batch size: 256 | lm loss: 1.998656E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.063 | TFLOPs: 18.40 | 31: iteration 95140/ 173500 | consumed samples: 24355840 | consumed tokens: 49880760320 | elapsed time per iteration (s): 0.77 | learning rate: 9.766E-05 | global batch size: 256 | lm loss: 1.979045E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.848 | TFLOPs: 20.08 | 31: iteration 95150/ 173500 | consumed samples: 24358400 | consumed tokens: 49886003200 | elapsed time per iteration (s): 0.78 | learning rate: 9.764E-05 | global batch size: 256 | lm loss: 1.968565E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.660 | TFLOPs: 19.94 | 31: iteration 95160/ 173500 | consumed samples: 24360960 | consumed tokens: 49891246080 | elapsed time per iteration (s): 0.80 | learning rate: 9.762E-05 | global batch size: 256 | lm loss: 1.978091E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.207 | TFLOPs: 19.37 | 31: iteration 95170/ 173500 | consumed samples: 24363520 | consumed tokens: 49896488960 | elapsed time per iteration (s): 0.78 | learning rate: 9.761E-05 | global batch size: 256 | lm loss: 1.997568E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.180 | TFLOPs: 19.73 | 31: iteration 95180/ 173500 | consumed samples: 24366080 | consumed tokens: 49901731840 | elapsed time per iteration (s): 0.74 | learning rate: 9.759E-05 | global batch size: 256 | lm loss: 1.983025E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.405 | TFLOPs: 20.84 | 31: iteration 95190/ 173500 | consumed samples: 24368640 | consumed tokens: 49906974720 | elapsed time per iteration (s): 0.78 | learning rate: 9.758E-05 | global batch size: 256 | lm loss: 1.994926E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.178 | TFLOPs: 19.79 | 31: iteration 95200/ 173500 | consumed samples: 24371200 | consumed tokens: 49912217600 | elapsed time per iteration (s): 0.76 | learning rate: 9.756E-05 | global batch size: 256 | lm loss: 1.974164E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.979 | TFLOPs: 20.51 | 31: iteration 95210/ 173500 | consumed samples: 24373760 | consumed tokens: 49917460480 | elapsed time per iteration (s): 0.83 | learning rate: 9.754E-05 | global batch size: 256 | lm loss: 1.992512E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.446 | TFLOPs: 18.72 | 31: iteration 95220/ 173500 | consumed samples: 24376320 | consumed tokens: 49922703360 | elapsed time per iteration (s): 0.74 | learning rate: 9.753E-05 | global batch size: 256 | lm loss: 1.995190E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.087 | TFLOPs: 20.82 | 31: iteration 95230/ 173500 | consumed samples: 24378880 | consumed tokens: 49927946240 | elapsed time per iteration (s): 0.79 | learning rate: 9.751E-05 | global batch size: 256 | lm loss: 1.966783E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.856 | TFLOPs: 19.65 | 31: iteration 95240/ 173500 | consumed samples: 24381440 | consumed tokens: 49933189120 | elapsed time per iteration (s): 0.78 | learning rate: 9.749E-05 | global batch size: 256 | lm loss: 1.986322E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.286 | TFLOPs: 19.86 | 31: iteration 95250/ 173500 | consumed samples: 24384000 | consumed tokens: 49938432000 | elapsed time per iteration (s): 0.77 | learning rate: 9.748E-05 | global batch size: 256 | lm loss: 1.989647E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.603 | TFLOPs: 20.12 | 31: iteration 95260/ 173500 | consumed samples: 24386560 | consumed tokens: 49943674880 | elapsed time per iteration (s): 0.77 | learning rate: 9.746E-05 | global batch size: 256 | lm loss: 1.970553E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.002 | TFLOPs: 20.02 | 31: iteration 95270/ 173500 | consumed samples: 24389120 | consumed tokens: 49948917760 | elapsed time per iteration (s): 0.77 | learning rate: 9.744E-05 | global batch size: 256 | lm loss: 1.960249E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.675 | TFLOPs: 20.19 | 31: iteration 95280/ 173500 | consumed samples: 24391680 | consumed tokens: 49954160640 | elapsed time per iteration (s): 0.76 | learning rate: 9.743E-05 | global batch size: 256 | lm loss: 2.011133E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.442 | TFLOPs: 20.41 | 31: iteration 95290/ 173500 | consumed samples: 24394240 | consumed tokens: 49959403520 | elapsed time per iteration (s): 0.75 | learning rate: 9.741E-05 | global batch size: 256 | lm loss: 1.952193E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.678 | TFLOPs: 20.61 | 31: iteration 95300/ 173500 | consumed samples: 24396800 | consumed tokens: 49964646400 | elapsed time per iteration (s): 0.79 | learning rate: 9.740E-05 | global batch size: 256 | lm loss: 1.998297E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.971 | TFLOPs: 19.60 | 31: iteration 95310/ 173500 | consumed samples: 24399360 | consumed tokens: 49969889280 | elapsed time per iteration (s): 0.79 | learning rate: 9.738E-05 | global batch size: 256 | lm loss: 1.974278E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.275 | TFLOPs: 19.62 | 31: iteration 95320/ 173500 | consumed samples: 24401920 | consumed tokens: 49975132160 | elapsed time per iteration (s): 0.78 | learning rate: 9.736E-05 | global batch size: 256 | lm loss: 1.963076E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.717 | TFLOPs: 19.89 | 31: iteration 95330/ 173500 | consumed samples: 24404480 | consumed tokens: 49980375040 | elapsed time per iteration (s): 0.79 | learning rate: 9.735E-05 | global batch size: 256 | lm loss: 1.991087E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.517 | TFLOPs: 19.57 | 31: iteration 95340/ 173500 | consumed samples: 24407040 | consumed tokens: 49985617920 | elapsed time per iteration (s): 0.81 | learning rate: 9.733E-05 | global batch size: 256 | lm loss: 2.001602E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.902 | TFLOPs: 19.23 | 31: iteration 95350/ 173500 | consumed samples: 24409600 | consumed tokens: 49990860800 | elapsed time per iteration (s): 0.82 | learning rate: 9.731E-05 | global batch size: 256 | lm loss: 1.954296E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.712 | TFLOPs: 18.98 | 31: iteration 95360/ 173500 | consumed samples: 24412160 | consumed tokens: 49996103680 | elapsed time per iteration (s): 0.78 | learning rate: 9.730E-05 | global batch size: 256 | lm loss: 1.945128E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.272 | TFLOPs: 19.80 | 31: iteration 95370/ 173500 | consumed samples: 24414720 | consumed tokens: 50001346560 | elapsed time per iteration (s): 0.84 | learning rate: 9.728E-05 | global batch size: 256 | lm loss: 1.989797E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.230 | TFLOPs: 18.34 | 31: iteration 95380/ 173500 | consumed samples: 24417280 | consumed tokens: 50006589440 | elapsed time per iteration (s): 0.81 | learning rate: 9.727E-05 | global batch size: 256 | lm loss: 1.974142E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.504 | TFLOPs: 19.15 | 31: iteration 95390/ 173500 | consumed samples: 24419840 | consumed tokens: 50011832320 | elapsed time per iteration (s): 0.84 | learning rate: 9.725E-05 | global batch size: 256 | lm loss: 1.998717E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.382 | TFLOPs: 18.35 | 31: iteration 95400/ 173500 | consumed samples: 24422400 | consumed tokens: 50017075200 | elapsed time per iteration (s): 0.80 | learning rate: 9.723E-05 | global batch size: 256 | lm loss: 1.972154E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.528 | TFLOPs: 19.45 | 31: iteration 95410/ 173500 | consumed samples: 24424960 | consumed tokens: 50022318080 | elapsed time per iteration (s): 0.87 | learning rate: 9.722E-05 | global batch size: 256 | lm loss: 1.956245E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.472 | TFLOPs: 17.75 | 31: iteration 95420/ 173500 | consumed samples: 24427520 | consumed tokens: 50027560960 | elapsed time per iteration (s): 0.85 | learning rate: 9.720E-05 | global batch size: 256 | lm loss: 1.974905E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.366 | TFLOPs: 18.23 | 31: iteration 95430/ 173500 | consumed samples: 24430080 | consumed tokens: 50032803840 | elapsed time per iteration (s): 0.83 | learning rate: 9.718E-05 | global batch size: 256 | lm loss: 1.955468E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.872 | TFLOPs: 18.56 | 31: iteration 95440/ 173500 | consumed samples: 24432640 | consumed tokens: 50038046720 | elapsed time per iteration (s): 1.40 | learning rate: 9.717E-05 | global batch size: 256 | lm loss: 1.991491E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 183.043 | TFLOPs: 11.07 | 31: iteration 95450/ 173500 | consumed samples: 24435200 | consumed tokens: 50043289600 | elapsed time per iteration (s): 0.81 | learning rate: 9.715E-05 | global batch size: 256 | lm loss: 1.976453E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.329 | TFLOPs: 19.20 | 31: iteration 95460/ 173500 | consumed samples: 24437760 | consumed tokens: 50048532480 | elapsed time per iteration (s): 0.83 | learning rate: 9.714E-05 | global batch size: 256 | lm loss: 2.001273E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.263 | TFLOPs: 18.77 | 31: iteration 95470/ 173500 | consumed samples: 24440320 | consumed tokens: 50053775360 | elapsed time per iteration (s): 0.82 | learning rate: 9.712E-05 | global batch size: 256 | lm loss: 1.999879E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.322 | TFLOPs: 18.77 | 31: iteration 95480/ 173500 | consumed samples: 24442880 | consumed tokens: 50059018240 | elapsed time per iteration (s): 0.82 | learning rate: 9.710E-05 | global batch size: 256 | lm loss: 1.972714E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.082 | TFLOPs: 18.94 | 31: iteration 95490/ 173500 | consumed samples: 24445440 | consumed tokens: 50064261120 | elapsed time per iteration (s): 0.94 | learning rate: 9.709E-05 | global batch size: 256 | lm loss: 1.966849E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 272.738 | TFLOPs: 16.50 | 31: iteration 95500/ 173500 | consumed samples: 24448000 | consumed tokens: 50069504000 | elapsed time per iteration (s): 0.83 | learning rate: 9.707E-05 | global batch size: 256 | lm loss: 2.005167E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.086 | TFLOPs: 18.58 | 31: iteration 95510/ 173500 | consumed samples: 24450560 | consumed tokens: 50074746880 | elapsed time per iteration (s): 0.83 | learning rate: 9.705E-05 | global batch size: 256 | lm loss: 1.989598E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.013 | TFLOPs: 18.69 | 31: iteration 95520/ 173500 | consumed samples: 24453120 | consumed tokens: 50079989760 | elapsed time per iteration (s): 0.86 | learning rate: 9.704E-05 | global batch size: 256 | lm loss: 2.017918E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.454 | TFLOPs: 18.00 | 31: iteration 95530/ 173500 | consumed samples: 24455680 | consumed tokens: 50085232640 | elapsed time per iteration (s): 0.86 | learning rate: 9.702E-05 | global batch size: 256 | lm loss: 2.004645E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.223 | TFLOPs: 17.92 | 31: iteration 95540/ 173500 | consumed samples: 24458240 | consumed tokens: 50090475520 | elapsed time per iteration (s): 0.89 | learning rate: 9.700E-05 | global batch size: 256 | lm loss: 1.965067E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.180 | TFLOPs: 17.31 | 31: iteration 95550/ 173500 | consumed samples: 24460800 | consumed tokens: 50095718400 | elapsed time per iteration (s): 0.88 | learning rate: 9.699E-05 | global batch size: 256 | lm loss: 1.969669E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.444 | TFLOPs: 17.57 | 31: iteration 95560/ 173500 | consumed samples: 24463360 | consumed tokens: 50100961280 | elapsed time per iteration (s): 0.92 | learning rate: 9.697E-05 | global batch size: 256 | lm loss: 1.981875E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 277.862 | TFLOPs: 16.81 | 31: iteration 95570/ 173500 | consumed samples: 24465920 | consumed tokens: 50106204160 | elapsed time per iteration (s): 0.82 | learning rate: 9.696E-05 | global batch size: 256 | lm loss: 1.985681E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.897 | TFLOPs: 18.87 | 31: iteration 95580/ 173500 | consumed samples: 24468480 | consumed tokens: 50111447040 | elapsed time per iteration (s): 0.84 | learning rate: 9.694E-05 | global batch size: 256 | lm loss: 1.967645E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.991 | TFLOPs: 18.51 | 31: iteration 95590/ 173500 | consumed samples: 24471040 | consumed tokens: 50116689920 | elapsed time per iteration (s): 0.86 | learning rate: 9.692E-05 | global batch size: 256 | lm loss: 1.976726E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.751 | TFLOPs: 17.95 | 31: iteration 95600/ 173500 | consumed samples: 24473600 | consumed tokens: 50121932800 | elapsed time per iteration (s): 0.94 | learning rate: 9.691E-05 | global batch size: 256 | lm loss: 1.960344E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 273.021 | TFLOPs: 16.52 | 31: iteration 95610/ 173500 | consumed samples: 24476160 | consumed tokens: 50127175680 | elapsed time per iteration (s): 0.84 | learning rate: 9.689E-05 | global batch size: 256 | lm loss: 1.972683E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.037 | TFLOPs: 18.33 | 31: iteration 95620/ 173500 | consumed samples: 24478720 | consumed tokens: 50132418560 | elapsed time per iteration (s): 0.83 | learning rate: 9.687E-05 | global batch size: 256 | lm loss: 1.982407E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.701 | TFLOPs: 18.62 | 31: iteration 95630/ 173500 | consumed samples: 24481280 | consumed tokens: 50137661440 | elapsed time per iteration (s): 0.89 | learning rate: 9.686E-05 | global batch size: 256 | lm loss: 1.988660E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.409 | TFLOPs: 17.45 | 31: iteration 95640/ 173500 | consumed samples: 24483840 | consumed tokens: 50142904320 | elapsed time per iteration (s): 0.93 | learning rate: 9.684E-05 | global batch size: 256 | lm loss: 1.985329E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 275.756 | TFLOPs: 16.68 | 31: iteration 95650/ 173500 | consumed samples: 24486400 | consumed tokens: 50148147200 | elapsed time per iteration (s): 0.81 | learning rate: 9.683E-05 | global batch size: 256 | lm loss: 1.992054E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.533 | TFLOPs: 19.03 | 31: iteration 95660/ 173500 | consumed samples: 24488960 | consumed tokens: 50153390080 | elapsed time per iteration (s): 0.85 | learning rate: 9.681E-05 | global batch size: 256 | lm loss: 1.954781E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.497 | TFLOPs: 18.18 | 31: iteration 95670/ 173500 | consumed samples: 24491520 | consumed tokens: 50158632960 | elapsed time per iteration (s): 0.84 | learning rate: 9.679E-05 | global batch size: 256 | lm loss: 1.961178E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.886 | TFLOPs: 18.44 | 31: iteration 95680/ 173500 | consumed samples: 24494080 | consumed tokens: 50163875840 | elapsed time per iteration (s): 0.84 | learning rate: 9.678E-05 | global batch size: 256 | lm loss: 1.994084E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.044 | TFLOPs: 18.45 | 31: iteration 95690/ 173500 | consumed samples: 24496640 | consumed tokens: 50169118720 | elapsed time per iteration (s): 0.82 | learning rate: 9.676E-05 | global batch size: 256 | lm loss: 1.980379E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.564 | TFLOPs: 18.91 | 31: iteration 95700/ 173500 | consumed samples: 24499200 | consumed tokens: 50174361600 | elapsed time per iteration (s): 0.81 | learning rate: 9.674E-05 | global batch size: 256 | lm loss: 2.018265E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.433 | TFLOPs: 19.20 | 31: iteration 95710/ 173500 | consumed samples: 24501760 | consumed tokens: 50179604480 | elapsed time per iteration (s): 0.78 | learning rate: 9.673E-05 | global batch size: 256 | lm loss: 1.983566E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.422 | TFLOPs: 19.75 | 31: iteration 95720/ 173500 | consumed samples: 24504320 | consumed tokens: 50184847360 | elapsed time per iteration (s): 0.81 | learning rate: 9.671E-05 | global batch size: 256 | lm loss: 1.978819E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.287 | TFLOPs: 19.20 | 31: iteration 95730/ 173500 | consumed samples: 24506880 | consumed tokens: 50190090240 | elapsed time per iteration (s): 0.80 | learning rate: 9.670E-05 | global batch size: 256 | lm loss: 1.950655E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.549 | TFLOPs: 19.33 | 31: iteration 95740/ 173500 | consumed samples: 24509440 | consumed tokens: 50195333120 | elapsed time per iteration (s): 0.80 | learning rate: 9.668E-05 | global batch size: 256 | lm loss: 1.980629E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.152 | TFLOPs: 19.43 | 31: iteration 95750/ 173500 | consumed samples: 24512000 | consumed tokens: 50200576000 | elapsed time per iteration (s): 0.84 | learning rate: 9.666E-05 | global batch size: 256 | lm loss: 1.954304E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.054 | TFLOPs: 18.45 | 31: iteration 95760/ 173500 | consumed samples: 24514560 | consumed tokens: 50205818880 | elapsed time per iteration (s): 0.83 | learning rate: 9.665E-05 | global batch size: 256 | lm loss: 1.958864E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.002 | TFLOPs: 18.75 | 31: iteration 95770/ 173500 | consumed samples: 24517120 | consumed tokens: 50211061760 | elapsed time per iteration (s): 0.81 | learning rate: 9.663E-05 | global batch size: 256 | lm loss: 1.973322E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.746 | TFLOPs: 19.22 | 31: iteration 95780/ 173500 | consumed samples: 24519680 | consumed tokens: 50216304640 | elapsed time per iteration (s): 0.85 | learning rate: 9.661E-05 | global batch size: 256 | lm loss: 1.978025E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.638 | TFLOPs: 18.25 | 31: iteration 95790/ 173500 | consumed samples: 24522240 | consumed tokens: 50221547520 | elapsed time per iteration (s): 0.78 | learning rate: 9.660E-05 | global batch size: 256 | lm loss: 1.993076E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.466 | TFLOPs: 19.75 | 31: iteration 95800/ 173500 | consumed samples: 24524800 | consumed tokens: 50226790400 | elapsed time per iteration (s): 0.73 | learning rate: 9.658E-05 | global batch size: 256 | lm loss: 1.990435E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.745 | TFLOPs: 21.22 | 31: iteration 95810/ 173500 | consumed samples: 24527360 | consumed tokens: 50232033280 | elapsed time per iteration (s): 0.81 | learning rate: 9.657E-05 | global batch size: 256 | lm loss: 1.982397E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.156 | TFLOPs: 19.19 | 31: iteration 95820/ 173500 | consumed samples: 24529920 | consumed tokens: 50237276160 | elapsed time per iteration (s): 0.73 | learning rate: 9.655E-05 | global batch size: 256 | lm loss: 1.993814E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.785 | TFLOPs: 21.10 | 31: iteration 95830/ 173500 | consumed samples: 24532480 | consumed tokens: 50242519040 | elapsed time per iteration (s): 0.77 | learning rate: 9.653E-05 | global batch size: 256 | lm loss: 2.005157E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.146 | TFLOPs: 20.15 | 31: iteration 95840/ 173500 | consumed samples: 24535040 | consumed tokens: 50247761920 | elapsed time per iteration (s): 0.80 | learning rate: 9.652E-05 | global batch size: 256 | lm loss: 1.979388E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.178 | TFLOPs: 19.37 | 31: iteration 95850/ 173500 | consumed samples: 24537600 | consumed tokens: 50253004800 | elapsed time per iteration (s): 0.74 | learning rate: 9.650E-05 | global batch size: 256 | lm loss: 1.989125E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.223 | TFLOPs: 21.07 | 31: iteration 95860/ 173500 | consumed samples: 24540160 | consumed tokens: 50258247680 | elapsed time per iteration (s): 0.78 | learning rate: 9.648E-05 | global batch size: 256 | lm loss: 1.994869E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.319 | TFLOPs: 19.74 | 31: iteration 95870/ 173500 | consumed samples: 24542720 | consumed tokens: 50263490560 | elapsed time per iteration (s): 0.78 | learning rate: 9.647E-05 | global batch size: 256 | lm loss: 2.013829E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.762 | TFLOPs: 19.83 | 31: iteration 95880/ 173500 | consumed samples: 24545280 | consumed tokens: 50268733440 | elapsed time per iteration (s): 0.76 | learning rate: 9.645E-05 | global batch size: 256 | lm loss: 1.957152E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.759 | TFLOPs: 20.31 | 31: iteration 95890/ 173500 | consumed samples: 24547840 | consumed tokens: 50273976320 | elapsed time per iteration (s): 0.76 | learning rate: 9.643E-05 | global batch size: 256 | lm loss: 1.988905E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.875 | TFLOPs: 20.38 | 31: iteration 95900/ 173500 | consumed samples: 24550400 | consumed tokens: 50279219200 | elapsed time per iteration (s): 0.78 | learning rate: 9.642E-05 | global batch size: 256 | lm loss: 1.987402E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.383 | TFLOPs: 19.81 | 31: iteration 95910/ 173500 | consumed samples: 24552960 | consumed tokens: 50284462080 | elapsed time per iteration (s): 0.80 | learning rate: 9.640E-05 | global batch size: 256 | lm loss: 1.986285E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.148 | TFLOPs: 19.43 | 31: iteration 95920/ 173500 | consumed samples: 24555520 | consumed tokens: 50289704960 | elapsed time per iteration (s): 0.75 | learning rate: 9.639E-05 | global batch size: 256 | lm loss: 1.965157E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.916 | TFLOPs: 20.75 | 31: iteration 95930/ 173500 | consumed samples: 24558080 | consumed tokens: 50294947840 | elapsed time per iteration (s): 0.79 | learning rate: 9.637E-05 | global batch size: 256 | lm loss: 1.973985E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.091 | TFLOPs: 19.67 | 31: iteration 95940/ 173500 | consumed samples: 24560640 | consumed tokens: 50300190720 | elapsed time per iteration (s): 0.76 | learning rate: 9.635E-05 | global batch size: 256 | lm loss: 1.958144E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.010 | TFLOPs: 20.33 | 31: iteration 95950/ 173500 | consumed samples: 24563200 | consumed tokens: 50305433600 | elapsed time per iteration (s): 0.76 | learning rate: 9.634E-05 | global batch size: 256 | lm loss: 1.973658E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.619 | TFLOPs: 20.43 | 31: iteration 95960/ 173500 | consumed samples: 24565760 | consumed tokens: 50310676480 | elapsed time per iteration (s): 0.74 | learning rate: 9.632E-05 | global batch size: 256 | lm loss: 1.964610E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.752 | TFLOPs: 21.04 | 31: iteration 95970/ 173500 | consumed samples: 24568320 | consumed tokens: 50315919360 | elapsed time per iteration (s): 0.88 | learning rate: 9.630E-05 | global batch size: 256 | lm loss: 1.946271E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.628 | TFLOPs: 17.52 | 31: iteration 95980/ 173500 | consumed samples: 24570880 | consumed tokens: 50321162240 | elapsed time per iteration (s): 0.78 | learning rate: 9.629E-05 | global batch size: 256 | lm loss: 1.964056E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.337 | TFLOPs: 19.92 | 31: iteration 95990/ 173500 | consumed samples: 24573440 | consumed tokens: 50326405120 | elapsed time per iteration (s): 0.80 | learning rate: 9.627E-05 | global batch size: 256 | lm loss: 2.004500E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.699 | TFLOPs: 19.40 | 0: [2022-11-26 15:48:08,532] [INFO] [logging.py:68:log_dist] [Rank 0] step=96000, skipped=0, lr=[9.625601507010446e-05, 9.625601507010446e-05, 9.625601507010446e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 96000/ 173500 | consumed samples: 24576000 | consumed tokens: 50331648000 | elapsed time per iteration (s): 0.81 | learning rate: 9.626E-05 | global batch size: 256 | lm loss: 2.005529E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.391 | TFLOPs: 19.08 | 0: steps: 96000 loss: 1.9910 iter time (s): 0.809 samples/sec: 316.283 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 96000 | lm loss value: 1.920308E+00 | lm loss PPL: 6.823060E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 96000 to checkpoints_1b1long 0: [2022-11-26 15:48:08,834] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step96000 is begin to save! 0: [2022-11-26 15:48:08,847] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_01-model_00-model_states.pt... 0: [2022-11-26 15:48:09,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_01-model_00-model_states.pt. 0: [2022-11-26 15:48:09,052] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_03-model_00-model_states.pt... 0: [2022-11-26 15:48:09,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_03-model_00-model_states.pt. 0: [2022-11-26 15:48:09,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_04-model_00-model_states.pt... 0: [2022-11-26 15:48:09,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_04-model_00-model_states.pt. 0: [2022-11-26 15:48:09,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_05-model_00-model_states.pt... 0: [2022-11-26 15:48:09,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_05-model_00-model_states.pt. 0: [2022-11-26 15:48:09,287] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_06-model_00-model_states.pt... 0: [2022-11-26 15:48:09,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_06-model_00-model_states.pt. 0: [2022-11-26 15:48:09,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_07-model_00-model_states.pt... 0: [2022-11-26 15:48:09,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_07-model_00-model_states.pt. 0: [2022-11-26 15:48:09,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_08-model_00-model_states.pt... 0: [2022-11-26 15:48:09,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_08-model_00-model_states.pt. 0: [2022-11-26 15:48:09,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_09-model_00-model_states.pt... 0: [2022-11-26 15:48:09,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_09-model_00-model_states.pt. 0: [2022-11-26 15:48:09,591] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_10-model_00-model_states.pt... 0: [2022-11-26 15:48:09,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_10-model_00-model_states.pt. 0: [2022-11-26 15:48:09,674] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_11-model_00-model_states.pt... 0: [2022-11-26 15:48:09,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_11-model_00-model_states.pt. 0: [2022-11-26 15:48:09,748] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_12-model_00-model_states.pt... 0: [2022-11-26 15:48:09,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_12-model_00-model_states.pt. 0: [2022-11-26 15:48:09,820] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_13-model_00-model_states.pt... 0: [2022-11-26 15:48:09,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_13-model_00-model_states.pt. 0: [2022-11-26 15:48:09,896] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_14-model_00-model_states.pt... 0: [2022-11-26 15:48:09,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_14-model_00-model_states.pt. 0: [2022-11-26 15:48:09,972] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_15-model_00-model_states.pt... 0: [2022-11-26 15:48:10,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_15-model_00-model_states.pt. 0: [2022-11-26 15:48:10,045] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_16-model_00-model_states.pt... 0: [2022-11-26 15:48:10,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_16-model_00-model_states.pt. 0: [2022-11-26 15:48:10,118] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_17-model_00-model_states.pt... 0: [2022-11-26 15:48:10,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_17-model_00-model_states.pt. 0: [2022-11-26 15:48:10,197] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_18-model_00-model_states.pt... 0: [2022-11-26 15:48:10,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_18-model_00-model_states.pt. 0: [2022-11-26 15:48:10,275] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_19-model_00-model_states.pt... 0: [2022-11-26 15:48:10,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_19-model_00-model_states.pt. 0: [2022-11-26 15:48:10,349] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_20-model_00-model_states.pt... 0: [2022-11-26 15:48:10,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_20-model_00-model_states.pt. 0: [2022-11-26 15:48:10,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_21-model_00-model_states.pt... 0: [2022-11-26 15:48:10,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_21-model_00-model_states.pt. 0: [2022-11-26 15:48:10,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_22-model_00-model_states.pt... 0: [2022-11-26 15:48:10,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_22-model_00-model_states.pt. 0: [2022-11-26 15:48:10,583] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_23-model_00-model_states.pt... 0: [2022-11-26 15:48:10,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_23-model_00-model_states.pt. 0: [2022-11-26 15:48:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_24-model_00-model_states.pt... 0: [2022-11-26 15:48:10,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_24-model_00-model_states.pt. 0: [2022-11-26 15:48:10,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_25-model_00-model_states.pt... 0: [2022-11-26 15:48:10,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_25-model_00-model_states.pt. 0: [2022-11-26 15:48:10,806] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_26-model_00-model_states.pt... 0: [2022-11-26 15:48:10,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_26-model_00-model_states.pt. 0: [2022-11-26 15:48:10,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_27-model_00-model_states.pt... 0: [2022-11-26 15:48:10,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_27-model_00-model_states.pt. 0: [2022-11-26 15:48:10,959] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_28-model_00-model_states.pt... 0: [2022-11-26 15:48:11,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_28-model_00-model_states.pt. 0: [2022-11-26 15:48:11,035] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/layer_30-model_00-model_states.pt... 0: [2022-11-26 15:48:11,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/layer_30-model_00-model_states.pt. 0: [2022-11-26 15:48:11,037] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step96000/mp_rank_00_model_states.pt 0: [2022-11-26 15:48:11,037] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/mp_rank_00_model_states.pt... 0: [2022-11-26 15:48:11,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/mp_rank_00_model_states.pt. 0: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 24: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 22: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 17: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 27: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 31: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 18: [2022-11-26 15:48:11,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 29: [2022-11-26 15:48:11,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:48:11,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 15:48:11,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 0: [2022-11-26 15:48:11,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:48:11,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 15:48:11,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 27: [2022-11-26 15:48:11,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:48:11,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:48:11,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 16: [2022-11-26 15:48:11,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 15:48:11,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 27: [2022-11-26 15:48:11,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 30: [2022-11-26 15:48:11,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:48:11,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:48:11,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 30: [2022-11-26 15:48:11,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 15:48:11,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 17: [2022-11-26 15:48:11,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 6: [2022-11-26 15:48:11,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:48:11,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 15:48:11,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 12: [2022-11-26 15:48:11,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:48:11,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 15:48:11,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 1: [2022-11-26 15:48:11,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:48:11,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:48:11,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 15:48:11,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 1: [2022-11-26 15:48:11,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 11: [2022-11-26 15:48:11,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:48:11,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 11: [2022-11-26 15:48:11,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 15:48:11,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 8: [2022-11-26 15:48:11,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:48:11,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:48:11,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 5: [2022-11-26 15:48:11,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:48:11,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 5: [2022-11-26 15:48:11,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 7: [2022-11-26 15:48:11,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:48:11,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 15:48:11,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 5: [2022-11-26 15:48:11,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 7: [2022-11-26 15:48:11,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 15:48:11,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 23: [2022-11-26 15:48:11,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:48:11,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 15:48:11,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 0: [2022-11-26 15:48:11,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:48:11,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 15:48:11,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 15: [2022-11-26 15:48:11,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:48:11,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:48:11,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 15:48:11,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 28: [2022-11-26 15:48:11,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:48:11,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 15:48:11,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 29: [2022-11-26 15:48:11,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:48:11,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 15:48:11,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 1: [2022-11-26 15:48:11,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:48:11,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:48:11,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 25: [2022-11-26 15:48:11,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 1: [2022-11-26 15:48:11,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 3: [2022-11-26 15:48:11,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:48:11,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:48:11,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 28: [2022-11-26 15:48:11,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 27: [2022-11-26 15:48:11,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:48:11,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 28: [2022-11-26 15:48:11,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 27: [2022-11-26 15:48:11,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 3: [2022-11-26 15:48:11,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 15:48:11,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 3: [2022-11-26 15:48:11,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 27: [2022-11-26 15:48:11,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 24: [2022-11-26 15:48:11,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:48:11,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 15:48:11,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 14: [2022-11-26 15:48:11,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:48:11,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 15:48:11,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 6: [2022-11-26 15:48:11,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:48:11,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 15:48:11,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 9: [2022-11-26 15:48:11,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:48:11,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:48:11,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 15:48:11,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 17: [2022-11-26 15:48:11,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 2: [2022-11-26 15:48:11,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:48:11,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 2: [2022-11-26 15:48:11,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 15:48:11,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 8: [2022-11-26 15:48:11,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:48:11,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 15:48:11,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 5: [2022-11-26 15:48:11,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:48:11,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:48:11,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 15:48:11,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 16: [2022-11-26 15:48:11,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 11: [2022-11-26 15:48:11,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:48:11,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:48:11,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 11: [2022-11-26 15:48:11,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 29: [2022-11-26 15:48:11,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 11: [2022-11-26 15:48:11,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 29: [2022-11-26 15:48:11,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 30: [2022-11-26 15:48:11,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:48:11,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 15:48:11,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 22: [2022-11-26 15:48:11,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:48:11,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 15:48:11,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 13: [2022-11-26 15:48:11,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:48:11,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:48:11,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 15:48:11,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 15:48:11,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 13: [2022-11-26 15:48:11,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 15: [2022-11-26 15:48:11,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:48:11,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 15:48:11,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 7: [2022-11-26 15:48:11,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:48:11,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 15:48:11,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 26: [2022-11-26 15:48:11,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:48:11,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 15:48:11,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 2: [2022-11-26 15:48:11,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:48:11,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:48:11,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 15:48:11,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 25: [2022-11-26 15:48:11,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 15:48:11,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 24: [2022-11-26 15:48:11,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:48:11,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 0: [2022-11-26 15:48:11,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:48:11,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 0: [2022-11-26 15:48:11,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 15:48:11,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 28: [2022-11-26 15:48:11,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 15:48:11,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 15:48:11,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 14: [2022-11-26 15:48:11,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:48:11,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 12: [2022-11-26 15:48:11,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:48:11,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 12: [2022-11-26 15:48:11,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 15:48:11,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 23: [2022-11-26 15:48:11,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:48:11,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:48:11,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 15:48:11,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 23: [2022-11-26 15:48:11,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 15:48:11,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 9: [2022-11-26 15:48:11,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:48:11,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 15:48:11,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 3: [2022-11-26 15:48:11,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:48:11,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:48:11,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 15:48:11,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 12: [2022-11-26 15:48:11,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 15:48:11,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 1: [2022-11-26 15:48:11,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:48:11,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 15:48:11,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 6: [2022-11-26 15:48:11,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:48:11,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 15:48:11,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 17: [2022-11-26 15:48:11,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:48:11,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 15:48:11,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 15: [2022-11-26 15:48:11,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:48:11,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 15:48:11,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 27: [2022-11-26 15:48:11,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:48:11,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:48:11,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 15:48:11,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 31: [2022-11-26 15:48:11,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 15:48:11,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 26: [2022-11-26 15:48:11,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:48:11,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:48:11,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 16: [2022-11-26 15:48:11,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 15:48:11,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 26: [2022-11-26 15:48:11,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 24: [2022-11-26 15:48:11,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:48:11,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 15:48:11,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 14: [2022-11-26 15:48:11,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:48:11,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 11: [2022-11-26 15:48:11,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:48:11,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 11: [2022-11-26 15:48:11,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 15:48:11,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 21: [2022-11-26 15:48:11,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:48:11,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:48:11,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:48:11,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 21: [2022-11-26 15:48:11,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 2: [2022-11-26 15:48:11,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 21: [2022-11-26 15:48:11,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 15:48:11,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 21: [2022-11-26 15:48:11,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 20: [2022-11-26 15:48:11,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:48:11,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:48:11,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:48:11,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 30: [2022-11-26 15:48:11,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 15:48:11,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 20: [2022-11-26 15:48:11,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 15:48:11,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 20: [2022-11-26 15:48:11,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 28: [2022-11-26 15:48:11,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:48:11,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:48:11,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 15:48:11,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 9: [2022-11-26 15:48:11,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:48:11,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 15:48:11,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 10: [2022-11-26 15:48:11,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:48:11,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 15:48:11,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 5: [2022-11-26 15:48:11,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:48:11,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 15:48:11,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 8: [2022-11-26 15:48:11,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:48:11,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:48:11,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 15:48:11,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 10: [2022-11-26 15:48:11,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 7: [2022-11-26 15:48:11,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:48:11,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 7: [2022-11-26 15:48:11,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 15:48:11,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 31: [2022-11-26 15:48:11,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:48:11,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:48:11,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 15:48:11,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 31: [2022-11-26 15:48:11,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 15:48:11,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 4: [2022-11-26 15:48:11,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:48:11,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:48:11,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:48:11,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 15:48:11,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 15:48:11,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 15:48:11,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 4: [2022-11-26 15:48:11,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 4: [2022-11-26 15:48:11,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 28: [2022-11-26 15:48:11,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 15:48:11,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 16: [2022-11-26 15:48:11,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:48:11,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 15:48:11,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 5: [2022-11-26 15:48:11,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:48:11,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 15:48:11,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 6: [2022-11-26 15:48:11,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:48:11,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 15:48:11,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 13: [2022-11-26 15:48:11,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:48:11,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 15:48:11,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 29: [2022-11-26 15:48:11,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:48:11,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 15:48:11,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 15: [2022-11-26 15:48:11,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:48:11,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 15:48:11,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 27: [2022-11-26 15:48:11,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:48:11,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 15:48:11,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 20: [2022-11-26 15:48:11,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:48:11,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 15:48:11,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 18: [2022-11-26 15:48:11,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:48:11,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:48:11,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 15:48:11,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 15:48:11,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 18: [2022-11-26 15:48:11,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 18: [2022-11-26 15:48:11,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:48:11,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 15:48:11,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 7: [2022-11-26 15:48:11,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:48:11,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 15:48:11,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 11: [2022-11-26 15:48:11,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:48:11,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:48:11,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 19: [2022-11-26 15:48:11,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:48:11,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:48:11,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:48:11,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 11: [2022-11-26 15:48:11,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 19: [2022-11-26 15:48:11,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 19: [2022-11-26 15:48:11,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 15:48:11,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 15:48:11,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 15:48:11,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 19: [2022-11-26 15:48:11,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 19: [2022-11-26 15:48:11,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 3: [2022-11-26 15:48:11,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:48:11,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 15:48:11,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 17: [2022-11-26 15:48:11,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:48:11,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:48:11,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 15:48:11,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 0: [2022-11-26 15:48:11,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 15:48:11,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 23: [2022-11-26 15:48:11,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:48:11,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 15:48:11,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 30: [2022-11-26 15:48:11,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:48:11,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 15:48:11,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 13: [2022-11-26 15:48:11,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:48:11,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 15:48:11,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 1: [2022-11-26 15:48:11,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:48:11,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 15:48:11,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 12: [2022-11-26 15:48:11,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:48:11,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 15:48:11,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 4: [2022-11-26 15:48:11,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:48:11,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 15:48:11,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 22: [2022-11-26 15:48:11,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:48:11,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 15:48:11,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 24: [2022-11-26 15:48:11,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:48:11,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 15:48:11,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 26: [2022-11-26 15:48:11,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:48:11,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 15:48:11,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 18: [2022-11-26 15:48:11,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:48:11,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 15:48:11,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 9: [2022-11-26 15:48:11,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:48:11,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 15:48:11,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 21: [2022-11-26 15:48:11,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:48:11,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 15:48:11,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 2: [2022-11-26 15:48:11,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:48:11,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 15:48:11,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 8: [2022-11-26 15:48:11,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:48:11,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 15:48:11,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 14: [2022-11-26 15:48:11,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:48:11,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 15:48:11,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 28: [2022-11-26 15:48:11,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 15:48:11,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 15:48:11,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 25: [2022-11-26 15:48:11,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:48:11,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 15:48:11,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 31: [2022-11-26 15:48:11,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:48:11,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 15:48:11,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 10: [2022-11-26 15:48:11,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:48:11,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 15:48:11,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 16: [2022-11-26 15:48:11,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:48:11,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 15:48:11,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 20: [2022-11-26 15:48:11,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:48:11,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 15:48:11,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 15: [2022-11-26 15:48:11,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:48:11,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 15:48:11,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 29: [2022-11-26 15:48:11,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:48:11,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 15:48:11,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 6: [2022-11-26 15:48:11,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:48:11,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 15:48:11,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 3: [2022-11-26 15:48:11,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:48:11,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 15:48:11,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 19: [2022-11-26 15:48:11,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:48:11,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 15:48:11,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 5: [2022-11-26 15:48:11,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:48:11,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 15:48:11,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 7: [2022-11-26 15:48:11,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:48:11,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 15:48:11,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 11: [2022-11-26 15:48:11,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:48:11,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 15:48:11,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 27: [2022-11-26 15:48:11,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:48:11,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 15:48:11,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 0: [2022-11-26 15:48:11,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:48:11,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:48:11,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 15:48:11,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 0: [2022-11-26 15:48:11,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 15:48:11,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 12: [2022-11-26 15:48:11,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:48:11,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 15:48:11,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 23: [2022-11-26 15:48:11,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:48:11,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:48:11,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 22: [2022-11-26 15:48:11,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 15:48:11,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 23: [2022-11-26 15:48:11,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 17: [2022-11-26 15:48:11,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:48:11,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 15:48:11,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 1: [2022-11-26 15:48:11,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:48:11,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:48:11,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 30: [2022-11-26 15:48:11,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 1: [2022-11-26 15:48:11,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 30: [2022-11-26 15:48:11,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 18: [2022-11-26 15:48:11,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:48:11,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 15:48:11,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 24: [2022-11-26 15:48:11,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:48:11,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 15:48:11,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 13: [2022-11-26 15:48:11,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:48:11,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 15:48:11,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 26: [2022-11-26 15:48:11,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:48:11,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 15:48:11,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 25: [2022-11-26 15:48:11,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:48:11,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 15:48:11,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 9: [2022-11-26 15:48:11,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:48:11,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 15:48:11,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 8: [2022-11-26 15:48:11,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:48:11,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 15:48:11,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 2: [2022-11-26 15:48:11,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:48:11,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 15:48:11,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 14: [2022-11-26 15:48:11,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:48:11,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 15:48:11,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 28: [2022-11-26 15:48:11,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 15:48:11,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 15:48:11,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 21: [2022-11-26 15:48:11,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:48:11,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 15:48:11,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 5: [2022-11-26 15:48:11,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:48:11,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:48:11,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 31: [2022-11-26 15:48:11,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 5: [2022-11-26 15:48:11,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 31: [2022-11-26 15:48:11,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 29: [2022-11-26 15:48:11,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:48:11,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:48:11,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 15:48:11,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 29: [2022-11-26 15:48:11,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 15:48:11,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 15: [2022-11-26 15:48:11,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:48:11,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 15:48:11,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 20: [2022-11-26 15:48:11,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:48:11,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 15:48:11,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 16: [2022-11-26 15:48:11,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:48:11,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 15:48:11,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 3: [2022-11-26 15:48:11,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:48:11,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 15:48:11,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 6: [2022-11-26 15:48:11,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:48:11,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 15:48:11,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 7: [2022-11-26 15:48:11,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:48:11,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 15:48:11,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 19: [2022-11-26 15:48:11,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:48:11,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 15:48:11,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 11: [2022-11-26 15:48:11,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:48:11,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 15:48:11,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 27: [2022-11-26 15:48:11,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:48:11,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 15:48:11,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 26: [2022-11-26 15:48:11,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:48:11,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 15:48:11,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 17: [2022-11-26 15:48:11,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:48:11,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 15:48:11,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 1: [2022-11-26 15:48:11,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:48:11,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:48:11,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 0: [2022-11-26 15:48:11,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 1: [2022-11-26 15:48:11,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 0: [2022-11-26 15:48:11,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 30: [2022-11-26 15:48:11,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:48:11,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 15:48:11,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 18: [2022-11-26 15:48:11,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:48:11,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 15:48:11,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 22: [2022-11-26 15:48:11,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:48:11,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:48:11,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 23: [2022-11-26 15:48:11,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 22: [2022-11-26 15:48:11,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 23: [2022-11-26 15:48:11,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 24: [2022-11-26 15:48:11,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:48:11,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 15:48:11,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 9: [2022-11-26 15:48:11,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:48:11,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 15:48:11,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 4: [2022-11-26 15:48:11,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:48:11,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 15:48:11,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 12: [2022-11-26 15:48:11,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:48:11,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 15:48:11,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 13: [2022-11-26 15:48:11,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:48:11,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 15:48:11,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 25: [2022-11-26 15:48:11,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:48:11,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 2: [2022-11-26 15:48:11,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:48:11,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 2: [2022-11-26 15:48:11,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 15:48:11,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 28: [2022-11-26 15:48:11,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:48:11,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 28: [2022-11-26 15:48:11,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 15:48:11,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 8: [2022-11-26 15:48:11,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 15:48:11,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 21: [2022-11-26 15:48:11,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:48:11,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 15:48:11,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 14: [2022-11-26 15:48:11,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:48:11,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 15:48:11,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 29: [2022-11-26 15:48:11,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:48:11,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 15:48:11,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 31: [2022-11-26 15:48:11,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:48:11,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 15:48:11,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 16: [2022-11-26 15:48:11,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:48:11,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 15:48:11,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 20: [2022-11-26 15:48:11,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 15:48:11,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 15:48:11,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 10: [2022-11-26 15:48:11,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:48:11,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:48:11,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 5: [2022-11-26 15:48:11,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 15:48:11,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 10: [2022-11-26 15:48:11,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 15: [2022-11-26 15:48:11,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:48:11,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 15:48:11,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 3: [2022-11-26 15:48:11,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:48:11,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 15:48:11,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 7: [2022-11-26 15:48:11,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:48:11,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 15:48:11,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 27: [2022-11-26 15:48:11,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:48:11,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 15:48:11,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 11: [2022-11-26 15:48:11,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:48:11,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 15:48:11,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 19: [2022-11-26 15:48:11,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:48:11,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 15:48:11,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 6: [2022-11-26 15:48:11,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:48:11,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 15:48:11,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 30: [2022-11-26 15:48:11,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:48:11,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 15:48:11,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 17: [2022-11-26 15:48:11,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:48:11,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 15:48:11,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 23: [2022-11-26 15:48:11,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:48:11,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:48:11,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 15:48:11,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 0: [2022-11-26 15:48:11,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 15:48:11,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 18: [2022-11-26 15:48:11,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:48:11,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 15:48:11,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 1: [2022-11-26 15:48:11,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:48:11,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 15:48:11,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 24: [2022-11-26 15:48:11,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:48:11,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 24: [2022-11-26 15:48:11,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 15:48:11,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 13: [2022-11-26 15:48:11,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 15:48:11,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 22: [2022-11-26 15:48:11,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:48:11,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 25: [2022-11-26 15:48:11,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:48:11,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 25: [2022-11-26 15:48:11,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 15:48:11,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 26: [2022-11-26 15:48:11,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:48:11,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 15:48:11,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 12: [2022-11-26 15:48:11,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:48:11,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 15:48:11,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 9: [2022-11-26 15:48:11,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:48:11,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 2: [2022-11-26 15:48:11,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:48:11,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:48:11,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 2: [2022-11-26 15:48:11,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 4: [2022-11-26 15:48:11,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 2: [2022-11-26 15:48:11,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 9: [2022-11-26 15:48:11,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 28: [2022-11-26 15:48:11,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 15:48:11,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 15:48:11,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 31: [2022-11-26 15:48:11,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:48:11,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 15:48:11,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 14: [2022-11-26 15:48:11,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:48:11,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 15:48:11,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 16: [2022-11-26 15:48:11,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 15:48:11,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 15:48:11,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 27: [2022-11-26 15:48:11,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 15:48:11,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 15:48:11,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 26: [2022-11-26 15:48:11,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 15:48:11,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 15:48:11,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 15: [2022-11-26 15:48:11,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:48:11,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 15:48:11,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 11: [2022-11-26 15:48:11,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:48:11,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 25: [2022-11-26 15:48:11,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:48:11,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 7: [2022-11-26 15:48:11,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 25: [2022-11-26 15:48:11,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 7: [2022-11-26 15:48:11,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 25: [2022-11-26 15:48:11,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 7: [2022-11-26 15:48:11,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 3: [2022-11-26 15:48:11,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:48:11,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 15:48:11,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 8: [2022-11-26 15:48:11,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:48:11,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 29: [2022-11-26 15:48:11,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:48:11,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 8: [2022-11-26 15:48:11,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 15:48:11,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 12: [2022-11-26 15:48:11,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 29: [2022-11-26 15:48:11,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 15:48:11,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 18: [2022-11-26 15:48:11,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 15:48:11,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 15:48:11,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 6: [2022-11-26 15:48:11,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:48:11,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:48:11,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:48:11,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 15:48:11,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 5: [2022-11-26 15:48:11,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 1: [2022-11-26 15:48:11,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 5: [2022-11-26 15:48:11,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 20: [2022-11-26 15:48:11,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:48:11,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 20: [2022-11-26 15:48:11,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 15:48:11,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 17: [2022-11-26 15:48:11,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:48:11,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 14: [2022-11-26 15:48:11,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 17: [2022-11-26 15:48:11,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 28: [2022-11-26 15:48:11,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:48:11,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 28: [2022-11-26 15:48:11,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 14: [2022-11-26 15:48:11,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 28: [2022-11-26 15:48:11,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 2: [2022-11-26 15:48:11,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:48:11,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 15:48:11,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 30: [2022-11-26 15:48:11,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 15:48:11,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 15:48:11,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 13: [2022-11-26 15:48:11,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:48:11,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 24: [2022-11-26 15:48:11,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:48:11,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 24: [2022-11-26 15:48:11,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 15:48:11,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 19: [2022-11-26 15:48:11,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 15:48:11,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 15:48:11,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 9: [2022-11-26 15:48:11,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:48:11,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 15:48:11,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 21: [2022-11-26 15:48:11,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 22: [2022-11-26 15:48:11,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:48:11,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 22: [2022-11-26 15:48:11,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 21: [2022-11-26 15:48:11,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 22: [2022-11-26 15:48:11,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 21: [2022-11-26 15:48:11,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:48:11,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 15:48:11,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 15:48:11,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 15:48:11,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 21: [2022-11-26 15:48:11,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 4: [2022-11-26 15:48:11,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:48:11,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 15:48:11,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 10: [2022-11-26 15:48:11,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:48:11,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 15:48:11,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 23: [2022-11-26 15:48:11,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 15:48:11,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 15:48:11,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 0: [2022-11-26 15:48:11,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:48:11,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 15:48:11,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 8: [2022-11-26 15:48:11,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:48:11,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 15:48:11,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 31: [2022-11-26 15:48:11,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:48:11,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 15:48:11,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 31: [2022-11-26 15:48:11,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 15:48:11,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 15:48:11,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 10: [2022-11-26 15:48:11,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:48:11,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 15:48:11,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:48:11,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 10: [2022-11-26 15:48:11,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step96000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 15:48:11,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 0: successfully saved checkpoint at iteration 96000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2564.10 31: iteration 96010/ 173500 | consumed samples: 24578560 | consumed tokens: 50336890880 | elapsed time per iteration (s): 1.16 | learning rate: 9.624E-05 | global batch size: 256 | lm loss: 1.975483E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 221.121 | TFLOPs: 13.38 | 31: iteration 96020/ 173500 | consumed samples: 24581120 | consumed tokens: 50342133760 | elapsed time per iteration (s): 0.86 | learning rate: 9.622E-05 | global batch size: 256 | lm loss: 2.001519E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.024 | TFLOPs: 17.91 | 31: iteration 96030/ 173500 | consumed samples: 24583680 | consumed tokens: 50347376640 | elapsed time per iteration (s): 0.80 | learning rate: 9.621E-05 | global batch size: 256 | lm loss: 2.003183E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.183 | TFLOPs: 19.25 | 31: iteration 96040/ 173500 | consumed samples: 24586240 | consumed tokens: 50352619520 | elapsed time per iteration (s): 0.83 | learning rate: 9.619E-05 | global batch size: 256 | lm loss: 1.977968E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.652 | TFLOPs: 18.67 | 31: iteration 96050/ 173500 | consumed samples: 24588800 | consumed tokens: 50357862400 | elapsed time per iteration (s): 0.81 | learning rate: 9.617E-05 | global batch size: 256 | lm loss: 1.983972E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.990 | TFLOPs: 19.12 | 31: iteration 96060/ 173500 | consumed samples: 24591360 | consumed tokens: 50363105280 | elapsed time per iteration (s): 0.80 | learning rate: 9.616E-05 | global batch size: 256 | lm loss: 1.952889E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.151 | TFLOPs: 19.31 | 31: iteration 96070/ 173500 | consumed samples: 24593920 | consumed tokens: 50368348160 | elapsed time per iteration (s): 0.80 | learning rate: 9.614E-05 | global batch size: 256 | lm loss: 1.983178E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.239 | TFLOPs: 19.43 | 31: iteration 96080/ 173500 | consumed samples: 24596480 | consumed tokens: 50373591040 | elapsed time per iteration (s): 0.79 | learning rate: 9.613E-05 | global batch size: 256 | lm loss: 1.979402E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.987 | TFLOPs: 19.54 | 31: iteration 96090/ 173500 | consumed samples: 24599040 | consumed tokens: 50378833920 | elapsed time per iteration (s): 0.79 | learning rate: 9.611E-05 | global batch size: 256 | lm loss: 1.981604E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.802 | TFLOPs: 19.65 | 31: iteration 96100/ 173500 | consumed samples: 24601600 | consumed tokens: 50384076800 | elapsed time per iteration (s): 0.82 | learning rate: 9.609E-05 | global batch size: 256 | lm loss: 1.997111E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.558 | TFLOPs: 18.85 | 31: iteration 96110/ 173500 | consumed samples: 24604160 | consumed tokens: 50389319680 | elapsed time per iteration (s): 0.82 | learning rate: 9.608E-05 | global batch size: 256 | lm loss: 1.983081E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.731 | TFLOPs: 18.98 | 31: iteration 96120/ 173500 | consumed samples: 24606720 | consumed tokens: 50394562560 | elapsed time per iteration (s): 0.84 | learning rate: 9.606E-05 | global batch size: 256 | lm loss: 2.005836E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.129 | TFLOPs: 18.52 | 31: iteration 96130/ 173500 | consumed samples: 24609280 | consumed tokens: 50399805440 | elapsed time per iteration (s): 0.82 | learning rate: 9.604E-05 | global batch size: 256 | lm loss: 1.982911E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.451 | TFLOPs: 18.96 | 31: iteration 96140/ 173500 | consumed samples: 24611840 | consumed tokens: 50405048320 | elapsed time per iteration (s): 0.82 | learning rate: 9.603E-05 | global batch size: 256 | lm loss: 1.964232E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.729 | TFLOPs: 18.92 | 31: iteration 96150/ 173500 | consumed samples: 24614400 | consumed tokens: 50410291200 | elapsed time per iteration (s): 0.81 | learning rate: 9.601E-05 | global batch size: 256 | lm loss: 1.985047E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.317 | TFLOPs: 19.02 | 31: iteration 96160/ 173500 | consumed samples: 24616960 | consumed tokens: 50415534080 | elapsed time per iteration (s): 0.76 | learning rate: 9.600E-05 | global batch size: 256 | lm loss: 1.975739E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.690 | TFLOPs: 20.43 | 31: iteration 96170/ 173500 | consumed samples: 24619520 | consumed tokens: 50420776960 | elapsed time per iteration (s): 0.78 | learning rate: 9.598E-05 | global batch size: 256 | lm loss: 1.979494E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.818 | TFLOPs: 19.83 | 31: iteration 96180/ 173500 | consumed samples: 24622080 | consumed tokens: 50426019840 | elapsed time per iteration (s): 0.85 | learning rate: 9.596E-05 | global batch size: 256 | lm loss: 1.960762E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.456 | TFLOPs: 18.24 | 31: iteration 96190/ 173500 | consumed samples: 24624640 | consumed tokens: 50431262720 | elapsed time per iteration (s): 0.78 | learning rate: 9.595E-05 | global batch size: 256 | lm loss: 1.952141E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.745 | TFLOPs: 19.95 | 31: iteration 96200/ 173500 | consumed samples: 24627200 | consumed tokens: 50436505600 | elapsed time per iteration (s): 0.77 | learning rate: 9.593E-05 | global batch size: 256 | lm loss: 1.962892E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.876 | TFLOPs: 20.02 | 31: iteration 96210/ 173500 | consumed samples: 24629760 | consumed tokens: 50441748480 | elapsed time per iteration (s): 0.77 | learning rate: 9.591E-05 | global batch size: 256 | lm loss: 2.002647E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.400 | TFLOPs: 20.11 | 31: iteration 96220/ 173500 | consumed samples: 24632320 | consumed tokens: 50446991360 | elapsed time per iteration (s): 0.75 | learning rate: 9.590E-05 | global batch size: 256 | lm loss: 1.964527E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.737 | TFLOPs: 20.67 | 31: iteration 96230/ 173500 | consumed samples: 24634880 | consumed tokens: 50452234240 | elapsed time per iteration (s): 0.77 | learning rate: 9.588E-05 | global batch size: 256 | lm loss: 1.983384E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.206 | TFLOPs: 20.04 | 31: iteration 96240/ 173500 | consumed samples: 24637440 | consumed tokens: 50457477120 | elapsed time per iteration (s): 0.72 | learning rate: 9.587E-05 | global batch size: 256 | lm loss: 1.975493E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.631 | TFLOPs: 21.58 | 31: iteration 96250/ 173500 | consumed samples: 24640000 | consumed tokens: 50462720000 | elapsed time per iteration (s): 0.71 | learning rate: 9.585E-05 | global batch size: 256 | lm loss: 1.972005E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 359.102 | TFLOPs: 21.72 | 31: iteration 96260/ 173500 | consumed samples: 24642560 | consumed tokens: 50467962880 | elapsed time per iteration (s): 0.74 | learning rate: 9.583E-05 | global batch size: 256 | lm loss: 1.984891E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.325 | TFLOPs: 20.83 | 31: iteration 96270/ 173500 | consumed samples: 24645120 | consumed tokens: 50473205760 | elapsed time per iteration (s): 0.77 | learning rate: 9.582E-05 | global batch size: 256 | lm loss: 1.966310E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.139 | TFLOPs: 20.03 | 31: iteration 96280/ 173500 | consumed samples: 24647680 | consumed tokens: 50478448640 | elapsed time per iteration (s): 0.76 | learning rate: 9.580E-05 | global batch size: 256 | lm loss: 1.981043E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.981 | TFLOPs: 20.51 | 31: iteration 96290/ 173500 | consumed samples: 24650240 | consumed tokens: 50483691520 | elapsed time per iteration (s): 0.74 | learning rate: 9.578E-05 | global batch size: 256 | lm loss: 1.958863E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.033 | TFLOPs: 20.93 | 31: iteration 96300/ 173500 | consumed samples: 24652800 | consumed tokens: 50488934400 | elapsed time per iteration (s): 0.79 | learning rate: 9.577E-05 | global batch size: 256 | lm loss: 1.960728E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.948 | TFLOPs: 19.60 | 31: iteration 96310/ 173500 | consumed samples: 24655360 | consumed tokens: 50494177280 | elapsed time per iteration (s): 0.73 | learning rate: 9.575E-05 | global batch size: 256 | lm loss: 1.982063E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.688 | TFLOPs: 21.16 | 31: iteration 96320/ 173500 | consumed samples: 24657920 | consumed tokens: 50499420160 | elapsed time per iteration (s): 0.75 | learning rate: 9.574E-05 | global batch size: 256 | lm loss: 1.982590E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.457 | TFLOPs: 20.54 | 31: iteration 96330/ 173500 | consumed samples: 24660480 | consumed tokens: 50504663040 | elapsed time per iteration (s): 0.75 | learning rate: 9.572E-05 | global batch size: 256 | lm loss: 2.002928E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.366 | TFLOPs: 20.71 | 31: iteration 96340/ 173500 | consumed samples: 24663040 | consumed tokens: 50509905920 | elapsed time per iteration (s): 0.81 | learning rate: 9.570E-05 | global batch size: 256 | lm loss: 1.973531E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.972 | TFLOPs: 19.18 | 31: iteration 96350/ 173500 | consumed samples: 24665600 | consumed tokens: 50515148800 | elapsed time per iteration (s): 0.80 | learning rate: 9.569E-05 | global batch size: 256 | lm loss: 1.967938E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.657 | TFLOPs: 19.28 | 31: iteration 96360/ 173500 | consumed samples: 24668160 | consumed tokens: 50520391680 | elapsed time per iteration (s): 0.79 | learning rate: 9.567E-05 | global batch size: 256 | lm loss: 1.981102E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.042 | TFLOPs: 19.60 | 31: iteration 96370/ 173500 | consumed samples: 24670720 | consumed tokens: 50525634560 | elapsed time per iteration (s): 0.78 | learning rate: 9.565E-05 | global batch size: 256 | lm loss: 1.997672E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.309 | TFLOPs: 19.80 | 31: iteration 96380/ 173500 | consumed samples: 24673280 | consumed tokens: 50530877440 | elapsed time per iteration (s): 0.74 | learning rate: 9.564E-05 | global batch size: 256 | lm loss: 1.955524E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.031 | TFLOPs: 20.99 | 31: iteration 96390/ 173500 | consumed samples: 24675840 | consumed tokens: 50536120320 | elapsed time per iteration (s): 0.75 | learning rate: 9.562E-05 | global batch size: 256 | lm loss: 1.989844E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.496 | TFLOPs: 20.66 | 31: iteration 96400/ 173500 | consumed samples: 24678400 | consumed tokens: 50541363200 | elapsed time per iteration (s): 0.75 | learning rate: 9.561E-05 | global batch size: 256 | lm loss: 1.982308E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.432 | TFLOPs: 20.53 | 31: iteration 96410/ 173500 | consumed samples: 24680960 | consumed tokens: 50546606080 | elapsed time per iteration (s): 0.73 | learning rate: 9.559E-05 | global batch size: 256 | lm loss: 1.981401E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.458 | TFLOPs: 21.20 | 31: iteration 96420/ 173500 | consumed samples: 24683520 | consumed tokens: 50551848960 | elapsed time per iteration (s): 0.80 | learning rate: 9.557E-05 | global batch size: 256 | lm loss: 1.963328E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.105 | TFLOPs: 19.24 | 31: iteration 96430/ 173500 | consumed samples: 24686080 | consumed tokens: 50557091840 | elapsed time per iteration (s): 0.73 | learning rate: 9.556E-05 | global batch size: 256 | lm loss: 2.004343E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.876 | TFLOPs: 21.11 | 31: iteration 96440/ 173500 | consumed samples: 24688640 | consumed tokens: 50562334720 | elapsed time per iteration (s): 0.78 | learning rate: 9.554E-05 | global batch size: 256 | lm loss: 1.961443E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.461 | TFLOPs: 19.87 | 31: iteration 96450/ 173500 | consumed samples: 24691200 | consumed tokens: 50567577600 | elapsed time per iteration (s): 0.74 | learning rate: 9.552E-05 | global batch size: 256 | lm loss: 1.990600E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.007 | TFLOPs: 21.05 | 31: iteration 96460/ 173500 | consumed samples: 24693760 | consumed tokens: 50572820480 | elapsed time per iteration (s): 0.74 | learning rate: 9.551E-05 | global batch size: 256 | lm loss: 1.985040E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.235 | TFLOPs: 20.89 | 31: iteration 96470/ 173500 | consumed samples: 24696320 | consumed tokens: 50578063360 | elapsed time per iteration (s): 0.74 | learning rate: 9.549E-05 | global batch size: 256 | lm loss: 1.967819E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.422 | TFLOPs: 20.84 | 31: iteration 96480/ 173500 | consumed samples: 24698880 | consumed tokens: 50583306240 | elapsed time per iteration (s): 0.74 | learning rate: 9.548E-05 | global batch size: 256 | lm loss: 1.974421E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.257 | TFLOPs: 21.01 | 31: iteration 96490/ 173500 | consumed samples: 24701440 | consumed tokens: 50588549120 | elapsed time per iteration (s): 0.73 | learning rate: 9.546E-05 | global batch size: 256 | lm loss: 1.972669E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.272 | TFLOPs: 21.19 | 31: iteration 96500/ 173500 | consumed samples: 24704000 | consumed tokens: 50593792000 | elapsed time per iteration (s): 0.76 | learning rate: 9.544E-05 | global batch size: 256 | lm loss: 2.011987E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.030 | TFLOPs: 20.45 | 31: iteration 96510/ 173500 | consumed samples: 24706560 | consumed tokens: 50599034880 | elapsed time per iteration (s): 0.72 | learning rate: 9.543E-05 | global batch size: 256 | lm loss: 1.993649E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.979 | TFLOPs: 21.41 | 31: iteration 96520/ 173500 | consumed samples: 24709120 | consumed tokens: 50604277760 | elapsed time per iteration (s): 0.75 | learning rate: 9.541E-05 | global batch size: 256 | lm loss: 1.987528E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.518 | TFLOPs: 20.60 | 31: iteration 96530/ 173500 | consumed samples: 24711680 | consumed tokens: 50609520640 | elapsed time per iteration (s): 0.72 | learning rate: 9.539E-05 | global batch size: 256 | lm loss: 1.986957E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.210 | TFLOPs: 21.55 | 31: iteration 96540/ 173500 | consumed samples: 24714240 | consumed tokens: 50614763520 | elapsed time per iteration (s): 0.73 | learning rate: 9.538E-05 | global batch size: 256 | lm loss: 1.991101E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.765 | TFLOPs: 21.22 | 31: iteration 96550/ 173500 | consumed samples: 24716800 | consumed tokens: 50620006400 | elapsed time per iteration (s): 0.76 | learning rate: 9.536E-05 | global batch size: 256 | lm loss: 1.983722E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.782 | TFLOPs: 20.31 | 31: iteration 96560/ 173500 | consumed samples: 24719360 | consumed tokens: 50625249280 | elapsed time per iteration (s): 0.74 | learning rate: 9.535E-05 | global batch size: 256 | lm loss: 1.995167E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.779 | TFLOPs: 20.98 | 31: iteration 96570/ 173500 | consumed samples: 24721920 | consumed tokens: 50630492160 | elapsed time per iteration (s): 0.78 | learning rate: 9.533E-05 | global batch size: 256 | lm loss: 1.974071E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.703 | TFLOPs: 19.89 | 31: iteration 96580/ 173500 | consumed samples: 24724480 | consumed tokens: 50635735040 | elapsed time per iteration (s): 0.79 | learning rate: 9.531E-05 | global batch size: 256 | lm loss: 1.966263E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.579 | TFLOPs: 19.52 | 31: iteration 96590/ 173500 | consumed samples: 24727040 | consumed tokens: 50640977920 | elapsed time per iteration (s): 0.81 | learning rate: 9.530E-05 | global batch size: 256 | lm loss: 1.952805E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.268 | TFLOPs: 19.19 | 31: iteration 96600/ 173500 | consumed samples: 24729600 | consumed tokens: 50646220800 | elapsed time per iteration (s): 0.77 | learning rate: 9.528E-05 | global batch size: 256 | lm loss: 1.985693E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.395 | TFLOPs: 19.99 | 31: iteration 96610/ 173500 | consumed samples: 24732160 | consumed tokens: 50651463680 | elapsed time per iteration (s): 0.77 | learning rate: 9.526E-05 | global batch size: 256 | lm loss: 2.010992E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.991 | TFLOPs: 20.08 | 31: iteration 96620/ 173500 | consumed samples: 24734720 | consumed tokens: 50656706560 | elapsed time per iteration (s): 0.77 | learning rate: 9.525E-05 | global batch size: 256 | lm loss: 1.996878E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.786 | TFLOPs: 20.13 | 31: iteration 96630/ 173500 | consumed samples: 24737280 | consumed tokens: 50661949440 | elapsed time per iteration (s): 0.81 | learning rate: 9.523E-05 | global batch size: 256 | lm loss: 1.968615E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.705 | TFLOPs: 19.04 | 31: iteration 96640/ 173500 | consumed samples: 24739840 | consumed tokens: 50667192320 | elapsed time per iteration (s): 0.79 | learning rate: 9.522E-05 | global batch size: 256 | lm loss: 1.970695E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.097 | TFLOPs: 19.67 | 31: iteration 96650/ 173500 | consumed samples: 24742400 | consumed tokens: 50672435200 | elapsed time per iteration (s): 0.93 | learning rate: 9.520E-05 | global batch size: 256 | lm loss: 1.986057E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 276.109 | TFLOPs: 16.70 | 31: iteration 96660/ 173500 | consumed samples: 24744960 | consumed tokens: 50677678080 | elapsed time per iteration (s): 0.79 | learning rate: 9.518E-05 | global batch size: 256 | lm loss: 2.003917E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.547 | TFLOPs: 19.57 | 31: iteration 96670/ 173500 | consumed samples: 24747520 | consumed tokens: 50682920960 | elapsed time per iteration (s): 0.79 | learning rate: 9.517E-05 | global batch size: 256 | lm loss: 1.966942E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.333 | TFLOPs: 19.50 | 31: iteration 96680/ 173500 | consumed samples: 24750080 | consumed tokens: 50688163840 | elapsed time per iteration (s): 0.84 | learning rate: 9.515E-05 | global batch size: 256 | lm loss: 1.992844E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.074 | TFLOPs: 18.52 | 31: iteration 96690/ 173500 | consumed samples: 24752640 | consumed tokens: 50693406720 | elapsed time per iteration (s): 0.82 | learning rate: 9.513E-05 | global batch size: 256 | lm loss: 1.973733E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.963 | TFLOPs: 18.81 | 31: iteration 96700/ 173500 | consumed samples: 24755200 | consumed tokens: 50698649600 | elapsed time per iteration (s): 0.87 | learning rate: 9.512E-05 | global batch size: 256 | lm loss: 1.985441E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.117 | TFLOPs: 17.73 | 31: iteration 96710/ 173500 | consumed samples: 24757760 | consumed tokens: 50703892480 | elapsed time per iteration (s): 0.79 | learning rate: 9.510E-05 | global batch size: 256 | lm loss: 1.991149E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.501 | TFLOPs: 19.57 | 31: iteration 96720/ 173500 | consumed samples: 24760320 | consumed tokens: 50709135360 | elapsed time per iteration (s): 0.81 | learning rate: 9.509E-05 | global batch size: 256 | lm loss: 1.975387E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.782 | TFLOPs: 19.16 | 31: iteration 96730/ 173500 | consumed samples: 24762880 | consumed tokens: 50714378240 | elapsed time per iteration (s): 0.79 | learning rate: 9.507E-05 | global batch size: 256 | lm loss: 1.976631E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.635 | TFLOPs: 19.64 | 31: iteration 96740/ 173500 | consumed samples: 24765440 | consumed tokens: 50719621120 | elapsed time per iteration (s): 0.79 | learning rate: 9.505E-05 | global batch size: 256 | lm loss: 1.952428E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.908 | TFLOPs: 19.66 | 31: iteration 96750/ 173500 | consumed samples: 24768000 | consumed tokens: 50724864000 | elapsed time per iteration (s): 0.81 | learning rate: 9.504E-05 | global batch size: 256 | lm loss: 1.988913E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.953 | TFLOPs: 19.11 | 31: iteration 96760/ 173500 | consumed samples: 24770560 | consumed tokens: 50730106880 | elapsed time per iteration (s): 0.82 | learning rate: 9.502E-05 | global batch size: 256 | lm loss: 1.926533E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.616 | TFLOPs: 18.97 | 31: iteration 96770/ 173500 | consumed samples: 24773120 | consumed tokens: 50735349760 | elapsed time per iteration (s): 3.23 | learning rate: 9.500E-05 | global batch size: 256 | lm loss: 1.991207E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 79.138 | TFLOPs: 4.79 | 31: iteration 96780/ 173500 | consumed samples: 24775680 | consumed tokens: 50740592640 | elapsed time per iteration (s): 0.79 | learning rate: 9.499E-05 | global batch size: 256 | lm loss: 1.968277E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.023 | TFLOPs: 19.72 | 31: iteration 96790/ 173500 | consumed samples: 24778240 | consumed tokens: 50745835520 | elapsed time per iteration (s): 0.81 | learning rate: 9.497E-05 | global batch size: 256 | lm loss: 1.984393E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.828 | TFLOPs: 19.11 | 31: iteration 96800/ 173500 | consumed samples: 24780800 | consumed tokens: 50751078400 | elapsed time per iteration (s): 0.80 | learning rate: 9.496E-05 | global batch size: 256 | lm loss: 1.962832E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.394 | TFLOPs: 19.26 | 31: iteration 96810/ 173500 | consumed samples: 24783360 | consumed tokens: 50756321280 | elapsed time per iteration (s): 0.82 | learning rate: 9.494E-05 | global batch size: 256 | lm loss: 1.966410E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.489 | TFLOPs: 18.97 | 31: iteration 96820/ 173500 | consumed samples: 24785920 | consumed tokens: 50761564160 | elapsed time per iteration (s): 0.81 | learning rate: 9.492E-05 | global batch size: 256 | lm loss: 1.970324E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.169 | TFLOPs: 19.19 | 31: iteration 96830/ 173500 | consumed samples: 24788480 | consumed tokens: 50766807040 | elapsed time per iteration (s): 0.85 | learning rate: 9.491E-05 | global batch size: 256 | lm loss: 1.981944E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.458 | TFLOPs: 18.24 | 31: iteration 96840/ 173500 | consumed samples: 24791040 | consumed tokens: 50772049920 | elapsed time per iteration (s): 0.90 | learning rate: 9.489E-05 | global batch size: 256 | lm loss: 1.988264E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.547 | TFLOPs: 17.21 | 31: iteration 96850/ 173500 | consumed samples: 24793600 | consumed tokens: 50777292800 | elapsed time per iteration (s): 0.80 | learning rate: 9.487E-05 | global batch size: 256 | lm loss: 2.002508E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.394 | TFLOPs: 19.44 | 31: iteration 96860/ 173500 | consumed samples: 24796160 | consumed tokens: 50782535680 | elapsed time per iteration (s): 0.83 | learning rate: 9.486E-05 | global batch size: 256 | lm loss: 1.971659E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.325 | TFLOPs: 18.59 | 31: iteration 96870/ 173500 | consumed samples: 24798720 | consumed tokens: 50787778560 | elapsed time per iteration (s): 0.83 | learning rate: 9.484E-05 | global batch size: 256 | lm loss: 1.965923E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.181 | TFLOPs: 18.58 | 31: iteration 96880/ 173500 | consumed samples: 24801280 | consumed tokens: 50793021440 | elapsed time per iteration (s): 0.82 | learning rate: 9.483E-05 | global batch size: 256 | lm loss: 1.963465E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.532 | TFLOPs: 18.79 | 31: iteration 96890/ 173500 | consumed samples: 24803840 | consumed tokens: 50798264320 | elapsed time per iteration (s): 0.75 | learning rate: 9.481E-05 | global batch size: 256 | lm loss: 1.969360E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.228 | TFLOPs: 20.76 | 31: iteration 96900/ 173500 | consumed samples: 24806400 | consumed tokens: 50803507200 | elapsed time per iteration (s): 0.74 | learning rate: 9.479E-05 | global batch size: 256 | lm loss: 1.961556E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.626 | TFLOPs: 20.91 | 31: iteration 96910/ 173500 | consumed samples: 24808960 | consumed tokens: 50808750080 | elapsed time per iteration (s): 0.72 | learning rate: 9.478E-05 | global batch size: 256 | lm loss: 2.003262E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.399 | TFLOPs: 21.50 | 31: iteration 96920/ 173500 | consumed samples: 24811520 | consumed tokens: 50813992960 | elapsed time per iteration (s): 0.72 | learning rate: 9.476E-05 | global batch size: 256 | lm loss: 1.967724E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.116 | TFLOPs: 21.36 | 31: iteration 96930/ 173500 | consumed samples: 24814080 | consumed tokens: 50819235840 | elapsed time per iteration (s): 0.73 | learning rate: 9.475E-05 | global batch size: 256 | lm loss: 1.984256E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.899 | TFLOPs: 21.23 | 31: iteration 96940/ 173500 | consumed samples: 24816640 | consumed tokens: 50824478720 | elapsed time per iteration (s): 0.73 | learning rate: 9.473E-05 | global batch size: 256 | lm loss: 1.963728E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.685 | TFLOPs: 21.28 | 31: iteration 96950/ 173500 | consumed samples: 24819200 | consumed tokens: 50829721600 | elapsed time per iteration (s): 0.75 | learning rate: 9.471E-05 | global batch size: 256 | lm loss: 1.964797E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.460 | TFLOPs: 20.54 | 31: iteration 96960/ 173500 | consumed samples: 24821760 | consumed tokens: 50834964480 | elapsed time per iteration (s): 0.74 | learning rate: 9.470E-05 | global batch size: 256 | lm loss: 1.979604E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.350 | TFLOPs: 20.89 | 31: iteration 96970/ 173500 | consumed samples: 24824320 | consumed tokens: 50840207360 | elapsed time per iteration (s): 0.81 | learning rate: 9.468E-05 | global batch size: 256 | lm loss: 1.983692E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.444 | TFLOPs: 19.14 | 31: iteration 96980/ 173500 | consumed samples: 24826880 | consumed tokens: 50845450240 | elapsed time per iteration (s): 0.74 | learning rate: 9.466E-05 | global batch size: 256 | lm loss: 1.997225E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.228 | TFLOPs: 20.82 | 31: iteration 96990/ 173500 | consumed samples: 24829440 | consumed tokens: 50850693120 | elapsed time per iteration (s): 0.78 | learning rate: 9.465E-05 | global batch size: 256 | lm loss: 1.972011E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.431 | TFLOPs: 19.75 | 31: iteration 97000/ 173500 | consumed samples: 24832000 | consumed tokens: 50855936000 | elapsed time per iteration (s): 0.74 | learning rate: 9.463E-05 | global batch size: 256 | lm loss: 2.005594E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.180 | TFLOPs: 21.00 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 97000 | lm loss value: 1.952218E+00 | lm loss PPL: 7.044292E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 97000 to checkpoints_1b1long 0: [2022-11-26 16:01:38,913] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step97000 is begin to save! 0: [2022-11-26 16:01:38,925] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_01-model_00-model_states.pt... 0: [2022-11-26 16:01:39,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_01-model_00-model_states.pt. 0: [2022-11-26 16:01:39,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_03-model_00-model_states.pt... 0: [2022-11-26 16:01:39,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_03-model_00-model_states.pt. 0: [2022-11-26 16:01:39,228] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_04-model_00-model_states.pt... 0: [2022-11-26 16:01:39,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_04-model_00-model_states.pt. 0: [2022-11-26 16:01:39,304] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_05-model_00-model_states.pt... 0: [2022-11-26 16:01:39,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_05-model_00-model_states.pt. 0: [2022-11-26 16:01:39,381] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_06-model_00-model_states.pt... 0: [2022-11-26 16:01:39,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_06-model_00-model_states.pt. 0: [2022-11-26 16:01:39,459] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_07-model_00-model_states.pt... 0: [2022-11-26 16:01:39,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_07-model_00-model_states.pt. 0: [2022-11-26 16:01:39,536] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_08-model_00-model_states.pt... 0: [2022-11-26 16:01:39,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_08-model_00-model_states.pt. 0: [2022-11-26 16:01:39,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_09-model_00-model_states.pt... 0: [2022-11-26 16:01:39,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_09-model_00-model_states.pt. 0: [2022-11-26 16:01:39,698] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_10-model_00-model_states.pt... 0: [2022-11-26 16:01:39,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_10-model_00-model_states.pt. 0: [2022-11-26 16:01:39,774] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_11-model_00-model_states.pt... 0: [2022-11-26 16:01:39,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_11-model_00-model_states.pt. 0: [2022-11-26 16:01:39,847] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_12-model_00-model_states.pt... 0: [2022-11-26 16:01:39,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_12-model_00-model_states.pt. 0: [2022-11-26 16:01:39,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_13-model_00-model_states.pt... 0: [2022-11-26 16:01:40,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_13-model_00-model_states.pt. 0: [2022-11-26 16:01:40,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_14-model_00-model_states.pt... 0: [2022-11-26 16:01:40,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_14-model_00-model_states.pt. 0: [2022-11-26 16:01:40,078] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_15-model_00-model_states.pt... 0: [2022-11-26 16:01:40,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_15-model_00-model_states.pt. 0: [2022-11-26 16:01:40,154] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_16-model_00-model_states.pt... 0: [2022-11-26 16:01:40,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_16-model_00-model_states.pt. 0: [2022-11-26 16:01:40,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_17-model_00-model_states.pt... 0: [2022-11-26 16:01:40,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_17-model_00-model_states.pt. 0: [2022-11-26 16:01:40,304] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_18-model_00-model_states.pt... 0: [2022-11-26 16:01:40,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_18-model_00-model_states.pt. 0: [2022-11-26 16:01:40,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_19-model_00-model_states.pt... 0: [2022-11-26 16:01:40,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_19-model_00-model_states.pt. 0: [2022-11-26 16:01:40,456] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_20-model_00-model_states.pt... 0: [2022-11-26 16:01:40,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_20-model_00-model_states.pt. 0: [2022-11-26 16:01:40,529] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_21-model_00-model_states.pt... 0: [2022-11-26 16:01:40,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_21-model_00-model_states.pt. 0: [2022-11-26 16:01:40,604] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_22-model_00-model_states.pt... 0: [2022-11-26 16:01:40,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_22-model_00-model_states.pt. 0: [2022-11-26 16:01:40,679] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_23-model_00-model_states.pt... 0: [2022-11-26 16:01:40,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_23-model_00-model_states.pt. 0: [2022-11-26 16:01:40,756] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_24-model_00-model_states.pt... 0: [2022-11-26 16:01:40,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_24-model_00-model_states.pt. 0: [2022-11-26 16:01:40,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_25-model_00-model_states.pt... 0: [2022-11-26 16:01:40,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_25-model_00-model_states.pt. 0: [2022-11-26 16:01:40,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_26-model_00-model_states.pt... 0: [2022-11-26 16:01:40,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_26-model_00-model_states.pt. 0: [2022-11-26 16:01:40,978] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_27-model_00-model_states.pt... 0: [2022-11-26 16:01:41,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_27-model_00-model_states.pt. 0: [2022-11-26 16:01:41,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_28-model_00-model_states.pt... 0: [2022-11-26 16:01:41,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_28-model_00-model_states.pt. 0: [2022-11-26 16:01:41,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/layer_30-model_00-model_states.pt... 0: [2022-11-26 16:01:41,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/layer_30-model_00-model_states.pt. 0: [2022-11-26 16:01:41,134] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step97000/mp_rank_00_model_states.pt 0: [2022-11-26 16:01:41,134] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/mp_rank_00_model_states.pt... 0: [2022-11-26 16:01:41,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/mp_rank_00_model_states.pt. 0: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:01:41,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:01:41,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:01:41,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:01:41,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:01:41,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 26: [2022-11-26 16:01:41,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 30: [2022-11-26 16:01:41,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 26: [2022-11-26 16:01:41,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 7: [2022-11-26 16:01:41,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:01:41,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 16:01:41,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 29: [2022-11-26 16:01:41,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:01:41,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 16:01:41,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 23: [2022-11-26 16:01:41,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:01:41,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 16:01:41,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 9: [2022-11-26 16:01:41,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:01:41,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 16:01:41,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 19: [2022-11-26 16:01:41,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:01:41,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 16:01:41,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 13: [2022-11-26 16:01:41,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:01:41,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 16:01:41,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 3: [2022-11-26 16:01:41,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:01:41,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 16:01:41,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 16: [2022-11-26 16:01:41,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:01:41,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 16:01:41,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 31: [2022-11-26 16:01:41,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:01:41,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 16:01:41,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 7: [2022-11-26 16:01:41,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:01:41,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 16:01:41,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 14: [2022-11-26 16:01:41,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:01:41,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 16:01:41,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 26: [2022-11-26 16:01:41,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:01:41,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 16:01:41,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 10: [2022-11-26 16:01:41,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:01:41,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 16:01:41,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 27: [2022-11-26 16:01:41,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:01:41,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:01:41,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:01:41,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 10: [2022-11-26 16:01:41,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 6: [2022-11-26 16:01:41,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 10: [2022-11-26 16:01:41,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 27: [2022-11-26 16:01:41,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 16:01:41,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 11: [2022-11-26 16:01:41,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:01:41,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 21: [2022-11-26 16:01:41,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:01:41,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 21: [2022-11-26 16:01:41,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 11: [2022-11-26 16:01:41,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:01:41,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 11: [2022-11-26 16:01:41,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 16:01:41,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 27: [2022-11-26 16:01:41,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:01:41,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:01:41,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 9: [2022-11-26 16:01:41,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:01:41,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 27: [2022-11-26 16:01:41,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 9: [2022-11-26 16:01:41,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 27: [2022-11-26 16:01:41,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 9: [2022-11-26 16:01:41,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 8: [2022-11-26 16:01:41,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:01:41,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 16:01:41,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 16: [2022-11-26 16:01:41,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:01:41,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 16:01:41,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 16: [2022-11-26 16:01:41,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:01:41,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 16:01:41,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 12: [2022-11-26 16:01:41,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:01:41,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 31: [2022-11-26 16:01:41,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:01:41,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:01:41,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 12: [2022-11-26 16:01:41,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 31: [2022-11-26 16:01:41,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 19: [2022-11-26 16:01:41,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 31: [2022-11-26 16:01:41,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 6: [2022-11-26 16:01:41,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:01:41,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 16:01:41,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 21: [2022-11-26 16:01:41,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:01:41,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 16:01:41,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 24: [2022-11-26 16:01:41,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:01:41,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 16:01:41,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 18: [2022-11-26 16:01:41,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:01:41,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 30: [2022-11-26 16:01:41,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:01:41,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 30: [2022-11-26 16:01:41,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 16:01:41,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 18: [2022-11-26 16:01:41,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:01:41,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 16:01:41,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 30: [2022-11-26 16:01:41,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:01:41,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 16:01:41,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 7: [2022-11-26 16:01:41,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:01:41,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 16:01:41,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 3: [2022-11-26 16:01:41,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:01:41,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 16:01:41,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 4: [2022-11-26 16:01:41,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:01:41,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:01:41,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 8: [2022-11-26 16:01:41,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 4: [2022-11-26 16:01:41,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 8: [2022-11-26 16:01:41,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 24: [2022-11-26 16:01:41,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:01:41,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 16:01:41,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 10: [2022-11-26 16:01:41,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:01:41,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 16:01:41,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 29: [2022-11-26 16:01:41,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:01:41,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:01:41,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:01:41,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 2: [2022-11-26 16:01:41,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 16:01:41,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 16:01:41,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 2: [2022-11-26 16:01:41,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 29: [2022-11-26 16:01:41,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 7: [2022-11-26 16:01:41,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:01:41,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 16:01:41,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 12: [2022-11-26 16:01:41,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:01:41,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 19: [2022-11-26 16:01:41,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:01:41,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 19: [2022-11-26 16:01:41,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 16:01:41,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 21: [2022-11-26 16:01:41,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:01:41,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 9: [2022-11-26 16:01:41,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:01:41,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 9: [2022-11-26 16:01:41,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 16: [2022-11-26 16:01:41,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:01:41,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 16: [2022-11-26 16:01:41,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 16:01:41,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 14: [2022-11-26 16:01:41,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:01:41,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 24: [2022-11-26 16:01:41,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:01:41,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 24: [2022-11-26 16:01:41,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 16:01:41,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 18: [2022-11-26 16:01:41,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:01:41,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 31: [2022-11-26 16:01:41,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:01:41,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 31: [2022-11-26 16:01:41,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 9: [2022-11-26 16:01:41,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:01:41,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 9: [2022-11-26 16:01:41,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 31: [2022-11-26 16:01:41,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:01:41,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 31: [2022-11-26 16:01:41,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 16:01:41,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 5: [2022-11-26 16:01:41,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:01:41,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:01:41,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 16:01:41,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 16:01:41,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 5: [2022-11-26 16:01:41,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 5: [2022-11-26 16:01:41,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:01:41,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 16:01:41,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 6: [2022-11-26 16:01:41,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:01:41,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:01:41,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 16:01:41,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 16:01:41,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 6: [2022-11-26 16:01:41,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 27: [2022-11-26 16:01:41,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:01:41,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 16:01:41,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 28: [2022-11-26 16:01:41,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:01:41,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:01:41,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 16:01:41,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 29: [2022-11-26 16:01:41,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:01:41,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 16:01:41,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 19: [2022-11-26 16:01:41,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:01:41,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 16:01:41,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 8: [2022-11-26 16:01:41,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:01:41,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:01:41,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 14: [2022-11-26 16:01:41,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 16:01:41,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 8: [2022-11-26 16:01:41,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 14: [2022-11-26 16:01:41,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:01:41,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 5: [2022-11-26 16:01:41,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:01:41,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 5: [2022-11-26 16:01:41,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 16:01:41,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 17: [2022-11-26 16:01:41,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:01:41,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 21: [2022-11-26 16:01:41,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:01:41,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 21: [2022-11-26 16:01:41,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 17: [2022-11-26 16:01:41,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:01:41,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 21: [2022-11-26 16:01:41,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 17: [2022-11-26 16:01:41,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 17: [2022-11-26 16:01:41,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:01:41,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 16:01:41,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 10: [2022-11-26 16:01:41,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:01:41,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:01:41,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 23: [2022-11-26 16:01:41,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 10: [2022-11-26 16:01:41,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 23: [2022-11-26 16:01:41,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 11: [2022-11-26 16:01:41,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:01:41,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:01:41,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 16:01:41,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 16:01:41,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 11: [2022-11-26 16:01:41,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 24: [2022-11-26 16:01:41,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:01:41,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 16:01:41,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 12: [2022-11-26 16:01:41,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:01:41,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 4: [2022-11-26 16:01:41,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:01:41,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 4: [2022-11-26 16:01:41,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 16:01:41,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 17: [2022-11-26 16:01:41,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:01:41,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 16:01:41,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 18: [2022-11-26 16:01:41,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:01:41,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 4: [2022-11-26 16:01:41,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:01:41,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 4: [2022-11-26 16:01:41,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:01:41,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 23: [2022-11-26 16:01:41,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:01:41,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 23: [2022-11-26 16:01:41,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 4: [2022-11-26 16:01:41,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 23: [2022-11-26 16:01:41,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 4: [2022-11-26 16:01:41,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 23: [2022-11-26 16:01:41,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:01:41,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:01:41,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 16:01:41,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 30: [2022-11-26 16:01:41,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 16:01:41,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 28: [2022-11-26 16:01:41,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 16:01:41,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 28: [2022-11-26 16:01:41,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:01:41,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 16:01:41,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 28: [2022-11-26 16:01:41,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:01:41,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 16:01:41,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 29: [2022-11-26 16:01:41,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:01:41,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 16:01:41,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 20: [2022-11-26 16:01:41,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:01:41,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:01:41,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:01:41,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 16:01:41,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 16:01:41,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 20: [2022-11-26 16:01:41,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 20: [2022-11-26 16:01:41,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 16:01:41,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 2: [2022-11-26 16:01:41,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:01:41,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 16:01:41,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:01:41,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 13: [2022-11-26 16:01:41,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:01:41,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 13: [2022-11-26 16:01:41,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 16:01:41,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 13: [2022-11-26 16:01:41,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:01:41,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 2: [2022-11-26 16:01:41,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 13: [2022-11-26 16:01:41,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 13: [2022-11-26 16:01:41,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:01:41,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:01:41,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 16:01:41,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 16:01:41,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 13: [2022-11-26 16:01:41,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 26: [2022-11-26 16:01:41,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:01:41,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 16:01:41,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 7: [2022-11-26 16:01:41,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:01:41,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 16:01:41,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 3: [2022-11-26 16:01:41,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:01:41,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 16:01:41,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 3: [2022-11-26 16:01:41,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:01:41,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 14: [2022-11-26 16:01:41,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:01:41,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 14: [2022-11-26 16:01:41,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 16:01:41,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 5: [2022-11-26 16:01:41,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:01:41,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 16:01:41,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 15: [2022-11-26 16:01:41,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:01:41,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:01:41,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:01:41,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:01:41,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 16:01:41,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 16:01:41,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 16:01:41,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 15: [2022-11-26 16:01:41,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 15: [2022-11-26 16:01:41,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 15: [2022-11-26 16:01:41,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 16:01:41,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 22: [2022-11-26 16:01:41,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:01:41,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:01:41,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:01:41,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:01:41,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 16:01:41,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 22: [2022-11-26 16:01:41,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 16:01:41,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 16:01:41,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 16:01:41,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 22: [2022-11-26 16:01:41,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 22: [2022-11-26 16:01:41,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 8: [2022-11-26 16:01:41,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:01:41,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 16:01:41,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 25: [2022-11-26 16:01:41,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:01:41,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:01:41,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:01:41,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:01:41,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 16:01:41,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 16:01:41,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 16:01:41,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 16:01:41,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 25: [2022-11-26 16:01:41,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 25: [2022-11-26 16:01:41,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 25: [2022-11-26 16:01:41,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 16: [2022-11-26 16:01:41,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:01:41,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 16:01:41,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 9: [2022-11-26 16:01:41,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:01:41,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 16:01:41,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 1: [2022-11-26 16:01:41,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:01:41,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:01:41,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:01:41,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 16:01:41,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 1: [2022-11-26 16:01:41,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 16:01:41,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 16:01:41,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 1: [2022-11-26 16:01:41,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 23: [2022-11-26 16:01:41,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:01:41,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 16:01:41,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 29: [2022-11-26 16:01:41,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:01:41,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 16:01:41,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 17: [2022-11-26 16:01:41,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:01:41,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 16:01:41,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 0: [2022-11-26 16:01:41,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:01:41,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:01:41,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 16:01:41,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 0: [2022-11-26 16:01:41,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 16:01:41,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 26: [2022-11-26 16:01:41,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:01:41,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 16:01:41,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 6: [2022-11-26 16:01:41,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:01:41,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 16:01:41,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 11: [2022-11-26 16:01:41,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:01:41,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 16:01:41,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 0: [2022-11-26 16:01:41,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:01:41,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 16:01:41,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:01:41,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 0: [2022-11-26 16:01:41,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:01:41,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 16:01:41,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 26: [2022-11-26 16:01:41,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:01:41,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 16:01:41,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 27: [2022-11-26 16:01:41,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:01:41,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 16:01:41,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 30: [2022-11-26 16:01:41,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:01:41,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 16:01:41,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 25: [2022-11-26 16:01:41,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:01:41,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 16:01:41,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 19: [2022-11-26 16:01:41,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:01:41,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 16:01:41,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 0: [2022-11-26 16:01:41,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 16:01:41,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 10: [2022-11-26 16:01:41,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:01:41,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 16:01:41,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 24: [2022-11-26 16:01:41,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:01:41,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 16:01:41,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 22: [2022-11-26 16:01:41,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:01:41,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 16:01:41,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 3: [2022-11-26 16:01:41,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:01:41,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 16:01:41,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 21: [2022-11-26 16:01:41,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:01:41,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 16:01:41,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 8: [2022-11-26 16:01:41,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:01:41,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 16:01:41,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 15: [2022-11-26 16:01:41,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:01:41,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:01:41,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 20: [2022-11-26 16:01:41,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 15: [2022-11-26 16:01:41,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 20: [2022-11-26 16:01:41,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 28: [2022-11-26 16:01:41,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:01:41,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 16:01:41,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 1: [2022-11-26 16:01:41,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:01:41,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 16:01:41,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 12: [2022-11-26 16:01:41,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:01:41,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 16:01:41,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 2: [2022-11-26 16:01:41,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:01:41,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 16:01:41,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 7: [2022-11-26 16:01:41,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:01:41,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 16:01:41,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 18: [2022-11-26 16:01:41,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:01:41,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 16:01:41,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 31: [2022-11-26 16:01:41,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:01:41,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 16:01:41,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 4: [2022-11-26 16:01:41,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:01:41,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 16:01:41,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 14: [2022-11-26 16:01:41,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:01:41,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 16:01:41,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 16: [2022-11-26 16:01:41,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:01:41,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 16:01:41,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 5: [2022-11-26 16:01:41,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:01:41,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 16:01:41,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 13: [2022-11-26 16:01:41,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:01:41,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 16:01:41,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 11: [2022-11-26 16:01:41,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:01:41,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 16:01:41,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 26: [2022-11-26 16:01:41,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:01:41,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 16:01:41,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 6: [2022-11-26 16:01:41,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:01:41,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 16:01:41,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 23: [2022-11-26 16:01:41,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:01:41,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 16:01:41,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 29: [2022-11-26 16:01:41,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:01:41,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:01:41,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 16:01:41,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 17: [2022-11-26 16:01:41,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 16:01:41,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 27: [2022-11-26 16:01:41,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:01:41,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 16:01:41,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 25: [2022-11-26 16:01:41,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:01:41,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 16:01:41,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 0: [2022-11-26 16:01:41,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:01:41,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 16:01:41,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 21: [2022-11-26 16:01:41,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:01:41,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 16:01:41,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 9: [2022-11-26 16:01:41,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:01:41,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 16:01:41,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 30: [2022-11-26 16:01:41,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:01:41,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 16:01:41,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 24: [2022-11-26 16:01:41,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:01:41,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 16:01:41,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 22: [2022-11-26 16:01:41,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:01:41,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 16:01:41,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 10: [2022-11-26 16:01:41,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:01:41,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 16:01:41,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 31: [2022-11-26 16:01:41,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:01:41,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 16:01:41,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 19: [2022-11-26 16:01:41,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:01:41,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 16:01:41,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 15: [2022-11-26 16:01:41,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:01:41,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 16:01:41,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 3: [2022-11-26 16:01:41,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:01:41,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 16:01:41,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 20: [2022-11-26 16:01:41,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:01:41,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 16:01:41,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 8: [2022-11-26 16:01:41,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:01:41,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 16:01:41,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 4: [2022-11-26 16:01:41,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:01:41,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 12: [2022-11-26 16:01:41,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:01:41,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 12: [2022-11-26 16:01:41,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 16:01:41,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 7: [2022-11-26 16:01:41,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:01:41,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 16:01:41,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 14: [2022-11-26 16:01:41,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:01:41,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 16:01:41,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 18: [2022-11-26 16:01:41,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:01:41,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 16:01:41,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 16: [2022-11-26 16:01:41,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:01:41,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 16:01:41,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 13: [2022-11-26 16:01:41,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:01:41,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:01:41,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 16:01:41,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 1: [2022-11-26 16:01:41,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 16:01:41,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 23: [2022-11-26 16:01:41,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:01:41,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 16:01:41,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 2: [2022-11-26 16:01:41,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:01:41,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 16:01:41,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 11: [2022-11-26 16:01:41,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:01:41,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 16:01:41,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 0: [2022-11-26 16:01:41,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:01:41,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 16:01:41,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 6: [2022-11-26 16:01:41,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:01:41,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:01:41,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 5: [2022-11-26 16:01:41,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 6: [2022-11-26 16:01:41,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 5: [2022-11-26 16:01:41,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 28: [2022-11-26 16:01:41,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:01:41,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:01:41,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 16:01:41,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 27: [2022-11-26 16:01:41,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:01:41,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:01:41,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:01:41,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 16:01:41,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 29: [2022-11-26 16:01:41,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 16:01:41,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 17: [2022-11-26 16:01:41,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 16:01:41,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 28: [2022-11-26 16:01:41,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 16:01:41,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 25: [2022-11-26 16:01:41,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:01:41,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 16:01:41,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 26: [2022-11-26 16:01:41,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:01:41,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 16:01:41,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 30: [2022-11-26 16:01:41,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:01:41,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 16:01:41,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 10: [2022-11-26 16:01:41,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:01:41,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 16:01:41,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 24: [2022-11-26 16:01:41,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:01:41,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 16:01:41,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 19: [2022-11-26 16:01:41,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:01:41,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 16:01:41,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 21: [2022-11-26 16:01:41,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:01:41,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 16:01:41,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 22: [2022-11-26 16:01:41,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:01:41,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:01:41,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 16:01:41,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 31: [2022-11-26 16:01:41,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 16:01:41,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 28: [2022-11-26 16:01:41,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:01:41,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:01:41,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 16:01:41,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 15: [2022-11-26 16:01:41,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:01:41,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 16:01:41,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 28: [2022-11-26 16:01:41,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 16:01:41,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 20: [2022-11-26 16:01:41,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:01:41,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 16:01:41,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 18: [2022-11-26 16:01:41,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:01:41,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 16:01:41,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 21: [2022-11-26 16:01:41,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:01:41,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 16:01:41,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 7: [2022-11-26 16:01:41,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:01:41,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:01:41,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 13: [2022-11-26 16:01:41,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:01:41,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 7: [2022-11-26 16:01:41,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 6: [2022-11-26 16:01:41,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 13: [2022-11-26 16:01:41,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 16:01:41,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 24: [2022-11-26 16:01:41,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:01:41,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 16:01:41,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 30: [2022-11-26 16:01:41,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:01:41,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:01:41,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:01:41,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 16: [2022-11-26 16:01:41,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 16:01:41,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 14: [2022-11-26 16:01:41,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 30: [2022-11-26 16:01:41,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 14: [2022-11-26 16:01:41,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 12: [2022-11-26 16:01:41,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:01:41,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 16:01:41,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 4: [2022-11-26 16:01:41,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:01:41,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 16:01:41,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 23: [2022-11-26 16:01:41,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:01:41,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 16:01:41,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 17: [2022-11-26 16:01:41,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:01:41,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 10: [2022-11-26 16:01:41,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:01:41,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 10: [2022-11-26 16:01:41,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 16:01:41,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 8: [2022-11-26 16:01:41,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:01:41,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 16:01:41,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 1: [2022-11-26 16:01:41,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:01:41,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:01:41,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 1: [2022-11-26 16:01:41,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 29: [2022-11-26 16:01:41,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 1: [2022-11-26 16:01:41,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 11: [2022-11-26 16:01:41,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:01:41,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 16:01:41,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 25: [2022-11-26 16:01:41,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:01:41,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 16:01:41,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 5: [2022-11-26 16:01:41,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:01:41,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 16:01:41,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 19: [2022-11-26 16:01:41,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:01:41,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 16:01:41,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 2: [2022-11-26 16:01:41,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:01:41,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 16:01:41,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 20: [2022-11-26 16:01:41,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:01:41,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 16:01:41,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 0: [2022-11-26 16:01:41,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:01:41,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 16:01:41,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 22: [2022-11-26 16:01:41,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:01:41,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 16:01:41,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 31: [2022-11-26 16:01:41,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:01:41,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 16:01:41,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 8: [2022-11-26 16:01:41,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:01:41,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 16:01:41,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 3: [2022-11-26 16:01:41,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:01:41,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 16:01:41,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 28: [2022-11-26 16:01:41,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:01:41,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:01:41,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:01:41,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 26: [2022-11-26 16:01:41,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 27: [2022-11-26 16:01:41,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 26: [2022-11-26 16:01:41,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 9: [2022-11-26 16:01:41,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:01:41,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:01:41,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 18: [2022-11-26 16:01:41,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 16:01:41,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 9: [2022-11-26 16:01:41,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 15: [2022-11-26 16:01:41,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:01:41,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 16:01:41,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 12: [2022-11-26 16:01:41,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:01:41,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:01:41,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 20: [2022-11-26 16:01:41,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 12: [2022-11-26 16:01:41,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 20: [2022-11-26 16:01:41,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 4: [2022-11-26 16:01:41,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:01:41,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 16:01:41,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 1: [2022-11-26 16:01:41,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:01:41,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 16:01:41,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 2: [2022-11-26 16:01:41,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:01:41,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 16:01:41,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 1: [2022-11-26 16:01:41,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:01:41,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 16:01:41,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 28: [2022-11-26 16:01:41,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 16:01:41,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 28: [2022-11-26 16:01:41,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:01:41,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step97000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 16:01:41,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 0: successfully saved checkpoint at iteration 97000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2559.00 31: iteration 97010/ 173500 | consumed samples: 24834560 | consumed tokens: 50861178880 | elapsed time per iteration (s): 1.12 | learning rate: 9.462E-05 | global batch size: 256 | lm loss: 2.000168E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.403 | TFLOPs: 13.82 | 31: iteration 97020/ 173500 | consumed samples: 24837120 | consumed tokens: 50866421760 | elapsed time per iteration (s): 0.75 | learning rate: 9.460E-05 | global batch size: 256 | lm loss: 1.963061E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.935 | TFLOPs: 20.57 | 31: iteration 97030/ 173500 | consumed samples: 24839680 | consumed tokens: 50871664640 | elapsed time per iteration (s): 0.84 | learning rate: 9.458E-05 | global batch size: 256 | lm loss: 1.968834E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.615 | TFLOPs: 18.49 | 31: iteration 97040/ 173500 | consumed samples: 24842240 | consumed tokens: 50876907520 | elapsed time per iteration (s): 0.74 | learning rate: 9.457E-05 | global batch size: 256 | lm loss: 1.962327E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.134 | TFLOPs: 21.00 | 31: iteration 97050/ 173500 | consumed samples: 24844800 | consumed tokens: 50882150400 | elapsed time per iteration (s): 0.91 | learning rate: 9.455E-05 | global batch size: 256 | lm loss: 1.971342E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.802 | TFLOPs: 17.11 | 31: iteration 97060/ 173500 | consumed samples: 24847360 | consumed tokens: 50887393280 | elapsed time per iteration (s): 0.79 | learning rate: 9.453E-05 | global batch size: 256 | lm loss: 1.973993E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.371 | TFLOPs: 19.62 | 31: iteration 97070/ 173500 | consumed samples: 24849920 | consumed tokens: 50892636160 | elapsed time per iteration (s): 0.79 | learning rate: 9.452E-05 | global batch size: 256 | lm loss: 1.989215E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.396 | TFLOPs: 19.56 | 31: iteration 97080/ 173500 | consumed samples: 24852480 | consumed tokens: 50897879040 | elapsed time per iteration (s): 0.77 | learning rate: 9.450E-05 | global batch size: 256 | lm loss: 1.979530E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.666 | TFLOPs: 20.19 | 31: iteration 97090/ 173500 | consumed samples: 24855040 | consumed tokens: 50903121920 | elapsed time per iteration (s): 0.87 | learning rate: 9.449E-05 | global batch size: 256 | lm loss: 2.003765E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.975 | TFLOPs: 17.78 | 31: iteration 97100/ 173500 | consumed samples: 24857600 | consumed tokens: 50908364800 | elapsed time per iteration (s): 0.87 | learning rate: 9.447E-05 | global batch size: 256 | lm loss: 1.965106E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.567 | TFLOPs: 17.76 | 31: iteration 97110/ 173500 | consumed samples: 24860160 | consumed tokens: 50913607680 | elapsed time per iteration (s): 0.81 | learning rate: 9.445E-05 | global batch size: 256 | lm loss: 1.973967E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.849 | TFLOPs: 19.17 | 31: iteration 97120/ 173500 | consumed samples: 24862720 | consumed tokens: 50918850560 | elapsed time per iteration (s): 0.78 | learning rate: 9.444E-05 | global batch size: 256 | lm loss: 1.981002E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.407 | TFLOPs: 19.93 | 31: iteration 97130/ 173500 | consumed samples: 24865280 | consumed tokens: 50924093440 | elapsed time per iteration (s): 0.82 | learning rate: 9.442E-05 | global batch size: 256 | lm loss: 1.998818E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.499 | TFLOPs: 18.78 | 31: iteration 97140/ 173500 | consumed samples: 24867840 | consumed tokens: 50929336320 | elapsed time per iteration (s): 0.81 | learning rate: 9.440E-05 | global batch size: 256 | lm loss: 1.981337E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.437 | TFLOPs: 19.02 | 31: iteration 97150/ 173500 | consumed samples: 24870400 | consumed tokens: 50934579200 | elapsed time per iteration (s): 0.90 | learning rate: 9.439E-05 | global batch size: 256 | lm loss: 2.008547E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.513 | TFLOPs: 17.27 | 31: iteration 97160/ 173500 | consumed samples: 24872960 | consumed tokens: 50939822080 | elapsed time per iteration (s): 0.89 | learning rate: 9.437E-05 | global batch size: 256 | lm loss: 2.008801E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.528 | TFLOPs: 17.33 | 31: iteration 97170/ 173500 | consumed samples: 24875520 | consumed tokens: 50945064960 | elapsed time per iteration (s): 0.82 | learning rate: 9.436E-05 | global batch size: 256 | lm loss: 2.000168E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.847 | TFLOPs: 18.99 | 31: iteration 97180/ 173500 | consumed samples: 24878080 | consumed tokens: 50950307840 | elapsed time per iteration (s): 0.77 | learning rate: 9.434E-05 | global batch size: 256 | lm loss: 2.008146E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.233 | TFLOPs: 20.10 | 31: iteration 97190/ 173500 | consumed samples: 24880640 | consumed tokens: 50955550720 | elapsed time per iteration (s): 0.80 | learning rate: 9.432E-05 | global batch size: 256 | lm loss: 1.955540E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.515 | TFLOPs: 19.45 | 31: iteration 97200/ 173500 | consumed samples: 24883200 | consumed tokens: 50960793600 | elapsed time per iteration (s): 2.77 | learning rate: 9.431E-05 | global batch size: 256 | lm loss: 1.970486E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 92.573 | TFLOPs: 5.60 | 31: iteration 97210/ 173500 | consumed samples: 24885760 | consumed tokens: 50966036480 | elapsed time per iteration (s): 0.77 | learning rate: 9.429E-05 | global batch size: 256 | lm loss: 1.985565E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.563 | TFLOPs: 20.06 | 31: iteration 97220/ 173500 | consumed samples: 24888320 | consumed tokens: 50971279360 | elapsed time per iteration (s): 0.74 | learning rate: 9.427E-05 | global batch size: 256 | lm loss: 1.938389E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.693 | TFLOPs: 21.03 | 31: iteration 97230/ 173500 | consumed samples: 24890880 | consumed tokens: 50976522240 | elapsed time per iteration (s): 0.88 | learning rate: 9.426E-05 | global batch size: 256 | lm loss: 1.962259E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.741 | TFLOPs: 17.59 | 31: iteration 97240/ 173500 | consumed samples: 24893440 | consumed tokens: 50981765120 | elapsed time per iteration (s): 0.82 | learning rate: 9.424E-05 | global batch size: 256 | lm loss: 1.966189E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.865 | TFLOPs: 18.87 | 31: iteration 97250/ 173500 | consumed samples: 24896000 | consumed tokens: 50987008000 | elapsed time per iteration (s): 0.78 | learning rate: 9.423E-05 | global batch size: 256 | lm loss: 1.990772E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.990 | TFLOPs: 19.96 | 31: iteration 97260/ 173500 | consumed samples: 24898560 | consumed tokens: 50992250880 | elapsed time per iteration (s): 0.82 | learning rate: 9.421E-05 | global batch size: 256 | lm loss: 1.955007E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.915 | TFLOPs: 18.99 | 31: iteration 97270/ 173500 | consumed samples: 24901120 | consumed tokens: 50997493760 | elapsed time per iteration (s): 0.81 | learning rate: 9.419E-05 | global batch size: 256 | lm loss: 1.983355E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.760 | TFLOPs: 19.16 | 31: iteration 97280/ 173500 | consumed samples: 24903680 | consumed tokens: 51002736640 | elapsed time per iteration (s): 0.79 | learning rate: 9.418E-05 | global batch size: 256 | lm loss: 1.977371E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.465 | TFLOPs: 19.57 | 31: iteration 97290/ 173500 | consumed samples: 24906240 | consumed tokens: 51007979520 | elapsed time per iteration (s): 0.85 | learning rate: 9.416E-05 | global batch size: 256 | lm loss: 1.962728E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.942 | TFLOPs: 18.21 | 31: iteration 97300/ 173500 | consumed samples: 24908800 | consumed tokens: 51013222400 | elapsed time per iteration (s): 0.74 | learning rate: 9.415E-05 | global batch size: 256 | lm loss: 1.962535E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.673 | TFLOPs: 20.79 | 31: iteration 97310/ 173500 | consumed samples: 24911360 | consumed tokens: 51018465280 | elapsed time per iteration (s): 0.77 | learning rate: 9.413E-05 | global batch size: 256 | lm loss: 1.990059E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.602 | TFLOPs: 20.24 | 31: iteration 97320/ 173500 | consumed samples: 24913920 | consumed tokens: 51023708160 | elapsed time per iteration (s): 0.77 | learning rate: 9.411E-05 | global batch size: 256 | lm loss: 1.970544E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.668 | TFLOPs: 20.07 | 31: iteration 97330/ 173500 | consumed samples: 24916480 | consumed tokens: 51028951040 | elapsed time per iteration (s): 0.75 | learning rate: 9.410E-05 | global batch size: 256 | lm loss: 1.969863E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.859 | TFLOPs: 20.74 | 31: iteration 97340/ 173500 | consumed samples: 24919040 | consumed tokens: 51034193920 | elapsed time per iteration (s): 0.74 | learning rate: 9.408E-05 | global batch size: 256 | lm loss: 1.977479E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.332 | TFLOPs: 20.83 | 31: iteration 97350/ 173500 | consumed samples: 24921600 | consumed tokens: 51039436800 | elapsed time per iteration (s): 0.75 | learning rate: 9.406E-05 | global batch size: 256 | lm loss: 1.984336E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.075 | TFLOPs: 20.76 | 31: iteration 97360/ 173500 | consumed samples: 24924160 | consumed tokens: 51044679680 | elapsed time per iteration (s): 0.74 | learning rate: 9.405E-05 | global batch size: 256 | lm loss: 1.975455E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.786 | TFLOPs: 20.92 | 31: iteration 97370/ 173500 | consumed samples: 24926720 | consumed tokens: 51049922560 | elapsed time per iteration (s): 0.80 | learning rate: 9.403E-05 | global batch size: 256 | lm loss: 1.967729E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.362 | TFLOPs: 19.38 | 31: iteration 97380/ 173500 | consumed samples: 24929280 | consumed tokens: 51055165440 | elapsed time per iteration (s): 0.81 | learning rate: 9.402E-05 | global batch size: 256 | lm loss: 1.974699E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.293 | TFLOPs: 19.13 | 31: iteration 97390/ 173500 | consumed samples: 24931840 | consumed tokens: 51060408320 | elapsed time per iteration (s): 0.81 | learning rate: 9.400E-05 | global batch size: 256 | lm loss: 1.990644E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.905 | TFLOPs: 19.23 | 31: iteration 97400/ 173500 | consumed samples: 24934400 | consumed tokens: 51065651200 | elapsed time per iteration (s): 0.81 | learning rate: 9.398E-05 | global batch size: 256 | lm loss: 2.010034E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.314 | TFLOPs: 19.02 | 31: iteration 97410/ 173500 | consumed samples: 24936960 | consumed tokens: 51070894080 | elapsed time per iteration (s): 1.84 | learning rate: 9.397E-05 | global batch size: 256 | lm loss: 1.994173E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 139.425 | TFLOPs: 8.43 | 31: iteration 97420/ 173500 | consumed samples: 24939520 | consumed tokens: 51076136960 | elapsed time per iteration (s): 0.80 | learning rate: 9.395E-05 | global batch size: 256 | lm loss: 1.971799E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.115 | TFLOPs: 19.25 | 31: iteration 97430/ 173500 | consumed samples: 24942080 | consumed tokens: 51081379840 | elapsed time per iteration (s): 0.83 | learning rate: 9.393E-05 | global batch size: 256 | lm loss: 1.962466E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.941 | TFLOPs: 18.63 | 31: iteration 97440/ 173500 | consumed samples: 24944640 | consumed tokens: 51086622720 | elapsed time per iteration (s): 0.82 | learning rate: 9.392E-05 | global batch size: 256 | lm loss: 1.971106E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.549 | TFLOPs: 18.91 | 31: iteration 97450/ 173500 | consumed samples: 24947200 | consumed tokens: 51091865600 | elapsed time per iteration (s): 0.82 | learning rate: 9.390E-05 | global batch size: 256 | lm loss: 1.979342E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.709 | TFLOPs: 18.98 | 31: iteration 97460/ 173500 | consumed samples: 24949760 | consumed tokens: 51097108480 | elapsed time per iteration (s): 0.80 | learning rate: 9.389E-05 | global batch size: 256 | lm loss: 1.980210E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.110 | TFLOPs: 19.37 | 31: iteration 97470/ 173500 | consumed samples: 24952320 | consumed tokens: 51102351360 | elapsed time per iteration (s): 0.81 | learning rate: 9.387E-05 | global batch size: 256 | lm loss: 1.982715E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.832 | TFLOPs: 19.17 | 31: iteration 97480/ 173500 | consumed samples: 24954880 | consumed tokens: 51107594240 | elapsed time per iteration (s): 0.83 | learning rate: 9.385E-05 | global batch size: 256 | lm loss: 1.948574E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.179 | TFLOPs: 18.77 | 31: iteration 97490/ 173500 | consumed samples: 24957440 | consumed tokens: 51112837120 | elapsed time per iteration (s): 0.81 | learning rate: 9.384E-05 | global batch size: 256 | lm loss: 1.976691E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.710 | TFLOPs: 19.04 | 31: iteration 97500/ 173500 | consumed samples: 24960000 | consumed tokens: 51118080000 | elapsed time per iteration (s): 0.84 | learning rate: 9.382E-05 | global batch size: 256 | lm loss: 1.967460E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.406 | TFLOPs: 18.42 | 31: iteration 97510/ 173500 | consumed samples: 24962560 | consumed tokens: 51123322880 | elapsed time per iteration (s): 0.82 | learning rate: 9.381E-05 | global batch size: 256 | lm loss: 1.960000E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.172 | TFLOPs: 18.95 | 31: iteration 97520/ 173500 | consumed samples: 24965120 | consumed tokens: 51128565760 | elapsed time per iteration (s): 0.83 | learning rate: 9.379E-05 | global batch size: 256 | lm loss: 1.986575E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.546 | TFLOPs: 18.61 | 31: iteration 97530/ 173500 | consumed samples: 24967680 | consumed tokens: 51133808640 | elapsed time per iteration (s): 0.81 | learning rate: 9.377E-05 | global batch size: 256 | lm loss: 1.989786E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.584 | TFLOPs: 19.09 | 31: iteration 97540/ 173500 | consumed samples: 24970240 | consumed tokens: 51139051520 | elapsed time per iteration (s): 0.83 | learning rate: 9.376E-05 | global batch size: 256 | lm loss: 1.994352E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.055 | TFLOPs: 18.76 | 31: iteration 97550/ 173500 | consumed samples: 24972800 | consumed tokens: 51144294400 | elapsed time per iteration (s): 0.83 | learning rate: 9.374E-05 | global batch size: 256 | lm loss: 1.980792E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.022 | TFLOPs: 18.70 | 31: iteration 97560/ 173500 | consumed samples: 24975360 | consumed tokens: 51149537280 | elapsed time per iteration (s): 0.85 | learning rate: 9.372E-05 | global batch size: 256 | lm loss: 1.959797E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.808 | TFLOPs: 18.20 | 31: iteration 97570/ 173500 | consumed samples: 24977920 | consumed tokens: 51154780160 | elapsed time per iteration (s): 0.85 | learning rate: 9.371E-05 | global batch size: 256 | lm loss: 1.995300E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.882 | TFLOPs: 18.14 | 31: iteration 97580/ 173500 | consumed samples: 24980480 | consumed tokens: 51160023040 | elapsed time per iteration (s): 0.83 | learning rate: 9.369E-05 | global batch size: 256 | lm loss: 1.979344E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.623 | TFLOPs: 18.55 | 31: iteration 97590/ 173500 | consumed samples: 24983040 | consumed tokens: 51165265920 | elapsed time per iteration (s): 0.80 | learning rate: 9.368E-05 | global batch size: 256 | lm loss: 1.979433E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.457 | TFLOPs: 19.45 | 31: iteration 97600/ 173500 | consumed samples: 24985600 | consumed tokens: 51170508800 | elapsed time per iteration (s): 0.97 | learning rate: 9.366E-05 | global batch size: 256 | lm loss: 1.979755E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 262.822 | TFLOPs: 15.90 | 31: iteration 97610/ 173500 | consumed samples: 24988160 | consumed tokens: 51175751680 | elapsed time per iteration (s): 0.83 | learning rate: 9.364E-05 | global batch size: 256 | lm loss: 1.965678E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.571 | TFLOPs: 18.73 | 31: iteration 97620/ 173500 | consumed samples: 24990720 | consumed tokens: 51180994560 | elapsed time per iteration (s): 0.81 | learning rate: 9.363E-05 | global batch size: 256 | lm loss: 2.018056E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.591 | TFLOPs: 19.09 | 31: iteration 97630/ 173500 | consumed samples: 24993280 | consumed tokens: 51186237440 | elapsed time per iteration (s): 0.83 | learning rate: 9.361E-05 | global batch size: 256 | lm loss: 2.006175E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.054 | TFLOPs: 18.58 | 31: iteration 97640/ 173500 | consumed samples: 24995840 | consumed tokens: 51191480320 | elapsed time per iteration (s): 0.82 | learning rate: 9.359E-05 | global batch size: 256 | lm loss: 1.987486E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.620 | TFLOPs: 18.97 | 31: iteration 97650/ 173500 | consumed samples: 24998400 | consumed tokens: 51196723200 | elapsed time per iteration (s): 0.83 | learning rate: 9.358E-05 | global batch size: 256 | lm loss: 1.987510E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.154 | TFLOPs: 18.76 | 31: iteration 97660/ 173500 | consumed samples: 25000960 | consumed tokens: 51201966080 | elapsed time per iteration (s): 0.82 | learning rate: 9.356E-05 | global batch size: 256 | lm loss: 2.005057E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.695 | TFLOPs: 18.92 | 31: iteration 97670/ 173500 | consumed samples: 25003520 | consumed tokens: 51207208960 | elapsed time per iteration (s): 0.79 | learning rate: 9.355E-05 | global batch size: 256 | lm loss: 1.985547E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.571 | TFLOPs: 19.51 | 31: iteration 97680/ 173500 | consumed samples: 25006080 | consumed tokens: 51212451840 | elapsed time per iteration (s): 0.82 | learning rate: 9.353E-05 | global batch size: 256 | lm loss: 1.988755E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.532 | TFLOPs: 18.85 | 31: iteration 97690/ 173500 | consumed samples: 25008640 | consumed tokens: 51217694720 | elapsed time per iteration (s): 0.78 | learning rate: 9.351E-05 | global batch size: 256 | lm loss: 1.979077E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.197 | TFLOPs: 19.73 | 31: iteration 97700/ 173500 | consumed samples: 25011200 | consumed tokens: 51222937600 | elapsed time per iteration (s): 0.81 | learning rate: 9.350E-05 | global batch size: 256 | lm loss: 2.015755E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.287 | TFLOPs: 19.07 | 31: iteration 97710/ 173500 | consumed samples: 25013760 | consumed tokens: 51228180480 | elapsed time per iteration (s): 0.81 | learning rate: 9.348E-05 | global batch size: 256 | lm loss: 1.964824E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.306 | TFLOPs: 19.20 | 31: iteration 97720/ 173500 | consumed samples: 25016320 | consumed tokens: 51233423360 | elapsed time per iteration (s): 0.84 | learning rate: 9.347E-05 | global batch size: 256 | lm loss: 1.988545E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.168 | TFLOPs: 18.34 | 31: iteration 97730/ 173500 | consumed samples: 25018880 | consumed tokens: 51238666240 | elapsed time per iteration (s): 0.77 | learning rate: 9.345E-05 | global batch size: 256 | lm loss: 1.964798E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.833 | TFLOPs: 20.01 | 31: iteration 97740/ 173500 | consumed samples: 25021440 | consumed tokens: 51243909120 | elapsed time per iteration (s): 0.86 | learning rate: 9.343E-05 | global batch size: 256 | lm loss: 1.962793E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.026 | TFLOPs: 18.09 | 31: iteration 97750/ 173500 | consumed samples: 25024000 | consumed tokens: 51249152000 | elapsed time per iteration (s): 0.81 | learning rate: 9.342E-05 | global batch size: 256 | lm loss: 1.997388E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.644 | TFLOPs: 19.22 | 31: iteration 97760/ 173500 | consumed samples: 25026560 | consumed tokens: 51254394880 | elapsed time per iteration (s): 0.84 | learning rate: 9.340E-05 | global batch size: 256 | lm loss: 1.991252E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.681 | TFLOPs: 18.49 | 31: iteration 97770/ 173500 | consumed samples: 25029120 | consumed tokens: 51259637760 | elapsed time per iteration (s): 0.82 | learning rate: 9.338E-05 | global batch size: 256 | lm loss: 1.991263E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.227 | TFLOPs: 18.83 | 31: iteration 97780/ 173500 | consumed samples: 25031680 | consumed tokens: 51264880640 | elapsed time per iteration (s): 0.85 | learning rate: 9.337E-05 | global batch size: 256 | lm loss: 1.967988E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.248 | TFLOPs: 18.29 | 31: iteration 97790/ 173500 | consumed samples: 25034240 | consumed tokens: 51270123520 | elapsed time per iteration (s): 0.82 | learning rate: 9.335E-05 | global batch size: 256 | lm loss: 1.994720E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.318 | TFLOPs: 18.89 | 31: iteration 97800/ 173500 | consumed samples: 25036800 | consumed tokens: 51275366400 | elapsed time per iteration (s): 0.83 | learning rate: 9.334E-05 | global batch size: 256 | lm loss: 1.971989E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.748 | TFLOPs: 18.68 | 31: iteration 97810/ 173500 | consumed samples: 25039360 | consumed tokens: 51280609280 | elapsed time per iteration (s): 0.78 | learning rate: 9.332E-05 | global batch size: 256 | lm loss: 1.996403E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.204 | TFLOPs: 19.86 | 31: iteration 97820/ 173500 | consumed samples: 25041920 | consumed tokens: 51285852160 | elapsed time per iteration (s): 0.82 | learning rate: 9.330E-05 | global batch size: 256 | lm loss: 1.972510E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.513 | TFLOPs: 18.97 | 31: iteration 97830/ 173500 | consumed samples: 25044480 | consumed tokens: 51291095040 | elapsed time per iteration (s): 0.76 | learning rate: 9.329E-05 | global batch size: 256 | lm loss: 1.974833E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.420 | TFLOPs: 20.35 | 31: iteration 97840/ 173500 | consumed samples: 25047040 | consumed tokens: 51296337920 | elapsed time per iteration (s): 0.74 | learning rate: 9.327E-05 | global batch size: 256 | lm loss: 1.985551E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.860 | TFLOPs: 20.86 | 31: iteration 97850/ 173500 | consumed samples: 25049600 | consumed tokens: 51301580800 | elapsed time per iteration (s): 0.74 | learning rate: 9.325E-05 | global batch size: 256 | lm loss: 1.982389E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.548 | TFLOPs: 20.84 | 31: iteration 97860/ 173500 | consumed samples: 25052160 | consumed tokens: 51306823680 | elapsed time per iteration (s): 0.79 | learning rate: 9.324E-05 | global batch size: 256 | lm loss: 1.976250E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.080 | TFLOPs: 19.73 | 31: iteration 97870/ 173500 | consumed samples: 25054720 | consumed tokens: 51312066560 | elapsed time per iteration (s): 0.73 | learning rate: 9.322E-05 | global batch size: 256 | lm loss: 2.005553E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.032 | TFLOPs: 21.30 | 31: iteration 97880/ 173500 | consumed samples: 25057280 | consumed tokens: 51317309440 | elapsed time per iteration (s): 0.83 | learning rate: 9.321E-05 | global batch size: 256 | lm loss: 1.961614E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.640 | TFLOPs: 18.61 | 31: iteration 97890/ 173500 | consumed samples: 25059840 | consumed tokens: 51322552320 | elapsed time per iteration (s): 0.75 | learning rate: 9.319E-05 | global batch size: 256 | lm loss: 1.973413E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.493 | TFLOPs: 20.72 | 31: iteration 97900/ 173500 | consumed samples: 25062400 | consumed tokens: 51327795200 | elapsed time per iteration (s): 0.76 | learning rate: 9.317E-05 | global batch size: 256 | lm loss: 1.963784E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.676 | TFLOPs: 20.25 | 31: iteration 97910/ 173500 | consumed samples: 25064960 | consumed tokens: 51333038080 | elapsed time per iteration (s): 0.79 | learning rate: 9.316E-05 | global batch size: 256 | lm loss: 1.985502E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.907 | TFLOPs: 19.66 | 31: iteration 97920/ 173500 | consumed samples: 25067520 | consumed tokens: 51338280960 | elapsed time per iteration (s): 0.76 | learning rate: 9.314E-05 | global batch size: 256 | lm loss: 1.975042E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.805 | TFLOPs: 20.25 | 31: iteration 97930/ 173500 | consumed samples: 25070080 | consumed tokens: 51343523840 | elapsed time per iteration (s): 0.77 | learning rate: 9.313E-05 | global batch size: 256 | lm loss: 1.996287E+00 | grad norm: 0.367 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.723 | TFLOPs: 20.19 | 31: iteration 97940/ 173500 | consumed samples: 25072640 | consumed tokens: 51348766720 | elapsed time per iteration (s): 0.83 | learning rate: 9.311E-05 | global batch size: 256 | lm loss: 1.998514E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.786 | TFLOPs: 18.68 | 31: iteration 97950/ 173500 | consumed samples: 25075200 | consumed tokens: 51354009600 | elapsed time per iteration (s): 0.76 | learning rate: 9.309E-05 | global batch size: 256 | lm loss: 1.970404E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.707 | TFLOPs: 20.49 | 31: iteration 97960/ 173500 | consumed samples: 25077760 | consumed tokens: 51359252480 | elapsed time per iteration (s): 0.79 | learning rate: 9.308E-05 | global batch size: 256 | lm loss: 2.012131E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.067 | TFLOPs: 19.54 | 31: iteration 97970/ 173500 | consumed samples: 25080320 | consumed tokens: 51364495360 | elapsed time per iteration (s): 0.75 | learning rate: 9.306E-05 | global batch size: 256 | lm loss: 1.975985E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.684 | TFLOPs: 20.61 | 31: iteration 97980/ 173500 | consumed samples: 25082880 | consumed tokens: 51369738240 | elapsed time per iteration (s): 0.74 | learning rate: 9.304E-05 | global batch size: 256 | lm loss: 2.003636E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.294 | TFLOPs: 20.95 | 31: iteration 97990/ 173500 | consumed samples: 25085440 | consumed tokens: 51374981120 | elapsed time per iteration (s): 0.78 | learning rate: 9.303E-05 | global batch size: 256 | lm loss: 1.978466E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.283 | TFLOPs: 19.80 | 0: [2022-11-26 16:15:36,750] [INFO] [logging.py:68:log_dist] [Rank 0] step=98000, skipped=0, lr=[9.301234885879047e-05, 9.301234885879047e-05, 9.301234885879047e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 98000/ 173500 | consumed samples: 25088000 | consumed tokens: 51380224000 | elapsed time per iteration (s): 0.79 | learning rate: 9.301E-05 | global batch size: 256 | lm loss: 1.947434E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.578 | TFLOPs: 19.64 | 0: steps: 98000 loss: 1.9122 iter time (s): 0.819 samples/sec: 312.645 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 98000 | lm loss value: 1.969223E+00 | lm loss PPL: 7.165107E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 98000 to checkpoints_1b1long 0: [2022-11-26 16:15:37,002] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step98000 is begin to save! 0: [2022-11-26 16:15:37,009] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_01-model_00-model_states.pt... 0: [2022-11-26 16:15:37,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_01-model_00-model_states.pt. 0: [2022-11-26 16:15:37,220] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_03-model_00-model_states.pt... 0: [2022-11-26 16:15:37,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_03-model_00-model_states.pt. 0: [2022-11-26 16:15:37,298] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_04-model_00-model_states.pt... 0: [2022-11-26 16:15:37,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_04-model_00-model_states.pt. 0: [2022-11-26 16:15:37,376] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_05-model_00-model_states.pt... 0: [2022-11-26 16:15:37,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_05-model_00-model_states.pt. 0: [2022-11-26 16:15:37,456] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_06-model_00-model_states.pt... 0: [2022-11-26 16:15:37,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_06-model_00-model_states.pt. 0: [2022-11-26 16:15:37,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_07-model_00-model_states.pt... 0: [2022-11-26 16:15:37,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_07-model_00-model_states.pt. 0: [2022-11-26 16:15:37,609] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_08-model_00-model_states.pt... 0: [2022-11-26 16:15:37,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_08-model_00-model_states.pt. 0: [2022-11-26 16:15:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_09-model_00-model_states.pt... 0: [2022-11-26 16:15:37,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_09-model_00-model_states.pt. 0: [2022-11-26 16:15:37,758] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_10-model_00-model_states.pt... 0: [2022-11-26 16:15:37,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_10-model_00-model_states.pt. 0: [2022-11-26 16:15:37,829] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_11-model_00-model_states.pt... 0: [2022-11-26 16:15:37,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_11-model_00-model_states.pt. 0: [2022-11-26 16:15:37,904] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_12-model_00-model_states.pt... 0: [2022-11-26 16:15:37,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_12-model_00-model_states.pt. 0: [2022-11-26 16:15:37,978] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_13-model_00-model_states.pt... 0: [2022-11-26 16:15:38,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_13-model_00-model_states.pt. 0: [2022-11-26 16:15:38,055] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_14-model_00-model_states.pt... 0: [2022-11-26 16:15:38,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_14-model_00-model_states.pt. 0: [2022-11-26 16:15:38,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_15-model_00-model_states.pt... 0: [2022-11-26 16:15:38,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_15-model_00-model_states.pt. 0: [2022-11-26 16:15:38,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_16-model_00-model_states.pt... 0: [2022-11-26 16:15:38,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_16-model_00-model_states.pt. 0: [2022-11-26 16:15:38,279] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_17-model_00-model_states.pt... 0: [2022-11-26 16:15:38,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_17-model_00-model_states.pt. 0: [2022-11-26 16:15:38,352] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_18-model_00-model_states.pt... 0: [2022-11-26 16:15:38,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_18-model_00-model_states.pt. 0: [2022-11-26 16:15:38,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_19-model_00-model_states.pt... 0: [2022-11-26 16:15:38,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_19-model_00-model_states.pt. 0: [2022-11-26 16:15:38,503] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_20-model_00-model_states.pt... 0: [2022-11-26 16:15:38,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_20-model_00-model_states.pt. 0: [2022-11-26 16:15:38,577] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_21-model_00-model_states.pt... 0: [2022-11-26 16:15:38,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_21-model_00-model_states.pt. 0: [2022-11-26 16:15:38,652] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_22-model_00-model_states.pt... 0: [2022-11-26 16:15:38,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_22-model_00-model_states.pt. 0: [2022-11-26 16:15:38,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_23-model_00-model_states.pt... 0: [2022-11-26 16:15:38,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_23-model_00-model_states.pt. 0: [2022-11-26 16:15:38,799] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_24-model_00-model_states.pt... 0: [2022-11-26 16:15:38,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_24-model_00-model_states.pt. 0: [2022-11-26 16:15:38,872] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_25-model_00-model_states.pt... 0: [2022-11-26 16:15:38,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_25-model_00-model_states.pt. 0: [2022-11-26 16:15:38,946] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_26-model_00-model_states.pt... 0: [2022-11-26 16:15:39,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_26-model_00-model_states.pt. 0: [2022-11-26 16:15:39,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_27-model_00-model_states.pt... 0: [2022-11-26 16:15:39,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_27-model_00-model_states.pt. 0: [2022-11-26 16:15:39,093] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_28-model_00-model_states.pt... 0: [2022-11-26 16:15:39,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_28-model_00-model_states.pt. 0: [2022-11-26 16:15:39,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/layer_30-model_00-model_states.pt... 0: [2022-11-26 16:15:39,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/layer_30-model_00-model_states.pt. 0: [2022-11-26 16:15:39,169] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step98000/mp_rank_00_model_states.pt 0: [2022-11-26 16:15:39,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/mp_rank_00_model_states.pt... 0: [2022-11-26 16:15:39,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/mp_rank_00_model_states.pt. 0: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:15:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:15:39,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:15:39,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 16:15:39,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 13: [2022-11-26 16:15:39,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:15:39,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 16:15:39,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 13: [2022-11-26 16:15:39,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:15:39,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:15:39,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 16:15:39,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 28: [2022-11-26 16:15:39,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 16:15:39,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 11: [2022-11-26 16:15:39,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:15:39,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 16:15:39,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 4: [2022-11-26 16:15:39,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:15:39,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:15:39,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:15:39,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 12: [2022-11-26 16:15:39,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:15:39,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 4: [2022-11-26 16:15:39,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 12: [2022-11-26 16:15:39,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 18: [2022-11-26 16:15:39,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 12: [2022-11-26 16:15:39,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 7: [2022-11-26 16:15:39,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:15:39,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 16:15:39,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 26: [2022-11-26 16:15:39,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:15:39,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 16:15:39,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 27: [2022-11-26 16:15:39,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:15:39,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 16:15:39,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 16: [2022-11-26 16:15:39,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:15:39,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 16:15:39,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 29: [2022-11-26 16:15:39,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:15:39,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 16:15:39,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 11: [2022-11-26 16:15:39,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:15:39,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 16:15:39,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 19: [2022-11-26 16:15:39,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:15:39,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 16:15:39,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 9: [2022-11-26 16:15:39,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:15:39,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 16:15:39,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 14: [2022-11-26 16:15:39,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:15:39,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 16:15:39,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 7: [2022-11-26 16:15:39,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:15:39,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 16:15:39,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 14: [2022-11-26 16:15:39,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:15:39,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 16:15:39,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 12: [2022-11-26 16:15:39,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:15:39,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 16:15:39,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 10: [2022-11-26 16:15:39,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:15:39,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 16:15:39,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 2: [2022-11-26 16:15:39,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:15:39,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 16:15:39,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 24: [2022-11-26 16:15:39,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:15:39,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:15:39,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 29: [2022-11-26 16:15:39,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 16:15:39,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 28: [2022-11-26 16:15:39,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:15:39,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 1: [2022-11-26 16:15:39,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:15:39,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 16:15:39,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 2: [2022-11-26 16:15:39,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:15:39,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:15:39,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 20: [2022-11-26 16:15:39,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 2: [2022-11-26 16:15:39,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 20: [2022-11-26 16:15:39,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 25: [2022-11-26 16:15:39,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:15:39,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:15:39,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 16:15:39,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 25: [2022-11-26 16:15:39,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 15: [2022-11-26 16:15:39,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:15:39,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 16:15:39,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 10: [2022-11-26 16:15:39,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:15:39,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:15:39,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 16: [2022-11-26 16:15:39,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 25: [2022-11-26 16:15:39,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 10: [2022-11-26 16:15:39,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 16: [2022-11-26 16:15:39,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 25: [2022-11-26 16:15:39,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:15:39,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 16:15:39,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 27: [2022-11-26 16:15:39,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:15:39,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 16:15:39,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 4: [2022-11-26 16:15:39,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:15:39,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 4: [2022-11-26 16:15:39,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 28: [2022-11-26 16:15:39,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 30: [2022-11-26 16:15:39,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:15:39,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 30: [2022-11-26 16:15:39,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 19: [2022-11-26 16:15:39,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:15:39,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 19: [2022-11-26 16:15:39,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 16:15:39,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 23: [2022-11-26 16:15:39,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:15:39,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:15:39,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:15:39,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 31: [2022-11-26 16:15:39,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:15:39,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 21: [2022-11-26 16:15:39,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:15:39,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 23: [2022-11-26 16:15:39,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 31: [2022-11-26 16:15:39,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 16:15:39,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 21: [2022-11-26 16:15:39,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 13: [2022-11-26 16:15:39,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 31: [2022-11-26 16:15:39,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 21: [2022-11-26 16:15:39,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 18: [2022-11-26 16:15:39,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:15:39,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 16:15:39,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 21: [2022-11-26 16:15:39,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:15:39,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 16:15:39,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 4: [2022-11-26 16:15:39,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:15:39,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 16:15:39,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 28: [2022-11-26 16:15:39,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:15:39,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:15:39,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 16:15:39,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 26: [2022-11-26 16:15:39,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:15:39,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:15:39,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 19: [2022-11-26 16:15:39,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 26: [2022-11-26 16:15:39,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 19: [2022-11-26 16:15:39,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 17: [2022-11-26 16:15:39,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:15:39,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 16:15:39,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 12: [2022-11-26 16:15:39,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:15:39,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:15:39,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:15:39,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 15: [2022-11-26 16:15:39,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 24: [2022-11-26 16:15:39,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 12: [2022-11-26 16:15:39,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 15: [2022-11-26 16:15:39,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 24: [2022-11-26 16:15:39,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 27: [2022-11-26 16:15:39,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:15:39,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 16:15:39,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 11: [2022-11-26 16:15:39,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:15:39,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:15:39,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 26: [2022-11-26 16:15:39,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 11: [2022-11-26 16:15:39,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 26: [2022-11-26 16:15:39,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 7: [2022-11-26 16:15:39,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:15:39,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 16:15:39,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 2: [2022-11-26 16:15:39,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:15:39,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 16:15:39,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 9: [2022-11-26 16:15:39,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:15:39,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 16: [2022-11-26 16:15:39,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:15:39,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:15:39,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:15:39,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 16: [2022-11-26 16:15:39,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 29: [2022-11-26 16:15:39,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 17: [2022-11-26 16:15:39,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 9: [2022-11-26 16:15:39,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:15:39,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 14: [2022-11-26 16:15:39,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:15:39,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 17: [2022-11-26 16:15:39,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 9: [2022-11-26 16:15:39,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 16:15:39,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 14: [2022-11-26 16:15:39,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 16:15:39,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 8: [2022-11-26 16:15:39,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:15:39,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:15:39,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:15:39,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:15:39,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 16:15:39,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 1: [2022-11-26 16:15:39,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 30: [2022-11-26 16:15:39,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 8: [2022-11-26 16:15:39,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 8: [2022-11-26 16:15:39,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 1: [2022-11-26 16:15:39,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 30: [2022-11-26 16:15:39,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 10: [2022-11-26 16:15:39,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:15:39,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:15:39,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 15: [2022-11-26 16:15:39,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 10: [2022-11-26 16:15:39,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 15: [2022-11-26 16:15:39,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 23: [2022-11-26 16:15:39,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:15:39,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 16:15:39,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 31: [2022-11-26 16:15:39,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:15:39,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 16:15:39,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 30: [2022-11-26 16:15:39,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:15:39,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 16:15:39,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 26: [2022-11-26 16:15:39,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:15:39,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:15:39,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 11: [2022-11-26 16:15:39,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:15:39,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 26: [2022-11-26 16:15:39,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 11: [2022-11-26 16:15:39,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 21: [2022-11-26 16:15:39,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 11: [2022-11-26 16:15:39,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 18: [2022-11-26 16:15:39,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:15:39,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 16:15:39,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 10: [2022-11-26 16:15:39,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:15:39,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 20: [2022-11-26 16:15:39,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:15:39,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 20: [2022-11-26 16:15:39,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 16:15:39,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 12: [2022-11-26 16:15:39,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:15:39,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 16:15:39,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 0: [2022-11-26 16:15:39,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:15:39,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 28: [2022-11-26 16:15:39,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 16:15:39,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 28: [2022-11-26 16:15:39,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:15:39,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 28: [2022-11-26 16:15:39,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 16:15:39,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 13: [2022-11-26 16:15:39,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:15:39,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 16:15:39,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 23: [2022-11-26 16:15:39,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:15:39,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 16:15:39,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 1: [2022-11-26 16:15:39,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:15:39,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 16:15:39,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 29: [2022-11-26 16:15:39,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:15:39,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 16:15:39,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 30: [2022-11-26 16:15:39,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:15:39,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 2: [2022-11-26 16:15:39,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:15:39,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 2: [2022-11-26 16:15:39,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 16:15:39,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 16: [2022-11-26 16:15:39,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:15:39,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 16:15:39,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 22: [2022-11-26 16:15:39,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:15:39,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 16:15:39,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 20: [2022-11-26 16:15:39,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:15:39,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 16:15:39,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 27: [2022-11-26 16:15:39,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:15:39,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 16:15:39,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 25: [2022-11-26 16:15:39,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:15:39,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:15:39,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 25: [2022-11-26 16:15:39,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 16:15:39,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 22: [2022-11-26 16:15:39,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 19: [2022-11-26 16:15:39,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:15:39,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 16:15:39,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 24: [2022-11-26 16:15:39,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:15:39,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 16:15:39,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 4: [2022-11-26 16:15:39,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:15:39,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 16:15:39,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 3: [2022-11-26 16:15:39,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:15:39,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 16:15:39,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 6: [2022-11-26 16:15:39,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:15:39,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 16:15:39,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 25: [2022-11-26 16:15:39,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:15:39,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 16:15:39,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 14: [2022-11-26 16:15:39,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:15:39,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 16:15:39,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 6: [2022-11-26 16:15:39,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:15:39,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:15:39,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 7: [2022-11-26 16:15:39,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 16:15:39,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 6: [2022-11-26 16:15:39,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 1: [2022-11-26 16:15:39,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:15:39,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 16:15:39,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 3: [2022-11-26 16:15:39,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:15:39,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 16:15:39,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 18: [2022-11-26 16:15:39,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:15:39,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 16:15:39,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 23: [2022-11-26 16:15:39,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:15:39,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 16:15:39,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 9: [2022-11-26 16:15:39,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:15:39,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 16:15:39,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 3: [2022-11-26 16:15:39,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:15:39,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 16:15:39,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 31: [2022-11-26 16:15:39,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:15:39,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 16:15:39,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 24: [2022-11-26 16:15:39,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:15:39,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 16:15:39,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 17: [2022-11-26 16:15:39,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:15:39,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 16:15:39,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 8: [2022-11-26 16:15:39,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:15:39,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 16:15:39,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 0: [2022-11-26 16:15:39,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:15:39,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 16:15:39,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:15:39,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 0: [2022-11-26 16:15:39,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 16:15:39,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:15:39,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 0: [2022-11-26 16:15:39,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 16:15:39,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 22: [2022-11-26 16:15:39,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:15:39,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 16:15:39,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 21: [2022-11-26 16:15:39,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:15:39,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 16:15:39,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 6: [2022-11-26 16:15:39,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:15:39,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 16:15:39,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 0: [2022-11-26 16:15:39,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 16:15:39,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 5: [2022-11-26 16:15:39,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:15:39,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:15:39,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:15:39,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:15:39,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 16:15:39,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 16:15:39,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 16:15:39,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 16:15:39,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 5: [2022-11-26 16:15:39,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 5: [2022-11-26 16:15:39,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 5: [2022-11-26 16:15:39,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 20: [2022-11-26 16:15:39,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:15:39,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 16:15:39,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 25: [2022-11-26 16:15:39,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:15:39,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 16:15:39,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 1: [2022-11-26 16:15:39,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:15:39,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 16:15:39,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 11: [2022-11-26 16:15:39,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:15:39,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 16:15:39,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 13: [2022-11-26 16:15:39,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:15:39,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 16:15:39,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 26: [2022-11-26 16:15:39,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:15:39,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 16:15:39,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 28: [2022-11-26 16:15:39,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:15:39,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 16:15:39,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 15: [2022-11-26 16:15:39,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:15:39,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 16:15:39,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 2: [2022-11-26 16:15:39,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:15:39,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 16:15:39,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 29: [2022-11-26 16:15:39,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:15:39,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 16:15:39,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 27: [2022-11-26 16:15:39,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:15:39,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 16:15:39,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 10: [2022-11-26 16:15:39,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:15:39,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 16:15:39,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 18: [2022-11-26 16:15:39,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:15:39,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 16:15:39,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 12: [2022-11-26 16:15:39,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:15:39,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 16:15:39,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 7: [2022-11-26 16:15:39,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:15:39,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 16:15:39,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 30: [2022-11-26 16:15:39,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:15:39,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 16:15:39,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 16: [2022-11-26 16:15:39,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:15:39,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:15:39,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 23: [2022-11-26 16:15:39,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 16: [2022-11-26 16:15:39,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 23: [2022-11-26 16:15:39,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 21: [2022-11-26 16:15:39,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:15:39,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 16:15:39,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 19: [2022-11-26 16:15:39,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:15:39,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 16:15:39,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 4: [2022-11-26 16:15:39,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:15:39,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 16:15:39,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 3: [2022-11-26 16:15:39,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:15:39,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 16:15:39,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 0: [2022-11-26 16:15:39,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:15:39,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 16:15:39,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 17: [2022-11-26 16:15:39,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:15:39,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 16:15:39,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 8: [2022-11-26 16:15:39,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:15:39,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:15:39,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 31: [2022-11-26 16:15:39,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:15:39,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 20: [2022-11-26 16:15:39,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 31: [2022-11-26 16:15:39,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 20: [2022-11-26 16:15:39,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 31: [2022-11-26 16:15:39,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 5: [2022-11-26 16:15:39,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:15:39,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:15:39,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 9: [2022-11-26 16:15:39,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:15:39,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 5: [2022-11-26 16:15:39,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 9: [2022-11-26 16:15:39,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 14: [2022-11-26 16:15:39,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 9: [2022-11-26 16:15:39,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 24: [2022-11-26 16:15:39,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:15:39,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 22: [2022-11-26 16:15:39,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:15:39,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 22: [2022-11-26 16:15:39,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 16:15:39,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 6: [2022-11-26 16:15:39,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:15:39,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 16:15:39,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 1: [2022-11-26 16:15:39,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:15:39,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 16:15:39,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 11: [2022-11-26 16:15:39,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:15:39,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 16:15:39,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 26: [2022-11-26 16:15:39,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:15:39,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 16:15:39,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 13: [2022-11-26 16:15:39,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:15:39,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 16:15:39,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 15: [2022-11-26 16:15:39,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:15:39,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 16:15:39,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 28: [2022-11-26 16:15:39,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:15:39,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 25: [2022-11-26 16:15:39,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:15:39,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 25: [2022-11-26 16:15:39,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 16:15:39,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 7: [2022-11-26 16:15:39,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:15:39,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 16:15:39,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 10: [2022-11-26 16:15:39,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:15:39,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:15:39,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 27: [2022-11-26 16:15:39,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:15:39,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:15:39,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 10: [2022-11-26 16:15:39,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 12: [2022-11-26 16:15:39,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 27: [2022-11-26 16:15:39,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 4: [2022-11-26 16:15:39,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 16:15:39,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 27: [2022-11-26 16:15:39,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 2: [2022-11-26 16:15:39,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:15:39,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 16:15:39,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 18: [2022-11-26 16:15:39,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:15:39,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 16:15:39,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 30: [2022-11-26 16:15:39,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:15:39,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 16:15:39,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 16: [2022-11-26 16:15:39,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:15:39,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 16:15:39,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 21: [2022-11-26 16:15:39,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:15:39,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 16:15:39,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 29: [2022-11-26 16:15:39,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:15:39,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 16:15:39,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 23: [2022-11-26 16:15:39,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:15:39,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 16:15:39,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 17: [2022-11-26 16:15:39,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:15:39,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:15:39,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 0: [2022-11-26 16:15:39,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 16:15:39,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 17: [2022-11-26 16:15:39,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 19: [2022-11-26 16:15:39,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:15:39,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 16:15:39,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 9: [2022-11-26 16:15:39,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:15:39,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 16:15:39,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 3: [2022-11-26 16:15:39,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:15:39,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 16:15:39,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 5: [2022-11-26 16:15:39,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:15:39,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 16:15:39,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 8: [2022-11-26 16:15:39,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:15:39,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 16:15:39,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 1: [2022-11-26 16:15:39,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:15:39,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 16:15:39,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 31: [2022-11-26 16:15:39,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:15:39,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 16:15:39,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 20: [2022-11-26 16:15:39,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:15:39,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:15:39,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 22: [2022-11-26 16:15:39,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 16:15:39,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 20: [2022-11-26 16:15:39,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 6: [2022-11-26 16:15:39,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:15:39,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 16:15:39,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 14: [2022-11-26 16:15:39,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:15:39,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 16:15:39,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 24: [2022-11-26 16:15:39,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:15:39,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 16:15:39,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 11: [2022-11-26 16:15:39,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:15:39,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 16:15:39,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 25: [2022-11-26 16:15:39,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:15:39,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 16:15:39,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 13: [2022-11-26 16:15:39,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:15:39,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 16:15:39,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 26: [2022-11-26 16:15:39,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:15:39,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 16:15:39,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 15: [2022-11-26 16:15:39,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:15:39,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 16:15:39,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 28: [2022-11-26 16:15:39,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:15:39,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 16:15:39,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 2: [2022-11-26 16:15:39,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:15:39,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 16:15:39,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 10: [2022-11-26 16:15:39,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:15:39,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 16:15:39,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 27: [2022-11-26 16:15:39,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:15:39,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 16:15:39,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 12: [2022-11-26 16:15:39,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:15:39,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 16:15:39,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 18: [2022-11-26 16:15:39,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:15:39,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 16:15:39,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 7: [2022-11-26 16:15:39,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:15:39,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 16:15:39,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 29: [2022-11-26 16:15:39,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:15:39,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 16:15:39,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 23: [2022-11-26 16:15:39,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:15:39,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 16:15:39,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 4: [2022-11-26 16:15:39,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:15:39,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 16:15:39,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 16: [2022-11-26 16:15:39,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:15:39,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 16:15:39,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 30: [2022-11-26 16:15:39,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:15:39,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 16:15:39,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 17: [2022-11-26 16:15:39,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:15:39,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:15:39,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 16:15:39,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 17: [2022-11-26 16:15:39,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 16:15:39,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 0: [2022-11-26 16:15:39,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:15:39,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 16:15:39,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 20: [2022-11-26 16:15:39,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:15:39,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 16:15:39,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 1: [2022-11-26 16:15:39,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:15:39,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 16:15:39,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 24: [2022-11-26 16:15:39,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:15:39,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 16:15:39,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 17: [2022-11-26 16:15:39,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:15:39,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 16:15:39,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 19: [2022-11-26 16:15:39,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:15:39,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:15:39,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 21: [2022-11-26 16:15:39,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 19: [2022-11-26 16:15:39,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 4: [2022-11-26 16:15:39,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:15:39,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 4: [2022-11-26 16:15:39,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 16:15:39,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 9: [2022-11-26 16:15:39,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:15:39,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 16:15:39,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 11: [2022-11-26 16:15:39,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:15:39,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 16:15:39,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 31: [2022-11-26 16:15:39,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:15:39,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 16:15:39,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 15: [2022-11-26 16:15:39,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:15:39,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 16:15:39,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 7: [2022-11-26 16:15:39,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:15:39,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 16:15:39,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 31: [2022-11-26 16:15:39,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:15:39,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 16:15:39,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 3: [2022-11-26 16:15:39,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:15:39,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:15:39,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 16:15:39,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 10: [2022-11-26 16:15:39,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 2: [2022-11-26 16:15:39,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:15:39,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:15:39,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:15:39,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 14: [2022-11-26 16:15:39,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 16:15:39,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 2: [2022-11-26 16:15:39,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 14: [2022-11-26 16:15:39,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 14: [2022-11-26 16:15:39,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 2: [2022-11-26 16:15:39,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 5: [2022-11-26 16:15:39,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:15:39,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 16:15:39,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 25: [2022-11-26 16:15:39,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:15:39,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:15:39,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 18: [2022-11-26 16:15:39,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 25: [2022-11-26 16:15:39,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 21: [2022-11-26 16:15:39,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:15:39,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 21: [2022-11-26 16:15:39,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 16:15:39,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 28: [2022-11-26 16:15:39,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:15:39,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:15:39,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 27: [2022-11-26 16:15:39,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 16:15:39,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 28: [2022-11-26 16:15:39,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 12: [2022-11-26 16:15:39,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:15:39,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 16:15:39,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 24: [2022-11-26 16:15:39,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:15:39,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 16:15:39,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 16: [2022-11-26 16:15:39,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:15:39,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:15:39,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:15:39,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 16:15:39,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 19: [2022-11-26 16:15:39,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 13: [2022-11-26 16:15:39,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 16:15:39,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 19: [2022-11-26 16:15:39,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 30: [2022-11-26 16:15:39,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:15:39,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 16:15:39,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 8: [2022-11-26 16:15:39,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:15:39,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 16:15:39,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 23: [2022-11-26 16:15:39,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:15:39,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 16:15:39,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 22: [2022-11-26 16:15:39,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:15:39,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:15:39,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 16:15:39,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 16:15:39,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 22: [2022-11-26 16:15:39,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 3: [2022-11-26 16:15:39,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:15:39,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 16:15:39,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 8: [2022-11-26 16:15:39,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:15:39,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 16:15:39,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 26: [2022-11-26 16:15:39,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:15:39,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 16:15:39,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 6: [2022-11-26 16:15:39,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:15:39,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:15:39,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 16:15:39,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 16:15:39,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 6: [2022-11-26 16:15:39,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 29: [2022-11-26 16:15:39,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:15:39,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 16:15:39,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 9: [2022-11-26 16:15:39,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:15:39,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 16:15:39,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 6: [2022-11-26 16:15:39,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:15:39,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 16:15:39,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 8: [2022-11-26 16:15:39,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:15:39,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 16:15:39,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 22: [2022-11-26 16:15:39,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:15:39,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 16:15:39,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 3: [2022-11-26 16:15:39,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:15:39,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step98000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 16:15:39,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 0: successfully saved checkpoint at iteration 98000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2488.07 31: iteration 98010/ 173500 | consumed samples: 25090560 | consumed tokens: 51385466880 | elapsed time per iteration (s): 1.07 | learning rate: 9.300E-05 | global batch size: 256 | lm loss: 2.003559E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.190 | TFLOPs: 14.47 | 31: iteration 98020/ 173500 | consumed samples: 25093120 | consumed tokens: 51390709760 | elapsed time per iteration (s): 0.80 | learning rate: 9.298E-05 | global batch size: 256 | lm loss: 1.935598E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.156 | TFLOPs: 19.31 | 31: iteration 98030/ 173500 | consumed samples: 25095680 | consumed tokens: 51395952640 | elapsed time per iteration (s): 0.78 | learning rate: 9.296E-05 | global batch size: 256 | lm loss: 1.994173E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.988 | TFLOPs: 19.78 | 31: iteration 98040/ 173500 | consumed samples: 25098240 | consumed tokens: 51401195520 | elapsed time per iteration (s): 0.79 | learning rate: 9.295E-05 | global batch size: 256 | lm loss: 1.993354E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.403 | TFLOPs: 19.63 | 31: iteration 98050/ 173500 | consumed samples: 25100800 | consumed tokens: 51406438400 | elapsed time per iteration (s): 0.73 | learning rate: 9.293E-05 | global batch size: 256 | lm loss: 1.998500E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.047 | TFLOPs: 21.18 | 31: iteration 98060/ 173500 | consumed samples: 25103360 | consumed tokens: 51411681280 | elapsed time per iteration (s): 0.75 | learning rate: 9.292E-05 | global batch size: 256 | lm loss: 1.986521E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.291 | TFLOPs: 20.59 | 31: iteration 98070/ 173500 | consumed samples: 25105920 | consumed tokens: 51416924160 | elapsed time per iteration (s): 0.75 | learning rate: 9.290E-05 | global batch size: 256 | lm loss: 2.011484E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.213 | TFLOPs: 20.52 | 31: iteration 98080/ 173500 | consumed samples: 25108480 | consumed tokens: 51422167040 | elapsed time per iteration (s): 0.79 | learning rate: 9.288E-05 | global batch size: 256 | lm loss: 1.961404E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.593 | TFLOPs: 19.52 | 31: iteration 98090/ 173500 | consumed samples: 25111040 | consumed tokens: 51427409920 | elapsed time per iteration (s): 0.79 | learning rate: 9.287E-05 | global batch size: 256 | lm loss: 1.994283E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.178 | TFLOPs: 19.61 | 31: iteration 98100/ 173500 | consumed samples: 25113600 | consumed tokens: 51432652800 | elapsed time per iteration (s): 0.78 | learning rate: 9.285E-05 | global batch size: 256 | lm loss: 1.971115E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.425 | TFLOPs: 19.75 | 31: iteration 98110/ 173500 | consumed samples: 25116160 | consumed tokens: 51437895680 | elapsed time per iteration (s): 0.89 | learning rate: 9.283E-05 | global batch size: 256 | lm loss: 1.990415E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.088 | TFLOPs: 17.43 | 31: iteration 98120/ 173500 | consumed samples: 25118720 | consumed tokens: 51443138560 | elapsed time per iteration (s): 0.80 | learning rate: 9.282E-05 | global batch size: 256 | lm loss: 1.969570E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.884 | TFLOPs: 19.41 | 31: iteration 98130/ 173500 | consumed samples: 25121280 | consumed tokens: 51448381440 | elapsed time per iteration (s): 0.82 | learning rate: 9.280E-05 | global batch size: 256 | lm loss: 1.962962E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.362 | TFLOPs: 18.96 | 31: iteration 98140/ 173500 | consumed samples: 25123840 | consumed tokens: 51453624320 | elapsed time per iteration (s): 0.80 | learning rate: 9.279E-05 | global batch size: 256 | lm loss: 1.974461E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.255 | TFLOPs: 19.44 | 31: iteration 98150/ 173500 | consumed samples: 25126400 | consumed tokens: 51458867200 | elapsed time per iteration (s): 0.81 | learning rate: 9.277E-05 | global batch size: 256 | lm loss: 1.979153E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.460 | TFLOPs: 19.02 | 31: iteration 98160/ 173500 | consumed samples: 25128960 | consumed tokens: 51464110080 | elapsed time per iteration (s): 0.80 | learning rate: 9.275E-05 | global batch size: 256 | lm loss: 1.960871E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.143 | TFLOPs: 19.43 | 31: iteration 98170/ 173500 | consumed samples: 25131520 | consumed tokens: 51469352960 | elapsed time per iteration (s): 0.78 | learning rate: 9.274E-05 | global batch size: 256 | lm loss: 1.985756E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.594 | TFLOPs: 19.76 | 31: iteration 98180/ 173500 | consumed samples: 25134080 | consumed tokens: 51474595840 | elapsed time per iteration (s): 0.80 | learning rate: 9.272E-05 | global batch size: 256 | lm loss: 2.001369E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.734 | TFLOPs: 19.40 | 31: iteration 98190/ 173500 | consumed samples: 25136640 | consumed tokens: 51479838720 | elapsed time per iteration (s): 0.77 | learning rate: 9.271E-05 | global batch size: 256 | lm loss: 1.977349E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.357 | TFLOPs: 20.11 | 31: iteration 98200/ 173500 | consumed samples: 25139200 | consumed tokens: 51485081600 | elapsed time per iteration (s): 0.79 | learning rate: 9.269E-05 | global batch size: 256 | lm loss: 1.987068E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.755 | TFLOPs: 19.53 | 31: iteration 98210/ 173500 | consumed samples: 25141760 | consumed tokens: 51490324480 | elapsed time per iteration (s): 0.78 | learning rate: 9.267E-05 | global batch size: 256 | lm loss: 1.988067E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.153 | TFLOPs: 19.97 | 31: iteration 98220/ 173500 | consumed samples: 25144320 | consumed tokens: 51495567360 | elapsed time per iteration (s): 0.75 | learning rate: 9.266E-05 | global batch size: 256 | lm loss: 1.964890E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.674 | TFLOPs: 20.73 | 31: iteration 98230/ 173500 | consumed samples: 25146880 | consumed tokens: 51500810240 | elapsed time per iteration (s): 0.78 | learning rate: 9.264E-05 | global batch size: 256 | lm loss: 1.988186E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.898 | TFLOPs: 19.78 | 31: iteration 98240/ 173500 | consumed samples: 25149440 | consumed tokens: 51506053120 | elapsed time per iteration (s): 0.75 | learning rate: 9.262E-05 | global batch size: 256 | lm loss: 1.965550E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.309 | TFLOPs: 20.71 | 31: iteration 98250/ 173500 | consumed samples: 25152000 | consumed tokens: 51511296000 | elapsed time per iteration (s): 0.76 | learning rate: 9.261E-05 | global batch size: 256 | lm loss: 1.974587E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.931 | TFLOPs: 20.44 | 31: iteration 98260/ 173500 | consumed samples: 25154560 | consumed tokens: 51516538880 | elapsed time per iteration (s): 0.76 | learning rate: 9.259E-05 | global batch size: 256 | lm loss: 1.997604E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.986 | TFLOPs: 20.27 | 31: iteration 98270/ 173500 | consumed samples: 25157120 | consumed tokens: 51521781760 | elapsed time per iteration (s): 0.75 | learning rate: 9.258E-05 | global batch size: 256 | lm loss: 1.973748E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.571 | TFLOPs: 20.54 | 31: iteration 98280/ 173500 | consumed samples: 25159680 | consumed tokens: 51527024640 | elapsed time per iteration (s): 0.75 | learning rate: 9.256E-05 | global batch size: 256 | lm loss: 2.007972E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.588 | TFLOPs: 20.60 | 31: iteration 98290/ 173500 | consumed samples: 25162240 | consumed tokens: 51532267520 | elapsed time per iteration (s): 0.82 | learning rate: 9.254E-05 | global batch size: 256 | lm loss: 1.956186E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.223 | TFLOPs: 18.83 | 31: iteration 98300/ 173500 | consumed samples: 25164800 | consumed tokens: 51537510400 | elapsed time per iteration (s): 0.75 | learning rate: 9.253E-05 | global batch size: 256 | lm loss: 1.958743E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.103 | TFLOPs: 20.58 | 31: iteration 98310/ 173500 | consumed samples: 25167360 | consumed tokens: 51542753280 | elapsed time per iteration (s): 0.77 | learning rate: 9.251E-05 | global batch size: 256 | lm loss: 1.990304E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.724 | TFLOPs: 20.01 | 31: iteration 98320/ 173500 | consumed samples: 25169920 | consumed tokens: 51547996160 | elapsed time per iteration (s): 0.75 | learning rate: 9.250E-05 | global batch size: 256 | lm loss: 1.976173E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.264 | TFLOPs: 20.52 | 31: iteration 98330/ 173500 | consumed samples: 25172480 | consumed tokens: 51553239040 | elapsed time per iteration (s): 0.78 | learning rate: 9.248E-05 | global batch size: 256 | lm loss: 1.987925E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.412 | TFLOPs: 19.87 | 31: iteration 98340/ 173500 | consumed samples: 25175040 | consumed tokens: 51558481920 | elapsed time per iteration (s): 0.83 | learning rate: 9.246E-05 | global batch size: 256 | lm loss: 1.959855E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.448 | TFLOPs: 18.66 | 31: iteration 98350/ 173500 | consumed samples: 25177600 | consumed tokens: 51563724800 | elapsed time per iteration (s): 0.82 | learning rate: 9.245E-05 | global batch size: 256 | lm loss: 1.983483E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.146 | TFLOPs: 18.88 | 31: iteration 98360/ 173500 | consumed samples: 25180160 | consumed tokens: 51568967680 | elapsed time per iteration (s): 0.79 | learning rate: 9.243E-05 | global batch size: 256 | lm loss: 1.977237E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.974 | TFLOPs: 19.54 | 31: iteration 98370/ 173500 | consumed samples: 25182720 | consumed tokens: 51574210560 | elapsed time per iteration (s): 0.87 | learning rate: 9.241E-05 | global batch size: 256 | lm loss: 1.975125E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.464 | TFLOPs: 17.87 | 31: iteration 98380/ 173500 | consumed samples: 25185280 | consumed tokens: 51579453440 | elapsed time per iteration (s): 0.81 | learning rate: 9.240E-05 | global batch size: 256 | lm loss: 2.014349E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.197 | TFLOPs: 19.01 | 31: iteration 98390/ 173500 | consumed samples: 25187840 | consumed tokens: 51584696320 | elapsed time per iteration (s): 0.83 | learning rate: 9.238E-05 | global batch size: 256 | lm loss: 1.973356E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.418 | TFLOPs: 18.60 | 31: iteration 98400/ 173500 | consumed samples: 25190400 | consumed tokens: 51589939200 | elapsed time per iteration (s): 0.80 | learning rate: 9.237E-05 | global batch size: 256 | lm loss: 1.979819E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.104 | TFLOPs: 19.30 | 31: iteration 98410/ 173500 | consumed samples: 25192960 | consumed tokens: 51595182080 | elapsed time per iteration (s): 0.83 | learning rate: 9.235E-05 | global batch size: 256 | lm loss: 1.954048E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.206 | TFLOPs: 18.77 | 31: iteration 98420/ 173500 | consumed samples: 25195520 | consumed tokens: 51600424960 | elapsed time per iteration (s): 0.80 | learning rate: 9.233E-05 | global batch size: 256 | lm loss: 2.003289E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.236 | TFLOPs: 19.25 | 31: iteration 98430/ 173500 | consumed samples: 25198080 | consumed tokens: 51605667840 | elapsed time per iteration (s): 0.84 | learning rate: 9.232E-05 | global batch size: 256 | lm loss: 1.952658E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.187 | TFLOPs: 18.52 | 31: iteration 98440/ 173500 | consumed samples: 25200640 | consumed tokens: 51610910720 | elapsed time per iteration (s): 0.79 | learning rate: 9.230E-05 | global batch size: 256 | lm loss: 1.997850E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.944 | TFLOPs: 19.60 | 31: iteration 98450/ 173500 | consumed samples: 25203200 | consumed tokens: 51616153600 | elapsed time per iteration (s): 0.83 | learning rate: 9.229E-05 | global batch size: 256 | lm loss: 1.986219E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.403 | TFLOPs: 18.72 | 31: iteration 98460/ 173500 | consumed samples: 25205760 | consumed tokens: 51621396480 | elapsed time per iteration (s): 0.81 | learning rate: 9.227E-05 | global batch size: 256 | lm loss: 1.955750E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.000 | TFLOPs: 19.24 | 31: iteration 98470/ 173500 | consumed samples: 25208320 | consumed tokens: 51626639360 | elapsed time per iteration (s): 0.79 | learning rate: 9.225E-05 | global batch size: 256 | lm loss: 1.986155E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.648 | TFLOPs: 19.70 | 31: iteration 98480/ 173500 | consumed samples: 25210880 | consumed tokens: 51631882240 | elapsed time per iteration (s): 0.81 | learning rate: 9.224E-05 | global batch size: 256 | lm loss: 2.001200E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.873 | TFLOPs: 19.17 | 31: iteration 98490/ 173500 | consumed samples: 25213440 | consumed tokens: 51637125120 | elapsed time per iteration (s): 0.79 | learning rate: 9.222E-05 | global batch size: 256 | lm loss: 1.989939E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.206 | TFLOPs: 19.49 | 31: iteration 98500/ 173500 | consumed samples: 25216000 | consumed tokens: 51642368000 | elapsed time per iteration (s): 0.79 | learning rate: 9.220E-05 | global batch size: 256 | lm loss: 1.987841E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.207 | TFLOPs: 19.49 | 31: iteration 98510/ 173500 | consumed samples: 25218560 | consumed tokens: 51647610880 | elapsed time per iteration (s): 1.17 | learning rate: 9.219E-05 | global batch size: 256 | lm loss: 1.966196E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 218.322 | TFLOPs: 13.21 | 31: iteration 98520/ 173500 | consumed samples: 25221120 | consumed tokens: 51652853760 | elapsed time per iteration (s): 0.80 | learning rate: 9.217E-05 | global batch size: 256 | lm loss: 1.962914E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.751 | TFLOPs: 19.47 | 31: iteration 98530/ 173500 | consumed samples: 25223680 | consumed tokens: 51658096640 | elapsed time per iteration (s): 0.83 | learning rate: 9.216E-05 | global batch size: 256 | lm loss: 1.980003E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.362 | TFLOPs: 18.66 | 31: iteration 98540/ 173500 | consumed samples: 25226240 | consumed tokens: 51663339520 | elapsed time per iteration (s): 0.83 | learning rate: 9.214E-05 | global batch size: 256 | lm loss: 1.959720E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.805 | TFLOPs: 18.56 | 31: iteration 98550/ 173500 | consumed samples: 25228800 | consumed tokens: 51668582400 | elapsed time per iteration (s): 0.76 | learning rate: 9.212E-05 | global batch size: 256 | lm loss: 1.982425E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.796 | TFLOPs: 20.44 | 31: iteration 98560/ 173500 | consumed samples: 25231360 | consumed tokens: 51673825280 | elapsed time per iteration (s): 0.79 | learning rate: 9.211E-05 | global batch size: 256 | lm loss: 1.972547E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.802 | TFLOPs: 19.65 | 31: iteration 98570/ 173500 | consumed samples: 25233920 | consumed tokens: 51679068160 | elapsed time per iteration (s): 0.79 | learning rate: 9.209E-05 | global batch size: 256 | lm loss: 2.013011E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.524 | TFLOPs: 19.63 | 31: iteration 98580/ 173500 | consumed samples: 25236480 | consumed tokens: 51684311040 | elapsed time per iteration (s): 0.77 | learning rate: 9.208E-05 | global batch size: 256 | lm loss: 1.963315E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.718 | TFLOPs: 20.19 | 31: iteration 98590/ 173500 | consumed samples: 25239040 | consumed tokens: 51689553920 | elapsed time per iteration (s): 0.75 | learning rate: 9.206E-05 | global batch size: 256 | lm loss: 1.946328E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.448 | TFLOPs: 20.78 | 31: iteration 98600/ 173500 | consumed samples: 25241600 | consumed tokens: 51694796800 | elapsed time per iteration (s): 0.77 | learning rate: 9.204E-05 | global batch size: 256 | lm loss: 1.991409E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.969 | TFLOPs: 20.14 | 31: iteration 98610/ 173500 | consumed samples: 25244160 | consumed tokens: 51700039680 | elapsed time per iteration (s): 0.75 | learning rate: 9.203E-05 | global batch size: 256 | lm loss: 1.962306E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.974 | TFLOPs: 20.57 | 31: iteration 98620/ 173500 | consumed samples: 25246720 | consumed tokens: 51705282560 | elapsed time per iteration (s): 0.73 | learning rate: 9.201E-05 | global batch size: 256 | lm loss: 1.977019E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.803 | TFLOPs: 21.28 | 31: iteration 98630/ 173500 | consumed samples: 25249280 | consumed tokens: 51710525440 | elapsed time per iteration (s): 0.78 | learning rate: 9.200E-05 | global batch size: 256 | lm loss: 1.977739E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.356 | TFLOPs: 19.80 | 31: iteration 98640/ 173500 | consumed samples: 25251840 | consumed tokens: 51715768320 | elapsed time per iteration (s): 0.75 | learning rate: 9.198E-05 | global batch size: 256 | lm loss: 1.943236E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.002 | TFLOPs: 20.63 | 31: iteration 98650/ 173500 | consumed samples: 25254400 | consumed tokens: 51721011200 | elapsed time per iteration (s): 0.74 | learning rate: 9.196E-05 | global batch size: 256 | lm loss: 1.958356E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.266 | TFLOPs: 21.01 | 31: iteration 98660/ 173500 | consumed samples: 25256960 | consumed tokens: 51726254080 | elapsed time per iteration (s): 0.78 | learning rate: 9.195E-05 | global batch size: 256 | lm loss: 1.956766E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.688 | TFLOPs: 19.95 | 31: iteration 98670/ 173500 | consumed samples: 25259520 | consumed tokens: 51731496960 | elapsed time per iteration (s): 0.77 | learning rate: 9.193E-05 | global batch size: 256 | lm loss: 2.002030E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.609 | TFLOPs: 20.18 | 31: iteration 98680/ 173500 | consumed samples: 25262080 | consumed tokens: 51736739840 | elapsed time per iteration (s): 0.79 | learning rate: 9.191E-05 | global batch size: 256 | lm loss: 1.970398E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.013 | TFLOPs: 19.54 | 31: iteration 98690/ 173500 | consumed samples: 25264640 | consumed tokens: 51741982720 | elapsed time per iteration (s): 0.86 | learning rate: 9.190E-05 | global batch size: 256 | lm loss: 1.968398E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.577 | TFLOPs: 17.94 | 31: iteration 98700/ 173500 | consumed samples: 25267200 | consumed tokens: 51747225600 | elapsed time per iteration (s): 0.84 | learning rate: 9.188E-05 | global batch size: 256 | lm loss: 1.964304E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.345 | TFLOPs: 18.41 | 31: iteration 98710/ 173500 | consumed samples: 25269760 | consumed tokens: 51752468480 | elapsed time per iteration (s): 0.86 | learning rate: 9.187E-05 | global batch size: 256 | lm loss: 1.983589E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.950 | TFLOPs: 18.09 | 31: iteration 98720/ 173500 | consumed samples: 25272320 | consumed tokens: 51757711360 | elapsed time per iteration (s): 0.78 | learning rate: 9.185E-05 | global batch size: 256 | lm loss: 1.965072E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.086 | TFLOPs: 19.85 | 31: iteration 98730/ 173500 | consumed samples: 25274880 | consumed tokens: 51762954240 | elapsed time per iteration (s): 0.84 | learning rate: 9.183E-05 | global batch size: 256 | lm loss: 1.946551E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.009 | TFLOPs: 18.39 | 31: iteration 98740/ 173500 | consumed samples: 25277440 | consumed tokens: 51768197120 | elapsed time per iteration (s): 0.79 | learning rate: 9.182E-05 | global batch size: 256 | lm loss: 1.953174E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.006 | TFLOPs: 19.54 | 31: iteration 98750/ 173500 | consumed samples: 25280000 | consumed tokens: 51773440000 | elapsed time per iteration (s): 0.85 | learning rate: 9.180E-05 | global batch size: 256 | lm loss: 1.983292E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.907 | TFLOPs: 18.14 | 31: iteration 98760/ 173500 | consumed samples: 25282560 | consumed tokens: 51778682880 | elapsed time per iteration (s): 0.80 | learning rate: 9.179E-05 | global batch size: 256 | lm loss: 1.947714E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.504 | TFLOPs: 19.39 | 31: iteration 98770/ 173500 | consumed samples: 25285120 | consumed tokens: 51783925760 | elapsed time per iteration (s): 0.78 | learning rate: 9.177E-05 | global batch size: 256 | lm loss: 1.988456E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.491 | TFLOPs: 19.87 | 31: iteration 98780/ 173500 | consumed samples: 25287680 | consumed tokens: 51789168640 | elapsed time per iteration (s): 0.81 | learning rate: 9.175E-05 | global batch size: 256 | lm loss: 1.961227E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.248 | TFLOPs: 19.13 | 31: iteration 98790/ 173500 | consumed samples: 25290240 | consumed tokens: 51794411520 | elapsed time per iteration (s): 0.78 | learning rate: 9.174E-05 | global batch size: 256 | lm loss: 1.976439E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.122 | TFLOPs: 19.85 | 31: iteration 98800/ 173500 | consumed samples: 25292800 | consumed tokens: 51799654400 | elapsed time per iteration (s): 0.74 | learning rate: 9.172E-05 | global batch size: 256 | lm loss: 1.975100E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.411 | TFLOPs: 20.84 | 31: iteration 98810/ 173500 | consumed samples: 25295360 | consumed tokens: 51804897280 | elapsed time per iteration (s): 0.75 | learning rate: 9.170E-05 | global batch size: 256 | lm loss: 1.988120E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.940 | TFLOPs: 20.69 | 31: iteration 98820/ 173500 | consumed samples: 25297920 | consumed tokens: 51810140160 | elapsed time per iteration (s): 0.83 | learning rate: 9.169E-05 | global batch size: 256 | lm loss: 2.006968E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.429 | TFLOPs: 18.72 | 31: iteration 98830/ 173500 | consumed samples: 25300480 | consumed tokens: 51815383040 | elapsed time per iteration (s): 0.86 | learning rate: 9.167E-05 | global batch size: 256 | lm loss: 1.954936E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.835 | TFLOPs: 18.02 | 31: iteration 98840/ 173500 | consumed samples: 25303040 | consumed tokens: 51820625920 | elapsed time per iteration (s): 0.76 | learning rate: 9.166E-05 | global batch size: 256 | lm loss: 1.997933E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.322 | TFLOPs: 20.47 | 31: iteration 98850/ 173500 | consumed samples: 25305600 | consumed tokens: 51825868800 | elapsed time per iteration (s): 0.74 | learning rate: 9.164E-05 | global batch size: 256 | lm loss: 2.015446E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.436 | TFLOPs: 20.90 | 31: iteration 98860/ 173500 | consumed samples: 25308160 | consumed tokens: 51831111680 | elapsed time per iteration (s): 0.74 | learning rate: 9.162E-05 | global batch size: 256 | lm loss: 1.991210E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.436 | TFLOPs: 20.96 | 31: iteration 98870/ 173500 | consumed samples: 25310720 | consumed tokens: 51836354560 | elapsed time per iteration (s): 0.75 | learning rate: 9.161E-05 | global batch size: 256 | lm loss: 1.948566E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.506 | TFLOPs: 20.72 | 31: iteration 98880/ 173500 | consumed samples: 25313280 | consumed tokens: 51841597440 | elapsed time per iteration (s): 0.75 | learning rate: 9.159E-05 | global batch size: 256 | lm loss: 1.982893E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.477 | TFLOPs: 20.78 | 31: iteration 98890/ 173500 | consumed samples: 25315840 | consumed tokens: 51846840320 | elapsed time per iteration (s): 0.76 | learning rate: 9.158E-05 | global batch size: 256 | lm loss: 2.001871E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.257 | TFLOPs: 20.40 | 31: iteration 98900/ 173500 | consumed samples: 25318400 | consumed tokens: 51852083200 | elapsed time per iteration (s): 0.80 | learning rate: 9.156E-05 | global batch size: 256 | lm loss: 1.978183E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.377 | TFLOPs: 19.32 | 31: iteration 98910/ 173500 | consumed samples: 25320960 | consumed tokens: 51857326080 | elapsed time per iteration (s): 0.85 | learning rate: 9.154E-05 | global batch size: 256 | lm loss: 1.991475E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.055 | TFLOPs: 18.27 | 31: iteration 98920/ 173500 | consumed samples: 25323520 | consumed tokens: 51862568960 | elapsed time per iteration (s): 0.82 | learning rate: 9.153E-05 | global batch size: 256 | lm loss: 2.035662E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.841 | TFLOPs: 18.99 | 31: iteration 98930/ 173500 | consumed samples: 25326080 | consumed tokens: 51867811840 | elapsed time per iteration (s): 0.77 | learning rate: 9.151E-05 | global batch size: 256 | lm loss: 1.966970E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.018 | TFLOPs: 20.15 | 31: iteration 98940/ 173500 | consumed samples: 25328640 | consumed tokens: 51873054720 | elapsed time per iteration (s): 0.73 | learning rate: 9.150E-05 | global batch size: 256 | lm loss: 1.985417E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.721 | TFLOPs: 21.16 | 31: iteration 98950/ 173500 | consumed samples: 25331200 | consumed tokens: 51878297600 | elapsed time per iteration (s): 0.74 | learning rate: 9.148E-05 | global batch size: 256 | lm loss: 1.969341E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.968 | TFLOPs: 20.87 | 31: iteration 98960/ 173500 | consumed samples: 25333760 | consumed tokens: 51883540480 | elapsed time per iteration (s): 0.80 | learning rate: 9.146E-05 | global batch size: 256 | lm loss: 1.952291E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.604 | TFLOPs: 19.46 | 31: iteration 98970/ 173500 | consumed samples: 25336320 | consumed tokens: 51888783360 | elapsed time per iteration (s): 1.01 | learning rate: 9.145E-05 | global batch size: 256 | lm loss: 1.972137E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 254.041 | TFLOPs: 15.37 | 31: iteration 98980/ 173500 | consumed samples: 25338880 | consumed tokens: 51894026240 | elapsed time per iteration (s): 0.77 | learning rate: 9.143E-05 | global batch size: 256 | lm loss: 1.983722E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.922 | TFLOPs: 20.08 | 31: iteration 98990/ 173500 | consumed samples: 25341440 | consumed tokens: 51899269120 | elapsed time per iteration (s): 0.73 | learning rate: 9.141E-05 | global batch size: 256 | lm loss: 1.980334E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.963 | TFLOPs: 21.17 | 31: iteration 99000/ 173500 | consumed samples: 25344000 | consumed tokens: 51904512000 | elapsed time per iteration (s): 0.72 | learning rate: 9.140E-05 | global batch size: 256 | lm loss: 1.978793E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.628 | TFLOPs: 21.45 | 31: ------------------------------------------------------------------------------------------- 31: valid loss at iteration 99000 | lm loss value: 2.040536E+00 | lm loss PPL: 7.694736E+00 | 31: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 99000 to checkpoints_1b1long 0: [2022-11-26 16:28:53,383] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step99000 is begin to save! 0: [2022-11-26 16:28:53,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_01-model_00-model_states.pt... 0: [2022-11-26 16:28:53,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_01-model_00-model_states.pt. 0: [2022-11-26 16:28:53,637] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_03-model_00-model_states.pt... 0: [2022-11-26 16:28:53,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_03-model_00-model_states.pt. 0: [2022-11-26 16:28:53,717] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_04-model_00-model_states.pt... 0: [2022-11-26 16:28:53,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_04-model_00-model_states.pt. 0: [2022-11-26 16:28:53,800] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_05-model_00-model_states.pt... 0: [2022-11-26 16:28:53,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_05-model_00-model_states.pt. 0: [2022-11-26 16:28:53,879] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_06-model_00-model_states.pt... 0: [2022-11-26 16:28:53,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_06-model_00-model_states.pt. 0: [2022-11-26 16:28:53,955] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_07-model_00-model_states.pt... 0: [2022-11-26 16:28:54,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_07-model_00-model_states.pt. 0: [2022-11-26 16:28:54,030] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_08-model_00-model_states.pt... 0: [2022-11-26 16:28:54,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_08-model_00-model_states.pt. 0: [2022-11-26 16:28:54,106] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_09-model_00-model_states.pt... 0: [2022-11-26 16:28:54,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_09-model_00-model_states.pt. 0: [2022-11-26 16:28:54,183] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_10-model_00-model_states.pt... 0: [2022-11-26 16:28:54,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_10-model_00-model_states.pt. 0: [2022-11-26 16:28:54,267] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_11-model_00-model_states.pt... 0: [2022-11-26 16:28:54,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_11-model_00-model_states.pt. 0: [2022-11-26 16:28:54,342] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_12-model_00-model_states.pt... 0: [2022-11-26 16:28:54,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_12-model_00-model_states.pt. 0: [2022-11-26 16:28:54,419] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_13-model_00-model_states.pt... 0: [2022-11-26 16:28:54,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_13-model_00-model_states.pt. 0: [2022-11-26 16:28:54,495] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_14-model_00-model_states.pt... 0: [2022-11-26 16:28:54,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_14-model_00-model_states.pt. 0: [2022-11-26 16:28:54,572] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_15-model_00-model_states.pt... 0: [2022-11-26 16:28:54,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_15-model_00-model_states.pt. 0: [2022-11-26 16:28:54,646] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_16-model_00-model_states.pt... 0: [2022-11-26 16:28:54,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_16-model_00-model_states.pt. 0: [2022-11-26 16:28:54,721] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_17-model_00-model_states.pt... 0: [2022-11-26 16:28:54,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_17-model_00-model_states.pt. 0: [2022-11-26 16:28:54,794] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_18-model_00-model_states.pt... 0: [2022-11-26 16:28:54,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_18-model_00-model_states.pt. 0: [2022-11-26 16:28:54,870] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_19-model_00-model_states.pt... 0: [2022-11-26 16:28:54,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_19-model_00-model_states.pt. 0: [2022-11-26 16:28:54,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_20-model_00-model_states.pt... 0: [2022-11-26 16:28:55,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_20-model_00-model_states.pt. 0: [2022-11-26 16:28:55,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_21-model_00-model_states.pt... 0: [2022-11-26 16:28:55,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_21-model_00-model_states.pt. 0: [2022-11-26 16:28:55,092] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_22-model_00-model_states.pt... 0: [2022-11-26 16:28:55,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_22-model_00-model_states.pt. 0: [2022-11-26 16:28:55,163] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_23-model_00-model_states.pt... 0: [2022-11-26 16:28:55,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_23-model_00-model_states.pt. 0: [2022-11-26 16:28:55,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_24-model_00-model_states.pt... 0: [2022-11-26 16:28:55,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_24-model_00-model_states.pt. 0: [2022-11-26 16:28:55,312] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_25-model_00-model_states.pt... 0: [2022-11-26 16:28:55,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_25-model_00-model_states.pt. 0: [2022-11-26 16:28:55,385] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_26-model_00-model_states.pt... 0: [2022-11-26 16:28:55,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_26-model_00-model_states.pt. 0: [2022-11-26 16:28:55,459] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_27-model_00-model_states.pt... 0: [2022-11-26 16:28:55,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_27-model_00-model_states.pt. 0: [2022-11-26 16:28:55,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_28-model_00-model_states.pt... 0: [2022-11-26 16:28:55,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_28-model_00-model_states.pt. 0: [2022-11-26 16:28:55,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/layer_30-model_00-model_states.pt... 0: [2022-11-26 16:28:55,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/layer_30-model_00-model_states.pt. 0: [2022-11-26 16:28:55,610] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step99000/mp_rank_00_model_states.pt 0: [2022-11-26 16:28:55,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/mp_rank_00_model_states.pt... 0: [2022-11-26 16:28:55,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/mp_rank_00_model_states.pt. 0: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:28:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:28:55,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:28:55,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 16:28:55,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 0: [2022-11-26 16:28:55,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:28:55,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 16:28:55,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 5: [2022-11-26 16:28:55,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:28:55,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 16:28:55,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 12: [2022-11-26 16:28:55,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:28:55,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 16:28:55,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 24: [2022-11-26 16:28:55,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:28:55,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:28:55,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 23: [2022-11-26 16:28:55,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 16:28:55,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 24: [2022-11-26 16:28:55,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 25: [2022-11-26 16:28:55,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:28:55,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 16:28:55,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 17: [2022-11-26 16:28:55,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:28:55,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 16:28:55,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 1: [2022-11-26 16:28:55,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:28:55,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 17: [2022-11-26 16:28:55,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:28:55,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:28:55,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 17: [2022-11-26 16:28:55,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 26: [2022-11-26 16:28:55,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 16:28:55,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 17: [2022-11-26 16:28:55,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 30: [2022-11-26 16:28:55,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:28:55,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 16:28:55,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 30: [2022-11-26 16:28:55,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:28:55,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 16:28:55,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 19: [2022-11-26 16:28:55,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:28:55,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:28:55,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 16:28:55,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 20: [2022-11-26 16:28:55,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 16:28:55,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 20: [2022-11-26 16:28:55,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:28:55,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 16:28:55,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 7: [2022-11-26 16:28:55,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:28:55,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 16:28:55,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 18: [2022-11-26 16:28:55,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:28:55,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 16:28:55,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 24: [2022-11-26 16:28:55,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:28:55,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:28:55,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 5: [2022-11-26 16:28:55,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 16:28:55,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 24: [2022-11-26 16:28:55,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 26: [2022-11-26 16:28:55,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:28:55,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 16:28:55,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 22: [2022-11-26 16:28:55,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:28:55,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:28:55,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:28:55,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:28:55,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 4: [2022-11-26 16:28:55,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 22: [2022-11-26 16:28:55,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 26: [2022-11-26 16:28:55,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 4: [2022-11-26 16:28:55,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 22: [2022-11-26 16:28:55,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 17: [2022-11-26 16:28:55,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:28:55,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 22: [2022-11-26 16:28:55,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 0: [2022-11-26 16:28:55,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:28:55,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 16:28:55,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 0: [2022-11-26 16:28:55,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:28:55,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 16:28:55,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 12: [2022-11-26 16:28:55,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:28:55,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 16:28:55,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 23: [2022-11-26 16:28:55,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:28:55,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 16:28:55,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:28:55,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 23: [2022-11-26 16:28:55,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 16:28:55,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 6: [2022-11-26 16:28:55,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:28:55,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:28:55,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 10: [2022-11-26 16:28:55,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 6: [2022-11-26 16:28:55,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 10: [2022-11-26 16:28:55,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 18: [2022-11-26 16:28:55,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:28:55,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 16:28:55,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 1: [2022-11-26 16:28:55,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:28:55,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:28:55,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 1: [2022-11-26 16:28:55,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 19: [2022-11-26 16:28:55,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 1: [2022-11-26 16:28:55,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 6: [2022-11-26 16:28:55,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:28:55,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 19: [2022-11-26 16:28:55,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:28:55,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 19: [2022-11-26 16:28:55,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 16:28:55,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 7: [2022-11-26 16:28:55,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:28:55,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 16:28:55,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 2: [2022-11-26 16:28:55,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:28:55,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 16:28:55,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 16: [2022-11-26 16:28:55,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:28:55,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 16:28:55,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 5: [2022-11-26 16:28:55,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:28:55,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 24: [2022-11-26 16:28:55,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:28:55,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 24: [2022-11-26 16:28:55,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 16:28:55,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 23: [2022-11-26 16:28:55,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:28:55,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 16:28:55,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 20: [2022-11-26 16:28:55,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:28:55,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 16:28:55,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 4: [2022-11-26 16:28:55,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:28:55,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:28:55,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 16:28:55,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 10: [2022-11-26 16:28:55,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 16:28:55,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 10: [2022-11-26 16:28:55,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:28:55,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 16:28:55,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 12: [2022-11-26 16:28:55,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:28:55,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 16:28:55,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 22: [2022-11-26 16:28:55,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:28:55,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 16:28:55,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 18: [2022-11-26 16:28:55,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:28:55,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 16:28:55,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 30: [2022-11-26 16:28:55,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:28:55,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 16:28:55,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 19: [2022-11-26 16:28:55,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:28:55,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 16:28:55,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 2: [2022-11-26 16:28:55,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:28:55,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 18: [2022-11-26 16:28:55,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:28:55,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 18: [2022-11-26 16:28:55,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 16:28:55,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 25: [2022-11-26 16:28:55,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:28:55,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 16:28:55,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 16: [2022-11-26 16:28:55,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:28:55,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 16:28:55,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:28:55,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 16: [2022-11-26 16:28:55,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 16:28:55,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 7: [2022-11-26 16:28:55,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:28:55,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:28:55,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 16:28:55,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 24: [2022-11-26 16:28:55,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:28:55,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 16:28:55,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 24: [2022-11-26 16:28:55,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 16:28:55,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 0: [2022-11-26 16:28:55,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:28:55,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:28:55,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 16:28:55,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 2: [2022-11-26 16:28:55,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 16:28:55,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 3: [2022-11-26 16:28:55,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:28:55,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:28:55,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:28:55,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:28:55,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 16:28:55,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 16:28:55,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 16:28:55,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 16:28:55,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 3: [2022-11-26 16:28:55,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 3: [2022-11-26 16:28:55,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 3: [2022-11-26 16:28:55,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 5: [2022-11-26 16:28:55,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:28:55,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 16:28:55,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 1: [2022-11-26 16:28:55,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:28:55,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 4: [2022-11-26 16:28:55,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:28:55,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 1: [2022-11-26 16:28:55,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 4: [2022-11-26 16:28:55,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 26: [2022-11-26 16:28:55,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:28:55,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 16:28:55,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 28: [2022-11-26 16:28:55,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:28:55,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:28:55,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:28:55,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:28:55,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:28:55,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 28: [2022-11-26 16:28:55,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:28:55,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 16:28:55,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 16:28:55,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 16:28:55,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 6: [2022-11-26 16:28:55,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:28:55,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 28: [2022-11-26 16:28:55,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 6: [2022-11-26 16:28:55,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 28: [2022-11-26 16:28:55,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 28: [2022-11-26 16:28:55,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 6: [2022-11-26 16:28:55,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 28: [2022-11-26 16:28:55,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 6: [2022-11-26 16:28:55,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 16:28:55,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 10: [2022-11-26 16:28:55,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:28:55,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 16:28:55,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 30: [2022-11-26 16:28:55,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:28:55,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 16:28:55,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 0: [2022-11-26 16:28:55,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 16:28:55,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 9: [2022-11-26 16:28:55,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:28:55,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:28:55,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:28:55,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:28:55,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 16:28:55,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 16:28:55,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 16:28:55,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 16:28:55,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 9: [2022-11-26 16:28:55,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 9: [2022-11-26 16:28:55,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 9: [2022-11-26 16:28:55,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 17: [2022-11-26 16:28:55,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:28:55,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 16:28:55,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 1: [2022-11-26 16:28:55,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:28:55,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 16:28:55,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 16: [2022-11-26 16:28:55,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:28:55,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 16:28:55,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 12: [2022-11-26 16:28:55,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:28:55,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 16:28:55,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 22: [2022-11-26 16:28:55,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:28:55,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 16:28:55,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 6: [2022-11-26 16:28:55,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:28:55,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 16:28:55,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 27: [2022-11-26 16:28:55,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:28:55,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:28:55,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:28:55,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:28:55,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 16:28:55,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 16:28:55,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 16:28:55,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 16:28:55,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 27: [2022-11-26 16:28:55,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 27: [2022-11-26 16:28:55,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 27: [2022-11-26 16:28:55,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 20: [2022-11-26 16:28:55,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:28:55,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 16:28:55,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 1: [2022-11-26 16:28:55,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:28:55,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 16:28:55,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 21: [2022-11-26 16:28:55,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:28:55,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:28:55,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:28:55,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:28:55,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 16:28:55,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 16:28:55,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 16:28:55,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 16:28:55,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 21: [2022-11-26 16:28:55,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 21: [2022-11-26 16:28:55,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 21: [2022-11-26 16:28:55,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 20: [2022-11-26 16:28:55,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:28:55,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:28:55,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:28:55,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:28:55,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:28:55,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 31: [2022-11-26 16:28:55,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 16:28:55,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 16:28:55,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 16:28:55,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 16:28:55,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 31: [2022-11-26 16:28:55,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 31: [2022-11-26 16:28:55,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 31: [2022-11-26 16:28:55,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 20: [2022-11-26 16:28:55,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 25: [2022-11-26 16:28:55,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:28:55,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:28:55,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 16:28:55,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 25: [2022-11-26 16:28:55,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 16:28:55,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 14: [2022-11-26 16:28:55,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:28:55,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:28:55,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:28:55,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:28:55,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 16:28:55,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 16:28:55,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 16:28:55,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 16:28:55,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 8: [2022-11-26 16:28:55,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:28:55,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:28:55,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:28:55,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:28:55,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 8: [2022-11-26 16:28:55,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 16:28:55,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 16:28:55,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 16:28:55,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 14: [2022-11-26 16:28:55,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 8: [2022-11-26 16:28:55,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 8: [2022-11-26 16:28:55,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 8: [2022-11-26 16:28:55,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 14: [2022-11-26 16:28:55,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 8: [2022-11-26 16:28:55,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 25: [2022-11-26 16:28:55,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:28:55,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:28:55,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 14: [2022-11-26 16:28:55,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 16:28:55,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 25: [2022-11-26 16:28:55,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 2: [2022-11-26 16:28:55,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:28:55,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 16:28:55,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 26: [2022-11-26 16:28:55,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:28:55,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 16:28:55,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 3: [2022-11-26 16:28:55,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:28:55,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 16:28:55,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 17: [2022-11-26 16:28:55,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:28:55,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 16:28:55,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 8: [2022-11-26 16:28:55,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:28:55,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 16:28:55,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 23: [2022-11-26 16:28:55,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:28:55,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 16:28:55,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 28: [2022-11-26 16:28:55,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:28:55,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 16:28:55,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 9: [2022-11-26 16:28:55,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:28:55,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 16:28:55,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 24: [2022-11-26 16:28:55,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:28:55,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 16:28:55,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 0: [2022-11-26 16:28:55,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:28:55,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 16:28:55,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 30: [2022-11-26 16:28:55,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:28:55,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 16:28:55,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 7: [2022-11-26 16:28:55,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:28:55,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 16:28:55,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 18: [2022-11-26 16:28:55,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:28:55,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 16:28:55,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 16: [2022-11-26 16:28:55,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:28:55,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 16:28:55,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 19: [2022-11-26 16:28:55,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:28:55,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 16:28:55,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 31: [2022-11-26 16:28:55,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:28:55,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 12: [2022-11-26 16:28:55,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:28:55,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 12: [2022-11-26 16:28:55,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 16:28:55,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 27: [2022-11-26 16:28:55,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:28:55,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 16:28:55,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 4: [2022-11-26 16:28:55,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:28:55,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 16:28:55,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 22: [2022-11-26 16:28:55,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:28:55,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 21: [2022-11-26 16:28:55,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:28:55,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 21: [2022-11-26 16:28:55,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 16:28:55,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 5: [2022-11-26 16:28:55,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:28:55,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 16:28:55,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 1: [2022-11-26 16:28:55,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:28:55,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 16:28:55,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 20: [2022-11-26 16:28:55,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:28:55,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 16:28:55,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 14: [2022-11-26 16:28:55,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:28:55,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 16:28:55,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 6: [2022-11-26 16:28:55,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:28:55,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 16:28:55,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 2: [2022-11-26 16:28:55,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:28:55,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 16:28:55,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 26: [2022-11-26 16:28:55,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:28:55,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:28:55,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 16:28:55,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 26: [2022-11-26 16:28:55,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 16:28:55,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 17: [2022-11-26 16:28:55,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:28:55,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 16:28:55,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 8: [2022-11-26 16:28:55,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:28:55,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 16:28:55,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 23: [2022-11-26 16:28:55,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:28:55,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 16:28:55,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 25: [2022-11-26 16:28:55,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:28:55,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 16:28:55,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 9: [2022-11-26 16:28:55,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:28:55,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 16:28:55,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 28: [2022-11-26 16:28:55,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:28:55,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 16:28:55,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 24: [2022-11-26 16:28:55,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:28:55,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 16:28:55,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 30: [2022-11-26 16:28:55,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:28:55,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 16:28:55,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 10: [2022-11-26 16:28:55,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:28:55,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 16:28:55,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 21: [2022-11-26 16:28:55,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:28:55,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 16:28:55,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 0: [2022-11-26 16:28:55,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:28:55,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 16:28:55,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 7: [2022-11-26 16:28:55,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:28:55,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 16:28:55,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 18: [2022-11-26 16:28:55,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:28:55,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 16:28:55,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 31: [2022-11-26 16:28:55,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:28:55,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 16:28:55,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 16: [2022-11-26 16:28:55,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:28:55,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 16:28:55,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 12: [2022-11-26 16:28:55,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:28:55,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 16:28:55,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 19: [2022-11-26 16:28:55,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:28:55,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 16:28:55,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 5: [2022-11-26 16:28:55,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:28:55,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 16:28:55,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 4: [2022-11-26 16:28:55,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:28:55,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 16:28:55,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 27: [2022-11-26 16:28:55,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:28:55,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 16:28:55,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 22: [2022-11-26 16:28:55,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:28:55,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 16:28:55,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 10: [2022-11-26 16:28:55,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:28:55,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 16:28:55,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 20: [2022-11-26 16:28:55,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:28:55,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 16:28:55,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 1: [2022-11-26 16:28:55,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:28:55,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:28:55,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 1: [2022-11-26 16:28:55,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 17: [2022-11-26 16:28:55,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:28:55,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:28:55,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 8: [2022-11-26 16:28:55,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 1: [2022-11-26 16:28:55,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 14: [2022-11-26 16:28:55,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 17: [2022-11-26 16:28:55,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 16:28:55,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 6: [2022-11-26 16:28:55,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:28:55,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 16:28:55,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 25: [2022-11-26 16:28:55,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:28:55,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 16:28:55,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 2: [2022-11-26 16:28:55,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:28:55,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 16:28:55,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 26: [2022-11-26 16:28:55,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:28:55,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 16:28:55,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 23: [2022-11-26 16:28:55,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:28:55,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 16:28:55,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 7: [2022-11-26 16:28:55,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:28:55,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 16:28:55,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 21: [2022-11-26 16:28:55,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:28:55,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 16:28:55,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 28: [2022-11-26 16:28:55,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:28:55,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 16:28:55,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 30: [2022-11-26 16:28:55,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:28:55,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 16:28:55,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 18: [2022-11-26 16:28:55,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:28:55,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 16:28:55,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 24: [2022-11-26 16:28:55,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:28:55,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 16:28:55,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 3: [2022-11-26 16:28:55,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:28:55,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 16:28:55,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 9: [2022-11-26 16:28:55,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:28:55,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 16:28:55,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 0: [2022-11-26 16:28:55,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:28:55,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 16:28:55,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 16: [2022-11-26 16:28:55,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:28:55,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 16:28:55,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 31: [2022-11-26 16:28:55,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:28:55,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 16:28:55,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 19: [2022-11-26 16:28:55,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:28:55,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 16:28:55,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 4: [2022-11-26 16:28:55,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:28:55,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 16:28:55,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 5: [2022-11-26 16:28:55,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:28:55,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 16:28:55,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 12: [2022-11-26 16:28:55,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:28:55,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 16:28:55,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 0: [2022-11-26 16:28:55,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:28:55,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 16:28:55,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 20: [2022-11-26 16:28:55,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:28:55,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 16:28:55,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 22: [2022-11-26 16:28:55,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:28:55,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 16:28:55,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 9: [2022-11-26 16:28:55,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:28:55,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 16:28:55,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 10: [2022-11-26 16:28:55,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:28:55,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:28:55,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 16:28:55,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 10: [2022-11-26 16:28:55,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 16:28:55,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 4: [2022-11-26 16:28:55,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:28:55,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 16:28:55,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 8: [2022-11-26 16:28:55,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:28:55,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 16:28:55,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 24: [2022-11-26 16:28:55,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:28:55,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 5: [2022-11-26 16:28:55,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:28:55,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:28:55,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 5: [2022-11-26 16:28:55,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 1: [2022-11-26 16:28:55,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 16:28:55,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 5: [2022-11-26 16:28:55,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 27: [2022-11-26 16:28:55,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:28:55,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:28:55,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 16:28:55,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 16:28:55,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 27: [2022-11-26 16:28:55,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 18: [2022-11-26 16:28:55,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:28:55,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 16:28:55,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 30: [2022-11-26 16:28:55,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:28:55,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 16:28:55,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 28: [2022-11-26 16:28:55,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:28:55,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 16:28:55,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 21: [2022-11-26 16:28:55,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:28:55,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 16:28:55,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 26: [2022-11-26 16:28:55,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:28:55,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 22: [2022-11-26 16:28:55,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:28:55,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 16: [2022-11-26 16:28:55,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:28:55,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 16:28:55,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 16: [2022-11-26 16:28:55,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 16:28:55,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 14: [2022-11-26 16:28:55,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:28:55,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 16:28:55,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 6: [2022-11-26 16:28:55,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:28:55,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 16:28:55,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 17: [2022-11-26 16:28:55,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:28:55,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 16:28:55,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 7: [2022-11-26 16:28:55,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:28:55,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:28:55,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 7: [2022-11-26 16:28:55,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 12: [2022-11-26 16:28:55,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 3: [2022-11-26 16:28:55,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:28:55,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 3: [2022-11-26 16:28:55,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 16:28:55,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 19: [2022-11-26 16:28:55,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:28:55,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 16:28:55,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 25: [2022-11-26 16:28:55,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:28:55,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 16:28:55,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 23: [2022-11-26 16:28:55,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:28:55,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 16:28:55,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 31: [2022-11-26 16:28:55,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:28:55,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 16:28:55,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 15: [2022-11-26 16:28:55,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:28:55,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:28:55,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:28:55,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:28:55,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 16:28:55,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 16:28:55,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 16:28:55,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 16:28:55,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 15: [2022-11-26 16:28:55,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 15: [2022-11-26 16:28:55,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 15: [2022-11-26 16:28:55,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 29: [2022-11-26 16:28:55,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:28:55,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:28:55,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:28:55,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:28:55,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 16:28:55,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 15: [2022-11-26 16:28:55,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 29: [2022-11-26 16:28:55,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 15: [2022-11-26 16:28:55,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 29: [2022-11-26 16:28:55,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 29: [2022-11-26 16:28:55,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 29: [2022-11-26 16:28:55,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 29: [2022-11-26 16:28:55,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:28:55,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 16:28:55,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 2: [2022-11-26 16:28:55,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:28:55,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 16:28:55,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 15: [2022-11-26 16:28:55,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:28:55,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 16:28:55,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 15: [2022-11-26 16:28:55,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:28:55,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 16:28:55,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 15: [2022-11-26 16:28:55,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:28:55,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:28:55,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:28:55,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 29: [2022-11-26 16:28:55,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:28:55,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:28:55,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 29: [2022-11-26 16:28:55,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 16:28:55,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 16:28:55,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 16:28:55,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 16:28:55,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 29: [2022-11-26 16:28:55,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 29: [2022-11-26 16:28:55,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 29: [2022-11-26 16:28:55,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 13: [2022-11-26 16:28:55,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:28:55,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 16:28:55,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 13: [2022-11-26 16:28:55,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:28:55,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:28:55,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:28:55,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 16:28:55,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 16:28:55,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 16:28:55,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 13: [2022-11-26 16:28:55,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 13: [2022-11-26 16:28:55,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 13: [2022-11-26 16:28:55,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:28:55,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:28:55,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:28:55,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:28:55,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 16:28:55,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 16:28:55,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 16:28:55,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 16:28:55,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 13: [2022-11-26 16:28:55,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 13: [2022-11-26 16:28:55,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 13: [2022-11-26 16:28:55,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 11: [2022-11-26 16:28:56,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:28:56,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:28:56,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:28:56,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:28:56,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 16:28:56,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 16:28:56,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 16:28:56,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:28:56,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:28:56,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:28:56,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 16:28:56,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 11: [2022-11-26 16:28:56,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 11: [2022-11-26 16:28:56,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 11: [2022-11-26 16:28:56,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 16:28:56,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 16:28:56,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 16:28:56,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 11: [2022-11-26 16:28:56,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 11: [2022-11-26 16:28:56,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 11: [2022-11-26 16:28:56,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:28:56,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 11: [2022-11-26 16:28:56,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step99000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 16:28:56,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 0: successfully saved checkpoint at iteration 99000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2650.23 31: iteration 99010/ 173500 | consumed samples: 25346560 | consumed tokens: 51909754880 | elapsed time per iteration (s): 1.09 | learning rate: 9.138E-05 | global batch size: 256 | lm loss: 1.965661E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.778 | TFLOPs: 14.20 | 31: iteration 99020/ 173500 | consumed samples: 25349120 | consumed tokens: 51914997760 | elapsed time per iteration (s): 0.76 | learning rate: 9.137E-05 | global batch size: 256 | lm loss: 1.989798E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.733 | TFLOPs: 20.43 | 31: iteration 99030/ 173500 | consumed samples: 25351680 | consumed tokens: 51920240640 | elapsed time per iteration (s): 0.79 | learning rate: 9.135E-05 | global batch size: 256 | lm loss: 1.967173E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.049 | TFLOPs: 19.48 | 31: iteration 99040/ 173500 | consumed samples: 25354240 | consumed tokens: 51925483520 | elapsed time per iteration (s): 0.73 | learning rate: 9.133E-05 | global batch size: 256 | lm loss: 1.988380E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.416 | TFLOPs: 21.08 | 31: iteration 99050/ 173500 | consumed samples: 25356800 | consumed tokens: 51930726400 | elapsed time per iteration (s): 0.74 | learning rate: 9.132E-05 | global batch size: 256 | lm loss: 1.984556E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.154 | TFLOPs: 20.82 | 31: iteration 99060/ 173500 | consumed samples: 25359360 | consumed tokens: 51935969280 | elapsed time per iteration (s): 0.78 | learning rate: 9.130E-05 | global batch size: 256 | lm loss: 1.959858E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.473 | TFLOPs: 19.81 | 31: iteration 99070/ 173500 | consumed samples: 25361920 | consumed tokens: 51941212160 | elapsed time per iteration (s): 0.77 | learning rate: 9.129E-05 | global batch size: 256 | lm loss: 1.963570E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.114 | TFLOPs: 20.03 | 31: iteration 99080/ 173500 | consumed samples: 25364480 | consumed tokens: 51946455040 | elapsed time per iteration (s): 0.83 | learning rate: 9.127E-05 | global batch size: 256 | lm loss: 1.978506E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.398 | TFLOPs: 18.66 | 31: iteration 99090/ 173500 | consumed samples: 25367040 | consumed tokens: 51951697920 | elapsed time per iteration (s): 0.82 | learning rate: 9.125E-05 | global batch size: 256 | lm loss: 1.983811E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.205 | TFLOPs: 18.83 | 31: iteration 99100/ 173500 | consumed samples: 25369600 | consumed tokens: 51956940800 | elapsed time per iteration (s): 0.83 | learning rate: 9.124E-05 | global batch size: 256 | lm loss: 1.967227E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.192 | TFLOPs: 18.77 | 31: iteration 99110/ 173500 | consumed samples: 25372160 | consumed tokens: 51962183680 | elapsed time per iteration (s): 0.74 | learning rate: 9.122E-05 | global batch size: 256 | lm loss: 1.965978E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.230 | TFLOPs: 20.83 | 31: iteration 99120/ 173500 | consumed samples: 25374720 | consumed tokens: 51967426560 | elapsed time per iteration (s): 0.76 | learning rate: 9.121E-05 | global batch size: 256 | lm loss: 1.948573E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.893 | TFLOPs: 20.32 | 31: iteration 99130/ 173500 | consumed samples: 25377280 | consumed tokens: 51972669440 | elapsed time per iteration (s): 0.74 | learning rate: 9.119E-05 | global batch size: 256 | lm loss: 1.965872E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.742 | TFLOPs: 20.86 | 31: iteration 99140/ 173500 | consumed samples: 25379840 | consumed tokens: 51977912320 | elapsed time per iteration (s): 0.78 | learning rate: 9.117E-05 | global batch size: 256 | lm loss: 1.988578E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.059 | TFLOPs: 19.97 | 31: iteration 99150/ 173500 | consumed samples: 25382400 | consumed tokens: 51983155200 | elapsed time per iteration (s): 0.73 | learning rate: 9.116E-05 | global batch size: 256 | lm loss: 1.979871E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.349 | TFLOPs: 21.26 | 31: iteration 99160/ 173500 | consumed samples: 25384960 | consumed tokens: 51988398080 | elapsed time per iteration (s): 0.73 | learning rate: 9.114E-05 | global batch size: 256 | lm loss: 1.989147E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.968 | TFLOPs: 21.17 | 31: iteration 99170/ 173500 | consumed samples: 25387520 | consumed tokens: 51993640960 | elapsed time per iteration (s): 0.76 | learning rate: 9.113E-05 | global batch size: 256 | lm loss: 1.975603E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.297 | TFLOPs: 20.41 | 31: iteration 99180/ 173500 | consumed samples: 25390080 | consumed tokens: 51998883840 | elapsed time per iteration (s): 0.75 | learning rate: 9.111E-05 | global batch size: 256 | lm loss: 1.966435E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.619 | TFLOPs: 20.67 | 31: iteration 99190/ 173500 | consumed samples: 25392640 | consumed tokens: 52004126720 | elapsed time per iteration (s): 0.72 | learning rate: 9.109E-05 | global batch size: 256 | lm loss: 1.977751E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.575 | TFLOPs: 21.45 | 31: iteration 99200/ 173500 | consumed samples: 25395200 | consumed tokens: 52009369600 | elapsed time per iteration (s): 0.79 | learning rate: 9.108E-05 | global batch size: 256 | lm loss: 1.954926E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.264 | TFLOPs: 19.50 | 31: iteration 99210/ 173500 | consumed samples: 25397760 | consumed tokens: 52014612480 | elapsed time per iteration (s): 0.79 | learning rate: 9.106E-05 | global batch size: 256 | lm loss: 1.973529E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.512 | TFLOPs: 19.51 | 31: iteration 99220/ 173500 | consumed samples: 25400320 | consumed tokens: 52019855360 | elapsed time per iteration (s): 0.83 | learning rate: 9.104E-05 | global batch size: 256 | lm loss: 1.980606E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.429 | TFLOPs: 18.72 | 31: iteration 99230/ 173500 | consumed samples: 25402880 | consumed tokens: 52025098240 | elapsed time per iteration (s): 0.83 | learning rate: 9.103E-05 | global batch size: 256 | lm loss: 1.996957E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.782 | TFLOPs: 18.56 | 31: iteration 99240/ 173500 | consumed samples: 25405440 | consumed tokens: 52030341120 | elapsed time per iteration (s): 0.75 | learning rate: 9.101E-05 | global batch size: 256 | lm loss: 1.987640E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.685 | TFLOPs: 20.55 | 31: iteration 99250/ 173500 | consumed samples: 25408000 | consumed tokens: 52035584000 | elapsed time per iteration (s): 0.76 | learning rate: 9.100E-05 | global batch size: 256 | lm loss: 1.964440E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.835 | TFLOPs: 20.50 | 31: iteration 99260/ 173500 | consumed samples: 25410560 | consumed tokens: 52040826880 | elapsed time per iteration (s): 0.77 | learning rate: 9.098E-05 | global batch size: 256 | lm loss: 1.961659E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.614 | TFLOPs: 20.00 | 31: iteration 99270/ 173500 | consumed samples: 25413120 | consumed tokens: 52046069760 | elapsed time per iteration (s): 0.77 | learning rate: 9.096E-05 | global batch size: 256 | lm loss: 1.985082E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.631 | TFLOPs: 20.18 | 31: iteration 99280/ 173500 | consumed samples: 25415680 | consumed tokens: 52051312640 | elapsed time per iteration (s): 0.85 | learning rate: 9.095E-05 | global batch size: 256 | lm loss: 2.001951E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.395 | TFLOPs: 18.17 | 31: iteration 99290/ 173500 | consumed samples: 25418240 | consumed tokens: 52056555520 | elapsed time per iteration (s): 0.77 | learning rate: 9.093E-05 | global batch size: 256 | lm loss: 1.957126E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.504 | TFLOPs: 20.24 | 31: iteration 99300/ 173500 | consumed samples: 25420800 | consumed tokens: 52061798400 | elapsed time per iteration (s): 0.79 | learning rate: 9.092E-05 | global batch size: 256 | lm loss: 2.012735E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.716 | TFLOPs: 19.70 | 31: iteration 99310/ 173500 | consumed samples: 25423360 | consumed tokens: 52067041280 | elapsed time per iteration (s): 0.77 | learning rate: 9.090E-05 | global batch size: 256 | lm loss: 1.972130E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.750 | TFLOPs: 20.19 | 31: iteration 99320/ 173500 | consumed samples: 25425920 | consumed tokens: 52072284160 | elapsed time per iteration (s): 0.83 | learning rate: 9.088E-05 | global batch size: 256 | lm loss: 1.982074E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.880 | TFLOPs: 18.75 | 31: iteration 99330/ 173500 | consumed samples: 25428480 | consumed tokens: 52077527040 | elapsed time per iteration (s): 0.75 | learning rate: 9.087E-05 | global batch size: 256 | lm loss: 1.987364E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.436 | TFLOPs: 20.66 | 31: iteration 99340/ 173500 | consumed samples: 25431040 | consumed tokens: 52082769920 | elapsed time per iteration (s): 0.77 | learning rate: 9.085E-05 | global batch size: 256 | lm loss: 1.977141E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.516 | TFLOPs: 20.00 | 31: iteration 99350/ 173500 | consumed samples: 25433600 | consumed tokens: 52088012800 | elapsed time per iteration (s): 0.73 | learning rate: 9.084E-05 | global batch size: 256 | lm loss: 1.958255E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.208 | TFLOPs: 21.25 | 31: iteration 99360/ 173500 | consumed samples: 25436160 | consumed tokens: 52093255680 | elapsed time per iteration (s): 0.79 | learning rate: 9.082E-05 | global batch size: 256 | lm loss: 1.985756E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.351 | TFLOPs: 19.50 | 31: iteration 99370/ 173500 | consumed samples: 25438720 | consumed tokens: 52098498560 | elapsed time per iteration (s): 0.71 | learning rate: 9.080E-05 | global batch size: 256 | lm loss: 1.963287E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 359.939 | TFLOPs: 21.78 | 31: iteration 99380/ 173500 | consumed samples: 25441280 | consumed tokens: 52103741440 | elapsed time per iteration (s): 0.84 | learning rate: 9.079E-05 | global batch size: 256 | lm loss: 1.983239E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.062 | TFLOPs: 18.52 | 31: iteration 99390/ 173500 | consumed samples: 25443840 | consumed tokens: 52108984320 | elapsed time per iteration (s): 0.73 | learning rate: 9.077E-05 | global batch size: 256 | lm loss: 1.996829E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.404 | TFLOPs: 21.26 | 31: iteration 99400/ 173500 | consumed samples: 25446400 | consumed tokens: 52114227200 | elapsed time per iteration (s): 0.77 | learning rate: 9.076E-05 | global batch size: 256 | lm loss: 1.991828E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.312 | TFLOPs: 20.23 | 31: iteration 99410/ 173500 | consumed samples: 25448960 | consumed tokens: 52119470080 | elapsed time per iteration (s): 2.50 | learning rate: 9.074E-05 | global batch size: 256 | lm loss: 1.982590E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 102.512 | TFLOPs: 6.20 | 31: iteration 99420/ 173500 | consumed samples: 25451520 | consumed tokens: 52124712960 | elapsed time per iteration (s): 0.77 | learning rate: 9.072E-05 | global batch size: 256 | lm loss: 1.979247E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.669 | TFLOPs: 20.19 | 31: iteration 99430/ 173500 | consumed samples: 25454080 | consumed tokens: 52129955840 | elapsed time per iteration (s): 0.75 | learning rate: 9.071E-05 | global batch size: 256 | lm loss: 1.975721E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.925 | TFLOPs: 20.56 | 31: iteration 99440/ 173500 | consumed samples: 25456640 | consumed tokens: 52135198720 | elapsed time per iteration (s): 0.76 | learning rate: 9.069E-05 | global batch size: 256 | lm loss: 1.992874E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.383 | TFLOPs: 20.41 | 31: iteration 99450/ 173500 | consumed samples: 25459200 | consumed tokens: 52140441600 | elapsed time per iteration (s): 0.82 | learning rate: 9.067E-05 | global batch size: 256 | lm loss: 1.984946E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.265 | TFLOPs: 18.83 | 31: iteration 99460/ 173500 | consumed samples: 25461760 | consumed tokens: 52145684480 | elapsed time per iteration (s): 0.72 | learning rate: 9.066E-05 | global batch size: 256 | lm loss: 1.943800E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.187 | TFLOPs: 21.43 | 31: iteration 99470/ 173500 | consumed samples: 25464320 | consumed tokens: 52150927360 | elapsed time per iteration (s): 0.79 | learning rate: 9.064E-05 | global batch size: 256 | lm loss: 1.992872E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.198 | TFLOPs: 19.55 | 31: iteration 99480/ 173500 | consumed samples: 25466880 | consumed tokens: 52156170240 | elapsed time per iteration (s): 0.76 | learning rate: 9.063E-05 | global batch size: 256 | lm loss: 1.981576E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.051 | TFLOPs: 20.39 | 31: iteration 99490/ 173500 | consumed samples: 25469440 | consumed tokens: 52161413120 | elapsed time per iteration (s): 0.72 | learning rate: 9.061E-05 | global batch size: 256 | lm loss: 1.966920E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.733 | TFLOPs: 21.40 | 31: iteration 99500/ 173500 | consumed samples: 25472000 | consumed tokens: 52166656000 | elapsed time per iteration (s): 0.86 | learning rate: 9.059E-05 | global batch size: 256 | lm loss: 1.997909E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.049 | TFLOPs: 18.03 | 31: iteration 99510/ 173500 | consumed samples: 25474560 | consumed tokens: 52171898880 | elapsed time per iteration (s): 0.77 | learning rate: 9.058E-05 | global batch size: 256 | lm loss: 1.992257E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.862 | TFLOPs: 20.20 | 31: iteration 99520/ 173500 | consumed samples: 25477120 | consumed tokens: 52177141760 | elapsed time per iteration (s): 0.75 | learning rate: 9.056E-05 | global batch size: 256 | lm loss: 1.984113E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.698 | TFLOPs: 20.73 | 31: iteration 99530/ 173500 | consumed samples: 25479680 | consumed tokens: 52182384640 | elapsed time per iteration (s): 0.80 | learning rate: 9.055E-05 | global batch size: 256 | lm loss: 1.963886E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.403 | TFLOPs: 19.32 | 31: iteration 99540/ 173500 | consumed samples: 25482240 | consumed tokens: 52187627520 | elapsed time per iteration (s): 0.76 | learning rate: 9.053E-05 | global batch size: 256 | lm loss: 1.974796E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.942 | TFLOPs: 20.32 | 31: iteration 99550/ 173500 | consumed samples: 25484800 | consumed tokens: 52192870400 | elapsed time per iteration (s): 0.75 | learning rate: 9.051E-05 | global batch size: 256 | lm loss: 1.969382E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.615 | TFLOPs: 20.67 | 31: iteration 99560/ 173500 | consumed samples: 25487360 | consumed tokens: 52198113280 | elapsed time per iteration (s): 0.77 | learning rate: 9.050E-05 | global batch size: 256 | lm loss: 1.954189E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.620 | TFLOPs: 20.12 | 31: iteration 99570/ 173500 | consumed samples: 25489920 | consumed tokens: 52203356160 | elapsed time per iteration (s): 0.74 | learning rate: 9.048E-05 | global batch size: 256 | lm loss: 1.997745E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.659 | TFLOPs: 20.91 | 31: iteration 99580/ 173500 | consumed samples: 25492480 | consumed tokens: 52208599040 | elapsed time per iteration (s): 0.77 | learning rate: 9.047E-05 | global batch size: 256 | lm loss: 1.944160E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.088 | TFLOPs: 20.03 | 31: iteration 99590/ 173500 | consumed samples: 25495040 | consumed tokens: 52213841920 | elapsed time per iteration (s): 0.80 | learning rate: 9.045E-05 | global batch size: 256 | lm loss: 1.955777E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.142 | TFLOPs: 19.43 | 31: iteration 99600/ 173500 | consumed samples: 25497600 | consumed tokens: 52219084800 | elapsed time per iteration (s): 0.81 | learning rate: 9.043E-05 | global batch size: 256 | lm loss: 1.965552E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.522 | TFLOPs: 19.03 | 31: iteration 99610/ 173500 | consumed samples: 25500160 | consumed tokens: 52224327680 | elapsed time per iteration (s): 0.75 | learning rate: 9.042E-05 | global batch size: 256 | lm loss: 1.971425E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.890 | TFLOPs: 20.74 | 31: iteration 99620/ 173500 | consumed samples: 25502720 | consumed tokens: 52229570560 | elapsed time per iteration (s): 0.76 | learning rate: 9.040E-05 | global batch size: 256 | lm loss: 1.957940E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.050 | TFLOPs: 20.45 | 31: iteration 99630/ 173500 | consumed samples: 25505280 | consumed tokens: 52234813440 | elapsed time per iteration (s): 0.74 | learning rate: 9.039E-05 | global batch size: 256 | lm loss: 1.985198E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.863 | TFLOPs: 21.04 | 31: iteration 99640/ 173500 | consumed samples: 25507840 | consumed tokens: 52240056320 | elapsed time per iteration (s): 0.79 | learning rate: 9.037E-05 | global batch size: 256 | lm loss: 1.997232E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.830 | TFLOPs: 19.71 | 31: iteration 99650/ 173500 | consumed samples: 25510400 | consumed tokens: 52245299200 | elapsed time per iteration (s): 0.74 | learning rate: 9.035E-05 | global batch size: 256 | lm loss: 1.981988E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.109 | TFLOPs: 21.06 | 31: iteration 99660/ 173500 | consumed samples: 25512960 | consumed tokens: 52250542080 | elapsed time per iteration (s): 0.74 | learning rate: 9.034E-05 | global batch size: 256 | lm loss: 1.977028E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.726 | TFLOPs: 20.79 | 31: iteration 99670/ 173500 | consumed samples: 25515520 | consumed tokens: 52255784960 | elapsed time per iteration (s): 0.80 | learning rate: 9.032E-05 | global batch size: 256 | lm loss: 1.964216E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.818 | TFLOPs: 19.35 | 31: iteration 99680/ 173500 | consumed samples: 25518080 | consumed tokens: 52261027840 | elapsed time per iteration (s): 0.78 | learning rate: 9.031E-05 | global batch size: 256 | lm loss: 1.999378E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.038 | TFLOPs: 19.85 | 31: iteration 99690/ 173500 | consumed samples: 25520640 | consumed tokens: 52266270720 | elapsed time per iteration (s): 0.76 | learning rate: 9.029E-05 | global batch size: 256 | lm loss: 2.000782E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.420 | TFLOPs: 20.41 | 31: iteration 99700/ 173500 | consumed samples: 25523200 | consumed tokens: 52271513600 | elapsed time per iteration (s): 0.78 | learning rate: 9.027E-05 | global batch size: 256 | lm loss: 1.980778E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.177 | TFLOPs: 19.79 | 31: iteration 99710/ 173500 | consumed samples: 25525760 | consumed tokens: 52276756480 | elapsed time per iteration (s): 0.79 | learning rate: 9.026E-05 | global batch size: 256 | lm loss: 1.984256E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.446 | TFLOPs: 19.63 | 31: iteration 99720/ 173500 | consumed samples: 25528320 | consumed tokens: 52281999360 | elapsed time per iteration (s): 0.73 | learning rate: 9.024E-05 | global batch size: 256 | lm loss: 2.004155E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.875 | TFLOPs: 21.11 | 31: iteration 99730/ 173500 | consumed samples: 25530880 | consumed tokens: 52287242240 | elapsed time per iteration (s): 0.82 | learning rate: 9.022E-05 | global batch size: 256 | lm loss: 1.975728E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.574 | TFLOPs: 18.91 | 31: iteration 99740/ 173500 | consumed samples: 25533440 | consumed tokens: 52292485120 | elapsed time per iteration (s): 0.96 | learning rate: 9.021E-05 | global batch size: 256 | lm loss: 1.975763E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 265.693 | TFLOPs: 16.07 | 31: iteration 99750/ 173500 | consumed samples: 25536000 | consumed tokens: 52297728000 | elapsed time per iteration (s): 0.84 | learning rate: 9.019E-05 | global batch size: 256 | lm loss: 1.984544E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.204 | TFLOPs: 18.40 | 31: iteration 99760/ 173500 | consumed samples: 25538560 | consumed tokens: 52302970880 | elapsed time per iteration (s): 0.79 | learning rate: 9.018E-05 | global batch size: 256 | lm loss: 1.956759E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.700 | TFLOPs: 19.70 | 31: iteration 99770/ 173500 | consumed samples: 25541120 | consumed tokens: 52308213760 | elapsed time per iteration (s): 0.73 | learning rate: 9.016E-05 | global batch size: 256 | lm loss: 1.984082E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.119 | TFLOPs: 21.12 | 31: iteration 99780/ 173500 | consumed samples: 25543680 | consumed tokens: 52313456640 | elapsed time per iteration (s): 0.74 | learning rate: 9.014E-05 | global batch size: 256 | lm loss: 1.972100E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.942 | TFLOPs: 20.81 | 31: iteration 99790/ 173500 | consumed samples: 25546240 | consumed tokens: 52318699520 | elapsed time per iteration (s): 0.79 | learning rate: 9.013E-05 | global batch size: 256 | lm loss: 1.969502E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.523 | TFLOPs: 19.63 | 31: iteration 99800/ 173500 | consumed samples: 25548800 | consumed tokens: 52323942400 | elapsed time per iteration (s): 0.81 | learning rate: 9.011E-05 | global batch size: 256 | lm loss: 1.977776E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.211 | TFLOPs: 19.19 | 31: iteration 99810/ 173500 | consumed samples: 25551360 | consumed tokens: 52329185280 | elapsed time per iteration (s): 0.80 | learning rate: 9.010E-05 | global batch size: 256 | lm loss: 1.999627E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.722 | TFLOPs: 19.46 | 31: iteration 99820/ 173500 | consumed samples: 25553920 | consumed tokens: 52334428160 | elapsed time per iteration (s): 0.82 | learning rate: 9.008E-05 | global batch size: 256 | lm loss: 1.967915E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.778 | TFLOPs: 18.98 | 31: iteration 99830/ 173500 | consumed samples: 25556480 | consumed tokens: 52339671040 | elapsed time per iteration (s): 0.78 | learning rate: 9.006E-05 | global batch size: 256 | lm loss: 1.976675E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.668 | TFLOPs: 19.82 | 31: iteration 99840/ 173500 | consumed samples: 25559040 | consumed tokens: 52344913920 | elapsed time per iteration (s): 0.81 | learning rate: 9.005E-05 | global batch size: 256 | lm loss: 1.957800E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.455 | TFLOPs: 19.02 | 31: iteration 99850/ 173500 | consumed samples: 25561600 | consumed tokens: 52350156800 | elapsed time per iteration (s): 0.86 | learning rate: 9.003E-05 | global batch size: 256 | lm loss: 1.980162E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.466 | TFLOPs: 18.00 | 31: iteration 99860/ 173500 | consumed samples: 25564160 | consumed tokens: 52355399680 | elapsed time per iteration (s): 1.50 | learning rate: 9.002E-05 | global batch size: 256 | lm loss: 1.985156E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 170.730 | TFLOPs: 10.33 | 31: iteration 99870/ 173500 | consumed samples: 25566720 | consumed tokens: 52360642560 | elapsed time per iteration (s): 0.80 | learning rate: 9.000E-05 | global batch size: 256 | lm loss: 1.989092E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.528 | TFLOPs: 19.45 | 31: iteration 99880/ 173500 | consumed samples: 25569280 | consumed tokens: 52365885440 | elapsed time per iteration (s): 0.85 | learning rate: 8.998E-05 | global batch size: 256 | lm loss: 2.001787E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.689 | TFLOPs: 18.31 | 31: iteration 99890/ 173500 | consumed samples: 25571840 | consumed tokens: 52371128320 | elapsed time per iteration (s): 0.79 | learning rate: 8.997E-05 | global batch size: 256 | lm loss: 1.932167E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.377 | TFLOPs: 19.68 | 31: iteration 99900/ 173500 | consumed samples: 25574400 | consumed tokens: 52376371200 | elapsed time per iteration (s): 0.85 | learning rate: 8.995E-05 | global batch size: 256 | lm loss: 1.992043E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.825 | TFLOPs: 18.20 | 31: iteration 99910/ 173500 | consumed samples: 25576960 | consumed tokens: 52381614080 | elapsed time per iteration (s): 0.82 | learning rate: 8.994E-05 | global batch size: 256 | lm loss: 1.967859E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.654 | TFLOPs: 18.91 | 31: iteration 99920/ 173500 | consumed samples: 25579520 | consumed tokens: 52386856960 | elapsed time per iteration (s): 0.85 | learning rate: 8.992E-05 | global batch size: 256 | lm loss: 1.976417E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.958 | TFLOPs: 18.27 | 31: iteration 99930/ 173500 | consumed samples: 25582080 | consumed tokens: 52392099840 | elapsed time per iteration (s): 0.84 | learning rate: 8.990E-05 | global batch size: 256 | lm loss: 1.973793E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.674 | TFLOPs: 18.49 | 31: iteration 99940/ 173500 | consumed samples: 25584640 | consumed tokens: 52397342720 | elapsed time per iteration (s): 0.82 | learning rate: 8.989E-05 | global batch size: 256 | lm loss: 1.983969E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.196 | TFLOPs: 18.95 | 31: iteration 99950/ 173500 | consumed samples: 25587200 | consumed tokens: 52402585600 | elapsed time per iteration (s): 0.78 | learning rate: 8.987E-05 | global batch size: 256 | lm loss: 1.970089E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.430 | TFLOPs: 19.87 | 31: iteration 99960/ 173500 | consumed samples: 25589760 | consumed tokens: 52407828480 | elapsed time per iteration (s): 0.80 | learning rate: 8.986E-05 | global batch size: 256 | lm loss: 1.985553E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.565 | TFLOPs: 19.45 | 31: iteration 99970/ 173500 | consumed samples: 25592320 | consumed tokens: 52413071360 | elapsed time per iteration (s): 0.80 | learning rate: 8.984E-05 | global batch size: 256 | lm loss: 1.977474E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.563 | TFLOPs: 19.33 | 31: iteration 99980/ 173500 | consumed samples: 25594880 | consumed tokens: 52418314240 | elapsed time per iteration (s): 0.85 | learning rate: 8.982E-05 | global batch size: 256 | lm loss: 1.953709E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.345 | TFLOPs: 18.17 | 31: iteration 99990/ 173500 | consumed samples: 25597440 | consumed tokens: 52423557120 | elapsed time per iteration (s): 0.82 | learning rate: 8.981E-05 | global batch size: 256 | lm loss: 1.957083E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.650 | TFLOPs: 18.85 | 0: [2022-11-26 16:42:23,914] [INFO] [logging.py:68:log_dist] [Rank 0] step=100000, skipped=0, lr=[8.979141123724914e-05, 8.979141123724914e-05, 8.979141123724914e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 100000/ 173500 | consumed samples: 25600000 | consumed tokens: 52428800000 | elapsed time per iteration (s): 0.79 | learning rate: 8.979E-05 | global batch size: 256 | lm loss: 1.963673E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.216 | TFLOPs: 19.55 | 0: steps: 100000 loss: 2.0223 iter time (s): 0.798 samples/sec: 320.682 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 100000 | lm loss value: 1.974387E+00 | lm loss PPL: 7.202204E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 100000 to checkpoints_1b1long 0: [2022-11-26 16:42:24,195] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step100000 is begin to save! 0: [2022-11-26 16:42:24,207] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_01-model_00-model_states.pt... 0: [2022-11-26 16:42:24,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_01-model_00-model_states.pt. 0: [2022-11-26 16:42:24,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_03-model_00-model_states.pt... 0: [2022-11-26 16:42:24,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_03-model_00-model_states.pt. 0: [2022-11-26 16:42:24,523] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_04-model_00-model_states.pt... 0: [2022-11-26 16:42:24,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_04-model_00-model_states.pt. 0: [2022-11-26 16:42:24,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_05-model_00-model_states.pt... 0: [2022-11-26 16:42:24,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_05-model_00-model_states.pt. 0: [2022-11-26 16:42:24,695] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_06-model_00-model_states.pt... 0: [2022-11-26 16:42:24,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_06-model_00-model_states.pt. 0: [2022-11-26 16:42:24,784] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_07-model_00-model_states.pt... 0: [2022-11-26 16:42:24,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_07-model_00-model_states.pt. 0: [2022-11-26 16:42:24,864] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_08-model_00-model_states.pt... 0: [2022-11-26 16:42:24,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_08-model_00-model_states.pt. 0: [2022-11-26 16:42:24,945] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_09-model_00-model_states.pt... 0: [2022-11-26 16:42:25,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_09-model_00-model_states.pt. 0: [2022-11-26 16:42:25,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_10-model_00-model_states.pt... 0: [2022-11-26 16:42:25,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_10-model_00-model_states.pt. 0: [2022-11-26 16:42:25,098] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_11-model_00-model_states.pt... 0: [2022-11-26 16:42:25,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_11-model_00-model_states.pt. 0: [2022-11-26 16:42:25,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_12-model_00-model_states.pt... 0: [2022-11-26 16:42:25,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_12-model_00-model_states.pt. 0: [2022-11-26 16:42:25,250] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_13-model_00-model_states.pt... 0: [2022-11-26 16:42:25,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_13-model_00-model_states.pt. 0: [2022-11-26 16:42:25,325] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_14-model_00-model_states.pt... 0: [2022-11-26 16:42:25,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_14-model_00-model_states.pt. 0: [2022-11-26 16:42:25,402] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_15-model_00-model_states.pt... 0: [2022-11-26 16:42:25,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_15-model_00-model_states.pt. 0: [2022-11-26 16:42:25,476] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_16-model_00-model_states.pt... 0: [2022-11-26 16:42:25,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_16-model_00-model_states.pt. 0: [2022-11-26 16:42:25,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_17-model_00-model_states.pt... 0: [2022-11-26 16:42:25,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_17-model_00-model_states.pt. 0: [2022-11-26 16:42:25,630] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_18-model_00-model_states.pt... 0: [2022-11-26 16:42:25,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_18-model_00-model_states.pt. 0: [2022-11-26 16:42:25,706] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_19-model_00-model_states.pt... 0: [2022-11-26 16:42:25,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_19-model_00-model_states.pt. 0: [2022-11-26 16:42:25,781] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_20-model_00-model_states.pt... 0: [2022-11-26 16:42:25,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_20-model_00-model_states.pt. 0: [2022-11-26 16:42:25,856] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_21-model_00-model_states.pt... 0: [2022-11-26 16:42:25,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_21-model_00-model_states.pt. 0: [2022-11-26 16:42:25,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_22-model_00-model_states.pt... 0: [2022-11-26 16:42:26,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_22-model_00-model_states.pt. 0: [2022-11-26 16:42:26,004] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_23-model_00-model_states.pt... 0: [2022-11-26 16:42:26,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_23-model_00-model_states.pt. 0: [2022-11-26 16:42:26,081] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_24-model_00-model_states.pt... 0: [2022-11-26 16:42:26,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_24-model_00-model_states.pt. 0: [2022-11-26 16:42:26,156] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_25-model_00-model_states.pt... 0: [2022-11-26 16:42:26,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_25-model_00-model_states.pt. 0: [2022-11-26 16:42:26,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_26-model_00-model_states.pt... 0: [2022-11-26 16:42:26,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_26-model_00-model_states.pt. 0: [2022-11-26 16:42:26,306] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_27-model_00-model_states.pt... 0: [2022-11-26 16:42:26,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_27-model_00-model_states.pt. 0: [2022-11-26 16:42:26,381] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_28-model_00-model_states.pt... 0: [2022-11-26 16:42:26,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_28-model_00-model_states.pt. 0: [2022-11-26 16:42:26,455] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/layer_30-model_00-model_states.pt... 0: [2022-11-26 16:42:26,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/layer_30-model_00-model_states.pt. 0: [2022-11-26 16:42:26,458] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step100000/mp_rank_00_model_states.pt 0: [2022-11-26 16:42:26,458] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/mp_rank_00_model_states.pt... 0: [2022-11-26 16:42:26,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/mp_rank_00_model_states.pt. 0: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:42:26,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:42:26,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:42:26,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 16:42:26,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 0: [2022-11-26 16:42:26,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:42:26,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 16:42:26,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 5: [2022-11-26 16:42:26,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:42:26,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 16:42:26,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 3: [2022-11-26 16:42:26,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:42:26,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:42:26,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 21: [2022-11-26 16:42:26,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 3: [2022-11-26 16:42:26,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 21: [2022-11-26 16:42:26,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 1: [2022-11-26 16:42:26,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:42:26,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:42:26,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:42:26,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 4: [2022-11-26 16:42:26,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 1: [2022-11-26 16:42:26,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 14: [2022-11-26 16:42:26,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 4: [2022-11-26 16:42:26,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 1: [2022-11-26 16:42:26,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 13: [2022-11-26 16:42:26,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:42:26,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 16:42:26,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 30: [2022-11-26 16:42:26,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:42:26,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 16:42:26,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 9: [2022-11-26 16:42:26,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:42:26,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 16:42:26,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 20: [2022-11-26 16:42:26,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:42:26,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 16:42:26,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 8: [2022-11-26 16:42:26,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:42:26,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 16:42:26,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 19: [2022-11-26 16:42:26,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:42:26,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 0: [2022-11-26 16:42:26,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:42:26,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 0: [2022-11-26 16:42:26,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 16:42:26,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 15: [2022-11-26 16:42:26,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:42:26,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:42:26,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 16:42:26,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 24: [2022-11-26 16:42:26,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 16:42:26,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 24: [2022-11-26 16:42:26,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:42:26,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 16:42:26,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 9: [2022-11-26 16:42:26,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:42:26,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:42:26,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 31: [2022-11-26 16:42:26,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 9: [2022-11-26 16:42:26,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 31: [2022-11-26 16:42:26,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 27: [2022-11-26 16:42:26,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:42:26,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 16:42:26,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 10: [2022-11-26 16:42:26,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:42:26,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 16:42:26,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 30: [2022-11-26 16:42:26,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:42:26,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:42:26,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:42:26,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 5: [2022-11-26 16:42:26,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 4: [2022-11-26 16:42:26,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 30: [2022-11-26 16:42:26,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 6: [2022-11-26 16:42:26,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:42:26,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 4: [2022-11-26 16:42:26,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 6: [2022-11-26 16:42:26,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 16:42:26,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 15: [2022-11-26 16:42:26,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:42:26,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 16:42:26,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 5: [2022-11-26 16:42:26,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:42:26,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:42:26,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 4: [2022-11-26 16:42:26,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 5: [2022-11-26 16:42:26,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 4: [2022-11-26 16:42:26,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 11: [2022-11-26 16:42:26,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:42:26,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:42:26,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:42:26,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:42:26,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:42:26,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 16:42:26,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 16:42:26,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 16:42:26,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 16:42:26,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 11: [2022-11-26 16:42:26,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 11: [2022-11-26 16:42:26,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 11: [2022-11-26 16:42:26,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 17: [2022-11-26 16:42:26,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 16:42:26,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 13: [2022-11-26 16:42:26,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:42:26,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:42:26,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:42:26,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 16:42:26,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 16:42:26,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 27: [2022-11-26 16:42:26,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 16: [2022-11-26 16:42:26,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:42:26,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 16:42:26,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:42:26,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 16: [2022-11-26 16:42:26,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 16:42:26,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 22: [2022-11-26 16:42:26,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:42:26,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 16:42:26,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 8: [2022-11-26 16:42:26,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:42:26,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 16:42:26,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 14: [2022-11-26 16:42:26,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:42:26,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 16:42:26,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 22: [2022-11-26 16:42:26,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:42:26,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 16:42:26,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 29: [2022-11-26 16:42:26,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:42:26,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:42:26,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 29: [2022-11-26 16:42:26,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 23: [2022-11-26 16:42:26,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 29: [2022-11-26 16:42:26,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 29: [2022-11-26 16:42:26,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:42:26,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 16:42:26,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 14: [2022-11-26 16:42:26,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:42:26,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:42:26,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 23: [2022-11-26 16:42:26,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:42:26,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 23: [2022-11-26 16:42:26,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 24: [2022-11-26 16:42:26,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 23: [2022-11-26 16:42:26,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 24: [2022-11-26 16:42:26,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 3: [2022-11-26 16:42:26,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:42:26,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:42:26,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 14: [2022-11-26 16:42:26,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 3: [2022-11-26 16:42:26,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 14: [2022-11-26 16:42:26,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 27: [2022-11-26 16:42:26,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:42:26,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:42:26,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 28: [2022-11-26 16:42:26,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 27: [2022-11-26 16:42:26,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 28: [2022-11-26 16:42:26,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 29: [2022-11-26 16:42:26,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:42:26,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:42:26,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:42:26,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 29: [2022-11-26 16:42:26,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 9: [2022-11-26 16:42:26,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 29: [2022-11-26 16:42:26,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 21: [2022-11-26 16:42:26,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 9: [2022-11-26 16:42:26,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 1: [2022-11-26 16:42:26,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:42:26,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:42:26,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 19: [2022-11-26 16:42:26,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 1: [2022-11-26 16:42:26,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 19: [2022-11-26 16:42:26,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 3: [2022-11-26 16:42:26,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:42:26,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 16:42:26,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 1: [2022-11-26 16:42:26,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:42:26,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:42:26,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:42:26,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 20: [2022-11-26 16:42:26,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:42:26,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:42:26,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:42:26,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 13: [2022-11-26 16:42:26,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 20: [2022-11-26 16:42:26,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 16:42:26,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 16:42:26,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 29: [2022-11-26 16:42:26,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 22: [2022-11-26 16:42:26,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 13: [2022-11-26 16:42:26,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 20: [2022-11-26 16:42:26,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 20: [2022-11-26 16:42:26,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 29: [2022-11-26 16:42:26,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 22: [2022-11-26 16:42:26,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 20: [2022-11-26 16:42:26,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 13: [2022-11-26 16:42:26,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:42:26,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:42:26,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 16:42:26,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 16:42:26,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 13: [2022-11-26 16:42:26,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 11: [2022-11-26 16:42:26,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:42:26,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 16:42:26,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 4: [2022-11-26 16:42:26,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:42:26,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 16:42:26,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 15: [2022-11-26 16:42:26,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:42:26,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 16:42:26,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 18: [2022-11-26 16:42:26,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:42:26,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:42:26,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:42:26,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 13: [2022-11-26 16:42:26,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:42:26,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 13: [2022-11-26 16:42:26,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 18: [2022-11-26 16:42:26,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 16:42:26,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 18: [2022-11-26 16:42:26,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 18: [2022-11-26 16:42:26,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 24: [2022-11-26 16:42:26,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:42:26,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 30: [2022-11-26 16:42:26,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:42:26,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:42:26,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 24: [2022-11-26 16:42:26,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 30: [2022-11-26 16:42:26,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 13: [2022-11-26 16:42:26,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 24: [2022-11-26 16:42:26,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 30: [2022-11-26 16:42:26,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:42:26,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 30: [2022-11-26 16:42:26,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 16:42:26,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 15: [2022-11-26 16:42:26,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:42:26,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:42:26,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:42:26,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 22: [2022-11-26 16:42:26,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 15: [2022-11-26 16:42:26,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 22: [2022-11-26 16:42:26,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 19: [2022-11-26 16:42:26,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 16:42:26,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 19: [2022-11-26 16:42:26,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:42:26,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 16:42:26,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 6: [2022-11-26 16:42:26,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:42:26,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 16:42:26,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:42:26,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 6: [2022-11-26 16:42:26,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 16:42:26,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 14: [2022-11-26 16:42:26,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:42:26,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 16:42:26,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 28: [2022-11-26 16:42:26,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:42:26,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 16:42:26,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:42:26,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 19: [2022-11-26 16:42:26,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:42:26,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 19: [2022-11-26 16:42:26,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 6: [2022-11-26 16:42:26,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:42:26,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:42:26,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 6: [2022-11-26 16:42:26,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 28: [2022-11-26 16:42:26,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 6: [2022-11-26 16:42:26,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 9: [2022-11-26 16:42:26,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 16:42:26,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 6: [2022-11-26 16:42:26,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:42:26,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 16:42:26,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 26: [2022-11-26 16:42:26,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:42:26,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:42:26,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:42:26,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:42:26,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 16:42:26,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 26: [2022-11-26 16:42:26,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 3: [2022-11-26 16:42:26,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:42:26,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 16:42:26,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 3: [2022-11-26 16:42:26,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 26: [2022-11-26 16:42:26,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 26: [2022-11-26 16:42:26,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 26: [2022-11-26 16:42:26,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 16: [2022-11-26 16:42:26,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:42:26,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:42:26,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 26: [2022-11-26 16:42:26,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:42:26,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:42:26,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 16: [2022-11-26 16:42:26,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 16:42:26,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 16:42:26,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 16: [2022-11-26 16:42:26,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 16:42:26,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 16: [2022-11-26 16:42:26,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 8: [2022-11-26 16:42:26,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:42:26,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 8: [2022-11-26 16:42:26,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 30: [2022-11-26 16:42:26,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:42:26,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 30: [2022-11-26 16:42:26,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 16:42:26,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 25: [2022-11-26 16:42:26,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:42:26,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:42:26,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 16:42:26,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 16:42:26,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 25: [2022-11-26 16:42:26,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 1: [2022-11-26 16:42:26,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:42:26,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:42:26,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 22: [2022-11-26 16:42:26,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 1: [2022-11-26 16:42:26,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 22: [2022-11-26 16:42:26,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 0: [2022-11-26 16:42:26,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:42:26,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:42:26,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:42:26,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 20: [2022-11-26 16:42:26,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 0: [2022-11-26 16:42:26,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 1: [2022-11-26 16:42:26,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 20: [2022-11-26 16:42:26,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 0: [2022-11-26 16:42:26,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 17: [2022-11-26 16:42:26,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:42:26,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:42:26,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 16:42:26,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 16:42:26,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 17: [2022-11-26 16:42:26,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 31: [2022-11-26 16:42:26,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:42:26,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 16:42:26,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 17: [2022-11-26 16:42:26,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:42:26,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 3: [2022-11-26 16:42:26,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:42:26,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 3: [2022-11-26 16:42:26,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 16:42:26,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 17: [2022-11-26 16:42:26,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:42:26,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 16:42:26,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 4: [2022-11-26 16:42:26,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:42:26,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 16:42:26,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 2: [2022-11-26 16:42:26,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:42:26,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:42:26,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:42:26,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 16:42:26,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 16:42:26,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 16:42:26,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 2: [2022-11-26 16:42:26,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 2: [2022-11-26 16:42:26,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 2: [2022-11-26 16:42:26,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:42:26,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 16:42:26,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 18: [2022-11-26 16:42:26,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:42:26,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 16:42:26,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 2: [2022-11-26 16:42:26,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:42:26,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 16:42:26,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 27: [2022-11-26 16:42:26,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:42:26,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 16:42:26,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 10: [2022-11-26 16:42:26,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:42:26,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:42:26,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 16:42:26,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 10: [2022-11-26 16:42:26,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 16:42:26,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 28: [2022-11-26 16:42:26,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:42:26,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:42:26,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 16:42:26,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 16:42:26,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 28: [2022-11-26 16:42:26,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 23: [2022-11-26 16:42:26,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:42:26,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 16:42:26,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 15: [2022-11-26 16:42:26,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:42:26,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:42:26,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 23: [2022-11-26 16:42:26,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 15: [2022-11-26 16:42:26,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 23: [2022-11-26 16:42:26,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 9: [2022-11-26 16:42:26,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:42:26,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 16:42:26,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 12: [2022-11-26 16:42:26,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:42:26,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:42:26,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:42:26,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 16:42:26,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:42:26,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 16:42:26,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 12: [2022-11-26 16:42:26,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 16:42:26,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 16:42:26,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 12: [2022-11-26 16:42:26,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 12: [2022-11-26 16:42:26,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 21: [2022-11-26 16:42:26,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:42:26,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 16:42:26,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 5: [2022-11-26 16:42:26,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:42:26,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:42:26,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 10: [2022-11-26 16:42:26,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 5: [2022-11-26 16:42:26,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 10: [2022-11-26 16:42:26,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 5: [2022-11-26 16:42:26,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:42:26,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:42:26,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 16:42:26,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 10: [2022-11-26 16:42:26,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 25: [2022-11-26 16:42:26,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:42:26,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:42:26,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 25: [2022-11-26 16:42:26,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 16:42:26,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 16:42:26,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 25: [2022-11-26 16:42:26,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 29: [2022-11-26 16:42:26,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:42:26,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 16:42:26,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 23: [2022-11-26 16:42:26,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:42:26,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 16:42:26,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 16: [2022-11-26 16:42:26,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:42:26,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 16:42:26,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 31: [2022-11-26 16:42:26,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:42:26,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:42:26,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 16:42:26,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 16:42:26,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 31: [2022-11-26 16:42:26,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 24: [2022-11-26 16:42:26,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:42:26,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 16:42:26,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 0: [2022-11-26 16:42:26,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:42:26,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:42:26,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 16:42:26,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 21: [2022-11-26 16:42:26,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:42:26,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 16:42:26,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 21: [2022-11-26 16:42:26,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:42:26,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 16:42:26,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 8: [2022-11-26 16:42:26,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:42:26,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:42:26,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 16:42:26,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 16:42:26,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 8: [2022-11-26 16:42:26,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 20: [2022-11-26 16:42:26,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:42:26,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 16:42:26,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 0: [2022-11-26 16:42:26,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 16:42:26,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 7: [2022-11-26 16:42:26,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:42:26,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:42:26,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:42:26,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:42:26,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:42:26,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 16:42:26,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 16:42:26,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 16:42:26,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 16:42:26,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 16:42:26,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 7: [2022-11-26 16:42:26,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 7: [2022-11-26 16:42:26,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 7: [2022-11-26 16:42:26,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 7: [2022-11-26 16:42:26,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 11: [2022-11-26 16:42:26,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:42:26,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 16:42:26,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 9: [2022-11-26 16:42:26,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:42:26,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 16:42:26,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 0: [2022-11-26 16:42:26,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:42:26,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 16:42:26,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 6: [2022-11-26 16:42:26,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:42:26,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 16:42:26,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 4: [2022-11-26 16:42:26,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:42:26,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 16:42:26,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 19: [2022-11-26 16:42:26,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:42:26,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 16:42:26,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 30: [2022-11-26 16:42:26,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:42:26,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 16:42:26,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 29: [2022-11-26 16:42:26,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:42:26,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:42:26,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:42:26,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 14: [2022-11-26 16:42:26,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 16:42:26,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 29: [2022-11-26 16:42:26,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 27: [2022-11-26 16:42:26,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 17: [2022-11-26 16:42:26,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:42:26,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 17: [2022-11-26 16:42:26,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 16:42:26,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 26: [2022-11-26 16:42:26,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:42:26,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 13: [2022-11-26 16:42:26,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:42:26,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 13: [2022-11-26 16:42:26,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 16:42:26,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 1: [2022-11-26 16:42:26,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:42:26,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 16:42:26,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 8: [2022-11-26 16:42:26,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:42:26,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 16:42:26,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 21: [2022-11-26 16:42:26,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:42:26,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 16:42:26,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 28: [2022-11-26 16:42:26,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:42:26,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 16:42:26,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 7: [2022-11-26 16:42:26,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:42:26,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 16:42:26,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 22: [2022-11-26 16:42:26,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:42:26,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 16:42:26,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 5: [2022-11-26 16:42:26,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:42:26,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 16:42:26,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 31: [2022-11-26 16:42:26,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:42:26,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 16:42:26,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 25: [2022-11-26 16:42:26,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:42:26,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 15: [2022-11-26 16:42:26,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:42:26,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 15: [2022-11-26 16:42:26,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 18: [2022-11-26 16:42:26,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:42:26,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 18: [2022-11-26 16:42:26,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 16:42:26,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 23: [2022-11-26 16:42:26,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:42:26,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 16:42:26,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 3: [2022-11-26 16:42:26,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:42:26,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 16:42:26,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 12: [2022-11-26 16:42:26,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:42:26,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 16:42:26,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 2: [2022-11-26 16:42:26,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:42:26,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 16:42:26,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 20: [2022-11-26 16:42:26,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:42:26,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:42:26,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 24: [2022-11-26 16:42:26,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 20: [2022-11-26 16:42:26,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 24: [2022-11-26 16:42:26,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 9: [2022-11-26 16:42:26,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:42:26,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 16:42:26,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 4: [2022-11-26 16:42:26,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:42:26,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 16:42:26,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 11: [2022-11-26 16:42:26,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:42:26,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 0: [2022-11-26 16:42:26,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:42:26,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 0: [2022-11-26 16:42:26,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 16:42:26,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 16: [2022-11-26 16:42:26,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:42:26,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 16:42:26,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 10: [2022-11-26 16:42:26,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:42:26,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 16:42:26,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 6: [2022-11-26 16:42:26,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:42:26,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 16:42:26,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 19: [2022-11-26 16:42:26,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:42:26,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 16:42:26,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 10: [2022-11-26 16:42:26,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:42:26,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 16:42:26,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 30: [2022-11-26 16:42:26,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:42:26,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 16:42:26,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 14: [2022-11-26 16:42:26,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:42:26,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 16:42:26,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 27: [2022-11-26 16:42:26,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:42:26,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 16:42:26,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 17: [2022-11-26 16:42:26,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:42:26,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 16:42:26,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 29: [2022-11-26 16:42:26,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:42:26,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 16:42:26,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 1: [2022-11-26 16:42:26,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:42:26,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 16:42:26,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 3: [2022-11-26 16:42:26,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:42:26,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 16:42:26,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 26: [2022-11-26 16:42:26,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:42:26,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 16:42:26,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 5: [2022-11-26 16:42:26,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:42:26,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:42:26,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 13: [2022-11-26 16:42:26,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 5: [2022-11-26 16:42:26,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 13: [2022-11-26 16:42:26,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 8: [2022-11-26 16:42:26,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:42:26,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:42:26,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 16:42:26,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 28: [2022-11-26 16:42:26,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 16:42:26,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 21: [2022-11-26 16:42:26,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:42:26,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 16:42:26,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 7: [2022-11-26 16:42:26,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:42:26,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 16:42:26,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 31: [2022-11-26 16:42:26,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:42:26,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 16:42:26,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 23: [2022-11-26 16:42:26,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:42:26,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 16:42:26,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 15: [2022-11-26 16:42:26,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:42:26,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 16:42:26,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 2: [2022-11-26 16:42:26,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:42:26,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 16:42:26,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 4: [2022-11-26 16:42:26,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:42:26,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 16:42:26,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 20: [2022-11-26 16:42:26,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:42:26,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 16:42:26,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 30: [2022-11-26 16:42:26,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:42:26,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 16:42:26,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 29: [2022-11-26 16:42:26,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:42:26,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 16:42:26,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 0: [2022-11-26 16:42:26,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:42:26,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 16:42:26,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 17: [2022-11-26 16:42:26,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:42:26,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 16:42:26,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 1: [2022-11-26 16:42:26,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:42:26,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:42:26,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:42:26,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 1: [2022-11-26 16:42:26,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 16: [2022-11-26 16:42:26,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 27: [2022-11-26 16:42:26,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 16:42:26,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 1: [2022-11-26 16:42:26,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 9: [2022-11-26 16:42:26,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:42:26,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:42:26,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 14: [2022-11-26 16:42:26,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 31: [2022-11-26 16:42:26,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:42:26,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 14: [2022-11-26 16:42:26,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 31: [2022-11-26 16:42:26,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 16:42:26,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 6: [2022-11-26 16:42:26,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:42:26,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 16:42:26,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 26: [2022-11-26 16:42:26,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:42:26,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 16:42:26,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 24: [2022-11-26 16:42:26,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:42:26,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 3: [2022-11-26 16:42:26,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:42:26,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 3: [2022-11-26 16:42:26,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 16:42:26,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 28: [2022-11-26 16:42:26,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:42:26,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:42:26,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 5: [2022-11-26 16:42:26,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:42:26,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:42:26,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 16:42:26,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 10: [2022-11-26 16:42:26,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 16:42:26,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 22: [2022-11-26 16:42:26,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 16:42:26,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 28: [2022-11-26 16:42:26,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 21: [2022-11-26 16:42:26,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:42:26,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 16:42:26,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 23: [2022-11-26 16:42:26,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:42:26,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 16:42:26,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 19: [2022-11-26 16:42:26,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:42:26,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 16:42:26,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 15: [2022-11-26 16:42:26,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:42:26,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 16:42:26,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 13: [2022-11-26 16:42:26,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:42:26,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 16:42:26,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 7: [2022-11-26 16:42:26,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:42:26,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 16:42:26,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 18: [2022-11-26 16:42:26,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:42:26,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 16:42:26,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 25: [2022-11-26 16:42:26,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:42:26,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:42:26,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 16:42:26,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 16:42:26,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 25: [2022-11-26 16:42:26,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 8: [2022-11-26 16:42:26,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:42:26,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 16:42:26,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 12: [2022-11-26 16:42:26,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:42:26,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 16:42:26,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 2: [2022-11-26 16:42:26,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:42:26,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 16:42:26,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 25: [2022-11-26 16:42:26,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:42:26,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 18: [2022-11-26 16:42:26,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:42:26,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 25: [2022-11-26 16:42:26,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 22: [2022-11-26 16:42:26,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:42:26,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 22: [2022-11-26 16:42:26,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 16:42:26,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 12: [2022-11-26 16:42:26,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:42:26,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 16:42:26,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 18: [2022-11-26 16:42:26,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:42:26,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 16:42:26,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 12: [2022-11-26 16:42:26,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:42:26,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 16:42:26,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 11: [2022-11-26 16:42:26,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:42:26,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step100000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 16:42:26,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 0: successfully saved checkpoint at iteration 100000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2636.41 31: iteration 100010/ 173500 | consumed samples: 25602560 | consumed tokens: 52434042880 | elapsed time per iteration (s): 1.12 | learning rate: 8.978E-05 | global batch size: 256 | lm loss: 2.016262E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.243 | TFLOPs: 13.87 | 31: iteration 100020/ 173500 | consumed samples: 25605120 | consumed tokens: 52439285760 | elapsed time per iteration (s): 0.79 | learning rate: 8.976E-05 | global batch size: 256 | lm loss: 2.003531E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.551 | TFLOPs: 19.57 | 31: iteration 100030/ 173500 | consumed samples: 25607680 | consumed tokens: 52444528640 | elapsed time per iteration (s): 0.84 | learning rate: 8.974E-05 | global batch size: 256 | lm loss: 1.988095E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.265 | TFLOPs: 18.53 | 31: iteration 100040/ 173500 | consumed samples: 25610240 | consumed tokens: 52449771520 | elapsed time per iteration (s): 0.79 | learning rate: 8.973E-05 | global batch size: 256 | lm loss: 2.010317E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.079 | TFLOPs: 19.67 | 31: iteration 100050/ 173500 | consumed samples: 25612800 | consumed tokens: 52455014400 | elapsed time per iteration (s): 0.82 | learning rate: 8.971E-05 | global batch size: 256 | lm loss: 1.971997E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.613 | TFLOPs: 18.97 | 31: iteration 100060/ 173500 | consumed samples: 25615360 | consumed tokens: 52460257280 | elapsed time per iteration (s): 0.79 | learning rate: 8.970E-05 | global batch size: 256 | lm loss: 1.959133E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.584 | TFLOPs: 19.70 | 31: iteration 100070/ 173500 | consumed samples: 25617920 | consumed tokens: 52465500160 | elapsed time per iteration (s): 0.80 | learning rate: 8.968E-05 | global batch size: 256 | lm loss: 1.974608E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.480 | TFLOPs: 19.33 | 31: iteration 100080/ 173500 | consumed samples: 25620480 | consumed tokens: 52470743040 | elapsed time per iteration (s): 0.81 | learning rate: 8.966E-05 | global batch size: 256 | lm loss: 1.995654E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.141 | TFLOPs: 19.19 | 31: iteration 100090/ 173500 | consumed samples: 25623040 | consumed tokens: 52475985920 | elapsed time per iteration (s): 0.82 | learning rate: 8.965E-05 | global batch size: 256 | lm loss: 1.979384E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.989 | TFLOPs: 19.00 | 31: iteration 100100/ 173500 | consumed samples: 25625600 | consumed tokens: 52481228800 | elapsed time per iteration (s): 0.77 | learning rate: 8.963E-05 | global batch size: 256 | lm loss: 1.977825E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.590 | TFLOPs: 20.06 | 31: iteration 100110/ 173500 | consumed samples: 25628160 | consumed tokens: 52486471680 | elapsed time per iteration (s): 0.75 | learning rate: 8.962E-05 | global batch size: 256 | lm loss: 1.989797E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.860 | TFLOPs: 20.56 | 31: iteration 100120/ 173500 | consumed samples: 25630720 | consumed tokens: 52491714560 | elapsed time per iteration (s): 0.77 | learning rate: 8.960E-05 | global batch size: 256 | lm loss: 2.013514E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.555 | TFLOPs: 20.00 | 31: iteration 100130/ 173500 | consumed samples: 25633280 | consumed tokens: 52496957440 | elapsed time per iteration (s): 0.81 | learning rate: 8.958E-05 | global batch size: 256 | lm loss: 1.966890E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.132 | TFLOPs: 19.06 | 31: iteration 100140/ 173500 | consumed samples: 25635840 | consumed tokens: 52502200320 | elapsed time per iteration (s): 0.83 | learning rate: 8.957E-05 | global batch size: 256 | lm loss: 1.955525E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.881 | TFLOPs: 18.63 | 31: iteration 100150/ 173500 | consumed samples: 25638400 | consumed tokens: 52507443200 | elapsed time per iteration (s): 0.77 | learning rate: 8.955E-05 | global batch size: 256 | lm loss: 1.986818E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.159 | TFLOPs: 20.09 | 31: iteration 100160/ 173500 | consumed samples: 25640960 | consumed tokens: 52512686080 | elapsed time per iteration (s): 0.70 | learning rate: 8.953E-05 | global batch size: 256 | lm loss: 1.986472E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 367.500 | TFLOPs: 22.23 | 31: iteration 100170/ 173500 | consumed samples: 25643520 | consumed tokens: 52517928960 | elapsed time per iteration (s): 0.74 | learning rate: 8.952E-05 | global batch size: 256 | lm loss: 1.979166E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.672 | TFLOPs: 20.79 | 31: iteration 100180/ 173500 | consumed samples: 25646080 | consumed tokens: 52523171840 | elapsed time per iteration (s): 0.79 | learning rate: 8.950E-05 | global batch size: 256 | lm loss: 1.978596E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.555 | TFLOPs: 19.51 | 31: iteration 100190/ 173500 | consumed samples: 25648640 | consumed tokens: 52528414720 | elapsed time per iteration (s): 0.77 | learning rate: 8.949E-05 | global batch size: 256 | lm loss: 1.969253E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.186 | TFLOPs: 20.10 | 31: iteration 100200/ 173500 | consumed samples: 25651200 | consumed tokens: 52533657600 | elapsed time per iteration (s): 0.75 | learning rate: 8.947E-05 | global batch size: 256 | lm loss: 2.002739E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.161 | TFLOPs: 20.58 | 31: iteration 100210/ 173500 | consumed samples: 25653760 | consumed tokens: 52538900480 | elapsed time per iteration (s): 0.80 | learning rate: 8.945E-05 | global batch size: 256 | lm loss: 1.974443E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.057 | TFLOPs: 19.30 | 31: iteration 100220/ 173500 | consumed samples: 25656320 | consumed tokens: 52544143360 | elapsed time per iteration (s): 0.75 | learning rate: 8.944E-05 | global batch size: 256 | lm loss: 1.958595E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.093 | TFLOPs: 20.70 | 31: iteration 100230/ 173500 | consumed samples: 25658880 | consumed tokens: 52549386240 | elapsed time per iteration (s): 0.78 | learning rate: 8.942E-05 | global batch size: 256 | lm loss: 1.953201E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.022 | TFLOPs: 19.90 | 31: iteration 100240/ 173500 | consumed samples: 25661440 | consumed tokens: 52554629120 | elapsed time per iteration (s): 0.72 | learning rate: 8.941E-05 | global batch size: 256 | lm loss: 1.959413E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.677 | TFLOPs: 21.40 | 31: iteration 100250/ 173500 | consumed samples: 25664000 | consumed tokens: 52559872000 | elapsed time per iteration (s): 0.77 | learning rate: 8.939E-05 | global batch size: 256 | lm loss: 1.982058E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.271 | TFLOPs: 20.10 | 31: iteration 100260/ 173500 | consumed samples: 25666560 | consumed tokens: 52565114880 | elapsed time per iteration (s): 0.78 | learning rate: 8.937E-05 | global batch size: 256 | lm loss: 1.965926E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.724 | TFLOPs: 19.89 | 31: iteration 100270/ 173500 | consumed samples: 25669120 | consumed tokens: 52570357760 | elapsed time per iteration (s): 0.77 | learning rate: 8.936E-05 | global batch size: 256 | lm loss: 1.985392E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.597 | TFLOPs: 20.06 | 31: iteration 100280/ 173500 | consumed samples: 25671680 | consumed tokens: 52575600640 | elapsed time per iteration (s): 0.80 | learning rate: 8.934E-05 | global batch size: 256 | lm loss: 1.962121E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.407 | TFLOPs: 19.44 | 31: iteration 100290/ 173500 | consumed samples: 25674240 | consumed tokens: 52580843520 | elapsed time per iteration (s): 0.89 | learning rate: 8.933E-05 | global batch size: 256 | lm loss: 1.973031E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.648 | TFLOPs: 17.46 | 31: iteration 100300/ 173500 | consumed samples: 25676800 | consumed tokens: 52586086400 | elapsed time per iteration (s): 0.80 | learning rate: 8.931E-05 | global batch size: 256 | lm loss: 1.997684E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.573 | TFLOPs: 19.33 | 31: iteration 100310/ 173500 | consumed samples: 25679360 | consumed tokens: 52591329280 | elapsed time per iteration (s): 0.84 | learning rate: 8.929E-05 | global batch size: 256 | lm loss: 1.963731E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.860 | TFLOPs: 18.38 | 31: iteration 100320/ 173500 | consumed samples: 25681920 | consumed tokens: 52596572160 | elapsed time per iteration (s): 0.82 | learning rate: 8.928E-05 | global batch size: 256 | lm loss: 1.971106E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.806 | TFLOPs: 18.98 | 31: iteration 100330/ 173500 | consumed samples: 25684480 | consumed tokens: 52601815040 | elapsed time per iteration (s): 0.85 | learning rate: 8.926E-05 | global batch size: 256 | lm loss: 1.982421E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.555 | TFLOPs: 18.18 | 31: iteration 100340/ 173500 | consumed samples: 25687040 | consumed tokens: 52607057920 | elapsed time per iteration (s): 0.81 | learning rate: 8.925E-05 | global batch size: 256 | lm loss: 2.005937E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.640 | TFLOPs: 19.16 | 31: iteration 100350/ 173500 | consumed samples: 25689600 | consumed tokens: 52612300800 | elapsed time per iteration (s): 0.82 | learning rate: 8.923E-05 | global batch size: 256 | lm loss: 1.972537E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.873 | TFLOPs: 18.93 | 31: iteration 100360/ 173500 | consumed samples: 25692160 | consumed tokens: 52617543680 | elapsed time per iteration (s): 0.81 | learning rate: 8.921E-05 | global batch size: 256 | lm loss: 1.990462E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.328 | TFLOPs: 19.08 | 31: iteration 100370/ 173500 | consumed samples: 25694720 | consumed tokens: 52622786560 | elapsed time per iteration (s): 0.93 | learning rate: 8.920E-05 | global batch size: 256 | lm loss: 1.977839E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 274.074 | TFLOPs: 16.58 | 31: iteration 100380/ 173500 | consumed samples: 25697280 | consumed tokens: 52628029440 | elapsed time per iteration (s): 0.85 | learning rate: 8.918E-05 | global batch size: 256 | lm loss: 1.996947E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.772 | TFLOPs: 18.14 | 31: iteration 100390/ 173500 | consumed samples: 25699840 | consumed tokens: 52633272320 | elapsed time per iteration (s): 0.84 | learning rate: 8.917E-05 | global batch size: 256 | lm loss: 1.990400E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.982 | TFLOPs: 18.45 | 31: iteration 100400/ 173500 | consumed samples: 25702400 | consumed tokens: 52638515200 | elapsed time per iteration (s): 0.90 | learning rate: 8.915E-05 | global batch size: 256 | lm loss: 1.982759E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.806 | TFLOPs: 17.23 | 31: iteration 100410/ 173500 | consumed samples: 25704960 | consumed tokens: 52643758080 | elapsed time per iteration (s): 0.82 | learning rate: 8.913E-05 | global batch size: 256 | lm loss: 1.997304E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.051 | TFLOPs: 18.82 | 31: iteration 100420/ 173500 | consumed samples: 25707520 | consumed tokens: 52649000960 | elapsed time per iteration (s): 0.86 | learning rate: 8.912E-05 | global batch size: 256 | lm loss: 1.984649E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.140 | TFLOPs: 17.98 | 31: iteration 100430/ 173500 | consumed samples: 25710080 | consumed tokens: 52654243840 | elapsed time per iteration (s): 0.80 | learning rate: 8.910E-05 | global batch size: 256 | lm loss: 1.991376E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.719 | TFLOPs: 19.40 | 31: iteration 100440/ 173500 | consumed samples: 25712640 | consumed tokens: 52659486720 | elapsed time per iteration (s): 0.86 | learning rate: 8.909E-05 | global batch size: 256 | lm loss: 1.965367E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.127 | TFLOPs: 17.91 | 31: iteration 100450/ 173500 | consumed samples: 25715200 | consumed tokens: 52664729600 | elapsed time per iteration (s): 0.79 | learning rate: 8.907E-05 | global batch size: 256 | lm loss: 1.976259E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.213 | TFLOPs: 19.67 | 31: iteration 100460/ 173500 | consumed samples: 25717760 | consumed tokens: 52669972480 | elapsed time per iteration (s): 0.80 | learning rate: 8.905E-05 | global batch size: 256 | lm loss: 1.983821E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.380 | TFLOPs: 19.38 | 31: iteration 100470/ 173500 | consumed samples: 25720320 | consumed tokens: 52675215360 | elapsed time per iteration (s): 0.77 | learning rate: 8.904E-05 | global batch size: 256 | lm loss: 1.960025E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.906 | TFLOPs: 20.02 | 31: iteration 100480/ 173500 | consumed samples: 25722880 | consumed tokens: 52680458240 | elapsed time per iteration (s): 0.82 | learning rate: 8.902E-05 | global batch size: 256 | lm loss: 1.970565E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.784 | TFLOPs: 18.98 | 31: iteration 100490/ 173500 | consumed samples: 25725440 | consumed tokens: 52685701120 | elapsed time per iteration (s): 0.83 | learning rate: 8.901E-05 | global batch size: 256 | lm loss: 1.992862E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.354 | TFLOPs: 18.65 | 31: iteration 100500/ 173500 | consumed samples: 25728000 | consumed tokens: 52690944000 | elapsed time per iteration (s): 0.80 | learning rate: 8.899E-05 | global batch size: 256 | lm loss: 1.959272E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.196 | TFLOPs: 19.25 | 31: iteration 100510/ 173500 | consumed samples: 25730560 | consumed tokens: 52696186880 | elapsed time per iteration (s): 0.80 | learning rate: 8.897E-05 | global batch size: 256 | lm loss: 1.994578E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.504 | TFLOPs: 19.39 | 31: iteration 100520/ 173500 | consumed samples: 25733120 | consumed tokens: 52701429760 | elapsed time per iteration (s): 0.79 | learning rate: 8.896E-05 | global batch size: 256 | lm loss: 2.001279E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.783 | TFLOPs: 19.59 | 31: iteration 100530/ 173500 | consumed samples: 25735680 | consumed tokens: 52706672640 | elapsed time per iteration (s): 0.80 | learning rate: 8.894E-05 | global batch size: 256 | lm loss: 1.976668E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.779 | TFLOPs: 19.29 | 31: iteration 100540/ 173500 | consumed samples: 25738240 | consumed tokens: 52711915520 | elapsed time per iteration (s): 0.75 | learning rate: 8.893E-05 | global batch size: 256 | lm loss: 1.983586E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.728 | TFLOPs: 20.61 | 31: iteration 100550/ 173500 | consumed samples: 25740800 | consumed tokens: 52717158400 | elapsed time per iteration (s): 0.74 | learning rate: 8.891E-05 | global batch size: 256 | lm loss: 1.967654E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.585 | TFLOPs: 20.85 | 31: iteration 100560/ 173500 | consumed samples: 25743360 | consumed tokens: 52722401280 | elapsed time per iteration (s): 0.86 | learning rate: 8.889E-05 | global batch size: 256 | lm loss: 1.972563E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.481 | TFLOPs: 18.00 | 31: iteration 100570/ 173500 | consumed samples: 25745920 | consumed tokens: 52727644160 | elapsed time per iteration (s): 0.78 | learning rate: 8.888E-05 | global batch size: 256 | lm loss: 1.968966E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.529 | TFLOPs: 19.88 | 31: iteration 100580/ 173500 | consumed samples: 25748480 | consumed tokens: 52732887040 | elapsed time per iteration (s): 0.80 | learning rate: 8.886E-05 | global batch size: 256 | lm loss: 1.978502E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.221 | TFLOPs: 19.37 | 31: iteration 100590/ 173500 | consumed samples: 25751040 | consumed tokens: 52738129920 | elapsed time per iteration (s): 0.72 | learning rate: 8.885E-05 | global batch size: 256 | lm loss: 1.969404E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.606 | TFLOPs: 21.57 | 31: iteration 100600/ 173500 | consumed samples: 25753600 | consumed tokens: 52743372800 | elapsed time per iteration (s): 0.74 | learning rate: 8.883E-05 | global batch size: 256 | lm loss: 1.973327E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.328 | TFLOPs: 21.01 | 31: iteration 100610/ 173500 | consumed samples: 25756160 | consumed tokens: 52748615680 | elapsed time per iteration (s): 0.78 | learning rate: 8.881E-05 | global batch size: 256 | lm loss: 1.992887E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.085 | TFLOPs: 19.91 | 31: iteration 100620/ 173500 | consumed samples: 25758720 | consumed tokens: 52753858560 | elapsed time per iteration (s): 0.75 | learning rate: 8.880E-05 | global batch size: 256 | lm loss: 1.965488E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.511 | TFLOPs: 20.60 | 31: iteration 100630/ 173500 | consumed samples: 25761280 | consumed tokens: 52759101440 | elapsed time per iteration (s): 0.75 | learning rate: 8.878E-05 | global batch size: 256 | lm loss: 1.997849E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.091 | TFLOPs: 20.64 | 31: iteration 100640/ 173500 | consumed samples: 25763840 | consumed tokens: 52764344320 | elapsed time per iteration (s): 0.79 | learning rate: 8.877E-05 | global batch size: 256 | lm loss: 1.983643E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.768 | TFLOPs: 19.71 | 31: iteration 100650/ 173500 | consumed samples: 25766400 | consumed tokens: 52769587200 | elapsed time per iteration (s): 0.81 | learning rate: 8.875E-05 | global batch size: 256 | lm loss: 1.952950E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.912 | TFLOPs: 19.23 | 31: iteration 100660/ 173500 | consumed samples: 25768960 | consumed tokens: 52774830080 | elapsed time per iteration (s): 0.80 | learning rate: 8.873E-05 | global batch size: 256 | lm loss: 1.981253E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.145 | TFLOPs: 19.31 | 31: iteration 100670/ 173500 | consumed samples: 25771520 | consumed tokens: 52780072960 | elapsed time per iteration (s): 0.81 | learning rate: 8.872E-05 | global batch size: 256 | lm loss: 1.971651E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.841 | TFLOPs: 19.05 | 31: iteration 100680/ 173500 | consumed samples: 25774080 | consumed tokens: 52785315840 | elapsed time per iteration (s): 0.82 | learning rate: 8.870E-05 | global batch size: 256 | lm loss: 1.992342E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.517 | TFLOPs: 18.91 | 31: iteration 100690/ 173500 | consumed samples: 25776640 | consumed tokens: 52790558720 | elapsed time per iteration (s): 0.77 | learning rate: 8.869E-05 | global batch size: 256 | lm loss: 2.004008E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.464 | TFLOPs: 20.23 | 31: iteration 100700/ 173500 | consumed samples: 25779200 | consumed tokens: 52795801600 | elapsed time per iteration (s): 0.72 | learning rate: 8.867E-05 | global batch size: 256 | lm loss: 1.955235E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.569 | TFLOPs: 21.63 | 31: iteration 100710/ 173500 | consumed samples: 25781760 | consumed tokens: 52801044480 | elapsed time per iteration (s): 0.78 | learning rate: 8.865E-05 | global batch size: 256 | lm loss: 1.952642E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.556 | TFLOPs: 19.88 | 31: iteration 100720/ 173500 | consumed samples: 25784320 | consumed tokens: 52806287360 | elapsed time per iteration (s): 0.78 | learning rate: 8.864E-05 | global batch size: 256 | lm loss: 1.955793E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.984 | TFLOPs: 19.90 | 31: iteration 100730/ 173500 | consumed samples: 25786880 | consumed tokens: 52811530240 | elapsed time per iteration (s): 0.80 | learning rate: 8.862E-05 | global batch size: 256 | lm loss: 1.977406E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.384 | TFLOPs: 19.26 | 31: iteration 100740/ 173500 | consumed samples: 25789440 | consumed tokens: 52816773120 | elapsed time per iteration (s): 0.79 | learning rate: 8.861E-05 | global batch size: 256 | lm loss: 1.964321E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.130 | TFLOPs: 19.61 | 31: iteration 100750/ 173500 | consumed samples: 25792000 | consumed tokens: 52822016000 | elapsed time per iteration (s): 0.74 | learning rate: 8.859E-05 | global batch size: 256 | lm loss: 1.986694E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.970 | TFLOPs: 20.81 | 31: iteration 100760/ 173500 | consumed samples: 25794560 | consumed tokens: 52827258880 | elapsed time per iteration (s): 0.76 | learning rate: 8.857E-05 | global batch size: 256 | lm loss: 1.971052E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.713 | TFLOPs: 20.49 | 31: iteration 100770/ 173500 | consumed samples: 25797120 | consumed tokens: 52832501760 | elapsed time per iteration (s): 0.79 | learning rate: 8.856E-05 | global batch size: 256 | lm loss: 1.978589E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.365 | TFLOPs: 19.50 | 31: iteration 100780/ 173500 | consumed samples: 25799680 | consumed tokens: 52837744640 | elapsed time per iteration (s): 0.88 | learning rate: 8.854E-05 | global batch size: 256 | lm loss: 1.941698E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.945 | TFLOPs: 17.54 | 31: iteration 100790/ 173500 | consumed samples: 25802240 | consumed tokens: 52842987520 | elapsed time per iteration (s): 0.76 | learning rate: 8.853E-05 | global batch size: 256 | lm loss: 1.954772E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.473 | TFLOPs: 20.36 | 31: iteration 100800/ 173500 | consumed samples: 25804800 | consumed tokens: 52848230400 | elapsed time per iteration (s): 0.78 | learning rate: 8.851E-05 | global batch size: 256 | lm loss: 1.958002E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.889 | TFLOPs: 19.90 | 31: iteration 100810/ 173500 | consumed samples: 25807360 | consumed tokens: 52853473280 | elapsed time per iteration (s): 0.73 | learning rate: 8.849E-05 | global batch size: 256 | lm loss: 1.980045E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.056 | TFLOPs: 21.18 | 31: iteration 100820/ 173500 | consumed samples: 25809920 | consumed tokens: 52858716160 | elapsed time per iteration (s): 0.83 | learning rate: 8.848E-05 | global batch size: 256 | lm loss: 1.981981E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.610 | TFLOPs: 18.61 | 31: iteration 100830/ 173500 | consumed samples: 25812480 | consumed tokens: 52863959040 | elapsed time per iteration (s): 0.77 | learning rate: 8.846E-05 | global batch size: 256 | lm loss: 1.967710E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.748 | TFLOPs: 20.19 | 31: iteration 100840/ 173500 | consumed samples: 25815040 | consumed tokens: 52869201920 | elapsed time per iteration (s): 0.78 | learning rate: 8.845E-05 | global batch size: 256 | lm loss: 1.952331E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.431 | TFLOPs: 19.81 | 31: iteration 100850/ 173500 | consumed samples: 25817600 | consumed tokens: 52874444800 | elapsed time per iteration (s): 0.76 | learning rate: 8.843E-05 | global batch size: 256 | lm loss: 1.990791E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.387 | TFLOPs: 20.47 | 31: iteration 100860/ 173500 | consumed samples: 25820160 | consumed tokens: 52879687680 | elapsed time per iteration (s): 0.78 | learning rate: 8.841E-05 | global batch size: 256 | lm loss: 1.987719E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.941 | TFLOPs: 19.90 | 31: iteration 100870/ 173500 | consumed samples: 25822720 | consumed tokens: 52884930560 | elapsed time per iteration (s): 0.76 | learning rate: 8.840E-05 | global batch size: 256 | lm loss: 1.981224E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.865 | TFLOPs: 20.44 | 31: iteration 100880/ 173500 | consumed samples: 25825280 | consumed tokens: 52890173440 | elapsed time per iteration (s): 0.82 | learning rate: 8.838E-05 | global batch size: 256 | lm loss: 1.985551E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.086 | TFLOPs: 18.82 | 31: iteration 100890/ 173500 | consumed samples: 25827840 | consumed tokens: 52895416320 | elapsed time per iteration (s): 0.82 | learning rate: 8.837E-05 | global batch size: 256 | lm loss: 1.965486E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.762 | TFLOPs: 18.92 | 31: iteration 100900/ 173500 | consumed samples: 25830400 | consumed tokens: 52900659200 | elapsed time per iteration (s): 0.80 | learning rate: 8.835E-05 | global batch size: 256 | lm loss: 1.971235E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.711 | TFLOPs: 19.40 | 31: iteration 100910/ 173500 | consumed samples: 25832960 | consumed tokens: 52905902080 | elapsed time per iteration (s): 0.79 | learning rate: 8.833E-05 | global batch size: 256 | lm loss: 1.976921E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.908 | TFLOPs: 19.72 | 31: iteration 100920/ 173500 | consumed samples: 25835520 | consumed tokens: 52911144960 | elapsed time per iteration (s): 0.87 | learning rate: 8.832E-05 | global batch size: 256 | lm loss: 1.975355E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.659 | TFLOPs: 17.77 | 31: iteration 100930/ 173500 | consumed samples: 25838080 | consumed tokens: 52916387840 | elapsed time per iteration (s): 0.79 | learning rate: 8.830E-05 | global batch size: 256 | lm loss: 1.970973E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.445 | TFLOPs: 19.57 | 31: iteration 100940/ 173500 | consumed samples: 25840640 | consumed tokens: 52921630720 | elapsed time per iteration (s): 0.83 | learning rate: 8.829E-05 | global batch size: 256 | lm loss: 1.981202E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.776 | TFLOPs: 18.68 | 31: iteration 100950/ 173500 | consumed samples: 25843200 | consumed tokens: 52926873600 | elapsed time per iteration (s): 0.82 | learning rate: 8.827E-05 | global batch size: 256 | lm loss: 1.965143E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.797 | TFLOPs: 18.80 | 31: iteration 100960/ 173500 | consumed samples: 25845760 | consumed tokens: 52932116480 | elapsed time per iteration (s): 0.84 | learning rate: 8.825E-05 | global batch size: 256 | lm loss: 1.981061E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.787 | TFLOPs: 18.38 | 31: iteration 100970/ 173500 | consumed samples: 25848320 | consumed tokens: 52937359360 | elapsed time per iteration (s): 0.75 | learning rate: 8.824E-05 | global batch size: 256 | lm loss: 1.972371E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.777 | TFLOPs: 20.68 | 31: iteration 100980/ 173500 | consumed samples: 25850880 | consumed tokens: 52942602240 | elapsed time per iteration (s): 0.78 | learning rate: 8.822E-05 | global batch size: 256 | lm loss: 1.971105E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.513 | TFLOPs: 19.87 | 31: iteration 100990/ 173500 | consumed samples: 25853440 | consumed tokens: 52947845120 | elapsed time per iteration (s): 0.76 | learning rate: 8.821E-05 | global batch size: 256 | lm loss: 1.956544E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.741 | TFLOPs: 20.49 | 31: iteration 101000/ 173500 | consumed samples: 25856000 | consumed tokens: 52953088000 | elapsed time per iteration (s): 0.76 | learning rate: 8.819E-05 | global batch size: 256 | lm loss: 1.941996E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.277 | TFLOPs: 20.40 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 101000 | lm loss value: 2.040429E+00 | lm loss PPL: 7.693910E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 101000 to checkpoints_1b1long 0: [2022-11-26 16:55:42,047] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step101000 is begin to save! 0: [2022-11-26 16:55:42,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_01-model_00-model_states.pt... 0: [2022-11-26 16:55:42,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_01-model_00-model_states.pt. 0: [2022-11-26 16:55:42,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_03-model_00-model_states.pt... 0: [2022-11-26 16:55:42,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_03-model_00-model_states.pt. 0: [2022-11-26 16:55:42,345] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_04-model_00-model_states.pt... 0: [2022-11-26 16:55:42,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_04-model_00-model_states.pt. 0: [2022-11-26 16:55:42,424] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_05-model_00-model_states.pt... 0: [2022-11-26 16:55:42,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_05-model_00-model_states.pt. 0: [2022-11-26 16:55:42,502] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_06-model_00-model_states.pt... 0: [2022-11-26 16:55:42,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_06-model_00-model_states.pt. 0: [2022-11-26 16:55:42,577] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_07-model_00-model_states.pt... 0: [2022-11-26 16:55:42,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_07-model_00-model_states.pt. 0: [2022-11-26 16:55:42,649] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_08-model_00-model_states.pt... 0: [2022-11-26 16:55:42,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_08-model_00-model_states.pt. 0: [2022-11-26 16:55:42,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_09-model_00-model_states.pt... 0: [2022-11-26 16:55:42,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_09-model_00-model_states.pt. 0: [2022-11-26 16:55:42,799] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_10-model_00-model_states.pt... 0: [2022-11-26 16:55:42,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_10-model_00-model_states.pt. 0: [2022-11-26 16:55:42,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_11-model_00-model_states.pt... 0: [2022-11-26 16:55:42,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_11-model_00-model_states.pt. 0: [2022-11-26 16:55:42,968] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_12-model_00-model_states.pt... 0: [2022-11-26 16:55:43,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_12-model_00-model_states.pt. 0: [2022-11-26 16:55:43,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_13-model_00-model_states.pt... 0: [2022-11-26 16:55:43,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_13-model_00-model_states.pt. 0: [2022-11-26 16:55:43,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_14-model_00-model_states.pt... 0: [2022-11-26 16:55:43,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_14-model_00-model_states.pt. 0: [2022-11-26 16:55:43,213] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_15-model_00-model_states.pt... 0: [2022-11-26 16:55:43,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_15-model_00-model_states.pt. 0: [2022-11-26 16:55:43,289] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_16-model_00-model_states.pt... 0: [2022-11-26 16:55:43,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_16-model_00-model_states.pt. 0: [2022-11-26 16:55:43,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_17-model_00-model_states.pt... 0: [2022-11-26 16:55:43,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_17-model_00-model_states.pt. 0: [2022-11-26 16:55:43,437] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_18-model_00-model_states.pt... 0: [2022-11-26 16:55:43,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_18-model_00-model_states.pt. 0: [2022-11-26 16:55:43,515] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_19-model_00-model_states.pt... 0: [2022-11-26 16:55:43,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_19-model_00-model_states.pt. 0: [2022-11-26 16:55:43,591] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_20-model_00-model_states.pt... 0: [2022-11-26 16:55:43,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_20-model_00-model_states.pt. 0: [2022-11-26 16:55:43,681] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_21-model_00-model_states.pt... 0: [2022-11-26 16:55:43,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_21-model_00-model_states.pt. 0: [2022-11-26 16:55:43,757] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_22-model_00-model_states.pt... 0: [2022-11-26 16:55:43,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_22-model_00-model_states.pt. 0: [2022-11-26 16:55:43,833] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_23-model_00-model_states.pt... 0: [2022-11-26 16:55:43,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_23-model_00-model_states.pt. 0: [2022-11-26 16:55:43,909] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_24-model_00-model_states.pt... 0: [2022-11-26 16:55:43,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_24-model_00-model_states.pt. 0: [2022-11-26 16:55:43,984] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_25-model_00-model_states.pt... 0: [2022-11-26 16:55:44,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_25-model_00-model_states.pt. 0: [2022-11-26 16:55:44,060] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_26-model_00-model_states.pt... 0: [2022-11-26 16:55:44,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_26-model_00-model_states.pt. 0: [2022-11-26 16:55:44,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_27-model_00-model_states.pt... 0: [2022-11-26 16:55:44,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_27-model_00-model_states.pt. 0: [2022-11-26 16:55:44,225] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_28-model_00-model_states.pt... 0: [2022-11-26 16:55:44,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_28-model_00-model_states.pt. 0: [2022-11-26 16:55:44,298] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/layer_30-model_00-model_states.pt... 0: [2022-11-26 16:55:44,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/layer_30-model_00-model_states.pt. 0: [2022-11-26 16:55:44,303] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step101000/mp_rank_00_model_states.pt 0: [2022-11-26 16:55:44,303] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/mp_rank_00_model_states.pt... 0: [2022-11-26 16:55:44,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/mp_rank_00_model_states.pt. 0: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 16: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 20: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 25: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 26: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 22: [2022-11-26 16:55:44,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:55:44,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:55:44,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 16:55:44,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 21: [2022-11-26 16:55:44,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:55:44,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 16:55:44,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 16: [2022-11-26 16:55:44,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:55:44,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:55:44,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 16:55:44,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 29: [2022-11-26 16:55:44,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:55:44,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 16:55:44,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 24: [2022-11-26 16:55:44,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 16:55:44,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 0: [2022-11-26 16:55:44,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:55:44,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:55:44,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 0: [2022-11-26 16:55:44,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 9: [2022-11-26 16:55:44,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 0: [2022-11-26 16:55:44,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 20: [2022-11-26 16:55:44,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:55:44,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 16:55:44,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 17: [2022-11-26 16:55:44,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:55:44,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 16:55:44,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 20: [2022-11-26 16:55:44,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:55:44,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 16:55:44,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 25: [2022-11-26 16:55:44,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:55:44,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 9: [2022-11-26 16:55:44,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:55:44,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:55:44,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 25: [2022-11-26 16:55:44,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 9: [2022-11-26 16:55:44,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 6: [2022-11-26 16:55:44,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 16:55:44,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 22: [2022-11-26 16:55:44,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:55:44,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 16:55:44,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 3: [2022-11-26 16:55:44,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:55:44,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 16:55:44,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 12: [2022-11-26 16:55:44,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:55:44,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 30: [2022-11-26 16:55:44,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:55:44,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 30: [2022-11-26 16:55:44,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 16:55:44,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 15: [2022-11-26 16:55:44,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:55:44,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 16:55:44,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 26: [2022-11-26 16:55:44,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:55:44,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 16:55:44,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 22: [2022-11-26 16:55:44,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:55:44,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 16:55:44,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 17: [2022-11-26 16:55:44,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:55:44,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 16:55:44,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 8: [2022-11-26 16:55:44,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:55:44,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 16:55:44,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 27: [2022-11-26 16:55:44,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:55:44,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:55:44,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:55:44,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:55:44,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 14: [2022-11-26 16:55:44,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 27: [2022-11-26 16:55:44,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 14: [2022-11-26 16:55:44,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 16:55:44,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 15: [2022-11-26 16:55:44,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:55:44,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 14: [2022-11-26 16:55:44,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 1: [2022-11-26 16:55:44,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:55:44,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 25: [2022-11-26 16:55:44,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 27: [2022-11-26 16:55:44,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:55:44,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 1: [2022-11-26 16:55:44,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 27: [2022-11-26 16:55:44,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 1: [2022-11-26 16:55:44,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 27: [2022-11-26 16:55:44,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 1: [2022-11-26 16:55:44,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:55:44,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 16:55:44,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 20: [2022-11-26 16:55:44,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:55:44,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 16:55:44,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 7: [2022-11-26 16:55:44,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:55:44,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 16:55:44,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 5: [2022-11-26 16:55:44,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:55:44,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 16:55:44,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 12: [2022-11-26 16:55:44,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:55:44,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 16:55:44,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 29: [2022-11-26 16:55:44,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:55:44,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:55:44,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 7: [2022-11-26 16:55:44,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 29: [2022-11-26 16:55:44,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 7: [2022-11-26 16:55:44,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 23: [2022-11-26 16:55:44,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:55:44,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 16:55:44,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 3: [2022-11-26 16:55:44,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:55:44,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:55:44,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 6: [2022-11-26 16:55:44,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 3: [2022-11-26 16:55:44,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 6: [2022-11-26 16:55:44,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 17: [2022-11-26 16:55:44,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:55:44,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 16:55:44,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 21: [2022-11-26 16:55:44,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:55:44,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 16:55:44,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 5: [2022-11-26 16:55:44,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:55:44,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:55:44,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 2: [2022-11-26 16:55:44,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 5: [2022-11-26 16:55:44,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 2: [2022-11-26 16:55:44,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 21: [2022-11-26 16:55:44,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:55:44,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 5: [2022-11-26 16:55:44,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:55:44,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 5: [2022-11-26 16:55:44,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 16:55:44,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 9: [2022-11-26 16:55:44,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:55:44,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:55:44,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:55:44,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 11: [2022-11-26 16:55:44,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 9: [2022-11-26 16:55:44,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 11: [2022-11-26 16:55:44,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 16:55:44,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 11: [2022-11-26 16:55:44,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 11: [2022-11-26 16:55:44,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:55:44,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 30: [2022-11-26 16:55:44,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:55:44,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 30: [2022-11-26 16:55:44,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 16:55:44,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 24: [2022-11-26 16:55:44,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:55:44,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 27: [2022-11-26 16:55:44,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:55:44,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:55:44,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:55:44,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 24: [2022-11-26 16:55:44,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 14: [2022-11-26 16:55:44,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:55:44,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 27: [2022-11-26 16:55:44,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 16:55:44,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 0: [2022-11-26 16:55:44,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 14: [2022-11-26 16:55:44,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 16:55:44,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 0: [2022-11-26 16:55:44,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 30: [2022-11-26 16:55:44,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:55:44,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 16:55:44,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 26: [2022-11-26 16:55:44,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:55:44,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 16:55:44,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 1: [2022-11-26 16:55:44,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:55:44,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:55:44,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 16:55:44,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 16:55:44,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 1: [2022-11-26 16:55:44,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 12: [2022-11-26 16:55:44,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:55:44,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 16:55:44,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 6: [2022-11-26 16:55:44,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:55:44,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 24: [2022-11-26 16:55:44,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:55:44,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 25: [2022-11-26 16:55:44,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:55:44,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 25: [2022-11-26 16:55:44,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 24: [2022-11-26 16:55:44,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 25: [2022-11-26 16:55:44,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 16: [2022-11-26 16:55:44,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:55:44,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 16:55:44,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 29: [2022-11-26 16:55:44,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:55:44,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 16:55:44,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 8: [2022-11-26 16:55:44,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:55:44,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 16:55:44,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 28: [2022-11-26 16:55:44,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:55:44,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:55:44,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 16:55:44,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 3: [2022-11-26 16:55:44,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:55:44,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 16:55:44,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 16: [2022-11-26 16:55:44,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:55:44,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 16:55:44,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 16: [2022-11-26 16:55:44,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:55:44,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:55:44,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 16: [2022-11-26 16:55:44,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 22: [2022-11-26 16:55:44,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 16: [2022-11-26 16:55:44,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 20: [2022-11-26 16:55:44,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:55:44,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 16:55:44,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 23: [2022-11-26 16:55:44,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:55:44,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:55:44,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 11: [2022-11-26 16:55:44,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:55:44,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:55:44,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:55:44,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:55:44,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:55:44,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 11: [2022-11-26 16:55:44,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 16:55:44,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 4: [2022-11-26 16:55:44,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 16:55:44,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 25: [2022-11-26 16:55:44,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 4: [2022-11-26 16:55:44,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 16:55:44,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 7: [2022-11-26 16:55:44,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:55:44,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 4: [2022-11-26 16:55:44,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 4: [2022-11-26 16:55:44,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 4: [2022-11-26 16:55:44,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 7: [2022-11-26 16:55:44,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 16:55:44,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 25: [2022-11-26 16:55:44,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 29: [2022-11-26 16:55:44,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:55:44,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:55:44,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 29: [2022-11-26 16:55:44,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 16:55:44,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 26: [2022-11-26 16:55:44,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 26: [2022-11-26 16:55:44,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:55:44,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:55:44,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:55:44,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:55:44,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 1: [2022-11-26 16:55:44,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 26: [2022-11-26 16:55:44,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 0: [2022-11-26 16:55:44,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 26: [2022-11-26 16:55:44,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 17: [2022-11-26 16:55:44,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 1: [2022-11-26 16:55:44,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 17: [2022-11-26 16:55:44,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 7: [2022-11-26 16:55:44,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:55:44,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 16:55:44,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 2: [2022-11-26 16:55:44,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:55:44,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 16:55:44,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 2: [2022-11-26 16:55:44,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:55:44,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 16:55:44,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 5: [2022-11-26 16:55:44,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:55:44,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 16:55:44,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 2: [2022-11-26 16:55:44,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:55:44,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 16:55:44,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 23: [2022-11-26 16:55:44,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:55:44,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:55:44,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 16:55:44,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 3: [2022-11-26 16:55:44,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 16:55:44,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 8: [2022-11-26 16:55:44,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:55:44,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 16:55:44,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 14: [2022-11-26 16:55:44,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:55:44,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 15: [2022-11-26 16:55:44,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:55:44,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 15: [2022-11-26 16:55:44,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 16:55:44,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 9: [2022-11-26 16:55:44,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:55:44,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 12: [2022-11-26 16:55:44,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:55:44,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:55:44,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:55:44,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 9: [2022-11-26 16:55:44,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 13: [2022-11-26 16:55:44,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 12: [2022-11-26 16:55:44,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 13: [2022-11-26 16:55:44,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 13: [2022-11-26 16:55:44,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 16:55:44,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 15: [2022-11-26 16:55:44,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:55:44,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:55:44,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:55:44,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 16:55:44,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 13: [2022-11-26 16:55:44,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 16:55:44,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 16:55:44,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 13: [2022-11-26 16:55:44,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 30: [2022-11-26 16:55:44,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:55:44,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 16:55:44,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 10: [2022-11-26 16:55:44,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:55:44,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:55:44,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:55:44,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 16:55:44,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 16:55:44,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 16:55:44,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 10: [2022-11-26 16:55:44,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 10: [2022-11-26 16:55:44,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 18: [2022-11-26 16:55:44,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:55:44,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 16:55:44,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 28: [2022-11-26 16:55:44,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 16:55:44,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 23: [2022-11-26 16:55:44,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:55:44,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:55:44,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 23: [2022-11-26 16:55:44,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 28: [2022-11-26 16:55:44,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 23: [2022-11-26 16:55:44,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 28: [2022-11-26 16:55:44,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:55:44,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 16:55:44,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 28: [2022-11-26 16:55:44,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:55:44,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 16:55:44,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 6: [2022-11-26 16:55:44,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:55:44,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 16:55:44,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 19: [2022-11-26 16:55:44,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:55:44,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:55:44,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:55:44,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:55:44,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 16:55:44,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 16:55:44,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 16:55:44,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 16:55:44,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 19: [2022-11-26 16:55:44,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 19: [2022-11-26 16:55:44,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 19: [2022-11-26 16:55:44,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 27: [2022-11-26 16:55:44,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:55:44,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 16:55:44,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 24: [2022-11-26 16:55:44,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:55:44,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 16:55:44,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 31: [2022-11-26 16:55:44,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:55:44,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:55:44,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 16:55:44,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:55:44,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 16:55:44,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 31: [2022-11-26 16:55:44,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 31: [2022-11-26 16:55:44,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 16:55:44,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 0: [2022-11-26 16:55:44,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:55:44,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:55:44,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 16:55:44,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 0: [2022-11-26 16:55:44,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 16:55:44,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 5: [2022-11-26 16:55:44,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:55:44,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 16:55:44,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 0: [2022-11-26 16:55:44,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:55:44,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 16:55:44,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 18: [2022-11-26 16:55:44,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:55:44,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 16:55:44,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:55:44,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 18: [2022-11-26 16:55:44,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 16:55:44,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 8: [2022-11-26 16:55:44,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:55:44,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 16:55:44,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 18: [2022-11-26 16:55:44,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:55:44,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 16:55:44,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 2: [2022-11-26 16:55:44,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:55:44,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 16:55:44,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 21: [2022-11-26 16:55:44,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:55:44,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 16:55:44,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 27: [2022-11-26 16:55:44,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:55:44,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 16:55:44,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 17: [2022-11-26 16:55:44,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:55:44,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 16:55:44,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 16: [2022-11-26 16:55:44,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:55:44,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 16:55:44,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 18: [2022-11-26 16:55:44,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:55:44,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 16:55:44,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 9: [2022-11-26 16:55:44,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:55:44,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 16:55:44,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 3: [2022-11-26 16:55:44,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:55:44,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 16:55:44,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 25: [2022-11-26 16:55:44,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:55:44,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 16:55:44,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 30: [2022-11-26 16:55:44,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:55:44,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 16:55:44,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 6: [2022-11-26 16:55:44,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:55:44,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 16:55:44,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 7: [2022-11-26 16:55:44,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:55:44,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 16:55:44,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 14: [2022-11-26 16:55:44,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:55:44,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 16:55:44,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 26: [2022-11-26 16:55:44,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:55:44,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 16:55:44,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 22: [2022-11-26 16:55:44,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:55:44,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:55:44,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 15: [2022-11-26 16:55:44,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 22: [2022-11-26 16:55:44,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 15: [2022-11-26 16:55:44,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 23: [2022-11-26 16:55:44,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:55:44,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 16:55:44,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 12: [2022-11-26 16:55:44,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:55:44,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 16:55:44,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 24: [2022-11-26 16:55:44,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:55:44,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:55:44,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 16:55:44,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 24: [2022-11-26 16:55:44,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 16:55:44,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 4: [2022-11-26 16:55:44,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:55:44,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 16:55:44,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 10: [2022-11-26 16:55:44,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:55:44,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 16:55:44,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 20: [2022-11-26 16:55:44,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:55:44,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 16:55:44,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 28: [2022-11-26 16:55:44,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:55:44,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:55:44,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 16:55:44,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 28: [2022-11-26 16:55:44,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 16:55:44,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 31: [2022-11-26 16:55:44,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:55:44,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 16:55:44,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 13: [2022-11-26 16:55:44,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:55:44,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 16:55:44,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 11: [2022-11-26 16:55:44,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:55:44,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 16:55:44,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 1: [2022-11-26 16:55:44,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:55:44,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 16:55:44,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 27: [2022-11-26 16:55:44,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:55:44,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 16:55:44,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 0: [2022-11-26 16:55:44,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:55:44,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 16:55:44,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 17: [2022-11-26 16:55:44,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:55:44,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 21: [2022-11-26 16:55:44,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:55:44,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 16:55:44,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 8: [2022-11-26 16:55:44,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:55:44,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 16:55:44,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 5: [2022-11-26 16:55:44,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:55:44,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:55:44,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 2: [2022-11-26 16:55:44,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 5: [2022-11-26 16:55:44,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 2: [2022-11-26 16:55:44,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 4: [2022-11-26 16:55:44,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:55:44,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 16:55:44,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 16: [2022-11-26 16:55:44,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:55:44,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 16:55:44,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 25: [2022-11-26 16:55:44,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:55:44,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 16:55:44,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 17: [2022-11-26 16:55:44,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 18: [2022-11-26 16:55:44,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:55:44,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 16:55:44,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 9: [2022-11-26 16:55:44,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:55:44,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 16:55:44,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 3: [2022-11-26 16:55:44,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:55:44,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 16:55:44,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 6: [2022-11-26 16:55:44,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:55:44,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 16:55:44,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 7: [2022-11-26 16:55:44,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:55:44,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 16:55:44,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 30: [2022-11-26 16:55:44,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:55:44,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 16:55:44,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 15: [2022-11-26 16:55:44,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:55:44,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:55:44,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 16:55:44,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 15: [2022-11-26 16:55:44,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 16:55:44,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 28: [2022-11-26 16:55:44,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:55:44,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 16:55:44,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 23: [2022-11-26 16:55:44,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:55:44,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 16:55:44,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 12: [2022-11-26 16:55:44,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:55:44,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 16:55:44,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 14: [2022-11-26 16:55:44,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:55:44,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:55:44,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 16:55:44,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 29: [2022-11-26 16:55:44,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 16:55:44,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 10: [2022-11-26 16:55:44,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:55:44,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 16:55:44,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 11: [2022-11-26 16:55:44,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:55:44,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 16:55:44,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 22: [2022-11-26 16:55:44,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:55:44,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 24: [2022-11-26 16:55:44,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:55:44,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 26: [2022-11-26 16:55:44,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:55:44,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 26: [2022-11-26 16:55:44,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 16:55:44,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 24: [2022-11-26 16:55:44,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 13: [2022-11-26 16:55:44,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:55:44,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 16:55:44,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 1: [2022-11-26 16:55:44,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:55:44,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 16:55:44,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 21: [2022-11-26 16:55:44,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:55:44,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 16:55:44,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 20: [2022-11-26 16:55:44,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:55:44,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 16:55:44,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 31: [2022-11-26 16:55:44,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:55:44,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 16:55:44,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 2: [2022-11-26 16:55:44,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:55:44,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 16:55:44,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 27: [2022-11-26 16:55:44,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:55:44,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 16:55:44,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 0: [2022-11-26 16:55:44,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:55:44,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 16:55:44,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 4: [2022-11-26 16:55:44,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:55:44,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 16:55:44,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 18: [2022-11-26 16:55:44,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:55:44,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 8: [2022-11-26 16:55:44,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:55:44,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 8: [2022-11-26 16:55:44,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 16:55:44,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 17: [2022-11-26 16:55:44,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:55:44,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 16:55:44,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 25: [2022-11-26 16:55:44,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:55:44,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 16:55:44,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 5: [2022-11-26 16:55:44,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:55:44,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 16:55:44,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 9: [2022-11-26 16:55:44,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:55:44,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 16:55:44,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 15: [2022-11-26 16:55:44,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:55:44,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 16:55:44,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 16: [2022-11-26 16:55:44,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:55:44,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 16:55:44,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 19: [2022-11-26 16:55:44,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:55:44,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 16:55:44,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 3: [2022-11-26 16:55:44,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:55:44,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 16:55:44,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 22: [2022-11-26 16:55:44,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:55:44,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 16:55:44,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 6: [2022-11-26 16:55:44,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:55:44,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 16:55:44,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 30: [2022-11-26 16:55:44,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:55:44,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 16:55:44,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 12: [2022-11-26 16:55:44,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:55:44,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 16:55:44,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 14: [2022-11-26 16:55:44,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:55:44,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 16:55:44,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 23: [2022-11-26 16:55:44,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:55:44,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:55:44,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 26: [2022-11-26 16:55:44,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 23: [2022-11-26 16:55:44,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 26: [2022-11-26 16:55:44,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 7: [2022-11-26 16:55:44,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:55:44,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 16:55:44,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 29: [2022-11-26 16:55:44,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:55:44,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 16:55:44,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 13: [2022-11-26 16:55:44,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:55:44,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 16:55:44,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 8: [2022-11-26 16:55:44,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:55:44,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 16:55:44,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 23: [2022-11-26 16:55:44,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 16:55:44,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 16:55:44,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 28: [2022-11-26 16:55:44,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:55:44,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 24: [2022-11-26 16:55:44,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 22: [2022-11-26 16:55:44,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:55:44,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 22: [2022-11-26 16:55:44,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 24: [2022-11-26 16:55:44,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 22: [2022-11-26 16:55:44,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 24: [2022-11-26 16:55:44,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 14: [2022-11-26 16:55:44,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:55:44,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 16:55:44,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 2: [2022-11-26 16:55:44,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:55:44,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 16:55:44,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 31: [2022-11-26 16:55:44,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:55:44,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 16:55:44,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 17: [2022-11-26 16:55:44,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:55:44,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:55:44,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 16:55:44,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 6: [2022-11-26 16:55:44,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 17: [2022-11-26 16:55:44,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 16:55:44,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 6: [2022-11-26 16:55:44,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 16:55:44,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 26: [2022-11-26 16:55:44,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 16:55:44,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 16:55:44,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 5: [2022-11-26 16:55:44,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:55:44,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:55:44,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 3: [2022-11-26 16:55:44,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 5: [2022-11-26 16:55:44,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 3: [2022-11-26 16:55:44,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 4: [2022-11-26 16:55:44,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:55:44,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:55:44,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 16:55:44,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 7: [2022-11-26 16:55:44,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 16:55:44,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 18: [2022-11-26 16:55:44,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:55:44,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 18: [2022-11-26 16:55:44,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 15: [2022-11-26 16:55:44,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 18: [2022-11-26 16:55:44,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 15: [2022-11-26 16:55:44,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 21: [2022-11-26 16:55:44,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:55:44,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 21: [2022-11-26 16:55:44,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 16:55:44,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 11: [2022-11-26 16:55:44,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 16:55:44,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 24: [2022-11-26 16:55:44,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 27: [2022-11-26 16:55:44,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:55:44,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:55:44,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 27: [2022-11-26 16:55:44,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 0: [2022-11-26 16:55:44,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 24: [2022-11-26 16:55:44,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 27: [2022-11-26 16:55:44,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 0: [2022-11-26 16:55:44,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 1: [2022-11-26 16:55:44,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 0: [2022-11-26 16:55:44,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 1: [2022-11-26 16:55:44,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 9: [2022-11-26 16:55:44,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 25: [2022-11-26 16:55:44,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:55:44,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 16:55:44,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 25: [2022-11-26 16:55:44,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 16:55:44,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 16: [2022-11-26 16:55:44,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 16:55:44,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 16:55:44,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 20: [2022-11-26 16:55:44,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 16:55:44,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 16:55:44,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 30: [2022-11-26 16:55:44,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 16:55:44,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 16:55:44,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 11: [2022-11-26 16:55:44,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:55:44,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 16:55:44,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 19: [2022-11-26 16:55:44,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 16:55:44,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 31: [2022-11-26 16:55:44,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:55:44,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 16:55:44,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 19: [2022-11-26 16:55:44,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 13: [2022-11-26 16:55:44,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:55:44,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 16:55:44,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 29: [2022-11-26 16:55:44,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 28: [2022-11-26 16:55:44,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 29: [2022-11-26 16:55:44,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 16:55:44,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 28: [2022-11-26 16:55:44,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 16:55:44,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 10: [2022-11-26 16:55:44,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:55:44,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 16:55:44,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 10: [2022-11-26 16:55:44,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:55:44,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 16:55:44,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 31: [2022-11-26 16:55:44,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 16:55:44,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 16:55:44,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 10: [2022-11-26 16:55:44,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:55:44,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step101000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 16:55:44,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 0: successfully saved checkpoint at iteration 101000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2570.75 31: iteration 101010/ 173500 | consumed samples: 25858560 | consumed tokens: 52958330880 | elapsed time per iteration (s): 0.99 | learning rate: 8.817E-05 | global batch size: 256 | lm loss: 1.979412E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 258.034 | TFLOPs: 15.61 | 31: iteration 101020/ 173500 | consumed samples: 25861120 | consumed tokens: 52963573760 | elapsed time per iteration (s): 0.84 | learning rate: 8.816E-05 | global batch size: 256 | lm loss: 1.961576E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.505 | TFLOPs: 18.36 | 31: iteration 101030/ 173500 | consumed samples: 25863680 | consumed tokens: 52968816640 | elapsed time per iteration (s): 0.75 | learning rate: 8.814E-05 | global batch size: 256 | lm loss: 1.989753E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.565 | TFLOPs: 20.54 | 31: iteration 101040/ 173500 | consumed samples: 25866240 | consumed tokens: 52974059520 | elapsed time per iteration (s): 0.77 | learning rate: 8.813E-05 | global batch size: 256 | lm loss: 1.981937E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.226 | TFLOPs: 20.16 | 31: iteration 101050/ 173500 | consumed samples: 25868800 | consumed tokens: 52979302400 | elapsed time per iteration (s): 0.75 | learning rate: 8.811E-05 | global batch size: 256 | lm loss: 1.977508E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.409 | TFLOPs: 20.65 | 31: iteration 101060/ 173500 | consumed samples: 25871360 | consumed tokens: 52984545280 | elapsed time per iteration (s): 0.79 | learning rate: 8.810E-05 | global batch size: 256 | lm loss: 1.934559E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.424 | TFLOPs: 19.51 | 31: iteration 101070/ 173500 | consumed samples: 25873920 | consumed tokens: 52989788160 | elapsed time per iteration (s): 0.77 | learning rate: 8.808E-05 | global batch size: 256 | lm loss: 1.964370E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.531 | TFLOPs: 20.12 | 31: iteration 101080/ 173500 | consumed samples: 25876480 | consumed tokens: 52995031040 | elapsed time per iteration (s): 0.78 | learning rate: 8.806E-05 | global batch size: 256 | lm loss: 1.998473E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.239 | TFLOPs: 19.92 | 31: iteration 101090/ 173500 | consumed samples: 25879040 | consumed tokens: 53000273920 | elapsed time per iteration (s): 0.73 | learning rate: 8.805E-05 | global batch size: 256 | lm loss: 1.960979E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.876 | TFLOPs: 21.29 | 31: iteration 101100/ 173500 | consumed samples: 25881600 | consumed tokens: 53005516800 | elapsed time per iteration (s): 0.76 | learning rate: 8.803E-05 | global batch size: 256 | lm loss: 1.957018E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.008 | TFLOPs: 20.27 | 31: iteration 101110/ 173500 | consumed samples: 25884160 | consumed tokens: 53010759680 | elapsed time per iteration (s): 0.76 | learning rate: 8.802E-05 | global batch size: 256 | lm loss: 1.983895E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.285 | TFLOPs: 20.47 | 31: iteration 101120/ 173500 | consumed samples: 25886720 | consumed tokens: 53016002560 | elapsed time per iteration (s): 0.81 | learning rate: 8.800E-05 | global batch size: 256 | lm loss: 1.941296E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.910 | TFLOPs: 19.23 | 31: iteration 101130/ 173500 | consumed samples: 25889280 | consumed tokens: 53021245440 | elapsed time per iteration (s): 0.78 | learning rate: 8.798E-05 | global batch size: 256 | lm loss: 1.968386E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.176 | TFLOPs: 19.73 | 31: iteration 101140/ 173500 | consumed samples: 25891840 | consumed tokens: 53026488320 | elapsed time per iteration (s): 0.78 | learning rate: 8.797E-05 | global batch size: 256 | lm loss: 1.968902E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.845 | TFLOPs: 19.83 | 31: iteration 101150/ 173500 | consumed samples: 25894400 | consumed tokens: 53031731200 | elapsed time per iteration (s): 0.72 | learning rate: 8.795E-05 | global batch size: 256 | lm loss: 1.986931E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.544 | TFLOPs: 21.51 | 31: iteration 101160/ 173500 | consumed samples: 25896960 | consumed tokens: 53036974080 | elapsed time per iteration (s): 0.78 | learning rate: 8.794E-05 | global batch size: 256 | lm loss: 1.945644E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.584 | TFLOPs: 19.76 | 31: iteration 101170/ 173500 | consumed samples: 25899520 | consumed tokens: 53042216960 | elapsed time per iteration (s): 0.73 | learning rate: 8.792E-05 | global batch size: 256 | lm loss: 1.973312E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.672 | TFLOPs: 21.28 | 31: iteration 101180/ 173500 | consumed samples: 25902080 | consumed tokens: 53047459840 | elapsed time per iteration (s): 0.78 | learning rate: 8.790E-05 | global batch size: 256 | lm loss: 1.979541E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.058 | TFLOPs: 19.85 | 31: iteration 101190/ 173500 | consumed samples: 25904640 | consumed tokens: 53052702720 | elapsed time per iteration (s): 0.77 | learning rate: 8.789E-05 | global batch size: 256 | lm loss: 1.959489E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.247 | TFLOPs: 20.16 | 31: iteration 101200/ 173500 | consumed samples: 25907200 | consumed tokens: 53057945600 | elapsed time per iteration (s): 0.77 | learning rate: 8.787E-05 | global batch size: 256 | lm loss: 1.981203E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.758 | TFLOPs: 20.01 | 31: iteration 101210/ 173500 | consumed samples: 25909760 | consumed tokens: 53063188480 | elapsed time per iteration (s): 0.73 | learning rate: 8.786E-05 | global batch size: 256 | lm loss: 1.990879E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.560 | TFLOPs: 21.27 | 31: iteration 101220/ 173500 | consumed samples: 25912320 | consumed tokens: 53068431360 | elapsed time per iteration (s): 0.86 | learning rate: 8.784E-05 | global batch size: 256 | lm loss: 1.950869E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.314 | TFLOPs: 17.93 | 31: iteration 101230/ 173500 | consumed samples: 25914880 | consumed tokens: 53073674240 | elapsed time per iteration (s): 0.76 | learning rate: 8.782E-05 | global batch size: 256 | lm loss: 1.964779E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.339 | TFLOPs: 20.29 | 31: iteration 101240/ 173500 | consumed samples: 25917440 | consumed tokens: 53078917120 | elapsed time per iteration (s): 0.76 | learning rate: 8.781E-05 | global batch size: 256 | lm loss: 1.984676E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.284 | TFLOPs: 20.34 | 31: iteration 101250/ 173500 | consumed samples: 25920000 | consumed tokens: 53084160000 | elapsed time per iteration (s): 0.75 | learning rate: 8.779E-05 | global batch size: 256 | lm loss: 1.959986E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.367 | TFLOPs: 20.65 | 31: iteration 101260/ 173500 | consumed samples: 25922560 | consumed tokens: 53089402880 | elapsed time per iteration (s): 0.72 | learning rate: 8.778E-05 | global batch size: 256 | lm loss: 1.967974E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.784 | TFLOPs: 21.52 | 31: iteration 101270/ 173500 | consumed samples: 25925120 | consumed tokens: 53094645760 | elapsed time per iteration (s): 0.78 | learning rate: 8.776E-05 | global batch size: 256 | lm loss: 1.951412E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.833 | TFLOPs: 19.77 | 31: iteration 101280/ 173500 | consumed samples: 25927680 | consumed tokens: 53099888640 | elapsed time per iteration (s): 0.72 | learning rate: 8.774E-05 | global batch size: 256 | lm loss: 1.968149E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.357 | TFLOPs: 21.38 | 31: iteration 101290/ 173500 | consumed samples: 25930240 | consumed tokens: 53105131520 | elapsed time per iteration (s): 0.77 | learning rate: 8.773E-05 | global batch size: 256 | lm loss: 2.002891E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.421 | TFLOPs: 20.23 | 31: iteration 101300/ 173500 | consumed samples: 25932800 | consumed tokens: 53110374400 | elapsed time per iteration (s): 0.76 | learning rate: 8.771E-05 | global batch size: 256 | lm loss: 1.966231E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.187 | TFLOPs: 20.40 | 31: iteration 101310/ 173500 | consumed samples: 25935360 | consumed tokens: 53115617280 | elapsed time per iteration (s): 0.84 | learning rate: 8.770E-05 | global batch size: 256 | lm loss: 1.955734E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.452 | TFLOPs: 18.54 | 31: iteration 101320/ 173500 | consumed samples: 25937920 | consumed tokens: 53120860160 | elapsed time per iteration (s): 0.82 | learning rate: 8.768E-05 | global batch size: 256 | lm loss: 1.965857E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.626 | TFLOPs: 18.79 | 31: iteration 101330/ 173500 | consumed samples: 25940480 | consumed tokens: 53126103040 | elapsed time per iteration (s): 0.74 | learning rate: 8.766E-05 | global batch size: 256 | lm loss: 2.002260E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.944 | TFLOPs: 21.05 | 31: iteration 101340/ 173500 | consumed samples: 25943040 | consumed tokens: 53131345920 | elapsed time per iteration (s): 0.79 | learning rate: 8.765E-05 | global batch size: 256 | lm loss: 2.004339E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.607 | TFLOPs: 19.70 | 31: iteration 101350/ 173500 | consumed samples: 25945600 | consumed tokens: 53136588800 | elapsed time per iteration (s): 0.73 | learning rate: 8.763E-05 | global batch size: 256 | lm loss: 1.977671E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.382 | TFLOPs: 21.20 | 31: iteration 101360/ 173500 | consumed samples: 25948160 | consumed tokens: 53141831680 | elapsed time per iteration (s): 0.78 | learning rate: 8.762E-05 | global batch size: 256 | lm loss: 1.975973E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.601 | TFLOPs: 19.82 | 31: iteration 101370/ 173500 | consumed samples: 25950720 | consumed tokens: 53147074560 | elapsed time per iteration (s): 0.78 | learning rate: 8.760E-05 | global batch size: 256 | lm loss: 1.976219E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.146 | TFLOPs: 19.97 | 31: iteration 101380/ 173500 | consumed samples: 25953280 | consumed tokens: 53152317440 | elapsed time per iteration (s): 0.77 | learning rate: 8.758E-05 | global batch size: 256 | lm loss: 1.973519E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.480 | TFLOPs: 20.17 | 31: iteration 101390/ 173500 | consumed samples: 25955840 | consumed tokens: 53157560320 | elapsed time per iteration (s): 0.79 | learning rate: 8.757E-05 | global batch size: 256 | lm loss: 1.970260E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.349 | TFLOPs: 19.62 | 31: iteration 101400/ 173500 | consumed samples: 25958400 | consumed tokens: 53162803200 | elapsed time per iteration (s): 0.80 | learning rate: 8.755E-05 | global batch size: 256 | lm loss: 1.976265E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.678 | TFLOPs: 19.40 | 31: iteration 101410/ 173500 | consumed samples: 25960960 | consumed tokens: 53168046080 | elapsed time per iteration (s): 0.82 | learning rate: 8.754E-05 | global batch size: 256 | lm loss: 1.972803E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.129 | TFLOPs: 18.82 | 31: iteration 101420/ 173500 | consumed samples: 25963520 | consumed tokens: 53173288960 | elapsed time per iteration (s): 0.82 | learning rate: 8.752E-05 | global batch size: 256 | lm loss: 1.997445E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.155 | TFLOPs: 18.82 | 31: iteration 101430/ 173500 | consumed samples: 25966080 | consumed tokens: 53178531840 | elapsed time per iteration (s): 0.82 | learning rate: 8.750E-05 | global batch size: 256 | lm loss: 1.984191E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.839 | TFLOPs: 18.80 | 31: iteration 101440/ 173500 | consumed samples: 25968640 | consumed tokens: 53183774720 | elapsed time per iteration (s): 0.79 | learning rate: 8.749E-05 | global batch size: 256 | lm loss: 1.977372E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.059 | TFLOPs: 19.73 | 31: iteration 101450/ 173500 | consumed samples: 25971200 | consumed tokens: 53189017600 | elapsed time per iteration (s): 0.82 | learning rate: 8.747E-05 | global batch size: 256 | lm loss: 1.951493E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.587 | TFLOPs: 18.97 | 31: iteration 101460/ 173500 | consumed samples: 25973760 | consumed tokens: 53194260480 | elapsed time per iteration (s): 0.89 | learning rate: 8.746E-05 | global batch size: 256 | lm loss: 1.974331E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.626 | TFLOPs: 17.46 | 31: iteration 101470/ 173500 | consumed samples: 25976320 | consumed tokens: 53199503360 | elapsed time per iteration (s): 0.78 | learning rate: 8.744E-05 | global batch size: 256 | lm loss: 1.986558E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.142 | TFLOPs: 19.73 | 31: iteration 101480/ 173500 | consumed samples: 25978880 | consumed tokens: 53204746240 | elapsed time per iteration (s): 0.82 | learning rate: 8.743E-05 | global batch size: 256 | lm loss: 1.953012E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.540 | TFLOPs: 18.79 | 31: iteration 101490/ 173500 | consumed samples: 25981440 | consumed tokens: 53209989120 | elapsed time per iteration (s): 0.82 | learning rate: 8.741E-05 | global batch size: 256 | lm loss: 1.974720E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.123 | TFLOPs: 18.82 | 31: iteration 101500/ 173500 | consumed samples: 25984000 | consumed tokens: 53215232000 | elapsed time per iteration (s): 0.91 | learning rate: 8.739E-05 | global batch size: 256 | lm loss: 1.968316E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 279.800 | TFLOPs: 16.93 | 31: iteration 101510/ 173500 | consumed samples: 25986560 | consumed tokens: 53220474880 | elapsed time per iteration (s): 0.78 | learning rate: 8.738E-05 | global batch size: 256 | lm loss: 1.981139E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.891 | TFLOPs: 19.96 | 31: iteration 101520/ 173500 | consumed samples: 25989120 | consumed tokens: 53225717760 | elapsed time per iteration (s): 0.85 | learning rate: 8.736E-05 | global batch size: 256 | lm loss: 1.957373E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.559 | TFLOPs: 18.30 | 31: iteration 101530/ 173500 | consumed samples: 25991680 | consumed tokens: 53230960640 | elapsed time per iteration (s): 0.82 | learning rate: 8.735E-05 | global batch size: 256 | lm loss: 1.956190E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.380 | TFLOPs: 18.90 | 31: iteration 101540/ 173500 | consumed samples: 25994240 | consumed tokens: 53236203520 | elapsed time per iteration (s): 0.84 | learning rate: 8.733E-05 | global batch size: 256 | lm loss: 1.977027E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.182 | TFLOPs: 18.46 | 31: iteration 101550/ 173500 | consumed samples: 25996800 | consumed tokens: 53241446400 | elapsed time per iteration (s): 0.78 | learning rate: 8.731E-05 | global batch size: 256 | lm loss: 1.970123E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.982 | TFLOPs: 19.84 | 31: iteration 101560/ 173500 | consumed samples: 25999360 | consumed tokens: 53246689280 | elapsed time per iteration (s): 0.78 | learning rate: 8.730E-05 | global batch size: 256 | lm loss: 1.976792E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.703 | TFLOPs: 19.83 | 31: iteration 101570/ 173500 | consumed samples: 26001920 | consumed tokens: 53251932160 | elapsed time per iteration (s): 0.79 | learning rate: 8.728E-05 | global batch size: 256 | lm loss: 1.972372E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.567 | TFLOPs: 19.51 | 31: iteration 101580/ 173500 | consumed samples: 26004480 | consumed tokens: 53257175040 | elapsed time per iteration (s): 0.79 | learning rate: 8.727E-05 | global batch size: 256 | lm loss: 1.969295E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.741 | TFLOPs: 19.65 | 31: iteration 101590/ 173500 | consumed samples: 26007040 | consumed tokens: 53262417920 | elapsed time per iteration (s): 0.79 | learning rate: 8.725E-05 | global batch size: 256 | lm loss: 1.962580E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.042 | TFLOPs: 19.54 | 31: iteration 101600/ 173500 | consumed samples: 26009600 | consumed tokens: 53267660800 | elapsed time per iteration (s): 0.80 | learning rate: 8.723E-05 | global batch size: 256 | lm loss: 1.957557E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.964 | TFLOPs: 19.36 | 31: iteration 101610/ 173500 | consumed samples: 26012160 | consumed tokens: 53272903680 | elapsed time per iteration (s): 0.83 | learning rate: 8.722E-05 | global batch size: 256 | lm loss: 1.983890E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.896 | TFLOPs: 18.63 | 31: iteration 101620/ 173500 | consumed samples: 26014720 | consumed tokens: 53278146560 | elapsed time per iteration (s): 0.82 | learning rate: 8.720E-05 | global batch size: 256 | lm loss: 1.930932E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.104 | TFLOPs: 18.94 | 31: iteration 101630/ 173500 | consumed samples: 26017280 | consumed tokens: 53283389440 | elapsed time per iteration (s): 0.80 | learning rate: 8.719E-05 | global batch size: 256 | lm loss: 1.968307E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.171 | TFLOPs: 19.25 | 31: iteration 101640/ 173500 | consumed samples: 26019840 | consumed tokens: 53288632320 | elapsed time per iteration (s): 0.79 | learning rate: 8.717E-05 | global batch size: 256 | lm loss: 1.995027E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.022 | TFLOPs: 19.72 | 31: iteration 101650/ 173500 | consumed samples: 26022400 | consumed tokens: 53293875200 | elapsed time per iteration (s): 0.81 | learning rate: 8.715E-05 | global batch size: 256 | lm loss: 1.989547E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.067 | TFLOPs: 19.06 | 31: iteration 101660/ 173500 | consumed samples: 26024960 | consumed tokens: 53299118080 | elapsed time per iteration (s): 0.81 | learning rate: 8.714E-05 | global batch size: 256 | lm loss: 1.973273E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.965 | TFLOPs: 19.18 | 31: iteration 101670/ 173500 | consumed samples: 26027520 | consumed tokens: 53304360960 | elapsed time per iteration (s): 1.12 | learning rate: 8.712E-05 | global batch size: 256 | lm loss: 1.971290E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.986 | TFLOPs: 13.79 | 31: iteration 101680/ 173500 | consumed samples: 26030080 | consumed tokens: 53309603840 | elapsed time per iteration (s): 0.75 | learning rate: 8.711E-05 | global batch size: 256 | lm loss: 1.968090E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.068 | TFLOPs: 20.63 | 31: iteration 101690/ 173500 | consumed samples: 26032640 | consumed tokens: 53314846720 | elapsed time per iteration (s): 0.77 | learning rate: 8.709E-05 | global batch size: 256 | lm loss: 1.982379E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.345 | TFLOPs: 20.23 | 31: iteration 101700/ 173500 | consumed samples: 26035200 | consumed tokens: 53320089600 | elapsed time per iteration (s): 0.73 | learning rate: 8.707E-05 | global batch size: 256 | lm loss: 1.988625E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.843 | TFLOPs: 21.16 | 31: iteration 101710/ 173500 | consumed samples: 26037760 | consumed tokens: 53325332480 | elapsed time per iteration (s): 0.76 | learning rate: 8.706E-05 | global batch size: 256 | lm loss: 1.942147E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.094 | TFLOPs: 20.27 | 31: iteration 101720/ 173500 | consumed samples: 26040320 | consumed tokens: 53330575360 | elapsed time per iteration (s): 0.82 | learning rate: 8.704E-05 | global batch size: 256 | lm loss: 1.970282E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.530 | TFLOPs: 18.97 | 31: iteration 101730/ 173500 | consumed samples: 26042880 | consumed tokens: 53335818240 | elapsed time per iteration (s): 0.75 | learning rate: 8.703E-05 | global batch size: 256 | lm loss: 1.969060E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.338 | TFLOPs: 20.65 | 31: iteration 101740/ 173500 | consumed samples: 26045440 | consumed tokens: 53341061120 | elapsed time per iteration (s): 0.81 | learning rate: 8.701E-05 | global batch size: 256 | lm loss: 1.991116E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.058 | TFLOPs: 19.18 | 31: iteration 101750/ 173500 | consumed samples: 26048000 | consumed tokens: 53346304000 | elapsed time per iteration (s): 0.83 | learning rate: 8.700E-05 | global batch size: 256 | lm loss: 1.948724E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.350 | TFLOPs: 18.65 | 31: iteration 101760/ 173500 | consumed samples: 26050560 | consumed tokens: 53351546880 | elapsed time per iteration (s): 0.77 | learning rate: 8.698E-05 | global batch size: 256 | lm loss: 1.989347E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.513 | TFLOPs: 20.06 | 31: iteration 101770/ 173500 | consumed samples: 26053120 | consumed tokens: 53356789760 | elapsed time per iteration (s): 0.80 | learning rate: 8.696E-05 | global batch size: 256 | lm loss: 1.970996E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.319 | TFLOPs: 19.32 | 31: iteration 101780/ 173500 | consumed samples: 26055680 | consumed tokens: 53362032640 | elapsed time per iteration (s): 0.80 | learning rate: 8.695E-05 | global batch size: 256 | lm loss: 1.959702E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.708 | TFLOPs: 19.46 | 31: iteration 101790/ 173500 | consumed samples: 26058240 | consumed tokens: 53367275520 | elapsed time per iteration (s): 0.81 | learning rate: 8.693E-05 | global batch size: 256 | lm loss: 1.965168E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.490 | TFLOPs: 19.21 | 31: iteration 101800/ 173500 | consumed samples: 26060800 | consumed tokens: 53372518400 | elapsed time per iteration (s): 0.80 | learning rate: 8.692E-05 | global batch size: 256 | lm loss: 1.929860E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.091 | TFLOPs: 19.43 | 31: iteration 101810/ 173500 | consumed samples: 26063360 | consumed tokens: 53377761280 | elapsed time per iteration (s): 0.84 | learning rate: 8.690E-05 | global batch size: 256 | lm loss: 1.988491E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.702 | TFLOPs: 18.49 | 31: iteration 101820/ 173500 | consumed samples: 26065920 | consumed tokens: 53383004160 | elapsed time per iteration (s): 0.80 | learning rate: 8.688E-05 | global batch size: 256 | lm loss: 1.971746E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.909 | TFLOPs: 19.29 | 31: iteration 101830/ 173500 | consumed samples: 26068480 | consumed tokens: 53388247040 | elapsed time per iteration (s): 0.83 | learning rate: 8.687E-05 | global batch size: 256 | lm loss: 1.946570E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.872 | TFLOPs: 18.75 | 31: iteration 101840/ 173500 | consumed samples: 26071040 | consumed tokens: 53393489920 | elapsed time per iteration (s): 0.81 | learning rate: 8.685E-05 | global batch size: 256 | lm loss: 1.995597E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.876 | TFLOPs: 19.17 | 31: iteration 101850/ 173500 | consumed samples: 26073600 | consumed tokens: 53398732800 | elapsed time per iteration (s): 0.85 | learning rate: 8.684E-05 | global batch size: 256 | lm loss: 1.972191E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.269 | TFLOPs: 18.29 | 31: iteration 101860/ 173500 | consumed samples: 26076160 | consumed tokens: 53403975680 | elapsed time per iteration (s): 0.94 | learning rate: 8.682E-05 | global batch size: 256 | lm loss: 1.975393E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 272.057 | TFLOPs: 16.46 | 31: iteration 101870/ 173500 | consumed samples: 26078720 | consumed tokens: 53409218560 | elapsed time per iteration (s): 0.81 | learning rate: 8.680E-05 | global batch size: 256 | lm loss: 1.969219E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.126 | TFLOPs: 19.00 | 31: iteration 101880/ 173500 | consumed samples: 26081280 | consumed tokens: 53414461440 | elapsed time per iteration (s): 0.87 | learning rate: 8.679E-05 | global batch size: 256 | lm loss: 1.975940E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.604 | TFLOPs: 17.76 | 31: iteration 101890/ 173500 | consumed samples: 26083840 | consumed tokens: 53419704320 | elapsed time per iteration (s): 0.81 | learning rate: 8.677E-05 | global batch size: 256 | lm loss: 1.967945E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.432 | TFLOPs: 19.14 | 31: iteration 101900/ 173500 | consumed samples: 26086400 | consumed tokens: 53424947200 | elapsed time per iteration (s): 0.84 | learning rate: 8.676E-05 | global batch size: 256 | lm loss: 1.986696E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.884 | TFLOPs: 18.51 | 31: iteration 101910/ 173500 | consumed samples: 26088960 | consumed tokens: 53430190080 | elapsed time per iteration (s): 0.83 | learning rate: 8.674E-05 | global batch size: 256 | lm loss: 1.950664E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.681 | TFLOPs: 18.61 | 31: iteration 101920/ 173500 | consumed samples: 26091520 | consumed tokens: 53435432960 | elapsed time per iteration (s): 0.79 | learning rate: 8.672E-05 | global batch size: 256 | lm loss: 1.964176E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.767 | TFLOPs: 19.59 | 31: iteration 101930/ 173500 | consumed samples: 26094080 | consumed tokens: 53440675840 | elapsed time per iteration (s): 0.85 | learning rate: 8.671E-05 | global batch size: 256 | lm loss: 1.957164E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.304 | TFLOPs: 18.29 | 31: iteration 101940/ 173500 | consumed samples: 26096640 | consumed tokens: 53445918720 | elapsed time per iteration (s): 0.79 | learning rate: 8.669E-05 | global batch size: 256 | lm loss: 1.963458E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.217 | TFLOPs: 19.49 | 31: iteration 101950/ 173500 | consumed samples: 26099200 | consumed tokens: 53451161600 | elapsed time per iteration (s): 0.82 | learning rate: 8.668E-05 | global batch size: 256 | lm loss: 1.968628E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.718 | TFLOPs: 18.92 | 31: iteration 101960/ 173500 | consumed samples: 26101760 | consumed tokens: 53456404480 | elapsed time per iteration (s): 0.79 | learning rate: 8.666E-05 | global batch size: 256 | lm loss: 1.958669E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.098 | TFLOPs: 19.49 | 31: iteration 101970/ 173500 | consumed samples: 26104320 | consumed tokens: 53461647360 | elapsed time per iteration (s): 0.79 | learning rate: 8.665E-05 | global batch size: 256 | lm loss: 1.977970E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.829 | TFLOPs: 19.65 | 31: iteration 101980/ 173500 | consumed samples: 26106880 | consumed tokens: 53466890240 | elapsed time per iteration (s): 0.90 | learning rate: 8.663E-05 | global batch size: 256 | lm loss: 1.996393E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.907 | TFLOPs: 17.30 | 31: iteration 101990/ 173500 | consumed samples: 26109440 | consumed tokens: 53472133120 | elapsed time per iteration (s): 0.87 | learning rate: 8.661E-05 | global batch size: 256 | lm loss: 2.002344E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.173 | TFLOPs: 17.74 | 0: [2022-11-26 17:09:03,853] [INFO] [logging.py:68:log_dist] [Rank 0] step=102000, skipped=0, lr=[8.659751165175261e-05, 8.659751165175261e-05, 8.659751165175261e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 102000/ 173500 | consumed samples: 26112000 | consumed tokens: 53477376000 | elapsed time per iteration (s): 0.82 | learning rate: 8.660E-05 | global batch size: 256 | lm loss: 1.961108E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.221 | TFLOPs: 18.95 | 0: steps: 102000 loss: 1.9486 iter time (s): 0.794 samples/sec: 322.523 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 102000 | lm loss value: 1.849083E+00 | lm loss PPL: 6.353988E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 102000 to checkpoints_1b1long 0: [2022-11-26 17:09:04,123] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step102000 is begin to save! 0: [2022-11-26 17:09:04,134] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_01-model_00-model_states.pt... 0: [2022-11-26 17:09:04,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_01-model_00-model_states.pt. 0: [2022-11-26 17:09:04,356] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_03-model_00-model_states.pt... 0: [2022-11-26 17:09:04,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_03-model_00-model_states.pt. 0: [2022-11-26 17:09:04,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_04-model_00-model_states.pt... 0: [2022-11-26 17:09:04,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_04-model_00-model_states.pt. 0: [2022-11-26 17:09:04,524] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_05-model_00-model_states.pt... 0: [2022-11-26 17:09:04,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_05-model_00-model_states.pt. 0: [2022-11-26 17:09:04,608] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_06-model_00-model_states.pt... 0: [2022-11-26 17:09:04,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_06-model_00-model_states.pt. 0: [2022-11-26 17:09:04,686] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_07-model_00-model_states.pt... 0: [2022-11-26 17:09:04,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_07-model_00-model_states.pt. 0: [2022-11-26 17:09:04,762] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_08-model_00-model_states.pt... 0: [2022-11-26 17:09:04,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_08-model_00-model_states.pt. 0: [2022-11-26 17:09:04,841] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_09-model_00-model_states.pt... 0: [2022-11-26 17:09:04,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_09-model_00-model_states.pt. 0: [2022-11-26 17:09:04,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_10-model_00-model_states.pt... 0: [2022-11-26 17:09:04,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_10-model_00-model_states.pt. 0: [2022-11-26 17:09:04,991] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_11-model_00-model_states.pt... 0: [2022-11-26 17:09:05,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_11-model_00-model_states.pt. 0: [2022-11-26 17:09:05,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_12-model_00-model_states.pt... 0: [2022-11-26 17:09:05,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_12-model_00-model_states.pt. 0: [2022-11-26 17:09:05,144] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_13-model_00-model_states.pt... 0: [2022-11-26 17:09:05,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_13-model_00-model_states.pt. 0: [2022-11-26 17:09:05,219] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_14-model_00-model_states.pt... 0: [2022-11-26 17:09:05,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_14-model_00-model_states.pt. 0: [2022-11-26 17:09:05,295] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_15-model_00-model_states.pt... 0: [2022-11-26 17:09:05,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_15-model_00-model_states.pt. 0: [2022-11-26 17:09:05,371] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_16-model_00-model_states.pt... 0: [2022-11-26 17:09:05,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_16-model_00-model_states.pt. 0: [2022-11-26 17:09:05,446] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_17-model_00-model_states.pt... 0: [2022-11-26 17:09:05,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_17-model_00-model_states.pt. 0: [2022-11-26 17:09:05,523] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_18-model_00-model_states.pt... 0: [2022-11-26 17:09:05,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_18-model_00-model_states.pt. 0: [2022-11-26 17:09:05,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_19-model_00-model_states.pt... 0: [2022-11-26 17:09:05,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_19-model_00-model_states.pt. 0: [2022-11-26 17:09:05,675] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_20-model_00-model_states.pt... 0: [2022-11-26 17:09:05,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_20-model_00-model_states.pt. 0: [2022-11-26 17:09:05,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_21-model_00-model_states.pt... 0: [2022-11-26 17:09:05,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_21-model_00-model_states.pt. 0: [2022-11-26 17:09:05,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_22-model_00-model_states.pt... 0: [2022-11-26 17:09:05,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_22-model_00-model_states.pt. 0: [2022-11-26 17:09:05,906] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_23-model_00-model_states.pt... 0: [2022-11-26 17:09:05,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_23-model_00-model_states.pt. 0: [2022-11-26 17:09:05,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_24-model_00-model_states.pt... 0: [2022-11-26 17:09:06,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_24-model_00-model_states.pt. 0: [2022-11-26 17:09:06,055] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_25-model_00-model_states.pt... 0: [2022-11-26 17:09:06,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_25-model_00-model_states.pt. 0: [2022-11-26 17:09:06,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_26-model_00-model_states.pt... 0: [2022-11-26 17:09:06,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_26-model_00-model_states.pt. 0: [2022-11-26 17:09:06,211] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_27-model_00-model_states.pt... 0: [2022-11-26 17:09:06,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_27-model_00-model_states.pt. 0: [2022-11-26 17:09:06,287] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_28-model_00-model_states.pt... 0: [2022-11-26 17:09:06,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_28-model_00-model_states.pt. 0: [2022-11-26 17:09:06,360] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/layer_30-model_00-model_states.pt... 0: [2022-11-26 17:09:06,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/layer_30-model_00-model_states.pt. 0: [2022-11-26 17:09:06,364] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step102000/mp_rank_00_model_states.pt 0: [2022-11-26 17:09:06,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/mp_rank_00_model_states.pt... 0: [2022-11-26 17:09:06,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/mp_rank_00_model_states.pt. 0: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:09:06,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:09:06,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:09:06,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 17:09:06,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 24: [2022-11-26 17:09:06,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:09:06,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 17:09:06,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 1: [2022-11-26 17:09:06,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:09:06,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 17:09:06,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 12: [2022-11-26 17:09:06,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:09:06,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 17:09:06,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 11: [2022-11-26 17:09:06,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:09:06,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 17:09:06,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 18: [2022-11-26 17:09:06,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:09:06,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 17:09:06,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 19: [2022-11-26 17:09:06,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:09:06,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 17:09:06,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 16: [2022-11-26 17:09:06,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:09:06,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 17:09:06,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 23: [2022-11-26 17:09:06,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:09:06,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:09:06,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 17: [2022-11-26 17:09:06,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:09:06,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 6: [2022-11-26 17:09:06,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:09:06,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 17:09:06,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 17:09:06,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 17: [2022-11-26 17:09:06,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 8: [2022-11-26 17:09:06,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:09:06,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 6: [2022-11-26 17:09:06,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 8: [2022-11-26 17:09:06,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 17:09:06,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 7: [2022-11-26 17:09:06,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:09:06,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 17:09:06,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 31: [2022-11-26 17:09:06,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:09:06,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 17:09:06,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 20: [2022-11-26 17:09:06,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:09:06,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 17:09:06,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 20: [2022-11-26 17:09:06,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:09:06,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:09:06,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 25: [2022-11-26 17:09:06,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 20: [2022-11-26 17:09:06,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 25: [2022-11-26 17:09:06,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 10: [2022-11-26 17:09:06,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:09:06,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 17:09:06,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 13: [2022-11-26 17:09:06,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:09:06,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:09:06,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 17:09:06,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 24: [2022-11-26 17:09:06,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 17:09:06,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 25: [2022-11-26 17:09:06,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:09:06,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:09:06,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:09:06,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 7: [2022-11-26 17:09:06,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 4: [2022-11-26 17:09:06,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:09:06,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 7: [2022-11-26 17:09:06,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 4: [2022-11-26 17:09:06,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 2: [2022-11-26 17:09:06,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 3: [2022-11-26 17:09:06,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:09:06,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 4: [2022-11-26 17:09:06,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 3: [2022-11-26 17:09:06,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 17:09:06,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 4: [2022-11-26 17:09:06,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:09:06,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 17:09:06,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 3: [2022-11-26 17:09:06,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:09:06,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 17:09:06,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 18: [2022-11-26 17:09:06,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:09:06,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 17:09:06,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 12: [2022-11-26 17:09:06,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:09:06,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 17:09:06,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 17: [2022-11-26 17:09:06,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:09:06,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:09:06,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 14: [2022-11-26 17:09:06,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 17:09:06,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 26: [2022-11-26 17:09:06,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:09:06,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:09:06,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 17:09:06,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 10: [2022-11-26 17:09:06,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:09:06,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:09:06,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 10: [2022-11-26 17:09:06,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 1: [2022-11-26 17:09:06,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 16: [2022-11-26 17:09:06,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:09:06,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 1: [2022-11-26 17:09:06,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 16: [2022-11-26 17:09:06,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 10: [2022-11-26 17:09:06,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 16: [2022-11-26 17:09:06,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 29: [2022-11-26 17:09:06,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:09:06,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:09:06,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 17:09:06,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 13: [2022-11-26 17:09:06,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:09:06,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 17:09:06,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 4: [2022-11-26 17:09:06,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:09:06,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 17:09:06,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 27: [2022-11-26 17:09:06,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:09:06,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 14: [2022-11-26 17:09:06,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:09:06,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 27: [2022-11-26 17:09:06,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 14: [2022-11-26 17:09:06,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 29: [2022-11-26 17:09:06,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 14: [2022-11-26 17:09:06,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 23: [2022-11-26 17:09:06,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:09:06,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 17:09:06,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 19: [2022-11-26 17:09:06,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:09:06,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 17:09:06,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 20: [2022-11-26 17:09:06,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:09:06,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 17:09:06,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 8: [2022-11-26 17:09:06,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:09:06,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 17:09:06,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 29: [2022-11-26 17:09:06,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:09:06,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:09:06,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 5: [2022-11-26 17:09:06,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:09:06,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:09:06,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 5: [2022-11-26 17:09:06,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 17:09:06,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 27: [2022-11-26 17:09:06,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 5: [2022-11-26 17:09:06,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 5: [2022-11-26 17:09:06,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 27: [2022-11-26 17:09:06,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 0: [2022-11-26 17:09:06,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:09:06,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:09:06,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:09:06,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 0: [2022-11-26 17:09:06,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 18: [2022-11-26 17:09:06,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 0: [2022-11-26 17:09:06,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 30: [2022-11-26 17:09:06,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 17:09:06,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 25: [2022-11-26 17:09:06,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:09:06,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:09:06,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:09:06,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 6: [2022-11-26 17:09:06,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 17:09:06,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 25: [2022-11-26 17:09:06,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 11: [2022-11-26 17:09:06,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:09:06,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:09:06,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 6: [2022-11-26 17:09:06,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 11: [2022-11-26 17:09:06,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 24: [2022-11-26 17:09:06,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 11: [2022-11-26 17:09:06,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 24: [2022-11-26 17:09:06,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 11: [2022-11-26 17:09:06,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:09:06,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 17:09:06,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 14: [2022-11-26 17:09:06,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:09:06,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 17:09:06,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 13: [2022-11-26 17:09:06,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:09:06,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 17:09:06,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 26: [2022-11-26 17:09:06,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:09:06,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:09:06,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 5: [2022-11-26 17:09:06,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 26: [2022-11-26 17:09:06,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 5: [2022-11-26 17:09:06,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 29: [2022-11-26 17:09:06,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:09:06,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 17:09:06,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 26: [2022-11-26 17:09:06,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:09:06,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 9: [2022-11-26 17:09:06,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:09:06,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 28: [2022-11-26 17:09:06,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:09:06,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 9: [2022-11-26 17:09:06,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 16: [2022-11-26 17:09:06,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:09:06,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:09:06,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 17:09:06,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 9: [2022-11-26 17:09:06,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:09:06,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:09:06,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 16: [2022-11-26 17:09:06,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 9: [2022-11-26 17:09:06,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 17:09:06,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 31: [2022-11-26 17:09:06,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:09:06,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:09:06,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 30: [2022-11-26 17:09:06,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:09:06,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 9: [2022-11-26 17:09:06,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 31: [2022-11-26 17:09:06,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 30: [2022-11-26 17:09:06,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 1: [2022-11-26 17:09:06,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 30: [2022-11-26 17:09:06,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 1: [2022-11-26 17:09:06,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 2: [2022-11-26 17:09:06,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:09:06,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 17:09:06,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 10: [2022-11-26 17:09:06,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:09:06,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 0: [2022-11-26 17:09:06,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:09:06,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 0: [2022-11-26 17:09:06,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 2: [2022-11-26 17:09:06,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:09:06,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 2: [2022-11-26 17:09:06,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 17:09:06,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 14: [2022-11-26 17:09:06,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:09:06,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 26: [2022-11-26 17:09:06,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:09:06,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 26: [2022-11-26 17:09:06,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 17:09:06,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 24: [2022-11-26 17:09:06,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:09:06,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 17: [2022-11-26 17:09:06,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 24: [2022-11-26 17:09:06,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 17: [2022-11-26 17:09:06,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:09:06,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 17:09:06,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 8: [2022-11-26 17:09:06,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:09:06,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 7: [2022-11-26 17:09:06,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:09:06,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 8: [2022-11-26 17:09:06,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 17: [2022-11-26 17:09:06,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:09:06,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 17: [2022-11-26 17:09:06,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 17:09:06,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 7: [2022-11-26 17:09:06,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:09:06,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 7: [2022-11-26 17:09:06,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 28: [2022-11-26 17:09:06,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 28: [2022-11-26 17:09:06,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:09:06,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 28: [2022-11-26 17:09:06,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 17:09:06,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 19: [2022-11-26 17:09:06,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:09:06,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 17:09:06,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 5: [2022-11-26 17:09:06,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:09:06,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 17:09:06,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 21: [2022-11-26 17:09:06,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:09:06,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 4: [2022-11-26 17:09:06,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:09:06,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 4: [2022-11-26 17:09:06,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 17:09:06,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 29: [2022-11-26 17:09:06,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:09:06,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 12: [2022-11-26 17:09:06,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:09:06,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:09:06,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 12: [2022-11-26 17:09:06,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 20: [2022-11-26 17:09:06,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 12: [2022-11-26 17:09:06,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 20: [2022-11-26 17:09:06,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 10: [2022-11-26 17:09:06,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:09:06,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 2: [2022-11-26 17:09:06,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:09:06,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 2: [2022-11-26 17:09:06,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 27: [2022-11-26 17:09:06,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:09:06,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 2: [2022-11-26 17:09:06,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 27: [2022-11-26 17:09:06,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 30: [2022-11-26 17:09:06,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:09:06,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 17:09:06,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 23: [2022-11-26 17:09:06,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:09:06,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:09:06,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 23: [2022-11-26 17:09:06,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 1: [2022-11-26 17:09:06,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 23: [2022-11-26 17:09:06,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 31: [2022-11-26 17:09:06,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:09:06,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 28: [2022-11-26 17:09:06,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:09:06,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 25: [2022-11-26 17:09:06,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:09:06,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 28: [2022-11-26 17:09:06,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 25: [2022-11-26 17:09:06,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 0: [2022-11-26 17:09:06,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:09:06,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:09:06,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 17:09:06,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 17:09:06,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 0: [2022-11-26 17:09:06,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 13: [2022-11-26 17:09:06,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:09:06,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 17:09:06,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 21: [2022-11-26 17:09:06,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:09:06,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 17:09:06,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 12: [2022-11-26 17:09:06,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:09:06,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 17:09:06,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 28: [2022-11-26 17:09:06,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 3: [2022-11-26 17:09:06,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:09:06,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 17:09:06,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 21: [2022-11-26 17:09:06,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:09:06,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 17:09:06,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 8: [2022-11-26 17:09:06,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:09:06,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 17:09:06,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 18: [2022-11-26 17:09:06,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:09:06,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 17:09:06,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 19: [2022-11-26 17:09:06,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:09:06,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 17:09:06,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 10: [2022-11-26 17:09:06,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:09:06,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 17:09:06,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 9: [2022-11-26 17:09:06,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:09:06,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 17:09:06,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 22: [2022-11-26 17:09:06,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:09:06,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:09:06,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:09:06,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 17:09:06,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 17:09:06,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:09:06,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 17:09:06,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 22: [2022-11-26 17:09:06,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 22: [2022-11-26 17:09:06,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 22: [2022-11-26 17:09:06,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 17:09:06,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 31: [2022-11-26 17:09:06,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:09:06,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 17:09:06,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 20: [2022-11-26 17:09:06,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:09:06,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 17:09:06,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 19: [2022-11-26 17:09:06,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:09:06,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 17:09:06,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 23: [2022-11-26 17:09:06,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:09:06,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 17:09:06,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 15: [2022-11-26 17:09:06,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:09:06,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:09:06,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:09:06,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:09:06,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 17:09:06,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 17:09:06,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 17:09:06,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 17:09:06,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 15: [2022-11-26 17:09:06,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 15: [2022-11-26 17:09:06,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 15: [2022-11-26 17:09:06,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 25: [2022-11-26 17:09:06,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:09:06,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 17:09:06,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 11: [2022-11-26 17:09:06,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:09:06,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 17:09:06,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 6: [2022-11-26 17:09:06,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:09:06,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 17:09:06,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 16: [2022-11-26 17:09:06,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:09:06,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 17:09:06,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 24: [2022-11-26 17:09:06,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:09:06,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 17:09:06,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 5: [2022-11-26 17:09:06,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:09:06,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 17:09:06,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 17: [2022-11-26 17:09:06,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:09:06,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 17:09:06,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 23: [2022-11-26 17:09:06,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:09:06,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 17:09:06,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 3: [2022-11-26 17:09:06,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:09:06,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:09:06,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 13: [2022-11-26 17:09:06,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 3: [2022-11-26 17:09:06,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 13: [2022-11-26 17:09:06,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 26: [2022-11-26 17:09:06,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:09:06,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 17:09:06,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 14: [2022-11-26 17:09:06,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:09:06,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 17:09:06,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 27: [2022-11-26 17:09:06,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:09:06,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 17:09:06,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 1: [2022-11-26 17:09:06,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:09:06,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 17:09:06,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 0: [2022-11-26 17:09:06,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:09:06,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:09:06,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 17:09:06,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 28: [2022-11-26 17:09:06,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:09:06,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:09:06,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 17:09:06,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 7: [2022-11-26 17:09:06,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:09:06,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:09:06,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 18: [2022-11-26 17:09:06,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 7: [2022-11-26 17:09:06,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 18: [2022-11-26 17:09:06,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 31: [2022-11-26 17:09:06,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:09:06,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 17:09:06,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 8: [2022-11-26 17:09:06,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:09:06,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 8: [2022-11-26 17:09:06,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 28: [2022-11-26 17:09:06,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 8: [2022-11-26 17:09:06,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 2: [2022-11-26 17:09:06,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:09:06,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 17:09:06,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 22: [2022-11-26 17:09:06,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:09:06,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 17:09:06,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 21: [2022-11-26 17:09:06,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:09:06,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 17:09:06,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 0: [2022-11-26 17:09:06,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 17:09:06,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 4: [2022-11-26 17:09:06,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:09:06,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 17:09:06,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 6: [2022-11-26 17:09:06,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:09:06,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 17:09:06,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 19: [2022-11-26 17:09:06,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:09:06,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 17:09:06,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 12: [2022-11-26 17:09:06,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:09:06,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 17:09:06,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 30: [2022-11-26 17:09:06,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:09:06,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 17:09:06,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 16: [2022-11-26 17:09:06,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:09:06,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 17:09:06,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 5: [2022-11-26 17:09:06,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:09:06,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 17:09:06,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 20: [2022-11-26 17:09:06,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:09:06,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 17:09:06,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 29: [2022-11-26 17:09:06,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:09:06,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 17:09:06,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 10: [2022-11-26 17:09:06,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:09:06,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 17:09:06,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 24: [2022-11-26 17:09:06,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:09:06,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 17:09:06,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 13: [2022-11-26 17:09:06,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:09:06,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 17:09:06,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 11: [2022-11-26 17:09:06,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:09:06,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 17:09:06,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 3: [2022-11-26 17:09:06,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:09:06,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 17:09:06,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 14: [2022-11-26 17:09:06,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:09:06,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 17:09:06,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 25: [2022-11-26 17:09:06,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:09:06,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 17:09:06,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 23: [2022-11-26 17:09:06,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:09:06,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 17:09:06,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 26: [2022-11-26 17:09:06,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:09:06,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 17:09:06,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 27: [2022-11-26 17:09:06,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:09:06,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 17:09:06,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 17: [2022-11-26 17:09:06,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:09:06,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 17:09:06,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 1: [2022-11-26 17:09:06,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:09:06,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 17:09:06,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 7: [2022-11-26 17:09:06,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:09:06,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 15: [2022-11-26 17:09:06,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:09:06,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 15: [2022-11-26 17:09:06,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 17:09:06,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 9: [2022-11-26 17:09:06,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:09:06,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 17:09:06,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 31: [2022-11-26 17:09:06,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:09:06,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 17:09:06,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 2: [2022-11-26 17:09:06,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:09:06,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 17:09:06,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 8: [2022-11-26 17:09:06,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:09:06,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 17:09:06,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 12: [2022-11-26 17:09:06,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:09:06,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 17:09:06,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 28: [2022-11-26 17:09:06,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:09:06,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 17:09:06,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 22: [2022-11-26 17:09:06,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:09:06,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 17:09:06,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 18: [2022-11-26 17:09:06,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:09:06,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 17:09:06,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 21: [2022-11-26 17:09:06,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:09:06,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 17:09:06,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 4: [2022-11-26 17:09:06,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:09:06,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 17:09:06,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 19: [2022-11-26 17:09:06,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:09:06,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 17:09:06,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 20: [2022-11-26 17:09:06,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:09:06,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 17:09:06,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 0: [2022-11-26 17:09:06,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:09:06,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 17:09:06,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 5: [2022-11-26 17:09:06,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:09:06,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 17:09:06,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 30: [2022-11-26 17:09:06,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:09:06,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 17:09:06,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 6: [2022-11-26 17:09:06,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:09:06,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 17:09:06,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 10: [2022-11-26 17:09:06,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:09:06,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 17:09:06,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 24: [2022-11-26 17:09:06,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:09:06,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 17:09:06,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 11: [2022-11-26 17:09:06,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:09:06,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 17:09:06,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 25: [2022-11-26 17:09:06,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:09:06,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:09:06,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 17:09:06,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 25: [2022-11-26 17:09:06,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 17:09:06,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 1: [2022-11-26 17:09:06,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:09:06,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 17:09:06,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 16: [2022-11-26 17:09:06,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:09:06,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 17:09:06,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 27: [2022-11-26 17:09:06,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:09:06,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 17:09:06,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 14: [2022-11-26 17:09:06,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:09:06,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 17:09:06,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 9: [2022-11-26 17:09:06,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:09:06,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 17:09:06,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 23: [2022-11-26 17:09:06,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:09:06,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 17:09:06,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 17: [2022-11-26 17:09:06,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:09:06,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 17:09:06,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 18: [2022-11-26 17:09:06,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:09:06,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:09:06,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:09:06,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 26: [2022-11-26 17:09:06,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 4: [2022-11-26 17:09:06,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:09:06,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 18: [2022-11-26 17:09:06,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 15: [2022-11-26 17:09:06,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 26: [2022-11-26 17:09:06,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 4: [2022-11-26 17:09:06,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 17:09:06,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 3: [2022-11-26 17:09:06,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:09:06,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 17:09:06,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 13: [2022-11-26 17:09:06,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:09:06,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 17:09:06,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 12: [2022-11-26 17:09:06,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:09:06,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 17:09:06,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 7: [2022-11-26 17:09:06,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:09:06,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 17:09:06,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 20: [2022-11-26 17:09:06,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:09:06,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 17:09:06,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 31: [2022-11-26 17:09:06,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:09:06,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 17:09:06,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 0: [2022-11-26 17:09:06,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:09:06,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:09:06,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 17:09:06,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 8: [2022-11-26 17:09:06,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 17:09:06,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 2: [2022-11-26 17:09:06,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:09:06,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 17:09:06,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 22: [2022-11-26 17:09:06,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:09:06,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 17:09:06,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 19: [2022-11-26 17:09:06,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:09:06,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 17:09:06,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 5: [2022-11-26 17:09:06,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:09:06,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 17:09:06,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 27: [2022-11-26 17:09:06,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:09:06,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 17:09:06,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 28: [2022-11-26 17:09:06,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:09:06,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:09:06,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 14: [2022-11-26 17:09:06,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:09:06,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 14: [2022-11-26 17:09:06,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 17:09:06,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 17: [2022-11-26 17:09:06,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:09:06,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 3: [2022-11-26 17:09:06,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:09:06,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 17:09:06,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 1: [2022-11-26 17:09:06,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:09:06,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 17: [2022-11-26 17:09:06,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 28: [2022-11-26 17:09:06,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 1: [2022-11-26 17:09:06,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 17:09:06,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 6: [2022-11-26 17:09:06,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:09:06,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 17:09:06,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 21: [2022-11-26 17:09:06,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:09:06,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 17:09:06,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 16: [2022-11-26 17:09:06,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:09:06,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 17:09:06,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 15: [2022-11-26 17:09:06,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:09:06,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:09:06,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 17:09:06,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 13: [2022-11-26 17:09:06,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 17:09:06,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 24: [2022-11-26 17:09:06,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:09:06,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 17:09:06,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 8: [2022-11-26 17:09:06,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:09:06,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:09:06,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 17:09:06,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 7: [2022-11-26 17:09:06,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 17:09:06,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 12: [2022-11-26 17:09:06,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:09:06,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:09:06,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 0: [2022-11-26 17:09:06,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 12: [2022-11-26 17:09:06,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 0: [2022-11-26 17:09:06,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 18: [2022-11-26 17:09:06,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:09:06,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 17:09:06,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 2: [2022-11-26 17:09:06,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:09:06,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 21: [2022-11-26 17:09:06,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:09:06,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:09:06,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 31: [2022-11-26 17:09:06,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 21: [2022-11-26 17:09:06,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 17:09:06,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 31: [2022-11-26 17:09:06,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 23: [2022-11-26 17:09:06,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:09:06,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 17:09:06,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 22: [2022-11-26 17:09:06,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:09:06,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 17:09:06,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 11: [2022-11-26 17:09:06,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:09:06,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 17:09:06,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 26: [2022-11-26 17:09:06,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:09:06,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 17:09:06,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 29: [2022-11-26 17:09:06,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:09:06,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:09:06,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 17:09:06,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 17:09:06,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 29: [2022-11-26 17:09:06,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 25: [2022-11-26 17:09:06,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:09:06,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:09:06,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 17:09:06,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 21: [2022-11-26 17:09:06,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:09:06,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 17:09:06,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 30: [2022-11-26 17:09:06,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:09:06,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 17:09:06,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 30: [2022-11-26 17:09:06,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:09:06,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 17:09:06,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 28: [2022-11-26 17:09:06,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 17:09:06,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 4: [2022-11-26 17:09:06,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:09:06,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 17:09:06,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 28: [2022-11-26 17:09:06,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:09:06,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 17:09:06,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 10: [2022-11-26 17:09:06,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:09:06,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 17:09:06,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 9: [2022-11-26 17:09:06,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:09:06,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step102000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 17:09:06,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 0: successfully saved checkpoint at iteration 102000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2573.70 31: iteration 102010/ 173500 | consumed samples: 26114560 | consumed tokens: 53482618880 | elapsed time per iteration (s): 1.09 | learning rate: 8.658E-05 | global batch size: 256 | lm loss: 1.975584E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.719 | TFLOPs: 14.20 | 31: iteration 102020/ 173500 | consumed samples: 26117120 | consumed tokens: 53487861760 | elapsed time per iteration (s): 0.77 | learning rate: 8.657E-05 | global batch size: 256 | lm loss: 1.957259E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.375 | TFLOPs: 20.23 | 31: iteration 102030/ 173500 | consumed samples: 26119680 | consumed tokens: 53493104640 | elapsed time per iteration (s): 0.79 | learning rate: 8.655E-05 | global batch size: 256 | lm loss: 1.957100E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.047 | TFLOPs: 19.48 | 31: iteration 102040/ 173500 | consumed samples: 26122240 | consumed tokens: 53498347520 | elapsed time per iteration (s): 0.78 | learning rate: 8.653E-05 | global batch size: 256 | lm loss: 1.941517E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.115 | TFLOPs: 19.79 | 31: iteration 102050/ 173500 | consumed samples: 26124800 | consumed tokens: 53503590400 | elapsed time per iteration (s): 0.86 | learning rate: 8.652E-05 | global batch size: 256 | lm loss: 1.966186E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.933 | TFLOPs: 18.08 | 31: iteration 102060/ 173500 | consumed samples: 26127360 | consumed tokens: 53508833280 | elapsed time per iteration (s): 0.77 | learning rate: 8.650E-05 | global batch size: 256 | lm loss: 1.996973E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.464 | TFLOPs: 20.05 | 31: iteration 102070/ 173500 | consumed samples: 26129920 | consumed tokens: 53514076160 | elapsed time per iteration (s): 0.74 | learning rate: 8.649E-05 | global batch size: 256 | lm loss: 1.962541E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.968 | TFLOPs: 20.99 | 31: iteration 102080/ 173500 | consumed samples: 26132480 | consumed tokens: 53519319040 | elapsed time per iteration (s): 0.83 | learning rate: 8.647E-05 | global batch size: 256 | lm loss: 1.968047E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.739 | TFLOPs: 18.74 | 31: iteration 102090/ 173500 | consumed samples: 26135040 | consumed tokens: 53524561920 | elapsed time per iteration (s): 0.85 | learning rate: 8.645E-05 | global batch size: 256 | lm loss: 2.016659E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.121 | TFLOPs: 18.22 | 31: iteration 102100/ 173500 | consumed samples: 26137600 | consumed tokens: 53529804800 | elapsed time per iteration (s): 0.76 | learning rate: 8.644E-05 | global batch size: 256 | lm loss: 1.982111E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.158 | TFLOPs: 20.40 | 31: iteration 102110/ 173500 | consumed samples: 26140160 | consumed tokens: 53535047680 | elapsed time per iteration (s): 0.78 | learning rate: 8.642E-05 | global batch size: 256 | lm loss: 1.973202E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.322 | TFLOPs: 19.92 | 31: iteration 102120/ 173500 | consumed samples: 26142720 | consumed tokens: 53540290560 | elapsed time per iteration (s): 0.81 | learning rate: 8.641E-05 | global batch size: 256 | lm loss: 1.937684E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.084 | TFLOPs: 19.06 | 31: iteration 102130/ 173500 | consumed samples: 26145280 | consumed tokens: 53545533440 | elapsed time per iteration (s): 0.74 | learning rate: 8.639E-05 | global batch size: 256 | lm loss: 1.971556E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.198 | TFLOPs: 20.82 | 31: iteration 102140/ 173500 | consumed samples: 26147840 | consumed tokens: 53550776320 | elapsed time per iteration (s): 0.78 | learning rate: 8.638E-05 | global batch size: 256 | lm loss: 1.977517E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.040 | TFLOPs: 19.85 | 31: iteration 102150/ 173500 | consumed samples: 26150400 | consumed tokens: 53556019200 | elapsed time per iteration (s): 0.77 | learning rate: 8.636E-05 | global batch size: 256 | lm loss: 1.961792E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.856 | TFLOPs: 20.20 | 31: iteration 102160/ 173500 | consumed samples: 26152960 | consumed tokens: 53561262080 | elapsed time per iteration (s): 0.78 | learning rate: 8.634E-05 | global batch size: 256 | lm loss: 1.955828E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.759 | TFLOPs: 19.83 | 31: iteration 102170/ 173500 | consumed samples: 26155520 | consumed tokens: 53566504960 | elapsed time per iteration (s): 0.78 | learning rate: 8.633E-05 | global batch size: 256 | lm loss: 1.964645E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.843 | TFLOPs: 19.95 | 31: iteration 102180/ 173500 | consumed samples: 26158080 | consumed tokens: 53571747840 | elapsed time per iteration (s): 0.81 | learning rate: 8.631E-05 | global batch size: 256 | lm loss: 1.940597E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.427 | TFLOPs: 19.20 | 31: iteration 102190/ 173500 | consumed samples: 26160640 | consumed tokens: 53576990720 | elapsed time per iteration (s): 0.78 | learning rate: 8.630E-05 | global batch size: 256 | lm loss: 1.944130E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.013 | TFLOPs: 19.78 | 31: iteration 102200/ 173500 | consumed samples: 26163200 | consumed tokens: 53582233600 | elapsed time per iteration (s): 0.75 | learning rate: 8.628E-05 | global batch size: 256 | lm loss: 1.971151E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.295 | TFLOPs: 20.71 | 31: iteration 102210/ 173500 | consumed samples: 26165760 | consumed tokens: 53587476480 | elapsed time per iteration (s): 0.72 | learning rate: 8.626E-05 | global batch size: 256 | lm loss: 1.959801E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.253 | TFLOPs: 21.49 | 31: iteration 102220/ 173500 | consumed samples: 26168320 | consumed tokens: 53592719360 | elapsed time per iteration (s): 0.75 | learning rate: 8.625E-05 | global batch size: 256 | lm loss: 1.954762E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.929 | TFLOPs: 20.69 | 31: iteration 102230/ 173500 | consumed samples: 26170880 | consumed tokens: 53597962240 | elapsed time per iteration (s): 0.75 | learning rate: 8.623E-05 | global batch size: 256 | lm loss: 1.951879E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.605 | TFLOPs: 20.67 | 31: iteration 102240/ 173500 | consumed samples: 26173440 | consumed tokens: 53603205120 | elapsed time per iteration (s): 0.79 | learning rate: 8.622E-05 | global batch size: 256 | lm loss: 1.975004E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.108 | TFLOPs: 19.73 | 31: iteration 102250/ 173500 | consumed samples: 26176000 | consumed tokens: 53608448000 | elapsed time per iteration (s): 0.85 | learning rate: 8.620E-05 | global batch size: 256 | lm loss: 1.968919E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.701 | TFLOPs: 18.31 | 31: iteration 102260/ 173500 | consumed samples: 26178560 | consumed tokens: 53613690880 | elapsed time per iteration (s): 0.77 | learning rate: 8.618E-05 | global batch size: 256 | lm loss: 1.981900E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.580 | TFLOPs: 20.06 | 31: iteration 102270/ 173500 | consumed samples: 26181120 | consumed tokens: 53618933760 | elapsed time per iteration (s): 0.74 | learning rate: 8.617E-05 | global batch size: 256 | lm loss: 1.966208E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.111 | TFLOPs: 20.82 | 31: iteration 102280/ 173500 | consumed samples: 26183680 | consumed tokens: 53624176640 | elapsed time per iteration (s): 0.85 | learning rate: 8.615E-05 | global batch size: 256 | lm loss: 1.985822E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.756 | TFLOPs: 18.13 | 31: iteration 102290/ 173500 | consumed samples: 26186240 | consumed tokens: 53629419520 | elapsed time per iteration (s): 0.73 | learning rate: 8.614E-05 | global batch size: 256 | lm loss: 1.963442E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.847 | TFLOPs: 21.10 | 31: iteration 102300/ 173500 | consumed samples: 26188800 | consumed tokens: 53634662400 | elapsed time per iteration (s): 0.77 | learning rate: 8.612E-05 | global batch size: 256 | lm loss: 1.977506E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.436 | TFLOPs: 20.11 | 31: iteration 102310/ 173500 | consumed samples: 26191360 | consumed tokens: 53639905280 | elapsed time per iteration (s): 0.78 | learning rate: 8.611E-05 | global batch size: 256 | lm loss: 1.956208E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.836 | TFLOPs: 19.89 | 31: iteration 102320/ 173500 | consumed samples: 26193920 | consumed tokens: 53645148160 | elapsed time per iteration (s): 0.76 | learning rate: 8.609E-05 | global batch size: 256 | lm loss: 1.947352E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.282 | TFLOPs: 20.28 | 31: iteration 102330/ 173500 | consumed samples: 26196480 | consumed tokens: 53650391040 | elapsed time per iteration (s): 0.76 | learning rate: 8.607E-05 | global batch size: 256 | lm loss: 1.970957E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.132 | TFLOPs: 20.40 | 31: iteration 102340/ 173500 | consumed samples: 26199040 | consumed tokens: 53655633920 | elapsed time per iteration (s): 0.72 | learning rate: 8.606E-05 | global batch size: 256 | lm loss: 2.008582E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.430 | TFLOPs: 21.50 | 31: iteration 102350/ 173500 | consumed samples: 26201600 | consumed tokens: 53660876800 | elapsed time per iteration (s): 0.81 | learning rate: 8.604E-05 | global batch size: 256 | lm loss: 1.976493E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.185 | TFLOPs: 19.07 | 31: iteration 102360/ 173500 | consumed samples: 26204160 | consumed tokens: 53666119680 | elapsed time per iteration (s): 0.81 | learning rate: 8.603E-05 | global batch size: 256 | lm loss: 1.986567E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.207 | TFLOPs: 19.13 | 31: iteration 102370/ 173500 | consumed samples: 26206720 | consumed tokens: 53671362560 | elapsed time per iteration (s): 0.77 | learning rate: 8.601E-05 | global batch size: 256 | lm loss: 1.970336E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.299 | TFLOPs: 20.04 | 31: iteration 102380/ 173500 | consumed samples: 26209280 | consumed tokens: 53676605440 | elapsed time per iteration (s): 0.77 | learning rate: 8.599E-05 | global batch size: 256 | lm loss: 1.970697E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.331 | TFLOPs: 20.23 | 31: iteration 102390/ 173500 | consumed samples: 26211840 | consumed tokens: 53681848320 | elapsed time per iteration (s): 0.76 | learning rate: 8.598E-05 | global batch size: 256 | lm loss: 1.984544E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.972 | TFLOPs: 20.33 | 31: iteration 102400/ 173500 | consumed samples: 26214400 | consumed tokens: 53687091200 | elapsed time per iteration (s): 0.80 | learning rate: 8.596E-05 | global batch size: 256 | lm loss: 1.954099E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.818 | TFLOPs: 19.35 | 31: iteration 102410/ 173500 | consumed samples: 26216960 | consumed tokens: 53692334080 | elapsed time per iteration (s): 0.75 | learning rate: 8.595E-05 | global batch size: 256 | lm loss: 1.958980E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.976 | TFLOPs: 20.69 | 31: iteration 102420/ 173500 | consumed samples: 26219520 | consumed tokens: 53697576960 | elapsed time per iteration (s): 0.77 | learning rate: 8.593E-05 | global batch size: 256 | lm loss: 1.993418E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.168 | TFLOPs: 20.16 | 31: iteration 102430/ 173500 | consumed samples: 26222080 | consumed tokens: 53702819840 | elapsed time per iteration (s): 0.79 | learning rate: 8.591E-05 | global batch size: 256 | lm loss: 1.941327E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.887 | TFLOPs: 19.59 | 31: iteration 102440/ 173500 | consumed samples: 26224640 | consumed tokens: 53708062720 | elapsed time per iteration (s): 0.75 | learning rate: 8.590E-05 | global batch size: 256 | lm loss: 1.984094E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.094 | TFLOPs: 20.64 | 31: iteration 102450/ 173500 | consumed samples: 26227200 | consumed tokens: 53713305600 | elapsed time per iteration (s): 0.77 | learning rate: 8.588E-05 | global batch size: 256 | lm loss: 1.968522E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.689 | TFLOPs: 20.01 | 31: iteration 102460/ 173500 | consumed samples: 26229760 | consumed tokens: 53718548480 | elapsed time per iteration (s): 0.82 | learning rate: 8.587E-05 | global batch size: 256 | lm loss: 1.962519E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.862 | TFLOPs: 18.81 | 31: iteration 102470/ 173500 | consumed samples: 26232320 | consumed tokens: 53723791360 | elapsed time per iteration (s): 0.79 | learning rate: 8.585E-05 | global batch size: 256 | lm loss: 1.965485E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.109 | TFLOPs: 19.67 | 31: iteration 102480/ 173500 | consumed samples: 26234880 | consumed tokens: 53729034240 | elapsed time per iteration (s): 0.97 | learning rate: 8.584E-05 | global batch size: 256 | lm loss: 2.013667E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 264.094 | TFLOPs: 15.98 | 31: iteration 102490/ 173500 | consumed samples: 26237440 | consumed tokens: 53734277120 | elapsed time per iteration (s): 0.75 | learning rate: 8.582E-05 | global batch size: 256 | lm loss: 1.969523E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.175 | TFLOPs: 20.64 | 31: iteration 102500/ 173500 | consumed samples: 26240000 | consumed tokens: 53739520000 | elapsed time per iteration (s): 0.80 | learning rate: 8.580E-05 | global batch size: 256 | lm loss: 1.975450E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.211 | TFLOPs: 19.43 | 31: iteration 102510/ 173500 | consumed samples: 26242560 | consumed tokens: 53744762880 | elapsed time per iteration (s): 0.77 | learning rate: 8.579E-05 | global batch size: 256 | lm loss: 1.979903E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.385 | TFLOPs: 20.17 | 31: iteration 102520/ 173500 | consumed samples: 26245120 | consumed tokens: 53750005760 | elapsed time per iteration (s): 0.75 | learning rate: 8.577E-05 | global batch size: 256 | lm loss: 1.983449E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.537 | TFLOPs: 20.78 | 31: iteration 102530/ 173500 | consumed samples: 26247680 | consumed tokens: 53755248640 | elapsed time per iteration (s): 0.72 | learning rate: 8.576E-05 | global batch size: 256 | lm loss: 1.963896E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.656 | TFLOPs: 21.52 | 31: iteration 102540/ 173500 | consumed samples: 26250240 | consumed tokens: 53760491520 | elapsed time per iteration (s): 0.74 | learning rate: 8.574E-05 | global batch size: 256 | lm loss: 1.998965E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.133 | TFLOPs: 21.00 | 31: iteration 102550/ 173500 | consumed samples: 26252800 | consumed tokens: 53765734400 | elapsed time per iteration (s): 0.72 | learning rate: 8.572E-05 | global batch size: 256 | lm loss: 1.970418E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.644 | TFLOPs: 21.46 | 31: iteration 102560/ 173500 | consumed samples: 26255360 | consumed tokens: 53770977280 | elapsed time per iteration (s): 2.40 | learning rate: 8.571E-05 | global batch size: 256 | lm loss: 1.978822E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 106.485 | TFLOPs: 6.44 | 31: iteration 102570/ 173500 | consumed samples: 26257920 | consumed tokens: 53776220160 | elapsed time per iteration (s): 0.75 | learning rate: 8.569E-05 | global batch size: 256 | lm loss: 1.991774E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.466 | TFLOPs: 20.54 | 31: iteration 102580/ 173500 | consumed samples: 26260480 | consumed tokens: 53781463040 | elapsed time per iteration (s): 0.78 | learning rate: 8.568E-05 | global batch size: 256 | lm loss: 1.955719E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.871 | TFLOPs: 19.96 | 31: iteration 102590/ 173500 | consumed samples: 26263040 | consumed tokens: 53786705920 | elapsed time per iteration (s): 0.73 | learning rate: 8.566E-05 | global batch size: 256 | lm loss: 1.969869E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.924 | TFLOPs: 21.17 | 31: iteration 102600/ 173500 | consumed samples: 26265600 | consumed tokens: 53791948800 | elapsed time per iteration (s): 0.78 | learning rate: 8.565E-05 | global batch size: 256 | lm loss: 1.961836E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.193 | TFLOPs: 19.79 | 31: iteration 102610/ 173500 | consumed samples: 26268160 | consumed tokens: 53797191680 | elapsed time per iteration (s): 0.74 | learning rate: 8.563E-05 | global batch size: 256 | lm loss: 1.983370E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.976 | TFLOPs: 20.87 | 31: iteration 102620/ 173500 | consumed samples: 26270720 | consumed tokens: 53802434560 | elapsed time per iteration (s): 0.78 | learning rate: 8.561E-05 | global batch size: 256 | lm loss: 1.974910E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.365 | TFLOPs: 19.74 | 31: iteration 102630/ 173500 | consumed samples: 26273280 | consumed tokens: 53807677440 | elapsed time per iteration (s): 0.80 | learning rate: 8.560E-05 | global batch size: 256 | lm loss: 1.962668E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.193 | TFLOPs: 19.43 | 31: iteration 102640/ 173500 | consumed samples: 26275840 | consumed tokens: 53812920320 | elapsed time per iteration (s): 0.77 | learning rate: 8.558E-05 | global batch size: 256 | lm loss: 1.981493E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.721 | TFLOPs: 20.07 | 31: iteration 102650/ 173500 | consumed samples: 26278400 | consumed tokens: 53818163200 | elapsed time per iteration (s): 0.76 | learning rate: 8.557E-05 | global batch size: 256 | lm loss: 1.962909E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.233 | TFLOPs: 20.28 | 31: iteration 102660/ 173500 | consumed samples: 26280960 | consumed tokens: 53823406080 | elapsed time per iteration (s): 0.75 | learning rate: 8.555E-05 | global batch size: 256 | lm loss: 1.971571E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.025 | TFLOPs: 20.57 | 31: iteration 102670/ 173500 | consumed samples: 26283520 | consumed tokens: 53828648960 | elapsed time per iteration (s): 0.88 | learning rate: 8.553E-05 | global batch size: 256 | lm loss: 1.964161E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.989 | TFLOPs: 17.54 | 31: iteration 102680/ 173500 | consumed samples: 26286080 | consumed tokens: 53833891840 | elapsed time per iteration (s): 0.77 | learning rate: 8.552E-05 | global batch size: 256 | lm loss: 1.977258E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.486 | TFLOPs: 20.11 | 31: iteration 102690/ 173500 | consumed samples: 26288640 | consumed tokens: 53839134720 | elapsed time per iteration (s): 0.76 | learning rate: 8.550E-05 | global batch size: 256 | lm loss: 2.000506E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.478 | TFLOPs: 20.48 | 31: iteration 102700/ 173500 | consumed samples: 26291200 | consumed tokens: 53844377600 | elapsed time per iteration (s): 0.80 | learning rate: 8.549E-05 | global batch size: 256 | lm loss: 1.969945E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.542 | TFLOPs: 19.33 | 31: iteration 102710/ 173500 | consumed samples: 26293760 | consumed tokens: 53849620480 | elapsed time per iteration (s): 0.84 | learning rate: 8.547E-05 | global batch size: 256 | lm loss: 1.977526E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.099 | TFLOPs: 18.46 | 31: iteration 102720/ 173500 | consumed samples: 26296320 | consumed tokens: 53854863360 | elapsed time per iteration (s): 0.81 | learning rate: 8.546E-05 | global batch size: 256 | lm loss: 1.986260E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.988 | TFLOPs: 19.06 | 31: iteration 102730/ 173500 | consumed samples: 26298880 | consumed tokens: 53860106240 | elapsed time per iteration (s): 0.83 | learning rate: 8.544E-05 | global batch size: 256 | lm loss: 1.966528E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.064 | TFLOPs: 18.76 | 31: iteration 102740/ 173500 | consumed samples: 26301440 | consumed tokens: 53865349120 | elapsed time per iteration (s): 0.83 | learning rate: 8.542E-05 | global batch size: 256 | lm loss: 1.938567E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.353 | TFLOPs: 18.65 | 31: iteration 102750/ 173500 | consumed samples: 26304000 | consumed tokens: 53870592000 | elapsed time per iteration (s): 0.81 | learning rate: 8.541E-05 | global batch size: 256 | lm loss: 1.991500E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.224 | TFLOPs: 19.13 | 31: iteration 102760/ 173500 | consumed samples: 26306560 | consumed tokens: 53875834880 | elapsed time per iteration (s): 0.88 | learning rate: 8.539E-05 | global batch size: 256 | lm loss: 1.992879E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.514 | TFLOPs: 17.70 | 31: iteration 102770/ 173500 | consumed samples: 26309120 | consumed tokens: 53881077760 | elapsed time per iteration (s): 0.75 | learning rate: 8.538E-05 | global batch size: 256 | lm loss: 1.978855E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.021 | TFLOPs: 20.75 | 31: iteration 102780/ 173500 | consumed samples: 26311680 | consumed tokens: 53886320640 | elapsed time per iteration (s): 0.75 | learning rate: 8.536E-05 | global batch size: 256 | lm loss: 1.951510E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.362 | TFLOPs: 20.53 | 31: iteration 102790/ 173500 | consumed samples: 26314240 | consumed tokens: 53891563520 | elapsed time per iteration (s): 0.80 | learning rate: 8.534E-05 | global batch size: 256 | lm loss: 1.999215E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.012 | TFLOPs: 19.30 | 31: iteration 102800/ 173500 | consumed samples: 26316800 | consumed tokens: 53896806400 | elapsed time per iteration (s): 0.74 | learning rate: 8.533E-05 | global batch size: 256 | lm loss: 1.998114E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.270 | TFLOPs: 21.07 | 31: iteration 102810/ 173500 | consumed samples: 26319360 | consumed tokens: 53902049280 | elapsed time per iteration (s): 0.79 | learning rate: 8.531E-05 | global batch size: 256 | lm loss: 1.994959E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.391 | TFLOPs: 19.69 | 31: iteration 102820/ 173500 | consumed samples: 26321920 | consumed tokens: 53907292160 | elapsed time per iteration (s): 0.76 | learning rate: 8.530E-05 | global batch size: 256 | lm loss: 1.978775E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.126 | TFLOPs: 20.46 | 31: iteration 102830/ 173500 | consumed samples: 26324480 | consumed tokens: 53912535040 | elapsed time per iteration (s): 0.76 | learning rate: 8.528E-05 | global batch size: 256 | lm loss: 1.965055E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.459 | TFLOPs: 20.48 | 31: iteration 102840/ 173500 | consumed samples: 26327040 | consumed tokens: 53917777920 | elapsed time per iteration (s): 0.76 | learning rate: 8.527E-05 | global batch size: 256 | lm loss: 2.010017E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.670 | TFLOPs: 20.31 | 31: iteration 102850/ 173500 | consumed samples: 26329600 | consumed tokens: 53923020800 | elapsed time per iteration (s): 0.76 | learning rate: 8.525E-05 | global batch size: 256 | lm loss: 1.969405E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.760 | TFLOPs: 20.43 | 31: iteration 102860/ 173500 | consumed samples: 26332160 | consumed tokens: 53928263680 | elapsed time per iteration (s): 0.73 | learning rate: 8.523E-05 | global batch size: 256 | lm loss: 1.970058E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.399 | TFLOPs: 21.32 | 31: iteration 102870/ 173500 | consumed samples: 26334720 | consumed tokens: 53933506560 | elapsed time per iteration (s): 0.82 | learning rate: 8.522E-05 | global batch size: 256 | lm loss: 1.974176E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.953 | TFLOPs: 18.87 | 31: iteration 102880/ 173500 | consumed samples: 26337280 | consumed tokens: 53938749440 | elapsed time per iteration (s): 0.74 | learning rate: 8.520E-05 | global batch size: 256 | lm loss: 1.960264E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.240 | TFLOPs: 20.95 | 31: iteration 102890/ 173500 | consumed samples: 26339840 | consumed tokens: 53943992320 | elapsed time per iteration (s): 0.72 | learning rate: 8.519E-05 | global batch size: 256 | lm loss: 1.984621E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.119 | TFLOPs: 21.48 | 31: iteration 102900/ 173500 | consumed samples: 26342400 | consumed tokens: 53949235200 | elapsed time per iteration (s): 0.78 | learning rate: 8.517E-05 | global batch size: 256 | lm loss: 1.967666E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.657 | TFLOPs: 19.88 | 31: iteration 102910/ 173500 | consumed samples: 26344960 | consumed tokens: 53954478080 | elapsed time per iteration (s): 0.70 | learning rate: 8.515E-05 | global batch size: 256 | lm loss: 1.964194E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 363.540 | TFLOPs: 21.99 | 31: iteration 102920/ 173500 | consumed samples: 26347520 | consumed tokens: 53959720960 | elapsed time per iteration (s): 0.79 | learning rate: 8.514E-05 | global batch size: 256 | lm loss: 1.997819E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.165 | TFLOPs: 19.55 | 31: iteration 102930/ 173500 | consumed samples: 26350080 | consumed tokens: 53964963840 | elapsed time per iteration (s): 0.78 | learning rate: 8.512E-05 | global batch size: 256 | lm loss: 1.979864E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.128 | TFLOPs: 19.85 | 31: iteration 102940/ 173500 | consumed samples: 26352640 | consumed tokens: 53970206720 | elapsed time per iteration (s): 0.75 | learning rate: 8.511E-05 | global batch size: 256 | lm loss: 1.982671E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.648 | TFLOPs: 20.55 | 31: iteration 102950/ 173500 | consumed samples: 26355200 | consumed tokens: 53975449600 | elapsed time per iteration (s): 0.74 | learning rate: 8.509E-05 | global batch size: 256 | lm loss: 1.966883E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.818 | TFLOPs: 20.80 | 31: iteration 102960/ 173500 | consumed samples: 26357760 | consumed tokens: 53980692480 | elapsed time per iteration (s): 0.80 | learning rate: 8.508E-05 | global batch size: 256 | lm loss: 1.976818E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.119 | TFLOPs: 19.37 | 31: iteration 102970/ 173500 | consumed samples: 26360320 | consumed tokens: 53985935360 | elapsed time per iteration (s): 0.80 | learning rate: 8.506E-05 | global batch size: 256 | lm loss: 1.952494E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.087 | TFLOPs: 19.36 | 31: iteration 102980/ 173500 | consumed samples: 26362880 | consumed tokens: 53991178240 | elapsed time per iteration (s): 0.79 | learning rate: 8.504E-05 | global batch size: 256 | lm loss: 1.972886E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.634 | TFLOPs: 19.58 | 31: iteration 102990/ 173500 | consumed samples: 26365440 | consumed tokens: 53996421120 | elapsed time per iteration (s): 0.76 | learning rate: 8.503E-05 | global batch size: 256 | lm loss: 1.947254E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.007 | TFLOPs: 20.51 | 31: iteration 103000/ 173500 | consumed samples: 26368000 | consumed tokens: 54001664000 | elapsed time per iteration (s): 0.78 | learning rate: 8.501E-05 | global batch size: 256 | lm loss: 1.983343E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.149 | TFLOPs: 19.73 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 103000 | lm loss value: 1.965976E+00 | lm loss PPL: 7.141880E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 103000 to checkpoints_1b1long 0: [2022-11-26 17:22:21,568] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step103000 is begin to save! 0: [2022-11-26 17:22:21,583] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_01-model_00-model_states.pt... 0: [2022-11-26 17:22:21,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_01-model_00-model_states.pt. 0: [2022-11-26 17:22:21,810] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_03-model_00-model_states.pt... 0: [2022-11-26 17:22:21,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_03-model_00-model_states.pt. 0: [2022-11-26 17:22:21,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_04-model_00-model_states.pt... 0: [2022-11-26 17:22:21,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_04-model_00-model_states.pt. 0: [2022-11-26 17:22:21,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_05-model_00-model_states.pt... 0: [2022-11-26 17:22:22,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_05-model_00-model_states.pt. 0: [2022-11-26 17:22:22,039] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_06-model_00-model_states.pt... 0: [2022-11-26 17:22:22,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_06-model_00-model_states.pt. 0: [2022-11-26 17:22:22,115] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_07-model_00-model_states.pt... 0: [2022-11-26 17:22:22,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_07-model_00-model_states.pt. 0: [2022-11-26 17:22:22,191] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_08-model_00-model_states.pt... 0: [2022-11-26 17:22:22,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_08-model_00-model_states.pt. 0: [2022-11-26 17:22:22,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_09-model_00-model_states.pt... 0: [2022-11-26 17:22:22,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_09-model_00-model_states.pt. 0: [2022-11-26 17:22:22,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_10-model_00-model_states.pt... 0: [2022-11-26 17:22:22,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_10-model_00-model_states.pt. 0: [2022-11-26 17:22:22,427] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_11-model_00-model_states.pt... 0: [2022-11-26 17:22:22,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_11-model_00-model_states.pt. 0: [2022-11-26 17:22:22,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_12-model_00-model_states.pt... 0: [2022-11-26 17:22:22,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_12-model_00-model_states.pt. 0: [2022-11-26 17:22:22,577] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_13-model_00-model_states.pt... 0: [2022-11-26 17:22:22,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_13-model_00-model_states.pt. 0: [2022-11-26 17:22:22,652] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_14-model_00-model_states.pt... 0: [2022-11-26 17:22:22,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_14-model_00-model_states.pt. 0: [2022-11-26 17:22:22,727] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_15-model_00-model_states.pt... 0: [2022-11-26 17:22:22,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_15-model_00-model_states.pt. 0: [2022-11-26 17:22:22,804] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_16-model_00-model_states.pt... 0: [2022-11-26 17:22:22,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_16-model_00-model_states.pt. 0: [2022-11-26 17:22:22,878] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_17-model_00-model_states.pt... 0: [2022-11-26 17:22:22,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_17-model_00-model_states.pt. 0: [2022-11-26 17:22:22,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_18-model_00-model_states.pt... 0: [2022-11-26 17:22:23,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_18-model_00-model_states.pt. 0: [2022-11-26 17:22:23,029] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_19-model_00-model_states.pt... 0: [2022-11-26 17:22:23,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_19-model_00-model_states.pt. 0: [2022-11-26 17:22:23,105] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_20-model_00-model_states.pt... 0: [2022-11-26 17:22:23,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_20-model_00-model_states.pt. 0: [2022-11-26 17:22:23,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_21-model_00-model_states.pt... 0: [2022-11-26 17:22:23,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_21-model_00-model_states.pt. 0: [2022-11-26 17:22:23,258] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_22-model_00-model_states.pt... 0: [2022-11-26 17:22:23,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_22-model_00-model_states.pt. 0: [2022-11-26 17:22:23,330] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_23-model_00-model_states.pt... 0: [2022-11-26 17:22:23,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_23-model_00-model_states.pt. 0: [2022-11-26 17:22:23,407] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_24-model_00-model_states.pt... 0: [2022-11-26 17:22:23,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_24-model_00-model_states.pt. 0: [2022-11-26 17:22:23,483] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_25-model_00-model_states.pt... 0: [2022-11-26 17:22:23,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_25-model_00-model_states.pt. 0: [2022-11-26 17:22:23,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_26-model_00-model_states.pt... 0: [2022-11-26 17:22:23,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_26-model_00-model_states.pt. 0: [2022-11-26 17:22:23,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_27-model_00-model_states.pt... 0: [2022-11-26 17:22:23,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_27-model_00-model_states.pt. 0: [2022-11-26 17:22:23,709] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_28-model_00-model_states.pt... 0: [2022-11-26 17:22:23,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_28-model_00-model_states.pt. 0: [2022-11-26 17:22:23,787] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/layer_30-model_00-model_states.pt... 0: [2022-11-26 17:22:23,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/layer_30-model_00-model_states.pt. 0: [2022-11-26 17:22:23,789] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step103000/mp_rank_00_model_states.pt 0: [2022-11-26 17:22:23,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/mp_rank_00_model_states.pt... 0: [2022-11-26 17:22:23,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/mp_rank_00_model_states.pt. 0: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:22:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:22:23,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:22:23,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:22:23,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 17:22:23,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 5: [2022-11-26 17:22:23,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:22:23,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:22:23,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:22:23,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 12: [2022-11-26 17:22:23,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 5: [2022-11-26 17:22:23,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 1: [2022-11-26 17:22:23,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 12: [2022-11-26 17:22:23,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 1: [2022-11-26 17:22:23,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 19: [2022-11-26 17:22:23,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:22:23,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 17:22:23,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 24: [2022-11-26 17:22:23,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:22:23,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 17:22:23,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 28: [2022-11-26 17:22:23,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:22:23,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:22:23,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 17:22:23,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 28: [2022-11-26 17:22:23,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 17:22:23,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 21: [2022-11-26 17:22:23,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:22:23,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 8: [2022-11-26 17:22:23,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:22:23,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:22:23,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 19: [2022-11-26 17:22:23,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:22:23,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 19: [2022-11-26 17:22:23,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 8: [2022-11-26 17:22:23,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 14: [2022-11-26 17:22:23,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 19: [2022-11-26 17:22:23,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 14: [2022-11-26 17:22:23,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 2: [2022-11-26 17:22:23,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:22:23,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 17:22:23,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 14: [2022-11-26 17:22:23,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:22:23,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 17:22:23,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 4: [2022-11-26 17:22:23,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:22:23,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 17:22:23,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 23: [2022-11-26 17:22:23,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:22:23,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 17:22:23,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 5: [2022-11-26 17:22:23,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:22:23,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 17:22:23,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 30: [2022-11-26 17:22:23,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:22:23,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 17:22:23,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 27: [2022-11-26 17:22:23,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:22:23,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:22:23,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:22:23,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 17:22:23,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 17:22:23,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 8: [2022-11-26 17:22:23,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 27: [2022-11-26 17:22:23,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 8: [2022-11-26 17:22:23,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 1: [2022-11-26 17:22:23,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:22:23,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:22:23,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:22:23,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 26: [2022-11-26 17:22:23,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:22:23,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 12: [2022-11-26 17:22:23,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 21: [2022-11-26 17:22:23,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 1: [2022-11-26 17:22:23,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 26: [2022-11-26 17:22:23,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 31: [2022-11-26 17:22:23,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:22:23,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 26: [2022-11-26 17:22:23,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 31: [2022-11-26 17:22:23,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 21: [2022-11-26 17:22:23,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 30: [2022-11-26 17:22:23,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:22:23,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 17:22:23,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 24: [2022-11-26 17:22:23,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:22:23,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 17:22:23,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 10: [2022-11-26 17:22:23,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:22:23,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:22:23,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 17:22:23,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 17:22:23,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 10: [2022-11-26 17:22:23,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 31: [2022-11-26 17:22:23,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:22:23,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 17:22:23,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 0: [2022-11-26 17:22:23,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:22:23,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:22:23,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 17:22:23,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 4: [2022-11-26 17:22:23,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:22:23,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 17:22:23,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 20: [2022-11-26 17:22:23,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:22:23,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 17:22:23,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 2: [2022-11-26 17:22:23,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:22:23,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 17:22:23,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 18: [2022-11-26 17:22:23,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:22:23,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 17:22:23,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 23: [2022-11-26 17:22:23,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:22:23,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 17:22:23,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 13: [2022-11-26 17:22:23,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:22:23,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 17:22:23,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 9: [2022-11-26 17:22:23,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:22:23,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 17:22:23,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 13: [2022-11-26 17:22:23,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:22:23,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 17:22:23,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 24: [2022-11-26 17:22:23,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:22:23,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 13: [2022-11-26 17:22:23,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:22:23,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 17:22:23,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 24: [2022-11-26 17:22:23,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 15: [2022-11-26 17:22:23,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:22:23,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 17:22:23,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 30: [2022-11-26 17:22:23,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:22:23,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:22:23,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:22:23,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:22:23,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:22:23,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 0: [2022-11-26 17:22:23,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 30: [2022-11-26 17:22:23,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 0: [2022-11-26 17:22:23,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 25: [2022-11-26 17:22:23,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 20: [2022-11-26 17:22:23,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:22:23,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 17:22:23,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 17:22:23,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 25: [2022-11-26 17:22:23,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 20: [2022-11-26 17:22:23,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 25: [2022-11-26 17:22:23,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 20: [2022-11-26 17:22:23,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 8: [2022-11-26 17:22:23,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:22:23,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 17:22:23,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 5: [2022-11-26 17:22:23,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:22:23,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:22:23,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 2: [2022-11-26 17:22:23,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 5: [2022-11-26 17:22:23,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 2: [2022-11-26 17:22:23,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 27: [2022-11-26 17:22:23,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:22:23,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 1: [2022-11-26 17:22:23,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:22:23,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 1: [2022-11-26 17:22:23,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 17:22:23,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 15: [2022-11-26 17:22:23,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:22:23,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 17:22:23,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 6: [2022-11-26 17:22:23,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:22:23,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 17:22:23,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 28: [2022-11-26 17:22:23,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 12: [2022-11-26 17:22:23,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:22:23,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 12: [2022-11-26 17:22:23,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 17:22:23,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 9: [2022-11-26 17:22:23,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:22:23,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:22:23,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 17:22:23,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 17:22:23,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 9: [2022-11-26 17:22:23,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 19: [2022-11-26 17:22:23,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:22:23,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 17:22:23,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 10: [2022-11-26 17:22:23,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:22:23,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:22:23,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:22:23,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:22:23,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 25: [2022-11-26 17:22:23,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 18: [2022-11-26 17:22:23,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 10: [2022-11-26 17:22:23,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 25: [2022-11-26 17:22:23,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 18: [2022-11-26 17:22:23,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 30: [2022-11-26 17:22:23,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 17:22:23,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 7: [2022-11-26 17:22:23,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:22:23,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:22:23,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:22:23,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 17:22:23,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 17:22:23,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 17:22:23,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 7: [2022-11-26 17:22:23,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 7: [2022-11-26 17:22:23,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 24: [2022-11-26 17:22:23,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:22:23,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 17:22:23,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 23: [2022-11-26 17:22:23,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:22:23,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 17:22:23,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 2: [2022-11-26 17:22:23,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:22:23,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 17:22:23,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 26: [2022-11-26 17:22:23,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:22:23,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 17:22:23,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 1: [2022-11-26 17:22:23,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:22:23,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:22:23,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 1: [2022-11-26 17:22:23,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 31: [2022-11-26 17:22:23,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:22:23,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 1: [2022-11-26 17:22:23,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 21: [2022-11-26 17:22:23,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:22:23,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 17:22:23,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 21: [2022-11-26 17:22:23,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 17:22:23,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 26: [2022-11-26 17:22:23,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:22:23,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 17:22:23,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 26: [2022-11-26 17:22:23,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:22:23,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:22:23,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 21: [2022-11-26 17:22:23,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 26: [2022-11-26 17:22:23,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 21: [2022-11-26 17:22:23,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 4: [2022-11-26 17:22:23,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:22:23,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 17:22:23,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 19: [2022-11-26 17:22:23,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:22:23,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:22:23,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 17:22:23,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 3: [2022-11-26 17:22:23,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:22:23,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 17:22:23,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 3: [2022-11-26 17:22:23,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 17:22:23,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 15: [2022-11-26 17:22:23,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:22:23,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 17:22:23,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 6: [2022-11-26 17:22:23,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:22:23,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 17:22:23,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 0: [2022-11-26 17:22:23,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:22:23,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 17:22:23,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 11: [2022-11-26 17:22:23,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:22:23,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 17:22:23,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 11: [2022-11-26 17:22:23,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:22:23,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 17:22:23,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 11: [2022-11-26 17:22:23,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:22:23,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 17:22:23,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 11: [2022-11-26 17:22:23,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:22:23,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 17:22:23,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 13: [2022-11-26 17:22:23,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:22:23,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:22:23,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 28: [2022-11-26 17:22:23,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:22:23,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 27: [2022-11-26 17:22:23,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 28: [2022-11-26 17:22:23,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 27: [2022-11-26 17:22:23,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 3: [2022-11-26 17:22:23,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:22:23,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 15: [2022-11-26 17:22:23,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:22:23,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:22:23,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 15: [2022-11-26 17:22:23,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 14: [2022-11-26 17:22:23,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 20: [2022-11-26 17:22:23,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:22:23,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 15: [2022-11-26 17:22:23,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 20: [2022-11-26 17:22:23,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 17:22:23,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 12: [2022-11-26 17:22:23,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:22:23,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 17:22:23,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 9: [2022-11-26 17:22:23,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:22:23,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 3: [2022-11-26 17:22:23,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:22:23,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 3: [2022-11-26 17:22:23,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 17:22:23,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 31: [2022-11-26 17:22:23,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:22:23,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 22: [2022-11-26 17:22:23,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:22:23,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:22:23,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 22: [2022-11-26 17:22:23,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 17:22:23,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 17:22:23,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 22: [2022-11-26 17:22:23,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 28: [2022-11-26 17:22:23,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 28: [2022-11-26 17:22:23,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:22:23,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 17:22:23,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 14: [2022-11-26 17:22:23,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:22:23,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 17:22:23,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 6: [2022-11-26 17:22:23,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:22:23,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 17:22:23,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 22: [2022-11-26 17:22:23,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:22:23,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:22:23,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 22: [2022-11-26 17:22:23,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 17:22:23,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 18: [2022-11-26 17:22:23,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 10: [2022-11-26 17:22:23,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:22:23,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 17:22:23,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 17: [2022-11-26 17:22:23,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:22:23,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:22:23,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:22:23,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 17:22:23,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 17:22:23,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 17:22:23,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 17: [2022-11-26 17:22:23,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 17: [2022-11-26 17:22:23,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 17: [2022-11-26 17:22:23,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:22:23,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 17:22:23,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 0: [2022-11-26 17:22:23,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 17:22:23,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 19: [2022-11-26 17:22:23,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:22:23,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 17:22:23,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 23: [2022-11-26 17:22:23,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:22:23,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 17:22:23,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 1: [2022-11-26 17:22:23,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:22:23,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 17:22:23,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 26: [2022-11-26 17:22:23,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:22:23,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 17:22:23,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 4: [2022-11-26 17:22:23,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:22:23,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 17:22:23,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 20: [2022-11-26 17:22:23,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:22:23,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 17:22:23,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 2: [2022-11-26 17:22:23,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:22:23,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 17:22:23,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 17: [2022-11-26 17:22:23,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:22:23,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 17:22:23,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 16: [2022-11-26 17:22:23,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:22:23,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:22:23,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:22:23,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:22:23,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 17:22:23,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 17:22:23,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 17:22:23,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 17:22:23,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 16: [2022-11-26 17:22:23,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 16: [2022-11-26 17:22:23,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 16: [2022-11-26 17:22:23,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 11: [2022-11-26 17:22:23,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:22:23,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 17:22:23,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 29: [2022-11-26 17:22:23,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:22:23,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:22:23,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:22:23,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:22:23,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 17:22:23,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 17:22:23,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 17:22:23,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 17:22:23,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 29: [2022-11-26 17:22:23,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 29: [2022-11-26 17:22:23,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 29: [2022-11-26 17:22:23,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 16: [2022-11-26 17:22:23,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:22:23,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 17:22:23,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 29: [2022-11-26 17:22:23,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:22:23,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 17:22:23,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 24: [2022-11-26 17:22:23,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:22:23,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 17:22:23,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 8: [2022-11-26 17:22:23,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:22:23,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 17:22:23,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 13: [2022-11-26 17:22:23,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:22:23,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 17:22:23,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 27: [2022-11-26 17:22:23,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:22:23,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 17:22:23,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 10: [2022-11-26 17:22:23,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:22:23,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 17:22:23,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 5: [2022-11-26 17:22:24,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:22:24,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 17:22:24,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 18: [2022-11-26 17:22:24,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:22:24,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 17:22:24,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 25: [2022-11-26 17:22:24,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:22:24,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 17:22:24,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 7: [2022-11-26 17:22:24,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:22:24,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 17:22:24,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 21: [2022-11-26 17:22:24,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:22:24,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 17:22:24,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 0: [2022-11-26 17:22:24,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:22:24,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 17:22:24,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 3: [2022-11-26 17:22:24,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:22:24,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 17:22:24,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 12: [2022-11-26 17:22:24,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:22:24,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:22:24,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 30: [2022-11-26 17:22:24,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 17:22:24,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 12: [2022-11-26 17:22:24,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 28: [2022-11-26 17:22:24,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:22:24,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 17:22:24,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 31: [2022-11-26 17:22:24,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:22:24,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 17:22:24,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 15: [2022-11-26 17:22:24,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:22:24,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 17:22:24,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 23: [2022-11-26 17:22:24,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:22:24,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 17:22:24,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 22: [2022-11-26 17:22:24,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:22:24,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 17:22:24,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 6: [2022-11-26 17:22:24,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:22:24,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 17:22:24,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 14: [2022-11-26 17:22:24,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:22:24,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 17:22:24,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 1: [2022-11-26 17:22:24,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:22:24,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 17:22:24,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 4: [2022-11-26 17:22:24,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:22:24,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 17:22:24,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 20: [2022-11-26 17:22:24,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:22:24,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 17:22:24,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 17: [2022-11-26 17:22:24,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:22:24,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:22:24,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 17:22:24,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 11: [2022-11-26 17:22:24,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 17:22:24,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 26: [2022-11-26 17:22:24,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:22:24,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 17:22:24,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 2: [2022-11-26 17:22:24,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:22:24,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 17:22:24,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 16: [2022-11-26 17:22:24,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:22:24,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 17:22:24,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 27: [2022-11-26 17:22:24,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:22:24,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:22:24,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 19: [2022-11-26 17:22:24,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 27: [2022-11-26 17:22:24,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 19: [2022-11-26 17:22:24,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 5: [2022-11-26 17:22:24,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:22:24,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 17:22:24,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 10: [2022-11-26 17:22:24,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:22:24,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 17:22:24,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 8: [2022-11-26 17:22:24,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:22:24,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 17:22:24,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 24: [2022-11-26 17:22:24,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:22:24,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 17:22:24,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 29: [2022-11-26 17:22:24,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:22:24,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 17:22:24,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 9: [2022-11-26 17:22:24,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:22:24,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 17:22:24,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 18: [2022-11-26 17:22:24,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:22:24,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 17:22:24,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 13: [2022-11-26 17:22:24,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:22:24,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 17:22:24,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 7: [2022-11-26 17:22:24,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:22:24,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:22:24,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 7: [2022-11-26 17:22:24,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 17:22:24,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 3: [2022-11-26 17:22:24,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 12: [2022-11-26 17:22:24,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:22:24,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 17:22:24,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 25: [2022-11-26 17:22:24,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:22:24,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 6: [2022-11-26 17:22:24,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:22:24,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 6: [2022-11-26 17:22:24,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 17:22:24,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 28: [2022-11-26 17:22:24,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:22:24,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 17:22:24,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 30: [2022-11-26 17:22:24,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:22:24,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 17:22:24,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 21: [2022-11-26 17:22:24,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:22:24,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 17:22:24,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 0: [2022-11-26 17:22:24,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:22:24,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 17:22:24,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 31: [2022-11-26 17:22:24,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:22:24,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 17:22:24,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 9: [2022-11-26 17:22:24,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:22:24,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 17:22:24,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 14: [2022-11-26 17:22:24,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:22:24,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 17:22:24,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 23: [2022-11-26 17:22:24,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:22:24,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 15: [2022-11-26 17:22:24,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:22:24,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 20: [2022-11-26 17:22:24,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:22:24,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 17:22:24,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 20: [2022-11-26 17:22:24,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 17:22:24,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 22: [2022-11-26 17:22:24,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:22:24,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 17:22:24,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 26: [2022-11-26 17:22:24,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:22:24,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 17:22:24,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 4: [2022-11-26 17:22:24,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:22:24,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 1: [2022-11-26 17:22:24,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:22:24,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 11: [2022-11-26 17:22:24,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:22:24,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 1: [2022-11-26 17:22:24,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 11: [2022-11-26 17:22:24,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 1: [2022-11-26 17:22:24,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 16: [2022-11-26 17:22:24,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:22:24,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 17:22:24,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 19: [2022-11-26 17:22:24,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:22:24,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 17:22:24,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 27: [2022-11-26 17:22:24,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:22:24,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 17:22:24,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 17: [2022-11-26 17:22:24,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:22:24,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 17:22:24,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 2: [2022-11-26 17:22:24,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:22:24,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 17:22:24,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 8: [2022-11-26 17:22:24,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:22:24,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 17:22:24,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 10: [2022-11-26 17:22:24,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:22:24,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 17:22:24,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 28: [2022-11-26 17:22:24,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:22:24,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 17:22:24,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 29: [2022-11-26 17:22:24,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:22:24,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 17:22:24,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 24: [2022-11-26 17:22:24,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:22:24,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:22:24,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 24: [2022-11-26 17:22:24,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 13: [2022-11-26 17:22:24,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 24: [2022-11-26 17:22:24,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 5: [2022-11-26 17:22:24,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:22:24,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 17:22:24,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 30: [2022-11-26 17:22:24,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:22:24,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 17:22:24,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 21: [2022-11-26 17:22:24,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:22:24,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 17:22:24,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 7: [2022-11-26 17:22:24,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:22:24,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 18: [2022-11-26 17:22:24,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:22:24,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 18: [2022-11-26 17:22:24,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 17:22:24,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 5: [2022-11-26 17:22:24,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:22:24,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 17:22:24,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 13: [2022-11-26 17:22:24,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:22:24,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 17:22:24,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 9: [2022-11-26 17:22:24,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:22:24,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 17:22:24,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 25: [2022-11-26 17:22:24,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:22:24,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:22:24,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 19: [2022-11-26 17:22:24,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 17:22:24,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 25: [2022-11-26 17:22:24,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 11: [2022-11-26 17:22:24,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:22:24,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 17:22:24,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 20: [2022-11-26 17:22:24,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:22:24,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 1: [2022-11-26 17:22:24,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:22:24,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 23: [2022-11-26 17:22:24,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:22:24,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 1: [2022-11-26 17:22:24,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 23: [2022-11-26 17:22:24,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 1: [2022-11-26 17:22:24,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 28: [2022-11-26 17:22:24,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:22:24,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:22:24,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 3: [2022-11-26 17:22:24,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:22:24,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 3: [2022-11-26 17:22:24,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 17:22:24,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 29: [2022-11-26 17:22:24,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:22:24,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 17:22:24,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 27: [2022-11-26 17:22:24,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:22:24,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:22:24,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 17:22:24,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 27: [2022-11-26 17:22:24,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 17:22:24,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 24: [2022-11-26 17:22:24,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:22:24,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 17:22:24,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 31: [2022-11-26 17:22:24,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:22:24,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 17:22:24,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 6: [2022-11-26 17:22:24,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:22:24,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 17:22:24,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 28: [2022-11-26 17:22:24,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 17:22:24,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 4: [2022-11-26 17:22:24,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:22:24,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 17:22:24,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 0: [2022-11-26 17:22:24,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:22:24,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:22:24,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 17:22:24,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 9: [2022-11-26 17:22:24,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 17:22:24,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 8: [2022-11-26 17:22:24,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:22:24,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 17:22:24,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 2: [2022-11-26 17:22:24,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:22:24,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 17:22:24,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 15: [2022-11-26 17:22:24,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:22:24,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 17:22:24,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 6: [2022-11-26 17:22:24,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:22:24,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:22:24,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 30: [2022-11-26 17:22:24,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 6: [2022-11-26 17:22:24,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 30: [2022-11-26 17:22:24,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 10: [2022-11-26 17:22:24,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:22:24,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 17:22:24,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 16: [2022-11-26 17:22:24,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:22:24,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 17:22:24,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 21: [2022-11-26 17:22:24,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:22:24,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 17:22:24,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 26: [2022-11-26 17:22:24,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:22:24,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 17:22:24,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 3: [2022-11-26 17:22:24,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:22:24,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:22:24,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 17: [2022-11-26 17:22:24,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 3: [2022-11-26 17:22:24,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 17: [2022-11-26 17:22:24,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 14: [2022-11-26 17:22:24,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:22:24,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:22:24,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 12: [2022-11-26 17:22:24,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:22:24,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 17:22:24,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 12: [2022-11-26 17:22:24,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 14: [2022-11-26 17:22:24,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 12: [2022-11-26 17:22:24,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 22: [2022-11-26 17:22:24,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:22:24,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 17:22:24,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 0: [2022-11-26 17:22:24,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:22:24,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 17:22:24,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 7: [2022-11-26 17:22:24,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:22:24,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:22:24,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 17:22:24,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 17:22:24,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 7: [2022-11-26 17:22:24,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 23: [2022-11-26 17:22:24,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:22:24,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 17:22:24,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 25: [2022-11-26 17:22:24,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:22:24,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:22:24,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 6: [2022-11-26 17:22:24,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 17:22:24,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 25: [2022-11-26 17:22:24,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 4: [2022-11-26 17:22:24,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:22:24,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 17:22:24,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 15: [2022-11-26 17:22:24,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:22:24,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:22:24,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 17:22:24,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 31: [2022-11-26 17:22:24,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 17:22:24,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 3: [2022-11-26 17:22:24,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:22:24,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 17:22:24,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 22: [2022-11-26 17:22:24,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:22:24,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 17:22:24,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 22: [2022-11-26 17:22:24,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:22:24,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step103000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 17:22:24,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 0: successfully saved checkpoint at iteration 103000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2557.79 31: iteration 103010/ 173500 | consumed samples: 26370560 | consumed tokens: 54006906880 | elapsed time per iteration (s): 1.04 | learning rate: 8.500E-05 | global batch size: 256 | lm loss: 1.978015E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.153 | TFLOPs: 14.89 | 31: iteration 103020/ 173500 | consumed samples: 26373120 | consumed tokens: 54012149760 | elapsed time per iteration (s): 0.74 | learning rate: 8.498E-05 | global batch size: 256 | lm loss: 1.974658E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.046 | TFLOPs: 21.00 | 31: iteration 103030/ 173500 | consumed samples: 26375680 | consumed tokens: 54017392640 | elapsed time per iteration (s): 0.79 | learning rate: 8.496E-05 | global batch size: 256 | lm loss: 1.956619E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.433 | TFLOPs: 19.63 | 31: iteration 103040/ 173500 | consumed samples: 26378240 | consumed tokens: 54022635520 | elapsed time per iteration (s): 0.77 | learning rate: 8.495E-05 | global batch size: 256 | lm loss: 1.961651E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.702 | TFLOPs: 20.07 | 31: iteration 103050/ 173500 | consumed samples: 26380800 | consumed tokens: 54027878400 | elapsed time per iteration (s): 0.74 | learning rate: 8.493E-05 | global batch size: 256 | lm loss: 1.975091E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.213 | TFLOPs: 20.94 | 31: iteration 103060/ 173500 | consumed samples: 26383360 | consumed tokens: 54033121280 | elapsed time per iteration (s): 0.80 | learning rate: 8.492E-05 | global batch size: 256 | lm loss: 1.970477E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.871 | TFLOPs: 19.35 | 31: iteration 103070/ 173500 | consumed samples: 26385920 | consumed tokens: 54038364160 | elapsed time per iteration (s): 0.74 | learning rate: 8.490E-05 | global batch size: 256 | lm loss: 1.970689E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.612 | TFLOPs: 20.85 | 31: iteration 103080/ 173500 | consumed samples: 26388480 | consumed tokens: 54043607040 | elapsed time per iteration (s): 0.77 | learning rate: 8.489E-05 | global batch size: 256 | lm loss: 2.006860E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.488 | TFLOPs: 20.05 | 31: iteration 103090/ 173500 | consumed samples: 26391040 | consumed tokens: 54048849920 | elapsed time per iteration (s): 0.76 | learning rate: 8.487E-05 | global batch size: 256 | lm loss: 1.978852E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.276 | TFLOPs: 20.46 | 31: iteration 103100/ 173500 | consumed samples: 26393600 | consumed tokens: 54054092800 | elapsed time per iteration (s): 0.79 | learning rate: 8.485E-05 | global batch size: 256 | lm loss: 2.010775E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.323 | TFLOPs: 19.56 | 31: iteration 103110/ 173500 | consumed samples: 26396160 | consumed tokens: 54059335680 | elapsed time per iteration (s): 0.76 | learning rate: 8.484E-05 | global batch size: 256 | lm loss: 1.975299E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.123 | TFLOPs: 20.33 | 31: iteration 103120/ 173500 | consumed samples: 26398720 | consumed tokens: 54064578560 | elapsed time per iteration (s): 0.77 | learning rate: 8.482E-05 | global batch size: 256 | lm loss: 1.992756E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.409 | TFLOPs: 20.23 | 31: iteration 103130/ 173500 | consumed samples: 26401280 | consumed tokens: 54069821440 | elapsed time per iteration (s): 0.75 | learning rate: 8.481E-05 | global batch size: 256 | lm loss: 1.984027E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.117 | TFLOPs: 20.58 | 31: iteration 103140/ 173500 | consumed samples: 26403840 | consumed tokens: 54075064320 | elapsed time per iteration (s): 0.81 | learning rate: 8.479E-05 | global batch size: 256 | lm loss: 2.006558E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.734 | TFLOPs: 19.10 | 31: iteration 103150/ 173500 | consumed samples: 26406400 | consumed tokens: 54080307200 | elapsed time per iteration (s): 0.76 | learning rate: 8.477E-05 | global batch size: 256 | lm loss: 1.950813E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.354 | TFLOPs: 20.29 | 31: iteration 103160/ 173500 | consumed samples: 26408960 | consumed tokens: 54085550080 | elapsed time per iteration (s): 0.76 | learning rate: 8.476E-05 | global batch size: 256 | lm loss: 1.984109E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.692 | TFLOPs: 20.49 | 31: iteration 103170/ 173500 | consumed samples: 26411520 | consumed tokens: 54090792960 | elapsed time per iteration (s): 0.75 | learning rate: 8.474E-05 | global batch size: 256 | lm loss: 1.950316E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.686 | TFLOPs: 20.67 | 31: iteration 103180/ 173500 | consumed samples: 26414080 | consumed tokens: 54096035840 | elapsed time per iteration (s): 0.72 | learning rate: 8.473E-05 | global batch size: 256 | lm loss: 1.980137E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.465 | TFLOPs: 21.57 | 31: iteration 103190/ 173500 | consumed samples: 26416640 | consumed tokens: 54101278720 | elapsed time per iteration (s): 0.75 | learning rate: 8.471E-05 | global batch size: 256 | lm loss: 1.976436E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.983 | TFLOPs: 20.69 | 31: iteration 103200/ 173500 | consumed samples: 26419200 | consumed tokens: 54106521600 | elapsed time per iteration (s): 0.77 | learning rate: 8.470E-05 | global batch size: 256 | lm loss: 1.968938E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.379 | TFLOPs: 20.05 | 31: iteration 103210/ 173500 | consumed samples: 26421760 | consumed tokens: 54111764480 | elapsed time per iteration (s): 0.75 | learning rate: 8.468E-05 | global batch size: 256 | lm loss: 1.996203E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.226 | TFLOPs: 20.52 | 31: iteration 103220/ 173500 | consumed samples: 26424320 | consumed tokens: 54117007360 | elapsed time per iteration (s): 0.79 | learning rate: 8.466E-05 | global batch size: 256 | lm loss: 1.963517E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.832 | TFLOPs: 19.53 | 31: iteration 103230/ 173500 | consumed samples: 26426880 | consumed tokens: 54122250240 | elapsed time per iteration (s): 0.75 | learning rate: 8.465E-05 | global batch size: 256 | lm loss: 1.958387E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.191 | TFLOPs: 20.70 | 31: iteration 103240/ 173500 | consumed samples: 26429440 | consumed tokens: 54127493120 | elapsed time per iteration (s): 0.79 | learning rate: 8.463E-05 | global batch size: 256 | lm loss: 1.981655E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.833 | TFLOPs: 19.71 | 31: iteration 103250/ 173500 | consumed samples: 26432000 | consumed tokens: 54132736000 | elapsed time per iteration (s): 0.76 | learning rate: 8.462E-05 | global batch size: 256 | lm loss: 1.981697E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.933 | TFLOPs: 20.44 | 31: iteration 103260/ 173500 | consumed samples: 26434560 | consumed tokens: 54137978880 | elapsed time per iteration (s): 0.73 | learning rate: 8.460E-05 | global batch size: 256 | lm loss: 1.979395E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.956 | TFLOPs: 21.11 | 31: iteration 103270/ 173500 | consumed samples: 26437120 | consumed tokens: 54143221760 | elapsed time per iteration (s): 0.79 | learning rate: 8.459E-05 | global batch size: 256 | lm loss: 1.973685E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.883 | TFLOPs: 19.59 | 31: iteration 103280/ 173500 | consumed samples: 26439680 | consumed tokens: 54148464640 | elapsed time per iteration (s): 0.79 | learning rate: 8.457E-05 | global batch size: 256 | lm loss: 1.993117E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.118 | TFLOPs: 19.55 | 31: iteration 103290/ 173500 | consumed samples: 26442240 | consumed tokens: 54153707520 | elapsed time per iteration (s): 0.77 | learning rate: 8.455E-05 | global batch size: 256 | lm loss: 1.951348E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.530 | TFLOPs: 20.24 | 31: iteration 103300/ 173500 | consumed samples: 26444800 | consumed tokens: 54158950400 | elapsed time per iteration (s): 0.82 | learning rate: 8.454E-05 | global batch size: 256 | lm loss: 1.979152E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.364 | TFLOPs: 18.96 | 31: iteration 103310/ 173500 | consumed samples: 26447360 | consumed tokens: 54164193280 | elapsed time per iteration (s): 0.77 | learning rate: 8.452E-05 | global batch size: 256 | lm loss: 1.964179E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.319 | TFLOPs: 20.04 | 31: iteration 103320/ 173500 | consumed samples: 26449920 | consumed tokens: 54169436160 | elapsed time per iteration (s): 0.75 | learning rate: 8.451E-05 | global batch size: 256 | lm loss: 1.956739E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.870 | TFLOPs: 20.62 | 31: iteration 103330/ 173500 | consumed samples: 26452480 | consumed tokens: 54174679040 | elapsed time per iteration (s): 0.74 | learning rate: 8.449E-05 | global batch size: 256 | lm loss: 1.975477E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.009 | TFLOPs: 20.99 | 31: iteration 103340/ 173500 | consumed samples: 26455040 | consumed tokens: 54179921920 | elapsed time per iteration (s): 0.75 | learning rate: 8.447E-05 | global batch size: 256 | lm loss: 1.991647E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.702 | TFLOPs: 20.67 | 31: iteration 103350/ 173500 | consumed samples: 26457600 | consumed tokens: 54185164800 | elapsed time per iteration (s): 0.75 | learning rate: 8.446E-05 | global batch size: 256 | lm loss: 1.970206E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.136 | TFLOPs: 20.70 | 31: iteration 103360/ 173500 | consumed samples: 26460160 | consumed tokens: 54190407680 | elapsed time per iteration (s): 0.77 | learning rate: 8.444E-05 | global batch size: 256 | lm loss: 1.978878E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.743 | TFLOPs: 20.07 | 31: iteration 103370/ 173500 | consumed samples: 26462720 | consumed tokens: 54195650560 | elapsed time per iteration (s): 0.74 | learning rate: 8.443E-05 | global batch size: 256 | lm loss: 1.989631E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.429 | TFLOPs: 20.84 | 31: iteration 103380/ 173500 | consumed samples: 26465280 | consumed tokens: 54200893440 | elapsed time per iteration (s): 0.71 | learning rate: 8.441E-05 | global batch size: 256 | lm loss: 1.965248E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.270 | TFLOPs: 21.67 | 31: iteration 103390/ 173500 | consumed samples: 26467840 | consumed tokens: 54206136320 | elapsed time per iteration (s): 0.76 | learning rate: 8.440E-05 | global batch size: 256 | lm loss: 2.004454E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.450 | TFLOPs: 20.41 | 31: iteration 103400/ 173500 | consumed samples: 26470400 | consumed tokens: 54211379200 | elapsed time per iteration (s): 0.78 | learning rate: 8.438E-05 | global batch size: 256 | lm loss: 1.985220E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.632 | TFLOPs: 19.88 | 31: iteration 103410/ 173500 | consumed samples: 26472960 | consumed tokens: 54216622080 | elapsed time per iteration (s): 0.86 | learning rate: 8.436E-05 | global batch size: 256 | lm loss: 1.986699E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.516 | TFLOPs: 17.94 | 31: iteration 103420/ 173500 | consumed samples: 26475520 | consumed tokens: 54221864960 | elapsed time per iteration (s): 0.79 | learning rate: 8.435E-05 | global batch size: 256 | lm loss: 1.957325E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.473 | TFLOPs: 19.69 | 31: iteration 103430/ 173500 | consumed samples: 26478080 | consumed tokens: 54227107840 | elapsed time per iteration (s): 0.76 | learning rate: 8.433E-05 | global batch size: 256 | lm loss: 1.961222E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.460 | TFLOPs: 20.35 | 31: iteration 103440/ 173500 | consumed samples: 26480640 | consumed tokens: 54232350720 | elapsed time per iteration (s): 0.75 | learning rate: 8.432E-05 | global batch size: 256 | lm loss: 1.981216E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.082 | TFLOPs: 20.70 | 31: iteration 103450/ 173500 | consumed samples: 26483200 | consumed tokens: 54237593600 | elapsed time per iteration (s): 0.72 | learning rate: 8.430E-05 | global batch size: 256 | lm loss: 1.979478E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.831 | TFLOPs: 21.65 | 31: iteration 103460/ 173500 | consumed samples: 26485760 | consumed tokens: 54242836480 | elapsed time per iteration (s): 0.75 | learning rate: 8.429E-05 | global batch size: 256 | lm loss: 1.983958E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.183 | TFLOPs: 20.58 | 31: iteration 103470/ 173500 | consumed samples: 26488320 | consumed tokens: 54248079360 | elapsed time per iteration (s): 0.77 | learning rate: 8.427E-05 | global batch size: 256 | lm loss: 1.963416E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.375 | TFLOPs: 20.23 | 31: iteration 103480/ 173500 | consumed samples: 26490880 | consumed tokens: 54253322240 | elapsed time per iteration (s): 0.71 | learning rate: 8.425E-05 | global batch size: 256 | lm loss: 1.968753E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 362.499 | TFLOPs: 21.93 | 31: iteration 103490/ 173500 | consumed samples: 26493440 | consumed tokens: 54258565120 | elapsed time per iteration (s): 0.72 | learning rate: 8.424E-05 | global batch size: 256 | lm loss: 1.957437E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.662 | TFLOPs: 21.58 | 31: iteration 103500/ 173500 | consumed samples: 26496000 | consumed tokens: 54263808000 | elapsed time per iteration (s): 0.74 | learning rate: 8.422E-05 | global batch size: 256 | lm loss: 2.003379E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.166 | TFLOPs: 21.06 | 31: iteration 103510/ 173500 | consumed samples: 26498560 | consumed tokens: 54269050880 | elapsed time per iteration (s): 0.80 | learning rate: 8.421E-05 | global batch size: 256 | lm loss: 1.962993E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.343 | TFLOPs: 19.38 | 31: iteration 103520/ 173500 | consumed samples: 26501120 | consumed tokens: 54274293760 | elapsed time per iteration (s): 0.80 | learning rate: 8.419E-05 | global batch size: 256 | lm loss: 1.992022E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.333 | TFLOPs: 19.44 | 31: iteration 103530/ 173500 | consumed samples: 26503680 | consumed tokens: 54279536640 | elapsed time per iteration (s): 0.84 | learning rate: 8.418E-05 | global batch size: 256 | lm loss: 1.976874E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.933 | TFLOPs: 18.39 | 31: iteration 103540/ 173500 | consumed samples: 26506240 | consumed tokens: 54284779520 | elapsed time per iteration (s): 0.80 | learning rate: 8.416E-05 | global batch size: 256 | lm loss: 1.957824E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.947 | TFLOPs: 19.36 | 31: iteration 103550/ 173500 | consumed samples: 26508800 | consumed tokens: 54290022400 | elapsed time per iteration (s): 0.81 | learning rate: 8.414E-05 | global batch size: 256 | lm loss: 1.957003E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.818 | TFLOPs: 19.17 | 31: iteration 103560/ 173500 | consumed samples: 26511360 | consumed tokens: 54295265280 | elapsed time per iteration (s): 0.79 | learning rate: 8.413E-05 | global batch size: 256 | lm loss: 1.961403E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.171 | TFLOPs: 19.67 | 31: iteration 103570/ 173500 | consumed samples: 26513920 | consumed tokens: 54300508160 | elapsed time per iteration (s): 0.79 | learning rate: 8.411E-05 | global batch size: 256 | lm loss: 1.975782E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.116 | TFLOPs: 19.55 | 31: iteration 103580/ 173500 | consumed samples: 26516480 | consumed tokens: 54305751040 | elapsed time per iteration (s): 0.78 | learning rate: 8.410E-05 | global batch size: 256 | lm loss: 1.993970E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.616 | TFLOPs: 19.94 | 31: iteration 103590/ 173500 | consumed samples: 26519040 | consumed tokens: 54310993920 | elapsed time per iteration (s): 0.81 | learning rate: 8.408E-05 | global batch size: 256 | lm loss: 1.967730E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.814 | TFLOPs: 19.11 | 31: iteration 103600/ 173500 | consumed samples: 26521600 | consumed tokens: 54316236800 | elapsed time per iteration (s): 0.78 | learning rate: 8.406E-05 | global batch size: 256 | lm loss: 2.012014E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.546 | TFLOPs: 19.88 | 31: iteration 103610/ 173500 | consumed samples: 26524160 | consumed tokens: 54321479680 | elapsed time per iteration (s): 0.81 | learning rate: 8.405E-05 | global batch size: 256 | lm loss: 1.973361E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.055 | TFLOPs: 19.18 | 31: iteration 103620/ 173500 | consumed samples: 26526720 | consumed tokens: 54326722560 | elapsed time per iteration (s): 0.78 | learning rate: 8.403E-05 | global batch size: 256 | lm loss: 1.964530E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.083 | TFLOPs: 19.79 | 31: iteration 103630/ 173500 | consumed samples: 26529280 | consumed tokens: 54331965440 | elapsed time per iteration (s): 0.80 | learning rate: 8.402E-05 | global batch size: 256 | lm loss: 1.987213E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.532 | TFLOPs: 19.45 | 31: iteration 103640/ 173500 | consumed samples: 26531840 | consumed tokens: 54337208320 | elapsed time per iteration (s): 0.80 | learning rate: 8.400E-05 | global batch size: 256 | lm loss: 1.957197E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.437 | TFLOPs: 19.33 | 31: iteration 103650/ 173500 | consumed samples: 26534400 | consumed tokens: 54342451200 | elapsed time per iteration (s): 0.79 | learning rate: 8.399E-05 | global batch size: 256 | lm loss: 1.957299E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.457 | TFLOPs: 19.57 | 31: iteration 103660/ 173500 | consumed samples: 26536960 | consumed tokens: 54347694080 | elapsed time per iteration (s): 0.71 | learning rate: 8.397E-05 | global batch size: 256 | lm loss: 1.964711E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 359.876 | TFLOPs: 21.77 | 31: iteration 103670/ 173500 | consumed samples: 26539520 | consumed tokens: 54352936960 | elapsed time per iteration (s): 0.76 | learning rate: 8.395E-05 | global batch size: 256 | lm loss: 1.974528E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.338 | TFLOPs: 20.29 | 31: iteration 103680/ 173500 | consumed samples: 26542080 | consumed tokens: 54358179840 | elapsed time per iteration (s): 0.75 | learning rate: 8.394E-05 | global batch size: 256 | lm loss: 1.986742E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.689 | TFLOPs: 20.67 | 31: iteration 103690/ 173500 | consumed samples: 26544640 | consumed tokens: 54363422720 | elapsed time per iteration (s): 0.79 | learning rate: 8.392E-05 | global batch size: 256 | lm loss: 1.978246E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.903 | TFLOPs: 19.66 | 31: iteration 103700/ 173500 | consumed samples: 26547200 | consumed tokens: 54368665600 | elapsed time per iteration (s): 0.76 | learning rate: 8.391E-05 | global batch size: 256 | lm loss: 1.946973E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.430 | TFLOPs: 20.29 | 31: iteration 103710/ 173500 | consumed samples: 26549760 | consumed tokens: 54373908480 | elapsed time per iteration (s): 0.79 | learning rate: 8.389E-05 | global batch size: 256 | lm loss: 1.960964E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.820 | TFLOPs: 19.71 | 31: iteration 103720/ 173500 | consumed samples: 26552320 | consumed tokens: 54379151360 | elapsed time per iteration (s): 0.73 | learning rate: 8.388E-05 | global batch size: 256 | lm loss: 1.937682E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.326 | TFLOPs: 21.13 | 31: iteration 103730/ 173500 | consumed samples: 26554880 | consumed tokens: 54384394240 | elapsed time per iteration (s): 0.79 | learning rate: 8.386E-05 | global batch size: 256 | lm loss: 1.961981E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.428 | TFLOPs: 19.69 | 31: iteration 103740/ 173500 | consumed samples: 26557440 | consumed tokens: 54389637120 | elapsed time per iteration (s): 0.77 | learning rate: 8.384E-05 | global batch size: 256 | lm loss: 1.951697E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.527 | TFLOPs: 20.00 | 31: iteration 103750/ 173500 | consumed samples: 26560000 | consumed tokens: 54394880000 | elapsed time per iteration (s): 0.82 | learning rate: 8.383E-05 | global batch size: 256 | lm loss: 1.998130E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.487 | TFLOPs: 18.97 | 31: iteration 103760/ 173500 | consumed samples: 26562560 | consumed tokens: 54400122880 | elapsed time per iteration (s): 0.72 | learning rate: 8.381E-05 | global batch size: 256 | lm loss: 1.955562E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.752 | TFLOPs: 21.46 | 31: iteration 103770/ 173500 | consumed samples: 26565120 | consumed tokens: 54405365760 | elapsed time per iteration (s): 0.77 | learning rate: 8.380E-05 | global batch size: 256 | lm loss: 1.944764E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.650 | TFLOPs: 20.00 | 31: iteration 103780/ 173500 | consumed samples: 26567680 | consumed tokens: 54410608640 | elapsed time per iteration (s): 0.76 | learning rate: 8.378E-05 | global batch size: 256 | lm loss: 1.997833E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.790 | TFLOPs: 20.25 | 31: iteration 103790/ 173500 | consumed samples: 26570240 | consumed tokens: 54415851520 | elapsed time per iteration (s): 0.72 | learning rate: 8.377E-05 | global batch size: 256 | lm loss: 1.945114E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.749 | TFLOPs: 21.64 | 31: iteration 103800/ 173500 | consumed samples: 26572800 | consumed tokens: 54421094400 | elapsed time per iteration (s): 0.78 | learning rate: 8.375E-05 | global batch size: 256 | lm loss: 1.989137E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.620 | TFLOPs: 19.82 | 31: iteration 103810/ 173500 | consumed samples: 26575360 | consumed tokens: 54426337280 | elapsed time per iteration (s): 0.79 | learning rate: 8.373E-05 | global batch size: 256 | lm loss: 1.967469E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.033 | TFLOPs: 19.48 | 31: iteration 103820/ 173500 | consumed samples: 26577920 | consumed tokens: 54431580160 | elapsed time per iteration (s): 0.76 | learning rate: 8.372E-05 | global batch size: 256 | lm loss: 1.976968E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.804 | TFLOPs: 20.38 | 31: iteration 103830/ 173500 | consumed samples: 26580480 | consumed tokens: 54436823040 | elapsed time per iteration (s): 0.77 | learning rate: 8.370E-05 | global batch size: 256 | lm loss: 1.989452E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.458 | TFLOPs: 20.17 | 31: iteration 103840/ 173500 | consumed samples: 26583040 | consumed tokens: 54442065920 | elapsed time per iteration (s): 0.78 | learning rate: 8.369E-05 | global batch size: 256 | lm loss: 1.959351E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.511 | TFLOPs: 19.81 | 31: iteration 103850/ 173500 | consumed samples: 26585600 | consumed tokens: 54447308800 | elapsed time per iteration (s): 0.77 | learning rate: 8.367E-05 | global batch size: 256 | lm loss: 1.977909E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.908 | TFLOPs: 20.08 | 31: iteration 103860/ 173500 | consumed samples: 26588160 | consumed tokens: 54452551680 | elapsed time per iteration (s): 0.74 | learning rate: 8.366E-05 | global batch size: 256 | lm loss: 1.950732E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.533 | TFLOPs: 20.96 | 31: iteration 103870/ 173500 | consumed samples: 26590720 | consumed tokens: 54457794560 | elapsed time per iteration (s): 0.78 | learning rate: 8.364E-05 | global batch size: 256 | lm loss: 1.963759E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.140 | TFLOPs: 19.79 | 31: iteration 103880/ 173500 | consumed samples: 26593280 | consumed tokens: 54463037440 | elapsed time per iteration (s): 0.72 | learning rate: 8.362E-05 | global batch size: 256 | lm loss: 1.961686E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.008 | TFLOPs: 21.42 | 31: iteration 103890/ 173500 | consumed samples: 26595840 | consumed tokens: 54468280320 | elapsed time per iteration (s): 0.79 | learning rate: 8.361E-05 | global batch size: 256 | lm loss: 2.005750E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.379 | TFLOPs: 19.62 | 31: iteration 103900/ 173500 | consumed samples: 26598400 | consumed tokens: 54473523200 | elapsed time per iteration (s): 0.73 | learning rate: 8.359E-05 | global batch size: 256 | lm loss: 1.975691E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.212 | TFLOPs: 21.25 | 31: iteration 103910/ 173500 | consumed samples: 26600960 | consumed tokens: 54478766080 | elapsed time per iteration (s): 0.77 | learning rate: 8.358E-05 | global batch size: 256 | lm loss: 1.958573E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.633 | TFLOPs: 20.18 | 31: iteration 103920/ 173500 | consumed samples: 26603520 | consumed tokens: 54484008960 | elapsed time per iteration (s): 0.73 | learning rate: 8.356E-05 | global batch size: 256 | lm loss: 1.946313E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.170 | TFLOPs: 21.12 | 31: iteration 103930/ 173500 | consumed samples: 26606080 | consumed tokens: 54489251840 | elapsed time per iteration (s): 0.75 | learning rate: 8.355E-05 | global batch size: 256 | lm loss: 1.972517E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.034 | TFLOPs: 20.63 | 31: iteration 103940/ 173500 | consumed samples: 26608640 | consumed tokens: 54494494720 | elapsed time per iteration (s): 0.72 | learning rate: 8.353E-05 | global batch size: 256 | lm loss: 1.974326E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.082 | TFLOPs: 21.48 | 31: iteration 103950/ 173500 | consumed samples: 26611200 | consumed tokens: 54499737600 | elapsed time per iteration (s): 0.76 | learning rate: 8.351E-05 | global batch size: 256 | lm loss: 2.000306E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.183 | TFLOPs: 20.46 | 31: iteration 103960/ 173500 | consumed samples: 26613760 | consumed tokens: 54504980480 | elapsed time per iteration (s): 0.79 | learning rate: 8.350E-05 | global batch size: 256 | lm loss: 1.960357E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.994 | TFLOPs: 19.60 | 31: iteration 103970/ 173500 | consumed samples: 26616320 | consumed tokens: 54510223360 | elapsed time per iteration (s): 0.75 | learning rate: 8.348E-05 | global batch size: 256 | lm loss: 1.969007E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.568 | TFLOPs: 20.72 | 31: iteration 103980/ 173500 | consumed samples: 26618880 | consumed tokens: 54515466240 | elapsed time per iteration (s): 0.76 | learning rate: 8.347E-05 | global batch size: 256 | lm loss: 1.942582E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.025 | TFLOPs: 20.51 | 31: iteration 103990/ 173500 | consumed samples: 26621440 | consumed tokens: 54520709120 | elapsed time per iteration (s): 0.79 | learning rate: 8.345E-05 | global batch size: 256 | lm loss: 1.959589E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.721 | TFLOPs: 19.64 | 0: [2022-11-26 17:35:11,330] [INFO] [logging.py:68:log_dist] [Rank 0] step=104000, skipped=0, lr=[8.343492337309329e-05, 8.343492337309329e-05, 8.343492337309329e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 104000/ 173500 | consumed samples: 26624000 | consumed tokens: 54525952000 | elapsed time per iteration (s): 0.80 | learning rate: 8.343E-05 | global batch size: 256 | lm loss: 1.967511E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.507 | TFLOPs: 19.39 | 0: steps: 104000 loss: 1.9716 iter time (s): 0.778 samples/sec: 328.846 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 104000 | lm loss value: 1.901816E+00 | lm loss PPL: 6.698048E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 104000 to checkpoints_1b1long 0: [2022-11-26 17:35:11,611] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step104000 is begin to save! 0: [2022-11-26 17:35:11,624] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_01-model_00-model_states.pt... 0: [2022-11-26 17:35:11,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_01-model_00-model_states.pt. 0: [2022-11-26 17:35:11,876] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_03-model_00-model_states.pt... 0: [2022-11-26 17:35:11,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_03-model_00-model_states.pt. 0: [2022-11-26 17:35:11,957] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_04-model_00-model_states.pt... 0: [2022-11-26 17:35:12,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_04-model_00-model_states.pt. 0: [2022-11-26 17:35:12,034] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_05-model_00-model_states.pt... 0: [2022-11-26 17:35:12,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_05-model_00-model_states.pt. 0: [2022-11-26 17:35:12,117] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_06-model_00-model_states.pt... 0: [2022-11-26 17:35:12,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_06-model_00-model_states.pt. 0: [2022-11-26 17:35:12,197] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_07-model_00-model_states.pt... 0: [2022-11-26 17:35:12,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_07-model_00-model_states.pt. 0: [2022-11-26 17:35:12,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_08-model_00-model_states.pt... 0: [2022-11-26 17:35:12,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_08-model_00-model_states.pt. 0: [2022-11-26 17:35:12,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_09-model_00-model_states.pt... 0: [2022-11-26 17:35:12,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_09-model_00-model_states.pt. 0: [2022-11-26 17:35:12,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_10-model_00-model_states.pt... 0: [2022-11-26 17:35:12,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_10-model_00-model_states.pt. 0: [2022-11-26 17:35:12,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_11-model_00-model_states.pt... 0: [2022-11-26 17:35:12,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_11-model_00-model_states.pt. 0: [2022-11-26 17:35:12,575] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_12-model_00-model_states.pt... 0: [2022-11-26 17:35:12,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_12-model_00-model_states.pt. 0: [2022-11-26 17:35:12,651] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_13-model_00-model_states.pt... 0: [2022-11-26 17:35:12,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_13-model_00-model_states.pt. 0: [2022-11-26 17:35:12,726] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_14-model_00-model_states.pt... 0: [2022-11-26 17:35:12,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_14-model_00-model_states.pt. 0: [2022-11-26 17:35:12,803] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_15-model_00-model_states.pt... 0: [2022-11-26 17:35:12,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_15-model_00-model_states.pt. 0: [2022-11-26 17:35:12,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_16-model_00-model_states.pt... 0: [2022-11-26 17:35:12,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_16-model_00-model_states.pt. 0: [2022-11-26 17:35:12,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_17-model_00-model_states.pt... 0: [2022-11-26 17:35:13,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_17-model_00-model_states.pt. 0: [2022-11-26 17:35:13,224] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_18-model_00-model_states.pt... 0: [2022-11-26 17:35:13,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_18-model_00-model_states.pt. 0: [2022-11-26 17:35:13,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_19-model_00-model_states.pt... 0: [2022-11-26 17:35:13,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_19-model_00-model_states.pt. 0: [2022-11-26 17:35:13,374] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_20-model_00-model_states.pt... 0: [2022-11-26 17:35:13,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_20-model_00-model_states.pt. 0: [2022-11-26 17:35:13,449] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_21-model_00-model_states.pt... 0: [2022-11-26 17:35:13,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_21-model_00-model_states.pt. 0: [2022-11-26 17:35:13,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_22-model_00-model_states.pt... 0: [2022-11-26 17:35:13,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_22-model_00-model_states.pt. 0: [2022-11-26 17:35:13,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_23-model_00-model_states.pt... 0: [2022-11-26 17:35:13,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_23-model_00-model_states.pt. 0: [2022-11-26 17:35:13,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_24-model_00-model_states.pt... 0: [2022-11-26 17:35:13,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_24-model_00-model_states.pt. 0: [2022-11-26 17:35:13,744] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_25-model_00-model_states.pt... 0: [2022-11-26 17:35:13,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_25-model_00-model_states.pt. 0: [2022-11-26 17:35:13,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_26-model_00-model_states.pt... 0: [2022-11-26 17:35:13,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_26-model_00-model_states.pt. 0: [2022-11-26 17:35:13,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_27-model_00-model_states.pt... 0: [2022-11-26 17:35:13,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_27-model_00-model_states.pt. 0: [2022-11-26 17:35:13,982] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_28-model_00-model_states.pt... 0: [2022-11-26 17:35:14,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_28-model_00-model_states.pt. 0: [2022-11-26 17:35:14,059] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/layer_30-model_00-model_states.pt... 0: [2022-11-26 17:35:14,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/layer_30-model_00-model_states.pt. 0: [2022-11-26 17:35:14,061] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step104000/mp_rank_00_model_states.pt 0: [2022-11-26 17:35:14,061] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/mp_rank_00_model_states.pt... 0: [2022-11-26 17:35:14,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/mp_rank_00_model_states.pt. 0: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:35:14,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:35:14,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:35:14,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 17:35:14,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 10: [2022-11-26 17:35:14,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:35:14,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 17:35:14,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 4: [2022-11-26 17:35:14,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:35:14,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 17:35:14,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 21: [2022-11-26 17:35:14,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:35:14,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 17:35:14,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 28: [2022-11-26 17:35:14,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:35:14,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:35:14,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:35:14,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 25: [2022-11-26 17:35:14,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 19: [2022-11-26 17:35:14,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 25: [2022-11-26 17:35:14,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 10: [2022-11-26 17:35:14,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:35:14,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 17:35:14,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 29: [2022-11-26 17:35:14,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:35:14,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:35:14,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 8: [2022-11-26 17:35:14,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 29: [2022-11-26 17:35:14,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 8: [2022-11-26 17:35:14,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 30: [2022-11-26 17:35:14,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:35:14,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 17:35:14,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 4: [2022-11-26 17:35:14,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:35:14,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 17:35:14,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 29: [2022-11-26 17:35:14,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:35:14,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:35:14,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 17:35:14,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 22: [2022-11-26 17:35:14,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:35:14,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 17:35:14,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 1: [2022-11-26 17:35:14,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:35:14,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 17:35:14,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 1: [2022-11-26 17:35:14,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 17:35:14,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 7: [2022-11-26 17:35:14,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:35:14,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 17:35:14,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 16: [2022-11-26 17:35:14,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:35:14,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:35:14,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 17:35:14,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 25: [2022-11-26 17:35:14,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 9: [2022-11-26 17:35:14,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:35:14,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:35:14,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 25: [2022-11-26 17:35:14,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 21: [2022-11-26 17:35:14,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 9: [2022-11-26 17:35:14,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 21: [2022-11-26 17:35:14,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 26: [2022-11-26 17:35:14,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:35:14,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 17:35:14,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 26: [2022-11-26 17:35:14,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:35:14,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 17:35:14,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 8: [2022-11-26 17:35:14,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:35:14,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 17:35:14,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 16: [2022-11-26 17:35:14,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:35:14,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 17:35:14,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 9: [2022-11-26 17:35:14,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:35:14,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 24: [2022-11-26 17:35:14,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:35:14,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 24: [2022-11-26 17:35:14,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 17:35:14,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 6: [2022-11-26 17:35:14,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:35:14,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 17:35:14,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 6: [2022-11-26 17:35:14,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:35:14,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:35:14,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 17:35:14,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 30: [2022-11-26 17:35:14,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:35:14,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 30: [2022-11-26 17:35:14,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 17:35:14,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 2: [2022-11-26 17:35:14,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:35:14,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 2: [2022-11-26 17:35:14,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 17:35:14,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 11: [2022-11-26 17:35:14,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:35:14,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 17:35:14,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 27: [2022-11-26 17:35:14,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:35:14,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 17:35:14,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 29: [2022-11-26 17:35:14,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:35:14,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 17:35:14,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 21: [2022-11-26 17:35:14,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:35:14,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 17:35:14,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 16: [2022-11-26 17:35:14,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:35:14,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:35:14,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 17:35:14,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 24: [2022-11-26 17:35:14,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 1: [2022-11-26 17:35:14,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:35:14,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 19: [2022-11-26 17:35:14,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:35:14,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 19: [2022-11-26 17:35:14,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 1: [2022-11-26 17:35:14,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 19: [2022-11-26 17:35:14,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 10: [2022-11-26 17:35:14,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:35:14,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:35:14,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 10: [2022-11-26 17:35:14,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 13: [2022-11-26 17:35:14,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 10: [2022-11-26 17:35:14,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 22: [2022-11-26 17:35:14,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:35:14,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:35:14,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 17:35:14,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 25: [2022-11-26 17:35:14,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 17:35:14,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 8: [2022-11-26 17:35:14,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:35:14,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 17:35:14,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 18: [2022-11-26 17:35:14,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:35:14,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 17:35:14,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 2: [2022-11-26 17:35:14,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:35:14,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 17:35:14,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 2: [2022-11-26 17:35:14,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:35:14,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 17:35:14,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 18: [2022-11-26 17:35:14,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:35:14,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 7: [2022-11-26 17:35:14,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:35:14,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:35:14,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 17:35:14,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 18: [2022-11-26 17:35:14,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 7: [2022-11-26 17:35:14,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 7: [2022-11-26 17:35:14,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 13: [2022-11-26 17:35:14,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:35:14,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 17:35:14,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 19: [2022-11-26 17:35:14,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:35:14,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 17:35:14,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 24: [2022-11-26 17:35:14,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:35:14,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 17:35:14,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 30: [2022-11-26 17:35:14,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:35:14,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 17:35:14,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 27: [2022-11-26 17:35:14,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:35:14,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 17:35:14,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 26: [2022-11-26 17:35:14,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:35:14,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 17:35:14,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 22: [2022-11-26 17:35:14,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:35:14,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 17:35:14,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 0: [2022-11-26 17:35:14,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:35:14,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 25: [2022-11-26 17:35:14,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:35:14,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 0: [2022-11-26 17:35:14,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 25: [2022-11-26 17:35:14,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 21: [2022-11-26 17:35:14,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:35:14,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 17:35:14,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 28: [2022-11-26 17:35:14,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 17:35:14,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 28: [2022-11-26 17:35:14,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:35:14,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:35:14,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 17:35:14,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 4: [2022-11-26 17:35:14,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 17:35:14,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 4: [2022-11-26 17:35:14,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:35:14,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 17:35:14,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 28: [2022-11-26 17:35:14,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:35:14,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 16: [2022-11-26 17:35:14,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:35:14,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 16: [2022-11-26 17:35:14,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 17:35:14,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 8: [2022-11-26 17:35:14,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:35:14,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 17:35:14,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 29: [2022-11-26 17:35:14,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:35:14,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 17:35:14,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 13: [2022-11-26 17:35:14,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:35:14,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 17:35:14,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 6: [2022-11-26 17:35:14,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:35:14,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:35:14,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 17:35:14,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 17:35:14,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 6: [2022-11-26 17:35:14,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 3: [2022-11-26 17:35:14,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:35:14,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:35:14,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:35:14,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:35:14,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 17:35:14,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 17:35:14,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 17:35:14,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 17:35:14,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 3: [2022-11-26 17:35:14,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 3: [2022-11-26 17:35:14,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 3: [2022-11-26 17:35:14,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 15: [2022-11-26 17:35:14,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:35:14,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 17:35:14,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 19: [2022-11-26 17:35:14,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:35:14,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 17:35:14,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 15: [2022-11-26 17:35:14,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:35:14,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:35:14,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 17:35:14,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 24: [2022-11-26 17:35:14,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 17:35:14,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 15: [2022-11-26 17:35:14,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:35:14,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 17:35:14,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 22: [2022-11-26 17:35:14,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:35:14,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 17:35:14,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 27: [2022-11-26 17:35:14,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:35:14,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 17:35:14,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 7: [2022-11-26 17:35:14,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:35:14,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 17:35:14,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 1: [2022-11-26 17:35:14,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:35:14,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:35:14,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 17:35:14,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 9: [2022-11-26 17:35:14,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:35:14,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 9: [2022-11-26 17:35:14,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 1: [2022-11-26 17:35:14,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 9: [2022-11-26 17:35:14,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 10: [2022-11-26 17:35:14,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:35:14,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 17:35:14,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 26: [2022-11-26 17:35:14,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:35:14,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 17:35:14,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 2: [2022-11-26 17:35:14,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:35:14,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 17:35:14,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 30: [2022-11-26 17:35:14,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:35:14,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 17:35:14,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 14: [2022-11-26 17:35:14,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:35:14,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:35:14,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:35:14,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 17:35:14,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:35:14,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 17:35:14,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 20: [2022-11-26 17:35:14,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:35:14,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:35:14,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 20: [2022-11-26 17:35:14,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 14: [2022-11-26 17:35:14,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 20: [2022-11-26 17:35:14,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 14: [2022-11-26 17:35:14,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 20: [2022-11-26 17:35:14,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 20: [2022-11-26 17:35:14,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 14: [2022-11-26 17:35:14,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 20: [2022-11-26 17:35:14,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:35:14,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 20: [2022-11-26 17:35:14,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 17:35:14,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 28: [2022-11-26 17:35:14,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:35:14,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 17:35:14,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 12: [2022-11-26 17:35:14,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:35:14,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:35:14,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:35:14,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:35:14,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 17:35:14,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 17:35:14,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 17:35:14,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 17:35:14,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 12: [2022-11-26 17:35:14,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 12: [2022-11-26 17:35:14,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 12: [2022-11-26 17:35:14,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 1: [2022-11-26 17:35:14,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:35:14,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 17:35:14,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 23: [2022-11-26 17:35:14,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:35:14,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:35:14,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:35:14,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 17:35:14,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 17:35:14,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 17:35:14,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 23: [2022-11-26 17:35:14,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 23: [2022-11-26 17:35:14,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 5: [2022-11-26 17:35:14,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:35:14,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:35:14,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:35:14,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:35:14,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 17:35:14,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 17:35:14,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 17:35:14,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 5: [2022-11-26 17:35:14,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 17:35:14,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 5: [2022-11-26 17:35:14,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 5: [2022-11-26 17:35:14,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 4: [2022-11-26 17:35:14,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:35:14,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 17:35:14,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 6: [2022-11-26 17:35:14,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:35:14,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 17:35:14,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 0: [2022-11-26 17:35:14,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:35:14,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 17:35:14,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 9: [2022-11-26 17:35:14,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:35:14,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 17:35:14,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 11: [2022-11-26 17:35:14,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:35:14,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 17:35:14,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 11: [2022-11-26 17:35:14,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:35:14,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 17:35:14,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 11: [2022-11-26 17:35:14,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:35:14,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 17:35:14,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 11: [2022-11-26 17:35:14,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:35:14,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 17:35:14,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 5: [2022-11-26 17:35:14,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:35:14,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 17:35:14,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 0: [2022-11-26 17:35:14,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:35:14,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 17:35:14,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 0: [2022-11-26 17:35:14,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:35:14,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 17:35:14,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 10: [2022-11-26 17:35:14,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:35:14,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 17:35:14,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 9: [2022-11-26 17:35:14,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:35:14,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 17:35:14,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 3: [2022-11-26 17:35:14,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:35:14,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 17:35:14,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 16: [2022-11-26 17:35:14,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:35:14,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:35:14,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:35:14,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:35:14,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:35:14,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:35:14,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 31: [2022-11-26 17:35:14,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 17:35:14,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 17:35:14,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 17:35:14,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 16: [2022-11-26 17:35:14,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 31: [2022-11-26 17:35:14,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 17:35:14,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 31: [2022-11-26 17:35:14,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 31: [2022-11-26 17:35:14,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 31: [2022-11-26 17:35:14,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 31: [2022-11-26 17:35:14,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 30: [2022-11-26 17:35:14,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:35:14,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 17:35:14,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 26: [2022-11-26 17:35:14,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:35:14,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 17:35:14,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 29: [2022-11-26 17:35:14,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:35:14,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 17:35:14,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 25: [2022-11-26 17:35:14,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:35:14,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 17:35:14,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 17: [2022-11-26 17:35:14,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:35:14,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:35:14,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:35:14,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:35:14,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 17:35:14,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 17:35:14,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 17:35:14,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 17:35:14,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 17: [2022-11-26 17:35:14,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 17: [2022-11-26 17:35:14,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 17: [2022-11-26 17:35:14,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 17: [2022-11-26 17:35:14,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:35:14,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 17:35:14,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 7: [2022-11-26 17:35:14,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:35:14,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 17:35:14,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 19: [2022-11-26 17:35:14,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:35:14,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 17:35:14,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 12: [2022-11-26 17:35:14,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:35:14,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 17:35:14,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 21: [2022-11-26 17:35:14,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:35:14,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 17:35:14,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 18: [2022-11-26 17:35:14,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:35:14,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 17:35:14,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 2: [2022-11-26 17:35:14,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:35:14,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 17:35:14,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 15: [2022-11-26 17:35:14,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:35:14,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 17:35:14,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 23: [2022-11-26 17:35:14,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:35:14,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 17:35:14,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 27: [2022-11-26 17:35:14,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:35:14,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 17:35:14,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 8: [2022-11-26 17:35:14,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:35:14,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 17:35:14,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 28: [2022-11-26 17:35:14,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:35:14,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 17:35:14,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 13: [2022-11-26 17:35:14,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:35:14,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 17:35:14,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 20: [2022-11-26 17:35:14,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:35:14,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:35:14,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 20: [2022-11-26 17:35:14,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 22: [2022-11-26 17:35:14,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 20: [2022-11-26 17:35:14,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 24: [2022-11-26 17:35:14,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:35:14,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:35:14,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 14: [2022-11-26 17:35:14,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 24: [2022-11-26 17:35:14,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 14: [2022-11-26 17:35:14,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 31: [2022-11-26 17:35:14,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:35:14,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 17:35:14,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 0: [2022-11-26 17:35:14,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:35:14,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:35:14,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 17:35:14,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 4: [2022-11-26 17:35:14,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:35:14,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 17:35:14,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 0: [2022-11-26 17:35:14,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 17:35:14,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 5: [2022-11-26 17:35:14,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:35:14,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 17:35:14,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 9: [2022-11-26 17:35:14,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:35:14,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 17:35:14,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 3: [2022-11-26 17:35:14,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:35:14,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 17:35:14,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 1: [2022-11-26 17:35:14,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:35:14,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 17:35:14,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 17: [2022-11-26 17:35:14,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:35:14,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 17:35:14,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 16: [2022-11-26 17:35:14,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:35:14,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 17:35:14,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 29: [2022-11-26 17:35:14,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:35:14,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 17:35:14,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 30: [2022-11-26 17:35:14,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:35:14,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 17:35:14,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 19: [2022-11-26 17:35:14,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:35:14,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:35:14,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 26: [2022-11-26 17:35:14,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:35:14,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 10: [2022-11-26 17:35:14,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 17:35:14,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 26: [2022-11-26 17:35:14,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 17:35:14,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 21: [2022-11-26 17:35:14,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:35:14,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 17:35:14,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 7: [2022-11-26 17:35:14,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:35:14,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 17:35:14,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 25: [2022-11-26 17:35:14,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:35:14,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 17:35:14,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 18: [2022-11-26 17:35:14,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:35:14,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 17:35:14,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 2: [2022-11-26 17:35:14,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:35:14,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 6: [2022-11-26 17:35:14,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:35:14,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 6: [2022-11-26 17:35:14,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 17:35:14,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 14: [2022-11-26 17:35:14,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:35:14,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:35:14,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 17:35:14,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 27: [2022-11-26 17:35:14,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 17:35:14,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 24: [2022-11-26 17:35:14,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:35:14,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:35:14,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 28: [2022-11-26 17:35:14,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 17:35:14,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 24: [2022-11-26 17:35:14,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 15: [2022-11-26 17:35:14,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:35:14,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 17:35:14,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 22: [2022-11-26 17:35:14,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:35:14,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 17:35:14,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 23: [2022-11-26 17:35:14,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:35:14,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 17:35:14,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 0: [2022-11-26 17:35:14,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:35:14,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:35:14,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 17:35:14,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 8: [2022-11-26 17:35:14,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 17:35:14,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 12: [2022-11-26 17:35:14,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:35:14,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 17:35:14,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 13: [2022-11-26 17:35:14,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:35:14,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 17:35:14,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 20: [2022-11-26 17:35:14,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:35:14,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:35:14,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 31: [2022-11-26 17:35:14,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 20: [2022-11-26 17:35:14,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 31: [2022-11-26 17:35:14,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 11: [2022-11-26 17:35:14,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:35:14,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 17:35:14,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 4: [2022-11-26 17:35:14,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:35:14,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 17:35:14,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 5: [2022-11-26 17:35:14,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:35:14,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 17:35:14,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 17: [2022-11-26 17:35:14,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:35:14,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 17:35:14,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 3: [2022-11-26 17:35:14,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:35:14,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 17:35:14,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 16: [2022-11-26 17:35:14,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:35:14,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 17:35:14,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 26: [2022-11-26 17:35:14,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:35:14,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 17:35:14,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 30: [2022-11-26 17:35:14,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:35:14,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 17:35:14,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 10: [2022-11-26 17:35:14,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:35:14,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 17:35:14,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 19: [2022-11-26 17:35:14,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:35:14,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 17:35:14,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 9: [2022-11-26 17:35:14,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:35:14,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:35:14,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 29: [2022-11-26 17:35:14,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 9: [2022-11-26 17:35:14,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 29: [2022-11-26 17:35:14,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 7: [2022-11-26 17:35:14,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:35:14,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 17:35:14,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 18: [2022-11-26 17:35:14,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:35:14,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 6: [2022-11-26 17:35:14,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:35:14,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 6: [2022-11-26 17:35:14,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 21: [2022-11-26 17:35:14,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:35:14,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 21: [2022-11-26 17:35:14,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 17:35:14,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 1: [2022-11-26 17:35:14,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:35:14,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 17:35:14,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 2: [2022-11-26 17:35:14,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:35:14,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 17:35:14,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 12: [2022-11-26 17:35:14,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:35:14,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 17:35:14,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 3: [2022-11-26 17:35:14,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:35:14,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 17:35:14,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 25: [2022-11-26 17:35:14,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:35:14,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 17:35:14,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 15: [2022-11-26 17:35:14,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:35:14,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:35:14,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 30: [2022-11-26 17:35:14,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 15: [2022-11-26 17:35:14,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 30: [2022-11-26 17:35:14,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 28: [2022-11-26 17:35:14,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:35:14,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 17:35:14,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 12: [2022-11-26 17:35:14,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:35:14,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 17:35:14,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 14: [2022-11-26 17:35:14,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:35:14,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 17:35:14,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 1: [2022-11-26 17:35:14,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:35:14,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 17:35:14,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 29: [2022-11-26 17:35:14,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:35:14,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 17:35:14,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 31: [2022-11-26 17:35:14,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:35:14,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 17:35:14,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 23: [2022-11-26 17:35:14,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:35:14,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 17:35:14,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 19: [2022-11-26 17:35:14,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:35:14,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 17:35:14,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 9: [2022-11-26 17:35:14,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:35:14,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 17:35:14,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 8: [2022-11-26 17:35:14,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:35:14,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 17:35:14,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 6: [2022-11-26 17:35:14,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:35:14,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:35:14,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:35:14,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 11: [2022-11-26 17:35:14,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:35:14,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 16: [2022-11-26 17:35:14,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:35:14,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 0: [2022-11-26 17:35:14,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 6: [2022-11-26 17:35:14,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 11: [2022-11-26 17:35:14,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 0: [2022-11-26 17:35:14,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 11: [2022-11-26 17:35:14,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 16: [2022-11-26 17:35:14,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 17:35:14,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 4: [2022-11-26 17:35:14,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:35:14,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:35:14,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 10: [2022-11-26 17:35:14,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:35:14,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 27: [2022-11-26 17:35:14,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 10: [2022-11-26 17:35:14,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 17:35:14,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 27: [2022-11-26 17:35:14,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 22: [2022-11-26 17:35:14,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:35:14,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 17:35:14,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 17: [2022-11-26 17:35:14,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:35:14,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 17:35:14,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 2: [2022-11-26 17:35:14,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:35:14,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 17:35:14,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 24: [2022-11-26 17:35:14,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:35:14,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 17:35:14,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 23: [2022-11-26 17:35:14,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:35:14,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 17:35:14,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 25: [2022-11-26 17:35:14,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:35:14,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 17:35:14,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 18: [2022-11-26 17:35:14,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:35:14,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 26: [2022-11-26 17:35:14,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:35:14,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 26: [2022-11-26 17:35:14,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 17:35:14,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 20: [2022-11-26 17:35:14,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:35:14,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 17:35:14,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 7: [2022-11-26 17:35:14,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:35:14,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 17:35:14,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 8: [2022-11-26 17:35:14,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:35:14,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 17:35:14,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 14: [2022-11-26 17:35:14,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:35:14,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 17:35:14,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 28: [2022-11-26 17:35:14,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:35:14,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 17:35:14,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 22: [2022-11-26 17:35:14,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:35:14,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 27: [2022-11-26 17:35:14,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:35:14,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:35:14,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 27: [2022-11-26 17:35:14,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 17:35:14,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 13: [2022-11-26 17:35:14,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:35:14,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 27: [2022-11-26 17:35:14,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 13: [2022-11-26 17:35:14,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 15: [2022-11-26 17:35:14,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:35:14,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 15: [2022-11-26 17:35:14,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 17:35:14,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 15: [2022-11-26 17:35:14,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:35:14,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 17:35:14,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 24: [2022-11-26 17:35:14,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:35:14,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 17:35:14,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 23: [2022-11-26 17:35:14,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:35:14,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 17:35:14,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 20: [2022-11-26 17:35:14,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:35:14,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 17:35:14,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 13: [2022-11-26 17:35:14,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:35:14,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:35:14,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 20: [2022-11-26 17:35:14,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 13: [2022-11-26 17:35:14,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 20: [2022-11-26 17:35:14,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 13: [2022-11-26 17:35:14,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:35:14,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 17:35:14,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 5: [2022-11-26 17:35:14,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:35:14,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step104000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 17:35:14,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 0: successfully saved checkpoint at iteration 104000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2795.93 31: iteration 104010/ 173500 | consumed samples: 26626560 | consumed tokens: 54531194880 | elapsed time per iteration (s): 1.10 | learning rate: 8.342E-05 | global batch size: 256 | lm loss: 1.980769E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.991 | TFLOPs: 14.03 | 31: iteration 104020/ 173500 | consumed samples: 26629120 | consumed tokens: 54536437760 | elapsed time per iteration (s): 0.79 | learning rate: 8.340E-05 | global batch size: 256 | lm loss: 1.987375E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.590 | TFLOPs: 19.58 | 31: iteration 104030/ 173500 | consumed samples: 26631680 | consumed tokens: 54541680640 | elapsed time per iteration (s): 0.78 | learning rate: 8.339E-05 | global batch size: 256 | lm loss: 1.946234E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.052 | TFLOPs: 19.97 | 31: iteration 104040/ 173500 | consumed samples: 26634240 | consumed tokens: 54546923520 | elapsed time per iteration (s): 0.81 | learning rate: 8.337E-05 | global batch size: 256 | lm loss: 1.955344E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.133 | TFLOPs: 19.19 | 31: iteration 104050/ 173500 | consumed samples: 26636800 | consumed tokens: 54552166400 | elapsed time per iteration (s): 0.81 | learning rate: 8.336E-05 | global batch size: 256 | lm loss: 1.966353E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.852 | TFLOPs: 19.23 | 31: iteration 104060/ 173500 | consumed samples: 26639360 | consumed tokens: 54557409280 | elapsed time per iteration (s): 0.87 | learning rate: 8.334E-05 | global batch size: 256 | lm loss: 1.955654E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.539 | TFLOPs: 17.76 | 31: iteration 104070/ 173500 | consumed samples: 26641920 | consumed tokens: 54562652160 | elapsed time per iteration (s): 0.84 | learning rate: 8.332E-05 | global batch size: 256 | lm loss: 1.979456E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.567 | TFLOPs: 18.37 | 31: iteration 104080/ 173500 | consumed samples: 26644480 | consumed tokens: 54567895040 | elapsed time per iteration (s): 0.84 | learning rate: 8.331E-05 | global batch size: 256 | lm loss: 1.939662E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.978 | TFLOPs: 18.39 | 31: iteration 104090/ 173500 | consumed samples: 26647040 | consumed tokens: 54573137920 | elapsed time per iteration (s): 0.84 | learning rate: 8.329E-05 | global batch size: 256 | lm loss: 1.977247E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.963 | TFLOPs: 18.39 | 31: iteration 104100/ 173500 | consumed samples: 26649600 | consumed tokens: 54578380800 | elapsed time per iteration (s): 0.80 | learning rate: 8.328E-05 | global batch size: 256 | lm loss: 1.986189E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.174 | TFLOPs: 19.43 | 31: iteration 104110/ 173500 | consumed samples: 26652160 | consumed tokens: 54583623680 | elapsed time per iteration (s): 0.81 | learning rate: 8.326E-05 | global batch size: 256 | lm loss: 1.952151E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.251 | TFLOPs: 19.13 | 31: iteration 104120/ 173500 | consumed samples: 26654720 | consumed tokens: 54588866560 | elapsed time per iteration (s): 0.82 | learning rate: 8.325E-05 | global batch size: 256 | lm loss: 1.975431E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.553 | TFLOPs: 18.97 | 31: iteration 104130/ 173500 | consumed samples: 26657280 | consumed tokens: 54594109440 | elapsed time per iteration (s): 0.81 | learning rate: 8.323E-05 | global batch size: 256 | lm loss: 1.971171E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.168 | TFLOPs: 19.01 | 31: iteration 104140/ 173500 | consumed samples: 26659840 | consumed tokens: 54599352320 | elapsed time per iteration (s): 0.80 | learning rate: 8.321E-05 | global batch size: 256 | lm loss: 1.962357E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.990 | TFLOPs: 19.48 | 31: iteration 104150/ 173500 | consumed samples: 26662400 | consumed tokens: 54604595200 | elapsed time per iteration (s): 0.79 | learning rate: 8.320E-05 | global batch size: 256 | lm loss: 1.974225E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.771 | TFLOPs: 19.71 | 31: iteration 104160/ 173500 | consumed samples: 26664960 | consumed tokens: 54609838080 | elapsed time per iteration (s): 0.84 | learning rate: 8.318E-05 | global batch size: 256 | lm loss: 1.978850E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.196 | TFLOPs: 18.52 | 31: iteration 104170/ 173500 | consumed samples: 26667520 | consumed tokens: 54615080960 | elapsed time per iteration (s): 0.79 | learning rate: 8.317E-05 | global batch size: 256 | lm loss: 1.948169E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.996 | TFLOPs: 19.66 | 31: iteration 104180/ 173500 | consumed samples: 26670080 | consumed tokens: 54620323840 | elapsed time per iteration (s): 0.86 | learning rate: 8.315E-05 | global batch size: 256 | lm loss: 1.968603E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.200 | TFLOPs: 17.92 | 31: iteration 104190/ 173500 | consumed samples: 26672640 | consumed tokens: 54625566720 | elapsed time per iteration (s): 0.80 | learning rate: 8.314E-05 | global batch size: 256 | lm loss: 1.961561E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.929 | TFLOPs: 19.29 | 31: iteration 104200/ 173500 | consumed samples: 26675200 | consumed tokens: 54630809600 | elapsed time per iteration (s): 0.80 | learning rate: 8.312E-05 | global batch size: 256 | lm loss: 1.990121E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.212 | TFLOPs: 19.37 | 31: iteration 104210/ 173500 | consumed samples: 26677760 | consumed tokens: 54636052480 | elapsed time per iteration (s): 0.81 | learning rate: 8.310E-05 | global batch size: 256 | lm loss: 1.972128E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.527 | TFLOPs: 19.15 | 31: iteration 104220/ 173500 | consumed samples: 26680320 | consumed tokens: 54641295360 | elapsed time per iteration (s): 0.85 | learning rate: 8.309E-05 | global batch size: 256 | lm loss: 1.973357E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.146 | TFLOPs: 18.22 | 31: iteration 104230/ 173500 | consumed samples: 26682880 | consumed tokens: 54646538240 | elapsed time per iteration (s): 0.80 | learning rate: 8.307E-05 | global batch size: 256 | lm loss: 1.941306E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.358 | TFLOPs: 19.26 | 31: iteration 104240/ 173500 | consumed samples: 26685440 | consumed tokens: 54651781120 | elapsed time per iteration (s): 0.83 | learning rate: 8.306E-05 | global batch size: 256 | lm loss: 1.937313E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.434 | TFLOPs: 18.66 | 31: iteration 104250/ 173500 | consumed samples: 26688000 | consumed tokens: 54657024000 | elapsed time per iteration (s): 0.80 | learning rate: 8.304E-05 | global batch size: 256 | lm loss: 1.973649E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.247 | TFLOPs: 19.25 | 31: iteration 104260/ 173500 | consumed samples: 26690560 | consumed tokens: 54662266880 | elapsed time per iteration (s): 0.83 | learning rate: 8.303E-05 | global batch size: 256 | lm loss: 1.969341E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.263 | TFLOPs: 18.71 | 31: iteration 104270/ 173500 | consumed samples: 26693120 | consumed tokens: 54667509760 | elapsed time per iteration (s): 0.80 | learning rate: 8.301E-05 | global batch size: 256 | lm loss: 1.952401E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.335 | TFLOPs: 19.26 | 31: iteration 104280/ 173500 | consumed samples: 26695680 | consumed tokens: 54672752640 | elapsed time per iteration (s): 0.81 | learning rate: 8.299E-05 | global batch size: 256 | lm loss: 1.961764E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.364 | TFLOPs: 19.20 | 31: iteration 104290/ 173500 | consumed samples: 26698240 | consumed tokens: 54677995520 | elapsed time per iteration (s): 0.82 | learning rate: 8.298E-05 | global batch size: 256 | lm loss: 1.943008E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.173 | TFLOPs: 18.89 | 31: iteration 104300/ 173500 | consumed samples: 26700800 | consumed tokens: 54683238400 | elapsed time per iteration (s): 0.81 | learning rate: 8.296E-05 | global batch size: 256 | lm loss: 1.972788E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.295 | TFLOPs: 19.14 | 31: iteration 104310/ 173500 | consumed samples: 26703360 | consumed tokens: 54688481280 | elapsed time per iteration (s): 0.79 | learning rate: 8.295E-05 | global batch size: 256 | lm loss: 1.985101E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.880 | TFLOPs: 19.65 | 31: iteration 104320/ 173500 | consumed samples: 26705920 | consumed tokens: 54693724160 | elapsed time per iteration (s): 0.90 | learning rate: 8.293E-05 | global batch size: 256 | lm loss: 1.980142E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.743 | TFLOPs: 17.23 | 31: iteration 104330/ 173500 | consumed samples: 26708480 | consumed tokens: 54698967040 | elapsed time per iteration (s): 0.79 | learning rate: 8.292E-05 | global batch size: 256 | lm loss: 1.976624E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.152 | TFLOPs: 19.61 | 31: iteration 104340/ 173500 | consumed samples: 26711040 | consumed tokens: 54704209920 | elapsed time per iteration (s): 0.80 | learning rate: 8.290E-05 | global batch size: 256 | lm loss: 1.961046E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.109 | TFLOPs: 19.43 | 31: iteration 104350/ 173500 | consumed samples: 26713600 | consumed tokens: 54709452800 | elapsed time per iteration (s): 0.80 | learning rate: 8.289E-05 | global batch size: 256 | lm loss: 1.963219E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.339 | TFLOPs: 19.26 | 31: iteration 104360/ 173500 | consumed samples: 26716160 | consumed tokens: 54714695680 | elapsed time per iteration (s): 0.82 | learning rate: 8.287E-05 | global batch size: 256 | lm loss: 1.965217E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.790 | TFLOPs: 18.86 | 31: iteration 104370/ 173500 | consumed samples: 26718720 | consumed tokens: 54719938560 | elapsed time per iteration (s): 0.78 | learning rate: 8.285E-05 | global batch size: 256 | lm loss: 2.003988E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.677 | TFLOPs: 19.82 | 31: iteration 104380/ 173500 | consumed samples: 26721280 | consumed tokens: 54725181440 | elapsed time per iteration (s): 0.89 | learning rate: 8.284E-05 | global batch size: 256 | lm loss: 1.973288E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.474 | TFLOPs: 17.45 | 31: iteration 104390/ 173500 | consumed samples: 26723840 | consumed tokens: 54730424320 | elapsed time per iteration (s): 0.91 | learning rate: 8.282E-05 | global batch size: 256 | lm loss: 1.971922E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 281.699 | TFLOPs: 17.04 | 31: iteration 104400/ 173500 | consumed samples: 26726400 | consumed tokens: 54735667200 | elapsed time per iteration (s): 1.43 | learning rate: 8.281E-05 | global batch size: 256 | lm loss: 1.950317E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 179.505 | TFLOPs: 10.86 | 31: iteration 104410/ 173500 | consumed samples: 26728960 | consumed tokens: 54740910080 | elapsed time per iteration (s): 0.82 | learning rate: 8.279E-05 | global batch size: 256 | lm loss: 1.981382E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.982 | TFLOPs: 18.81 | 31: iteration 104420/ 173500 | consumed samples: 26731520 | consumed tokens: 54746152960 | elapsed time per iteration (s): 0.82 | learning rate: 8.278E-05 | global batch size: 256 | lm loss: 1.950981E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.379 | TFLOPs: 18.78 | 31: iteration 104430/ 173500 | consumed samples: 26734080 | consumed tokens: 54751395840 | elapsed time per iteration (s): 0.83 | learning rate: 8.276E-05 | global batch size: 256 | lm loss: 1.954878E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.975 | TFLOPs: 18.57 | 31: iteration 104440/ 173500 | consumed samples: 26736640 | consumed tokens: 54756638720 | elapsed time per iteration (s): 0.85 | learning rate: 8.274E-05 | global batch size: 256 | lm loss: 1.997572E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.239 | TFLOPs: 18.22 | 31: iteration 104450/ 173500 | consumed samples: 26739200 | consumed tokens: 54761881600 | elapsed time per iteration (s): 0.82 | learning rate: 8.273E-05 | global batch size: 256 | lm loss: 1.974611E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.931 | TFLOPs: 18.81 | 31: iteration 104460/ 173500 | consumed samples: 26741760 | consumed tokens: 54767124480 | elapsed time per iteration (s): 0.84 | learning rate: 8.271E-05 | global batch size: 256 | lm loss: 1.969023E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.214 | TFLOPs: 18.46 | 31: iteration 104470/ 173500 | consumed samples: 26744320 | consumed tokens: 54772367360 | elapsed time per iteration (s): 0.86 | learning rate: 8.270E-05 | global batch size: 256 | lm loss: 1.965041E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.474 | TFLOPs: 18.00 | 31: iteration 104480/ 173500 | consumed samples: 26746880 | consumed tokens: 54777610240 | elapsed time per iteration (s): 0.81 | learning rate: 8.268E-05 | global batch size: 256 | lm loss: 1.983128E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.204 | TFLOPs: 19.13 | 31: iteration 104490/ 173500 | consumed samples: 26749440 | consumed tokens: 54782853120 | elapsed time per iteration (s): 0.83 | learning rate: 8.267E-05 | global batch size: 256 | lm loss: 1.964954E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.701 | TFLOPs: 18.68 | 31: iteration 104500/ 173500 | consumed samples: 26752000 | consumed tokens: 54788096000 | elapsed time per iteration (s): 0.85 | learning rate: 8.265E-05 | global batch size: 256 | lm loss: 1.944478E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.214 | TFLOPs: 18.28 | 31: iteration 104510/ 173500 | consumed samples: 26754560 | consumed tokens: 54793338880 | elapsed time per iteration (s): 0.80 | learning rate: 8.263E-05 | global batch size: 256 | lm loss: 1.971738E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.172 | TFLOPs: 19.43 | 31: iteration 104520/ 173500 | consumed samples: 26757120 | consumed tokens: 54798581760 | elapsed time per iteration (s): 0.84 | learning rate: 8.262E-05 | global batch size: 256 | lm loss: 1.970860E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.899 | TFLOPs: 18.39 | 31: iteration 104530/ 173500 | consumed samples: 26759680 | consumed tokens: 54803824640 | elapsed time per iteration (s): 0.83 | learning rate: 8.260E-05 | global batch size: 256 | lm loss: 1.966688E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.089 | TFLOPs: 18.70 | 31: iteration 104540/ 173500 | consumed samples: 26762240 | consumed tokens: 54809067520 | elapsed time per iteration (s): 0.83 | learning rate: 8.259E-05 | global batch size: 256 | lm loss: 1.990418E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.573 | TFLOPs: 18.67 | 31: iteration 104550/ 173500 | consumed samples: 26764800 | consumed tokens: 54814310400 | elapsed time per iteration (s): 0.83 | learning rate: 8.257E-05 | global batch size: 256 | lm loss: 1.981331E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.822 | TFLOPs: 18.68 | 31: iteration 104560/ 173500 | consumed samples: 26767360 | consumed tokens: 54819553280 | elapsed time per iteration (s): 0.85 | learning rate: 8.256E-05 | global batch size: 256 | lm loss: 1.984003E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.553 | TFLOPs: 18.30 | 31: iteration 104570/ 173500 | consumed samples: 26769920 | consumed tokens: 54824796160 | elapsed time per iteration (s): 0.85 | learning rate: 8.254E-05 | global batch size: 256 | lm loss: 1.941240E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.626 | TFLOPs: 18.13 | 31: iteration 104580/ 173500 | consumed samples: 26772480 | consumed tokens: 54830039040 | elapsed time per iteration (s): 0.86 | learning rate: 8.252E-05 | global batch size: 256 | lm loss: 1.980749E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.135 | TFLOPs: 18.10 | 31: iteration 104590/ 173500 | consumed samples: 26775040 | consumed tokens: 54835281920 | elapsed time per iteration (s): 0.81 | learning rate: 8.251E-05 | global batch size: 256 | lm loss: 1.967015E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.496 | TFLOPs: 19.15 | 31: iteration 104600/ 173500 | consumed samples: 26777600 | consumed tokens: 54840524800 | elapsed time per iteration (s): 0.80 | learning rate: 8.249E-05 | global batch size: 256 | lm loss: 1.953718E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.389 | TFLOPs: 19.32 | 31: iteration 104610/ 173500 | consumed samples: 26780160 | consumed tokens: 54845767680 | elapsed time per iteration (s): 0.79 | learning rate: 8.248E-05 | global batch size: 256 | lm loss: 1.937428E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.040 | TFLOPs: 19.54 | 31: iteration 104620/ 173500 | consumed samples: 26782720 | consumed tokens: 54851010560 | elapsed time per iteration (s): 0.81 | learning rate: 8.246E-05 | global batch size: 256 | lm loss: 1.992689E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.873 | TFLOPs: 19.11 | 31: iteration 104630/ 173500 | consumed samples: 26785280 | consumed tokens: 54856253440 | elapsed time per iteration (s): 0.87 | learning rate: 8.245E-05 | global batch size: 256 | lm loss: 1.967646E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.591 | TFLOPs: 17.70 | 31: iteration 104640/ 173500 | consumed samples: 26787840 | consumed tokens: 54861496320 | elapsed time per iteration (s): 0.84 | learning rate: 8.243E-05 | global batch size: 256 | lm loss: 1.965123E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.004 | TFLOPs: 18.51 | 31: iteration 104650/ 173500 | consumed samples: 26790400 | consumed tokens: 54866739200 | elapsed time per iteration (s): 0.82 | learning rate: 8.241E-05 | global batch size: 256 | lm loss: 1.937257E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.665 | TFLOPs: 18.85 | 31: iteration 104660/ 173500 | consumed samples: 26792960 | consumed tokens: 54871982080 | elapsed time per iteration (s): 0.86 | learning rate: 8.240E-05 | global batch size: 256 | lm loss: 1.959461E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.080 | TFLOPs: 18.09 | 31: iteration 104670/ 173500 | consumed samples: 26795520 | consumed tokens: 54877224960 | elapsed time per iteration (s): 0.82 | learning rate: 8.238E-05 | global batch size: 256 | lm loss: 1.979769E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.686 | TFLOPs: 18.80 | 31: iteration 104680/ 173500 | consumed samples: 26798080 | consumed tokens: 54882467840 | elapsed time per iteration (s): 0.84 | learning rate: 8.237E-05 | global batch size: 256 | lm loss: 1.981337E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.809 | TFLOPs: 18.38 | 31: iteration 104690/ 173500 | consumed samples: 26800640 | consumed tokens: 54887710720 | elapsed time per iteration (s): 0.75 | learning rate: 8.235E-05 | global batch size: 256 | lm loss: 1.950207E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.181 | TFLOPs: 20.70 | 31: iteration 104700/ 173500 | consumed samples: 26803200 | consumed tokens: 54892953600 | elapsed time per iteration (s): 0.75 | learning rate: 8.234E-05 | global batch size: 256 | lm loss: 1.970694E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.813 | TFLOPs: 20.68 | 31: iteration 104710/ 173500 | consumed samples: 26805760 | consumed tokens: 54898196480 | elapsed time per iteration (s): 0.82 | learning rate: 8.232E-05 | global batch size: 256 | lm loss: 1.980066E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.136 | TFLOPs: 18.82 | 31: iteration 104720/ 173500 | consumed samples: 26808320 | consumed tokens: 54903439360 | elapsed time per iteration (s): 0.81 | learning rate: 8.230E-05 | global batch size: 256 | lm loss: 1.989646E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.118 | TFLOPs: 19.00 | 31: iteration 104730/ 173500 | consumed samples: 26810880 | consumed tokens: 54908682240 | elapsed time per iteration (s): 0.75 | learning rate: 8.229E-05 | global batch size: 256 | lm loss: 2.001697E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.315 | TFLOPs: 20.71 | 31: iteration 104740/ 173500 | consumed samples: 26813440 | consumed tokens: 54913925120 | elapsed time per iteration (s): 0.81 | learning rate: 8.227E-05 | global batch size: 256 | lm loss: 1.974293E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.475 | TFLOPs: 19.15 | 31: iteration 104750/ 173500 | consumed samples: 26816000 | consumed tokens: 54919168000 | elapsed time per iteration (s): 0.79 | learning rate: 8.226E-05 | global batch size: 256 | lm loss: 1.931020E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.404 | TFLOPs: 19.63 | 31: iteration 104760/ 173500 | consumed samples: 26818560 | consumed tokens: 54924410880 | elapsed time per iteration (s): 0.73 | learning rate: 8.224E-05 | global batch size: 256 | lm loss: 1.960894E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.386 | TFLOPs: 21.08 | 31: iteration 104770/ 173500 | consumed samples: 26821120 | consumed tokens: 54929653760 | elapsed time per iteration (s): 0.77 | learning rate: 8.223E-05 | global batch size: 256 | lm loss: 1.969790E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.205 | TFLOPs: 20.04 | 31: iteration 104780/ 173500 | consumed samples: 26823680 | consumed tokens: 54934896640 | elapsed time per iteration (s): 0.78 | learning rate: 8.221E-05 | global batch size: 256 | lm loss: 1.957816E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.898 | TFLOPs: 19.78 | 31: iteration 104790/ 173500 | consumed samples: 26826240 | consumed tokens: 54940139520 | elapsed time per iteration (s): 0.78 | learning rate: 8.220E-05 | global batch size: 256 | lm loss: 1.970543E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.240 | TFLOPs: 19.86 | 31: iteration 104800/ 173500 | consumed samples: 26828800 | consumed tokens: 54945382400 | elapsed time per iteration (s): 0.79 | learning rate: 8.218E-05 | global batch size: 256 | lm loss: 1.965800E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.205 | TFLOPs: 19.67 | 31: iteration 104810/ 173500 | consumed samples: 26831360 | consumed tokens: 54950625280 | elapsed time per iteration (s): 0.78 | learning rate: 8.216E-05 | global batch size: 256 | lm loss: 1.989511E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.636 | TFLOPs: 19.88 | 31: iteration 104820/ 173500 | consumed samples: 26833920 | consumed tokens: 54955868160 | elapsed time per iteration (s): 0.79 | learning rate: 8.215E-05 | global batch size: 256 | lm loss: 2.001356E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.725 | TFLOPs: 19.71 | 31: iteration 104830/ 173500 | consumed samples: 26836480 | consumed tokens: 54961111040 | elapsed time per iteration (s): 0.75 | learning rate: 8.213E-05 | global batch size: 256 | lm loss: 1.949372E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.899 | TFLOPs: 20.74 | 31: iteration 104840/ 173500 | consumed samples: 26839040 | consumed tokens: 54966353920 | elapsed time per iteration (s): 0.79 | learning rate: 8.212E-05 | global batch size: 256 | lm loss: 2.010191E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.347 | TFLOPs: 19.62 | 31: iteration 104850/ 173500 | consumed samples: 26841600 | consumed tokens: 54971596800 | elapsed time per iteration (s): 0.77 | learning rate: 8.210E-05 | global batch size: 256 | lm loss: 1.975388E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.428 | TFLOPs: 20.05 | 31: iteration 104860/ 173500 | consumed samples: 26844160 | consumed tokens: 54976839680 | elapsed time per iteration (s): 0.77 | learning rate: 8.209E-05 | global batch size: 256 | lm loss: 1.986562E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.045 | TFLOPs: 20.21 | 31: iteration 104870/ 173500 | consumed samples: 26846720 | consumed tokens: 54982082560 | elapsed time per iteration (s): 0.78 | learning rate: 8.207E-05 | global batch size: 256 | lm loss: 1.950751E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.158 | TFLOPs: 19.91 | 31: iteration 104880/ 173500 | consumed samples: 26849280 | consumed tokens: 54987325440 | elapsed time per iteration (s): 0.78 | learning rate: 8.205E-05 | global batch size: 256 | lm loss: 1.992249E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.487 | TFLOPs: 19.75 | 31: iteration 104890/ 173500 | consumed samples: 26851840 | consumed tokens: 54992568320 | elapsed time per iteration (s): 0.80 | learning rate: 8.204E-05 | global batch size: 256 | lm loss: 2.011085E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.917 | TFLOPs: 19.35 | 31: iteration 104900/ 173500 | consumed samples: 26854400 | consumed tokens: 54997811200 | elapsed time per iteration (s): 0.78 | learning rate: 8.202E-05 | global batch size: 256 | lm loss: 1.956730E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.178 | TFLOPs: 19.73 | 31: iteration 104910/ 173500 | consumed samples: 26856960 | consumed tokens: 55003054080 | elapsed time per iteration (s): 0.75 | learning rate: 8.201E-05 | global batch size: 256 | lm loss: 1.959610E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.351 | TFLOPs: 20.65 | 31: iteration 104920/ 173500 | consumed samples: 26859520 | consumed tokens: 55008296960 | elapsed time per iteration (s): 0.77 | learning rate: 8.199E-05 | global batch size: 256 | lm loss: 1.963548E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.799 | TFLOPs: 20.13 | 31: iteration 104930/ 173500 | consumed samples: 26862080 | consumed tokens: 55013539840 | elapsed time per iteration (s): 0.75 | learning rate: 8.198E-05 | global batch size: 256 | lm loss: 1.999539E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.680 | TFLOPs: 20.61 | 31: iteration 104940/ 173500 | consumed samples: 26864640 | consumed tokens: 55018782720 | elapsed time per iteration (s): 0.79 | learning rate: 8.196E-05 | global batch size: 256 | lm loss: 1.941716E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.794 | TFLOPs: 19.71 | 31: iteration 104950/ 173500 | consumed samples: 26867200 | consumed tokens: 55024025600 | elapsed time per iteration (s): 0.78 | learning rate: 8.194E-05 | global batch size: 256 | lm loss: 1.958911E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.637 | TFLOPs: 19.94 | 31: iteration 104960/ 173500 | consumed samples: 26869760 | consumed tokens: 55029268480 | elapsed time per iteration (s): 0.75 | learning rate: 8.193E-05 | global batch size: 256 | lm loss: 1.981030E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.922 | TFLOPs: 20.69 | 31: iteration 104970/ 173500 | consumed samples: 26872320 | consumed tokens: 55034511360 | elapsed time per iteration (s): 0.72 | learning rate: 8.191E-05 | global batch size: 256 | lm loss: 1.952139E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.857 | TFLOPs: 21.41 | 31: iteration 104980/ 173500 | consumed samples: 26874880 | consumed tokens: 55039754240 | elapsed time per iteration (s): 0.72 | learning rate: 8.190E-05 | global batch size: 256 | lm loss: 1.980659E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.122 | TFLOPs: 21.36 | 31: iteration 104990/ 173500 | consumed samples: 26877440 | consumed tokens: 55044997120 | elapsed time per iteration (s): 0.77 | learning rate: 8.188E-05 | global batch size: 256 | lm loss: 1.983992E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.409 | TFLOPs: 20.23 | 31: iteration 105000/ 173500 | consumed samples: 26880000 | consumed tokens: 55050240000 | elapsed time per iteration (s): 0.78 | learning rate: 8.187E-05 | global batch size: 256 | lm loss: 1.944225E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.756 | TFLOPs: 19.77 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 105000 | lm loss value: 1.870390E+00 | lm loss PPL: 6.490827E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 105000 to checkpoints_1b1long 0: [2022-11-26 17:48:48,164] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step105000 is begin to save! 0: [2022-11-26 17:48:48,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_01-model_00-model_states.pt... 0: [2022-11-26 17:48:48,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_01-model_00-model_states.pt. 0: [2022-11-26 17:48:48,496] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_03-model_00-model_states.pt... 0: [2022-11-26 17:48:48,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_03-model_00-model_states.pt. 0: [2022-11-26 17:48:48,606] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_04-model_00-model_states.pt... 0: [2022-11-26 17:48:48,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_04-model_00-model_states.pt. 0: [2022-11-26 17:48:48,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_05-model_00-model_states.pt... 0: [2022-11-26 17:48:48,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_05-model_00-model_states.pt. 0: [2022-11-26 17:48:48,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_06-model_00-model_states.pt... 0: [2022-11-26 17:48:48,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_06-model_00-model_states.pt. 0: [2022-11-26 17:48:48,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_07-model_00-model_states.pt... 0: [2022-11-26 17:48:49,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_07-model_00-model_states.pt. 0: [2022-11-26 17:48:49,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_08-model_00-model_states.pt... 0: [2022-11-26 17:48:49,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_08-model_00-model_states.pt. 0: [2022-11-26 17:48:49,178] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_09-model_00-model_states.pt... 0: [2022-11-26 17:48:49,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_09-model_00-model_states.pt. 0: [2022-11-26 17:48:49,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_10-model_00-model_states.pt... 0: [2022-11-26 17:48:49,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_10-model_00-model_states.pt. 0: [2022-11-26 17:48:49,331] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_11-model_00-model_states.pt... 0: [2022-11-26 17:48:49,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_11-model_00-model_states.pt. 0: [2022-11-26 17:48:49,408] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_12-model_00-model_states.pt... 0: [2022-11-26 17:48:49,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_12-model_00-model_states.pt. 0: [2022-11-26 17:48:49,487] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_13-model_00-model_states.pt... 0: [2022-11-26 17:48:49,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_13-model_00-model_states.pt. 0: [2022-11-26 17:48:49,569] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_14-model_00-model_states.pt... 0: [2022-11-26 17:48:49,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_14-model_00-model_states.pt. 0: [2022-11-26 17:48:49,642] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_15-model_00-model_states.pt... 0: [2022-11-26 17:48:49,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_15-model_00-model_states.pt. 0: [2022-11-26 17:48:49,718] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_16-model_00-model_states.pt... 0: [2022-11-26 17:48:49,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_16-model_00-model_states.pt. 0: [2022-11-26 17:48:49,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_17-model_00-model_states.pt... 0: [2022-11-26 17:48:49,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_17-model_00-model_states.pt. 0: [2022-11-26 17:48:49,870] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_18-model_00-model_states.pt... 0: [2022-11-26 17:48:49,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_18-model_00-model_states.pt. 0: [2022-11-26 17:48:49,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_19-model_00-model_states.pt... 0: [2022-11-26 17:48:50,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_19-model_00-model_states.pt. 0: [2022-11-26 17:48:50,020] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_20-model_00-model_states.pt... 0: [2022-11-26 17:48:50,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_20-model_00-model_states.pt. 0: [2022-11-26 17:48:50,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_21-model_00-model_states.pt... 0: [2022-11-26 17:48:50,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_21-model_00-model_states.pt. 0: [2022-11-26 17:48:50,170] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_22-model_00-model_states.pt... 0: [2022-11-26 17:48:50,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_22-model_00-model_states.pt. 0: [2022-11-26 17:48:50,268] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_23-model_00-model_states.pt... 0: [2022-11-26 17:48:50,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_23-model_00-model_states.pt. 0: [2022-11-26 17:48:50,375] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_24-model_00-model_states.pt... 0: [2022-11-26 17:48:50,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_24-model_00-model_states.pt. 0: [2022-11-26 17:48:50,483] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_25-model_00-model_states.pt... 0: [2022-11-26 17:48:50,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_25-model_00-model_states.pt. 0: [2022-11-26 17:48:50,591] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_26-model_00-model_states.pt... 0: [2022-11-26 17:48:50,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_26-model_00-model_states.pt. 0: [2022-11-26 17:48:50,698] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_27-model_00-model_states.pt... 0: [2022-11-26 17:48:50,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_27-model_00-model_states.pt. 0: [2022-11-26 17:48:50,804] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_28-model_00-model_states.pt... 0: [2022-11-26 17:48:50,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_28-model_00-model_states.pt. 0: [2022-11-26 17:48:50,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/layer_30-model_00-model_states.pt... 0: [2022-11-26 17:48:50,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/layer_30-model_00-model_states.pt. 0: [2022-11-26 17:48:50,916] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step105000/mp_rank_00_model_states.pt 0: [2022-11-26 17:48:50,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/mp_rank_00_model_states.pt... 0: [2022-11-26 17:48:50,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/mp_rank_00_model_states.pt. 0: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 17: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 23: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 18: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 31: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 27: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 20: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:48:50,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:48:51,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:48:51,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 17:48:51,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 0: [2022-11-26 17:48:51,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:48:51,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 17:48:51,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 26: [2022-11-26 17:48:51,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:48:51,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 17:48:51,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 9: [2022-11-26 17:48:51,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:48:51,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 17:48:51,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 12: [2022-11-26 17:48:51,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:48:51,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 17:48:51,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 4: [2022-11-26 17:48:51,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:48:51,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:48:51,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 17:48:51,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 0: [2022-11-26 17:48:51,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 17:48:51,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 11: [2022-11-26 17:48:51,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:48:51,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:48:51,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:48:51,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 17:48:51,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 8: [2022-11-26 17:48:51,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:48:51,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 8: [2022-11-26 17:48:51,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 27: [2022-11-26 17:48:51,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 17:48:51,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 27: [2022-11-26 17:48:51,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 8: [2022-11-26 17:48:51,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 19: [2022-11-26 17:48:51,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:48:51,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:48:51,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 14: [2022-11-26 17:48:51,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 19: [2022-11-26 17:48:51,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 14: [2022-11-26 17:48:51,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 30: [2022-11-26 17:48:51,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:48:51,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:48:51,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 17:48:51,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 17:48:51,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 30: [2022-11-26 17:48:51,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 30: [2022-11-26 17:48:51,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:48:51,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 17:48:51,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 3: [2022-11-26 17:48:51,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:48:51,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:48:51,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 17:48:51,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 1: [2022-11-26 17:48:51,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 17:48:51,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 11: [2022-11-26 17:48:51,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:48:51,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 17:48:51,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 21: [2022-11-26 17:48:51,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:48:51,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:48:51,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 17:48:51,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:48:51,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:48:51,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:48:51,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 21: [2022-11-26 17:48:51,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 7: [2022-11-26 17:48:51,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 12: [2022-11-26 17:48:51,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 31: [2022-11-26 17:48:51,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:48:51,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 21: [2022-11-26 17:48:51,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 7: [2022-11-26 17:48:51,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 12: [2022-11-26 17:48:51,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 31: [2022-11-26 17:48:51,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 17:48:51,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:48:51,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 31: [2022-11-26 17:48:51,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 31: [2022-11-26 17:48:51,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 17:48:51,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 29: [2022-11-26 17:48:51,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:48:51,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:48:51,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 17:48:51,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 26: [2022-11-26 17:48:51,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 17:48:51,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 9: [2022-11-26 17:48:51,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:48:51,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:48:51,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 17:48:51,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 17:48:51,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 9: [2022-11-26 17:48:51,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 5: [2022-11-26 17:48:51,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:48:51,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 17:48:51,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 13: [2022-11-26 17:48:51,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:48:51,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:48:51,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:48:51,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 17:48:51,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 17:48:51,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 13: [2022-11-26 17:48:51,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 17:48:51,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 13: [2022-11-26 17:48:51,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 8: [2022-11-26 17:48:51,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:48:51,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 27: [2022-11-26 17:48:51,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:48:51,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 17: [2022-11-26 17:48:51,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:48:51,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:48:51,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 17:48:51,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 17:48:51,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 27: [2022-11-26 17:48:51,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 30: [2022-11-26 17:48:51,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:48:51,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 17:48:51,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 11: [2022-11-26 17:48:51,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:48:51,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:48:51,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:48:51,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:48:51,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 22: [2022-11-26 17:48:51,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 17:48:51,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 23: [2022-11-26 17:48:51,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 11: [2022-11-26 17:48:51,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 22: [2022-11-26 17:48:51,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 22: [2022-11-26 17:48:51,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 9: [2022-11-26 17:48:51,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:48:51,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 23: [2022-11-26 17:48:51,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:48:51,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:48:51,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:48:51,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 13: [2022-11-26 17:48:51,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:48:51,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 22: [2022-11-26 17:48:51,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 5: [2022-11-26 17:48:51,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 9: [2022-11-26 17:48:51,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 13: [2022-11-26 17:48:51,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 23: [2022-11-26 17:48:51,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 22: [2022-11-26 17:48:51,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 5: [2022-11-26 17:48:51,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 4: [2022-11-26 17:48:51,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:48:51,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 4: [2022-11-26 17:48:51,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 17:48:51,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 5: [2022-11-26 17:48:51,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:48:51,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:48:51,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 15: [2022-11-26 17:48:51,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:48:51,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 29: [2022-11-26 17:48:51,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 15: [2022-11-26 17:48:51,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 29: [2022-11-26 17:48:51,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 15: [2022-11-26 17:48:51,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 21: [2022-11-26 17:48:51,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:48:51,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:48:51,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 14: [2022-11-26 17:48:51,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 21: [2022-11-26 17:48:51,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 14: [2022-11-26 17:48:51,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 26: [2022-11-26 17:48:51,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:48:51,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 17:48:51,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 21: [2022-11-26 17:48:51,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:48:51,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 17:48:51,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 25: [2022-11-26 17:48:51,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:48:51,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:48:51,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:48:51,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 25: [2022-11-26 17:48:51,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 16: [2022-11-26 17:48:51,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 25: [2022-11-26 17:48:51,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 17:48:51,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 14: [2022-11-26 17:48:51,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:48:51,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 17:48:51,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 25: [2022-11-26 17:48:51,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 16: [2022-11-26 17:48:51,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:48:51,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 17:48:51,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 8: [2022-11-26 17:48:51,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:48:51,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 17:48:51,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 7: [2022-11-26 17:48:51,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:48:51,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 17:48:51,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 4: [2022-11-26 17:48:51,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:48:51,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:48:51,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 17:48:51,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 17:48:51,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 4: [2022-11-26 17:48:51,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 8: [2022-11-26 17:48:51,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:48:51,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 17:48:51,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 5: [2022-11-26 17:48:51,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:48:51,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 17:48:51,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 17: [2022-11-26 17:48:51,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 17:48:51,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 17: [2022-11-26 17:48:51,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:48:51,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:48:51,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 17:48:51,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 7: [2022-11-26 17:48:51,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 17: [2022-11-26 17:48:51,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:48:51,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 17: [2022-11-26 17:48:51,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 17:48:51,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 19: [2022-11-26 17:48:51,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:48:51,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 17:48:51,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 15: [2022-11-26 17:48:51,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:48:51,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 17:48:51,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 28: [2022-11-26 17:48:51,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:48:51,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:48:51,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:48:51,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 31: [2022-11-26 17:48:51,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:48:51,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:48:51,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:48:51,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 28: [2022-11-26 17:48:51,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 17:48:51,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 31: [2022-11-26 17:48:51,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 15: [2022-11-26 17:48:51,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 28: [2022-11-26 17:48:51,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 31: [2022-11-26 17:48:51,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 28: [2022-11-26 17:48:51,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 17:48:51,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 26: [2022-11-26 17:48:51,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:48:51,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 28: [2022-11-26 17:48:51,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 26: [2022-11-26 17:48:51,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 0: [2022-11-26 17:48:51,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:48:51,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 0: [2022-11-26 17:48:51,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 17:48:51,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 19: [2022-11-26 17:48:51,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:48:51,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 17:48:51,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 6: [2022-11-26 17:48:51,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:48:51,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:48:51,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 17:48:51,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 29: [2022-11-26 17:48:51,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:48:51,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 29: [2022-11-26 17:48:51,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 6: [2022-11-26 17:48:51,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 29: [2022-11-26 17:48:51,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 25: [2022-11-26 17:48:51,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:48:51,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 17:48:51,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 27: [2022-11-26 17:48:51,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:48:51,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 17:48:51,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 11: [2022-11-26 17:48:51,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:48:51,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 17:48:51,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 5: [2022-11-26 17:48:51,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:48:51,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 17:48:51,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 14: [2022-11-26 17:48:51,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:48:51,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 17:48:51,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 21: [2022-11-26 17:48:51,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:48:51,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:48:51,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 11: [2022-11-26 17:48:51,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 21: [2022-11-26 17:48:51,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 11: [2022-11-26 17:48:51,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 9: [2022-11-26 17:48:51,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:48:51,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 19: [2022-11-26 17:48:51,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:48:51,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 19: [2022-11-26 17:48:51,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 17:48:51,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 17: [2022-11-26 17:48:51,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:48:51,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 17:48:51,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 0: [2022-11-26 17:48:51,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:48:51,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 17:48:51,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 30: [2022-11-26 17:48:51,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:48:51,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 17:48:51,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 12: [2022-11-26 17:48:51,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:48:51,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:48:51,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 17:48:51,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 17:48:51,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 12: [2022-11-26 17:48:51,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 28: [2022-11-26 17:48:51,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:48:51,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 17:48:51,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 8: [2022-11-26 17:48:51,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:48:51,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:48:51,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 4: [2022-11-26 17:48:51,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 8: [2022-11-26 17:48:51,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 4: [2022-11-26 17:48:51,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 26: [2022-11-26 17:48:51,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:48:51,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 17:48:51,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 13: [2022-11-26 17:48:51,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:48:51,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 17:48:51,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 23: [2022-11-26 17:48:51,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:48:51,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 17:48:51,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 22: [2022-11-26 17:48:51,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:48:51,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 17:48:51,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 3: [2022-11-26 17:48:51,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:48:51,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 17:48:51,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:48:51,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 2: [2022-11-26 17:48:51,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:48:51,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 15: [2022-11-26 17:48:51,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:48:51,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:48:51,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:48:51,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 15: [2022-11-26 17:48:51,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 2: [2022-11-26 17:48:51,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 3: [2022-11-26 17:48:51,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:48:51,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 2: [2022-11-26 17:48:51,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 2: [2022-11-26 17:48:51,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 3: [2022-11-26 17:48:51,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 2: [2022-11-26 17:48:51,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 3: [2022-11-26 17:48:51,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 2: [2022-11-26 17:48:51,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 2: [2022-11-26 17:48:51,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 23: [2022-11-26 17:48:51,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:48:51,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:48:51,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:48:51,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 23: [2022-11-26 17:48:51,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 3: [2022-11-26 17:48:51,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 23: [2022-11-26 17:48:51,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 17:48:51,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 23: [2022-11-26 17:48:51,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 17: [2022-11-26 17:48:51,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:48:51,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 17:48:51,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 16: [2022-11-26 17:48:51,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:48:51,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 17:48:51,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 2: [2022-11-26 17:48:51,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:48:51,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 15: [2022-11-26 17:48:51,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:48:51,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:48:51,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 15: [2022-11-26 17:48:51,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 19: [2022-11-26 17:48:51,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 15: [2022-11-26 17:48:51,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 19: [2022-11-26 17:48:51,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 16: [2022-11-26 17:48:51,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:48:51,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 17:48:51,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 2: [2022-11-26 17:48:51,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:48:51,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 17:48:51,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 16: [2022-11-26 17:48:51,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:48:51,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 17:48:51,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 7: [2022-11-26 17:48:51,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:48:51,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 17:48:51,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:48:51,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 7: [2022-11-26 17:48:51,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 17:48:51,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 6: [2022-11-26 17:48:51,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:48:51,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:48:51,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:48:51,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 17:48:51,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 6: [2022-11-26 17:48:51,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 17:48:51,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 17:48:51,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 6: [2022-11-26 17:48:51,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 1: [2022-11-26 17:48:51,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:48:51,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 17:48:51,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 18: [2022-11-26 17:48:51,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:48:51,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:48:51,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:48:51,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:48:51,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:48:51,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 17:48:51,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 17:48:51,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 17:48:51,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 17:48:51,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 17:48:51,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 18: [2022-11-26 17:48:51,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 18: [2022-11-26 17:48:51,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 25: [2022-11-26 17:48:51,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:48:51,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 18: [2022-11-26 17:48:51,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 25: [2022-11-26 17:48:51,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 17:48:51,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 25: [2022-11-26 17:48:51,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:48:51,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 17:48:51,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 1: [2022-11-26 17:48:51,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:48:51,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:48:51,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:48:51,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:48:51,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 17:48:51,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 14: [2022-11-26 17:48:51,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 1: [2022-11-26 17:48:51,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 14: [2022-11-26 17:48:51,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 1: [2022-11-26 17:48:51,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 1: [2022-11-26 17:48:51,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 1: [2022-11-26 17:48:51,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 10: [2022-11-26 17:48:51,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:48:51,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:48:51,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:48:51,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:48:51,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:48:51,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 17:48:51,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 17:48:51,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 17:48:51,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 17:48:51,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 17:48:51,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 10: [2022-11-26 17:48:51,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 10: [2022-11-26 17:48:51,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 10: [2022-11-26 17:48:51,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 10: [2022-11-26 17:48:51,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 22: [2022-11-26 17:48:51,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:48:51,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 17:48:51,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 12: [2022-11-26 17:48:51,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:48:51,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 17:48:51,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 1: [2022-11-26 17:48:51,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:48:51,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 17:48:51,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 24: [2022-11-26 17:48:51,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:48:51,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:48:51,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:48:51,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:48:51,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:48:51,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 17:48:51,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 17:48:51,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 17:48:51,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 17:48:51,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 24: [2022-11-26 17:48:51,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 24: [2022-11-26 17:48:51,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 24: [2022-11-26 17:48:51,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 24: [2022-11-26 17:48:51,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 17:48:51,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 31: [2022-11-26 17:48:51,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:48:51,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 17:48:51,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 29: [2022-11-26 17:48:51,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:48:51,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 17:48:51,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 20: [2022-11-26 17:48:51,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:48:51,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:48:51,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:48:51,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:48:51,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:48:51,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:48:51,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 17:48:51,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 17:48:51,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 17:48:51,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 17:48:51,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 17:48:51,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 17:48:51,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 20: [2022-11-26 17:48:51,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 20: [2022-11-26 17:48:51,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 20: [2022-11-26 17:48:51,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 20: [2022-11-26 17:48:51,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 20: [2022-11-26 17:48:51,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 9: [2022-11-26 17:48:51,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:48:51,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 17:48:51,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 18: [2022-11-26 17:48:51,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:48:51,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 17:48:51,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 26: [2022-11-26 17:48:51,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:48:51,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 17:48:51,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 11: [2022-11-26 17:48:51,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:48:51,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 17:48:51,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 13: [2022-11-26 17:48:51,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:48:51,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 17:48:51,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 10: [2022-11-26 17:48:51,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:48:51,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 17:48:51,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 0: [2022-11-26 17:48:51,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:48:51,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:48:51,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 17:48:51,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 27: [2022-11-26 17:48:51,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:48:51,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:48:51,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 17:48:51,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 21: [2022-11-26 17:48:51,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 17:48:51,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 30: [2022-11-26 17:48:51,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:48:51,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 17:48:51,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 5: [2022-11-26 17:48:51,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:48:51,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 17:48:51,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 3: [2022-11-26 17:48:51,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:48:51,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 17:48:51,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 6: [2022-11-26 17:48:51,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:48:51,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 17:48:51,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 28: [2022-11-26 17:48:51,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:48:51,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:48:51,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 17:48:51,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 28: [2022-11-26 17:48:51,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 17:48:51,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 23: [2022-11-26 17:48:51,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:48:51,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 17:48:51,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 25: [2022-11-26 17:48:51,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:48:51,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 17:48:51,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 4: [2022-11-26 17:48:51,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:48:51,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 15: [2022-11-26 17:48:51,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:48:51,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 15: [2022-11-26 17:48:51,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 17:48:51,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 2: [2022-11-26 17:48:51,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:48:51,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:48:51,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 17:48:51,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 7: [2022-11-26 17:48:51,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 17:48:51,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 14: [2022-11-26 17:48:51,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:48:51,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 17:48:51,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 24: [2022-11-26 17:48:51,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:48:51,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 19: [2022-11-26 17:48:51,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:48:51,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 24: [2022-11-26 17:48:51,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 19: [2022-11-26 17:48:51,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 12: [2022-11-26 17:48:51,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:48:51,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 17:48:51,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 31: [2022-11-26 17:48:51,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:48:51,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 17:48:51,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 17: [2022-11-26 17:48:51,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:48:51,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 17:48:51,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 22: [2022-11-26 17:48:51,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:48:51,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:48:51,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 18: [2022-11-26 17:48:51,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 22: [2022-11-26 17:48:51,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 18: [2022-11-26 17:48:51,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 20: [2022-11-26 17:48:51,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:48:51,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 17:48:51,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 26: [2022-11-26 17:48:51,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:48:51,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 17:48:51,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 13: [2022-11-26 17:48:51,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:48:51,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 17:48:51,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 29: [2022-11-26 17:48:51,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:48:51,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 17:48:51,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 11: [2022-11-26 17:48:51,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:48:51,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 17:48:51,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 9: [2022-11-26 17:48:51,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:48:51,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 17:48:51,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 21: [2022-11-26 17:48:51,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:48:51,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 17:48:51,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 0: [2022-11-26 17:48:51,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 17:48:51,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 10: [2022-11-26 17:48:51,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:48:51,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 17:48:51,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 1: [2022-11-26 17:48:51,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:48:51,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 17:48:51,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 16: [2022-11-26 17:48:51,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:48:51,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 17:48:51,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 27: [2022-11-26 17:48:51,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 17:48:51,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 17:48:51,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 3: [2022-11-26 17:48:51,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:48:51,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:48:51,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 17:48:51,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 30: [2022-11-26 17:48:51,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 17:48:51,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 28: [2022-11-26 17:48:51,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 17:48:51,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 17:48:51,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 5: [2022-11-26 17:48:51,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:48:51,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 17:48:51,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 4: [2022-11-26 17:48:51,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:48:51,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:48:51,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 17:48:51,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 6: [2022-11-26 17:48:51,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 17:48:51,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 0: [2022-11-26 17:48:51,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:48:51,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 17:48:51,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 8: [2022-11-26 17:48:51,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:48:51,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:48:51,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 17:48:51,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 23: [2022-11-26 17:48:51,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 17:48:51,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 2: [2022-11-26 17:48:51,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:48:51,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 15: [2022-11-26 17:48:51,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:48:51,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 15: [2022-11-26 17:48:51,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 17:48:51,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 17: [2022-11-26 17:48:51,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:48:51,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:48:51,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 25: [2022-11-26 17:48:51,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 17: [2022-11-26 17:48:51,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 25: [2022-11-26 17:48:51,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 30: [2022-11-26 17:48:51,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 17:48:51,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 17:48:51,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 20: [2022-11-26 17:48:51,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 17:48:51,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 17:48:51,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 27: [2022-11-26 17:48:51,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:48:51,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:48:51,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 27: [2022-11-26 17:48:51,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 19: [2022-11-26 17:48:51,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 17:48:51,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 27: [2022-11-26 17:48:51,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 19: [2022-11-26 17:48:51,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 17:48:51,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 21: [2022-11-26 17:48:51,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 17:48:51,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 17:48:51,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 7: [2022-11-26 17:48:51,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:48:51,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 17:48:51,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 14: [2022-11-26 17:48:51,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:48:51,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 24: [2022-11-26 17:48:51,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:48:51,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 17:48:51,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 17:48:51,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 14: [2022-11-26 17:48:51,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 24: [2022-11-26 17:48:51,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 17:48:51,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 26: [2022-11-26 17:48:51,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 17:48:51,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 17:48:51,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 11: [2022-11-26 17:48:51,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 17: [2022-11-26 17:48:51,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:48:51,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 17:48:51,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 17: [2022-11-26 17:48:51,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 17:48:51,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 0: [2022-11-26 17:48:51,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:48:51,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 9: [2022-11-26 17:48:51,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:48:51,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 9: [2022-11-26 17:48:51,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 17:48:51,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 18: [2022-11-26 17:48:51,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 17:48:51,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 17:48:51,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 5: [2022-11-26 17:48:51,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:48:51,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 22: [2022-11-26 17:48:51,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:48:51,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 22: [2022-11-26 17:48:51,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 17:48:51,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 13: [2022-11-26 17:48:51,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:48:51,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 17:48:51,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 25: [2022-11-26 17:48:51,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 17:48:51,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 17:48:51,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 23: [2022-11-26 17:48:51,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:48:51,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 23: [2022-11-26 17:48:51,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 15: [2022-11-26 17:48:51,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 23: [2022-11-26 17:48:51,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 15: [2022-11-26 17:48:51,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 4: [2022-11-26 17:48:51,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:48:51,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 17:48:51,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 7: [2022-11-26 17:48:51,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:48:51,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 17:48:51,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 8: [2022-11-26 17:48:51,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:48:51,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:48:51,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 28: [2022-11-26 17:48:51,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:48:51,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 16: [2022-11-26 17:48:51,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:48:51,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 28: [2022-11-26 17:48:51,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 31: [2022-11-26 17:48:51,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 16: [2022-11-26 17:48:51,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 28: [2022-11-26 17:48:51,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 16: [2022-11-26 17:48:51,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 3: [2022-11-26 17:48:51,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 31: [2022-11-26 17:48:51,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 17:48:51,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 17:48:51,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 17:48:51,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 31: [2022-11-26 17:48:51,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 2: [2022-11-26 17:48:51,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:48:51,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 17:48:51,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 1: [2022-11-26 17:48:51,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:48:51,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 17:48:51,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 17:48:51,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 1: [2022-11-26 17:48:51,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 24: [2022-11-26 17:48:51,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:48:51,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:48:51,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:48:51,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 24: [2022-11-26 17:48:51,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 12: [2022-11-26 17:48:51,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 17:48:51,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 24: [2022-11-26 17:48:51,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 12: [2022-11-26 17:48:51,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 12: [2022-11-26 17:48:51,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 29: [2022-11-26 17:48:51,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:48:51,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:48:51,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 17:48:51,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 17:48:51,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 17:48:51,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 17:48:51,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 29: [2022-11-26 17:48:51,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 29: [2022-11-26 17:48:51,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 10: [2022-11-26 17:48:51,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:48:51,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 17:48:51,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 6: [2022-11-26 17:48:51,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:48:51,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step105000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 17:48:51,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 0: successfully saved checkpoint at iteration 105000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 3090.67 31: iteration 105010/ 173500 | consumed samples: 26882560 | consumed tokens: 55055482880 | elapsed time per iteration (s): 1.12 | learning rate: 8.185E-05 | global batch size: 256 | lm loss: 1.986267E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.082 | TFLOPs: 13.80 | 31: iteration 105020/ 173500 | consumed samples: 26885120 | consumed tokens: 55060725760 | elapsed time per iteration (s): 0.73 | learning rate: 8.184E-05 | global batch size: 256 | lm loss: 1.968978E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.631 | TFLOPs: 21.09 | 31: iteration 105030/ 173500 | consumed samples: 26887680 | consumed tokens: 55065968640 | elapsed time per iteration (s): 0.81 | learning rate: 8.182E-05 | global batch size: 256 | lm loss: 1.968179E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.356 | TFLOPs: 19.20 | 31: iteration 105040/ 173500 | consumed samples: 26890240 | consumed tokens: 55071211520 | elapsed time per iteration (s): 0.76 | learning rate: 8.180E-05 | global batch size: 256 | lm loss: 1.953864E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.399 | TFLOPs: 20.47 | 31: iteration 105050/ 173500 | consumed samples: 26892800 | consumed tokens: 55076454400 | elapsed time per iteration (s): 0.77 | learning rate: 8.179E-05 | global batch size: 256 | lm loss: 1.961704E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.751 | TFLOPs: 20.07 | 31: iteration 105060/ 173500 | consumed samples: 26895360 | consumed tokens: 55081697280 | elapsed time per iteration (s): 0.75 | learning rate: 8.177E-05 | global batch size: 256 | lm loss: 1.985046E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.572 | TFLOPs: 20.54 | 31: iteration 105070/ 173500 | consumed samples: 26897920 | consumed tokens: 55086940160 | elapsed time per iteration (s): 0.80 | learning rate: 8.176E-05 | global batch size: 256 | lm loss: 1.973326E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.501 | TFLOPs: 19.45 | 31: iteration 105080/ 173500 | consumed samples: 26900480 | consumed tokens: 55092183040 | elapsed time per iteration (s): 0.75 | learning rate: 8.174E-05 | global batch size: 256 | lm loss: 1.972240E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.308 | TFLOPs: 20.53 | 31: iteration 105090/ 173500 | consumed samples: 26903040 | consumed tokens: 55097425920 | elapsed time per iteration (s): 0.79 | learning rate: 8.173E-05 | global batch size: 256 | lm loss: 1.971379E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.728 | TFLOPs: 19.58 | 31: iteration 105100/ 173500 | consumed samples: 26905600 | consumed tokens: 55102668800 | elapsed time per iteration (s): 0.86 | learning rate: 8.171E-05 | global batch size: 256 | lm loss: 1.943604E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.402 | TFLOPs: 18.11 | 31: iteration 105110/ 173500 | consumed samples: 26908160 | consumed tokens: 55107911680 | elapsed time per iteration (s): 0.76 | learning rate: 8.169E-05 | global batch size: 256 | lm loss: 1.953655E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.749 | TFLOPs: 20.43 | 31: iteration 105120/ 173500 | consumed samples: 26910720 | consumed tokens: 55113154560 | elapsed time per iteration (s): 0.83 | learning rate: 8.168E-05 | global batch size: 256 | lm loss: 1.974865E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.377 | TFLOPs: 18.66 | 31: iteration 105130/ 173500 | consumed samples: 26913280 | consumed tokens: 55118397440 | elapsed time per iteration (s): 0.77 | learning rate: 8.166E-05 | global batch size: 256 | lm loss: 1.984506E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.094 | TFLOPs: 20.21 | 31: iteration 105140/ 173500 | consumed samples: 26915840 | consumed tokens: 55123640320 | elapsed time per iteration (s): 0.77 | learning rate: 8.165E-05 | global batch size: 256 | lm loss: 1.963283E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.168 | TFLOPs: 20.16 | 31: iteration 105150/ 173500 | consumed samples: 26918400 | consumed tokens: 55128883200 | elapsed time per iteration (s): 0.76 | learning rate: 8.163E-05 | global batch size: 256 | lm loss: 1.925369E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.748 | TFLOPs: 20.43 | 31: iteration 105160/ 173500 | consumed samples: 26920960 | consumed tokens: 55134126080 | elapsed time per iteration (s): 0.82 | learning rate: 8.162E-05 | global batch size: 256 | lm loss: 1.970383E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.465 | TFLOPs: 18.84 | 31: iteration 105170/ 173500 | consumed samples: 26923520 | consumed tokens: 55139368960 | elapsed time per iteration (s): 0.85 | learning rate: 8.160E-05 | global batch size: 256 | lm loss: 1.977543E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.365 | TFLOPs: 18.29 | 31: iteration 105180/ 173500 | consumed samples: 26926080 | consumed tokens: 55144611840 | elapsed time per iteration (s): 0.97 | learning rate: 8.159E-05 | global batch size: 256 | lm loss: 1.986308E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 264.757 | TFLOPs: 16.02 | 31: iteration 105190/ 173500 | consumed samples: 26928640 | consumed tokens: 55149854720 | elapsed time per iteration (s): 0.87 | learning rate: 8.157E-05 | global batch size: 256 | lm loss: 1.972696E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.135 | TFLOPs: 17.85 | 31: iteration 105200/ 173500 | consumed samples: 26931200 | consumed tokens: 55155097600 | elapsed time per iteration (s): 0.79 | learning rate: 8.155E-05 | global batch size: 256 | lm loss: 1.968069E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.433 | TFLOPs: 19.63 | 31: iteration 105210/ 173500 | consumed samples: 26933760 | consumed tokens: 55160340480 | elapsed time per iteration (s): 0.88 | learning rate: 8.154E-05 | global batch size: 256 | lm loss: 1.965103E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.988 | TFLOPs: 17.60 | 31: iteration 105220/ 173500 | consumed samples: 26936320 | consumed tokens: 55165583360 | elapsed time per iteration (s): 0.82 | learning rate: 8.152E-05 | global batch size: 256 | lm loss: 1.994266E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.183 | TFLOPs: 18.83 | 31: iteration 105230/ 173500 | consumed samples: 26938880 | consumed tokens: 55170826240 | elapsed time per iteration (s): 0.82 | learning rate: 8.151E-05 | global batch size: 256 | lm loss: 1.987736E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.811 | TFLOPs: 18.92 | 31: iteration 105240/ 173500 | consumed samples: 26941440 | consumed tokens: 55176069120 | elapsed time per iteration (s): 0.80 | learning rate: 8.149E-05 | global batch size: 256 | lm loss: 1.977558E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.958 | TFLOPs: 19.48 | 31: iteration 105250/ 173500 | consumed samples: 26944000 | consumed tokens: 55181312000 | elapsed time per iteration (s): 0.82 | learning rate: 8.148E-05 | global batch size: 256 | lm loss: 1.979844E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.862 | TFLOPs: 18.81 | 31: iteration 105260/ 173500 | consumed samples: 26946560 | consumed tokens: 55186554880 | elapsed time per iteration (s): 0.81 | learning rate: 8.146E-05 | global batch size: 256 | lm loss: 1.958384E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.998 | TFLOPs: 19.18 | 31: iteration 105270/ 173500 | consumed samples: 26949120 | consumed tokens: 55191797760 | elapsed time per iteration (s): 0.84 | learning rate: 8.144E-05 | global batch size: 256 | lm loss: 1.963977E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.359 | TFLOPs: 18.41 | 31: iteration 105280/ 173500 | consumed samples: 26951680 | consumed tokens: 55197040640 | elapsed time per iteration (s): 0.75 | learning rate: 8.143E-05 | global batch size: 256 | lm loss: 1.943894E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.334 | TFLOPs: 20.71 | 31: iteration 105290/ 173500 | consumed samples: 26954240 | consumed tokens: 55202283520 | elapsed time per iteration (s): 0.77 | learning rate: 8.141E-05 | global batch size: 256 | lm loss: 1.973176E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.100 | TFLOPs: 20.15 | 31: iteration 105300/ 173500 | consumed samples: 26956800 | consumed tokens: 55207526400 | elapsed time per iteration (s): 0.81 | learning rate: 8.140E-05 | global batch size: 256 | lm loss: 1.981412E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.021 | TFLOPs: 19.06 | 31: iteration 105310/ 173500 | consumed samples: 26959360 | consumed tokens: 55212769280 | elapsed time per iteration (s): 0.74 | learning rate: 8.138E-05 | global batch size: 256 | lm loss: 1.958358E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.533 | TFLOPs: 20.84 | 31: iteration 105320/ 173500 | consumed samples: 26961920 | consumed tokens: 55218012160 | elapsed time per iteration (s): 0.76 | learning rate: 8.137E-05 | global batch size: 256 | lm loss: 1.935976E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.123 | TFLOPs: 20.40 | 31: iteration 105330/ 173500 | consumed samples: 26964480 | consumed tokens: 55223255040 | elapsed time per iteration (s): 0.77 | learning rate: 8.135E-05 | global batch size: 256 | lm loss: 1.936583E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.416 | TFLOPs: 20.11 | 31: iteration 105340/ 173500 | consumed samples: 26967040 | consumed tokens: 55228497920 | elapsed time per iteration (s): 0.90 | learning rate: 8.134E-05 | global batch size: 256 | lm loss: 1.978694E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 283.769 | TFLOPs: 17.17 | 31: iteration 105350/ 173500 | consumed samples: 26969600 | consumed tokens: 55233740800 | elapsed time per iteration (s): 0.75 | learning rate: 8.132E-05 | global batch size: 256 | lm loss: 1.964432E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.932 | TFLOPs: 20.75 | 31: iteration 105360/ 173500 | consumed samples: 26972160 | consumed tokens: 55238983680 | elapsed time per iteration (s): 0.74 | learning rate: 8.130E-05 | global batch size: 256 | lm loss: 1.969925E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.157 | TFLOPs: 20.88 | 31: iteration 105370/ 173500 | consumed samples: 26974720 | consumed tokens: 55244226560 | elapsed time per iteration (s): 0.81 | learning rate: 8.129E-05 | global batch size: 256 | lm loss: 1.962439E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.499 | TFLOPs: 19.21 | 31: iteration 105380/ 173500 | consumed samples: 26977280 | consumed tokens: 55249469440 | elapsed time per iteration (s): 0.76 | learning rate: 8.127E-05 | global batch size: 256 | lm loss: 1.969996E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.244 | TFLOPs: 20.34 | 31: iteration 105390/ 173500 | consumed samples: 26979840 | consumed tokens: 55254712320 | elapsed time per iteration (s): 0.73 | learning rate: 8.126E-05 | global batch size: 256 | lm loss: 1.977740E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.687 | TFLOPs: 21.09 | 31: iteration 105400/ 173500 | consumed samples: 26982400 | consumed tokens: 55259955200 | elapsed time per iteration (s): 0.75 | learning rate: 8.124E-05 | global batch size: 256 | lm loss: 1.975972E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.886 | TFLOPs: 20.74 | 31: iteration 105410/ 173500 | consumed samples: 26984960 | consumed tokens: 55265198080 | elapsed time per iteration (s): 0.75 | learning rate: 8.123E-05 | global batch size: 256 | lm loss: 1.977117E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.174 | TFLOPs: 20.76 | 31: iteration 105420/ 173500 | consumed samples: 26987520 | consumed tokens: 55270440960 | elapsed time per iteration (s): 0.77 | learning rate: 8.121E-05 | global batch size: 256 | lm loss: 1.979940E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.733 | TFLOPs: 20.01 | 31: iteration 105430/ 173500 | consumed samples: 26990080 | consumed tokens: 55275683840 | elapsed time per iteration (s): 0.74 | learning rate: 8.120E-05 | global batch size: 256 | lm loss: 1.982865E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.544 | TFLOPs: 20.90 | 31: iteration 105440/ 173500 | consumed samples: 26992640 | consumed tokens: 55280926720 | elapsed time per iteration (s): 0.80 | learning rate: 8.118E-05 | global batch size: 256 | lm loss: 1.962922E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.682 | TFLOPs: 19.46 | 31: iteration 105450/ 173500 | consumed samples: 26995200 | consumed tokens: 55286169600 | elapsed time per iteration (s): 0.70 | learning rate: 8.116E-05 | global batch size: 256 | lm loss: 1.957518E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 364.443 | TFLOPs: 22.05 | 31: iteration 105460/ 173500 | consumed samples: 26997760 | consumed tokens: 55291412480 | elapsed time per iteration (s): 0.80 | learning rate: 8.115E-05 | global batch size: 256 | lm loss: 1.997302E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.925 | TFLOPs: 19.42 | 31: iteration 105470/ 173500 | consumed samples: 27000320 | consumed tokens: 55296655360 | elapsed time per iteration (s): 0.81 | learning rate: 8.113E-05 | global batch size: 256 | lm loss: 1.969824E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.783 | TFLOPs: 19.23 | 31: iteration 105480/ 173500 | consumed samples: 27002880 | consumed tokens: 55301898240 | elapsed time per iteration (s): 0.80 | learning rate: 8.112E-05 | global batch size: 256 | lm loss: 1.963362E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.673 | TFLOPs: 19.28 | 31: iteration 105490/ 173500 | consumed samples: 27005440 | consumed tokens: 55307141120 | elapsed time per iteration (s): 0.75 | learning rate: 8.110E-05 | global batch size: 256 | lm loss: 1.971241E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.738 | TFLOPs: 20.61 | 31: iteration 105500/ 173500 | consumed samples: 27008000 | consumed tokens: 55312384000 | elapsed time per iteration (s): 0.77 | learning rate: 8.109E-05 | global batch size: 256 | lm loss: 1.966076E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.510 | TFLOPs: 20.06 | 31: iteration 105510/ 173500 | consumed samples: 27010560 | consumed tokens: 55317626880 | elapsed time per iteration (s): 0.78 | learning rate: 8.107E-05 | global batch size: 256 | lm loss: 1.947233E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.234 | TFLOPs: 19.92 | 31: iteration 105520/ 173500 | consumed samples: 27013120 | consumed tokens: 55322869760 | elapsed time per iteration (s): 0.74 | learning rate: 8.105E-05 | global batch size: 256 | lm loss: 1.988038E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.640 | TFLOPs: 20.91 | 31: iteration 105530/ 173500 | consumed samples: 27015680 | consumed tokens: 55328112640 | elapsed time per iteration (s): 0.78 | learning rate: 8.104E-05 | global batch size: 256 | lm loss: 1.944072E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.991 | TFLOPs: 19.78 | 31: iteration 105540/ 173500 | consumed samples: 27018240 | consumed tokens: 55333355520 | elapsed time per iteration (s): 0.80 | learning rate: 8.102E-05 | global batch size: 256 | lm loss: 1.986459E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.929 | TFLOPs: 19.48 | 31: iteration 105550/ 173500 | consumed samples: 27020800 | consumed tokens: 55338598400 | elapsed time per iteration (s): 0.81 | learning rate: 8.101E-05 | global batch size: 256 | lm loss: 1.949164E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.588 | TFLOPs: 19.09 | 31: iteration 105560/ 173500 | consumed samples: 27023360 | consumed tokens: 55343841280 | elapsed time per iteration (s): 0.70 | learning rate: 8.099E-05 | global batch size: 256 | lm loss: 1.968982E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 368.158 | TFLOPs: 22.27 | 31: iteration 105570/ 173500 | consumed samples: 27025920 | consumed tokens: 55349084160 | elapsed time per iteration (s): 0.75 | learning rate: 8.098E-05 | global batch size: 256 | lm loss: 1.975325E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.706 | TFLOPs: 20.73 | 31: iteration 105580/ 173500 | consumed samples: 27028480 | consumed tokens: 55354327040 | elapsed time per iteration (s): 0.78 | learning rate: 8.096E-05 | global batch size: 256 | lm loss: 1.997088E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.406 | TFLOPs: 19.93 | 31: iteration 105590/ 173500 | consumed samples: 27031040 | consumed tokens: 55359569920 | elapsed time per iteration (s): 0.71 | learning rate: 8.095E-05 | global batch size: 256 | lm loss: 1.985723E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.323 | TFLOPs: 21.68 | 31: iteration 105600/ 173500 | consumed samples: 27033600 | consumed tokens: 55364812800 | elapsed time per iteration (s): 0.75 | learning rate: 8.093E-05 | global batch size: 256 | lm loss: 1.954263E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.973 | TFLOPs: 20.75 | 31: iteration 105610/ 173500 | consumed samples: 27036160 | consumed tokens: 55370055680 | elapsed time per iteration (s): 0.75 | learning rate: 8.091E-05 | global batch size: 256 | lm loss: 1.985821E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.130 | TFLOPs: 20.52 | 31: iteration 105620/ 173500 | consumed samples: 27038720 | consumed tokens: 55375298560 | elapsed time per iteration (s): 0.76 | learning rate: 8.090E-05 | global batch size: 256 | lm loss: 1.969152E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.861 | TFLOPs: 20.44 | 31: iteration 105630/ 173500 | consumed samples: 27041280 | consumed tokens: 55380541440 | elapsed time per iteration (s): 0.76 | learning rate: 8.088E-05 | global batch size: 256 | lm loss: 1.966585E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.348 | TFLOPs: 20.41 | 31: iteration 105640/ 173500 | consumed samples: 27043840 | consumed tokens: 55385784320 | elapsed time per iteration (s): 0.72 | learning rate: 8.087E-05 | global batch size: 256 | lm loss: 1.967949E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.920 | TFLOPs: 21.65 | 31: iteration 105650/ 173500 | consumed samples: 27046400 | consumed tokens: 55391027200 | elapsed time per iteration (s): 0.79 | learning rate: 8.085E-05 | global batch size: 256 | lm loss: 1.967851E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.023 | TFLOPs: 19.54 | 31: iteration 105660/ 173500 | consumed samples: 27048960 | consumed tokens: 55396270080 | elapsed time per iteration (s): 0.74 | learning rate: 8.084E-05 | global batch size: 256 | lm loss: 1.963746E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.538 | TFLOPs: 21.03 | 31: iteration 105670/ 173500 | consumed samples: 27051520 | consumed tokens: 55401512960 | elapsed time per iteration (s): 0.78 | learning rate: 8.082E-05 | global batch size: 256 | lm loss: 1.993485E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.810 | TFLOPs: 19.89 | 31: iteration 105680/ 173500 | consumed samples: 27054080 | consumed tokens: 55406755840 | elapsed time per iteration (s): 0.80 | learning rate: 8.081E-05 | global batch size: 256 | lm loss: 1.970119E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.458 | TFLOPs: 19.33 | 31: iteration 105690/ 173500 | consumed samples: 27056640 | consumed tokens: 55411998720 | elapsed time per iteration (s): 0.80 | learning rate: 8.079E-05 | global batch size: 256 | lm loss: 1.966509E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.362 | TFLOPs: 19.26 | 31: iteration 105700/ 173500 | consumed samples: 27059200 | consumed tokens: 55417241600 | elapsed time per iteration (s): 0.76 | learning rate: 8.077E-05 | global batch size: 256 | lm loss: 1.960973E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.070 | TFLOPs: 20.33 | 31: iteration 105710/ 173500 | consumed samples: 27061760 | consumed tokens: 55422484480 | elapsed time per iteration (s): 0.80 | learning rate: 8.076E-05 | global batch size: 256 | lm loss: 1.977578E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.372 | TFLOPs: 19.32 | 31: iteration 105720/ 173500 | consumed samples: 27064320 | consumed tokens: 55427727360 | elapsed time per iteration (s): 0.90 | learning rate: 8.074E-05 | global batch size: 256 | lm loss: 1.967303E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.762 | TFLOPs: 17.23 | 31: iteration 105730/ 173500 | consumed samples: 27066880 | consumed tokens: 55432970240 | elapsed time per iteration (s): 0.85 | learning rate: 8.073E-05 | global batch size: 256 | lm loss: 1.952025E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.630 | TFLOPs: 18.19 | 31: iteration 105740/ 173500 | consumed samples: 27069440 | consumed tokens: 55438213120 | elapsed time per iteration (s): 0.79 | learning rate: 8.071E-05 | global batch size: 256 | lm loss: 1.947831E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.586 | TFLOPs: 19.58 | 31: iteration 105750/ 173500 | consumed samples: 27072000 | consumed tokens: 55443456000 | elapsed time per iteration (s): 0.87 | learning rate: 8.070E-05 | global batch size: 256 | lm loss: 1.984665E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.377 | TFLOPs: 17.75 | 31: iteration 105760/ 173500 | consumed samples: 27074560 | consumed tokens: 55448698880 | elapsed time per iteration (s): 0.75 | learning rate: 8.068E-05 | global batch size: 256 | lm loss: 1.949682E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.457 | TFLOPs: 20.66 | 31: iteration 105770/ 173500 | consumed samples: 27077120 | consumed tokens: 55453941760 | elapsed time per iteration (s): 0.74 | learning rate: 8.067E-05 | global batch size: 256 | lm loss: 1.967977E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.766 | TFLOPs: 20.86 | 31: iteration 105780/ 173500 | consumed samples: 27079680 | consumed tokens: 55459184640 | elapsed time per iteration (s): 0.83 | learning rate: 8.065E-05 | global batch size: 256 | lm loss: 1.962628E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.874 | TFLOPs: 18.75 | 31: iteration 105790/ 173500 | consumed samples: 27082240 | consumed tokens: 55464427520 | elapsed time per iteration (s): 0.81 | learning rate: 8.063E-05 | global batch size: 256 | lm loss: 1.968415E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.623 | TFLOPs: 19.09 | 31: iteration 105800/ 173500 | consumed samples: 27084800 | consumed tokens: 55469670400 | elapsed time per iteration (s): 0.79 | learning rate: 8.062E-05 | global batch size: 256 | lm loss: 1.952783E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.986 | TFLOPs: 19.72 | 31: iteration 105810/ 173500 | consumed samples: 27087360 | consumed tokens: 55474913280 | elapsed time per iteration (s): 0.79 | learning rate: 8.060E-05 | global batch size: 256 | lm loss: 1.961097E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.978 | TFLOPs: 19.66 | 31: iteration 105820/ 173500 | consumed samples: 27089920 | consumed tokens: 55480156160 | elapsed time per iteration (s): 0.84 | learning rate: 8.059E-05 | global batch size: 256 | lm loss: 1.975142E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.083 | TFLOPs: 18.52 | 31: iteration 105830/ 173500 | consumed samples: 27092480 | consumed tokens: 55485399040 | elapsed time per iteration (s): 0.87 | learning rate: 8.057E-05 | global batch size: 256 | lm loss: 1.950263E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.026 | TFLOPs: 17.79 | 31: iteration 105840/ 173500 | consumed samples: 27095040 | consumed tokens: 55490641920 | elapsed time per iteration (s): 0.85 | learning rate: 8.056E-05 | global batch size: 256 | lm loss: 1.979407E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.137 | TFLOPs: 18.22 | 31: iteration 105850/ 173500 | consumed samples: 27097600 | consumed tokens: 55495884800 | elapsed time per iteration (s): 0.89 | learning rate: 8.054E-05 | global batch size: 256 | lm loss: 1.958407E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.079 | TFLOPs: 17.49 | 31: iteration 105860/ 173500 | consumed samples: 27100160 | consumed tokens: 55501127680 | elapsed time per iteration (s): 0.84 | learning rate: 8.053E-05 | global batch size: 256 | lm loss: 1.950821E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.720 | TFLOPs: 18.50 | 31: iteration 105870/ 173500 | consumed samples: 27102720 | consumed tokens: 55506370560 | elapsed time per iteration (s): 0.73 | learning rate: 8.051E-05 | global batch size: 256 | lm loss: 1.963691E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.314 | TFLOPs: 21.31 | 31: iteration 105880/ 173500 | consumed samples: 27105280 | consumed tokens: 55511613440 | elapsed time per iteration (s): 0.76 | learning rate: 8.049E-05 | global batch size: 256 | lm loss: 1.971730E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.351 | TFLOPs: 20.35 | 31: iteration 105890/ 173500 | consumed samples: 27107840 | consumed tokens: 55516856320 | elapsed time per iteration (s): 0.79 | learning rate: 8.048E-05 | global batch size: 256 | lm loss: 1.964470E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.429 | TFLOPs: 19.57 | 31: iteration 105900/ 173500 | consumed samples: 27110400 | consumed tokens: 55522099200 | elapsed time per iteration (s): 0.78 | learning rate: 8.046E-05 | global batch size: 256 | lm loss: 1.968205E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.386 | TFLOPs: 19.93 | 31: iteration 105910/ 173500 | consumed samples: 27112960 | consumed tokens: 55527342080 | elapsed time per iteration (s): 0.82 | learning rate: 8.045E-05 | global batch size: 256 | lm loss: 1.964295E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.067 | TFLOPs: 19.00 | 31: iteration 105920/ 173500 | consumed samples: 27115520 | consumed tokens: 55532584960 | elapsed time per iteration (s): 0.88 | learning rate: 8.043E-05 | global batch size: 256 | lm loss: 1.954550E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.844 | TFLOPs: 17.66 | 31: iteration 105930/ 173500 | consumed samples: 27118080 | consumed tokens: 55537827840 | elapsed time per iteration (s): 0.78 | learning rate: 8.042E-05 | global batch size: 256 | lm loss: 1.950906E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.248 | TFLOPs: 19.74 | 31: iteration 105940/ 173500 | consumed samples: 27120640 | consumed tokens: 55543070720 | elapsed time per iteration (s): 0.78 | learning rate: 8.040E-05 | global batch size: 256 | lm loss: 1.969945E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.985 | TFLOPs: 19.84 | 31: iteration 105950/ 173500 | consumed samples: 27123200 | consumed tokens: 55548313600 | elapsed time per iteration (s): 0.80 | learning rate: 8.039E-05 | global batch size: 256 | lm loss: 1.984365E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.407 | TFLOPs: 19.32 | 31: iteration 105960/ 173500 | consumed samples: 27125760 | consumed tokens: 55553556480 | elapsed time per iteration (s): 0.86 | learning rate: 8.037E-05 | global batch size: 256 | lm loss: 1.938551E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.186 | TFLOPs: 17.98 | 31: iteration 105970/ 173500 | consumed samples: 27128320 | consumed tokens: 55558799360 | elapsed time per iteration (s): 0.86 | learning rate: 8.035E-05 | global batch size: 256 | lm loss: 1.984921E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.635 | TFLOPs: 18.07 | 31: iteration 105980/ 173500 | consumed samples: 27130880 | consumed tokens: 55564042240 | elapsed time per iteration (s): 0.85 | learning rate: 8.034E-05 | global batch size: 256 | lm loss: 1.964889E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.859 | TFLOPs: 18.14 | 31: iteration 105990/ 173500 | consumed samples: 27133440 | consumed tokens: 55569285120 | elapsed time per iteration (s): 0.85 | learning rate: 8.032E-05 | global batch size: 256 | lm loss: 1.952677E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.448 | TFLOPs: 18.30 | 0: [2022-11-26 18:02:04,423] [INFO] [logging.py:68:log_dist] [Rank 0] step=106000, skipped=0, lr=[8.030787777917086e-05, 8.030787777917086e-05, 8.030787777917086e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 106000/ 173500 | consumed samples: 27136000 | consumed tokens: 55574528000 | elapsed time per iteration (s): 0.84 | learning rate: 8.031E-05 | global batch size: 256 | lm loss: 1.965882E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.364 | TFLOPs: 18.47 | 0: steps: 106000 loss: 1.9320 iter time (s): 0.801 samples/sec: 319.640 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 106000 | lm loss value: 1.888047E+00 | lm loss PPL: 6.606454E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 106000 to checkpoints_1b1long 0: [2022-11-26 18:02:04,701] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step106000 is begin to save! 0: [2022-11-26 18:02:04,713] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_01-model_00-model_states.pt... 0: [2022-11-26 18:02:04,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_01-model_00-model_states.pt. 0: [2022-11-26 18:02:04,939] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_03-model_00-model_states.pt... 0: [2022-11-26 18:02:05,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_03-model_00-model_states.pt. 0: [2022-11-26 18:02:05,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_04-model_00-model_states.pt... 0: [2022-11-26 18:02:05,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_04-model_00-model_states.pt. 0: [2022-11-26 18:02:05,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_05-model_00-model_states.pt... 0: [2022-11-26 18:02:05,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_05-model_00-model_states.pt. 0: [2022-11-26 18:02:05,180] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_06-model_00-model_states.pt... 0: [2022-11-26 18:02:05,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_06-model_00-model_states.pt. 0: [2022-11-26 18:02:05,257] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_07-model_00-model_states.pt... 0: [2022-11-26 18:02:05,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_07-model_00-model_states.pt. 0: [2022-11-26 18:02:05,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_08-model_00-model_states.pt... 0: [2022-11-26 18:02:05,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_08-model_00-model_states.pt. 0: [2022-11-26 18:02:05,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_09-model_00-model_states.pt... 0: [2022-11-26 18:02:05,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_09-model_00-model_states.pt. 0: [2022-11-26 18:02:05,495] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_10-model_00-model_states.pt... 0: [2022-11-26 18:02:05,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_10-model_00-model_states.pt. 0: [2022-11-26 18:02:05,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_11-model_00-model_states.pt... 0: [2022-11-26 18:02:05,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_11-model_00-model_states.pt. 0: [2022-11-26 18:02:05,649] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_12-model_00-model_states.pt... 0: [2022-11-26 18:02:05,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_12-model_00-model_states.pt. 0: [2022-11-26 18:02:05,722] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_13-model_00-model_states.pt... 0: [2022-11-26 18:02:05,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_13-model_00-model_states.pt. 0: [2022-11-26 18:02:05,801] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_14-model_00-model_states.pt... 0: [2022-11-26 18:02:05,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_14-model_00-model_states.pt. 0: [2022-11-26 18:02:05,879] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_15-model_00-model_states.pt... 0: [2022-11-26 18:02:05,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_15-model_00-model_states.pt. 0: [2022-11-26 18:02:05,954] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_16-model_00-model_states.pt... 0: [2022-11-26 18:02:06,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_16-model_00-model_states.pt. 0: [2022-11-26 18:02:06,029] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_17-model_00-model_states.pt... 0: [2022-11-26 18:02:06,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_17-model_00-model_states.pt. 0: [2022-11-26 18:02:06,109] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_18-model_00-model_states.pt... 0: [2022-11-26 18:02:06,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_18-model_00-model_states.pt. 0: [2022-11-26 18:02:06,183] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_19-model_00-model_states.pt... 0: [2022-11-26 18:02:06,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_19-model_00-model_states.pt. 0: [2022-11-26 18:02:06,259] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_20-model_00-model_states.pt... 0: [2022-11-26 18:02:06,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_20-model_00-model_states.pt. 0: [2022-11-26 18:02:06,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_21-model_00-model_states.pt... 0: [2022-11-26 18:02:06,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_21-model_00-model_states.pt. 0: [2022-11-26 18:02:06,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_22-model_00-model_states.pt... 0: [2022-11-26 18:02:06,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_22-model_00-model_states.pt. 0: [2022-11-26 18:02:06,524] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_23-model_00-model_states.pt... 0: [2022-11-26 18:02:06,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_23-model_00-model_states.pt. 0: [2022-11-26 18:02:06,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_24-model_00-model_states.pt... 0: [2022-11-26 18:02:06,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_24-model_00-model_states.pt. 0: [2022-11-26 18:02:06,679] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_25-model_00-model_states.pt... 0: [2022-11-26 18:02:06,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_25-model_00-model_states.pt. 0: [2022-11-26 18:02:06,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_26-model_00-model_states.pt... 0: [2022-11-26 18:02:06,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_26-model_00-model_states.pt. 0: [2022-11-26 18:02:06,833] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_27-model_00-model_states.pt... 0: [2022-11-26 18:02:06,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_27-model_00-model_states.pt. 0: [2022-11-26 18:02:06,908] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_28-model_00-model_states.pt... 0: [2022-11-26 18:02:06,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_28-model_00-model_states.pt. 0: [2022-11-26 18:02:06,984] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/layer_30-model_00-model_states.pt... 0: [2022-11-26 18:02:06,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/layer_30-model_00-model_states.pt. 0: [2022-11-26 18:02:06,987] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step106000/mp_rank_00_model_states.pt 0: [2022-11-26 18:02:06,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/mp_rank_00_model_states.pt... 0: [2022-11-26 18:02:06,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/mp_rank_00_model_states.pt. 0: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:02:07,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:02:07,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:02:07,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:02:07,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:02:07,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 0: [2022-11-26 18:02:07,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 18:02:07,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 6: [2022-11-26 18:02:07,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 1: [2022-11-26 18:02:07,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:02:07,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 18:02:07,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 29: [2022-11-26 18:02:07,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:02:07,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 18:02:07,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 30: [2022-11-26 18:02:07,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:02:07,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 18:02:07,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 3: [2022-11-26 18:02:07,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:02:07,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 18:02:07,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 23: [2022-11-26 18:02:07,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:02:07,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 18:02:07,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 8: [2022-11-26 18:02:07,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:02:07,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 18:02:07,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 2: [2022-11-26 18:02:07,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:02:07,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 18:02:07,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 17: [2022-11-26 18:02:07,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:02:07,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:02:07,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 27: [2022-11-26 18:02:07,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 17: [2022-11-26 18:02:07,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 27: [2022-11-26 18:02:07,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 5: [2022-11-26 18:02:07,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:02:07,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 18:02:07,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 21: [2022-11-26 18:02:07,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:02:07,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 18:02:07,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 14: [2022-11-26 18:02:07,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:02:07,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 18:02:07,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 31: [2022-11-26 18:02:07,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:02:07,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 18:02:07,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 12: [2022-11-26 18:02:07,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:02:07,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:02:07,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:02:07,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 15: [2022-11-26 18:02:07,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 22: [2022-11-26 18:02:07,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 12: [2022-11-26 18:02:07,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 15: [2022-11-26 18:02:07,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 22: [2022-11-26 18:02:07,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 24: [2022-11-26 18:02:07,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:02:07,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 18:02:07,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 20: [2022-11-26 18:02:07,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:02:07,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 18:02:07,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 6: [2022-11-26 18:02:07,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:02:07,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:02:07,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 18:02:07,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 1: [2022-11-26 18:02:07,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 26: [2022-11-26 18:02:07,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:02:07,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 26: [2022-11-26 18:02:07,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 18:02:07,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 20: [2022-11-26 18:02:07,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:02:07,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:02:07,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 30: [2022-11-26 18:02:07,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 20: [2022-11-26 18:02:07,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 30: [2022-11-26 18:02:07,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 9: [2022-11-26 18:02:07,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:02:07,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 18:02:07,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 7: [2022-11-26 18:02:07,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:02:07,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 18:02:07,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 3: [2022-11-26 18:02:07,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:02:07,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 18:02:07,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 21: [2022-11-26 18:02:07,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:02:07,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:02:07,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 2: [2022-11-26 18:02:07,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 21: [2022-11-26 18:02:07,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 2: [2022-11-26 18:02:07,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 26: [2022-11-26 18:02:07,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:02:07,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 18:02:07,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 7: [2022-11-26 18:02:07,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:02:07,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 18:02:07,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 25: [2022-11-26 18:02:07,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:02:07,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 18:02:07,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 31: [2022-11-26 18:02:07,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:02:07,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:02:07,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 29: [2022-11-26 18:02:07,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 31: [2022-11-26 18:02:07,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 29: [2022-11-26 18:02:07,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 17: [2022-11-26 18:02:07,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:02:07,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 18:02:07,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 27: [2022-11-26 18:02:07,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:02:07,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 23: [2022-11-26 18:02:07,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:02:07,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 23: [2022-11-26 18:02:07,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 15: [2022-11-26 18:02:07,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:02:07,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 15: [2022-11-26 18:02:07,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 18:02:07,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 5: [2022-11-26 18:02:07,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:02:07,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 18:02:07,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 20: [2022-11-26 18:02:07,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:02:07,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 18:02:07,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 30: [2022-11-26 18:02:07,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:02:07,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 18:02:07,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 26: [2022-11-26 18:02:07,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:02:07,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:02:07,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:02:07,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 12: [2022-11-26 18:02:07,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 26: [2022-11-26 18:02:07,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 8: [2022-11-26 18:02:07,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 12: [2022-11-26 18:02:07,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 8: [2022-11-26 18:02:07,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 29: [2022-11-26 18:02:07,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:02:07,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 18:02:07,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 22: [2022-11-26 18:02:07,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:02:07,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 18:02:07,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 1: [2022-11-26 18:02:07,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:02:07,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 18:02:07,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 22: [2022-11-26 18:02:07,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:02:07,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:02:07,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 18:02:07,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 10: [2022-11-26 18:02:07,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:02:07,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:02:07,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 6: [2022-11-26 18:02:07,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 10: [2022-11-26 18:02:07,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 18:02:07,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 6: [2022-11-26 18:02:07,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 10: [2022-11-26 18:02:07,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 25: [2022-11-26 18:02:07,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:02:07,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 14: [2022-11-26 18:02:07,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:02:07,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 18:02:07,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 25: [2022-11-26 18:02:07,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 14: [2022-11-26 18:02:07,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:02:07,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:02:07,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 18:02:07,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 17: [2022-11-26 18:02:07,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 18:02:07,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 12: [2022-11-26 18:02:07,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:02:07,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 18:02:07,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 2: [2022-11-26 18:02:07,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:02:07,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 18:02:07,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 24: [2022-11-26 18:02:07,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:02:07,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:02:07,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:02:07,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 18:02:07,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 18:02:07,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 28: [2022-11-26 18:02:07,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 24: [2022-11-26 18:02:07,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 18:02:07,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 24: [2022-11-26 18:02:07,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:02:07,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 18:02:07,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 31: [2022-11-26 18:02:07,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:02:07,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 18:02:07,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 16: [2022-11-26 18:02:07,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:02:07,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 9: [2022-11-26 18:02:07,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:02:07,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 9: [2022-11-26 18:02:07,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 18:02:07,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 3: [2022-11-26 18:02:07,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:02:07,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 18:02:07,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 8: [2022-11-26 18:02:07,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:02:07,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 18:02:07,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 21: [2022-11-26 18:02:07,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:02:07,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:02:07,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 15: [2022-11-26 18:02:07,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 18:02:07,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 21: [2022-11-26 18:02:07,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 7: [2022-11-26 18:02:07,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:02:07,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 18:02:07,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 27: [2022-11-26 18:02:07,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:02:07,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:02:07,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 27: [2022-11-26 18:02:07,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 16: [2022-11-26 18:02:07,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 27: [2022-11-26 18:02:07,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 28: [2022-11-26 18:02:07,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:02:07,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 18:02:07,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 20: [2022-11-26 18:02:07,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:02:07,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:02:07,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 20: [2022-11-26 18:02:07,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 18:02:07,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 25: [2022-11-26 18:02:07,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 5: [2022-11-26 18:02:07,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:02:07,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 18:02:07,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 4: [2022-11-26 18:02:07,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:02:07,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:02:07,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:02:07,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:02:07,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 18:02:07,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 18:02:07,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 4: [2022-11-26 18:02:07,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 4: [2022-11-26 18:02:07,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 6: [2022-11-26 18:02:07,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 4: [2022-11-26 18:02:07,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 6: [2022-11-26 18:02:07,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 19: [2022-11-26 18:02:07,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:02:07,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:02:07,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:02:07,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 18:02:07,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 18:02:07,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 18:02:07,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 19: [2022-11-26 18:02:07,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 19: [2022-11-26 18:02:07,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 1: [2022-11-26 18:02:07,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:02:07,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 18:02:07,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 9: [2022-11-26 18:02:07,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:02:07,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 18:02:07,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 17: [2022-11-26 18:02:07,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:02:07,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 18:02:07,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 2: [2022-11-26 18:02:07,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:02:07,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 18:02:07,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 21: [2022-11-26 18:02:07,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:02:07,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 18:02:07,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 0: [2022-11-26 18:02:07,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:02:07,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 18:02:07,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 0: [2022-11-26 18:02:07,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:02:07,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 18:02:07,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 26: [2022-11-26 18:02:07,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:02:07,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 18:02:07,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 13: [2022-11-26 18:02:07,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:02:07,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:02:07,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:02:07,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:02:07,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 18:02:07,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 18:02:07,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 13: [2022-11-26 18:02:07,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 18:02:07,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 18:02:07,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 13: [2022-11-26 18:02:07,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 13: [2022-11-26 18:02:07,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 23: [2022-11-26 18:02:07,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:02:07,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 18:02:07,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 3: [2022-11-26 18:02:07,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:02:07,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 18:02:07,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 18: [2022-11-26 18:02:07,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:02:07,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 18:02:07,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 0: [2022-11-26 18:02:07,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:02:07,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 18:02:07,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 25: [2022-11-26 18:02:07,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:02:07,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 18:02:07,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 11: [2022-11-26 18:02:07,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:02:07,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:02:07,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:02:07,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 18:02:07,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 18:02:07,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 18:02:07,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 11: [2022-11-26 18:02:07,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 11: [2022-11-26 18:02:07,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 11: [2022-11-26 18:02:07,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:02:07,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 18:02:07,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 18: [2022-11-26 18:02:07,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:02:07,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 18:02:07,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 24: [2022-11-26 18:02:07,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:02:07,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 18:02:07,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 23: [2022-11-26 18:02:07,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:02:07,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:02:07,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 18:02:07,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 18: [2022-11-26 18:02:07,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 18:02:07,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 30: [2022-11-26 18:02:07,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:02:07,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 18:02:07,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 29: [2022-11-26 18:02:07,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:02:07,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 18:02:07,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 14: [2022-11-26 18:02:07,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:02:07,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 18:02:07,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 27: [2022-11-26 18:02:07,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:02:07,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:02:07,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 5: [2022-11-26 18:02:07,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 27: [2022-11-26 18:02:07,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 5: [2022-11-26 18:02:07,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 28: [2022-11-26 18:02:07,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:02:07,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 4: [2022-11-26 18:02:07,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:02:07,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 4: [2022-11-26 18:02:07,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 18:02:07,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 8: [2022-11-26 18:02:07,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:02:07,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 18:02:07,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 16: [2022-11-26 18:02:07,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:02:07,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 18:02:07,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 31: [2022-11-26 18:02:07,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:02:07,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 18:02:07,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 19: [2022-11-26 18:02:07,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:02:07,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 18:02:07,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 7: [2022-11-26 18:02:07,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:02:07,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 18:02:07,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 12: [2022-11-26 18:02:07,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:02:07,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 18:02:07,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 6: [2022-11-26 18:02:07,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:02:07,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 18:02:07,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 22: [2022-11-26 18:02:07,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:02:07,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 18:02:07,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 10: [2022-11-26 18:02:07,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:02:07,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 18:02:07,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 15: [2022-11-26 18:02:07,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:02:07,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 18:02:07,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 20: [2022-11-26 18:02:07,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:02:07,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 18:02:07,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 0: [2022-11-26 18:02:07,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:02:07,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 18:02:07,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 9: [2022-11-26 18:02:07,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:02:07,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 18:02:07,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 18: [2022-11-26 18:02:07,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:02:07,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 18:02:07,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 2: [2022-11-26 18:02:07,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:02:07,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 18:02:07,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 26: [2022-11-26 18:02:07,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:02:07,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:02:07,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 1: [2022-11-26 18:02:07,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 26: [2022-11-26 18:02:07,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 1: [2022-11-26 18:02:07,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 3: [2022-11-26 18:02:07,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:02:07,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 18:02:07,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 17: [2022-11-26 18:02:07,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:02:07,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 18:02:07,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 21: [2022-11-26 18:02:07,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:02:07,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 18:02:07,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 13: [2022-11-26 18:02:07,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:02:07,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 18:02:07,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 30: [2022-11-26 18:02:07,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:02:07,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 18:02:07,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 28: [2022-11-26 18:02:07,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:02:07,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 18:02:07,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 29: [2022-11-26 18:02:07,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:02:07,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 18:02:07,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 27: [2022-11-26 18:02:07,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:02:07,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 18:02:07,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 24: [2022-11-26 18:02:07,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:02:07,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 10: [2022-11-26 18:02:07,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:02:07,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 10: [2022-11-26 18:02:07,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 18:02:07,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 8: [2022-11-26 18:02:07,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:02:07,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:02:07,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 18:02:07,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 11: [2022-11-26 18:02:07,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 18:02:07,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 25: [2022-11-26 18:02:07,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:02:07,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 18:02:07,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 16: [2022-11-26 18:02:07,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:02:07,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 18:02:07,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 31: [2022-11-26 18:02:07,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:02:07,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 18:02:07,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 23: [2022-11-26 18:02:07,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:02:07,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 18:02:07,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 4: [2022-11-26 18:02:07,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:02:07,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 18:02:07,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 14: [2022-11-26 18:02:07,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:02:07,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 18:02:07,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 5: [2022-11-26 18:02:07,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:02:07,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 18:02:07,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 6: [2022-11-26 18:02:07,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:02:07,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 18:02:07,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 19: [2022-11-26 18:02:07,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:02:07,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 18:02:07,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 7: [2022-11-26 18:02:07,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:02:07,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 18:02:07,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 15: [2022-11-26 18:02:07,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:02:07,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 18:02:07,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 0: [2022-11-26 18:02:07,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:02:07,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 18: [2022-11-26 18:02:07,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:02:07,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 18: [2022-11-26 18:02:07,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 18:02:07,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 12: [2022-11-26 18:02:07,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:02:07,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 18:02:07,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 22: [2022-11-26 18:02:07,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:02:07,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 18:02:07,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 1: [2022-11-26 18:02:07,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:02:07,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 18:02:07,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 9: [2022-11-26 18:02:07,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:02:07,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 18:02:07,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 20: [2022-11-26 18:02:07,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:02:07,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 18:02:07,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 17: [2022-11-26 18:02:07,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:02:07,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 18:02:07,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 3: [2022-11-26 18:02:07,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:02:07,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 18:02:07,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 2: [2022-11-26 18:02:07,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:02:07,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 18:02:07,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 26: [2022-11-26 18:02:07,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:02:07,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 18:02:07,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 30: [2022-11-26 18:02:07,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:02:07,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 13: [2022-11-26 18:02:07,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:02:07,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 13: [2022-11-26 18:02:07,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 18:02:07,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 11: [2022-11-26 18:02:07,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:02:07,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 18:02:07,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 21: [2022-11-26 18:02:07,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:02:07,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 18:02:07,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 24: [2022-11-26 18:02:07,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:02:07,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 18:02:07,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 27: [2022-11-26 18:02:07,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:02:07,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 18:02:07,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 29: [2022-11-26 18:02:07,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:02:07,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:02:07,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 23: [2022-11-26 18:02:07,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 29: [2022-11-26 18:02:07,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 23: [2022-11-26 18:02:07,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 28: [2022-11-26 18:02:07,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:02:07,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 25: [2022-11-26 18:02:07,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:02:07,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 25: [2022-11-26 18:02:07,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 18:02:07,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 8: [2022-11-26 18:02:07,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:02:07,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 18:02:07,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 4: [2022-11-26 18:02:07,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:02:07,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 18:02:07,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 19: [2022-11-26 18:02:07,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:02:07,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 18:02:07,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 5: [2022-11-26 18:02:07,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:02:07,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 18:02:07,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 16: [2022-11-26 18:02:07,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:02:07,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 18:02:07,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 14: [2022-11-26 18:02:07,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:02:07,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 31: [2022-11-26 18:02:07,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:02:07,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 31: [2022-11-26 18:02:07,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 18:02:07,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 7: [2022-11-26 18:02:07,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:02:07,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 18:02:07,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 12: [2022-11-26 18:02:07,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:02:07,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 18:02:07,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 6: [2022-11-26 18:02:07,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:02:07,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 18:02:07,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 10: [2022-11-26 18:02:07,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:02:07,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 18:02:07,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 22: [2022-11-26 18:02:07,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:02:07,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 18:02:07,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 0: [2022-11-26 18:02:07,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:02:07,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 18:02:07,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 18: [2022-11-26 18:02:07,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:02:07,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 18:02:07,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 1: [2022-11-26 18:02:07,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:02:07,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 3: [2022-11-26 18:02:07,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:02:07,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 3: [2022-11-26 18:02:07,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 18:02:07,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 15: [2022-11-26 18:02:07,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:02:07,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 18:02:07,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 20: [2022-11-26 18:02:07,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:02:07,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:02:07,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 18:02:07,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 9: [2022-11-26 18:02:07,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 18:02:07,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 13: [2022-11-26 18:02:07,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:02:07,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 18:02:07,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 21: [2022-11-26 18:02:07,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:02:07,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 18:02:07,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 26: [2022-11-26 18:02:07,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:02:07,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 18:02:07,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 2: [2022-11-26 18:02:07,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:02:07,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 18:02:07,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 25: [2022-11-26 18:02:07,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:02:07,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 18:02:07,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 30: [2022-11-26 18:02:07,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:02:07,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 18:02:07,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 29: [2022-11-26 18:02:07,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:02:07,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:02:07,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 18:02:07,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 8: [2022-11-26 18:02:07,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 18:02:07,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 27: [2022-11-26 18:02:07,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:02:07,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 18:02:07,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 5: [2022-11-26 18:02:07,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:02:07,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:02:07,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 18:02:07,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 28: [2022-11-26 18:02:07,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 18:02:07,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 11: [2022-11-26 18:02:07,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:02:07,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 18:02:07,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 23: [2022-11-26 18:02:07,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:02:07,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 18:02:07,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 24: [2022-11-26 18:02:07,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:02:07,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 18:02:07,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 14: [2022-11-26 18:02:07,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:02:07,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 18:02:07,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 31: [2022-11-26 18:02:07,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:02:07,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 18:02:07,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 16: [2022-11-26 18:02:07,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:02:07,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 18:02:07,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 7: [2022-11-26 18:02:07,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:02:07,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 18:02:07,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 12: [2022-11-26 18:02:07,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:02:07,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 18:02:07,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 6: [2022-11-26 18:02:07,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:02:07,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 18:02:07,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 19: [2022-11-26 18:02:07,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:02:07,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 18:02:07,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 4: [2022-11-26 18:02:07,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:02:07,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 22: [2022-11-26 18:02:07,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:02:07,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 22: [2022-11-26 18:02:07,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 18:02:07,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 15: [2022-11-26 18:02:07,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:02:07,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 18:02:07,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 0: [2022-11-26 18:02:07,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:02:07,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:02:07,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 18:02:07,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 10: [2022-11-26 18:02:07,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:02:07,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 18:02:07,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 17: [2022-11-26 18:02:07,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:02:07,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:02:07,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 8: [2022-11-26 18:02:07,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 18:02:07,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 17: [2022-11-26 18:02:07,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 9: [2022-11-26 18:02:07,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:02:07,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 18:02:07,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 27: [2022-11-26 18:02:07,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:02:07,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 18:02:07,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 20: [2022-11-26 18:02:07,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:02:07,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 18:02:07,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 24: [2022-11-26 18:02:07,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:02:07,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:02:07,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 18:02:07,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 16: [2022-11-26 18:02:07,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 18:02:07,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 3: [2022-11-26 18:02:07,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:02:07,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:02:07,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 18:02:07,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 31: [2022-11-26 18:02:07,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 29: [2022-11-26 18:02:07,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:02:07,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 29: [2022-11-26 18:02:07,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 18:02:07,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 0: [2022-11-26 18:02:07,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 18:02:07,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 17: [2022-11-26 18:02:07,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:02:07,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 18:02:07,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 30: [2022-11-26 18:02:07,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:02:07,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 18:02:07,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 11: [2022-11-26 18:02:07,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:02:07,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 18:02:07,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 5: [2022-11-26 18:02:07,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:02:07,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 18:02:07,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 21: [2022-11-26 18:02:07,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:02:07,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 18:02:07,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 26: [2022-11-26 18:02:07,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:02:07,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:02:07,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:02:07,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 28: [2022-11-26 18:02:07,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 7: [2022-11-26 18:02:07,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 28: [2022-11-26 18:02:07,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 26: [2022-11-26 18:02:07,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 7: [2022-11-26 18:02:07,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 2: [2022-11-26 18:02:07,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:02:07,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 18:02:07,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 4: [2022-11-26 18:02:07,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:02:07,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 18:02:07,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 23: [2022-11-26 18:02:07,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:02:07,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 13: [2022-11-26 18:02:07,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:02:07,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 23: [2022-11-26 18:02:07,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 13: [2022-11-26 18:02:07,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 15: [2022-11-26 18:02:07,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:02:07,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 18:02:07,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 14: [2022-11-26 18:02:07,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:02:07,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 18:02:07,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 22: [2022-11-26 18:02:07,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:02:07,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 18:02:07,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 25: [2022-11-26 18:02:07,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:02:07,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 18:02:07,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 16: [2022-11-26 18:02:07,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:02:07,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 18:02:07,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 12: [2022-11-26 18:02:07,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:02:07,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 18:02:07,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 9: [2022-11-26 18:02:07,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:02:07,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 18:02:07,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 18: [2022-11-26 18:02:07,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:02:07,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 18:02:07,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 10: [2022-11-26 18:02:07,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:02:07,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:02:07,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 18:02:07,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 19: [2022-11-26 18:02:07,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 18:02:07,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 10: [2022-11-26 18:02:07,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:02:07,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 18:02:07,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 18: [2022-11-26 18:02:07,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:02:07,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step106000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 18:02:07,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 0: successfully saved checkpoint at iteration 106000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2641.88 31: iteration 106010/ 173500 | consumed samples: 27138560 | consumed tokens: 55579770880 | elapsed time per iteration (s): 1.05 | learning rate: 8.029E-05 | global batch size: 256 | lm loss: 1.956525E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.882 | TFLOPs: 14.75 | 31: iteration 106020/ 173500 | consumed samples: 27141120 | consumed tokens: 55585013760 | elapsed time per iteration (s): 0.82 | learning rate: 8.028E-05 | global batch size: 256 | lm loss: 1.944998E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.597 | TFLOPs: 18.79 | 31: iteration 106030/ 173500 | consumed samples: 27143680 | consumed tokens: 55590256640 | elapsed time per iteration (s): 0.77 | learning rate: 8.026E-05 | global batch size: 256 | lm loss: 1.952910E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.097 | TFLOPs: 20.15 | 31: iteration 106040/ 173500 | consumed samples: 27146240 | consumed tokens: 55595499520 | elapsed time per iteration (s): 0.79 | learning rate: 8.025E-05 | global batch size: 256 | lm loss: 1.964590E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.359 | TFLOPs: 19.56 | 31: iteration 106050/ 173500 | consumed samples: 27148800 | consumed tokens: 55600742400 | elapsed time per iteration (s): 0.78 | learning rate: 8.023E-05 | global batch size: 256 | lm loss: 1.970065E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.197 | TFLOPs: 19.92 | 31: iteration 106060/ 173500 | consumed samples: 27151360 | consumed tokens: 55605985280 | elapsed time per iteration (s): 0.75 | learning rate: 8.021E-05 | global batch size: 256 | lm loss: 1.962258E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.495 | TFLOPs: 20.54 | 31: iteration 106070/ 173500 | consumed samples: 27153920 | consumed tokens: 55611228160 | elapsed time per iteration (s): 0.80 | learning rate: 8.020E-05 | global batch size: 256 | lm loss: 1.967200E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.934 | TFLOPs: 19.48 | 31: iteration 106080/ 173500 | consumed samples: 27156480 | consumed tokens: 55616471040 | elapsed time per iteration (s): 0.75 | learning rate: 8.018E-05 | global batch size: 256 | lm loss: 1.958101E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.647 | TFLOPs: 20.55 | 31: iteration 106090/ 173500 | consumed samples: 27159040 | consumed tokens: 55621713920 | elapsed time per iteration (s): 0.75 | learning rate: 8.017E-05 | global batch size: 256 | lm loss: 1.961772E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.029 | TFLOPs: 20.69 | 31: iteration 106100/ 173500 | consumed samples: 27161600 | consumed tokens: 55626956800 | elapsed time per iteration (s): 0.76 | learning rate: 8.015E-05 | global batch size: 256 | lm loss: 1.962570E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.119 | TFLOPs: 20.46 | 31: iteration 106110/ 173500 | consumed samples: 27164160 | consumed tokens: 55632199680 | elapsed time per iteration (s): 0.76 | learning rate: 8.014E-05 | global batch size: 256 | lm loss: 1.955496E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.809 | TFLOPs: 20.44 | 31: iteration 106120/ 173500 | consumed samples: 27166720 | consumed tokens: 55637442560 | elapsed time per iteration (s): 0.75 | learning rate: 8.012E-05 | global batch size: 256 | lm loss: 1.972214E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.154 | TFLOPs: 20.58 | 31: iteration 106130/ 173500 | consumed samples: 27169280 | consumed tokens: 55642685440 | elapsed time per iteration (s): 0.78 | learning rate: 8.011E-05 | global batch size: 256 | lm loss: 1.969436E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.453 | TFLOPs: 19.81 | 31: iteration 106140/ 173500 | consumed samples: 27171840 | consumed tokens: 55647928320 | elapsed time per iteration (s): 0.78 | learning rate: 8.009E-05 | global batch size: 256 | lm loss: 1.958977E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.890 | TFLOPs: 19.78 | 31: iteration 106150/ 173500 | consumed samples: 27174400 | consumed tokens: 55653171200 | elapsed time per iteration (s): 0.73 | learning rate: 8.007E-05 | global batch size: 256 | lm loss: 1.953725E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.955 | TFLOPs: 21.23 | 31: iteration 106160/ 173500 | consumed samples: 27176960 | consumed tokens: 55658414080 | elapsed time per iteration (s): 0.76 | learning rate: 8.006E-05 | global batch size: 256 | lm loss: 1.990471E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.561 | TFLOPs: 20.42 | 31: iteration 106170/ 173500 | consumed samples: 27179520 | consumed tokens: 55663656960 | elapsed time per iteration (s): 0.78 | learning rate: 8.004E-05 | global batch size: 256 | lm loss: 1.974692E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.707 | TFLOPs: 19.89 | 31: iteration 106180/ 173500 | consumed samples: 27182080 | consumed tokens: 55668899840 | elapsed time per iteration (s): 0.76 | learning rate: 8.003E-05 | global batch size: 256 | lm loss: 1.967330E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.442 | TFLOPs: 20.41 | 31: iteration 106190/ 173500 | consumed samples: 27184640 | consumed tokens: 55674142720 | elapsed time per iteration (s): 0.76 | learning rate: 8.001E-05 | global batch size: 256 | lm loss: 1.977913E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.924 | TFLOPs: 20.38 | 31: iteration 106200/ 173500 | consumed samples: 27187200 | consumed tokens: 55679385600 | elapsed time per iteration (s): 0.85 | learning rate: 8.000E-05 | global batch size: 256 | lm loss: 1.964950E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.427 | TFLOPs: 18.24 | 31: iteration 106210/ 173500 | consumed samples: 27189760 | consumed tokens: 55684628480 | elapsed time per iteration (s): 0.74 | learning rate: 7.998E-05 | global batch size: 256 | lm loss: 1.972913E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.693 | TFLOPs: 20.79 | 31: iteration 106220/ 173500 | consumed samples: 27192320 | consumed tokens: 55689871360 | elapsed time per iteration (s): 0.78 | learning rate: 7.997E-05 | global batch size: 256 | lm loss: 1.951992E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.593 | TFLOPs: 19.76 | 31: iteration 106230/ 173500 | consumed samples: 27194880 | consumed tokens: 55695114240 | elapsed time per iteration (s): 0.75 | learning rate: 7.995E-05 | global batch size: 256 | lm loss: 1.975602E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.679 | TFLOPs: 20.55 | 31: iteration 106240/ 173500 | consumed samples: 27197440 | consumed tokens: 55700357120 | elapsed time per iteration (s): 0.79 | learning rate: 7.994E-05 | global batch size: 256 | lm loss: 1.960657E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.974 | TFLOPs: 19.54 | 31: iteration 106250/ 173500 | consumed samples: 27200000 | consumed tokens: 55705600000 | elapsed time per iteration (s): 0.78 | learning rate: 7.992E-05 | global batch size: 256 | lm loss: 1.950994E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.941 | TFLOPs: 19.78 | 31: iteration 106260/ 173500 | consumed samples: 27202560 | consumed tokens: 55710842880 | elapsed time per iteration (s): 0.79 | learning rate: 7.990E-05 | global batch size: 256 | lm loss: 1.976794E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.322 | TFLOPs: 19.62 | 31: iteration 106270/ 173500 | consumed samples: 27205120 | consumed tokens: 55716085760 | elapsed time per iteration (s): 0.81 | learning rate: 7.989E-05 | global batch size: 256 | lm loss: 1.976730E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.040 | TFLOPs: 19.18 | 31: iteration 106280/ 173500 | consumed samples: 27207680 | consumed tokens: 55721328640 | elapsed time per iteration (s): 0.79 | learning rate: 7.987E-05 | global batch size: 256 | lm loss: 1.971059E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.015 | TFLOPs: 19.60 | 31: iteration 106290/ 173500 | consumed samples: 27210240 | consumed tokens: 55726571520 | elapsed time per iteration (s): 0.73 | learning rate: 7.986E-05 | global batch size: 256 | lm loss: 1.964142E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.864 | TFLOPs: 21.35 | 31: iteration 106300/ 173500 | consumed samples: 27212800 | consumed tokens: 55731814400 | elapsed time per iteration (s): 0.76 | learning rate: 7.984E-05 | global batch size: 256 | lm loss: 1.986421E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.578 | TFLOPs: 20.42 | 31: iteration 106310/ 173500 | consumed samples: 27215360 | consumed tokens: 55737057280 | elapsed time per iteration (s): 0.73 | learning rate: 7.983E-05 | global batch size: 256 | lm loss: 1.973246E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.644 | TFLOPs: 21.21 | 31: iteration 106320/ 173500 | consumed samples: 27217920 | consumed tokens: 55742300160 | elapsed time per iteration (s): 0.81 | learning rate: 7.981E-05 | global batch size: 256 | lm loss: 1.988333E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.320 | TFLOPs: 19.02 | 31: iteration 106330/ 173500 | consumed samples: 27220480 | consumed tokens: 55747543040 | elapsed time per iteration (s): 0.77 | learning rate: 7.980E-05 | global batch size: 256 | lm loss: 1.954548E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.872 | TFLOPs: 20.14 | 31: iteration 106340/ 173500 | consumed samples: 27223040 | consumed tokens: 55752785920 | elapsed time per iteration (s): 0.82 | learning rate: 7.978E-05 | global batch size: 256 | lm loss: 1.948829E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.497 | TFLOPs: 18.84 | 31: iteration 106350/ 173500 | consumed samples: 27225600 | consumed tokens: 55758028800 | elapsed time per iteration (s): 0.89 | learning rate: 7.976E-05 | global batch size: 256 | lm loss: 1.986192E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.587 | TFLOPs: 17.40 | 31: iteration 106360/ 173500 | consumed samples: 27228160 | consumed tokens: 55763271680 | elapsed time per iteration (s): 0.83 | learning rate: 7.975E-05 | global batch size: 256 | lm loss: 1.956211E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.827 | TFLOPs: 18.56 | 31: iteration 106370/ 173500 | consumed samples: 27230720 | consumed tokens: 55768514560 | elapsed time per iteration (s): 0.76 | learning rate: 7.973E-05 | global batch size: 256 | lm loss: 1.946115E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.086 | TFLOPs: 20.33 | 31: iteration 106380/ 173500 | consumed samples: 27233280 | consumed tokens: 55773757440 | elapsed time per iteration (s): 0.82 | learning rate: 7.972E-05 | global batch size: 256 | lm loss: 1.967091E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.105 | TFLOPs: 18.94 | 31: iteration 106390/ 173500 | consumed samples: 27235840 | consumed tokens: 55779000320 | elapsed time per iteration (s): 0.79 | learning rate: 7.970E-05 | global batch size: 256 | lm loss: 1.967171E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.729 | TFLOPs: 19.58 | 31: iteration 106400/ 173500 | consumed samples: 27238400 | consumed tokens: 55784243200 | elapsed time per iteration (s): 0.88 | learning rate: 7.969E-05 | global batch size: 256 | lm loss: 1.992693E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.452 | TFLOPs: 17.63 | 31: iteration 106410/ 173500 | consumed samples: 27240960 | consumed tokens: 55789486080 | elapsed time per iteration (s): 0.76 | learning rate: 7.967E-05 | global batch size: 256 | lm loss: 1.955444E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.631 | TFLOPs: 20.49 | 31: iteration 106420/ 173500 | consumed samples: 27243520 | consumed tokens: 55794728960 | elapsed time per iteration (s): 0.77 | learning rate: 7.966E-05 | global batch size: 256 | lm loss: 1.967506E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.440 | TFLOPs: 20.17 | 31: iteration 106430/ 173500 | consumed samples: 27246080 | consumed tokens: 55799971840 | elapsed time per iteration (s): 0.77 | learning rate: 7.964E-05 | global batch size: 256 | lm loss: 1.945140E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.740 | TFLOPs: 20.07 | 31: iteration 106440/ 173500 | consumed samples: 27248640 | consumed tokens: 55805214720 | elapsed time per iteration (s): 0.76 | learning rate: 7.963E-05 | global batch size: 256 | lm loss: 1.924023E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.988 | TFLOPs: 20.51 | 31: iteration 106450/ 173500 | consumed samples: 27251200 | consumed tokens: 55810457600 | elapsed time per iteration (s): 0.82 | learning rate: 7.961E-05 | global batch size: 256 | lm loss: 1.967577E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.712 | TFLOPs: 18.92 | 31: iteration 106460/ 173500 | consumed samples: 27253760 | consumed tokens: 55815700480 | elapsed time per iteration (s): 0.83 | learning rate: 7.959E-05 | global batch size: 256 | lm loss: 1.944707E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.822 | TFLOPs: 18.68 | 31: iteration 106470/ 173500 | consumed samples: 27256320 | consumed tokens: 55820943360 | elapsed time per iteration (s): 0.77 | learning rate: 7.958E-05 | global batch size: 256 | lm loss: 1.956184E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.861 | TFLOPs: 20.02 | 31: iteration 106480/ 173500 | consumed samples: 27258880 | consumed tokens: 55826186240 | elapsed time per iteration (s): 0.80 | learning rate: 7.956E-05 | global batch size: 256 | lm loss: 1.965821E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.595 | TFLOPs: 19.33 | 31: iteration 106490/ 173500 | consumed samples: 27261440 | consumed tokens: 55831429120 | elapsed time per iteration (s): 0.78 | learning rate: 7.955E-05 | global batch size: 256 | lm loss: 1.976506E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.019 | TFLOPs: 19.78 | 31: iteration 106500/ 173500 | consumed samples: 27264000 | consumed tokens: 55836672000 | elapsed time per iteration (s): 0.81 | learning rate: 7.953E-05 | global batch size: 256 | lm loss: 1.978174E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.606 | TFLOPs: 19.21 | 31: iteration 106510/ 173500 | consumed samples: 27266560 | consumed tokens: 55841914880 | elapsed time per iteration (s): 0.74 | learning rate: 7.952E-05 | global batch size: 256 | lm loss: 1.976845E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.039 | TFLOPs: 20.99 | 31: iteration 106520/ 173500 | consumed samples: 27269120 | consumed tokens: 55847157760 | elapsed time per iteration (s): 0.77 | learning rate: 7.950E-05 | global batch size: 256 | lm loss: 1.980504E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.366 | TFLOPs: 19.99 | 31: iteration 106530/ 173500 | consumed samples: 27271680 | consumed tokens: 55852400640 | elapsed time per iteration (s): 0.76 | learning rate: 7.949E-05 | global batch size: 256 | lm loss: 1.944711E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.129 | TFLOPs: 20.27 | 31: iteration 106540/ 173500 | consumed samples: 27274240 | consumed tokens: 55857643520 | elapsed time per iteration (s): 0.73 | learning rate: 7.947E-05 | global batch size: 256 | lm loss: 1.948378E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.129 | TFLOPs: 21.24 | 31: iteration 106550/ 173500 | consumed samples: 27276800 | consumed tokens: 55862886400 | elapsed time per iteration (s): 0.78 | learning rate: 7.945E-05 | global batch size: 256 | lm loss: 1.976790E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.695 | TFLOPs: 19.76 | 31: iteration 106560/ 173500 | consumed samples: 27279360 | consumed tokens: 55868129280 | elapsed time per iteration (s): 0.75 | learning rate: 7.944E-05 | global batch size: 256 | lm loss: 1.965282E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.255 | TFLOPs: 20.65 | 31: iteration 106570/ 173500 | consumed samples: 27281920 | consumed tokens: 55873372160 | elapsed time per iteration (s): 0.75 | learning rate: 7.942E-05 | global batch size: 256 | lm loss: 1.964429E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.092 | TFLOPs: 20.64 | 31: iteration 106580/ 173500 | consumed samples: 27284480 | consumed tokens: 55878615040 | elapsed time per iteration (s): 0.97 | learning rate: 7.941E-05 | global batch size: 256 | lm loss: 1.994848E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 263.271 | TFLOPs: 15.93 | 31: iteration 106590/ 173500 | consumed samples: 27287040 | consumed tokens: 55883857920 | elapsed time per iteration (s): 0.83 | learning rate: 7.939E-05 | global batch size: 256 | lm loss: 1.958239E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.767 | TFLOPs: 18.62 | 31: iteration 106600/ 173500 | consumed samples: 27289600 | consumed tokens: 55889100800 | elapsed time per iteration (s): 0.82 | learning rate: 7.938E-05 | global batch size: 256 | lm loss: 1.945247E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.175 | TFLOPs: 18.83 | 31: iteration 106610/ 173500 | consumed samples: 27292160 | consumed tokens: 55894343680 | elapsed time per iteration (s): 0.73 | learning rate: 7.936E-05 | global batch size: 256 | lm loss: 1.981735E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.699 | TFLOPs: 21.34 | 31: iteration 106620/ 173500 | consumed samples: 27294720 | consumed tokens: 55899586560 | elapsed time per iteration (s): 0.88 | learning rate: 7.935E-05 | global batch size: 256 | lm loss: 1.955179E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.642 | TFLOPs: 17.52 | 31: iteration 106630/ 173500 | consumed samples: 27297280 | consumed tokens: 55904829440 | elapsed time per iteration (s): 0.73 | learning rate: 7.933E-05 | global batch size: 256 | lm loss: 1.958241E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.745 | TFLOPs: 21.10 | 31: iteration 106640/ 173500 | consumed samples: 27299840 | consumed tokens: 55910072320 | elapsed time per iteration (s): 0.80 | learning rate: 7.932E-05 | global batch size: 256 | lm loss: 1.963619E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.340 | TFLOPs: 19.38 | 31: iteration 106650/ 173500 | consumed samples: 27302400 | consumed tokens: 55915315200 | elapsed time per iteration (s): 0.75 | learning rate: 7.930E-05 | global batch size: 256 | lm loss: 1.950522E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.952 | TFLOPs: 20.57 | 31: iteration 106660/ 173500 | consumed samples: 27304960 | consumed tokens: 55920558080 | elapsed time per iteration (s): 0.78 | learning rate: 7.928E-05 | global batch size: 256 | lm loss: 1.985195E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.393 | TFLOPs: 19.87 | 31: iteration 106670/ 173500 | consumed samples: 27307520 | consumed tokens: 55925800960 | elapsed time per iteration (s): 0.77 | learning rate: 7.927E-05 | global batch size: 256 | lm loss: 1.930945E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.912 | TFLOPs: 20.02 | 31: iteration 106680/ 173500 | consumed samples: 27310080 | consumed tokens: 55931043840 | elapsed time per iteration (s): 0.76 | learning rate: 7.925E-05 | global batch size: 256 | lm loss: 1.951415E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.518 | TFLOPs: 20.42 | 31: iteration 106690/ 173500 | consumed samples: 27312640 | consumed tokens: 55936286720 | elapsed time per iteration (s): 0.75 | learning rate: 7.924E-05 | global batch size: 256 | lm loss: 1.973362E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.241 | TFLOPs: 20.52 | 31: iteration 106700/ 173500 | consumed samples: 27315200 | consumed tokens: 55941529600 | elapsed time per iteration (s): 0.75 | learning rate: 7.922E-05 | global batch size: 256 | lm loss: 1.965683E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.227 | TFLOPs: 20.76 | 31: iteration 106710/ 173500 | consumed samples: 27317760 | consumed tokens: 55946772480 | elapsed time per iteration (s): 0.76 | learning rate: 7.921E-05 | global batch size: 256 | lm loss: 1.943177E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.016 | TFLOPs: 20.27 | 31: iteration 106720/ 173500 | consumed samples: 27320320 | consumed tokens: 55952015360 | elapsed time per iteration (s): 0.78 | learning rate: 7.919E-05 | global batch size: 256 | lm loss: 1.993666E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.182 | TFLOPs: 19.98 | 31: iteration 106730/ 173500 | consumed samples: 27322880 | consumed tokens: 55957258240 | elapsed time per iteration (s): 0.78 | learning rate: 7.918E-05 | global batch size: 256 | lm loss: 1.957124E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.045 | TFLOPs: 19.91 | 31: iteration 106740/ 173500 | consumed samples: 27325440 | consumed tokens: 55962501120 | elapsed time per iteration (s): 0.91 | learning rate: 7.916E-05 | global batch size: 256 | lm loss: 1.967603E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 281.975 | TFLOPs: 17.06 | 31: iteration 106750/ 173500 | consumed samples: 27328000 | consumed tokens: 55967744000 | elapsed time per iteration (s): 0.78 | learning rate: 7.915E-05 | global batch size: 256 | lm loss: 1.973567E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.122 | TFLOPs: 19.73 | 31: iteration 106760/ 173500 | consumed samples: 27330560 | consumed tokens: 55972986880 | elapsed time per iteration (s): 0.79 | learning rate: 7.913E-05 | global batch size: 256 | lm loss: 1.967743E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.984 | TFLOPs: 19.60 | 31: iteration 106770/ 173500 | consumed samples: 27333120 | consumed tokens: 55978229760 | elapsed time per iteration (s): 0.92 | learning rate: 7.911E-05 | global batch size: 256 | lm loss: 1.956532E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 279.005 | TFLOPs: 16.88 | 31: iteration 106780/ 173500 | consumed samples: 27335680 | consumed tokens: 55983472640 | elapsed time per iteration (s): 0.78 | learning rate: 7.910E-05 | global batch size: 256 | lm loss: 1.982288E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.637 | TFLOPs: 19.76 | 31: iteration 106790/ 173500 | consumed samples: 27338240 | consumed tokens: 55988715520 | elapsed time per iteration (s): 0.96 | learning rate: 7.908E-05 | global batch size: 256 | lm loss: 1.945442E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 265.525 | TFLOPs: 16.06 | 31: iteration 106800/ 173500 | consumed samples: 27340800 | consumed tokens: 55993958400 | elapsed time per iteration (s): 0.81 | learning rate: 7.907E-05 | global batch size: 256 | lm loss: 1.952732E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.569 | TFLOPs: 19.03 | 31: iteration 106810/ 173500 | consumed samples: 27343360 | consumed tokens: 55999201280 | elapsed time per iteration (s): 0.83 | learning rate: 7.905E-05 | global batch size: 256 | lm loss: 1.968946E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.505 | TFLOPs: 18.72 | 31: iteration 106820/ 173500 | consumed samples: 27345920 | consumed tokens: 56004444160 | elapsed time per iteration (s): 0.83 | learning rate: 7.904E-05 | global batch size: 256 | lm loss: 1.998436E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.209 | TFLOPs: 18.77 | 31: iteration 106830/ 173500 | consumed samples: 27348480 | consumed tokens: 56009687040 | elapsed time per iteration (s): 0.86 | learning rate: 7.902E-05 | global batch size: 256 | lm loss: 1.985462E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.918 | TFLOPs: 18.08 | 31: iteration 106840/ 173500 | consumed samples: 27351040 | consumed tokens: 56014929920 | elapsed time per iteration (s): 0.91 | learning rate: 7.901E-05 | global batch size: 256 | lm loss: 1.957637E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.852 | TFLOPs: 17.11 | 31: iteration 106850/ 173500 | consumed samples: 27353600 | consumed tokens: 56020172800 | elapsed time per iteration (s): 0.89 | learning rate: 7.899E-05 | global batch size: 256 | lm loss: 2.001195E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.338 | TFLOPs: 17.38 | 31: iteration 106860/ 173500 | consumed samples: 27356160 | consumed tokens: 56025415680 | elapsed time per iteration (s): 0.93 | learning rate: 7.898E-05 | global batch size: 256 | lm loss: 1.963974E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 276.059 | TFLOPs: 16.70 | 31: iteration 106870/ 173500 | consumed samples: 27358720 | consumed tokens: 56030658560 | elapsed time per iteration (s): 0.85 | learning rate: 7.896E-05 | global batch size: 256 | lm loss: 1.977019E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.568 | TFLOPs: 18.24 | 31: iteration 106880/ 173500 | consumed samples: 27361280 | consumed tokens: 56035901440 | elapsed time per iteration (s): 0.97 | learning rate: 7.894E-05 | global batch size: 256 | lm loss: 1.946864E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 263.889 | TFLOPs: 15.96 | 31: iteration 106890/ 173500 | consumed samples: 27363840 | consumed tokens: 56041144320 | elapsed time per iteration (s): 0.82 | learning rate: 7.893E-05 | global batch size: 256 | lm loss: 1.939944E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.579 | TFLOPs: 18.97 | 31: iteration 106900/ 173500 | consumed samples: 27366400 | consumed tokens: 56046387200 | elapsed time per iteration (s): 0.86 | learning rate: 7.891E-05 | global batch size: 256 | lm loss: 1.956030E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.758 | TFLOPs: 18.07 | 31: iteration 106910/ 173500 | consumed samples: 27368960 | consumed tokens: 56051630080 | elapsed time per iteration (s): 0.87 | learning rate: 7.890E-05 | global batch size: 256 | lm loss: 1.943027E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.657 | TFLOPs: 17.89 | 31: iteration 106920/ 173500 | consumed samples: 27371520 | consumed tokens: 56056872960 | elapsed time per iteration (s): 0.86 | learning rate: 7.888E-05 | global batch size: 256 | lm loss: 1.944796E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.249 | TFLOPs: 17.92 | 31: iteration 106930/ 173500 | consumed samples: 27374080 | consumed tokens: 56062115840 | elapsed time per iteration (s): 0.89 | learning rate: 7.887E-05 | global batch size: 256 | lm loss: 1.965708E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.241 | TFLOPs: 17.38 | 31: iteration 106940/ 173500 | consumed samples: 27376640 | consumed tokens: 56067358720 | elapsed time per iteration (s): 0.86 | learning rate: 7.885E-05 | global batch size: 256 | lm loss: 2.003131E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.885 | TFLOPs: 17.96 | 31: iteration 106950/ 173500 | consumed samples: 27379200 | consumed tokens: 56072601600 | elapsed time per iteration (s): 0.95 | learning rate: 7.884E-05 | global batch size: 256 | lm loss: 1.953711E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 269.572 | TFLOPs: 16.31 | 31: iteration 106960/ 173500 | consumed samples: 27381760 | consumed tokens: 56077844480 | elapsed time per iteration (s): 0.83 | learning rate: 7.882E-05 | global batch size: 256 | lm loss: 1.998075E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.121 | TFLOPs: 18.70 | 31: iteration 106970/ 173500 | consumed samples: 27384320 | consumed tokens: 56083087360 | elapsed time per iteration (s): 0.83 | learning rate: 7.881E-05 | global batch size: 256 | lm loss: 1.955726E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.493 | TFLOPs: 18.72 | 31: iteration 106980/ 173500 | consumed samples: 27386880 | consumed tokens: 56088330240 | elapsed time per iteration (s): 0.94 | learning rate: 7.879E-05 | global batch size: 256 | lm loss: 1.934403E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 271.503 | TFLOPs: 16.43 | 31: iteration 106990/ 173500 | consumed samples: 27389440 | consumed tokens: 56093573120 | elapsed time per iteration (s): 0.83 | learning rate: 7.877E-05 | global batch size: 256 | lm loss: 1.982087E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.248 | TFLOPs: 18.59 | 31: iteration 107000/ 173500 | consumed samples: 27392000 | consumed tokens: 56098816000 | elapsed time per iteration (s): 0.84 | learning rate: 7.876E-05 | global batch size: 256 | lm loss: 1.975402E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.006 | TFLOPs: 18.51 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 107000 | lm loss value: 1.922081E+00 | lm loss PPL: 6.835165E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 107000 to checkpoints_1b1long 0: [2022-11-26 18:15:32,277] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step107000 is begin to save! 0: [2022-11-26 18:15:32,291] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_01-model_00-model_states.pt... 0: [2022-11-26 18:15:32,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_01-model_00-model_states.pt. 0: [2022-11-26 18:15:32,512] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_03-model_00-model_states.pt... 0: [2022-11-26 18:15:32,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_03-model_00-model_states.pt. 0: [2022-11-26 18:15:32,594] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_04-model_00-model_states.pt... 0: [2022-11-26 18:15:32,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_04-model_00-model_states.pt. 0: [2022-11-26 18:15:32,675] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_05-model_00-model_states.pt... 0: [2022-11-26 18:15:32,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_05-model_00-model_states.pt. 0: [2022-11-26 18:15:32,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_06-model_00-model_states.pt... 0: [2022-11-26 18:15:32,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_06-model_00-model_states.pt. 0: [2022-11-26 18:15:32,827] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_07-model_00-model_states.pt... 0: [2022-11-26 18:15:32,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_07-model_00-model_states.pt. 0: [2022-11-26 18:15:32,901] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_08-model_00-model_states.pt... 0: [2022-11-26 18:15:32,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_08-model_00-model_states.pt. 0: [2022-11-26 18:15:32,982] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_09-model_00-model_states.pt... 0: [2022-11-26 18:15:33,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_09-model_00-model_states.pt. 0: [2022-11-26 18:15:33,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_10-model_00-model_states.pt... 0: [2022-11-26 18:15:33,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_10-model_00-model_states.pt. 0: [2022-11-26 18:15:33,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_11-model_00-model_states.pt... 0: [2022-11-26 18:15:33,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_11-model_00-model_states.pt. 0: [2022-11-26 18:15:33,205] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_12-model_00-model_states.pt... 0: [2022-11-26 18:15:33,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_12-model_00-model_states.pt. 0: [2022-11-26 18:15:33,281] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_13-model_00-model_states.pt... 0: [2022-11-26 18:15:33,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_13-model_00-model_states.pt. 0: [2022-11-26 18:15:33,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_14-model_00-model_states.pt... 0: [2022-11-26 18:15:33,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_14-model_00-model_states.pt. 0: [2022-11-26 18:15:33,427] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_15-model_00-model_states.pt... 0: [2022-11-26 18:15:33,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_15-model_00-model_states.pt. 0: [2022-11-26 18:15:33,502] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_16-model_00-model_states.pt... 0: [2022-11-26 18:15:33,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_16-model_00-model_states.pt. 0: [2022-11-26 18:15:33,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_17-model_00-model_states.pt... 0: [2022-11-26 18:15:33,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_17-model_00-model_states.pt. 0: [2022-11-26 18:15:33,650] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_18-model_00-model_states.pt... 0: [2022-11-26 18:15:33,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_18-model_00-model_states.pt. 0: [2022-11-26 18:15:33,727] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_19-model_00-model_states.pt... 0: [2022-11-26 18:15:33,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_19-model_00-model_states.pt. 0: [2022-11-26 18:15:33,799] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_20-model_00-model_states.pt... 0: [2022-11-26 18:15:33,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_20-model_00-model_states.pt. 0: [2022-11-26 18:15:33,875] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_21-model_00-model_states.pt... 0: [2022-11-26 18:15:33,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_21-model_00-model_states.pt. 0: [2022-11-26 18:15:33,946] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_22-model_00-model_states.pt... 0: [2022-11-26 18:15:34,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_22-model_00-model_states.pt. 0: [2022-11-26 18:15:34,020] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_23-model_00-model_states.pt... 0: [2022-11-26 18:15:34,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_23-model_00-model_states.pt. 0: [2022-11-26 18:15:34,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_24-model_00-model_states.pt... 0: [2022-11-26 18:15:34,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_24-model_00-model_states.pt. 0: [2022-11-26 18:15:34,170] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_25-model_00-model_states.pt... 0: [2022-11-26 18:15:34,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_25-model_00-model_states.pt. 0: [2022-11-26 18:15:34,248] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_26-model_00-model_states.pt... 0: [2022-11-26 18:15:34,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_26-model_00-model_states.pt. 0: [2022-11-26 18:15:34,322] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_27-model_00-model_states.pt... 0: [2022-11-26 18:15:34,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_27-model_00-model_states.pt. 0: [2022-11-26 18:15:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_28-model_00-model_states.pt... 0: [2022-11-26 18:15:34,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_28-model_00-model_states.pt. 0: [2022-11-26 18:15:34,472] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/layer_30-model_00-model_states.pt... 0: [2022-11-26 18:15:34,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/layer_30-model_00-model_states.pt. 0: [2022-11-26 18:15:34,475] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step107000/mp_rank_00_model_states.pt 0: [2022-11-26 18:15:34,475] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/mp_rank_00_model_states.pt... 0: [2022-11-26 18:15:34,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/mp_rank_00_model_states.pt. 0: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:15:34,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:15:34,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:15:34,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 18:15:34,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 18: [2022-11-26 18:15:34,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:15:34,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 18:15:34,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 24: [2022-11-26 18:15:34,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:15:34,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 18:15:34,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 10: [2022-11-26 18:15:34,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:15:34,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 18:15:34,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 14: [2022-11-26 18:15:34,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:15:34,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 18:15:34,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 23: [2022-11-26 18:15:34,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:15:34,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:15:34,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 31: [2022-11-26 18:15:34,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 23: [2022-11-26 18:15:34,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 31: [2022-11-26 18:15:34,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 7: [2022-11-26 18:15:34,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:15:34,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 18:15:34,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 2: [2022-11-26 18:15:34,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:15:34,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 18:15:34,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 24: [2022-11-26 18:15:34,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:15:34,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 18:15:34,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 2: [2022-11-26 18:15:34,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:15:34,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:15:34,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 18:15:34,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 0: [2022-11-26 18:15:34,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 18:15:34,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 0: [2022-11-26 18:15:34,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:15:34,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 18:15:34,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 18: [2022-11-26 18:15:34,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:15:34,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 18:15:34,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 1: [2022-11-26 18:15:34,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:15:34,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:15:34,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:15:34,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 6: [2022-11-26 18:15:34,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 30: [2022-11-26 18:15:34,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:15:34,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 18:15:34,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 6: [2022-11-26 18:15:34,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 1: [2022-11-26 18:15:34,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 30: [2022-11-26 18:15:34,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 18:15:34,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 16: [2022-11-26 18:15:34,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:15:34,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 18:15:34,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 29: [2022-11-26 18:15:34,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:15:34,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 18:15:34,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 3: [2022-11-26 18:15:34,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:15:34,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 20: [2022-11-26 18:15:34,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:15:34,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 20: [2022-11-26 18:15:34,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 18:15:34,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 14: [2022-11-26 18:15:34,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:15:34,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 18:15:34,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 31: [2022-11-26 18:15:34,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:15:34,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:15:34,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 18:15:34,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 27: [2022-11-26 18:15:34,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 30: [2022-11-26 18:15:34,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:15:34,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 30: [2022-11-26 18:15:34,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 12: [2022-11-26 18:15:34,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:15:34,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 12: [2022-11-26 18:15:34,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 18:15:34,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 4: [2022-11-26 18:15:34,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:15:34,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 18:15:34,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 31: [2022-11-26 18:15:34,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:15:34,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 17: [2022-11-26 18:15:34,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:15:34,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 20: [2022-11-26 18:15:34,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:15:34,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 18:15:34,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 20: [2022-11-26 18:15:34,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 17: [2022-11-26 18:15:34,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:15:34,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 20: [2022-11-26 18:15:34,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 17: [2022-11-26 18:15:34,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 12: [2022-11-26 18:15:34,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:15:34,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 30: [2022-11-26 18:15:34,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:15:34,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 30: [2022-11-26 18:15:34,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 18:15:34,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 3: [2022-11-26 18:15:34,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:15:34,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 27: [2022-11-26 18:15:34,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:15:34,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 9: [2022-11-26 18:15:34,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:15:34,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 27: [2022-11-26 18:15:34,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 9: [2022-11-26 18:15:34,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 27: [2022-11-26 18:15:34,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 3: [2022-11-26 18:15:34,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:15:34,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 18:15:34,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 23: [2022-11-26 18:15:34,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:15:34,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 18:15:34,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 16: [2022-11-26 18:15:34,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:15:34,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 10: [2022-11-26 18:15:34,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:15:34,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 10: [2022-11-26 18:15:34,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 18:15:34,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 10: [2022-11-26 18:15:34,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:15:34,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:15:34,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 1: [2022-11-26 18:15:34,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:15:34,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 10: [2022-11-26 18:15:34,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 23: [2022-11-26 18:15:34,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:15:34,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 4: [2022-11-26 18:15:34,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 1: [2022-11-26 18:15:34,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:15:34,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 25: [2022-11-26 18:15:34,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:15:34,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 1: [2022-11-26 18:15:34,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 25: [2022-11-26 18:15:34,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 18:15:34,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:15:34,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:15:34,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 25: [2022-11-26 18:15:34,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 1: [2022-11-26 18:15:34,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 25: [2022-11-26 18:15:34,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 18: [2022-11-26 18:15:34,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:15:34,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 25: [2022-11-26 18:15:34,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 18: [2022-11-26 18:15:34,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 18:15:34,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 4: [2022-11-26 18:15:34,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 16: [2022-11-26 18:15:34,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:15:34,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 18:15:34,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 6: [2022-11-26 18:15:34,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:15:34,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 18:15:34,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 25: [2022-11-26 18:15:34,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:15:34,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:15:34,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 18:15:34,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 7: [2022-11-26 18:15:34,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 18:15:34,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 9: [2022-11-26 18:15:34,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:15:34,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 18:15:34,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 9: [2022-11-26 18:15:34,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:15:34,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:15:34,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 9: [2022-11-26 18:15:34,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 2: [2022-11-26 18:15:34,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 9: [2022-11-26 18:15:34,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 14: [2022-11-26 18:15:34,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:15:34,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 18:15:34,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 9: [2022-11-26 18:15:34,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:15:34,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 18:15:34,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 12: [2022-11-26 18:15:34,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:15:34,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 18:15:34,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 18: [2022-11-26 18:15:34,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:15:34,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 18:15:34,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 24: [2022-11-26 18:15:34,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:15:34,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:15:34,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 18:15:34,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 24: [2022-11-26 18:15:34,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 18:15:34,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 2: [2022-11-26 18:15:34,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:15:34,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:15:34,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 18:15:34,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 13: [2022-11-26 18:15:34,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:15:34,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 13: [2022-11-26 18:15:34,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 24: [2022-11-26 18:15:34,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 13: [2022-11-26 18:15:34,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 13: [2022-11-26 18:15:34,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:15:34,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 18:15:34,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 10: [2022-11-26 18:15:34,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:15:34,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 18:15:34,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 19: [2022-11-26 18:15:34,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:15:34,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 18:15:34,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 13: [2022-11-26 18:15:34,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:15:34,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:15:34,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 29: [2022-11-26 18:15:34,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:15:34,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 18:15:34,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 29: [2022-11-26 18:15:34,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 13: [2022-11-26 18:15:34,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 3: [2022-11-26 18:15:34,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:15:34,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 3: [2022-11-26 18:15:34,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 18:15:34,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 27: [2022-11-26 18:15:34,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:15:34,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 18:15:34,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 13: [2022-11-26 18:15:34,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:15:34,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:15:34,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 1: [2022-11-26 18:15:34,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 13: [2022-11-26 18:15:34,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 1: [2022-11-26 18:15:34,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 27: [2022-11-26 18:15:34,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:15:34,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 18:15:34,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 16: [2022-11-26 18:15:34,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:15:34,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 25: [2022-11-26 18:15:34,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:15:34,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 25: [2022-11-26 18:15:34,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 18:15:34,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 19: [2022-11-26 18:15:34,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:15:34,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 18:15:34,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 19: [2022-11-26 18:15:34,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:15:34,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 12: [2022-11-26 18:15:34,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:15:34,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 19: [2022-11-26 18:15:34,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 12: [2022-11-26 18:15:34,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 23: [2022-11-26 18:15:34,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:15:34,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 18:15:34,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 20: [2022-11-26 18:15:34,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:15:34,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:15:34,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 18:15:34,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 18:15:34,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 20: [2022-11-26 18:15:34,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 30: [2022-11-26 18:15:34,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:15:34,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 18:15:34,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 6: [2022-11-26 18:15:34,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:15:34,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 18:15:34,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 4: [2022-11-26 18:15:34,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:15:34,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 18:15:34,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 29: [2022-11-26 18:15:34,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:15:34,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 0: [2022-11-26 18:15:34,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:15:34,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 0: [2022-11-26 18:15:34,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 18:15:34,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 31: [2022-11-26 18:15:34,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:15:34,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 17: [2022-11-26 18:15:34,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:15:34,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 17: [2022-11-26 18:15:34,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 18:15:34,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 17: [2022-11-26 18:15:34,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:15:34,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 18:15:34,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 7: [2022-11-26 18:15:34,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:15:34,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 18:15:34,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 8: [2022-11-26 18:15:34,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:15:34,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:15:34,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:15:34,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 18:15:34,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:15:34,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 18:15:34,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 18:15:34,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 8: [2022-11-26 18:15:34,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 8: [2022-11-26 18:15:34,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 18:15:34,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 8: [2022-11-26 18:15:34,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 19: [2022-11-26 18:15:34,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:15:34,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 18:15:34,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 14: [2022-11-26 18:15:34,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:15:34,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 18:15:34,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 22: [2022-11-26 18:15:34,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:15:34,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:15:34,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:15:34,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:15:34,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 18:15:34,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 18:15:34,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 18:15:34,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 18:15:34,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 22: [2022-11-26 18:15:34,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 22: [2022-11-26 18:15:34,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 22: [2022-11-26 18:15:34,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 9: [2022-11-26 18:15:34,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:15:34,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 18:15:34,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 0: [2022-11-26 18:15:34,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:15:34,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 18:15:34,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 0: [2022-11-26 18:15:34,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:15:34,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 18:15:34,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 29: [2022-11-26 18:15:34,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:15:34,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 18:15:34,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 15: [2022-11-26 18:15:34,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:15:34,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:15:34,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:15:34,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:15:34,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:15:34,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 18:15:34,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 18:15:34,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 18:15:34,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 18:15:34,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 15: [2022-11-26 18:15:34,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 15: [2022-11-26 18:15:34,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 18:15:34,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 15: [2022-11-26 18:15:34,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 15: [2022-11-26 18:15:34,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 6: [2022-11-26 18:15:34,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:15:34,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 18:15:34,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 25: [2022-11-26 18:15:34,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:15:34,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 18:15:34,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 13: [2022-11-26 18:15:34,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:15:34,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 18:15:34,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 20: [2022-11-26 18:15:34,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:15:34,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 18:15:34,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 31: [2022-11-26 18:15:34,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:15:34,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 18:15:34,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 3: [2022-11-26 18:15:34,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:15:34,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 18:15:34,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 10: [2022-11-26 18:15:34,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:15:34,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 18:15:34,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 18: [2022-11-26 18:15:34,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:15:34,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 18:15:34,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 4: [2022-11-26 18:15:34,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:15:34,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 18:15:34,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 2: [2022-11-26 18:15:34,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:15:34,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 18:15:34,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 23: [2022-11-26 18:15:34,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:15:34,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 18:15:34,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 30: [2022-11-26 18:15:34,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:15:34,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 18:15:34,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 7: [2022-11-26 18:15:34,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:15:34,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 18:15:34,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 1: [2022-11-26 18:15:34,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:15:34,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 18:15:34,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 22: [2022-11-26 18:15:34,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:15:34,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:15:34,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:15:34,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 18:15:34,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 27: [2022-11-26 18:15:34,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 8: [2022-11-26 18:15:34,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 27: [2022-11-26 18:15:34,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 8: [2022-11-26 18:15:34,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 16: [2022-11-26 18:15:34,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:15:34,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 24: [2022-11-26 18:15:34,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:15:34,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 24: [2022-11-26 18:15:34,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 18:15:34,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 0: [2022-11-26 18:15:34,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:15:34,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 18:15:34,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 12: [2022-11-26 18:15:34,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:15:34,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 18:15:34,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 14: [2022-11-26 18:15:34,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:15:34,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 18:15:34,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 9: [2022-11-26 18:15:34,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:15:34,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 18:15:34,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 6: [2022-11-26 18:15:34,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:15:34,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 18:15:34,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 13: [2022-11-26 18:15:34,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:15:34,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 18:15:34,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 17: [2022-11-26 18:15:34,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:15:34,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 18:15:34,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 31: [2022-11-26 18:15:34,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:15:34,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 18:15:34,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 19: [2022-11-26 18:15:34,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:15:34,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 18:15:34,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 25: [2022-11-26 18:15:34,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:15:34,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 18:15:34,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 18: [2022-11-26 18:15:34,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:15:34,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 18:15:34,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 29: [2022-11-26 18:15:34,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:15:34,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 1: [2022-11-26 18:15:34,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:15:34,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 1: [2022-11-26 18:15:34,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 18:15:34,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 15: [2022-11-26 18:15:34,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:15:34,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 18:15:34,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 20: [2022-11-26 18:15:34,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:15:34,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 18:15:34,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 4: [2022-11-26 18:15:34,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:15:34,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 18:15:34,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 10: [2022-11-26 18:15:34,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:15:34,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 18:15:34,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 2: [2022-11-26 18:15:34,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:15:34,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 18:15:34,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 27: [2022-11-26 18:15:34,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:15:34,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 18:15:34,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 14: [2022-11-26 18:15:34,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:15:34,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 18:15:34,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 24: [2022-11-26 18:15:34,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:15:34,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:15:34,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 18:15:34,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 30: [2022-11-26 18:15:34,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 18:15:34,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 7: [2022-11-26 18:15:34,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:15:34,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 18:15:34,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 23: [2022-11-26 18:15:34,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:15:34,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 18:15:34,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 22: [2022-11-26 18:15:34,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:15:34,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 18:15:34,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 16: [2022-11-26 18:15:34,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:15:34,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 18:15:34,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 17: [2022-11-26 18:15:34,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:15:34,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:15:34,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 18:15:34,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 17: [2022-11-26 18:15:34,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 18:15:34,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 13: [2022-11-26 18:15:34,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:15:34,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:15:34,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 9: [2022-11-26 18:15:34,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 13: [2022-11-26 18:15:34,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 9: [2022-11-26 18:15:34,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 12: [2022-11-26 18:15:34,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:15:34,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 18:15:34,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 8: [2022-11-26 18:15:34,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:15:34,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 18:15:34,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 6: [2022-11-26 18:15:34,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:15:34,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 18:15:34,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 3: [2022-11-26 18:15:34,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:15:34,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 18:15:34,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 31: [2022-11-26 18:15:34,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:15:34,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:15:34,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 25: [2022-11-26 18:15:34,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 18:15:34,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 31: [2022-11-26 18:15:34,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 29: [2022-11-26 18:15:34,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:15:34,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:15:34,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 29: [2022-11-26 18:15:34,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 15: [2022-11-26 18:15:34,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 29: [2022-11-26 18:15:34,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 4: [2022-11-26 18:15:34,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:15:34,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 18:15:34,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 20: [2022-11-26 18:15:34,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:15:34,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 18:15:34,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 1: [2022-11-26 18:15:34,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:15:34,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 18:15:34,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 10: [2022-11-26 18:15:34,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:15:34,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 18:15:34,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 18: [2022-11-26 18:15:34,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:15:34,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 18:15:34,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 19: [2022-11-26 18:15:34,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:15:34,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 18:15:34,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 3: [2022-11-26 18:15:34,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:15:34,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 18:15:34,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 27: [2022-11-26 18:15:34,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:15:34,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 18:15:34,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 30: [2022-11-26 18:15:34,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:15:34,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 18:15:34,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 7: [2022-11-26 18:15:34,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:15:34,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 18:15:34,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 2: [2022-11-26 18:15:34,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:15:34,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 18:15:34,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 23: [2022-11-26 18:15:34,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:15:34,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 18:15:34,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 24: [2022-11-26 18:15:34,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:15:34,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 18:15:34,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 0: [2022-11-26 18:15:34,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:15:34,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:15:34,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 18:15:34,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 5: [2022-11-26 18:15:34,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:15:34,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 18:15:34,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 12: [2022-11-26 18:15:34,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:15:34,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:15:34,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 18:15:34,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 8: [2022-11-26 18:15:34,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 18:15:34,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 6: [2022-11-26 18:15:34,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:15:34,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:15:34,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 11: [2022-11-26 18:15:34,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 6: [2022-11-26 18:15:34,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 11: [2022-11-26 18:15:34,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 11: [2022-11-26 18:15:34,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:15:34,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:15:34,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:15:34,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 18:15:34,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 14: [2022-11-26 18:15:34,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 13: [2022-11-26 18:15:34,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 14: [2022-11-26 18:15:34,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 13: [2022-11-26 18:15:34,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 0: [2022-11-26 18:15:34,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 18:15:34,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 31: [2022-11-26 18:15:34,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:15:34,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 18:15:34,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 16: [2022-11-26 18:15:34,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:15:34,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 18:15:34,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 26: [2022-11-26 18:15:34,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:15:34,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 18:15:34,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 26: [2022-11-26 18:15:34,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:15:34,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 18:15:34,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 9: [2022-11-26 18:15:34,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:15:34,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 25: [2022-11-26 18:15:34,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:15:34,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 25: [2022-11-26 18:15:34,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 18:15:34,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 1: [2022-11-26 18:15:34,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:15:34,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 18:15:34,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 24: [2022-11-26 18:15:34,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:15:34,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 18:15:34,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 22: [2022-11-26 18:15:34,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:15:34,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 18:15:34,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 30: [2022-11-26 18:15:34,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:15:34,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 18:15:34,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 17: [2022-11-26 18:15:34,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:15:34,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 18:15:34,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 20: [2022-11-26 18:15:34,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:15:34,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 18:15:34,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 12: [2022-11-26 18:15:34,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:15:34,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:15:34,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 18:15:34,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 14: [2022-11-26 18:15:34,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 18:15:34,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 19: [2022-11-26 18:15:34,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:15:34,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 18:15:34,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 15: [2022-11-26 18:15:34,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:15:34,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 18:15:34,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 16: [2022-11-26 18:15:34,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:15:34,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 18:15:34,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 10: [2022-11-26 18:15:34,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:15:34,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 18:15:34,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 17: [2022-11-26 18:15:34,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:15:34,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 18:15:34,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 8: [2022-11-26 18:15:34,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:15:34,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 18:15:34,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 23: [2022-11-26 18:15:34,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:15:34,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 18:15:34,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 18: [2022-11-26 18:15:34,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:15:34,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 18:15:34,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 7: [2022-11-26 18:15:34,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:15:34,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:15:34,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 22: [2022-11-26 18:15:34,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 18:15:34,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 26: [2022-11-26 18:15:34,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:15:34,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 26: [2022-11-26 18:15:34,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 18:15:34,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 4: [2022-11-26 18:15:34,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:15:34,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 18:15:34,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 27: [2022-11-26 18:15:34,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:15:34,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 18:15:34,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 2: [2022-11-26 18:15:34,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:15:34,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 18:15:34,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 3: [2022-11-26 18:15:34,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:15:34,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 18:15:34,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 29: [2022-11-26 18:15:34,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:15:34,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 18:15:34,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 5: [2022-11-26 18:15:34,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:15:34,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 18:15:34,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 11: [2022-11-26 18:15:34,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:15:34,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 18:15:34,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 26: [2022-11-26 18:15:34,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:15:34,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 18:15:34,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 11: [2022-11-26 18:15:34,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:15:34,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 18:15:34,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 5: [2022-11-26 18:15:34,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:15:34,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 18:15:34,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 11: [2022-11-26 18:15:34,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:15:34,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 18:15:34,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 26: [2022-11-26 18:15:34,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:15:34,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 18:15:34,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 26: [2022-11-26 18:15:34,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:15:34,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 18:15:34,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 11: [2022-11-26 18:15:34,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:15:34,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 18:15:34,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 5: [2022-11-26 18:15:34,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:15:34,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 18:15:34,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 5: [2022-11-26 18:15:34,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:15:34,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 18:15:34,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 5: [2022-11-26 18:15:34,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:15:34,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 18:15:34,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 11: [2022-11-26 18:15:34,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:15:34,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 18:15:34,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 26: [2022-11-26 18:15:34,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:15:34,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 18:15:34,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 11: [2022-11-26 18:15:34,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:15:34,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 18:15:34,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 5: [2022-11-26 18:15:34,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:15:34,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 18:15:34,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 26: [2022-11-26 18:15:34,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:15:34,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 18:15:34,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 21: [2022-11-26 18:15:34,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:15:34,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:15:34,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:15:34,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:15:34,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:15:34,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:15:34,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:15:34,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 18:15:34,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 18:15:34,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 18:15:34,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 18:15:34,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 18:15:34,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 21: [2022-11-26 18:15:34,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 18:15:34,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 18:15:34,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 21: [2022-11-26 18:15:34,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 21: [2022-11-26 18:15:34,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 21: [2022-11-26 18:15:34,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 21: [2022-11-26 18:15:34,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 21: [2022-11-26 18:15:34,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 21: [2022-11-26 18:15:34,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:15:34,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 18:15:34,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 28: [2022-11-26 18:15:34,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:15:34,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:15:34,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:15:34,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:15:34,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:15:34,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:15:34,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:15:34,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 18:15:34,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 18:15:34,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 18:15:34,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 18:15:34,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 18:15:34,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 28: [2022-11-26 18:15:34,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 18:15:34,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 28: [2022-11-26 18:15:34,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 28: [2022-11-26 18:15:34,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:15:34,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 18:15:34,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 28: [2022-11-26 18:15:34,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 28: [2022-11-26 18:15:34,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 28: [2022-11-26 18:15:34,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 28: [2022-11-26 18:15:34,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step107000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 18:15:34,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 0: successfully saved checkpoint at iteration 107000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2615.71 31: iteration 107010/ 173500 | consumed samples: 27394560 | consumed tokens: 56104058880 | elapsed time per iteration (s): 1.06 | learning rate: 7.874E-05 | global batch size: 256 | lm loss: 1.936513E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.800 | TFLOPs: 14.57 | 31: iteration 107020/ 173500 | consumed samples: 27397120 | consumed tokens: 56109301760 | elapsed time per iteration (s): 0.74 | learning rate: 7.873E-05 | global batch size: 256 | lm loss: 1.986761E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.540 | TFLOPs: 20.96 | 31: iteration 107030/ 173500 | consumed samples: 27399680 | consumed tokens: 56114544640 | elapsed time per iteration (s): 0.78 | learning rate: 7.871E-05 | global batch size: 256 | lm loss: 1.975163E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.949 | TFLOPs: 19.96 | 31: iteration 107040/ 173500 | consumed samples: 27402240 | consumed tokens: 56119787520 | elapsed time per iteration (s): 0.75 | learning rate: 7.870E-05 | global batch size: 256 | lm loss: 1.955589E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.948 | TFLOPs: 20.57 | 31: iteration 107050/ 173500 | consumed samples: 27404800 | consumed tokens: 56125030400 | elapsed time per iteration (s): 0.77 | learning rate: 7.868E-05 | global batch size: 256 | lm loss: 1.954530E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.213 | TFLOPs: 20.22 | 31: iteration 107060/ 173500 | consumed samples: 27407360 | consumed tokens: 56130273280 | elapsed time per iteration (s): 0.77 | learning rate: 7.867E-05 | global batch size: 256 | lm loss: 1.979856E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.710 | TFLOPs: 20.13 | 31: iteration 107070/ 173500 | consumed samples: 27409920 | consumed tokens: 56135516160 | elapsed time per iteration (s): 0.78 | learning rate: 7.865E-05 | global batch size: 256 | lm loss: 1.946770E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.657 | TFLOPs: 19.82 | 31: iteration 107080/ 173500 | consumed samples: 27412480 | consumed tokens: 56140759040 | elapsed time per iteration (s): 0.78 | learning rate: 7.864E-05 | global batch size: 256 | lm loss: 1.974711E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.559 | TFLOPs: 19.76 | 31: iteration 107090/ 173500 | consumed samples: 27415040 | consumed tokens: 56146001920 | elapsed time per iteration (s): 0.73 | learning rate: 7.862E-05 | global batch size: 256 | lm loss: 1.967315E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.533 | TFLOPs: 21.21 | 31: iteration 107100/ 173500 | consumed samples: 27417600 | consumed tokens: 56151244800 | elapsed time per iteration (s): 0.72 | learning rate: 7.860E-05 | global batch size: 256 | lm loss: 1.998536E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.695 | TFLOPs: 21.40 | 31: iteration 107110/ 173500 | consumed samples: 27420160 | consumed tokens: 56156487680 | elapsed time per iteration (s): 0.73 | learning rate: 7.859E-05 | global batch size: 256 | lm loss: 1.980619E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.460 | TFLOPs: 21.26 | 31: iteration 107120/ 173500 | consumed samples: 27422720 | consumed tokens: 56161730560 | elapsed time per iteration (s): 0.75 | learning rate: 7.857E-05 | global batch size: 256 | lm loss: 1.973994E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.705 | TFLOPs: 20.73 | 31: iteration 107130/ 173500 | consumed samples: 27425280 | consumed tokens: 56166973440 | elapsed time per iteration (s): 0.75 | learning rate: 7.856E-05 | global batch size: 256 | lm loss: 1.941108E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.443 | TFLOPs: 20.66 | 31: iteration 107140/ 173500 | consumed samples: 27427840 | consumed tokens: 56172216320 | elapsed time per iteration (s): 0.77 | learning rate: 7.854E-05 | global batch size: 256 | lm loss: 1.968271E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.538 | TFLOPs: 20.00 | 31: iteration 107150/ 173500 | consumed samples: 27430400 | consumed tokens: 56177459200 | elapsed time per iteration (s): 0.77 | learning rate: 7.853E-05 | global batch size: 256 | lm loss: 1.967032E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.580 | TFLOPs: 20.12 | 31: iteration 107160/ 173500 | consumed samples: 27432960 | consumed tokens: 56182702080 | elapsed time per iteration (s): 0.77 | learning rate: 7.851E-05 | global batch size: 256 | lm loss: 1.947596E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.304 | TFLOPs: 20.22 | 31: iteration 107170/ 173500 | consumed samples: 27435520 | consumed tokens: 56187944960 | elapsed time per iteration (s): 0.95 | learning rate: 7.850E-05 | global batch size: 256 | lm loss: 1.966818E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 268.997 | TFLOPs: 16.27 | 31: iteration 107180/ 173500 | consumed samples: 27438080 | consumed tokens: 56193187840 | elapsed time per iteration (s): 0.80 | learning rate: 7.848E-05 | global batch size: 256 | lm loss: 1.940382E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.545 | TFLOPs: 19.39 | 31: iteration 107190/ 173500 | consumed samples: 27440640 | consumed tokens: 56198430720 | elapsed time per iteration (s): 0.81 | learning rate: 7.847E-05 | global batch size: 256 | lm loss: 1.998906E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.198 | TFLOPs: 19.07 | 31: iteration 107200/ 173500 | consumed samples: 27443200 | consumed tokens: 56203673600 | elapsed time per iteration (s): 0.78 | learning rate: 7.845E-05 | global batch size: 256 | lm loss: 1.955104E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.453 | TFLOPs: 19.81 | 31: iteration 107210/ 173500 | consumed samples: 27445760 | consumed tokens: 56208916480 | elapsed time per iteration (s): 0.75 | learning rate: 7.844E-05 | global batch size: 256 | lm loss: 1.960180E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.587 | TFLOPs: 20.67 | 31: iteration 107220/ 173500 | consumed samples: 27448320 | consumed tokens: 56214159360 | elapsed time per iteration (s): 0.78 | learning rate: 7.842E-05 | global batch size: 256 | lm loss: 1.974142E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.255 | TFLOPs: 19.80 | 31: iteration 107230/ 173500 | consumed samples: 27450880 | consumed tokens: 56219402240 | elapsed time per iteration (s): 0.76 | learning rate: 7.840E-05 | global batch size: 256 | lm loss: 1.954020E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.798 | TFLOPs: 20.25 | 31: iteration 107240/ 173500 | consumed samples: 27453440 | consumed tokens: 56224645120 | elapsed time per iteration (s): 0.75 | learning rate: 7.839E-05 | global batch size: 256 | lm loss: 1.972926E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.214 | TFLOPs: 20.70 | 31: iteration 107250/ 173500 | consumed samples: 27456000 | consumed tokens: 56229888000 | elapsed time per iteration (s): 0.74 | learning rate: 7.837E-05 | global batch size: 256 | lm loss: 1.968541E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.720 | TFLOPs: 20.92 | 31: iteration 107260/ 173500 | consumed samples: 27458560 | consumed tokens: 56235130880 | elapsed time per iteration (s): 0.78 | learning rate: 7.836E-05 | global batch size: 256 | lm loss: 1.967230E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.158 | TFLOPs: 19.97 | 31: iteration 107270/ 173500 | consumed samples: 27461120 | consumed tokens: 56240373760 | elapsed time per iteration (s): 0.72 | learning rate: 7.834E-05 | global batch size: 256 | lm loss: 1.964367E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.489 | TFLOPs: 21.45 | 31: iteration 107280/ 173500 | consumed samples: 27463680 | consumed tokens: 56245616640 | elapsed time per iteration (s): 0.75 | learning rate: 7.833E-05 | global batch size: 256 | lm loss: 1.938382E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.530 | TFLOPs: 20.66 | 31: iteration 107290/ 173500 | consumed samples: 27466240 | consumed tokens: 56250859520 | elapsed time per iteration (s): 0.80 | learning rate: 7.831E-05 | global batch size: 256 | lm loss: 1.927800E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.314 | TFLOPs: 19.26 | 31: iteration 107300/ 173500 | consumed samples: 27468800 | consumed tokens: 56256102400 | elapsed time per iteration (s): 0.78 | learning rate: 7.830E-05 | global batch size: 256 | lm loss: 1.951543E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.367 | TFLOPs: 19.87 | 31: iteration 107310/ 173500 | consumed samples: 27471360 | consumed tokens: 56261345280 | elapsed time per iteration (s): 0.82 | learning rate: 7.828E-05 | global batch size: 256 | lm loss: 1.942265E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.952 | TFLOPs: 18.99 | 31: iteration 107320/ 173500 | consumed samples: 27473920 | consumed tokens: 56266588160 | elapsed time per iteration (s): 0.79 | learning rate: 7.827E-05 | global batch size: 256 | lm loss: 1.967701E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.037 | TFLOPs: 19.72 | 31: iteration 107330/ 173500 | consumed samples: 27476480 | consumed tokens: 56271831040 | elapsed time per iteration (s): 0.79 | learning rate: 7.825E-05 | global batch size: 256 | lm loss: 1.962383E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.353 | TFLOPs: 19.56 | 31: iteration 107340/ 173500 | consumed samples: 27479040 | consumed tokens: 56277073920 | elapsed time per iteration (s): 0.77 | learning rate: 7.823E-05 | global batch size: 256 | lm loss: 1.967650E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.316 | TFLOPs: 20.04 | 31: iteration 107350/ 173500 | consumed samples: 27481600 | consumed tokens: 56282316800 | elapsed time per iteration (s): 0.79 | learning rate: 7.822E-05 | global batch size: 256 | lm loss: 1.976808E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.103 | TFLOPs: 19.55 | 31: iteration 107360/ 173500 | consumed samples: 27484160 | consumed tokens: 56287559680 | elapsed time per iteration (s): 0.77 | learning rate: 7.820E-05 | global batch size: 256 | lm loss: 1.962929E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.115 | TFLOPs: 20.21 | 31: iteration 107370/ 173500 | consumed samples: 27486720 | consumed tokens: 56292802560 | elapsed time per iteration (s): 0.96 | learning rate: 7.819E-05 | global batch size: 256 | lm loss: 1.963909E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 266.548 | TFLOPs: 16.13 | 31: iteration 107380/ 173500 | consumed samples: 27489280 | consumed tokens: 56298045440 | elapsed time per iteration (s): 0.79 | learning rate: 7.817E-05 | global batch size: 256 | lm loss: 1.964349E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.366 | TFLOPs: 19.50 | 31: iteration 107390/ 173500 | consumed samples: 27491840 | consumed tokens: 56303288320 | elapsed time per iteration (s): 0.77 | learning rate: 7.816E-05 | global batch size: 256 | lm loss: 1.968144E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.888 | TFLOPs: 20.08 | 31: iteration 107400/ 173500 | consumed samples: 27494400 | consumed tokens: 56308531200 | elapsed time per iteration (s): 0.79 | learning rate: 7.814E-05 | global batch size: 256 | lm loss: 1.999471E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.590 | TFLOPs: 19.52 | 31: iteration 107410/ 173500 | consumed samples: 27496960 | consumed tokens: 56313774080 | elapsed time per iteration (s): 0.77 | learning rate: 7.813E-05 | global batch size: 256 | lm loss: 1.954362E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.981 | TFLOPs: 20.08 | 31: iteration 107420/ 173500 | consumed samples: 27499520 | consumed tokens: 56319016960 | elapsed time per iteration (s): 0.78 | learning rate: 7.811E-05 | global batch size: 256 | lm loss: 1.964394E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.848 | TFLOPs: 19.95 | 31: iteration 107430/ 173500 | consumed samples: 27502080 | consumed tokens: 56324259840 | elapsed time per iteration (s): 0.81 | learning rate: 7.810E-05 | global batch size: 256 | lm loss: 2.005650E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.586 | TFLOPs: 19.15 | 31: iteration 107440/ 173500 | consumed samples: 27504640 | consumed tokens: 56329502720 | elapsed time per iteration (s): 0.74 | learning rate: 7.808E-05 | global batch size: 256 | lm loss: 1.956268E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.552 | TFLOPs: 21.03 | 31: iteration 107450/ 173500 | consumed samples: 27507200 | consumed tokens: 56334745600 | elapsed time per iteration (s): 0.82 | learning rate: 7.807E-05 | global batch size: 256 | lm loss: 1.982616E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.346 | TFLOPs: 18.84 | 31: iteration 107460/ 173500 | consumed samples: 27509760 | consumed tokens: 56339988480 | elapsed time per iteration (s): 0.75 | learning rate: 7.805E-05 | global batch size: 256 | lm loss: 1.953463E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.008 | TFLOPs: 20.75 | 31: iteration 107470/ 173500 | consumed samples: 27512320 | consumed tokens: 56345231360 | elapsed time per iteration (s): 0.81 | learning rate: 7.803E-05 | global batch size: 256 | lm loss: 1.960020E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.244 | TFLOPs: 19.19 | 31: iteration 107480/ 173500 | consumed samples: 27514880 | consumed tokens: 56350474240 | elapsed time per iteration (s): 0.77 | learning rate: 7.802E-05 | global batch size: 256 | lm loss: 1.981931E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.247 | TFLOPs: 20.16 | 31: iteration 107490/ 173500 | consumed samples: 27517440 | consumed tokens: 56355717120 | elapsed time per iteration (s): 0.80 | learning rate: 7.800E-05 | global batch size: 256 | lm loss: 1.947165E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.670 | TFLOPs: 19.40 | 31: iteration 107500/ 173500 | consumed samples: 27520000 | consumed tokens: 56360960000 | elapsed time per iteration (s): 0.81 | learning rate: 7.799E-05 | global batch size: 256 | lm loss: 1.969434E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.206 | TFLOPs: 19.01 | 31: iteration 107510/ 173500 | consumed samples: 27522560 | consumed tokens: 56366202880 | elapsed time per iteration (s): 0.74 | learning rate: 7.797E-05 | global batch size: 256 | lm loss: 1.947067E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.238 | TFLOPs: 20.95 | 31: iteration 107520/ 173500 | consumed samples: 27525120 | consumed tokens: 56371445760 | elapsed time per iteration (s): 0.79 | learning rate: 7.796E-05 | global batch size: 256 | lm loss: 1.971473E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.419 | TFLOPs: 19.57 | 31: iteration 107530/ 173500 | consumed samples: 27527680 | consumed tokens: 56376688640 | elapsed time per iteration (s): 0.75 | learning rate: 7.794E-05 | global batch size: 256 | lm loss: 1.952876E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.727 | TFLOPs: 20.55 | 31: iteration 107540/ 173500 | consumed samples: 27530240 | consumed tokens: 56381931520 | elapsed time per iteration (s): 0.77 | learning rate: 7.793E-05 | global batch size: 256 | lm loss: 1.967705E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.282 | TFLOPs: 20.10 | 31: iteration 107550/ 173500 | consumed samples: 27532800 | consumed tokens: 56387174400 | elapsed time per iteration (s): 0.77 | learning rate: 7.791E-05 | global batch size: 256 | lm loss: 1.974712E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.474 | TFLOPs: 20.11 | 31: iteration 107560/ 173500 | consumed samples: 27535360 | consumed tokens: 56392417280 | elapsed time per iteration (s): 0.93 | learning rate: 7.790E-05 | global batch size: 256 | lm loss: 1.972563E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 275.593 | TFLOPs: 16.67 | 31: iteration 107570/ 173500 | consumed samples: 27537920 | consumed tokens: 56397660160 | elapsed time per iteration (s): 0.76 | learning rate: 7.788E-05 | global batch size: 256 | lm loss: 1.973684E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.084 | TFLOPs: 20.45 | 31: iteration 107580/ 173500 | consumed samples: 27540480 | consumed tokens: 56402903040 | elapsed time per iteration (s): 0.79 | learning rate: 7.787E-05 | global batch size: 256 | lm loss: 1.957687E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.031 | TFLOPs: 19.66 | 31: iteration 107590/ 173500 | consumed samples: 27543040 | consumed tokens: 56408145920 | elapsed time per iteration (s): 0.73 | learning rate: 7.785E-05 | global batch size: 256 | lm loss: 1.954301E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.923 | TFLOPs: 21.17 | 31: iteration 107600/ 173500 | consumed samples: 27545600 | consumed tokens: 56413388800 | elapsed time per iteration (s): 0.80 | learning rate: 7.783E-05 | global batch size: 256 | lm loss: 1.966561E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.112 | TFLOPs: 19.37 | 31: iteration 107610/ 173500 | consumed samples: 27548160 | consumed tokens: 56418631680 | elapsed time per iteration (s): 0.74 | learning rate: 7.782E-05 | global batch size: 256 | lm loss: 1.975100E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.111 | TFLOPs: 20.88 | 31: iteration 107620/ 173500 | consumed samples: 27550720 | consumed tokens: 56423874560 | elapsed time per iteration (s): 0.78 | learning rate: 7.780E-05 | global batch size: 256 | lm loss: 1.986633E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.780 | TFLOPs: 19.77 | 31: iteration 107630/ 173500 | consumed samples: 27553280 | consumed tokens: 56429117440 | elapsed time per iteration (s): 0.79 | learning rate: 7.779E-05 | global batch size: 256 | lm loss: 1.940856E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.905 | TFLOPs: 19.60 | 31: iteration 107640/ 173500 | consumed samples: 27555840 | consumed tokens: 56434360320 | elapsed time per iteration (s): 0.72 | learning rate: 7.777E-05 | global batch size: 256 | lm loss: 1.987100E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.901 | TFLOPs: 21.59 | 31: iteration 107650/ 173500 | consumed samples: 27558400 | consumed tokens: 56439603200 | elapsed time per iteration (s): 0.83 | learning rate: 7.776E-05 | global batch size: 256 | lm loss: 1.958734E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.078 | TFLOPs: 18.76 | 31: iteration 107660/ 173500 | consumed samples: 27560960 | consumed tokens: 56444846080 | elapsed time per iteration (s): 0.79 | learning rate: 7.774E-05 | global batch size: 256 | lm loss: 1.967375E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.777 | TFLOPs: 19.53 | 31: iteration 107670/ 173500 | consumed samples: 27563520 | consumed tokens: 56450088960 | elapsed time per iteration (s): 0.77 | learning rate: 7.773E-05 | global batch size: 256 | lm loss: 1.977132E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.098 | TFLOPs: 20.09 | 31: iteration 107680/ 173500 | consumed samples: 27566080 | consumed tokens: 56455331840 | elapsed time per iteration (s): 0.79 | learning rate: 7.771E-05 | global batch size: 256 | lm loss: 1.939534E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.465 | TFLOPs: 19.69 | 31: iteration 107690/ 173500 | consumed samples: 27568640 | consumed tokens: 56460574720 | elapsed time per iteration (s): 0.79 | learning rate: 7.770E-05 | global batch size: 256 | lm loss: 2.000558E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.365 | TFLOPs: 19.68 | 31: iteration 107700/ 173500 | consumed samples: 27571200 | consumed tokens: 56465817600 | elapsed time per iteration (s): 0.76 | learning rate: 7.768E-05 | global batch size: 256 | lm loss: 1.972595E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.048 | TFLOPs: 20.39 | 31: iteration 107710/ 173500 | consumed samples: 27573760 | consumed tokens: 56471060480 | elapsed time per iteration (s): 0.76 | learning rate: 7.767E-05 | global batch size: 256 | lm loss: 1.950828E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.369 | TFLOPs: 20.29 | 31: iteration 107720/ 173500 | consumed samples: 27576320 | consumed tokens: 56476303360 | elapsed time per iteration (s): 0.78 | learning rate: 7.765E-05 | global batch size: 256 | lm loss: 1.933696E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.085 | TFLOPs: 19.91 | 31: iteration 107730/ 173500 | consumed samples: 27578880 | consumed tokens: 56481546240 | elapsed time per iteration (s): 0.78 | learning rate: 7.763E-05 | global batch size: 256 | lm loss: 1.966158E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.960 | TFLOPs: 19.84 | 31: iteration 107740/ 173500 | consumed samples: 27581440 | consumed tokens: 56486789120 | elapsed time per iteration (s): 0.85 | learning rate: 7.762E-05 | global batch size: 256 | lm loss: 1.969721E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.981 | TFLOPs: 18.21 | 31: iteration 107750/ 173500 | consumed samples: 27584000 | consumed tokens: 56492032000 | elapsed time per iteration (s): 0.83 | learning rate: 7.760E-05 | global batch size: 256 | lm loss: 1.973963E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.672 | TFLOPs: 18.61 | 31: iteration 107760/ 173500 | consumed samples: 27586560 | consumed tokens: 56497274880 | elapsed time per iteration (s): 0.80 | learning rate: 7.759E-05 | global batch size: 256 | lm loss: 1.980310E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.074 | TFLOPs: 19.24 | 31: iteration 107770/ 173500 | consumed samples: 27589120 | consumed tokens: 56502517760 | elapsed time per iteration (s): 0.81 | learning rate: 7.757E-05 | global batch size: 256 | lm loss: 1.980703E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.334 | TFLOPs: 19.08 | 31: iteration 107780/ 173500 | consumed samples: 27591680 | consumed tokens: 56507760640 | elapsed time per iteration (s): 0.84 | learning rate: 7.756E-05 | global batch size: 256 | lm loss: 1.958305E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.618 | TFLOPs: 18.49 | 31: iteration 107790/ 173500 | consumed samples: 27594240 | consumed tokens: 56513003520 | elapsed time per iteration (s): 0.86 | learning rate: 7.754E-05 | global batch size: 256 | lm loss: 1.977540E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.030 | TFLOPs: 17.97 | 31: iteration 107800/ 173500 | consumed samples: 27596800 | consumed tokens: 56518246400 | elapsed time per iteration (s): 0.80 | learning rate: 7.753E-05 | global batch size: 256 | lm loss: 1.974456E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.702 | TFLOPs: 19.34 | 31: iteration 107810/ 173500 | consumed samples: 27599360 | consumed tokens: 56523489280 | elapsed time per iteration (s): 0.78 | learning rate: 7.751E-05 | global batch size: 256 | lm loss: 1.970360E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.054 | TFLOPs: 19.79 | 31: iteration 107820/ 173500 | consumed samples: 27601920 | consumed tokens: 56528732160 | elapsed time per iteration (s): 0.79 | learning rate: 7.750E-05 | global batch size: 256 | lm loss: 1.964936E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.462 | TFLOPs: 19.57 | 31: iteration 107830/ 173500 | consumed samples: 27604480 | consumed tokens: 56533975040 | elapsed time per iteration (s): 0.80 | learning rate: 7.748E-05 | global batch size: 256 | lm loss: 1.950197E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.104 | TFLOPs: 19.30 | 31: iteration 107840/ 173500 | consumed samples: 27607040 | consumed tokens: 56539217920 | elapsed time per iteration (s): 0.79 | learning rate: 7.747E-05 | global batch size: 256 | lm loss: 1.975862E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.082 | TFLOPs: 19.61 | 31: iteration 107850/ 173500 | consumed samples: 27609600 | consumed tokens: 56544460800 | elapsed time per iteration (s): 0.82 | learning rate: 7.745E-05 | global batch size: 256 | lm loss: 1.975986E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.547 | TFLOPs: 18.79 | 31: iteration 107860/ 173500 | consumed samples: 27612160 | consumed tokens: 56549703680 | elapsed time per iteration (s): 0.81 | learning rate: 7.744E-05 | global batch size: 256 | lm loss: 1.917859E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.436 | TFLOPs: 19.20 | 31: iteration 107870/ 173500 | consumed samples: 27614720 | consumed tokens: 56554946560 | elapsed time per iteration (s): 0.73 | learning rate: 7.742E-05 | global batch size: 256 | lm loss: 1.962519E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.856 | TFLOPs: 21.10 | 31: iteration 107880/ 173500 | consumed samples: 27617280 | consumed tokens: 56560189440 | elapsed time per iteration (s): 0.78 | learning rate: 7.740E-05 | global batch size: 256 | lm loss: 1.965093E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.720 | TFLOPs: 19.77 | 31: iteration 107890/ 173500 | consumed samples: 27619840 | consumed tokens: 56565432320 | elapsed time per iteration (s): 0.74 | learning rate: 7.739E-05 | global batch size: 256 | lm loss: 1.966413E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.262 | TFLOPs: 21.07 | 31: iteration 107900/ 173500 | consumed samples: 27622400 | consumed tokens: 56570675200 | elapsed time per iteration (s): 0.77 | learning rate: 7.737E-05 | global batch size: 256 | lm loss: 1.972016E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.809 | TFLOPs: 20.01 | 31: iteration 107910/ 173500 | consumed samples: 27624960 | consumed tokens: 56575918080 | elapsed time per iteration (s): 0.74 | learning rate: 7.736E-05 | global batch size: 256 | lm loss: 1.969632E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.627 | TFLOPs: 20.97 | 31: iteration 107920/ 173500 | consumed samples: 27627520 | consumed tokens: 56581160960 | elapsed time per iteration (s): 0.78 | learning rate: 7.734E-05 | global batch size: 256 | lm loss: 1.959278E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.384 | TFLOPs: 19.81 | 31: iteration 107930/ 173500 | consumed samples: 27630080 | consumed tokens: 56586403840 | elapsed time per iteration (s): 0.73 | learning rate: 7.733E-05 | global batch size: 256 | lm loss: 1.957628E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.573 | TFLOPs: 21.09 | 31: iteration 107940/ 173500 | consumed samples: 27632640 | consumed tokens: 56591646720 | elapsed time per iteration (s): 0.82 | learning rate: 7.731E-05 | global batch size: 256 | lm loss: 1.939762E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.726 | TFLOPs: 18.86 | 31: iteration 107950/ 173500 | consumed samples: 27635200 | consumed tokens: 56596889600 | elapsed time per iteration (s): 0.85 | learning rate: 7.730E-05 | global batch size: 256 | lm loss: 1.959850E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.423 | TFLOPs: 18.30 | 31: iteration 107960/ 173500 | consumed samples: 27637760 | consumed tokens: 56602132480 | elapsed time per iteration (s): 0.84 | learning rate: 7.728E-05 | global batch size: 256 | lm loss: 1.951977E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.611 | TFLOPs: 18.49 | 31: iteration 107970/ 173500 | consumed samples: 27640320 | consumed tokens: 56607375360 | elapsed time per iteration (s): 0.81 | learning rate: 7.727E-05 | global batch size: 256 | lm loss: 1.977072E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.211 | TFLOPs: 19.13 | 31: iteration 107980/ 173500 | consumed samples: 27642880 | consumed tokens: 56612618240 | elapsed time per iteration (s): 0.83 | learning rate: 7.725E-05 | global batch size: 256 | lm loss: 1.924931E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.401 | TFLOPs: 18.72 | 31: iteration 107990/ 173500 | consumed samples: 27645440 | consumed tokens: 56617861120 | elapsed time per iteration (s): 0.83 | learning rate: 7.724E-05 | global batch size: 256 | lm loss: 1.944288E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.100 | TFLOPs: 18.58 | 0: [2022-11-26 18:28:40,060] [INFO] [logging.py:68:log_dist] [Rank 0] step=108000, skipped=0, lr=[7.722055869362951e-05, 7.722055869362951e-05, 7.722055869362951e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 108000/ 173500 | consumed samples: 27648000 | consumed tokens: 56623104000 | elapsed time per iteration (s): 0.77 | learning rate: 7.722E-05 | global batch size: 256 | lm loss: 1.984188E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.524 | TFLOPs: 20.12 | 0: steps: 108000 loss: 2.0431 iter time (s): 0.792 samples/sec: 323.045 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 108000 | lm loss value: 1.941695E+00 | lm loss PPL: 6.970559E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 108000 to checkpoints_1b1long 0: [2022-11-26 18:28:40,348] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step108000 is begin to save! 0: [2022-11-26 18:28:40,362] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_01-model_00-model_states.pt... 0: [2022-11-26 18:28:40,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_01-model_00-model_states.pt. 0: [2022-11-26 18:28:40,594] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_03-model_00-model_states.pt... 0: [2022-11-26 18:28:40,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_03-model_00-model_states.pt. 0: [2022-11-26 18:28:40,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_04-model_00-model_states.pt... 0: [2022-11-26 18:28:40,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_04-model_00-model_states.pt. 0: [2022-11-26 18:28:40,742] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_05-model_00-model_states.pt... 0: [2022-11-26 18:28:40,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_05-model_00-model_states.pt. 0: [2022-11-26 18:28:40,820] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_06-model_00-model_states.pt... 0: [2022-11-26 18:28:40,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_06-model_00-model_states.pt. 0: [2022-11-26 18:28:40,904] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_07-model_00-model_states.pt... 0: [2022-11-26 18:28:40,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_07-model_00-model_states.pt. 0: [2022-11-26 18:28:40,983] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_08-model_00-model_states.pt... 0: [2022-11-26 18:28:41,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_08-model_00-model_states.pt. 0: [2022-11-26 18:28:41,062] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_09-model_00-model_states.pt... 0: [2022-11-26 18:28:41,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_09-model_00-model_states.pt. 0: [2022-11-26 18:28:41,138] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_10-model_00-model_states.pt... 0: [2022-11-26 18:28:41,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_10-model_00-model_states.pt. 0: [2022-11-26 18:28:41,211] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_11-model_00-model_states.pt... 0: [2022-11-26 18:28:41,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_11-model_00-model_states.pt. 0: [2022-11-26 18:28:41,285] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_12-model_00-model_states.pt... 0: [2022-11-26 18:28:41,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_12-model_00-model_states.pt. 0: [2022-11-26 18:28:41,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_13-model_00-model_states.pt... 0: [2022-11-26 18:28:41,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_13-model_00-model_states.pt. 0: [2022-11-26 18:28:41,430] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_14-model_00-model_states.pt... 0: [2022-11-26 18:28:41,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_14-model_00-model_states.pt. 0: [2022-11-26 18:28:41,505] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_15-model_00-model_states.pt... 0: [2022-11-26 18:28:41,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_15-model_00-model_states.pt. 0: [2022-11-26 18:28:41,578] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_16-model_00-model_states.pt... 0: [2022-11-26 18:28:41,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_16-model_00-model_states.pt. 0: [2022-11-26 18:28:41,654] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_17-model_00-model_states.pt... 0: [2022-11-26 18:28:41,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_17-model_00-model_states.pt. 0: [2022-11-26 18:28:41,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_18-model_00-model_states.pt... 0: [2022-11-26 18:28:41,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_18-model_00-model_states.pt. 0: [2022-11-26 18:28:41,802] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_19-model_00-model_states.pt... 0: [2022-11-26 18:28:41,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_19-model_00-model_states.pt. 0: [2022-11-26 18:28:41,876] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_20-model_00-model_states.pt... 0: [2022-11-26 18:28:41,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_20-model_00-model_states.pt. 0: [2022-11-26 18:28:41,949] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_21-model_00-model_states.pt... 0: [2022-11-26 18:28:42,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_21-model_00-model_states.pt. 0: [2022-11-26 18:28:42,024] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_22-model_00-model_states.pt... 0: [2022-11-26 18:28:42,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_22-model_00-model_states.pt. 0: [2022-11-26 18:28:42,098] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_23-model_00-model_states.pt... 0: [2022-11-26 18:28:42,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_23-model_00-model_states.pt. 0: [2022-11-26 18:28:42,175] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_24-model_00-model_states.pt... 0: [2022-11-26 18:28:42,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_24-model_00-model_states.pt. 0: [2022-11-26 18:28:42,247] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_25-model_00-model_states.pt... 0: [2022-11-26 18:28:42,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_25-model_00-model_states.pt. 0: [2022-11-26 18:28:42,322] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_26-model_00-model_states.pt... 0: [2022-11-26 18:28:42,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_26-model_00-model_states.pt. 0: [2022-11-26 18:28:42,393] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_27-model_00-model_states.pt... 0: [2022-11-26 18:28:42,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_27-model_00-model_states.pt. 0: [2022-11-26 18:28:42,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_28-model_00-model_states.pt... 0: [2022-11-26 18:28:42,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_28-model_00-model_states.pt. 0: [2022-11-26 18:28:42,541] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/layer_30-model_00-model_states.pt... 0: [2022-11-26 18:28:42,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/layer_30-model_00-model_states.pt. 0: [2022-11-26 18:28:42,543] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step108000/mp_rank_00_model_states.pt 0: [2022-11-26 18:28:42,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/mp_rank_00_model_states.pt... 0: [2022-11-26 18:28:42,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/mp_rank_00_model_states.pt. 31: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:28:42,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:28:42,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:28:42,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 18:28:42,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 23: [2022-11-26 18:28:42,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:28:42,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 18:28:42,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 12: [2022-11-26 18:28:42,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:28:42,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 18:28:42,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 21: [2022-11-26 18:28:42,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:28:42,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 18:28:42,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 1: [2022-11-26 18:28:42,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:28:42,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 18:28:42,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 21: [2022-11-26 18:28:42,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:28:42,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 4: [2022-11-26 18:28:42,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:28:42,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 6: [2022-11-26 18:28:42,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:28:42,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 6: [2022-11-26 18:28:42,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 4: [2022-11-26 18:28:42,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 6: [2022-11-26 18:28:42,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 0: [2022-11-26 18:28:42,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:28:42,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 18:28:42,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 8: [2022-11-26 18:28:42,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:28:42,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 18:28:42,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 18: [2022-11-26 18:28:42,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:28:42,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 18:28:42,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 1: [2022-11-26 18:28:42,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:28:42,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 18:28:42,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 10: [2022-11-26 18:28:42,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:28:42,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 18:28:42,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 23: [2022-11-26 18:28:42,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:28:42,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:28:42,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 18:28:42,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 8: [2022-11-26 18:28:42,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 20: [2022-11-26 18:28:42,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:28:42,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 20: [2022-11-26 18:28:42,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 6: [2022-11-26 18:28:42,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:28:42,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:28:42,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 19: [2022-11-26 18:28:42,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:28:42,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 15: [2022-11-26 18:28:42,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 19: [2022-11-26 18:28:42,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 6: [2022-11-26 18:28:42,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 15: [2022-11-26 18:28:42,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 19: [2022-11-26 18:28:42,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 0: [2022-11-26 18:28:42,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:28:42,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:28:42,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 26: [2022-11-26 18:28:42,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 0: [2022-11-26 18:28:42,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 26: [2022-11-26 18:28:42,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 14: [2022-11-26 18:28:42,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:28:42,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 18:28:42,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 30: [2022-11-26 18:28:42,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:28:42,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 18:28:42,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 25: [2022-11-26 18:28:42,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:28:42,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 28: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:28:42,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 17: [2022-11-26 18:28:42,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 19: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:28:42,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 2: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 19: [2022-11-26 18:28:42,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 27: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 2: [2022-11-26 18:28:42,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 17: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 2: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 17: [2022-11-26 18:28:42,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 7: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 7: [2022-11-26 18:28:42,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 12: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:28:42,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 12: [2022-11-26 18:28:42,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 2: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 24: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:28:42,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 30: [2022-11-26 18:28:42,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 24: [2022-11-26 18:28:42,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 18:28:42,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 30: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 24: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 24: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 13: [2022-11-26 18:28:42,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:28:42,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 18:28:42,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 13: [2022-11-26 18:28:42,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:28:42,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 18:28:42,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 10: [2022-11-26 18:28:42,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:28:42,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 18:28:42,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 22: [2022-11-26 18:28:42,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:28:42,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 18:28:42,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 24: [2022-11-26 18:28:42,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:28:42,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 18:28:42,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 28: [2022-11-26 18:28:42,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 18:28:42,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 28: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 20: [2022-11-26 18:28:42,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:28:42,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:28:42,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 22: [2022-11-26 18:28:42,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 20: [2022-11-26 18:28:42,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 22: [2022-11-26 18:28:42,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 4: [2022-11-26 18:28:42,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:28:42,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 27: [2022-11-26 18:28:42,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:28:42,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 27: [2022-11-26 18:28:42,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 18:28:42,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 15: [2022-11-26 18:28:42,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:28:42,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 10: [2022-11-26 18:28:42,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:28:42,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 10: [2022-11-26 18:28:42,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 23: [2022-11-26 18:28:42,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:28:42,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 23: [2022-11-26 18:28:42,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 17: [2022-11-26 18:28:42,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:28:42,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 17: [2022-11-26 18:28:42,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 18:28:42,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 18: [2022-11-26 18:28:42,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:28:42,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 18:28:42,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 2: [2022-11-26 18:28:42,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:28:42,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 18:28:42,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 5: [2022-11-26 18:28:42,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:28:42,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 5: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:28:42,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 18:28:42,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 5: [2022-11-26 18:28:42,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:28:42,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 18:28:42,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 0: [2022-11-26 18:28:42,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:28:42,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:28:42,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 0: [2022-11-26 18:28:42,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 20: [2022-11-26 18:28:42,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 0: [2022-11-26 18:28:42,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 15: [2022-11-26 18:28:42,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:28:42,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 18:28:42,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 7: [2022-11-26 18:28:42,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:28:42,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:28:42,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 12: [2022-11-26 18:28:42,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:28:42,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 12: [2022-11-26 18:28:42,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 18:28:42,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 17: [2022-11-26 18:28:42,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 1: [2022-11-26 18:28:42,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:28:42,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 30: [2022-11-26 18:28:42,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:28:42,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 30: [2022-11-26 18:28:42,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 7: [2022-11-26 18:28:42,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:28:42,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 30: [2022-11-26 18:28:42,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 7: [2022-11-26 18:28:42,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 18:28:42,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 31: [2022-11-26 18:28:42,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:28:42,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:28:42,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:28:42,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 18:28:42,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 18:28:42,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 18:28:42,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 31: [2022-11-26 18:28:42,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 31: [2022-11-26 18:28:42,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 28: [2022-11-26 18:28:42,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:28:42,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:28:42,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 18:28:42,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 8: [2022-11-26 18:28:42,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:28:42,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:28:42,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 18:28:42,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 18:28:42,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 8: [2022-11-26 18:28:42,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 19: [2022-11-26 18:28:42,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:28:42,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:28:42,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 26: [2022-11-26 18:28:42,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 19: [2022-11-26 18:28:42,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:28:42,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:28:42,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 19: [2022-11-26 18:28:42,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 4: [2022-11-26 18:28:42,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 13: [2022-11-26 18:28:42,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:28:42,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 19: [2022-11-26 18:28:42,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 4: [2022-11-26 18:28:42,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 13: [2022-11-26 18:28:42,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 26: [2022-11-26 18:28:42,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 19: [2022-11-26 18:28:42,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 13: [2022-11-26 18:28:42,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 6: [2022-11-26 18:28:42,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:28:42,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 18:28:42,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 25: [2022-11-26 18:28:42,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:28:42,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 18:28:42,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 25: [2022-11-26 18:28:42,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:28:42,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 18:28:42,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 13: [2022-11-26 18:28:42,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:28:42,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 26: [2022-11-26 18:28:42,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:28:42,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 26: [2022-11-26 18:28:42,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 18:28:42,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 5: [2022-11-26 18:28:42,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:28:42,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:28:42,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 18:28:42,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 14: [2022-11-26 18:28:42,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 18:28:42,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 24: [2022-11-26 18:28:42,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:28:42,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 18:28:42,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 10: [2022-11-26 18:28:42,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:28:42,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 18:28:42,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 6: [2022-11-26 18:28:42,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:28:42,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 18:28:42,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 18: [2022-11-26 18:28:42,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:28:42,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 18:28:42,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 21: [2022-11-26 18:28:42,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:28:42,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 18:28:42,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 15: [2022-11-26 18:28:42,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:28:42,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:28:42,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 1: [2022-11-26 18:28:42,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:28:42,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 21: [2022-11-26 18:28:42,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 27: [2022-11-26 18:28:42,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:28:42,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 12: [2022-11-26 18:28:42,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:28:42,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 1: [2022-11-26 18:28:42,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 12: [2022-11-26 18:28:42,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 27: [2022-11-26 18:28:42,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 1: [2022-11-26 18:28:42,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 27: [2022-11-26 18:28:42,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 7: [2022-11-26 18:28:42,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:28:42,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 18:28:42,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 30: [2022-11-26 18:28:42,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:28:42,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 11: [2022-11-26 18:28:42,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:28:42,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 11: [2022-11-26 18:28:42,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 18:28:42,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 11: [2022-11-26 18:28:42,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:28:42,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 18:28:42,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 11: [2022-11-26 18:28:42,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:28:42,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 18:28:42,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 11: [2022-11-26 18:28:42,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:28:42,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:28:42,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:28:42,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:28:42,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 28: [2022-11-26 18:28:42,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:28:42,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 9: [2022-11-26 18:28:42,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 18:28:42,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:28:42,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:28:42,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 22: [2022-11-26 18:28:42,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 9: [2022-11-26 18:28:42,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 18:28:42,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 9: [2022-11-26 18:28:42,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 18:28:42,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 18:28:42,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 28: [2022-11-26 18:28:42,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 9: [2022-11-26 18:28:42,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 9: [2022-11-26 18:28:42,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 28: [2022-11-26 18:28:42,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 25: [2022-11-26 18:28:42,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:28:42,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 18:28:42,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 18: [2022-11-26 18:28:42,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:28:42,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 20: [2022-11-26 18:28:42,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:28:42,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 20: [2022-11-26 18:28:42,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 18:28:42,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 4: [2022-11-26 18:28:42,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:28:42,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 14: [2022-11-26 18:28:42,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:28:42,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 14: [2022-11-26 18:28:42,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 18:28:42,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 2: [2022-11-26 18:28:42,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:28:42,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 18:28:42,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 23: [2022-11-26 18:28:42,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:28:42,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 18:28:42,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 31: [2022-11-26 18:28:42,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:28:42,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 18:28:42,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 3: [2022-11-26 18:28:42,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:28:42,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:28:42,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:28:42,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 18:28:42,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 3: [2022-11-26 18:28:42,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 18:28:42,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 18:28:42,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 3: [2022-11-26 18:28:42,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 21: [2022-11-26 18:28:42,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:28:42,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:28:42,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 18:28:42,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 27: [2022-11-26 18:28:42,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 18:28:42,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 26: [2022-11-26 18:28:42,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:28:42,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 18:28:42,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 6: [2022-11-26 18:28:42,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:28:42,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 18:28:42,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 16: [2022-11-26 18:28:42,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:28:42,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:28:42,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:28:42,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:28:42,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 18:28:42,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 18:28:42,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 18:28:42,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 18:28:42,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 16: [2022-11-26 18:28:42,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 16: [2022-11-26 18:28:42,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 16: [2022-11-26 18:28:42,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 29: [2022-11-26 18:28:42,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:28:42,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:28:42,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:28:42,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:28:42,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 18:28:42,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 18:28:42,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 18:28:42,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 18:28:42,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 29: [2022-11-26 18:28:42,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 29: [2022-11-26 18:28:42,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 29: [2022-11-26 18:28:42,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 0: [2022-11-26 18:28:42,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:28:42,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 18:28:42,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 8: [2022-11-26 18:28:42,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:28:42,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 18:28:42,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 1: [2022-11-26 18:28:42,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:28:42,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 18:28:42,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 2: [2022-11-26 18:28:42,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:28:42,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 18:28:42,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 9: [2022-11-26 18:28:42,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:28:42,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 18:28:42,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 19: [2022-11-26 18:28:42,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:28:42,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 18:28:42,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 24: [2022-11-26 18:28:42,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:28:42,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 18:28:42,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 28: [2022-11-26 18:28:42,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:28:42,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 18:28:42,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 4: [2022-11-26 18:28:42,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:28:42,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 18:28:42,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 13: [2022-11-26 18:28:42,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:28:42,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 18:28:42,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 5: [2022-11-26 18:28:42,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:28:42,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 18:28:42,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 20: [2022-11-26 18:28:42,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:28:42,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 18:28:42,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 18: [2022-11-26 18:28:42,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:28:42,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 18:28:42,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 12: [2022-11-26 18:28:42,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:28:42,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 18:28:42,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 22: [2022-11-26 18:28:42,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:28:42,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 18:28:42,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 23: [2022-11-26 18:28:42,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:28:42,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:28:42,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 18:28:42,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 27: [2022-11-26 18:28:42,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 17: [2022-11-26 18:28:42,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:28:42,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 17: [2022-11-26 18:28:42,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 30: [2022-11-26 18:28:42,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:28:42,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 30: [2022-11-26 18:28:42,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 18:28:42,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 29: [2022-11-26 18:28:42,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:28:42,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 18:28:42,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 11: [2022-11-26 18:28:42,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:28:42,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 18:28:42,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 15: [2022-11-26 18:28:42,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:28:42,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 18:28:42,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 7: [2022-11-26 18:28:42,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:28:42,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 18:28:42,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 6: [2022-11-26 18:28:42,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:28:42,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 18:28:42,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 10: [2022-11-26 18:28:42,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:28:42,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 18:28:42,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 26: [2022-11-26 18:28:42,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:28:42,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 18:28:42,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 14: [2022-11-26 18:28:42,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:28:42,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 18:28:42,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 3: [2022-11-26 18:28:42,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:28:42,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 18:28:42,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 31: [2022-11-26 18:28:42,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:28:42,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:28:42,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 18:28:42,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 16: [2022-11-26 18:28:42,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 18:28:42,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 9: [2022-11-26 18:28:42,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:28:42,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 18:28:42,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 0: [2022-11-26 18:28:42,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:28:42,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:28:42,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:28:42,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 18:28:42,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 13: [2022-11-26 18:28:42,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 18:28:42,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 24: [2022-11-26 18:28:42,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 18:28:42,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 19: [2022-11-26 18:28:42,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:28:42,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 18:28:42,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 4: [2022-11-26 18:28:42,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:28:42,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 18:28:42,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 1: [2022-11-26 18:28:42,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:28:42,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 18:28:42,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 15: [2022-11-26 18:28:42,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:28:42,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:28:42,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 5: [2022-11-26 18:28:42,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 15: [2022-11-26 18:28:42,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 5: [2022-11-26 18:28:42,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 21: [2022-11-26 18:28:42,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:28:42,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 18:28:42,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 16: [2022-11-26 18:28:42,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:28:42,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 18:28:42,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 12: [2022-11-26 18:28:42,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:28:42,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 18:28:42,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 7: [2022-11-26 18:28:42,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:28:42,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 18:28:42,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 8: [2022-11-26 18:28:42,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:28:42,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:28:42,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 18:28:42,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 30: [2022-11-26 18:28:42,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 10: [2022-11-26 18:28:42,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:28:42,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 30: [2022-11-26 18:28:42,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 10: [2022-11-26 18:28:42,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 29: [2022-11-26 18:28:42,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:28:42,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 18:28:42,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 2: [2022-11-26 18:28:42,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:28:42,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 18:28:42,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 31: [2022-11-26 18:28:42,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:28:42,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 18:28:42,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 20: [2022-11-26 18:28:42,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:28:42,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 18:28:42,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 27: [2022-11-26 18:28:42,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:28:42,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 25: [2022-11-26 18:28:42,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:28:42,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:28:42,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:28:42,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 25: [2022-11-26 18:28:42,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 17: [2022-11-26 18:28:42,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 25: [2022-11-26 18:28:42,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 17: [2022-11-26 18:28:42,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 25: [2022-11-26 18:28:42,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 3: [2022-11-26 18:28:42,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:28:42,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 3: [2022-11-26 18:28:42,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 18:28:42,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 26: [2022-11-26 18:28:42,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:28:42,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 6: [2022-11-26 18:28:42,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:28:42,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:28:42,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 23: [2022-11-26 18:28:42,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 6: [2022-11-26 18:28:42,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 18:28:42,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 23: [2022-11-26 18:28:42,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 21: [2022-11-26 18:28:42,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:28:42,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:28:42,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 18: [2022-11-26 18:28:42,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 18:28:42,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 21: [2022-11-26 18:28:42,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 22: [2022-11-26 18:28:42,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:28:42,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:28:42,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 18:28:42,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 28: [2022-11-26 18:28:42,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 18:28:42,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 14: [2022-11-26 18:28:42,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:28:42,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 18:28:42,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 11: [2022-11-26 18:28:42,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:28:42,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 18:28:42,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 24: [2022-11-26 18:28:42,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:28:42,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 18:28:42,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 19: [2022-11-26 18:28:42,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:28:42,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 18:28:42,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 1: [2022-11-26 18:28:42,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:28:42,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 18:28:42,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 28: [2022-11-26 18:28:42,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:28:42,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 18:28:42,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 8: [2022-11-26 18:28:42,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:28:42,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 18:28:42,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 2: [2022-11-26 18:28:42,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:28:42,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 18:28:42,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 9: [2022-11-26 18:28:42,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:28:42,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 18:28:42,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 4: [2022-11-26 18:28:42,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:28:42,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:28:42,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 18:28:42,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 0: [2022-11-26 18:28:42,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 18:28:42,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 13: [2022-11-26 18:28:42,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:28:42,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 18:28:42,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 17: [2022-11-26 18:28:42,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:28:42,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 18:28:42,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 11: [2022-11-26 18:28:42,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:28:42,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 18:28:42,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 29: [2022-11-26 18:28:42,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:28:42,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 18:28:42,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 14: [2022-11-26 18:28:42,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:28:42,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 18:28:42,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 25: [2022-11-26 18:28:42,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:28:42,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 18:28:42,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 15: [2022-11-26 18:28:42,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:28:42,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 18:28:42,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 5: [2022-11-26 18:28:42,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:28:42,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 18:28:42,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 10: [2022-11-26 18:28:42,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:28:42,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 18:28:42,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 18: [2022-11-26 18:28:42,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:28:42,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 18:28:42,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 22: [2022-11-26 18:28:42,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:28:42,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 18:28:42,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 12: [2022-11-26 18:28:42,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:28:42,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 18:28:42,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 20: [2022-11-26 18:28:42,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:28:42,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 18:28:42,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 23: [2022-11-26 18:28:42,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:28:42,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 18:28:42,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 30: [2022-11-26 18:28:42,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:28:42,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 18:28:42,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 16: [2022-11-26 18:28:42,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:28:42,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 18:28:42,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 27: [2022-11-26 18:28:42,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:28:42,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 18:28:42,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 5: [2022-11-26 18:28:42,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:28:42,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 18:28:42,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 21: [2022-11-26 18:28:42,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:28:42,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 18:28:42,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 2: [2022-11-26 18:28:42,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:28:42,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 18:28:42,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 12: [2022-11-26 18:28:42,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:28:42,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 18:28:42,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 31: [2022-11-26 18:28:42,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:28:42,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 18:28:42,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 8: [2022-11-26 18:28:42,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:28:42,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 18:28:42,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 16: [2022-11-26 18:28:42,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:28:42,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:28:42,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 18:28:42,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 27: [2022-11-26 18:28:42,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 18:28:42,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 20: [2022-11-26 18:28:42,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:28:42,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 18:28:42,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 13: [2022-11-26 18:28:42,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:28:42,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:28:42,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 18:28:42,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 15: [2022-11-26 18:28:42,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:28:42,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 18:28:42,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 15: [2022-11-26 18:28:42,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 18:28:42,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 24: [2022-11-26 18:28:42,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:28:42,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:28:42,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 18:28:42,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 14: [2022-11-26 18:28:42,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 18:28:42,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 10: [2022-11-26 18:28:42,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:28:42,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:28:42,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 18:28:42,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 23: [2022-11-26 18:28:42,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 18:28:42,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 7: [2022-11-26 18:28:42,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:28:42,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:28:42,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 18:28:42,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 28: [2022-11-26 18:28:42,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:28:42,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 7: [2022-11-26 18:28:42,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 28: [2022-11-26 18:28:42,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 11: [2022-11-26 18:28:42,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:28:42,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 11: [2022-11-26 18:28:42,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 18:28:42,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 6: [2022-11-26 18:28:42,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:28:42,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:28:42,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 22: [2022-11-26 18:28:42,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 6: [2022-11-26 18:28:42,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 22: [2022-11-26 18:28:42,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 26: [2022-11-26 18:28:42,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:28:42,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 18:28:42,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 1: [2022-11-26 18:28:42,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:28:42,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 18:28:42,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 18: [2022-11-26 18:28:42,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:28:42,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 18:28:42,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 0: [2022-11-26 18:28:42,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:28:42,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:28:42,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 18:28:42,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 19: [2022-11-26 18:28:42,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:28:42,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 18:28:42,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 9: [2022-11-26 18:28:42,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:28:42,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 18:28:42,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 0: [2022-11-26 18:28:42,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 18:28:42,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 17: [2022-11-26 18:28:42,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:28:42,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:28:42,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 18:28:42,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 30: [2022-11-26 18:28:42,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 18:28:42,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 3: [2022-11-26 18:28:42,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:28:42,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 18:28:42,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 31: [2022-11-26 18:28:42,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:28:42,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 18:28:42,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 25: [2022-11-26 18:28:42,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:28:42,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 18:28:42,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 26: [2022-11-26 18:28:42,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:28:42,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 18:28:42,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 3: [2022-11-26 18:28:42,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:28:42,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 18:28:42,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 3: [2022-11-26 18:28:42,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:28:42,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step108000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 18:28:42,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 0: successfully saved checkpoint at iteration 108000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2557.50 31: iteration 108010/ 173500 | consumed samples: 27650560 | consumed tokens: 56628346880 | elapsed time per iteration (s): 1.04 | learning rate: 7.721E-05 | global batch size: 256 | lm loss: 1.970207E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.244 | TFLOPs: 14.90 | 31: iteration 108020/ 173500 | consumed samples: 27653120 | consumed tokens: 56633589760 | elapsed time per iteration (s): 0.78 | learning rate: 7.719E-05 | global batch size: 256 | lm loss: 1.972606E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.060 | TFLOPs: 19.91 | 31: iteration 108030/ 173500 | consumed samples: 27655680 | consumed tokens: 56638832640 | elapsed time per iteration (s): 0.79 | learning rate: 7.717E-05 | global batch size: 256 | lm loss: 1.936124E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.382 | TFLOPs: 19.62 | 31: iteration 108040/ 173500 | consumed samples: 27658240 | consumed tokens: 56644075520 | elapsed time per iteration (s): 0.77 | learning rate: 7.716E-05 | global batch size: 256 | lm loss: 1.960234E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.350 | TFLOPs: 20.11 | 31: iteration 108050/ 173500 | consumed samples: 27660800 | consumed tokens: 56649318400 | elapsed time per iteration (s): 0.78 | learning rate: 7.714E-05 | global batch size: 256 | lm loss: 1.973921E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.192 | TFLOPs: 19.92 | 31: iteration 108060/ 173500 | consumed samples: 27663360 | consumed tokens: 56654561280 | elapsed time per iteration (s): 0.77 | learning rate: 7.713E-05 | global batch size: 256 | lm loss: 1.939243E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.873 | TFLOPs: 20.14 | 31: iteration 108070/ 173500 | consumed samples: 27665920 | consumed tokens: 56659804160 | elapsed time per iteration (s): 0.72 | learning rate: 7.711E-05 | global batch size: 256 | lm loss: 1.966816E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.276 | TFLOPs: 21.49 | 31: iteration 108080/ 173500 | consumed samples: 27668480 | consumed tokens: 56665047040 | elapsed time per iteration (s): 0.78 | learning rate: 7.710E-05 | global batch size: 256 | lm loss: 1.955018E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.771 | TFLOPs: 19.89 | 31: iteration 108090/ 173500 | consumed samples: 27671040 | consumed tokens: 56670289920 | elapsed time per iteration (s): 0.75 | learning rate: 7.708E-05 | global batch size: 256 | lm loss: 1.970609E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.205 | TFLOPs: 20.52 | 31: iteration 108100/ 173500 | consumed samples: 27673600 | consumed tokens: 56675532800 | elapsed time per iteration (s): 0.84 | learning rate: 7.707E-05 | global batch size: 256 | lm loss: 1.979238E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.192 | TFLOPs: 18.52 | 31: iteration 108110/ 173500 | consumed samples: 27676160 | consumed tokens: 56680775680 | elapsed time per iteration (s): 0.79 | learning rate: 7.705E-05 | global batch size: 256 | lm loss: 1.958881E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.142 | TFLOPs: 19.49 | 31: iteration 108120/ 173500 | consumed samples: 27678720 | consumed tokens: 56686018560 | elapsed time per iteration (s): 0.87 | learning rate: 7.704E-05 | global batch size: 256 | lm loss: 1.954724E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.920 | TFLOPs: 17.78 | 31: iteration 108130/ 173500 | consumed samples: 27681280 | consumed tokens: 56691261440 | elapsed time per iteration (s): 0.80 | learning rate: 7.702E-05 | global batch size: 256 | lm loss: 1.956535E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.008 | TFLOPs: 19.42 | 31: iteration 108140/ 173500 | consumed samples: 27683840 | consumed tokens: 56696504320 | elapsed time per iteration (s): 0.80 | learning rate: 7.701E-05 | global batch size: 256 | lm loss: 1.958526E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.524 | TFLOPs: 19.33 | 31: iteration 108150/ 173500 | consumed samples: 27686400 | consumed tokens: 56701747200 | elapsed time per iteration (s): 0.75 | learning rate: 7.699E-05 | global batch size: 256 | lm loss: 1.961166E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.180 | TFLOPs: 20.70 | 31: iteration 108160/ 173500 | consumed samples: 27688960 | consumed tokens: 56706990080 | elapsed time per iteration (s): 0.75 | learning rate: 7.698E-05 | global batch size: 256 | lm loss: 1.967192E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.676 | TFLOPs: 20.67 | 31: iteration 108170/ 173500 | consumed samples: 27691520 | consumed tokens: 56712232960 | elapsed time per iteration (s): 0.78 | learning rate: 7.696E-05 | global batch size: 256 | lm loss: 1.957745E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.231 | TFLOPs: 19.80 | 31: iteration 108180/ 173500 | consumed samples: 27694080 | consumed tokens: 56717475840 | elapsed time per iteration (s): 0.77 | learning rate: 7.694E-05 | global batch size: 256 | lm loss: 1.959552E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.501 | TFLOPs: 20.18 | 31: iteration 108190/ 173500 | consumed samples: 27696640 | consumed tokens: 56722718720 | elapsed time per iteration (s): 0.75 | learning rate: 7.693E-05 | global batch size: 256 | lm loss: 1.954878E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.426 | TFLOPs: 20.66 | 31: iteration 108200/ 173500 | consumed samples: 27699200 | consumed tokens: 56727961600 | elapsed time per iteration (s): 0.73 | learning rate: 7.691E-05 | global batch size: 256 | lm loss: 1.983361E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.578 | TFLOPs: 21.27 | 31: iteration 108210/ 173500 | consumed samples: 27701760 | consumed tokens: 56733204480 | elapsed time per iteration (s): 0.77 | learning rate: 7.690E-05 | global batch size: 256 | lm loss: 1.961016E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.448 | TFLOPs: 20.11 | 31: iteration 108220/ 173500 | consumed samples: 27704320 | consumed tokens: 56738447360 | elapsed time per iteration (s): 0.77 | learning rate: 7.688E-05 | global batch size: 256 | lm loss: 1.967972E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.968 | TFLOPs: 20.20 | 31: iteration 108230/ 173500 | consumed samples: 27706880 | consumed tokens: 56743690240 | elapsed time per iteration (s): 0.75 | learning rate: 7.687E-05 | global batch size: 256 | lm loss: 1.951209E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.954 | TFLOPs: 20.69 | 31: iteration 108240/ 173500 | consumed samples: 27709440 | consumed tokens: 56748933120 | elapsed time per iteration (s): 0.78 | learning rate: 7.685E-05 | global batch size: 256 | lm loss: 1.946646E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.407 | TFLOPs: 19.87 | 31: iteration 108250/ 173500 | consumed samples: 27712000 | consumed tokens: 56754176000 | elapsed time per iteration (s): 0.82 | learning rate: 7.684E-05 | global batch size: 256 | lm loss: 1.975464E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.413 | TFLOPs: 18.78 | 31: iteration 108260/ 173500 | consumed samples: 27714560 | consumed tokens: 56759418880 | elapsed time per iteration (s): 0.79 | learning rate: 7.682E-05 | global batch size: 256 | lm loss: 1.947203E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.018 | TFLOPs: 19.66 | 31: iteration 108270/ 173500 | consumed samples: 27717120 | consumed tokens: 56764661760 | elapsed time per iteration (s): 0.80 | learning rate: 7.681E-05 | global batch size: 256 | lm loss: 1.962307E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.828 | TFLOPs: 19.35 | 31: iteration 108280/ 173500 | consumed samples: 27719680 | consumed tokens: 56769904640 | elapsed time per iteration (s): 0.76 | learning rate: 7.679E-05 | global batch size: 256 | lm loss: 1.988428E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.431 | TFLOPs: 20.41 | 31: iteration 108290/ 173500 | consumed samples: 27722240 | consumed tokens: 56775147520 | elapsed time per iteration (s): 0.74 | learning rate: 7.678E-05 | global batch size: 256 | lm loss: 1.965530E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.233 | TFLOPs: 21.01 | 31: iteration 108300/ 173500 | consumed samples: 27724800 | consumed tokens: 56780390400 | elapsed time per iteration (s): 0.77 | learning rate: 7.676E-05 | global batch size: 256 | lm loss: 1.970201E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.312 | TFLOPs: 20.23 | 31: iteration 108310/ 173500 | consumed samples: 27727360 | consumed tokens: 56785633280 | elapsed time per iteration (s): 0.77 | learning rate: 7.675E-05 | global batch size: 256 | lm loss: 1.954404E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.750 | TFLOPs: 20.13 | 31: iteration 108320/ 173500 | consumed samples: 27729920 | consumed tokens: 56790876160 | elapsed time per iteration (s): 0.82 | learning rate: 7.673E-05 | global batch size: 256 | lm loss: 1.967896E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.882 | TFLOPs: 18.87 | 31: iteration 108330/ 173500 | consumed samples: 27732480 | consumed tokens: 56796119040 | elapsed time per iteration (s): 0.79 | learning rate: 7.672E-05 | global batch size: 256 | lm loss: 1.940548E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.864 | TFLOPs: 19.65 | 31: iteration 108340/ 173500 | consumed samples: 27735040 | consumed tokens: 56801361920 | elapsed time per iteration (s): 0.73 | learning rate: 7.670E-05 | global batch size: 256 | lm loss: 1.960703E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.978 | TFLOPs: 21.17 | 31: iteration 108350/ 173500 | consumed samples: 27737600 | consumed tokens: 56806604800 | elapsed time per iteration (s): 0.77 | learning rate: 7.668E-05 | global batch size: 256 | lm loss: 1.971592E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.375 | TFLOPs: 19.99 | 31: iteration 108360/ 173500 | consumed samples: 27740160 | consumed tokens: 56811847680 | elapsed time per iteration (s): 0.75 | learning rate: 7.667E-05 | global batch size: 256 | lm loss: 1.983782E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.591 | TFLOPs: 20.67 | 31: iteration 108370/ 173500 | consumed samples: 27742720 | consumed tokens: 56817090560 | elapsed time per iteration (s): 0.81 | learning rate: 7.665E-05 | global batch size: 256 | lm loss: 1.983968E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.732 | TFLOPs: 19.16 | 31: iteration 108380/ 173500 | consumed samples: 27745280 | consumed tokens: 56822333440 | elapsed time per iteration (s): 0.78 | learning rate: 7.664E-05 | global batch size: 256 | lm loss: 1.979135E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.940 | TFLOPs: 19.84 | 31: iteration 108390/ 173500 | consumed samples: 27747840 | consumed tokens: 56827576320 | elapsed time per iteration (s): 0.78 | learning rate: 7.662E-05 | global batch size: 256 | lm loss: 1.970205E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.151 | TFLOPs: 19.79 | 31: iteration 108400/ 173500 | consumed samples: 27750400 | consumed tokens: 56832819200 | elapsed time per iteration (s): 0.73 | learning rate: 7.661E-05 | global batch size: 256 | lm loss: 1.964176E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.671 | TFLOPs: 21.09 | 31: iteration 108410/ 173500 | consumed samples: 27752960 | consumed tokens: 56838062080 | elapsed time per iteration (s): 0.89 | learning rate: 7.659E-05 | global batch size: 256 | lm loss: 1.970780E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.868 | TFLOPs: 17.35 | 31: iteration 108420/ 173500 | consumed samples: 27755520 | consumed tokens: 56843304960 | elapsed time per iteration (s): 0.74 | learning rate: 7.658E-05 | global batch size: 256 | lm loss: 1.962367E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.299 | TFLOPs: 20.89 | 31: iteration 108430/ 173500 | consumed samples: 27758080 | consumed tokens: 56848547840 | elapsed time per iteration (s): 0.79 | learning rate: 7.656E-05 | global batch size: 256 | lm loss: 1.969496E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.834 | TFLOPs: 19.65 | 31: iteration 108440/ 173500 | consumed samples: 27760640 | consumed tokens: 56853790720 | elapsed time per iteration (s): 0.78 | learning rate: 7.655E-05 | global batch size: 256 | lm loss: 1.960407E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.929 | TFLOPs: 19.90 | 31: iteration 108450/ 173500 | consumed samples: 27763200 | consumed tokens: 56859033600 | elapsed time per iteration (s): 0.75 | learning rate: 7.653E-05 | global batch size: 256 | lm loss: 1.954554E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.420 | TFLOPs: 20.53 | 31: iteration 108460/ 173500 | consumed samples: 27765760 | consumed tokens: 56864276480 | elapsed time per iteration (s): 0.76 | learning rate: 7.652E-05 | global batch size: 256 | lm loss: 1.959393E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.562 | TFLOPs: 20.42 | 31: iteration 108470/ 173500 | consumed samples: 27768320 | consumed tokens: 56869519360 | elapsed time per iteration (s): 0.77 | learning rate: 7.650E-05 | global batch size: 256 | lm loss: 1.973838E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.502 | TFLOPs: 20.18 | 31: iteration 108480/ 173500 | consumed samples: 27770880 | consumed tokens: 56874762240 | elapsed time per iteration (s): 0.74 | learning rate: 7.649E-05 | global batch size: 256 | lm loss: 1.982685E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.954 | TFLOPs: 20.99 | 31: iteration 108490/ 173500 | consumed samples: 27773440 | consumed tokens: 56880005120 | elapsed time per iteration (s): 0.82 | learning rate: 7.647E-05 | global batch size: 256 | lm loss: 1.958018E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.746 | TFLOPs: 18.92 | 31: iteration 108500/ 173500 | consumed samples: 27776000 | consumed tokens: 56885248000 | elapsed time per iteration (s): 0.80 | learning rate: 7.646E-05 | global batch size: 256 | lm loss: 1.979133E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.244 | TFLOPs: 19.43 | 31: iteration 108510/ 173500 | consumed samples: 27778560 | consumed tokens: 56890490880 | elapsed time per iteration (s): 0.87 | learning rate: 7.644E-05 | global batch size: 256 | lm loss: 1.989556E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.857 | TFLOPs: 17.90 | 31: iteration 108520/ 173500 | consumed samples: 27781120 | consumed tokens: 56895733760 | elapsed time per iteration (s): 0.85 | learning rate: 7.642E-05 | global batch size: 256 | lm loss: 1.941221E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.913 | TFLOPs: 18.14 | 31: iteration 108530/ 173500 | consumed samples: 27783680 | consumed tokens: 56900976640 | elapsed time per iteration (s): 0.84 | learning rate: 7.641E-05 | global batch size: 256 | lm loss: 1.961080E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.840 | TFLOPs: 18.44 | 31: iteration 108540/ 173500 | consumed samples: 27786240 | consumed tokens: 56906219520 | elapsed time per iteration (s): 0.86 | learning rate: 7.639E-05 | global batch size: 256 | lm loss: 1.943143E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.858 | TFLOPs: 18.08 | 31: iteration 108550/ 173500 | consumed samples: 27788800 | consumed tokens: 56911462400 | elapsed time per iteration (s): 0.84 | learning rate: 7.638E-05 | global batch size: 256 | lm loss: 1.995557E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.094 | TFLOPs: 18.46 | 31: iteration 108560/ 173500 | consumed samples: 27791360 | consumed tokens: 56916705280 | elapsed time per iteration (s): 0.81 | learning rate: 7.636E-05 | global batch size: 256 | lm loss: 1.978391E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.897 | TFLOPs: 19.11 | 31: iteration 108570/ 173500 | consumed samples: 27793920 | consumed tokens: 56921948160 | elapsed time per iteration (s): 0.81 | learning rate: 7.635E-05 | global batch size: 256 | lm loss: 1.947402E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.274 | TFLOPs: 19.13 | 31: iteration 108580/ 173500 | consumed samples: 27796480 | consumed tokens: 56927191040 | elapsed time per iteration (s): 0.82 | learning rate: 7.633E-05 | global batch size: 256 | lm loss: 1.953323E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.795 | TFLOPs: 18.92 | 31: iteration 108590/ 173500 | consumed samples: 27799040 | consumed tokens: 56932433920 | elapsed time per iteration (s): 0.79 | learning rate: 7.632E-05 | global batch size: 256 | lm loss: 1.995459E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.423 | TFLOPs: 19.57 | 31: iteration 108600/ 173500 | consumed samples: 27801600 | consumed tokens: 56937676800 | elapsed time per iteration (s): 0.81 | learning rate: 7.630E-05 | global batch size: 256 | lm loss: 1.993340E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.435 | TFLOPs: 19.14 | 31: iteration 108610/ 173500 | consumed samples: 27804160 | consumed tokens: 56942919680 | elapsed time per iteration (s): 0.80 | learning rate: 7.629E-05 | global batch size: 256 | lm loss: 1.963113E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.778 | TFLOPs: 19.41 | 31: iteration 108620/ 173500 | consumed samples: 27806720 | consumed tokens: 56948162560 | elapsed time per iteration (s): 0.85 | learning rate: 7.627E-05 | global batch size: 256 | lm loss: 1.977478E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.284 | TFLOPs: 18.17 | 31: iteration 108630/ 173500 | consumed samples: 27809280 | consumed tokens: 56953405440 | elapsed time per iteration (s): 0.82 | learning rate: 7.626E-05 | global batch size: 256 | lm loss: 1.984825E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.080 | TFLOPs: 18.94 | 31: iteration 108640/ 173500 | consumed samples: 27811840 | consumed tokens: 56958648320 | elapsed time per iteration (s): 0.82 | learning rate: 7.624E-05 | global batch size: 256 | lm loss: 1.955787E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.005 | TFLOPs: 18.88 | 31: iteration 108650/ 173500 | consumed samples: 27814400 | consumed tokens: 56963891200 | elapsed time per iteration (s): 0.82 | learning rate: 7.623E-05 | global batch size: 256 | lm loss: 1.982594E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.175 | TFLOPs: 18.95 | 31: iteration 108660/ 173500 | consumed samples: 27816960 | consumed tokens: 56969134080 | elapsed time per iteration (s): 0.81 | learning rate: 7.621E-05 | global batch size: 256 | lm loss: 1.963434E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.081 | TFLOPs: 19.12 | 31: iteration 108670/ 173500 | consumed samples: 27819520 | consumed tokens: 56974376960 | elapsed time per iteration (s): 0.83 | learning rate: 7.620E-05 | global batch size: 256 | lm loss: 1.961290E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.781 | TFLOPs: 18.74 | 31: iteration 108680/ 173500 | consumed samples: 27822080 | consumed tokens: 56979619840 | elapsed time per iteration (s): 0.83 | learning rate: 7.618E-05 | global batch size: 256 | lm loss: 1.952617E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.732 | TFLOPs: 18.68 | 31: iteration 108690/ 173500 | consumed samples: 27824640 | consumed tokens: 56984862720 | elapsed time per iteration (s): 0.82 | learning rate: 7.617E-05 | global batch size: 256 | lm loss: 1.975779E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.150 | TFLOPs: 18.82 | 31: iteration 108700/ 173500 | consumed samples: 27827200 | consumed tokens: 56990105600 | elapsed time per iteration (s): 0.79 | learning rate: 7.615E-05 | global batch size: 256 | lm loss: 1.960655E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.441 | TFLOPs: 19.57 | 31: iteration 108710/ 173500 | consumed samples: 27829760 | consumed tokens: 56995348480 | elapsed time per iteration (s): 0.84 | learning rate: 7.613E-05 | global batch size: 256 | lm loss: 1.976268E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.893 | TFLOPs: 18.38 | 31: iteration 108720/ 173500 | consumed samples: 27832320 | consumed tokens: 57000591360 | elapsed time per iteration (s): 0.74 | learning rate: 7.612E-05 | global batch size: 256 | lm loss: 1.956798E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.669 | TFLOPs: 20.97 | 31: iteration 108730/ 173500 | consumed samples: 27834880 | consumed tokens: 57005834240 | elapsed time per iteration (s): 0.78 | learning rate: 7.610E-05 | global batch size: 256 | lm loss: 1.954972E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.693 | TFLOPs: 19.82 | 31: iteration 108740/ 173500 | consumed samples: 27837440 | consumed tokens: 57011077120 | elapsed time per iteration (s): 0.73 | learning rate: 7.609E-05 | global batch size: 256 | lm loss: 1.953830E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.831 | TFLOPs: 21.28 | 31: iteration 108750/ 173500 | consumed samples: 27840000 | consumed tokens: 57016320000 | elapsed time per iteration (s): 0.77 | learning rate: 7.607E-05 | global batch size: 256 | lm loss: 1.987433E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.733 | TFLOPs: 20.13 | 31: iteration 108760/ 173500 | consumed samples: 27842560 | consumed tokens: 57021562880 | elapsed time per iteration (s): 0.75 | learning rate: 7.606E-05 | global batch size: 256 | lm loss: 1.985093E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.025 | TFLOPs: 20.75 | 31: iteration 108770/ 173500 | consumed samples: 27845120 | consumed tokens: 57026805760 | elapsed time per iteration (s): 0.78 | learning rate: 7.604E-05 | global batch size: 256 | lm loss: 1.960277E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.158 | TFLOPs: 19.97 | 31: iteration 108780/ 173500 | consumed samples: 27847680 | consumed tokens: 57032048640 | elapsed time per iteration (s): 0.78 | learning rate: 7.603E-05 | global batch size: 256 | lm loss: 2.005172E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.506 | TFLOPs: 19.93 | 31: iteration 108790/ 173500 | consumed samples: 27850240 | consumed tokens: 57037291520 | elapsed time per iteration (s): 0.76 | learning rate: 7.601E-05 | global batch size: 256 | lm loss: 1.972806E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.583 | TFLOPs: 20.36 | 31: iteration 108800/ 173500 | consumed samples: 27852800 | consumed tokens: 57042534400 | elapsed time per iteration (s): 0.76 | learning rate: 7.600E-05 | global batch size: 256 | lm loss: 1.965487E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.463 | TFLOPs: 20.48 | 31: iteration 108810/ 173500 | consumed samples: 27855360 | consumed tokens: 57047777280 | elapsed time per iteration (s): 0.75 | learning rate: 7.598E-05 | global batch size: 256 | lm loss: 1.978153E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.417 | TFLOPs: 20.65 | 31: iteration 108820/ 173500 | consumed samples: 27857920 | consumed tokens: 57053020160 | elapsed time per iteration (s): 0.81 | learning rate: 7.597E-05 | global batch size: 256 | lm loss: 2.022921E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.710 | TFLOPs: 19.16 | 31: iteration 108830/ 173500 | consumed samples: 27860480 | consumed tokens: 57058263040 | elapsed time per iteration (s): 0.81 | learning rate: 7.595E-05 | global batch size: 256 | lm loss: 1.947576E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.040 | TFLOPs: 19.12 | 31: iteration 108840/ 173500 | consumed samples: 27863040 | consumed tokens: 57063505920 | elapsed time per iteration (s): 0.85 | learning rate: 7.594E-05 | global batch size: 256 | lm loss: 1.956303E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.551 | TFLOPs: 18.18 | 31: iteration 108850/ 173500 | consumed samples: 27865600 | consumed tokens: 57068748800 | elapsed time per iteration (s): 0.77 | learning rate: 7.592E-05 | global batch size: 256 | lm loss: 1.948212E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.907 | TFLOPs: 20.08 | 31: iteration 108860/ 173500 | consumed samples: 27868160 | consumed tokens: 57073991680 | elapsed time per iteration (s): 0.73 | learning rate: 7.591E-05 | global batch size: 256 | lm loss: 1.949822E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.095 | TFLOPs: 21.30 | 31: iteration 108870/ 173500 | consumed samples: 27870720 | consumed tokens: 57079234560 | elapsed time per iteration (s): 0.78 | learning rate: 7.589E-05 | global batch size: 256 | lm loss: 1.984733E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.836 | TFLOPs: 19.77 | 31: iteration 108880/ 173500 | consumed samples: 27873280 | consumed tokens: 57084477440 | elapsed time per iteration (s): 0.75 | learning rate: 7.588E-05 | global batch size: 256 | lm loss: 1.981174E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.411 | TFLOPs: 20.65 | 31: iteration 108890/ 173500 | consumed samples: 27875840 | consumed tokens: 57089720320 | elapsed time per iteration (s): 0.76 | learning rate: 7.586E-05 | global batch size: 256 | lm loss: 1.961274E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.094 | TFLOPs: 20.45 | 31: iteration 108900/ 173500 | consumed samples: 27878400 | consumed tokens: 57094963200 | elapsed time per iteration (s): 0.78 | learning rate: 7.585E-05 | global batch size: 256 | lm loss: 1.944494E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.290 | TFLOPs: 19.92 | 31: iteration 108910/ 173500 | consumed samples: 27880960 | consumed tokens: 57100206080 | elapsed time per iteration (s): 0.81 | learning rate: 7.583E-05 | global batch size: 256 | lm loss: 1.972610E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.522 | TFLOPs: 19.21 | 31: iteration 108920/ 173500 | consumed samples: 27883520 | consumed tokens: 57105448960 | elapsed time per iteration (s): 1.55 | learning rate: 7.581E-05 | global batch size: 256 | lm loss: 1.967048E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 165.683 | TFLOPs: 10.02 | 31: iteration 108930/ 173500 | consumed samples: 27886080 | consumed tokens: 57110691840 | elapsed time per iteration (s): 0.76 | learning rate: 7.580E-05 | global batch size: 256 | lm loss: 1.956834E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.970 | TFLOPs: 20.33 | 31: iteration 108940/ 173500 | consumed samples: 27888640 | consumed tokens: 57115934720 | elapsed time per iteration (s): 0.77 | learning rate: 7.578E-05 | global batch size: 256 | lm loss: 1.932428E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.686 | TFLOPs: 20.01 | 31: iteration 108950/ 173500 | consumed samples: 27891200 | consumed tokens: 57121177600 | elapsed time per iteration (s): 0.78 | learning rate: 7.577E-05 | global batch size: 256 | lm loss: 1.995024E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.477 | TFLOPs: 19.75 | 31: iteration 108960/ 173500 | consumed samples: 27893760 | consumed tokens: 57126420480 | elapsed time per iteration (s): 0.78 | learning rate: 7.575E-05 | global batch size: 256 | lm loss: 1.966421E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.293 | TFLOPs: 19.92 | 31: iteration 108970/ 173500 | consumed samples: 27896320 | consumed tokens: 57131663360 | elapsed time per iteration (s): 0.75 | learning rate: 7.574E-05 | global batch size: 256 | lm loss: 1.963299E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.669 | TFLOPs: 20.67 | 31: iteration 108980/ 173500 | consumed samples: 27898880 | consumed tokens: 57136906240 | elapsed time per iteration (s): 0.76 | learning rate: 7.572E-05 | global batch size: 256 | lm loss: 1.956645E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.500 | TFLOPs: 20.42 | 31: iteration 108990/ 173500 | consumed samples: 27901440 | consumed tokens: 57142149120 | elapsed time per iteration (s): 0.75 | learning rate: 7.571E-05 | global batch size: 256 | lm loss: 1.983068E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.060 | TFLOPs: 20.69 | 31: iteration 109000/ 173500 | consumed samples: 27904000 | consumed tokens: 57147392000 | elapsed time per iteration (s): 0.73 | learning rate: 7.569E-05 | global batch size: 256 | lm loss: 1.964218E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.084 | TFLOPs: 21.24 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 109000 | lm loss value: 1.971281E+00 | lm loss PPL: 7.179867E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 109000 to checkpoints_1b1long 0: [2022-11-26 18:41:55,226] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step109000 is begin to save! 0: [2022-11-26 18:41:55,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_01-model_00-model_states.pt... 0: [2022-11-26 18:41:55,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_01-model_00-model_states.pt. 0: [2022-11-26 18:41:55,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_03-model_00-model_states.pt... 0: [2022-11-26 18:41:55,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_03-model_00-model_states.pt. 0: [2022-11-26 18:41:55,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_04-model_00-model_states.pt... 0: [2022-11-26 18:41:55,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_04-model_00-model_states.pt. 0: [2022-11-26 18:41:55,643] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_05-model_00-model_states.pt... 0: [2022-11-26 18:41:55,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_05-model_00-model_states.pt. 0: [2022-11-26 18:41:55,720] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_06-model_00-model_states.pt... 0: [2022-11-26 18:41:55,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_06-model_00-model_states.pt. 0: [2022-11-26 18:41:55,797] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_07-model_00-model_states.pt... 0: [2022-11-26 18:41:55,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_07-model_00-model_states.pt. 0: [2022-11-26 18:41:55,871] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_08-model_00-model_states.pt... 0: [2022-11-26 18:41:55,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_08-model_00-model_states.pt. 0: [2022-11-26 18:41:55,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_09-model_00-model_states.pt... 0: [2022-11-26 18:41:56,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_09-model_00-model_states.pt. 0: [2022-11-26 18:41:56,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_10-model_00-model_states.pt... 0: [2022-11-26 18:41:56,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_10-model_00-model_states.pt. 0: [2022-11-26 18:41:56,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_11-model_00-model_states.pt... 0: [2022-11-26 18:41:56,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_11-model_00-model_states.pt. 0: [2022-11-26 18:41:56,170] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_12-model_00-model_states.pt... 0: [2022-11-26 18:41:56,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_12-model_00-model_states.pt. 0: [2022-11-26 18:41:56,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_13-model_00-model_states.pt... 0: [2022-11-26 18:41:56,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_13-model_00-model_states.pt. 0: [2022-11-26 18:41:56,325] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_14-model_00-model_states.pt... 0: [2022-11-26 18:41:56,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_14-model_00-model_states.pt. 0: [2022-11-26 18:41:56,399] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_15-model_00-model_states.pt... 0: [2022-11-26 18:41:56,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_15-model_00-model_states.pt. 0: [2022-11-26 18:41:56,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_16-model_00-model_states.pt... 0: [2022-11-26 18:41:56,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_16-model_00-model_states.pt. 0: [2022-11-26 18:41:56,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_17-model_00-model_states.pt... 0: [2022-11-26 18:41:56,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_17-model_00-model_states.pt. 0: [2022-11-26 18:41:56,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_18-model_00-model_states.pt... 0: [2022-11-26 18:41:56,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_18-model_00-model_states.pt. 0: [2022-11-26 18:41:56,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_19-model_00-model_states.pt... 0: [2022-11-26 18:41:56,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_19-model_00-model_states.pt. 0: [2022-11-26 18:41:56,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_20-model_00-model_states.pt... 0: [2022-11-26 18:41:56,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_20-model_00-model_states.pt. 0: [2022-11-26 18:41:56,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_21-model_00-model_states.pt... 0: [2022-11-26 18:41:56,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_21-model_00-model_states.pt. 0: [2022-11-26 18:41:56,917] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_22-model_00-model_states.pt... 0: [2022-11-26 18:41:56,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_22-model_00-model_states.pt. 0: [2022-11-26 18:41:56,991] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_23-model_00-model_states.pt... 0: [2022-11-26 18:41:57,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_23-model_00-model_states.pt. 0: [2022-11-26 18:41:57,065] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_24-model_00-model_states.pt... 0: [2022-11-26 18:41:57,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_24-model_00-model_states.pt. 0: [2022-11-26 18:41:57,139] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_25-model_00-model_states.pt... 0: [2022-11-26 18:41:57,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_25-model_00-model_states.pt. 0: [2022-11-26 18:41:57,214] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_26-model_00-model_states.pt... 0: [2022-11-26 18:41:57,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_26-model_00-model_states.pt. 0: [2022-11-26 18:41:57,284] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_27-model_00-model_states.pt... 0: [2022-11-26 18:41:57,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_27-model_00-model_states.pt. 0: [2022-11-26 18:41:57,357] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_28-model_00-model_states.pt... 0: [2022-11-26 18:41:57,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_28-model_00-model_states.pt. 0: [2022-11-26 18:41:57,431] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/layer_30-model_00-model_states.pt... 0: [2022-11-26 18:41:57,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/layer_30-model_00-model_states.pt. 0: [2022-11-26 18:41:57,436] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step109000/mp_rank_00_model_states.pt 0: [2022-11-26 18:41:57,436] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/mp_rank_00_model_states.pt... 0: [2022-11-26 18:41:57,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/mp_rank_00_model_states.pt. 0: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:41:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:41:57,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:41:57,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 18:41:57,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 16: [2022-11-26 18:41:57,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:41:57,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 18:41:57,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 30: [2022-11-26 18:41:57,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:41:57,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:41:57,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 19: [2022-11-26 18:41:57,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 30: [2022-11-26 18:41:57,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 19: [2022-11-26 18:41:57,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 21: [2022-11-26 18:41:57,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:41:57,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:41:57,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 18:41:57,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 21: [2022-11-26 18:41:57,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 18:41:57,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 4: [2022-11-26 18:41:57,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:41:57,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 18:41:57,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 9: [2022-11-26 18:41:57,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:41:57,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 18:41:57,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 29: [2022-11-26 18:41:57,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:41:57,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 6: [2022-11-26 18:41:57,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:41:57,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 6: [2022-11-26 18:41:57,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 18:41:57,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 6: [2022-11-26 18:41:57,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:41:57,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 18:41:57,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 30: [2022-11-26 18:41:57,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:41:57,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 18:41:57,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 28: [2022-11-26 18:41:57,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:41:57,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 18:41:57,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 18: [2022-11-26 18:41:57,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:41:57,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:41:57,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:41:57,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:41:57,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 18: [2022-11-26 18:41:57,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 18:41:57,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 29: [2022-11-26 18:41:57,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 18: [2022-11-26 18:41:57,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 18:41:57,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 18: [2022-11-26 18:41:57,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 18: [2022-11-26 18:41:57,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 8: [2022-11-26 18:41:57,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:41:57,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 18:41:57,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 23: [2022-11-26 18:41:57,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:41:57,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 18:41:57,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:41:57,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 23: [2022-11-26 18:41:57,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 18:41:57,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 8: [2022-11-26 18:41:57,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:41:57,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 19: [2022-11-26 18:41:57,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:41:57,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 19: [2022-11-26 18:41:57,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 18:41:57,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 0: [2022-11-26 18:41:57,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:41:57,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 18:41:57,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 0: [2022-11-26 18:41:57,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:41:57,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:41:57,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 18:41:57,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 9: [2022-11-26 18:41:57,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:41:57,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 18:41:57,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 13: [2022-11-26 18:41:57,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:41:57,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 18:41:57,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 7: [2022-11-26 18:41:57,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:41:57,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 18:41:57,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 7: [2022-11-26 18:41:57,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:41:57,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 18:41:57,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 24: [2022-11-26 18:41:57,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:41:57,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 18:41:57,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 28: [2022-11-26 18:41:57,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:41:57,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:41:57,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 21: [2022-11-26 18:41:57,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:41:57,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 21: [2022-11-26 18:41:57,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 26: [2022-11-26 18:41:57,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 21: [2022-11-26 18:41:57,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 10: [2022-11-26 18:41:57,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:41:57,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 18:41:57,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 9: [2022-11-26 18:41:57,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:41:57,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 18:41:57,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 6: [2022-11-26 18:41:57,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:41:57,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 18:41:57,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 26: [2022-11-26 18:41:57,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:41:57,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 26: [2022-11-26 18:41:57,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 18:41:57,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 11: [2022-11-26 18:41:57,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:41:57,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 18:41:57,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 10: [2022-11-26 18:41:57,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:41:57,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:41:57,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 18:41:57,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 19: [2022-11-26 18:41:57,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 18:41:57,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 26: [2022-11-26 18:41:57,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:41:57,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 18:41:57,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 11: [2022-11-26 18:41:57,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:41:57,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 18:41:57,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 24: [2022-11-26 18:41:57,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:41:57,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:41:57,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 8: [2022-11-26 18:41:57,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 18:41:57,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 24: [2022-11-26 18:41:57,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 21: [2022-11-26 18:41:57,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:41:57,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 18:41:57,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 15: [2022-11-26 18:41:57,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:41:57,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 18:41:57,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 4: [2022-11-26 18:41:57,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:41:57,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 18:41:57,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 0: [2022-11-26 18:41:57,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:41:57,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 29: [2022-11-26 18:41:57,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:41:57,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 29: [2022-11-26 18:41:57,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 18:41:57,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 28: [2022-11-26 18:41:57,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:41:57,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 18:41:57,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 10: [2022-11-26 18:41:57,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:41:57,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 18:41:57,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 26: [2022-11-26 18:41:57,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:41:57,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 18:41:57,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 18: [2022-11-26 18:41:57,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:41:57,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 18:41:57,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 16: [2022-11-26 18:41:57,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:41:57,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:41:57,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 9: [2022-11-26 18:41:57,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 16: [2022-11-26 18:41:57,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 9: [2022-11-26 18:41:57,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 20: [2022-11-26 18:41:57,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:41:57,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:41:57,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 18:41:57,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 18:41:57,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 20: [2022-11-26 18:41:57,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 2: [2022-11-26 18:41:57,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:41:57,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 18:41:57,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 13: [2022-11-26 18:41:57,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:41:57,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 18:41:57,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 15: [2022-11-26 18:41:57,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:41:57,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 18:41:57,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 13: [2022-11-26 18:41:57,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:41:57,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 18:41:57,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 8: [2022-11-26 18:41:57,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:41:57,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 18:41:57,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 19: [2022-11-26 18:41:57,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:41:57,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 18:41:57,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 17: [2022-11-26 18:41:57,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:41:57,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:41:57,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:41:57,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:41:57,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 18:41:57,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 18:41:57,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 18:41:57,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 18:41:57,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 17: [2022-11-26 18:41:57,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 29: [2022-11-26 18:41:57,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:41:57,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 29: [2022-11-26 18:41:57,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 17: [2022-11-26 18:41:57,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 29: [2022-11-26 18:41:57,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 24: [2022-11-26 18:41:57,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:41:57,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 18:41:57,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 2: [2022-11-26 18:41:57,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:41:57,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 18:41:57,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 6: [2022-11-26 18:41:57,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:41:57,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 30: [2022-11-26 18:41:57,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:41:57,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:41:57,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 30: [2022-11-26 18:41:57,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 18:41:57,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 18:41:57,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 28: [2022-11-26 18:41:57,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:41:57,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 20: [2022-11-26 18:41:57,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:41:57,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 18:41:57,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 25: [2022-11-26 18:41:57,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:41:57,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 18:41:57,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 2: [2022-11-26 18:41:57,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:41:57,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 22: [2022-11-26 18:41:57,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:41:57,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:41:57,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 22: [2022-11-26 18:41:57,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 18:41:57,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 18:41:57,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 22: [2022-11-26 18:41:57,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 22: [2022-11-26 18:41:57,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:41:57,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 18:41:57,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 22: [2022-11-26 18:41:57,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:41:57,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 18:41:57,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 4: [2022-11-26 18:41:57,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:41:57,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 7: [2022-11-26 18:41:57,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:41:57,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 7: [2022-11-26 18:41:57,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 18:41:57,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:41:57,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 25: [2022-11-26 18:41:57,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:41:57,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 18:41:57,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 25: [2022-11-26 18:41:57,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 18:41:57,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 28: [2022-11-26 18:41:57,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 18:41:57,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 10: [2022-11-26 18:41:57,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:41:57,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 18:41:57,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 24: [2022-11-26 18:41:57,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:41:57,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 16: [2022-11-26 18:41:57,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:41:57,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:41:57,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 18:41:57,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 24: [2022-11-26 18:41:57,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 25: [2022-11-26 18:41:57,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:41:57,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:41:57,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:41:57,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 16: [2022-11-26 18:41:57,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 27: [2022-11-26 18:41:57,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:41:57,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 25: [2022-11-26 18:41:57,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 27: [2022-11-26 18:41:57,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 25: [2022-11-26 18:41:57,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 27: [2022-11-26 18:41:57,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 27: [2022-11-26 18:41:57,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 27: [2022-11-26 18:41:57,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 18:41:57,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 23: [2022-11-26 18:41:57,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:41:57,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:41:57,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 18:41:57,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 18:41:57,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 23: [2022-11-26 18:41:57,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 17: [2022-11-26 18:41:57,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:41:57,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 18:41:57,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 15: [2022-11-26 18:41:57,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:41:57,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 18:41:57,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 11: [2022-11-26 18:41:57,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:41:57,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 18:41:57,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 0: [2022-11-26 18:41:57,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:41:57,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 18:41:57,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 3: [2022-11-26 18:41:57,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:41:57,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:41:57,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:41:57,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:41:57,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:41:57,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 18:41:57,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 18:41:57,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 18:41:57,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 18:41:57,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 11: [2022-11-26 18:41:57,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 3: [2022-11-26 18:41:57,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 3: [2022-11-26 18:41:57,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 3: [2022-11-26 18:41:57,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 11: [2022-11-26 18:41:57,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 31: [2022-11-26 18:41:57,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:41:57,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:41:57,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:41:57,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:41:57,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 18:41:57,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 18:41:57,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 18:41:57,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 18:41:57,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 31: [2022-11-26 18:41:57,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 31: [2022-11-26 18:41:57,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 31: [2022-11-26 18:41:57,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 1: [2022-11-26 18:41:57,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:41:57,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:41:57,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:41:57,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:41:57,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 18:41:57,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 18:41:57,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 18:41:57,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 18:41:57,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 1: [2022-11-26 18:41:57,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 1: [2022-11-26 18:41:57,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 1: [2022-11-26 18:41:57,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 12: [2022-11-26 18:41:57,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:41:57,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:41:57,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:41:57,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:41:57,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 18:41:57,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 18:41:57,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 18:41:57,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 14: [2022-11-26 18:41:57,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:41:57,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 12: [2022-11-26 18:41:57,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 12: [2022-11-26 18:41:57,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 12: [2022-11-26 18:41:57,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 14: [2022-11-26 18:41:57,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:41:57,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:41:57,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:41:57,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 18:41:57,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 14: [2022-11-26 18:41:57,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 18:41:57,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 18:41:57,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 18:41:57,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 14: [2022-11-26 18:41:57,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 14: [2022-11-26 18:41:57,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 6: [2022-11-26 18:41:57,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:41:57,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:41:57,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 26: [2022-11-26 18:41:57,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 18:41:57,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 6: [2022-11-26 18:41:57,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 0: [2022-11-26 18:41:57,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:41:57,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 18:41:57,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 9: [2022-11-26 18:41:57,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:41:57,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 18:41:57,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 18: [2022-11-26 18:41:57,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:41:57,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 18:41:57,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 0: [2022-11-26 18:41:57,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 18:41:57,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 8: [2022-11-26 18:41:57,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:41:57,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 18:41:57,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 19: [2022-11-26 18:41:57,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:41:57,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 18:41:57,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 3: [2022-11-26 18:41:57,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:41:57,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 18:41:57,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 5: [2022-11-26 18:41:57,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:41:57,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:41:57,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 18:41:57,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:41:57,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:41:57,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:41:57,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 18:41:57,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 5: [2022-11-26 18:41:57,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 18:41:57,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 18:41:57,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 5: [2022-11-26 18:41:57,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 18:41:57,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 5: [2022-11-26 18:41:57,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 5: [2022-11-26 18:41:57,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 22: [2022-11-26 18:41:57,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:41:57,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 18:41:57,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 21: [2022-11-26 18:41:57,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:41:57,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 18:41:57,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 10: [2022-11-26 18:41:57,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:41:57,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 18:41:57,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 29: [2022-11-26 18:41:57,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:41:57,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 18:41:57,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 23: [2022-11-26 18:41:57,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:41:57,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 18:41:57,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 31: [2022-11-26 18:41:57,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:41:57,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 18:41:57,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 30: [2022-11-26 18:41:57,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:41:57,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 18:41:57,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 4: [2022-11-26 18:41:57,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:41:57,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 18:41:57,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 2: [2022-11-26 18:41:57,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:41:57,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 18:41:57,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 27: [2022-11-26 18:41:57,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:41:57,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:41:57,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 18:41:57,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 28: [2022-11-26 18:41:57,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 18:41:57,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 1: [2022-11-26 18:41:57,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:41:57,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:41:57,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 24: [2022-11-26 18:41:57,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 1: [2022-11-26 18:41:57,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 24: [2022-11-26 18:41:57,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 14: [2022-11-26 18:41:57,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:41:57,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 18:41:57,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 16: [2022-11-26 18:41:57,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:41:57,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 18:41:57,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 7: [2022-11-26 18:41:57,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:41:57,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 18:41:57,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 13: [2022-11-26 18:41:57,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:41:57,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:41:57,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 20: [2022-11-26 18:41:57,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 13: [2022-11-26 18:41:57,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 20: [2022-11-26 18:41:57,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 25: [2022-11-26 18:41:57,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:41:57,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 18:41:57,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 15: [2022-11-26 18:41:57,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:41:57,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 18:41:57,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 12: [2022-11-26 18:41:57,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:41:57,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 18:41:57,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 17: [2022-11-26 18:41:57,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:41:57,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 18:41:57,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 18: [2022-11-26 18:41:57,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:41:57,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 18:41:57,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 11: [2022-11-26 18:41:57,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:41:57,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 18:41:57,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 6: [2022-11-26 18:41:57,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:41:57,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 18:41:57,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 22: [2022-11-26 18:41:57,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:41:57,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 18:41:57,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 21: [2022-11-26 18:41:57,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:41:57,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 18:41:57,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 10: [2022-11-26 18:41:57,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:41:57,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 18:41:57,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 3: [2022-11-26 18:41:57,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:41:57,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 18:41:57,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 9: [2022-11-26 18:41:57,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:41:57,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 18:41:57,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 19: [2022-11-26 18:41:57,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:41:57,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 18:41:57,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 26: [2022-11-26 18:41:57,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:41:57,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 18:41:57,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 8: [2022-11-26 18:41:57,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:41:57,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 18:41:57,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 5: [2022-11-26 18:41:57,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:41:57,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 18:41:57,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 23: [2022-11-26 18:41:57,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:41:57,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 18:41:57,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 29: [2022-11-26 18:41:57,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:41:57,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:41:57,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 18:41:57,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 31: [2022-11-26 18:41:57,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 18:41:57,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 2: [2022-11-26 18:41:57,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:41:57,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 18:41:57,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 1: [2022-11-26 18:41:57,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:41:57,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 30: [2022-11-26 18:41:57,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:41:57,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 30: [2022-11-26 18:41:57,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 18:41:57,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 4: [2022-11-26 18:41:57,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:41:57,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:41:57,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 4: [2022-11-26 18:41:57,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 13: [2022-11-26 18:41:57,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 4: [2022-11-26 18:41:57,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 24: [2022-11-26 18:41:57,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:41:57,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 18:41:57,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 28: [2022-11-26 18:41:57,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:41:57,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 18:41:57,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 14: [2022-11-26 18:41:57,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:41:57,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 18:41:57,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 12: [2022-11-26 18:41:57,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:41:57,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 18:41:57,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 27: [2022-11-26 18:41:57,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:41:57,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 18:41:57,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 20: [2022-11-26 18:41:57,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:41:57,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 18:41:57,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 7: [2022-11-26 18:41:57,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:41:57,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 18:41:57,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 0: [2022-11-26 18:41:57,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:41:57,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 18:41:57,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 16: [2022-11-26 18:41:57,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:41:57,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:41:57,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:41:57,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 16: [2022-11-26 18:41:57,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 18:41:57,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 15: [2022-11-26 18:41:57,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 18:41:57,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 25: [2022-11-26 18:41:57,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 17: [2022-11-26 18:41:57,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:41:57,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 18:41:57,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 18: [2022-11-26 18:41:57,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:41:57,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 18:41:57,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 26: [2022-11-26 18:41:57,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:41:57,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 18:41:57,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 11: [2022-11-26 18:41:57,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:41:57,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 18:41:57,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 5: [2022-11-26 18:41:57,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:41:57,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 18:41:57,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 6: [2022-11-26 18:41:57,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:41:57,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 18:41:57,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 0: [2022-11-26 18:41:57,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:41:57,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 18:41:57,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 9: [2022-11-26 18:41:57,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:41:57,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 18:41:57,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 22: [2022-11-26 18:41:57,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:41:57,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 18:41:57,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 3: [2022-11-26 18:41:57,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:41:57,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 18:41:57,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 21: [2022-11-26 18:41:57,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:41:57,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 18:41:57,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 19: [2022-11-26 18:41:57,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:41:57,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 18:41:57,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 23: [2022-11-26 18:41:57,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:41:57,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 18:41:57,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 10: [2022-11-26 18:41:57,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:41:57,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 18:41:57,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 8: [2022-11-26 18:41:57,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:41:57,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 18:41:57,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 31: [2022-11-26 18:41:57,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:41:57,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 18:41:57,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 29: [2022-11-26 18:41:57,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:41:57,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 18:41:57,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 30: [2022-11-26 18:41:57,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:41:57,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 18:41:57,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 29: [2022-11-26 18:41:57,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:41:57,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 18:41:57,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 0: [2022-11-26 18:41:57,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:41:57,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 18:41:57,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 1: [2022-11-26 18:41:57,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:41:57,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:41:57,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 18:41:57,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 1: [2022-11-26 18:41:57,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 18:41:57,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 16: [2022-11-26 18:41:57,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:41:57,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 18:41:57,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 24: [2022-11-26 18:41:57,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:41:57,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 18:41:57,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 14: [2022-11-26 18:41:57,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:41:57,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:41:57,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 12: [2022-11-26 18:41:57,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 14: [2022-11-26 18:41:57,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 12: [2022-11-26 18:41:57,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 28: [2022-11-26 18:41:57,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:41:57,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:41:57,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 18:41:57,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 28: [2022-11-26 18:41:57,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 18:41:57,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 17: [2022-11-26 18:41:57,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:41:57,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:41:57,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 26: [2022-11-26 18:41:57,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 17: [2022-11-26 18:41:57,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 26: [2022-11-26 18:41:57,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 3: [2022-11-26 18:41:57,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:41:57,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 18:41:57,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 22: [2022-11-26 18:41:57,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:41:57,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 18:41:57,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 5: [2022-11-26 18:41:57,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:41:57,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 18:41:57,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 27: [2022-11-26 18:41:57,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:41:57,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:41:57,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:41:57,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 27: [2022-11-26 18:41:57,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 6: [2022-11-26 18:41:57,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 4: [2022-11-26 18:41:57,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 18:41:57,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 27: [2022-11-26 18:41:57,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 8: [2022-11-26 18:41:57,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:41:57,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 18:41:57,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 18: [2022-11-26 18:41:57,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:41:57,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:41:57,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 18:41:57,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 31: [2022-11-26 18:41:57,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 18:41:57,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 13: [2022-11-26 18:41:57,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:41:57,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 18:41:57,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 21: [2022-11-26 18:41:57,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:41:57,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 18:41:57,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 23: [2022-11-26 18:41:57,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:41:57,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 18:41:57,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 9: [2022-11-26 18:41:57,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:41:57,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 18:41:57,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 28: [2022-11-26 18:41:57,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:41:57,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:41:57,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 24: [2022-11-26 18:41:57,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:41:57,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 19: [2022-11-26 18:41:57,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 24: [2022-11-26 18:41:57,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 18:41:57,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 19: [2022-11-26 18:41:57,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 16: [2022-11-26 18:41:57,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:41:57,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:41:57,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:41:57,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 20: [2022-11-26 18:41:57,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 18:41:57,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 30: [2022-11-26 18:41:57,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 16: [2022-11-26 18:41:57,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 15: [2022-11-26 18:41:57,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:41:57,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 15: [2022-11-26 18:41:57,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 18:41:57,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 7: [2022-11-26 18:41:57,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:41:57,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 1: [2022-11-26 18:41:57,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:41:57,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 1: [2022-11-26 18:41:57,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 18:41:57,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 25: [2022-11-26 18:41:57,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:41:57,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 2: [2022-11-26 18:41:57,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:41:57,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 25: [2022-11-26 18:41:57,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 2: [2022-11-26 18:41:57,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 2: [2022-11-26 18:41:57,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:41:57,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 18:41:57,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 14: [2022-11-26 18:41:57,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:41:57,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 18:41:57,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 4: [2022-11-26 18:41:57,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:41:57,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 18:41:57,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 15: [2022-11-26 18:41:57,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:41:57,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 18:41:57,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 20: [2022-11-26 18:41:57,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:41:57,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 18:41:57,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 12: [2022-11-26 18:41:57,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:41:57,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:41:57,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 27: [2022-11-26 18:41:57,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 12: [2022-11-26 18:41:57,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 27: [2022-11-26 18:41:57,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 2: [2022-11-26 18:41:57,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:41:57,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 18:41:57,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 13: [2022-11-26 18:41:57,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:41:57,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 18:41:57,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 25: [2022-11-26 18:41:57,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:41:57,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 20: [2022-11-26 18:41:57,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:41:57,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 20: [2022-11-26 18:41:57,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 18:41:57,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 27: [2022-11-26 18:41:57,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:41:57,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 18:41:57,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 15: [2022-11-26 18:41:57,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:41:57,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 18:41:57,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 13: [2022-11-26 18:41:57,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:41:57,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 18:41:57,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 25: [2022-11-26 18:41:57,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:41:57,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:41:57,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 25: [2022-11-26 18:41:57,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step109000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 10: [2022-11-26 18:41:57,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 25: [2022-11-26 18:41:57,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 0: successfully saved checkpoint at iteration 109000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2544.90 31: iteration 109010/ 173500 | consumed samples: 27906560 | consumed tokens: 57152634880 | elapsed time per iteration (s): 0.99 | learning rate: 7.568E-05 | global batch size: 256 | lm loss: 1.967957E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 257.614 | TFLOPs: 15.59 | 31: iteration 109020/ 173500 | consumed samples: 27909120 | consumed tokens: 57157877760 | elapsed time per iteration (s): 0.74 | learning rate: 7.566E-05 | global batch size: 256 | lm loss: 1.971961E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.822 | TFLOPs: 20.98 | 31: iteration 109030/ 173500 | consumed samples: 27911680 | consumed tokens: 57163120640 | elapsed time per iteration (s): 0.73 | learning rate: 7.565E-05 | global batch size: 256 | lm loss: 1.952362E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.888 | TFLOPs: 21.17 | 31: iteration 109040/ 173500 | consumed samples: 27914240 | consumed tokens: 57168363520 | elapsed time per iteration (s): 0.77 | learning rate: 7.563E-05 | global batch size: 256 | lm loss: 1.974428E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.427 | TFLOPs: 19.99 | 31: iteration 109050/ 173500 | consumed samples: 27916800 | consumed tokens: 57173606400 | elapsed time per iteration (s): 0.81 | learning rate: 7.562E-05 | global batch size: 256 | lm loss: 1.962767E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.136 | TFLOPs: 19.00 | 31: iteration 109060/ 173500 | consumed samples: 27919360 | consumed tokens: 57178849280 | elapsed time per iteration (s): 0.79 | learning rate: 7.560E-05 | global batch size: 256 | lm loss: 1.981941E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.761 | TFLOPs: 19.71 | 31: iteration 109070/ 173500 | consumed samples: 27921920 | consumed tokens: 57184092160 | elapsed time per iteration (s): 0.78 | learning rate: 7.559E-05 | global batch size: 256 | lm loss: 1.972955E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.649 | TFLOPs: 19.94 | 31: iteration 109080/ 173500 | consumed samples: 27924480 | consumed tokens: 57189335040 | elapsed time per iteration (s): 0.75 | learning rate: 7.557E-05 | global batch size: 256 | lm loss: 1.950129E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.242 | TFLOPs: 20.52 | 31: iteration 109090/ 173500 | consumed samples: 27927040 | consumed tokens: 57194577920 | elapsed time per iteration (s): 0.77 | learning rate: 7.556E-05 | global batch size: 256 | lm loss: 1.983228E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.095 | TFLOPs: 20.03 | 31: iteration 109100/ 173500 | consumed samples: 27929600 | consumed tokens: 57199820800 | elapsed time per iteration (s): 0.78 | learning rate: 7.554E-05 | global batch size: 256 | lm loss: 1.985298E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.720 | TFLOPs: 19.95 | 31: iteration 109110/ 173500 | consumed samples: 27932160 | consumed tokens: 57205063680 | elapsed time per iteration (s): 0.73 | learning rate: 7.553E-05 | global batch size: 256 | lm loss: 1.927093E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.068 | TFLOPs: 21.36 | 31: iteration 109120/ 173500 | consumed samples: 27934720 | consumed tokens: 57210306560 | elapsed time per iteration (s): 0.77 | learning rate: 7.551E-05 | global batch size: 256 | lm loss: 1.964918E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.996 | TFLOPs: 20.21 | 31: iteration 109130/ 173500 | consumed samples: 27937280 | consumed tokens: 57215549440 | elapsed time per iteration (s): 0.73 | learning rate: 7.550E-05 | global batch size: 256 | lm loss: 1.945455E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.651 | TFLOPs: 21.27 | 31: iteration 109140/ 173500 | consumed samples: 27939840 | consumed tokens: 57220792320 | elapsed time per iteration (s): 0.71 | learning rate: 7.548E-05 | global batch size: 256 | lm loss: 1.982397E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.525 | TFLOPs: 21.69 | 31: iteration 109150/ 173500 | consumed samples: 27942400 | consumed tokens: 57226035200 | elapsed time per iteration (s): 0.77 | learning rate: 7.546E-05 | global batch size: 256 | lm loss: 1.938375E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.015 | TFLOPs: 20.09 | 31: iteration 109160/ 173500 | consumed samples: 27944960 | consumed tokens: 57231278080 | elapsed time per iteration (s): 0.77 | learning rate: 7.545E-05 | global batch size: 256 | lm loss: 1.977547E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.777 | TFLOPs: 20.13 | 31: iteration 109170/ 173500 | consumed samples: 27947520 | consumed tokens: 57236520960 | elapsed time per iteration (s): 0.75 | learning rate: 7.543E-05 | global batch size: 256 | lm loss: 1.969213E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.834 | TFLOPs: 20.68 | 31: iteration 109180/ 173500 | consumed samples: 27950080 | consumed tokens: 57241763840 | elapsed time per iteration (s): 0.79 | learning rate: 7.542E-05 | global batch size: 256 | lm loss: 1.941688E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.853 | TFLOPs: 19.71 | 31: iteration 109190/ 173500 | consumed samples: 27952640 | consumed tokens: 57247006720 | elapsed time per iteration (s): 0.80 | learning rate: 7.540E-05 | global batch size: 256 | lm loss: 1.967148E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.383 | TFLOPs: 19.44 | 31: iteration 109200/ 173500 | consumed samples: 27955200 | consumed tokens: 57252249600 | elapsed time per iteration (s): 0.78 | learning rate: 7.539E-05 | global batch size: 256 | lm loss: 1.943792E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.614 | TFLOPs: 19.76 | 31: iteration 109210/ 173500 | consumed samples: 27957760 | consumed tokens: 57257492480 | elapsed time per iteration (s): 0.84 | learning rate: 7.537E-05 | global batch size: 256 | lm loss: 1.963021E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.809 | TFLOPs: 18.44 | 31: iteration 109220/ 173500 | consumed samples: 27960320 | consumed tokens: 57262735360 | elapsed time per iteration (s): 0.78 | learning rate: 7.536E-05 | global batch size: 256 | lm loss: 1.979226E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.480 | TFLOPs: 19.93 | 31: iteration 109230/ 173500 | consumed samples: 27962880 | consumed tokens: 57267978240 | elapsed time per iteration (s): 0.78 | learning rate: 7.534E-05 | global batch size: 256 | lm loss: 1.973247E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.920 | TFLOPs: 19.90 | 31: iteration 109240/ 173500 | consumed samples: 27965440 | consumed tokens: 57273221120 | elapsed time per iteration (s): 0.79 | learning rate: 7.533E-05 | global batch size: 256 | lm loss: 1.962750E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.882 | TFLOPs: 19.65 | 31: iteration 109250/ 173500 | consumed samples: 27968000 | consumed tokens: 57278464000 | elapsed time per iteration (s): 0.81 | learning rate: 7.531E-05 | global batch size: 256 | lm loss: 1.980918E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.534 | TFLOPs: 19.03 | 31: iteration 109260/ 173500 | consumed samples: 27970560 | consumed tokens: 57283706880 | elapsed time per iteration (s): 0.74 | learning rate: 7.530E-05 | global batch size: 256 | lm loss: 1.961452E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.352 | TFLOPs: 21.01 | 31: iteration 109270/ 173500 | consumed samples: 27973120 | consumed tokens: 57288949760 | elapsed time per iteration (s): 0.72 | learning rate: 7.528E-05 | global batch size: 256 | lm loss: 1.978623E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.303 | TFLOPs: 21.56 | 31: iteration 109280/ 173500 | consumed samples: 27975680 | consumed tokens: 57294192640 | elapsed time per iteration (s): 0.79 | learning rate: 7.527E-05 | global batch size: 256 | lm loss: 1.970791E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.570 | TFLOPs: 19.51 | 31: iteration 109290/ 173500 | consumed samples: 27978240 | consumed tokens: 57299435520 | elapsed time per iteration (s): 0.84 | learning rate: 7.525E-05 | global batch size: 256 | lm loss: 1.968569E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.802 | TFLOPs: 18.50 | 31: iteration 109300/ 173500 | consumed samples: 27980800 | consumed tokens: 57304678400 | elapsed time per iteration (s): 0.73 | learning rate: 7.524E-05 | global batch size: 256 | lm loss: 2.001634E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.967 | TFLOPs: 21.23 | 31: iteration 109310/ 173500 | consumed samples: 27983360 | consumed tokens: 57309921280 | elapsed time per iteration (s): 0.80 | learning rate: 7.522E-05 | global batch size: 256 | lm loss: 1.979660E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.125 | TFLOPs: 19.25 | 31: iteration 109320/ 173500 | consumed samples: 27985920 | consumed tokens: 57315164160 | elapsed time per iteration (s): 0.79 | learning rate: 7.521E-05 | global batch size: 256 | lm loss: 1.973629E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.534 | TFLOPs: 19.63 | 31: iteration 109330/ 173500 | consumed samples: 27988480 | consumed tokens: 57320407040 | elapsed time per iteration (s): 0.80 | learning rate: 7.519E-05 | global batch size: 256 | lm loss: 1.966521E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.622 | TFLOPs: 19.34 | 31: iteration 109340/ 173500 | consumed samples: 27991040 | consumed tokens: 57325649920 | elapsed time per iteration (s): 0.76 | learning rate: 7.518E-05 | global batch size: 256 | lm loss: 1.950410E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.310 | TFLOPs: 20.29 | 31: iteration 109350/ 173500 | consumed samples: 27993600 | consumed tokens: 57330892800 | elapsed time per iteration (s): 0.73 | learning rate: 7.516E-05 | global batch size: 256 | lm loss: 1.947026E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.485 | TFLOPs: 21.14 | 31: iteration 109360/ 173500 | consumed samples: 27996160 | consumed tokens: 57336135680 | elapsed time per iteration (s): 0.77 | learning rate: 7.515E-05 | global batch size: 256 | lm loss: 1.957242E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.109 | TFLOPs: 20.09 | 31: iteration 109370/ 173500 | consumed samples: 27998720 | consumed tokens: 57341378560 | elapsed time per iteration (s): 0.76 | learning rate: 7.513E-05 | global batch size: 256 | lm loss: 1.978634E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.069 | TFLOPs: 20.51 | 31: iteration 109380/ 173500 | consumed samples: 28001280 | consumed tokens: 57346621440 | elapsed time per iteration (s): 0.87 | learning rate: 7.512E-05 | global batch size: 256 | lm loss: 1.962511E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.913 | TFLOPs: 17.78 | 31: iteration 109390/ 173500 | consumed samples: 28003840 | consumed tokens: 57351864320 | elapsed time per iteration (s): 0.85 | learning rate: 7.510E-05 | global batch size: 256 | lm loss: 1.978980E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.221 | TFLOPs: 18.22 | 31: iteration 109400/ 173500 | consumed samples: 28006400 | consumed tokens: 57357107200 | elapsed time per iteration (s): 0.75 | learning rate: 7.509E-05 | global batch size: 256 | lm loss: 1.945170E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.000 | TFLOPs: 20.69 | 31: iteration 109410/ 173500 | consumed samples: 28008960 | consumed tokens: 57362350080 | elapsed time per iteration (s): 0.75 | learning rate: 7.507E-05 | global batch size: 256 | lm loss: 1.946426E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.524 | TFLOPs: 20.72 | 31: iteration 109420/ 173500 | consumed samples: 28011520 | consumed tokens: 57367592960 | elapsed time per iteration (s): 0.82 | learning rate: 7.505E-05 | global batch size: 256 | lm loss: 1.973396E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.428 | TFLOPs: 18.96 | 31: iteration 109430/ 173500 | consumed samples: 28014080 | consumed tokens: 57372835840 | elapsed time per iteration (s): 0.78 | learning rate: 7.504E-05 | global batch size: 256 | lm loss: 1.944933E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.591 | TFLOPs: 19.94 | 31: iteration 109440/ 173500 | consumed samples: 28016640 | consumed tokens: 57378078720 | elapsed time per iteration (s): 0.76 | learning rate: 7.502E-05 | global batch size: 256 | lm loss: 1.936710E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.704 | TFLOPs: 20.43 | 31: iteration 109450/ 173500 | consumed samples: 28019200 | consumed tokens: 57383321600 | elapsed time per iteration (s): 0.80 | learning rate: 7.501E-05 | global batch size: 256 | lm loss: 1.961448E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.844 | TFLOPs: 19.41 | 31: iteration 109460/ 173500 | consumed samples: 28021760 | consumed tokens: 57388564480 | elapsed time per iteration (s): 0.79 | learning rate: 7.499E-05 | global batch size: 256 | lm loss: 1.963468E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.740 | TFLOPs: 19.71 | 31: iteration 109470/ 173500 | consumed samples: 28024320 | consumed tokens: 57393807360 | elapsed time per iteration (s): 0.74 | learning rate: 7.498E-05 | global batch size: 256 | lm loss: 1.987983E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.367 | TFLOPs: 20.95 | 31: iteration 109480/ 173500 | consumed samples: 28026880 | consumed tokens: 57399050240 | elapsed time per iteration (s): 0.78 | learning rate: 7.496E-05 | global batch size: 256 | lm loss: 1.960076E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.770 | TFLOPs: 19.95 | 31: iteration 109490/ 173500 | consumed samples: 28029440 | consumed tokens: 57404293120 | elapsed time per iteration (s): 0.75 | learning rate: 7.495E-05 | global batch size: 256 | lm loss: 1.939137E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.586 | TFLOPs: 20.67 | 31: iteration 109500/ 173500 | consumed samples: 28032000 | consumed tokens: 57409536000 | elapsed time per iteration (s): 0.76 | learning rate: 7.493E-05 | global batch size: 256 | lm loss: 1.965443E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.763 | TFLOPs: 20.43 | 31: iteration 109510/ 173500 | consumed samples: 28034560 | consumed tokens: 57414778880 | elapsed time per iteration (s): 0.75 | learning rate: 7.492E-05 | global batch size: 256 | lm loss: 1.963291E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.461 | TFLOPs: 20.54 | 31: iteration 109520/ 173500 | consumed samples: 28037120 | consumed tokens: 57420021760 | elapsed time per iteration (s): 0.76 | learning rate: 7.490E-05 | global batch size: 256 | lm loss: 1.981946E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.731 | TFLOPs: 20.43 | 31: iteration 109530/ 173500 | consumed samples: 28039680 | consumed tokens: 57425264640 | elapsed time per iteration (s): 0.76 | learning rate: 7.489E-05 | global batch size: 256 | lm loss: 1.962223E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.271 | TFLOPs: 20.46 | 31: iteration 109540/ 173500 | consumed samples: 28042240 | consumed tokens: 57430507520 | elapsed time per iteration (s): 0.75 | learning rate: 7.487E-05 | global batch size: 256 | lm loss: 1.968530E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.948 | TFLOPs: 20.69 | 31: iteration 109550/ 173500 | consumed samples: 28044800 | consumed tokens: 57435750400 | elapsed time per iteration (s): 0.77 | learning rate: 7.486E-05 | global batch size: 256 | lm loss: 1.952592E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.287 | TFLOPs: 20.04 | 31: iteration 109560/ 173500 | consumed samples: 28047360 | consumed tokens: 57440993280 | elapsed time per iteration (s): 0.79 | learning rate: 7.484E-05 | global batch size: 256 | lm loss: 1.981824E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.321 | TFLOPs: 19.56 | 31: iteration 109570/ 173500 | consumed samples: 28049920 | consumed tokens: 57446236160 | elapsed time per iteration (s): 0.77 | learning rate: 7.483E-05 | global batch size: 256 | lm loss: 1.963821E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.536 | TFLOPs: 20.12 | 31: iteration 109580/ 173500 | consumed samples: 28052480 | consumed tokens: 57451479040 | elapsed time per iteration (s): 0.74 | learning rate: 7.481E-05 | global batch size: 256 | lm loss: 1.983943E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.249 | TFLOPs: 20.95 | 31: iteration 109590/ 173500 | consumed samples: 28055040 | consumed tokens: 57456721920 | elapsed time per iteration (s): 0.82 | learning rate: 7.480E-05 | global batch size: 256 | lm loss: 1.960411E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.987 | TFLOPs: 18.87 | 31: iteration 109600/ 173500 | consumed samples: 28057600 | consumed tokens: 57461964800 | elapsed time per iteration (s): 0.77 | learning rate: 7.478E-05 | global batch size: 256 | lm loss: 1.962864E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.299 | TFLOPs: 20.22 | 31: iteration 109610/ 173500 | consumed samples: 28060160 | consumed tokens: 57467207680 | elapsed time per iteration (s): 0.81 | learning rate: 7.477E-05 | global batch size: 256 | lm loss: 1.953641E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.757 | TFLOPs: 19.04 | 31: iteration 109620/ 173500 | consumed samples: 28062720 | consumed tokens: 57472450560 | elapsed time per iteration (s): 0.81 | learning rate: 7.475E-05 | global batch size: 256 | lm loss: 1.974023E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.706 | TFLOPs: 19.10 | 31: iteration 109630/ 173500 | consumed samples: 28065280 | consumed tokens: 57477693440 | elapsed time per iteration (s): 0.80 | learning rate: 7.474E-05 | global batch size: 256 | lm loss: 1.961425E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.443 | TFLOPs: 19.33 | 31: iteration 109640/ 173500 | consumed samples: 28067840 | consumed tokens: 57482936320 | elapsed time per iteration (s): 0.79 | learning rate: 7.472E-05 | global batch size: 256 | lm loss: 1.968761E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.335 | TFLOPs: 19.56 | 31: iteration 109650/ 173500 | consumed samples: 28070400 | consumed tokens: 57488179200 | elapsed time per iteration (s): 0.77 | learning rate: 7.471E-05 | global batch size: 256 | lm loss: 1.975185E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.399 | TFLOPs: 20.11 | 31: iteration 109660/ 173500 | consumed samples: 28072960 | consumed tokens: 57493422080 | elapsed time per iteration (s): 0.80 | learning rate: 7.469E-05 | global batch size: 256 | lm loss: 1.982704E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.359 | TFLOPs: 19.44 | 31: iteration 109670/ 173500 | consumed samples: 28075520 | consumed tokens: 57498664960 | elapsed time per iteration (s): 0.76 | learning rate: 7.468E-05 | global batch size: 256 | lm loss: 1.945362E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.422 | TFLOPs: 20.47 | 31: iteration 109680/ 173500 | consumed samples: 28078080 | consumed tokens: 57503907840 | elapsed time per iteration (s): 0.80 | learning rate: 7.466E-05 | global batch size: 256 | lm loss: 1.953752E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.798 | TFLOPs: 19.41 | 31: iteration 109690/ 173500 | consumed samples: 28080640 | consumed tokens: 57509150720 | elapsed time per iteration (s): 0.78 | learning rate: 7.465E-05 | global batch size: 256 | lm loss: 1.988437E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.773 | TFLOPs: 19.95 | 31: iteration 109700/ 173500 | consumed samples: 28083200 | consumed tokens: 57514393600 | elapsed time per iteration (s): 0.82 | learning rate: 7.463E-05 | global batch size: 256 | lm loss: 1.968684E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.749 | TFLOPs: 18.98 | 31: iteration 109710/ 173500 | consumed samples: 28085760 | consumed tokens: 57519636480 | elapsed time per iteration (s): 0.80 | learning rate: 7.462E-05 | global batch size: 256 | lm loss: 1.976545E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.098 | TFLOPs: 19.30 | 31: iteration 109720/ 173500 | consumed samples: 28088320 | consumed tokens: 57524879360 | elapsed time per iteration (s): 0.76 | learning rate: 7.460E-05 | global batch size: 256 | lm loss: 1.983130E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.200 | TFLOPs: 20.40 | 31: iteration 109730/ 173500 | consumed samples: 28090880 | consumed tokens: 57530122240 | elapsed time per iteration (s): 0.79 | learning rate: 7.459E-05 | global batch size: 256 | lm loss: 1.943945E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.921 | TFLOPs: 19.54 | 31: iteration 109740/ 173500 | consumed samples: 28093440 | consumed tokens: 57535365120 | elapsed time per iteration (s): 0.75 | learning rate: 7.457E-05 | global batch size: 256 | lm loss: 1.968154E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.435 | TFLOPs: 20.53 | 31: iteration 109750/ 173500 | consumed samples: 28096000 | consumed tokens: 57540608000 | elapsed time per iteration (s): 0.79 | learning rate: 7.455E-05 | global batch size: 256 | lm loss: 1.974843E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.409 | TFLOPs: 19.50 | 31: iteration 109760/ 173500 | consumed samples: 28098560 | consumed tokens: 57545850880 | elapsed time per iteration (s): 0.75 | learning rate: 7.454E-05 | global batch size: 256 | lm loss: 1.949805E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.478 | TFLOPs: 20.78 | 31: iteration 109770/ 173500 | consumed samples: 28101120 | consumed tokens: 57551093760 | elapsed time per iteration (s): 0.76 | learning rate: 7.452E-05 | global batch size: 256 | lm loss: 1.933786E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.474 | TFLOPs: 20.42 | 31: iteration 109780/ 173500 | consumed samples: 28103680 | consumed tokens: 57556336640 | elapsed time per iteration (s): 0.75 | learning rate: 7.451E-05 | global batch size: 256 | lm loss: 1.932478E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.262 | TFLOPs: 20.71 | 31: iteration 109790/ 173500 | consumed samples: 28106240 | consumed tokens: 57561579520 | elapsed time per iteration (s): 0.75 | learning rate: 7.449E-05 | global batch size: 256 | lm loss: 1.979342E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.175 | TFLOPs: 20.76 | 31: iteration 109800/ 173500 | consumed samples: 28108800 | consumed tokens: 57566822400 | elapsed time per iteration (s): 0.76 | learning rate: 7.448E-05 | global batch size: 256 | lm loss: 1.985996E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.355 | TFLOPs: 20.41 | 31: iteration 109810/ 173500 | consumed samples: 28111360 | consumed tokens: 57572065280 | elapsed time per iteration (s): 0.74 | learning rate: 7.446E-05 | global batch size: 256 | lm loss: 1.957211E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.094 | TFLOPs: 20.82 | 31: iteration 109820/ 173500 | consumed samples: 28113920 | consumed tokens: 57577308160 | elapsed time per iteration (s): 0.90 | learning rate: 7.445E-05 | global batch size: 256 | lm loss: 1.976728E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.878 | TFLOPs: 17.23 | 31: iteration 109830/ 173500 | consumed samples: 28116480 | consumed tokens: 57582551040 | elapsed time per iteration (s): 0.78 | learning rate: 7.443E-05 | global batch size: 256 | lm loss: 1.951615E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.877 | TFLOPs: 19.90 | 31: iteration 109840/ 173500 | consumed samples: 28119040 | consumed tokens: 57587793920 | elapsed time per iteration (s): 0.88 | learning rate: 7.442E-05 | global batch size: 256 | lm loss: 1.955605E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.366 | TFLOPs: 17.69 | 31: iteration 109850/ 173500 | consumed samples: 28121600 | consumed tokens: 57593036800 | elapsed time per iteration (s): 0.83 | learning rate: 7.440E-05 | global batch size: 256 | lm loss: 1.940062E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.701 | TFLOPs: 18.62 | 31: iteration 109860/ 173500 | consumed samples: 28124160 | consumed tokens: 57598279680 | elapsed time per iteration (s): 0.83 | learning rate: 7.439E-05 | global batch size: 256 | lm loss: 1.985876E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.293 | TFLOPs: 18.65 | 31: iteration 109870/ 173500 | consumed samples: 28126720 | consumed tokens: 57603522560 | elapsed time per iteration (s): 0.85 | learning rate: 7.437E-05 | global batch size: 256 | lm loss: 1.941060E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.505 | TFLOPs: 18.30 | 31: iteration 109880/ 173500 | consumed samples: 28129280 | consumed tokens: 57608765440 | elapsed time per iteration (s): 0.88 | learning rate: 7.436E-05 | global batch size: 256 | lm loss: 1.946309E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.489 | TFLOPs: 17.69 | 31: iteration 109890/ 173500 | consumed samples: 28131840 | consumed tokens: 57614008320 | elapsed time per iteration (s): 0.79 | learning rate: 7.434E-05 | global batch size: 256 | lm loss: 1.955108E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.335 | TFLOPs: 19.56 | 31: iteration 109900/ 173500 | consumed samples: 28134400 | consumed tokens: 57619251200 | elapsed time per iteration (s): 0.82 | learning rate: 7.433E-05 | global batch size: 256 | lm loss: 1.937621E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.665 | TFLOPs: 18.79 | 31: iteration 109910/ 173500 | consumed samples: 28136960 | consumed tokens: 57624494080 | elapsed time per iteration (s): 0.80 | learning rate: 7.431E-05 | global batch size: 256 | lm loss: 1.986678E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.859 | TFLOPs: 19.29 | 31: iteration 109920/ 173500 | consumed samples: 28139520 | consumed tokens: 57629736960 | elapsed time per iteration (s): 0.79 | learning rate: 7.430E-05 | global batch size: 256 | lm loss: 1.961681E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.744 | TFLOPs: 19.59 | 31: iteration 109930/ 173500 | consumed samples: 28142080 | consumed tokens: 57634979840 | elapsed time per iteration (s): 0.81 | learning rate: 7.428E-05 | global batch size: 256 | lm loss: 1.962987E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.071 | TFLOPs: 19.18 | 31: iteration 109940/ 173500 | consumed samples: 28144640 | consumed tokens: 57640222720 | elapsed time per iteration (s): 0.84 | learning rate: 7.427E-05 | global batch size: 256 | lm loss: 1.965100E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.064 | TFLOPs: 18.52 | 31: iteration 109950/ 173500 | consumed samples: 28147200 | consumed tokens: 57645465600 | elapsed time per iteration (s): 0.80 | learning rate: 7.425E-05 | global batch size: 256 | lm loss: 1.963268E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.600 | TFLOPs: 19.40 | 31: iteration 109960/ 173500 | consumed samples: 28149760 | consumed tokens: 57650708480 | elapsed time per iteration (s): 0.78 | learning rate: 7.424E-05 | global batch size: 256 | lm loss: 1.959428E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.939 | TFLOPs: 19.96 | 31: iteration 109970/ 173500 | consumed samples: 28152320 | consumed tokens: 57655951360 | elapsed time per iteration (s): 0.73 | learning rate: 7.422E-05 | global batch size: 256 | lm loss: 1.959943E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.383 | TFLOPs: 21.26 | 31: iteration 109980/ 173500 | consumed samples: 28154880 | consumed tokens: 57661194240 | elapsed time per iteration (s): 0.76 | learning rate: 7.421E-05 | global batch size: 256 | lm loss: 1.960618E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.584 | TFLOPs: 20.42 | 31: iteration 109990/ 173500 | consumed samples: 28157440 | consumed tokens: 57666437120 | elapsed time per iteration (s): 0.76 | learning rate: 7.419E-05 | global batch size: 256 | lm loss: 1.939281E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.275 | TFLOPs: 20.40 | 0: [2022-11-26 18:54:57,823] [INFO] [logging.py:68:log_dist] [Rank 0] step=110000, skipped=0, lr=[7.417709678812063e-05, 7.417709678812063e-05, 7.417709678812063e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 110000/ 173500 | consumed samples: 28160000 | consumed tokens: 57671680000 | elapsed time per iteration (s): 0.76 | learning rate: 7.418E-05 | global batch size: 256 | lm loss: 1.957934E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.852 | TFLOPs: 20.50 | 0: steps: 110000 loss: 1.9095 iter time (s): 0.784 samples/sec: 326.679 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 110000 | lm loss value: 2.041759E+00 | lm loss PPL: 7.704153E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 110000 to checkpoints_1b1long 0: [2022-11-26 18:54:58,142] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step110000 is begin to save! 0: [2022-11-26 18:54:58,152] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_01-model_00-model_states.pt... 0: [2022-11-26 18:54:58,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_01-model_00-model_states.pt. 0: [2022-11-26 18:54:58,369] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_03-model_00-model_states.pt... 0: [2022-11-26 18:54:58,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_03-model_00-model_states.pt. 0: [2022-11-26 18:54:58,452] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_04-model_00-model_states.pt... 0: [2022-11-26 18:54:58,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_04-model_00-model_states.pt. 0: [2022-11-26 18:54:58,529] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_05-model_00-model_states.pt... 0: [2022-11-26 18:54:58,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_05-model_00-model_states.pt. 0: [2022-11-26 18:54:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_06-model_00-model_states.pt... 0: [2022-11-26 18:54:58,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_06-model_00-model_states.pt. 0: [2022-11-26 18:54:58,689] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_07-model_00-model_states.pt... 0: [2022-11-26 18:54:58,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_07-model_00-model_states.pt. 0: [2022-11-26 18:54:58,765] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_08-model_00-model_states.pt... 0: [2022-11-26 18:54:58,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_08-model_00-model_states.pt. 0: [2022-11-26 18:54:58,841] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_09-model_00-model_states.pt... 0: [2022-11-26 18:54:58,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_09-model_00-model_states.pt. 0: [2022-11-26 18:54:58,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_10-model_00-model_states.pt... 0: [2022-11-26 18:54:58,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_10-model_00-model_states.pt. 0: [2022-11-26 18:54:58,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_11-model_00-model_states.pt... 0: [2022-11-26 18:54:59,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_11-model_00-model_states.pt. 0: [2022-11-26 18:54:59,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_12-model_00-model_states.pt... 0: [2022-11-26 18:54:59,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_12-model_00-model_states.pt. 0: [2022-11-26 18:54:59,142] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_13-model_00-model_states.pt... 0: [2022-11-26 18:54:59,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_13-model_00-model_states.pt. 0: [2022-11-26 18:54:59,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_14-model_00-model_states.pt... 0: [2022-11-26 18:54:59,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_14-model_00-model_states.pt. 0: [2022-11-26 18:54:59,290] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_15-model_00-model_states.pt... 0: [2022-11-26 18:54:59,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_15-model_00-model_states.pt. 0: [2022-11-26 18:54:59,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_16-model_00-model_states.pt... 0: [2022-11-26 18:54:59,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_16-model_00-model_states.pt. 0: [2022-11-26 18:54:59,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_17-model_00-model_states.pt... 0: [2022-11-26 18:54:59,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_17-model_00-model_states.pt. 0: [2022-11-26 18:54:59,516] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_18-model_00-model_states.pt... 0: [2022-11-26 18:54:59,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_18-model_00-model_states.pt. 0: [2022-11-26 18:54:59,589] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_19-model_00-model_states.pt... 0: [2022-11-26 18:54:59,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_19-model_00-model_states.pt. 0: [2022-11-26 18:54:59,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_20-model_00-model_states.pt... 0: [2022-11-26 18:54:59,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_20-model_00-model_states.pt. 0: [2022-11-26 18:54:59,738] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_21-model_00-model_states.pt... 0: [2022-11-26 18:54:59,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_21-model_00-model_states.pt. 0: [2022-11-26 18:54:59,811] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_22-model_00-model_states.pt... 0: [2022-11-26 18:54:59,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_22-model_00-model_states.pt. 0: [2022-11-26 18:54:59,888] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_23-model_00-model_states.pt... 0: [2022-11-26 18:54:59,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_23-model_00-model_states.pt. 0: [2022-11-26 18:54:59,961] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_24-model_00-model_states.pt... 0: [2022-11-26 18:55:00,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_24-model_00-model_states.pt. 0: [2022-11-26 18:55:00,037] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_25-model_00-model_states.pt... 0: [2022-11-26 18:55:00,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_25-model_00-model_states.pt. 0: [2022-11-26 18:55:00,109] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_26-model_00-model_states.pt... 0: [2022-11-26 18:55:00,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_26-model_00-model_states.pt. 0: [2022-11-26 18:55:00,186] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_27-model_00-model_states.pt... 0: [2022-11-26 18:55:00,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_27-model_00-model_states.pt. 0: [2022-11-26 18:55:00,262] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_28-model_00-model_states.pt... 0: [2022-11-26 18:55:00,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_28-model_00-model_states.pt. 0: [2022-11-26 18:55:00,333] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/layer_30-model_00-model_states.pt... 0: [2022-11-26 18:55:00,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/layer_30-model_00-model_states.pt. 0: [2022-11-26 18:55:00,338] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step110000/mp_rank_00_model_states.pt 0: [2022-11-26 18:55:00,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/mp_rank_00_model_states.pt... 0: [2022-11-26 18:55:00,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/mp_rank_00_model_states.pt. 0: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 28: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 27: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 20: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 30: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:55:00,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 17: [2022-11-26 18:55:00,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:55:00,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:55:00,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 18:55:00,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 0: [2022-11-26 18:55:00,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:55:00,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 18:55:00,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:55:00,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 0: [2022-11-26 18:55:00,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 18:55:00,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 0: [2022-11-26 18:55:00,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:55:00,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 18:55:00,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 13: [2022-11-26 18:55:00,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:55:00,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:55:00,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:55:00,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 18:55:00,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 18:55:00,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 21: [2022-11-26 18:55:00,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:55:00,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 13: [2022-11-26 18:55:00,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 13: [2022-11-26 18:55:00,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 21: [2022-11-26 18:55:00,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 18:55:00,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 11: [2022-11-26 18:55:00,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:55:00,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:55:00,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 18:55:00,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 18:55:00,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 11: [2022-11-26 18:55:00,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 5: [2022-11-26 18:55:00,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:55:00,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 18:55:00,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 14: [2022-11-26 18:55:00,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:55:00,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 18:55:00,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 4: [2022-11-26 18:55:00,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:55:00,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 18:55:00,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 4: [2022-11-26 18:55:00,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:55:00,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 18:55:00,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 29: [2022-11-26 18:55:00,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:55:00,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 18:55:00,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 29: [2022-11-26 18:55:00,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:55:00,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:55:00,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 8: [2022-11-26 18:55:00,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 20: [2022-11-26 18:55:00,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:55:00,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 8: [2022-11-26 18:55:00,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 20: [2022-11-26 18:55:00,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 18:55:00,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 28: [2022-11-26 18:55:00,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:55:00,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:55:00,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:55:00,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 27: [2022-11-26 18:55:00,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 1: [2022-11-26 18:55:00,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 27: [2022-11-26 18:55:00,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 17: [2022-11-26 18:55:00,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:55:00,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:55:00,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 27: [2022-11-26 18:55:00,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 17: [2022-11-26 18:55:00,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:55:00,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 12: [2022-11-26 18:55:00,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:55:00,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:55:00,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:55:00,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 18:55:00,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 12: [2022-11-26 18:55:00,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 18:55:00,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 17: [2022-11-26 18:55:00,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 12: [2022-11-26 18:55:00,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 18:55:00,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 12: [2022-11-26 18:55:00,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 12: [2022-11-26 18:55:00,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 9: [2022-11-26 18:55:00,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:55:00,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:55:00,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:55:00,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 18:55:00,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 9: [2022-11-26 18:55:00,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 18:55:00,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 29: [2022-11-26 18:55:00,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 29: [2022-11-26 18:55:00,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 9: [2022-11-26 18:55:00,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:55:00,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:55:00,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:55:00,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 1: [2022-11-26 18:55:00,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 17: [2022-11-26 18:55:00,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 9: [2022-11-26 18:55:00,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 1: [2022-11-26 18:55:00,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 17: [2022-11-26 18:55:00,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 20: [2022-11-26 18:55:00,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:55:00,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 18:55:00,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 20: [2022-11-26 18:55:00,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:55:00,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 18:55:00,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 16: [2022-11-26 18:55:00,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:55:00,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:55:00,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 16: [2022-11-26 18:55:00,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 18:55:00,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 12: [2022-11-26 18:55:00,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 1: [2022-11-26 18:55:00,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:55:00,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:55:00,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:55:00,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 23: [2022-11-26 18:55:00,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:55:00,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:55:00,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 23: [2022-11-26 18:55:00,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 18:55:00,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 18:55:00,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 23: [2022-11-26 18:55:00,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 18:55:00,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 23: [2022-11-26 18:55:00,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 26: [2022-11-26 18:55:00,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:55:00,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 26: [2022-11-26 18:55:00,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 23: [2022-11-26 18:55:00,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 26: [2022-11-26 18:55:00,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 29: [2022-11-26 18:55:00,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:55:00,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 18:55:00,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 16: [2022-11-26 18:55:00,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:55:00,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:55:00,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 18:55:00,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 13: [2022-11-26 18:55:00,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:55:00,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 18:55:00,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 18:55:00,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 13: [2022-11-26 18:55:00,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 1: [2022-11-26 18:55:00,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:55:00,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 18:55:00,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 17: [2022-11-26 18:55:00,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:55:00,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 18:55:00,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 0: [2022-11-26 18:55:00,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:55:00,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:55:00,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 18:55:00,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 14: [2022-11-26 18:55:00,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:55:00,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:55:00,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:55:00,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:55:00,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:55:00,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:55:00,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 8: [2022-11-26 18:55:00,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:55:00,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:55:00,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 8: [2022-11-26 18:55:00,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:55:00,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 14: [2022-11-26 18:55:00,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 21: [2022-11-26 18:55:00,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 5: [2022-11-26 18:55:00,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:55:00,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 14: [2022-11-26 18:55:00,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 21: [2022-11-26 18:55:00,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 5: [2022-11-26 18:55:00,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 8: [2022-11-26 18:55:00,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 8: [2022-11-26 18:55:00,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 14: [2022-11-26 18:55:00,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 18:55:00,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 18:55:00,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 5: [2022-11-26 18:55:00,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 8: [2022-11-26 18:55:00,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 14: [2022-11-26 18:55:00,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 8: [2022-11-26 18:55:00,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 14: [2022-11-26 18:55:00,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 14: [2022-11-26 18:55:00,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 11: [2022-11-26 18:55:00,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:55:00,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:55:00,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:55:00,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 18:55:00,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 31: [2022-11-26 18:55:00,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 11: [2022-11-26 18:55:00,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:55:00,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 11: [2022-11-26 18:55:00,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 31: [2022-11-26 18:55:00,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 11: [2022-11-26 18:55:00,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 18:55:00,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 31: [2022-11-26 18:55:00,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:55:00,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 18:55:00,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 22: [2022-11-26 18:55:00,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:55:00,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:55:00,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:55:00,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:55:00,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 18:55:00,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 18:55:00,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 19: [2022-11-26 18:55:00,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 9: [2022-11-26 18:55:00,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:55:00,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 22: [2022-11-26 18:55:00,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 22: [2022-11-26 18:55:00,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 19: [2022-11-26 18:55:00,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 9: [2022-11-26 18:55:00,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 22: [2022-11-26 18:55:00,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:55:00,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 22: [2022-11-26 18:55:00,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:55:00,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 19: [2022-11-26 18:55:00,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:55:00,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:55:00,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 22: [2022-11-26 18:55:00,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 19: [2022-11-26 18:55:00,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 7: [2022-11-26 18:55:00,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:55:00,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:55:00,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:55:00,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:55:00,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:55:00,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 22: [2022-11-26 18:55:00,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 19: [2022-11-26 18:55:00,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 7: [2022-11-26 18:55:00,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 18:55:00,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 18:55:00,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 18:55:00,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 18:55:00,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 9: [2022-11-26 18:55:00,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 7: [2022-11-26 18:55:00,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 7: [2022-11-26 18:55:00,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 7: [2022-11-26 18:55:00,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 7: [2022-11-26 18:55:00,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 7: [2022-11-26 18:55:00,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 23: [2022-11-26 18:55:00,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:55:00,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 18:55:00,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 17: [2022-11-26 18:55:00,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:55:00,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 18:55:00,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 20: [2022-11-26 18:55:00,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:55:00,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 18:55:00,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 4: [2022-11-26 18:55:00,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:55:00,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 13: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:55:00,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 20: [2022-11-26 18:55:00,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 16: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:55:00,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 2: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:55:00,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 3: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:55:00,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 30: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:55:00,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 18:55:00,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 18:55:00,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 18:55:00,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 30: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:55:00,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 27: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 24: [2022-11-26 18:55:00,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 29: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 30: [2022-11-26 18:55:00,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 18:55:00,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 18:55:00,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 18:55:00,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 16: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 24: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 29: [2022-11-26 18:55:00,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 30: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 30: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 30: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 30: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 16: [2022-11-26 18:55:00,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 3: [2022-11-26 18:55:00,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 24: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 29: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 27: [2022-11-26 18:55:00,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 16: [2022-11-26 18:55:00,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 3: [2022-11-26 18:55:00,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 24: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 24: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 27: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 27: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:55:00,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 16: [2022-11-26 18:55:00,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 27: [2022-11-26 18:55:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:55:00,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 18:55:00,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 18:55:00,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 27: [2022-11-26 18:55:00,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 9: [2022-11-26 18:55:00,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:55:00,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:55:00,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 9: [2022-11-26 18:55:00,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 18:55:00,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 3: [2022-11-26 18:55:00,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 0: [2022-11-26 18:55:00,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:55:00,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:55:00,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 0: [2022-11-26 18:55:00,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 12: [2022-11-26 18:55:00,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 4: [2022-11-26 18:55:00,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:55:00,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 4: [2022-11-26 18:55:00,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 18:55:00,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 4: [2022-11-26 18:55:00,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:55:00,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 18:55:00,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:55:00,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:55:00,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 2: [2022-11-26 18:55:00,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:55:00,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 18:55:00,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:55:00,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 2: [2022-11-26 18:55:00,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 15: [2022-11-26 18:55:00,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:55:00,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 2: [2022-11-26 18:55:00,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 18:55:00,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 2: [2022-11-26 18:55:00,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 15: [2022-11-26 18:55:00,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:55:00,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 15: [2022-11-26 18:55:00,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 18:55:00,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 18:55:00,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 15: [2022-11-26 18:55:00,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 31: [2022-11-26 18:55:00,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:55:00,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 18:55:00,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 26: [2022-11-26 18:55:00,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:55:00,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 18:55:00,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 26: [2022-11-26 18:55:00,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:55:00,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 18:55:00,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 27: [2022-11-26 18:55:00,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:55:00,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:55:00,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 27: [2022-11-26 18:55:00,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 19: [2022-11-26 18:55:00,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:55:00,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:55:00,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 27: [2022-11-26 18:55:00,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 19: [2022-11-26 18:55:00,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 18:55:00,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 18:55:00,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 19: [2022-11-26 18:55:00,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 1: [2022-11-26 18:55:00,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:55:00,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 17: [2022-11-26 18:55:00,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:55:00,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 17: [2022-11-26 18:55:00,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 18:55:00,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 18: [2022-11-26 18:55:00,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:55:00,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:55:00,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:55:00,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 18: [2022-11-26 18:55:00,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:55:00,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:55:00,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:55:00,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 18: [2022-11-26 18:55:00,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 18:55:00,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 5: [2022-11-26 18:55:00,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 18: [2022-11-26 18:55:00,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 18: [2022-11-26 18:55:00,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 18:55:00,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 18: [2022-11-26 18:55:00,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 18:55:00,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 18: [2022-11-26 18:55:00,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 3: [2022-11-26 18:55:00,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:55:00,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 18:55:00,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 5: [2022-11-26 18:55:00,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 3: [2022-11-26 18:55:00,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:55:00,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 18:55:00,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 2: [2022-11-26 18:55:00,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:55:00,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:55:00,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 18:55:00,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 2: [2022-11-26 18:55:00,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 18:55:00,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 8: [2022-11-26 18:55:00,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:55:00,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 18:55:00,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 10: [2022-11-26 18:55:00,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:55:00,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:55:00,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:55:00,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 18:55:00,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 18:55:00,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 18:55:00,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 10: [2022-11-26 18:55:00,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 10: [2022-11-26 18:55:00,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 5: [2022-11-26 18:55:00,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:55:00,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 18:55:00,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 5: [2022-11-26 18:55:00,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:55:00,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 18:55:00,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 24: [2022-11-26 18:55:00,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:55:00,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 18:55:00,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 15: [2022-11-26 18:55:00,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:55:00,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:55:00,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 18:55:00,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 18:55:00,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 15: [2022-11-26 18:55:00,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 21: [2022-11-26 18:55:00,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:55:00,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:55:00,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 18:55:00,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 15: [2022-11-26 18:55:00,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 18:55:00,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 18: [2022-11-26 18:55:00,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:55:00,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 18:55:00,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 22: [2022-11-26 18:55:00,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:55:00,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 18:55:00,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 21: [2022-11-26 18:55:00,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:55:00,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 18:55:00,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 28: [2022-11-26 18:55:00,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 18:55:00,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 28: [2022-11-26 18:55:00,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:55:00,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 18:55:00,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 28: [2022-11-26 18:55:00,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:55:00,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 18:55:00,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 28: [2022-11-26 18:55:00,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:55:00,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 18:55:00,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 20: [2022-11-26 18:55:00,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:55:00,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:55:00,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 18:55:00,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 20: [2022-11-26 18:55:00,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 18:55:00,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 9: [2022-11-26 18:55:00,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:55:00,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 18:55:00,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 1: [2022-11-26 18:55:00,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:55:00,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:55:00,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 1: [2022-11-26 18:55:00,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 30: [2022-11-26 18:55:00,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 1: [2022-11-26 18:55:00,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 26: [2022-11-26 18:55:00,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:55:00,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:55:00,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 18:55:00,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 7: [2022-11-26 18:55:00,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:55:00,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 7: [2022-11-26 18:55:00,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 26: [2022-11-26 18:55:00,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 7: [2022-11-26 18:55:00,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 31: [2022-11-26 18:55:00,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:55:00,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:55:00,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 18:55:00,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 18:55:00,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 31: [2022-11-26 18:55:00,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 5: [2022-11-26 18:55:00,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:55:00,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 18:55:00,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 0: [2022-11-26 18:55:00,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 18:55:00,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 10: [2022-11-26 18:55:00,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:55:00,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 18:55:00,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 10: [2022-11-26 18:55:00,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:55:00,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 18:55:00,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 6: [2022-11-26 18:55:00,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:55:00,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:55:00,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:55:00,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:55:00,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:55:00,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:55:00,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 18:55:00,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 18:55:00,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 18:55:00,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 18:55:00,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 18:55:00,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 6: [2022-11-26 18:55:00,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 18:55:00,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 6: [2022-11-26 18:55:00,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 6: [2022-11-26 18:55:00,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 6: [2022-11-26 18:55:00,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 6: [2022-11-26 18:55:00,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 11: [2022-11-26 18:55:00,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:55:00,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 18:55:00,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 14: [2022-11-26 18:55:00,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:55:00,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 18:55:00,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 12: [2022-11-26 18:55:00,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:55:00,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 18:55:00,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 28: [2022-11-26 18:55:00,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:55:00,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 18:55:00,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 23: [2022-11-26 18:55:00,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:55:00,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 18:55:00,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 19: [2022-11-26 18:55:00,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:55:00,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 18:55:00,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 25: [2022-11-26 18:55:00,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:55:00,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:55:00,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:55:00,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:55:00,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:55:00,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 18:55:00,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 18:55:00,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:55:00,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 25: [2022-11-26 18:55:00,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 25: [2022-11-26 18:55:00,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 18:55:00,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 18:55:00,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 18:55:00,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 18:55:00,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 25: [2022-11-26 18:55:00,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 25: [2022-11-26 18:55:00,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 25: [2022-11-26 18:55:00,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 31: [2022-11-26 18:55:00,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:55:00,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 18:55:00,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 21: [2022-11-26 18:55:00,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:55:00,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 18:55:00,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 3: [2022-11-26 18:55:00,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:55:00,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 18:55:00,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 0: [2022-11-26 18:55:00,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:55:00,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 18:55:00,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 16: [2022-11-26 18:55:00,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:55:00,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 18:55:00,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 26: [2022-11-26 18:55:00,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:55:00,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 18:55:00,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 10: [2022-11-26 18:55:00,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:55:00,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 18:55:00,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 5: [2022-11-26 18:55:00,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:55:00,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 18:55:00,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 20: [2022-11-26 18:55:00,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:55:00,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 18:55:00,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 6: [2022-11-26 18:55:00,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:55:00,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 18:55:00,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 13: [2022-11-26 18:55:00,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:55:00,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 8: [2022-11-26 18:55:00,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:55:00,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 8: [2022-11-26 18:55:00,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 18:55:00,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 25: [2022-11-26 18:55:00,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:55:00,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 18:55:00,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 9: [2022-11-26 18:55:00,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:55:00,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 18:55:00,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 17: [2022-11-26 18:55:00,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:55:00,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 18:55:00,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 4: [2022-11-26 18:55:00,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:55:00,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 18:55:00,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 29: [2022-11-26 18:55:00,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:55:00,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 18:55:00,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 24: [2022-11-26 18:55:00,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 18:55:00,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 18:55:00,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 22: [2022-11-26 18:55:00,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:55:00,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 18:55:00,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 2: [2022-11-26 18:55:00,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:55:00,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 18:55:00,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 15: [2022-11-26 18:55:00,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:55:00,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 18:55:00,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 30: [2022-11-26 18:55:00,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:55:00,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 18:55:00,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 1: [2022-11-26 18:55:00,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:55:00,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:55:00,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 27: [2022-11-26 18:55:00,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 1: [2022-11-26 18:55:00,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 27: [2022-11-26 18:55:00,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 23: [2022-11-26 18:55:00,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:55:00,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 18:55:00,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 12: [2022-11-26 18:55:00,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:55:00,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 18:55:00,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 19: [2022-11-26 18:55:00,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:55:00,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 18:55:00,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 11: [2022-11-26 18:55:00,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:55:00,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 18:55:00,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 7: [2022-11-26 18:55:00,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:55:00,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 18:55:00,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 0: [2022-11-26 18:55:00,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:55:00,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 18:55:00,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 28: [2022-11-26 18:55:00,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:55:00,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 18:55:00,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 3: [2022-11-26 18:55:00,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:55:00,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 18:55:00,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 13: [2022-11-26 18:55:00,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:55:00,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 18:55:00,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 31: [2022-11-26 18:55:00,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:55:00,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 18:55:00,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 17: [2022-11-26 18:55:00,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 18:55:00,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 18:55:00,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 26: [2022-11-26 18:55:00,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 18:55:00,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 18:55:00,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 14: [2022-11-26 18:55:00,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:55:00,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 18:55:00,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 29: [2022-11-26 18:55:00,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 18:55:00,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 18:55:00,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 1: [2022-11-26 18:55:00,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:55:00,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:55:00,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:55:00,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 1: [2022-11-26 18:55:00,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 4: [2022-11-26 18:55:00,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 9: [2022-11-26 18:55:00,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 1: [2022-11-26 18:55:00,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 4: [2022-11-26 18:55:00,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 16: [2022-11-26 18:55:00,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:55:00,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:55:00,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 15: [2022-11-26 18:55:00,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:55:00,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 27: [2022-11-26 18:55:00,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 15: [2022-11-26 18:55:00,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 18:55:00,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 24: [2022-11-26 18:55:00,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 27: [2022-11-26 18:55:00,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 24: [2022-11-26 18:55:00,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 18:55:00,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 6: [2022-11-26 18:55:00,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:55:00,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 2: [2022-11-26 18:55:00,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:55:00,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 2: [2022-11-26 18:55:00,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 18:55:00,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 18: [2022-11-26 18:55:00,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:55:00,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 18:55:00,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 25: [2022-11-26 18:55:00,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:55:00,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 8: [2022-11-26 18:55:00,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 25: [2022-11-26 18:55:00,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 8: [2022-11-26 18:55:00,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 18:55:00,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 23: [2022-11-26 18:55:00,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 18:55:00,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 18:55:00,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 7: [2022-11-26 18:55:00,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:55:00,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 18:55:00,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 30: [2022-11-26 18:55:00,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 18:55:00,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 18:55:00,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 18: [2022-11-26 18:55:00,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:55:00,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:55:00,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 10: [2022-11-26 18:55:00,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 18:55:00,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 18: [2022-11-26 18:55:00,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 11: [2022-11-26 18:55:00,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:55:00,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 18:55:00,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 22: [2022-11-26 18:55:00,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 18:55:00,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 18:55:00,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 16: [2022-11-26 18:55:00,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 18:55:00,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 18:55:00,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 14: [2022-11-26 18:55:00,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:55:00,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 18:55:00,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 5: [2022-11-26 18:55:00,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:55:00,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 18:55:00,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 31: [2022-11-26 18:55:00,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:55:00,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 31: [2022-11-26 18:55:00,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 18:55:00,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 3: [2022-11-26 18:55:00,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 26: [2022-11-26 18:55:00,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:55:00,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 26: [2022-11-26 18:55:00,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 18:55:00,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 28: [2022-11-26 18:55:00,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 19: [2022-11-26 18:55:00,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 28: [2022-11-26 18:55:00,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 19: [2022-11-26 18:55:00,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 18:55:00,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 28: [2022-11-26 18:55:00,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 12: [2022-11-26 18:55:00,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:55:00,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 18:55:00,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 21: [2022-11-26 18:55:00,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:55:00,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 18:55:00,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 10: [2022-11-26 18:55:00,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:55:00,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 18:55:00,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 18: [2022-11-26 18:55:00,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 18:55:00,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 18:55:00,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 20: [2022-11-26 18:55:00,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 18:55:00,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 18:55:00,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 21: [2022-11-26 18:55:00,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 18:55:00,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step110000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 18:55:00,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 0: successfully saved checkpoint at iteration 110000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2551.83 31: iteration 110010/ 173500 | consumed samples: 28162560 | consumed tokens: 57676922880 | elapsed time per iteration (s): 1.06 | learning rate: 7.416E-05 | global batch size: 256 | lm loss: 1.972123E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.253 | TFLOPs: 14.60 | 31: iteration 110020/ 173500 | consumed samples: 28165120 | consumed tokens: 57682165760 | elapsed time per iteration (s): 0.79 | learning rate: 7.415E-05 | global batch size: 256 | lm loss: 1.981140E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.533 | TFLOPs: 19.51 | 31: iteration 110030/ 173500 | consumed samples: 28167680 | consumed tokens: 57687408640 | elapsed time per iteration (s): 0.84 | learning rate: 7.413E-05 | global batch size: 256 | lm loss: 1.966180E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.375 | TFLOPs: 18.41 | 31: iteration 110040/ 173500 | consumed samples: 28170240 | consumed tokens: 57692651520 | elapsed time per iteration (s): 0.75 | learning rate: 7.412E-05 | global batch size: 256 | lm loss: 1.983091E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.707 | TFLOPs: 20.55 | 31: iteration 110050/ 173500 | consumed samples: 28172800 | consumed tokens: 57697894400 | elapsed time per iteration (s): 0.82 | learning rate: 7.410E-05 | global batch size: 256 | lm loss: 1.974158E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.063 | TFLOPs: 18.88 | 31: iteration 110060/ 173500 | consumed samples: 28175360 | consumed tokens: 57703137280 | elapsed time per iteration (s): 0.83 | learning rate: 7.409E-05 | global batch size: 256 | lm loss: 1.951943E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.423 | TFLOPs: 18.60 | 31: iteration 110070/ 173500 | consumed samples: 28177920 | consumed tokens: 57708380160 | elapsed time per iteration (s): 0.82 | learning rate: 7.407E-05 | global batch size: 256 | lm loss: 1.948891E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.249 | TFLOPs: 18.89 | 31: iteration 110080/ 173500 | consumed samples: 28180480 | consumed tokens: 57713623040 | elapsed time per iteration (s): 0.78 | learning rate: 7.406E-05 | global batch size: 256 | lm loss: 1.953300E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.197 | TFLOPs: 19.86 | 31: iteration 110090/ 173500 | consumed samples: 28183040 | consumed tokens: 57718865920 | elapsed time per iteration (s): 0.75 | learning rate: 7.404E-05 | global batch size: 256 | lm loss: 1.961580E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.432 | TFLOPs: 20.66 | 31: iteration 110100/ 173500 | consumed samples: 28185600 | consumed tokens: 57724108800 | elapsed time per iteration (s): 0.74 | learning rate: 7.403E-05 | global batch size: 256 | lm loss: 1.945250E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.008 | TFLOPs: 21.05 | 31: iteration 110110/ 173500 | consumed samples: 28188160 | consumed tokens: 57729351680 | elapsed time per iteration (s): 0.76 | learning rate: 7.401E-05 | global batch size: 256 | lm loss: 1.940396E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.302 | TFLOPs: 20.41 | 31: iteration 110120/ 173500 | consumed samples: 28190720 | consumed tokens: 57734594560 | elapsed time per iteration (s): 0.75 | learning rate: 7.400E-05 | global batch size: 256 | lm loss: 1.939101E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.703 | TFLOPs: 20.73 | 31: iteration 110130/ 173500 | consumed samples: 28193280 | consumed tokens: 57739837440 | elapsed time per iteration (s): 0.73 | learning rate: 7.398E-05 | global batch size: 256 | lm loss: 1.953116E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.770 | TFLOPs: 21.34 | 31: iteration 110140/ 173500 | consumed samples: 28195840 | consumed tokens: 57745080320 | elapsed time per iteration (s): 0.81 | learning rate: 7.397E-05 | global batch size: 256 | lm loss: 1.978380E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.246 | TFLOPs: 19.07 | 31: iteration 110150/ 173500 | consumed samples: 28198400 | consumed tokens: 57750323200 | elapsed time per iteration (s): 0.77 | learning rate: 7.395E-05 | global batch size: 256 | lm loss: 1.915487E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.035 | TFLOPs: 20.21 | 31: iteration 110160/ 173500 | consumed samples: 28200960 | consumed tokens: 57755566080 | elapsed time per iteration (s): 0.75 | learning rate: 7.394E-05 | global batch size: 256 | lm loss: 1.957087E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.485 | TFLOPs: 20.54 | 31: iteration 110170/ 173500 | consumed samples: 28203520 | consumed tokens: 57760808960 | elapsed time per iteration (s): 0.73 | learning rate: 7.392E-05 | global batch size: 256 | lm loss: 1.956234E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.927 | TFLOPs: 21.29 | 31: iteration 110180/ 173500 | consumed samples: 28206080 | consumed tokens: 57766051840 | elapsed time per iteration (s): 0.73 | learning rate: 7.391E-05 | global batch size: 256 | lm loss: 1.975410E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.501 | TFLOPs: 21.26 | 31: iteration 110190/ 173500 | consumed samples: 28208640 | consumed tokens: 57771294720 | elapsed time per iteration (s): 0.78 | learning rate: 7.389E-05 | global batch size: 256 | lm loss: 1.965873E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.493 | TFLOPs: 19.93 | 31: iteration 110200/ 173500 | consumed samples: 28211200 | consumed tokens: 57776537600 | elapsed time per iteration (s): 0.77 | learning rate: 7.388E-05 | global batch size: 256 | lm loss: 1.960261E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.134 | TFLOPs: 20.15 | 31: iteration 110210/ 173500 | consumed samples: 28213760 | consumed tokens: 57781780480 | elapsed time per iteration (s): 0.78 | learning rate: 7.386E-05 | global batch size: 256 | lm loss: 1.950261E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.970 | TFLOPs: 19.84 | 31: iteration 110220/ 173500 | consumed samples: 28216320 | consumed tokens: 57787023360 | elapsed time per iteration (s): 0.80 | learning rate: 7.385E-05 | global batch size: 256 | lm loss: 1.969444E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.333 | TFLOPs: 19.44 | 31: iteration 110230/ 173500 | consumed samples: 28218880 | consumed tokens: 57792266240 | elapsed time per iteration (s): 0.86 | learning rate: 7.383E-05 | global batch size: 256 | lm loss: 1.967438E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.010 | TFLOPs: 17.97 | 31: iteration 110240/ 173500 | consumed samples: 28221440 | consumed tokens: 57797509120 | elapsed time per iteration (s): 0.79 | learning rate: 7.382E-05 | global batch size: 256 | lm loss: 1.961457E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.126 | TFLOPs: 19.55 | 31: iteration 110250/ 173500 | consumed samples: 28224000 | consumed tokens: 57802752000 | elapsed time per iteration (s): 0.82 | learning rate: 7.380E-05 | global batch size: 256 | lm loss: 1.963723E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.821 | TFLOPs: 18.92 | 31: iteration 110260/ 173500 | consumed samples: 28226560 | consumed tokens: 57807994880 | elapsed time per iteration (s): 0.95 | learning rate: 7.378E-05 | global batch size: 256 | lm loss: 1.947998E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 270.663 | TFLOPs: 16.37 | 31: iteration 110270/ 173500 | consumed samples: 28229120 | consumed tokens: 57813237760 | elapsed time per iteration (s): 0.84 | learning rate: 7.377E-05 | global batch size: 256 | lm loss: 1.957845E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.402 | TFLOPs: 18.42 | 31: iteration 110280/ 173500 | consumed samples: 28231680 | consumed tokens: 57818480640 | elapsed time per iteration (s): 0.80 | learning rate: 7.375E-05 | global batch size: 256 | lm loss: 1.958293E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.541 | TFLOPs: 19.45 | 31: iteration 110290/ 173500 | consumed samples: 28234240 | consumed tokens: 57823723520 | elapsed time per iteration (s): 0.79 | learning rate: 7.374E-05 | global batch size: 256 | lm loss: 1.953540E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.559 | TFLOPs: 19.70 | 31: iteration 110300/ 173500 | consumed samples: 28236800 | consumed tokens: 57828966400 | elapsed time per iteration (s): 1.03 | learning rate: 7.372E-05 | global batch size: 256 | lm loss: 1.968337E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.762 | TFLOPs: 15.05 | 31: iteration 110310/ 173500 | consumed samples: 28239360 | consumed tokens: 57834209280 | elapsed time per iteration (s): 0.80 | learning rate: 7.371E-05 | global batch size: 256 | lm loss: 1.959917E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.226 | TFLOPs: 19.37 | 31: iteration 110320/ 173500 | consumed samples: 28241920 | consumed tokens: 57839452160 | elapsed time per iteration (s): 0.77 | learning rate: 7.369E-05 | global batch size: 256 | lm loss: 1.925520E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.492 | TFLOPs: 19.99 | 31: iteration 110330/ 173500 | consumed samples: 28244480 | consumed tokens: 57844695040 | elapsed time per iteration (s): 0.84 | learning rate: 7.368E-05 | global batch size: 256 | lm loss: 1.958966E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.057 | TFLOPs: 18.52 | 31: iteration 110340/ 173500 | consumed samples: 28247040 | consumed tokens: 57849937920 | elapsed time per iteration (s): 0.81 | learning rate: 7.366E-05 | global batch size: 256 | lm loss: 1.967426E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.569 | TFLOPs: 19.15 | 31: iteration 110350/ 173500 | consumed samples: 28249600 | consumed tokens: 57855180800 | elapsed time per iteration (s): 0.78 | learning rate: 7.365E-05 | global batch size: 256 | lm loss: 1.958438E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.946 | TFLOPs: 19.96 | 31: iteration 110360/ 173500 | consumed samples: 28252160 | consumed tokens: 57860423680 | elapsed time per iteration (s): 0.79 | learning rate: 7.363E-05 | global batch size: 256 | lm loss: 1.982481E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.210 | TFLOPs: 19.61 | 31: iteration 110370/ 173500 | consumed samples: 28254720 | consumed tokens: 57865666560 | elapsed time per iteration (s): 0.78 | learning rate: 7.362E-05 | global batch size: 256 | lm loss: 1.972958E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.354 | TFLOPs: 19.86 | 31: iteration 110380/ 173500 | consumed samples: 28257280 | consumed tokens: 57870909440 | elapsed time per iteration (s): 0.78 | learning rate: 7.360E-05 | global batch size: 256 | lm loss: 1.976731E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.135 | TFLOPs: 19.97 | 31: iteration 110390/ 173500 | consumed samples: 28259840 | consumed tokens: 57876152320 | elapsed time per iteration (s): 0.80 | learning rate: 7.359E-05 | global batch size: 256 | lm loss: 1.965469E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.770 | TFLOPs: 19.28 | 31: iteration 110400/ 173500 | consumed samples: 28262400 | consumed tokens: 57881395200 | elapsed time per iteration (s): 0.81 | learning rate: 7.357E-05 | global batch size: 256 | lm loss: 1.955230E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.070 | TFLOPs: 19.06 | 31: iteration 110410/ 173500 | consumed samples: 28264960 | consumed tokens: 57886638080 | elapsed time per iteration (s): 0.73 | learning rate: 7.356E-05 | global batch size: 256 | lm loss: 1.982838E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.292 | TFLOPs: 21.25 | 31: iteration 110420/ 173500 | consumed samples: 28267520 | consumed tokens: 57891880960 | elapsed time per iteration (s): 0.72 | learning rate: 7.354E-05 | global batch size: 256 | lm loss: 1.962371E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.442 | TFLOPs: 21.44 | 31: iteration 110430/ 173500 | consumed samples: 28270080 | consumed tokens: 57897123840 | elapsed time per iteration (s): 0.76 | learning rate: 7.353E-05 | global batch size: 256 | lm loss: 1.945686E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.214 | TFLOPs: 20.34 | 31: iteration 110440/ 173500 | consumed samples: 28272640 | consumed tokens: 57902366720 | elapsed time per iteration (s): 0.75 | learning rate: 7.351E-05 | global batch size: 256 | lm loss: 1.944987E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.782 | TFLOPs: 20.56 | 31: iteration 110450/ 173500 | consumed samples: 28275200 | consumed tokens: 57907609600 | elapsed time per iteration (s): 0.72 | learning rate: 7.350E-05 | global batch size: 256 | lm loss: 1.947967E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.602 | TFLOPs: 21.39 | 31: iteration 110460/ 173500 | consumed samples: 28277760 | consumed tokens: 57912852480 | elapsed time per iteration (s): 0.76 | learning rate: 7.348E-05 | global batch size: 256 | lm loss: 1.973524E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.494 | TFLOPs: 20.42 | 31: iteration 110470/ 173500 | consumed samples: 28280320 | consumed tokens: 57918095360 | elapsed time per iteration (s): 0.79 | learning rate: 7.347E-05 | global batch size: 256 | lm loss: 1.945015E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.256 | TFLOPs: 19.50 | 31: iteration 110480/ 173500 | consumed samples: 28282880 | consumed tokens: 57923338240 | elapsed time per iteration (s): 0.75 | learning rate: 7.345E-05 | global batch size: 256 | lm loss: 1.961135E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.775 | TFLOPs: 20.74 | 31: iteration 110490/ 173500 | consumed samples: 28285440 | consumed tokens: 57928581120 | elapsed time per iteration (s): 0.73 | learning rate: 7.344E-05 | global batch size: 256 | lm loss: 1.940801E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.019 | TFLOPs: 21.24 | 31: iteration 110500/ 173500 | consumed samples: 28288000 | consumed tokens: 57933824000 | elapsed time per iteration (s): 0.74 | learning rate: 7.342E-05 | global batch size: 256 | lm loss: 1.958368E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.936 | TFLOPs: 20.99 | 31: iteration 110510/ 173500 | consumed samples: 28290560 | consumed tokens: 57939066880 | elapsed time per iteration (s): 0.77 | learning rate: 7.341E-05 | global batch size: 256 | lm loss: 1.942469E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.634 | TFLOPs: 20.18 | 31: iteration 110520/ 173500 | consumed samples: 28293120 | consumed tokens: 57944309760 | elapsed time per iteration (s): 0.72 | learning rate: 7.339E-05 | global batch size: 256 | lm loss: 1.929183E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.431 | TFLOPs: 21.44 | 31: iteration 110530/ 173500 | consumed samples: 28295680 | consumed tokens: 57949552640 | elapsed time per iteration (s): 0.74 | learning rate: 7.338E-05 | global batch size: 256 | lm loss: 1.947337E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.014 | TFLOPs: 20.81 | 31: iteration 110540/ 173500 | consumed samples: 28298240 | consumed tokens: 57954795520 | elapsed time per iteration (s): 0.71 | learning rate: 7.336E-05 | global batch size: 256 | lm loss: 1.958988E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.936 | TFLOPs: 21.71 | 31: iteration 110550/ 173500 | consumed samples: 28300800 | consumed tokens: 57960038400 | elapsed time per iteration (s): 0.76 | learning rate: 7.335E-05 | global batch size: 256 | lm loss: 1.944510E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.212 | TFLOPs: 20.34 | 31: iteration 110560/ 173500 | consumed samples: 28303360 | consumed tokens: 57965281280 | elapsed time per iteration (s): 0.75 | learning rate: 7.333E-05 | global batch size: 256 | lm loss: 1.976520E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.699 | TFLOPs: 20.67 | 31: iteration 110570/ 173500 | consumed samples: 28305920 | consumed tokens: 57970524160 | elapsed time per iteration (s): 0.74 | learning rate: 7.332E-05 | global batch size: 256 | lm loss: 1.965498E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.637 | TFLOPs: 20.97 | 31: iteration 110580/ 173500 | consumed samples: 28308480 | consumed tokens: 57975767040 | elapsed time per iteration (s): 0.76 | learning rate: 7.330E-05 | global batch size: 256 | lm loss: 1.954529E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.331 | TFLOPs: 20.29 | 31: iteration 110590/ 173500 | consumed samples: 28311040 | consumed tokens: 57981009920 | elapsed time per iteration (s): 0.76 | learning rate: 7.329E-05 | global batch size: 256 | lm loss: 1.947139E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.783 | TFLOPs: 20.25 | 31: iteration 110600/ 173500 | consumed samples: 28313600 | consumed tokens: 57986252800 | elapsed time per iteration (s): 0.75 | learning rate: 7.327E-05 | global batch size: 256 | lm loss: 1.959476E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.563 | TFLOPs: 20.66 | 31: iteration 110610/ 173500 | consumed samples: 28316160 | consumed tokens: 57991495680 | elapsed time per iteration (s): 0.76 | learning rate: 7.326E-05 | global batch size: 256 | lm loss: 1.956647E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.876 | TFLOPs: 20.44 | 31: iteration 110620/ 173500 | consumed samples: 28318720 | consumed tokens: 57996738560 | elapsed time per iteration (s): 0.78 | learning rate: 7.324E-05 | global batch size: 256 | lm loss: 1.969188E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.418 | TFLOPs: 19.87 | 31: iteration 110630/ 173500 | consumed samples: 28321280 | consumed tokens: 58001981440 | elapsed time per iteration (s): 0.77 | learning rate: 7.323E-05 | global batch size: 256 | lm loss: 1.946691E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.606 | TFLOPs: 20.18 | 31: iteration 110640/ 173500 | consumed samples: 28323840 | consumed tokens: 58007224320 | elapsed time per iteration (s): 0.74 | learning rate: 7.321E-05 | global batch size: 256 | lm loss: 1.963900E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.297 | TFLOPs: 21.07 | 31: iteration 110650/ 173500 | consumed samples: 28326400 | consumed tokens: 58012467200 | elapsed time per iteration (s): 0.72 | learning rate: 7.320E-05 | global batch size: 256 | lm loss: 1.976298E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.217 | TFLOPs: 21.55 | 31: iteration 110660/ 173500 | consumed samples: 28328960 | consumed tokens: 58017710080 | elapsed time per iteration (s): 0.76 | learning rate: 7.318E-05 | global batch size: 256 | lm loss: 1.961898E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.861 | TFLOPs: 20.50 | 31: iteration 110670/ 173500 | consumed samples: 28331520 | consumed tokens: 58022952960 | elapsed time per iteration (s): 0.78 | learning rate: 7.317E-05 | global batch size: 256 | lm loss: 1.961267E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.952 | TFLOPs: 19.90 | 31: iteration 110680/ 173500 | consumed samples: 28334080 | consumed tokens: 58028195840 | elapsed time per iteration (s): 0.73 | learning rate: 7.315E-05 | global batch size: 256 | lm loss: 1.951314E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.303 | TFLOPs: 21.25 | 31: iteration 110690/ 173500 | consumed samples: 28336640 | consumed tokens: 58033438720 | elapsed time per iteration (s): 0.78 | learning rate: 7.314E-05 | global batch size: 256 | lm loss: 1.982022E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.288 | TFLOPs: 19.98 | 31: iteration 110700/ 173500 | consumed samples: 28339200 | consumed tokens: 58038681600 | elapsed time per iteration (s): 0.72 | learning rate: 7.312E-05 | global batch size: 256 | lm loss: 1.953197E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.974 | TFLOPs: 21.48 | 31: iteration 110710/ 173500 | consumed samples: 28341760 | consumed tokens: 58043924480 | elapsed time per iteration (s): 0.74 | learning rate: 7.311E-05 | global batch size: 256 | lm loss: 1.964970E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.705 | TFLOPs: 20.79 | 31: iteration 110720/ 173500 | consumed samples: 28344320 | consumed tokens: 58049167360 | elapsed time per iteration (s): 0.73 | learning rate: 7.309E-05 | global batch size: 256 | lm loss: 1.978874E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.074 | TFLOPs: 21.30 | 31: iteration 110730/ 173500 | consumed samples: 28346880 | consumed tokens: 58054410240 | elapsed time per iteration (s): 0.75 | learning rate: 7.308E-05 | global batch size: 256 | lm loss: 1.960026E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.859 | TFLOPs: 20.62 | 31: iteration 110740/ 173500 | consumed samples: 28349440 | consumed tokens: 58059653120 | elapsed time per iteration (s): 0.78 | learning rate: 7.306E-05 | global batch size: 256 | lm loss: 1.970247E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.093 | TFLOPs: 19.79 | 31: iteration 110750/ 173500 | consumed samples: 28352000 | consumed tokens: 58064896000 | elapsed time per iteration (s): 0.74 | learning rate: 7.305E-05 | global batch size: 256 | lm loss: 1.940093E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.263 | TFLOPs: 21.01 | 31: iteration 110760/ 173500 | consumed samples: 28354560 | consumed tokens: 58070138880 | elapsed time per iteration (s): 0.74 | learning rate: 7.303E-05 | global batch size: 256 | lm loss: 1.958872E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.495 | TFLOPs: 21.02 | 31: iteration 110770/ 173500 | consumed samples: 28357120 | consumed tokens: 58075381760 | elapsed time per iteration (s): 0.80 | learning rate: 7.302E-05 | global batch size: 256 | lm loss: 1.946629E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.509 | TFLOPs: 19.27 | 31: iteration 110780/ 173500 | consumed samples: 28359680 | consumed tokens: 58080624640 | elapsed time per iteration (s): 0.79 | learning rate: 7.300E-05 | global batch size: 256 | lm loss: 1.951185E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.571 | TFLOPs: 19.70 | 31: iteration 110790/ 173500 | consumed samples: 28362240 | consumed tokens: 58085867520 | elapsed time per iteration (s): 0.75 | learning rate: 7.299E-05 | global batch size: 256 | lm loss: 1.972629E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.610 | TFLOPs: 20.67 | 31: iteration 110800/ 173500 | consumed samples: 28364800 | consumed tokens: 58091110400 | elapsed time per iteration (s): 0.76 | learning rate: 7.297E-05 | global batch size: 256 | lm loss: 1.936268E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.804 | TFLOPs: 20.32 | 31: iteration 110810/ 173500 | consumed samples: 28367360 | consumed tokens: 58096353280 | elapsed time per iteration (s): 0.76 | learning rate: 7.296E-05 | global batch size: 256 | lm loss: 1.959570E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.216 | TFLOPs: 20.40 | 31: iteration 110820/ 173500 | consumed samples: 28369920 | consumed tokens: 58101596160 | elapsed time per iteration (s): 0.72 | learning rate: 7.294E-05 | global batch size: 256 | lm loss: 1.936204E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.998 | TFLOPs: 21.60 | 31: iteration 110830/ 173500 | consumed samples: 28372480 | consumed tokens: 58106839040 | elapsed time per iteration (s): 0.76 | learning rate: 7.293E-05 | global batch size: 256 | lm loss: 1.964251E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.709 | TFLOPs: 20.37 | 31: iteration 110840/ 173500 | consumed samples: 28375040 | consumed tokens: 58112081920 | elapsed time per iteration (s): 0.73 | learning rate: 7.291E-05 | global batch size: 256 | lm loss: 1.971337E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.842 | TFLOPs: 21.16 | 31: iteration 110850/ 173500 | consumed samples: 28377600 | consumed tokens: 58117324800 | elapsed time per iteration (s): 0.78 | learning rate: 7.290E-05 | global batch size: 256 | lm loss: 1.971168E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.846 | TFLOPs: 19.95 | 31: iteration 110860/ 173500 | consumed samples: 28380160 | consumed tokens: 58122567680 | elapsed time per iteration (s): 0.73 | learning rate: 7.288E-05 | global batch size: 256 | lm loss: 1.969341E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.908 | TFLOPs: 21.17 | 31: iteration 110870/ 173500 | consumed samples: 28382720 | consumed tokens: 58127810560 | elapsed time per iteration (s): 0.76 | learning rate: 7.287E-05 | global batch size: 256 | lm loss: 1.943454E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.082 | TFLOPs: 20.45 | 31: iteration 110880/ 173500 | consumed samples: 28385280 | consumed tokens: 58133053440 | elapsed time per iteration (s): 0.74 | learning rate: 7.285E-05 | global batch size: 256 | lm loss: 1.964494E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.093 | TFLOPs: 20.94 | 31: iteration 110890/ 173500 | consumed samples: 28387840 | consumed tokens: 58138296320 | elapsed time per iteration (s): 0.76 | learning rate: 7.284E-05 | global batch size: 256 | lm loss: 1.957260E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.414 | TFLOPs: 20.47 | 31: iteration 110900/ 173500 | consumed samples: 28390400 | consumed tokens: 58143539200 | elapsed time per iteration (s): 0.72 | learning rate: 7.282E-05 | global batch size: 256 | lm loss: 1.946923E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.786 | TFLOPs: 21.58 | 31: iteration 110910/ 173500 | consumed samples: 28392960 | consumed tokens: 58148782080 | elapsed time per iteration (s): 0.81 | learning rate: 7.281E-05 | global batch size: 256 | lm loss: 1.964833E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.356 | TFLOPs: 19.20 | 31: iteration 110920/ 173500 | consumed samples: 28395520 | consumed tokens: 58154024960 | elapsed time per iteration (s): 0.75 | learning rate: 7.279E-05 | global batch size: 256 | lm loss: 1.968317E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.229 | TFLOPs: 20.58 | 31: iteration 110930/ 173500 | consumed samples: 28398080 | consumed tokens: 58159267840 | elapsed time per iteration (s): 0.73 | learning rate: 7.278E-05 | global batch size: 256 | lm loss: 1.929851E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.690 | TFLOPs: 21.22 | 31: iteration 110940/ 173500 | consumed samples: 28400640 | consumed tokens: 58164510720 | elapsed time per iteration (s): 0.74 | learning rate: 7.276E-05 | global batch size: 256 | lm loss: 1.952421E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.467 | TFLOPs: 20.84 | 31: iteration 110950/ 173500 | consumed samples: 28403200 | consumed tokens: 58169753600 | elapsed time per iteration (s): 0.76 | learning rate: 7.275E-05 | global batch size: 256 | lm loss: 1.928565E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.721 | TFLOPs: 20.25 | 31: iteration 110960/ 173500 | consumed samples: 28405760 | consumed tokens: 58174996480 | elapsed time per iteration (s): 0.75 | learning rate: 7.273E-05 | global batch size: 256 | lm loss: 1.960980E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.386 | TFLOPs: 20.59 | 31: iteration 110970/ 173500 | consumed samples: 28408320 | consumed tokens: 58180239360 | elapsed time per iteration (s): 0.73 | learning rate: 7.272E-05 | global batch size: 256 | lm loss: 2.011386E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.411 | TFLOPs: 21.08 | 31: iteration 110980/ 173500 | consumed samples: 28410880 | consumed tokens: 58185482240 | elapsed time per iteration (s): 0.74 | learning rate: 7.270E-05 | global batch size: 256 | lm loss: 1.970583E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.609 | TFLOPs: 20.85 | 31: iteration 110990/ 173500 | consumed samples: 28413440 | consumed tokens: 58190725120 | elapsed time per iteration (s): 0.80 | learning rate: 7.269E-05 | global batch size: 256 | lm loss: 1.969150E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.873 | TFLOPs: 19.35 | 31: iteration 111000/ 173500 | consumed samples: 28416000 | consumed tokens: 58195968000 | elapsed time per iteration (s): 0.78 | learning rate: 7.267E-05 | global batch size: 256 | lm loss: 1.960560E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.470 | TFLOPs: 19.81 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 111000 | lm loss value: 1.904855E+00 | lm loss PPL: 6.718431E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 111000 to checkpoints_1b1long 0: [2022-11-26 19:07:50,536] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step111000 is begin to save! 0: [2022-11-26 19:07:50,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_01-model_00-model_states.pt... 0: [2022-11-26 19:07:50,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_01-model_00-model_states.pt. 0: [2022-11-26 19:07:50,790] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_03-model_00-model_states.pt... 0: [2022-11-26 19:07:50,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_03-model_00-model_states.pt. 0: [2022-11-26 19:07:50,869] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_04-model_00-model_states.pt... 0: [2022-11-26 19:07:50,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_04-model_00-model_states.pt. 0: [2022-11-26 19:07:50,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_05-model_00-model_states.pt... 0: [2022-11-26 19:07:51,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_05-model_00-model_states.pt. 0: [2022-11-26 19:07:51,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_06-model_00-model_states.pt... 0: [2022-11-26 19:07:51,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_06-model_00-model_states.pt. 0: [2022-11-26 19:07:51,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_07-model_00-model_states.pt... 0: [2022-11-26 19:07:51,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_07-model_00-model_states.pt. 0: [2022-11-26 19:07:51,178] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_08-model_00-model_states.pt... 0: [2022-11-26 19:07:51,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_08-model_00-model_states.pt. 0: [2022-11-26 19:07:51,250] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_09-model_00-model_states.pt... 0: [2022-11-26 19:07:51,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_09-model_00-model_states.pt. 0: [2022-11-26 19:07:51,328] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_10-model_00-model_states.pt... 0: [2022-11-26 19:07:51,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_10-model_00-model_states.pt. 0: [2022-11-26 19:07:51,403] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_11-model_00-model_states.pt... 0: [2022-11-26 19:07:51,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_11-model_00-model_states.pt. 0: [2022-11-26 19:07:51,478] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_12-model_00-model_states.pt... 0: [2022-11-26 19:07:51,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_12-model_00-model_states.pt. 0: [2022-11-26 19:07:51,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_13-model_00-model_states.pt... 0: [2022-11-26 19:07:51,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_13-model_00-model_states.pt. 0: [2022-11-26 19:07:51,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_14-model_00-model_states.pt... 0: [2022-11-26 19:07:51,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_14-model_00-model_states.pt. 0: [2022-11-26 19:07:51,699] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_15-model_00-model_states.pt... 0: [2022-11-26 19:07:51,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_15-model_00-model_states.pt. 0: [2022-11-26 19:07:51,771] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_16-model_00-model_states.pt... 0: [2022-11-26 19:07:51,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_16-model_00-model_states.pt. 0: [2022-11-26 19:07:51,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_17-model_00-model_states.pt... 0: [2022-11-26 19:07:51,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_17-model_00-model_states.pt. 0: [2022-11-26 19:07:51,918] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_18-model_00-model_states.pt... 0: [2022-11-26 19:07:51,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_18-model_00-model_states.pt. 0: [2022-11-26 19:07:51,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_19-model_00-model_states.pt... 0: [2022-11-26 19:07:52,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_19-model_00-model_states.pt. 0: [2022-11-26 19:07:52,068] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_20-model_00-model_states.pt... 0: [2022-11-26 19:07:52,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_20-model_00-model_states.pt. 0: [2022-11-26 19:07:52,150] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_21-model_00-model_states.pt... 0: [2022-11-26 19:07:52,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_21-model_00-model_states.pt. 0: [2022-11-26 19:07:52,226] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_22-model_00-model_states.pt... 0: [2022-11-26 19:07:52,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_22-model_00-model_states.pt. 0: [2022-11-26 19:07:52,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_23-model_00-model_states.pt... 0: [2022-11-26 19:07:52,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_23-model_00-model_states.pt. 0: [2022-11-26 19:07:52,373] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_24-model_00-model_states.pt... 0: [2022-11-26 19:07:52,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_24-model_00-model_states.pt. 0: [2022-11-26 19:07:52,447] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_25-model_00-model_states.pt... 0: [2022-11-26 19:07:52,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_25-model_00-model_states.pt. 0: [2022-11-26 19:07:52,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_26-model_00-model_states.pt... 0: [2022-11-26 19:07:52,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_26-model_00-model_states.pt. 0: [2022-11-26 19:07:52,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_27-model_00-model_states.pt... 0: [2022-11-26 19:07:52,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_27-model_00-model_states.pt. 0: [2022-11-26 19:07:52,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_28-model_00-model_states.pt... 0: [2022-11-26 19:07:52,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_28-model_00-model_states.pt. 0: [2022-11-26 19:07:52,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/layer_30-model_00-model_states.pt... 0: [2022-11-26 19:07:52,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/layer_30-model_00-model_states.pt. 0: [2022-11-26 19:07:52,742] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step111000/mp_rank_00_model_states.pt 0: [2022-11-26 19:07:52,742] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/mp_rank_00_model_states.pt... 0: [2022-11-26 19:07:52,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/mp_rank_00_model_states.pt. 0: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:07:52,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:07:52,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:07:52,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:07:52,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:07:52,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:07:52,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:07:52,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:07:52,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:07:52,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:07:52,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 19:07:52,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 19:07:52,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 26: [2022-11-26 19:07:52,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:07:52,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 19:07:52,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 20: [2022-11-26 19:07:52,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:07:52,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 19:07:52,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 30: [2022-11-26 19:07:52,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:07:52,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 19:07:52,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 2: [2022-11-26 19:07:52,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:07:52,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 19:07:52,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 23: [2022-11-26 19:07:52,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:07:52,885] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 19:07:52,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 13: [2022-11-26 19:07:52,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:07:52,885] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 19:07:52,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 8: [2022-11-26 19:07:52,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:07:52,885] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 19:07:52,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 27: [2022-11-26 19:07:52,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:07:52,885] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 19:07:52,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 28: [2022-11-26 19:07:52,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 19:07:52,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 21: [2022-11-26 19:07:52,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:07:52,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 19:07:52,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 28: [2022-11-26 19:07:52,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 19: [2022-11-26 19:07:52,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:07:52,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 19:07:52,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 13: [2022-11-26 19:07:52,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:07:52,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:07:52,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 13: [2022-11-26 19:07:52,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 30: [2022-11-26 19:07:52,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 13: [2022-11-26 19:07:52,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 12: [2022-11-26 19:07:52,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:07:52,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 19:07:52,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 22: [2022-11-26 19:07:52,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:07:52,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:07:52,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 1: [2022-11-26 19:07:52,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:07:52,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 22: [2022-11-26 19:07:52,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 12: [2022-11-26 19:07:52,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 15: [2022-11-26 19:07:52,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:07:52,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:07:52,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 15: [2022-11-26 19:07:52,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 19:07:52,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 5: [2022-11-26 19:07:52,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:07:52,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 26: [2022-11-26 19:07:52,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 5: [2022-11-26 19:07:52,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 19:07:52,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 26: [2022-11-26 19:07:52,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 3: [2022-11-26 19:07:52,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:07:52,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:07:52,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 9: [2022-11-26 19:07:52,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:07:52,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 9: [2022-11-26 19:07:52,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 3: [2022-11-26 19:07:52,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 9: [2022-11-26 19:07:52,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 3: [2022-11-26 19:07:52,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 24: [2022-11-26 19:07:52,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:07:52,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:07:52,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:07:52,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 21: [2022-11-26 19:07:52,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 14: [2022-11-26 19:07:52,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 24: [2022-11-26 19:07:52,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 19:07:52,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 21: [2022-11-26 19:07:52,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 24: [2022-11-26 19:07:52,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:07:52,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 2: [2022-11-26 19:07:52,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:07:52,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 2: [2022-11-26 19:07:52,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 19:07:52,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 7: [2022-11-26 19:07:52,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:07:52,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 19:07:52,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 22: [2022-11-26 19:07:52,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:07:52,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 19:07:52,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 19: [2022-11-26 19:07:52,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:07:52,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 20: [2022-11-26 19:07:52,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:07:52,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 20: [2022-11-26 19:07:52,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 19:07:52,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 14: [2022-11-26 19:07:52,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:07:52,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:07:52,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 19:07:52,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 31: [2022-11-26 19:07:52,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 19:07:52,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 5: [2022-11-26 19:07:52,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:07:52,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 31: [2022-11-26 19:07:52,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:07:52,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 31: [2022-11-26 19:07:52,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 19:07:52,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 25: [2022-11-26 19:07:52,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:07:52,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:07:52,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 19:07:52,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 24: [2022-11-26 19:07:52,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 19:07:52,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 6: [2022-11-26 19:07:52,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:07:52,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 19:07:52,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 28: [2022-11-26 19:07:52,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 19:07:52,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 19:07:52,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 9: [2022-11-26 19:07:52,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:07:52,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 19:07:52,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 19: [2022-11-26 19:07:52,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:07:52,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 19:07:52,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 8: [2022-11-26 19:07:52,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:07:52,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 19:07:52,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 23: [2022-11-26 19:07:52,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:07:52,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 19:07:52,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 12: [2022-11-26 19:07:52,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:07:52,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 19:07:52,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 31: [2022-11-26 19:07:52,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:07:52,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:07:52,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 31: [2022-11-26 19:07:52,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 26: [2022-11-26 19:07:52,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:07:52,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 31: [2022-11-26 19:07:52,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 26: [2022-11-26 19:07:52,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 19:07:52,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 21: [2022-11-26 19:07:52,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:07:52,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 19:07:52,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 13: [2022-11-26 19:07:52,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:07:52,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 22: [2022-11-26 19:07:52,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:07:52,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 22: [2022-11-26 19:07:52,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 19:07:52,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 3: [2022-11-26 19:07:52,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:07:52,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 19:07:52,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 30: [2022-11-26 19:07:52,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:07:52,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 19:07:52,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 5: [2022-11-26 19:07:52,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:07:52,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 19:07:52,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 20: [2022-11-26 19:07:52,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:07:52,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 19:07:52,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 15: [2022-11-26 19:07:52,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:07:52,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 19:07:52,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 23: [2022-11-26 19:07:52,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:07:52,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 19:07:52,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 27: [2022-11-26 19:07:52,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:07:52,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:07:52,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 30: [2022-11-26 19:07:52,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 19:07:52,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 27: [2022-11-26 19:07:52,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 27: [2022-11-26 19:07:52,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:07:52,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 0: [2022-11-26 19:07:52,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:07:52,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 0: [2022-11-26 19:07:52,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 19:07:52,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 2: [2022-11-26 19:07:52,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:07:52,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 19:07:52,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 14: [2022-11-26 19:07:52,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:07:52,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 19:07:52,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 25: [2022-11-26 19:07:52,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:07:52,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 19:07:52,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 3: [2022-11-26 19:07:52,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:07:52,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 19:07:52,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 21: [2022-11-26 19:07:52,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:07:52,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 19:07:52,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 2: [2022-11-26 19:07:52,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:07:52,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 19:07:52,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 1: [2022-11-26 19:07:52,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:07:52,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 24: [2022-11-26 19:07:52,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:07:52,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 24: [2022-11-26 19:07:52,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 19:07:52,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 20: [2022-11-26 19:07:52,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 28: [2022-11-26 19:07:52,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:07:52,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 19:07:52,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 28: [2022-11-26 19:07:52,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 19:07:52,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 26: [2022-11-26 19:07:52,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:07:52,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 19:07:52,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 6: [2022-11-26 19:07:52,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:07:52,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:07:52,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:07:52,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 5: [2022-11-26 19:07:52,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:07:52,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 5: [2022-11-26 19:07:52,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 1: [2022-11-26 19:07:52,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:07:52,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 19:07:52,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 19:07:52,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 1: [2022-11-26 19:07:52,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 19: [2022-11-26 19:07:52,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 5: [2022-11-26 19:07:52,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 19: [2022-11-26 19:07:52,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 9: [2022-11-26 19:07:52,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:07:52,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 19:07:52,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 12: [2022-11-26 19:07:52,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:07:52,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 19:07:52,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 9: [2022-11-26 19:07:52,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:07:52,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:07:52,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 19:07:52,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 6: [2022-11-26 19:07:52,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:07:52,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:07:52,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 19:07:52,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 9: [2022-11-26 19:07:52,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 19:07:52,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 6: [2022-11-26 19:07:52,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 6: [2022-11-26 19:07:52,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 8: [2022-11-26 19:07:52,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:07:52,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:07:52,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 22: [2022-11-26 19:07:52,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 8: [2022-11-26 19:07:52,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 22: [2022-11-26 19:07:52,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 13: [2022-11-26 19:07:52,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:07:52,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 19:07:52,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 14: [2022-11-26 19:07:52,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:07:52,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 19:07:52,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 15: [2022-11-26 19:07:52,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:07:52,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 19:07:52,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 17: [2022-11-26 19:07:52,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 19:07:52,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 19:07:52,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 25: [2022-11-26 19:07:52,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:07:52,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 19:07:52,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 31: [2022-11-26 19:07:52,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:07:52,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 19:07:52,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 17: [2022-11-26 19:07:52,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 19:07:52,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 19:07:52,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 19:07:52,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 19:07:52,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 17: [2022-11-26 19:07:52,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 25: [2022-11-26 19:07:52,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:07:52,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 19:07:52,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 4: [2022-11-26 19:07:52,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:07:52,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:07:52,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:07:52,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:07:52,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:07:52,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:07:52,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:07:52,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:07:52,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 19:07:52,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 19:07:52,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 19:07:52,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 19:07:52,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 11: [2022-11-26 19:07:52,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 4: [2022-11-26 19:07:52,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 4: [2022-11-26 19:07:52,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 4: [2022-11-26 19:07:52,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 11: [2022-11-26 19:07:52,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 19:07:52,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 19:07:52,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 19:07:52,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 11: [2022-11-26 19:07:52,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 11: [2022-11-26 19:07:52,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 11: [2022-11-26 19:07:52,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 10: [2022-11-26 19:07:52,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:07:52,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:07:52,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 19:07:52,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 10: [2022-11-26 19:07:52,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 19:07:52,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 8: [2022-11-26 19:07:52,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:07:52,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:07:52,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 23: [2022-11-26 19:07:52,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:07:52,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 8: [2022-11-26 19:07:52,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 29: [2022-11-26 19:07:52,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 23: [2022-11-26 19:07:52,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 19:07:52,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 5: [2022-11-26 19:07:52,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:07:52,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 19:07:52,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 3: [2022-11-26 19:07:52,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:07:52,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 19:07:52,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 15: [2022-11-26 19:07:52,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:07:52,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 19:07:52,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 10: [2022-11-26 19:07:52,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:07:52,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 19:07:52,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 1: [2022-11-26 19:07:52,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:07:52,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:07:52,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:07:52,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:07:52,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 7: [2022-11-26 19:07:52,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:07:52,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 0: [2022-11-26 19:07:52,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 19:07:52,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 7: [2022-11-26 19:07:52,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 19:07:52,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 0: [2022-11-26 19:07:52,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 0: [2022-11-26 19:07:52,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 29: [2022-11-26 19:07:52,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:07:52,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 19:07:52,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 29: [2022-11-26 19:07:52,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:07:52,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 19:07:52,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 18: [2022-11-26 19:07:52,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:07:52,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 19:07:52,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 18: [2022-11-26 19:07:52,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:07:52,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 19:07:52,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 18: [2022-11-26 19:07:52,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:07:52,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 19:07:52,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 18: [2022-11-26 19:07:52,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:07:52,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 19:07:52,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 0: [2022-11-26 19:07:52,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 19:07:52,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 2: [2022-11-26 19:07:52,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:07:52,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 19:07:52,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 7: [2022-11-26 19:07:52,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:07:52,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:07:52,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:07:52,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 19:07:52,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 19:07:52,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 19:07:52,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 7: [2022-11-26 19:07:52,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 7: [2022-11-26 19:07:52,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 6: [2022-11-26 19:07:52,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:07:52,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 19:07:52,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 16: [2022-11-26 19:07:52,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:07:52,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 19:07:52,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:07:52,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 16: [2022-11-26 19:07:52,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 19:07:52,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:07:52,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 16: [2022-11-26 19:07:52,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 19:07:52,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 21: [2022-11-26 19:07:52,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:07:52,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 19:07:52,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 11: [2022-11-26 19:07:52,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:07:52,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 19:07:52,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 18: [2022-11-26 19:07:52,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:07:52,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 19:07:52,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 31: [2022-11-26 19:07:52,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:07:52,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 19:07:52,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 20: [2022-11-26 19:07:52,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:07:52,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 19:07:52,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 9: [2022-11-26 19:07:52,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:07:52,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 19:07:52,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 25: [2022-11-26 19:07:52,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:07:52,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 27: [2022-11-26 19:07:52,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:07:52,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 27: [2022-11-26 19:07:52,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 19:07:52,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 22: [2022-11-26 19:07:52,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:07:52,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 19:07:52,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 28: [2022-11-26 19:07:52,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 19:07:52,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 4: [2022-11-26 19:07:52,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:07:52,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 19:07:52,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 28: [2022-11-26 19:07:52,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 19: [2022-11-26 19:07:52,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:07:52,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 19:07:52,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 14: [2022-11-26 19:07:52,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:07:52,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:07:52,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 19:07:52,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 24: [2022-11-26 19:07:52,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 19:07:52,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 26: [2022-11-26 19:07:52,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:07:52,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 19:07:52,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 30: [2022-11-26 19:07:52,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:07:52,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 19:07:52,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 12: [2022-11-26 19:07:52,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:07:52,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 19:07:52,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 10: [2022-11-26 19:07:52,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:07:52,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 19:07:52,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 13: [2022-11-26 19:07:52,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:07:52,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 17: [2022-11-26 19:07:52,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:07:52,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 17: [2022-11-26 19:07:52,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 19:07:52,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 8: [2022-11-26 19:07:52,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:07:52,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 19:07:52,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 29: [2022-11-26 19:07:52,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:07:52,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 19:07:52,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 15: [2022-11-26 19:07:52,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:07:52,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 19:07:52,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 23: [2022-11-26 19:07:52,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:07:52,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 19:07:52,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 6: [2022-11-26 19:07:52,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:07:52,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 19:07:52,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 3: [2022-11-26 19:07:52,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:07:52,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 19:07:52,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 5: [2022-11-26 19:07:52,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:07:52,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 19:07:52,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 1: [2022-11-26 19:07:52,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:07:52,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 19:07:52,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 0: [2022-11-26 19:07:52,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:07:52,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 19:07:52,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 16: [2022-11-26 19:07:52,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:07:52,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 19:07:52,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 2: [2022-11-26 19:07:52,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:07:52,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 19:07:52,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 11: [2022-11-26 19:07:52,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:07:53,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 19:07:53,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 7: [2022-11-26 19:07:53,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:07:53,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 19:07:53,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 21: [2022-11-26 19:07:53,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:07:53,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 19:07:53,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 27: [2022-11-26 19:07:53,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:07:53,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 19:07:53,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 4: [2022-11-26 19:07:53,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:07:53,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 19:07:53,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 28: [2022-11-26 19:07:53,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:07:53,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:07:53,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 19: [2022-11-26 19:07:53,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 28: [2022-11-26 19:07:53,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 18: [2022-11-26 19:07:53,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 19: [2022-11-26 19:07:53,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 19:07:53,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 28: [2022-11-26 19:07:53,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 25: [2022-11-26 19:07:53,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:07:53,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 19:07:53,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 13: [2022-11-26 19:07:53,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:07:53,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 22: [2022-11-26 19:07:53,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:07:53,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 22: [2022-11-26 19:07:53,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 24: [2022-11-26 19:07:53,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:07:53,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 24: [2022-11-26 19:07:53,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 19:07:53,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 30: [2022-11-26 19:07:53,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:07:53,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 19:07:53,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 20: [2022-11-26 19:07:53,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:07:53,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 19:07:53,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 14: [2022-11-26 19:07:53,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:07:53,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 19:07:53,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 31: [2022-11-26 19:07:53,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:07:53,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 19:07:53,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 17: [2022-11-26 19:07:53,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:07:53,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:07:53,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 17: [2022-11-26 19:07:53,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 19:07:53,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 10: [2022-11-26 19:07:53,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:07:53,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 10: [2022-11-26 19:07:53,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 19:07:53,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 26: [2022-11-26 19:07:53,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:07:53,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 19:07:53,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 8: [2022-11-26 19:07:53,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:07:53,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 19:07:53,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 29: [2022-11-26 19:07:53,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:07:53,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 19:07:53,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 12: [2022-11-26 19:07:53,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:07:53,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 19:07:53,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 15: [2022-11-26 19:07:53,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:07:53,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 19:07:53,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 3: [2022-11-26 19:07:53,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:07:53,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 19:07:53,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 23: [2022-11-26 19:07:53,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:07:53,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 19:07:53,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 6: [2022-11-26 19:07:53,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:07:53,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 19:07:53,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 5: [2022-11-26 19:07:53,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:07:53,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 19:07:53,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 1: [2022-11-26 19:07:53,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:07:53,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 19:07:53,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 21: [2022-11-26 19:07:53,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:07:53,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 19:07:53,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 7: [2022-11-26 19:07:53,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:07:53,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 19:07:53,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 0: [2022-11-26 19:07:53,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:07:53,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 19:07:53,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 4: [2022-11-26 19:07:53,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:07:53,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 19:07:53,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 2: [2022-11-26 19:07:53,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:07:53,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 19:07:53,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 16: [2022-11-26 19:07:53,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:07:53,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 19:07:53,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 28: [2022-11-26 19:07:53,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 19:07:53,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 19:07:53,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 11: [2022-11-26 19:07:53,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:07:53,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 19:07:53,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 18: [2022-11-26 19:07:53,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:07:53,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 19:07:53,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 26: [2022-11-26 19:07:53,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:07:53,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 19:07:53,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 8: [2022-11-26 19:07:53,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:07:53,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:07:53,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:07:53,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:07:53,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:07:53,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 10: [2022-11-26 19:07:53,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 19:07:53,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 25: [2022-11-26 19:07:53,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 19: [2022-11-26 19:07:53,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 8: [2022-11-26 19:07:53,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 20: [2022-11-26 19:07:53,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 25: [2022-11-26 19:07:53,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 19: [2022-11-26 19:07:53,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 20: [2022-11-26 19:07:53,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 22: [2022-11-26 19:07:53,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:07:53,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 19:07:53,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 12: [2022-11-26 19:07:53,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:07:53,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 14: [2022-11-26 19:07:53,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:07:53,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:07:53,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 14: [2022-11-26 19:07:53,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 19:07:53,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 24: [2022-11-26 19:07:53,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:07:53,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 19:07:53,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 24: [2022-11-26 19:07:53,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 19:07:53,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 1: [2022-11-26 19:07:53,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:07:53,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 19:07:53,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 13: [2022-11-26 19:07:53,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:07:53,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:07:53,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 19:07:53,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 3: [2022-11-26 19:07:53,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 19:07:53,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 30: [2022-11-26 19:07:53,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:07:53,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 19:07:53,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 9: [2022-11-26 19:07:53,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:07:53,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 19:07:53,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 29: [2022-11-26 19:07:53,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:07:53,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 19:07:53,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 6: [2022-11-26 19:07:53,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:07:53,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 19:07:53,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 2: [2022-11-26 19:07:53,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:07:53,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 19:07:53,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 7: [2022-11-26 19:07:53,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:07:53,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 19:07:53,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 4: [2022-11-26 19:07:53,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:07:53,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 19:07:53,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 0: [2022-11-26 19:07:53,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:07:53,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:07:53,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 18: [2022-11-26 19:07:53,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 0: [2022-11-26 19:07:53,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 18: [2022-11-26 19:07:53,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 9: [2022-11-26 19:07:53,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:07:53,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:07:53,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 19:07:53,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 20: [2022-11-26 19:07:53,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 19:07:53,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 17: [2022-11-26 19:07:53,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 19:07:53,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 28: [2022-11-26 19:07:53,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 17: [2022-11-26 19:07:53,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 28: [2022-11-26 19:07:53,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 19:07:53,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 24: [2022-11-26 19:07:53,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:07:53,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 19:07:53,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 5: [2022-11-26 19:07:53,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:07:53,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 19:07:53,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 19: [2022-11-26 19:07:53,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:07:53,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:07:53,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 25: [2022-11-26 19:07:53,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:07:53,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 19: [2022-11-26 19:07:53,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 27: [2022-11-26 19:07:53,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 25: [2022-11-26 19:07:53,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 19:07:53,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 22: [2022-11-26 19:07:53,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:07:53,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 14: [2022-11-26 19:07:53,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:07:53,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 14: [2022-11-26 19:07:53,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 19:07:53,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 15: [2022-11-26 19:07:53,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:07:53,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 19:07:53,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 23: [2022-11-26 19:07:53,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:07:53,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 19:07:53,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 31: [2022-11-26 19:07:53,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:07:53,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:07:53,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 19:07:53,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 16: [2022-11-26 19:07:53,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 19:07:53,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 13: [2022-11-26 19:07:53,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:07:53,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 8: [2022-11-26 19:07:53,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:07:53,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 26: [2022-11-26 19:07:53,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:07:53,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 26: [2022-11-26 19:07:53,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 8: [2022-11-26 19:07:53,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 26: [2022-11-26 19:07:53,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 11: [2022-11-26 19:07:53,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:07:53,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 19:07:53,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 21: [2022-11-26 19:07:53,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:07:53,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 19:07:53,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 29: [2022-11-26 19:07:53,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:07:53,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 19:07:53,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 12: [2022-11-26 19:07:53,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:07:53,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 19:07:53,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 10: [2022-11-26 19:07:53,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:07:53,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 19:07:53,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 15: [2022-11-26 19:07:53,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:07:53,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 19:07:53,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 27: [2022-11-26 19:07:53,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:07:53,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 19:07:53,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 23: [2022-11-26 19:07:53,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:07:53,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 19:07:53,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 30: [2022-11-26 19:07:53,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:07:53,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 19:07:53,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 17: [2022-11-26 19:07:53,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 19:07:53,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 19:07:53,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 19:07:53,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 17: [2022-11-26 19:07:53,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 19:07:53,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 10: [2022-11-26 19:07:53,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:07:53,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 19:07:53,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 29: [2022-11-26 19:07:53,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:07:53,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 19:07:53,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 16: [2022-11-26 19:07:53,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:07:53,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:07:53,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 19:07:53,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step111000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 19:07:53,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 16: [2022-11-26 19:07:53,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 0: successfully saved checkpoint at iteration 111000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2551.53 31: iteration 111010/ 173500 | consumed samples: 28418560 | consumed tokens: 58201210880 | elapsed time per iteration (s): 1.03 | learning rate: 7.266E-05 | global batch size: 256 | lm loss: 1.986036E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.699 | TFLOPs: 15.11 | 31: iteration 111020/ 173500 | consumed samples: 28421120 | consumed tokens: 58206453760 | elapsed time per iteration (s): 0.76 | learning rate: 7.264E-05 | global batch size: 256 | lm loss: 1.975161E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.056 | TFLOPs: 20.39 | 31: iteration 111030/ 173500 | consumed samples: 28423680 | consumed tokens: 58211696640 | elapsed time per iteration (s): 0.75 | learning rate: 7.263E-05 | global batch size: 256 | lm loss: 1.971186E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.132 | TFLOPs: 20.70 | 31: iteration 111040/ 173500 | consumed samples: 28426240 | consumed tokens: 58216939520 | elapsed time per iteration (s): 0.73 | learning rate: 7.261E-05 | global batch size: 256 | lm loss: 1.959251E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.596 | TFLOPs: 21.33 | 31: iteration 111050/ 173500 | consumed samples: 28428800 | consumed tokens: 58222182400 | elapsed time per iteration (s): 0.77 | learning rate: 7.260E-05 | global batch size: 256 | lm loss: 1.931302E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.608 | TFLOPs: 20.00 | 31: iteration 111060/ 173500 | consumed samples: 28431360 | consumed tokens: 58227425280 | elapsed time per iteration (s): 0.78 | learning rate: 7.258E-05 | global batch size: 256 | lm loss: 1.933726E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.277 | TFLOPs: 19.80 | 31: iteration 111070/ 173500 | consumed samples: 28433920 | consumed tokens: 58232668160 | elapsed time per iteration (s): 0.81 | learning rate: 7.257E-05 | global batch size: 256 | lm loss: 1.949994E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.451 | TFLOPs: 19.14 | 31: iteration 111080/ 173500 | consumed samples: 28436480 | consumed tokens: 58237911040 | elapsed time per iteration (s): 0.78 | learning rate: 7.255E-05 | global batch size: 256 | lm loss: 1.953463E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.381 | TFLOPs: 19.93 | 31: iteration 111090/ 173500 | consumed samples: 28439040 | consumed tokens: 58243153920 | elapsed time per iteration (s): 0.75 | learning rate: 7.254E-05 | global batch size: 256 | lm loss: 1.967098E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.534 | TFLOPs: 20.72 | 31: iteration 111100/ 173500 | consumed samples: 28441600 | consumed tokens: 58248396800 | elapsed time per iteration (s): 0.79 | learning rate: 7.252E-05 | global batch size: 256 | lm loss: 1.986645E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.187 | TFLOPs: 19.61 | 31: iteration 111110/ 173500 | consumed samples: 28444160 | consumed tokens: 58253639680 | elapsed time per iteration (s): 0.75 | learning rate: 7.251E-05 | global batch size: 256 | lm loss: 1.972041E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.472 | TFLOPs: 20.60 | 31: iteration 111120/ 173500 | consumed samples: 28446720 | consumed tokens: 58258882560 | elapsed time per iteration (s): 0.70 | learning rate: 7.249E-05 | global batch size: 256 | lm loss: 1.915290E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 363.974 | TFLOPs: 22.02 | 31: iteration 111130/ 173500 | consumed samples: 28449280 | consumed tokens: 58264125440 | elapsed time per iteration (s): 0.77 | learning rate: 7.248E-05 | global batch size: 256 | lm loss: 1.966089E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.190 | TFLOPs: 20.16 | 31: iteration 111140/ 173500 | consumed samples: 28451840 | consumed tokens: 58269368320 | elapsed time per iteration (s): 0.79 | learning rate: 7.246E-05 | global batch size: 256 | lm loss: 1.970025E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.216 | TFLOPs: 19.61 | 31: iteration 111150/ 173500 | consumed samples: 28454400 | consumed tokens: 58274611200 | elapsed time per iteration (s): 0.74 | learning rate: 7.245E-05 | global batch size: 256 | lm loss: 1.952897E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.168 | TFLOPs: 21.00 | 31: iteration 111160/ 173500 | consumed samples: 28456960 | consumed tokens: 58279854080 | elapsed time per iteration (s): 0.86 | learning rate: 7.243E-05 | global batch size: 256 | lm loss: 1.935629E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.917 | TFLOPs: 18.08 | 31: iteration 111170/ 173500 | consumed samples: 28459520 | consumed tokens: 58285096960 | elapsed time per iteration (s): 0.72 | learning rate: 7.242E-05 | global batch size: 256 | lm loss: 1.985591E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.178 | TFLOPs: 21.43 | 31: iteration 111180/ 173500 | consumed samples: 28462080 | consumed tokens: 58290339840 | elapsed time per iteration (s): 0.77 | learning rate: 7.240E-05 | global batch size: 256 | lm loss: 1.949745E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.661 | TFLOPs: 20.06 | 31: iteration 111190/ 173500 | consumed samples: 28464640 | consumed tokens: 58295582720 | elapsed time per iteration (s): 0.82 | learning rate: 7.239E-05 | global batch size: 256 | lm loss: 1.937834E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.883 | TFLOPs: 18.99 | 31: iteration 111200/ 173500 | consumed samples: 28467200 | consumed tokens: 58300825600 | elapsed time per iteration (s): 0.76 | learning rate: 7.237E-05 | global batch size: 256 | lm loss: 1.972386E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.314 | TFLOPs: 20.41 | 31: iteration 111210/ 173500 | consumed samples: 28469760 | consumed tokens: 58306068480 | elapsed time per iteration (s): 0.71 | learning rate: 7.236E-05 | global batch size: 256 | lm loss: 1.925478E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 360.250 | TFLOPs: 21.79 | 31: iteration 111220/ 173500 | consumed samples: 28472320 | consumed tokens: 58311311360 | elapsed time per iteration (s): 0.72 | learning rate: 7.234E-05 | global batch size: 256 | lm loss: 1.976799E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.380 | TFLOPs: 21.50 | 31: iteration 111230/ 173500 | consumed samples: 28474880 | consumed tokens: 58316554240 | elapsed time per iteration (s): 0.83 | learning rate: 7.233E-05 | global batch size: 256 | lm loss: 1.950714E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.224 | TFLOPs: 18.77 | 31: iteration 111240/ 173500 | consumed samples: 28477440 | consumed tokens: 58321797120 | elapsed time per iteration (s): 0.78 | learning rate: 7.231E-05 | global batch size: 256 | lm loss: 1.989210E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.578 | TFLOPs: 19.88 | 31: iteration 111250/ 173500 | consumed samples: 28480000 | consumed tokens: 58327040000 | elapsed time per iteration (s): 0.90 | learning rate: 7.230E-05 | global batch size: 256 | lm loss: 1.949902E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.541 | TFLOPs: 17.21 | 31: iteration 111260/ 173500 | consumed samples: 28482560 | consumed tokens: 58332282880 | elapsed time per iteration (s): 0.74 | learning rate: 7.228E-05 | global batch size: 256 | lm loss: 1.952980E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.982 | TFLOPs: 21.05 | 31: iteration 111270/ 173500 | consumed samples: 28485120 | consumed tokens: 58337525760 | elapsed time per iteration (s): 0.71 | learning rate: 7.227E-05 | global batch size: 256 | lm loss: 1.947040E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 360.487 | TFLOPs: 21.81 | 31: iteration 111280/ 173500 | consumed samples: 28487680 | consumed tokens: 58342768640 | elapsed time per iteration (s): 0.79 | learning rate: 7.225E-05 | global batch size: 256 | lm loss: 1.949161E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.064 | TFLOPs: 19.54 | 31: iteration 111290/ 173500 | consumed samples: 28490240 | consumed tokens: 58348011520 | elapsed time per iteration (s): 0.79 | learning rate: 7.224E-05 | global batch size: 256 | lm loss: 1.933457E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.530 | TFLOPs: 19.57 | 31: iteration 111300/ 173500 | consumed samples: 28492800 | consumed tokens: 58353254400 | elapsed time per iteration (s): 0.92 | learning rate: 7.222E-05 | global batch size: 256 | lm loss: 1.966271E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 276.791 | TFLOPs: 16.75 | 31: iteration 111310/ 173500 | consumed samples: 28495360 | consumed tokens: 58358497280 | elapsed time per iteration (s): 0.84 | learning rate: 7.221E-05 | global batch size: 256 | lm loss: 1.939824E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.239 | TFLOPs: 18.47 | 31: iteration 111320/ 173500 | consumed samples: 28497920 | consumed tokens: 58363740160 | elapsed time per iteration (s): 0.82 | learning rate: 7.219E-05 | global batch size: 256 | lm loss: 1.946776E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.632 | TFLOPs: 18.79 | 31: iteration 111330/ 173500 | consumed samples: 28500480 | consumed tokens: 58368983040 | elapsed time per iteration (s): 0.81 | learning rate: 7.218E-05 | global batch size: 256 | lm loss: 1.955511E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.105 | TFLOPs: 19.06 | 31: iteration 111340/ 173500 | consumed samples: 28503040 | consumed tokens: 58374225920 | elapsed time per iteration (s): 0.82 | learning rate: 7.216E-05 | global batch size: 256 | lm loss: 1.958644E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.998 | TFLOPs: 18.81 | 31: iteration 111350/ 173500 | consumed samples: 28505600 | consumed tokens: 58379468800 | elapsed time per iteration (s): 0.76 | learning rate: 7.215E-05 | global batch size: 256 | lm loss: 1.953495E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.488 | TFLOPs: 20.30 | 31: iteration 111360/ 173500 | consumed samples: 28508160 | consumed tokens: 58384711680 | elapsed time per iteration (s): 0.86 | learning rate: 7.213E-05 | global batch size: 256 | lm loss: 1.971728E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.704 | TFLOPs: 18.07 | 31: iteration 111370/ 173500 | consumed samples: 28510720 | consumed tokens: 58389954560 | elapsed time per iteration (s): 0.79 | learning rate: 7.212E-05 | global batch size: 256 | lm loss: 1.957626E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.402 | TFLOPs: 19.63 | 31: iteration 111380/ 173500 | consumed samples: 28513280 | consumed tokens: 58395197440 | elapsed time per iteration (s): 0.79 | learning rate: 7.210E-05 | global batch size: 256 | lm loss: 1.978797E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.893 | TFLOPs: 19.66 | 31: iteration 111390/ 173500 | consumed samples: 28515840 | consumed tokens: 58400440320 | elapsed time per iteration (s): 0.78 | learning rate: 7.209E-05 | global batch size: 256 | lm loss: 1.969502E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.268 | TFLOPs: 19.74 | 31: iteration 111400/ 173500 | consumed samples: 28518400 | consumed tokens: 58405683200 | elapsed time per iteration (s): 0.80 | learning rate: 7.207E-05 | global batch size: 256 | lm loss: 1.933631E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.975 | TFLOPs: 19.42 | 31: iteration 111410/ 173500 | consumed samples: 28520960 | consumed tokens: 58410926080 | elapsed time per iteration (s): 0.79 | learning rate: 7.206E-05 | global batch size: 256 | lm loss: 1.953401E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.495 | TFLOPs: 19.63 | 31: iteration 111420/ 173500 | consumed samples: 28523520 | consumed tokens: 58416168960 | elapsed time per iteration (s): 0.77 | learning rate: 7.205E-05 | global batch size: 256 | lm loss: 1.946505E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.145 | TFLOPs: 20.15 | 31: iteration 111430/ 173500 | consumed samples: 28526080 | consumed tokens: 58421411840 | elapsed time per iteration (s): 0.72 | learning rate: 7.203E-05 | global batch size: 256 | lm loss: 1.992421E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.488 | TFLOPs: 21.57 | 31: iteration 111440/ 173500 | consumed samples: 28528640 | consumed tokens: 58426654720 | elapsed time per iteration (s): 0.74 | learning rate: 7.202E-05 | global batch size: 256 | lm loss: 1.944078E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.968 | TFLOPs: 20.93 | 31: iteration 111450/ 173500 | consumed samples: 28531200 | consumed tokens: 58431897600 | elapsed time per iteration (s): 0.74 | learning rate: 7.200E-05 | global batch size: 256 | lm loss: 1.947877E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.602 | TFLOPs: 20.91 | 31: iteration 111460/ 173500 | consumed samples: 28533760 | consumed tokens: 58437140480 | elapsed time per iteration (s): 0.76 | learning rate: 7.199E-05 | global batch size: 256 | lm loss: 1.947882E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.595 | TFLOPs: 20.36 | 31: iteration 111470/ 173500 | consumed samples: 28536320 | consumed tokens: 58442383360 | elapsed time per iteration (s): 0.75 | learning rate: 7.197E-05 | global batch size: 256 | lm loss: 1.971216E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.491 | TFLOPs: 20.78 | 31: iteration 111480/ 173500 | consumed samples: 28538880 | consumed tokens: 58447626240 | elapsed time per iteration (s): 0.78 | learning rate: 7.196E-05 | global batch size: 256 | lm loss: 1.985730E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.787 | TFLOPs: 19.83 | 31: iteration 111490/ 173500 | consumed samples: 28541440 | consumed tokens: 58452869120 | elapsed time per iteration (s): 0.75 | learning rate: 7.194E-05 | global batch size: 256 | lm loss: 1.953369E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.015 | TFLOPs: 20.69 | 31: iteration 111500/ 173500 | consumed samples: 28544000 | consumed tokens: 58458112000 | elapsed time per iteration (s): 0.75 | learning rate: 7.193E-05 | global batch size: 256 | lm loss: 1.990660E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.859 | TFLOPs: 20.74 | 31: iteration 111510/ 173500 | consumed samples: 28546560 | consumed tokens: 58463354880 | elapsed time per iteration (s): 0.72 | learning rate: 7.191E-05 | global batch size: 256 | lm loss: 1.981390E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.622 | TFLOPs: 21.51 | 31: iteration 111520/ 173500 | consumed samples: 28549120 | consumed tokens: 58468597760 | elapsed time per iteration (s): 0.85 | learning rate: 7.190E-05 | global batch size: 256 | lm loss: 1.948364E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.118 | TFLOPs: 18.22 | 31: iteration 111530/ 173500 | consumed samples: 28551680 | consumed tokens: 58473840640 | elapsed time per iteration (s): 0.75 | learning rate: 7.188E-05 | global batch size: 256 | lm loss: 1.944671E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.378 | TFLOPs: 20.71 | 31: iteration 111540/ 173500 | consumed samples: 28554240 | consumed tokens: 58479083520 | elapsed time per iteration (s): 0.75 | learning rate: 7.187E-05 | global batch size: 256 | lm loss: 1.950530E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.468 | TFLOPs: 20.54 | 31: iteration 111550/ 173500 | consumed samples: 28556800 | consumed tokens: 58484326400 | elapsed time per iteration (s): 0.77 | learning rate: 7.185E-05 | global batch size: 256 | lm loss: 1.958018E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.560 | TFLOPs: 20.18 | 31: iteration 111560/ 173500 | consumed samples: 28559360 | consumed tokens: 58489569280 | elapsed time per iteration (s): 0.76 | learning rate: 7.184E-05 | global batch size: 256 | lm loss: 1.955330E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.970 | TFLOPs: 20.51 | 31: iteration 111570/ 173500 | consumed samples: 28561920 | consumed tokens: 58494812160 | elapsed time per iteration (s): 0.87 | learning rate: 7.182E-05 | global batch size: 256 | lm loss: 1.983176E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.591 | TFLOPs: 17.88 | 31: iteration 111580/ 173500 | consumed samples: 28564480 | consumed tokens: 58500055040 | elapsed time per iteration (s): 0.77 | learning rate: 7.181E-05 | global batch size: 256 | lm loss: 1.945461E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.582 | TFLOPs: 20.06 | 31: iteration 111590/ 173500 | consumed samples: 28567040 | consumed tokens: 58505297920 | elapsed time per iteration (s): 0.83 | learning rate: 7.179E-05 | global batch size: 256 | lm loss: 1.955326E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.919 | TFLOPs: 18.69 | 31: iteration 111600/ 173500 | consumed samples: 28569600 | consumed tokens: 58510540800 | elapsed time per iteration (s): 0.75 | learning rate: 7.178E-05 | global batch size: 256 | lm loss: 1.967986E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.892 | TFLOPs: 20.56 | 31: iteration 111610/ 173500 | consumed samples: 28572160 | consumed tokens: 58515783680 | elapsed time per iteration (s): 0.80 | learning rate: 7.176E-05 | global batch size: 256 | lm loss: 1.965067E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.362 | TFLOPs: 19.26 | 31: iteration 111620/ 173500 | consumed samples: 28574720 | consumed tokens: 58521026560 | elapsed time per iteration (s): 0.80 | learning rate: 7.175E-05 | global batch size: 256 | lm loss: 1.957161E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.920 | TFLOPs: 19.29 | 31: iteration 111630/ 173500 | consumed samples: 28577280 | consumed tokens: 58526269440 | elapsed time per iteration (s): 0.80 | learning rate: 7.173E-05 | global batch size: 256 | lm loss: 1.940076E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.772 | TFLOPs: 19.28 | 31: iteration 111640/ 173500 | consumed samples: 28579840 | consumed tokens: 58531512320 | elapsed time per iteration (s): 0.82 | learning rate: 7.172E-05 | global batch size: 256 | lm loss: 1.968814E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.544 | TFLOPs: 18.79 | 31: iteration 111650/ 173500 | consumed samples: 28582400 | consumed tokens: 58536755200 | elapsed time per iteration (s): 0.75 | learning rate: 7.170E-05 | global batch size: 256 | lm loss: 1.953913E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.786 | TFLOPs: 20.68 | 31: iteration 111660/ 173500 | consumed samples: 28584960 | consumed tokens: 58541998080 | elapsed time per iteration (s): 0.74 | learning rate: 7.169E-05 | global batch size: 256 | lm loss: 1.912968E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.940 | TFLOPs: 20.81 | 31: iteration 111670/ 173500 | consumed samples: 28587520 | consumed tokens: 58547240960 | elapsed time per iteration (s): 0.76 | learning rate: 7.167E-05 | global batch size: 256 | lm loss: 1.953805E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.899 | TFLOPs: 20.26 | 31: iteration 111680/ 173500 | consumed samples: 28590080 | consumed tokens: 58552483840 | elapsed time per iteration (s): 0.79 | learning rate: 7.166E-05 | global batch size: 256 | lm loss: 1.977748E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.463 | TFLOPs: 19.69 | 31: iteration 111690/ 173500 | consumed samples: 28592640 | consumed tokens: 58557726720 | elapsed time per iteration (s): 0.86 | learning rate: 7.164E-05 | global batch size: 256 | lm loss: 1.951910E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.627 | TFLOPs: 17.95 | 31: iteration 111700/ 173500 | consumed samples: 28595200 | consumed tokens: 58562969600 | elapsed time per iteration (s): 0.89 | learning rate: 7.163E-05 | global batch size: 256 | lm loss: 1.945000E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.537 | TFLOPs: 17.46 | 31: iteration 111710/ 173500 | consumed samples: 28597760 | consumed tokens: 58568212480 | elapsed time per iteration (s): 0.76 | learning rate: 7.161E-05 | global batch size: 256 | lm loss: 1.947713E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.686 | TFLOPs: 20.43 | 31: iteration 111720/ 173500 | consumed samples: 28600320 | consumed tokens: 58573455360 | elapsed time per iteration (s): 0.85 | learning rate: 7.160E-05 | global batch size: 256 | lm loss: 1.957903E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.055 | TFLOPs: 18.27 | 31: iteration 111730/ 173500 | consumed samples: 28602880 | consumed tokens: 58578698240 | elapsed time per iteration (s): 0.74 | learning rate: 7.158E-05 | global batch size: 256 | lm loss: 1.966900E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.819 | TFLOPs: 20.80 | 31: iteration 111740/ 173500 | consumed samples: 28605440 | consumed tokens: 58583941120 | elapsed time per iteration (s): 0.86 | learning rate: 7.157E-05 | global batch size: 256 | lm loss: 1.963144E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.942 | TFLOPs: 18.09 | 31: iteration 111750/ 173500 | consumed samples: 28608000 | consumed tokens: 58589184000 | elapsed time per iteration (s): 0.74 | learning rate: 7.155E-05 | global batch size: 256 | lm loss: 1.975826E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.076 | TFLOPs: 21.06 | 31: iteration 111760/ 173500 | consumed samples: 28610560 | consumed tokens: 58594426880 | elapsed time per iteration (s): 0.76 | learning rate: 7.154E-05 | global batch size: 256 | lm loss: 1.957453E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.388 | TFLOPs: 20.35 | 31: iteration 111770/ 173500 | consumed samples: 28613120 | consumed tokens: 58599669760 | elapsed time per iteration (s): 0.78 | learning rate: 7.152E-05 | global batch size: 256 | lm loss: 1.996506E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.092 | TFLOPs: 19.97 | 31: iteration 111780/ 173500 | consumed samples: 28615680 | consumed tokens: 58604912640 | elapsed time per iteration (s): 0.76 | learning rate: 7.151E-05 | global batch size: 256 | lm loss: 1.943549E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.039 | TFLOPs: 20.51 | 31: iteration 111790/ 173500 | consumed samples: 28618240 | consumed tokens: 58610155520 | elapsed time per iteration (s): 0.81 | learning rate: 7.149E-05 | global batch size: 256 | lm loss: 1.946930E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.261 | TFLOPs: 19.01 | 31: iteration 111800/ 173500 | consumed samples: 28620800 | consumed tokens: 58615398400 | elapsed time per iteration (s): 0.78 | learning rate: 7.148E-05 | global batch size: 256 | lm loss: 1.923707E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.564 | TFLOPs: 19.76 | 31: iteration 111810/ 173500 | consumed samples: 28623360 | consumed tokens: 58620641280 | elapsed time per iteration (s): 0.73 | learning rate: 7.146E-05 | global batch size: 256 | lm loss: 1.969949E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.428 | TFLOPs: 21.20 | 31: iteration 111820/ 173500 | consumed samples: 28625920 | consumed tokens: 58625884160 | elapsed time per iteration (s): 0.79 | learning rate: 7.145E-05 | global batch size: 256 | lm loss: 1.977459E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.512 | TFLOPs: 19.69 | 31: iteration 111830/ 173500 | consumed samples: 28628480 | consumed tokens: 58631127040 | elapsed time per iteration (s): 0.78 | learning rate: 7.143E-05 | global batch size: 256 | lm loss: 1.946834E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.872 | TFLOPs: 19.84 | 31: iteration 111840/ 173500 | consumed samples: 28631040 | consumed tokens: 58636369920 | elapsed time per iteration (s): 0.85 | learning rate: 7.142E-05 | global batch size: 256 | lm loss: 1.947206E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.996 | TFLOPs: 18.21 | 31: iteration 111850/ 173500 | consumed samples: 28633600 | consumed tokens: 58641612800 | elapsed time per iteration (s): 0.72 | learning rate: 7.140E-05 | global batch size: 256 | lm loss: 1.918544E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.719 | TFLOPs: 21.58 | 31: iteration 111860/ 173500 | consumed samples: 28636160 | consumed tokens: 58646855680 | elapsed time per iteration (s): 0.76 | learning rate: 7.139E-05 | global batch size: 256 | lm loss: 1.960639E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.179 | TFLOPs: 20.34 | 31: iteration 111870/ 173500 | consumed samples: 28638720 | consumed tokens: 58652098560 | elapsed time per iteration (s): 0.74 | learning rate: 7.137E-05 | global batch size: 256 | lm loss: 1.958938E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.219 | TFLOPs: 20.88 | 31: iteration 111880/ 173500 | consumed samples: 28641280 | consumed tokens: 58657341440 | elapsed time per iteration (s): 0.78 | learning rate: 7.136E-05 | global batch size: 256 | lm loss: 1.930464E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.579 | TFLOPs: 19.82 | 31: iteration 111890/ 173500 | consumed samples: 28643840 | consumed tokens: 58662584320 | elapsed time per iteration (s): 0.80 | learning rate: 7.135E-05 | global batch size: 256 | lm loss: 1.969315E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.997 | TFLOPs: 19.48 | 31: iteration 111900/ 173500 | consumed samples: 28646400 | consumed tokens: 58667827200 | elapsed time per iteration (s): 0.81 | learning rate: 7.133E-05 | global batch size: 256 | lm loss: 1.939235E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.081 | TFLOPs: 19.06 | 31: iteration 111910/ 173500 | consumed samples: 28648960 | consumed tokens: 58673070080 | elapsed time per iteration (s): 0.93 | learning rate: 7.132E-05 | global batch size: 256 | lm loss: 1.963870E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 274.530 | TFLOPs: 16.61 | 31: iteration 111920/ 173500 | consumed samples: 28651520 | consumed tokens: 58678312960 | elapsed time per iteration (s): 0.85 | learning rate: 7.130E-05 | global batch size: 256 | lm loss: 1.938492E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.689 | TFLOPs: 18.19 | 31: iteration 111930/ 173500 | consumed samples: 28654080 | consumed tokens: 58683555840 | elapsed time per iteration (s): 0.79 | learning rate: 7.129E-05 | global batch size: 256 | lm loss: 1.970407E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.066 | TFLOPs: 19.73 | 31: iteration 111940/ 173500 | consumed samples: 28656640 | consumed tokens: 58688798720 | elapsed time per iteration (s): 0.81 | learning rate: 7.127E-05 | global batch size: 256 | lm loss: 1.979189E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.637 | TFLOPs: 19.16 | 31: iteration 111950/ 173500 | consumed samples: 28659200 | consumed tokens: 58694041600 | elapsed time per iteration (s): 0.92 | learning rate: 7.126E-05 | global batch size: 256 | lm loss: 1.983212E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 278.120 | TFLOPs: 16.83 | 31: iteration 111960/ 173500 | consumed samples: 28661760 | consumed tokens: 58699284480 | elapsed time per iteration (s): 0.87 | learning rate: 7.124E-05 | global batch size: 256 | lm loss: 1.958229E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.075 | TFLOPs: 17.85 | 31: iteration 111970/ 173500 | consumed samples: 28664320 | consumed tokens: 58704527360 | elapsed time per iteration (s): 0.88 | learning rate: 7.123E-05 | global batch size: 256 | lm loss: 1.967490E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.586 | TFLOPs: 17.52 | 31: iteration 111980/ 173500 | consumed samples: 28666880 | consumed tokens: 58709770240 | elapsed time per iteration (s): 0.81 | learning rate: 7.121E-05 | global batch size: 256 | lm loss: 1.923576E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.260 | TFLOPs: 19.01 | 31: iteration 111990/ 173500 | consumed samples: 28669440 | consumed tokens: 58715013120 | elapsed time per iteration (s): 0.78 | learning rate: 7.120E-05 | global batch size: 256 | lm loss: 1.972321E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.932 | TFLOPs: 19.96 | 0: [2022-11-26 19:21:00,161] [INFO] [logging.py:68:log_dist] [Rank 0] step=112000, skipped=0, lr=[7.118156405567987e-05, 7.118156405567987e-05, 7.118156405567987e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 112000/ 173500 | consumed samples: 28672000 | consumed tokens: 58720256000 | elapsed time per iteration (s): 0.79 | learning rate: 7.118E-05 | global batch size: 256 | lm loss: 1.965426E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.275 | TFLOPs: 19.50 | 0: steps: 112000 loss: 1.9457 iter time (s): 0.776 samples/sec: 329.963 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 112000 | lm loss value: 1.913700E+00 | lm loss PPL: 6.778125E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 112000 to checkpoints_1b1long 0: [2022-11-26 19:21:00,412] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step112000 is begin to save! 0: [2022-11-26 19:21:00,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_01-model_00-model_states.pt... 0: [2022-11-26 19:21:00,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_01-model_00-model_states.pt. 0: [2022-11-26 19:21:00,649] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_03-model_00-model_states.pt... 0: [2022-11-26 19:21:00,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_03-model_00-model_states.pt. 0: [2022-11-26 19:21:00,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_04-model_00-model_states.pt... 0: [2022-11-26 19:21:00,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_04-model_00-model_states.pt. 0: [2022-11-26 19:21:00,814] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_05-model_00-model_states.pt... 0: [2022-11-26 19:21:00,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_05-model_00-model_states.pt. 0: [2022-11-26 19:21:00,891] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_06-model_00-model_states.pt... 0: [2022-11-26 19:21:00,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_06-model_00-model_states.pt. 0: [2022-11-26 19:21:00,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_07-model_00-model_states.pt... 0: [2022-11-26 19:21:01,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_07-model_00-model_states.pt. 0: [2022-11-26 19:21:01,040] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_08-model_00-model_states.pt... 0: [2022-11-26 19:21:01,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_08-model_00-model_states.pt. 0: [2022-11-26 19:21:01,116] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_09-model_00-model_states.pt... 0: [2022-11-26 19:21:01,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_09-model_00-model_states.pt. 0: [2022-11-26 19:21:01,192] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_10-model_00-model_states.pt... 0: [2022-11-26 19:21:01,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_10-model_00-model_states.pt. 0: [2022-11-26 19:21:01,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_11-model_00-model_states.pt... 0: [2022-11-26 19:21:01,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_11-model_00-model_states.pt. 0: [2022-11-26 19:21:01,339] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_12-model_00-model_states.pt... 0: [2022-11-26 19:21:01,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_12-model_00-model_states.pt. 0: [2022-11-26 19:21:01,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_13-model_00-model_states.pt... 0: [2022-11-26 19:21:01,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_13-model_00-model_states.pt. 0: [2022-11-26 19:21:01,488] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_14-model_00-model_states.pt... 0: [2022-11-26 19:21:01,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_14-model_00-model_states.pt. 0: [2022-11-26 19:21:01,559] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_15-model_00-model_states.pt... 0: [2022-11-26 19:21:01,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_15-model_00-model_states.pt. 0: [2022-11-26 19:21:01,633] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_16-model_00-model_states.pt... 0: [2022-11-26 19:21:01,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_16-model_00-model_states.pt. 0: [2022-11-26 19:21:01,709] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_17-model_00-model_states.pt... 0: [2022-11-26 19:21:01,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_17-model_00-model_states.pt. 0: [2022-11-26 19:21:01,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_18-model_00-model_states.pt... 0: [2022-11-26 19:21:01,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_18-model_00-model_states.pt. 0: [2022-11-26 19:21:01,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_19-model_00-model_states.pt... 0: [2022-11-26 19:21:01,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_19-model_00-model_states.pt. 0: [2022-11-26 19:21:01,930] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_20-model_00-model_states.pt... 0: [2022-11-26 19:21:02,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_20-model_00-model_states.pt. 0: [2022-11-26 19:21:02,003] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_21-model_00-model_states.pt... 0: [2022-11-26 19:21:02,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_21-model_00-model_states.pt. 0: [2022-11-26 19:21:02,076] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_22-model_00-model_states.pt... 0: [2022-11-26 19:21:02,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_22-model_00-model_states.pt. 0: [2022-11-26 19:21:02,149] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_23-model_00-model_states.pt... 0: [2022-11-26 19:21:02,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_23-model_00-model_states.pt. 0: [2022-11-26 19:21:02,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_24-model_00-model_states.pt... 0: [2022-11-26 19:21:02,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_24-model_00-model_states.pt. 0: [2022-11-26 19:21:02,296] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_25-model_00-model_states.pt... 0: [2022-11-26 19:21:02,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_25-model_00-model_states.pt. 0: [2022-11-26 19:21:02,368] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_26-model_00-model_states.pt... 0: [2022-11-26 19:21:02,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_26-model_00-model_states.pt. 0: [2022-11-26 19:21:02,442] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_27-model_00-model_states.pt... 0: [2022-11-26 19:21:02,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_27-model_00-model_states.pt. 0: [2022-11-26 19:21:02,515] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_28-model_00-model_states.pt... 0: [2022-11-26 19:21:02,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_28-model_00-model_states.pt. 0: [2022-11-26 19:21:02,585] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/layer_30-model_00-model_states.pt... 0: [2022-11-26 19:21:02,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/layer_30-model_00-model_states.pt. 0: [2022-11-26 19:21:02,590] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step112000/mp_rank_00_model_states.pt 0: [2022-11-26 19:21:02,590] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/mp_rank_00_model_states.pt... 0: [2022-11-26 19:21:02,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/mp_rank_00_model_states.pt. 0: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:21:02,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:21:02,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:21:02,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 19:21:02,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 0: [2022-11-26 19:21:02,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:21:02,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 19:21:02,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 30: [2022-11-26 19:21:02,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:21:02,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 19:21:02,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 31: [2022-11-26 19:21:02,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:21:02,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 19:21:02,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 2: [2022-11-26 19:21:02,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:21:02,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 19:21:02,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 4: [2022-11-26 19:21:02,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:21:02,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:21:02,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 6: [2022-11-26 19:21:02,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 4: [2022-11-26 19:21:02,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 6: [2022-11-26 19:21:02,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 18: [2022-11-26 19:21:02,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:21:02,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 19:21:02,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 1: [2022-11-26 19:21:02,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:21:02,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 19:21:02,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 16: [2022-11-26 19:21:02,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:21:02,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 19:21:02,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 26: [2022-11-26 19:21:02,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:21:02,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 23: [2022-11-26 19:21:02,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:21:02,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 23: [2022-11-26 19:21:02,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 19:21:02,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 17: [2022-11-26 19:21:02,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 19:21:02,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 19:21:02,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 26: [2022-11-26 19:21:02,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:21:02,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:21:02,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:21:02,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 12: [2022-11-26 19:21:02,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 19:21:02,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 3: [2022-11-26 19:21:02,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:21:02,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 3: [2022-11-26 19:21:02,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 25: [2022-11-26 19:21:02,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 3: [2022-11-26 19:21:02,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 25: [2022-11-26 19:21:02,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 25: [2022-11-26 19:21:02,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:21:02,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 19:21:02,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 2: [2022-11-26 19:21:02,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:21:02,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 19:21:02,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 24: [2022-11-26 19:21:02,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:21:02,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 19:21:02,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 27: [2022-11-26 19:21:02,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:21:02,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 29: [2022-11-26 19:21:02,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:21:02,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 29: [2022-11-26 19:21:02,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 19:21:02,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 10: [2022-11-26 19:21:02,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:21:02,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 19:21:02,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 13: [2022-11-26 19:21:02,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:21:02,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:21:02,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 14: [2022-11-26 19:21:02,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 13: [2022-11-26 19:21:02,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 14: [2022-11-26 19:21:02,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 17: [2022-11-26 19:21:02,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 19:21:02,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 19:21:02,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 19: [2022-11-26 19:21:02,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:21:02,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:21:02,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 5: [2022-11-26 19:21:02,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 19: [2022-11-26 19:21:02,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 5: [2022-11-26 19:21:02,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 5: [2022-11-26 19:21:02,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:21:02,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 19:21:02,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 31: [2022-11-26 19:21:02,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:21:02,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 19:21:02,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 21: [2022-11-26 19:21:02,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:21:02,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 19:21:02,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 1: [2022-11-26 19:21:02,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:21:02,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:21:02,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:21:02,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 18: [2022-11-26 19:21:02,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 1: [2022-11-26 19:21:02,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 11: [2022-11-26 19:21:02,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 18: [2022-11-26 19:21:02,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 1: [2022-11-26 19:21:02,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 7: [2022-11-26 19:21:02,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:21:02,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:21:02,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 19:21:02,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:21:02,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 4: [2022-11-26 19:21:02,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 9: [2022-11-26 19:21:02,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:21:02,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:21:02,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 9: [2022-11-26 19:21:02,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 16: [2022-11-26 19:21:02,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 14: [2022-11-26 19:21:02,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:21:02,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 9: [2022-11-26 19:21:02,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 16: [2022-11-26 19:21:02,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 14: [2022-11-26 19:21:02,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 4: [2022-11-26 19:21:02,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 14: [2022-11-26 19:21:02,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 18: [2022-11-26 19:21:02,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:21:02,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 19:21:02,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 12: [2022-11-26 19:21:02,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:21:02,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:21:02,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 30: [2022-11-26 19:21:02,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 12: [2022-11-26 19:21:02,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 30: [2022-11-26 19:21:02,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 23: [2022-11-26 19:21:02,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:21:02,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:21:02,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 23: [2022-11-26 19:21:02,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 19: [2022-11-26 19:21:02,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 23: [2022-11-26 19:21:02,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 3: [2022-11-26 19:21:02,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:21:02,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 19:21:02,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 10: [2022-11-26 19:21:02,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:21:02,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:21:02,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:21:02,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 26: [2022-11-26 19:21:02,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 10: [2022-11-26 19:21:02,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 11: [2022-11-26 19:21:02,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 26: [2022-11-26 19:21:02,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 12: [2022-11-26 19:21:02,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:21:02,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 12: [2022-11-26 19:21:02,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 21: [2022-11-26 19:21:02,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:21:02,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 21: [2022-11-26 19:21:02,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 19:21:02,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 6: [2022-11-26 19:21:02,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:21:02,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 19:21:02,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 24: [2022-11-26 19:21:02,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:21:02,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 25: [2022-11-26 19:21:02,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:21:02,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 2: [2022-11-26 19:21:02,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:21:02,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 24: [2022-11-26 19:21:02,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:21:02,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 25: [2022-11-26 19:21:02,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 24: [2022-11-26 19:21:02,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 2: [2022-11-26 19:21:02,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 24: [2022-11-26 19:21:02,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 0: [2022-11-26 19:21:02,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:21:02,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 19:21:02,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 31: [2022-11-26 19:21:02,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:21:02,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:21:02,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 13: [2022-11-26 19:21:02,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 31: [2022-11-26 19:21:02,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 13: [2022-11-26 19:21:02,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 20: [2022-11-26 19:21:02,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:21:02,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 19:21:02,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:21:02,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 20: [2022-11-26 19:21:02,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 19:21:02,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 27: [2022-11-26 19:21:02,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:21:02,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 19:21:02,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 6: [2022-11-26 19:21:02,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:21:02,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 19:21:02,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 27: [2022-11-26 19:21:02,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:21:02,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:21:02,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 19:21:02,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 19: [2022-11-26 19:21:02,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 20: [2022-11-26 19:21:02,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:21:02,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 20: [2022-11-26 19:21:02,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 19:21:02,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 27: [2022-11-26 19:21:02,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:21:02,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 19:21:02,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 13: [2022-11-26 19:21:02,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:21:02,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 19:21:02,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 13: [2022-11-26 19:21:02,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:21:02,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:21:02,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 7: [2022-11-26 19:21:02,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 13: [2022-11-26 19:21:02,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 7: [2022-11-26 19:21:02,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 9: [2022-11-26 19:21:02,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:21:02,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 19:21:02,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 26: [2022-11-26 19:21:02,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:21:02,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:21:02,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 29: [2022-11-26 19:21:02,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:21:02,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 29: [2022-11-26 19:21:02,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 30: [2022-11-26 19:21:02,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 29: [2022-11-26 19:21:02,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 30: [2022-11-26 19:21:02,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 0: [2022-11-26 19:21:02,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:21:02,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 23: [2022-11-26 19:21:02,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:21:02,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:21:02,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:21:02,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 23: [2022-11-26 19:21:02,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 11: [2022-11-26 19:21:02,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 21: [2022-11-26 19:21:02,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 23: [2022-11-26 19:21:02,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 11: [2022-11-26 19:21:02,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 21: [2022-11-26 19:21:02,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 4: [2022-11-26 19:21:02,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:21:02,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 19:21:02,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 25: [2022-11-26 19:21:02,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:21:02,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 19:21:02,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 24: [2022-11-26 19:21:02,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:21:02,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 1: [2022-11-26 19:21:02,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:21:02,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:21:02,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:21:02,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:21:02,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:21:02,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 17: [2022-11-26 19:21:02,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:21:02,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 15: [2022-11-26 19:21:02,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 19:21:02,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 19:21:02,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 19:21:02,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 17: [2022-11-26 19:21:02,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 1: [2022-11-26 19:21:02,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 15: [2022-11-26 19:21:02,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 15: [2022-11-26 19:21:02,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 15: [2022-11-26 19:21:02,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 15: [2022-11-26 19:21:02,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 17: [2022-11-26 19:21:02,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 17: [2022-11-26 19:21:02,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 19:21:02,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 19:21:02,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 4: [2022-11-26 19:21:02,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:21:02,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 19:21:02,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 29: [2022-11-26 19:21:02,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:21:02,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 19:21:02,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 3: [2022-11-26 19:21:02,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:21:02,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 19:21:02,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 20: [2022-11-26 19:21:02,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:21:02,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 31: [2022-11-26 19:21:02,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:21:02,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 31: [2022-11-26 19:21:02,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 19:21:02,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 9: [2022-11-26 19:21:02,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:21:02,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 19:21:02,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 16: [2022-11-26 19:21:02,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:21:02,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 19:21:02,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 10: [2022-11-26 19:21:02,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:21:02,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 19:21:02,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 14: [2022-11-26 19:21:02,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:21:02,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 6: [2022-11-26 19:21:02,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:21:02,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:21:02,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:21:02,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 6: [2022-11-26 19:21:02,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 5: [2022-11-26 19:21:02,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 12: [2022-11-26 19:21:02,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 6: [2022-11-26 19:21:02,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 5: [2022-11-26 19:21:02,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 12: [2022-11-26 19:21:02,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 30: [2022-11-26 19:21:02,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:21:02,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 19:21:02,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 9: [2022-11-26 19:21:02,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:21:02,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 19:21:02,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 14: [2022-11-26 19:21:02,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:21:02,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 19:21:02,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 19: [2022-11-26 19:21:02,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:21:02,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 19:21:02,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 5: [2022-11-26 19:21:02,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:21:02,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 19:21:02,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 1: [2022-11-26 19:21:02,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:21:02,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 19:21:02,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 7: [2022-11-26 19:21:02,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:21:02,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 19:21:02,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 3: [2022-11-26 19:21:02,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:21:02,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 19:21:02,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 0: [2022-11-26 19:21:02,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:21:02,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 2: [2022-11-26 19:21:02,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:21:02,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 2: [2022-11-26 19:21:02,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 19:21:02,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 23: [2022-11-26 19:21:02,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:21:02,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 19:21:02,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 29: [2022-11-26 19:21:02,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:21:02,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 19:21:02,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 18: [2022-11-26 19:21:02,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:21:02,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 28: [2022-11-26 19:21:02,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 19:21:02,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 19:21:02,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:21:02,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 28: [2022-11-26 19:21:02,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 19:21:02,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 19:21:02,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 19:21:02,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 28: [2022-11-26 19:21:02,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 28: [2022-11-26 19:21:02,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 16: [2022-11-26 19:21:02,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:21:02,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 19:21:02,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 22: [2022-11-26 19:21:02,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:21:02,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:21:02,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:21:02,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:21:02,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 19:21:02,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 19:21:02,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 19:21:02,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 19:21:02,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 22: [2022-11-26 19:21:02,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 22: [2022-11-26 19:21:02,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 22: [2022-11-26 19:21:02,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 6: [2022-11-26 19:21:02,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:21:02,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 19:21:02,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 8: [2022-11-26 19:21:02,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:21:02,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:21:02,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:21:02,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:21:02,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 19:21:02,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 19:21:02,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 19:21:02,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 19:21:02,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 8: [2022-11-26 19:21:02,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 8: [2022-11-26 19:21:02,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 8: [2022-11-26 19:21:02,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 5: [2022-11-26 19:21:02,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:21:02,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 19:21:02,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 1: [2022-11-26 19:21:02,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:21:02,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 19:21:02,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 14: [2022-11-26 19:21:02,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:21:02,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 19:21:02,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 20: [2022-11-26 19:21:02,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:21:02,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 19:21:02,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 13: [2022-11-26 19:21:02,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:21:02,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 19:21:02,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 22: [2022-11-26 19:21:02,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:21:02,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 19:21:02,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 17: [2022-11-26 19:21:02,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:21:02,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:21:02,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:21:02,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 17: [2022-11-26 19:21:02,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 21: [2022-11-26 19:21:02,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 26: [2022-11-26 19:21:02,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 17: [2022-11-26 19:21:02,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 26: [2022-11-26 19:21:02,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 27: [2022-11-26 19:21:02,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:21:02,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 19:21:02,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 4: [2022-11-26 19:21:02,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:21:02,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 19:21:02,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 0: [2022-11-26 19:21:02,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:21:02,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 19:21:02,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 7: [2022-11-26 19:21:02,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:21:02,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 19:21:02,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 9: [2022-11-26 19:21:02,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:21:02,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 19:21:02,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 25: [2022-11-26 19:21:02,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:21:02,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 19:21:02,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 31: [2022-11-26 19:21:02,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:21:02,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 19:21:02,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 15: [2022-11-26 19:21:02,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:21:02,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 19:21:02,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 28: [2022-11-26 19:21:02,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:21:02,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:21:02,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 19:21:02,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 2: [2022-11-26 19:21:02,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:21:02,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 19:21:02,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 28: [2022-11-26 19:21:02,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 19:21:02,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 3: [2022-11-26 19:21:02,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:21:02,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 19:21:02,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 19: [2022-11-26 19:21:02,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:21:02,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:21:02,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 11: [2022-11-26 19:21:02,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 19: [2022-11-26 19:21:02,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 11: [2022-11-26 19:21:02,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 23: [2022-11-26 19:21:02,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:21:02,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 19:21:02,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 12: [2022-11-26 19:21:02,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:21:02,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 19:21:02,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 10: [2022-11-26 19:21:02,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:21:02,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 19:21:02,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 24: [2022-11-26 19:21:02,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:21:02,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 19:21:02,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 30: [2022-11-26 19:21:02,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:21:02,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 19:21:02,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 16: [2022-11-26 19:21:02,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:21:02,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:21:02,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 19:21:02,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 29: [2022-11-26 19:21:02,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 19:21:02,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 18: [2022-11-26 19:21:02,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:21:02,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 19:21:02,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 6: [2022-11-26 19:21:02,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:21:02,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 19:21:02,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 5: [2022-11-26 19:21:02,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:21:02,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 19:21:02,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 13: [2022-11-26 19:21:02,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:21:02,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 19:21:02,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 20: [2022-11-26 19:21:02,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:21:02,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 19:21:02,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 21: [2022-11-26 19:21:02,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:21:02,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 19:21:02,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 14: [2022-11-26 19:21:02,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:21:02,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 19:21:02,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 26: [2022-11-26 19:21:02,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:21:02,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 19:21:02,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 1: [2022-11-26 19:21:02,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:21:02,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 19:21:02,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 27: [2022-11-26 19:21:02,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:21:02,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 19:21:02,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 4: [2022-11-26 19:21:02,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:21:02,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 19:21:02,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 22: [2022-11-26 19:21:02,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:21:02,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 19:21:02,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 17: [2022-11-26 19:21:02,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 19:21:02,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 19:21:02,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 0: [2022-11-26 19:21:02,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:21:02,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:21:02,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 19:21:02,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 9: [2022-11-26 19:21:02,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:21:02,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 19:21:02,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 15: [2022-11-26 19:21:02,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:21:02,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 19:21:02,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 10: [2022-11-26 19:21:02,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:21:02,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:21:02,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 19:21:02,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 10: [2022-11-26 19:21:02,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 19:21:02,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 28: [2022-11-26 19:21:02,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:21:02,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:21:02,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 19:21:02,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 31: [2022-11-26 19:21:02,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:21:02,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 19:21:02,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 23: [2022-11-26 19:21:02,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:21:02,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 19:21:02,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 18: [2022-11-26 19:21:02,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:21:02,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 19:21:02,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 3: [2022-11-26 19:21:02,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:21:02,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 19:21:02,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 30: [2022-11-26 19:21:02,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:21:02,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 19:21:02,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 19: [2022-11-26 19:21:02,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:21:02,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 19:21:02,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 29: [2022-11-26 19:21:02,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:21:02,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 19:21:02,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 0: [2022-11-26 19:21:02,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 16: [2022-11-26 19:21:02,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:21:02,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 28: [2022-11-26 19:21:02,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 16: [2022-11-26 19:21:02,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 28: [2022-11-26 19:21:02,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 16: [2022-11-26 19:21:02,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 12: [2022-11-26 19:21:02,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:21:02,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 19:21:02,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 14: [2022-11-26 19:21:02,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:21:02,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:21:02,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 11: [2022-11-26 19:21:02,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 14: [2022-11-26 19:21:02,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 11: [2022-11-26 19:21:02,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 21: [2022-11-26 19:21:02,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:21:02,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 19:21:02,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 7: [2022-11-26 19:21:02,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:21:02,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:21:02,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 13: [2022-11-26 19:21:02,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 5: [2022-11-26 19:21:02,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:21:02,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 13: [2022-11-26 19:21:02,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 5: [2022-11-26 19:21:02,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 19:21:02,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 6: [2022-11-26 19:21:02,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:21:02,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:21:02,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 19:21:02,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 20: [2022-11-26 19:21:02,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 19:21:02,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 7: [2022-11-26 19:21:02,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:21:02,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 19:21:02,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 1: [2022-11-26 19:21:02,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:21:02,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 19:21:02,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 25: [2022-11-26 19:21:02,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:21:02,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 19:21:02,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 24: [2022-11-26 19:21:02,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:21:02,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 19:21:02,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 26: [2022-11-26 19:21:02,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:21:02,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 19:21:02,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 22: [2022-11-26 19:21:02,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:21:02,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 19:21:02,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 3: [2022-11-26 19:21:02,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:21:02,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 19:21:02,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 4: [2022-11-26 19:21:02,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:21:02,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 27: [2022-11-26 19:21:02,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:21:02,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 2: [2022-11-26 19:21:02,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:21:02,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:21:02,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 2: [2022-11-26 19:21:02,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 15: [2022-11-26 19:21:02,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 2: [2022-11-26 19:21:02,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 15: [2022-11-26 19:21:02,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 27: [2022-11-26 19:21:02,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 16: [2022-11-26 19:21:02,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:21:02,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 19:21:02,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 12: [2022-11-26 19:21:02,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:21:02,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 19:21:02,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 10: [2022-11-26 19:21:02,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:21:02,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 19:21:02,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 31: [2022-11-26 19:21:02,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:21:02,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 19:21:02,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 19: [2022-11-26 19:21:02,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:21:02,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 19:21:02,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 0: [2022-11-26 19:21:02,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:21:02,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 13: [2022-11-26 19:21:02,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:21:02,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 13: [2022-11-26 19:21:02,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 19:21:02,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 30: [2022-11-26 19:21:02,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:21:02,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 19:21:02,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 23: [2022-11-26 19:21:02,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:21:02,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:21:02,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 6: [2022-11-26 19:21:02,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:21:02,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 20: [2022-11-26 19:21:02,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 19:21:02,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 6: [2022-11-26 19:21:02,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 19:21:02,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 21: [2022-11-26 19:21:02,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:21:02,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:21:02,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 14: [2022-11-26 19:21:02,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 21: [2022-11-26 19:21:02,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 14: [2022-11-26 19:21:02,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 9: [2022-11-26 19:21:02,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:21:02,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 19:21:02,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 8: [2022-11-26 19:21:02,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:21:02,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 19:21:02,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 1: [2022-11-26 19:21:02,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:21:02,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 19:21:02,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 7: [2022-11-26 19:21:02,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:21:02,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 19:21:02,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 18: [2022-11-26 19:21:02,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:21:02,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 19:21:02,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 5: [2022-11-26 19:21:02,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:21:02,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 19:21:02,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 4: [2022-11-26 19:21:02,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:21:02,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 19:21:02,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 29: [2022-11-26 19:21:02,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:21:02,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 19:21:02,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 12: [2022-11-26 19:21:02,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:21:02,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 19:21:02,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 10: [2022-11-26 19:21:02,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:21:02,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 19:21:02,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 3: [2022-11-26 19:21:02,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:21:02,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:21:02,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 19:21:02,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 2: [2022-11-26 19:21:02,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 19:21:02,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 26: [2022-11-26 19:21:02,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:21:02,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 19:21:02,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 25: [2022-11-26 19:21:02,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:21:02,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 19:21:02,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 24: [2022-11-26 19:21:02,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:21:02,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 19:21:02,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 17: [2022-11-26 19:21:02,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 19:21:02,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 19:21:02,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 8: [2022-11-26 19:21:02,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:21:02,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 19:21:02,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 16: [2022-11-26 19:21:02,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:21:02,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 19:21:02,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 22: [2022-11-26 19:21:02,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:21:02,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 19:21:02,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 27: [2022-11-26 19:21:02,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:21:02,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:21:02,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 19: [2022-11-26 19:21:02,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 18: [2022-11-26 19:21:02,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:21:02,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 27: [2022-11-26 19:21:02,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 18: [2022-11-26 19:21:02,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 19:21:02,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 21: [2022-11-26 19:21:02,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:21:02,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 19:21:02,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 9: [2022-11-26 19:21:02,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:21:02,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 11: [2022-11-26 19:21:02,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:21:02,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 11: [2022-11-26 19:21:02,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 19:21:02,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 0: [2022-11-26 19:21:02,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:21:02,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:21:02,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 28: [2022-11-26 19:21:02,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:21:02,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 0: [2022-11-26 19:21:02,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 11: [2022-11-26 19:21:02,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 28: [2022-11-26 19:21:02,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 19:21:02,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 19:21:02,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 28: [2022-11-26 19:21:02,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 19:21:02,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 31: [2022-11-26 19:21:02,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:21:02,885] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 19:21:02,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 17: [2022-11-26 19:21:02,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:21:02,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:21:02,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 19:21:02,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 17: [2022-11-26 19:21:02,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 19:21:02,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 30: [2022-11-26 19:21:02,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:21:02,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 19:21:02,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 23: [2022-11-26 19:21:02,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:21:02,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 19:21:02,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 24: [2022-11-26 19:21:02,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:21:02,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 19:21:02,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 29: [2022-11-26 19:21:02,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:21:02,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 19:21:02,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 28: [2022-11-26 19:21:02,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 19:21:02,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 11: [2022-11-26 19:21:02,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 28: [2022-11-26 19:21:02,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 11: [2022-11-26 19:21:02,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step112000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 19:21:02,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 0: successfully saved checkpoint at iteration 112000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2488.40 31: iteration 112010/ 173500 | consumed samples: 28674560 | consumed tokens: 58725498880 | elapsed time per iteration (s): 1.11 | learning rate: 7.117E-05 | global batch size: 256 | lm loss: 1.966692E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.034 | TFLOPs: 13.98 | 31: iteration 112020/ 173500 | consumed samples: 28677120 | consumed tokens: 58730741760 | elapsed time per iteration (s): 0.81 | learning rate: 7.115E-05 | global batch size: 256 | lm loss: 1.967155E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.504 | TFLOPs: 19.15 | 31: iteration 112030/ 173500 | consumed samples: 28679680 | consumed tokens: 58735984640 | elapsed time per iteration (s): 0.85 | learning rate: 7.114E-05 | global batch size: 256 | lm loss: 1.981006E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.619 | TFLOPs: 18.13 | 31: iteration 112040/ 173500 | consumed samples: 28682240 | consumed tokens: 58741227520 | elapsed time per iteration (s): 0.81 | learning rate: 7.112E-05 | global batch size: 256 | lm loss: 1.954071E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.730 | TFLOPs: 19.16 | 31: iteration 112050/ 173500 | consumed samples: 28684800 | consumed tokens: 58746470400 | elapsed time per iteration (s): 0.80 | learning rate: 7.111E-05 | global batch size: 256 | lm loss: 1.957038E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.978 | TFLOPs: 19.48 | 31: iteration 112060/ 173500 | consumed samples: 28687360 | consumed tokens: 58751713280 | elapsed time per iteration (s): 0.79 | learning rate: 7.109E-05 | global batch size: 256 | lm loss: 1.985917E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.422 | TFLOPs: 19.69 | 31: iteration 112070/ 173500 | consumed samples: 28689920 | consumed tokens: 58756956160 | elapsed time per iteration (s): 0.77 | learning rate: 7.108E-05 | global batch size: 256 | lm loss: 1.997901E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.418 | TFLOPs: 20.05 | 31: iteration 112080/ 173500 | consumed samples: 28692480 | consumed tokens: 58762199040 | elapsed time per iteration (s): 0.76 | learning rate: 7.106E-05 | global batch size: 256 | lm loss: 1.940134E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.711 | TFLOPs: 20.43 | 31: iteration 112090/ 173500 | consumed samples: 28695040 | consumed tokens: 58767441920 | elapsed time per iteration (s): 0.79 | learning rate: 7.105E-05 | global batch size: 256 | lm loss: 1.961684E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.157 | TFLOPs: 19.61 | 31: iteration 112100/ 173500 | consumed samples: 28697600 | consumed tokens: 58772684800 | elapsed time per iteration (s): 0.79 | learning rate: 7.103E-05 | global batch size: 256 | lm loss: 1.957302E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.699 | TFLOPs: 19.52 | 31: iteration 112110/ 173500 | consumed samples: 28700160 | consumed tokens: 58777927680 | elapsed time per iteration (s): 0.80 | learning rate: 7.102E-05 | global batch size: 256 | lm loss: 1.939875E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.439 | TFLOPs: 19.26 | 31: iteration 112120/ 173500 | consumed samples: 28702720 | consumed tokens: 58783170560 | elapsed time per iteration (s): 0.76 | learning rate: 7.100E-05 | global batch size: 256 | lm loss: 1.936218E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.876 | TFLOPs: 20.38 | 31: iteration 112130/ 173500 | consumed samples: 28705280 | consumed tokens: 58788413440 | elapsed time per iteration (s): 0.87 | learning rate: 7.099E-05 | global batch size: 256 | lm loss: 1.969396E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.778 | TFLOPs: 17.83 | 31: iteration 112140/ 173500 | consumed samples: 28707840 | consumed tokens: 58793656320 | elapsed time per iteration (s): 0.72 | learning rate: 7.097E-05 | global batch size: 256 | lm loss: 1.978353E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.692 | TFLOPs: 21.40 | 31: iteration 112150/ 173500 | consumed samples: 28710400 | consumed tokens: 58798899200 | elapsed time per iteration (s): 0.76 | learning rate: 7.096E-05 | global batch size: 256 | lm loss: 1.961749E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.073 | TFLOPs: 20.27 | 31: iteration 112160/ 173500 | consumed samples: 28712960 | consumed tokens: 58804142080 | elapsed time per iteration (s): 0.76 | learning rate: 7.094E-05 | global batch size: 256 | lm loss: 1.978745E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.342 | TFLOPs: 20.29 | 31: iteration 112170/ 173500 | consumed samples: 28715520 | consumed tokens: 58809384960 | elapsed time per iteration (s): 0.77 | learning rate: 7.093E-05 | global batch size: 256 | lm loss: 1.964091E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.589 | TFLOPs: 20.24 | 31: iteration 112180/ 173500 | consumed samples: 28718080 | consumed tokens: 58814627840 | elapsed time per iteration (s): 0.80 | learning rate: 7.091E-05 | global batch size: 256 | lm loss: 1.967097E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.322 | TFLOPs: 19.38 | 31: iteration 112190/ 173500 | consumed samples: 28720640 | consumed tokens: 58819870720 | elapsed time per iteration (s): 0.73 | learning rate: 7.090E-05 | global batch size: 256 | lm loss: 1.946040E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.870 | TFLOPs: 21.11 | 31: iteration 112200/ 173500 | consumed samples: 28723200 | consumed tokens: 58825113600 | elapsed time per iteration (s): 0.75 | learning rate: 7.088E-05 | global batch size: 256 | lm loss: 1.983878E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.368 | TFLOPs: 20.77 | 31: iteration 112210/ 173500 | consumed samples: 28725760 | consumed tokens: 58830356480 | elapsed time per iteration (s): 0.74 | learning rate: 7.087E-05 | global batch size: 256 | lm loss: 1.941248E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.302 | TFLOPs: 20.95 | 31: iteration 112220/ 173500 | consumed samples: 28728320 | consumed tokens: 58835599360 | elapsed time per iteration (s): 0.77 | learning rate: 7.086E-05 | global batch size: 256 | lm loss: 1.966172E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.711 | TFLOPs: 20.19 | 31: iteration 112230/ 173500 | consumed samples: 28730880 | consumed tokens: 58840842240 | elapsed time per iteration (s): 0.77 | learning rate: 7.084E-05 | global batch size: 256 | lm loss: 1.969547E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.948 | TFLOPs: 20.08 | 31: iteration 112240/ 173500 | consumed samples: 28733440 | consumed tokens: 58846085120 | elapsed time per iteration (s): 0.76 | learning rate: 7.083E-05 | global batch size: 256 | lm loss: 1.927203E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.611 | TFLOPs: 20.36 | 31: iteration 112250/ 173500 | consumed samples: 28736000 | consumed tokens: 58851328000 | elapsed time per iteration (s): 0.77 | learning rate: 7.081E-05 | global batch size: 256 | lm loss: 1.946099E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.454 | TFLOPs: 20.05 | 31: iteration 112260/ 173500 | consumed samples: 28738560 | consumed tokens: 58856570880 | elapsed time per iteration (s): 0.74 | learning rate: 7.080E-05 | global batch size: 256 | lm loss: 1.992419E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.089 | TFLOPs: 20.94 | 31: iteration 112270/ 173500 | consumed samples: 28741120 | consumed tokens: 58861813760 | elapsed time per iteration (s): 0.76 | learning rate: 7.078E-05 | global batch size: 256 | lm loss: 1.950239E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.386 | TFLOPs: 20.41 | 31: iteration 112280/ 173500 | consumed samples: 28743680 | consumed tokens: 58867056640 | elapsed time per iteration (s): 0.77 | learning rate: 7.077E-05 | global batch size: 256 | lm loss: 1.971060E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.057 | TFLOPs: 20.03 | 31: iteration 112290/ 173500 | consumed samples: 28746240 | consumed tokens: 58872299520 | elapsed time per iteration (s): 0.80 | learning rate: 7.075E-05 | global batch size: 256 | lm loss: 1.969662E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.780 | TFLOPs: 19.35 | 31: iteration 112300/ 173500 | consumed samples: 28748800 | consumed tokens: 58877542400 | elapsed time per iteration (s): 0.78 | learning rate: 7.074E-05 | global batch size: 256 | lm loss: 1.952427E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.615 | TFLOPs: 19.76 | 31: iteration 112310/ 173500 | consumed samples: 28751360 | consumed tokens: 58882785280 | elapsed time per iteration (s): 0.81 | learning rate: 7.072E-05 | global batch size: 256 | lm loss: 1.967192E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.316 | TFLOPs: 19.20 | 31: iteration 112320/ 173500 | consumed samples: 28753920 | consumed tokens: 58888028160 | elapsed time per iteration (s): 0.80 | learning rate: 7.071E-05 | global batch size: 256 | lm loss: 1.938907E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.974 | TFLOPs: 19.42 | 31: iteration 112330/ 173500 | consumed samples: 28756480 | consumed tokens: 58893271040 | elapsed time per iteration (s): 0.93 | learning rate: 7.069E-05 | global batch size: 256 | lm loss: 1.946021E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 274.611 | TFLOPs: 16.61 | 31: iteration 112340/ 173500 | consumed samples: 28759040 | consumed tokens: 58898513920 | elapsed time per iteration (s): 0.74 | learning rate: 7.068E-05 | global batch size: 256 | lm loss: 1.974756E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.944 | TFLOPs: 21.05 | 31: iteration 112350/ 173500 | consumed samples: 28761600 | consumed tokens: 58903756800 | elapsed time per iteration (s): 0.75 | learning rate: 7.066E-05 | global batch size: 256 | lm loss: 1.929946E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.744 | TFLOPs: 20.67 | 31: iteration 112360/ 173500 | consumed samples: 28764160 | consumed tokens: 58908999680 | elapsed time per iteration (s): 0.72 | learning rate: 7.065E-05 | global batch size: 256 | lm loss: 1.941617E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.998 | TFLOPs: 21.66 | 31: iteration 112370/ 173500 | consumed samples: 28766720 | consumed tokens: 58914242560 | elapsed time per iteration (s): 0.73 | learning rate: 7.063E-05 | global batch size: 256 | lm loss: 1.964593E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.983 | TFLOPs: 21.17 | 31: iteration 112380/ 173500 | consumed samples: 28769280 | consumed tokens: 58919485440 | elapsed time per iteration (s): 0.77 | learning rate: 7.062E-05 | global batch size: 256 | lm loss: 1.976401E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.970 | TFLOPs: 20.20 | 31: iteration 112390/ 173500 | consumed samples: 28771840 | consumed tokens: 58924728320 | elapsed time per iteration (s): 0.75 | learning rate: 7.060E-05 | global batch size: 256 | lm loss: 1.947300E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.878 | TFLOPs: 20.62 | 31: iteration 112400/ 173500 | consumed samples: 28774400 | consumed tokens: 58929971200 | elapsed time per iteration (s): 0.74 | learning rate: 7.059E-05 | global batch size: 256 | lm loss: 1.952017E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.106 | TFLOPs: 20.82 | 31: iteration 112410/ 173500 | consumed samples: 28776960 | consumed tokens: 58935214080 | elapsed time per iteration (s): 0.79 | learning rate: 7.057E-05 | global batch size: 256 | lm loss: 1.947664E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.141 | TFLOPs: 19.49 | 31: iteration 112420/ 173500 | consumed samples: 28779520 | consumed tokens: 58940456960 | elapsed time per iteration (s): 0.82 | learning rate: 7.056E-05 | global batch size: 256 | lm loss: 1.979276E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.738 | TFLOPs: 18.86 | 31: iteration 112430/ 173500 | consumed samples: 28782080 | consumed tokens: 58945699840 | elapsed time per iteration (s): 0.81 | learning rate: 7.054E-05 | global batch size: 256 | lm loss: 1.939284E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.369 | TFLOPs: 19.08 | 31: iteration 112440/ 173500 | consumed samples: 28784640 | consumed tokens: 58950942720 | elapsed time per iteration (s): 0.81 | learning rate: 7.053E-05 | global batch size: 256 | lm loss: 1.934986E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.547 | TFLOPs: 19.15 | 31: iteration 112450/ 173500 | consumed samples: 28787200 | consumed tokens: 58956185600 | elapsed time per iteration (s): 0.84 | learning rate: 7.051E-05 | global batch size: 256 | lm loss: 1.925822E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.992 | TFLOPs: 18.51 | 31: iteration 112460/ 173500 | consumed samples: 28789760 | consumed tokens: 58961428480 | elapsed time per iteration (s): 0.82 | learning rate: 7.050E-05 | global batch size: 256 | lm loss: 1.955576E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.353 | TFLOPs: 18.96 | 31: iteration 112470/ 173500 | consumed samples: 28792320 | consumed tokens: 58966671360 | elapsed time per iteration (s): 0.81 | learning rate: 7.049E-05 | global batch size: 256 | lm loss: 1.973170E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.638 | TFLOPs: 19.22 | 31: iteration 112480/ 173500 | consumed samples: 28794880 | consumed tokens: 58971914240 | elapsed time per iteration (s): 0.78 | learning rate: 7.047E-05 | global batch size: 256 | lm loss: 1.962387E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.623 | TFLOPs: 19.82 | 31: iteration 112490/ 173500 | consumed samples: 28797440 | consumed tokens: 58977157120 | elapsed time per iteration (s): 0.85 | learning rate: 7.046E-05 | global batch size: 256 | lm loss: 1.920917E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.290 | TFLOPs: 18.23 | 31: iteration 112500/ 173500 | consumed samples: 28800000 | consumed tokens: 58982400000 | elapsed time per iteration (s): 0.79 | learning rate: 7.044E-05 | global batch size: 256 | lm loss: 1.958876E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.267 | TFLOPs: 19.62 | 31: iteration 112510/ 173500 | consumed samples: 28802560 | consumed tokens: 58987642880 | elapsed time per iteration (s): 0.79 | learning rate: 7.043E-05 | global batch size: 256 | lm loss: 1.960141E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.416 | TFLOPs: 19.69 | 31: iteration 112520/ 173500 | consumed samples: 28805120 | consumed tokens: 58992885760 | elapsed time per iteration (s): 0.78 | learning rate: 7.041E-05 | global batch size: 256 | lm loss: 1.940630E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.195 | TFLOPs: 19.98 | 31: iteration 112530/ 173500 | consumed samples: 28807680 | consumed tokens: 58998128640 | elapsed time per iteration (s): 0.74 | learning rate: 7.040E-05 | global batch size: 256 | lm loss: 1.969643E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.000 | TFLOPs: 20.87 | 31: iteration 112540/ 173500 | consumed samples: 28810240 | consumed tokens: 59003371520 | elapsed time per iteration (s): 0.74 | learning rate: 7.038E-05 | global batch size: 256 | lm loss: 1.951421E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.964 | TFLOPs: 21.05 | 31: iteration 112550/ 173500 | consumed samples: 28812800 | consumed tokens: 59008614400 | elapsed time per iteration (s): 0.81 | learning rate: 7.037E-05 | global batch size: 256 | lm loss: 1.961201E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.684 | TFLOPs: 19.10 | 31: iteration 112560/ 173500 | consumed samples: 28815360 | consumed tokens: 59013857280 | elapsed time per iteration (s): 0.90 | learning rate: 7.035E-05 | global batch size: 256 | lm loss: 1.941481E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.745 | TFLOPs: 17.29 | 31: iteration 112570/ 173500 | consumed samples: 28817920 | consumed tokens: 59019100160 | elapsed time per iteration (s): 0.85 | learning rate: 7.034E-05 | global batch size: 256 | lm loss: 1.974139E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.952 | TFLOPs: 18.27 | 31: iteration 112580/ 173500 | consumed samples: 28820480 | consumed tokens: 59024343040 | elapsed time per iteration (s): 0.87 | learning rate: 7.032E-05 | global batch size: 256 | lm loss: 1.939579E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.233 | TFLOPs: 17.74 | 31: iteration 112590/ 173500 | consumed samples: 28823040 | consumed tokens: 59029585920 | elapsed time per iteration (s): 0.80 | learning rate: 7.031E-05 | global batch size: 256 | lm loss: 1.969323E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.075 | TFLOPs: 19.42 | 31: iteration 112600/ 173500 | consumed samples: 28825600 | consumed tokens: 59034828800 | elapsed time per iteration (s): 0.80 | learning rate: 7.029E-05 | global batch size: 256 | lm loss: 1.952973E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.502 | TFLOPs: 19.39 | 31: iteration 112610/ 173500 | consumed samples: 28828160 | consumed tokens: 59040071680 | elapsed time per iteration (s): 0.82 | learning rate: 7.028E-05 | global batch size: 256 | lm loss: 1.945560E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.880 | TFLOPs: 18.81 | 31: iteration 112620/ 173500 | consumed samples: 28830720 | consumed tokens: 59045314560 | elapsed time per iteration (s): 0.79 | learning rate: 7.026E-05 | global batch size: 256 | lm loss: 1.971579E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.734 | TFLOPs: 19.59 | 31: iteration 112630/ 173500 | consumed samples: 28833280 | consumed tokens: 59050557440 | elapsed time per iteration (s): 0.83 | learning rate: 7.025E-05 | global batch size: 256 | lm loss: 1.975721E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.098 | TFLOPs: 18.64 | 31: iteration 112640/ 173500 | consumed samples: 28835840 | consumed tokens: 59055800320 | elapsed time per iteration (s): 0.75 | learning rate: 7.023E-05 | global batch size: 256 | lm loss: 1.969373E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.441 | TFLOPs: 20.60 | 31: iteration 112650/ 173500 | consumed samples: 28838400 | consumed tokens: 59061043200 | elapsed time per iteration (s): 0.73 | learning rate: 7.022E-05 | global batch size: 256 | lm loss: 1.953874E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.033 | TFLOPs: 21.18 | 31: iteration 112660/ 173500 | consumed samples: 28840960 | consumed tokens: 59066286080 | elapsed time per iteration (s): 0.76 | learning rate: 7.020E-05 | global batch size: 256 | lm loss: 1.942289E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.637 | TFLOPs: 20.49 | 31: iteration 112670/ 173500 | consumed samples: 28843520 | consumed tokens: 59071528960 | elapsed time per iteration (s): 0.77 | learning rate: 7.019E-05 | global batch size: 256 | lm loss: 1.949212E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.610 | TFLOPs: 20.00 | 31: iteration 112680/ 173500 | consumed samples: 28846080 | consumed tokens: 59076771840 | elapsed time per iteration (s): 0.78 | learning rate: 7.017E-05 | global batch size: 256 | lm loss: 1.954917E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.395 | TFLOPs: 19.93 | 31: iteration 112690/ 173500 | consumed samples: 28848640 | consumed tokens: 59082014720 | elapsed time per iteration (s): 0.74 | learning rate: 7.016E-05 | global batch size: 256 | lm loss: 1.957708E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.503 | TFLOPs: 21.02 | 31: iteration 112700/ 173500 | consumed samples: 28851200 | consumed tokens: 59087257600 | elapsed time per iteration (s): 0.74 | learning rate: 7.015E-05 | global batch size: 256 | lm loss: 1.964576E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.080 | TFLOPs: 21.00 | 31: iteration 112710/ 173500 | consumed samples: 28853760 | consumed tokens: 59092500480 | elapsed time per iteration (s): 0.76 | learning rate: 7.013E-05 | global batch size: 256 | lm loss: 1.946846E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.291 | TFLOPs: 20.28 | 31: iteration 112720/ 173500 | consumed samples: 28856320 | consumed tokens: 59097743360 | elapsed time per iteration (s): 0.77 | learning rate: 7.012E-05 | global batch size: 256 | lm loss: 1.961178E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.490 | TFLOPs: 20.05 | 31: iteration 112730/ 173500 | consumed samples: 28858880 | consumed tokens: 59102986240 | elapsed time per iteration (s): 0.76 | learning rate: 7.010E-05 | global batch size: 256 | lm loss: 1.953996E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.510 | TFLOPs: 20.36 | 31: iteration 112740/ 173500 | consumed samples: 28861440 | consumed tokens: 59108229120 | elapsed time per iteration (s): 0.78 | learning rate: 7.009E-05 | global batch size: 256 | lm loss: 1.972098E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.400 | TFLOPs: 19.81 | 31: iteration 112750/ 173500 | consumed samples: 28864000 | consumed tokens: 59113472000 | elapsed time per iteration (s): 0.77 | learning rate: 7.007E-05 | global batch size: 256 | lm loss: 1.985228E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.866 | TFLOPs: 20.14 | 31: iteration 112760/ 173500 | consumed samples: 28866560 | consumed tokens: 59118714880 | elapsed time per iteration (s): 0.82 | learning rate: 7.006E-05 | global batch size: 256 | lm loss: 1.925364E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.009 | TFLOPs: 19.00 | 31: iteration 112770/ 173500 | consumed samples: 28869120 | consumed tokens: 59123957760 | elapsed time per iteration (s): 0.82 | learning rate: 7.004E-05 | global batch size: 256 | lm loss: 1.956251E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.091 | TFLOPs: 18.88 | 31: iteration 112780/ 173500 | consumed samples: 28871680 | consumed tokens: 59129200640 | elapsed time per iteration (s): 0.79 | learning rate: 7.003E-05 | global batch size: 256 | lm loss: 1.964020E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.963 | TFLOPs: 19.60 | 31: iteration 112790/ 173500 | consumed samples: 28874240 | consumed tokens: 59134443520 | elapsed time per iteration (s): 0.80 | learning rate: 7.001E-05 | global batch size: 256 | lm loss: 1.932248E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.832 | TFLOPs: 19.41 | 31: iteration 112800/ 173500 | consumed samples: 28876800 | consumed tokens: 59139686400 | elapsed time per iteration (s): 0.78 | learning rate: 7.000E-05 | global batch size: 256 | lm loss: 1.982557E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.032 | TFLOPs: 19.91 | 31: iteration 112810/ 173500 | consumed samples: 28879360 | consumed tokens: 59144929280 | elapsed time per iteration (s): 0.82 | learning rate: 6.998E-05 | global batch size: 256 | lm loss: 1.955022E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.125 | TFLOPs: 18.82 | 31: iteration 112820/ 173500 | consumed samples: 28881920 | consumed tokens: 59150172160 | elapsed time per iteration (s): 0.79 | learning rate: 6.997E-05 | global batch size: 256 | lm loss: 1.961945E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.134 | TFLOPs: 19.67 | 31: iteration 112830/ 173500 | consumed samples: 28884480 | consumed tokens: 59155415040 | elapsed time per iteration (s): 0.75 | learning rate: 6.995E-05 | global batch size: 256 | lm loss: 1.971352E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.060 | TFLOPs: 20.63 | 31: iteration 112840/ 173500 | consumed samples: 28887040 | consumed tokens: 59160657920 | elapsed time per iteration (s): 0.77 | learning rate: 6.994E-05 | global batch size: 256 | lm loss: 1.948757E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.599 | TFLOPs: 20.18 | 31: iteration 112850/ 173500 | consumed samples: 28889600 | consumed tokens: 59165900800 | elapsed time per iteration (s): 0.77 | learning rate: 6.992E-05 | global batch size: 256 | lm loss: 1.959418E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.134 | TFLOPs: 20.21 | 31: iteration 112860/ 173500 | consumed samples: 28892160 | consumed tokens: 59171143680 | elapsed time per iteration (s): 0.77 | learning rate: 6.991E-05 | global batch size: 256 | lm loss: 1.960532E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.623 | TFLOPs: 20.12 | 31: iteration 112870/ 173500 | consumed samples: 28894720 | consumed tokens: 59176386560 | elapsed time per iteration (s): 0.75 | learning rate: 6.989E-05 | global batch size: 256 | lm loss: 1.939032E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.703 | TFLOPs: 20.55 | 31: iteration 112880/ 173500 | consumed samples: 28897280 | consumed tokens: 59181629440 | elapsed time per iteration (s): 0.80 | learning rate: 6.988E-05 | global batch size: 256 | lm loss: 1.976260E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.390 | TFLOPs: 19.38 | 31: iteration 112890/ 173500 | consumed samples: 28899840 | consumed tokens: 59186872320 | elapsed time per iteration (s): 0.79 | learning rate: 6.987E-05 | global batch size: 256 | lm loss: 1.957281E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.924 | TFLOPs: 19.54 | 31: iteration 112900/ 173500 | consumed samples: 28902400 | consumed tokens: 59192115200 | elapsed time per iteration (s): 0.81 | learning rate: 6.985E-05 | global batch size: 256 | lm loss: 1.951812E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.799 | TFLOPs: 19.11 | 31: iteration 112910/ 173500 | consumed samples: 28904960 | consumed tokens: 59197358080 | elapsed time per iteration (s): 0.80 | learning rate: 6.984E-05 | global batch size: 256 | lm loss: 1.948022E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.683 | TFLOPs: 19.34 | 31: iteration 112920/ 173500 | consumed samples: 28907520 | consumed tokens: 59202600960 | elapsed time per iteration (s): 0.81 | learning rate: 6.982E-05 | global batch size: 256 | lm loss: 1.977864E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.637 | TFLOPs: 19.10 | 31: iteration 112930/ 173500 | consumed samples: 28910080 | consumed tokens: 59207843840 | elapsed time per iteration (s): 0.80 | learning rate: 6.981E-05 | global batch size: 256 | lm loss: 1.940008E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.847 | TFLOPs: 19.47 | 31: iteration 112940/ 173500 | consumed samples: 28912640 | consumed tokens: 59213086720 | elapsed time per iteration (s): 0.80 | learning rate: 6.979E-05 | global batch size: 256 | lm loss: 1.979819E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.833 | TFLOPs: 19.35 | 31: iteration 112950/ 173500 | consumed samples: 28915200 | consumed tokens: 59218329600 | elapsed time per iteration (s): 0.80 | learning rate: 6.978E-05 | global batch size: 256 | lm loss: 1.948903E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.078 | TFLOPs: 19.24 | 31: iteration 112960/ 173500 | consumed samples: 28917760 | consumed tokens: 59223572480 | elapsed time per iteration (s): 0.76 | learning rate: 6.976E-05 | global batch size: 256 | lm loss: 1.961850E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.555 | TFLOPs: 20.30 | 31: iteration 112970/ 173500 | consumed samples: 28920320 | consumed tokens: 59228815360 | elapsed time per iteration (s): 0.73 | learning rate: 6.975E-05 | global batch size: 256 | lm loss: 1.963826E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.977 | TFLOPs: 21.29 | 31: iteration 112980/ 173500 | consumed samples: 28922880 | consumed tokens: 59234058240 | elapsed time per iteration (s): 0.77 | learning rate: 6.973E-05 | global batch size: 256 | lm loss: 1.963617E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.711 | TFLOPs: 20.19 | 31: iteration 112990/ 173500 | consumed samples: 28925440 | consumed tokens: 59239301120 | elapsed time per iteration (s): 0.82 | learning rate: 6.972E-05 | global batch size: 256 | lm loss: 1.950937E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.538 | TFLOPs: 18.97 | 31: iteration 113000/ 173500 | consumed samples: 28928000 | consumed tokens: 59244544000 | elapsed time per iteration (s): 0.83 | learning rate: 6.970E-05 | global batch size: 256 | lm loss: 1.952280E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.926 | TFLOPs: 18.69 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 113000 | lm loss value: 1.848138E+00 | lm loss PPL: 6.347991E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 113000 to checkpoints_1b1long 0: [2022-11-26 19:34:09,161] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step113000 is begin to save! 0: [2022-11-26 19:34:09,173] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_01-model_00-model_states.pt... 0: [2022-11-26 19:34:09,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_01-model_00-model_states.pt. 0: [2022-11-26 19:34:09,395] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_03-model_00-model_states.pt... 0: [2022-11-26 19:34:09,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_03-model_00-model_states.pt. 0: [2022-11-26 19:34:09,475] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_04-model_00-model_states.pt... 0: [2022-11-26 19:34:09,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_04-model_00-model_states.pt. 0: [2022-11-26 19:34:09,555] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_05-model_00-model_states.pt... 0: [2022-11-26 19:34:09,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_05-model_00-model_states.pt. 0: [2022-11-26 19:34:09,630] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_06-model_00-model_states.pt... 0: [2022-11-26 19:34:09,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_06-model_00-model_states.pt. 0: [2022-11-26 19:34:09,708] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_07-model_00-model_states.pt... 0: [2022-11-26 19:34:09,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_07-model_00-model_states.pt. 0: [2022-11-26 19:34:09,780] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_08-model_00-model_states.pt... 0: [2022-11-26 19:34:09,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_08-model_00-model_states.pt. 0: [2022-11-26 19:34:09,861] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_09-model_00-model_states.pt... 0: [2022-11-26 19:34:09,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_09-model_00-model_states.pt. 0: [2022-11-26 19:34:09,938] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_10-model_00-model_states.pt... 0: [2022-11-26 19:34:10,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_10-model_00-model_states.pt. 0: [2022-11-26 19:34:10,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_11-model_00-model_states.pt... 0: [2022-11-26 19:34:10,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_11-model_00-model_states.pt. 0: [2022-11-26 19:34:10,090] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_12-model_00-model_states.pt... 0: [2022-11-26 19:34:10,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_12-model_00-model_states.pt. 0: [2022-11-26 19:34:10,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_13-model_00-model_states.pt... 0: [2022-11-26 19:34:10,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_13-model_00-model_states.pt. 0: [2022-11-26 19:34:10,240] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_14-model_00-model_states.pt... 0: [2022-11-26 19:34:10,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_14-model_00-model_states.pt. 0: [2022-11-26 19:34:10,312] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_15-model_00-model_states.pt... 0: [2022-11-26 19:34:10,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_15-model_00-model_states.pt. 0: [2022-11-26 19:34:10,388] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_16-model_00-model_states.pt... 0: [2022-11-26 19:34:10,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_16-model_00-model_states.pt. 0: [2022-11-26 19:34:10,459] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_17-model_00-model_states.pt... 0: [2022-11-26 19:34:10,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_17-model_00-model_states.pt. 0: [2022-11-26 19:34:10,536] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_18-model_00-model_states.pt... 0: [2022-11-26 19:34:10,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_18-model_00-model_states.pt. 0: [2022-11-26 19:34:10,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_19-model_00-model_states.pt... 0: [2022-11-26 19:34:10,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_19-model_00-model_states.pt. 0: [2022-11-26 19:34:10,681] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_20-model_00-model_states.pt... 0: [2022-11-26 19:34:10,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_20-model_00-model_states.pt. 0: [2022-11-26 19:34:10,758] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_21-model_00-model_states.pt... 0: [2022-11-26 19:34:10,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_21-model_00-model_states.pt. 0: [2022-11-26 19:34:10,829] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_22-model_00-model_states.pt... 0: [2022-11-26 19:34:10,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_22-model_00-model_states.pt. 0: [2022-11-26 19:34:10,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_23-model_00-model_states.pt... 0: [2022-11-26 19:34:10,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_23-model_00-model_states.pt. 0: [2022-11-26 19:34:10,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_24-model_00-model_states.pt... 0: [2022-11-26 19:34:11,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_24-model_00-model_states.pt. 0: [2022-11-26 19:34:11,049] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_25-model_00-model_states.pt... 0: [2022-11-26 19:34:11,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_25-model_00-model_states.pt. 0: [2022-11-26 19:34:11,126] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_26-model_00-model_states.pt... 0: [2022-11-26 19:34:11,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_26-model_00-model_states.pt. 0: [2022-11-26 19:34:11,199] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_27-model_00-model_states.pt... 0: [2022-11-26 19:34:11,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_27-model_00-model_states.pt. 0: [2022-11-26 19:34:11,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_28-model_00-model_states.pt... 0: [2022-11-26 19:34:11,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_28-model_00-model_states.pt. 0: [2022-11-26 19:34:11,343] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/layer_30-model_00-model_states.pt... 0: [2022-11-26 19:34:11,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/layer_30-model_00-model_states.pt. 0: [2022-11-26 19:34:11,347] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step113000/mp_rank_00_model_states.pt 0: [2022-11-26 19:34:11,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/mp_rank_00_model_states.pt... 0: [2022-11-26 19:34:11,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/mp_rank_00_model_states.pt. 25: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:34:11,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:34:11,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:34:11,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 19:34:11,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 21: [2022-11-26 19:34:11,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:34:11,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 19:34:11,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 0: [2022-11-26 19:34:11,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:34:11,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:34:11,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 19:34:11,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 2: [2022-11-26 19:34:11,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:34:11,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 19:34:11,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 28: [2022-11-26 19:34:11,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 19:34:11,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 19:34:11,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 19: [2022-11-26 19:34:11,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:34:11,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 19:34:11,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 6: [2022-11-26 19:34:11,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:34:11,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 27: [2022-11-26 19:34:11,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:34:11,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 26: [2022-11-26 19:34:11,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:34:11,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:34:11,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 26: [2022-11-26 19:34:11,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 27: [2022-11-26 19:34:11,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 18: [2022-11-26 19:34:11,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 27: [2022-11-26 19:34:11,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 26: [2022-11-26 19:34:11,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 7: [2022-11-26 19:34:11,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:34:11,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:34:11,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 14: [2022-11-26 19:34:11,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 7: [2022-11-26 19:34:11,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 14: [2022-11-26 19:34:11,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 8: [2022-11-26 19:34:11,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:34:11,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 19:34:11,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 25: [2022-11-26 19:34:11,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:34:11,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 19:34:11,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 24: [2022-11-26 19:34:11,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:34:11,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 19:34:11,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 30: [2022-11-26 19:34:11,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:34:11,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 19:34:11,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 12: [2022-11-26 19:34:11,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:34:11,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 19:34:11,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 29: [2022-11-26 19:34:11,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:34:11,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 4: [2022-11-26 19:34:11,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:34:11,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 4: [2022-11-26 19:34:11,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 19:34:11,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 25: [2022-11-26 19:34:11,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:34:11,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 19:34:11,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 18: [2022-11-26 19:34:11,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:34:11,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 19:34:11,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 3: [2022-11-26 19:34:11,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:34:11,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 19:34:11,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 20: [2022-11-26 19:34:11,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:34:11,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 19:34:11,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 29: [2022-11-26 19:34:11,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:34:11,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 19:34:11,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 20: [2022-11-26 19:34:11,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:34:11,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:34:11,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:34:11,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 20: [2022-11-26 19:34:11,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 19:34:11,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 24: [2022-11-26 19:34:11,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 27: [2022-11-26 19:34:11,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 7: [2022-11-26 19:34:11,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:34:11,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 30: [2022-11-26 19:34:11,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:34:11,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 19:34:11,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 30: [2022-11-26 19:34:11,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 19:34:11,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 14: [2022-11-26 19:34:11,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:34:11,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 19:34:11,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 5: [2022-11-26 19:34:11,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:34:11,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 19:34:11,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 5: [2022-11-26 19:34:11,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:34:11,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 19:34:11,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 17: [2022-11-26 19:34:11,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 19:34:11,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 19:34:11,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 17: [2022-11-26 19:34:11,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 19:34:11,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 12: [2022-11-26 19:34:11,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 17: [2022-11-26 19:34:11,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 12: [2022-11-26 19:34:11,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 19:34:11,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 10: [2022-11-26 19:34:11,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:34:11,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 19:34:11,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 15: [2022-11-26 19:34:11,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:34:11,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 19:34:11,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 7: [2022-11-26 19:34:11,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:34:11,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 19:34:11,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 21: [2022-11-26 19:34:11,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:34:11,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 14: [2022-11-26 19:34:11,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:34:11,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 14: [2022-11-26 19:34:11,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 19:34:11,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 24: [2022-11-26 19:34:11,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:34:11,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 19:34:11,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 2: [2022-11-26 19:34:11,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:34:11,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 19:34:11,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 8: [2022-11-26 19:34:11,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:34:11,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 19:34:11,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 4: [2022-11-26 19:34:11,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:34:11,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 19:34:11,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 10: [2022-11-26 19:34:11,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:34:11,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 19:34:11,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 19: [2022-11-26 19:34:11,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:34:11,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 19:34:11,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 0: [2022-11-26 19:34:11,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:34:11,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 19:34:11,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 6: [2022-11-26 19:34:11,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:34:11,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 29: [2022-11-26 19:34:11,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:34:11,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 29: [2022-11-26 19:34:11,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 19:34:11,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 26: [2022-11-26 19:34:11,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:34:11,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:34:11,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 3: [2022-11-26 19:34:11,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 19:34:11,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 26: [2022-11-26 19:34:11,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 12: [2022-11-26 19:34:11,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:34:11,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:34:11,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 19:34:11,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 19: [2022-11-26 19:34:11,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 19:34:11,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 2: [2022-11-26 19:34:11,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 28: [2022-11-26 19:34:11,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:34:11,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 19:34:11,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 30: [2022-11-26 19:34:11,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:34:11,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 19:34:11,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 21: [2022-11-26 19:34:11,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:34:11,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 19:34:11,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 3: [2022-11-26 19:34:11,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:34:11,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 19:34:11,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 1: [2022-11-26 19:34:11,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:34:11,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 15: [2022-11-26 19:34:11,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:34:11,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 19:34:11,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 1: [2022-11-26 19:34:11,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 22: [2022-11-26 19:34:11,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:34:11,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 19:34:11,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 11: [2022-11-26 19:34:11,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:34:11,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 19:34:11,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 22: [2022-11-26 19:34:11,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:34:11,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 19:34:11,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 8: [2022-11-26 19:34:11,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:34:11,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 19:34:11,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 5: [2022-11-26 19:34:11,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:34:11,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:34:11,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 19:34:11,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 5: [2022-11-26 19:34:11,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 19:34:11,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 27: [2022-11-26 19:34:11,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:34:11,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 18: [2022-11-26 19:34:11,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:34:11,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 19:34:11,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 27: [2022-11-26 19:34:11,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:34:11,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 27: [2022-11-26 19:34:11,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 28: [2022-11-26 19:34:11,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 27: [2022-11-26 19:34:11,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 28: [2022-11-26 19:34:11,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 15: [2022-11-26 19:34:11,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 28: [2022-11-26 19:34:11,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:34:11,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 28: [2022-11-26 19:34:11,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 19:34:11,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 15: [2022-11-26 19:34:11,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 17: [2022-11-26 19:34:11,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 19:34:11,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 19:34:11,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 7: [2022-11-26 19:34:11,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:34:11,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 19:34:11,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 25: [2022-11-26 19:34:11,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:34:11,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:34:11,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 19:34:11,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 26: [2022-11-26 19:34:11,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 19:34:11,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 23: [2022-11-26 19:34:11,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:34:11,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:34:11,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:34:11,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 19:34:11,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 16: [2022-11-26 19:34:11,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:34:11,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:34:11,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 16: [2022-11-26 19:34:11,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:34:11,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 23: [2022-11-26 19:34:11,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 16: [2022-11-26 19:34:11,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 19:34:11,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 23: [2022-11-26 19:34:11,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 16: [2022-11-26 19:34:11,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 19:34:11,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 16: [2022-11-26 19:34:11,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 16: [2022-11-26 19:34:11,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 20: [2022-11-26 19:34:11,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:34:11,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 19:34:11,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 10: [2022-11-26 19:34:11,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:34:11,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 19:34:11,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 1: [2022-11-26 19:34:11,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:34:11,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:34:11,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:34:11,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:34:11,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:34:11,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 2: [2022-11-26 19:34:11,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 13: [2022-11-26 19:34:11,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 1: [2022-11-26 19:34:11,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 2: [2022-11-26 19:34:11,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 13: [2022-11-26 19:34:11,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 19:34:11,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 19:34:11,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 13: [2022-11-26 19:34:11,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 13: [2022-11-26 19:34:11,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 31: [2022-11-26 19:34:11,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:34:11,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:34:11,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:34:11,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 19:34:11,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 19:34:11,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 31: [2022-11-26 19:34:11,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 31: [2022-11-26 19:34:11,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 19:34:11,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 6: [2022-11-26 19:34:11,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:34:11,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 19:34:11,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 4: [2022-11-26 19:34:11,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:34:11,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 19:34:11,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 5: [2022-11-26 19:34:11,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:34:11,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 19:34:11,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 26: [2022-11-26 19:34:11,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:34:11,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 19:34:11,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 0: [2022-11-26 19:34:11,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:34:11,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 19:34:11,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 11: [2022-11-26 19:34:11,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:34:11,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 19:34:11,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 19: [2022-11-26 19:34:11,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:34:11,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 19:34:11,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 9: [2022-11-26 19:34:11,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:34:11,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:34:11,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 19:34:11,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 19:34:11,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 9: [2022-11-26 19:34:11,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 14: [2022-11-26 19:34:11,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:34:11,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 19:34:11,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 28: [2022-11-26 19:34:11,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:34:11,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:34:11,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 19:34:11,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 10: [2022-11-26 19:34:11,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:34:11,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 19:34:11,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 0: [2022-11-26 19:34:11,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:34:11,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 19:34:11,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 0: [2022-11-26 19:34:11,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 19:34:11,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 21: [2022-11-26 19:34:11,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:34:11,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 19:34:11,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 11: [2022-11-26 19:34:11,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:34:11,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 19:34:11,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 17: [2022-11-26 19:34:11,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 19:34:11,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 30: [2022-11-26 19:34:11,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 28: [2022-11-26 19:34:11,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 30: [2022-11-26 19:34:11,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 28: [2022-11-26 19:34:11,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 30: [2022-11-26 19:34:11,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 17: [2022-11-26 19:34:11,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 1: [2022-11-26 19:34:11,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:34:11,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 19:34:11,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 8: [2022-11-26 19:34:11,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:34:11,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 19:34:11,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 31: [2022-11-26 19:34:11,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:34:11,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 3: [2022-11-26 19:34:11,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:34:11,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 3: [2022-11-26 19:34:11,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 19:34:11,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 22: [2022-11-26 19:34:11,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:34:11,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 19:34:11,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 29: [2022-11-26 19:34:11,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:34:11,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 19:34:11,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 9: [2022-11-26 19:34:11,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:34:11,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 19:34:11,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 13: [2022-11-26 19:34:11,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:34:11,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 19:34:11,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 6: [2022-11-26 19:34:11,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:34:11,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 19:34:11,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 18: [2022-11-26 19:34:11,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:34:11,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 19:34:11,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 9: [2022-11-26 19:34:11,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:34:11,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 19:34:11,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 25: [2022-11-26 19:34:11,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:34:11,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 19:34:11,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 15: [2022-11-26 19:34:11,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:34:11,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 19:34:11,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 0: [2022-11-26 19:34:11,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:34:11,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 19:34:11,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 12: [2022-11-26 19:34:11,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:34:11,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 19:34:11,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 16: [2022-11-26 19:34:11,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:34:11,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 19:34:11,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 5: [2022-11-26 19:34:11,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:34:11,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 19:34:11,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 20: [2022-11-26 19:34:11,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:34:11,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 19:34:11,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 23: [2022-11-26 19:34:11,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:34:11,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 19:34:11,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 27: [2022-11-26 19:34:11,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:34:11,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 19:34:11,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 26: [2022-11-26 19:34:11,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:34:11,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 19:34:11,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 4: [2022-11-26 19:34:11,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:34:11,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 19:34:11,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 21: [2022-11-26 19:34:11,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:34:11,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 19:34:11,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 7: [2022-11-26 19:34:11,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:34:11,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 19:34:11,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 28: [2022-11-26 19:34:11,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:34:11,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 28: [2022-11-26 19:34:11,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 24: [2022-11-26 19:34:11,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 28: [2022-11-26 19:34:11,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 24: [2022-11-26 19:34:11,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 10: [2022-11-26 19:34:11,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:34:11,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 19:34:11,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 2: [2022-11-26 19:34:11,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:34:11,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 19:34:11,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 19: [2022-11-26 19:34:11,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:34:11,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 19:34:11,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 17: [2022-11-26 19:34:11,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 19:34:11,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 19:34:11,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 11: [2022-11-26 19:34:11,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:34:11,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 19:34:11,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 14: [2022-11-26 19:34:11,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:34:11,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 19:34:11,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 3: [2022-11-26 19:34:11,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:34:11,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 19:34:11,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 22: [2022-11-26 19:34:11,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:34:11,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 19:34:11,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 8: [2022-11-26 19:34:11,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:34:11,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 19:34:11,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 1: [2022-11-26 19:34:11,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:34:11,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 19:34:11,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 31: [2022-11-26 19:34:11,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:34:11,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 19:34:11,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 25: [2022-11-26 19:34:11,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:34:11,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 19:34:11,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 9: [2022-11-26 19:34:11,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:34:11,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 19:34:11,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 30: [2022-11-26 19:34:11,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:34:11,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 19:34:11,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 29: [2022-11-26 19:34:11,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:34:11,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 19:34:11,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 18: [2022-11-26 19:34:11,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:34:11,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 19:34:11,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 12: [2022-11-26 19:34:11,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:34:11,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 19:34:11,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 23: [2022-11-26 19:34:11,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:34:11,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 19:34:11,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 6: [2022-11-26 19:34:11,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:34:11,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 19:34:11,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 16: [2022-11-26 19:34:11,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:34:11,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 19:34:11,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 5: [2022-11-26 19:34:11,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:34:11,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 19:34:11,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 7: [2022-11-26 19:34:11,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:34:11,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 19:34:11,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 20: [2022-11-26 19:34:11,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:34:11,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 19:34:11,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 15: [2022-11-26 19:34:11,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:34:11,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 19:34:11,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 26: [2022-11-26 19:34:11,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:34:11,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 19:34:11,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 0: [2022-11-26 19:34:11,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:34:11,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 19:34:11,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 11: [2022-11-26 19:34:11,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:34:11,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 19:34:11,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 21: [2022-11-26 19:34:11,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:34:11,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 19:34:11,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 13: [2022-11-26 19:34:11,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:34:11,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 19:34:11,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 10: [2022-11-26 19:34:11,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:34:11,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 19:34:11,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 4: [2022-11-26 19:34:11,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:34:11,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:34:11,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:34:11,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 19:34:11,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 24: [2022-11-26 19:34:11,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 27: [2022-11-26 19:34:11,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 24: [2022-11-26 19:34:11,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 27: [2022-11-26 19:34:11,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 17: [2022-11-26 19:34:11,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 19:34:11,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 19:34:11,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 14: [2022-11-26 19:34:11,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:34:11,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:34:11,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 19:34:11,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 19: [2022-11-26 19:34:11,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 19:34:11,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 1: [2022-11-26 19:34:11,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:34:11,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 19:34:11,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 2: [2022-11-26 19:34:11,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:34:11,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 19:34:11,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 3: [2022-11-26 19:34:11,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:34:11,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 19:34:11,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 8: [2022-11-26 19:34:11,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:34:11,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 19:34:11,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 31: [2022-11-26 19:34:11,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:34:11,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 19:34:11,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 6: [2022-11-26 19:34:11,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:34:11,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 19:34:11,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 22: [2022-11-26 19:34:11,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:34:11,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 19:34:11,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 29: [2022-11-26 19:34:11,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:34:11,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 0: [2022-11-26 19:34:11,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:34:11,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 29: [2022-11-26 19:34:11,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 0: [2022-11-26 19:34:11,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 25: [2022-11-26 19:34:11,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:34:11,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 19:34:11,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 16: [2022-11-26 19:34:11,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:34:11,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:34:11,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:34:11,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 19:34:11,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 18: [2022-11-26 19:34:11,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 19:34:11,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 7: [2022-11-26 19:34:11,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 19:34:11,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 23: [2022-11-26 19:34:11,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:34:11,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 19:34:11,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 30: [2022-11-26 19:34:11,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:34:11,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 28: [2022-11-26 19:34:11,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:34:11,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 15: [2022-11-26 19:34:11,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:34:11,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 19:34:11,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 9: [2022-11-26 19:34:11,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:34:11,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 19:34:11,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 12: [2022-11-26 19:34:11,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:34:11,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 19:34:11,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 20: [2022-11-26 19:34:11,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:34:11,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 27: [2022-11-26 19:34:11,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:34:11,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 27: [2022-11-26 19:34:11,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 19:34:11,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 28: [2022-11-26 19:34:11,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 19:34:11,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 5: [2022-11-26 19:34:11,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:34:11,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 19:34:11,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 2: [2022-11-26 19:34:11,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:34:11,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 19:34:11,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 26: [2022-11-26 19:34:11,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:34:11,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 19:34:11,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 4: [2022-11-26 19:34:11,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:34:11,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 19:34:11,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 11: [2022-11-26 19:34:11,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:34:11,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 19:34:11,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 21: [2022-11-26 19:34:11,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:34:11,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 19:34:11,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 24: [2022-11-26 19:34:11,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:34:11,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 19:34:11,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 10: [2022-11-26 19:34:11,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:34:11,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 19:34:11,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 19: [2022-11-26 19:34:11,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:34:11,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 19:34:11,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 17: [2022-11-26 19:34:11,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 19:34:11,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 19:34:11,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 28: [2022-11-26 19:34:11,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 19:34:11,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 19:34:11,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 30: [2022-11-26 19:34:11,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:34:11,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 19:34:11,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 13: [2022-11-26 19:34:11,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:34:11,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 19:34:11,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 31: [2022-11-26 19:34:11,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:34:11,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 19:34:11,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 8: [2022-11-26 19:34:11,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:34:11,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 19:34:11,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 3: [2022-11-26 19:34:11,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:34:11,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 19:34:11,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 29: [2022-11-26 19:34:11,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:34:11,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 19:34:11,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 15: [2022-11-26 19:34:11,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:34:11,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:34:11,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:34:11,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 16: [2022-11-26 19:34:11,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 15: [2022-11-26 19:34:11,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 16: [2022-11-26 19:34:11,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 14: [2022-11-26 19:34:11,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 19:34:11,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 6: [2022-11-26 19:34:11,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:34:11,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:34:11,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 6: [2022-11-26 19:34:11,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 19:34:11,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 0: [2022-11-26 19:34:11,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 20: [2022-11-26 19:34:11,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:34:11,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 19:34:11,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 13: [2022-11-26 19:34:11,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:34:11,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 19:34:11,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 9: [2022-11-26 19:34:11,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:34:11,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 19:34:11,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 7: [2022-11-26 19:34:11,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:34:11,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:34:11,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:34:11,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 18: [2022-11-26 19:34:11,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 7: [2022-11-26 19:34:11,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 22: [2022-11-26 19:34:11,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 18: [2022-11-26 19:34:11,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 22: [2022-11-26 19:34:11,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 5: [2022-11-26 19:34:11,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:34:11,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 19:34:11,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 25: [2022-11-26 19:34:11,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:34:11,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 19:34:11,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 11: [2022-11-26 19:34:11,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:34:11,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 19:34:11,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 27: [2022-11-26 19:34:11,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:34:11,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 19:34:11,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 19: [2022-11-26 19:34:11,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:34:11,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 19:34:11,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 24: [2022-11-26 19:34:11,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:34:11,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 19:34:11,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 4: [2022-11-26 19:34:11,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:34:11,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 19:34:11,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 10: [2022-11-26 19:34:11,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:34:11,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:34:11,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 1: [2022-11-26 19:34:11,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 10: [2022-11-26 19:34:11,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 1: [2022-11-26 19:34:11,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 2: [2022-11-26 19:34:11,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:34:11,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:34:11,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 19:34:11,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 23: [2022-11-26 19:34:11,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 19:34:11,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 17: [2022-11-26 19:34:11,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:34:11,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 17: [2022-11-26 19:34:11,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 19:34:11,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 30: [2022-11-26 19:34:11,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 19:34:11,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 12: [2022-11-26 19:34:11,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:34:11,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 19:34:11,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 8: [2022-11-26 19:34:11,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:34:11,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 19:34:11,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 21: [2022-11-26 19:34:11,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:34:11,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 19:34:11,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 18: [2022-11-26 19:34:11,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:34:11,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 19:34:11,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 31: [2022-11-26 19:34:11,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:34:11,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 19:34:11,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 3: [2022-11-26 19:34:11,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:34:11,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 19:34:11,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 29: [2022-11-26 19:34:11,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:34:11,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 19:34:11,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 14: [2022-11-26 19:34:11,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:34:11,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 19:34:11,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 16: [2022-11-26 19:34:11,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:34:11,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 19:34:11,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 23: [2022-11-26 19:34:11,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:34:11,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 19:34:11,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 9: [2022-11-26 19:34:11,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:34:11,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:34:11,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 22: [2022-11-26 19:34:11,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 19:34:11,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 9: [2022-11-26 19:34:11,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 25: [2022-11-26 19:34:11,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:34:11,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 19:34:11,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 13: [2022-11-26 19:34:11,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:34:11,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 1: [2022-11-26 19:34:11,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:34:11,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 1: [2022-11-26 19:34:11,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 19:34:11,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 15: [2022-11-26 19:34:11,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:34:11,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 19:34:11,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 22: [2022-11-26 19:34:11,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:34:11,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 19:34:11,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 12: [2022-11-26 19:34:11,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:34:11,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 19:34:11,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 28: [2022-11-26 19:34:11,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 19:34:11,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 19:34:11,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 6: [2022-11-26 19:34:11,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:34:11,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 19:34:11,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 26: [2022-11-26 19:34:11,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:34:11,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 19:34:11,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 1: [2022-11-26 19:34:11,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:34:11,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step113000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 19:34:11,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 0: successfully saved checkpoint at iteration 113000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2572.06 31: iteration 113010/ 173500 | consumed samples: 28930560 | consumed tokens: 59249786880 | elapsed time per iteration (s): 1.12 | learning rate: 6.969E-05 | global batch size: 256 | lm loss: 1.973505E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.019 | TFLOPs: 13.79 | 31: iteration 113020/ 173500 | consumed samples: 28933120 | consumed tokens: 59255029760 | elapsed time per iteration (s): 0.85 | learning rate: 6.967E-05 | global batch size: 256 | lm loss: 1.973914E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.872 | TFLOPs: 18.14 | 31: iteration 113030/ 173500 | consumed samples: 28935680 | consumed tokens: 59260272640 | elapsed time per iteration (s): 0.84 | learning rate: 6.966E-05 | global batch size: 256 | lm loss: 1.943464E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.018 | TFLOPs: 18.39 | 31: iteration 113040/ 173500 | consumed samples: 28938240 | consumed tokens: 59265515520 | elapsed time per iteration (s): 0.88 | learning rate: 6.964E-05 | global batch size: 256 | lm loss: 1.963262E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.485 | TFLOPs: 17.51 | 31: iteration 113050/ 173500 | consumed samples: 28940800 | consumed tokens: 59270758400 | elapsed time per iteration (s): 0.90 | learning rate: 6.963E-05 | global batch size: 256 | lm loss: 1.970086E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.099 | TFLOPs: 17.19 | 31: iteration 113060/ 173500 | consumed samples: 28943360 | consumed tokens: 59276001280 | elapsed time per iteration (s): 1.34 | learning rate: 6.961E-05 | global batch size: 256 | lm loss: 1.988265E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 190.772 | TFLOPs: 11.54 | 31: iteration 113070/ 173500 | consumed samples: 28945920 | consumed tokens: 59281244160 | elapsed time per iteration (s): 0.86 | learning rate: 6.960E-05 | global batch size: 256 | lm loss: 1.921522E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.994 | TFLOPs: 18.09 | 31: iteration 113080/ 173500 | consumed samples: 28948480 | consumed tokens: 59286487040 | elapsed time per iteration (s): 0.89 | learning rate: 6.959E-05 | global batch size: 256 | lm loss: 1.958016E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.761 | TFLOPs: 17.35 | 31: iteration 113090/ 173500 | consumed samples: 28951040 | consumed tokens: 59291729920 | elapsed time per iteration (s): 0.86 | learning rate: 6.957E-05 | global batch size: 256 | lm loss: 1.963337E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.915 | TFLOPs: 17.96 | 31: iteration 113100/ 173500 | consumed samples: 28953600 | consumed tokens: 59296972800 | elapsed time per iteration (s): 0.82 | learning rate: 6.956E-05 | global batch size: 256 | lm loss: 1.960534E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.248 | TFLOPs: 18.89 | 31: iteration 113110/ 173500 | consumed samples: 28956160 | consumed tokens: 59302215680 | elapsed time per iteration (s): 0.89 | learning rate: 6.954E-05 | global batch size: 256 | lm loss: 1.948161E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.453 | TFLOPs: 17.45 | 31: iteration 113120/ 173500 | consumed samples: 28958720 | consumed tokens: 59307458560 | elapsed time per iteration (s): 0.88 | learning rate: 6.953E-05 | global batch size: 256 | lm loss: 1.985950E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.532 | TFLOPs: 17.52 | 31: iteration 113130/ 173500 | consumed samples: 28961280 | consumed tokens: 59312701440 | elapsed time per iteration (s): 0.77 | learning rate: 6.951E-05 | global batch size: 256 | lm loss: 1.973305E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.460 | TFLOPs: 20.05 | 31: iteration 113140/ 173500 | consumed samples: 28963840 | consumed tokens: 59317944320 | elapsed time per iteration (s): 0.81 | learning rate: 6.950E-05 | global batch size: 256 | lm loss: 1.935093E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.461 | TFLOPs: 19.15 | 31: iteration 113150/ 173500 | consumed samples: 28966400 | consumed tokens: 59323187200 | elapsed time per iteration (s): 0.75 | learning rate: 6.948E-05 | global batch size: 256 | lm loss: 1.950957E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.722 | TFLOPs: 20.67 | 31: iteration 113160/ 173500 | consumed samples: 28968960 | consumed tokens: 59328430080 | elapsed time per iteration (s): 0.73 | learning rate: 6.947E-05 | global batch size: 256 | lm loss: 1.959906E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.772 | TFLOPs: 21.10 | 31: iteration 113170/ 173500 | consumed samples: 28971520 | consumed tokens: 59333672960 | elapsed time per iteration (s): 0.78 | learning rate: 6.945E-05 | global batch size: 256 | lm loss: 1.927371E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.240 | TFLOPs: 19.86 | 31: iteration 113180/ 173500 | consumed samples: 28974080 | consumed tokens: 59338915840 | elapsed time per iteration (s): 0.75 | learning rate: 6.944E-05 | global batch size: 256 | lm loss: 1.972320E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.978 | TFLOPs: 20.69 | 31: iteration 113190/ 173500 | consumed samples: 28976640 | consumed tokens: 59344158720 | elapsed time per iteration (s): 0.77 | learning rate: 6.942E-05 | global batch size: 256 | lm loss: 1.966801E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.392 | TFLOPs: 20.11 | 31: iteration 113200/ 173500 | consumed samples: 28979200 | consumed tokens: 59349401600 | elapsed time per iteration (s): 0.77 | learning rate: 6.941E-05 | global batch size: 256 | lm loss: 1.984218E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.286 | TFLOPs: 20.10 | 31: iteration 113210/ 173500 | consumed samples: 28981760 | consumed tokens: 59354644480 | elapsed time per iteration (s): 0.72 | learning rate: 6.939E-05 | global batch size: 256 | lm loss: 1.957682E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.098 | TFLOPs: 21.60 | 31: iteration 113220/ 173500 | consumed samples: 28984320 | consumed tokens: 59359887360 | elapsed time per iteration (s): 0.79 | learning rate: 6.938E-05 | global batch size: 256 | lm loss: 1.909979E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.802 | TFLOPs: 19.65 | 31: iteration 113230/ 173500 | consumed samples: 28986880 | consumed tokens: 59365130240 | elapsed time per iteration (s): 0.78 | learning rate: 6.936E-05 | global batch size: 256 | lm loss: 1.950864E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.632 | TFLOPs: 19.94 | 31: iteration 113240/ 173500 | consumed samples: 28989440 | consumed tokens: 59370373120 | elapsed time per iteration (s): 0.78 | learning rate: 6.935E-05 | global batch size: 256 | lm loss: 1.955470E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.781 | TFLOPs: 19.83 | 31: iteration 113250/ 173500 | consumed samples: 28992000 | consumed tokens: 59375616000 | elapsed time per iteration (s): 0.74 | learning rate: 6.934E-05 | global batch size: 256 | lm loss: 1.945231E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.068 | TFLOPs: 21.06 | 31: iteration 113260/ 173500 | consumed samples: 28994560 | consumed tokens: 59380858880 | elapsed time per iteration (s): 0.75 | learning rate: 6.932E-05 | global batch size: 256 | lm loss: 1.954708E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.364 | TFLOPs: 20.71 | 31: iteration 113270/ 173500 | consumed samples: 28997120 | consumed tokens: 59386101760 | elapsed time per iteration (s): 0.76 | learning rate: 6.931E-05 | global batch size: 256 | lm loss: 1.955240E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.002 | TFLOPs: 20.45 | 31: iteration 113280/ 173500 | consumed samples: 28999680 | consumed tokens: 59391344640 | elapsed time per iteration (s): 0.75 | learning rate: 6.929E-05 | global batch size: 256 | lm loss: 1.977335E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.424 | TFLOPs: 20.53 | 31: iteration 113290/ 173500 | consumed samples: 29002240 | consumed tokens: 59396587520 | elapsed time per iteration (s): 0.76 | learning rate: 6.928E-05 | global batch size: 256 | lm loss: 1.955250E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.621 | TFLOPs: 20.36 | 31: iteration 113300/ 173500 | consumed samples: 29004800 | consumed tokens: 59401830400 | elapsed time per iteration (s): 0.75 | learning rate: 6.926E-05 | global batch size: 256 | lm loss: 1.937342E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.550 | TFLOPs: 20.54 | 31: iteration 113310/ 173500 | consumed samples: 29007360 | consumed tokens: 59407073280 | elapsed time per iteration (s): 0.76 | learning rate: 6.925E-05 | global batch size: 256 | lm loss: 1.949182E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.829 | TFLOPs: 20.50 | 31: iteration 113320/ 173500 | consumed samples: 29009920 | consumed tokens: 59412316160 | elapsed time per iteration (s): 0.73 | learning rate: 6.923E-05 | global batch size: 256 | lm loss: 1.936149E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.985 | TFLOPs: 21.29 | 31: iteration 113330/ 173500 | consumed samples: 29012480 | consumed tokens: 59417559040 | elapsed time per iteration (s): 0.76 | learning rate: 6.922E-05 | global batch size: 256 | lm loss: 1.956642E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.717 | TFLOPs: 20.43 | 31: iteration 113340/ 173500 | consumed samples: 29015040 | consumed tokens: 59422801920 | elapsed time per iteration (s): 0.79 | learning rate: 6.920E-05 | global batch size: 256 | lm loss: 1.949814E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.055 | TFLOPs: 19.73 | 31: iteration 113350/ 173500 | consumed samples: 29017600 | consumed tokens: 59428044800 | elapsed time per iteration (s): 0.72 | learning rate: 6.919E-05 | global batch size: 256 | lm loss: 1.932187E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.564 | TFLOPs: 21.51 | 31: iteration 113360/ 173500 | consumed samples: 29020160 | consumed tokens: 59433287680 | elapsed time per iteration (s): 0.77 | learning rate: 6.917E-05 | global batch size: 256 | lm loss: 1.969220E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.108 | TFLOPs: 20.15 | 31: iteration 113370/ 173500 | consumed samples: 29022720 | consumed tokens: 59438530560 | elapsed time per iteration (s): 0.74 | learning rate: 6.916E-05 | global batch size: 256 | lm loss: 1.933400E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.682 | TFLOPs: 20.91 | 31: iteration 113380/ 173500 | consumed samples: 29025280 | consumed tokens: 59443773440 | elapsed time per iteration (s): 0.74 | learning rate: 6.914E-05 | global batch size: 256 | lm loss: 1.938190E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.636 | TFLOPs: 20.85 | 31: iteration 113390/ 173500 | consumed samples: 29027840 | consumed tokens: 59449016320 | elapsed time per iteration (s): 0.77 | learning rate: 6.913E-05 | global batch size: 256 | lm loss: 1.942662E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.085 | TFLOPs: 20.21 | 31: iteration 113400/ 173500 | consumed samples: 29030400 | consumed tokens: 59454259200 | elapsed time per iteration (s): 0.75 | learning rate: 6.912E-05 | global batch size: 256 | lm loss: 1.963039E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.782 | TFLOPs: 20.62 | 31: iteration 113410/ 173500 | consumed samples: 29032960 | consumed tokens: 59459502080 | elapsed time per iteration (s): 0.83 | learning rate: 6.910E-05 | global batch size: 256 | lm loss: 1.965875E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.619 | TFLOPs: 18.67 | 31: iteration 113420/ 173500 | consumed samples: 29035520 | consumed tokens: 59464744960 | elapsed time per iteration (s): 0.78 | learning rate: 6.909E-05 | global batch size: 256 | lm loss: 1.941004E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.178 | TFLOPs: 19.79 | 31: iteration 113430/ 173500 | consumed samples: 29038080 | consumed tokens: 59469987840 | elapsed time per iteration (s): 0.76 | learning rate: 6.907E-05 | global batch size: 256 | lm loss: 1.958574E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.030 | TFLOPs: 20.39 | 31: iteration 113440/ 173500 | consumed samples: 29040640 | consumed tokens: 59475230720 | elapsed time per iteration (s): 0.78 | learning rate: 6.906E-05 | global batch size: 256 | lm loss: 1.977216E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.346 | TFLOPs: 19.80 | 31: iteration 113450/ 173500 | consumed samples: 29043200 | consumed tokens: 59480473600 | elapsed time per iteration (s): 0.76 | learning rate: 6.904E-05 | global batch size: 256 | lm loss: 1.936298E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.736 | TFLOPs: 20.25 | 31: iteration 113460/ 173500 | consumed samples: 29045760 | consumed tokens: 59485716480 | elapsed time per iteration (s): 0.75 | learning rate: 6.903E-05 | global batch size: 256 | lm loss: 1.963553E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.373 | TFLOPs: 20.53 | 31: iteration 113470/ 173500 | consumed samples: 29048320 | consumed tokens: 59490959360 | elapsed time per iteration (s): 0.72 | learning rate: 6.901E-05 | global batch size: 256 | lm loss: 1.939197E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.331 | TFLOPs: 21.50 | 31: iteration 113480/ 173500 | consumed samples: 29050880 | consumed tokens: 59496202240 | elapsed time per iteration (s): 0.79 | learning rate: 6.900E-05 | global batch size: 256 | lm loss: 1.965374E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.714 | TFLOPs: 19.52 | 31: iteration 113490/ 173500 | consumed samples: 29053440 | consumed tokens: 59501445120 | elapsed time per iteration (s): 0.87 | learning rate: 6.898E-05 | global batch size: 256 | lm loss: 1.929023E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.115 | TFLOPs: 17.73 | 31: iteration 113500/ 173500 | consumed samples: 29056000 | consumed tokens: 59506688000 | elapsed time per iteration (s): 0.82 | learning rate: 6.897E-05 | global batch size: 256 | lm loss: 1.939231E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.543 | TFLOPs: 18.97 | 31: iteration 113510/ 173500 | consumed samples: 29058560 | consumed tokens: 59511930880 | elapsed time per iteration (s): 0.79 | learning rate: 6.895E-05 | global batch size: 256 | lm loss: 1.957944E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.776 | TFLOPs: 19.53 | 31: iteration 113520/ 173500 | consumed samples: 29061120 | consumed tokens: 59517173760 | elapsed time per iteration (s): 0.86 | learning rate: 6.894E-05 | global batch size: 256 | lm loss: 1.972124E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.383 | TFLOPs: 18.05 | 31: iteration 113530/ 173500 | consumed samples: 29063680 | consumed tokens: 59522416640 | elapsed time per iteration (s): 0.78 | learning rate: 6.892E-05 | global batch size: 256 | lm loss: 1.962235E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.857 | TFLOPs: 19.96 | 31: iteration 113540/ 173500 | consumed samples: 29066240 | consumed tokens: 59527659520 | elapsed time per iteration (s): 0.79 | learning rate: 6.891E-05 | global batch size: 256 | lm loss: 1.962023E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.292 | TFLOPs: 19.68 | 31: iteration 113550/ 173500 | consumed samples: 29068800 | consumed tokens: 59532902400 | elapsed time per iteration (s): 0.74 | learning rate: 6.890E-05 | global batch size: 256 | lm loss: 1.965955E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.225 | TFLOPs: 20.89 | 31: iteration 113560/ 173500 | consumed samples: 29071360 | consumed tokens: 59538145280 | elapsed time per iteration (s): 0.74 | learning rate: 6.888E-05 | global batch size: 256 | lm loss: 1.949386E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.666 | TFLOPs: 20.79 | 31: iteration 113570/ 173500 | consumed samples: 29073920 | consumed tokens: 59543388160 | elapsed time per iteration (s): 0.78 | learning rate: 6.887E-05 | global batch size: 256 | lm loss: 1.971387E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.370 | TFLOPs: 19.81 | 31: iteration 113580/ 173500 | consumed samples: 29076480 | consumed tokens: 59548631040 | elapsed time per iteration (s): 0.75 | learning rate: 6.885E-05 | global batch size: 256 | lm loss: 1.996505E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.733 | TFLOPs: 20.73 | 31: iteration 113590/ 173500 | consumed samples: 29079040 | consumed tokens: 59553873920 | elapsed time per iteration (s): 0.73 | learning rate: 6.884E-05 | global batch size: 256 | lm loss: 1.930001E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.414 | TFLOPs: 21.08 | 31: iteration 113600/ 173500 | consumed samples: 29081600 | consumed tokens: 59559116800 | elapsed time per iteration (s): 0.78 | learning rate: 6.882E-05 | global batch size: 256 | lm loss: 1.954670E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.996 | TFLOPs: 19.90 | 31: iteration 113610/ 173500 | consumed samples: 29084160 | consumed tokens: 59564359680 | elapsed time per iteration (s): 0.76 | learning rate: 6.881E-05 | global batch size: 256 | lm loss: 1.952157E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.548 | TFLOPs: 20.30 | 31: iteration 113620/ 173500 | consumed samples: 29086720 | consumed tokens: 59569602560 | elapsed time per iteration (s): 0.76 | learning rate: 6.879E-05 | global batch size: 256 | lm loss: 1.919490E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.919 | TFLOPs: 20.32 | 31: iteration 113630/ 173500 | consumed samples: 29089280 | consumed tokens: 59574845440 | elapsed time per iteration (s): 0.75 | learning rate: 6.878E-05 | global batch size: 256 | lm loss: 1.976044E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.892 | TFLOPs: 20.74 | 31: iteration 113640/ 173500 | consumed samples: 29091840 | consumed tokens: 59580088320 | elapsed time per iteration (s): 0.78 | learning rate: 6.876E-05 | global batch size: 256 | lm loss: 1.968545E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.397 | TFLOPs: 19.87 | 31: iteration 113650/ 173500 | consumed samples: 29094400 | consumed tokens: 59585331200 | elapsed time per iteration (s): 0.75 | learning rate: 6.875E-05 | global batch size: 256 | lm loss: 1.968106E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.107 | TFLOPs: 20.52 | 31: iteration 113660/ 173500 | consumed samples: 29096960 | consumed tokens: 59590574080 | elapsed time per iteration (s): 0.77 | learning rate: 6.873E-05 | global batch size: 256 | lm loss: 1.958474E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.608 | TFLOPs: 20.24 | 31: iteration 113670/ 173500 | consumed samples: 29099520 | consumed tokens: 59595816960 | elapsed time per iteration (s): 0.76 | learning rate: 6.872E-05 | global batch size: 256 | lm loss: 1.982271E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.214 | TFLOPs: 20.34 | 31: iteration 113680/ 173500 | consumed samples: 29102080 | consumed tokens: 59601059840 | elapsed time per iteration (s): 0.82 | learning rate: 6.871E-05 | global batch size: 256 | lm loss: 1.968727E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.758 | TFLOPs: 18.98 | 31: iteration 113690/ 173500 | consumed samples: 29104640 | consumed tokens: 59606302720 | elapsed time per iteration (s): 0.84 | learning rate: 6.869E-05 | global batch size: 256 | lm loss: 1.963060E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.530 | TFLOPs: 18.42 | 31: iteration 113700/ 173500 | consumed samples: 29107200 | consumed tokens: 59611545600 | elapsed time per iteration (s): 0.87 | learning rate: 6.868E-05 | global batch size: 256 | lm loss: 1.969389E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.113 | TFLOPs: 17.85 | 31: iteration 113710/ 173500 | consumed samples: 29109760 | consumed tokens: 59616788480 | elapsed time per iteration (s): 0.80 | learning rate: 6.866E-05 | global batch size: 256 | lm loss: 1.949314E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.575 | TFLOPs: 19.27 | 31: iteration 113720/ 173500 | consumed samples: 29112320 | consumed tokens: 59622031360 | elapsed time per iteration (s): 0.79 | learning rate: 6.865E-05 | global batch size: 256 | lm loss: 1.929284E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.763 | TFLOPs: 19.53 | 31: iteration 113730/ 173500 | consumed samples: 29114880 | consumed tokens: 59627274240 | elapsed time per iteration (s): 0.82 | learning rate: 6.863E-05 | global batch size: 256 | lm loss: 1.950693E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.740 | TFLOPs: 18.80 | 31: iteration 113740/ 173500 | consumed samples: 29117440 | consumed tokens: 59632517120 | elapsed time per iteration (s): 0.78 | learning rate: 6.862E-05 | global batch size: 256 | lm loss: 1.952420E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.183 | TFLOPs: 19.79 | 31: iteration 113750/ 173500 | consumed samples: 29120000 | consumed tokens: 59637760000 | elapsed time per iteration (s): 0.80 | learning rate: 6.860E-05 | global batch size: 256 | lm loss: 1.927460E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.219 | TFLOPs: 19.31 | 31: iteration 113760/ 173500 | consumed samples: 29122560 | consumed tokens: 59643002880 | elapsed time per iteration (s): 0.81 | learning rate: 6.859E-05 | global batch size: 256 | lm loss: 1.947478E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.714 | TFLOPs: 19.22 | 31: iteration 113770/ 173500 | consumed samples: 29125120 | consumed tokens: 59648245760 | elapsed time per iteration (s): 0.86 | learning rate: 6.857E-05 | global batch size: 256 | lm loss: 1.954038E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.013 | TFLOPs: 17.91 | 31: iteration 113780/ 173500 | consumed samples: 29127680 | consumed tokens: 59653488640 | elapsed time per iteration (s): 0.83 | learning rate: 6.856E-05 | global batch size: 256 | lm loss: 1.959258E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.262 | TFLOPs: 18.59 | 31: iteration 113790/ 173500 | consumed samples: 29130240 | consumed tokens: 59658731520 | elapsed time per iteration (s): 0.86 | learning rate: 6.854E-05 | global batch size: 256 | lm loss: 1.969789E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.232 | TFLOPs: 17.92 | 31: iteration 113800/ 173500 | consumed samples: 29132800 | consumed tokens: 59663974400 | elapsed time per iteration (s): 0.76 | learning rate: 6.853E-05 | global batch size: 256 | lm loss: 1.941843E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.829 | TFLOPs: 20.32 | 31: iteration 113810/ 173500 | consumed samples: 29135360 | consumed tokens: 59669217280 | elapsed time per iteration (s): 0.76 | learning rate: 6.852E-05 | global batch size: 256 | lm loss: 1.941499E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.488 | TFLOPs: 20.36 | 31: iteration 113820/ 173500 | consumed samples: 29137920 | consumed tokens: 59674460160 | elapsed time per iteration (s): 0.76 | learning rate: 6.850E-05 | global batch size: 256 | lm loss: 1.934723E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.109 | TFLOPs: 20.27 | 31: iteration 113830/ 173500 | consumed samples: 29140480 | consumed tokens: 59679703040 | elapsed time per iteration (s): 0.73 | learning rate: 6.849E-05 | global batch size: 256 | lm loss: 1.939628E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.536 | TFLOPs: 21.09 | 31: iteration 113840/ 173500 | consumed samples: 29143040 | consumed tokens: 59684945920 | elapsed time per iteration (s): 0.87 | learning rate: 6.847E-05 | global batch size: 256 | lm loss: 1.957088E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.847 | TFLOPs: 17.84 | 31: iteration 113850/ 173500 | consumed samples: 29145600 | consumed tokens: 59690188800 | elapsed time per iteration (s): 0.76 | learning rate: 6.846E-05 | global batch size: 256 | lm loss: 1.946287E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.361 | TFLOPs: 20.47 | 31: iteration 113860/ 173500 | consumed samples: 29148160 | consumed tokens: 59695431680 | elapsed time per iteration (s): 0.77 | learning rate: 6.844E-05 | global batch size: 256 | lm loss: 1.949032E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.165 | TFLOPs: 20.22 | 31: iteration 113870/ 173500 | consumed samples: 29150720 | consumed tokens: 59700674560 | elapsed time per iteration (s): 0.83 | learning rate: 6.843E-05 | global batch size: 256 | lm loss: 1.941614E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.782 | TFLOPs: 18.56 | 31: iteration 113880/ 173500 | consumed samples: 29153280 | consumed tokens: 59705917440 | elapsed time per iteration (s): 0.74 | learning rate: 6.841E-05 | global batch size: 256 | lm loss: 1.941488E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.762 | TFLOPs: 21.04 | 31: iteration 113890/ 173500 | consumed samples: 29155840 | consumed tokens: 59711160320 | elapsed time per iteration (s): 0.80 | learning rate: 6.840E-05 | global batch size: 256 | lm loss: 1.966873E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.484 | TFLOPs: 19.39 | 31: iteration 113900/ 173500 | consumed samples: 29158400 | consumed tokens: 59716403200 | elapsed time per iteration (s): 0.81 | learning rate: 6.838E-05 | global batch size: 256 | lm loss: 1.962291E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.300 | TFLOPs: 19.07 | 31: iteration 113910/ 173500 | consumed samples: 29160960 | consumed tokens: 59721646080 | elapsed time per iteration (s): 0.76 | learning rate: 6.837E-05 | global batch size: 256 | lm loss: 1.966624E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.666 | TFLOPs: 20.37 | 31: iteration 113920/ 173500 | consumed samples: 29163520 | consumed tokens: 59726888960 | elapsed time per iteration (s): 0.76 | learning rate: 6.835E-05 | global batch size: 256 | lm loss: 1.961953E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.686 | TFLOPs: 20.25 | 31: iteration 113930/ 173500 | consumed samples: 29166080 | consumed tokens: 59732131840 | elapsed time per iteration (s): 0.78 | learning rate: 6.834E-05 | global batch size: 256 | lm loss: 1.940452E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.277 | TFLOPs: 19.92 | 31: iteration 113940/ 173500 | consumed samples: 29168640 | consumed tokens: 59737374720 | elapsed time per iteration (s): 0.75 | learning rate: 6.833E-05 | global batch size: 256 | lm loss: 1.981191E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.426 | TFLOPs: 20.78 | 31: iteration 113950/ 173500 | consumed samples: 29171200 | consumed tokens: 59742617600 | elapsed time per iteration (s): 0.75 | learning rate: 6.831E-05 | global batch size: 256 | lm loss: 1.975661E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.140 | TFLOPs: 20.52 | 31: iteration 113960/ 173500 | consumed samples: 29173760 | consumed tokens: 59747860480 | elapsed time per iteration (s): 0.77 | learning rate: 6.830E-05 | global batch size: 256 | lm loss: 1.934466E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.669 | TFLOPs: 20.19 | 31: iteration 113970/ 173500 | consumed samples: 29176320 | consumed tokens: 59753103360 | elapsed time per iteration (s): 0.77 | learning rate: 6.828E-05 | global batch size: 256 | lm loss: 1.962766E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.861 | TFLOPs: 20.20 | 31: iteration 113980/ 173500 | consumed samples: 29178880 | consumed tokens: 59758346240 | elapsed time per iteration (s): 0.94 | learning rate: 6.827E-05 | global batch size: 256 | lm loss: 1.914937E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 272.580 | TFLOPs: 16.49 | 31: iteration 113990/ 173500 | consumed samples: 29181440 | consumed tokens: 59763589120 | elapsed time per iteration (s): 0.73 | learning rate: 6.825E-05 | global batch size: 256 | lm loss: 1.978837E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.407 | TFLOPs: 21.26 | 0: [2022-11-26 19:47:24,292] [INFO] [logging.py:68:log_dist] [Rank 0] step=114000, skipped=0, lr=[6.823796836261315e-05, 6.823796836261315e-05, 6.823796836261315e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 114000/ 173500 | consumed samples: 29184000 | consumed tokens: 59768832000 | elapsed time per iteration (s): 0.72 | learning rate: 6.824E-05 | global batch size: 256 | lm loss: 1.969692E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.436 | TFLOPs: 21.38 | 0: steps: 114000 loss: 1.9759 iter time (s): 0.787 samples/sec: 325.366 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 114000 | lm loss value: 1.935553E+00 | lm loss PPL: 6.927871E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 114000 to checkpoints_1b1long 0: [2022-11-26 19:47:24,543] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step114000 is begin to save! 0: [2022-11-26 19:47:24,551] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_01-model_00-model_states.pt... 0: [2022-11-26 19:47:24,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_01-model_00-model_states.pt. 0: [2022-11-26 19:47:24,802] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_03-model_00-model_states.pt... 0: [2022-11-26 19:47:24,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_03-model_00-model_states.pt. 0: [2022-11-26 19:47:24,882] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_04-model_00-model_states.pt... 0: [2022-11-26 19:47:24,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_04-model_00-model_states.pt. 0: [2022-11-26 19:47:24,955] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_05-model_00-model_states.pt... 0: [2022-11-26 19:47:25,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_05-model_00-model_states.pt. 0: [2022-11-26 19:47:25,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_06-model_00-model_states.pt... 0: [2022-11-26 19:47:25,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_06-model_00-model_states.pt. 0: [2022-11-26 19:47:25,106] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_07-model_00-model_states.pt... 0: [2022-11-26 19:47:25,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_07-model_00-model_states.pt. 0: [2022-11-26 19:47:25,183] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_08-model_00-model_states.pt... 0: [2022-11-26 19:47:25,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_08-model_00-model_states.pt. 0: [2022-11-26 19:47:25,258] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_09-model_00-model_states.pt... 0: [2022-11-26 19:47:25,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_09-model_00-model_states.pt. 0: [2022-11-26 19:47:25,331] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_10-model_00-model_states.pt... 0: [2022-11-26 19:47:25,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_10-model_00-model_states.pt. 0: [2022-11-26 19:47:25,407] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_11-model_00-model_states.pt... 0: [2022-11-26 19:47:25,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_11-model_00-model_states.pt. 0: [2022-11-26 19:47:25,481] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_12-model_00-model_states.pt... 0: [2022-11-26 19:47:25,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_12-model_00-model_states.pt. 0: [2022-11-26 19:47:25,557] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_13-model_00-model_states.pt... 0: [2022-11-26 19:47:25,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_13-model_00-model_states.pt. 0: [2022-11-26 19:47:25,630] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_14-model_00-model_states.pt... 0: [2022-11-26 19:47:25,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_14-model_00-model_states.pt. 0: [2022-11-26 19:47:25,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_15-model_00-model_states.pt... 0: [2022-11-26 19:47:25,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_15-model_00-model_states.pt. 0: [2022-11-26 19:47:25,779] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_16-model_00-model_states.pt... 0: [2022-11-26 19:47:25,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_16-model_00-model_states.pt. 0: [2022-11-26 19:47:25,857] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_17-model_00-model_states.pt... 0: [2022-11-26 19:47:25,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_17-model_00-model_states.pt. 0: [2022-11-26 19:47:25,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_18-model_00-model_states.pt... 0: [2022-11-26 19:47:26,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_18-model_00-model_states.pt. 0: [2022-11-26 19:47:26,003] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_19-model_00-model_states.pt... 0: [2022-11-26 19:47:26,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_19-model_00-model_states.pt. 0: [2022-11-26 19:47:26,079] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_20-model_00-model_states.pt... 0: [2022-11-26 19:47:26,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_20-model_00-model_states.pt. 0: [2022-11-26 19:47:26,154] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_21-model_00-model_states.pt... 0: [2022-11-26 19:47:26,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_21-model_00-model_states.pt. 0: [2022-11-26 19:47:26,229] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_22-model_00-model_states.pt... 0: [2022-11-26 19:47:26,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_22-model_00-model_states.pt. 0: [2022-11-26 19:47:26,304] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_23-model_00-model_states.pt... 0: [2022-11-26 19:47:26,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_23-model_00-model_states.pt. 0: [2022-11-26 19:47:26,377] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_24-model_00-model_states.pt... 0: [2022-11-26 19:47:26,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_24-model_00-model_states.pt. 0: [2022-11-26 19:47:26,453] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_25-model_00-model_states.pt... 0: [2022-11-26 19:47:26,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_25-model_00-model_states.pt. 0: [2022-11-26 19:47:26,527] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_26-model_00-model_states.pt... 0: [2022-11-26 19:47:26,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_26-model_00-model_states.pt. 0: [2022-11-26 19:47:26,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_27-model_00-model_states.pt... 0: [2022-11-26 19:47:26,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_27-model_00-model_states.pt. 0: [2022-11-26 19:47:26,674] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_28-model_00-model_states.pt... 0: [2022-11-26 19:47:26,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_28-model_00-model_states.pt. 0: [2022-11-26 19:47:26,750] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/layer_30-model_00-model_states.pt... 0: [2022-11-26 19:47:26,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/layer_30-model_00-model_states.pt. 0: [2022-11-26 19:47:26,753] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step114000/mp_rank_00_model_states.pt 0: [2022-11-26 19:47:26,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/mp_rank_00_model_states.pt... 0: [2022-11-26 19:47:26,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/mp_rank_00_model_states.pt. 0: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 20: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 24: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 26: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 30: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 25: [2022-11-26 19:47:26,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:47:26,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:47:26,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 19:47:26,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 23: [2022-11-26 19:47:26,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:47:26,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 19:47:26,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 8: [2022-11-26 19:47:26,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:47:26,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 19:47:26,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 30: [2022-11-26 19:47:26,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:47:26,885] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 19:47:26,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 15: [2022-11-26 19:47:26,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:47:26,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 19:47:26,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 11: [2022-11-26 19:47:26,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:47:26,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 19:47:26,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 19: [2022-11-26 19:47:26,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:47:26,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 19:47:26,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 9: [2022-11-26 19:47:26,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:47:26,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 19:47:26,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 20: [2022-11-26 19:47:26,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:47:26,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 19:47:26,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 20: [2022-11-26 19:47:26,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:47:26,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 19:47:26,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 30: [2022-11-26 19:47:26,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:47:26,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:47:26,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 16: [2022-11-26 19:47:26,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 30: [2022-11-26 19:47:26,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 16: [2022-11-26 19:47:26,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 24: [2022-11-26 19:47:26,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:47:26,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:47:26,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 24: [2022-11-26 19:47:26,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 15: [2022-11-26 19:47:26,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 24: [2022-11-26 19:47:26,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 4: [2022-11-26 19:47:26,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:47:26,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 19:47:26,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 24: [2022-11-26 19:47:26,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:47:26,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 31: [2022-11-26 19:47:26,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:47:26,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 31: [2022-11-26 19:47:26,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 19:47:26,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 6: [2022-11-26 19:47:26,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:47:26,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 19:47:26,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 25: [2022-11-26 19:47:26,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:47:26,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:47:26,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 19:47:26,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 19:47:26,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 25: [2022-11-26 19:47:26,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 21: [2022-11-26 19:47:26,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:47:26,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 8: [2022-11-26 19:47:26,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:47:26,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 8: [2022-11-26 19:47:26,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 19:47:26,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 14: [2022-11-26 19:47:26,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:47:26,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:47:26,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 19:47:26,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 19:47:26,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 14: [2022-11-26 19:47:26,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 1: [2022-11-26 19:47:26,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:47:26,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:47:26,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 29: [2022-11-26 19:47:26,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:47:26,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 1: [2022-11-26 19:47:26,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 29: [2022-11-26 19:47:26,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 7: [2022-11-26 19:47:26,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 29: [2022-11-26 19:47:26,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 8: [2022-11-26 19:47:26,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:47:26,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 19:47:26,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 23: [2022-11-26 19:47:26,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:47:26,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 19:47:26,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 18: [2022-11-26 19:47:26,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:47:26,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 19:47:26,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 21: [2022-11-26 19:47:26,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:47:26,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 19:47:26,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 19: [2022-11-26 19:47:26,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:47:26,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:47:26,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 19:47:26,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 19:47:26,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 19: [2022-11-26 19:47:26,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 4: [2022-11-26 19:47:26,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:47:26,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 19:47:26,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 0: [2022-11-26 19:47:26,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:47:26,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 19:47:26,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 28: [2022-11-26 19:47:26,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:47:26,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:47:26,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 19:47:26,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 18: [2022-11-26 19:47:26,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:47:26,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 19:47:26,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 4: [2022-11-26 19:47:26,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:47:26,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 10: [2022-11-26 19:47:26,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:47:26,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 4: [2022-11-26 19:47:26,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 10: [2022-11-26 19:47:26,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 23: [2022-11-26 19:47:26,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:47:26,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:47:26,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:47:26,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 16: [2022-11-26 19:47:26,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 15: [2022-11-26 19:47:26,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:47:26,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 16: [2022-11-26 19:47:26,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 15: [2022-11-26 19:47:26,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 19:47:26,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 1: [2022-11-26 19:47:26,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 23: [2022-11-26 19:47:26,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 25: [2022-11-26 19:47:26,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:47:26,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 19:47:26,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 17: [2022-11-26 19:47:26,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 19:47:26,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 19:47:26,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 4: [2022-11-26 19:47:26,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:47:26,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 19:47:26,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 6: [2022-11-26 19:47:26,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 17: [2022-11-26 19:47:26,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:47:26,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 17: [2022-11-26 19:47:26,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:47:26,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 17: [2022-11-26 19:47:26,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 19:47:26,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 19:47:26,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 17: [2022-11-26 19:47:26,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 28: [2022-11-26 19:47:26,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 19:47:26,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 28: [2022-11-26 19:47:26,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 19:47:26,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 19:47:26,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 30: [2022-11-26 19:47:26,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:47:26,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 19:47:26,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 2: [2022-11-26 19:47:26,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:47:26,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 19:47:26,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 20: [2022-11-26 19:47:26,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:47:26,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 13: [2022-11-26 19:47:26,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:47:26,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 24: [2022-11-26 19:47:26,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:47:26,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 13: [2022-11-26 19:47:26,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 24: [2022-11-26 19:47:26,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 13: [2022-11-26 19:47:26,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 6: [2022-11-26 19:47:26,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:47:26,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:47:26,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 19:47:26,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 19:47:26,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 6: [2022-11-26 19:47:26,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 13: [2022-11-26 19:47:26,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:47:26,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:47:26,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 19:47:26,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 19:47:26,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 30: [2022-11-26 19:47:26,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:47:26,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 30: [2022-11-26 19:47:26,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 19:47:26,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 11: [2022-11-26 19:47:26,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:47:26,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 19:47:26,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 11: [2022-11-26 19:47:26,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:47:26,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 19:47:26,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 21: [2022-11-26 19:47:26,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:47:26,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 14: [2022-11-26 19:47:26,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:47:26,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 14: [2022-11-26 19:47:26,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 19:47:26,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 2: [2022-11-26 19:47:26,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:47:26,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 0: [2022-11-26 19:47:26,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:47:26,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 0: [2022-11-26 19:47:26,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 11: [2022-11-26 19:47:26,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:47:26,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 11: [2022-11-26 19:47:26,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 19:47:26,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 20: [2022-11-26 19:47:26,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:47:26,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 19:47:26,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 24: [2022-11-26 19:47:26,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:47:26,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:47:26,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 17: [2022-11-26 19:47:26,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:47:26,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 24: [2022-11-26 19:47:26,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 17: [2022-11-26 19:47:26,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 0: [2022-11-26 19:47:26,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:47:26,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 17: [2022-11-26 19:47:26,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 28: [2022-11-26 19:47:26,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 19:47:26,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 19:47:26,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 23: [2022-11-26 19:47:26,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:47:26,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:47:26,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 29: [2022-11-26 19:47:26,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 23: [2022-11-26 19:47:26,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 29: [2022-11-26 19:47:26,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 1: [2022-11-26 19:47:26,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:47:26,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 19:47:26,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 16: [2022-11-26 19:47:26,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:47:26,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 19:47:26,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 9: [2022-11-26 19:47:26,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:47:26,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 19:47:26,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 9: [2022-11-26 19:47:26,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:47:26,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 19:47:26,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 21: [2022-11-26 19:47:26,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:47:26,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:47:26,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:47:26,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 3: [2022-11-26 19:47:26,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:47:26,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:47:26,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 19:47:26,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 21: [2022-11-26 19:47:26,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 3: [2022-11-26 19:47:26,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 3: [2022-11-26 19:47:26,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 19:47:26,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 19:47:26,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 3: [2022-11-26 19:47:26,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 3: [2022-11-26 19:47:26,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 19: [2022-11-26 19:47:26,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:47:26,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 31: [2022-11-26 19:47:26,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:47:26,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:47:26,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 29: [2022-11-26 19:47:26,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 31: [2022-11-26 19:47:26,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 19:47:26,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:47:26,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 29: [2022-11-26 19:47:26,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 31: [2022-11-26 19:47:26,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 19:47:26,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 2: [2022-11-26 19:47:26,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:47:26,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 18: [2022-11-26 19:47:26,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:47:26,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 18: [2022-11-26 19:47:26,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 19:47:26,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 22: [2022-11-26 19:47:26,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:47:26,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 19:47:26,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 22: [2022-11-26 19:47:26,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:47:26,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 19:47:26,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 22: [2022-11-26 19:47:26,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:47:26,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 19:47:26,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 22: [2022-11-26 19:47:26,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:47:26,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 19:47:26,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 5: [2022-11-26 19:47:26,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:47:26,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:47:26,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:47:26,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 19:47:26,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 19:47:26,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 19:47:26,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 5: [2022-11-26 19:47:26,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 5: [2022-11-26 19:47:26,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 13: [2022-11-26 19:47:26,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:47:26,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 19:47:26,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 27: [2022-11-26 19:47:26,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:47:26,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:47:26,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 19:47:26,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 19:47:26,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 27: [2022-11-26 19:47:26,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 10: [2022-11-26 19:47:26,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:47:26,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:47:26,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 19:47:26,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 16: [2022-11-26 19:47:26,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 19:47:26,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 27: [2022-11-26 19:47:26,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:47:26,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 7: [2022-11-26 19:47:26,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:47:26,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 27: [2022-11-26 19:47:26,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 7: [2022-11-26 19:47:26,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 7: [2022-11-26 19:47:26,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:47:26,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 19:47:26,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 14: [2022-11-26 19:47:26,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:47:26,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:47:26,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 8: [2022-11-26 19:47:26,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 14: [2022-11-26 19:47:26,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 8: [2022-11-26 19:47:26,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 18: [2022-11-26 19:47:26,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:47:26,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 19:47:26,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 31: [2022-11-26 19:47:26,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:47:26,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 19:47:26,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 15: [2022-11-26 19:47:26,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:47:26,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 19:47:26,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 29: [2022-11-26 19:47:26,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:47:26,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 19:47:26,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 0: [2022-11-26 19:47:26,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:47:26,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 19:47:26,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 25: [2022-11-26 19:47:26,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:47:26,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 19:47:26,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 28: [2022-11-26 19:47:26,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 19:47:26,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 10: [2022-11-26 19:47:26,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:47:26,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 28: [2022-11-26 19:47:26,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 10: [2022-11-26 19:47:26,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 26: [2022-11-26 19:47:26,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:47:26,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:47:26,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:47:26,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:47:26,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 19:47:26,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 19:47:26,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 19:47:26,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 19:47:26,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 26: [2022-11-26 19:47:26,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 26: [2022-11-26 19:47:26,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 26: [2022-11-26 19:47:26,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 1: [2022-11-26 19:47:26,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:47:26,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 19:47:26,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 0: [2022-11-26 19:47:26,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:47:26,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 19:47:26,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 12: [2022-11-26 19:47:26,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:47:26,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:47:26,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:47:26,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:47:26,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 19:47:26,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 19:47:26,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 19:47:26,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 19:47:26,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 12: [2022-11-26 19:47:26,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 12: [2022-11-26 19:47:26,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 12: [2022-11-26 19:47:26,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 23: [2022-11-26 19:47:26,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:47:26,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 19:47:26,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 4: [2022-11-26 19:47:26,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:47:26,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 19:47:26,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 11: [2022-11-26 19:47:26,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:47:26,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 19:47:26,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 9: [2022-11-26 19:47:26,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:47:26,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 19:47:26,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 6: [2022-11-26 19:47:26,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:47:26,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 19:47:26,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 20: [2022-11-26 19:47:26,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:47:26,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 19:47:26,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 0: [2022-11-26 19:47:26,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 19:47:26,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 17: [2022-11-26 19:47:26,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 19:47:26,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 19:47:26,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 24: [2022-11-26 19:47:26,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:47:26,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 19:47:26,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 8: [2022-11-26 19:47:26,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:47:26,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 19:47:26,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 7: [2022-11-26 19:47:26,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:47:26,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 19:47:26,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 9: [2022-11-26 19:47:26,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:47:26,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:47:26,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 19:47:26,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 25: [2022-11-26 19:47:26,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 19:47:26,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 2: [2022-11-26 19:47:26,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:47:26,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 19:47:26,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 19: [2022-11-26 19:47:26,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:47:26,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 19:47:26,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 22: [2022-11-26 19:47:26,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:47:26,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 19:47:26,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 5: [2022-11-26 19:47:26,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:47:26,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 19:47:26,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 30: [2022-11-26 19:47:26,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:47:26,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 19:47:26,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 29: [2022-11-26 19:47:26,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:47:26,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 19:47:26,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 14: [2022-11-26 19:47:26,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:47:26,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 19:47:26,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 15: [2022-11-26 19:47:26,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:47:26,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 19:47:26,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 16: [2022-11-26 19:47:26,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:47:26,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 19:47:26,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 1: [2022-11-26 19:47:26,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:47:26,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 19:47:26,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 28: [2022-11-26 19:47:26,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 19:47:26,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 19:47:26,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 3: [2022-11-26 19:47:26,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:47:26,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 19:47:26,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 27: [2022-11-26 19:47:26,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:47:26,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 19:47:26,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 18: [2022-11-26 19:47:26,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:47:26,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 19:47:26,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 26: [2022-11-26 19:47:26,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:47:26,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 19:47:26,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 21: [2022-11-26 19:47:26,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:47:26,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 19:47:26,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 31: [2022-11-26 19:47:26,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:47:26,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 19:47:26,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 12: [2022-11-26 19:47:26,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:47:26,982] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 19:47:26,982] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 13: [2022-11-26 19:47:26,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:47:26,982] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 19:47:26,982] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 10: [2022-11-26 19:47:26,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:47:26,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 19:47:26,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 0: [2022-11-26 19:47:26,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:47:26,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 19:47:26,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 6: [2022-11-26 19:47:26,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:47:26,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:47:26,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 19: [2022-11-26 19:47:26,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 19:47:26,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 6: [2022-11-26 19:47:26,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 23: [2022-11-26 19:47:26,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:47:26,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 19:47:26,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 4: [2022-11-26 19:47:26,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:47:26,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:47:26,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 4: [2022-11-26 19:47:26,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 11: [2022-11-26 19:47:26,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 4: [2022-11-26 19:47:26,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 24: [2022-11-26 19:47:26,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:47:26,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 19:47:26,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 17: [2022-11-26 19:47:26,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 19:47:26,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 19:47:26,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 20: [2022-11-26 19:47:26,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:47:26,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 19:47:26,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 25: [2022-11-26 19:47:26,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:47:26,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 19:47:26,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 9: [2022-11-26 19:47:26,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:47:26,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 19:47:26,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 14: [2022-11-26 19:47:26,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:47:26,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 19:47:26,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 22: [2022-11-26 19:47:26,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:47:26,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 19:47:26,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 29: [2022-11-26 19:47:26,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:47:26,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 7: [2022-11-26 19:47:26,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:47:26,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 7: [2022-11-26 19:47:26,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 19:47:26,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 5: [2022-11-26 19:47:26,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:47:26,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 2: [2022-11-26 19:47:26,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:47:26,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 2: [2022-11-26 19:47:26,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 19:47:26,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 15: [2022-11-26 19:47:26,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:47:26,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 19:47:26,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 30: [2022-11-26 19:47:26,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:47:26,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 19:47:26,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 28: [2022-11-26 19:47:26,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 19:47:26,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 19:47:26,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 16: [2022-11-26 19:47:27,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:47:27,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 19:47:27,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 1: [2022-11-26 19:47:27,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:47:27,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 19:47:27,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 21: [2022-11-26 19:47:27,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:47:27,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:47:27,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 19:47:27,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 8: [2022-11-26 19:47:27,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 19:47:27,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 3: [2022-11-26 19:47:27,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:47:27,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 19:47:27,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 26: [2022-11-26 19:47:27,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:47:27,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 19:47:27,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 27: [2022-11-26 19:47:27,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:47:27,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 19:47:27,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 18: [2022-11-26 19:47:27,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:47:27,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 19:47:27,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 12: [2022-11-26 19:47:27,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:47:27,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 19:47:27,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 13: [2022-11-26 19:47:27,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:47:27,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 19:47:27,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 10: [2022-11-26 19:47:27,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:47:27,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 19:47:27,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 31: [2022-11-26 19:47:27,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:47:27,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 19:47:27,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 23: [2022-11-26 19:47:27,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:47:27,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 19:47:27,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 6: [2022-11-26 19:47:27,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:47:27,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 19:47:27,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 4: [2022-11-26 19:47:27,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:47:27,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 19:47:27,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 20: [2022-11-26 19:47:27,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:47:27,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 19:47:27,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 5: [2022-11-26 19:47:27,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:47:27,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 17: [2022-11-26 19:47:27,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:47:27,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 17: [2022-11-26 19:47:27,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 19:47:27,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 24: [2022-11-26 19:47:27,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:47:27,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 19:47:27,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 19: [2022-11-26 19:47:27,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:47:27,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 19:47:27,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 9: [2022-11-26 19:47:27,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:47:27,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 19:47:27,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 30: [2022-11-26 19:47:27,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:47:27,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 19:47:27,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 25: [2022-11-26 19:47:27,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 19:47:27,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 19:47:27,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 2: [2022-11-26 19:47:27,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:47:27,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 19:47:27,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 29: [2022-11-26 19:47:27,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:47:27,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 19:47:27,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 11: [2022-11-26 19:47:27,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:47:27,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 19:47:27,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 15: [2022-11-26 19:47:27,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:47:27,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 19:47:27,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 14: [2022-11-26 19:47:27,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 22: [2022-11-26 19:47:27,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:47:27,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 22: [2022-11-26 19:47:27,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 19:47:27,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 14: [2022-11-26 19:47:27,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 8: [2022-11-26 19:47:27,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:47:27,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 19:47:27,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 16: [2022-11-26 19:47:27,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:47:27,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 19:47:27,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 28: [2022-11-26 19:47:27,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:47:27,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:47:27,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 19:47:27,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 28: [2022-11-26 19:47:27,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 19:47:27,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 18: [2022-11-26 19:47:27,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:47:27,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 19:47:27,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 3: [2022-11-26 19:47:27,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:47:27,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 0: [2022-11-26 19:47:27,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:47:27,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 0: [2022-11-26 19:47:27,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 19:47:27,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 26: [2022-11-26 19:47:27,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:47:27,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 19:47:27,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 31: [2022-11-26 19:47:27,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:47:27,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 19:47:27,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 10: [2022-11-26 19:47:27,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:47:27,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 19:47:27,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 23: [2022-11-26 19:47:27,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 19:47:27,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 19:47:27,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 1: [2022-11-26 19:47:27,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:47:27,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 19:47:27,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 6: [2022-11-26 19:47:27,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:47:27,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 19:47:27,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 21: [2022-11-26 19:47:27,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 20: [2022-11-26 19:47:27,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:47:27,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 20: [2022-11-26 19:47:27,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 21: [2022-11-26 19:47:27,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 20: [2022-11-26 19:47:27,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 0: [2022-11-26 19:47:27,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:47:27,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 19:47:27,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 19:47:27,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 0: [2022-11-26 19:47:27,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 19:47:27,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 19: [2022-11-26 19:47:27,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 19:47:27,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 19:47:27,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 11: [2022-11-26 19:47:27,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:47:27,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 19:47:27,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 7: [2022-11-26 19:47:27,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:47:27,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 19:47:27,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 4: [2022-11-26 19:47:27,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:47:27,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 19:47:27,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 30: [2022-11-26 19:47:27,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 19:47:27,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 19:47:27,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 28: [2022-11-26 19:47:27,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:47:27,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:47:27,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 2: [2022-11-26 19:47:27,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 29: [2022-11-26 19:47:27,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 28: [2022-11-26 19:47:27,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 19:47:27,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 2: [2022-11-26 19:47:27,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 19:47:27,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 9: [2022-11-26 19:47:27,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:47:27,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 5: [2022-11-26 19:47:27,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:47:27,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 5: [2022-11-26 19:47:27,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 19:47:27,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 17: [2022-11-26 19:47:27,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:47:27,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 17: [2022-11-26 19:47:27,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 19:47:27,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 7: [2022-11-26 19:47:27,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:47:27,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 19:47:27,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 7: [2022-11-26 19:47:27,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 19:47:27,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 14: [2022-11-26 19:47:27,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:47:27,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 19:47:27,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 12: [2022-11-26 19:47:27,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:47:27,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 19:47:27,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 31: [2022-11-26 19:47:27,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:47:27,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:47:27,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 1: [2022-11-26 19:47:27,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 31: [2022-11-26 19:47:27,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 15: [2022-11-26 19:47:27,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 19:47:27,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 1: [2022-11-26 19:47:27,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 19:47:27,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 16: [2022-11-26 19:47:27,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 19:47:27,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 19:47:27,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 27: [2022-11-26 19:47:27,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:47:27,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 19:47:27,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 25: [2022-11-26 19:47:27,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:47:27,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:47:27,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 25: [2022-11-26 19:47:27,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 22: [2022-11-26 19:47:27,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 18: [2022-11-26 19:47:27,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 25: [2022-11-26 19:47:27,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 22: [2022-11-26 19:47:27,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 19:47:27,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 13: [2022-11-26 19:47:27,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 21: [2022-11-26 19:47:27,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:47:27,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 21: [2022-11-26 19:47:27,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 13: [2022-11-26 19:47:27,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 21: [2022-11-26 19:47:27,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 7: [2022-11-26 19:47:27,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:47:27,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 19:47:27,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 8: [2022-11-26 19:47:27,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:47:27,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 19:47:27,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 5: [2022-11-26 19:47:27,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:47:27,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 19:47:27,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 26: [2022-11-26 19:47:27,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 19:47:27,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 19:47:27,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 27: [2022-11-26 19:47:27,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 19:47:27,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 19:47:27,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 12: [2022-11-26 19:47:27,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:47:27,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 19:47:27,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 3: [2022-11-26 19:47:27,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:47:27,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step114000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 19:47:27,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 0: successfully saved checkpoint at iteration 114000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2528.84 31: iteration 114010/ 173500 | consumed samples: 29186560 | consumed tokens: 59774074880 | elapsed time per iteration (s): 1.04 | learning rate: 6.822E-05 | global batch size: 256 | lm loss: 1.952132E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.445 | TFLOPs: 14.85 | 31: iteration 114020/ 173500 | consumed samples: 29189120 | consumed tokens: 59779317760 | elapsed time per iteration (s): 0.79 | learning rate: 6.821E-05 | global batch size: 256 | lm loss: 1.946630E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.063 | TFLOPs: 19.73 | 31: iteration 114030/ 173500 | consumed samples: 29191680 | consumed tokens: 59784560640 | elapsed time per iteration (s): 0.75 | learning rate: 6.819E-05 | global batch size: 256 | lm loss: 1.967795E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.268 | TFLOPs: 20.71 | 31: iteration 114040/ 173500 | consumed samples: 29194240 | consumed tokens: 59789803520 | elapsed time per iteration (s): 0.76 | learning rate: 6.818E-05 | global batch size: 256 | lm loss: 1.951555E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.348 | TFLOPs: 20.47 | 31: iteration 114050/ 173500 | consumed samples: 29196800 | consumed tokens: 59795046400 | elapsed time per iteration (s): 0.82 | learning rate: 6.817E-05 | global batch size: 256 | lm loss: 1.967017E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.522 | TFLOPs: 18.85 | 31: iteration 114060/ 173500 | consumed samples: 29199360 | consumed tokens: 59800289280 | elapsed time per iteration (s): 0.79 | learning rate: 6.815E-05 | global batch size: 256 | lm loss: 1.942612E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.971 | TFLOPs: 19.60 | 31: iteration 114070/ 173500 | consumed samples: 29201920 | consumed tokens: 59805532160 | elapsed time per iteration (s): 0.81 | learning rate: 6.814E-05 | global batch size: 256 | lm loss: 1.952390E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.753 | TFLOPs: 19.16 | 31: iteration 114080/ 173500 | consumed samples: 29204480 | consumed tokens: 59810775040 | elapsed time per iteration (s): 0.83 | learning rate: 6.812E-05 | global batch size: 256 | lm loss: 1.941703E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.574 | TFLOPs: 18.67 | 31: iteration 114090/ 173500 | consumed samples: 29207040 | consumed tokens: 59816017920 | elapsed time per iteration (s): 0.79 | learning rate: 6.811E-05 | global batch size: 256 | lm loss: 1.916921E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.765 | TFLOPs: 19.53 | 31: iteration 114100/ 173500 | consumed samples: 29209600 | consumed tokens: 59821260800 | elapsed time per iteration (s): 0.84 | learning rate: 6.809E-05 | global batch size: 256 | lm loss: 1.976808E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.901 | TFLOPs: 18.39 | 31: iteration 114110/ 173500 | consumed samples: 29212160 | consumed tokens: 59826503680 | elapsed time per iteration (s): 0.81 | learning rate: 6.808E-05 | global batch size: 256 | lm loss: 1.966434E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.258 | TFLOPs: 19.01 | 31: iteration 114120/ 173500 | consumed samples: 29214720 | consumed tokens: 59831746560 | elapsed time per iteration (s): 0.77 | learning rate: 6.806E-05 | global batch size: 256 | lm loss: 1.974449E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.579 | TFLOPs: 20.06 | 31: iteration 114130/ 173500 | consumed samples: 29217280 | consumed tokens: 59836989440 | elapsed time per iteration (s): 0.73 | learning rate: 6.805E-05 | global batch size: 256 | lm loss: 1.951050E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.323 | TFLOPs: 21.13 | 31: iteration 114140/ 173500 | consumed samples: 29219840 | consumed tokens: 59842232320 | elapsed time per iteration (s): 0.71 | learning rate: 6.803E-05 | global batch size: 256 | lm loss: 1.964532E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 362.733 | TFLOPs: 21.94 | 31: iteration 114150/ 173500 | consumed samples: 29222400 | consumed tokens: 59847475200 | elapsed time per iteration (s): 0.76 | learning rate: 6.802E-05 | global batch size: 256 | lm loss: 1.957952E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.209 | TFLOPs: 20.28 | 31: iteration 114160/ 173500 | consumed samples: 29224960 | consumed tokens: 59852718080 | elapsed time per iteration (s): 0.77 | learning rate: 6.800E-05 | global batch size: 256 | lm loss: 1.968248E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.372 | TFLOPs: 20.17 | 31: iteration 114170/ 173500 | consumed samples: 29227520 | consumed tokens: 59857960960 | elapsed time per iteration (s): 0.86 | learning rate: 6.799E-05 | global batch size: 256 | lm loss: 1.942778E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.039 | TFLOPs: 18.03 | 31: iteration 114180/ 173500 | consumed samples: 29230080 | consumed tokens: 59863203840 | elapsed time per iteration (s): 0.78 | learning rate: 6.798E-05 | global batch size: 256 | lm loss: 1.968457E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.020 | TFLOPs: 19.97 | 31: iteration 114190/ 173500 | consumed samples: 29232640 | consumed tokens: 59868446720 | elapsed time per iteration (s): 0.79 | learning rate: 6.796E-05 | global batch size: 256 | lm loss: 1.977881E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.521 | TFLOPs: 19.63 | 31: iteration 114200/ 173500 | consumed samples: 29235200 | consumed tokens: 59873689600 | elapsed time per iteration (s): 0.74 | learning rate: 6.795E-05 | global batch size: 256 | lm loss: 1.950134E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.919 | TFLOPs: 21.05 | 31: iteration 114210/ 173500 | consumed samples: 29237760 | consumed tokens: 59878932480 | elapsed time per iteration (s): 0.79 | learning rate: 6.793E-05 | global batch size: 256 | lm loss: 1.941102E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.812 | TFLOPs: 19.71 | 31: iteration 114220/ 173500 | consumed samples: 29240320 | consumed tokens: 59884175360 | elapsed time per iteration (s): 0.77 | learning rate: 6.792E-05 | global batch size: 256 | lm loss: 1.950826E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.557 | TFLOPs: 20.24 | 31: iteration 114230/ 173500 | consumed samples: 29242880 | consumed tokens: 59889418240 | elapsed time per iteration (s): 0.82 | learning rate: 6.790E-05 | global batch size: 256 | lm loss: 1.976492E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.521 | TFLOPs: 18.85 | 31: iteration 114240/ 173500 | consumed samples: 29245440 | consumed tokens: 59894661120 | elapsed time per iteration (s): 0.83 | learning rate: 6.789E-05 | global batch size: 256 | lm loss: 1.968295E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.810 | TFLOPs: 18.62 | 31: iteration 114250/ 173500 | consumed samples: 29248000 | consumed tokens: 59899904000 | elapsed time per iteration (s): 0.87 | learning rate: 6.787E-05 | global batch size: 256 | lm loss: 1.938985E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.745 | TFLOPs: 17.77 | 31: iteration 114260/ 173500 | consumed samples: 29250560 | consumed tokens: 59905146880 | elapsed time per iteration (s): 0.74 | learning rate: 6.786E-05 | global batch size: 256 | lm loss: 1.929102E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.363 | TFLOPs: 21.01 | 31: iteration 114270/ 173500 | consumed samples: 29253120 | consumed tokens: 59910389760 | elapsed time per iteration (s): 0.80 | learning rate: 6.784E-05 | global batch size: 256 | lm loss: 1.979514E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.826 | TFLOPs: 19.29 | 31: iteration 114280/ 173500 | consumed samples: 29255680 | consumed tokens: 59915632640 | elapsed time per iteration (s): 0.79 | learning rate: 6.783E-05 | global batch size: 256 | lm loss: 1.950264E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.252 | TFLOPs: 19.50 | 31: iteration 114290/ 173500 | consumed samples: 29258240 | consumed tokens: 59920875520 | elapsed time per iteration (s): 0.78 | learning rate: 6.782E-05 | global batch size: 256 | lm loss: 1.955326E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.210 | TFLOPs: 19.80 | 31: iteration 114300/ 173500 | consumed samples: 29260800 | consumed tokens: 59926118400 | elapsed time per iteration (s): 0.72 | learning rate: 6.780E-05 | global batch size: 256 | lm loss: 1.967472E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.951 | TFLOPs: 21.41 | 31: iteration 114310/ 173500 | consumed samples: 29263360 | consumed tokens: 59931361280 | elapsed time per iteration (s): 0.75 | learning rate: 6.779E-05 | global batch size: 256 | lm loss: 1.966917E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.948 | TFLOPs: 20.57 | 31: iteration 114320/ 173500 | consumed samples: 29265920 | consumed tokens: 59936604160 | elapsed time per iteration (s): 0.74 | learning rate: 6.777E-05 | global batch size: 256 | lm loss: 1.933182E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.894 | TFLOPs: 20.93 | 31: iteration 114330/ 173500 | consumed samples: 29268480 | consumed tokens: 59941847040 | elapsed time per iteration (s): 0.72 | learning rate: 6.776E-05 | global batch size: 256 | lm loss: 1.951775E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.942 | TFLOPs: 21.53 | 31: iteration 114340/ 173500 | consumed samples: 29271040 | consumed tokens: 59947089920 | elapsed time per iteration (s): 0.78 | learning rate: 6.774E-05 | global batch size: 256 | lm loss: 1.999265E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.862 | TFLOPs: 19.83 | 31: iteration 114350/ 173500 | consumed samples: 29273600 | consumed tokens: 59952332800 | elapsed time per iteration (s): 0.77 | learning rate: 6.773E-05 | global batch size: 256 | lm loss: 1.939712E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.643 | TFLOPs: 20.06 | 31: iteration 114360/ 173500 | consumed samples: 29276160 | consumed tokens: 59957575680 | elapsed time per iteration (s): 0.76 | learning rate: 6.771E-05 | global batch size: 256 | lm loss: 1.986048E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.994 | TFLOPs: 20.39 | 31: iteration 114370/ 173500 | consumed samples: 29278720 | consumed tokens: 59962818560 | elapsed time per iteration (s): 0.75 | learning rate: 6.770E-05 | global batch size: 256 | lm loss: 1.948126E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.828 | TFLOPs: 20.68 | 31: iteration 114380/ 173500 | consumed samples: 29281280 | consumed tokens: 59968061440 | elapsed time per iteration (s): 0.80 | learning rate: 6.768E-05 | global batch size: 256 | lm loss: 1.936785E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.953 | TFLOPs: 19.30 | 31: iteration 114390/ 173500 | consumed samples: 29283840 | consumed tokens: 59973304320 | elapsed time per iteration (s): 0.78 | learning rate: 6.767E-05 | global batch size: 256 | lm loss: 1.950042E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.157 | TFLOPs: 19.85 | 31: iteration 114400/ 173500 | consumed samples: 29286400 | consumed tokens: 59978547200 | elapsed time per iteration (s): 0.76 | learning rate: 6.766E-05 | global batch size: 256 | lm loss: 1.939272E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.852 | TFLOPs: 20.38 | 31: iteration 114410/ 173500 | consumed samples: 29288960 | consumed tokens: 59983790080 | elapsed time per iteration (s): 0.79 | learning rate: 6.764E-05 | global batch size: 256 | lm loss: 1.969152E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.160 | TFLOPs: 19.61 | 31: iteration 114420/ 173500 | consumed samples: 29291520 | consumed tokens: 59989032960 | elapsed time per iteration (s): 0.76 | learning rate: 6.763E-05 | global batch size: 256 | lm loss: 1.954578E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.396 | TFLOPs: 20.29 | 31: iteration 114430/ 173500 | consumed samples: 29294080 | consumed tokens: 59994275840 | elapsed time per iteration (s): 0.75 | learning rate: 6.761E-05 | global batch size: 256 | lm loss: 1.933665E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.616 | TFLOPs: 20.73 | 31: iteration 114440/ 173500 | consumed samples: 29296640 | consumed tokens: 59999518720 | elapsed time per iteration (s): 0.80 | learning rate: 6.760E-05 | global batch size: 256 | lm loss: 1.941402E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.627 | TFLOPs: 19.28 | 31: iteration 114450/ 173500 | consumed samples: 29299200 | consumed tokens: 60004761600 | elapsed time per iteration (s): 0.79 | learning rate: 6.758E-05 | global batch size: 256 | lm loss: 1.959719E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.676 | TFLOPs: 19.52 | 31: iteration 114460/ 173500 | consumed samples: 29301760 | consumed tokens: 60010004480 | elapsed time per iteration (s): 0.82 | learning rate: 6.757E-05 | global batch size: 256 | lm loss: 1.941038E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.980 | TFLOPs: 18.81 | 31: iteration 114470/ 173500 | consumed samples: 29304320 | consumed tokens: 60015247360 | elapsed time per iteration (s): 0.81 | learning rate: 6.755E-05 | global batch size: 256 | lm loss: 1.955386E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.102 | TFLOPs: 19.12 | 31: iteration 114480/ 173500 | consumed samples: 29306880 | consumed tokens: 60020490240 | elapsed time per iteration (s): 0.83 | learning rate: 6.754E-05 | global batch size: 256 | lm loss: 1.949952E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.063 | TFLOPs: 18.58 | 31: iteration 114490/ 173500 | consumed samples: 29309440 | consumed tokens: 60025733120 | elapsed time per iteration (s): 0.79 | learning rate: 6.753E-05 | global batch size: 256 | lm loss: 1.975802E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.381 | TFLOPs: 19.56 | 31: iteration 114500/ 173500 | consumed samples: 29312000 | consumed tokens: 60030976000 | elapsed time per iteration (s): 0.75 | learning rate: 6.751E-05 | global batch size: 256 | lm loss: 1.943030E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.217 | TFLOPs: 20.76 | 31: iteration 114510/ 173500 | consumed samples: 29314560 | consumed tokens: 60036218880 | elapsed time per iteration (s): 0.78 | learning rate: 6.750E-05 | global batch size: 256 | lm loss: 1.927781E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.790 | TFLOPs: 19.95 | 31: iteration 114520/ 173500 | consumed samples: 29317120 | consumed tokens: 60041461760 | elapsed time per iteration (s): 0.81 | learning rate: 6.748E-05 | global batch size: 256 | lm loss: 1.978264E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.182 | TFLOPs: 19.13 | 31: iteration 114530/ 173500 | consumed samples: 29319680 | consumed tokens: 60046704640 | elapsed time per iteration (s): 0.77 | learning rate: 6.747E-05 | global batch size: 256 | lm loss: 1.964981E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.435 | TFLOPs: 20.05 | 31: iteration 114540/ 173500 | consumed samples: 29322240 | consumed tokens: 60051947520 | elapsed time per iteration (s): 0.74 | learning rate: 6.745E-05 | global batch size: 256 | lm loss: 1.946192E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.144 | TFLOPs: 20.82 | 31: iteration 114550/ 173500 | consumed samples: 29324800 | consumed tokens: 60057190400 | elapsed time per iteration (s): 0.79 | learning rate: 6.744E-05 | global batch size: 256 | lm loss: 1.996726E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.158 | TFLOPs: 19.61 | 31: iteration 114560/ 173500 | consumed samples: 29327360 | consumed tokens: 60062433280 | elapsed time per iteration (s): 0.77 | learning rate: 6.742E-05 | global batch size: 256 | lm loss: 1.928245E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.492 | TFLOPs: 20.11 | 31: iteration 114570/ 173500 | consumed samples: 29329920 | consumed tokens: 60067676160 | elapsed time per iteration (s): 0.77 | learning rate: 6.741E-05 | global batch size: 256 | lm loss: 1.979510E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.125 | TFLOPs: 20.21 | 31: iteration 114580/ 173500 | consumed samples: 29332480 | consumed tokens: 60072919040 | elapsed time per iteration (s): 0.75 | learning rate: 6.739E-05 | global batch size: 256 | lm loss: 1.977455E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.581 | TFLOPs: 20.60 | 31: iteration 114590/ 173500 | consumed samples: 29335040 | consumed tokens: 60078161920 | elapsed time per iteration (s): 0.76 | learning rate: 6.738E-05 | global batch size: 256 | lm loss: 1.948137E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.558 | TFLOPs: 20.42 | 31: iteration 114600/ 173500 | consumed samples: 29337600 | consumed tokens: 60083404800 | elapsed time per iteration (s): 0.77 | learning rate: 6.737E-05 | global batch size: 256 | lm loss: 1.982530E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.687 | TFLOPs: 20.01 | 31: iteration 114610/ 173500 | consumed samples: 29340160 | consumed tokens: 60088647680 | elapsed time per iteration (s): 0.79 | learning rate: 6.735E-05 | global batch size: 256 | lm loss: 1.959155E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.608 | TFLOPs: 19.52 | 31: iteration 114620/ 173500 | consumed samples: 29342720 | consumed tokens: 60093890560 | elapsed time per iteration (s): 0.84 | learning rate: 6.734E-05 | global batch size: 256 | lm loss: 1.964923E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.401 | TFLOPs: 18.54 | 31: iteration 114630/ 173500 | consumed samples: 29345280 | consumed tokens: 60099133440 | elapsed time per iteration (s): 0.81 | learning rate: 6.732E-05 | global batch size: 256 | lm loss: 1.962206E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.428 | TFLOPs: 19.20 | 31: iteration 114640/ 173500 | consumed samples: 29347840 | consumed tokens: 60104376320 | elapsed time per iteration (s): 0.83 | learning rate: 6.731E-05 | global batch size: 256 | lm loss: 1.930368E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.602 | TFLOPs: 18.61 | 31: iteration 114650/ 173500 | consumed samples: 29350400 | consumed tokens: 60109619200 | elapsed time per iteration (s): 0.78 | learning rate: 6.729E-05 | global batch size: 256 | lm loss: 1.969804E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.982 | TFLOPs: 19.78 | 31: iteration 114660/ 173500 | consumed samples: 29352960 | consumed tokens: 60114862080 | elapsed time per iteration (s): 0.85 | learning rate: 6.728E-05 | global batch size: 256 | lm loss: 1.939549E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.472 | TFLOPs: 18.12 | 31: iteration 114670/ 173500 | consumed samples: 29355520 | consumed tokens: 60120104960 | elapsed time per iteration (s): 0.84 | learning rate: 6.726E-05 | global batch size: 256 | lm loss: 1.933648E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.574 | TFLOPs: 18.55 | 31: iteration 114680/ 173500 | consumed samples: 29358080 | consumed tokens: 60125347840 | elapsed time per iteration (s): 0.83 | learning rate: 6.725E-05 | global batch size: 256 | lm loss: 2.003808E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.940 | TFLOPs: 18.69 | 31: iteration 114690/ 173500 | consumed samples: 29360640 | consumed tokens: 60130590720 | elapsed time per iteration (s): 0.81 | learning rate: 6.724E-05 | global batch size: 256 | lm loss: 1.953559E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.133 | TFLOPs: 19.13 | 31: iteration 114700/ 173500 | consumed samples: 29363200 | consumed tokens: 60135833600 | elapsed time per iteration (s): 0.82 | learning rate: 6.722E-05 | global batch size: 256 | lm loss: 1.973858E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.417 | TFLOPs: 18.90 | 31: iteration 114710/ 173500 | consumed samples: 29365760 | consumed tokens: 60141076480 | elapsed time per iteration (s): 0.82 | learning rate: 6.721E-05 | global batch size: 256 | lm loss: 1.970786E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.102 | TFLOPs: 18.94 | 31: iteration 114720/ 173500 | consumed samples: 29368320 | consumed tokens: 60146319360 | elapsed time per iteration (s): 0.84 | learning rate: 6.719E-05 | global batch size: 256 | lm loss: 1.958844E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.308 | TFLOPs: 18.41 | 31: iteration 114730/ 173500 | consumed samples: 29370880 | consumed tokens: 60151562240 | elapsed time per iteration (s): 0.81 | learning rate: 6.718E-05 | global batch size: 256 | lm loss: 1.976541E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.707 | TFLOPs: 19.16 | 31: iteration 114740/ 173500 | consumed samples: 29373440 | consumed tokens: 60156805120 | elapsed time per iteration (s): 0.79 | learning rate: 6.716E-05 | global batch size: 256 | lm loss: 1.951048E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.537 | TFLOPs: 19.69 | 31: iteration 114750/ 173500 | consumed samples: 29376000 | consumed tokens: 60162048000 | elapsed time per iteration (s): 0.81 | learning rate: 6.715E-05 | global batch size: 256 | lm loss: 1.961915E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.384 | TFLOPs: 19.02 | 31: iteration 114760/ 173500 | consumed samples: 29378560 | consumed tokens: 60167290880 | elapsed time per iteration (s): 0.79 | learning rate: 6.713E-05 | global batch size: 256 | lm loss: 1.931576E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.413 | TFLOPs: 19.63 | 31: iteration 114770/ 173500 | consumed samples: 29381120 | consumed tokens: 60172533760 | elapsed time per iteration (s): 0.87 | learning rate: 6.712E-05 | global batch size: 256 | lm loss: 1.941156E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.159 | TFLOPs: 17.86 | 31: iteration 114780/ 173500 | consumed samples: 29383680 | consumed tokens: 60177776640 | elapsed time per iteration (s): 0.82 | learning rate: 6.710E-05 | global batch size: 256 | lm loss: 1.953655E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.620 | TFLOPs: 18.91 | 31: iteration 114790/ 173500 | consumed samples: 29386240 | consumed tokens: 60183019520 | elapsed time per iteration (s): 0.83 | learning rate: 6.709E-05 | global batch size: 256 | lm loss: 1.953807E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.429 | TFLOPs: 18.60 | 31: iteration 114800/ 173500 | consumed samples: 29388800 | consumed tokens: 60188262400 | elapsed time per iteration (s): 0.82 | learning rate: 6.708E-05 | global batch size: 256 | lm loss: 1.978443E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.442 | TFLOPs: 18.90 | 31: iteration 114810/ 173500 | consumed samples: 29391360 | consumed tokens: 60193505280 | elapsed time per iteration (s): 0.80 | learning rate: 6.706E-05 | global batch size: 256 | lm loss: 1.974986E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.452 | TFLOPs: 19.33 | 31: iteration 114820/ 173500 | consumed samples: 29393920 | consumed tokens: 60198748160 | elapsed time per iteration (s): 0.80 | learning rate: 6.705E-05 | global batch size: 256 | lm loss: 1.947390E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.896 | TFLOPs: 19.41 | 31: iteration 114830/ 173500 | consumed samples: 29396480 | consumed tokens: 60203991040 | elapsed time per iteration (s): 0.78 | learning rate: 6.703E-05 | global batch size: 256 | lm loss: 1.960373E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.608 | TFLOPs: 19.94 | 31: iteration 114840/ 173500 | consumed samples: 29399040 | consumed tokens: 60209233920 | elapsed time per iteration (s): 0.80 | learning rate: 6.702E-05 | global batch size: 256 | lm loss: 1.954107E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.051 | TFLOPs: 19.30 | 31: iteration 114850/ 173500 | consumed samples: 29401600 | consumed tokens: 60214476800 | elapsed time per iteration (s): 0.79 | learning rate: 6.700E-05 | global batch size: 256 | lm loss: 1.963688E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.653 | TFLOPs: 19.52 | 31: iteration 114860/ 173500 | consumed samples: 29404160 | consumed tokens: 60219719680 | elapsed time per iteration (s): 0.81 | learning rate: 6.699E-05 | global batch size: 256 | lm loss: 1.943589E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.973 | TFLOPs: 19.24 | 31: iteration 114870/ 173500 | consumed samples: 29406720 | consumed tokens: 60224962560 | elapsed time per iteration (s): 0.88 | learning rate: 6.697E-05 | global batch size: 256 | lm loss: 1.953824E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.533 | TFLOPs: 17.58 | 31: iteration 114880/ 173500 | consumed samples: 29409280 | consumed tokens: 60230205440 | elapsed time per iteration (s): 0.81 | learning rate: 6.696E-05 | global batch size: 256 | lm loss: 1.961449E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.690 | TFLOPs: 19.22 | 31: iteration 114890/ 173500 | consumed samples: 29411840 | consumed tokens: 60235448320 | elapsed time per iteration (s): 0.80 | learning rate: 6.695E-05 | global batch size: 256 | lm loss: 1.956952E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.213 | TFLOPs: 19.25 | 31: iteration 114900/ 173500 | consumed samples: 29414400 | consumed tokens: 60240691200 | elapsed time per iteration (s): 0.79 | learning rate: 6.693E-05 | global batch size: 256 | lm loss: 1.951334E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.421 | TFLOPs: 19.51 | 31: iteration 114910/ 173500 | consumed samples: 29416960 | consumed tokens: 60245934080 | elapsed time per iteration (s): 0.80 | learning rate: 6.692E-05 | global batch size: 256 | lm loss: 1.946175E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.784 | TFLOPs: 19.47 | 31: iteration 114920/ 173500 | consumed samples: 29419520 | consumed tokens: 60251176960 | elapsed time per iteration (s): 0.76 | learning rate: 6.690E-05 | global batch size: 256 | lm loss: 1.963083E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.126 | TFLOPs: 20.27 | 31: iteration 114930/ 173500 | consumed samples: 29422080 | consumed tokens: 60256419840 | elapsed time per iteration (s): 0.80 | learning rate: 6.689E-05 | global batch size: 256 | lm loss: 1.966970E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.607 | TFLOPs: 19.40 | 31: iteration 114940/ 173500 | consumed samples: 29424640 | consumed tokens: 60261662720 | elapsed time per iteration (s): 0.73 | learning rate: 6.687E-05 | global batch size: 256 | lm loss: 1.952587E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.398 | TFLOPs: 21.14 | 31: iteration 114950/ 173500 | consumed samples: 29427200 | consumed tokens: 60266905600 | elapsed time per iteration (s): 0.75 | learning rate: 6.686E-05 | global batch size: 256 | lm loss: 1.940906E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.204 | TFLOPs: 20.58 | 31: iteration 114960/ 173500 | consumed samples: 29429760 | consumed tokens: 60272148480 | elapsed time per iteration (s): 0.75 | learning rate: 6.684E-05 | global batch size: 256 | lm loss: 1.963925E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.398 | TFLOPs: 20.65 | 31: iteration 114970/ 173500 | consumed samples: 29432320 | consumed tokens: 60277391360 | elapsed time per iteration (s): 0.72 | learning rate: 6.683E-05 | global batch size: 256 | lm loss: 1.968129E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.027 | TFLOPs: 21.48 | 31: iteration 114980/ 173500 | consumed samples: 29434880 | consumed tokens: 60282634240 | elapsed time per iteration (s): 0.83 | learning rate: 6.682E-05 | global batch size: 256 | lm loss: 1.938238E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.818 | TFLOPs: 18.68 | 31: iteration 114990/ 173500 | consumed samples: 29437440 | consumed tokens: 60287877120 | elapsed time per iteration (s): 0.75 | learning rate: 6.680E-05 | global batch size: 256 | lm loss: 1.924986E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.895 | TFLOPs: 20.56 | 31: iteration 115000/ 173500 | consumed samples: 29440000 | consumed tokens: 60293120000 | elapsed time per iteration (s): 0.85 | learning rate: 6.679E-05 | global batch size: 256 | lm loss: 1.943948E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.844 | TFLOPs: 18.26 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 115000 | lm loss value: 2.043278E+00 | lm loss PPL: 7.715858E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 115000 to checkpoints_1b1long 0: [2022-11-26 20:00:37,395] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step115000 is begin to save! 0: [2022-11-26 20:00:37,409] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_01-model_00-model_states.pt... 0: [2022-11-26 20:00:37,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_01-model_00-model_states.pt. 0: [2022-11-26 20:00:37,628] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_03-model_00-model_states.pt... 0: [2022-11-26 20:00:37,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_03-model_00-model_states.pt. 0: [2022-11-26 20:00:37,726] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_04-model_00-model_states.pt... 0: [2022-11-26 20:00:37,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_04-model_00-model_states.pt. 0: [2022-11-26 20:00:37,801] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_05-model_00-model_states.pt... 0: [2022-11-26 20:00:37,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_05-model_00-model_states.pt. 0: [2022-11-26 20:00:37,874] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_06-model_00-model_states.pt... 0: [2022-11-26 20:00:37,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_06-model_00-model_states.pt. 0: [2022-11-26 20:00:37,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_07-model_00-model_states.pt... 0: [2022-11-26 20:00:38,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_07-model_00-model_states.pt. 0: [2022-11-26 20:00:38,028] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_08-model_00-model_states.pt... 0: [2022-11-26 20:00:38,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_08-model_00-model_states.pt. 0: [2022-11-26 20:00:38,103] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_09-model_00-model_states.pt... 0: [2022-11-26 20:00:38,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_09-model_00-model_states.pt. 0: [2022-11-26 20:00:38,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_10-model_00-model_states.pt... 0: [2022-11-26 20:00:38,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_10-model_00-model_states.pt. 0: [2022-11-26 20:00:38,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_11-model_00-model_states.pt... 0: [2022-11-26 20:00:38,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_11-model_00-model_states.pt. 0: [2022-11-26 20:00:38,327] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_12-model_00-model_states.pt... 0: [2022-11-26 20:00:38,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_12-model_00-model_states.pt. 0: [2022-11-26 20:00:38,402] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_13-model_00-model_states.pt... 0: [2022-11-26 20:00:38,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_13-model_00-model_states.pt. 0: [2022-11-26 20:00:38,490] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_14-model_00-model_states.pt... 0: [2022-11-26 20:00:38,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_14-model_00-model_states.pt. 0: [2022-11-26 20:00:38,566] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_15-model_00-model_states.pt... 0: [2022-11-26 20:00:38,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_15-model_00-model_states.pt. 0: [2022-11-26 20:00:38,643] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_16-model_00-model_states.pt... 0: [2022-11-26 20:00:38,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_16-model_00-model_states.pt. 0: [2022-11-26 20:00:38,717] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_17-model_00-model_states.pt... 0: [2022-11-26 20:00:38,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_17-model_00-model_states.pt. 0: [2022-11-26 20:00:38,795] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_18-model_00-model_states.pt... 0: [2022-11-26 20:00:38,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_18-model_00-model_states.pt. 0: [2022-11-26 20:00:38,869] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_19-model_00-model_states.pt... 0: [2022-11-26 20:00:38,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_19-model_00-model_states.pt. 0: [2022-11-26 20:00:38,946] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_20-model_00-model_states.pt... 0: [2022-11-26 20:00:39,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_20-model_00-model_states.pt. 0: [2022-11-26 20:00:39,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_21-model_00-model_states.pt... 0: [2022-11-26 20:00:39,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_21-model_00-model_states.pt. 0: [2022-11-26 20:00:39,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_22-model_00-model_states.pt... 0: [2022-11-26 20:00:39,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_22-model_00-model_states.pt. 0: [2022-11-26 20:00:39,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_23-model_00-model_states.pt... 0: [2022-11-26 20:00:39,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_23-model_00-model_states.pt. 0: [2022-11-26 20:00:39,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_24-model_00-model_states.pt... 0: [2022-11-26 20:00:39,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_24-model_00-model_states.pt. 0: [2022-11-26 20:00:39,322] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_25-model_00-model_states.pt... 0: [2022-11-26 20:00:39,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_25-model_00-model_states.pt. 0: [2022-11-26 20:00:39,399] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_26-model_00-model_states.pt... 0: [2022-11-26 20:00:39,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_26-model_00-model_states.pt. 0: [2022-11-26 20:00:39,476] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_27-model_00-model_states.pt... 0: [2022-11-26 20:00:39,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_27-model_00-model_states.pt. 0: [2022-11-26 20:00:39,553] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_28-model_00-model_states.pt... 0: [2022-11-26 20:00:39,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_28-model_00-model_states.pt. 0: [2022-11-26 20:00:39,627] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/layer_30-model_00-model_states.pt... 0: [2022-11-26 20:00:39,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/layer_30-model_00-model_states.pt. 0: [2022-11-26 20:00:39,631] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step115000/mp_rank_00_model_states.pt 0: [2022-11-26 20:00:39,631] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/mp_rank_00_model_states.pt... 0: [2022-11-26 20:00:39,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/mp_rank_00_model_states.pt. 0: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:00:39,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:00:39,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:00:39,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 20:00:39,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 0: [2022-11-26 20:00:39,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:00:39,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 20:00:39,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 18: [2022-11-26 20:00:39,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:00:39,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 20:00:39,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 18: [2022-11-26 20:00:39,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:00:39,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 20:00:39,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 0: [2022-11-26 20:00:39,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:00:39,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 20:00:39,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 0: [2022-11-26 20:00:39,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:00:39,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:00:39,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 20:00:39,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 7: [2022-11-26 20:00:39,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:00:39,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 20:00:39,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 5: [2022-11-26 20:00:39,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:00:39,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 10: [2022-11-26 20:00:39,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:00:39,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:00:39,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 10: [2022-11-26 20:00:39,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 20:00:39,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 20:00:39,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 10: [2022-11-26 20:00:39,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 10: [2022-11-26 20:00:39,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:00:39,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 20:00:39,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 8: [2022-11-26 20:00:39,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:00:39,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 20:00:39,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 20: [2022-11-26 20:00:39,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:00:39,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 20:00:39,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 8: [2022-11-26 20:00:39,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:00:39,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:00:39,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 20:00:39,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 27: [2022-11-26 20:00:39,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:00:39,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 8: [2022-11-26 20:00:39,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 27: [2022-11-26 20:00:39,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 20:00:39,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 12: [2022-11-26 20:00:39,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:00:39,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:00:39,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 20: [2022-11-26 20:00:39,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 12: [2022-11-26 20:00:39,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 20: [2022-11-26 20:00:39,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:00:39,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 30: [2022-11-26 20:00:39,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:00:39,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 30: [2022-11-26 20:00:39,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 20: [2022-11-26 20:00:39,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 30: [2022-11-26 20:00:39,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 24: [2022-11-26 20:00:39,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:00:39,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 20:00:39,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 27: [2022-11-26 20:00:39,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:00:39,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:00:39,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 0: [2022-11-26 20:00:39,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 20:00:39,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 27: [2022-11-26 20:00:39,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 20: [2022-11-26 20:00:39,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:00:39,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:00:39,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 27: [2022-11-26 20:00:39,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 20: [2022-11-26 20:00:39,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 27: [2022-11-26 20:00:39,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 23: [2022-11-26 20:00:39,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:00:39,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 20:00:39,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 27: [2022-11-26 20:00:39,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:00:39,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 20:00:39,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 24: [2022-11-26 20:00:39,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:00:39,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 20:00:39,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 18: [2022-11-26 20:00:39,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:00:39,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 24: [2022-11-26 20:00:39,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:00:39,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 24: [2022-11-26 20:00:39,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 20:00:39,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 8: [2022-11-26 20:00:39,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:00:39,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 20:00:39,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 15: [2022-11-26 20:00:39,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:00:39,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 20:00:39,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 20: [2022-11-26 20:00:39,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:00:39,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 20:00:39,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 24: [2022-11-26 20:00:39,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:00:39,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 20:00:39,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 15: [2022-11-26 20:00:39,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:00:39,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 20:00:39,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 10: [2022-11-26 20:00:39,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:00:39,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:00:39,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:00:39,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 20:00:39,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 11: [2022-11-26 20:00:39,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 20:00:39,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 20:00:39,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 11: [2022-11-26 20:00:39,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 8: [2022-11-26 20:00:39,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:00:39,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 28: [2022-11-26 20:00:39,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:00:39,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:00:39,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 7: [2022-11-26 20:00:39,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 10: [2022-11-26 20:00:39,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:00:39,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:00:39,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 7: [2022-11-26 20:00:39,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 10: [2022-11-26 20:00:39,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 7: [2022-11-26 20:00:39,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 20:00:39,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 20: [2022-11-26 20:00:39,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:00:39,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 20:00:39,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 23: [2022-11-26 20:00:39,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:00:39,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 20:00:39,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 30: [2022-11-26 20:00:39,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:00:39,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 20:00:39,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:00:39,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 19: [2022-11-26 20:00:39,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:00:39,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:00:39,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 30: [2022-11-26 20:00:39,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 19: [2022-11-26 20:00:39,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 20:00:39,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 30: [2022-11-26 20:00:39,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 19: [2022-11-26 20:00:39,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 19: [2022-11-26 20:00:39,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:00:39,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 20:00:39,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:00:39,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 19: [2022-11-26 20:00:39,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 20:00:39,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 28: [2022-11-26 20:00:39,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 20:00:39,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 7: [2022-11-26 20:00:39,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:00:39,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 20:00:39,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 24: [2022-11-26 20:00:39,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:00:39,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:00:39,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 20:00:39,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 5: [2022-11-26 20:00:39,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 21: [2022-11-26 20:00:39,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:00:39,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:00:39,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:00:39,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:00:39,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:00:39,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:00:39,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 5: [2022-11-26 20:00:39,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 7: [2022-11-26 20:00:39,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 21: [2022-11-26 20:00:39,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 20:00:39,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 20:00:39,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 5: [2022-11-26 20:00:39,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 20:00:39,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 20:00:39,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 5: [2022-11-26 20:00:39,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 7: [2022-11-26 20:00:39,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 21: [2022-11-26 20:00:39,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 21: [2022-11-26 20:00:39,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 5: [2022-11-26 20:00:39,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:00:39,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:00:39,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 20:00:39,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 20:00:39,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 5: [2022-11-26 20:00:39,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 16: [2022-11-26 20:00:39,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:00:39,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:00:39,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 20:00:39,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 20:00:39,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 15: [2022-11-26 20:00:39,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:00:39,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 15: [2022-11-26 20:00:39,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 20:00:39,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 11: [2022-11-26 20:00:39,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:00:39,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:00:39,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:00:39,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:00:39,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:00:39,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 20:00:39,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 20:00:39,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 20:00:39,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 11: [2022-11-26 20:00:39,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 30: [2022-11-26 20:00:39,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 20:00:39,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 11: [2022-11-26 20:00:39,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 30: [2022-11-26 20:00:39,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 30: [2022-11-26 20:00:39,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 16: [2022-11-26 20:00:39,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:00:39,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 20:00:39,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 14: [2022-11-26 20:00:39,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:00:39,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:00:39,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:00:39,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 20:00:39,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 20:00:39,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 12: [2022-11-26 20:00:39,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:00:39,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 14: [2022-11-26 20:00:39,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 14: [2022-11-26 20:00:39,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 12: [2022-11-26 20:00:39,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 16: [2022-11-26 20:00:39,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:00:39,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:00:39,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 14: [2022-11-26 20:00:39,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 20:00:39,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 16: [2022-11-26 20:00:39,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 20:00:39,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 9: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:00:39,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 3: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:00:39,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 3: [2022-11-26 20:00:39,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 20:00:39,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 20:00:39,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 3: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 3: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 23: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 3: [2022-11-26 20:00:39,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 20:00:39,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 23: [2022-11-26 20:00:39,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 20:00:39,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 3: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 3: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 23: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 23: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 9: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:00:39,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 15: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:00:39,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 20:00:39,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 20:00:39,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 20:00:39,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 15: [2022-11-26 20:00:39,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 15: [2022-11-26 20:00:39,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 12: [2022-11-26 20:00:39,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:00:39,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:00:39,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 20:00:39,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 20:00:39,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 12: [2022-11-26 20:00:39,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 12: [2022-11-26 20:00:39,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:00:39,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:00:39,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 20:00:39,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 12: [2022-11-26 20:00:39,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 20:00:39,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 23: [2022-11-26 20:00:39,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:00:39,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 20:00:39,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 22: [2022-11-26 20:00:39,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:00:39,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 20:00:39,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 22: [2022-11-26 20:00:39,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:00:39,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:00:39,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:00:39,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 20:00:39,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 9: [2022-11-26 20:00:39,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 20:00:39,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 20:00:39,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 9: [2022-11-26 20:00:39,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 29: [2022-11-26 20:00:39,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:00:39,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:00:39,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 20:00:39,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 20:00:39,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 29: [2022-11-26 20:00:39,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 31: [2022-11-26 20:00:39,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:00:39,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 20:00:39,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 21: [2022-11-26 20:00:39,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:00:39,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:00:39,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:00:39,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 20:00:39,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 20:00:39,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 20:00:39,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 21: [2022-11-26 20:00:39,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 21: [2022-11-26 20:00:39,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 28: [2022-11-26 20:00:39,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 28: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:00:39,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 20:00:39,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 28: [2022-11-26 20:00:39,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:00:39,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 20:00:39,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 28: [2022-11-26 20:00:39,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:00:39,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:00:39,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 20:00:39,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 20:00:39,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 28: [2022-11-26 20:00:39,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 29: [2022-11-26 20:00:39,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:00:39,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 20:00:39,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 13: [2022-11-26 20:00:39,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:00:39,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:00:39,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 20:00:39,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 20:00:39,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 13: [2022-11-26 20:00:39,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 22: [2022-11-26 20:00:39,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:00:39,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:00:39,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 20:00:39,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 20:00:39,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 22: [2022-11-26 20:00:39,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 22: [2022-11-26 20:00:39,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:00:39,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 20:00:39,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 4: [2022-11-26 20:00:39,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:00:39,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 20:00:39,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 4: [2022-11-26 20:00:39,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:00:39,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 20:00:39,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 6: [2022-11-26 20:00:39,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:00:39,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:00:39,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:00:39,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:00:39,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:00:39,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:00:39,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 20:00:39,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 6: [2022-11-26 20:00:39,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 20:00:39,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 20:00:39,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 20:00:39,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 20:00:39,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 20:00:39,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 6: [2022-11-26 20:00:39,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 6: [2022-11-26 20:00:39,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 6: [2022-11-26 20:00:39,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 6: [2022-11-26 20:00:39,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 31: [2022-11-26 20:00:39,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:00:39,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 20:00:39,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 31: [2022-11-26 20:00:39,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:00:39,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 20:00:39,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 4: [2022-11-26 20:00:39,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:00:39,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 20:00:39,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 31: [2022-11-26 20:00:39,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:00:39,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 20:00:39,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 13: [2022-11-26 20:00:39,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:00:39,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:00:39,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:00:39,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 20:00:39,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 20:00:39,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 20:00:39,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 13: [2022-11-26 20:00:39,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 13: [2022-11-26 20:00:39,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 26: [2022-11-26 20:00:39,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:00:39,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:00:39,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:00:39,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 20:00:39,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 20:00:39,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 20:00:39,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:00:39,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 26: [2022-11-26 20:00:39,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 26: [2022-11-26 20:00:39,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 26: [2022-11-26 20:00:39,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 20:00:39,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 13: [2022-11-26 20:00:39,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:00:39,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 20:00:39,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 25: [2022-11-26 20:00:39,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:00:39,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:00:39,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:00:39,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:00:39,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 20:00:39,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 20:00:39,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 20:00:39,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 20:00:39,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 25: [2022-11-26 20:00:39,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 25: [2022-11-26 20:00:39,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 25: [2022-11-26 20:00:39,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 1: [2022-11-26 20:00:39,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:00:39,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:00:39,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:00:39,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:00:39,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:00:39,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 20:00:39,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 20:00:39,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 20:00:39,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 20:00:39,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 20:00:39,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 1: [2022-11-26 20:00:39,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 1: [2022-11-26 20:00:39,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 1: [2022-11-26 20:00:39,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 1: [2022-11-26 20:00:39,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 16: [2022-11-26 20:00:39,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:00:39,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 20:00:39,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 29: [2022-11-26 20:00:39,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:00:39,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 20:00:39,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 29: [2022-11-26 20:00:39,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:00:39,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 20:00:39,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 29: [2022-11-26 20:00:39,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:00:39,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 20:00:39,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 27: [2022-11-26 20:00:39,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:00:39,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 20:00:39,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 31: [2022-11-26 20:00:39,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:00:39,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:00:39,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 20:00:39,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 20:00:39,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 31: [2022-11-26 20:00:39,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 4: [2022-11-26 20:00:39,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:00:39,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 20:00:39,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 0: [2022-11-26 20:00:39,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:00:39,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 20:00:39,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 2: [2022-11-26 20:00:39,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:00:39,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:00:39,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:00:39,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:00:39,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:00:39,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 20:00:39,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 20:00:39,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 20:00:39,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 2: [2022-11-26 20:00:39,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 2: [2022-11-26 20:00:39,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 20:00:39,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 2: [2022-11-26 20:00:39,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 20:00:39,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 2: [2022-11-26 20:00:39,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 0: [2022-11-26 20:00:39,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 20:00:39,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 17: [2022-11-26 20:00:39,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:00:39,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:00:39,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:00:39,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:00:39,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:00:39,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:00:39,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 20:00:39,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 20:00:39,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 20:00:39,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 17: [2022-11-26 20:00:39,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 20:00:39,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 17: [2022-11-26 20:00:39,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 20:00:39,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 20:00:39,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 17: [2022-11-26 20:00:39,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 17: [2022-11-26 20:00:39,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 17: [2022-11-26 20:00:39,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 4: [2022-11-26 20:00:39,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:00:39,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 20:00:39,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 4: [2022-11-26 20:00:39,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:00:39,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 20:00:39,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 11: [2022-11-26 20:00:39,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:00:39,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 20:00:39,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 8: [2022-11-26 20:00:39,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:00:39,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 20:00:39,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 7: [2022-11-26 20:00:39,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:00:39,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 20:00:39,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 10: [2022-11-26 20:00:39,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:00:39,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 20:00:39,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 24: [2022-11-26 20:00:39,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:00:39,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 20:00:39,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 30: [2022-11-26 20:00:39,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:00:39,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 20:00:39,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 22: [2022-11-26 20:00:39,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:00:39,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 20:00:39,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 1: [2022-11-26 20:00:39,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:00:39,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 20:00:39,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 16: [2022-11-26 20:00:39,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:00:39,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 20:00:39,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 2: [2022-11-26 20:00:39,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:00:39,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 20:00:39,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 26: [2022-11-26 20:00:39,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:00:39,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 20:00:39,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 18: [2022-11-26 20:00:39,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:00:39,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:00:39,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 20:00:39,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 27: [2022-11-26 20:00:39,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 20:00:39,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 19: [2022-11-26 20:00:39,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:00:39,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 20:00:39,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 3: [2022-11-26 20:00:39,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:00:39,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 20:00:39,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 9: [2022-11-26 20:00:39,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:00:39,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 20:00:39,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 5: [2022-11-26 20:00:39,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:00:39,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 20:00:39,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 25: [2022-11-26 20:00:39,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:00:39,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 20:00:39,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 14: [2022-11-26 20:00:39,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:00:39,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 20:00:39,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 29: [2022-11-26 20:00:39,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:00:39,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 20:00:39,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 20: [2022-11-26 20:00:39,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:00:39,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 20:00:39,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 6: [2022-11-26 20:00:39,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:00:39,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 20:00:39,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 23: [2022-11-26 20:00:39,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:00:39,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 20:00:39,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 15: [2022-11-26 20:00:39,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:00:39,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 20:00:39,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 4: [2022-11-26 20:00:39,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:00:39,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 20:00:39,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 17: [2022-11-26 20:00:39,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:00:39,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 20:00:39,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 12: [2022-11-26 20:00:39,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:00:39,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 20:00:39,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 0: [2022-11-26 20:00:39,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:00:39,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 20:00:39,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 8: [2022-11-26 20:00:39,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:00:39,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 21: [2022-11-26 20:00:39,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:00:39,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 20:00:39,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 8: [2022-11-26 20:00:39,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 28: [2022-11-26 20:00:39,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:00:39,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 20:00:39,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 13: [2022-11-26 20:00:39,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:00:39,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 20:00:39,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 7: [2022-11-26 20:00:39,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:00:39,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:00:39,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 20:00:39,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 24: [2022-11-26 20:00:39,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 20:00:39,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 18: [2022-11-26 20:00:39,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:00:39,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 20:00:39,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 10: [2022-11-26 20:00:39,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:00:39,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 20:00:39,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 11: [2022-11-26 20:00:39,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:00:39,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 20:00:39,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 22: [2022-11-26 20:00:39,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:00:39,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 16: [2022-11-26 20:00:39,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:00:39,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 16: [2022-11-26 20:00:39,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 20:00:39,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 31: [2022-11-26 20:00:39,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:00:39,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:00:39,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 20:00:39,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 23: [2022-11-26 20:00:39,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 20:00:39,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 3: [2022-11-26 20:00:39,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:00:39,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 20:00:39,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 1: [2022-11-26 20:00:39,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:00:39,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:00:39,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 20:00:39,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 10: [2022-11-26 20:00:39,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 20:00:39,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 5: [2022-11-26 20:00:39,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:00:39,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 20:00:39,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 30: [2022-11-26 20:00:39,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:00:39,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 20:00:39,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 26: [2022-11-26 20:00:39,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:00:39,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 20:00:39,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 2: [2022-11-26 20:00:39,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:00:39,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 8: [2022-11-26 20:00:39,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:00:39,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 8: [2022-11-26 20:00:39,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 20:00:39,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 7: [2022-11-26 20:00:39,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:00:39,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 20:00:39,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 0: [2022-11-26 20:00:39,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:00:39,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 20:00:39,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 6: [2022-11-26 20:00:39,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:00:39,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 20:00:39,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 3: [2022-11-26 20:00:39,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:00:39,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:00:39,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 20:00:39,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 11: [2022-11-26 20:00:39,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 20:00:39,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 28: [2022-11-26 20:00:39,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:00:39,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 17: [2022-11-26 20:00:39,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:00:39,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 17: [2022-11-26 20:00:39,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 19: [2022-11-26 20:00:39,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:00:39,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 19: [2022-11-26 20:00:39,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 9: [2022-11-26 20:00:39,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:00:39,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 9: [2022-11-26 20:00:39,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 20:00:39,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 24: [2022-11-26 20:00:39,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:00:39,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 20:00:39,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 1: [2022-11-26 20:00:39,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:00:39,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 20:00:39,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 21: [2022-11-26 20:00:39,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:00:39,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 20:00:39,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 27: [2022-11-26 20:00:39,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:00:39,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 20:00:39,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 15: [2022-11-26 20:00:39,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:00:39,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 20:00:39,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 13: [2022-11-26 20:00:39,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:00:39,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 20:00:39,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 12: [2022-11-26 20:00:39,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:00:39,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 20:00:39,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 20: [2022-11-26 20:00:39,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:00:39,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 20:00:39,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 31: [2022-11-26 20:00:39,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:00:39,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 20:00:39,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 22: [2022-11-26 20:00:39,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:00:39,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 20:00:39,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 19: [2022-11-26 20:00:39,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:00:39,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 20:00:39,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 4: [2022-11-26 20:00:39,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:00:39,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 20:00:39,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 16: [2022-11-26 20:00:39,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:00:39,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 20:00:39,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 2: [2022-11-26 20:00:39,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:00:39,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 20:00:39,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 29: [2022-11-26 20:00:39,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:00:39,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 20:00:39,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 30: [2022-11-26 20:00:39,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:00:39,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 20:00:39,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 26: [2022-11-26 20:00:39,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:00:39,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 20:00:39,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 26: [2022-11-26 20:00:39,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:00:39,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 20:00:39,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 18: [2022-11-26 20:00:39,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:00:39,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 20:00:39,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 14: [2022-11-26 20:00:39,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:00:39,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 20:00:39,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 27: [2022-11-26 20:00:39,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:00:39,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:00:39,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 20:00:39,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 9: [2022-11-26 20:00:39,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:00:39,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 9: [2022-11-26 20:00:39,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 27: [2022-11-26 20:00:39,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 9: [2022-11-26 20:00:39,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 18: [2022-11-26 20:00:39,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:00:39,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 20:00:39,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 25: [2022-11-26 20:00:39,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:00:39,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:00:39,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 9: [2022-11-26 20:00:39,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:00:39,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 25: [2022-11-26 20:00:39,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 9: [2022-11-26 20:00:39,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 20:00:39,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 25: [2022-11-26 20:00:39,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 19: [2022-11-26 20:00:39,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:00:39,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 20:00:39,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 14: [2022-11-26 20:00:39,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:00:39,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 20:00:39,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 25: [2022-11-26 20:00:39,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:00:39,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step115000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 20:00:39,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 0: successfully saved checkpoint at iteration 115000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2633.72 31: iteration 115010/ 173500 | consumed samples: 29442560 | consumed tokens: 60298362880 | elapsed time per iteration (s): 1.07 | learning rate: 6.677E-05 | global batch size: 256 | lm loss: 1.950254E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.897 | TFLOPs: 14.51 | 31: iteration 115020/ 173500 | consumed samples: 29445120 | consumed tokens: 60303605760 | elapsed time per iteration (s): 0.80 | learning rate: 6.676E-05 | global batch size: 256 | lm loss: 1.923536E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.402 | TFLOPs: 19.38 | 31: iteration 115030/ 173500 | consumed samples: 29447680 | consumed tokens: 60308848640 | elapsed time per iteration (s): 0.86 | learning rate: 6.674E-05 | global batch size: 256 | lm loss: 1.931205E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.277 | TFLOPs: 18.11 | 31: iteration 115040/ 173500 | consumed samples: 29450240 | consumed tokens: 60314091520 | elapsed time per iteration (s): 0.80 | learning rate: 6.673E-05 | global batch size: 256 | lm loss: 1.966467E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.946 | TFLOPs: 19.48 | 31: iteration 115050/ 173500 | consumed samples: 29452800 | consumed tokens: 60319334400 | elapsed time per iteration (s): 0.88 | learning rate: 6.671E-05 | global batch size: 256 | lm loss: 1.973369E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.208 | TFLOPs: 17.68 | 31: iteration 115060/ 173500 | consumed samples: 29455360 | consumed tokens: 60324577280 | elapsed time per iteration (s): 0.82 | learning rate: 6.670E-05 | global batch size: 256 | lm loss: 1.963887E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.260 | TFLOPs: 18.95 | 31: iteration 115070/ 173500 | consumed samples: 29457920 | consumed tokens: 60329820160 | elapsed time per iteration (s): 0.85 | learning rate: 6.669E-05 | global batch size: 256 | lm loss: 1.940505E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.222 | TFLOPs: 18.28 | 31: iteration 115080/ 173500 | consumed samples: 29460480 | consumed tokens: 60335063040 | elapsed time per iteration (s): 0.74 | learning rate: 6.667E-05 | global batch size: 256 | lm loss: 1.939018E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.507 | TFLOPs: 20.90 | 31: iteration 115090/ 173500 | consumed samples: 29463040 | consumed tokens: 60340305920 | elapsed time per iteration (s): 0.72 | learning rate: 6.666E-05 | global batch size: 256 | lm loss: 1.973144E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.899 | TFLOPs: 21.53 | 31: iteration 115100/ 173500 | consumed samples: 29465600 | consumed tokens: 60345548800 | elapsed time per iteration (s): 0.74 | learning rate: 6.664E-05 | global batch size: 256 | lm loss: 1.940890E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.322 | TFLOPs: 21.01 | 31: iteration 115110/ 173500 | consumed samples: 29468160 | consumed tokens: 60350791680 | elapsed time per iteration (s): 0.74 | learning rate: 6.663E-05 | global batch size: 256 | lm loss: 1.960476E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.160 | TFLOPs: 21.00 | 31: iteration 115120/ 173500 | consumed samples: 29470720 | consumed tokens: 60356034560 | elapsed time per iteration (s): 0.75 | learning rate: 6.661E-05 | global batch size: 256 | lm loss: 1.956551E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.492 | TFLOPs: 20.72 | 31: iteration 115130/ 173500 | consumed samples: 29473280 | consumed tokens: 60361277440 | elapsed time per iteration (s): 0.81 | learning rate: 6.660E-05 | global batch size: 256 | lm loss: 1.950304E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.394 | TFLOPs: 19.20 | 31: iteration 115140/ 173500 | consumed samples: 29475840 | consumed tokens: 60366520320 | elapsed time per iteration (s): 0.77 | learning rate: 6.658E-05 | global batch size: 256 | lm loss: 1.932239E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.245 | TFLOPs: 20.16 | 31: iteration 115150/ 173500 | consumed samples: 29478400 | consumed tokens: 60371763200 | elapsed time per iteration (s): 0.80 | learning rate: 6.657E-05 | global batch size: 256 | lm loss: 1.956599E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.653 | TFLOPs: 19.28 | 31: iteration 115160/ 173500 | consumed samples: 29480960 | consumed tokens: 60377006080 | elapsed time per iteration (s): 0.79 | learning rate: 6.656E-05 | global batch size: 256 | lm loss: 1.939775E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.923 | TFLOPs: 19.72 | 31: iteration 115170/ 173500 | consumed samples: 29483520 | consumed tokens: 60382248960 | elapsed time per iteration (s): 0.79 | learning rate: 6.654E-05 | global batch size: 256 | lm loss: 1.940217E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.400 | TFLOPs: 19.69 | 31: iteration 115180/ 173500 | consumed samples: 29486080 | consumed tokens: 60387491840 | elapsed time per iteration (s): 0.75 | learning rate: 6.653E-05 | global batch size: 256 | lm loss: 1.984063E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.886 | TFLOPs: 20.62 | 31: iteration 115190/ 173500 | consumed samples: 29488640 | consumed tokens: 60392734720 | elapsed time per iteration (s): 0.78 | learning rate: 6.651E-05 | global batch size: 256 | lm loss: 1.977334E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.576 | TFLOPs: 19.76 | 31: iteration 115200/ 173500 | consumed samples: 29491200 | consumed tokens: 60397977600 | elapsed time per iteration (s): 0.78 | learning rate: 6.650E-05 | global batch size: 256 | lm loss: 1.955807E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.742 | TFLOPs: 19.83 | 31: iteration 115210/ 173500 | consumed samples: 29493760 | consumed tokens: 60403220480 | elapsed time per iteration (s): 0.89 | learning rate: 6.648E-05 | global batch size: 256 | lm loss: 1.972637E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.519 | TFLOPs: 17.39 | 31: iteration 115220/ 173500 | consumed samples: 29496320 | consumed tokens: 60408463360 | elapsed time per iteration (s): 0.80 | learning rate: 6.647E-05 | global batch size: 256 | lm loss: 1.966251E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.695 | TFLOPs: 19.40 | 31: iteration 115230/ 173500 | consumed samples: 29498880 | consumed tokens: 60413706240 | elapsed time per iteration (s): 0.76 | learning rate: 6.646E-05 | global batch size: 256 | lm loss: 1.970911E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.893 | TFLOPs: 20.44 | 31: iteration 115240/ 173500 | consumed samples: 29501440 | consumed tokens: 60418949120 | elapsed time per iteration (s): 0.73 | learning rate: 6.644E-05 | global batch size: 256 | lm loss: 1.971197E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.156 | TFLOPs: 21.12 | 31: iteration 115250/ 173500 | consumed samples: 29504000 | consumed tokens: 60424192000 | elapsed time per iteration (s): 0.76 | learning rate: 6.643E-05 | global batch size: 256 | lm loss: 1.963760E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.879 | TFLOPs: 20.50 | 31: iteration 115260/ 173500 | consumed samples: 29506560 | consumed tokens: 60429434880 | elapsed time per iteration (s): 0.85 | learning rate: 6.641E-05 | global batch size: 256 | lm loss: 1.940691E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.103 | TFLOPs: 18.16 | 31: iteration 115270/ 173500 | consumed samples: 29509120 | consumed tokens: 60434677760 | elapsed time per iteration (s): 0.76 | learning rate: 6.640E-05 | global batch size: 256 | lm loss: 1.947904E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.564 | TFLOPs: 20.48 | 31: iteration 115280/ 173500 | consumed samples: 29511680 | consumed tokens: 60439920640 | elapsed time per iteration (s): 0.75 | learning rate: 6.638E-05 | global batch size: 256 | lm loss: 1.955221E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.541 | TFLOPs: 20.66 | 31: iteration 115290/ 173500 | consumed samples: 29514240 | consumed tokens: 60445163520 | elapsed time per iteration (s): 0.76 | learning rate: 6.637E-05 | global batch size: 256 | lm loss: 1.956469E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.597 | TFLOPs: 20.30 | 31: iteration 115300/ 173500 | consumed samples: 29516800 | consumed tokens: 60450406400 | elapsed time per iteration (s): 0.74 | learning rate: 6.635E-05 | global batch size: 256 | lm loss: 1.971087E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.309 | TFLOPs: 20.83 | 31: iteration 115310/ 173500 | consumed samples: 29519360 | consumed tokens: 60455649280 | elapsed time per iteration (s): 0.83 | learning rate: 6.634E-05 | global batch size: 256 | lm loss: 1.959397E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.470 | TFLOPs: 18.60 | 31: iteration 115320/ 173500 | consumed samples: 29521920 | consumed tokens: 60460892160 | elapsed time per iteration (s): 0.83 | learning rate: 6.633E-05 | global batch size: 256 | lm loss: 1.930289E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.235 | TFLOPs: 18.59 | 31: iteration 115330/ 173500 | consumed samples: 29524480 | consumed tokens: 60466135040 | elapsed time per iteration (s): 0.79 | learning rate: 6.631E-05 | global batch size: 256 | lm loss: 1.953247E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.632 | TFLOPs: 19.58 | 31: iteration 115340/ 173500 | consumed samples: 29527040 | consumed tokens: 60471377920 | elapsed time per iteration (s): 0.83 | learning rate: 6.630E-05 | global batch size: 256 | lm loss: 1.953070E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.854 | TFLOPs: 18.62 | 31: iteration 115350/ 173500 | consumed samples: 29529600 | consumed tokens: 60476620800 | elapsed time per iteration (s): 0.82 | learning rate: 6.628E-05 | global batch size: 256 | lm loss: 1.951212E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.570 | TFLOPs: 18.97 | 31: iteration 115360/ 173500 | consumed samples: 29532160 | consumed tokens: 60481863680 | elapsed time per iteration (s): 0.84 | learning rate: 6.627E-05 | global batch size: 256 | lm loss: 1.968727E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.180 | TFLOPs: 18.52 | 31: iteration 115370/ 173500 | consumed samples: 29534720 | consumed tokens: 60487106560 | elapsed time per iteration (s): 0.80 | learning rate: 6.625E-05 | global batch size: 256 | lm loss: 1.933753E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.319 | TFLOPs: 19.38 | 31: iteration 115380/ 173500 | consumed samples: 29537280 | consumed tokens: 60492349440 | elapsed time per iteration (s): 0.84 | learning rate: 6.624E-05 | global batch size: 256 | lm loss: 1.943330E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.452 | TFLOPs: 18.48 | 31: iteration 115390/ 173500 | consumed samples: 29539840 | consumed tokens: 60497592320 | elapsed time per iteration (s): 0.85 | learning rate: 6.622E-05 | global batch size: 256 | lm loss: 1.972728E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.449 | TFLOPs: 18.24 | 31: iteration 115400/ 173500 | consumed samples: 29542400 | consumed tokens: 60502835200 | elapsed time per iteration (s): 0.81 | learning rate: 6.621E-05 | global batch size: 256 | lm loss: 1.952556E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.581 | TFLOPs: 19.03 | 31: iteration 115410/ 173500 | consumed samples: 29544960 | consumed tokens: 60508078080 | elapsed time per iteration (s): 0.85 | learning rate: 6.620E-05 | global batch size: 256 | lm loss: 1.963103E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.210 | TFLOPs: 18.16 | 31: iteration 115420/ 173500 | consumed samples: 29547520 | consumed tokens: 60513320960 | elapsed time per iteration (s): 0.82 | learning rate: 6.618E-05 | global batch size: 256 | lm loss: 1.934902E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.186 | TFLOPs: 18.89 | 31: iteration 115430/ 173500 | consumed samples: 29550080 | consumed tokens: 60518563840 | elapsed time per iteration (s): 0.84 | learning rate: 6.617E-05 | global batch size: 256 | lm loss: 1.957377E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.238 | TFLOPs: 18.41 | 31: iteration 115440/ 173500 | consumed samples: 29552640 | consumed tokens: 60523806720 | elapsed time per iteration (s): 0.89 | learning rate: 6.615E-05 | global batch size: 256 | lm loss: 1.936990E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.398 | TFLOPs: 17.33 | 31: iteration 115450/ 173500 | consumed samples: 29555200 | consumed tokens: 60529049600 | elapsed time per iteration (s): 0.89 | learning rate: 6.614E-05 | global batch size: 256 | lm loss: 1.946959E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.830 | TFLOPs: 17.41 | 31: iteration 115460/ 173500 | consumed samples: 29557760 | consumed tokens: 60534292480 | elapsed time per iteration (s): 0.85 | learning rate: 6.612E-05 | global batch size: 256 | lm loss: 1.937838E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.757 | TFLOPs: 18.32 | 31: iteration 115470/ 173500 | consumed samples: 29560320 | consumed tokens: 60539535360 | elapsed time per iteration (s): 0.83 | learning rate: 6.611E-05 | global batch size: 256 | lm loss: 1.959916E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.231 | TFLOPs: 18.59 | 31: iteration 115480/ 173500 | consumed samples: 29562880 | consumed tokens: 60544778240 | elapsed time per iteration (s): 0.88 | learning rate: 6.610E-05 | global batch size: 256 | lm loss: 1.954642E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.781 | TFLOPs: 17.65 | 31: iteration 115490/ 173500 | consumed samples: 29565440 | consumed tokens: 60550021120 | elapsed time per iteration (s): 0.85 | learning rate: 6.608E-05 | global batch size: 256 | lm loss: 1.957898E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.744 | TFLOPs: 18.13 | 31: iteration 115500/ 173500 | consumed samples: 29568000 | consumed tokens: 60555264000 | elapsed time per iteration (s): 0.79 | learning rate: 6.607E-05 | global batch size: 256 | lm loss: 1.973255E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.592 | TFLOPs: 19.70 | 31: iteration 115510/ 173500 | consumed samples: 29570560 | consumed tokens: 60560506880 | elapsed time per iteration (s): 0.87 | learning rate: 6.605E-05 | global batch size: 256 | lm loss: 1.962091E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.015 | TFLOPs: 17.85 | 31: iteration 115520/ 173500 | consumed samples: 29573120 | consumed tokens: 60565749760 | elapsed time per iteration (s): 0.85 | learning rate: 6.604E-05 | global batch size: 256 | lm loss: 1.959786E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.843 | TFLOPs: 18.26 | 31: iteration 115530/ 173500 | consumed samples: 29575680 | consumed tokens: 60570992640 | elapsed time per iteration (s): 0.81 | learning rate: 6.602E-05 | global batch size: 256 | lm loss: 1.936341E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.863 | TFLOPs: 19.23 | 31: iteration 115540/ 173500 | consumed samples: 29578240 | consumed tokens: 60576235520 | elapsed time per iteration (s): 0.78 | learning rate: 6.601E-05 | global batch size: 256 | lm loss: 1.938800E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.609 | TFLOPs: 19.88 | 31: iteration 115550/ 173500 | consumed samples: 29580800 | consumed tokens: 60581478400 | elapsed time per iteration (s): 0.83 | learning rate: 6.599E-05 | global batch size: 256 | lm loss: 1.928259E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.920 | TFLOPs: 18.57 | 31: iteration 115560/ 173500 | consumed samples: 29583360 | consumed tokens: 60586721280 | elapsed time per iteration (s): 0.77 | learning rate: 6.598E-05 | global batch size: 256 | lm loss: 1.942098E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.426 | TFLOPs: 19.99 | 31: iteration 115570/ 173500 | consumed samples: 29585920 | consumed tokens: 60591964160 | elapsed time per iteration (s): 0.77 | learning rate: 6.597E-05 | global batch size: 256 | lm loss: 1.963397E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.441 | TFLOPs: 20.11 | 31: iteration 115580/ 173500 | consumed samples: 29588480 | consumed tokens: 60597207040 | elapsed time per iteration (s): 0.75 | learning rate: 6.595E-05 | global batch size: 256 | lm loss: 1.938803E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.174 | TFLOPs: 20.52 | 31: iteration 115590/ 173500 | consumed samples: 29591040 | consumed tokens: 60602449920 | elapsed time per iteration (s): 0.79 | learning rate: 6.594E-05 | global batch size: 256 | lm loss: 1.961521E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.667 | TFLOPs: 19.64 | 31: iteration 115600/ 173500 | consumed samples: 29593600 | consumed tokens: 60607692800 | elapsed time per iteration (s): 0.76 | learning rate: 6.592E-05 | global batch size: 256 | lm loss: 1.963340E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.036 | TFLOPs: 20.33 | 31: iteration 115610/ 173500 | consumed samples: 29596160 | consumed tokens: 60612935680 | elapsed time per iteration (s): 0.79 | learning rate: 6.591E-05 | global batch size: 256 | lm loss: 1.965119E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.741 | TFLOPs: 19.52 | 31: iteration 115620/ 173500 | consumed samples: 29598720 | consumed tokens: 60618178560 | elapsed time per iteration (s): 0.79 | learning rate: 6.589E-05 | global batch size: 256 | lm loss: 1.946416E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.581 | TFLOPs: 19.64 | 31: iteration 115630/ 173500 | consumed samples: 29601280 | consumed tokens: 60623421440 | elapsed time per iteration (s): 0.76 | learning rate: 6.588E-05 | global batch size: 256 | lm loss: 1.966255E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.047 | TFLOPs: 20.27 | 31: iteration 115640/ 173500 | consumed samples: 29603840 | consumed tokens: 60628664320 | elapsed time per iteration (s): 0.73 | learning rate: 6.587E-05 | global batch size: 256 | lm loss: 1.959608E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.735 | TFLOPs: 21.10 | 31: iteration 115650/ 173500 | consumed samples: 29606400 | consumed tokens: 60633907200 | elapsed time per iteration (s): 0.83 | learning rate: 6.585E-05 | global batch size: 256 | lm loss: 1.946408E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.070 | TFLOPs: 18.76 | 31: iteration 115660/ 173500 | consumed samples: 29608960 | consumed tokens: 60639150080 | elapsed time per iteration (s): 0.78 | learning rate: 6.584E-05 | global batch size: 256 | lm loss: 1.954952E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.622 | TFLOPs: 19.94 | 31: iteration 115670/ 173500 | consumed samples: 29611520 | consumed tokens: 60644392960 | elapsed time per iteration (s): 0.83 | learning rate: 6.582E-05 | global batch size: 256 | lm loss: 1.934585E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.995 | TFLOPs: 18.63 | 31: iteration 115680/ 173500 | consumed samples: 29614080 | consumed tokens: 60649635840 | elapsed time per iteration (s): 0.79 | learning rate: 6.581E-05 | global batch size: 256 | lm loss: 1.961804E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.848 | TFLOPs: 19.53 | 31: iteration 115690/ 173500 | consumed samples: 29616640 | consumed tokens: 60654878720 | elapsed time per iteration (s): 0.83 | learning rate: 6.579E-05 | global batch size: 256 | lm loss: 1.979538E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.194 | TFLOPs: 18.71 | 31: iteration 115700/ 173500 | consumed samples: 29619200 | consumed tokens: 60660121600 | elapsed time per iteration (s): 0.81 | learning rate: 6.578E-05 | global batch size: 256 | lm loss: 1.920355E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.175 | TFLOPs: 19.07 | 31: iteration 115710/ 173500 | consumed samples: 29621760 | consumed tokens: 60665364480 | elapsed time per iteration (s): 0.79 | learning rate: 6.577E-05 | global batch size: 256 | lm loss: 1.949883E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.293 | TFLOPs: 19.68 | 31: iteration 115720/ 173500 | consumed samples: 29624320 | consumed tokens: 60670607360 | elapsed time per iteration (s): 0.78 | learning rate: 6.575E-05 | global batch size: 256 | lm loss: 1.965247E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.258 | TFLOPs: 19.98 | 31: iteration 115730/ 173500 | consumed samples: 29626880 | consumed tokens: 60675850240 | elapsed time per iteration (s): 0.74 | learning rate: 6.574E-05 | global batch size: 256 | lm loss: 1.939829E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.889 | TFLOPs: 20.80 | 31: iteration 115740/ 173500 | consumed samples: 29629440 | consumed tokens: 60681093120 | elapsed time per iteration (s): 0.74 | learning rate: 6.572E-05 | global batch size: 256 | lm loss: 1.966966E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.534 | TFLOPs: 20.90 | 31: iteration 115750/ 173500 | consumed samples: 29632000 | consumed tokens: 60686336000 | elapsed time per iteration (s): 0.79 | learning rate: 6.571E-05 | global batch size: 256 | lm loss: 1.943435E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.789 | TFLOPs: 19.71 | 31: iteration 115760/ 173500 | consumed samples: 29634560 | consumed tokens: 60691578880 | elapsed time per iteration (s): 0.79 | learning rate: 6.569E-05 | global batch size: 256 | lm loss: 1.958348E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.495 | TFLOPs: 19.63 | 31: iteration 115770/ 173500 | consumed samples: 29637120 | consumed tokens: 60696821760 | elapsed time per iteration (s): 0.76 | learning rate: 6.568E-05 | global batch size: 256 | lm loss: 1.970671E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.475 | TFLOPs: 20.36 | 31: iteration 115780/ 173500 | consumed samples: 29639680 | consumed tokens: 60702064640 | elapsed time per iteration (s): 0.80 | learning rate: 6.567E-05 | global batch size: 256 | lm loss: 1.930549E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.628 | TFLOPs: 19.46 | 31: iteration 115790/ 173500 | consumed samples: 29642240 | consumed tokens: 60707307520 | elapsed time per iteration (s): 0.86 | learning rate: 6.565E-05 | global batch size: 256 | lm loss: 1.946740E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.026 | TFLOPs: 17.91 | 31: iteration 115800/ 173500 | consumed samples: 29644800 | consumed tokens: 60712550400 | elapsed time per iteration (s): 0.73 | learning rate: 6.564E-05 | global batch size: 256 | lm loss: 1.954765E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.083 | TFLOPs: 21.18 | 31: iteration 115810/ 173500 | consumed samples: 29647360 | consumed tokens: 60717793280 | elapsed time per iteration (s): 0.80 | learning rate: 6.562E-05 | global batch size: 256 | lm loss: 1.945163E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.028 | TFLOPs: 19.42 | 31: iteration 115820/ 173500 | consumed samples: 29649920 | consumed tokens: 60723036160 | elapsed time per iteration (s): 0.73 | learning rate: 6.561E-05 | global batch size: 256 | lm loss: 1.949546E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.244 | TFLOPs: 21.31 | 31: iteration 115830/ 173500 | consumed samples: 29652480 | consumed tokens: 60728279040 | elapsed time per iteration (s): 0.81 | learning rate: 6.559E-05 | global batch size: 256 | lm loss: 1.949057E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.862 | TFLOPs: 19.17 | 31: iteration 115840/ 173500 | consumed samples: 29655040 | consumed tokens: 60733521920 | elapsed time per iteration (s): 0.74 | learning rate: 6.558E-05 | global batch size: 256 | lm loss: 1.963171E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.886 | TFLOPs: 21.05 | 31: iteration 115850/ 173500 | consumed samples: 29657600 | consumed tokens: 60738764800 | elapsed time per iteration (s): 0.77 | learning rate: 6.556E-05 | global batch size: 256 | lm loss: 1.937337E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.552 | TFLOPs: 20.12 | 31: iteration 115860/ 173500 | consumed samples: 29660160 | consumed tokens: 60744007680 | elapsed time per iteration (s): 0.77 | learning rate: 6.555E-05 | global batch size: 256 | lm loss: 1.930001E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.857 | TFLOPs: 20.20 | 31: iteration 115870/ 173500 | consumed samples: 29662720 | consumed tokens: 60749250560 | elapsed time per iteration (s): 0.74 | learning rate: 6.554E-05 | global batch size: 256 | lm loss: 1.959315E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.387 | TFLOPs: 20.83 | 31: iteration 115880/ 173500 | consumed samples: 29665280 | consumed tokens: 60754493440 | elapsed time per iteration (s): 0.73 | learning rate: 6.552E-05 | global batch size: 256 | lm loss: 1.953336E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.485 | TFLOPs: 21.14 | 31: iteration 115890/ 173500 | consumed samples: 29667840 | consumed tokens: 60759736320 | elapsed time per iteration (s): 0.73 | learning rate: 6.551E-05 | global batch size: 256 | lm loss: 1.959032E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.573 | TFLOPs: 21.21 | 31: iteration 115900/ 173500 | consumed samples: 29670400 | consumed tokens: 60764979200 | elapsed time per iteration (s): 0.76 | learning rate: 6.549E-05 | global batch size: 256 | lm loss: 1.956184E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.210 | TFLOPs: 20.34 | 31: iteration 115910/ 173500 | consumed samples: 29672960 | consumed tokens: 60770222080 | elapsed time per iteration (s): 0.80 | learning rate: 6.548E-05 | global batch size: 256 | lm loss: 1.974660E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.695 | TFLOPs: 19.34 | 31: iteration 115920/ 173500 | consumed samples: 29675520 | consumed tokens: 60775464960 | elapsed time per iteration (s): 0.79 | learning rate: 6.546E-05 | global batch size: 256 | lm loss: 1.927904E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.217 | TFLOPs: 19.49 | 31: iteration 115930/ 173500 | consumed samples: 29678080 | consumed tokens: 60780707840 | elapsed time per iteration (s): 0.77 | learning rate: 6.545E-05 | global batch size: 256 | lm loss: 1.957562E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.190 | TFLOPs: 20.16 | 31: iteration 115940/ 173500 | consumed samples: 29680640 | consumed tokens: 60785950720 | elapsed time per iteration (s): 0.74 | learning rate: 6.544E-05 | global batch size: 256 | lm loss: 1.970736E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.052 | TFLOPs: 21.00 | 31: iteration 115950/ 173500 | consumed samples: 29683200 | consumed tokens: 60791193600 | elapsed time per iteration (s): 0.77 | learning rate: 6.542E-05 | global batch size: 256 | lm loss: 1.941326E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.584 | TFLOPs: 20.00 | 31: iteration 115960/ 173500 | consumed samples: 29685760 | consumed tokens: 60796436480 | elapsed time per iteration (s): 0.78 | learning rate: 6.541E-05 | global batch size: 256 | lm loss: 1.944613E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.083 | TFLOPs: 19.97 | 31: iteration 115970/ 173500 | consumed samples: 29688320 | consumed tokens: 60801679360 | elapsed time per iteration (s): 0.84 | learning rate: 6.539E-05 | global batch size: 256 | lm loss: 1.955197E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.466 | TFLOPs: 18.42 | 31: iteration 115980/ 173500 | consumed samples: 29690880 | consumed tokens: 60806922240 | elapsed time per iteration (s): 0.82 | learning rate: 6.538E-05 | global batch size: 256 | lm loss: 1.927554E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.579 | TFLOPs: 18.85 | 31: iteration 115990/ 173500 | consumed samples: 29693440 | consumed tokens: 60812165120 | elapsed time per iteration (s): 0.85 | learning rate: 6.536E-05 | global batch size: 256 | lm loss: 1.935944E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.947 | TFLOPs: 18.33 | 0: [2022-11-26 20:13:55,078] [INFO] [logging.py:68:log_dist] [Rank 0] step=116000, skipped=0, lr=[6.535024808618106e-05, 6.535024808618106e-05, 6.535024808618106e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 116000/ 173500 | consumed samples: 29696000 | consumed tokens: 60817408000 | elapsed time per iteration (s): 0.83 | learning rate: 6.535E-05 | global batch size: 256 | lm loss: 1.949747E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.138 | TFLOPs: 18.76 | 0: steps: 116000 loss: 2.0002 iter time (s): 0.790 samples/sec: 323.995 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 116000 | lm loss value: 1.958388E+00 | lm loss PPL: 7.087889E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 116000 to checkpoints_1b1long 0: [2022-11-26 20:13:55,408] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step116000 is begin to save! 0: [2022-11-26 20:13:55,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_01-model_00-model_states.pt... 0: [2022-11-26 20:13:55,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_01-model_00-model_states.pt. 0: [2022-11-26 20:13:55,652] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_03-model_00-model_states.pt... 0: [2022-11-26 20:13:55,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_03-model_00-model_states.pt. 0: [2022-11-26 20:13:55,739] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_04-model_00-model_states.pt... 0: [2022-11-26 20:13:55,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_04-model_00-model_states.pt. 0: [2022-11-26 20:13:55,819] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_05-model_00-model_states.pt... 0: [2022-11-26 20:13:55,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_05-model_00-model_states.pt. 0: [2022-11-26 20:13:55,897] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_06-model_00-model_states.pt... 0: [2022-11-26 20:13:55,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_06-model_00-model_states.pt. 0: [2022-11-26 20:13:55,971] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_07-model_00-model_states.pt... 0: [2022-11-26 20:13:56,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_07-model_00-model_states.pt. 0: [2022-11-26 20:13:56,051] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_08-model_00-model_states.pt... 0: [2022-11-26 20:13:56,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_08-model_00-model_states.pt. 0: [2022-11-26 20:13:56,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_09-model_00-model_states.pt... 0: [2022-11-26 20:13:56,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_09-model_00-model_states.pt. 0: [2022-11-26 20:13:56,211] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_10-model_00-model_states.pt... 0: [2022-11-26 20:13:56,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_10-model_00-model_states.pt. 0: [2022-11-26 20:13:56,291] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_11-model_00-model_states.pt... 0: [2022-11-26 20:13:56,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_11-model_00-model_states.pt. 0: [2022-11-26 20:13:56,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_12-model_00-model_states.pt... 0: [2022-11-26 20:13:56,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_12-model_00-model_states.pt. 0: [2022-11-26 20:13:56,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_13-model_00-model_states.pt... 0: [2022-11-26 20:13:56,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_13-model_00-model_states.pt. 0: [2022-11-26 20:13:56,519] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_14-model_00-model_states.pt... 0: [2022-11-26 20:13:56,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_14-model_00-model_states.pt. 0: [2022-11-26 20:13:56,594] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_15-model_00-model_states.pt... 0: [2022-11-26 20:13:56,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_15-model_00-model_states.pt. 0: [2022-11-26 20:13:56,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_16-model_00-model_states.pt... 0: [2022-11-26 20:13:56,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_16-model_00-model_states.pt. 0: [2022-11-26 20:13:56,748] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_17-model_00-model_states.pt... 0: [2022-11-26 20:13:56,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_17-model_00-model_states.pt. 0: [2022-11-26 20:13:56,822] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_18-model_00-model_states.pt... 0: [2022-11-26 20:13:56,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_18-model_00-model_states.pt. 0: [2022-11-26 20:13:56,898] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_19-model_00-model_states.pt... 0: [2022-11-26 20:13:56,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_19-model_00-model_states.pt. 0: [2022-11-26 20:13:56,978] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_20-model_00-model_states.pt... 0: [2022-11-26 20:13:57,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_20-model_00-model_states.pt. 0: [2022-11-26 20:13:57,054] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_21-model_00-model_states.pt... 0: [2022-11-26 20:13:57,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_21-model_00-model_states.pt. 0: [2022-11-26 20:13:57,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_22-model_00-model_states.pt... 0: [2022-11-26 20:13:57,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_22-model_00-model_states.pt. 0: [2022-11-26 20:13:57,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_23-model_00-model_states.pt... 0: [2022-11-26 20:13:57,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_23-model_00-model_states.pt. 0: [2022-11-26 20:13:57,302] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_24-model_00-model_states.pt... 0: [2022-11-26 20:13:57,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_24-model_00-model_states.pt. 0: [2022-11-26 20:13:57,377] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_25-model_00-model_states.pt... 0: [2022-11-26 20:13:57,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_25-model_00-model_states.pt. 0: [2022-11-26 20:13:57,455] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_26-model_00-model_states.pt... 0: [2022-11-26 20:13:57,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_26-model_00-model_states.pt. 0: [2022-11-26 20:13:57,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_27-model_00-model_states.pt... 0: [2022-11-26 20:13:57,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_27-model_00-model_states.pt. 0: [2022-11-26 20:13:57,606] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_28-model_00-model_states.pt... 0: [2022-11-26 20:13:57,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_28-model_00-model_states.pt. 0: [2022-11-26 20:13:57,682] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/layer_30-model_00-model_states.pt... 0: [2022-11-26 20:13:57,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/layer_30-model_00-model_states.pt. 0: [2022-11-26 20:13:57,684] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step116000/mp_rank_00_model_states.pt 0: [2022-11-26 20:13:57,684] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/mp_rank_00_model_states.pt... 0: [2022-11-26 20:13:57,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/mp_rank_00_model_states.pt. 0: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:13:57,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:13:57,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:13:57,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 20:13:57,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 9: [2022-11-26 20:13:57,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:13:57,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 20:13:57,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 28: [2022-11-26 20:13:57,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:13:57,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:13:57,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 20:13:57,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 6: [2022-11-26 20:13:57,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:13:57,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 20:13:57,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 7: [2022-11-26 20:13:57,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:13:57,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 20:13:57,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 19: [2022-11-26 20:13:57,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:13:57,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:13:57,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 20:13:57,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 20:13:57,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 19: [2022-11-26 20:13:57,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 23: [2022-11-26 20:13:57,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:13:57,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 20:13:57,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 25: [2022-11-26 20:13:57,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:13:57,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:13:57,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 24: [2022-11-26 20:13:57,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:13:57,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 25: [2022-11-26 20:13:57,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 24: [2022-11-26 20:13:57,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 22: [2022-11-26 20:13:57,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:13:57,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 24: [2022-11-26 20:13:57,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 22: [2022-11-26 20:13:57,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 20:13:57,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 21: [2022-11-26 20:13:57,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:13:57,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 28: [2022-11-26 20:13:57,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 21: [2022-11-26 20:13:57,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 15: [2022-11-26 20:13:57,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:13:57,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 15: [2022-11-26 20:13:57,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 20:13:57,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 28: [2022-11-26 20:13:57,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:13:57,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 24: [2022-11-26 20:13:57,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:13:57,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 24: [2022-11-26 20:13:57,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 1: [2022-11-26 20:13:57,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:13:57,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 1: [2022-11-26 20:13:57,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 9: [2022-11-26 20:13:57,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:13:57,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 20:13:57,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 0: [2022-11-26 20:13:57,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:13:57,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 3: [2022-11-26 20:13:57,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:13:57,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 3: [2022-11-26 20:13:57,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 0: [2022-11-26 20:13:57,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 3: [2022-11-26 20:13:57,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 3: [2022-11-26 20:13:57,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:13:57,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 20:13:57,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 27: [2022-11-26 20:13:57,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:13:57,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 20:13:57,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 12: [2022-11-26 20:13:57,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:13:57,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 20:13:57,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 29: [2022-11-26 20:13:57,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:13:57,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 20:13:57,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 21: [2022-11-26 20:13:57,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:13:57,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 20:13:57,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 5: [2022-11-26 20:13:57,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:13:57,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 20:13:57,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 16: [2022-11-26 20:13:57,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:13:57,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:13:57,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 23: [2022-11-26 20:13:57,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 6: [2022-11-26 20:13:57,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:13:57,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 23: [2022-11-26 20:13:57,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 6: [2022-11-26 20:13:57,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 25: [2022-11-26 20:13:57,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:13:57,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 25: [2022-11-26 20:13:57,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 15: [2022-11-26 20:13:57,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:13:57,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 20:13:57,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 25: [2022-11-26 20:13:57,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 19: [2022-11-26 20:13:57,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:13:57,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 20:13:57,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 5: [2022-11-26 20:13:57,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:13:57,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:13:57,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 20:13:57,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 2: [2022-11-26 20:13:57,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:13:57,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 20:13:57,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 13: [2022-11-26 20:13:57,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 20:13:57,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 4: [2022-11-26 20:13:57,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:13:57,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:13:57,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 20:13:57,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 27: [2022-11-26 20:13:57,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 0: [2022-11-26 20:13:57,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:13:57,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:13:57,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 13: [2022-11-26 20:13:57,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 27: [2022-11-26 20:13:57,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 0: [2022-11-26 20:13:57,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 13: [2022-11-26 20:13:57,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 7: [2022-11-26 20:13:57,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:13:57,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 20:13:57,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 2: [2022-11-26 20:13:57,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:13:57,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 20:13:57,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 4: [2022-11-26 20:13:57,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:13:57,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 20:13:57,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 8: [2022-11-26 20:13:57,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:13:57,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 20:13:57,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 8: [2022-11-26 20:13:57,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:13:57,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 20:13:57,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 24: [2022-11-26 20:13:57,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:13:57,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 20:13:57,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 1: [2022-11-26 20:13:57,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:13:57,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:13:57,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 26: [2022-11-26 20:13:57,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 1: [2022-11-26 20:13:57,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 26: [2022-11-26 20:13:57,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 1: [2022-11-26 20:13:57,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:13:57,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 20:13:57,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 2: [2022-11-26 20:13:57,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:13:57,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 20:13:57,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 7: [2022-11-26 20:13:57,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:13:57,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 20:13:57,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 27: [2022-11-26 20:13:57,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:13:57,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 20:13:57,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 26: [2022-11-26 20:13:57,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:13:57,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 20:13:57,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 16: [2022-11-26 20:13:57,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:13:57,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 14: [2022-11-26 20:13:57,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:13:57,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 14: [2022-11-26 20:13:57,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 20:13:57,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 4: [2022-11-26 20:13:57,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:13:57,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 20:13:57,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 16: [2022-11-26 20:13:57,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:13:57,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 20:13:57,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 9: [2022-11-26 20:13:57,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:13:57,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:13:57,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 9: [2022-11-26 20:13:57,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 19: [2022-11-26 20:13:57,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 9: [2022-11-26 20:13:57,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 28: [2022-11-26 20:13:57,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:13:57,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:13:57,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 20:13:57,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 22: [2022-11-26 20:13:57,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:13:57,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:13:57,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 20:13:57,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 20:13:57,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 22: [2022-11-26 20:13:57,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 7: [2022-11-26 20:13:57,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:13:57,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 3: [2022-11-26 20:13:57,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:13:57,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 20:13:57,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 7: [2022-11-26 20:13:57,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 28: [2022-11-26 20:13:57,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 20:13:57,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 24: [2022-11-26 20:13:57,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:13:57,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 12: [2022-11-26 20:13:57,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:13:57,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 12: [2022-11-26 20:13:57,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 20:13:57,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 12: [2022-11-26 20:13:57,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:13:57,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 20:13:57,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 0: [2022-11-26 20:13:57,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:13:57,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:13:57,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 20:13:57,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 26: [2022-11-26 20:13:57,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:13:57,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 20:13:57,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 8: [2022-11-26 20:13:57,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:13:57,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 20:13:57,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 2: [2022-11-26 20:13:57,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:13:57,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:13:57,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 2: [2022-11-26 20:13:57,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 15: [2022-11-26 20:13:57,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 2: [2022-11-26 20:13:57,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 6: [2022-11-26 20:13:57,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:13:57,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 20:13:57,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 14: [2022-11-26 20:13:57,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:13:57,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 20:13:57,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 3: [2022-11-26 20:13:57,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:13:57,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 20:13:57,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 25: [2022-11-26 20:13:57,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:13:57,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 20:13:57,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 15: [2022-11-26 20:13:57,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:13:57,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 20:13:57,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 13: [2022-11-26 20:13:57,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:13:57,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 20:13:57,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 28: [2022-11-26 20:13:57,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:13:57,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 20:13:57,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 31: [2022-11-26 20:13:57,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:13:57,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:13:57,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:13:57,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 20:13:57,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 20:13:57,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 27: [2022-11-26 20:13:57,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:13:57,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 17: [2022-11-26 20:13:57,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:13:57,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 31: [2022-11-26 20:13:57,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 17: [2022-11-26 20:13:57,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 27: [2022-11-26 20:13:57,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 31: [2022-11-26 20:13:57,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 17: [2022-11-26 20:13:57,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 17: [2022-11-26 20:13:57,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:13:57,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 20:13:57,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 17: [2022-11-26 20:13:57,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:13:57,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 20:13:57,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 0: [2022-11-26 20:13:57,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:13:57,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 20:13:57,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 29: [2022-11-26 20:13:57,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:13:57,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 20:13:57,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 5: [2022-11-26 20:13:57,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:13:57,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 20:13:57,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 5: [2022-11-26 20:13:57,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:13:57,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 20:13:57,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 16: [2022-11-26 20:13:57,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:13:57,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 20:13:57,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 1: [2022-11-26 20:13:57,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:13:57,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 20:13:57,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 9: [2022-11-26 20:13:57,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:13:57,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:13:57,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 21: [2022-11-26 20:13:57,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 9: [2022-11-26 20:13:57,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 21: [2022-11-26 20:13:57,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 11: [2022-11-26 20:13:57,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:13:57,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:13:57,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:13:57,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:13:57,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 20:13:57,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 20:13:57,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 20:13:57,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 20:13:57,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 11: [2022-11-26 20:13:57,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 11: [2022-11-26 20:13:57,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 11: [2022-11-26 20:13:57,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 25: [2022-11-26 20:13:57,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:13:57,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 20:13:57,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 30: [2022-11-26 20:13:57,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:13:57,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:13:57,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:13:57,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:13:57,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 20:13:57,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 30: [2022-11-26 20:13:57,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 20:13:57,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 20:13:57,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 14: [2022-11-26 20:13:57,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:13:57,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 30: [2022-11-26 20:13:57,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 30: [2022-11-26 20:13:57,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 14: [2022-11-26 20:13:57,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 20:13:57,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 29: [2022-11-26 20:13:57,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:13:57,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:13:57,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 20:13:57,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 20:13:57,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 29: [2022-11-26 20:13:57,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 6: [2022-11-26 20:13:57,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:13:57,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 20:13:57,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 0: [2022-11-26 20:13:57,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 20:13:57,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 23: [2022-11-26 20:13:57,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:13:57,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 20:13:57,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 12: [2022-11-26 20:13:57,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:13:57,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 20:13:57,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 13: [2022-11-26 20:13:57,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:13:57,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 20:13:57,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 22: [2022-11-26 20:13:57,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:13:57,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 20:13:57,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 18: [2022-11-26 20:13:57,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:13:57,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:13:57,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 20:13:57,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 20:13:57,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 18: [2022-11-26 20:13:57,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 20: [2022-11-26 20:13:57,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:13:57,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:13:57,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:13:57,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:13:57,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 20:13:57,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 20:13:57,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 20:13:57,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 20:13:57,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 20: [2022-11-26 20:13:57,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 20: [2022-11-26 20:13:57,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 20: [2022-11-26 20:13:57,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 5: [2022-11-26 20:13:57,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:13:57,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 20:13:57,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 18: [2022-11-26 20:13:57,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:13:57,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 20:13:57,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 18: [2022-11-26 20:13:57,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:13:57,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 20:13:57,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 28: [2022-11-26 20:13:57,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:13:57,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:13:57,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:13:57,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:13:57,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 20:13:57,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 20:13:57,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 20:13:57,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 10: [2022-11-26 20:13:57,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 10: [2022-11-26 20:13:57,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 19: [2022-11-26 20:13:57,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:13:57,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 19: [2022-11-26 20:13:57,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 28: [2022-11-26 20:13:57,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 19: [2022-11-26 20:13:57,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 20: [2022-11-26 20:13:57,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:13:57,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 20:13:57,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 11: [2022-11-26 20:13:57,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:13:57,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 20:13:57,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 15: [2022-11-26 20:13:57,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:13:57,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 20:13:57,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 29: [2022-11-26 20:13:57,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:13:57,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 20:13:57,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 24: [2022-11-26 20:13:57,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:13:57,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 20:13:57,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 18: [2022-11-26 20:13:57,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:13:57,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 20:13:57,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 4: [2022-11-26 20:13:57,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:13:57,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 20:13:57,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 0: [2022-11-26 20:13:57,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:13:57,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 20:13:57,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 8: [2022-11-26 20:13:57,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:13:57,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 20:13:57,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 3: [2022-11-26 20:13:57,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:13:57,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 20:13:57,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 17: [2022-11-26 20:13:57,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:13:57,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 20:13:57,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 7: [2022-11-26 20:13:57,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:13:57,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 20:13:57,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 26: [2022-11-26 20:13:57,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:13:57,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 20:13:57,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 31: [2022-11-26 20:13:57,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:13:57,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 20:13:57,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 30: [2022-11-26 20:13:57,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:13:57,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 20:13:57,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 9: [2022-11-26 20:13:57,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:13:57,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 20:13:57,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 21: [2022-11-26 20:13:57,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:13:57,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 20:13:57,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 10: [2022-11-26 20:13:57,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:13:57,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 20:13:57,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 1: [2022-11-26 20:13:57,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:13:57,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 20:13:57,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 25: [2022-11-26 20:13:57,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:13:57,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 27: [2022-11-26 20:13:57,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:13:57,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 27: [2022-11-26 20:13:57,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 20:13:57,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 2: [2022-11-26 20:13:57,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:13:57,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 20:13:57,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 13: [2022-11-26 20:13:57,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:13:57,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 20:13:57,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 14: [2022-11-26 20:13:57,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:13:57,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 20:13:57,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 12: [2022-11-26 20:13:57,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:13:57,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 20:13:57,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 6: [2022-11-26 20:13:57,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:13:57,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 20:13:57,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 22: [2022-11-26 20:13:57,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:13:57,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 20:13:57,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 16: [2022-11-26 20:13:57,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:13:57,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 20:13:57,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 23: [2022-11-26 20:13:57,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:13:57,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 20:13:57,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 5: [2022-11-26 20:13:57,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:13:57,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 20:13:57,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 19: [2022-11-26 20:13:57,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:13:57,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 20:13:57,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 15: [2022-11-26 20:13:57,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:13:57,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 20:13:57,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 28: [2022-11-26 20:13:57,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:13:57,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 20: [2022-11-26 20:13:57,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:13:57,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 20:13:57,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 11: [2022-11-26 20:13:57,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:13:57,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 20:13:57,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 0: [2022-11-26 20:13:57,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:13:57,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 20:13:57,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 28: [2022-11-26 20:13:57,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 4: [2022-11-26 20:13:57,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:13:57,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 20:13:57,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 26: [2022-11-26 20:13:57,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:13:57,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 29: [2022-11-26 20:13:57,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:13:57,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 29: [2022-11-26 20:13:57,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 20:13:57,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 24: [2022-11-26 20:13:57,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:13:57,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 20:13:57,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 17: [2022-11-26 20:13:57,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:13:57,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 20:13:57,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 7: [2022-11-26 20:13:57,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:13:57,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 20:13:57,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 18: [2022-11-26 20:13:57,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:13:57,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 20:13:57,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 31: [2022-11-26 20:13:57,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:13:57,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 20:13:57,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 30: [2022-11-26 20:13:57,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:13:57,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 20:13:57,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 3: [2022-11-26 20:13:57,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:13:57,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 20:13:57,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 8: [2022-11-26 20:13:57,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:13:57,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 20:13:57,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 1: [2022-11-26 20:13:57,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:13:57,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 20:13:57,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 9: [2022-11-26 20:13:57,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:13:57,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:13:57,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 16: [2022-11-26 20:13:57,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 20:13:57,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 9: [2022-11-26 20:13:57,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 2: [2022-11-26 20:13:57,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:13:57,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 20:13:57,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 25: [2022-11-26 20:13:57,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:13:57,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 20:13:57,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 27: [2022-11-26 20:13:57,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:13:57,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 20:13:57,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 14: [2022-11-26 20:13:57,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:13:57,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 20:13:57,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 21: [2022-11-26 20:13:57,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:13:57,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 20:13:57,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 6: [2022-11-26 20:13:57,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:13:57,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 20:13:57,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 13: [2022-11-26 20:13:57,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:13:57,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:13:57,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 22: [2022-11-26 20:13:57,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 20:13:57,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 23: [2022-11-26 20:13:57,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:13:57,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 20:13:57,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 13: [2022-11-26 20:13:57,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 28: [2022-11-26 20:13:57,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:13:57,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 19: [2022-11-26 20:13:57,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:13:57,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 28: [2022-11-26 20:13:57,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 19: [2022-11-26 20:13:57,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 5: [2022-11-26 20:13:57,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:13:57,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 20:13:57,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 29: [2022-11-26 20:13:57,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:13:57,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:13:57,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 11: [2022-11-26 20:13:57,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 20:13:57,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 29: [2022-11-26 20:13:57,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 20: [2022-11-26 20:13:57,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:13:57,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 20:13:57,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 18: [2022-11-26 20:13:57,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:13:57,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 20:13:57,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 0: [2022-11-26 20:13:57,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:13:57,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 20:13:57,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 24: [2022-11-26 20:13:57,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:13:57,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 20:13:57,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 15: [2022-11-26 20:13:57,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:13:57,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 20:13:57,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 26: [2022-11-26 20:13:57,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:13:57,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 20:13:57,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 4: [2022-11-26 20:13:57,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:13:57,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 20:13:57,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 1: [2022-11-26 20:13:57,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:13:57,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 20:13:57,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 25: [2022-11-26 20:13:57,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:13:57,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 20:13:57,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 16: [2022-11-26 20:13:57,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:13:57,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 20:13:57,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 2: [2022-11-26 20:13:57,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:13:57,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 20:13:57,982] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 14: [2022-11-26 20:13:57,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:13:57,982] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 20:13:57,982] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 12: [2022-11-26 20:13:57,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:13:57,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 20:13:57,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 8: [2022-11-26 20:13:57,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:13:57,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 20:13:57,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 30: [2022-11-26 20:13:57,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:13:57,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 20:13:57,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 7: [2022-11-26 20:13:57,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:13:57,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 20:13:57,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 3: [2022-11-26 20:13:57,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:13:57,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 20:13:57,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 27: [2022-11-26 20:13:57,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:13:57,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 20:13:57,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 21: [2022-11-26 20:13:57,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:13:57,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 20:13:57,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 19: [2022-11-26 20:13:57,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:13:57,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 20:13:57,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 10: [2022-11-26 20:13:57,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:13:57,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:13:57,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 20:13:57,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 26: [2022-11-26 20:13:57,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 20:13:57,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 28: [2022-11-26 20:13:57,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:13:57,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 0: [2022-11-26 20:13:57,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:13:57,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 17: [2022-11-26 20:13:57,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:13:57,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 9: [2022-11-26 20:13:57,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:13:57,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 17: [2022-11-26 20:13:57,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 20:13:57,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 9: [2022-11-26 20:13:57,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 23: [2022-11-26 20:13:57,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:13:57,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 23: [2022-11-26 20:13:57,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 20:13:57,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 18: [2022-11-26 20:13:57,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:13:57,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 13: [2022-11-26 20:13:57,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:13:57,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:13:57,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 13: [2022-11-26 20:13:57,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 22: [2022-11-26 20:13:57,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 13: [2022-11-26 20:13:57,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 22: [2022-11-26 20:13:57,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 31: [2022-11-26 20:13:57,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:13:57,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 20:13:57,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 4: [2022-11-26 20:13:57,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:13:57,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 20:13:57,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 2: [2022-11-26 20:13:57,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:13:57,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 20:13:57,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 1: [2022-11-26 20:13:57,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:13:57,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:13:57,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:13:57,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 11: [2022-11-26 20:13:57,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 6: [2022-11-26 20:13:57,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 11: [2022-11-26 20:13:57,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 6: [2022-11-26 20:13:57,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 1: [2022-11-26 20:13:57,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 14: [2022-11-26 20:13:57,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:13:57,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 20:13:57,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 5: [2022-11-26 20:13:57,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:13:57,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 20:13:57,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 20: [2022-11-26 20:13:57,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:13:57,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 20:13:57,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 15: [2022-11-26 20:13:57,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:13:57,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 20:13:57,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 24: [2022-11-26 20:13:57,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:13:57,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 20:13:57,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 7: [2022-11-26 20:13:57,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:13:57,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 20:13:57,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 8: [2022-11-26 20:13:57,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:13:57,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:13:57,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 29: [2022-11-26 20:13:57,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 8: [2022-11-26 20:13:57,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 29: [2022-11-26 20:13:57,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 23: [2022-11-26 20:13:57,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:13:57,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 20:13:57,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 16: [2022-11-26 20:13:57,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:13:57,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:13:57,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 9: [2022-11-26 20:13:57,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 16: [2022-11-26 20:13:57,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 9: [2022-11-26 20:13:57,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 21: [2022-11-26 20:13:57,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:13:57,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 20:13:57,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 30: [2022-11-26 20:13:57,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:13:57,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 13: [2022-11-26 20:13:57,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:13:57,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 25: [2022-11-26 20:13:57,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:13:57,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 20:13:57,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 25: [2022-11-26 20:13:57,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 20:13:57,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 6: [2022-11-26 20:13:57,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:13:57,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 20:13:57,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 31: [2022-11-26 20:13:57,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:13:57,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 27: [2022-11-26 20:13:57,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:13:57,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 27: [2022-11-26 20:13:57,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 20:13:57,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 22: [2022-11-26 20:13:57,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:13:57,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 20:13:57,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 10: [2022-11-26 20:13:57,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:13:57,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:13:57,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:13:57,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 20:13:57,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 20:13:57,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 20:13:57,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 10: [2022-11-26 20:13:57,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 10: [2022-11-26 20:13:57,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 3: [2022-11-26 20:13:57,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:13:57,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:13:57,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 20:13:57,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 17: [2022-11-26 20:13:57,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 20:13:57,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 8: [2022-11-26 20:13:57,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:13:57,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 20:13:57,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 12: [2022-11-26 20:13:57,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:13:57,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 20:13:57,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:13:57,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 12: [2022-11-26 20:13:57,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 20:13:57,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 31: [2022-11-26 20:13:58,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:13:58,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 20:13:58,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 17: [2022-11-26 20:13:58,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:13:58,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step116000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 20:13:58,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 0: successfully saved checkpoint at iteration 116000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2633.26 31: iteration 116010/ 173500 | consumed samples: 29698560 | consumed tokens: 60822650880 | elapsed time per iteration (s): 1.07 | learning rate: 6.534E-05 | global batch size: 256 | lm loss: 1.933317E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.199 | TFLOPs: 14.47 | 31: iteration 116020/ 173500 | consumed samples: 29701120 | consumed tokens: 60827893760 | elapsed time per iteration (s): 0.83 | learning rate: 6.532E-05 | global batch size: 256 | lm loss: 1.928714E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.381 | TFLOPs: 18.72 | 31: iteration 116030/ 173500 | consumed samples: 29703680 | consumed tokens: 60833136640 | elapsed time per iteration (s): 0.84 | learning rate: 6.531E-05 | global batch size: 256 | lm loss: 1.941530E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.718 | TFLOPs: 18.43 | 31: iteration 116040/ 173500 | consumed samples: 29706240 | consumed tokens: 60838379520 | elapsed time per iteration (s): 0.79 | learning rate: 6.529E-05 | global batch size: 256 | lm loss: 1.954920E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.720 | TFLOPs: 19.64 | 31: iteration 116050/ 173500 | consumed samples: 29708800 | consumed tokens: 60843622400 | elapsed time per iteration (s): 0.76 | learning rate: 6.528E-05 | global batch size: 256 | lm loss: 1.962273E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.449 | TFLOPs: 20.41 | 31: iteration 116060/ 173500 | consumed samples: 29711360 | consumed tokens: 60848865280 | elapsed time per iteration (s): 0.73 | learning rate: 6.526E-05 | global batch size: 256 | lm loss: 1.975595E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.757 | TFLOPs: 21.34 | 31: iteration 116070/ 173500 | consumed samples: 29713920 | consumed tokens: 60854108160 | elapsed time per iteration (s): 0.78 | learning rate: 6.525E-05 | global batch size: 256 | lm loss: 1.958975E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.451 | TFLOPs: 19.81 | 31: iteration 116080/ 173500 | consumed samples: 29716480 | consumed tokens: 60859351040 | elapsed time per iteration (s): 0.80 | learning rate: 6.524E-05 | global batch size: 256 | lm loss: 1.964517E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.995 | TFLOPs: 19.48 | 31: iteration 116090/ 173500 | consumed samples: 29719040 | consumed tokens: 60864593920 | elapsed time per iteration (s): 0.78 | learning rate: 6.522E-05 | global batch size: 256 | lm loss: 1.938984E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.915 | TFLOPs: 19.90 | 31: iteration 116100/ 173500 | consumed samples: 29721600 | consumed tokens: 60869836800 | elapsed time per iteration (s): 0.78 | learning rate: 6.521E-05 | global batch size: 256 | lm loss: 1.970972E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.546 | TFLOPs: 19.88 | 31: iteration 116110/ 173500 | consumed samples: 29724160 | consumed tokens: 60875079680 | elapsed time per iteration (s): 0.75 | learning rate: 6.519E-05 | global batch size: 256 | lm loss: 1.939033E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.962 | TFLOPs: 20.75 | 31: iteration 116120/ 173500 | consumed samples: 29726720 | consumed tokens: 60880322560 | elapsed time per iteration (s): 0.81 | learning rate: 6.518E-05 | global batch size: 256 | lm loss: 1.949711E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.940 | TFLOPs: 19.11 | 31: iteration 116130/ 173500 | consumed samples: 29729280 | consumed tokens: 60885565440 | elapsed time per iteration (s): 0.74 | learning rate: 6.516E-05 | global batch size: 256 | lm loss: 1.944279E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.251 | TFLOPs: 20.95 | 31: iteration 116140/ 173500 | consumed samples: 29731840 | consumed tokens: 60890808320 | elapsed time per iteration (s): 0.83 | learning rate: 6.515E-05 | global batch size: 256 | lm loss: 1.982980E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.493 | TFLOPs: 18.66 | 31: iteration 116150/ 173500 | consumed samples: 29734400 | consumed tokens: 60896051200 | elapsed time per iteration (s): 0.76 | learning rate: 6.514E-05 | global batch size: 256 | lm loss: 1.950336E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.455 | TFLOPs: 20.29 | 31: iteration 116160/ 173500 | consumed samples: 29736960 | consumed tokens: 60901294080 | elapsed time per iteration (s): 0.84 | learning rate: 6.512E-05 | global batch size: 256 | lm loss: 1.968637E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.907 | TFLOPs: 18.45 | 31: iteration 116170/ 173500 | consumed samples: 29739520 | consumed tokens: 60906536960 | elapsed time per iteration (s): 0.75 | learning rate: 6.511E-05 | global batch size: 256 | lm loss: 1.948360E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.314 | TFLOPs: 20.77 | 31: iteration 116180/ 173500 | consumed samples: 29742080 | consumed tokens: 60911779840 | elapsed time per iteration (s): 0.79 | learning rate: 6.509E-05 | global batch size: 256 | lm loss: 1.938532E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.998 | TFLOPs: 19.54 | 31: iteration 116190/ 173500 | consumed samples: 29744640 | consumed tokens: 60917022720 | elapsed time per iteration (s): 0.75 | learning rate: 6.508E-05 | global batch size: 256 | lm loss: 1.971555E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.812 | TFLOPs: 20.68 | 31: iteration 116200/ 173500 | consumed samples: 29747200 | consumed tokens: 60922265600 | elapsed time per iteration (s): 0.81 | learning rate: 6.506E-05 | global batch size: 256 | lm loss: 1.928800E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.589 | TFLOPs: 19.09 | 31: iteration 116210/ 173500 | consumed samples: 29749760 | consumed tokens: 60927508480 | elapsed time per iteration (s): 0.88 | learning rate: 6.505E-05 | global batch size: 256 | lm loss: 1.942356E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.274 | TFLOPs: 17.50 | 31: iteration 116220/ 173500 | consumed samples: 29752320 | consumed tokens: 60932751360 | elapsed time per iteration (s): 0.81 | learning rate: 6.504E-05 | global batch size: 256 | lm loss: 1.946379E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.074 | TFLOPs: 19.18 | 31: iteration 116230/ 173500 | consumed samples: 29754880 | consumed tokens: 60937994240 | elapsed time per iteration (s): 0.82 | learning rate: 6.502E-05 | global batch size: 256 | lm loss: 1.931326E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.764 | TFLOPs: 18.98 | 31: iteration 116240/ 173500 | consumed samples: 29757440 | consumed tokens: 60943237120 | elapsed time per iteration (s): 0.78 | learning rate: 6.501E-05 | global batch size: 256 | lm loss: 1.936657E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.180 | TFLOPs: 19.98 | 31: iteration 116250/ 173500 | consumed samples: 29760000 | consumed tokens: 60948480000 | elapsed time per iteration (s): 0.84 | learning rate: 6.499E-05 | global batch size: 256 | lm loss: 1.982425E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.283 | TFLOPs: 18.41 | 31: iteration 116260/ 173500 | consumed samples: 29762560 | consumed tokens: 60953722880 | elapsed time per iteration (s): 0.79 | learning rate: 6.498E-05 | global batch size: 256 | lm loss: 1.959751E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.779 | TFLOPs: 19.59 | 31: iteration 116270/ 173500 | consumed samples: 29765120 | consumed tokens: 60958965760 | elapsed time per iteration (s): 0.78 | learning rate: 6.496E-05 | global batch size: 256 | lm loss: 1.954792E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.666 | TFLOPs: 19.94 | 31: iteration 116280/ 173500 | consumed samples: 29767680 | consumed tokens: 60964208640 | elapsed time per iteration (s): 0.74 | learning rate: 6.495E-05 | global batch size: 256 | lm loss: 1.966146E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.734 | TFLOPs: 21.04 | 31: iteration 116290/ 173500 | consumed samples: 29770240 | consumed tokens: 60969451520 | elapsed time per iteration (s): 0.74 | learning rate: 6.494E-05 | global batch size: 256 | lm loss: 1.944215E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.697 | TFLOPs: 20.85 | 31: iteration 116300/ 173500 | consumed samples: 29772800 | consumed tokens: 60974694400 | elapsed time per iteration (s): 0.79 | learning rate: 6.492E-05 | global batch size: 256 | lm loss: 1.928642E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.288 | TFLOPs: 19.56 | 31: iteration 116310/ 173500 | consumed samples: 29775360 | consumed tokens: 60979937280 | elapsed time per iteration (s): 0.76 | learning rate: 6.491E-05 | global batch size: 256 | lm loss: 1.957549E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.009 | TFLOPs: 20.45 | 31: iteration 116320/ 173500 | consumed samples: 29777920 | consumed tokens: 60985180160 | elapsed time per iteration (s): 0.80 | learning rate: 6.489E-05 | global batch size: 256 | lm loss: 1.940528E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.527 | TFLOPs: 19.45 | 31: iteration 116330/ 173500 | consumed samples: 29780480 | consumed tokens: 60990423040 | elapsed time per iteration (s): 0.76 | learning rate: 6.488E-05 | global batch size: 256 | lm loss: 1.925857E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.127 | TFLOPs: 20.46 | 31: iteration 116340/ 173500 | consumed samples: 29783040 | consumed tokens: 60995665920 | elapsed time per iteration (s): 0.84 | learning rate: 6.487E-05 | global batch size: 256 | lm loss: 1.986574E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.480 | TFLOPs: 18.54 | 31: iteration 116350/ 173500 | consumed samples: 29785600 | consumed tokens: 61000908800 | elapsed time per iteration (s): 0.77 | learning rate: 6.485E-05 | global batch size: 256 | lm loss: 1.961519E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.954 | TFLOPs: 20.14 | 31: iteration 116360/ 173500 | consumed samples: 29788160 | consumed tokens: 61006151680 | elapsed time per iteration (s): 0.75 | learning rate: 6.484E-05 | global batch size: 256 | lm loss: 1.978093E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.386 | TFLOPs: 20.53 | 31: iteration 116370/ 173500 | consumed samples: 29790720 | consumed tokens: 61011394560 | elapsed time per iteration (s): 0.75 | learning rate: 6.482E-05 | global batch size: 256 | lm loss: 1.932410E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.408 | TFLOPs: 20.59 | 31: iteration 116380/ 173500 | consumed samples: 29793280 | consumed tokens: 61016637440 | elapsed time per iteration (s): 0.72 | learning rate: 6.481E-05 | global batch size: 256 | lm loss: 1.946219E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.707 | TFLOPs: 21.64 | 31: iteration 116390/ 173500 | consumed samples: 29795840 | consumed tokens: 61021880320 | elapsed time per iteration (s): 0.77 | learning rate: 6.479E-05 | global batch size: 256 | lm loss: 1.959536E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.732 | TFLOPs: 20.13 | 31: iteration 116400/ 173500 | consumed samples: 29798400 | consumed tokens: 61027123200 | elapsed time per iteration (s): 0.76 | learning rate: 6.478E-05 | global batch size: 256 | lm loss: 1.933932E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.254 | TFLOPs: 20.34 | 31: iteration 116410/ 173500 | consumed samples: 29800960 | consumed tokens: 61032366080 | elapsed time per iteration (s): 0.74 | learning rate: 6.477E-05 | global batch size: 256 | lm loss: 1.930257E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.996 | TFLOPs: 20.81 | 31: iteration 116420/ 173500 | consumed samples: 29803520 | consumed tokens: 61037608960 | elapsed time per iteration (s): 0.80 | learning rate: 6.475E-05 | global batch size: 256 | lm loss: 1.943553E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.969 | TFLOPs: 19.48 | 31: iteration 116430/ 173500 | consumed samples: 29806080 | consumed tokens: 61042851840 | elapsed time per iteration (s): 0.74 | learning rate: 6.474E-05 | global batch size: 256 | lm loss: 1.941495E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.617 | TFLOPs: 20.97 | 31: iteration 116440/ 173500 | consumed samples: 29808640 | consumed tokens: 61048094720 | elapsed time per iteration (s): 0.89 | learning rate: 6.472E-05 | global batch size: 256 | lm loss: 1.934606E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.643 | TFLOPs: 17.40 | 31: iteration 116450/ 173500 | consumed samples: 29811200 | consumed tokens: 61053337600 | elapsed time per iteration (s): 0.73 | learning rate: 6.471E-05 | global batch size: 256 | lm loss: 1.951090E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.271 | TFLOPs: 21.25 | 31: iteration 116460/ 173500 | consumed samples: 29813760 | consumed tokens: 61058580480 | elapsed time per iteration (s): 0.79 | learning rate: 6.469E-05 | global batch size: 256 | lm loss: 1.935232E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.596 | TFLOPs: 19.64 | 31: iteration 116470/ 173500 | consumed samples: 29816320 | consumed tokens: 61063823360 | elapsed time per iteration (s): 0.78 | learning rate: 6.468E-05 | global batch size: 256 | lm loss: 1.960187E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.717 | TFLOPs: 19.95 | 31: iteration 116480/ 173500 | consumed samples: 29818880 | consumed tokens: 61069066240 | elapsed time per iteration (s): 0.85 | learning rate: 6.467E-05 | global batch size: 256 | lm loss: 1.945580E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.127 | TFLOPs: 18.22 | 31: iteration 116490/ 173500 | consumed samples: 29821440 | consumed tokens: 61074309120 | elapsed time per iteration (s): 0.79 | learning rate: 6.465E-05 | global batch size: 256 | lm loss: 1.933063E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.126 | TFLOPs: 19.61 | 31: iteration 116500/ 173500 | consumed samples: 29824000 | consumed tokens: 61079552000 | elapsed time per iteration (s): 0.79 | learning rate: 6.464E-05 | global batch size: 256 | lm loss: 1.978037E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.874 | TFLOPs: 19.71 | 31: iteration 116510/ 173500 | consumed samples: 29826560 | consumed tokens: 61084794880 | elapsed time per iteration (s): 0.85 | learning rate: 6.462E-05 | global batch size: 256 | lm loss: 1.958727E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.714 | TFLOPs: 18.25 | 31: iteration 116520/ 173500 | consumed samples: 29829120 | consumed tokens: 61090037760 | elapsed time per iteration (s): 0.76 | learning rate: 6.461E-05 | global batch size: 256 | lm loss: 1.942852E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.954 | TFLOPs: 20.45 | 31: iteration 116530/ 173500 | consumed samples: 29831680 | consumed tokens: 61095280640 | elapsed time per iteration (s): 0.78 | learning rate: 6.459E-05 | global batch size: 256 | lm loss: 1.962492E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.655 | TFLOPs: 19.94 | 31: iteration 116540/ 173500 | consumed samples: 29834240 | consumed tokens: 61100523520 | elapsed time per iteration (s): 0.75 | learning rate: 6.458E-05 | global batch size: 256 | lm loss: 1.954919E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.483 | TFLOPs: 20.78 | 31: iteration 116550/ 173500 | consumed samples: 29836800 | consumed tokens: 61105766400 | elapsed time per iteration (s): 0.82 | learning rate: 6.457E-05 | global batch size: 256 | lm loss: 1.948349E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.465 | TFLOPs: 18.78 | 31: iteration 116560/ 173500 | consumed samples: 29839360 | consumed tokens: 61111009280 | elapsed time per iteration (s): 0.81 | learning rate: 6.455E-05 | global batch size: 256 | lm loss: 1.956527E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.511 | TFLOPs: 19.21 | 31: iteration 116570/ 173500 | consumed samples: 29841920 | consumed tokens: 61116252160 | elapsed time per iteration (s): 0.78 | learning rate: 6.454E-05 | global batch size: 256 | lm loss: 1.961571E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.181 | TFLOPs: 19.91 | 31: iteration 116580/ 173500 | consumed samples: 29844480 | consumed tokens: 61121495040 | elapsed time per iteration (s): 0.78 | learning rate: 6.452E-05 | global batch size: 256 | lm loss: 1.957289E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.066 | TFLOPs: 19.97 | 31: iteration 116590/ 173500 | consumed samples: 29847040 | consumed tokens: 61126737920 | elapsed time per iteration (s): 0.73 | learning rate: 6.451E-05 | global batch size: 256 | lm loss: 1.948332E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.357 | TFLOPs: 21.14 | 31: iteration 116600/ 173500 | consumed samples: 29849600 | consumed tokens: 61131980800 | elapsed time per iteration (s): 0.87 | learning rate: 6.450E-05 | global batch size: 256 | lm loss: 1.969535E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.488 | TFLOPs: 17.76 | 31: iteration 116610/ 173500 | consumed samples: 29852160 | consumed tokens: 61137223680 | elapsed time per iteration (s): 0.76 | learning rate: 6.448E-05 | global batch size: 256 | lm loss: 1.940018E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.851 | TFLOPs: 20.38 | 31: iteration 116620/ 173500 | consumed samples: 29854720 | consumed tokens: 61142466560 | elapsed time per iteration (s): 0.84 | learning rate: 6.447E-05 | global batch size: 256 | lm loss: 1.954026E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.594 | TFLOPs: 18.37 | 31: iteration 116630/ 173500 | consumed samples: 29857280 | consumed tokens: 61147709440 | elapsed time per iteration (s): 0.84 | learning rate: 6.445E-05 | global batch size: 256 | lm loss: 1.955264E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.556 | TFLOPs: 18.42 | 31: iteration 116640/ 173500 | consumed samples: 29859840 | consumed tokens: 61152952320 | elapsed time per iteration (s): 0.79 | learning rate: 6.444E-05 | global batch size: 256 | lm loss: 1.969138E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.063 | TFLOPs: 19.67 | 31: iteration 116650/ 173500 | consumed samples: 29862400 | consumed tokens: 61158195200 | elapsed time per iteration (s): 0.89 | learning rate: 6.442E-05 | global batch size: 256 | lm loss: 1.955546E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.313 | TFLOPs: 17.38 | 31: iteration 116660/ 173500 | consumed samples: 29864960 | consumed tokens: 61163438080 | elapsed time per iteration (s): 0.82 | learning rate: 6.441E-05 | global batch size: 256 | lm loss: 1.935304E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.333 | TFLOPs: 18.90 | 31: iteration 116670/ 173500 | consumed samples: 29867520 | consumed tokens: 61168680960 | elapsed time per iteration (s): 0.79 | learning rate: 6.440E-05 | global batch size: 256 | lm loss: 1.924757E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.460 | TFLOPs: 19.63 | 31: iteration 116680/ 173500 | consumed samples: 29870080 | consumed tokens: 61173923840 | elapsed time per iteration (s): 0.81 | learning rate: 6.438E-05 | global batch size: 256 | lm loss: 1.980556E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.970 | TFLOPs: 19.12 | 31: iteration 116690/ 173500 | consumed samples: 29872640 | consumed tokens: 61179166720 | elapsed time per iteration (s): 0.84 | learning rate: 6.437E-05 | global batch size: 256 | lm loss: 1.968380E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.301 | TFLOPs: 18.53 | 31: iteration 116700/ 173500 | consumed samples: 29875200 | consumed tokens: 61184409600 | elapsed time per iteration (s): 0.78 | learning rate: 6.435E-05 | global batch size: 256 | lm loss: 1.946414E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.393 | TFLOPs: 19.81 | 31: iteration 116710/ 173500 | consumed samples: 29877760 | consumed tokens: 61189652480 | elapsed time per iteration (s): 0.82 | learning rate: 6.434E-05 | global batch size: 256 | lm loss: 1.943077E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.384 | TFLOPs: 18.90 | 31: iteration 116720/ 173500 | consumed samples: 29880320 | consumed tokens: 61194895360 | elapsed time per iteration (s): 0.81 | learning rate: 6.433E-05 | global batch size: 256 | lm loss: 1.973371E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.405 | TFLOPs: 19.08 | 31: iteration 116730/ 173500 | consumed samples: 29882880 | consumed tokens: 61200138240 | elapsed time per iteration (s): 0.84 | learning rate: 6.431E-05 | global batch size: 256 | lm loss: 1.943691E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.835 | TFLOPs: 18.44 | 31: iteration 116740/ 173500 | consumed samples: 29885440 | consumed tokens: 61205381120 | elapsed time per iteration (s): 0.80 | learning rate: 6.430E-05 | global batch size: 256 | lm loss: 1.967765E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.868 | TFLOPs: 19.29 | 31: iteration 116750/ 173500 | consumed samples: 29888000 | consumed tokens: 61210624000 | elapsed time per iteration (s): 0.87 | learning rate: 6.428E-05 | global batch size: 256 | lm loss: 1.939487E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.920 | TFLOPs: 17.78 | 31: iteration 116760/ 173500 | consumed samples: 29890560 | consumed tokens: 61215866880 | elapsed time per iteration (s): 0.79 | learning rate: 6.427E-05 | global batch size: 256 | lm loss: 1.937551E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.813 | TFLOPs: 19.71 | 31: iteration 116770/ 173500 | consumed samples: 29893120 | consumed tokens: 61221109760 | elapsed time per iteration (s): 0.83 | learning rate: 6.425E-05 | global batch size: 256 | lm loss: 1.986530E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.900 | TFLOPs: 18.75 | 31: iteration 116780/ 173500 | consumed samples: 29895680 | consumed tokens: 61226352640 | elapsed time per iteration (s): 0.79 | learning rate: 6.424E-05 | global batch size: 256 | lm loss: 1.941624E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.029 | TFLOPs: 19.60 | 31: iteration 116790/ 173500 | consumed samples: 29898240 | consumed tokens: 61231595520 | elapsed time per iteration (s): 0.78 | learning rate: 6.423E-05 | global batch size: 256 | lm loss: 1.964130E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.123 | TFLOPs: 19.85 | 31: iteration 116800/ 173500 | consumed samples: 29900800 | consumed tokens: 61236838400 | elapsed time per iteration (s): 0.82 | learning rate: 6.421E-05 | global batch size: 256 | lm loss: 1.956118E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.389 | TFLOPs: 18.78 | 31: iteration 116810/ 173500 | consumed samples: 29903360 | consumed tokens: 61242081280 | elapsed time per iteration (s): 0.77 | learning rate: 6.420E-05 | global batch size: 256 | lm loss: 1.966566E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.313 | TFLOPs: 20.16 | 31: iteration 116820/ 173500 | consumed samples: 29905920 | consumed tokens: 61247324160 | elapsed time per iteration (s): 0.90 | learning rate: 6.418E-05 | global batch size: 256 | lm loss: 1.947868E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.088 | TFLOPs: 17.25 | 31: iteration 116830/ 173500 | consumed samples: 29908480 | consumed tokens: 61252567040 | elapsed time per iteration (s): 0.76 | learning rate: 6.417E-05 | global batch size: 256 | lm loss: 1.945366E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.865 | TFLOPs: 20.50 | 31: iteration 116840/ 173500 | consumed samples: 29911040 | consumed tokens: 61257809920 | elapsed time per iteration (s): 0.85 | learning rate: 6.415E-05 | global batch size: 256 | lm loss: 1.944180E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.189 | TFLOPs: 18.22 | 31: iteration 116850/ 173500 | consumed samples: 29913600 | consumed tokens: 61263052800 | elapsed time per iteration (s): 0.75 | learning rate: 6.414E-05 | global batch size: 256 | lm loss: 1.991615E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.326 | TFLOPs: 20.71 | 31: iteration 116860/ 173500 | consumed samples: 29916160 | consumed tokens: 61268295680 | elapsed time per iteration (s): 0.76 | learning rate: 6.413E-05 | global batch size: 256 | lm loss: 1.964598E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.358 | TFLOPs: 20.41 | 31: iteration 116870/ 173500 | consumed samples: 29918720 | consumed tokens: 61273538560 | elapsed time per iteration (s): 0.76 | learning rate: 6.411E-05 | global batch size: 256 | lm loss: 1.965993E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.183 | TFLOPs: 20.46 | 31: iteration 116880/ 173500 | consumed samples: 29921280 | consumed tokens: 61278781440 | elapsed time per iteration (s): 0.72 | learning rate: 6.410E-05 | global batch size: 256 | lm loss: 1.949600E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.184 | TFLOPs: 21.43 | 31: iteration 116890/ 173500 | consumed samples: 29923840 | consumed tokens: 61284024320 | elapsed time per iteration (s): 0.85 | learning rate: 6.408E-05 | global batch size: 256 | lm loss: 1.966972E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.430 | TFLOPs: 18.24 | 31: iteration 116900/ 173500 | consumed samples: 29926400 | consumed tokens: 61289267200 | elapsed time per iteration (s): 0.79 | learning rate: 6.407E-05 | global batch size: 256 | lm loss: 1.960164E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.479 | TFLOPs: 19.69 | 31: iteration 116910/ 173500 | consumed samples: 29928960 | consumed tokens: 61294510080 | elapsed time per iteration (s): 0.75 | learning rate: 6.406E-05 | global batch size: 256 | lm loss: 1.922237E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.451 | TFLOPs: 20.54 | 31: iteration 116920/ 173500 | consumed samples: 29931520 | consumed tokens: 61299752960 | elapsed time per iteration (s): 0.76 | learning rate: 6.404E-05 | global batch size: 256 | lm loss: 1.976291E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.816 | TFLOPs: 20.38 | 31: iteration 116930/ 173500 | consumed samples: 29934080 | consumed tokens: 61304995840 | elapsed time per iteration (s): 0.80 | learning rate: 6.403E-05 | global batch size: 256 | lm loss: 1.947068E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.797 | TFLOPs: 19.29 | 31: iteration 116940/ 173500 | consumed samples: 29936640 | consumed tokens: 61310238720 | elapsed time per iteration (s): 0.76 | learning rate: 6.401E-05 | global batch size: 256 | lm loss: 1.917962E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.179 | TFLOPs: 20.40 | 31: iteration 116950/ 173500 | consumed samples: 29939200 | consumed tokens: 61315481600 | elapsed time per iteration (s): 0.78 | learning rate: 6.400E-05 | global batch size: 256 | lm loss: 1.954662E+00 | grad norm: 0.223 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.654 | TFLOPs: 19.82 | 31: iteration 116960/ 173500 | consumed samples: 29941760 | consumed tokens: 61320724480 | elapsed time per iteration (s): 0.83 | learning rate: 6.399E-05 | global batch size: 256 | lm loss: 1.916275E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.876 | TFLOPs: 18.63 | 31: iteration 116970/ 173500 | consumed samples: 29944320 | consumed tokens: 61325967360 | elapsed time per iteration (s): 0.81 | learning rate: 6.397E-05 | global batch size: 256 | lm loss: 1.947368E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.061 | TFLOPs: 19.18 | 31: iteration 116980/ 173500 | consumed samples: 29946880 | consumed tokens: 61331210240 | elapsed time per iteration (s): 0.74 | learning rate: 6.396E-05 | global batch size: 256 | lm loss: 1.957598E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.600 | TFLOPs: 20.91 | 31: iteration 116990/ 173500 | consumed samples: 29949440 | consumed tokens: 61336453120 | elapsed time per iteration (s): 0.74 | learning rate: 6.394E-05 | global batch size: 256 | lm loss: 1.959413E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.319 | TFLOPs: 20.89 | 31: iteration 117000/ 173500 | consumed samples: 29952000 | consumed tokens: 61341696000 | elapsed time per iteration (s): 0.78 | learning rate: 6.393E-05 | global batch size: 256 | lm loss: 1.975796E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.279 | TFLOPs: 19.92 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 117000 | lm loss value: 1.925241E+00 | lm loss PPL: 6.856798E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 117000 to checkpoints_1b1long 0: [2022-11-26 20:27:09,263] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step117000 is begin to save! 0: [2022-11-26 20:27:09,275] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_01-model_00-model_states.pt... 0: [2022-11-26 20:27:09,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_01-model_00-model_states.pt. 0: [2022-11-26 20:27:09,495] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_03-model_00-model_states.pt... 0: [2022-11-26 20:27:09,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_03-model_00-model_states.pt. 0: [2022-11-26 20:27:09,573] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_04-model_00-model_states.pt... 0: [2022-11-26 20:27:09,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_04-model_00-model_states.pt. 0: [2022-11-26 20:27:09,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_05-model_00-model_states.pt... 0: [2022-11-26 20:27:09,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_05-model_00-model_states.pt. 0: [2022-11-26 20:27:09,735] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_06-model_00-model_states.pt... 0: [2022-11-26 20:27:09,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_06-model_00-model_states.pt. 0: [2022-11-26 20:27:09,809] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_07-model_00-model_states.pt... 0: [2022-11-26 20:27:09,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_07-model_00-model_states.pt. 0: [2022-11-26 20:27:09,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_08-model_00-model_states.pt... 0: [2022-11-26 20:27:09,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_08-model_00-model_states.pt. 0: [2022-11-26 20:27:09,965] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_09-model_00-model_states.pt... 0: [2022-11-26 20:27:10,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_09-model_00-model_states.pt. 0: [2022-11-26 20:27:10,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_10-model_00-model_states.pt... 0: [2022-11-26 20:27:10,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_10-model_00-model_states.pt. 0: [2022-11-26 20:27:10,127] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_11-model_00-model_states.pt... 0: [2022-11-26 20:27:10,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_11-model_00-model_states.pt. 0: [2022-11-26 20:27:10,205] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_12-model_00-model_states.pt... 0: [2022-11-26 20:27:10,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_12-model_00-model_states.pt. 0: [2022-11-26 20:27:10,281] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_13-model_00-model_states.pt... 0: [2022-11-26 20:27:10,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_13-model_00-model_states.pt. 0: [2022-11-26 20:27:10,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_14-model_00-model_states.pt... 0: [2022-11-26 20:27:10,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_14-model_00-model_states.pt. 0: [2022-11-26 20:27:10,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_15-model_00-model_states.pt... 0: [2022-11-26 20:27:10,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_15-model_00-model_states.pt. 0: [2022-11-26 20:27:10,515] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_16-model_00-model_states.pt... 0: [2022-11-26 20:27:10,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_16-model_00-model_states.pt. 0: [2022-11-26 20:27:10,590] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_17-model_00-model_states.pt... 0: [2022-11-26 20:27:10,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_17-model_00-model_states.pt. 0: [2022-11-26 20:27:10,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_18-model_00-model_states.pt... 0: [2022-11-26 20:27:10,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_18-model_00-model_states.pt. 0: [2022-11-26 20:27:10,751] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_19-model_00-model_states.pt... 0: [2022-11-26 20:27:10,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_19-model_00-model_states.pt. 0: [2022-11-26 20:27:10,823] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_20-model_00-model_states.pt... 0: [2022-11-26 20:27:10,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_20-model_00-model_states.pt. 0: [2022-11-26 20:27:10,901] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_21-model_00-model_states.pt... 0: [2022-11-26 20:27:10,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_21-model_00-model_states.pt. 0: [2022-11-26 20:27:10,977] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_22-model_00-model_states.pt... 0: [2022-11-26 20:27:11,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_22-model_00-model_states.pt. 0: [2022-11-26 20:27:11,060] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_23-model_00-model_states.pt... 0: [2022-11-26 20:27:11,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_23-model_00-model_states.pt. 0: [2022-11-26 20:27:11,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_24-model_00-model_states.pt... 0: [2022-11-26 20:27:11,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_24-model_00-model_states.pt. 0: [2022-11-26 20:27:11,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_25-model_00-model_states.pt... 0: [2022-11-26 20:27:11,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_25-model_00-model_states.pt. 0: [2022-11-26 20:27:11,287] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_26-model_00-model_states.pt... 0: [2022-11-26 20:27:11,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_26-model_00-model_states.pt. 0: [2022-11-26 20:27:11,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_27-model_00-model_states.pt... 0: [2022-11-26 20:27:11,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_27-model_00-model_states.pt. 0: [2022-11-26 20:27:11,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_28-model_00-model_states.pt... 0: [2022-11-26 20:27:11,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_28-model_00-model_states.pt. 0: [2022-11-26 20:27:11,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/layer_30-model_00-model_states.pt... 0: [2022-11-26 20:27:11,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/layer_30-model_00-model_states.pt. 0: [2022-11-26 20:27:11,516] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step117000/mp_rank_00_model_states.pt 0: [2022-11-26 20:27:11,516] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/mp_rank_00_model_states.pt... 0: [2022-11-26 20:27:11,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/mp_rank_00_model_states.pt. 0: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:27:11,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:27:11,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:27:11,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 20:27:11,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 7: [2022-11-26 20:27:11,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:27:11,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 20:27:11,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 15: [2022-11-26 20:27:11,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:27:11,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 20:27:11,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 6: [2022-11-26 20:27:11,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:27:11,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 20:27:11,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 24: [2022-11-26 20:27:11,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:27:11,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 27: [2022-11-26 20:27:11,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:27:11,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:27:11,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 23: [2022-11-26 20:27:11,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 27: [2022-11-26 20:27:11,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 11: [2022-11-26 20:27:11,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:27:11,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 23: [2022-11-26 20:27:11,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 11: [2022-11-26 20:27:11,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 20:27:11,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 18: [2022-11-26 20:27:11,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:27:11,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 20:27:11,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 31: [2022-11-26 20:27:11,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:27:11,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 20:27:11,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 10: [2022-11-26 20:27:11,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:27:11,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 20:27:11,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 17: [2022-11-26 20:27:11,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:27:11,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 20:27:11,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 7: [2022-11-26 20:27:11,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:27:11,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:27:11,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 6: [2022-11-26 20:27:11,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 7: [2022-11-26 20:27:11,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 6: [2022-11-26 20:27:11,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 4: [2022-11-26 20:27:11,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:27:11,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:27:11,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 20:27:11,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 20:27:11,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 4: [2022-11-26 20:27:11,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 13: [2022-11-26 20:27:11,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:27:11,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 20:27:11,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 11: [2022-11-26 20:27:11,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:27:11,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:27:11,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:27:11,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 5: [2022-11-26 20:27:11,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:27:11,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 13: [2022-11-26 20:27:11,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 11: [2022-11-26 20:27:11,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 5: [2022-11-26 20:27:11,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 16: [2022-11-26 20:27:11,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 13: [2022-11-26 20:27:11,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 5: [2022-11-26 20:27:11,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 14: [2022-11-26 20:27:11,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:27:11,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 20:27:11,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 24: [2022-11-26 20:27:11,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:27:11,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 20:27:11,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 19: [2022-11-26 20:27:11,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:27:11,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 12: [2022-11-26 20:27:11,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:27:11,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 12: [2022-11-26 20:27:11,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 20:27:11,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 20: [2022-11-26 20:27:11,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:27:11,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 18: [2022-11-26 20:27:11,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:27:11,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 18: [2022-11-26 20:27:11,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 17: [2022-11-26 20:27:11,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:27:11,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 20: [2022-11-26 20:27:11,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:27:11,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:27:11,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 12: [2022-11-26 20:27:11,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:27:11,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:27:11,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 17: [2022-11-26 20:27:11,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 4: [2022-11-26 20:27:11,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 1: [2022-11-26 20:27:11,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 12: [2022-11-26 20:27:11,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 20: [2022-11-26 20:27:11,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 4: [2022-11-26 20:27:11,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 1: [2022-11-26 20:27:11,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 20: [2022-11-26 20:27:11,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 16: [2022-11-26 20:27:11,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:27:11,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 3: [2022-11-26 20:27:11,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:27:11,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 3: [2022-11-26 20:27:11,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 20:27:11,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 31: [2022-11-26 20:27:11,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:27:11,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:27:11,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:27:11,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:27:11,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 2: [2022-11-26 20:27:11,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 20:27:11,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 3: [2022-11-26 20:27:11,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 2: [2022-11-26 20:27:11,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 2: [2022-11-26 20:27:11,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 3: [2022-11-26 20:27:11,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 31: [2022-11-26 20:27:11,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 27: [2022-11-26 20:27:11,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:27:11,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 20:27:11,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 26: [2022-11-26 20:27:11,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:27:11,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 11: [2022-11-26 20:27:11,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:27:11,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 11: [2022-11-26 20:27:11,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 20:27:11,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 29: [2022-11-26 20:27:11,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:27:11,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 20:27:11,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 27: [2022-11-26 20:27:11,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:27:11,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 20:27:11,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 9: [2022-11-26 20:27:11,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:27:11,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 20:27:11,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:27:11,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 9: [2022-11-26 20:27:11,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 20:27:11,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 16: [2022-11-26 20:27:11,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:27:11,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:27:11,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 14: [2022-11-26 20:27:11,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 16: [2022-11-26 20:27:11,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 2: [2022-11-26 20:27:11,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:27:11,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 2: [2022-11-26 20:27:11,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 20:27:11,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 19: [2022-11-26 20:27:11,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:27:11,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 20:27:11,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 29: [2022-11-26 20:27:11,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:27:11,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 20:27:11,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 9: [2022-11-26 20:27:11,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:27:11,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 20:27:11,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 19: [2022-11-26 20:27:11,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:27:11,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:27:11,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 6: [2022-11-26 20:27:11,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 10: [2022-11-26 20:27:11,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:27:11,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 6: [2022-11-26 20:27:11,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 10: [2022-11-26 20:27:11,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 20:27:11,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 26: [2022-11-26 20:27:11,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:27:11,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:27:11,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 20:27:11,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 20:27:11,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 26: [2022-11-26 20:27:11,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 23: [2022-11-26 20:27:11,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:27:11,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 20:27:11,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 1: [2022-11-26 20:27:11,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:27:11,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 23: [2022-11-26 20:27:11,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:27:11,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 23: [2022-11-26 20:27:11,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 20:27:11,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 31: [2022-11-26 20:27:11,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:27:11,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 20:27:11,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 25: [2022-11-26 20:27:11,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:27:11,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 20:27:11,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 24: [2022-11-26 20:27:11,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:27:11,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 20:27:11,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 22: [2022-11-26 20:27:11,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:27:11,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 20:27:11,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 13: [2022-11-26 20:27:11,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:27:11,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:27:11,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 20:27:11,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 13: [2022-11-26 20:27:11,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 20:27:11,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 13: [2022-11-26 20:27:11,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:27:11,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:27:11,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 13: [2022-11-26 20:27:11,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 20:27:11,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 15: [2022-11-26 20:27:11,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:27:11,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 15: [2022-11-26 20:27:11,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 20:27:11,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 25: [2022-11-26 20:27:11,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:27:11,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 20:27:11,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 4: [2022-11-26 20:27:11,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:27:11,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 20:27:11,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 27: [2022-11-26 20:27:11,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:27:11,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:27:11,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 11: [2022-11-26 20:27:11,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 27: [2022-11-26 20:27:11,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 11: [2022-11-26 20:27:11,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 6: [2022-11-26 20:27:11,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:27:11,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 20:27:11,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 19: [2022-11-26 20:27:11,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:27:11,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 20:27:11,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 29: [2022-11-26 20:27:11,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:27:11,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 20: [2022-11-26 20:27:11,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:27:11,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:27:11,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:27:11,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 29: [2022-11-26 20:27:11,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 1: [2022-11-26 20:27:11,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 20: [2022-11-26 20:27:11,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 20:27:11,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 20:27:11,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 20: [2022-11-26 20:27:11,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 14: [2022-11-26 20:27:11,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:27:11,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:27:11,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:27:11,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:27:11,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 20:27:11,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 14: [2022-11-26 20:27:11,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 18: [2022-11-26 20:27:11,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 5: [2022-11-26 20:27:11,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 18: [2022-11-26 20:27:11,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 5: [2022-11-26 20:27:11,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 14: [2022-11-26 20:27:11,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 0: [2022-11-26 20:27:11,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:27:11,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:27:11,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 20:27:11,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 15: [2022-11-26 20:27:11,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:27:11,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 20:27:11,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 17: [2022-11-26 20:27:11,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:27:11,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 26: [2022-11-26 20:27:11,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:27:11,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 26: [2022-11-26 20:27:11,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 20:27:11,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 3: [2022-11-26 20:27:11,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:27:11,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:27:11,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 20:27:11,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 22: [2022-11-26 20:27:11,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 20:27:11,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 17: [2022-11-26 20:27:11,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:27:11,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 20:27:11,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 22: [2022-11-26 20:27:11,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:27:11,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 20:27:11,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 23: [2022-11-26 20:27:11,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:27:11,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:27:11,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:27:11,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:27:11,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 30: [2022-11-26 20:27:11,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 20:27:11,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 7: [2022-11-26 20:27:11,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 12: [2022-11-26 20:27:11,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:27:11,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:27:11,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 30: [2022-11-26 20:27:11,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 30: [2022-11-26 20:27:11,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 7: [2022-11-26 20:27:11,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 12: [2022-11-26 20:27:11,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 20:27:11,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 20:27:11,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 12: [2022-11-26 20:27:11,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 28: [2022-11-26 20:27:11,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:27:11,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:27:11,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 20:27:11,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 20:27:11,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 28: [2022-11-26 20:27:11,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 28: [2022-11-26 20:27:11,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:27:11,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 31: [2022-11-26 20:27:11,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:27:11,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 31: [2022-11-26 20:27:11,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 20:27:11,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 3: [2022-11-26 20:27:11,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:27:11,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 20:27:11,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 5: [2022-11-26 20:27:11,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:27:11,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 20:27:11,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 2: [2022-11-26 20:27:11,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:27:11,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 20:27:11,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 0: [2022-11-26 20:27:11,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:27:11,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 20:27:11,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 25: [2022-11-26 20:27:11,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:27:11,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 20:27:11,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 25: [2022-11-26 20:27:11,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:27:11,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 20:27:11,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 0: [2022-11-26 20:27:11,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:27:11,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:27:11,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 20:27:11,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 20:27:11,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 0: [2022-11-26 20:27:11,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 18: [2022-11-26 20:27:11,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:27:11,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 20:27:11,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 0: [2022-11-26 20:27:11,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 30: [2022-11-26 20:27:11,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:27:11,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 10: [2022-11-26 20:27:11,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:27:11,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 29: [2022-11-26 20:27:11,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:27:11,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 30: [2022-11-26 20:27:11,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 29: [2022-11-26 20:27:11,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 20:27:11,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 30: [2022-11-26 20:27:11,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 7: [2022-11-26 20:27:11,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:27:11,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 20:27:11,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 9: [2022-11-26 20:27:11,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:27:11,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 20:27:11,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 14: [2022-11-26 20:27:11,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:27:11,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 20:27:11,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 24: [2022-11-26 20:27:11,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:27:11,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 20:27:11,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 19: [2022-11-26 20:27:11,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:27:11,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 20:27:11,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 8: [2022-11-26 20:27:11,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:27:11,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:27:11,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 20:27:11,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:27:11,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:27:11,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 20:27:11,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 8: [2022-11-26 20:27:11,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 8: [2022-11-26 20:27:11,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 20:27:11,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 20:27:11,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 8: [2022-11-26 20:27:11,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 9: [2022-11-26 20:27:11,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:27:11,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 20:27:11,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 6: [2022-11-26 20:27:11,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:27:11,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 20:27:11,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 13: [2022-11-26 20:27:11,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:27:11,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 20:27:11,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 21: [2022-11-26 20:27:11,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:27:11,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:27:11,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:27:11,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 20:27:11,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 20:27:11,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 20:27:11,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 21: [2022-11-26 20:27:11,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 21: [2022-11-26 20:27:11,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 21: [2022-11-26 20:27:11,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:27:11,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 20:27:11,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 23: [2022-11-26 20:27:11,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:27:11,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 20:27:11,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 20: [2022-11-26 20:27:11,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:27:11,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 20:27:11,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 1: [2022-11-26 20:27:11,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:27:11,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 20:27:11,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 4: [2022-11-26 20:27:11,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:27:11,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 20:27:11,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 25: [2022-11-26 20:27:11,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:27:11,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 20:27:11,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 21: [2022-11-26 20:27:11,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:27:11,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 20:27:11,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 27: [2022-11-26 20:27:11,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:27:11,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 20:27:11,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 16: [2022-11-26 20:27:11,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:27:11,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 20:27:11,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 0: [2022-11-26 20:27:11,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:27:11,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 20:27:11,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 15: [2022-11-26 20:27:11,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:27:11,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 20:27:11,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 2: [2022-11-26 20:27:11,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:27:11,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 20:27:11,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 31: [2022-11-26 20:27:11,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:27:11,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 20:27:11,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 17: [2022-11-26 20:27:11,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:27:11,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 20:27:11,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 7: [2022-11-26 20:27:11,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:27:11,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 20:27:11,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 12: [2022-11-26 20:27:11,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:27:11,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 20:27:11,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 8: [2022-11-26 20:27:11,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:27:11,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 5: [2022-11-26 20:27:11,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:27:11,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 5: [2022-11-26 20:27:11,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 20:27:11,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 30: [2022-11-26 20:27:11,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:27:11,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 20:27:11,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 28: [2022-11-26 20:27:11,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:27:11,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 20:27:11,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 3: [2022-11-26 20:27:11,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:27:11,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 20:27:11,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 22: [2022-11-26 20:27:11,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:27:11,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 29: [2022-11-26 20:27:11,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:27:11,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 29: [2022-11-26 20:27:11,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 20:27:11,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 24: [2022-11-26 20:27:11,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:27:11,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 20:27:11,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 26: [2022-11-26 20:27:11,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:27:11,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 20:27:11,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 18: [2022-11-26 20:27:11,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:27:11,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 20:27:11,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 6: [2022-11-26 20:27:11,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:27:11,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 20:27:11,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 10: [2022-11-26 20:27:11,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:27:11,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 20:27:11,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 13: [2022-11-26 20:27:11,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:27:11,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 20:27:11,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 9: [2022-11-26 20:27:11,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:27:11,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 20:27:11,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 14: [2022-11-26 20:27:11,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:27:11,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 20:27:11,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 1: [2022-11-26 20:27:11,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:27:11,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 20:27:11,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 4: [2022-11-26 20:27:11,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:27:11,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 21: [2022-11-26 20:27:11,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:27:11,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 4: [2022-11-26 20:27:11,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 21: [2022-11-26 20:27:11,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 20: [2022-11-26 20:27:11,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:27:11,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 20:27:11,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 16: [2022-11-26 20:27:11,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:27:11,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 20:27:11,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 19: [2022-11-26 20:27:11,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:27:11,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 20:27:11,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 25: [2022-11-26 20:27:11,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:27:11,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 20:27:11,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 23: [2022-11-26 20:27:11,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:27:11,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 20:27:11,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 27: [2022-11-26 20:27:11,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:27:11,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 20:27:11,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 17: [2022-11-26 20:27:11,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:27:11,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:27:11,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 2: [2022-11-26 20:27:11,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 17: [2022-11-26 20:27:11,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 2: [2022-11-26 20:27:11,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 0: [2022-11-26 20:27:11,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:27:11,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 20:27:11,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 15: [2022-11-26 20:27:11,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:27:11,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 20:27:11,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 8: [2022-11-26 20:27:11,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:27:11,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 20:27:11,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 22: [2022-11-26 20:27:11,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:27:11,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 20:27:11,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 12: [2022-11-26 20:27:11,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:27:11,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 20:27:11,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 5: [2022-11-26 20:27:11,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:27:11,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 20:27:11,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 29: [2022-11-26 20:27:11,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:27:11,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 20:27:11,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 30: [2022-11-26 20:27:11,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:27:11,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:27:11,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 20:27:11,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 11: [2022-11-26 20:27:11,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 20:27:11,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 31: [2022-11-26 20:27:11,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:27:11,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 20:27:11,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 24: [2022-11-26 20:27:11,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:27:11,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 20:27:11,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 28: [2022-11-26 20:27:11,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:27:11,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:27:11,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 20:27:11,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 7: [2022-11-26 20:27:11,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:27:11,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:27:11,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 20:27:11,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 10: [2022-11-26 20:27:11,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 20:27:11,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 26: [2022-11-26 20:27:11,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:27:11,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 20:27:11,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 28: [2022-11-26 20:27:11,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 20:27:11,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 14: [2022-11-26 20:27:11,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:27:11,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 20:27:11,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 6: [2022-11-26 20:27:11,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:27:11,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:27:11,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 18: [2022-11-26 20:27:11,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 6: [2022-11-26 20:27:11,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 18: [2022-11-26 20:27:11,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 9: [2022-11-26 20:27:11,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:27:11,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 20:27:11,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 11: [2022-11-26 20:27:11,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:27:11,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 20:27:11,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 20: [2022-11-26 20:27:11,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:27:11,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 20:27:11,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 19: [2022-11-26 20:27:11,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:27:11,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 20:27:11,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 23: [2022-11-26 20:27:11,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:27:11,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 20:27:11,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 13: [2022-11-26 20:27:11,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:27:11,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 20:27:11,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 16: [2022-11-26 20:27:11,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:27:11,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 20:27:11,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 1: [2022-11-26 20:27:11,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:27:11,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:27:11,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 1: [2022-11-26 20:27:11,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 21: [2022-11-26 20:27:11,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 1: [2022-11-26 20:27:11,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 25: [2022-11-26 20:27:11,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:27:11,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 20:27:11,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 4: [2022-11-26 20:27:11,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:27:11,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 20:27:11,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 15: [2022-11-26 20:27:11,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:27:11,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 20:27:11,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 0: [2022-11-26 20:27:11,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:27:11,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 20:27:11,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 27: [2022-11-26 20:27:11,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:27:11,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 20:27:11,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 2: [2022-11-26 20:27:11,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:27:11,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 20:27:11,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 17: [2022-11-26 20:27:11,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:27:11,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:27:11,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 20:27:11,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 17: [2022-11-26 20:27:11,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 3: [2022-11-26 20:27:11,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:27:11,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 3: [2022-11-26 20:27:11,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 20:27:11,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 12: [2022-11-26 20:27:11,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:27:11,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 20:27:11,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 22: [2022-11-26 20:27:11,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:27:11,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 20:27:11,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 10: [2022-11-26 20:27:11,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:27:11,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 20:27:11,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 7: [2022-11-26 20:27:11,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:27:11,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 20:27:11,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 11: [2022-11-26 20:27:11,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:27:11,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 20:27:11,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 19: [2022-11-26 20:27:11,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:27:11,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:27:11,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 31: [2022-11-26 20:27:11,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 19: [2022-11-26 20:27:11,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 31: [2022-11-26 20:27:11,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 24: [2022-11-26 20:27:11,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:27:11,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 20:27:11,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 30: [2022-11-26 20:27:11,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:27:11,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 20:27:11,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 4: [2022-11-26 20:27:11,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:27:11,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 20:27:11,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 5: [2022-11-26 20:27:11,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:27:11,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 20:27:11,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 26: [2022-11-26 20:27:11,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:27:11,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 20:27:11,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 0: [2022-11-26 20:27:11,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:27:11,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 20:27:11,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 29: [2022-11-26 20:27:11,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:27:11,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 20:27:11,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 13: [2022-11-26 20:27:11,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:27:11,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 20:27:11,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 28: [2022-11-26 20:27:11,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:27:11,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:27:11,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:27:11,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 17: [2022-11-26 20:27:11,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 28: [2022-11-26 20:27:11,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 18: [2022-11-26 20:27:11,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 17: [2022-11-26 20:27:11,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 18: [2022-11-26 20:27:11,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 6: [2022-11-26 20:27:11,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:27:11,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:27:11,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 20:27:11,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 23: [2022-11-26 20:27:11,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 20:27:11,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 25: [2022-11-26 20:27:11,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:27:11,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 20:27:11,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 16: [2022-11-26 20:27:11,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:27:11,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:27:11,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 1: [2022-11-26 20:27:11,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 16: [2022-11-26 20:27:11,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 1: [2022-11-26 20:27:11,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 27: [2022-11-26 20:27:11,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:27:11,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:27:11,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 11: [2022-11-26 20:27:11,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:27:11,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 9: [2022-11-26 20:27:11,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 2: [2022-11-26 20:27:11,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:27:11,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 11: [2022-11-26 20:27:11,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 2: [2022-11-26 20:27:11,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 11: [2022-11-26 20:27:11,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 14: [2022-11-26 20:27:11,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:27:11,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 14: [2022-11-26 20:27:11,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 20:27:11,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 21: [2022-11-26 20:27:11,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:27:11,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 20:27:11,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 15: [2022-11-26 20:27:11,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:27:11,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 20:27:11,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 8: [2022-11-26 20:27:11,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:27:11,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 20:27:11,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 20: [2022-11-26 20:27:11,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:27:11,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 22: [2022-11-26 20:27:11,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:27:11,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:27:11,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 22: [2022-11-26 20:27:11,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 12: [2022-11-26 20:27:11,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 22: [2022-11-26 20:27:11,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 12: [2022-11-26 20:27:11,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 7: [2022-11-26 20:27:11,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:27:11,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 20:27:11,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 31: [2022-11-26 20:27:11,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:27:11,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:27:11,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 20:27:11,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 10: [2022-11-26 20:27:11,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 20:27:11,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 5: [2022-11-26 20:27:11,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:27:11,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 20:27:11,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 18: [2022-11-26 20:27:11,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:27:11,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 20:27:11,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 30: [2022-11-26 20:27:11,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:27:11,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:27:11,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 20:27:11,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 24: [2022-11-26 20:27:11,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 20:27:11,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 14: [2022-11-26 20:27:11,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:27:11,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 20:27:11,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 26: [2022-11-26 20:27:11,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:27:11,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 20:27:11,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 22: [2022-11-26 20:27:11,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:27:11,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 20:27:11,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 28: [2022-11-26 20:27:11,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:27:11,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 20:27:11,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 28: [2022-11-26 20:27:11,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:27:11,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:27:11,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 20:27:11,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 3: [2022-11-26 20:27:11,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 20:27:11,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 29: [2022-11-26 20:27:11,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:27:11,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 20:27:11,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 30: [2022-11-26 20:27:11,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:27:11,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step117000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 20:27:11,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 0: successfully saved checkpoint at iteration 117000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2612.62 31: iteration 117010/ 173500 | consumed samples: 29954560 | consumed tokens: 61346938880 | elapsed time per iteration (s): 1.08 | learning rate: 6.391E-05 | global batch size: 256 | lm loss: 1.941738E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.148 | TFLOPs: 14.35 | 31: iteration 117020/ 173500 | consumed samples: 29957120 | consumed tokens: 61352181760 | elapsed time per iteration (s): 0.74 | learning rate: 6.390E-05 | global batch size: 256 | lm loss: 1.973722E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.776 | TFLOPs: 20.86 | 31: iteration 117030/ 173500 | consumed samples: 29959680 | consumed tokens: 61357424640 | elapsed time per iteration (s): 0.80 | learning rate: 6.389E-05 | global batch size: 256 | lm loss: 1.977558E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.097 | TFLOPs: 19.37 | 31: iteration 117040/ 173500 | consumed samples: 29962240 | consumed tokens: 61362667520 | elapsed time per iteration (s): 0.81 | learning rate: 6.387E-05 | global batch size: 256 | lm loss: 1.952529E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.087 | TFLOPs: 19.12 | 31: iteration 117050/ 173500 | consumed samples: 29964800 | consumed tokens: 61367910400 | elapsed time per iteration (s): 0.88 | learning rate: 6.386E-05 | global batch size: 256 | lm loss: 1.920466E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.683 | TFLOPs: 17.59 | 31: iteration 117060/ 173500 | consumed samples: 29967360 | consumed tokens: 61373153280 | elapsed time per iteration (s): 0.71 | learning rate: 6.384E-05 | global batch size: 256 | lm loss: 1.943225E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 359.772 | TFLOPs: 21.77 | 31: iteration 117070/ 173500 | consumed samples: 29969920 | consumed tokens: 61378396160 | elapsed time per iteration (s): 0.76 | learning rate: 6.383E-05 | global batch size: 256 | lm loss: 1.956641E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.987 | TFLOPs: 20.33 | 31: iteration 117080/ 173500 | consumed samples: 29972480 | consumed tokens: 61383639040 | elapsed time per iteration (s): 0.78 | learning rate: 6.382E-05 | global batch size: 256 | lm loss: 1.931933E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.524 | TFLOPs: 19.87 | 31: iteration 117090/ 173500 | consumed samples: 29975040 | consumed tokens: 61388881920 | elapsed time per iteration (s): 0.79 | learning rate: 6.380E-05 | global batch size: 256 | lm loss: 1.944272E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.626 | TFLOPs: 19.52 | 31: iteration 117100/ 173500 | consumed samples: 29977600 | consumed tokens: 61394124800 | elapsed time per iteration (s): 0.76 | learning rate: 6.379E-05 | global batch size: 256 | lm loss: 1.960589E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.718 | TFLOPs: 20.43 | 31: iteration 117110/ 173500 | consumed samples: 29980160 | consumed tokens: 61399367680 | elapsed time per iteration (s): 0.76 | learning rate: 6.377E-05 | global batch size: 256 | lm loss: 1.954878E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.243 | TFLOPs: 20.34 | 31: iteration 117120/ 173500 | consumed samples: 29982720 | consumed tokens: 61404610560 | elapsed time per iteration (s): 0.80 | learning rate: 6.376E-05 | global batch size: 256 | lm loss: 1.956585E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.349 | TFLOPs: 19.38 | 31: iteration 117130/ 173500 | consumed samples: 29985280 | consumed tokens: 61409853440 | elapsed time per iteration (s): 0.79 | learning rate: 6.374E-05 | global batch size: 256 | lm loss: 1.923637E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.988 | TFLOPs: 19.54 | 31: iteration 117140/ 173500 | consumed samples: 29987840 | consumed tokens: 61415096320 | elapsed time per iteration (s): 0.79 | learning rate: 6.373E-05 | global batch size: 256 | lm loss: 1.954142E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.515 | TFLOPs: 19.63 | 31: iteration 117150/ 173500 | consumed samples: 29990400 | consumed tokens: 61420339200 | elapsed time per iteration (s): 0.82 | learning rate: 6.372E-05 | global batch size: 256 | lm loss: 1.947966E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.999 | TFLOPs: 18.81 | 31: iteration 117160/ 173500 | consumed samples: 29992960 | consumed tokens: 61425582080 | elapsed time per iteration (s): 0.85 | learning rate: 6.370E-05 | global batch size: 256 | lm loss: 1.956026E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.963 | TFLOPs: 18.15 | 31: iteration 117170/ 173500 | consumed samples: 29995520 | consumed tokens: 61430824960 | elapsed time per iteration (s): 0.81 | learning rate: 6.369E-05 | global batch size: 256 | lm loss: 1.960105E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.989 | TFLOPs: 19.18 | 31: iteration 117180/ 173500 | consumed samples: 29998080 | consumed tokens: 61436067840 | elapsed time per iteration (s): 0.78 | learning rate: 6.367E-05 | global batch size: 256 | lm loss: 1.974471E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.168 | TFLOPs: 19.97 | 31: iteration 117190/ 173500 | consumed samples: 30000640 | consumed tokens: 61441310720 | elapsed time per iteration (s): 0.81 | learning rate: 6.366E-05 | global batch size: 256 | lm loss: 1.947681E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.656 | TFLOPs: 19.10 | 31: iteration 117200/ 173500 | consumed samples: 30003200 | consumed tokens: 61446553600 | elapsed time per iteration (s): 0.80 | learning rate: 6.365E-05 | global batch size: 256 | lm loss: 1.965270E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.563 | TFLOPs: 19.27 | 31: iteration 117210/ 173500 | consumed samples: 30005760 | consumed tokens: 61451796480 | elapsed time per iteration (s): 0.83 | learning rate: 6.363E-05 | global batch size: 256 | lm loss: 1.951281E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.582 | TFLOPs: 18.73 | 31: iteration 117220/ 173500 | consumed samples: 30008320 | consumed tokens: 61457039360 | elapsed time per iteration (s): 0.86 | learning rate: 6.362E-05 | global batch size: 256 | lm loss: 1.936009E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.329 | TFLOPs: 18.11 | 31: iteration 117230/ 173500 | consumed samples: 30010880 | consumed tokens: 61462282240 | elapsed time per iteration (s): 0.88 | learning rate: 6.360E-05 | global batch size: 256 | lm loss: 1.937402E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.207 | TFLOPs: 17.56 | 31: iteration 117240/ 173500 | consumed samples: 30013440 | consumed tokens: 61467525120 | elapsed time per iteration (s): 1.08 | learning rate: 6.359E-05 | global batch size: 256 | lm loss: 1.969729E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.728 | TFLOPs: 14.32 | 31: iteration 117250/ 173500 | consumed samples: 30016000 | consumed tokens: 61472768000 | elapsed time per iteration (s): 0.87 | learning rate: 6.358E-05 | global batch size: 256 | lm loss: 1.957925E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.521 | TFLOPs: 17.88 | 31: iteration 117260/ 173500 | consumed samples: 30018560 | consumed tokens: 61478010880 | elapsed time per iteration (s): 0.88 | learning rate: 6.356E-05 | global batch size: 256 | lm loss: 1.947015E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.450 | TFLOPs: 17.69 | 31: iteration 117270/ 173500 | consumed samples: 30021120 | consumed tokens: 61483253760 | elapsed time per iteration (s): 0.88 | learning rate: 6.355E-05 | global batch size: 256 | lm loss: 1.933673E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.783 | TFLOPs: 17.53 | 31: iteration 117280/ 173500 | consumed samples: 30023680 | consumed tokens: 61488496640 | elapsed time per iteration (s): 0.84 | learning rate: 6.353E-05 | global batch size: 256 | lm loss: 1.986147E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.634 | TFLOPs: 18.37 | 31: iteration 117290/ 173500 | consumed samples: 30026240 | consumed tokens: 61493739520 | elapsed time per iteration (s): 0.85 | learning rate: 6.352E-05 | global batch size: 256 | lm loss: 1.959373E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.314 | TFLOPs: 18.23 | 31: iteration 117300/ 173500 | consumed samples: 30028800 | consumed tokens: 61498982400 | elapsed time per iteration (s): 0.89 | learning rate: 6.351E-05 | global batch size: 256 | lm loss: 1.952276E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.265 | TFLOPs: 17.32 | 31: iteration 117310/ 173500 | consumed samples: 30031360 | consumed tokens: 61504225280 | elapsed time per iteration (s): 0.85 | learning rate: 6.349E-05 | global batch size: 256 | lm loss: 1.957880E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.198 | TFLOPs: 18.16 | 31: iteration 117320/ 173500 | consumed samples: 30033920 | consumed tokens: 61509468160 | elapsed time per iteration (s): 0.88 | learning rate: 6.348E-05 | global batch size: 256 | lm loss: 1.980734E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.438 | TFLOPs: 17.57 | 31: iteration 117330/ 173500 | consumed samples: 30036480 | consumed tokens: 61514711040 | elapsed time per iteration (s): 0.84 | learning rate: 6.346E-05 | global batch size: 256 | lm loss: 1.940194E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.148 | TFLOPs: 18.34 | 31: iteration 117340/ 173500 | consumed samples: 30039040 | consumed tokens: 61519953920 | elapsed time per iteration (s): 0.79 | learning rate: 6.345E-05 | global batch size: 256 | lm loss: 1.954349E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.889 | TFLOPs: 19.53 | 31: iteration 117350/ 173500 | consumed samples: 30041600 | consumed tokens: 61525196800 | elapsed time per iteration (s): 0.77 | learning rate: 6.343E-05 | global batch size: 256 | lm loss: 1.976014E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.827 | TFLOPs: 20.01 | 31: iteration 117360/ 173500 | consumed samples: 30044160 | consumed tokens: 61530439680 | elapsed time per iteration (s): 0.85 | learning rate: 6.342E-05 | global batch size: 256 | lm loss: 1.953063E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.033 | TFLOPs: 18.21 | 31: iteration 117370/ 173500 | consumed samples: 30046720 | consumed tokens: 61535682560 | elapsed time per iteration (s): 0.74 | learning rate: 6.341E-05 | global batch size: 256 | lm loss: 1.960157E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.878 | TFLOPs: 20.86 | 31: iteration 117380/ 173500 | consumed samples: 30049280 | consumed tokens: 61540925440 | elapsed time per iteration (s): 0.76 | learning rate: 6.339E-05 | global batch size: 256 | lm loss: 1.961709E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.095 | TFLOPs: 20.39 | 31: iteration 117390/ 173500 | consumed samples: 30051840 | consumed tokens: 61546168320 | elapsed time per iteration (s): 0.75 | learning rate: 6.338E-05 | global batch size: 256 | lm loss: 1.948039E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.368 | TFLOPs: 20.53 | 31: iteration 117400/ 173500 | consumed samples: 30054400 | consumed tokens: 61551411200 | elapsed time per iteration (s): 0.80 | learning rate: 6.336E-05 | global batch size: 256 | lm loss: 1.966749E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.615 | TFLOPs: 19.40 | 31: iteration 117410/ 173500 | consumed samples: 30056960 | consumed tokens: 61556654080 | elapsed time per iteration (s): 0.77 | learning rate: 6.335E-05 | global batch size: 256 | lm loss: 1.940897E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.918 | TFLOPs: 20.14 | 31: iteration 117420/ 173500 | consumed samples: 30059520 | consumed tokens: 61561896960 | elapsed time per iteration (s): 0.75 | learning rate: 6.334E-05 | global batch size: 256 | lm loss: 1.956083E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.274 | TFLOPs: 20.53 | 31: iteration 117430/ 173500 | consumed samples: 30062080 | consumed tokens: 61567139840 | elapsed time per iteration (s): 0.80 | learning rate: 6.332E-05 | global batch size: 256 | lm loss: 1.928421E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.292 | TFLOPs: 19.44 | 31: iteration 117440/ 173500 | consumed samples: 30064640 | consumed tokens: 61572382720 | elapsed time per iteration (s): 0.77 | learning rate: 6.331E-05 | global batch size: 256 | lm loss: 1.955640E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.146 | TFLOPs: 20.03 | 31: iteration 117450/ 173500 | consumed samples: 30067200 | consumed tokens: 61577625600 | elapsed time per iteration (s): 0.81 | learning rate: 6.329E-05 | global batch size: 256 | lm loss: 1.944963E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.007 | TFLOPs: 19.18 | 31: iteration 117460/ 173500 | consumed samples: 30069760 | consumed tokens: 61582868480 | elapsed time per iteration (s): 0.78 | learning rate: 6.328E-05 | global batch size: 256 | lm loss: 1.971091E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.393 | TFLOPs: 19.87 | 31: iteration 117470/ 173500 | consumed samples: 30072320 | consumed tokens: 61588111360 | elapsed time per iteration (s): 0.77 | learning rate: 6.327E-05 | global batch size: 256 | lm loss: 1.953759E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.446 | TFLOPs: 19.99 | 31: iteration 117480/ 173500 | consumed samples: 30074880 | consumed tokens: 61593354240 | elapsed time per iteration (s): 0.76 | learning rate: 6.325E-05 | global batch size: 256 | lm loss: 1.931560E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.887 | TFLOPs: 20.38 | 31: iteration 117490/ 173500 | consumed samples: 30077440 | consumed tokens: 61598597120 | elapsed time per iteration (s): 0.80 | learning rate: 6.324E-05 | global batch size: 256 | lm loss: 1.946727E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.315 | TFLOPs: 19.32 | 31: iteration 117500/ 173500 | consumed samples: 30080000 | consumed tokens: 61603840000 | elapsed time per iteration (s): 0.77 | learning rate: 6.322E-05 | global batch size: 256 | lm loss: 1.947870E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.287 | TFLOPs: 20.04 | 31: iteration 117510/ 173500 | consumed samples: 30082560 | consumed tokens: 61609082880 | elapsed time per iteration (s): 0.74 | learning rate: 6.321E-05 | global batch size: 256 | lm loss: 1.973803E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.472 | TFLOPs: 21.02 | 31: iteration 117520/ 173500 | consumed samples: 30085120 | consumed tokens: 61614325760 | elapsed time per iteration (s): 0.76 | learning rate: 6.320E-05 | global batch size: 256 | lm loss: 1.955472E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.559 | TFLOPs: 20.36 | 31: iteration 117530/ 173500 | consumed samples: 30087680 | consumed tokens: 61619568640 | elapsed time per iteration (s): 0.78 | learning rate: 6.318E-05 | global batch size: 256 | lm loss: 1.935769E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.401 | TFLOPs: 19.93 | 31: iteration 117540/ 173500 | consumed samples: 30090240 | consumed tokens: 61624811520 | elapsed time per iteration (s): 0.73 | learning rate: 6.317E-05 | global batch size: 256 | lm loss: 1.953593E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.297 | TFLOPs: 21.19 | 31: iteration 117550/ 173500 | consumed samples: 30092800 | consumed tokens: 61630054400 | elapsed time per iteration (s): 0.77 | learning rate: 6.315E-05 | global batch size: 256 | lm loss: 1.975915E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.303 | TFLOPs: 20.22 | 31: iteration 117560/ 173500 | consumed samples: 30095360 | consumed tokens: 61635297280 | elapsed time per iteration (s): 0.75 | learning rate: 6.314E-05 | global batch size: 256 | lm loss: 1.965108E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.685 | TFLOPs: 20.67 | 31: iteration 117570/ 173500 | consumed samples: 30097920 | consumed tokens: 61640540160 | elapsed time per iteration (s): 0.79 | learning rate: 6.313E-05 | global batch size: 256 | lm loss: 1.978509E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.292 | TFLOPs: 19.68 | 31: iteration 117580/ 173500 | consumed samples: 30100480 | consumed tokens: 61645783040 | elapsed time per iteration (s): 0.79 | learning rate: 6.311E-05 | global batch size: 256 | lm loss: 1.969608E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.838 | TFLOPs: 19.65 | 31: iteration 117590/ 173500 | consumed samples: 30103040 | consumed tokens: 61651025920 | elapsed time per iteration (s): 0.79 | learning rate: 6.310E-05 | global batch size: 256 | lm loss: 1.938399E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.812 | TFLOPs: 19.65 | 31: iteration 117600/ 173500 | consumed samples: 30105600 | consumed tokens: 61656268800 | elapsed time per iteration (s): 0.74 | learning rate: 6.308E-05 | global batch size: 256 | lm loss: 1.947351E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.829 | TFLOPs: 20.92 | 31: iteration 117610/ 173500 | consumed samples: 30108160 | consumed tokens: 61661511680 | elapsed time per iteration (s): 0.76 | learning rate: 6.307E-05 | global batch size: 256 | lm loss: 1.944695E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.428 | TFLOPs: 20.47 | 31: iteration 117620/ 173500 | consumed samples: 30110720 | consumed tokens: 61666754560 | elapsed time per iteration (s): 0.77 | learning rate: 6.305E-05 | global batch size: 256 | lm loss: 1.955047E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.763 | TFLOPs: 20.07 | 31: iteration 117630/ 173500 | consumed samples: 30113280 | consumed tokens: 61671997440 | elapsed time per iteration (s): 0.82 | learning rate: 6.304E-05 | global batch size: 256 | lm loss: 1.970815E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.380 | TFLOPs: 18.96 | 31: iteration 117640/ 173500 | consumed samples: 30115840 | consumed tokens: 61677240320 | elapsed time per iteration (s): 0.79 | learning rate: 6.303E-05 | global batch size: 256 | lm loss: 1.981249E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.182 | TFLOPs: 19.55 | 31: iteration 117650/ 173500 | consumed samples: 30118400 | consumed tokens: 61682483200 | elapsed time per iteration (s): 0.75 | learning rate: 6.301E-05 | global batch size: 256 | lm loss: 1.910711E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.117 | TFLOPs: 20.58 | 31: iteration 117660/ 173500 | consumed samples: 30120960 | consumed tokens: 61687726080 | elapsed time per iteration (s): 0.73 | learning rate: 6.300E-05 | global batch size: 256 | lm loss: 1.974576E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.381 | TFLOPs: 21.14 | 31: iteration 117670/ 173500 | consumed samples: 30123520 | consumed tokens: 61692968960 | elapsed time per iteration (s): 0.79 | learning rate: 6.298E-05 | global batch size: 256 | lm loss: 1.989214E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.888 | TFLOPs: 19.53 | 31: iteration 117680/ 173500 | consumed samples: 30126080 | consumed tokens: 61698211840 | elapsed time per iteration (s): 0.80 | learning rate: 6.297E-05 | global batch size: 256 | lm loss: 1.978922E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.734 | TFLOPs: 19.28 | 31: iteration 117690/ 173500 | consumed samples: 30128640 | consumed tokens: 61703454720 | elapsed time per iteration (s): 0.74 | learning rate: 6.296E-05 | global batch size: 256 | lm loss: 1.932002E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.522 | TFLOPs: 21.02 | 31: iteration 117700/ 173500 | consumed samples: 30131200 | consumed tokens: 61708697600 | elapsed time per iteration (s): 0.77 | learning rate: 6.294E-05 | global batch size: 256 | lm loss: 1.958405E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.744 | TFLOPs: 20.07 | 31: iteration 117710/ 173500 | consumed samples: 30133760 | consumed tokens: 61713940480 | elapsed time per iteration (s): 0.79 | learning rate: 6.293E-05 | global batch size: 256 | lm loss: 1.949916E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.570 | TFLOPs: 19.70 | 31: iteration 117720/ 173500 | consumed samples: 30136320 | consumed tokens: 61719183360 | elapsed time per iteration (s): 0.83 | learning rate: 6.291E-05 | global batch size: 256 | lm loss: 1.976301E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.743 | TFLOPs: 18.68 | 31: iteration 117730/ 173500 | consumed samples: 30138880 | consumed tokens: 61724426240 | elapsed time per iteration (s): 0.83 | learning rate: 6.290E-05 | global batch size: 256 | lm loss: 1.963202E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.805 | TFLOPs: 18.68 | 31: iteration 117740/ 173500 | consumed samples: 30141440 | consumed tokens: 61729669120 | elapsed time per iteration (s): 0.79 | learning rate: 6.289E-05 | global batch size: 256 | lm loss: 1.936646E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.705 | TFLOPs: 19.58 | 31: iteration 117750/ 173500 | consumed samples: 30144000 | consumed tokens: 61734912000 | elapsed time per iteration (s): 0.86 | learning rate: 6.287E-05 | global batch size: 256 | lm loss: 1.948260E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.743 | TFLOPs: 18.01 | 31: iteration 117760/ 173500 | consumed samples: 30146560 | consumed tokens: 61740154880 | elapsed time per iteration (s): 0.84 | learning rate: 6.286E-05 | global batch size: 256 | lm loss: 1.947972E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.217 | TFLOPs: 18.40 | 31: iteration 117770/ 173500 | consumed samples: 30149120 | consumed tokens: 61745397760 | elapsed time per iteration (s): 0.85 | learning rate: 6.284E-05 | global batch size: 256 | lm loss: 1.960473E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.338 | TFLOPs: 18.29 | 31: iteration 117780/ 173500 | consumed samples: 30151680 | consumed tokens: 61750640640 | elapsed time per iteration (s): 0.79 | learning rate: 6.283E-05 | global batch size: 256 | lm loss: 1.945362E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.967 | TFLOPs: 19.60 | 31: iteration 117790/ 173500 | consumed samples: 30154240 | consumed tokens: 61755883520 | elapsed time per iteration (s): 0.84 | learning rate: 6.282E-05 | global batch size: 256 | lm loss: 1.938874E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.777 | TFLOPs: 18.44 | 31: iteration 117800/ 173500 | consumed samples: 30156800 | consumed tokens: 61761126400 | elapsed time per iteration (s): 0.80 | learning rate: 6.280E-05 | global batch size: 256 | lm loss: 1.963532E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.029 | TFLOPs: 19.42 | 31: iteration 117810/ 173500 | consumed samples: 30159360 | consumed tokens: 61766369280 | elapsed time per iteration (s): 0.80 | learning rate: 6.279E-05 | global batch size: 256 | lm loss: 1.937650E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.099 | TFLOPs: 19.24 | 31: iteration 117820/ 173500 | consumed samples: 30161920 | consumed tokens: 61771612160 | elapsed time per iteration (s): 0.80 | learning rate: 6.277E-05 | global batch size: 256 | lm loss: 1.946927E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.290 | TFLOPs: 19.38 | 31: iteration 117830/ 173500 | consumed samples: 30164480 | consumed tokens: 61776855040 | elapsed time per iteration (s): 0.81 | learning rate: 6.276E-05 | global batch size: 256 | lm loss: 1.946041E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.566 | TFLOPs: 19.15 | 31: iteration 117840/ 173500 | consumed samples: 30167040 | consumed tokens: 61782097920 | elapsed time per iteration (s): 0.79 | learning rate: 6.275E-05 | global batch size: 256 | lm loss: 1.961887E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.971 | TFLOPs: 19.66 | 31: iteration 117850/ 173500 | consumed samples: 30169600 | consumed tokens: 61787340800 | elapsed time per iteration (s): 0.82 | learning rate: 6.273E-05 | global batch size: 256 | lm loss: 1.920326E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.631 | TFLOPs: 18.91 | 31: iteration 117860/ 173500 | consumed samples: 30172160 | consumed tokens: 61792583680 | elapsed time per iteration (s): 0.81 | learning rate: 6.272E-05 | global batch size: 256 | lm loss: 1.938599E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.228 | TFLOPs: 19.13 | 31: iteration 117870/ 173500 | consumed samples: 30174720 | consumed tokens: 61797826560 | elapsed time per iteration (s): 0.91 | learning rate: 6.270E-05 | global batch size: 256 | lm loss: 1.934422E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.761 | TFLOPs: 17.11 | 31: iteration 117880/ 173500 | consumed samples: 30177280 | consumed tokens: 61803069440 | elapsed time per iteration (s): 0.83 | learning rate: 6.269E-05 | global batch size: 256 | lm loss: 1.953829E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.601 | TFLOPs: 18.61 | 31: iteration 117890/ 173500 | consumed samples: 30179840 | consumed tokens: 61808312320 | elapsed time per iteration (s): 0.82 | learning rate: 6.268E-05 | global batch size: 256 | lm loss: 1.945913E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.073 | TFLOPs: 18.94 | 31: iteration 117900/ 173500 | consumed samples: 30182400 | consumed tokens: 61813555200 | elapsed time per iteration (s): 0.85 | learning rate: 6.266E-05 | global batch size: 256 | lm loss: 1.953918E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.279 | TFLOPs: 18.17 | 31: iteration 117910/ 173500 | consumed samples: 30184960 | consumed tokens: 61818798080 | elapsed time per iteration (s): 0.80 | learning rate: 6.265E-05 | global batch size: 256 | lm loss: 1.945681E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.573 | TFLOPs: 19.27 | 31: iteration 117920/ 173500 | consumed samples: 30187520 | consumed tokens: 61824040960 | elapsed time per iteration (s): 0.80 | learning rate: 6.263E-05 | global batch size: 256 | lm loss: 1.966395E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.341 | TFLOPs: 19.32 | 31: iteration 117930/ 173500 | consumed samples: 30190080 | consumed tokens: 61829283840 | elapsed time per iteration (s): 0.76 | learning rate: 6.262E-05 | global batch size: 256 | lm loss: 1.923564E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.413 | TFLOPs: 20.47 | 31: iteration 117940/ 173500 | consumed samples: 30192640 | consumed tokens: 61834526720 | elapsed time per iteration (s): 0.79 | learning rate: 6.261E-05 | global batch size: 256 | lm loss: 1.953928E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.317 | TFLOPs: 19.62 | 31: iteration 117950/ 173500 | consumed samples: 30195200 | consumed tokens: 61839769600 | elapsed time per iteration (s): 0.80 | learning rate: 6.259E-05 | global batch size: 256 | lm loss: 1.931865E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.318 | TFLOPs: 19.38 | 31: iteration 117960/ 173500 | consumed samples: 30197760 | consumed tokens: 61845012480 | elapsed time per iteration (s): 0.75 | learning rate: 6.258E-05 | global batch size: 256 | lm loss: 1.955512E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.625 | TFLOPs: 20.55 | 31: iteration 117970/ 173500 | consumed samples: 30200320 | consumed tokens: 61850255360 | elapsed time per iteration (s): 0.82 | learning rate: 6.256E-05 | global batch size: 256 | lm loss: 1.987072E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.082 | TFLOPs: 19.00 | 31: iteration 117980/ 173500 | consumed samples: 30202880 | consumed tokens: 61855498240 | elapsed time per iteration (s): 0.81 | learning rate: 6.255E-05 | global batch size: 256 | lm loss: 1.979147E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.086 | TFLOPs: 19.12 | 31: iteration 117990/ 173500 | consumed samples: 30205440 | consumed tokens: 61860741120 | elapsed time per iteration (s): 0.82 | learning rate: 6.254E-05 | global batch size: 256 | lm loss: 1.953699E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.763 | TFLOPs: 18.92 | 0: [2022-11-26 20:40:34,347] [INFO] [logging.py:68:log_dist] [Rank 0] step=118000, skipped=0, lr=[6.252226684525562e-05, 6.252226684525562e-05, 6.252226684525562e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 118000/ 173500 | consumed samples: 30208000 | consumed tokens: 61865984000 | elapsed time per iteration (s): 0.78 | learning rate: 6.252E-05 | global batch size: 256 | lm loss: 1.945777E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.890 | TFLOPs: 19.90 | 0: steps: 118000 loss: 1.8830 iter time (s): 0.794 samples/sec: 322.347 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 118000 | lm loss value: 1.907793E+00 | lm loss PPL: 6.738203E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 118000 to checkpoints_1b1long 0: [2022-11-26 20:40:34,649] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step118000 is begin to save! 0: [2022-11-26 20:40:34,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_01-model_00-model_states.pt... 0: [2022-11-26 20:40:34,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_01-model_00-model_states.pt. 0: [2022-11-26 20:40:34,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_03-model_00-model_states.pt... 0: [2022-11-26 20:40:34,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_03-model_00-model_states.pt. 0: [2022-11-26 20:40:34,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_04-model_00-model_states.pt... 0: [2022-11-26 20:40:35,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_04-model_00-model_states.pt. 0: [2022-11-26 20:40:35,042] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_05-model_00-model_states.pt... 0: [2022-11-26 20:40:35,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_05-model_00-model_states.pt. 0: [2022-11-26 20:40:35,120] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_06-model_00-model_states.pt... 0: [2022-11-26 20:40:35,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_06-model_00-model_states.pt. 0: [2022-11-26 20:40:35,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_07-model_00-model_states.pt... 0: [2022-11-26 20:40:35,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_07-model_00-model_states.pt. 0: [2022-11-26 20:40:35,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_08-model_00-model_states.pt... 0: [2022-11-26 20:40:35,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_08-model_00-model_states.pt. 0: [2022-11-26 20:40:35,347] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_09-model_00-model_states.pt... 0: [2022-11-26 20:40:35,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_09-model_00-model_states.pt. 0: [2022-11-26 20:40:35,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_10-model_00-model_states.pt... 0: [2022-11-26 20:40:35,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_10-model_00-model_states.pt. 0: [2022-11-26 20:40:35,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_11-model_00-model_states.pt... 0: [2022-11-26 20:40:35,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_11-model_00-model_states.pt. 0: [2022-11-26 20:40:35,575] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_12-model_00-model_states.pt... 0: [2022-11-26 20:40:35,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_12-model_00-model_states.pt. 0: [2022-11-26 20:40:35,647] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_13-model_00-model_states.pt... 0: [2022-11-26 20:40:35,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_13-model_00-model_states.pt. 0: [2022-11-26 20:40:35,723] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_14-model_00-model_states.pt... 0: [2022-11-26 20:40:35,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_14-model_00-model_states.pt. 0: [2022-11-26 20:40:35,804] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_15-model_00-model_states.pt... 0: [2022-11-26 20:40:35,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_15-model_00-model_states.pt. 0: [2022-11-26 20:40:35,881] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_16-model_00-model_states.pt... 0: [2022-11-26 20:40:35,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_16-model_00-model_states.pt. 0: [2022-11-26 20:40:35,965] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_17-model_00-model_states.pt... 0: [2022-11-26 20:40:36,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_17-model_00-model_states.pt. 0: [2022-11-26 20:40:36,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_18-model_00-model_states.pt... 0: [2022-11-26 20:40:36,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_18-model_00-model_states.pt. 0: [2022-11-26 20:40:36,120] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_19-model_00-model_states.pt... 0: [2022-11-26 20:40:36,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_19-model_00-model_states.pt. 0: [2022-11-26 20:40:36,197] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_20-model_00-model_states.pt... 0: [2022-11-26 20:40:36,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_20-model_00-model_states.pt. 0: [2022-11-26 20:40:36,273] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_21-model_00-model_states.pt... 0: [2022-11-26 20:40:36,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_21-model_00-model_states.pt. 0: [2022-11-26 20:40:36,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_22-model_00-model_states.pt... 0: [2022-11-26 20:40:36,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_22-model_00-model_states.pt. 0: [2022-11-26 20:40:36,424] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_23-model_00-model_states.pt... 0: [2022-11-26 20:40:36,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_23-model_00-model_states.pt. 0: [2022-11-26 20:40:36,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_24-model_00-model_states.pt... 0: [2022-11-26 20:40:36,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_24-model_00-model_states.pt. 0: [2022-11-26 20:40:36,574] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_25-model_00-model_states.pt... 0: [2022-11-26 20:40:36,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_25-model_00-model_states.pt. 0: [2022-11-26 20:40:36,648] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_26-model_00-model_states.pt... 0: [2022-11-26 20:40:36,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_26-model_00-model_states.pt. 0: [2022-11-26 20:40:36,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_27-model_00-model_states.pt... 0: [2022-11-26 20:40:36,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_27-model_00-model_states.pt. 0: [2022-11-26 20:40:36,800] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_28-model_00-model_states.pt... 0: [2022-11-26 20:40:36,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_28-model_00-model_states.pt. 0: [2022-11-26 20:40:36,873] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/layer_30-model_00-model_states.pt... 0: [2022-11-26 20:40:36,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/layer_30-model_00-model_states.pt. 0: [2022-11-26 20:40:36,877] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step118000/mp_rank_00_model_states.pt 0: [2022-11-26 20:40:36,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/mp_rank_00_model_states.pt... 0: [2022-11-26 20:40:36,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/mp_rank_00_model_states.pt. 0: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:40:36,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:40:37,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:40:37,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 20:40:37,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 7: [2022-11-26 20:40:37,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:40:37,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 20:40:37,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 12: [2022-11-26 20:40:37,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:40:37,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 20:40:37,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 0: [2022-11-26 20:40:37,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:40:37,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 20:40:37,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 26: [2022-11-26 20:40:37,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:40:37,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 20:40:37,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 13: [2022-11-26 20:40:37,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:40:37,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:40:37,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 20:40:37,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 22: [2022-11-26 20:40:37,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 25: [2022-11-26 20:40:37,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:40:37,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 25: [2022-11-26 20:40:37,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 20:40:37,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 3: [2022-11-26 20:40:37,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:40:37,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 20:40:37,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 1: [2022-11-26 20:40:37,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:40:37,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:40:37,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 6: [2022-11-26 20:40:37,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 1: [2022-11-26 20:40:37,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 6: [2022-11-26 20:40:37,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 20: [2022-11-26 20:40:37,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:40:37,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 20:40:37,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 6: [2022-11-26 20:40:37,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:40:37,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 20:40:37,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 24: [2022-11-26 20:40:37,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:40:37,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:40:37,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 30: [2022-11-26 20:40:37,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:40:37,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 24: [2022-11-26 20:40:37,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 5: [2022-11-26 20:40:37,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 29: [2022-11-26 20:40:37,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:40:37,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 29: [2022-11-26 20:40:37,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 30: [2022-11-26 20:40:37,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 27: [2022-11-26 20:40:37,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:40:37,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 27: [2022-11-26 20:40:37,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 20:40:37,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 4: [2022-11-26 20:40:37,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:40:37,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 20:40:37,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 1: [2022-11-26 20:40:37,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:40:37,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 20:40:37,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 18: [2022-11-26 20:40:37,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:40:37,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 20:40:37,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 11: [2022-11-26 20:40:37,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:40:37,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:40:37,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 20:40:37,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 20:40:37,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 11: [2022-11-26 20:40:37,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 0: [2022-11-26 20:40:37,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:40:37,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 20:40:37,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 24: [2022-11-26 20:40:37,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:40:37,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:40:37,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 20:40:37,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 5: [2022-11-26 20:40:37,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 20:40:37,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 12: [2022-11-26 20:40:37,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:40:37,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 20:40:37,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 8: [2022-11-26 20:40:37,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:40:37,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:40:37,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 8: [2022-11-26 20:40:37,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 20:40:37,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 18: [2022-11-26 20:40:37,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 29: [2022-11-26 20:40:37,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:40:37,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 20:40:37,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 7: [2022-11-26 20:40:37,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:40:37,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:40:37,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 30: [2022-11-26 20:40:37,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 7: [2022-11-26 20:40:37,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 30: [2022-11-26 20:40:37,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 10: [2022-11-26 20:40:37,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:40:37,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 28: [2022-11-26 20:40:37,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:40:37,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 15: [2022-11-26 20:40:37,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:40:37,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 20:40:37,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 17: [2022-11-26 20:40:37,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:40:37,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:40:37,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:40:37,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 20:40:37,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 26: [2022-11-26 20:40:37,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:40:37,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 17: [2022-11-26 20:40:37,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 17: [2022-11-26 20:40:37,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 25: [2022-11-26 20:40:37,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 26: [2022-11-26 20:40:37,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 20:40:37,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 8: [2022-11-26 20:40:37,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:40:37,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 20:40:37,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 24: [2022-11-26 20:40:37,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:40:37,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 20:40:37,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 20: [2022-11-26 20:40:37,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:40:37,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 20:40:37,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 2: [2022-11-26 20:40:37,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:40:37,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 20:40:37,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 5: [2022-11-26 20:40:37,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:40:37,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 20:40:37,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 18: [2022-11-26 20:40:37,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:40:37,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 20:40:37,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 7: [2022-11-26 20:40:37,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:40:37,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 20:40:37,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 16: [2022-11-26 20:40:37,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:40:37,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 0: [2022-11-26 20:40:37,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:40:37,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:40:37,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 0: [2022-11-26 20:40:37,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 20:40:37,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 20:40:37,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 0: [2022-11-26 20:40:37,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 11: [2022-11-26 20:40:37,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:40:37,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:40:37,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 12: [2022-11-26 20:40:37,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 11: [2022-11-26 20:40:37,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 12: [2022-11-26 20:40:37,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 3: [2022-11-26 20:40:37,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:40:37,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 8: [2022-11-26 20:40:37,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:40:37,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 3: [2022-11-26 20:40:37,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 8: [2022-11-26 20:40:37,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 22: [2022-11-26 20:40:37,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:40:37,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 20:40:37,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 4: [2022-11-26 20:40:37,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:40:37,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 20:40:37,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 10: [2022-11-26 20:40:37,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:40:37,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 20:40:37,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 30: [2022-11-26 20:40:37,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:40:37,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 20:40:37,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 1: [2022-11-26 20:40:37,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:40:37,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:40:37,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:40:37,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 1: [2022-11-26 20:40:37,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 5: [2022-11-26 20:40:37,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 6: [2022-11-26 20:40:37,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 20:40:37,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 1: [2022-11-26 20:40:37,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 13: [2022-11-26 20:40:37,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:40:37,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:40:37,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 20:40:37,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 20:40:37,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 13: [2022-11-26 20:40:37,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 22: [2022-11-26 20:40:37,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:40:37,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 20:40:37,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 23: [2022-11-26 20:40:37,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:40:37,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 20:40:37,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 23: [2022-11-26 20:40:37,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:40:37,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 20:40:37,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:40:37,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:40:37,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:40:37,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 15: [2022-11-26 20:40:37,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:40:37,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 20:40:37,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 29: [2022-11-26 20:40:37,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:40:37,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 23: [2022-11-26 20:40:37,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 27: [2022-11-26 20:40:37,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 23: [2022-11-26 20:40:37,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 29: [2022-11-26 20:40:37,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 15: [2022-11-26 20:40:37,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 29: [2022-11-26 20:40:37,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 27: [2022-11-26 20:40:37,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 24: [2022-11-26 20:40:37,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:40:37,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 20:40:37,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 6: [2022-11-26 20:40:37,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:40:37,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 20:40:37,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 25: [2022-11-26 20:40:37,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:40:37,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:40:37,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:40:37,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:40:37,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 20:40:37,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 25: [2022-11-26 20:40:37,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 19: [2022-11-26 20:40:37,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 19: [2022-11-26 20:40:37,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 25: [2022-11-26 20:40:37,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 19: [2022-11-26 20:40:37,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 19: [2022-11-26 20:40:37,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 21: [2022-11-26 20:40:37,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:40:37,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:40:37,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:40:37,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 20: [2022-11-26 20:40:37,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:40:37,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 20:40:37,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 21: [2022-11-26 20:40:37,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 17: [2022-11-26 20:40:37,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 20: [2022-11-26 20:40:37,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 17: [2022-11-26 20:40:37,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 20: [2022-11-26 20:40:37,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 17: [2022-11-26 20:40:37,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:40:37,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 20:40:37,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 15: [2022-11-26 20:40:37,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:40:37,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 20:40:37,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:40:37,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 15: [2022-11-26 20:40:37,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 20:40:37,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 28: [2022-11-26 20:40:37,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 20:40:37,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 28: [2022-11-26 20:40:37,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:40:37,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 20:40:37,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 28: [2022-11-26 20:40:37,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:40:37,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 20:40:37,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 10: [2022-11-26 20:40:37,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:40:37,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 20:40:37,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 7: [2022-11-26 20:40:37,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:40:37,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:40:37,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 20:40:37,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 10: [2022-11-26 20:40:37,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 20:40:37,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 8: [2022-11-26 20:40:37,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:40:37,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 20:40:37,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 3: [2022-11-26 20:40:37,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:40:37,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 20:40:37,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 4: [2022-11-26 20:40:37,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:40:37,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 20:40:37,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 27: [2022-11-26 20:40:37,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:40:37,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 20:40:37,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 30: [2022-11-26 20:40:37,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:40:37,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 20:40:37,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 14: [2022-11-26 20:40:37,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:40:37,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:40:37,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:40:37,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 20:40:37,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 20:40:37,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 20:40:37,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 14: [2022-11-26 20:40:37,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 14: [2022-11-26 20:40:37,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 16: [2022-11-26 20:40:37,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:40:37,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 20:40:37,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 16: [2022-11-26 20:40:37,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:40:37,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 20:40:37,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 1: [2022-11-26 20:40:37,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:40:37,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 20:40:37,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 2: [2022-11-26 20:40:37,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:40:37,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:40:37,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 20:40:37,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 20:40:37,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 2: [2022-11-26 20:40:37,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 11: [2022-11-26 20:40:37,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:40:37,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 20:40:37,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 27: [2022-11-26 20:40:37,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:40:37,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:40:37,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:40:37,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:40:37,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 27: [2022-11-26 20:40:37,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 9: [2022-11-26 20:40:37,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 20:40:37,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 20:40:37,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 27: [2022-11-26 20:40:37,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 9: [2022-11-26 20:40:37,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 9: [2022-11-26 20:40:37,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 28: [2022-11-26 20:40:37,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:40:37,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:40:37,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 20:40:37,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 20: [2022-11-26 20:40:37,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:40:37,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 20:40:37,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 3: [2022-11-26 20:40:37,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:40:37,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 20:40:37,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 28: [2022-11-26 20:40:37,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 20:40:37,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 18: [2022-11-26 20:40:37,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:40:37,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 20:40:37,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 31: [2022-11-26 20:40:37,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:40:37,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:40:37,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:40:37,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:40:37,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 20:40:37,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 20:40:37,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 20:40:37,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 20:40:37,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 31: [2022-11-26 20:40:37,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 31: [2022-11-26 20:40:37,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 31: [2022-11-26 20:40:37,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 26: [2022-11-26 20:40:37,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:40:37,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 19: [2022-11-26 20:40:37,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:40:37,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 19: [2022-11-26 20:40:37,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 20:40:37,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 13: [2022-11-26 20:40:37,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:40:37,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 20:40:37,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 16: [2022-11-26 20:40:37,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:40:37,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 20:40:37,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 14: [2022-11-26 20:40:37,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:40:37,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:40:37,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 9: [2022-11-26 20:40:37,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 14: [2022-11-26 20:40:37,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 9: [2022-11-26 20:40:37,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 29: [2022-11-26 20:40:37,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:40:37,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 20:40:37,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 25: [2022-11-26 20:40:37,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:40:37,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 20:40:37,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 12: [2022-11-26 20:40:37,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:40:37,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 20:40:37,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 21: [2022-11-26 20:40:37,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:40:37,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 20:40:37,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 22: [2022-11-26 20:40:37,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:40:37,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 20:40:37,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 2: [2022-11-26 20:40:37,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:40:37,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 20:40:37,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 6: [2022-11-26 20:40:37,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:40:37,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 20:40:37,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 5: [2022-11-26 20:40:37,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:40:37,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 20:40:37,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 0: [2022-11-26 20:40:37,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:40:37,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 20:40:37,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 10: [2022-11-26 20:40:37,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:40:37,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:40:37,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 20:40:37,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 1: [2022-11-26 20:40:37,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 20:40:37,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 24: [2022-11-26 20:40:37,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:40:37,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 20:40:37,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 11: [2022-11-26 20:40:37,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:40:37,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 20:40:37,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 4: [2022-11-26 20:40:37,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:40:37,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 20:40:37,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 27: [2022-11-26 20:40:37,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:40:37,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 20:40:37,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 31: [2022-11-26 20:40:37,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:40:37,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:40:37,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 30: [2022-11-26 20:40:37,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 31: [2022-11-26 20:40:37,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 30: [2022-11-26 20:40:37,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 8: [2022-11-26 20:40:37,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:40:37,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 20:40:37,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 20: [2022-11-26 20:40:37,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:40:37,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 20:40:37,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 7: [2022-11-26 20:40:37,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:40:37,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:40:37,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 20:40:37,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 3: [2022-11-26 20:40:37,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:40:37,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 3: [2022-11-26 20:40:37,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 15: [2022-11-26 20:40:37,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 3: [2022-11-26 20:40:37,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 17: [2022-11-26 20:40:37,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:40:37,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 20:40:37,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 19: [2022-11-26 20:40:37,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:40:37,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 20:40:37,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 23: [2022-11-26 20:40:37,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:40:37,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 20:40:37,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 18: [2022-11-26 20:40:37,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:40:37,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 20:40:37,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 25: [2022-11-26 20:40:37,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:40:37,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:40:37,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 20:40:37,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 25: [2022-11-26 20:40:37,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 20:40:37,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 28: [2022-11-26 20:40:37,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:40:37,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 20:40:37,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 26: [2022-11-26 20:40:37,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:40:37,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 20:40:37,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 13: [2022-11-26 20:40:37,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:40:37,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 20:40:37,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 9: [2022-11-26 20:40:37,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:40:37,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 20:40:37,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 16: [2022-11-26 20:40:37,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:40:37,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 20:40:37,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 14: [2022-11-26 20:40:37,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:40:37,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 20:40:37,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 21: [2022-11-26 20:40:37,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:40:37,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 20:40:37,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 12: [2022-11-26 20:40:37,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:40:37,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 0: [2022-11-26 20:40:37,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:40:37,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 0: [2022-11-26 20:40:37,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 20:40:37,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 6: [2022-11-26 20:40:37,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:40:37,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 20:40:37,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 2: [2022-11-26 20:40:37,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:40:37,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 20:40:37,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 5: [2022-11-26 20:40:37,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:40:37,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:40:37,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 22: [2022-11-26 20:40:37,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 20:40:37,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 5: [2022-11-26 20:40:37,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 24: [2022-11-26 20:40:37,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:40:37,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 20:40:37,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 23: [2022-11-26 20:40:37,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:40:37,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 20:40:37,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 10: [2022-11-26 20:40:37,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:40:37,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:40:37,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 20:40:37,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 10: [2022-11-26 20:40:37,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 20:40:37,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 30: [2022-11-26 20:40:37,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:40:37,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 20:40:37,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 20: [2022-11-26 20:40:37,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:40:37,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:40:37,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 3: [2022-11-26 20:40:37,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:40:37,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 27: [2022-11-26 20:40:37,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 20:40:37,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 3: [2022-11-26 20:40:37,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 20:40:37,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 17: [2022-11-26 20:40:37,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:40:37,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:40:37,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:40:37,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 20:40:37,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 31: [2022-11-26 20:40:37,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 8: [2022-11-26 20:40:37,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 31: [2022-11-26 20:40:37,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 8: [2022-11-26 20:40:37,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 11: [2022-11-26 20:40:37,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:40:37,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 20:40:37,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 15: [2022-11-26 20:40:37,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:40:37,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 20:40:37,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 7: [2022-11-26 20:40:37,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:40:37,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 20:40:37,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 29: [2022-11-26 20:40:37,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:40:37,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 20:40:37,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 1: [2022-11-26 20:40:37,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:40:37,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 20:40:37,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 18: [2022-11-26 20:40:37,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:40:37,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 20:40:37,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 25: [2022-11-26 20:40:37,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:40:37,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 20:40:37,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 26: [2022-11-26 20:40:37,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:40:37,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 20:40:37,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 28: [2022-11-26 20:40:37,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:40:37,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:40:37,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 20:40:37,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 0: [2022-11-26 20:40:37,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:40:37,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:40:37,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 20:40:37,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 13: [2022-11-26 20:40:37,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:40:37,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 28: [2022-11-26 20:40:37,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 16: [2022-11-26 20:40:37,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:40:37,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 28: [2022-11-26 20:40:37,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 16: [2022-11-26 20:40:37,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 20:40:37,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 14: [2022-11-26 20:40:37,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:40:37,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 20:40:37,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 0: [2022-11-26 20:40:37,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 20:40:37,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 5: [2022-11-26 20:40:37,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:40:37,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 20:40:37,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 21: [2022-11-26 20:40:37,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:40:37,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 20:40:37,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 6: [2022-11-26 20:40:37,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:40:37,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 20:40:37,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 22: [2022-11-26 20:40:37,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:40:37,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 20:40:37,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 12: [2022-11-26 20:40:37,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:40:37,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 20:40:37,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 15: [2022-11-26 20:40:37,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:40:37,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 20:40:37,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 23: [2022-11-26 20:40:37,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:40:37,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:40:37,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 4: [2022-11-26 20:40:37,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 23: [2022-11-26 20:40:37,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 4: [2022-11-26 20:40:37,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 2: [2022-11-26 20:40:37,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:40:37,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 20:40:37,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 30: [2022-11-26 20:40:37,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:40:37,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 20:40:37,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 10: [2022-11-26 20:40:37,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:40:37,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 20:40:37,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 24: [2022-11-26 20:40:37,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:40:37,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:40:37,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 1: [2022-11-26 20:40:37,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 24: [2022-11-26 20:40:37,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 1: [2022-11-26 20:40:37,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 17: [2022-11-26 20:40:37,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:40:37,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 20:40:37,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 7: [2022-11-26 20:40:37,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:40:37,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 20:40:37,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 27: [2022-11-26 20:40:37,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:40:37,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 20:40:37,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 20: [2022-11-26 20:40:37,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:40:37,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 20:40:37,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 31: [2022-11-26 20:40:37,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:40:37,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 20:40:37,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 11: [2022-11-26 20:40:37,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:40:37,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:40:37,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 3: [2022-11-26 20:40:37,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 11: [2022-11-26 20:40:37,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 3: [2022-11-26 20:40:37,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 0: [2022-11-26 20:40:37,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:40:37,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:40:37,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 13: [2022-11-26 20:40:37,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 20:40:37,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 0: [2022-11-26 20:40:37,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 14: [2022-11-26 20:40:37,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:40:37,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 20:40:37,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 19: [2022-11-26 20:40:37,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:40:37,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 20:40:37,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 28: [2022-11-26 20:40:37,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:40:37,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 20:40:37,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 12: [2022-11-26 20:40:37,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:40:37,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:40:37,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 20:40:37,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 29: [2022-11-26 20:40:37,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 8: [2022-11-26 20:40:37,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:40:37,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 8: [2022-11-26 20:40:37,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 20:40:37,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 25: [2022-11-26 20:40:37,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:40:37,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 20:40:37,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 1: [2022-11-26 20:40:37,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:40:37,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 20:40:37,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 9: [2022-11-26 20:40:37,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:40:37,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 20:40:37,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 5: [2022-11-26 20:40:37,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:40:37,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 20:40:37,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 13: [2022-11-26 20:40:37,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:40:37,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:40:37,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 31: [2022-11-26 20:40:37,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:40:37,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 31: [2022-11-26 20:40:37,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 17: [2022-11-26 20:40:37,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 31: [2022-11-26 20:40:37,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 17: [2022-11-26 20:40:37,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 20: [2022-11-26 20:40:37,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:40:37,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 20:40:37,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 18: [2022-11-26 20:40:37,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:40:37,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 6: [2022-11-26 20:40:37,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:40:37,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 6: [2022-11-26 20:40:37,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 20:40:37,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 3: [2022-11-26 20:40:37,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:40:37,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 20:40:37,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 7: [2022-11-26 20:40:37,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:40:37,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 20:40:37,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 22: [2022-11-26 20:40:37,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:40:37,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:40:37,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:40:37,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 2: [2022-11-26 20:40:37,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:40:37,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 22: [2022-11-26 20:40:37,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 12: [2022-11-26 20:40:37,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 30: [2022-11-26 20:40:37,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 2: [2022-11-26 20:40:37,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 30: [2022-11-26 20:40:37,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 2: [2022-11-26 20:40:37,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 24: [2022-11-26 20:40:37,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:40:37,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 20:40:37,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 23: [2022-11-26 20:40:37,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:40:37,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 20:40:37,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 26: [2022-11-26 20:40:37,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:40:37,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 20:40:37,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 4: [2022-11-26 20:40:37,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:40:37,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:40:37,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 15: [2022-11-26 20:40:37,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 4: [2022-11-26 20:40:37,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 15: [2022-11-26 20:40:37,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 8: [2022-11-26 20:40:37,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:40:37,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 20:40:37,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 25: [2022-11-26 20:40:37,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:40:37,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 20:40:37,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 10: [2022-11-26 20:40:37,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:40:37,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 20:40:37,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 9: [2022-11-26 20:40:37,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:40:37,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 20:40:37,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 11: [2022-11-26 20:40:37,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:40:37,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 20:40:37,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 16: [2022-11-26 20:40:37,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:40:37,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 20:40:37,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 29: [2022-11-26 20:40:37,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:40:37,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 20:40:37,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 27: [2022-11-26 20:40:37,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:40:37,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 20:40:37,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 19: [2022-11-26 20:40:37,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:40:37,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:40:37,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 20:40:37,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 28: [2022-11-26 20:40:37,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 20:40:37,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 21: [2022-11-26 20:40:37,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:40:37,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:40:37,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 20:40:37,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 20:40:37,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 21: [2022-11-26 20:40:37,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 26: [2022-11-26 20:40:37,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:40:37,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 20:40:37,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 14: [2022-11-26 20:40:37,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:40:37,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 20:40:37,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 2: [2022-11-26 20:40:37,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:40:37,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 20:40:37,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 22: [2022-11-26 20:40:37,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:40:37,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 20:40:37,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 16: [2022-11-26 20:40:37,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:40:37,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 20:40:37,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 18: [2022-11-26 20:40:37,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:40:37,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 20:40:37,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 21: [2022-11-26 20:40:37,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:40:37,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step118000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 20:40:37,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 0: successfully saved checkpoint at iteration 118000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2565.84 31: iteration 118010/ 173500 | consumed samples: 30210560 | consumed tokens: 61871226880 | elapsed time per iteration (s): 1.07 | learning rate: 6.251E-05 | global batch size: 256 | lm loss: 1.952425E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.584 | TFLOPs: 14.49 | 31: iteration 118020/ 173500 | consumed samples: 30213120 | consumed tokens: 61876469760 | elapsed time per iteration (s): 0.86 | learning rate: 6.249E-05 | global batch size: 256 | lm loss: 1.957352E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.736 | TFLOPs: 18.01 | 31: iteration 118030/ 173500 | consumed samples: 30215680 | consumed tokens: 61881712640 | elapsed time per iteration (s): 2.56 | learning rate: 6.248E-05 | global batch size: 256 | lm loss: 1.963715E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 99.896 | TFLOPs: 6.04 | 31: iteration 118040/ 173500 | consumed samples: 30218240 | consumed tokens: 61886955520 | elapsed time per iteration (s): 0.83 | learning rate: 6.247E-05 | global batch size: 256 | lm loss: 1.949339E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.929 | TFLOPs: 18.75 | 31: iteration 118050/ 173500 | consumed samples: 30220800 | consumed tokens: 61892198400 | elapsed time per iteration (s): 0.80 | learning rate: 6.245E-05 | global batch size: 256 | lm loss: 1.961239E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.880 | TFLOPs: 19.47 | 31: iteration 118060/ 173500 | consumed samples: 30223360 | consumed tokens: 61897441280 | elapsed time per iteration (s): 0.82 | learning rate: 6.244E-05 | global batch size: 256 | lm loss: 1.942435E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.440 | TFLOPs: 18.78 | 31: iteration 118070/ 173500 | consumed samples: 30225920 | consumed tokens: 61902684160 | elapsed time per iteration (s): 0.81 | learning rate: 6.242E-05 | global batch size: 256 | lm loss: 1.920930E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.389 | TFLOPs: 19.20 | 31: iteration 118080/ 173500 | consumed samples: 30228480 | consumed tokens: 61907927040 | elapsed time per iteration (s): 0.80 | learning rate: 6.241E-05 | global batch size: 256 | lm loss: 1.961188E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.698 | TFLOPs: 19.46 | 31: iteration 118090/ 173500 | consumed samples: 30231040 | consumed tokens: 61913169920 | elapsed time per iteration (s): 0.81 | learning rate: 6.240E-05 | global batch size: 256 | lm loss: 1.968011E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.215 | TFLOPs: 19.13 | 31: iteration 118100/ 173500 | consumed samples: 30233600 | consumed tokens: 61918412800 | elapsed time per iteration (s): 0.82 | learning rate: 6.238E-05 | global batch size: 256 | lm loss: 1.931632E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.016 | TFLOPs: 18.82 | 31: iteration 118110/ 173500 | consumed samples: 30236160 | consumed tokens: 61923655680 | elapsed time per iteration (s): 0.85 | learning rate: 6.237E-05 | global batch size: 256 | lm loss: 1.939959E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.345 | TFLOPs: 18.29 | 31: iteration 118120/ 173500 | consumed samples: 30238720 | consumed tokens: 61928898560 | elapsed time per iteration (s): 0.82 | learning rate: 6.235E-05 | global batch size: 256 | lm loss: 1.937755E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.090 | TFLOPs: 18.94 | 31: iteration 118130/ 173500 | consumed samples: 30241280 | consumed tokens: 61934141440 | elapsed time per iteration (s): 0.82 | learning rate: 6.234E-05 | global batch size: 256 | lm loss: 1.939175E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.532 | TFLOPs: 18.91 | 31: iteration 118140/ 173500 | consumed samples: 30243840 | consumed tokens: 61939384320 | elapsed time per iteration (s): 0.80 | learning rate: 6.233E-05 | global batch size: 256 | lm loss: 1.973755E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.645 | TFLOPs: 19.40 | 31: iteration 118150/ 173500 | consumed samples: 30246400 | consumed tokens: 61944627200 | elapsed time per iteration (s): 0.79 | learning rate: 6.231E-05 | global batch size: 256 | lm loss: 1.967837E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.493 | TFLOPs: 19.51 | 31: iteration 118160/ 173500 | consumed samples: 30248960 | consumed tokens: 61949870080 | elapsed time per iteration (s): 0.80 | learning rate: 6.230E-05 | global batch size: 256 | lm loss: 1.962193E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.694 | TFLOPs: 19.28 | 31: iteration 118170/ 173500 | consumed samples: 30251520 | consumed tokens: 61955112960 | elapsed time per iteration (s): 0.81 | learning rate: 6.228E-05 | global batch size: 256 | lm loss: 1.940920E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.551 | TFLOPs: 19.09 | 31: iteration 118180/ 173500 | consumed samples: 30254080 | consumed tokens: 61960355840 | elapsed time per iteration (s): 0.81 | learning rate: 6.227E-05 | global batch size: 256 | lm loss: 1.952351E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.060 | TFLOPs: 19.12 | 31: iteration 118190/ 173500 | consumed samples: 30256640 | consumed tokens: 61965598720 | elapsed time per iteration (s): 0.82 | learning rate: 6.226E-05 | global batch size: 256 | lm loss: 1.944682E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.585 | TFLOPs: 18.79 | 31: iteration 118200/ 173500 | consumed samples: 30259200 | consumed tokens: 61970841600 | elapsed time per iteration (s): 0.80 | learning rate: 6.224E-05 | global batch size: 256 | lm loss: 1.965263E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.767 | TFLOPs: 19.41 | 31: iteration 118210/ 173500 | consumed samples: 30261760 | consumed tokens: 61976084480 | elapsed time per iteration (s): 0.79 | learning rate: 6.223E-05 | global batch size: 256 | lm loss: 1.932883E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.672 | TFLOPs: 19.52 | 31: iteration 118220/ 173500 | consumed samples: 30264320 | consumed tokens: 61981327360 | elapsed time per iteration (s): 0.84 | learning rate: 6.221E-05 | global batch size: 256 | lm loss: 1.937702E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.252 | TFLOPs: 18.35 | 31: iteration 118230/ 173500 | consumed samples: 30266880 | consumed tokens: 61986570240 | elapsed time per iteration (s): 0.84 | learning rate: 6.220E-05 | global batch size: 256 | lm loss: 1.914697E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.149 | TFLOPs: 18.52 | 31: iteration 118240/ 173500 | consumed samples: 30269440 | consumed tokens: 61991813120 | elapsed time per iteration (s): 0.87 | learning rate: 6.219E-05 | global batch size: 256 | lm loss: 1.932217E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.385 | TFLOPs: 17.75 | 31: iteration 118250/ 173500 | consumed samples: 30272000 | consumed tokens: 61997056000 | elapsed time per iteration (s): 0.80 | learning rate: 6.217E-05 | global batch size: 256 | lm loss: 1.931235E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.498 | TFLOPs: 19.39 | 31: iteration 118260/ 173500 | consumed samples: 30274560 | consumed tokens: 62002298880 | elapsed time per iteration (s): 0.85 | learning rate: 6.216E-05 | global batch size: 256 | lm loss: 1.948484E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.392 | TFLOPs: 18.29 | 31: iteration 118270/ 173500 | consumed samples: 30277120 | consumed tokens: 62007541760 | elapsed time per iteration (s): 0.75 | learning rate: 6.215E-05 | global batch size: 256 | lm loss: 1.977014E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.749 | TFLOPs: 20.67 | 31: iteration 118280/ 173500 | consumed samples: 30279680 | consumed tokens: 62012784640 | elapsed time per iteration (s): 0.74 | learning rate: 6.213E-05 | global batch size: 256 | lm loss: 1.972385E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.812 | TFLOPs: 20.80 | 31: iteration 118290/ 173500 | consumed samples: 30282240 | consumed tokens: 62018027520 | elapsed time per iteration (s): 0.77 | learning rate: 6.212E-05 | global batch size: 256 | lm loss: 1.940313E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.518 | TFLOPs: 20.00 | 31: iteration 118300/ 173500 | consumed samples: 30284800 | consumed tokens: 62023270400 | elapsed time per iteration (s): 0.73 | learning rate: 6.210E-05 | global batch size: 256 | lm loss: 1.963758E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.843 | TFLOPs: 21.16 | 31: iteration 118310/ 173500 | consumed samples: 30287360 | consumed tokens: 62028513280 | elapsed time per iteration (s): 0.77 | learning rate: 6.209E-05 | global batch size: 256 | lm loss: 1.943665E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.486 | TFLOPs: 20.24 | 31: iteration 118320/ 173500 | consumed samples: 30289920 | consumed tokens: 62033756160 | elapsed time per iteration (s): 0.76 | learning rate: 6.208E-05 | global batch size: 256 | lm loss: 1.937876E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.958 | TFLOPs: 20.32 | 31: iteration 118330/ 173500 | consumed samples: 30292480 | consumed tokens: 62038999040 | elapsed time per iteration (s): 0.76 | learning rate: 6.206E-05 | global batch size: 256 | lm loss: 1.951966E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.419 | TFLOPs: 20.41 | 31: iteration 118340/ 173500 | consumed samples: 30295040 | consumed tokens: 62044241920 | elapsed time per iteration (s): 0.77 | learning rate: 6.205E-05 | global batch size: 256 | lm loss: 1.921551E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.803 | TFLOPs: 20.13 | 31: iteration 118350/ 173500 | consumed samples: 30297600 | consumed tokens: 62049484800 | elapsed time per iteration (s): 0.80 | learning rate: 6.203E-05 | global batch size: 256 | lm loss: 1.938740E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.205 | TFLOPs: 19.25 | 31: iteration 118360/ 173500 | consumed samples: 30300160 | consumed tokens: 62054727680 | elapsed time per iteration (s): 0.78 | learning rate: 6.202E-05 | global batch size: 256 | lm loss: 1.945210E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.234 | TFLOPs: 19.74 | 31: iteration 118370/ 173500 | consumed samples: 30302720 | consumed tokens: 62059970560 | elapsed time per iteration (s): 0.74 | learning rate: 6.201E-05 | global batch size: 256 | lm loss: 1.945068E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.728 | TFLOPs: 20.92 | 31: iteration 118380/ 173500 | consumed samples: 30305280 | consumed tokens: 62065213440 | elapsed time per iteration (s): 0.80 | learning rate: 6.199E-05 | global batch size: 256 | lm loss: 1.933281E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.979 | TFLOPs: 19.48 | 31: iteration 118390/ 173500 | consumed samples: 30307840 | consumed tokens: 62070456320 | elapsed time per iteration (s): 0.78 | learning rate: 6.198E-05 | global batch size: 256 | lm loss: 1.918403E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.545 | TFLOPs: 19.76 | 31: iteration 118400/ 173500 | consumed samples: 30310400 | consumed tokens: 62075699200 | elapsed time per iteration (s): 0.76 | learning rate: 6.196E-05 | global batch size: 256 | lm loss: 1.969657E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.929 | TFLOPs: 20.32 | 31: iteration 118410/ 173500 | consumed samples: 30312960 | consumed tokens: 62080942080 | elapsed time per iteration (s): 0.82 | learning rate: 6.195E-05 | global batch size: 256 | lm loss: 1.932969E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.563 | TFLOPs: 18.79 | 31: iteration 118420/ 173500 | consumed samples: 30315520 | consumed tokens: 62086184960 | elapsed time per iteration (s): 0.84 | learning rate: 6.194E-05 | global batch size: 256 | lm loss: 1.944374E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.145 | TFLOPs: 18.52 | 31: iteration 118430/ 173500 | consumed samples: 30318080 | consumed tokens: 62091427840 | elapsed time per iteration (s): 0.78 | learning rate: 6.192E-05 | global batch size: 256 | lm loss: 1.969538E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.614 | TFLOPs: 19.88 | 31: iteration 118440/ 173500 | consumed samples: 30320640 | consumed tokens: 62096670720 | elapsed time per iteration (s): 0.81 | learning rate: 6.191E-05 | global batch size: 256 | lm loss: 1.958459E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.378 | TFLOPs: 19.08 | 31: iteration 118450/ 173500 | consumed samples: 30323200 | consumed tokens: 62101913600 | elapsed time per iteration (s): 0.81 | learning rate: 6.189E-05 | global batch size: 256 | lm loss: 1.917195E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.492 | TFLOPs: 19.21 | 31: iteration 118460/ 173500 | consumed samples: 30325760 | consumed tokens: 62107156480 | elapsed time per iteration (s): 0.83 | learning rate: 6.188E-05 | global batch size: 256 | lm loss: 1.956265E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.432 | TFLOPs: 18.66 | 31: iteration 118470/ 173500 | consumed samples: 30328320 | consumed tokens: 62112399360 | elapsed time per iteration (s): 0.80 | learning rate: 6.187E-05 | global batch size: 256 | lm loss: 1.940038E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.267 | TFLOPs: 19.38 | 31: iteration 118480/ 173500 | consumed samples: 30330880 | consumed tokens: 62117642240 | elapsed time per iteration (s): 0.81 | learning rate: 6.185E-05 | global batch size: 256 | lm loss: 1.968044E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.776 | TFLOPs: 19.16 | 31: iteration 118490/ 173500 | consumed samples: 30333440 | consumed tokens: 62122885120 | elapsed time per iteration (s): 0.85 | learning rate: 6.184E-05 | global batch size: 256 | lm loss: 1.947308E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.237 | TFLOPs: 18.28 | 31: iteration 118500/ 173500 | consumed samples: 30336000 | consumed tokens: 62128128000 | elapsed time per iteration (s): 0.85 | learning rate: 6.183E-05 | global batch size: 256 | lm loss: 1.955618E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.156 | TFLOPs: 18.22 | 31: iteration 118510/ 173500 | consumed samples: 30338560 | consumed tokens: 62133370880 | elapsed time per iteration (s): 0.80 | learning rate: 6.181E-05 | global batch size: 256 | lm loss: 1.934518E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.993 | TFLOPs: 19.36 | 31: iteration 118520/ 173500 | consumed samples: 30341120 | consumed tokens: 62138613760 | elapsed time per iteration (s): 0.83 | learning rate: 6.180E-05 | global batch size: 256 | lm loss: 1.952708E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.507 | TFLOPs: 18.72 | 31: iteration 118530/ 173500 | consumed samples: 30343680 | consumed tokens: 62143856640 | elapsed time per iteration (s): 0.81 | learning rate: 6.178E-05 | global batch size: 256 | lm loss: 1.924999E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.370 | TFLOPs: 19.14 | 31: iteration 118540/ 173500 | consumed samples: 30346240 | consumed tokens: 62149099520 | elapsed time per iteration (s): 0.79 | learning rate: 6.177E-05 | global batch size: 256 | lm loss: 1.978821E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.215 | TFLOPs: 19.61 | 31: iteration 118550/ 173500 | consumed samples: 30348800 | consumed tokens: 62154342400 | elapsed time per iteration (s): 0.82 | learning rate: 6.176E-05 | global batch size: 256 | lm loss: 1.942610E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.026 | TFLOPs: 18.88 | 31: iteration 118560/ 173500 | consumed samples: 30351360 | consumed tokens: 62159585280 | elapsed time per iteration (s): 0.82 | learning rate: 6.174E-05 | global batch size: 256 | lm loss: 1.956289E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.029 | TFLOPs: 18.94 | 31: iteration 118570/ 173500 | consumed samples: 30353920 | consumed tokens: 62164828160 | elapsed time per iteration (s): 0.82 | learning rate: 6.173E-05 | global batch size: 256 | lm loss: 1.934555E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.513 | TFLOPs: 18.85 | 31: iteration 118580/ 173500 | consumed samples: 30356480 | consumed tokens: 62170071040 | elapsed time per iteration (s): 0.84 | learning rate: 6.171E-05 | global batch size: 256 | lm loss: 1.938787E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.157 | TFLOPs: 18.52 | 31: iteration 118590/ 173500 | consumed samples: 30359040 | consumed tokens: 62175313920 | elapsed time per iteration (s): 0.77 | learning rate: 6.170E-05 | global batch size: 256 | lm loss: 1.963455E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.408 | TFLOPs: 20.11 | 31: iteration 118600/ 173500 | consumed samples: 30361600 | consumed tokens: 62180556800 | elapsed time per iteration (s): 0.78 | learning rate: 6.169E-05 | global batch size: 256 | lm loss: 1.945874E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.619 | TFLOPs: 19.94 | 31: iteration 118610/ 173500 | consumed samples: 30364160 | consumed tokens: 62185799680 | elapsed time per iteration (s): 0.75 | learning rate: 6.167E-05 | global batch size: 256 | lm loss: 1.960528E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.644 | TFLOPs: 20.73 | 31: iteration 118620/ 173500 | consumed samples: 30366720 | consumed tokens: 62191042560 | elapsed time per iteration (s): 0.78 | learning rate: 6.166E-05 | global batch size: 256 | lm loss: 1.944142E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.001 | TFLOPs: 19.90 | 31: iteration 118630/ 173500 | consumed samples: 30369280 | consumed tokens: 62196285440 | elapsed time per iteration (s): 0.74 | learning rate: 6.164E-05 | global batch size: 256 | lm loss: 1.965746E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.444 | TFLOPs: 20.90 | 31: iteration 118640/ 173500 | consumed samples: 30371840 | consumed tokens: 62201528320 | elapsed time per iteration (s): 0.80 | learning rate: 6.163E-05 | global batch size: 256 | lm loss: 1.953468E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.204 | TFLOPs: 19.31 | 31: iteration 118650/ 173500 | consumed samples: 30374400 | consumed tokens: 62206771200 | elapsed time per iteration (s): 0.85 | learning rate: 6.162E-05 | global batch size: 256 | lm loss: 1.944151E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.756 | TFLOPs: 18.26 | 31: iteration 118660/ 173500 | consumed samples: 30376960 | consumed tokens: 62212014080 | elapsed time per iteration (s): 0.82 | learning rate: 6.160E-05 | global batch size: 256 | lm loss: 1.932132E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.794 | TFLOPs: 18.92 | 31: iteration 118670/ 173500 | consumed samples: 30379520 | consumed tokens: 62217256960 | elapsed time per iteration (s): 0.76 | learning rate: 6.159E-05 | global batch size: 256 | lm loss: 1.942795E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.443 | TFLOPs: 20.35 | 31: iteration 118680/ 173500 | consumed samples: 30382080 | consumed tokens: 62222499840 | elapsed time per iteration (s): 0.79 | learning rate: 6.158E-05 | global batch size: 256 | lm loss: 1.944408E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.309 | TFLOPs: 19.62 | 31: iteration 118690/ 173500 | consumed samples: 30384640 | consumed tokens: 62227742720 | elapsed time per iteration (s): 0.80 | learning rate: 6.156E-05 | global batch size: 256 | lm loss: 1.955823E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.271 | TFLOPs: 19.25 | 31: iteration 118700/ 173500 | consumed samples: 30387200 | consumed tokens: 62232985600 | elapsed time per iteration (s): 0.90 | learning rate: 6.155E-05 | global batch size: 256 | lm loss: 1.963229E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.007 | TFLOPs: 17.24 | 31: iteration 118710/ 173500 | consumed samples: 30389760 | consumed tokens: 62238228480 | elapsed time per iteration (s): 0.81 | learning rate: 6.153E-05 | global batch size: 256 | lm loss: 1.962336E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.298 | TFLOPs: 19.14 | 31: iteration 118720/ 173500 | consumed samples: 30392320 | consumed tokens: 62243471360 | elapsed time per iteration (s): 0.84 | learning rate: 6.152E-05 | global batch size: 256 | lm loss: 1.937485E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.665 | TFLOPs: 18.43 | 31: iteration 118730/ 173500 | consumed samples: 30394880 | consumed tokens: 62248714240 | elapsed time per iteration (s): 0.80 | learning rate: 6.151E-05 | global batch size: 256 | lm loss: 1.935323E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.875 | TFLOPs: 19.41 | 31: iteration 118740/ 173500 | consumed samples: 30397440 | consumed tokens: 62253957120 | elapsed time per iteration (s): 0.84 | learning rate: 6.149E-05 | global batch size: 256 | lm loss: 1.957349E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.510 | TFLOPs: 18.42 | 31: iteration 118750/ 173500 | consumed samples: 30400000 | consumed tokens: 62259200000 | elapsed time per iteration (s): 0.80 | learning rate: 6.148E-05 | global batch size: 256 | lm loss: 1.922343E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.336 | TFLOPs: 19.38 | 31: iteration 118760/ 173500 | consumed samples: 30402560 | consumed tokens: 62264442880 | elapsed time per iteration (s): 0.82 | learning rate: 6.146E-05 | global batch size: 256 | lm loss: 1.943129E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.230 | TFLOPs: 18.83 | 31: iteration 118770/ 173500 | consumed samples: 30405120 | consumed tokens: 62269685760 | elapsed time per iteration (s): 0.83 | learning rate: 6.145E-05 | global batch size: 256 | lm loss: 1.926578E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.326 | TFLOPs: 18.71 | 31: iteration 118780/ 173500 | consumed samples: 30407680 | consumed tokens: 62274928640 | elapsed time per iteration (s): 0.79 | learning rate: 6.144E-05 | global batch size: 256 | lm loss: 1.936098E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.550 | TFLOPs: 19.63 | 31: iteration 118790/ 173500 | consumed samples: 30410240 | consumed tokens: 62280171520 | elapsed time per iteration (s): 0.84 | learning rate: 6.142E-05 | global batch size: 256 | lm loss: 1.976491E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.909 | TFLOPs: 18.39 | 31: iteration 118800/ 173500 | consumed samples: 30412800 | consumed tokens: 62285414400 | elapsed time per iteration (s): 0.78 | learning rate: 6.141E-05 | global batch size: 256 | lm loss: 1.944620E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.790 | TFLOPs: 19.89 | 31: iteration 118810/ 173500 | consumed samples: 30415360 | consumed tokens: 62290657280 | elapsed time per iteration (s): 0.79 | learning rate: 6.139E-05 | global batch size: 256 | lm loss: 1.954882E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.645 | TFLOPs: 19.52 | 31: iteration 118820/ 173500 | consumed samples: 30417920 | consumed tokens: 62295900160 | elapsed time per iteration (s): 0.81 | learning rate: 6.138E-05 | global batch size: 256 | lm loss: 1.958176E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.178 | TFLOPs: 19.07 | 31: iteration 118830/ 173500 | consumed samples: 30420480 | consumed tokens: 62301143040 | elapsed time per iteration (s): 0.79 | learning rate: 6.137E-05 | global batch size: 256 | lm loss: 1.970175E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.085 | TFLOPs: 19.73 | 31: iteration 118840/ 173500 | consumed samples: 30423040 | consumed tokens: 62306385920 | elapsed time per iteration (s): 0.82 | learning rate: 6.135E-05 | global batch size: 256 | lm loss: 1.949941E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.237 | TFLOPs: 18.95 | 31: iteration 118850/ 173500 | consumed samples: 30425600 | consumed tokens: 62311628800 | elapsed time per iteration (s): 0.82 | learning rate: 6.134E-05 | global batch size: 256 | lm loss: 1.960693E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.453 | TFLOPs: 18.96 | 31: iteration 118860/ 173500 | consumed samples: 30428160 | consumed tokens: 62316871680 | elapsed time per iteration (s): 0.74 | learning rate: 6.133E-05 | global batch size: 256 | lm loss: 1.945192E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.991 | TFLOPs: 20.81 | 31: iteration 118870/ 173500 | consumed samples: 30430720 | consumed tokens: 62322114560 | elapsed time per iteration (s): 0.75 | learning rate: 6.131E-05 | global batch size: 256 | lm loss: 1.970978E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.041 | TFLOPs: 20.57 | 31: iteration 118880/ 173500 | consumed samples: 30433280 | consumed tokens: 62327357440 | elapsed time per iteration (s): 0.81 | learning rate: 6.130E-05 | global batch size: 256 | lm loss: 1.943085E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.878 | TFLOPs: 19.23 | 31: iteration 118890/ 173500 | consumed samples: 30435840 | consumed tokens: 62332600320 | elapsed time per iteration (s): 0.76 | learning rate: 6.128E-05 | global batch size: 256 | lm loss: 1.961795E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.875 | TFLOPs: 20.44 | 31: iteration 118900/ 173500 | consumed samples: 30438400 | consumed tokens: 62337843200 | elapsed time per iteration (s): 0.79 | learning rate: 6.127E-05 | global batch size: 256 | lm loss: 1.967053E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.263 | TFLOPs: 19.62 | 31: iteration 118910/ 173500 | consumed samples: 30440960 | consumed tokens: 62343086080 | elapsed time per iteration (s): 0.78 | learning rate: 6.126E-05 | global batch size: 256 | lm loss: 1.934389E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.345 | TFLOPs: 19.74 | 31: iteration 118920/ 173500 | consumed samples: 30443520 | consumed tokens: 62348328960 | elapsed time per iteration (s): 0.82 | learning rate: 6.124E-05 | global batch size: 256 | lm loss: 1.960166E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.492 | TFLOPs: 18.90 | 31: iteration 118930/ 173500 | consumed samples: 30446080 | consumed tokens: 62353571840 | elapsed time per iteration (s): 0.71 | learning rate: 6.123E-05 | global batch size: 256 | lm loss: 1.946426E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.850 | TFLOPs: 21.71 | 31: iteration 118940/ 173500 | consumed samples: 30448640 | consumed tokens: 62358814720 | elapsed time per iteration (s): 0.79 | learning rate: 6.121E-05 | global batch size: 256 | lm loss: 1.939665E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.791 | TFLOPs: 19.59 | 31: iteration 118950/ 173500 | consumed samples: 30451200 | consumed tokens: 62364057600 | elapsed time per iteration (s): 0.80 | learning rate: 6.120E-05 | global batch size: 256 | lm loss: 1.936053E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.055 | TFLOPs: 19.24 | 31: iteration 118960/ 173500 | consumed samples: 30453760 | consumed tokens: 62369300480 | elapsed time per iteration (s): 0.77 | learning rate: 6.119E-05 | global batch size: 256 | lm loss: 1.947866E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.563 | TFLOPs: 20.12 | 31: iteration 118970/ 173500 | consumed samples: 30456320 | consumed tokens: 62374543360 | elapsed time per iteration (s): 0.73 | learning rate: 6.117E-05 | global batch size: 256 | lm loss: 1.944036E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.310 | TFLOPs: 21.19 | 31: iteration 118980/ 173500 | consumed samples: 30458880 | consumed tokens: 62379786240 | elapsed time per iteration (s): 0.78 | learning rate: 6.116E-05 | global batch size: 256 | lm loss: 1.926586E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.135 | TFLOPs: 19.91 | 31: iteration 118990/ 173500 | consumed samples: 30461440 | consumed tokens: 62385029120 | elapsed time per iteration (s): 0.73 | learning rate: 6.115E-05 | global batch size: 256 | lm loss: 1.939453E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.044 | TFLOPs: 21.24 | 31: iteration 119000/ 173500 | consumed samples: 30464000 | consumed tokens: 62390272000 | elapsed time per iteration (s): 0.77 | learning rate: 6.113E-05 | global batch size: 256 | lm loss: 1.938335E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.858 | TFLOPs: 20.08 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 119000 | lm loss value: 1.877051E+00 | lm loss PPL: 6.534210E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 119000 to checkpoints_1b1long 0: [2022-11-26 20:54:14,359] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step119000 is begin to save! 0: [2022-11-26 20:54:14,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_01-model_00-model_states.pt... 0: [2022-11-26 20:54:14,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_01-model_00-model_states.pt. 0: [2022-11-26 20:54:14,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_03-model_00-model_states.pt... 0: [2022-11-26 20:54:14,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_03-model_00-model_states.pt. 0: [2022-11-26 20:54:14,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_04-model_00-model_states.pt... 0: [2022-11-26 20:54:14,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_04-model_00-model_states.pt. 0: [2022-11-26 20:54:14,771] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_05-model_00-model_states.pt... 0: [2022-11-26 20:54:14,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_05-model_00-model_states.pt. 0: [2022-11-26 20:54:14,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_06-model_00-model_states.pt... 0: [2022-11-26 20:54:14,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_06-model_00-model_states.pt. 0: [2022-11-26 20:54:14,927] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_07-model_00-model_states.pt... 0: [2022-11-26 20:54:15,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_07-model_00-model_states.pt. 0: [2022-11-26 20:54:15,004] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_08-model_00-model_states.pt... 0: [2022-11-26 20:54:15,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_08-model_00-model_states.pt. 0: [2022-11-26 20:54:15,085] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_09-model_00-model_states.pt... 0: [2022-11-26 20:54:15,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_09-model_00-model_states.pt. 0: [2022-11-26 20:54:15,158] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_10-model_00-model_states.pt... 0: [2022-11-26 20:54:15,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_10-model_00-model_states.pt. 0: [2022-11-26 20:54:15,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_11-model_00-model_states.pt... 0: [2022-11-26 20:54:15,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_11-model_00-model_states.pt. 0: [2022-11-26 20:54:15,307] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_12-model_00-model_states.pt... 0: [2022-11-26 20:54:15,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_12-model_00-model_states.pt. 0: [2022-11-26 20:54:15,382] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_13-model_00-model_states.pt... 0: [2022-11-26 20:54:15,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_13-model_00-model_states.pt. 0: [2022-11-26 20:54:15,456] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_14-model_00-model_states.pt... 0: [2022-11-26 20:54:15,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_14-model_00-model_states.pt. 0: [2022-11-26 20:54:15,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_15-model_00-model_states.pt... 0: [2022-11-26 20:54:15,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_15-model_00-model_states.pt. 0: [2022-11-26 20:54:15,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_16-model_00-model_states.pt... 0: [2022-11-26 20:54:15,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_16-model_00-model_states.pt. 0: [2022-11-26 20:54:15,679] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_17-model_00-model_states.pt... 0: [2022-11-26 20:54:15,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_17-model_00-model_states.pt. 0: [2022-11-26 20:54:15,756] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_18-model_00-model_states.pt... 0: [2022-11-26 20:54:15,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_18-model_00-model_states.pt. 0: [2022-11-26 20:54:15,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_19-model_00-model_states.pt... 0: [2022-11-26 20:54:15,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_19-model_00-model_states.pt. 0: [2022-11-26 20:54:15,905] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_20-model_00-model_states.pt... 0: [2022-11-26 20:54:15,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_20-model_00-model_states.pt. 0: [2022-11-26 20:54:15,980] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_21-model_00-model_states.pt... 0: [2022-11-26 20:54:16,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_21-model_00-model_states.pt. 0: [2022-11-26 20:54:16,055] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_22-model_00-model_states.pt... 0: [2022-11-26 20:54:16,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_22-model_00-model_states.pt. 0: [2022-11-26 20:54:16,127] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_23-model_00-model_states.pt... 0: [2022-11-26 20:54:16,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_23-model_00-model_states.pt. 0: [2022-11-26 20:54:16,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_24-model_00-model_states.pt... 0: [2022-11-26 20:54:16,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_24-model_00-model_states.pt. 0: [2022-11-26 20:54:16,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_25-model_00-model_states.pt... 0: [2022-11-26 20:54:16,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_25-model_00-model_states.pt. 0: [2022-11-26 20:54:16,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_26-model_00-model_states.pt... 0: [2022-11-26 20:54:16,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_26-model_00-model_states.pt. 0: [2022-11-26 20:54:16,428] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_27-model_00-model_states.pt... 0: [2022-11-26 20:54:16,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_27-model_00-model_states.pt. 0: [2022-11-26 20:54:16,504] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_28-model_00-model_states.pt... 0: [2022-11-26 20:54:16,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_28-model_00-model_states.pt. 0: [2022-11-26 20:54:16,581] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/layer_30-model_00-model_states.pt... 0: [2022-11-26 20:54:16,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/layer_30-model_00-model_states.pt. 0: [2022-11-26 20:54:16,583] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step119000/mp_rank_00_model_states.pt 0: [2022-11-26 20:54:16,583] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/mp_rank_00_model_states.pt... 0: [2022-11-26 20:54:16,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/mp_rank_00_model_states.pt. 0: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 29: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 30: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 21: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 28: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 26: [2022-11-26 20:54:16,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 18: [2022-11-26 20:54:16,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:54:16,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 20:54:16,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 20: [2022-11-26 20:54:16,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:54:16,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 20:54:16,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 18: [2022-11-26 20:54:16,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:54:16,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 20:54:16,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 0: [2022-11-26 20:54:16,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:54:16,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 20:54:16,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 16: [2022-11-26 20:54:16,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:54:16,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:54:16,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 9: [2022-11-26 20:54:16,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 16: [2022-11-26 20:54:16,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 9: [2022-11-26 20:54:16,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 10: [2022-11-26 20:54:16,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:54:16,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 20:54:16,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 30: [2022-11-26 20:54:16,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:54:16,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 20:54:16,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 24: [2022-11-26 20:54:16,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:54:16,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 20:54:16,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 9: [2022-11-26 20:54:16,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:54:16,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:54:16,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 6: [2022-11-26 20:54:16,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 9: [2022-11-26 20:54:16,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 6: [2022-11-26 20:54:16,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 1: [2022-11-26 20:54:16,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:54:16,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:54:16,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 20: [2022-11-26 20:54:16,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 20:54:16,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 1: [2022-11-26 20:54:16,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 4: [2022-11-26 20:54:16,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:54:16,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:54:16,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 20:54:16,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 20:54:16,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 4: [2022-11-26 20:54:16,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 13: [2022-11-26 20:54:16,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:54:16,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:54:16,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:54:16,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 20:54:16,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 26: [2022-11-26 20:54:16,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:54:16,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 20:54:16,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 7: [2022-11-26 20:54:16,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:54:16,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 28: [2022-11-26 20:54:16,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 20:54:16,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 7: [2022-11-26 20:54:16,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 28: [2022-11-26 20:54:16,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 28: [2022-11-26 20:54:16,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 30: [2022-11-26 20:54:16,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:54:16,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 0: [2022-11-26 20:54:16,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:54:16,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 0: [2022-11-26 20:54:16,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 20:54:16,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 31: [2022-11-26 20:54:16,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:54:16,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 20:54:16,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 21: [2022-11-26 20:54:16,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:54:16,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 20:54:16,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 29: [2022-11-26 20:54:16,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:54:16,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 20:54:16,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 21: [2022-11-26 20:54:16,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:54:16,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 20:54:16,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 26: [2022-11-26 20:54:16,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:54:16,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 20:54:16,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 6: [2022-11-26 20:54:16,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:54:16,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:54:16,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 8: [2022-11-26 20:54:16,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 6: [2022-11-26 20:54:16,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 8: [2022-11-26 20:54:16,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 9: [2022-11-26 20:54:16,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:54:16,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 20:54:16,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 11: [2022-11-26 20:54:16,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:54:16,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:54:16,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 20:54:16,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 20:54:16,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 11: [2022-11-26 20:54:16,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 28: [2022-11-26 20:54:16,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:54:16,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:54:16,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:54:16,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 29: [2022-11-26 20:54:16,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 10: [2022-11-26 20:54:16,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 29: [2022-11-26 20:54:16,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 11: [2022-11-26 20:54:16,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:54:16,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 20:54:16,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 7: [2022-11-26 20:54:16,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:54:16,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 20:54:16,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 8: [2022-11-26 20:54:16,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:54:16,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:54:16,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 15: [2022-11-26 20:54:16,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 8: [2022-11-26 20:54:16,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 15: [2022-11-26 20:54:16,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 13: [2022-11-26 20:54:16,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:54:16,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 24: [2022-11-26 20:54:16,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:54:16,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 24: [2022-11-26 20:54:16,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 20:54:16,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 24: [2022-11-26 20:54:16,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:54:16,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 20:54:16,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 11: [2022-11-26 20:54:16,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:54:16,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 20:54:16,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 20: [2022-11-26 20:54:16,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:54:16,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 20:54:16,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 31: [2022-11-26 20:54:16,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:54:16,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:54:16,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 20:54:16,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 16: [2022-11-26 20:54:16,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 20:54:16,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 22: [2022-11-26 20:54:16,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:54:16,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 20:54:16,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 22: [2022-11-26 20:54:16,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:54:16,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 20:54:16,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 18: [2022-11-26 20:54:16,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:54:16,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 20:54:16,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 16: [2022-11-26 20:54:16,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:54:16,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 4: [2022-11-26 20:54:16,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:54:16,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 17: [2022-11-26 20:54:16,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:54:16,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 20:54:16,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 17: [2022-11-26 20:54:16,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 20:54:16,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 17: [2022-11-26 20:54:16,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:54:16,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 20:54:16,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 17: [2022-11-26 20:54:16,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:54:16,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 20:54:16,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 28: [2022-11-26 20:54:16,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 20:54:16,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 10: [2022-11-26 20:54:16,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:54:16,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 28: [2022-11-26 20:54:16,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:54:16,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 28: [2022-11-26 20:54:16,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 20:54:16,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 21: [2022-11-26 20:54:16,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:54:16,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 20:54:16,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 4: [2022-11-26 20:54:16,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:54:16,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 20:54:16,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 8: [2022-11-26 20:54:16,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:54:16,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 20:54:16,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 1: [2022-11-26 20:54:16,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:54:16,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 20:54:16,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 13: [2022-11-26 20:54:16,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:54:16,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 20:54:16,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 30: [2022-11-26 20:54:16,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:54:16,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:54:16,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 30: [2022-11-26 20:54:16,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 20:54:16,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 18: [2022-11-26 20:54:16,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 1: [2022-11-26 20:54:16,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:54:16,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 20:54:16,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 8: [2022-11-26 20:54:16,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:54:16,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 20:54:16,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 30: [2022-11-26 20:54:16,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:54:16,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 20:54:16,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 22: [2022-11-26 20:54:16,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:54:16,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 20:54:16,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 26: [2022-11-26 20:54:16,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:54:16,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:54:16,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 20:54:16,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 20:54:16,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 26: [2022-11-26 20:54:16,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 6: [2022-11-26 20:54:16,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:54:16,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:54:16,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 7: [2022-11-26 20:54:16,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 6: [2022-11-26 20:54:16,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 7: [2022-11-26 20:54:16,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 1: [2022-11-26 20:54:16,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:54:16,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 20:54:16,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 13: [2022-11-26 20:54:16,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:54:16,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 15: [2022-11-26 20:54:16,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:54:16,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:54:16,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 15: [2022-11-26 20:54:16,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 20:54:16,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 20:54:16,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 15: [2022-11-26 20:54:16,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 9: [2022-11-26 20:54:16,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:54:16,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 15: [2022-11-26 20:54:16,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:54:16,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 21: [2022-11-26 20:54:16,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:54:16,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 21: [2022-11-26 20:54:16,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 29: [2022-11-26 20:54:16,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:54:16,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 29: [2022-11-26 20:54:16,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 21: [2022-11-26 20:54:16,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 29: [2022-11-26 20:54:16,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 20: [2022-11-26 20:54:16,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:54:16,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:54:16,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 29: [2022-11-26 20:54:16,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 20: [2022-11-26 20:54:16,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 29: [2022-11-26 20:54:16,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 3: [2022-11-26 20:54:16,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:54:16,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:54:16,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:54:16,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:54:16,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 20:54:16,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 20:54:16,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 20:54:16,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 20:54:16,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 3: [2022-11-26 20:54:16,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 3: [2022-11-26 20:54:16,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 3: [2022-11-26 20:54:16,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 10: [2022-11-26 20:54:16,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:54:16,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 7: [2022-11-26 20:54:16,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:54:16,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 7: [2022-11-26 20:54:16,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 20:54:16,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 14: [2022-11-26 20:54:16,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:54:16,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:54:16,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 23: [2022-11-26 20:54:16,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 14: [2022-11-26 20:54:16,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 23: [2022-11-26 20:54:16,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:54:16,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 23: [2022-11-26 20:54:16,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 20:54:16,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 23: [2022-11-26 20:54:16,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:54:16,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 20:54:16,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 23: [2022-11-26 20:54:16,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:54:16,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 24: [2022-11-26 20:54:16,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:54:16,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 24: [2022-11-26 20:54:16,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 20:54:16,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 22: [2022-11-26 20:54:16,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:54:16,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 20:54:16,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 14: [2022-11-26 20:54:16,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:54:16,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:54:16,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:54:16,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 31: [2022-11-26 20:54:16,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 20:54:16,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 14: [2022-11-26 20:54:16,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 14: [2022-11-26 20:54:16,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:54:16,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 31: [2022-11-26 20:54:16,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 14: [2022-11-26 20:54:16,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 20:54:16,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 17: [2022-11-26 20:54:16,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:54:16,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:54:16,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 20:54:16,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 17: [2022-11-26 20:54:16,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 20:54:16,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 0: [2022-11-26 20:54:16,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:54:16,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 20:54:16,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 0: [2022-11-26 20:54:16,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:54:16,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 20:54:16,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 0: [2022-11-26 20:54:16,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:54:16,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 20:54:16,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 23: [2022-11-26 20:54:16,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:54:16,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 20:54:16,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 28: [2022-11-26 20:54:16,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:54:16,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:54:16,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 20:54:16,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 5: [2022-11-26 20:54:16,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:54:16,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:54:16,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 20:54:16,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 5: [2022-11-26 20:54:16,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 20:54:16,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 25: [2022-11-26 20:54:16,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:54:16,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:54:16,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 20:54:16,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 20:54:16,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 25: [2022-11-26 20:54:16,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 28: [2022-11-26 20:54:16,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 20:54:16,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 24: [2022-11-26 20:54:16,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:54:16,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 20:54:16,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 12: [2022-11-26 20:54:16,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:54:16,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:54:16,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:54:16,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:54:16,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 20:54:16,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 20:54:16,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 20:54:16,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 20:54:16,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 12: [2022-11-26 20:54:16,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 12: [2022-11-26 20:54:16,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 12: [2022-11-26 20:54:16,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 14: [2022-11-26 20:54:16,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:54:16,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 20:54:16,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 19: [2022-11-26 20:54:16,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:54:16,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:54:16,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 20:54:16,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 20:54:16,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 19: [2022-11-26 20:54:16,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 1: [2022-11-26 20:54:16,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:54:16,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 20:54:16,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 27: [2022-11-26 20:54:16,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:54:16,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:54:16,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:54:16,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:54:16,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 27: [2022-11-26 20:54:16,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 20:54:16,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 2: [2022-11-26 20:54:16,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 27: [2022-11-26 20:54:16,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 27: [2022-11-26 20:54:16,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 2: [2022-11-26 20:54:16,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 2: [2022-11-26 20:54:16,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 6: [2022-11-26 20:54:16,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:54:16,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 20:54:16,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 17: [2022-11-26 20:54:16,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:54:16,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 20:54:16,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 4: [2022-11-26 20:54:16,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:54:16,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 20:54:16,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 20: [2022-11-26 20:54:16,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:54:16,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 20:54:16,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 11: [2022-11-26 20:54:16,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:54:16,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 20:54:16,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 18: [2022-11-26 20:54:16,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:54:16,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 20:54:16,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 14: [2022-11-26 20:54:16,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:54:16,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 20:54:16,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 30: [2022-11-26 20:54:16,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:54:16,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 20:54:16,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 31: [2022-11-26 20:54:16,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:54:16,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 20:54:16,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 7: [2022-11-26 20:54:16,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:54:16,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 20:54:16,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 3: [2022-11-26 20:54:16,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:54:16,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 20:54:16,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 8: [2022-11-26 20:54:16,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:54:16,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 20:54:16,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 29: [2022-11-26 20:54:16,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:54:16,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 20:54:16,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 25: [2022-11-26 20:54:16,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:54:16,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 20:54:16,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 26: [2022-11-26 20:54:16,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:54:16,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 20:54:16,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 19: [2022-11-26 20:54:16,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:54:16,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 20:54:16,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 0: [2022-11-26 20:54:16,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:54:16,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 20:54:16,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 15: [2022-11-26 20:54:16,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:54:16,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:54:16,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 22: [2022-11-26 20:54:16,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 15: [2022-11-26 20:54:16,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 22: [2022-11-26 20:54:16,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 12: [2022-11-26 20:54:16,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:54:16,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 20:54:16,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 17: [2022-11-26 20:54:16,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:54:16,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 24: [2022-11-26 20:54:16,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:54:16,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 24: [2022-11-26 20:54:16,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 27: [2022-11-26 20:54:16,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:54:16,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 24: [2022-11-26 20:54:16,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 27: [2022-11-26 20:54:16,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 13: [2022-11-26 20:54:16,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:54:16,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 28: [2022-11-26 20:54:16,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:54:16,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 10: [2022-11-26 20:54:16,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:54:16,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 10: [2022-11-26 20:54:16,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 28: [2022-11-26 20:54:16,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 10: [2022-11-26 20:54:16,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 9: [2022-11-26 20:54:16,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:54:16,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 20:54:16,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 4: [2022-11-26 20:54:16,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:54:16,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 20:54:16,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 5: [2022-11-26 20:54:16,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:54:16,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 20:54:16,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 16: [2022-11-26 20:54:16,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:54:16,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 20:54:16,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 6: [2022-11-26 20:54:16,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:54:16,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 20:54:16,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 2: [2022-11-26 20:54:16,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:54:16,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 20:54:16,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 23: [2022-11-26 20:54:16,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:54:16,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 20:54:16,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 1: [2022-11-26 20:54:16,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:54:16,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 20:54:16,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 20: [2022-11-26 20:54:16,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:54:16,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 20:54:16,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 18: [2022-11-26 20:54:16,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:54:16,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 20:54:16,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 3: [2022-11-26 20:54:16,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:54:16,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 20:54:16,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 14: [2022-11-26 20:54:16,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:54:16,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:54:16,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 11: [2022-11-26 20:54:16,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 14: [2022-11-26 20:54:16,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 11: [2022-11-26 20:54:16,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 7: [2022-11-26 20:54:16,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:54:16,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 20:54:16,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 30: [2022-11-26 20:54:16,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:54:16,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:54:16,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 31: [2022-11-26 20:54:16,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 30: [2022-11-26 20:54:16,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 31: [2022-11-26 20:54:16,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 26: [2022-11-26 20:54:16,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:54:16,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 20:54:16,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 8: [2022-11-26 20:54:16,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:54:16,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 20:54:16,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 27: [2022-11-26 20:54:16,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:54:16,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 20:54:16,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 29: [2022-11-26 20:54:16,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:54:16,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 20:54:16,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 19: [2022-11-26 20:54:16,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:54:16,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 20:54:16,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 12: [2022-11-26 20:54:16,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:54:16,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 20:54:16,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 5: [2022-11-26 20:54:16,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:54:16,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 20:54:16,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 25: [2022-11-26 20:54:16,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 28: [2022-11-26 20:54:16,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:54:16,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:54:16,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 9: [2022-11-26 20:54:16,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 25: [2022-11-26 20:54:16,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 9: [2022-11-26 20:54:16,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 16: [2022-11-26 20:54:16,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:54:16,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 20:54:16,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 10: [2022-11-26 20:54:16,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:54:16,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:54:16,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 20:54:16,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 0: [2022-11-26 20:54:16,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 20:54:16,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 24: [2022-11-26 20:54:16,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:54:16,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:54:16,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 4: [2022-11-26 20:54:16,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:54:16,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 24: [2022-11-26 20:54:16,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 15: [2022-11-26 20:54:16,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 4: [2022-11-26 20:54:16,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 28: [2022-11-26 20:54:16,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 4: [2022-11-26 20:54:16,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 28: [2022-11-26 20:54:16,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 22: [2022-11-26 20:54:16,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:54:16,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 20:54:16,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 2: [2022-11-26 20:54:16,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:54:16,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 20:54:16,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 23: [2022-11-26 20:54:16,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:54:16,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 20:54:16,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 13: [2022-11-26 20:54:16,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:54:16,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 20:54:16,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 6: [2022-11-26 20:54:16,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:54:16,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 20:54:16,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 1: [2022-11-26 20:54:16,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:54:16,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 20:54:16,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 20: [2022-11-26 20:54:16,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:54:16,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 20:54:16,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 11: [2022-11-26 20:54:16,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:54:16,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 20:54:16,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 18: [2022-11-26 20:54:16,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:54:16,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 20:54:16,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 17: [2022-11-26 20:54:16,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:54:16,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 20:54:16,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 8: [2022-11-26 20:54:16,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:54:16,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 20:54:16,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 30: [2022-11-26 20:54:16,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:54:16,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 20:54:16,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 14: [2022-11-26 20:54:16,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:54:16,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 20:54:16,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 7: [2022-11-26 20:54:16,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:54:16,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 20:54:16,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 31: [2022-11-26 20:54:16,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:54:16,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 20:54:16,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 3: [2022-11-26 20:54:16,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:54:16,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 20:54:16,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 16: [2022-11-26 20:54:16,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:54:16,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 20:54:16,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 26: [2022-11-26 20:54:16,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:54:16,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 20:54:16,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 29: [2022-11-26 20:54:16,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:54:16,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 20:54:16,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 12: [2022-11-26 20:54:16,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:54:16,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 20:54:16,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 27: [2022-11-26 20:54:16,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:54:16,885] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 20:54:16,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 0: [2022-11-26 20:54:16,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:54:16,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 20:54:16,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 20:54:16,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 24: [2022-11-26 20:54:16,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 20:54:16,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 20:54:16,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 9: [2022-11-26 20:54:16,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:54:16,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 20:54:16,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 10: [2022-11-26 20:54:16,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:54:16,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 20:54:16,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 17: [2022-11-26 20:54:16,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 20:54:16,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 20:54:16,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 15: [2022-11-26 20:54:16,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:54:16,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 20:54:16,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 25: [2022-11-26 20:54:16,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:54:16,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 20:54:16,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 4: [2022-11-26 20:54:16,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:54:16,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 20:54:16,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 22: [2022-11-26 20:54:16,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:54:16,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 20:54:16,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 20: [2022-11-26 20:54:16,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 20:54:16,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 20:54:16,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 3: [2022-11-26 20:54:16,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:54:16,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:54:16,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 3: [2022-11-26 20:54:16,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 2: [2022-11-26 20:54:16,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 3: [2022-11-26 20:54:16,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 31: [2022-11-26 20:54:16,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 20:54:16,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 20:54:16,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 13: [2022-11-26 20:54:16,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:54:16,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 20:54:16,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 30: [2022-11-26 20:54:16,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 20:54:16,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 20:54:16,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 29: [2022-11-26 20:54:16,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 20:54:16,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 20:54:16,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 8: [2022-11-26 20:54:16,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:54:16,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 20:54:16,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 1: [2022-11-26 20:54:16,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:54:16,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:54:16,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 20:54:16,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 18: [2022-11-26 20:54:16,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:54:16,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 20:54:16,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 11: [2022-11-26 20:54:16,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 18: [2022-11-26 20:54:16,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 20:54:16,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 11: [2022-11-26 20:54:16,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 20:54:16,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 16: [2022-11-26 20:54:16,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 20:54:16,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 20:54:16,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 6: [2022-11-26 20:54:16,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:54:16,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 20:54:16,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 14: [2022-11-26 20:54:16,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:54:16,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 20:54:16,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 15: [2022-11-26 20:54:16,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:54:16,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:54:16,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 7: [2022-11-26 20:54:16,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:54:16,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 20:54:16,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 15: [2022-11-26 20:54:16,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 7: [2022-11-26 20:54:16,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 20:54:16,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 25: [2022-11-26 20:54:16,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:54:16,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 20:54:16,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 26: [2022-11-26 20:54:16,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 20:54:16,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 20:54:16,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 5: [2022-11-26 20:54:16,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:54:16,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 20:54:16,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 27: [2022-11-26 20:54:16,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:54:16,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 20:54:16,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 13: [2022-11-26 20:54:16,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:54:16,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 20:54:16,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 0: [2022-11-26 20:54:16,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 20:54:16,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 10: [2022-11-26 20:54:16,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:54:16,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 20:54:16,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 6: [2022-11-26 20:54:16,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:54:16,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 20:54:16,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 2: [2022-11-26 20:54:16,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:54:16,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 20:54:16,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 5: [2022-11-26 20:54:16,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:54:16,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 20:54:16,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 22: [2022-11-26 20:54:16,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 20:54:16,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 20:54:16,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 2: [2022-11-26 20:54:16,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:54:16,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 20:54:16,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 25: [2022-11-26 20:54:16,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:54:16,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 27: [2022-11-26 20:54:16,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:54:16,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 25: [2022-11-26 20:54:16,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 27: [2022-11-26 20:54:16,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 19: [2022-11-26 20:54:16,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:54:16,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 20:54:16,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 28: [2022-11-26 20:54:16,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:54:16,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:54:16,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 20:54:16,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 19: [2022-11-26 20:54:16,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:54:16,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 20:54:16,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 19: [2022-11-26 20:54:16,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 20:54:16,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 20:54:16,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 28: [2022-11-26 20:54:16,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 20:54:16,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 5: [2022-11-26 20:54:16,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:54:16,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 20:54:16,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 2: [2022-11-26 20:54:16,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:54:16,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 20:54:16,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 27: [2022-11-26 20:54:16,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 20:54:16,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 20:54:16,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 25: [2022-11-26 20:54:16,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 20:54:16,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 20:54:16,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 21: [2022-11-26 20:54:17,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:54:17,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 20:54:17,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 21: [2022-11-26 20:54:17,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:54:17,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 20:54:17,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 21: [2022-11-26 20:54:17,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:54:17,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 20:54:17,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 21: [2022-11-26 20:54:17,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 20:54:17,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step119000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 20:54:17,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 0: successfully saved checkpoint at iteration 119000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2750.08 31: iteration 119010/ 173500 | consumed samples: 30466560 | consumed tokens: 62395514880 | elapsed time per iteration (s): 1.12 | learning rate: 6.112E-05 | global batch size: 256 | lm loss: 1.941600E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.572 | TFLOPs: 13.89 | 31: iteration 119020/ 173500 | consumed samples: 30469120 | consumed tokens: 62400757760 | elapsed time per iteration (s): 0.82 | learning rate: 6.110E-05 | global batch size: 256 | lm loss: 1.955852E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.384 | TFLOPs: 18.78 | 31: iteration 119030/ 173500 | consumed samples: 30471680 | consumed tokens: 62406000640 | elapsed time per iteration (s): 0.98 | learning rate: 6.109E-05 | global batch size: 256 | lm loss: 1.966933E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 261.416 | TFLOPs: 15.82 | 31: iteration 119040/ 173500 | consumed samples: 30474240 | consumed tokens: 62411243520 | elapsed time per iteration (s): 0.79 | learning rate: 6.108E-05 | global batch size: 256 | lm loss: 1.947404E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.372 | TFLOPs: 19.62 | 31: iteration 119050/ 173500 | consumed samples: 30476800 | consumed tokens: 62416486400 | elapsed time per iteration (s): 0.84 | learning rate: 6.106E-05 | global batch size: 256 | lm loss: 1.944233E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.025 | TFLOPs: 18.33 | 31: iteration 119060/ 173500 | consumed samples: 30479360 | consumed tokens: 62421729280 | elapsed time per iteration (s): 0.83 | learning rate: 6.105E-05 | global batch size: 256 | lm loss: 1.956007E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.349 | TFLOPs: 18.65 | 31: iteration 119070/ 173500 | consumed samples: 30481920 | consumed tokens: 62426972160 | elapsed time per iteration (s): 0.84 | learning rate: 6.104E-05 | global batch size: 256 | lm loss: 1.932200E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.204 | TFLOPs: 18.34 | 31: iteration 119080/ 173500 | consumed samples: 30484480 | consumed tokens: 62432215040 | elapsed time per iteration (s): 0.84 | learning rate: 6.102E-05 | global batch size: 256 | lm loss: 1.952228E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.352 | TFLOPs: 18.47 | 31: iteration 119090/ 173500 | consumed samples: 30487040 | consumed tokens: 62437457920 | elapsed time per iteration (s): 0.86 | learning rate: 6.101E-05 | global batch size: 256 | lm loss: 1.938617E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.815 | TFLOPs: 17.96 | 31: iteration 119100/ 173500 | consumed samples: 30489600 | consumed tokens: 62442700800 | elapsed time per iteration (s): 0.84 | learning rate: 6.099E-05 | global batch size: 256 | lm loss: 1.961225E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.843 | TFLOPs: 18.38 | 31: iteration 119110/ 173500 | consumed samples: 30492160 | consumed tokens: 62447943680 | elapsed time per iteration (s): 0.96 | learning rate: 6.098E-05 | global batch size: 256 | lm loss: 1.957093E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 266.958 | TFLOPs: 16.15 | 31: iteration 119120/ 173500 | consumed samples: 30494720 | consumed tokens: 62453186560 | elapsed time per iteration (s): 0.86 | learning rate: 6.097E-05 | global batch size: 256 | lm loss: 1.972740E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.359 | TFLOPs: 17.93 | 31: iteration 119130/ 173500 | consumed samples: 30497280 | consumed tokens: 62458429440 | elapsed time per iteration (s): 0.84 | learning rate: 6.095E-05 | global batch size: 256 | lm loss: 1.971351E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.030 | TFLOPs: 18.51 | 31: iteration 119140/ 173500 | consumed samples: 30499840 | consumed tokens: 62463672320 | elapsed time per iteration (s): 0.82 | learning rate: 6.094E-05 | global batch size: 256 | lm loss: 1.934018E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.981 | TFLOPs: 18.87 | 31: iteration 119150/ 173500 | consumed samples: 30502400 | consumed tokens: 62468915200 | elapsed time per iteration (s): 0.85 | learning rate: 6.092E-05 | global batch size: 256 | lm loss: 1.932388E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.318 | TFLOPs: 18.29 | 31: iteration 119160/ 173500 | consumed samples: 30504960 | consumed tokens: 62474158080 | elapsed time per iteration (s): 0.81 | learning rate: 6.091E-05 | global batch size: 256 | lm loss: 1.934449E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.503 | TFLOPs: 19.03 | 31: iteration 119170/ 173500 | consumed samples: 30507520 | consumed tokens: 62479400960 | elapsed time per iteration (s): 0.83 | learning rate: 6.090E-05 | global batch size: 256 | lm loss: 1.968069E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.017 | TFLOPs: 18.76 | 31: iteration 119180/ 173500 | consumed samples: 30510080 | consumed tokens: 62484643840 | elapsed time per iteration (s): 0.83 | learning rate: 6.088E-05 | global batch size: 256 | lm loss: 1.937303E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.816 | TFLOPs: 18.74 | 31: iteration 119190/ 173500 | consumed samples: 30512640 | consumed tokens: 62489886720 | elapsed time per iteration (s): 0.81 | learning rate: 6.087E-05 | global batch size: 256 | lm loss: 1.942117E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.226 | TFLOPs: 19.07 | 31: iteration 119200/ 173500 | consumed samples: 30515200 | consumed tokens: 62495129600 | elapsed time per iteration (s): 0.82 | learning rate: 6.086E-05 | global batch size: 256 | lm loss: 1.940969E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.994 | TFLOPs: 18.94 | 31: iteration 119210/ 173500 | consumed samples: 30517760 | consumed tokens: 62500372480 | elapsed time per iteration (s): 0.79 | learning rate: 6.084E-05 | global batch size: 256 | lm loss: 1.950337E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.316 | TFLOPs: 19.50 | 31: iteration 119220/ 173500 | consumed samples: 30520320 | consumed tokens: 62505615360 | elapsed time per iteration (s): 0.73 | learning rate: 6.083E-05 | global batch size: 256 | lm loss: 1.942529E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.127 | TFLOPs: 21.12 | 31: iteration 119230/ 173500 | consumed samples: 30522880 | consumed tokens: 62510858240 | elapsed time per iteration (s): 0.76 | learning rate: 6.081E-05 | global batch size: 256 | lm loss: 1.936883E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.150 | TFLOPs: 20.28 | 31: iteration 119240/ 173500 | consumed samples: 30525440 | consumed tokens: 62516101120 | elapsed time per iteration (s): 0.71 | learning rate: 6.080E-05 | global batch size: 256 | lm loss: 1.967833E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 359.159 | TFLOPs: 21.73 | 31: iteration 119250/ 173500 | consumed samples: 30528000 | consumed tokens: 62521344000 | elapsed time per iteration (s): 0.82 | learning rate: 6.079E-05 | global batch size: 256 | lm loss: 1.935483E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.647 | TFLOPs: 18.85 | 31: iteration 119260/ 173500 | consumed samples: 30530560 | consumed tokens: 62526586880 | elapsed time per iteration (s): 0.77 | learning rate: 6.077E-05 | global batch size: 256 | lm loss: 1.939966E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.109 | TFLOPs: 20.15 | 31: iteration 119270/ 173500 | consumed samples: 30533120 | consumed tokens: 62531829760 | elapsed time per iteration (s): 0.75 | learning rate: 6.076E-05 | global batch size: 256 | lm loss: 1.942900E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.081 | TFLOPs: 20.63 | 31: iteration 119280/ 173500 | consumed samples: 30535680 | consumed tokens: 62537072640 | elapsed time per iteration (s): 0.71 | learning rate: 6.075E-05 | global batch size: 256 | lm loss: 1.939286E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 359.263 | TFLOPs: 21.73 | 31: iteration 119290/ 173500 | consumed samples: 30538240 | consumed tokens: 62542315520 | elapsed time per iteration (s): 0.72 | learning rate: 6.073E-05 | global batch size: 256 | lm loss: 1.981635E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.935 | TFLOPs: 21.41 | 31: iteration 119300/ 173500 | consumed samples: 30540800 | consumed tokens: 62547558400 | elapsed time per iteration (s): 0.73 | learning rate: 6.072E-05 | global batch size: 256 | lm loss: 1.962611E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.871 | TFLOPs: 21.17 | 31: iteration 119310/ 173500 | consumed samples: 30543360 | consumed tokens: 62552801280 | elapsed time per iteration (s): 0.72 | learning rate: 6.070E-05 | global batch size: 256 | lm loss: 1.947657E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.116 | TFLOPs: 21.48 | 31: iteration 119320/ 173500 | consumed samples: 30545920 | consumed tokens: 62558044160 | elapsed time per iteration (s): 0.73 | learning rate: 6.069E-05 | global batch size: 256 | lm loss: 1.925763E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.387 | TFLOPs: 21.32 | 31: iteration 119330/ 173500 | consumed samples: 30548480 | consumed tokens: 62563287040 | elapsed time per iteration (s): 0.77 | learning rate: 6.068E-05 | global batch size: 256 | lm loss: 1.939666E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.555 | TFLOPs: 20.24 | 31: iteration 119340/ 173500 | consumed samples: 30551040 | consumed tokens: 62568529920 | elapsed time per iteration (s): 0.80 | learning rate: 6.066E-05 | global batch size: 256 | lm loss: 1.949068E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.896 | TFLOPs: 19.29 | 31: iteration 119350/ 173500 | consumed samples: 30553600 | consumed tokens: 62573772800 | elapsed time per iteration (s): 0.75 | learning rate: 6.065E-05 | global batch size: 256 | lm loss: 1.958776E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.409 | TFLOPs: 20.65 | 31: iteration 119360/ 173500 | consumed samples: 30556160 | consumed tokens: 62579015680 | elapsed time per iteration (s): 0.78 | learning rate: 6.064E-05 | global batch size: 256 | lm loss: 1.946242E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.176 | TFLOPs: 19.91 | 31: iteration 119370/ 173500 | consumed samples: 30558720 | consumed tokens: 62584258560 | elapsed time per iteration (s): 0.77 | learning rate: 6.062E-05 | global batch size: 256 | lm loss: 1.951271E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.777 | TFLOPs: 20.13 | 31: iteration 119380/ 173500 | consumed samples: 30561280 | consumed tokens: 62589501440 | elapsed time per iteration (s): 0.77 | learning rate: 6.061E-05 | global batch size: 256 | lm loss: 1.936829E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.781 | TFLOPs: 20.19 | 31: iteration 119390/ 173500 | consumed samples: 30563840 | consumed tokens: 62594744320 | elapsed time per iteration (s): 0.75 | learning rate: 6.059E-05 | global batch size: 256 | lm loss: 1.954431E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.585 | TFLOPs: 20.67 | 31: iteration 119400/ 173500 | consumed samples: 30566400 | consumed tokens: 62599987200 | elapsed time per iteration (s): 0.74 | learning rate: 6.058E-05 | global batch size: 256 | lm loss: 1.912576E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.464 | TFLOPs: 21.02 | 31: iteration 119410/ 173500 | consumed samples: 30568960 | consumed tokens: 62605230080 | elapsed time per iteration (s): 0.76 | learning rate: 6.057E-05 | global batch size: 256 | lm loss: 1.973306E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.515 | TFLOPs: 20.30 | 31: iteration 119420/ 173500 | consumed samples: 30571520 | consumed tokens: 62610472960 | elapsed time per iteration (s): 0.73 | learning rate: 6.055E-05 | global batch size: 256 | lm loss: 1.963527E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.561 | TFLOPs: 21.15 | 31: iteration 119430/ 173500 | consumed samples: 30574080 | consumed tokens: 62615715840 | elapsed time per iteration (s): 0.76 | learning rate: 6.054E-05 | global batch size: 256 | lm loss: 1.964401E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.607 | TFLOPs: 20.36 | 31: iteration 119440/ 173500 | consumed samples: 30576640 | consumed tokens: 62620958720 | elapsed time per iteration (s): 0.80 | learning rate: 6.053E-05 | global batch size: 256 | lm loss: 1.920730E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.062 | TFLOPs: 19.36 | 31: iteration 119450/ 173500 | consumed samples: 30579200 | consumed tokens: 62626201600 | elapsed time per iteration (s): 0.74 | learning rate: 6.051E-05 | global batch size: 256 | lm loss: 1.935525E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.841 | TFLOPs: 20.86 | 31: iteration 119460/ 173500 | consumed samples: 30581760 | consumed tokens: 62631444480 | elapsed time per iteration (s): 0.75 | learning rate: 6.050E-05 | global batch size: 256 | lm loss: 1.956361E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.759 | TFLOPs: 20.55 | 31: iteration 119470/ 173500 | consumed samples: 30584320 | consumed tokens: 62636687360 | elapsed time per iteration (s): 0.76 | learning rate: 6.048E-05 | global batch size: 256 | lm loss: 1.943985E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.219 | TFLOPs: 20.28 | 31: iteration 119480/ 173500 | consumed samples: 30586880 | consumed tokens: 62641930240 | elapsed time per iteration (s): 0.74 | learning rate: 6.047E-05 | global batch size: 256 | lm loss: 1.963007E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.376 | TFLOPs: 21.02 | 31: iteration 119490/ 173500 | consumed samples: 30589440 | consumed tokens: 62647173120 | elapsed time per iteration (s): 0.76 | learning rate: 6.046E-05 | global batch size: 256 | lm loss: 1.944101E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.235 | TFLOPs: 20.40 | 31: iteration 119500/ 173500 | consumed samples: 30592000 | consumed tokens: 62652416000 | elapsed time per iteration (s): 0.76 | learning rate: 6.044E-05 | global batch size: 256 | lm loss: 1.936978E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.582 | TFLOPs: 20.36 | 31: iteration 119510/ 173500 | consumed samples: 30594560 | consumed tokens: 62657658880 | elapsed time per iteration (s): 0.75 | learning rate: 6.043E-05 | global batch size: 256 | lm loss: 1.965550E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.462 | TFLOPs: 20.54 | 31: iteration 119520/ 173500 | consumed samples: 30597120 | consumed tokens: 62662901760 | elapsed time per iteration (s): 0.79 | learning rate: 6.042E-05 | global batch size: 256 | lm loss: 1.955099E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.794 | TFLOPs: 19.53 | 31: iteration 119530/ 173500 | consumed samples: 30599680 | consumed tokens: 62668144640 | elapsed time per iteration (s): 0.78 | learning rate: 6.040E-05 | global batch size: 256 | lm loss: 1.944945E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.630 | TFLOPs: 19.94 | 31: iteration 119540/ 173500 | consumed samples: 30602240 | consumed tokens: 62673387520 | elapsed time per iteration (s): 0.79 | learning rate: 6.039E-05 | global batch size: 256 | lm loss: 1.960293E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.061 | TFLOPs: 19.67 | 31: iteration 119550/ 173500 | consumed samples: 30604800 | consumed tokens: 62678630400 | elapsed time per iteration (s): 0.77 | learning rate: 6.037E-05 | global batch size: 256 | lm loss: 1.947307E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.339 | TFLOPs: 20.23 | 31: iteration 119560/ 173500 | consumed samples: 30607360 | consumed tokens: 62683873280 | elapsed time per iteration (s): 0.80 | learning rate: 6.036E-05 | global batch size: 256 | lm loss: 1.942846E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.283 | TFLOPs: 19.38 | 31: iteration 119570/ 173500 | consumed samples: 30609920 | consumed tokens: 62689116160 | elapsed time per iteration (s): 0.77 | learning rate: 6.035E-05 | global batch size: 256 | lm loss: 1.936814E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.494 | TFLOPs: 20.05 | 31: iteration 119580/ 173500 | consumed samples: 30612480 | consumed tokens: 62694359040 | elapsed time per iteration (s): 0.77 | learning rate: 6.033E-05 | global batch size: 256 | lm loss: 1.929262E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.976 | TFLOPs: 20.08 | 31: iteration 119590/ 173500 | consumed samples: 30615040 | consumed tokens: 62699601920 | elapsed time per iteration (s): 0.79 | learning rate: 6.032E-05 | global batch size: 256 | lm loss: 1.963720E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.268 | TFLOPs: 19.50 | 31: iteration 119600/ 173500 | consumed samples: 30617600 | consumed tokens: 62704844800 | elapsed time per iteration (s): 0.73 | learning rate: 6.031E-05 | global batch size: 256 | lm loss: 1.938549E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.120 | TFLOPs: 21.24 | 31: iteration 119610/ 173500 | consumed samples: 30620160 | consumed tokens: 62710087680 | elapsed time per iteration (s): 0.79 | learning rate: 6.029E-05 | global batch size: 256 | lm loss: 1.956412E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.979 | TFLOPs: 19.60 | 31: iteration 119620/ 173500 | consumed samples: 30622720 | consumed tokens: 62715330560 | elapsed time per iteration (s): 0.76 | learning rate: 6.028E-05 | global batch size: 256 | lm loss: 1.970585E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.988 | TFLOPs: 20.45 | 31: iteration 119630/ 173500 | consumed samples: 30625280 | consumed tokens: 62720573440 | elapsed time per iteration (s): 0.95 | learning rate: 6.026E-05 | global batch size: 256 | lm loss: 1.951709E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 270.149 | TFLOPs: 16.34 | 31: iteration 119640/ 173500 | consumed samples: 30627840 | consumed tokens: 62725816320 | elapsed time per iteration (s): 0.79 | learning rate: 6.025E-05 | global batch size: 256 | lm loss: 1.957443E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.801 | TFLOPs: 19.59 | 31: iteration 119650/ 173500 | consumed samples: 30630400 | consumed tokens: 62731059200 | elapsed time per iteration (s): 0.78 | learning rate: 6.024E-05 | global batch size: 256 | lm loss: 1.964241E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.652 | TFLOPs: 19.76 | 31: iteration 119660/ 173500 | consumed samples: 30632960 | consumed tokens: 62736302080 | elapsed time per iteration (s): 0.80 | learning rate: 6.022E-05 | global batch size: 256 | lm loss: 1.958207E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.342 | TFLOPs: 19.32 | 31: iteration 119670/ 173500 | consumed samples: 30635520 | consumed tokens: 62741544960 | elapsed time per iteration (s): 0.75 | learning rate: 6.021E-05 | global batch size: 256 | lm loss: 1.935024E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.166 | TFLOPs: 20.52 | 31: iteration 119680/ 173500 | consumed samples: 30638080 | consumed tokens: 62746787840 | elapsed time per iteration (s): 0.81 | learning rate: 6.020E-05 | global batch size: 256 | lm loss: 1.952161E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.387 | TFLOPs: 19.20 | 31: iteration 119690/ 173500 | consumed samples: 30640640 | consumed tokens: 62752030720 | elapsed time per iteration (s): 0.77 | learning rate: 6.018E-05 | global batch size: 256 | lm loss: 1.968313E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.258 | TFLOPs: 20.16 | 31: iteration 119700/ 173500 | consumed samples: 30643200 | consumed tokens: 62757273600 | elapsed time per iteration (s): 0.75 | learning rate: 6.017E-05 | global batch size: 256 | lm loss: 1.974418E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.570 | TFLOPs: 20.54 | 31: iteration 119710/ 173500 | consumed samples: 30645760 | consumed tokens: 62762516480 | elapsed time per iteration (s): 0.79 | learning rate: 6.015E-05 | global batch size: 256 | lm loss: 1.937733E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.021 | TFLOPs: 19.66 | 31: iteration 119720/ 173500 | consumed samples: 30648320 | consumed tokens: 62767759360 | elapsed time per iteration (s): 0.76 | learning rate: 6.014E-05 | global batch size: 256 | lm loss: 1.966697E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.135 | TFLOPs: 20.40 | 31: iteration 119730/ 173500 | consumed samples: 30650880 | consumed tokens: 62773002240 | elapsed time per iteration (s): 0.79 | learning rate: 6.013E-05 | global batch size: 256 | lm loss: 1.947078E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.164 | TFLOPs: 19.55 | 31: iteration 119740/ 173500 | consumed samples: 30653440 | consumed tokens: 62778245120 | elapsed time per iteration (s): 0.84 | learning rate: 6.011E-05 | global batch size: 256 | lm loss: 1.965657E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.573 | TFLOPs: 18.49 | 31: iteration 119750/ 173500 | consumed samples: 30656000 | consumed tokens: 62783488000 | elapsed time per iteration (s): 0.77 | learning rate: 6.010E-05 | global batch size: 256 | lm loss: 1.955234E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.517 | TFLOPs: 20.06 | 31: iteration 119760/ 173500 | consumed samples: 30658560 | consumed tokens: 62788730880 | elapsed time per iteration (s): 0.76 | learning rate: 6.009E-05 | global batch size: 256 | lm loss: 1.945349E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.394 | TFLOPs: 20.41 | 31: iteration 119770/ 173500 | consumed samples: 30661120 | consumed tokens: 62793973760 | elapsed time per iteration (s): 0.72 | learning rate: 6.007E-05 | global batch size: 256 | lm loss: 1.940663E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.354 | TFLOPs: 21.50 | 31: iteration 119780/ 173500 | consumed samples: 30663680 | consumed tokens: 62799216640 | elapsed time per iteration (s): 0.77 | learning rate: 6.006E-05 | global batch size: 256 | lm loss: 1.937769E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.553 | TFLOPs: 20.18 | 31: iteration 119790/ 173500 | consumed samples: 30666240 | consumed tokens: 62804459520 | elapsed time per iteration (s): 0.76 | learning rate: 6.004E-05 | global batch size: 256 | lm loss: 1.961220E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.749 | TFLOPs: 20.31 | 31: iteration 119800/ 173500 | consumed samples: 30668800 | consumed tokens: 62809702400 | elapsed time per iteration (s): 0.73 | learning rate: 6.003E-05 | global batch size: 256 | lm loss: 1.924641E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.467 | TFLOPs: 21.20 | 31: iteration 119810/ 173500 | consumed samples: 30671360 | consumed tokens: 62814945280 | elapsed time per iteration (s): 0.72 | learning rate: 6.002E-05 | global batch size: 256 | lm loss: 1.943811E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.663 | TFLOPs: 21.58 | 31: iteration 119820/ 173500 | consumed samples: 30673920 | consumed tokens: 62820188160 | elapsed time per iteration (s): 0.74 | learning rate: 6.000E-05 | global batch size: 256 | lm loss: 1.965788E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.222 | TFLOPs: 20.89 | 31: iteration 119830/ 173500 | consumed samples: 30676480 | consumed tokens: 62825431040 | elapsed time per iteration (s): 0.76 | learning rate: 5.999E-05 | global batch size: 256 | lm loss: 1.960391E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.952 | TFLOPs: 20.51 | 31: iteration 119840/ 173500 | consumed samples: 30679040 | consumed tokens: 62830673920 | elapsed time per iteration (s): 0.83 | learning rate: 5.998E-05 | global batch size: 256 | lm loss: 1.920351E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.048 | TFLOPs: 18.58 | 31: iteration 119850/ 173500 | consumed samples: 30681600 | consumed tokens: 62835916800 | elapsed time per iteration (s): 0.89 | learning rate: 5.996E-05 | global batch size: 256 | lm loss: 1.947372E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.189 | TFLOPs: 17.50 | 31: iteration 119860/ 173500 | consumed samples: 30684160 | consumed tokens: 62841159680 | elapsed time per iteration (s): 0.75 | learning rate: 5.995E-05 | global batch size: 256 | lm loss: 1.951350E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.351 | TFLOPs: 20.53 | 31: iteration 119870/ 173500 | consumed samples: 30686720 | consumed tokens: 62846402560 | elapsed time per iteration (s): 0.75 | learning rate: 5.994E-05 | global batch size: 256 | lm loss: 1.944845E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.894 | TFLOPs: 20.62 | 31: iteration 119880/ 173500 | consumed samples: 30689280 | consumed tokens: 62851645440 | elapsed time per iteration (s): 0.73 | learning rate: 5.992E-05 | global batch size: 256 | lm loss: 1.957468E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.718 | TFLOPs: 21.22 | 31: iteration 119890/ 173500 | consumed samples: 30691840 | consumed tokens: 62856888320 | elapsed time per iteration (s): 0.72 | learning rate: 5.991E-05 | global batch size: 256 | lm loss: 1.911109E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.714 | TFLOPs: 21.52 | 31: iteration 119900/ 173500 | consumed samples: 30694400 | consumed tokens: 62862131200 | elapsed time per iteration (s): 0.86 | learning rate: 5.989E-05 | global batch size: 256 | lm loss: 1.947377E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.873 | TFLOPs: 18.08 | 31: iteration 119910/ 173500 | consumed samples: 30696960 | consumed tokens: 62867374080 | elapsed time per iteration (s): 0.86 | learning rate: 5.988E-05 | global batch size: 256 | lm loss: 1.915703E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.852 | TFLOPs: 17.96 | 31: iteration 119920/ 173500 | consumed samples: 30699520 | consumed tokens: 62872616960 | elapsed time per iteration (s): 0.89 | learning rate: 5.987E-05 | global batch size: 256 | lm loss: 1.926796E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.128 | TFLOPs: 17.37 | 31: iteration 119930/ 173500 | consumed samples: 30702080 | consumed tokens: 62877859840 | elapsed time per iteration (s): 0.80 | learning rate: 5.985E-05 | global batch size: 256 | lm loss: 1.967822E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.625 | TFLOPs: 19.28 | 31: iteration 119940/ 173500 | consumed samples: 30704640 | consumed tokens: 62883102720 | elapsed time per iteration (s): 0.81 | learning rate: 5.984E-05 | global batch size: 256 | lm loss: 1.933282E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.744 | TFLOPs: 19.10 | 31: iteration 119950/ 173500 | consumed samples: 30707200 | consumed tokens: 62888345600 | elapsed time per iteration (s): 0.73 | learning rate: 5.983E-05 | global batch size: 256 | lm loss: 1.963898E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.303 | TFLOPs: 21.13 | 31: iteration 119960/ 173500 | consumed samples: 30709760 | consumed tokens: 62893588480 | elapsed time per iteration (s): 0.82 | learning rate: 5.981E-05 | global batch size: 256 | lm loss: 1.966001E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.593 | TFLOPs: 18.91 | 31: iteration 119970/ 173500 | consumed samples: 30712320 | consumed tokens: 62898831360 | elapsed time per iteration (s): 0.75 | learning rate: 5.980E-05 | global batch size: 256 | lm loss: 1.962990E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.595 | TFLOPs: 20.67 | 31: iteration 119980/ 173500 | consumed samples: 30714880 | consumed tokens: 62904074240 | elapsed time per iteration (s): 0.83 | learning rate: 5.979E-05 | global batch size: 256 | lm loss: 1.937527E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.553 | TFLOPs: 18.61 | 31: iteration 119990/ 173500 | consumed samples: 30717440 | consumed tokens: 62909317120 | elapsed time per iteration (s): 0.78 | learning rate: 5.977E-05 | global batch size: 256 | lm loss: 1.950174E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.740 | TFLOPs: 19.77 | 0: [2022-11-26 21:07:24,555] [INFO] [logging.py:68:log_dist] [Rank 0] step=120000, skipped=0, lr=[5.975780833100023e-05, 5.975780833100023e-05, 5.975780833100023e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 120000/ 173500 | consumed samples: 30720000 | consumed tokens: 62914560000 | elapsed time per iteration (s): 0.79 | learning rate: 5.976E-05 | global batch size: 256 | lm loss: 1.945624E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.963 | TFLOPs: 19.72 | 0: steps: 120000 loss: 1.9231 iter time (s): 0.800 samples/sec: 320.108 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 120000 | lm loss value: 1.863736E+00 | lm loss PPL: 6.447781E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 120000 to checkpoints_1b1long 0: [2022-11-26 21:07:24,828] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step120000 is begin to save! 0: [2022-11-26 21:07:24,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_01-model_00-model_states.pt... 0: [2022-11-26 21:07:25,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_01-model_00-model_states.pt. 0: [2022-11-26 21:07:25,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_03-model_00-model_states.pt... 0: [2022-11-26 21:07:25,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_03-model_00-model_states.pt. 0: [2022-11-26 21:07:25,146] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_04-model_00-model_states.pt... 0: [2022-11-26 21:07:25,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_04-model_00-model_states.pt. 0: [2022-11-26 21:07:25,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_05-model_00-model_states.pt... 0: [2022-11-26 21:07:25,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_05-model_00-model_states.pt. 0: [2022-11-26 21:07:25,304] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_06-model_00-model_states.pt... 0: [2022-11-26 21:07:25,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_06-model_00-model_states.pt. 0: [2022-11-26 21:07:25,376] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_07-model_00-model_states.pt... 0: [2022-11-26 21:07:25,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_07-model_00-model_states.pt. 0: [2022-11-26 21:07:25,457] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_08-model_00-model_states.pt... 0: [2022-11-26 21:07:25,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_08-model_00-model_states.pt. 0: [2022-11-26 21:07:25,529] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_09-model_00-model_states.pt... 0: [2022-11-26 21:07:25,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_09-model_00-model_states.pt. 0: [2022-11-26 21:07:25,604] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_10-model_00-model_states.pt... 0: [2022-11-26 21:07:25,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_10-model_00-model_states.pt. 0: [2022-11-26 21:07:25,678] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_11-model_00-model_states.pt... 0: [2022-11-26 21:07:25,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_11-model_00-model_states.pt. 0: [2022-11-26 21:07:25,750] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_12-model_00-model_states.pt... 0: [2022-11-26 21:07:25,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_12-model_00-model_states.pt. 0: [2022-11-26 21:07:25,824] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_13-model_00-model_states.pt... 0: [2022-11-26 21:07:25,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_13-model_00-model_states.pt. 0: [2022-11-26 21:07:25,898] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_14-model_00-model_states.pt... 0: [2022-11-26 21:07:25,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_14-model_00-model_states.pt. 0: [2022-11-26 21:07:25,970] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_15-model_00-model_states.pt... 0: [2022-11-26 21:07:26,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_15-model_00-model_states.pt. 0: [2022-11-26 21:07:26,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_16-model_00-model_states.pt... 0: [2022-11-26 21:07:26,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_16-model_00-model_states.pt. 0: [2022-11-26 21:07:26,117] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_17-model_00-model_states.pt... 0: [2022-11-26 21:07:26,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_17-model_00-model_states.pt. 0: [2022-11-26 21:07:26,188] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_18-model_00-model_states.pt... 0: [2022-11-26 21:07:26,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_18-model_00-model_states.pt. 0: [2022-11-26 21:07:26,263] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_19-model_00-model_states.pt... 0: [2022-11-26 21:07:26,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_19-model_00-model_states.pt. 0: [2022-11-26 21:07:26,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_20-model_00-model_states.pt... 0: [2022-11-26 21:07:26,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_20-model_00-model_states.pt. 0: [2022-11-26 21:07:26,408] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_21-model_00-model_states.pt... 0: [2022-11-26 21:07:26,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_21-model_00-model_states.pt. 0: [2022-11-26 21:07:26,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_22-model_00-model_states.pt... 0: [2022-11-26 21:07:26,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_22-model_00-model_states.pt. 0: [2022-11-26 21:07:26,553] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_23-model_00-model_states.pt... 0: [2022-11-26 21:07:26,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_23-model_00-model_states.pt. 0: [2022-11-26 21:07:26,628] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_24-model_00-model_states.pt... 0: [2022-11-26 21:07:26,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_24-model_00-model_states.pt. 0: [2022-11-26 21:07:26,699] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_25-model_00-model_states.pt... 0: [2022-11-26 21:07:26,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_25-model_00-model_states.pt. 0: [2022-11-26 21:07:26,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_26-model_00-model_states.pt... 0: [2022-11-26 21:07:26,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_26-model_00-model_states.pt. 0: [2022-11-26 21:07:26,850] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_27-model_00-model_states.pt... 0: [2022-11-26 21:07:26,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_27-model_00-model_states.pt. 0: [2022-11-26 21:07:26,921] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_28-model_00-model_states.pt... 0: [2022-11-26 21:07:26,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_28-model_00-model_states.pt. 0: [2022-11-26 21:07:26,997] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/layer_30-model_00-model_states.pt... 0: [2022-11-26 21:07:26,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/layer_30-model_00-model_states.pt. 0: [2022-11-26 21:07:26,999] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step120000/mp_rank_00_model_states.pt 0: [2022-11-26 21:07:26,999] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/mp_rank_00_model_states.pt... 0: [2022-11-26 21:07:27,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/mp_rank_00_model_states.pt. 0: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:07:27,073] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:07:27,073] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:07:27,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:07:27,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:07:27,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 21:07:27,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 0: [2022-11-26 21:07:27,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:07:27,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 21:07:27,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 18: [2022-11-26 21:07:27,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:07:27,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 21:07:27,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 0: [2022-11-26 21:07:27,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:07:27,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 21:07:27,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 0: [2022-11-26 21:07:27,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:07:27,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 21:07:27,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 18: [2022-11-26 21:07:27,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:07:27,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 21:07:27,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 8: [2022-11-26 21:07:27,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:07:27,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 21:07:27,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 24: [2022-11-26 21:07:27,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:07:27,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 21:07:27,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 18: [2022-11-26 21:07:27,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:07:27,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 21:07:27,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 2: [2022-11-26 21:07:27,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:07:27,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 21:07:27,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 21: [2022-11-26 21:07:27,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:07:27,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 21:07:27,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 9: [2022-11-26 21:07:27,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:07:27,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 21:07:27,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 30: [2022-11-26 21:07:27,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:07:27,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 21:07:27,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 12: [2022-11-26 21:07:27,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:07:27,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 21:07:27,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 11: [2022-11-26 21:07:27,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:07:27,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:07:27,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 28: [2022-11-26 21:07:27,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:07:27,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 11: [2022-11-26 21:07:27,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 28: [2022-11-26 21:07:27,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 21: [2022-11-26 21:07:27,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 28: [2022-11-26 21:07:27,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 21: [2022-11-26 21:07:27,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:07:27,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 9: [2022-11-26 21:07:27,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:07:27,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 9: [2022-11-26 21:07:27,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 21:07:27,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 27: [2022-11-26 21:07:27,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:07:27,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 21:07:27,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 11: [2022-11-26 21:07:27,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:07:27,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 21:07:27,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 30: [2022-11-26 21:07:27,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:07:27,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 21:07:27,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 24: [2022-11-26 21:07:27,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:07:27,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 7: [2022-11-26 21:07:27,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:07:27,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 7: [2022-11-26 21:07:27,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 21:07:27,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 7: [2022-11-26 21:07:27,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:07:27,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 21:07:27,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 13: [2022-11-26 21:07:27,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:07:27,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:07:27,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:07:27,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 21:07:27,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 3: [2022-11-26 21:07:27,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 31: [2022-11-26 21:07:27,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 3: [2022-11-26 21:07:27,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 31: [2022-11-26 21:07:27,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 8: [2022-11-26 21:07:27,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:07:27,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 21:07:27,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 30: [2022-11-26 21:07:27,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:07:27,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 21:07:27,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 8: [2022-11-26 21:07:27,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:07:27,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 0: [2022-11-26 21:07:27,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:07:27,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 14: [2022-11-26 21:07:27,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:07:27,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 21:07:27,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 30: [2022-11-26 21:07:27,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:07:27,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:07:27,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 7: [2022-11-26 21:07:27,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 30: [2022-11-26 21:07:27,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 7: [2022-11-26 21:07:27,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 8: [2022-11-26 21:07:27,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:07:27,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 21:07:27,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 8: [2022-11-26 21:07:27,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:07:27,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:07:27,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 21:07:27,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 11: [2022-11-26 21:07:27,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 31: [2022-11-26 21:07:27,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:07:27,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 31: [2022-11-26 21:07:27,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 21:07:27,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:07:27,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 9: [2022-11-26 21:07:27,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:07:27,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 9: [2022-11-26 21:07:27,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 31: [2022-11-26 21:07:27,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 9: [2022-11-26 21:07:27,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 12: [2022-11-26 21:07:27,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:07:27,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 21:07:27,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 21: [2022-11-26 21:07:27,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:07:27,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 21:07:27,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 24: [2022-11-26 21:07:27,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:07:27,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:07:27,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:07:27,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 21:07:27,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 7: [2022-11-26 21:07:27,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 24: [2022-11-26 21:07:27,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 7: [2022-11-26 21:07:27,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 28: [2022-11-26 21:07:27,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:07:27,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 3: [2022-11-26 21:07:27,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:07:27,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 21:07:27,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 27: [2022-11-26 21:07:27,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:07:27,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:07:27,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:07:27,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 21:07:27,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 21:07:27,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 21:07:27,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 27: [2022-11-26 21:07:27,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 27: [2022-11-26 21:07:27,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 2: [2022-11-26 21:07:27,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:07:27,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 21:07:27,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 19: [2022-11-26 21:07:27,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:07:27,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:07:27,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 21:07:27,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 15: [2022-11-26 21:07:27,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:07:27,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 19: [2022-11-26 21:07:27,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 15: [2022-11-26 21:07:27,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 21:07:27,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 4: [2022-11-26 21:07:27,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:07:27,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 21:07:27,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 4: [2022-11-26 21:07:27,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:07:27,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 31: [2022-11-26 21:07:27,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:07:27,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:07:27,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:07:27,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 21:07:27,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 4: [2022-11-26 21:07:27,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 10: [2022-11-26 21:07:27,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 21:07:27,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 21:07:27,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 10: [2022-11-26 21:07:27,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 2: [2022-11-26 21:07:27,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:07:27,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 21:07:27,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 10: [2022-11-26 21:07:27,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:07:27,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:07:27,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 14: [2022-11-26 21:07:27,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:07:27,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 10: [2022-11-26 21:07:27,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:07:27,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 7: [2022-11-26 21:07:27,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 10: [2022-11-26 21:07:27,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 14: [2022-11-26 21:07:27,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 10: [2022-11-26 21:07:27,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 21:07:27,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 11: [2022-11-26 21:07:27,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:07:27,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 21:07:27,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 15: [2022-11-26 21:07:27,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:07:27,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 21:07:27,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 1: [2022-11-26 21:07:27,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:07:27,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:07:27,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:07:27,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 21:07:27,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 21:07:27,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 21:07:27,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 1: [2022-11-26 21:07:27,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 1: [2022-11-26 21:07:27,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 11: [2022-11-26 21:07:27,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:07:27,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 28: [2022-11-26 21:07:27,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 21:07:27,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 11: [2022-11-26 21:07:27,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 28: [2022-11-26 21:07:27,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 21:07:27,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 28: [2022-11-26 21:07:27,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 28: [2022-11-26 21:07:27,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 21:07:27,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 21:07:27,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 20: [2022-11-26 21:07:27,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:07:27,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:07:27,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 21:07:27,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 21:07:27,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 20: [2022-11-26 21:07:27,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 1: [2022-11-26 21:07:27,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:07:27,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:07:27,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 19: [2022-11-26 21:07:27,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:07:27,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 19: [2022-11-26 21:07:27,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 21:07:27,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 21:07:27,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 19: [2022-11-26 21:07:27,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 3: [2022-11-26 21:07:27,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:07:27,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 21:07:27,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 1: [2022-11-26 21:07:27,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:07:27,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 21:07:27,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 3: [2022-11-26 21:07:27,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:07:27,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 21:07:27,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 31: [2022-11-26 21:07:27,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:07:27,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 21:07:27,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 30: [2022-11-26 21:07:27,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:07:27,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 21:07:27,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 27: [2022-11-26 21:07:27,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:07:27,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 21:07:27,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 14: [2022-11-26 21:07:27,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:07:27,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 30: [2022-11-26 21:07:27,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:07:27,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 30: [2022-11-26 21:07:27,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 9: [2022-11-26 21:07:27,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:07:27,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 30: [2022-11-26 21:07:27,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 9: [2022-11-26 21:07:27,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 15: [2022-11-26 21:07:27,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:07:27,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 21:07:27,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 8: [2022-11-26 21:07:27,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:07:27,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 12: [2022-11-26 21:07:27,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:07:27,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 12: [2022-11-26 21:07:27,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 15: [2022-11-26 21:07:27,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:07:27,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 12: [2022-11-26 21:07:27,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 15: [2022-11-26 21:07:27,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 24: [2022-11-26 21:07:27,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:07:27,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 2: [2022-11-26 21:07:27,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:07:27,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 2: [2022-11-26 21:07:27,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 15: [2022-11-26 21:07:27,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:07:27,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 15: [2022-11-26 21:07:27,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 21:07:27,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 2: [2022-11-26 21:07:27,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:07:27,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 21:07:27,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:07:27,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 2: [2022-11-26 21:07:27,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 21:07:27,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 28: [2022-11-26 21:07:27,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:07:27,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 28: [2022-11-26 21:07:27,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 21:07:27,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 14: [2022-11-26 21:07:27,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 11: [2022-11-26 21:07:27,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:07:27,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 11: [2022-11-26 21:07:27,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 21:07:27,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 21: [2022-11-26 21:07:27,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:07:27,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 21:07:27,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 22: [2022-11-26 21:07:27,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:07:27,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 22: [2022-11-26 21:07:27,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 21:07:27,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 18: [2022-11-26 21:07:27,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 22: [2022-11-26 21:07:27,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 21:07:27,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 21:07:27,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:07:27,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 22: [2022-11-26 21:07:27,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 21:07:27,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 21:07:27,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 21:07:27,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 21:07:27,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 22: [2022-11-26 21:07:27,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 22: [2022-11-26 21:07:27,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 21:07:27,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 22: [2022-11-26 21:07:27,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 12: [2022-11-26 21:07:27,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:07:27,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 21:07:27,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 1: [2022-11-26 21:07:27,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:07:27,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 21:07:27,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 25: [2022-11-26 21:07:27,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:07:27,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:07:27,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:07:27,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:07:27,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:07:27,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:07:27,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 21:07:27,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 12: [2022-11-26 21:07:27,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:07:27,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 12: [2022-11-26 21:07:27,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 25: [2022-11-26 21:07:27,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 24: [2022-11-26 21:07:27,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:07:27,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 25: [2022-11-26 21:07:27,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 24: [2022-11-26 21:07:27,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 25: [2022-11-26 21:07:27,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 21:07:27,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 24: [2022-11-26 21:07:27,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 25: [2022-11-26 21:07:27,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 25: [2022-11-26 21:07:27,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 25: [2022-11-26 21:07:27,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 25: [2022-11-26 21:07:27,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 25: [2022-11-26 21:07:27,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 31: [2022-11-26 21:07:27,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:07:27,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 21:07:27,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 13: [2022-11-26 21:07:27,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:07:27,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 21:07:27,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 13: [2022-11-26 21:07:27,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:07:27,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 21:07:27,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 13: [2022-11-26 21:07:27,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:07:27,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 21:07:27,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 4: [2022-11-26 21:07:27,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:07:27,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:07:27,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:07:27,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:07:27,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 21:07:27,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 21:07:27,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 21:07:27,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 9: [2022-11-26 21:07:27,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:07:27,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 9: [2022-11-26 21:07:27,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 4: [2022-11-26 21:07:27,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 4: [2022-11-26 21:07:27,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 4: [2022-11-26 21:07:27,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 9: [2022-11-26 21:07:27,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 13: [2022-11-26 21:07:27,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:07:27,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:07:27,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:07:27,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:07:27,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:07:27,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 6: [2022-11-26 21:07:27,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:07:27,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 21:07:27,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 13: [2022-11-26 21:07:27,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 6: [2022-11-26 21:07:27,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 21:07:27,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 13: [2022-11-26 21:07:27,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:07:27,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 6: [2022-11-26 21:07:27,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 6: [2022-11-26 21:07:27,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 21:07:27,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 6: [2022-11-26 21:07:27,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 13: [2022-11-26 21:07:27,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 6: [2022-11-26 21:07:27,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 24: [2022-11-26 21:07:27,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:07:27,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 24: [2022-11-26 21:07:27,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 21:07:27,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 16: [2022-11-26 21:07:27,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:07:27,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 21:07:27,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 16: [2022-11-26 21:07:27,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:07:27,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 21:07:27,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 16: [2022-11-26 21:07:27,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:07:27,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 19: [2022-11-26 21:07:27,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:07:27,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:07:27,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:07:27,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 16: [2022-11-26 21:07:27,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 19: [2022-11-26 21:07:27,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 16: [2022-11-26 21:07:27,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 20: [2022-11-26 21:07:27,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:07:27,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 16: [2022-11-26 21:07:27,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 20: [2022-11-26 21:07:27,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 19: [2022-11-26 21:07:27,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 20: [2022-11-26 21:07:27,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 15: [2022-11-26 21:07:27,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:07:27,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:07:27,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:07:27,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:07:27,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 21:07:27,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 10: [2022-11-26 21:07:27,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:07:27,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 23: [2022-11-26 21:07:27,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 10: [2022-11-26 21:07:27,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 15: [2022-11-26 21:07:27,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 23: [2022-11-26 21:07:27,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 9: [2022-11-26 21:07:27,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:07:27,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 23: [2022-11-26 21:07:27,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 23: [2022-11-26 21:07:27,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 9: [2022-11-26 21:07:27,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 21:07:27,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 22: [2022-11-26 21:07:27,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 21:07:27,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 21:07:27,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 0: [2022-11-26 21:07:27,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:07:27,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 21:07:27,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 12: [2022-11-26 21:07:27,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:07:27,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 21:07:27,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 20: [2022-11-26 21:07:27,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:07:27,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:07:27,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:07:27,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 21:07:27,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 21:07:27,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 21:07:27,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 20: [2022-11-26 21:07:27,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 20: [2022-11-26 21:07:27,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 6: [2022-11-26 21:07:27,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:07:27,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 21:07:27,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 29: [2022-11-26 21:07:27,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:07:27,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:07:27,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:07:27,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 21:07:27,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 21:07:27,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 21:07:27,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 29: [2022-11-26 21:07:27,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 29: [2022-11-26 21:07:27,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 29: [2022-11-26 21:07:27,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:07:27,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 21:07:27,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 29: [2022-11-26 21:07:27,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:07:27,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 21:07:27,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 29: [2022-11-26 21:07:27,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:07:27,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 21:07:27,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 3: [2022-11-26 21:07:27,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:07:27,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 21:07:27,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 5: [2022-11-26 21:07:27,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:07:27,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:07:27,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:07:27,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:07:27,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:07:27,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:07:27,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 21:07:27,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 21:07:27,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 21:07:27,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 21:07:27,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 21:07:27,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 21:07:27,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 5: [2022-11-26 21:07:27,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 5: [2022-11-26 21:07:27,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 5: [2022-11-26 21:07:27,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 5: [2022-11-26 21:07:27,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 5: [2022-11-26 21:07:27,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 23: [2022-11-26 21:07:27,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:07:27,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 21:07:27,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 0: [2022-11-26 21:07:27,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 21:07:27,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 18: [2022-11-26 21:07:27,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:07:27,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 21:07:27,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 17: [2022-11-26 21:07:27,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:07:27,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:07:27,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:07:27,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:07:27,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:07:27,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:07:27,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 21:07:27,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 21:07:27,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 21:07:27,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 21:07:27,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 21:07:27,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 21:07:27,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 17: [2022-11-26 21:07:27,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 17: [2022-11-26 21:07:27,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 17: [2022-11-26 21:07:27,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 17: [2022-11-26 21:07:27,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 17: [2022-11-26 21:07:27,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 26: [2022-11-26 21:07:27,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 21:07:27,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 21:07:27,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 21:07:27,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 21:07:27,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 21:07:27,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 21:07:27,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 21:07:27,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 21:07:27,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 21:07:27,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 21:07:27,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 21:07:27,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 21:07:27,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 26: [2022-11-26 21:07:27,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 26: [2022-11-26 21:07:27,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 26: [2022-11-26 21:07:27,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 26: [2022-11-26 21:07:27,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 26: [2022-11-26 21:07:27,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 0: [2022-11-26 21:07:27,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:07:27,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 21:07:27,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 7: [2022-11-26 21:07:27,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:07:27,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 21:07:27,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 14: [2022-11-26 21:07:27,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:07:27,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 21:07:27,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 27: [2022-11-26 21:07:27,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:07:27,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 21:07:27,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 23: [2022-11-26 21:07:27,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:07:27,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 21:07:27,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 21: [2022-11-26 21:07:27,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:07:27,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 21:07:27,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 3: [2022-11-26 21:07:27,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:07:27,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 21:07:27,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 10: [2022-11-26 21:07:27,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:07:27,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 21:07:27,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 16: [2022-11-26 21:07:27,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:07:27,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 21:07:27,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 28: [2022-11-26 21:07:27,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 21:07:27,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 21:07:27,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 25: [2022-11-26 21:07:27,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:07:27,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 21:07:27,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 31: [2022-11-26 21:07:27,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:07:27,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 21:07:27,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 26: [2022-11-26 21:07:27,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 21:07:27,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 21:07:27,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 17: [2022-11-26 21:07:27,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:07:27,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 21:07:27,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 2: [2022-11-26 21:07:27,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:07:27,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 21:07:27,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 6: [2022-11-26 21:07:27,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:07:27,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 21:07:27,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 13: [2022-11-26 21:07:27,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:07:27,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 21:07:27,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 8: [2022-11-26 21:07:27,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:07:27,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 21:07:27,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 30: [2022-11-26 21:07:27,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:07:27,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 21:07:27,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 24: [2022-11-26 21:07:27,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:07:27,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 21:07:27,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 1: [2022-11-26 21:07:27,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:07:27,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:07:27,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 15: [2022-11-26 21:07:27,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 21:07:27,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 1: [2022-11-26 21:07:27,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 22: [2022-11-26 21:07:27,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 21:07:27,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 21:07:27,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 29: [2022-11-26 21:07:27,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:07:27,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 21:07:27,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 11: [2022-11-26 21:07:27,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:07:27,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 21:07:27,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 4: [2022-11-26 21:07:27,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:07:27,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 21:07:27,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 9: [2022-11-26 21:07:27,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:07:27,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 21:07:27,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 0: [2022-11-26 21:07:27,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:07:27,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:07:27,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 20: [2022-11-26 21:07:27,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 0: [2022-11-26 21:07:27,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 20: [2022-11-26 21:07:27,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 18: [2022-11-26 21:07:27,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:07:27,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 21:07:27,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 12: [2022-11-26 21:07:27,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:07:27,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 21:07:27,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 19: [2022-11-26 21:07:27,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:07:27,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 21:07:27,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 27: [2022-11-26 21:07:27,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:07:27,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 21:07:27,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 7: [2022-11-26 21:07:27,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:07:27,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 21:07:27,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 23: [2022-11-26 21:07:27,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 28: [2022-11-26 21:07:27,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:07:27,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 21:07:27,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 4: [2022-11-26 21:07:27,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:07:27,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 5: [2022-11-26 21:07:27,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:07:27,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 5: [2022-11-26 21:07:27,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 21:07:27,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 14: [2022-11-26 21:07:27,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:07:27,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 21:07:27,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 1: [2022-11-26 21:07:27,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:07:27,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 21:07:27,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 3: [2022-11-26 21:07:27,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 28: [2022-11-26 21:07:27,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 21:07:27,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 3: [2022-11-26 21:07:27,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 21:07:27,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 21: [2022-11-26 21:07:27,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:07:27,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 21:07:27,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 8: [2022-11-26 21:07:27,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:07:27,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 21:07:27,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 17: [2022-11-26 21:07:27,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:07:27,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 21:07:27,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 24: [2022-11-26 21:07:27,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:07:27,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 21:07:27,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 18: [2022-11-26 21:07:27,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:07:27,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 21:07:27,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 3: [2022-11-26 21:07:27,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:07:27,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:07:27,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 30: [2022-11-26 21:07:27,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 3: [2022-11-26 21:07:27,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 31: [2022-11-26 21:07:27,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:07:27,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 31: [2022-11-26 21:07:27,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 21:07:27,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 15: [2022-11-26 21:07:27,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:07:27,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 21:07:27,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 29: [2022-11-26 21:07:27,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:07:27,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 16: [2022-11-26 21:07:27,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:07:27,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 9: [2022-11-26 21:07:27,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:07:27,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 21:07:27,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 9: [2022-11-26 21:07:27,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 21:07:27,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 10: [2022-11-26 21:07:27,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:07:27,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:07:27,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 21:07:27,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 12: [2022-11-26 21:07:27,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 21:07:27,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 22: [2022-11-26 21:07:27,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:07:27,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:07:27,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 22: [2022-11-26 21:07:27,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 0: [2022-11-26 21:07:27,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 5: [2022-11-26 21:07:27,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 22: [2022-11-26 21:07:27,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 5: [2022-11-26 21:07:27,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 0: [2022-11-26 21:07:27,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 6: [2022-11-26 21:07:27,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:07:27,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:07:27,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 21:07:27,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 2: [2022-11-26 21:07:27,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:07:27,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 26: [2022-11-26 21:07:27,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:07:27,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 2: [2022-11-26 21:07:27,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 26: [2022-11-26 21:07:27,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 2: [2022-11-26 21:07:27,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 26: [2022-11-26 21:07:27,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 11: [2022-11-26 21:07:27,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:07:27,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 21:07:27,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 7: [2022-11-26 21:07:27,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:07:27,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 21:07:27,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 13: [2022-11-26 21:07:27,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:07:27,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 27: [2022-11-26 21:07:27,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:07:27,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 27: [2022-11-26 21:07:27,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 21:07:27,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 23: [2022-11-26 21:07:27,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:07:27,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 21:07:27,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 14: [2022-11-26 21:07:27,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:07:27,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 21:07:27,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 20: [2022-11-26 21:07:27,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:07:27,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 21:07:27,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 25: [2022-11-26 21:07:27,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:07:27,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 21:07:27,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 21: [2022-11-26 21:07:27,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:07:27,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 21:07:27,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 14: [2022-11-26 21:07:27,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:07:27,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 21:07:27,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 28: [2022-11-26 21:07:27,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:07:27,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:07:27,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 21:07:27,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 16: [2022-11-26 21:07:27,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:07:27,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 21:07:27,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 28: [2022-11-26 21:07:27,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 21:07:27,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 10: [2022-11-26 21:07:27,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:07:27,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step120000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 21:07:27,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 0: successfully saved checkpoint at iteration 120000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2515.71 31: iteration 120010/ 173500 | consumed samples: 30722560 | consumed tokens: 62919802880 | elapsed time per iteration (s): 1.06 | learning rate: 5.974E-05 | global batch size: 256 | lm loss: 1.930281E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.927 | TFLOPs: 14.64 | 31: iteration 120020/ 173500 | consumed samples: 30725120 | consumed tokens: 62925045760 | elapsed time per iteration (s): 0.73 | learning rate: 5.973E-05 | global batch size: 256 | lm loss: 1.940177E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.987 | TFLOPs: 21.29 | 31: iteration 120030/ 173500 | consumed samples: 30727680 | consumed tokens: 62930288640 | elapsed time per iteration (s): 0.78 | learning rate: 5.972E-05 | global batch size: 256 | lm loss: 1.936905E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.827 | TFLOPs: 19.83 | 31: iteration 120040/ 173500 | consumed samples: 30730240 | consumed tokens: 62935531520 | elapsed time per iteration (s): 0.78 | learning rate: 5.970E-05 | global batch size: 256 | lm loss: 1.945001E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.530 | TFLOPs: 19.94 | 31: iteration 120050/ 173500 | consumed samples: 30732800 | consumed tokens: 62940774400 | elapsed time per iteration (s): 0.79 | learning rate: 5.969E-05 | global batch size: 256 | lm loss: 1.956716E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.844 | TFLOPs: 19.53 | 31: iteration 120060/ 173500 | consumed samples: 30735360 | consumed tokens: 62946017280 | elapsed time per iteration (s): 0.81 | learning rate: 5.968E-05 | global batch size: 256 | lm loss: 1.962759E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.711 | TFLOPs: 19.04 | 31: iteration 120070/ 173500 | consumed samples: 30737920 | consumed tokens: 62951260160 | elapsed time per iteration (s): 0.83 | learning rate: 5.966E-05 | global batch size: 256 | lm loss: 1.918846E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.502 | TFLOPs: 18.72 | 31: iteration 120080/ 173500 | consumed samples: 30740480 | consumed tokens: 62956503040 | elapsed time per iteration (s): 0.90 | learning rate: 5.965E-05 | global batch size: 256 | lm loss: 1.954033E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.861 | TFLOPs: 17.29 | 31: iteration 120090/ 173500 | consumed samples: 30743040 | consumed tokens: 62961745920 | elapsed time per iteration (s): 0.82 | learning rate: 5.963E-05 | global batch size: 256 | lm loss: 1.960437E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.795 | TFLOPs: 18.80 | 31: iteration 120100/ 173500 | consumed samples: 30745600 | consumed tokens: 62966988800 | elapsed time per iteration (s): 0.80 | learning rate: 5.962E-05 | global batch size: 256 | lm loss: 1.945741E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.241 | TFLOPs: 19.43 | 31: iteration 120110/ 173500 | consumed samples: 30748160 | consumed tokens: 62972231680 | elapsed time per iteration (s): 0.82 | learning rate: 5.961E-05 | global batch size: 256 | lm loss: 1.951291E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.144 | TFLOPs: 18.88 | 31: iteration 120120/ 173500 | consumed samples: 30750720 | consumed tokens: 62977474560 | elapsed time per iteration (s): 0.91 | learning rate: 5.959E-05 | global batch size: 256 | lm loss: 1.944011E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 280.629 | TFLOPs: 16.98 | 31: iteration 120130/ 173500 | consumed samples: 30753280 | consumed tokens: 62982717440 | elapsed time per iteration (s): 0.81 | learning rate: 5.958E-05 | global batch size: 256 | lm loss: 1.943709E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.632 | TFLOPs: 19.09 | 31: iteration 120140/ 173500 | consumed samples: 30755840 | consumed tokens: 62987960320 | elapsed time per iteration (s): 0.83 | learning rate: 5.957E-05 | global batch size: 256 | lm loss: 1.932512E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.419 | TFLOPs: 18.66 | 31: iteration 120150/ 173500 | consumed samples: 30758400 | consumed tokens: 62993203200 | elapsed time per iteration (s): 0.80 | learning rate: 5.955E-05 | global batch size: 256 | lm loss: 1.950130E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.830 | TFLOPs: 19.35 | 31: iteration 120160/ 173500 | consumed samples: 30760960 | consumed tokens: 62998446080 | elapsed time per iteration (s): 0.79 | learning rate: 5.954E-05 | global batch size: 256 | lm loss: 1.980647E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.048 | TFLOPs: 19.54 | 31: iteration 120170/ 173500 | consumed samples: 30763520 | consumed tokens: 63003688960 | elapsed time per iteration (s): 0.81 | learning rate: 5.953E-05 | global batch size: 256 | lm loss: 1.979336E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.629 | TFLOPs: 19.22 | 31: iteration 120180/ 173500 | consumed samples: 30766080 | consumed tokens: 63008931840 | elapsed time per iteration (s): 0.80 | learning rate: 5.951E-05 | global batch size: 256 | lm loss: 1.961771E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.004 | TFLOPs: 19.36 | 31: iteration 120190/ 173500 | consumed samples: 30768640 | consumed tokens: 63014174720 | elapsed time per iteration (s): 0.80 | learning rate: 5.950E-05 | global batch size: 256 | lm loss: 1.983704E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.881 | TFLOPs: 19.47 | 31: iteration 120200/ 173500 | consumed samples: 30771200 | consumed tokens: 63019417600 | elapsed time per iteration (s): 0.84 | learning rate: 5.948E-05 | global batch size: 256 | lm loss: 1.942267E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.998 | TFLOPs: 18.51 | 31: iteration 120210/ 173500 | consumed samples: 30773760 | consumed tokens: 63024660480 | elapsed time per iteration (s): 0.81 | learning rate: 5.947E-05 | global batch size: 256 | lm loss: 1.953706E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.743 | TFLOPs: 19.04 | 31: iteration 120220/ 173500 | consumed samples: 30776320 | consumed tokens: 63029903360 | elapsed time per iteration (s): 0.85 | learning rate: 5.946E-05 | global batch size: 256 | lm loss: 1.937395E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.606 | TFLOPs: 18.31 | 31: iteration 120230/ 173500 | consumed samples: 30778880 | consumed tokens: 63035146240 | elapsed time per iteration (s): 0.81 | learning rate: 5.944E-05 | global batch size: 256 | lm loss: 1.917405E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.077 | TFLOPs: 19.12 | 31: iteration 120240/ 173500 | consumed samples: 30781440 | consumed tokens: 63040389120 | elapsed time per iteration (s): 0.80 | learning rate: 5.943E-05 | global batch size: 256 | lm loss: 1.963104E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.956 | TFLOPs: 19.48 | 31: iteration 120250/ 173500 | consumed samples: 30784000 | consumed tokens: 63045632000 | elapsed time per iteration (s): 0.80 | learning rate: 5.942E-05 | global batch size: 256 | lm loss: 1.941633E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.394 | TFLOPs: 19.38 | 31: iteration 120260/ 173500 | consumed samples: 30786560 | consumed tokens: 63050874880 | elapsed time per iteration (s): 0.81 | learning rate: 5.940E-05 | global batch size: 256 | lm loss: 1.945207E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.758 | TFLOPs: 19.16 | 31: iteration 120270/ 173500 | consumed samples: 30789120 | consumed tokens: 63056117760 | elapsed time per iteration (s): 0.78 | learning rate: 5.939E-05 | global batch size: 256 | lm loss: 1.943903E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.514 | TFLOPs: 19.81 | 31: iteration 120280/ 173500 | consumed samples: 30791680 | consumed tokens: 63061360640 | elapsed time per iteration (s): 0.79 | learning rate: 5.938E-05 | global batch size: 256 | lm loss: 1.925163E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.726 | TFLOPs: 19.52 | 31: iteration 120290/ 173500 | consumed samples: 30794240 | consumed tokens: 63066603520 | elapsed time per iteration (s): 1.35 | learning rate: 5.936E-05 | global batch size: 256 | lm loss: 1.941922E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 190.239 | TFLOPs: 11.51 | 31: iteration 120300/ 173500 | consumed samples: 30796800 | consumed tokens: 63071846400 | elapsed time per iteration (s): 0.81 | learning rate: 5.935E-05 | global batch size: 256 | lm loss: 1.935937E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.922 | TFLOPs: 19.11 | 31: iteration 120310/ 173500 | consumed samples: 30799360 | consumed tokens: 63077089280 | elapsed time per iteration (s): 0.83 | learning rate: 5.934E-05 | global batch size: 256 | lm loss: 1.950194E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.714 | TFLOPs: 18.62 | 31: iteration 120320/ 173500 | consumed samples: 30801920 | consumed tokens: 63082332160 | elapsed time per iteration (s): 0.79 | learning rate: 5.932E-05 | global batch size: 256 | lm loss: 1.966297E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.604 | TFLOPs: 19.70 | 31: iteration 120330/ 173500 | consumed samples: 30804480 | consumed tokens: 63087575040 | elapsed time per iteration (s): 0.73 | learning rate: 5.931E-05 | global batch size: 256 | lm loss: 1.947632E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.697 | TFLOPs: 21.34 | 31: iteration 120340/ 173500 | consumed samples: 30807040 | consumed tokens: 63092817920 | elapsed time per iteration (s): 0.77 | learning rate: 5.929E-05 | global batch size: 256 | lm loss: 1.948380E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.282 | TFLOPs: 20.22 | 31: iteration 120350/ 173500 | consumed samples: 30809600 | consumed tokens: 63098060800 | elapsed time per iteration (s): 0.77 | learning rate: 5.928E-05 | global batch size: 256 | lm loss: 1.948637E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.209 | TFLOPs: 20.10 | 31: iteration 120360/ 173500 | consumed samples: 30812160 | consumed tokens: 63103303680 | elapsed time per iteration (s): 0.79 | learning rate: 5.927E-05 | global batch size: 256 | lm loss: 1.950937E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.974 | TFLOPs: 19.66 | 31: iteration 120370/ 173500 | consumed samples: 30814720 | consumed tokens: 63108546560 | elapsed time per iteration (s): 0.80 | learning rate: 5.925E-05 | global batch size: 256 | lm loss: 1.951011E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.609 | TFLOPs: 19.46 | 31: iteration 120380/ 173500 | consumed samples: 30817280 | consumed tokens: 63113789440 | elapsed time per iteration (s): 0.77 | learning rate: 5.924E-05 | global batch size: 256 | lm loss: 1.943566E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.124 | TFLOPs: 20.03 | 31: iteration 120390/ 173500 | consumed samples: 30819840 | consumed tokens: 63119032320 | elapsed time per iteration (s): 0.74 | learning rate: 5.923E-05 | global batch size: 256 | lm loss: 1.932078E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.368 | TFLOPs: 21.01 | 31: iteration 120400/ 173500 | consumed samples: 30822400 | consumed tokens: 63124275200 | elapsed time per iteration (s): 0.77 | learning rate: 5.921E-05 | global batch size: 256 | lm loss: 1.956945E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.497 | TFLOPs: 20.12 | 31: iteration 120410/ 173500 | consumed samples: 30824960 | consumed tokens: 63129518080 | elapsed time per iteration (s): 0.75 | learning rate: 5.920E-05 | global batch size: 256 | lm loss: 1.945162E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.979 | TFLOPs: 20.69 | 31: iteration 120420/ 173500 | consumed samples: 30827520 | consumed tokens: 63134760960 | elapsed time per iteration (s): 0.76 | learning rate: 5.919E-05 | global batch size: 256 | lm loss: 1.961204E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.665 | TFLOPs: 20.31 | 31: iteration 120430/ 173500 | consumed samples: 30830080 | consumed tokens: 63140003840 | elapsed time per iteration (s): 0.78 | learning rate: 5.917E-05 | global batch size: 256 | lm loss: 1.953577E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.897 | TFLOPs: 19.84 | 31: iteration 120440/ 173500 | consumed samples: 30832640 | consumed tokens: 63145246720 | elapsed time per iteration (s): 0.79 | learning rate: 5.916E-05 | global batch size: 256 | lm loss: 1.955710E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.491 | TFLOPs: 19.57 | 31: iteration 120450/ 173500 | consumed samples: 30835200 | consumed tokens: 63150489600 | elapsed time per iteration (s): 0.76 | learning rate: 5.914E-05 | global batch size: 256 | lm loss: 1.938534E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.241 | TFLOPs: 20.46 | 31: iteration 120460/ 173500 | consumed samples: 30837760 | consumed tokens: 63155732480 | elapsed time per iteration (s): 0.76 | learning rate: 5.913E-05 | global batch size: 256 | lm loss: 1.943979E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.194 | TFLOPs: 20.28 | 31: iteration 120470/ 173500 | consumed samples: 30840320 | consumed tokens: 63160975360 | elapsed time per iteration (s): 0.79 | learning rate: 5.912E-05 | global batch size: 256 | lm loss: 1.930509E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.040 | TFLOPs: 19.72 | 31: iteration 120480/ 173500 | consumed samples: 30842880 | consumed tokens: 63166218240 | elapsed time per iteration (s): 0.75 | learning rate: 5.910E-05 | global batch size: 256 | lm loss: 1.942729E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.691 | TFLOPs: 20.61 | 31: iteration 120490/ 173500 | consumed samples: 30845440 | consumed tokens: 63171461120 | elapsed time per iteration (s): 0.76 | learning rate: 5.909E-05 | global batch size: 256 | lm loss: 1.946218E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.978 | TFLOPs: 20.39 | 31: iteration 120500/ 173500 | consumed samples: 30848000 | consumed tokens: 63176704000 | elapsed time per iteration (s): 0.78 | learning rate: 5.908E-05 | global batch size: 256 | lm loss: 1.943674E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.134 | TFLOPs: 19.91 | 31: iteration 120510/ 173500 | consumed samples: 30850560 | consumed tokens: 63181946880 | elapsed time per iteration (s): 0.77 | learning rate: 5.906E-05 | global batch size: 256 | lm loss: 1.898019E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.081 | TFLOPs: 20.15 | 31: iteration 120520/ 173500 | consumed samples: 30853120 | consumed tokens: 63187189760 | elapsed time per iteration (s): 0.79 | learning rate: 5.905E-05 | global batch size: 256 | lm loss: 1.958185E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.924 | TFLOPs: 19.54 | 31: iteration 120530/ 173500 | consumed samples: 30855680 | consumed tokens: 63192432640 | elapsed time per iteration (s): 0.77 | learning rate: 5.904E-05 | global batch size: 256 | lm loss: 1.926018E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.829 | TFLOPs: 20.14 | 31: iteration 120540/ 173500 | consumed samples: 30858240 | consumed tokens: 63197675520 | elapsed time per iteration (s): 0.74 | learning rate: 5.902E-05 | global batch size: 256 | lm loss: 1.954845E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.790 | TFLOPs: 20.80 | 31: iteration 120550/ 173500 | consumed samples: 30860800 | consumed tokens: 63202918400 | elapsed time per iteration (s): 0.79 | learning rate: 5.901E-05 | global batch size: 256 | lm loss: 1.966682E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.876 | TFLOPs: 19.53 | 31: iteration 120560/ 173500 | consumed samples: 30863360 | consumed tokens: 63208161280 | elapsed time per iteration (s): 0.76 | learning rate: 5.900E-05 | global batch size: 256 | lm loss: 1.957809E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.989 | TFLOPs: 20.45 | 31: iteration 120570/ 173500 | consumed samples: 30865920 | consumed tokens: 63213404160 | elapsed time per iteration (s): 0.78 | learning rate: 5.898E-05 | global batch size: 256 | lm loss: 1.974430E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.515 | TFLOPs: 19.93 | 31: iteration 120580/ 173500 | consumed samples: 30868480 | consumed tokens: 63218647040 | elapsed time per iteration (s): 0.79 | learning rate: 5.897E-05 | global batch size: 256 | lm loss: 1.924695E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.809 | TFLOPs: 19.53 | 31: iteration 120590/ 173500 | consumed samples: 30871040 | consumed tokens: 63223889920 | elapsed time per iteration (s): 0.74 | learning rate: 5.895E-05 | global batch size: 256 | lm loss: 1.933586E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.951 | TFLOPs: 20.99 | 31: iteration 120600/ 173500 | consumed samples: 30873600 | consumed tokens: 63229132800 | elapsed time per iteration (s): 0.74 | learning rate: 5.894E-05 | global batch size: 256 | lm loss: 1.959887E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.429 | TFLOPs: 21.02 | 31: iteration 120610/ 173500 | consumed samples: 30876160 | consumed tokens: 63234375680 | elapsed time per iteration (s): 0.75 | learning rate: 5.893E-05 | global batch size: 256 | lm loss: 1.961038E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.549 | TFLOPs: 20.78 | 31: iteration 120620/ 173500 | consumed samples: 30878720 | consumed tokens: 63239618560 | elapsed time per iteration (s): 0.75 | learning rate: 5.891E-05 | global batch size: 256 | lm loss: 1.957162E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.915 | TFLOPs: 20.68 | 31: iteration 120630/ 173500 | consumed samples: 30881280 | consumed tokens: 63244861440 | elapsed time per iteration (s): 0.78 | learning rate: 5.890E-05 | global batch size: 256 | lm loss: 1.917705E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.899 | TFLOPs: 19.78 | 31: iteration 120640/ 173500 | consumed samples: 30883840 | consumed tokens: 63250104320 | elapsed time per iteration (s): 0.73 | learning rate: 5.889E-05 | global batch size: 256 | lm loss: 1.949860E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.030 | TFLOPs: 21.24 | 31: iteration 120650/ 173500 | consumed samples: 30886400 | consumed tokens: 63255347200 | elapsed time per iteration (s): 0.78 | learning rate: 5.887E-05 | global batch size: 256 | lm loss: 1.952587E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.403 | TFLOPs: 19.87 | 31: iteration 120660/ 173500 | consumed samples: 30888960 | consumed tokens: 63260590080 | elapsed time per iteration (s): 0.78 | learning rate: 5.886E-05 | global batch size: 256 | lm loss: 1.930290E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.392 | TFLOPs: 19.93 | 31: iteration 120670/ 173500 | consumed samples: 30891520 | consumed tokens: 63265832960 | elapsed time per iteration (s): 0.80 | learning rate: 5.885E-05 | global batch size: 256 | lm loss: 1.968214E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.472 | TFLOPs: 19.27 | 31: iteration 120680/ 173500 | consumed samples: 30894080 | consumed tokens: 63271075840 | elapsed time per iteration (s): 0.79 | learning rate: 5.883E-05 | global batch size: 256 | lm loss: 1.956093E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.334 | TFLOPs: 19.50 | 31: iteration 120690/ 173500 | consumed samples: 30896640 | consumed tokens: 63276318720 | elapsed time per iteration (s): 0.89 | learning rate: 5.882E-05 | global batch size: 256 | lm loss: 1.948702E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.762 | TFLOPs: 17.41 | 31: iteration 120700/ 173500 | consumed samples: 30899200 | consumed tokens: 63281561600 | elapsed time per iteration (s): 0.80 | learning rate: 5.881E-05 | global batch size: 256 | lm loss: 1.922121E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.967 | TFLOPs: 19.30 | 31: iteration 120710/ 173500 | consumed samples: 30901760 | consumed tokens: 63286804480 | elapsed time per iteration (s): 0.80 | learning rate: 5.879E-05 | global batch size: 256 | lm loss: 1.957669E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.943 | TFLOPs: 19.30 | 31: iteration 120720/ 173500 | consumed samples: 30904320 | consumed tokens: 63292047360 | elapsed time per iteration (s): 0.75 | learning rate: 5.878E-05 | global batch size: 256 | lm loss: 1.951743E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.825 | TFLOPs: 20.62 | 31: iteration 120730/ 173500 | consumed samples: 30906880 | consumed tokens: 63297290240 | elapsed time per iteration (s): 0.81 | learning rate: 5.877E-05 | global batch size: 256 | lm loss: 1.928381E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.536 | TFLOPs: 19.21 | 31: iteration 120740/ 173500 | consumed samples: 30909440 | consumed tokens: 63302533120 | elapsed time per iteration (s): 0.83 | learning rate: 5.875E-05 | global batch size: 256 | lm loss: 1.941598E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.088 | TFLOPs: 18.70 | 31: iteration 120750/ 173500 | consumed samples: 30912000 | consumed tokens: 63307776000 | elapsed time per iteration (s): 0.76 | learning rate: 5.874E-05 | global batch size: 256 | lm loss: 1.942576E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.765 | TFLOPs: 20.43 | 31: iteration 120760/ 173500 | consumed samples: 30914560 | consumed tokens: 63313018880 | elapsed time per iteration (s): 0.76 | learning rate: 5.872E-05 | global batch size: 256 | lm loss: 1.933807E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.444 | TFLOPs: 20.47 | 31: iteration 120770/ 173500 | consumed samples: 30917120 | consumed tokens: 63318261760 | elapsed time per iteration (s): 0.74 | learning rate: 5.871E-05 | global batch size: 256 | lm loss: 1.953170E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.201 | TFLOPs: 20.82 | 31: iteration 120780/ 173500 | consumed samples: 30919680 | consumed tokens: 63323504640 | elapsed time per iteration (s): 0.77 | learning rate: 5.870E-05 | global batch size: 256 | lm loss: 1.936926E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.720 | TFLOPs: 20.01 | 31: iteration 120790/ 173500 | consumed samples: 30922240 | consumed tokens: 63328747520 | elapsed time per iteration (s): 0.89 | learning rate: 5.868E-05 | global batch size: 256 | lm loss: 1.968352E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.139 | TFLOPs: 17.31 | 31: iteration 120800/ 173500 | consumed samples: 30924800 | consumed tokens: 63333990400 | elapsed time per iteration (s): 0.75 | learning rate: 5.867E-05 | global batch size: 256 | lm loss: 1.918798E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.773 | TFLOPs: 20.74 | 31: iteration 120810/ 173500 | consumed samples: 30927360 | consumed tokens: 63339233280 | elapsed time per iteration (s): 0.83 | learning rate: 5.866E-05 | global batch size: 256 | lm loss: 1.950179E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.512 | TFLOPs: 18.66 | 31: iteration 120820/ 173500 | consumed samples: 30929920 | consumed tokens: 63344476160 | elapsed time per iteration (s): 0.85 | learning rate: 5.864E-05 | global batch size: 256 | lm loss: 1.965147E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.821 | TFLOPs: 18.14 | 31: iteration 120830/ 173500 | consumed samples: 30932480 | consumed tokens: 63349719040 | elapsed time per iteration (s): 0.88 | learning rate: 5.863E-05 | global batch size: 256 | lm loss: 1.924559E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.249 | TFLOPs: 17.56 | 31: iteration 120840/ 173500 | consumed samples: 30935040 | consumed tokens: 63354961920 | elapsed time per iteration (s): 0.88 | learning rate: 5.862E-05 | global batch size: 256 | lm loss: 1.958479E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.449 | TFLOPs: 17.57 | 31: iteration 120850/ 173500 | consumed samples: 30937600 | consumed tokens: 63360204800 | elapsed time per iteration (s): 0.86 | learning rate: 5.860E-05 | global batch size: 256 | lm loss: 1.954451E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.190 | TFLOPs: 17.92 | 31: iteration 120860/ 173500 | consumed samples: 30940160 | consumed tokens: 63365447680 | elapsed time per iteration (s): 0.87 | learning rate: 5.859E-05 | global batch size: 256 | lm loss: 1.933666E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.230 | TFLOPs: 17.86 | 31: iteration 120870/ 173500 | consumed samples: 30942720 | consumed tokens: 63370690560 | elapsed time per iteration (s): 0.82 | learning rate: 5.858E-05 | global batch size: 256 | lm loss: 1.960304E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.530 | TFLOPs: 18.97 | 31: iteration 120880/ 173500 | consumed samples: 30945280 | consumed tokens: 63375933440 | elapsed time per iteration (s): 0.82 | learning rate: 5.856E-05 | global batch size: 256 | lm loss: 1.974444E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.443 | TFLOPs: 18.78 | 31: iteration 120890/ 173500 | consumed samples: 30947840 | consumed tokens: 63381176320 | elapsed time per iteration (s): 0.83 | learning rate: 5.855E-05 | global batch size: 256 | lm loss: 1.978521E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.201 | TFLOPs: 18.77 | 31: iteration 120900/ 173500 | consumed samples: 30950400 | consumed tokens: 63386419200 | elapsed time per iteration (s): 0.80 | learning rate: 5.854E-05 | global batch size: 256 | lm loss: 1.960612E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.781 | TFLOPs: 19.41 | 31: iteration 120910/ 173500 | consumed samples: 30952960 | consumed tokens: 63391662080 | elapsed time per iteration (s): 0.81 | learning rate: 5.852E-05 | global batch size: 256 | lm loss: 1.934085E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.690 | TFLOPs: 19.22 | 31: iteration 120920/ 173500 | consumed samples: 30955520 | consumed tokens: 63396904960 | elapsed time per iteration (s): 0.79 | learning rate: 5.851E-05 | global batch size: 256 | lm loss: 1.973101E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.186 | TFLOPs: 19.55 | 31: iteration 120930/ 173500 | consumed samples: 30958080 | consumed tokens: 63402147840 | elapsed time per iteration (s): 0.90 | learning rate: 5.850E-05 | global batch size: 256 | lm loss: 1.953186E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.178 | TFLOPs: 17.25 | 31: iteration 120940/ 173500 | consumed samples: 30960640 | consumed tokens: 63407390720 | elapsed time per iteration (s): 0.81 | learning rate: 5.848E-05 | global batch size: 256 | lm loss: 1.948338E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.561 | TFLOPs: 19.09 | 31: iteration 120950/ 173500 | consumed samples: 30963200 | consumed tokens: 63412633600 | elapsed time per iteration (s): 0.82 | learning rate: 5.847E-05 | global batch size: 256 | lm loss: 1.946838E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.317 | TFLOPs: 18.95 | 31: iteration 120960/ 173500 | consumed samples: 30965760 | consumed tokens: 63417876480 | elapsed time per iteration (s): 0.81 | learning rate: 5.845E-05 | global batch size: 256 | lm loss: 1.957361E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.241 | TFLOPs: 19.01 | 31: iteration 120970/ 173500 | consumed samples: 30968320 | consumed tokens: 63423119360 | elapsed time per iteration (s): 0.87 | learning rate: 5.844E-05 | global batch size: 256 | lm loss: 1.977979E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.196 | TFLOPs: 17.86 | 31: iteration 120980/ 173500 | consumed samples: 30970880 | consumed tokens: 63428362240 | elapsed time per iteration (s): 0.86 | learning rate: 5.843E-05 | global batch size: 256 | lm loss: 1.945118E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.875 | TFLOPs: 18.08 | 31: iteration 120990/ 173500 | consumed samples: 30973440 | consumed tokens: 63433605120 | elapsed time per iteration (s): 0.87 | learning rate: 5.841E-05 | global batch size: 256 | lm loss: 1.966485E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.956 | TFLOPs: 17.72 | 31: iteration 121000/ 173500 | consumed samples: 30976000 | consumed tokens: 63438848000 | elapsed time per iteration (s): 0.81 | learning rate: 5.840E-05 | global batch size: 256 | lm loss: 1.945331E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.876 | TFLOPs: 19.17 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 121000 | lm loss value: 1.890003E+00 | lm loss PPL: 6.619391E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 121000 to checkpoints_1b1long 0: [2022-11-26 21:20:51,741] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step121000 is begin to save! 0: [2022-11-26 21:20:51,750] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_01-model_00-model_states.pt... 0: [2022-11-26 21:20:51,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_01-model_00-model_states.pt. 0: [2022-11-26 21:20:51,974] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_03-model_00-model_states.pt... 0: [2022-11-26 21:20:52,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_03-model_00-model_states.pt. 0: [2022-11-26 21:20:52,059] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_04-model_00-model_states.pt... 0: [2022-11-26 21:20:52,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_04-model_00-model_states.pt. 0: [2022-11-26 21:20:52,144] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_05-model_00-model_states.pt... 0: [2022-11-26 21:20:52,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_05-model_00-model_states.pt. 0: [2022-11-26 21:20:52,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_06-model_00-model_states.pt... 0: [2022-11-26 21:20:52,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_06-model_00-model_states.pt. 0: [2022-11-26 21:20:52,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_07-model_00-model_states.pt... 0: [2022-11-26 21:20:52,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_07-model_00-model_states.pt. 0: [2022-11-26 21:20:52,377] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_08-model_00-model_states.pt... 0: [2022-11-26 21:20:52,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_08-model_00-model_states.pt. 0: [2022-11-26 21:20:52,458] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_09-model_00-model_states.pt... 0: [2022-11-26 21:20:52,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_09-model_00-model_states.pt. 0: [2022-11-26 21:20:52,539] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_10-model_00-model_states.pt... 0: [2022-11-26 21:20:52,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_10-model_00-model_states.pt. 0: [2022-11-26 21:20:52,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_11-model_00-model_states.pt... 0: [2022-11-26 21:20:52,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_11-model_00-model_states.pt. 0: [2022-11-26 21:20:52,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_12-model_00-model_states.pt... 0: [2022-11-26 21:20:52,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_12-model_00-model_states.pt. 0: [2022-11-26 21:20:52,774] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_13-model_00-model_states.pt... 0: [2022-11-26 21:20:52,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_13-model_00-model_states.pt. 0: [2022-11-26 21:20:52,847] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_14-model_00-model_states.pt... 0: [2022-11-26 21:20:52,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_14-model_00-model_states.pt. 0: [2022-11-26 21:20:52,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_15-model_00-model_states.pt... 0: [2022-11-26 21:20:53,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_15-model_00-model_states.pt. 0: [2022-11-26 21:20:53,001] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_16-model_00-model_states.pt... 0: [2022-11-26 21:20:53,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_16-model_00-model_states.pt. 0: [2022-11-26 21:20:53,077] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_17-model_00-model_states.pt... 0: [2022-11-26 21:20:53,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_17-model_00-model_states.pt. 0: [2022-11-26 21:20:53,155] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_18-model_00-model_states.pt... 0: [2022-11-26 21:20:53,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_18-model_00-model_states.pt. 0: [2022-11-26 21:20:53,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_19-model_00-model_states.pt... 0: [2022-11-26 21:20:53,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_19-model_00-model_states.pt. 0: [2022-11-26 21:20:53,305] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_20-model_00-model_states.pt... 0: [2022-11-26 21:20:53,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_20-model_00-model_states.pt. 0: [2022-11-26 21:20:53,381] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_21-model_00-model_states.pt... 0: [2022-11-26 21:20:53,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_21-model_00-model_states.pt. 0: [2022-11-26 21:20:53,457] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_22-model_00-model_states.pt... 0: [2022-11-26 21:20:53,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_22-model_00-model_states.pt. 0: [2022-11-26 21:20:53,536] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_23-model_00-model_states.pt... 0: [2022-11-26 21:20:53,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_23-model_00-model_states.pt. 0: [2022-11-26 21:20:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_24-model_00-model_states.pt... 0: [2022-11-26 21:20:53,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_24-model_00-model_states.pt. 0: [2022-11-26 21:20:53,686] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_25-model_00-model_states.pt... 0: [2022-11-26 21:20:53,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_25-model_00-model_states.pt. 0: [2022-11-26 21:20:53,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_26-model_00-model_states.pt... 0: [2022-11-26 21:20:53,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_26-model_00-model_states.pt. 0: [2022-11-26 21:20:53,838] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_27-model_00-model_states.pt... 0: [2022-11-26 21:20:53,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_27-model_00-model_states.pt. 0: [2022-11-26 21:20:53,911] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_28-model_00-model_states.pt... 0: [2022-11-26 21:20:53,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_28-model_00-model_states.pt. 0: [2022-11-26 21:20:53,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/layer_30-model_00-model_states.pt... 0: [2022-11-26 21:20:53,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/layer_30-model_00-model_states.pt. 0: [2022-11-26 21:20:53,991] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step121000/mp_rank_00_model_states.pt 0: [2022-11-26 21:20:53,991] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/mp_rank_00_model_states.pt... 0: [2022-11-26 21:20:53,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/mp_rank_00_model_states.pt. 0: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:20:54,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:20:54,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 21:20:54,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 21:20:54,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 6: [2022-11-26 21:20:54,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:20:54,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:20:54,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 12: [2022-11-26 21:20:54,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 6: [2022-11-26 21:20:54,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 12: [2022-11-26 21:20:54,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 31: [2022-11-26 21:20:54,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:20:54,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 21:20:54,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 23: [2022-11-26 21:20:54,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:20:54,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 21:20:54,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 28: [2022-11-26 21:20:54,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:20:54,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:20:54,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 21:20:54,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 0: [2022-11-26 21:20:54,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:20:54,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:20:54,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 14: [2022-11-26 21:20:54,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:20:54,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 21:20:54,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 27: [2022-11-26 21:20:54,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 9: [2022-11-26 21:20:54,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 28: [2022-11-26 21:20:54,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 9: [2022-11-26 21:20:54,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 2: [2022-11-26 21:20:54,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 28: [2022-11-26 21:20:54,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 9: [2022-11-26 21:20:54,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 2: [2022-11-26 21:20:54,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 21:20:54,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 25: [2022-11-26 21:20:54,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:20:54,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 21:20:54,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 3: [2022-11-26 21:20:54,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:20:54,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 21:20:54,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 19: [2022-11-26 21:20:54,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:20:54,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 21:20:54,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 30: [2022-11-26 21:20:54,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:20:54,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 21:20:54,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 15: [2022-11-26 21:20:54,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:20:54,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 21:20:54,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 7: [2022-11-26 21:20:54,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:20:54,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 21:20:54,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 10: [2022-11-26 21:20:54,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:20:54,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 21:20:54,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 11: [2022-11-26 21:20:54,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:20:54,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:20:54,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 21:20:54,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 5: [2022-11-26 21:20:54,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:20:54,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:20:54,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 21:20:54,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 17: [2022-11-26 21:20:54,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 21:20:54,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 13: [2022-11-26 21:20:54,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:20:54,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 13: [2022-11-26 21:20:54,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 11: [2022-11-26 21:20:54,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 0: [2022-11-26 21:20:54,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 13: [2022-11-26 21:20:54,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 0: [2022-11-26 21:20:54,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 21: [2022-11-26 21:20:54,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:20:54,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 21:20:54,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 8: [2022-11-26 21:20:54,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:20:54,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 21:20:54,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 21: [2022-11-26 21:20:54,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:20:54,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 21:20:54,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 26: [2022-11-26 21:20:54,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:20:54,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 26: [2022-11-26 21:20:54,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 2: [2022-11-26 21:20:54,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 26: [2022-11-26 21:20:54,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 2: [2022-11-26 21:20:54,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 15: [2022-11-26 21:20:54,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 22: [2022-11-26 21:20:54,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 21:20:54,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:20:54,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 22: [2022-11-26 21:20:54,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 21:20:54,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 15: [2022-11-26 21:20:54,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 22: [2022-11-26 21:20:54,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 22: [2022-11-26 21:20:54,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 20: [2022-11-26 21:20:54,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:20:54,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 21:20:54,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 10: [2022-11-26 21:20:54,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:20:54,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 21:20:54,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 12: [2022-11-26 21:20:54,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:20:54,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 21:20:54,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 14: [2022-11-26 21:20:54,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:20:54,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 21:20:54,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 6: [2022-11-26 21:20:54,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:20:54,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 21:20:54,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:20:54,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 6: [2022-11-26 21:20:54,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 21:20:54,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 3: [2022-11-26 21:20:54,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:20:54,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 19: [2022-11-26 21:20:54,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:20:54,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 20: [2022-11-26 21:20:54,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:20:54,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 20: [2022-11-26 21:20:54,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 25: [2022-11-26 21:20:54,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:20:54,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 25: [2022-11-26 21:20:54,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 20: [2022-11-26 21:20:54,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 25: [2022-11-26 21:20:54,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 13: [2022-11-26 21:20:54,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:20:54,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 21:20:54,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 27: [2022-11-26 21:20:54,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:20:54,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 21:20:54,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 30: [2022-11-26 21:20:54,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:20:54,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:20:54,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 23: [2022-11-26 21:20:54,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:20:54,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 30: [2022-11-26 21:20:54,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 21:20:54,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 23: [2022-11-26 21:20:54,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 30: [2022-11-26 21:20:54,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 12: [2022-11-26 21:20:54,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:20:54,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 21:20:54,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 0: [2022-11-26 21:20:54,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:20:54,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 21:20:54,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 9: [2022-11-26 21:20:54,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:20:54,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 21:20:54,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 9: [2022-11-26 21:20:54,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:20:54,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 21:20:54,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 0: [2022-11-26 21:20:54,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:20:54,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 26: [2022-11-26 21:20:54,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:20:54,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 26: [2022-11-26 21:20:54,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 21:20:54,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 10: [2022-11-26 21:20:54,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:20:54,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 21:20:54,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 8: [2022-11-26 21:20:54,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:20:54,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 21:20:54,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 2: [2022-11-26 21:20:54,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:20:54,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 21:20:54,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 7: [2022-11-26 21:20:54,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:20:54,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 21:20:54,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:20:54,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 7: [2022-11-26 21:20:54,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 21:20:54,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 29: [2022-11-26 21:20:54,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:20:54,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 21:20:54,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 29: [2022-11-26 21:20:54,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:20:54,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 21:20:54,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 24: [2022-11-26 21:20:54,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:20:54,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:20:54,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 31: [2022-11-26 21:20:54,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:20:54,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 21:20:54,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 31: [2022-11-26 21:20:54,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 24: [2022-11-26 21:20:54,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 31: [2022-11-26 21:20:54,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 4: [2022-11-26 21:20:54,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:20:54,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:20:54,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 21:20:54,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 21:20:54,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 4: [2022-11-26 21:20:54,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 27: [2022-11-26 21:20:54,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:20:54,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 3: [2022-11-26 21:20:54,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:20:54,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 3: [2022-11-26 21:20:54,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 21:20:54,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 17: [2022-11-26 21:20:54,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:20:54,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:20:54,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 21:20:54,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 17: [2022-11-26 21:20:54,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:20:54,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:20:54,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 21:20:54,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 21:20:54,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 28: [2022-11-26 21:20:54,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:20:54,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 17: [2022-11-26 21:20:54,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 24: [2022-11-26 21:20:54,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 28: [2022-11-26 21:20:54,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 21:20:54,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 21:20:54,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 21:20:54,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 28: [2022-11-26 21:20:54,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 21: [2022-11-26 21:20:54,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:20:54,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 21:20:54,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 23: [2022-11-26 21:20:54,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:20:54,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 21:20:54,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 19: [2022-11-26 21:20:54,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:20:54,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 21:20:54,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 20: [2022-11-26 21:20:54,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:20:54,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 21:20:54,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 25: [2022-11-26 21:20:54,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:20:54,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 21:20:54,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 4: [2022-11-26 21:20:54,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:20:54,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 21:20:54,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 13: [2022-11-26 21:20:54,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:20:54,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 21:20:54,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 11: [2022-11-26 21:20:54,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:20:54,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:20:54,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 21:20:54,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 21:20:54,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 11: [2022-11-26 21:20:54,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 6: [2022-11-26 21:20:54,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:20:54,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 21:20:54,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 16: [2022-11-26 21:20:54,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:20:54,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:20:54,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:20:54,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 21:20:54,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 21:20:54,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 16: [2022-11-26 21:20:54,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 16: [2022-11-26 21:20:54,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 21:20:54,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 26: [2022-11-26 21:20:54,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 21:20:54,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 21:20:54,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 22: [2022-11-26 21:20:54,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 21:20:54,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 21:20:54,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 15: [2022-11-26 21:20:54,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:20:54,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 21:20:54,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 12: [2022-11-26 21:20:54,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:20:54,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 21:20:54,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 14: [2022-11-26 21:20:54,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:20:54,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 21:20:54,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 31: [2022-11-26 21:20:54,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:20:54,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 21:20:54,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 1: [2022-11-26 21:20:54,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:20:54,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:20:54,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:20:54,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:20:54,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 21:20:54,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 21:20:54,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 1: [2022-11-26 21:20:54,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 21:20:54,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 21:20:54,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 1: [2022-11-26 21:20:54,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 1: [2022-11-26 21:20:54,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 23: [2022-11-26 21:20:54,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:20:54,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 21:20:54,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 10: [2022-11-26 21:20:54,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:20:54,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 21:20:54,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 7: [2022-11-26 21:20:54,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:20:54,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 21:20:54,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 5: [2022-11-26 21:20:54,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:20:54,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 21:20:54,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 5: [2022-11-26 21:20:54,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:20:54,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 21:20:54,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 5: [2022-11-26 21:20:54,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:20:54,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 21:20:54,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 21: [2022-11-26 21:20:54,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:20:54,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 21:20:54,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 2: [2022-11-26 21:20:54,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:20:54,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 21:20:54,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 29: [2022-11-26 21:20:54,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:20:54,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 21:20:54,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 18: [2022-11-26 21:20:54,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:20:54,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 21:20:54,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 28: [2022-11-26 21:20:54,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 21:20:54,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 21:20:54,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 15: [2022-11-26 21:20:54,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:20:54,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 21:20:54,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 18: [2022-11-26 21:20:54,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:20:54,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:20:54,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 21:20:54,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 18: [2022-11-26 21:20:54,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 21:20:54,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 30: [2022-11-26 21:20:54,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:20:54,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 21:20:54,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 24: [2022-11-26 21:20:54,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:20:54,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 21:20:54,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 17: [2022-11-26 21:20:54,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:20:54,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 21:20:54,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 25: [2022-11-26 21:20:54,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:20:54,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 11: [2022-11-26 21:20:54,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:20:54,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 21:20:54,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 25: [2022-11-26 21:20:54,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 9: [2022-11-26 21:20:54,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:20:54,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 21:20:54,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 22: [2022-11-26 21:20:54,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 21:20:54,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 21:20:54,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 18: [2022-11-26 21:20:54,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:20:54,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 21:20:54,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 0: [2022-11-26 21:20:54,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:20:54,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 21:20:54,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 8: [2022-11-26 21:20:54,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:20:54,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 21:20:54,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 13: [2022-11-26 21:20:54,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:20:54,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 21:20:54,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 27: [2022-11-26 21:20:54,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:20:54,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 21:20:54,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 3: [2022-11-26 21:20:54,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:20:54,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 21:20:54,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 19: [2022-11-26 21:20:54,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:20:54,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 21:20:54,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 16: [2022-11-26 21:20:54,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:20:54,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 21:20:54,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 20: [2022-11-26 21:20:54,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:20:54,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 21:20:54,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 26: [2022-11-26 21:20:54,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 21:20:54,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 21:20:54,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 6: [2022-11-26 21:20:54,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:20:54,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 21:20:54,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 5: [2022-11-26 21:20:54,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:20:54,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 21:20:54,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 1: [2022-11-26 21:20:54,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:20:54,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 21:20:54,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 18: [2022-11-26 21:20:54,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:20:54,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 21:20:54,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 2: [2022-11-26 21:20:54,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:20:54,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 21:20:54,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 7: [2022-11-26 21:20:54,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:20:54,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:20:54,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 7: [2022-11-26 21:20:54,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 21:20:54,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 4: [2022-11-26 21:20:54,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 12: [2022-11-26 21:20:54,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:20:54,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 21:20:54,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 14: [2022-11-26 21:20:54,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:20:54,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 21:20:54,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 31: [2022-11-26 21:20:54,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:20:54,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 21:20:54,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 29: [2022-11-26 21:20:54,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:20:54,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 21:20:54,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 19: [2022-11-26 21:20:54,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:20:54,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 21:20:54,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 25: [2022-11-26 21:20:54,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:20:54,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:20:54,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 21:20:54,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 0: [2022-11-26 21:20:54,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 21:20:54,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 23: [2022-11-26 21:20:54,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:20:54,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 21:20:54,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 10: [2022-11-26 21:20:54,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:20:54,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 21:20:54,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 28: [2022-11-26 21:20:54,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 21:20:54,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 21:20:54,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 24: [2022-11-26 21:20:54,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:20:54,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 21:20:54,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 9: [2022-11-26 21:20:54,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:20:54,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 21:20:54,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 15: [2022-11-26 21:20:54,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:20:54,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 21:20:54,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 30: [2022-11-26 21:20:54,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:20:54,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 17: [2022-11-26 21:20:54,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:20:54,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 17: [2022-11-26 21:20:54,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 21:20:54,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 21: [2022-11-26 21:20:54,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:20:54,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 21:20:54,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 8: [2022-11-26 21:20:54,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:20:54,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 21:20:54,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 13: [2022-11-26 21:20:54,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:20:54,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 21:20:54,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 22: [2022-11-26 21:20:54,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 21:20:54,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 21:20:54,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 3: [2022-11-26 21:20:54,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:20:54,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 21:20:54,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 27: [2022-11-26 21:20:54,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:20:54,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 21:20:54,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 26: [2022-11-26 21:20:54,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 21:20:54,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 21:20:54,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 7: [2022-11-26 21:20:54,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:20:54,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 21:20:54,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 4: [2022-11-26 21:20:54,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:20:54,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 21:20:54,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 6: [2022-11-26 21:20:54,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:20:54,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:20:54,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 20: [2022-11-26 21:20:54,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 6: [2022-11-26 21:20:54,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 20: [2022-11-26 21:20:54,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 12: [2022-11-26 21:20:54,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:20:54,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 21:20:54,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 14: [2022-11-26 21:20:54,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:20:54,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:20:54,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 16: [2022-11-26 21:20:54,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 14: [2022-11-26 21:20:54,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 16: [2022-11-26 21:20:54,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 5: [2022-11-26 21:20:54,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:20:54,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 21:20:54,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 21: [2022-11-26 21:20:54,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:20:54,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 21:20:54,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 1: [2022-11-26 21:20:54,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:20:54,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 21:20:54,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 18: [2022-11-26 21:20:54,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:20:54,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 21:20:54,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 11: [2022-11-26 21:20:54,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:20:54,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 21:20:54,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 28: [2022-11-26 21:20:54,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 21:20:54,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 21:20:54,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 9: [2022-11-26 21:20:54,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:20:54,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 21:20:54,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 30: [2022-11-26 21:20:54,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:20:54,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 21:20:54,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 0: [2022-11-26 21:20:54,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:20:54,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 21:20:54,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 15: [2022-11-26 21:20:54,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:20:54,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 21:20:54,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 19: [2022-11-26 21:20:54,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:20:54,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 21:20:54,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 3: [2022-11-26 21:20:54,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:20:54,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:20:54,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 21:20:54,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 10: [2022-11-26 21:20:54,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 21:20:54,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 27: [2022-11-26 21:20:54,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:20:54,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:20:54,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 13: [2022-11-26 21:20:54,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:20:54,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 27: [2022-11-26 21:20:54,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 13: [2022-11-26 21:20:54,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 21:20:54,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 24: [2022-11-26 21:20:54,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 8: [2022-11-26 21:20:54,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:20:54,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 21:20:54,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 23: [2022-11-26 21:20:54,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:20:54,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 21:20:54,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 17: [2022-11-26 21:20:54,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:20:54,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 25: [2022-11-26 21:20:54,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:20:54,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 21:20:54,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 31: [2022-11-26 21:20:54,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:20:54,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 21:20:54,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 2: [2022-11-26 21:20:54,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:20:54,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 2: [2022-11-26 21:20:54,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 29: [2022-11-26 21:20:54,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:20:54,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 2: [2022-11-26 21:20:54,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 29: [2022-11-26 21:20:54,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 1: [2022-11-26 21:20:54,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:20:54,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 21:20:54,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 6: [2022-11-26 21:20:54,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:20:54,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 21:20:54,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 26: [2022-11-26 21:20:54,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 21:20:54,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 21:20:54,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 20: [2022-11-26 21:20:54,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:20:54,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 21:20:54,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 11: [2022-11-26 21:20:54,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:20:54,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 21:20:54,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 16: [2022-11-26 21:20:54,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:20:54,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 21:20:54,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 22: [2022-11-26 21:20:54,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 21:20:54,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 21:20:54,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 5: [2022-11-26 21:20:54,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:20:54,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 21:20:54,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 14: [2022-11-26 21:20:54,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:20:54,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 21:20:54,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 4: [2022-11-26 21:20:54,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:20:54,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 21:20:54,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 23: [2022-11-26 21:20:54,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:20:54,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 21:20:54,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 21: [2022-11-26 21:20:54,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:20:54,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 21:20:54,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 12: [2022-11-26 21:20:54,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:20:54,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 21:20:54,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 2: [2022-11-26 21:20:54,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:20:54,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 21:20:54,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 31: [2022-11-26 21:20:54,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:20:54,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 21:20:54,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 7: [2022-11-26 21:20:54,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:20:54,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 21:20:54,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 29: [2022-11-26 21:20:54,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:20:54,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 21:20:54,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 28: [2022-11-26 21:20:54,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 21:20:54,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 21:20:54,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 10: [2022-11-26 21:20:54,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:20:54,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 21:20:54,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 0: [2022-11-26 21:20:54,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:20:54,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 21:20:54,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 30: [2022-11-26 21:20:54,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:20:54,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 21:20:54,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 19: [2022-11-26 21:20:54,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:20:54,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 21:20:54,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 15: [2022-11-26 21:20:54,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:20:54,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 21:20:54,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 25: [2022-11-26 21:20:54,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:20:54,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 21:20:54,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 17: [2022-11-26 21:20:54,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:20:54,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:20:54,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 17: [2022-11-26 21:20:54,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 11: [2022-11-26 21:20:54,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 17: [2022-11-26 21:20:54,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 8: [2022-11-26 21:20:54,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 22: [2022-11-26 21:20:54,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:20:54,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 21:20:54,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 22: [2022-11-26 21:20:54,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 21:20:54,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 3: [2022-11-26 21:20:54,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:20:54,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 20: [2022-11-26 21:20:54,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:20:54,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 20: [2022-11-26 21:20:54,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 21:20:54,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 9: [2022-11-26 21:20:54,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:20:54,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:20:54,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 4: [2022-11-26 21:20:54,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 9: [2022-11-26 21:20:54,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 4: [2022-11-26 21:20:54,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 5: [2022-11-26 21:20:54,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:20:54,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 24: [2022-11-26 21:20:54,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:20:54,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 24: [2022-11-26 21:20:54,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 21:20:54,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 26: [2022-11-26 21:20:54,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 21:20:54,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 21:20:54,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 7: [2022-11-26 21:20:54,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:20:54,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 21:20:54,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 1: [2022-11-26 21:20:54,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:20:54,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:20:54,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 1: [2022-11-26 21:20:54,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 12: [2022-11-26 21:20:54,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 1: [2022-11-26 21:20:54,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 13: [2022-11-26 21:20:54,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:20:54,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 21:20:54,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 28: [2022-11-26 21:20:54,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 21:20:54,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 21:20:54,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 21: [2022-11-26 21:20:54,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:20:54,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 21:20:54,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 2: [2022-11-26 21:20:54,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:20:54,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 21:20:54,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 0: [2022-11-26 21:20:54,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:20:54,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 21:20:54,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 6: [2022-11-26 21:20:54,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:20:54,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 21:20:54,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 30: [2022-11-26 21:20:54,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:20:54,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 21:20:54,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 14: [2022-11-26 21:20:54,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:20:54,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 21:20:54,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 23: [2022-11-26 21:20:54,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:20:54,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 21:20:54,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 18: [2022-11-26 21:20:54,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:20:54,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 21:20:54,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 10: [2022-11-26 21:20:54,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:20:54,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:20:54,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 16: [2022-11-26 21:20:54,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 21:20:54,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 10: [2022-11-26 21:20:54,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 27: [2022-11-26 21:20:54,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:20:54,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 21:20:54,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 25: [2022-11-26 21:20:54,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:20:54,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:20:54,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 21:20:54,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 31: [2022-11-26 21:20:54,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 21:20:54,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 13: [2022-11-26 21:20:54,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:20:54,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 21:20:54,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 4: [2022-11-26 21:20:54,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:20:54,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 21:20:54,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 16: [2022-11-26 21:20:54,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:20:54,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 21:20:54,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 15: [2022-11-26 21:20:54,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:20:54,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 21:20:54,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 11: [2022-11-26 21:20:54,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:20:54,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:20:54,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 19: [2022-11-26 21:20:54,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 11: [2022-11-26 21:20:54,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 19: [2022-11-26 21:20:54,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 18: [2022-11-26 21:20:54,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:20:54,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 21:20:54,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 8: [2022-11-26 21:20:54,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:20:54,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 21:20:54,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 27: [2022-11-26 21:20:54,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:20:54,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 21:20:54,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 17: [2022-11-26 21:20:54,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:20:54,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 21:20:54,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 29: [2022-11-26 21:20:54,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:20:54,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 9: [2022-11-26 21:20:54,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:20:54,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 9: [2022-11-26 21:20:54,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 21:20:54,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 22: [2022-11-26 21:20:54,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 21:20:54,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 21:20:54,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 24: [2022-11-26 21:20:54,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:20:54,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 21:20:54,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 3: [2022-11-26 21:20:54,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:20:54,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 21:20:54,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 24: [2022-11-26 21:20:54,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:20:54,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 21:20:54,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 29: [2022-11-26 21:20:54,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:20:54,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step121000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 21:20:54,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 0: successfully saved checkpoint at iteration 121000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2580.03 31: iteration 121010/ 173500 | consumed samples: 30978560 | consumed tokens: 63444090880 | elapsed time per iteration (s): 1.20 | learning rate: 5.839E-05 | global batch size: 256 | lm loss: 1.970988E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 212.651 | TFLOPs: 12.86 | 31: iteration 121020/ 173500 | consumed samples: 30981120 | consumed tokens: 63449333760 | elapsed time per iteration (s): 0.83 | learning rate: 5.837E-05 | global batch size: 256 | lm loss: 1.960253E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.488 | TFLOPs: 18.60 | 31: iteration 121030/ 173500 | consumed samples: 30983680 | consumed tokens: 63454576640 | elapsed time per iteration (s): 0.83 | learning rate: 5.836E-05 | global batch size: 256 | lm loss: 1.950506E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.036 | TFLOPs: 18.57 | 31: iteration 121040/ 173500 | consumed samples: 30986240 | consumed tokens: 63459819520 | elapsed time per iteration (s): 0.84 | learning rate: 5.835E-05 | global batch size: 256 | lm loss: 1.940698E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.517 | TFLOPs: 18.48 | 31: iteration 121050/ 173500 | consumed samples: 30988800 | consumed tokens: 63465062400 | elapsed time per iteration (s): 0.83 | learning rate: 5.833E-05 | global batch size: 256 | lm loss: 1.964325E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.791 | TFLOPs: 18.62 | 31: iteration 121060/ 173500 | consumed samples: 30991360 | consumed tokens: 63470305280 | elapsed time per iteration (s): 0.91 | learning rate: 5.832E-05 | global batch size: 256 | lm loss: 1.947072E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.287 | TFLOPs: 17.08 | 31: iteration 121070/ 173500 | consumed samples: 30993920 | consumed tokens: 63475548160 | elapsed time per iteration (s): 0.80 | learning rate: 5.831E-05 | global batch size: 256 | lm loss: 1.950459E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.288 | TFLOPs: 19.32 | 31: iteration 121080/ 173500 | consumed samples: 30996480 | consumed tokens: 63480791040 | elapsed time per iteration (s): 0.83 | learning rate: 5.829E-05 | global batch size: 256 | lm loss: 1.976341E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.197 | TFLOPs: 18.77 | 31: iteration 121090/ 173500 | consumed samples: 30999040 | consumed tokens: 63486033920 | elapsed time per iteration (s): 0.84 | learning rate: 5.828E-05 | global batch size: 256 | lm loss: 1.977438E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.669 | TFLOPs: 18.49 | 31: iteration 121100/ 173500 | consumed samples: 31001600 | consumed tokens: 63491276800 | elapsed time per iteration (s): 0.91 | learning rate: 5.827E-05 | global batch size: 256 | lm loss: 1.958077E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 281.548 | TFLOPs: 17.03 | 31: iteration 121110/ 173500 | consumed samples: 31004160 | consumed tokens: 63496519680 | elapsed time per iteration (s): 0.81 | learning rate: 5.825E-05 | global batch size: 256 | lm loss: 1.969765E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.265 | TFLOPs: 19.13 | 31: iteration 121120/ 173500 | consumed samples: 31006720 | consumed tokens: 63501762560 | elapsed time per iteration (s): 0.84 | learning rate: 5.824E-05 | global batch size: 256 | lm loss: 1.942681E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.253 | TFLOPs: 18.47 | 31: iteration 121130/ 173500 | consumed samples: 31009280 | consumed tokens: 63507005440 | elapsed time per iteration (s): 0.85 | learning rate: 5.823E-05 | global batch size: 256 | lm loss: 1.955171E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.049 | TFLOPs: 18.27 | 31: iteration 121140/ 173500 | consumed samples: 31011840 | consumed tokens: 63512248320 | elapsed time per iteration (s): 0.86 | learning rate: 5.821E-05 | global batch size: 256 | lm loss: 1.969412E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.077 | TFLOPs: 17.91 | 31: iteration 121150/ 173500 | consumed samples: 31014400 | consumed tokens: 63517491200 | elapsed time per iteration (s): 3.13 | learning rate: 5.820E-05 | global batch size: 256 | lm loss: 1.962187E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 81.752 | TFLOPs: 4.95 | 31: iteration 121160/ 173500 | consumed samples: 31016960 | consumed tokens: 63522734080 | elapsed time per iteration (s): 0.85 | learning rate: 5.818E-05 | global batch size: 256 | lm loss: 1.926876E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.865 | TFLOPs: 18.14 | 31: iteration 121170/ 173500 | consumed samples: 31019520 | consumed tokens: 63527976960 | elapsed time per iteration (s): 0.87 | learning rate: 5.817E-05 | global batch size: 256 | lm loss: 1.918272E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.099 | TFLOPs: 17.85 | 31: iteration 121180/ 173500 | consumed samples: 31022080 | consumed tokens: 63533219840 | elapsed time per iteration (s): 0.85 | learning rate: 5.816E-05 | global batch size: 256 | lm loss: 1.940336E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.704 | TFLOPs: 18.25 | 31: iteration 121190/ 173500 | consumed samples: 31024640 | consumed tokens: 63538462720 | elapsed time per iteration (s): 0.81 | learning rate: 5.814E-05 | global batch size: 256 | lm loss: 1.944682E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.415 | TFLOPs: 19.02 | 31: iteration 121200/ 173500 | consumed samples: 31027200 | consumed tokens: 63543705600 | elapsed time per iteration (s): 0.86 | learning rate: 5.813E-05 | global batch size: 256 | lm loss: 1.936765E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.381 | TFLOPs: 18.11 | 31: iteration 121210/ 173500 | consumed samples: 31029760 | consumed tokens: 63548948480 | elapsed time per iteration (s): 0.81 | learning rate: 5.812E-05 | global batch size: 256 | lm loss: 1.958903E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.653 | TFLOPs: 19.22 | 31: iteration 121220/ 173500 | consumed samples: 31032320 | consumed tokens: 63554191360 | elapsed time per iteration (s): 0.82 | learning rate: 5.810E-05 | global batch size: 256 | lm loss: 1.948965E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.334 | TFLOPs: 18.96 | 31: iteration 121230/ 173500 | consumed samples: 31034880 | consumed tokens: 63559434240 | elapsed time per iteration (s): 0.81 | learning rate: 5.809E-05 | global batch size: 256 | lm loss: 1.953466E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.282 | TFLOPs: 19.19 | 31: iteration 121240/ 173500 | consumed samples: 31037440 | consumed tokens: 63564677120 | elapsed time per iteration (s): 0.92 | learning rate: 5.808E-05 | global batch size: 256 | lm loss: 1.926036E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 279.548 | TFLOPs: 16.91 | 31: iteration 121250/ 173500 | consumed samples: 31040000 | consumed tokens: 63569920000 | elapsed time per iteration (s): 0.81 | learning rate: 5.806E-05 | global batch size: 256 | lm loss: 1.951136E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.887 | TFLOPs: 19.23 | 31: iteration 121260/ 173500 | consumed samples: 31042560 | consumed tokens: 63575162880 | elapsed time per iteration (s): 0.78 | learning rate: 5.805E-05 | global batch size: 256 | lm loss: 1.928545E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.430 | TFLOPs: 19.75 | 31: iteration 121270/ 173500 | consumed samples: 31045120 | consumed tokens: 63580405760 | elapsed time per iteration (s): 0.81 | learning rate: 5.804E-05 | global batch size: 256 | lm loss: 1.951625E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.223 | TFLOPs: 19.13 | 31: iteration 121280/ 173500 | consumed samples: 31047680 | consumed tokens: 63585648640 | elapsed time per iteration (s): 0.83 | learning rate: 5.802E-05 | global batch size: 256 | lm loss: 1.969304E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.428 | TFLOPs: 18.60 | 31: iteration 121290/ 173500 | consumed samples: 31050240 | consumed tokens: 63590891520 | elapsed time per iteration (s): 0.90 | learning rate: 5.801E-05 | global batch size: 256 | lm loss: 1.951800E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.404 | TFLOPs: 17.21 | 31: iteration 121300/ 173500 | consumed samples: 31052800 | consumed tokens: 63596134400 | elapsed time per iteration (s): 0.82 | learning rate: 5.800E-05 | global batch size: 256 | lm loss: 1.936989E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.928 | TFLOPs: 18.93 | 31: iteration 121310/ 173500 | consumed samples: 31055360 | consumed tokens: 63601377280 | elapsed time per iteration (s): 0.80 | learning rate: 5.798E-05 | global batch size: 256 | lm loss: 1.956075E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.451 | TFLOPs: 19.45 | 31: iteration 121320/ 173500 | consumed samples: 31057920 | consumed tokens: 63606620160 | elapsed time per iteration (s): 0.86 | learning rate: 5.797E-05 | global batch size: 256 | lm loss: 1.933153E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.115 | TFLOPs: 17.97 | 31: iteration 121330/ 173500 | consumed samples: 31060480 | consumed tokens: 63611863040 | elapsed time per iteration (s): 0.84 | learning rate: 5.796E-05 | global batch size: 256 | lm loss: 1.955949E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.650 | TFLOPs: 18.43 | 31: iteration 121340/ 173500 | consumed samples: 31063040 | consumed tokens: 63617105920 | elapsed time per iteration (s): 0.79 | learning rate: 5.794E-05 | global batch size: 256 | lm loss: 1.903033E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.480 | TFLOPs: 19.51 | 31: iteration 121350/ 173500 | consumed samples: 31065600 | consumed tokens: 63622348800 | elapsed time per iteration (s): 0.82 | learning rate: 5.793E-05 | global batch size: 256 | lm loss: 1.926175E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.125 | TFLOPs: 18.88 | 31: iteration 121360/ 173500 | consumed samples: 31068160 | consumed tokens: 63627591680 | elapsed time per iteration (s): 0.79 | learning rate: 5.792E-05 | global batch size: 256 | lm loss: 1.958503E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.663 | TFLOPs: 19.64 | 31: iteration 121370/ 173500 | consumed samples: 31070720 | consumed tokens: 63632834560 | elapsed time per iteration (s): 0.78 | learning rate: 5.790E-05 | global batch size: 256 | lm loss: 1.935437E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.759 | TFLOPs: 19.89 | 31: iteration 121380/ 173500 | consumed samples: 31073280 | consumed tokens: 63638077440 | elapsed time per iteration (s): 0.83 | learning rate: 5.789E-05 | global batch size: 256 | lm loss: 1.949419E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.263 | TFLOPs: 18.65 | 31: iteration 121390/ 173500 | consumed samples: 31075840 | consumed tokens: 63643320320 | elapsed time per iteration (s): 0.85 | learning rate: 5.788E-05 | global batch size: 256 | lm loss: 1.987867E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.822 | TFLOPs: 18.32 | 31: iteration 121400/ 173500 | consumed samples: 31078400 | consumed tokens: 63648563200 | elapsed time per iteration (s): 0.78 | learning rate: 5.786E-05 | global batch size: 256 | lm loss: 1.959926E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.852 | TFLOPs: 19.77 | 31: iteration 121410/ 173500 | consumed samples: 31080960 | consumed tokens: 63653806080 | elapsed time per iteration (s): 0.84 | learning rate: 5.785E-05 | global batch size: 256 | lm loss: 1.947649E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.534 | TFLOPs: 18.36 | 31: iteration 121420/ 173500 | consumed samples: 31083520 | consumed tokens: 63659048960 | elapsed time per iteration (s): 0.83 | learning rate: 5.784E-05 | global batch size: 256 | lm loss: 1.944232E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.661 | TFLOPs: 18.61 | 31: iteration 121430/ 173500 | consumed samples: 31086080 | consumed tokens: 63664291840 | elapsed time per iteration (s): 0.84 | learning rate: 5.782E-05 | global batch size: 256 | lm loss: 1.928519E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.610 | TFLOPs: 18.43 | 31: iteration 121440/ 173500 | consumed samples: 31088640 | consumed tokens: 63669534720 | elapsed time per iteration (s): 0.83 | learning rate: 5.781E-05 | global batch size: 256 | lm loss: 1.942989E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.211 | TFLOPs: 18.71 | 31: iteration 121450/ 173500 | consumed samples: 31091200 | consumed tokens: 63674777600 | elapsed time per iteration (s): 0.93 | learning rate: 5.780E-05 | global batch size: 256 | lm loss: 1.948281E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 275.248 | TFLOPs: 16.65 | 31: iteration 121460/ 173500 | consumed samples: 31093760 | consumed tokens: 63680020480 | elapsed time per iteration (s): 0.81 | learning rate: 5.778E-05 | global batch size: 256 | lm loss: 1.939160E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.495 | TFLOPs: 19.15 | 31: iteration 121470/ 173500 | consumed samples: 31096320 | consumed tokens: 63685263360 | elapsed time per iteration (s): 0.80 | learning rate: 5.777E-05 | global batch size: 256 | lm loss: 1.922456E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.055 | TFLOPs: 19.24 | 31: iteration 121480/ 173500 | consumed samples: 31098880 | consumed tokens: 63690506240 | elapsed time per iteration (s): 0.80 | learning rate: 5.776E-05 | global batch size: 256 | lm loss: 1.932607E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.459 | TFLOPs: 19.39 | 31: iteration 121490/ 173500 | consumed samples: 31101440 | consumed tokens: 63695749120 | elapsed time per iteration (s): 0.78 | learning rate: 5.774E-05 | global batch size: 256 | lm loss: 1.967556E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.267 | TFLOPs: 19.80 | 31: iteration 121500/ 173500 | consumed samples: 31104000 | consumed tokens: 63700992000 | elapsed time per iteration (s): 0.84 | learning rate: 5.773E-05 | global batch size: 256 | lm loss: 1.954340E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.067 | TFLOPs: 18.52 | 31: iteration 121510/ 173500 | consumed samples: 31106560 | consumed tokens: 63706234880 | elapsed time per iteration (s): 0.80 | learning rate: 5.771E-05 | global batch size: 256 | lm loss: 1.934777E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.760 | TFLOPs: 19.28 | 31: iteration 121520/ 173500 | consumed samples: 31109120 | consumed tokens: 63711477760 | elapsed time per iteration (s): 0.80 | learning rate: 5.770E-05 | global batch size: 256 | lm loss: 1.925689E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.226 | TFLOPs: 19.37 | 31: iteration 121530/ 173500 | consumed samples: 31111680 | consumed tokens: 63716720640 | elapsed time per iteration (s): 0.79 | learning rate: 5.769E-05 | global batch size: 256 | lm loss: 1.954001E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.924 | TFLOPs: 19.66 | 31: iteration 121540/ 173500 | consumed samples: 31114240 | consumed tokens: 63721963520 | elapsed time per iteration (s): 0.80 | learning rate: 5.767E-05 | global batch size: 256 | lm loss: 1.966171E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.519 | TFLOPs: 19.45 | 31: iteration 121550/ 173500 | consumed samples: 31116800 | consumed tokens: 63727206400 | elapsed time per iteration (s): 0.78 | learning rate: 5.766E-05 | global batch size: 256 | lm loss: 1.977902E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.308 | TFLOPs: 19.98 | 31: iteration 121560/ 173500 | consumed samples: 31119360 | consumed tokens: 63732449280 | elapsed time per iteration (s): 0.78 | learning rate: 5.765E-05 | global batch size: 256 | lm loss: 1.948886E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.250 | TFLOPs: 19.98 | 31: iteration 121570/ 173500 | consumed samples: 31121920 | consumed tokens: 63737692160 | elapsed time per iteration (s): 0.81 | learning rate: 5.763E-05 | global batch size: 256 | lm loss: 1.960323E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.541 | TFLOPs: 19.03 | 31: iteration 121580/ 173500 | consumed samples: 31124480 | consumed tokens: 63742935040 | elapsed time per iteration (s): 1.04 | learning rate: 5.762E-05 | global batch size: 256 | lm loss: 1.964492E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.787 | TFLOPs: 14.93 | 31: iteration 121590/ 173500 | consumed samples: 31127040 | consumed tokens: 63748177920 | elapsed time per iteration (s): 0.80 | learning rate: 5.761E-05 | global batch size: 256 | lm loss: 1.967846E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.577 | TFLOPs: 19.39 | 31: iteration 121600/ 173500 | consumed samples: 31129600 | consumed tokens: 63753420800 | elapsed time per iteration (s): 0.85 | learning rate: 5.759E-05 | global batch size: 256 | lm loss: 1.958386E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.726 | TFLOPs: 18.19 | 31: iteration 121610/ 173500 | consumed samples: 31132160 | consumed tokens: 63758663680 | elapsed time per iteration (s): 0.78 | learning rate: 5.758E-05 | global batch size: 256 | lm loss: 1.946442E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.357 | TFLOPs: 19.74 | 31: iteration 121620/ 173500 | consumed samples: 31134720 | consumed tokens: 63763906560 | elapsed time per iteration (s): 0.96 | learning rate: 5.757E-05 | global batch size: 256 | lm loss: 1.958517E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 266.835 | TFLOPs: 16.14 | 31: iteration 121630/ 173500 | consumed samples: 31137280 | consumed tokens: 63769149440 | elapsed time per iteration (s): 0.81 | learning rate: 5.755E-05 | global batch size: 256 | lm loss: 1.941362E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.559 | TFLOPs: 19.21 | 31: iteration 121640/ 173500 | consumed samples: 31139840 | consumed tokens: 63774392320 | elapsed time per iteration (s): 0.81 | learning rate: 5.754E-05 | global batch size: 256 | lm loss: 1.961885E+00 | grad norm: 0.215 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.530 | TFLOPs: 19.15 | 31: iteration 121650/ 173500 | consumed samples: 31142400 | consumed tokens: 63779635200 | elapsed time per iteration (s): 0.80 | learning rate: 5.753E-05 | global batch size: 256 | lm loss: 1.941999E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.604 | TFLOPs: 19.34 | 31: iteration 121660/ 173500 | consumed samples: 31144960 | consumed tokens: 63784878080 | elapsed time per iteration (s): 0.79 | learning rate: 5.751E-05 | global batch size: 256 | lm loss: 1.934110E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.223 | TFLOPs: 19.55 | 31: iteration 121670/ 173500 | consumed samples: 31147520 | consumed tokens: 63790120960 | elapsed time per iteration (s): 0.77 | learning rate: 5.750E-05 | global batch size: 256 | lm loss: 1.911083E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.639 | TFLOPs: 20.12 | 31: iteration 121680/ 173500 | consumed samples: 31150080 | consumed tokens: 63795363840 | elapsed time per iteration (s): 0.75 | learning rate: 5.749E-05 | global batch size: 256 | lm loss: 1.952469E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.517 | TFLOPs: 20.60 | 31: iteration 121690/ 173500 | consumed samples: 31152640 | consumed tokens: 63800606720 | elapsed time per iteration (s): 0.76 | learning rate: 5.747E-05 | global batch size: 256 | lm loss: 1.939584E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.808 | TFLOPs: 20.26 | 31: iteration 121700/ 173500 | consumed samples: 31155200 | consumed tokens: 63805849600 | elapsed time per iteration (s): 0.74 | learning rate: 5.746E-05 | global batch size: 256 | lm loss: 1.940738E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.436 | TFLOPs: 20.90 | 31: iteration 121710/ 173500 | consumed samples: 31157760 | consumed tokens: 63811092480 | elapsed time per iteration (s): 0.75 | learning rate: 5.745E-05 | global batch size: 256 | lm loss: 1.978155E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.404 | TFLOPs: 20.65 | 31: iteration 121720/ 173500 | consumed samples: 31160320 | consumed tokens: 63816335360 | elapsed time per iteration (s): 0.76 | learning rate: 5.743E-05 | global batch size: 256 | lm loss: 1.960858E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.041 | TFLOPs: 20.45 | 31: iteration 121730/ 173500 | consumed samples: 31162880 | consumed tokens: 63821578240 | elapsed time per iteration (s): 0.78 | learning rate: 5.742E-05 | global batch size: 256 | lm loss: 1.922777E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.339 | TFLOPs: 19.74 | 31: iteration 121740/ 173500 | consumed samples: 31165440 | consumed tokens: 63826821120 | elapsed time per iteration (s): 0.78 | learning rate: 5.741E-05 | global batch size: 256 | lm loss: 1.942057E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.557 | TFLOPs: 19.88 | 31: iteration 121750/ 173500 | consumed samples: 31168000 | consumed tokens: 63832064000 | elapsed time per iteration (s): 0.77 | learning rate: 5.739E-05 | global batch size: 256 | lm loss: 1.946827E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.873 | TFLOPs: 20.14 | 31: iteration 121760/ 173500 | consumed samples: 31170560 | consumed tokens: 63837306880 | elapsed time per iteration (s): 0.80 | learning rate: 5.738E-05 | global batch size: 256 | lm loss: 1.948295E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.927 | TFLOPs: 19.35 | 31: iteration 121770/ 173500 | consumed samples: 31173120 | consumed tokens: 63842549760 | elapsed time per iteration (s): 0.74 | learning rate: 5.737E-05 | global batch size: 256 | lm loss: 1.984092E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.130 | TFLOPs: 20.94 | 31: iteration 121780/ 173500 | consumed samples: 31175680 | consumed tokens: 63847792640 | elapsed time per iteration (s): 0.81 | learning rate: 5.735E-05 | global batch size: 256 | lm loss: 1.925924E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.774 | TFLOPs: 19.16 | 31: iteration 121790/ 173500 | consumed samples: 31178240 | consumed tokens: 63853035520 | elapsed time per iteration (s): 0.76 | learning rate: 5.734E-05 | global batch size: 256 | lm loss: 1.931230E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.706 | TFLOPs: 20.31 | 31: iteration 121800/ 173500 | consumed samples: 31180800 | consumed tokens: 63858278400 | elapsed time per iteration (s): 0.78 | learning rate: 5.733E-05 | global batch size: 256 | lm loss: 1.961513E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.121 | TFLOPs: 19.91 | 31: iteration 121810/ 173500 | consumed samples: 31183360 | consumed tokens: 63863521280 | elapsed time per iteration (s): 0.74 | learning rate: 5.731E-05 | global batch size: 256 | lm loss: 1.936596E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.843 | TFLOPs: 21.04 | 31: iteration 121820/ 173500 | consumed samples: 31185920 | consumed tokens: 63868764160 | elapsed time per iteration (s): 0.81 | learning rate: 5.730E-05 | global batch size: 256 | lm loss: 1.983067E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.176 | TFLOPs: 19.13 | 31: iteration 121830/ 173500 | consumed samples: 31188480 | consumed tokens: 63874007040 | elapsed time per iteration (s): 0.74 | learning rate: 5.729E-05 | global batch size: 256 | lm loss: 1.956501E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.569 | TFLOPs: 20.97 | 31: iteration 121840/ 173500 | consumed samples: 31191040 | consumed tokens: 63879249920 | elapsed time per iteration (s): 0.84 | learning rate: 5.727E-05 | global batch size: 256 | lm loss: 1.951627E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.867 | TFLOPs: 18.50 | 31: iteration 121850/ 173500 | consumed samples: 31193600 | consumed tokens: 63884492800 | elapsed time per iteration (s): 0.95 | learning rate: 5.726E-05 | global batch size: 256 | lm loss: 1.947800E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 270.732 | TFLOPs: 16.38 | 31: iteration 121860/ 173500 | consumed samples: 31196160 | consumed tokens: 63889735680 | elapsed time per iteration (s): 0.77 | learning rate: 5.725E-05 | global batch size: 256 | lm loss: 1.952799E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.653 | TFLOPs: 20.06 | 31: iteration 121870/ 173500 | consumed samples: 31198720 | consumed tokens: 63894978560 | elapsed time per iteration (s): 0.82 | learning rate: 5.723E-05 | global batch size: 256 | lm loss: 1.955466E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.586 | TFLOPs: 18.91 | 31: iteration 121880/ 173500 | consumed samples: 31201280 | consumed tokens: 63900221440 | elapsed time per iteration (s): 0.83 | learning rate: 5.722E-05 | global batch size: 256 | lm loss: 1.967425E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.183 | TFLOPs: 18.77 | 31: iteration 121890/ 173500 | consumed samples: 31203840 | consumed tokens: 63905464320 | elapsed time per iteration (s): 0.77 | learning rate: 5.721E-05 | global batch size: 256 | lm loss: 1.954938E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.909 | TFLOPs: 20.14 | 31: iteration 121900/ 173500 | consumed samples: 31206400 | consumed tokens: 63910707200 | elapsed time per iteration (s): 0.82 | learning rate: 5.719E-05 | global batch size: 256 | lm loss: 1.920826E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.113 | TFLOPs: 18.88 | 31: iteration 121910/ 173500 | consumed samples: 31208960 | consumed tokens: 63915950080 | elapsed time per iteration (s): 0.80 | learning rate: 5.718E-05 | global batch size: 256 | lm loss: 1.926782E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.500 | TFLOPs: 19.27 | 31: iteration 121920/ 173500 | consumed samples: 31211520 | consumed tokens: 63921192960 | elapsed time per iteration (s): 0.80 | learning rate: 5.717E-05 | global batch size: 256 | lm loss: 1.929863E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.491 | TFLOPs: 19.45 | 31: iteration 121930/ 173500 | consumed samples: 31214080 | consumed tokens: 63926435840 | elapsed time per iteration (s): 0.80 | learning rate: 5.715E-05 | global batch size: 256 | lm loss: 1.974356E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.492 | TFLOPs: 19.27 | 31: iteration 121940/ 173500 | consumed samples: 31216640 | consumed tokens: 63931678720 | elapsed time per iteration (s): 0.82 | learning rate: 5.714E-05 | global batch size: 256 | lm loss: 1.946393E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.262 | TFLOPs: 18.95 | 31: iteration 121950/ 173500 | consumed samples: 31219200 | consumed tokens: 63936921600 | elapsed time per iteration (s): 0.84 | learning rate: 5.713E-05 | global batch size: 256 | lm loss: 1.979913E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.823 | TFLOPs: 18.50 | 31: iteration 121960/ 173500 | consumed samples: 31221760 | consumed tokens: 63942164480 | elapsed time per iteration (s): 0.88 | learning rate: 5.711E-05 | global batch size: 256 | lm loss: 1.916149E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.846 | TFLOPs: 17.53 | 31: iteration 121970/ 173500 | consumed samples: 31224320 | consumed tokens: 63947407360 | elapsed time per iteration (s): 0.83 | learning rate: 5.710E-05 | global batch size: 256 | lm loss: 1.938327E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.915 | TFLOPs: 18.69 | 31: iteration 121980/ 173500 | consumed samples: 31226880 | consumed tokens: 63952650240 | elapsed time per iteration (s): 0.84 | learning rate: 5.709E-05 | global batch size: 256 | lm loss: 1.945663E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.745 | TFLOPs: 18.38 | 31: iteration 121990/ 173500 | consumed samples: 31229440 | consumed tokens: 63957893120 | elapsed time per iteration (s): 0.82 | learning rate: 5.707E-05 | global batch size: 256 | lm loss: 1.944550E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.174 | TFLOPs: 18.83 | 0: [2022-11-26 21:34:57,115] [INFO] [logging.py:68:log_dist] [Rank 0] step=122000, skipped=0, lr=[5.706057124448849e-05, 5.706057124448849e-05, 5.706057124448849e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 122000/ 173500 | consumed samples: 31232000 | consumed tokens: 63963136000 | elapsed time per iteration (s): 0.83 | learning rate: 5.706E-05 | global batch size: 256 | lm loss: 1.904436E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.191 | TFLOPs: 18.71 | 0: steps: 122000 loss: 1.9233 iter time (s): 0.821 samples/sec: 311.836 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 122000 | lm loss value: 2.061312E+00 | lm loss PPL: 7.856272E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 122000 to checkpoints_1b1long 0: [2022-11-26 21:34:57,476] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step122000 is begin to save! 0: [2022-11-26 21:34:57,487] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_01-model_00-model_states.pt... 0: [2022-11-26 21:34:57,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_01-model_00-model_states.pt. 0: [2022-11-26 21:34:57,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_03-model_00-model_states.pt... 0: [2022-11-26 21:34:57,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_03-model_00-model_states.pt. 0: [2022-11-26 21:34:57,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_04-model_00-model_states.pt... 0: [2022-11-26 21:34:57,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_04-model_00-model_states.pt. 0: [2022-11-26 21:34:57,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_05-model_00-model_states.pt... 0: [2022-11-26 21:34:57,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_05-model_00-model_states.pt. 0: [2022-11-26 21:34:57,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_06-model_00-model_states.pt... 0: [2022-11-26 21:34:58,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_06-model_00-model_states.pt. 0: [2022-11-26 21:34:58,014] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_07-model_00-model_states.pt... 0: [2022-11-26 21:34:58,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_07-model_00-model_states.pt. 0: [2022-11-26 21:34:58,092] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_08-model_00-model_states.pt... 0: [2022-11-26 21:34:58,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_08-model_00-model_states.pt. 0: [2022-11-26 21:34:58,168] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_09-model_00-model_states.pt... 0: [2022-11-26 21:34:58,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_09-model_00-model_states.pt. 0: [2022-11-26 21:34:58,247] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_10-model_00-model_states.pt... 0: [2022-11-26 21:34:58,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_10-model_00-model_states.pt. 0: [2022-11-26 21:34:58,325] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_11-model_00-model_states.pt... 0: [2022-11-26 21:34:58,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_11-model_00-model_states.pt. 0: [2022-11-26 21:34:58,402] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_12-model_00-model_states.pt... 0: [2022-11-26 21:34:58,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_12-model_00-model_states.pt. 0: [2022-11-26 21:34:58,479] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_13-model_00-model_states.pt... 0: [2022-11-26 21:34:58,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_13-model_00-model_states.pt. 0: [2022-11-26 21:34:58,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_14-model_00-model_states.pt... 0: [2022-11-26 21:34:58,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_14-model_00-model_states.pt. 0: [2022-11-26 21:34:58,634] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_15-model_00-model_states.pt... 0: [2022-11-26 21:34:58,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_15-model_00-model_states.pt. 0: [2022-11-26 21:34:58,716] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_16-model_00-model_states.pt... 0: [2022-11-26 21:34:58,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_16-model_00-model_states.pt. 0: [2022-11-26 21:34:58,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_17-model_00-model_states.pt... 0: [2022-11-26 21:34:58,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_17-model_00-model_states.pt. 0: [2022-11-26 21:34:58,871] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_18-model_00-model_states.pt... 0: [2022-11-26 21:34:58,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_18-model_00-model_states.pt. 0: [2022-11-26 21:34:58,945] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_19-model_00-model_states.pt... 0: [2022-11-26 21:34:59,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_19-model_00-model_states.pt. 0: [2022-11-26 21:34:59,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_20-model_00-model_states.pt... 0: [2022-11-26 21:34:59,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_20-model_00-model_states.pt. 0: [2022-11-26 21:34:59,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_21-model_00-model_states.pt... 0: [2022-11-26 21:34:59,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_21-model_00-model_states.pt. 0: [2022-11-26 21:34:59,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_22-model_00-model_states.pt... 0: [2022-11-26 21:34:59,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_22-model_00-model_states.pt. 0: [2022-11-26 21:34:59,256] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_23-model_00-model_states.pt... 0: [2022-11-26 21:34:59,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_23-model_00-model_states.pt. 0: [2022-11-26 21:34:59,341] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_24-model_00-model_states.pt... 0: [2022-11-26 21:34:59,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_24-model_00-model_states.pt. 0: [2022-11-26 21:34:59,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_25-model_00-model_states.pt... 0: [2022-11-26 21:34:59,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_25-model_00-model_states.pt. 0: [2022-11-26 21:34:59,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_26-model_00-model_states.pt... 0: [2022-11-26 21:34:59,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_26-model_00-model_states.pt. 0: [2022-11-26 21:34:59,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_27-model_00-model_states.pt... 0: [2022-11-26 21:34:59,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_27-model_00-model_states.pt. 0: [2022-11-26 21:34:59,653] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_28-model_00-model_states.pt... 0: [2022-11-26 21:34:59,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_28-model_00-model_states.pt. 0: [2022-11-26 21:34:59,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/layer_30-model_00-model_states.pt... 0: [2022-11-26 21:34:59,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/layer_30-model_00-model_states.pt. 0: [2022-11-26 21:34:59,733] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step122000/mp_rank_00_model_states.pt 0: [2022-11-26 21:34:59,733] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/mp_rank_00_model_states.pt... 0: [2022-11-26 21:34:59,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/mp_rank_00_model_states.pt. 0: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:35:00,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:35:00,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:35:01,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:35:01,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 21:35:01,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 30: [2022-11-26 21:35:01,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:35:01,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 5: [2022-11-26 21:35:01,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:35:01,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:35:01,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 5: [2022-11-26 21:35:01,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 3: [2022-11-26 21:35:01,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 5: [2022-11-26 21:35:01,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 3: [2022-11-26 21:35:01,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 2: [2022-11-26 21:35:01,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:35:01,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 21:35:01,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 24: [2022-11-26 21:35:01,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:35:01,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 21:35:01,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 10: [2022-11-26 21:35:01,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:35:01,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 21:35:01,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 16: [2022-11-26 21:35:01,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:35:01,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:35:01,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 21:35:01,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 25: [2022-11-26 21:35:01,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 21:35:01,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 11: [2022-11-26 21:35:01,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:35:01,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:35:01,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 21:35:01,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 21:35:01,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 11: [2022-11-26 21:35:01,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 6: [2022-11-26 21:35:01,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:35:01,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 21:35:01,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 4: [2022-11-26 21:35:01,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:35:01,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 21:35:01,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 27: [2022-11-26 21:35:01,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:35:01,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 21:35:01,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 21: [2022-11-26 21:35:01,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:35:01,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 21:35:01,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 16: [2022-11-26 21:35:01,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:35:01,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 15: [2022-11-26 21:35:01,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:35:01,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 15: [2022-11-26 21:35:01,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 31: [2022-11-26 21:35:01,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:35:01,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 31: [2022-11-26 21:35:01,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 21:35:01,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 18: [2022-11-26 21:35:01,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:35:01,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:35:01,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 0: [2022-11-26 21:35:01,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 18: [2022-11-26 21:35:01,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 0: [2022-11-26 21:35:01,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 0: [2022-11-26 21:35:01,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:35:01,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:35:01,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 30: [2022-11-26 21:35:01,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 0: [2022-11-26 21:35:01,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 30: [2022-11-26 21:35:01,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 3: [2022-11-26 21:35:01,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:35:01,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 21:35:01,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 27: [2022-11-26 21:35:01,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:35:01,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 21:35:01,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 8: [2022-11-26 21:35:01,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:35:01,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 21:35:01,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 2: [2022-11-26 21:35:01,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:35:01,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 19: [2022-11-26 21:35:01,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:35:01,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 19: [2022-11-26 21:35:01,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 24: [2022-11-26 21:35:01,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:35:01,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 24: [2022-11-26 21:35:01,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 21:35:01,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 8: [2022-11-26 21:35:01,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:35:01,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 21:35:01,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 12: [2022-11-26 21:35:01,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:35:01,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 21:35:01,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 12: [2022-11-26 21:35:01,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:35:01,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 21:35:01,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 9: [2022-11-26 21:35:01,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:35:01,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:35:01,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 13: [2022-11-26 21:35:01,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:35:01,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 13: [2022-11-26 21:35:01,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 21: [2022-11-26 21:35:01,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 9: [2022-11-26 21:35:01,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 13: [2022-11-26 21:35:01,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 17: [2022-11-26 21:35:01,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 26: [2022-11-26 21:35:01,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:35:01,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 26: [2022-11-26 21:35:01,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 17: [2022-11-26 21:35:01,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 26: [2022-11-26 21:35:01,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 13: [2022-11-26 21:35:01,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:35:01,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 21:35:01,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 12: [2022-11-26 21:35:01,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:35:01,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 21:35:01,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 26: [2022-11-26 21:35:01,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:35:01,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:35:01,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 26: [2022-11-26 21:35:01,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 21:35:01,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 4: [2022-11-26 21:35:01,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 14: [2022-11-26 21:35:01,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:35:01,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:35:01,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 21:35:01,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 21:35:01,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 20: [2022-11-26 21:35:01,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:35:01,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 14: [2022-11-26 21:35:01,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 20: [2022-11-26 21:35:01,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 3: [2022-11-26 21:35:01,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:35:01,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 21:35:01,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 1: [2022-11-26 21:35:01,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:35:01,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 21:35:01,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 31: [2022-11-26 21:35:01,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:35:01,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 21:35:01,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 20: [2022-11-26 21:35:01,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:35:01,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 21:35:01,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 18: [2022-11-26 21:35:01,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:35:01,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:35:01,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 21:35:01,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 19: [2022-11-26 21:35:01,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 21:35:01,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 6: [2022-11-26 21:35:01,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:35:01,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 21:35:01,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 5: [2022-11-26 21:35:01,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:35:01,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 21:35:01,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 16: [2022-11-26 21:35:01,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:35:01,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 21:35:01,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 11: [2022-11-26 21:35:01,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:35:01,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 21:35:01,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 25: [2022-11-26 21:35:01,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:35:01,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 21:35:01,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 7: [2022-11-26 21:35:01,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:35:01,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 21:35:01,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 10: [2022-11-26 21:35:01,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:35:01,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 13: [2022-11-26 21:35:01,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:35:01,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 13: [2022-11-26 21:35:01,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 21:35:01,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 29: [2022-11-26 21:35:01,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:35:01,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 21:35:01,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 15: [2022-11-26 21:35:01,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:35:01,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 21:35:01,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 28: [2022-11-26 21:35:01,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 21:35:01,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 21:35:01,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 30: [2022-11-26 21:35:01,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:35:01,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 21:35:01,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 26: [2022-11-26 21:35:01,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 22: [2022-11-26 21:35:01,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 26: [2022-11-26 21:35:01,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 21:35:01,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 22: [2022-11-26 21:35:01,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 21:35:01,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 22: [2022-11-26 21:35:01,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 21:35:01,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 21:35:01,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 21: [2022-11-26 21:35:01,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:35:01,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 21:35:01,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 10: [2022-11-26 21:35:01,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:35:01,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 21:35:01,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 22: [2022-11-26 21:35:01,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 21:35:01,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 21:35:01,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 4: [2022-11-26 21:35:01,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:35:01,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 21:35:01,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 8: [2022-11-26 21:35:01,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:35:01,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 21:35:01,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 28: [2022-11-26 21:35:01,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:35:01,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 28: [2022-11-26 21:35:01,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 21:35:01,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 1: [2022-11-26 21:35:01,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 21:35:01,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 0: [2022-11-26 21:35:01,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:35:01,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 21:35:01,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 18: [2022-11-26 21:35:01,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:35:01,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 21:35:01,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 9: [2022-11-26 21:35:01,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:35:01,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 2: [2022-11-26 21:35:01,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:35:01,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 2: [2022-11-26 21:35:01,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 9: [2022-11-26 21:35:01,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:35:01,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 9: [2022-11-26 21:35:01,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 21:35:01,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 15: [2022-11-26 21:35:01,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:35:01,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 21:35:01,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 23: [2022-11-26 21:35:01,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:35:01,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:35:01,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:35:01,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 21:35:01,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 23: [2022-11-26 21:35:01,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 21:35:01,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 21:35:01,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 23: [2022-11-26 21:35:01,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 5: [2022-11-26 21:35:01,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:35:01,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:35:01,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 14: [2022-11-26 21:35:01,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 5: [2022-11-26 21:35:01,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 14: [2022-11-26 21:35:01,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 6: [2022-11-26 21:35:01,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:35:01,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 21:35:01,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 24: [2022-11-26 21:35:01,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:35:01,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 21:35:01,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 25: [2022-11-26 21:35:01,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:35:01,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 27: [2022-11-26 21:35:01,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:35:01,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 27: [2022-11-26 21:35:01,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 21:35:01,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 31: [2022-11-26 21:35:01,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:35:01,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 21:35:01,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 20: [2022-11-26 21:35:01,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:35:01,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 21:35:01,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 19: [2022-11-26 21:35:01,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:35:01,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 21:35:01,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 17: [2022-11-26 21:35:01,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:35:01,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 21:35:01,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 17: [2022-11-26 21:35:01,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:35:01,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 21:35:01,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 17: [2022-11-26 21:35:01,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:35:01,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 12: [2022-11-26 21:35:01,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:35:01,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 12: [2022-11-26 21:35:01,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 21:35:01,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 16: [2022-11-26 21:35:01,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:35:01,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 21:35:01,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 29: [2022-11-26 21:35:01,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:35:01,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 21:35:01,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 10: [2022-11-26 21:35:01,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:35:01,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 21:35:01,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 24: [2022-11-26 21:35:01,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:35:01,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 21:35:01,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 3: [2022-11-26 21:35:01,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:35:01,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 21:35:01,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 11: [2022-11-26 21:35:01,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:35:01,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 21:35:01,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 2: [2022-11-26 21:35:01,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:35:01,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 21:35:01,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 8: [2022-11-26 21:35:01,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:35:01,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 21:35:01,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 9: [2022-11-26 21:35:01,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:35:01,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 21:35:01,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 28: [2022-11-26 21:35:01,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 21:35:01,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 21:35:01,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 0: [2022-11-26 21:35:01,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:35:01,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 21:35:01,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 7: [2022-11-26 21:35:01,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:35:01,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 21:35:01,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 4: [2022-11-26 21:35:01,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:35:01,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 21:35:01,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 20: [2022-11-26 21:35:01,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:35:01,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 21:35:01,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 30: [2022-11-26 21:35:01,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:35:01,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 21:35:01,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 5: [2022-11-26 21:35:01,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:35:01,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 21:35:01,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 25: [2022-11-26 21:35:01,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:35:01,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 21:35:01,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 29: [2022-11-26 21:35:01,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:35:01,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 21:35:01,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 26: [2022-11-26 21:35:01,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 21:35:01,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 21:35:01,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 27: [2022-11-26 21:35:01,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:35:01,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 21:35:01,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 13: [2022-11-26 21:35:01,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:35:01,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 21:35:01,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 15: [2022-11-26 21:35:01,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:35:01,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:35:01,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 31: [2022-11-26 21:35:01,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 15: [2022-11-26 21:35:01,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 31: [2022-11-26 21:35:01,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 18: [2022-11-26 21:35:01,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:35:01,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 21:35:01,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 6: [2022-11-26 21:35:01,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:35:01,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 21:35:01,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 23: [2022-11-26 21:35:01,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:35:01,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 21:35:01,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 1: [2022-11-26 21:35:01,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:35:01,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 22: [2022-11-26 21:35:01,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:35:01,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 22: [2022-11-26 21:35:01,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 21:35:01,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 11: [2022-11-26 21:35:01,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:35:01,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 21:35:01,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 19: [2022-11-26 21:35:01,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:35:01,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 21:35:01,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 21: [2022-11-26 21:35:01,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:35:01,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 21:35:01,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 14: [2022-11-26 21:35:01,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:35:01,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 21:35:01,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 12: [2022-11-26 21:35:01,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:35:01,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 21:35:01,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 17: [2022-11-26 21:35:01,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:35:01,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:35:01,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 3: [2022-11-26 21:35:01,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 21:35:01,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 17: [2022-11-26 21:35:01,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 16: [2022-11-26 21:35:01,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:35:01,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 21:35:01,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 9: [2022-11-26 21:35:01,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:35:01,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 21:35:01,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 24: [2022-11-26 21:35:01,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:35:01,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 21:35:01,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 2: [2022-11-26 21:35:01,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:35:01,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 21:35:01,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 28: [2022-11-26 21:35:01,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 21:35:01,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 21:35:01,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 4: [2022-11-26 21:35:01,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:35:01,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 21:35:01,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 8: [2022-11-26 21:35:01,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:35:01,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:35:01,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 8: [2022-11-26 21:35:01,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 7: [2022-11-26 21:35:01,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 8: [2022-11-26 21:35:01,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 20: [2022-11-26 21:35:01,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:35:01,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 21:35:01,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 5: [2022-11-26 21:35:01,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:35:01,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 21:35:01,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 25: [2022-11-26 21:35:01,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:35:01,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 21:35:01,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 0: [2022-11-26 21:35:01,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:35:01,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:35:01,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:35:01,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 21:35:01,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 1: [2022-11-26 21:35:01,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 21:35:01,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 31: [2022-11-26 21:35:01,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:35:01,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:35:01,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 23: [2022-11-26 21:35:01,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 31: [2022-11-26 21:35:01,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 23: [2022-11-26 21:35:01,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 13: [2022-11-26 21:35:01,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:35:01,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 21:35:01,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 22: [2022-11-26 21:35:01,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 21:35:01,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 21:35:01,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 26: [2022-11-26 21:35:01,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 21:35:01,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 21:35:01,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 6: [2022-11-26 21:35:01,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:35:01,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 21:35:01,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 29: [2022-11-26 21:35:01,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:35:01,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 21:35:01,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 21: [2022-11-26 21:35:01,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:35:01,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 21:35:01,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 27: [2022-11-26 21:35:01,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:35:01,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 21:35:01,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 30: [2022-11-26 21:35:01,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:35:01,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 21:35:01,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 11: [2022-11-26 21:35:01,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:35:01,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 21:35:01,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 17: [2022-11-26 21:35:01,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:35:01,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 3: [2022-11-26 21:35:01,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:35:01,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 17: [2022-11-26 21:35:01,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 3: [2022-11-26 21:35:01,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 12: [2022-11-26 21:35:01,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:35:01,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:35:01,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 9: [2022-11-26 21:35:01,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:35:01,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 16: [2022-11-26 21:35:01,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 19: [2022-11-26 21:35:01,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:35:01,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 16: [2022-11-26 21:35:01,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 9: [2022-11-26 21:35:01,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 19: [2022-11-26 21:35:01,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 21:35:01,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 15: [2022-11-26 21:35:01,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:35:01,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 0: [2022-11-26 21:35:01,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 15: [2022-11-26 21:35:01,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 0: [2022-11-26 21:35:01,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 14: [2022-11-26 21:35:01,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:35:01,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 21:35:01,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 24: [2022-11-26 21:35:01,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:35:01,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 21:35:01,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 2: [2022-11-26 21:35:01,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:35:01,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 21:35:01,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 10: [2022-11-26 21:35:01,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:35:01,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:35:01,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 21:35:01,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 21:35:01,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 10: [2022-11-26 21:35:01,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 8: [2022-11-26 21:35:01,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:35:01,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 21:35:01,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 28: [2022-11-26 21:35:01,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 21:35:01,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 4: [2022-11-26 21:35:01,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:35:01,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 21:35:01,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 7: [2022-11-26 21:35:01,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:35:01,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 21:35:01,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 28: [2022-11-26 21:35:01,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 20: [2022-11-26 21:35:01,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:35:01,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 21:35:01,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 0: [2022-11-26 21:35:01,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:35:01,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 21:35:01,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 5: [2022-11-26 21:35:01,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:35:01,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 21:35:01,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 25: [2022-11-26 21:35:01,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:35:01,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 21:35:01,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 1: [2022-11-26 21:35:01,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:35:01,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 21:35:01,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 23: [2022-11-26 21:35:01,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:35:01,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 21:35:01,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 13: [2022-11-26 21:35:01,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:35:01,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 21:35:01,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 29: [2022-11-26 21:35:01,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:35:01,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 21:35:01,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 29: [2022-11-26 21:35:01,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:35:01,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 21:35:01,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 15: [2022-11-26 21:35:01,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:35:01,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 21:35:01,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 18: [2022-11-26 21:35:01,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:35:01,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 21:35:01,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 30: [2022-11-26 21:35:01,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:35:01,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 21:35:01,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 27: [2022-11-26 21:35:01,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:35:01,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 21:35:01,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 22: [2022-11-26 21:35:01,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 21:35:01,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 21:35:01,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 6: [2022-11-26 21:35:01,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:35:01,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 21:35:01,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 14: [2022-11-26 21:35:01,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:35:01,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 21:35:01,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 11: [2022-11-26 21:35:01,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:35:01,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 21:35:01,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 12: [2022-11-26 21:35:01,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:35:01,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 21:35:01,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 19: [2022-11-26 21:35:01,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:35:01,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 26: [2022-11-26 21:35:01,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:35:01,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 26: [2022-11-26 21:35:01,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 21:35:01,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 17: [2022-11-26 21:35:01,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:35:01,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:35:01,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 31: [2022-11-26 21:35:01,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 17: [2022-11-26 21:35:01,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 31: [2022-11-26 21:35:01,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 3: [2022-11-26 21:35:01,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:35:01,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 21:35:01,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 9: [2022-11-26 21:35:01,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:35:01,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 21:35:01,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 2: [2022-11-26 21:35:01,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:35:01,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 21:35:01,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 16: [2022-11-26 21:35:01,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:35:01,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 21:35:01,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 24: [2022-11-26 21:35:01,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:35:01,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 21:35:01,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 8: [2022-11-26 21:35:01,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:35:01,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 21:35:01,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 4: [2022-11-26 21:35:01,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:35:01,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 21:35:01,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 7: [2022-11-26 21:35:01,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:35:01,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 21:35:01,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 20: [2022-11-26 21:35:01,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:35:01,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 21:35:01,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 21: [2022-11-26 21:35:01,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:35:01,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 21:35:01,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 25: [2022-11-26 21:35:01,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:35:01,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 21:35:01,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 30: [2022-11-26 21:35:01,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:35:01,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 21:35:01,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 18: [2022-11-26 21:35:01,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:35:01,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 21:35:01,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 5: [2022-11-26 21:35:01,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:35:01,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 21:35:01,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 13: [2022-11-26 21:35:01,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:35:01,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 21:35:01,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 10: [2022-11-26 21:35:01,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:35:01,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 21:35:01,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 27: [2022-11-26 21:35:01,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:35:01,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 21:35:01,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 0: [2022-11-26 21:35:01,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:35:01,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 21:35:01,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 8: [2022-11-26 21:35:01,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:35:01,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 21:35:01,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 21: [2022-11-26 21:35:01,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:35:01,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 7: [2022-11-26 21:35:01,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:35:01,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 7: [2022-11-26 21:35:01,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 21:35:01,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 28: [2022-11-26 21:35:01,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:35:01,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:35:01,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 21:35:01,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 3: [2022-11-26 21:35:01,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:35:01,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:35:01,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 3: [2022-11-26 21:35:01,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 21:35:01,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 31: [2022-11-26 21:35:01,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 2: [2022-11-26 21:35:01,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:35:01,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 21:35:01,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 22: [2022-11-26 21:35:01,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 21:35:01,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 21:35:01,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 17: [2022-11-26 21:35:01,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:35:01,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 21:35:01,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 23: [2022-11-26 21:35:01,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:35:01,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 21:35:01,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 16: [2022-11-26 21:35:01,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:35:01,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 21:35:01,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 12: [2022-11-26 21:35:01,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:35:01,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 27: [2022-11-26 21:35:01,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:35:01,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 9: [2022-11-26 21:35:01,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:35:01,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:35:01,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 9: [2022-11-26 21:35:01,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 21:35:01,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 1: [2022-11-26 21:35:01,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 27: [2022-11-26 21:35:01,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 1: [2022-11-26 21:35:01,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 19: [2022-11-26 21:35:01,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:35:01,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 21:35:01,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 13: [2022-11-26 21:35:01,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:35:01,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 21:35:01,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 1: [2022-11-26 21:35:01,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:35:01,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 21:35:01,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 11: [2022-11-26 21:35:01,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:35:01,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 21:35:01,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 30: [2022-11-26 21:35:01,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:35:01,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 21:35:01,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 18: [2022-11-26 21:35:01,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:35:01,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 21:35:01,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 25: [2022-11-26 21:35:01,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:35:01,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 21:35:01,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 31: [2022-11-26 21:35:01,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 26: [2022-11-26 21:35:01,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:35:01,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 26: [2022-11-26 21:35:01,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 31: [2022-11-26 21:35:01,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 26: [2022-11-26 21:35:01,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 29: [2022-11-26 21:35:01,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:35:01,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 21:35:01,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:35:01,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 29: [2022-11-26 21:35:01,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 21:35:01,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 15: [2022-11-26 21:35:01,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:35:01,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:35:01,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 21:35:01,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 21:35:01,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 15: [2022-11-26 21:35:01,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 28: [2022-11-26 21:35:01,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 21:35:01,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 28: [2022-11-26 21:35:01,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 21:35:01,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 21:35:01,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 28: [2022-11-26 21:35:01,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:35:01,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 28: [2022-11-26 21:35:01,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 14: [2022-11-26 21:35:01,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 28: [2022-11-26 21:35:01,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 14: [2022-11-26 21:35:01,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 24: [2022-11-26 21:35:01,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:35:01,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 21:35:01,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 21: [2022-11-26 21:35:01,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 22: [2022-11-26 21:35:01,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:35:01,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 22: [2022-11-26 21:35:01,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 21: [2022-11-26 21:35:01,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 22: [2022-11-26 21:35:01,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 0: [2022-11-26 21:35:01,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:35:01,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 21:35:01,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 20: [2022-11-26 21:35:01,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:35:01,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 21:35:01,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 19: [2022-11-26 21:35:01,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:35:01,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 21:35:01,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 23: [2022-11-26 21:35:01,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:35:01,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 21:35:01,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 5: [2022-11-26 21:35:01,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:35:01,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 21:35:01,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 6: [2022-11-26 21:35:01,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:35:01,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 21:35:01,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 23: [2022-11-26 21:35:01,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:35:01,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 21:35:01,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 1: [2022-11-26 21:35:01,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:35:01,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 26: [2022-11-26 21:35:01,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:35:01,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 26: [2022-11-26 21:35:01,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 21:35:01,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 14: [2022-11-26 21:35:01,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:35:01,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 21:35:01,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 10: [2022-11-26 21:35:01,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:35:01,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 21:35:01,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 6: [2022-11-26 21:35:01,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:35:01,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step122000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 21:35:01,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 0: successfully saved checkpoint at iteration 122000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 3784.57 31: iteration 122010/ 173500 | consumed samples: 31234560 | consumed tokens: 63968378880 | elapsed time per iteration (s): 1.23 | learning rate: 5.705E-05 | global batch size: 256 | lm loss: 1.961773E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 208.877 | TFLOPs: 12.64 | 31: iteration 122020/ 173500 | consumed samples: 31237120 | consumed tokens: 63973621760 | elapsed time per iteration (s): 0.81 | learning rate: 5.703E-05 | global batch size: 256 | lm loss: 1.921981E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.830 | TFLOPs: 19.11 | 31: iteration 122030/ 173500 | consumed samples: 31239680 | consumed tokens: 63978864640 | elapsed time per iteration (s): 1.00 | learning rate: 5.702E-05 | global batch size: 256 | lm loss: 1.939293E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 257.174 | TFLOPs: 15.56 | 31: iteration 122040/ 173500 | consumed samples: 31242240 | consumed tokens: 63984107520 | elapsed time per iteration (s): 0.80 | learning rate: 5.701E-05 | global batch size: 256 | lm loss: 1.923656E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.777 | TFLOPs: 19.29 | 31: iteration 122050/ 173500 | consumed samples: 31244800 | consumed tokens: 63989350400 | elapsed time per iteration (s): 0.80 | learning rate: 5.699E-05 | global batch size: 256 | lm loss: 1.973597E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.124 | TFLOPs: 19.25 | 31: iteration 122060/ 173500 | consumed samples: 31247360 | consumed tokens: 63994593280 | elapsed time per iteration (s): 0.81 | learning rate: 5.698E-05 | global batch size: 256 | lm loss: 1.962928E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.136 | TFLOPs: 19.13 | 31: iteration 122070/ 173500 | consumed samples: 31249920 | consumed tokens: 63999836160 | elapsed time per iteration (s): 0.80 | learning rate: 5.697E-05 | global batch size: 256 | lm loss: 1.933691E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.372 | TFLOPs: 19.38 | 31: iteration 122080/ 173500 | consumed samples: 31252480 | consumed tokens: 64005079040 | elapsed time per iteration (s): 0.82 | learning rate: 5.695E-05 | global batch size: 256 | lm loss: 1.931798E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.626 | TFLOPs: 18.79 | 31: iteration 122090/ 173500 | consumed samples: 31255040 | consumed tokens: 64010321920 | elapsed time per iteration (s): 0.79 | learning rate: 5.694E-05 | global batch size: 256 | lm loss: 1.943656E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.115 | TFLOPs: 19.49 | 31: iteration 122100/ 173500 | consumed samples: 31257600 | consumed tokens: 64015564800 | elapsed time per iteration (s): 0.80 | learning rate: 5.693E-05 | global batch size: 256 | lm loss: 1.930684E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.420 | TFLOPs: 19.38 | 31: iteration 122110/ 173500 | consumed samples: 31260160 | consumed tokens: 64020807680 | elapsed time per iteration (s): 0.81 | learning rate: 5.691E-05 | global batch size: 256 | lm loss: 1.923731E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.168 | TFLOPs: 19.19 | 31: iteration 122120/ 173500 | consumed samples: 31262720 | consumed tokens: 64026050560 | elapsed time per iteration (s): 0.80 | learning rate: 5.690E-05 | global batch size: 256 | lm loss: 1.948342E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.127 | TFLOPs: 19.25 | 31: iteration 122130/ 173500 | consumed samples: 31265280 | consumed tokens: 64031293440 | elapsed time per iteration (s): 0.79 | learning rate: 5.689E-05 | global batch size: 256 | lm loss: 1.949332E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.598 | TFLOPs: 19.52 | 31: iteration 122140/ 173500 | consumed samples: 31267840 | consumed tokens: 64036536320 | elapsed time per iteration (s): 0.80 | learning rate: 5.687E-05 | global batch size: 256 | lm loss: 1.944239E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.978 | TFLOPs: 19.30 | 31: iteration 122150/ 173500 | consumed samples: 31270400 | consumed tokens: 64041779200 | elapsed time per iteration (s): 0.81 | learning rate: 5.686E-05 | global batch size: 256 | lm loss: 1.930699E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.481 | TFLOPs: 19.21 | 31: iteration 122160/ 173500 | consumed samples: 31272960 | consumed tokens: 64047022080 | elapsed time per iteration (s): 0.80 | learning rate: 5.685E-05 | global batch size: 256 | lm loss: 1.928912E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.493 | TFLOPs: 19.45 | 31: iteration 122170/ 173500 | consumed samples: 31275520 | consumed tokens: 64052264960 | elapsed time per iteration (s): 0.80 | learning rate: 5.683E-05 | global batch size: 256 | lm loss: 1.949179E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.375 | TFLOPs: 19.38 | 31: iteration 122180/ 173500 | consumed samples: 31278080 | consumed tokens: 64057507840 | elapsed time per iteration (s): 0.77 | learning rate: 5.682E-05 | global batch size: 256 | lm loss: 1.951588E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.361 | TFLOPs: 19.99 | 31: iteration 122190/ 173500 | consumed samples: 31280640 | consumed tokens: 64062750720 | elapsed time per iteration (s): 0.79 | learning rate: 5.681E-05 | global batch size: 256 | lm loss: 1.928532E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.535 | TFLOPs: 19.51 | 31: iteration 122200/ 173500 | consumed samples: 31283200 | consumed tokens: 64067993600 | elapsed time per iteration (s): 0.80 | learning rate: 5.679E-05 | global batch size: 256 | lm loss: 1.939189E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.429 | TFLOPs: 19.26 | 31: iteration 122210/ 173500 | consumed samples: 31285760 | consumed tokens: 64073236480 | elapsed time per iteration (s): 0.81 | learning rate: 5.678E-05 | global batch size: 256 | lm loss: 1.955447E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.640 | TFLOPs: 19.22 | 31: iteration 122220/ 173500 | consumed samples: 31288320 | consumed tokens: 64078479360 | elapsed time per iteration (s): 0.81 | learning rate: 5.677E-05 | global batch size: 256 | lm loss: 1.928479E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.095 | TFLOPs: 19.12 | 31: iteration 122230/ 173500 | consumed samples: 31290880 | consumed tokens: 64083722240 | elapsed time per iteration (s): 0.81 | learning rate: 5.675E-05 | global batch size: 256 | lm loss: 1.971033E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.832 | TFLOPs: 19.23 | 31: iteration 122240/ 173500 | consumed samples: 31293440 | consumed tokens: 64088965120 | elapsed time per iteration (s): 0.79 | learning rate: 5.674E-05 | global batch size: 256 | lm loss: 1.940269E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.268 | TFLOPs: 19.62 | 31: iteration 122250/ 173500 | consumed samples: 31296000 | consumed tokens: 64094208000 | elapsed time per iteration (s): 0.81 | learning rate: 5.673E-05 | global batch size: 256 | lm loss: 1.944949E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.825 | TFLOPs: 19.05 | 31: iteration 122260/ 173500 | consumed samples: 31298560 | consumed tokens: 64099450880 | elapsed time per iteration (s): 0.82 | learning rate: 5.672E-05 | global batch size: 256 | lm loss: 1.936585E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.825 | TFLOPs: 18.86 | 31: iteration 122270/ 173500 | consumed samples: 31301120 | consumed tokens: 64104693760 | elapsed time per iteration (s): 0.82 | learning rate: 5.670E-05 | global batch size: 256 | lm loss: 1.962250E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.401 | TFLOPs: 18.96 | 31: iteration 122280/ 173500 | consumed samples: 31303680 | consumed tokens: 64109936640 | elapsed time per iteration (s): 0.82 | learning rate: 5.669E-05 | global batch size: 256 | lm loss: 1.931474E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.448 | TFLOPs: 18.84 | 31: iteration 122290/ 173500 | consumed samples: 31306240 | consumed tokens: 64115179520 | elapsed time per iteration (s): 0.81 | learning rate: 5.668E-05 | global batch size: 256 | lm loss: 1.939725E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.577 | TFLOPs: 19.09 | 31: iteration 122300/ 173500 | consumed samples: 31308800 | consumed tokens: 64120422400 | elapsed time per iteration (s): 0.83 | learning rate: 5.666E-05 | global batch size: 256 | lm loss: 1.954862E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.947 | TFLOPs: 18.63 | 31: iteration 122310/ 173500 | consumed samples: 31311360 | consumed tokens: 64125665280 | elapsed time per iteration (s): 0.81 | learning rate: 5.665E-05 | global batch size: 256 | lm loss: 1.959303E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.546 | TFLOPs: 19.03 | 31: iteration 122320/ 173500 | consumed samples: 31313920 | consumed tokens: 64130908160 | elapsed time per iteration (s): 0.79 | learning rate: 5.664E-05 | global batch size: 256 | lm loss: 1.946053E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.660 | TFLOPs: 19.52 | 31: iteration 122330/ 173500 | consumed samples: 31316480 | consumed tokens: 64136151040 | elapsed time per iteration (s): 0.80 | learning rate: 5.662E-05 | global batch size: 256 | lm loss: 1.932492E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.405 | TFLOPs: 19.32 | 31: iteration 122340/ 173500 | consumed samples: 31319040 | consumed tokens: 64141393920 | elapsed time per iteration (s): 0.80 | learning rate: 5.661E-05 | global batch size: 256 | lm loss: 1.919607E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.616 | TFLOPs: 19.34 | 31: iteration 122350/ 173500 | consumed samples: 31321600 | consumed tokens: 64146636800 | elapsed time per iteration (s): 0.79 | learning rate: 5.660E-05 | global batch size: 256 | lm loss: 1.944856E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.645 | TFLOPs: 19.58 | 31: iteration 122360/ 173500 | consumed samples: 31324160 | consumed tokens: 64151879680 | elapsed time per iteration (s): 0.79 | learning rate: 5.658E-05 | global batch size: 256 | lm loss: 1.967048E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.045 | TFLOPs: 19.66 | 31: iteration 122370/ 173500 | consumed samples: 31326720 | consumed tokens: 64157122560 | elapsed time per iteration (s): 0.79 | learning rate: 5.657E-05 | global batch size: 256 | lm loss: 1.978900E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.598 | TFLOPs: 19.52 | 31: iteration 122380/ 173500 | consumed samples: 31329280 | consumed tokens: 64162365440 | elapsed time per iteration (s): 0.79 | learning rate: 5.656E-05 | global batch size: 256 | lm loss: 1.940527E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.975 | TFLOPs: 19.60 | 31: iteration 122390/ 173500 | consumed samples: 31331840 | consumed tokens: 64167608320 | elapsed time per iteration (s): 0.80 | learning rate: 5.654E-05 | global batch size: 256 | lm loss: 1.928618E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.233 | TFLOPs: 19.43 | 31: iteration 122400/ 173500 | consumed samples: 31334400 | consumed tokens: 64172851200 | elapsed time per iteration (s): 0.79 | learning rate: 5.653E-05 | global batch size: 256 | lm loss: 1.969632E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.585 | TFLOPs: 19.58 | 31: iteration 122410/ 173500 | consumed samples: 31336960 | consumed tokens: 64178094080 | elapsed time per iteration (s): 0.86 | learning rate: 5.652E-05 | global batch size: 256 | lm loss: 1.942407E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.331 | TFLOPs: 17.99 | 31: iteration 122420/ 173500 | consumed samples: 31339520 | consumed tokens: 64183336960 | elapsed time per iteration (s): 0.82 | learning rate: 5.650E-05 | global batch size: 256 | lm loss: 1.938568E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.835 | TFLOPs: 18.99 | 31: iteration 122430/ 173500 | consumed samples: 31342080 | consumed tokens: 64188579840 | elapsed time per iteration (s): 0.82 | learning rate: 5.649E-05 | global batch size: 256 | lm loss: 1.955082E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.142 | TFLOPs: 18.82 | 31: iteration 122440/ 173500 | consumed samples: 31344640 | consumed tokens: 64193822720 | elapsed time per iteration (s): 0.80 | learning rate: 5.648E-05 | global batch size: 256 | lm loss: 1.936596E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.256 | TFLOPs: 19.44 | 31: iteration 122450/ 173500 | consumed samples: 31347200 | consumed tokens: 64199065600 | elapsed time per iteration (s): 0.83 | learning rate: 5.646E-05 | global batch size: 256 | lm loss: 1.961202E+00 | grad norm: 0.217 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.510 | TFLOPs: 18.72 | 31: iteration 122460/ 173500 | consumed samples: 31349760 | consumed tokens: 64204308480 | elapsed time per iteration (s): 2.91 | learning rate: 5.645E-05 | global batch size: 256 | lm loss: 1.930695E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 88.077 | TFLOPs: 5.33 | 31: iteration 122470/ 173500 | consumed samples: 31352320 | consumed tokens: 64209551360 | elapsed time per iteration (s): 0.82 | learning rate: 5.644E-05 | global batch size: 256 | lm loss: 1.984497E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.561 | TFLOPs: 18.97 | 31: iteration 122480/ 173500 | consumed samples: 31354880 | consumed tokens: 64214794240 | elapsed time per iteration (s): 0.81 | learning rate: 5.642E-05 | global batch size: 256 | lm loss: 1.958063E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.927 | TFLOPs: 19.05 | 31: iteration 122490/ 173500 | consumed samples: 31357440 | consumed tokens: 64220037120 | elapsed time per iteration (s): 0.84 | learning rate: 5.641E-05 | global batch size: 256 | lm loss: 1.956969E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.293 | TFLOPs: 18.35 | 31: iteration 122500/ 173500 | consumed samples: 31360000 | consumed tokens: 64225280000 | elapsed time per iteration (s): 0.82 | learning rate: 5.640E-05 | global batch size: 256 | lm loss: 1.925620E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.953 | TFLOPs: 18.99 | 31: iteration 122510/ 173500 | consumed samples: 31362560 | consumed tokens: 64230522880 | elapsed time per iteration (s): 0.80 | learning rate: 5.638E-05 | global batch size: 256 | lm loss: 1.964402E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.813 | TFLOPs: 19.35 | 31: iteration 122520/ 173500 | consumed samples: 31365120 | consumed tokens: 64235765760 | elapsed time per iteration (s): 0.83 | learning rate: 5.637E-05 | global batch size: 256 | lm loss: 1.953127E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.921 | TFLOPs: 18.75 | 31: iteration 122530/ 173500 | consumed samples: 31367680 | consumed tokens: 64241008640 | elapsed time per iteration (s): 0.81 | learning rate: 5.636E-05 | global batch size: 256 | lm loss: 1.952947E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.098 | TFLOPs: 19.06 | 31: iteration 122540/ 173500 | consumed samples: 31370240 | consumed tokens: 64246251520 | elapsed time per iteration (s): 0.80 | learning rate: 5.634E-05 | global batch size: 256 | lm loss: 1.940882E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.018 | TFLOPs: 19.42 | 31: iteration 122550/ 173500 | consumed samples: 31372800 | consumed tokens: 64251494400 | elapsed time per iteration (s): 0.80 | learning rate: 5.633E-05 | global batch size: 256 | lm loss: 1.952127E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.824 | TFLOPs: 19.41 | 31: iteration 122560/ 173500 | consumed samples: 31375360 | consumed tokens: 64256737280 | elapsed time per iteration (s): 0.82 | learning rate: 5.632E-05 | global batch size: 256 | lm loss: 1.940538E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.453 | TFLOPs: 18.84 | 31: iteration 122570/ 173500 | consumed samples: 31377920 | consumed tokens: 64261980160 | elapsed time per iteration (s): 0.84 | learning rate: 5.630E-05 | global batch size: 256 | lm loss: 1.961819E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.028 | TFLOPs: 18.33 | 31: iteration 122580/ 173500 | consumed samples: 31380480 | consumed tokens: 64267223040 | elapsed time per iteration (s): 0.80 | learning rate: 5.629E-05 | global batch size: 256 | lm loss: 1.931055E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.425 | TFLOPs: 19.32 | 31: iteration 122590/ 173500 | consumed samples: 31383040 | consumed tokens: 64272465920 | elapsed time per iteration (s): 0.87 | learning rate: 5.628E-05 | global batch size: 256 | lm loss: 1.945868E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.322 | TFLOPs: 17.81 | 31: iteration 122600/ 173500 | consumed samples: 31385600 | consumed tokens: 64277708800 | elapsed time per iteration (s): 0.81 | learning rate: 5.627E-05 | global batch size: 256 | lm loss: 1.961654E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.522 | TFLOPs: 19.09 | 31: iteration 122610/ 173500 | consumed samples: 31388160 | consumed tokens: 64282951680 | elapsed time per iteration (s): 0.82 | learning rate: 5.625E-05 | global batch size: 256 | lm loss: 1.935039E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.485 | TFLOPs: 18.90 | 31: iteration 122620/ 173500 | consumed samples: 31390720 | consumed tokens: 64288194560 | elapsed time per iteration (s): 0.83 | learning rate: 5.624E-05 | global batch size: 256 | lm loss: 1.944125E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.505 | TFLOPs: 18.66 | 31: iteration 122630/ 173500 | consumed samples: 31393280 | consumed tokens: 64293437440 | elapsed time per iteration (s): 0.80 | learning rate: 5.623E-05 | global batch size: 256 | lm loss: 1.906908E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.250 | TFLOPs: 19.31 | 31: iteration 122640/ 173500 | consumed samples: 31395840 | consumed tokens: 64298680320 | elapsed time per iteration (s): 0.86 | learning rate: 5.621E-05 | global batch size: 256 | lm loss: 1.925469E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.600 | TFLOPs: 18.00 | 31: iteration 122650/ 173500 | consumed samples: 31398400 | consumed tokens: 64303923200 | elapsed time per iteration (s): 0.90 | learning rate: 5.620E-05 | global batch size: 256 | lm loss: 1.942198E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.147 | TFLOPs: 17.25 | 31: iteration 122660/ 173500 | consumed samples: 31400960 | consumed tokens: 64309166080 | elapsed time per iteration (s): 0.90 | learning rate: 5.619E-05 | global batch size: 256 | lm loss: 1.930563E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.090 | TFLOPs: 17.19 | 31: iteration 122670/ 173500 | consumed samples: 31403520 | consumed tokens: 64314408960 | elapsed time per iteration (s): 0.91 | learning rate: 5.617E-05 | global batch size: 256 | lm loss: 1.920655E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 281.118 | TFLOPs: 17.01 | 31: iteration 122680/ 173500 | consumed samples: 31406080 | consumed tokens: 64319651840 | elapsed time per iteration (s): 0.87 | learning rate: 5.616E-05 | global batch size: 256 | lm loss: 1.970393E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.984 | TFLOPs: 17.85 | 31: iteration 122690/ 173500 | consumed samples: 31408640 | consumed tokens: 64324894720 | elapsed time per iteration (s): 0.85 | learning rate: 5.615E-05 | global batch size: 256 | lm loss: 1.923259E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.685 | TFLOPs: 18.25 | 31: iteration 122700/ 173500 | consumed samples: 31411200 | consumed tokens: 64330137600 | elapsed time per iteration (s): 0.88 | learning rate: 5.613E-05 | global batch size: 256 | lm loss: 1.962309E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.689 | TFLOPs: 17.53 | 31: iteration 122710/ 173500 | consumed samples: 31413760 | consumed tokens: 64335380480 | elapsed time per iteration (s): 0.88 | learning rate: 5.612E-05 | global batch size: 256 | lm loss: 1.951810E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.442 | TFLOPs: 17.51 | 31: iteration 122720/ 173500 | consumed samples: 31416320 | consumed tokens: 64340623360 | elapsed time per iteration (s): 0.83 | learning rate: 5.611E-05 | global batch size: 256 | lm loss: 1.957159E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.872 | TFLOPs: 18.75 | 31: iteration 122730/ 173500 | consumed samples: 31418880 | consumed tokens: 64345866240 | elapsed time per iteration (s): 0.82 | learning rate: 5.609E-05 | global batch size: 256 | lm loss: 1.920635E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.237 | TFLOPs: 18.83 | 31: iteration 122740/ 173500 | consumed samples: 31421440 | consumed tokens: 64351109120 | elapsed time per iteration (s): 0.86 | learning rate: 5.608E-05 | global batch size: 256 | lm loss: 1.945237E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.998 | TFLOPs: 17.91 | 31: iteration 122750/ 173500 | consumed samples: 31424000 | consumed tokens: 64356352000 | elapsed time per iteration (s): 0.91 | learning rate: 5.607E-05 | global batch size: 256 | lm loss: 1.906889E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 280.652 | TFLOPs: 16.98 | 31: iteration 122760/ 173500 | consumed samples: 31426560 | consumed tokens: 64361594880 | elapsed time per iteration (s): 0.89 | learning rate: 5.605E-05 | global batch size: 256 | lm loss: 1.971605E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.180 | TFLOPs: 17.31 | 31: iteration 122770/ 173500 | consumed samples: 31429120 | consumed tokens: 64366837760 | elapsed time per iteration (s): 0.84 | learning rate: 5.604E-05 | global batch size: 256 | lm loss: 1.953797E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.899 | TFLOPs: 18.51 | 31: iteration 122780/ 173500 | consumed samples: 31431680 | consumed tokens: 64372080640 | elapsed time per iteration (s): 0.85 | learning rate: 5.603E-05 | global batch size: 256 | lm loss: 1.956751E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.354 | TFLOPs: 18.17 | 31: iteration 122790/ 173500 | consumed samples: 31434240 | consumed tokens: 64377323520 | elapsed time per iteration (s): 0.84 | learning rate: 5.601E-05 | global batch size: 256 | lm loss: 1.958771E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.453 | TFLOPs: 18.48 | 31: iteration 122800/ 173500 | consumed samples: 31436800 | consumed tokens: 64382566400 | elapsed time per iteration (s): 0.85 | learning rate: 5.600E-05 | global batch size: 256 | lm loss: 1.933675E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.903 | TFLOPs: 18.26 | 31: iteration 122810/ 173500 | consumed samples: 31439360 | consumed tokens: 64387809280 | elapsed time per iteration (s): 0.80 | learning rate: 5.599E-05 | global batch size: 256 | lm loss: 1.972132E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.592 | TFLOPs: 19.33 | 31: iteration 122820/ 173500 | consumed samples: 31441920 | consumed tokens: 64393052160 | elapsed time per iteration (s): 0.80 | learning rate: 5.597E-05 | global batch size: 256 | lm loss: 1.954386E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.686 | TFLOPs: 19.28 | 31: iteration 122830/ 173500 | consumed samples: 31444480 | consumed tokens: 64398295040 | elapsed time per iteration (s): 0.81 | learning rate: 5.596E-05 | global batch size: 256 | lm loss: 1.958811E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.974 | TFLOPs: 19.12 | 31: iteration 122840/ 173500 | consumed samples: 31447040 | consumed tokens: 64403537920 | elapsed time per iteration (s): 0.77 | learning rate: 5.595E-05 | global batch size: 256 | lm loss: 1.934008E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.123 | TFLOPs: 20.15 | 31: iteration 122850/ 173500 | consumed samples: 31449600 | consumed tokens: 64408780800 | elapsed time per iteration (s): 0.80 | learning rate: 5.594E-05 | global batch size: 256 | lm loss: 1.942579E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.082 | TFLOPs: 19.36 | 31: iteration 122860/ 173500 | consumed samples: 31452160 | consumed tokens: 64414023680 | elapsed time per iteration (s): 0.81 | learning rate: 5.592E-05 | global batch size: 256 | lm loss: 1.958452E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.011 | TFLOPs: 19.24 | 31: iteration 122870/ 173500 | consumed samples: 31454720 | consumed tokens: 64419266560 | elapsed time per iteration (s): 0.79 | learning rate: 5.591E-05 | global batch size: 256 | lm loss: 1.968180E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.815 | TFLOPs: 19.53 | 31: iteration 122880/ 173500 | consumed samples: 31457280 | consumed tokens: 64424509440 | elapsed time per iteration (s): 0.89 | learning rate: 5.590E-05 | global batch size: 256 | lm loss: 1.933992E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.642 | TFLOPs: 17.40 | 31: iteration 122890/ 173500 | consumed samples: 31459840 | consumed tokens: 64429752320 | elapsed time per iteration (s): 0.78 | learning rate: 5.588E-05 | global batch size: 256 | lm loss: 1.973521E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.703 | TFLOPs: 19.95 | 31: iteration 122900/ 173500 | consumed samples: 31462400 | consumed tokens: 64434995200 | elapsed time per iteration (s): 0.86 | learning rate: 5.587E-05 | global batch size: 256 | lm loss: 1.939897E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.922 | TFLOPs: 18.08 | 31: iteration 122910/ 173500 | consumed samples: 31464960 | consumed tokens: 64440238080 | elapsed time per iteration (s): 0.80 | learning rate: 5.586E-05 | global batch size: 256 | lm loss: 1.955232E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.436 | TFLOPs: 19.26 | 31: iteration 122920/ 173500 | consumed samples: 31467520 | consumed tokens: 64445480960 | elapsed time per iteration (s): 0.80 | learning rate: 5.584E-05 | global batch size: 256 | lm loss: 1.973443E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.953 | TFLOPs: 19.36 | 31: iteration 122930/ 173500 | consumed samples: 31470080 | consumed tokens: 64450723840 | elapsed time per iteration (s): 0.82 | learning rate: 5.583E-05 | global batch size: 256 | lm loss: 1.968425E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.307 | TFLOPs: 18.83 | 31: iteration 122940/ 173500 | consumed samples: 31472640 | consumed tokens: 64455966720 | elapsed time per iteration (s): 0.79 | learning rate: 5.582E-05 | global batch size: 256 | lm loss: 1.942499E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.689 | TFLOPs: 19.52 | 31: iteration 122950/ 173500 | consumed samples: 31475200 | consumed tokens: 64461209600 | elapsed time per iteration (s): 0.81 | learning rate: 5.580E-05 | global batch size: 256 | lm loss: 1.948660E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.604 | TFLOPs: 19.09 | 31: iteration 122960/ 173500 | consumed samples: 31477760 | consumed tokens: 64466452480 | elapsed time per iteration (s): 0.79 | learning rate: 5.579E-05 | global batch size: 256 | lm loss: 1.967027E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.224 | TFLOPs: 19.49 | 31: iteration 122970/ 173500 | consumed samples: 31480320 | consumed tokens: 64471695360 | elapsed time per iteration (s): 0.81 | learning rate: 5.578E-05 | global batch size: 256 | lm loss: 1.932421E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.348 | TFLOPs: 19.14 | 31: iteration 122980/ 173500 | consumed samples: 31482880 | consumed tokens: 64476938240 | elapsed time per iteration (s): 0.78 | learning rate: 5.576E-05 | global batch size: 256 | lm loss: 1.943992E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.742 | TFLOPs: 19.83 | 31: iteration 122990/ 173500 | consumed samples: 31485440 | consumed tokens: 64482181120 | elapsed time per iteration (s): 0.78 | learning rate: 5.575E-05 | global batch size: 256 | lm loss: 1.941861E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.705 | TFLOPs: 19.83 | 31: iteration 123000/ 173500 | consumed samples: 31488000 | consumed tokens: 64487424000 | elapsed time per iteration (s): 0.77 | learning rate: 5.574E-05 | global batch size: 256 | lm loss: 1.936872E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.659 | TFLOPs: 20.00 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 123000 | lm loss value: 1.888385E+00 | lm loss PPL: 6.608685E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 123000 to checkpoints_1b1long 0: [2022-11-26 21:49:02,587] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step123000 is begin to save! 0: [2022-11-26 21:49:02,598] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_01-model_00-model_states.pt... 0: [2022-11-26 21:49:02,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_01-model_00-model_states.pt. 0: [2022-11-26 21:49:02,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_03-model_00-model_states.pt... 0: [2022-11-26 21:49:02,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_03-model_00-model_states.pt. 0: [2022-11-26 21:49:02,921] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_04-model_00-model_states.pt... 0: [2022-11-26 21:49:03,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_04-model_00-model_states.pt. 0: [2022-11-26 21:49:03,003] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_05-model_00-model_states.pt... 0: [2022-11-26 21:49:03,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_05-model_00-model_states.pt. 0: [2022-11-26 21:49:03,080] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_06-model_00-model_states.pt... 0: [2022-11-26 21:49:03,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_06-model_00-model_states.pt. 0: [2022-11-26 21:49:03,160] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_07-model_00-model_states.pt... 0: [2022-11-26 21:49:03,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_07-model_00-model_states.pt. 0: [2022-11-26 21:49:03,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_08-model_00-model_states.pt... 0: [2022-11-26 21:49:03,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_08-model_00-model_states.pt. 0: [2022-11-26 21:49:03,319] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_09-model_00-model_states.pt... 0: [2022-11-26 21:49:03,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_09-model_00-model_states.pt. 0: [2022-11-26 21:49:03,399] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_10-model_00-model_states.pt... 0: [2022-11-26 21:49:03,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_10-model_00-model_states.pt. 0: [2022-11-26 21:49:03,475] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_11-model_00-model_states.pt... 0: [2022-11-26 21:49:03,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_11-model_00-model_states.pt. 0: [2022-11-26 21:49:03,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_12-model_00-model_states.pt... 0: [2022-11-26 21:49:03,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_12-model_00-model_states.pt. 0: [2022-11-26 21:49:03,640] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_13-model_00-model_states.pt... 0: [2022-11-26 21:49:03,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_13-model_00-model_states.pt. 0: [2022-11-26 21:49:03,718] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_14-model_00-model_states.pt... 0: [2022-11-26 21:49:03,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_14-model_00-model_states.pt. 0: [2022-11-26 21:49:03,797] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_15-model_00-model_states.pt... 0: [2022-11-26 21:49:03,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_15-model_00-model_states.pt. 0: [2022-11-26 21:49:03,876] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_16-model_00-model_states.pt... 0: [2022-11-26 21:49:03,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_16-model_00-model_states.pt. 0: [2022-11-26 21:49:03,955] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_17-model_00-model_states.pt... 0: [2022-11-26 21:49:04,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_17-model_00-model_states.pt. 0: [2022-11-26 21:49:04,035] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_18-model_00-model_states.pt... 0: [2022-11-26 21:49:04,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_18-model_00-model_states.pt. 0: [2022-11-26 21:49:04,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_19-model_00-model_states.pt... 0: [2022-11-26 21:49:04,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_19-model_00-model_states.pt. 0: [2022-11-26 21:49:04,192] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_20-model_00-model_states.pt... 0: [2022-11-26 21:49:04,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_20-model_00-model_states.pt. 0: [2022-11-26 21:49:04,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_21-model_00-model_states.pt... 0: [2022-11-26 21:49:04,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_21-model_00-model_states.pt. 0: [2022-11-26 21:49:04,352] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_22-model_00-model_states.pt... 0: [2022-11-26 21:49:04,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_22-model_00-model_states.pt. 0: [2022-11-26 21:49:04,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_23-model_00-model_states.pt... 0: [2022-11-26 21:49:04,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_23-model_00-model_states.pt. 0: [2022-11-26 21:49:04,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_24-model_00-model_states.pt... 0: [2022-11-26 21:49:04,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_24-model_00-model_states.pt. 0: [2022-11-26 21:49:04,591] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_25-model_00-model_states.pt... 0: [2022-11-26 21:49:04,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_25-model_00-model_states.pt. 0: [2022-11-26 21:49:04,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_26-model_00-model_states.pt... 0: [2022-11-26 21:49:04,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_26-model_00-model_states.pt. 0: [2022-11-26 21:49:04,748] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_27-model_00-model_states.pt... 0: [2022-11-26 21:49:04,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_27-model_00-model_states.pt. 0: [2022-11-26 21:49:04,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_28-model_00-model_states.pt... 0: [2022-11-26 21:49:04,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_28-model_00-model_states.pt. 0: [2022-11-26 21:49:04,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/layer_30-model_00-model_states.pt... 0: [2022-11-26 21:49:04,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/layer_30-model_00-model_states.pt. 0: [2022-11-26 21:49:04,911] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step123000/mp_rank_00_model_states.pt 0: [2022-11-26 21:49:04,911] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/mp_rank_00_model_states.pt... 0: [2022-11-26 21:49:04,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/mp_rank_00_model_states.pt. 0: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 16: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 30: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:49:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 26: [2022-11-26 21:49:05,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 21:49:05,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 21:49:05,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 5: [2022-11-26 21:49:05,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:49:05,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 21:49:05,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 30: [2022-11-26 21:49:05,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:49:05,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 29: [2022-11-26 21:49:05,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:49:05,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 29: [2022-11-26 21:49:05,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 21:49:05,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 11: [2022-11-26 21:49:05,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:49:05,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 21:49:05,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 13: [2022-11-26 21:49:05,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:49:05,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 21:49:05,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 2: [2022-11-26 21:49:05,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:49:05,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 21:49:05,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 20: [2022-11-26 21:49:05,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:49:05,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 21:49:05,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 17: [2022-11-26 21:49:05,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:49:05,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 21:49:05,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 23: [2022-11-26 21:49:05,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:49:05,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:49:05,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 22: [2022-11-26 21:49:05,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:49:05,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 23: [2022-11-26 21:49:05,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 22: [2022-11-26 21:49:05,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 3: [2022-11-26 21:49:05,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 22: [2022-11-26 21:49:05,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 28: [2022-11-26 21:49:05,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:49:05,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:49:05,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:49:05,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 10: [2022-11-26 21:49:05,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 17: [2022-11-26 21:49:05,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 10: [2022-11-26 21:49:05,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 18: [2022-11-26 21:49:05,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:49:05,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:49:05,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:49:05,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 12: [2022-11-26 21:49:05,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 18: [2022-11-26 21:49:05,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 12: [2022-11-26 21:49:05,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 18: [2022-11-26 21:49:05,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 18: [2022-11-26 21:49:05,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 28: [2022-11-26 21:49:05,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 21:49:05,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 15: [2022-11-26 21:49:05,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:49:05,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 21:49:05,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 27: [2022-11-26 21:49:05,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:49:05,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 21:49:05,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 19: [2022-11-26 21:49:05,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:49:05,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 21:49:05,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 19: [2022-11-26 21:49:05,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:49:05,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 21:49:05,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 4: [2022-11-26 21:49:05,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:49:05,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 25: [2022-11-26 21:49:05,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:49:05,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 25: [2022-11-26 21:49:05,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 21:49:05,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 29: [2022-11-26 21:49:05,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:49:05,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 26: [2022-11-26 21:49:05,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:49:05,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 26: [2022-11-26 21:49:05,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 13: [2022-11-26 21:49:05,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 26: [2022-11-26 21:49:05,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 13: [2022-11-26 21:49:05,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 21:49:05,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 12: [2022-11-26 21:49:05,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:49:05,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 25: [2022-11-26 21:49:05,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:49:05,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 25: [2022-11-26 21:49:05,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 21:49:05,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 2: [2022-11-26 21:49:05,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:49:05,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 21:49:05,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 9: [2022-11-26 21:49:05,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:49:05,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:49:05,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 21:49:05,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 21:49:05,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 9: [2022-11-26 21:49:05,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 22: [2022-11-26 21:49:05,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 21:49:05,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 21:49:05,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 3: [2022-11-26 21:49:05,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:49:05,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 21:49:05,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 11: [2022-11-26 21:49:05,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:49:05,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 21:49:05,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 5: [2022-11-26 21:49:05,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:49:05,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 21:49:05,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 20: [2022-11-26 21:49:05,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:49:05,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 21:49:05,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 23: [2022-11-26 21:49:05,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:49:05,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 21:49:05,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 21: [2022-11-26 21:49:05,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:49:05,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:49:05,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 15: [2022-11-26 21:49:05,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 21: [2022-11-26 21:49:05,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 15: [2022-11-26 21:49:05,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 0: [2022-11-26 21:49:05,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:49:05,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 21:49:05,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 27: [2022-11-26 21:49:05,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:49:05,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 21:49:05,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 2: [2022-11-26 21:49:05,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:49:05,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 18: [2022-11-26 21:49:05,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:49:05,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 18: [2022-11-26 21:49:05,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 4: [2022-11-26 21:49:05,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:49:05,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 4: [2022-11-26 21:49:05,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 21:49:05,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 1: [2022-11-26 21:49:05,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:49:05,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 21:49:05,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 29: [2022-11-26 21:49:05,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:49:05,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 21:49:05,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 10: [2022-11-26 21:49:05,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:49:05,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 21:49:05,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 21: [2022-11-26 21:49:05,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:49:05,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 21:49:05,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 28: [2022-11-26 21:49:05,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 21:49:05,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 21:49:05,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 21:49:05,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 28: [2022-11-26 21:49:05,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 21:49:05,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 30: [2022-11-26 21:49:05,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:49:05,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 21:49:05,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 20: [2022-11-26 21:49:05,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:49:05,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:49:05,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 18: [2022-11-26 21:49:05,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 20: [2022-11-26 21:49:05,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 18: [2022-11-26 21:49:05,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 9: [2022-11-26 21:49:05,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:49:05,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 21:49:05,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 0: [2022-11-26 21:49:05,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:49:05,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:49:05,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:49:05,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 0: [2022-11-26 21:49:05,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 16: [2022-11-26 21:49:05,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 11: [2022-11-26 21:49:05,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:49:05,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:49:05,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 22: [2022-11-26 21:49:05,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:49:05,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 3: [2022-11-26 21:49:05,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 11: [2022-11-26 21:49:05,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 22: [2022-11-26 21:49:05,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 3: [2022-11-26 21:49:05,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 22: [2022-11-26 21:49:05,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 17: [2022-11-26 21:49:05,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:49:05,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 23: [2022-11-26 21:49:05,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:49:05,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 23: [2022-11-26 21:49:05,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 21:49:05,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 5: [2022-11-26 21:49:05,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:49:05,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:49:05,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 21:49:05,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 5: [2022-11-26 21:49:05,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:49:05,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 17: [2022-11-26 21:49:05,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 5: [2022-11-26 21:49:05,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 17: [2022-11-26 21:49:05,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 15: [2022-11-26 21:49:05,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:49:05,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 21:49:05,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 25: [2022-11-26 21:49:05,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:49:05,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 21:49:05,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 8: [2022-11-26 21:49:05,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:49:05,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:49:05,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:49:05,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:49:05,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 29: [2022-11-26 21:49:05,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 8: [2022-11-26 21:49:05,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 29: [2022-11-26 21:49:05,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 8: [2022-11-26 21:49:05,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 21:49:05,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 8: [2022-11-26 21:49:05,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 8: [2022-11-26 21:49:05,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 27: [2022-11-26 21:49:05,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:49:05,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:49:05,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 4: [2022-11-26 21:49:05,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 27: [2022-11-26 21:49:05,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 4: [2022-11-26 21:49:05,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 7: [2022-11-26 21:49:05,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:49:05,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:49:05,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:49:05,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:49:05,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 21:49:05,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 21:49:05,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 1: [2022-11-26 21:49:05,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 7: [2022-11-26 21:49:05,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 7: [2022-11-26 21:49:05,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 1: [2022-11-26 21:49:05,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 7: [2022-11-26 21:49:05,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 1: [2022-11-26 21:49:05,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:49:05,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 21:49:05,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 30: [2022-11-26 21:49:05,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:49:05,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 21:49:05,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 4: [2022-11-26 21:49:05,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:49:05,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 21:49:05,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 13: [2022-11-26 21:49:05,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:49:05,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 21:49:05,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 16: [2022-11-26 21:49:05,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:49:05,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:49:05,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 21:49:05,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 21:49:05,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 16: [2022-11-26 21:49:05,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 26: [2022-11-26 21:49:05,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 21:49:05,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 21:49:05,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 0: [2022-11-26 21:49:05,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 9: [2022-11-26 21:49:05,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:49:05,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 9: [2022-11-26 21:49:05,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 21:49:05,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 19: [2022-11-26 21:49:05,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:49:05,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 21:49:05,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 13: [2022-11-26 21:49:05,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:49:05,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 21:49:05,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 1: [2022-11-26 21:49:05,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:49:05,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 21:49:05,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 30: [2022-11-26 21:49:05,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:49:05,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 21:49:05,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 6: [2022-11-26 21:49:05,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:49:05,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:49:05,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 21:49:05,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 21:49:05,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 6: [2022-11-26 21:49:05,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 6: [2022-11-26 21:49:05,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:49:05,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:49:05,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 21:49:05,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 21:49:05,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 6: [2022-11-26 21:49:05,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 2: [2022-11-26 21:49:05,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:49:05,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 19: [2022-11-26 21:49:05,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:49:05,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 2: [2022-11-26 21:49:05,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 19: [2022-11-26 21:49:05,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 22: [2022-11-26 21:49:05,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 21:49:05,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 21:49:05,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 26: [2022-11-26 21:49:05,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 21:49:05,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 21:49:05,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 12: [2022-11-26 21:49:05,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:49:05,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 21:49:05,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 20: [2022-11-26 21:49:05,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:49:05,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 21:49:05,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 12: [2022-11-26 21:49:05,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:49:05,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:49:05,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 10: [2022-11-26 21:49:05,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:49:05,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 10: [2022-11-26 21:49:05,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 21:49:05,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 21:49:05,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 10: [2022-11-26 21:49:05,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 25: [2022-11-26 21:49:05,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:49:05,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 21:49:05,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 14: [2022-11-26 21:49:05,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:49:05,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:49:05,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:49:05,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:49:05,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 21:49:05,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 21:49:05,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 21:49:05,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 21:49:05,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 14: [2022-11-26 21:49:05,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 14: [2022-11-26 21:49:05,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 14: [2022-11-26 21:49:05,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 24: [2022-11-26 21:49:05,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:49:05,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:49:05,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:49:05,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:49:05,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 21:49:05,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 21:49:05,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 21:49:05,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 21:49:05,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 24: [2022-11-26 21:49:05,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 24: [2022-11-26 21:49:05,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 24: [2022-11-26 21:49:05,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 31: [2022-11-26 21:49:05,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:49:05,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:49:05,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:49:05,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:49:05,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 21:49:05,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 21:49:05,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 21:49:05,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 21:49:05,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 31: [2022-11-26 21:49:05,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 31: [2022-11-26 21:49:05,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 31: [2022-11-26 21:49:05,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 23: [2022-11-26 21:49:05,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:49:05,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 21:49:05,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 0: [2022-11-26 21:49:05,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:49:05,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 21:49:05,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 16: [2022-11-26 21:49:05,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:49:05,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 21:49:05,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 8: [2022-11-26 21:49:05,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:49:05,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 21:49:05,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 28: [2022-11-26 21:49:05,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 21:49:05,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 21:49:05,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 7: [2022-11-26 21:49:05,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:49:05,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 21:49:05,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 15: [2022-11-26 21:49:05,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:49:05,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 21:49:05,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 21: [2022-11-26 21:49:05,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:49:05,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 21:49:05,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 3: [2022-11-26 21:49:05,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:49:05,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 11: [2022-11-26 21:49:05,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:49:05,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 11: [2022-11-26 21:49:05,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 21:49:05,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 19: [2022-11-26 21:49:05,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:49:05,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 21:49:05,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 27: [2022-11-26 21:49:05,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:49:05,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 21:49:05,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 6: [2022-11-26 21:49:05,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:49:05,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 21:49:05,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 4: [2022-11-26 21:49:05,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:49:05,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 21:49:05,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 18: [2022-11-26 21:49:05,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:49:05,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 21:49:05,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 1: [2022-11-26 21:49:05,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:49:05,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 21:49:05,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 17: [2022-11-26 21:49:05,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:49:05,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:49:05,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 29: [2022-11-26 21:49:05,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 17: [2022-11-26 21:49:05,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 29: [2022-11-26 21:49:05,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 24: [2022-11-26 21:49:05,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:49:05,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:49:05,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 20: [2022-11-26 21:49:05,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 21:49:05,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 24: [2022-11-26 21:49:05,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 23: [2022-11-26 21:49:05,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:49:05,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 21:49:05,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 2: [2022-11-26 21:49:05,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:49:05,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 21:49:05,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 26: [2022-11-26 21:49:05,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 21:49:05,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 21:49:05,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 12: [2022-11-26 21:49:05,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:49:05,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 21:49:05,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 30: [2022-11-26 21:49:05,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:49:05,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 21:49:05,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 0: [2022-11-26 21:49:05,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:49:05,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 21:49:05,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 14: [2022-11-26 21:49:05,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:49:05,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 21:49:05,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 5: [2022-11-26 21:49:05,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:49:05,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 22: [2022-11-26 21:49:05,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:49:05,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 22: [2022-11-26 21:49:05,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 21:49:05,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 25: [2022-11-26 21:49:05,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:49:05,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 13: [2022-11-26 21:49:05,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:49:05,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 25: [2022-11-26 21:49:05,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 13: [2022-11-26 21:49:05,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 9: [2022-11-26 21:49:05,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:49:05,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:49:05,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 21:49:05,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 31: [2022-11-26 21:49:05,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 21:49:05,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 10: [2022-11-26 21:49:05,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:49:05,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 21:49:05,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 7: [2022-11-26 21:49:05,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:49:05,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 21:49:05,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 8: [2022-11-26 21:49:05,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:49:05,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 21:49:05,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 11: [2022-11-26 21:49:05,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:49:05,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 21:49:05,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 28: [2022-11-26 21:49:05,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:49:05,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:49:05,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 21:49:05,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 3: [2022-11-26 21:49:05,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:49:05,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 21:49:05,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 15: [2022-11-26 21:49:05,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:49:05,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 21:49:05,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 28: [2022-11-26 21:49:05,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 21:49:05,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 21: [2022-11-26 21:49:05,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:49:05,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 21:49:05,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 19: [2022-11-26 21:49:05,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:49:05,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 21:49:05,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 6: [2022-11-26 21:49:05,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:49:05,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 21:49:05,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 27: [2022-11-26 21:49:05,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:49:05,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 21:49:05,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 4: [2022-11-26 21:49:05,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:49:05,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 21:49:05,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 18: [2022-11-26 21:49:05,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:49:05,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 21:49:05,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 29: [2022-11-26 21:49:05,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:49:05,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 21:49:05,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 17: [2022-11-26 21:49:05,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:49:05,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 21:49:05,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 1: [2022-11-26 21:49:05,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:49:05,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 21:49:05,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 24: [2022-11-26 21:49:05,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:49:05,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 21:49:05,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 22: [2022-11-26 21:49:05,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 21:49:05,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 21:49:05,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 0: [2022-11-26 21:49:05,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:49:05,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:49:05,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 0: [2022-11-26 21:49:05,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 5: [2022-11-26 21:49:05,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 0: [2022-11-26 21:49:05,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 13: [2022-11-26 21:49:05,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:49:05,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 21:49:05,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 12: [2022-11-26 21:49:05,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:49:05,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 21:49:05,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 2: [2022-11-26 21:49:05,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:49:05,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 21:49:05,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 30: [2022-11-26 21:49:05,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:49:05,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 21:49:05,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 14: [2022-11-26 21:49:05,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:49:05,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 20: [2022-11-26 21:49:05,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:49:05,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 26: [2022-11-26 21:49:05,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:49:05,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 26: [2022-11-26 21:49:05,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 20: [2022-11-26 21:49:05,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 26: [2022-11-26 21:49:05,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 23: [2022-11-26 21:49:05,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 21:49:05,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 21:49:05,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 31: [2022-11-26 21:49:05,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:49:05,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 21:49:05,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 25: [2022-11-26 21:49:05,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:49:05,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 21:49:05,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 16: [2022-11-26 21:49:05,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:49:05,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 21:49:05,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 8: [2022-11-26 21:49:05,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:49:05,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 21:49:05,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 7: [2022-11-26 21:49:05,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:49:05,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 21:49:05,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 3: [2022-11-26 21:49:05,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:49:05,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 21:49:05,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 28: [2022-11-26 21:49:05,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 21:49:05,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 21:49:05,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 15: [2022-11-26 21:49:05,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:49:05,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 21:49:05,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 4: [2022-11-26 21:49:05,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:49:05,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:49:05,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 4: [2022-11-26 21:49:05,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 21: [2022-11-26 21:49:05,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 4: [2022-11-26 21:49:05,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 6: [2022-11-26 21:49:05,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:49:05,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 21:49:05,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 19: [2022-11-26 21:49:05,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:49:05,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 21:49:05,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 10: [2022-11-26 21:49:05,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:49:05,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 21:49:05,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 10: [2022-11-26 21:49:05,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:49:05,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 21:49:05,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 27: [2022-11-26 21:49:05,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:49:05,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 21:49:05,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 9: [2022-11-26 21:49:05,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:49:05,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 21:49:05,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:49:05,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 9: [2022-11-26 21:49:05,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 21:49:05,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 11: [2022-11-26 21:49:05,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:49:05,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 17: [2022-11-26 21:49:05,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:49:05,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 17: [2022-11-26 21:49:05,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 21:49:05,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 26: [2022-11-26 21:49:05,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 21:49:05,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 21:49:05,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 1: [2022-11-26 21:49:05,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:49:05,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 21:49:05,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 24: [2022-11-26 21:49:05,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:49:05,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 21:49:05,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 18: [2022-11-26 21:49:05,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:49:05,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 21:49:05,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 29: [2022-11-26 21:49:05,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:49:05,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 22: [2022-11-26 21:49:05,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:49:05,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 22: [2022-11-26 21:49:05,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 21:49:05,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 5: [2022-11-26 21:49:05,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:49:05,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 21:49:05,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 20: [2022-11-26 21:49:05,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:49:05,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 21:49:05,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 13: [2022-11-26 21:49:05,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:49:05,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 21:49:05,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 2: [2022-11-26 21:49:05,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:49:05,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 23: [2022-11-26 21:49:05,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:49:05,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 23: [2022-11-26 21:49:05,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 21:49:05,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 30: [2022-11-26 21:49:05,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:49:05,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 21:49:05,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 14: [2022-11-26 21:49:05,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:49:05,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 21:49:05,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 25: [2022-11-26 21:49:05,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:49:05,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 21:49:05,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 16: [2022-11-26 21:49:05,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 21:49:05,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 21:49:05,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 3: [2022-11-26 21:49:05,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:49:05,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 21:49:05,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 8: [2022-11-26 21:49:05,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:49:05,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 21:49:05,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 31: [2022-11-26 21:49:05,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:49:05,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 21:49:05,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 11: [2022-11-26 21:49:05,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:49:05,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 21:49:05,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 12: [2022-11-26 21:49:05,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:49:05,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:49:05,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 7: [2022-11-26 21:49:05,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 12: [2022-11-26 21:49:05,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 7: [2022-11-26 21:49:05,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 0: [2022-11-26 21:49:05,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:49:05,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 21:49:05,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 5: [2022-11-26 21:49:05,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:49:05,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:49:05,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 21:49:05,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 9: [2022-11-26 21:49:05,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 21:49:05,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 13: [2022-11-26 21:49:05,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:49:05,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 21:49:05,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 2: [2022-11-26 21:49:05,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:49:05,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:49:05,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 19: [2022-11-26 21:49:05,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 2: [2022-11-26 21:49:05,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 23: [2022-11-26 21:49:05,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 19: [2022-11-26 21:49:05,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 23: [2022-11-26 21:49:05,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 21:49:05,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 8: [2022-11-26 21:49:05,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:49:05,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:49:05,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 16: [2022-11-26 21:49:05,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:49:05,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 3: [2022-11-26 21:49:05,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:49:05,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 16: [2022-11-26 21:49:05,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 21: [2022-11-26 21:49:05,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 16: [2022-11-26 21:49:05,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 3: [2022-11-26 21:49:05,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 21:49:05,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 6: [2022-11-26 21:49:05,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:49:05,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 21:49:05,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 28: [2022-11-26 21:49:05,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:49:05,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:49:05,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 11: [2022-11-26 21:49:05,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 20: [2022-11-26 21:49:05,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 11: [2022-11-26 21:49:05,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 21:49:05,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 10: [2022-11-26 21:49:05,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:49:05,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 21:49:05,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 22: [2022-11-26 21:49:05,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:49:05,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:49:05,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 22: [2022-11-26 21:49:05,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 7: [2022-11-26 21:49:05,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 15: [2022-11-26 21:49:05,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 22: [2022-11-26 21:49:05,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 15: [2022-11-26 21:49:05,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 21:49:05,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 30: [2022-11-26 21:49:05,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:49:05,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 12: [2022-11-26 21:49:05,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 30: [2022-11-26 21:49:05,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 12: [2022-11-26 21:49:05,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 21:49:05,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 14: [2022-11-26 21:49:05,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:49:05,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 21:49:05,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 0: [2022-11-26 21:49:05,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 25: [2022-11-26 21:49:05,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:49:05,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:49:05,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 25: [2022-11-26 21:49:05,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 0: [2022-11-26 21:49:05,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 25: [2022-11-26 21:49:05,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 1: [2022-11-26 21:49:05,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 21:49:05,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 28: [2022-11-26 21:49:05,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 21:49:05,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 28: [2022-11-26 21:49:05,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 21:49:05,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 21:49:05,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 31: [2022-11-26 21:49:05,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 21:49:05,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 21:49:05,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 4: [2022-11-26 21:49:05,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:49:05,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 21:49:05,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 26: [2022-11-26 21:49:05,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 21:49:05,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 21:49:05,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 15: [2022-11-26 21:49:05,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:49:05,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 21:49:05,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 17: [2022-11-26 21:49:05,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 21:49:05,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 21:49:05,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 18: [2022-11-26 21:49:05,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 21:49:05,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 21:49:05,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 21: [2022-11-26 21:49:05,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:49:05,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 21:49:05,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 29: [2022-11-26 21:49:05,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 21:49:05,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 21:49:05,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 27: [2022-11-26 21:49:05,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:49:05,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 21:49:05,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 21:49:05,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 21:49:05,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 27: [2022-11-26 21:49:05,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 24: [2022-11-26 21:49:05,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 21:49:05,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 21:49:05,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 21: [2022-11-26 21:49:05,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 21:49:05,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step123000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 21:49:05,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 0: successfully saved checkpoint at iteration 123000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2688.90 31: iteration 123010/ 173500 | consumed samples: 31490560 | consumed tokens: 64492666880 | elapsed time per iteration (s): 1.09 | learning rate: 5.573E-05 | global batch size: 256 | lm loss: 1.945782E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.572 | TFLOPs: 14.19 | 31: iteration 123020/ 173500 | consumed samples: 31493120 | consumed tokens: 64497909760 | elapsed time per iteration (s): 0.81 | learning rate: 5.571E-05 | global batch size: 256 | lm loss: 1.923547E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.874 | TFLOPs: 19.23 | 31: iteration 123030/ 173500 | consumed samples: 31495680 | consumed tokens: 64503152640 | elapsed time per iteration (s): 0.82 | learning rate: 5.570E-05 | global batch size: 256 | lm loss: 1.936614E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.299 | TFLOPs: 18.95 | 31: iteration 123040/ 173500 | consumed samples: 31498240 | consumed tokens: 64508395520 | elapsed time per iteration (s): 0.79 | learning rate: 5.569E-05 | global batch size: 256 | lm loss: 1.929775E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.265 | TFLOPs: 19.68 | 31: iteration 123050/ 173500 | consumed samples: 31500800 | consumed tokens: 64513638400 | elapsed time per iteration (s): 0.77 | learning rate: 5.567E-05 | global batch size: 256 | lm loss: 1.951163E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.968 | TFLOPs: 20.02 | 31: iteration 123060/ 173500 | consumed samples: 31503360 | consumed tokens: 64518881280 | elapsed time per iteration (s): 0.81 | learning rate: 5.566E-05 | global batch size: 256 | lm loss: 1.964277E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.328 | TFLOPs: 19.02 | 31: iteration 123070/ 173500 | consumed samples: 31505920 | consumed tokens: 64524124160 | elapsed time per iteration (s): 0.80 | learning rate: 5.565E-05 | global batch size: 256 | lm loss: 1.927312E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.177 | TFLOPs: 19.25 | 31: iteration 123080/ 173500 | consumed samples: 31508480 | consumed tokens: 64529367040 | elapsed time per iteration (s): 0.80 | learning rate: 5.563E-05 | global batch size: 256 | lm loss: 1.973446E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.745 | TFLOPs: 19.34 | 31: iteration 123090/ 173500 | consumed samples: 31511040 | consumed tokens: 64534609920 | elapsed time per iteration (s): 0.80 | learning rate: 5.562E-05 | global batch size: 256 | lm loss: 1.932143E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.993 | TFLOPs: 19.36 | 31: iteration 123100/ 173500 | consumed samples: 31513600 | consumed tokens: 64539852800 | elapsed time per iteration (s): 0.78 | learning rate: 5.561E-05 | global batch size: 256 | lm loss: 1.954348E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.451 | TFLOPs: 19.75 | 31: iteration 123110/ 173500 | consumed samples: 31516160 | consumed tokens: 64545095680 | elapsed time per iteration (s): 0.80 | learning rate: 5.559E-05 | global batch size: 256 | lm loss: 1.972091E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.991 | TFLOPs: 19.30 | 31: iteration 123120/ 173500 | consumed samples: 31518720 | consumed tokens: 64550338560 | elapsed time per iteration (s): 0.83 | learning rate: 5.558E-05 | global batch size: 256 | lm loss: 1.927818E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.716 | TFLOPs: 18.68 | 31: iteration 123130/ 173500 | consumed samples: 31521280 | consumed tokens: 64555581440 | elapsed time per iteration (s): 0.81 | learning rate: 5.557E-05 | global batch size: 256 | lm loss: 1.959067E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.682 | TFLOPs: 19.16 | 31: iteration 123140/ 173500 | consumed samples: 31523840 | consumed tokens: 64560824320 | elapsed time per iteration (s): 0.79 | learning rate: 5.555E-05 | global batch size: 256 | lm loss: 1.939443E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.640 | TFLOPs: 19.64 | 31: iteration 123150/ 173500 | consumed samples: 31526400 | consumed tokens: 64566067200 | elapsed time per iteration (s): 0.84 | learning rate: 5.554E-05 | global batch size: 256 | lm loss: 1.909457E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.490 | TFLOPs: 18.42 | 31: iteration 123160/ 173500 | consumed samples: 31528960 | consumed tokens: 64571310080 | elapsed time per iteration (s): 0.82 | learning rate: 5.553E-05 | global batch size: 256 | lm loss: 1.937708E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.595 | TFLOPs: 18.85 | 31: iteration 123170/ 173500 | consumed samples: 31531520 | consumed tokens: 64576552960 | elapsed time per iteration (s): 0.80 | learning rate: 5.552E-05 | global batch size: 256 | lm loss: 1.951175E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.281 | TFLOPs: 19.26 | 31: iteration 123180/ 173500 | consumed samples: 31534080 | consumed tokens: 64581795840 | elapsed time per iteration (s): 0.80 | learning rate: 5.550E-05 | global batch size: 256 | lm loss: 1.958457E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.267 | TFLOPs: 19.25 | 31: iteration 123190/ 173500 | consumed samples: 31536640 | consumed tokens: 64587038720 | elapsed time per iteration (s): 0.79 | learning rate: 5.549E-05 | global batch size: 256 | lm loss: 1.939543E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.902 | TFLOPs: 19.66 | 31: iteration 123200/ 173500 | consumed samples: 31539200 | consumed tokens: 64592281600 | elapsed time per iteration (s): 0.84 | learning rate: 5.548E-05 | global batch size: 256 | lm loss: 1.965010E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.042 | TFLOPs: 18.39 | 31: iteration 123210/ 173500 | consumed samples: 31541760 | consumed tokens: 64597524480 | elapsed time per iteration (s): 0.80 | learning rate: 5.546E-05 | global batch size: 256 | lm loss: 1.965249E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.247 | TFLOPs: 19.37 | 31: iteration 123220/ 173500 | consumed samples: 31544320 | consumed tokens: 64602767360 | elapsed time per iteration (s): 0.79 | learning rate: 5.545E-05 | global batch size: 256 | lm loss: 1.928591E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.966 | TFLOPs: 19.66 | 31: iteration 123230/ 173500 | consumed samples: 31546880 | consumed tokens: 64608010240 | elapsed time per iteration (s): 0.81 | learning rate: 5.544E-05 | global batch size: 256 | lm loss: 1.939204E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.896 | TFLOPs: 19.17 | 31: iteration 123240/ 173500 | consumed samples: 31549440 | consumed tokens: 64613253120 | elapsed time per iteration (s): 0.81 | learning rate: 5.542E-05 | global batch size: 256 | lm loss: 1.935685E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.665 | TFLOPs: 19.04 | 31: iteration 123250/ 173500 | consumed samples: 31552000 | consumed tokens: 64618496000 | elapsed time per iteration (s): 0.78 | learning rate: 5.541E-05 | global batch size: 256 | lm loss: 1.970615E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.381 | TFLOPs: 19.81 | 31: iteration 123260/ 173500 | consumed samples: 31554560 | consumed tokens: 64623738880 | elapsed time per iteration (s): 0.85 | learning rate: 5.540E-05 | global batch size: 256 | lm loss: 1.934335E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.193 | TFLOPs: 18.22 | 31: iteration 123270/ 173500 | consumed samples: 31557120 | consumed tokens: 64628981760 | elapsed time per iteration (s): 0.77 | learning rate: 5.538E-05 | global batch size: 256 | lm loss: 1.911904E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.311 | TFLOPs: 20.04 | 31: iteration 123280/ 173500 | consumed samples: 31559680 | consumed tokens: 64634224640 | elapsed time per iteration (s): 0.79 | learning rate: 5.537E-05 | global batch size: 256 | lm loss: 1.943348E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.885 | TFLOPs: 19.53 | 31: iteration 123290/ 173500 | consumed samples: 31562240 | consumed tokens: 64639467520 | elapsed time per iteration (s): 0.78 | learning rate: 5.536E-05 | global batch size: 256 | lm loss: 1.959754E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.729 | TFLOPs: 19.89 | 31: iteration 123300/ 173500 | consumed samples: 31564800 | consumed tokens: 64644710400 | elapsed time per iteration (s): 0.78 | learning rate: 5.535E-05 | global batch size: 256 | lm loss: 1.920036E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.641 | TFLOPs: 19.94 | 31: iteration 123310/ 173500 | consumed samples: 31567360 | consumed tokens: 64649953280 | elapsed time per iteration (s): 0.78 | learning rate: 5.533E-05 | global batch size: 256 | lm loss: 1.944695E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.317 | TFLOPs: 19.98 | 31: iteration 123320/ 173500 | consumed samples: 31569920 | consumed tokens: 64655196160 | elapsed time per iteration (s): 0.77 | learning rate: 5.532E-05 | global batch size: 256 | lm loss: 1.954620E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.464 | TFLOPs: 19.99 | 31: iteration 123330/ 173500 | consumed samples: 31572480 | consumed tokens: 64660439040 | elapsed time per iteration (s): 0.80 | learning rate: 5.531E-05 | global batch size: 256 | lm loss: 1.909076E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.920 | TFLOPs: 19.48 | 31: iteration 123340/ 173500 | consumed samples: 31575040 | consumed tokens: 64665681920 | elapsed time per iteration (s): 0.82 | learning rate: 5.529E-05 | global batch size: 256 | lm loss: 1.951271E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.241 | TFLOPs: 18.95 | 31: iteration 123350/ 173500 | consumed samples: 31577600 | consumed tokens: 64670924800 | elapsed time per iteration (s): 0.81 | learning rate: 5.528E-05 | global batch size: 256 | lm loss: 1.944649E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.516 | TFLOPs: 19.03 | 31: iteration 123360/ 173500 | consumed samples: 31580160 | consumed tokens: 64676167680 | elapsed time per iteration (s): 0.78 | learning rate: 5.527E-05 | global batch size: 256 | lm loss: 1.951708E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.477 | TFLOPs: 19.75 | 31: iteration 123370/ 173500 | consumed samples: 31582720 | consumed tokens: 64681410560 | elapsed time per iteration (s): 0.81 | learning rate: 5.525E-05 | global batch size: 256 | lm loss: 1.954920E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.439 | TFLOPs: 19.14 | 31: iteration 123380/ 173500 | consumed samples: 31585280 | consumed tokens: 64686653440 | elapsed time per iteration (s): 0.81 | learning rate: 5.524E-05 | global batch size: 256 | lm loss: 1.908935E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.739 | TFLOPs: 19.22 | 31: iteration 123390/ 173500 | consumed samples: 31587840 | consumed tokens: 64691896320 | elapsed time per iteration (s): 0.80 | learning rate: 5.523E-05 | global batch size: 256 | lm loss: 1.940177E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.486 | TFLOPs: 19.27 | 31: iteration 123400/ 173500 | consumed samples: 31590400 | consumed tokens: 64697139200 | elapsed time per iteration (s): 0.81 | learning rate: 5.521E-05 | global batch size: 256 | lm loss: 1.942160E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.950 | TFLOPs: 19.11 | 31: iteration 123410/ 173500 | consumed samples: 31592960 | consumed tokens: 64702382080 | elapsed time per iteration (s): 0.82 | learning rate: 5.520E-05 | global batch size: 256 | lm loss: 1.938591E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.074 | TFLOPs: 18.82 | 31: iteration 123420/ 173500 | consumed samples: 31595520 | consumed tokens: 64707624960 | elapsed time per iteration (s): 0.83 | learning rate: 5.519E-05 | global batch size: 256 | lm loss: 1.946399E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.817 | TFLOPs: 18.68 | 31: iteration 123430/ 173500 | consumed samples: 31598080 | consumed tokens: 64712867840 | elapsed time per iteration (s): 0.80 | learning rate: 5.518E-05 | global batch size: 256 | lm loss: 1.950377E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.250 | TFLOPs: 19.37 | 31: iteration 123440/ 173500 | consumed samples: 31600640 | consumed tokens: 64718110720 | elapsed time per iteration (s): 0.77 | learning rate: 5.516E-05 | global batch size: 256 | lm loss: 1.938690E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.823 | TFLOPs: 20.20 | 31: iteration 123450/ 173500 | consumed samples: 31603200 | consumed tokens: 64723353600 | elapsed time per iteration (s): 0.82 | learning rate: 5.515E-05 | global batch size: 256 | lm loss: 1.923813E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.372 | TFLOPs: 18.78 | 31: iteration 123460/ 173500 | consumed samples: 31605760 | consumed tokens: 64728596480 | elapsed time per iteration (s): 0.83 | learning rate: 5.514E-05 | global batch size: 256 | lm loss: 1.946561E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.118 | TFLOPs: 18.76 | 31: iteration 123470/ 173500 | consumed samples: 31608320 | consumed tokens: 64733839360 | elapsed time per iteration (s): 0.93 | learning rate: 5.512E-05 | global batch size: 256 | lm loss: 1.921681E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 276.311 | TFLOPs: 16.72 | 31: iteration 123480/ 173500 | consumed samples: 31610880 | consumed tokens: 64739082240 | elapsed time per iteration (s): 0.80 | learning rate: 5.511E-05 | global batch size: 256 | lm loss: 1.928962E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.238 | TFLOPs: 19.43 | 31: iteration 123490/ 173500 | consumed samples: 31613440 | consumed tokens: 64744325120 | elapsed time per iteration (s): 0.78 | learning rate: 5.510E-05 | global batch size: 256 | lm loss: 1.936879E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.145 | TFLOPs: 19.73 | 31: iteration 123500/ 173500 | consumed samples: 31616000 | consumed tokens: 64749568000 | elapsed time per iteration (s): 0.80 | learning rate: 5.508E-05 | global batch size: 256 | lm loss: 1.924740E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.442 | TFLOPs: 19.26 | 31: iteration 123510/ 173500 | consumed samples: 31618560 | consumed tokens: 64754810880 | elapsed time per iteration (s): 0.82 | learning rate: 5.507E-05 | global batch size: 256 | lm loss: 1.957066E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.253 | TFLOPs: 18.89 | 31: iteration 123520/ 173500 | consumed samples: 31621120 | consumed tokens: 64760053760 | elapsed time per iteration (s): 0.78 | learning rate: 5.506E-05 | global batch size: 256 | lm loss: 1.916740E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.896 | TFLOPs: 19.90 | 31: iteration 123530/ 173500 | consumed samples: 31623680 | consumed tokens: 64765296640 | elapsed time per iteration (s): 0.83 | learning rate: 5.504E-05 | global batch size: 256 | lm loss: 1.930713E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.506 | TFLOPs: 18.66 | 31: iteration 123540/ 173500 | consumed samples: 31626240 | consumed tokens: 64770539520 | elapsed time per iteration (s): 0.80 | learning rate: 5.503E-05 | global batch size: 256 | lm loss: 1.936999E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.393 | TFLOPs: 19.44 | 31: iteration 123550/ 173500 | consumed samples: 31628800 | consumed tokens: 64775782400 | elapsed time per iteration (s): 0.79 | learning rate: 5.502E-05 | global batch size: 256 | lm loss: 1.945940E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.097 | TFLOPs: 19.55 | 31: iteration 123560/ 173500 | consumed samples: 31631360 | consumed tokens: 64781025280 | elapsed time per iteration (s): 0.81 | learning rate: 5.501E-05 | global batch size: 256 | lm loss: 1.940539E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.174 | TFLOPs: 19.01 | 31: iteration 123570/ 173500 | consumed samples: 31633920 | consumed tokens: 64786268160 | elapsed time per iteration (s): 0.77 | learning rate: 5.499E-05 | global batch size: 256 | lm loss: 1.946998E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.979 | TFLOPs: 20.20 | 31: iteration 123580/ 173500 | consumed samples: 31636480 | consumed tokens: 64791511040 | elapsed time per iteration (s): 0.73 | learning rate: 5.498E-05 | global batch size: 256 | lm loss: 1.946716E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.108 | TFLOPs: 21.12 | 31: iteration 123590/ 173500 | consumed samples: 31639040 | consumed tokens: 64796753920 | elapsed time per iteration (s): 0.72 | learning rate: 5.497E-05 | global batch size: 256 | lm loss: 1.958459E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.271 | TFLOPs: 21.55 | 31: iteration 123600/ 173500 | consumed samples: 31641600 | consumed tokens: 64801996800 | elapsed time per iteration (s): 0.80 | learning rate: 5.495E-05 | global batch size: 256 | lm loss: 1.936855E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.566 | TFLOPs: 19.33 | 31: iteration 123610/ 173500 | consumed samples: 31644160 | consumed tokens: 64807239680 | elapsed time per iteration (s): 0.81 | learning rate: 5.494E-05 | global batch size: 256 | lm loss: 1.957878E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.274 | TFLOPs: 19.07 | 31: iteration 123620/ 173500 | consumed samples: 31646720 | consumed tokens: 64812482560 | elapsed time per iteration (s): 0.80 | learning rate: 5.493E-05 | global batch size: 256 | lm loss: 1.946947E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.781 | TFLOPs: 19.35 | 31: iteration 123630/ 173500 | consumed samples: 31649280 | consumed tokens: 64817725440 | elapsed time per iteration (s): 0.79 | learning rate: 5.491E-05 | global batch size: 256 | lm loss: 1.934330E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.292 | TFLOPs: 19.56 | 31: iteration 123640/ 173500 | consumed samples: 31651840 | consumed tokens: 64822968320 | elapsed time per iteration (s): 0.78 | learning rate: 5.490E-05 | global batch size: 256 | lm loss: 1.941888E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.925 | TFLOPs: 19.96 | 31: iteration 123650/ 173500 | consumed samples: 31654400 | consumed tokens: 64828211200 | elapsed time per iteration (s): 0.81 | learning rate: 5.489E-05 | global batch size: 256 | lm loss: 1.932111E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.276 | TFLOPs: 19.13 | 31: iteration 123660/ 173500 | consumed samples: 31656960 | consumed tokens: 64833454080 | elapsed time per iteration (s): 0.79 | learning rate: 5.488E-05 | global batch size: 256 | lm loss: 1.988668E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.330 | TFLOPs: 19.56 | 31: iteration 123670/ 173500 | consumed samples: 31659520 | consumed tokens: 64838696960 | elapsed time per iteration (s): 0.77 | learning rate: 5.486E-05 | global batch size: 256 | lm loss: 1.957150E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.834 | TFLOPs: 20.14 | 31: iteration 123680/ 173500 | consumed samples: 31662080 | consumed tokens: 64843939840 | elapsed time per iteration (s): 0.80 | learning rate: 5.485E-05 | global batch size: 256 | lm loss: 1.939350E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.517 | TFLOPs: 19.39 | 31: iteration 123690/ 173500 | consumed samples: 31664640 | consumed tokens: 64849182720 | elapsed time per iteration (s): 0.78 | learning rate: 5.484E-05 | global batch size: 256 | lm loss: 1.941455E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.204 | TFLOPs: 19.86 | 31: iteration 123700/ 173500 | consumed samples: 31667200 | consumed tokens: 64854425600 | elapsed time per iteration (s): 0.79 | learning rate: 5.482E-05 | global batch size: 256 | lm loss: 1.952267E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.601 | TFLOPs: 19.58 | 31: iteration 123710/ 173500 | consumed samples: 31669760 | consumed tokens: 64859668480 | elapsed time per iteration (s): 0.81 | learning rate: 5.481E-05 | global batch size: 256 | lm loss: 1.946746E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.582 | TFLOPs: 19.09 | 31: iteration 123720/ 173500 | consumed samples: 31672320 | consumed tokens: 64864911360 | elapsed time per iteration (s): 0.85 | learning rate: 5.480E-05 | global batch size: 256 | lm loss: 1.953169E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.031 | TFLOPs: 18.27 | 31: iteration 123730/ 173500 | consumed samples: 31674880 | consumed tokens: 64870154240 | elapsed time per iteration (s): 0.79 | learning rate: 5.478E-05 | global batch size: 256 | lm loss: 1.935399E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.999 | TFLOPs: 19.54 | 31: iteration 123740/ 173500 | consumed samples: 31677440 | consumed tokens: 64875397120 | elapsed time per iteration (s): 0.80 | learning rate: 5.477E-05 | global batch size: 256 | lm loss: 1.954521E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.752 | TFLOPs: 19.47 | 31: iteration 123750/ 173500 | consumed samples: 31680000 | consumed tokens: 64880640000 | elapsed time per iteration (s): 0.75 | learning rate: 5.476E-05 | global batch size: 256 | lm loss: 1.956356E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.080 | TFLOPs: 20.51 | 31: iteration 123760/ 173500 | consumed samples: 31682560 | consumed tokens: 64885882880 | elapsed time per iteration (s): 0.78 | learning rate: 5.475E-05 | global batch size: 256 | lm loss: 1.966000E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.686 | TFLOPs: 19.82 | 31: iteration 123770/ 173500 | consumed samples: 31685120 | consumed tokens: 64891125760 | elapsed time per iteration (s): 0.81 | learning rate: 5.473E-05 | global batch size: 256 | lm loss: 1.931422E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.072 | TFLOPs: 19.12 | 31: iteration 123780/ 173500 | consumed samples: 31687680 | consumed tokens: 64896368640 | elapsed time per iteration (s): 0.75 | learning rate: 5.472E-05 | global batch size: 256 | lm loss: 1.950216E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.329 | TFLOPs: 20.71 | 31: iteration 123790/ 173500 | consumed samples: 31690240 | consumed tokens: 64901611520 | elapsed time per iteration (s): 0.75 | learning rate: 5.471E-05 | global batch size: 256 | lm loss: 1.962175E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.464 | TFLOPs: 20.72 | 31: iteration 123800/ 173500 | consumed samples: 31692800 | consumed tokens: 64906854400 | elapsed time per iteration (s): 0.88 | learning rate: 5.469E-05 | global batch size: 256 | lm loss: 1.939872E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.870 | TFLOPs: 17.54 | 31: iteration 123810/ 173500 | consumed samples: 31695360 | consumed tokens: 64912097280 | elapsed time per iteration (s): 0.83 | learning rate: 5.468E-05 | global batch size: 256 | lm loss: 1.956625E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.936 | TFLOPs: 18.63 | 31: iteration 123820/ 173500 | consumed samples: 31697920 | consumed tokens: 64917340160 | elapsed time per iteration (s): 0.81 | learning rate: 5.467E-05 | global batch size: 256 | lm loss: 1.935591E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.475 | TFLOPs: 19.21 | 31: iteration 123830/ 173500 | consumed samples: 31700480 | consumed tokens: 64922583040 | elapsed time per iteration (s): 0.78 | learning rate: 5.465E-05 | global batch size: 256 | lm loss: 1.933189E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.069 | TFLOPs: 19.79 | 31: iteration 123840/ 173500 | consumed samples: 31703040 | consumed tokens: 64927825920 | elapsed time per iteration (s): 0.78 | learning rate: 5.464E-05 | global batch size: 256 | lm loss: 1.935435E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.138 | TFLOPs: 19.91 | 31: iteration 123850/ 173500 | consumed samples: 31705600 | consumed tokens: 64933068800 | elapsed time per iteration (s): 0.82 | learning rate: 5.463E-05 | global batch size: 256 | lm loss: 1.942152E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.392 | TFLOPs: 18.84 | 31: iteration 123860/ 173500 | consumed samples: 31708160 | consumed tokens: 64938311680 | elapsed time per iteration (s): 0.80 | learning rate: 5.462E-05 | global batch size: 256 | lm loss: 1.903938E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.660 | TFLOPs: 19.46 | 31: iteration 123870/ 173500 | consumed samples: 31710720 | consumed tokens: 64943554560 | elapsed time per iteration (s): 0.80 | learning rate: 5.460E-05 | global batch size: 256 | lm loss: 1.969267E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.618 | TFLOPs: 19.46 | 31: iteration 123880/ 173500 | consumed samples: 31713280 | consumed tokens: 64948797440 | elapsed time per iteration (s): 0.87 | learning rate: 5.459E-05 | global batch size: 256 | lm loss: 1.919058E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.097 | TFLOPs: 17.79 | 31: iteration 123890/ 173500 | consumed samples: 31715840 | consumed tokens: 64954040320 | elapsed time per iteration (s): 0.79 | learning rate: 5.458E-05 | global batch size: 256 | lm loss: 1.959995E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.069 | TFLOPs: 19.61 | 31: iteration 123900/ 173500 | consumed samples: 31718400 | consumed tokens: 64959283200 | elapsed time per iteration (s): 0.78 | learning rate: 5.456E-05 | global batch size: 256 | lm loss: 1.951048E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.602 | TFLOPs: 19.88 | 31: iteration 123910/ 173500 | consumed samples: 31720960 | consumed tokens: 64964526080 | elapsed time per iteration (s): 0.84 | learning rate: 5.455E-05 | global batch size: 256 | lm loss: 1.932224E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.327 | TFLOPs: 18.53 | 31: iteration 123920/ 173500 | consumed samples: 31723520 | consumed tokens: 64969768960 | elapsed time per iteration (s): 0.84 | learning rate: 5.454E-05 | global batch size: 256 | lm loss: 1.920305E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.072 | TFLOPs: 18.46 | 31: iteration 123930/ 173500 | consumed samples: 31726080 | consumed tokens: 64975011840 | elapsed time per iteration (s): 0.82 | learning rate: 5.452E-05 | global batch size: 256 | lm loss: 1.941958E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.413 | TFLOPs: 18.84 | 31: iteration 123940/ 173500 | consumed samples: 31728640 | consumed tokens: 64980254720 | elapsed time per iteration (s): 0.83 | learning rate: 5.451E-05 | global batch size: 256 | lm loss: 1.964054E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.048 | TFLOPs: 18.64 | 31: iteration 123950/ 173500 | consumed samples: 31731200 | consumed tokens: 64985497600 | elapsed time per iteration (s): 0.80 | learning rate: 5.450E-05 | global batch size: 256 | lm loss: 1.960027E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.115 | TFLOPs: 19.31 | 31: iteration 123960/ 173500 | consumed samples: 31733760 | consumed tokens: 64990740480 | elapsed time per iteration (s): 0.78 | learning rate: 5.449E-05 | global batch size: 256 | lm loss: 1.941908E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.923 | TFLOPs: 19.78 | 31: iteration 123970/ 173500 | consumed samples: 31736320 | consumed tokens: 64995983360 | elapsed time per iteration (s): 0.81 | learning rate: 5.447E-05 | global batch size: 256 | lm loss: 1.940106E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.670 | TFLOPs: 19.16 | 31: iteration 123980/ 173500 | consumed samples: 31738880 | consumed tokens: 65001226240 | elapsed time per iteration (s): 0.76 | learning rate: 5.446E-05 | global batch size: 256 | lm loss: 1.914596E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.841 | TFLOPs: 20.32 | 31: iteration 123990/ 173500 | consumed samples: 31741440 | consumed tokens: 65006469120 | elapsed time per iteration (s): 0.74 | learning rate: 5.445E-05 | global batch size: 256 | lm loss: 1.940827E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.881 | TFLOPs: 20.99 | 0: [2022-11-26 22:02:25,264] [INFO] [logging.py:68:log_dist] [Rank 0] step=124000, skipped=0, lr=[5.443416434803536e-05, 5.443416434803536e-05, 5.443416434803536e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 124000/ 173500 | consumed samples: 31744000 | consumed tokens: 65011712000 | elapsed time per iteration (s): 0.74 | learning rate: 5.443E-05 | global batch size: 256 | lm loss: 1.957792E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.642 | TFLOPs: 21.03 | 0: steps: 124000 loss: 2.0363 iter time (s): 0.818 samples/sec: 312.907 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 124000 | lm loss value: 1.856464E+00 | lm loss PPL: 6.401061E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 124000 to checkpoints_1b1long 0: [2022-11-26 22:02:25,528] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step124000 is begin to save! 0: [2022-11-26 22:02:25,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_01-model_00-model_states.pt... 0: [2022-11-26 22:02:25,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_01-model_00-model_states.pt. 0: [2022-11-26 22:02:25,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_03-model_00-model_states.pt... 0: [2022-11-26 22:02:25,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_03-model_00-model_states.pt. 0: [2022-11-26 22:02:25,880] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_04-model_00-model_states.pt... 0: [2022-11-26 22:02:25,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_04-model_00-model_states.pt. 0: [2022-11-26 22:02:25,955] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_05-model_00-model_states.pt... 0: [2022-11-26 22:02:26,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_05-model_00-model_states.pt. 0: [2022-11-26 22:02:26,033] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_06-model_00-model_states.pt... 0: [2022-11-26 22:02:26,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_06-model_00-model_states.pt. 0: [2022-11-26 22:02:26,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_07-model_00-model_states.pt... 0: [2022-11-26 22:02:26,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_07-model_00-model_states.pt. 0: [2022-11-26 22:02:26,188] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_08-model_00-model_states.pt... 0: [2022-11-26 22:02:26,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_08-model_00-model_states.pt. 0: [2022-11-26 22:02:26,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_09-model_00-model_states.pt... 0: [2022-11-26 22:02:26,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_09-model_00-model_states.pt. 0: [2022-11-26 22:02:26,347] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_10-model_00-model_states.pt... 0: [2022-11-26 22:02:26,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_10-model_00-model_states.pt. 0: [2022-11-26 22:02:26,424] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_11-model_00-model_states.pt... 0: [2022-11-26 22:02:26,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_11-model_00-model_states.pt. 0: [2022-11-26 22:02:26,498] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_12-model_00-model_states.pt... 0: [2022-11-26 22:02:26,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_12-model_00-model_states.pt. 0: [2022-11-26 22:02:26,577] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_13-model_00-model_states.pt... 0: [2022-11-26 22:02:26,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_13-model_00-model_states.pt. 0: [2022-11-26 22:02:26,650] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_14-model_00-model_states.pt... 0: [2022-11-26 22:02:26,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_14-model_00-model_states.pt. 0: [2022-11-26 22:02:26,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_15-model_00-model_states.pt... 0: [2022-11-26 22:02:26,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_15-model_00-model_states.pt. 0: [2022-11-26 22:02:26,809] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_16-model_00-model_states.pt... 0: [2022-11-26 22:02:26,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_16-model_00-model_states.pt. 0: [2022-11-26 22:02:26,883] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_17-model_00-model_states.pt... 0: [2022-11-26 22:02:26,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_17-model_00-model_states.pt. 0: [2022-11-26 22:02:26,965] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_18-model_00-model_states.pt... 0: [2022-11-26 22:02:27,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_18-model_00-model_states.pt. 0: [2022-11-26 22:02:27,040] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_19-model_00-model_states.pt... 0: [2022-11-26 22:02:27,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_19-model_00-model_states.pt. 0: [2022-11-26 22:02:27,116] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_20-model_00-model_states.pt... 0: [2022-11-26 22:02:27,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_20-model_00-model_states.pt. 0: [2022-11-26 22:02:27,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_21-model_00-model_states.pt... 0: [2022-11-26 22:02:27,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_21-model_00-model_states.pt. 0: [2022-11-26 22:02:27,271] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_22-model_00-model_states.pt... 0: [2022-11-26 22:02:27,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_22-model_00-model_states.pt. 0: [2022-11-26 22:02:27,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_23-model_00-model_states.pt... 0: [2022-11-26 22:02:27,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_23-model_00-model_states.pt. 0: [2022-11-26 22:02:27,426] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_24-model_00-model_states.pt... 0: [2022-11-26 22:02:27,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_24-model_00-model_states.pt. 0: [2022-11-26 22:02:27,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_25-model_00-model_states.pt... 0: [2022-11-26 22:02:27,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_25-model_00-model_states.pt. 0: [2022-11-26 22:02:27,573] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_26-model_00-model_states.pt... 0: [2022-11-26 22:02:27,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_26-model_00-model_states.pt. 0: [2022-11-26 22:02:27,653] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_27-model_00-model_states.pt... 0: [2022-11-26 22:02:27,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_27-model_00-model_states.pt. 0: [2022-11-26 22:02:27,727] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_28-model_00-model_states.pt... 0: [2022-11-26 22:02:27,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_28-model_00-model_states.pt. 0: [2022-11-26 22:02:27,808] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/layer_30-model_00-model_states.pt... 0: [2022-11-26 22:02:27,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/layer_30-model_00-model_states.pt. 0: [2022-11-26 22:02:27,813] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step124000/mp_rank_00_model_states.pt 0: [2022-11-26 22:02:27,813] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/mp_rank_00_model_states.pt... 0: [2022-11-26 22:02:27,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/mp_rank_00_model_states.pt. 0: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:02:27,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:02:27,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:02:27,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 22:02:27,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 20: [2022-11-26 22:02:27,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:02:27,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 22:02:27,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 24: [2022-11-26 22:02:27,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:02:27,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 22:02:27,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 5: [2022-11-26 22:02:27,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:02:27,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:02:27,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 14: [2022-11-26 22:02:27,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 5: [2022-11-26 22:02:27,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 13: [2022-11-26 22:02:27,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:02:27,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 13: [2022-11-26 22:02:27,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 22:02:27,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 15: [2022-11-26 22:02:27,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:02:27,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 22:02:27,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 11: [2022-11-26 22:02:27,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:02:27,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 22:02:27,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 8: [2022-11-26 22:02:27,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:02:27,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 22:02:27,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 17: [2022-11-26 22:02:27,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:02:27,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:02:27,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 22:02:27,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 17: [2022-11-26 22:02:27,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 22:02:27,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 4: [2022-11-26 22:02:27,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:02:27,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:02:27,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 11: [2022-11-26 22:02:27,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 4: [2022-11-26 22:02:27,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 11: [2022-11-26 22:02:27,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 23: [2022-11-26 22:02:27,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:02:27,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 22:02:27,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 1: [2022-11-26 22:02:27,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:02:27,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:02:27,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 22:02:27,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 22:02:27,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 1: [2022-11-26 22:02:27,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 8: [2022-11-26 22:02:27,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:02:27,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 22:02:27,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 23: [2022-11-26 22:02:27,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:02:27,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 22:02:27,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 20: [2022-11-26 22:02:27,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:02:27,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 22:02:27,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 7: [2022-11-26 22:02:27,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:02:27,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 22:02:27,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 25: [2022-11-26 22:02:27,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:02:27,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 22:02:27,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 22: [2022-11-26 22:02:27,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:02:27,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 22:02:27,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 0: [2022-11-26 22:02:27,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:02:27,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 22:02:27,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 17: [2022-11-26 22:02:27,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:02:27,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:02:27,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 22:02:27,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 29: [2022-11-26 22:02:27,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:02:27,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 29: [2022-11-26 22:02:27,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:02:27,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 29: [2022-11-26 22:02:27,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 22:02:27,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 29: [2022-11-26 22:02:27,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 22:02:27,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 0: [2022-11-26 22:02:27,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:02:27,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 22:02:27,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 3: [2022-11-26 22:02:27,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:02:27,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 22:02:27,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 30: [2022-11-26 22:02:27,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:02:27,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 22:02:27,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 21: [2022-11-26 22:02:27,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:02:27,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 22:02:27,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 6: [2022-11-26 22:02:27,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:02:27,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 22:02:27,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 26: [2022-11-26 22:02:27,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:02:27,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 22:02:27,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 21: [2022-11-26 22:02:27,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:02:27,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 22:02:27,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 16: [2022-11-26 22:02:27,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:02:27,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:02:27,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 22:02:27,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 3: [2022-11-26 22:02:27,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:02:27,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 26: [2022-11-26 22:02:27,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:02:27,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 22:02:27,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 18: [2022-11-26 22:02:27,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:02:27,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 18: [2022-11-26 22:02:27,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 26: [2022-11-26 22:02:27,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 18: [2022-11-26 22:02:27,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 26: [2022-11-26 22:02:27,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 9: [2022-11-26 22:02:27,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:02:27,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 22:02:27,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 18: [2022-11-26 22:02:27,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:02:27,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 9: [2022-11-26 22:02:27,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:02:27,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 9: [2022-11-26 22:02:27,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 22:02:27,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 10: [2022-11-26 22:02:27,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:02:27,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:02:27,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 18: [2022-11-26 22:02:27,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 10: [2022-11-26 22:02:27,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 18: [2022-11-26 22:02:27,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 4: [2022-11-26 22:02:27,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:02:27,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 22:02:27,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 23: [2022-11-26 22:02:27,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:02:27,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 22:02:27,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 17: [2022-11-26 22:02:27,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:02:27,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 10: [2022-11-26 22:02:27,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:02:27,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:02:27,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 15: [2022-11-26 22:02:27,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 8: [2022-11-26 22:02:27,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:02:27,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 15: [2022-11-26 22:02:27,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 17: [2022-11-26 22:02:27,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 8: [2022-11-26 22:02:27,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 22:02:27,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 10: [2022-11-26 22:02:27,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:02:27,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 22:02:27,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 30: [2022-11-26 22:02:27,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:02:27,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 22:02:27,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 6: [2022-11-26 22:02:27,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:02:27,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:02:27,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 25: [2022-11-26 22:02:27,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 6: [2022-11-26 22:02:27,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 25: [2022-11-26 22:02:27,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 28: [2022-11-26 22:02:27,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:02:27,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:02:27,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 22:02:27,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 22:02:27,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 28: [2022-11-26 22:02:27,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 6: [2022-11-26 22:02:27,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:02:27,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 22:02:27,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 31: [2022-11-26 22:02:27,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:02:27,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:02:27,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 11: [2022-11-26 22:02:27,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 31: [2022-11-26 22:02:27,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 11: [2022-11-26 22:02:27,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 19: [2022-11-26 22:02:27,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:02:27,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 22:02:27,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 11: [2022-11-26 22:02:27,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:02:27,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:02:27,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 29: [2022-11-26 22:02:27,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 11: [2022-11-26 22:02:27,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 29: [2022-11-26 22:02:27,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 7: [2022-11-26 22:02:27,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:02:27,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 22:02:27,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 1: [2022-11-26 22:02:27,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:02:27,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 22:02:27,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 22: [2022-11-26 22:02:27,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:02:27,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 22:02:27,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 19: [2022-11-26 22:02:27,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:02:27,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 22:02:27,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 4: [2022-11-26 22:02:27,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:02:27,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:02:27,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:02:27,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 15: [2022-11-26 22:02:27,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 26: [2022-11-26 22:02:27,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 4: [2022-11-26 22:02:27,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 15: [2022-11-26 22:02:27,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 26: [2022-11-26 22:02:27,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 29: [2022-11-26 22:02:27,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:02:27,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 22:02:27,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 1: [2022-11-26 22:02:27,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:02:27,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:02:27,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:02:27,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 23: [2022-11-26 22:02:27,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 14: [2022-11-26 22:02:27,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:02:27,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 1: [2022-11-26 22:02:27,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 23: [2022-11-26 22:02:27,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 14: [2022-11-26 22:02:27,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 21: [2022-11-26 22:02:27,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 14: [2022-11-26 22:02:27,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 14: [2022-11-26 22:02:27,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:02:27,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:02:27,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 18: [2022-11-26 22:02:27,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:02:27,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 14: [2022-11-26 22:02:27,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 18: [2022-11-26 22:02:27,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 7: [2022-11-26 22:02:27,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 13: [2022-11-26 22:02:27,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:02:27,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:02:27,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 13: [2022-11-26 22:02:27,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 22:02:27,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 28: [2022-11-26 22:02:27,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 22:02:27,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 20: [2022-11-26 22:02:27,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:02:27,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 22:02:27,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 10: [2022-11-26 22:02:27,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:02:27,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 22:02:27,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 20: [2022-11-26 22:02:27,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:02:27,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 31: [2022-11-26 22:02:27,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:02:27,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 20: [2022-11-26 22:02:27,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 31: [2022-11-26 22:02:27,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 21: [2022-11-26 22:02:27,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:02:27,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:02:27,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 13: [2022-11-26 22:02:27,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 21: [2022-11-26 22:02:27,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 13: [2022-11-26 22:02:27,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 9: [2022-11-26 22:02:27,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:02:27,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:02:27,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 19: [2022-11-26 22:02:27,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 9: [2022-11-26 22:02:27,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 19: [2022-11-26 22:02:27,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 5: [2022-11-26 22:02:27,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:02:27,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 22:02:27,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 9: [2022-11-26 22:02:27,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:02:27,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 22:02:27,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 24: [2022-11-26 22:02:27,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:02:27,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 22:02:27,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 30: [2022-11-26 22:02:27,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:02:27,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 22:02:27,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 16: [2022-11-26 22:02:27,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:02:27,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 22:02:27,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 31: [2022-11-26 22:02:27,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:02:27,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 22:02:27,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 30: [2022-11-26 22:02:27,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:02:27,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 22:02:27,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 26: [2022-11-26 22:02:27,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:02:27,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 22:02:27,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 8: [2022-11-26 22:02:27,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:02:27,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 0: [2022-11-26 22:02:27,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:02:27,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 0: [2022-11-26 22:02:27,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 22:02:27,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 16: [2022-11-26 22:02:27,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:02:27,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 22:02:27,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 24: [2022-11-26 22:02:27,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:02:27,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 22:02:27,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 3: [2022-11-26 22:02:27,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:02:27,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 22:02:27,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 0: [2022-11-26 22:02:27,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:02:27,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 22:02:27,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 13: [2022-11-26 22:02:27,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:02:27,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 22:02:27,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 6: [2022-11-26 22:02:27,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:02:27,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 22:02:27,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 5: [2022-11-26 22:02:27,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:02:27,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 22:02:27,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 3: [2022-11-26 22:02:27,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:02:27,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 22:02:27,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 15: [2022-11-26 22:02:27,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:02:27,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 22:02:27,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 14: [2022-11-26 22:02:27,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:02:27,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 7: [2022-11-26 22:02:27,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:02:27,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 7: [2022-11-26 22:02:27,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 28: [2022-11-26 22:02:27,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:02:27,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:02:27,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 22: [2022-11-26 22:02:27,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 22:02:27,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 28: [2022-11-26 22:02:27,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 22:02:27,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 17: [2022-11-26 22:02:27,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:02:27,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 22:02:27,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 25: [2022-11-26 22:02:27,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:02:27,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:02:27,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 22:02:27,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 25: [2022-11-26 22:02:27,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 22:02:27,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 4: [2022-11-26 22:02:27,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:02:27,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 22:02:27,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 27: [2022-11-26 22:02:27,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:02:27,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:02:27,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:02:27,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 22:02:27,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 22:02:27,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 22:02:27,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 27: [2022-11-26 22:02:27,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 27: [2022-11-26 22:02:27,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 2: [2022-11-26 22:02:27,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:02:27,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:02:27,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:02:27,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:02:27,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 22:02:27,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 22:02:27,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 22:02:27,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 2: [2022-11-26 22:02:27,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 22:02:27,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 2: [2022-11-26 22:02:27,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 2: [2022-11-26 22:02:27,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 20: [2022-11-26 22:02:27,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:02:27,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 18: [2022-11-26 22:02:27,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:02:27,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 18: [2022-11-26 22:02:27,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 22:02:27,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 0: [2022-11-26 22:02:27,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:02:27,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 22:02:27,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 17: [2022-11-26 22:02:28,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:02:28,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 22:02:28,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 6: [2022-11-26 22:02:28,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:02:28,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 22:02:28,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 11: [2022-11-26 22:02:28,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:02:28,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 22:02:28,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 12: [2022-11-26 22:02:28,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:02:28,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:02:28,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:02:28,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:02:28,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:02:28,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 22:02:28,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 22:02:28,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 22:02:28,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 22:02:28,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 22:02:28,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 12: [2022-11-26 22:02:28,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 12: [2022-11-26 22:02:28,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 12: [2022-11-26 22:02:28,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 12: [2022-11-26 22:02:28,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 24: [2022-11-26 22:02:28,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:02:28,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 22:02:28,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 19: [2022-11-26 22:02:28,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:02:28,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 22:02:28,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 23: [2022-11-26 22:02:28,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:02:28,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 22:02:28,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 10: [2022-11-26 22:02:28,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:02:28,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 22:02:28,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 29: [2022-11-26 22:02:28,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:02:28,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 22:02:28,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 8: [2022-11-26 22:02:28,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:02:28,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 22:02:28,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 2: [2022-11-26 22:02:28,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:02:28,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 22:02:28,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 22: [2022-11-26 22:02:28,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:02:28,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 22:02:28,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 31: [2022-11-26 22:02:28,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:02:28,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 22:02:28,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 15: [2022-11-26 22:02:28,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:02:28,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 22:02:28,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 16: [2022-11-26 22:02:28,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:02:28,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 22:02:28,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 30: [2022-11-26 22:02:28,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:02:28,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 22:02:28,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 21: [2022-11-26 22:02:28,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:02:28,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 22:02:28,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 13: [2022-11-26 22:02:28,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:02:28,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:02:28,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 4: [2022-11-26 22:02:28,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 9: [2022-11-26 22:02:28,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:02:28,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 4: [2022-11-26 22:02:28,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 9: [2022-11-26 22:02:28,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 22:02:28,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 28: [2022-11-26 22:02:28,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:02:28,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 22:02:28,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 26: [2022-11-26 22:02:28,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:02:28,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 25: [2022-11-26 22:02:28,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:02:28,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 25: [2022-11-26 22:02:28,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 22:02:28,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 1: [2022-11-26 22:02:28,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:02:28,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 22:02:28,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 18: [2022-11-26 22:02:28,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:02:28,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 22:02:28,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 14: [2022-11-26 22:02:28,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:02:28,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 22:02:28,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 7: [2022-11-26 22:02:28,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:02:28,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:02:28,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:02:28,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 7: [2022-11-26 22:02:28,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 5: [2022-11-26 22:02:28,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 7: [2022-11-26 22:02:28,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 3: [2022-11-26 22:02:28,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:02:28,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 22:02:28,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 27: [2022-11-26 22:02:28,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:02:28,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 22:02:28,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 20: [2022-11-26 22:02:28,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:02:28,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 22:02:28,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 11: [2022-11-26 22:02:28,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:02:28,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 22:02:28,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 6: [2022-11-26 22:02:28,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:02:28,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 22:02:28,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 0: [2022-11-26 22:02:28,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 22:02:28,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 23: [2022-11-26 22:02:28,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:02:28,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 22:02:28,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 17: [2022-11-26 22:02:28,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:02:28,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 22:02:28,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 10: [2022-11-26 22:02:28,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:02:28,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 22:02:28,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 24: [2022-11-26 22:02:28,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:02:28,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 22:02:28,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 8: [2022-11-26 22:02:28,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:02:28,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 22:02:28,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 19: [2022-11-26 22:02:28,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:02:28,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:02:28,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 2: [2022-11-26 22:02:28,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 22:02:28,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 19: [2022-11-26 22:02:28,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 29: [2022-11-26 22:02:28,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:02:28,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 22:02:28,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 22: [2022-11-26 22:02:28,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:02:28,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 22:02:28,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 30: [2022-11-26 22:02:28,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:02:28,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 22:02:28,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 13: [2022-11-26 22:02:28,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:02:28,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 22:02:28,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 12: [2022-11-26 22:02:28,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:02:28,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 22:02:28,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 15: [2022-11-26 22:02:28,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:02:28,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 22:02:28,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 4: [2022-11-26 22:02:28,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:02:28,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 22:02:28,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 26: [2022-11-26 22:02:28,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:02:28,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 22:02:28,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 31: [2022-11-26 22:02:28,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:02:28,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 22:02:28,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 21: [2022-11-26 22:02:28,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:02:28,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 22:02:28,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 25: [2022-11-26 22:02:28,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:02:28,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 22:02:28,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 9: [2022-11-26 22:02:28,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:02:28,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 22:02:28,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 1: [2022-11-26 22:02:28,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:02:28,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 22:02:28,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 28: [2022-11-26 22:02:28,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:02:28,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 22:02:28,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 16: [2022-11-26 22:02:28,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:02:28,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 22:02:28,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 3: [2022-11-26 22:02:28,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:02:28,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 22:02:28,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 7: [2022-11-26 22:02:28,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:02:28,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 22:02:28,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 5: [2022-11-26 22:02:28,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:02:28,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 22:02:28,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 20: [2022-11-26 22:02:28,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:02:28,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 22:02:28,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 18: [2022-11-26 22:02:28,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:02:28,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 22:02:28,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 14: [2022-11-26 22:02:28,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:02:28,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 22:02:28,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 0: [2022-11-26 22:02:28,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:02:28,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 22:02:28,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 6: [2022-11-26 22:02:28,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:02:28,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 22:02:28,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 27: [2022-11-26 22:02:28,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:02:28,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 22:02:28,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 23: [2022-11-26 22:02:28,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:02:28,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:02:28,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 11: [2022-11-26 22:02:28,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 23: [2022-11-26 22:02:28,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 11: [2022-11-26 22:02:28,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 19: [2022-11-26 22:02:28,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:02:28,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 22:02:28,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 24: [2022-11-26 22:02:28,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:02:28,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 12: [2022-11-26 22:02:28,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:02:28,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 24: [2022-11-26 22:02:28,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 12: [2022-11-26 22:02:28,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 17: [2022-11-26 22:02:28,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:02:28,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 22:02:28,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 8: [2022-11-26 22:02:28,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:02:28,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 22:02:28,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 10: [2022-11-26 22:02:28,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:02:28,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:02:28,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 10: [2022-11-26 22:02:28,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 2: [2022-11-26 22:02:28,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 10: [2022-11-26 22:02:28,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 13: [2022-11-26 22:02:28,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:02:28,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 22:02:28,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 29: [2022-11-26 22:02:28,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:02:28,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 22:02:28,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 22: [2022-11-26 22:02:28,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:02:28,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 22:02:28,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 30: [2022-11-26 22:02:28,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:02:28,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 22:02:28,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 16: [2022-11-26 22:02:28,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:02:28,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 22:02:28,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 31: [2022-11-26 22:02:28,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:02:28,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 22:02:28,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 25: [2022-11-26 22:02:28,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:02:28,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 21: [2022-11-26 22:02:28,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:02:28,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 21: [2022-11-26 22:02:28,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 22:02:28,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 26: [2022-11-26 22:02:28,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:02:28,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 22:02:28,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 15: [2022-11-26 22:02:28,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:02:28,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 22:02:28,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 9: [2022-11-26 22:02:28,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:02:28,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 22:02:28,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 14: [2022-11-26 22:02:28,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:02:28,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:02:28,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 0: [2022-11-26 22:02:28,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 14: [2022-11-26 22:02:28,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 0: [2022-11-26 22:02:28,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 4: [2022-11-26 22:02:28,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:02:28,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:02:28,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 22:02:28,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 3: [2022-11-26 22:02:28,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:02:28,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 3: [2022-11-26 22:02:28,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 20: [2022-11-26 22:02:28,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 3: [2022-11-26 22:02:28,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 11: [2022-11-26 22:02:28,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:02:28,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 22:02:28,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 18: [2022-11-26 22:02:28,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:02:28,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 22:02:28,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 19: [2022-11-26 22:02:28,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:02:28,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 22:02:28,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 1: [2022-11-26 22:02:28,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:02:28,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 23: [2022-11-26 22:02:28,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:02:28,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 23: [2022-11-26 22:02:28,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 22:02:28,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 27: [2022-11-26 22:02:28,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:02:28,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 22:02:28,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 6: [2022-11-26 22:02:28,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:02:28,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 22:02:28,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 24: [2022-11-26 22:02:28,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:02:28,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:02:28,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 7: [2022-11-26 22:02:28,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 24: [2022-11-26 22:02:28,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 7: [2022-11-26 22:02:28,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 28: [2022-11-26 22:02:28,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:02:28,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 22:02:28,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 17: [2022-11-26 22:02:28,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:02:28,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 22:02:28,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 8: [2022-11-26 22:02:28,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:02:28,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 22:02:28,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 12: [2022-11-26 22:02:28,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:02:28,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 22:02:28,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 2: [2022-11-26 22:02:28,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:02:28,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 22:02:28,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 10: [2022-11-26 22:02:28,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:02:28,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 22:02:28,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 22: [2022-11-26 22:02:28,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:02:28,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 22:02:28,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 29: [2022-11-26 22:02:28,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:02:28,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 22:02:28,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 9: [2022-11-26 22:02:28,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:02:28,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 22:02:28,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 13: [2022-11-26 22:02:28,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:02:28,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 22:02:28,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 5: [2022-11-26 22:02:28,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:02:28,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 22:02:28,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 15: [2022-11-26 22:02:28,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:02:28,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:02:28,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 14: [2022-11-26 22:02:28,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 15: [2022-11-26 22:02:28,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 14: [2022-11-26 22:02:28,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 30: [2022-11-26 22:02:28,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:02:28,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 22:02:28,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 19: [2022-11-26 22:02:28,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:02:28,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:02:28,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 4: [2022-11-26 22:02:28,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 19: [2022-11-26 22:02:28,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 4: [2022-11-26 22:02:28,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 25: [2022-11-26 22:02:28,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:02:28,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 22:02:28,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 16: [2022-11-26 22:02:28,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:02:28,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 22:02:28,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 3: [2022-11-26 22:02:28,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:02:28,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 22:02:28,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 1: [2022-11-26 22:02:28,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:02:28,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 22:02:28,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 21: [2022-11-26 22:02:28,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:02:28,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 22:02:28,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 26: [2022-11-26 22:02:28,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:02:28,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 22:02:28,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 5: [2022-11-26 22:02:28,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:02:28,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 22:02:28,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 27: [2022-11-26 22:02:28,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:02:28,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:02:28,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 28: [2022-11-26 22:02:28,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 27: [2022-11-26 22:02:28,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 28: [2022-11-26 22:02:28,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 7: [2022-11-26 22:02:28,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:02:28,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 22:02:28,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 31: [2022-11-26 22:02:28,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:02:28,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 22:02:28,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 27: [2022-11-26 22:02:28,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:02:28,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:02:28,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 31: [2022-11-26 22:02:28,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step124000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 27: [2022-11-26 22:02:28,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 31: [2022-11-26 22:02:28,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 0: successfully saved checkpoint at iteration 124000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2625.41 31: iteration 124010/ 173500 | consumed samples: 31746560 | consumed tokens: 65016954880 | elapsed time per iteration (s): 1.01 | learning rate: 5.442E-05 | global batch size: 256 | lm loss: 1.921442E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 253.831 | TFLOPs: 15.36 | 31: iteration 124020/ 173500 | consumed samples: 31749120 | consumed tokens: 65022197760 | elapsed time per iteration (s): 0.74 | learning rate: 5.441E-05 | global batch size: 256 | lm loss: 1.939953E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.366 | TFLOPs: 20.95 | 31: iteration 124030/ 173500 | consumed samples: 31751680 | consumed tokens: 65027440640 | elapsed time per iteration (s): 0.79 | learning rate: 5.440E-05 | global batch size: 256 | lm loss: 1.944173E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.824 | TFLOPs: 19.59 | 31: iteration 124040/ 173500 | consumed samples: 31754240 | consumed tokens: 65032683520 | elapsed time per iteration (s): 0.79 | learning rate: 5.438E-05 | global batch size: 256 | lm loss: 1.936148E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.959 | TFLOPs: 19.60 | 31: iteration 124050/ 173500 | consumed samples: 31756800 | consumed tokens: 65037926400 | elapsed time per iteration (s): 0.79 | learning rate: 5.437E-05 | global batch size: 256 | lm loss: 1.904134E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.153 | TFLOPs: 19.49 | 31: iteration 124060/ 173500 | consumed samples: 31759360 | consumed tokens: 65043169280 | elapsed time per iteration (s): 0.78 | learning rate: 5.436E-05 | global batch size: 256 | lm loss: 1.927793E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.256 | TFLOPs: 19.86 | 31: iteration 124070/ 173500 | consumed samples: 31761920 | consumed tokens: 65048412160 | elapsed time per iteration (s): 0.78 | learning rate: 5.434E-05 | global batch size: 256 | lm loss: 1.939773E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.628 | TFLOPs: 19.88 | 31: iteration 124080/ 173500 | consumed samples: 31764480 | consumed tokens: 65053655040 | elapsed time per iteration (s): 0.74 | learning rate: 5.433E-05 | global batch size: 256 | lm loss: 1.919931E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.146 | TFLOPs: 21.00 | 31: iteration 124090/ 173500 | consumed samples: 31767040 | consumed tokens: 65058897920 | elapsed time per iteration (s): 0.74 | learning rate: 5.432E-05 | global batch size: 256 | lm loss: 1.962346E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.271 | TFLOPs: 20.89 | 31: iteration 124100/ 173500 | consumed samples: 31769600 | consumed tokens: 65064140800 | elapsed time per iteration (s): 0.73 | learning rate: 5.430E-05 | global batch size: 256 | lm loss: 1.961451E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.494 | TFLOPs: 21.20 | 31: iteration 124110/ 173500 | consumed samples: 31772160 | consumed tokens: 65069383680 | elapsed time per iteration (s): 0.75 | learning rate: 5.429E-05 | global batch size: 256 | lm loss: 1.911810E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.745 | TFLOPs: 20.55 | 31: iteration 124120/ 173500 | consumed samples: 31774720 | consumed tokens: 65074626560 | elapsed time per iteration (s): 0.79 | learning rate: 5.428E-05 | global batch size: 256 | lm loss: 1.963687E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.864 | TFLOPs: 19.65 | 31: iteration 124130/ 173500 | consumed samples: 31777280 | consumed tokens: 65079869440 | elapsed time per iteration (s): 0.78 | learning rate: 5.427E-05 | global batch size: 256 | lm loss: 1.939089E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.484 | TFLOPs: 19.87 | 31: iteration 124140/ 173500 | consumed samples: 31779840 | consumed tokens: 65085112320 | elapsed time per iteration (s): 0.76 | learning rate: 5.425E-05 | global batch size: 256 | lm loss: 1.961627E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.267 | TFLOPs: 20.28 | 31: iteration 124150/ 173500 | consumed samples: 31782400 | consumed tokens: 65090355200 | elapsed time per iteration (s): 0.74 | learning rate: 5.424E-05 | global batch size: 256 | lm loss: 1.969938E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.572 | TFLOPs: 20.85 | 31: iteration 124160/ 173500 | consumed samples: 31784960 | consumed tokens: 65095598080 | elapsed time per iteration (s): 0.75 | learning rate: 5.423E-05 | global batch size: 256 | lm loss: 1.950747E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.455 | TFLOPs: 20.78 | 31: iteration 124170/ 173500 | consumed samples: 31787520 | consumed tokens: 65100840960 | elapsed time per iteration (s): 0.78 | learning rate: 5.421E-05 | global batch size: 256 | lm loss: 1.959276E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.372 | TFLOPs: 19.74 | 31: iteration 124180/ 173500 | consumed samples: 31790080 | consumed tokens: 65106083840 | elapsed time per iteration (s): 0.75 | learning rate: 5.420E-05 | global batch size: 256 | lm loss: 1.919444E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.356 | TFLOPs: 20.71 | 31: iteration 124190/ 173500 | consumed samples: 31792640 | consumed tokens: 65111326720 | elapsed time per iteration (s): 0.79 | learning rate: 5.419E-05 | global batch size: 256 | lm loss: 1.965944E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.581 | TFLOPs: 19.52 | 31: iteration 124200/ 173500 | consumed samples: 31795200 | consumed tokens: 65116569600 | elapsed time per iteration (s): 0.73 | learning rate: 5.418E-05 | global batch size: 256 | lm loss: 1.958565E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.723 | TFLOPs: 21.34 | 31: iteration 124210/ 173500 | consumed samples: 31797760 | consumed tokens: 65121812480 | elapsed time per iteration (s): 0.77 | learning rate: 5.416E-05 | global batch size: 256 | lm loss: 1.943241E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.624 | TFLOPs: 20.00 | 31: iteration 124220/ 173500 | consumed samples: 31800320 | consumed tokens: 65127055360 | elapsed time per iteration (s): 0.81 | learning rate: 5.415E-05 | global batch size: 256 | lm loss: 1.970917E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.746 | TFLOPs: 19.22 | 31: iteration 124230/ 173500 | consumed samples: 31802880 | consumed tokens: 65132298240 | elapsed time per iteration (s): 0.86 | learning rate: 5.414E-05 | global batch size: 256 | lm loss: 1.939333E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.078 | TFLOPs: 18.09 | 31: iteration 124240/ 173500 | consumed samples: 31805440 | consumed tokens: 65137541120 | elapsed time per iteration (s): 0.85 | learning rate: 5.412E-05 | global batch size: 256 | lm loss: 1.930918E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.116 | TFLOPs: 18.28 | 31: iteration 124250/ 173500 | consumed samples: 31808000 | consumed tokens: 65142784000 | elapsed time per iteration (s): 0.79 | learning rate: 5.411E-05 | global batch size: 256 | lm loss: 1.946556E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.683 | TFLOPs: 19.52 | 31: iteration 124260/ 173500 | consumed samples: 31810560 | consumed tokens: 65148026880 | elapsed time per iteration (s): 0.82 | learning rate: 5.410E-05 | global batch size: 256 | lm loss: 1.944300E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.245 | TFLOPs: 18.95 | 31: iteration 124270/ 173500 | consumed samples: 31813120 | consumed tokens: 65153269760 | elapsed time per iteration (s): 0.87 | learning rate: 5.409E-05 | global batch size: 256 | lm loss: 1.965656E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.246 | TFLOPs: 17.74 | 31: iteration 124280/ 173500 | consumed samples: 31815680 | consumed tokens: 65158512640 | elapsed time per iteration (s): 0.85 | learning rate: 5.407E-05 | global batch size: 256 | lm loss: 1.942619E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.701 | TFLOPs: 18.13 | 31: iteration 124290/ 173500 | consumed samples: 31818240 | consumed tokens: 65163755520 | elapsed time per iteration (s): 0.84 | learning rate: 5.406E-05 | global batch size: 256 | lm loss: 1.955920E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.232 | TFLOPs: 18.41 | 31: iteration 124300/ 173500 | consumed samples: 31820800 | consumed tokens: 65168998400 | elapsed time per iteration (s): 0.81 | learning rate: 5.405E-05 | global batch size: 256 | lm loss: 1.927239E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.172 | TFLOPs: 19.01 | 31: iteration 124310/ 173500 | consumed samples: 31823360 | consumed tokens: 65174241280 | elapsed time per iteration (s): 0.79 | learning rate: 5.403E-05 | global batch size: 256 | lm loss: 1.928342E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.809 | TFLOPs: 19.53 | 31: iteration 124320/ 173500 | consumed samples: 31825920 | consumed tokens: 65179484160 | elapsed time per iteration (s): 0.85 | learning rate: 5.402E-05 | global batch size: 256 | lm loss: 1.947385E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.232 | TFLOPs: 18.28 | 31: iteration 124330/ 173500 | consumed samples: 31828480 | consumed tokens: 65184727040 | elapsed time per iteration (s): 0.82 | learning rate: 5.401E-05 | global batch size: 256 | lm loss: 1.954558E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.525 | TFLOPs: 18.91 | 31: iteration 124340/ 173500 | consumed samples: 31831040 | consumed tokens: 65189969920 | elapsed time per iteration (s): 0.84 | learning rate: 5.399E-05 | global batch size: 256 | lm loss: 1.930892E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.956 | TFLOPs: 18.39 | 31: iteration 124350/ 173500 | consumed samples: 31833600 | consumed tokens: 65195212800 | elapsed time per iteration (s): 0.81 | learning rate: 5.398E-05 | global batch size: 256 | lm loss: 1.941405E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.476 | TFLOPs: 19.15 | 31: iteration 124360/ 173500 | consumed samples: 31836160 | consumed tokens: 65200455680 | elapsed time per iteration (s): 0.77 | learning rate: 5.397E-05 | global batch size: 256 | lm loss: 1.926044E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.675 | TFLOPs: 20.01 | 31: iteration 124370/ 173500 | consumed samples: 31838720 | consumed tokens: 65205698560 | elapsed time per iteration (s): 0.78 | learning rate: 5.396E-05 | global batch size: 256 | lm loss: 1.954432E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.047 | TFLOPs: 19.79 | 31: iteration 124380/ 173500 | consumed samples: 31841280 | consumed tokens: 65210941440 | elapsed time per iteration (s): 0.81 | learning rate: 5.394E-05 | global batch size: 256 | lm loss: 1.943817E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.088 | TFLOPs: 19.18 | 31: iteration 124390/ 173500 | consumed samples: 31843840 | consumed tokens: 65216184320 | elapsed time per iteration (s): 1.00 | learning rate: 5.393E-05 | global batch size: 256 | lm loss: 1.944868E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 256.649 | TFLOPs: 15.53 | 31: iteration 124400/ 173500 | consumed samples: 31846400 | consumed tokens: 65221427200 | elapsed time per iteration (s): 0.73 | learning rate: 5.392E-05 | global batch size: 256 | lm loss: 1.948043E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.587 | TFLOPs: 21.09 | 31: iteration 124410/ 173500 | consumed samples: 31848960 | consumed tokens: 65226670080 | elapsed time per iteration (s): 0.76 | learning rate: 5.390E-05 | global batch size: 256 | lm loss: 1.943015E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.296 | TFLOPs: 20.41 | 31: iteration 124420/ 173500 | consumed samples: 31851520 | consumed tokens: 65231912960 | elapsed time per iteration (s): 0.77 | learning rate: 5.389E-05 | global batch size: 256 | lm loss: 1.944477E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.881 | TFLOPs: 20.08 | 31: iteration 124430/ 173500 | consumed samples: 31854080 | consumed tokens: 65237155840 | elapsed time per iteration (s): 0.74 | learning rate: 5.388E-05 | global batch size: 256 | lm loss: 1.922644E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.908 | TFLOPs: 20.99 | 31: iteration 124440/ 173500 | consumed samples: 31856640 | consumed tokens: 65242398720 | elapsed time per iteration (s): 0.74 | learning rate: 5.387E-05 | global batch size: 256 | lm loss: 1.934971E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.424 | TFLOPs: 20.96 | 31: iteration 124450/ 173500 | consumed samples: 31859200 | consumed tokens: 65247641600 | elapsed time per iteration (s): 0.74 | learning rate: 5.385E-05 | global batch size: 256 | lm loss: 1.959983E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.064 | TFLOPs: 21.00 | 31: iteration 124460/ 173500 | consumed samples: 31861760 | consumed tokens: 65252884480 | elapsed time per iteration (s): 0.74 | learning rate: 5.384E-05 | global batch size: 256 | lm loss: 1.925003E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.142 | TFLOPs: 21.06 | 31: iteration 124470/ 173500 | consumed samples: 31864320 | consumed tokens: 65258127360 | elapsed time per iteration (s): 0.76 | learning rate: 5.383E-05 | global batch size: 256 | lm loss: 1.956444E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.285 | TFLOPs: 20.34 | 31: iteration 124480/ 173500 | consumed samples: 31866880 | consumed tokens: 65263370240 | elapsed time per iteration (s): 0.77 | learning rate: 5.381E-05 | global batch size: 256 | lm loss: 1.959723E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.326 | TFLOPs: 20.10 | 31: iteration 124490/ 173500 | consumed samples: 31869440 | consumed tokens: 65268613120 | elapsed time per iteration (s): 0.81 | learning rate: 5.380E-05 | global batch size: 256 | lm loss: 1.944130E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.445 | TFLOPs: 19.02 | 31: iteration 124500/ 173500 | consumed samples: 31872000 | consumed tokens: 65273856000 | elapsed time per iteration (s): 0.93 | learning rate: 5.379E-05 | global batch size: 256 | lm loss: 1.920473E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 274.199 | TFLOPs: 16.59 | 31: iteration 124510/ 173500 | consumed samples: 31874560 | consumed tokens: 65279098880 | elapsed time per iteration (s): 0.87 | learning rate: 5.378E-05 | global batch size: 256 | lm loss: 1.973335E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.337 | TFLOPs: 17.87 | 31: iteration 124520/ 173500 | consumed samples: 31877120 | consumed tokens: 65284341760 | elapsed time per iteration (s): 0.85 | learning rate: 5.376E-05 | global batch size: 256 | lm loss: 1.914257E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.501 | TFLOPs: 18.18 | 31: iteration 124530/ 173500 | consumed samples: 31879680 | consumed tokens: 65289584640 | elapsed time per iteration (s): 0.86 | learning rate: 5.375E-05 | global batch size: 256 | lm loss: 1.947853E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.136 | TFLOPs: 18.04 | 31: iteration 124540/ 173500 | consumed samples: 31882240 | consumed tokens: 65294827520 | elapsed time per iteration (s): 0.88 | learning rate: 5.374E-05 | global batch size: 256 | lm loss: 1.954425E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.452 | TFLOPs: 17.69 | 31: iteration 124550/ 173500 | consumed samples: 31884800 | consumed tokens: 65300070400 | elapsed time per iteration (s): 0.90 | learning rate: 5.372E-05 | global batch size: 256 | lm loss: 1.941085E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.272 | TFLOPs: 17.26 | 31: iteration 124560/ 173500 | consumed samples: 31887360 | consumed tokens: 65305313280 | elapsed time per iteration (s): 0.88 | learning rate: 5.371E-05 | global batch size: 256 | lm loss: 1.985598E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.199 | TFLOPs: 17.62 | 31: iteration 124570/ 173500 | consumed samples: 31889920 | consumed tokens: 65310556160 | elapsed time per iteration (s): 0.87 | learning rate: 5.370E-05 | global batch size: 256 | lm loss: 1.951528E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.325 | TFLOPs: 17.75 | 31: iteration 124580/ 173500 | consumed samples: 31892480 | consumed tokens: 65315799040 | elapsed time per iteration (s): 0.86 | learning rate: 5.369E-05 | global batch size: 256 | lm loss: 1.954688E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.102 | TFLOPs: 17.97 | 31: iteration 124590/ 173500 | consumed samples: 31895040 | consumed tokens: 65321041920 | elapsed time per iteration (s): 0.84 | learning rate: 5.367E-05 | global batch size: 256 | lm loss: 1.949509E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.052 | TFLOPs: 18.33 | 31: iteration 124600/ 173500 | consumed samples: 31897600 | consumed tokens: 65326284800 | elapsed time per iteration (s): 0.90 | learning rate: 5.366E-05 | global batch size: 256 | lm loss: 1.929830E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.581 | TFLOPs: 17.28 | 31: iteration 124610/ 173500 | consumed samples: 31900160 | consumed tokens: 65331527680 | elapsed time per iteration (s): 0.86 | learning rate: 5.365E-05 | global batch size: 256 | lm loss: 1.951022E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.393 | TFLOPs: 17.99 | 31: iteration 124620/ 173500 | consumed samples: 31902720 | consumed tokens: 65336770560 | elapsed time per iteration (s): 0.84 | learning rate: 5.363E-05 | global batch size: 256 | lm loss: 1.956584E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.214 | TFLOPs: 18.46 | 31: iteration 124630/ 173500 | consumed samples: 31905280 | consumed tokens: 65342013440 | elapsed time per iteration (s): 0.82 | learning rate: 5.362E-05 | global batch size: 256 | lm loss: 1.955556E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.936 | TFLOPs: 18.93 | 31: iteration 124640/ 173500 | consumed samples: 31907840 | consumed tokens: 65347256320 | elapsed time per iteration (s): 0.93 | learning rate: 5.361E-05 | global batch size: 256 | lm loss: 1.935976E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 275.923 | TFLOPs: 16.69 | 31: iteration 124650/ 173500 | consumed samples: 31910400 | consumed tokens: 65352499200 | elapsed time per iteration (s): 0.78 | learning rate: 5.360E-05 | global batch size: 256 | lm loss: 1.933739E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.432 | TFLOPs: 19.75 | 31: iteration 124660/ 173500 | consumed samples: 31912960 | consumed tokens: 65357742080 | elapsed time per iteration (s): 1.03 | learning rate: 5.358E-05 | global batch size: 256 | lm loss: 1.962479E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.880 | TFLOPs: 15.06 | 31: iteration 124670/ 173500 | consumed samples: 31915520 | consumed tokens: 65362984960 | elapsed time per iteration (s): 0.81 | learning rate: 5.357E-05 | global batch size: 256 | lm loss: 1.935454E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.029 | TFLOPs: 19.12 | 31: iteration 124680/ 173500 | consumed samples: 31918080 | consumed tokens: 65368227840 | elapsed time per iteration (s): 0.93 | learning rate: 5.356E-05 | global batch size: 256 | lm loss: 1.950694E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 274.340 | TFLOPs: 16.60 | 31: iteration 124690/ 173500 | consumed samples: 31920640 | consumed tokens: 65373470720 | elapsed time per iteration (s): 0.85 | learning rate: 5.355E-05 | global batch size: 256 | lm loss: 1.951713E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.354 | TFLOPs: 18.23 | 31: iteration 124700/ 173500 | consumed samples: 31923200 | consumed tokens: 65378713600 | elapsed time per iteration (s): 0.80 | learning rate: 5.353E-05 | global batch size: 256 | lm loss: 1.940461E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.245 | TFLOPs: 19.37 | 31: iteration 124710/ 173500 | consumed samples: 31925760 | consumed tokens: 65383956480 | elapsed time per iteration (s): 0.82 | learning rate: 5.352E-05 | global batch size: 256 | lm loss: 1.933022E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.273 | TFLOPs: 18.83 | 31: iteration 124720/ 173500 | consumed samples: 31928320 | consumed tokens: 65389199360 | elapsed time per iteration (s): 0.81 | learning rate: 5.351E-05 | global batch size: 256 | lm loss: 1.948757E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.284 | TFLOPs: 19.01 | 31: iteration 124730/ 173500 | consumed samples: 31930880 | consumed tokens: 65394442240 | elapsed time per iteration (s): 0.82 | learning rate: 5.349E-05 | global batch size: 256 | lm loss: 1.942356E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.294 | TFLOPs: 18.83 | 31: iteration 124740/ 173500 | consumed samples: 31933440 | consumed tokens: 65399685120 | elapsed time per iteration (s): 0.80 | learning rate: 5.348E-05 | global batch size: 256 | lm loss: 1.988447E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.836 | TFLOPs: 19.41 | 31: iteration 124750/ 173500 | consumed samples: 31936000 | consumed tokens: 65404928000 | elapsed time per iteration (s): 0.79 | learning rate: 5.347E-05 | global batch size: 256 | lm loss: 1.961563E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.725 | TFLOPs: 19.58 | 31: iteration 124760/ 173500 | consumed samples: 31938560 | consumed tokens: 65410170880 | elapsed time per iteration (s): 0.80 | learning rate: 5.346E-05 | global batch size: 256 | lm loss: 1.945189E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.032 | TFLOPs: 19.42 | 31: iteration 124770/ 173500 | consumed samples: 31941120 | consumed tokens: 65415413760 | elapsed time per iteration (s): 0.81 | learning rate: 5.344E-05 | global batch size: 256 | lm loss: 1.936951E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.436 | TFLOPs: 19.08 | 31: iteration 124780/ 173500 | consumed samples: 31943680 | consumed tokens: 65420656640 | elapsed time per iteration (s): 0.81 | learning rate: 5.343E-05 | global batch size: 256 | lm loss: 1.943599E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.972 | TFLOPs: 19.24 | 31: iteration 124790/ 173500 | consumed samples: 31946240 | consumed tokens: 65425899520 | elapsed time per iteration (s): 0.82 | learning rate: 5.342E-05 | global batch size: 256 | lm loss: 1.950352E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.979 | TFLOPs: 18.81 | 31: iteration 124800/ 173500 | consumed samples: 31948800 | consumed tokens: 65431142400 | elapsed time per iteration (s): 0.83 | learning rate: 5.340E-05 | global batch size: 256 | lm loss: 1.919947E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.911 | TFLOPs: 18.63 | 31: iteration 124810/ 173500 | consumed samples: 31951360 | consumed tokens: 65436385280 | elapsed time per iteration (s): 0.82 | learning rate: 5.339E-05 | global batch size: 256 | lm loss: 1.912217E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.101 | TFLOPs: 18.88 | 31: iteration 124820/ 173500 | consumed samples: 31953920 | consumed tokens: 65441628160 | elapsed time per iteration (s): 0.86 | learning rate: 5.338E-05 | global batch size: 256 | lm loss: 1.943071E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.161 | TFLOPs: 17.98 | 31: iteration 124830/ 173500 | consumed samples: 31956480 | consumed tokens: 65446871040 | elapsed time per iteration (s): 0.81 | learning rate: 5.337E-05 | global batch size: 256 | lm loss: 1.927359E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.490 | TFLOPs: 19.09 | 31: iteration 124840/ 173500 | consumed samples: 31959040 | consumed tokens: 65452113920 | elapsed time per iteration (s): 0.81 | learning rate: 5.335E-05 | global batch size: 256 | lm loss: 1.938153E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.194 | TFLOPs: 19.07 | 31: iteration 124850/ 173500 | consumed samples: 31961600 | consumed tokens: 65457356800 | elapsed time per iteration (s): 0.83 | learning rate: 5.334E-05 | global batch size: 256 | lm loss: 1.938873E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.235 | TFLOPs: 18.65 | 31: iteration 124860/ 173500 | consumed samples: 31964160 | consumed tokens: 65462599680 | elapsed time per iteration (s): 0.81 | learning rate: 5.333E-05 | global batch size: 256 | lm loss: 1.953286E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.630 | TFLOPs: 19.03 | 31: iteration 124870/ 173500 | consumed samples: 31966720 | consumed tokens: 65467842560 | elapsed time per iteration (s): 0.79 | learning rate: 5.331E-05 | global batch size: 256 | lm loss: 1.944723E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.040 | TFLOPs: 19.66 | 31: iteration 124880/ 173500 | consumed samples: 31969280 | consumed tokens: 65473085440 | elapsed time per iteration (s): 0.83 | learning rate: 5.330E-05 | global batch size: 256 | lm loss: 1.953623E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.316 | TFLOPs: 18.71 | 31: iteration 124890/ 173500 | consumed samples: 31971840 | consumed tokens: 65478328320 | elapsed time per iteration (s): 0.79 | learning rate: 5.329E-05 | global batch size: 256 | lm loss: 1.939110E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.078 | TFLOPs: 19.61 | 31: iteration 124900/ 173500 | consumed samples: 31974400 | consumed tokens: 65483571200 | elapsed time per iteration (s): 0.80 | learning rate: 5.328E-05 | global batch size: 256 | lm loss: 1.948676E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.747 | TFLOPs: 19.28 | 31: iteration 124910/ 173500 | consumed samples: 31976960 | consumed tokens: 65488814080 | elapsed time per iteration (s): 0.78 | learning rate: 5.326E-05 | global batch size: 256 | lm loss: 1.924601E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.246 | TFLOPs: 19.80 | 31: iteration 124920/ 173500 | consumed samples: 31979520 | consumed tokens: 65494056960 | elapsed time per iteration (s): 0.83 | learning rate: 5.325E-05 | global batch size: 256 | lm loss: 1.942920E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.424 | TFLOPs: 18.60 | 31: iteration 124930/ 173500 | consumed samples: 31982080 | consumed tokens: 65499299840 | elapsed time per iteration (s): 0.78 | learning rate: 5.324E-05 | global batch size: 256 | lm loss: 1.970363E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.450 | TFLOPs: 19.81 | 31: iteration 124940/ 173500 | consumed samples: 31984640 | consumed tokens: 65504542720 | elapsed time per iteration (s): 0.77 | learning rate: 5.323E-05 | global batch size: 256 | lm loss: 1.909700E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.362 | TFLOPs: 20.05 | 31: iteration 124950/ 173500 | consumed samples: 31987200 | consumed tokens: 65509785600 | elapsed time per iteration (s): 0.79 | learning rate: 5.321E-05 | global batch size: 256 | lm loss: 1.969159E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.499 | TFLOPs: 19.69 | 31: iteration 124960/ 173500 | consumed samples: 31989760 | consumed tokens: 65515028480 | elapsed time per iteration (s): 0.82 | learning rate: 5.320E-05 | global batch size: 256 | lm loss: 1.901853E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.854 | TFLOPs: 18.99 | 31: iteration 124970/ 173500 | consumed samples: 31992320 | consumed tokens: 65520271360 | elapsed time per iteration (s): 0.80 | learning rate: 5.319E-05 | global batch size: 256 | lm loss: 1.941936E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.518 | TFLOPs: 19.39 | 31: iteration 124980/ 173500 | consumed samples: 31994880 | consumed tokens: 65525514240 | elapsed time per iteration (s): 0.81 | learning rate: 5.317E-05 | global batch size: 256 | lm loss: 1.959391E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.258 | TFLOPs: 19.19 | 31: iteration 124990/ 173500 | consumed samples: 31997440 | consumed tokens: 65530757120 | elapsed time per iteration (s): 0.81 | learning rate: 5.316E-05 | global batch size: 256 | lm loss: 1.933546E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.259 | TFLOPs: 19.07 | 31: iteration 125000/ 173500 | consumed samples: 32000000 | consumed tokens: 65536000000 | elapsed time per iteration (s): 0.78 | learning rate: 5.315E-05 | global batch size: 256 | lm loss: 1.923236E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.728 | TFLOPs: 19.83 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 125000 | lm loss value: 1.899797E+00 | lm loss PPL: 6.684539E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 125000 to checkpoints_1b1long 0: [2022-11-26 22:15:59,120] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step125000 is begin to save! 0: [2022-11-26 22:15:59,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_01-model_00-model_states.pt... 0: [2022-11-26 22:15:59,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_01-model_00-model_states.pt. 0: [2022-11-26 22:15:59,357] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_03-model_00-model_states.pt... 0: [2022-11-26 22:15:59,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_03-model_00-model_states.pt. 0: [2022-11-26 22:15:59,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_04-model_00-model_states.pt... 0: [2022-11-26 22:15:59,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_04-model_00-model_states.pt. 0: [2022-11-26 22:15:59,519] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_05-model_00-model_states.pt... 0: [2022-11-26 22:15:59,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_05-model_00-model_states.pt. 0: [2022-11-26 22:15:59,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_06-model_00-model_states.pt... 0: [2022-11-26 22:15:59,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_06-model_00-model_states.pt. 0: [2022-11-26 22:15:59,673] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_07-model_00-model_states.pt... 0: [2022-11-26 22:15:59,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_07-model_00-model_states.pt. 0: [2022-11-26 22:15:59,750] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_08-model_00-model_states.pt... 0: [2022-11-26 22:15:59,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_08-model_00-model_states.pt. 0: [2022-11-26 22:15:59,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_09-model_00-model_states.pt... 0: [2022-11-26 22:15:59,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_09-model_00-model_states.pt. 0: [2022-11-26 22:15:59,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_10-model_00-model_states.pt... 0: [2022-11-26 22:15:59,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_10-model_00-model_states.pt. 0: [2022-11-26 22:15:59,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_11-model_00-model_states.pt... 0: [2022-11-26 22:16:00,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_11-model_00-model_states.pt. 0: [2022-11-26 22:16:00,055] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_12-model_00-model_states.pt... 0: [2022-11-26 22:16:00,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_12-model_00-model_states.pt. 0: [2022-11-26 22:16:00,148] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_13-model_00-model_states.pt... 0: [2022-11-26 22:16:00,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_13-model_00-model_states.pt. 0: [2022-11-26 22:16:00,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_14-model_00-model_states.pt... 0: [2022-11-26 22:16:00,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_14-model_00-model_states.pt. 0: [2022-11-26 22:16:00,305] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_15-model_00-model_states.pt... 0: [2022-11-26 22:16:00,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_15-model_00-model_states.pt. 0: [2022-11-26 22:16:00,380] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_16-model_00-model_states.pt... 0: [2022-11-26 22:16:00,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_16-model_00-model_states.pt. 0: [2022-11-26 22:16:00,454] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_17-model_00-model_states.pt... 0: [2022-11-26 22:16:00,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_17-model_00-model_states.pt. 0: [2022-11-26 22:16:00,527] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_18-model_00-model_states.pt... 0: [2022-11-26 22:16:00,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_18-model_00-model_states.pt. 0: [2022-11-26 22:16:00,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_19-model_00-model_states.pt... 0: [2022-11-26 22:16:00,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_19-model_00-model_states.pt. 0: [2022-11-26 22:16:00,673] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_20-model_00-model_states.pt... 0: [2022-11-26 22:16:00,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_20-model_00-model_states.pt. 0: [2022-11-26 22:16:00,749] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_21-model_00-model_states.pt... 0: [2022-11-26 22:16:00,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_21-model_00-model_states.pt. 0: [2022-11-26 22:16:00,824] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_22-model_00-model_states.pt... 0: [2022-11-26 22:16:00,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_22-model_00-model_states.pt. 0: [2022-11-26 22:16:00,899] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_23-model_00-model_states.pt... 0: [2022-11-26 22:16:00,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_23-model_00-model_states.pt. 0: [2022-11-26 22:16:00,972] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_24-model_00-model_states.pt... 0: [2022-11-26 22:16:01,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_24-model_00-model_states.pt. 0: [2022-11-26 22:16:01,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_25-model_00-model_states.pt... 0: [2022-11-26 22:16:01,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_25-model_00-model_states.pt. 0: [2022-11-26 22:16:01,121] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_26-model_00-model_states.pt... 0: [2022-11-26 22:16:01,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_26-model_00-model_states.pt. 0: [2022-11-26 22:16:01,199] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_27-model_00-model_states.pt... 0: [2022-11-26 22:16:01,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_27-model_00-model_states.pt. 0: [2022-11-26 22:16:01,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_28-model_00-model_states.pt... 0: [2022-11-26 22:16:01,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_28-model_00-model_states.pt. 0: [2022-11-26 22:16:01,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/layer_30-model_00-model_states.pt... 0: [2022-11-26 22:16:01,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/layer_30-model_00-model_states.pt. 0: [2022-11-26 22:16:01,351] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step125000/mp_rank_00_model_states.pt 0: [2022-11-26 22:16:01,351] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/mp_rank_00_model_states.pt... 0: [2022-11-26 22:16:01,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/mp_rank_00_model_states.pt. 0: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:16:01,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:16:01,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:16:01,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:16:01,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 22:16:01,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 0: [2022-11-26 22:16:01,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:16:01,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 22:16:01,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 18: [2022-11-26 22:16:01,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:16:01,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 22:16:01,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 18: [2022-11-26 22:16:01,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:16:01,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:16:01,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 22:16:01,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 22:16:01,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 18: [2022-11-26 22:16:01,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 0: [2022-11-26 22:16:01,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:16:01,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 22:16:01,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 17: [2022-11-26 22:16:01,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:16:01,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 22:16:01,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 23: [2022-11-26 22:16:01,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:16:01,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 22:16:01,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 27: [2022-11-26 22:16:01,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:16:01,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 22:16:01,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 9: [2022-11-26 22:16:01,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:16:01,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 22:16:01,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 20: [2022-11-26 22:16:01,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:16:01,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:16:01,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 20: [2022-11-26 22:16:01,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 7: [2022-11-26 22:16:01,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 20: [2022-11-26 22:16:01,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 1: [2022-11-26 22:16:01,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:16:01,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:16:01,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 22:16:01,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 2: [2022-11-26 22:16:01,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 22:16:01,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 2: [2022-11-26 22:16:01,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:16:01,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:16:01,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 18: [2022-11-26 22:16:01,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 2: [2022-11-26 22:16:01,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 18: [2022-11-26 22:16:01,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 4: [2022-11-26 22:16:01,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:16:01,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 22:16:01,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 30: [2022-11-26 22:16:01,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:16:01,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:16:01,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 22:16:01,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 23: [2022-11-26 22:16:01,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:16:01,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 22:16:01,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 28: [2022-11-26 22:16:01,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:16:01,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 22:16:01,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 20: [2022-11-26 22:16:01,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:16:01,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 22:16:01,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 25: [2022-11-26 22:16:01,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:16:01,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:16:01,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 12: [2022-11-26 22:16:01,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 25: [2022-11-26 22:16:01,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 12: [2022-11-26 22:16:01,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 22: [2022-11-26 22:16:01,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:16:01,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:16:01,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 22:16:01,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 17: [2022-11-26 22:16:01,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:16:01,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:16:01,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 22:16:01,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 0: [2022-11-26 22:16:01,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 20: [2022-11-26 22:16:01,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:16:01,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 20: [2022-11-26 22:16:01,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 22:16:01,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 23: [2022-11-26 22:16:01,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:16:01,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 22:16:01,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 22: [2022-11-26 22:16:01,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 22:16:01,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 25: [2022-11-26 22:16:01,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:16:01,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:16:01,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 22: [2022-11-26 22:16:01,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 25: [2022-11-26 22:16:01,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 22: [2022-11-26 22:16:01,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 16: [2022-11-26 22:16:01,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:16:01,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 22:16:01,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 28: [2022-11-26 22:16:01,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 22:16:01,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 7: [2022-11-26 22:16:01,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:16:01,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:16:01,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 2: [2022-11-26 22:16:01,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 7: [2022-11-26 22:16:01,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 2: [2022-11-26 22:16:01,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 31: [2022-11-26 22:16:01,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:16:01,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 22:16:01,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 7: [2022-11-26 22:16:01,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:16:01,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 22:16:01,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 2: [2022-11-26 22:16:01,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:16:01,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 15: [2022-11-26 22:16:01,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:16:01,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:16:01,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 22:16:01,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 15: [2022-11-26 22:16:01,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 22:16:01,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 2: [2022-11-26 22:16:01,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 19: [2022-11-26 22:16:01,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:16:01,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 22:16:01,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 4: [2022-11-26 22:16:01,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:16:01,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 22:16:01,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 28: [2022-11-26 22:16:01,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:16:01,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:16:01,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:16:01,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 22:16:01,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 22:16:01,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 26: [2022-11-26 22:16:01,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 23: [2022-11-26 22:16:01,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:16:01,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 21: [2022-11-26 22:16:01,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:16:01,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 21: [2022-11-26 22:16:01,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 22:16:01,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 21: [2022-11-26 22:16:01,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:16:01,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 22:16:01,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 31: [2022-11-26 22:16:01,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:16:01,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 22:16:01,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 5: [2022-11-26 22:16:01,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:16:01,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 22:16:01,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 28: [2022-11-26 22:16:01,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 22:16:01,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 20: [2022-11-26 22:16:01,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:16:01,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 30: [2022-11-26 22:16:01,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:16:01,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 30: [2022-11-26 22:16:01,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 22:16:01,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 24: [2022-11-26 22:16:01,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:16:01,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 22:16:01,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 31: [2022-11-26 22:16:01,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:16:01,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 22: [2022-11-26 22:16:01,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:16:01,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 20: [2022-11-26 22:16:01,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:16:01,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:16:01,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 22:16:01,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 20: [2022-11-26 22:16:01,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 22:16:01,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 10: [2022-11-26 22:16:01,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:16:01,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:16:01,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:16:01,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 20: [2022-11-26 22:16:01,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 10: [2022-11-26 22:16:01,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 22:16:01,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 22:16:01,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 22:16:01,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:16:01,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:16:01,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 10: [2022-11-26 22:16:01,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 23: [2022-11-26 22:16:01,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 10: [2022-11-26 22:16:01,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 23: [2022-11-26 22:16:01,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 23: [2022-11-26 22:16:01,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:16:01,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 23: [2022-11-26 22:16:01,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 10: [2022-11-26 22:16:01,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 23: [2022-11-26 22:16:01,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 10: [2022-11-26 22:16:01,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:16:01,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:16:01,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 22:16:01,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 22:16:01,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 10: [2022-11-26 22:16:01,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 7: [2022-11-26 22:16:01,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:16:01,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 22:16:01,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 16: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:16:01,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 25: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:16:01,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 17: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:16:01,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 22:16:01,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 17: [2022-11-26 22:16:01,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 27: [2022-11-26 22:16:01,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 25: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 25: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 17: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 18: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 17: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:16:01,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 27: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:16:01,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 17: [2022-11-26 22:16:01,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 18: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 27: [2022-11-26 22:16:01,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 17: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 27: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 0: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 17: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:16:01,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 15: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:16:01,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 17: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 15: [2022-11-26 22:16:01,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 15: [2022-11-26 22:16:01,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 22:16:01,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 2: [2022-11-26 22:16:01,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:16:01,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 22:16:01,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 30: [2022-11-26 22:16:01,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:16:01,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 22:16:01,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 14: [2022-11-26 22:16:01,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:16:01,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:16:01,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 22:16:01,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 22:16:01,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 14: [2022-11-26 22:16:01,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 30: [2022-11-26 22:16:01,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:16:01,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 22:16:01,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 16: [2022-11-26 22:16:01,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:16:01,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 22:16:01,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 7: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:16:01,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 22:16:01,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 7: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 4: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:16:01,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 31: [2022-11-26 22:16:01,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 4: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 31: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 11: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:16:01,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 12: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:16:01,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 26: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:16:01,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 22:16:01,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 4: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:16:01,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 3: [2022-11-26 22:16:01,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:16:01,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 22:16:01,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 26: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 26: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 4: [2022-11-26 22:16:01,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 3: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 26: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 3: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 3: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 26: [2022-11-26 22:16:01,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 22:16:01,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 4: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 3: [2022-11-26 22:16:01,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 26: [2022-11-26 22:16:01,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 26: [2022-11-26 22:16:01,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 9: [2022-11-26 22:16:01,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:16:01,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:16:01,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 12: [2022-11-26 22:16:01,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:16:01,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 22:16:01,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 8: [2022-11-26 22:16:01,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 12: [2022-11-26 22:16:01,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 9: [2022-11-26 22:16:01,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 9: [2022-11-26 22:16:01,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 8: [2022-11-26 22:16:01,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 12: [2022-11-26 22:16:01,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 1: [2022-11-26 22:16:01,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:16:01,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:16:01,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 24: [2022-11-26 22:16:01,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 1: [2022-11-26 22:16:01,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 12: [2022-11-26 22:16:01,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:16:01,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 12: [2022-11-26 22:16:01,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 22:16:01,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 22: [2022-11-26 22:16:01,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:16:01,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 22:16:01,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 22: [2022-11-26 22:16:01,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:16:01,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:16:01,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 22:16:01,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 1: [2022-11-26 22:16:01,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:16:01,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 22: [2022-11-26 22:16:01,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 1: [2022-11-26 22:16:01,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:16:01,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 22:16:01,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 22:16:01,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 1: [2022-11-26 22:16:01,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 12: [2022-11-26 22:16:01,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:16:01,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 22:16:01,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 12: [2022-11-26 22:16:01,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:16:01,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 22:16:01,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 28: [2022-11-26 22:16:01,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:16:01,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:16:01,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:16:01,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:16:01,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 22:16:01,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 22:16:01,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 1: [2022-11-26 22:16:01,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 15: [2022-11-26 22:16:01,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:16:01,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:16:01,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 22:16:01,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 22:16:01,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 15: [2022-11-26 22:16:01,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 8: [2022-11-26 22:16:01,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:16:01,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:16:01,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 22:16:01,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 22:16:01,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 8: [2022-11-26 22:16:01,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 14: [2022-11-26 22:16:01,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:16:01,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:16:01,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 31: [2022-11-26 22:16:01,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 14: [2022-11-26 22:16:01,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 31: [2022-11-26 22:16:01,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 5: [2022-11-26 22:16:01,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:16:01,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 14: [2022-11-26 22:16:01,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:16:01,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 5: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:16:01,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 5: [2022-11-26 22:16:01,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 14: [2022-11-26 22:16:01,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 5: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 16: [2022-11-26 22:16:01,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:16:01,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 5: [2022-11-26 22:16:01,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 16: [2022-11-26 22:16:01,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 5: [2022-11-26 22:16:01,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 5: [2022-11-26 22:16:01,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:16:01,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:16:01,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 22:16:01,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 22:16:01,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 5: [2022-11-26 22:16:01,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 9: [2022-11-26 22:16:01,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:16:01,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:16:01,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 8: [2022-11-26 22:16:01,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:16:01,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 9: [2022-11-26 22:16:01,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 8: [2022-11-26 22:16:01,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 16: [2022-11-26 22:16:01,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 8: [2022-11-26 22:16:01,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:16:01,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:16:01,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 8: [2022-11-26 22:16:01,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 22:16:01,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 22:16:01,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 8: [2022-11-26 22:16:01,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 9: [2022-11-26 22:16:01,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:16:01,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:16:01,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:16:01,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 11: [2022-11-26 22:16:01,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 30: [2022-11-26 22:16:01,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:16:01,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 9: [2022-11-26 22:16:01,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:16:01,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 30: [2022-11-26 22:16:01,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 27: [2022-11-26 22:16:01,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:16:01,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 9: [2022-11-26 22:16:01,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 27: [2022-11-26 22:16:01,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 9: [2022-11-26 22:16:01,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 30: [2022-11-26 22:16:01,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 27: [2022-11-26 22:16:01,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 9: [2022-11-26 22:16:01,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 27: [2022-11-26 22:16:01,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:16:01,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 30: [2022-11-26 22:16:01,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:16:01,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 30: [2022-11-26 22:16:01,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 22:16:01,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 24: [2022-11-26 22:16:01,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:16:01,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:16:01,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 29: [2022-11-26 22:16:01,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:16:01,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:16:01,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:16:01,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:16:01,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:16:01,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 22:16:01,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:16:01,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 29: [2022-11-26 22:16:01,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:16:01,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 22:16:01,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 11: [2022-11-26 22:16:01,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 29: [2022-11-26 22:16:01,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 22:16:01,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 22:16:01,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 22:16:01,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 11: [2022-11-26 22:16:01,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 29: [2022-11-26 22:16:01,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 29: [2022-11-26 22:16:01,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 29: [2022-11-26 22:16:01,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 29: [2022-11-26 22:16:01,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 29: [2022-11-26 22:16:01,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 11: [2022-11-26 22:16:01,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 29: [2022-11-26 22:16:01,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 11: [2022-11-26 22:16:01,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:16:01,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 22:16:01,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 24: [2022-11-26 22:16:01,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:16:01,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 22:16:01,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 24: [2022-11-26 22:16:01,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:16:01,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 22:16:01,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 24: [2022-11-26 22:16:01,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:16:01,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 22:16:01,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 6: [2022-11-26 22:16:01,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:16:01,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:16:01,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:16:01,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:16:01,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:16:01,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:16:01,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 22:16:01,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 22:16:01,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 22:16:01,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 22:16:01,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 22:16:01,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 22:16:01,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 6: [2022-11-26 22:16:01,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 6: [2022-11-26 22:16:01,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 6: [2022-11-26 22:16:01,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 6: [2022-11-26 22:16:01,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 6: [2022-11-26 22:16:01,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 19: [2022-11-26 22:16:01,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:16:01,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:16:01,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:16:01,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:16:01,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 22:16:01,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 22:16:01,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 22:16:01,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 22:16:01,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 19: [2022-11-26 22:16:01,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 19: [2022-11-26 22:16:01,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 19: [2022-11-26 22:16:01,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 31: [2022-11-26 22:16:01,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:16:01,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 22:16:01,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 28: [2022-11-26 22:16:01,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 22:16:01,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 22:16:01,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 21: [2022-11-26 22:16:01,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:16:01,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 28: [2022-11-26 22:16:01,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:16:01,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 28: [2022-11-26 22:16:01,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 21: [2022-11-26 22:16:01,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 28: [2022-11-26 22:16:01,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 28: [2022-11-26 22:16:01,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:16:01,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 22:16:01,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 19: [2022-11-26 22:16:01,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:16:01,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 22:16:01,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 16: [2022-11-26 22:16:01,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:16:01,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 22:16:01,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 17: [2022-11-26 22:16:01,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:16:01,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 22:16:01,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 25: [2022-11-26 22:16:01,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:16:01,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 22:16:01,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 4: [2022-11-26 22:16:01,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:16:01,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 22:16:01,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 21: [2022-11-26 22:16:01,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:16:01,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:16:01,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:16:01,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 22:16:01,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 22:16:01,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 22:16:01,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 21: [2022-11-26 22:16:01,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 21: [2022-11-26 22:16:01,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 14: [2022-11-26 22:16:01,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:16:01,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 22:16:01,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 13: [2022-11-26 22:16:01,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:16:01,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:16:01,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:16:01,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:16:01,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:16:01,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 22:16:01,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 22:16:01,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 22:16:01,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 22:16:01,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 22:16:01,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 13: [2022-11-26 22:16:01,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 13: [2022-11-26 22:16:01,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 13: [2022-11-26 22:16:01,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 13: [2022-11-26 22:16:01,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 13: [2022-11-26 22:16:01,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:16:01,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 22:16:01,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 0: [2022-11-26 22:16:01,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:16:01,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 22:16:01,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 3: [2022-11-26 22:16:01,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:16:01,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 22:16:01,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 2: [2022-11-26 22:16:01,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:16:01,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 22:16:01,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 24: [2022-11-26 22:16:01,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:16:01,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 22:16:01,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 1: [2022-11-26 22:16:01,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:16:01,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 22:16:01,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 23: [2022-11-26 22:16:01,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:16:01,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 22:16:01,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 6: [2022-11-26 22:16:01,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:16:01,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 22:16:01,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 26: [2022-11-26 22:16:01,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:16:01,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 22:16:01,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 15: [2022-11-26 22:16:01,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:16:01,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 22:16:01,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 31: [2022-11-26 22:16:01,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:16:01,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 22:16:01,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 29: [2022-11-26 22:16:01,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:16:01,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 22:16:01,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 28: [2022-11-26 22:16:01,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:16:01,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:16:01,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 22:16:01,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 11: [2022-11-26 22:16:01,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 22:16:01,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 22: [2022-11-26 22:16:01,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:16:01,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 22:16:01,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 8: [2022-11-26 22:16:01,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:16:01,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 22:16:01,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 12: [2022-11-26 22:16:01,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:16:01,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 22:16:01,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 10: [2022-11-26 22:16:01,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:16:01,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 22:16:01,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 27: [2022-11-26 22:16:01,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:16:01,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:16:01,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 27: [2022-11-26 22:16:01,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 9: [2022-11-26 22:16:01,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 27: [2022-11-26 22:16:01,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 19: [2022-11-26 22:16:01,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:16:01,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 22:16:01,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 18: [2022-11-26 22:16:01,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:16:01,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 25: [2022-11-26 22:16:01,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:16:01,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 25: [2022-11-26 22:16:01,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 22:16:01,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 5: [2022-11-26 22:16:01,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:16:01,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 22:16:01,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 21: [2022-11-26 22:16:01,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:16:01,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 22:16:01,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 20: [2022-11-26 22:16:01,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:16:01,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 22:16:01,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 2: [2022-11-26 22:16:01,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:16:01,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 22:16:01,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 0: [2022-11-26 22:16:01,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:16:01,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 22:16:01,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 30: [2022-11-26 22:16:01,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:16:01,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:16:01,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 30: [2022-11-26 22:16:01,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 18: [2022-11-26 22:16:01,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 30: [2022-11-26 22:16:01,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 4: [2022-11-26 22:16:01,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:16:01,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 22:16:01,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 26: [2022-11-26 22:16:01,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:16:01,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 22:16:01,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 27: [2022-11-26 22:16:01,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:16:01,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:16:01,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 9: [2022-11-26 22:16:01,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 27: [2022-11-26 22:16:01,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 19: [2022-11-26 22:16:01,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:16:01,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 19: [2022-11-26 22:16:01,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 22:16:01,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 16: [2022-11-26 22:16:01,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:16:01,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 22:16:01,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 31: [2022-11-26 22:16:01,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:16:01,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:16:01,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 22:16:01,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 22: [2022-11-26 22:16:01,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:16:01,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 22:16:01,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 22: [2022-11-26 22:16:01,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 22:16:01,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 7: [2022-11-26 22:16:01,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:16:01,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 22:16:01,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 13: [2022-11-26 22:16:01,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:16:01,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 22:16:01,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 6: [2022-11-26 22:16:01,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:16:01,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 23: [2022-11-26 22:16:01,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:16:01,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 23: [2022-11-26 22:16:01,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 22:16:01,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 8: [2022-11-26 22:16:01,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:16:01,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 22:16:01,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 28: [2022-11-26 22:16:01,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:16:01,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:16:01,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 22:16:01,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 10: [2022-11-26 22:16:01,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:16:01,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:16:01,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 20: [2022-11-26 22:16:01,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 10: [2022-11-26 22:16:01,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 20: [2022-11-26 22:16:01,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 2: [2022-11-26 22:16:01,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:16:01,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 22:16:01,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 0: [2022-11-26 22:16:01,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:16:01,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:16:01,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:16:01,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 1: [2022-11-26 22:16:01,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 21: [2022-11-26 22:16:01,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 0: [2022-11-26 22:16:01,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 1: [2022-11-26 22:16:01,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 21: [2022-11-26 22:16:01,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 15: [2022-11-26 22:16:01,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:16:01,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 3: [2022-11-26 22:16:01,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:16:01,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 3: [2022-11-26 22:16:01,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 22:16:01,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 28: [2022-11-26 22:16:01,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 22:16:01,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 4: [2022-11-26 22:16:01,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:16:01,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:16:01,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 5: [2022-11-26 22:16:01,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 4: [2022-11-26 22:16:01,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 5: [2022-11-26 22:16:01,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 14: [2022-11-26 22:16:01,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:16:01,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 22:16:01,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 16: [2022-11-26 22:16:01,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:16:01,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:16:01,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 22:16:01,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 14: [2022-11-26 22:16:01,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 22:16:01,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 30: [2022-11-26 22:16:01,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:16:01,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:16:01,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 22:16:01,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 7: [2022-11-26 22:16:01,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 22:16:01,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 3: [2022-11-26 22:16:01,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:16:01,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 22:16:01,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 12: [2022-11-26 22:16:01,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:16:01,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 22:16:01,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 24: [2022-11-26 22:16:01,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:16:01,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 22:16:01,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 14: [2022-11-26 22:16:01,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:16:01,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 22:16:01,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 17: [2022-11-26 22:16:01,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:16:01,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 22:16:01,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:16:01,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 17: [2022-11-26 22:16:01,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 22:16:01,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 11: [2022-11-26 22:16:01,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:16:01,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 22:16:01,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 13: [2022-11-26 22:16:01,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:16:01,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 22:16:01,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 3: [2022-11-26 22:16:01,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:16:01,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step125000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 22:16:01,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 0: successfully saved checkpoint at iteration 125000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2560.47 31: iteration 125010/ 173500 | consumed samples: 32002560 | consumed tokens: 65541242880 | elapsed time per iteration (s): 1.07 | learning rate: 5.314E-05 | global batch size: 256 | lm loss: 1.938357E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.843 | TFLOPs: 14.45 | 31: iteration 125020/ 173500 | consumed samples: 32005120 | consumed tokens: 65546485760 | elapsed time per iteration (s): 0.81 | learning rate: 5.312E-05 | global batch size: 256 | lm loss: 1.979143E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.354 | TFLOPs: 19.14 | 31: iteration 125030/ 173500 | consumed samples: 32007680 | consumed tokens: 65551728640 | elapsed time per iteration (s): 0.80 | learning rate: 5.311E-05 | global batch size: 256 | lm loss: 1.943239E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.409 | TFLOPs: 19.26 | 31: iteration 125040/ 173500 | consumed samples: 32010240 | consumed tokens: 65556971520 | elapsed time per iteration (s): 0.81 | learning rate: 5.310E-05 | global batch size: 256 | lm loss: 1.929119E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.240 | TFLOPs: 19.07 | 31: iteration 125050/ 173500 | consumed samples: 32012800 | consumed tokens: 65562214400 | elapsed time per iteration (s): 0.84 | learning rate: 5.308E-05 | global batch size: 256 | lm loss: 1.916015E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.592 | TFLOPs: 18.49 | 31: iteration 125060/ 173500 | consumed samples: 32015360 | consumed tokens: 65567457280 | elapsed time per iteration (s): 0.82 | learning rate: 5.307E-05 | global batch size: 256 | lm loss: 1.939470E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.906 | TFLOPs: 18.81 | 31: iteration 125070/ 173500 | consumed samples: 32017920 | consumed tokens: 65572700160 | elapsed time per iteration (s): 0.80 | learning rate: 5.306E-05 | global batch size: 256 | lm loss: 1.897424E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.381 | TFLOPs: 19.38 | 31: iteration 125080/ 173500 | consumed samples: 32020480 | consumed tokens: 65577943040 | elapsed time per iteration (s): 0.81 | learning rate: 5.305E-05 | global batch size: 256 | lm loss: 1.951799E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.293 | TFLOPs: 19.13 | 31: iteration 125090/ 173500 | consumed samples: 32023040 | consumed tokens: 65583185920 | elapsed time per iteration (s): 0.81 | learning rate: 5.303E-05 | global batch size: 256 | lm loss: 1.914055E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.300 | TFLOPs: 19.14 | 31: iteration 125100/ 173500 | consumed samples: 32025600 | consumed tokens: 65588428800 | elapsed time per iteration (s): 0.79 | learning rate: 5.302E-05 | global batch size: 256 | lm loss: 1.965618E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.306 | TFLOPs: 19.56 | 31: iteration 125110/ 173500 | consumed samples: 32028160 | consumed tokens: 65593671680 | elapsed time per iteration (s): 0.89 | learning rate: 5.301E-05 | global batch size: 256 | lm loss: 1.930550E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.510 | TFLOPs: 17.45 | 31: iteration 125120/ 173500 | consumed samples: 32030720 | consumed tokens: 65598914560 | elapsed time per iteration (s): 0.78 | learning rate: 5.300E-05 | global batch size: 256 | lm loss: 1.961698E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.574 | TFLOPs: 19.82 | 31: iteration 125130/ 173500 | consumed samples: 32033280 | consumed tokens: 65604157440 | elapsed time per iteration (s): 0.73 | learning rate: 5.298E-05 | global batch size: 256 | lm loss: 1.951880E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.620 | TFLOPs: 21.21 | 31: iteration 125140/ 173500 | consumed samples: 32035840 | consumed tokens: 65609400320 | elapsed time per iteration (s): 0.82 | learning rate: 5.297E-05 | global batch size: 256 | lm loss: 1.946874E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.621 | TFLOPs: 18.97 | 31: iteration 125150/ 173500 | consumed samples: 32038400 | consumed tokens: 65614643200 | elapsed time per iteration (s): 0.74 | learning rate: 5.296E-05 | global batch size: 256 | lm loss: 1.938067E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.638 | TFLOPs: 20.91 | 31: iteration 125160/ 173500 | consumed samples: 32040960 | consumed tokens: 65619886080 | elapsed time per iteration (s): 0.79 | learning rate: 5.294E-05 | global batch size: 256 | lm loss: 1.931517E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.338 | TFLOPs: 19.62 | 31: iteration 125170/ 173500 | consumed samples: 32043520 | consumed tokens: 65625128960 | elapsed time per iteration (s): 0.75 | learning rate: 5.293E-05 | global batch size: 256 | lm loss: 1.923722E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.446 | TFLOPs: 20.66 | 31: iteration 125180/ 173500 | consumed samples: 32046080 | consumed tokens: 65630371840 | elapsed time per iteration (s): 0.83 | learning rate: 5.292E-05 | global batch size: 256 | lm loss: 1.954228E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.047 | TFLOPs: 18.58 | 31: iteration 125190/ 173500 | consumed samples: 32048640 | consumed tokens: 65635614720 | elapsed time per iteration (s): 0.77 | learning rate: 5.291E-05 | global batch size: 256 | lm loss: 1.964719E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.087 | TFLOPs: 20.03 | 31: iteration 125200/ 173500 | consumed samples: 32051200 | consumed tokens: 65640857600 | elapsed time per iteration (s): 0.77 | learning rate: 5.289E-05 | global batch size: 256 | lm loss: 1.947907E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.931 | TFLOPs: 20.02 | 31: iteration 125210/ 173500 | consumed samples: 32053760 | consumed tokens: 65646100480 | elapsed time per iteration (s): 0.76 | learning rate: 5.288E-05 | global batch size: 256 | lm loss: 1.933258E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.987 | TFLOPs: 20.27 | 31: iteration 125220/ 173500 | consumed samples: 32056320 | consumed tokens: 65651343360 | elapsed time per iteration (s): 0.78 | learning rate: 5.287E-05 | global batch size: 256 | lm loss: 1.919090E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.359 | TFLOPs: 19.86 | 31: iteration 125230/ 173500 | consumed samples: 32058880 | consumed tokens: 65656586240 | elapsed time per iteration (s): 0.75 | learning rate: 5.286E-05 | global batch size: 256 | lm loss: 1.946003E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.931 | TFLOPs: 20.56 | 31: iteration 125240/ 173500 | consumed samples: 32061440 | consumed tokens: 65661829120 | elapsed time per iteration (s): 0.76 | learning rate: 5.284E-05 | global batch size: 256 | lm loss: 1.948104E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.622 | TFLOPs: 20.30 | 31: iteration 125250/ 173500 | consumed samples: 32064000 | consumed tokens: 65667072000 | elapsed time per iteration (s): 0.75 | learning rate: 5.283E-05 | global batch size: 256 | lm loss: 1.945721E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.658 | TFLOPs: 20.61 | 31: iteration 125260/ 173500 | consumed samples: 32066560 | consumed tokens: 65672314880 | elapsed time per iteration (s): 0.79 | learning rate: 5.282E-05 | global batch size: 256 | lm loss: 1.909660E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.931 | TFLOPs: 19.66 | 31: iteration 125270/ 173500 | consumed samples: 32069120 | consumed tokens: 65677557760 | elapsed time per iteration (s): 0.78 | learning rate: 5.280E-05 | global batch size: 256 | lm loss: 1.939206E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.620 | TFLOPs: 19.76 | 31: iteration 125280/ 173500 | consumed samples: 32071680 | consumed tokens: 65682800640 | elapsed time per iteration (s): 0.79 | learning rate: 5.279E-05 | global batch size: 256 | lm loss: 1.930244E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.059 | TFLOPs: 19.60 | 31: iteration 125290/ 173500 | consumed samples: 32074240 | consumed tokens: 65688043520 | elapsed time per iteration (s): 0.71 | learning rate: 5.278E-05 | global batch size: 256 | lm loss: 1.952263E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 359.464 | TFLOPs: 21.75 | 31: iteration 125300/ 173500 | consumed samples: 32076800 | consumed tokens: 65693286400 | elapsed time per iteration (s): 0.78 | learning rate: 5.277E-05 | global batch size: 256 | lm loss: 1.967073E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.378 | TFLOPs: 19.81 | 31: iteration 125310/ 173500 | consumed samples: 32079360 | consumed tokens: 65698529280 | elapsed time per iteration (s): 0.82 | learning rate: 5.275E-05 | global batch size: 256 | lm loss: 1.972825E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.064 | TFLOPs: 18.88 | 31: iteration 125320/ 173500 | consumed samples: 32081920 | consumed tokens: 65703772160 | elapsed time per iteration (s): 0.80 | learning rate: 5.274E-05 | global batch size: 256 | lm loss: 1.962280E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.925 | TFLOPs: 19.35 | 31: iteration 125330/ 173500 | consumed samples: 32084480 | consumed tokens: 65709015040 | elapsed time per iteration (s): 0.74 | learning rate: 5.273E-05 | global batch size: 256 | lm loss: 1.931462E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.314 | TFLOPs: 20.95 | 31: iteration 125340/ 173500 | consumed samples: 32087040 | consumed tokens: 65714257920 | elapsed time per iteration (s): 0.79 | learning rate: 5.272E-05 | global batch size: 256 | lm loss: 1.938187E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.055 | TFLOPs: 19.60 | 31: iteration 125350/ 173500 | consumed samples: 32089600 | consumed tokens: 65719500800 | elapsed time per iteration (s): 0.81 | learning rate: 5.270E-05 | global batch size: 256 | lm loss: 1.947128E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.981 | TFLOPs: 19.06 | 31: iteration 125360/ 173500 | consumed samples: 32092160 | consumed tokens: 65724743680 | elapsed time per iteration (s): 0.81 | learning rate: 5.269E-05 | global batch size: 256 | lm loss: 1.943409E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.485 | TFLOPs: 19.03 | 31: iteration 125370/ 173500 | consumed samples: 32094720 | consumed tokens: 65729986560 | elapsed time per iteration (s): 0.79 | learning rate: 5.268E-05 | global batch size: 256 | lm loss: 1.942575E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.968 | TFLOPs: 19.54 | 31: iteration 125380/ 173500 | consumed samples: 32097280 | consumed tokens: 65735229440 | elapsed time per iteration (s): 0.87 | learning rate: 5.267E-05 | global batch size: 256 | lm loss: 1.964192E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.026 | TFLOPs: 17.79 | 31: iteration 125390/ 173500 | consumed samples: 32099840 | consumed tokens: 65740472320 | elapsed time per iteration (s): 0.83 | learning rate: 5.265E-05 | global batch size: 256 | lm loss: 1.933621E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.712 | TFLOPs: 18.74 | 31: iteration 125400/ 173500 | consumed samples: 32102400 | consumed tokens: 65745715200 | elapsed time per iteration (s): 0.78 | learning rate: 5.264E-05 | global batch size: 256 | lm loss: 1.941136E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.384 | TFLOPs: 19.81 | 31: iteration 125410/ 173500 | consumed samples: 32104960 | consumed tokens: 65750958080 | elapsed time per iteration (s): 0.74 | learning rate: 5.263E-05 | global batch size: 256 | lm loss: 1.908792E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.251 | TFLOPs: 21.01 | 31: iteration 125420/ 173500 | consumed samples: 32107520 | consumed tokens: 65756200960 | elapsed time per iteration (s): 0.75 | learning rate: 5.261E-05 | global batch size: 256 | lm loss: 1.940335E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.183 | TFLOPs: 20.58 | 31: iteration 125430/ 173500 | consumed samples: 32110080 | consumed tokens: 65761443840 | elapsed time per iteration (s): 0.76 | learning rate: 5.260E-05 | global batch size: 256 | lm loss: 1.960736E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.279 | TFLOPs: 20.46 | 31: iteration 125440/ 173500 | consumed samples: 32112640 | consumed tokens: 65766686720 | elapsed time per iteration (s): 0.73 | learning rate: 5.259E-05 | global batch size: 256 | lm loss: 1.947748E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.661 | TFLOPs: 21.15 | 31: iteration 125450/ 173500 | consumed samples: 32115200 | consumed tokens: 65771929600 | elapsed time per iteration (s): 0.78 | learning rate: 5.258E-05 | global batch size: 256 | lm loss: 1.962821E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.630 | TFLOPs: 19.88 | 31: iteration 125460/ 173500 | consumed samples: 32117760 | consumed tokens: 65777172480 | elapsed time per iteration (s): 0.83 | learning rate: 5.256E-05 | global batch size: 256 | lm loss: 1.938582E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.605 | TFLOPs: 18.55 | 31: iteration 125470/ 173500 | consumed samples: 32120320 | consumed tokens: 65782415360 | elapsed time per iteration (s): 0.80 | learning rate: 5.255E-05 | global batch size: 256 | lm loss: 1.946833E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.846 | TFLOPs: 19.41 | 31: iteration 125480/ 173500 | consumed samples: 32122880 | consumed tokens: 65787658240 | elapsed time per iteration (s): 0.77 | learning rate: 5.254E-05 | global batch size: 256 | lm loss: 1.944165E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.713 | TFLOPs: 20.19 | 31: iteration 125490/ 173500 | consumed samples: 32125440 | consumed tokens: 65792901120 | elapsed time per iteration (s): 0.74 | learning rate: 5.253E-05 | global batch size: 256 | lm loss: 1.965959E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.703 | TFLOPs: 20.79 | 31: iteration 125500/ 173500 | consumed samples: 32128000 | consumed tokens: 65798144000 | elapsed time per iteration (s): 0.75 | learning rate: 5.251E-05 | global batch size: 256 | lm loss: 1.943234E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.344 | TFLOPs: 20.53 | 31: iteration 125510/ 173500 | consumed samples: 32130560 | consumed tokens: 65803386880 | elapsed time per iteration (s): 0.74 | learning rate: 5.250E-05 | global batch size: 256 | lm loss: 1.951637E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.588 | TFLOPs: 20.97 | 31: iteration 125520/ 173500 | consumed samples: 32133120 | consumed tokens: 65808629760 | elapsed time per iteration (s): 0.77 | learning rate: 5.249E-05 | global batch size: 256 | lm loss: 1.935966E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.993 | TFLOPs: 20.21 | 31: iteration 125530/ 173500 | consumed samples: 32135680 | consumed tokens: 65813872640 | elapsed time per iteration (s): 0.75 | learning rate: 5.247E-05 | global batch size: 256 | lm loss: 1.954181E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.056 | TFLOPs: 20.69 | 31: iteration 125540/ 173500 | consumed samples: 32138240 | consumed tokens: 65819115520 | elapsed time per iteration (s): 0.74 | learning rate: 5.246E-05 | global batch size: 256 | lm loss: 1.946763E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.097 | TFLOPs: 21.00 | 31: iteration 125550/ 173500 | consumed samples: 32140800 | consumed tokens: 65824358400 | elapsed time per iteration (s): 0.76 | learning rate: 5.245E-05 | global batch size: 256 | lm loss: 1.939879E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.008 | TFLOPs: 20.45 | 31: iteration 125560/ 173500 | consumed samples: 32143360 | consumed tokens: 65829601280 | elapsed time per iteration (s): 0.79 | learning rate: 5.244E-05 | global batch size: 256 | lm loss: 1.932893E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.709 | TFLOPs: 19.52 | 31: iteration 125570/ 173500 | consumed samples: 32145920 | consumed tokens: 65834844160 | elapsed time per iteration (s): 0.77 | learning rate: 5.242E-05 | global batch size: 256 | lm loss: 1.916245E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.277 | TFLOPs: 20.22 | 31: iteration 125580/ 173500 | consumed samples: 32148480 | consumed tokens: 65840087040 | elapsed time per iteration (s): 0.93 | learning rate: 5.241E-05 | global batch size: 256 | lm loss: 1.937159E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 275.882 | TFLOPs: 16.69 | 31: iteration 125590/ 173500 | consumed samples: 32151040 | consumed tokens: 65845329920 | elapsed time per iteration (s): 0.74 | learning rate: 5.240E-05 | global batch size: 256 | lm loss: 1.938805E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.104 | TFLOPs: 20.94 | 31: iteration 125600/ 173500 | consumed samples: 32153600 | consumed tokens: 65850572800 | elapsed time per iteration (s): 0.75 | learning rate: 5.239E-05 | global batch size: 256 | lm loss: 1.943507E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.454 | TFLOPs: 20.72 | 31: iteration 125610/ 173500 | consumed samples: 32156160 | consumed tokens: 65855815680 | elapsed time per iteration (s): 0.88 | learning rate: 5.237E-05 | global batch size: 256 | lm loss: 1.957709E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.436 | TFLOPs: 17.57 | 31: iteration 125620/ 173500 | consumed samples: 32158720 | consumed tokens: 65861058560 | elapsed time per iteration (s): 0.75 | learning rate: 5.236E-05 | global batch size: 256 | lm loss: 1.938774E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.761 | TFLOPs: 20.68 | 31: iteration 125630/ 173500 | consumed samples: 32161280 | consumed tokens: 65866301440 | elapsed time per iteration (s): 0.75 | learning rate: 5.235E-05 | global batch size: 256 | lm loss: 1.925735E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.153 | TFLOPs: 20.58 | 31: iteration 125640/ 173500 | consumed samples: 32163840 | consumed tokens: 65871544320 | elapsed time per iteration (s): 0.78 | learning rate: 5.234E-05 | global batch size: 256 | lm loss: 1.961224E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.949 | TFLOPs: 19.78 | 31: iteration 125650/ 173500 | consumed samples: 32166400 | consumed tokens: 65876787200 | elapsed time per iteration (s): 0.74 | learning rate: 5.232E-05 | global batch size: 256 | lm loss: 1.918395E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.121 | TFLOPs: 20.94 | 31: iteration 125660/ 173500 | consumed samples: 32168960 | consumed tokens: 65882030080 | elapsed time per iteration (s): 0.82 | learning rate: 5.231E-05 | global batch size: 256 | lm loss: 1.923446E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.540 | TFLOPs: 18.79 | 31: iteration 125670/ 173500 | consumed samples: 32171520 | consumed tokens: 65887272960 | elapsed time per iteration (s): 0.73 | learning rate: 5.230E-05 | global batch size: 256 | lm loss: 1.960137E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.745 | TFLOPs: 21.10 | 31: iteration 125680/ 173500 | consumed samples: 32174080 | consumed tokens: 65892515840 | elapsed time per iteration (s): 0.71 | learning rate: 5.229E-05 | global batch size: 256 | lm loss: 1.909076E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.838 | TFLOPs: 21.71 | 31: iteration 125690/ 173500 | consumed samples: 32176640 | consumed tokens: 65897758720 | elapsed time per iteration (s): 0.75 | learning rate: 5.227E-05 | global batch size: 256 | lm loss: 1.928964E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.254 | TFLOPs: 20.77 | 31: iteration 125700/ 173500 | consumed samples: 32179200 | consumed tokens: 65903001600 | elapsed time per iteration (s): 0.77 | learning rate: 5.226E-05 | global batch size: 256 | lm loss: 1.923277E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.628 | TFLOPs: 20.06 | 31: iteration 125710/ 173500 | consumed samples: 32181760 | consumed tokens: 65908244480 | elapsed time per iteration (s): 0.76 | learning rate: 5.225E-05 | global batch size: 256 | lm loss: 1.940558E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.056 | TFLOPs: 20.27 | 31: iteration 125720/ 173500 | consumed samples: 32184320 | consumed tokens: 65913487360 | elapsed time per iteration (s): 0.78 | learning rate: 5.223E-05 | global batch size: 256 | lm loss: 1.942051E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.100 | TFLOPs: 19.85 | 31: iteration 125730/ 173500 | consumed samples: 32186880 | consumed tokens: 65918730240 | elapsed time per iteration (s): 0.77 | learning rate: 5.222E-05 | global batch size: 256 | lm loss: 1.948542E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.973 | TFLOPs: 20.08 | 31: iteration 125740/ 173500 | consumed samples: 32189440 | consumed tokens: 65923973120 | elapsed time per iteration (s): 0.72 | learning rate: 5.221E-05 | global batch size: 256 | lm loss: 1.940187E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.181 | TFLOPs: 21.61 | 31: iteration 125750/ 173500 | consumed samples: 32192000 | consumed tokens: 65929216000 | elapsed time per iteration (s): 0.74 | learning rate: 5.220E-05 | global batch size: 256 | lm loss: 1.954520E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.711 | TFLOPs: 20.79 | 31: iteration 125760/ 173500 | consumed samples: 32194560 | consumed tokens: 65934458880 | elapsed time per iteration (s): 0.73 | learning rate: 5.218E-05 | global batch size: 256 | lm loss: 1.923702E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.705 | TFLOPs: 21.22 | 31: iteration 125770/ 173500 | consumed samples: 32197120 | consumed tokens: 65939701760 | elapsed time per iteration (s): 0.77 | learning rate: 5.217E-05 | global batch size: 256 | lm loss: 1.925748E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.794 | TFLOPs: 20.13 | 31: iteration 125780/ 173500 | consumed samples: 32199680 | consumed tokens: 65944944640 | elapsed time per iteration (s): 0.72 | learning rate: 5.216E-05 | global batch size: 256 | lm loss: 1.941280E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.226 | TFLOPs: 21.61 | 31: iteration 125790/ 173500 | consumed samples: 32202240 | consumed tokens: 65950187520 | elapsed time per iteration (s): 0.76 | learning rate: 5.215E-05 | global batch size: 256 | lm loss: 1.924311E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.722 | TFLOPs: 20.25 | 31: iteration 125800/ 173500 | consumed samples: 32204800 | consumed tokens: 65955430400 | elapsed time per iteration (s): 0.76 | learning rate: 5.213E-05 | global batch size: 256 | lm loss: 1.930140E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.157 | TFLOPs: 20.46 | 31: iteration 125810/ 173500 | consumed samples: 32207360 | consumed tokens: 65960673280 | elapsed time per iteration (s): 0.81 | learning rate: 5.212E-05 | global batch size: 256 | lm loss: 1.956464E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.505 | TFLOPs: 19.09 | 31: iteration 125820/ 173500 | consumed samples: 32209920 | consumed tokens: 65965916160 | elapsed time per iteration (s): 0.87 | learning rate: 5.211E-05 | global batch size: 256 | lm loss: 1.940307E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.758 | TFLOPs: 17.89 | 31: iteration 125830/ 173500 | consumed samples: 32212480 | consumed tokens: 65971159040 | elapsed time per iteration (s): 0.89 | learning rate: 5.210E-05 | global batch size: 256 | lm loss: 1.966417E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.025 | TFLOPs: 17.42 | 31: iteration 125840/ 173500 | consumed samples: 32215040 | consumed tokens: 65976401920 | elapsed time per iteration (s): 0.79 | learning rate: 5.208E-05 | global batch size: 256 | lm loss: 1.945794E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.650 | TFLOPs: 19.58 | 31: iteration 125850/ 173500 | consumed samples: 32217600 | consumed tokens: 65981644800 | elapsed time per iteration (s): 0.77 | learning rate: 5.207E-05 | global batch size: 256 | lm loss: 1.948477E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.519 | TFLOPs: 20.06 | 31: iteration 125860/ 173500 | consumed samples: 32220160 | consumed tokens: 65986887680 | elapsed time per iteration (s): 0.73 | learning rate: 5.206E-05 | global batch size: 256 | lm loss: 1.965042E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.813 | TFLOPs: 21.22 | 31: iteration 125870/ 173500 | consumed samples: 32222720 | consumed tokens: 65992130560 | elapsed time per iteration (s): 0.86 | learning rate: 5.205E-05 | global batch size: 256 | lm loss: 1.968956E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.514 | TFLOPs: 18.00 | 31: iteration 125880/ 173500 | consumed samples: 32225280 | consumed tokens: 65997373440 | elapsed time per iteration (s): 0.75 | learning rate: 5.203E-05 | global batch size: 256 | lm loss: 1.953537E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.042 | TFLOPs: 20.75 | 31: iteration 125890/ 173500 | consumed samples: 32227840 | consumed tokens: 66002616320 | elapsed time per iteration (s): 0.74 | learning rate: 5.202E-05 | global batch size: 256 | lm loss: 1.955418E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.911 | TFLOPs: 20.81 | 31: iteration 125900/ 173500 | consumed samples: 32230400 | consumed tokens: 66007859200 | elapsed time per iteration (s): 0.78 | learning rate: 5.201E-05 | global batch size: 256 | lm loss: 1.936051E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.815 | TFLOPs: 19.95 | 31: iteration 125910/ 173500 | consumed samples: 32232960 | consumed tokens: 66013102080 | elapsed time per iteration (s): 0.76 | learning rate: 5.200E-05 | global batch size: 256 | lm loss: 1.950171E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.750 | TFLOPs: 20.43 | 31: iteration 125920/ 173500 | consumed samples: 32235520 | consumed tokens: 66018344960 | elapsed time per iteration (s): 0.75 | learning rate: 5.198E-05 | global batch size: 256 | lm loss: 1.927969E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.827 | TFLOPs: 20.74 | 31: iteration 125930/ 173500 | consumed samples: 32238080 | consumed tokens: 66023587840 | elapsed time per iteration (s): 0.84 | learning rate: 5.197E-05 | global batch size: 256 | lm loss: 1.931083E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.218 | TFLOPs: 18.34 | 31: iteration 125940/ 173500 | consumed samples: 32240640 | consumed tokens: 66028830720 | elapsed time per iteration (s): 0.81 | learning rate: 5.196E-05 | global batch size: 256 | lm loss: 1.939723E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.170 | TFLOPs: 19.19 | 31: iteration 125950/ 173500 | consumed samples: 32243200 | consumed tokens: 66034073600 | elapsed time per iteration (s): 0.92 | learning rate: 5.194E-05 | global batch size: 256 | lm loss: 1.929510E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 277.669 | TFLOPs: 16.80 | 31: iteration 125960/ 173500 | consumed samples: 32245760 | consumed tokens: 66039316480 | elapsed time per iteration (s): 0.78 | learning rate: 5.193E-05 | global batch size: 256 | lm loss: 1.947469E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.486 | TFLOPs: 19.75 | 31: iteration 125970/ 173500 | consumed samples: 32248320 | consumed tokens: 66044559360 | elapsed time per iteration (s): 0.80 | learning rate: 5.192E-05 | global batch size: 256 | lm loss: 1.947925E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.477 | TFLOPs: 19.45 | 31: iteration 125980/ 173500 | consumed samples: 32250880 | consumed tokens: 66049802240 | elapsed time per iteration (s): 0.83 | learning rate: 5.191E-05 | global batch size: 256 | lm loss: 1.961735E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.503 | TFLOPs: 18.72 | 31: iteration 125990/ 173500 | consumed samples: 32253440 | consumed tokens: 66055045120 | elapsed time per iteration (s): 0.77 | learning rate: 5.189E-05 | global batch size: 256 | lm loss: 1.937999E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.595 | TFLOPs: 20.18 | 0: [2022-11-26 22:29:04,949] [INFO] [logging.py:68:log_dist] [Rank 0] step=126000, skipped=0, lr=[5.188210163686188e-05, 5.188210163686188e-05, 5.188210163686188e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 126000/ 173500 | consumed samples: 32256000 | consumed tokens: 66060288000 | elapsed time per iteration (s): 0.82 | learning rate: 5.188E-05 | global batch size: 256 | lm loss: 1.963055E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.520 | TFLOPs: 18.85 | 0: steps: 126000 loss: 1.9794 iter time (s): 0.795 samples/sec: 322.196 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 126000 | lm loss value: 1.963925E+00 | lm loss PPL: 7.127248E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 126000 to checkpoints_1b1long 0: [2022-11-26 22:29:05,200] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step126000 is begin to save! 0: [2022-11-26 22:29:05,206] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_01-model_00-model_states.pt... 0: [2022-11-26 22:29:05,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_01-model_00-model_states.pt. 0: [2022-11-26 22:29:05,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_03-model_00-model_states.pt... 0: [2022-11-26 22:29:05,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_03-model_00-model_states.pt. 0: [2022-11-26 22:29:05,584] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_04-model_00-model_states.pt... 0: [2022-11-26 22:29:05,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_04-model_00-model_states.pt. 0: [2022-11-26 22:29:05,662] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_05-model_00-model_states.pt... 0: [2022-11-26 22:29:05,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_05-model_00-model_states.pt. 0: [2022-11-26 22:29:05,743] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_06-model_00-model_states.pt... 0: [2022-11-26 22:29:05,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_06-model_00-model_states.pt. 0: [2022-11-26 22:29:05,819] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_07-model_00-model_states.pt... 0: [2022-11-26 22:29:05,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_07-model_00-model_states.pt. 0: [2022-11-26 22:29:05,898] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_08-model_00-model_states.pt... 0: [2022-11-26 22:29:05,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_08-model_00-model_states.pt. 0: [2022-11-26 22:29:05,978] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_09-model_00-model_states.pt... 0: [2022-11-26 22:29:06,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_09-model_00-model_states.pt. 0: [2022-11-26 22:29:06,054] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_10-model_00-model_states.pt... 0: [2022-11-26 22:29:06,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_10-model_00-model_states.pt. 0: [2022-11-26 22:29:06,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_11-model_00-model_states.pt... 0: [2022-11-26 22:29:06,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_11-model_00-model_states.pt. 0: [2022-11-26 22:29:06,205] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_12-model_00-model_states.pt... 0: [2022-11-26 22:29:06,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_12-model_00-model_states.pt. 0: [2022-11-26 22:29:06,281] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_13-model_00-model_states.pt... 0: [2022-11-26 22:29:06,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_13-model_00-model_states.pt. 0: [2022-11-26 22:29:06,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_14-model_00-model_states.pt... 0: [2022-11-26 22:29:06,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_14-model_00-model_states.pt. 0: [2022-11-26 22:29:06,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_15-model_00-model_states.pt... 0: [2022-11-26 22:29:06,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_15-model_00-model_states.pt. 0: [2022-11-26 22:29:06,510] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_16-model_00-model_states.pt... 0: [2022-11-26 22:29:06,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_16-model_00-model_states.pt. 0: [2022-11-26 22:29:06,583] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_17-model_00-model_states.pt... 0: [2022-11-26 22:29:06,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_17-model_00-model_states.pt. 0: [2022-11-26 22:29:06,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_18-model_00-model_states.pt... 0: [2022-11-26 22:29:06,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_18-model_00-model_states.pt. 0: [2022-11-26 22:29:06,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_19-model_00-model_states.pt... 0: [2022-11-26 22:29:06,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_19-model_00-model_states.pt. 0: [2022-11-26 22:29:06,811] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_20-model_00-model_states.pt... 0: [2022-11-26 22:29:06,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_20-model_00-model_states.pt. 0: [2022-11-26 22:29:06,886] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_21-model_00-model_states.pt... 0: [2022-11-26 22:29:06,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_21-model_00-model_states.pt. 0: [2022-11-26 22:29:06,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_22-model_00-model_states.pt... 0: [2022-11-26 22:29:07,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_22-model_00-model_states.pt. 0: [2022-11-26 22:29:07,036] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_23-model_00-model_states.pt... 0: [2022-11-26 22:29:07,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_23-model_00-model_states.pt. 0: [2022-11-26 22:29:07,110] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_24-model_00-model_states.pt... 0: [2022-11-26 22:29:07,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_24-model_00-model_states.pt. 0: [2022-11-26 22:29:07,188] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_25-model_00-model_states.pt... 0: [2022-11-26 22:29:07,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_25-model_00-model_states.pt. 0: [2022-11-26 22:29:07,262] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_26-model_00-model_states.pt... 0: [2022-11-26 22:29:07,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_26-model_00-model_states.pt. 0: [2022-11-26 22:29:07,340] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_27-model_00-model_states.pt... 0: [2022-11-26 22:29:07,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_27-model_00-model_states.pt. 0: [2022-11-26 22:29:07,415] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_28-model_00-model_states.pt... 0: [2022-11-26 22:29:07,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_28-model_00-model_states.pt. 0: [2022-11-26 22:29:07,489] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/layer_30-model_00-model_states.pt... 0: [2022-11-26 22:29:07,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/layer_30-model_00-model_states.pt. 0: [2022-11-26 22:29:07,492] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step126000/mp_rank_00_model_states.pt 0: [2022-11-26 22:29:07,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/mp_rank_00_model_states.pt... 0: [2022-11-26 22:29:07,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/mp_rank_00_model_states.pt. 0: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:29:07,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:29:07,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:29:07,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 22:29:07,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 31: [2022-11-26 22:29:07,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:29:07,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 22:29:07,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 6: [2022-11-26 22:29:07,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:29:07,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 22:29:07,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 10: [2022-11-26 22:29:07,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:29:07,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 22:29:07,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 20: [2022-11-26 22:29:07,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:29:07,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:29:07,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 22:29:07,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 22:29:07,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 20: [2022-11-26 22:29:07,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 15: [2022-11-26 22:29:07,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:29:07,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 22:29:07,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 8: [2022-11-26 22:29:07,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:29:07,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 22:29:07,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 12: [2022-11-26 22:29:07,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:29:07,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 22:29:07,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 11: [2022-11-26 22:29:07,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:29:07,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:29:07,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 22:29:07,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 11: [2022-11-26 22:29:07,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 19: [2022-11-26 22:29:07,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:29:07,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 23: [2022-11-26 22:29:07,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:29:07,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:29:07,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 14: [2022-11-26 22:29:07,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 26: [2022-11-26 22:29:07,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:29:07,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 14: [2022-11-26 22:29:07,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 26: [2022-11-26 22:29:07,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 22:29:07,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:29:07,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 22:29:07,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 26: [2022-11-26 22:29:07,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 26: [2022-11-26 22:29:07,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 22:29:07,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 13: [2022-11-26 22:29:07,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:29:07,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 22:29:07,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 13: [2022-11-26 22:29:07,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:29:07,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 22:29:07,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 23: [2022-11-26 22:29:07,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:29:07,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 22:29:07,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 28: [2022-11-26 22:29:07,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:29:07,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:29:07,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 20: [2022-11-26 22:29:07,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:29:07,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 27: [2022-11-26 22:29:07,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:29:07,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 20: [2022-11-26 22:29:07,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 27: [2022-11-26 22:29:07,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 20: [2022-11-26 22:29:07,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 14: [2022-11-26 22:29:07,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:29:07,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 22:29:07,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 19: [2022-11-26 22:29:07,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:29:07,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 22:29:07,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 25: [2022-11-26 22:29:07,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:29:07,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:29:07,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 25: [2022-11-26 22:29:07,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 27: [2022-11-26 22:29:07,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 3: [2022-11-26 22:29:07,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:29:07,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 3: [2022-11-26 22:29:07,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 22:29:07,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 0: [2022-11-26 22:29:07,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:29:07,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:29:07,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 6: [2022-11-26 22:29:07,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 22:29:07,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 0: [2022-11-26 22:29:07,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 10: [2022-11-26 22:29:07,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:29:07,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 22:29:07,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 2: [2022-11-26 22:29:07,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:29:07,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 12: [2022-11-26 22:29:07,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:29:07,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 12: [2022-11-26 22:29:07,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 22:29:07,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 10: [2022-11-26 22:29:07,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:29:07,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 22:29:07,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 13: [2022-11-26 22:29:07,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:29:07,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 22:29:07,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 8: [2022-11-26 22:29:07,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:29:07,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 26: [2022-11-26 22:29:07,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:29:07,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 26: [2022-11-26 22:29:07,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 25: [2022-11-26 22:29:07,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:29:07,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 25: [2022-11-26 22:29:07,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 22:29:07,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 27: [2022-11-26 22:29:07,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:29:07,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 29: [2022-11-26 22:29:07,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:29:07,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 29: [2022-11-26 22:29:07,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 22:29:07,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 23: [2022-11-26 22:29:07,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:29:07,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 14: [2022-11-26 22:29:07,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:29:07,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 14: [2022-11-26 22:29:07,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 22:29:07,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 31: [2022-11-26 22:29:07,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:29:07,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 22:29:07,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 9: [2022-11-26 22:29:07,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:29:07,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 22:29:07,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 22: [2022-11-26 22:29:07,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:29:07,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 22:29:07,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 22: [2022-11-26 22:29:07,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:29:07,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 22:29:07,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 22: [2022-11-26 22:29:07,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:29:07,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 23: [2022-11-26 22:29:07,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:29:07,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 23: [2022-11-26 22:29:07,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 22:29:07,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 2: [2022-11-26 22:29:07,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:29:07,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 22:29:07,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 29: [2022-11-26 22:29:07,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:29:07,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 22:29:07,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 25: [2022-11-26 22:29:07,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:29:07,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:29:07,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:29:07,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 9: [2022-11-26 22:29:07,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 2: [2022-11-26 22:29:07,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 9: [2022-11-26 22:29:07,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 2: [2022-11-26 22:29:07,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 25: [2022-11-26 22:29:07,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 15: [2022-11-26 22:29:07,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:29:07,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 22:29:07,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 12: [2022-11-26 22:29:07,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:29:07,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 22:29:07,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 10: [2022-11-26 22:29:07,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:29:07,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 22:29:07,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 26: [2022-11-26 22:29:07,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:29:07,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 22:29:07,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 22: [2022-11-26 22:29:07,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:29:07,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 22:29:07,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 19: [2022-11-26 22:29:07,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:29:07,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:29:07,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 22:29:07,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 22:29:07,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 19: [2022-11-26 22:29:07,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 11: [2022-11-26 22:29:07,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:29:07,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 22:29:07,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 30: [2022-11-26 22:29:07,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:29:07,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:29:07,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:29:07,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:29:07,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 30: [2022-11-26 22:29:07,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 6: [2022-11-26 22:29:07,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 30: [2022-11-26 22:29:07,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 22:29:07,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 22:29:07,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 30: [2022-11-26 22:29:07,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 30: [2022-11-26 22:29:07,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 12: [2022-11-26 22:29:07,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:29:07,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 22:29:07,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 30: [2022-11-26 22:29:07,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:29:07,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 22:29:07,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 3: [2022-11-26 22:29:07,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:29:07,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 22:29:07,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 6: [2022-11-26 22:29:07,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:29:07,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 22:29:07,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 3: [2022-11-26 22:29:07,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:29:07,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 9: [2022-11-26 22:29:07,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:29:07,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 3: [2022-11-26 22:29:07,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 9: [2022-11-26 22:29:07,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 20: [2022-11-26 22:29:07,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:29:07,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 22:29:07,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 29: [2022-11-26 22:29:07,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:29:07,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 1: [2022-11-26 22:29:07,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:29:07,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:29:07,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:29:07,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:29:07,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 1: [2022-11-26 22:29:07,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 22:29:07,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 22:29:07,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 22:29:07,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 8: [2022-11-26 22:29:07,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:29:07,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 1: [2022-11-26 22:29:07,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 28: [2022-11-26 22:29:07,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 8: [2022-11-26 22:29:07,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 1: [2022-11-26 22:29:07,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 1: [2022-11-26 22:29:07,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 1: [2022-11-26 22:29:07,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 28: [2022-11-26 22:29:07,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 28: [2022-11-26 22:29:07,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:29:07,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 22:29:07,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 28: [2022-11-26 22:29:07,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:29:07,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 22:29:07,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 28: [2022-11-26 22:29:07,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:29:07,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 22:29:07,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 8: [2022-11-26 22:29:07,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:29:07,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 22:29:07,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 24: [2022-11-26 22:29:07,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:29:07,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:29:07,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 22:29:07,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 22:29:07,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 24: [2022-11-26 22:29:07,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 24: [2022-11-26 22:29:07,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:29:07,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 22:29:07,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 11: [2022-11-26 22:29:07,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:29:07,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 14: [2022-11-26 22:29:07,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:29:07,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 14: [2022-11-26 22:29:07,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 22:29:07,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 4: [2022-11-26 22:29:07,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:29:07,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:29:07,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:29:07,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:29:07,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 22:29:07,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 22:29:07,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 17: [2022-11-26 22:29:07,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:29:07,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 31: [2022-11-26 22:29:07,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:29:07,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 4: [2022-11-26 22:29:07,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 22:29:07,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 17: [2022-11-26 22:29:07,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:29:07,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 17: [2022-11-26 22:29:07,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 31: [2022-11-26 22:29:07,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 17: [2022-11-26 22:29:07,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 22:29:07,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 17: [2022-11-26 22:29:07,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 17: [2022-11-26 22:29:07,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:29:07,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 17: [2022-11-26 22:29:07,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 31: [2022-11-26 22:29:07,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:29:07,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 31: [2022-11-26 22:29:07,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 17: [2022-11-26 22:29:07,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:29:07,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:29:07,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 2: [2022-11-26 22:29:07,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:29:07,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 17: [2022-11-26 22:29:07,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 31: [2022-11-26 22:29:07,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 2: [2022-11-26 22:29:07,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 31: [2022-11-26 22:29:07,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 2: [2022-11-26 22:29:07,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 13: [2022-11-26 22:29:07,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:29:07,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 22:29:07,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 27: [2022-11-26 22:29:07,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:29:07,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 22:29:07,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 0: [2022-11-26 22:29:07,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:29:07,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:29:07,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 22:29:07,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 0: [2022-11-26 22:29:07,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 22:29:07,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 20: [2022-11-26 22:29:07,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:29:07,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 22:29:07,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 5: [2022-11-26 22:29:07,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:29:07,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:29:07,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 22:29:07,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 22:29:07,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 5: [2022-11-26 22:29:07,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 5: [2022-11-26 22:29:07,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:29:07,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 22:29:07,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 7: [2022-11-26 22:29:07,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:29:07,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:29:07,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 22:29:07,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 22:29:07,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:29:07,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 7: [2022-11-26 22:29:07,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 22:29:07,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 7: [2022-11-26 22:29:07,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 15: [2022-11-26 22:29:07,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:29:07,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:29:07,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 22:29:07,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 22:29:07,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 16: [2022-11-26 22:29:07,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:29:07,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 16: [2022-11-26 22:29:07,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:29:07,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:29:07,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 22:29:07,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:29:07,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 16: [2022-11-26 22:29:07,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 22:29:07,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 22:29:07,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 22:29:07,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 16: [2022-11-26 22:29:07,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 16: [2022-11-26 22:29:07,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 1: [2022-11-26 22:29:07,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:29:07,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 22:29:07,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 21: [2022-11-26 22:29:07,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:29:07,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:29:07,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:29:07,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:29:07,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 22:29:07,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 22:29:07,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 22:29:07,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 22:29:07,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 21: [2022-11-26 22:29:07,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 21: [2022-11-26 22:29:07,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 21: [2022-11-26 22:29:07,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 6: [2022-11-26 22:29:07,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:29:07,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 22:29:07,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 27: [2022-11-26 22:29:07,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:29:07,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 22:29:07,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 4: [2022-11-26 22:29:07,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:29:07,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 22:29:07,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 0: [2022-11-26 22:29:07,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:29:07,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:29:07,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 0: [2022-11-26 22:29:07,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 11: [2022-11-26 22:29:07,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 0: [2022-11-26 22:29:07,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 18: [2022-11-26 22:29:07,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:29:07,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 22:29:07,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 18: [2022-11-26 22:29:07,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:29:07,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 22:29:07,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 18: [2022-11-26 22:29:07,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:29:07,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 22:29:07,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 18: [2022-11-26 22:29:07,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:29:07,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 22:29:07,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 28: [2022-11-26 22:29:07,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:29:07,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:29:07,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 22:29:07,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 10: [2022-11-26 22:29:07,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:29:07,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 22:29:07,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 14: [2022-11-26 22:29:07,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:29:07,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 22:29:07,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 28: [2022-11-26 22:29:07,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 22:29:07,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 2: [2022-11-26 22:29:07,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:29:07,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 22:29:07,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 18: [2022-11-26 22:29:07,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:29:07,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 22:29:07,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 23: [2022-11-26 22:29:07,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:29:07,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 22:29:07,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 16: [2022-11-26 22:29:07,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:29:07,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 22:29:07,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 8: [2022-11-26 22:29:07,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:29:07,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 22:29:07,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 25: [2022-11-26 22:29:07,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:29:07,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 22:29:07,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 12: [2022-11-26 22:29:07,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:29:07,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 22:29:07,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 30: [2022-11-26 22:29:07,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:29:07,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 22:29:07,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 13: [2022-11-26 22:29:07,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:29:07,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 22:29:07,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 19: [2022-11-26 22:29:07,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:29:07,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 22:29:07,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 26: [2022-11-26 22:29:07,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:29:07,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 22:29:07,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 17: [2022-11-26 22:29:07,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:29:07,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 22:29:07,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 24: [2022-11-26 22:29:07,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:29:07,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:29:07,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 22:29:07,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 29: [2022-11-26 22:29:07,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 22:29:07,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 21: [2022-11-26 22:29:07,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:29:07,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 22:29:07,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 22: [2022-11-26 22:29:07,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:29:07,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 22:29:07,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 15: [2022-11-26 22:29:07,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:29:07,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 22:29:07,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 3: [2022-11-26 22:29:07,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:29:07,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 22:29:07,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 7: [2022-11-26 22:29:07,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:29:07,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 22:29:07,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 5: [2022-11-26 22:29:07,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:29:07,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 22:29:07,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 20: [2022-11-26 22:29:07,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:29:07,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 22:29:07,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 9: [2022-11-26 22:29:07,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:29:07,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 22:29:07,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 31: [2022-11-26 22:29:07,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:29:07,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 22:29:07,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 1: [2022-11-26 22:29:07,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:29:07,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 22:29:07,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 6: [2022-11-26 22:29:07,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:29:07,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 22:29:07,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 0: [2022-11-26 22:29:07,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:29:07,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 22:29:07,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 4: [2022-11-26 22:29:07,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:29:07,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 22:29:07,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 14: [2022-11-26 22:29:07,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:29:07,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:29:07,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:29:07,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 14: [2022-11-26 22:29:07,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 16: [2022-11-26 22:29:07,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 14: [2022-11-26 22:29:07,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 10: [2022-11-26 22:29:07,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 22:29:07,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 2: [2022-11-26 22:29:07,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:29:07,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 22:29:07,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 11: [2022-11-26 22:29:07,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:29:07,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 22:29:07,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 27: [2022-11-26 22:29:07,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:29:07,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 22:29:07,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 25: [2022-11-26 22:29:07,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:29:07,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 22:29:07,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 23: [2022-11-26 22:29:07,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:29:07,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 22:29:07,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 28: [2022-11-26 22:29:07,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:29:07,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:29:07,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 22:29:07,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 8: [2022-11-26 22:29:07,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:29:07,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 22:29:07,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 8: [2022-11-26 22:29:07,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 22:29:07,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 21: [2022-11-26 22:29:07,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:29:07,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 17: [2022-11-26 22:29:07,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:29:07,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 17: [2022-11-26 22:29:07,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 22:29:07,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 12: [2022-11-26 22:29:07,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:29:07,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 26: [2022-11-26 22:29:07,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:29:07,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 26: [2022-11-26 22:29:07,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 22:29:07,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 13: [2022-11-26 22:29:07,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:29:07,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 22:29:07,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 19: [2022-11-26 22:29:07,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:29:07,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 22:29:07,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 5: [2022-11-26 22:29:07,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:29:07,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 22:29:07,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 29: [2022-11-26 22:29:07,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:29:07,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 22:29:07,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 18: [2022-11-26 22:29:07,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:29:07,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 22:29:07,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 22: [2022-11-26 22:29:07,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:29:07,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 22:29:07,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 24: [2022-11-26 22:29:07,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:29:07,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:29:07,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 22:29:07,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 1: [2022-11-26 22:29:07,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 3: [2022-11-26 22:29:07,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:29:07,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 3: [2022-11-26 22:29:07,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 22:29:07,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 31: [2022-11-26 22:29:07,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:29:07,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 22:29:07,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 6: [2022-11-26 22:29:07,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:29:07,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 22:29:07,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 10: [2022-11-26 22:29:07,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:29:07,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 22:29:07,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 0: [2022-11-26 22:29:07,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:29:07,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:29:07,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 0: [2022-11-26 22:29:07,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 4: [2022-11-26 22:29:07,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 0: [2022-11-26 22:29:07,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 20: [2022-11-26 22:29:07,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:29:07,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 22:29:07,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 7: [2022-11-26 22:29:07,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:29:07,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 22:29:07,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 16: [2022-11-26 22:29:07,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:29:07,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:29:07,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 15: [2022-11-26 22:29:07,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 16: [2022-11-26 22:29:07,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 15: [2022-11-26 22:29:07,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 14: [2022-11-26 22:29:07,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:29:07,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 22:29:07,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 27: [2022-11-26 22:29:07,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:29:07,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:29:07,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 9: [2022-11-26 22:29:07,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 27: [2022-11-26 22:29:07,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 9: [2022-11-26 22:29:07,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 11: [2022-11-26 22:29:07,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:29:07,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 22:29:07,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 2: [2022-11-26 22:29:07,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:29:07,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 22:29:07,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 28: [2022-11-26 22:29:07,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:29:07,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:29:07,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 25: [2022-11-26 22:29:07,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 28: [2022-11-26 22:29:07,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 25: [2022-11-26 22:29:07,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 30: [2022-11-26 22:29:07,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:29:07,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 22:29:07,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 8: [2022-11-26 22:29:07,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:29:07,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 22:29:07,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 23: [2022-11-26 22:29:07,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:29:07,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 22:29:07,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 19: [2022-11-26 22:29:07,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:29:07,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:29:07,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 29: [2022-11-26 22:29:07,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 19: [2022-11-26 22:29:07,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 29: [2022-11-26 22:29:07,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 5: [2022-11-26 22:29:07,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:29:07,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 22:29:07,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 12: [2022-11-26 22:29:07,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:29:07,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 22:29:07,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 21: [2022-11-26 22:29:07,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:29:07,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 22:29:07,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 13: [2022-11-26 22:29:07,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:29:07,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 22:29:07,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 17: [2022-11-26 22:29:07,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:29:07,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 18: [2022-11-26 22:29:07,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:29:07,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:29:07,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 17: [2022-11-26 22:29:07,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 18: [2022-11-26 22:29:07,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 26: [2022-11-26 22:29:07,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 22:29:07,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 1: [2022-11-26 22:29:07,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:29:07,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 22:29:07,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 3: [2022-11-26 22:29:07,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:29:07,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 22:29:07,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 22: [2022-11-26 22:29:07,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:29:07,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 22:29:07,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 31: [2022-11-26 22:29:07,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:29:07,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 22:29:07,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 24: [2022-11-26 22:29:07,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:29:07,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 22:29:07,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 20: [2022-11-26 22:29:07,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:29:07,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 22:29:07,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 0: [2022-11-26 22:29:07,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:29:07,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:29:07,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 22:29:07,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 27: [2022-11-26 22:29:07,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:29:07,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 22:29:07,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 15: [2022-11-26 22:29:07,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:29:07,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 22:29:07,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 19: [2022-11-26 22:29:07,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:29:07,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:29:07,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 6: [2022-11-26 22:29:07,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 19: [2022-11-26 22:29:07,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 6: [2022-11-26 22:29:07,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 10: [2022-11-26 22:29:07,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:29:07,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:29:07,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 11: [2022-11-26 22:29:07,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:29:07,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 10: [2022-11-26 22:29:07,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 7: [2022-11-26 22:29:07,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 11: [2022-11-26 22:29:07,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 22:29:07,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 30: [2022-11-26 22:29:07,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:29:07,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 22:29:07,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 2: [2022-11-26 22:29:07,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:29:07,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 22:29:07,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 14: [2022-11-26 22:29:07,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:29:07,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 22:29:07,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 0: [2022-11-26 22:29:07,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 22:29:07,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 13: [2022-11-26 22:29:07,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:29:07,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 22:29:07,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 9: [2022-11-26 22:29:07,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:29:07,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 22:29:07,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 25: [2022-11-26 22:29:07,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:29:07,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 22:29:07,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 4: [2022-11-26 22:29:07,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:29:07,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 22:29:07,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 12: [2022-11-26 22:29:07,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:29:07,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 28: [2022-11-26 22:29:07,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:29:07,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 28: [2022-11-26 22:29:07,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 22:29:07,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 23: [2022-11-26 22:29:07,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:29:07,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 22:29:07,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 8: [2022-11-26 22:29:07,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:29:07,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 22:29:07,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 17: [2022-11-26 22:29:07,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:29:07,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 22:29:07,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 21: [2022-11-26 22:29:07,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:29:07,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 22:29:07,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 26: [2022-11-26 22:29:07,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:29:07,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 22:29:07,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 3: [2022-11-26 22:29:07,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:29:07,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 22:29:07,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 29: [2022-11-26 22:29:07,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:29:07,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 22:29:07,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 5: [2022-11-26 22:29:07,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:29:07,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 22:29:07,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 22: [2022-11-26 22:29:07,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:29:07,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 22:29:07,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 24: [2022-11-26 22:29:07,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:29:07,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 22:29:07,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 18: [2022-11-26 22:29:07,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:29:07,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 22:29:07,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 29: [2022-11-26 22:29:07,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:29:07,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 22:29:07,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 7: [2022-11-26 22:29:07,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:29:07,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:29:07,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 5: [2022-11-26 22:29:07,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:29:07,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 22:29:07,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 5: [2022-11-26 22:29:07,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 22:29:07,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 7: [2022-11-26 22:29:07,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 9: [2022-11-26 22:29:07,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:29:07,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:29:07,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 22:29:07,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 22:29:07,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 9: [2022-11-26 22:29:07,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 15: [2022-11-26 22:29:07,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:29:07,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 22:29:07,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 24: [2022-11-26 22:29:07,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:29:07,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step126000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 22:29:07,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step126000 is ready now! 0: successfully saved checkpoint at iteration 126000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2619.89 31: iteration 126010/ 173500 | consumed samples: 32258560 | consumed tokens: 66065530880 | elapsed time per iteration (s): 1.12 | learning rate: 5.187E-05 | global batch size: 256 | lm loss: 1.903994E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.976 | TFLOPs: 13.79 | 31: iteration 126020/ 173500 | consumed samples: 32261120 | consumed tokens: 66070773760 | elapsed time per iteration (s): 0.82 | learning rate: 5.186E-05 | global batch size: 256 | lm loss: 1.940995E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.034 | TFLOPs: 18.82 | 31: iteration 126030/ 173500 | consumed samples: 32263680 | consumed tokens: 66076016640 | elapsed time per iteration (s): 1.73 | learning rate: 5.184E-05 | global batch size: 256 | lm loss: 1.950236E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 147.671 | TFLOPs: 8.93 | 31: iteration 126040/ 173500 | consumed samples: 32266240 | consumed tokens: 66081259520 | elapsed time per iteration (s): 0.81 | learning rate: 5.183E-05 | global batch size: 256 | lm loss: 1.940841E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.485 | TFLOPs: 19.09 | 31: iteration 126050/ 173500 | consumed samples: 32268800 | consumed tokens: 66086502400 | elapsed time per iteration (s): 0.80 | learning rate: 5.182E-05 | global batch size: 256 | lm loss: 1.950909E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.672 | TFLOPs: 19.34 | 31: iteration 126060/ 173500 | consumed samples: 32271360 | consumed tokens: 66091745280 | elapsed time per iteration (s): 0.79 | learning rate: 5.181E-05 | global batch size: 256 | lm loss: 1.922285E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.252 | TFLOPs: 19.68 | 31: iteration 126070/ 173500 | consumed samples: 32273920 | consumed tokens: 66096988160 | elapsed time per iteration (s): 0.80 | learning rate: 5.179E-05 | global batch size: 256 | lm loss: 1.948565E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.815 | TFLOPs: 19.47 | 31: iteration 126080/ 173500 | consumed samples: 32276480 | consumed tokens: 66102231040 | elapsed time per iteration (s): 0.88 | learning rate: 5.178E-05 | global batch size: 256 | lm loss: 1.966540E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.035 | TFLOPs: 17.55 | 31: iteration 126090/ 173500 | consumed samples: 32279040 | consumed tokens: 66107473920 | elapsed time per iteration (s): 0.80 | learning rate: 5.177E-05 | global batch size: 256 | lm loss: 1.973718E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.403 | TFLOPs: 19.32 | 31: iteration 126100/ 173500 | consumed samples: 32281600 | consumed tokens: 66112716800 | elapsed time per iteration (s): 0.78 | learning rate: 5.176E-05 | global batch size: 256 | lm loss: 1.943798E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.089 | TFLOPs: 19.97 | 31: iteration 126110/ 173500 | consumed samples: 32284160 | consumed tokens: 66117959680 | elapsed time per iteration (s): 0.76 | learning rate: 5.174E-05 | global batch size: 256 | lm loss: 1.921584E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.858 | TFLOPs: 20.26 | 31: iteration 126120/ 173500 | consumed samples: 32286720 | consumed tokens: 66123202560 | elapsed time per iteration (s): 0.82 | learning rate: 5.173E-05 | global batch size: 256 | lm loss: 1.929869E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.613 | TFLOPs: 18.97 | 31: iteration 126130/ 173500 | consumed samples: 32289280 | consumed tokens: 66128445440 | elapsed time per iteration (s): 0.79 | learning rate: 5.172E-05 | global batch size: 256 | lm loss: 1.923017E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.516 | TFLOPs: 19.51 | 31: iteration 126140/ 173500 | consumed samples: 32291840 | consumed tokens: 66133688320 | elapsed time per iteration (s): 0.81 | learning rate: 5.171E-05 | global batch size: 256 | lm loss: 1.959879E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.519 | TFLOPs: 19.15 | 31: iteration 126150/ 173500 | consumed samples: 32294400 | consumed tokens: 66138931200 | elapsed time per iteration (s): 0.83 | learning rate: 5.169E-05 | global batch size: 256 | lm loss: 1.935995E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.384 | TFLOPs: 18.72 | 31: iteration 126160/ 173500 | consumed samples: 32296960 | consumed tokens: 66144174080 | elapsed time per iteration (s): 0.83 | learning rate: 5.168E-05 | global batch size: 256 | lm loss: 1.953984E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.567 | TFLOPs: 18.67 | 31: iteration 126170/ 173500 | consumed samples: 32299520 | consumed tokens: 66149416960 | elapsed time per iteration (s): 0.82 | learning rate: 5.167E-05 | global batch size: 256 | lm loss: 1.970852E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.503 | TFLOPs: 18.78 | 31: iteration 126180/ 173500 | consumed samples: 32302080 | consumed tokens: 66154659840 | elapsed time per iteration (s): 0.82 | learning rate: 5.166E-05 | global batch size: 256 | lm loss: 1.941359E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.556 | TFLOPs: 18.97 | 31: iteration 126190/ 173500 | consumed samples: 32304640 | consumed tokens: 66159902720 | elapsed time per iteration (s): 0.77 | learning rate: 5.164E-05 | global batch size: 256 | lm loss: 1.936669E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.153 | TFLOPs: 20.22 | 31: iteration 126200/ 173500 | consumed samples: 32307200 | consumed tokens: 66165145600 | elapsed time per iteration (s): 0.79 | learning rate: 5.163E-05 | global batch size: 256 | lm loss: 1.897817E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.127 | TFLOPs: 19.49 | 31: iteration 126210/ 173500 | consumed samples: 32309760 | consumed tokens: 66170388480 | elapsed time per iteration (s): 0.81 | learning rate: 5.162E-05 | global batch size: 256 | lm loss: 1.931841E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.928 | TFLOPs: 19.11 | 31: iteration 126220/ 173500 | consumed samples: 32312320 | consumed tokens: 66175631360 | elapsed time per iteration (s): 0.78 | learning rate: 5.161E-05 | global batch size: 256 | lm loss: 1.961724E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.169 | TFLOPs: 19.73 | 31: iteration 126230/ 173500 | consumed samples: 32314880 | consumed tokens: 66180874240 | elapsed time per iteration (s): 0.81 | learning rate: 5.159E-05 | global batch size: 256 | lm loss: 1.947206E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.728 | TFLOPs: 19.04 | 31: iteration 126240/ 173500 | consumed samples: 32317440 | consumed tokens: 66186117120 | elapsed time per iteration (s): 0.89 | learning rate: 5.158E-05 | global batch size: 256 | lm loss: 1.944808E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.603 | TFLOPs: 17.40 | 31: iteration 126250/ 173500 | consumed samples: 32320000 | consumed tokens: 66191360000 | elapsed time per iteration (s): 0.77 | learning rate: 5.157E-05 | global batch size: 256 | lm loss: 1.942023E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.806 | TFLOPs: 20.01 | 31: iteration 126260/ 173500 | consumed samples: 32322560 | consumed tokens: 66196602880 | elapsed time per iteration (s): 0.79 | learning rate: 5.156E-05 | global batch size: 256 | lm loss: 1.933899E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.849 | TFLOPs: 19.53 | 31: iteration 126270/ 173500 | consumed samples: 32325120 | consumed tokens: 66201845760 | elapsed time per iteration (s): 0.81 | learning rate: 5.154E-05 | global batch size: 256 | lm loss: 1.906020E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.634 | TFLOPs: 19.22 | 31: iteration 126280/ 173500 | consumed samples: 32327680 | consumed tokens: 66207088640 | elapsed time per iteration (s): 0.77 | learning rate: 5.153E-05 | global batch size: 256 | lm loss: 1.896525E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.470 | TFLOPs: 20.05 | 31: iteration 126290/ 173500 | consumed samples: 32330240 | consumed tokens: 66212331520 | elapsed time per iteration (s): 0.80 | learning rate: 5.152E-05 | global batch size: 256 | lm loss: 1.934387E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.009 | TFLOPs: 19.48 | 31: iteration 126300/ 173500 | consumed samples: 32332800 | consumed tokens: 66217574400 | elapsed time per iteration (s): 0.80 | learning rate: 5.151E-05 | global batch size: 256 | lm loss: 1.944739E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.320 | TFLOPs: 19.44 | 31: iteration 126310/ 173500 | consumed samples: 32335360 | consumed tokens: 66222817280 | elapsed time per iteration (s): 0.81 | learning rate: 5.149E-05 | global batch size: 256 | lm loss: 1.944035E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.838 | TFLOPs: 19.11 | 31: iteration 126320/ 173500 | consumed samples: 32337920 | consumed tokens: 66228060160 | elapsed time per iteration (s): 0.82 | learning rate: 5.148E-05 | global batch size: 256 | lm loss: 1.944751E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.411 | TFLOPs: 18.90 | 31: iteration 126330/ 173500 | consumed samples: 32340480 | consumed tokens: 66233303040 | elapsed time per iteration (s): 0.84 | learning rate: 5.147E-05 | global batch size: 256 | lm loss: 1.940796E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.572 | TFLOPs: 18.43 | 31: iteration 126340/ 173500 | consumed samples: 32343040 | consumed tokens: 66238545920 | elapsed time per iteration (s): 0.83 | learning rate: 5.146E-05 | global batch size: 256 | lm loss: 1.947919E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.576 | TFLOPs: 18.61 | 31: iteration 126350/ 173500 | consumed samples: 32345600 | consumed tokens: 66243788800 | elapsed time per iteration (s): 0.84 | learning rate: 5.144E-05 | global batch size: 256 | lm loss: 1.942223E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.192 | TFLOPs: 18.34 | 31: iteration 126360/ 173500 | consumed samples: 32348160 | consumed tokens: 66249031680 | elapsed time per iteration (s): 0.88 | learning rate: 5.143E-05 | global batch size: 256 | lm loss: 1.910386E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.966 | TFLOPs: 17.54 | 31: iteration 126370/ 173500 | consumed samples: 32350720 | consumed tokens: 66254274560 | elapsed time per iteration (s): 0.91 | learning rate: 5.142E-05 | global batch size: 256 | lm loss: 1.946803E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 280.811 | TFLOPs: 16.99 | 31: iteration 126380/ 173500 | consumed samples: 32353280 | consumed tokens: 66259517440 | elapsed time per iteration (s): 0.90 | learning rate: 5.141E-05 | global batch size: 256 | lm loss: 1.940587E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.930 | TFLOPs: 17.24 | 31: iteration 126390/ 173500 | consumed samples: 32355840 | consumed tokens: 66264760320 | elapsed time per iteration (s): 0.87 | learning rate: 5.139E-05 | global batch size: 256 | lm loss: 1.964594E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.757 | TFLOPs: 17.77 | 31: iteration 126400/ 173500 | consumed samples: 32358400 | consumed tokens: 66270003200 | elapsed time per iteration (s): 0.86 | learning rate: 5.138E-05 | global batch size: 256 | lm loss: 1.929002E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.965 | TFLOPs: 17.97 | 31: iteration 126410/ 173500 | consumed samples: 32360960 | consumed tokens: 66275246080 | elapsed time per iteration (s): 0.95 | learning rate: 5.137E-05 | global batch size: 256 | lm loss: 1.956800E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 268.756 | TFLOPs: 16.26 | 31: iteration 126420/ 173500 | consumed samples: 32363520 | consumed tokens: 66280488960 | elapsed time per iteration (s): 0.92 | learning rate: 5.136E-05 | global batch size: 256 | lm loss: 1.939097E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 278.493 | TFLOPs: 16.85 | 31: iteration 126430/ 173500 | consumed samples: 32366080 | consumed tokens: 66285731840 | elapsed time per iteration (s): 0.82 | learning rate: 5.134E-05 | global batch size: 256 | lm loss: 1.953908E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.136 | TFLOPs: 18.94 | 31: iteration 126440/ 173500 | consumed samples: 32368640 | consumed tokens: 66290974720 | elapsed time per iteration (s): 0.84 | learning rate: 5.133E-05 | global batch size: 256 | lm loss: 1.944353E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.262 | TFLOPs: 18.47 | 31: iteration 126450/ 173500 | consumed samples: 32371200 | consumed tokens: 66296217600 | elapsed time per iteration (s): 0.86 | learning rate: 5.132E-05 | global batch size: 256 | lm loss: 1.936055E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.995 | TFLOPs: 18.09 | 31: iteration 126460/ 173500 | consumed samples: 32373760 | consumed tokens: 66301460480 | elapsed time per iteration (s): 0.89 | learning rate: 5.131E-05 | global batch size: 256 | lm loss: 1.950437E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.159 | TFLOPs: 17.43 | 31: iteration 126470/ 173500 | consumed samples: 32376320 | consumed tokens: 66306703360 | elapsed time per iteration (s): 0.90 | learning rate: 5.129E-05 | global batch size: 256 | lm loss: 1.967050E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.223 | TFLOPs: 17.19 | 31: iteration 126480/ 173500 | consumed samples: 32378880 | consumed tokens: 66311946240 | elapsed time per iteration (s): 0.91 | learning rate: 5.128E-05 | global batch size: 256 | lm loss: 1.909737E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.116 | TFLOPs: 17.07 | 31: iteration 126490/ 173500 | consumed samples: 32381440 | consumed tokens: 66317189120 | elapsed time per iteration (s): 0.87 | learning rate: 5.127E-05 | global batch size: 256 | lm loss: 1.951225E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.359 | TFLOPs: 17.75 | 31: iteration 126500/ 173500 | consumed samples: 32384000 | consumed tokens: 66322432000 | elapsed time per iteration (s): 0.92 | learning rate: 5.126E-05 | global batch size: 256 | lm loss: 1.942785E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 279.722 | TFLOPs: 16.92 | 31: iteration 126510/ 173500 | consumed samples: 32386560 | consumed tokens: 66327674880 | elapsed time per iteration (s): 0.92 | learning rate: 5.124E-05 | global batch size: 256 | lm loss: 1.917093E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 276.794 | TFLOPs: 16.75 | 31: iteration 126520/ 173500 | consumed samples: 32389120 | consumed tokens: 66332917760 | elapsed time per iteration (s): 0.82 | learning rate: 5.123E-05 | global batch size: 256 | lm loss: 1.965624E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.488 | TFLOPs: 18.90 | 31: iteration 126530/ 173500 | consumed samples: 32391680 | consumed tokens: 66338160640 | elapsed time per iteration (s): 0.83 | learning rate: 5.122E-05 | global batch size: 256 | lm loss: 1.944371E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.641 | TFLOPs: 18.61 | 31: iteration 126540/ 173500 | consumed samples: 32394240 | consumed tokens: 66343403520 | elapsed time per iteration (s): 0.87 | learning rate: 5.121E-05 | global batch size: 256 | lm loss: 1.913147E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.982 | TFLOPs: 17.72 | 31: iteration 126550/ 173500 | consumed samples: 32396800 | consumed tokens: 66348646400 | elapsed time per iteration (s): 0.80 | learning rate: 5.119E-05 | global batch size: 256 | lm loss: 1.937628E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.601 | TFLOPs: 19.46 | 31: iteration 126560/ 173500 | consumed samples: 32399360 | consumed tokens: 66353889280 | elapsed time per iteration (s): 0.78 | learning rate: 5.118E-05 | global batch size: 256 | lm loss: 1.947833E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.138 | TFLOPs: 19.73 | 31: iteration 126570/ 173500 | consumed samples: 32401920 | consumed tokens: 66359132160 | elapsed time per iteration (s): 0.80 | learning rate: 5.117E-05 | global batch size: 256 | lm loss: 1.938740E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.481 | TFLOPs: 19.39 | 31: iteration 126580/ 173500 | consumed samples: 32404480 | consumed tokens: 66364375040 | elapsed time per iteration (s): 0.90 | learning rate: 5.116E-05 | global batch size: 256 | lm loss: 1.940257E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.683 | TFLOPs: 17.28 | 31: iteration 126590/ 173500 | consumed samples: 32407040 | consumed tokens: 66369617920 | elapsed time per iteration (s): 0.83 | learning rate: 5.114E-05 | global batch size: 256 | lm loss: 1.945449E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.934 | TFLOPs: 18.69 | 31: iteration 126600/ 173500 | consumed samples: 32409600 | consumed tokens: 66374860800 | elapsed time per iteration (s): 0.78 | learning rate: 5.113E-05 | global batch size: 256 | lm loss: 1.948911E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.444 | TFLOPs: 19.75 | 31: iteration 126610/ 173500 | consumed samples: 32412160 | consumed tokens: 66380103680 | elapsed time per iteration (s): 0.80 | learning rate: 5.112E-05 | global batch size: 256 | lm loss: 1.946114E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.478 | TFLOPs: 19.45 | 31: iteration 126620/ 173500 | consumed samples: 32414720 | consumed tokens: 66385346560 | elapsed time per iteration (s): 0.94 | learning rate: 5.111E-05 | global batch size: 256 | lm loss: 1.960957E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 273.402 | TFLOPs: 16.54 | 31: iteration 126630/ 173500 | consumed samples: 32417280 | consumed tokens: 66390589440 | elapsed time per iteration (s): 0.88 | learning rate: 5.109E-05 | global batch size: 256 | lm loss: 1.944855E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.727 | TFLOPs: 17.53 | 31: iteration 126640/ 173500 | consumed samples: 32419840 | consumed tokens: 66395832320 | elapsed time per iteration (s): 0.86 | learning rate: 5.108E-05 | global batch size: 256 | lm loss: 1.919963E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.950 | TFLOPs: 18.09 | 31: iteration 126650/ 173500 | consumed samples: 32422400 | consumed tokens: 66401075200 | elapsed time per iteration (s): 0.81 | learning rate: 5.107E-05 | global batch size: 256 | lm loss: 1.930096E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.122 | TFLOPs: 19.00 | 31: iteration 126660/ 173500 | consumed samples: 32424960 | consumed tokens: 66406318080 | elapsed time per iteration (s): 0.81 | learning rate: 5.106E-05 | global batch size: 256 | lm loss: 1.935720E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.819 | TFLOPs: 19.23 | 31: iteration 126670/ 173500 | consumed samples: 32427520 | consumed tokens: 66411560960 | elapsed time per iteration (s): 0.82 | learning rate: 5.104E-05 | global batch size: 256 | lm loss: 1.944144E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.001 | TFLOPs: 18.88 | 31: iteration 126680/ 173500 | consumed samples: 32430080 | consumed tokens: 66416803840 | elapsed time per iteration (s): 0.79 | learning rate: 5.103E-05 | global batch size: 256 | lm loss: 1.942100E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.305 | TFLOPs: 19.50 | 31: iteration 126690/ 173500 | consumed samples: 32432640 | consumed tokens: 66422046720 | elapsed time per iteration (s): 0.87 | learning rate: 5.102E-05 | global batch size: 256 | lm loss: 1.945843E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.580 | TFLOPs: 17.70 | 31: iteration 126700/ 173500 | consumed samples: 32435200 | consumed tokens: 66427289600 | elapsed time per iteration (s): 0.83 | learning rate: 5.101E-05 | global batch size: 256 | lm loss: 1.972415E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.669 | TFLOPs: 18.61 | 31: iteration 126710/ 173500 | consumed samples: 32437760 | consumed tokens: 66432532480 | elapsed time per iteration (s): 0.85 | learning rate: 5.099E-05 | global batch size: 256 | lm loss: 1.925609E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.728 | TFLOPs: 18.25 | 31: iteration 126720/ 173500 | consumed samples: 32440320 | consumed tokens: 66437775360 | elapsed time per iteration (s): 0.81 | learning rate: 5.098E-05 | global batch size: 256 | lm loss: 1.930384E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.284 | TFLOPs: 19.13 | 31: iteration 126730/ 173500 | consumed samples: 32442880 | consumed tokens: 66443018240 | elapsed time per iteration (s): 0.79 | learning rate: 5.097E-05 | global batch size: 256 | lm loss: 1.937919E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.848 | TFLOPs: 19.65 | 31: iteration 126740/ 173500 | consumed samples: 32445440 | consumed tokens: 66448261120 | elapsed time per iteration (s): 0.81 | learning rate: 5.096E-05 | global batch size: 256 | lm loss: 1.935421E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.310 | TFLOPs: 19.01 | 31: iteration 126750/ 173500 | consumed samples: 32448000 | consumed tokens: 66453504000 | elapsed time per iteration (s): 0.79 | learning rate: 5.094E-05 | global batch size: 256 | lm loss: 1.937499E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.913 | TFLOPs: 19.72 | 31: iteration 126760/ 173500 | consumed samples: 32450560 | consumed tokens: 66458746880 | elapsed time per iteration (s): 0.82 | learning rate: 5.093E-05 | global batch size: 256 | lm loss: 1.926273E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.129 | TFLOPs: 18.88 | 31: iteration 126770/ 173500 | consumed samples: 32453120 | consumed tokens: 66463989760 | elapsed time per iteration (s): 0.80 | learning rate: 5.092E-05 | global batch size: 256 | lm loss: 1.955022E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.991 | TFLOPs: 19.48 | 31: iteration 126780/ 173500 | consumed samples: 32455680 | consumed tokens: 66469232640 | elapsed time per iteration (s): 0.83 | learning rate: 5.091E-05 | global batch size: 256 | lm loss: 1.965435E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.846 | TFLOPs: 18.62 | 31: iteration 126790/ 173500 | consumed samples: 32458240 | consumed tokens: 66474475520 | elapsed time per iteration (s): 0.84 | learning rate: 5.090E-05 | global batch size: 256 | lm loss: 1.942949E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.076 | TFLOPs: 18.34 | 31: iteration 126800/ 173500 | consumed samples: 32460800 | consumed tokens: 66479718400 | elapsed time per iteration (s): 0.84 | learning rate: 5.088E-05 | global batch size: 256 | lm loss: 1.944364E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.208 | TFLOPs: 18.52 | 31: iteration 126810/ 173500 | consumed samples: 32463360 | consumed tokens: 66484961280 | elapsed time per iteration (s): 1.03 | learning rate: 5.087E-05 | global batch size: 256 | lm loss: 1.949169E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.116 | TFLOPs: 15.07 | 31: iteration 126820/ 173500 | consumed samples: 32465920 | consumed tokens: 66490204160 | elapsed time per iteration (s): 0.82 | learning rate: 5.086E-05 | global batch size: 256 | lm loss: 1.933500E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.677 | TFLOPs: 18.98 | 31: iteration 126830/ 173500 | consumed samples: 32468480 | consumed tokens: 66495447040 | elapsed time per iteration (s): 0.82 | learning rate: 5.085E-05 | global batch size: 256 | lm loss: 1.935792E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.462 | TFLOPs: 18.90 | 31: iteration 126840/ 173500 | consumed samples: 32471040 | consumed tokens: 66500689920 | elapsed time per iteration (s): 0.82 | learning rate: 5.083E-05 | global batch size: 256 | lm loss: 1.920005E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.199 | TFLOPs: 18.95 | 31: iteration 126850/ 173500 | consumed samples: 32473600 | consumed tokens: 66505932800 | elapsed time per iteration (s): 0.80 | learning rate: 5.082E-05 | global batch size: 256 | lm loss: 1.948444E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.137 | TFLOPs: 19.37 | 31: iteration 126860/ 173500 | consumed samples: 32476160 | consumed tokens: 66511175680 | elapsed time per iteration (s): 0.77 | learning rate: 5.081E-05 | global batch size: 256 | lm loss: 1.921863E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.508 | TFLOPs: 19.99 | 31: iteration 126870/ 173500 | consumed samples: 32478720 | consumed tokens: 66516418560 | elapsed time per iteration (s): 0.81 | learning rate: 5.080E-05 | global batch size: 256 | lm loss: 1.940957E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.332 | TFLOPs: 19.14 | 31: iteration 126880/ 173500 | consumed samples: 32481280 | consumed tokens: 66521661440 | elapsed time per iteration (s): 0.78 | learning rate: 5.078E-05 | global batch size: 256 | lm loss: 1.937053E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.342 | TFLOPs: 19.74 | 31: iteration 126890/ 173500 | consumed samples: 32483840 | consumed tokens: 66526904320 | elapsed time per iteration (s): 0.79 | learning rate: 5.077E-05 | global batch size: 256 | lm loss: 1.910242E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.441 | TFLOPs: 19.63 | 31: iteration 126900/ 173500 | consumed samples: 32486400 | consumed tokens: 66532147200 | elapsed time per iteration (s): 0.91 | learning rate: 5.076E-05 | global batch size: 256 | lm loss: 1.931786E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 281.988 | TFLOPs: 17.06 | 31: iteration 126910/ 173500 | consumed samples: 32488960 | consumed tokens: 66537390080 | elapsed time per iteration (s): 0.81 | learning rate: 5.075E-05 | global batch size: 256 | lm loss: 1.945397E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.415 | TFLOPs: 19.14 | 31: iteration 126920/ 173500 | consumed samples: 32491520 | consumed tokens: 66542632960 | elapsed time per iteration (s): 0.78 | learning rate: 5.073E-05 | global batch size: 256 | lm loss: 1.949873E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.847 | TFLOPs: 19.83 | 31: iteration 126930/ 173500 | consumed samples: 32494080 | consumed tokens: 66547875840 | elapsed time per iteration (s): 0.77 | learning rate: 5.072E-05 | global batch size: 256 | lm loss: 1.947028E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.951 | TFLOPs: 20.20 | 31: iteration 126940/ 173500 | consumed samples: 32496640 | consumed tokens: 66553118720 | elapsed time per iteration (s): 0.82 | learning rate: 5.071E-05 | global batch size: 256 | lm loss: 1.956828E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.454 | TFLOPs: 18.90 | 31: iteration 126950/ 173500 | consumed samples: 32499200 | consumed tokens: 66558361600 | elapsed time per iteration (s): 0.79 | learning rate: 5.070E-05 | global batch size: 256 | lm loss: 1.957233E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.450 | TFLOPs: 19.57 | 31: iteration 126960/ 173500 | consumed samples: 32501760 | consumed tokens: 66563604480 | elapsed time per iteration (s): 0.80 | learning rate: 5.068E-05 | global batch size: 256 | lm loss: 1.975418E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.403 | TFLOPs: 19.32 | 31: iteration 126970/ 173500 | consumed samples: 32504320 | consumed tokens: 66568847360 | elapsed time per iteration (s): 0.82 | learning rate: 5.067E-05 | global batch size: 256 | lm loss: 1.941797E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.670 | TFLOPs: 18.86 | 31: iteration 126980/ 173500 | consumed samples: 32506880 | consumed tokens: 66574090240 | elapsed time per iteration (s): 0.82 | learning rate: 5.066E-05 | global batch size: 256 | lm loss: 1.920439E+00 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.820 | TFLOPs: 18.99 | 31: iteration 126990/ 173500 | consumed samples: 32509440 | consumed tokens: 66579333120 | elapsed time per iteration (s): 0.78 | learning rate: 5.065E-05 | global batch size: 256 | lm loss: 1.958212E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.993 | TFLOPs: 19.90 | 31: iteration 127000/ 173500 | consumed samples: 32512000 | consumed tokens: 66584576000 | elapsed time per iteration (s): 0.84 | learning rate: 5.064E-05 | global batch size: 256 | lm loss: 1.934016E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.760 | TFLOPs: 18.38 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 127000 | lm loss value: 1.893294E+00 | lm loss PPL: 6.641210E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 127000 to checkpoints_1b1long 0: [2022-11-26 22:43:06,727] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step127000 is begin to save! 0: [2022-11-26 22:43:06,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_01-model_00-model_states.pt... 0: [2022-11-26 22:43:06,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_01-model_00-model_states.pt. 0: [2022-11-26 22:43:06,954] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_03-model_00-model_states.pt... 0: [2022-11-26 22:43:07,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_03-model_00-model_states.pt. 0: [2022-11-26 22:43:07,034] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_04-model_00-model_states.pt... 0: [2022-11-26 22:43:07,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_04-model_00-model_states.pt. 0: [2022-11-26 22:43:07,121] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_05-model_00-model_states.pt... 0: [2022-11-26 22:43:07,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_05-model_00-model_states.pt. 0: [2022-11-26 22:43:07,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_06-model_00-model_states.pt... 0: [2022-11-26 22:43:07,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_06-model_00-model_states.pt. 0: [2022-11-26 22:43:07,275] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_07-model_00-model_states.pt... 0: [2022-11-26 22:43:07,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_07-model_00-model_states.pt. 0: [2022-11-26 22:43:07,352] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_08-model_00-model_states.pt... 0: [2022-11-26 22:43:07,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_08-model_00-model_states.pt. 0: [2022-11-26 22:43:07,434] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_09-model_00-model_states.pt... 0: [2022-11-26 22:43:07,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_09-model_00-model_states.pt. 0: [2022-11-26 22:43:07,512] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_10-model_00-model_states.pt... 0: [2022-11-26 22:43:07,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_10-model_00-model_states.pt. 0: [2022-11-26 22:43:07,587] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_11-model_00-model_states.pt... 0: [2022-11-26 22:43:07,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_11-model_00-model_states.pt. 0: [2022-11-26 22:43:07,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_12-model_00-model_states.pt... 0: [2022-11-26 22:43:07,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_12-model_00-model_states.pt. 0: [2022-11-26 22:43:07,736] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_13-model_00-model_states.pt... 0: [2022-11-26 22:43:07,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_13-model_00-model_states.pt. 0: [2022-11-26 22:43:07,810] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_14-model_00-model_states.pt... 0: [2022-11-26 22:43:07,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_14-model_00-model_states.pt. 0: [2022-11-26 22:43:07,887] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_15-model_00-model_states.pt... 0: [2022-11-26 22:43:07,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_15-model_00-model_states.pt. 0: [2022-11-26 22:43:07,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_16-model_00-model_states.pt... 0: [2022-11-26 22:43:08,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_16-model_00-model_states.pt. 0: [2022-11-26 22:43:08,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_17-model_00-model_states.pt... 0: [2022-11-26 22:43:08,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_17-model_00-model_states.pt. 0: [2022-11-26 22:43:08,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_18-model_00-model_states.pt... 0: [2022-11-26 22:43:08,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_18-model_00-model_states.pt. 0: [2022-11-26 22:43:08,189] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_19-model_00-model_states.pt... 0: [2022-11-26 22:43:08,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_19-model_00-model_states.pt. 0: [2022-11-26 22:43:08,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_20-model_00-model_states.pt... 0: [2022-11-26 22:43:08,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_20-model_00-model_states.pt. 0: [2022-11-26 22:43:08,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_21-model_00-model_states.pt... 0: [2022-11-26 22:43:08,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_21-model_00-model_states.pt. 0: [2022-11-26 22:43:08,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_22-model_00-model_states.pt... 0: [2022-11-26 22:43:08,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_22-model_00-model_states.pt. 0: [2022-11-26 22:43:08,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_23-model_00-model_states.pt... 0: [2022-11-26 22:43:08,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_23-model_00-model_states.pt. 0: [2022-11-26 22:43:08,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_24-model_00-model_states.pt... 0: [2022-11-26 22:43:08,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_24-model_00-model_states.pt. 0: [2022-11-26 22:43:08,645] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_25-model_00-model_states.pt... 0: [2022-11-26 22:43:08,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_25-model_00-model_states.pt. 0: [2022-11-26 22:43:08,721] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_26-model_00-model_states.pt... 0: [2022-11-26 22:43:08,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_26-model_00-model_states.pt. 0: [2022-11-26 22:43:08,795] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_27-model_00-model_states.pt... 0: [2022-11-26 22:43:08,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_27-model_00-model_states.pt. 0: [2022-11-26 22:43:08,872] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_28-model_00-model_states.pt... 0: [2022-11-26 22:43:08,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_28-model_00-model_states.pt. 0: [2022-11-26 22:43:08,946] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/layer_30-model_00-model_states.pt... 0: [2022-11-26 22:43:08,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/layer_30-model_00-model_states.pt. 0: [2022-11-26 22:43:08,949] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step127000/mp_rank_00_model_states.pt 0: [2022-11-26 22:43:08,949] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/mp_rank_00_model_states.pt... 0: [2022-11-26 22:43:08,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/mp_rank_00_model_states.pt. 0: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:43:09,028] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:43:09,028] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:43:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:43:09,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:43:09,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 22:43:09,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 19: [2022-11-26 22:43:09,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:43:09,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 22:43:09,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 24: [2022-11-26 22:43:09,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:43:09,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 22:43:09,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 15: [2022-11-26 22:43:09,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:43:09,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 22:43:09,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 1: [2022-11-26 22:43:09,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:43:09,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 22:43:09,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 4: [2022-11-26 22:43:09,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:43:09,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 22:43:09,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 6: [2022-11-26 22:43:09,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:43:09,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 22:43:09,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 13: [2022-11-26 22:43:09,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:43:09,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 22:43:09,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 17: [2022-11-26 22:43:09,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:43:09,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 22:43:09,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 29: [2022-11-26 22:43:09,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:43:09,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:43:09,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:43:09,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 0: [2022-11-26 22:43:09,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 8: [2022-11-26 22:43:09,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 0: [2022-11-26 22:43:09,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 29: [2022-11-26 22:43:09,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 22:43:09,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 30: [2022-11-26 22:43:09,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:43:09,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 22:43:09,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 18: [2022-11-26 22:43:09,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:43:09,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 22:43:09,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 22: [2022-11-26 22:43:09,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:43:09,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:43:09,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 22:43:09,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 15: [2022-11-26 22:43:09,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 22:43:09,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 9: [2022-11-26 22:43:09,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:43:09,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:43:09,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 26: [2022-11-26 22:43:09,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 9: [2022-11-26 22:43:09,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 26: [2022-11-26 22:43:09,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 17: [2022-11-26 22:43:09,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:43:09,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:43:09,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 26: [2022-11-26 22:43:09,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 9: [2022-11-26 22:43:09,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:43:09,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 9: [2022-11-26 22:43:09,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 22: [2022-11-26 22:43:09,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:43:09,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 22: [2022-11-26 22:43:09,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 9: [2022-11-26 22:43:09,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 20: [2022-11-26 22:43:09,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:43:09,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 20: [2022-11-26 22:43:09,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 22:43:09,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 27: [2022-11-26 22:43:09,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:43:09,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:43:09,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:43:09,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 22:43:09,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 4: [2022-11-26 22:43:09,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 22:43:09,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 27: [2022-11-26 22:43:09,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 22:43:09,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 27: [2022-11-26 22:43:09,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:43:09,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 22:43:09,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 14: [2022-11-26 22:43:09,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:43:09,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 12: [2022-11-26 22:43:09,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:43:09,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 12: [2022-11-26 22:43:09,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 22:43:09,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 14: [2022-11-26 22:43:09,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:43:09,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:43:09,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 19: [2022-11-26 22:43:09,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 14: [2022-11-26 22:43:09,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 19: [2022-11-26 22:43:09,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 12: [2022-11-26 22:43:09,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:43:09,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 22:43:09,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 29: [2022-11-26 22:43:09,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:43:09,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 22:43:09,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 3: [2022-11-26 22:43:09,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:43:09,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:43:09,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:43:09,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 30: [2022-11-26 22:43:09,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 3: [2022-11-26 22:43:09,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 30: [2022-11-26 22:43:09,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 3: [2022-11-26 22:43:09,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 31: [2022-11-26 22:43:09,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:43:09,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 31: [2022-11-26 22:43:09,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 22:43:09,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 1: [2022-11-26 22:43:09,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:43:09,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 24: [2022-11-26 22:43:09,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:43:09,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 24: [2022-11-26 22:43:09,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 22:43:09,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 23: [2022-11-26 22:43:09,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:43:09,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 22:43:09,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:43:09,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 23: [2022-11-26 22:43:09,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 22:43:09,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 21: [2022-11-26 22:43:09,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:43:09,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 22:43:09,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 15: [2022-11-26 22:43:09,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:43:09,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 22:43:09,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 26: [2022-11-26 22:43:09,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:43:09,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 22:43:09,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 0: [2022-11-26 22:43:09,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:43:09,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 22:43:09,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 18: [2022-11-26 22:43:09,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:43:09,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 7: [2022-11-26 22:43:09,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:43:09,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 18: [2022-11-26 22:43:09,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 7: [2022-11-26 22:43:09,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 5: [2022-11-26 22:43:09,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:43:09,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:43:09,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 22:43:09,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 22:43:09,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 5: [2022-11-26 22:43:09,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 8: [2022-11-26 22:43:09,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:43:09,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 22:43:09,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 22: [2022-11-26 22:43:09,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:43:09,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 22:43:09,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 28: [2022-11-26 22:43:09,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:43:09,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:43:09,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:43:09,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 22:43:09,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 27: [2022-11-26 22:43:09,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 22:43:09,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 3: [2022-11-26 22:43:09,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:43:09,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 22:43:09,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 26: [2022-11-26 22:43:09,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:43:09,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 22:43:09,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 10: [2022-11-26 22:43:09,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:43:09,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:43:09,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 22:43:09,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 22:43:09,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 10: [2022-11-26 22:43:09,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 18: [2022-11-26 22:43:09,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:43:09,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 22:43:09,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 10: [2022-11-26 22:43:09,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:43:09,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:43:09,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 9: [2022-11-26 22:43:09,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 22:43:09,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 6: [2022-11-26 22:43:09,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:43:09,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 6: [2022-11-26 22:43:09,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 22:43:09,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 8: [2022-11-26 22:43:09,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:43:09,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 22:43:09,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 7: [2022-11-26 22:43:09,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:43:09,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:43:09,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:43:09,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 22:43:09,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 13: [2022-11-26 22:43:09,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 27: [2022-11-26 22:43:09,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 13: [2022-11-26 22:43:09,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 27: [2022-11-26 22:43:09,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 1: [2022-11-26 22:43:09,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:43:09,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 22:43:09,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 14: [2022-11-26 22:43:09,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:43:09,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 22:43:09,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 19: [2022-11-26 22:43:09,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:43:09,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 22:43:09,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 15: [2022-11-26 22:43:09,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:43:09,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 22:43:09,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 30: [2022-11-26 22:43:09,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:43:09,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 22:43:09,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 2: [2022-11-26 22:43:09,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:43:09,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:43:09,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 22:43:09,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 22:43:09,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 2: [2022-11-26 22:43:09,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 22: [2022-11-26 22:43:09,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:43:09,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 24: [2022-11-26 22:43:09,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:43:09,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:43:09,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 24: [2022-11-26 22:43:09,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 19: [2022-11-26 22:43:09,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 24: [2022-11-26 22:43:09,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 19: [2022-11-26 22:43:09,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 11: [2022-11-26 22:43:09,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:43:09,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 22:43:09,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 31: [2022-11-26 22:43:09,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:43:09,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 22:43:09,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 2: [2022-11-26 22:43:09,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:43:09,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 31: [2022-11-26 22:43:09,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:43:09,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:43:09,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 31: [2022-11-26 22:43:09,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 16: [2022-11-26 22:43:09,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 22:43:09,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 31: [2022-11-26 22:43:09,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 2: [2022-11-26 22:43:09,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:43:09,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:43:09,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 17: [2022-11-26 22:43:09,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:43:09,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 2: [2022-11-26 22:43:09,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 20: [2022-11-26 22:43:09,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:43:09,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:43:09,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 2: [2022-11-26 22:43:09,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 20: [2022-11-26 22:43:09,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 17: [2022-11-26 22:43:09,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 3: [2022-11-26 22:43:09,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:43:09,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 11: [2022-11-26 22:43:09,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 17: [2022-11-26 22:43:09,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:43:09,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 11: [2022-11-26 22:43:09,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 17: [2022-11-26 22:43:09,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 3: [2022-11-26 22:43:09,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 17: [2022-11-26 22:43:09,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 29: [2022-11-26 22:43:09,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:43:09,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 22:43:09,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 6: [2022-11-26 22:43:09,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:43:09,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 22:43:09,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 28: [2022-11-26 22:43:09,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 22:43:09,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 28: [2022-11-26 22:43:09,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:43:09,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 22:43:09,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 28: [2022-11-26 22:43:09,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:43:09,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:43:09,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 22:43:09,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 20: [2022-11-26 22:43:09,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 22:43:09,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 4: [2022-11-26 22:43:09,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:43:09,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 22:43:09,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 14: [2022-11-26 22:43:09,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:43:09,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 22:43:09,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 23: [2022-11-26 22:43:09,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:43:09,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:43:09,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 22:43:09,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 22:43:09,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 23: [2022-11-26 22:43:09,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 30: [2022-11-26 22:43:09,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:43:09,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 22:43:09,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 5: [2022-11-26 22:43:09,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:43:09,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:43:09,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 22:43:09,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 11: [2022-11-26 22:43:09,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:43:09,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 11: [2022-11-26 22:43:09,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 5: [2022-11-26 22:43:09,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 11: [2022-11-26 22:43:09,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 18: [2022-11-26 22:43:09,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:43:09,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 22:43:09,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 21: [2022-11-26 22:43:09,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:43:09,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 22:43:09,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 31: [2022-11-26 22:43:09,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:43:09,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 22:43:09,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 0: [2022-11-26 22:43:09,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:43:09,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:43:09,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 22:43:09,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 0: [2022-11-26 22:43:09,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 22:43:09,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 21: [2022-11-26 22:43:09,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:43:09,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:43:09,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 22:43:09,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 22:43:09,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 21: [2022-11-26 22:43:09,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 29: [2022-11-26 22:43:09,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:43:09,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 22:43:09,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 4: [2022-11-26 22:43:09,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:43:09,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 22:43:09,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 16: [2022-11-26 22:43:09,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:43:09,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 22:43:09,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 9: [2022-11-26 22:43:09,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:43:09,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 22:43:09,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 24: [2022-11-26 22:43:09,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:43:09,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 22:43:09,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 12: [2022-11-26 22:43:09,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:43:09,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:43:09,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 8: [2022-11-26 22:43:09,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 12: [2022-11-26 22:43:09,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 8: [2022-11-26 22:43:09,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 25: [2022-11-26 22:43:09,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:43:09,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:43:09,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:43:09,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:43:09,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:43:09,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 22:43:09,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 25: [2022-11-26 22:43:09,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 22:43:09,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 22:43:09,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 22:43:09,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 22:43:09,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 25: [2022-11-26 22:43:09,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 25: [2022-11-26 22:43:09,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 25: [2022-11-26 22:43:09,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 1: [2022-11-26 22:43:09,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:43:09,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 22:43:09,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 26: [2022-11-26 22:43:09,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:43:09,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 22:43:09,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 0: [2022-11-26 22:43:09,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:43:09,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:43:09,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 22:43:09,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 4: [2022-11-26 22:43:09,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:43:09,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 22:43:09,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 12: [2022-11-26 22:43:09,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:43:09,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 22:43:09,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 5: [2022-11-26 22:43:09,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:43:09,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 0: [2022-11-26 22:43:09,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 5: [2022-11-26 22:43:09,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 0: [2022-11-26 22:43:09,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 13: [2022-11-26 22:43:09,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:43:09,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 22:43:09,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 20: [2022-11-26 22:43:09,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:43:09,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 22:43:09,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 19: [2022-11-26 22:43:09,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:43:09,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 22:43:09,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 2: [2022-11-26 22:43:09,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:43:09,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 22:43:09,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 25: [2022-11-26 22:43:09,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:43:09,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 22:43:09,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 9: [2022-11-26 22:43:09,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:43:09,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 22:43:09,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 3: [2022-11-26 22:43:09,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:43:09,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 22:43:09,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 27: [2022-11-26 22:43:09,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:43:09,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 22:43:09,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 6: [2022-11-26 22:43:09,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:43:09,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 22:43:09,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 22: [2022-11-26 22:43:09,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:43:09,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 22:43:09,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 10: [2022-11-26 22:43:09,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:43:09,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 22:43:09,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 23: [2022-11-26 22:43:09,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:43:09,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 22:43:09,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 17: [2022-11-26 22:43:09,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:43:09,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 22:43:09,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 29: [2022-11-26 22:43:09,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:43:09,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 22:43:09,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 24: [2022-11-26 22:43:09,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:43:09,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 22:43:09,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 8: [2022-11-26 22:43:09,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:43:09,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 22:43:09,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 16: [2022-11-26 22:43:09,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:43:09,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 22:43:09,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 11: [2022-11-26 22:43:09,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:43:09,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:43:09,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 15: [2022-11-26 22:43:09,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 11: [2022-11-26 22:43:09,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 28: [2022-11-26 22:43:09,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:43:09,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 28: [2022-11-26 22:43:09,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 22:43:09,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 7: [2022-11-26 22:43:09,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:43:09,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:43:09,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 22:43:09,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 1: [2022-11-26 22:43:09,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 22:43:09,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 30: [2022-11-26 22:43:09,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:43:09,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:43:09,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 14: [2022-11-26 22:43:09,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 30: [2022-11-26 22:43:09,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 14: [2022-11-26 22:43:09,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 0: [2022-11-26 22:43:09,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:43:09,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 22:43:09,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 31: [2022-11-26 22:43:09,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:43:09,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 22:43:09,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 21: [2022-11-26 22:43:09,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:43:09,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 22:43:09,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 13: [2022-11-26 22:43:09,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:43:09,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 22:43:09,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 5: [2022-11-26 22:43:09,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:43:09,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 22:43:09,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 18: [2022-11-26 22:43:09,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:43:09,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 22:43:09,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 4: [2022-11-26 22:43:09,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:43:09,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 22:43:09,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 12: [2022-11-26 22:43:09,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:43:09,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 22:43:09,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 26: [2022-11-26 22:43:09,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:43:09,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 22:43:09,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 20: [2022-11-26 22:43:09,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:43:09,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 22:43:09,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 6: [2022-11-26 22:43:09,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:43:09,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 22:43:09,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 3: [2022-11-26 22:43:09,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:43:09,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 22:43:09,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 9: [2022-11-26 22:43:09,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:43:09,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 22:43:09,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 2: [2022-11-26 22:43:09,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:43:09,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 22:43:09,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 22: [2022-11-26 22:43:09,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:43:09,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 22:43:09,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 25: [2022-11-26 22:43:09,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:43:09,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 22:43:09,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 19: [2022-11-26 22:43:09,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:43:09,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 22:43:09,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 29: [2022-11-26 22:43:09,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:43:09,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 22:43:09,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 10: [2022-11-26 22:43:09,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:43:09,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 22:43:09,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 27: [2022-11-26 22:43:09,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:43:09,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:43:09,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 17: [2022-11-26 22:43:09,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 27: [2022-11-26 22:43:09,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 17: [2022-11-26 22:43:09,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 15: [2022-11-26 22:43:09,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:43:09,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 22:43:09,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 23: [2022-11-26 22:43:09,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:43:09,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 22:43:09,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 11: [2022-11-26 22:43:09,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:43:09,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 22:43:09,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 24: [2022-11-26 22:43:09,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:43:09,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 22:43:09,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 16: [2022-11-26 22:43:09,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:43:09,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 22:43:09,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 7: [2022-11-26 22:43:09,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:43:09,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:43:09,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 22:43:09,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 1: [2022-11-26 22:43:09,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 22:43:09,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 0: [2022-11-26 22:43:09,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:43:09,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 22:43:09,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 8: [2022-11-26 22:43:09,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:43:09,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 22:43:09,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 5: [2022-11-26 22:43:09,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:43:09,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:43:09,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:43:09,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 22:43:09,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 13: [2022-11-26 22:43:09,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 22:43:09,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 31: [2022-11-26 22:43:09,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 22:43:09,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 4: [2022-11-26 22:43:09,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:43:09,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:43:09,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 22:43:09,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 21: [2022-11-26 22:43:09,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:43:09,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 22:43:09,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 28: [2022-11-26 22:43:09,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 22:43:09,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 14: [2022-11-26 22:43:09,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:43:09,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 22:43:09,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 30: [2022-11-26 22:43:09,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:43:09,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 22:43:09,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 12: [2022-11-26 22:43:09,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:43:09,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 22:43:09,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 26: [2022-11-26 22:43:09,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:43:09,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 22:43:09,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 18: [2022-11-26 22:43:09,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:43:09,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 22:43:09,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 19: [2022-11-26 22:43:09,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:43:09,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 22:43:09,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 25: [2022-11-26 22:43:09,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:43:09,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 22:43:09,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 20: [2022-11-26 22:43:09,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:43:09,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 9: [2022-11-26 22:43:09,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:43:09,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 9: [2022-11-26 22:43:09,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 22:43:09,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 22: [2022-11-26 22:43:09,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:43:09,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 22:43:09,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 3: [2022-11-26 22:43:09,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:43:09,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 22:43:09,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 2: [2022-11-26 22:43:09,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:43:09,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 22:43:09,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 6: [2022-11-26 22:43:09,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:43:09,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:43:09,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 29: [2022-11-26 22:43:09,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 6: [2022-11-26 22:43:09,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 29: [2022-11-26 22:43:09,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 15: [2022-11-26 22:43:09,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:43:09,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 22:43:09,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 18: [2022-11-26 22:43:09,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:43:09,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 22:43:09,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 0: [2022-11-26 22:43:09,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:43:09,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 22:43:09,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 10: [2022-11-26 22:43:09,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:43:09,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 22:43:09,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 24: [2022-11-26 22:43:09,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:43:09,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 22:43:09,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 27: [2022-11-26 22:43:09,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:43:09,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 22:43:09,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 30: [2022-11-26 22:43:09,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:43:09,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 22:43:09,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 17: [2022-11-26 22:43:09,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:43:09,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 22:43:09,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 31: [2022-11-26 22:43:09,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:43:09,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 22:43:09,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 23: [2022-11-26 22:43:09,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:43:09,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 22:43:09,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 8: [2022-11-26 22:43:09,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:43:09,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 22:43:09,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 11: [2022-11-26 22:43:09,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:43:09,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 22:43:09,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 7: [2022-11-26 22:43:09,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:43:09,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 22:43:09,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 14: [2022-11-26 22:43:09,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:43:09,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 22:43:09,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 21: [2022-11-26 22:43:09,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:43:09,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 22:43:09,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 15: [2022-11-26 22:43:09,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:43:09,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 22:43:09,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 12: [2022-11-26 22:43:09,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:43:09,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 22:43:09,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 22: [2022-11-26 22:43:09,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:43:09,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 22:43:09,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 5: [2022-11-26 22:43:09,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:43:09,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 22:43:09,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 1: [2022-11-26 22:43:09,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:43:09,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 22:43:09,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 16: [2022-11-26 22:43:09,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:43:09,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 22:43:09,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 20: [2022-11-26 22:43:09,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:43:09,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 22:43:09,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 9: [2022-11-26 22:43:09,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:43:09,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:43:09,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 22:43:09,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 13: [2022-11-26 22:43:09,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 22:43:09,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 27: [2022-11-26 22:43:09,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:43:09,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 22:43:09,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 28: [2022-11-26 22:43:09,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:43:09,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:43:09,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 22:43:09,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 29: [2022-11-26 22:43:09,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 22:43:09,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 4: [2022-11-26 22:43:09,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:43:09,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 22:43:09,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 2: [2022-11-26 22:43:09,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:43:09,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:43:09,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 18: [2022-11-26 22:43:09,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 3: [2022-11-26 22:43:09,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:43:09,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 3: [2022-11-26 22:43:09,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 2: [2022-11-26 22:43:09,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 3: [2022-11-26 22:43:09,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 24: [2022-11-26 22:43:09,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:43:09,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:43:09,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 26: [2022-11-26 22:43:09,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 22:43:09,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 24: [2022-11-26 22:43:09,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 6: [2022-11-26 22:43:09,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:43:09,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 22:43:09,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 19: [2022-11-26 22:43:09,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:43:09,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 22:43:09,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 10: [2022-11-26 22:43:09,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:43:09,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 22:43:09,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 25: [2022-11-26 22:43:09,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:43:09,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 22:43:09,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 28: [2022-11-26 22:43:09,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:43:09,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:43:09,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 22:43:09,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 11: [2022-11-26 22:43:09,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:43:09,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 22:43:09,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 0: [2022-11-26 22:43:09,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:43:09,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 22:43:09,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 23: [2022-11-26 22:43:09,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:43:09,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 22:43:09,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 23: [2022-11-26 22:43:09,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 22:43:09,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 11: [2022-11-26 22:43:09,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:43:09,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:43:09,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 31: [2022-11-26 22:43:09,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 11: [2022-11-26 22:43:09,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 31: [2022-11-26 22:43:09,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 7: [2022-11-26 22:43:09,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:43:09,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 22:43:09,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 8: [2022-11-26 22:43:09,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:43:09,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:43:09,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 10: [2022-11-26 22:43:09,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 8: [2022-11-26 22:43:09,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 10: [2022-11-26 22:43:09,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 16: [2022-11-26 22:43:09,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:43:09,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 22:43:09,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 21: [2022-11-26 22:43:09,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:43:09,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 22:43:09,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 30: [2022-11-26 22:43:09,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:43:09,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 1: [2022-11-26 22:43:09,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:43:09,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 1: [2022-11-26 22:43:09,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 22:43:09,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 13: [2022-11-26 22:43:09,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:43:09,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 22:43:09,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 14: [2022-11-26 22:43:09,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:43:09,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 22:43:09,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 28: [2022-11-26 22:43:09,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:43:09,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 22:43:09,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 16: [2022-11-26 22:43:09,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:43:09,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step127000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 22:43:09,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step127000 is ready now! 0: successfully saved checkpoint at iteration 127000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2549.47 31: iteration 127010/ 173500 | consumed samples: 32514560 | consumed tokens: 66589818880 | elapsed time per iteration (s): 1.08 | learning rate: 5.062E-05 | global batch size: 256 | lm loss: 1.944118E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.028 | TFLOPs: 14.34 | 31: iteration 127020/ 173500 | consumed samples: 32517120 | consumed tokens: 66595061760 | elapsed time per iteration (s): 0.79 | learning rate: 5.061E-05 | global batch size: 256 | lm loss: 1.965573E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.774 | TFLOPs: 19.71 | 31: iteration 127030/ 173500 | consumed samples: 32519680 | consumed tokens: 66600304640 | elapsed time per iteration (s): 0.85 | learning rate: 5.060E-05 | global batch size: 256 | lm loss: 1.932037E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.833 | TFLOPs: 18.26 | 31: iteration 127040/ 173500 | consumed samples: 32522240 | consumed tokens: 66605547520 | elapsed time per iteration (s): 0.80 | learning rate: 5.059E-05 | global batch size: 256 | lm loss: 1.919380E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.439 | TFLOPs: 19.33 | 31: iteration 127050/ 173500 | consumed samples: 32524800 | consumed tokens: 66610790400 | elapsed time per iteration (s): 0.83 | learning rate: 5.057E-05 | global batch size: 256 | lm loss: 1.945092E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.051 | TFLOPs: 18.70 | 31: iteration 127060/ 173500 | consumed samples: 32527360 | consumed tokens: 66616033280 | elapsed time per iteration (s): 0.82 | learning rate: 5.056E-05 | global batch size: 256 | lm loss: 1.951414E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.326 | TFLOPs: 18.89 | 31: iteration 127070/ 173500 | consumed samples: 32529920 | consumed tokens: 66621276160 | elapsed time per iteration (s): 0.81 | learning rate: 5.055E-05 | global batch size: 256 | lm loss: 1.957336E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.302 | TFLOPs: 19.07 | 31: iteration 127080/ 173500 | consumed samples: 32532480 | consumed tokens: 66626519040 | elapsed time per iteration (s): 0.81 | learning rate: 5.054E-05 | global batch size: 256 | lm loss: 1.942595E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.648 | TFLOPs: 19.10 | 31: iteration 127090/ 173500 | consumed samples: 32535040 | consumed tokens: 66631761920 | elapsed time per iteration (s): 0.85 | learning rate: 5.052E-05 | global batch size: 256 | lm loss: 1.915637E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.654 | TFLOPs: 18.19 | 31: iteration 127100/ 173500 | consumed samples: 32537600 | consumed tokens: 66637004800 | elapsed time per iteration (s): 0.82 | learning rate: 5.051E-05 | global batch size: 256 | lm loss: 1.945179E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.185 | TFLOPs: 18.83 | 31: iteration 127110/ 173500 | consumed samples: 32540160 | consumed tokens: 66642247680 | elapsed time per iteration (s): 0.76 | learning rate: 5.050E-05 | global batch size: 256 | lm loss: 1.937825E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.032 | TFLOPs: 20.33 | 31: iteration 127120/ 173500 | consumed samples: 32542720 | consumed tokens: 66647490560 | elapsed time per iteration (s): 0.81 | learning rate: 5.049E-05 | global batch size: 256 | lm loss: 1.935824E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.378 | TFLOPs: 19.20 | 31: iteration 127130/ 173500 | consumed samples: 32545280 | consumed tokens: 66652733440 | elapsed time per iteration (s): 0.78 | learning rate: 5.047E-05 | global batch size: 256 | lm loss: 1.960906E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.901 | TFLOPs: 19.90 | 31: iteration 127140/ 173500 | consumed samples: 32547840 | consumed tokens: 66657976320 | elapsed time per iteration (s): 0.79 | learning rate: 5.046E-05 | global batch size: 256 | lm loss: 1.929873E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.071 | TFLOPs: 19.61 | 31: iteration 127150/ 173500 | consumed samples: 32550400 | consumed tokens: 66663219200 | elapsed time per iteration (s): 0.79 | learning rate: 5.045E-05 | global batch size: 256 | lm loss: 1.929134E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.441 | TFLOPs: 19.63 | 31: iteration 127160/ 173500 | consumed samples: 32552960 | consumed tokens: 66668462080 | elapsed time per iteration (s): 0.81 | learning rate: 5.044E-05 | global batch size: 256 | lm loss: 1.938416E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.944 | TFLOPs: 19.05 | 31: iteration 127170/ 173500 | consumed samples: 32555520 | consumed tokens: 66673704960 | elapsed time per iteration (s): 0.81 | learning rate: 5.042E-05 | global batch size: 256 | lm loss: 1.949031E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.474 | TFLOPs: 19.15 | 31: iteration 127180/ 173500 | consumed samples: 32558080 | consumed tokens: 66678947840 | elapsed time per iteration (s): 0.80 | learning rate: 5.041E-05 | global batch size: 256 | lm loss: 1.928561E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.417 | TFLOPs: 19.32 | 31: iteration 127190/ 173500 | consumed samples: 32560640 | consumed tokens: 66684190720 | elapsed time per iteration (s): 0.76 | learning rate: 5.040E-05 | global batch size: 256 | lm loss: 1.962194E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.525 | TFLOPs: 20.42 | 31: iteration 127200/ 173500 | consumed samples: 32563200 | consumed tokens: 66689433600 | elapsed time per iteration (s): 0.74 | learning rate: 5.039E-05 | global batch size: 256 | lm loss: 1.957048E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.107 | TFLOPs: 21.06 | 31: iteration 127210/ 173500 | consumed samples: 32565760 | consumed tokens: 66694676480 | elapsed time per iteration (s): 0.76 | learning rate: 5.038E-05 | global batch size: 256 | lm loss: 1.924629E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.795 | TFLOPs: 20.44 | 31: iteration 127220/ 173500 | consumed samples: 32568320 | consumed tokens: 66699919360 | elapsed time per iteration (s): 0.73 | learning rate: 5.036E-05 | global batch size: 256 | lm loss: 1.933745E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.629 | TFLOPs: 21.15 | 31: iteration 127230/ 173500 | consumed samples: 32570880 | consumed tokens: 66705162240 | elapsed time per iteration (s): 0.71 | learning rate: 5.035E-05 | global batch size: 256 | lm loss: 1.933837E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 362.166 | TFLOPs: 21.91 | 31: iteration 127240/ 173500 | consumed samples: 32573440 | consumed tokens: 66710405120 | elapsed time per iteration (s): 0.83 | learning rate: 5.034E-05 | global batch size: 256 | lm loss: 1.938103E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.887 | TFLOPs: 18.75 | 31: iteration 127250/ 173500 | consumed samples: 32576000 | consumed tokens: 66715648000 | elapsed time per iteration (s): 0.79 | learning rate: 5.033E-05 | global batch size: 256 | lm loss: 1.941095E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.325 | TFLOPs: 19.56 | 31: iteration 127260/ 173500 | consumed samples: 32578560 | consumed tokens: 66720890880 | elapsed time per iteration (s): 0.78 | learning rate: 5.031E-05 | global batch size: 256 | lm loss: 1.953524E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.343 | TFLOPs: 19.86 | 31: iteration 127270/ 173500 | consumed samples: 32581120 | consumed tokens: 66726133760 | elapsed time per iteration (s): 0.77 | learning rate: 5.030E-05 | global batch size: 256 | lm loss: 1.936203E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.811 | TFLOPs: 20.13 | 31: iteration 127280/ 173500 | consumed samples: 32583680 | consumed tokens: 66731376640 | elapsed time per iteration (s): 0.76 | learning rate: 5.029E-05 | global batch size: 256 | lm loss: 1.953776E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.813 | TFLOPs: 20.38 | 31: iteration 127290/ 173500 | consumed samples: 32586240 | consumed tokens: 66736619520 | elapsed time per iteration (s): 0.76 | learning rate: 5.028E-05 | global batch size: 256 | lm loss: 1.923942E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.887 | TFLOPs: 20.26 | 31: iteration 127300/ 173500 | consumed samples: 32588800 | consumed tokens: 66741862400 | elapsed time per iteration (s): 0.73 | learning rate: 5.026E-05 | global batch size: 256 | lm loss: 1.911633E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.879 | TFLOPs: 21.17 | 31: iteration 127310/ 173500 | consumed samples: 32591360 | consumed tokens: 66747105280 | elapsed time per iteration (s): 0.78 | learning rate: 5.025E-05 | global batch size: 256 | lm loss: 1.954225E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.281 | TFLOPs: 19.98 | 31: iteration 127320/ 173500 | consumed samples: 32593920 | consumed tokens: 66752348160 | elapsed time per iteration (s): 0.75 | learning rate: 5.024E-05 | global batch size: 256 | lm loss: 1.963132E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.207 | TFLOPs: 20.70 | 31: iteration 127330/ 173500 | consumed samples: 32596480 | consumed tokens: 66757591040 | elapsed time per iteration (s): 0.74 | learning rate: 5.023E-05 | global batch size: 256 | lm loss: 1.958817E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.177 | TFLOPs: 20.82 | 31: iteration 127340/ 173500 | consumed samples: 32599040 | consumed tokens: 66762833920 | elapsed time per iteration (s): 0.87 | learning rate: 5.022E-05 | global batch size: 256 | lm loss: 1.936334E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.643 | TFLOPs: 17.83 | 31: iteration 127350/ 173500 | consumed samples: 32601600 | consumed tokens: 66768076800 | elapsed time per iteration (s): 0.78 | learning rate: 5.020E-05 | global batch size: 256 | lm loss: 1.941383E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.053 | TFLOPs: 19.91 | 31: iteration 127360/ 173500 | consumed samples: 32604160 | consumed tokens: 66773319680 | elapsed time per iteration (s): 0.75 | learning rate: 5.019E-05 | global batch size: 256 | lm loss: 1.925344E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.298 | TFLOPs: 20.59 | 31: iteration 127370/ 173500 | consumed samples: 32606720 | consumed tokens: 66778562560 | elapsed time per iteration (s): 0.78 | learning rate: 5.018E-05 | global batch size: 256 | lm loss: 1.913343E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.031 | TFLOPs: 19.85 | 31: iteration 127380/ 173500 | consumed samples: 32609280 | consumed tokens: 66783805440 | elapsed time per iteration (s): 0.79 | learning rate: 5.017E-05 | global batch size: 256 | lm loss: 1.973533E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.927 | TFLOPs: 19.72 | 31: iteration 127390/ 173500 | consumed samples: 32611840 | consumed tokens: 66789048320 | elapsed time per iteration (s): 0.75 | learning rate: 5.015E-05 | global batch size: 256 | lm loss: 1.931892E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.722 | TFLOPs: 20.61 | 31: iteration 127400/ 173500 | consumed samples: 32614400 | consumed tokens: 66794291200 | elapsed time per iteration (s): 0.75 | learning rate: 5.014E-05 | global batch size: 256 | lm loss: 1.944803E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.936 | TFLOPs: 20.63 | 31: iteration 127410/ 173500 | consumed samples: 32616960 | consumed tokens: 66799534080 | elapsed time per iteration (s): 0.78 | learning rate: 5.013E-05 | global batch size: 256 | lm loss: 1.968165E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.288 | TFLOPs: 19.98 | 31: iteration 127420/ 173500 | consumed samples: 32619520 | consumed tokens: 66804776960 | elapsed time per iteration (s): 0.75 | learning rate: 5.012E-05 | global batch size: 256 | lm loss: 1.956744E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.277 | TFLOPs: 20.71 | 31: iteration 127430/ 173500 | consumed samples: 32622080 | consumed tokens: 66810019840 | elapsed time per iteration (s): 0.82 | learning rate: 5.010E-05 | global batch size: 256 | lm loss: 1.930704E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.820 | TFLOPs: 18.92 | 31: iteration 127440/ 173500 | consumed samples: 32624640 | consumed tokens: 66815262720 | elapsed time per iteration (s): 0.78 | learning rate: 5.009E-05 | global batch size: 256 | lm loss: 1.902549E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.279 | TFLOPs: 19.86 | 31: iteration 127450/ 173500 | consumed samples: 32627200 | consumed tokens: 66820505600 | elapsed time per iteration (s): 0.77 | learning rate: 5.008E-05 | global batch size: 256 | lm loss: 1.923069E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.820 | TFLOPs: 20.20 | 31: iteration 127460/ 173500 | consumed samples: 32629760 | consumed tokens: 66825748480 | elapsed time per iteration (s): 0.81 | learning rate: 5.007E-05 | global batch size: 256 | lm loss: 1.931531E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.718 | TFLOPs: 19.10 | 31: iteration 127470/ 173500 | consumed samples: 32632320 | consumed tokens: 66830991360 | elapsed time per iteration (s): 0.79 | learning rate: 5.006E-05 | global batch size: 256 | lm loss: 1.948031E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.476 | TFLOPs: 19.63 | 31: iteration 127480/ 173500 | consumed samples: 32634880 | consumed tokens: 66836234240 | elapsed time per iteration (s): 0.73 | learning rate: 5.004E-05 | global batch size: 256 | lm loss: 1.927627E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.953 | TFLOPs: 21.23 | 31: iteration 127490/ 173500 | consumed samples: 32637440 | consumed tokens: 66841477120 | elapsed time per iteration (s): 0.72 | learning rate: 5.003E-05 | global batch size: 256 | lm loss: 1.923379E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.005 | TFLOPs: 21.42 | 31: iteration 127500/ 173500 | consumed samples: 32640000 | consumed tokens: 66846720000 | elapsed time per iteration (s): 0.83 | learning rate: 5.002E-05 | global batch size: 256 | lm loss: 1.940861E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.751 | TFLOPs: 18.74 | 31: iteration 127510/ 173500 | consumed samples: 32642560 | consumed tokens: 66851962880 | elapsed time per iteration (s): 0.80 | learning rate: 5.001E-05 | global batch size: 256 | lm loss: 1.944797E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.635 | TFLOPs: 19.40 | 31: iteration 127520/ 173500 | consumed samples: 32645120 | consumed tokens: 66857205760 | elapsed time per iteration (s): 0.84 | learning rate: 4.999E-05 | global batch size: 256 | lm loss: 1.936623E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.058 | TFLOPs: 18.52 | 31: iteration 127530/ 173500 | consumed samples: 32647680 | consumed tokens: 66862448640 | elapsed time per iteration (s): 0.79 | learning rate: 4.998E-05 | global batch size: 256 | lm loss: 1.940969E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.962 | TFLOPs: 19.72 | 31: iteration 127540/ 173500 | consumed samples: 32650240 | consumed tokens: 66867691520 | elapsed time per iteration (s): 0.82 | learning rate: 4.997E-05 | global batch size: 256 | lm loss: 1.963414E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.719 | TFLOPs: 18.86 | 31: iteration 127550/ 173500 | consumed samples: 32652800 | consumed tokens: 66872934400 | elapsed time per iteration (s): 0.81 | learning rate: 4.996E-05 | global batch size: 256 | lm loss: 1.930935E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.987 | TFLOPs: 19.18 | 31: iteration 127560/ 173500 | consumed samples: 32655360 | consumed tokens: 66878177280 | elapsed time per iteration (s): 0.80 | learning rate: 4.995E-05 | global batch size: 256 | lm loss: 1.947429E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.972 | TFLOPs: 19.36 | 31: iteration 127570/ 173500 | consumed samples: 32657920 | consumed tokens: 66883420160 | elapsed time per iteration (s): 0.83 | learning rate: 4.993E-05 | global batch size: 256 | lm loss: 1.934789E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.967 | TFLOPs: 18.75 | 31: iteration 127580/ 173500 | consumed samples: 32660480 | consumed tokens: 66888663040 | elapsed time per iteration (s): 0.80 | learning rate: 4.992E-05 | global batch size: 256 | lm loss: 1.948407E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.523 | TFLOPs: 19.33 | 31: iteration 127590/ 173500 | consumed samples: 32663040 | consumed tokens: 66893905920 | elapsed time per iteration (s): 0.77 | learning rate: 4.991E-05 | global batch size: 256 | lm loss: 1.942471E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.767 | TFLOPs: 20.13 | 31: iteration 127600/ 173500 | consumed samples: 32665600 | consumed tokens: 66899148800 | elapsed time per iteration (s): 0.79 | learning rate: 4.990E-05 | global batch size: 256 | lm loss: 1.937570E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.180 | TFLOPs: 19.55 | 31: iteration 127610/ 173500 | consumed samples: 32668160 | consumed tokens: 66904391680 | elapsed time per iteration (s): 0.79 | learning rate: 4.988E-05 | global batch size: 256 | lm loss: 1.930528E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.166 | TFLOPs: 19.67 | 31: iteration 127620/ 173500 | consumed samples: 32670720 | consumed tokens: 66909634560 | elapsed time per iteration (s): 0.74 | learning rate: 4.987E-05 | global batch size: 256 | lm loss: 1.927758E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.175 | TFLOPs: 21.06 | 31: iteration 127630/ 173500 | consumed samples: 32673280 | consumed tokens: 66914877440 | elapsed time per iteration (s): 0.81 | learning rate: 4.986E-05 | global batch size: 256 | lm loss: 1.924264E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.933 | TFLOPs: 19.23 | 31: iteration 127640/ 173500 | consumed samples: 32675840 | consumed tokens: 66920120320 | elapsed time per iteration (s): 0.79 | learning rate: 4.985E-05 | global batch size: 256 | lm loss: 1.921446E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.430 | TFLOPs: 19.63 | 31: iteration 127650/ 173500 | consumed samples: 32678400 | consumed tokens: 66925363200 | elapsed time per iteration (s): 0.78 | learning rate: 4.984E-05 | global batch size: 256 | lm loss: 1.933723E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.805 | TFLOPs: 19.77 | 31: iteration 127660/ 173500 | consumed samples: 32680960 | consumed tokens: 66930606080 | elapsed time per iteration (s): 0.81 | learning rate: 4.982E-05 | global batch size: 256 | lm loss: 1.966486E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.072 | TFLOPs: 19.18 | 31: iteration 127670/ 173500 | consumed samples: 32683520 | consumed tokens: 66935848960 | elapsed time per iteration (s): 0.78 | learning rate: 4.981E-05 | global batch size: 256 | lm loss: 1.951018E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.519 | TFLOPs: 19.81 | 31: iteration 127680/ 173500 | consumed samples: 32686080 | consumed tokens: 66941091840 | elapsed time per iteration (s): 0.79 | learning rate: 4.980E-05 | global batch size: 256 | lm loss: 1.934537E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.262 | TFLOPs: 19.62 | 31: iteration 127690/ 173500 | consumed samples: 32688640 | consumed tokens: 66946334720 | elapsed time per iteration (s): 0.78 | learning rate: 4.979E-05 | global batch size: 256 | lm loss: 1.939947E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.356 | TFLOPs: 19.80 | 31: iteration 127700/ 173500 | consumed samples: 32691200 | consumed tokens: 66951577600 | elapsed time per iteration (s): 0.74 | learning rate: 4.977E-05 | global batch size: 256 | lm loss: 1.966642E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.404 | TFLOPs: 20.84 | 31: iteration 127710/ 173500 | consumed samples: 32693760 | consumed tokens: 66956820480 | elapsed time per iteration (s): 0.75 | learning rate: 4.976E-05 | global batch size: 256 | lm loss: 1.958222E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.261 | TFLOPs: 20.71 | 31: iteration 127720/ 173500 | consumed samples: 32696320 | consumed tokens: 66962063360 | elapsed time per iteration (s): 0.71 | learning rate: 4.975E-05 | global batch size: 256 | lm loss: 1.963668E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 359.604 | TFLOPs: 21.76 | 31: iteration 127730/ 173500 | consumed samples: 32698880 | consumed tokens: 66967306240 | elapsed time per iteration (s): 0.79 | learning rate: 4.974E-05 | global batch size: 256 | lm loss: 1.929141E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.449 | TFLOPs: 19.63 | 31: iteration 127740/ 173500 | consumed samples: 32701440 | consumed tokens: 66972549120 | elapsed time per iteration (s): 0.76 | learning rate: 4.972E-05 | global batch size: 256 | lm loss: 1.925646E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.820 | TFLOPs: 20.50 | 31: iteration 127750/ 173500 | consumed samples: 32704000 | consumed tokens: 66977792000 | elapsed time per iteration (s): 0.84 | learning rate: 4.971E-05 | global batch size: 256 | lm loss: 1.941537E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.874 | TFLOPs: 18.44 | 31: iteration 127760/ 173500 | consumed samples: 32706560 | consumed tokens: 66983034880 | elapsed time per iteration (s): 0.80 | learning rate: 4.970E-05 | global batch size: 256 | lm loss: 1.924363E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.406 | TFLOPs: 19.38 | 31: iteration 127770/ 173500 | consumed samples: 32709120 | consumed tokens: 66988277760 | elapsed time per iteration (s): 0.80 | learning rate: 4.969E-05 | global batch size: 256 | lm loss: 1.934010E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.054 | TFLOPs: 19.24 | 31: iteration 127780/ 173500 | consumed samples: 32711680 | consumed tokens: 66993520640 | elapsed time per iteration (s): 0.74 | learning rate: 4.968E-05 | global batch size: 256 | lm loss: 1.935677E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.180 | TFLOPs: 20.88 | 31: iteration 127790/ 173500 | consumed samples: 32714240 | consumed tokens: 66998763520 | elapsed time per iteration (s): 0.81 | learning rate: 4.966E-05 | global batch size: 256 | lm loss: 1.961714E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.978 | TFLOPs: 19.06 | 31: iteration 127800/ 173500 | consumed samples: 32716800 | consumed tokens: 67004006400 | elapsed time per iteration (s): 0.82 | learning rate: 4.965E-05 | global batch size: 256 | lm loss: 1.921219E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.518 | TFLOPs: 18.85 | 31: iteration 127810/ 173500 | consumed samples: 32719360 | consumed tokens: 67009249280 | elapsed time per iteration (s): 0.72 | learning rate: 4.964E-05 | global batch size: 256 | lm loss: 1.940048E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.364 | TFLOPs: 21.38 | 31: iteration 127820/ 173500 | consumed samples: 32721920 | consumed tokens: 67014492160 | elapsed time per iteration (s): 0.81 | learning rate: 4.963E-05 | global batch size: 256 | lm loss: 1.934256E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.883 | TFLOPs: 19.23 | 31: iteration 127830/ 173500 | consumed samples: 32724480 | consumed tokens: 67019735040 | elapsed time per iteration (s): 0.83 | learning rate: 4.962E-05 | global batch size: 256 | lm loss: 1.936524E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.933 | TFLOPs: 18.69 | 31: iteration 127840/ 173500 | consumed samples: 32727040 | consumed tokens: 67024977920 | elapsed time per iteration (s): 0.76 | learning rate: 4.960E-05 | global batch size: 256 | lm loss: 1.935291E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.907 | TFLOPs: 20.32 | 31: iteration 127850/ 173500 | consumed samples: 32729600 | consumed tokens: 67030220800 | elapsed time per iteration (s): 0.77 | learning rate: 4.959E-05 | global batch size: 256 | lm loss: 1.933248E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.659 | TFLOPs: 20.19 | 31: iteration 127860/ 173500 | consumed samples: 32732160 | consumed tokens: 67035463680 | elapsed time per iteration (s): 0.77 | learning rate: 4.958E-05 | global batch size: 256 | lm loss: 1.943049E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.610 | TFLOPs: 20.18 | 31: iteration 127870/ 173500 | consumed samples: 32734720 | consumed tokens: 67040706560 | elapsed time per iteration (s): 0.81 | learning rate: 4.957E-05 | global batch size: 256 | lm loss: 1.995214E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.309 | TFLOPs: 19.08 | 31: iteration 127880/ 173500 | consumed samples: 32737280 | consumed tokens: 67045949440 | elapsed time per iteration (s): 0.79 | learning rate: 4.955E-05 | global batch size: 256 | lm loss: 1.955087E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.144 | TFLOPs: 19.61 | 31: iteration 127890/ 173500 | consumed samples: 32739840 | consumed tokens: 67051192320 | elapsed time per iteration (s): 0.83 | learning rate: 4.954E-05 | global batch size: 256 | lm loss: 1.917101E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.312 | TFLOPs: 18.65 | 31: iteration 127900/ 173500 | consumed samples: 32742400 | consumed tokens: 67056435200 | elapsed time per iteration (s): 0.75 | learning rate: 4.953E-05 | global batch size: 256 | lm loss: 1.958111E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.449 | TFLOPs: 20.72 | 31: iteration 127910/ 173500 | consumed samples: 32744960 | consumed tokens: 67061678080 | elapsed time per iteration (s): 0.82 | learning rate: 4.952E-05 | global batch size: 256 | lm loss: 1.923572E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.851 | TFLOPs: 18.93 | 31: iteration 127920/ 173500 | consumed samples: 32747520 | consumed tokens: 67066920960 | elapsed time per iteration (s): 0.76 | learning rate: 4.951E-05 | global batch size: 256 | lm loss: 1.945150E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.092 | TFLOPs: 20.33 | 31: iteration 127930/ 173500 | consumed samples: 32750080 | consumed tokens: 67072163840 | elapsed time per iteration (s): 0.84 | learning rate: 4.949E-05 | global batch size: 256 | lm loss: 1.969595E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.480 | TFLOPs: 18.54 | 31: iteration 127940/ 173500 | consumed samples: 32752640 | consumed tokens: 67077406720 | elapsed time per iteration (s): 0.80 | learning rate: 4.948E-05 | global batch size: 256 | lm loss: 1.943252E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.808 | TFLOPs: 19.29 | 31: iteration 127950/ 173500 | consumed samples: 32755200 | consumed tokens: 67082649600 | elapsed time per iteration (s): 0.77 | learning rate: 4.947E-05 | global batch size: 256 | lm loss: 1.932824E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.260 | TFLOPs: 20.22 | 31: iteration 127960/ 173500 | consumed samples: 32757760 | consumed tokens: 67087892480 | elapsed time per iteration (s): 0.76 | learning rate: 4.946E-05 | global batch size: 256 | lm loss: 1.939548E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.965 | TFLOPs: 20.45 | 31: iteration 127970/ 173500 | consumed samples: 32760320 | consumed tokens: 67093135360 | elapsed time per iteration (s): 0.72 | learning rate: 4.944E-05 | global batch size: 256 | lm loss: 1.923035E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.283 | TFLOPs: 21.61 | 31: iteration 127980/ 173500 | consumed samples: 32762880 | consumed tokens: 67098378240 | elapsed time per iteration (s): 0.77 | learning rate: 4.943E-05 | global batch size: 256 | lm loss: 1.929001E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.476 | TFLOPs: 20.17 | 31: iteration 127990/ 173500 | consumed samples: 32765440 | consumed tokens: 67103621120 | elapsed time per iteration (s): 0.78 | learning rate: 4.942E-05 | global batch size: 256 | lm loss: 1.898883E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.855 | TFLOPs: 19.83 | 0: [2022-11-26 22:56:13,631] [INFO] [logging.py:68:log_dist] [Rank 0] step=128000, skipped=0, lr=[4.94077976375529e-05, 4.94077976375529e-05, 4.94077976375529e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 128000/ 173500 | consumed samples: 32768000 | consumed tokens: 67108864000 | elapsed time per iteration (s): 0.78 | learning rate: 4.941E-05 | global batch size: 256 | lm loss: 1.947661E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.257 | TFLOPs: 19.74 | 0: steps: 128000 loss: 1.9087 iter time (s): 0.809 samples/sec: 316.433 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 128000 | lm loss value: 1.904491E+00 | lm loss PPL: 6.715991E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 128000 to checkpoints_1b1long 0: [2022-11-26 22:56:13,917] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step128000 is begin to save! 0: [2022-11-26 22:56:13,930] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_01-model_00-model_states.pt... 0: [2022-11-26 22:56:14,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_01-model_00-model_states.pt. 0: [2022-11-26 22:56:14,148] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_03-model_00-model_states.pt... 0: [2022-11-26 22:56:14,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_03-model_00-model_states.pt. 0: [2022-11-26 22:56:14,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_04-model_00-model_states.pt... 0: [2022-11-26 22:56:14,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_04-model_00-model_states.pt. 0: [2022-11-26 22:56:14,304] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_05-model_00-model_states.pt... 0: [2022-11-26 22:56:14,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_05-model_00-model_states.pt. 0: [2022-11-26 22:56:14,388] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_06-model_00-model_states.pt... 0: [2022-11-26 22:56:14,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_06-model_00-model_states.pt. 0: [2022-11-26 22:56:14,466] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_07-model_00-model_states.pt... 0: [2022-11-26 22:56:14,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_07-model_00-model_states.pt. 0: [2022-11-26 22:56:14,541] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_08-model_00-model_states.pt... 0: [2022-11-26 22:56:14,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_08-model_00-model_states.pt. 0: [2022-11-26 22:56:14,620] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_09-model_00-model_states.pt... 0: [2022-11-26 22:56:14,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_09-model_00-model_states.pt. 0: [2022-11-26 22:56:14,701] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_10-model_00-model_states.pt... 0: [2022-11-26 22:56:14,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_10-model_00-model_states.pt. 0: [2022-11-26 22:56:14,779] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_11-model_00-model_states.pt... 0: [2022-11-26 22:56:14,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_11-model_00-model_states.pt. 0: [2022-11-26 22:56:14,855] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_12-model_00-model_states.pt... 0: [2022-11-26 22:56:14,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_12-model_00-model_states.pt. 0: [2022-11-26 22:56:14,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_13-model_00-model_states.pt... 0: [2022-11-26 22:56:15,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_13-model_00-model_states.pt. 0: [2022-11-26 22:56:15,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_14-model_00-model_states.pt... 0: [2022-11-26 22:56:15,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_14-model_00-model_states.pt. 0: [2022-11-26 22:56:15,089] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_15-model_00-model_states.pt... 0: [2022-11-26 22:56:15,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_15-model_00-model_states.pt. 0: [2022-11-26 22:56:15,166] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_16-model_00-model_states.pt... 0: [2022-11-26 22:56:15,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_16-model_00-model_states.pt. 0: [2022-11-26 22:56:15,240] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_17-model_00-model_states.pt... 0: [2022-11-26 22:56:15,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_17-model_00-model_states.pt. 0: [2022-11-26 22:56:15,318] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_18-model_00-model_states.pt... 0: [2022-11-26 22:56:15,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_18-model_00-model_states.pt. 0: [2022-11-26 22:56:15,391] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_19-model_00-model_states.pt... 0: [2022-11-26 22:56:15,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_19-model_00-model_states.pt. 0: [2022-11-26 22:56:15,469] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_20-model_00-model_states.pt... 0: [2022-11-26 22:56:15,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_20-model_00-model_states.pt. 0: [2022-11-26 22:56:15,542] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_21-model_00-model_states.pt... 0: [2022-11-26 22:56:15,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_21-model_00-model_states.pt. 0: [2022-11-26 22:56:15,619] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_22-model_00-model_states.pt... 0: [2022-11-26 22:56:15,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_22-model_00-model_states.pt. 0: [2022-11-26 22:56:15,696] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_23-model_00-model_states.pt... 0: [2022-11-26 22:56:15,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_23-model_00-model_states.pt. 0: [2022-11-26 22:56:15,771] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_24-model_00-model_states.pt... 0: [2022-11-26 22:56:15,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_24-model_00-model_states.pt. 0: [2022-11-26 22:56:15,848] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_25-model_00-model_states.pt... 0: [2022-11-26 22:56:15,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_25-model_00-model_states.pt. 0: [2022-11-26 22:56:15,925] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_26-model_00-model_states.pt... 0: [2022-11-26 22:56:16,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_26-model_00-model_states.pt. 0: [2022-11-26 22:56:16,011] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_27-model_00-model_states.pt... 0: [2022-11-26 22:56:16,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_27-model_00-model_states.pt. 0: [2022-11-26 22:56:16,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_28-model_00-model_states.pt... 0: [2022-11-26 22:56:16,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_28-model_00-model_states.pt. 0: [2022-11-26 22:56:16,163] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/layer_30-model_00-model_states.pt... 0: [2022-11-26 22:56:16,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/layer_30-model_00-model_states.pt. 0: [2022-11-26 22:56:16,166] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step128000/mp_rank_00_model_states.pt 0: [2022-11-26 22:56:16,166] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/mp_rank_00_model_states.pt... 0: [2022-11-26 22:56:16,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/mp_rank_00_model_states.pt. 0: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 26: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 23: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 22: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 20: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:56:16,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 28: [2022-11-26 22:56:16,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:56:16,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 19: [2022-11-26 22:56:16,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:56:16,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 19: [2022-11-26 22:56:16,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-26 22:56:16,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 18: [2022-11-26 22:56:16,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:56:16,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 22:56:16,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 26: [2022-11-26 22:56:16,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:56:16,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 22:56:16,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 26: [2022-11-26 22:56:16,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:56:16,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 22:56:16,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 4: [2022-11-26 22:56:16,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:56:16,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 22:56:16,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 0: [2022-11-26 22:56:16,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:56:16,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 22:56:16,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 1: [2022-11-26 22:56:16,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:56:16,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:56:16,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 1: [2022-11-26 22:56:16,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 3: [2022-11-26 22:56:16,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 1: [2022-11-26 22:56:16,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:56:16,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 1: [2022-11-26 22:56:16,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 22:56:16,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 5: [2022-11-26 22:56:16,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:56:16,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 22:56:16,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 5: [2022-11-26 22:56:16,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:56:16,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 22:56:16,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 12: [2022-11-26 22:56:16,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:56:16,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 22:56:16,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 7: [2022-11-26 22:56:16,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:56:16,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 22:56:16,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 14: [2022-11-26 22:56:16,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:56:16,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 22:56:16,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 28: [2022-11-26 22:56:16,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:56:16,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:56:16,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 22:56:16,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 11: [2022-11-26 22:56:16,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:56:16,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 22:56:16,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 16: [2022-11-26 22:56:16,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:56:16,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:56:16,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:56:16,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 9: [2022-11-26 22:56:16,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:56:16,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 25: [2022-11-26 22:56:16,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 22:56:16,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 9: [2022-11-26 22:56:16,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 25: [2022-11-26 22:56:16,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 9: [2022-11-26 22:56:16,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 25: [2022-11-26 22:56:16,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 18: [2022-11-26 22:56:16,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:56:16,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 0: [2022-11-26 22:56:16,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:56:16,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 0: [2022-11-26 22:56:16,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 22:56:16,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 2: [2022-11-26 22:56:16,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:56:16,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 15: [2022-11-26 22:56:16,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:56:16,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 15: [2022-11-26 22:56:16,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 22:56:16,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 15: [2022-11-26 22:56:16,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:56:16,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:56:16,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 2: [2022-11-26 22:56:16,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 15: [2022-11-26 22:56:16,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 2: [2022-11-26 22:56:16,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 19: [2022-11-26 22:56:16,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:56:16,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 22:56:16,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 9: [2022-11-26 22:56:16,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:56:16,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 16: [2022-11-26 22:56:16,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:56:16,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 16: [2022-11-26 22:56:16,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 22:56:16,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 22: [2022-11-26 22:56:16,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:56:16,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 22:56:16,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 5: [2022-11-26 22:56:16,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:56:16,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:56:16,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 14: [2022-11-26 22:56:16,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 30: [2022-11-26 22:56:16,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:56:16,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 14: [2022-11-26 22:56:16,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 30: [2022-11-26 22:56:16,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 22:56:16,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 27: [2022-11-26 22:56:16,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:56:16,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 22:56:16,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 3: [2022-11-26 22:56:16,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:56:16,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 22:56:16,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 29: [2022-11-26 22:56:16,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:56:16,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 22:56:16,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 22: [2022-11-26 22:56:16,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:56:16,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 22:56:16,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 30: [2022-11-26 22:56:16,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:56:16,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 22:56:16,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 15: [2022-11-26 22:56:16,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:56:16,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:56:16,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 29: [2022-11-26 22:56:16,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 15: [2022-11-26 22:56:16,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 29: [2022-11-26 22:56:16,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 4: [2022-11-26 22:56:16,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:56:16,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 22:56:16,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 11: [2022-11-26 22:56:16,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:56:16,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 22:56:16,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 4: [2022-11-26 22:56:16,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:56:16,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 22:56:16,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 28: [2022-11-26 22:56:16,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 22:56:16,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 14: [2022-11-26 22:56:16,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:56:16,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:56:16,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 22:56:16,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 12: [2022-11-26 22:56:16,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 22:56:16,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 7: [2022-11-26 22:56:16,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:56:16,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 22:56:16,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 11: [2022-11-26 22:56:16,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:56:16,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 22:56:16,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 19: [2022-11-26 22:56:16,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:56:16,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 16: [2022-11-26 22:56:16,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:56:16,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 16: [2022-11-26 22:56:16,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 22:56:16,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 3: [2022-11-26 22:56:16,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:56:16,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 22:56:16,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 23: [2022-11-26 22:56:16,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:56:16,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 22:56:16,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 2: [2022-11-26 22:56:16,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:56:16,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:56:16,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 22:56:16,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 18: [2022-11-26 22:56:16,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 22:56:16,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 29: [2022-11-26 22:56:16,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:56:16,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 22:56:16,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 5: [2022-11-26 22:56:16,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:56:16,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 22:56:16,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 1: [2022-11-26 22:56:16,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:56:16,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 22:56:16,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 25: [2022-11-26 22:56:16,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:56:16,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:56:16,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 28: [2022-11-26 22:56:16,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 25: [2022-11-26 22:56:16,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 28: [2022-11-26 22:56:16,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 4: [2022-11-26 22:56:16,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:56:16,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 22:56:16,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 20: [2022-11-26 22:56:16,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:56:16,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 22:56:16,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 9: [2022-11-26 22:56:16,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:56:16,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 16: [2022-11-26 22:56:16,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:56:16,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 9: [2022-11-26 22:56:16,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 16: [2022-11-26 22:56:16,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 15: [2022-11-26 22:56:16,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:56:16,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 22:56:16,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 9: [2022-11-26 22:56:16,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:56:16,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 22:56:16,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 21: [2022-11-26 22:56:16,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:56:16,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 22: [2022-11-26 22:56:16,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:56:16,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 22: [2022-11-26 22:56:16,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 19: [2022-11-26 22:56:16,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:56:16,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:56:16,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 19: [2022-11-26 22:56:16,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 22:56:16,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 25: [2022-11-26 22:56:16,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 22:56:16,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 6: [2022-11-26 22:56:16,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:56:16,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:56:16,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:56:16,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 22:56:16,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 22:56:16,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 22:56:16,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 6: [2022-11-26 22:56:16,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 6: [2022-11-26 22:56:16,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 22: [2022-11-26 22:56:16,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:56:16,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 22:56:16,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 1: [2022-11-26 22:56:16,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:56:16,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 18: [2022-11-26 22:56:16,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:56:16,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 1: [2022-11-26 22:56:16,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 3: [2022-11-26 22:56:16,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:56:16,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 3: [2022-11-26 22:56:16,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 22:56:16,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 13: [2022-11-26 22:56:16,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:56:16,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:56:16,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:56:16,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 22:56:16,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 22:56:16,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 22:56:16,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 13: [2022-11-26 22:56:16,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 13: [2022-11-26 22:56:16,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 14: [2022-11-26 22:56:16,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:56:16,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 20: [2022-11-26 22:56:16,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:56:16,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 28: [2022-11-26 22:56:16,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:56:16,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:56:16,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 22:56:16,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 26: [2022-11-26 22:56:16,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 22:56:16,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 13: [2022-11-26 22:56:16,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:56:16,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 22:56:16,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 20: [2022-11-26 22:56:16,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:56:16,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:56:16,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 28: [2022-11-26 22:56:16,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 20: [2022-11-26 22:56:16,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:56:16,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 23: [2022-11-26 22:56:16,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 28: [2022-11-26 22:56:16,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 20: [2022-11-26 22:56:16,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 0: [2022-11-26 22:56:16,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:56:16,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 20: [2022-11-26 22:56:16,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 22:56:16,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 0: [2022-11-26 22:56:16,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 23: [2022-11-26 22:56:16,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:56:16,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 22:56:16,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 27: [2022-11-26 22:56:16,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:56:16,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:56:16,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 30: [2022-11-26 22:56:16,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 27: [2022-11-26 22:56:16,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 30: [2022-11-26 22:56:16,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 30: [2022-11-26 22:56:16,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:56:16,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 22:56:16,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 31: [2022-11-26 22:56:16,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:56:16,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:56:16,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:56:16,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 22:56:16,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 22:56:16,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 31: [2022-11-26 22:56:16,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 31: [2022-11-26 22:56:16,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 22:56:16,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 11: [2022-11-26 22:56:16,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:56:16,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 22:56:16,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 0: [2022-11-26 22:56:16,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:56:16,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 22:56:16,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 21: [2022-11-26 22:56:16,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:56:16,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 22:56:16,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 7: [2022-11-26 22:56:16,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:56:16,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 22:56:16,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 21: [2022-11-26 22:56:16,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:56:16,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 22:56:16,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 21: [2022-11-26 22:56:16,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:56:16,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 22:56:16,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 26: [2022-11-26 22:56:16,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:56:16,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 7: [2022-11-26 22:56:16,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:56:16,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:56:16,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 7: [2022-11-26 22:56:16,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 22:56:16,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 2: [2022-11-26 22:56:16,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 22:56:16,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 4: [2022-11-26 22:56:16,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:56:16,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 22:56:16,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 23: [2022-11-26 22:56:16,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:56:16,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 12: [2022-11-26 22:56:16,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:56:16,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 12: [2022-11-26 22:56:16,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 27: [2022-11-26 22:56:16,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:56:16,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 27: [2022-11-26 22:56:16,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 5: [2022-11-26 22:56:16,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:56:16,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 5: [2022-11-26 22:56:16,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 27: [2022-11-26 22:56:16,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:56:16,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 27: [2022-11-26 22:56:16,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 22:56:16,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 1: [2022-11-26 22:56:16,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:56:16,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 22:56:16,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 8: [2022-11-26 22:56:16,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:56:16,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:56:16,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:56:16,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 22:56:16,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 22:56:16,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 22:56:16,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 8: [2022-11-26 22:56:16,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 8: [2022-11-26 22:56:16,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 10: [2022-11-26 22:56:16,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:56:16,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:56:16,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:56:16,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:56:16,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 22:56:16,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 22:56:16,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 22:56:16,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 22:56:16,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 10: [2022-11-26 22:56:16,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 10: [2022-11-26 22:56:16,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 10: [2022-11-26 22:56:16,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 29: [2022-11-26 22:56:16,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:56:16,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 22:56:16,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 31: [2022-11-26 22:56:16,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:56:16,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 22:56:16,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 20: [2022-11-26 22:56:16,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:56:16,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 22:56:16,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 26: [2022-11-26 22:56:16,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:56:16,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 22:56:16,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 25: [2022-11-26 22:56:16,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:56:16,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 22:56:16,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 23: [2022-11-26 22:56:16,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:56:16,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 22:56:16,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 22: [2022-11-26 22:56:16,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:56:16,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 22:56:16,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 13: [2022-11-26 22:56:16,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:56:16,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 22:56:16,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 16: [2022-11-26 22:56:16,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:56:16,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 22:56:16,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 21: [2022-11-26 22:56:16,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:56:16,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 22:56:16,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 17: [2022-11-26 22:56:16,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:56:16,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:56:16,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:56:16,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:56:16,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:56:16,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 22:56:16,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 22:56:16,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 22:56:16,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 22:56:16,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 22:56:16,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 17: [2022-11-26 22:56:16,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 17: [2022-11-26 22:56:16,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 17: [2022-11-26 22:56:16,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 17: [2022-11-26 22:56:16,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 30: [2022-11-26 22:56:16,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:56:16,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 22:56:16,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 19: [2022-11-26 22:56:16,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:56:16,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 22:56:16,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 11: [2022-11-26 22:56:16,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:56:16,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 22:56:16,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 2: [2022-11-26 22:56:16,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:56:16,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 22:56:16,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 24: [2022-11-26 22:56:16,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:56:16,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:56:16,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:56:16,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:56:16,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:56:16,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 22:56:16,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 22:56:16,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 22:56:16,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 22:56:16,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 22:56:16,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 24: [2022-11-26 22:56:16,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 24: [2022-11-26 22:56:16,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 24: [2022-11-26 22:56:16,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 24: [2022-11-26 22:56:16,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 15: [2022-11-26 22:56:16,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:56:16,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 22:56:16,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 12: [2022-11-26 22:56:16,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:56:16,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 22:56:16,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 31: [2022-11-26 22:56:16,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:56:16,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 22:56:16,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 9: [2022-11-26 22:56:16,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:56:16,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 22:56:16,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 28: [2022-11-26 22:56:16,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:56:16,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 22:56:16,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 10: [2022-11-26 22:56:16,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:56:16,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 22:56:16,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 29: [2022-11-26 22:56:16,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:56:16,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 22:56:16,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 18: [2022-11-26 22:56:16,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:56:16,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 22:56:16,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 0: [2022-11-26 22:56:16,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:56:16,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 22:56:16,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 14: [2022-11-26 22:56:16,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:56:16,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 22:56:16,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 6: [2022-11-26 22:56:16,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:56:16,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 22:56:16,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 8: [2022-11-26 22:56:16,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:56:16,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 22:56:16,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 7: [2022-11-26 22:56:16,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:56:16,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 22:56:16,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 3: [2022-11-26 22:56:16,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:56:16,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 22:56:16,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 27: [2022-11-26 22:56:16,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:56:16,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 22:56:16,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 25: [2022-11-26 22:56:16,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:56:16,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 22:56:16,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 5: [2022-11-26 22:56:16,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:56:16,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 22:56:16,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 1: [2022-11-26 22:56:16,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:56:16,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 22:56:16,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 4: [2022-11-26 22:56:16,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:56:16,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 22:56:16,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 20: [2022-11-26 22:56:16,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:56:16,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 22:56:16,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 26: [2022-11-26 22:56:16,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:56:16,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 22:56:16,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 17: [2022-11-26 22:56:16,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:56:16,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 22:56:16,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 16: [2022-11-26 22:56:16,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:56:16,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 22:56:16,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 13: [2022-11-26 22:56:16,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:56:16,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 22:56:16,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 22: [2022-11-26 22:56:16,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:56:16,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 22:56:16,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 24: [2022-11-26 22:56:16,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 22:56:16,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 22:56:16,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 2: [2022-11-26 22:56:16,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:56:16,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 22:56:16,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 19: [2022-11-26 22:56:16,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:56:16,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 22:56:16,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 11: [2022-11-26 22:56:16,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:56:16,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 22:56:16,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 30: [2022-11-26 22:56:16,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:56:16,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 22:56:16,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 15: [2022-11-26 22:56:16,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:56:16,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 22:56:16,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 29: [2022-11-26 22:56:16,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:56:16,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:56:16,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 21: [2022-11-26 22:56:16,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 29: [2022-11-26 22:56:16,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 21: [2022-11-26 22:56:16,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 9: [2022-11-26 22:56:16,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:56:16,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 22:56:16,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 0: [2022-11-26 22:56:16,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:56:16,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 22:56:16,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 28: [2022-11-26 22:56:16,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:56:16,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 22:56:16,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 18: [2022-11-26 22:56:16,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:56:16,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 22:56:16,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 12: [2022-11-26 22:56:16,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:56:16,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:56:16,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 12: [2022-11-26 22:56:16,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 31: [2022-11-26 22:56:16,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 12: [2022-11-26 22:56:16,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 10: [2022-11-26 22:56:16,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:56:16,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 22:56:16,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 8: [2022-11-26 22:56:16,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:56:16,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 22:56:16,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 6: [2022-11-26 22:56:16,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:56:16,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 22:56:16,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 27: [2022-11-26 22:56:16,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:56:16,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 22:56:16,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 14: [2022-11-26 22:56:16,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:56:16,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 22:56:16,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 7: [2022-11-26 22:56:16,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:56:16,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 22:56:16,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 20: [2022-11-26 22:56:16,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:56:16,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 22:56:16,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 5: [2022-11-26 22:56:16,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:56:16,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 22:56:16,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 3: [2022-11-26 22:56:16,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:56:16,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 22:56:16,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 4: [2022-11-26 22:56:16,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:56:16,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:56:16,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 22:56:16,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 25: [2022-11-26 22:56:16,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 22:56:16,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 26: [2022-11-26 22:56:16,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:56:16,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 24: [2022-11-26 22:56:16,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:56:16,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 24: [2022-11-26 22:56:16,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 22:56:16,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 23: [2022-11-26 22:56:16,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:56:16,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 22:56:16,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 17: [2022-11-26 22:56:16,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:56:16,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 22:56:16,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 1: [2022-11-26 22:56:16,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:56:16,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 22:56:16,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 13: [2022-11-26 22:56:16,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:56:16,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 22:56:16,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 16: [2022-11-26 22:56:16,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:56:16,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:56:16,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 22:56:16,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 23: [2022-11-26 22:56:16,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 22:56:16,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 22: [2022-11-26 22:56:16,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:56:16,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 22:56:16,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 2: [2022-11-26 22:56:16,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:56:16,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 22:56:16,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 21: [2022-11-26 22:56:16,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:56:16,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 22:56:16,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 11: [2022-11-26 22:56:16,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:56:16,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:56:16,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 15: [2022-11-26 22:56:16,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 22:56:16,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 11: [2022-11-26 22:56:16,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 19: [2022-11-26 22:56:16,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 22:56:16,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 22:56:16,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 9: [2022-11-26 22:56:16,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:56:16,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 22:56:16,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 30: [2022-11-26 22:56:16,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:56:16,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 22:56:16,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 29: [2022-11-26 22:56:16,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:56:16,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:56:16,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 22:56:16,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 10: [2022-11-26 22:56:16,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:56:16,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 10: [2022-11-26 22:56:16,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 29: [2022-11-26 22:56:16,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 10: [2022-11-26 22:56:16,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 12: [2022-11-26 22:56:16,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:56:16,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 22:56:16,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 31: [2022-11-26 22:56:16,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:56:16,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 22:56:16,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 7: [2022-11-26 22:56:16,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:56:16,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 2: [2022-11-26 22:56:16,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:56:16,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 2: [2022-11-26 22:56:16,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 22:56:16,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 22: [2022-11-26 22:56:16,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 22:56:16,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 22:56:16,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 0: [2022-11-26 22:56:16,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:56:16,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:56:16,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 22:56:16,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 14: [2022-11-26 22:56:16,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:56:16,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:56:16,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:56:16,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:56:16,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 22:56:16,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 22:56:16,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 14: [2022-11-26 22:56:16,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 19: [2022-11-26 22:56:16,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 28: [2022-11-26 22:56:16,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 22:56:16,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 19: [2022-11-26 22:56:16,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 22:56:16,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 28: [2022-11-26 22:56:16,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 22:56:16,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 18: [2022-11-26 22:56:16,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:56:16,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 18: [2022-11-26 22:56:16,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 22:56:16,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 13: [2022-11-26 22:56:16,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 22:56:16,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 9: [2022-11-26 22:56:16,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:56:16,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:56:16,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 5: [2022-11-26 22:56:16,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 9: [2022-11-26 22:56:16,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 5: [2022-11-26 22:56:16,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 10: [2022-11-26 22:56:16,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:56:16,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 22:56:16,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 15: [2022-11-26 22:56:16,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:56:16,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 6: [2022-11-26 22:56:16,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:56:16,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 6: [2022-11-26 22:56:16,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 22:56:16,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 20: [2022-11-26 22:56:16,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:56:16,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 20: [2022-11-26 22:56:16,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 22:56:16,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 23: [2022-11-26 22:56:16,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 27: [2022-11-26 22:56:16,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 23: [2022-11-26 22:56:16,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 27: [2022-11-26 22:56:16,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 22:56:16,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 16: [2022-11-26 22:56:16,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 31: [2022-11-26 22:56:16,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 16: [2022-11-26 22:56:16,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 31: [2022-11-26 22:56:16,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 22:56:16,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 16: [2022-11-26 22:56:16,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 29: [2022-11-26 22:56:16,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 25: [2022-11-26 22:56:16,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 29: [2022-11-26 22:56:16,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 22:56:16,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 25: [2022-11-26 22:56:16,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 22:56:16,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 7: [2022-11-26 22:56:16,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:56:16,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:56:16,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 4: [2022-11-26 22:56:16,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 7: [2022-11-26 22:56:16,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 4: [2022-11-26 22:56:16,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 0: [2022-11-26 22:56:16,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 22:56:16,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 17: [2022-11-26 22:56:16,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 22:56:16,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 22:56:16,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 30: [2022-11-26 22:56:16,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 22:56:16,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 22:56:16,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 12: [2022-11-26 22:56:16,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:56:16,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 22:56:16,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 21: [2022-11-26 22:56:16,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:56:16,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 8: [2022-11-26 22:56:16,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 21: [2022-11-26 22:56:16,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 8: [2022-11-26 22:56:16,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 22:56:16,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 1: [2022-11-26 22:56:16,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:56:16,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 24: [2022-11-26 22:56:16,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:56:16,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:56:16,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 8: [2022-11-26 22:56:16,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:56:16,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 24: [2022-11-26 22:56:16,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 0: [2022-11-26 22:56:16,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 8: [2022-11-26 22:56:16,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 24: [2022-11-26 22:56:16,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 8: [2022-11-26 22:56:16,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 3: [2022-11-26 22:56:16,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:56:16,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 22:56:16,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 3: [2022-11-26 22:56:16,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:56:16,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 22:56:16,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 26: [2022-11-26 22:56:16,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:56:16,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 26: [2022-11-26 22:56:16,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 8: [2022-11-26 22:56:16,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 22:56:16,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 26: [2022-11-26 22:56:16,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 27: [2022-11-26 22:56:16,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 22:56:16,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 22:56:16,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 6: [2022-11-26 22:56:16,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:56:16,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 22:56:16,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 6: [2022-11-26 22:56:16,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:56:16,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step128000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 22:56:16,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step128000 is ready now! 0: successfully saved checkpoint at iteration 128000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2601.69 31: iteration 128010/ 173500 | consumed samples: 32770560 | consumed tokens: 67114106880 | elapsed time per iteration (s): 1.08 | learning rate: 4.940E-05 | global batch size: 256 | lm loss: 1.951483E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.293 | TFLOPs: 14.30 | 31: iteration 128020/ 173500 | consumed samples: 32773120 | consumed tokens: 67119349760 | elapsed time per iteration (s): 0.80 | learning rate: 4.938E-05 | global batch size: 256 | lm loss: 1.927382E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.516 | TFLOPs: 19.39 | 31: iteration 128030/ 173500 | consumed samples: 32775680 | consumed tokens: 67124592640 | elapsed time per iteration (s): 0.80 | learning rate: 4.937E-05 | global batch size: 256 | lm loss: 1.938840E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.477 | TFLOPs: 19.39 | 31: iteration 128040/ 173500 | consumed samples: 32778240 | consumed tokens: 67129835520 | elapsed time per iteration (s): 0.82 | learning rate: 4.936E-05 | global batch size: 256 | lm loss: 1.925867E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.973 | TFLOPs: 18.93 | 31: iteration 128050/ 173500 | consumed samples: 32780800 | consumed tokens: 67135078400 | elapsed time per iteration (s): 0.78 | learning rate: 4.935E-05 | global batch size: 256 | lm loss: 1.941340E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.846 | TFLOPs: 19.83 | 31: iteration 128060/ 173500 | consumed samples: 32783360 | consumed tokens: 67140321280 | elapsed time per iteration (s): 0.84 | learning rate: 4.933E-05 | global batch size: 256 | lm loss: 1.934414E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.419 | TFLOPs: 18.48 | 31: iteration 128070/ 173500 | consumed samples: 32785920 | consumed tokens: 67145564160 | elapsed time per iteration (s): 0.78 | learning rate: 4.932E-05 | global batch size: 256 | lm loss: 1.935336E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.072 | TFLOPs: 19.79 | 31: iteration 128080/ 173500 | consumed samples: 32788480 | consumed tokens: 67150807040 | elapsed time per iteration (s): 0.84 | learning rate: 4.931E-05 | global batch size: 256 | lm loss: 1.932997E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.050 | TFLOPs: 18.39 | 31: iteration 128090/ 173500 | consumed samples: 32791040 | consumed tokens: 67156049920 | elapsed time per iteration (s): 0.79 | learning rate: 4.930E-05 | global batch size: 256 | lm loss: 1.949687E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.638 | TFLOPs: 19.58 | 31: iteration 128100/ 173500 | consumed samples: 32793600 | consumed tokens: 67161292800 | elapsed time per iteration (s): 0.74 | learning rate: 4.929E-05 | global batch size: 256 | lm loss: 1.931347E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.458 | TFLOPs: 20.90 | 31: iteration 128110/ 173500 | consumed samples: 32796160 | consumed tokens: 67166535680 | elapsed time per iteration (s): 0.76 | learning rate: 4.927E-05 | global batch size: 256 | lm loss: 1.969511E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.317 | TFLOPs: 20.41 | 31: iteration 128120/ 173500 | consumed samples: 32798720 | consumed tokens: 67171778560 | elapsed time per iteration (s): 0.74 | learning rate: 4.926E-05 | global batch size: 256 | lm loss: 1.966758E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.305 | TFLOPs: 20.89 | 31: iteration 128130/ 173500 | consumed samples: 32801280 | consumed tokens: 67177021440 | elapsed time per iteration (s): 0.78 | learning rate: 4.925E-05 | global batch size: 256 | lm loss: 1.914432E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.010 | TFLOPs: 19.90 | 31: iteration 128140/ 173500 | consumed samples: 32803840 | consumed tokens: 67182264320 | elapsed time per iteration (s): 0.83 | learning rate: 4.924E-05 | global batch size: 256 | lm loss: 1.975016E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.938 | TFLOPs: 18.75 | 31: iteration 128150/ 173500 | consumed samples: 32806400 | consumed tokens: 67187507200 | elapsed time per iteration (s): 0.80 | learning rate: 4.923E-05 | global batch size: 256 | lm loss: 1.948477E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.973 | TFLOPs: 19.30 | 31: iteration 128160/ 173500 | consumed samples: 32808960 | consumed tokens: 67192750080 | elapsed time per iteration (s): 0.77 | learning rate: 4.921E-05 | global batch size: 256 | lm loss: 1.918725E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.297 | TFLOPs: 20.22 | 31: iteration 128170/ 173500 | consumed samples: 32811520 | consumed tokens: 67197992960 | elapsed time per iteration (s): 0.77 | learning rate: 4.920E-05 | global batch size: 256 | lm loss: 1.941995E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.828 | TFLOPs: 20.01 | 31: iteration 128180/ 173500 | consumed samples: 32814080 | consumed tokens: 67203235840 | elapsed time per iteration (s): 0.75 | learning rate: 4.919E-05 | global batch size: 256 | lm loss: 1.937342E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.677 | TFLOPs: 20.73 | 31: iteration 128190/ 173500 | consumed samples: 32816640 | consumed tokens: 67208478720 | elapsed time per iteration (s): 0.76 | learning rate: 4.918E-05 | global batch size: 256 | lm loss: 1.938203E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.247 | TFLOPs: 20.46 | 31: iteration 128200/ 173500 | consumed samples: 32819200 | consumed tokens: 67213721600 | elapsed time per iteration (s): 0.76 | learning rate: 4.916E-05 | global batch size: 256 | lm loss: 1.968425E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.146 | TFLOPs: 20.34 | 31: iteration 128210/ 173500 | consumed samples: 32821760 | consumed tokens: 67218964480 | elapsed time per iteration (s): 0.81 | learning rate: 4.915E-05 | global batch size: 256 | lm loss: 1.916260E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.168 | TFLOPs: 19.19 | 31: iteration 128220/ 173500 | consumed samples: 32824320 | consumed tokens: 67224207360 | elapsed time per iteration (s): 0.85 | learning rate: 4.914E-05 | global batch size: 256 | lm loss: 1.947369E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.701 | TFLOPs: 18.13 | 31: iteration 128230/ 173500 | consumed samples: 32826880 | consumed tokens: 67229450240 | elapsed time per iteration (s): 0.82 | learning rate: 4.913E-05 | global batch size: 256 | lm loss: 1.904900E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.414 | TFLOPs: 18.78 | 31: iteration 128240/ 173500 | consumed samples: 32829440 | consumed tokens: 67234693120 | elapsed time per iteration (s): 0.88 | learning rate: 4.912E-05 | global batch size: 256 | lm loss: 1.984657E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.219 | TFLOPs: 17.56 | 31: iteration 128250/ 173500 | consumed samples: 32832000 | consumed tokens: 67239936000 | elapsed time per iteration (s): 0.99 | learning rate: 4.910E-05 | global batch size: 256 | lm loss: 1.921945E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 258.887 | TFLOPs: 15.66 | 31: iteration 128260/ 173500 | consumed samples: 32834560 | consumed tokens: 67245178880 | elapsed time per iteration (s): 0.92 | learning rate: 4.909E-05 | global batch size: 256 | lm loss: 1.928568E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 278.962 | TFLOPs: 16.88 | 31: iteration 128270/ 173500 | consumed samples: 32837120 | consumed tokens: 67250421760 | elapsed time per iteration (s): 0.89 | learning rate: 4.908E-05 | global batch size: 256 | lm loss: 1.966303E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.903 | TFLOPs: 17.42 | 31: iteration 128280/ 173500 | consumed samples: 32839680 | consumed tokens: 67255664640 | elapsed time per iteration (s): 0.90 | learning rate: 4.907E-05 | global batch size: 256 | lm loss: 1.969086E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.809 | TFLOPs: 17.29 | 31: iteration 128290/ 173500 | consumed samples: 32842240 | consumed tokens: 67260907520 | elapsed time per iteration (s): 0.82 | learning rate: 4.906E-05 | global batch size: 256 | lm loss: 1.921019E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.423 | TFLOPs: 18.90 | 31: iteration 128300/ 173500 | consumed samples: 32844800 | consumed tokens: 67266150400 | elapsed time per iteration (s): 0.94 | learning rate: 4.904E-05 | global batch size: 256 | lm loss: 1.911583E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 272.598 | TFLOPs: 16.49 | 31: iteration 128310/ 173500 | consumed samples: 32847360 | consumed tokens: 67271393280 | elapsed time per iteration (s): 0.82 | learning rate: 4.903E-05 | global batch size: 256 | lm loss: 1.948240E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.579 | TFLOPs: 18.91 | 31: iteration 128320/ 173500 | consumed samples: 32849920 | consumed tokens: 67276636160 | elapsed time per iteration (s): 0.82 | learning rate: 4.902E-05 | global batch size: 256 | lm loss: 1.930408E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.870 | TFLOPs: 18.93 | 31: iteration 128330/ 173500 | consumed samples: 32852480 | consumed tokens: 67281879040 | elapsed time per iteration (s): 0.78 | learning rate: 4.901E-05 | global batch size: 256 | lm loss: 1.963701E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.444 | TFLOPs: 19.87 | 31: iteration 128340/ 173500 | consumed samples: 32855040 | consumed tokens: 67287121920 | elapsed time per iteration (s): 0.78 | learning rate: 4.900E-05 | global batch size: 256 | lm loss: 1.971528E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.135 | TFLOPs: 19.73 | 31: iteration 128350/ 173500 | consumed samples: 32857600 | consumed tokens: 67292364800 | elapsed time per iteration (s): 0.89 | learning rate: 4.898E-05 | global batch size: 256 | lm loss: 1.924007E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.849 | TFLOPs: 17.35 | 31: iteration 128360/ 173500 | consumed samples: 32860160 | consumed tokens: 67297607680 | elapsed time per iteration (s): 0.85 | learning rate: 4.897E-05 | global batch size: 256 | lm loss: 1.898312E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.121 | TFLOPs: 18.28 | 31: iteration 128370/ 173500 | consumed samples: 32862720 | consumed tokens: 67302850560 | elapsed time per iteration (s): 0.88 | learning rate: 4.896E-05 | global batch size: 256 | lm loss: 1.974962E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.339 | TFLOPs: 17.56 | 31: iteration 128380/ 173500 | consumed samples: 32865280 | consumed tokens: 67308093440 | elapsed time per iteration (s): 0.88 | learning rate: 4.895E-05 | global batch size: 256 | lm loss: 1.958667E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.306 | TFLOPs: 17.68 | 31: iteration 128390/ 173500 | consumed samples: 32867840 | consumed tokens: 67313336320 | elapsed time per iteration (s): 0.83 | learning rate: 4.893E-05 | global batch size: 256 | lm loss: 1.950285E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.726 | TFLOPs: 18.68 | 31: iteration 128400/ 173500 | consumed samples: 32870400 | consumed tokens: 67318579200 | elapsed time per iteration (s): 0.88 | learning rate: 4.892E-05 | global batch size: 256 | lm loss: 1.932639E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.070 | TFLOPs: 17.67 | 31: iteration 128410/ 173500 | consumed samples: 32872960 | consumed tokens: 67323822080 | elapsed time per iteration (s): 0.82 | learning rate: 4.891E-05 | global batch size: 256 | lm loss: 1.937269E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.916 | TFLOPs: 18.99 | 31: iteration 128420/ 173500 | consumed samples: 32875520 | consumed tokens: 67329064960 | elapsed time per iteration (s): 0.85 | learning rate: 4.890E-05 | global batch size: 256 | lm loss: 1.925516E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.953 | TFLOPs: 18.21 | 31: iteration 128430/ 173500 | consumed samples: 32878080 | consumed tokens: 67334307840 | elapsed time per iteration (s): 0.82 | learning rate: 4.889E-05 | global batch size: 256 | lm loss: 1.942916E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.394 | TFLOPs: 18.78 | 31: iteration 128440/ 173500 | consumed samples: 32880640 | consumed tokens: 67339550720 | elapsed time per iteration (s): 0.80 | learning rate: 4.887E-05 | global batch size: 256 | lm loss: 1.963769E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.033 | TFLOPs: 19.36 | 31: iteration 128450/ 173500 | consumed samples: 32883200 | consumed tokens: 67344793600 | elapsed time per iteration (s): 0.74 | learning rate: 4.886E-05 | global batch size: 256 | lm loss: 1.923242E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.592 | TFLOPs: 20.97 | 31: iteration 128460/ 173500 | consumed samples: 32885760 | consumed tokens: 67350036480 | elapsed time per iteration (s): 0.82 | learning rate: 4.885E-05 | global batch size: 256 | lm loss: 1.937491E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.887 | TFLOPs: 18.93 | 31: iteration 128470/ 173500 | consumed samples: 32888320 | consumed tokens: 67355279360 | elapsed time per iteration (s): 0.79 | learning rate: 4.884E-05 | global batch size: 256 | lm loss: 1.947715E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.173 | TFLOPs: 19.55 | 31: iteration 128480/ 173500 | consumed samples: 32890880 | consumed tokens: 67360522240 | elapsed time per iteration (s): 0.74 | learning rate: 4.883E-05 | global batch size: 256 | lm loss: 1.927224E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.268 | TFLOPs: 20.83 | 31: iteration 128490/ 173500 | consumed samples: 32893440 | consumed tokens: 67365765120 | elapsed time per iteration (s): 0.78 | learning rate: 4.881E-05 | global batch size: 256 | lm loss: 1.971710E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.760 | TFLOPs: 19.83 | 31: iteration 128500/ 173500 | consumed samples: 32896000 | consumed tokens: 67371008000 | elapsed time per iteration (s): 0.79 | learning rate: 4.880E-05 | global batch size: 256 | lm loss: 1.926838E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.778 | TFLOPs: 19.59 | 31: iteration 128510/ 173500 | consumed samples: 32898560 | consumed tokens: 67376250880 | elapsed time per iteration (s): 0.82 | learning rate: 4.879E-05 | global batch size: 256 | lm loss: 1.914964E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.515 | TFLOPs: 18.97 | 31: iteration 128520/ 173500 | consumed samples: 32901120 | consumed tokens: 67381493760 | elapsed time per iteration (s): 0.82 | learning rate: 4.878E-05 | global batch size: 256 | lm loss: 1.952498E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.262 | TFLOPs: 18.95 | 31: iteration 128530/ 173500 | consumed samples: 32903680 | consumed tokens: 67386736640 | elapsed time per iteration (s): 0.81 | learning rate: 4.877E-05 | global batch size: 256 | lm loss: 1.914365E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.460 | TFLOPs: 19.15 | 31: iteration 128540/ 173500 | consumed samples: 32906240 | consumed tokens: 67391979520 | elapsed time per iteration (s): 0.80 | learning rate: 4.875E-05 | global batch size: 256 | lm loss: 1.918163E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.777 | TFLOPs: 19.29 | 31: iteration 128550/ 173500 | consumed samples: 32908800 | consumed tokens: 67397222400 | elapsed time per iteration (s): 0.74 | learning rate: 4.874E-05 | global batch size: 256 | lm loss: 1.919559E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.570 | TFLOPs: 20.85 | 31: iteration 128560/ 173500 | consumed samples: 32911360 | consumed tokens: 67402465280 | elapsed time per iteration (s): 0.81 | learning rate: 4.873E-05 | global batch size: 256 | lm loss: 1.957766E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.659 | TFLOPs: 19.10 | 31: iteration 128570/ 173500 | consumed samples: 32913920 | consumed tokens: 67407708160 | elapsed time per iteration (s): 0.75 | learning rate: 4.872E-05 | global batch size: 256 | lm loss: 1.928436E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.424 | TFLOPs: 20.78 | 31: iteration 128580/ 173500 | consumed samples: 32916480 | consumed tokens: 67412951040 | elapsed time per iteration (s): 0.83 | learning rate: 4.871E-05 | global batch size: 256 | lm loss: 1.967719E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.774 | TFLOPs: 18.74 | 31: iteration 128590/ 173500 | consumed samples: 32919040 | consumed tokens: 67418193920 | elapsed time per iteration (s): 0.76 | learning rate: 4.869E-05 | global batch size: 256 | lm loss: 1.947818E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.733 | TFLOPs: 20.49 | 31: iteration 128600/ 173500 | consumed samples: 32921600 | consumed tokens: 67423436800 | elapsed time per iteration (s): 0.83 | learning rate: 4.868E-05 | global batch size: 256 | lm loss: 1.925617E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.313 | TFLOPs: 18.59 | 31: iteration 128610/ 173500 | consumed samples: 32924160 | consumed tokens: 67428679680 | elapsed time per iteration (s): 0.85 | learning rate: 4.867E-05 | global batch size: 256 | lm loss: 1.939279E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.312 | TFLOPs: 18.29 | 31: iteration 128620/ 173500 | consumed samples: 32926720 | consumed tokens: 67433922560 | elapsed time per iteration (s): 0.83 | learning rate: 4.866E-05 | global batch size: 256 | lm loss: 1.955412E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.569 | TFLOPs: 18.61 | 31: iteration 128630/ 173500 | consumed samples: 32929280 | consumed tokens: 67439165440 | elapsed time per iteration (s): 0.83 | learning rate: 4.865E-05 | global batch size: 256 | lm loss: 1.945040E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.503 | TFLOPs: 18.72 | 31: iteration 128640/ 173500 | consumed samples: 32931840 | consumed tokens: 67444408320 | elapsed time per iteration (s): 0.73 | learning rate: 4.863E-05 | global batch size: 256 | lm loss: 1.933492E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.182 | TFLOPs: 21.25 | 31: iteration 128650/ 173500 | consumed samples: 32934400 | consumed tokens: 67449651200 | elapsed time per iteration (s): 0.76 | learning rate: 4.862E-05 | global batch size: 256 | lm loss: 1.950340E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.408 | TFLOPs: 20.35 | 31: iteration 128660/ 173500 | consumed samples: 32936960 | consumed tokens: 67454894080 | elapsed time per iteration (s): 0.75 | learning rate: 4.861E-05 | global batch size: 256 | lm loss: 1.960929E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.862 | TFLOPs: 20.62 | 31: iteration 128670/ 173500 | consumed samples: 32939520 | consumed tokens: 67460136960 | elapsed time per iteration (s): 0.77 | learning rate: 4.860E-05 | global batch size: 256 | lm loss: 1.943074E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.031 | TFLOPs: 20.15 | 31: iteration 128680/ 173500 | consumed samples: 32942080 | consumed tokens: 67465379840 | elapsed time per iteration (s): 0.76 | learning rate: 4.858E-05 | global batch size: 256 | lm loss: 1.938654E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.828 | TFLOPs: 20.26 | 31: iteration 128690/ 173500 | consumed samples: 32944640 | consumed tokens: 67470622720 | elapsed time per iteration (s): 0.73 | learning rate: 4.857E-05 | global batch size: 256 | lm loss: 1.940080E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.692 | TFLOPs: 21.09 | 31: iteration 128700/ 173500 | consumed samples: 32947200 | consumed tokens: 67475865600 | elapsed time per iteration (s): 0.84 | learning rate: 4.856E-05 | global batch size: 256 | lm loss: 1.938877E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.420 | TFLOPs: 18.42 | 31: iteration 128710/ 173500 | consumed samples: 32949760 | consumed tokens: 67481108480 | elapsed time per iteration (s): 0.87 | learning rate: 4.855E-05 | global batch size: 256 | lm loss: 1.949910E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.064 | TFLOPs: 17.85 | 31: iteration 128720/ 173500 | consumed samples: 32952320 | consumed tokens: 67486351360 | elapsed time per iteration (s): 0.77 | learning rate: 4.854E-05 | global batch size: 256 | lm loss: 1.927920E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.690 | TFLOPs: 20.13 | 31: iteration 128730/ 173500 | consumed samples: 32954880 | consumed tokens: 67491594240 | elapsed time per iteration (s): 0.79 | learning rate: 4.852E-05 | global batch size: 256 | lm loss: 1.932310E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.186 | TFLOPs: 19.67 | 31: iteration 128740/ 173500 | consumed samples: 32957440 | consumed tokens: 67496837120 | elapsed time per iteration (s): 0.76 | learning rate: 4.851E-05 | global batch size: 256 | lm loss: 1.925252E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.827 | TFLOPs: 20.32 | 31: iteration 128750/ 173500 | consumed samples: 32960000 | consumed tokens: 67502080000 | elapsed time per iteration (s): 0.78 | learning rate: 4.850E-05 | global batch size: 256 | lm loss: 1.904484E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.049 | TFLOPs: 19.97 | 31: iteration 128760/ 173500 | consumed samples: 32962560 | consumed tokens: 67507322880 | elapsed time per iteration (s): 0.75 | learning rate: 4.849E-05 | global batch size: 256 | lm loss: 1.955536E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.215 | TFLOPs: 20.58 | 31: iteration 128770/ 173500 | consumed samples: 32965120 | consumed tokens: 67512565760 | elapsed time per iteration (s): 0.76 | learning rate: 4.848E-05 | global batch size: 256 | lm loss: 1.943561E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.809 | TFLOPs: 20.32 | 31: iteration 128780/ 173500 | consumed samples: 32967680 | consumed tokens: 67517808640 | elapsed time per iteration (s): 0.73 | learning rate: 4.846E-05 | global batch size: 256 | lm loss: 1.942665E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.663 | TFLOPs: 21.21 | 31: iteration 128790/ 173500 | consumed samples: 32970240 | consumed tokens: 67523051520 | elapsed time per iteration (s): 0.80 | learning rate: 4.845E-05 | global batch size: 256 | lm loss: 1.922523E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.928 | TFLOPs: 19.35 | 31: iteration 128800/ 173500 | consumed samples: 32972800 | consumed tokens: 67528294400 | elapsed time per iteration (s): 0.74 | learning rate: 4.844E-05 | global batch size: 256 | lm loss: 1.969057E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.101 | TFLOPs: 21.06 | 31: iteration 128810/ 173500 | consumed samples: 32975360 | consumed tokens: 67533537280 | elapsed time per iteration (s): 0.80 | learning rate: 4.843E-05 | global batch size: 256 | lm loss: 1.947714E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.603 | TFLOPs: 19.40 | 31: iteration 128820/ 173500 | consumed samples: 32977920 | consumed tokens: 67538780160 | elapsed time per iteration (s): 0.82 | learning rate: 4.842E-05 | global batch size: 256 | lm loss: 1.927405E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.790 | TFLOPs: 18.86 | 31: iteration 128830/ 173500 | consumed samples: 32980480 | consumed tokens: 67544023040 | elapsed time per iteration (s): 0.79 | learning rate: 4.840E-05 | global batch size: 256 | lm loss: 1.935360E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.012 | TFLOPs: 19.60 | 31: iteration 128840/ 173500 | consumed samples: 32983040 | consumed tokens: 67549265920 | elapsed time per iteration (s): 0.75 | learning rate: 4.839E-05 | global batch size: 256 | lm loss: 1.960698E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.079 | TFLOPs: 20.63 | 31: iteration 128850/ 173500 | consumed samples: 32985600 | consumed tokens: 67554508800 | elapsed time per iteration (s): 0.80 | learning rate: 4.838E-05 | global batch size: 256 | lm loss: 1.932720E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.960 | TFLOPs: 19.30 | 31: iteration 128860/ 173500 | consumed samples: 32988160 | consumed tokens: 67559751680 | elapsed time per iteration (s): 0.81 | learning rate: 4.837E-05 | global batch size: 256 | lm loss: 1.934574E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.103 | TFLOPs: 19.18 | 31: iteration 128870/ 173500 | consumed samples: 32990720 | consumed tokens: 67564994560 | elapsed time per iteration (s): 0.78 | learning rate: 4.836E-05 | global batch size: 256 | lm loss: 1.943168E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.468 | TFLOPs: 19.81 | 31: iteration 128880/ 173500 | consumed samples: 32993280 | consumed tokens: 67570237440 | elapsed time per iteration (s): 0.76 | learning rate: 4.834E-05 | global batch size: 256 | lm loss: 1.934495E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.499 | TFLOPs: 20.42 | 31: iteration 128890/ 173500 | consumed samples: 32995840 | consumed tokens: 67575480320 | elapsed time per iteration (s): 0.75 | learning rate: 4.833E-05 | global batch size: 256 | lm loss: 1.943047E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.716 | TFLOPs: 20.67 | 31: iteration 128900/ 173500 | consumed samples: 32998400 | consumed tokens: 67580723200 | elapsed time per iteration (s): 0.84 | learning rate: 4.832E-05 | global batch size: 256 | lm loss: 1.941070E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.186 | TFLOPs: 18.40 | 31: iteration 128910/ 173500 | consumed samples: 33000960 | consumed tokens: 67585966080 | elapsed time per iteration (s): 0.73 | learning rate: 4.831E-05 | global batch size: 256 | lm loss: 1.956881E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.393 | TFLOPs: 21.20 | 31: iteration 128920/ 173500 | consumed samples: 33003520 | consumed tokens: 67591208960 | elapsed time per iteration (s): 0.72 | learning rate: 4.830E-05 | global batch size: 256 | lm loss: 1.946988E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.246 | TFLOPs: 21.55 | 31: iteration 128930/ 173500 | consumed samples: 33006080 | consumed tokens: 67596451840 | elapsed time per iteration (s): 0.79 | learning rate: 4.828E-05 | global batch size: 256 | lm loss: 1.969025E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.271 | TFLOPs: 19.62 | 31: iteration 128940/ 173500 | consumed samples: 33008640 | consumed tokens: 67601694720 | elapsed time per iteration (s): 0.73 | learning rate: 4.827E-05 | global batch size: 256 | lm loss: 1.938965E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.810 | TFLOPs: 21.34 | 31: iteration 128950/ 173500 | consumed samples: 33011200 | consumed tokens: 67606937600 | elapsed time per iteration (s): 0.79 | learning rate: 4.826E-05 | global batch size: 256 | lm loss: 1.942994E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.685 | TFLOPs: 19.70 | 31: iteration 128960/ 173500 | consumed samples: 33013760 | consumed tokens: 67612180480 | elapsed time per iteration (s): 0.82 | learning rate: 4.825E-05 | global batch size: 256 | lm loss: 1.945205E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.112 | TFLOPs: 18.94 | 31: iteration 128970/ 173500 | consumed samples: 33016320 | consumed tokens: 67617423360 | elapsed time per iteration (s): 0.82 | learning rate: 4.824E-05 | global batch size: 256 | lm loss: 1.909052E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.186 | TFLOPs: 18.95 | 31: iteration 128980/ 173500 | consumed samples: 33018880 | consumed tokens: 67622666240 | elapsed time per iteration (s): 0.78 | learning rate: 4.822E-05 | global batch size: 256 | lm loss: 1.963662E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.313 | TFLOPs: 19.92 | 31: iteration 128990/ 173500 | consumed samples: 33021440 | consumed tokens: 67627909120 | elapsed time per iteration (s): 0.79 | learning rate: 4.821E-05 | global batch size: 256 | lm loss: 1.928271E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.015 | TFLOPs: 19.66 | 31: iteration 129000/ 173500 | consumed samples: 33024000 | consumed tokens: 67633152000 | elapsed time per iteration (s): 0.77 | learning rate: 4.820E-05 | global batch size: 256 | lm loss: 1.937563E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.495 | TFLOPs: 20.24 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 129000 | lm loss value: 1.884587E+00 | lm loss PPL: 6.583638E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 129000 to checkpoints_1b1long 0: [2022-11-26 23:09:37,346] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step129000 is begin to save! 0: [2022-11-26 23:09:37,357] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_01-model_00-model_states.pt... 0: [2022-11-26 23:09:37,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_01-model_00-model_states.pt. 0: [2022-11-26 23:09:37,575] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_03-model_00-model_states.pt... 0: [2022-11-26 23:09:37,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_03-model_00-model_states.pt. 0: [2022-11-26 23:09:37,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_04-model_00-model_states.pt... 0: [2022-11-26 23:09:37,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_04-model_00-model_states.pt. 0: [2022-11-26 23:09:37,739] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_05-model_00-model_states.pt... 0: [2022-11-26 23:09:37,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_05-model_00-model_states.pt. 0: [2022-11-26 23:09:37,819] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_06-model_00-model_states.pt... 0: [2022-11-26 23:09:37,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_06-model_00-model_states.pt. 0: [2022-11-26 23:09:37,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_07-model_00-model_states.pt... 0: [2022-11-26 23:09:37,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_07-model_00-model_states.pt. 0: [2022-11-26 23:09:37,971] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_08-model_00-model_states.pt... 0: [2022-11-26 23:09:38,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_08-model_00-model_states.pt. 0: [2022-11-26 23:09:38,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_09-model_00-model_states.pt... 0: [2022-11-26 23:09:38,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_09-model_00-model_states.pt. 0: [2022-11-26 23:09:38,121] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_10-model_00-model_states.pt... 0: [2022-11-26 23:09:38,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_10-model_00-model_states.pt. 0: [2022-11-26 23:09:38,194] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_11-model_00-model_states.pt... 0: [2022-11-26 23:09:38,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_11-model_00-model_states.pt. 0: [2022-11-26 23:09:38,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_12-model_00-model_states.pt... 0: [2022-11-26 23:09:38,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_12-model_00-model_states.pt. 0: [2022-11-26 23:09:38,347] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_13-model_00-model_states.pt... 0: [2022-11-26 23:09:38,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_13-model_00-model_states.pt. 0: [2022-11-26 23:09:38,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_14-model_00-model_states.pt... 0: [2022-11-26 23:09:38,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_14-model_00-model_states.pt. 0: [2022-11-26 23:09:38,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_15-model_00-model_states.pt... 0: [2022-11-26 23:09:38,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_15-model_00-model_states.pt. 0: [2022-11-26 23:09:38,568] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_16-model_00-model_states.pt... 0: [2022-11-26 23:09:38,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_16-model_00-model_states.pt. 0: [2022-11-26 23:09:38,642] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_17-model_00-model_states.pt... 0: [2022-11-26 23:09:38,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_17-model_00-model_states.pt. 0: [2022-11-26 23:09:38,716] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_18-model_00-model_states.pt... 0: [2022-11-26 23:09:38,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_18-model_00-model_states.pt. 0: [2022-11-26 23:09:38,790] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_19-model_00-model_states.pt... 0: [2022-11-26 23:09:38,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_19-model_00-model_states.pt. 0: [2022-11-26 23:09:38,863] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_20-model_00-model_states.pt... 0: [2022-11-26 23:09:38,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_20-model_00-model_states.pt. 0: [2022-11-26 23:09:38,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_21-model_00-model_states.pt... 0: [2022-11-26 23:09:39,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_21-model_00-model_states.pt. 0: [2022-11-26 23:09:39,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_22-model_00-model_states.pt... 0: [2022-11-26 23:09:39,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_22-model_00-model_states.pt. 0: [2022-11-26 23:09:39,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_23-model_00-model_states.pt... 0: [2022-11-26 23:09:39,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_23-model_00-model_states.pt. 0: [2022-11-26 23:09:39,157] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_24-model_00-model_states.pt... 0: [2022-11-26 23:09:39,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_24-model_00-model_states.pt. 0: [2022-11-26 23:09:39,228] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_25-model_00-model_states.pt... 0: [2022-11-26 23:09:39,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_25-model_00-model_states.pt. 0: [2022-11-26 23:09:39,303] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_26-model_00-model_states.pt... 0: [2022-11-26 23:09:39,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_26-model_00-model_states.pt. 0: [2022-11-26 23:09:39,374] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_27-model_00-model_states.pt... 0: [2022-11-26 23:09:39,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_27-model_00-model_states.pt. 0: [2022-11-26 23:09:39,446] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_28-model_00-model_states.pt... 0: [2022-11-26 23:09:39,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_28-model_00-model_states.pt. 0: [2022-11-26 23:09:39,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/layer_30-model_00-model_states.pt... 0: [2022-11-26 23:09:39,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/layer_30-model_00-model_states.pt. 0: [2022-11-26 23:09:39,526] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step129000/mp_rank_00_model_states.pt 0: [2022-11-26 23:09:39,526] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/mp_rank_00_model_states.pt... 0: [2022-11-26 23:09:39,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/mp_rank_00_model_states.pt. 0: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:09:39,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:09:39,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:09:39,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 23:09:39,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 21: [2022-11-26 23:09:39,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:09:39,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 23:09:39,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 6: [2022-11-26 23:09:39,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:09:39,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 23:09:39,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 19: [2022-11-26 23:09:39,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:09:39,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 12: [2022-11-26 23:09:39,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:09:39,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 12: [2022-11-26 23:09:39,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 23:09:39,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 17: [2022-11-26 23:09:39,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:09:39,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:09:39,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 23:09:39,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 16: [2022-11-26 23:09:39,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:09:39,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 23:09:39,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 20: [2022-11-26 23:09:39,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:09:39,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 23:09:39,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 22: [2022-11-26 23:09:39,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:09:39,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:09:39,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 8: [2022-11-26 23:09:39,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 2: [2022-11-26 23:09:39,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:09:39,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 8: [2022-11-26 23:09:39,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 2: [2022-11-26 23:09:39,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 23:09:39,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 9: [2022-11-26 23:09:39,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:09:39,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:09:39,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 19: [2022-11-26 23:09:39,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 9: [2022-11-26 23:09:39,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 19: [2022-11-26 23:09:39,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 17: [2022-11-26 23:09:39,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 23:09:39,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 26: [2022-11-26 23:09:39,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:09:39,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:09:39,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 3: [2022-11-26 23:09:39,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 26: [2022-11-26 23:09:39,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 3: [2022-11-26 23:09:39,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 21: [2022-11-26 23:09:39,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:09:39,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 23:09:39,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 20: [2022-11-26 23:09:39,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:09:39,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 23:09:39,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 13: [2022-11-26 23:09:39,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:09:39,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 23:09:39,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 15: [2022-11-26 23:09:39,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:09:39,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 23:09:39,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:09:39,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:09:39,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 13: [2022-11-26 23:09:39,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 15: [2022-11-26 23:09:39,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 23:09:39,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 13: [2022-11-26 23:09:39,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 23: [2022-11-26 23:09:39,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:09:39,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:09:39,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:09:39,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 23: [2022-11-26 23:09:39,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 23:09:39,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 18: [2022-11-26 23:09:39,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:09:39,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 14: [2022-11-26 23:09:39,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 18: [2022-11-26 23:09:39,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 23:09:39,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:09:39,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 18: [2022-11-26 23:09:39,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 23:09:39,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 24: [2022-11-26 23:09:39,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 23: [2022-11-26 23:09:39,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:09:39,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 23:09:39,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 6: [2022-11-26 23:09:39,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:09:39,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 23:09:39,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 10: [2022-11-26 23:09:39,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:09:39,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 23:09:39,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 9: [2022-11-26 23:09:39,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:09:39,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 23:09:39,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 22: [2022-11-26 23:09:39,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:09:39,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 23:09:39,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 8: [2022-11-26 23:09:39,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:09:39,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 23:09:39,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 12: [2022-11-26 23:09:39,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:09:39,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 23:09:39,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 17: [2022-11-26 23:09:39,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:09:39,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 23:09:39,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 25: [2022-11-26 23:09:39,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:09:39,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 11: [2022-11-26 23:09:39,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:09:39,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 11: [2022-11-26 23:09:39,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 23:09:39,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 11: [2022-11-26 23:09:39,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:09:39,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:09:39,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 3: [2022-11-26 23:09:39,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 11: [2022-11-26 23:09:39,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 3: [2022-11-26 23:09:39,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 10: [2022-11-26 23:09:39,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:09:39,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 23:09:39,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 2: [2022-11-26 23:09:39,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:09:39,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 23:09:39,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 10: [2022-11-26 23:09:39,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:09:39,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 0: [2022-11-26 23:09:39,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:09:39,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 10: [2022-11-26 23:09:39,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 29: [2022-11-26 23:09:39,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:09:39,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 29: [2022-11-26 23:09:39,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 23:09:39,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 18: [2022-11-26 23:09:39,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:09:39,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:09:39,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 23: [2022-11-26 23:09:39,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 18: [2022-11-26 23:09:39,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 23: [2022-11-26 23:09:39,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 19: [2022-11-26 23:09:39,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:09:39,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 23:09:39,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 26: [2022-11-26 23:09:39,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:09:39,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 23:09:39,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 22: [2022-11-26 23:09:39,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:09:39,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 23:09:39,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 24: [2022-11-26 23:09:39,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:09:39,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 23:09:39,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 14: [2022-11-26 23:09:39,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:09:39,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 23:09:39,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 21: [2022-11-26 23:09:39,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:09:39,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:09:39,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 6: [2022-11-26 23:09:39,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 21: [2022-11-26 23:09:39,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 6: [2022-11-26 23:09:39,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 20: [2022-11-26 23:09:39,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:09:39,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:09:39,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 23:09:39,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 16: [2022-11-26 23:09:39,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 23:09:39,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 3: [2022-11-26 23:09:39,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:09:39,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 23:09:39,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 2: [2022-11-26 23:09:39,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:09:39,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 23:09:39,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 14: [2022-11-26 23:09:39,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:09:39,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 23:09:39,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 25: [2022-11-26 23:09:39,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:09:39,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:09:39,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 23:09:39,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 25: [2022-11-26 23:09:39,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 23:09:39,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 20: [2022-11-26 23:09:39,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:09:39,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 23:09:39,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 24: [2022-11-26 23:09:39,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:09:39,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 23:09:39,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 29: [2022-11-26 23:09:39,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:09:39,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:09:39,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:09:39,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:09:39,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:09:39,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 12: [2022-11-26 23:09:39,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 24: [2022-11-26 23:09:39,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 14: [2022-11-26 23:09:39,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 29: [2022-11-26 23:09:39,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 12: [2022-11-26 23:09:39,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 14: [2022-11-26 23:09:39,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 29: [2022-11-26 23:09:39,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 24: [2022-11-26 23:09:39,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 29: [2022-11-26 23:09:39,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 9: [2022-11-26 23:09:39,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:09:39,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 23:09:39,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 23: [2022-11-26 23:09:39,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:09:39,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 23:09:39,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 11: [2022-11-26 23:09:39,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:09:39,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 23:09:39,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 19: [2022-11-26 23:09:39,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:09:39,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 23:09:39,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 8: [2022-11-26 23:09:39,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:09:39,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 23:09:39,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 5: [2022-11-26 23:09:39,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:09:39,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:09:39,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:09:39,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:09:39,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 23:09:39,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 23:09:39,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 5: [2022-11-26 23:09:39,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 23:09:39,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 23:09:39,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 5: [2022-11-26 23:09:39,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 5: [2022-11-26 23:09:39,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 6: [2022-11-26 23:09:39,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:09:39,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 23:09:39,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 8: [2022-11-26 23:09:39,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:09:39,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 23:09:39,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 13: [2022-11-26 23:09:39,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:09:39,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 26: [2022-11-26 23:09:39,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:09:39,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 26: [2022-11-26 23:09:39,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 23:09:39,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:09:39,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 26: [2022-11-26 23:09:39,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 23:09:39,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 13: [2022-11-26 23:09:39,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:09:39,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 0: [2022-11-26 23:09:39,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:09:39,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:09:39,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 0: [2022-11-26 23:09:39,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 23:09:39,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 21: [2022-11-26 23:09:39,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:09:39,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 23:09:39,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 3: [2022-11-26 23:09:39,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:09:39,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 23:09:39,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 30: [2022-11-26 23:09:39,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:09:39,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:09:39,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:09:39,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 23:09:39,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 30: [2022-11-26 23:09:39,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 23:09:39,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 23:09:39,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 30: [2022-11-26 23:09:39,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 17: [2022-11-26 23:09:39,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:09:39,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:09:39,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 16: [2022-11-26 23:09:39,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 23:09:39,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 17: [2022-11-26 23:09:39,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 22: [2022-11-26 23:09:39,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:09:39,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 9: [2022-11-26 23:09:39,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:09:39,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 9: [2022-11-26 23:09:39,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 23:09:39,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 4: [2022-11-26 23:09:39,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:09:39,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:09:39,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:09:39,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:09:39,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 23:09:39,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 18: [2022-11-26 23:09:39,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 4: [2022-11-26 23:09:39,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 23:09:39,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 23:09:39,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 4: [2022-11-26 23:09:39,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 18: [2022-11-26 23:09:39,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 2: [2022-11-26 23:09:39,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:09:39,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 10: [2022-11-26 23:09:39,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:09:39,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 10: [2022-11-26 23:09:39,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 23:09:39,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 1: [2022-11-26 23:09:39,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:09:39,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:09:39,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 23:09:39,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 1: [2022-11-26 23:09:39,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 23:09:39,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 1: [2022-11-26 23:09:39,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:09:39,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:09:39,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 15: [2022-11-26 23:09:39,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 23:09:39,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 1: [2022-11-26 23:09:39,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 27: [2022-11-26 23:09:39,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:09:39,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:09:39,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:09:39,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:09:39,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 23:09:39,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 23:09:39,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 23:09:39,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 27: [2022-11-26 23:09:39,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 23:09:39,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 27: [2022-11-26 23:09:39,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 27: [2022-11-26 23:09:39,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 25: [2022-11-26 23:09:39,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:09:39,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 23:09:39,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 25: [2022-11-26 23:09:39,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:09:39,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 23:09:39,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 29: [2022-11-26 23:09:39,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:09:39,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 23:09:39,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 17: [2022-11-26 23:09:39,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:09:39,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 23:09:39,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 28: [2022-11-26 23:09:39,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:09:39,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:09:39,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:09:39,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:09:39,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 23:09:39,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 23:09:39,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 23:09:39,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 23:09:39,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 28: [2022-11-26 23:09:39,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 28: [2022-11-26 23:09:39,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 28: [2022-11-26 23:09:39,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 31: [2022-11-26 23:09:39,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:09:39,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:09:39,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:09:39,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:09:39,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 23:09:39,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 23:09:39,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 23:09:39,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 31: [2022-11-26 23:09:39,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 1: [2022-11-26 23:09:39,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 23:09:39,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 31: [2022-11-26 23:09:39,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 0: [2022-11-26 23:09:39,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 23:09:39,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 20: [2022-11-26 23:09:39,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:09:39,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 23:09:39,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 7: [2022-11-26 23:09:39,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:09:39,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:09:39,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:09:39,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:09:39,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 23:09:39,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 23:09:39,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 23:09:39,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 23:09:39,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 7: [2022-11-26 23:09:39,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 7: [2022-11-26 23:09:39,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 7: [2022-11-26 23:09:39,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 15: [2022-11-26 23:09:39,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:09:39,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 23:09:39,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 5: [2022-11-26 23:09:39,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:09:39,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 23:09:39,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 6: [2022-11-26 23:09:39,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:09:39,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 23:09:39,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 9: [2022-11-26 23:09:39,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:09:39,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 23:09:39,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 23: [2022-11-26 23:09:39,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:09:39,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 23:09:39,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 16: [2022-11-26 23:09:39,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:09:39,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 23:09:39,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 28: [2022-11-26 23:09:39,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:09:39,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:09:39,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 23:09:39,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 28: [2022-11-26 23:09:39,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 23:09:39,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 17: [2022-11-26 23:09:39,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:09:39,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 23:09:39,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 26: [2022-11-26 23:09:39,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:09:39,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 23:09:39,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 27: [2022-11-26 23:09:39,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:09:39,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 23:09:39,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 22: [2022-11-26 23:09:39,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:09:39,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 23:09:39,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 29: [2022-11-26 23:09:39,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:09:39,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 23:09:39,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 19: [2022-11-26 23:09:39,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:09:39,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 23:09:39,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 12: [2022-11-26 23:09:39,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:09:39,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 23:09:39,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 18: [2022-11-26 23:09:39,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:09:39,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 23:09:39,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 0: [2022-11-26 23:09:39,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:09:39,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 23:09:39,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 24: [2022-11-26 23:09:39,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:09:39,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 23:09:39,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 8: [2022-11-26 23:09:39,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:09:39,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 23:09:39,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 10: [2022-11-26 23:09:39,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:09:39,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 23:09:39,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 21: [2022-11-26 23:09:39,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:09:39,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 23:09:39,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 15: [2022-11-26 23:09:39,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:09:39,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 23:09:39,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 13: [2022-11-26 23:09:39,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:09:39,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 23:09:39,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 3: [2022-11-26 23:09:39,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:09:39,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 23:09:39,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 11: [2022-11-26 23:09:39,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:09:39,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 23:09:39,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 2: [2022-11-26 23:09:39,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:09:39,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 23:09:39,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 14: [2022-11-26 23:09:39,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:09:39,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 1: [2022-11-26 23:09:39,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:09:39,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 1: [2022-11-26 23:09:39,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 4: [2022-11-26 23:09:39,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:09:39,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 4: [2022-11-26 23:09:39,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 23:09:39,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 30: [2022-11-26 23:09:39,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:09:39,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 23:09:39,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 31: [2022-11-26 23:09:39,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:09:39,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 23:09:39,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 25: [2022-11-26 23:09:39,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:09:39,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 23:09:39,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 5: [2022-11-26 23:09:39,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:09:39,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 23:09:39,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 20: [2022-11-26 23:09:39,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:09:39,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 23:09:39,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 6: [2022-11-26 23:09:39,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:09:39,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 23:09:39,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 28: [2022-11-26 23:09:39,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:09:39,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 23:09:39,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 23: [2022-11-26 23:09:39,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:09:39,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 23:09:39,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 9: [2022-11-26 23:09:39,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:09:39,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 23:09:39,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 16: [2022-11-26 23:09:39,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:09:39,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 23:09:39,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 26: [2022-11-26 23:09:39,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:09:39,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 23:09:39,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 7: [2022-11-26 23:09:39,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:09:39,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 23:09:39,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 12: [2022-11-26 23:09:39,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:09:39,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 23:09:39,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 19: [2022-11-26 23:09:39,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:09:39,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 23:09:39,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 17: [2022-11-26 23:09:39,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:09:39,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 23:09:39,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 29: [2022-11-26 23:09:39,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:09:39,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 23:09:39,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 0: [2022-11-26 23:09:39,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:09:39,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:09:39,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 22: [2022-11-26 23:09:39,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 0: [2022-11-26 23:09:39,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 22: [2022-11-26 23:09:39,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 10: [2022-11-26 23:09:39,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:09:39,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:09:39,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 23:09:39,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 24: [2022-11-26 23:09:39,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 8: [2022-11-26 23:09:39,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:09:39,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 8: [2022-11-26 23:09:39,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 23:09:39,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 2: [2022-11-26 23:09:39,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:09:39,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 23:09:39,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 27: [2022-11-26 23:09:39,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:09:39,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 23:09:39,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 16: [2022-11-26 23:09:39,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:09:39,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:09:39,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:09:39,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 5: [2022-11-26 23:09:39,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 16: [2022-11-26 23:09:39,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 5: [2022-11-26 23:09:39,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 16: [2022-11-26 23:09:39,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 21: [2022-11-26 23:09:39,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 18: [2022-11-26 23:09:39,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:09:39,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 23:09:39,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 13: [2022-11-26 23:09:39,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:09:39,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 23:09:39,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 14: [2022-11-26 23:09:39,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:09:39,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 23:09:39,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 3: [2022-11-26 23:09:39,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:09:39,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 23:09:39,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 25: [2022-11-26 23:09:39,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:09:39,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 23: [2022-11-26 23:09:39,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:09:39,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 23: [2022-11-26 23:09:39,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 23:09:39,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 20: [2022-11-26 23:09:39,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:09:39,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 11: [2022-11-26 23:09:39,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:09:39,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 11: [2022-11-26 23:09:39,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 15: [2022-11-26 23:09:39,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:09:39,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 15: [2022-11-26 23:09:39,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 19: [2022-11-26 23:09:39,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:09:39,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 19: [2022-11-26 23:09:39,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 23:09:39,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 4: [2022-11-26 23:09:39,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:09:39,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 23:09:39,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 31: [2022-11-26 23:09:39,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:09:39,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 28: [2022-11-26 23:09:39,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:09:39,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 7: [2022-11-26 23:09:39,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:09:39,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 7: [2022-11-26 23:09:39,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 23:09:39,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 28: [2022-11-26 23:09:39,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 6: [2022-11-26 23:09:39,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:09:39,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 23:09:39,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 29: [2022-11-26 23:09:39,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:09:39,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:09:39,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 23:09:39,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 10: [2022-11-26 23:09:39,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 23:09:39,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 24: [2022-11-26 23:09:39,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:09:39,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 23:09:39,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 12: [2022-11-26 23:09:39,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:09:39,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 23:09:39,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 1: [2022-11-26 23:09:39,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:09:39,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:09:39,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 23:09:39,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 27: [2022-11-26 23:09:39,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 30: [2022-11-26 23:09:39,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:09:39,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:09:39,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 21: [2022-11-26 23:09:39,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 23:09:39,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 30: [2022-11-26 23:09:39,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 23:09:39,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 18: [2022-11-26 23:09:39,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:09:39,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 23:09:39,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 7: [2022-11-26 23:09:39,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:09:39,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 23:09:39,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 26: [2022-11-26 23:09:39,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:09:39,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 23:09:39,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 6: [2022-11-26 23:09:39,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:09:39,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:09:39,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 19: [2022-11-26 23:09:39,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:09:39,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 9: [2022-11-26 23:09:39,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 19: [2022-11-26 23:09:39,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 23:09:39,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 9: [2022-11-26 23:09:39,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 23: [2022-11-26 23:09:39,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:09:39,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 0: [2022-11-26 23:09:39,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:09:39,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 0: [2022-11-26 23:09:39,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 23:09:39,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 22: [2022-11-26 23:09:39,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:09:39,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 23:09:39,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 11: [2022-11-26 23:09:39,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:09:39,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 23:09:39,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 17: [2022-11-26 23:09:39,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:09:39,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 23:09:39,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 8: [2022-11-26 23:09:39,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:09:39,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 23:09:39,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 16: [2022-11-26 23:09:39,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:09:39,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:09:39,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:09:39,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 25: [2022-11-26 23:09:39,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:09:39,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 15: [2022-11-26 23:09:39,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 21: [2022-11-26 23:09:39,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 15: [2022-11-26 23:09:39,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 21: [2022-11-26 23:09:39,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 25: [2022-11-26 23:09:39,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 23:09:39,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 30: [2022-11-26 23:09:39,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:09:39,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:09:39,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 28: [2022-11-26 23:09:39,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 23:09:39,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 30: [2022-11-26 23:09:39,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 26: [2022-11-26 23:09:39,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:09:39,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 23:09:39,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 20: [2022-11-26 23:09:39,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:09:39,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:09:39,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:09:39,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 13: [2022-11-26 23:09:39,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 23:09:39,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 20: [2022-11-26 23:09:39,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 13: [2022-11-26 23:09:39,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 13: [2022-11-26 23:09:39,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 12: [2022-11-26 23:09:39,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:09:39,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 31: [2022-11-26 23:09:39,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:09:39,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 31: [2022-11-26 23:09:39,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:09:39,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 23:09:39,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 23:09:39,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 31: [2022-11-26 23:09:39,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 5: [2022-11-26 23:09:39,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:09:39,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 23:09:39,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 2: [2022-11-26 23:09:39,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:09:39,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 23:09:39,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 4: [2022-11-26 23:09:39,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:09:39,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 23:09:39,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 15: [2022-11-26 23:09:39,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:09:39,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:09:39,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 23:09:39,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 3: [2022-11-26 23:09:39,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 23:09:39,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:09:39,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 3: [2022-11-26 23:09:39,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 29: [2022-11-26 23:09:39,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:09:39,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 14: [2022-11-26 23:09:39,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:09:39,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 14: [2022-11-26 23:09:39,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 29: [2022-11-26 23:09:39,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 2: [2022-11-26 23:09:39,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:09:39,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 2: [2022-11-26 23:09:39,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 23:09:39,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 4: [2022-11-26 23:09:39,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:09:39,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 23:09:39,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 22: [2022-11-26 23:09:39,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:09:39,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 8: [2022-11-26 23:09:39,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:09:39,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 8: [2022-11-26 23:09:39,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 23:09:39,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 0: [2022-11-26 23:09:39,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:09:39,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:09:39,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 14: [2022-11-26 23:09:39,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 0: [2022-11-26 23:09:39,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 14: [2022-11-26 23:09:39,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 30: [2022-11-26 23:09:39,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:09:39,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:09:39,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 23:09:39,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 24: [2022-11-26 23:09:39,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 23:09:39,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 10: [2022-11-26 23:09:39,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:09:39,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 23:09:39,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 27: [2022-11-26 23:09:39,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:09:39,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 23:09:39,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 11: [2022-11-26 23:09:39,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:09:39,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 23:09:39,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 17: [2022-11-26 23:09:39,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:09:39,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 23:09:39,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 9: [2022-11-26 23:09:39,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:09:39,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 23:09:39,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 31: [2022-11-26 23:09:39,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:09:39,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 23:09:39,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 25: [2022-11-26 23:09:39,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:09:39,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 23:09:39,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 18: [2022-11-26 23:09:39,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:09:39,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 23:09:39,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 30: [2022-11-26 23:09:39,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:09:39,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 23:09:39,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 1: [2022-11-26 23:09:39,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:09:39,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 23:09:39,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 1: [2022-11-26 23:09:39,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:09:39,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 23:09:39,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:09:39,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 1: [2022-11-26 23:09:39,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 23:09:39,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 4: [2022-11-26 23:09:39,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:09:39,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step129000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 23:09:39,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step129000 is ready now! 0: successfully saved checkpoint at iteration 129000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2527.62 31: iteration 129010/ 173500 | consumed samples: 33026560 | consumed tokens: 67638394880 | elapsed time per iteration (s): 1.06 | learning rate: 4.819E-05 | global batch size: 256 | lm loss: 1.947958E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.902 | TFLOPs: 14.57 | 31: iteration 129020/ 173500 | consumed samples: 33029120 | consumed tokens: 67643637760 | elapsed time per iteration (s): 0.74 | learning rate: 4.818E-05 | global batch size: 256 | lm loss: 1.907235E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.691 | TFLOPs: 21.03 | 31: iteration 129030/ 173500 | consumed samples: 33031680 | consumed tokens: 67648880640 | elapsed time per iteration (s): 0.82 | learning rate: 4.816E-05 | global batch size: 256 | lm loss: 1.943545E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.892 | TFLOPs: 18.87 | 31: iteration 129040/ 173500 | consumed samples: 33034240 | consumed tokens: 67654123520 | elapsed time per iteration (s): 0.76 | learning rate: 4.815E-05 | global batch size: 256 | lm loss: 1.939982E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.618 | TFLOPs: 20.36 | 31: iteration 129050/ 173500 | consumed samples: 33036800 | consumed tokens: 67659366400 | elapsed time per iteration (s): 0.84 | learning rate: 4.814E-05 | global batch size: 256 | lm loss: 1.943086E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.349 | TFLOPs: 18.47 | 31: iteration 129060/ 173500 | consumed samples: 33039360 | consumed tokens: 67664609280 | elapsed time per iteration (s): 0.78 | learning rate: 4.813E-05 | global batch size: 256 | lm loss: 1.927359E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.296 | TFLOPs: 19.92 | 31: iteration 129070/ 173500 | consumed samples: 33041920 | consumed tokens: 67669852160 | elapsed time per iteration (s): 0.77 | learning rate: 4.812E-05 | global batch size: 256 | lm loss: 1.923930E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.573 | TFLOPs: 20.12 | 31: iteration 129080/ 173500 | consumed samples: 33044480 | consumed tokens: 67675095040 | elapsed time per iteration (s): 0.73 | learning rate: 4.811E-05 | global batch size: 256 | lm loss: 1.957806E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.909 | TFLOPs: 21.23 | 31: iteration 129090/ 173500 | consumed samples: 33047040 | consumed tokens: 67680337920 | elapsed time per iteration (s): 0.83 | learning rate: 4.809E-05 | global batch size: 256 | lm loss: 1.939575E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.100 | TFLOPs: 18.64 | 31: iteration 129100/ 173500 | consumed samples: 33049600 | consumed tokens: 67685580800 | elapsed time per iteration (s): 0.75 | learning rate: 4.808E-05 | global batch size: 256 | lm loss: 1.954629E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.523 | TFLOPs: 20.78 | 31: iteration 129110/ 173500 | consumed samples: 33052160 | consumed tokens: 67690823680 | elapsed time per iteration (s): 0.75 | learning rate: 4.807E-05 | global batch size: 256 | lm loss: 1.961677E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.827 | TFLOPs: 20.68 | 31: iteration 129120/ 173500 | consumed samples: 33054720 | consumed tokens: 67696066560 | elapsed time per iteration (s): 0.78 | learning rate: 4.806E-05 | global batch size: 256 | lm loss: 1.949218E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.678 | TFLOPs: 19.82 | 31: iteration 129130/ 173500 | consumed samples: 33057280 | consumed tokens: 67701309440 | elapsed time per iteration (s): 0.77 | learning rate: 4.805E-05 | global batch size: 256 | lm loss: 1.947772E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.689 | TFLOPs: 20.07 | 31: iteration 129140/ 173500 | consumed samples: 33059840 | consumed tokens: 67706552320 | elapsed time per iteration (s): 0.80 | learning rate: 4.803E-05 | global batch size: 256 | lm loss: 1.935431E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.612 | TFLOPs: 19.28 | 31: iteration 129150/ 173500 | consumed samples: 33062400 | consumed tokens: 67711795200 | elapsed time per iteration (s): 0.81 | learning rate: 4.802E-05 | global batch size: 256 | lm loss: 1.934931E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.878 | TFLOPs: 19.23 | 31: iteration 129160/ 173500 | consumed samples: 33064960 | consumed tokens: 67717038080 | elapsed time per iteration (s): 0.77 | learning rate: 4.801E-05 | global batch size: 256 | lm loss: 1.963317E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.662 | TFLOPs: 20.00 | 31: iteration 129170/ 173500 | consumed samples: 33067520 | consumed tokens: 67722280960 | elapsed time per iteration (s): 0.84 | learning rate: 4.800E-05 | global batch size: 256 | lm loss: 1.925467E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.129 | TFLOPs: 18.46 | 31: iteration 129180/ 173500 | consumed samples: 33070080 | consumed tokens: 67727523840 | elapsed time per iteration (s): 0.80 | learning rate: 4.799E-05 | global batch size: 256 | lm loss: 1.960138E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.473 | TFLOPs: 19.45 | 31: iteration 129190/ 173500 | consumed samples: 33072640 | consumed tokens: 67732766720 | elapsed time per iteration (s): 0.81 | learning rate: 4.797E-05 | global batch size: 256 | lm loss: 1.935702E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.025 | TFLOPs: 19.18 | 31: iteration 129200/ 173500 | consumed samples: 33075200 | consumed tokens: 67738009600 | elapsed time per iteration (s): 0.78 | learning rate: 4.796E-05 | global batch size: 256 | lm loss: 1.928173E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.992 | TFLOPs: 19.96 | 31: iteration 129210/ 173500 | consumed samples: 33077760 | consumed tokens: 67743252480 | elapsed time per iteration (s): 0.78 | learning rate: 4.795E-05 | global batch size: 256 | lm loss: 1.947655E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.732 | TFLOPs: 19.89 | 31: iteration 129220/ 173500 | consumed samples: 33080320 | consumed tokens: 67748495360 | elapsed time per iteration (s): 0.72 | learning rate: 4.794E-05 | global batch size: 256 | lm loss: 1.952316E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.389 | TFLOPs: 21.56 | 31: iteration 129230/ 173500 | consumed samples: 33082880 | consumed tokens: 67753738240 | elapsed time per iteration (s): 0.85 | learning rate: 4.793E-05 | global batch size: 256 | lm loss: 1.943992E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.178 | TFLOPs: 18.28 | 31: iteration 129240/ 173500 | consumed samples: 33085440 | consumed tokens: 67758981120 | elapsed time per iteration (s): 0.74 | learning rate: 4.791E-05 | global batch size: 256 | lm loss: 1.970462E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.536 | TFLOPs: 20.96 | 31: iteration 129250/ 173500 | consumed samples: 33088000 | consumed tokens: 67764224000 | elapsed time per iteration (s): 0.77 | learning rate: 4.790E-05 | global batch size: 256 | lm loss: 1.957788E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.591 | TFLOPs: 20.18 | 31: iteration 129260/ 173500 | consumed samples: 33090560 | consumed tokens: 67769466880 | elapsed time per iteration (s): 0.75 | learning rate: 4.789E-05 | global batch size: 256 | lm loss: 1.968636E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.898 | TFLOPs: 20.74 | 31: iteration 129270/ 173500 | consumed samples: 33093120 | consumed tokens: 67774709760 | elapsed time per iteration (s): 0.75 | learning rate: 4.788E-05 | global batch size: 256 | lm loss: 1.937152E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.322 | TFLOPs: 20.71 | 31: iteration 129280/ 173500 | consumed samples: 33095680 | consumed tokens: 67779952640 | elapsed time per iteration (s): 0.76 | learning rate: 4.787E-05 | global batch size: 256 | lm loss: 1.933999E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.963 | TFLOPs: 20.32 | 31: iteration 129290/ 173500 | consumed samples: 33098240 | consumed tokens: 67785195520 | elapsed time per iteration (s): 0.74 | learning rate: 4.785E-05 | global batch size: 256 | lm loss: 1.928033E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.197 | TFLOPs: 20.82 | 31: iteration 129300/ 173500 | consumed samples: 33100800 | consumed tokens: 67790438400 | elapsed time per iteration (s): 0.76 | learning rate: 4.784E-05 | global batch size: 256 | lm loss: 1.948473E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.056 | TFLOPs: 20.33 | 31: iteration 129310/ 173500 | consumed samples: 33103360 | consumed tokens: 67795681280 | elapsed time per iteration (s): 0.78 | learning rate: 4.783E-05 | global batch size: 256 | lm loss: 1.912683E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.833 | TFLOPs: 19.89 | 31: iteration 129320/ 173500 | consumed samples: 33105920 | consumed tokens: 67800924160 | elapsed time per iteration (s): 0.74 | learning rate: 4.782E-05 | global batch size: 256 | lm loss: 1.931468E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.905 | TFLOPs: 20.93 | 31: iteration 129330/ 173500 | consumed samples: 33108480 | consumed tokens: 67806167040 | elapsed time per iteration (s): 0.83 | learning rate: 4.781E-05 | global batch size: 256 | lm loss: 1.946656E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.921 | TFLOPs: 18.69 | 31: iteration 129340/ 173500 | consumed samples: 33111040 | consumed tokens: 67811409920 | elapsed time per iteration (s): 0.86 | learning rate: 4.780E-05 | global batch size: 256 | lm loss: 1.923892E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.458 | TFLOPs: 18.06 | 31: iteration 129350/ 173500 | consumed samples: 33113600 | consumed tokens: 67816652800 | elapsed time per iteration (s): 0.77 | learning rate: 4.778E-05 | global batch size: 256 | lm loss: 1.922694E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.585 | TFLOPs: 20.18 | 31: iteration 129360/ 173500 | consumed samples: 33116160 | consumed tokens: 67821895680 | elapsed time per iteration (s): 0.82 | learning rate: 4.777E-05 | global batch size: 256 | lm loss: 1.940014E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.226 | TFLOPs: 18.95 | 31: iteration 129370/ 173500 | consumed samples: 33118720 | consumed tokens: 67827138560 | elapsed time per iteration (s): 0.75 | learning rate: 4.776E-05 | global batch size: 256 | lm loss: 1.968014E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.970 | TFLOPs: 20.57 | 31: iteration 129380/ 173500 | consumed samples: 33121280 | consumed tokens: 67832381440 | elapsed time per iteration (s): 0.73 | learning rate: 4.775E-05 | global batch size: 256 | lm loss: 1.946666E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.774 | TFLOPs: 21.10 | 31: iteration 129390/ 173500 | consumed samples: 33123840 | consumed tokens: 67837624320 | elapsed time per iteration (s): 0.75 | learning rate: 4.774E-05 | global batch size: 256 | lm loss: 1.919312E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.742 | TFLOPs: 20.55 | 31: iteration 129400/ 173500 | consumed samples: 33126400 | consumed tokens: 67842867200 | elapsed time per iteration (s): 0.75 | learning rate: 4.772E-05 | global batch size: 256 | lm loss: 1.955799E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.129 | TFLOPs: 20.70 | 31: iteration 129410/ 173500 | consumed samples: 33128960 | consumed tokens: 67848110080 | elapsed time per iteration (s): 0.76 | learning rate: 4.771E-05 | global batch size: 256 | lm loss: 1.913093E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.751 | TFLOPs: 20.49 | 31: iteration 129420/ 173500 | consumed samples: 33131520 | consumed tokens: 67853352960 | elapsed time per iteration (s): 0.83 | learning rate: 4.770E-05 | global batch size: 256 | lm loss: 1.930729E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.736 | TFLOPs: 18.56 | 31: iteration 129430/ 173500 | consumed samples: 33134080 | consumed tokens: 67858595840 | elapsed time per iteration (s): 0.74 | learning rate: 4.769E-05 | global batch size: 256 | lm loss: 1.943617E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.595 | TFLOPs: 20.85 | 31: iteration 129440/ 173500 | consumed samples: 33136640 | consumed tokens: 67863838720 | elapsed time per iteration (s): 0.75 | learning rate: 4.768E-05 | global batch size: 256 | lm loss: 1.956136E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.745 | TFLOPs: 20.61 | 31: iteration 129450/ 173500 | consumed samples: 33139200 | consumed tokens: 67869081600 | elapsed time per iteration (s): 0.76 | learning rate: 4.766E-05 | global batch size: 256 | lm loss: 1.938928E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.564 | TFLOPs: 20.30 | 31: iteration 129460/ 173500 | consumed samples: 33141760 | consumed tokens: 67874324480 | elapsed time per iteration (s): 0.80 | learning rate: 4.765E-05 | global batch size: 256 | lm loss: 1.930210E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.774 | TFLOPs: 19.47 | 31: iteration 129470/ 173500 | consumed samples: 33144320 | consumed tokens: 67879567360 | elapsed time per iteration (s): 0.80 | learning rate: 4.764E-05 | global batch size: 256 | lm loss: 1.936099E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.861 | TFLOPs: 19.35 | 31: iteration 129480/ 173500 | consumed samples: 33146880 | consumed tokens: 67884810240 | elapsed time per iteration (s): 0.75 | learning rate: 4.763E-05 | global batch size: 256 | lm loss: 1.933930E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.409 | TFLOPs: 20.53 | 31: iteration 129490/ 173500 | consumed samples: 33149440 | consumed tokens: 67890053120 | elapsed time per iteration (s): 0.85 | learning rate: 4.762E-05 | global batch size: 256 | lm loss: 1.952674E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.507 | TFLOPs: 18.12 | 31: iteration 129500/ 173500 | consumed samples: 33152000 | consumed tokens: 67895296000 | elapsed time per iteration (s): 0.78 | learning rate: 4.761E-05 | global batch size: 256 | lm loss: 1.949168E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.871 | TFLOPs: 19.90 | 31: iteration 129510/ 173500 | consumed samples: 33154560 | consumed tokens: 67900538880 | elapsed time per iteration (s): 0.86 | learning rate: 4.759E-05 | global batch size: 256 | lm loss: 1.920005E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.367 | TFLOPs: 17.93 | 31: iteration 129520/ 173500 | consumed samples: 33157120 | consumed tokens: 67905781760 | elapsed time per iteration (s): 0.78 | learning rate: 4.758E-05 | global batch size: 256 | lm loss: 1.939366E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.130 | TFLOPs: 19.97 | 31: iteration 129530/ 173500 | consumed samples: 33159680 | consumed tokens: 67911024640 | elapsed time per iteration (s): 0.78 | learning rate: 4.757E-05 | global batch size: 256 | lm loss: 1.929926E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.007 | TFLOPs: 19.78 | 31: iteration 129540/ 173500 | consumed samples: 33162240 | consumed tokens: 67916267520 | elapsed time per iteration (s): 0.78 | learning rate: 4.756E-05 | global batch size: 256 | lm loss: 1.946482E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.764 | TFLOPs: 19.95 | 31: iteration 129550/ 173500 | consumed samples: 33164800 | consumed tokens: 67921510400 | elapsed time per iteration (s): 0.84 | learning rate: 4.755E-05 | global batch size: 256 | lm loss: 1.900557E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.348 | TFLOPs: 18.47 | 31: iteration 129560/ 173500 | consumed samples: 33167360 | consumed tokens: 67926753280 | elapsed time per iteration (s): 0.82 | learning rate: 4.753E-05 | global batch size: 256 | lm loss: 1.931961E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.840 | TFLOPs: 18.87 | 31: iteration 129570/ 173500 | consumed samples: 33169920 | consumed tokens: 67931996160 | elapsed time per iteration (s): 0.83 | learning rate: 4.752E-05 | global batch size: 256 | lm loss: 1.920250E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.916 | TFLOPs: 18.69 | 31: iteration 129580/ 173500 | consumed samples: 33172480 | consumed tokens: 67937239040 | elapsed time per iteration (s): 0.80 | learning rate: 4.751E-05 | global batch size: 256 | lm loss: 1.916291E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.942 | TFLOPs: 19.36 | 31: iteration 129590/ 173500 | consumed samples: 33175040 | consumed tokens: 67942481920 | elapsed time per iteration (s): 0.78 | learning rate: 4.750E-05 | global batch size: 256 | lm loss: 1.952187E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.848 | TFLOPs: 19.89 | 31: iteration 129600/ 173500 | consumed samples: 33177600 | consumed tokens: 67947724800 | elapsed time per iteration (s): 0.94 | learning rate: 4.749E-05 | global batch size: 256 | lm loss: 1.919371E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 272.941 | TFLOPs: 16.51 | 31: iteration 129610/ 173500 | consumed samples: 33180160 | consumed tokens: 67952967680 | elapsed time per iteration (s): 0.73 | learning rate: 4.747E-05 | global batch size: 256 | lm loss: 1.944040E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.302 | TFLOPs: 21.13 | 31: iteration 129620/ 173500 | consumed samples: 33182720 | consumed tokens: 67958210560 | elapsed time per iteration (s): 0.92 | learning rate: 4.746E-05 | global batch size: 256 | lm loss: 1.931962E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 277.723 | TFLOPs: 16.80 | 31: iteration 129630/ 173500 | consumed samples: 33185280 | consumed tokens: 67963453440 | elapsed time per iteration (s): 0.92 | learning rate: 4.745E-05 | global batch size: 256 | lm loss: 1.933365E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 278.111 | TFLOPs: 16.83 | 31: iteration 129640/ 173500 | consumed samples: 33187840 | consumed tokens: 67968696320 | elapsed time per iteration (s): 0.82 | learning rate: 4.744E-05 | global batch size: 256 | lm loss: 1.933169E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.602 | TFLOPs: 18.97 | 31: iteration 129650/ 173500 | consumed samples: 33190400 | consumed tokens: 67973939200 | elapsed time per iteration (s): 0.82 | learning rate: 4.743E-05 | global batch size: 256 | lm loss: 1.937367E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.898 | TFLOPs: 18.87 | 31: iteration 129660/ 173500 | consumed samples: 33192960 | consumed tokens: 67979182080 | elapsed time per iteration (s): 0.83 | learning rate: 4.742E-05 | global batch size: 256 | lm loss: 1.955605E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.961 | TFLOPs: 18.75 | 31: iteration 129670/ 173500 | consumed samples: 33195520 | consumed tokens: 67984424960 | elapsed time per iteration (s): 0.78 | learning rate: 4.740E-05 | global batch size: 256 | lm loss: 1.919555E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.548 | TFLOPs: 19.88 | 31: iteration 129680/ 173500 | consumed samples: 33198080 | consumed tokens: 67989667840 | elapsed time per iteration (s): 0.81 | learning rate: 4.739E-05 | global batch size: 256 | lm loss: 1.951118E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.540 | TFLOPs: 19.15 | 31: iteration 129690/ 173500 | consumed samples: 33200640 | consumed tokens: 67994910720 | elapsed time per iteration (s): 0.80 | learning rate: 4.738E-05 | global batch size: 256 | lm loss: 1.928001E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.153 | TFLOPs: 19.31 | 31: iteration 129700/ 173500 | consumed samples: 33203200 | consumed tokens: 68000153600 | elapsed time per iteration (s): 0.79 | learning rate: 4.737E-05 | global batch size: 256 | lm loss: 1.930909E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.907 | TFLOPs: 19.60 | 31: iteration 129710/ 173500 | consumed samples: 33205760 | consumed tokens: 68005396480 | elapsed time per iteration (s): 0.82 | learning rate: 4.736E-05 | global batch size: 256 | lm loss: 1.918521E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.858 | TFLOPs: 18.99 | 31: iteration 129720/ 173500 | consumed samples: 33208320 | consumed tokens: 68010639360 | elapsed time per iteration (s): 0.83 | learning rate: 4.734E-05 | global batch size: 256 | lm loss: 1.949564E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.160 | TFLOPs: 18.58 | 31: iteration 129730/ 173500 | consumed samples: 33210880 | consumed tokens: 68015882240 | elapsed time per iteration (s): 0.85 | learning rate: 4.733E-05 | global batch size: 256 | lm loss: 1.926913E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.020 | TFLOPs: 18.27 | 31: iteration 129740/ 173500 | consumed samples: 33213440 | consumed tokens: 68021125120 | elapsed time per iteration (s): 0.82 | learning rate: 4.732E-05 | global batch size: 256 | lm loss: 1.937078E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.260 | TFLOPs: 18.95 | 31: iteration 129750/ 173500 | consumed samples: 33216000 | consumed tokens: 68026368000 | elapsed time per iteration (s): 0.83 | learning rate: 4.731E-05 | global batch size: 256 | lm loss: 1.906586E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.640 | TFLOPs: 18.67 | 31: iteration 129760/ 173500 | consumed samples: 33218560 | consumed tokens: 68031610880 | elapsed time per iteration (s): 0.87 | learning rate: 4.730E-05 | global batch size: 256 | lm loss: 1.955367E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.694 | TFLOPs: 17.89 | 31: iteration 129770/ 173500 | consumed samples: 33221120 | consumed tokens: 68036853760 | elapsed time per iteration (s): 0.79 | learning rate: 4.729E-05 | global batch size: 256 | lm loss: 1.910621E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.753 | TFLOPs: 19.65 | 31: iteration 129780/ 173500 | consumed samples: 33223680 | consumed tokens: 68042096640 | elapsed time per iteration (s): 0.75 | learning rate: 4.727E-05 | global batch size: 256 | lm loss: 1.939131E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.423 | TFLOPs: 20.78 | 31: iteration 129790/ 173500 | consumed samples: 33226240 | consumed tokens: 68047339520 | elapsed time per iteration (s): 0.77 | learning rate: 4.726E-05 | global batch size: 256 | lm loss: 1.952495E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.336 | TFLOPs: 19.98 | 31: iteration 129800/ 173500 | consumed samples: 33228800 | consumed tokens: 68052582400 | elapsed time per iteration (s): 0.78 | learning rate: 4.725E-05 | global batch size: 256 | lm loss: 1.913909E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.647 | TFLOPs: 19.88 | 31: iteration 129810/ 173500 | consumed samples: 33231360 | consumed tokens: 68057825280 | elapsed time per iteration (s): 0.80 | learning rate: 4.724E-05 | global batch size: 256 | lm loss: 1.938140E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.569 | TFLOPs: 19.33 | 31: iteration 129820/ 173500 | consumed samples: 33233920 | consumed tokens: 68063068160 | elapsed time per iteration (s): 0.80 | learning rate: 4.723E-05 | global batch size: 256 | lm loss: 1.924804E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.535 | TFLOPs: 19.33 | 31: iteration 129830/ 173500 | consumed samples: 33236480 | consumed tokens: 68068311040 | elapsed time per iteration (s): 0.76 | learning rate: 4.721E-05 | global batch size: 256 | lm loss: 1.926002E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.689 | TFLOPs: 20.43 | 31: iteration 129840/ 173500 | consumed samples: 33239040 | consumed tokens: 68073553920 | elapsed time per iteration (s): 0.80 | learning rate: 4.720E-05 | global batch size: 256 | lm loss: 1.954621E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.440 | TFLOPs: 19.33 | 31: iteration 129850/ 173500 | consumed samples: 33241600 | consumed tokens: 68078796800 | elapsed time per iteration (s): 0.76 | learning rate: 4.719E-05 | global batch size: 256 | lm loss: 1.937729E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.707 | TFLOPs: 20.31 | 31: iteration 129860/ 173500 | consumed samples: 33244160 | consumed tokens: 68084039680 | elapsed time per iteration (s): 0.75 | learning rate: 4.718E-05 | global batch size: 256 | lm loss: 1.922847E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.157 | TFLOPs: 20.76 | 31: iteration 129870/ 173500 | consumed samples: 33246720 | consumed tokens: 68089282560 | elapsed time per iteration (s): 0.76 | learning rate: 4.717E-05 | global batch size: 256 | lm loss: 1.928085E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.041 | TFLOPs: 20.27 | 31: iteration 129880/ 173500 | consumed samples: 33249280 | consumed tokens: 68094525440 | elapsed time per iteration (s): 0.76 | learning rate: 4.716E-05 | global batch size: 256 | lm loss: 1.918385E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.097 | TFLOPs: 20.39 | 31: iteration 129890/ 173500 | consumed samples: 33251840 | consumed tokens: 68099768320 | elapsed time per iteration (s): 0.81 | learning rate: 4.714E-05 | global batch size: 256 | lm loss: 1.935141E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.989 | TFLOPs: 19.12 | 31: iteration 129900/ 173500 | consumed samples: 33254400 | consumed tokens: 68105011200 | elapsed time per iteration (s): 0.76 | learning rate: 4.713E-05 | global batch size: 256 | lm loss: 1.942688E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.412 | TFLOPs: 20.41 | 31: iteration 129910/ 173500 | consumed samples: 33256960 | consumed tokens: 68110254080 | elapsed time per iteration (s): 0.79 | learning rate: 4.712E-05 | global batch size: 256 | lm loss: 1.924441E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.364 | TFLOPs: 19.56 | 31: iteration 129920/ 173500 | consumed samples: 33259520 | consumed tokens: 68115496960 | elapsed time per iteration (s): 0.80 | learning rate: 4.711E-05 | global batch size: 256 | lm loss: 1.923809E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.587 | TFLOPs: 19.33 | 31: iteration 129930/ 173500 | consumed samples: 33262080 | consumed tokens: 68120739840 | elapsed time per iteration (s): 0.75 | learning rate: 4.710E-05 | global batch size: 256 | lm loss: 1.933474E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.366 | TFLOPs: 20.53 | 31: iteration 129940/ 173500 | consumed samples: 33264640 | consumed tokens: 68125982720 | elapsed time per iteration (s): 0.74 | learning rate: 4.709E-05 | global batch size: 256 | lm loss: 1.911534E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.193 | TFLOPs: 20.82 | 31: iteration 129950/ 173500 | consumed samples: 33267200 | consumed tokens: 68131225600 | elapsed time per iteration (s): 0.75 | learning rate: 4.707E-05 | global batch size: 256 | lm loss: 1.952858E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.438 | TFLOPs: 20.72 | 31: iteration 129960/ 173500 | consumed samples: 33269760 | consumed tokens: 68136468480 | elapsed time per iteration (s): 0.86 | learning rate: 4.706E-05 | global batch size: 256 | lm loss: 1.936914E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.011 | TFLOPs: 18.09 | 31: iteration 129970/ 173500 | consumed samples: 33272320 | consumed tokens: 68141711360 | elapsed time per iteration (s): 0.80 | learning rate: 4.705E-05 | global batch size: 256 | lm loss: 1.924829E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.197 | TFLOPs: 19.31 | 31: iteration 129980/ 173500 | consumed samples: 33274880 | consumed tokens: 68146954240 | elapsed time per iteration (s): 0.78 | learning rate: 4.704E-05 | global batch size: 256 | lm loss: 1.911934E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.084 | TFLOPs: 19.79 | 31: iteration 129990/ 173500 | consumed samples: 33277440 | consumed tokens: 68152197120 | elapsed time per iteration (s): 0.74 | learning rate: 4.703E-05 | global batch size: 256 | lm loss: 1.950269E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.722 | TFLOPs: 20.98 | 0: [2022-11-26 23:22:49,651] [INFO] [logging.py:68:log_dist] [Rank 0] step=130000, skipped=0, lr=[4.7014562839599005e-05, 4.7014562839599005e-05, 4.7014562839599005e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 130000/ 173500 | consumed samples: 33280000 | consumed tokens: 68157440000 | elapsed time per iteration (s): 0.76 | learning rate: 4.701E-05 | global batch size: 256 | lm loss: 1.956251E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.811 | TFLOPs: 20.26 | 0: steps: 130000 loss: 1.9289 iter time (s): 0.793 samples/sec: 322.934 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 130000 | lm loss value: 1.830915E+00 | lm loss PPL: 6.239592E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 130000 to checkpoints_1b1long 0: [2022-11-26 23:22:49,916] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step130000 is begin to save! 0: [2022-11-26 23:22:49,927] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_01-model_00-model_states.pt... 0: [2022-11-26 23:22:50,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_01-model_00-model_states.pt. 0: [2022-11-26 23:22:50,154] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_03-model_00-model_states.pt... 0: [2022-11-26 23:22:50,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_03-model_00-model_states.pt. 0: [2022-11-26 23:22:50,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_04-model_00-model_states.pt... 0: [2022-11-26 23:22:50,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_04-model_00-model_states.pt. 0: [2022-11-26 23:22:50,312] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_05-model_00-model_states.pt... 0: [2022-11-26 23:22:50,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_05-model_00-model_states.pt. 0: [2022-11-26 23:22:50,391] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_06-model_00-model_states.pt... 0: [2022-11-26 23:22:50,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_06-model_00-model_states.pt. 0: [2022-11-26 23:22:50,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_07-model_00-model_states.pt... 0: [2022-11-26 23:22:50,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_07-model_00-model_states.pt. 0: [2022-11-26 23:22:50,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_08-model_00-model_states.pt... 0: [2022-11-26 23:22:50,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_08-model_00-model_states.pt. 0: [2022-11-26 23:22:50,632] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_09-model_00-model_states.pt... 0: [2022-11-26 23:22:50,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_09-model_00-model_states.pt. 0: [2022-11-26 23:22:50,709] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_10-model_00-model_states.pt... 0: [2022-11-26 23:22:50,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_10-model_00-model_states.pt. 0: [2022-11-26 23:22:50,795] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_11-model_00-model_states.pt... 0: [2022-11-26 23:22:50,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_11-model_00-model_states.pt. 0: [2022-11-26 23:22:50,871] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_12-model_00-model_states.pt... 0: [2022-11-26 23:22:50,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_12-model_00-model_states.pt. 0: [2022-11-26 23:22:50,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_13-model_00-model_states.pt... 0: [2022-11-26 23:22:51,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_13-model_00-model_states.pt. 0: [2022-11-26 23:22:51,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_14-model_00-model_states.pt... 0: [2022-11-26 23:22:51,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_14-model_00-model_states.pt. 0: [2022-11-26 23:22:51,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_15-model_00-model_states.pt... 0: [2022-11-26 23:22:51,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_15-model_00-model_states.pt. 0: [2022-11-26 23:22:51,183] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_16-model_00-model_states.pt... 0: [2022-11-26 23:22:51,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_16-model_00-model_states.pt. 0: [2022-11-26 23:22:51,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_17-model_00-model_states.pt... 0: [2022-11-26 23:22:51,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_17-model_00-model_states.pt. 0: [2022-11-26 23:22:51,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_18-model_00-model_states.pt... 0: [2022-11-26 23:22:51,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_18-model_00-model_states.pt. 0: [2022-11-26 23:22:51,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_19-model_00-model_states.pt... 0: [2022-11-26 23:22:51,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_19-model_00-model_states.pt. 0: [2022-11-26 23:22:51,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_20-model_00-model_states.pt... 0: [2022-11-26 23:22:51,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_20-model_00-model_states.pt. 0: [2022-11-26 23:22:51,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_21-model_00-model_states.pt... 0: [2022-11-26 23:22:51,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_21-model_00-model_states.pt. 0: [2022-11-26 23:22:51,649] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_22-model_00-model_states.pt... 0: [2022-11-26 23:22:51,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_22-model_00-model_states.pt. 0: [2022-11-26 23:22:51,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_23-model_00-model_states.pt... 0: [2022-11-26 23:22:51,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_23-model_00-model_states.pt. 0: [2022-11-26 23:22:51,800] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_24-model_00-model_states.pt... 0: [2022-11-26 23:22:51,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_24-model_00-model_states.pt. 0: [2022-11-26 23:22:51,881] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_25-model_00-model_states.pt... 0: [2022-11-26 23:22:51,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_25-model_00-model_states.pt. 0: [2022-11-26 23:22:51,960] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_26-model_00-model_states.pt... 0: [2022-11-26 23:22:52,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_26-model_00-model_states.pt. 0: [2022-11-26 23:22:52,035] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_27-model_00-model_states.pt... 0: [2022-11-26 23:22:52,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_27-model_00-model_states.pt. 0: [2022-11-26 23:22:52,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_28-model_00-model_states.pt... 0: [2022-11-26 23:22:52,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_28-model_00-model_states.pt. 0: [2022-11-26 23:22:52,189] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/layer_30-model_00-model_states.pt... 0: [2022-11-26 23:22:52,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/layer_30-model_00-model_states.pt. 0: [2022-11-26 23:22:52,193] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step130000/mp_rank_00_model_states.pt 0: [2022-11-26 23:22:52,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/mp_rank_00_model_states.pt... 0: [2022-11-26 23:22:52,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/mp_rank_00_model_states.pt. 0: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:22:52,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:22:52,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:22:52,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 23:22:52,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 0: [2022-11-26 23:22:52,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:22:52,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 23:22:52,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 0: [2022-11-26 23:22:52,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:22:52,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 23:22:52,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 0: [2022-11-26 23:22:52,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:22:52,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 23:22:52,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 25: [2022-11-26 23:22:52,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:22:52,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 23:22:52,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 19: [2022-11-26 23:22:52,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:22:52,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 29: [2022-11-26 23:22:52,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:22:52,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 29: [2022-11-26 23:22:52,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 23:22:52,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 2: [2022-11-26 23:22:52,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:22:52,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 23:22:52,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 7: [2022-11-26 23:22:52,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:22:52,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 8: [2022-11-26 23:22:52,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:22:52,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 8: [2022-11-26 23:22:52,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 23:22:52,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 16: [2022-11-26 23:22:52,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:22:52,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:22:52,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 30: [2022-11-26 23:22:52,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 16: [2022-11-26 23:22:52,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 30: [2022-11-26 23:22:52,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 23: [2022-11-26 23:22:52,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:22:52,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 23:22:52,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 0: [2022-11-26 23:22:52,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:22:52,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:22:52,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 0: [2022-11-26 23:22:52,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 21: [2022-11-26 23:22:52,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 0: [2022-11-26 23:22:52,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 9: [2022-11-26 23:22:52,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:22:52,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:22:52,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:22:52,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 9: [2022-11-26 23:22:52,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 29: [2022-11-26 23:22:52,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 9: [2022-11-26 23:22:52,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 23:22:52,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 9: [2022-11-26 23:22:52,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 7: [2022-11-26 23:22:52,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:22:52,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 23:22:52,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 21: [2022-11-26 23:22:52,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:22:52,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 23:22:52,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 2: [2022-11-26 23:22:52,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:22:52,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 23:22:52,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 3: [2022-11-26 23:22:52,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:22:52,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:22:52,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:22:52,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 23:22:52,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 17: [2022-11-26 23:22:52,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 24: [2022-11-26 23:22:52,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 17: [2022-11-26 23:22:52,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 24: [2022-11-26 23:22:52,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 17: [2022-11-26 23:22:52,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:22:52,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 23:22:52,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 14: [2022-11-26 23:22:52,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:22:52,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 23:22:52,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 8: [2022-11-26 23:22:52,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:22:52,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 10: [2022-11-26 23:22:52,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:22:52,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 29: [2022-11-26 23:22:52,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:22:52,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:22:52,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 29: [2022-11-26 23:22:52,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 22: [2022-11-26 23:22:52,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:22:52,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:22:52,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 29: [2022-11-26 23:22:52,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 22: [2022-11-26 23:22:52,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 8: [2022-11-26 23:22:52,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 22: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 8: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 27: [2022-11-26 23:22:52,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 5: [2022-11-26 23:22:52,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 2: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:22:52,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 9: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 28: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:22:52,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 28: [2022-11-26 23:22:52,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 23:22:52,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 5: [2022-11-26 23:22:52,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 28: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 28: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 5: [2022-11-26 23:22:52,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 7: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:22:52,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 14: [2022-11-26 23:22:52,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 5: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:22:52,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 7: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 14: [2022-11-26 23:22:52,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 5: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 5: [2022-11-26 23:22:52,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 14: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 5: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 28: [2022-11-26 23:22:52,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 14: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 27: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 27: [2022-11-26 23:22:52,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 6: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:22:52,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 24: [2022-11-26 23:22:52,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 6: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 24: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 6: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:22:52,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 23:22:52,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 9: [2022-11-26 23:22:52,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:22:52,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 23:22:52,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 28: [2022-11-26 23:22:52,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:22:52,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 23:22:52,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 3: [2022-11-26 23:22:52,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:22:52,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 23:22:52,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 4: [2022-11-26 23:22:52,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:22:52,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 11: [2022-11-26 23:22:52,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:22:52,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:22:52,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 11: [2022-11-26 23:22:52,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 23:22:52,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 30: [2022-11-26 23:22:52,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:22:52,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:22:52,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 11: [2022-11-26 23:22:52,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 30: [2022-11-26 23:22:52,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 4: [2022-11-26 23:22:52,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 11: [2022-11-26 23:22:52,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:22:52,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 4: [2022-11-26 23:22:52,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 11: [2022-11-26 23:22:52,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:22:52,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 22: [2022-11-26 23:22:52,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:22:52,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 22: [2022-11-26 23:22:52,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 11: [2022-11-26 23:22:52,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 22: [2022-11-26 23:22:52,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 9: [2022-11-26 23:22:52,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:22:52,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 9: [2022-11-26 23:22:52,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 11: [2022-11-26 23:22:52,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:22:52,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 11: [2022-11-26 23:22:52,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 23:22:52,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 12: [2022-11-26 23:22:52,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:22:52,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 23:22:52,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 23: [2022-11-26 23:22:52,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:22:52,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:22:52,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 30: [2022-11-26 23:22:52,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 23: [2022-11-26 23:22:52,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 30: [2022-11-26 23:22:52,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 3: [2022-11-26 23:22:52,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:22:52,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 23:22:52,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 1: [2022-11-26 23:22:52,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:22:52,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:22:52,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:22:52,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:22:52,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:22:52,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 14: [2022-11-26 23:22:52,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:22:52,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 23:22:52,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 1: [2022-11-26 23:22:52,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 14: [2022-11-26 23:22:52,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 18: [2022-11-26 23:22:52,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 23:22:52,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 23:22:52,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 18: [2022-11-26 23:22:52,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 18: [2022-11-26 23:22:52,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 18: [2022-11-26 23:22:52,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 16: [2022-11-26 23:22:52,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:22:52,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:22:52,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 23:22:52,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 25: [2022-11-26 23:22:52,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 14: [2022-11-26 23:22:52,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 25: [2022-11-26 23:22:52,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 28: [2022-11-26 23:22:52,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:22:52,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 23:22:52,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 3: [2022-11-26 23:22:52,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:22:52,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 20: [2022-11-26 23:22:52,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:22:52,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 24: [2022-11-26 23:22:52,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:22:52,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 23:22:52,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 24: [2022-11-26 23:22:52,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-26 23:22:52,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 30: [2022-11-26 23:22:52,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:22:52,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 4: [2022-11-26 23:22:52,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:22:52,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 4: [2022-11-26 23:22:52,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:22:52,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 23:22:52,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 23:22:52,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 4: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 4: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 24: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:22:52,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 19: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:22:52,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 23:22:52,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 17: [2022-11-26 23:22:52,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:22:52,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 23:22:52,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 19: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 19: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 17: [2022-11-26 23:22:52,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:22:52,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 20: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:22:52,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 1: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:22:52,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 1: [2022-11-26 23:22:52,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 20: [2022-11-26 23:22:52,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 25: [2022-11-26 23:22:52,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 17: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 8: [2022-11-26 23:22:52,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 1: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 20: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 25: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 8: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 19: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:22:52,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 23:22:52,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 20: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:22:52,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 19: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 19: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 20: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 7: [2022-11-26 23:22:52,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:22:52,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:22:52,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 23:22:52,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 23:22:52,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 7: [2022-11-26 23:22:52,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 5: [2022-11-26 23:22:52,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:22:52,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 23:22:52,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 5: [2022-11-26 23:22:52,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:22:52,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 23:22:52,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 2: [2022-11-26 23:22:52,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:22:52,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 23:22:52,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 23: [2022-11-26 23:22:52,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:22:52,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 23:22:52,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 27: [2022-11-26 23:22:52,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:22:52,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 22: [2022-11-26 23:22:52,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:22:52,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 22: [2022-11-26 23:22:52,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 23:22:52,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:22:52,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:22:52,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 22: [2022-11-26 23:22:52,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 23:22:52,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 23:22:52,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 22: [2022-11-26 23:22:52,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 2: [2022-11-26 23:22:52,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:22:52,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 30: [2022-11-26 23:22:52,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:22:52,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 30: [2022-11-26 23:22:52,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 23:22:52,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 16: [2022-11-26 23:22:52,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:22:52,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 23: [2022-11-26 23:22:52,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:22:52,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:22:52,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 23: [2022-11-26 23:22:52,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 23:22:52,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 23:22:52,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 23: [2022-11-26 23:22:52,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 12: [2022-11-26 23:22:52,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:22:52,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 23:22:52,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 23: [2022-11-26 23:22:52,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:22:52,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 23:22:52,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 25: [2022-11-26 23:22:52,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:22:52,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:22:52,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 23:22:52,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 23:22:52,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 25: [2022-11-26 23:22:52,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 24: [2022-11-26 23:22:52,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:22:52,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 23:22:52,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 10: [2022-11-26 23:22:52,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:22:52,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:22:52,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:22:52,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:22:52,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 23:22:52,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 23:22:52,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 10: [2022-11-26 23:22:52,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 23:22:52,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 20: [2022-11-26 23:22:52,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:22:52,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:22:52,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 20: [2022-11-26 23:22:52,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 18: [2022-11-26 23:22:52,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 10: [2022-11-26 23:22:52,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 10: [2022-11-26 23:22:52,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 20: [2022-11-26 23:22:52,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 18: [2022-11-26 23:22:52,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 27: [2022-11-26 23:22:52,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:22:52,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 23:22:52,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 3: [2022-11-26 23:22:52,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:22:52,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 23:22:52,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 20: [2022-11-26 23:22:52,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:22:52,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 23:22:52,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 4: [2022-11-26 23:22:52,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:22:52,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 23:22:52,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 5: [2022-11-26 23:22:52,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:22:52,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:22:52,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 5: [2022-11-26 23:22:52,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 8: [2022-11-26 23:22:52,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 5: [2022-11-26 23:22:52,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 8: [2022-11-26 23:22:52,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:22:52,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 23:22:52,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 12: [2022-11-26 23:22:52,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:22:52,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 23:22:52,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:22:52,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 12: [2022-11-26 23:22:52,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 23:22:52,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 14: [2022-11-26 23:22:52,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:22:52,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 23:22:52,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 31: [2022-11-26 23:22:52,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:22:52,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:22:52,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:22:52,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 23:22:52,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 23:22:52,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 31: [2022-11-26 23:22:52,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 23:22:52,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 31: [2022-11-26 23:22:52,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 31: [2022-11-26 23:22:52,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:22:52,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:22:52,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 22: [2022-11-26 23:22:52,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 31: [2022-11-26 23:22:52,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 22: [2022-11-26 23:22:52,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 12: [2022-11-26 23:22:52,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:22:52,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:22:52,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 28: [2022-11-26 23:22:52,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 12: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 28: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 12: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:22:52,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 31: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 31: [2022-11-26 23:22:52,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 21: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 21: [2022-11-26 23:22:52,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 23:22:52,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 23:22:52,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 7: [2022-11-26 23:22:52,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 31: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 31: [2022-11-26 23:22:52,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 21: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 10: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 7: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 10: [2022-11-26 23:22:52,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 31: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 10: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 30: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:22:52,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 25: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:22:52,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 24: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 27: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 25: [2022-11-26 23:22:52,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 24: [2022-11-26 23:22:52,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 6: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 6: [2022-11-26 23:22:52,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 16: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 6: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 16: [2022-11-26 23:22:52,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 23:22:52,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 16: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 17: [2022-11-26 23:22:52,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:22:52,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 23:22:52,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 29: [2022-11-26 23:22:52,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:22:52,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:22:52,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 23:22:52,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 23:22:52,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 29: [2022-11-26 23:22:52,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 26: [2022-11-26 23:22:52,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:22:52,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:22:52,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:22:52,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:22:52,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:22:52,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:22:52,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 23:22:52,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 23:22:52,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 23:22:52,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 23:22:52,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 23:22:52,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 23:22:52,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 26: [2022-11-26 23:22:52,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 26: [2022-11-26 23:22:52,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 26: [2022-11-26 23:22:52,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 26: [2022-11-26 23:22:52,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 26: [2022-11-26 23:22:52,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 19: [2022-11-26 23:22:52,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:22:52,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 3: [2022-11-26 23:22:52,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:22:52,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 3: [2022-11-26 23:22:52,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 2: [2022-11-26 23:22:52,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:22:52,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 2: [2022-11-26 23:22:52,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 16: [2022-11-26 23:22:52,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:22:52,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 29: [2022-11-26 23:22:52,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:22:52,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 29: [2022-11-26 23:22:52,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 23:22:52,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 16: [2022-11-26 23:22:52,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 11: [2022-11-26 23:22:52,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:22:52,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 23:22:52,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 11: [2022-11-26 23:22:52,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:22:52,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 21: [2022-11-26 23:22:52,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:22:52,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 21: [2022-11-26 23:22:52,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 23:22:52,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 13: [2022-11-26 23:22:52,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:22:52,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:22:52,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:22:52,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:22:52,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:22:52,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:22:52,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 23:22:52,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 23:22:52,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 23:22:52,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 23:22:52,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 23:22:52,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 13: [2022-11-26 23:22:52,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 13: [2022-11-26 23:22:52,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 23:22:52,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 13: [2022-11-26 23:22:52,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 13: [2022-11-26 23:22:52,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 13: [2022-11-26 23:22:52,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 26: [2022-11-26 23:22:52,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:22:52,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 23:22:52,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 9: [2022-11-26 23:22:52,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:22:52,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 23:22:52,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 31: [2022-11-26 23:22:52,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:22:52,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 23:22:52,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 1: [2022-11-26 23:22:52,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:22:52,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 23:22:52,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 1: [2022-11-26 23:22:52,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:22:52,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 23:22:52,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 6: [2022-11-26 23:22:52,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:22:52,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 23:22:52,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 6: [2022-11-26 23:22:52,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:22:52,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 23:22:52,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 6: [2022-11-26 23:22:52,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:22:52,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 23:22:52,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 6: [2022-11-26 23:22:52,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:22:52,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 23:22:52,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 1: [2022-11-26 23:22:52,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:22:52,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:22:52,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:22:52,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 23:22:52,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 23:22:52,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 1: [2022-11-26 23:22:52,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 1: [2022-11-26 23:22:52,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 23:22:52,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 4: [2022-11-26 23:22:52,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:22:52,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 23:22:52,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 20: [2022-11-26 23:22:52,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:22:52,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:22:52,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 23:22:52,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 23:22:52,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 20: [2022-11-26 23:22:52,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 7: [2022-11-26 23:22:52,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:22:52,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 23:22:52,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 9: [2022-11-26 23:22:52,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:22:52,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 23:22:52,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 15: [2022-11-26 23:22:52,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:22:52,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:22:52,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:22:52,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:22:52,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:22:52,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:22:52,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 23:22:52,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 23:22:52,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 23:22:52,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 23:22:52,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 23:22:52,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 15: [2022-11-26 23:22:52,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 15: [2022-11-26 23:22:52,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 23:22:52,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 15: [2022-11-26 23:22:52,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 15: [2022-11-26 23:22:52,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 15: [2022-11-26 23:22:52,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 8: [2022-11-26 23:22:52,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:22:52,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 23:22:52,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 29: [2022-11-26 23:22:52,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:22:52,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 23:22:52,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 13: [2022-11-26 23:22:52,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:22:52,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 23:22:52,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 30: [2022-11-26 23:22:52,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:22:52,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 23:22:52,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 5: [2022-11-26 23:22:52,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:22:52,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 23:22:52,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 15: [2022-11-26 23:22:52,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:22:52,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 23:22:52,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 23: [2022-11-26 23:22:52,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:22:52,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:22:52,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 23:22:52,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 23: [2022-11-26 23:22:52,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 23:22:52,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 2: [2022-11-26 23:22:52,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:22:52,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 23:22:52,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 18: [2022-11-26 23:22:52,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:22:52,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 23:22:52,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 10: [2022-11-26 23:22:52,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:22:52,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 23:22:52,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 25: [2022-11-26 23:22:52,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:22:52,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 23:22:52,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 12: [2022-11-26 23:22:52,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:22:52,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 23:22:52,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 28: [2022-11-26 23:22:52,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:22:52,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:22:52,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 23:22:52,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 27: [2022-11-26 23:22:52,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:22:52,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 23:22:52,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 16: [2022-11-26 23:22:52,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:22:52,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 23:22:52,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 28: [2022-11-26 23:22:52,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 23:22:52,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 17: [2022-11-26 23:22:52,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:22:52,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 23:22:52,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 22: [2022-11-26 23:22:52,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:22:52,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 23:22:52,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 11: [2022-11-26 23:22:52,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:22:52,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 23:22:52,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 14: [2022-11-26 23:22:52,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:22:52,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 23:22:52,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 21: [2022-11-26 23:22:52,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:22:52,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-26 23:22:52,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 26: [2022-11-26 23:22:52,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:22:52,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 23:22:52,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 20: [2022-11-26 23:22:52,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:22:52,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 23:22:52,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 6: [2022-11-26 23:22:52,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:22:52,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 23:22:52,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 4: [2022-11-26 23:22:52,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:22:52,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 23:22:52,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 3: [2022-11-26 23:22:52,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:22:52,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 23:22:52,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 7: [2022-11-26 23:22:52,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:22:52,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 23:22:52,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 31: [2022-11-26 23:22:52,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:22:52,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 23:22:52,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 0: [2022-11-26 23:22:52,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:22:52,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:22:52,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 23:22:52,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 25: [2022-11-26 23:22:52,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:22:52,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 23:22:52,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 1: [2022-11-26 23:22:52,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:22:52,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 23:22:52,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 5: [2022-11-26 23:22:52,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:22:52,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 23:22:52,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 8: [2022-11-26 23:22:52,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:22:52,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:22:52,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:22:52,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 18: [2022-11-26 23:22:52,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 8: [2022-11-26 23:22:52,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 23: [2022-11-26 23:22:52,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 18: [2022-11-26 23:22:52,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 8: [2022-11-26 23:22:52,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 15: [2022-11-26 23:22:52,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:22:52,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 23:22:52,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 29: [2022-11-26 23:22:52,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:22:52,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 23:22:52,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 19: [2022-11-26 23:22:52,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:22:52,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-26 23:22:52,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 9: [2022-11-26 23:22:52,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:22:52,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 23:22:52,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 2: [2022-11-26 23:22:52,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:22:52,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 23:22:52,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 28: [2022-11-26 23:22:52,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:22:52,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 24: [2022-11-26 23:22:52,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:22:52,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 24: [2022-11-26 23:22:52,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 23:22:52,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 12: [2022-11-26 23:22:52,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:22:52,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:22:52,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 23:22:52,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 17: [2022-11-26 23:22:52,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 23:22:52,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 27: [2022-11-26 23:22:52,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:22:52,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 23:22:52,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 13: [2022-11-26 23:22:52,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:22:52,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 23:22:52,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 0: [2022-11-26 23:22:52,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 10: [2022-11-26 23:22:52,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:22:52,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 10: [2022-11-26 23:22:52,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 23:22:52,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 30: [2022-11-26 23:22:52,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:22:52,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 23:22:52,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 16: [2022-11-26 23:22:52,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:22:52,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 23:22:52,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 18: [2022-11-26 23:22:52,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:22:52,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 23:22:52,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 22: [2022-11-26 23:22:52,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:22:52,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 23:22:52,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 0: [2022-11-26 23:22:52,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:22:52,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 23:22:52,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 3: [2022-11-26 23:22:52,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:22:52,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 23:22:52,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 21: [2022-11-26 23:22:52,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:22:52,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 23:22:52,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 14: [2022-11-26 23:22:52,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:22:52,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 23:22:52,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 27: [2022-11-26 23:22:52,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:22:52,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 23:22:52,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 14: [2022-11-26 23:22:52,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:22:52,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step130000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 23:22:52,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step130000 is ready now! 0: successfully saved checkpoint at iteration 130000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2617.45 31: iteration 130010/ 173500 | consumed samples: 33282560 | consumed tokens: 68162682880 | elapsed time per iteration (s): 1.04 | learning rate: 4.700E-05 | global batch size: 256 | lm loss: 1.951669E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.163 | TFLOPs: 14.89 | 31: iteration 130020/ 173500 | consumed samples: 33285120 | consumed tokens: 68167925760 | elapsed time per iteration (s): 0.79 | learning rate: 4.699E-05 | global batch size: 256 | lm loss: 1.956026E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.189 | TFLOPs: 19.61 | 31: iteration 130030/ 173500 | consumed samples: 33287680 | consumed tokens: 68173168640 | elapsed time per iteration (s): 0.79 | learning rate: 4.698E-05 | global batch size: 256 | lm loss: 1.937066E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.034 | TFLOPs: 19.72 | 31: iteration 130040/ 173500 | consumed samples: 33290240 | consumed tokens: 68178411520 | elapsed time per iteration (s): 0.76 | learning rate: 4.697E-05 | global batch size: 256 | lm loss: 1.915889E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.997 | TFLOPs: 20.33 | 31: iteration 130050/ 173500 | consumed samples: 33292800 | consumed tokens: 68183654400 | elapsed time per iteration (s): 0.75 | learning rate: 4.696E-05 | global batch size: 256 | lm loss: 1.914781E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.122 | TFLOPs: 20.76 | 31: iteration 130060/ 173500 | consumed samples: 33295360 | consumed tokens: 68188897280 | elapsed time per iteration (s): 1.06 | learning rate: 4.694E-05 | global batch size: 256 | lm loss: 1.917606E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.664 | TFLOPs: 14.62 | 31: iteration 130070/ 173500 | consumed samples: 33297920 | consumed tokens: 68194140160 | elapsed time per iteration (s): 0.77 | learning rate: 4.693E-05 | global batch size: 256 | lm loss: 1.942659E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.630 | TFLOPs: 20.00 | 31: iteration 130080/ 173500 | consumed samples: 33300480 | consumed tokens: 68199383040 | elapsed time per iteration (s): 0.82 | learning rate: 4.692E-05 | global batch size: 256 | lm loss: 1.925168E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.284 | TFLOPs: 18.83 | 31: iteration 130090/ 173500 | consumed samples: 33303040 | consumed tokens: 68204625920 | elapsed time per iteration (s): 0.73 | learning rate: 4.691E-05 | global batch size: 256 | lm loss: 1.962170E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.300 | TFLOPs: 21.07 | 31: iteration 130100/ 173500 | consumed samples: 33305600 | consumed tokens: 68209868800 | elapsed time per iteration (s): 0.82 | learning rate: 4.690E-05 | global batch size: 256 | lm loss: 1.947659E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.388 | TFLOPs: 18.78 | 31: iteration 130110/ 173500 | consumed samples: 33308160 | consumed tokens: 68215111680 | elapsed time per iteration (s): 0.74 | learning rate: 4.689E-05 | global batch size: 256 | lm loss: 1.948853E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.960 | TFLOPs: 20.93 | 31: iteration 130120/ 173500 | consumed samples: 33310720 | consumed tokens: 68220354560 | elapsed time per iteration (s): 0.78 | learning rate: 4.687E-05 | global batch size: 256 | lm loss: 1.951930E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.380 | TFLOPs: 19.75 | 31: iteration 130130/ 173500 | consumed samples: 33313280 | consumed tokens: 68225597440 | elapsed time per iteration (s): 0.77 | learning rate: 4.686E-05 | global batch size: 256 | lm loss: 1.924280E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.396 | TFLOPs: 20.17 | 31: iteration 130140/ 173500 | consumed samples: 33315840 | consumed tokens: 68230840320 | elapsed time per iteration (s): 0.80 | learning rate: 4.685E-05 | global batch size: 256 | lm loss: 1.935640E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.241 | TFLOPs: 19.25 | 31: iteration 130150/ 173500 | consumed samples: 33318400 | consumed tokens: 68236083200 | elapsed time per iteration (s): 0.82 | learning rate: 4.684E-05 | global batch size: 256 | lm loss: 1.931073E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.967 | TFLOPs: 18.99 | 31: iteration 130160/ 173500 | consumed samples: 33320960 | consumed tokens: 68241326080 | elapsed time per iteration (s): 0.85 | learning rate: 4.683E-05 | global batch size: 256 | lm loss: 1.917859E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.917 | TFLOPs: 18.27 | 31: iteration 130170/ 173500 | consumed samples: 33323520 | consumed tokens: 68246568960 | elapsed time per iteration (s): 0.84 | learning rate: 4.681E-05 | global batch size: 256 | lm loss: 1.948339E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.349 | TFLOPs: 18.47 | 31: iteration 130180/ 173500 | consumed samples: 33326080 | consumed tokens: 68251811840 | elapsed time per iteration (s): 0.89 | learning rate: 4.680E-05 | global batch size: 256 | lm loss: 1.910515E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.043 | TFLOPs: 17.43 | 31: iteration 130190/ 173500 | consumed samples: 33328640 | consumed tokens: 68257054720 | elapsed time per iteration (s): 0.89 | learning rate: 4.679E-05 | global batch size: 256 | lm loss: 1.923291E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.150 | TFLOPs: 17.43 | 31: iteration 130200/ 173500 | consumed samples: 33331200 | consumed tokens: 68262297600 | elapsed time per iteration (s): 0.91 | learning rate: 4.678E-05 | global batch size: 256 | lm loss: 1.934383E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 280.835 | TFLOPs: 16.99 | 31: iteration 130210/ 173500 | consumed samples: 33333760 | consumed tokens: 68267540480 | elapsed time per iteration (s): 0.90 | learning rate: 4.677E-05 | global batch size: 256 | lm loss: 1.971918E+00 | grad norm: 0.253 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.219 | TFLOPs: 17.26 | 31: iteration 130220/ 173500 | consumed samples: 33336320 | consumed tokens: 68272783360 | elapsed time per iteration (s): 0.89 | learning rate: 4.676E-05 | global batch size: 256 | lm loss: 1.944687E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.094 | TFLOPs: 17.37 | 31: iteration 130230/ 173500 | consumed samples: 33338880 | consumed tokens: 68278026240 | elapsed time per iteration (s): 0.87 | learning rate: 4.674E-05 | global batch size: 256 | lm loss: 1.920101E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.760 | TFLOPs: 17.89 | 31: iteration 130240/ 173500 | consumed samples: 33341440 | consumed tokens: 68283269120 | elapsed time per iteration (s): 0.83 | learning rate: 4.673E-05 | global batch size: 256 | lm loss: 1.922986E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.931 | TFLOPs: 18.57 | 31: iteration 130250/ 173500 | consumed samples: 33344000 | consumed tokens: 68288512000 | elapsed time per iteration (s): 0.84 | learning rate: 4.672E-05 | global batch size: 256 | lm loss: 1.942279E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.911 | TFLOPs: 18.39 | 31: iteration 130260/ 173500 | consumed samples: 33346560 | consumed tokens: 68293754880 | elapsed time per iteration (s): 0.90 | learning rate: 4.671E-05 | global batch size: 256 | lm loss: 1.931196E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 283.583 | TFLOPs: 17.16 | 31: iteration 130270/ 173500 | consumed samples: 33349120 | consumed tokens: 68298997760 | elapsed time per iteration (s): 0.85 | learning rate: 4.670E-05 | global batch size: 256 | lm loss: 1.970568E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.182 | TFLOPs: 18.28 | 31: iteration 130280/ 173500 | consumed samples: 33351680 | consumed tokens: 68304240640 | elapsed time per iteration (s): 0.89 | learning rate: 4.669E-05 | global batch size: 256 | lm loss: 1.953591E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.341 | TFLOPs: 17.44 | 31: iteration 130290/ 173500 | consumed samples: 33354240 | consumed tokens: 68309483520 | elapsed time per iteration (s): 0.79 | learning rate: 4.667E-05 | global batch size: 256 | lm loss: 1.932746E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.872 | TFLOPs: 19.53 | 31: iteration 130300/ 173500 | consumed samples: 33356800 | consumed tokens: 68314726400 | elapsed time per iteration (s): 0.79 | learning rate: 4.666E-05 | global batch size: 256 | lm loss: 1.929951E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.828 | TFLOPs: 19.59 | 31: iteration 130310/ 173500 | consumed samples: 33359360 | consumed tokens: 68319969280 | elapsed time per iteration (s): 0.78 | learning rate: 4.665E-05 | global batch size: 256 | lm loss: 1.962456E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.048 | TFLOPs: 19.85 | 31: iteration 130320/ 173500 | consumed samples: 33361920 | consumed tokens: 68325212160 | elapsed time per iteration (s): 0.79 | learning rate: 4.664E-05 | global batch size: 256 | lm loss: 1.954237E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.049 | TFLOPs: 19.48 | 31: iteration 130330/ 173500 | consumed samples: 33364480 | consumed tokens: 68330455040 | elapsed time per iteration (s): 0.80 | learning rate: 4.663E-05 | global batch size: 256 | lm loss: 1.923899E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.541 | TFLOPs: 19.33 | 31: iteration 130340/ 173500 | consumed samples: 33367040 | consumed tokens: 68335697920 | elapsed time per iteration (s): 0.77 | learning rate: 4.662E-05 | global batch size: 256 | lm loss: 1.973059E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.896 | TFLOPs: 20.08 | 31: iteration 130350/ 173500 | consumed samples: 33369600 | consumed tokens: 68340940800 | elapsed time per iteration (s): 0.77 | learning rate: 4.660E-05 | global batch size: 256 | lm loss: 1.905430E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.600 | TFLOPs: 20.12 | 31: iteration 130360/ 173500 | consumed samples: 33372160 | consumed tokens: 68346183680 | elapsed time per iteration (s): 0.78 | learning rate: 4.659E-05 | global batch size: 256 | lm loss: 1.930390E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.629 | TFLOPs: 19.76 | 31: iteration 130370/ 173500 | consumed samples: 33374720 | consumed tokens: 68351426560 | elapsed time per iteration (s): 0.73 | learning rate: 4.658E-05 | global batch size: 256 | lm loss: 1.938154E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.867 | TFLOPs: 21.29 | 31: iteration 130380/ 173500 | consumed samples: 33377280 | consumed tokens: 68356669440 | elapsed time per iteration (s): 0.83 | learning rate: 4.657E-05 | global batch size: 256 | lm loss: 1.921004E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.026 | TFLOPs: 18.57 | 31: iteration 130390/ 173500 | consumed samples: 33379840 | consumed tokens: 68361912320 | elapsed time per iteration (s): 0.73 | learning rate: 4.656E-05 | global batch size: 256 | lm loss: 1.945158E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.952 | TFLOPs: 21.17 | 31: iteration 130400/ 173500 | consumed samples: 33382400 | consumed tokens: 68367155200 | elapsed time per iteration (s): 0.81 | learning rate: 4.655E-05 | global batch size: 256 | lm loss: 1.946408E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.570 | TFLOPs: 19.09 | 31: iteration 130410/ 173500 | consumed samples: 33384960 | consumed tokens: 68372398080 | elapsed time per iteration (s): 0.73 | learning rate: 4.653E-05 | global batch size: 256 | lm loss: 1.915828E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.601 | TFLOPs: 21.21 | 31: iteration 130420/ 173500 | consumed samples: 33387520 | consumed tokens: 68377640960 | elapsed time per iteration (s): 0.77 | learning rate: 4.652E-05 | global batch size: 256 | lm loss: 1.957157E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.361 | TFLOPs: 19.99 | 31: iteration 130430/ 173500 | consumed samples: 33390080 | consumed tokens: 68382883840 | elapsed time per iteration (s): 0.76 | learning rate: 4.651E-05 | global batch size: 256 | lm loss: 1.951397E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.799 | TFLOPs: 20.44 | 31: iteration 130440/ 173500 | consumed samples: 33392640 | consumed tokens: 68388126720 | elapsed time per iteration (s): 0.73 | learning rate: 4.650E-05 | global batch size: 256 | lm loss: 1.922721E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.096 | TFLOPs: 21.18 | 31: iteration 130450/ 173500 | consumed samples: 33395200 | consumed tokens: 68393369600 | elapsed time per iteration (s): 0.80 | learning rate: 4.649E-05 | global batch size: 256 | lm loss: 1.923381E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.034 | TFLOPs: 19.36 | 31: iteration 130460/ 173500 | consumed samples: 33397760 | consumed tokens: 68398612480 | elapsed time per iteration (s): 0.86 | learning rate: 4.648E-05 | global batch size: 256 | lm loss: 1.941621E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.804 | TFLOPs: 18.02 | 31: iteration 130470/ 173500 | consumed samples: 33400320 | consumed tokens: 68403855360 | elapsed time per iteration (s): 0.76 | learning rate: 4.646E-05 | global batch size: 256 | lm loss: 1.939217E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.243 | TFLOPs: 20.28 | 31: iteration 130480/ 173500 | consumed samples: 33402880 | consumed tokens: 68409098240 | elapsed time per iteration (s): 0.75 | learning rate: 4.645E-05 | global batch size: 256 | lm loss: 1.913921E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.282 | TFLOPs: 20.71 | 31: iteration 130490/ 173500 | consumed samples: 33405440 | consumed tokens: 68414341120 | elapsed time per iteration (s): 0.78 | learning rate: 4.644E-05 | global batch size: 256 | lm loss: 1.956149E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.857 | TFLOPs: 19.89 | 31: iteration 130500/ 173500 | consumed samples: 33408000 | consumed tokens: 68419584000 | elapsed time per iteration (s): 0.74 | learning rate: 4.643E-05 | global batch size: 256 | lm loss: 1.960658E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.688 | TFLOPs: 20.79 | 31: iteration 130510/ 173500 | consumed samples: 33410560 | consumed tokens: 68424826880 | elapsed time per iteration (s): 0.79 | learning rate: 4.642E-05 | global batch size: 256 | lm loss: 1.942530E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.674 | TFLOPs: 19.64 | 31: iteration 130520/ 173500 | consumed samples: 33413120 | consumed tokens: 68430069760 | elapsed time per iteration (s): 0.78 | learning rate: 4.641E-05 | global batch size: 256 | lm loss: 1.933066E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.900 | TFLOPs: 19.78 | 31: iteration 130530/ 173500 | consumed samples: 33415680 | consumed tokens: 68435312640 | elapsed time per iteration (s): 0.74 | learning rate: 4.639E-05 | global batch size: 256 | lm loss: 1.949796E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.647 | TFLOPs: 21.03 | 31: iteration 130540/ 173500 | consumed samples: 33418240 | consumed tokens: 68440555520 | elapsed time per iteration (s): 0.78 | learning rate: 4.638E-05 | global batch size: 256 | lm loss: 1.965151E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.783 | TFLOPs: 19.95 | 31: iteration 130550/ 173500 | consumed samples: 33420800 | consumed tokens: 68445798400 | elapsed time per iteration (s): 0.72 | learning rate: 4.637E-05 | global batch size: 256 | lm loss: 1.913630E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.484 | TFLOPs: 21.63 | 31: iteration 130560/ 173500 | consumed samples: 33423360 | consumed tokens: 68451041280 | elapsed time per iteration (s): 0.73 | learning rate: 4.636E-05 | global batch size: 256 | lm loss: 1.934104E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.983 | TFLOPs: 21.23 | 31: iteration 130570/ 173500 | consumed samples: 33425920 | consumed tokens: 68456284160 | elapsed time per iteration (s): 0.78 | learning rate: 4.635E-05 | global batch size: 256 | lm loss: 1.933855E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.742 | TFLOPs: 19.83 | 31: iteration 130580/ 173500 | consumed samples: 33428480 | consumed tokens: 68461527040 | elapsed time per iteration (s): 0.73 | learning rate: 4.634E-05 | global batch size: 256 | lm loss: 1.922069E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.753 | TFLOPs: 21.16 | 31: iteration 130590/ 173500 | consumed samples: 33431040 | consumed tokens: 68466769920 | elapsed time per iteration (s): 0.73 | learning rate: 4.632E-05 | global batch size: 256 | lm loss: 1.952659E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.822 | TFLOPs: 21.16 | 31: iteration 130600/ 173500 | consumed samples: 33433600 | consumed tokens: 68472012800 | elapsed time per iteration (s): 0.76 | learning rate: 4.631E-05 | global batch size: 256 | lm loss: 1.926084E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.086 | TFLOPs: 20.33 | 31: iteration 130610/ 173500 | consumed samples: 33436160 | consumed tokens: 68477255680 | elapsed time per iteration (s): 0.80 | learning rate: 4.630E-05 | global batch size: 256 | lm loss: 1.905404E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.524 | TFLOPs: 19.39 | 31: iteration 130620/ 173500 | consumed samples: 33438720 | consumed tokens: 68482498560 | elapsed time per iteration (s): 0.82 | learning rate: 4.629E-05 | global batch size: 256 | lm loss: 1.931022E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.189 | TFLOPs: 18.89 | 31: iteration 130630/ 173500 | consumed samples: 33441280 | consumed tokens: 68487741440 | elapsed time per iteration (s): 0.80 | learning rate: 4.628E-05 | global batch size: 256 | lm loss: 1.939719E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.609 | TFLOPs: 19.28 | 31: iteration 130640/ 173500 | consumed samples: 33443840 | consumed tokens: 68492984320 | elapsed time per iteration (s): 0.84 | learning rate: 4.627E-05 | global batch size: 256 | lm loss: 1.959837E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.383 | TFLOPs: 18.47 | 31: iteration 130650/ 173500 | consumed samples: 33446400 | consumed tokens: 68498227200 | elapsed time per iteration (s): 0.86 | learning rate: 4.625E-05 | global batch size: 256 | lm loss: 1.933990E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.367 | TFLOPs: 18.05 | 31: iteration 130660/ 173500 | consumed samples: 33448960 | consumed tokens: 68503470080 | elapsed time per iteration (s): 0.80 | learning rate: 4.624E-05 | global batch size: 256 | lm loss: 1.952082E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.178 | TFLOPs: 19.43 | 31: iteration 130670/ 173500 | consumed samples: 33451520 | consumed tokens: 68508712960 | elapsed time per iteration (s): 0.95 | learning rate: 4.623E-05 | global batch size: 256 | lm loss: 1.922872E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 268.212 | TFLOPs: 16.23 | 31: iteration 130680/ 173500 | consumed samples: 33454080 | consumed tokens: 68513955840 | elapsed time per iteration (s): 0.87 | learning rate: 4.622E-05 | global batch size: 256 | lm loss: 1.930635E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.715 | TFLOPs: 17.83 | 31: iteration 130690/ 173500 | consumed samples: 33456640 | consumed tokens: 68519198720 | elapsed time per iteration (s): 0.83 | learning rate: 4.621E-05 | global batch size: 256 | lm loss: 1.944855E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.923 | TFLOPs: 18.57 | 31: iteration 130700/ 173500 | consumed samples: 33459200 | consumed tokens: 68524441600 | elapsed time per iteration (s): 0.84 | learning rate: 4.620E-05 | global batch size: 256 | lm loss: 1.940671E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.236 | TFLOPs: 18.53 | 31: iteration 130710/ 173500 | consumed samples: 33461760 | consumed tokens: 68529684480 | elapsed time per iteration (s): 0.91 | learning rate: 4.619E-05 | global batch size: 256 | lm loss: 1.930729E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 281.729 | TFLOPs: 17.04 | 31: iteration 130720/ 173500 | consumed samples: 33464320 | consumed tokens: 68534927360 | elapsed time per iteration (s): 0.82 | learning rate: 4.617E-05 | global batch size: 256 | lm loss: 1.943774E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.364 | TFLOPs: 18.84 | 31: iteration 130730/ 173500 | consumed samples: 33466880 | consumed tokens: 68540170240 | elapsed time per iteration (s): 0.81 | learning rate: 4.616E-05 | global batch size: 256 | lm loss: 1.934705E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.632 | TFLOPs: 19.09 | 31: iteration 130740/ 173500 | consumed samples: 33469440 | consumed tokens: 68545413120 | elapsed time per iteration (s): 0.80 | learning rate: 4.615E-05 | global batch size: 256 | lm loss: 1.924763E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.655 | TFLOPs: 19.46 | 31: iteration 130750/ 173500 | consumed samples: 33472000 | consumed tokens: 68550656000 | elapsed time per iteration (s): 0.80 | learning rate: 4.614E-05 | global batch size: 256 | lm loss: 1.933897E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.823 | TFLOPs: 19.35 | 31: iteration 130760/ 173500 | consumed samples: 33474560 | consumed tokens: 68555898880 | elapsed time per iteration (s): 0.80 | learning rate: 4.613E-05 | global batch size: 256 | lm loss: 1.915748E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.891 | TFLOPs: 19.47 | 31: iteration 130770/ 173500 | consumed samples: 33477120 | consumed tokens: 68561141760 | elapsed time per iteration (s): 0.76 | learning rate: 4.612E-05 | global batch size: 256 | lm loss: 1.927129E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.679 | TFLOPs: 20.25 | 31: iteration 130780/ 173500 | consumed samples: 33479680 | consumed tokens: 68566384640 | elapsed time per iteration (s): 0.75 | learning rate: 4.610E-05 | global batch size: 256 | lm loss: 1.938055E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.200 | TFLOPs: 20.76 | 31: iteration 130790/ 173500 | consumed samples: 33482240 | consumed tokens: 68571627520 | elapsed time per iteration (s): 0.90 | learning rate: 4.609E-05 | global batch size: 256 | lm loss: 1.945532E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 283.620 | TFLOPs: 17.16 | 31: iteration 130800/ 173500 | consumed samples: 33484800 | consumed tokens: 68576870400 | elapsed time per iteration (s): 0.73 | learning rate: 4.608E-05 | global batch size: 256 | lm loss: 1.940488E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.842 | TFLOPs: 21.23 | 31: iteration 130810/ 173500 | consumed samples: 33487360 | consumed tokens: 68582113280 | elapsed time per iteration (s): 0.78 | learning rate: 4.607E-05 | global batch size: 256 | lm loss: 1.923902E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.451 | TFLOPs: 19.81 | 31: iteration 130820/ 173500 | consumed samples: 33489920 | consumed tokens: 68587356160 | elapsed time per iteration (s): 0.82 | learning rate: 4.606E-05 | global batch size: 256 | lm loss: 1.958268E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.990 | TFLOPs: 18.87 | 31: iteration 130830/ 173500 | consumed samples: 33492480 | consumed tokens: 68592599040 | elapsed time per iteration (s): 0.81 | learning rate: 4.605E-05 | global batch size: 256 | lm loss: 1.937899E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.912 | TFLOPs: 19.23 | 31: iteration 130840/ 173500 | consumed samples: 33495040 | consumed tokens: 68597841920 | elapsed time per iteration (s): 0.80 | learning rate: 4.603E-05 | global batch size: 256 | lm loss: 1.955157E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.537 | TFLOPs: 19.33 | 31: iteration 130850/ 173500 | consumed samples: 33497600 | consumed tokens: 68603084800 | elapsed time per iteration (s): 0.80 | learning rate: 4.602E-05 | global batch size: 256 | lm loss: 1.950076E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.715 | TFLOPs: 19.46 | 31: iteration 130860/ 173500 | consumed samples: 33500160 | consumed tokens: 68608327680 | elapsed time per iteration (s): 0.78 | learning rate: 4.601E-05 | global batch size: 256 | lm loss: 1.950648E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.377 | TFLOPs: 19.74 | 31: iteration 130870/ 173500 | consumed samples: 33502720 | consumed tokens: 68613570560 | elapsed time per iteration (s): 0.82 | learning rate: 4.600E-05 | global batch size: 256 | lm loss: 1.919891E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.567 | TFLOPs: 18.85 | 31: iteration 130880/ 173500 | consumed samples: 33505280 | consumed tokens: 68618813440 | elapsed time per iteration (s): 0.82 | learning rate: 4.599E-05 | global batch size: 256 | lm loss: 1.959974E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.842 | TFLOPs: 18.99 | 31: iteration 130890/ 173500 | consumed samples: 33507840 | consumed tokens: 68624056320 | elapsed time per iteration (s): 0.80 | learning rate: 4.598E-05 | global batch size: 256 | lm loss: 1.958109E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.160 | TFLOPs: 19.43 | 31: iteration 130900/ 173500 | consumed samples: 33510400 | consumed tokens: 68629299200 | elapsed time per iteration (s): 0.84 | learning rate: 4.596E-05 | global batch size: 256 | lm loss: 1.918661E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.084 | TFLOPs: 18.34 | 31: iteration 130910/ 173500 | consumed samples: 33512960 | consumed tokens: 68634542080 | elapsed time per iteration (s): 0.79 | learning rate: 4.595E-05 | global batch size: 256 | lm loss: 1.928746E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.107 | TFLOPs: 19.73 | 31: iteration 130920/ 173500 | consumed samples: 33515520 | consumed tokens: 68639784960 | elapsed time per iteration (s): 0.81 | learning rate: 4.594E-05 | global batch size: 256 | lm loss: 1.960147E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.777 | TFLOPs: 19.04 | 31: iteration 130930/ 173500 | consumed samples: 33518080 | consumed tokens: 68645027840 | elapsed time per iteration (s): 0.97 | learning rate: 4.593E-05 | global batch size: 256 | lm loss: 1.947264E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 265.198 | TFLOPs: 16.04 | 31: iteration 130940/ 173500 | consumed samples: 33520640 | consumed tokens: 68650270720 | elapsed time per iteration (s): 0.78 | learning rate: 4.592E-05 | global batch size: 256 | lm loss: 1.966677E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.893 | TFLOPs: 19.84 | 31: iteration 130950/ 173500 | consumed samples: 33523200 | consumed tokens: 68655513600 | elapsed time per iteration (s): 0.84 | learning rate: 4.591E-05 | global batch size: 256 | lm loss: 1.927618E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.641 | TFLOPs: 18.43 | 31: iteration 130960/ 173500 | consumed samples: 33525760 | consumed tokens: 68660756480 | elapsed time per iteration (s): 0.81 | learning rate: 4.590E-05 | global batch size: 256 | lm loss: 1.933137E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.279 | TFLOPs: 19.19 | 31: iteration 130970/ 173500 | consumed samples: 33528320 | consumed tokens: 68665999360 | elapsed time per iteration (s): 0.85 | learning rate: 4.588E-05 | global batch size: 256 | lm loss: 1.932473E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.972 | TFLOPs: 18.15 | 31: iteration 130980/ 173500 | consumed samples: 33530880 | consumed tokens: 68671242240 | elapsed time per iteration (s): 0.79 | learning rate: 4.587E-05 | global batch size: 256 | lm loss: 1.936557E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.946 | TFLOPs: 19.72 | 31: iteration 130990/ 173500 | consumed samples: 33533440 | consumed tokens: 68676485120 | elapsed time per iteration (s): 0.82 | learning rate: 4.586E-05 | global batch size: 256 | lm loss: 1.955507E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.966 | TFLOPs: 18.93 | 31: iteration 131000/ 173500 | consumed samples: 33536000 | consumed tokens: 68681728000 | elapsed time per iteration (s): 0.80 | learning rate: 4.585E-05 | global batch size: 256 | lm loss: 1.924663E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.146 | TFLOPs: 19.25 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 131000 | lm loss value: 1.995430E+00 | lm loss PPL: 7.355363E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 131000 to checkpoints_1b1long 0: [2022-11-26 23:36:20,043] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step131000 is begin to save! 0: [2022-11-26 23:36:20,054] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_01-model_00-model_states.pt... 0: [2022-11-26 23:36:20,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_01-model_00-model_states.pt. 0: [2022-11-26 23:36:20,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_03-model_00-model_states.pt... 0: [2022-11-26 23:36:20,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_03-model_00-model_states.pt. 0: [2022-11-26 23:36:20,358] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_04-model_00-model_states.pt... 0: [2022-11-26 23:36:20,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_04-model_00-model_states.pt. 0: [2022-11-26 23:36:20,438] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_05-model_00-model_states.pt... 0: [2022-11-26 23:36:20,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_05-model_00-model_states.pt. 0: [2022-11-26 23:36:20,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_06-model_00-model_states.pt... 0: [2022-11-26 23:36:20,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_06-model_00-model_states.pt. 0: [2022-11-26 23:36:20,594] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_07-model_00-model_states.pt... 0: [2022-11-26 23:36:20,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_07-model_00-model_states.pt. 0: [2022-11-26 23:36:20,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_08-model_00-model_states.pt... 0: [2022-11-26 23:36:20,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_08-model_00-model_states.pt. 0: [2022-11-26 23:36:20,746] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_09-model_00-model_states.pt... 0: [2022-11-26 23:36:20,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_09-model_00-model_states.pt. 0: [2022-11-26 23:36:20,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_10-model_00-model_states.pt... 0: [2022-11-26 23:36:20,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_10-model_00-model_states.pt. 0: [2022-11-26 23:36:20,901] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_11-model_00-model_states.pt... 0: [2022-11-26 23:36:20,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_11-model_00-model_states.pt. 0: [2022-11-26 23:36:20,977] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_12-model_00-model_states.pt... 0: [2022-11-26 23:36:21,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_12-model_00-model_states.pt. 0: [2022-11-26 23:36:21,052] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_13-model_00-model_states.pt... 0: [2022-11-26 23:36:21,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_13-model_00-model_states.pt. 0: [2022-11-26 23:36:21,127] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_14-model_00-model_states.pt... 0: [2022-11-26 23:36:21,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_14-model_00-model_states.pt. 0: [2022-11-26 23:36:21,203] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_15-model_00-model_states.pt... 0: [2022-11-26 23:36:21,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_15-model_00-model_states.pt. 0: [2022-11-26 23:36:21,279] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_16-model_00-model_states.pt... 0: [2022-11-26 23:36:21,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_16-model_00-model_states.pt. 0: [2022-11-26 23:36:21,354] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_17-model_00-model_states.pt... 0: [2022-11-26 23:36:21,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_17-model_00-model_states.pt. 0: [2022-11-26 23:36:21,430] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_18-model_00-model_states.pt... 0: [2022-11-26 23:36:21,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_18-model_00-model_states.pt. 0: [2022-11-26 23:36:21,505] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_19-model_00-model_states.pt... 0: [2022-11-26 23:36:21,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_19-model_00-model_states.pt. 0: [2022-11-26 23:36:21,581] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_20-model_00-model_states.pt... 0: [2022-11-26 23:36:21,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_20-model_00-model_states.pt. 0: [2022-11-26 23:36:21,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_21-model_00-model_states.pt... 0: [2022-11-26 23:36:21,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_21-model_00-model_states.pt. 0: [2022-11-26 23:36:21,730] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_22-model_00-model_states.pt... 0: [2022-11-26 23:36:21,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_22-model_00-model_states.pt. 0: [2022-11-26 23:36:21,808] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_23-model_00-model_states.pt... 0: [2022-11-26 23:36:21,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_23-model_00-model_states.pt. 0: [2022-11-26 23:36:21,882] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_24-model_00-model_states.pt... 0: [2022-11-26 23:36:21,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_24-model_00-model_states.pt. 0: [2022-11-26 23:36:21,955] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_25-model_00-model_states.pt... 0: [2022-11-26 23:36:22,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_25-model_00-model_states.pt. 0: [2022-11-26 23:36:22,033] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_26-model_00-model_states.pt... 0: [2022-11-26 23:36:22,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_26-model_00-model_states.pt. 0: [2022-11-26 23:36:22,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_27-model_00-model_states.pt... 0: [2022-11-26 23:36:22,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_27-model_00-model_states.pt. 0: [2022-11-26 23:36:22,183] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_28-model_00-model_states.pt... 0: [2022-11-26 23:36:22,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_28-model_00-model_states.pt. 0: [2022-11-26 23:36:22,258] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/layer_30-model_00-model_states.pt... 0: [2022-11-26 23:36:22,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/layer_30-model_00-model_states.pt. 0: [2022-11-26 23:36:22,260] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step131000/mp_rank_00_model_states.pt 0: [2022-11-26 23:36:22,260] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/mp_rank_00_model_states.pt... 0: [2022-11-26 23:36:22,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/mp_rank_00_model_states.pt. 0: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:36:22,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:36:22,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:36:22,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 10: [2022-11-26 23:36:22,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:36:22,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 10: [2022-11-26 23:36:22,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 23:36:22,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 26: [2022-11-26 23:36:22,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:36:22,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 4: [2022-11-26 23:36:22,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:36:22,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 4: [2022-11-26 23:36:22,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 23:36:22,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 0: [2022-11-26 23:36:22,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:36:22,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 23:36:22,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 6: [2022-11-26 23:36:22,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:36:22,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 23:36:22,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 1: [2022-11-26 23:36:22,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:36:22,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 23:36:22,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 22: [2022-11-26 23:36:22,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:36:22,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 23:36:22,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 30: [2022-11-26 23:36:22,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:36:22,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 23:36:22,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 11: [2022-11-26 23:36:22,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:36:22,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 23:36:22,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 2: [2022-11-26 23:36:22,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:36:22,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 23:36:22,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 15: [2022-11-26 23:36:22,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:36:22,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 23:36:22,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 4: [2022-11-26 23:36:22,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:36:22,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 23:36:22,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 24: [2022-11-26 23:36:22,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:36:22,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 13: [2022-11-26 23:36:22,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:36:22,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 7: [2022-11-26 23:36:22,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:36:22,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 7: [2022-11-26 23:36:22,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 10: [2022-11-26 23:36:22,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:36:22,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 7: [2022-11-26 23:36:22,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 10: [2022-11-26 23:36:22,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 23:36:22,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 17: [2022-11-26 23:36:22,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:36:22,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:36:22,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 23:36:22,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 25: [2022-11-26 23:36:22,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 13: [2022-11-26 23:36:22,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:36:22,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 23:36:22,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 25: [2022-11-26 23:36:22,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 25: [2022-11-26 23:36:22,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:36:22,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 23:36:22,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 8: [2022-11-26 23:36:22,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:36:22,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:36:22,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:36:22,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:36:22,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 5: [2022-11-26 23:36:22,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 23:36:22,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 8: [2022-11-26 23:36:22,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 5: [2022-11-26 23:36:22,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 5: [2022-11-26 23:36:22,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 8: [2022-11-26 23:36:22,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 8: [2022-11-26 23:36:22,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 29: [2022-11-26 23:36:22,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:36:22,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 23:36:22,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 14: [2022-11-26 23:36:22,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:36:22,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:36:22,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 26: [2022-11-26 23:36:22,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:36:22,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 26: [2022-11-26 23:36:22,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 14: [2022-11-26 23:36:22,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 26: [2022-11-26 23:36:22,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 14: [2022-11-26 23:36:22,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 22: [2022-11-26 23:36:22,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:36:22,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-26 23:36:22,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 24: [2022-11-26 23:36:22,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:36:22,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:36:22,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 27: [2022-11-26 23:36:22,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:36:22,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 31: [2022-11-26 23:36:22,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 27: [2022-11-26 23:36:22,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 23:36:22,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 31: [2022-11-26 23:36:22,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 3: [2022-11-26 23:36:22,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:36:22,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:36:22,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 23:36:22,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 23: [2022-11-26 23:36:22,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 1: [2022-11-26 23:36:22,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:36:22,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 1: [2022-11-26 23:36:22,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 11: [2022-11-26 23:36:22,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:36:22,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:36:22,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 11: [2022-11-26 23:36:22,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 15: [2022-11-26 23:36:22,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 11: [2022-11-26 23:36:22,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 15: [2022-11-26 23:36:22,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 17: [2022-11-26 23:36:22,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:36:22,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 23:36:22,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 19: [2022-11-26 23:36:22,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:36:22,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-26 23:36:22,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 31: [2022-11-26 23:36:22,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:36:22,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 23:36:22,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 25: [2022-11-26 23:36:22,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:36:22,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 23:36:22,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 19: [2022-11-26 23:36:22,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:36:22,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 23:36:22,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 29: [2022-11-26 23:36:22,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:36:22,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:36:22,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 24: [2022-11-26 23:36:22,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:36:22,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 24: [2022-11-26 23:36:22,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 29: [2022-11-26 23:36:22,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 24: [2022-11-26 23:36:22,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 29: [2022-11-26 23:36:22,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 2: [2022-11-26 23:36:22,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:36:22,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 23:36:22,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 13: [2022-11-26 23:36:22,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:36:22,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 30: [2022-11-26 23:36:22,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:36:22,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 30: [2022-11-26 23:36:22,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 23:36:22,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 20: [2022-11-26 23:36:22,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:36:22,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 23:36:22,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 26: [2022-11-26 23:36:22,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:36:22,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 23:36:22,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 3: [2022-11-26 23:36:22,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:36:22,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 23:36:22,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 0: [2022-11-26 23:36:22,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:36:22,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 23:36:22,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 6: [2022-11-26 23:36:22,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:36:22,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 23:36:22,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 6: [2022-11-26 23:36:22,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:36:22,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 23:36:22,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 10: [2022-11-26 23:36:22,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:36:22,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 23:36:22,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 8: [2022-11-26 23:36:22,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:36:22,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 23:36:22,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 2: [2022-11-26 23:36:22,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:36:22,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 23:36:22,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 4: [2022-11-26 23:36:22,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:36:22,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 23:36:22,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 23: [2022-11-26 23:36:22,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:36:22,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-26 23:36:22,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 4: [2022-11-26 23:36:22,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:36:22,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 23:36:22,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 17: [2022-11-26 23:36:22,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:36:22,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 23:36:22,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 20: [2022-11-26 23:36:22,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:36:22,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:36:22,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:36:22,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 23:36:22,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 14: [2022-11-26 23:36:22,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 20: [2022-11-26 23:36:22,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 14: [2022-11-26 23:36:22,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 7: [2022-11-26 23:36:22,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:36:22,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 7: [2022-11-26 23:36:22,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 20: [2022-11-26 23:36:22,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:36:22,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 20: [2022-11-26 23:36:22,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 23:36:22,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 27: [2022-11-26 23:36:22,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:36:22,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-26 23:36:22,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 7: [2022-11-26 23:36:22,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:36:22,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 23:36:22,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 19: [2022-11-26 23:36:22,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:36:22,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 23:36:22,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 23: [2022-11-26 23:36:22,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:36:22,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 23:36:22,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 1: [2022-11-26 23:36:22,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:36:22,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 23:36:22,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 26: [2022-11-26 23:36:22,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:36:22,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 13: [2022-11-26 23:36:22,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:36:22,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 13: [2022-11-26 23:36:22,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 23:36:22,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 8: [2022-11-26 23:36:22,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:36:22,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 23:36:22,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 31: [2022-11-26 23:36:22,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:36:22,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 23:36:22,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 11: [2022-11-26 23:36:22,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:36:22,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 23:36:22,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 15: [2022-11-26 23:36:22,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:36:22,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 23:36:22,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 24: [2022-11-26 23:36:22,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:36:22,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 23:36:22,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 5: [2022-11-26 23:36:22,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:36:22,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:36:22,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 15: [2022-11-26 23:36:22,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:36:22,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 5: [2022-11-26 23:36:22,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 23:36:22,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 15: [2022-11-26 23:36:22,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 23:36:22,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 22: [2022-11-26 23:36:22,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:36:22,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 23:36:22,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 29: [2022-11-26 23:36:22,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:36:22,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 23:36:22,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 25: [2022-11-26 23:36:22,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:36:22,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 23:36:22,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 2: [2022-11-26 23:36:22,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:36:22,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 23:36:22,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 18: [2022-11-26 23:36:22,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:36:22,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-26 23:36:22,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 0: [2022-11-26 23:36:22,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:36:22,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 23:36:22,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 9: [2022-11-26 23:36:22,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:36:22,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 23:36:22,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 17: [2022-11-26 23:36:22,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:36:22,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 30: [2022-11-26 23:36:22,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:36:22,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 30: [2022-11-26 23:36:22,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 23:36:22,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 23: [2022-11-26 23:36:22,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:36:22,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:36:22,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 23:36:22,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 10: [2022-11-26 23:36:22,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 23:36:22,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 20: [2022-11-26 23:36:22,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:36:22,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 23:36:22,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 11: [2022-11-26 23:36:22,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:36:22,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:36:22,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 23:36:22,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 11: [2022-11-26 23:36:22,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 23:36:22,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 1: [2022-11-26 23:36:22,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:36:22,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 28: [2022-11-26 23:36:22,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:36:22,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:36:22,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 28: [2022-11-26 23:36:22,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 23:36:22,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 30: [2022-11-26 23:36:22,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 23:36:22,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 27: [2022-11-26 23:36:22,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:36:22,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 23:36:22,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 14: [2022-11-26 23:36:22,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:36:22,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 23:36:22,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 12: [2022-11-26 23:36:22,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:36:22,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:36:22,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:36:22,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 23:36:22,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 23:36:22,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 12: [2022-11-26 23:36:22,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 12: [2022-11-26 23:36:22,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 23:36:22,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 28: [2022-11-26 23:36:22,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:36:22,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-26 23:36:22,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 22: [2022-11-26 23:36:22,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:36:22,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-26 23:36:22,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 21: [2022-11-26 23:36:22,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:36:22,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:36:22,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:36:22,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:36:22,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:36:22,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-26 23:36:22,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 18: [2022-11-26 23:36:22,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 21: [2022-11-26 23:36:22,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-26 23:36:22,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 18: [2022-11-26 23:36:22,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:36:22,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 21: [2022-11-26 23:36:22,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 18: [2022-11-26 23:36:22,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 21: [2022-11-26 23:36:22,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 21: [2022-11-26 23:36:22,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 18: [2022-11-26 23:36:22,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-26 23:36:22,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 31: [2022-11-26 23:36:22,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:36:22,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 23:36:22,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 9: [2022-11-26 23:36:22,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:36:22,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 7: [2022-11-26 23:36:22,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:36:22,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 7: [2022-11-26 23:36:22,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 23:36:22,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 9: [2022-11-26 23:36:22,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:36:22,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 23:36:22,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 6: [2022-11-26 23:36:22,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:36:22,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 23:36:22,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 19: [2022-11-26 23:36:22,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:36:22,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-26 23:36:22,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 16: [2022-11-26 23:36:22,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:36:22,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:36:22,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:36:22,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:36:22,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 23:36:22,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 23:36:22,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 23:36:22,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 16: [2022-11-26 23:36:22,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 23:36:22,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 16: [2022-11-26 23:36:22,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 16: [2022-11-26 23:36:22,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 12: [2022-11-26 23:36:22,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:36:22,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 23:36:22,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 0: [2022-11-26 23:36:22,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:36:22,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 23:36:22,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 18: [2022-11-26 23:36:22,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:36:22,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 23:36:22,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 28: [2022-11-26 23:36:22,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:36:22,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 23:36:22,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 4: [2022-11-26 23:36:22,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:36:22,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 23:36:22,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 23: [2022-11-26 23:36:22,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:36:22,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 23:36:22,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 20: [2022-11-26 23:36:22,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:36:22,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 23:36:22,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 1: [2022-11-26 23:36:22,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:36:22,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 23:36:22,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 24: [2022-11-26 23:36:22,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:36:22,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:36:22,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 24: [2022-11-26 23:36:22,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 29: [2022-11-26 23:36:22,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 24: [2022-11-26 23:36:22,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 16: [2022-11-26 23:36:22,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:36:22,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-26 23:36:22,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 25: [2022-11-26 23:36:22,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:36:22,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 26: [2022-11-26 23:36:22,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:36:22,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 26: [2022-11-26 23:36:22,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 23:36:22,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 17: [2022-11-26 23:36:22,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:36:22,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 23:36:22,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 13: [2022-11-26 23:36:22,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:36:22,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 23:36:22,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 27: [2022-11-26 23:36:22,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:36:22,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-26 23:36:22,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 11: [2022-11-26 23:36:22,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:36:22,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 23:36:22,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 6: [2022-11-26 23:36:22,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:36:22,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 23:36:22,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 10: [2022-11-26 23:36:22,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:36:22,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 23:36:22,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 5: [2022-11-26 23:36:22,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:36:22,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 23:36:22,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 12: [2022-11-26 23:36:22,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:36:22,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:36:22,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 23:36:22,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 22: [2022-11-26 23:36:22,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 23:36:22,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 2: [2022-11-26 23:36:22,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:36:22,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 23:36:22,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 31: [2022-11-26 23:36:22,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:36:22,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-26 23:36:22,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 8: [2022-11-26 23:36:22,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:36:22,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 23:36:22,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 19: [2022-11-26 23:36:22,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:36:22,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 23:36:22,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 15: [2022-11-26 23:36:22,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:36:22,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 23:36:22,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 28: [2022-11-26 23:36:22,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:36:22,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:36:22,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 30: [2022-11-26 23:36:22,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 23:36:22,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 28: [2022-11-26 23:36:22,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 14: [2022-11-26 23:36:22,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:36:22,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 23:36:22,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 0: [2022-11-26 23:36:22,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:36:22,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 3: [2022-11-26 23:36:22,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:36:22,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 3: [2022-11-26 23:36:22,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 23:36:22,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 7: [2022-11-26 23:36:22,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:36:22,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 4: [2022-11-26 23:36:22,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:36:22,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 4: [2022-11-26 23:36:22,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 23:36:22,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 9: [2022-11-26 23:36:22,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:36:22,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 23:36:22,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 1: [2022-11-26 23:36:22,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:36:22,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 23:36:22,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 23: [2022-11-26 23:36:22,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:36:22,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 23:36:22,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 29: [2022-11-26 23:36:22,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:36:22,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 23:36:22,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 24: [2022-11-26 23:36:22,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:36:22,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-26 23:36:22,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 21: [2022-11-26 23:36:22,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:36:22,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 23:36:22,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 18: [2022-11-26 23:36:22,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:36:22,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 23:36:22,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 21: [2022-11-26 23:36:22,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:36:22,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 23:36:22,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 20: [2022-11-26 23:36:22,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:36:22,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-26 23:36:22,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 13: [2022-11-26 23:36:22,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:36:22,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 23:36:22,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 17: [2022-11-26 23:36:22,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:36:22,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 23:36:22,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 16: [2022-11-26 23:36:22,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:36:22,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 23:36:22,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 30: [2022-11-26 23:36:22,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:36:22,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 23:36:22,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 25: [2022-11-26 23:36:22,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:36:22,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 23:36:22,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 26: [2022-11-26 23:36:22,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:36:22,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:36:22,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 27: [2022-11-26 23:36:22,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 23:36:22,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 26: [2022-11-26 23:36:22,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 10: [2022-11-26 23:36:22,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:36:22,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 23:36:22,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 8: [2022-11-26 23:36:22,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:36:22,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:36:22,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 23:36:22,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 2: [2022-11-26 23:36:22,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 23:36:22,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 12: [2022-11-26 23:36:22,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:36:22,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 23:36:22,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 5: [2022-11-26 23:36:22,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:36:22,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 23:36:22,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 31: [2022-11-26 23:36:22,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:36:22,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 23:36:22,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 11: [2022-11-26 23:36:22,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:36:22,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 23:36:22,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 22: [2022-11-26 23:36:22,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:36:22,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 23:36:22,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 0: [2022-11-26 23:36:22,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:36:22,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:36:22,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 23:36:22,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 19: [2022-11-26 23:36:22,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:36:22,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 23:36:22,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 14: [2022-11-26 23:36:22,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:36:22,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:36:22,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 23:36:22,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 15: [2022-11-26 23:36:22,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 23:36:22,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 7: [2022-11-26 23:36:22,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:36:22,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 23:36:22,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 4: [2022-11-26 23:36:22,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:36:22,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 23:36:22,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 18: [2022-11-26 23:36:22,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:36:22,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-26 23:36:22,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 9: [2022-11-26 23:36:22,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:36:22,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 23:36:22,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 3: [2022-11-26 23:36:22,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:36:22,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 3: [2022-11-26 23:36:22,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 0: [2022-11-26 23:36:22,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 3: [2022-11-26 23:36:22,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 20: [2022-11-26 23:36:22,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:36:22,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 23:36:22,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 1: [2022-11-26 23:36:22,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:36:22,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 23:36:22,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 28: [2022-11-26 23:36:22,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:36:22,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 23: [2022-11-26 23:36:22,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:36:22,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 23: [2022-11-26 23:36:22,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 23:36:22,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 29: [2022-11-26 23:36:22,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:36:22,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 23:36:22,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 8: [2022-11-26 23:36:22,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:36:22,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 23:36:22,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 24: [2022-11-26 23:36:22,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:36:22,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 23:36:22,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 21: [2022-11-26 23:36:22,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:36:22,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 23:36:22,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 13: [2022-11-26 23:36:22,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:36:22,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 23:36:22,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 25: [2022-11-26 23:36:22,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:36:22,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 23:36:22,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 17: [2022-11-26 23:36:22,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:36:22,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:36:22,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 16: [2022-11-26 23:36:22,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 23:36:22,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 17: [2022-11-26 23:36:22,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 6: [2022-11-26 23:36:22,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:36:22,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 23:36:22,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 15: [2022-11-26 23:36:22,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:36:22,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 23:36:22,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 26: [2022-11-26 23:36:22,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:36:22,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-26 23:36:22,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 11: [2022-11-26 23:36:22,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:36:22,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 23:36:22,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 27: [2022-11-26 23:36:22,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:36:22,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:36:22,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 27: [2022-11-26 23:36:22,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 2: [2022-11-26 23:36:22,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 27: [2022-11-26 23:36:22,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 30: [2022-11-26 23:36:22,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:36:22,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 23:36:22,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 10: [2022-11-26 23:36:22,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:36:22,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 23:36:22,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 31: [2022-11-26 23:36:22,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:36:22,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 23:36:22,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 7: [2022-11-26 23:36:22,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:36:22,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 23:36:22,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 22: [2022-11-26 23:36:22,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:36:22,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 23:36:22,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 3: [2022-11-26 23:36:22,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:36:22,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:36:22,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 19: [2022-11-26 23:36:22,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 3: [2022-11-26 23:36:22,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 19: [2022-11-26 23:36:22,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 4: [2022-11-26 23:36:22,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:36:22,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 23:36:22,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 8: [2022-11-26 23:36:22,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:36:22,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 23:36:22,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 31: [2022-11-26 23:36:22,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:36:22,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 23:36:22,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 14: [2022-11-26 23:36:22,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:36:22,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:36:22,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 14: [2022-11-26 23:36:22,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 5: [2022-11-26 23:36:22,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 14: [2022-11-26 23:36:22,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 16: [2022-11-26 23:36:22,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:36:22,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 23:36:22,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 5: [2022-11-26 23:36:22,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:36:22,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:36:22,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:36:22,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 5: [2022-11-26 23:36:22,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 23:36:22,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 22: [2022-11-26 23:36:22,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 23:36:22,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 21: [2022-11-26 23:36:22,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 1: [2022-11-26 23:36:22,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:36:22,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 24: [2022-11-26 23:36:22,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:36:22,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:36:22,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 24: [2022-11-26 23:36:22,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 27: [2022-11-26 23:36:22,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 24: [2022-11-26 23:36:22,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 27: [2022-11-26 23:36:22,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 12: [2022-11-26 23:36:22,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:36:22,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:36:22,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:36:22,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:36:22,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 23:36:22,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 9: [2022-11-26 23:36:22,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 23:36:22,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 12: [2022-11-26 23:36:22,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 12: [2022-11-26 23:36:22,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 25: [2022-11-26 23:36:22,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-26 23:36:22,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 13: [2022-11-26 23:36:22,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:36:22,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 23:36:22,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 17: [2022-11-26 23:36:22,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:36:22,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 23:36:22,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 29: [2022-11-26 23:36:22,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:36:22,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 23:36:22,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 18: [2022-11-26 23:36:22,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:36:22,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 23:36:22,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 20: [2022-11-26 23:36:22,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:36:22,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:36:22,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 20: [2022-11-26 23:36:22,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 23:36:22,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 19: [2022-11-26 23:36:22,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 28: [2022-11-26 23:36:22,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:36:22,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:36:22,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:36:22,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 3: [2022-11-26 23:36:22,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 11: [2022-11-26 23:36:22,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 3: [2022-11-26 23:36:22,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 23: [2022-11-26 23:36:22,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:36:22,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 23:36:22,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 2: [2022-11-26 23:36:22,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:36:22,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 30: [2022-11-26 23:36:22,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:36:22,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 30: [2022-11-26 23:36:22,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 10: [2022-11-26 23:36:22,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:36:22,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 10: [2022-11-26 23:36:22,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 0: [2022-11-26 23:36:22,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:36:22,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 23:36:22,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 0: [2022-11-26 23:36:22,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:36:22,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 0: [2022-11-26 23:36:22,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 23:36:22,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 26: [2022-11-26 23:36:22,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:36:22,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:36:22,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 7: [2022-11-26 23:36:22,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 26: [2022-11-26 23:36:22,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 7: [2022-11-26 23:36:22,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 14: [2022-11-26 23:36:22,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:36:22,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 23:36:22,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 6: [2022-11-26 23:36:22,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:36:22,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 23:36:22,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 15: [2022-11-26 23:36:22,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:36:22,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 23:36:22,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 18: [2022-11-26 23:36:22,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:36:22,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 23:36:22,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 9: [2022-11-26 23:36:22,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:36:22,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 23:36:22,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 28: [2022-11-26 23:36:22,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 23:36:22,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 28: [2022-11-26 23:36:22,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:36:22,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-26 23:36:22,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 9: [2022-11-26 23:36:22,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:36:22,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 23:36:22,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 28: [2022-11-26 23:36:22,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:36:22,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step131000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 23:36:22,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step131000 is ready now! 0: successfully saved checkpoint at iteration 131000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2541.81 31: iteration 131010/ 173500 | consumed samples: 33538560 | consumed tokens: 68686970880 | elapsed time per iteration (s): 1.12 | learning rate: 4.584E-05 | global batch size: 256 | lm loss: 1.929476E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.129 | TFLOPs: 13.80 | 31: iteration 131020/ 173500 | consumed samples: 33541120 | consumed tokens: 68692213760 | elapsed time per iteration (s): 0.80 | learning rate: 4.583E-05 | global batch size: 256 | lm loss: 1.888642E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.712 | TFLOPs: 19.28 | 31: iteration 131030/ 173500 | consumed samples: 33543680 | consumed tokens: 68697456640 | elapsed time per iteration (s): 0.81 | learning rate: 4.581E-05 | global batch size: 256 | lm loss: 1.937064E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.562 | TFLOPs: 19.21 | 31: iteration 131040/ 173500 | consumed samples: 33546240 | consumed tokens: 68702699520 | elapsed time per iteration (s): 0.81 | learning rate: 4.580E-05 | global batch size: 256 | lm loss: 1.941500E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.432 | TFLOPs: 19.20 | 31: iteration 131050/ 173500 | consumed samples: 33548800 | consumed tokens: 68707942400 | elapsed time per iteration (s): 0.76 | learning rate: 4.579E-05 | global batch size: 256 | lm loss: 1.946378E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.797 | TFLOPs: 20.31 | 31: iteration 131060/ 173500 | consumed samples: 33551360 | consumed tokens: 68713185280 | elapsed time per iteration (s): 0.77 | learning rate: 4.578E-05 | global batch size: 256 | lm loss: 1.923093E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.252 | TFLOPs: 20.22 | 31: iteration 131070/ 173500 | consumed samples: 33553920 | consumed tokens: 68718428160 | elapsed time per iteration (s): 0.72 | learning rate: 4.577E-05 | global batch size: 256 | lm loss: 1.953862E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.509 | TFLOPs: 21.39 | 31: iteration 131080/ 173500 | consumed samples: 33556480 | consumed tokens: 68723671040 | elapsed time per iteration (s): 0.75 | learning rate: 4.576E-05 | global batch size: 256 | lm loss: 1.943700E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.694 | TFLOPs: 20.67 | 31: iteration 131090/ 173500 | consumed samples: 33559040 | consumed tokens: 68728913920 | elapsed time per iteration (s): 0.86 | learning rate: 4.575E-05 | global batch size: 256 | lm loss: 1.938733E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.414 | TFLOPs: 17.93 | 31: iteration 131100/ 173500 | consumed samples: 33561600 | consumed tokens: 68734156800 | elapsed time per iteration (s): 0.77 | learning rate: 4.573E-05 | global batch size: 256 | lm loss: 1.931420E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.900 | TFLOPs: 20.02 | 31: iteration 131110/ 173500 | consumed samples: 33564160 | consumed tokens: 68739399680 | elapsed time per iteration (s): 0.73 | learning rate: 4.572E-05 | global batch size: 256 | lm loss: 1.932100E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.911 | TFLOPs: 21.17 | 31: iteration 131120/ 173500 | consumed samples: 33566720 | consumed tokens: 68744642560 | elapsed time per iteration (s): 0.78 | learning rate: 4.571E-05 | global batch size: 256 | lm loss: 1.927610E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.849 | TFLOPs: 19.89 | 31: iteration 131130/ 173500 | consumed samples: 33569280 | consumed tokens: 68749885440 | elapsed time per iteration (s): 0.73 | learning rate: 4.570E-05 | global batch size: 256 | lm loss: 1.924303E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.940 | TFLOPs: 21.11 | 31: iteration 131140/ 173500 | consumed samples: 33571840 | consumed tokens: 68755128320 | elapsed time per iteration (s): 0.77 | learning rate: 4.569E-05 | global batch size: 256 | lm loss: 1.926608E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.458 | TFLOPs: 20.17 | 31: iteration 131150/ 173500 | consumed samples: 33574400 | consumed tokens: 68760371200 | elapsed time per iteration (s): 0.75 | learning rate: 4.568E-05 | global batch size: 256 | lm loss: 1.957400E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.530 | TFLOPs: 20.66 | 31: iteration 131160/ 173500 | consumed samples: 33576960 | consumed tokens: 68765614080 | elapsed time per iteration (s): 0.75 | learning rate: 4.566E-05 | global batch size: 256 | lm loss: 1.909102E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.910 | TFLOPs: 20.75 | 31: iteration 131170/ 173500 | consumed samples: 33579520 | consumed tokens: 68770856960 | elapsed time per iteration (s): 0.73 | learning rate: 4.565E-05 | global batch size: 256 | lm loss: 1.922824E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.322 | TFLOPs: 21.13 | 31: iteration 131180/ 173500 | consumed samples: 33582080 | consumed tokens: 68776099840 | elapsed time per iteration (s): 0.78 | learning rate: 4.564E-05 | global batch size: 256 | lm loss: 1.922825E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.172 | TFLOPs: 19.85 | 31: iteration 131190/ 173500 | consumed samples: 33584640 | consumed tokens: 68781342720 | elapsed time per iteration (s): 0.77 | learning rate: 4.563E-05 | global batch size: 256 | lm loss: 1.932363E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.989 | TFLOPs: 20.21 | 31: iteration 131200/ 173500 | consumed samples: 33587200 | consumed tokens: 68786585600 | elapsed time per iteration (s): 0.74 | learning rate: 4.562E-05 | global batch size: 256 | lm loss: 1.924976E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.180 | TFLOPs: 20.88 | 31: iteration 131210/ 173500 | consumed samples: 33589760 | consumed tokens: 68791828480 | elapsed time per iteration (s): 0.78 | learning rate: 4.561E-05 | global batch size: 256 | lm loss: 1.944385E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.801 | TFLOPs: 19.89 | 31: iteration 131220/ 173500 | consumed samples: 33592320 | consumed tokens: 68797071360 | elapsed time per iteration (s): 0.78 | learning rate: 4.560E-05 | global batch size: 256 | lm loss: 1.942792E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.350 | TFLOPs: 19.92 | 31: iteration 131230/ 173500 | consumed samples: 33594880 | consumed tokens: 68802314240 | elapsed time per iteration (s): 0.75 | learning rate: 4.558E-05 | global batch size: 256 | lm loss: 1.942662E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.279 | TFLOPs: 20.77 | 31: iteration 131240/ 173500 | consumed samples: 33597440 | consumed tokens: 68807557120 | elapsed time per iteration (s): 0.78 | learning rate: 4.557E-05 | global batch size: 256 | lm loss: 1.919709E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.014 | TFLOPs: 19.97 | 31: iteration 131250/ 173500 | consumed samples: 33600000 | consumed tokens: 68812800000 | elapsed time per iteration (s): 0.76 | learning rate: 4.556E-05 | global batch size: 256 | lm loss: 1.945940E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.212 | TFLOPs: 20.34 | 31: iteration 131260/ 173500 | consumed samples: 33602560 | consumed tokens: 68818042880 | elapsed time per iteration (s): 0.74 | learning rate: 4.555E-05 | global batch size: 256 | lm loss: 1.977326E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.402 | TFLOPs: 21.02 | 31: iteration 131270/ 173500 | consumed samples: 33605120 | consumed tokens: 68823285760 | elapsed time per iteration (s): 0.84 | learning rate: 4.554E-05 | global batch size: 256 | lm loss: 1.951447E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.493 | TFLOPs: 18.42 | 31: iteration 131280/ 173500 | consumed samples: 33607680 | consumed tokens: 68828528640 | elapsed time per iteration (s): 0.77 | learning rate: 4.553E-05 | global batch size: 256 | lm loss: 1.944687E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.492 | TFLOPs: 20.18 | 31: iteration 131290/ 173500 | consumed samples: 33610240 | consumed tokens: 68833771520 | elapsed time per iteration (s): 0.78 | learning rate: 4.552E-05 | global batch size: 256 | lm loss: 1.947832E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.328 | TFLOPs: 19.80 | 31: iteration 131300/ 173500 | consumed samples: 33612800 | consumed tokens: 68839014400 | elapsed time per iteration (s): 0.78 | learning rate: 4.550E-05 | global batch size: 256 | lm loss: 1.918096E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.234 | TFLOPs: 19.80 | 31: iteration 131310/ 173500 | consumed samples: 33615360 | consumed tokens: 68844257280 | elapsed time per iteration (s): 0.79 | learning rate: 4.549E-05 | global batch size: 256 | lm loss: 1.925141E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.321 | TFLOPs: 19.50 | 31: iteration 131320/ 173500 | consumed samples: 33617920 | consumed tokens: 68849500160 | elapsed time per iteration (s): 0.76 | learning rate: 4.548E-05 | global batch size: 256 | lm loss: 1.972542E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.592 | TFLOPs: 20.30 | 31: iteration 131330/ 173500 | consumed samples: 33620480 | consumed tokens: 68854743040 | elapsed time per iteration (s): 0.73 | learning rate: 4.547E-05 | global batch size: 256 | lm loss: 1.933237E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.931 | TFLOPs: 21.23 | 31: iteration 131340/ 173500 | consumed samples: 33623040 | consumed tokens: 68859985920 | elapsed time per iteration (s): 0.77 | learning rate: 4.546E-05 | global batch size: 256 | lm loss: 1.951969E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.782 | TFLOPs: 20.01 | 31: iteration 131350/ 173500 | consumed samples: 33625600 | consumed tokens: 68865228800 | elapsed time per iteration (s): 0.72 | learning rate: 4.545E-05 | global batch size: 256 | lm loss: 1.939825E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.524 | TFLOPs: 21.45 | 31: iteration 131360/ 173500 | consumed samples: 33628160 | consumed tokens: 68870471680 | elapsed time per iteration (s): 0.75 | learning rate: 4.544E-05 | global batch size: 256 | lm loss: 1.921174E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.315 | TFLOPs: 20.59 | 31: iteration 131370/ 173500 | consumed samples: 33630720 | consumed tokens: 68875714560 | elapsed time per iteration (s): 0.74 | learning rate: 4.542E-05 | global batch size: 256 | lm loss: 1.932377E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.568 | TFLOPs: 20.85 | 31: iteration 131380/ 173500 | consumed samples: 33633280 | consumed tokens: 68880957440 | elapsed time per iteration (s): 0.76 | learning rate: 4.541E-05 | global batch size: 256 | lm loss: 1.953858E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.907 | TFLOPs: 20.38 | 31: iteration 131390/ 173500 | consumed samples: 33635840 | consumed tokens: 68886200320 | elapsed time per iteration (s): 0.73 | learning rate: 4.540E-05 | global batch size: 256 | lm loss: 1.934941E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.659 | TFLOPs: 21.15 | 31: iteration 131400/ 173500 | consumed samples: 33638400 | consumed tokens: 68891443200 | elapsed time per iteration (s): 0.75 | learning rate: 4.539E-05 | global batch size: 256 | lm loss: 1.932655E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.531 | TFLOPs: 20.72 | 31: iteration 131410/ 173500 | consumed samples: 33640960 | consumed tokens: 68896686080 | elapsed time per iteration (s): 0.73 | learning rate: 4.538E-05 | global batch size: 256 | lm loss: 1.944069E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.787 | TFLOPs: 21.28 | 31: iteration 131420/ 173500 | consumed samples: 33643520 | consumed tokens: 68901928960 | elapsed time per iteration (s): 0.74 | learning rate: 4.537E-05 | global batch size: 256 | lm loss: 1.933403E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.912 | TFLOPs: 21.05 | 31: iteration 131430/ 173500 | consumed samples: 33646080 | consumed tokens: 68907171840 | elapsed time per iteration (s): 0.79 | learning rate: 4.535E-05 | global batch size: 256 | lm loss: 1.928039E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.871 | TFLOPs: 19.53 | 31: iteration 131440/ 173500 | consumed samples: 33648640 | consumed tokens: 68912414720 | elapsed time per iteration (s): 0.77 | learning rate: 4.534E-05 | global batch size: 256 | lm loss: 1.937914E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.496 | TFLOPs: 19.99 | 31: iteration 131450/ 173500 | consumed samples: 33651200 | consumed tokens: 68917657600 | elapsed time per iteration (s): 0.75 | learning rate: 4.533E-05 | global batch size: 256 | lm loss: 1.934951E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.144 | TFLOPs: 20.70 | 31: iteration 131460/ 173500 | consumed samples: 33653760 | consumed tokens: 68922900480 | elapsed time per iteration (s): 1.01 | learning rate: 4.532E-05 | global batch size: 256 | lm loss: 1.946435E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 253.636 | TFLOPs: 15.34 | 31: iteration 131470/ 173500 | consumed samples: 33656320 | consumed tokens: 68928143360 | elapsed time per iteration (s): 0.80 | learning rate: 4.531E-05 | global batch size: 256 | lm loss: 1.960533E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.200 | TFLOPs: 19.37 | 31: iteration 131480/ 173500 | consumed samples: 33658880 | consumed tokens: 68933386240 | elapsed time per iteration (s): 0.83 | learning rate: 4.530E-05 | global batch size: 256 | lm loss: 1.937761E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.922 | TFLOPs: 18.69 | 31: iteration 131490/ 173500 | consumed samples: 33661440 | consumed tokens: 68938629120 | elapsed time per iteration (s): 0.81 | learning rate: 4.529E-05 | global batch size: 256 | lm loss: 1.927345E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.311 | TFLOPs: 19.14 | 31: iteration 131500/ 173500 | consumed samples: 33664000 | consumed tokens: 68943872000 | elapsed time per iteration (s): 0.94 | learning rate: 4.527E-05 | global batch size: 256 | lm loss: 1.952791E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 272.965 | TFLOPs: 16.51 | 31: iteration 131510/ 173500 | consumed samples: 33666560 | consumed tokens: 68949114880 | elapsed time per iteration (s): 0.79 | learning rate: 4.526E-05 | global batch size: 256 | lm loss: 1.915040E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.233 | TFLOPs: 19.68 | 31: iteration 131520/ 173500 | consumed samples: 33669120 | consumed tokens: 68954357760 | elapsed time per iteration (s): 0.78 | learning rate: 4.525E-05 | global batch size: 256 | lm loss: 1.947579E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.065 | TFLOPs: 19.97 | 31: iteration 131530/ 173500 | consumed samples: 33671680 | consumed tokens: 68959600640 | elapsed time per iteration (s): 0.81 | learning rate: 4.524E-05 | global batch size: 256 | lm loss: 1.921549E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.123 | TFLOPs: 19.00 | 31: iteration 131540/ 173500 | consumed samples: 33674240 | consumed tokens: 68964843520 | elapsed time per iteration (s): 0.84 | learning rate: 4.523E-05 | global batch size: 256 | lm loss: 1.937245E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.885 | TFLOPs: 18.51 | 31: iteration 131550/ 173500 | consumed samples: 33676800 | consumed tokens: 68970086400 | elapsed time per iteration (s): 0.79 | learning rate: 4.522E-05 | global batch size: 256 | lm loss: 1.988553E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.503 | TFLOPs: 19.69 | 31: iteration 131560/ 173500 | consumed samples: 33679360 | consumed tokens: 68975329280 | elapsed time per iteration (s): 0.81 | learning rate: 4.521E-05 | global batch size: 256 | lm loss: 1.910291E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.200 | TFLOPs: 19.01 | 31: iteration 131570/ 173500 | consumed samples: 33681920 | consumed tokens: 68980572160 | elapsed time per iteration (s): 0.81 | learning rate: 4.519E-05 | global batch size: 256 | lm loss: 1.928235E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.384 | TFLOPs: 19.08 | 31: iteration 131580/ 173500 | consumed samples: 33684480 | consumed tokens: 68985815040 | elapsed time per iteration (s): 0.82 | learning rate: 4.518E-05 | global batch size: 256 | lm loss: 1.957461E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.939 | TFLOPs: 18.93 | 31: iteration 131590/ 173500 | consumed samples: 33687040 | consumed tokens: 68991057920 | elapsed time per iteration (s): 0.81 | learning rate: 4.517E-05 | global batch size: 256 | lm loss: 1.949579E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.205 | TFLOPs: 19.01 | 31: iteration 131600/ 173500 | consumed samples: 33689600 | consumed tokens: 68996300800 | elapsed time per iteration (s): 0.79 | learning rate: 4.516E-05 | global batch size: 256 | lm loss: 1.919261E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.858 | TFLOPs: 19.65 | 31: iteration 131610/ 173500 | consumed samples: 33692160 | consumed tokens: 69001543680 | elapsed time per iteration (s): 0.81 | learning rate: 4.515E-05 | global batch size: 256 | lm loss: 1.932203E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.633 | TFLOPs: 19.09 | 31: iteration 131620/ 173500 | consumed samples: 33694720 | consumed tokens: 69006786560 | elapsed time per iteration (s): 0.81 | learning rate: 4.514E-05 | global batch size: 256 | lm loss: 1.923612E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.766 | TFLOPs: 19.16 | 31: iteration 131630/ 173500 | consumed samples: 33697280 | consumed tokens: 69012029440 | elapsed time per iteration (s): 0.79 | learning rate: 4.513E-05 | global batch size: 256 | lm loss: 1.930724E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.476 | TFLOPs: 19.51 | 31: iteration 131640/ 173500 | consumed samples: 33699840 | consumed tokens: 69017272320 | elapsed time per iteration (s): 0.90 | learning rate: 4.511E-05 | global batch size: 256 | lm loss: 1.932674E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.774 | TFLOPs: 17.29 | 31: iteration 131650/ 173500 | consumed samples: 33702400 | consumed tokens: 69022515200 | elapsed time per iteration (s): 0.87 | learning rate: 4.510E-05 | global batch size: 256 | lm loss: 1.940129E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.586 | TFLOPs: 17.76 | 31: iteration 131660/ 173500 | consumed samples: 33704960 | consumed tokens: 69027758080 | elapsed time per iteration (s): 0.80 | learning rate: 4.509E-05 | global batch size: 256 | lm loss: 1.944326E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.323 | TFLOPs: 19.32 | 31: iteration 131670/ 173500 | consumed samples: 33707520 | consumed tokens: 69033000960 | elapsed time per iteration (s): 0.92 | learning rate: 4.508E-05 | global batch size: 256 | lm loss: 1.927207E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 277.058 | TFLOPs: 16.76 | 31: iteration 131680/ 173500 | consumed samples: 33710080 | consumed tokens: 69038243840 | elapsed time per iteration (s): 0.82 | learning rate: 4.507E-05 | global batch size: 256 | lm loss: 1.945523E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.969 | TFLOPs: 18.87 | 31: iteration 131690/ 173500 | consumed samples: 33712640 | consumed tokens: 69043486720 | elapsed time per iteration (s): 1.02 | learning rate: 4.506E-05 | global batch size: 256 | lm loss: 1.943460E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 252.182 | TFLOPs: 15.26 | 31: iteration 131700/ 173500 | consumed samples: 33715200 | consumed tokens: 69048729600 | elapsed time per iteration (s): 0.81 | learning rate: 4.505E-05 | global batch size: 256 | lm loss: 1.925795E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.020 | TFLOPs: 19.12 | 31: iteration 131710/ 173500 | consumed samples: 33717760 | consumed tokens: 69053972480 | elapsed time per iteration (s): 0.83 | learning rate: 4.504E-05 | global batch size: 256 | lm loss: 1.947479E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.508 | TFLOPs: 18.66 | 31: iteration 131720/ 173500 | consumed samples: 33720320 | consumed tokens: 69059215360 | elapsed time per iteration (s): 0.82 | learning rate: 4.502E-05 | global batch size: 256 | lm loss: 1.931337E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.461 | TFLOPs: 18.90 | 31: iteration 131730/ 173500 | consumed samples: 33722880 | consumed tokens: 69064458240 | elapsed time per iteration (s): 0.79 | learning rate: 4.501E-05 | global batch size: 256 | lm loss: 1.931572E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.167 | TFLOPs: 19.67 | 31: iteration 131740/ 173500 | consumed samples: 33725440 | consumed tokens: 69069701120 | elapsed time per iteration (s): 0.80 | learning rate: 4.500E-05 | global batch size: 256 | lm loss: 1.931980E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.954 | TFLOPs: 19.36 | 31: iteration 131750/ 173500 | consumed samples: 33728000 | consumed tokens: 69074944000 | elapsed time per iteration (s): 0.78 | learning rate: 4.499E-05 | global batch size: 256 | lm loss: 1.924506E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.387 | TFLOPs: 19.93 | 31: iteration 131760/ 173500 | consumed samples: 33730560 | consumed tokens: 69080186880 | elapsed time per iteration (s): 0.82 | learning rate: 4.498E-05 | global batch size: 256 | lm loss: 1.952668E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.953 | TFLOPs: 18.99 | 31: iteration 131770/ 173500 | consumed samples: 33733120 | consumed tokens: 69085429760 | elapsed time per iteration (s): 0.84 | learning rate: 4.497E-05 | global batch size: 256 | lm loss: 1.943674E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.080 | TFLOPs: 18.40 | 31: iteration 131780/ 173500 | consumed samples: 33735680 | consumed tokens: 69090672640 | elapsed time per iteration (s): 0.79 | learning rate: 4.496E-05 | global batch size: 256 | lm loss: 1.923884E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.384 | TFLOPs: 19.68 | 31: iteration 131790/ 173500 | consumed samples: 33738240 | consumed tokens: 69095915520 | elapsed time per iteration (s): 0.83 | learning rate: 4.494E-05 | global batch size: 256 | lm loss: 1.917879E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.962 | TFLOPs: 18.57 | 31: iteration 131800/ 173500 | consumed samples: 33740800 | consumed tokens: 69101158400 | elapsed time per iteration (s): 0.83 | learning rate: 4.493E-05 | global batch size: 256 | lm loss: 1.957412E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.537 | TFLOPs: 18.61 | 31: iteration 131810/ 173500 | consumed samples: 33743360 | consumed tokens: 69106401280 | elapsed time per iteration (s): 0.82 | learning rate: 4.492E-05 | global batch size: 256 | lm loss: 1.953678E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.887 | TFLOPs: 18.87 | 31: iteration 131820/ 173500 | consumed samples: 33745920 | consumed tokens: 69111644160 | elapsed time per iteration (s): 0.76 | learning rate: 4.491E-05 | global batch size: 256 | lm loss: 1.940110E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.444 | TFLOPs: 20.41 | 31: iteration 131830/ 173500 | consumed samples: 33748480 | consumed tokens: 69116887040 | elapsed time per iteration (s): 0.76 | learning rate: 4.490E-05 | global batch size: 256 | lm loss: 1.939519E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.622 | TFLOPs: 20.43 | 31: iteration 131840/ 173500 | consumed samples: 33751040 | consumed tokens: 69122129920 | elapsed time per iteration (s): 0.76 | learning rate: 4.489E-05 | global batch size: 256 | lm loss: 1.920980E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.921 | TFLOPs: 20.32 | 31: iteration 131850/ 173500 | consumed samples: 33753600 | consumed tokens: 69127372800 | elapsed time per iteration (s): 0.74 | learning rate: 4.488E-05 | global batch size: 256 | lm loss: 1.933063E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.245 | TFLOPs: 20.89 | 31: iteration 131860/ 173500 | consumed samples: 33756160 | consumed tokens: 69132615680 | elapsed time per iteration (s): 0.79 | learning rate: 4.486E-05 | global batch size: 256 | lm loss: 1.932462E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.357 | TFLOPs: 19.68 | 31: iteration 131870/ 173500 | consumed samples: 33758720 | consumed tokens: 69137858560 | elapsed time per iteration (s): 0.79 | learning rate: 4.485E-05 | global batch size: 256 | lm loss: 1.935337E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.702 | TFLOPs: 19.58 | 31: iteration 131880/ 173500 | consumed samples: 33761280 | consumed tokens: 69143101440 | elapsed time per iteration (s): 0.77 | learning rate: 4.484E-05 | global batch size: 256 | lm loss: 1.938727E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.928 | TFLOPs: 20.02 | 31: iteration 131890/ 173500 | consumed samples: 33763840 | consumed tokens: 69148344320 | elapsed time per iteration (s): 0.75 | learning rate: 4.483E-05 | global batch size: 256 | lm loss: 1.913450E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.714 | TFLOPs: 20.67 | 31: iteration 131900/ 173500 | consumed samples: 33766400 | consumed tokens: 69153587200 | elapsed time per iteration (s): 0.73 | learning rate: 4.482E-05 | global batch size: 256 | lm loss: 1.932311E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.880 | TFLOPs: 21.11 | 31: iteration 131910/ 173500 | consumed samples: 33768960 | consumed tokens: 69158830080 | elapsed time per iteration (s): 0.76 | learning rate: 4.481E-05 | global batch size: 256 | lm loss: 1.944811E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.897 | TFLOPs: 20.38 | 31: iteration 131920/ 173500 | consumed samples: 33771520 | consumed tokens: 69164072960 | elapsed time per iteration (s): 0.77 | learning rate: 4.480E-05 | global batch size: 256 | lm loss: 1.915202E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.116 | TFLOPs: 20.03 | 31: iteration 131930/ 173500 | consumed samples: 33774080 | consumed tokens: 69169315840 | elapsed time per iteration (s): 0.71 | learning rate: 4.478E-05 | global batch size: 256 | lm loss: 1.970052E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 362.159 | TFLOPs: 21.91 | 31: iteration 131940/ 173500 | consumed samples: 33776640 | consumed tokens: 69174558720 | elapsed time per iteration (s): 0.76 | learning rate: 4.477E-05 | global batch size: 256 | lm loss: 1.923120E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.026 | TFLOPs: 20.39 | 31: iteration 131950/ 173500 | consumed samples: 33779200 | consumed tokens: 69179801600 | elapsed time per iteration (s): 0.73 | learning rate: 4.476E-05 | global batch size: 256 | lm loss: 1.939763E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.372 | TFLOPs: 21.32 | 31: iteration 131960/ 173500 | consumed samples: 33781760 | consumed tokens: 69185044480 | elapsed time per iteration (s): 0.72 | learning rate: 4.475E-05 | global batch size: 256 | lm loss: 1.887036E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.723 | TFLOPs: 21.58 | 31: iteration 131970/ 173500 | consumed samples: 33784320 | consumed tokens: 69190287360 | elapsed time per iteration (s): 0.73 | learning rate: 4.474E-05 | global batch size: 256 | lm loss: 1.937900E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.692 | TFLOPs: 21.22 | 31: iteration 131980/ 173500 | consumed samples: 33786880 | consumed tokens: 69195530240 | elapsed time per iteration (s): 0.76 | learning rate: 4.473E-05 | global batch size: 256 | lm loss: 1.931018E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.441 | TFLOPs: 20.47 | 31: iteration 131990/ 173500 | consumed samples: 33789440 | consumed tokens: 69200773120 | elapsed time per iteration (s): 0.74 | learning rate: 4.472E-05 | global batch size: 256 | lm loss: 1.944267E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.650 | TFLOPs: 20.85 | 0: [2022-11-26 23:49:29,097] [INFO] [logging.py:68:log_dist] [Rank 0] step=132000, skipped=0, lr=[4.4705599266134565e-05, 4.4705599266134565e-05, 4.4705599266134565e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 132000/ 173500 | consumed samples: 33792000 | consumed tokens: 69206016000 | elapsed time per iteration (s): 0.72 | learning rate: 4.471E-05 | global batch size: 256 | lm loss: 1.936622E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.887 | TFLOPs: 21.47 | 0: steps: 132000 loss: 1.9424 iter time (s): 0.794 samples/sec: 322.250 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 132000 | lm loss value: 1.887680E+00 | lm loss PPL: 6.604028E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 132000 to checkpoints_1b1long 0: [2022-11-26 23:49:29,347] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step132000 is begin to save! 0: [2022-11-26 23:49:29,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_01-model_00-model_states.pt... 0: [2022-11-26 23:49:29,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_01-model_00-model_states.pt. 0: [2022-11-26 23:49:29,572] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_03-model_00-model_states.pt... 0: [2022-11-26 23:49:29,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_03-model_00-model_states.pt. 0: [2022-11-26 23:49:29,650] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_04-model_00-model_states.pt... 0: [2022-11-26 23:49:29,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_04-model_00-model_states.pt. 0: [2022-11-26 23:49:29,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_05-model_00-model_states.pt... 0: [2022-11-26 23:49:29,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_05-model_00-model_states.pt. 0: [2022-11-26 23:49:29,806] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_06-model_00-model_states.pt... 0: [2022-11-26 23:49:29,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_06-model_00-model_states.pt. 0: [2022-11-26 23:49:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_07-model_00-model_states.pt... 0: [2022-11-26 23:49:29,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_07-model_00-model_states.pt. 0: [2022-11-26 23:49:29,957] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_08-model_00-model_states.pt... 0: [2022-11-26 23:49:30,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_08-model_00-model_states.pt. 0: [2022-11-26 23:49:30,047] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_09-model_00-model_states.pt... 0: [2022-11-26 23:49:30,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_09-model_00-model_states.pt. 0: [2022-11-26 23:49:30,121] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_10-model_00-model_states.pt... 0: [2022-11-26 23:49:30,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_10-model_00-model_states.pt. 0: [2022-11-26 23:49:30,199] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_11-model_00-model_states.pt... 0: [2022-11-26 23:49:30,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_11-model_00-model_states.pt. 0: [2022-11-26 23:49:30,273] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_12-model_00-model_states.pt... 0: [2022-11-26 23:49:30,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_12-model_00-model_states.pt. 0: [2022-11-26 23:49:30,351] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_13-model_00-model_states.pt... 0: [2022-11-26 23:49:30,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_13-model_00-model_states.pt. 0: [2022-11-26 23:49:30,430] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_14-model_00-model_states.pt... 0: [2022-11-26 23:49:30,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_14-model_00-model_states.pt. 0: [2022-11-26 23:49:30,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_15-model_00-model_states.pt... 0: [2022-11-26 23:49:30,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_15-model_00-model_states.pt. 0: [2022-11-26 23:49:30,586] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_16-model_00-model_states.pt... 0: [2022-11-26 23:49:30,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_16-model_00-model_states.pt. 0: [2022-11-26 23:49:30,662] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_17-model_00-model_states.pt... 0: [2022-11-26 23:49:30,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_17-model_00-model_states.pt. 0: [2022-11-26 23:49:30,741] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_18-model_00-model_states.pt... 0: [2022-11-26 23:49:30,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_18-model_00-model_states.pt. 0: [2022-11-26 23:49:30,822] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_19-model_00-model_states.pt... 0: [2022-11-26 23:49:30,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_19-model_00-model_states.pt. 0: [2022-11-26 23:49:30,898] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_20-model_00-model_states.pt... 0: [2022-11-26 23:49:30,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_20-model_00-model_states.pt. 0: [2022-11-26 23:49:30,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_21-model_00-model_states.pt... 0: [2022-11-26 23:49:31,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_21-model_00-model_states.pt. 0: [2022-11-26 23:49:31,051] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_22-model_00-model_states.pt... 0: [2022-11-26 23:49:31,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_22-model_00-model_states.pt. 0: [2022-11-26 23:49:31,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_23-model_00-model_states.pt... 0: [2022-11-26 23:49:31,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_23-model_00-model_states.pt. 0: [2022-11-26 23:49:31,206] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_24-model_00-model_states.pt... 0: [2022-11-26 23:49:31,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_24-model_00-model_states.pt. 0: [2022-11-26 23:49:31,284] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_25-model_00-model_states.pt... 0: [2022-11-26 23:49:31,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_25-model_00-model_states.pt. 0: [2022-11-26 23:49:31,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_26-model_00-model_states.pt... 0: [2022-11-26 23:49:31,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_26-model_00-model_states.pt. 0: [2022-11-26 23:49:31,441] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_27-model_00-model_states.pt... 0: [2022-11-26 23:49:31,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_27-model_00-model_states.pt. 0: [2022-11-26 23:49:31,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_28-model_00-model_states.pt... 0: [2022-11-26 23:49:31,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_28-model_00-model_states.pt. 0: [2022-11-26 23:49:31,591] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/layer_30-model_00-model_states.pt... 0: [2022-11-26 23:49:31,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/layer_30-model_00-model_states.pt. 0: [2022-11-26 23:49:31,594] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step132000/mp_rank_00_model_states.pt 0: [2022-11-26 23:49:31,594] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/mp_rank_00_model_states.pt... 0: [2022-11-26 23:49:31,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/mp_rank_00_model_states.pt. 0: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 17: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 23: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 27: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 16: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 24: [2022-11-26 23:49:31,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:49:31,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:49:31,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:49:31,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 23:49:31,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 17: [2022-11-26 23:49:31,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-26 23:49:31,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 3: [2022-11-26 23:49:31,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:49:31,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 23:49:31,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 31: [2022-11-26 23:49:31,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:49:31,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-26 23:49:31,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 24: [2022-11-26 23:49:31,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:49:31,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:49:31,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 6: [2022-11-26 23:49:31,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 24: [2022-11-26 23:49:31,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 6: [2022-11-26 23:49:31,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 11: [2022-11-26 23:49:31,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:49:31,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 23:49:31,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 14: [2022-11-26 23:49:31,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:49:31,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 23:49:31,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 6: [2022-11-26 23:49:31,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:49:31,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 23:49:31,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 18: [2022-11-26 23:49:31,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:49:31,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-26 23:49:31,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 15: [2022-11-26 23:49:31,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:49:31,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 23:49:31,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:49:31,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 1: [2022-11-26 23:49:31,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:49:31,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 23:49:31,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 11: [2022-11-26 23:49:31,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:49:31,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 23:49:31,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 1: [2022-11-26 23:49:31,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 23:49:31,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 3: [2022-11-26 23:49:31,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:49:31,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 23:49:31,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 24: [2022-11-26 23:49:31,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:49:31,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-26 23:49:31,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 1: [2022-11-26 23:49:31,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:49:31,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 23:49:31,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 4: [2022-11-26 23:49:31,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:49:31,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:49:31,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 23:49:31,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 23:49:31,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 4: [2022-11-26 23:49:31,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 7: [2022-11-26 23:49:31,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:49:31,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 29: [2022-11-26 23:49:31,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:49:31,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 29: [2022-11-26 23:49:31,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-26 23:49:31,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 17: [2022-11-26 23:49:31,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:49:31,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-26 23:49:31,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 28: [2022-11-26 23:49:31,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:49:31,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:49:31,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-26 23:49:31,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 23: [2022-11-26 23:49:31,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:49:31,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-26 23:49:31,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 13: [2022-11-26 23:49:31,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:49:31,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 23:49:31,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 20: [2022-11-26 23:49:31,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:49:31,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 0: [2022-11-26 23:49:31,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:49:31,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 0: [2022-11-26 23:49:31,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 20: [2022-11-26 23:49:31,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:49:31,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 20: [2022-11-26 23:49:31,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-26 23:49:31,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 0: [2022-11-26 23:49:31,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:49:31,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:49:31,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 0: [2022-11-26 23:49:31,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 8: [2022-11-26 23:49:31,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 0: [2022-11-26 23:49:31,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 1: [2022-11-26 23:49:31,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:49:31,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:49:31,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:49:31,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 0: [2022-11-26 23:49:31,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 6: [2022-11-26 23:49:31,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 23:49:31,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 1: [2022-11-26 23:49:31,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 0: [2022-11-26 23:49:31,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 4: [2022-11-26 23:49:31,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:49:31,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 21: [2022-11-26 23:49:31,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:49:31,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 4: [2022-11-26 23:49:31,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 21: [2022-11-26 23:49:31,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 18: [2022-11-26 23:49:31,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:49:31,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 19: [2022-11-26 23:49:31,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:49:31,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 19: [2022-11-26 23:49:31,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-26 23:49:31,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 20: [2022-11-26 23:49:31,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:49:31,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-26 23:49:31,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 8: [2022-11-26 23:49:31,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:49:31,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 11: [2022-11-26 23:49:31,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:49:31,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 11: [2022-11-26 23:49:31,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 23:49:31,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 0: [2022-11-26 23:49:31,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:49:31,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:49:31,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 23:49:31,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 28: [2022-11-26 23:49:31,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-26 23:49:31,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 28: [2022-11-26 23:49:31,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:49:31,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-26 23:49:31,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 28: [2022-11-26 23:49:31,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:49:31,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-26 23:49:31,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 16: [2022-11-26 23:49:31,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:49:31,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-26 23:49:31,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 10: [2022-11-26 23:49:31,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:49:31,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 23:49:31,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 23: [2022-11-26 23:49:31,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:49:31,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:49:31,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 19: [2022-11-26 23:49:31,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 23: [2022-11-26 23:49:31,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 19: [2022-11-26 23:49:31,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 24: [2022-11-26 23:49:31,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:49:31,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-26 23:49:31,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 14: [2022-11-26 23:49:31,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:49:31,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 15: [2022-11-26 23:49:31,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:49:31,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 15: [2022-11-26 23:49:31,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 23:49:31,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 16: [2022-11-26 23:49:31,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:49:31,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-26 23:49:31,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 28: [2022-11-26 23:49:31,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:49:31,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:49:31,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 11: [2022-11-26 23:49:31,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:49:31,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 11: [2022-11-26 23:49:31,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 23:49:31,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 7: [2022-11-26 23:49:31,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:49:31,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 23:49:31,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 17: [2022-11-26 23:49:31,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:49:31,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-26 23:49:31,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 17: [2022-11-26 23:49:31,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:49:31,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-26 23:49:31,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 13: [2022-11-26 23:49:31,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:49:31,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 23:49:31,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 2: [2022-11-26 23:49:31,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:49:31,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:49:31,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 18: [2022-11-26 23:49:31,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 2: [2022-11-26 23:49:31,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 18: [2022-11-26 23:49:31,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 31: [2022-11-26 23:49:31,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:49:31,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-26 23:49:31,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 31: [2022-11-26 23:49:31,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-26 23:49:31,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 31: [2022-11-26 23:49:31,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:49:31,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-26 23:49:31,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 10: [2022-11-26 23:49:31,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:49:31,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 23:49:31,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 31: [2022-11-26 23:49:31,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:49:31,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-26 23:49:31,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 20: [2022-11-26 23:49:31,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:49:31,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-26 23:49:31,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 6: [2022-11-26 23:49:31,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:49:31,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 4: [2022-11-26 23:49:31,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:49:31,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:49:31,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:49:31,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 5: [2022-11-26 23:49:31,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 23:49:31,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 4: [2022-11-26 23:49:31,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 5: [2022-11-26 23:49:31,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 5: [2022-11-26 23:49:31,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 4: [2022-11-26 23:49:31,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 19: [2022-11-26 23:49:31,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:49:31,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:49:31,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 18: [2022-11-26 23:49:31,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:49:31,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 18: [2022-11-26 23:49:31,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 19: [2022-11-26 23:49:31,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 13: [2022-11-26 23:49:31,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 18: [2022-11-26 23:49:31,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 21: [2022-11-26 23:49:31,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:49:31,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-26 23:49:31,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 15: [2022-11-26 23:49:31,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:49:31,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 23:49:31,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 8: [2022-11-26 23:49:31,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:49:31,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 23:49:31,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 16: [2022-11-26 23:49:31,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:49:31,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-26 23:49:31,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 14: [2022-11-26 23:49:31,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:49:31,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 23:49:31,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 1: [2022-11-26 23:49:31,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:49:31,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 23:49:31,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 21: [2022-11-26 23:49:31,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:49:31,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-26 23:49:31,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 24: [2022-11-26 23:49:31,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:49:31,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:49:31,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:49:31,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:49:31,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:49:31,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-26 23:49:31,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-26 23:49:31,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 22: [2022-11-26 23:49:31,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 5: [2022-11-26 23:49:31,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:49:31,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 22: [2022-11-26 23:49:31,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 5: [2022-11-26 23:49:31,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 22: [2022-11-26 23:49:31,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 5: [2022-11-26 23:49:31,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 22: [2022-11-26 23:49:31,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 22: [2022-11-26 23:49:31,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 24: [2022-11-26 23:49:31,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 29: [2022-11-26 23:49:31,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:49:31,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-26 23:49:31,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 2: [2022-11-26 23:49:31,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:49:31,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 23:49:31,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:49:31,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:49:31,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 2: [2022-11-26 23:49:31,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 23:49:31,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 23:49:31,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 2: [2022-11-26 23:49:31,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 14: [2022-11-26 23:49:31,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:49:31,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 23:49:31,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 25: [2022-11-26 23:49:31,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:49:31,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:49:31,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:49:31,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:49:31,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-26 23:49:31,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-26 23:49:31,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-26 23:49:31,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-26 23:49:31,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 25: [2022-11-26 23:49:31,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 7: [2022-11-26 23:49:31,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:49:31,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 7: [2022-11-26 23:49:31,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 25: [2022-11-26 23:49:31,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 7: [2022-11-26 23:49:31,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 8: [2022-11-26 23:49:31,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:49:31,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 23:49:31,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 3: [2022-11-26 23:49:31,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:49:31,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 23:49:31,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 20: [2022-11-26 23:49:31,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:49:31,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-26 23:49:31,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 7: [2022-11-26 23:49:31,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:49:31,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 23:49:31,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 23: [2022-11-26 23:49:31,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:49:31,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-26 23:49:31,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 0: [2022-11-26 23:49:31,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:49:31,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 23:49:31,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 4: [2022-11-26 23:49:31,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:49:31,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 23:49:31,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 6: [2022-11-26 23:49:31,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:49:31,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 23:49:31,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 31: [2022-11-26 23:49:31,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:49:31,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-26 23:49:31,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 1: [2022-11-26 23:49:31,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:49:31,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 23:49:31,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 23: [2022-11-26 23:49:31,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:49:31,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-26 23:49:31,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 30: [2022-11-26 23:49:31,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:49:31,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-26 23:49:31,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 23: [2022-11-26 23:49:31,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:49:31,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-26 23:49:31,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 29: [2022-11-26 23:49:31,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:49:31,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:49:31,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-26 23:49:31,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-26 23:49:31,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 29: [2022-11-26 23:49:31,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 29: [2022-11-26 23:49:31,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:49:31,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-26 23:49:31,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 17: [2022-11-26 23:49:31,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:49:31,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-26 23:49:31,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 15: [2022-11-26 23:49:31,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:49:31,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 23:49:31,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 0: [2022-11-26 23:49:31,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 23:49:31,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 25: [2022-11-26 23:49:31,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:49:31,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-26 23:49:31,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 30: [2022-11-26 23:49:31,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:49:31,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-26 23:49:31,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 8: [2022-11-26 23:49:31,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:49:31,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 23:49:31,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 12: [2022-11-26 23:49:31,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:49:31,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 27: [2022-11-26 23:49:31,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:49:31,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 27: [2022-11-26 23:49:31,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-26 23:49:31,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 24: [2022-11-26 23:49:31,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:49:31,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-26 23:49:31,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 30: [2022-11-26 23:49:31,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:49:31,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-26 23:49:31,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 26: [2022-11-26 23:49:31,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:49:31,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-26 23:49:31,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 9: [2022-11-26 23:49:31,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:49:31,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:49:31,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 23:49:31,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 27: [2022-11-26 23:49:31,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 12: [2022-11-26 23:49:31,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:49:31,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 12: [2022-11-26 23:49:31,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 23:49:31,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 19: [2022-11-26 23:49:31,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:49:31,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-26 23:49:31,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 30: [2022-11-26 23:49:31,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:49:31,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:49:31,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-26 23:49:31,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-26 23:49:31,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 30: [2022-11-26 23:49:31,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 12: [2022-11-26 23:49:31,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:49:31,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 23:49:31,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 27: [2022-11-26 23:49:31,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:49:31,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-26 23:49:31,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 27: [2022-11-26 23:49:31,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:49:31,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-26 23:49:31,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 9: [2022-11-26 23:49:31,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:49:31,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 23:49:31,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 12: [2022-11-26 23:49:31,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:49:31,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 23:49:31,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 26: [2022-11-26 23:49:31,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:49:31,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-26 23:49:31,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 27: [2022-11-26 23:49:31,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:49:31,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:49:31,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 26: [2022-11-26 23:49:31,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 27: [2022-11-26 23:49:31,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 26: [2022-11-26 23:49:31,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 9: [2022-11-26 23:49:31,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:49:31,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 23:49:31,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 9: [2022-11-26 23:49:31,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:49:31,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 23:49:31,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 10: [2022-11-26 23:49:31,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:49:31,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 23:49:31,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 28: [2022-11-26 23:49:31,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:49:31,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 11: [2022-11-26 23:49:31,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:49:31,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 23:49:31,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 9: [2022-11-26 23:49:31,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:49:31,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 23:49:31,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 2: [2022-11-26 23:49:31,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:49:31,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 23:49:31,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 28: [2022-11-26 23:49:31,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 18: [2022-11-26 23:49:31,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:49:31,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-26 23:49:31,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 3: [2022-11-26 23:49:31,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:49:31,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 23:49:31,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 12: [2022-11-26 23:49:31,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:49:31,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 23:49:31,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 21: [2022-11-26 23:49:31,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:49:31,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-26 23:49:31,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 16: [2022-11-26 23:49:31,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:49:31,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-26 23:49:31,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 26: [2022-11-26 23:49:31,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:49:31,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-26 23:49:31,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 5: [2022-11-26 23:49:31,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:49:31,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 23:49:31,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 13: [2022-11-26 23:49:31,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:49:31,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 23:49:31,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 7: [2022-11-26 23:49:31,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:49:31,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 23:49:31,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 14: [2022-11-26 23:49:31,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:49:31,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 23:49:31,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 0: [2022-11-26 23:49:31,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:49:31,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 23:49:31,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 22: [2022-11-26 23:49:31,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:49:31,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-26 23:49:31,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 6: [2022-11-26 23:49:31,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:49:31,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 23:49:31,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 20: [2022-11-26 23:49:31,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:49:31,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-26 23:49:31,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 23: [2022-11-26 23:49:31,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:49:31,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-26 23:49:31,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 25: [2022-11-26 23:49:31,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:49:31,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-26 23:49:31,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 17: [2022-11-26 23:49:31,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:49:31,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-26 23:49:31,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 1: [2022-11-26 23:49:31,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:49:31,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 23:49:31,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 31: [2022-11-26 23:49:31,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:49:31,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-26 23:49:31,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 15: [2022-11-26 23:49:31,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:49:31,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 23:49:31,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 8: [2022-11-26 23:49:31,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:49:31,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 23:49:31,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 29: [2022-11-26 23:49:31,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:49:31,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-26 23:49:31,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 4: [2022-11-26 23:49:31,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:49:31,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 23:49:31,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 27: [2022-11-26 23:49:31,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:49:31,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:49:31,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-26 23:49:31,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 24: [2022-11-26 23:49:31,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-26 23:49:31,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 30: [2022-11-26 23:49:31,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:49:31,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-26 23:49:31,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 19: [2022-11-26 23:49:31,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:49:31,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-26 23:49:31,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 3: [2022-11-26 23:49:31,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:49:31,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 23:49:31,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 12: [2022-11-26 23:49:31,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:49:31,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 23:49:31,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 10: [2022-11-26 23:49:31,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:49:31,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 23:49:31,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 28: [2022-11-26 23:49:31,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:49:31,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 13: [2022-11-26 23:49:31,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:49:31,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 23:49:31,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 5: [2022-11-26 23:49:31,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:49:31,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 23:49:31,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 16: [2022-11-26 23:49:31,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:49:31,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 22: [2022-11-26 23:49:31,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:49:31,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 22: [2022-11-26 23:49:31,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-26 23:49:31,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 0: [2022-11-26 23:49:31,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:49:31,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 23:49:31,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 7: [2022-11-26 23:49:31,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:49:31,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 23:49:31,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 28: [2022-11-26 23:49:31,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 14: [2022-11-26 23:49:31,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:49:31,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 23:49:31,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 9: [2022-11-26 23:49:31,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:49:31,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 23:49:31,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 26: [2022-11-26 23:49:31,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:49:31,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-26 23:49:31,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 11: [2022-11-26 23:49:31,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:49:31,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 23:49:31,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 6: [2022-11-26 23:49:31,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:49:31,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 2: [2022-11-26 23:49:31,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:49:31,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 2: [2022-11-26 23:49:31,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 31: [2022-11-26 23:49:31,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:49:31,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 31: [2022-11-26 23:49:31,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-26 23:49:31,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 17: [2022-11-26 23:49:31,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:49:31,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-26 23:49:31,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 18: [2022-11-26 23:49:31,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:49:31,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-26 23:49:31,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 25: [2022-11-26 23:49:31,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:49:31,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-26 23:49:31,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 4: [2022-11-26 23:49:31,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:49:31,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 23:49:31,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 8: [2022-11-26 23:49:31,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:49:31,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 23:49:31,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 29: [2022-11-26 23:49:31,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:49:31,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-26 23:49:31,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 15: [2022-11-26 23:49:31,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:49:31,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 23:49:31,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 20: [2022-11-26 23:49:31,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:49:31,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-26 23:49:31,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 30: [2022-11-26 23:49:31,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:49:31,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-26 23:49:31,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 23: [2022-11-26 23:49:31,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:49:31,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-26 23:49:31,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 1: [2022-11-26 23:49:31,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:49:31,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 23:49:31,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 24: [2022-11-26 23:49:31,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:49:31,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-26 23:49:31,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 27: [2022-11-26 23:49:31,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:49:31,885] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-26 23:49:31,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 12: [2022-11-26 23:49:31,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:49:31,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 23:49:31,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 28: [2022-11-26 23:49:31,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:49:31,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:49:31,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-26 23:49:31,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 19: [2022-11-26 23:49:31,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:49:31,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-26 23:49:31,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 28: [2022-11-26 23:49:31,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-26 23:49:31,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 0: [2022-11-26 23:49:31,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:49:31,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 23:49:31,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 3: [2022-11-26 23:49:31,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:49:31,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:49:31,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 23:49:31,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 13: [2022-11-26 23:49:31,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 23:49:31,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 22: [2022-11-26 23:49:31,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:49:31,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-26 23:49:31,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 7: [2022-11-26 23:49:31,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:49:31,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 23:49:31,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 27: [2022-11-26 23:49:31,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-26 23:49:31,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-26 23:49:31,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 21: [2022-11-26 23:49:31,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 31: [2022-11-26 23:49:31,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:49:31,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 31: [2022-11-26 23:49:31,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 21: [2022-11-26 23:49:31,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 31: [2022-11-26 23:49:31,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 21: [2022-11-26 23:49:31,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:49:31,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-26 23:49:31,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 10: [2022-11-26 23:49:31,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:49:31,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 23:49:31,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 11: [2022-11-26 23:49:31,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:49:31,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 23:49:31,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 15: [2022-11-26 23:49:31,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:49:31,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 23:49:31,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 20: [2022-11-26 23:49:31,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-26 23:49:31,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-26 23:49:31,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 26: [2022-11-26 23:49:31,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:49:31,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-26 23:49:31,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 6: [2022-11-26 23:49:31,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:49:31,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 23:49:31,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 2: [2022-11-26 23:49:31,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:49:31,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:49:31,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 5: [2022-11-26 23:49:31,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 23:49:31,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 2: [2022-11-26 23:49:31,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 4: [2022-11-26 23:49:31,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:49:31,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:49:31,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 23:49:31,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 10: [2022-11-26 23:49:31,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 24: [2022-11-26 23:49:31,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-26 23:49:31,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 10: [2022-11-26 23:49:31,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 23:49:31,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 25: [2022-11-26 23:49:31,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 29: [2022-11-26 23:49:31,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 25: [2022-11-26 23:49:31,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 29: [2022-11-26 23:49:31,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-26 23:49:31,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 25: [2022-11-26 23:49:31,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 14: [2022-11-26 23:49:31,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:49:31,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 23:49:31,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 16: [2022-11-26 23:49:31,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:49:31,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-26 23:49:31,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 8: [2022-11-26 23:49:31,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:49:31,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 23:49:31,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 12: [2022-11-26 23:49:31,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:49:31,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 23:49:31,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 18: [2022-11-26 23:49:31,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-26 23:49:31,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-26 23:49:31,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 23: [2022-11-26 23:49:31,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-26 23:49:31,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-26 23:49:31,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 28: [2022-11-26 23:49:31,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-26 23:49:31,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-26 23:49:31,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 3: [2022-11-26 23:49:31,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:49:31,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 23:49:31,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 13: [2022-11-26 23:49:31,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:49:31,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:49:31,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:49:31,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 23:49:31,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 11: [2022-11-26 23:49:31,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 9: [2022-11-26 23:49:31,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 11: [2022-11-26 23:49:31,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 9: [2022-11-26 23:49:31,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 30: [2022-11-26 23:49:31,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-26 23:49:31,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-26 23:49:31,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 1: [2022-11-26 23:49:31,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:49:31,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 23:49:31,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 7: [2022-11-26 23:49:31,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:49:31,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 23:49:31,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 26: [2022-11-26 23:49:31,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:49:31,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-26 23:49:31,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 22: [2022-11-26 23:49:31,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-26 23:49:31,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-26 23:49:31,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 21: [2022-11-26 23:49:31,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:49:31,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 19: [2022-11-26 23:49:31,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:49:31,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 21: [2022-11-26 23:49:31,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 19: [2022-11-26 23:49:31,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 10: [2022-11-26 23:49:31,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 19: [2022-11-26 23:49:31,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 10: [2022-11-26 23:49:31,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 5: [2022-11-26 23:49:31,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:49:31,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 23:49:31,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 5: [2022-11-26 23:49:31,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:49:31,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 23:49:31,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 14: [2022-11-26 23:49:31,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:49:31,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 23:49:31,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 16: [2022-11-26 23:49:31,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:49:31,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-26 23:49:31,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 16: [2022-11-26 23:49:31,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-26 23:49:31,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-26 23:49:31,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 26: [2022-11-26 23:49:31,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-26 23:49:31,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-26 23:49:31,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 19: [2022-11-26 23:49:31,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:49:31,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 2: [2022-11-26 23:49:31,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 19: [2022-11-26 23:49:31,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 2: [2022-11-26 23:49:31,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 23:49:31,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 9: [2022-11-26 23:49:31,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:49:31,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 23:49:31,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 17: [2022-11-26 23:49:31,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-26 23:49:31,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step132000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-26 23:49:31,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step132000 is ready now! 0: successfully saved checkpoint at iteration 132000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2579.50 31: iteration 132010/ 173500 | consumed samples: 33794560 | consumed tokens: 69211258880 | elapsed time per iteration (s): 1.09 | learning rate: 4.469E-05 | global batch size: 256 | lm loss: 1.936848E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.208 | TFLOPs: 14.23 | 31: iteration 132020/ 173500 | consumed samples: 33797120 | consumed tokens: 69216501760 | elapsed time per iteration (s): 0.81 | learning rate: 4.468E-05 | global batch size: 256 | lm loss: 1.941926E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.394 | TFLOPs: 19.14 | 31: iteration 132030/ 173500 | consumed samples: 33799680 | consumed tokens: 69221744640 | elapsed time per iteration (s): 0.82 | learning rate: 4.467E-05 | global batch size: 256 | lm loss: 1.974006E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.058 | TFLOPs: 19.00 | 31: iteration 132040/ 173500 | consumed samples: 33802240 | consumed tokens: 69226987520 | elapsed time per iteration (s): 0.79 | learning rate: 4.466E-05 | global batch size: 256 | lm loss: 1.932316E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.036 | TFLOPs: 19.48 | 31: iteration 132050/ 173500 | consumed samples: 33804800 | consumed tokens: 69232230400 | elapsed time per iteration (s): 0.78 | learning rate: 4.465E-05 | global batch size: 256 | lm loss: 1.954575E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.870 | TFLOPs: 19.77 | 31: iteration 132060/ 173500 | consumed samples: 33807360 | consumed tokens: 69237473280 | elapsed time per iteration (s): 0.78 | learning rate: 4.464E-05 | global batch size: 256 | lm loss: 1.931064E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.439 | TFLOPs: 19.81 | 31: iteration 132070/ 173500 | consumed samples: 33809920 | consumed tokens: 69242716160 | elapsed time per iteration (s): 0.75 | learning rate: 4.463E-05 | global batch size: 256 | lm loss: 1.938147E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.590 | TFLOPs: 20.60 | 31: iteration 132080/ 173500 | consumed samples: 33812480 | consumed tokens: 69247959040 | elapsed time per iteration (s): 0.80 | learning rate: 4.462E-05 | global batch size: 256 | lm loss: 1.929354E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.886 | TFLOPs: 19.35 | 31: iteration 132090/ 173500 | consumed samples: 33815040 | consumed tokens: 69253201920 | elapsed time per iteration (s): 0.87 | learning rate: 4.460E-05 | global batch size: 256 | lm loss: 1.905828E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.736 | TFLOPs: 17.77 | 31: iteration 132100/ 173500 | consumed samples: 33817600 | consumed tokens: 69258444800 | elapsed time per iteration (s): 0.89 | learning rate: 4.459E-05 | global batch size: 256 | lm loss: 1.950814E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.467 | TFLOPs: 17.33 | 31: iteration 132110/ 173500 | consumed samples: 33820160 | consumed tokens: 69263687680 | elapsed time per iteration (s): 0.86 | learning rate: 4.458E-05 | global batch size: 256 | lm loss: 1.930648E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.525 | TFLOPs: 18.06 | 31: iteration 132120/ 173500 | consumed samples: 33822720 | consumed tokens: 69268930560 | elapsed time per iteration (s): 0.88 | learning rate: 4.457E-05 | global batch size: 256 | lm loss: 1.929136E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.912 | TFLOPs: 17.66 | 31: iteration 132130/ 173500 | consumed samples: 33825280 | consumed tokens: 69274173440 | elapsed time per iteration (s): 0.88 | learning rate: 4.456E-05 | global batch size: 256 | lm loss: 1.947068E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.988 | TFLOPs: 17.60 | 31: iteration 132140/ 173500 | consumed samples: 33827840 | consumed tokens: 69279416320 | elapsed time per iteration (s): 0.83 | learning rate: 4.455E-05 | global batch size: 256 | lm loss: 1.941792E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.551 | TFLOPs: 18.73 | 31: iteration 132150/ 173500 | consumed samples: 33830400 | consumed tokens: 69284659200 | elapsed time per iteration (s): 0.79 | learning rate: 4.454E-05 | global batch size: 256 | lm loss: 1.957286E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.450 | TFLOPs: 19.63 | 31: iteration 132160/ 173500 | consumed samples: 33832960 | consumed tokens: 69289902080 | elapsed time per iteration (s): 0.82 | learning rate: 4.452E-05 | global batch size: 256 | lm loss: 1.918217E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.314 | TFLOPs: 18.89 | 31: iteration 132170/ 173500 | consumed samples: 33835520 | consumed tokens: 69295144960 | elapsed time per iteration (s): 0.84 | learning rate: 4.451E-05 | global batch size: 256 | lm loss: 1.927748E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.582 | TFLOPs: 18.49 | 31: iteration 132180/ 173500 | consumed samples: 33838080 | consumed tokens: 69300387840 | elapsed time per iteration (s): 0.81 | learning rate: 4.450E-05 | global batch size: 256 | lm loss: 1.969951E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.747 | TFLOPs: 19.10 | 31: iteration 132190/ 173500 | consumed samples: 33840640 | consumed tokens: 69305630720 | elapsed time per iteration (s): 0.82 | learning rate: 4.449E-05 | global batch size: 256 | lm loss: 1.932621E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.622 | TFLOPs: 18.97 | 31: iteration 132200/ 173500 | consumed samples: 33843200 | consumed tokens: 69310873600 | elapsed time per iteration (s): 0.80 | learning rate: 4.448E-05 | global batch size: 256 | lm loss: 1.945354E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.716 | TFLOPs: 19.46 | 31: iteration 132210/ 173500 | consumed samples: 33845760 | consumed tokens: 69316116480 | elapsed time per iteration (s): 0.82 | learning rate: 4.447E-05 | global batch size: 256 | lm loss: 1.947816E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.075 | TFLOPs: 18.88 | 31: iteration 132220/ 173500 | consumed samples: 33848320 | consumed tokens: 69321359360 | elapsed time per iteration (s): 0.76 | learning rate: 4.446E-05 | global batch size: 256 | lm loss: 1.923953E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.372 | TFLOPs: 20.29 | 31: iteration 132230/ 173500 | consumed samples: 33850880 | consumed tokens: 69326602240 | elapsed time per iteration (s): 0.79 | learning rate: 4.445E-05 | global batch size: 256 | lm loss: 1.944812E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.349 | TFLOPs: 19.56 | 31: iteration 132240/ 173500 | consumed samples: 33853440 | consumed tokens: 69331845120 | elapsed time per iteration (s): 0.77 | learning rate: 4.443E-05 | global batch size: 256 | lm loss: 1.949109E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.721 | TFLOPs: 20.19 | 31: iteration 132250/ 173500 | consumed samples: 33856000 | consumed tokens: 69337088000 | elapsed time per iteration (s): 0.77 | learning rate: 4.442E-05 | global batch size: 256 | lm loss: 1.910245E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.094 | TFLOPs: 20.21 | 31: iteration 132260/ 173500 | consumed samples: 33858560 | consumed tokens: 69342330880 | elapsed time per iteration (s): 0.77 | learning rate: 4.441E-05 | global batch size: 256 | lm loss: 1.954485E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.682 | TFLOPs: 20.13 | 31: iteration 132270/ 173500 | consumed samples: 33861120 | consumed tokens: 69347573760 | elapsed time per iteration (s): 0.77 | learning rate: 4.440E-05 | global batch size: 256 | lm loss: 1.928354E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.997 | TFLOPs: 20.02 | 31: iteration 132280/ 173500 | consumed samples: 33863680 | consumed tokens: 69352816640 | elapsed time per iteration (s): 0.79 | learning rate: 4.439E-05 | global batch size: 256 | lm loss: 1.923617E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.641 | TFLOPs: 19.70 | 31: iteration 132290/ 173500 | consumed samples: 33866240 | consumed tokens: 69358059520 | elapsed time per iteration (s): 0.76 | learning rate: 4.438E-05 | global batch size: 256 | lm loss: 1.945667E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.677 | TFLOPs: 20.49 | 31: iteration 132300/ 173500 | consumed samples: 33868800 | consumed tokens: 69363302400 | elapsed time per iteration (s): 0.77 | learning rate: 4.437E-05 | global batch size: 256 | lm loss: 1.921851E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.269 | TFLOPs: 20.22 | 31: iteration 132310/ 173500 | consumed samples: 33871360 | consumed tokens: 69368545280 | elapsed time per iteration (s): 1.25 | learning rate: 4.436E-05 | global batch size: 256 | lm loss: 1.941865E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 204.748 | TFLOPs: 12.39 | 31: iteration 132320/ 173500 | consumed samples: 33873920 | consumed tokens: 69373788160 | elapsed time per iteration (s): 0.81 | learning rate: 4.434E-05 | global batch size: 256 | lm loss: 1.926497E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.364 | TFLOPs: 19.14 | 31: iteration 132330/ 173500 | consumed samples: 33876480 | consumed tokens: 69379031040 | elapsed time per iteration (s): 0.79 | learning rate: 4.433E-05 | global batch size: 256 | lm loss: 1.952245E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.053 | TFLOPs: 19.54 | 31: iteration 132340/ 173500 | consumed samples: 33879040 | consumed tokens: 69384273920 | elapsed time per iteration (s): 0.79 | learning rate: 4.432E-05 | global batch size: 256 | lm loss: 1.928433E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.119 | TFLOPs: 19.61 | 31: iteration 132350/ 173500 | consumed samples: 33881600 | consumed tokens: 69389516800 | elapsed time per iteration (s): 0.72 | learning rate: 4.431E-05 | global batch size: 256 | lm loss: 1.974010E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.071 | TFLOPs: 21.60 | 31: iteration 132360/ 173500 | consumed samples: 33884160 | consumed tokens: 69394759680 | elapsed time per iteration (s): 0.78 | learning rate: 4.430E-05 | global batch size: 256 | lm loss: 1.943326E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.303 | TFLOPs: 19.80 | 31: iteration 132370/ 173500 | consumed samples: 33886720 | consumed tokens: 69400002560 | elapsed time per iteration (s): 0.79 | learning rate: 4.429E-05 | global batch size: 256 | lm loss: 1.925395E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.467 | TFLOPs: 19.57 | 31: iteration 132380/ 173500 | consumed samples: 33889280 | consumed tokens: 69405245440 | elapsed time per iteration (s): 0.77 | learning rate: 4.428E-05 | global batch size: 256 | lm loss: 1.939331E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.531 | TFLOPs: 20.12 | 31: iteration 132390/ 173500 | consumed samples: 33891840 | consumed tokens: 69410488320 | elapsed time per iteration (s): 0.79 | learning rate: 4.427E-05 | global batch size: 256 | lm loss: 1.941316E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.035 | TFLOPs: 19.48 | 31: iteration 132400/ 173500 | consumed samples: 33894400 | consumed tokens: 69415731200 | elapsed time per iteration (s): 0.82 | learning rate: 4.425E-05 | global batch size: 256 | lm loss: 1.922842E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.561 | TFLOPs: 18.91 | 31: iteration 132410/ 173500 | consumed samples: 33896960 | consumed tokens: 69420974080 | elapsed time per iteration (s): 0.82 | learning rate: 4.424E-05 | global batch size: 256 | lm loss: 1.927653E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.294 | TFLOPs: 18.89 | 31: iteration 132420/ 173500 | consumed samples: 33899520 | consumed tokens: 69426216960 | elapsed time per iteration (s): 0.79 | learning rate: 4.423E-05 | global batch size: 256 | lm loss: 1.930639E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.857 | TFLOPs: 19.65 | 31: iteration 132430/ 173500 | consumed samples: 33902080 | consumed tokens: 69431459840 | elapsed time per iteration (s): 0.80 | learning rate: 4.422E-05 | global batch size: 256 | lm loss: 1.930693E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.373 | TFLOPs: 19.38 | 31: iteration 132440/ 173500 | consumed samples: 33904640 | consumed tokens: 69436702720 | elapsed time per iteration (s): 0.84 | learning rate: 4.421E-05 | global batch size: 256 | lm loss: 1.963086E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.039 | TFLOPs: 18.33 | 31: iteration 132450/ 173500 | consumed samples: 33907200 | consumed tokens: 69441945600 | elapsed time per iteration (s): 0.89 | learning rate: 4.420E-05 | global batch size: 256 | lm loss: 1.953319E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.987 | TFLOPs: 17.42 | 31: iteration 132460/ 173500 | consumed samples: 33909760 | consumed tokens: 69447188480 | elapsed time per iteration (s): 0.83 | learning rate: 4.419E-05 | global batch size: 256 | lm loss: 1.937481E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.242 | TFLOPs: 18.71 | 31: iteration 132470/ 173500 | consumed samples: 33912320 | consumed tokens: 69452431360 | elapsed time per iteration (s): 0.79 | learning rate: 4.418E-05 | global batch size: 256 | lm loss: 1.942649E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.952 | TFLOPs: 19.54 | 31: iteration 132480/ 173500 | consumed samples: 33914880 | consumed tokens: 69457674240 | elapsed time per iteration (s): 0.82 | learning rate: 4.416E-05 | global batch size: 256 | lm loss: 1.927838E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.033 | TFLOPs: 18.94 | 31: iteration 132490/ 173500 | consumed samples: 33917440 | consumed tokens: 69462917120 | elapsed time per iteration (s): 0.80 | learning rate: 4.415E-05 | global batch size: 256 | lm loss: 1.953745E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.170 | TFLOPs: 19.25 | 31: iteration 132500/ 173500 | consumed samples: 33920000 | consumed tokens: 69468160000 | elapsed time per iteration (s): 0.81 | learning rate: 4.414E-05 | global batch size: 256 | lm loss: 1.924432E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.217 | TFLOPs: 19.19 | 31: iteration 132510/ 173500 | consumed samples: 33922560 | consumed tokens: 69473402880 | elapsed time per iteration (s): 0.80 | learning rate: 4.413E-05 | global batch size: 256 | lm loss: 1.928963E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.752 | TFLOPs: 19.47 | 31: iteration 132520/ 173500 | consumed samples: 33925120 | consumed tokens: 69478645760 | elapsed time per iteration (s): 0.77 | learning rate: 4.412E-05 | global batch size: 256 | lm loss: 1.957494E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.025 | TFLOPs: 20.03 | 31: iteration 132530/ 173500 | consumed samples: 33927680 | consumed tokens: 69483888640 | elapsed time per iteration (s): 0.78 | learning rate: 4.411E-05 | global batch size: 256 | lm loss: 1.928201E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.348 | TFLOPs: 19.86 | 31: iteration 132540/ 173500 | consumed samples: 33930240 | consumed tokens: 69489131520 | elapsed time per iteration (s): 0.80 | learning rate: 4.410E-05 | global batch size: 256 | lm loss: 1.940136E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.036 | TFLOPs: 19.30 | 31: iteration 132550/ 173500 | consumed samples: 33932800 | consumed tokens: 69494374400 | elapsed time per iteration (s): 0.84 | learning rate: 4.409E-05 | global batch size: 256 | lm loss: 1.935167E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.927 | TFLOPs: 18.51 | 31: iteration 132560/ 173500 | consumed samples: 33935360 | consumed tokens: 69499617280 | elapsed time per iteration (s): 0.79 | learning rate: 4.407E-05 | global batch size: 256 | lm loss: 1.929439E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.406 | TFLOPs: 19.57 | 31: iteration 132570/ 173500 | consumed samples: 33937920 | consumed tokens: 69504860160 | elapsed time per iteration (s): 0.85 | learning rate: 4.406E-05 | global batch size: 256 | lm loss: 1.933060E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.533 | TFLOPs: 18.30 | 31: iteration 132580/ 173500 | consumed samples: 33940480 | consumed tokens: 69510103040 | elapsed time per iteration (s): 0.80 | learning rate: 4.405E-05 | global batch size: 256 | lm loss: 1.922515E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.459 | TFLOPs: 19.27 | 31: iteration 132590/ 173500 | consumed samples: 33943040 | consumed tokens: 69515345920 | elapsed time per iteration (s): 0.82 | learning rate: 4.404E-05 | global batch size: 256 | lm loss: 1.912252E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.558 | TFLOPs: 18.85 | 31: iteration 132600/ 173500 | consumed samples: 33945600 | consumed tokens: 69520588800 | elapsed time per iteration (s): 0.80 | learning rate: 4.403E-05 | global batch size: 256 | lm loss: 1.980084E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.991 | TFLOPs: 19.30 | 31: iteration 132610/ 173500 | consumed samples: 33948160 | consumed tokens: 69525831680 | elapsed time per iteration (s): 0.83 | learning rate: 4.402E-05 | global batch size: 256 | lm loss: 1.922433E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.634 | TFLOPs: 18.73 | 31: iteration 132620/ 173500 | consumed samples: 33950720 | consumed tokens: 69531074560 | elapsed time per iteration (s): 0.78 | learning rate: 4.401E-05 | global batch size: 256 | lm loss: 1.946317E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.980 | TFLOPs: 19.96 | 31: iteration 132630/ 173500 | consumed samples: 33953280 | consumed tokens: 69536317440 | elapsed time per iteration (s): 0.77 | learning rate: 4.400E-05 | global batch size: 256 | lm loss: 1.931787E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.823 | TFLOPs: 20.13 | 31: iteration 132640/ 173500 | consumed samples: 33955840 | consumed tokens: 69541560320 | elapsed time per iteration (s): 0.81 | learning rate: 4.399E-05 | global batch size: 256 | lm loss: 1.922807E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.130 | TFLOPs: 19.19 | 31: iteration 132650/ 173500 | consumed samples: 33958400 | consumed tokens: 69546803200 | elapsed time per iteration (s): 0.82 | learning rate: 4.397E-05 | global batch size: 256 | lm loss: 1.949140E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.971 | TFLOPs: 18.87 | 31: iteration 132660/ 173500 | consumed samples: 33960960 | consumed tokens: 69552046080 | elapsed time per iteration (s): 0.80 | learning rate: 4.396E-05 | global batch size: 256 | lm loss: 1.963754E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.165 | TFLOPs: 19.37 | 31: iteration 132670/ 173500 | consumed samples: 33963520 | consumed tokens: 69557288960 | elapsed time per iteration (s): 0.79 | learning rate: 4.395E-05 | global batch size: 256 | lm loss: 1.916672E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.187 | TFLOPs: 19.49 | 31: iteration 132680/ 173500 | consumed samples: 33966080 | consumed tokens: 69562531840 | elapsed time per iteration (s): 0.77 | learning rate: 4.394E-05 | global batch size: 256 | lm loss: 1.943275E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.600 | TFLOPs: 20.00 | 31: iteration 132690/ 173500 | consumed samples: 33968640 | consumed tokens: 69567774720 | elapsed time per iteration (s): 0.79 | learning rate: 4.393E-05 | global batch size: 256 | lm loss: 1.934600E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.033 | TFLOPs: 19.66 | 31: iteration 132700/ 173500 | consumed samples: 33971200 | consumed tokens: 69573017600 | elapsed time per iteration (s): 0.77 | learning rate: 4.392E-05 | global batch size: 256 | lm loss: 1.938237E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.287 | TFLOPs: 20.04 | 31: iteration 132710/ 173500 | consumed samples: 33973760 | consumed tokens: 69578260480 | elapsed time per iteration (s): 0.80 | learning rate: 4.391E-05 | global batch size: 256 | lm loss: 1.952668E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.413 | TFLOPs: 19.44 | 31: iteration 132720/ 173500 | consumed samples: 33976320 | consumed tokens: 69583503360 | elapsed time per iteration (s): 0.79 | learning rate: 4.390E-05 | global batch size: 256 | lm loss: 1.948268E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.275 | TFLOPs: 19.62 | 31: iteration 132730/ 173500 | consumed samples: 33978880 | consumed tokens: 69588746240 | elapsed time per iteration (s): 0.80 | learning rate: 4.388E-05 | global batch size: 256 | lm loss: 1.933914E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.745 | TFLOPs: 19.28 | 31: iteration 132740/ 173500 | consumed samples: 33981440 | consumed tokens: 69593989120 | elapsed time per iteration (s): 0.78 | learning rate: 4.387E-05 | global batch size: 256 | lm loss: 1.945007E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.384 | TFLOPs: 19.75 | 31: iteration 132750/ 173500 | consumed samples: 33984000 | consumed tokens: 69599232000 | elapsed time per iteration (s): 0.79 | learning rate: 4.386E-05 | global batch size: 256 | lm loss: 1.927097E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.770 | TFLOPs: 19.71 | 31: iteration 132760/ 173500 | consumed samples: 33986560 | consumed tokens: 69604474880 | elapsed time per iteration (s): 0.81 | learning rate: 4.385E-05 | global batch size: 256 | lm loss: 1.940711E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.459 | TFLOPs: 19.14 | 31: iteration 132770/ 173500 | consumed samples: 33989120 | consumed tokens: 69609717760 | elapsed time per iteration (s): 0.80 | learning rate: 4.384E-05 | global batch size: 256 | lm loss: 1.933364E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.023 | TFLOPs: 19.42 | 31: iteration 132780/ 173500 | consumed samples: 33991680 | consumed tokens: 69614960640 | elapsed time per iteration (s): 0.80 | learning rate: 4.383E-05 | global batch size: 256 | lm loss: 1.943007E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.868 | TFLOPs: 19.29 | 31: iteration 132790/ 173500 | consumed samples: 33994240 | consumed tokens: 69620203520 | elapsed time per iteration (s): 0.83 | learning rate: 4.382E-05 | global batch size: 256 | lm loss: 1.927810E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.456 | TFLOPs: 18.72 | 31: iteration 132800/ 173500 | consumed samples: 33996800 | consumed tokens: 69625446400 | elapsed time per iteration (s): 0.85 | learning rate: 4.381E-05 | global batch size: 256 | lm loss: 1.963233E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.337 | TFLOPs: 18.23 | 31: iteration 132810/ 173500 | consumed samples: 33999360 | consumed tokens: 69630689280 | elapsed time per iteration (s): 0.80 | learning rate: 4.380E-05 | global batch size: 256 | lm loss: 1.959080E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.782 | TFLOPs: 19.47 | 31: iteration 132820/ 173500 | consumed samples: 34001920 | consumed tokens: 69635932160 | elapsed time per iteration (s): 0.82 | learning rate: 4.378E-05 | global batch size: 256 | lm loss: 1.941157E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.683 | TFLOPs: 18.86 | 31: iteration 132830/ 173500 | consumed samples: 34004480 | consumed tokens: 69641175040 | elapsed time per iteration (s): 0.79 | learning rate: 4.377E-05 | global batch size: 256 | lm loss: 1.896715E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.167 | TFLOPs: 19.49 | 31: iteration 132840/ 173500 | consumed samples: 34007040 | consumed tokens: 69646417920 | elapsed time per iteration (s): 0.79 | learning rate: 4.376E-05 | global batch size: 256 | lm loss: 1.935329E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.730 | TFLOPs: 19.52 | 31: iteration 132850/ 173500 | consumed samples: 34009600 | consumed tokens: 69651660800 | elapsed time per iteration (s): 0.82 | learning rate: 4.375E-05 | global batch size: 256 | lm loss: 1.925738E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.862 | TFLOPs: 18.87 | 31: iteration 132860/ 173500 | consumed samples: 34012160 | consumed tokens: 69656903680 | elapsed time per iteration (s): 0.81 | learning rate: 4.374E-05 | global batch size: 256 | lm loss: 1.941004E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.656 | TFLOPs: 19.10 | 31: iteration 132870/ 173500 | consumed samples: 34014720 | consumed tokens: 69662146560 | elapsed time per iteration (s): 0.82 | learning rate: 4.373E-05 | global batch size: 256 | lm loss: 1.921024E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.452 | TFLOPs: 18.84 | 31: iteration 132880/ 173500 | consumed samples: 34017280 | consumed tokens: 69667389440 | elapsed time per iteration (s): 0.79 | learning rate: 4.372E-05 | global batch size: 256 | lm loss: 1.923962E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.267 | TFLOPs: 19.56 | 31: iteration 132890/ 173500 | consumed samples: 34019840 | consumed tokens: 69672632320 | elapsed time per iteration (s): 0.79 | learning rate: 4.371E-05 | global batch size: 256 | lm loss: 1.932806E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.637 | TFLOPs: 19.52 | 31: iteration 132900/ 173500 | consumed samples: 34022400 | consumed tokens: 69677875200 | elapsed time per iteration (s): 0.80 | learning rate: 4.369E-05 | global batch size: 256 | lm loss: 1.944305E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.993 | TFLOPs: 19.48 | 31: iteration 132910/ 173500 | consumed samples: 34024960 | consumed tokens: 69683118080 | elapsed time per iteration (s): 0.81 | learning rate: 4.368E-05 | global batch size: 256 | lm loss: 1.937201E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.899 | TFLOPs: 19.05 | 31: iteration 132920/ 173500 | consumed samples: 34027520 | consumed tokens: 69688360960 | elapsed time per iteration (s): 0.82 | learning rate: 4.367E-05 | global batch size: 256 | lm loss: 1.924709E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.705 | TFLOPs: 18.98 | 31: iteration 132930/ 173500 | consumed samples: 34030080 | consumed tokens: 69693603840 | elapsed time per iteration (s): 0.82 | learning rate: 4.366E-05 | global batch size: 256 | lm loss: 1.956158E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.548 | TFLOPs: 18.85 | 31: iteration 132940/ 173500 | consumed samples: 34032640 | consumed tokens: 69698846720 | elapsed time per iteration (s): 0.81 | learning rate: 4.365E-05 | global batch size: 256 | lm loss: 1.941349E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.323 | TFLOPs: 19.08 | 31: iteration 132950/ 173500 | consumed samples: 34035200 | consumed tokens: 69704089600 | elapsed time per iteration (s): 0.80 | learning rate: 4.364E-05 | global batch size: 256 | lm loss: 1.907047E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.833 | TFLOPs: 19.41 | 31: iteration 132960/ 173500 | consumed samples: 34037760 | consumed tokens: 69709332480 | elapsed time per iteration (s): 0.85 | learning rate: 4.363E-05 | global batch size: 256 | lm loss: 1.957237E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.143 | TFLOPs: 18.28 | 31: iteration 132970/ 173500 | consumed samples: 34040320 | consumed tokens: 69714575360 | elapsed time per iteration (s): 0.79 | learning rate: 4.362E-05 | global batch size: 256 | lm loss: 1.931625E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.623 | TFLOPs: 19.64 | 31: iteration 132980/ 173500 | consumed samples: 34042880 | consumed tokens: 69719818240 | elapsed time per iteration (s): 0.81 | learning rate: 4.361E-05 | global batch size: 256 | lm loss: 1.911919E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.286 | TFLOPs: 19.07 | 31: iteration 132990/ 173500 | consumed samples: 34045440 | consumed tokens: 69725061120 | elapsed time per iteration (s): 0.79 | learning rate: 4.359E-05 | global batch size: 256 | lm loss: 1.921955E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.189 | TFLOPs: 19.61 | 31: iteration 133000/ 173500 | consumed samples: 34048000 | consumed tokens: 69730304000 | elapsed time per iteration (s): 0.80 | learning rate: 4.358E-05 | global batch size: 256 | lm loss: 1.928811E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.687 | TFLOPs: 19.40 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 133000 | lm loss value: 1.797577E+00 | lm loss PPL: 6.035004E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 133000 to checkpoints_1b1long 0: [2022-11-27 00:03:00,637] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step133000 is begin to save! 0: [2022-11-27 00:03:00,647] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_01-model_00-model_states.pt... 0: [2022-11-27 00:03:00,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_01-model_00-model_states.pt. 0: [2022-11-27 00:03:00,864] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_03-model_00-model_states.pt... 0: [2022-11-27 00:03:00,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_03-model_00-model_states.pt. 0: [2022-11-27 00:03:00,949] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_04-model_00-model_states.pt... 0: [2022-11-27 00:03:01,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_04-model_00-model_states.pt. 0: [2022-11-27 00:03:01,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_05-model_00-model_states.pt... 0: [2022-11-27 00:03:01,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_05-model_00-model_states.pt. 0: [2022-11-27 00:03:01,109] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_06-model_00-model_states.pt... 0: [2022-11-27 00:03:01,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_06-model_00-model_states.pt. 0: [2022-11-27 00:03:01,186] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_07-model_00-model_states.pt... 0: [2022-11-27 00:03:01,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_07-model_00-model_states.pt. 0: [2022-11-27 00:03:01,268] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_08-model_00-model_states.pt... 0: [2022-11-27 00:03:01,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_08-model_00-model_states.pt. 0: [2022-11-27 00:03:01,346] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_09-model_00-model_states.pt... 0: [2022-11-27 00:03:01,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_09-model_00-model_states.pt. 0: [2022-11-27 00:03:01,428] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_10-model_00-model_states.pt... 0: [2022-11-27 00:03:01,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_10-model_00-model_states.pt. 0: [2022-11-27 00:03:01,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_11-model_00-model_states.pt... 0: [2022-11-27 00:03:01,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_11-model_00-model_states.pt. 0: [2022-11-27 00:03:01,588] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_12-model_00-model_states.pt... 0: [2022-11-27 00:03:01,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_12-model_00-model_states.pt. 0: [2022-11-27 00:03:01,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_13-model_00-model_states.pt... 0: [2022-11-27 00:03:01,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_13-model_00-model_states.pt. 0: [2022-11-27 00:03:01,751] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_14-model_00-model_states.pt... 0: [2022-11-27 00:03:01,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_14-model_00-model_states.pt. 0: [2022-11-27 00:03:01,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_15-model_00-model_states.pt... 0: [2022-11-27 00:03:01,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_15-model_00-model_states.pt. 0: [2022-11-27 00:03:01,909] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_16-model_00-model_states.pt... 0: [2022-11-27 00:03:01,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_16-model_00-model_states.pt. 0: [2022-11-27 00:03:01,986] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_17-model_00-model_states.pt... 0: [2022-11-27 00:03:02,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_17-model_00-model_states.pt. 0: [2022-11-27 00:03:02,065] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_18-model_00-model_states.pt... 0: [2022-11-27 00:03:02,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_18-model_00-model_states.pt. 0: [2022-11-27 00:03:02,143] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_19-model_00-model_states.pt... 0: [2022-11-27 00:03:02,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_19-model_00-model_states.pt. 0: [2022-11-27 00:03:02,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_20-model_00-model_states.pt... 0: [2022-11-27 00:03:02,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_20-model_00-model_states.pt. 0: [2022-11-27 00:03:02,301] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_21-model_00-model_states.pt... 0: [2022-11-27 00:03:02,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_21-model_00-model_states.pt. 0: [2022-11-27 00:03:02,383] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_22-model_00-model_states.pt... 0: [2022-11-27 00:03:02,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_22-model_00-model_states.pt. 0: [2022-11-27 00:03:02,464] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_23-model_00-model_states.pt... 0: [2022-11-27 00:03:02,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_23-model_00-model_states.pt. 0: [2022-11-27 00:03:02,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_24-model_00-model_states.pt... 0: [2022-11-27 00:03:02,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_24-model_00-model_states.pt. 0: [2022-11-27 00:03:02,624] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_25-model_00-model_states.pt... 0: [2022-11-27 00:03:02,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_25-model_00-model_states.pt. 0: [2022-11-27 00:03:02,702] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_26-model_00-model_states.pt... 0: [2022-11-27 00:03:02,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_26-model_00-model_states.pt. 0: [2022-11-27 00:03:02,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_27-model_00-model_states.pt... 0: [2022-11-27 00:03:02,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_27-model_00-model_states.pt. 0: [2022-11-27 00:03:02,862] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_28-model_00-model_states.pt... 0: [2022-11-27 00:03:02,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_28-model_00-model_states.pt. 0: [2022-11-27 00:03:02,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/layer_30-model_00-model_states.pt... 0: [2022-11-27 00:03:02,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/layer_30-model_00-model_states.pt. 0: [2022-11-27 00:03:02,946] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step133000/mp_rank_00_model_states.pt 0: [2022-11-27 00:03:02,946] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/mp_rank_00_model_states.pt... 0: [2022-11-27 00:03:02,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/mp_rank_00_model_states.pt. 0: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:03:03,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:03:03,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:03:03,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 00:03:03,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 10: [2022-11-27 00:03:03,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:03:03,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 6: [2022-11-27 00:03:03,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:03:03,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 6: [2022-11-27 00:03:03,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 00:03:03,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 25: [2022-11-27 00:03:03,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:03:03,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:03:03,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 19: [2022-11-27 00:03:03,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 25: [2022-11-27 00:03:03,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 19: [2022-11-27 00:03:03,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 22: [2022-11-27 00:03:03,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:03:03,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 00:03:03,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 4: [2022-11-27 00:03:03,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:03:03,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:03:03,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:03:03,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 2: [2022-11-27 00:03:03,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 3: [2022-11-27 00:03:03,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 4: [2022-11-27 00:03:03,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 2: [2022-11-27 00:03:03,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 3: [2022-11-27 00:03:03,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 5: [2022-11-27 00:03:03,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:03:03,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 00:03:03,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 30: [2022-11-27 00:03:03,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:03:03,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 00:03:03,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 8: [2022-11-27 00:03:03,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:03:03,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:03:03,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 13: [2022-11-27 00:03:03,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 8: [2022-11-27 00:03:03,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 13: [2022-11-27 00:03:03,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 11: [2022-11-27 00:03:03,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:03:03,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 00:03:03,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 28: [2022-11-27 00:03:03,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:03:03,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:03:03,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 1: [2022-11-27 00:03:03,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 28: [2022-11-27 00:03:03,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 1: [2022-11-27 00:03:03,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 29: [2022-11-27 00:03:03,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:03:03,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 00:03:03,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 12: [2022-11-27 00:03:03,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:03:03,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 00:03:03,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 0: [2022-11-27 00:03:03,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:03:03,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:03:03,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 00:03:03,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 18: [2022-11-27 00:03:03,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:03:03,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 00:03:03,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 5: [2022-11-27 00:03:03,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:03:03,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 00:03:03,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 10: [2022-11-27 00:03:03,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:03:03,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 00:03:03,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 7: [2022-11-27 00:03:03,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:03:03,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:03:03,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 00:03:03,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 18: [2022-11-27 00:03:03,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 00:03:03,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 15: [2022-11-27 00:03:03,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:03:03,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 00:03:03,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 14: [2022-11-27 00:03:03,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:03:03,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:03:03,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 6: [2022-11-27 00:03:03,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 8: [2022-11-27 00:03:03,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:03:03,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 6: [2022-11-27 00:03:03,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 8: [2022-11-27 00:03:03,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 27: [2022-11-27 00:03:03,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:03:03,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 27: [2022-11-27 00:03:03,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 1: [2022-11-27 00:03:03,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:03:03,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 1: [2022-11-27 00:03:03,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 14: [2022-11-27 00:03:03,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:03:03,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 14: [2022-11-27 00:03:03,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 00:03:03,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 16: [2022-11-27 00:03:03,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:03:03,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 00:03:03,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 2: [2022-11-27 00:03:03,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:03:03,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 00:03:03,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 30: [2022-11-27 00:03:03,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:03:03,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 00:03:03,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 3: [2022-11-27 00:03:03,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:03:03,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:03:03,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:03:03,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:03:03,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 00:03:03,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 25: [2022-11-27 00:03:03,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 26: [2022-11-27 00:03:03,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:03:03,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 13: [2022-11-27 00:03:03,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 00:03:03,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 26: [2022-11-27 00:03:03,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 25: [2022-11-27 00:03:03,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 26: [2022-11-27 00:03:03,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 00:03:03,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 1: [2022-11-27 00:03:03,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:03:03,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 00:03:03,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 28: [2022-11-27 00:03:03,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:03:03,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 00:03:03,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 19: [2022-11-27 00:03:03,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:03:03,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 00:03:03,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 15: [2022-11-27 00:03:03,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:03:03,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:03:03,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:03:03,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 17: [2022-11-27 00:03:03,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 9: [2022-11-27 00:03:03,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:03:03,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 15: [2022-11-27 00:03:03,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 17: [2022-11-27 00:03:03,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 9: [2022-11-27 00:03:03,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 12: [2022-11-27 00:03:03,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 17: [2022-11-27 00:03:03,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:03:03,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 17: [2022-11-27 00:03:03,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 00:03:03,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 19: [2022-11-27 00:03:03,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:03:03,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-27 00:03:03,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 21: [2022-11-27 00:03:03,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:03:03,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 00:03:03,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 13: [2022-11-27 00:03:03,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:03:03,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 00:03:03,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 21: [2022-11-27 00:03:03,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:03:03,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:03:03,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 00:03:03,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 00:03:03,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 21: [2022-11-27 00:03:03,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 14: [2022-11-27 00:03:03,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:03:03,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:03:03,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:03:03,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 14: [2022-11-27 00:03:03,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 16: [2022-11-27 00:03:03,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 11: [2022-11-27 00:03:03,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 14: [2022-11-27 00:03:03,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 16: [2022-11-27 00:03:03,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 10: [2022-11-27 00:03:03,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:03:03,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 00:03:03,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 29: [2022-11-27 00:03:03,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:03:03,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 00:03:03,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 18: [2022-11-27 00:03:03,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:03:03,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 00:03:03,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 5: [2022-11-27 00:03:03,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:03:03,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 00:03:03,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 8: [2022-11-27 00:03:03,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:03:03,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 00:03:03,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 27: [2022-11-27 00:03:03,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:03:03,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 00:03:03,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 4: [2022-11-27 00:03:03,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:03:03,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 00:03:03,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 9: [2022-11-27 00:03:03,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:03:03,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:03:03,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:03:03,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 22: [2022-11-27 00:03:03,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 9: [2022-11-27 00:03:03,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 00:03:03,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 22: [2022-11-27 00:03:03,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 9: [2022-11-27 00:03:03,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 22: [2022-11-27 00:03:03,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:03:03,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 00:03:03,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 12: [2022-11-27 00:03:03,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:03:03,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 00:03:03,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 11: [2022-11-27 00:03:03,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:03:03,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 00:03:03,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 0: [2022-11-27 00:03:03,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:03:03,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 00:03:03,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 0: [2022-11-27 00:03:03,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:03:03,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 00:03:03,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 26: [2022-11-27 00:03:03,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:03:03,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 00:03:03,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 1: [2022-11-27 00:03:03,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:03:03,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 00:03:03,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 8: [2022-11-27 00:03:03,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:03:03,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 00:03:03,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 29: [2022-11-27 00:03:03,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:03:03,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:03:03,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 00:03:03,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 29: [2022-11-27 00:03:03,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 00:03:03,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 7: [2022-11-27 00:03:03,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:03:03,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:03:03,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:03:03,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 15: [2022-11-27 00:03:03,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 21: [2022-11-27 00:03:03,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:03:03,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 7: [2022-11-27 00:03:03,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 15: [2022-11-27 00:03:03,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 21: [2022-11-27 00:03:03,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 5: [2022-11-27 00:03:03,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 7: [2022-11-27 00:03:03,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:03:03,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 7: [2022-11-27 00:03:03,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 00:03:03,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 2: [2022-11-27 00:03:03,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:03:03,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 3: [2022-11-27 00:03:03,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:03:03,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 2: [2022-11-27 00:03:03,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 3: [2022-11-27 00:03:03,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 24: [2022-11-27 00:03:03,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:03:03,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 00:03:03,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 10: [2022-11-27 00:03:03,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:03:03,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 00:03:03,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 9: [2022-11-27 00:03:03,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:03:03,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:03:03,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 00:03:03,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 23: [2022-11-27 00:03:03,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 00:03:03,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 25: [2022-11-27 00:03:03,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:03:03,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 0: [2022-11-27 00:03:03,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:03:03,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 0: [2022-11-27 00:03:03,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 00:03:03,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 17: [2022-11-27 00:03:03,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:03:03,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 00:03:03,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 24: [2022-11-27 00:03:03,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:03:03,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-27 00:03:03,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 6: [2022-11-27 00:03:03,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:03:03,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 00:03:03,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:03:03,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 6: [2022-11-27 00:03:03,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 00:03:03,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 14: [2022-11-27 00:03:03,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:03:03,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 00:03:03,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 0: [2022-11-27 00:03:03,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 00:03:03,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 13: [2022-11-27 00:03:03,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:03:03,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 00:03:03,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 30: [2022-11-27 00:03:03,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:03:03,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 00:03:03,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 23: [2022-11-27 00:03:03,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:03:03,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 00:03:03,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 27: [2022-11-27 00:03:03,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:03:03,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 00:03:03,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 4: [2022-11-27 00:03:03,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:03:03,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:03:03,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 00:03:03,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 4: [2022-11-27 00:03:03,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 00:03:03,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 31: [2022-11-27 00:03:03,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:03:03,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:03:03,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:03:03,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:03:03,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 00:03:03,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 31: [2022-11-27 00:03:03,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 00:03:03,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 00:03:03,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 00:03:03,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 31: [2022-11-27 00:03:03,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 31: [2022-11-27 00:03:03,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 30: [2022-11-27 00:03:03,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:03:03,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 00:03:03,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 28: [2022-11-27 00:03:03,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:03:03,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:03:03,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 28: [2022-11-27 00:03:03,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 2: [2022-11-27 00:03:03,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 28: [2022-11-27 00:03:03,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 18: [2022-11-27 00:03:03,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:03:03,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 00:03:03,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 20: [2022-11-27 00:03:03,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:03:03,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:03:03,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 00:03:03,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 00:03:03,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 20: [2022-11-27 00:03:03,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 22: [2022-11-27 00:03:03,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:03:03,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 00:03:03,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 16: [2022-11-27 00:03:03,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:03:03,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 00:03:03,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 17: [2022-11-27 00:03:03,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:03:03,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 19: [2022-11-27 00:03:03,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:03:03,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 19: [2022-11-27 00:03:03,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 00:03:03,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 3: [2022-11-27 00:03:03,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:03:03,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 00:03:03,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 29: [2022-11-27 00:03:03,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:03:03,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 00:03:03,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 26: [2022-11-27 00:03:03,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:03:03,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 00:03:03,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 27: [2022-11-27 00:03:03,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:03:03,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 00:03:03,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 15: [2022-11-27 00:03:03,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:03:03,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 00:03:03,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 25: [2022-11-27 00:03:03,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:03:03,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 00:03:03,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 23: [2022-11-27 00:03:03,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:03:03,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 00:03:03,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 1: [2022-11-27 00:03:03,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:03:03,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 00:03:03,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 24: [2022-11-27 00:03:03,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:03:03,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 00:03:03,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 12: [2022-11-27 00:03:03,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:03:03,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 00:03:03,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 20: [2022-11-27 00:03:03,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:03:03,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:03:03,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 00:03:03,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 11: [2022-11-27 00:03:03,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 00:03:03,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 6: [2022-11-27 00:03:03,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:03:03,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:03:03,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 14: [2022-11-27 00:03:03,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 6: [2022-11-27 00:03:03,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 14: [2022-11-27 00:03:03,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 5: [2022-11-27 00:03:03,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:03:03,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 00:03:03,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 8: [2022-11-27 00:03:03,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:03:03,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 00:03:03,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 0: [2022-11-27 00:03:03,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:03:03,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 00:03:03,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 10: [2022-11-27 00:03:03,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:03:03,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 00:03:03,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 9: [2022-11-27 00:03:03,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:03:03,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 00:03:03,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 7: [2022-11-27 00:03:03,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:03:03,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:03:03,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 4: [2022-11-27 00:03:03,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 7: [2022-11-27 00:03:03,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 4: [2022-11-27 00:03:03,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 16: [2022-11-27 00:03:03,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:03:03,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:03:03,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 16: [2022-11-27 00:03:03,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 18: [2022-11-27 00:03:03,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 16: [2022-11-27 00:03:03,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 13: [2022-11-27 00:03:03,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:03:03,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 00:03:03,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 19: [2022-11-27 00:03:03,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:03:03,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:03:03,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 28: [2022-11-27 00:03:03,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 30: [2022-11-27 00:03:03,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:03:03,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 28: [2022-11-27 00:03:03,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 30: [2022-11-27 00:03:03,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 00:03:03,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 29: [2022-11-27 00:03:03,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:03:03,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 00:03:03,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 26: [2022-11-27 00:03:03,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:03:03,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 00:03:03,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 22: [2022-11-27 00:03:03,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:03:03,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 31: [2022-11-27 00:03:03,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:03:03,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 31: [2022-11-27 00:03:03,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 00:03:03,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 21: [2022-11-27 00:03:03,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:03:03,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 00:03:03,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 2: [2022-11-27 00:03:03,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:03:03,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 00:03:03,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 17: [2022-11-27 00:03:03,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:03:03,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 00:03:03,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 25: [2022-11-27 00:03:03,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:03:03,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 00:03:03,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 23: [2022-11-27 00:03:03,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:03:03,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 00:03:03,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 27: [2022-11-27 00:03:03,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:03:03,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 00:03:03,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 5: [2022-11-27 00:03:03,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:03:03,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:03:03,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 15: [2022-11-27 00:03:03,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 5: [2022-11-27 00:03:03,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 15: [2022-11-27 00:03:03,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 3: [2022-11-27 00:03:03,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:03:03,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 00:03:03,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 24: [2022-11-27 00:03:03,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:03:03,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 00:03:03,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 12: [2022-11-27 00:03:03,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:03:03,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 00:03:03,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 1: [2022-11-27 00:03:03,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:03:03,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 00:03:03,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 11: [2022-11-27 00:03:03,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:03:03,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 00:03:03,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 20: [2022-11-27 00:03:03,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:03:03,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 00:03:03,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 6: [2022-11-27 00:03:03,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:03:03,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 00:03:03,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 0: [2022-11-27 00:03:03,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:03:03,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 00:03:03,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 7: [2022-11-27 00:03:03,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:03:03,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 00:03:03,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 19: [2022-11-27 00:03:03,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:03:03,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 00:03:03,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 21: [2022-11-27 00:03:03,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:03:03,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 00:03:03,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 10: [2022-11-27 00:03:03,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:03:03,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 00:03:03,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 4: [2022-11-27 00:03:03,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:03:03,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 00:03:03,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 28: [2022-11-27 00:03:03,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:03:03,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 00:03:03,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 14: [2022-11-27 00:03:03,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:03:03,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:03:03,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 00:03:03,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 8: [2022-11-27 00:03:03,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 00:03:03,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 9: [2022-11-27 00:03:03,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:03:03,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 00:03:03,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 18: [2022-11-27 00:03:03,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:03:03,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 00:03:03,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 31: [2022-11-27 00:03:03,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:03:03,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 00:03:03,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 13: [2022-11-27 00:03:03,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:03:03,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 00:03:03,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 29: [2022-11-27 00:03:03,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:03:03,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 00:03:03,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 22: [2022-11-27 00:03:03,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:03:03,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 30: [2022-11-27 00:03:03,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:03:03,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 30: [2022-11-27 00:03:03,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 00:03:03,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 16: [2022-11-27 00:03:03,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:03:03,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 00:03:03,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 26: [2022-11-27 00:03:03,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:03:03,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 00:03:03,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 15: [2022-11-27 00:03:03,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:03:03,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 00:03:03,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 17: [2022-11-27 00:03:03,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:03:03,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:03:03,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:03:03,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 00:03:03,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 3: [2022-11-27 00:03:03,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 12: [2022-11-27 00:03:03,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 3: [2022-11-27 00:03:03,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 12: [2022-11-27 00:03:03,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 25: [2022-11-27 00:03:03,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:03:03,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 00:03:03,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 11: [2022-11-27 00:03:03,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:03:03,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 00:03:03,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 23: [2022-11-27 00:03:03,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:03:03,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 00:03:03,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 6: [2022-11-27 00:03:03,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:03:03,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 00:03:03,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 20: [2022-11-27 00:03:03,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:03:03,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 00:03:03,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 5: [2022-11-27 00:03:03,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:03:03,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 1: [2022-11-27 00:03:03,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:03:03,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 1: [2022-11-27 00:03:03,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 21: [2022-11-27 00:03:03,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:03:03,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 21: [2022-11-27 00:03:03,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 00:03:03,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 7: [2022-11-27 00:03:03,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:03:03,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 00:03:03,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 4: [2022-11-27 00:03:03,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:03:03,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 00:03:03,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 28: [2022-11-27 00:03:03,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:03:03,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 27: [2022-11-27 00:03:03,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:03:03,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 27: [2022-11-27 00:03:03,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 00:03:03,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 10: [2022-11-27 00:03:03,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:03:03,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 00:03:03,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 0: [2022-11-27 00:03:03,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:03:03,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 00:03:03,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 24: [2022-11-27 00:03:03,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:03:03,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 00:03:03,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 8: [2022-11-27 00:03:03,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:03:03,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 00:03:03,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 14: [2022-11-27 00:03:03,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:03:03,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 00:03:03,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 19: [2022-11-27 00:03:03,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:03:03,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 00:03:03,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 9: [2022-11-27 00:03:03,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:03:03,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 00:03:03,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 30: [2022-11-27 00:03:03,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:03:03,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 00:03:03,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 18: [2022-11-27 00:03:03,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:03:03,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-27 00:03:03,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 2: [2022-11-27 00:03:03,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:03:03,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 00:03:03,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 22: [2022-11-27 00:03:03,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:03:03,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 00:03:03,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 26: [2022-11-27 00:03:03,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:03:03,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 00:03:03,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 13: [2022-11-27 00:03:03,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:03:03,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 00:03:03,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 16: [2022-11-27 00:03:03,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:03:03,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 00:03:03,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 31: [2022-11-27 00:03:03,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:03:03,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:03:03,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 00:03:03,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 29: [2022-11-27 00:03:03,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 00:03:03,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 3: [2022-11-27 00:03:03,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:03:03,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 00:03:03,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 28: [2022-11-27 00:03:03,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:03:03,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 00:03:03,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 5: [2022-11-27 00:03:03,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:03:03,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 00:03:03,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 24: [2022-11-27 00:03:03,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:03:03,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 4: [2022-11-27 00:03:03,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:03:03,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 4: [2022-11-27 00:03:03,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 00:03:03,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 1: [2022-11-27 00:03:03,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:03:03,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 00:03:03,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 27: [2022-11-27 00:03:03,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:03:03,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:03:03,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 27: [2022-11-27 00:03:03,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 00:03:03,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 9: [2022-11-27 00:03:03,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 22: [2022-11-27 00:03:03,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:03:03,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 00:03:03,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 31: [2022-11-27 00:03:03,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:03:03,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 00:03:03,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 15: [2022-11-27 00:03:03,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:03:03,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:03:03,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 00:03:03,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 0: [2022-11-27 00:03:03,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:03:03,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 00:03:03,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 0: [2022-11-27 00:03:03,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 11: [2022-11-27 00:03:03,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:03:03,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 11: [2022-11-27 00:03:03,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 00:03:03,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 6: [2022-11-27 00:03:03,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:03:03,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:03:03,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 00:03:03,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 3: [2022-11-27 00:03:03,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 00:03:03,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 17: [2022-11-27 00:03:03,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:03:03,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 00:03:03,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 18: [2022-11-27 00:03:03,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:03:03,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 00:03:03,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 12: [2022-11-27 00:03:03,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:03:03,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 00:03:03,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 29: [2022-11-27 00:03:03,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:03:03,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 00:03:03,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 30: [2022-11-27 00:03:03,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:03:03,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 00:03:03,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 8: [2022-11-27 00:03:03,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:03:03,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 00:03:03,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 13: [2022-11-27 00:03:03,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:03:03,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:03:03,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:03:03,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 00:03:03,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 24: [2022-11-27 00:03:03,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 27: [2022-11-27 00:03:03,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 24: [2022-11-27 00:03:03,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 27: [2022-11-27 00:03:03,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 19: [2022-11-27 00:03:03,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:03:03,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 00:03:03,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 2: [2022-11-27 00:03:03,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:03:03,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:03:03,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:03:03,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 10: [2022-11-27 00:03:03,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 2: [2022-11-27 00:03:03,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 10: [2022-11-27 00:03:03,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 2: [2022-11-27 00:03:03,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 2: [2022-11-27 00:03:03,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 25: [2022-11-27 00:03:03,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:03:03,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:03:03,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 00:03:03,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 14: [2022-11-27 00:03:03,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:03:03,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 00:03:03,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 14: [2022-11-27 00:03:03,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 00:03:03,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 23: [2022-11-27 00:03:03,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:03:03,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 00:03:03,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 15: [2022-11-27 00:03:03,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:03:03,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 00:03:03,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 17: [2022-11-27 00:03:03,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:03:03,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:03:03,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 00:03:03,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 26: [2022-11-27 00:03:03,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-27 00:03:03,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 16: [2022-11-27 00:03:03,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:03:03,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 00:03:03,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 11: [2022-11-27 00:03:03,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:03:03,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 00:03:03,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 24: [2022-11-27 00:03:03,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:03:03,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-27 00:03:03,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 12: [2022-11-27 00:03:03,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:03:03,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 00:03:03,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 23: [2022-11-27 00:03:03,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:03:03,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:03:03,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 00:03:03,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 00:03:03,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 23: [2022-11-27 00:03:03,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 20: [2022-11-27 00:03:03,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:03:03,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 00:03:03,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 20: [2022-11-27 00:03:03,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:03:03,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 00:03:03,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 25: [2022-11-27 00:03:03,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:03:03,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 00:03:03,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 20: [2022-11-27 00:03:03,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:03:03,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step133000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 00:03:03,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step133000 is ready now! 0: successfully saved checkpoint at iteration 133000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2644.27 31: iteration 133010/ 173500 | consumed samples: 34050560 | consumed tokens: 69735546880 | elapsed time per iteration (s): 1.12 | learning rate: 4.357E-05 | global batch size: 256 | lm loss: 1.949088E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.270 | TFLOPs: 13.87 | 31: iteration 133020/ 173500 | consumed samples: 34053120 | consumed tokens: 69740789760 | elapsed time per iteration (s): 0.79 | learning rate: 4.356E-05 | global batch size: 256 | lm loss: 1.898228E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.466 | TFLOPs: 19.63 | 31: iteration 133030/ 173500 | consumed samples: 34055680 | consumed tokens: 69746032640 | elapsed time per iteration (s): 0.80 | learning rate: 4.355E-05 | global batch size: 256 | lm loss: 1.917645E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.280 | TFLOPs: 19.44 | 31: iteration 133040/ 173500 | consumed samples: 34058240 | consumed tokens: 69751275520 | elapsed time per iteration (s): 0.80 | learning rate: 4.354E-05 | global batch size: 256 | lm loss: 1.957187E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.894 | TFLOPs: 19.29 | 31: iteration 133050/ 173500 | consumed samples: 34060800 | consumed tokens: 69756518400 | elapsed time per iteration (s): 0.82 | learning rate: 4.353E-05 | global batch size: 256 | lm loss: 1.922198E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.709 | TFLOPs: 18.92 | 31: iteration 133060/ 173500 | consumed samples: 34063360 | consumed tokens: 69761761280 | elapsed time per iteration (s): 0.79 | learning rate: 4.352E-05 | global batch size: 256 | lm loss: 1.934804E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.766 | TFLOPs: 19.59 | 31: iteration 133070/ 173500 | consumed samples: 34065920 | consumed tokens: 69767004160 | elapsed time per iteration (s): 0.74 | learning rate: 4.351E-05 | global batch size: 256 | lm loss: 1.915250E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.732 | TFLOPs: 20.86 | 31: iteration 133080/ 173500 | consumed samples: 34068480 | consumed tokens: 69772247040 | elapsed time per iteration (s): 0.76 | learning rate: 4.349E-05 | global batch size: 256 | lm loss: 1.925094E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.129 | TFLOPs: 20.27 | 31: iteration 133090/ 173500 | consumed samples: 34071040 | consumed tokens: 69777489920 | elapsed time per iteration (s): 0.76 | learning rate: 4.348E-05 | global batch size: 256 | lm loss: 1.898689E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.144 | TFLOPs: 20.34 | 31: iteration 133100/ 173500 | consumed samples: 34073600 | consumed tokens: 69782732800 | elapsed time per iteration (s): 0.81 | learning rate: 4.347E-05 | global batch size: 256 | lm loss: 1.903039E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.601 | TFLOPs: 19.21 | 31: iteration 133110/ 173500 | consumed samples: 34076160 | consumed tokens: 69787975680 | elapsed time per iteration (s): 0.79 | learning rate: 4.346E-05 | global batch size: 256 | lm loss: 1.941257E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.160 | TFLOPs: 19.67 | 31: iteration 133120/ 173500 | consumed samples: 34078720 | consumed tokens: 69793218560 | elapsed time per iteration (s): 0.77 | learning rate: 4.345E-05 | global batch size: 256 | lm loss: 1.940462E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.998 | TFLOPs: 20.15 | 31: iteration 133130/ 173500 | consumed samples: 34081280 | consumed tokens: 69798461440 | elapsed time per iteration (s): 0.84 | learning rate: 4.344E-05 | global batch size: 256 | lm loss: 1.939424E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.154 | TFLOPs: 18.46 | 31: iteration 133140/ 173500 | consumed samples: 34083840 | consumed tokens: 69803704320 | elapsed time per iteration (s): 0.82 | learning rate: 4.343E-05 | global batch size: 256 | lm loss: 1.942478E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.801 | TFLOPs: 18.92 | 31: iteration 133150/ 173500 | consumed samples: 34086400 | consumed tokens: 69808947200 | elapsed time per iteration (s): 0.83 | learning rate: 4.342E-05 | global batch size: 256 | lm loss: 1.927696E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.102 | TFLOPs: 18.64 | 31: iteration 133160/ 173500 | consumed samples: 34088960 | consumed tokens: 69814190080 | elapsed time per iteration (s): 0.81 | learning rate: 4.341E-05 | global batch size: 256 | lm loss: 1.937726E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.638 | TFLOPs: 19.03 | 31: iteration 133170/ 173500 | consumed samples: 34091520 | consumed tokens: 69819432960 | elapsed time per iteration (s): 0.83 | learning rate: 4.340E-05 | global batch size: 256 | lm loss: 1.917445E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.434 | TFLOPs: 18.66 | 31: iteration 133180/ 173500 | consumed samples: 34094080 | consumed tokens: 69824675840 | elapsed time per iteration (s): 0.94 | learning rate: 4.338E-05 | global batch size: 256 | lm loss: 1.934407E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 271.142 | TFLOPs: 16.40 | 31: iteration 133190/ 173500 | consumed samples: 34096640 | consumed tokens: 69829918720 | elapsed time per iteration (s): 0.83 | learning rate: 4.337E-05 | global batch size: 256 | lm loss: 1.942241E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.202 | TFLOPs: 18.71 | 31: iteration 133200/ 173500 | consumed samples: 34099200 | consumed tokens: 69835161600 | elapsed time per iteration (s): 0.81 | learning rate: 4.336E-05 | global batch size: 256 | lm loss: 1.932668E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.429 | TFLOPs: 19.14 | 31: iteration 133210/ 173500 | consumed samples: 34101760 | consumed tokens: 69840404480 | elapsed time per iteration (s): 0.74 | learning rate: 4.335E-05 | global batch size: 256 | lm loss: 1.926595E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.052 | TFLOPs: 21.00 | 31: iteration 133220/ 173500 | consumed samples: 34104320 | consumed tokens: 69845647360 | elapsed time per iteration (s): 0.82 | learning rate: 4.334E-05 | global batch size: 256 | lm loss: 1.915405E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.120 | TFLOPs: 18.88 | 31: iteration 133230/ 173500 | consumed samples: 34106880 | consumed tokens: 69850890240 | elapsed time per iteration (s): 0.75 | learning rate: 4.333E-05 | global batch size: 256 | lm loss: 1.936763E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.124 | TFLOPs: 20.64 | 31: iteration 133240/ 173500 | consumed samples: 34109440 | consumed tokens: 69856133120 | elapsed time per iteration (s): 0.83 | learning rate: 4.332E-05 | global batch size: 256 | lm loss: 1.937683E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.576 | TFLOPs: 18.73 | 31: iteration 133250/ 173500 | consumed samples: 34112000 | consumed tokens: 69861376000 | elapsed time per iteration (s): 0.76 | learning rate: 4.331E-05 | global batch size: 256 | lm loss: 1.931851E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.317 | TFLOPs: 20.41 | 31: iteration 133260/ 173500 | consumed samples: 34114560 | consumed tokens: 69866618880 | elapsed time per iteration (s): 0.78 | learning rate: 4.330E-05 | global batch size: 256 | lm loss: 1.914622E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.242 | TFLOPs: 19.74 | 31: iteration 133270/ 173500 | consumed samples: 34117120 | consumed tokens: 69871861760 | elapsed time per iteration (s): 0.82 | learning rate: 4.328E-05 | global batch size: 256 | lm loss: 1.956231E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.289 | TFLOPs: 18.83 | 31: iteration 133280/ 173500 | consumed samples: 34119680 | consumed tokens: 69877104640 | elapsed time per iteration (s): 0.79 | learning rate: 4.327E-05 | global batch size: 256 | lm loss: 1.924824E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.326 | TFLOPs: 19.68 | 31: iteration 133290/ 173500 | consumed samples: 34122240 | consumed tokens: 69882347520 | elapsed time per iteration (s): 0.81 | learning rate: 4.326E-05 | global batch size: 256 | lm loss: 1.942625E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.054 | TFLOPs: 19.06 | 31: iteration 133300/ 173500 | consumed samples: 34124800 | consumed tokens: 69887590400 | elapsed time per iteration (s): 0.77 | learning rate: 4.325E-05 | global batch size: 256 | lm loss: 1.923299E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.878 | TFLOPs: 20.08 | 31: iteration 133310/ 173500 | consumed samples: 34127360 | consumed tokens: 69892833280 | elapsed time per iteration (s): 0.73 | learning rate: 4.324E-05 | global batch size: 256 | lm loss: 1.953523E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.868 | TFLOPs: 21.11 | 31: iteration 133320/ 173500 | consumed samples: 34129920 | consumed tokens: 69898076160 | elapsed time per iteration (s): 0.79 | learning rate: 4.323E-05 | global batch size: 256 | lm loss: 1.941815E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.345 | TFLOPs: 19.56 | 31: iteration 133330/ 173500 | consumed samples: 34132480 | consumed tokens: 69903319040 | elapsed time per iteration (s): 0.73 | learning rate: 4.322E-05 | global batch size: 256 | lm loss: 1.941071E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.677 | TFLOPs: 21.28 | 31: iteration 133340/ 173500 | consumed samples: 34135040 | consumed tokens: 69908561920 | elapsed time per iteration (s): 0.84 | learning rate: 4.321E-05 | global batch size: 256 | lm loss: 1.937938E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.846 | TFLOPs: 18.44 | 31: iteration 133350/ 173500 | consumed samples: 34137600 | consumed tokens: 69913804800 | elapsed time per iteration (s): 0.73 | learning rate: 4.320E-05 | global batch size: 256 | lm loss: 1.961120E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.818 | TFLOPs: 21.16 | 31: iteration 133360/ 173500 | consumed samples: 34140160 | consumed tokens: 69919047680 | elapsed time per iteration (s): 0.76 | learning rate: 4.319E-05 | global batch size: 256 | lm loss: 1.920997E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.670 | TFLOPs: 20.25 | 31: iteration 133370/ 173500 | consumed samples: 34142720 | consumed tokens: 69924290560 | elapsed time per iteration (s): 0.74 | learning rate: 4.317E-05 | global batch size: 256 | lm loss: 1.921041E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.877 | TFLOPs: 20.92 | 31: iteration 133380/ 173500 | consumed samples: 34145280 | consumed tokens: 69929533440 | elapsed time per iteration (s): 0.78 | learning rate: 4.316E-05 | global batch size: 256 | lm loss: 1.923485E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.748 | TFLOPs: 19.89 | 31: iteration 133390/ 173500 | consumed samples: 34147840 | consumed tokens: 69934776320 | elapsed time per iteration (s): 0.77 | learning rate: 4.315E-05 | global batch size: 256 | lm loss: 1.955314E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.813 | TFLOPs: 20.19 | 31: iteration 133400/ 173500 | consumed samples: 34150400 | consumed tokens: 69940019200 | elapsed time per iteration (s): 0.78 | learning rate: 4.314E-05 | global batch size: 256 | lm loss: 1.964186E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.136 | TFLOPs: 19.91 | 31: iteration 133410/ 173500 | consumed samples: 34152960 | consumed tokens: 69945262080 | elapsed time per iteration (s): 0.76 | learning rate: 4.313E-05 | global batch size: 256 | lm loss: 1.953444E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.708 | TFLOPs: 20.49 | 31: iteration 133420/ 173500 | consumed samples: 34155520 | consumed tokens: 69950504960 | elapsed time per iteration (s): 0.76 | learning rate: 4.312E-05 | global batch size: 256 | lm loss: 1.922947E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.830 | TFLOPs: 20.44 | 31: iteration 133430/ 173500 | consumed samples: 34158080 | consumed tokens: 69955747840 | elapsed time per iteration (s): 0.75 | learning rate: 4.311E-05 | global batch size: 256 | lm loss: 1.935881E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.599 | TFLOPs: 20.54 | 31: iteration 133440/ 173500 | consumed samples: 34160640 | consumed tokens: 69960990720 | elapsed time per iteration (s): 0.79 | learning rate: 4.310E-05 | global batch size: 256 | lm loss: 1.962589E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.030 | TFLOPs: 19.66 | 31: iteration 133450/ 173500 | consumed samples: 34163200 | consumed tokens: 69966233600 | elapsed time per iteration (s): 0.83 | learning rate: 4.309E-05 | global batch size: 256 | lm loss: 1.931132E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.333 | TFLOPs: 18.59 | 31: iteration 133460/ 173500 | consumed samples: 34165760 | consumed tokens: 69971476480 | elapsed time per iteration (s): 2.61 | learning rate: 4.308E-05 | global batch size: 256 | lm loss: 1.929928E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 97.945 | TFLOPs: 5.93 | 31: iteration 133470/ 173500 | consumed samples: 34168320 | consumed tokens: 69976719360 | elapsed time per iteration (s): 0.79 | learning rate: 4.306E-05 | global batch size: 256 | lm loss: 1.928289E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.358 | TFLOPs: 19.62 | 31: iteration 133480/ 173500 | consumed samples: 34170880 | consumed tokens: 69981962240 | elapsed time per iteration (s): 0.85 | learning rate: 4.305E-05 | global batch size: 256 | lm loss: 1.903389E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.474 | TFLOPs: 18.24 | 31: iteration 133490/ 173500 | consumed samples: 34173440 | consumed tokens: 69987205120 | elapsed time per iteration (s): 0.78 | learning rate: 4.304E-05 | global batch size: 256 | lm loss: 1.956244E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.306 | TFLOPs: 19.98 | 31: iteration 133500/ 173500 | consumed samples: 34176000 | consumed tokens: 69992448000 | elapsed time per iteration (s): 0.74 | learning rate: 4.303E-05 | global batch size: 256 | lm loss: 1.942939E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.973 | TFLOPs: 20.87 | 31: iteration 133510/ 173500 | consumed samples: 34178560 | consumed tokens: 69997690880 | elapsed time per iteration (s): 0.76 | learning rate: 4.302E-05 | global batch size: 256 | lm loss: 1.915847E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.061 | TFLOPs: 20.33 | 31: iteration 133520/ 173500 | consumed samples: 34181120 | consumed tokens: 70002933760 | elapsed time per iteration (s): 0.78 | learning rate: 4.301E-05 | global batch size: 256 | lm loss: 1.927511E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.376 | TFLOPs: 19.93 | 31: iteration 133530/ 173500 | consumed samples: 34183680 | consumed tokens: 70008176640 | elapsed time per iteration (s): 0.73 | learning rate: 4.300E-05 | global batch size: 256 | lm loss: 1.943814E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.607 | TFLOPs: 21.15 | 31: iteration 133540/ 173500 | consumed samples: 34186240 | consumed tokens: 70013419520 | elapsed time per iteration (s): 0.81 | learning rate: 4.299E-05 | global batch size: 256 | lm loss: 1.944213E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.500 | TFLOPs: 19.03 | 31: iteration 133550/ 173500 | consumed samples: 34188800 | consumed tokens: 70018662400 | elapsed time per iteration (s): 0.73 | learning rate: 4.298E-05 | global batch size: 256 | lm loss: 1.912195E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.314 | TFLOPs: 21.19 | 31: iteration 133560/ 173500 | consumed samples: 34191360 | consumed tokens: 70023905280 | elapsed time per iteration (s): 0.77 | learning rate: 4.297E-05 | global batch size: 256 | lm loss: 1.915125E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.391 | TFLOPs: 19.99 | 31: iteration 133570/ 173500 | consumed samples: 34193920 | consumed tokens: 70029148160 | elapsed time per iteration (s): 0.91 | learning rate: 4.295E-05 | global batch size: 256 | lm loss: 1.926133E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 280.276 | TFLOPs: 16.96 | 31: iteration 133580/ 173500 | consumed samples: 34196480 | consumed tokens: 70034391040 | elapsed time per iteration (s): 0.79 | learning rate: 4.294E-05 | global batch size: 256 | lm loss: 1.914471E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.766 | TFLOPs: 19.53 | 31: iteration 133590/ 173500 | consumed samples: 34199040 | consumed tokens: 70039633920 | elapsed time per iteration (s): 0.76 | learning rate: 4.293E-05 | global batch size: 256 | lm loss: 1.935082E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.421 | TFLOPs: 20.41 | 31: iteration 133600/ 173500 | consumed samples: 34201600 | consumed tokens: 70044876800 | elapsed time per iteration (s): 0.77 | learning rate: 4.292E-05 | global batch size: 256 | lm loss: 1.939305E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.875 | TFLOPs: 20.20 | 31: iteration 133610/ 173500 | consumed samples: 34204160 | consumed tokens: 70050119680 | elapsed time per iteration (s): 0.76 | learning rate: 4.291E-05 | global batch size: 256 | lm loss: 1.941336E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.747 | TFLOPs: 20.37 | 31: iteration 133620/ 173500 | consumed samples: 34206720 | consumed tokens: 70055362560 | elapsed time per iteration (s): 0.81 | learning rate: 4.290E-05 | global batch size: 256 | lm loss: 1.946681E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.155 | TFLOPs: 19.07 | 31: iteration 133630/ 173500 | consumed samples: 34209280 | consumed tokens: 70060605440 | elapsed time per iteration (s): 0.76 | learning rate: 4.289E-05 | global batch size: 256 | lm loss: 1.960764E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.431 | TFLOPs: 20.47 | 31: iteration 133640/ 173500 | consumed samples: 34211840 | consumed tokens: 70065848320 | elapsed time per iteration (s): 0.73 | learning rate: 4.288E-05 | global batch size: 256 | lm loss: 1.946250E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.858 | TFLOPs: 21.17 | 31: iteration 133650/ 173500 | consumed samples: 34214400 | consumed tokens: 70071091200 | elapsed time per iteration (s): 0.73 | learning rate: 4.287E-05 | global batch size: 256 | lm loss: 1.896364E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.055 | TFLOPs: 21.18 | 31: iteration 133660/ 173500 | consumed samples: 34216960 | consumed tokens: 70076334080 | elapsed time per iteration (s): 0.80 | learning rate: 4.286E-05 | global batch size: 256 | lm loss: 1.917600E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.376 | TFLOPs: 19.32 | 31: iteration 133670/ 173500 | consumed samples: 34219520 | consumed tokens: 70081576960 | elapsed time per iteration (s): 0.80 | learning rate: 4.284E-05 | global batch size: 256 | lm loss: 1.933357E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.797 | TFLOPs: 19.29 | 31: iteration 133680/ 173500 | consumed samples: 34222080 | consumed tokens: 70086819840 | elapsed time per iteration (s): 0.80 | learning rate: 4.283E-05 | global batch size: 256 | lm loss: 1.928378E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.618 | TFLOPs: 19.46 | 31: iteration 133690/ 173500 | consumed samples: 34224640 | consumed tokens: 70092062720 | elapsed time per iteration (s): 0.74 | learning rate: 4.282E-05 | global batch size: 256 | lm loss: 1.947995E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.514 | TFLOPs: 20.96 | 31: iteration 133700/ 173500 | consumed samples: 34227200 | consumed tokens: 70097305600 | elapsed time per iteration (s): 0.82 | learning rate: 4.281E-05 | global batch size: 256 | lm loss: 1.932131E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.608 | TFLOPs: 18.85 | 31: iteration 133710/ 173500 | consumed samples: 34229760 | consumed tokens: 70102548480 | elapsed time per iteration (s): 0.78 | learning rate: 4.280E-05 | global batch size: 256 | lm loss: 1.917866E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.399 | TFLOPs: 19.93 | 31: iteration 133720/ 173500 | consumed samples: 34232320 | consumed tokens: 70107791360 | elapsed time per iteration (s): 0.82 | learning rate: 4.279E-05 | global batch size: 256 | lm loss: 1.936675E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.514 | TFLOPs: 18.91 | 31: iteration 133730/ 173500 | consumed samples: 34234880 | consumed tokens: 70113034240 | elapsed time per iteration (s): 0.76 | learning rate: 4.278E-05 | global batch size: 256 | lm loss: 1.989066E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.819 | TFLOPs: 20.44 | 31: iteration 133740/ 173500 | consumed samples: 34237440 | consumed tokens: 70118277120 | elapsed time per iteration (s): 0.78 | learning rate: 4.277E-05 | global batch size: 256 | lm loss: 1.956479E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.475 | TFLOPs: 19.87 | 31: iteration 133750/ 173500 | consumed samples: 34240000 | consumed tokens: 70123520000 | elapsed time per iteration (s): 0.75 | learning rate: 4.276E-05 | global batch size: 256 | lm loss: 1.931005E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.196 | TFLOPs: 20.52 | 31: iteration 133760/ 173500 | consumed samples: 34242560 | consumed tokens: 70128762880 | elapsed time per iteration (s): 0.83 | learning rate: 4.275E-05 | global batch size: 256 | lm loss: 1.925170E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.562 | TFLOPs: 18.67 | 31: iteration 133770/ 173500 | consumed samples: 34245120 | consumed tokens: 70134005760 | elapsed time per iteration (s): 0.73 | learning rate: 4.273E-05 | global batch size: 256 | lm loss: 1.964986E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.911 | TFLOPs: 21.29 | 31: iteration 133780/ 173500 | consumed samples: 34247680 | consumed tokens: 70139248640 | elapsed time per iteration (s): 0.77 | learning rate: 4.272E-05 | global batch size: 256 | lm loss: 1.925442E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.132 | TFLOPs: 20.03 | 31: iteration 133790/ 173500 | consumed samples: 34250240 | consumed tokens: 70144491520 | elapsed time per iteration (s): 0.72 | learning rate: 4.271E-05 | global batch size: 256 | lm loss: 1.906025E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.439 | TFLOPs: 21.38 | 31: iteration 133800/ 173500 | consumed samples: 34252800 | consumed tokens: 70149734400 | elapsed time per iteration (s): 0.74 | learning rate: 4.270E-05 | global batch size: 256 | lm loss: 1.950386E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.354 | TFLOPs: 20.89 | 31: iteration 133810/ 173500 | consumed samples: 34255360 | consumed tokens: 70154977280 | elapsed time per iteration (s): 0.75 | learning rate: 4.269E-05 | global batch size: 256 | lm loss: 1.950541E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.232 | TFLOPs: 20.64 | 31: iteration 133820/ 173500 | consumed samples: 34257920 | consumed tokens: 70160220160 | elapsed time per iteration (s): 0.75 | learning rate: 4.268E-05 | global batch size: 256 | lm loss: 1.946400E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.368 | TFLOPs: 20.53 | 31: iteration 133830/ 173500 | consumed samples: 34260480 | consumed tokens: 70165463040 | elapsed time per iteration (s): 0.76 | learning rate: 4.267E-05 | global batch size: 256 | lm loss: 1.940244E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.215 | TFLOPs: 20.28 | 31: iteration 133840/ 173500 | consumed samples: 34263040 | consumed tokens: 70170705920 | elapsed time per iteration (s): 0.77 | learning rate: 4.266E-05 | global batch size: 256 | lm loss: 1.938388E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.988 | TFLOPs: 20.02 | 31: iteration 133850/ 173500 | consumed samples: 34265600 | consumed tokens: 70175948800 | elapsed time per iteration (s): 0.81 | learning rate: 4.265E-05 | global batch size: 256 | lm loss: 1.921307E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.285 | TFLOPs: 19.01 | 31: iteration 133860/ 173500 | consumed samples: 34268160 | consumed tokens: 70181191680 | elapsed time per iteration (s): 0.77 | learning rate: 4.264E-05 | global batch size: 256 | lm loss: 1.933002E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.872 | TFLOPs: 20.02 | 31: iteration 133870/ 173500 | consumed samples: 34270720 | consumed tokens: 70186434560 | elapsed time per iteration (s): 0.78 | learning rate: 4.263E-05 | global batch size: 256 | lm loss: 1.926830E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.571 | TFLOPs: 19.76 | 31: iteration 133880/ 173500 | consumed samples: 34273280 | consumed tokens: 70191677440 | elapsed time per iteration (s): 0.81 | learning rate: 4.261E-05 | global batch size: 256 | lm loss: 1.947479E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.743 | TFLOPs: 19.04 | 31: iteration 133890/ 173500 | consumed samples: 34275840 | consumed tokens: 70196920320 | elapsed time per iteration (s): 0.80 | learning rate: 4.260E-05 | global batch size: 256 | lm loss: 1.942765E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.038 | TFLOPs: 19.42 | 31: iteration 133900/ 173500 | consumed samples: 34278400 | consumed tokens: 70202163200 | elapsed time per iteration (s): 0.77 | learning rate: 4.259E-05 | global batch size: 256 | lm loss: 1.947383E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.361 | TFLOPs: 20.05 | 31: iteration 133910/ 173500 | consumed samples: 34280960 | consumed tokens: 70207406080 | elapsed time per iteration (s): 0.80 | learning rate: 4.258E-05 | global batch size: 256 | lm loss: 1.941073E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.364 | TFLOPs: 19.44 | 31: iteration 133920/ 173500 | consumed samples: 34283520 | consumed tokens: 70212648960 | elapsed time per iteration (s): 0.78 | learning rate: 4.257E-05 | global batch size: 256 | lm loss: 1.948028E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.512 | TFLOPs: 19.93 | 31: iteration 133930/ 173500 | consumed samples: 34286080 | consumed tokens: 70217891840 | elapsed time per iteration (s): 0.77 | learning rate: 4.256E-05 | global batch size: 256 | lm loss: 1.929273E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.517 | TFLOPs: 20.24 | 31: iteration 133940/ 173500 | consumed samples: 34288640 | consumed tokens: 70223134720 | elapsed time per iteration (s): 0.77 | learning rate: 4.255E-05 | global batch size: 256 | lm loss: 1.938008E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.518 | TFLOPs: 20.24 | 31: iteration 133950/ 173500 | consumed samples: 34291200 | consumed tokens: 70228377600 | elapsed time per iteration (s): 0.77 | learning rate: 4.254E-05 | global batch size: 256 | lm loss: 1.948260E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.497 | TFLOPs: 20.18 | 31: iteration 133960/ 173500 | consumed samples: 34293760 | consumed tokens: 70233620480 | elapsed time per iteration (s): 0.77 | learning rate: 4.253E-05 | global batch size: 256 | lm loss: 1.949481E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.273 | TFLOPs: 20.04 | 31: iteration 133970/ 173500 | consumed samples: 34296320 | consumed tokens: 70238863360 | elapsed time per iteration (s): 0.72 | learning rate: 4.252E-05 | global batch size: 256 | lm loss: 1.902158E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.522 | TFLOPs: 21.63 | 31: iteration 133980/ 173500 | consumed samples: 34298880 | consumed tokens: 70244106240 | elapsed time per iteration (s): 0.76 | learning rate: 4.251E-05 | global batch size: 256 | lm loss: 1.916367E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.720 | TFLOPs: 20.37 | 31: iteration 133990/ 173500 | consumed samples: 34301440 | consumed tokens: 70249349120 | elapsed time per iteration (s): 0.74 | learning rate: 4.249E-05 | global batch size: 256 | lm loss: 1.927000E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.220 | TFLOPs: 20.88 | 0: [2022-11-27 00:16:23,391] [INFO] [logging.py:68:log_dist] [Rank 0] step=134000, skipped=0, lr=[4.248399618979796e-05, 4.248399618979796e-05, 4.248399618979796e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 134000/ 173500 | consumed samples: 34304000 | consumed tokens: 70254592000 | elapsed time per iteration (s): 0.74 | learning rate: 4.248E-05 | global batch size: 256 | lm loss: 1.950266E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.926 | TFLOPs: 20.93 | 0: steps: 134000 loss: 1.9363 iter time (s): 0.802 samples/sec: 319.259 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 134000 | lm loss value: 1.913822E+00 | lm loss PPL: 6.778951E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 134000 to checkpoints_1b1long 0: [2022-11-27 00:16:23,644] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step134000 is begin to save! 0: [2022-11-27 00:16:23,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_01-model_00-model_states.pt... 0: [2022-11-27 00:16:23,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_01-model_00-model_states.pt. 0: [2022-11-27 00:16:23,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_03-model_00-model_states.pt... 0: [2022-11-27 00:16:23,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_03-model_00-model_states.pt. 0: [2022-11-27 00:16:23,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_04-model_00-model_states.pt... 0: [2022-11-27 00:16:24,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_04-model_00-model_states.pt. 0: [2022-11-27 00:16:24,043] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_05-model_00-model_states.pt... 0: [2022-11-27 00:16:24,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_05-model_00-model_states.pt. 0: [2022-11-27 00:16:24,119] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_06-model_00-model_states.pt... 0: [2022-11-27 00:16:24,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_06-model_00-model_states.pt. 0: [2022-11-27 00:16:24,199] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_07-model_00-model_states.pt... 0: [2022-11-27 00:16:24,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_07-model_00-model_states.pt. 0: [2022-11-27 00:16:24,275] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_08-model_00-model_states.pt... 0: [2022-11-27 00:16:24,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_08-model_00-model_states.pt. 0: [2022-11-27 00:16:24,351] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_09-model_00-model_states.pt... 0: [2022-11-27 00:16:24,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_09-model_00-model_states.pt. 0: [2022-11-27 00:16:24,427] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_10-model_00-model_states.pt... 0: [2022-11-27 00:16:24,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_10-model_00-model_states.pt. 0: [2022-11-27 00:16:24,502] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_11-model_00-model_states.pt... 0: [2022-11-27 00:16:24,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_11-model_00-model_states.pt. 0: [2022-11-27 00:16:24,575] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_12-model_00-model_states.pt... 0: [2022-11-27 00:16:24,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_12-model_00-model_states.pt. 0: [2022-11-27 00:16:24,650] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_13-model_00-model_states.pt... 0: [2022-11-27 00:16:24,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_13-model_00-model_states.pt. 0: [2022-11-27 00:16:24,726] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_14-model_00-model_states.pt... 0: [2022-11-27 00:16:24,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_14-model_00-model_states.pt. 0: [2022-11-27 00:16:24,804] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_15-model_00-model_states.pt... 0: [2022-11-27 00:16:24,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_15-model_00-model_states.pt. 0: [2022-11-27 00:16:24,878] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_16-model_00-model_states.pt... 0: [2022-11-27 00:16:24,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_16-model_00-model_states.pt. 0: [2022-11-27 00:16:24,954] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_17-model_00-model_states.pt... 0: [2022-11-27 00:16:25,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_17-model_00-model_states.pt. 0: [2022-11-27 00:16:25,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_18-model_00-model_states.pt... 0: [2022-11-27 00:16:25,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_18-model_00-model_states.pt. 0: [2022-11-27 00:16:25,106] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_19-model_00-model_states.pt... 0: [2022-11-27 00:16:25,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_19-model_00-model_states.pt. 0: [2022-11-27 00:16:25,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_20-model_00-model_states.pt... 0: [2022-11-27 00:16:25,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_20-model_00-model_states.pt. 0: [2022-11-27 00:16:25,257] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_21-model_00-model_states.pt... 0: [2022-11-27 00:16:25,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_21-model_00-model_states.pt. 0: [2022-11-27 00:16:25,334] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_22-model_00-model_states.pt... 0: [2022-11-27 00:16:25,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_22-model_00-model_states.pt. 0: [2022-11-27 00:16:25,407] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_23-model_00-model_states.pt... 0: [2022-11-27 00:16:25,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_23-model_00-model_states.pt. 0: [2022-11-27 00:16:25,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_24-model_00-model_states.pt... 0: [2022-11-27 00:16:25,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_24-model_00-model_states.pt. 0: [2022-11-27 00:16:25,562] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_25-model_00-model_states.pt... 0: [2022-11-27 00:16:25,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_25-model_00-model_states.pt. 0: [2022-11-27 00:16:25,636] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_26-model_00-model_states.pt... 0: [2022-11-27 00:16:25,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_26-model_00-model_states.pt. 0: [2022-11-27 00:16:25,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_27-model_00-model_states.pt... 0: [2022-11-27 00:16:25,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_27-model_00-model_states.pt. 0: [2022-11-27 00:16:25,787] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_28-model_00-model_states.pt... 0: [2022-11-27 00:16:25,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_28-model_00-model_states.pt. 0: [2022-11-27 00:16:25,861] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/layer_30-model_00-model_states.pt... 0: [2022-11-27 00:16:25,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/layer_30-model_00-model_states.pt. 0: [2022-11-27 00:16:25,866] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step134000/mp_rank_00_model_states.pt 0: [2022-11-27 00:16:25,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/mp_rank_00_model_states.pt... 0: [2022-11-27 00:16:25,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/mp_rank_00_model_states.pt. 0: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:16:25,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:16:25,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:16:25,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:16:25,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 00:16:25,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 10: [2022-11-27 00:16:25,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:16:25,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 00:16:25,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 5: [2022-11-27 00:16:25,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:16:25,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 00:16:25,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 20: [2022-11-27 00:16:25,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:16:25,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 00:16:25,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 28: [2022-11-27 00:16:25,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 22: [2022-11-27 00:16:25,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:16:25,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 22: [2022-11-27 00:16:25,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 00:16:25,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 21: [2022-11-27 00:16:25,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:16:25,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 00:16:25,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 18: [2022-11-27 00:16:25,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:16:25,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 00:16:25,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 11: [2022-11-27 00:16:25,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:16:25,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:16:25,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 9: [2022-11-27 00:16:25,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:16:25,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 00:16:25,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 11: [2022-11-27 00:16:25,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 9: [2022-11-27 00:16:25,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 00:16:25,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 8: [2022-11-27 00:16:26,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:16:26,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 00:16:26,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 11: [2022-11-27 00:16:26,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:16:26,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 00:16:26,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 14: [2022-11-27 00:16:26,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:16:26,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:16:26,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 12: [2022-11-27 00:16:26,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 14: [2022-11-27 00:16:26,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 13: [2022-11-27 00:16:26,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:16:26,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 13: [2022-11-27 00:16:26,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 00:16:26,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 25: [2022-11-27 00:16:26,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:16:26,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-27 00:16:26,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 15: [2022-11-27 00:16:26,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:16:26,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 00:16:26,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 3: [2022-11-27 00:16:26,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:16:26,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 00:16:26,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 19: [2022-11-27 00:16:26,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:16:26,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 00:16:26,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 27: [2022-11-27 00:16:26,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:16:26,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 00:16:26,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 17: [2022-11-27 00:16:26,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:16:26,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:16:26,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 2: [2022-11-27 00:16:26,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 14: [2022-11-27 00:16:26,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:16:26,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 14: [2022-11-27 00:16:26,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 17: [2022-11-27 00:16:26,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 14: [2022-11-27 00:16:26,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 1: [2022-11-27 00:16:26,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:16:26,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:16:26,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 18: [2022-11-27 00:16:26,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 00:16:26,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 1: [2022-11-27 00:16:26,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 1: [2022-11-27 00:16:26,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:16:26,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 25: [2022-11-27 00:16:26,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:16:26,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 25: [2022-11-27 00:16:26,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 00:16:26,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 21: [2022-11-27 00:16:26,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:16:26,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 00:16:26,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 5: [2022-11-27 00:16:26,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:16:26,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 00:16:26,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 3: [2022-11-27 00:16:26,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:16:26,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 00:16:26,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 19: [2022-11-27 00:16:26,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:16:26,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-27 00:16:26,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 2: [2022-11-27 00:16:26,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:16:26,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 00:16:26,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 9: [2022-11-27 00:16:26,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:16:26,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 00:16:26,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 17: [2022-11-27 00:16:26,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:16:26,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 28: [2022-11-27 00:16:26,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:16:26,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 10: [2022-11-27 00:16:26,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:16:26,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:16:26,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 00:16:26,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 6: [2022-11-27 00:16:26,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 00:16:26,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 24: [2022-11-27 00:16:26,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:16:26,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 00:16:26,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 0: [2022-11-27 00:16:26,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:16:26,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 00:16:26,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 22: [2022-11-27 00:16:26,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:16:26,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:16:26,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 11: [2022-11-27 00:16:26,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 00:16:26,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 22: [2022-11-27 00:16:26,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 25: [2022-11-27 00:16:26,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:16:26,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 31: [2022-11-27 00:16:26,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:16:26,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 31: [2022-11-27 00:16:26,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 00:16:26,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 12: [2022-11-27 00:16:26,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:16:26,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 00:16:26,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:16:26,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 12: [2022-11-27 00:16:26,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 00:16:26,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 18: [2022-11-27 00:16:26,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:16:26,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 00:16:26,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 9: [2022-11-27 00:16:26,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:16:26,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:16:26,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 26: [2022-11-27 00:16:26,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:16:26,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 26: [2022-11-27 00:16:26,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 16: [2022-11-27 00:16:26,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:16:26,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 26: [2022-11-27 00:16:26,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 16: [2022-11-27 00:16:26,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 00:16:26,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 10: [2022-11-27 00:16:26,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:16:26,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 10: [2022-11-27 00:16:26,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 00:16:26,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 14: [2022-11-27 00:16:26,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:16:26,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 00:16:26,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 19: [2022-11-27 00:16:26,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:16:26,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 00:16:26,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 26: [2022-11-27 00:16:26,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:16:26,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:16:26,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 26: [2022-11-27 00:16:26,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 27: [2022-11-27 00:16:26,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 26: [2022-11-27 00:16:26,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 4: [2022-11-27 00:16:26,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:16:26,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 00:16:26,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 15: [2022-11-27 00:16:26,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:16:26,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 00:16:26,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 3: [2022-11-27 00:16:26,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:16:26,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:16:26,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 8: [2022-11-27 00:16:26,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:16:26,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 24: [2022-11-27 00:16:26,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 8: [2022-11-27 00:16:26,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:16:26,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 8: [2022-11-27 00:16:26,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 00:16:26,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 00:16:26,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 8: [2022-11-27 00:16:26,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 31: [2022-11-27 00:16:26,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:16:26,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 00:16:26,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 6: [2022-11-27 00:16:26,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:16:26,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 00:16:26,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 13: [2022-11-27 00:16:26,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:16:26,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 00:16:26,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 28: [2022-11-27 00:16:26,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 00:16:26,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 28: [2022-11-27 00:16:26,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:16:26,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:16:26,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 00:16:26,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 13: [2022-11-27 00:16:26,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 00:16:26,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 5: [2022-11-27 00:16:26,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:16:26,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:16:26,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 00:16:26,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 24: [2022-11-27 00:16:26,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 00:16:26,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 30: [2022-11-27 00:16:26,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:16:26,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:16:26,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:16:26,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 00:16:26,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 30: [2022-11-27 00:16:26,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 00:16:26,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 00:16:26,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 30: [2022-11-27 00:16:26,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 16: [2022-11-27 00:16:26,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:16:26,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 00:16:26,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 7: [2022-11-27 00:16:26,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:16:26,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:16:26,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:16:26,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 00:16:26,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 00:16:26,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 00:16:26,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 7: [2022-11-27 00:16:26,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 20: [2022-11-27 00:16:26,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:16:26,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:16:26,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 21: [2022-11-27 00:16:26,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 00:16:26,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 20: [2022-11-27 00:16:26,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:16:26,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 22: [2022-11-27 00:16:26,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:16:26,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 20: [2022-11-27 00:16:26,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 00:16:26,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 22: [2022-11-27 00:16:26,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 00:16:26,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 2: [2022-11-27 00:16:26,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:16:26,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 00:16:26,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 15: [2022-11-27 00:16:26,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:16:26,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 00:16:26,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 22: [2022-11-27 00:16:26,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:16:26,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 00:16:26,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 10: [2022-11-27 00:16:26,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:16:26,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 1: [2022-11-27 00:16:26,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:16:26,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:16:26,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 17: [2022-11-27 00:16:26,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:16:26,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 00:16:26,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 17: [2022-11-27 00:16:26,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 1: [2022-11-27 00:16:26,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 1: [2022-11-27 00:16:26,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 17: [2022-11-27 00:16:26,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 24: [2022-11-27 00:16:26,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:16:26,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-27 00:16:26,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 0: [2022-11-27 00:16:26,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:16:26,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 00:16:26,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 27: [2022-11-27 00:16:26,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:16:26,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 00:16:26,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 6: [2022-11-27 00:16:26,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:16:26,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 00:16:26,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 26: [2022-11-27 00:16:26,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:16:26,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 00:16:26,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 14: [2022-11-27 00:16:26,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:16:26,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 00:16:26,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 11: [2022-11-27 00:16:26,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:16:26,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 9: [2022-11-27 00:16:26,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:16:26,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 9: [2022-11-27 00:16:26,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 00:16:26,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 7: [2022-11-27 00:16:26,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:16:26,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 00:16:26,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 28: [2022-11-27 00:16:26,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:16:26,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:16:26,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:16:26,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:16:26,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 00:16:26,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 00:16:26,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 00:16:26,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 29: [2022-11-27 00:16:26,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 29: [2022-11-27 00:16:26,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 8: [2022-11-27 00:16:26,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:16:26,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:16:26,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 00:16:26,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 17: [2022-11-27 00:16:26,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 00:16:26,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 28: [2022-11-27 00:16:26,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 00:16:26,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 25: [2022-11-27 00:16:26,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:16:26,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 00:16:26,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 23: [2022-11-27 00:16:26,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:16:26,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:16:26,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:16:26,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:16:26,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 23: [2022-11-27 00:16:26,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 00:16:26,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 00:16:26,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 16: [2022-11-27 00:16:26,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 23: [2022-11-27 00:16:26,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 23: [2022-11-27 00:16:26,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 23: [2022-11-27 00:16:26,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 12: [2022-11-27 00:16:26,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:16:26,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 00:16:26,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 18: [2022-11-27 00:16:26,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:16:26,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 00:16:26,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 2: [2022-11-27 00:16:26,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:16:26,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 00:16:26,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 20: [2022-11-27 00:16:26,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:16:26,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 00:16:26,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 27: [2022-11-27 00:16:26,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:16:26,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 00:16:26,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 23: [2022-11-27 00:16:26,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:16:26,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 00:16:26,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 31: [2022-11-27 00:16:26,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:16:26,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 00:16:26,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 29: [2022-11-27 00:16:26,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:16:26,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 00:16:26,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 3: [2022-11-27 00:16:26,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:16:26,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 00:16:26,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 13: [2022-11-27 00:16:26,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:16:26,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 00:16:26,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 21: [2022-11-27 00:16:26,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:16:26,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 00:16:26,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 30: [2022-11-27 00:16:26,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:16:26,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 00:16:26,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 5: [2022-11-27 00:16:26,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:16:26,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 00:16:26,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 4: [2022-11-27 00:16:26,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:16:26,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 00:16:26,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 15: [2022-11-27 00:16:26,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:16:26,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 00:16:26,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 0: [2022-11-27 00:16:26,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:16:26,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:16:26,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 00:16:26,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 0: [2022-11-27 00:16:26,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 00:16:26,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 9: [2022-11-27 00:16:26,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:16:26,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 00:16:26,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 10: [2022-11-27 00:16:26,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:16:26,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 00:16:26,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 1: [2022-11-27 00:16:26,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:16:26,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 00:16:26,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 22: [2022-11-27 00:16:26,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:16:26,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 00:16:26,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 24: [2022-11-27 00:16:26,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:16:26,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 00:16:26,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 25: [2022-11-27 00:16:26,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:16:26,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 00:16:26,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 6: [2022-11-27 00:16:26,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:16:26,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 00:16:26,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 14: [2022-11-27 00:16:26,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:16:26,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 16: [2022-11-27 00:16:26,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:16:26,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 16: [2022-11-27 00:16:26,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 11: [2022-11-27 00:16:26,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:16:26,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 11: [2022-11-27 00:16:26,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 00:16:26,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 17: [2022-11-27 00:16:26,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:16:26,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 00:16:26,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 20: [2022-11-27 00:16:26,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:16:26,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 00:16:26,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 31: [2022-11-27 00:16:26,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:16:26,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 00:16:26,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 12: [2022-11-27 00:16:26,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:16:26,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 00:16:26,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 8: [2022-11-27 00:16:26,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:16:26,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 00:16:26,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 18: [2022-11-27 00:16:26,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:16:26,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 00:16:26,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 23: [2022-11-27 00:16:26,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:16:26,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 00:16:26,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 28: [2022-11-27 00:16:26,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:16:26,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:16:26,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 00:16:26,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 2: [2022-11-27 00:16:26,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:16:26,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 00:16:26,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 7: [2022-11-27 00:16:26,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:16:26,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 00:16:26,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 29: [2022-11-27 00:16:26,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:16:26,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 00:16:26,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 30: [2022-11-27 00:16:26,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:16:26,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 00:16:26,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 21: [2022-11-27 00:16:26,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:16:26,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:16:26,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 5: [2022-11-27 00:16:26,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 21: [2022-11-27 00:16:26,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 5: [2022-11-27 00:16:26,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 26: [2022-11-27 00:16:26,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:16:26,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 00:16:26,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 4: [2022-11-27 00:16:26,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:16:26,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 00:16:26,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 9: [2022-11-27 00:16:26,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:16:26,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 00:16:26,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 10: [2022-11-27 00:16:26,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:16:26,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 19: [2022-11-27 00:16:26,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:16:26,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 19: [2022-11-27 00:16:26,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 00:16:26,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 13: [2022-11-27 00:16:26,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:16:26,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 00:16:26,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 27: [2022-11-27 00:16:26,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:16:26,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 00:16:26,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 22: [2022-11-27 00:16:26,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:16:26,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 12: [2022-11-27 00:16:26,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:16:26,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 12: [2022-11-27 00:16:26,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 00:16:26,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 7: [2022-11-27 00:16:26,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:16:26,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 00:16:26,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 16: [2022-11-27 00:16:26,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:16:26,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 00:16:26,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 15: [2022-11-27 00:16:26,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:16:26,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 00:16:26,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 28: [2022-11-27 00:16:26,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 00:16:26,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 28: [2022-11-27 00:16:26,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:16:26,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 00:16:26,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 25: [2022-11-27 00:16:26,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:16:26,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 00:16:26,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 31: [2022-11-27 00:16:26,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:16:26,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 00:16:26,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 18: [2022-11-27 00:16:26,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:16:26,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 00:16:26,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 3: [2022-11-27 00:16:26,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:16:26,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 00:16:26,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 14: [2022-11-27 00:16:26,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:16:26,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 00:16:26,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 21: [2022-11-27 00:16:26,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:16:26,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:16:26,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 00:16:26,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 20: [2022-11-27 00:16:26,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 00:16:26,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 11: [2022-11-27 00:16:26,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:16:26,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 00:16:26,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 1: [2022-11-27 00:16:26,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:16:26,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:16:26,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 13: [2022-11-27 00:16:26,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 1: [2022-11-27 00:16:26,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 13: [2022-11-27 00:16:26,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 30: [2022-11-27 00:16:26,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:16:26,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 00:16:26,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 5: [2022-11-27 00:16:26,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:16:26,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:16:26,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 00:16:26,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 19: [2022-11-27 00:16:26,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 8: [2022-11-27 00:16:26,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:16:26,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 19: [2022-11-27 00:16:26,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 8: [2022-11-27 00:16:26,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 24: [2022-11-27 00:16:26,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:16:26,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:16:26,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 24: [2022-11-27 00:16:26,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 00:16:26,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 6: [2022-11-27 00:16:26,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 17: [2022-11-27 00:16:26,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:16:26,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 00:16:26,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 2: [2022-11-27 00:16:26,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:16:26,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 00:16:26,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 23: [2022-11-27 00:16:26,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:16:26,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 00:16:26,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 4: [2022-11-27 00:16:26,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:16:26,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 00:16:26,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 10: [2022-11-27 00:16:26,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:16:26,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 00:16:26,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 15: [2022-11-27 00:16:26,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:16:26,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 00:16:26,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 29: [2022-11-27 00:16:26,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:16:26,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 00:16:26,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 28: [2022-11-27 00:16:26,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:16:26,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:16:26,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:16:26,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 00:16:26,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 12: [2022-11-27 00:16:26,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 00:16:26,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 26: [2022-11-27 00:16:26,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:16:26,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 00:16:26,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 9: [2022-11-27 00:16:26,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:16:26,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 00:16:26,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 14: [2022-11-27 00:16:26,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:16:26,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 00:16:26,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 28: [2022-11-27 00:16:26,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 00:16:26,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 26: [2022-11-27 00:16:26,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:16:26,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 00:16:26,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 27: [2022-11-27 00:16:26,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:16:26,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 00:16:26,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 24: [2022-11-27 00:16:26,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:16:26,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:16:26,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 24: [2022-11-27 00:16:26,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 7: [2022-11-27 00:16:26,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 24: [2022-11-27 00:16:26,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 0: [2022-11-27 00:16:26,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:16:26,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 00:16:26,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 23: [2022-11-27 00:16:26,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:16:26,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 00:16:26,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 11: [2022-11-27 00:16:26,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:16:26,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 00:16:26,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 6: [2022-11-27 00:16:26,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:16:26,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 00:16:26,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 0: [2022-11-27 00:16:26,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:16:26,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 00:16:26,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 17: [2022-11-27 00:16:26,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:16:26,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 00:16:26,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 22: [2022-11-27 00:16:26,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:16:26,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 00:16:26,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 1: [2022-11-27 00:16:26,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:16:26,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 8: [2022-11-27 00:16:26,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:16:26,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 8: [2022-11-27 00:16:26,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 00:16:26,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 2: [2022-11-27 00:16:26,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:16:26,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 00:16:26,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 20: [2022-11-27 00:16:26,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:16:26,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 00:16:26,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 27: [2022-11-27 00:16:26,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:16:26,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:16:26,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 00:16:26,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 18: [2022-11-27 00:16:26,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 00:16:26,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 31: [2022-11-27 00:16:26,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:16:26,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 00:16:26,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 29: [2022-11-27 00:16:26,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:16:26,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 00:16:26,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 25: [2022-11-27 00:16:26,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:16:26,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:16:26,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 25: [2022-11-27 00:16:26,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 4: [2022-11-27 00:16:26,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 25: [2022-11-27 00:16:26,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 3: [2022-11-27 00:16:26,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:16:26,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 00:16:26,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 21: [2022-11-27 00:16:26,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:16:26,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 00:16:26,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 30: [2022-11-27 00:16:26,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:16:26,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 00:16:26,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 13: [2022-11-27 00:16:26,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:16:26,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 00:16:26,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 19: [2022-11-27 00:16:26,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:16:26,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 00:16:26,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 5: [2022-11-27 00:16:26,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:16:26,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 00:16:26,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 15: [2022-11-27 00:16:26,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:16:26,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 00:16:26,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 24: [2022-11-27 00:16:26,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:16:26,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 00:16:26,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 22: [2022-11-27 00:16:26,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:16:26,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 00:16:26,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 11: [2022-11-27 00:16:26,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:16:26,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 00:16:26,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 8: [2022-11-27 00:16:26,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:16:26,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 00:16:26,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 18: [2022-11-27 00:16:26,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:16:26,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:16:26,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-27 00:16:26,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 14: [2022-11-27 00:16:26,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 00:16:26,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 16: [2022-11-27 00:16:26,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:16:26,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 00:16:26,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 31: [2022-11-27 00:16:26,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:16:26,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 00:16:26,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 1: [2022-11-27 00:16:26,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:16:26,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 00:16:26,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 20: [2022-11-27 00:16:26,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:16:26,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 00:16:26,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 6: [2022-11-27 00:16:26,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:16:26,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:16:26,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 00:16:26,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 28: [2022-11-27 00:16:26,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 00:16:26,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 9: [2022-11-27 00:16:26,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:16:26,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 12: [2022-11-27 00:16:26,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:16:26,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 12: [2022-11-27 00:16:26,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 00:16:26,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 23: [2022-11-27 00:16:26,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:16:26,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 00:16:26,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 10: [2022-11-27 00:16:26,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:16:26,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 00:16:26,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 27: [2022-11-27 00:16:26,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:16:26,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 00:16:26,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 7: [2022-11-27 00:16:26,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:16:26,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 00:16:26,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 0: [2022-11-27 00:16:26,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:16:26,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 00:16:26,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 17: [2022-11-27 00:16:26,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:16:26,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 00:16:26,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 29: [2022-11-27 00:16:26,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:16:26,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:16:26,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 00:16:26,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 25: [2022-11-27 00:16:26,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 00:16:26,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 3: [2022-11-27 00:16:26,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:16:26,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 00:16:26,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 2: [2022-11-27 00:16:26,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:16:26,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 00:16:26,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 4: [2022-11-27 00:16:26,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:16:26,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:16:26,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 00:16:26,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 21: [2022-11-27 00:16:26,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 00:16:26,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 13: [2022-11-27 00:16:26,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:16:26,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:16:26,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 26: [2022-11-27 00:16:26,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 00:16:26,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 13: [2022-11-27 00:16:26,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 19: [2022-11-27 00:16:26,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:16:26,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 00:16:26,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 31: [2022-11-27 00:16:26,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:16:26,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 00:16:26,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 5: [2022-11-27 00:16:26,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:16:26,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 00:16:26,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 0: [2022-11-27 00:16:26,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:16:26,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 00:16:26,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 4: [2022-11-27 00:16:26,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:16:26,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 00:16:26,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 6: [2022-11-27 00:16:26,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:16:26,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 00:16:26,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 15: [2022-11-27 00:16:26,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:16:26,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 00:16:26,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 30: [2022-11-27 00:16:26,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:16:26,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 00:16:26,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 30: [2022-11-27 00:16:26,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:16:26,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step134000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 00:16:26,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step134000 is ready now! 0: successfully saved checkpoint at iteration 134000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2550.13 31: iteration 134010/ 173500 | consumed samples: 34306560 | consumed tokens: 70259834880 | elapsed time per iteration (s): 1.06 | learning rate: 4.247E-05 | global batch size: 256 | lm loss: 1.920226E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.406 | TFLOPs: 14.66 | 31: iteration 134020/ 173500 | consumed samples: 34309120 | consumed tokens: 70265077760 | elapsed time per iteration (s): 0.75 | learning rate: 4.246E-05 | global batch size: 256 | lm loss: 1.924883E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.368 | TFLOPs: 20.77 | 31: iteration 134030/ 173500 | consumed samples: 34311680 | consumed tokens: 70270320640 | elapsed time per iteration (s): 0.81 | learning rate: 4.245E-05 | global batch size: 256 | lm loss: 1.911953E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.356 | TFLOPs: 19.08 | 31: iteration 134040/ 173500 | consumed samples: 34314240 | consumed tokens: 70275563520 | elapsed time per iteration (s): 0.88 | learning rate: 4.244E-05 | global batch size: 256 | lm loss: 1.932428E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.232 | TFLOPs: 17.56 | 31: iteration 134050/ 173500 | consumed samples: 34316800 | consumed tokens: 70280806400 | elapsed time per iteration (s): 0.87 | learning rate: 4.243E-05 | global batch size: 256 | lm loss: 1.912786E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.417 | TFLOPs: 17.81 | 31: iteration 134060/ 173500 | consumed samples: 34319360 | consumed tokens: 70286049280 | elapsed time per iteration (s): 0.91 | learning rate: 4.242E-05 | global batch size: 256 | lm loss: 1.895316E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.843 | TFLOPs: 17.11 | 31: iteration 134070/ 173500 | consumed samples: 34321920 | consumed tokens: 70291292160 | elapsed time per iteration (s): 0.87 | learning rate: 4.241E-05 | global batch size: 256 | lm loss: 1.933689E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.990 | TFLOPs: 17.79 | 31: iteration 134080/ 173500 | consumed samples: 34324480 | consumed tokens: 70296535040 | elapsed time per iteration (s): 0.95 | learning rate: 4.240E-05 | global batch size: 256 | lm loss: 1.929665E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 269.916 | TFLOPs: 16.33 | 31: iteration 134090/ 173500 | consumed samples: 34327040 | consumed tokens: 70301777920 | elapsed time per iteration (s): 0.95 | learning rate: 4.239E-05 | global batch size: 256 | lm loss: 1.933530E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 268.873 | TFLOPs: 16.27 | 31: iteration 134100/ 173500 | consumed samples: 34329600 | consumed tokens: 70307020800 | elapsed time per iteration (s): 0.78 | learning rate: 4.238E-05 | global batch size: 256 | lm loss: 1.945503E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.682 | TFLOPs: 19.76 | 31: iteration 134110/ 173500 | consumed samples: 34332160 | consumed tokens: 70312263680 | elapsed time per iteration (s): 0.80 | learning rate: 4.236E-05 | global batch size: 256 | lm loss: 1.942353E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.607 | TFLOPs: 19.46 | 31: iteration 134120/ 173500 | consumed samples: 34334720 | consumed tokens: 70317506560 | elapsed time per iteration (s): 0.84 | learning rate: 4.235E-05 | global batch size: 256 | lm loss: 1.945147E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.172 | TFLOPs: 18.40 | 31: iteration 134130/ 173500 | consumed samples: 34337280 | consumed tokens: 70322749440 | elapsed time per iteration (s): 0.80 | learning rate: 4.234E-05 | global batch size: 256 | lm loss: 1.940304E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.785 | TFLOPs: 19.41 | 31: iteration 134140/ 173500 | consumed samples: 34339840 | consumed tokens: 70327992320 | elapsed time per iteration (s): 0.82 | learning rate: 4.233E-05 | global batch size: 256 | lm loss: 1.945700E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.009 | TFLOPs: 18.94 | 31: iteration 134150/ 173500 | consumed samples: 34342400 | consumed tokens: 70333235200 | elapsed time per iteration (s): 0.84 | learning rate: 4.232E-05 | global batch size: 256 | lm loss: 1.907722E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.696 | TFLOPs: 18.37 | 31: iteration 134160/ 173500 | consumed samples: 34344960 | consumed tokens: 70338478080 | elapsed time per iteration (s): 0.77 | learning rate: 4.231E-05 | global batch size: 256 | lm loss: 1.930841E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.876 | TFLOPs: 20.14 | 31: iteration 134170/ 173500 | consumed samples: 34347520 | consumed tokens: 70343720960 | elapsed time per iteration (s): 0.81 | learning rate: 4.230E-05 | global batch size: 256 | lm loss: 1.922220E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.041 | TFLOPs: 19.06 | 31: iteration 134180/ 173500 | consumed samples: 34350080 | consumed tokens: 70348963840 | elapsed time per iteration (s): 0.78 | learning rate: 4.229E-05 | global batch size: 256 | lm loss: 1.939720E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.671 | TFLOPs: 19.82 | 31: iteration 134190/ 173500 | consumed samples: 34352640 | consumed tokens: 70354206720 | elapsed time per iteration (s): 0.91 | learning rate: 4.228E-05 | global batch size: 256 | lm loss: 1.928126E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 281.713 | TFLOPs: 17.04 | 31: iteration 134200/ 173500 | consumed samples: 34355200 | consumed tokens: 70359449600 | elapsed time per iteration (s): 0.76 | learning rate: 4.227E-05 | global batch size: 256 | lm loss: 1.956024E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.994 | TFLOPs: 20.27 | 31: iteration 134210/ 173500 | consumed samples: 34357760 | consumed tokens: 70364692480 | elapsed time per iteration (s): 0.82 | learning rate: 4.226E-05 | global batch size: 256 | lm loss: 1.911158E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.343 | TFLOPs: 18.84 | 31: iteration 134220/ 173500 | consumed samples: 34360320 | consumed tokens: 70369935360 | elapsed time per iteration (s): 0.74 | learning rate: 4.225E-05 | global batch size: 256 | lm loss: 1.928807E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.200 | TFLOPs: 20.88 | 31: iteration 134230/ 173500 | consumed samples: 34362880 | consumed tokens: 70375178240 | elapsed time per iteration (s): 0.78 | learning rate: 4.223E-05 | global batch size: 256 | lm loss: 1.964935E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.398 | TFLOPs: 19.75 | 31: iteration 134240/ 173500 | consumed samples: 34365440 | consumed tokens: 70380421120 | elapsed time per iteration (s): 0.76 | learning rate: 4.222E-05 | global batch size: 256 | lm loss: 1.931533E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.389 | TFLOPs: 20.41 | 31: iteration 134250/ 173500 | consumed samples: 34368000 | consumed tokens: 70385664000 | elapsed time per iteration (s): 0.81 | learning rate: 4.221E-05 | global batch size: 256 | lm loss: 1.944962E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.579 | TFLOPs: 19.21 | 31: iteration 134260/ 173500 | consumed samples: 34370560 | consumed tokens: 70390906880 | elapsed time per iteration (s): 0.77 | learning rate: 4.220E-05 | global batch size: 256 | lm loss: 1.913355E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.470 | TFLOPs: 20.11 | 31: iteration 134270/ 173500 | consumed samples: 34373120 | consumed tokens: 70396149760 | elapsed time per iteration (s): 0.79 | learning rate: 4.219E-05 | global batch size: 256 | lm loss: 1.927242E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.960 | TFLOPs: 19.60 | 31: iteration 134280/ 173500 | consumed samples: 34375680 | consumed tokens: 70401392640 | elapsed time per iteration (s): 0.80 | learning rate: 4.218E-05 | global batch size: 256 | lm loss: 1.940860E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.716 | TFLOPs: 19.40 | 31: iteration 134290/ 173500 | consumed samples: 34378240 | consumed tokens: 70406635520 | elapsed time per iteration (s): 0.81 | learning rate: 4.217E-05 | global batch size: 256 | lm loss: 1.933189E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.963 | TFLOPs: 19.05 | 31: iteration 134300/ 173500 | consumed samples: 34380800 | consumed tokens: 70411878400 | elapsed time per iteration (s): 0.80 | learning rate: 4.216E-05 | global batch size: 256 | lm loss: 1.935869E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.880 | TFLOPs: 19.35 | 31: iteration 134310/ 173500 | consumed samples: 34383360 | consumed tokens: 70417121280 | elapsed time per iteration (s): 0.81 | learning rate: 4.215E-05 | global batch size: 256 | lm loss: 1.924117E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.979 | TFLOPs: 19.06 | 31: iteration 134320/ 173500 | consumed samples: 34385920 | consumed tokens: 70422364160 | elapsed time per iteration (s): 0.78 | learning rate: 4.214E-05 | global batch size: 256 | lm loss: 1.904236E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.509 | TFLOPs: 19.75 | 31: iteration 134330/ 173500 | consumed samples: 34388480 | consumed tokens: 70427607040 | elapsed time per iteration (s): 0.80 | learning rate: 4.213E-05 | global batch size: 256 | lm loss: 1.937407E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.388 | TFLOPs: 19.26 | 31: iteration 134340/ 173500 | consumed samples: 34391040 | consumed tokens: 70432849920 | elapsed time per iteration (s): 0.89 | learning rate: 4.212E-05 | global batch size: 256 | lm loss: 1.923220E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.427 | TFLOPs: 17.33 | 31: iteration 134350/ 173500 | consumed samples: 34393600 | consumed tokens: 70438092800 | elapsed time per iteration (s): 0.80 | learning rate: 4.210E-05 | global batch size: 256 | lm loss: 1.968117E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.668 | TFLOPs: 19.46 | 31: iteration 134360/ 173500 | consumed samples: 34396160 | consumed tokens: 70443335680 | elapsed time per iteration (s): 0.90 | learning rate: 4.209E-05 | global batch size: 256 | lm loss: 1.940817E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.698 | TFLOPs: 17.22 | 31: iteration 134370/ 173500 | consumed samples: 34398720 | consumed tokens: 70448578560 | elapsed time per iteration (s): 0.83 | learning rate: 4.208E-05 | global batch size: 256 | lm loss: 1.933410E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.715 | TFLOPs: 18.56 | 31: iteration 134380/ 173500 | consumed samples: 34401280 | consumed tokens: 70453821440 | elapsed time per iteration (s): 0.81 | learning rate: 4.207E-05 | global batch size: 256 | lm loss: 1.916332E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.577 | TFLOPs: 19.21 | 31: iteration 134390/ 173500 | consumed samples: 34403840 | consumed tokens: 70459064320 | elapsed time per iteration (s): 0.81 | learning rate: 4.206E-05 | global batch size: 256 | lm loss: 1.960054E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.241 | TFLOPs: 19.07 | 31: iteration 134400/ 173500 | consumed samples: 34406400 | consumed tokens: 70464307200 | elapsed time per iteration (s): 0.85 | learning rate: 4.205E-05 | global batch size: 256 | lm loss: 1.911329E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.801 | TFLOPs: 18.26 | 31: iteration 134410/ 173500 | consumed samples: 34408960 | consumed tokens: 70469550080 | elapsed time per iteration (s): 0.88 | learning rate: 4.204E-05 | global batch size: 256 | lm loss: 1.940841E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.600 | TFLOPs: 17.64 | 31: iteration 134420/ 173500 | consumed samples: 34411520 | consumed tokens: 70474792960 | elapsed time per iteration (s): 0.80 | learning rate: 4.203E-05 | global batch size: 256 | lm loss: 1.946625E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.025 | TFLOPs: 19.36 | 31: iteration 134430/ 173500 | consumed samples: 34414080 | consumed tokens: 70480035840 | elapsed time per iteration (s): 0.80 | learning rate: 4.202E-05 | global batch size: 256 | lm loss: 1.926551E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.146 | TFLOPs: 19.43 | 31: iteration 134440/ 173500 | consumed samples: 34416640 | consumed tokens: 70485278720 | elapsed time per iteration (s): 0.79 | learning rate: 4.201E-05 | global batch size: 256 | lm loss: 1.926992E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.557 | TFLOPs: 19.70 | 31: iteration 134450/ 173500 | consumed samples: 34419200 | consumed tokens: 70490521600 | elapsed time per iteration (s): 0.90 | learning rate: 4.200E-05 | global batch size: 256 | lm loss: 1.956714E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 283.357 | TFLOPs: 17.14 | 31: iteration 134460/ 173500 | consumed samples: 34421760 | consumed tokens: 70495764480 | elapsed time per iteration (s): 0.91 | learning rate: 4.199E-05 | global batch size: 256 | lm loss: 1.945075E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.150 | TFLOPs: 17.07 | 31: iteration 134470/ 173500 | consumed samples: 34424320 | consumed tokens: 70501007360 | elapsed time per iteration (s): 0.81 | learning rate: 4.197E-05 | global batch size: 256 | lm loss: 1.922733E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.564 | TFLOPs: 19.21 | 31: iteration 134480/ 173500 | consumed samples: 34426880 | consumed tokens: 70506250240 | elapsed time per iteration (s): 0.90 | learning rate: 4.196E-05 | global batch size: 256 | lm loss: 1.922108E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 283.379 | TFLOPs: 17.14 | 31: iteration 134490/ 173500 | consumed samples: 34429440 | consumed tokens: 70511493120 | elapsed time per iteration (s): 0.80 | learning rate: 4.195E-05 | global batch size: 256 | lm loss: 1.931508E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.033 | TFLOPs: 19.36 | 31: iteration 134500/ 173500 | consumed samples: 34432000 | consumed tokens: 70516736000 | elapsed time per iteration (s): 0.80 | learning rate: 4.194E-05 | global batch size: 256 | lm loss: 1.948060E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.920 | TFLOPs: 19.41 | 31: iteration 134510/ 173500 | consumed samples: 34434560 | consumed tokens: 70521978880 | elapsed time per iteration (s): 0.83 | learning rate: 4.193E-05 | global batch size: 256 | lm loss: 1.939107E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.790 | TFLOPs: 18.56 | 31: iteration 134520/ 173500 | consumed samples: 34437120 | consumed tokens: 70527221760 | elapsed time per iteration (s): 0.84 | learning rate: 4.192E-05 | global batch size: 256 | lm loss: 1.944549E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.048 | TFLOPs: 18.52 | 31: iteration 134530/ 173500 | consumed samples: 34439680 | consumed tokens: 70532464640 | elapsed time per iteration (s): 0.82 | learning rate: 4.191E-05 | global batch size: 256 | lm loss: 1.921068E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.441 | TFLOPs: 18.84 | 31: iteration 134540/ 173500 | consumed samples: 34442240 | consumed tokens: 70537707520 | elapsed time per iteration (s): 0.82 | learning rate: 4.190E-05 | global batch size: 256 | lm loss: 1.941338E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.644 | TFLOPs: 18.85 | 31: iteration 134550/ 173500 | consumed samples: 34444800 | consumed tokens: 70542950400 | elapsed time per iteration (s): 0.91 | learning rate: 4.189E-05 | global batch size: 256 | lm loss: 1.930704E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 280.449 | TFLOPs: 16.97 | 31: iteration 134560/ 173500 | consumed samples: 34447360 | consumed tokens: 70548193280 | elapsed time per iteration (s): 0.82 | learning rate: 4.188E-05 | global batch size: 256 | lm loss: 1.910201E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.522 | TFLOPs: 18.97 | 31: iteration 134570/ 173500 | consumed samples: 34449920 | consumed tokens: 70553436160 | elapsed time per iteration (s): 0.79 | learning rate: 4.187E-05 | global batch size: 256 | lm loss: 1.952325E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.045 | TFLOPs: 19.60 | 31: iteration 134580/ 173500 | consumed samples: 34452480 | consumed tokens: 70558679040 | elapsed time per iteration (s): 0.83 | learning rate: 4.186E-05 | global batch size: 256 | lm loss: 1.944403E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.256 | TFLOPs: 18.71 | 31: iteration 134590/ 173500 | consumed samples: 34455040 | consumed tokens: 70563921920 | elapsed time per iteration (s): 0.84 | learning rate: 4.185E-05 | global batch size: 256 | lm loss: 1.906078E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.640 | TFLOPs: 18.49 | 31: iteration 134600/ 173500 | consumed samples: 34457600 | consumed tokens: 70569164800 | elapsed time per iteration (s): 0.86 | learning rate: 4.183E-05 | global batch size: 256 | lm loss: 1.937232E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.045 | TFLOPs: 18.09 | 31: iteration 134610/ 173500 | consumed samples: 34460160 | consumed tokens: 70574407680 | elapsed time per iteration (s): 0.81 | learning rate: 4.182E-05 | global batch size: 256 | lm loss: 1.950916E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.036 | TFLOPs: 19.18 | 31: iteration 134620/ 173500 | consumed samples: 34462720 | consumed tokens: 70579650560 | elapsed time per iteration (s): 0.75 | learning rate: 4.181E-05 | global batch size: 256 | lm loss: 1.941533E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.162 | TFLOPs: 20.52 | 31: iteration 134630/ 173500 | consumed samples: 34465280 | consumed tokens: 70584893440 | elapsed time per iteration (s): 0.79 | learning rate: 4.180E-05 | global batch size: 256 | lm loss: 1.914921E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.836 | TFLOPs: 19.59 | 31: iteration 134640/ 173500 | consumed samples: 34467840 | consumed tokens: 70590136320 | elapsed time per iteration (s): 0.74 | learning rate: 4.179E-05 | global batch size: 256 | lm loss: 1.959717E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.146 | TFLOPs: 21.06 | 31: iteration 134650/ 173500 | consumed samples: 34470400 | consumed tokens: 70595379200 | elapsed time per iteration (s): 0.78 | learning rate: 4.178E-05 | global batch size: 256 | lm loss: 1.925521E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.735 | TFLOPs: 19.77 | 31: iteration 134660/ 173500 | consumed samples: 34472960 | consumed tokens: 70600622080 | elapsed time per iteration (s): 0.76 | learning rate: 4.177E-05 | global batch size: 256 | lm loss: 1.922691E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.840 | TFLOPs: 20.38 | 31: iteration 134670/ 173500 | consumed samples: 34475520 | consumed tokens: 70605864960 | elapsed time per iteration (s): 0.77 | learning rate: 4.176E-05 | global batch size: 256 | lm loss: 1.912652E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.282 | TFLOPs: 20.04 | 31: iteration 134680/ 173500 | consumed samples: 34478080 | consumed tokens: 70611107840 | elapsed time per iteration (s): 0.77 | learning rate: 4.175E-05 | global batch size: 256 | lm loss: 1.912718E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.408 | TFLOPs: 20.05 | 31: iteration 134690/ 173500 | consumed samples: 34480640 | consumed tokens: 70616350720 | elapsed time per iteration (s): 0.82 | learning rate: 4.174E-05 | global batch size: 256 | lm loss: 1.946803E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.072 | TFLOPs: 18.94 | 31: iteration 134700/ 173500 | consumed samples: 34483200 | consumed tokens: 70621593600 | elapsed time per iteration (s): 0.77 | learning rate: 4.173E-05 | global batch size: 256 | lm loss: 1.915659E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.824 | TFLOPs: 20.07 | 31: iteration 134710/ 173500 | consumed samples: 34485760 | consumed tokens: 70626836480 | elapsed time per iteration (s): 0.84 | learning rate: 4.172E-05 | global batch size: 256 | lm loss: 1.949903E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.407 | TFLOPs: 18.42 | 31: iteration 134720/ 173500 | consumed samples: 34488320 | consumed tokens: 70632079360 | elapsed time per iteration (s): 0.86 | learning rate: 4.171E-05 | global batch size: 256 | lm loss: 1.942532E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.204 | TFLOPs: 17.98 | 31: iteration 134730/ 173500 | consumed samples: 34490880 | consumed tokens: 70637322240 | elapsed time per iteration (s): 0.80 | learning rate: 4.170E-05 | global batch size: 256 | lm loss: 1.927544E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.375 | TFLOPs: 19.44 | 31: iteration 134740/ 173500 | consumed samples: 34493440 | consumed tokens: 70642565120 | elapsed time per iteration (s): 0.83 | learning rate: 4.168E-05 | global batch size: 256 | lm loss: 1.945907E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.692 | TFLOPs: 18.74 | 31: iteration 134750/ 173500 | consumed samples: 34496000 | consumed tokens: 70647808000 | elapsed time per iteration (s): 0.75 | learning rate: 4.167E-05 | global batch size: 256 | lm loss: 1.897707E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.361 | TFLOPs: 20.59 | 31: iteration 134760/ 173500 | consumed samples: 34498560 | consumed tokens: 70653050880 | elapsed time per iteration (s): 0.79 | learning rate: 4.166E-05 | global batch size: 256 | lm loss: 1.944129E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.595 | TFLOPs: 19.58 | 31: iteration 134770/ 173500 | consumed samples: 34501120 | consumed tokens: 70658293760 | elapsed time per iteration (s): 0.82 | learning rate: 4.165E-05 | global batch size: 256 | lm loss: 1.928557E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.684 | TFLOPs: 18.80 | 31: iteration 134780/ 173500 | consumed samples: 34503680 | consumed tokens: 70663536640 | elapsed time per iteration (s): 0.80 | learning rate: 4.164E-05 | global batch size: 256 | lm loss: 1.952168E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.664 | TFLOPs: 19.28 | 31: iteration 134790/ 173500 | consumed samples: 34506240 | consumed tokens: 70668779520 | elapsed time per iteration (s): 0.79 | learning rate: 4.163E-05 | global batch size: 256 | lm loss: 1.917107E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.881 | TFLOPs: 19.71 | 31: iteration 134800/ 173500 | consumed samples: 34508800 | consumed tokens: 70674022400 | elapsed time per iteration (s): 0.87 | learning rate: 4.162E-05 | global batch size: 256 | lm loss: 1.916603E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.338 | TFLOPs: 17.87 | 31: iteration 134810/ 173500 | consumed samples: 34511360 | consumed tokens: 70679265280 | elapsed time per iteration (s): 0.77 | learning rate: 4.161E-05 | global batch size: 256 | lm loss: 1.938294E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.835 | TFLOPs: 20.14 | 31: iteration 134820/ 173500 | consumed samples: 34513920 | consumed tokens: 70684508160 | elapsed time per iteration (s): 0.78 | learning rate: 4.160E-05 | global batch size: 256 | lm loss: 1.935040E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.423 | TFLOPs: 19.75 | 31: iteration 134830/ 173500 | consumed samples: 34516480 | consumed tokens: 70689751040 | elapsed time per iteration (s): 0.74 | learning rate: 4.159E-05 | global batch size: 256 | lm loss: 1.915174E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.063 | TFLOPs: 20.88 | 31: iteration 134840/ 173500 | consumed samples: 34519040 | consumed tokens: 70694993920 | elapsed time per iteration (s): 0.77 | learning rate: 4.158E-05 | global batch size: 256 | lm loss: 1.932806E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.078 | TFLOPs: 20.21 | 31: iteration 134850/ 173500 | consumed samples: 34521600 | consumed tokens: 70700236800 | elapsed time per iteration (s): 0.74 | learning rate: 4.157E-05 | global batch size: 256 | lm loss: 1.937056E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.646 | TFLOPs: 20.79 | 31: iteration 134860/ 173500 | consumed samples: 34524160 | consumed tokens: 70705479680 | elapsed time per iteration (s): 0.75 | learning rate: 4.156E-05 | global batch size: 256 | lm loss: 1.924966E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.603 | TFLOPs: 20.73 | 31: iteration 134870/ 173500 | consumed samples: 34526720 | consumed tokens: 70710722560 | elapsed time per iteration (s): 0.77 | learning rate: 4.155E-05 | global batch size: 256 | lm loss: 1.919029E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.100 | TFLOPs: 20.03 | 31: iteration 134880/ 173500 | consumed samples: 34529280 | consumed tokens: 70715965440 | elapsed time per iteration (s): 0.76 | learning rate: 4.153E-05 | global batch size: 256 | lm loss: 1.898610E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.134 | TFLOPs: 20.40 | 31: iteration 134890/ 173500 | consumed samples: 34531840 | consumed tokens: 70721208320 | elapsed time per iteration (s): 0.82 | learning rate: 4.152E-05 | global batch size: 256 | lm loss: 1.938093E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.464 | TFLOPs: 18.96 | 31: iteration 134900/ 173500 | consumed samples: 34534400 | consumed tokens: 70726451200 | elapsed time per iteration (s): 0.76 | learning rate: 4.151E-05 | global batch size: 256 | lm loss: 1.915199E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.362 | TFLOPs: 20.41 | 31: iteration 134910/ 173500 | consumed samples: 34536960 | consumed tokens: 70731694080 | elapsed time per iteration (s): 0.78 | learning rate: 4.150E-05 | global batch size: 256 | lm loss: 1.954022E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.944 | TFLOPs: 19.90 | 31: iteration 134920/ 173500 | consumed samples: 34539520 | consumed tokens: 70736936960 | elapsed time per iteration (s): 0.78 | learning rate: 4.149E-05 | global batch size: 256 | lm loss: 1.901018E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.928 | TFLOPs: 19.96 | 31: iteration 134930/ 173500 | consumed samples: 34542080 | consumed tokens: 70742179840 | elapsed time per iteration (s): 0.79 | learning rate: 4.148E-05 | global batch size: 256 | lm loss: 1.940559E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.294 | TFLOPs: 19.62 | 31: iteration 134940/ 173500 | consumed samples: 34544640 | consumed tokens: 70747422720 | elapsed time per iteration (s): 0.79 | learning rate: 4.147E-05 | global batch size: 256 | lm loss: 1.942160E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.046 | TFLOPs: 19.54 | 31: iteration 134950/ 173500 | consumed samples: 34547200 | consumed tokens: 70752665600 | elapsed time per iteration (s): 0.75 | learning rate: 4.146E-05 | global batch size: 256 | lm loss: 1.943558E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.707 | TFLOPs: 20.67 | 31: iteration 134960/ 173500 | consumed samples: 34549760 | consumed tokens: 70757908480 | elapsed time per iteration (s): 0.74 | learning rate: 4.145E-05 | global batch size: 256 | lm loss: 1.928564E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.224 | TFLOPs: 20.95 | 31: iteration 134970/ 173500 | consumed samples: 34552320 | consumed tokens: 70763151360 | elapsed time per iteration (s): 0.72 | learning rate: 4.144E-05 | global batch size: 256 | lm loss: 1.910190E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.516 | TFLOPs: 21.57 | 31: iteration 134980/ 173500 | consumed samples: 34554880 | consumed tokens: 70768394240 | elapsed time per iteration (s): 0.80 | learning rate: 4.143E-05 | global batch size: 256 | lm loss: 1.937199E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.386 | TFLOPs: 19.38 | 31: iteration 134990/ 173500 | consumed samples: 34557440 | consumed tokens: 70773637120 | elapsed time per iteration (s): 0.75 | learning rate: 4.142E-05 | global batch size: 256 | lm loss: 1.940005E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.644 | TFLOPs: 20.55 | 31: iteration 135000/ 173500 | consumed samples: 34560000 | consumed tokens: 70778880000 | elapsed time per iteration (s): 0.90 | learning rate: 4.141E-05 | global batch size: 256 | lm loss: 1.953914E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 283.854 | TFLOPs: 17.17 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 135000 | lm loss value: 1.965875E+00 | lm loss PPL: 7.141160E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 135000 to checkpoints_1b1long 0: [2022-11-27 00:29:56,047] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step135000 is begin to save! 0: [2022-11-27 00:29:56,059] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_01-model_00-model_states.pt... 0: [2022-11-27 00:29:56,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_01-model_00-model_states.pt. 0: [2022-11-27 00:29:56,280] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_03-model_00-model_states.pt... 0: [2022-11-27 00:29:56,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_03-model_00-model_states.pt. 0: [2022-11-27 00:29:56,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_04-model_00-model_states.pt... 0: [2022-11-27 00:29:56,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_04-model_00-model_states.pt. 0: [2022-11-27 00:29:56,442] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_05-model_00-model_states.pt... 0: [2022-11-27 00:29:56,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_05-model_00-model_states.pt. 0: [2022-11-27 00:29:56,524] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_06-model_00-model_states.pt... 0: [2022-11-27 00:29:56,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_06-model_00-model_states.pt. 0: [2022-11-27 00:29:56,598] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_07-model_00-model_states.pt... 0: [2022-11-27 00:29:56,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_07-model_00-model_states.pt. 0: [2022-11-27 00:29:56,675] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_08-model_00-model_states.pt... 0: [2022-11-27 00:29:56,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_08-model_00-model_states.pt. 0: [2022-11-27 00:29:56,750] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_09-model_00-model_states.pt... 0: [2022-11-27 00:29:56,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_09-model_00-model_states.pt. 0: [2022-11-27 00:29:56,824] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_10-model_00-model_states.pt... 0: [2022-11-27 00:29:56,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_10-model_00-model_states.pt. 0: [2022-11-27 00:29:56,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_11-model_00-model_states.pt... 0: [2022-11-27 00:29:56,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_11-model_00-model_states.pt. 0: [2022-11-27 00:29:56,976] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_12-model_00-model_states.pt... 0: [2022-11-27 00:29:57,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_12-model_00-model_states.pt. 0: [2022-11-27 00:29:57,054] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_13-model_00-model_states.pt... 0: [2022-11-27 00:29:57,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_13-model_00-model_states.pt. 0: [2022-11-27 00:29:57,127] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_14-model_00-model_states.pt... 0: [2022-11-27 00:29:57,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_14-model_00-model_states.pt. 0: [2022-11-27 00:29:57,205] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_15-model_00-model_states.pt... 0: [2022-11-27 00:29:57,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_15-model_00-model_states.pt. 0: [2022-11-27 00:29:57,281] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_16-model_00-model_states.pt... 0: [2022-11-27 00:29:57,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_16-model_00-model_states.pt. 0: [2022-11-27 00:29:57,356] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_17-model_00-model_states.pt... 0: [2022-11-27 00:29:57,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_17-model_00-model_states.pt. 0: [2022-11-27 00:29:57,432] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_18-model_00-model_states.pt... 0: [2022-11-27 00:29:57,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_18-model_00-model_states.pt. 0: [2022-11-27 00:29:57,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_19-model_00-model_states.pt... 0: [2022-11-27 00:29:57,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_19-model_00-model_states.pt. 0: [2022-11-27 00:29:57,585] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_20-model_00-model_states.pt... 0: [2022-11-27 00:29:57,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_20-model_00-model_states.pt. 0: [2022-11-27 00:29:57,659] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_21-model_00-model_states.pt... 0: [2022-11-27 00:29:57,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_21-model_00-model_states.pt. 0: [2022-11-27 00:29:57,736] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_22-model_00-model_states.pt... 0: [2022-11-27 00:29:57,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_22-model_00-model_states.pt. 0: [2022-11-27 00:29:57,820] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_23-model_00-model_states.pt... 0: [2022-11-27 00:29:57,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_23-model_00-model_states.pt. 0: [2022-11-27 00:29:57,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_24-model_00-model_states.pt... 0: [2022-11-27 00:29:57,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_24-model_00-model_states.pt. 0: [2022-11-27 00:29:57,971] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_25-model_00-model_states.pt... 0: [2022-11-27 00:29:58,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_25-model_00-model_states.pt. 0: [2022-11-27 00:29:58,045] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_26-model_00-model_states.pt... 0: [2022-11-27 00:29:58,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_26-model_00-model_states.pt. 0: [2022-11-27 00:29:58,121] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_27-model_00-model_states.pt... 0: [2022-11-27 00:29:58,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_27-model_00-model_states.pt. 0: [2022-11-27 00:29:58,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_28-model_00-model_states.pt... 0: [2022-11-27 00:29:58,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_28-model_00-model_states.pt. 0: [2022-11-27 00:29:58,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/layer_30-model_00-model_states.pt... 0: [2022-11-27 00:29:58,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/layer_30-model_00-model_states.pt. 0: [2022-11-27 00:29:58,274] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step135000/mp_rank_00_model_states.pt 0: [2022-11-27 00:29:58,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/mp_rank_00_model_states.pt... 0: [2022-11-27 00:29:58,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/mp_rank_00_model_states.pt. 0: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:29:58,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:29:58,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:29:58,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:29:58,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 00:29:58,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 00:29:58,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 30: [2022-11-27 00:29:58,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 28: [2022-11-27 00:29:58,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:29:58,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:29:58,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 00:29:58,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 22: [2022-11-27 00:29:58,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:29:58,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 00:29:58,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 25: [2022-11-27 00:29:58,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:29:58,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:29:58,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:29:58,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 00:29:58,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 10: [2022-11-27 00:29:58,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 25: [2022-11-27 00:29:58,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 10: [2022-11-27 00:29:58,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 25: [2022-11-27 00:29:58,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 11: [2022-11-27 00:29:58,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:29:58,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 00:29:58,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 19: [2022-11-27 00:29:58,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:29:58,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:29:58,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 21: [2022-11-27 00:29:58,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 19: [2022-11-27 00:29:58,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 21: [2022-11-27 00:29:58,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 29: [2022-11-27 00:29:58,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:29:58,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 00:29:58,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 5: [2022-11-27 00:29:58,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:29:58,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 00:29:58,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 18: [2022-11-27 00:29:58,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:29:58,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 00:29:58,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 24: [2022-11-27 00:29:58,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:29:58,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 19: [2022-11-27 00:29:58,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:29:58,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 24: [2022-11-27 00:29:58,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 19: [2022-11-27 00:29:58,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 28: [2022-11-27 00:29:58,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 00:29:58,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 28: [2022-11-27 00:29:58,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:29:58,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 00:29:58,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 10: [2022-11-27 00:29:58,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:29:58,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 00:29:58,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 0: [2022-11-27 00:29:58,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:29:58,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:29:58,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:29:58,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 00:29:58,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 25: [2022-11-27 00:29:58,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 00:29:58,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 24: [2022-11-27 00:29:58,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:29:58,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:29:58,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 00:29:58,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 24: [2022-11-27 00:29:58,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 00:29:58,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 30: [2022-11-27 00:29:58,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:29:58,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:29:58,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:29:58,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 30: [2022-11-27 00:29:58,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 4: [2022-11-27 00:29:58,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 31: [2022-11-27 00:29:58,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 30: [2022-11-27 00:29:58,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 31: [2022-11-27 00:29:58,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 17: [2022-11-27 00:29:58,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:29:58,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:29:58,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 00:29:58,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 00:29:58,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 17: [2022-11-27 00:29:58,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 3: [2022-11-27 00:29:58,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:29:58,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 00:29:58,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 4: [2022-11-27 00:29:58,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:29:58,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:29:58,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 00:29:58,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 23: [2022-11-27 00:29:58,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 00:29:58,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 23: [2022-11-27 00:29:58,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:29:58,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 18: [2022-11-27 00:29:58,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:29:58,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 23: [2022-11-27 00:29:58,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 18: [2022-11-27 00:29:58,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 31: [2022-11-27 00:29:58,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:29:58,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 00:29:58,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 9: [2022-11-27 00:29:58,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:29:58,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 00:29:58,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 18: [2022-11-27 00:29:58,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:29:58,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:29:58,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 12: [2022-11-27 00:29:58,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 18: [2022-11-27 00:29:58,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 12: [2022-11-27 00:29:58,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 7: [2022-11-27 00:29:58,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:29:58,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 00:29:58,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 9: [2022-11-27 00:29:58,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:29:58,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 25: [2022-11-27 00:29:58,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:29:58,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 25: [2022-11-27 00:29:58,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 00:29:58,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 19: [2022-11-27 00:29:58,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:29:58,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:29:58,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 3: [2022-11-27 00:29:58,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 19: [2022-11-27 00:29:58,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 3: [2022-11-27 00:29:58,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 7: [2022-11-27 00:29:58,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:29:58,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:29:58,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 3: [2022-11-27 00:29:58,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 7: [2022-11-27 00:29:58,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 3: [2022-11-27 00:29:58,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 4: [2022-11-27 00:29:58,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:29:58,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 00:29:58,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 22: [2022-11-27 00:29:58,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:29:58,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 00:29:58,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 21: [2022-11-27 00:29:58,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:29:58,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:29:58,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 00:29:58,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 21: [2022-11-27 00:29:58,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 00:29:58,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 10: [2022-11-27 00:29:58,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:29:58,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 00:29:58,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 29: [2022-11-27 00:29:58,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:29:58,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 00:29:58,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 28: [2022-11-27 00:29:58,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:29:58,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:29:58,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 00:29:58,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 28: [2022-11-27 00:29:58,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 00:29:58,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 11: [2022-11-27 00:29:58,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:29:58,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 00:29:58,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 23: [2022-11-27 00:29:58,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:29:58,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 9: [2022-11-27 00:29:58,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:29:58,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:29:58,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 9: [2022-11-27 00:29:58,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 11: [2022-11-27 00:29:58,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 9: [2022-11-27 00:29:58,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 12: [2022-11-27 00:29:58,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:29:58,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 12: [2022-11-27 00:29:58,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 23: [2022-11-27 00:29:58,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:29:58,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 23: [2022-11-27 00:29:58,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 00:29:58,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 30: [2022-11-27 00:29:58,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:29:58,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 6: [2022-11-27 00:29:58,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:29:58,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 6: [2022-11-27 00:29:58,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 00:29:58,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 17: [2022-11-27 00:29:58,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:29:58,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:29:58,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 6: [2022-11-27 00:29:58,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:29:58,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:29:58,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 3: [2022-11-27 00:29:58,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 00:29:58,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 12: [2022-11-27 00:29:58,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:29:58,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 6: [2022-11-27 00:29:58,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 24: [2022-11-27 00:29:58,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 6: [2022-11-27 00:29:58,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:29:58,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 24: [2022-11-27 00:29:58,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 4: [2022-11-27 00:29:58,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:29:58,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 6: [2022-11-27 00:29:58,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 4: [2022-11-27 00:29:58,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 6: [2022-11-27 00:29:58,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 4: [2022-11-27 00:29:58,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 0: [2022-11-27 00:29:58,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:29:58,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 00:29:58,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:29:58,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 0: [2022-11-27 00:29:58,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 00:29:58,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 29: [2022-11-27 00:29:58,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:29:58,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 00:29:58,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 24: [2022-11-27 00:29:58,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:29:58,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 27: [2022-11-27 00:29:58,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:29:58,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:29:58,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 24: [2022-11-27 00:29:58,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 27: [2022-11-27 00:29:58,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 16: [2022-11-27 00:29:58,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 13: [2022-11-27 00:29:58,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:29:58,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:29:58,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:29:58,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:29:58,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 13: [2022-11-27 00:29:58,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 00:29:58,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 00:29:58,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 12: [2022-11-27 00:29:58,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:29:58,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 13: [2022-11-27 00:29:58,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 00:29:58,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 13: [2022-11-27 00:29:58,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 12: [2022-11-27 00:29:58,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 13: [2022-11-27 00:29:58,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 12: [2022-11-27 00:29:58,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 18: [2022-11-27 00:29:58,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:29:58,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 00:29:58,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 19: [2022-11-27 00:29:58,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:29:58,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 00:29:58,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 27: [2022-11-27 00:29:58,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:29:58,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:29:58,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 00:29:58,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 27: [2022-11-27 00:29:58,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 00:29:58,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 21: [2022-11-27 00:29:58,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:29:58,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 00:29:58,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 25: [2022-11-27 00:29:58,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:29:58,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 00:29:58,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 31: [2022-11-27 00:29:58,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:29:58,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:29:58,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 9: [2022-11-27 00:29:58,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 00:29:58,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 31: [2022-11-27 00:29:58,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 27: [2022-11-27 00:29:58,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:29:58,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:29:58,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 00:29:58,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 31: [2022-11-27 00:29:58,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 11: [2022-11-27 00:29:58,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:29:58,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 11: [2022-11-27 00:29:58,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 00:29:58,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 2: [2022-11-27 00:29:58,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:29:58,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:29:58,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:29:58,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:29:58,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 00:29:58,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 00:29:58,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 00:29:58,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 00:29:58,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 2: [2022-11-27 00:29:58,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 2: [2022-11-27 00:29:58,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 2: [2022-11-27 00:29:58,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 16: [2022-11-27 00:29:58,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:29:58,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:29:58,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 00:29:58,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 00:29:58,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 16: [2022-11-27 00:29:58,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 14: [2022-11-27 00:29:58,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:29:58,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:29:58,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:29:58,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 00:29:58,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 00:29:58,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 00:29:58,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 14: [2022-11-27 00:29:58,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 14: [2022-11-27 00:29:58,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 10: [2022-11-27 00:29:58,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:29:58,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 00:29:58,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 7: [2022-11-27 00:29:58,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:29:58,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 00:29:58,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 17: [2022-11-27 00:29:58,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:29:58,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 00:29:58,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 15: [2022-11-27 00:29:58,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:29:58,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:29:58,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:29:58,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:29:58,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 00:29:58,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 00:29:58,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 00:29:58,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 00:29:58,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 15: [2022-11-27 00:29:58,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 15: [2022-11-27 00:29:58,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 15: [2022-11-27 00:29:58,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 0: [2022-11-27 00:29:58,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 00:29:58,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 28: [2022-11-27 00:29:58,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:29:58,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:29:58,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 00:29:58,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 1: [2022-11-27 00:29:58,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:29:58,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 00:29:58,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 8: [2022-11-27 00:29:58,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:29:58,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:29:58,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:29:58,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 00:29:58,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 00:29:58,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 00:29:58,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 8: [2022-11-27 00:29:58,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 8: [2022-11-27 00:29:58,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 8: [2022-11-27 00:29:58,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:29:58,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 00:29:58,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 28: [2022-11-27 00:29:58,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 00:29:58,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 1: [2022-11-27 00:29:58,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:29:58,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 00:29:58,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 11: [2022-11-27 00:29:58,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:29:58,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 00:29:58,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 5: [2022-11-27 00:29:58,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:29:58,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 00:29:58,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 5: [2022-11-27 00:29:58,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:29:58,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 00:29:58,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 5: [2022-11-27 00:29:58,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:29:58,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 00:29:58,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 5: [2022-11-27 00:29:58,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:29:58,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 00:29:58,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 18: [2022-11-27 00:29:58,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:29:58,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-27 00:29:58,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 17: [2022-11-27 00:29:58,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:29:58,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 00:29:58,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 10: [2022-11-27 00:29:58,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:29:58,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 00:29:58,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 7: [2022-11-27 00:29:58,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:29:58,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 00:29:58,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 24: [2022-11-27 00:29:58,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:29:58,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-27 00:29:58,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 26: [2022-11-27 00:29:58,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:29:58,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:29:58,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:29:58,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:29:58,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:29:58,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 00:29:58,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 00:29:58,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-27 00:29:58,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 00:29:58,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 00:29:58,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 26: [2022-11-27 00:29:58,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 26: [2022-11-27 00:29:58,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 26: [2022-11-27 00:29:58,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 26: [2022-11-27 00:29:58,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 13: [2022-11-27 00:29:58,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:29:58,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 00:29:58,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 30: [2022-11-27 00:29:58,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:29:58,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 00:29:58,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 20: [2022-11-27 00:29:58,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:29:58,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:29:58,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:29:58,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:29:58,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:29:58,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 00:29:58,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 00:29:58,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 00:29:58,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 00:29:58,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 00:29:58,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 20: [2022-11-27 00:29:58,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 20: [2022-11-27 00:29:58,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 20: [2022-11-27 00:29:58,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 20: [2022-11-27 00:29:58,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 29: [2022-11-27 00:29:58,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:29:58,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 00:29:58,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 22: [2022-11-27 00:29:58,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:29:58,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 00:29:58,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 25: [2022-11-27 00:29:58,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:29:58,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:29:58,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 6: [2022-11-27 00:29:58,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 00:29:58,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 25: [2022-11-27 00:29:58,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 31: [2022-11-27 00:29:58,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:29:58,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 00:29:58,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 23: [2022-11-27 00:29:58,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:29:58,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 00:29:58,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 28: [2022-11-27 00:29:58,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:29:58,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 00:29:58,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 19: [2022-11-27 00:29:58,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:29:58,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 00:29:58,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 12: [2022-11-27 00:29:58,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:29:58,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 00:29:58,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 21: [2022-11-27 00:29:58,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:29:58,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 00:29:58,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 4: [2022-11-27 00:29:58,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:29:58,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 00:29:58,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 14: [2022-11-27 00:29:58,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:29:58,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 00:29:58,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 15: [2022-11-27 00:29:58,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:29:58,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 00:29:58,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 2: [2022-11-27 00:29:58,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:29:58,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 00:29:58,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 0: [2022-11-27 00:29:58,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:29:58,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 00:29:58,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 1: [2022-11-27 00:29:58,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:29:58,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 00:29:58,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 3: [2022-11-27 00:29:58,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:29:58,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 00:29:58,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 8: [2022-11-27 00:29:58,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:29:58,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 00:29:58,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 16: [2022-11-27 00:29:58,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:29:58,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 00:29:58,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 9: [2022-11-27 00:29:58,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:29:58,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:29:58,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 00:29:58,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 27: [2022-11-27 00:29:58,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 00:29:58,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 11: [2022-11-27 00:29:58,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:29:58,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 00:29:58,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 26: [2022-11-27 00:29:58,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:29:58,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 00:29:58,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 20: [2022-11-27 00:29:58,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:29:58,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 00:29:58,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 17: [2022-11-27 00:29:58,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:29:58,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 00:29:58,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 5: [2022-11-27 00:29:58,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:29:58,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 00:29:58,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 10: [2022-11-27 00:29:58,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:29:58,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 00:29:58,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 7: [2022-11-27 00:29:58,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:29:58,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:29:58,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 22: [2022-11-27 00:29:58,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 7: [2022-11-27 00:29:58,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 22: [2022-11-27 00:29:58,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 24: [2022-11-27 00:29:58,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:29:58,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 29: [2022-11-27 00:29:58,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:29:58,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 29: [2022-11-27 00:29:58,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 00:29:58,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 13: [2022-11-27 00:29:58,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:29:58,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 00:29:58,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 23: [2022-11-27 00:29:58,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:29:58,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:29:58,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 00:29:58,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 31: [2022-11-27 00:29:58,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:29:58,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 31: [2022-11-27 00:29:58,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 18: [2022-11-27 00:29:58,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 31: [2022-11-27 00:29:58,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 30: [2022-11-27 00:29:58,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:29:58,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 00:29:58,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 0: [2022-11-27 00:29:58,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:29:58,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 00:29:58,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 25: [2022-11-27 00:29:58,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:29:58,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 00:29:58,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 6: [2022-11-27 00:29:58,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:29:58,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 00:29:58,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 28: [2022-11-27 00:29:58,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:29:58,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:29:58,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 19: [2022-11-27 00:29:58,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 00:29:58,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 28: [2022-11-27 00:29:58,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 21: [2022-11-27 00:29:58,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:29:58,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 00:29:58,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 14: [2022-11-27 00:29:58,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:29:58,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 00:29:58,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 1: [2022-11-27 00:29:58,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:29:58,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 00:29:58,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 15: [2022-11-27 00:29:58,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:29:58,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 00:29:58,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 12: [2022-11-27 00:29:58,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:29:58,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 00:29:58,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 3: [2022-11-27 00:29:58,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:29:58,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 00:29:58,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 4: [2022-11-27 00:29:58,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:29:58,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 00:29:58,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 2: [2022-11-27 00:29:58,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:29:58,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 00:29:58,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 9: [2022-11-27 00:29:58,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:29:58,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 00:29:58,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 16: [2022-11-27 00:29:58,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:29:58,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 00:29:58,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 8: [2022-11-27 00:29:58,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:29:58,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 00:29:58,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 27: [2022-11-27 00:29:58,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:29:58,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 00:29:58,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 11: [2022-11-27 00:29:58,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:29:58,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 00:29:58,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 26: [2022-11-27 00:29:58,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:29:58,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 00:29:58,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 5: [2022-11-27 00:29:58,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:29:58,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 00:29:58,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 10: [2022-11-27 00:29:58,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:29:58,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 00:29:58,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 24: [2022-11-27 00:29:58,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:29:58,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 00:29:58,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 0: [2022-11-27 00:29:58,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:29:58,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 00:29:58,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 30: [2022-11-27 00:29:58,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:29:58,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 00:29:58,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 20: [2022-11-27 00:29:58,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:29:58,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 00:29:58,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 13: [2022-11-27 00:29:58,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:29:58,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 00:29:58,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 22: [2022-11-27 00:29:58,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:29:58,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 00:29:58,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 17: [2022-11-27 00:29:58,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:29:58,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 00:29:58,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 7: [2022-11-27 00:29:58,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:29:58,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 00:29:58,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 29: [2022-11-27 00:29:58,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:29:58,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 00:29:58,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 23: [2022-11-27 00:29:58,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:29:58,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 00:29:58,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 31: [2022-11-27 00:29:58,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:29:58,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 00:29:58,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 18: [2022-11-27 00:29:58,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:29:58,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 00:29:58,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 19: [2022-11-27 00:29:58,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:29:58,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 00:29:58,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 25: [2022-11-27 00:29:58,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:29:58,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:29:58,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 6: [2022-11-27 00:29:58,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 00:29:58,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 25: [2022-11-27 00:29:58,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 21: [2022-11-27 00:29:58,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:29:58,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 00:29:58,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 28: [2022-11-27 00:29:58,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:29:58,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 00:29:58,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 14: [2022-11-27 00:29:58,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:29:58,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 00:29:58,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 12: [2022-11-27 00:29:58,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:29:58,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 00:29:58,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 4: [2022-11-27 00:29:58,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:29:58,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 00:29:58,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 15: [2022-11-27 00:29:58,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:29:58,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 00:29:58,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 16: [2022-11-27 00:29:58,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:29:58,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:29:58,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 00:29:58,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 1: [2022-11-27 00:29:58,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 00:29:58,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 3: [2022-11-27 00:29:58,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:29:58,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 00:29:58,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 8: [2022-11-27 00:29:58,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:29:58,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 00:29:58,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 0: [2022-11-27 00:29:58,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:29:58,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 00:29:58,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 26: [2022-11-27 00:29:58,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:29:58,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 9: [2022-11-27 00:29:58,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:29:58,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 9: [2022-11-27 00:29:58,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 00:29:58,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 2: [2022-11-27 00:29:58,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:29:58,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 00:29:58,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 10: [2022-11-27 00:29:58,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:29:58,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 00:29:58,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 23: [2022-11-27 00:29:58,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:29:58,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 11: [2022-11-27 00:29:58,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:29:58,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:29:58,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 11: [2022-11-27 00:29:58,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 00:29:58,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 27: [2022-11-27 00:29:58,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 00:29:58,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 6: [2022-11-27 00:29:58,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:29:58,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 00:29:58,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 30: [2022-11-27 00:29:58,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:29:58,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 00:29:58,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 18: [2022-11-27 00:29:58,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:29:58,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:29:58,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 00:29:58,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 17: [2022-11-27 00:29:58,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 00:29:58,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 7: [2022-11-27 00:29:58,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:29:58,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 31: [2022-11-27 00:29:58,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:29:58,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 31: [2022-11-27 00:29:58,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 00:29:58,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 29: [2022-11-27 00:29:58,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:29:58,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:29:58,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 00:29:58,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 5: [2022-11-27 00:29:58,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 00:29:58,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 28: [2022-11-27 00:29:58,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:29:58,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:29:58,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 00:29:58,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 20: [2022-11-27 00:29:58,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:29:58,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 24: [2022-11-27 00:29:58,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:29:58,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 24: [2022-11-27 00:29:58,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-27 00:29:58,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 28: [2022-11-27 00:29:58,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 00:29:58,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 19: [2022-11-27 00:29:58,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:29:58,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 00:29:58,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 25: [2022-11-27 00:29:58,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:29:58,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:29:58,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 25: [2022-11-27 00:29:58,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 4: [2022-11-27 00:29:58,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 25: [2022-11-27 00:29:58,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 14: [2022-11-27 00:29:58,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:29:58,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 00:29:58,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 12: [2022-11-27 00:29:58,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:29:58,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 00:29:58,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 15: [2022-11-27 00:29:58,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:29:58,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 00:29:58,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 13: [2022-11-27 00:29:58,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:29:58,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 00:29:58,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 9: [2022-11-27 00:29:58,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:29:58,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 00:29:58,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 21: [2022-11-27 00:29:58,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:29:58,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 00:29:58,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 2: [2022-11-27 00:29:58,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:29:58,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 00:29:58,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 3: [2022-11-27 00:29:58,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:29:58,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 00:29:58,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 8: [2022-11-27 00:29:58,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:29:58,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 00:29:58,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 16: [2022-11-27 00:29:58,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:29:58,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 00:29:58,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 27: [2022-11-27 00:29:58,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:29:58,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 00:29:58,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 6: [2022-11-27 00:29:58,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:29:58,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 00:29:58,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 14: [2022-11-27 00:29:58,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:29:58,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 00:29:58,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 1: [2022-11-27 00:29:58,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:29:58,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:29:58,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 00:29:58,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 1: [2022-11-27 00:29:58,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 00:29:58,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 27: [2022-11-27 00:29:58,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:29:58,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step135000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 00:29:58,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step135000 is ready now! 0: successfully saved checkpoint at iteration 135000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2560.48 31: iteration 135010/ 173500 | consumed samples: 34562560 | consumed tokens: 70784122880 | elapsed time per iteration (s): 1.09 | learning rate: 4.140E-05 | global batch size: 256 | lm loss: 1.943752E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.553 | TFLOPs: 14.19 | 31: iteration 135020/ 173500 | consumed samples: 34565120 | consumed tokens: 70789365760 | elapsed time per iteration (s): 0.84 | learning rate: 4.139E-05 | global batch size: 256 | lm loss: 1.943094E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.822 | TFLOPs: 18.50 | 31: iteration 135030/ 173500 | consumed samples: 34567680 | consumed tokens: 70794608640 | elapsed time per iteration (s): 0.81 | learning rate: 4.137E-05 | global batch size: 256 | lm loss: 1.911803E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.169 | TFLOPs: 19.01 | 31: iteration 135040/ 173500 | consumed samples: 34570240 | consumed tokens: 70799851520 | elapsed time per iteration (s): 0.80 | learning rate: 4.136E-05 | global batch size: 256 | lm loss: 1.932791E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.854 | TFLOPs: 19.47 | 31: iteration 135050/ 173500 | consumed samples: 34572800 | consumed tokens: 70805094400 | elapsed time per iteration (s): 0.83 | learning rate: 4.135E-05 | global batch size: 256 | lm loss: 1.917232E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.162 | TFLOPs: 18.70 | 31: iteration 135060/ 173500 | consumed samples: 34575360 | consumed tokens: 70810337280 | elapsed time per iteration (s): 0.81 | learning rate: 4.134E-05 | global batch size: 256 | lm loss: 1.912434E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.001 | TFLOPs: 19.12 | 31: iteration 135070/ 173500 | consumed samples: 34577920 | consumed tokens: 70815580160 | elapsed time per iteration (s): 0.82 | learning rate: 4.133E-05 | global batch size: 256 | lm loss: 1.938245E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.398 | TFLOPs: 18.96 | 31: iteration 135080/ 173500 | consumed samples: 34580480 | consumed tokens: 70820823040 | elapsed time per iteration (s): 0.84 | learning rate: 4.132E-05 | global batch size: 256 | lm loss: 1.925097E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.028 | TFLOPs: 18.39 | 31: iteration 135090/ 173500 | consumed samples: 34583040 | consumed tokens: 70826065920 | elapsed time per iteration (s): 0.83 | learning rate: 4.131E-05 | global batch size: 256 | lm loss: 1.933135E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.869 | TFLOPs: 18.69 | 31: iteration 135100/ 173500 | consumed samples: 34585600 | consumed tokens: 70831308800 | elapsed time per iteration (s): 0.84 | learning rate: 4.130E-05 | global batch size: 256 | lm loss: 1.958247E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.479 | TFLOPs: 18.48 | 31: iteration 135110/ 173500 | consumed samples: 34588160 | consumed tokens: 70836551680 | elapsed time per iteration (s): 0.83 | learning rate: 4.129E-05 | global batch size: 256 | lm loss: 1.949237E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.868 | TFLOPs: 18.75 | 31: iteration 135120/ 173500 | consumed samples: 34590720 | consumed tokens: 70841794560 | elapsed time per iteration (s): 0.79 | learning rate: 4.128E-05 | global batch size: 256 | lm loss: 1.934714E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.090 | TFLOPs: 19.73 | 31: iteration 135130/ 173500 | consumed samples: 34593280 | consumed tokens: 70847037440 | elapsed time per iteration (s): 0.83 | learning rate: 4.127E-05 | global batch size: 256 | lm loss: 1.911506E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.662 | TFLOPs: 18.55 | 31: iteration 135140/ 173500 | consumed samples: 34595840 | consumed tokens: 70852280320 | elapsed time per iteration (s): 0.81 | learning rate: 4.126E-05 | global batch size: 256 | lm loss: 1.918438E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.950 | TFLOPs: 19.11 | 31: iteration 135150/ 173500 | consumed samples: 34598400 | consumed tokens: 70857523200 | elapsed time per iteration (s): 0.79 | learning rate: 4.125E-05 | global batch size: 256 | lm loss: 1.943012E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.049 | TFLOPs: 19.54 | 31: iteration 135160/ 173500 | consumed samples: 34600960 | consumed tokens: 70862766080 | elapsed time per iteration (s): 0.84 | learning rate: 4.124E-05 | global batch size: 256 | lm loss: 1.905230E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.902 | TFLOPs: 18.45 | 31: iteration 135170/ 173500 | consumed samples: 34603520 | consumed tokens: 70868008960 | elapsed time per iteration (s): 0.74 | learning rate: 4.123E-05 | global batch size: 256 | lm loss: 1.973958E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.213 | TFLOPs: 20.82 | 31: iteration 135180/ 173500 | consumed samples: 34606080 | consumed tokens: 70873251840 | elapsed time per iteration (s): 0.79 | learning rate: 4.122E-05 | global batch size: 256 | lm loss: 1.908105E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.805 | TFLOPs: 19.65 | 31: iteration 135190/ 173500 | consumed samples: 34608640 | consumed tokens: 70878494720 | elapsed time per iteration (s): 0.74 | learning rate: 4.120E-05 | global batch size: 256 | lm loss: 1.951372E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.060 | TFLOPs: 20.81 | 31: iteration 135200/ 173500 | consumed samples: 34611200 | consumed tokens: 70883737600 | elapsed time per iteration (s): 0.80 | learning rate: 4.119E-05 | global batch size: 256 | lm loss: 1.934382E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.704 | TFLOPs: 19.46 | 31: iteration 135210/ 173500 | consumed samples: 34613760 | consumed tokens: 70888980480 | elapsed time per iteration (s): 0.77 | learning rate: 4.118E-05 | global batch size: 256 | lm loss: 1.964726E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.382 | TFLOPs: 20.05 | 31: iteration 135220/ 173500 | consumed samples: 34616320 | consumed tokens: 70894223360 | elapsed time per iteration (s): 0.80 | learning rate: 4.117E-05 | global batch size: 256 | lm loss: 1.907314E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.318 | TFLOPs: 19.38 | 31: iteration 135230/ 173500 | consumed samples: 34618880 | consumed tokens: 70899466240 | elapsed time per iteration (s): 0.92 | learning rate: 4.116E-05 | global batch size: 256 | lm loss: 1.901478E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 277.421 | TFLOPs: 16.78 | 31: iteration 135240/ 173500 | consumed samples: 34621440 | consumed tokens: 70904709120 | elapsed time per iteration (s): 0.76 | learning rate: 4.115E-05 | global batch size: 256 | lm loss: 1.948249E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.357 | TFLOPs: 20.29 | 31: iteration 135250/ 173500 | consumed samples: 34624000 | consumed tokens: 70909952000 | elapsed time per iteration (s): 0.74 | learning rate: 4.114E-05 | global batch size: 256 | lm loss: 1.928002E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.179 | TFLOPs: 20.88 | 31: iteration 135260/ 173500 | consumed samples: 34626560 | consumed tokens: 70915194880 | elapsed time per iteration (s): 0.76 | learning rate: 4.113E-05 | global batch size: 256 | lm loss: 1.943855E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.654 | TFLOPs: 20.49 | 31: iteration 135270/ 173500 | consumed samples: 34629120 | consumed tokens: 70920437760 | elapsed time per iteration (s): 0.74 | learning rate: 4.112E-05 | global batch size: 256 | lm loss: 1.948193E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.634 | TFLOPs: 20.91 | 31: iteration 135280/ 173500 | consumed samples: 34631680 | consumed tokens: 70925680640 | elapsed time per iteration (s): 0.75 | learning rate: 4.111E-05 | global batch size: 256 | lm loss: 1.934777E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.113 | TFLOPs: 20.76 | 31: iteration 135290/ 173500 | consumed samples: 34634240 | consumed tokens: 70930923520 | elapsed time per iteration (s): 0.74 | learning rate: 4.110E-05 | global batch size: 256 | lm loss: 1.956810E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.311 | TFLOPs: 20.95 | 31: iteration 135300/ 173500 | consumed samples: 34636800 | consumed tokens: 70936166400 | elapsed time per iteration (s): 0.77 | learning rate: 4.109E-05 | global batch size: 256 | lm loss: 1.915508E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.431 | TFLOPs: 20.11 | 31: iteration 135310/ 173500 | consumed samples: 34639360 | consumed tokens: 70941409280 | elapsed time per iteration (s): 1.00 | learning rate: 4.108E-05 | global batch size: 256 | lm loss: 1.921541E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 256.097 | TFLOPs: 15.49 | 31: iteration 135320/ 173500 | consumed samples: 34641920 | consumed tokens: 70946652160 | elapsed time per iteration (s): 0.80 | learning rate: 4.107E-05 | global batch size: 256 | lm loss: 1.967009E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.383 | TFLOPs: 19.26 | 31: iteration 135330/ 173500 | consumed samples: 34644480 | consumed tokens: 70951895040 | elapsed time per iteration (s): 0.78 | learning rate: 4.106E-05 | global batch size: 256 | lm loss: 1.914191E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.105 | TFLOPs: 19.91 | 31: iteration 135340/ 173500 | consumed samples: 34647040 | consumed tokens: 70957137920 | elapsed time per iteration (s): 0.81 | learning rate: 4.105E-05 | global batch size: 256 | lm loss: 1.928394E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.398 | TFLOPs: 19.02 | 31: iteration 135350/ 173500 | consumed samples: 34649600 | consumed tokens: 70962380800 | elapsed time per iteration (s): 0.77 | learning rate: 4.104E-05 | global batch size: 256 | lm loss: 1.945465E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.617 | TFLOPs: 20.06 | 31: iteration 135360/ 173500 | consumed samples: 34652160 | consumed tokens: 70967623680 | elapsed time per iteration (s): 0.84 | learning rate: 4.102E-05 | global batch size: 256 | lm loss: 1.946286E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.939 | TFLOPs: 18.45 | 31: iteration 135370/ 173500 | consumed samples: 34654720 | consumed tokens: 70972866560 | elapsed time per iteration (s): 0.79 | learning rate: 4.101E-05 | global batch size: 256 | lm loss: 1.972472E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.439 | TFLOPs: 19.51 | 31: iteration 135380/ 173500 | consumed samples: 34657280 | consumed tokens: 70978109440 | elapsed time per iteration (s): 0.85 | learning rate: 4.100E-05 | global batch size: 256 | lm loss: 1.911955E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.929 | TFLOPs: 18.33 | 31: iteration 135390/ 173500 | consumed samples: 34659840 | consumed tokens: 70983352320 | elapsed time per iteration (s): 0.74 | learning rate: 4.099E-05 | global batch size: 256 | lm loss: 1.945942E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.314 | TFLOPs: 20.89 | 31: iteration 135400/ 173500 | consumed samples: 34662400 | consumed tokens: 70988595200 | elapsed time per iteration (s): 0.75 | learning rate: 4.098E-05 | global batch size: 256 | lm loss: 1.905491E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.562 | TFLOPs: 20.78 | 31: iteration 135410/ 173500 | consumed samples: 34664960 | consumed tokens: 70993838080 | elapsed time per iteration (s): 0.73 | learning rate: 4.097E-05 | global batch size: 256 | lm loss: 1.920288E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.474 | TFLOPs: 21.20 | 31: iteration 135420/ 173500 | consumed samples: 34667520 | consumed tokens: 70999080960 | elapsed time per iteration (s): 0.74 | learning rate: 4.096E-05 | global batch size: 256 | lm loss: 1.960797E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.795 | TFLOPs: 20.80 | 31: iteration 135430/ 173500 | consumed samples: 34670080 | consumed tokens: 71004323840 | elapsed time per iteration (s): 0.81 | learning rate: 4.095E-05 | global batch size: 256 | lm loss: 1.939698E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.085 | TFLOPs: 19.18 | 31: iteration 135440/ 173500 | consumed samples: 34672640 | consumed tokens: 71009566720 | elapsed time per iteration (s): 2.43 | learning rate: 4.094E-05 | global batch size: 256 | lm loss: 1.930646E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 105.255 | TFLOPs: 6.37 | 31: iteration 135450/ 173500 | consumed samples: 34675200 | consumed tokens: 71014809600 | elapsed time per iteration (s): 0.75 | learning rate: 4.093E-05 | global batch size: 256 | lm loss: 1.905714E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.605 | TFLOPs: 20.55 | 31: iteration 135460/ 173500 | consumed samples: 34677760 | consumed tokens: 71020052480 | elapsed time per iteration (s): 0.73 | learning rate: 4.092E-05 | global batch size: 256 | lm loss: 1.911882E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.863 | TFLOPs: 21.29 | 31: iteration 135470/ 173500 | consumed samples: 34680320 | consumed tokens: 71025295360 | elapsed time per iteration (s): 0.80 | learning rate: 4.091E-05 | global batch size: 256 | lm loss: 1.916034E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.971 | TFLOPs: 19.36 | 31: iteration 135480/ 173500 | consumed samples: 34682880 | consumed tokens: 71030538240 | elapsed time per iteration (s): 0.88 | learning rate: 4.090E-05 | global batch size: 256 | lm loss: 1.927320E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.619 | TFLOPs: 17.52 | 31: iteration 135490/ 173500 | consumed samples: 34685440 | consumed tokens: 71035781120 | elapsed time per iteration (s): 0.93 | learning rate: 4.089E-05 | global batch size: 256 | lm loss: 1.941447E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 274.269 | TFLOPs: 16.59 | 31: iteration 135500/ 173500 | consumed samples: 34688000 | consumed tokens: 71041024000 | elapsed time per iteration (s): 0.83 | learning rate: 4.088E-05 | global batch size: 256 | lm loss: 1.891168E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.716 | TFLOPs: 18.68 | 31: iteration 135510/ 173500 | consumed samples: 34690560 | consumed tokens: 71046266880 | elapsed time per iteration (s): 0.84 | learning rate: 4.087E-05 | global batch size: 256 | lm loss: 1.949449E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.072 | TFLOPs: 18.40 | 31: iteration 135520/ 173500 | consumed samples: 34693120 | consumed tokens: 71051509760 | elapsed time per iteration (s): 0.76 | learning rate: 4.086E-05 | global batch size: 256 | lm loss: 1.953055E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.609 | TFLOPs: 20.30 | 31: iteration 135530/ 173500 | consumed samples: 34695680 | consumed tokens: 71056752640 | elapsed time per iteration (s): 0.75 | learning rate: 4.085E-05 | global batch size: 256 | lm loss: 1.926837E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.453 | TFLOPs: 20.78 | 31: iteration 135540/ 173500 | consumed samples: 34698240 | consumed tokens: 71061995520 | elapsed time per iteration (s): 0.94 | learning rate: 4.083E-05 | global batch size: 256 | lm loss: 1.957504E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 271.001 | TFLOPs: 16.39 | 31: iteration 135550/ 173500 | consumed samples: 34700800 | consumed tokens: 71067238400 | elapsed time per iteration (s): 0.75 | learning rate: 4.082E-05 | global batch size: 256 | lm loss: 1.946545E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.171 | TFLOPs: 20.70 | 31: iteration 135560/ 173500 | consumed samples: 34703360 | consumed tokens: 71072481280 | elapsed time per iteration (s): 0.77 | learning rate: 4.081E-05 | global batch size: 256 | lm loss: 1.935900E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.462 | TFLOPs: 20.23 | 31: iteration 135570/ 173500 | consumed samples: 34705920 | consumed tokens: 71077724160 | elapsed time per iteration (s): 0.79 | learning rate: 4.080E-05 | global batch size: 256 | lm loss: 1.927644E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.756 | TFLOPs: 19.53 | 31: iteration 135580/ 173500 | consumed samples: 34708480 | consumed tokens: 71082967040 | elapsed time per iteration (s): 0.74 | learning rate: 4.079E-05 | global batch size: 256 | lm loss: 1.937057E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.862 | TFLOPs: 20.92 | 31: iteration 135590/ 173500 | consumed samples: 34711040 | consumed tokens: 71088209920 | elapsed time per iteration (s): 0.78 | learning rate: 4.078E-05 | global batch size: 256 | lm loss: 1.923276E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.007 | TFLOPs: 19.96 | 31: iteration 135600/ 173500 | consumed samples: 34713600 | consumed tokens: 71093452800 | elapsed time per iteration (s): 0.75 | learning rate: 4.077E-05 | global batch size: 256 | lm loss: 1.938819E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.065 | TFLOPs: 20.57 | 31: iteration 135610/ 173500 | consumed samples: 34716160 | consumed tokens: 71098695680 | elapsed time per iteration (s): 0.77 | learning rate: 4.076E-05 | global batch size: 256 | lm loss: 1.904028E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.103 | TFLOPs: 20.15 | 31: iteration 135620/ 173500 | consumed samples: 34718720 | consumed tokens: 71103938560 | elapsed time per iteration (s): 0.90 | learning rate: 4.075E-05 | global batch size: 256 | lm loss: 1.918452E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.911 | TFLOPs: 17.24 | 31: iteration 135630/ 173500 | consumed samples: 34721280 | consumed tokens: 71109181440 | elapsed time per iteration (s): 0.75 | learning rate: 4.074E-05 | global batch size: 256 | lm loss: 1.942828E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.535 | TFLOPs: 20.66 | 31: iteration 135640/ 173500 | consumed samples: 34723840 | consumed tokens: 71114424320 | elapsed time per iteration (s): 0.73 | learning rate: 4.073E-05 | global batch size: 256 | lm loss: 1.950531E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.591 | TFLOPs: 21.09 | 31: iteration 135650/ 173500 | consumed samples: 34726400 | consumed tokens: 71119667200 | elapsed time per iteration (s): 0.74 | learning rate: 4.072E-05 | global batch size: 256 | lm loss: 1.910294E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.015 | TFLOPs: 21.05 | 31: iteration 135660/ 173500 | consumed samples: 34728960 | consumed tokens: 71124910080 | elapsed time per iteration (s): 0.78 | learning rate: 4.071E-05 | global batch size: 256 | lm loss: 1.930642E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.932 | TFLOPs: 19.90 | 31: iteration 135670/ 173500 | consumed samples: 34731520 | consumed tokens: 71130152960 | elapsed time per iteration (s): 0.76 | learning rate: 4.070E-05 | global batch size: 256 | lm loss: 1.939998E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.658 | TFLOPs: 20.31 | 31: iteration 135680/ 173500 | consumed samples: 34734080 | consumed tokens: 71135395840 | elapsed time per iteration (s): 0.75 | learning rate: 4.069E-05 | global batch size: 256 | lm loss: 1.946118E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.988 | TFLOPs: 20.69 | 31: iteration 135690/ 173500 | consumed samples: 34736640 | consumed tokens: 71140638720 | elapsed time per iteration (s): 0.76 | learning rate: 4.068E-05 | global batch size: 256 | lm loss: 1.917517E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.174 | TFLOPs: 20.28 | 31: iteration 135700/ 173500 | consumed samples: 34739200 | consumed tokens: 71145881600 | elapsed time per iteration (s): 0.78 | learning rate: 4.067E-05 | global batch size: 256 | lm loss: 1.913850E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.094 | TFLOPs: 19.79 | 31: iteration 135710/ 173500 | consumed samples: 34741760 | consumed tokens: 71151124480 | elapsed time per iteration (s): 0.76 | learning rate: 4.066E-05 | global batch size: 256 | lm loss: 1.923515E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.637 | TFLOPs: 20.43 | 31: iteration 135720/ 173500 | consumed samples: 34744320 | consumed tokens: 71156367360 | elapsed time per iteration (s): 0.77 | learning rate: 4.065E-05 | global batch size: 256 | lm loss: 1.910818E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.266 | TFLOPs: 20.16 | 31: iteration 135730/ 173500 | consumed samples: 34746880 | consumed tokens: 71161610240 | elapsed time per iteration (s): 0.73 | learning rate: 4.064E-05 | global batch size: 256 | lm loss: 1.923014E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.596 | TFLOPs: 21.33 | 31: iteration 135740/ 173500 | consumed samples: 34749440 | consumed tokens: 71166853120 | elapsed time per iteration (s): 0.75 | learning rate: 4.062E-05 | global batch size: 256 | lm loss: 1.950109E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.658 | TFLOPs: 20.73 | 31: iteration 135750/ 173500 | consumed samples: 34752000 | consumed tokens: 71172096000 | elapsed time per iteration (s): 0.77 | learning rate: 4.061E-05 | global batch size: 256 | lm loss: 1.929317E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.612 | TFLOPs: 20.18 | 31: iteration 135760/ 173500 | consumed samples: 34754560 | consumed tokens: 71177338880 | elapsed time per iteration (s): 0.76 | learning rate: 4.060E-05 | global batch size: 256 | lm loss: 1.962927E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.195 | TFLOPs: 20.28 | 31: iteration 135770/ 173500 | consumed samples: 34757120 | consumed tokens: 71182581760 | elapsed time per iteration (s): 0.85 | learning rate: 4.059E-05 | global batch size: 256 | lm loss: 1.943884E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.487 | TFLOPs: 18.24 | 31: iteration 135780/ 173500 | consumed samples: 34759680 | consumed tokens: 71187824640 | elapsed time per iteration (s): 0.79 | learning rate: 4.058E-05 | global batch size: 256 | lm loss: 1.940767E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.397 | TFLOPs: 19.50 | 31: iteration 135790/ 173500 | consumed samples: 34762240 | consumed tokens: 71193067520 | elapsed time per iteration (s): 0.84 | learning rate: 4.057E-05 | global batch size: 256 | lm loss: 1.904091E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.014 | TFLOPs: 18.39 | 31: iteration 135800/ 173500 | consumed samples: 34764800 | consumed tokens: 71198310400 | elapsed time per iteration (s): 0.82 | learning rate: 4.056E-05 | global batch size: 256 | lm loss: 1.905643E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.357 | TFLOPs: 18.96 | 31: iteration 135810/ 173500 | consumed samples: 34767360 | consumed tokens: 71203553280 | elapsed time per iteration (s): 0.84 | learning rate: 4.055E-05 | global batch size: 256 | lm loss: 1.923046E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.352 | TFLOPs: 18.41 | 31: iteration 135820/ 173500 | consumed samples: 34769920 | consumed tokens: 71208796160 | elapsed time per iteration (s): 0.79 | learning rate: 4.054E-05 | global batch size: 256 | lm loss: 1.911409E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.439 | TFLOPs: 19.57 | 31: iteration 135830/ 173500 | consumed samples: 34772480 | consumed tokens: 71214039040 | elapsed time per iteration (s): 0.81 | learning rate: 4.053E-05 | global batch size: 256 | lm loss: 1.906107E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.466 | TFLOPs: 19.02 | 31: iteration 135840/ 173500 | consumed samples: 34775040 | consumed tokens: 71219281920 | elapsed time per iteration (s): 0.78 | learning rate: 4.052E-05 | global batch size: 256 | lm loss: 1.926362E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.694 | TFLOPs: 19.76 | 31: iteration 135850/ 173500 | consumed samples: 34777600 | consumed tokens: 71224524800 | elapsed time per iteration (s): 0.83 | learning rate: 4.051E-05 | global batch size: 256 | lm loss: 1.921083E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.336 | TFLOPs: 18.65 | 31: iteration 135860/ 173500 | consumed samples: 34780160 | consumed tokens: 71229767680 | elapsed time per iteration (s): 0.81 | learning rate: 4.050E-05 | global batch size: 256 | lm loss: 1.917074E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.680 | TFLOPs: 19.22 | 31: iteration 135870/ 173500 | consumed samples: 34782720 | consumed tokens: 71235010560 | elapsed time per iteration (s): 0.84 | learning rate: 4.049E-05 | global batch size: 256 | lm loss: 1.946414E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.034 | TFLOPs: 18.51 | 31: iteration 135880/ 173500 | consumed samples: 34785280 | consumed tokens: 71240253440 | elapsed time per iteration (s): 0.74 | learning rate: 4.048E-05 | global batch size: 256 | lm loss: 1.936379E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.208 | TFLOPs: 21.01 | 31: iteration 135890/ 173500 | consumed samples: 34787840 | consumed tokens: 71245496320 | elapsed time per iteration (s): 0.82 | learning rate: 4.047E-05 | global batch size: 256 | lm loss: 1.928883E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.761 | TFLOPs: 18.86 | 31: iteration 135900/ 173500 | consumed samples: 34790400 | consumed tokens: 71250739200 | elapsed time per iteration (s): 0.74 | learning rate: 4.046E-05 | global batch size: 256 | lm loss: 1.915844E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.946 | TFLOPs: 20.99 | 31: iteration 135910/ 173500 | consumed samples: 34792960 | consumed tokens: 71255982080 | elapsed time per iteration (s): 0.76 | learning rate: 4.045E-05 | global batch size: 256 | lm loss: 1.935275E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.983 | TFLOPs: 20.51 | 31: iteration 135920/ 173500 | consumed samples: 34795520 | consumed tokens: 71261224960 | elapsed time per iteration (s): 0.79 | learning rate: 4.044E-05 | global batch size: 256 | lm loss: 1.943435E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.349 | TFLOPs: 19.62 | 31: iteration 135930/ 173500 | consumed samples: 34798080 | consumed tokens: 71266467840 | elapsed time per iteration (s): 0.88 | learning rate: 4.043E-05 | global batch size: 256 | lm loss: 1.895297E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.133 | TFLOPs: 17.61 | 31: iteration 135940/ 173500 | consumed samples: 34800640 | consumed tokens: 71271710720 | elapsed time per iteration (s): 0.77 | learning rate: 4.042E-05 | global batch size: 256 | lm loss: 1.926900E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.061 | TFLOPs: 20.03 | 31: iteration 135950/ 173500 | consumed samples: 34803200 | consumed tokens: 71276953600 | elapsed time per iteration (s): 0.79 | learning rate: 4.040E-05 | global batch size: 256 | lm loss: 1.925842E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.802 | TFLOPs: 19.65 | 31: iteration 135960/ 173500 | consumed samples: 34805760 | consumed tokens: 71282196480 | elapsed time per iteration (s): 0.75 | learning rate: 4.039E-05 | global batch size: 256 | lm loss: 1.910440E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.050 | TFLOPs: 20.57 | 31: iteration 135970/ 173500 | consumed samples: 34808320 | consumed tokens: 71287439360 | elapsed time per iteration (s): 0.74 | learning rate: 4.038E-05 | global batch size: 256 | lm loss: 1.889736E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.254 | TFLOPs: 20.89 | 31: iteration 135980/ 173500 | consumed samples: 34810880 | consumed tokens: 71292682240 | elapsed time per iteration (s): 0.77 | learning rate: 4.037E-05 | global batch size: 256 | lm loss: 1.922272E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.348 | TFLOPs: 20.17 | 31: iteration 135990/ 173500 | consumed samples: 34813440 | consumed tokens: 71297925120 | elapsed time per iteration (s): 0.78 | learning rate: 4.036E-05 | global batch size: 256 | lm loss: 1.911992E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.550 | TFLOPs: 19.88 | 0: [2022-11-27 00:43:29,167] [INFO] [logging.py:68:log_dist] [Rank 0] step=136000, skipped=0, lr=[4.035272599944626e-05, 4.035272599944626e-05, 4.035272599944626e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 136000/ 173500 | consumed samples: 34816000 | consumed tokens: 71303168000 | elapsed time per iteration (s): 0.90 | learning rate: 4.035E-05 | global batch size: 256 | lm loss: 1.936479E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.037 | TFLOPs: 17.24 | 0: steps: 136000 loss: 1.9018 iter time (s): 0.808 samples/sec: 316.979 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 136000 | lm loss value: 1.979792E+00 | lm loss PPL: 7.241238E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 136000 to checkpoints_1b1long 0: [2022-11-27 00:43:29,448] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step136000 is begin to save! 0: [2022-11-27 00:43:29,457] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_01-model_00-model_states.pt... 0: [2022-11-27 00:43:29,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_01-model_00-model_states.pt. 0: [2022-11-27 00:43:29,678] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_03-model_00-model_states.pt... 0: [2022-11-27 00:43:29,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_03-model_00-model_states.pt. 0: [2022-11-27 00:43:29,759] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_04-model_00-model_states.pt... 0: [2022-11-27 00:43:29,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_04-model_00-model_states.pt. 0: [2022-11-27 00:43:29,841] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_05-model_00-model_states.pt... 0: [2022-11-27 00:43:29,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_05-model_00-model_states.pt. 0: [2022-11-27 00:43:29,920] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_06-model_00-model_states.pt... 0: [2022-11-27 00:43:30,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_06-model_00-model_states.pt. 0: [2022-11-27 00:43:30,003] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_07-model_00-model_states.pt... 0: [2022-11-27 00:43:30,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_07-model_00-model_states.pt. 0: [2022-11-27 00:43:30,081] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_08-model_00-model_states.pt... 0: [2022-11-27 00:43:30,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_08-model_00-model_states.pt. 0: [2022-11-27 00:43:30,157] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_09-model_00-model_states.pt... 0: [2022-11-27 00:43:30,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_09-model_00-model_states.pt. 0: [2022-11-27 00:43:30,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_10-model_00-model_states.pt... 0: [2022-11-27 00:43:30,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_10-model_00-model_states.pt. 0: [2022-11-27 00:43:30,309] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_11-model_00-model_states.pt... 0: [2022-11-27 00:43:30,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_11-model_00-model_states.pt. 0: [2022-11-27 00:43:30,382] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_12-model_00-model_states.pt... 0: [2022-11-27 00:43:30,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_12-model_00-model_states.pt. 0: [2022-11-27 00:43:30,458] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_13-model_00-model_states.pt... 0: [2022-11-27 00:43:30,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_13-model_00-model_states.pt. 0: [2022-11-27 00:43:30,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_14-model_00-model_states.pt... 0: [2022-11-27 00:43:30,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_14-model_00-model_states.pt. 0: [2022-11-27 00:43:30,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_15-model_00-model_states.pt... 0: [2022-11-27 00:43:30,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_15-model_00-model_states.pt. 0: [2022-11-27 00:43:30,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_16-model_00-model_states.pt... 0: [2022-11-27 00:43:30,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_16-model_00-model_states.pt. 0: [2022-11-27 00:43:30,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_17-model_00-model_states.pt... 0: [2022-11-27 00:43:30,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_17-model_00-model_states.pt. 0: [2022-11-27 00:43:30,837] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_18-model_00-model_states.pt... 0: [2022-11-27 00:43:30,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_18-model_00-model_states.pt. 0: [2022-11-27 00:43:30,914] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_19-model_00-model_states.pt... 0: [2022-11-27 00:43:30,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_19-model_00-model_states.pt. 0: [2022-11-27 00:43:30,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_20-model_00-model_states.pt... 0: [2022-11-27 00:43:31,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_20-model_00-model_states.pt. 0: [2022-11-27 00:43:31,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_21-model_00-model_states.pt... 0: [2022-11-27 00:43:31,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_21-model_00-model_states.pt. 0: [2022-11-27 00:43:31,139] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_22-model_00-model_states.pt... 0: [2022-11-27 00:43:31,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_22-model_00-model_states.pt. 0: [2022-11-27 00:43:31,212] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_23-model_00-model_states.pt... 0: [2022-11-27 00:43:31,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_23-model_00-model_states.pt. 0: [2022-11-27 00:43:31,289] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_24-model_00-model_states.pt... 0: [2022-11-27 00:43:31,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_24-model_00-model_states.pt. 0: [2022-11-27 00:43:31,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_25-model_00-model_states.pt... 0: [2022-11-27 00:43:31,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_25-model_00-model_states.pt. 0: [2022-11-27 00:43:31,436] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_26-model_00-model_states.pt... 0: [2022-11-27 00:43:31,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_26-model_00-model_states.pt. 0: [2022-11-27 00:43:31,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_27-model_00-model_states.pt... 0: [2022-11-27 00:43:31,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_27-model_00-model_states.pt. 0: [2022-11-27 00:43:31,587] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_28-model_00-model_states.pt... 0: [2022-11-27 00:43:31,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_28-model_00-model_states.pt. 0: [2022-11-27 00:43:31,662] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/layer_30-model_00-model_states.pt... 0: [2022-11-27 00:43:31,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/layer_30-model_00-model_states.pt. 0: [2022-11-27 00:43:31,665] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step136000/mp_rank_00_model_states.pt 0: [2022-11-27 00:43:31,665] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/mp_rank_00_model_states.pt... 0: [2022-11-27 00:43:31,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/mp_rank_00_model_states.pt. 0: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:43:31,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:43:31,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:43:31,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 00:43:31,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 11: [2022-11-27 00:43:31,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:43:31,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:43:31,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 11: [2022-11-27 00:43:31,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 21: [2022-11-27 00:43:31,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 11: [2022-11-27 00:43:31,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 30: [2022-11-27 00:43:31,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:43:31,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 13: [2022-11-27 00:43:31,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:43:31,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 13: [2022-11-27 00:43:31,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 00:43:31,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 19: [2022-11-27 00:43:31,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:43:31,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-27 00:43:31,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 0: [2022-11-27 00:43:31,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:43:31,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 00:43:31,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 22: [2022-11-27 00:43:31,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:43:31,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 00:43:31,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 5: [2022-11-27 00:43:31,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:43:31,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:43:31,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 00:43:31,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 24: [2022-11-27 00:43:31,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:43:31,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 00:43:31,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 24: [2022-11-27 00:43:31,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 00:43:31,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 5: [2022-11-27 00:43:31,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:43:31,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 00:43:31,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 10: [2022-11-27 00:43:31,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:43:31,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:43:31,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:43:31,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 00:43:31,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 10: [2022-11-27 00:43:31,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 3: [2022-11-27 00:43:31,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 10: [2022-11-27 00:43:31,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 3: [2022-11-27 00:43:31,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 16: [2022-11-27 00:43:31,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:43:31,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 00:43:31,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 30: [2022-11-27 00:43:31,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:43:31,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 00:43:31,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 6: [2022-11-27 00:43:31,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:43:31,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 00:43:31,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 17: [2022-11-27 00:43:31,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:43:31,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 00:43:31,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 26: [2022-11-27 00:43:31,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:43:31,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-27 00:43:31,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 19: [2022-11-27 00:43:31,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:43:31,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:43:31,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 15: [2022-11-27 00:43:31,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:43:31,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 19: [2022-11-27 00:43:31,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 26: [2022-11-27 00:43:31,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 15: [2022-11-27 00:43:31,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 00:43:31,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 11: [2022-11-27 00:43:31,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:43:31,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 00:43:31,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 8: [2022-11-27 00:43:31,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:43:31,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 00:43:31,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 1: [2022-11-27 00:43:31,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:43:31,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 00:43:31,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 1: [2022-11-27 00:43:31,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:43:31,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 00:43:31,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 13: [2022-11-27 00:43:31,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:43:31,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 00:43:31,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 0: [2022-11-27 00:43:31,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:43:31,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 00:43:31,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 22: [2022-11-27 00:43:31,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:43:31,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:43:31,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:43:31,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 21: [2022-11-27 00:43:31,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 10: [2022-11-27 00:43:31,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 22: [2022-11-27 00:43:31,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 21: [2022-11-27 00:43:31,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 10: [2022-11-27 00:43:31,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 7: [2022-11-27 00:43:31,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:43:31,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 00:43:31,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 23: [2022-11-27 00:43:31,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:43:31,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 00:43:31,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 17: [2022-11-27 00:43:31,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:43:31,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:43:31,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 29: [2022-11-27 00:43:31,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:43:31,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 17: [2022-11-27 00:43:31,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 00:43:31,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 9: [2022-11-27 00:43:31,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:43:31,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 00:43:31,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 23: [2022-11-27 00:43:31,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:43:31,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 00:43:31,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 25: [2022-11-27 00:43:31,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:43:31,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 00:43:31,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 29: [2022-11-27 00:43:31,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 00:43:31,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 5: [2022-11-27 00:43:31,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:43:31,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 00:43:31,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 20: [2022-11-27 00:43:31,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:43:31,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:43:31,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:43:31,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 30: [2022-11-27 00:43:31,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 20: [2022-11-27 00:43:31,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 11: [2022-11-27 00:43:31,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 30: [2022-11-27 00:43:31,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 11: [2022-11-27 00:43:31,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 6: [2022-11-27 00:43:31,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:43:31,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 00:43:31,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 9: [2022-11-27 00:43:31,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:43:31,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:43:31,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:43:31,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 00:43:31,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 00:43:31,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 9: [2022-11-27 00:43:31,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 1: [2022-11-27 00:43:31,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 19: [2022-11-27 00:43:31,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:43:31,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 1: [2022-11-27 00:43:31,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 19: [2022-11-27 00:43:31,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 31: [2022-11-27 00:43:31,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:43:31,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:43:31,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 00:43:31,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 00:43:31,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 31: [2022-11-27 00:43:31,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 16: [2022-11-27 00:43:31,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:43:31,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:43:31,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 00:43:31,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 00:43:31,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 16: [2022-11-27 00:43:31,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 25: [2022-11-27 00:43:31,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:43:31,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 00:43:31,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 3: [2022-11-27 00:43:31,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:43:31,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:43:31,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 00:43:31,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 21: [2022-11-27 00:43:31,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 00:43:31,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 13: [2022-11-27 00:43:31,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:43:31,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 00:43:31,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 17: [2022-11-27 00:43:31,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:43:31,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:43:31,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 00:43:31,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 0: [2022-11-27 00:43:31,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 00:43:31,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 8: [2022-11-27 00:43:31,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:43:31,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 00:43:31,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 19: [2022-11-27 00:43:31,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:43:31,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 00:43:31,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 3: [2022-11-27 00:43:31,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:43:31,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 00:43:31,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 8: [2022-11-27 00:43:31,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:43:31,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:43:31,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 14: [2022-11-27 00:43:31,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:43:31,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 10: [2022-11-27 00:43:31,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 14: [2022-11-27 00:43:31,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 10: [2022-11-27 00:43:31,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 14: [2022-11-27 00:43:31,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:43:31,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 14: [2022-11-27 00:43:31,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 00:43:31,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 29: [2022-11-27 00:43:31,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:43:31,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:43:31,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 00:43:31,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 00:43:31,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 29: [2022-11-27 00:43:31,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 16: [2022-11-27 00:43:31,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:43:31,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 00:43:31,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 2: [2022-11-27 00:43:31,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:43:31,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 15: [2022-11-27 00:43:31,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:43:31,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 15: [2022-11-27 00:43:31,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 00:43:31,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 24: [2022-11-27 00:43:31,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:43:31,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:43:31,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:43:31,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 24: [2022-11-27 00:43:31,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 2: [2022-11-27 00:43:31,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 31: [2022-11-27 00:43:31,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:43:31,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 31: [2022-11-27 00:43:31,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 00:43:31,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 24: [2022-11-27 00:43:31,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 00:43:31,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 2: [2022-11-27 00:43:31,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:43:31,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 00:43:31,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 18: [2022-11-27 00:43:31,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:43:31,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 00:43:31,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 6: [2022-11-27 00:43:31,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:43:31,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:43:31,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 26: [2022-11-27 00:43:31,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:43:31,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:43:31,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 6: [2022-11-27 00:43:31,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 18: [2022-11-27 00:43:31,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 26: [2022-11-27 00:43:31,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 6: [2022-11-27 00:43:31,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 26: [2022-11-27 00:43:31,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 6: [2022-11-27 00:43:31,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 15: [2022-11-27 00:43:31,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:43:31,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 00:43:31,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 22: [2022-11-27 00:43:31,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:43:31,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 00:43:31,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 28: [2022-11-27 00:43:31,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:43:31,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 00:43:31,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 0: [2022-11-27 00:43:31,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:43:31,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 00:43:31,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 24: [2022-11-27 00:43:31,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:43:31,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-27 00:43:31,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 1: [2022-11-27 00:43:31,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:43:31,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 00:43:31,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 26: [2022-11-27 00:43:31,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:43:31,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 00:43:31,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 28: [2022-11-27 00:43:31,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:43:31,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 00:43:31,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 28: [2022-11-27 00:43:31,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:43:31,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 00:43:31,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 9: [2022-11-27 00:43:31,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:43:31,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 00:43:31,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 23: [2022-11-27 00:43:31,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:43:31,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 00:43:31,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 25: [2022-11-27 00:43:31,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:43:31,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 5: [2022-11-27 00:43:31,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:43:31,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 25: [2022-11-27 00:43:31,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 5: [2022-11-27 00:43:31,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 20: [2022-11-27 00:43:31,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:43:31,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 00:43:31,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 20: [2022-11-27 00:43:31,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:43:31,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 00:43:31,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 27: [2022-11-27 00:43:31,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:43:31,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:43:31,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:43:31,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 00:43:31,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 00:43:31,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 00:43:31,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 27: [2022-11-27 00:43:31,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 27: [2022-11-27 00:43:31,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 31: [2022-11-27 00:43:31,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:43:31,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 00:43:31,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 4: [2022-11-27 00:43:31,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:43:31,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:43:31,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:43:31,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:43:31,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 00:43:31,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 00:43:31,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 30: [2022-11-27 00:43:31,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 4: [2022-11-27 00:43:31,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 4: [2022-11-27 00:43:31,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 4: [2022-11-27 00:43:31,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 30: [2022-11-27 00:43:31,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 12: [2022-11-27 00:43:31,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:43:31,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:43:31,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 00:43:31,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 00:43:31,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 12: [2022-11-27 00:43:31,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 7: [2022-11-27 00:43:31,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:43:31,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 00:43:31,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 12: [2022-11-27 00:43:31,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:43:31,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 00:43:31,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 15: [2022-11-27 00:43:31,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:43:31,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 00:43:31,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 21: [2022-11-27 00:43:31,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:43:31,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 00:43:31,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 29: [2022-11-27 00:43:31,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:43:31,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 00:43:31,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 27: [2022-11-27 00:43:31,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:43:31,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 13: [2022-11-27 00:43:31,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:43:31,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 13: [2022-11-27 00:43:31,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 00:43:31,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 14: [2022-11-27 00:43:31,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:43:31,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 00:43:31,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 17: [2022-11-27 00:43:31,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:43:31,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:43:31,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 00:43:31,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 17: [2022-11-27 00:43:31,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 00:43:31,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 10: [2022-11-27 00:43:31,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:43:31,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 00:43:31,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 8: [2022-11-27 00:43:31,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:43:31,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 00:43:31,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 4: [2022-11-27 00:43:31,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:43:31,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 00:43:31,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 18: [2022-11-27 00:43:31,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:43:31,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 00:43:31,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 3: [2022-11-27 00:43:31,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:43:31,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 00:43:31,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 20: [2022-11-27 00:43:31,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:43:31,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 00:43:31,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 22: [2022-11-27 00:43:31,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:43:31,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 00:43:31,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 11: [2022-11-27 00:43:31,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:43:31,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 00:43:31,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 12: [2022-11-27 00:43:31,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:43:31,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 00:43:31,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 5: [2022-11-27 00:43:31,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:43:31,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 00:43:31,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 16: [2022-11-27 00:43:31,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:43:31,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 00:43:31,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 9: [2022-11-27 00:43:31,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:43:31,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 00:43:31,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 28: [2022-11-27 00:43:31,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:43:31,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 00:43:31,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 6: [2022-11-27 00:43:31,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:43:31,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 00:43:31,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 19: [2022-11-27 00:43:31,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:43:31,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 00:43:31,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 26: [2022-11-27 00:43:31,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:43:31,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 00:43:31,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 25: [2022-11-27 00:43:31,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:43:31,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 00:43:31,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 1: [2022-11-27 00:43:31,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:43:31,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 00:43:31,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 7: [2022-11-27 00:43:31,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:43:31,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 00:43:31,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 24: [2022-11-27 00:43:31,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:43:31,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 00:43:31,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 23: [2022-11-27 00:43:31,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:43:31,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:43:31,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 30: [2022-11-27 00:43:31,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:43:31,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 0: [2022-11-27 00:43:31,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 30: [2022-11-27 00:43:31,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 0: [2022-11-27 00:43:31,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 30: [2022-11-27 00:43:31,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 31: [2022-11-27 00:43:31,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:43:31,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 00:43:31,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 15: [2022-11-27 00:43:31,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:43:31,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 00:43:31,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 13: [2022-11-27 00:43:31,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:43:31,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 00:43:31,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 2: [2022-11-27 00:43:31,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:43:31,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 00:43:31,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 27: [2022-11-27 00:43:31,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:43:31,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:43:31,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 00:43:31,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 27: [2022-11-27 00:43:31,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 00:43:31,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 8: [2022-11-27 00:43:31,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:43:31,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 00:43:31,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 14: [2022-11-27 00:43:31,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:43:31,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 00:43:31,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 21: [2022-11-27 00:43:31,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:43:31,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 00:43:31,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 17: [2022-11-27 00:43:31,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:43:31,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 00:43:31,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 4: [2022-11-27 00:43:31,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:43:31,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 00:43:31,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 22: [2022-11-27 00:43:31,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:43:31,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 00:43:31,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 3: [2022-11-27 00:43:31,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:43:31,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 00:43:31,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 6: [2022-11-27 00:43:31,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:43:31,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 00:43:31,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 20: [2022-11-27 00:43:31,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:43:31,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 00:43:31,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 12: [2022-11-27 00:43:31,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:43:31,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 00:43:31,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 5: [2022-11-27 00:43:31,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:43:31,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 00:43:31,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 18: [2022-11-27 00:43:31,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:43:31,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 00:43:31,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 10: [2022-11-27 00:43:31,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:43:31,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 00:43:31,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 19: [2022-11-27 00:43:31,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:43:31,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 00:43:31,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 28: [2022-11-27 00:43:31,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:43:31,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 00:43:31,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 16: [2022-11-27 00:43:31,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:43:31,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 00:43:31,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 26: [2022-11-27 00:43:31,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:43:31,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 00:43:31,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 30: [2022-11-27 00:43:31,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:43:31,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:43:31,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 11: [2022-11-27 00:43:31,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 30: [2022-11-27 00:43:31,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 11: [2022-11-27 00:43:31,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 23: [2022-11-27 00:43:31,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:43:31,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 00:43:31,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 9: [2022-11-27 00:43:31,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:43:31,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 00:43:31,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 0: [2022-11-27 00:43:31,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:43:31,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:43:31,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 00:43:31,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 31: [2022-11-27 00:43:31,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:43:31,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 27: [2022-11-27 00:43:31,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:43:31,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 27: [2022-11-27 00:43:31,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 00:43:31,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 17: [2022-11-27 00:43:31,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:43:31,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 00:43:31,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 12: [2022-11-27 00:43:31,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:43:31,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 1: [2022-11-27 00:43:31,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:43:31,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 15: [2022-11-27 00:43:31,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:43:31,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 00:43:31,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 1: [2022-11-27 00:43:31,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 00:43:31,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 13: [2022-11-27 00:43:31,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:43:31,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 00:43:31,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 21: [2022-11-27 00:43:31,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:43:31,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 00:43:31,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 25: [2022-11-27 00:43:31,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:43:31,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 00:43:31,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 7: [2022-11-27 00:43:31,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:43:31,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 00:43:31,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 6: [2022-11-27 00:43:31,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:43:31,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 00:43:31,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 4: [2022-11-27 00:43:31,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:43:31,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 00:43:31,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 3: [2022-11-27 00:43:31,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:43:31,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:43:31,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 14: [2022-11-27 00:43:31,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 3: [2022-11-27 00:43:31,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 14: [2022-11-27 00:43:31,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 5: [2022-11-27 00:43:31,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:43:31,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 8: [2022-11-27 00:43:31,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:43:31,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 26: [2022-11-27 00:43:31,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:43:31,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 00:43:31,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 20: [2022-11-27 00:43:31,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:43:31,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 00:43:31,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 20: [2022-11-27 00:43:31,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 00:43:31,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 19: [2022-11-27 00:43:31,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:43:31,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 29: [2022-11-27 00:43:31,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:43:31,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 29: [2022-11-27 00:43:31,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 00:43:31,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 28: [2022-11-27 00:43:31,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:43:31,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:43:31,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 00:43:31,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 7: [2022-11-27 00:43:31,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:43:31,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 00:43:31,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 9: [2022-11-27 00:43:31,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:43:31,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 00:43:31,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 2: [2022-11-27 00:43:31,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:43:31,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 00:43:31,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 0: [2022-11-27 00:43:31,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:43:31,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 00:43:31,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 25: [2022-11-27 00:43:31,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:43:31,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:43:31,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 16: [2022-11-27 00:43:31,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 25: [2022-11-27 00:43:31,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 16: [2022-11-27 00:43:31,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 24: [2022-11-27 00:43:31,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:43:31,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 00:43:31,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 10: [2022-11-27 00:43:31,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:43:31,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 1: [2022-11-27 00:43:31,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:43:31,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 1: [2022-11-27 00:43:31,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 00:43:31,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 31: [2022-11-27 00:43:31,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:43:31,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 00:43:31,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 15: [2022-11-27 00:43:31,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:43:31,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 00:43:31,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 23: [2022-11-27 00:43:31,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:43:31,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 00:43:31,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 28: [2022-11-27 00:43:31,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 00:43:31,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 18: [2022-11-27 00:43:31,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:43:31,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 00:43:31,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 13: [2022-11-27 00:43:31,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:43:31,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 00:43:31,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 27: [2022-11-27 00:43:31,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:43:31,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 00:43:31,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 4: [2022-11-27 00:43:31,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:43:31,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 00:43:31,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 30: [2022-11-27 00:43:31,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:43:31,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 00:43:31,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 19: [2022-11-27 00:43:31,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:43:31,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 13: [2022-11-27 00:43:31,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:43:31,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 16: [2022-11-27 00:43:31,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:43:31,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:43:31,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 13: [2022-11-27 00:43:31,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 16: [2022-11-27 00:43:31,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 13: [2022-11-27 00:43:31,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 18: [2022-11-27 00:43:31,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 00:43:31,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 4: [2022-11-27 00:43:31,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:43:31,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 00:43:31,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 0: [2022-11-27 00:43:31,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 00:43:31,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 5: [2022-11-27 00:43:31,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:43:31,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 00:43:31,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 6: [2022-11-27 00:43:31,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:43:31,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 00:43:31,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 9: [2022-11-27 00:43:31,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:43:31,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:43:31,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 26: [2022-11-27 00:43:31,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:43:31,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 9: [2022-11-27 00:43:31,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 23: [2022-11-27 00:43:31,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 20: [2022-11-27 00:43:31,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:43:31,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:43:31,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 20: [2022-11-27 00:43:31,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 14: [2022-11-27 00:43:31,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 26: [2022-11-27 00:43:31,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 20: [2022-11-27 00:43:31,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 14: [2022-11-27 00:43:31,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 8: [2022-11-27 00:43:31,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:43:31,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 00:43:31,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 11: [2022-11-27 00:43:31,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:43:31,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:43:31,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 00:43:31,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 21: [2022-11-27 00:43:31,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 22: [2022-11-27 00:43:31,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:43:31,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 22: [2022-11-27 00:43:31,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 00:43:31,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 31: [2022-11-27 00:43:31,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:43:31,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 00:43:31,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 2: [2022-11-27 00:43:31,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:43:31,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 00:43:31,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 30: [2022-11-27 00:43:31,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:43:31,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:43:31,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 29: [2022-11-27 00:43:31,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 00:43:31,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 30: [2022-11-27 00:43:31,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 28: [2022-11-27 00:43:31,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:43:31,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:43:31,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:43:31,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 7: [2022-11-27 00:43:31,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 00:43:31,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 3: [2022-11-27 00:43:31,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:43:31,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 27: [2022-11-27 00:43:31,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 3: [2022-11-27 00:43:31,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 27: [2022-11-27 00:43:31,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 3: [2022-11-27 00:43:31,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 12: [2022-11-27 00:43:31,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:43:31,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 00:43:31,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 0: [2022-11-27 00:43:31,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:43:31,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 00:43:31,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 25: [2022-11-27 00:43:31,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:43:31,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-27 00:43:31,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 15: [2022-11-27 00:43:31,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:43:31,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 00:43:31,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 17: [2022-11-27 00:43:31,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:43:31,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:43:31,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 17: [2022-11-27 00:43:31,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 00:43:31,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 29: [2022-11-27 00:43:31,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 21: [2022-11-27 00:43:31,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:43:31,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 00:43:31,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 14: [2022-11-27 00:43:31,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:43:31,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 00:43:31,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 22: [2022-11-27 00:43:31,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:43:31,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 00:43:31,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 1: [2022-11-27 00:43:31,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:43:31,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 00:43:31,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 18: [2022-11-27 00:43:31,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:43:31,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 00:43:31,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 11: [2022-11-27 00:43:31,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:43:31,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:43:31,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:43:31,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 20: [2022-11-27 00:43:31,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:43:31,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 10: [2022-11-27 00:43:31,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 00:43:31,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 8: [2022-11-27 00:43:31,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 20: [2022-11-27 00:43:31,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 8: [2022-11-27 00:43:31,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 20: [2022-11-27 00:43:31,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 28: [2022-11-27 00:43:31,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:43:31,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 11: [2022-11-27 00:43:31,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:43:31,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 11: [2022-11-27 00:43:31,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 00:43:31,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 12: [2022-11-27 00:43:31,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:43:31,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:43:31,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 00:43:31,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 3: [2022-11-27 00:43:31,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 00:43:31,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 17: [2022-11-27 00:43:31,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:43:31,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:43:31,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 2: [2022-11-27 00:43:31,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 17: [2022-11-27 00:43:31,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 2: [2022-11-27 00:43:31,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 7: [2022-11-27 00:43:31,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:43:31,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 00:43:31,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 10: [2022-11-27 00:43:31,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:43:31,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 00:43:31,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 24: [2022-11-27 00:43:31,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:43:31,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-27 00:43:31,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 18: [2022-11-27 00:43:31,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:43:31,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step136000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-27 00:43:31,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step136000 is ready now! 0: successfully saved checkpoint at iteration 136000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2529.27 31: iteration 136010/ 173500 | consumed samples: 34818560 | consumed tokens: 71308410880 | elapsed time per iteration (s): 1.17 | learning rate: 4.034E-05 | global batch size: 256 | lm loss: 1.910849E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 218.134 | TFLOPs: 13.20 | 31: iteration 136020/ 173500 | consumed samples: 34821120 | consumed tokens: 71313653760 | elapsed time per iteration (s): 0.90 | learning rate: 4.033E-05 | global batch size: 256 | lm loss: 1.946462E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.575 | TFLOPs: 17.28 | 31: iteration 136030/ 173500 | consumed samples: 34823680 | consumed tokens: 71318896640 | elapsed time per iteration (s): 0.89 | learning rate: 4.032E-05 | global batch size: 256 | lm loss: 1.950578E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.693 | TFLOPs: 17.40 | 31: iteration 136040/ 173500 | consumed samples: 34826240 | consumed tokens: 71324139520 | elapsed time per iteration (s): 0.93 | learning rate: 4.031E-05 | global batch size: 256 | lm loss: 1.929432E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 276.717 | TFLOPs: 16.74 | 31: iteration 136050/ 173500 | consumed samples: 34828800 | consumed tokens: 71329382400 | elapsed time per iteration (s): 0.85 | learning rate: 4.030E-05 | global batch size: 256 | lm loss: 1.961996E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.106 | TFLOPs: 18.22 | 31: iteration 136060/ 173500 | consumed samples: 34831360 | consumed tokens: 71334625280 | elapsed time per iteration (s): 0.80 | learning rate: 4.029E-05 | global batch size: 256 | lm loss: 1.919474E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.257 | TFLOPs: 19.37 | 31: iteration 136070/ 173500 | consumed samples: 34833920 | consumed tokens: 71339868160 | elapsed time per iteration (s): 0.76 | learning rate: 4.028E-05 | global batch size: 256 | lm loss: 1.916311E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.770 | TFLOPs: 20.43 | 31: iteration 136080/ 173500 | consumed samples: 34836480 | consumed tokens: 71345111040 | elapsed time per iteration (s): 0.80 | learning rate: 4.027E-05 | global batch size: 256 | lm loss: 1.915631E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.304 | TFLOPs: 19.44 | 31: iteration 136090/ 173500 | consumed samples: 34839040 | consumed tokens: 71350353920 | elapsed time per iteration (s): 0.80 | learning rate: 4.026E-05 | global batch size: 256 | lm loss: 1.935109E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.763 | TFLOPs: 19.47 | 31: iteration 136100/ 173500 | consumed samples: 34841600 | consumed tokens: 71355596800 | elapsed time per iteration (s): 0.94 | learning rate: 4.025E-05 | global batch size: 256 | lm loss: 1.936128E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 273.553 | TFLOPs: 16.55 | 31: iteration 136110/ 173500 | consumed samples: 34844160 | consumed tokens: 71360839680 | elapsed time per iteration (s): 0.85 | learning rate: 4.024E-05 | global batch size: 256 | lm loss: 1.921551E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.179 | TFLOPs: 18.16 | 31: iteration 136120/ 173500 | consumed samples: 34846720 | consumed tokens: 71366082560 | elapsed time per iteration (s): 0.81 | learning rate: 4.023E-05 | global batch size: 256 | lm loss: 1.924630E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.815 | TFLOPs: 19.05 | 31: iteration 136130/ 173500 | consumed samples: 34849280 | consumed tokens: 71371325440 | elapsed time per iteration (s): 0.84 | learning rate: 4.022E-05 | global batch size: 256 | lm loss: 1.921980E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.383 | TFLOPs: 18.41 | 31: iteration 136140/ 173500 | consumed samples: 34851840 | consumed tokens: 71376568320 | elapsed time per iteration (s): 0.75 | learning rate: 4.021E-05 | global batch size: 256 | lm loss: 1.933263E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.129 | TFLOPs: 20.58 | 31: iteration 136150/ 173500 | consumed samples: 34854400 | consumed tokens: 71381811200 | elapsed time per iteration (s): 0.85 | learning rate: 4.020E-05 | global batch size: 256 | lm loss: 1.933902E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.428 | TFLOPs: 18.18 | 31: iteration 136160/ 173500 | consumed samples: 34856960 | consumed tokens: 71387054080 | elapsed time per iteration (s): 0.81 | learning rate: 4.019E-05 | global batch size: 256 | lm loss: 1.970321E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.831 | TFLOPs: 19.17 | 31: iteration 136170/ 173500 | consumed samples: 34859520 | consumed tokens: 71392296960 | elapsed time per iteration (s): 0.77 | learning rate: 4.018E-05 | global batch size: 256 | lm loss: 1.950690E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.772 | TFLOPs: 20.07 | 31: iteration 136180/ 173500 | consumed samples: 34862080 | consumed tokens: 71397539840 | elapsed time per iteration (s): 0.74 | learning rate: 4.017E-05 | global batch size: 256 | lm loss: 1.940358E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.219 | TFLOPs: 20.95 | 31: iteration 136190/ 173500 | consumed samples: 34864640 | consumed tokens: 71402782720 | elapsed time per iteration (s): 0.84 | learning rate: 4.016E-05 | global batch size: 256 | lm loss: 1.941832E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.993 | TFLOPs: 18.33 | 31: iteration 136200/ 173500 | consumed samples: 34867200 | consumed tokens: 71408025600 | elapsed time per iteration (s): 0.74 | learning rate: 4.014E-05 | global batch size: 256 | lm loss: 1.926488E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.821 | TFLOPs: 20.86 | 31: iteration 136210/ 173500 | consumed samples: 34869760 | consumed tokens: 71413268480 | elapsed time per iteration (s): 0.77 | learning rate: 4.013E-05 | global batch size: 256 | lm loss: 1.914345E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.837 | TFLOPs: 20.08 | 31: iteration 136220/ 173500 | consumed samples: 34872320 | consumed tokens: 71418511360 | elapsed time per iteration (s): 0.76 | learning rate: 4.012E-05 | global batch size: 256 | lm loss: 1.905058E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.030 | TFLOPs: 20.51 | 31: iteration 136230/ 173500 | consumed samples: 34874880 | consumed tokens: 71423754240 | elapsed time per iteration (s): 0.74 | learning rate: 4.011E-05 | global batch size: 256 | lm loss: 1.924778E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.463 | TFLOPs: 20.96 | 31: iteration 136240/ 173500 | consumed samples: 34877440 | consumed tokens: 71428997120 | elapsed time per iteration (s): 0.78 | learning rate: 4.010E-05 | global batch size: 256 | lm loss: 1.921548E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.180 | TFLOPs: 19.73 | 31: iteration 136250/ 173500 | consumed samples: 34880000 | consumed tokens: 71434240000 | elapsed time per iteration (s): 0.74 | learning rate: 4.009E-05 | global batch size: 256 | lm loss: 1.930323E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.643 | TFLOPs: 20.85 | 31: iteration 136260/ 173500 | consumed samples: 34882560 | consumed tokens: 71439482880 | elapsed time per iteration (s): 0.81 | learning rate: 4.008E-05 | global batch size: 256 | lm loss: 1.956473E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.236 | TFLOPs: 19.19 | 31: iteration 136270/ 173500 | consumed samples: 34885120 | consumed tokens: 71444725760 | elapsed time per iteration (s): 0.75 | learning rate: 4.007E-05 | global batch size: 256 | lm loss: 1.922769E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.905 | TFLOPs: 20.68 | 31: iteration 136280/ 173500 | consumed samples: 34887680 | consumed tokens: 71449968640 | elapsed time per iteration (s): 0.87 | learning rate: 4.006E-05 | global batch size: 256 | lm loss: 1.904603E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.276 | TFLOPs: 17.80 | 31: iteration 136290/ 173500 | consumed samples: 34890240 | consumed tokens: 71455211520 | elapsed time per iteration (s): 0.79 | learning rate: 4.005E-05 | global batch size: 256 | lm loss: 1.912432E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.380 | TFLOPs: 19.68 | 31: iteration 136300/ 173500 | consumed samples: 34892800 | consumed tokens: 71460454400 | elapsed time per iteration (s): 0.78 | learning rate: 4.004E-05 | global batch size: 256 | lm loss: 1.918572E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.545 | TFLOPs: 19.88 | 31: iteration 136310/ 173500 | consumed samples: 34895360 | consumed tokens: 71465697280 | elapsed time per iteration (s): 0.81 | learning rate: 4.003E-05 | global batch size: 256 | lm loss: 1.955819E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.791 | TFLOPs: 19.23 | 31: iteration 136320/ 173500 | consumed samples: 34897920 | consumed tokens: 71470940160 | elapsed time per iteration (s): 0.92 | learning rate: 4.002E-05 | global batch size: 256 | lm loss: 1.962587E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 278.604 | TFLOPs: 16.85 | 31: iteration 136330/ 173500 | consumed samples: 34900480 | consumed tokens: 71476183040 | elapsed time per iteration (s): 0.74 | learning rate: 4.001E-05 | global batch size: 256 | lm loss: 1.928978E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.594 | TFLOPs: 20.97 | 31: iteration 136340/ 173500 | consumed samples: 34903040 | consumed tokens: 71481425920 | elapsed time per iteration (s): 0.77 | learning rate: 4.000E-05 | global batch size: 256 | lm loss: 1.931377E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.656 | TFLOPs: 20.12 | 31: iteration 136350/ 173500 | consumed samples: 34905600 | consumed tokens: 71486668800 | elapsed time per iteration (s): 0.76 | learning rate: 3.999E-05 | global batch size: 256 | lm loss: 1.938648E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.817 | TFLOPs: 20.38 | 31: iteration 136360/ 173500 | consumed samples: 34908160 | consumed tokens: 71491911680 | elapsed time per iteration (s): 0.78 | learning rate: 3.998E-05 | global batch size: 256 | lm loss: 1.944405E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.844 | TFLOPs: 19.89 | 31: iteration 136370/ 173500 | consumed samples: 34910720 | consumed tokens: 71497154560 | elapsed time per iteration (s): 0.75 | learning rate: 3.997E-05 | global batch size: 256 | lm loss: 1.916356E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.120 | TFLOPs: 20.58 | 31: iteration 136380/ 173500 | consumed samples: 34913280 | consumed tokens: 71502397440 | elapsed time per iteration (s): 0.74 | learning rate: 3.996E-05 | global batch size: 256 | lm loss: 1.942275E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.227 | TFLOPs: 20.82 | 31: iteration 136390/ 173500 | consumed samples: 34915840 | consumed tokens: 71507640320 | elapsed time per iteration (s): 0.81 | learning rate: 3.995E-05 | global batch size: 256 | lm loss: 1.937072E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.330 | TFLOPs: 19.02 | 31: iteration 136400/ 173500 | consumed samples: 34918400 | consumed tokens: 71512883200 | elapsed time per iteration (s): 0.74 | learning rate: 3.994E-05 | global batch size: 256 | lm loss: 1.942996E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.645 | TFLOPs: 20.85 | 31: iteration 136410/ 173500 | consumed samples: 34920960 | consumed tokens: 71518126080 | elapsed time per iteration (s): 0.80 | learning rate: 3.993E-05 | global batch size: 256 | lm loss: 1.923524E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.497 | TFLOPs: 19.27 | 31: iteration 136420/ 173500 | consumed samples: 34923520 | consumed tokens: 71523368960 | elapsed time per iteration (s): 0.75 | learning rate: 3.992E-05 | global batch size: 256 | lm loss: 1.940150E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.477 | TFLOPs: 20.60 | 31: iteration 136430/ 173500 | consumed samples: 34926080 | consumed tokens: 71528611840 | elapsed time per iteration (s): 0.77 | learning rate: 3.991E-05 | global batch size: 256 | lm loss: 1.945069E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.037 | TFLOPs: 20.15 | 31: iteration 136440/ 173500 | consumed samples: 34928640 | consumed tokens: 71533854720 | elapsed time per iteration (s): 0.77 | learning rate: 3.990E-05 | global batch size: 256 | lm loss: 1.952889E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.221 | TFLOPs: 20.04 | 31: iteration 136450/ 173500 | consumed samples: 34931200 | consumed tokens: 71539097600 | elapsed time per iteration (s): 0.75 | learning rate: 3.989E-05 | global batch size: 256 | lm loss: 1.913352E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.724 | TFLOPs: 20.61 | 31: iteration 136460/ 173500 | consumed samples: 34933760 | consumed tokens: 71544340480 | elapsed time per iteration (s): 0.76 | learning rate: 3.988E-05 | global batch size: 256 | lm loss: 1.920496E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.733 | TFLOPs: 20.31 | 31: iteration 136470/ 173500 | consumed samples: 34936320 | consumed tokens: 71549583360 | elapsed time per iteration (s): 0.75 | learning rate: 3.987E-05 | global batch size: 256 | lm loss: 1.931183E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.155 | TFLOPs: 20.70 | 31: iteration 136480/ 173500 | consumed samples: 34938880 | consumed tokens: 71554826240 | elapsed time per iteration (s): 0.80 | learning rate: 3.985E-05 | global batch size: 256 | lm loss: 1.933644E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.284 | TFLOPs: 19.32 | 31: iteration 136490/ 173500 | consumed samples: 34941440 | consumed tokens: 71560069120 | elapsed time per iteration (s): 0.80 | learning rate: 3.984E-05 | global batch size: 256 | lm loss: 1.920147E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.631 | TFLOPs: 19.40 | 31: iteration 136500/ 173500 | consumed samples: 34944000 | consumed tokens: 71565312000 | elapsed time per iteration (s): 0.79 | learning rate: 3.983E-05 | global batch size: 256 | lm loss: 1.927223E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.342 | TFLOPs: 19.56 | 31: iteration 136510/ 173500 | consumed samples: 34946560 | consumed tokens: 71570554880 | elapsed time per iteration (s): 0.79 | learning rate: 3.982E-05 | global batch size: 256 | lm loss: 1.910668E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.317 | TFLOPs: 19.56 | 31: iteration 136520/ 173500 | consumed samples: 34949120 | consumed tokens: 71575797760 | elapsed time per iteration (s): 0.87 | learning rate: 3.981E-05 | global batch size: 256 | lm loss: 1.940975E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.169 | TFLOPs: 17.86 | 31: iteration 136530/ 173500 | consumed samples: 34951680 | consumed tokens: 71581040640 | elapsed time per iteration (s): 0.78 | learning rate: 3.980E-05 | global batch size: 256 | lm loss: 1.912012E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.763 | TFLOPs: 19.83 | 31: iteration 136540/ 173500 | consumed samples: 34954240 | consumed tokens: 71586283520 | elapsed time per iteration (s): 0.75 | learning rate: 3.979E-05 | global batch size: 256 | lm loss: 1.941610E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.614 | TFLOPs: 20.73 | 31: iteration 136550/ 173500 | consumed samples: 34956800 | consumed tokens: 71591526400 | elapsed time per iteration (s): 0.76 | learning rate: 3.978E-05 | global batch size: 256 | lm loss: 1.909320E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.327 | TFLOPs: 20.35 | 31: iteration 136560/ 173500 | consumed samples: 34959360 | consumed tokens: 71596769280 | elapsed time per iteration (s): 0.76 | learning rate: 3.977E-05 | global batch size: 256 | lm loss: 1.915242E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.776 | TFLOPs: 20.37 | 31: iteration 136570/ 173500 | consumed samples: 34961920 | consumed tokens: 71602012160 | elapsed time per iteration (s): 0.77 | learning rate: 3.976E-05 | global batch size: 256 | lm loss: 1.921063E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.342 | TFLOPs: 20.05 | 31: iteration 136580/ 173500 | consumed samples: 34964480 | consumed tokens: 71607255040 | elapsed time per iteration (s): 0.75 | learning rate: 3.975E-05 | global batch size: 256 | lm loss: 1.932502E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.393 | TFLOPs: 20.59 | 31: iteration 136590/ 173500 | consumed samples: 34967040 | consumed tokens: 71612497920 | elapsed time per iteration (s): 0.76 | learning rate: 3.974E-05 | global batch size: 256 | lm loss: 1.973700E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.010 | TFLOPs: 20.33 | 31: iteration 136600/ 173500 | consumed samples: 34969600 | consumed tokens: 71617740800 | elapsed time per iteration (s): 0.82 | learning rate: 3.973E-05 | global batch size: 256 | lm loss: 1.912302E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.161 | TFLOPs: 18.88 | 31: iteration 136610/ 173500 | consumed samples: 34972160 | consumed tokens: 71622983680 | elapsed time per iteration (s): 0.78 | learning rate: 3.972E-05 | global batch size: 256 | lm loss: 1.934862E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.443 | TFLOPs: 19.81 | 31: iteration 136620/ 173500 | consumed samples: 34974720 | consumed tokens: 71628226560 | elapsed time per iteration (s): 0.74 | learning rate: 3.971E-05 | global batch size: 256 | lm loss: 1.941424E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.956 | TFLOPs: 20.81 | 31: iteration 136630/ 173500 | consumed samples: 34977280 | consumed tokens: 71633469440 | elapsed time per iteration (s): 0.80 | learning rate: 3.970E-05 | global batch size: 256 | lm loss: 1.916862E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.702 | TFLOPs: 19.46 | 31: iteration 136640/ 173500 | consumed samples: 34979840 | consumed tokens: 71638712320 | elapsed time per iteration (s): 0.73 | learning rate: 3.969E-05 | global batch size: 256 | lm loss: 1.915976E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.314 | TFLOPs: 21.07 | 31: iteration 136650/ 173500 | consumed samples: 34982400 | consumed tokens: 71643955200 | elapsed time per iteration (s): 0.77 | learning rate: 3.968E-05 | global batch size: 256 | lm loss: 1.899527E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.965 | TFLOPs: 20.02 | 31: iteration 136660/ 173500 | consumed samples: 34984960 | consumed tokens: 71649198080 | elapsed time per iteration (s): 0.79 | learning rate: 3.967E-05 | global batch size: 256 | lm loss: 1.931315E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.715 | TFLOPs: 19.70 | 31: iteration 136670/ 173500 | consumed samples: 34987520 | consumed tokens: 71654440960 | elapsed time per iteration (s): 0.83 | learning rate: 3.966E-05 | global batch size: 256 | lm loss: 1.950705E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.480 | TFLOPs: 18.66 | 31: iteration 136680/ 173500 | consumed samples: 34990080 | consumed tokens: 71659683840 | elapsed time per iteration (s): 0.83 | learning rate: 3.965E-05 | global batch size: 256 | lm loss: 1.929058E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.059 | TFLOPs: 18.70 | 31: iteration 136690/ 173500 | consumed samples: 34992640 | consumed tokens: 71664926720 | elapsed time per iteration (s): 0.80 | learning rate: 3.964E-05 | global batch size: 256 | lm loss: 1.905672E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.477 | TFLOPs: 19.33 | 31: iteration 136700/ 173500 | consumed samples: 34995200 | consumed tokens: 71670169600 | elapsed time per iteration (s): 0.75 | learning rate: 3.963E-05 | global batch size: 256 | lm loss: 1.936398E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.556 | TFLOPs: 20.60 | 31: iteration 136710/ 173500 | consumed samples: 34997760 | consumed tokens: 71675412480 | elapsed time per iteration (s): 0.76 | learning rate: 3.962E-05 | global batch size: 256 | lm loss: 1.929974E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.705 | TFLOPs: 20.25 | 31: iteration 136720/ 173500 | consumed samples: 35000320 | consumed tokens: 71680655360 | elapsed time per iteration (s): 0.77 | learning rate: 3.961E-05 | global batch size: 256 | lm loss: 1.906259E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.282 | TFLOPs: 20.10 | 31: iteration 136730/ 173500 | consumed samples: 35002880 | consumed tokens: 71685898240 | elapsed time per iteration (s): 0.74 | learning rate: 3.960E-05 | global batch size: 256 | lm loss: 1.922799E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.501 | TFLOPs: 20.90 | 31: iteration 136740/ 173500 | consumed samples: 35005440 | consumed tokens: 71691141120 | elapsed time per iteration (s): 0.77 | learning rate: 3.959E-05 | global batch size: 256 | lm loss: 1.956681E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.897 | TFLOPs: 20.02 | 31: iteration 136750/ 173500 | consumed samples: 35008000 | consumed tokens: 71696384000 | elapsed time per iteration (s): 0.74 | learning rate: 3.958E-05 | global batch size: 256 | lm loss: 1.922678E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.016 | TFLOPs: 21.05 | 31: iteration 136760/ 173500 | consumed samples: 35010560 | consumed tokens: 71701626880 | elapsed time per iteration (s): 0.78 | learning rate: 3.957E-05 | global batch size: 256 | lm loss: 1.920474E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.284 | TFLOPs: 19.92 | 31: iteration 136770/ 173500 | consumed samples: 35013120 | consumed tokens: 71706869760 | elapsed time per iteration (s): 0.82 | learning rate: 3.956E-05 | global batch size: 256 | lm loss: 1.953498E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.024 | TFLOPs: 18.88 | 31: iteration 136780/ 173500 | consumed samples: 35015680 | consumed tokens: 71712112640 | elapsed time per iteration (s): 0.83 | learning rate: 3.955E-05 | global batch size: 256 | lm loss: 1.937795E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.358 | TFLOPs: 18.59 | 31: iteration 136790/ 173500 | consumed samples: 35018240 | consumed tokens: 71717355520 | elapsed time per iteration (s): 0.80 | learning rate: 3.954E-05 | global batch size: 256 | lm loss: 1.922101E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.703 | TFLOPs: 19.46 | 31: iteration 136800/ 173500 | consumed samples: 35020800 | consumed tokens: 71722598400 | elapsed time per iteration (s): 0.78 | learning rate: 3.953E-05 | global batch size: 256 | lm loss: 1.924691E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.559 | TFLOPs: 19.76 | 31: iteration 136810/ 173500 | consumed samples: 35023360 | consumed tokens: 71727841280 | elapsed time per iteration (s): 0.75 | learning rate: 3.952E-05 | global batch size: 256 | lm loss: 1.946032E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.552 | TFLOPs: 20.54 | 31: iteration 136820/ 173500 | consumed samples: 35025920 | consumed tokens: 71733084160 | elapsed time per iteration (s): 0.83 | learning rate: 3.951E-05 | global batch size: 256 | lm loss: 1.944142E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.632 | TFLOPs: 18.55 | 31: iteration 136830/ 173500 | consumed samples: 35028480 | consumed tokens: 71738327040 | elapsed time per iteration (s): 0.77 | learning rate: 3.950E-05 | global batch size: 256 | lm loss: 1.933523E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.206 | TFLOPs: 20.04 | 31: iteration 136840/ 173500 | consumed samples: 35031040 | consumed tokens: 71743569920 | elapsed time per iteration (s): 0.80 | learning rate: 3.949E-05 | global batch size: 256 | lm loss: 1.910827E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.106 | TFLOPs: 19.37 | 31: iteration 136850/ 173500 | consumed samples: 35033600 | consumed tokens: 71748812800 | elapsed time per iteration (s): 0.73 | learning rate: 3.947E-05 | global batch size: 256 | lm loss: 1.923858E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.957 | TFLOPs: 21.17 | 31: iteration 136860/ 173500 | consumed samples: 35036160 | consumed tokens: 71754055680 | elapsed time per iteration (s): 0.72 | learning rate: 3.946E-05 | global batch size: 256 | lm loss: 1.952316E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.290 | TFLOPs: 21.62 | 31: iteration 136870/ 173500 | consumed samples: 35038720 | consumed tokens: 71759298560 | elapsed time per iteration (s): 0.77 | learning rate: 3.945E-05 | global batch size: 256 | lm loss: 1.935455E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.467 | TFLOPs: 19.99 | 31: iteration 136880/ 173500 | consumed samples: 35041280 | consumed tokens: 71764541440 | elapsed time per iteration (s): 0.73 | learning rate: 3.944E-05 | global batch size: 256 | lm loss: 1.947532E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.266 | TFLOPs: 21.13 | 31: iteration 136890/ 173500 | consumed samples: 35043840 | consumed tokens: 71769784320 | elapsed time per iteration (s): 0.73 | learning rate: 3.943E-05 | global batch size: 256 | lm loss: 1.924263E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.075 | TFLOPs: 21.24 | 31: iteration 136900/ 173500 | consumed samples: 35046400 | consumed tokens: 71775027200 | elapsed time per iteration (s): 0.79 | learning rate: 3.942E-05 | global batch size: 256 | lm loss: 1.943610E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.210 | TFLOPs: 19.55 | 31: iteration 136910/ 173500 | consumed samples: 35048960 | consumed tokens: 71780270080 | elapsed time per iteration (s): 0.77 | learning rate: 3.941E-05 | global batch size: 256 | lm loss: 1.934104E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.879 | TFLOPs: 20.14 | 31: iteration 136920/ 173500 | consumed samples: 35051520 | consumed tokens: 71785512960 | elapsed time per iteration (s): 0.78 | learning rate: 3.940E-05 | global batch size: 256 | lm loss: 1.925352E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.187 | TFLOPs: 19.85 | 31: iteration 136930/ 173500 | consumed samples: 35054080 | consumed tokens: 71790755840 | elapsed time per iteration (s): 0.76 | learning rate: 3.939E-05 | global batch size: 256 | lm loss: 1.940824E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.653 | TFLOPs: 20.43 | 31: iteration 136940/ 173500 | consumed samples: 35056640 | consumed tokens: 71795998720 | elapsed time per iteration (s): 0.81 | learning rate: 3.938E-05 | global batch size: 256 | lm loss: 1.908035E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.206 | TFLOPs: 19.19 | 31: iteration 136950/ 173500 | consumed samples: 35059200 | consumed tokens: 71801241600 | elapsed time per iteration (s): 0.79 | learning rate: 3.937E-05 | global batch size: 256 | lm loss: 1.917468E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.212 | TFLOPs: 19.49 | 31: iteration 136960/ 173500 | consumed samples: 35061760 | consumed tokens: 71806484480 | elapsed time per iteration (s): 0.79 | learning rate: 3.936E-05 | global batch size: 256 | lm loss: 1.925217E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.043 | TFLOPs: 19.60 | 31: iteration 136970/ 173500 | consumed samples: 35064320 | consumed tokens: 71811727360 | elapsed time per iteration (s): 0.74 | learning rate: 3.935E-05 | global batch size: 256 | lm loss: 1.925149E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.390 | TFLOPs: 20.83 | 31: iteration 136980/ 173500 | consumed samples: 35066880 | consumed tokens: 71816970240 | elapsed time per iteration (s): 0.75 | learning rate: 3.934E-05 | global batch size: 256 | lm loss: 1.921797E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.874 | TFLOPs: 20.56 | 31: iteration 136990/ 173500 | consumed samples: 35069440 | consumed tokens: 71822213120 | elapsed time per iteration (s): 0.73 | learning rate: 3.933E-05 | global batch size: 256 | lm loss: 1.929217E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.476 | TFLOPs: 21.32 | 31: iteration 137000/ 173500 | consumed samples: 35072000 | consumed tokens: 71827456000 | elapsed time per iteration (s): 0.80 | learning rate: 3.932E-05 | global batch size: 256 | lm loss: 1.941272E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.926 | TFLOPs: 19.35 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 137000 | lm loss value: 1.927287E+00 | lm loss PPL: 6.870847E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 137000 to checkpoints_1b1long 0: [2022-11-27 00:56:39,429] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step137000 is begin to save! 0: [2022-11-27 00:56:39,438] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_01-model_00-model_states.pt... 0: [2022-11-27 00:56:39,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_01-model_00-model_states.pt. 0: [2022-11-27 00:56:39,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_03-model_00-model_states.pt... 0: [2022-11-27 00:56:39,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_03-model_00-model_states.pt. 0: [2022-11-27 00:56:39,739] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_04-model_00-model_states.pt... 0: [2022-11-27 00:56:39,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_04-model_00-model_states.pt. 0: [2022-11-27 00:56:39,820] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_05-model_00-model_states.pt... 0: [2022-11-27 00:56:39,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_05-model_00-model_states.pt. 0: [2022-11-27 00:56:39,899] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_06-model_00-model_states.pt... 0: [2022-11-27 00:56:39,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_06-model_00-model_states.pt. 0: [2022-11-27 00:56:39,974] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_07-model_00-model_states.pt... 0: [2022-11-27 00:56:40,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_07-model_00-model_states.pt. 0: [2022-11-27 00:56:40,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_08-model_00-model_states.pt... 0: [2022-11-27 00:56:40,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_08-model_00-model_states.pt. 0: [2022-11-27 00:56:40,126] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_09-model_00-model_states.pt... 0: [2022-11-27 00:56:40,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_09-model_00-model_states.pt. 0: [2022-11-27 00:56:40,200] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_10-model_00-model_states.pt... 0: [2022-11-27 00:56:40,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_10-model_00-model_states.pt. 0: [2022-11-27 00:56:40,278] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_11-model_00-model_states.pt... 0: [2022-11-27 00:56:40,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_11-model_00-model_states.pt. 0: [2022-11-27 00:56:40,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_12-model_00-model_states.pt... 0: [2022-11-27 00:56:40,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_12-model_00-model_states.pt. 0: [2022-11-27 00:56:40,427] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_13-model_00-model_states.pt... 0: [2022-11-27 00:56:40,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_13-model_00-model_states.pt. 0: [2022-11-27 00:56:40,503] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_14-model_00-model_states.pt... 0: [2022-11-27 00:56:40,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_14-model_00-model_states.pt. 0: [2022-11-27 00:56:40,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_15-model_00-model_states.pt... 0: [2022-11-27 00:56:40,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_15-model_00-model_states.pt. 0: [2022-11-27 00:56:40,651] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_16-model_00-model_states.pt... 0: [2022-11-27 00:56:40,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_16-model_00-model_states.pt. 0: [2022-11-27 00:56:40,726] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_17-model_00-model_states.pt... 0: [2022-11-27 00:56:40,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_17-model_00-model_states.pt. 0: [2022-11-27 00:56:40,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_18-model_00-model_states.pt... 0: [2022-11-27 00:56:40,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_18-model_00-model_states.pt. 0: [2022-11-27 00:56:40,875] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_19-model_00-model_states.pt... 0: [2022-11-27 00:56:40,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_19-model_00-model_states.pt. 0: [2022-11-27 00:56:40,949] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_20-model_00-model_states.pt... 0: [2022-11-27 00:56:41,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_20-model_00-model_states.pt. 0: [2022-11-27 00:56:41,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_21-model_00-model_states.pt... 0: [2022-11-27 00:56:41,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_21-model_00-model_states.pt. 0: [2022-11-27 00:56:41,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_22-model_00-model_states.pt... 0: [2022-11-27 00:56:41,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_22-model_00-model_states.pt. 0: [2022-11-27 00:56:41,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_23-model_00-model_states.pt... 0: [2022-11-27 00:56:41,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_23-model_00-model_states.pt. 0: [2022-11-27 00:56:41,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_24-model_00-model_states.pt... 0: [2022-11-27 00:56:41,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_24-model_00-model_states.pt. 0: [2022-11-27 00:56:41,321] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_25-model_00-model_states.pt... 0: [2022-11-27 00:56:41,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_25-model_00-model_states.pt. 0: [2022-11-27 00:56:41,393] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_26-model_00-model_states.pt... 0: [2022-11-27 00:56:41,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_26-model_00-model_states.pt. 0: [2022-11-27 00:56:41,469] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_27-model_00-model_states.pt... 0: [2022-11-27 00:56:41,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_27-model_00-model_states.pt. 0: [2022-11-27 00:56:41,542] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_28-model_00-model_states.pt... 0: [2022-11-27 00:56:41,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_28-model_00-model_states.pt. 0: [2022-11-27 00:56:41,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/layer_30-model_00-model_states.pt... 0: [2022-11-27 00:56:41,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/layer_30-model_00-model_states.pt. 0: [2022-11-27 00:56:41,620] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step137000/mp_rank_00_model_states.pt 0: [2022-11-27 00:56:41,620] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/mp_rank_00_model_states.pt... 0: [2022-11-27 00:56:41,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/mp_rank_00_model_states.pt. 0: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 30: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 19: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 29: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 18: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:56:41,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:56:41,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:56:41,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 00:56:41,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 17: [2022-11-27 00:56:41,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:56:41,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:56:41,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 00:56:41,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 20: [2022-11-27 00:56:41,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 00:56:41,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 8: [2022-11-27 00:56:41,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:56:41,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 7: [2022-11-27 00:56:41,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:56:41,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 7: [2022-11-27 00:56:41,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 00:56:41,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 5: [2022-11-27 00:56:41,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:56:41,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 00:56:41,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 18: [2022-11-27 00:56:41,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:56:41,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 00:56:41,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 27: [2022-11-27 00:56:41,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:56:41,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:56:41,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:56:41,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 27: [2022-11-27 00:56:41,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 0: [2022-11-27 00:56:41,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 00:56:41,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 31: [2022-11-27 00:56:41,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 4: [2022-11-27 00:56:41,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:56:41,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:56:41,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 27: [2022-11-27 00:56:41,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 4: [2022-11-27 00:56:41,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 00:56:41,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 4: [2022-11-27 00:56:41,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 28: [2022-11-27 00:56:41,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:56:41,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 20: [2022-11-27 00:56:41,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:56:41,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 00:56:41,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 28: [2022-11-27 00:56:41,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 23: [2022-11-27 00:56:41,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:56:41,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:56:41,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 00:56:41,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 17: [2022-11-27 00:56:41,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 23: [2022-11-27 00:56:41,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:56:41,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:56:41,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 11: [2022-11-27 00:56:41,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 00:56:41,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 23: [2022-11-27 00:56:41,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 00:56:41,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 6: [2022-11-27 00:56:41,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:56:41,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 5: [2022-11-27 00:56:41,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:56:41,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 5: [2022-11-27 00:56:41,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 00:56:41,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 13: [2022-11-27 00:56:41,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:56:41,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 00:56:41,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 14: [2022-11-27 00:56:41,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:56:41,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 00:56:41,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 12: [2022-11-27 00:56:41,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:56:41,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 00:56:41,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 18: [2022-11-27 00:56:41,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:56:41,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:56:41,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 16: [2022-11-27 00:56:41,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 18: [2022-11-27 00:56:41,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 16: [2022-11-27 00:56:41,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 24: [2022-11-27 00:56:41,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:56:41,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 00:56:41,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 8: [2022-11-27 00:56:41,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:56:41,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 00:56:41,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 9: [2022-11-27 00:56:41,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:56:41,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 00:56:41,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 10: [2022-11-27 00:56:41,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:56:41,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:56:41,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:56:41,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 9: [2022-11-27 00:56:41,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:56:41,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 00:56:41,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 9: [2022-11-27 00:56:41,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 10: [2022-11-27 00:56:41,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 10: [2022-11-27 00:56:41,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 10: [2022-11-27 00:56:41,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 9: [2022-11-27 00:56:41,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 30: [2022-11-27 00:56:41,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:56:41,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 00:56:41,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 24: [2022-11-27 00:56:41,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:56:41,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-27 00:56:41,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 4: [2022-11-27 00:56:41,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:56:41,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:56:41,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 00:56:41,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 7: [2022-11-27 00:56:41,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 00:56:41,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 28: [2022-11-27 00:56:41,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:56:41,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 00:56:41,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 9: [2022-11-27 00:56:41,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:56:41,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:56:41,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 14: [2022-11-27 00:56:41,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 9: [2022-11-27 00:56:41,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 14: [2022-11-27 00:56:41,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 0: [2022-11-27 00:56:41,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:56:41,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 00:56:41,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 7: [2022-11-27 00:56:41,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:56:41,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 00:56:41,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 20: [2022-11-27 00:56:41,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:56:41,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 00:56:41,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 30: [2022-11-27 00:56:41,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:56:41,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 00:56:41,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 5: [2022-11-27 00:56:41,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:56:41,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:56:41,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 00:56:41,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 1: [2022-11-27 00:56:41,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:56:41,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:56:41,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:56:41,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 27: [2022-11-27 00:56:41,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 1: [2022-11-27 00:56:41,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 17: [2022-11-27 00:56:41,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 27: [2022-11-27 00:56:41,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 17: [2022-11-27 00:56:41,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 2: [2022-11-27 00:56:41,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:56:41,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:56:41,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:56:41,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 00:56:41,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 00:56:41,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 00:56:41,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 2: [2022-11-27 00:56:41,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 2: [2022-11-27 00:56:41,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 16: [2022-11-27 00:56:41,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:56:41,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 00:56:41,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 27: [2022-11-27 00:56:41,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:56:41,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:56:41,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 12: [2022-11-27 00:56:41,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 00:56:41,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 28: [2022-11-27 00:56:41,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:56:41,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 28: [2022-11-27 00:56:41,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 00:56:41,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 8: [2022-11-27 00:56:41,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:56:41,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:56:41,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 00:56:41,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 24: [2022-11-27 00:56:41,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 11: [2022-11-27 00:56:41,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:56:41,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 11: [2022-11-27 00:56:41,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 00:56:41,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 1: [2022-11-27 00:56:41,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:56:41,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:56:41,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 18: [2022-11-27 00:56:41,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 1: [2022-11-27 00:56:41,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 18: [2022-11-27 00:56:41,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 13: [2022-11-27 00:56:41,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:56:41,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:56:41,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 13: [2022-11-27 00:56:41,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 14: [2022-11-27 00:56:41,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 13: [2022-11-27 00:56:41,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 6: [2022-11-27 00:56:41,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:56:41,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 00:56:41,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:56:41,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:56:41,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 11: [2022-11-27 00:56:41,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 6: [2022-11-27 00:56:41,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 11: [2022-11-27 00:56:41,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 6: [2022-11-27 00:56:41,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 17: [2022-11-27 00:56:41,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:56:41,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 00:56:41,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 20: [2022-11-27 00:56:41,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:56:41,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 30: [2022-11-27 00:56:41,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:56:41,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 30: [2022-11-27 00:56:41,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 00:56:41,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 19: [2022-11-27 00:56:41,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:56:41,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 00:56:41,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 5: [2022-11-27 00:56:41,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 00:56:41,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 4: [2022-11-27 00:56:41,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:56:41,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:56:41,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 4: [2022-11-27 00:56:41,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 5: [2022-11-27 00:56:41,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 4: [2022-11-27 00:56:41,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 23: [2022-11-27 00:56:41,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:56:41,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:56:41,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:56:41,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 31: [2022-11-27 00:56:41,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:56:41,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 00:56:41,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 23: [2022-11-27 00:56:41,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 31: [2022-11-27 00:56:41,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 29: [2022-11-27 00:56:41,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 29: [2022-11-27 00:56:41,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 31: [2022-11-27 00:56:41,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 15: [2022-11-27 00:56:41,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:56:41,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 00:56:41,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 31: [2022-11-27 00:56:41,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:56:41,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 15: [2022-11-27 00:56:41,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:56:41,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 31: [2022-11-27 00:56:41,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 15: [2022-11-27 00:56:41,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 0: [2022-11-27 00:56:41,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:56:41,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 00:56:41,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 23: [2022-11-27 00:56:41,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:56:41,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:56:41,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 23: [2022-11-27 00:56:41,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 10: [2022-11-27 00:56:41,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 23: [2022-11-27 00:56:41,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 19: [2022-11-27 00:56:41,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:56:41,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 00:56:41,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 8: [2022-11-27 00:56:41,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:56:41,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:56:41,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 00:56:41,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 13: [2022-11-27 00:56:41,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 00:56:41,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 18: [2022-11-27 00:56:41,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:56:41,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 00:56:41,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 9: [2022-11-27 00:56:41,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:56:41,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 00:56:41,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 12: [2022-11-27 00:56:41,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:56:41,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 00:56:41,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 24: [2022-11-27 00:56:41,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:56:41,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:56:41,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:56:41,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:56:41,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 00:56:41,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:56:41,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 22: [2022-11-27 00:56:41,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 00:56:41,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 00:56:41,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 24: [2022-11-27 00:56:41,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 22: [2022-11-27 00:56:41,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 22: [2022-11-27 00:56:41,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 22: [2022-11-27 00:56:41,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 15: [2022-11-27 00:56:41,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:56:41,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 24: [2022-11-27 00:56:41,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 15: [2022-11-27 00:56:41,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 0: [2022-11-27 00:56:41,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:56:41,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:56:41,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:56:41,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 31: [2022-11-27 00:56:41,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 00:56:41,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 1: [2022-11-27 00:56:41,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 7: [2022-11-27 00:56:41,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:56:41,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 00:56:41,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 3: [2022-11-27 00:56:41,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:56:41,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:56:41,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:56:41,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 00:56:41,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 00:56:41,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 3: [2022-11-27 00:56:41,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 00:56:41,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 3: [2022-11-27 00:56:41,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 30: [2022-11-27 00:56:41,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:56:41,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 00:56:41,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 25: [2022-11-27 00:56:41,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:56:41,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:56:41,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:56:41,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 00:56:41,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 00:56:41,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-27 00:56:41,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 25: [2022-11-27 00:56:41,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 25: [2022-11-27 00:56:41,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 16: [2022-11-27 00:56:41,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:56:41,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 00:56:41,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 29: [2022-11-27 00:56:41,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:56:41,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 00:56:41,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 28: [2022-11-27 00:56:41,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:56:41,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 19: [2022-11-27 00:56:41,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:56:41,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 28: [2022-11-27 00:56:41,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 19: [2022-11-27 00:56:41,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 11: [2022-11-27 00:56:41,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:56:41,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 00:56:41,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 21: [2022-11-27 00:56:41,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:56:41,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:56:41,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:56:41,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 00:56:41,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 00:56:41,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 21: [2022-11-27 00:56:41,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 00:56:41,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 21: [2022-11-27 00:56:41,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 27: [2022-11-27 00:56:41,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:56:41,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 00:56:41,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 0: [2022-11-27 00:56:41,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 00:56:41,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 26: [2022-11-27 00:56:41,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:56:41,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:56:41,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:56:41,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:56:41,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 00:56:41,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 00:56:41,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 00:56:41,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 26: [2022-11-27 00:56:41,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 26: [2022-11-27 00:56:41,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 14: [2022-11-27 00:56:41,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:56:41,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 2: [2022-11-27 00:56:41,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:56:41,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 21: [2022-11-27 00:56:41,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 2: [2022-11-27 00:56:41,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 14: [2022-11-27 00:56:41,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 2: [2022-11-27 00:56:41,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 16: [2022-11-27 00:56:41,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:56:41,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 00:56:41,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 3: [2022-11-27 00:56:41,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:56:41,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 00:56:41,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 25: [2022-11-27 00:56:41,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:56:41,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:56:41,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 12: [2022-11-27 00:56:41,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 00:56:41,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 25: [2022-11-27 00:56:41,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 13: [2022-11-27 00:56:41,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:56:41,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 00:56:41,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 5: [2022-11-27 00:56:41,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:56:41,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 00:56:41,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 15: [2022-11-27 00:56:41,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:56:41,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 00:56:41,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 9: [2022-11-27 00:56:41,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:56:41,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 00:56:41,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 23: [2022-11-27 00:56:41,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:56:41,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 00:56:41,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 20: [2022-11-27 00:56:41,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:56:41,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 00:56:41,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 10: [2022-11-27 00:56:41,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:56:41,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 00:56:41,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 4: [2022-11-27 00:56:41,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:56:41,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 00:56:41,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 17: [2022-11-27 00:56:41,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:56:41,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 00:56:41,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 6: [2022-11-27 00:56:41,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:56:41,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 00:56:41,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 18: [2022-11-27 00:56:41,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:56:41,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 00:56:41,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 1: [2022-11-27 00:56:41,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:56:41,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 00:56:41,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 22: [2022-11-27 00:56:41,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:56:41,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 00:56:41,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 24: [2022-11-27 00:56:41,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:56:41,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 31: [2022-11-27 00:56:41,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:56:41,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 31: [2022-11-27 00:56:41,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 00:56:41,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 28: [2022-11-27 00:56:41,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:56:41,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:56:41,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 30: [2022-11-27 00:56:41,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 28: [2022-11-27 00:56:41,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 30: [2022-11-27 00:56:41,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 26: [2022-11-27 00:56:41,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:56:41,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 00:56:41,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 8: [2022-11-27 00:56:41,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:56:41,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 00:56:41,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 29: [2022-11-27 00:56:41,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:56:41,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 00:56:41,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 27: [2022-11-27 00:56:41,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:56:41,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 00:56:41,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 7: [2022-11-27 00:56:41,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:56:41,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 00:56:41,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 11: [2022-11-27 00:56:41,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:56:41,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:56:41,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 00:56:41,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 0: [2022-11-27 00:56:41,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 00:56:41,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 3: [2022-11-27 00:56:41,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:56:41,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 00:56:41,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 12: [2022-11-27 00:56:41,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:56:41,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 00:56:41,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 16: [2022-11-27 00:56:41,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:56:41,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 00:56:41,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 19: [2022-11-27 00:56:41,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:56:41,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 00:56:41,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 21: [2022-11-27 00:56:41,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:56:41,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 00:56:41,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 2: [2022-11-27 00:56:41,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:56:41,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 00:56:41,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 14: [2022-11-27 00:56:41,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:56:41,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 00:56:41,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 25: [2022-11-27 00:56:41,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:56:41,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 00:56:41,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 13: [2022-11-27 00:56:41,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:56:41,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 00:56:41,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 15: [2022-11-27 00:56:41,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:56:41,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 00:56:41,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 4: [2022-11-27 00:56:41,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:56:41,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 00:56:41,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 5: [2022-11-27 00:56:41,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:56:41,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 00:56:41,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 31: [2022-11-27 00:56:41,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:56:41,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 00:56:41,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 20: [2022-11-27 00:56:41,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:56:41,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 23: [2022-11-27 00:56:41,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:56:41,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 23: [2022-11-27 00:56:41,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 00:56:41,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 10: [2022-11-27 00:56:41,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:56:41,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 00:56:41,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 22: [2022-11-27 00:56:41,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:56:41,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 00:56:41,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 6: [2022-11-27 00:56:41,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:56:41,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 7: [2022-11-27 00:56:41,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:56:41,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 7: [2022-11-27 00:56:41,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 1: [2022-11-27 00:56:41,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:56:41,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 1: [2022-11-27 00:56:41,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 00:56:41,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 17: [2022-11-27 00:56:41,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:56:41,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 00:56:41,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 28: [2022-11-27 00:56:41,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-27 00:56:41,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 00:56:41,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 11: [2022-11-27 00:56:41,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:56:41,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 00:56:41,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 29: [2022-11-27 00:56:41,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:56:41,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 00:56:41,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 8: [2022-11-27 00:56:41,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:56:41,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 00:56:41,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 27: [2022-11-27 00:56:41,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:56:41,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 00:56:41,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 2: [2022-11-27 00:56:41,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:56:41,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 00:56:41,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 0: [2022-11-27 00:56:41,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:56:41,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 00:56:41,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 30: [2022-11-27 00:56:41,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:56:41,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:56:41,885] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 00:56:41,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 25: [2022-11-27 00:56:41,885] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 00:56:41,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 12: [2022-11-27 00:56:41,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:56:41,885] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 00:56:41,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 16: [2022-11-27 00:56:41,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:56:41,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:56:41,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 00:56:41,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 13: [2022-11-27 00:56:41,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 00:56:41,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 24: [2022-11-27 00:56:41,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:56:41,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 00:56:41,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 20: [2022-11-27 00:56:41,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:56:41,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 00:56:41,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 14: [2022-11-27 00:56:41,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:56:41,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 00:56:41,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 21: [2022-11-27 00:56:41,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:56:41,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 00:56:41,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 9: [2022-11-27 00:56:41,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:56:41,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 00:56:41,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 15: [2022-11-27 00:56:41,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:56:41,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 00:56:41,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 23: [2022-11-27 00:56:41,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:56:41,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:56:41,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 00:56:41,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 23: [2022-11-27 00:56:41,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 00:56:41,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 7: [2022-11-27 00:56:41,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:56:41,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 00:56:41,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 22: [2022-11-27 00:56:41,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:56:41,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 00:56:41,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 10: [2022-11-27 00:56:41,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:56:41,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 00:56:41,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 17: [2022-11-27 00:56:41,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 00:56:41,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 00:56:41,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 5: [2022-11-27 00:56:41,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 27: [2022-11-27 00:56:41,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:56:41,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 00:56:41,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 27: [2022-11-27 00:56:41,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 00:56:41,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 0: [2022-11-27 00:56:41,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:56:41,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 00:56:41,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 11: [2022-11-27 00:56:41,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:56:41,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 00:56:41,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 28: [2022-11-27 00:56:41,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:56:41,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:56:41,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:56:41,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 16: [2022-11-27 00:56:41,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 2: [2022-11-27 00:56:41,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 28: [2022-11-27 00:56:41,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 24: [2022-11-27 00:56:41,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:56:41,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 28: [2022-11-27 00:56:41,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 24: [2022-11-27 00:56:41,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 00:56:41,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 8: [2022-11-27 00:56:41,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:56:41,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 00:56:41,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 31: [2022-11-27 00:56:41,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:56:41,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 00:56:41,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 6: [2022-11-27 00:56:41,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:56:41,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 00:56:41,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 1: [2022-11-27 00:56:41,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:56:41,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 00:56:41,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 25: [2022-11-27 00:56:41,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:56:41,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:56:41,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 13: [2022-11-27 00:56:41,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 00:56:41,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 25: [2022-11-27 00:56:41,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 4: [2022-11-27 00:56:41,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:56:41,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 00:56:41,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 29: [2022-11-27 00:56:41,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:56:41,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:56:41,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 18: [2022-11-27 00:56:41,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 29: [2022-11-27 00:56:41,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 18: [2022-11-27 00:56:41,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 23: [2022-11-27 00:56:41,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:56:41,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:56:41,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 23: [2022-11-27 00:56:41,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 26: [2022-11-27 00:56:41,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 3: [2022-11-27 00:56:41,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 23: [2022-11-27 00:56:41,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 3: [2022-11-27 00:56:41,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 00:56:41,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 12: [2022-11-27 00:56:41,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:56:41,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 00:56:41,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 19: [2022-11-27 00:56:41,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:56:41,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 00:56:41,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 9: [2022-11-27 00:56:41,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 22: [2022-11-27 00:56:41,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:56:41,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 22: [2022-11-27 00:56:41,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 9: [2022-11-27 00:56:41,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 22: [2022-11-27 00:56:41,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 10: [2022-11-27 00:56:41,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:56:41,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 00:56:41,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 20: [2022-11-27 00:56:41,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-27 00:56:41,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 00:56:41,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 6: [2022-11-27 00:56:41,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:56:41,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 00:56:41,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 5: [2022-11-27 00:56:41,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:56:41,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 14: [2022-11-27 00:56:41,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:56:41,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 14: [2022-11-27 00:56:41,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 00:56:41,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 15: [2022-11-27 00:56:41,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:56:41,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 00:56:41,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 21: [2022-11-27 00:56:41,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:56:41,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:56:41,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 30: [2022-11-27 00:56:41,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 00:56:41,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 21: [2022-11-27 00:56:41,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 31: [2022-11-27 00:56:41,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:56:41,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 1: [2022-11-27 00:56:41,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 31: [2022-11-27 00:56:41,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 1: [2022-11-27 00:56:41,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 00:56:41,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 4: [2022-11-27 00:56:41,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:56:41,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 00:56:41,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 7: [2022-11-27 00:56:41,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:56:41,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 00:56:41,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 8: [2022-11-27 00:56:41,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:56:41,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 00:56:41,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 28: [2022-11-27 00:56:41,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:56:41,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:56:41,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 00:56:41,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 16: [2022-11-27 00:56:41,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 00:56:41,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 00:56:41,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 27: [2022-11-27 00:56:41,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:56:41,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:56:41,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 00:56:41,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 27: [2022-11-27 00:56:41,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 18: [2022-11-27 00:56:41,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 27: [2022-11-27 00:56:41,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 18: [2022-11-27 00:56:41,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 00:56:41,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 24: [2022-11-27 00:56:41,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 00:56:41,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 00:56:41,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 30: [2022-11-27 00:56:41,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 00:56:41,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 00:56:41,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 9: [2022-11-27 00:56:41,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:56:41,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 00:56:41,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 28: [2022-11-27 00:56:41,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 00:56:41,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 2: [2022-11-27 00:56:41,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:56:41,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 00:56:41,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 25: [2022-11-27 00:56:41,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:56:41,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 17: [2022-11-27 00:56:41,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 25: [2022-11-27 00:56:41,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 17: [2022-11-27 00:56:41,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 00:56:41,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 0: [2022-11-27 00:56:41,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:56:41,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 00:56:41,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 15: [2022-11-27 00:56:41,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:56:41,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:56:41,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 3: [2022-11-27 00:56:41,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 15: [2022-11-27 00:56:41,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 3: [2022-11-27 00:56:41,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 12: [2022-11-27 00:56:41,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:56:41,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 00:56:41,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 29: [2022-11-27 00:56:41,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:56:41,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 00:56:41,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 29: [2022-11-27 00:56:41,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:56:41,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:56:41,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:56:41,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 00:56:41,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 13: [2022-11-27 00:56:41,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 00:56:41,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 29: [2022-11-27 00:56:41,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 21: [2022-11-27 00:56:41,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:56:41,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 29: [2022-11-27 00:56:41,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 21: [2022-11-27 00:56:41,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 19: [2022-11-27 00:56:41,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 00:56:41,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 21: [2022-11-27 00:56:41,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 19: [2022-11-27 00:56:41,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 19: [2022-11-27 00:56:41,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 00:56:41,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 19: [2022-11-27 00:56:41,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-27 00:56:41,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-27 00:56:41,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 14: [2022-11-27 00:56:41,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:56:41,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 00:56:41,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 26: [2022-11-27 00:56:41,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:56:41,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-27 00:56:41,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 00:56:41,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step137000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-27 00:56:41,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 26: [2022-11-27 00:56:41,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step137000 is ready now! 0: successfully saved checkpoint at iteration 137000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2515.15 31: iteration 137010/ 173500 | consumed samples: 35074560 | consumed tokens: 71832698880 | elapsed time per iteration (s): 1.04 | learning rate: 3.931E-05 | global batch size: 256 | lm loss: 1.922432E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.279 | TFLOPs: 14.84 | 31: iteration 137020/ 173500 | consumed samples: 35077120 | consumed tokens: 71837941760 | elapsed time per iteration (s): 0.78 | learning rate: 3.930E-05 | global batch size: 256 | lm loss: 1.942324E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.425 | TFLOPs: 19.87 | 31: iteration 137030/ 173500 | consumed samples: 35079680 | consumed tokens: 71843184640 | elapsed time per iteration (s): 0.75 | learning rate: 3.929E-05 | global batch size: 256 | lm loss: 1.913112E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.727 | TFLOPs: 20.67 | 31: iteration 137040/ 173500 | consumed samples: 35082240 | consumed tokens: 71848427520 | elapsed time per iteration (s): 0.75 | learning rate: 3.928E-05 | global batch size: 256 | lm loss: 1.919042E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.080 | TFLOPs: 20.57 | 31: iteration 137050/ 173500 | consumed samples: 35084800 | consumed tokens: 71853670400 | elapsed time per iteration (s): 0.82 | learning rate: 3.927E-05 | global batch size: 256 | lm loss: 1.921170E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.660 | TFLOPs: 18.98 | 31: iteration 137060/ 173500 | consumed samples: 35087360 | consumed tokens: 71858913280 | elapsed time per iteration (s): 0.86 | learning rate: 3.926E-05 | global batch size: 256 | lm loss: 1.924025E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.275 | TFLOPs: 17.98 | 31: iteration 137070/ 173500 | consumed samples: 35089920 | consumed tokens: 71864156160 | elapsed time per iteration (s): 0.82 | learning rate: 3.925E-05 | global batch size: 256 | lm loss: 1.932107E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.712 | TFLOPs: 18.98 | 31: iteration 137080/ 173500 | consumed samples: 35092480 | consumed tokens: 71869399040 | elapsed time per iteration (s): 0.83 | learning rate: 3.924E-05 | global batch size: 256 | lm loss: 1.910130E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.151 | TFLOPs: 18.70 | 31: iteration 137090/ 173500 | consumed samples: 35095040 | consumed tokens: 71874641920 | elapsed time per iteration (s): 0.82 | learning rate: 3.923E-05 | global batch size: 256 | lm loss: 1.917543E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.386 | TFLOPs: 18.84 | 31: iteration 137100/ 173500 | consumed samples: 35097600 | consumed tokens: 71879884800 | elapsed time per iteration (s): 0.83 | learning rate: 3.922E-05 | global batch size: 256 | lm loss: 1.936464E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.896 | TFLOPs: 18.57 | 31: iteration 137110/ 173500 | consumed samples: 35100160 | consumed tokens: 71885127680 | elapsed time per iteration (s): 0.80 | learning rate: 3.921E-05 | global batch size: 256 | lm loss: 1.926951E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.298 | TFLOPs: 19.32 | 31: iteration 137120/ 173500 | consumed samples: 35102720 | consumed tokens: 71890370560 | elapsed time per iteration (s): 0.82 | learning rate: 3.920E-05 | global batch size: 256 | lm loss: 1.928466E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.424 | TFLOPs: 18.78 | 31: iteration 137130/ 173500 | consumed samples: 35105280 | consumed tokens: 71895613440 | elapsed time per iteration (s): 0.80 | learning rate: 3.919E-05 | global batch size: 256 | lm loss: 1.922497E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.962 | TFLOPs: 19.30 | 31: iteration 137140/ 173500 | consumed samples: 35107840 | consumed tokens: 71900856320 | elapsed time per iteration (s): 0.85 | learning rate: 3.918E-05 | global batch size: 256 | lm loss: 1.906471E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.924 | TFLOPs: 18.14 | 31: iteration 137150/ 173500 | consumed samples: 35110400 | consumed tokens: 71906099200 | elapsed time per iteration (s): 0.83 | learning rate: 3.917E-05 | global batch size: 256 | lm loss: 1.930713E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.219 | TFLOPs: 18.59 | 31: iteration 137160/ 173500 | consumed samples: 35112960 | consumed tokens: 71911342080 | elapsed time per iteration (s): 0.85 | learning rate: 3.916E-05 | global batch size: 256 | lm loss: 1.934499E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.949 | TFLOPs: 18.33 | 31: iteration 137170/ 173500 | consumed samples: 35115520 | consumed tokens: 71916584960 | elapsed time per iteration (s): 0.84 | learning rate: 3.915E-05 | global batch size: 256 | lm loss: 1.962008E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.470 | TFLOPs: 18.36 | 31: iteration 137180/ 173500 | consumed samples: 35118080 | consumed tokens: 71921827840 | elapsed time per iteration (s): 0.81 | learning rate: 3.914E-05 | global batch size: 256 | lm loss: 1.927613E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.043 | TFLOPs: 19.06 | 31: iteration 137190/ 173500 | consumed samples: 35120640 | consumed tokens: 71927070720 | elapsed time per iteration (s): 0.77 | learning rate: 3.913E-05 | global batch size: 256 | lm loss: 1.965215E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.846 | TFLOPs: 20.08 | 31: iteration 137200/ 173500 | consumed samples: 35123200 | consumed tokens: 71932313600 | elapsed time per iteration (s): 0.85 | learning rate: 3.912E-05 | global batch size: 256 | lm loss: 1.925887E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.095 | TFLOPs: 18.15 | 31: iteration 137210/ 173500 | consumed samples: 35125760 | consumed tokens: 71937556480 | elapsed time per iteration (s): 0.82 | learning rate: 3.911E-05 | global batch size: 256 | lm loss: 1.930529E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.279 | TFLOPs: 18.89 | 31: iteration 137220/ 173500 | consumed samples: 35128320 | consumed tokens: 71942799360 | elapsed time per iteration (s): 0.80 | learning rate: 3.910E-05 | global batch size: 256 | lm loss: 1.929888E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.362 | TFLOPs: 19.26 | 31: iteration 137230/ 173500 | consumed samples: 35130880 | consumed tokens: 71948042240 | elapsed time per iteration (s): 0.82 | learning rate: 3.909E-05 | global batch size: 256 | lm loss: 1.954706E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.670 | TFLOPs: 18.92 | 31: iteration 137240/ 173500 | consumed samples: 35133440 | consumed tokens: 71953285120 | elapsed time per iteration (s): 0.86 | learning rate: 3.908E-05 | global batch size: 256 | lm loss: 1.933471E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.609 | TFLOPs: 17.94 | 31: iteration 137250/ 173500 | consumed samples: 35136000 | consumed tokens: 71958528000 | elapsed time per iteration (s): 0.77 | learning rate: 3.907E-05 | global batch size: 256 | lm loss: 1.937820E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.610 | TFLOPs: 20.00 | 31: iteration 137260/ 173500 | consumed samples: 35138560 | consumed tokens: 71963770880 | elapsed time per iteration (s): 0.81 | learning rate: 3.906E-05 | global batch size: 256 | lm loss: 1.901373E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.913 | TFLOPs: 19.11 | 31: iteration 137270/ 173500 | consumed samples: 35141120 | consumed tokens: 71969013760 | elapsed time per iteration (s): 0.76 | learning rate: 3.905E-05 | global batch size: 256 | lm loss: 1.936261E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.889 | TFLOPs: 20.50 | 31: iteration 137280/ 173500 | consumed samples: 35143680 | consumed tokens: 71974256640 | elapsed time per iteration (s): 0.79 | learning rate: 3.904E-05 | global batch size: 256 | lm loss: 1.925870E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.128 | TFLOPs: 19.61 | 31: iteration 137290/ 173500 | consumed samples: 35146240 | consumed tokens: 71979499520 | elapsed time per iteration (s): 0.74 | learning rate: 3.903E-05 | global batch size: 256 | lm loss: 1.927111E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.038 | TFLOPs: 20.87 | 31: iteration 137300/ 173500 | consumed samples: 35148800 | consumed tokens: 71984742400 | elapsed time per iteration (s): 0.80 | learning rate: 3.902E-05 | global batch size: 256 | lm loss: 1.929919E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.278 | TFLOPs: 19.25 | 31: iteration 137310/ 173500 | consumed samples: 35151360 | consumed tokens: 71989985280 | elapsed time per iteration (s): 0.81 | learning rate: 3.901E-05 | global batch size: 256 | lm loss: 1.892591E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.343 | TFLOPs: 19.20 | 31: iteration 137320/ 173500 | consumed samples: 35153920 | consumed tokens: 71995228160 | elapsed time per iteration (s): 0.82 | learning rate: 3.900E-05 | global batch size: 256 | lm loss: 1.943096E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.127 | TFLOPs: 18.94 | 31: iteration 137330/ 173500 | consumed samples: 35156480 | consumed tokens: 72000471040 | elapsed time per iteration (s): 0.75 | learning rate: 3.899E-05 | global batch size: 256 | lm loss: 1.950952E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.579 | TFLOPs: 20.73 | 31: iteration 137340/ 173500 | consumed samples: 35159040 | consumed tokens: 72005713920 | elapsed time per iteration (s): 0.77 | learning rate: 3.898E-05 | global batch size: 256 | lm loss: 1.933803E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.264 | TFLOPs: 20.10 | 31: iteration 137350/ 173500 | consumed samples: 35161600 | consumed tokens: 72010956800 | elapsed time per iteration (s): 0.71 | learning rate: 3.897E-05 | global batch size: 256 | lm loss: 1.912435E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.573 | TFLOPs: 21.69 | 31: iteration 137360/ 173500 | consumed samples: 35164160 | consumed tokens: 72016199680 | elapsed time per iteration (s): 0.74 | learning rate: 3.896E-05 | global batch size: 256 | lm loss: 1.936096E+00 | grad norm: 0.213 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.725 | TFLOPs: 20.92 | 31: iteration 137370/ 173500 | consumed samples: 35166720 | consumed tokens: 72021442560 | elapsed time per iteration (s): 0.77 | learning rate: 3.895E-05 | global batch size: 256 | lm loss: 1.923841E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.344 | TFLOPs: 20.05 | 31: iteration 137380/ 173500 | consumed samples: 35169280 | consumed tokens: 72026685440 | elapsed time per iteration (s): 0.77 | learning rate: 3.894E-05 | global batch size: 256 | lm loss: 1.902008E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.227 | TFLOPs: 20.04 | 31: iteration 137390/ 173500 | consumed samples: 35171840 | consumed tokens: 72031928320 | elapsed time per iteration (s): 0.75 | learning rate: 3.893E-05 | global batch size: 256 | lm loss: 1.924951E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.654 | TFLOPs: 20.61 | 31: iteration 137400/ 173500 | consumed samples: 35174400 | consumed tokens: 72037171200 | elapsed time per iteration (s): 0.79 | learning rate: 3.892E-05 | global batch size: 256 | lm loss: 1.920569E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.952 | TFLOPs: 19.66 | 31: iteration 137410/ 173500 | consumed samples: 35176960 | consumed tokens: 72042414080 | elapsed time per iteration (s): 0.75 | learning rate: 3.891E-05 | global batch size: 256 | lm loss: 1.928743E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.544 | TFLOPs: 20.60 | 31: iteration 137420/ 173500 | consumed samples: 35179520 | consumed tokens: 72047656960 | elapsed time per iteration (s): 0.75 | learning rate: 3.890E-05 | global batch size: 256 | lm loss: 1.952059E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.433 | TFLOPs: 20.66 | 31: iteration 137430/ 173500 | consumed samples: 35182080 | consumed tokens: 72052899840 | elapsed time per iteration (s): 0.79 | learning rate: 3.889E-05 | global batch size: 256 | lm loss: 1.913504E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.920 | TFLOPs: 19.60 | 31: iteration 137440/ 173500 | consumed samples: 35184640 | consumed tokens: 72058142720 | elapsed time per iteration (s): 0.82 | learning rate: 3.888E-05 | global batch size: 256 | lm loss: 1.913351E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.601 | TFLOPs: 18.85 | 31: iteration 137450/ 173500 | consumed samples: 35187200 | consumed tokens: 72063385600 | elapsed time per iteration (s): 0.76 | learning rate: 3.887E-05 | global batch size: 256 | lm loss: 1.924027E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.881 | TFLOPs: 20.50 | 31: iteration 137460/ 173500 | consumed samples: 35189760 | consumed tokens: 72068628480 | elapsed time per iteration (s): 0.76 | learning rate: 3.886E-05 | global batch size: 256 | lm loss: 1.939249E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.660 | TFLOPs: 20.49 | 31: iteration 137470/ 173500 | consumed samples: 35192320 | consumed tokens: 72073871360 | elapsed time per iteration (s): 0.76 | learning rate: 3.885E-05 | global batch size: 256 | lm loss: 1.926745E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.318 | TFLOPs: 20.29 | 31: iteration 137480/ 173500 | consumed samples: 35194880 | consumed tokens: 72079114240 | elapsed time per iteration (s): 0.81 | learning rate: 3.884E-05 | global batch size: 256 | lm loss: 1.949685E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.788 | TFLOPs: 19.10 | 31: iteration 137490/ 173500 | consumed samples: 35197440 | consumed tokens: 72084357120 | elapsed time per iteration (s): 0.78 | learning rate: 3.883E-05 | global batch size: 256 | lm loss: 1.951549E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.732 | TFLOPs: 19.95 | 31: iteration 137500/ 173500 | consumed samples: 35200000 | consumed tokens: 72089600000 | elapsed time per iteration (s): 0.77 | learning rate: 3.882E-05 | global batch size: 256 | lm loss: 1.959197E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.495 | TFLOPs: 20.24 | 31: iteration 137510/ 173500 | consumed samples: 35202560 | consumed tokens: 72094842880 | elapsed time per iteration (s): 0.80 | learning rate: 3.881E-05 | global batch size: 256 | lm loss: 1.931790E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.185 | TFLOPs: 19.31 | 31: iteration 137520/ 173500 | consumed samples: 35205120 | consumed tokens: 72100085760 | elapsed time per iteration (s): 0.79 | learning rate: 3.880E-05 | global batch size: 256 | lm loss: 1.917510E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.251 | TFLOPs: 19.56 | 31: iteration 137530/ 173500 | consumed samples: 35207680 | consumed tokens: 72105328640 | elapsed time per iteration (s): 0.79 | learning rate: 3.879E-05 | global batch size: 256 | lm loss: 1.903639E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.521 | TFLOPs: 19.63 | 31: iteration 137540/ 173500 | consumed samples: 35210240 | consumed tokens: 72110571520 | elapsed time per iteration (s): 0.81 | learning rate: 3.878E-05 | global batch size: 256 | lm loss: 1.918326E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.978 | TFLOPs: 19.24 | 31: iteration 137550/ 173500 | consumed samples: 35212800 | consumed tokens: 72115814400 | elapsed time per iteration (s): 0.80 | learning rate: 3.876E-05 | global batch size: 256 | lm loss: 1.925932E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.429 | TFLOPs: 19.39 | 31: iteration 137560/ 173500 | consumed samples: 35215360 | consumed tokens: 72121057280 | elapsed time per iteration (s): 0.86 | learning rate: 3.875E-05 | global batch size: 256 | lm loss: 1.948806E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.133 | TFLOPs: 18.10 | 31: iteration 137570/ 173500 | consumed samples: 35217920 | consumed tokens: 72126300160 | elapsed time per iteration (s): 0.79 | learning rate: 3.874E-05 | global batch size: 256 | lm loss: 1.926306E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.017 | TFLOPs: 19.54 | 31: iteration 137580/ 173500 | consumed samples: 35220480 | consumed tokens: 72131543040 | elapsed time per iteration (s): 0.76 | learning rate: 3.873E-05 | global batch size: 256 | lm loss: 1.932418E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.849 | TFLOPs: 20.50 | 31: iteration 137590/ 173500 | consumed samples: 35223040 | consumed tokens: 72136785920 | elapsed time per iteration (s): 0.81 | learning rate: 3.872E-05 | global batch size: 256 | lm loss: 1.952512E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.908 | TFLOPs: 19.17 | 31: iteration 137600/ 173500 | consumed samples: 35225600 | consumed tokens: 72142028800 | elapsed time per iteration (s): 0.79 | learning rate: 3.871E-05 | global batch size: 256 | lm loss: 1.920293E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.629 | TFLOPs: 19.52 | 31: iteration 137610/ 173500 | consumed samples: 35228160 | consumed tokens: 72147271680 | elapsed time per iteration (s): 0.87 | learning rate: 3.870E-05 | global batch size: 256 | lm loss: 1.951787E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.953 | TFLOPs: 17.84 | 31: iteration 137620/ 173500 | consumed samples: 35230720 | consumed tokens: 72152514560 | elapsed time per iteration (s): 0.83 | learning rate: 3.869E-05 | global batch size: 256 | lm loss: 1.949644E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.180 | TFLOPs: 18.70 | 31: iteration 137630/ 173500 | consumed samples: 35233280 | consumed tokens: 72157757440 | elapsed time per iteration (s): 0.85 | learning rate: 3.868E-05 | global batch size: 256 | lm loss: 1.931169E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.939 | TFLOPs: 18.33 | 31: iteration 137640/ 173500 | consumed samples: 35235840 | consumed tokens: 72163000320 | elapsed time per iteration (s): 0.80 | learning rate: 3.867E-05 | global batch size: 256 | lm loss: 1.935722E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.141 | TFLOPs: 19.25 | 31: iteration 137650/ 173500 | consumed samples: 35238400 | consumed tokens: 72168243200 | elapsed time per iteration (s): 0.79 | learning rate: 3.866E-05 | global batch size: 256 | lm loss: 1.911774E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.084 | TFLOPs: 19.67 | 31: iteration 137660/ 173500 | consumed samples: 35240960 | consumed tokens: 72173486080 | elapsed time per iteration (s): 0.79 | learning rate: 3.865E-05 | global batch size: 256 | lm loss: 1.961972E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.120 | TFLOPs: 19.55 | 31: iteration 137670/ 173500 | consumed samples: 35243520 | consumed tokens: 72178728960 | elapsed time per iteration (s): 0.81 | learning rate: 3.864E-05 | global batch size: 256 | lm loss: 1.944927E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.829 | TFLOPs: 19.17 | 31: iteration 137680/ 173500 | consumed samples: 35246080 | consumed tokens: 72183971840 | elapsed time per iteration (s): 0.82 | learning rate: 3.863E-05 | global batch size: 256 | lm loss: 1.925564E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.688 | TFLOPs: 18.98 | 31: iteration 137690/ 173500 | consumed samples: 35248640 | consumed tokens: 72189214720 | elapsed time per iteration (s): 0.80 | learning rate: 3.862E-05 | global batch size: 256 | lm loss: 1.939571E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.432 | TFLOPs: 19.32 | 31: iteration 137700/ 173500 | consumed samples: 35251200 | consumed tokens: 72194457600 | elapsed time per iteration (s): 0.79 | learning rate: 3.861E-05 | global batch size: 256 | lm loss: 1.935727E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.024 | TFLOPs: 19.48 | 31: iteration 137710/ 173500 | consumed samples: 35253760 | consumed tokens: 72199700480 | elapsed time per iteration (s): 0.86 | learning rate: 3.860E-05 | global batch size: 256 | lm loss: 1.926997E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.614 | TFLOPs: 17.94 | 31: iteration 137720/ 173500 | consumed samples: 35256320 | consumed tokens: 72204943360 | elapsed time per iteration (s): 0.86 | learning rate: 3.859E-05 | global batch size: 256 | lm loss: 1.905799E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.803 | TFLOPs: 17.96 | 31: iteration 137730/ 173500 | consumed samples: 35258880 | consumed tokens: 72210186240 | elapsed time per iteration (s): 0.85 | learning rate: 3.858E-05 | global batch size: 256 | lm loss: 1.928878E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.903 | TFLOPs: 18.32 | 31: iteration 137740/ 173500 | consumed samples: 35261440 | consumed tokens: 72215429120 | elapsed time per iteration (s): 0.73 | learning rate: 3.857E-05 | global batch size: 256 | lm loss: 1.947327E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.278 | TFLOPs: 21.19 | 31: iteration 137750/ 173500 | consumed samples: 35264000 | consumed tokens: 72220672000 | elapsed time per iteration (s): 0.75 | learning rate: 3.856E-05 | global batch size: 256 | lm loss: 1.939894E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.982 | TFLOPs: 20.75 | 31: iteration 137760/ 173500 | consumed samples: 35266560 | consumed tokens: 72225914880 | elapsed time per iteration (s): 0.74 | learning rate: 3.855E-05 | global batch size: 256 | lm loss: 1.886468E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.812 | TFLOPs: 21.04 | 31: iteration 137770/ 173500 | consumed samples: 35269120 | consumed tokens: 72231157760 | elapsed time per iteration (s): 0.80 | learning rate: 3.854E-05 | global batch size: 256 | lm loss: 1.928758E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.293 | TFLOPs: 19.44 | 31: iteration 137780/ 173500 | consumed samples: 35271680 | consumed tokens: 72236400640 | elapsed time per iteration (s): 0.78 | learning rate: 3.853E-05 | global batch size: 256 | lm loss: 1.945224E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.559 | TFLOPs: 19.88 | 31: iteration 137790/ 173500 | consumed samples: 35274240 | consumed tokens: 72241643520 | elapsed time per iteration (s): 0.76 | learning rate: 3.852E-05 | global batch size: 256 | lm loss: 1.935720E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.460 | TFLOPs: 20.36 | 31: iteration 137800/ 173500 | consumed samples: 35276800 | consumed tokens: 72246886400 | elapsed time per iteration (s): 0.74 | learning rate: 3.851E-05 | global batch size: 256 | lm loss: 1.938166E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.814 | TFLOPs: 20.98 | 31: iteration 137810/ 173500 | consumed samples: 35279360 | consumed tokens: 72252129280 | elapsed time per iteration (s): 0.75 | learning rate: 3.850E-05 | global batch size: 256 | lm loss: 1.951924E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.279 | TFLOPs: 20.77 | 31: iteration 137820/ 173500 | consumed samples: 35281920 | consumed tokens: 72257372160 | elapsed time per iteration (s): 0.79 | learning rate: 3.849E-05 | global batch size: 256 | lm loss: 1.911604E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.828 | TFLOPs: 19.65 | 31: iteration 137830/ 173500 | consumed samples: 35284480 | consumed tokens: 72262615040 | elapsed time per iteration (s): 0.79 | learning rate: 3.848E-05 | global batch size: 256 | lm loss: 1.930439E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.121 | TFLOPs: 19.55 | 31: iteration 137840/ 173500 | consumed samples: 35287040 | consumed tokens: 72267857920 | elapsed time per iteration (s): 0.79 | learning rate: 3.847E-05 | global batch size: 256 | lm loss: 1.949922E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.538 | TFLOPs: 19.63 | 31: iteration 137850/ 173500 | consumed samples: 35289600 | consumed tokens: 72273100800 | elapsed time per iteration (s): 0.82 | learning rate: 3.846E-05 | global batch size: 256 | lm loss: 1.931392E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.871 | TFLOPs: 18.99 | 31: iteration 137860/ 173500 | consumed samples: 35292160 | consumed tokens: 72278343680 | elapsed time per iteration (s): 0.81 | learning rate: 3.845E-05 | global batch size: 256 | lm loss: 1.931728E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.716 | TFLOPs: 19.10 | 31: iteration 137870/ 173500 | consumed samples: 35294720 | consumed tokens: 72283586560 | elapsed time per iteration (s): 0.78 | learning rate: 3.844E-05 | global batch size: 256 | lm loss: 1.909315E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.333 | TFLOPs: 19.86 | 31: iteration 137880/ 173500 | consumed samples: 35297280 | consumed tokens: 72288829440 | elapsed time per iteration (s): 0.79 | learning rate: 3.843E-05 | global batch size: 256 | lm loss: 1.934581E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.824 | TFLOPs: 19.53 | 31: iteration 137890/ 173500 | consumed samples: 35299840 | consumed tokens: 72294072320 | elapsed time per iteration (s): 0.81 | learning rate: 3.842E-05 | global batch size: 256 | lm loss: 1.925175E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.179 | TFLOPs: 19.13 | 31: iteration 137900/ 173500 | consumed samples: 35302400 | consumed tokens: 72299315200 | elapsed time per iteration (s): 0.80 | learning rate: 3.841E-05 | global batch size: 256 | lm loss: 1.910101E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.738 | TFLOPs: 19.40 | 31: iteration 137910/ 173500 | consumed samples: 35304960 | consumed tokens: 72304558080 | elapsed time per iteration (s): 0.80 | learning rate: 3.840E-05 | global batch size: 256 | lm loss: 1.919968E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.555 | TFLOPs: 19.27 | 31: iteration 137920/ 173500 | consumed samples: 35307520 | consumed tokens: 72309800960 | elapsed time per iteration (s): 0.78 | learning rate: 3.839E-05 | global batch size: 256 | lm loss: 1.921156E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.180 | TFLOPs: 19.79 | 31: iteration 137930/ 173500 | consumed samples: 35310080 | consumed tokens: 72315043840 | elapsed time per iteration (s): 0.81 | learning rate: 3.838E-05 | global batch size: 256 | lm loss: 1.939370E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.427 | TFLOPs: 19.20 | 31: iteration 137940/ 173500 | consumed samples: 35312640 | consumed tokens: 72320286720 | elapsed time per iteration (s): 0.80 | learning rate: 3.837E-05 | global batch size: 256 | lm loss: 1.923551E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.131 | TFLOPs: 19.31 | 31: iteration 137950/ 173500 | consumed samples: 35315200 | consumed tokens: 72325529600 | elapsed time per iteration (s): 0.74 | learning rate: 3.836E-05 | global batch size: 256 | lm loss: 1.919886E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.905 | TFLOPs: 20.99 | 31: iteration 137960/ 173500 | consumed samples: 35317760 | consumed tokens: 72330772480 | elapsed time per iteration (s): 0.79 | learning rate: 3.835E-05 | global batch size: 256 | lm loss: 1.919006E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.674 | TFLOPs: 19.70 | 31: iteration 137970/ 173500 | consumed samples: 35320320 | consumed tokens: 72336015360 | elapsed time per iteration (s): 0.80 | learning rate: 3.834E-05 | global batch size: 256 | lm loss: 1.946990E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.214 | TFLOPs: 19.37 | 31: iteration 137980/ 173500 | consumed samples: 35322880 | consumed tokens: 72341258240 | elapsed time per iteration (s): 0.78 | learning rate: 3.833E-05 | global batch size: 256 | lm loss: 1.909353E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.726 | TFLOPs: 19.77 | 31: iteration 137990/ 173500 | consumed samples: 35325440 | consumed tokens: 72346501120 | elapsed time per iteration (s): 0.86 | learning rate: 3.832E-05 | global batch size: 256 | lm loss: 1.967125E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.569 | TFLOPs: 18.06 | 0: [2022-11-27 01:09:57,850] [INFO] [logging.py:68:log_dist] [Rank 0] step=138000, skipped=0, lr=[3.831464022325417e-05, 3.831464022325417e-05, 3.831464022325417e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 138000/ 173500 | consumed samples: 35328000 | consumed tokens: 72351744000 | elapsed time per iteration (s): 0.84 | learning rate: 3.831E-05 | global batch size: 256 | lm loss: 1.921942E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.660 | TFLOPs: 18.43 | 0: steps: 138000 loss: 1.8647 iter time (s): 0.789 samples/sec: 324.454 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 138000 | lm loss value: 1.900197E+00 | lm loss PPL: 6.687209E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 138000 to checkpoints_1b1long 0: [2022-11-27 01:09:58,117] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step138000 is begin to save! 0: [2022-11-27 01:09:58,128] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_01-model_00-model_states.pt... 0: [2022-11-27 01:09:58,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_01-model_00-model_states.pt. 0: [2022-11-27 01:09:58,339] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_03-model_00-model_states.pt... 0: [2022-11-27 01:09:58,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_03-model_00-model_states.pt. 0: [2022-11-27 01:09:58,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_04-model_00-model_states.pt... 0: [2022-11-27 01:09:58,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_04-model_00-model_states.pt. 0: [2022-11-27 01:09:58,493] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_05-model_00-model_states.pt... 0: [2022-11-27 01:09:58,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_05-model_00-model_states.pt. 0: [2022-11-27 01:09:58,571] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_06-model_00-model_states.pt... 0: [2022-11-27 01:09:58,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_06-model_00-model_states.pt. 0: [2022-11-27 01:09:58,645] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_07-model_00-model_states.pt... 0: [2022-11-27 01:09:58,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_07-model_00-model_states.pt. 0: [2022-11-27 01:09:58,718] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_08-model_00-model_states.pt... 0: [2022-11-27 01:09:58,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_08-model_00-model_states.pt. 0: [2022-11-27 01:09:58,794] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_09-model_00-model_states.pt... 0: [2022-11-27 01:09:58,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_09-model_00-model_states.pt. 0: [2022-11-27 01:09:58,867] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_10-model_00-model_states.pt... 0: [2022-11-27 01:09:58,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_10-model_00-model_states.pt. 0: [2022-11-27 01:09:58,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_11-model_00-model_states.pt... 0: [2022-11-27 01:09:59,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_11-model_00-model_states.pt. 0: [2022-11-27 01:09:59,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_12-model_00-model_states.pt... 0: [2022-11-27 01:09:59,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_12-model_00-model_states.pt. 0: [2022-11-27 01:09:59,090] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_13-model_00-model_states.pt... 0: [2022-11-27 01:09:59,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_13-model_00-model_states.pt. 0: [2022-11-27 01:09:59,164] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_14-model_00-model_states.pt... 0: [2022-11-27 01:09:59,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_14-model_00-model_states.pt. 0: [2022-11-27 01:09:59,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_15-model_00-model_states.pt... 0: [2022-11-27 01:09:59,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_15-model_00-model_states.pt. 0: [2022-11-27 01:09:59,311] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_16-model_00-model_states.pt... 0: [2022-11-27 01:09:59,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_16-model_00-model_states.pt. 0: [2022-11-27 01:09:59,385] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_17-model_00-model_states.pt... 0: [2022-11-27 01:09:59,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_17-model_00-model_states.pt. 0: [2022-11-27 01:09:59,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_18-model_00-model_states.pt... 0: [2022-11-27 01:09:59,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_18-model_00-model_states.pt. 0: [2022-11-27 01:09:59,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_19-model_00-model_states.pt... 0: [2022-11-27 01:09:59,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_19-model_00-model_states.pt. 0: [2022-11-27 01:09:59,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_20-model_00-model_states.pt... 0: [2022-11-27 01:09:59,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_20-model_00-model_states.pt. 0: [2022-11-27 01:09:59,682] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_21-model_00-model_states.pt... 0: [2022-11-27 01:09:59,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_21-model_00-model_states.pt. 0: [2022-11-27 01:09:59,759] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_22-model_00-model_states.pt... 0: [2022-11-27 01:09:59,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_22-model_00-model_states.pt. 0: [2022-11-27 01:09:59,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_23-model_00-model_states.pt... 0: [2022-11-27 01:09:59,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_23-model_00-model_states.pt. 0: [2022-11-27 01:09:59,906] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_24-model_00-model_states.pt... 0: [2022-11-27 01:09:59,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_24-model_00-model_states.pt. 0: [2022-11-27 01:09:59,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_25-model_00-model_states.pt... 0: [2022-11-27 01:10:00,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_25-model_00-model_states.pt. 0: [2022-11-27 01:10:00,052] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_26-model_00-model_states.pt... 0: [2022-11-27 01:10:00,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_26-model_00-model_states.pt. 0: [2022-11-27 01:10:00,127] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_27-model_00-model_states.pt... 0: [2022-11-27 01:10:00,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_27-model_00-model_states.pt. 0: [2022-11-27 01:10:00,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_28-model_00-model_states.pt... 0: [2022-11-27 01:10:00,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_28-model_00-model_states.pt. 0: [2022-11-27 01:10:00,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/layer_30-model_00-model_states.pt... 0: [2022-11-27 01:10:00,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/layer_30-model_00-model_states.pt. 0: [2022-11-27 01:10:00,276] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step138000/mp_rank_00_model_states.pt 0: [2022-11-27 01:10:00,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/mp_rank_00_model_states.pt... 0: [2022-11-27 01:10:00,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/mp_rank_00_model_states.pt. 0: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:10:00,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:10:00,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:10:00,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 01:10:00,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 9: [2022-11-27 01:10:00,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:10:00,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 01:10:00,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 21: [2022-11-27 01:10:00,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:10:00,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:10:00,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 15: [2022-11-27 01:10:00,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 21: [2022-11-27 01:10:00,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 15: [2022-11-27 01:10:00,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 6: [2022-11-27 01:10:00,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:10:00,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:10:00,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:10:00,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 4: [2022-11-27 01:10:00,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 6: [2022-11-27 01:10:00,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 4: [2022-11-27 01:10:00,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 27: [2022-11-27 01:10:00,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 01:10:00,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 18: [2022-11-27 01:10:00,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:10:00,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:10:00,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:10:00,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 7: [2022-11-27 01:10:00,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 16: [2022-11-27 01:10:00,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 18: [2022-11-27 01:10:00,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 7: [2022-11-27 01:10:00,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 16: [2022-11-27 01:10:00,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 10: [2022-11-27 01:10:00,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:10:00,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:10:00,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 01:10:00,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 28: [2022-11-27 01:10:00,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 01:10:00,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 2: [2022-11-27 01:10:00,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:10:00,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:10:00,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 01:10:00,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 24: [2022-11-27 01:10:00,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 01:10:00,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 12: [2022-11-27 01:10:00,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:10:00,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 01:10:00,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 29: [2022-11-27 01:10:00,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:10:00,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 01:10:00,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 21: [2022-11-27 01:10:00,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:10:00,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 3: [2022-11-27 01:10:00,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:10:00,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 3: [2022-11-27 01:10:00,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 01:10:00,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 1: [2022-11-27 01:10:00,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:10:00,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 14: [2022-11-27 01:10:00,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:10:00,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 1: [2022-11-27 01:10:00,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 14: [2022-11-27 01:10:00,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 29: [2022-11-27 01:10:00,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:10:00,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:10:00,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 01:10:00,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 17: [2022-11-27 01:10:00,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 01:10:00,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 22: [2022-11-27 01:10:00,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:10:00,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 01:10:00,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 0: [2022-11-27 01:10:00,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:10:00,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:10:00,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:10:00,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 15: [2022-11-27 01:10:00,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 0: [2022-11-27 01:10:00,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 10: [2022-11-27 01:10:00,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 15: [2022-11-27 01:10:00,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 0: [2022-11-27 01:10:00,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 4: [2022-11-27 01:10:00,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:10:00,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 01:10:00,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 14: [2022-11-27 01:10:00,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:10:00,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:10:00,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 01:10:00,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 27: [2022-11-27 01:10:00,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 01:10:00,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 25: [2022-11-27 01:10:00,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:10:00,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:10:00,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 8: [2022-11-27 01:10:00,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:10:00,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 24: [2022-11-27 01:10:00,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:10:00,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 8: [2022-11-27 01:10:00,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 01:10:00,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 25: [2022-11-27 01:10:00,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 24: [2022-11-27 01:10:00,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 19: [2022-11-27 01:10:00,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:10:00,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 11: [2022-11-27 01:10:00,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:10:00,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 19: [2022-11-27 01:10:00,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 11: [2022-11-27 01:10:00,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 01:10:00,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 16: [2022-11-27 01:10:00,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:10:00,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 01:10:00,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 9: [2022-11-27 01:10:00,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:10:00,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:10:00,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 01:10:00,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 28: [2022-11-27 01:10:00,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 26: [2022-11-27 01:10:00,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:10:00,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 2: [2022-11-27 01:10:00,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:10:00,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 2: [2022-11-27 01:10:00,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 26: [2022-11-27 01:10:00,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 2: [2022-11-27 01:10:00,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 18: [2022-11-27 01:10:00,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:10:00,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:10:00,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 26: [2022-11-27 01:10:00,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 18: [2022-11-27 01:10:00,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 26: [2022-11-27 01:10:00,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 5: [2022-11-27 01:10:00,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:10:00,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:10:00,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:10:00,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:10:00,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:10:00,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:10:00,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 7: [2022-11-27 01:10:00,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 4: [2022-11-27 01:10:00,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 10: [2022-11-27 01:10:00,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 13: [2022-11-27 01:10:00,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 20: [2022-11-27 01:10:00,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 5: [2022-11-27 01:10:00,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 7: [2022-11-27 01:10:00,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 4: [2022-11-27 01:10:00,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 10: [2022-11-27 01:10:00,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 13: [2022-11-27 01:10:00,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 20: [2022-11-27 01:10:00,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 6: [2022-11-27 01:10:00,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:10:00,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 5: [2022-11-27 01:10:00,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:10:00,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 5: [2022-11-27 01:10:00,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 01:10:00,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 22: [2022-11-27 01:10:00,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:10:00,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 01:10:00,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 8: [2022-11-27 01:10:00,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:10:00,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 01:10:00,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 13: [2022-11-27 01:10:00,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:10:00,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 12: [2022-11-27 01:10:00,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:10:00,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 12: [2022-11-27 01:10:00,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 01:10:00,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 24: [2022-11-27 01:10:00,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:10:00,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-27 01:10:00,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 30: [2022-11-27 01:10:00,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:10:00,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 01:10:00,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:10:00,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 30: [2022-11-27 01:10:00,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 11: [2022-11-27 01:10:00,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:10:00,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 11: [2022-11-27 01:10:00,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 01:10:00,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 12: [2022-11-27 01:10:00,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:10:00,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 01:10:00,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 1: [2022-11-27 01:10:00,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:10:00,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 15: [2022-11-27 01:10:00,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:10:00,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 1: [2022-11-27 01:10:00,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 15: [2022-11-27 01:10:00,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 27: [2022-11-27 01:10:00,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:10:00,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 01:10:00,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 8: [2022-11-27 01:10:00,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:10:00,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 01:10:00,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 7: [2022-11-27 01:10:00,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:10:00,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 01:10:00,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 28: [2022-11-27 01:10:00,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:10:00,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 01:10:00,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 13: [2022-11-27 01:10:00,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:10:00,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 01:10:00,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 3: [2022-11-27 01:10:00,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:10:00,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 01:10:00,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 3: [2022-11-27 01:10:00,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:10:00,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 0: [2022-11-27 01:10:00,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:10:00,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:10:00,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 22: [2022-11-27 01:10:00,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 0: [2022-11-27 01:10:00,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 22: [2022-11-27 01:10:00,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 20: [2022-11-27 01:10:00,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:10:00,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 20: [2022-11-27 01:10:00,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 01:10:00,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 25: [2022-11-27 01:10:00,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:10:00,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 0: [2022-11-27 01:10:00,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:10:00,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 25: [2022-11-27 01:10:00,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:10:00,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 25: [2022-11-27 01:10:00,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 0: [2022-11-27 01:10:00,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 25: [2022-11-27 01:10:00,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 18: [2022-11-27 01:10:00,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:10:00,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 01:10:00,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 14: [2022-11-27 01:10:00,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:10:00,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 01:10:00,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 29: [2022-11-27 01:10:00,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:10:00,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 01:10:00,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 2: [2022-11-27 01:10:00,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:10:00,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 01:10:00,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 21: [2022-11-27 01:10:00,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:10:00,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 01:10:00,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 4: [2022-11-27 01:10:00,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:10:00,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 01:10:00,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 16: [2022-11-27 01:10:00,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:10:00,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 01:10:00,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 26: [2022-11-27 01:10:00,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:10:00,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-27 01:10:00,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 20: [2022-11-27 01:10:00,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:10:00,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 11: [2022-11-27 01:10:00,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:10:00,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 11: [2022-11-27 01:10:00,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 01:10:00,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 1: [2022-11-27 01:10:00,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:10:00,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 17: [2022-11-27 01:10:00,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:10:00,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 17: [2022-11-27 01:10:00,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:10:00,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 01:10:00,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 17: [2022-11-27 01:10:00,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 01:10:00,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 27: [2022-11-27 01:10:00,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:10:00,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 01:10:00,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 1: [2022-11-27 01:10:00,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:10:00,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:10:00,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 23: [2022-11-27 01:10:00,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 1: [2022-11-27 01:10:00,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 23: [2022-11-27 01:10:00,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 6: [2022-11-27 01:10:00,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:10:00,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:10:00,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 18: [2022-11-27 01:10:00,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 6: [2022-11-27 01:10:00,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 18: [2022-11-27 01:10:00,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 19: [2022-11-27 01:10:00,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:10:00,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 01:10:00,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 9: [2022-11-27 01:10:00,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:10:00,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 01:10:00,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 23: [2022-11-27 01:10:00,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:10:00,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:10:00,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 01:10:00,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 01:10:00,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 23: [2022-11-27 01:10:00,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 31: [2022-11-27 01:10:00,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:10:00,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:10:00,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:10:00,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 01:10:00,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 01:10:00,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 01:10:00,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 31: [2022-11-27 01:10:00,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 31: [2022-11-27 01:10:00,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 30: [2022-11-27 01:10:00,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:10:00,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 01:10:00,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 21: [2022-11-27 01:10:00,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:10:00,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 01:10:00,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 25: [2022-11-27 01:10:00,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:10:00,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 01:10:00,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 17: [2022-11-27 01:10:00,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:10:00,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 01:10:00,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 2: [2022-11-27 01:10:00,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:10:00,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 01:10:00,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 10: [2022-11-27 01:10:00,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:10:00,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 01:10:00,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 5: [2022-11-27 01:10:00,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:10:00,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 01:10:00,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 14: [2022-11-27 01:10:00,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:10:00,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 01:10:00,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 9: [2022-11-27 01:10:00,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:10:00,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 01:10:00,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 5: [2022-11-27 01:10:00,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:10:00,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 01:10:00,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 29: [2022-11-27 01:10:00,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:10:00,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 01:10:00,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 0: [2022-11-27 01:10:00,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:10:00,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 01:10:00,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 30: [2022-11-27 01:10:00,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:10:00,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 01:10:00,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 31: [2022-11-27 01:10:00,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:10:00,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 01:10:00,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 26: [2022-11-27 01:10:00,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:10:00,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 01:10:00,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 6: [2022-11-27 01:10:00,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:10:00,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 01:10:00,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 12: [2022-11-27 01:10:00,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:10:00,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 01:10:00,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 28: [2022-11-27 01:10:00,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:10:00,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 01:10:00,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 11: [2022-11-27 01:10:00,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:10:00,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 01:10:00,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 15: [2022-11-27 01:10:00,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:10:00,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 01:10:00,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 8: [2022-11-27 01:10:00,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:10:00,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 01:10:00,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 24: [2022-11-27 01:10:00,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:10:00,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 01:10:00,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 16: [2022-11-27 01:10:00,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:10:00,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 01:10:00,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 3: [2022-11-27 01:10:00,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:10:00,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 01:10:00,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 7: [2022-11-27 01:10:00,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:10:00,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 01:10:00,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 13: [2022-11-27 01:10:00,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:10:00,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 01:10:00,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 22: [2022-11-27 01:10:00,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:10:00,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 01:10:00,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 20: [2022-11-27 01:10:00,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:10:00,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 01:10:00,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 23: [2022-11-27 01:10:00,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:10:00,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 01:10:00,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 4: [2022-11-27 01:10:00,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:10:00,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 01:10:00,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 25: [2022-11-27 01:10:00,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:10:00,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 01:10:00,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 1: [2022-11-27 01:10:00,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:10:00,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 01:10:00,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 19: [2022-11-27 01:10:00,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:10:00,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 01:10:00,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 2: [2022-11-27 01:10:00,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:10:00,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 01:10:00,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 0: [2022-11-27 01:10:00,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:10:00,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:10:00,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:10:00,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 14: [2022-11-27 01:10:00,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 9: [2022-11-27 01:10:00,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 14: [2022-11-27 01:10:00,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 21: [2022-11-27 01:10:00,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:10:00,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 01:10:00,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 27: [2022-11-27 01:10:00,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:10:00,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 01:10:00,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 18: [2022-11-27 01:10:00,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:10:00,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 01:10:00,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 0: [2022-11-27 01:10:00,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 5: [2022-11-27 01:10:00,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:10:00,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 5: [2022-11-27 01:10:00,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 01:10:00,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 29: [2022-11-27 01:10:00,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:10:00,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 01:10:00,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 10: [2022-11-27 01:10:00,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:10:00,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 01:10:00,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 17: [2022-11-27 01:10:00,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:10:00,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 01:10:00,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 11: [2022-11-27 01:10:00,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:10:00,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 01:10:00,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 6: [2022-11-27 01:10:00,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:10:00,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 01:10:00,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 31: [2022-11-27 01:10:00,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:10:00,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 01:10:00,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 30: [2022-11-27 01:10:00,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:10:00,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 01:10:00,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 7: [2022-11-27 01:10:00,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:10:00,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 01:10:00,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 3: [2022-11-27 01:10:00,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:10:00,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:10:00,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 01:10:00,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 26: [2022-11-27 01:10:00,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 01:10:00,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 8: [2022-11-27 01:10:00,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:10:00,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 01:10:00,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 24: [2022-11-27 01:10:00,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:10:00,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 01:10:00,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 16: [2022-11-27 01:10:00,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:10:00,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 01:10:00,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 15: [2022-11-27 01:10:00,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:10:00,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 01:10:00,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 22: [2022-11-27 01:10:00,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:10:00,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 01:10:00,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 23: [2022-11-27 01:10:00,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:10:00,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 01:10:00,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 12: [2022-11-27 01:10:00,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:10:00,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 01:10:00,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 13: [2022-11-27 01:10:00,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:10:00,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 01:10:00,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 4: [2022-11-27 01:10:00,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:10:00,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:10:00,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 01:10:00,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 4: [2022-11-27 01:10:00,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 01:10:00,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 5: [2022-11-27 01:10:00,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:10:00,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 01:10:00,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 17: [2022-11-27 01:10:00,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:10:00,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 01:10:00,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 28: [2022-11-27 01:10:00,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:10:00,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:10:00,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 28: [2022-11-27 01:10:00,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 01:10:00,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 25: [2022-11-27 01:10:00,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 21: [2022-11-27 01:10:00,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:10:00,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 01:10:00,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 27: [2022-11-27 01:10:00,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:10:00,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 01:10:00,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 18: [2022-11-27 01:10:00,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:10:00,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 01:10:00,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 19: [2022-11-27 01:10:00,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:10:00,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 01:10:00,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 0: [2022-11-27 01:10:00,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:10:00,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 01:10:00,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 9: [2022-11-27 01:10:00,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:10:00,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:10:00,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 01:10:00,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 14: [2022-11-27 01:10:00,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:10:00,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 14: [2022-11-27 01:10:00,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 1: [2022-11-27 01:10:00,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 14: [2022-11-27 01:10:00,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 2: [2022-11-27 01:10:00,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:10:00,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 01:10:00,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 29: [2022-11-27 01:10:00,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:10:00,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 01:10:00,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 10: [2022-11-27 01:10:00,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:10:00,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 01:10:00,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 6: [2022-11-27 01:10:00,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:10:00,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 01:10:00,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 31: [2022-11-27 01:10:00,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:10:00,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 01:10:00,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 26: [2022-11-27 01:10:00,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:10:00,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 01:10:00,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 30: [2022-11-27 01:10:00,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:10:00,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 01:10:00,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 24: [2022-11-27 01:10:00,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:10:00,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 01:10:00,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 12: [2022-11-27 01:10:00,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:10:00,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 01:10:00,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 11: [2022-11-27 01:10:00,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:10:00,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:10:00,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 01:10:00,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 15: [2022-11-27 01:10:00,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 01:10:00,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 8: [2022-11-27 01:10:00,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:10:00,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 01:10:00,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 3: [2022-11-27 01:10:00,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:10:00,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 01:10:00,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 22: [2022-11-27 01:10:00,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:10:00,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 01:10:00,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 28: [2022-11-27 01:10:00,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:10:00,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:10:00,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 01:10:00,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 4: [2022-11-27 01:10:00,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:10:00,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 01:10:00,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 16: [2022-11-27 01:10:00,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:10:00,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 01:10:00,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 28: [2022-11-27 01:10:00,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 01:10:00,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 20: [2022-11-27 01:10:00,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:10:00,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 01:10:00,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 5: [2022-11-27 01:10:00,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:10:00,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 01:10:00,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 17: [2022-11-27 01:10:00,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:10:00,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 01:10:00,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 25: [2022-11-27 01:10:00,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:10:00,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 01:10:00,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 27: [2022-11-27 01:10:00,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:10:00,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 01:10:00,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 19: [2022-11-27 01:10:00,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:10:00,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 01:10:00,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 23: [2022-11-27 01:10:00,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:10:00,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 01:10:00,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 14: [2022-11-27 01:10:00,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:10:00,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 01:10:00,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 1: [2022-11-27 01:10:00,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:10:00,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 01:10:00,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 18: [2022-11-27 01:10:00,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:10:00,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 01:10:00,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 13: [2022-11-27 01:10:00,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:10:00,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 01:10:00,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 2: [2022-11-27 01:10:00,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:10:00,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 01:10:00,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 9: [2022-11-27 01:10:00,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:10:00,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 01:10:00,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 29: [2022-11-27 01:10:00,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:10:00,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 01:10:00,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 0: [2022-11-27 01:10:00,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:10:00,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 01:10:00,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 21: [2022-11-27 01:10:00,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:10:00,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 01:10:00,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 6: [2022-11-27 01:10:00,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:10:00,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 01:10:00,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 31: [2022-11-27 01:10:00,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:10:00,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 01:10:00,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 26: [2022-11-27 01:10:00,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:10:00,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 01:10:00,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 30: [2022-11-27 01:10:00,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:10:00,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 01:10:00,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 11: [2022-11-27 01:10:00,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:10:00,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 01:10:00,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 24: [2022-11-27 01:10:00,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:10:00,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 01:10:00,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 12: [2022-11-27 01:10:00,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:10:00,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 01:10:00,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 11: [2022-11-27 01:10:00,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:10:00,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 01:10:00,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 3: [2022-11-27 01:10:00,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:10:00,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 01:10:00,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 24: [2022-11-27 01:10:00,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:10:00,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:10:00,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:10:00,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 19: [2022-11-27 01:10:00,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:10:00,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 9: [2022-11-27 01:10:00,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 19: [2022-11-27 01:10:00,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 4: [2022-11-27 01:10:00,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 19: [2022-11-27 01:10:00,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 4: [2022-11-27 01:10:00,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 9: [2022-11-27 01:10:00,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 29: [2022-11-27 01:10:00,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:10:00,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:10:00,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:10:00,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 01:10:00,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 30: [2022-11-27 01:10:00,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 0: [2022-11-27 01:10:00,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 30: [2022-11-27 01:10:00,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 0: [2022-11-27 01:10:00,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 21: [2022-11-27 01:10:00,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:10:00,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 01:10:00,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 2: [2022-11-27 01:10:00,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:10:00,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:10:00,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 14: [2022-11-27 01:10:00,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 2: [2022-11-27 01:10:00,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 14: [2022-11-27 01:10:00,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 1: [2022-11-27 01:10:00,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:10:00,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:10:00,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 18: [2022-11-27 01:10:00,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 1: [2022-11-27 01:10:00,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 18: [2022-11-27 01:10:00,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 10: [2022-11-27 01:10:00,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:10:00,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:10:00,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 5: [2022-11-27 01:10:00,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 10: [2022-11-27 01:10:00,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 5: [2022-11-27 01:10:00,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 20: [2022-11-27 01:10:00,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:10:00,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 01:10:00,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 23: [2022-11-27 01:10:00,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:10:00,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 01:10:00,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 7: [2022-11-27 01:10:00,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:10:00,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 01:10:00,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 12: [2022-11-27 01:10:00,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:10:00,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 01:10:00,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 28: [2022-11-27 01:10:00,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:10:00,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:10:00,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 01:10:00,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 25: [2022-11-27 01:10:00,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:10:00,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 01:10:00,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 8: [2022-11-27 01:10:00,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:10:00,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:10:00,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 01:10:00,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 8: [2022-11-27 01:10:00,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 01:10:00,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 01:10:00,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 8: [2022-11-27 01:10:00,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 22: [2022-11-27 01:10:00,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:10:00,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:10:00,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:10:00,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 01:10:00,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 01:10:00,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 22: [2022-11-27 01:10:00,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 17: [2022-11-27 01:10:00,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 01:10:00,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 6: [2022-11-27 01:10:00,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:10:00,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 01:10:00,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 15: [2022-11-27 01:10:00,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:10:00,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:10:00,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 01:10:00,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 01:10:00,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 15: [2022-11-27 01:10:00,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 31: [2022-11-27 01:10:00,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:10:00,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 23: [2022-11-27 01:10:00,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:10:00,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 23: [2022-11-27 01:10:00,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 01:10:00,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 3: [2022-11-27 01:10:00,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:10:00,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 01:10:00,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 20: [2022-11-27 01:10:00,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:10:00,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 01:10:00,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 27: [2022-11-27 01:10:00,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:10:00,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 01:10:00,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 10: [2022-11-27 01:10:00,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:10:00,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 01:10:00,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 16: [2022-11-27 01:10:00,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:10:00,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 01:10:00,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 16: [2022-11-27 01:10:00,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:10:00,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:10:00,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 01:10:00,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 13: [2022-11-27 01:10:00,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 01:10:00,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 7: [2022-11-27 01:10:00,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:10:00,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 01:10:00,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 13: [2022-11-27 01:10:00,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:10:00,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 01:10:00,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 28: [2022-11-27 01:10:00,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:10:00,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step138000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 01:10:00,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step138000 is ready now! 0: successfully saved checkpoint at iteration 138000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2521.80 31: iteration 138010/ 173500 | consumed samples: 35330560 | consumed tokens: 72356986880 | elapsed time per iteration (s): 1.16 | learning rate: 3.830E-05 | global batch size: 256 | lm loss: 1.922307E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 221.548 | TFLOPs: 13.40 | 31: iteration 138020/ 173500 | consumed samples: 35333120 | consumed tokens: 72362229760 | elapsed time per iteration (s): 0.89 | learning rate: 3.829E-05 | global batch size: 256 | lm loss: 1.935988E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.124 | TFLOPs: 17.37 | 31: iteration 138030/ 173500 | consumed samples: 35335680 | consumed tokens: 72367472640 | elapsed time per iteration (s): 0.95 | learning rate: 3.828E-05 | global batch size: 256 | lm loss: 1.917280E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 268.943 | TFLOPs: 16.27 | 31: iteration 138040/ 173500 | consumed samples: 35338240 | consumed tokens: 72372715520 | elapsed time per iteration (s): 0.89 | learning rate: 3.827E-05 | global batch size: 256 | lm loss: 1.937840E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.016 | TFLOPs: 17.42 | 31: iteration 138050/ 173500 | consumed samples: 35340800 | consumed tokens: 72377958400 | elapsed time per iteration (s): 0.88 | learning rate: 3.826E-05 | global batch size: 256 | lm loss: 1.942578E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.638 | TFLOPs: 17.52 | 31: iteration 138060/ 173500 | consumed samples: 35343360 | consumed tokens: 72383201280 | elapsed time per iteration (s): 0.87 | learning rate: 3.825E-05 | global batch size: 256 | lm loss: 1.925914E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.247 | TFLOPs: 17.86 | 31: iteration 138070/ 173500 | consumed samples: 35345920 | consumed tokens: 72388444160 | elapsed time per iteration (s): 0.84 | learning rate: 3.825E-05 | global batch size: 256 | lm loss: 1.924134E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.143 | TFLOPs: 18.40 | 31: iteration 138080/ 173500 | consumed samples: 35348480 | consumed tokens: 72393687040 | elapsed time per iteration (s): 0.88 | learning rate: 3.824E-05 | global batch size: 256 | lm loss: 1.914713E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.724 | TFLOPs: 17.59 | 31: iteration 138090/ 173500 | consumed samples: 35351040 | consumed tokens: 72398929920 | elapsed time per iteration (s): 0.89 | learning rate: 3.823E-05 | global batch size: 256 | lm loss: 1.906203E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.456 | TFLOPs: 17.45 | 31: iteration 138100/ 173500 | consumed samples: 35353600 | consumed tokens: 72404172800 | elapsed time per iteration (s): 0.84 | learning rate: 3.822E-05 | global batch size: 256 | lm loss: 1.947796E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.334 | TFLOPs: 18.53 | 31: iteration 138110/ 173500 | consumed samples: 35356160 | consumed tokens: 72409415680 | elapsed time per iteration (s): 0.88 | learning rate: 3.821E-05 | global batch size: 256 | lm loss: 1.935714E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.566 | TFLOPs: 17.52 | 31: iteration 138120/ 173500 | consumed samples: 35358720 | consumed tokens: 72414658560 | elapsed time per iteration (s): 0.87 | learning rate: 3.820E-05 | global batch size: 256 | lm loss: 1.918158E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.664 | TFLOPs: 17.77 | 31: iteration 138130/ 173500 | consumed samples: 35361280 | consumed tokens: 72419901440 | elapsed time per iteration (s): 1.36 | learning rate: 3.819E-05 | global batch size: 256 | lm loss: 1.915608E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 187.570 | TFLOPs: 11.35 | 31: iteration 138140/ 173500 | consumed samples: 35363840 | consumed tokens: 72425144320 | elapsed time per iteration (s): 0.79 | learning rate: 3.818E-05 | global batch size: 256 | lm loss: 1.906377E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.409 | TFLOPs: 19.50 | 31: iteration 138150/ 173500 | consumed samples: 35366400 | consumed tokens: 72430387200 | elapsed time per iteration (s): 0.79 | learning rate: 3.817E-05 | global batch size: 256 | lm loss: 1.895826E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.685 | TFLOPs: 19.64 | 31: iteration 138160/ 173500 | consumed samples: 35368960 | consumed tokens: 72435630080 | elapsed time per iteration (s): 0.76 | learning rate: 3.816E-05 | global batch size: 256 | lm loss: 1.942587E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.407 | TFLOPs: 20.35 | 31: iteration 138170/ 173500 | consumed samples: 35371520 | consumed tokens: 72440872960 | elapsed time per iteration (s): 0.77 | learning rate: 3.815E-05 | global batch size: 256 | lm loss: 1.931236E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.775 | TFLOPs: 20.01 | 31: iteration 138180/ 173500 | consumed samples: 35374080 | consumed tokens: 72446115840 | elapsed time per iteration (s): 0.75 | learning rate: 3.814E-05 | global batch size: 256 | lm loss: 1.908721E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.867 | TFLOPs: 20.56 | 31: iteration 138190/ 173500 | consumed samples: 35376640 | consumed tokens: 72451358720 | elapsed time per iteration (s): 0.79 | learning rate: 3.813E-05 | global batch size: 256 | lm loss: 1.936438E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.087 | TFLOPs: 19.73 | 31: iteration 138200/ 173500 | consumed samples: 35379200 | consumed tokens: 72456601600 | elapsed time per iteration (s): 0.75 | learning rate: 3.812E-05 | global batch size: 256 | lm loss: 1.934106E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.205 | TFLOPs: 20.52 | 31: iteration 138210/ 173500 | consumed samples: 35381760 | consumed tokens: 72461844480 | elapsed time per iteration (s): 0.74 | learning rate: 3.811E-05 | global batch size: 256 | lm loss: 1.926009E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.443 | TFLOPs: 20.96 | 31: iteration 138220/ 173500 | consumed samples: 35384320 | consumed tokens: 72467087360 | elapsed time per iteration (s): 0.77 | learning rate: 3.810E-05 | global batch size: 256 | lm loss: 1.934237E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.663 | TFLOPs: 20.06 | 31: iteration 138230/ 173500 | consumed samples: 35386880 | consumed tokens: 72472330240 | elapsed time per iteration (s): 0.74 | learning rate: 3.809E-05 | global batch size: 256 | lm loss: 1.942316E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.003 | TFLOPs: 20.99 | 31: iteration 138240/ 173500 | consumed samples: 35389440 | consumed tokens: 72477573120 | elapsed time per iteration (s): 0.78 | learning rate: 3.808E-05 | global batch size: 256 | lm loss: 1.907936E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.673 | TFLOPs: 19.82 | 31: iteration 138250/ 173500 | consumed samples: 35392000 | consumed tokens: 72482816000 | elapsed time per iteration (s): 0.77 | learning rate: 3.807E-05 | global batch size: 256 | lm loss: 1.922522E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.024 | TFLOPs: 20.03 | 31: iteration 138260/ 173500 | consumed samples: 35394560 | consumed tokens: 72488058880 | elapsed time per iteration (s): 0.78 | learning rate: 3.806E-05 | global batch size: 256 | lm loss: 1.950275E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.225 | TFLOPs: 19.98 | 31: iteration 138270/ 173500 | consumed samples: 35397120 | consumed tokens: 72493301760 | elapsed time per iteration (s): 0.79 | learning rate: 3.805E-05 | global batch size: 256 | lm loss: 1.927615E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.682 | TFLOPs: 19.58 | 31: iteration 138280/ 173500 | consumed samples: 35399680 | consumed tokens: 72498544640 | elapsed time per iteration (s): 0.71 | learning rate: 3.804E-05 | global batch size: 256 | lm loss: 1.899744E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 359.296 | TFLOPs: 21.74 | 31: iteration 138290/ 173500 | consumed samples: 35402240 | consumed tokens: 72503787520 | elapsed time per iteration (s): 0.74 | learning rate: 3.803E-05 | global batch size: 256 | lm loss: 1.947300E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.361 | TFLOPs: 20.95 | 31: iteration 138300/ 173500 | consumed samples: 35404800 | consumed tokens: 72509030400 | elapsed time per iteration (s): 0.75 | learning rate: 3.802E-05 | global batch size: 256 | lm loss: 1.933427E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.367 | TFLOPs: 20.59 | 31: iteration 138310/ 173500 | consumed samples: 35407360 | consumed tokens: 72514273280 | elapsed time per iteration (s): 0.77 | learning rate: 3.801E-05 | global batch size: 256 | lm loss: 1.931437E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.360 | TFLOPs: 20.11 | 31: iteration 138320/ 173500 | consumed samples: 35409920 | consumed tokens: 72519516160 | elapsed time per iteration (s): 0.90 | learning rate: 3.800E-05 | global batch size: 256 | lm loss: 1.926286E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 283.817 | TFLOPs: 17.17 | 31: iteration 138330/ 173500 | consumed samples: 35412480 | consumed tokens: 72524759040 | elapsed time per iteration (s): 0.85 | learning rate: 3.799E-05 | global batch size: 256 | lm loss: 1.945727E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.125 | TFLOPs: 18.16 | 31: iteration 138340/ 173500 | consumed samples: 35415040 | consumed tokens: 72530001920 | elapsed time per iteration (s): 0.87 | learning rate: 3.798E-05 | global batch size: 256 | lm loss: 1.921095E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.838 | TFLOPs: 17.90 | 31: iteration 138350/ 173500 | consumed samples: 35417600 | consumed tokens: 72535244800 | elapsed time per iteration (s): 0.76 | learning rate: 3.797E-05 | global batch size: 256 | lm loss: 1.894312E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.108 | TFLOPs: 20.45 | 31: iteration 138360/ 173500 | consumed samples: 35420160 | consumed tokens: 72540487680 | elapsed time per iteration (s): 0.77 | learning rate: 3.796E-05 | global batch size: 256 | lm loss: 1.950197E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.721 | TFLOPs: 20.07 | 31: iteration 138370/ 173500 | consumed samples: 35422720 | consumed tokens: 72545730560 | elapsed time per iteration (s): 0.77 | learning rate: 3.795E-05 | global batch size: 256 | lm loss: 1.905993E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.548 | TFLOPs: 20.06 | 31: iteration 138380/ 173500 | consumed samples: 35425280 | consumed tokens: 72550973440 | elapsed time per iteration (s): 0.87 | learning rate: 3.794E-05 | global batch size: 256 | lm loss: 1.939024E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.366 | TFLOPs: 17.75 | 31: iteration 138390/ 173500 | consumed samples: 35427840 | consumed tokens: 72556216320 | elapsed time per iteration (s): 0.83 | learning rate: 3.793E-05 | global batch size: 256 | lm loss: 1.921474E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.778 | TFLOPs: 18.56 | 31: iteration 138400/ 173500 | consumed samples: 35430400 | consumed tokens: 72561459200 | elapsed time per iteration (s): 0.77 | learning rate: 3.792E-05 | global batch size: 256 | lm loss: 1.951547E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.291 | TFLOPs: 20.10 | 31: iteration 138410/ 173500 | consumed samples: 35432960 | consumed tokens: 72566702080 | elapsed time per iteration (s): 0.78 | learning rate: 3.791E-05 | global batch size: 256 | lm loss: 1.919655E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.908 | TFLOPs: 19.84 | 31: iteration 138420/ 173500 | consumed samples: 35435520 | consumed tokens: 72571944960 | elapsed time per iteration (s): 0.81 | learning rate: 3.790E-05 | global batch size: 256 | lm loss: 1.947702E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.339 | TFLOPs: 19.02 | 31: iteration 138430/ 173500 | consumed samples: 35438080 | consumed tokens: 72577187840 | elapsed time per iteration (s): 0.95 | learning rate: 3.789E-05 | global batch size: 256 | lm loss: 1.957075E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 269.497 | TFLOPs: 16.30 | 31: iteration 138440/ 173500 | consumed samples: 35440640 | consumed tokens: 72582430720 | elapsed time per iteration (s): 0.84 | learning rate: 3.788E-05 | global batch size: 256 | lm loss: 1.941619E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.755 | TFLOPs: 18.38 | 31: iteration 138450/ 173500 | consumed samples: 35443200 | consumed tokens: 72587673600 | elapsed time per iteration (s): 0.82 | learning rate: 3.787E-05 | global batch size: 256 | lm loss: 1.925365E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.774 | TFLOPs: 18.86 | 31: iteration 138460/ 173500 | consumed samples: 35445760 | consumed tokens: 72592916480 | elapsed time per iteration (s): 0.79 | learning rate: 3.786E-05 | global batch size: 256 | lm loss: 1.920722E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.419 | TFLOPs: 19.63 | 31: iteration 138470/ 173500 | consumed samples: 35448320 | consumed tokens: 72598159360 | elapsed time per iteration (s): 0.78 | learning rate: 3.785E-05 | global batch size: 256 | lm loss: 1.938663E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.926 | TFLOPs: 19.84 | 31: iteration 138480/ 173500 | consumed samples: 35450880 | consumed tokens: 72603402240 | elapsed time per iteration (s): 0.77 | learning rate: 3.784E-05 | global batch size: 256 | lm loss: 1.908303E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.206 | TFLOPs: 20.22 | 31: iteration 138490/ 173500 | consumed samples: 35453440 | consumed tokens: 72608645120 | elapsed time per iteration (s): 0.90 | learning rate: 3.783E-05 | global batch size: 256 | lm loss: 1.924877E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 283.334 | TFLOPs: 17.14 | 31: iteration 138500/ 173500 | consumed samples: 35456000 | consumed tokens: 72613888000 | elapsed time per iteration (s): 0.83 | learning rate: 3.782E-05 | global batch size: 256 | lm loss: 1.930224E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.883 | TFLOPs: 18.75 | 31: iteration 138510/ 173500 | consumed samples: 35458560 | consumed tokens: 72619130880 | elapsed time per iteration (s): 0.84 | learning rate: 3.781E-05 | global batch size: 256 | lm loss: 1.913955E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.108 | TFLOPs: 18.46 | 31: iteration 138520/ 173500 | consumed samples: 35461120 | consumed tokens: 72624373760 | elapsed time per iteration (s): 0.83 | learning rate: 3.780E-05 | global batch size: 256 | lm loss: 1.976107E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.150 | TFLOPs: 18.76 | 31: iteration 138530/ 173500 | consumed samples: 35463680 | consumed tokens: 72629616640 | elapsed time per iteration (s): 0.80 | learning rate: 3.779E-05 | global batch size: 256 | lm loss: 1.943439E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.962 | TFLOPs: 19.48 | 31: iteration 138540/ 173500 | consumed samples: 35466240 | consumed tokens: 72634859520 | elapsed time per iteration (s): 0.83 | learning rate: 3.778E-05 | global batch size: 256 | lm loss: 1.936837E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.528 | TFLOPs: 18.67 | 31: iteration 138550/ 173500 | consumed samples: 35468800 | consumed tokens: 72640102400 | elapsed time per iteration (s): 0.75 | learning rate: 3.777E-05 | global batch size: 256 | lm loss: 1.939458E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.168 | TFLOPs: 20.76 | 31: iteration 138560/ 173500 | consumed samples: 35471360 | consumed tokens: 72645345280 | elapsed time per iteration (s): 0.78 | learning rate: 3.776E-05 | global batch size: 256 | lm loss: 1.905427E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.985 | TFLOPs: 19.84 | 31: iteration 138570/ 173500 | consumed samples: 35473920 | consumed tokens: 72650588160 | elapsed time per iteration (s): 0.78 | learning rate: 3.775E-05 | global batch size: 256 | lm loss: 1.905001E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.152 | TFLOPs: 19.73 | 31: iteration 138580/ 173500 | consumed samples: 35476480 | consumed tokens: 72655831040 | elapsed time per iteration (s): 0.75 | learning rate: 3.774E-05 | global batch size: 256 | lm loss: 1.946897E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.023 | TFLOPs: 20.57 | 31: iteration 138590/ 173500 | consumed samples: 35479040 | consumed tokens: 72661073920 | elapsed time per iteration (s): 0.81 | learning rate: 3.773E-05 | global batch size: 256 | lm loss: 1.897072E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.806 | TFLOPs: 19.04 | 31: iteration 138600/ 173500 | consumed samples: 35481600 | consumed tokens: 72666316800 | elapsed time per iteration (s): 0.75 | learning rate: 3.772E-05 | global batch size: 256 | lm loss: 1.910258E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.129 | TFLOPs: 20.70 | 31: iteration 138610/ 173500 | consumed samples: 35484160 | consumed tokens: 72671559680 | elapsed time per iteration (s): 0.79 | learning rate: 3.771E-05 | global batch size: 256 | lm loss: 1.918121E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.604 | TFLOPs: 19.58 | 31: iteration 138620/ 173500 | consumed samples: 35486720 | consumed tokens: 72676802560 | elapsed time per iteration (s): 0.80 | learning rate: 3.770E-05 | global batch size: 256 | lm loss: 1.943634E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.017 | TFLOPs: 19.42 | 31: iteration 138630/ 173500 | consumed samples: 35489280 | consumed tokens: 72682045440 | elapsed time per iteration (s): 0.81 | learning rate: 3.769E-05 | global batch size: 256 | lm loss: 1.938566E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.690 | TFLOPs: 19.04 | 31: iteration 138640/ 173500 | consumed samples: 35491840 | consumed tokens: 72687288320 | elapsed time per iteration (s): 0.82 | learning rate: 3.768E-05 | global batch size: 256 | lm loss: 1.932810E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.097 | TFLOPs: 19.00 | 31: iteration 138650/ 173500 | consumed samples: 35494400 | consumed tokens: 72692531200 | elapsed time per iteration (s): 0.78 | learning rate: 3.767E-05 | global batch size: 256 | lm loss: 1.936204E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.566 | TFLOPs: 19.82 | 31: iteration 138660/ 173500 | consumed samples: 35496960 | consumed tokens: 72697774080 | elapsed time per iteration (s): 0.75 | learning rate: 3.766E-05 | global batch size: 256 | lm loss: 1.867648E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.130 | TFLOPs: 20.58 | 31: iteration 138670/ 173500 | consumed samples: 35499520 | consumed tokens: 72703016960 | elapsed time per iteration (s): 0.81 | learning rate: 3.765E-05 | global batch size: 256 | lm loss: 1.912828E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.164 | TFLOPs: 19.19 | 31: iteration 138680/ 173500 | consumed samples: 35502080 | consumed tokens: 72708259840 | elapsed time per iteration (s): 0.85 | learning rate: 3.764E-05 | global batch size: 256 | lm loss: 1.906639E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.407 | TFLOPs: 18.29 | 31: iteration 138690/ 173500 | consumed samples: 35504640 | consumed tokens: 72713502720 | elapsed time per iteration (s): 0.81 | learning rate: 3.763E-05 | global batch size: 256 | lm loss: 1.915485E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.420 | TFLOPs: 19.02 | 31: iteration 138700/ 173500 | consumed samples: 35507200 | consumed tokens: 72718745600 | elapsed time per iteration (s): 0.82 | learning rate: 3.762E-05 | global batch size: 256 | lm loss: 1.946362E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.991 | TFLOPs: 18.87 | 31: iteration 138710/ 173500 | consumed samples: 35509760 | consumed tokens: 72723988480 | elapsed time per iteration (s): 0.83 | learning rate: 3.761E-05 | global batch size: 256 | lm loss: 1.932940E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.672 | TFLOPs: 18.55 | 31: iteration 138720/ 173500 | consumed samples: 35512320 | consumed tokens: 72729231360 | elapsed time per iteration (s): 0.86 | learning rate: 3.760E-05 | global batch size: 256 | lm loss: 1.936527E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.986 | TFLOPs: 17.91 | 31: iteration 138730/ 173500 | consumed samples: 35514880 | consumed tokens: 72734474240 | elapsed time per iteration (s): 0.80 | learning rate: 3.759E-05 | global batch size: 256 | lm loss: 1.908038E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.216 | TFLOPs: 19.25 | 31: iteration 138740/ 173500 | consumed samples: 35517440 | consumed tokens: 72739717120 | elapsed time per iteration (s): 0.83 | learning rate: 3.758E-05 | global batch size: 256 | lm loss: 1.918686E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.651 | TFLOPs: 18.55 | 31: iteration 138750/ 173500 | consumed samples: 35520000 | consumed tokens: 72744960000 | elapsed time per iteration (s): 0.80 | learning rate: 3.757E-05 | global batch size: 256 | lm loss: 1.938173E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.285 | TFLOPs: 19.26 | 31: iteration 138760/ 173500 | consumed samples: 35522560 | consumed tokens: 72750202880 | elapsed time per iteration (s): 0.80 | learning rate: 3.757E-05 | global batch size: 256 | lm loss: 1.928083E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.781 | TFLOPs: 19.29 | 31: iteration 138770/ 173500 | consumed samples: 35525120 | consumed tokens: 72755445760 | elapsed time per iteration (s): 0.80 | learning rate: 3.756E-05 | global batch size: 256 | lm loss: 1.950557E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.371 | TFLOPs: 19.38 | 31: iteration 138780/ 173500 | consumed samples: 35527680 | consumed tokens: 72760688640 | elapsed time per iteration (s): 0.77 | learning rate: 3.755E-05 | global batch size: 256 | lm loss: 1.927454E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.156 | TFLOPs: 20.03 | 31: iteration 138790/ 173500 | consumed samples: 35530240 | consumed tokens: 72765931520 | elapsed time per iteration (s): 0.74 | learning rate: 3.754E-05 | global batch size: 256 | lm loss: 1.923693E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.372 | TFLOPs: 20.95 | 31: iteration 138800/ 173500 | consumed samples: 35532800 | consumed tokens: 72771174400 | elapsed time per iteration (s): 0.75 | learning rate: 3.753E-05 | global batch size: 256 | lm loss: 1.956243E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.375 | TFLOPs: 20.77 | 31: iteration 138810/ 173500 | consumed samples: 35535360 | consumed tokens: 72776417280 | elapsed time per iteration (s): 0.74 | learning rate: 3.752E-05 | global batch size: 256 | lm loss: 1.926445E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.340 | TFLOPs: 20.89 | 31: iteration 138820/ 173500 | consumed samples: 35537920 | consumed tokens: 72781660160 | elapsed time per iteration (s): 0.76 | learning rate: 3.751E-05 | global batch size: 256 | lm loss: 1.926534E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.067 | TFLOPs: 20.33 | 31: iteration 138830/ 173500 | consumed samples: 35540480 | consumed tokens: 72786903040 | elapsed time per iteration (s): 0.78 | learning rate: 3.750E-05 | global batch size: 256 | lm loss: 1.933984E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.278 | TFLOPs: 19.86 | 31: iteration 138840/ 173500 | consumed samples: 35543040 | consumed tokens: 72792145920 | elapsed time per iteration (s): 0.78 | learning rate: 3.749E-05 | global batch size: 256 | lm loss: 1.938161E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.345 | TFLOPs: 19.74 | 31: iteration 138850/ 173500 | consumed samples: 35545600 | consumed tokens: 72797388800 | elapsed time per iteration (s): 0.73 | learning rate: 3.748E-05 | global batch size: 256 | lm loss: 1.947624E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.685 | TFLOPs: 21.34 | 31: iteration 138860/ 173500 | consumed samples: 35548160 | consumed tokens: 72802631680 | elapsed time per iteration (s): 0.76 | learning rate: 3.747E-05 | global batch size: 256 | lm loss: 1.895279E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.352 | TFLOPs: 20.35 | 31: iteration 138870/ 173500 | consumed samples: 35550720 | consumed tokens: 72807874560 | elapsed time per iteration (s): 0.76 | learning rate: 3.746E-05 | global batch size: 256 | lm loss: 1.956763E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.820 | TFLOPs: 20.50 | 31: iteration 138880/ 173500 | consumed samples: 35553280 | consumed tokens: 72813117440 | elapsed time per iteration (s): 0.87 | learning rate: 3.745E-05 | global batch size: 256 | lm loss: 1.915182E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.499 | TFLOPs: 17.82 | 31: iteration 138890/ 173500 | consumed samples: 35555840 | consumed tokens: 72818360320 | elapsed time per iteration (s): 0.79 | learning rate: 3.744E-05 | global batch size: 256 | lm loss: 1.927553E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.566 | TFLOPs: 19.70 | 31: iteration 138900/ 173500 | consumed samples: 35558400 | consumed tokens: 72823603200 | elapsed time per iteration (s): 0.86 | learning rate: 3.743E-05 | global batch size: 256 | lm loss: 1.882480E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.655 | TFLOPs: 17.95 | 31: iteration 138910/ 173500 | consumed samples: 35560960 | consumed tokens: 72828846080 | elapsed time per iteration (s): 0.75 | learning rate: 3.742E-05 | global batch size: 256 | lm loss: 1.955398E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.893 | TFLOPs: 20.74 | 31: iteration 138920/ 173500 | consumed samples: 35563520 | consumed tokens: 72834088960 | elapsed time per iteration (s): 0.78 | learning rate: 3.741E-05 | global batch size: 256 | lm loss: 1.928744E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.730 | TFLOPs: 19.89 | 31: iteration 138930/ 173500 | consumed samples: 35566080 | consumed tokens: 72839331840 | elapsed time per iteration (s): 1.02 | learning rate: 3.740E-05 | global batch size: 256 | lm loss: 1.935523E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.228 | TFLOPs: 15.14 | 31: iteration 138940/ 173500 | consumed samples: 35568640 | consumed tokens: 72844574720 | elapsed time per iteration (s): 0.76 | learning rate: 3.739E-05 | global batch size: 256 | lm loss: 1.940050E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.906 | TFLOPs: 20.32 | 31: iteration 138950/ 173500 | consumed samples: 35571200 | consumed tokens: 72849817600 | elapsed time per iteration (s): 0.74 | learning rate: 3.738E-05 | global batch size: 256 | lm loss: 1.915602E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.932 | TFLOPs: 20.87 | 31: iteration 138960/ 173500 | consumed samples: 35573760 | consumed tokens: 72855060480 | elapsed time per iteration (s): 0.75 | learning rate: 3.737E-05 | global batch size: 256 | lm loss: 1.946653E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.420 | TFLOPs: 20.72 | 31: iteration 138970/ 173500 | consumed samples: 35576320 | consumed tokens: 72860303360 | elapsed time per iteration (s): 0.75 | learning rate: 3.736E-05 | global batch size: 256 | lm loss: 1.926401E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.095 | TFLOPs: 20.70 | 31: iteration 138980/ 173500 | consumed samples: 35578880 | consumed tokens: 72865546240 | elapsed time per iteration (s): 0.75 | learning rate: 3.735E-05 | global batch size: 256 | lm loss: 1.927484E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.413 | TFLOPs: 20.53 | 31: iteration 138990/ 173500 | consumed samples: 35581440 | consumed tokens: 72870789120 | elapsed time per iteration (s): 0.73 | learning rate: 3.734E-05 | global batch size: 256 | lm loss: 1.963525E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.422 | TFLOPs: 21.26 | 31: iteration 139000/ 173500 | consumed samples: 35584000 | consumed tokens: 72876032000 | elapsed time per iteration (s): 0.83 | learning rate: 3.733E-05 | global batch size: 256 | lm loss: 1.911080E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.431 | TFLOPs: 18.72 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 139000 | lm loss value: 1.838041E+00 | lm loss PPL: 6.284217E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 139000 to checkpoints_1b1long 0: [2022-11-27 01:23:30,864] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step139000 is begin to save! 0: [2022-11-27 01:23:30,876] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_01-model_00-model_states.pt... 0: [2022-11-27 01:23:31,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_01-model_00-model_states.pt. 0: [2022-11-27 01:23:31,109] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_03-model_00-model_states.pt... 0: [2022-11-27 01:23:31,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_03-model_00-model_states.pt. 0: [2022-11-27 01:23:31,194] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_04-model_00-model_states.pt... 0: [2022-11-27 01:23:31,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_04-model_00-model_states.pt. 0: [2022-11-27 01:23:31,283] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_05-model_00-model_states.pt... 0: [2022-11-27 01:23:31,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_05-model_00-model_states.pt. 0: [2022-11-27 01:23:31,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_06-model_00-model_states.pt... 0: [2022-11-27 01:23:31,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_06-model_00-model_states.pt. 0: [2022-11-27 01:23:31,432] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_07-model_00-model_states.pt... 0: [2022-11-27 01:23:31,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_07-model_00-model_states.pt. 0: [2022-11-27 01:23:31,510] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_08-model_00-model_states.pt... 0: [2022-11-27 01:23:31,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_08-model_00-model_states.pt. 0: [2022-11-27 01:23:31,586] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_09-model_00-model_states.pt... 0: [2022-11-27 01:23:31,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_09-model_00-model_states.pt. 0: [2022-11-27 01:23:31,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_10-model_00-model_states.pt... 0: [2022-11-27 01:23:31,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_10-model_00-model_states.pt. 0: [2022-11-27 01:23:31,736] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_11-model_00-model_states.pt... 0: [2022-11-27 01:23:31,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_11-model_00-model_states.pt. 0: [2022-11-27 01:23:31,808] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_12-model_00-model_states.pt... 0: [2022-11-27 01:23:31,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_12-model_00-model_states.pt. 0: [2022-11-27 01:23:31,891] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_13-model_00-model_states.pt... 0: [2022-11-27 01:23:31,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_13-model_00-model_states.pt. 0: [2022-11-27 01:23:31,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_14-model_00-model_states.pt... 0: [2022-11-27 01:23:32,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_14-model_00-model_states.pt. 0: [2022-11-27 01:23:32,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_15-model_00-model_states.pt... 0: [2022-11-27 01:23:32,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_15-model_00-model_states.pt. 0: [2022-11-27 01:23:32,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_16-model_00-model_states.pt... 0: [2022-11-27 01:23:32,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_16-model_00-model_states.pt. 0: [2022-11-27 01:23:32,186] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_17-model_00-model_states.pt... 0: [2022-11-27 01:23:32,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_17-model_00-model_states.pt. 0: [2022-11-27 01:23:32,263] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_18-model_00-model_states.pt... 0: [2022-11-27 01:23:32,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_18-model_00-model_states.pt. 0: [2022-11-27 01:23:32,336] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_19-model_00-model_states.pt... 0: [2022-11-27 01:23:32,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_19-model_00-model_states.pt. 0: [2022-11-27 01:23:32,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_20-model_00-model_states.pt... 0: [2022-11-27 01:23:32,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_20-model_00-model_states.pt. 0: [2022-11-27 01:23:32,487] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_21-model_00-model_states.pt... 0: [2022-11-27 01:23:32,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_21-model_00-model_states.pt. 0: [2022-11-27 01:23:32,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_22-model_00-model_states.pt... 0: [2022-11-27 01:23:32,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_22-model_00-model_states.pt. 0: [2022-11-27 01:23:32,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_23-model_00-model_states.pt... 0: [2022-11-27 01:23:32,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_23-model_00-model_states.pt. 0: [2022-11-27 01:23:32,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_24-model_00-model_states.pt... 0: [2022-11-27 01:23:32,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_24-model_00-model_states.pt. 0: [2022-11-27 01:23:32,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_25-model_00-model_states.pt... 0: [2022-11-27 01:23:32,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_25-model_00-model_states.pt. 0: [2022-11-27 01:23:32,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_26-model_00-model_states.pt... 0: [2022-11-27 01:23:32,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_26-model_00-model_states.pt. 0: [2022-11-27 01:23:32,936] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_27-model_00-model_states.pt... 0: [2022-11-27 01:23:33,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_27-model_00-model_states.pt. 0: [2022-11-27 01:23:33,011] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_28-model_00-model_states.pt... 0: [2022-11-27 01:23:33,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_28-model_00-model_states.pt. 0: [2022-11-27 01:23:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/layer_30-model_00-model_states.pt... 0: [2022-11-27 01:23:33,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/layer_30-model_00-model_states.pt. 0: [2022-11-27 01:23:33,088] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step139000/mp_rank_00_model_states.pt 0: [2022-11-27 01:23:33,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/mp_rank_00_model_states.pt... 0: [2022-11-27 01:23:33,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/mp_rank_00_model_states.pt. 31: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:23:33,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:23:33,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:23:33,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 01:23:33,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 4: [2022-11-27 01:23:33,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:23:33,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 01:23:33,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 8: [2022-11-27 01:23:33,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:23:33,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 01:23:33,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 18: [2022-11-27 01:23:33,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:23:33,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-27 01:23:33,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 14: [2022-11-27 01:23:33,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:23:33,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:23:33,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 9: [2022-11-27 01:23:33,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 14: [2022-11-27 01:23:33,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 9: [2022-11-27 01:23:33,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 27: [2022-11-27 01:23:33,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:23:33,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:23:33,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 01:23:33,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 24: [2022-11-27 01:23:33,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 01:23:33,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 6: [2022-11-27 01:23:33,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:23:33,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 01:23:33,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 25: [2022-11-27 01:23:33,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:23:33,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 01:23:33,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 10: [2022-11-27 01:23:33,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:23:33,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 01:23:33,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 12: [2022-11-27 01:23:33,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:23:33,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 01:23:33,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 2: [2022-11-27 01:23:33,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:23:33,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 01:23:33,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 26: [2022-11-27 01:23:33,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:23:33,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 01:23:33,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 14: [2022-11-27 01:23:33,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:23:33,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:23:33,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 01:23:33,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 17: [2022-11-27 01:23:33,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 01:23:33,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 2: [2022-11-27 01:23:33,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:23:33,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 01:23:33,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 22: [2022-11-27 01:23:33,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:23:33,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:23:33,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 19: [2022-11-27 01:23:33,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 22: [2022-11-27 01:23:33,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 19: [2022-11-27 01:23:33,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 13: [2022-11-27 01:23:33,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:23:33,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:23:33,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 9: [2022-11-27 01:23:33,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 01:23:33,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 1: [2022-11-27 01:23:33,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:23:33,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 11: [2022-11-27 01:23:33,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:23:33,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:23:33,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 11: [2022-11-27 01:23:33,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 01:23:33,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 1: [2022-11-27 01:23:33,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 5: [2022-11-27 01:23:33,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:23:33,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:23:33,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 01:23:33,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 01:23:33,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 5: [2022-11-27 01:23:33,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 8: [2022-11-27 01:23:33,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:23:33,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 01:23:33,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 4: [2022-11-27 01:23:33,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:23:33,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 01:23:33,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 3: [2022-11-27 01:23:33,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:23:33,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 01:23:33,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 10: [2022-11-27 01:23:33,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:23:33,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 01:23:33,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 16: [2022-11-27 01:23:33,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:23:33,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:23:33,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:23:33,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:23:33,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:23:33,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:23:33,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:23:33,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 11: [2022-11-27 01:23:33,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 21: [2022-11-27 01:23:33,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 26: [2022-11-27 01:23:33,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 16: [2022-11-27 01:23:33,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 20: [2022-11-27 01:23:33,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 11: [2022-11-27 01:23:33,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 21: [2022-11-27 01:23:33,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 26: [2022-11-27 01:23:33,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 16: [2022-11-27 01:23:33,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 21: [2022-11-27 01:23:33,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 16: [2022-11-27 01:23:33,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 21: [2022-11-27 01:23:33,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 16: [2022-11-27 01:23:33,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 24: [2022-11-27 01:23:33,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:23:33,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 01:23:33,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 1: [2022-11-27 01:23:33,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:23:33,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 01:23:33,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 28: [2022-11-27 01:23:33,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 01:23:33,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 22: [2022-11-27 01:23:33,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:23:33,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 28: [2022-11-27 01:23:33,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:23:33,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 18: [2022-11-27 01:23:33,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:23:33,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 01:23:33,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 28: [2022-11-27 01:23:33,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 01:23:33,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 17: [2022-11-27 01:23:33,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:23:33,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 01:23:33,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 6: [2022-11-27 01:23:33,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:23:33,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 01:23:33,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 3: [2022-11-27 01:23:33,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:23:33,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 01:23:33,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 14: [2022-11-27 01:23:33,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:23:33,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 01:23:33,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 0: [2022-11-27 01:23:33,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:23:33,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:23:33,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 01:23:33,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 0: [2022-11-27 01:23:33,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 01:23:33,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 4: [2022-11-27 01:23:33,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:23:33,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 01:23:33,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 8: [2022-11-27 01:23:33,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:23:33,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 01:23:33,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 5: [2022-11-27 01:23:33,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:23:33,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 01:23:33,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 15: [2022-11-27 01:23:33,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:23:33,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 01:23:33,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:23:33,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 15: [2022-11-27 01:23:33,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 01:23:33,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 6: [2022-11-27 01:23:33,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:23:33,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 01:23:33,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 9: [2022-11-27 01:23:33,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:23:33,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 01:23:33,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 30: [2022-11-27 01:23:33,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:23:33,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 01:23:33,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 20: [2022-11-27 01:23:33,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:23:33,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 01:23:33,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 27: [2022-11-27 01:23:33,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:23:33,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:23:33,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:23:33,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 01:23:33,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 27: [2022-11-27 01:23:33,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 01:23:33,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 01:23:33,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 28: [2022-11-27 01:23:33,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:23:33,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 28: [2022-11-27 01:23:33,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 01:23:33,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 3: [2022-11-27 01:23:33,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:23:33,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:23:33,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 25: [2022-11-27 01:23:33,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:23:33,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 11: [2022-11-27 01:23:33,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 01:23:33,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 25: [2022-11-27 01:23:33,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:23:33,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-27 01:23:33,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 01:23:33,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 25: [2022-11-27 01:23:33,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 24: [2022-11-27 01:23:33,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:23:33,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-27 01:23:33,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 0: [2022-11-27 01:23:33,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:23:33,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:23:33,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:23:33,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 01:23:33,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 18: [2022-11-27 01:23:33,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 01:23:33,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 13: [2022-11-27 01:23:33,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:23:33,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 01:23:33,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 1: [2022-11-27 01:23:33,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:23:33,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:23:33,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 2: [2022-11-27 01:23:33,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 1: [2022-11-27 01:23:33,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 16: [2022-11-27 01:23:33,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:23:33,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 16: [2022-11-27 01:23:33,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 01:23:33,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 13: [2022-11-27 01:23:33,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:23:33,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:23:33,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:23:33,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-27 01:23:33,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 01:23:33,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 26: [2022-11-27 01:23:33,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 13: [2022-11-27 01:23:33,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 01:23:33,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 19: [2022-11-27 01:23:33,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:23:33,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 01:23:33,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 9: [2022-11-27 01:23:33,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:23:33,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 01:23:33,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 12: [2022-11-27 01:23:33,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:23:33,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 01:23:33,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:23:33,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 12: [2022-11-27 01:23:33,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 01:23:33,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 10: [2022-11-27 01:23:33,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:23:33,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 01:23:33,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 4: [2022-11-27 01:23:33,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:23:33,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 01:23:33,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 31: [2022-11-27 01:23:33,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:23:33,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:23:33,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 01:23:33,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 01:23:33,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 31: [2022-11-27 01:23:33,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 16: [2022-11-27 01:23:33,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:23:33,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 01:23:33,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 8: [2022-11-27 01:23:33,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:23:33,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:23:33,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 01:23:33,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 22: [2022-11-27 01:23:33,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 01:23:33,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 5: [2022-11-27 01:23:33,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:23:33,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 01:23:33,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 7: [2022-11-27 01:23:33,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:23:33,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:23:33,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 01:23:33,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 7: [2022-11-27 01:23:33,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 01:23:33,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:23:33,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 7: [2022-11-27 01:23:33,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 01:23:33,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 17: [2022-11-27 01:23:33,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:23:33,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 01:23:33,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 30: [2022-11-27 01:23:33,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:23:33,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 01:23:33,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 20: [2022-11-27 01:23:33,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:23:33,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 01:23:33,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 14: [2022-11-27 01:23:33,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:23:33,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 01:23:33,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 13: [2022-11-27 01:23:33,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:23:33,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 01:23:33,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 0: [2022-11-27 01:23:33,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:23:33,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 01:23:33,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 25: [2022-11-27 01:23:33,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:23:33,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 01:23:33,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 18: [2022-11-27 01:23:33,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:23:33,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 01:23:33,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 2: [2022-11-27 01:23:33,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:23:33,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 01:23:33,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 11: [2022-11-27 01:23:33,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:23:33,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 01:23:33,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 29: [2022-11-27 01:23:33,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:23:33,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 01:23:33,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:23:33,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:23:33,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:23:33,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 29: [2022-11-27 01:23:33,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 01:23:33,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 01:23:33,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 01:23:33,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 29: [2022-11-27 01:23:33,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 29: [2022-11-27 01:23:33,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 0: [2022-11-27 01:23:33,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 01:23:33,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 19: [2022-11-27 01:23:33,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:23:33,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 01:23:33,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 17: [2022-11-27 01:23:33,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:23:33,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 3: [2022-11-27 01:23:33,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:23:33,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 3: [2022-11-27 01:23:33,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 01:23:33,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 23: [2022-11-27 01:23:33,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:23:33,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:23:33,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:23:33,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:23:33,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 01:23:33,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 01:23:33,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 01:23:33,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 01:23:33,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 23: [2022-11-27 01:23:33,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 23: [2022-11-27 01:23:33,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 23: [2022-11-27 01:23:33,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 27: [2022-11-27 01:23:33,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:23:33,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 01:23:33,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 24: [2022-11-27 01:23:33,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:23:33,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 01:23:33,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 12: [2022-11-27 01:23:33,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:23:33,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 01:23:33,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 6: [2022-11-27 01:23:33,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:23:33,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 01:23:33,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 15: [2022-11-27 01:23:33,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:23:33,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 01:23:33,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 1: [2022-11-27 01:23:33,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:23:33,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 01:23:33,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 30: [2022-11-27 01:23:33,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:23:33,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 01:23:33,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 21: [2022-11-27 01:23:33,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:23:33,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 01:23:33,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 28: [2022-11-27 01:23:33,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:23:33,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 01:23:33,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 7: [2022-11-27 01:23:33,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:23:33,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 01:23:33,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 10: [2022-11-27 01:23:33,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:23:33,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 01:23:33,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 31: [2022-11-27 01:23:33,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:23:33,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 01:23:33,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 20: [2022-11-27 01:23:33,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:23:33,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 01:23:33,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 22: [2022-11-27 01:23:33,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:23:33,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 01:23:33,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 9: [2022-11-27 01:23:33,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:23:33,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 01:23:33,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 2: [2022-11-27 01:23:33,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:23:33,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 01:23:33,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 23: [2022-11-27 01:23:33,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:23:33,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 01:23:33,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 5: [2022-11-27 01:23:33,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:23:33,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 01:23:33,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 8: [2022-11-27 01:23:33,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:23:33,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 01:23:33,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 0: [2022-11-27 01:23:33,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:23:33,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 18: [2022-11-27 01:23:33,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:23:33,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 25: [2022-11-27 01:23:33,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:23:33,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:23:33,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 14: [2022-11-27 01:23:33,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 18: [2022-11-27 01:23:33,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 14: [2022-11-27 01:23:33,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 25: [2022-11-27 01:23:33,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 01:23:33,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 26: [2022-11-27 01:23:33,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:23:33,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 01:23:33,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 11: [2022-11-27 01:23:33,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:23:33,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 01:23:33,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 29: [2022-11-27 01:23:33,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:23:33,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 01:23:33,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 17: [2022-11-27 01:23:33,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:23:33,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 01:23:33,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 4: [2022-11-27 01:23:33,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:23:33,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 01:23:33,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 19: [2022-11-27 01:23:33,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:23:33,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 01:23:33,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 13: [2022-11-27 01:23:33,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:23:33,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:23:33,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 01:23:33,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 3: [2022-11-27 01:23:33,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 01:23:33,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 12: [2022-11-27 01:23:33,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:23:33,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 01:23:33,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 1: [2022-11-27 01:23:33,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:23:33,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 01:23:33,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 27: [2022-11-27 01:23:33,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:23:33,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:23:33,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 27: [2022-11-27 01:23:33,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 6: [2022-11-27 01:23:33,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 27: [2022-11-27 01:23:33,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 16: [2022-11-27 01:23:33,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:23:33,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 01:23:33,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 24: [2022-11-27 01:23:33,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:23:33,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-27 01:23:33,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 15: [2022-11-27 01:23:33,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:23:33,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 01:23:33,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 31: [2022-11-27 01:23:33,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:23:33,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 30: [2022-11-27 01:23:33,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:23:33,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 30: [2022-11-27 01:23:33,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 01:23:33,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 28: [2022-11-27 01:23:33,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:23:33,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 01:23:33,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 20: [2022-11-27 01:23:33,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:23:33,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 01:23:33,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 7: [2022-11-27 01:23:33,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:23:33,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 01:23:33,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 10: [2022-11-27 01:23:33,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:23:33,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 01:23:33,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 9: [2022-11-27 01:23:33,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:23:33,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 01:23:33,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 22: [2022-11-27 01:23:33,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:23:33,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 01:23:33,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 21: [2022-11-27 01:23:33,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:23:33,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 01:23:33,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 0: [2022-11-27 01:23:33,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:23:33,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 01:23:33,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 5: [2022-11-27 01:23:33,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:23:33,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 01:23:33,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 25: [2022-11-27 01:23:33,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:23:33,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 01:23:33,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 2: [2022-11-27 01:23:33,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:23:33,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:23:33,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 23: [2022-11-27 01:23:33,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 01:23:33,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 2: [2022-11-27 01:23:33,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 29: [2022-11-27 01:23:33,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:23:33,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 01:23:33,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 13: [2022-11-27 01:23:33,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:23:33,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 01:23:33,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 11: [2022-11-27 01:23:33,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:23:33,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 01:23:33,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 17: [2022-11-27 01:23:33,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:23:33,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:23:33,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:23:33,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 18: [2022-11-27 01:23:33,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:23:33,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 17: [2022-11-27 01:23:33,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 01:23:33,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 18: [2022-11-27 01:23:33,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 26: [2022-11-27 01:23:33,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 01:23:33,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 18: [2022-11-27 01:23:33,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 14: [2022-11-27 01:23:33,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:23:33,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 01:23:33,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 19: [2022-11-27 01:23:33,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:23:33,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 01:23:33,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 3: [2022-11-27 01:23:33,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:23:33,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 01:23:33,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 4: [2022-11-27 01:23:33,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:23:33,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:23:33,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 12: [2022-11-27 01:23:33,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 4: [2022-11-27 01:23:33,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 12: [2022-11-27 01:23:33,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 1: [2022-11-27 01:23:33,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:23:33,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 01:23:33,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 6: [2022-11-27 01:23:33,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:23:33,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 15: [2022-11-27 01:23:33,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:23:33,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 15: [2022-11-27 01:23:33,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 01:23:33,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 24: [2022-11-27 01:23:33,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:23:33,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 01:23:33,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 16: [2022-11-27 01:23:33,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:23:33,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 01:23:33,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 27: [2022-11-27 01:23:33,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:23:33,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 01:23:33,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 28: [2022-11-27 01:23:33,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:23:33,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 01:23:33,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 21: [2022-11-27 01:23:33,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:23:33,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 01:23:33,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 7: [2022-11-27 01:23:33,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:23:33,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 01:23:33,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 10: [2022-11-27 01:23:33,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:23:33,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 01:23:33,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 22: [2022-11-27 01:23:33,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:23:33,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 01:23:33,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 31: [2022-11-27 01:23:33,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:23:33,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 01:23:33,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 20: [2022-11-27 01:23:33,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:23:33,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 01:23:33,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 9: [2022-11-27 01:23:33,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:23:33,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 01:23:33,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 5: [2022-11-27 01:23:33,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:23:33,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 01:23:33,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 0: [2022-11-27 01:23:33,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:23:33,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 01:23:33,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 26: [2022-11-27 01:23:33,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:23:33,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 01:23:33,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 2: [2022-11-27 01:23:33,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:23:33,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:23:33,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 01:23:33,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 14: [2022-11-27 01:23:33,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 01:23:33,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 4: [2022-11-27 01:23:33,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:23:33,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 01:23:33,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 23: [2022-11-27 01:23:33,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:23:33,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 01:23:33,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 25: [2022-11-27 01:23:33,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:23:33,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 01:23:33,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 29: [2022-11-27 01:23:33,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:23:33,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 01:23:33,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 18: [2022-11-27 01:23:33,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:23:33,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 01:23:33,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 11: [2022-11-27 01:23:33,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:23:33,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 01:23:33,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 17: [2022-11-27 01:23:33,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:23:33,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:23:33,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 17: [2022-11-27 01:23:33,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 01:23:33,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 30: [2022-11-27 01:23:33,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 16: [2022-11-27 01:23:33,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:23:33,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 01:23:33,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 8: [2022-11-27 01:23:33,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:23:33,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 01:23:33,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 13: [2022-11-27 01:23:33,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:23:33,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 01:23:33,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 19: [2022-11-27 01:23:33,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:23:33,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:23:33,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 01:23:33,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 3: [2022-11-27 01:23:33,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 01:23:33,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 24: [2022-11-27 01:23:33,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:23:33,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 01:23:33,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 15: [2022-11-27 01:23:33,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:23:33,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 01:23:33,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 1: [2022-11-27 01:23:33,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:23:33,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 01:23:33,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 27: [2022-11-27 01:23:33,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:23:33,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:23:33,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 27: [2022-11-27 01:23:33,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 01:23:33,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 7: [2022-11-27 01:23:33,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 30: [2022-11-27 01:23:33,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:23:33,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 01:23:33,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 6: [2022-11-27 01:23:33,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:23:33,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:23:33,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 01:23:33,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 12: [2022-11-27 01:23:33,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:23:33,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 01:23:33,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 9: [2022-11-27 01:23:33,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:23:33,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 01:23:33,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 5: [2022-11-27 01:23:33,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:23:33,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 01:23:33,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 2: [2022-11-27 01:23:33,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:23:33,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:23:33,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 01:23:33,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 14: [2022-11-27 01:23:33,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 01:23:33,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 20: [2022-11-27 01:23:33,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:23:33,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 01:23:33,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 28: [2022-11-27 01:23:33,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 01:23:33,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 23: [2022-11-27 01:23:33,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:23:33,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 01:23:33,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 24: [2022-11-27 01:23:33,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:23:33,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 01:23:33,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 26: [2022-11-27 01:23:33,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:23:33,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 01:23:33,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 13: [2022-11-27 01:23:33,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:23:33,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 01:23:33,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 31: [2022-11-27 01:23:33,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:23:33,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:23:33,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 17: [2022-11-27 01:23:33,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 01:23:33,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 31: [2022-11-27 01:23:33,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 25: [2022-11-27 01:23:33,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:23:33,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:23:33,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 01:23:33,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 25: [2022-11-27 01:23:33,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 01:23:33,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 22: [2022-11-27 01:23:33,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:23:33,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 01:23:33,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 1: [2022-11-27 01:23:33,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:23:33,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 3: [2022-11-27 01:23:33,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:23:33,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 3: [2022-11-27 01:23:33,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 01:23:33,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 27: [2022-11-27 01:23:33,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:23:33,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 01:23:33,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 10: [2022-11-27 01:23:33,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:23:33,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 19: [2022-11-27 01:23:33,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:23:33,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 19: [2022-11-27 01:23:33,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 01:23:33,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 11: [2022-11-27 01:23:33,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:23:33,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 01:23:33,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 15: [2022-11-27 01:23:33,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:23:33,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:23:33,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 8: [2022-11-27 01:23:33,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 15: [2022-11-27 01:23:33,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 8: [2022-11-27 01:23:33,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 29: [2022-11-27 01:23:33,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:23:33,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 01:23:33,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 28: [2022-11-27 01:23:33,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:23:33,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 16: [2022-11-27 01:23:33,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:23:33,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:23:33,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 01:23:33,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 12: [2022-11-27 01:23:33,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 01:23:33,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 0: [2022-11-27 01:23:33,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:23:33,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 01:23:33,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 7: [2022-11-27 01:23:33,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:23:33,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:23:33,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 7: [2022-11-27 01:23:33,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 01:23:33,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 6: [2022-11-27 01:23:33,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 28: [2022-11-27 01:23:33,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 21: [2022-11-27 01:23:33,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:23:33,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 01:23:33,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 30: [2022-11-27 01:23:33,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:23:33,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 01:23:33,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 4: [2022-11-27 01:23:33,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:23:33,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 01:23:33,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 10: [2022-11-27 01:23:33,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:23:33,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 01:23:33,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 22: [2022-11-27 01:23:33,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:23:33,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 01:23:33,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 20: [2022-11-27 01:23:33,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:23:33,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 01:23:33,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 21: [2022-11-27 01:23:33,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:23:33,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 01:23:33,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 30: [2022-11-27 01:23:33,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:23:33,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 01:23:33,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 31: [2022-11-27 01:23:33,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:23:33,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:23:33,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 01:23:33,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step139000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 01:23:33,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 31: [2022-11-27 01:23:33,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step139000 is ready now! 0: successfully saved checkpoint at iteration 139000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2572.22 31: iteration 139010/ 173500 | consumed samples: 35586560 | consumed tokens: 72881274880 | elapsed time per iteration (s): 1.08 | learning rate: 3.732E-05 | global batch size: 256 | lm loss: 1.931929E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.439 | TFLOPs: 14.30 | 31: iteration 139020/ 173500 | consumed samples: 35589120 | consumed tokens: 72886517760 | elapsed time per iteration (s): 0.82 | learning rate: 3.731E-05 | global batch size: 256 | lm loss: 1.934805E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.032 | TFLOPs: 19.00 | 31: iteration 139030/ 173500 | consumed samples: 35591680 | consumed tokens: 72891760640 | elapsed time per iteration (s): 0.82 | learning rate: 3.730E-05 | global batch size: 256 | lm loss: 1.929405E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.338 | TFLOPs: 18.77 | 31: iteration 139040/ 173500 | consumed samples: 35594240 | consumed tokens: 72897003520 | elapsed time per iteration (s): 1.04 | learning rate: 3.729E-05 | global batch size: 256 | lm loss: 1.925116E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.104 | TFLOPs: 14.83 | 31: iteration 139050/ 173500 | consumed samples: 35596800 | consumed tokens: 72902246400 | elapsed time per iteration (s): 0.80 | learning rate: 3.728E-05 | global batch size: 256 | lm loss: 1.899906E+00 | grad norm: 0.213 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.525 | TFLOPs: 19.45 | 31: iteration 139060/ 173500 | consumed samples: 35599360 | consumed tokens: 72907489280 | elapsed time per iteration (s): 0.77 | learning rate: 3.727E-05 | global batch size: 256 | lm loss: 1.955923E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.006 | TFLOPs: 20.03 | 31: iteration 139070/ 173500 | consumed samples: 35601920 | consumed tokens: 72912732160 | elapsed time per iteration (s): 0.77 | learning rate: 3.726E-05 | global batch size: 256 | lm loss: 1.921276E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.838 | TFLOPs: 20.14 | 31: iteration 139080/ 173500 | consumed samples: 35604480 | consumed tokens: 72917975040 | elapsed time per iteration (s): 0.79 | learning rate: 3.725E-05 | global batch size: 256 | lm loss: 1.934000E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.910 | TFLOPs: 19.72 | 31: iteration 139090/ 173500 | consumed samples: 35607040 | consumed tokens: 72923217920 | elapsed time per iteration (s): 0.78 | learning rate: 3.724E-05 | global batch size: 256 | lm loss: 1.969301E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.781 | TFLOPs: 19.89 | 31: iteration 139100/ 173500 | consumed samples: 35609600 | consumed tokens: 72928460800 | elapsed time per iteration (s): 0.74 | learning rate: 3.723E-05 | global batch size: 256 | lm loss: 1.925428E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.250 | TFLOPs: 20.95 | 31: iteration 139110/ 173500 | consumed samples: 35612160 | consumed tokens: 72933703680 | elapsed time per iteration (s): 0.76 | learning rate: 3.722E-05 | global batch size: 256 | lm loss: 1.938442E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.053 | TFLOPs: 20.45 | 31: iteration 139120/ 173500 | consumed samples: 35614720 | consumed tokens: 72938946560 | elapsed time per iteration (s): 0.76 | learning rate: 3.722E-05 | global batch size: 256 | lm loss: 1.937300E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.032 | TFLOPs: 20.33 | 31: iteration 139130/ 173500 | consumed samples: 35617280 | consumed tokens: 72944189440 | elapsed time per iteration (s): 0.78 | learning rate: 3.721E-05 | global batch size: 256 | lm loss: 1.925547E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.589 | TFLOPs: 19.82 | 31: iteration 139140/ 173500 | consumed samples: 35619840 | consumed tokens: 72949432320 | elapsed time per iteration (s): 0.75 | learning rate: 3.720E-05 | global batch size: 256 | lm loss: 1.938491E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.560 | TFLOPs: 20.78 | 31: iteration 139150/ 173500 | consumed samples: 35622400 | consumed tokens: 72954675200 | elapsed time per iteration (s): 0.77 | learning rate: 3.719E-05 | global batch size: 256 | lm loss: 1.943619E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.930 | TFLOPs: 20.20 | 31: iteration 139160/ 173500 | consumed samples: 35624960 | consumed tokens: 72959918080 | elapsed time per iteration (s): 0.83 | learning rate: 3.718E-05 | global batch size: 256 | lm loss: 1.898260E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.352 | TFLOPs: 18.72 | 31: iteration 139170/ 173500 | consumed samples: 35627520 | consumed tokens: 72965160960 | elapsed time per iteration (s): 0.76 | learning rate: 3.717E-05 | global batch size: 256 | lm loss: 1.907844E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.125 | TFLOPs: 20.27 | 31: iteration 139180/ 173500 | consumed samples: 35630080 | consumed tokens: 72970403840 | elapsed time per iteration (s): 0.76 | learning rate: 3.716E-05 | global batch size: 256 | lm loss: 1.937910E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.268 | TFLOPs: 20.34 | 31: iteration 139190/ 173500 | consumed samples: 35632640 | consumed tokens: 72975646720 | elapsed time per iteration (s): 0.88 | learning rate: 3.715E-05 | global batch size: 256 | lm loss: 1.944212E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.529 | TFLOPs: 17.70 | 31: iteration 139200/ 173500 | consumed samples: 35635200 | consumed tokens: 72980889600 | elapsed time per iteration (s): 0.79 | learning rate: 3.714E-05 | global batch size: 256 | lm loss: 1.909944E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.969 | TFLOPs: 19.60 | 31: iteration 139210/ 173500 | consumed samples: 35637760 | consumed tokens: 72986132480 | elapsed time per iteration (s): 0.89 | learning rate: 3.713E-05 | global batch size: 256 | lm loss: 1.938600E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.827 | TFLOPs: 17.47 | 31: iteration 139220/ 173500 | consumed samples: 35640320 | consumed tokens: 72991375360 | elapsed time per iteration (s): 0.82 | learning rate: 3.712E-05 | global batch size: 256 | lm loss: 1.939621E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.977 | TFLOPs: 18.99 | 31: iteration 139230/ 173500 | consumed samples: 35642880 | consumed tokens: 72996618240 | elapsed time per iteration (s): 0.76 | learning rate: 3.711E-05 | global batch size: 256 | lm loss: 1.927627E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.223 | TFLOPs: 20.34 | 31: iteration 139240/ 173500 | consumed samples: 35645440 | consumed tokens: 73001861120 | elapsed time per iteration (s): 0.77 | learning rate: 3.710E-05 | global batch size: 256 | lm loss: 1.938205E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.512 | TFLOPs: 20.24 | 31: iteration 139250/ 173500 | consumed samples: 35648000 | consumed tokens: 73007104000 | elapsed time per iteration (s): 0.80 | learning rate: 3.709E-05 | global batch size: 256 | lm loss: 1.937450E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.614 | TFLOPs: 19.46 | 31: iteration 139260/ 173500 | consumed samples: 35650560 | consumed tokens: 73012346880 | elapsed time per iteration (s): 0.76 | learning rate: 3.708E-05 | global batch size: 256 | lm loss: 1.930071E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.584 | TFLOPs: 20.30 | 31: iteration 139270/ 173500 | consumed samples: 35653120 | consumed tokens: 73017589760 | elapsed time per iteration (s): 0.76 | learning rate: 3.707E-05 | global batch size: 256 | lm loss: 1.915308E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.167 | TFLOPs: 20.28 | 31: iteration 139280/ 173500 | consumed samples: 35655680 | consumed tokens: 73022832640 | elapsed time per iteration (s): 0.83 | learning rate: 3.706E-05 | global batch size: 256 | lm loss: 1.913749E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.630 | TFLOPs: 18.61 | 31: iteration 139290/ 173500 | consumed samples: 35658240 | consumed tokens: 73028075520 | elapsed time per iteration (s): 0.77 | learning rate: 3.705E-05 | global batch size: 256 | lm loss: 1.884885E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.739 | TFLOPs: 20.07 | 31: iteration 139300/ 173500 | consumed samples: 35660800 | consumed tokens: 73033318400 | elapsed time per iteration (s): 0.90 | learning rate: 3.704E-05 | global batch size: 256 | lm loss: 1.922701E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.200 | TFLOPs: 17.19 | 31: iteration 139310/ 173500 | consumed samples: 35663360 | consumed tokens: 73038561280 | elapsed time per iteration (s): 0.80 | learning rate: 3.703E-05 | global batch size: 256 | lm loss: 1.890322E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.564 | TFLOPs: 19.39 | 31: iteration 139320/ 173500 | consumed samples: 35665920 | consumed tokens: 73043804160 | elapsed time per iteration (s): 0.91 | learning rate: 3.702E-05 | global batch size: 256 | lm loss: 1.920572E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.000 | TFLOPs: 17.06 | 31: iteration 139330/ 173500 | consumed samples: 35668480 | consumed tokens: 73049047040 | elapsed time per iteration (s): 0.96 | learning rate: 3.701E-05 | global batch size: 256 | lm loss: 1.913787E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 267.884 | TFLOPs: 16.21 | 31: iteration 139340/ 173500 | consumed samples: 35671040 | consumed tokens: 73054289920 | elapsed time per iteration (s): 0.80 | learning rate: 3.700E-05 | global batch size: 256 | lm loss: 1.951977E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.521 | TFLOPs: 19.45 | 31: iteration 139350/ 173500 | consumed samples: 35673600 | consumed tokens: 73059532800 | elapsed time per iteration (s): 0.84 | learning rate: 3.699E-05 | global batch size: 256 | lm loss: 1.916131E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.766 | TFLOPs: 18.44 | 31: iteration 139360/ 173500 | consumed samples: 35676160 | consumed tokens: 73064775680 | elapsed time per iteration (s): 0.81 | learning rate: 3.698E-05 | global batch size: 256 | lm loss: 1.927193E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.034 | TFLOPs: 19.18 | 31: iteration 139370/ 173500 | consumed samples: 35678720 | consumed tokens: 73070018560 | elapsed time per iteration (s): 0.82 | learning rate: 3.697E-05 | global batch size: 256 | lm loss: 1.916943E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.368 | TFLOPs: 18.90 | 31: iteration 139380/ 173500 | consumed samples: 35681280 | consumed tokens: 73075261440 | elapsed time per iteration (s): 0.79 | learning rate: 3.696E-05 | global batch size: 256 | lm loss: 1.921927E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.257 | TFLOPs: 19.50 | 31: iteration 139390/ 173500 | consumed samples: 35683840 | consumed tokens: 73080504320 | elapsed time per iteration (s): 0.83 | learning rate: 3.695E-05 | global batch size: 256 | lm loss: 1.907632E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.957 | TFLOPs: 18.69 | 31: iteration 139400/ 173500 | consumed samples: 35686400 | consumed tokens: 73085747200 | elapsed time per iteration (s): 0.82 | learning rate: 3.694E-05 | global batch size: 256 | lm loss: 1.923133E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.891 | TFLOPs: 18.99 | 31: iteration 139410/ 173500 | consumed samples: 35688960 | consumed tokens: 73090990080 | elapsed time per iteration (s): 0.81 | learning rate: 3.694E-05 | global batch size: 256 | lm loss: 1.937653E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.281 | TFLOPs: 19.07 | 31: iteration 139420/ 173500 | consumed samples: 35691520 | consumed tokens: 73096232960 | elapsed time per iteration (s): 0.77 | learning rate: 3.693E-05 | global batch size: 256 | lm loss: 1.922345E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.464 | TFLOPs: 19.99 | 31: iteration 139430/ 173500 | consumed samples: 35694080 | consumed tokens: 73101475840 | elapsed time per iteration (s): 0.83 | learning rate: 3.692E-05 | global batch size: 256 | lm loss: 1.930650E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.536 | TFLOPs: 18.61 | 31: iteration 139440/ 173500 | consumed samples: 35696640 | consumed tokens: 73106718720 | elapsed time per iteration (s): 0.83 | learning rate: 3.691E-05 | global batch size: 256 | lm loss: 1.926808E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.671 | TFLOPs: 18.55 | 31: iteration 139450/ 173500 | consumed samples: 35699200 | consumed tokens: 73111961600 | elapsed time per iteration (s): 0.78 | learning rate: 3.690E-05 | global batch size: 256 | lm loss: 1.889601E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.617 | TFLOPs: 19.82 | 31: iteration 139460/ 173500 | consumed samples: 35701760 | consumed tokens: 73117204480 | elapsed time per iteration (s): 0.85 | learning rate: 3.689E-05 | global batch size: 256 | lm loss: 1.912884E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.105 | TFLOPs: 18.28 | 31: iteration 139470/ 173500 | consumed samples: 35704320 | consumed tokens: 73122447360 | elapsed time per iteration (s): 0.84 | learning rate: 3.688E-05 | global batch size: 256 | lm loss: 1.957257E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.156 | TFLOPs: 18.34 | 31: iteration 139480/ 173500 | consumed samples: 35706880 | consumed tokens: 73127690240 | elapsed time per iteration (s): 0.80 | learning rate: 3.687E-05 | global batch size: 256 | lm loss: 1.931431E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.609 | TFLOPs: 19.40 | 31: iteration 139490/ 173500 | consumed samples: 35709440 | consumed tokens: 73132933120 | elapsed time per iteration (s): 0.81 | learning rate: 3.686E-05 | global batch size: 256 | lm loss: 1.909619E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.918 | TFLOPs: 19.17 | 31: iteration 139500/ 173500 | consumed samples: 35712000 | consumed tokens: 73138176000 | elapsed time per iteration (s): 0.78 | learning rate: 3.685E-05 | global batch size: 256 | lm loss: 1.937435E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.590 | TFLOPs: 19.82 | 31: iteration 139510/ 173500 | consumed samples: 35714560 | consumed tokens: 73143418880 | elapsed time per iteration (s): 0.82 | learning rate: 3.684E-05 | global batch size: 256 | lm loss: 1.958669E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.858 | TFLOPs: 18.99 | 31: iteration 139520/ 173500 | consumed samples: 35717120 | consumed tokens: 73148661760 | elapsed time per iteration (s): 0.78 | learning rate: 3.683E-05 | global batch size: 256 | lm loss: 1.892785E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.990 | TFLOPs: 19.84 | 31: iteration 139530/ 173500 | consumed samples: 35719680 | consumed tokens: 73153904640 | elapsed time per iteration (s): 0.83 | learning rate: 3.682E-05 | global batch size: 256 | lm loss: 1.949623E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.087 | TFLOPs: 18.70 | 31: iteration 139540/ 173500 | consumed samples: 35722240 | consumed tokens: 73159147520 | elapsed time per iteration (s): 0.73 | learning rate: 3.681E-05 | global batch size: 256 | lm loss: 1.910662E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.958 | TFLOPs: 21.29 | 31: iteration 139550/ 173500 | consumed samples: 35724800 | consumed tokens: 73164390400 | elapsed time per iteration (s): 0.75 | learning rate: 3.680E-05 | global batch size: 256 | lm loss: 1.943485E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.557 | TFLOPs: 20.66 | 31: iteration 139560/ 173500 | consumed samples: 35727360 | consumed tokens: 73169633280 | elapsed time per iteration (s): 0.75 | learning rate: 3.679E-05 | global batch size: 256 | lm loss: 1.891518E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.908 | TFLOPs: 20.68 | 31: iteration 139570/ 173500 | consumed samples: 35729920 | consumed tokens: 73174876160 | elapsed time per iteration (s): 0.78 | learning rate: 3.678E-05 | global batch size: 256 | lm loss: 1.927462E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.011 | TFLOPs: 19.96 | 31: iteration 139580/ 173500 | consumed samples: 35732480 | consumed tokens: 73180119040 | elapsed time per iteration (s): 0.71 | learning rate: 3.677E-05 | global batch size: 256 | lm loss: 1.925223E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 361.512 | TFLOPs: 21.87 | 31: iteration 139590/ 173500 | consumed samples: 35735040 | consumed tokens: 73185361920 | elapsed time per iteration (s): 0.75 | learning rate: 3.676E-05 | global batch size: 256 | lm loss: 1.934822E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.849 | TFLOPs: 20.74 | 31: iteration 139600/ 173500 | consumed samples: 35737600 | consumed tokens: 73190604800 | elapsed time per iteration (s): 0.75 | learning rate: 3.675E-05 | global batch size: 256 | lm loss: 1.942571E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.806 | TFLOPs: 20.68 | 31: iteration 139610/ 173500 | consumed samples: 35740160 | consumed tokens: 73195847680 | elapsed time per iteration (s): 0.79 | learning rate: 3.674E-05 | global batch size: 256 | lm loss: 1.931624E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.977 | TFLOPs: 19.72 | 31: iteration 139620/ 173500 | consumed samples: 35742720 | consumed tokens: 73201090560 | elapsed time per iteration (s): 0.77 | learning rate: 3.673E-05 | global batch size: 256 | lm loss: 1.933521E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.539 | TFLOPs: 20.24 | 31: iteration 139630/ 173500 | consumed samples: 35745280 | consumed tokens: 73206333440 | elapsed time per iteration (s): 0.76 | learning rate: 3.672E-05 | global batch size: 256 | lm loss: 1.968295E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.688 | TFLOPs: 20.37 | 31: iteration 139640/ 173500 | consumed samples: 35747840 | consumed tokens: 73211576320 | elapsed time per iteration (s): 0.73 | learning rate: 3.671E-05 | global batch size: 256 | lm loss: 1.916044E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.112 | TFLOPs: 21.12 | 31: iteration 139650/ 173500 | consumed samples: 35750400 | consumed tokens: 73216819200 | elapsed time per iteration (s): 0.79 | learning rate: 3.671E-05 | global batch size: 256 | lm loss: 1.906836E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.090 | TFLOPs: 19.61 | 31: iteration 139660/ 173500 | consumed samples: 35752960 | consumed tokens: 73222062080 | elapsed time per iteration (s): 0.78 | learning rate: 3.670E-05 | global batch size: 256 | lm loss: 1.908815E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.243 | TFLOPs: 19.74 | 31: iteration 139670/ 173500 | consumed samples: 35755520 | consumed tokens: 73227304960 | elapsed time per iteration (s): 0.83 | learning rate: 3.669E-05 | global batch size: 256 | lm loss: 1.930709E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.414 | TFLOPs: 18.72 | 31: iteration 139680/ 173500 | consumed samples: 35758080 | consumed tokens: 73232547840 | elapsed time per iteration (s): 0.81 | learning rate: 3.668E-05 | global batch size: 256 | lm loss: 1.911633E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.434 | TFLOPs: 19.02 | 31: iteration 139690/ 173500 | consumed samples: 35760640 | consumed tokens: 73237790720 | elapsed time per iteration (s): 0.89 | learning rate: 3.667E-05 | global batch size: 256 | lm loss: 1.964799E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.277 | TFLOPs: 17.44 | 31: iteration 139700/ 173500 | consumed samples: 35763200 | consumed tokens: 73243033600 | elapsed time per iteration (s): 0.78 | learning rate: 3.666E-05 | global batch size: 256 | lm loss: 1.913385E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.866 | TFLOPs: 19.96 | 31: iteration 139710/ 173500 | consumed samples: 35765760 | consumed tokens: 73248276480 | elapsed time per iteration (s): 0.83 | learning rate: 3.665E-05 | global batch size: 256 | lm loss: 1.942739E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.559 | TFLOPs: 18.67 | 31: iteration 139720/ 173500 | consumed samples: 35768320 | consumed tokens: 73253519360 | elapsed time per iteration (s): 0.80 | learning rate: 3.664E-05 | global batch size: 256 | lm loss: 1.931131E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.126 | TFLOPs: 19.43 | 31: iteration 139730/ 173500 | consumed samples: 35770880 | consumed tokens: 73258762240 | elapsed time per iteration (s): 0.79 | learning rate: 3.663E-05 | global batch size: 256 | lm loss: 1.951489E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.648 | TFLOPs: 19.52 | 31: iteration 139740/ 173500 | consumed samples: 35773440 | consumed tokens: 73264005120 | elapsed time per iteration (s): 0.83 | learning rate: 3.662E-05 | global batch size: 256 | lm loss: 1.952807E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.950 | TFLOPs: 18.57 | 31: iteration 139750/ 173500 | consumed samples: 35776000 | consumed tokens: 73269248000 | elapsed time per iteration (s): 0.79 | learning rate: 3.661E-05 | global batch size: 256 | lm loss: 1.912881E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.190 | TFLOPs: 19.61 | 31: iteration 139760/ 173500 | consumed samples: 35778560 | consumed tokens: 73274490880 | elapsed time per iteration (s): 0.87 | learning rate: 3.660E-05 | global batch size: 256 | lm loss: 1.956215E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.949 | TFLOPs: 17.90 | 31: iteration 139770/ 173500 | consumed samples: 35781120 | consumed tokens: 73279733760 | elapsed time per iteration (s): 0.81 | learning rate: 3.659E-05 | global batch size: 256 | lm loss: 1.919443E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.502 | TFLOPs: 19.15 | 31: iteration 139780/ 173500 | consumed samples: 35783680 | consumed tokens: 73284976640 | elapsed time per iteration (s): 0.78 | learning rate: 3.658E-05 | global batch size: 256 | lm loss: 1.917077E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.576 | TFLOPs: 19.88 | 31: iteration 139790/ 173500 | consumed samples: 35786240 | consumed tokens: 73290219520 | elapsed time per iteration (s): 0.84 | learning rate: 3.657E-05 | global batch size: 256 | lm loss: 1.953096E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.760 | TFLOPs: 18.38 | 31: iteration 139800/ 173500 | consumed samples: 35788800 | consumed tokens: 73295462400 | elapsed time per iteration (s): 0.78 | learning rate: 3.656E-05 | global batch size: 256 | lm loss: 1.927139E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.156 | TFLOPs: 19.85 | 31: iteration 139810/ 173500 | consumed samples: 35791360 | consumed tokens: 73300705280 | elapsed time per iteration (s): 0.77 | learning rate: 3.655E-05 | global batch size: 256 | lm loss: 1.926628E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.527 | TFLOPs: 20.06 | 31: iteration 139820/ 173500 | consumed samples: 35793920 | consumed tokens: 73305948160 | elapsed time per iteration (s): 0.80 | learning rate: 3.654E-05 | global batch size: 256 | lm loss: 1.938452E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.798 | TFLOPs: 19.41 | 31: iteration 139830/ 173500 | consumed samples: 35796480 | consumed tokens: 73311191040 | elapsed time per iteration (s): 0.79 | learning rate: 3.653E-05 | global batch size: 256 | lm loss: 1.927545E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.774 | TFLOPs: 19.65 | 31: iteration 139840/ 173500 | consumed samples: 35799040 | consumed tokens: 73316433920 | elapsed time per iteration (s): 0.80 | learning rate: 3.652E-05 | global batch size: 256 | lm loss: 1.907129E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.400 | TFLOPs: 19.44 | 31: iteration 139850/ 173500 | consumed samples: 35801600 | consumed tokens: 73321676800 | elapsed time per iteration (s): 0.82 | learning rate: 3.651E-05 | global batch size: 256 | lm loss: 1.926766E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.775 | TFLOPs: 18.80 | 31: iteration 139860/ 173500 | consumed samples: 35804160 | consumed tokens: 73326919680 | elapsed time per iteration (s): 0.87 | learning rate: 3.651E-05 | global batch size: 256 | lm loss: 1.926887E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.791 | TFLOPs: 17.77 | 31: iteration 139870/ 173500 | consumed samples: 35806720 | consumed tokens: 73332162560 | elapsed time per iteration (s): 0.84 | learning rate: 3.650E-05 | global batch size: 256 | lm loss: 1.933475E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.251 | TFLOPs: 18.53 | 31: iteration 139880/ 173500 | consumed samples: 35809280 | consumed tokens: 73337405440 | elapsed time per iteration (s): 0.82 | learning rate: 3.649E-05 | global batch size: 256 | lm loss: 1.966880E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.712 | TFLOPs: 18.80 | 31: iteration 139890/ 173500 | consumed samples: 35811840 | consumed tokens: 73342648320 | elapsed time per iteration (s): 0.80 | learning rate: 3.648E-05 | global batch size: 256 | lm loss: 1.945449E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.112 | TFLOPs: 19.37 | 31: iteration 139900/ 173500 | consumed samples: 35814400 | consumed tokens: 73347891200 | elapsed time per iteration (s): 0.77 | learning rate: 3.647E-05 | global batch size: 256 | lm loss: 1.934256E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.791 | TFLOPs: 20.19 | 31: iteration 139910/ 173500 | consumed samples: 35816960 | consumed tokens: 73353134080 | elapsed time per iteration (s): 0.78 | learning rate: 3.646E-05 | global batch size: 256 | lm loss: 1.917343E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.041 | TFLOPs: 19.85 | 31: iteration 139920/ 173500 | consumed samples: 35819520 | consumed tokens: 73358376960 | elapsed time per iteration (s): 0.76 | learning rate: 3.645E-05 | global batch size: 256 | lm loss: 1.914103E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.612 | TFLOPs: 20.36 | 31: iteration 139930/ 173500 | consumed samples: 35822080 | consumed tokens: 73363619840 | elapsed time per iteration (s): 0.77 | learning rate: 3.644E-05 | global batch size: 256 | lm loss: 1.946134E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.583 | TFLOPs: 20.06 | 31: iteration 139940/ 173500 | consumed samples: 35824640 | consumed tokens: 73368862720 | elapsed time per iteration (s): 0.78 | learning rate: 3.643E-05 | global batch size: 256 | lm loss: 1.920303E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.506 | TFLOPs: 19.93 | 31: iteration 139950/ 173500 | consumed samples: 35827200 | consumed tokens: 73374105600 | elapsed time per iteration (s): 0.73 | learning rate: 3.642E-05 | global batch size: 256 | lm loss: 1.935314E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.229 | TFLOPs: 21.19 | 31: iteration 139960/ 173500 | consumed samples: 35829760 | consumed tokens: 73379348480 | elapsed time per iteration (s): 0.78 | learning rate: 3.641E-05 | global batch size: 256 | lm loss: 1.938890E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.675 | TFLOPs: 19.88 | 31: iteration 139970/ 173500 | consumed samples: 35832320 | consumed tokens: 73384591360 | elapsed time per iteration (s): 0.79 | learning rate: 3.640E-05 | global batch size: 256 | lm loss: 1.951983E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.753 | TFLOPs: 19.59 | 31: iteration 139980/ 173500 | consumed samples: 35834880 | consumed tokens: 73389834240 | elapsed time per iteration (s): 0.84 | learning rate: 3.639E-05 | global batch size: 256 | lm loss: 1.889427E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.692 | TFLOPs: 18.37 | 31: iteration 139990/ 173500 | consumed samples: 35837440 | consumed tokens: 73395077120 | elapsed time per iteration (s): 0.89 | learning rate: 3.638E-05 | global batch size: 256 | lm loss: 1.934825E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.393 | TFLOPs: 17.33 | 0: [2022-11-27 01:36:56,107] [INFO] [logging.py:68:log_dist] [Rank 0] step=140000, skipped=0, lr=[3.63724657135183e-05, 3.63724657135183e-05, 3.63724657135183e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 140000/ 173500 | consumed samples: 35840000 | consumed tokens: 73400320000 | elapsed time per iteration (s): 0.90 | learning rate: 3.637E-05 | global batch size: 256 | lm loss: 1.940622E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.458 | TFLOPs: 17.27 | 0: steps: 140000 loss: 1.8531 iter time (s): 0.804 samples/sec: 318.462 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 140000 | lm loss value: 1.832360E+00 | lm loss PPL: 6.248615E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 140000 to checkpoints_1b1long 0: [2022-11-27 01:36:56,357] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step140000 is begin to save! 0: [2022-11-27 01:36:56,369] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_01-model_00-model_states.pt... 0: [2022-11-27 01:36:56,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_01-model_00-model_states.pt. 0: [2022-11-27 01:36:56,574] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_03-model_00-model_states.pt... 0: [2022-11-27 01:36:56,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_03-model_00-model_states.pt. 0: [2022-11-27 01:36:56,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_04-model_00-model_states.pt... 0: [2022-11-27 01:36:56,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_04-model_00-model_states.pt. 0: [2022-11-27 01:36:56,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_05-model_00-model_states.pt... 0: [2022-11-27 01:36:56,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_05-model_00-model_states.pt. 0: [2022-11-27 01:36:56,803] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_06-model_00-model_states.pt... 0: [2022-11-27 01:36:56,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_06-model_00-model_states.pt. 0: [2022-11-27 01:36:56,879] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_07-model_00-model_states.pt... 0: [2022-11-27 01:36:56,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_07-model_00-model_states.pt. 0: [2022-11-27 01:36:56,951] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_08-model_00-model_states.pt... 0: [2022-11-27 01:36:57,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_08-model_00-model_states.pt. 0: [2022-11-27 01:36:57,029] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_09-model_00-model_states.pt... 0: [2022-11-27 01:36:57,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_09-model_00-model_states.pt. 0: [2022-11-27 01:36:57,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_10-model_00-model_states.pt... 0: [2022-11-27 01:36:57,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_10-model_00-model_states.pt. 0: [2022-11-27 01:36:57,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_11-model_00-model_states.pt... 0: [2022-11-27 01:36:57,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_11-model_00-model_states.pt. 0: [2022-11-27 01:36:57,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_12-model_00-model_states.pt... 0: [2022-11-27 01:36:57,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_12-model_00-model_states.pt. 0: [2022-11-27 01:36:57,329] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_13-model_00-model_states.pt... 0: [2022-11-27 01:36:57,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_13-model_00-model_states.pt. 0: [2022-11-27 01:36:57,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_14-model_00-model_states.pt... 0: [2022-11-27 01:36:57,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_14-model_00-model_states.pt. 0: [2022-11-27 01:36:57,478] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_15-model_00-model_states.pt... 0: [2022-11-27 01:36:57,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_15-model_00-model_states.pt. 0: [2022-11-27 01:36:57,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_16-model_00-model_states.pt... 0: [2022-11-27 01:36:57,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_16-model_00-model_states.pt. 0: [2022-11-27 01:36:57,630] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_17-model_00-model_states.pt... 0: [2022-11-27 01:36:57,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_17-model_00-model_states.pt. 0: [2022-11-27 01:36:57,706] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_18-model_00-model_states.pt... 0: [2022-11-27 01:36:57,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_18-model_00-model_states.pt. 0: [2022-11-27 01:36:57,781] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_19-model_00-model_states.pt... 0: [2022-11-27 01:36:57,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_19-model_00-model_states.pt. 0: [2022-11-27 01:36:57,856] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_20-model_00-model_states.pt... 0: [2022-11-27 01:36:57,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_20-model_00-model_states.pt. 0: [2022-11-27 01:36:57,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_21-model_00-model_states.pt... 0: [2022-11-27 01:36:58,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_21-model_00-model_states.pt. 0: [2022-11-27 01:36:58,005] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_22-model_00-model_states.pt... 0: [2022-11-27 01:36:58,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_22-model_00-model_states.pt. 0: [2022-11-27 01:36:58,081] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_23-model_00-model_states.pt... 0: [2022-11-27 01:36:58,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_23-model_00-model_states.pt. 0: [2022-11-27 01:36:58,155] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_24-model_00-model_states.pt... 0: [2022-11-27 01:36:58,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_24-model_00-model_states.pt. 0: [2022-11-27 01:36:58,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_25-model_00-model_states.pt... 0: [2022-11-27 01:36:58,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_25-model_00-model_states.pt. 0: [2022-11-27 01:36:58,306] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_26-model_00-model_states.pt... 0: [2022-11-27 01:36:58,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_26-model_00-model_states.pt. 0: [2022-11-27 01:36:58,383] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_27-model_00-model_states.pt... 0: [2022-11-27 01:36:58,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_27-model_00-model_states.pt. 0: [2022-11-27 01:36:58,457] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_28-model_00-model_states.pt... 0: [2022-11-27 01:36:58,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_28-model_00-model_states.pt. 0: [2022-11-27 01:36:58,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/layer_30-model_00-model_states.pt... 0: [2022-11-27 01:36:58,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/layer_30-model_00-model_states.pt. 0: [2022-11-27 01:36:58,536] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step140000/mp_rank_00_model_states.pt 0: [2022-11-27 01:36:58,536] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/mp_rank_00_model_states.pt... 0: [2022-11-27 01:36:58,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/mp_rank_00_model_states.pt. 0: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:36:58,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:36:58,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:36:58,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 01:36:58,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 21: [2022-11-27 01:36:58,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:36:58,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 01:36:58,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 19: [2022-11-27 01:36:58,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:36:58,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-27 01:36:58,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 10: [2022-11-27 01:36:58,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:36:58,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:36:58,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:36:58,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 11: [2022-11-27 01:36:58,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 30: [2022-11-27 01:36:58,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 10: [2022-11-27 01:36:58,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 30: [2022-11-27 01:36:58,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 20: [2022-11-27 01:36:58,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:36:58,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 01:36:58,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 8: [2022-11-27 01:36:58,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:36:58,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 18: [2022-11-27 01:36:58,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:36:58,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 18: [2022-11-27 01:36:58,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 8: [2022-11-27 01:36:58,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 18: [2022-11-27 01:36:58,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 11: [2022-11-27 01:36:58,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:36:58,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 01:36:58,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 28: [2022-11-27 01:36:58,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:36:58,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:36:58,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:36:58,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 17: [2022-11-27 01:36:58,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 27: [2022-11-27 01:36:58,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 17: [2022-11-27 01:36:58,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 0: [2022-11-27 01:36:58,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:36:58,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:36:58,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 01:36:58,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 24: [2022-11-27 01:36:58,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-27 01:36:58,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 4: [2022-11-27 01:36:58,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:36:58,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 26: [2022-11-27 01:36:58,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:36:58,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:36:58,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 26: [2022-11-27 01:36:58,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 01:36:58,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 01:36:58,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 26: [2022-11-27 01:36:58,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 8: [2022-11-27 01:36:58,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:36:58,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 01:36:58,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 5: [2022-11-27 01:36:58,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:36:58,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:36:58,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 01:36:58,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 3: [2022-11-27 01:36:58,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:36:58,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:36:58,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 29: [2022-11-27 01:36:58,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 3: [2022-11-27 01:36:58,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 29: [2022-11-27 01:36:58,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 19: [2022-11-27 01:36:58,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:36:58,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 01:36:58,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 10: [2022-11-27 01:36:58,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:36:58,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:36:58,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 23: [2022-11-27 01:36:58,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:36:58,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 23: [2022-11-27 01:36:58,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 17: [2022-11-27 01:36:58,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 01:36:58,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 23: [2022-11-27 01:36:58,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 16: [2022-11-27 01:36:58,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:36:58,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 24: [2022-11-27 01:36:58,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:36:58,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:36:58,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 01:36:58,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 24: [2022-11-27 01:36:58,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 01:36:58,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 30: [2022-11-27 01:36:58,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:36:58,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 30: [2022-11-27 01:36:58,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 28: [2022-11-27 01:36:58,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 9: [2022-11-27 01:36:58,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:36:58,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 9: [2022-11-27 01:36:58,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 01:36:58,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 30: [2022-11-27 01:36:58,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:36:58,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 01:36:58,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 20: [2022-11-27 01:36:58,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:36:58,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 01:36:58,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 17: [2022-11-27 01:36:58,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:36:58,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 01:36:58,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 0: [2022-11-27 01:36:58,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:36:58,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 16: [2022-11-27 01:36:58,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:36:58,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 01:36:58,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 0: [2022-11-27 01:36:58,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 4: [2022-11-27 01:36:58,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:36:58,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 4: [2022-11-27 01:36:58,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 12: [2022-11-27 01:36:58,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:36:58,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 12: [2022-11-27 01:36:58,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 01:36:58,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 31: [2022-11-27 01:36:58,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:36:58,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:36:58,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 01:36:58,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 16: [2022-11-27 01:36:58,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:36:58,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 16: [2022-11-27 01:36:58,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 28: [2022-11-27 01:36:58,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 16: [2022-11-27 01:36:58,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 12: [2022-11-27 01:36:58,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:36:58,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 01:36:58,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 10: [2022-11-27 01:36:58,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:36:58,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 01:36:58,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 18: [2022-11-27 01:36:58,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:36:58,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 01:36:58,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 31: [2022-11-27 01:36:58,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:36:58,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:36:58,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 23: [2022-11-27 01:36:58,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:36:58,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 31: [2022-11-27 01:36:58,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 25: [2022-11-27 01:36:58,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 23: [2022-11-27 01:36:58,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 01:36:58,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 23: [2022-11-27 01:36:58,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:36:58,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 01:36:58,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 3: [2022-11-27 01:36:58,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:36:58,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:36:58,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 27: [2022-11-27 01:36:58,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 3: [2022-11-27 01:36:58,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 27: [2022-11-27 01:36:58,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 13: [2022-11-27 01:36:58,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:36:58,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:36:58,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 01:36:58,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 01:36:58,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 13: [2022-11-27 01:36:58,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 6: [2022-11-27 01:36:58,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:36:58,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 20: [2022-11-27 01:36:58,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:36:58,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 20: [2022-11-27 01:36:58,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 01:36:58,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 25: [2022-11-27 01:36:58,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:36:58,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 01:36:58,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 31: [2022-11-27 01:36:58,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:36:58,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 01:36:58,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 8: [2022-11-27 01:36:58,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:36:58,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 01:36:58,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 16: [2022-11-27 01:36:58,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:36:58,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 01:36:58,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 7: [2022-11-27 01:36:58,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:36:58,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 01:36:58,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 19: [2022-11-27 01:36:58,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:36:58,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 7: [2022-11-27 01:36:58,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:36:58,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 7: [2022-11-27 01:36:58,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 01:36:58,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 11: [2022-11-27 01:36:58,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:36:58,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:36:58,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 8: [2022-11-27 01:36:58,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 11: [2022-11-27 01:36:58,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 8: [2022-11-27 01:36:58,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 1: [2022-11-27 01:36:58,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:36:58,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:36:58,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 1: [2022-11-27 01:36:58,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 11: [2022-11-27 01:36:58,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 1: [2022-11-27 01:36:58,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 6: [2022-11-27 01:36:58,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:36:58,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 01:36:58,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 20: [2022-11-27 01:36:58,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:36:58,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:36:58,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:36:58,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 01:36:58,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 20: [2022-11-27 01:36:58,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 0: [2022-11-27 01:36:58,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 20: [2022-11-27 01:36:58,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 14: [2022-11-27 01:36:58,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:36:58,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 14: [2022-11-27 01:36:58,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 23: [2022-11-27 01:36:58,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:36:58,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 14: [2022-11-27 01:36:58,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 7: [2022-11-27 01:36:58,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:36:58,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 30: [2022-11-27 01:36:58,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:36:58,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 30: [2022-11-27 01:36:58,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 7: [2022-11-27 01:36:58,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 24: [2022-11-27 01:36:58,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:36:58,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 24: [2022-11-27 01:36:58,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 01:36:58,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 21: [2022-11-27 01:36:58,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:36:58,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 01:36:58,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 10: [2022-11-27 01:36:58,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:36:58,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 01:36:58,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 3: [2022-11-27 01:36:58,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:36:58,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 18: [2022-11-27 01:36:58,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:36:58,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 18: [2022-11-27 01:36:58,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 01:36:58,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 28: [2022-11-27 01:36:58,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:36:58,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 17: [2022-11-27 01:36:58,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:36:58,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 17: [2022-11-27 01:36:58,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 01:36:58,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 14: [2022-11-27 01:36:58,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:36:58,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 01:36:58,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 3: [2022-11-27 01:36:58,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:36:58,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 01:36:58,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 9: [2022-11-27 01:36:58,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:36:58,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 01:36:58,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 1: [2022-11-27 01:36:58,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:36:58,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 13: [2022-11-27 01:36:58,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:36:58,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:36:58,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 13: [2022-11-27 01:36:58,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:36:58,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 28: [2022-11-27 01:36:58,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 13: [2022-11-27 01:36:58,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 28: [2022-11-27 01:36:58,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 13: [2022-11-27 01:36:58,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 13: [2022-11-27 01:36:58,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 12: [2022-11-27 01:36:58,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:36:58,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 01:36:58,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 26: [2022-11-27 01:36:58,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:36:58,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:36:58,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 01:36:58,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-27 01:36:58,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 26: [2022-11-27 01:36:58,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 4: [2022-11-27 01:36:58,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:36:58,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 01:36:58,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 12: [2022-11-27 01:36:58,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:36:58,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 01:36:58,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 5: [2022-11-27 01:36:58,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 01:36:58,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 25: [2022-11-27 01:36:58,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:36:58,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:36:58,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 01:36:58,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 25: [2022-11-27 01:36:58,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 5: [2022-11-27 01:36:58,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:36:58,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 5: [2022-11-27 01:36:58,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 01:36:58,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 22: [2022-11-27 01:36:58,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:36:58,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 01:36:58,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 22: [2022-11-27 01:36:58,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:36:58,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 01:36:58,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 22: [2022-11-27 01:36:58,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:36:58,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 01:36:58,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 9: [2022-11-27 01:36:58,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:36:58,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 01:36:58,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 19: [2022-11-27 01:36:58,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:36:58,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 01:36:58,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 14: [2022-11-27 01:36:58,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:36:58,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 01:36:58,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 4: [2022-11-27 01:36:58,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:36:58,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:36:58,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 31: [2022-11-27 01:36:58,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 4: [2022-11-27 01:36:58,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 31: [2022-11-27 01:36:58,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 0: [2022-11-27 01:36:58,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:36:58,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 01:36:58,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 6: [2022-11-27 01:36:58,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:36:58,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 01:36:58,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 27: [2022-11-27 01:36:58,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:36:58,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 01:36:58,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:36:58,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 27: [2022-11-27 01:36:58,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 01:36:58,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 29: [2022-11-27 01:36:58,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:36:58,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 01:36:58,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 29: [2022-11-27 01:36:58,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:36:58,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 21: [2022-11-27 01:36:58,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:36:58,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 29: [2022-11-27 01:36:58,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 21: [2022-11-27 01:36:58,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 25: [2022-11-27 01:36:58,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:36:58,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 01:36:58,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 1: [2022-11-27 01:36:58,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:36:58,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 01:36:58,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 7: [2022-11-27 01:36:58,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:36:58,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 01:36:58,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 24: [2022-11-27 01:36:58,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:36:58,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 01:36:58,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 2: [2022-11-27 01:36:58,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:36:58,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:36:58,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:36:58,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 01:36:58,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 01:36:58,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:36:58,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 2: [2022-11-27 01:36:58,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 2: [2022-11-27 01:36:58,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 01:36:58,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 01:36:58,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 2: [2022-11-27 01:36:58,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 15: [2022-11-27 01:36:58,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:36:58,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:36:58,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:36:58,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 01:36:58,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 01:36:58,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 01:36:58,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 15: [2022-11-27 01:36:58,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 15: [2022-11-27 01:36:58,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 5: [2022-11-27 01:36:58,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:36:58,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 01:36:58,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 22: [2022-11-27 01:36:58,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:36:58,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 01:36:58,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 17: [2022-11-27 01:36:58,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:36:58,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 01:36:58,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 25: [2022-11-27 01:36:58,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:36:58,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 01:36:58,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 15: [2022-11-27 01:36:58,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:36:58,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 01:36:58,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 8: [2022-11-27 01:36:58,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:36:58,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 01:36:58,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 11: [2022-11-27 01:36:58,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:36:58,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 01:36:58,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 31: [2022-11-27 01:36:58,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:36:58,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 01:36:58,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 23: [2022-11-27 01:36:58,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:36:58,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 01:36:58,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 20: [2022-11-27 01:36:58,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:36:58,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 18: [2022-11-27 01:36:58,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:36:58,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 18: [2022-11-27 01:36:58,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 01:36:58,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 28: [2022-11-27 01:36:58,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:36:58,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 01:36:58,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 29: [2022-11-27 01:36:58,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:36:58,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 01:36:58,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 30: [2022-11-27 01:36:58,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:36:58,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 01:36:58,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 19: [2022-11-27 01:36:58,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:36:58,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 01:36:58,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 16: [2022-11-27 01:36:58,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:36:58,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 01:36:58,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 6: [2022-11-27 01:36:58,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:36:58,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 01:36:58,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 9: [2022-11-27 01:36:58,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:36:58,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 01:36:58,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 10: [2022-11-27 01:36:58,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:36:58,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 01:36:58,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 26: [2022-11-27 01:36:58,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:36:58,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 01:36:58,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 2: [2022-11-27 01:36:58,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:36:58,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 01:36:58,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 7: [2022-11-27 01:36:58,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:36:58,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 01:36:58,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 3: [2022-11-27 01:36:58,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:36:58,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 01:36:58,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 21: [2022-11-27 01:36:58,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:36:58,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 01:36:58,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 13: [2022-11-27 01:36:58,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:36:58,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 01:36:58,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 27: [2022-11-27 01:36:58,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:36:58,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 01:36:58,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 12: [2022-11-27 01:36:58,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:36:58,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 01:36:58,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 4: [2022-11-27 01:36:58,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:36:58,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 01:36:58,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 0: [2022-11-27 01:36:58,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:36:58,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 01:36:58,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 14: [2022-11-27 01:36:58,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:36:58,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 01:36:58,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 1: [2022-11-27 01:36:58,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:36:58,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 22: [2022-11-27 01:36:58,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:36:58,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 01:36:58,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 1: [2022-11-27 01:36:58,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 5: [2022-11-27 01:36:58,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:36:58,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 01:36:58,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 24: [2022-11-27 01:36:58,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:36:58,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 01:36:58,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 17: [2022-11-27 01:36:58,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:36:58,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 01:36:58,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 15: [2022-11-27 01:36:58,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:36:58,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 01:36:58,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 11: [2022-11-27 01:36:58,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:36:58,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 01:36:58,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 8: [2022-11-27 01:36:58,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:36:58,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 01:36:58,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 31: [2022-11-27 01:36:58,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:36:58,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 01:36:58,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 23: [2022-11-27 01:36:58,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:36:58,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 01:36:58,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 25: [2022-11-27 01:36:58,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:36:58,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 01:36:58,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 18: [2022-11-27 01:36:58,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:36:58,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 01:36:58,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 16: [2022-11-27 01:36:58,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:36:58,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 01:36:58,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 3: [2022-11-27 01:36:58,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:36:58,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 01:36:58,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 6: [2022-11-27 01:36:58,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:36:58,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 01:36:58,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 30: [2022-11-27 01:36:58,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:36:58,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 01:36:58,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 9: [2022-11-27 01:36:58,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:36:58,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 01:36:58,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 19: [2022-11-27 01:36:58,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:36:58,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 01:36:58,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 29: [2022-11-27 01:36:58,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:36:58,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 01:36:58,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 28: [2022-11-27 01:36:58,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:36:58,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 01:36:58,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 4: [2022-11-27 01:36:58,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:36:58,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:36:58,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 26: [2022-11-27 01:36:58,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 4: [2022-11-27 01:36:58,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 26: [2022-11-27 01:36:58,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 20: [2022-11-27 01:36:58,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:36:58,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 01:36:58,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 13: [2022-11-27 01:36:58,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:36:58,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 01:36:58,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 2: [2022-11-27 01:36:58,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:36:58,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 01:36:58,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 21: [2022-11-27 01:36:58,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:36:58,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 01:36:58,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 27: [2022-11-27 01:36:58,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:36:58,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 01:36:58,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 7: [2022-11-27 01:36:58,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:36:58,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 01:36:58,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 12: [2022-11-27 01:36:58,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:36:58,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:36:58,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 01:36:58,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 14: [2022-11-27 01:36:58,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:36:58,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 01:36:58,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 1: [2022-11-27 01:36:58,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:36:58,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 01:36:58,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 8: [2022-11-27 01:36:58,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:36:58,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 01:36:58,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 22: [2022-11-27 01:36:58,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:36:58,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 01:36:58,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 15: [2022-11-27 01:36:58,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:36:58,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 01:36:58,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 24: [2022-11-27 01:36:58,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:36:58,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-27 01:36:58,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 11: [2022-11-27 01:36:58,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:36:58,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 01:36:58,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 10: [2022-11-27 01:36:58,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:36:58,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 01:36:58,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 25: [2022-11-27 01:36:58,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:36:58,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:36:58,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 23: [2022-11-27 01:36:58,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:36:58,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 01:36:58,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 31: [2022-11-27 01:36:58,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:36:58,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 23: [2022-11-27 01:36:58,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 01:36:58,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 31: [2022-11-27 01:36:58,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 01:36:58,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 5: [2022-11-27 01:36:58,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:36:58,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 01:36:58,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 28: [2022-11-27 01:36:58,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:36:58,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 01:36:58,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 20: [2022-11-27 01:36:58,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:36:58,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 01:36:58,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 19: [2022-11-27 01:36:58,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:36:58,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:36:58,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 16: [2022-11-27 01:36:58,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 19: [2022-11-27 01:36:58,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 16: [2022-11-27 01:36:58,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 18: [2022-11-27 01:36:58,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:36:58,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-27 01:36:58,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 29: [2022-11-27 01:36:58,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:36:58,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 01:36:58,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 30: [2022-11-27 01:36:58,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:36:58,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 01:36:58,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 0: [2022-11-27 01:36:58,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 01:36:58,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 6: [2022-11-27 01:36:58,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:36:58,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:36:58,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 3: [2022-11-27 01:36:58,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 01:36:58,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 6: [2022-11-27 01:36:58,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 9: [2022-11-27 01:36:58,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:36:58,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 01:36:58,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 26: [2022-11-27 01:36:58,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:36:58,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 01:36:58,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 27: [2022-11-27 01:36:58,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:36:58,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 01:36:58,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 13: [2022-11-27 01:36:58,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:36:58,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 01:36:58,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 7: [2022-11-27 01:36:58,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:36:58,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 01:36:58,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 0: [2022-11-27 01:36:58,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:36:58,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 01:36:58,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 2: [2022-11-27 01:36:58,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:36:58,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:36:58,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 21: [2022-11-27 01:36:58,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 2: [2022-11-27 01:36:58,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 21: [2022-11-27 01:36:58,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 11: [2022-11-27 01:36:58,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:36:58,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 01:36:58,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 14: [2022-11-27 01:36:58,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:36:58,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 01:36:58,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 4: [2022-11-27 01:36:58,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:36:58,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:36:58,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 12: [2022-11-27 01:36:58,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 4: [2022-11-27 01:36:58,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 12: [2022-11-27 01:36:58,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 30: [2022-11-27 01:36:58,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:36:58,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:36:58,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 22: [2022-11-27 01:36:58,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 30: [2022-11-27 01:36:58,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 22: [2022-11-27 01:36:58,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 23: [2022-11-27 01:36:58,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:36:58,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:36:58,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 01:36:58,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 24: [2022-11-27 01:36:58,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 01:36:58,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 8: [2022-11-27 01:36:58,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:36:58,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 01:36:58,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 15: [2022-11-27 01:36:58,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:36:58,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 01:36:58,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 5: [2022-11-27 01:36:58,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:36:58,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 01:36:58,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 25: [2022-11-27 01:36:58,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:36:58,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 01:36:58,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 10: [2022-11-27 01:36:58,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:36:58,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:36:58,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 17: [2022-11-27 01:36:58,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:36:58,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 16: [2022-11-27 01:36:58,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 10: [2022-11-27 01:36:58,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 17: [2022-11-27 01:36:58,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 19: [2022-11-27 01:36:58,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:36:58,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 19: [2022-11-27 01:36:58,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 01:36:58,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 3: [2022-11-27 01:36:58,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:36:58,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 01:36:58,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 31: [2022-11-27 01:36:58,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:36:58,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 01:36:58,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 18: [2022-11-27 01:36:58,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:36:58,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:36:58,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 01:36:58,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 0: [2022-11-27 01:36:58,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 01:36:58,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 28: [2022-11-27 01:36:58,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:36:58,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 01:36:58,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 20: [2022-11-27 01:36:58,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:36:58,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 01:36:58,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 2: [2022-11-27 01:36:58,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:36:58,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 01:36:58,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 7: [2022-11-27 01:36:58,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:36:58,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 01:36:58,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 29: [2022-11-27 01:36:58,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:36:58,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 01:36:58,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 13: [2022-11-27 01:36:58,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:36:58,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 01:36:58,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 21: [2022-11-27 01:36:58,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:36:58,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 01:36:58,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 6: [2022-11-27 01:36:58,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:36:58,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 01:36:58,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 9: [2022-11-27 01:36:58,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:36:58,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 01:36:58,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 26: [2022-11-27 01:36:58,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:36:58,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 01:36:58,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 27: [2022-11-27 01:36:58,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:36:58,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 01:36:58,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 5: [2022-11-27 01:36:58,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:36:58,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:36:58,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 10: [2022-11-27 01:36:58,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 5: [2022-11-27 01:36:58,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 10: [2022-11-27 01:36:58,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 12: [2022-11-27 01:36:58,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:36:58,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 01:36:58,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 4: [2022-11-27 01:36:58,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:36:58,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 01:36:58,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 24: [2022-11-27 01:36:58,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:36:58,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 01:36:58,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 1: [2022-11-27 01:36:58,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:36:58,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 01:36:58,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 6: [2022-11-27 01:36:58,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:36:58,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 01:36:58,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 1: [2022-11-27 01:36:58,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:36:58,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:36:58,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 01:36:58,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 01:36:58,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 1: [2022-11-27 01:36:58,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 22: [2022-11-27 01:36:58,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:36:58,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 01:36:58,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 15: [2022-11-27 01:36:58,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:36:58,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 01:36:58,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 14: [2022-11-27 01:36:58,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:36:58,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 01:36:58,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 14: [2022-11-27 01:36:58,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:36:58,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step140000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 01:36:58,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step140000 is ready now! 0: successfully saved checkpoint at iteration 140000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2502.62 31: iteration 140010/ 173500 | consumed samples: 35842560 | consumed tokens: 73405562880 | elapsed time per iteration (s): 1.15 | learning rate: 3.636E-05 | global batch size: 256 | lm loss: 1.911339E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 221.675 | TFLOPs: 13.41 | 31: iteration 140020/ 173500 | consumed samples: 35845120 | consumed tokens: 73410805760 | elapsed time per iteration (s): 0.91 | learning rate: 3.635E-05 | global batch size: 256 | lm loss: 1.931073E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 280.760 | TFLOPs: 16.99 | 31: iteration 140030/ 173500 | consumed samples: 35847680 | consumed tokens: 73416048640 | elapsed time per iteration (s): 0.86 | learning rate: 3.634E-05 | global batch size: 256 | lm loss: 1.916372E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.107 | TFLOPs: 17.97 | 31: iteration 140040/ 173500 | consumed samples: 35850240 | consumed tokens: 73421291520 | elapsed time per iteration (s): 0.87 | learning rate: 3.633E-05 | global batch size: 256 | lm loss: 1.932829E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.452 | TFLOPs: 17.75 | 31: iteration 140050/ 173500 | consumed samples: 35852800 | consumed tokens: 73426534400 | elapsed time per iteration (s): 0.83 | learning rate: 3.633E-05 | global batch size: 256 | lm loss: 1.905267E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.013 | TFLOPs: 18.69 | 31: iteration 140060/ 173500 | consumed samples: 35855360 | consumed tokens: 73431777280 | elapsed time per iteration (s): 0.82 | learning rate: 3.632E-05 | global batch size: 256 | lm loss: 1.920520E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.174 | TFLOPs: 18.89 | 31: iteration 140070/ 173500 | consumed samples: 35857920 | consumed tokens: 73437020160 | elapsed time per iteration (s): 0.80 | learning rate: 3.631E-05 | global batch size: 256 | lm loss: 1.930582E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.284 | TFLOPs: 19.44 | 31: iteration 140080/ 173500 | consumed samples: 35860480 | consumed tokens: 73442263040 | elapsed time per iteration (s): 0.88 | learning rate: 3.630E-05 | global batch size: 256 | lm loss: 1.924755E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.907 | TFLOPs: 17.54 | 31: iteration 140090/ 173500 | consumed samples: 35863040 | consumed tokens: 73447505920 | elapsed time per iteration (s): 0.77 | learning rate: 3.629E-05 | global batch size: 256 | lm loss: 1.924626E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.790 | TFLOPs: 20.13 | 31: iteration 140100/ 173500 | consumed samples: 35865600 | consumed tokens: 73452748800 | elapsed time per iteration (s): 0.73 | learning rate: 3.628E-05 | global batch size: 256 | lm loss: 1.939928E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.585 | TFLOPs: 21.15 | 31: iteration 140110/ 173500 | consumed samples: 35868160 | consumed tokens: 73457991680 | elapsed time per iteration (s): 0.78 | learning rate: 3.627E-05 | global batch size: 256 | lm loss: 1.937056E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.511 | TFLOPs: 19.75 | 31: iteration 140120/ 173500 | consumed samples: 35870720 | consumed tokens: 73463234560 | elapsed time per iteration (s): 0.85 | learning rate: 3.626E-05 | global batch size: 256 | lm loss: 1.953722E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.602 | TFLOPs: 18.31 | 31: iteration 140130/ 173500 | consumed samples: 35873280 | consumed tokens: 73468477440 | elapsed time per iteration (s): 0.78 | learning rate: 3.625E-05 | global batch size: 256 | lm loss: 1.929612E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.975 | TFLOPs: 19.90 | 31: iteration 140140/ 173500 | consumed samples: 35875840 | consumed tokens: 73473720320 | elapsed time per iteration (s): 0.82 | learning rate: 3.624E-05 | global batch size: 256 | lm loss: 1.918350E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.075 | TFLOPs: 18.94 | 31: iteration 140150/ 173500 | consumed samples: 35878400 | consumed tokens: 73478963200 | elapsed time per iteration (s): 1.02 | learning rate: 3.623E-05 | global batch size: 256 | lm loss: 1.949568E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.604 | TFLOPs: 15.22 | 31: iteration 140160/ 173500 | consumed samples: 35880960 | consumed tokens: 73484206080 | elapsed time per iteration (s): 0.84 | learning rate: 3.622E-05 | global batch size: 256 | lm loss: 1.901717E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.755 | TFLOPs: 18.38 | 31: iteration 140170/ 173500 | consumed samples: 35883520 | consumed tokens: 73489448960 | elapsed time per iteration (s): 0.83 | learning rate: 3.621E-05 | global batch size: 256 | lm loss: 1.939713E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.866 | TFLOPs: 18.56 | 31: iteration 140180/ 173500 | consumed samples: 35886080 | consumed tokens: 73494691840 | elapsed time per iteration (s): 0.79 | learning rate: 3.620E-05 | global batch size: 256 | lm loss: 1.934569E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.528 | TFLOPs: 19.51 | 31: iteration 140190/ 173500 | consumed samples: 35888640 | consumed tokens: 73499934720 | elapsed time per iteration (s): 0.78 | learning rate: 3.619E-05 | global batch size: 256 | lm loss: 1.914846E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.901 | TFLOPs: 19.78 | 31: iteration 140200/ 173500 | consumed samples: 35891200 | consumed tokens: 73505177600 | elapsed time per iteration (s): 0.73 | learning rate: 3.618E-05 | global batch size: 256 | lm loss: 1.913689E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.260 | TFLOPs: 21.13 | 31: iteration 140210/ 173500 | consumed samples: 35893760 | consumed tokens: 73510420480 | elapsed time per iteration (s): 0.76 | learning rate: 3.617E-05 | global batch size: 256 | lm loss: 1.894500E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.799 | TFLOPs: 20.44 | 31: iteration 140220/ 173500 | consumed samples: 35896320 | consumed tokens: 73515663360 | elapsed time per iteration (s): 0.76 | learning rate: 3.616E-05 | global batch size: 256 | lm loss: 1.902580E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.046 | TFLOPs: 20.27 | 31: iteration 140230/ 173500 | consumed samples: 35898880 | consumed tokens: 73520906240 | elapsed time per iteration (s): 0.73 | learning rate: 3.616E-05 | global batch size: 256 | lm loss: 1.914511E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.670 | TFLOPs: 21.09 | 31: iteration 140240/ 173500 | consumed samples: 35901440 | consumed tokens: 73526149120 | elapsed time per iteration (s): 0.72 | learning rate: 3.615E-05 | global batch size: 256 | lm loss: 1.905627E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.893 | TFLOPs: 21.47 | 31: iteration 140250/ 173500 | consumed samples: 35904000 | consumed tokens: 73531392000 | elapsed time per iteration (s): 0.76 | learning rate: 3.614E-05 | global batch size: 256 | lm loss: 1.893201E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.127 | TFLOPs: 20.27 | 31: iteration 140260/ 173500 | consumed samples: 35906560 | consumed tokens: 73536634880 | elapsed time per iteration (s): 0.74 | learning rate: 3.613E-05 | global batch size: 256 | lm loss: 1.937704E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.840 | TFLOPs: 20.92 | 31: iteration 140270/ 173500 | consumed samples: 35909120 | consumed tokens: 73541877760 | elapsed time per iteration (s): 0.77 | learning rate: 3.612E-05 | global batch size: 256 | lm loss: 1.903681E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.219 | TFLOPs: 20.22 | 31: iteration 140280/ 173500 | consumed samples: 35911680 | consumed tokens: 73547120640 | elapsed time per iteration (s): 0.78 | learning rate: 3.611E-05 | global batch size: 256 | lm loss: 1.944866E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.732 | TFLOPs: 19.77 | 31: iteration 140290/ 173500 | consumed samples: 35914240 | consumed tokens: 73552363520 | elapsed time per iteration (s): 0.78 | learning rate: 3.610E-05 | global batch size: 256 | lm loss: 1.920471E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.272 | TFLOPs: 19.74 | 31: iteration 140300/ 173500 | consumed samples: 35916800 | consumed tokens: 73557606400 | elapsed time per iteration (s): 0.80 | learning rate: 3.609E-05 | global batch size: 256 | lm loss: 1.959353E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.908 | TFLOPs: 19.35 | 31: iteration 140310/ 173500 | consumed samples: 35919360 | consumed tokens: 73562849280 | elapsed time per iteration (s): 0.78 | learning rate: 3.608E-05 | global batch size: 256 | lm loss: 1.920584E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.548 | TFLOPs: 19.94 | 31: iteration 140320/ 173500 | consumed samples: 35921920 | consumed tokens: 73568092160 | elapsed time per iteration (s): 0.81 | learning rate: 3.607E-05 | global batch size: 256 | lm loss: 1.957554E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.993 | TFLOPs: 19.18 | 31: iteration 140330/ 173500 | consumed samples: 35924480 | consumed tokens: 73573335040 | elapsed time per iteration (s): 0.74 | learning rate: 3.606E-05 | global batch size: 256 | lm loss: 1.908847E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.201 | TFLOPs: 20.94 | 31: iteration 140340/ 173500 | consumed samples: 35927040 | consumed tokens: 73578577920 | elapsed time per iteration (s): 0.82 | learning rate: 3.605E-05 | global batch size: 256 | lm loss: 1.941583E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.402 | TFLOPs: 18.90 | 31: iteration 140350/ 173500 | consumed samples: 35929600 | consumed tokens: 73583820800 | elapsed time per iteration (s): 0.87 | learning rate: 3.604E-05 | global batch size: 256 | lm loss: 1.913865E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.317 | TFLOPs: 17.81 | 31: iteration 140360/ 173500 | consumed samples: 35932160 | consumed tokens: 73589063680 | elapsed time per iteration (s): 0.88 | learning rate: 3.603E-05 | global batch size: 256 | lm loss: 1.913686E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.222 | TFLOPs: 17.56 | 31: iteration 140370/ 173500 | consumed samples: 35934720 | consumed tokens: 73594306560 | elapsed time per iteration (s): 0.77 | learning rate: 3.602E-05 | global batch size: 256 | lm loss: 1.918631E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.422 | TFLOPs: 20.05 | 31: iteration 140380/ 173500 | consumed samples: 35937280 | consumed tokens: 73599549440 | elapsed time per iteration (s): 0.82 | learning rate: 3.601E-05 | global batch size: 256 | lm loss: 1.934254E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.474 | TFLOPs: 18.90 | 31: iteration 140390/ 173500 | consumed samples: 35939840 | consumed tokens: 73604792320 | elapsed time per iteration (s): 0.85 | learning rate: 3.601E-05 | global batch size: 256 | lm loss: 1.945127E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.214 | TFLOPs: 18.22 | 31: iteration 140400/ 173500 | consumed samples: 35942400 | consumed tokens: 73610035200 | elapsed time per iteration (s): 0.92 | learning rate: 3.600E-05 | global batch size: 256 | lm loss: 1.901229E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 279.643 | TFLOPs: 16.92 | 31: iteration 140410/ 173500 | consumed samples: 35944960 | consumed tokens: 73615278080 | elapsed time per iteration (s): 0.82 | learning rate: 3.599E-05 | global batch size: 256 | lm loss: 1.895755E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.908 | TFLOPs: 18.81 | 31: iteration 140420/ 173500 | consumed samples: 35947520 | consumed tokens: 73620520960 | elapsed time per iteration (s): 0.80 | learning rate: 3.598E-05 | global batch size: 256 | lm loss: 1.943527E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.308 | TFLOPs: 19.26 | 31: iteration 140430/ 173500 | consumed samples: 35950080 | consumed tokens: 73625763840 | elapsed time per iteration (s): 0.83 | learning rate: 3.597E-05 | global batch size: 256 | lm loss: 1.906605E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.879 | TFLOPs: 18.69 | 31: iteration 140440/ 173500 | consumed samples: 35952640 | consumed tokens: 73631006720 | elapsed time per iteration (s): 0.83 | learning rate: 3.596E-05 | global batch size: 256 | lm loss: 1.909297E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.718 | TFLOPs: 18.74 | 31: iteration 140450/ 173500 | consumed samples: 35955200 | consumed tokens: 73636249600 | elapsed time per iteration (s): 0.78 | learning rate: 3.595E-05 | global batch size: 256 | lm loss: 1.916098E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.537 | TFLOPs: 19.94 | 31: iteration 140460/ 173500 | consumed samples: 35957760 | consumed tokens: 73641492480 | elapsed time per iteration (s): 0.85 | learning rate: 3.594E-05 | global batch size: 256 | lm loss: 1.949967E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.568 | TFLOPs: 18.18 | 31: iteration 140470/ 173500 | consumed samples: 35960320 | consumed tokens: 73646735360 | elapsed time per iteration (s): 0.85 | learning rate: 3.593E-05 | global batch size: 256 | lm loss: 1.921183E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.565 | TFLOPs: 18.12 | 31: iteration 140480/ 173500 | consumed samples: 35962880 | consumed tokens: 73651978240 | elapsed time per iteration (s): 0.91 | learning rate: 3.592E-05 | global batch size: 256 | lm loss: 1.927914E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 281.003 | TFLOPs: 17.00 | 31: iteration 140490/ 173500 | consumed samples: 35965440 | consumed tokens: 73657221120 | elapsed time per iteration (s): 0.78 | learning rate: 3.591E-05 | global batch size: 256 | lm loss: 1.920579E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.851 | TFLOPs: 19.89 | 31: iteration 140500/ 173500 | consumed samples: 35968000 | consumed tokens: 73662464000 | elapsed time per iteration (s): 0.82 | learning rate: 3.590E-05 | global batch size: 256 | lm loss: 1.933763E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.162 | TFLOPs: 18.95 | 31: iteration 140510/ 173500 | consumed samples: 35970560 | consumed tokens: 73667706880 | elapsed time per iteration (s): 0.83 | learning rate: 3.589E-05 | global batch size: 256 | lm loss: 1.933648E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.667 | TFLOPs: 18.67 | 31: iteration 140520/ 173500 | consumed samples: 35973120 | consumed tokens: 73672949760 | elapsed time per iteration (s): 0.80 | learning rate: 3.588E-05 | global batch size: 256 | lm loss: 1.913134E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.653 | TFLOPs: 19.28 | 31: iteration 140530/ 173500 | consumed samples: 35975680 | consumed tokens: 73678192640 | elapsed time per iteration (s): 0.83 | learning rate: 3.587E-05 | global batch size: 256 | lm loss: 1.933357E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.942 | TFLOPs: 18.57 | 31: iteration 140540/ 173500 | consumed samples: 35978240 | consumed tokens: 73683435520 | elapsed time per iteration (s): 0.78 | learning rate: 3.586E-05 | global batch size: 256 | lm loss: 1.943935E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.902 | TFLOPs: 19.90 | 31: iteration 140550/ 173500 | consumed samples: 35980800 | consumed tokens: 73688678400 | elapsed time per iteration (s): 0.76 | learning rate: 3.586E-05 | global batch size: 256 | lm loss: 1.924446E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.359 | TFLOPs: 20.35 | 31: iteration 140560/ 173500 | consumed samples: 35983360 | consumed tokens: 73693921280 | elapsed time per iteration (s): 0.80 | learning rate: 3.585E-05 | global batch size: 256 | lm loss: 1.923979E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.354 | TFLOPs: 19.26 | 31: iteration 140570/ 173500 | consumed samples: 35985920 | consumed tokens: 73699164160 | elapsed time per iteration (s): 0.82 | learning rate: 3.584E-05 | global batch size: 256 | lm loss: 1.906666E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.090 | TFLOPs: 19.00 | 31: iteration 140580/ 173500 | consumed samples: 35988480 | consumed tokens: 73704407040 | elapsed time per iteration (s): 0.84 | learning rate: 3.583E-05 | global batch size: 256 | lm loss: 1.945782E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.006 | TFLOPs: 18.51 | 31: iteration 140590/ 173500 | consumed samples: 35991040 | consumed tokens: 73709649920 | elapsed time per iteration (s): 0.77 | learning rate: 3.582E-05 | global batch size: 256 | lm loss: 1.929444E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.910 | TFLOPs: 20.02 | 31: iteration 140600/ 173500 | consumed samples: 35993600 | consumed tokens: 73714892800 | elapsed time per iteration (s): 0.76 | learning rate: 3.581E-05 | global batch size: 256 | lm loss: 1.925851E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.332 | TFLOPs: 20.41 | 31: iteration 140610/ 173500 | consumed samples: 35996160 | consumed tokens: 73720135680 | elapsed time per iteration (s): 0.76 | learning rate: 3.580E-05 | global batch size: 256 | lm loss: 1.926080E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.141 | TFLOPs: 20.34 | 31: iteration 140620/ 173500 | consumed samples: 35998720 | consumed tokens: 73725378560 | elapsed time per iteration (s): 0.91 | learning rate: 3.579E-05 | global batch size: 256 | lm loss: 1.926134E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 280.615 | TFLOPs: 16.98 | 31: iteration 140630/ 173500 | consumed samples: 36001280 | consumed tokens: 73730621440 | elapsed time per iteration (s): 0.75 | learning rate: 3.578E-05 | global batch size: 256 | lm loss: 1.931257E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.864 | TFLOPs: 20.74 | 31: iteration 140640/ 173500 | consumed samples: 36003840 | consumed tokens: 73735864320 | elapsed time per iteration (s): 0.80 | learning rate: 3.577E-05 | global batch size: 256 | lm loss: 1.909989E+00 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.217 | TFLOPs: 19.37 | 31: iteration 140650/ 173500 | consumed samples: 36006400 | consumed tokens: 73741107200 | elapsed time per iteration (s): 0.71 | learning rate: 3.576E-05 | global batch size: 256 | lm loss: 1.908893E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.900 | TFLOPs: 21.71 | 31: iteration 140660/ 173500 | consumed samples: 36008960 | consumed tokens: 73746350080 | elapsed time per iteration (s): 0.74 | learning rate: 3.575E-05 | global batch size: 256 | lm loss: 1.938410E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.370 | TFLOPs: 20.89 | 31: iteration 140670/ 173500 | consumed samples: 36011520 | consumed tokens: 73751592960 | elapsed time per iteration (s): 0.79 | learning rate: 3.574E-05 | global batch size: 256 | lm loss: 1.900321E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.092 | TFLOPs: 19.73 | 31: iteration 140680/ 173500 | consumed samples: 36014080 | consumed tokens: 73756835840 | elapsed time per iteration (s): 0.80 | learning rate: 3.573E-05 | global batch size: 256 | lm loss: 1.923426E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.587 | TFLOPs: 19.46 | 31: iteration 140690/ 173500 | consumed samples: 36016640 | consumed tokens: 73762078720 | elapsed time per iteration (s): 0.73 | learning rate: 3.573E-05 | global batch size: 256 | lm loss: 1.925177E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.292 | TFLOPs: 21.31 | 31: iteration 140700/ 173500 | consumed samples: 36019200 | consumed tokens: 73767321600 | elapsed time per iteration (s): 0.75 | learning rate: 3.572E-05 | global batch size: 256 | lm loss: 1.898507E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.749 | TFLOPs: 20.67 | 31: iteration 140710/ 173500 | consumed samples: 36021760 | consumed tokens: 73772564480 | elapsed time per iteration (s): 0.89 | learning rate: 3.571E-05 | global batch size: 256 | lm loss: 1.913667E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.746 | TFLOPs: 17.47 | 31: iteration 140720/ 173500 | consumed samples: 36024320 | consumed tokens: 73777807360 | elapsed time per iteration (s): 0.80 | learning rate: 3.570E-05 | global batch size: 256 | lm loss: 1.935585E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.735 | TFLOPs: 19.40 | 31: iteration 140730/ 173500 | consumed samples: 36026880 | consumed tokens: 73783050240 | elapsed time per iteration (s): 0.83 | learning rate: 3.569E-05 | global batch size: 256 | lm loss: 1.918604E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.934 | TFLOPs: 18.75 | 31: iteration 140740/ 173500 | consumed samples: 36029440 | consumed tokens: 73788293120 | elapsed time per iteration (s): 0.77 | learning rate: 3.568E-05 | global batch size: 256 | lm loss: 1.904359E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.268 | TFLOPs: 20.10 | 31: iteration 140750/ 173500 | consumed samples: 36032000 | consumed tokens: 73793536000 | elapsed time per iteration (s): 0.75 | learning rate: 3.567E-05 | global batch size: 256 | lm loss: 1.922779E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.095 | TFLOPs: 20.76 | 31: iteration 140760/ 173500 | consumed samples: 36034560 | consumed tokens: 73798778880 | elapsed time per iteration (s): 0.75 | learning rate: 3.566E-05 | global batch size: 256 | lm loss: 1.938537E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.664 | TFLOPs: 20.61 | 31: iteration 140770/ 173500 | consumed samples: 36037120 | consumed tokens: 73804021760 | elapsed time per iteration (s): 0.74 | learning rate: 3.565E-05 | global batch size: 256 | lm loss: 1.936512E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.661 | TFLOPs: 20.85 | 31: iteration 140780/ 173500 | consumed samples: 36039680 | consumed tokens: 73809264640 | elapsed time per iteration (s): 0.78 | learning rate: 3.564E-05 | global batch size: 256 | lm loss: 1.931779E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.725 | TFLOPs: 19.83 | 31: iteration 140790/ 173500 | consumed samples: 36042240 | consumed tokens: 73814507520 | elapsed time per iteration (s): 0.75 | learning rate: 3.563E-05 | global batch size: 256 | lm loss: 1.922482E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.514 | TFLOPs: 20.60 | 31: iteration 140800/ 173500 | consumed samples: 36044800 | consumed tokens: 73819750400 | elapsed time per iteration (s): 0.75 | learning rate: 3.562E-05 | global batch size: 256 | lm loss: 1.925094E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.580 | TFLOPs: 20.66 | 31: iteration 140810/ 173500 | consumed samples: 36047360 | consumed tokens: 73824993280 | elapsed time per iteration (s): 0.75 | learning rate: 3.561E-05 | global batch size: 256 | lm loss: 1.926034E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.263 | TFLOPs: 20.65 | 31: iteration 140820/ 173500 | consumed samples: 36049920 | consumed tokens: 73830236160 | elapsed time per iteration (s): 0.79 | learning rate: 3.560E-05 | global batch size: 256 | lm loss: 1.954872E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.961 | TFLOPs: 19.72 | 31: iteration 140830/ 173500 | consumed samples: 36052480 | consumed tokens: 73835479040 | elapsed time per iteration (s): 0.79 | learning rate: 3.560E-05 | global batch size: 256 | lm loss: 1.932464E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.661 | TFLOPs: 19.70 | 31: iteration 140840/ 173500 | consumed samples: 36055040 | consumed tokens: 73840721920 | elapsed time per iteration (s): 0.79 | learning rate: 3.559E-05 | global batch size: 256 | lm loss: 1.933196E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.063 | TFLOPs: 19.67 | 31: iteration 140850/ 173500 | consumed samples: 36057600 | consumed tokens: 73845964800 | elapsed time per iteration (s): 0.78 | learning rate: 3.558E-05 | global batch size: 256 | lm loss: 1.908379E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.986 | TFLOPs: 19.84 | 31: iteration 140860/ 173500 | consumed samples: 36060160 | consumed tokens: 73851207680 | elapsed time per iteration (s): 0.77 | learning rate: 3.557E-05 | global batch size: 256 | lm loss: 1.900798E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.393 | TFLOPs: 20.05 | 31: iteration 140870/ 173500 | consumed samples: 36062720 | consumed tokens: 73856450560 | elapsed time per iteration (s): 0.79 | learning rate: 3.556E-05 | global batch size: 256 | lm loss: 1.929896E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.773 | TFLOPs: 19.65 | 31: iteration 140880/ 173500 | consumed samples: 36065280 | consumed tokens: 73861693440 | elapsed time per iteration (s): 0.75 | learning rate: 3.555E-05 | global batch size: 256 | lm loss: 1.943508E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.704 | TFLOPs: 20.55 | 31: iteration 140890/ 173500 | consumed samples: 36067840 | consumed tokens: 73866936320 | elapsed time per iteration (s): 0.74 | learning rate: 3.554E-05 | global batch size: 256 | lm loss: 1.943077E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.999 | TFLOPs: 20.87 | 31: iteration 140900/ 173500 | consumed samples: 36070400 | consumed tokens: 73872179200 | elapsed time per iteration (s): 0.78 | learning rate: 3.553E-05 | global batch size: 256 | lm loss: 1.938982E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.115 | TFLOPs: 19.79 | 31: iteration 140910/ 173500 | consumed samples: 36072960 | consumed tokens: 73877422080 | elapsed time per iteration (s): 0.80 | learning rate: 3.552E-05 | global batch size: 256 | lm loss: 1.948451E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.123 | TFLOPs: 19.25 | 31: iteration 140920/ 173500 | consumed samples: 36075520 | consumed tokens: 73882664960 | elapsed time per iteration (s): 0.76 | learning rate: 3.551E-05 | global batch size: 256 | lm loss: 1.928368E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.803 | TFLOPs: 20.25 | 31: iteration 140930/ 173500 | consumed samples: 36078080 | consumed tokens: 73887907840 | elapsed time per iteration (s): 0.76 | learning rate: 3.550E-05 | global batch size: 256 | lm loss: 1.904177E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.073 | TFLOPs: 20.39 | 31: iteration 140940/ 173500 | consumed samples: 36080640 | consumed tokens: 73893150720 | elapsed time per iteration (s): 0.76 | learning rate: 3.549E-05 | global batch size: 256 | lm loss: 1.943442E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.056 | TFLOPs: 20.51 | 31: iteration 140950/ 173500 | consumed samples: 36083200 | consumed tokens: 73898393600 | elapsed time per iteration (s): 0.76 | learning rate: 3.548E-05 | global batch size: 256 | lm loss: 1.900092E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.830 | TFLOPs: 20.38 | 31: iteration 140960/ 173500 | consumed samples: 36085760 | consumed tokens: 73903636480 | elapsed time per iteration (s): 0.83 | learning rate: 3.548E-05 | global batch size: 256 | lm loss: 1.943920E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.859 | TFLOPs: 18.75 | 31: iteration 140970/ 173500 | consumed samples: 36088320 | consumed tokens: 73908879360 | elapsed time per iteration (s): 0.90 | learning rate: 3.547E-05 | global batch size: 256 | lm loss: 1.908445E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.114 | TFLOPs: 17.25 | 31: iteration 140980/ 173500 | consumed samples: 36090880 | consumed tokens: 73914122240 | elapsed time per iteration (s): 0.77 | learning rate: 3.546E-05 | global batch size: 256 | lm loss: 1.942634E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.122 | TFLOPs: 20.03 | 31: iteration 140990/ 173500 | consumed samples: 36093440 | consumed tokens: 73919365120 | elapsed time per iteration (s): 0.79 | learning rate: 3.545E-05 | global batch size: 256 | lm loss: 1.935469E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.307 | TFLOPs: 19.68 | 31: iteration 141000/ 173500 | consumed samples: 36096000 | consumed tokens: 73924608000 | elapsed time per iteration (s): 0.75 | learning rate: 3.544E-05 | global batch size: 256 | lm loss: 1.924268E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.930 | TFLOPs: 20.69 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 141000 | lm loss value: 1.877524E+00 | lm loss PPL: 6.537295E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 141000 to checkpoints_1b1long 0: [2022-11-27 01:50:17,727] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step141000 is begin to save! 0: [2022-11-27 01:50:17,738] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_01-model_00-model_states.pt... 0: [2022-11-27 01:50:17,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_01-model_00-model_states.pt. 0: [2022-11-27 01:50:17,959] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_03-model_00-model_states.pt... 0: [2022-11-27 01:50:18,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_03-model_00-model_states.pt. 0: [2022-11-27 01:50:18,045] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_04-model_00-model_states.pt... 0: [2022-11-27 01:50:18,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_04-model_00-model_states.pt. 0: [2022-11-27 01:50:18,127] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_05-model_00-model_states.pt... 0: [2022-11-27 01:50:18,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_05-model_00-model_states.pt. 0: [2022-11-27 01:50:18,206] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_06-model_00-model_states.pt... 0: [2022-11-27 01:50:18,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_06-model_00-model_states.pt. 0: [2022-11-27 01:50:18,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_07-model_00-model_states.pt... 0: [2022-11-27 01:50:18,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_07-model_00-model_states.pt. 0: [2022-11-27 01:50:18,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_08-model_00-model_states.pt... 0: [2022-11-27 01:50:18,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_08-model_00-model_states.pt. 0: [2022-11-27 01:50:18,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_09-model_00-model_states.pt... 0: [2022-11-27 01:50:18,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_09-model_00-model_states.pt. 0: [2022-11-27 01:50:18,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_10-model_00-model_states.pt... 0: [2022-11-27 01:50:18,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_10-model_00-model_states.pt. 0: [2022-11-27 01:50:18,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_11-model_00-model_states.pt... 0: [2022-11-27 01:50:18,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_11-model_00-model_states.pt. 0: [2022-11-27 01:50:18,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_12-model_00-model_states.pt... 0: [2022-11-27 01:50:18,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_12-model_00-model_states.pt. 0: [2022-11-27 01:50:18,746] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_13-model_00-model_states.pt... 0: [2022-11-27 01:50:18,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_13-model_00-model_states.pt. 0: [2022-11-27 01:50:18,819] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_14-model_00-model_states.pt... 0: [2022-11-27 01:50:18,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_14-model_00-model_states.pt. 0: [2022-11-27 01:50:18,897] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_15-model_00-model_states.pt... 0: [2022-11-27 01:50:18,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_15-model_00-model_states.pt. 0: [2022-11-27 01:50:18,970] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_16-model_00-model_states.pt... 0: [2022-11-27 01:50:19,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_16-model_00-model_states.pt. 0: [2022-11-27 01:50:19,047] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_17-model_00-model_states.pt... 0: [2022-11-27 01:50:19,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_17-model_00-model_states.pt. 0: [2022-11-27 01:50:19,119] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_18-model_00-model_states.pt... 0: [2022-11-27 01:50:19,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_18-model_00-model_states.pt. 0: [2022-11-27 01:50:19,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_19-model_00-model_states.pt... 0: [2022-11-27 01:50:19,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_19-model_00-model_states.pt. 0: [2022-11-27 01:50:19,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_20-model_00-model_states.pt... 0: [2022-11-27 01:50:19,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_20-model_00-model_states.pt. 0: [2022-11-27 01:50:19,347] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_21-model_00-model_states.pt... 0: [2022-11-27 01:50:19,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_21-model_00-model_states.pt. 0: [2022-11-27 01:50:19,419] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_22-model_00-model_states.pt... 0: [2022-11-27 01:50:19,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_22-model_00-model_states.pt. 0: [2022-11-27 01:50:19,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_23-model_00-model_states.pt... 0: [2022-11-27 01:50:19,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_23-model_00-model_states.pt. 0: [2022-11-27 01:50:19,572] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_24-model_00-model_states.pt... 0: [2022-11-27 01:50:19,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_24-model_00-model_states.pt. 0: [2022-11-27 01:50:19,646] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_25-model_00-model_states.pt... 0: [2022-11-27 01:50:19,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_25-model_00-model_states.pt. 0: [2022-11-27 01:50:19,718] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_26-model_00-model_states.pt... 0: [2022-11-27 01:50:19,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_26-model_00-model_states.pt. 0: [2022-11-27 01:50:19,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_27-model_00-model_states.pt... 0: [2022-11-27 01:50:19,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_27-model_00-model_states.pt. 0: [2022-11-27 01:50:19,869] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_28-model_00-model_states.pt... 0: [2022-11-27 01:50:19,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_28-model_00-model_states.pt. 0: [2022-11-27 01:50:19,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/layer_30-model_00-model_states.pt... 0: [2022-11-27 01:50:19,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/layer_30-model_00-model_states.pt. 0: [2022-11-27 01:50:19,948] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step141000/mp_rank_00_model_states.pt 0: [2022-11-27 01:50:19,948] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/mp_rank_00_model_states.pt... 0: [2022-11-27 01:50:19,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/mp_rank_00_model_states.pt. 0: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 26: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 28: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 16: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 20: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 24: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 31: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 21: [2022-11-27 01:50:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:50:20,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:50:20,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 01:50:20,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 16: [2022-11-27 01:50:20,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:50:20,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 01:50:20,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 24: [2022-11-27 01:50:20,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:50:20,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:50:20,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 24: [2022-11-27 01:50:20,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 19: [2022-11-27 01:50:20,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 24: [2022-11-27 01:50:20,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 28: [2022-11-27 01:50:20,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:50:20,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:50:20,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 26: [2022-11-27 01:50:20,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:50:20,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 26: [2022-11-27 01:50:20,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-27 01:50:20,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 15: [2022-11-27 01:50:20,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:50:20,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 01:50:20,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 25: [2022-11-27 01:50:20,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:50:20,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-27 01:50:20,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 23: [2022-11-27 01:50:20,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:50:20,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 01:50:20,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 0: [2022-11-27 01:50:20,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:50:20,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:50:20,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:50:20,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 0: [2022-11-27 01:50:20,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 2: [2022-11-27 01:50:20,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 31: [2022-11-27 01:50:20,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 0: [2022-11-27 01:50:20,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 2: [2022-11-27 01:50:20,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 13: [2022-11-27 01:50:20,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:50:20,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 01:50:20,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 22: [2022-11-27 01:50:20,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:50:20,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 01:50:20,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 3: [2022-11-27 01:50:20,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:50:20,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 21: [2022-11-27 01:50:20,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:50:20,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 21: [2022-11-27 01:50:20,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 01:50:20,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 7: [2022-11-27 01:50:20,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:50:20,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:50:20,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 01:50:20,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 6: [2022-11-27 01:50:20,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:50:20,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 6: [2022-11-27 01:50:20,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 17: [2022-11-27 01:50:20,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 6: [2022-11-27 01:50:20,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 5: [2022-11-27 01:50:20,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:50:20,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 01:50:20,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 27: [2022-11-27 01:50:20,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:50:20,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 01:50:20,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 11: [2022-11-27 01:50:20,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:50:20,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 4: [2022-11-27 01:50:20,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:50:20,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 4: [2022-11-27 01:50:20,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 01:50:20,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 18: [2022-11-27 01:50:20,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:50:20,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 01:50:20,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 28: [2022-11-27 01:50:20,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 01:50:20,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 20: [2022-11-27 01:50:20,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:50:20,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 01:50:20,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 10: [2022-11-27 01:50:20,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:50:20,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 01:50:20,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 29: [2022-11-27 01:50:20,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:50:20,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 01:50:20,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 17: [2022-11-27 01:50:20,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:50:20,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 14: [2022-11-27 01:50:20,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:50:20,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 14: [2022-11-27 01:50:20,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 01:50:20,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 25: [2022-11-27 01:50:20,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:50:20,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 01:50:20,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 23: [2022-11-27 01:50:20,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:50:20,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 2: [2022-11-27 01:50:20,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:50:20,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 2: [2022-11-27 01:50:20,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 01:50:20,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 11: [2022-11-27 01:50:20,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:50:20,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 01:50:20,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 22: [2022-11-27 01:50:20,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:50:20,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 01:50:20,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 5: [2022-11-27 01:50:20,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:50:20,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:50:20,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 8: [2022-11-27 01:50:20,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 5: [2022-11-27 01:50:20,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 8: [2022-11-27 01:50:20,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 15: [2022-11-27 01:50:20,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:50:20,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 01:50:20,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 31: [2022-11-27 01:50:20,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:50:20,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 01:50:20,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 18: [2022-11-27 01:50:20,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:50:20,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:50:20,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 26: [2022-11-27 01:50:20,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 18: [2022-11-27 01:50:20,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 26: [2022-11-27 01:50:20,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 7: [2022-11-27 01:50:20,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:50:20,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 01:50:20,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 9: [2022-11-27 01:50:20,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:50:20,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:50:20,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 13: [2022-11-27 01:50:20,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:50:20,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 13: [2022-11-27 01:50:20,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 01:50:20,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 01:50:20,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 13: [2022-11-27 01:50:20,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 3: [2022-11-27 01:50:20,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:50:20,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 28: [2022-11-27 01:50:20,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:50:20,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:50:20,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 24: [2022-11-27 01:50:20,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-27 01:50:20,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 6: [2022-11-27 01:50:20,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:50:20,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 01:50:20,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 12: [2022-11-27 01:50:20,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:50:20,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 01:50:20,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 25: [2022-11-27 01:50:20,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:50:20,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 01:50:20,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 27: [2022-11-27 01:50:20,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:50:20,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 01:50:20,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 9: [2022-11-27 01:50:20,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:50:20,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 20: [2022-11-27 01:50:20,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:50:20,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 9: [2022-11-27 01:50:20,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 1: [2022-11-27 01:50:20,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:50:20,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 11: [2022-11-27 01:50:20,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:50:20,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 01:50:20,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 1: [2022-11-27 01:50:20,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 01:50:20,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 23: [2022-11-27 01:50:20,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:50:20,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 01:50:20,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 4: [2022-11-27 01:50:20,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:50:20,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:50:20,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 16: [2022-11-27 01:50:20,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 4: [2022-11-27 01:50:20,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 16: [2022-11-27 01:50:20,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 4: [2022-11-27 01:50:20,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:50:20,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 01:50:20,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 29: [2022-11-27 01:50:20,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:50:20,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 01:50:20,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 12: [2022-11-27 01:50:20,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:50:20,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 01:50:20,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 6: [2022-11-27 01:50:20,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:50:20,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 01:50:20,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 19: [2022-11-27 01:50:20,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:50:20,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 30: [2022-11-27 01:50:20,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:50:20,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:50:20,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:50:20,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 30: [2022-11-27 01:50:20,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 01:50:20,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 01:50:20,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 30: [2022-11-27 01:50:20,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 19: [2022-11-27 01:50:20,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 15: [2022-11-27 01:50:20,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:50:20,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:50:20,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 22: [2022-11-27 01:50:20,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 19: [2022-11-27 01:50:20,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 15: [2022-11-27 01:50:20,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 22: [2022-11-27 01:50:20,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 14: [2022-11-27 01:50:20,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:50:20,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 01:50:20,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 10: [2022-11-27 01:50:20,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:50:20,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 01:50:20,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 2: [2022-11-27 01:50:20,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:50:20,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:50:20,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:50:20,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 21: [2022-11-27 01:50:20,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 2: [2022-11-27 01:50:20,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 24: [2022-11-27 01:50:20,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 21: [2022-11-27 01:50:20,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 7: [2022-11-27 01:50:20,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:50:20,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 7: [2022-11-27 01:50:20,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 01:50:20,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 0: [2022-11-27 01:50:20,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:50:20,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:50:20,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 16: [2022-11-27 01:50:20,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:50:20,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 16: [2022-11-27 01:50:20,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 01:50:20,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 3: [2022-11-27 01:50:20,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:50:20,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:50:20,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 26: [2022-11-27 01:50:20,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 01:50:20,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 3: [2022-11-27 01:50:20,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 8: [2022-11-27 01:50:20,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:50:20,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 01:50:20,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 18: [2022-11-27 01:50:20,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:50:20,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 01:50:20,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 31: [2022-11-27 01:50:20,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:50:20,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 01:50:20,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 1: [2022-11-27 01:50:20,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:50:20,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 01:50:20,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 14: [2022-11-27 01:50:20,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:50:20,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 01:50:20,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 17: [2022-11-27 01:50:20,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:50:20,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 01:50:20,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 29: [2022-11-27 01:50:20,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:50:20,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 01:50:20,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 20: [2022-11-27 01:50:20,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:50:20,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 01:50:20,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 28: [2022-11-27 01:50:20,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 01:50:20,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 28: [2022-11-27 01:50:20,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:50:20,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 01:50:20,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 12: [2022-11-27 01:50:20,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:50:20,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 21: [2022-11-27 01:50:20,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:50:20,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 12: [2022-11-27 01:50:20,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 0: [2022-11-27 01:50:20,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 21: [2022-11-27 01:50:20,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 0: [2022-11-27 01:50:20,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 28: [2022-11-27 01:50:20,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:50:20,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:50:20,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:50:20,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 27: [2022-11-27 01:50:20,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 5: [2022-11-27 01:50:20,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 27: [2022-11-27 01:50:20,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 30: [2022-11-27 01:50:20,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:50:20,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 01:50:20,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 9: [2022-11-27 01:50:20,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:50:20,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:50:20,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 01:50:20,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 01:50:20,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 9: [2022-11-27 01:50:20,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 24: [2022-11-27 01:50:20,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:50:20,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 28: [2022-11-27 01:50:20,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 24: [2022-11-27 01:50:20,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 19: [2022-11-27 01:50:20,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:50:20,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 19: [2022-11-27 01:50:20,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 01:50:20,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 10: [2022-11-27 01:50:20,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:50:20,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 01:50:20,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 4: [2022-11-27 01:50:20,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:50:20,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 01:50:20,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 8: [2022-11-27 01:50:20,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:50:20,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 01:50:20,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 26: [2022-11-27 01:50:20,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:50:20,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 01:50:20,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 15: [2022-11-27 01:50:20,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:50:20,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 01:50:20,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 1: [2022-11-27 01:50:20,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:50:20,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 01:50:20,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 31: [2022-11-27 01:50:20,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:50:20,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 01:50:20,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 17: [2022-11-27 01:50:20,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:50:20,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 01:50:20,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 16: [2022-11-27 01:50:20,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:50:20,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 01:50:20,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 2: [2022-11-27 01:50:20,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:50:20,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 01:50:20,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 13: [2022-11-27 01:50:20,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:50:20,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 01:50:20,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 30: [2022-11-27 01:50:20,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:50:20,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 01:50:20,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 22: [2022-11-27 01:50:20,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:50:20,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 01:50:20,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 3: [2022-11-27 01:50:20,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:50:20,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 01:50:20,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 10: [2022-11-27 01:50:20,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:50:20,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 01:50:20,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 20: [2022-11-27 01:50:20,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:50:20,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 01:50:20,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 27: [2022-11-27 01:50:20,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:50:20,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 01:50:20,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 6: [2022-11-27 01:50:20,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:50:20,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 01:50:20,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 23: [2022-11-27 01:50:20,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:50:20,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 01:50:20,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 0: [2022-11-27 01:50:20,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:50:20,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 01:50:20,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 8: [2022-11-27 01:50:20,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:50:20,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 01:50:20,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 25: [2022-11-27 01:50:20,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:50:20,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 01:50:20,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 11: [2022-11-27 01:50:20,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:50:20,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 01:50:20,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 5: [2022-11-27 01:50:20,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:50:20,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 01:50:20,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 29: [2022-11-27 01:50:20,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:50:20,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 01:50:20,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 21: [2022-11-27 01:50:20,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:50:20,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 01:50:20,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 18: [2022-11-27 01:50:20,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:50:20,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 01:50:20,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 14: [2022-11-27 01:50:20,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:50:20,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 01:50:20,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 12: [2022-11-27 01:50:20,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:50:20,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 01:50:20,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 24: [2022-11-27 01:50:20,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:50:20,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 01:50:20,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 19: [2022-11-27 01:50:20,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:50:20,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 01:50:20,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 9: [2022-11-27 01:50:20,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:50:20,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 01:50:20,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 26: [2022-11-27 01:50:20,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:50:20,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 01:50:20,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 15: [2022-11-27 01:50:20,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:50:20,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 01:50:20,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 4: [2022-11-27 01:50:20,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:50:20,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 01:50:20,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 28: [2022-11-27 01:50:20,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:50:20,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:50:20,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 01:50:20,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 31: [2022-11-27 01:50:20,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:50:20,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 01:50:20,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 28: [2022-11-27 01:50:20,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 01:50:20,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 17: [2022-11-27 01:50:20,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:50:20,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 01:50:20,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 30: [2022-11-27 01:50:20,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:50:20,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 01:50:20,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 13: [2022-11-27 01:50:20,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:50:20,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 01:50:20,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 10: [2022-11-27 01:50:20,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:50:20,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 01:50:20,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 16: [2022-11-27 01:50:20,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:50:20,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 2: [2022-11-27 01:50:20,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:50:20,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 2: [2022-11-27 01:50:20,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 01:50:20,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 22: [2022-11-27 01:50:20,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:50:20,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 01:50:20,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 6: [2022-11-27 01:50:20,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:50:20,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 01:50:20,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 3: [2022-11-27 01:50:20,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:50:20,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 01:50:20,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 27: [2022-11-27 01:50:20,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:50:20,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 01:50:20,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 20: [2022-11-27 01:50:20,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:50:20,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 01:50:20,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 23: [2022-11-27 01:50:20,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:50:20,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 01:50:20,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 25: [2022-11-27 01:50:20,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:50:20,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 01:50:20,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 0: [2022-11-27 01:50:20,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:50:20,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 01:50:20,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 11: [2022-11-27 01:50:20,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:50:20,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 01:50:20,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 29: [2022-11-27 01:50:20,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:50:20,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 01:50:20,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 7: [2022-11-27 01:50:20,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:50:20,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 01:50:20,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 5: [2022-11-27 01:50:20,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:50:20,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:50:20,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 18: [2022-11-27 01:50:20,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 5: [2022-11-27 01:50:20,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 18: [2022-11-27 01:50:20,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 14: [2022-11-27 01:50:20,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:50:20,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 01:50:20,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 31: [2022-11-27 01:50:20,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:50:20,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 01:50:20,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 7: [2022-11-27 01:50:20,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:50:20,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 01:50:20,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 19: [2022-11-27 01:50:20,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:50:20,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 01:50:20,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 28: [2022-11-27 01:50:20,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:50:20,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:50:20,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 01:50:20,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 3: [2022-11-27 01:50:20,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:50:20,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:50:20,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:50:20,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 22: [2022-11-27 01:50:20,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 26: [2022-11-27 01:50:20,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 9: [2022-11-27 01:50:20,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:50:20,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 22: [2022-11-27 01:50:20,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 26: [2022-11-27 01:50:20,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 9: [2022-11-27 01:50:20,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 01:50:20,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 4: [2022-11-27 01:50:20,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:50:20,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 01:50:20,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 2: [2022-11-27 01:50:20,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:50:20,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:50:20,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 25: [2022-11-27 01:50:20,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 2: [2022-11-27 01:50:20,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 25: [2022-11-27 01:50:20,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 15: [2022-11-27 01:50:20,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:50:20,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 01:50:20,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 1: [2022-11-27 01:50:20,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:50:20,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 01:50:20,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 28: [2022-11-27 01:50:20,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 01:50:20,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 20: [2022-11-27 01:50:20,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:50:20,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 01:50:20,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 10: [2022-11-27 01:50:20,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:50:20,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 01:50:20,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 8: [2022-11-27 01:50:20,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:50:20,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 01:50:20,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 13: [2022-11-27 01:50:20,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:50:20,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 01:50:20,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 12: [2022-11-27 01:50:20,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:50:20,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 01:50:20,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 23: [2022-11-27 01:50:20,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:50:20,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 01:50:20,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 8: [2022-11-27 01:50:20,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:50:20,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 01:50:20,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 31: [2022-11-27 01:50:20,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:50:20,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:50:20,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:50:20,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 01:50:20,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 31: [2022-11-27 01:50:20,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 01:50:20,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 27: [2022-11-27 01:50:20,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 30: [2022-11-27 01:50:20,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:50:20,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 30: [2022-11-27 01:50:20,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 01:50:20,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 22: [2022-11-27 01:50:20,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:50:20,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 6: [2022-11-27 01:50:20,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:50:20,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 6: [2022-11-27 01:50:20,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 01:50:20,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 9: [2022-11-27 01:50:20,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:50:20,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 01:50:20,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 7: [2022-11-27 01:50:20,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:50:20,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:50:20,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:50:20,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:50:20,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 1: [2022-11-27 01:50:20,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 16: [2022-11-27 01:50:20,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:50:20,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 17: [2022-11-27 01:50:20,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:50:20,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 1: [2022-11-27 01:50:20,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 16: [2022-11-27 01:50:20,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 2: [2022-11-27 01:50:20,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:50:20,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 26: [2022-11-27 01:50:20,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 16: [2022-11-27 01:50:20,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 2: [2022-11-27 01:50:20,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 17: [2022-11-27 01:50:20,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 26: [2022-11-27 01:50:20,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 2: [2022-11-27 01:50:20,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 17: [2022-11-27 01:50:20,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 4: [2022-11-27 01:50:20,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:50:20,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 01:50:20,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 24: [2022-11-27 01:50:20,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:50:20,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 01:50:20,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 13: [2022-11-27 01:50:20,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:50:20,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 01:50:20,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 5: [2022-11-27 01:50:20,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:50:20,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 01:50:20,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 29: [2022-11-27 01:50:20,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:50:20,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:50:20,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 15: [2022-11-27 01:50:20,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 01:50:20,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:50:20,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 0: [2022-11-27 01:50:20,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:50:20,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 0: [2022-11-27 01:50:20,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 15: [2022-11-27 01:50:20,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 0: [2022-11-27 01:50:20,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 15: [2022-11-27 01:50:20,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 13: [2022-11-27 01:50:20,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:50:20,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 4: [2022-11-27 01:50:20,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:50:20,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 4: [2022-11-27 01:50:20,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 01:50:20,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 12: [2022-11-27 01:50:20,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:50:20,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:50:20,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 5: [2022-11-27 01:50:20,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 12: [2022-11-27 01:50:20,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 11: [2022-11-27 01:50:20,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:50:20,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:50:20,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 11: [2022-11-27 01:50:20,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 01:50:20,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 01:50:20,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 11: [2022-11-27 01:50:20,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 20: [2022-11-27 01:50:20,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:50:20,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 01:50:20,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 8: [2022-11-27 01:50:20,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 22: [2022-11-27 01:50:20,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:50:20,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 22: [2022-11-27 01:50:20,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 8: [2022-11-27 01:50:20,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 22: [2022-11-27 01:50:20,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 18: [2022-11-27 01:50:20,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:50:20,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 01:50:20,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 16: [2022-11-27 01:50:20,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:50:20,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 01:50:20,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 01:50:20,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 16: [2022-11-27 01:50:20,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 01:50:20,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 9: [2022-11-27 01:50:20,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:50:20,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 01:50:20,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 31: [2022-11-27 01:50:20,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:50:20,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 31: [2022-11-27 01:50:20,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 30: [2022-11-27 01:50:20,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 31: [2022-11-27 01:50:20,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 30: [2022-11-27 01:50:20,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 12: [2022-11-27 01:50:20,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:50:20,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:50:20,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 6: [2022-11-27 01:50:20,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 12: [2022-11-27 01:50:20,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 6: [2022-11-27 01:50:20,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:50:20,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 6: [2022-11-27 01:50:20,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 01:50:20,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 26: [2022-11-27 01:50:20,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 01:50:20,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 01:50:20,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 28: [2022-11-27 01:50:20,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:50:20,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 23: [2022-11-27 01:50:20,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:50:20,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 23: [2022-11-27 01:50:20,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 01:50:20,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 23: [2022-11-27 01:50:20,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:50:20,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 2: [2022-11-27 01:50:20,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 25: [2022-11-27 01:50:20,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 23: [2022-11-27 01:50:20,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 21: [2022-11-27 01:50:20,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:50:20,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 25: [2022-11-27 01:50:20,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 21: [2022-11-27 01:50:20,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 2: [2022-11-27 01:50:20,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 25: [2022-11-27 01:50:20,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 21: [2022-11-27 01:50:20,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 1: [2022-11-27 01:50:20,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:50:20,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 01:50:20,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 14: [2022-11-27 01:50:20,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:50:20,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 01:50:20,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 19: [2022-11-27 01:50:20,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:50:20,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 24: [2022-11-27 01:50:20,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 28: [2022-11-27 01:50:20,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 19: [2022-11-27 01:50:20,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 25: [2022-11-27 01:50:20,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 24: [2022-11-27 01:50:20,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-27 01:50:20,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 25: [2022-11-27 01:50:20,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 01:50:20,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 27: [2022-11-27 01:50:20,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:50:20,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 01:50:20,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 10: [2022-11-27 01:50:20,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:50:20,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:50:20,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 01:50:20,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 14: [2022-11-27 01:50:20,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 01:50:20,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 21: [2022-11-27 01:50:20,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:50:20,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 01:50:20,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 27: [2022-11-27 01:50:20,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 01:50:20,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 01:50:20,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 20: [2022-11-27 01:50:20,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 01:50:20,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 01:50:20,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 0: [2022-11-27 01:50:20,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:50:20,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 01:50:20,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 0: [2022-11-27 01:50:20,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:50:20,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 01:50:20,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 17: [2022-11-27 01:50:20,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:50:20,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 01:50:20,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 01:50:20,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 01:50:20,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 17: [2022-11-27 01:50:20,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 3: [2022-11-27 01:50:20,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:50:20,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 01:50:20,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 18: [2022-11-27 01:50:20,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:50:20,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 01:50:20,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 8: [2022-11-27 01:50:20,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:50:20,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 01:50:20,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 11: [2022-11-27 01:50:20,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:50:20,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 01:50:20,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 30: [2022-11-27 01:50:20,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 01:50:20,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 01:50:20,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 29: [2022-11-27 01:50:20,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:50:20,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 01:50:20,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 28: [2022-11-27 01:50:20,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 01:50:20,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 29: [2022-11-27 01:50:20,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 01:50:20,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 01:50:20,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 18: [2022-11-27 01:50:20,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 01:50:20,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 01:50:20,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 14: [2022-11-27 01:50:20,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:50:20,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 01:50:20,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 21: [2022-11-27 01:50:20,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:50:20,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 01:50:20,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 01:50:20,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 01:50:20,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 21: [2022-11-27 01:50:20,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 7: [2022-11-27 01:50:20,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:50:20,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:50:20,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 01:50:20,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 01:50:20,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 7: [2022-11-27 01:50:20,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 12: [2022-11-27 01:50:20,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:50:20,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 01:50:20,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 10: [2022-11-27 01:50:20,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:50:20,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step141000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 01:50:20,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step141000 is ready now! 0: successfully saved checkpoint at iteration 141000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2801.93 31: iteration 141010/ 173500 | consumed samples: 36098560 | consumed tokens: 73929850880 | elapsed time per iteration (s): 1.05 | learning rate: 3.543E-05 | global batch size: 256 | lm loss: 1.927612E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.808 | TFLOPs: 14.75 | 31: iteration 141020/ 173500 | consumed samples: 36101120 | consumed tokens: 73935093760 | elapsed time per iteration (s): 0.73 | learning rate: 3.542E-05 | global batch size: 256 | lm loss: 1.924208E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.217 | TFLOPs: 21.13 | 31: iteration 141030/ 173500 | consumed samples: 36103680 | consumed tokens: 73940336640 | elapsed time per iteration (s): 0.77 | learning rate: 3.541E-05 | global batch size: 256 | lm loss: 1.906262E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.279 | TFLOPs: 20.10 | 31: iteration 141040/ 173500 | consumed samples: 36106240 | consumed tokens: 73945579520 | elapsed time per iteration (s): 0.75 | learning rate: 3.540E-05 | global batch size: 256 | lm loss: 1.875556E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.509 | TFLOPs: 20.54 | 31: iteration 141050/ 173500 | consumed samples: 36108800 | consumed tokens: 73950822400 | elapsed time per iteration (s): 0.75 | learning rate: 3.539E-05 | global batch size: 256 | lm loss: 1.962271E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.847 | TFLOPs: 20.68 | 31: iteration 141060/ 173500 | consumed samples: 36111360 | consumed tokens: 73956065280 | elapsed time per iteration (s): 0.78 | learning rate: 3.538E-05 | global batch size: 256 | lm loss: 1.941578E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.463 | TFLOPs: 19.75 | 31: iteration 141070/ 173500 | consumed samples: 36113920 | consumed tokens: 73961308160 | elapsed time per iteration (s): 0.73 | learning rate: 3.537E-05 | global batch size: 256 | lm loss: 1.922670E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.606 | TFLOPs: 21.15 | 31: iteration 141080/ 173500 | consumed samples: 36116480 | consumed tokens: 73966551040 | elapsed time per iteration (s): 0.76 | learning rate: 3.536E-05 | global batch size: 256 | lm loss: 1.928770E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.059 | TFLOPs: 20.45 | 31: iteration 141090/ 173500 | consumed samples: 36119040 | consumed tokens: 73971793920 | elapsed time per iteration (s): 0.72 | learning rate: 3.536E-05 | global batch size: 256 | lm loss: 1.958176E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.608 | TFLOPs: 21.45 | 31: iteration 141100/ 173500 | consumed samples: 36121600 | consumed tokens: 73977036800 | elapsed time per iteration (s): 0.77 | learning rate: 3.535E-05 | global batch size: 256 | lm loss: 1.931717E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.545 | TFLOPs: 20.12 | 31: iteration 141110/ 173500 | consumed samples: 36124160 | consumed tokens: 73982279680 | elapsed time per iteration (s): 0.78 | learning rate: 3.534E-05 | global batch size: 256 | lm loss: 1.933956E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.467 | TFLOPs: 19.87 | 31: iteration 141120/ 173500 | consumed samples: 36126720 | consumed tokens: 73987522560 | elapsed time per iteration (s): 0.78 | learning rate: 3.533E-05 | global batch size: 256 | lm loss: 1.938643E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.540 | TFLOPs: 19.82 | 31: iteration 141130/ 173500 | consumed samples: 36129280 | consumed tokens: 73992765440 | elapsed time per iteration (s): 0.86 | learning rate: 3.532E-05 | global batch size: 256 | lm loss: 1.932671E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.279 | TFLOPs: 17.98 | 31: iteration 141140/ 173500 | consumed samples: 36131840 | consumed tokens: 73998008320 | elapsed time per iteration (s): 0.77 | learning rate: 3.531E-05 | global batch size: 256 | lm loss: 1.881492E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.874 | TFLOPs: 20.08 | 31: iteration 141150/ 173500 | consumed samples: 36134400 | consumed tokens: 74003251200 | elapsed time per iteration (s): 0.95 | learning rate: 3.530E-05 | global batch size: 256 | lm loss: 1.943189E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 270.371 | TFLOPs: 16.36 | 31: iteration 141160/ 173500 | consumed samples: 36136960 | consumed tokens: 74008494080 | elapsed time per iteration (s): 0.77 | learning rate: 3.529E-05 | global batch size: 256 | lm loss: 1.908592E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.540 | TFLOPs: 20.06 | 31: iteration 141170/ 173500 | consumed samples: 36139520 | consumed tokens: 74013736960 | elapsed time per iteration (s): 0.78 | learning rate: 3.528E-05 | global batch size: 256 | lm loss: 1.924696E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.843 | TFLOPs: 19.89 | 31: iteration 141180/ 173500 | consumed samples: 36142080 | consumed tokens: 74018979840 | elapsed time per iteration (s): 0.81 | learning rate: 3.527E-05 | global batch size: 256 | lm loss: 1.928269E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.827 | TFLOPs: 19.17 | 31: iteration 141190/ 173500 | consumed samples: 36144640 | consumed tokens: 74024222720 | elapsed time per iteration (s): 0.78 | learning rate: 3.526E-05 | global batch size: 256 | lm loss: 1.927407E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.402 | TFLOPs: 19.93 | 31: iteration 141200/ 173500 | consumed samples: 36147200 | consumed tokens: 74029465600 | elapsed time per iteration (s): 0.87 | learning rate: 3.525E-05 | global batch size: 256 | lm loss: 1.940980E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.486 | TFLOPs: 17.88 | 31: iteration 141210/ 173500 | consumed samples: 36149760 | consumed tokens: 74034708480 | elapsed time per iteration (s): 0.76 | learning rate: 3.525E-05 | global batch size: 256 | lm loss: 1.923897E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.922 | TFLOPs: 20.50 | 31: iteration 141220/ 173500 | consumed samples: 36152320 | consumed tokens: 74039951360 | elapsed time per iteration (s): 0.74 | learning rate: 3.524E-05 | global batch size: 256 | lm loss: 1.926372E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.415 | TFLOPs: 20.96 | 31: iteration 141230/ 173500 | consumed samples: 36154880 | consumed tokens: 74045194240 | elapsed time per iteration (s): 0.77 | learning rate: 3.523E-05 | global batch size: 256 | lm loss: 1.934443E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.535 | TFLOPs: 20.00 | 31: iteration 141240/ 173500 | consumed samples: 36157440 | consumed tokens: 74050437120 | elapsed time per iteration (s): 0.73 | learning rate: 3.522E-05 | global batch size: 256 | lm loss: 1.927037E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.165 | TFLOPs: 21.18 | 31: iteration 141250/ 173500 | consumed samples: 36160000 | consumed tokens: 74055680000 | elapsed time per iteration (s): 0.82 | learning rate: 3.521E-05 | global batch size: 256 | lm loss: 1.942132E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.453 | TFLOPs: 18.78 | 31: iteration 141260/ 173500 | consumed samples: 36162560 | consumed tokens: 74060922880 | elapsed time per iteration (s): 0.78 | learning rate: 3.520E-05 | global batch size: 256 | lm loss: 1.932296E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.209 | TFLOPs: 19.80 | 31: iteration 141270/ 173500 | consumed samples: 36165120 | consumed tokens: 74066165760 | elapsed time per iteration (s): 0.85 | learning rate: 3.519E-05 | global batch size: 256 | lm loss: 1.941555E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.791 | TFLOPs: 18.14 | 31: iteration 141280/ 173500 | consumed samples: 36167680 | consumed tokens: 74071408640 | elapsed time per iteration (s): 0.83 | learning rate: 3.518E-05 | global batch size: 256 | lm loss: 1.923761E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.756 | TFLOPs: 18.68 | 31: iteration 141290/ 173500 | consumed samples: 36170240 | consumed tokens: 74076651520 | elapsed time per iteration (s): 0.81 | learning rate: 3.517E-05 | global batch size: 256 | lm loss: 1.940058E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.653 | TFLOPs: 19.22 | 31: iteration 141300/ 173500 | consumed samples: 36172800 | consumed tokens: 74081894400 | elapsed time per iteration (s): 0.83 | learning rate: 3.516E-05 | global batch size: 256 | lm loss: 1.944822E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.135 | TFLOPs: 18.64 | 31: iteration 141310/ 173500 | consumed samples: 36175360 | consumed tokens: 74087137280 | elapsed time per iteration (s): 0.82 | learning rate: 3.515E-05 | global batch size: 256 | lm loss: 1.915428E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.576 | TFLOPs: 18.85 | 31: iteration 141320/ 173500 | consumed samples: 36177920 | consumed tokens: 74092380160 | elapsed time per iteration (s): 0.82 | learning rate: 3.514E-05 | global batch size: 256 | lm loss: 1.940873E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.452 | TFLOPs: 18.78 | 31: iteration 141330/ 173500 | consumed samples: 36180480 | consumed tokens: 74097623040 | elapsed time per iteration (s): 0.83 | learning rate: 3.514E-05 | global batch size: 256 | lm loss: 1.944484E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.263 | TFLOPs: 18.65 | 31: iteration 141340/ 173500 | consumed samples: 36183040 | consumed tokens: 74102865920 | elapsed time per iteration (s): 0.80 | learning rate: 3.513E-05 | global batch size: 256 | lm loss: 1.926690E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.702 | TFLOPs: 19.34 | 31: iteration 141350/ 173500 | consumed samples: 36185600 | consumed tokens: 74108108800 | elapsed time per iteration (s): 0.80 | learning rate: 3.512E-05 | global batch size: 256 | lm loss: 1.912078E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.032 | TFLOPs: 19.42 | 31: iteration 141360/ 173500 | consumed samples: 36188160 | consumed tokens: 74113351680 | elapsed time per iteration (s): 0.80 | learning rate: 3.511E-05 | global batch size: 256 | lm loss: 1.909486E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.909 | TFLOPs: 19.41 | 31: iteration 141370/ 173500 | consumed samples: 36190720 | consumed tokens: 74118594560 | elapsed time per iteration (s): 0.72 | learning rate: 3.510E-05 | global batch size: 256 | lm loss: 1.916634E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.897 | TFLOPs: 21.65 | 31: iteration 141380/ 173500 | consumed samples: 36193280 | consumed tokens: 74123837440 | elapsed time per iteration (s): 0.76 | learning rate: 3.509E-05 | global batch size: 256 | lm loss: 1.938001E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.650 | TFLOPs: 20.43 | 31: iteration 141390/ 173500 | consumed samples: 36195840 | consumed tokens: 74129080320 | elapsed time per iteration (s): 0.87 | learning rate: 3.508E-05 | global batch size: 256 | lm loss: 1.936703E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.802 | TFLOPs: 17.71 | 31: iteration 141400/ 173500 | consumed samples: 36198400 | consumed tokens: 74134323200 | elapsed time per iteration (s): 0.74 | learning rate: 3.507E-05 | global batch size: 256 | lm loss: 1.906367E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.445 | TFLOPs: 21.02 | 31: iteration 141410/ 173500 | consumed samples: 36200960 | consumed tokens: 74139566080 | elapsed time per iteration (s): 0.77 | learning rate: 3.506E-05 | global batch size: 256 | lm loss: 1.933447E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.232 | TFLOPs: 20.04 | 31: iteration 141420/ 173500 | consumed samples: 36203520 | consumed tokens: 74144808960 | elapsed time per iteration (s): 0.77 | learning rate: 3.505E-05 | global batch size: 256 | lm loss: 1.916199E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.392 | TFLOPs: 19.99 | 31: iteration 141430/ 173500 | consumed samples: 36206080 | consumed tokens: 74150051840 | elapsed time per iteration (s): 0.75 | learning rate: 3.504E-05 | global batch size: 256 | lm loss: 1.887906E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.669 | TFLOPs: 20.67 | 31: iteration 141440/ 173500 | consumed samples: 36208640 | consumed tokens: 74155294720 | elapsed time per iteration (s): 0.76 | learning rate: 3.503E-05 | global batch size: 256 | lm loss: 1.904300E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.425 | TFLOPs: 20.47 | 31: iteration 141450/ 173500 | consumed samples: 36211200 | consumed tokens: 74160537600 | elapsed time per iteration (s): 0.80 | learning rate: 3.503E-05 | global batch size: 256 | lm loss: 1.944125E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.524 | TFLOPs: 19.33 | 31: iteration 141460/ 173500 | consumed samples: 36213760 | consumed tokens: 74165780480 | elapsed time per iteration (s): 0.86 | learning rate: 3.502E-05 | global batch size: 256 | lm loss: 1.933746E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.725 | TFLOPs: 17.95 | 31: iteration 141470/ 173500 | consumed samples: 36216320 | consumed tokens: 74171023360 | elapsed time per iteration (s): 0.77 | learning rate: 3.501E-05 | global batch size: 256 | lm loss: 1.913345E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.595 | TFLOPs: 20.12 | 31: iteration 141480/ 173500 | consumed samples: 36218880 | consumed tokens: 74176266240 | elapsed time per iteration (s): 0.80 | learning rate: 3.500E-05 | global batch size: 256 | lm loss: 1.952320E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.109 | TFLOPs: 19.43 | 31: iteration 141490/ 173500 | consumed samples: 36221440 | consumed tokens: 74181509120 | elapsed time per iteration (s): 0.78 | learning rate: 3.499E-05 | global batch size: 256 | lm loss: 1.919314E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.012 | TFLOPs: 19.78 | 31: iteration 141500/ 173500 | consumed samples: 36224000 | consumed tokens: 74186752000 | elapsed time per iteration (s): 0.76 | learning rate: 3.498E-05 | global batch size: 256 | lm loss: 1.929998E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.387 | TFLOPs: 20.29 | 31: iteration 141510/ 173500 | consumed samples: 36226560 | consumed tokens: 74191994880 | elapsed time per iteration (s): 0.73 | learning rate: 3.497E-05 | global batch size: 256 | lm loss: 1.915852E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.839 | TFLOPs: 21.16 | 31: iteration 141520/ 173500 | consumed samples: 36229120 | consumed tokens: 74197237760 | elapsed time per iteration (s): 0.76 | learning rate: 3.496E-05 | global batch size: 256 | lm loss: 1.928127E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.902 | TFLOPs: 20.50 | 31: iteration 141530/ 173500 | consumed samples: 36231680 | consumed tokens: 74202480640 | elapsed time per iteration (s): 0.81 | learning rate: 3.495E-05 | global batch size: 256 | lm loss: 1.921436E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.512 | TFLOPs: 19.15 | 31: iteration 141540/ 173500 | consumed samples: 36234240 | consumed tokens: 74207723520 | elapsed time per iteration (s): 0.80 | learning rate: 3.494E-05 | global batch size: 256 | lm loss: 1.930630E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.650 | TFLOPs: 19.28 | 31: iteration 141550/ 173500 | consumed samples: 36236800 | consumed tokens: 74212966400 | elapsed time per iteration (s): 0.75 | learning rate: 3.493E-05 | global batch size: 256 | lm loss: 1.906444E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.786 | TFLOPs: 20.74 | 31: iteration 141560/ 173500 | consumed samples: 36239360 | consumed tokens: 74218209280 | elapsed time per iteration (s): 0.79 | learning rate: 3.493E-05 | global batch size: 256 | lm loss: 1.895589E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.087 | TFLOPs: 19.49 | 31: iteration 141570/ 173500 | consumed samples: 36241920 | consumed tokens: 74223452160 | elapsed time per iteration (s): 0.79 | learning rate: 3.492E-05 | global batch size: 256 | lm loss: 1.913047E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.977 | TFLOPs: 19.54 | 31: iteration 141580/ 173500 | consumed samples: 36244480 | consumed tokens: 74228695040 | elapsed time per iteration (s): 0.73 | learning rate: 3.491E-05 | global batch size: 256 | lm loss: 1.945111E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.306 | TFLOPs: 21.31 | 31: iteration 141590/ 173500 | consumed samples: 36247040 | consumed tokens: 74233937920 | elapsed time per iteration (s): 0.82 | learning rate: 3.490E-05 | global batch size: 256 | lm loss: 1.927733E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.548 | TFLOPs: 18.85 | 31: iteration 141600/ 173500 | consumed samples: 36249600 | consumed tokens: 74239180800 | elapsed time per iteration (s): 0.84 | learning rate: 3.489E-05 | global batch size: 256 | lm loss: 1.923821E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.277 | TFLOPs: 18.41 | 31: iteration 141610/ 173500 | consumed samples: 36252160 | consumed tokens: 74244423680 | elapsed time per iteration (s): 0.82 | learning rate: 3.488E-05 | global batch size: 256 | lm loss: 1.935712E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.790 | TFLOPs: 18.80 | 31: iteration 141620/ 173500 | consumed samples: 36254720 | consumed tokens: 74249666560 | elapsed time per iteration (s): 0.87 | learning rate: 3.487E-05 | global batch size: 256 | lm loss: 1.926350E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.751 | TFLOPs: 17.71 | 31: iteration 141630/ 173500 | consumed samples: 36257280 | consumed tokens: 74254909440 | elapsed time per iteration (s): 0.81 | learning rate: 3.486E-05 | global batch size: 256 | lm loss: 1.933205E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.438 | TFLOPs: 19.14 | 31: iteration 141640/ 173500 | consumed samples: 36259840 | consumed tokens: 74260152320 | elapsed time per iteration (s): 0.84 | learning rate: 3.485E-05 | global batch size: 256 | lm loss: 1.943924E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.432 | TFLOPs: 18.54 | 31: iteration 141650/ 173500 | consumed samples: 36262400 | consumed tokens: 74265395200 | elapsed time per iteration (s): 0.87 | learning rate: 3.484E-05 | global batch size: 256 | lm loss: 1.916963E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.092 | TFLOPs: 17.79 | 31: iteration 141660/ 173500 | consumed samples: 36264960 | consumed tokens: 74270638080 | elapsed time per iteration (s): 0.78 | learning rate: 3.484E-05 | global batch size: 256 | lm loss: 1.943097E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.132 | TFLOPs: 19.85 | 31: iteration 141670/ 173500 | consumed samples: 36267520 | consumed tokens: 74275880960 | elapsed time per iteration (s): 0.83 | learning rate: 3.483E-05 | global batch size: 256 | lm loss: 1.912082E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.252 | TFLOPs: 18.59 | 31: iteration 141680/ 173500 | consumed samples: 36270080 | consumed tokens: 74281123840 | elapsed time per iteration (s): 0.82 | learning rate: 3.482E-05 | global batch size: 256 | lm loss: 1.940866E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.261 | TFLOPs: 18.95 | 31: iteration 141690/ 173500 | consumed samples: 36272640 | consumed tokens: 74286366720 | elapsed time per iteration (s): 0.80 | learning rate: 3.481E-05 | global batch size: 256 | lm loss: 1.901462E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.532 | TFLOPs: 19.45 | 31: iteration 141700/ 173500 | consumed samples: 36275200 | consumed tokens: 74291609600 | elapsed time per iteration (s): 0.83 | learning rate: 3.480E-05 | global batch size: 256 | lm loss: 1.917596E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.887 | TFLOPs: 18.69 | 31: iteration 141710/ 173500 | consumed samples: 36277760 | consumed tokens: 74296852480 | elapsed time per iteration (s): 0.81 | learning rate: 3.479E-05 | global batch size: 256 | lm loss: 1.916348E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.612 | TFLOPs: 19.03 | 31: iteration 141720/ 173500 | consumed samples: 36280320 | consumed tokens: 74302095360 | elapsed time per iteration (s): 0.72 | learning rate: 3.478E-05 | global batch size: 256 | lm loss: 1.915748E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.606 | TFLOPs: 21.45 | 31: iteration 141730/ 173500 | consumed samples: 36282880 | consumed tokens: 74307338240 | elapsed time per iteration (s): 0.77 | learning rate: 3.477E-05 | global batch size: 256 | lm loss: 1.905297E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.631 | TFLOPs: 20.24 | 31: iteration 141740/ 173500 | consumed samples: 36285440 | consumed tokens: 74312581120 | elapsed time per iteration (s): 0.77 | learning rate: 3.476E-05 | global batch size: 256 | lm loss: 1.898216E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.222 | TFLOPs: 20.16 | 31: iteration 141750/ 173500 | consumed samples: 36288000 | consumed tokens: 74317824000 | elapsed time per iteration (s): 0.79 | learning rate: 3.475E-05 | global batch size: 256 | lm loss: 1.911145E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.586 | TFLOPs: 19.58 | 31: iteration 141760/ 173500 | consumed samples: 36290560 | consumed tokens: 74323066880 | elapsed time per iteration (s): 0.83 | learning rate: 3.474E-05 | global batch size: 256 | lm loss: 1.951555E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.664 | TFLOPs: 18.61 | 31: iteration 141770/ 173500 | consumed samples: 36293120 | consumed tokens: 74328309760 | elapsed time per iteration (s): 0.77 | learning rate: 3.474E-05 | global batch size: 256 | lm loss: 1.931551E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.322 | TFLOPs: 20.23 | 31: iteration 141780/ 173500 | consumed samples: 36295680 | consumed tokens: 74333552640 | elapsed time per iteration (s): 0.78 | learning rate: 3.473E-05 | global batch size: 256 | lm loss: 1.937240E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.222 | TFLOPs: 19.86 | 31: iteration 141790/ 173500 | consumed samples: 36298240 | consumed tokens: 74338795520 | elapsed time per iteration (s): 0.74 | learning rate: 3.472E-05 | global batch size: 256 | lm loss: 1.944865E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.793 | TFLOPs: 21.04 | 31: iteration 141800/ 173500 | consumed samples: 36300800 | consumed tokens: 74344038400 | elapsed time per iteration (s): 0.77 | learning rate: 3.471E-05 | global batch size: 256 | lm loss: 1.889641E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.551 | TFLOPs: 20.24 | 31: iteration 141810/ 173500 | consumed samples: 36303360 | consumed tokens: 74349281280 | elapsed time per iteration (s): 0.76 | learning rate: 3.470E-05 | global batch size: 256 | lm loss: 1.931069E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.413 | TFLOPs: 20.35 | 31: iteration 141820/ 173500 | consumed samples: 36305920 | consumed tokens: 74354524160 | elapsed time per iteration (s): 0.84 | learning rate: 3.469E-05 | global batch size: 256 | lm loss: 1.930131E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.517 | TFLOPs: 18.54 | 31: iteration 141830/ 173500 | consumed samples: 36308480 | consumed tokens: 74359767040 | elapsed time per iteration (s): 0.79 | learning rate: 3.468E-05 | global batch size: 256 | lm loss: 1.922188E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.301 | TFLOPs: 19.50 | 31: iteration 141840/ 173500 | consumed samples: 36311040 | consumed tokens: 74365009920 | elapsed time per iteration (s): 0.73 | learning rate: 3.467E-05 | global batch size: 256 | lm loss: 1.931778E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.757 | TFLOPs: 21.28 | 31: iteration 141850/ 173500 | consumed samples: 36313600 | consumed tokens: 74370252800 | elapsed time per iteration (s): 0.78 | learning rate: 3.466E-05 | global batch size: 256 | lm loss: 1.953486E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.870 | TFLOPs: 19.84 | 31: iteration 141860/ 173500 | consumed samples: 36316160 | consumed tokens: 74375495680 | elapsed time per iteration (s): 0.78 | learning rate: 3.465E-05 | global batch size: 256 | lm loss: 1.951467E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.512 | TFLOPs: 19.87 | 31: iteration 141870/ 173500 | consumed samples: 36318720 | consumed tokens: 74380738560 | elapsed time per iteration (s): 0.75 | learning rate: 3.465E-05 | global batch size: 256 | lm loss: 1.959135E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.270 | TFLOPs: 20.77 | 31: iteration 141880/ 173500 | consumed samples: 36321280 | consumed tokens: 74385981440 | elapsed time per iteration (s): 0.77 | learning rate: 3.464E-05 | global batch size: 256 | lm loss: 1.936241E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.456 | TFLOPs: 20.05 | 31: iteration 141890/ 173500 | consumed samples: 36323840 | consumed tokens: 74391224320 | elapsed time per iteration (s): 0.75 | learning rate: 3.463E-05 | global batch size: 256 | lm loss: 1.919075E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.094 | TFLOPs: 20.57 | 31: iteration 141900/ 173500 | consumed samples: 36326400 | consumed tokens: 74396467200 | elapsed time per iteration (s): 0.76 | learning rate: 3.462E-05 | global batch size: 256 | lm loss: 1.928784E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.174 | TFLOPs: 20.28 | 31: iteration 141910/ 173500 | consumed samples: 36328960 | consumed tokens: 74401710080 | elapsed time per iteration (s): 0.77 | learning rate: 3.461E-05 | global batch size: 256 | lm loss: 1.934529E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.598 | TFLOPs: 20.12 | 31: iteration 141920/ 173500 | consumed samples: 36331520 | consumed tokens: 74406952960 | elapsed time per iteration (s): 0.83 | learning rate: 3.460E-05 | global batch size: 256 | lm loss: 1.932396E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.772 | TFLOPs: 18.62 | 31: iteration 141930/ 173500 | consumed samples: 36334080 | consumed tokens: 74412195840 | elapsed time per iteration (s): 0.82 | learning rate: 3.459E-05 | global batch size: 256 | lm loss: 1.914245E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.682 | TFLOPs: 18.92 | 31: iteration 141940/ 173500 | consumed samples: 36336640 | consumed tokens: 74417438720 | elapsed time per iteration (s): 0.79 | learning rate: 3.458E-05 | global batch size: 256 | lm loss: 1.912665E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.542 | TFLOPs: 19.57 | 31: iteration 141950/ 173500 | consumed samples: 36339200 | consumed tokens: 74422681600 | elapsed time per iteration (s): 0.83 | learning rate: 3.457E-05 | global batch size: 256 | lm loss: 1.958800E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.837 | TFLOPs: 18.68 | 31: iteration 141960/ 173500 | consumed samples: 36341760 | consumed tokens: 74427924480 | elapsed time per iteration (s): 0.81 | learning rate: 3.456E-05 | global batch size: 256 | lm loss: 1.920625E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.661 | TFLOPs: 19.04 | 31: iteration 141970/ 173500 | consumed samples: 36344320 | consumed tokens: 74433167360 | elapsed time per iteration (s): 0.83 | learning rate: 3.456E-05 | global batch size: 256 | lm loss: 1.930147E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.719 | TFLOPs: 18.68 | 31: iteration 141980/ 173500 | consumed samples: 36346880 | consumed tokens: 74438410240 | elapsed time per iteration (s): 0.77 | learning rate: 3.455E-05 | global batch size: 256 | lm loss: 1.922343E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.910 | TFLOPs: 20.20 | 31: iteration 141990/ 173500 | consumed samples: 36349440 | consumed tokens: 74443653120 | elapsed time per iteration (s): 0.82 | learning rate: 3.454E-05 | global batch size: 256 | lm loss: 1.926790E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.078 | TFLOPs: 18.82 | 0: [2022-11-27 02:03:30,517] [INFO] [logging.py:68:log_dist] [Rank 0] step=142000, skipped=0, lr=[3.452880099827123e-05, 3.452880099827123e-05, 3.452880099827123e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 142000/ 173500 | consumed samples: 36352000 | consumed tokens: 74448896000 | elapsed time per iteration (s): 0.81 | learning rate: 3.453E-05 | global batch size: 256 | lm loss: 1.937691E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.824 | TFLOPs: 19.17 | 0: steps: 142000 loss: 1.9330 iter time (s): 0.792 samples/sec: 323.303 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 142000 | lm loss value: 1.900146E+00 | lm loss PPL: 6.686868E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 142000 to checkpoints_1b1long 0: [2022-11-27 02:03:30,788] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step142000 is begin to save! 0: [2022-11-27 02:03:30,799] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_01-model_00-model_states.pt... 0: [2022-11-27 02:03:31,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_01-model_00-model_states.pt. 0: [2022-11-27 02:03:31,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_03-model_00-model_states.pt... 0: [2022-11-27 02:03:31,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_03-model_00-model_states.pt. 0: [2022-11-27 02:03:31,100] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_04-model_00-model_states.pt... 0: [2022-11-27 02:03:31,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_04-model_00-model_states.pt. 0: [2022-11-27 02:03:31,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_05-model_00-model_states.pt... 0: [2022-11-27 02:03:31,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_05-model_00-model_states.pt. 0: [2022-11-27 02:03:31,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_06-model_00-model_states.pt... 0: [2022-11-27 02:03:31,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_06-model_00-model_states.pt. 0: [2022-11-27 02:03:31,331] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_07-model_00-model_states.pt... 0: [2022-11-27 02:03:31,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_07-model_00-model_states.pt. 0: [2022-11-27 02:03:31,406] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_08-model_00-model_states.pt... 0: [2022-11-27 02:03:31,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_08-model_00-model_states.pt. 0: [2022-11-27 02:03:31,481] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_09-model_00-model_states.pt... 0: [2022-11-27 02:03:31,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_09-model_00-model_states.pt. 0: [2022-11-27 02:03:31,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_10-model_00-model_states.pt... 0: [2022-11-27 02:03:31,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_10-model_00-model_states.pt. 0: [2022-11-27 02:03:31,637] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_11-model_00-model_states.pt... 0: [2022-11-27 02:03:31,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_11-model_00-model_states.pt. 0: [2022-11-27 02:03:31,712] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_12-model_00-model_states.pt... 0: [2022-11-27 02:03:31,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_12-model_00-model_states.pt. 0: [2022-11-27 02:03:31,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_13-model_00-model_states.pt... 0: [2022-11-27 02:03:31,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_13-model_00-model_states.pt. 0: [2022-11-27 02:03:31,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_14-model_00-model_states.pt... 0: [2022-11-27 02:03:31,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_14-model_00-model_states.pt. 0: [2022-11-27 02:03:31,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_15-model_00-model_states.pt... 0: [2022-11-27 02:03:32,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_15-model_00-model_states.pt. 0: [2022-11-27 02:03:32,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_16-model_00-model_states.pt... 0: [2022-11-27 02:03:32,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_16-model_00-model_states.pt. 0: [2022-11-27 02:03:32,090] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_17-model_00-model_states.pt... 0: [2022-11-27 02:03:32,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_17-model_00-model_states.pt. 0: [2022-11-27 02:03:32,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_18-model_00-model_states.pt... 0: [2022-11-27 02:03:32,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_18-model_00-model_states.pt. 0: [2022-11-27 02:03:32,249] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_19-model_00-model_states.pt... 0: [2022-11-27 02:03:32,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_19-model_00-model_states.pt. 0: [2022-11-27 02:03:32,326] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_20-model_00-model_states.pt... 0: [2022-11-27 02:03:32,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_20-model_00-model_states.pt. 0: [2022-11-27 02:03:32,400] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_21-model_00-model_states.pt... 0: [2022-11-27 02:03:32,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_21-model_00-model_states.pt. 0: [2022-11-27 02:03:32,477] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_22-model_00-model_states.pt... 0: [2022-11-27 02:03:32,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_22-model_00-model_states.pt. 0: [2022-11-27 02:03:32,552] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_23-model_00-model_states.pt... 0: [2022-11-27 02:03:32,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_23-model_00-model_states.pt. 0: [2022-11-27 02:03:32,624] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_24-model_00-model_states.pt... 0: [2022-11-27 02:03:32,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_24-model_00-model_states.pt. 0: [2022-11-27 02:03:32,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_25-model_00-model_states.pt... 0: [2022-11-27 02:03:32,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_25-model_00-model_states.pt. 0: [2022-11-27 02:03:32,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_26-model_00-model_states.pt... 0: [2022-11-27 02:03:32,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_26-model_00-model_states.pt. 0: [2022-11-27 02:03:32,850] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_27-model_00-model_states.pt... 0: [2022-11-27 02:03:32,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_27-model_00-model_states.pt. 0: [2022-11-27 02:03:32,925] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_28-model_00-model_states.pt... 0: [2022-11-27 02:03:32,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_28-model_00-model_states.pt. 0: [2022-11-27 02:03:32,999] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/layer_30-model_00-model_states.pt... 0: [2022-11-27 02:03:33,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/layer_30-model_00-model_states.pt. 0: [2022-11-27 02:03:33,004] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step142000/mp_rank_00_model_states.pt 0: [2022-11-27 02:03:33,004] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/mp_rank_00_model_states.pt... 0: [2022-11-27 02:03:33,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/mp_rank_00_model_states.pt. 0: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:03:33,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:03:33,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:03:33,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 02:03:33,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 23: [2022-11-27 02:03:33,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:03:33,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 02:03:33,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 0: [2022-11-27 02:03:33,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:03:33,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 02:03:33,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 4: [2022-11-27 02:03:33,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:03:33,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 02:03:33,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 1: [2022-11-27 02:03:33,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:03:33,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:03:33,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 02:03:33,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 1: [2022-11-27 02:03:33,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 02:03:33,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 9: [2022-11-27 02:03:33,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:03:33,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 02:03:33,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 7: [2022-11-27 02:03:33,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:03:33,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 02:03:33,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 29: [2022-11-27 02:03:33,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:03:33,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 02:03:33,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 30: [2022-11-27 02:03:33,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:03:33,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 02:03:33,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 6: [2022-11-27 02:03:33,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:03:33,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 02:03:33,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 10: [2022-11-27 02:03:33,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:03:33,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:03:33,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 20: [2022-11-27 02:03:33,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 10: [2022-11-27 02:03:33,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 20: [2022-11-27 02:03:33,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 3: [2022-11-27 02:03:33,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:03:33,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 02:03:33,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 14: [2022-11-27 02:03:33,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:03:33,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 02:03:33,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 31: [2022-11-27 02:03:33,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:03:33,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 02:03:33,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 12: [2022-11-27 02:03:33,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:03:33,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 02:03:33,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 29: [2022-11-27 02:03:33,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:03:33,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:03:33,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 2: [2022-11-27 02:03:33,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 29: [2022-11-27 02:03:33,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 2: [2022-11-27 02:03:33,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 11: [2022-11-27 02:03:33,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:03:33,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 16: [2022-11-27 02:03:33,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:03:33,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 11: [2022-11-27 02:03:33,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 16: [2022-11-27 02:03:33,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 9: [2022-11-27 02:03:33,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:03:33,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 11: [2022-11-27 02:03:33,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:03:33,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 23: [2022-11-27 02:03:33,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:03:33,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 02:03:33,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 23: [2022-11-27 02:03:33,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 02:03:33,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 8: [2022-11-27 02:03:33,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:03:33,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 02:03:33,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 22: [2022-11-27 02:03:33,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:03:33,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:03:33,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 02:03:33,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 02:03:33,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 22: [2022-11-27 02:03:33,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 28: [2022-11-27 02:03:33,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:03:33,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:03:33,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 02:03:33,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 18: [2022-11-27 02:03:33,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:03:33,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 02:03:33,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 25: [2022-11-27 02:03:33,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:03:33,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 14: [2022-11-27 02:03:33,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:03:33,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 25: [2022-11-27 02:03:33,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 14: [2022-11-27 02:03:33,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 3: [2022-11-27 02:03:33,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:03:33,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 02:03:33,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 28: [2022-11-27 02:03:33,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 02:03:33,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 25: [2022-11-27 02:03:33,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:03:33,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 02:03:33,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 7: [2022-11-27 02:03:33,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:03:33,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 02:03:33,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 20: [2022-11-27 02:03:33,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:03:33,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 02:03:33,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 10: [2022-11-27 02:03:33,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:03:33,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 02:03:33,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 12: [2022-11-27 02:03:33,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:03:33,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 02:03:33,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 24: [2022-11-27 02:03:33,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:03:33,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 02:03:33,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 15: [2022-11-27 02:03:33,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:03:33,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 02:03:33,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 6: [2022-11-27 02:03:33,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:03:33,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:03:33,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 1: [2022-11-27 02:03:33,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 11: [2022-11-27 02:03:33,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:03:33,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 1: [2022-11-27 02:03:33,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 15: [2022-11-27 02:03:33,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:03:33,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 15: [2022-11-27 02:03:33,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 11: [2022-11-27 02:03:33,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 15: [2022-11-27 02:03:33,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 21: [2022-11-27 02:03:33,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:03:33,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:03:33,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 02:03:33,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 02:03:33,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 21: [2022-11-27 02:03:33,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 8: [2022-11-27 02:03:33,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:03:33,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:03:33,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 4: [2022-11-27 02:03:33,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 8: [2022-11-27 02:03:33,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 18: [2022-11-27 02:03:33,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:03:33,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 18: [2022-11-27 02:03:33,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 02:03:33,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 21: [2022-11-27 02:03:33,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:03:33,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 02:03:33,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 4: [2022-11-27 02:03:33,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:03:33,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 28: [2022-11-27 02:03:33,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:03:33,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 02:03:33,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 4: [2022-11-27 02:03:33,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 13: [2022-11-27 02:03:33,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:03:33,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 02:03:33,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 24: [2022-11-27 02:03:33,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:03:33,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 02:03:33,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 30: [2022-11-27 02:03:33,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:03:33,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 02:03:33,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 9: [2022-11-27 02:03:33,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:03:33,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 02:03:33,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 23: [2022-11-27 02:03:33,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:03:33,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 02:03:33,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 1: [2022-11-27 02:03:33,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:03:33,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 02:03:33,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 7: [2022-11-27 02:03:33,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:03:33,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 02:03:33,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 31: [2022-11-27 02:03:33,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:03:33,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:03:33,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 24: [2022-11-27 02:03:33,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 02:03:33,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 31: [2022-11-27 02:03:33,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 5: [2022-11-27 02:03:33,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:03:33,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 02:03:33,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 5: [2022-11-27 02:03:33,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:03:33,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 02:03:33,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 19: [2022-11-27 02:03:33,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:03:33,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 02:03:33,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:03:33,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 19: [2022-11-27 02:03:33,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 02:03:33,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 0: [2022-11-27 02:03:33,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:03:33,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:03:33,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 02:03:33,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 02:03:33,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 0: [2022-11-27 02:03:33,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 2: [2022-11-27 02:03:33,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:03:33,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 02:03:33,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 14: [2022-11-27 02:03:33,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:03:33,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:03:33,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 3: [2022-11-27 02:03:33,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 14: [2022-11-27 02:03:33,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 6: [2022-11-27 02:03:33,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:03:33,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 6: [2022-11-27 02:03:33,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 02:03:33,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 16: [2022-11-27 02:03:33,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:03:33,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:03:33,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 02:03:33,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 16: [2022-11-27 02:03:33,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 02:03:33,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 28: [2022-11-27 02:03:33,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:03:33,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 02:03:33,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 26: [2022-11-27 02:03:33,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:03:33,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 02:03:33,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 27: [2022-11-27 02:03:33,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:03:33,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:03:33,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 02:03:33,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 02:03:33,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 20: [2022-11-27 02:03:33,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:03:33,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:03:33,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 20: [2022-11-27 02:03:33,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 26: [2022-11-27 02:03:33,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 20: [2022-11-27 02:03:33,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 26: [2022-11-27 02:03:33,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 18: [2022-11-27 02:03:33,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:03:33,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 02:03:33,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 8: [2022-11-27 02:03:33,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:03:33,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 02:03:33,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 22: [2022-11-27 02:03:33,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:03:33,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:03:33,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 02:03:33,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 10: [2022-11-27 02:03:33,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 02:03:33,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 31: [2022-11-27 02:03:33,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:03:33,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 02:03:33,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 29: [2022-11-27 02:03:33,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:03:33,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 02:03:33,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 13: [2022-11-27 02:03:33,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:03:33,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 02:03:33,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 2: [2022-11-27 02:03:33,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:03:33,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:03:33,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 02:03:33,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 25: [2022-11-27 02:03:33,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 02:03:33,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 15: [2022-11-27 02:03:33,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:03:33,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 02:03:33,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 9: [2022-11-27 02:03:33,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:03:33,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 02:03:33,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 4: [2022-11-27 02:03:33,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:03:33,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 02:03:33,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 12: [2022-11-27 02:03:33,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:03:33,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 02:03:33,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 8: [2022-11-27 02:03:33,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:03:33,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 02:03:33,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 1: [2022-11-27 02:03:33,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:03:33,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 02:03:33,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 27: [2022-11-27 02:03:33,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:03:33,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:03:33,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 0: [2022-11-27 02:03:33,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 27: [2022-11-27 02:03:33,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 0: [2022-11-27 02:03:33,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 5: [2022-11-27 02:03:33,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:03:33,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 02:03:33,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 20: [2022-11-27 02:03:33,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:03:33,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 02:03:33,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 25: [2022-11-27 02:03:33,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:03:33,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 02:03:33,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 17: [2022-11-27 02:03:33,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:03:33,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 02:03:33,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 17: [2022-11-27 02:03:33,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:03:33,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:03:33,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 02:03:33,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 17: [2022-11-27 02:03:33,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 02:03:33,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 17: [2022-11-27 02:03:33,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:03:33,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 02:03:33,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 13: [2022-11-27 02:03:33,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:03:33,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 02:03:33,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 23: [2022-11-27 02:03:33,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:03:33,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 02:03:33,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 22: [2022-11-27 02:03:33,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:03:33,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 02:03:33,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 11: [2022-11-27 02:03:33,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:03:33,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 02:03:33,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 29: [2022-11-27 02:03:33,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:03:33,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 02:03:33,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 21: [2022-11-27 02:03:33,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:03:33,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 02:03:33,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 3: [2022-11-27 02:03:33,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:03:33,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 02:03:33,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 27: [2022-11-27 02:03:33,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:03:33,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 02:03:33,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 7: [2022-11-27 02:03:33,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:03:33,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 02:03:33,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 30: [2022-11-27 02:03:33,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:03:33,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 02:03:33,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 16: [2022-11-27 02:03:33,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:03:33,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 02:03:33,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 6: [2022-11-27 02:03:33,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:03:33,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 02:03:33,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 10: [2022-11-27 02:03:33,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:03:33,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 02:03:33,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 18: [2022-11-27 02:03:33,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:03:33,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 02:03:33,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 31: [2022-11-27 02:03:33,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:03:33,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 02:03:33,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 19: [2022-11-27 02:03:33,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:03:33,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-27 02:03:33,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 14: [2022-11-27 02:03:33,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:03:33,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 02:03:33,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 28: [2022-11-27 02:03:33,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:03:33,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 26: [2022-11-27 02:03:33,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:03:33,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 02:03:33,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 28: [2022-11-27 02:03:33,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 2: [2022-11-27 02:03:33,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:03:33,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 02:03:33,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 15: [2022-11-27 02:03:33,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:03:33,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 02:03:33,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 0: [2022-11-27 02:03:33,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:03:33,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 02:03:33,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 24: [2022-11-27 02:03:33,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:03:33,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 02:03:33,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 9: [2022-11-27 02:03:33,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:03:33,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 02:03:33,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 5: [2022-11-27 02:03:33,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:03:33,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 02:03:33,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 1: [2022-11-27 02:03:33,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:03:33,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 02:03:33,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 25: [2022-11-27 02:03:33,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:03:33,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 02:03:33,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 20: [2022-11-27 02:03:33,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:03:33,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 02:03:33,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 4: [2022-11-27 02:03:33,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:03:33,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 02:03:33,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 11: [2022-11-27 02:03:33,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:03:33,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 02:03:33,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 13: [2022-11-27 02:03:33,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:03:33,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 02:03:33,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 8: [2022-11-27 02:03:33,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:03:33,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 02:03:33,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 17: [2022-11-27 02:03:33,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:03:33,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 02:03:33,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 12: [2022-11-27 02:03:33,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:03:33,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 02:03:33,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 23: [2022-11-27 02:03:33,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:03:33,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 02:03:33,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 3: [2022-11-27 02:03:33,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:03:33,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 02:03:33,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 22: [2022-11-27 02:03:33,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:03:33,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 02:03:33,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 29: [2022-11-27 02:03:33,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:03:33,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 02:03:33,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 21: [2022-11-27 02:03:33,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:03:33,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 02:03:33,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 6: [2022-11-27 02:03:33,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:03:33,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 02:03:33,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 10: [2022-11-27 02:03:33,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:03:33,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 02:03:33,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 30: [2022-11-27 02:03:33,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:03:33,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 02:03:33,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 7: [2022-11-27 02:03:33,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:03:33,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 02:03:33,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 19: [2022-11-27 02:03:33,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:03:33,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 02:03:33,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 16: [2022-11-27 02:03:33,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:03:33,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 02:03:33,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 18: [2022-11-27 02:03:33,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:03:33,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 02:03:33,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 31: [2022-11-27 02:03:33,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:03:33,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 02:03:33,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 0: [2022-11-27 02:03:33,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:03:33,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:03:33,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:03:33,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 28: [2022-11-27 02:03:33,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 24: [2022-11-27 02:03:33,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 26: [2022-11-27 02:03:33,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:03:33,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 26: [2022-11-27 02:03:33,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-27 02:03:33,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 9: [2022-11-27 02:03:33,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:03:33,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:03:33,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 02:03:33,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 27: [2022-11-27 02:03:33,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 5: [2022-11-27 02:03:33,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:03:33,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 5: [2022-11-27 02:03:33,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 02:03:33,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 8: [2022-11-27 02:03:33,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:03:33,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 02:03:33,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 14: [2022-11-27 02:03:33,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:03:33,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 02:03:33,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 0: [2022-11-27 02:03:33,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 02:03:33,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 1: [2022-11-27 02:03:33,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:03:33,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 02:03:33,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 11: [2022-11-27 02:03:33,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:03:33,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 02:03:33,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 15: [2022-11-27 02:03:33,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:03:33,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:03:33,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 02:03:33,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 23: [2022-11-27 02:03:33,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 02:03:33,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 22: [2022-11-27 02:03:33,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:03:33,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:03:33,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 02:03:33,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 4: [2022-11-27 02:03:33,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 02:03:33,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 21: [2022-11-27 02:03:33,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:03:33,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 02:03:33,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 2: [2022-11-27 02:03:33,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:03:33,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:03:33,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 02:03:33,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 25: [2022-11-27 02:03:33,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 02:03:33,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 13: [2022-11-27 02:03:33,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:03:33,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 02:03:33,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 20: [2022-11-27 02:03:33,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:03:33,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 02:03:33,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 17: [2022-11-27 02:03:33,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:03:33,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 02:03:33,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 3: [2022-11-27 02:03:33,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:03:33,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 02:03:33,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 12: [2022-11-27 02:03:33,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:03:33,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 02:03:33,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 6: [2022-11-27 02:03:33,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:03:33,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 02:03:33,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 29: [2022-11-27 02:03:33,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:03:33,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 02:03:33,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 30: [2022-11-27 02:03:33,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:03:33,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 02:03:33,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 10: [2022-11-27 02:03:33,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:03:33,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 02:03:33,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 7: [2022-11-27 02:03:33,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:03:33,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 02:03:33,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 31: [2022-11-27 02:03:33,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:03:33,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 02:03:33,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 14: [2022-11-27 02:03:33,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:03:33,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 02:03:33,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 19: [2022-11-27 02:03:33,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:03:33,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:03:33,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 16: [2022-11-27 02:03:33,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 19: [2022-11-27 02:03:33,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 16: [2022-11-27 02:03:33,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 27: [2022-11-27 02:03:33,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:03:33,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 02:03:33,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 18: [2022-11-27 02:03:33,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:03:33,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-27 02:03:33,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 24: [2022-11-27 02:03:33,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:03:33,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-27 02:03:33,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 0: [2022-11-27 02:03:33,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:03:33,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 02:03:33,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 5: [2022-11-27 02:03:33,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:03:33,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 02:03:33,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 1: [2022-11-27 02:03:33,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:03:33,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 02:03:33,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 26: [2022-11-27 02:03:33,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:03:33,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 02:03:33,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 9: [2022-11-27 02:03:33,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:03:33,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 02:03:33,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 15: [2022-11-27 02:03:33,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:03:33,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 02:03:33,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 4: [2022-11-27 02:03:33,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:03:33,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 02:03:33,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 12: [2022-11-27 02:03:33,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:03:33,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 02:03:33,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 20: [2022-11-27 02:03:33,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:03:33,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 25: [2022-11-27 02:03:33,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:03:33,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 25: [2022-11-27 02:03:33,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 02:03:33,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 8: [2022-11-27 02:03:33,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:03:33,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 02:03:33,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 13: [2022-11-27 02:03:33,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:03:33,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:03:33,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 02:03:33,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 23: [2022-11-27 02:03:33,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 02:03:33,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 11: [2022-11-27 02:03:33,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:03:33,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 02:03:33,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 22: [2022-11-27 02:03:33,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:03:33,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 02:03:33,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 21: [2022-11-27 02:03:33,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:03:33,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 02:03:33,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 3: [2022-11-27 02:03:33,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:03:33,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 02:03:33,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 29: [2022-11-27 02:03:33,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:03:33,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 02:03:33,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 17: [2022-11-27 02:03:33,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:03:33,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 6: [2022-11-27 02:03:33,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:03:33,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 6: [2022-11-27 02:03:33,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 02:03:33,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 7: [2022-11-27 02:03:33,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:03:33,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 02:03:33,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 27: [2022-11-27 02:03:33,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:03:33,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 02:03:33,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 30: [2022-11-27 02:03:33,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:03:33,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 02:03:33,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 10: [2022-11-27 02:03:33,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:03:33,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 02:03:33,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 5: [2022-11-27 02:03:33,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:03:33,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:03:33,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 02:03:33,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 31: [2022-11-27 02:03:33,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 02:03:33,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 0: [2022-11-27 02:03:33,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:03:33,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 02:03:33,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 18: [2022-11-27 02:03:33,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:03:33,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 02:03:33,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 16: [2022-11-27 02:03:33,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:03:33,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 02:03:33,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 17: [2022-11-27 02:03:33,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:03:33,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 02:03:33,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 8: [2022-11-27 02:03:33,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:03:33,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 02:03:33,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 9: [2022-11-27 02:03:33,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:03:33,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 02:03:33,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 14: [2022-11-27 02:03:33,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:03:33,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 02:03:33,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 15: [2022-11-27 02:03:33,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:03:33,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 7: [2022-11-27 02:03:33,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:03:33,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 7: [2022-11-27 02:03:33,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 02:03:33,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 25: [2022-11-27 02:03:33,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:03:33,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 02:03:33,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 2: [2022-11-27 02:03:33,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:03:33,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:03:33,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 21: [2022-11-27 02:03:33,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 2: [2022-11-27 02:03:33,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 21: [2022-11-27 02:03:33,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 26: [2022-11-27 02:03:33,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:03:33,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:03:33,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 3: [2022-11-27 02:03:33,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:03:33,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 23: [2022-11-27 02:03:33,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 3: [2022-11-27 02:03:33,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 23: [2022-11-27 02:03:33,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 12: [2022-11-27 02:03:33,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:03:33,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 12: [2022-11-27 02:03:33,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 02:03:33,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 4: [2022-11-27 02:03:33,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:03:33,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 02:03:33,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 6: [2022-11-27 02:03:33,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:03:33,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 02:03:33,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 24: [2022-11-27 02:03:33,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:03:33,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:03:33,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:03:33,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-27 02:03:33,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 5: [2022-11-27 02:03:33,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 20: [2022-11-27 02:03:33,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 5: [2022-11-27 02:03:33,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 20: [2022-11-27 02:03:33,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 29: [2022-11-27 02:03:33,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:03:33,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:03:33,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 30: [2022-11-27 02:03:33,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 02:03:33,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 29: [2022-11-27 02:03:33,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 11: [2022-11-27 02:03:33,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:03:33,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 02:03:33,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 2: [2022-11-27 02:03:33,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:03:33,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:03:33,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 13: [2022-11-27 02:03:33,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 02:03:33,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 2: [2022-11-27 02:03:33,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 16: [2022-11-27 02:03:33,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:03:33,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 02:03:33,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 1: [2022-11-27 02:03:33,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:03:33,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:03:33,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 1: [2022-11-27 02:03:33,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 22: [2022-11-27 02:03:33,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 1: [2022-11-27 02:03:33,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 27: [2022-11-27 02:03:33,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:03:33,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:03:33,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 02:03:33,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 10: [2022-11-27 02:03:33,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 02:03:33,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 28: [2022-11-27 02:03:33,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:03:33,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:03:33,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 02:03:33,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 15: [2022-11-27 02:03:33,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:03:33,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 02:03:33,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 26: [2022-11-27 02:03:33,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:03:33,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 02:03:33,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 31: [2022-11-27 02:03:33,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:03:33,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 14: [2022-11-27 02:03:33,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:03:33,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 18: [2022-11-27 02:03:33,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:03:33,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 18: [2022-11-27 02:03:33,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 14: [2022-11-27 02:03:33,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 18: [2022-11-27 02:03:33,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 26: [2022-11-27 02:03:33,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:03:33,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 02:03:33,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 24: [2022-11-27 02:03:33,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:03:33,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 02:03:33,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 28: [2022-11-27 02:03:33,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 02:03:33,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 28: [2022-11-27 02:03:33,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:03:33,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 02:03:33,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 28: [2022-11-27 02:03:33,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:03:33,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 02:03:33,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 19: [2022-11-27 02:03:33,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:03:33,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 02:03:33,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 19: [2022-11-27 02:03:33,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:03:33,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:03:33,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 02:03:33,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step142000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 02:03:33,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 19: [2022-11-27 02:03:33,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step142000 is ready now! 0: successfully saved checkpoint at iteration 142000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2582.30 31: iteration 142010/ 173500 | consumed samples: 36354560 | consumed tokens: 74454138880 | elapsed time per iteration (s): 1.13 | learning rate: 3.452E-05 | global batch size: 256 | lm loss: 1.917574E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 226.926 | TFLOPs: 13.73 | 31: iteration 142020/ 173500 | consumed samples: 36357120 | consumed tokens: 74459381760 | elapsed time per iteration (s): 0.88 | learning rate: 3.451E-05 | global batch size: 256 | lm loss: 1.904925E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.556 | TFLOPs: 17.52 | 31: iteration 142030/ 173500 | consumed samples: 36359680 | consumed tokens: 74464624640 | elapsed time per iteration (s): 0.88 | learning rate: 3.450E-05 | global batch size: 256 | lm loss: 1.910567E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.746 | TFLOPs: 17.65 | 31: iteration 142040/ 173500 | consumed samples: 36362240 | consumed tokens: 74469867520 | elapsed time per iteration (s): 0.90 | learning rate: 3.449E-05 | global batch size: 256 | lm loss: 1.943364E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 283.232 | TFLOPs: 17.13 | 31: iteration 142050/ 173500 | consumed samples: 36364800 | consumed tokens: 74475110400 | elapsed time per iteration (s): 0.93 | learning rate: 3.448E-05 | global batch size: 256 | lm loss: 1.934751E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 276.355 | TFLOPs: 16.72 | 31: iteration 142060/ 173500 | consumed samples: 36367360 | consumed tokens: 74480353280 | elapsed time per iteration (s): 0.92 | learning rate: 3.448E-05 | global batch size: 256 | lm loss: 1.926807E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 278.443 | TFLOPs: 16.85 | 31: iteration 142070/ 173500 | consumed samples: 36369920 | consumed tokens: 74485596160 | elapsed time per iteration (s): 0.85 | learning rate: 3.447E-05 | global batch size: 256 | lm loss: 1.915203E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.974 | TFLOPs: 18.21 | 31: iteration 142080/ 173500 | consumed samples: 36372480 | consumed tokens: 74490839040 | elapsed time per iteration (s): 0.92 | learning rate: 3.446E-05 | global batch size: 256 | lm loss: 1.928265E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 279.414 | TFLOPs: 16.90 | 31: iteration 142090/ 173500 | consumed samples: 36375040 | consumed tokens: 74496081920 | elapsed time per iteration (s): 0.89 | learning rate: 3.445E-05 | global batch size: 256 | lm loss: 1.927873E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.906 | TFLOPs: 17.36 | 31: iteration 142100/ 173500 | consumed samples: 36377600 | consumed tokens: 74501324800 | elapsed time per iteration (s): 0.91 | learning rate: 3.444E-05 | global batch size: 256 | lm loss: 1.918356E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.554 | TFLOPs: 17.09 | 31: iteration 142110/ 173500 | consumed samples: 36380160 | consumed tokens: 74506567680 | elapsed time per iteration (s): 0.85 | learning rate: 3.443E-05 | global batch size: 256 | lm loss: 1.891219E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.977 | TFLOPs: 18.27 | 31: iteration 142120/ 173500 | consumed samples: 36382720 | consumed tokens: 74511810560 | elapsed time per iteration (s): 0.86 | learning rate: 3.442E-05 | global batch size: 256 | lm loss: 1.932898E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.568 | TFLOPs: 18.00 | 31: iteration 142130/ 173500 | consumed samples: 36385280 | consumed tokens: 74517053440 | elapsed time per iteration (s): 0.84 | learning rate: 3.441E-05 | global batch size: 256 | lm loss: 1.924933E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.110 | TFLOPs: 18.34 | 31: iteration 142140/ 173500 | consumed samples: 36387840 | consumed tokens: 74522296320 | elapsed time per iteration (s): 0.84 | learning rate: 3.440E-05 | global batch size: 256 | lm loss: 1.917043E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.649 | TFLOPs: 18.37 | 31: iteration 142150/ 173500 | consumed samples: 36390400 | consumed tokens: 74527539200 | elapsed time per iteration (s): 0.87 | learning rate: 3.439E-05 | global batch size: 256 | lm loss: 1.925066E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.496 | TFLOPs: 17.76 | 31: iteration 142160/ 173500 | consumed samples: 36392960 | consumed tokens: 74532782080 | elapsed time per iteration (s): 0.91 | learning rate: 3.439E-05 | global batch size: 256 | lm loss: 1.928628E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 281.712 | TFLOPs: 17.04 | 31: iteration 142170/ 173500 | consumed samples: 36395520 | consumed tokens: 74538024960 | elapsed time per iteration (s): 0.82 | learning rate: 3.438E-05 | global batch size: 256 | lm loss: 1.971471E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.471 | TFLOPs: 18.84 | 31: iteration 142180/ 173500 | consumed samples: 36398080 | consumed tokens: 74543267840 | elapsed time per iteration (s): 0.86 | learning rate: 3.437E-05 | global batch size: 256 | lm loss: 1.919742E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.157 | TFLOPs: 17.92 | 31: iteration 142190/ 173500 | consumed samples: 36400640 | consumed tokens: 74548510720 | elapsed time per iteration (s): 0.82 | learning rate: 3.436E-05 | global batch size: 256 | lm loss: 1.925437E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.478 | TFLOPs: 18.90 | 31: iteration 142200/ 173500 | consumed samples: 36403200 | consumed tokens: 74553753600 | elapsed time per iteration (s): 0.80 | learning rate: 3.435E-05 | global batch size: 256 | lm loss: 1.931695E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.832 | TFLOPs: 19.29 | 31: iteration 142210/ 173500 | consumed samples: 36405760 | consumed tokens: 74558996480 | elapsed time per iteration (s): 0.83 | learning rate: 3.434E-05 | global batch size: 256 | lm loss: 1.940074E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.243 | TFLOPs: 18.77 | 31: iteration 142220/ 173500 | consumed samples: 36408320 | consumed tokens: 74564239360 | elapsed time per iteration (s): 0.84 | learning rate: 3.433E-05 | global batch size: 256 | lm loss: 1.929534E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.583 | TFLOPs: 18.43 | 31: iteration 142230/ 173500 | consumed samples: 36410880 | consumed tokens: 74569482240 | elapsed time per iteration (s): 0.77 | learning rate: 3.432E-05 | global batch size: 256 | lm loss: 1.933383E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.236 | TFLOPs: 20.16 | 31: iteration 142240/ 173500 | consumed samples: 36413440 | consumed tokens: 74574725120 | elapsed time per iteration (s): 0.81 | learning rate: 3.431E-05 | global batch size: 256 | lm loss: 1.933630E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.580 | TFLOPs: 19.21 | 31: iteration 142250/ 173500 | consumed samples: 36416000 | consumed tokens: 74579968000 | elapsed time per iteration (s): 0.80 | learning rate: 3.431E-05 | global batch size: 256 | lm loss: 1.920332E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.530 | TFLOPs: 19.45 | 31: iteration 142260/ 173500 | consumed samples: 36418560 | consumed tokens: 74585210880 | elapsed time per iteration (s): 0.84 | learning rate: 3.430E-05 | global batch size: 256 | lm loss: 1.952878E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.533 | TFLOPs: 18.36 | 31: iteration 142270/ 173500 | consumed samples: 36421120 | consumed tokens: 74590453760 | elapsed time per iteration (s): 0.81 | learning rate: 3.429E-05 | global batch size: 256 | lm loss: 1.898234E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.141 | TFLOPs: 19.00 | 31: iteration 142280/ 173500 | consumed samples: 36423680 | consumed tokens: 74595696640 | elapsed time per iteration (s): 0.82 | learning rate: 3.428E-05 | global batch size: 256 | lm loss: 1.918033E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.316 | TFLOPs: 18.89 | 31: iteration 142290/ 173500 | consumed samples: 36426240 | consumed tokens: 74600939520 | elapsed time per iteration (s): 0.82 | learning rate: 3.427E-05 | global batch size: 256 | lm loss: 1.923832E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.124 | TFLOPs: 18.88 | 31: iteration 142300/ 173500 | consumed samples: 36428800 | consumed tokens: 74606182400 | elapsed time per iteration (s): 0.82 | learning rate: 3.426E-05 | global batch size: 256 | lm loss: 1.919353E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.848 | TFLOPs: 18.87 | 31: iteration 142310/ 173500 | consumed samples: 36431360 | consumed tokens: 74611425280 | elapsed time per iteration (s): 0.77 | learning rate: 3.425E-05 | global batch size: 256 | lm loss: 1.948649E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.391 | TFLOPs: 20.17 | 31: iteration 142320/ 173500 | consumed samples: 36433920 | consumed tokens: 74616668160 | elapsed time per iteration (s): 0.77 | learning rate: 3.424E-05 | global batch size: 256 | lm loss: 1.926756E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.374 | TFLOPs: 20.11 | 31: iteration 142330/ 173500 | consumed samples: 36436480 | consumed tokens: 74621911040 | elapsed time per iteration (s): 0.81 | learning rate: 3.423E-05 | global batch size: 256 | lm loss: 1.919984E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.064 | TFLOPs: 19.12 | 31: iteration 142340/ 173500 | consumed samples: 36439040 | consumed tokens: 74627153920 | elapsed time per iteration (s): 0.80 | learning rate: 3.423E-05 | global batch size: 256 | lm loss: 1.911475E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.918 | TFLOPs: 19.41 | 31: iteration 142350/ 173500 | consumed samples: 36441600 | consumed tokens: 74632396800 | elapsed time per iteration (s): 0.80 | learning rate: 3.422E-05 | global batch size: 256 | lm loss: 1.944278E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.333 | TFLOPs: 19.26 | 31: iteration 142360/ 173500 | consumed samples: 36444160 | consumed tokens: 74637639680 | elapsed time per iteration (s): 0.78 | learning rate: 3.421E-05 | global batch size: 256 | lm loss: 1.924994E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.264 | TFLOPs: 19.74 | 31: iteration 142370/ 173500 | consumed samples: 36446720 | consumed tokens: 74642882560 | elapsed time per iteration (s): 0.79 | learning rate: 3.420E-05 | global batch size: 256 | lm loss: 1.904565E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.003 | TFLOPs: 19.60 | 31: iteration 142380/ 173500 | consumed samples: 36449280 | consumed tokens: 74648125440 | elapsed time per iteration (s): 0.78 | learning rate: 3.419E-05 | global batch size: 256 | lm loss: 1.932652E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.505 | TFLOPs: 19.87 | 31: iteration 142390/ 173500 | consumed samples: 36451840 | consumed tokens: 74653368320 | elapsed time per iteration (s): 0.79 | learning rate: 3.418E-05 | global batch size: 256 | lm loss: 1.921358E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.485 | TFLOPs: 19.51 | 31: iteration 142400/ 173500 | consumed samples: 36454400 | consumed tokens: 74658611200 | elapsed time per iteration (s): 0.80 | learning rate: 3.417E-05 | global batch size: 256 | lm loss: 1.913546E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.396 | TFLOPs: 19.26 | 31: iteration 142410/ 173500 | consumed samples: 36456960 | consumed tokens: 74663854080 | elapsed time per iteration (s): 0.80 | learning rate: 3.416E-05 | global batch size: 256 | lm loss: 1.924863E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.242 | TFLOPs: 19.37 | 31: iteration 142420/ 173500 | consumed samples: 36459520 | consumed tokens: 74669096960 | elapsed time per iteration (s): 0.89 | learning rate: 3.415E-05 | global batch size: 256 | lm loss: 1.921511E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.538 | TFLOPs: 17.46 | 31: iteration 142430/ 173500 | consumed samples: 36462080 | consumed tokens: 74674339840 | elapsed time per iteration (s): 0.80 | learning rate: 3.415E-05 | global batch size: 256 | lm loss: 1.925222E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.907 | TFLOPs: 19.47 | 31: iteration 142440/ 173500 | consumed samples: 36464640 | consumed tokens: 74679582720 | elapsed time per iteration (s): 0.80 | learning rate: 3.414E-05 | global batch size: 256 | lm loss: 1.916172E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.029 | TFLOPs: 19.42 | 31: iteration 142450/ 173500 | consumed samples: 36467200 | consumed tokens: 74684825600 | elapsed time per iteration (s): 0.87 | learning rate: 3.413E-05 | global batch size: 256 | lm loss: 1.938187E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.740 | TFLOPs: 17.89 | 31: iteration 142460/ 173500 | consumed samples: 36469760 | consumed tokens: 74690068480 | elapsed time per iteration (s): 0.78 | learning rate: 3.412E-05 | global batch size: 256 | lm loss: 1.921374E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.507 | TFLOPs: 19.81 | 31: iteration 142470/ 173500 | consumed samples: 36472320 | consumed tokens: 74695311360 | elapsed time per iteration (s): 0.78 | learning rate: 3.411E-05 | global batch size: 256 | lm loss: 1.914295E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.411 | TFLOPs: 19.75 | 31: iteration 142480/ 173500 | consumed samples: 36474880 | consumed tokens: 74700554240 | elapsed time per iteration (s): 0.84 | learning rate: 3.410E-05 | global batch size: 256 | lm loss: 1.942730E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.894 | TFLOPs: 18.51 | 31: iteration 142490/ 173500 | consumed samples: 36477440 | consumed tokens: 74705797120 | elapsed time per iteration (s): 0.82 | learning rate: 3.409E-05 | global batch size: 256 | lm loss: 1.904548E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.930 | TFLOPs: 18.87 | 31: iteration 142500/ 173500 | consumed samples: 36480000 | consumed tokens: 74711040000 | elapsed time per iteration (s): 0.83 | learning rate: 3.408E-05 | global batch size: 256 | lm loss: 1.941157E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.612 | TFLOPs: 18.55 | 31: iteration 142510/ 173500 | consumed samples: 36482560 | consumed tokens: 74716282880 | elapsed time per iteration (s): 0.80 | learning rate: 3.407E-05 | global batch size: 256 | lm loss: 1.916704E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.506 | TFLOPs: 19.27 | 31: iteration 142520/ 173500 | consumed samples: 36485120 | consumed tokens: 74721525760 | elapsed time per iteration (s): 0.79 | learning rate: 3.407E-05 | global batch size: 256 | lm loss: 1.932909E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.419 | TFLOPs: 19.63 | 31: iteration 142530/ 173500 | consumed samples: 36487680 | consumed tokens: 74726768640 | elapsed time per iteration (s): 0.85 | learning rate: 3.406E-05 | global batch size: 256 | lm loss: 1.920933E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.421 | TFLOPs: 18.17 | 31: iteration 142540/ 173500 | consumed samples: 36490240 | consumed tokens: 74732011520 | elapsed time per iteration (s): 0.81 | learning rate: 3.405E-05 | global batch size: 256 | lm loss: 1.934439E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.759 | TFLOPs: 19.10 | 31: iteration 142550/ 173500 | consumed samples: 36492800 | consumed tokens: 74737254400 | elapsed time per iteration (s): 0.81 | learning rate: 3.404E-05 | global batch size: 256 | lm loss: 1.929815E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.072 | TFLOPs: 19.12 | 31: iteration 142560/ 173500 | consumed samples: 36495360 | consumed tokens: 74742497280 | elapsed time per iteration (s): 0.85 | learning rate: 3.403E-05 | global batch size: 256 | lm loss: 1.916750E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.176 | TFLOPs: 18.28 | 31: iteration 142570/ 173500 | consumed samples: 36497920 | consumed tokens: 74747740160 | elapsed time per iteration (s): 0.83 | learning rate: 3.402E-05 | global batch size: 256 | lm loss: 1.911129E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.192 | TFLOPs: 18.64 | 31: iteration 142580/ 173500 | consumed samples: 36500480 | consumed tokens: 74752983040 | elapsed time per iteration (s): 0.81 | learning rate: 3.401E-05 | global batch size: 256 | lm loss: 1.926028E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.956 | TFLOPs: 19.18 | 31: iteration 142590/ 173500 | consumed samples: 36503040 | consumed tokens: 74758225920 | elapsed time per iteration (s): 0.83 | learning rate: 3.400E-05 | global batch size: 256 | lm loss: 1.939729E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.301 | TFLOPs: 18.71 | 31: iteration 142600/ 173500 | consumed samples: 36505600 | consumed tokens: 74763468800 | elapsed time per iteration (s): 0.82 | learning rate: 3.400E-05 | global batch size: 256 | lm loss: 1.934922E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.430 | TFLOPs: 18.96 | 31: iteration 142610/ 173500 | consumed samples: 36508160 | consumed tokens: 74768711680 | elapsed time per iteration (s): 0.81 | learning rate: 3.399E-05 | global batch size: 256 | lm loss: 1.917063E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.279 | TFLOPs: 19.01 | 31: iteration 142620/ 173500 | consumed samples: 36510720 | consumed tokens: 74773954560 | elapsed time per iteration (s): 0.82 | learning rate: 3.398E-05 | global batch size: 256 | lm loss: 1.950396E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.432 | TFLOPs: 18.78 | 31: iteration 142630/ 173500 | consumed samples: 36513280 | consumed tokens: 74779197440 | elapsed time per iteration (s): 0.80 | learning rate: 3.397E-05 | global batch size: 256 | lm loss: 1.904119E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.748 | TFLOPs: 19.28 | 31: iteration 142640/ 173500 | consumed samples: 36515840 | consumed tokens: 74784440320 | elapsed time per iteration (s): 0.79 | learning rate: 3.396E-05 | global batch size: 256 | lm loss: 1.928464E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.938 | TFLOPs: 19.54 | 31: iteration 142650/ 173500 | consumed samples: 36518400 | consumed tokens: 74789683200 | elapsed time per iteration (s): 0.80 | learning rate: 3.395E-05 | global batch size: 256 | lm loss: 1.925648E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.095 | TFLOPs: 19.24 | 31: iteration 142660/ 173500 | consumed samples: 36520960 | consumed tokens: 74794926080 | elapsed time per iteration (s): 0.82 | learning rate: 3.394E-05 | global batch size: 256 | lm loss: 1.913735E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.884 | TFLOPs: 18.93 | 31: iteration 142670/ 173500 | consumed samples: 36523520 | consumed tokens: 74800168960 | elapsed time per iteration (s): 0.78 | learning rate: 3.393E-05 | global batch size: 256 | lm loss: 1.930495E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.920 | TFLOPs: 19.78 | 31: iteration 142680/ 173500 | consumed samples: 36526080 | consumed tokens: 74805411840 | elapsed time per iteration (s): 0.78 | learning rate: 3.392E-05 | global batch size: 256 | lm loss: 1.923752E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.224 | TFLOPs: 19.74 | 31: iteration 142690/ 173500 | consumed samples: 36528640 | consumed tokens: 74810654720 | elapsed time per iteration (s): 0.79 | learning rate: 3.392E-05 | global batch size: 256 | lm loss: 1.947524E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.998 | TFLOPs: 19.72 | 31: iteration 142700/ 173500 | consumed samples: 36531200 | consumed tokens: 74815897600 | elapsed time per iteration (s): 0.80 | learning rate: 3.391E-05 | global batch size: 256 | lm loss: 1.926892E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.696 | TFLOPs: 19.46 | 31: iteration 142710/ 173500 | consumed samples: 36533760 | consumed tokens: 74821140480 | elapsed time per iteration (s): 0.80 | learning rate: 3.390E-05 | global batch size: 256 | lm loss: 1.926149E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.117 | TFLOPs: 19.37 | 31: iteration 142720/ 173500 | consumed samples: 36536320 | consumed tokens: 74826383360 | elapsed time per iteration (s): 0.84 | learning rate: 3.389E-05 | global batch size: 256 | lm loss: 1.947835E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.807 | TFLOPs: 18.38 | 31: iteration 142730/ 173500 | consumed samples: 36538880 | consumed tokens: 74831626240 | elapsed time per iteration (s): 0.84 | learning rate: 3.388E-05 | global batch size: 256 | lm loss: 1.946617E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.806 | TFLOPs: 18.50 | 31: iteration 142740/ 173500 | consumed samples: 36541440 | consumed tokens: 74836869120 | elapsed time per iteration (s): 0.83 | learning rate: 3.387E-05 | global batch size: 256 | lm loss: 1.912907E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.111 | TFLOPs: 18.58 | 31: iteration 142750/ 173500 | consumed samples: 36544000 | consumed tokens: 74842112000 | elapsed time per iteration (s): 0.85 | learning rate: 3.386E-05 | global batch size: 256 | lm loss: 1.905681E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.050 | TFLOPs: 18.21 | 31: iteration 142760/ 173500 | consumed samples: 36546560 | consumed tokens: 74847354880 | elapsed time per iteration (s): 0.84 | learning rate: 3.385E-05 | global batch size: 256 | lm loss: 1.913430E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.299 | TFLOPs: 18.35 | 31: iteration 142770/ 173500 | consumed samples: 36549120 | consumed tokens: 74852597760 | elapsed time per iteration (s): 0.82 | learning rate: 3.385E-05 | global batch size: 256 | lm loss: 1.899705E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.471 | TFLOPs: 18.90 | 31: iteration 142780/ 173500 | consumed samples: 36551680 | consumed tokens: 74857840640 | elapsed time per iteration (s): 0.84 | learning rate: 3.384E-05 | global batch size: 256 | lm loss: 1.916997E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.138 | TFLOPs: 18.34 | 31: iteration 142790/ 173500 | consumed samples: 36554240 | consumed tokens: 74863083520 | elapsed time per iteration (s): 0.77 | learning rate: 3.383E-05 | global batch size: 256 | lm loss: 1.914129E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.553 | TFLOPs: 20.00 | 31: iteration 142800/ 173500 | consumed samples: 36556800 | consumed tokens: 74868326400 | elapsed time per iteration (s): 0.76 | learning rate: 3.382E-05 | global batch size: 256 | lm loss: 1.944222E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.429 | TFLOPs: 20.47 | 31: iteration 142810/ 173500 | consumed samples: 36559360 | consumed tokens: 74873569280 | elapsed time per iteration (s): 0.74 | learning rate: 3.381E-05 | global batch size: 256 | lm loss: 1.922000E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.218 | TFLOPs: 21.01 | 31: iteration 142820/ 173500 | consumed samples: 36561920 | consumed tokens: 74878812160 | elapsed time per iteration (s): 0.77 | learning rate: 3.380E-05 | global batch size: 256 | lm loss: 1.920727E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.465 | TFLOPs: 20.11 | 31: iteration 142830/ 173500 | consumed samples: 36564480 | consumed tokens: 74884055040 | elapsed time per iteration (s): 0.79 | learning rate: 3.379E-05 | global batch size: 256 | lm loss: 1.926514E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.890 | TFLOPs: 19.59 | 31: iteration 142840/ 173500 | consumed samples: 36567040 | consumed tokens: 74889297920 | elapsed time per iteration (s): 0.74 | learning rate: 3.378E-05 | global batch size: 256 | lm loss: 1.921434E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.420 | TFLOPs: 20.90 | 31: iteration 142850/ 173500 | consumed samples: 36569600 | consumed tokens: 74894540800 | elapsed time per iteration (s): 0.78 | learning rate: 3.378E-05 | global batch size: 256 | lm loss: 1.925418E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.653 | TFLOPs: 19.76 | 31: iteration 142860/ 173500 | consumed samples: 36572160 | consumed tokens: 74899783680 | elapsed time per iteration (s): 0.75 | learning rate: 3.377E-05 | global batch size: 256 | lm loss: 1.905932E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.153 | TFLOPs: 20.58 | 31: iteration 142870/ 173500 | consumed samples: 36574720 | consumed tokens: 74905026560 | elapsed time per iteration (s): 0.89 | learning rate: 3.376E-05 | global batch size: 256 | lm loss: 1.936747E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.313 | TFLOPs: 17.32 | 31: iteration 142880/ 173500 | consumed samples: 36577280 | consumed tokens: 74910269440 | elapsed time per iteration (s): 0.83 | learning rate: 3.375E-05 | global batch size: 256 | lm loss: 1.919582E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.354 | TFLOPs: 18.59 | 31: iteration 142890/ 173500 | consumed samples: 36579840 | consumed tokens: 74915512320 | elapsed time per iteration (s): 0.84 | learning rate: 3.374E-05 | global batch size: 256 | lm loss: 1.916829E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.754 | TFLOPs: 18.44 | 31: iteration 142900/ 173500 | consumed samples: 36582400 | consumed tokens: 74920755200 | elapsed time per iteration (s): 0.81 | learning rate: 3.373E-05 | global batch size: 256 | lm loss: 1.914749E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.584 | TFLOPs: 19.03 | 31: iteration 142910/ 173500 | consumed samples: 36584960 | consumed tokens: 74925998080 | elapsed time per iteration (s): 0.82 | learning rate: 3.372E-05 | global batch size: 256 | lm loss: 1.932486E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.992 | TFLOPs: 18.81 | 31: iteration 142920/ 173500 | consumed samples: 36587520 | consumed tokens: 74931240960 | elapsed time per iteration (s): 0.80 | learning rate: 3.371E-05 | global batch size: 256 | lm loss: 1.923299E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.093 | TFLOPs: 19.43 | 31: iteration 142930/ 173500 | consumed samples: 36590080 | consumed tokens: 74936483840 | elapsed time per iteration (s): 0.75 | learning rate: 3.371E-05 | global batch size: 256 | lm loss: 1.916709E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.157 | TFLOPs: 20.64 | 31: iteration 142940/ 173500 | consumed samples: 36592640 | consumed tokens: 74941726720 | elapsed time per iteration (s): 0.74 | learning rate: 3.370E-05 | global batch size: 256 | lm loss: 1.916875E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.444 | TFLOPs: 20.96 | 31: iteration 142950/ 173500 | consumed samples: 36595200 | consumed tokens: 74946969600 | elapsed time per iteration (s): 0.73 | learning rate: 3.369E-05 | global batch size: 256 | lm loss: 1.943844E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.685 | TFLOPs: 21.16 | 31: iteration 142960/ 173500 | consumed samples: 36597760 | consumed tokens: 74952212480 | elapsed time per iteration (s): 0.73 | learning rate: 3.368E-05 | global batch size: 256 | lm loss: 1.906657E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.900 | TFLOPs: 21.35 | 31: iteration 142970/ 173500 | consumed samples: 36600320 | consumed tokens: 74957455360 | elapsed time per iteration (s): 0.77 | learning rate: 3.367E-05 | global batch size: 256 | lm loss: 1.930915E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.473 | TFLOPs: 20.11 | 31: iteration 142980/ 173500 | consumed samples: 36602880 | consumed tokens: 74962698240 | elapsed time per iteration (s): 0.74 | learning rate: 3.366E-05 | global batch size: 256 | lm loss: 1.925753E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.822 | TFLOPs: 20.86 | 31: iteration 142990/ 173500 | consumed samples: 36605440 | consumed tokens: 74967941120 | elapsed time per iteration (s): 0.77 | learning rate: 3.365E-05 | global batch size: 256 | lm loss: 1.926530E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.547 | TFLOPs: 20.24 | 31: iteration 143000/ 173500 | consumed samples: 36608000 | consumed tokens: 74973184000 | elapsed time per iteration (s): 0.77 | learning rate: 3.364E-05 | global batch size: 256 | lm loss: 1.960360E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.564 | TFLOPs: 20.00 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 143000 | lm loss value: 1.929614E+00 | lm loss PPL: 6.886854E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 143000 to checkpoints_1b1long 0: [2022-11-27 02:17:10,092] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step143000 is begin to save! 0: [2022-11-27 02:17:10,106] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_01-model_00-model_states.pt... 0: [2022-11-27 02:17:10,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_01-model_00-model_states.pt. 0: [2022-11-27 02:17:10,309] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_03-model_00-model_states.pt... 0: [2022-11-27 02:17:10,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_03-model_00-model_states.pt. 0: [2022-11-27 02:17:10,391] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_04-model_00-model_states.pt... 0: [2022-11-27 02:17:10,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_04-model_00-model_states.pt. 0: [2022-11-27 02:17:10,466] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_05-model_00-model_states.pt... 0: [2022-11-27 02:17:10,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_05-model_00-model_states.pt. 0: [2022-11-27 02:17:10,542] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_06-model_00-model_states.pt... 0: [2022-11-27 02:17:10,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_06-model_00-model_states.pt. 0: [2022-11-27 02:17:10,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_07-model_00-model_states.pt... 0: [2022-11-27 02:17:10,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_07-model_00-model_states.pt. 0: [2022-11-27 02:17:10,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_08-model_00-model_states.pt... 0: [2022-11-27 02:17:10,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_08-model_00-model_states.pt. 0: [2022-11-27 02:17:10,774] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_09-model_00-model_states.pt... 0: [2022-11-27 02:17:10,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_09-model_00-model_states.pt. 0: [2022-11-27 02:17:10,849] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_10-model_00-model_states.pt... 0: [2022-11-27 02:17:10,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_10-model_00-model_states.pt. 0: [2022-11-27 02:17:10,923] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_11-model_00-model_states.pt... 0: [2022-11-27 02:17:10,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_11-model_00-model_states.pt. 0: [2022-11-27 02:17:10,998] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_12-model_00-model_states.pt... 0: [2022-11-27 02:17:11,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_12-model_00-model_states.pt. 0: [2022-11-27 02:17:11,071] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_13-model_00-model_states.pt... 0: [2022-11-27 02:17:11,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_13-model_00-model_states.pt. 0: [2022-11-27 02:17:11,147] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_14-model_00-model_states.pt... 0: [2022-11-27 02:17:11,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_14-model_00-model_states.pt. 0: [2022-11-27 02:17:11,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_15-model_00-model_states.pt... 0: [2022-11-27 02:17:11,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_15-model_00-model_states.pt. 0: [2022-11-27 02:17:11,297] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_16-model_00-model_states.pt... 0: [2022-11-27 02:17:11,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_16-model_00-model_states.pt. 0: [2022-11-27 02:17:11,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_17-model_00-model_states.pt... 0: [2022-11-27 02:17:11,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_17-model_00-model_states.pt. 0: [2022-11-27 02:17:11,446] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_18-model_00-model_states.pt... 0: [2022-11-27 02:17:11,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_18-model_00-model_states.pt. 0: [2022-11-27 02:17:11,521] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_19-model_00-model_states.pt... 0: [2022-11-27 02:17:11,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_19-model_00-model_states.pt. 0: [2022-11-27 02:17:11,597] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_20-model_00-model_states.pt... 0: [2022-11-27 02:17:11,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_20-model_00-model_states.pt. 0: [2022-11-27 02:17:11,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_21-model_00-model_states.pt... 0: [2022-11-27 02:17:11,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_21-model_00-model_states.pt. 0: [2022-11-27 02:17:11,747] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_22-model_00-model_states.pt... 0: [2022-11-27 02:17:11,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_22-model_00-model_states.pt. 0: [2022-11-27 02:17:11,822] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_23-model_00-model_states.pt... 0: [2022-11-27 02:17:11,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_23-model_00-model_states.pt. 0: [2022-11-27 02:17:11,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_24-model_00-model_states.pt... 0: [2022-11-27 02:17:11,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_24-model_00-model_states.pt. 0: [2022-11-27 02:17:11,970] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_25-model_00-model_states.pt... 0: [2022-11-27 02:17:12,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_25-model_00-model_states.pt. 0: [2022-11-27 02:17:12,045] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_26-model_00-model_states.pt... 0: [2022-11-27 02:17:12,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_26-model_00-model_states.pt. 0: [2022-11-27 02:17:12,119] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_27-model_00-model_states.pt... 0: [2022-11-27 02:17:12,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_27-model_00-model_states.pt. 0: [2022-11-27 02:17:12,192] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_28-model_00-model_states.pt... 0: [2022-11-27 02:17:12,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_28-model_00-model_states.pt. 0: [2022-11-27 02:17:12,267] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/layer_30-model_00-model_states.pt... 0: [2022-11-27 02:17:12,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/layer_30-model_00-model_states.pt. 0: [2022-11-27 02:17:12,269] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step143000/mp_rank_00_model_states.pt 0: [2022-11-27 02:17:12,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/mp_rank_00_model_states.pt... 0: [2022-11-27 02:17:12,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/mp_rank_00_model_states.pt. 0: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:17:12,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:17:12,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:17:12,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 02:17:12,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 5: [2022-11-27 02:17:12,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:17:12,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 02:17:12,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 26: [2022-11-27 02:17:12,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:17:12,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 02:17:12,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 19: [2022-11-27 02:17:12,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:17:12,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 02:17:12,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 0: [2022-11-27 02:17:12,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:17:12,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:17:12,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 13: [2022-11-27 02:17:12,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 02:17:12,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 0: [2022-11-27 02:17:12,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 23: [2022-11-27 02:17:12,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:17:12,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 02:17:12,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 8: [2022-11-27 02:17:12,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:17:12,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 02:17:12,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 20: [2022-11-27 02:17:12,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:17:12,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:17:12,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 2: [2022-11-27 02:17:12,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 20: [2022-11-27 02:17:12,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 2: [2022-11-27 02:17:12,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 7: [2022-11-27 02:17:12,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:17:12,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 02:17:12,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 22: [2022-11-27 02:17:12,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:17:12,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 02:17:12,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 9: [2022-11-27 02:17:12,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:17:12,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 02:17:12,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 1: [2022-11-27 02:17:12,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:17:12,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:17:12,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 10: [2022-11-27 02:17:12,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 02:17:12,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 1: [2022-11-27 02:17:12,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 18: [2022-11-27 02:17:12,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:17:12,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 02:17:12,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 28: [2022-11-27 02:17:12,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:17:12,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 02:17:12,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:17:12,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 28: [2022-11-27 02:17:12,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 02:17:12,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 12: [2022-11-27 02:17:12,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:17:12,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:17:12,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 5: [2022-11-27 02:17:12,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 12: [2022-11-27 02:17:12,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 5: [2022-11-27 02:17:12,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 27: [2022-11-27 02:17:12,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:17:12,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 02:17:12,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 21: [2022-11-27 02:17:12,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:17:12,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 02:17:12,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 25: [2022-11-27 02:17:12,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:17:12,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 02:17:12,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 6: [2022-11-27 02:17:12,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:17:12,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 02:17:12,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:17:12,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 6: [2022-11-27 02:17:12,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 02:17:12,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 16: [2022-11-27 02:17:12,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:17:12,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 02:17:12,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 11: [2022-11-27 02:17:12,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:17:12,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:17:12,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 02:17:12,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 22: [2022-11-27 02:17:12,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:17:12,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 9: [2022-11-27 02:17:12,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:17:12,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 0: [2022-11-27 02:17:12,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 9: [2022-11-27 02:17:12,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 22: [2022-11-27 02:17:12,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 9: [2022-11-27 02:17:12,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 2: [2022-11-27 02:17:12,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:17:12,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 4: [2022-11-27 02:17:12,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:17:12,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 4: [2022-11-27 02:17:12,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 02:17:12,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 14: [2022-11-27 02:17:12,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:17:12,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 02:17:12,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 30: [2022-11-27 02:17:12,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:17:12,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 02:17:12,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 24: [2022-11-27 02:17:12,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:17:12,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 1: [2022-11-27 02:17:12,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:17:12,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 1: [2022-11-27 02:17:12,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 02:17:12,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 8: [2022-11-27 02:17:12,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:17:12,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 02:17:12,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 19: [2022-11-27 02:17:12,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:17:12,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-27 02:17:12,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 20: [2022-11-27 02:17:12,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:17:12,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 02:17:12,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 29: [2022-11-27 02:17:12,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:17:12,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:17:12,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 02:17:12,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 02:17:12,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 15: [2022-11-27 02:17:12,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:17:12,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 15: [2022-11-27 02:17:12,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 02:17:12,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 18: [2022-11-27 02:17:12,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:17:12,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 02:17:12,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 21: [2022-11-27 02:17:12,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:17:12,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 23: [2022-11-27 02:17:12,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:17:12,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 23: [2022-11-27 02:17:12,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 02:17:12,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 0: [2022-11-27 02:17:12,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:17:12,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:17:12,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 0: [2022-11-27 02:17:12,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 15: [2022-11-27 02:17:12,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 0: [2022-11-27 02:17:12,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 12: [2022-11-27 02:17:12,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:17:12,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 14: [2022-11-27 02:17:12,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:17:12,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 14: [2022-11-27 02:17:12,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 02:17:12,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 26: [2022-11-27 02:17:12,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:17:12,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 02:17:12,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 24: [2022-11-27 02:17:12,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:17:12,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 02:17:12,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 5: [2022-11-27 02:17:12,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:17:12,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 02:17:12,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 10: [2022-11-27 02:17:12,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:17:12,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:17:12,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 02:17:12,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 17: [2022-11-27 02:17:12,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 02:17:12,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 17: [2022-11-27 02:17:12,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:17:12,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 02:17:12,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 27: [2022-11-27 02:17:12,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:17:12,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 02:17:12,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 30: [2022-11-27 02:17:12,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:17:12,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 02:17:12,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 2: [2022-11-27 02:17:12,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:17:12,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 02:17:12,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 9: [2022-11-27 02:17:12,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:17:12,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:17:12,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 8: [2022-11-27 02:17:12,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 6: [2022-11-27 02:17:12,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:17:12,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:17:12,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 8: [2022-11-27 02:17:12,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 6: [2022-11-27 02:17:12,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 13: [2022-11-27 02:17:12,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:17:12,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 4: [2022-11-27 02:17:12,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:17:12,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 6: [2022-11-27 02:17:12,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 7: [2022-11-27 02:17:12,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:17:12,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 7: [2022-11-27 02:17:12,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 4: [2022-11-27 02:17:12,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 7: [2022-11-27 02:17:12,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 4: [2022-11-27 02:17:12,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 7: [2022-11-27 02:17:12,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 28: [2022-11-27 02:17:12,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:17:12,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:17:12,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 02:17:12,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 16: [2022-11-27 02:17:12,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:17:12,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 02:17:12,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 19: [2022-11-27 02:17:12,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:17:12,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 02:17:12,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 29: [2022-11-27 02:17:12,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:17:12,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 02:17:12,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 24: [2022-11-27 02:17:12,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:17:12,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:17:12,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 23: [2022-11-27 02:17:12,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:17:12,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 24: [2022-11-27 02:17:12,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 17: [2022-11-27 02:17:12,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:17:12,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 24: [2022-11-27 02:17:12,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 23: [2022-11-27 02:17:12,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 17: [2022-11-27 02:17:12,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 18: [2022-11-27 02:17:12,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:17:12,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 18: [2022-11-27 02:17:12,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 02:17:12,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 13: [2022-11-27 02:17:12,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:17:12,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 02:17:12,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 22: [2022-11-27 02:17:12,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:17:12,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 02:17:12,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 21: [2022-11-27 02:17:12,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:17:12,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 02:17:12,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 11: [2022-11-27 02:17:12,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:17:12,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 02:17:12,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 15: [2022-11-27 02:17:12,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:17:12,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 02:17:12,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 28: [2022-11-27 02:17:12,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 02:17:12,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 31: [2022-11-27 02:17:12,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:17:12,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:17:12,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 02:17:12,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 02:17:12,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 20: [2022-11-27 02:17:12,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:17:12,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 20: [2022-11-27 02:17:12,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 31: [2022-11-27 02:17:12,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:17:12,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 31: [2022-11-27 02:17:12,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 02:17:12,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 30: [2022-11-27 02:17:12,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:17:12,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 02:17:12,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 10: [2022-11-27 02:17:12,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:17:12,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 02:17:12,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 16: [2022-11-27 02:17:12,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:17:12,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 14: [2022-11-27 02:17:12,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:17:12,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 14: [2022-11-27 02:17:12,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 02:17:12,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 26: [2022-11-27 02:17:12,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:17:12,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 02:17:12,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 12: [2022-11-27 02:17:12,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:17:12,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 02:17:12,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 27: [2022-11-27 02:17:12,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:17:12,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 02:17:12,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 11: [2022-11-27 02:17:12,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:17:12,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 02:17:12,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 9: [2022-11-27 02:17:12,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:17:12,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 02:17:12,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 3: [2022-11-27 02:17:12,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:17:12,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:17:12,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:17:12,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 02:17:12,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 02:17:12,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 02:17:12,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 3: [2022-11-27 02:17:12,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 3: [2022-11-27 02:17:12,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 5: [2022-11-27 02:17:12,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:17:12,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 02:17:12,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 28: [2022-11-27 02:17:12,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:17:12,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:17:12,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 17: [2022-11-27 02:17:12,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 28: [2022-11-27 02:17:12,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 17: [2022-11-27 02:17:12,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 24: [2022-11-27 02:17:12,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:17:12,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 02:17:12,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 0: [2022-11-27 02:17:12,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:17:12,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 02:17:12,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 15: [2022-11-27 02:17:12,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:17:12,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 02:17:12,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 6: [2022-11-27 02:17:12,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:17:12,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 02:17:12,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 25: [2022-11-27 02:17:12,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:17:12,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-27 02:17:12,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 4: [2022-11-27 02:17:12,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:17:12,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 02:17:12,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 26: [2022-11-27 02:17:12,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:17:12,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 02:17:12,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 8: [2022-11-27 02:17:12,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:17:12,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 02:17:12,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 20: [2022-11-27 02:17:12,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:17:12,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 02:17:12,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 23: [2022-11-27 02:17:12,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:17:12,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 02:17:12,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 2: [2022-11-27 02:17:12,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:17:12,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 02:17:12,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 29: [2022-11-27 02:17:12,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:17:12,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 02:17:12,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 3: [2022-11-27 02:17:12,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:17:12,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 02:17:12,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 22: [2022-11-27 02:17:12,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:17:12,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 02:17:12,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 19: [2022-11-27 02:17:12,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:17:12,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 02:17:12,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 18: [2022-11-27 02:17:12,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:17:12,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 02:17:12,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 12: [2022-11-27 02:17:12,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:17:12,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 02:17:12,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 1: [2022-11-27 02:17:12,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:17:12,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 02:17:12,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 10: [2022-11-27 02:17:12,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:17:12,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 02:17:12,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 27: [2022-11-27 02:17:12,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:17:12,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 02:17:12,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 31: [2022-11-27 02:17:12,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:17:12,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 02:17:12,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 7: [2022-11-27 02:17:12,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:17:12,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 02:17:12,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 21: [2022-11-27 02:17:12,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:17:12,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 02:17:12,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 13: [2022-11-27 02:17:12,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:17:12,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 02:17:12,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 11: [2022-11-27 02:17:12,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:17:12,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:17:12,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 30: [2022-11-27 02:17:12,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 11: [2022-11-27 02:17:12,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 30: [2022-11-27 02:17:12,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 16: [2022-11-27 02:17:12,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:17:12,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 02:17:12,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 9: [2022-11-27 02:17:12,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:17:12,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 02:17:12,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 28: [2022-11-27 02:17:12,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:17:12,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 02:17:12,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 5: [2022-11-27 02:17:12,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:17:12,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 02:17:12,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 0: [2022-11-27 02:17:12,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:17:12,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 02:17:12,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 6: [2022-11-27 02:17:12,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:17:12,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 02:17:12,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 24: [2022-11-27 02:17:12,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:17:12,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 15: [2022-11-27 02:17:12,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:17:12,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 15: [2022-11-27 02:17:12,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 02:17:12,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 14: [2022-11-27 02:17:12,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:17:12,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 02:17:12,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 25: [2022-11-27 02:17:12,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:17:12,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 02:17:12,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 17: [2022-11-27 02:17:12,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:17:12,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:17:12,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 4: [2022-11-27 02:17:12,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 17: [2022-11-27 02:17:12,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 4: [2022-11-27 02:17:12,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 2: [2022-11-27 02:17:12,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:17:12,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 02:17:12,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 26: [2022-11-27 02:17:12,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:17:12,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-27 02:17:12,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 22: [2022-11-27 02:17:12,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:17:12,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 02:17:12,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 20: [2022-11-27 02:17:12,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:17:12,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 02:17:12,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 29: [2022-11-27 02:17:12,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:17:12,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 02:17:12,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 8: [2022-11-27 02:17:12,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:17:12,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 02:17:12,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 19: [2022-11-27 02:17:12,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:17:12,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 02:17:12,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 23: [2022-11-27 02:17:12,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:17:12,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 02:17:12,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 3: [2022-11-27 02:17:12,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:17:12,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 02:17:12,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 12: [2022-11-27 02:17:12,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:17:12,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 02:17:12,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 1: [2022-11-27 02:17:12,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:17:12,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 02:17:12,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 27: [2022-11-27 02:17:12,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:17:12,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 02:17:12,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 10: [2022-11-27 02:17:12,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:17:12,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 02:17:12,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 21: [2022-11-27 02:17:12,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:17:12,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 02:17:12,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 13: [2022-11-27 02:17:12,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:17:12,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 02:17:12,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 30: [2022-11-27 02:17:12,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:17:12,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 02:17:12,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 11: [2022-11-27 02:17:12,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:17:12,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 02:17:12,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 18: [2022-11-27 02:17:12,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:17:12,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 02:17:12,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 7: [2022-11-27 02:17:12,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:17:12,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 02:17:12,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 16: [2022-11-27 02:17:12,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:17:12,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 02:17:12,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 28: [2022-11-27 02:17:12,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:17:12,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 02:17:12,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 6: [2022-11-27 02:17:12,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:17:12,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 9: [2022-11-27 02:17:12,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:17:12,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:17:12,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 9: [2022-11-27 02:17:12,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 15: [2022-11-27 02:17:12,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 02:17:12,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 9: [2022-11-27 02:17:12,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 31: [2022-11-27 02:17:12,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:17:12,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 02:17:12,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 17: [2022-11-27 02:17:12,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:17:12,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:17:12,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 24: [2022-11-27 02:17:12,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 17: [2022-11-27 02:17:12,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 24: [2022-11-27 02:17:12,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 26: [2022-11-27 02:17:12,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:17:12,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:17:12,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 25: [2022-11-27 02:17:12,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 26: [2022-11-27 02:17:12,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 25: [2022-11-27 02:17:12,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 4: [2022-11-27 02:17:12,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:17:12,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 14: [2022-11-27 02:17:12,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:17:12,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 02:17:12,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 4: [2022-11-27 02:17:12,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 2: [2022-11-27 02:17:12,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:17:12,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 02:17:12,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 29: [2022-11-27 02:17:12,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:17:12,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 02:17:12,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 8: [2022-11-27 02:17:12,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:17:12,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 02:17:12,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 22: [2022-11-27 02:17:12,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:17:12,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 02:17:12,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 0: [2022-11-27 02:17:12,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:17:12,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 02:17:12,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 1: [2022-11-27 02:17:12,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:17:12,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 19: [2022-11-27 02:17:12,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:17:12,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 3: [2022-11-27 02:17:12,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:17:12,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 02:17:12,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 19: [2022-11-27 02:17:12,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 02:17:12,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 20: [2022-11-27 02:17:12,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:17:12,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 02:17:12,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 23: [2022-11-27 02:17:12,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:17:12,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 02:17:12,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 5: [2022-11-27 02:17:12,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:17:12,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 02:17:12,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 7: [2022-11-27 02:17:12,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:17:12,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 02:17:12,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 12: [2022-11-27 02:17:12,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:17:12,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 02:17:12,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 31: [2022-11-27 02:17:12,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:17:12,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 02:17:12,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 10: [2022-11-27 02:17:12,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:17:12,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 02:17:12,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 30: [2022-11-27 02:17:12,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:17:12,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 02:17:12,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 11: [2022-11-27 02:17:12,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:17:12,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 02:17:12,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 27: [2022-11-27 02:17:12,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:17:12,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 02:17:12,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 21: [2022-11-27 02:17:12,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:17:12,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 02:17:12,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 13: [2022-11-27 02:17:12,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:17:12,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 02:17:12,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 16: [2022-11-27 02:17:12,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:17:12,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 02:17:12,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 18: [2022-11-27 02:17:12,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:17:12,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:17:12,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 02:17:12,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 28: [2022-11-27 02:17:12,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 02:17:12,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 9: [2022-11-27 02:17:12,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:17:12,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 02:17:12,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 0: [2022-11-27 02:17:12,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:17:12,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:17:12,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 02:17:12,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 24: [2022-11-27 02:17:12,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:17:12,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 02:17:12,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 14: [2022-11-27 02:17:12,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:17:12,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 02:17:12,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 25: [2022-11-27 02:17:12,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:17:12,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 02:17:12,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 15: [2022-11-27 02:17:12,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:17:12,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 02:17:12,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 20: [2022-11-27 02:17:12,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:17:12,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 02:17:12,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 22: [2022-11-27 02:17:12,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:17:12,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 02:17:12,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 26: [2022-11-27 02:17:12,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:17:12,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 02:17:12,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 4: [2022-11-27 02:17:12,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:17:12,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 02:17:12,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 17: [2022-11-27 02:17:12,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:17:12,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 02:17:12,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 2: [2022-11-27 02:17:12,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:17:12,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:17:12,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 29: [2022-11-27 02:17:12,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 2: [2022-11-27 02:17:12,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 29: [2022-11-27 02:17:12,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 0: [2022-11-27 02:17:12,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 02:17:12,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 19: [2022-11-27 02:17:12,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:17:12,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 02:17:12,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 5: [2022-11-27 02:17:12,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:17:12,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 02:17:12,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 23: [2022-11-27 02:17:12,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:17:12,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 02:17:12,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 18: [2022-11-27 02:17:12,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:17:12,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-27 02:17:12,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 7: [2022-11-27 02:17:12,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:17:12,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 02:17:12,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 3: [2022-11-27 02:17:12,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:17:12,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 02:17:12,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 11: [2022-11-27 02:17:12,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:17:12,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 02:17:12,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 1: [2022-11-27 02:17:12,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:17:12,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 02:17:12,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 12: [2022-11-27 02:17:12,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:17:12,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 02:17:12,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 30: [2022-11-27 02:17:12,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:17:12,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 02:17:12,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 21: [2022-11-27 02:17:12,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:17:12,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 02:17:12,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 13: [2022-11-27 02:17:12,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:17:12,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 02:17:12,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 27: [2022-11-27 02:17:12,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:17:12,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 02:17:12,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 10: [2022-11-27 02:17:12,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:17:12,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 02:17:12,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 24: [2022-11-27 02:17:12,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:17:12,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 02:17:12,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 22: [2022-11-27 02:17:12,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:17:12,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 02:17:12,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 7: [2022-11-27 02:17:12,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:17:12,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:17:12,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 02:17:12,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 17: [2022-11-27 02:17:12,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 02:17:12,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 28: [2022-11-27 02:17:12,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:17:12,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 02:17:12,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 15: [2022-11-27 02:17:12,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:17:12,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:17:12,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 02:17:12,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 23: [2022-11-27 02:17:12,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 02:17:12,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 3: [2022-11-27 02:17:12,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:17:12,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 02:17:12,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 14: [2022-11-27 02:17:12,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:17:12,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 02:17:12,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 26: [2022-11-27 02:17:12,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:17:12,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:17:12,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 02:17:12,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 20: [2022-11-27 02:17:12,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 8: [2022-11-27 02:17:12,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:17:12,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 30: [2022-11-27 02:17:12,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:17:12,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 13: [2022-11-27 02:17:12,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:17:12,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 8: [2022-11-27 02:17:12,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 13: [2022-11-27 02:17:12,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 02:17:12,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 30: [2022-11-27 02:17:12,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 9: [2022-11-27 02:17:12,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:17:12,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 02:17:12,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 4: [2022-11-27 02:17:12,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:17:12,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:17:12,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:17:12,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:17:12,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 21: [2022-11-27 02:17:12,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 4: [2022-11-27 02:17:12,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 25: [2022-11-27 02:17:12,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 27: [2022-11-27 02:17:12,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 12: [2022-11-27 02:17:12,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:17:12,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 21: [2022-11-27 02:17:12,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 27: [2022-11-27 02:17:12,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 12: [2022-11-27 02:17:12,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 02:17:12,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 29: [2022-11-27 02:17:12,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:17:12,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 02:17:12,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 1: [2022-11-27 02:17:12,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:17:12,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 19: [2022-11-27 02:17:12,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:17:12,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 19: [2022-11-27 02:17:12,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 02:17:12,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 8: [2022-11-27 02:17:12,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:17:12,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 11: [2022-11-27 02:17:12,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:17:12,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 31: [2022-11-27 02:17:12,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:17:12,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 02:17:12,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 31: [2022-11-27 02:17:12,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:17:12,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 02:17:12,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 31: [2022-11-27 02:17:12,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 02:17:12,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 5: [2022-11-27 02:17:12,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:17:12,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 02:17:12,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 2: [2022-11-27 02:17:12,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:17:12,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 0: [2022-11-27 02:17:12,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:17:12,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 0: [2022-11-27 02:17:12,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 02:17:12,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 25: [2022-11-27 02:17:12,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:17:12,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 02:17:12,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 10: [2022-11-27 02:17:12,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:17:12,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 02:17:12,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 18: [2022-11-27 02:17:12,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:17:12,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 02:17:12,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 14: [2022-11-27 02:17:12,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:17:12,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 02:17:12,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 16: [2022-11-27 02:17:12,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:17:12,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 02:17:12,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:17:12,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 16: [2022-11-27 02:17:12,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 02:17:12,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 6: [2022-11-27 02:17:12,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:17:12,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step143000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 02:17:12,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step143000 is ready now! 0: successfully saved checkpoint at iteration 143000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2537.18 31: iteration 143010/ 173500 | consumed samples: 36610560 | consumed tokens: 74978426880 | elapsed time per iteration (s): 1.03 | learning rate: 3.364E-05 | global batch size: 256 | lm loss: 1.926669E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.453 | TFLOPs: 14.97 | 31: iteration 143020/ 173500 | consumed samples: 36613120 | consumed tokens: 74983669760 | elapsed time per iteration (s): 0.79 | learning rate: 3.363E-05 | global batch size: 256 | lm loss: 1.939696E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.497 | TFLOPs: 19.69 | 31: iteration 143030/ 173500 | consumed samples: 36615680 | consumed tokens: 74988912640 | elapsed time per iteration (s): 0.78 | learning rate: 3.362E-05 | global batch size: 256 | lm loss: 1.937796E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.469 | TFLOPs: 19.75 | 31: iteration 143040/ 173500 | consumed samples: 36618240 | consumed tokens: 74994155520 | elapsed time per iteration (s): 0.76 | learning rate: 3.361E-05 | global batch size: 256 | lm loss: 1.964785E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.137 | TFLOPs: 20.34 | 31: iteration 143050/ 173500 | consumed samples: 36620800 | consumed tokens: 74999398400 | elapsed time per iteration (s): 0.74 | learning rate: 3.360E-05 | global batch size: 256 | lm loss: 1.902650E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.523 | TFLOPs: 21.02 | 31: iteration 143060/ 173500 | consumed samples: 36623360 | consumed tokens: 75004641280 | elapsed time per iteration (s): 0.74 | learning rate: 3.359E-05 | global batch size: 256 | lm loss: 1.911470E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.982 | TFLOPs: 21.05 | 31: iteration 143070/ 173500 | consumed samples: 36625920 | consumed tokens: 75009884160 | elapsed time per iteration (s): 0.80 | learning rate: 3.358E-05 | global batch size: 256 | lm loss: 1.920903E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.686 | TFLOPs: 19.34 | 31: iteration 143080/ 173500 | consumed samples: 36628480 | consumed tokens: 75015127040 | elapsed time per iteration (s): 0.77 | learning rate: 3.358E-05 | global batch size: 256 | lm loss: 1.931185E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.570 | TFLOPs: 20.24 | 31: iteration 143090/ 173500 | consumed samples: 36631040 | consumed tokens: 75020369920 | elapsed time per iteration (s): 0.72 | learning rate: 3.357E-05 | global batch size: 256 | lm loss: 1.910532E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.751 | TFLOPs: 21.46 | 31: iteration 143100/ 173500 | consumed samples: 36633600 | consumed tokens: 75025612800 | elapsed time per iteration (s): 0.77 | learning rate: 3.356E-05 | global batch size: 256 | lm loss: 1.925588E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.379 | TFLOPs: 20.11 | 31: iteration 143110/ 173500 | consumed samples: 36636160 | consumed tokens: 75030855680 | elapsed time per iteration (s): 0.82 | learning rate: 3.355E-05 | global batch size: 256 | lm loss: 1.924817E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.641 | TFLOPs: 18.79 | 31: iteration 143120/ 173500 | consumed samples: 36638720 | consumed tokens: 75036098560 | elapsed time per iteration (s): 0.74 | learning rate: 3.354E-05 | global batch size: 256 | lm loss: 1.912893E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.028 | TFLOPs: 20.99 | 31: iteration 143130/ 173500 | consumed samples: 36641280 | consumed tokens: 75041341440 | elapsed time per iteration (s): 0.74 | learning rate: 3.353E-05 | global batch size: 256 | lm loss: 1.913866E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.843 | TFLOPs: 20.86 | 31: iteration 143140/ 173500 | consumed samples: 36643840 | consumed tokens: 75046584320 | elapsed time per iteration (s): 0.78 | learning rate: 3.352E-05 | global batch size: 256 | lm loss: 1.901588E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.342 | TFLOPs: 19.92 | 31: iteration 143150/ 173500 | consumed samples: 36646400 | consumed tokens: 75051827200 | elapsed time per iteration (s): 0.75 | learning rate: 3.351E-05 | global batch size: 256 | lm loss: 1.903166E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.385 | TFLOPs: 20.65 | 31: iteration 143160/ 173500 | consumed samples: 36648960 | consumed tokens: 75057070080 | elapsed time per iteration (s): 0.73 | learning rate: 3.351E-05 | global batch size: 256 | lm loss: 1.922914E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.664 | TFLOPs: 21.21 | 31: iteration 143170/ 173500 | consumed samples: 36651520 | consumed tokens: 75062312960 | elapsed time per iteration (s): 0.85 | learning rate: 3.350E-05 | global batch size: 256 | lm loss: 1.919300E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.590 | TFLOPs: 18.18 | 31: iteration 143180/ 173500 | consumed samples: 36654080 | consumed tokens: 75067555840 | elapsed time per iteration (s): 0.78 | learning rate: 3.349E-05 | global batch size: 256 | lm loss: 1.899172E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.443 | TFLOPs: 19.87 | 31: iteration 143190/ 173500 | consumed samples: 36656640 | consumed tokens: 75072798720 | elapsed time per iteration (s): 0.77 | learning rate: 3.348E-05 | global batch size: 256 | lm loss: 1.928768E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.434 | TFLOPs: 20.23 | 31: iteration 143200/ 173500 | consumed samples: 36659200 | consumed tokens: 75078041600 | elapsed time per iteration (s): 0.90 | learning rate: 3.347E-05 | global batch size: 256 | lm loss: 1.932910E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.924 | TFLOPs: 17.30 | 31: iteration 143210/ 173500 | consumed samples: 36661760 | consumed tokens: 75083284480 | elapsed time per iteration (s): 0.75 | learning rate: 3.346E-05 | global batch size: 256 | lm loss: 1.902311E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.429 | TFLOPs: 20.60 | 31: iteration 143220/ 173500 | consumed samples: 36664320 | consumed tokens: 75088527360 | elapsed time per iteration (s): 0.81 | learning rate: 3.345E-05 | global batch size: 256 | lm loss: 1.947531E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.971 | TFLOPs: 19.18 | 31: iteration 143230/ 173500 | consumed samples: 36666880 | consumed tokens: 75093770240 | elapsed time per iteration (s): 0.78 | learning rate: 3.344E-05 | global batch size: 256 | lm loss: 1.887036E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.110 | TFLOPs: 19.85 | 31: iteration 143240/ 173500 | consumed samples: 36669440 | consumed tokens: 75099013120 | elapsed time per iteration (s): 0.77 | learning rate: 3.344E-05 | global batch size: 256 | lm loss: 1.941442E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.151 | TFLOPs: 20.09 | 31: iteration 143250/ 173500 | consumed samples: 36672000 | consumed tokens: 75104256000 | elapsed time per iteration (s): 0.77 | learning rate: 3.343E-05 | global batch size: 256 | lm loss: 1.932559E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.293 | TFLOPs: 20.04 | 31: iteration 143260/ 173500 | consumed samples: 36674560 | consumed tokens: 75109498880 | elapsed time per iteration (s): 0.78 | learning rate: 3.342E-05 | global batch size: 256 | lm loss: 1.920795E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.177 | TFLOPs: 19.91 | 31: iteration 143270/ 173500 | consumed samples: 36677120 | consumed tokens: 75114741760 | elapsed time per iteration (s): 0.82 | learning rate: 3.341E-05 | global batch size: 256 | lm loss: 1.930728E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.778 | TFLOPs: 18.98 | 31: iteration 143280/ 173500 | consumed samples: 36679680 | consumed tokens: 75119984640 | elapsed time per iteration (s): 0.88 | learning rate: 3.340E-05 | global batch size: 256 | lm loss: 1.919481E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.458 | TFLOPs: 17.63 | 31: iteration 143290/ 173500 | consumed samples: 36682240 | consumed tokens: 75125227520 | elapsed time per iteration (s): 0.77 | learning rate: 3.339E-05 | global batch size: 256 | lm loss: 1.925189E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.144 | TFLOPs: 20.09 | 31: iteration 143300/ 173500 | consumed samples: 36684800 | consumed tokens: 75130470400 | elapsed time per iteration (s): 0.79 | learning rate: 3.338E-05 | global batch size: 256 | lm loss: 1.946396E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.365 | TFLOPs: 19.56 | 31: iteration 143310/ 173500 | consumed samples: 36687360 | consumed tokens: 75135713280 | elapsed time per iteration (s): 0.77 | learning rate: 3.338E-05 | global batch size: 256 | lm loss: 1.899779E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.716 | TFLOPs: 20.13 | 31: iteration 143320/ 173500 | consumed samples: 36689920 | consumed tokens: 75140956160 | elapsed time per iteration (s): 0.77 | learning rate: 3.337E-05 | global batch size: 256 | lm loss: 1.910726E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.442 | TFLOPs: 20.11 | 31: iteration 143330/ 173500 | consumed samples: 36692480 | consumed tokens: 75146199040 | elapsed time per iteration (s): 0.76 | learning rate: 3.336E-05 | global batch size: 256 | lm loss: 1.920290E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.915 | TFLOPs: 20.26 | 31: iteration 143340/ 173500 | consumed samples: 36695040 | consumed tokens: 75151441920 | elapsed time per iteration (s): 0.81 | learning rate: 3.335E-05 | global batch size: 256 | lm loss: 1.938917E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.391 | TFLOPs: 19.02 | 31: iteration 143350/ 173500 | consumed samples: 36697600 | consumed tokens: 75156684800 | elapsed time per iteration (s): 0.82 | learning rate: 3.334E-05 | global batch size: 256 | lm loss: 1.893746E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.991 | TFLOPs: 19.00 | 31: iteration 143360/ 173500 | consumed samples: 36700160 | consumed tokens: 75161927680 | elapsed time per iteration (s): 0.80 | learning rate: 3.333E-05 | global batch size: 256 | lm loss: 1.920659E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.427 | TFLOPs: 19.26 | 31: iteration 143370/ 173500 | consumed samples: 36702720 | consumed tokens: 75167170560 | elapsed time per iteration (s): 0.80 | learning rate: 3.332E-05 | global batch size: 256 | lm loss: 1.915348E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.724 | TFLOPs: 19.46 | 31: iteration 143380/ 173500 | consumed samples: 36705280 | consumed tokens: 75172413440 | elapsed time per iteration (s): 0.81 | learning rate: 3.332E-05 | global batch size: 256 | lm loss: 1.933917E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.645 | TFLOPs: 19.04 | 31: iteration 143390/ 173500 | consumed samples: 36707840 | consumed tokens: 75177656320 | elapsed time per iteration (s): 0.81 | learning rate: 3.331E-05 | global batch size: 256 | lm loss: 1.933875E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.064 | TFLOPs: 19.18 | 31: iteration 143400/ 173500 | consumed samples: 36710400 | consumed tokens: 75182899200 | elapsed time per iteration (s): 0.80 | learning rate: 3.330E-05 | global batch size: 256 | lm loss: 1.922741E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.141 | TFLOPs: 19.43 | 31: iteration 143410/ 173500 | consumed samples: 36712960 | consumed tokens: 75188142080 | elapsed time per iteration (s): 0.79 | learning rate: 3.329E-05 | global batch size: 256 | lm loss: 1.914703E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.858 | TFLOPs: 19.71 | 31: iteration 143420/ 173500 | consumed samples: 36715520 | consumed tokens: 75193384960 | elapsed time per iteration (s): 0.79 | learning rate: 3.328E-05 | global batch size: 256 | lm loss: 1.942501E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.082 | TFLOPs: 19.49 | 31: iteration 143430/ 173500 | consumed samples: 36718080 | consumed tokens: 75198627840 | elapsed time per iteration (s): 0.79 | learning rate: 3.327E-05 | global batch size: 256 | lm loss: 1.908848E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.242 | TFLOPs: 19.62 | 31: iteration 143440/ 173500 | consumed samples: 36720640 | consumed tokens: 75203870720 | elapsed time per iteration (s): 0.79 | learning rate: 3.326E-05 | global batch size: 256 | lm loss: 1.935063E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.170 | TFLOPs: 19.67 | 31: iteration 143450/ 173500 | consumed samples: 36723200 | consumed tokens: 75209113600 | elapsed time per iteration (s): 0.78 | learning rate: 3.326E-05 | global batch size: 256 | lm loss: 1.911749E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.842 | TFLOPs: 19.77 | 31: iteration 143460/ 173500 | consumed samples: 36725760 | consumed tokens: 75214356480 | elapsed time per iteration (s): 0.76 | learning rate: 3.325E-05 | global batch size: 256 | lm loss: 1.958121E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.975 | TFLOPs: 20.33 | 31: iteration 143470/ 173500 | consumed samples: 36728320 | consumed tokens: 75219599360 | elapsed time per iteration (s): 0.78 | learning rate: 3.324E-05 | global batch size: 256 | lm loss: 1.923987E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.466 | TFLOPs: 19.81 | 31: iteration 143480/ 173500 | consumed samples: 36730880 | consumed tokens: 75224842240 | elapsed time per iteration (s): 0.75 | learning rate: 3.323E-05 | global batch size: 256 | lm loss: 1.970741E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.894 | TFLOPs: 20.62 | 31: iteration 143490/ 173500 | consumed samples: 36733440 | consumed tokens: 75230085120 | elapsed time per iteration (s): 0.78 | learning rate: 3.322E-05 | global batch size: 256 | lm loss: 1.934393E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.433 | TFLOPs: 19.93 | 31: iteration 143500/ 173500 | consumed samples: 36736000 | consumed tokens: 75235328000 | elapsed time per iteration (s): 0.74 | learning rate: 3.321E-05 | global batch size: 256 | lm loss: 1.913691E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.903 | TFLOPs: 20.81 | 31: iteration 143510/ 173500 | consumed samples: 36738560 | consumed tokens: 75240570880 | elapsed time per iteration (s): 0.71 | learning rate: 3.320E-05 | global batch size: 256 | lm loss: 1.928527E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 361.006 | TFLOPs: 21.84 | 31: iteration 143520/ 173500 | consumed samples: 36741120 | consumed tokens: 75245813760 | elapsed time per iteration (s): 0.75 | learning rate: 3.320E-05 | global batch size: 256 | lm loss: 1.920114E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.836 | TFLOPs: 20.56 | 31: iteration 143530/ 173500 | consumed samples: 36743680 | consumed tokens: 75251056640 | elapsed time per iteration (s): 0.71 | learning rate: 3.319E-05 | global batch size: 256 | lm loss: 1.910984E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 360.962 | TFLOPs: 21.84 | 31: iteration 143540/ 173500 | consumed samples: 36746240 | consumed tokens: 75256299520 | elapsed time per iteration (s): 0.76 | learning rate: 3.318E-05 | global batch size: 256 | lm loss: 1.910667E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.295 | TFLOPs: 20.34 | 31: iteration 143550/ 173500 | consumed samples: 36748800 | consumed tokens: 75261542400 | elapsed time per iteration (s): 0.78 | learning rate: 3.317E-05 | global batch size: 256 | lm loss: 1.909914E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.029 | TFLOPs: 19.97 | 31: iteration 143560/ 173500 | consumed samples: 36751360 | consumed tokens: 75266785280 | elapsed time per iteration (s): 0.78 | learning rate: 3.316E-05 | global batch size: 256 | lm loss: 1.931571E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.477 | TFLOPs: 19.93 | 31: iteration 143570/ 173500 | consumed samples: 36753920 | consumed tokens: 75272028160 | elapsed time per iteration (s): 0.80 | learning rate: 3.315E-05 | global batch size: 256 | lm loss: 1.959295E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.497 | TFLOPs: 19.33 | 31: iteration 143580/ 173500 | consumed samples: 36756480 | consumed tokens: 75277271040 | elapsed time per iteration (s): 0.76 | learning rate: 3.314E-05 | global batch size: 256 | lm loss: 1.951241E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.155 | TFLOPs: 20.40 | 31: iteration 143590/ 173500 | consumed samples: 36759040 | consumed tokens: 75282513920 | elapsed time per iteration (s): 0.75 | learning rate: 3.314E-05 | global batch size: 256 | lm loss: 1.937937E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.344 | TFLOPs: 20.59 | 31: iteration 143600/ 173500 | consumed samples: 36761600 | consumed tokens: 75287756800 | elapsed time per iteration (s): 0.79 | learning rate: 3.313E-05 | global batch size: 256 | lm loss: 1.928687E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.333 | TFLOPs: 19.62 | 31: iteration 143610/ 173500 | consumed samples: 36764160 | consumed tokens: 75292999680 | elapsed time per iteration (s): 0.77 | learning rate: 3.312E-05 | global batch size: 256 | lm loss: 1.910176E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.007 | TFLOPs: 20.21 | 31: iteration 143620/ 173500 | consumed samples: 36766720 | consumed tokens: 75298242560 | elapsed time per iteration (s): 0.79 | learning rate: 3.311E-05 | global batch size: 256 | lm loss: 1.935186E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.429 | TFLOPs: 19.51 | 31: iteration 143630/ 173500 | consumed samples: 36769280 | consumed tokens: 75303485440 | elapsed time per iteration (s): 0.76 | learning rate: 3.310E-05 | global batch size: 256 | lm loss: 1.910650E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.736 | TFLOPs: 20.37 | 31: iteration 143640/ 173500 | consumed samples: 36771840 | consumed tokens: 75308728320 | elapsed time per iteration (s): 0.78 | learning rate: 3.309E-05 | global batch size: 256 | lm loss: 1.917370E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.604 | TFLOPs: 19.94 | 31: iteration 143650/ 173500 | consumed samples: 36774400 | consumed tokens: 75313971200 | elapsed time per iteration (s): 0.81 | learning rate: 3.308E-05 | global batch size: 256 | lm loss: 1.923399E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.069 | TFLOPs: 19.12 | 31: iteration 143660/ 173500 | consumed samples: 36776960 | consumed tokens: 75319214080 | elapsed time per iteration (s): 0.81 | learning rate: 3.308E-05 | global batch size: 256 | lm loss: 1.935681E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.627 | TFLOPs: 19.03 | 31: iteration 143670/ 173500 | consumed samples: 36779520 | consumed tokens: 75324456960 | elapsed time per iteration (s): 0.81 | learning rate: 3.307E-05 | global batch size: 256 | lm loss: 1.933386E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.020 | TFLOPs: 19.06 | 31: iteration 143680/ 173500 | consumed samples: 36782080 | consumed tokens: 75329699840 | elapsed time per iteration (s): 0.85 | learning rate: 3.306E-05 | global batch size: 256 | lm loss: 1.929412E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.460 | TFLOPs: 18.30 | 31: iteration 143690/ 173500 | consumed samples: 36784640 | consumed tokens: 75334942720 | elapsed time per iteration (s): 0.77 | learning rate: 3.305E-05 | global batch size: 256 | lm loss: 1.936644E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.233 | TFLOPs: 20.04 | 31: iteration 143700/ 173500 | consumed samples: 36787200 | consumed tokens: 75340185600 | elapsed time per iteration (s): 0.78 | learning rate: 3.304E-05 | global batch size: 256 | lm loss: 1.915366E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.114 | TFLOPs: 19.97 | 31: iteration 143710/ 173500 | consumed samples: 36789760 | consumed tokens: 75345428480 | elapsed time per iteration (s): 0.75 | learning rate: 3.303E-05 | global batch size: 256 | lm loss: 1.887127E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.887 | TFLOPs: 20.74 | 31: iteration 143720/ 173500 | consumed samples: 36792320 | consumed tokens: 75350671360 | elapsed time per iteration (s): 0.83 | learning rate: 3.302E-05 | global batch size: 256 | lm loss: 1.917899E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.211 | TFLOPs: 18.77 | 31: iteration 143730/ 173500 | consumed samples: 36794880 | consumed tokens: 75355914240 | elapsed time per iteration (s): 0.72 | learning rate: 3.302E-05 | global batch size: 256 | lm loss: 1.925024E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.157 | TFLOPs: 21.37 | 31: iteration 143740/ 173500 | consumed samples: 36797440 | consumed tokens: 75361157120 | elapsed time per iteration (s): 0.80 | learning rate: 3.301E-05 | global batch size: 256 | lm loss: 1.960667E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.181 | TFLOPs: 19.25 | 31: iteration 143750/ 173500 | consumed samples: 36800000 | consumed tokens: 75366400000 | elapsed time per iteration (s): 0.78 | learning rate: 3.300E-05 | global batch size: 256 | lm loss: 1.902499E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.820 | TFLOPs: 19.83 | 31: iteration 143760/ 173500 | consumed samples: 36802560 | consumed tokens: 75371642880 | elapsed time per iteration (s): 0.75 | learning rate: 3.299E-05 | global batch size: 256 | lm loss: 1.897467E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.535 | TFLOPs: 20.60 | 31: iteration 143770/ 173500 | consumed samples: 36805120 | consumed tokens: 75376885760 | elapsed time per iteration (s): 0.76 | learning rate: 3.298E-05 | global batch size: 256 | lm loss: 1.930332E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.540 | TFLOPs: 20.48 | 31: iteration 143780/ 173500 | consumed samples: 36807680 | consumed tokens: 75382128640 | elapsed time per iteration (s): 0.74 | learning rate: 3.297E-05 | global batch size: 256 | lm loss: 1.918420E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.958 | TFLOPs: 20.87 | 31: iteration 143790/ 173500 | consumed samples: 36810240 | consumed tokens: 75387371520 | elapsed time per iteration (s): 0.74 | learning rate: 3.296E-05 | global batch size: 256 | lm loss: 1.926231E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.852 | TFLOPs: 21.04 | 31: iteration 143800/ 173500 | consumed samples: 36812800 | consumed tokens: 75392614400 | elapsed time per iteration (s): 0.72 | learning rate: 3.296E-05 | global batch size: 256 | lm loss: 1.936289E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.356 | TFLOPs: 21.56 | 31: iteration 143810/ 173500 | consumed samples: 36815360 | consumed tokens: 75397857280 | elapsed time per iteration (s): 1.01 | learning rate: 3.295E-05 | global batch size: 256 | lm loss: 1.945225E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 253.293 | TFLOPs: 15.32 | 31: iteration 143820/ 173500 | consumed samples: 36817920 | consumed tokens: 75403100160 | elapsed time per iteration (s): 0.77 | learning rate: 3.294E-05 | global batch size: 256 | lm loss: 1.933479E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.422 | TFLOPs: 20.11 | 31: iteration 143830/ 173500 | consumed samples: 36820480 | consumed tokens: 75408343040 | elapsed time per iteration (s): 0.72 | learning rate: 3.293E-05 | global batch size: 256 | lm loss: 1.885895E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.995 | TFLOPs: 21.66 | 31: iteration 143840/ 173500 | consumed samples: 36823040 | consumed tokens: 75413585920 | elapsed time per iteration (s): 0.77 | learning rate: 3.292E-05 | global batch size: 256 | lm loss: 1.924520E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.072 | TFLOPs: 20.09 | 31: iteration 143850/ 173500 | consumed samples: 36825600 | consumed tokens: 75418828800 | elapsed time per iteration (s): 0.71 | learning rate: 3.291E-05 | global batch size: 256 | lm loss: 1.938618E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.229 | TFLOPs: 21.67 | 31: iteration 143860/ 173500 | consumed samples: 36828160 | consumed tokens: 75424071680 | elapsed time per iteration (s): 0.78 | learning rate: 3.290E-05 | global batch size: 256 | lm loss: 1.932147E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.093 | TFLOPs: 19.91 | 31: iteration 143870/ 173500 | consumed samples: 36830720 | consumed tokens: 75429314560 | elapsed time per iteration (s): 0.75 | learning rate: 3.290E-05 | global batch size: 256 | lm loss: 1.938438E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.044 | TFLOPs: 20.75 | 31: iteration 143880/ 173500 | consumed samples: 36833280 | consumed tokens: 75434557440 | elapsed time per iteration (s): 0.83 | learning rate: 3.289E-05 | global batch size: 256 | lm loss: 1.910306E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.575 | TFLOPs: 18.67 | 31: iteration 143890/ 173500 | consumed samples: 36835840 | consumed tokens: 75439800320 | elapsed time per iteration (s): 0.77 | learning rate: 3.288E-05 | global batch size: 256 | lm loss: 1.905531E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.944 | TFLOPs: 20.02 | 31: iteration 143900/ 173500 | consumed samples: 36838400 | consumed tokens: 75445043200 | elapsed time per iteration (s): 0.77 | learning rate: 3.287E-05 | global batch size: 256 | lm loss: 1.900340E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.486 | TFLOPs: 19.99 | 31: iteration 143910/ 173500 | consumed samples: 36840960 | consumed tokens: 75450286080 | elapsed time per iteration (s): 0.79 | learning rate: 3.286E-05 | global batch size: 256 | lm loss: 1.892346E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.053 | TFLOPs: 19.60 | 31: iteration 143920/ 173500 | consumed samples: 36843520 | consumed tokens: 75455528960 | elapsed time per iteration (s): 0.86 | learning rate: 3.285E-05 | global batch size: 256 | lm loss: 1.923088E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.754 | TFLOPs: 18.07 | 31: iteration 143930/ 173500 | consumed samples: 36846080 | consumed tokens: 75460771840 | elapsed time per iteration (s): 0.83 | learning rate: 3.285E-05 | global batch size: 256 | lm loss: 1.929948E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.525 | TFLOPs: 18.66 | 31: iteration 143940/ 173500 | consumed samples: 36848640 | consumed tokens: 75466014720 | elapsed time per iteration (s): 0.75 | learning rate: 3.284E-05 | global batch size: 256 | lm loss: 1.904899E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.385 | TFLOPs: 20.59 | 31: iteration 143950/ 173500 | consumed samples: 36851200 | consumed tokens: 75471257600 | elapsed time per iteration (s): 0.79 | learning rate: 3.283E-05 | global batch size: 256 | lm loss: 1.945650E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.271 | TFLOPs: 19.56 | 31: iteration 143960/ 173500 | consumed samples: 36853760 | consumed tokens: 75476500480 | elapsed time per iteration (s): 0.79 | learning rate: 3.282E-05 | global batch size: 256 | lm loss: 1.912453E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.821 | TFLOPs: 19.71 | 31: iteration 143970/ 173500 | consumed samples: 36856320 | consumed tokens: 75481743360 | elapsed time per iteration (s): 0.80 | learning rate: 3.281E-05 | global batch size: 256 | lm loss: 1.906725E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.469 | TFLOPs: 19.39 | 31: iteration 143980/ 173500 | consumed samples: 36858880 | consumed tokens: 75486986240 | elapsed time per iteration (s): 0.78 | learning rate: 3.280E-05 | global batch size: 256 | lm loss: 1.935035E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.999 | TFLOPs: 19.84 | 31: iteration 143990/ 173500 | consumed samples: 36861440 | consumed tokens: 75492229120 | elapsed time per iteration (s): 0.78 | learning rate: 3.279E-05 | global batch size: 256 | lm loss: 1.929601E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.296 | TFLOPs: 19.92 | 0: [2022-11-27 02:30:17,583] [INFO] [logging.py:68:log_dist] [Rank 0] step=144000, skipped=0, lr=[3.278611280458685e-05, 3.278611280458685e-05, 3.278611280458685e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 144000/ 173500 | consumed samples: 36864000 | consumed tokens: 75497472000 | elapsed time per iteration (s): 1.25 | learning rate: 3.279E-05 | global batch size: 256 | lm loss: 1.930938E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 204.582 | TFLOPs: 12.38 | 0: steps: 144000 loss: 1.9128 iter time (s): 0.798 samples/sec: 320.684 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 144000 | lm loss value: 1.857637E+00 | lm loss PPL: 6.408578E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 144000 to checkpoints_1b1long 0: [2022-11-27 02:30:17,837] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step144000 is begin to save! 0: [2022-11-27 02:30:17,847] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_01-model_00-model_states.pt... 0: [2022-11-27 02:30:18,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_01-model_00-model_states.pt. 0: [2022-11-27 02:30:18,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_03-model_00-model_states.pt... 0: [2022-11-27 02:30:18,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_03-model_00-model_states.pt. 0: [2022-11-27 02:30:18,212] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_04-model_00-model_states.pt... 0: [2022-11-27 02:30:18,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_04-model_00-model_states.pt. 0: [2022-11-27 02:30:18,295] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_05-model_00-model_states.pt... 0: [2022-11-27 02:30:18,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_05-model_00-model_states.pt. 0: [2022-11-27 02:30:18,376] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_06-model_00-model_states.pt... 0: [2022-11-27 02:30:18,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_06-model_00-model_states.pt. 0: [2022-11-27 02:30:18,457] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_07-model_00-model_states.pt... 0: [2022-11-27 02:30:18,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_07-model_00-model_states.pt. 0: [2022-11-27 02:30:18,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_08-model_00-model_states.pt... 0: [2022-11-27 02:30:18,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_08-model_00-model_states.pt. 0: [2022-11-27 02:30:18,620] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_09-model_00-model_states.pt... 0: [2022-11-27 02:30:18,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_09-model_00-model_states.pt. 0: [2022-11-27 02:30:18,699] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_10-model_00-model_states.pt... 0: [2022-11-27 02:30:18,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_10-model_00-model_states.pt. 0: [2022-11-27 02:30:18,779] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_11-model_00-model_states.pt... 0: [2022-11-27 02:30:18,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_11-model_00-model_states.pt. 0: [2022-11-27 02:30:18,856] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_12-model_00-model_states.pt... 0: [2022-11-27 02:30:18,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_12-model_00-model_states.pt. 0: [2022-11-27 02:30:18,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_13-model_00-model_states.pt... 0: [2022-11-27 02:30:19,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_13-model_00-model_states.pt. 0: [2022-11-27 02:30:19,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_14-model_00-model_states.pt... 0: [2022-11-27 02:30:19,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_14-model_00-model_states.pt. 0: [2022-11-27 02:30:19,090] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_15-model_00-model_states.pt... 0: [2022-11-27 02:30:19,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_15-model_00-model_states.pt. 0: [2022-11-27 02:30:19,165] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_16-model_00-model_states.pt... 0: [2022-11-27 02:30:19,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_16-model_00-model_states.pt. 0: [2022-11-27 02:30:19,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_17-model_00-model_states.pt... 0: [2022-11-27 02:30:19,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_17-model_00-model_states.pt. 0: [2022-11-27 02:30:19,326] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_18-model_00-model_states.pt... 0: [2022-11-27 02:30:19,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_18-model_00-model_states.pt. 0: [2022-11-27 02:30:19,400] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_19-model_00-model_states.pt... 0: [2022-11-27 02:30:19,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_19-model_00-model_states.pt. 0: [2022-11-27 02:30:19,478] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_20-model_00-model_states.pt... 0: [2022-11-27 02:30:19,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_20-model_00-model_states.pt. 0: [2022-11-27 02:30:19,555] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_21-model_00-model_states.pt... 0: [2022-11-27 02:30:19,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_21-model_00-model_states.pt. 0: [2022-11-27 02:30:19,630] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_22-model_00-model_states.pt... 0: [2022-11-27 02:30:19,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_22-model_00-model_states.pt. 0: [2022-11-27 02:30:19,709] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_23-model_00-model_states.pt... 0: [2022-11-27 02:30:19,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_23-model_00-model_states.pt. 0: [2022-11-27 02:30:19,786] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_24-model_00-model_states.pt... 0: [2022-11-27 02:30:19,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_24-model_00-model_states.pt. 0: [2022-11-27 02:30:19,867] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_25-model_00-model_states.pt... 0: [2022-11-27 02:30:19,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_25-model_00-model_states.pt. 0: [2022-11-27 02:30:19,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_26-model_00-model_states.pt... 0: [2022-11-27 02:30:20,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_26-model_00-model_states.pt. 0: [2022-11-27 02:30:20,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_27-model_00-model_states.pt... 0: [2022-11-27 02:30:20,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_27-model_00-model_states.pt. 0: [2022-11-27 02:30:20,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_28-model_00-model_states.pt... 0: [2022-11-27 02:30:20,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_28-model_00-model_states.pt. 0: [2022-11-27 02:30:20,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/layer_30-model_00-model_states.pt... 0: [2022-11-27 02:30:20,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/layer_30-model_00-model_states.pt. 0: [2022-11-27 02:30:20,175] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step144000/mp_rank_00_model_states.pt 0: [2022-11-27 02:30:20,175] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/mp_rank_00_model_states.pt... 0: [2022-11-27 02:30:20,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/mp_rank_00_model_states.pt. 0: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:30:20,250] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:30:20,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:30:20,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:30:20,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 02:30:20,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 0: [2022-11-27 02:30:20,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:30:20,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 02:30:20,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 20: [2022-11-27 02:30:20,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:30:20,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 02:30:20,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 8: [2022-11-27 02:30:20,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:30:20,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 02:30:20,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 16: [2022-11-27 02:30:20,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:30:20,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 02:30:20,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 12: [2022-11-27 02:30:20,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:30:20,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 02:30:20,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 22: [2022-11-27 02:30:20,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:30:20,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:30:20,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 26: [2022-11-27 02:30:20,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 22: [2022-11-27 02:30:20,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 26: [2022-11-27 02:30:20,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 15: [2022-11-27 02:30:20,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:30:20,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 02:30:20,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 19: [2022-11-27 02:30:20,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:30:20,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:30:20,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 02:30:20,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 19: [2022-11-27 02:30:20,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-27 02:30:20,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 28: [2022-11-27 02:30:20,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:30:20,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:30:20,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:30:20,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 24: [2022-11-27 02:30:20,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 11: [2022-11-27 02:30:20,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:30:20,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 24: [2022-11-27 02:30:20,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 27: [2022-11-27 02:30:20,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 02:30:20,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 11: [2022-11-27 02:30:20,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 02:30:20,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 1: [2022-11-27 02:30:20,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:30:20,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:30:20,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:30:20,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:30:20,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 21: [2022-11-27 02:30:20,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 18: [2022-11-27 02:30:20,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:30:20,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 1: [2022-11-27 02:30:20,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 17: [2022-11-27 02:30:20,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 21: [2022-11-27 02:30:20,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 18: [2022-11-27 02:30:20,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 5: [2022-11-27 02:30:20,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 1: [2022-11-27 02:30:20,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 18: [2022-11-27 02:30:20,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 6: [2022-11-27 02:30:20,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:30:20,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 02:30:20,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 8: [2022-11-27 02:30:20,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:30:20,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 02:30:20,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 6: [2022-11-27 02:30:20,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:30:20,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 02:30:20,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 0: [2022-11-27 02:30:20,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:30:20,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 02:30:20,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 9: [2022-11-27 02:30:20,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:30:20,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 02:30:20,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:30:20,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 2: [2022-11-27 02:30:20,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:30:20,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 2: [2022-11-27 02:30:20,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 9: [2022-11-27 02:30:20,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 2: [2022-11-27 02:30:20,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 3: [2022-11-27 02:30:20,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:30:20,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 02:30:20,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 31: [2022-11-27 02:30:20,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:30:20,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 02:30:20,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 26: [2022-11-27 02:30:20,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:30:20,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 13: [2022-11-27 02:30:20,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:30:20,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 13: [2022-11-27 02:30:20,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 29: [2022-11-27 02:30:20,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:30:20,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:30:20,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 29: [2022-11-27 02:30:20,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 02:30:20,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 13: [2022-11-27 02:30:20,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:30:20,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 13: [2022-11-27 02:30:20,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 29: [2022-11-27 02:30:20,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:30:20,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 13: [2022-11-27 02:30:20,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 29: [2022-11-27 02:30:20,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 17: [2022-11-27 02:30:20,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 20: [2022-11-27 02:30:20,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:30:20,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 02:30:20,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 12: [2022-11-27 02:30:20,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:30:20,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 02:30:20,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 5: [2022-11-27 02:30:20,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:30:20,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:30:20,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 10: [2022-11-27 02:30:20,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 5: [2022-11-27 02:30:20,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 10: [2022-11-27 02:30:20,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 22: [2022-11-27 02:30:20,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:30:20,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 02:30:20,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 24: [2022-11-27 02:30:20,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:30:20,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 9: [2022-11-27 02:30:20,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:30:20,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 9: [2022-11-27 02:30:20,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 02:30:20,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 15: [2022-11-27 02:30:20,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:30:20,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 24: [2022-11-27 02:30:20,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:30:20,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 26: [2022-11-27 02:30:20,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:30:20,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 15: [2022-11-27 02:30:20,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 26: [2022-11-27 02:30:20,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 02:30:20,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 0: [2022-11-27 02:30:20,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:30:20,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 02:30:20,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 19: [2022-11-27 02:30:20,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:30:20,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:30:20,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 02:30:20,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 02:30:20,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 19: [2022-11-27 02:30:20,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 1: [2022-11-27 02:30:20,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:30:20,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:30:20,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:30:20,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 02:30:20,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 1: [2022-11-27 02:30:20,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 02:30:20,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 16: [2022-11-27 02:30:20,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:30:20,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 02:30:20,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 8: [2022-11-27 02:30:20,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:30:20,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 02:30:20,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 31: [2022-11-27 02:30:20,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:30:20,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 02:30:20,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 11: [2022-11-27 02:30:20,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:30:20,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:30:20,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 02:30:20,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 02:30:20,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 11: [2022-11-27 02:30:20,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 14: [2022-11-27 02:30:20,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:30:20,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 02:30:20,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 14: [2022-11-27 02:30:20,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:30:20,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 02:30:20,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 27: [2022-11-27 02:30:20,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:30:20,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:30:20,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:30:20,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:30:20,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:30:20,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 7: [2022-11-27 02:30:20,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 02:30:20,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 02:30:20,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 27: [2022-11-27 02:30:20,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 7: [2022-11-27 02:30:20,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 27: [2022-11-27 02:30:20,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 7: [2022-11-27 02:30:20,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 27: [2022-11-27 02:30:20,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 7: [2022-11-27 02:30:20,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 3: [2022-11-27 02:30:20,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:30:20,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 02:30:20,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 6: [2022-11-27 02:30:20,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:30:20,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:30:20,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 10: [2022-11-27 02:30:20,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:30:20,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 6: [2022-11-27 02:30:20,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 10: [2022-11-27 02:30:20,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 14: [2022-11-27 02:30:20,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 10: [2022-11-27 02:30:20,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 18: [2022-11-27 02:30:20,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:30:20,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 02:30:20,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 21: [2022-11-27 02:30:20,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:30:20,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 02:30:20,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 20: [2022-11-27 02:30:20,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:30:20,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:30:20,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 15: [2022-11-27 02:30:20,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 20: [2022-11-27 02:30:20,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 15: [2022-11-27 02:30:20,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 6: [2022-11-27 02:30:20,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:30:20,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 02:30:20,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 2: [2022-11-27 02:30:20,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:30:20,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:30:20,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 02:30:20,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 02:30:20,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 2: [2022-11-27 02:30:20,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 29: [2022-11-27 02:30:20,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:30:20,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:30:20,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 13: [2022-11-27 02:30:20,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 29: [2022-11-27 02:30:20,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 13: [2022-11-27 02:30:20,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 9: [2022-11-27 02:30:20,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:30:20,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 02:30:20,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 22: [2022-11-27 02:30:20,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:30:20,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 02:30:20,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 31: [2022-11-27 02:30:20,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:30:20,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 02:30:20,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 0: [2022-11-27 02:30:20,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:30:20,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 02:30:20,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 28: [2022-11-27 02:30:20,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 02:30:20,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 28: [2022-11-27 02:30:20,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:30:20,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 02:30:20,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 5: [2022-11-27 02:30:20,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:30:20,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:30:20,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 12: [2022-11-27 02:30:20,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 02:30:20,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 5: [2022-11-27 02:30:20,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 18: [2022-11-27 02:30:20,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:30:20,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 02:30:20,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 26: [2022-11-27 02:30:20,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:30:20,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 02:30:20,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 12: [2022-11-27 02:30:20,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:30:20,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 02:30:20,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 21: [2022-11-27 02:30:20,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:30:20,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 02:30:20,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 20: [2022-11-27 02:30:20,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:30:20,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 02:30:20,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 30: [2022-11-27 02:30:20,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:30:20,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 02:30:20,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 10: [2022-11-27 02:30:20,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:30:20,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 29: [2022-11-27 02:30:20,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:30:20,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 29: [2022-11-27 02:30:20,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 02:30:20,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 28: [2022-11-27 02:30:20,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:30:20,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:30:20,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:30:20,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:30:20,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 02:30:20,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 02:30:20,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 02:30:20,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 23: [2022-11-27 02:30:20,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 23: [2022-11-27 02:30:20,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 25: [2022-11-27 02:30:20,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:30:20,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:30:20,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:30:20,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-27 02:30:20,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 02:30:20,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 02:30:20,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 25: [2022-11-27 02:30:20,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 25: [2022-11-27 02:30:20,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 17: [2022-11-27 02:30:20,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:30:20,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 02:30:20,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 17: [2022-11-27 02:30:20,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:30:20,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 02:30:20,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 1: [2022-11-27 02:30:20,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:30:20,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 02:30:20,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 19: [2022-11-27 02:30:20,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:30:20,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 02:30:20,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 30: [2022-11-27 02:30:20,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:30:20,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 02:30:20,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 11: [2022-11-27 02:30:20,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:30:20,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 02:30:20,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 8: [2022-11-27 02:30:20,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:30:20,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 02:30:20,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 14: [2022-11-27 02:30:20,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:30:20,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 02:30:20,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 28: [2022-11-27 02:30:20,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 02:30:20,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 16: [2022-11-27 02:30:20,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:30:20,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 02:30:20,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 21: [2022-11-27 02:30:20,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:30:20,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 02:30:20,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 5: [2022-11-27 02:30:20,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:30:20,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 02:30:20,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 1: [2022-11-27 02:30:20,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:30:20,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 02:30:20,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 4: [2022-11-27 02:30:20,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:30:20,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:30:20,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:30:20,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:30:20,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 02:30:20,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 02:30:20,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 02:30:20,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 4: [2022-11-27 02:30:20,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 4: [2022-11-27 02:30:20,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 4: [2022-11-27 02:30:20,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 02:30:20,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 25: [2022-11-27 02:30:20,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:30:20,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 3: [2022-11-27 02:30:20,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:30:20,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 3: [2022-11-27 02:30:20,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 02:30:20,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 30: [2022-11-27 02:30:20,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:30:20,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 02:30:20,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 18: [2022-11-27 02:30:20,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:30:20,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 02:30:20,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 27: [2022-11-27 02:30:20,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:30:20,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 02:30:20,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 23: [2022-11-27 02:30:20,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:30:20,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:30:20,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 02:30:20,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 7: [2022-11-27 02:30:20,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 02:30:20,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 22: [2022-11-27 02:30:20,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:30:20,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 02:30:20,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 13: [2022-11-27 02:30:20,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:30:20,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 02:30:20,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 2: [2022-11-27 02:30:20,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:30:20,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 02:30:20,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 31: [2022-11-27 02:30:20,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:30:20,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 02:30:20,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 15: [2022-11-27 02:30:20,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:30:20,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 02:30:20,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 6: [2022-11-27 02:30:20,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:30:20,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 02:30:20,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 24: [2022-11-27 02:30:20,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:30:20,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-27 02:30:20,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 19: [2022-11-27 02:30:20,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:30:20,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 02:30:20,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 29: [2022-11-27 02:30:20,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:30:20,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 02:30:20,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 17: [2022-11-27 02:30:20,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:30:20,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 02:30:20,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 12: [2022-11-27 02:30:20,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:30:20,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 02:30:20,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 16: [2022-11-27 02:30:20,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:30:20,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 02:30:20,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 20: [2022-11-27 02:30:20,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:30:20,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 02:30:20,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 28: [2022-11-27 02:30:20,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:30:20,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 02:30:20,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 5: [2022-11-27 02:30:20,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:30:20,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 02:30:20,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 1: [2022-11-27 02:30:20,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:30:20,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 02:30:20,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 8: [2022-11-27 02:30:20,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:30:20,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 02:30:20,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 11: [2022-11-27 02:30:20,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:30:20,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 02:30:20,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 26: [2022-11-27 02:30:20,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:30:20,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 02:30:20,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 30: [2022-11-27 02:30:20,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:30:20,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 02:30:20,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 14: [2022-11-27 02:30:20,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:30:20,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:30:20,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 10: [2022-11-27 02:30:20,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 14: [2022-11-27 02:30:20,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 10: [2022-11-27 02:30:20,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 4: [2022-11-27 02:30:20,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:30:20,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 02:30:20,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 18: [2022-11-27 02:30:20,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:30:20,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 23: [2022-11-27 02:30:20,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:30:20,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 23: [2022-11-27 02:30:20,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 02:30:20,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 21: [2022-11-27 02:30:20,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:30:20,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 02:30:20,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 27: [2022-11-27 02:30:20,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:30:20,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 02:30:20,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 22: [2022-11-27 02:30:20,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:30:20,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 02:30:20,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 7: [2022-11-27 02:30:20,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:30:20,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 02:30:20,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 0: [2022-11-27 02:30:20,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:30:20,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 25: [2022-11-27 02:30:20,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:30:20,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 25: [2022-11-27 02:30:20,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 02:30:20,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 24: [2022-11-27 02:30:20,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:30:20,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 02:30:20,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 3: [2022-11-27 02:30:20,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:30:20,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 02:30:20,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 13: [2022-11-27 02:30:20,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:30:20,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 02:30:20,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 9: [2022-11-27 02:30:20,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:30:20,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 02:30:20,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 15: [2022-11-27 02:30:20,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:30:20,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 02:30:20,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 31: [2022-11-27 02:30:20,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:30:20,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 02:30:20,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 2: [2022-11-27 02:30:20,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:30:20,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 02:30:20,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 29: [2022-11-27 02:30:20,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:30:20,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 02:30:20,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 17: [2022-11-27 02:30:20,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:30:20,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:30:20,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 02:30:20,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 9: [2022-11-27 02:30:20,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:30:20,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 02:30:20,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 16: [2022-11-27 02:30:20,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:30:20,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 02:30:20,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 20: [2022-11-27 02:30:20,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:30:20,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 02:30:20,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 19: [2022-11-27 02:30:20,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:30:20,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 02:30:20,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 6: [2022-11-27 02:30:20,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:30:20,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 02:30:20,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 28: [2022-11-27 02:30:20,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:30:20,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 02:30:20,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 10: [2022-11-27 02:30:20,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:30:20,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 02:30:20,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 12: [2022-11-27 02:30:20,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:30:20,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:30:20,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:30:20,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 12: [2022-11-27 02:30:20,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 26: [2022-11-27 02:30:20,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 12: [2022-11-27 02:30:20,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 26: [2022-11-27 02:30:20,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 5: [2022-11-27 02:30:20,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 8: [2022-11-27 02:30:20,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:30:20,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 02:30:20,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 1: [2022-11-27 02:30:20,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:30:20,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 02:30:20,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 11: [2022-11-27 02:30:20,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:30:20,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 02:30:20,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 4: [2022-11-27 02:30:20,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:30:20,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 02:30:20,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 0: [2022-11-27 02:30:20,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 23: [2022-11-27 02:30:20,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:30:20,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 25: [2022-11-27 02:30:20,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:30:20,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 0: [2022-11-27 02:30:20,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 25: [2022-11-27 02:30:20,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 02:30:20,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 14: [2022-11-27 02:30:20,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:30:20,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 02:30:20,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 3: [2022-11-27 02:30:20,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:30:20,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 21: [2022-11-27 02:30:20,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:30:20,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 21: [2022-11-27 02:30:20,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 02:30:20,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 18: [2022-11-27 02:30:20,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:30:20,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 02:30:20,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 30: [2022-11-27 02:30:20,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:30:20,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 02:30:20,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 27: [2022-11-27 02:30:20,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:30:20,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 02:30:20,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 7: [2022-11-27 02:30:20,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:30:20,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 02:30:20,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 31: [2022-11-27 02:30:20,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:30:20,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 02:30:20,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 13: [2022-11-27 02:30:20,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:30:20,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 02:30:20,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 22: [2022-11-27 02:30:20,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:30:20,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 02:30:20,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 0: [2022-11-27 02:30:20,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:30:20,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 02:30:20,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 2: [2022-11-27 02:30:20,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:30:20,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 02:30:20,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 15: [2022-11-27 02:30:20,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:30:20,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 02:30:20,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 29: [2022-11-27 02:30:20,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:30:20,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 02:30:20,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 6: [2022-11-27 02:30:20,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:30:20,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 02:30:20,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 16: [2022-11-27 02:30:20,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:30:20,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 02:30:20,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 12: [2022-11-27 02:30:20,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:30:20,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 02:30:20,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 26: [2022-11-27 02:30:20,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:30:20,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 02:30:20,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 9: [2022-11-27 02:30:20,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:30:20,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 02:30:20,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 24: [2022-11-27 02:30:20,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:30:20,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 02:30:20,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 17: [2022-11-27 02:30:20,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:30:20,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 02:30:20,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 28: [2022-11-27 02:30:20,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:30:20,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 02:30:20,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 20: [2022-11-27 02:30:20,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:30:20,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 02:30:20,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 10: [2022-11-27 02:30:20,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:30:20,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 02:30:20,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 8: [2022-11-27 02:30:20,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:30:20,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 02:30:20,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 19: [2022-11-27 02:30:20,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:30:20,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 02:30:20,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 1: [2022-11-27 02:30:20,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:30:20,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 02:30:20,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 11: [2022-11-27 02:30:20,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:30:20,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 02:30:20,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 21: [2022-11-27 02:30:20,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:30:20,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 02:30:20,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 25: [2022-11-27 02:30:20,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:30:20,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 5: [2022-11-27 02:30:20,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:30:20,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 5: [2022-11-27 02:30:20,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 02:30:20,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 14: [2022-11-27 02:30:20,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:30:20,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 02:30:20,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 4: [2022-11-27 02:30:20,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:30:20,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:30:20,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 02:30:20,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 23: [2022-11-27 02:30:20,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 02:30:20,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 3: [2022-11-27 02:30:20,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:30:20,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:30:20,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 18: [2022-11-27 02:30:20,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 3: [2022-11-27 02:30:20,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 18: [2022-11-27 02:30:20,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 13: [2022-11-27 02:30:20,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:30:20,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 22: [2022-11-27 02:30:20,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:30:20,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 22: [2022-11-27 02:30:20,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 02:30:20,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 27: [2022-11-27 02:30:20,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:30:20,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 02:30:20,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 30: [2022-11-27 02:30:20,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:30:20,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 02:30:20,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 7: [2022-11-27 02:30:20,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:30:20,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 02:30:20,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 9: [2022-11-27 02:30:20,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:30:20,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 02:30:20,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 0: [2022-11-27 02:30:20,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:30:20,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 02:30:20,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 29: [2022-11-27 02:30:20,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:30:20,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 02:30:20,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 31: [2022-11-27 02:30:20,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:30:20,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 02:30:20,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 6: [2022-11-27 02:30:20,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:30:20,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:30:20,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 02:30:20,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 24: [2022-11-27 02:30:20,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 02:30:20,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 26: [2022-11-27 02:30:20,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:30:20,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 02:30:20,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 2: [2022-11-27 02:30:20,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:30:20,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 02:30:20,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 17: [2022-11-27 02:30:20,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:30:20,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 02:30:20,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 16: [2022-11-27 02:30:20,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:30:20,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 02:30:20,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 19: [2022-11-27 02:30:20,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:30:20,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 02:30:20,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 4: [2022-11-27 02:30:20,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:30:20,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 25: [2022-11-27 02:30:20,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:30:20,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 25: [2022-11-27 02:30:20,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 02:30:20,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 20: [2022-11-27 02:30:20,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:30:20,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:30:20,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:30:20,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 23: [2022-11-27 02:30:20,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 14: [2022-11-27 02:30:20,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 20: [2022-11-27 02:30:20,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 23: [2022-11-27 02:30:20,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 14: [2022-11-27 02:30:20,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 21: [2022-11-27 02:30:20,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:30:20,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 02:30:20,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 3: [2022-11-27 02:30:20,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:30:20,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 02:30:20,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 28: [2022-11-27 02:30:20,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:30:20,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:30:20,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 28: [2022-11-27 02:30:20,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 15: [2022-11-27 02:30:20,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 28: [2022-11-27 02:30:20,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 24: [2022-11-27 02:30:20,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:30:20,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 02:30:20,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 10: [2022-11-27 02:30:20,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:30:20,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 02:30:20,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 30: [2022-11-27 02:30:20,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:30:20,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 02:30:20,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 11: [2022-11-27 02:30:20,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:30:20,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 02:30:20,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 12: [2022-11-27 02:30:20,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:30:20,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:30:20,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:30:20,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 18: [2022-11-27 02:30:20,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 8: [2022-11-27 02:30:20,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 02:30:20,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 18: [2022-11-27 02:30:20,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 12: [2022-11-27 02:30:20,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 31: [2022-11-27 02:30:20,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:30:20,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 02:30:20,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 13: [2022-11-27 02:30:20,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:30:20,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 02:30:20,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 27: [2022-11-27 02:30:20,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:30:20,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 02:30:20,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 2: [2022-11-27 02:30:20,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:30:20,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:30:20,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 22: [2022-11-27 02:30:20,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 02:30:20,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 2: [2022-11-27 02:30:20,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 30: [2022-11-27 02:30:20,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:30:20,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 02:30:20,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 7: [2022-11-27 02:30:20,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:30:20,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 02:30:20,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 1: [2022-11-27 02:30:20,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:30:20,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 02:30:20,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 15: [2022-11-27 02:30:20,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:30:20,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 02:30:20,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 5: [2022-11-27 02:30:20,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:30:20,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step144000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 02:30:20,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step144000 is ready now! 0: successfully saved checkpoint at iteration 144000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2677.83 31: iteration 144010/ 173500 | consumed samples: 36866560 | consumed tokens: 75502714880 | elapsed time per iteration (s): 1.01 | learning rate: 3.278E-05 | global batch size: 256 | lm loss: 1.930679E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 253.389 | TFLOPs: 15.33 | 31: iteration 144020/ 173500 | consumed samples: 36869120 | consumed tokens: 75507957760 | elapsed time per iteration (s): 0.78 | learning rate: 3.277E-05 | global batch size: 256 | lm loss: 1.917928E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.154 | TFLOPs: 19.85 | 31: iteration 144030/ 173500 | consumed samples: 36871680 | consumed tokens: 75513200640 | elapsed time per iteration (s): 0.82 | learning rate: 3.276E-05 | global batch size: 256 | lm loss: 1.921157E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.228 | TFLOPs: 18.89 | 31: iteration 144040/ 173500 | consumed samples: 36874240 | consumed tokens: 75518443520 | elapsed time per iteration (s): 0.81 | learning rate: 3.275E-05 | global batch size: 256 | lm loss: 1.913121E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.938 | TFLOPs: 19.11 | 31: iteration 144050/ 173500 | consumed samples: 36876800 | consumed tokens: 75523686400 | elapsed time per iteration (s): 0.78 | learning rate: 3.274E-05 | global batch size: 256 | lm loss: 1.927883E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.358 | TFLOPs: 19.86 | 31: iteration 144060/ 173500 | consumed samples: 36879360 | consumed tokens: 75528929280 | elapsed time per iteration (s): 0.93 | learning rate: 3.274E-05 | global batch size: 256 | lm loss: 1.916269E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 274.314 | TFLOPs: 16.60 | 31: iteration 144070/ 173500 | consumed samples: 36881920 | consumed tokens: 75534172160 | elapsed time per iteration (s): 0.86 | learning rate: 3.273E-05 | global batch size: 256 | lm loss: 1.936526E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.195 | TFLOPs: 18.10 | 31: iteration 144080/ 173500 | consumed samples: 36884480 | consumed tokens: 75539415040 | elapsed time per iteration (s): 1.01 | learning rate: 3.272E-05 | global batch size: 256 | lm loss: 1.933146E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 253.883 | TFLOPs: 15.36 | 31: iteration 144090/ 173500 | consumed samples: 36887040 | consumed tokens: 75544657920 | elapsed time per iteration (s): 0.91 | learning rate: 3.271E-05 | global batch size: 256 | lm loss: 1.895203E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 281.680 | TFLOPs: 17.04 | 31: iteration 144100/ 173500 | consumed samples: 36889600 | consumed tokens: 75549900800 | elapsed time per iteration (s): 0.94 | learning rate: 3.270E-05 | global batch size: 256 | lm loss: 1.920767E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 271.057 | TFLOPs: 16.40 | 31: iteration 144110/ 173500 | consumed samples: 36892160 | consumed tokens: 75555143680 | elapsed time per iteration (s): 0.91 | learning rate: 3.269E-05 | global batch size: 256 | lm loss: 1.931795E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.706 | TFLOPs: 17.10 | 31: iteration 144120/ 173500 | consumed samples: 36894720 | consumed tokens: 75560386560 | elapsed time per iteration (s): 0.92 | learning rate: 3.268E-05 | global batch size: 256 | lm loss: 1.930100E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 278.106 | TFLOPs: 16.82 | 31: iteration 144130/ 173500 | consumed samples: 36897280 | consumed tokens: 75565629440 | elapsed time per iteration (s): 0.87 | learning rate: 3.268E-05 | global batch size: 256 | lm loss: 1.901365E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.662 | TFLOPs: 17.77 | 31: iteration 144140/ 173500 | consumed samples: 36899840 | consumed tokens: 75570872320 | elapsed time per iteration (s): 0.86 | learning rate: 3.267E-05 | global batch size: 256 | lm loss: 1.925783E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.359 | TFLOPs: 18.11 | 31: iteration 144150/ 173500 | consumed samples: 36902400 | consumed tokens: 75576115200 | elapsed time per iteration (s): 0.86 | learning rate: 3.266E-05 | global batch size: 256 | lm loss: 1.935990E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.310 | TFLOPs: 17.93 | 31: iteration 144160/ 173500 | consumed samples: 36904960 | consumed tokens: 75581358080 | elapsed time per iteration (s): 0.98 | learning rate: 3.265E-05 | global batch size: 256 | lm loss: 1.949409E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 260.860 | TFLOPs: 15.78 | 31: iteration 144170/ 173500 | consumed samples: 36907520 | consumed tokens: 75586600960 | elapsed time per iteration (s): 0.96 | learning rate: 3.264E-05 | global batch size: 256 | lm loss: 1.933346E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 266.466 | TFLOPs: 16.12 | 31: iteration 144180/ 173500 | consumed samples: 36910080 | consumed tokens: 75591843840 | elapsed time per iteration (s): 0.81 | learning rate: 3.263E-05 | global batch size: 256 | lm loss: 1.920330E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.380 | TFLOPs: 19.02 | 31: iteration 144190/ 173500 | consumed samples: 36912640 | consumed tokens: 75597086720 | elapsed time per iteration (s): 0.95 | learning rate: 3.263E-05 | global batch size: 256 | lm loss: 1.925489E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 268.757 | TFLOPs: 16.26 | 31: iteration 144200/ 173500 | consumed samples: 36915200 | consumed tokens: 75602329600 | elapsed time per iteration (s): 0.86 | learning rate: 3.262E-05 | global batch size: 256 | lm loss: 1.912851E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.932 | TFLOPs: 18.08 | 31: iteration 144210/ 173500 | consumed samples: 36917760 | consumed tokens: 75607572480 | elapsed time per iteration (s): 0.82 | learning rate: 3.261E-05 | global batch size: 256 | lm loss: 1.934017E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.595 | TFLOPs: 18.79 | 31: iteration 144220/ 173500 | consumed samples: 36920320 | consumed tokens: 75612815360 | elapsed time per iteration (s): 0.82 | learning rate: 3.260E-05 | global batch size: 256 | lm loss: 1.942116E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.078 | TFLOPs: 18.94 | 31: iteration 144230/ 173500 | consumed samples: 36922880 | consumed tokens: 75618058240 | elapsed time per iteration (s): 0.79 | learning rate: 3.259E-05 | global batch size: 256 | lm loss: 1.894357E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.506 | TFLOPs: 19.57 | 31: iteration 144240/ 173500 | consumed samples: 36925440 | consumed tokens: 75623301120 | elapsed time per iteration (s): 0.87 | learning rate: 3.258E-05 | global batch size: 256 | lm loss: 1.950323E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.331 | TFLOPs: 17.87 | 31: iteration 144250/ 173500 | consumed samples: 36928000 | consumed tokens: 75628544000 | elapsed time per iteration (s): 0.78 | learning rate: 3.258E-05 | global batch size: 256 | lm loss: 1.933825E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.240 | TFLOPs: 19.74 | 31: iteration 144260/ 173500 | consumed samples: 36930560 | consumed tokens: 75633786880 | elapsed time per iteration (s): 0.77 | learning rate: 3.257E-05 | global batch size: 256 | lm loss: 1.948470E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.026 | TFLOPs: 20.21 | 31: iteration 144270/ 173500 | consumed samples: 36933120 | consumed tokens: 75639029760 | elapsed time per iteration (s): 0.77 | learning rate: 3.256E-05 | global batch size: 256 | lm loss: 1.932441E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.223 | TFLOPs: 20.10 | 31: iteration 144280/ 173500 | consumed samples: 36935680 | consumed tokens: 75644272640 | elapsed time per iteration (s): 0.77 | learning rate: 3.255E-05 | global batch size: 256 | lm loss: 1.921054E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.712 | TFLOPs: 20.01 | 31: iteration 144290/ 173500 | consumed samples: 36938240 | consumed tokens: 75649515520 | elapsed time per iteration (s): 0.72 | learning rate: 3.254E-05 | global batch size: 256 | lm loss: 1.912434E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.589 | TFLOPs: 21.39 | 31: iteration 144300/ 173500 | consumed samples: 36940800 | consumed tokens: 75654758400 | elapsed time per iteration (s): 0.74 | learning rate: 3.253E-05 | global batch size: 256 | lm loss: 1.890172E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.635 | TFLOPs: 20.79 | 31: iteration 144310/ 173500 | consumed samples: 36943360 | consumed tokens: 75660001280 | elapsed time per iteration (s): 0.77 | learning rate: 3.253E-05 | global batch size: 256 | lm loss: 1.934212E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.888 | TFLOPs: 20.14 | 31: iteration 144320/ 173500 | consumed samples: 36945920 | consumed tokens: 75665244160 | elapsed time per iteration (s): 0.73 | learning rate: 3.252E-05 | global batch size: 256 | lm loss: 1.945654E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.762 | TFLOPs: 21.34 | 31: iteration 144330/ 173500 | consumed samples: 36948480 | consumed tokens: 75670487040 | elapsed time per iteration (s): 0.78 | learning rate: 3.251E-05 | global batch size: 256 | lm loss: 1.930082E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.574 | TFLOPs: 19.88 | 31: iteration 144340/ 173500 | consumed samples: 36951040 | consumed tokens: 75675729920 | elapsed time per iteration (s): 0.74 | learning rate: 3.250E-05 | global batch size: 256 | lm loss: 1.923256E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.685 | TFLOPs: 21.03 | 31: iteration 144350/ 173500 | consumed samples: 36953600 | consumed tokens: 75680972800 | elapsed time per iteration (s): 0.77 | learning rate: 3.249E-05 | global batch size: 256 | lm loss: 1.941305E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.066 | TFLOPs: 20.15 | 31: iteration 144360/ 173500 | consumed samples: 36956160 | consumed tokens: 75686215680 | elapsed time per iteration (s): 0.81 | learning rate: 3.248E-05 | global batch size: 256 | lm loss: 1.929138E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.581 | TFLOPs: 19.21 | 31: iteration 144370/ 173500 | consumed samples: 36958720 | consumed tokens: 75691458560 | elapsed time per iteration (s): 0.81 | learning rate: 3.247E-05 | global batch size: 256 | lm loss: 1.920113E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.468 | TFLOPs: 19.15 | 31: iteration 144380/ 173500 | consumed samples: 36961280 | consumed tokens: 75696701440 | elapsed time per iteration (s): 0.83 | learning rate: 3.247E-05 | global batch size: 256 | lm loss: 1.900293E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.830 | TFLOPs: 18.68 | 31: iteration 144390/ 173500 | consumed samples: 36963840 | consumed tokens: 75701944320 | elapsed time per iteration (s): 0.80 | learning rate: 3.246E-05 | global batch size: 256 | lm loss: 1.915574E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.585 | TFLOPs: 19.27 | 31: iteration 144400/ 173500 | consumed samples: 36966400 | consumed tokens: 75707187200 | elapsed time per iteration (s): 0.84 | learning rate: 3.245E-05 | global batch size: 256 | lm loss: 1.943813E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.092 | TFLOPs: 18.52 | 31: iteration 144410/ 173500 | consumed samples: 36968960 | consumed tokens: 75712430080 | elapsed time per iteration (s): 0.79 | learning rate: 3.244E-05 | global batch size: 256 | lm loss: 1.902442E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.098 | TFLOPs: 19.49 | 31: iteration 144420/ 173500 | consumed samples: 36971520 | consumed tokens: 75717672960 | elapsed time per iteration (s): 1.09 | learning rate: 3.243E-05 | global batch size: 256 | lm loss: 1.896447E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.795 | TFLOPs: 14.26 | 31: iteration 144430/ 173500 | consumed samples: 36974080 | consumed tokens: 75722915840 | elapsed time per iteration (s): 0.79 | learning rate: 3.242E-05 | global batch size: 256 | lm loss: 1.908918E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.649 | TFLOPs: 19.58 | 31: iteration 144440/ 173500 | consumed samples: 36976640 | consumed tokens: 75728158720 | elapsed time per iteration (s): 0.79 | learning rate: 3.242E-05 | global batch size: 256 | lm loss: 1.930715E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.926 | TFLOPs: 19.66 | 31: iteration 144450/ 173500 | consumed samples: 36979200 | consumed tokens: 75733401600 | elapsed time per iteration (s): 0.82 | learning rate: 3.241E-05 | global batch size: 256 | lm loss: 1.908907E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.528 | TFLOPs: 18.85 | 31: iteration 144460/ 173500 | consumed samples: 36981760 | consumed tokens: 75738644480 | elapsed time per iteration (s): 0.80 | learning rate: 3.240E-05 | global batch size: 256 | lm loss: 1.905605E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.932 | TFLOPs: 19.48 | 31: iteration 144470/ 173500 | consumed samples: 36984320 | consumed tokens: 75743887360 | elapsed time per iteration (s): 0.80 | learning rate: 3.239E-05 | global batch size: 256 | lm loss: 1.920280E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.609 | TFLOPs: 19.34 | 31: iteration 144480/ 173500 | consumed samples: 36986880 | consumed tokens: 75749130240 | elapsed time per iteration (s): 0.82 | learning rate: 3.238E-05 | global batch size: 256 | lm loss: 1.953099E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.743 | TFLOPs: 18.86 | 31: iteration 144490/ 173500 | consumed samples: 36989440 | consumed tokens: 75754373120 | elapsed time per iteration (s): 0.86 | learning rate: 3.237E-05 | global batch size: 256 | lm loss: 1.928982E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.732 | TFLOPs: 18.01 | 31: iteration 144500/ 173500 | consumed samples: 36992000 | consumed tokens: 75759616000 | elapsed time per iteration (s): 0.92 | learning rate: 3.237E-05 | global batch size: 256 | lm loss: 1.928691E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 278.222 | TFLOPs: 16.83 | 31: iteration 144510/ 173500 | consumed samples: 36994560 | consumed tokens: 75764858880 | elapsed time per iteration (s): 0.83 | learning rate: 3.236E-05 | global batch size: 256 | lm loss: 1.911241E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.091 | TFLOPs: 18.64 | 31: iteration 144520/ 173500 | consumed samples: 36997120 | consumed tokens: 75770101760 | elapsed time per iteration (s): 0.85 | learning rate: 3.235E-05 | global batch size: 256 | lm loss: 1.926811E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.009 | TFLOPs: 18.21 | 31: iteration 144530/ 173500 | consumed samples: 36999680 | consumed tokens: 75775344640 | elapsed time per iteration (s): 0.85 | learning rate: 3.234E-05 | global batch size: 256 | lm loss: 1.920390E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.705 | TFLOPs: 18.31 | 31: iteration 144540/ 173500 | consumed samples: 37002240 | consumed tokens: 75780587520 | elapsed time per iteration (s): 0.80 | learning rate: 3.233E-05 | global batch size: 256 | lm loss: 1.939204E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.746 | TFLOPs: 19.46 | 31: iteration 144550/ 173500 | consumed samples: 37004800 | consumed tokens: 75785830400 | elapsed time per iteration (s): 0.85 | learning rate: 3.232E-05 | global batch size: 256 | lm loss: 1.934415E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.507 | TFLOPs: 18.12 | 31: iteration 144560/ 173500 | consumed samples: 37007360 | consumed tokens: 75791073280 | elapsed time per iteration (s): 0.84 | learning rate: 3.232E-05 | global batch size: 256 | lm loss: 1.910420E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.775 | TFLOPs: 18.44 | 31: iteration 144570/ 173500 | consumed samples: 37009920 | consumed tokens: 75796316160 | elapsed time per iteration (s): 0.80 | learning rate: 3.231E-05 | global batch size: 256 | lm loss: 1.937674E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.434 | TFLOPs: 19.39 | 31: iteration 144580/ 173500 | consumed samples: 37012480 | consumed tokens: 75801559040 | elapsed time per iteration (s): 0.80 | learning rate: 3.230E-05 | global batch size: 256 | lm loss: 1.914593E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.579 | TFLOPs: 19.27 | 31: iteration 144590/ 173500 | consumed samples: 37015040 | consumed tokens: 75806801920 | elapsed time per iteration (s): 0.82 | learning rate: 3.229E-05 | global batch size: 256 | lm loss: 1.909309E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.535 | TFLOPs: 18.91 | 31: iteration 144600/ 173500 | consumed samples: 37017600 | consumed tokens: 75812044800 | elapsed time per iteration (s): 0.81 | learning rate: 3.228E-05 | global batch size: 256 | lm loss: 1.907403E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.603 | TFLOPs: 19.03 | 31: iteration 144610/ 173500 | consumed samples: 37020160 | consumed tokens: 75817287680 | elapsed time per iteration (s): 0.85 | learning rate: 3.228E-05 | global batch size: 256 | lm loss: 1.914746E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.536 | TFLOPs: 18.18 | 31: iteration 144620/ 173500 | consumed samples: 37022720 | consumed tokens: 75822530560 | elapsed time per iteration (s): 0.83 | learning rate: 3.227E-05 | global batch size: 256 | lm loss: 1.898439E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.315 | TFLOPs: 18.65 | 31: iteration 144630/ 173500 | consumed samples: 37025280 | consumed tokens: 75827773440 | elapsed time per iteration (s): 0.81 | learning rate: 3.226E-05 | global batch size: 256 | lm loss: 1.936822E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.189 | TFLOPs: 19.01 | 31: iteration 144640/ 173500 | consumed samples: 37027840 | consumed tokens: 75833016320 | elapsed time per iteration (s): 0.81 | learning rate: 3.225E-05 | global batch size: 256 | lm loss: 1.920552E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.831 | TFLOPs: 19.05 | 31: iteration 144650/ 173500 | consumed samples: 37030400 | consumed tokens: 75838259200 | elapsed time per iteration (s): 0.83 | learning rate: 3.224E-05 | global batch size: 256 | lm loss: 1.932438E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.000 | TFLOPs: 18.57 | 31: iteration 144660/ 173500 | consumed samples: 37032960 | consumed tokens: 75843502080 | elapsed time per iteration (s): 0.76 | learning rate: 3.223E-05 | global batch size: 256 | lm loss: 1.913475E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.628 | TFLOPs: 20.30 | 31: iteration 144670/ 173500 | consumed samples: 37035520 | consumed tokens: 75848744960 | elapsed time per iteration (s): 0.80 | learning rate: 3.223E-05 | global batch size: 256 | lm loss: 1.917632E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.762 | TFLOPs: 19.47 | 31: iteration 144680/ 173500 | consumed samples: 37038080 | consumed tokens: 75853987840 | elapsed time per iteration (s): 0.80 | learning rate: 3.222E-05 | global batch size: 256 | lm loss: 1.902546E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.915 | TFLOPs: 19.41 | 31: iteration 144690/ 173500 | consumed samples: 37040640 | consumed tokens: 75859230720 | elapsed time per iteration (s): 0.84 | learning rate: 3.221E-05 | global batch size: 256 | lm loss: 1.924919E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.520 | TFLOPs: 18.48 | 31: iteration 144700/ 173500 | consumed samples: 37043200 | consumed tokens: 75864473600 | elapsed time per iteration (s): 0.97 | learning rate: 3.220E-05 | global batch size: 256 | lm loss: 1.945520E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 264.845 | TFLOPs: 16.02 | 31: iteration 144710/ 173500 | consumed samples: 37045760 | consumed tokens: 75869716480 | elapsed time per iteration (s): 0.82 | learning rate: 3.219E-05 | global batch size: 256 | lm loss: 1.926566E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.148 | TFLOPs: 18.88 | 31: iteration 144720/ 173500 | consumed samples: 37048320 | consumed tokens: 75874959360 | elapsed time per iteration (s): 0.74 | learning rate: 3.218E-05 | global batch size: 256 | lm loss: 1.949710E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.581 | TFLOPs: 21.03 | 31: iteration 144730/ 173500 | consumed samples: 37050880 | consumed tokens: 75880202240 | elapsed time per iteration (s): 0.78 | learning rate: 3.218E-05 | global batch size: 256 | lm loss: 1.931119E+00 | grad norm: 0.393 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.715 | TFLOPs: 19.89 | 31: iteration 144740/ 173500 | consumed samples: 37053440 | consumed tokens: 75885445120 | elapsed time per iteration (s): 0.75 | learning rate: 3.217E-05 | global batch size: 256 | lm loss: 1.941387E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.669 | TFLOPs: 20.73 | 31: iteration 144750/ 173500 | consumed samples: 37056000 | consumed tokens: 75890688000 | elapsed time per iteration (s): 0.74 | learning rate: 3.216E-05 | global batch size: 256 | lm loss: 1.913774E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.726 | TFLOPs: 20.92 | 31: iteration 144760/ 173500 | consumed samples: 37058560 | consumed tokens: 75895930880 | elapsed time per iteration (s): 0.76 | learning rate: 3.215E-05 | global batch size: 256 | lm loss: 1.937235E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.940 | TFLOPs: 20.26 | 31: iteration 144770/ 173500 | consumed samples: 37061120 | consumed tokens: 75901173760 | elapsed time per iteration (s): 0.77 | learning rate: 3.214E-05 | global batch size: 256 | lm loss: 1.883313E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.752 | TFLOPs: 20.19 | 31: iteration 144780/ 173500 | consumed samples: 37063680 | consumed tokens: 75906416640 | elapsed time per iteration (s): 0.75 | learning rate: 3.213E-05 | global batch size: 256 | lm loss: 1.907772E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.979 | TFLOPs: 20.69 | 31: iteration 144790/ 173500 | consumed samples: 37066240 | consumed tokens: 75911659520 | elapsed time per iteration (s): 0.77 | learning rate: 3.213E-05 | global batch size: 256 | lm loss: 1.924292E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.757 | TFLOPs: 20.13 | 31: iteration 144800/ 173500 | consumed samples: 37068800 | consumed tokens: 75916902400 | elapsed time per iteration (s): 0.73 | learning rate: 3.212E-05 | global batch size: 256 | lm loss: 1.913576E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.146 | TFLOPs: 21.12 | 31: iteration 144810/ 173500 | consumed samples: 37071360 | consumed tokens: 75922145280 | elapsed time per iteration (s): 0.78 | learning rate: 3.211E-05 | global batch size: 256 | lm loss: 1.944986E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.433 | TFLOPs: 19.81 | 31: iteration 144820/ 173500 | consumed samples: 37073920 | consumed tokens: 75927388160 | elapsed time per iteration (s): 0.79 | learning rate: 3.210E-05 | global batch size: 256 | lm loss: 1.912231E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.603 | TFLOPs: 19.64 | 31: iteration 144830/ 173500 | consumed samples: 37076480 | consumed tokens: 75932631040 | elapsed time per iteration (s): 0.75 | learning rate: 3.209E-05 | global batch size: 256 | lm loss: 1.907055E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.683 | TFLOPs: 20.55 | 31: iteration 144840/ 173500 | consumed samples: 37079040 | consumed tokens: 75937873920 | elapsed time per iteration (s): 0.80 | learning rate: 3.208E-05 | global batch size: 256 | lm loss: 1.907818E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.847 | TFLOPs: 19.35 | 31: iteration 144850/ 173500 | consumed samples: 37081600 | consumed tokens: 75943116800 | elapsed time per iteration (s): 2.65 | learning rate: 3.208E-05 | global batch size: 256 | lm loss: 1.949267E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 96.615 | TFLOPs: 5.84 | 31: iteration 144860/ 173500 | consumed samples: 37084160 | consumed tokens: 75948359680 | elapsed time per iteration (s): 0.82 | learning rate: 3.207E-05 | global batch size: 256 | lm loss: 1.925111E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.169 | TFLOPs: 18.95 | 31: iteration 144870/ 173500 | consumed samples: 37086720 | consumed tokens: 75953602560 | elapsed time per iteration (s): 0.81 | learning rate: 3.206E-05 | global batch size: 256 | lm loss: 1.934032E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.413 | TFLOPs: 19.08 | 31: iteration 144880/ 173500 | consumed samples: 37089280 | consumed tokens: 75958845440 | elapsed time per iteration (s): 0.82 | learning rate: 3.205E-05 | global batch size: 256 | lm loss: 1.942920E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.625 | TFLOPs: 18.97 | 31: iteration 144890/ 173500 | consumed samples: 37091840 | consumed tokens: 75964088320 | elapsed time per iteration (s): 0.82 | learning rate: 3.204E-05 | global batch size: 256 | lm loss: 1.947444E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.405 | TFLOPs: 18.96 | 31: iteration 144900/ 173500 | consumed samples: 37094400 | consumed tokens: 75969331200 | elapsed time per iteration (s): 0.79 | learning rate: 3.204E-05 | global batch size: 256 | lm loss: 1.906647E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.227 | TFLOPs: 19.55 | 31: iteration 144910/ 173500 | consumed samples: 37096960 | consumed tokens: 75974574080 | elapsed time per iteration (s): 0.93 | learning rate: 3.203E-05 | global batch size: 256 | lm loss: 1.936257E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 275.931 | TFLOPs: 16.69 | 31: iteration 144920/ 173500 | consumed samples: 37099520 | consumed tokens: 75979816960 | elapsed time per iteration (s): 0.77 | learning rate: 3.202E-05 | global batch size: 256 | lm loss: 1.931027E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.652 | TFLOPs: 20.00 | 31: iteration 144930/ 173500 | consumed samples: 37102080 | consumed tokens: 75985059840 | elapsed time per iteration (s): 0.80 | learning rate: 3.201E-05 | global batch size: 256 | lm loss: 1.901807E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.327 | TFLOPs: 19.44 | 31: iteration 144940/ 173500 | consumed samples: 37104640 | consumed tokens: 75990302720 | elapsed time per iteration (s): 0.81 | learning rate: 3.200E-05 | global batch size: 256 | lm loss: 1.930594E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.362 | TFLOPs: 19.08 | 31: iteration 144950/ 173500 | consumed samples: 37107200 | consumed tokens: 75995545600 | elapsed time per iteration (s): 0.79 | learning rate: 3.199E-05 | global batch size: 256 | lm loss: 1.882207E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.254 | TFLOPs: 19.56 | 31: iteration 144960/ 173500 | consumed samples: 37109760 | consumed tokens: 76000788480 | elapsed time per iteration (s): 0.81 | learning rate: 3.199E-05 | global batch size: 256 | lm loss: 1.933209E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.233 | TFLOPs: 19.13 | 31: iteration 144970/ 173500 | consumed samples: 37112320 | consumed tokens: 76006031360 | elapsed time per iteration (s): 0.81 | learning rate: 3.198E-05 | global batch size: 256 | lm loss: 1.906685E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.412 | TFLOPs: 19.20 | 31: iteration 144980/ 173500 | consumed samples: 37114880 | consumed tokens: 76011274240 | elapsed time per iteration (s): 0.77 | learning rate: 3.197E-05 | global batch size: 256 | lm loss: 1.908930E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.349 | TFLOPs: 20.11 | 31: iteration 144990/ 173500 | consumed samples: 37117440 | consumed tokens: 76016517120 | elapsed time per iteration (s): 0.93 | learning rate: 3.196E-05 | global batch size: 256 | lm loss: 1.930418E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 274.371 | TFLOPs: 16.60 | 31: iteration 145000/ 173500 | consumed samples: 37120000 | consumed tokens: 76021760000 | elapsed time per iteration (s): 0.86 | learning rate: 3.195E-05 | global batch size: 256 | lm loss: 1.940369E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.655 | TFLOPs: 18.07 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 145000 | lm loss value: 1.925971E+00 | lm loss PPL: 6.861807E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 145000 to checkpoints_1b1long 0: [2022-11-27 02:44:20,957] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step145000 is begin to save! 0: [2022-11-27 02:44:20,976] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_01-model_00-model_states.pt... 0: [2022-11-27 02:44:21,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_01-model_00-model_states.pt. 0: [2022-11-27 02:44:21,189] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_03-model_00-model_states.pt... 0: [2022-11-27 02:44:21,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_03-model_00-model_states.pt. 0: [2022-11-27 02:44:21,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_04-model_00-model_states.pt... 0: [2022-11-27 02:44:21,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_04-model_00-model_states.pt. 0: [2022-11-27 02:44:21,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_05-model_00-model_states.pt... 0: [2022-11-27 02:44:21,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_05-model_00-model_states.pt. 0: [2022-11-27 02:44:21,437] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_06-model_00-model_states.pt... 0: [2022-11-27 02:44:21,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_06-model_00-model_states.pt. 0: [2022-11-27 02:44:21,513] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_07-model_00-model_states.pt... 0: [2022-11-27 02:44:21,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_07-model_00-model_states.pt. 0: [2022-11-27 02:44:21,626] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_08-model_00-model_states.pt... 0: [2022-11-27 02:44:21,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_08-model_00-model_states.pt. 0: [2022-11-27 02:44:21,744] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_09-model_00-model_states.pt... 0: [2022-11-27 02:44:21,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_09-model_00-model_states.pt. 0: [2022-11-27 02:44:21,852] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_10-model_00-model_states.pt... 0: [2022-11-27 02:44:21,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_10-model_00-model_states.pt. 0: [2022-11-27 02:44:21,960] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_11-model_00-model_states.pt... 0: [2022-11-27 02:44:22,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_11-model_00-model_states.pt. 0: [2022-11-27 02:44:22,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_12-model_00-model_states.pt... 0: [2022-11-27 02:44:22,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_12-model_00-model_states.pt. 0: [2022-11-27 02:44:22,174] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_13-model_00-model_states.pt... 0: [2022-11-27 02:44:22,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_13-model_00-model_states.pt. 0: [2022-11-27 02:44:22,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_14-model_00-model_states.pt... 0: [2022-11-27 02:44:22,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_14-model_00-model_states.pt. 0: [2022-11-27 02:44:22,391] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_15-model_00-model_states.pt... 0: [2022-11-27 02:44:22,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_15-model_00-model_states.pt. 0: [2022-11-27 02:44:22,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_16-model_00-model_states.pt... 0: [2022-11-27 02:44:22,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_16-model_00-model_states.pt. 0: [2022-11-27 02:44:22,606] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_17-model_00-model_states.pt... 0: [2022-11-27 02:44:22,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_17-model_00-model_states.pt. 0: [2022-11-27 02:44:22,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_18-model_00-model_states.pt... 0: [2022-11-27 02:44:22,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_18-model_00-model_states.pt. 0: [2022-11-27 02:44:22,822] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_19-model_00-model_states.pt... 0: [2022-11-27 02:44:22,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_19-model_00-model_states.pt. 0: [2022-11-27 02:44:22,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_20-model_00-model_states.pt... 0: [2022-11-27 02:44:23,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_20-model_00-model_states.pt. 0: [2022-11-27 02:44:23,040] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_21-model_00-model_states.pt... 0: [2022-11-27 02:44:23,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_21-model_00-model_states.pt. 0: [2022-11-27 02:44:23,149] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_22-model_00-model_states.pt... 0: [2022-11-27 02:44:23,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_22-model_00-model_states.pt. 0: [2022-11-27 02:44:23,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_23-model_00-model_states.pt... 0: [2022-11-27 02:44:23,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_23-model_00-model_states.pt. 0: [2022-11-27 02:44:23,361] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_24-model_00-model_states.pt... 0: [2022-11-27 02:44:23,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_24-model_00-model_states.pt. 0: [2022-11-27 02:44:23,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_25-model_00-model_states.pt... 0: [2022-11-27 02:44:23,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_25-model_00-model_states.pt. 0: [2022-11-27 02:44:23,577] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_26-model_00-model_states.pt... 0: [2022-11-27 02:44:23,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_26-model_00-model_states.pt. 0: [2022-11-27 02:44:23,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_27-model_00-model_states.pt... 0: [2022-11-27 02:44:23,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_27-model_00-model_states.pt. 0: [2022-11-27 02:44:23,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_28-model_00-model_states.pt... 0: [2022-11-27 02:44:23,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_28-model_00-model_states.pt. 0: [2022-11-27 02:44:23,900] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/layer_30-model_00-model_states.pt... 0: [2022-11-27 02:44:23,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/layer_30-model_00-model_states.pt. 0: [2022-11-27 02:44:23,903] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step145000/mp_rank_00_model_states.pt 0: [2022-11-27 02:44:23,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/mp_rank_00_model_states.pt... 0: [2022-11-27 02:44:23,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/mp_rank_00_model_states.pt. 0: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:44:23,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:44:24,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:44:24,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 02:44:24,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 19: [2022-11-27 02:44:24,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:44:24,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 9: [2022-11-27 02:44:24,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:44:24,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 9: [2022-11-27 02:44:24,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:44:24,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 02:44:24,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 9: [2022-11-27 02:44:24,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 02:44:24,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 10: [2022-11-27 02:44:24,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:44:24,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:44:24,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 3: [2022-11-27 02:44:24,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 10: [2022-11-27 02:44:24,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 3: [2022-11-27 02:44:24,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 21: [2022-11-27 02:44:24,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:44:24,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 02:44:24,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 4: [2022-11-27 02:44:24,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:44:24,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:44:24,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 22: [2022-11-27 02:44:24,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 4: [2022-11-27 02:44:24,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:44:24,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 22: [2022-11-27 02:44:24,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 4: [2022-11-27 02:44:24,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 26: [2022-11-27 02:44:24,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:44:24,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 26: [2022-11-27 02:44:24,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:44:24,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 02:44:24,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 02:44:24,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 26: [2022-11-27 02:44:24,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 25: [2022-11-27 02:44:24,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:44:24,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 10: [2022-11-27 02:44:24,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:44:24,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:44:24,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 19: [2022-11-27 02:44:24,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 10: [2022-11-27 02:44:24,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 25: [2022-11-27 02:44:24,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 19: [2022-11-27 02:44:24,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 23: [2022-11-27 02:44:24,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:44:24,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:44:24,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 12: [2022-11-27 02:44:24,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:44:24,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 12: [2022-11-27 02:44:24,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 23: [2022-11-27 02:44:24,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 12: [2022-11-27 02:44:24,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 23: [2022-11-27 02:44:24,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 31: [2022-11-27 02:44:24,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:44:24,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 02:44:24,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 5: [2022-11-27 02:44:24,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:44:24,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 02:44:24,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 5: [2022-11-27 02:44:24,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:44:24,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 02:44:24,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 28: [2022-11-27 02:44:24,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:44:24,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:44:24,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 02:44:24,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 25: [2022-11-27 02:44:24,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:44:24,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 31: [2022-11-27 02:44:24,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:44:24,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 25: [2022-11-27 02:44:24,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 31: [2022-11-27 02:44:24,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 21: [2022-11-27 02:44:24,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:44:24,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 02:44:24,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 7: [2022-11-27 02:44:24,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:44:24,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 02:44:24,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 28: [2022-11-27 02:44:24,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 02:44:24,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 28: [2022-11-27 02:44:24,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:44:24,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 02:44:24,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 21: [2022-11-27 02:44:24,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:44:24,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:44:24,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 02:44:24,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 27: [2022-11-27 02:44:24,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 02:44:24,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 8: [2022-11-27 02:44:24,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:44:24,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 02:44:24,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 27: [2022-11-27 02:44:24,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:44:24,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:44:24,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 02:44:24,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 0: [2022-11-27 02:44:24,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 02:44:24,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 7: [2022-11-27 02:44:24,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:44:24,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:44:24,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 02:44:24,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 8: [2022-11-27 02:44:24,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:44:24,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 8: [2022-11-27 02:44:24,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 30: [2022-11-27 02:44:24,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 8: [2022-11-27 02:44:24,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 1: [2022-11-27 02:44:24,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:44:24,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 02:44:24,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 12: [2022-11-27 02:44:24,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:44:24,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 30: [2022-11-27 02:44:24,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:44:24,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 30: [2022-11-27 02:44:24,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 02:44:24,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 4: [2022-11-27 02:44:24,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:44:24,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:44:24,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 1: [2022-11-27 02:44:24,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 4: [2022-11-27 02:44:24,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 1: [2022-11-27 02:44:24,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 12: [2022-11-27 02:44:24,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:44:24,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 02:44:24,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 22: [2022-11-27 02:44:24,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:44:24,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 02:44:24,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 19: [2022-11-27 02:44:24,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:44:24,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 02:44:24,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 30: [2022-11-27 02:44:24,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:44:24,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 3: [2022-11-27 02:44:24,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:44:24,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 3: [2022-11-27 02:44:24,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 02:44:24,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 0: [2022-11-27 02:44:24,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:44:24,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:44:24,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:44:24,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:44:24,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 0: [2022-11-27 02:44:24,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 2: [2022-11-27 02:44:24,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 02:44:24,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 12: [2022-11-27 02:44:24,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:44:24,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 0: [2022-11-27 02:44:24,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 2: [2022-11-27 02:44:24,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 2: [2022-11-27 02:44:24,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 12: [2022-11-27 02:44:24,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 21: [2022-11-27 02:44:24,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:44:24,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:44:24,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 21: [2022-11-27 02:44:24,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 5: [2022-11-27 02:44:24,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 2: [2022-11-27 02:44:24,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:44:24,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 5: [2022-11-27 02:44:24,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 2: [2022-11-27 02:44:24,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 02:44:24,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 8: [2022-11-27 02:44:24,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:44:24,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 02:44:24,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 27: [2022-11-27 02:44:24,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:44:24,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:44:24,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 02:44:24,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 4: [2022-11-27 02:44:24,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 02:44:24,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 13: [2022-11-27 02:44:24,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:44:24,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 02:44:24,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 10: [2022-11-27 02:44:24,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:44:24,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 02:44:24,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 31: [2022-11-27 02:44:24,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:44:24,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 0: [2022-11-27 02:44:24,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:44:24,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 0: [2022-11-27 02:44:24,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 20: [2022-11-27 02:44:24,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:44:24,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 20: [2022-11-27 02:44:24,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 02:44:24,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 22: [2022-11-27 02:44:24,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:44:24,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 02:44:24,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 23: [2022-11-27 02:44:24,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:44:24,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 02:44:24,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 13: [2022-11-27 02:44:24,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:44:24,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:44:24,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:44:24,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 23: [2022-11-27 02:44:24,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 7: [2022-11-27 02:44:24,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 13: [2022-11-27 02:44:24,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 23: [2022-11-27 02:44:24,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 7: [2022-11-27 02:44:24,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 8: [2022-11-27 02:44:24,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:44:24,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 02:44:24,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 20: [2022-11-27 02:44:24,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:44:24,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:44:24,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 26: [2022-11-27 02:44:24,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:44:24,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:44:24,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 25: [2022-11-27 02:44:24,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 26: [2022-11-27 02:44:24,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 02:44:24,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 02:44:24,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 10: [2022-11-27 02:44:24,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:44:24,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 25: [2022-11-27 02:44:24,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 10: [2022-11-27 02:44:24,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 02:44:24,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 22: [2022-11-27 02:44:24,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:44:24,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:44:24,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:44:24,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 02:44:24,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 5: [2022-11-27 02:44:24,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 02:44:24,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 27: [2022-11-27 02:44:24,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 02:44:24,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 14: [2022-11-27 02:44:24,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:44:24,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 02:44:24,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 2: [2022-11-27 02:44:24,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:44:24,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 02:44:24,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 1: [2022-11-27 02:44:24,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:44:24,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 24: [2022-11-27 02:44:24,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:44:24,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:44:24,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 24: [2022-11-27 02:44:24,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-27 02:44:24,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 28: [2022-11-27 02:44:24,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:44:24,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 24: [2022-11-27 02:44:24,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 28: [2022-11-27 02:44:24,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 24: [2022-11-27 02:44:24,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:44:24,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:44:24,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 24: [2022-11-27 02:44:24,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-27 02:44:24,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 25: [2022-11-27 02:44:24,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 11: [2022-11-27 02:44:24,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:44:24,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 02:44:24,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 25: [2022-11-27 02:44:24,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 11: [2022-11-27 02:44:24,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:44:24,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 02:44:24,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 11: [2022-11-27 02:44:24,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:44:24,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 02:44:24,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 11: [2022-11-27 02:44:24,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:44:24,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 02:44:24,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 9: [2022-11-27 02:44:24,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:44:24,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 6: [2022-11-27 02:44:24,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:44:24,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 6: [2022-11-27 02:44:24,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 02:44:24,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 31: [2022-11-27 02:44:24,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:44:24,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:44:24,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 02:44:24,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 31: [2022-11-27 02:44:24,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 02:44:24,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 18: [2022-11-27 02:44:24,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:44:24,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:44:24,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:44:24,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:44:24,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 02:44:24,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 02:44:24,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 02:44:24,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 02:44:24,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 18: [2022-11-27 02:44:24,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 18: [2022-11-27 02:44:24,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 9: [2022-11-27 02:44:24,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:44:24,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 9: [2022-11-27 02:44:24,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 02:44:24,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 17: [2022-11-27 02:44:24,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:44:24,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:44:24,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 02:44:24,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 17: [2022-11-27 02:44:24,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 02:44:24,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 17: [2022-11-27 02:44:24,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:44:24,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 02:44:24,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 17: [2022-11-27 02:44:24,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:44:24,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 3: [2022-11-27 02:44:24,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:44:24,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 17: [2022-11-27 02:44:24,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:44:24,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 17: [2022-11-27 02:44:24,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 3: [2022-11-27 02:44:24,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:44:24,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 3: [2022-11-27 02:44:24,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 3: [2022-11-27 02:44:24,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 02:44:24,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 7: [2022-11-27 02:44:24,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:44:24,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 02:44:24,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 20: [2022-11-27 02:44:24,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:44:24,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:44:24,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:44:24,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:44:24,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 02:44:24,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 14: [2022-11-27 02:44:24,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:44:24,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 02:44:24,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 14: [2022-11-27 02:44:24,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 15: [2022-11-27 02:44:24,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 14: [2022-11-27 02:44:24,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 15: [2022-11-27 02:44:24,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 15: [2022-11-27 02:44:24,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 15: [2022-11-27 02:44:24,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 14: [2022-11-27 02:44:24,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:44:24,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 02:44:24,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 30: [2022-11-27 02:44:24,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:44:24,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 02:44:24,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 29: [2022-11-27 02:44:24,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:44:24,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:44:24,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 02:44:24,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 29: [2022-11-27 02:44:24,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 02:44:24,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 29: [2022-11-27 02:44:24,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:44:24,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:44:24,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 02:44:24,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 02:44:24,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 29: [2022-11-27 02:44:24,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 6: [2022-11-27 02:44:24,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:44:24,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 02:44:24,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 19: [2022-11-27 02:44:24,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:44:24,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 02:44:24,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 6: [2022-11-27 02:44:24,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:44:24,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 02:44:24,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 11: [2022-11-27 02:44:24,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:44:24,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 02:44:24,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 17: [2022-11-27 02:44:24,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:44:24,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 02:44:24,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 4: [2022-11-27 02:44:24,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:44:24,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 02:44:24,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 16: [2022-11-27 02:44:24,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:44:24,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:44:24,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:44:24,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:44:24,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 02:44:24,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 02:44:24,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 02:44:24,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 16: [2022-11-27 02:44:24,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 16: [2022-11-27 02:44:24,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 16: [2022-11-27 02:44:24,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 02:44:24,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 10: [2022-11-27 02:44:24,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:44:24,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 02:44:24,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 23: [2022-11-27 02:44:24,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:44:24,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 0: [2022-11-27 02:44:24,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:44:24,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 26: [2022-11-27 02:44:24,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:44:24,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 02:44:24,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 0: [2022-11-27 02:44:24,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 02:44:24,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 5: [2022-11-27 02:44:24,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:44:24,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 02:44:24,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 28: [2022-11-27 02:44:24,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:44:24,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 02:44:24,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 30: [2022-11-27 02:44:24,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:44:24,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 02:44:24,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 9: [2022-11-27 02:44:24,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:44:24,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 02:44:24,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 8: [2022-11-27 02:44:24,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:44:24,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 02:44:24,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 21: [2022-11-27 02:44:24,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:44:24,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 02:44:24,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 31: [2022-11-27 02:44:24,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:44:24,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:44:24,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 2: [2022-11-27 02:44:24,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 31: [2022-11-27 02:44:24,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 2: [2022-11-27 02:44:24,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 25: [2022-11-27 02:44:24,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:44:24,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:44:24,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 7: [2022-11-27 02:44:24,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 02:44:24,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 25: [2022-11-27 02:44:24,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 16: [2022-11-27 02:44:24,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:44:24,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:44:24,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 3: [2022-11-27 02:44:24,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:44:24,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 22: [2022-11-27 02:44:24,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 3: [2022-11-27 02:44:24,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 22: [2022-11-27 02:44:24,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 3: [2022-11-27 02:44:24,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 0: [2022-11-27 02:44:24,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:44:24,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 02:44:24,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 24: [2022-11-27 02:44:24,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:44:24,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 02:44:24,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 1: [2022-11-27 02:44:24,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:44:24,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 02:44:24,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 18: [2022-11-27 02:44:24,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:44:24,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 29: [2022-11-27 02:44:24,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:44:24,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 27: [2022-11-27 02:44:24,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:44:24,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 02:44:24,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 27: [2022-11-27 02:44:24,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 02:44:24,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 15: [2022-11-27 02:44:24,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:44:24,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 02:44:24,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 13: [2022-11-27 02:44:24,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:44:24,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 02:44:24,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 12: [2022-11-27 02:44:24,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:44:24,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:44:24,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 02:44:24,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 6: [2022-11-27 02:44:24,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 02:44:24,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 20: [2022-11-27 02:44:24,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:44:24,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 02:44:24,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 14: [2022-11-27 02:44:24,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:44:24,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 02:44:24,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 4: [2022-11-27 02:44:24,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:44:24,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 02:44:24,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 26: [2022-11-27 02:44:24,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:44:24,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 02:44:24,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 11: [2022-11-27 02:44:24,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:44:24,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 02:44:24,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 23: [2022-11-27 02:44:24,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:44:24,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 02:44:24,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 5: [2022-11-27 02:44:24,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:44:24,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 02:44:24,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 8: [2022-11-27 02:44:24,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:44:24,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 02:44:24,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 3: [2022-11-27 02:44:24,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:44:24,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 02:44:24,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 31: [2022-11-27 02:44:24,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:44:24,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 02:44:24,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 9: [2022-11-27 02:44:24,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:44:24,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 02:44:24,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 21: [2022-11-27 02:44:24,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:44:24,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:44:24,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:44:24,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 21: [2022-11-27 02:44:24,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 26: [2022-11-27 02:44:24,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 29: [2022-11-27 02:44:24,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 26: [2022-11-27 02:44:24,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 21: [2022-11-27 02:44:24,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 2: [2022-11-27 02:44:24,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:44:24,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 02:44:24,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 28: [2022-11-27 02:44:24,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:44:24,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:44:24,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 02:44:24,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 19: [2022-11-27 02:44:24,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:44:24,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 02:44:24,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 18: [2022-11-27 02:44:24,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:44:24,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 02:44:24,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 25: [2022-11-27 02:44:24,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:44:24,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 02:44:24,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 7: [2022-11-27 02:44:24,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:44:24,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 02:44:24,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 17: [2022-11-27 02:44:24,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:44:24,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 02:44:24,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 30: [2022-11-27 02:44:24,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:44:24,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 02:44:24,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 21: [2022-11-27 02:44:24,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:44:24,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:44:24,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 02:44:24,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 10: [2022-11-27 02:44:24,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:44:24,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 02:44:24,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 10: [2022-11-27 02:44:24,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 02:44:24,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 27: [2022-11-27 02:44:24,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:44:24,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:44:24,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 02:44:24,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 27: [2022-11-27 02:44:24,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 02:44:24,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 12: [2022-11-27 02:44:24,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:44:24,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 02:44:24,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 13: [2022-11-27 02:44:24,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:44:24,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 02:44:24,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 13: [2022-11-27 02:44:24,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 02:44:24,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 4: [2022-11-27 02:44:24,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:44:24,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 02:44:24,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 20: [2022-11-27 02:44:24,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:44:24,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 02:44:24,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 0: [2022-11-27 02:44:24,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:44:24,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 17: [2022-11-27 02:44:24,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:44:24,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 17: [2022-11-27 02:44:24,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 6: [2022-11-27 02:44:24,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:44:24,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 02:44:24,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 17: [2022-11-27 02:44:24,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 2: [2022-11-27 02:44:24,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:44:24,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 02:44:24,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 9: [2022-11-27 02:44:24,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:44:24,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 14: [2022-11-27 02:44:24,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:44:24,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:44:24,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 14: [2022-11-27 02:44:24,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 02:44:24,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 19: [2022-11-27 02:44:24,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 02:44:24,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 8: [2022-11-27 02:44:24,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:44:24,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 02:44:24,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 11: [2022-11-27 02:44:24,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:44:24,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 02:44:24,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 5: [2022-11-27 02:44:24,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:44:24,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 02:44:24,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 30: [2022-11-27 02:44:24,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:44:24,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 02:44:24,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 15: [2022-11-27 02:44:24,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:44:24,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 02:44:24,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 23: [2022-11-27 02:44:24,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:44:24,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 02:44:24,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 28: [2022-11-27 02:44:24,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:44:24,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 02:44:24,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 26: [2022-11-27 02:44:24,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:44:24,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-27 02:44:24,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 1: [2022-11-27 02:44:24,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:44:24,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 02:44:24,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 25: [2022-11-27 02:44:24,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:44:24,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:44:24,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 25: [2022-11-27 02:44:24,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 16: [2022-11-27 02:44:24,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 16: [2022-11-27 02:44:24,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:44:24,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 16: [2022-11-27 02:44:24,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 02:44:24,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 29: [2022-11-27 02:44:24,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:44:24,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 02:44:24,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 25: [2022-11-27 02:44:24,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:44:24,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 02:44:24,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 10: [2022-11-27 02:44:24,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:44:24,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 02:44:24,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 10: [2022-11-27 02:44:24,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:44:24,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 02:44:24,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 24: [2022-11-27 02:44:24,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:44:24,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 02:44:24,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 17: [2022-11-27 02:44:24,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:44:24,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:44:24,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 19: [2022-11-27 02:44:24,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:44:24,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 19: [2022-11-27 02:44:24,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 5: [2022-11-27 02:44:24,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 17: [2022-11-27 02:44:24,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 3: [2022-11-27 02:44:24,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:44:24,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 31: [2022-11-27 02:44:24,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:44:24,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 31: [2022-11-27 02:44:24,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 02:44:24,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 3: [2022-11-27 02:44:24,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 16: [2022-11-27 02:44:24,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:44:24,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 02:44:24,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 4: [2022-11-27 02:44:24,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:44:24,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 02:44:24,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 27: [2022-11-27 02:44:24,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:44:24,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:44:24,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 27: [2022-11-27 02:44:24,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 02:44:24,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 7: [2022-11-27 02:44:24,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 22: [2022-11-27 02:44:24,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:44:24,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 02:44:24,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 0: [2022-11-27 02:44:24,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:44:24,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 02:44:24,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 28: [2022-11-27 02:44:24,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:44:24,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:44:24,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 02:44:24,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 18: [2022-11-27 02:44:24,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:44:24,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 02:44:24,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 12: [2022-11-27 02:44:24,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:44:24,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:44:24,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 12: [2022-11-27 02:44:24,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 02:44:24,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 18: [2022-11-27 02:44:24,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 15: [2022-11-27 02:44:24,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:44:24,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:44:24,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 02:44:24,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 02:44:24,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 15: [2022-11-27 02:44:24,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 11: [2022-11-27 02:44:24,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:44:24,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:44:24,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 02:44:24,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 21: [2022-11-27 02:44:24,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 02:44:24,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 12: [2022-11-27 02:44:24,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:44:24,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 02:44:24,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 30: [2022-11-27 02:44:24,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:44:24,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 02:44:24,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 2: [2022-11-27 02:44:24,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:44:24,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 3: [2022-11-27 02:44:24,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:44:24,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 3: [2022-11-27 02:44:24,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 02:44:24,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 29: [2022-11-27 02:44:24,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:44:24,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 14: [2022-11-27 02:44:24,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:44:24,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 14: [2022-11-27 02:44:24,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 02:44:24,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 7: [2022-11-27 02:44:24,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:44:24,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 02:44:24,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 23: [2022-11-27 02:44:24,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:44:24,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 02:44:24,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 31: [2022-11-27 02:44:24,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:44:24,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 02:44:24,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 6: [2022-11-27 02:44:24,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:44:24,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:44:24,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 02:44:24,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 27: [2022-11-27 02:44:24,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 02:44:24,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 22: [2022-11-27 02:44:24,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:44:24,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 02:44:24,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 9: [2022-11-27 02:44:24,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:44:24,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 02:44:24,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 20: [2022-11-27 02:44:24,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:44:24,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 02:44:24,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 28: [2022-11-27 02:44:24,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 02:44:24,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 24: [2022-11-27 02:44:24,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:44:24,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:44:24,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 02:44:24,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 0: [2022-11-27 02:44:24,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 02:44:24,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 6: [2022-11-27 02:44:24,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:44:24,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 02:44:24,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 13: [2022-11-27 02:44:24,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:44:24,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 02:44:24,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 13: [2022-11-27 02:44:24,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:44:24,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 02:44:24,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 6: [2022-11-27 02:44:24,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:44:24,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 02:44:24,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 24: [2022-11-27 02:44:24,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:44:24,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 02:44:24,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 20: [2022-11-27 02:44:24,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:44:24,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 02:44:24,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 14: [2022-11-27 02:44:24,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:44:24,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 02:44:24,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 15: [2022-11-27 02:44:24,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:44:24,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 02:44:24,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 20: [2022-11-27 02:44:24,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:44:24,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 02:44:24,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 14: [2022-11-27 02:44:24,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:44:24,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 02:44:24,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 13: [2022-11-27 02:44:24,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:44:24,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step145000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 02:44:24,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step145000 is ready now! 0: successfully saved checkpoint at iteration 145000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 3283.67 31: iteration 145010/ 173500 | consumed samples: 37122560 | consumed tokens: 76027002880 | elapsed time per iteration (s): 1.21 | learning rate: 3.195E-05 | global batch size: 256 | lm loss: 1.937641E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 211.961 | TFLOPs: 12.82 | 31: iteration 145020/ 173500 | consumed samples: 37125120 | consumed tokens: 76032245760 | elapsed time per iteration (s): 0.82 | learning rate: 3.194E-05 | global batch size: 256 | lm loss: 1.898278E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.351 | TFLOPs: 18.90 | 31: iteration 145030/ 173500 | consumed samples: 37127680 | consumed tokens: 76037488640 | elapsed time per iteration (s): 0.78 | learning rate: 3.193E-05 | global batch size: 256 | lm loss: 1.910940E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.664 | TFLOPs: 19.94 | 31: iteration 145040/ 173500 | consumed samples: 37130240 | consumed tokens: 76042731520 | elapsed time per iteration (s): 0.81 | learning rate: 3.192E-05 | global batch size: 256 | lm loss: 1.943919E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.592 | TFLOPs: 19.15 | 31: iteration 145050/ 173500 | consumed samples: 37132800 | consumed tokens: 76047974400 | elapsed time per iteration (s): 0.81 | learning rate: 3.191E-05 | global batch size: 256 | lm loss: 1.898580E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.063 | TFLOPs: 19.06 | 31: iteration 145060/ 173500 | consumed samples: 37135360 | consumed tokens: 76053217280 | elapsed time per iteration (s): 0.81 | learning rate: 3.190E-05 | global batch size: 256 | lm loss: 1.909867E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.457 | TFLOPs: 19.08 | 31: iteration 145070/ 173500 | consumed samples: 37137920 | consumed tokens: 76058460160 | elapsed time per iteration (s): 0.82 | learning rate: 3.190E-05 | global batch size: 256 | lm loss: 1.947128E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.693 | TFLOPs: 18.98 | 31: iteration 145080/ 173500 | consumed samples: 37140480 | consumed tokens: 76063703040 | elapsed time per iteration (s): 0.82 | learning rate: 3.189E-05 | global batch size: 256 | lm loss: 1.920087E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.587 | TFLOPs: 18.85 | 31: iteration 145090/ 173500 | consumed samples: 37143040 | consumed tokens: 76068945920 | elapsed time per iteration (s): 0.83 | learning rate: 3.188E-05 | global batch size: 256 | lm loss: 1.946768E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.777 | TFLOPs: 18.56 | 31: iteration 145100/ 173500 | consumed samples: 37145600 | consumed tokens: 76074188800 | elapsed time per iteration (s): 0.81 | learning rate: 3.187E-05 | global batch size: 256 | lm loss: 1.958483E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.482 | TFLOPs: 19.21 | 31: iteration 145110/ 173500 | consumed samples: 37148160 | consumed tokens: 76079431680 | elapsed time per iteration (s): 0.82 | learning rate: 3.186E-05 | global batch size: 256 | lm loss: 1.906813E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.333 | TFLOPs: 18.77 | 31: iteration 145120/ 173500 | consumed samples: 37150720 | consumed tokens: 76084674560 | elapsed time per iteration (s): 0.80 | learning rate: 3.186E-05 | global batch size: 256 | lm loss: 1.927371E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.205 | TFLOPs: 19.31 | 31: iteration 145130/ 173500 | consumed samples: 37153280 | consumed tokens: 76089917440 | elapsed time per iteration (s): 0.81 | learning rate: 3.185E-05 | global batch size: 256 | lm loss: 1.942423E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.182 | TFLOPs: 19.13 | 31: iteration 145140/ 173500 | consumed samples: 37155840 | consumed tokens: 76095160320 | elapsed time per iteration (s): 0.82 | learning rate: 3.184E-05 | global batch size: 256 | lm loss: 1.936810E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.989 | TFLOPs: 18.81 | 31: iteration 145150/ 173500 | consumed samples: 37158400 | consumed tokens: 76100403200 | elapsed time per iteration (s): 0.80 | learning rate: 3.183E-05 | global batch size: 256 | lm loss: 1.935007E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.379 | TFLOPs: 19.32 | 31: iteration 145160/ 173500 | consumed samples: 37160960 | consumed tokens: 76105646080 | elapsed time per iteration (s): 0.80 | learning rate: 3.182E-05 | global batch size: 256 | lm loss: 1.951327E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.608 | TFLOPs: 19.27 | 31: iteration 145170/ 173500 | consumed samples: 37163520 | consumed tokens: 76110888960 | elapsed time per iteration (s): 0.90 | learning rate: 3.181E-05 | global batch size: 256 | lm loss: 1.894838E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 283.747 | TFLOPs: 17.17 | 31: iteration 145180/ 173500 | consumed samples: 37166080 | consumed tokens: 76116131840 | elapsed time per iteration (s): 0.82 | learning rate: 3.181E-05 | global batch size: 256 | lm loss: 1.926762E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.614 | TFLOPs: 18.97 | 31: iteration 145190/ 173500 | consumed samples: 37168640 | consumed tokens: 76121374720 | elapsed time per iteration (s): 0.82 | learning rate: 3.180E-05 | global batch size: 256 | lm loss: 1.916580E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.266 | TFLOPs: 18.89 | 31: iteration 145200/ 173500 | consumed samples: 37171200 | consumed tokens: 76126617600 | elapsed time per iteration (s): 0.83 | learning rate: 3.179E-05 | global batch size: 256 | lm loss: 1.919011E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.354 | TFLOPs: 18.65 | 31: iteration 145210/ 173500 | consumed samples: 37173760 | consumed tokens: 76131860480 | elapsed time per iteration (s): 0.79 | learning rate: 3.178E-05 | global batch size: 256 | lm loss: 1.941748E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.190 | TFLOPs: 19.67 | 31: iteration 145220/ 173500 | consumed samples: 37176320 | consumed tokens: 76137103360 | elapsed time per iteration (s): 0.74 | learning rate: 3.177E-05 | global batch size: 256 | lm loss: 1.904387E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.417 | TFLOPs: 20.84 | 31: iteration 145230/ 173500 | consumed samples: 37178880 | consumed tokens: 76142346240 | elapsed time per iteration (s): 0.75 | learning rate: 3.177E-05 | global batch size: 256 | lm loss: 1.969717E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.312 | TFLOPs: 20.59 | 31: iteration 145240/ 173500 | consumed samples: 37181440 | consumed tokens: 76147589120 | elapsed time per iteration (s): 0.77 | learning rate: 3.176E-05 | global batch size: 256 | lm loss: 1.924309E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.381 | TFLOPs: 20.05 | 31: iteration 145250/ 173500 | consumed samples: 37184000 | consumed tokens: 76152832000 | elapsed time per iteration (s): 0.74 | learning rate: 3.175E-05 | global batch size: 256 | lm loss: 1.932059E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.965 | TFLOPs: 20.87 | 31: iteration 145260/ 173500 | consumed samples: 37186560 | consumed tokens: 76158074880 | elapsed time per iteration (s): 0.79 | learning rate: 3.174E-05 | global batch size: 256 | lm loss: 1.922259E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.483 | TFLOPs: 19.57 | 31: iteration 145270/ 173500 | consumed samples: 37189120 | consumed tokens: 76163317760 | elapsed time per iteration (s): 0.78 | learning rate: 3.173E-05 | global batch size: 256 | lm loss: 1.951509E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.052 | TFLOPs: 19.91 | 31: iteration 145280/ 173500 | consumed samples: 37191680 | consumed tokens: 76168560640 | elapsed time per iteration (s): 0.77 | learning rate: 3.172E-05 | global batch size: 256 | lm loss: 1.945456E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.926 | TFLOPs: 20.02 | 31: iteration 145290/ 173500 | consumed samples: 37194240 | consumed tokens: 76173803520 | elapsed time per iteration (s): 0.88 | learning rate: 3.172E-05 | global batch size: 256 | lm loss: 1.945218E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.635 | TFLOPs: 17.52 | 31: iteration 145300/ 173500 | consumed samples: 37196800 | consumed tokens: 76179046400 | elapsed time per iteration (s): 0.79 | learning rate: 3.171E-05 | global batch size: 256 | lm loss: 1.932276E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.966 | TFLOPs: 19.66 | 31: iteration 145310/ 173500 | consumed samples: 37199360 | consumed tokens: 76184289280 | elapsed time per iteration (s): 0.74 | learning rate: 3.170E-05 | global batch size: 256 | lm loss: 1.926876E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.130 | TFLOPs: 20.94 | 31: iteration 145320/ 173500 | consumed samples: 37201920 | consumed tokens: 76189532160 | elapsed time per iteration (s): 0.79 | learning rate: 3.169E-05 | global batch size: 256 | lm loss: 1.920516E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.964 | TFLOPs: 19.66 | 31: iteration 145330/ 173500 | consumed samples: 37204480 | consumed tokens: 76194775040 | elapsed time per iteration (s): 0.76 | learning rate: 3.168E-05 | global batch size: 256 | lm loss: 1.931654E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.173 | TFLOPs: 20.46 | 31: iteration 145340/ 173500 | consumed samples: 37207040 | consumed tokens: 76200017920 | elapsed time per iteration (s): 0.78 | learning rate: 3.168E-05 | global batch size: 256 | lm loss: 1.942238E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.320 | TFLOPs: 19.86 | 31: iteration 145350/ 173500 | consumed samples: 37209600 | consumed tokens: 76205260800 | elapsed time per iteration (s): 0.93 | learning rate: 3.167E-05 | global batch size: 256 | lm loss: 1.974532E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 276.298 | TFLOPs: 16.72 | 31: iteration 145360/ 173500 | consumed samples: 37212160 | consumed tokens: 76210503680 | elapsed time per iteration (s): 0.75 | learning rate: 3.166E-05 | global batch size: 256 | lm loss: 1.926581E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.330 | TFLOPs: 20.77 | 31: iteration 145370/ 173500 | consumed samples: 37214720 | consumed tokens: 76215746560 | elapsed time per iteration (s): 0.73 | learning rate: 3.165E-05 | global batch size: 256 | lm loss: 1.908018E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.652 | TFLOPs: 21.27 | 31: iteration 145380/ 173500 | consumed samples: 37217280 | consumed tokens: 76220989440 | elapsed time per iteration (s): 0.82 | learning rate: 3.164E-05 | global batch size: 256 | lm loss: 1.918280E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.469 | TFLOPs: 18.90 | 31: iteration 145390/ 173500 | consumed samples: 37219840 | consumed tokens: 76226232320 | elapsed time per iteration (s): 0.84 | learning rate: 3.164E-05 | global batch size: 256 | lm loss: 1.940100E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.844 | TFLOPs: 18.50 | 31: iteration 145400/ 173500 | consumed samples: 37222400 | consumed tokens: 76231475200 | elapsed time per iteration (s): 0.90 | learning rate: 3.163E-05 | global batch size: 256 | lm loss: 1.928085E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 283.412 | TFLOPs: 17.15 | 31: iteration 145410/ 173500 | consumed samples: 37224960 | consumed tokens: 76236718080 | elapsed time per iteration (s): 0.80 | learning rate: 3.162E-05 | global batch size: 256 | lm loss: 1.935270E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.188 | TFLOPs: 19.25 | 31: iteration 145420/ 173500 | consumed samples: 37227520 | consumed tokens: 76241960960 | elapsed time per iteration (s): 0.79 | learning rate: 3.161E-05 | global batch size: 256 | lm loss: 1.897665E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.902 | TFLOPs: 19.66 | 31: iteration 145430/ 173500 | consumed samples: 37230080 | consumed tokens: 76247203840 | elapsed time per iteration (s): 0.80 | learning rate: 3.160E-05 | global batch size: 256 | lm loss: 1.918661E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.225 | TFLOPs: 19.25 | 31: iteration 145440/ 173500 | consumed samples: 37232640 | consumed tokens: 76252446720 | elapsed time per iteration (s): 0.81 | learning rate: 3.160E-05 | global batch size: 256 | lm loss: 1.910547E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.216 | TFLOPs: 19.07 | 31: iteration 145450/ 173500 | consumed samples: 37235200 | consumed tokens: 76257689600 | elapsed time per iteration (s): 0.80 | learning rate: 3.159E-05 | global batch size: 256 | lm loss: 1.917372E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.391 | TFLOPs: 19.44 | 31: iteration 145460/ 173500 | consumed samples: 37237760 | consumed tokens: 76262932480 | elapsed time per iteration (s): 0.80 | learning rate: 3.158E-05 | global batch size: 256 | lm loss: 1.950686E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.278 | TFLOPs: 19.26 | 31: iteration 145470/ 173500 | consumed samples: 37240320 | consumed tokens: 76268175360 | elapsed time per iteration (s): 0.77 | learning rate: 3.157E-05 | global batch size: 256 | lm loss: 1.905054E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.205 | TFLOPs: 20.16 | 31: iteration 145480/ 173500 | consumed samples: 37242880 | consumed tokens: 76273418240 | elapsed time per iteration (s): 0.77 | learning rate: 3.156E-05 | global batch size: 256 | lm loss: 1.931171E+00 | grad norm: 0.217 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.336 | TFLOPs: 20.11 | 31: iteration 145490/ 173500 | consumed samples: 37245440 | consumed tokens: 76278661120 | elapsed time per iteration (s): 0.75 | learning rate: 3.155E-05 | global batch size: 256 | lm loss: 1.933332E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.719 | TFLOPs: 20.55 | 31: iteration 145500/ 173500 | consumed samples: 37248000 | consumed tokens: 76283904000 | elapsed time per iteration (s): 0.77 | learning rate: 3.155E-05 | global batch size: 256 | lm loss: 1.912896E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.380 | TFLOPs: 20.05 | 31: iteration 145510/ 173500 | consumed samples: 37250560 | consumed tokens: 76289146880 | elapsed time per iteration (s): 0.75 | learning rate: 3.154E-05 | global batch size: 256 | lm loss: 1.917675E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.334 | TFLOPs: 20.65 | 31: iteration 145520/ 173500 | consumed samples: 37253120 | consumed tokens: 76294389760 | elapsed time per iteration (s): 0.76 | learning rate: 3.153E-05 | global batch size: 256 | lm loss: 1.943193E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.862 | TFLOPs: 20.44 | 31: iteration 145530/ 173500 | consumed samples: 37255680 | consumed tokens: 76299632640 | elapsed time per iteration (s): 0.86 | learning rate: 3.152E-05 | global batch size: 256 | lm loss: 1.931846E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.109 | TFLOPs: 18.10 | 31: iteration 145540/ 173500 | consumed samples: 37258240 | consumed tokens: 76304875520 | elapsed time per iteration (s): 0.74 | learning rate: 3.151E-05 | global batch size: 256 | lm loss: 1.910901E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.567 | TFLOPs: 20.97 | 31: iteration 145550/ 173500 | consumed samples: 37260800 | consumed tokens: 76310118400 | elapsed time per iteration (s): 0.75 | learning rate: 3.151E-05 | global batch size: 256 | lm loss: 1.929633E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.376 | TFLOPs: 20.65 | 31: iteration 145560/ 173500 | consumed samples: 37263360 | consumed tokens: 76315361280 | elapsed time per iteration (s): 0.72 | learning rate: 3.150E-05 | global batch size: 256 | lm loss: 1.909903E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.334 | TFLOPs: 21.38 | 31: iteration 145570/ 173500 | consumed samples: 37265920 | consumed tokens: 76320604160 | elapsed time per iteration (s): 0.73 | learning rate: 3.149E-05 | global batch size: 256 | lm loss: 1.940315E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.330 | TFLOPs: 21.19 | 31: iteration 145580/ 173500 | consumed samples: 37268480 | consumed tokens: 76325847040 | elapsed time per iteration (s): 0.74 | learning rate: 3.148E-05 | global batch size: 256 | lm loss: 1.936553E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.156 | TFLOPs: 20.88 | 31: iteration 145590/ 173500 | consumed samples: 37271040 | consumed tokens: 76331089920 | elapsed time per iteration (s): 0.74 | learning rate: 3.147E-05 | global batch size: 256 | lm loss: 1.929085E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.765 | TFLOPs: 20.98 | 31: iteration 145600/ 173500 | consumed samples: 37273600 | consumed tokens: 76336332800 | elapsed time per iteration (s): 0.80 | learning rate: 3.147E-05 | global batch size: 256 | lm loss: 1.956868E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.689 | TFLOPs: 19.28 | 31: iteration 145610/ 173500 | consumed samples: 37276160 | consumed tokens: 76341575680 | elapsed time per iteration (s): 0.74 | learning rate: 3.146E-05 | global batch size: 256 | lm loss: 1.953873E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.259 | TFLOPs: 21.01 | 31: iteration 145620/ 173500 | consumed samples: 37278720 | consumed tokens: 76346818560 | elapsed time per iteration (s): 0.83 | learning rate: 3.145E-05 | global batch size: 256 | lm loss: 1.939208E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.452 | TFLOPs: 18.72 | 31: iteration 145630/ 173500 | consumed samples: 37281280 | consumed tokens: 76352061440 | elapsed time per iteration (s): 0.78 | learning rate: 3.144E-05 | global batch size: 256 | lm loss: 1.922919E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.171 | TFLOPs: 19.97 | 31: iteration 145640/ 173500 | consumed samples: 37283840 | consumed tokens: 76357304320 | elapsed time per iteration (s): 0.82 | learning rate: 3.143E-05 | global batch size: 256 | lm loss: 1.932558E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.836 | TFLOPs: 18.99 | 31: iteration 145650/ 173500 | consumed samples: 37286400 | consumed tokens: 76362547200 | elapsed time per iteration (s): 0.75 | learning rate: 3.143E-05 | global batch size: 256 | lm loss: 1.923217E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.950 | TFLOPs: 20.63 | 31: iteration 145660/ 173500 | consumed samples: 37288960 | consumed tokens: 76367790080 | elapsed time per iteration (s): 0.79 | learning rate: 3.142E-05 | global batch size: 256 | lm loss: 1.897034E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.797 | TFLOPs: 19.71 | 31: iteration 145670/ 173500 | consumed samples: 37291520 | consumed tokens: 76373032960 | elapsed time per iteration (s): 0.80 | learning rate: 3.141E-05 | global batch size: 256 | lm loss: 1.936344E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.801 | TFLOPs: 19.29 | 31: iteration 145680/ 173500 | consumed samples: 37294080 | consumed tokens: 76378275840 | elapsed time per iteration (s): 0.75 | learning rate: 3.140E-05 | global batch size: 256 | lm loss: 1.940947E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.865 | TFLOPs: 20.62 | 31: iteration 145690/ 173500 | consumed samples: 37296640 | consumed tokens: 76383518720 | elapsed time per iteration (s): 0.78 | learning rate: 3.139E-05 | global batch size: 256 | lm loss: 1.958132E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.449 | TFLOPs: 19.87 | 31: iteration 145700/ 173500 | consumed samples: 37299200 | consumed tokens: 76388761600 | elapsed time per iteration (s): 0.80 | learning rate: 3.139E-05 | global batch size: 256 | lm loss: 1.927196E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.196 | TFLOPs: 19.43 | 31: iteration 145710/ 173500 | consumed samples: 37301760 | consumed tokens: 76394004480 | elapsed time per iteration (s): 0.74 | learning rate: 3.138E-05 | global batch size: 256 | lm loss: 1.925712E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.782 | TFLOPs: 20.92 | 31: iteration 145720/ 173500 | consumed samples: 37304320 | consumed tokens: 76399247360 | elapsed time per iteration (s): 0.78 | learning rate: 3.137E-05 | global batch size: 256 | lm loss: 1.931079E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.442 | TFLOPs: 19.81 | 31: iteration 145730/ 173500 | consumed samples: 37306880 | consumed tokens: 76404490240 | elapsed time per iteration (s): 0.73 | learning rate: 3.136E-05 | global batch size: 256 | lm loss: 1.890891E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.940 | TFLOPs: 21.17 | 31: iteration 145740/ 173500 | consumed samples: 37309440 | consumed tokens: 76409733120 | elapsed time per iteration (s): 0.75 | learning rate: 3.135E-05 | global batch size: 256 | lm loss: 1.911930E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.630 | TFLOPs: 20.61 | 31: iteration 145750/ 173500 | consumed samples: 37312000 | consumed tokens: 76414976000 | elapsed time per iteration (s): 0.79 | learning rate: 3.135E-05 | global batch size: 256 | lm loss: 1.900237E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.302 | TFLOPs: 19.68 | 31: iteration 145760/ 173500 | consumed samples: 37314560 | consumed tokens: 76420218880 | elapsed time per iteration (s): 0.72 | learning rate: 3.134E-05 | global batch size: 256 | lm loss: 1.902101E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.395 | TFLOPs: 21.44 | 31: iteration 145770/ 173500 | consumed samples: 37317120 | consumed tokens: 76425461760 | elapsed time per iteration (s): 0.74 | learning rate: 3.133E-05 | global batch size: 256 | lm loss: 1.898678E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.057 | TFLOPs: 21.00 | 31: iteration 145780/ 173500 | consumed samples: 37319680 | consumed tokens: 76430704640 | elapsed time per iteration (s): 0.78 | learning rate: 3.132E-05 | global batch size: 256 | lm loss: 1.942307E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.963 | TFLOPs: 19.84 | 31: iteration 145790/ 173500 | consumed samples: 37322240 | consumed tokens: 76435947520 | elapsed time per iteration (s): 0.77 | learning rate: 3.131E-05 | global batch size: 256 | lm loss: 1.914754E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.547 | TFLOPs: 20.24 | 31: iteration 145800/ 173500 | consumed samples: 37324800 | consumed tokens: 76441190400 | elapsed time per iteration (s): 0.91 | learning rate: 3.131E-05 | global batch size: 256 | lm loss: 1.921432E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 280.094 | TFLOPs: 16.94 | 31: iteration 145810/ 173500 | consumed samples: 37327360 | consumed tokens: 76446433280 | elapsed time per iteration (s): 0.82 | learning rate: 3.130E-05 | global batch size: 256 | lm loss: 1.967095E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.370 | TFLOPs: 18.96 | 31: iteration 145820/ 173500 | consumed samples: 37329920 | consumed tokens: 76451676160 | elapsed time per iteration (s): 0.77 | learning rate: 3.129E-05 | global batch size: 256 | lm loss: 1.893678E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.617 | TFLOPs: 20.18 | 31: iteration 145830/ 173500 | consumed samples: 37332480 | consumed tokens: 76456919040 | elapsed time per iteration (s): 0.80 | learning rate: 3.128E-05 | global batch size: 256 | lm loss: 1.916645E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.681 | TFLOPs: 19.34 | 31: iteration 145840/ 173500 | consumed samples: 37335040 | consumed tokens: 76462161920 | elapsed time per iteration (s): 0.81 | learning rate: 3.127E-05 | global batch size: 256 | lm loss: 1.937679E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.949 | TFLOPs: 19.17 | 31: iteration 145850/ 173500 | consumed samples: 37337600 | consumed tokens: 76467404800 | elapsed time per iteration (s): 0.81 | learning rate: 3.127E-05 | global batch size: 256 | lm loss: 1.906952E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.315 | TFLOPs: 19.02 | 31: iteration 145860/ 173500 | consumed samples: 37340160 | consumed tokens: 76472647680 | elapsed time per iteration (s): 0.81 | learning rate: 3.126E-05 | global batch size: 256 | lm loss: 1.907606E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.602 | TFLOPs: 19.21 | 31: iteration 145870/ 173500 | consumed samples: 37342720 | consumed tokens: 76477890560 | elapsed time per iteration (s): 0.81 | learning rate: 3.125E-05 | global batch size: 256 | lm loss: 1.927599E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.114 | TFLOPs: 19.12 | 31: iteration 145880/ 173500 | consumed samples: 37345280 | consumed tokens: 76483133440 | elapsed time per iteration (s): 0.87 | learning rate: 3.124E-05 | global batch size: 256 | lm loss: 1.928910E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.875 | TFLOPs: 17.72 | 31: iteration 145890/ 173500 | consumed samples: 37347840 | consumed tokens: 76488376320 | elapsed time per iteration (s): 0.81 | learning rate: 3.123E-05 | global batch size: 256 | lm loss: 1.938939E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.340 | TFLOPs: 19.08 | 31: iteration 145900/ 173500 | consumed samples: 37350400 | consumed tokens: 76493619200 | elapsed time per iteration (s): 0.81 | learning rate: 3.123E-05 | global batch size: 256 | lm loss: 1.945103E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.070 | TFLOPs: 19.18 | 31: iteration 145910/ 173500 | consumed samples: 37352960 | consumed tokens: 76498862080 | elapsed time per iteration (s): 0.81 | learning rate: 3.122E-05 | global batch size: 256 | lm loss: 1.912752E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.558 | TFLOPs: 19.21 | 31: iteration 145920/ 173500 | consumed samples: 37355520 | consumed tokens: 76504104960 | elapsed time per iteration (s): 0.82 | learning rate: 3.121E-05 | global batch size: 256 | lm loss: 1.921607E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.950 | TFLOPs: 18.87 | 31: iteration 145930/ 173500 | consumed samples: 37358080 | consumed tokens: 76509347840 | elapsed time per iteration (s): 0.84 | learning rate: 3.120E-05 | global batch size: 256 | lm loss: 1.913156E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.214 | TFLOPs: 18.34 | 31: iteration 145940/ 173500 | consumed samples: 37360640 | consumed tokens: 76514590720 | elapsed time per iteration (s): 0.84 | learning rate: 3.119E-05 | global batch size: 256 | lm loss: 1.909787E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.654 | TFLOPs: 18.37 | 31: iteration 145950/ 173500 | consumed samples: 37363200 | consumed tokens: 76519833600 | elapsed time per iteration (s): 0.78 | learning rate: 3.119E-05 | global batch size: 256 | lm loss: 1.901210E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.295 | TFLOPs: 19.74 | 31: iteration 145960/ 173500 | consumed samples: 37365760 | consumed tokens: 76525076480 | elapsed time per iteration (s): 0.76 | learning rate: 3.118E-05 | global batch size: 256 | lm loss: 1.908888E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.761 | TFLOPs: 20.37 | 31: iteration 145970/ 173500 | consumed samples: 37368320 | consumed tokens: 76530319360 | elapsed time per iteration (s): 0.80 | learning rate: 3.117E-05 | global batch size: 256 | lm loss: 1.938263E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.413 | TFLOPs: 19.32 | 31: iteration 145980/ 173500 | consumed samples: 37370880 | consumed tokens: 76535562240 | elapsed time per iteration (s): 0.89 | learning rate: 3.116E-05 | global batch size: 256 | lm loss: 1.943630E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.432 | TFLOPs: 17.45 | 31: iteration 145990/ 173500 | consumed samples: 37373440 | consumed tokens: 76540805120 | elapsed time per iteration (s): 0.81 | learning rate: 3.115E-05 | global batch size: 256 | lm loss: 1.926708E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.854 | TFLOPs: 19.05 | 0: [2022-11-27 02:57:39,398] [INFO] [logging.py:68:log_dist] [Rank 0] step=146000, skipped=0, lr=[3.1146732758228304e-05, 3.1146732758228304e-05, 3.1146732758228304e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 146000/ 173500 | consumed samples: 37376000 | consumed tokens: 76546048000 | elapsed time per iteration (s): 0.81 | learning rate: 3.115E-05 | global batch size: 256 | lm loss: 1.925010E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.334 | TFLOPs: 19.02 | 0: steps: 146000 loss: 1.9157 iter time (s): 0.815 samples/sec: 314.021 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 146000 | lm loss value: 1.988679E+00 | lm loss PPL: 7.305877E+00 | 0: saving checkpoint at iteration 146000 to checkpoints_1b1long 31: -------------------------------------------------------------------------------------------- 0: [2022-11-27 02:57:39,689] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step146000 is begin to save! 0: [2022-11-27 02:57:39,698] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_01-model_00-model_states.pt... 0: [2022-11-27 02:57:39,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_01-model_00-model_states.pt. 0: [2022-11-27 02:57:39,923] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_03-model_00-model_states.pt... 0: [2022-11-27 02:57:40,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_03-model_00-model_states.pt. 0: [2022-11-27 02:57:40,003] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_04-model_00-model_states.pt... 0: [2022-11-27 02:57:40,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_04-model_00-model_states.pt. 0: [2022-11-27 02:57:40,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_05-model_00-model_states.pt... 0: [2022-11-27 02:57:40,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_05-model_00-model_states.pt. 0: [2022-11-27 02:57:40,158] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_06-model_00-model_states.pt... 0: [2022-11-27 02:57:40,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_06-model_00-model_states.pt. 0: [2022-11-27 02:57:40,240] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_07-model_00-model_states.pt... 0: [2022-11-27 02:57:40,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_07-model_00-model_states.pt. 0: [2022-11-27 02:57:40,315] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_08-model_00-model_states.pt... 0: [2022-11-27 02:57:40,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_08-model_00-model_states.pt. 0: [2022-11-27 02:57:40,387] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_09-model_00-model_states.pt... 0: [2022-11-27 02:57:40,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_09-model_00-model_states.pt. 0: [2022-11-27 02:57:40,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_10-model_00-model_states.pt... 0: [2022-11-27 02:57:40,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_10-model_00-model_states.pt. 0: [2022-11-27 02:57:40,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_11-model_00-model_states.pt... 0: [2022-11-27 02:57:40,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_11-model_00-model_states.pt. 0: [2022-11-27 02:57:40,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_12-model_00-model_states.pt... 0: [2022-11-27 02:57:40,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_12-model_00-model_states.pt. 0: [2022-11-27 02:57:40,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_13-model_00-model_states.pt... 0: [2022-11-27 02:57:40,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_13-model_00-model_states.pt. 0: [2022-11-27 02:57:40,759] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_14-model_00-model_states.pt... 0: [2022-11-27 02:57:40,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_14-model_00-model_states.pt. 0: [2022-11-27 02:57:40,833] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_15-model_00-model_states.pt... 0: [2022-11-27 02:57:40,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_15-model_00-model_states.pt. 0: [2022-11-27 02:57:40,905] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_16-model_00-model_states.pt... 0: [2022-11-27 02:57:40,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_16-model_00-model_states.pt. 0: [2022-11-27 02:57:40,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_17-model_00-model_states.pt... 0: [2022-11-27 02:57:41,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_17-model_00-model_states.pt. 0: [2022-11-27 02:57:41,055] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_18-model_00-model_states.pt... 0: [2022-11-27 02:57:41,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_18-model_00-model_states.pt. 0: [2022-11-27 02:57:41,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_19-model_00-model_states.pt... 0: [2022-11-27 02:57:41,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_19-model_00-model_states.pt. 0: [2022-11-27 02:57:41,199] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_20-model_00-model_states.pt... 0: [2022-11-27 02:57:41,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_20-model_00-model_states.pt. 0: [2022-11-27 02:57:41,275] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_21-model_00-model_states.pt... 0: [2022-11-27 02:57:41,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_21-model_00-model_states.pt. 0: [2022-11-27 02:57:41,347] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_22-model_00-model_states.pt... 0: [2022-11-27 02:57:41,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_22-model_00-model_states.pt. 0: [2022-11-27 02:57:41,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_23-model_00-model_states.pt... 0: [2022-11-27 02:57:41,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_23-model_00-model_states.pt. 0: [2022-11-27 02:57:41,493] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_24-model_00-model_states.pt... 0: [2022-11-27 02:57:41,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_24-model_00-model_states.pt. 0: [2022-11-27 02:57:41,568] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_25-model_00-model_states.pt... 0: [2022-11-27 02:57:41,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_25-model_00-model_states.pt. 0: [2022-11-27 02:57:41,641] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_26-model_00-model_states.pt... 0: [2022-11-27 02:57:41,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_26-model_00-model_states.pt. 0: [2022-11-27 02:57:41,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_27-model_00-model_states.pt... 0: [2022-11-27 02:57:41,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_27-model_00-model_states.pt. 0: [2022-11-27 02:57:41,790] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_28-model_00-model_states.pt... 0: [2022-11-27 02:57:41,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_28-model_00-model_states.pt. 0: [2022-11-27 02:57:41,862] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/layer_30-model_00-model_states.pt... 0: [2022-11-27 02:57:41,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/layer_30-model_00-model_states.pt. 0: [2022-11-27 02:57:41,866] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step146000/mp_rank_00_model_states.pt 0: [2022-11-27 02:57:41,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/mp_rank_00_model_states.pt... 0: [2022-11-27 02:57:41,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/mp_rank_00_model_states.pt. 0: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 24: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 19: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 16: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 20: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:57:41,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-27 02:57:41,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:57:41,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 02:57:41,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 29: [2022-11-27 02:57:42,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:57:42,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 6: [2022-11-27 02:57:42,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:57:42,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 6: [2022-11-27 02:57:42,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 02:57:42,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 25: [2022-11-27 02:57:42,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:57:42,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-27 02:57:42,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 14: [2022-11-27 02:57:42,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:57:42,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 02:57:42,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 24: [2022-11-27 02:57:42,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:57:42,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 02:57:42,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 19: [2022-11-27 02:57:42,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:57:42,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 02:57:42,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 18: [2022-11-27 02:57:42,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:57:42,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 02:57:42,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 2: [2022-11-27 02:57:42,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:57:42,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:57:42,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 3: [2022-11-27 02:57:42,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 2: [2022-11-27 02:57:42,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 3: [2022-11-27 02:57:42,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 21: [2022-11-27 02:57:42,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:57:42,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 02:57:42,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 17: [2022-11-27 02:57:41,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:57:41,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 02:57:41,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 1: [2022-11-27 02:57:42,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:57:42,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:57:42,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 20: [2022-11-27 02:57:42,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 1: [2022-11-27 02:57:42,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 20: [2022-11-27 02:57:42,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 6: [2022-11-27 02:57:42,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:57:42,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 02:57:42,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 23: [2022-11-27 02:57:42,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:57:42,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 02:57:42,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 30: [2022-11-27 02:57:42,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:57:42,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 02:57:42,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 1: [2022-11-27 02:57:42,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:57:42,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 02:57:42,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 29: [2022-11-27 02:57:42,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:57:42,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 02:57:42,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 4: [2022-11-27 02:57:42,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:57:42,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:57:42,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 02:57:42,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 0: [2022-11-27 02:57:42,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 02:57:42,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 25: [2022-11-27 02:57:42,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:57:42,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:57:42,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 25: [2022-11-27 02:57:42,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 14: [2022-11-27 02:57:42,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 25: [2022-11-27 02:57:42,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 28: [2022-11-27 02:57:42,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:57:42,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:57:42,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 18: [2022-11-27 02:57:42,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 28: [2022-11-27 02:57:42,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 18: [2022-11-27 02:57:42,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 13: [2022-11-27 02:57:42,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:57:42,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:57:42,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 8: [2022-11-27 02:57:42,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:57:42,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 15: [2022-11-27 02:57:42,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 13: [2022-11-27 02:57:42,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 15: [2022-11-27 02:57:42,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:57:42,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 4: [2022-11-27 02:57:42,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:57:42,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 15: [2022-11-27 02:57:42,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 4: [2022-11-27 02:57:42,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 02:57:42,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 8: [2022-11-27 02:57:42,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 24: [2022-11-27 02:57:42,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:57:42,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 02:57:42,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 2: [2022-11-27 02:57:42,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:57:42,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 13: [2022-11-27 02:57:42,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:57:42,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:57:42,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 13: [2022-11-27 02:57:42,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 21: [2022-11-27 02:57:42,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:57:42,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 13: [2022-11-27 02:57:42,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 21: [2022-11-27 02:57:42,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 21: [2022-11-27 02:57:42,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 02:57:42,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 17: [2022-11-27 02:57:42,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:57:42,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:57:42,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 19: [2022-11-27 02:57:42,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 17: [2022-11-27 02:57:42,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 19: [2022-11-27 02:57:42,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 8: [2022-11-27 02:57:42,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:57:42,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 02:57:42,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 15: [2022-11-27 02:57:42,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:57:42,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 02:57:42,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 9: [2022-11-27 02:57:42,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:57:42,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 02:57:42,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 31: [2022-11-27 02:57:42,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:57:42,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 02:57:42,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 6: [2022-11-27 02:57:42,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:57:42,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 02:57:42,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 30: [2022-11-27 02:57:42,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:57:42,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 02:57:42,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 3: [2022-11-27 02:57:42,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:57:42,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 02:57:42,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 14: [2022-11-27 02:57:42,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:57:42,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:57:42,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:57:42,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 5: [2022-11-27 02:57:42,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 02:57:42,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 14: [2022-11-27 02:57:42,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 5: [2022-11-27 02:57:42,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 5: [2022-11-27 02:57:42,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 29: [2022-11-27 02:57:42,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:57:42,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 02:57:42,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 4: [2022-11-27 02:57:42,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:57:42,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 02:57:42,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 0: [2022-11-27 02:57:42,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:57:42,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 02:57:42,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 13: [2022-11-27 02:57:42,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:57:42,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 02:57:42,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 6: [2022-11-27 02:57:42,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:57:42,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 02:57:42,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 10: [2022-11-27 02:57:42,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:57:42,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:57:42,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 10: [2022-11-27 02:57:42,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 25: [2022-11-27 02:57:42,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:57:42,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 10: [2022-11-27 02:57:42,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 31: [2022-11-27 02:57:42,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:57:42,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:57:42,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 17: [2022-11-27 02:57:42,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:57:42,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 25: [2022-11-27 02:57:42,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 31: [2022-11-27 02:57:42,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 17: [2022-11-27 02:57:42,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 31: [2022-11-27 02:57:42,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 17: [2022-11-27 02:57:42,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 23: [2022-11-27 02:57:42,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:57:42,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:57:42,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 23: [2022-11-27 02:57:42,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 02:57:42,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 02:57:42,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 26: [2022-11-27 02:57:42,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:57:42,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 26: [2022-11-27 02:57:42,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 02:57:42,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 24: [2022-11-27 02:57:42,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:57:42,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:57:42,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:57:42,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 26: [2022-11-27 02:57:42,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:57:42,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 24: [2022-11-27 02:57:42,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 8: [2022-11-27 02:57:42,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 10: [2022-11-27 02:57:42,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 26: [2022-11-27 02:57:42,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 8: [2022-11-27 02:57:42,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 26: [2022-11-27 02:57:42,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 15: [2022-11-27 02:57:42,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:57:42,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 02:57:42,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 18: [2022-11-27 02:57:42,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:57:42,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:57:42,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 8: [2022-11-27 02:57:42,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 18: [2022-11-27 02:57:42,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 8: [2022-11-27 02:57:42,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 16: [2022-11-27 02:57:42,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:57:42,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:57:42,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 02:57:42,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 20: [2022-11-27 02:57:42,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 02:57:42,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 20: [2022-11-27 02:57:42,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:57:42,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:57:42,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 2: [2022-11-27 02:57:42,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 02:57:42,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 20: [2022-11-27 02:57:42,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 23: [2022-11-27 02:57:42,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:57:42,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 02:57:42,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 20: [2022-11-27 02:57:42,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:57:42,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 28: [2022-11-27 02:57:42,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:57:42,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 3: [2022-11-27 02:57:42,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:57:42,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:57:42,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 16: [2022-11-27 02:57:42,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 26: [2022-11-27 02:57:42,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:57:42,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 3: [2022-11-27 02:57:42,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 25: [2022-11-27 02:57:42,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:57:42,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 25: [2022-11-27 02:57:42,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 26: [2022-11-27 02:57:42,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 25: [2022-11-27 02:57:42,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 10: [2022-11-27 02:57:42,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:57:42,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 02:57:42,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 11: [2022-11-27 02:57:41,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:57:42,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 02:57:42,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 11: [2022-11-27 02:57:42,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:57:42,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 02:57:42,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 11: [2022-11-27 02:57:42,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:57:42,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 18: [2022-11-27 02:57:42,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:57:42,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 18: [2022-11-27 02:57:42,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 11: [2022-11-27 02:57:42,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:57:42,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 1: [2022-11-27 02:57:42,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:57:42,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 9: [2022-11-27 02:57:42,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:57:42,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 1: [2022-11-27 02:57:42,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 02:57:42,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 9: [2022-11-27 02:57:42,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 02:57:42,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 9: [2022-11-27 02:57:42,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:57:42,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 02:57:42,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 26: [2022-11-27 02:57:42,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:57:42,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 02:57:42,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 12: [2022-11-27 02:57:42,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:57:42,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:57:42,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:57:42,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:57:42,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 12: [2022-11-27 02:57:42,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 9: [2022-11-27 02:57:42,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 12: [2022-11-27 02:57:42,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 12: [2022-11-27 02:57:42,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:57:42,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 02:57:42,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 02:57:42,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 12: [2022-11-27 02:57:42,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 12: [2022-11-27 02:57:42,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 1: [2022-11-27 02:57:42,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:57:42,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 1: [2022-11-27 02:57:42,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 02:57:42,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 29: [2022-11-27 02:57:42,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:57:42,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 30: [2022-11-27 02:57:42,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:57:42,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 30: [2022-11-27 02:57:42,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 02:57:42,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 31: [2022-11-27 02:57:42,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:57:42,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 02:57:42,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 27: [2022-11-27 02:57:42,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:57:42,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:57:42,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:57:42,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:57:42,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 02:57:42,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 02:57:42,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 27: [2022-11-27 02:57:42,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 02:57:42,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 02:57:42,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 27: [2022-11-27 02:57:42,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 27: [2022-11-27 02:57:42,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 4: [2022-11-27 02:57:42,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:57:42,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:57:42,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 16: [2022-11-27 02:57:42,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 4: [2022-11-27 02:57:42,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 16: [2022-11-27 02:57:42,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 0: [2022-11-27 02:57:42,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:57:42,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 02:57:42,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 0: [2022-11-27 02:57:42,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:57:42,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:57:42,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 02:57:42,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 5: [2022-11-27 02:57:42,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:57:42,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 02:57:42,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 14: [2022-11-27 02:57:42,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:57:42,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 17: [2022-11-27 02:57:42,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:57:42,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 17: [2022-11-27 02:57:42,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 02:57:42,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 28: [2022-11-27 02:57:42,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 02:57:42,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 28: [2022-11-27 02:57:42,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:57:42,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 02:57:42,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 28: [2022-11-27 02:57:42,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:57:42,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 02:57:42,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 19: [2022-11-27 02:57:42,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:57:42,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:57:42,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 02:57:42,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 02:57:42,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 19: [2022-11-27 02:57:42,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 13: [2022-11-27 02:57:42,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:57:42,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 02:57:42,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 24: [2022-11-27 02:57:42,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:57:42,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 02:57:42,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 22: [2022-11-27 02:57:42,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:57:42,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:57:42,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:57:42,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 02:57:42,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 02:57:42,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 22: [2022-11-27 02:57:42,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 02:57:42,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 22: [2022-11-27 02:57:42,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 31: [2022-11-27 02:57:42,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:57:42,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 02:57:42,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 6: [2022-11-27 02:57:42,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:57:42,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 02:57:42,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 3: [2022-11-27 02:57:42,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:57:42,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 1: [2022-11-27 02:57:42,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:57:42,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 1: [2022-11-27 02:57:42,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 02:57:42,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 4: [2022-11-27 02:57:42,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:57:42,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 02:57:42,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 25: [2022-11-27 02:57:42,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:57:42,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 02:57:42,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 0: [2022-11-27 02:57:42,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:57:42,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 02:57:42,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 2: [2022-11-27 02:57:42,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:57:42,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 02:57:42,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 7: [2022-11-27 02:57:42,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:57:42,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:57:42,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:57:42,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:57:42,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 02:57:42,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 02:57:42,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 02:57:42,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 02:57:42,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 7: [2022-11-27 02:57:42,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 7: [2022-11-27 02:57:42,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 7: [2022-11-27 02:57:42,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 15: [2022-11-27 02:57:42,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:57:42,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 02:57:42,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 0: [2022-11-27 02:57:42,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 02:57:42,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 23: [2022-11-27 02:57:42,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:57:42,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 02:57:42,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 11: [2022-11-27 02:57:42,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:57:42,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 02:57:42,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 28: [2022-11-27 02:57:42,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:57:42,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 02:57:42,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 21: [2022-11-27 02:57:42,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:57:42,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 02:57:42,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 26: [2022-11-27 02:57:42,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:57:42,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 02:57:42,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 8: [2022-11-27 02:57:42,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:57:42,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 02:57:42,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 2: [2022-11-27 02:57:42,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:57:42,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 02:57:42,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 29: [2022-11-27 02:57:42,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:57:42,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 02:57:42,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 30: [2022-11-27 02:57:42,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:57:42,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 02:57:42,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 7: [2022-11-27 02:57:42,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:57:42,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 02:57:42,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 9: [2022-11-27 02:57:42,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:57:42,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 02:57:42,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 12: [2022-11-27 02:57:42,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:57:42,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 02:57:42,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 5: [2022-11-27 02:57:42,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:57:42,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:57:42,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 5: [2022-11-27 02:57:42,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 20: [2022-11-27 02:57:42,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 5: [2022-11-27 02:57:42,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 17: [2022-11-27 02:57:42,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:57:42,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:57:42,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 17: [2022-11-27 02:57:42,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 19: [2022-11-27 02:57:42,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 17: [2022-11-27 02:57:42,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 10: [2022-11-27 02:57:42,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:57:42,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 02:57:42,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 18: [2022-11-27 02:57:42,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:57:42,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 02:57:42,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 3: [2022-11-27 02:57:42,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:57:42,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 02:57:42,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 13: [2022-11-27 02:57:42,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:57:42,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 02:57:42,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 24: [2022-11-27 02:57:42,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:57:42,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 02:57:42,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 27: [2022-11-27 02:57:42,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:57:42,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 02:57:42,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 22: [2022-11-27 02:57:42,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:57:42,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 02:57:42,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 16: [2022-11-27 02:57:42,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:57:42,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 02:57:42,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 14: [2022-11-27 02:57:42,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:57:42,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 02:57:42,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 0: [2022-11-27 02:57:42,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:57:42,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 02:57:42,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 6: [2022-11-27 02:57:42,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:57:42,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 02:57:42,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 31: [2022-11-27 02:57:42,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:57:42,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 02:57:42,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 25: [2022-11-27 02:57:42,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:57:42,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 02:57:42,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 1: [2022-11-27 02:57:42,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:57:42,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 02:57:42,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 21: [2022-11-27 02:57:42,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:57:42,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 02:57:42,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 15: [2022-11-27 02:57:42,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:57:42,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 23: [2022-11-27 02:57:42,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:57:42,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 23: [2022-11-27 02:57:42,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 02:57:42,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 4: [2022-11-27 02:57:42,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:57:42,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 02:57:42,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 7: [2022-11-27 02:57:42,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:57:42,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 02:57:42,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 2: [2022-11-27 02:57:42,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:57:42,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 02:57:42,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 11: [2022-11-27 02:57:42,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:57:42,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 28: [2022-11-27 02:57:42,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:57:42,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 28: [2022-11-27 02:57:42,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 02:57:42,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 26: [2022-11-27 02:57:42,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:57:42,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 02:57:42,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 8: [2022-11-27 02:57:42,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:57:42,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 02:57:42,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 5: [2022-11-27 02:57:42,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:57:42,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 02:57:42,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 20: [2022-11-27 02:57:42,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:57:42,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 02:57:42,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 18: [2022-11-27 02:57:42,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:57:42,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 27: [2022-11-27 02:57:42,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:57:42,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 27: [2022-11-27 02:57:42,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 02:57:42,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 17: [2022-11-27 02:57:42,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:57:42,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:57:42,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:57:42,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 19: [2022-11-27 02:57:42,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:57:42,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 29: [2022-11-27 02:57:42,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 17: [2022-11-27 02:57:42,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 19: [2022-11-27 02:57:42,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 12: [2022-11-27 02:57:42,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 19: [2022-11-27 02:57:42,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 29: [2022-11-27 02:57:42,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 16: [2022-11-27 02:57:42,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:57:42,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 02:57:42,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 24: [2022-11-27 02:57:42,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:57:42,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 13: [2022-11-27 02:57:42,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:57:42,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 13: [2022-11-27 02:57:42,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 02:57:42,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 6: [2022-11-27 02:57:42,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:57:42,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 02:57:42,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 30: [2022-11-27 02:57:42,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:57:42,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:57:42,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 10: [2022-11-27 02:57:42,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:57:42,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 02:57:42,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 3: [2022-11-27 02:57:42,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 30: [2022-11-27 02:57:42,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 3: [2022-11-27 02:57:42,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 14: [2022-11-27 02:57:42,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:57:42,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 02:57:42,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 25: [2022-11-27 02:57:42,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:57:42,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:57:42,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 02:57:42,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 0: [2022-11-27 02:57:42,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 02:57:42,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 31: [2022-11-27 02:57:42,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:57:42,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 02:57:42,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 22: [2022-11-27 02:57:42,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:57:42,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 02:57:42,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 9: [2022-11-27 02:57:42,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:57:42,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 02:57:42,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 4: [2022-11-27 02:57:42,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:57:42,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 02:57:42,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 1: [2022-11-27 02:57:42,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:57:42,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 02:57:42,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 21: [2022-11-27 02:57:42,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:57:42,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:57:42,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 02:57:42,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 15: [2022-11-27 02:57:42,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 02:57:42,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 11: [2022-11-27 02:57:42,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:57:42,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 02:57:42,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 23: [2022-11-27 02:57:42,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:57:42,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 02:57:42,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 28: [2022-11-27 02:57:42,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-27 02:57:42,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 02:57:42,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 7: [2022-11-27 02:57:42,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:57:42,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 02:57:42,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 8: [2022-11-27 02:57:42,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:57:42,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 02:57:42,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 26: [2022-11-27 02:57:42,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:57:42,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 02:57:42,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 5: [2022-11-27 02:57:42,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:57:42,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 02:57:42,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 2: [2022-11-27 02:57:42,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:57:42,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 02:57:42,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 30: [2022-11-27 02:57:42,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 02:57:42,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 02:57:42,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 20: [2022-11-27 02:57:42,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:57:42,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 02:57:42,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 19: [2022-11-27 02:57:42,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:57:42,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:57:42,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 19: [2022-11-27 02:57:42,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 9: [2022-11-27 02:57:42,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 18: [2022-11-27 02:57:42,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 19: [2022-11-27 02:57:42,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 18: [2022-11-27 02:57:42,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 9: [2022-11-27 02:57:42,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 13: [2022-11-27 02:57:42,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:57:42,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:57:42,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 02:57:42,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 24: [2022-11-27 02:57:42,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-27 02:57:42,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 12: [2022-11-27 02:57:42,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:57:42,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:57:42,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 29: [2022-11-27 02:57:42,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 02:57:42,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 12: [2022-11-27 02:57:42,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 27: [2022-11-27 02:57:42,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:57:42,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 02:57:42,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 2: [2022-11-27 02:57:42,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:57:42,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 02:57:42,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 6: [2022-11-27 02:57:42,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:57:42,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:57:42,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 02:57:42,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 02:57:42,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 6: [2022-11-27 02:57:42,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 8: [2022-11-27 02:57:42,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 6: [2022-11-27 02:57:42,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 8: [2022-11-27 02:57:42,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 20: [2022-11-27 02:57:42,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:57:42,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 20: [2022-11-27 02:57:42,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 02:57:42,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 13: [2022-11-27 02:57:42,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 02:57:42,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 25: [2022-11-27 02:57:42,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-27 02:57:42,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 02:57:42,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 0: [2022-11-27 02:57:42,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:57:42,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 02:57:42,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 3: [2022-11-27 02:57:42,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:57:42,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 02:57:42,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 21: [2022-11-27 02:57:42,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 31: [2022-11-27 02:57:42,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 21: [2022-11-27 02:57:42,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 02:57:42,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 31: [2022-11-27 02:57:42,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 02:57:42,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 12: [2022-11-27 02:57:42,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 23: [2022-11-27 02:57:42,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:57:42,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 23: [2022-11-27 02:57:42,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 02:57:42,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 12: [2022-11-27 02:57:42,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 15: [2022-11-27 02:57:42,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:57:42,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 02:57:42,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 10: [2022-11-27 02:57:42,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:57:42,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 02:57:42,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 19: [2022-11-27 02:57:42,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:57:42,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:57:42,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 19: [2022-11-27 02:57:42,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 24: [2022-11-27 02:57:42,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 29: [2022-11-27 02:57:42,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 19: [2022-11-27 02:57:42,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 14: [2022-11-27 02:57:42,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 24: [2022-11-27 02:57:42,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 02:57:42,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 14: [2022-11-27 02:57:42,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 02:57:42,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 17: [2022-11-27 02:57:42,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:57:42,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:57:42,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:57:42,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 02:57:42,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 17: [2022-11-27 02:57:42,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 02:57:42,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 22: [2022-11-27 02:57:42,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:57:42,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 22: [2022-11-27 02:57:42,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 02:57:42,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 30: [2022-11-27 02:57:42,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 17: [2022-11-27 02:57:42,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 30: [2022-11-27 02:57:42,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 02:57:42,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 16: [2022-11-27 02:57:42,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:57:42,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 02:57:42,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 1: [2022-11-27 02:57:42,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:57:42,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 02:57:42,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 4: [2022-11-27 02:57:42,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 18: [2022-11-27 02:57:42,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:57:42,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 18: [2022-11-27 02:57:42,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 27: [2022-11-27 02:57:42,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:57:42,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 18: [2022-11-27 02:57:42,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 9: [2022-11-27 02:57:42,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:57:42,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 28: [2022-11-27 02:57:42,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 27: [2022-11-27 02:57:42,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 9: [2022-11-27 02:57:42,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 28: [2022-11-27 02:57:42,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 9: [2022-11-27 02:57:42,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 28: [2022-11-27 02:57:42,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 7: [2022-11-27 02:57:42,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:57:42,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 02:57:42,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 3: [2022-11-27 02:57:42,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:57:42,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 02:57:42,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 22: [2022-11-27 02:57:42,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:57:42,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 02:57:42,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 5: [2022-11-27 02:57:42,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:57:42,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 02:57:42,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 14: [2022-11-27 02:57:42,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:57:42,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 02:57:42,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 16: [2022-11-27 02:57:42,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:57:42,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:57:42,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 02:57:42,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 10: [2022-11-27 02:57:42,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 02:57:42,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 16: [2022-11-27 02:57:42,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 02:57:42,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 02:57:42,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 22: [2022-11-27 02:57:42,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 02:57:42,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 02:57:42,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 10: [2022-11-27 02:57:42,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:57:42,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step146000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 02:57:42,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step146000 is ready now! 0: successfully saved checkpoint at iteration 146000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2488.19 31: iteration 146010/ 173500 | consumed samples: 37378560 | consumed tokens: 76551290880 | elapsed time per iteration (s): 1.04 | learning rate: 3.114E-05 | global batch size: 256 | lm loss: 1.914632E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.014 | TFLOPs: 14.88 | 31: iteration 146020/ 173500 | consumed samples: 37381120 | consumed tokens: 76556533760 | elapsed time per iteration (s): 0.89 | learning rate: 3.113E-05 | global batch size: 256 | lm loss: 1.911698E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.205 | TFLOPs: 17.50 | 31: iteration 146030/ 173500 | consumed samples: 37383680 | consumed tokens: 76561776640 | elapsed time per iteration (s): 0.77 | learning rate: 3.112E-05 | global batch size: 256 | lm loss: 1.902161E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.244 | TFLOPs: 20.10 | 31: iteration 146040/ 173500 | consumed samples: 37386240 | consumed tokens: 76567019520 | elapsed time per iteration (s): 0.75 | learning rate: 3.112E-05 | global batch size: 256 | lm loss: 1.942709E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.246 | TFLOPs: 20.58 | 31: iteration 146050/ 173500 | consumed samples: 37388800 | consumed tokens: 76572262400 | elapsed time per iteration (s): 0.80 | learning rate: 3.111E-05 | global batch size: 256 | lm loss: 1.914067E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.416 | TFLOPs: 19.44 | 31: iteration 146060/ 173500 | consumed samples: 37391360 | consumed tokens: 76577505280 | elapsed time per iteration (s): 0.73 | learning rate: 3.110E-05 | global batch size: 256 | lm loss: 1.941798E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.229 | TFLOPs: 21.13 | 31: iteration 146070/ 173500 | consumed samples: 37393920 | consumed tokens: 76582748160 | elapsed time per iteration (s): 0.71 | learning rate: 3.109E-05 | global batch size: 256 | lm loss: 1.914729E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.240 | TFLOPs: 21.67 | 31: iteration 146080/ 173500 | consumed samples: 37396480 | consumed tokens: 76587991040 | elapsed time per iteration (s): 0.74 | learning rate: 3.108E-05 | global batch size: 256 | lm loss: 1.923696E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.081 | TFLOPs: 21.00 | 31: iteration 146090/ 173500 | consumed samples: 37399040 | consumed tokens: 76593233920 | elapsed time per iteration (s): 0.77 | learning rate: 3.108E-05 | global batch size: 256 | lm loss: 1.914348E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.881 | TFLOPs: 20.02 | 31: iteration 146100/ 173500 | consumed samples: 37401600 | consumed tokens: 76598476800 | elapsed time per iteration (s): 0.74 | learning rate: 3.107E-05 | global batch size: 256 | lm loss: 1.930255E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.095 | TFLOPs: 20.82 | 31: iteration 146110/ 173500 | consumed samples: 37404160 | consumed tokens: 76603719680 | elapsed time per iteration (s): 0.77 | learning rate: 3.106E-05 | global batch size: 256 | lm loss: 1.924702E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.541 | TFLOPs: 20.24 | 31: iteration 146120/ 173500 | consumed samples: 37406720 | consumed tokens: 76608962560 | elapsed time per iteration (s): 0.81 | learning rate: 3.105E-05 | global batch size: 256 | lm loss: 1.935184E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.782 | TFLOPs: 19.10 | 31: iteration 146130/ 173500 | consumed samples: 37409280 | consumed tokens: 76614205440 | elapsed time per iteration (s): 0.76 | learning rate: 3.104E-05 | global batch size: 256 | lm loss: 1.921656E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.007 | TFLOPs: 20.27 | 31: iteration 146140/ 173500 | consumed samples: 37411840 | consumed tokens: 76619448320 | elapsed time per iteration (s): 0.79 | learning rate: 3.104E-05 | global batch size: 256 | lm loss: 1.912854E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.229 | TFLOPs: 19.49 | 31: iteration 146150/ 173500 | consumed samples: 37414400 | consumed tokens: 76624691200 | elapsed time per iteration (s): 0.95 | learning rate: 3.103E-05 | global batch size: 256 | lm loss: 1.948815E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 269.918 | TFLOPs: 16.33 | 31: iteration 146160/ 173500 | consumed samples: 37416960 | consumed tokens: 76629934080 | elapsed time per iteration (s): 0.93 | learning rate: 3.102E-05 | global batch size: 256 | lm loss: 1.894421E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 274.381 | TFLOPs: 16.60 | 31: iteration 146170/ 173500 | consumed samples: 37419520 | consumed tokens: 76635176960 | elapsed time per iteration (s): 0.91 | learning rate: 3.101E-05 | global batch size: 256 | lm loss: 1.926663E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.050 | TFLOPs: 17.06 | 31: iteration 146180/ 173500 | consumed samples: 37422080 | consumed tokens: 76640419840 | elapsed time per iteration (s): 0.93 | learning rate: 3.100E-05 | global batch size: 256 | lm loss: 1.921304E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 274.180 | TFLOPs: 16.59 | 31: iteration 146190/ 173500 | consumed samples: 37424640 | consumed tokens: 76645662720 | elapsed time per iteration (s): 0.90 | learning rate: 3.100E-05 | global batch size: 256 | lm loss: 1.913052E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.879 | TFLOPs: 17.11 | 31: iteration 146200/ 173500 | consumed samples: 37427200 | consumed tokens: 76650905600 | elapsed time per iteration (s): 1.01 | learning rate: 3.099E-05 | global batch size: 256 | lm loss: 1.923335E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 254.277 | TFLOPs: 15.38 | 31: iteration 146210/ 173500 | consumed samples: 37429760 | consumed tokens: 76656148480 | elapsed time per iteration (s): 1.03 | learning rate: 3.098E-05 | global batch size: 256 | lm loss: 1.936679E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.586 | TFLOPs: 15.10 | 31: iteration 146220/ 173500 | consumed samples: 37432320 | consumed tokens: 76661391360 | elapsed time per iteration (s): 0.79 | learning rate: 3.097E-05 | global batch size: 256 | lm loss: 1.921600E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.014 | TFLOPs: 19.48 | 31: iteration 146230/ 173500 | consumed samples: 37434880 | consumed tokens: 76666634240 | elapsed time per iteration (s): 0.80 | learning rate: 3.096E-05 | global batch size: 256 | lm loss: 1.926049E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.396 | TFLOPs: 19.32 | 31: iteration 146240/ 173500 | consumed samples: 37437440 | consumed tokens: 76671877120 | elapsed time per iteration (s): 0.77 | learning rate: 3.096E-05 | global batch size: 256 | lm loss: 1.896475E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.401 | TFLOPs: 20.23 | 31: iteration 146250/ 173500 | consumed samples: 37440000 | consumed tokens: 76677120000 | elapsed time per iteration (s): 0.78 | learning rate: 3.095E-05 | global batch size: 256 | lm loss: 1.923863E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.048 | TFLOPs: 19.85 | 31: iteration 146260/ 173500 | consumed samples: 37442560 | consumed tokens: 76682362880 | elapsed time per iteration (s): 0.77 | learning rate: 3.094E-05 | global batch size: 256 | lm loss: 1.914510E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.630 | TFLOPs: 20.12 | 31: iteration 146270/ 173500 | consumed samples: 37445120 | consumed tokens: 76687605760 | elapsed time per iteration (s): 0.79 | learning rate: 3.093E-05 | global batch size: 256 | lm loss: 1.932011E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.534 | TFLOPs: 19.63 | 31: iteration 146280/ 173500 | consumed samples: 37447680 | consumed tokens: 76692848640 | elapsed time per iteration (s): 0.80 | learning rate: 3.093E-05 | global batch size: 256 | lm loss: 1.908974E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.532 | TFLOPs: 19.27 | 31: iteration 146290/ 173500 | consumed samples: 37450240 | consumed tokens: 76698091520 | elapsed time per iteration (s): 0.92 | learning rate: 3.092E-05 | global batch size: 256 | lm loss: 1.923718E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 278.424 | TFLOPs: 16.84 | 31: iteration 146300/ 173500 | consumed samples: 37452800 | consumed tokens: 76703334400 | elapsed time per iteration (s): 0.87 | learning rate: 3.091E-05 | global batch size: 256 | lm loss: 1.941627E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.692 | TFLOPs: 17.71 | 31: iteration 146310/ 173500 | consumed samples: 37455360 | consumed tokens: 76708577280 | elapsed time per iteration (s): 0.87 | learning rate: 3.090E-05 | global batch size: 256 | lm loss: 1.913546E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.977 | TFLOPs: 17.85 | 31: iteration 146320/ 173500 | consumed samples: 37457920 | consumed tokens: 76713820160 | elapsed time per iteration (s): 0.86 | learning rate: 3.089E-05 | global batch size: 256 | lm loss: 1.943375E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.823 | TFLOPs: 18.08 | 31: iteration 146330/ 173500 | consumed samples: 37460480 | consumed tokens: 76719063040 | elapsed time per iteration (s): 0.88 | learning rate: 3.089E-05 | global batch size: 256 | lm loss: 1.901338E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.616 | TFLOPs: 17.58 | 31: iteration 146340/ 173500 | consumed samples: 37463040 | consumed tokens: 76724305920 | elapsed time per iteration (s): 0.84 | learning rate: 3.088E-05 | global batch size: 256 | lm loss: 1.920698E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.721 | TFLOPs: 18.37 | 31: iteration 146350/ 173500 | consumed samples: 37465600 | consumed tokens: 76729548800 | elapsed time per iteration (s): 0.87 | learning rate: 3.087E-05 | global batch size: 256 | lm loss: 1.906078E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.088 | TFLOPs: 17.73 | 31: iteration 146360/ 173500 | consumed samples: 37468160 | consumed tokens: 76734791680 | elapsed time per iteration (s): 0.80 | learning rate: 3.086E-05 | global batch size: 256 | lm loss: 1.906508E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.525 | TFLOPs: 19.39 | 31: iteration 146370/ 173500 | consumed samples: 37470720 | consumed tokens: 76740034560 | elapsed time per iteration (s): 0.80 | learning rate: 3.085E-05 | global batch size: 256 | lm loss: 1.935184E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.587 | TFLOPs: 19.46 | 31: iteration 146380/ 173500 | consumed samples: 37473280 | consumed tokens: 76745277440 | elapsed time per iteration (s): 0.80 | learning rate: 3.085E-05 | global batch size: 256 | lm loss: 1.941603E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.881 | TFLOPs: 19.47 | 31: iteration 146390/ 173500 | consumed samples: 37475840 | consumed tokens: 76750520320 | elapsed time per iteration (s): 0.77 | learning rate: 3.084E-05 | global batch size: 256 | lm loss: 1.934249E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.920 | TFLOPs: 20.02 | 31: iteration 146400/ 173500 | consumed samples: 37478400 | consumed tokens: 76755763200 | elapsed time per iteration (s): 0.83 | learning rate: 3.083E-05 | global batch size: 256 | lm loss: 1.926147E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.027 | TFLOPs: 18.57 | 31: iteration 146410/ 173500 | consumed samples: 37480960 | consumed tokens: 76761006080 | elapsed time per iteration (s): 0.85 | learning rate: 3.082E-05 | global batch size: 256 | lm loss: 1.910634E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.067 | TFLOPs: 18.27 | 31: iteration 146420/ 173500 | consumed samples: 37483520 | consumed tokens: 76766248960 | elapsed time per iteration (s): 0.82 | learning rate: 3.082E-05 | global batch size: 256 | lm loss: 1.925515E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.504 | TFLOPs: 18.91 | 31: iteration 146430/ 173500 | consumed samples: 37486080 | consumed tokens: 76771491840 | elapsed time per iteration (s): 0.83 | learning rate: 3.081E-05 | global batch size: 256 | lm loss: 1.921498E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.967 | TFLOPs: 18.63 | 31: iteration 146440/ 173500 | consumed samples: 37488640 | consumed tokens: 76776734720 | elapsed time per iteration (s): 0.79 | learning rate: 3.080E-05 | global batch size: 256 | lm loss: 1.924349E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.514 | TFLOPs: 19.69 | 31: iteration 146450/ 173500 | consumed samples: 37491200 | consumed tokens: 76781977600 | elapsed time per iteration (s): 0.80 | learning rate: 3.079E-05 | global batch size: 256 | lm loss: 1.946492E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.916 | TFLOPs: 19.35 | 31: iteration 146460/ 173500 | consumed samples: 37493760 | consumed tokens: 76787220480 | elapsed time per iteration (s): 0.73 | learning rate: 3.078E-05 | global batch size: 256 | lm loss: 1.905154E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.587 | TFLOPs: 21.27 | 31: iteration 146470/ 173500 | consumed samples: 37496320 | consumed tokens: 76792463360 | elapsed time per iteration (s): 0.83 | learning rate: 3.078E-05 | global batch size: 256 | lm loss: 1.895829E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.596 | TFLOPs: 18.61 | 31: iteration 146480/ 173500 | consumed samples: 37498880 | consumed tokens: 76797706240 | elapsed time per iteration (s): 0.78 | learning rate: 3.077E-05 | global batch size: 256 | lm loss: 1.910041E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.048 | TFLOPs: 19.79 | 31: iteration 146490/ 173500 | consumed samples: 37501440 | consumed tokens: 76802949120 | elapsed time per iteration (s): 0.78 | learning rate: 3.076E-05 | global batch size: 256 | lm loss: 1.922687E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.436 | TFLOPs: 19.81 | 31: iteration 146500/ 173500 | consumed samples: 37504000 | consumed tokens: 76808192000 | elapsed time per iteration (s): 0.74 | learning rate: 3.075E-05 | global batch size: 256 | lm loss: 1.895078E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.428 | TFLOPs: 20.96 | 31: iteration 146510/ 173500 | consumed samples: 37506560 | consumed tokens: 76813434880 | elapsed time per iteration (s): 0.76 | learning rate: 3.075E-05 | global batch size: 256 | lm loss: 1.949198E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.522 | TFLOPs: 20.30 | 31: iteration 146520/ 173500 | consumed samples: 37509120 | consumed tokens: 76818677760 | elapsed time per iteration (s): 0.75 | learning rate: 3.074E-05 | global batch size: 256 | lm loss: 1.933837E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.603 | TFLOPs: 20.61 | 31: iteration 146530/ 173500 | consumed samples: 37511680 | consumed tokens: 76823920640 | elapsed time per iteration (s): 0.82 | learning rate: 3.073E-05 | global batch size: 256 | lm loss: 1.931096E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.339 | TFLOPs: 18.96 | 31: iteration 146540/ 173500 | consumed samples: 37514240 | consumed tokens: 76829163520 | elapsed time per iteration (s): 0.74 | learning rate: 3.072E-05 | global batch size: 256 | lm loss: 1.911163E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.247 | TFLOPs: 20.83 | 31: iteration 146550/ 173500 | consumed samples: 37516800 | consumed tokens: 76834406400 | elapsed time per iteration (s): 0.74 | learning rate: 3.071E-05 | global batch size: 256 | lm loss: 1.943957E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.450 | TFLOPs: 20.84 | 31: iteration 146560/ 173500 | consumed samples: 37519360 | consumed tokens: 76839649280 | elapsed time per iteration (s): 0.72 | learning rate: 3.071E-05 | global batch size: 256 | lm loss: 1.904199E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.852 | TFLOPs: 21.41 | 31: iteration 146570/ 173500 | consumed samples: 37521920 | consumed tokens: 76844892160 | elapsed time per iteration (s): 0.76 | learning rate: 3.070E-05 | global batch size: 256 | lm loss: 1.947983E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.794 | TFLOPs: 20.44 | 31: iteration 146580/ 173500 | consumed samples: 37524480 | consumed tokens: 76850135040 | elapsed time per iteration (s): 0.73 | learning rate: 3.069E-05 | global batch size: 256 | lm loss: 1.928111E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.587 | TFLOPs: 21.09 | 31: iteration 146590/ 173500 | consumed samples: 37527040 | consumed tokens: 76855377920 | elapsed time per iteration (s): 0.77 | learning rate: 3.068E-05 | global batch size: 256 | lm loss: 1.919506E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.404 | TFLOPs: 19.99 | 31: iteration 146600/ 173500 | consumed samples: 37529600 | consumed tokens: 76860620800 | elapsed time per iteration (s): 0.77 | learning rate: 3.068E-05 | global batch size: 256 | lm loss: 1.898428E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.117 | TFLOPs: 20.15 | 31: iteration 146610/ 173500 | consumed samples: 37532160 | consumed tokens: 76865863680 | elapsed time per iteration (s): 0.77 | learning rate: 3.067E-05 | global batch size: 256 | lm loss: 1.895959E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.871 | TFLOPs: 20.02 | 31: iteration 146620/ 173500 | consumed samples: 37534720 | consumed tokens: 76871106560 | elapsed time per iteration (s): 0.78 | learning rate: 3.066E-05 | global batch size: 256 | lm loss: 1.907082E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.126 | TFLOPs: 19.85 | 31: iteration 146630/ 173500 | consumed samples: 37537280 | consumed tokens: 76876349440 | elapsed time per iteration (s): 0.78 | learning rate: 3.065E-05 | global batch size: 256 | lm loss: 1.904126E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.394 | TFLOPs: 19.93 | 31: iteration 146640/ 173500 | consumed samples: 37539840 | consumed tokens: 76881592320 | elapsed time per iteration (s): 2.50 | learning rate: 3.064E-05 | global batch size: 256 | lm loss: 1.914049E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 102.435 | TFLOPs: 6.20 | 31: iteration 146650/ 173500 | consumed samples: 37542400 | consumed tokens: 76886835200 | elapsed time per iteration (s): 0.73 | learning rate: 3.064E-05 | global batch size: 256 | lm loss: 1.910082E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.023 | TFLOPs: 21.12 | 31: iteration 146660/ 173500 | consumed samples: 37544960 | consumed tokens: 76892078080 | elapsed time per iteration (s): 0.77 | learning rate: 3.063E-05 | global batch size: 256 | lm loss: 1.936478E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.986 | TFLOPs: 20.02 | 31: iteration 146670/ 173500 | consumed samples: 37547520 | consumed tokens: 76897320960 | elapsed time per iteration (s): 0.74 | learning rate: 3.062E-05 | global batch size: 256 | lm loss: 1.935170E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.198 | TFLOPs: 20.88 | 31: iteration 146680/ 173500 | consumed samples: 37550080 | consumed tokens: 76902563840 | elapsed time per iteration (s): 0.76 | learning rate: 3.061E-05 | global batch size: 256 | lm loss: 1.940895E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.100 | TFLOPs: 20.27 | 31: iteration 146690/ 173500 | consumed samples: 37552640 | consumed tokens: 76907806720 | elapsed time per iteration (s): 0.73 | learning rate: 3.061E-05 | global batch size: 256 | lm loss: 1.945029E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.422 | TFLOPs: 21.32 | 31: iteration 146700/ 173500 | consumed samples: 37555200 | consumed tokens: 76913049600 | elapsed time per iteration (s): 0.75 | learning rate: 3.060E-05 | global batch size: 256 | lm loss: 1.919917E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.560 | TFLOPs: 20.66 | 31: iteration 146710/ 173500 | consumed samples: 37557760 | consumed tokens: 76918292480 | elapsed time per iteration (s): 0.79 | learning rate: 3.059E-05 | global batch size: 256 | lm loss: 1.945918E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.567 | TFLOPs: 19.64 | 31: iteration 146720/ 173500 | consumed samples: 37560320 | consumed tokens: 76923535360 | elapsed time per iteration (s): 0.79 | learning rate: 3.058E-05 | global batch size: 256 | lm loss: 1.910223E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.237 | TFLOPs: 19.56 | 31: iteration 146730/ 173500 | consumed samples: 37562880 | consumed tokens: 76928778240 | elapsed time per iteration (s): 0.81 | learning rate: 3.057E-05 | global batch size: 256 | lm loss: 1.926844E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.677 | TFLOPs: 19.04 | 31: iteration 146740/ 173500 | consumed samples: 37565440 | consumed tokens: 76934021120 | elapsed time per iteration (s): 0.86 | learning rate: 3.057E-05 | global batch size: 256 | lm loss: 1.932810E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.988 | TFLOPs: 17.91 | 31: iteration 146750/ 173500 | consumed samples: 37568000 | consumed tokens: 76939264000 | elapsed time per iteration (s): 0.78 | learning rate: 3.056E-05 | global batch size: 256 | lm loss: 1.937635E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.741 | TFLOPs: 19.89 | 31: iteration 146760/ 173500 | consumed samples: 37570560 | consumed tokens: 76944506880 | elapsed time per iteration (s): 0.82 | learning rate: 3.055E-05 | global batch size: 256 | lm loss: 1.901496E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.567 | TFLOPs: 18.91 | 31: iteration 146770/ 173500 | consumed samples: 37573120 | consumed tokens: 76949749760 | elapsed time per iteration (s): 0.78 | learning rate: 3.054E-05 | global batch size: 256 | lm loss: 1.903236E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.343 | TFLOPs: 19.80 | 31: iteration 146780/ 173500 | consumed samples: 37575680 | consumed tokens: 76954992640 | elapsed time per iteration (s): 0.81 | learning rate: 3.054E-05 | global batch size: 256 | lm loss: 1.931603E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.695 | TFLOPs: 19.16 | 31: iteration 146790/ 173500 | consumed samples: 37578240 | consumed tokens: 76960235520 | elapsed time per iteration (s): 0.82 | learning rate: 3.053E-05 | global batch size: 256 | lm loss: 1.941381E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.408 | TFLOPs: 18.84 | 31: iteration 146800/ 173500 | consumed samples: 37580800 | consumed tokens: 76965478400 | elapsed time per iteration (s): 0.86 | learning rate: 3.052E-05 | global batch size: 256 | lm loss: 1.951465E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.982 | TFLOPs: 17.97 | 31: iteration 146810/ 173500 | consumed samples: 37583360 | consumed tokens: 76970721280 | elapsed time per iteration (s): 0.79 | learning rate: 3.051E-05 | global batch size: 256 | lm loss: 1.922483E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.577 | TFLOPs: 19.64 | 31: iteration 146820/ 173500 | consumed samples: 37585920 | consumed tokens: 76975964160 | elapsed time per iteration (s): 0.79 | learning rate: 3.050E-05 | global batch size: 256 | lm loss: 1.945070E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.886 | TFLOPs: 19.72 | 31: iteration 146830/ 173500 | consumed samples: 37588480 | consumed tokens: 76981207040 | elapsed time per iteration (s): 0.98 | learning rate: 3.050E-05 | global batch size: 256 | lm loss: 1.930590E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 262.111 | TFLOPs: 15.86 | 31: iteration 146840/ 173500 | consumed samples: 37591040 | consumed tokens: 76986449920 | elapsed time per iteration (s): 0.96 | learning rate: 3.049E-05 | global batch size: 256 | lm loss: 1.919904E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 266.397 | TFLOPs: 16.12 | 31: iteration 146850/ 173500 | consumed samples: 37593600 | consumed tokens: 76991692800 | elapsed time per iteration (s): 0.80 | learning rate: 3.048E-05 | global batch size: 256 | lm loss: 1.917600E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.645 | TFLOPs: 19.40 | 31: iteration 146860/ 173500 | consumed samples: 37596160 | consumed tokens: 76996935680 | elapsed time per iteration (s): 0.80 | learning rate: 3.047E-05 | global batch size: 256 | lm loss: 1.907428E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.382 | TFLOPs: 19.32 | 31: iteration 146870/ 173500 | consumed samples: 37598720 | consumed tokens: 77002178560 | elapsed time per iteration (s): 0.82 | learning rate: 3.047E-05 | global batch size: 256 | lm loss: 1.936445E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.512 | TFLOPs: 18.97 | 31: iteration 146880/ 173500 | consumed samples: 37601280 | consumed tokens: 77007421440 | elapsed time per iteration (s): 0.81 | learning rate: 3.046E-05 | global batch size: 256 | lm loss: 1.927757E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.052 | TFLOPs: 19.12 | 31: iteration 146890/ 173500 | consumed samples: 37603840 | consumed tokens: 77012664320 | elapsed time per iteration (s): 0.81 | learning rate: 3.045E-05 | global batch size: 256 | lm loss: 1.870868E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.566 | TFLOPs: 19.21 | 31: iteration 146900/ 173500 | consumed samples: 37606400 | consumed tokens: 77017907200 | elapsed time per iteration (s): 0.85 | learning rate: 3.044E-05 | global batch size: 256 | lm loss: 1.936861E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.173 | TFLOPs: 18.22 | 31: iteration 146910/ 173500 | consumed samples: 37608960 | consumed tokens: 77023150080 | elapsed time per iteration (s): 0.83 | learning rate: 3.044E-05 | global batch size: 256 | lm loss: 1.930726E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.120 | TFLOPs: 18.76 | 31: iteration 146920/ 173500 | consumed samples: 37611520 | consumed tokens: 77028392960 | elapsed time per iteration (s): 0.83 | learning rate: 3.043E-05 | global batch size: 256 | lm loss: 1.927401E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.570 | TFLOPs: 18.61 | 31: iteration 146930/ 173500 | consumed samples: 37614080 | consumed tokens: 77033635840 | elapsed time per iteration (s): 0.82 | learning rate: 3.042E-05 | global batch size: 256 | lm loss: 1.910689E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.610 | TFLOPs: 18.85 | 31: iteration 146940/ 173500 | consumed samples: 37616640 | consumed tokens: 77038878720 | elapsed time per iteration (s): 0.81 | learning rate: 3.041E-05 | global batch size: 256 | lm loss: 1.925832E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.567 | TFLOPs: 19.03 | 31: iteration 146950/ 173500 | consumed samples: 37619200 | consumed tokens: 77044121600 | elapsed time per iteration (s): 0.81 | learning rate: 3.040E-05 | global batch size: 256 | lm loss: 1.915262E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.641 | TFLOPs: 19.22 | 31: iteration 146960/ 173500 | consumed samples: 37621760 | consumed tokens: 77049364480 | elapsed time per iteration (s): 0.80 | learning rate: 3.040E-05 | global batch size: 256 | lm loss: 1.926003E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.896 | TFLOPs: 19.35 | 31: iteration 146970/ 173500 | consumed samples: 37624320 | consumed tokens: 77054607360 | elapsed time per iteration (s): 0.81 | learning rate: 3.039E-05 | global batch size: 256 | lm loss: 1.912249E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.508 | TFLOPs: 19.15 | 31: iteration 146980/ 173500 | consumed samples: 37626880 | consumed tokens: 77059850240 | elapsed time per iteration (s): 0.79 | learning rate: 3.038E-05 | global batch size: 256 | lm loss: 1.893095E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.907 | TFLOPs: 19.72 | 31: iteration 146990/ 173500 | consumed samples: 37629440 | consumed tokens: 77065093120 | elapsed time per iteration (s): 0.79 | learning rate: 3.037E-05 | global batch size: 256 | lm loss: 1.937221E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.799 | TFLOPs: 19.53 | 31: iteration 147000/ 173500 | consumed samples: 37632000 | consumed tokens: 77070336000 | elapsed time per iteration (s): 0.77 | learning rate: 3.037E-05 | global batch size: 256 | lm loss: 1.921889E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.813 | TFLOPs: 20.19 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 147000 | lm loss value: 1.885301E+00 | lm loss PPL: 6.588335E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 147000 to checkpoints_1b1long 0: [2022-11-27 03:11:27,539] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step147000 is begin to save! 0: [2022-11-27 03:11:27,551] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_01-model_00-model_states.pt... 0: [2022-11-27 03:11:27,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_01-model_00-model_states.pt. 0: [2022-11-27 03:11:27,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_03-model_00-model_states.pt... 0: [2022-11-27 03:11:27,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_03-model_00-model_states.pt. 0: [2022-11-27 03:11:27,853] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_04-model_00-model_states.pt... 0: [2022-11-27 03:11:27,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_04-model_00-model_states.pt. 0: [2022-11-27 03:11:27,933] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_05-model_00-model_states.pt... 0: [2022-11-27 03:11:28,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_05-model_00-model_states.pt. 0: [2022-11-27 03:11:28,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_06-model_00-model_states.pt... 0: [2022-11-27 03:11:28,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_06-model_00-model_states.pt. 0: [2022-11-27 03:11:28,091] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_07-model_00-model_states.pt... 0: [2022-11-27 03:11:28,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_07-model_00-model_states.pt. 0: [2022-11-27 03:11:28,168] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_08-model_00-model_states.pt... 0: [2022-11-27 03:11:28,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_08-model_00-model_states.pt. 0: [2022-11-27 03:11:28,248] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_09-model_00-model_states.pt... 0: [2022-11-27 03:11:28,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_09-model_00-model_states.pt. 0: [2022-11-27 03:11:28,325] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_10-model_00-model_states.pt... 0: [2022-11-27 03:11:28,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_10-model_00-model_states.pt. 0: [2022-11-27 03:11:28,403] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_11-model_00-model_states.pt... 0: [2022-11-27 03:11:28,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_11-model_00-model_states.pt. 0: [2022-11-27 03:11:28,481] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_12-model_00-model_states.pt... 0: [2022-11-27 03:11:28,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_12-model_00-model_states.pt. 0: [2022-11-27 03:11:28,559] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_13-model_00-model_states.pt... 0: [2022-11-27 03:11:28,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_13-model_00-model_states.pt. 0: [2022-11-27 03:11:28,637] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_14-model_00-model_states.pt... 0: [2022-11-27 03:11:28,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_14-model_00-model_states.pt. 0: [2022-11-27 03:11:28,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_15-model_00-model_states.pt... 0: [2022-11-27 03:11:28,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_15-model_00-model_states.pt. 0: [2022-11-27 03:11:28,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_16-model_00-model_states.pt... 0: [2022-11-27 03:11:28,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_16-model_00-model_states.pt. 0: [2022-11-27 03:11:28,871] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_17-model_00-model_states.pt... 0: [2022-11-27 03:11:28,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_17-model_00-model_states.pt. 0: [2022-11-27 03:11:28,949] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_18-model_00-model_states.pt... 0: [2022-11-27 03:11:29,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_18-model_00-model_states.pt. 0: [2022-11-27 03:11:29,028] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_19-model_00-model_states.pt... 0: [2022-11-27 03:11:29,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_19-model_00-model_states.pt. 0: [2022-11-27 03:11:29,107] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_20-model_00-model_states.pt... 0: [2022-11-27 03:11:29,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_20-model_00-model_states.pt. 0: [2022-11-27 03:11:29,182] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_21-model_00-model_states.pt... 0: [2022-11-27 03:11:29,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_21-model_00-model_states.pt. 0: [2022-11-27 03:11:29,262] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_22-model_00-model_states.pt... 0: [2022-11-27 03:11:29,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_22-model_00-model_states.pt. 0: [2022-11-27 03:11:29,341] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_23-model_00-model_states.pt... 0: [2022-11-27 03:11:29,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_23-model_00-model_states.pt. 0: [2022-11-27 03:11:29,419] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_24-model_00-model_states.pt... 0: [2022-11-27 03:11:29,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_24-model_00-model_states.pt. 0: [2022-11-27 03:11:29,496] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_25-model_00-model_states.pt... 0: [2022-11-27 03:11:29,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_25-model_00-model_states.pt. 0: [2022-11-27 03:11:29,572] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_26-model_00-model_states.pt... 0: [2022-11-27 03:11:29,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_26-model_00-model_states.pt. 0: [2022-11-27 03:11:29,651] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_27-model_00-model_states.pt... 0: [2022-11-27 03:11:29,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_27-model_00-model_states.pt. 0: [2022-11-27 03:11:29,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_28-model_00-model_states.pt... 0: [2022-11-27 03:11:29,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_28-model_00-model_states.pt. 0: [2022-11-27 03:11:29,805] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/layer_30-model_00-model_states.pt... 0: [2022-11-27 03:11:29,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/layer_30-model_00-model_states.pt. 0: [2022-11-27 03:11:29,809] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step147000/mp_rank_00_model_states.pt 0: [2022-11-27 03:11:29,809] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/mp_rank_00_model_states.pt... 0: [2022-11-27 03:11:29,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/mp_rank_00_model_states.pt. 0: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:11:29,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:11:29,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:11:29,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 03:11:29,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 2: [2022-11-27 03:11:29,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:11:29,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 03:11:29,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 24: [2022-11-27 03:11:29,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:11:29,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:11:29,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 24: [2022-11-27 03:11:29,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 19: [2022-11-27 03:11:29,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 24: [2022-11-27 03:11:29,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 16: [2022-11-27 03:11:29,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:11:29,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 03:11:29,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 29: [2022-11-27 03:11:29,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:11:29,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 03:11:29,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 13: [2022-11-27 03:11:29,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:11:29,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:11:29,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 29: [2022-11-27 03:11:29,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 13: [2022-11-27 03:11:29,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 29: [2022-11-27 03:11:29,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 20: [2022-11-27 03:11:29,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:11:29,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 03:11:29,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 1: [2022-11-27 03:11:29,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:11:29,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 03:11:29,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 16: [2022-11-27 03:11:29,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:11:29,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 03:11:29,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 7: [2022-11-27 03:11:29,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:11:29,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:11:29,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 12: [2022-11-27 03:11:29,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 7: [2022-11-27 03:11:29,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 12: [2022-11-27 03:11:29,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 4: [2022-11-27 03:11:29,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:11:29,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 03:11:29,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 20: [2022-11-27 03:11:29,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:11:29,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:11:29,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:11:29,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 14: [2022-11-27 03:11:29,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 18: [2022-11-27 03:11:29,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 3: [2022-11-27 03:11:29,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:11:29,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:11:29,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 14: [2022-11-27 03:11:29,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 18: [2022-11-27 03:11:29,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 3: [2022-11-27 03:11:29,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 24: [2022-11-27 03:11:29,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:11:29,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:11:29,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 03:11:29,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 14: [2022-11-27 03:11:29,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 3: [2022-11-27 03:11:29,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 14: [2022-11-27 03:11:29,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 24: [2022-11-27 03:11:29,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 11: [2022-11-27 03:11:29,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:11:29,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:11:29,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 11: [2022-11-27 03:11:29,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 03:11:29,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 28: [2022-11-27 03:11:29,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 03:11:29,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 5: [2022-11-27 03:11:29,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:11:29,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 22: [2022-11-27 03:11:29,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:11:29,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:11:29,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 22: [2022-11-27 03:11:29,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 18: [2022-11-27 03:11:29,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:11:29,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 18: [2022-11-27 03:11:29,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 22: [2022-11-27 03:11:29,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 18: [2022-11-27 03:11:29,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 5: [2022-11-27 03:11:29,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:11:29,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 5: [2022-11-27 03:11:29,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 03:11:29,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 21: [2022-11-27 03:11:29,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:11:29,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 27: [2022-11-27 03:11:29,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:11:29,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 27: [2022-11-27 03:11:29,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 03:11:29,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 28: [2022-11-27 03:11:29,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:11:29,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 03:11:29,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 7: [2022-11-27 03:11:29,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:11:29,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:11:29,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 03:11:29,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 8: [2022-11-27 03:11:29,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:11:29,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:11:29,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 2: [2022-11-27 03:11:29,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 8: [2022-11-27 03:11:29,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 2: [2022-11-27 03:11:29,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 4: [2022-11-27 03:11:29,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:11:29,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:11:29,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:11:29,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 15: [2022-11-27 03:11:29,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 4: [2022-11-27 03:11:29,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 4: [2022-11-27 03:11:29,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 15: [2022-11-27 03:11:29,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 4: [2022-11-27 03:11:29,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 16: [2022-11-27 03:11:29,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:11:29,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 03:11:29,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 8: [2022-11-27 03:11:29,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:11:29,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:11:29,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 03:11:29,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 27: [2022-11-27 03:11:29,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 03:11:29,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 19: [2022-11-27 03:11:29,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:11:29,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 03:11:29,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 19: [2022-11-27 03:11:29,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:11:29,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 03:11:29,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 20: [2022-11-27 03:11:29,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:11:29,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:11:29,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 03:11:29,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 28: [2022-11-27 03:11:29,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:11:29,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 2: [2022-11-27 03:11:29,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:11:29,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 2: [2022-11-27 03:11:29,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 03:11:29,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 3: [2022-11-27 03:11:29,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:11:29,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 03:11:29,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 23: [2022-11-27 03:11:29,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 03:11:29,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 03:11:29,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 23: [2022-11-27 03:11:29,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:11:29,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:11:29,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 8: [2022-11-27 03:11:29,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:11:29,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 23: [2022-11-27 03:11:29,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 8: [2022-11-27 03:11:29,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 03:11:29,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 23: [2022-11-27 03:11:29,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 28: [2022-11-27 03:11:29,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 18: [2022-11-27 03:11:29,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:11:29,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 18: [2022-11-27 03:11:29,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 03:11:29,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 5: [2022-11-27 03:11:29,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:11:29,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 03:11:29,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 16: [2022-11-27 03:11:29,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:11:29,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 03:11:29,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 5: [2022-11-27 03:11:29,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:11:29,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:11:29,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 21: [2022-11-27 03:11:29,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 5: [2022-11-27 03:11:29,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 10: [2022-11-27 03:11:29,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:11:29,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:11:29,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:11:29,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 10: [2022-11-27 03:11:29,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 03:11:29,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 03:11:29,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 03:11:29,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 10: [2022-11-27 03:11:29,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 10: [2022-11-27 03:11:29,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 12: [2022-11-27 03:11:29,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:11:29,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 03:11:29,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:11:29,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:11:29,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 23: [2022-11-27 03:11:29,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:11:29,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 12: [2022-11-27 03:11:29,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 23: [2022-11-27 03:11:29,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 1: [2022-11-27 03:11:29,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:11:29,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 23: [2022-11-27 03:11:29,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 10: [2022-11-27 03:11:29,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 14: [2022-11-27 03:11:29,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:11:29,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:11:29,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 14: [2022-11-27 03:11:29,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 19: [2022-11-27 03:11:29,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 14: [2022-11-27 03:11:29,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 19: [2022-11-27 03:11:29,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 1: [2022-11-27 03:11:29,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 22: [2022-11-27 03:11:29,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:11:29,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 03:11:29,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 13: [2022-11-27 03:11:29,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 03:11:29,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 13: [2022-11-27 03:11:29,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:11:29,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 03:11:29,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 13: [2022-11-27 03:11:29,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:11:29,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:11:29,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 03:11:29,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 17: [2022-11-27 03:11:29,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 03:11:29,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 17: [2022-11-27 03:11:29,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:11:29,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 03:11:29,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 17: [2022-11-27 03:11:29,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:11:29,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 03:11:29,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 12: [2022-11-27 03:11:29,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:11:29,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 03:11:29,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 21: [2022-11-27 03:11:29,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:11:29,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:11:29,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 03:11:29,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 03:11:29,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 21: [2022-11-27 03:11:29,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 17: [2022-11-27 03:11:29,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:11:29,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:11:29,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 29: [2022-11-27 03:11:29,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 17: [2022-11-27 03:11:29,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 29: [2022-11-27 03:11:29,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 7: [2022-11-27 03:11:29,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:11:29,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:11:29,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 03:11:29,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 15: [2022-11-27 03:11:29,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:11:29,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 03:11:29,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 7: [2022-11-27 03:11:29,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 03:11:29,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 27: [2022-11-27 03:11:29,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:11:29,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 1: [2022-11-27 03:11:29,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:11:29,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 1: [2022-11-27 03:11:29,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 03:11:29,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 11: [2022-11-27 03:11:29,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:11:29,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 03:11:29,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 1: [2022-11-27 03:11:29,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:11:29,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:11:29,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 03:11:29,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 24: [2022-11-27 03:11:29,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 03:11:29,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 24: [2022-11-27 03:11:29,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:11:29,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:11:29,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:11:29,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 11: [2022-11-27 03:11:29,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 27: [2022-11-27 03:11:29,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 11: [2022-11-27 03:11:29,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 24: [2022-11-27 03:11:29,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 27: [2022-11-27 03:11:29,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 20: [2022-11-27 03:11:29,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:11:29,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 03:11:29,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 28: [2022-11-27 03:11:29,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:11:29,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 8: [2022-11-27 03:11:29,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:11:29,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 8: [2022-11-27 03:11:29,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 03:11:29,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 3: [2022-11-27 03:11:29,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:11:29,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 2: [2022-11-27 03:11:29,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:11:29,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 2: [2022-11-27 03:11:29,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 03:11:29,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 18: [2022-11-27 03:11:29,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:11:29,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 03:11:29,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 0: [2022-11-27 03:11:29,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:11:29,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 03:11:29,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 0: [2022-11-27 03:11:29,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:11:29,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:11:29,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 03:11:29,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 24: [2022-11-27 03:11:29,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:11:29,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 03:11:29,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 22: [2022-11-27 03:11:29,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:11:29,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 03:11:29,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 11: [2022-11-27 03:11:29,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:11:29,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 03:11:29,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 23: [2022-11-27 03:11:29,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-27 03:11:29,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 03:11:29,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 15: [2022-11-27 03:11:29,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:11:29,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:11:29,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 03:11:29,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 15: [2022-11-27 03:11:29,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 03:11:29,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 19: [2022-11-27 03:11:29,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:11:29,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 03:11:29,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 30: [2022-11-27 03:11:29,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:11:29,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:11:29,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:11:29,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:11:29,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 03:11:29,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 03:11:29,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 03:11:29,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 03:11:29,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 30: [2022-11-27 03:11:29,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 30: [2022-11-27 03:11:29,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 30: [2022-11-27 03:11:29,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 0: [2022-11-27 03:11:29,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:11:29,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 03:11:29,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 0: [2022-11-27 03:11:29,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:11:29,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:11:29,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:11:29,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:11:29,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:11:29,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:11:29,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 03:11:29,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 6: [2022-11-27 03:11:29,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 03:11:29,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 03:11:29,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 03:11:29,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 03:11:29,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 6: [2022-11-27 03:11:29,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 6: [2022-11-27 03:11:29,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 6: [2022-11-27 03:11:29,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 0: [2022-11-27 03:11:29,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 0: [2022-11-27 03:11:29,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 17: [2022-11-27 03:11:29,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:11:29,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 03:11:29,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 31: [2022-11-27 03:11:29,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:11:29,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:11:29,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:11:29,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:11:29,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 03:11:29,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 03:11:29,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 31: [2022-11-27 03:11:29,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 03:11:29,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 03:11:29,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 31: [2022-11-27 03:11:29,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 31: [2022-11-27 03:11:29,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 20: [2022-11-27 03:11:29,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:11:29,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 03:11:29,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 0: [2022-11-27 03:11:29,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 03:11:29,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 29: [2022-11-27 03:11:30,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:11:30,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 03:11:30,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 6: [2022-11-27 03:11:30,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:11:30,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 03:11:30,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 4: [2022-11-27 03:11:30,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:11:30,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 03:11:30,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 15: [2022-11-27 03:11:30,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:11:30,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 03:11:30,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 16: [2022-11-27 03:11:30,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:11:30,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 03:11:30,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 2: [2022-11-27 03:11:30,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:11:30,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 03:11:30,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 18: [2022-11-27 03:11:30,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:11:30,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 03:11:30,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 21: [2022-11-27 03:11:30,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:11:30,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 03:11:30,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 10: [2022-11-27 03:11:30,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:11:30,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 03:11:30,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 28: [2022-11-27 03:11:30,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:11:30,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 03:11:30,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 7: [2022-11-27 03:11:30,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:11:30,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 03:11:30,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 31: [2022-11-27 03:11:30,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:11:30,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 03:11:30,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 1: [2022-11-27 03:11:30,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:11:30,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 03:11:30,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 12: [2022-11-27 03:11:30,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:11:30,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 03:11:30,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 3: [2022-11-27 03:11:30,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:11:30,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 03:11:30,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 30: [2022-11-27 03:11:30,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:11:30,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 03:11:30,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 23: [2022-11-27 03:11:30,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-27 03:11:30,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 03:11:30,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 13: [2022-11-27 03:11:30,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:11:30,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 03:11:30,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 22: [2022-11-27 03:11:30,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:11:30,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 03:11:30,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 27: [2022-11-27 03:11:30,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:11:30,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 03:11:30,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 5: [2022-11-27 03:11:30,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:11:30,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 03:11:30,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 8: [2022-11-27 03:11:30,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:11:30,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 03:11:30,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 0: [2022-11-27 03:11:30,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:11:30,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 03:11:30,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 20: [2022-11-27 03:11:30,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:11:30,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 03:11:30,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 24: [2022-11-27 03:11:30,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:11:30,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 03:11:30,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 11: [2022-11-27 03:11:30,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:11:30,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 03:11:30,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 19: [2022-11-27 03:11:30,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:11:30,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 03:11:30,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 4: [2022-11-27 03:11:30,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:11:30,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 03:11:30,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 14: [2022-11-27 03:11:30,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:11:30,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 03:11:30,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 29: [2022-11-27 03:11:30,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:11:30,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 03:11:30,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 6: [2022-11-27 03:11:30,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:11:30,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 03:11:30,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 15: [2022-11-27 03:11:30,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:11:30,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 03:11:30,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 16: [2022-11-27 03:11:30,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:11:30,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 03:11:30,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 21: [2022-11-27 03:11:30,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:11:30,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 03:11:30,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 17: [2022-11-27 03:11:30,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:11:30,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 03:11:30,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 2: [2022-11-27 03:11:30,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:11:30,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 03:11:30,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 10: [2022-11-27 03:11:30,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:11:30,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 03:11:30,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 18: [2022-11-27 03:11:30,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:11:30,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-27 03:11:30,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 14: [2022-11-27 03:11:30,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:11:30,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 03:11:30,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 12: [2022-11-27 03:11:30,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:11:30,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 03:11:30,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 7: [2022-11-27 03:11:30,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:11:30,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 03:11:30,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 31: [2022-11-27 03:11:30,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:11:30,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 03:11:30,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 13: [2022-11-27 03:11:30,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:11:30,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 03:11:30,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 3: [2022-11-27 03:11:30,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:11:30,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 03:11:30,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 30: [2022-11-27 03:11:30,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:11:30,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 03:11:30,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 25: [2022-11-27 03:11:30,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:11:30,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 03:11:30,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 27: [2022-11-27 03:11:30,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:11:30,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 03:11:30,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 0: [2022-11-27 03:11:30,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:11:30,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 03:11:30,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 23: [2022-11-27 03:11:30,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:11:30,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:11:30,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 23: [2022-11-27 03:11:30,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 03:11:30,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 1: [2022-11-27 03:11:30,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 24: [2022-11-27 03:11:30,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 1: [2022-11-27 03:11:30,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 24: [2022-11-27 03:11:30,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 22: [2022-11-27 03:11:30,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:11:30,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 03:11:30,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 5: [2022-11-27 03:11:30,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:11:30,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 03:11:30,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 11: [2022-11-27 03:11:30,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:11:30,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 03:11:30,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 20: [2022-11-27 03:11:30,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:11:30,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 03:11:30,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 8: [2022-11-27 03:11:30,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:11:30,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 03:11:30,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 28: [2022-11-27 03:11:30,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:11:30,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 03:11:30,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 17: [2022-11-27 03:11:30,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:11:30,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:11:30,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 19: [2022-11-27 03:11:30,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 17: [2022-11-27 03:11:30,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 19: [2022-11-27 03:11:30,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 6: [2022-11-27 03:11:30,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:11:30,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 03:11:30,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 4: [2022-11-27 03:11:30,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:11:30,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 03:11:30,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 25: [2022-11-27 03:11:30,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:11:30,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:11:30,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-27 03:11:30,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 25: [2022-11-27 03:11:30,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 03:11:30,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 15: [2022-11-27 03:11:30,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:11:30,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 03:11:30,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 29: [2022-11-27 03:11:30,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:11:30,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 03:11:30,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 16: [2022-11-27 03:11:30,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:11:30,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 03:11:30,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 25: [2022-11-27 03:11:30,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:11:30,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 31: [2022-11-27 03:11:30,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:11:30,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:11:30,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 31: [2022-11-27 03:11:30,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 25: [2022-11-27 03:11:30,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 31: [2022-11-27 03:11:30,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 25: [2022-11-27 03:11:30,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 18: [2022-11-27 03:11:30,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:11:30,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 03:11:30,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 21: [2022-11-27 03:11:30,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:11:30,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 03:11:30,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 2: [2022-11-27 03:11:30,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:11:30,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 03:11:30,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 9: [2022-11-27 03:11:30,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:11:30,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 03:11:30,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 9: [2022-11-27 03:11:30,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:11:30,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 03:11:30,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 12: [2022-11-27 03:11:30,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:11:30,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 03:11:30,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 13: [2022-11-27 03:11:30,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:11:30,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 03:11:30,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 10: [2022-11-27 03:11:30,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:11:30,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 03:11:30,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 27: [2022-11-27 03:11:30,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:11:30,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 03:11:30,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 9: [2022-11-27 03:11:30,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:11:30,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:11:30,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:11:30,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 03:11:30,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 03:11:30,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 03:11:30,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 9: [2022-11-27 03:11:30,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 9: [2022-11-27 03:11:30,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 28: [2022-11-27 03:11:30,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:11:30,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 03:11:30,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 3: [2022-11-27 03:11:30,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:11:30,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 03:11:30,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 14: [2022-11-27 03:11:30,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:11:30,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 03:11:30,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 7: [2022-11-27 03:11:30,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:11:30,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 03:11:30,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 30: [2022-11-27 03:11:30,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:11:30,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 03:11:30,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 0: [2022-11-27 03:11:30,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:11:30,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 03:11:30,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 16: [2022-11-27 03:11:30,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:11:30,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:11:30,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 22: [2022-11-27 03:11:30,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 16: [2022-11-27 03:11:30,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 22: [2022-11-27 03:11:30,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 23: [2022-11-27 03:11:30,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 03:11:30,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 03:11:30,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 24: [2022-11-27 03:11:30,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:11:30,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 03:11:30,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 1: [2022-11-27 03:11:30,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:11:30,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 19: [2022-11-27 03:11:30,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:11:30,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 19: [2022-11-27 03:11:30,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 03:11:30,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 21: [2022-11-27 03:11:30,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:11:30,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 03:11:30,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 8: [2022-11-27 03:11:30,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:11:30,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 03:11:30,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 29: [2022-11-27 03:11:30,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:11:30,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 03:11:30,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 11: [2022-11-27 03:11:30,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:11:30,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 03:11:30,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 2: [2022-11-27 03:11:30,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:11:30,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 20: [2022-11-27 03:11:30,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:11:30,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 31: [2022-11-27 03:11:30,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:11:30,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 03:11:30,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 31: [2022-11-27 03:11:30,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 4: [2022-11-27 03:11:30,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:11:30,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 4: [2022-11-27 03:11:30,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 03:11:30,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 7: [2022-11-27 03:11:30,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:11:30,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 03:11:30,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 17: [2022-11-27 03:11:30,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:11:30,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 03:11:30,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 5: [2022-11-27 03:11:30,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:11:30,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:11:30,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 03:11:30,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 15: [2022-11-27 03:11:30,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 03:11:30,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 22: [2022-11-27 03:11:30,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:11:30,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:11:30,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 22: [2022-11-27 03:11:30,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 6: [2022-11-27 03:11:30,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 9: [2022-11-27 03:11:30,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:11:30,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 23: [2022-11-27 03:11:30,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:11:30,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 22: [2022-11-27 03:11:30,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 23: [2022-11-27 03:11:30,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 03:11:30,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 18: [2022-11-27 03:11:30,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:11:30,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 03:11:30,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 1: [2022-11-27 03:11:30,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:11:30,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 03:11:30,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 14: [2022-11-27 03:11:30,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:11:30,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 03:11:30,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 12: [2022-11-27 03:11:30,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:11:30,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:11:30,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 03:11:30,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 28: [2022-11-27 03:11:30,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 03:11:30,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 13: [2022-11-27 03:11:30,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:11:30,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:11:30,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:11:30,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 3: [2022-11-27 03:11:30,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 03:11:30,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 8: [2022-11-27 03:11:30,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 13: [2022-11-27 03:11:30,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 03:11:30,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 27: [2022-11-27 03:11:30,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:11:30,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 03:11:30,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 25: [2022-11-27 03:11:30,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:11:30,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 30: [2022-11-27 03:11:30,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:11:30,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 25: [2022-11-27 03:11:30,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 30: [2022-11-27 03:11:30,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 10: [2022-11-27 03:11:30,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:11:30,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 03:11:30,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 5: [2022-11-27 03:11:30,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:11:30,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 03:11:30,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 9: [2022-11-27 03:11:30,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:11:30,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 03:11:30,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 25: [2022-11-27 03:11:30,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:11:30,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 03:11:30,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 25: [2022-11-27 03:11:30,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:11:30,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 03:11:30,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 26: [2022-11-27 03:11:30,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:11:30,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:11:30,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:11:30,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:11:30,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:11:30,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:11:30,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:11:30,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 03:11:30,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-27 03:11:30,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 03:11:30,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 03:11:30,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 03:11:30,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 03:11:30,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 03:11:30,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 26: [2022-11-27 03:11:30,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 26: [2022-11-27 03:11:30,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 26: [2022-11-27 03:11:30,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 26: [2022-11-27 03:11:30,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 26: [2022-11-27 03:11:30,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 26: [2022-11-27 03:11:30,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 26: [2022-11-27 03:11:30,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:11:30,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 03:11:30,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 9: [2022-11-27 03:11:30,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:11:30,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step147000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 03:11:30,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step147000 is ready now! 0: successfully saved checkpoint at iteration 147000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2633.14 31: iteration 147010/ 173500 | consumed samples: 37634560 | consumed tokens: 77075578880 | elapsed time per iteration (s): 1.11 | learning rate: 3.036E-05 | global batch size: 256 | lm loss: 1.901714E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.971 | TFLOPs: 13.97 | 31: iteration 147020/ 173500 | consumed samples: 37637120 | consumed tokens: 77080821760 | elapsed time per iteration (s): 0.77 | learning rate: 3.035E-05 | global batch size: 256 | lm loss: 1.917749E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.358 | TFLOPs: 20.17 | 31: iteration 147030/ 173500 | consumed samples: 37639680 | consumed tokens: 77086064640 | elapsed time per iteration (s): 0.75 | learning rate: 3.034E-05 | global batch size: 256 | lm loss: 1.905069E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.165 | TFLOPs: 20.64 | 31: iteration 147040/ 173500 | consumed samples: 37642240 | consumed tokens: 77091307520 | elapsed time per iteration (s): 0.79 | learning rate: 3.034E-05 | global batch size: 256 | lm loss: 1.918051E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.091 | TFLOPs: 19.67 | 31: iteration 147050/ 173500 | consumed samples: 37644800 | consumed tokens: 77096550400 | elapsed time per iteration (s): 0.72 | learning rate: 3.033E-05 | global batch size: 256 | lm loss: 1.913243E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.812 | TFLOPs: 21.47 | 31: iteration 147060/ 173500 | consumed samples: 37647360 | consumed tokens: 77101793280 | elapsed time per iteration (s): 0.81 | learning rate: 3.032E-05 | global batch size: 256 | lm loss: 1.923873E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.355 | TFLOPs: 19.08 | 31: iteration 147070/ 173500 | consumed samples: 37649920 | consumed tokens: 77107036160 | elapsed time per iteration (s): 0.75 | learning rate: 3.031E-05 | global batch size: 256 | lm loss: 1.882662E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.096 | TFLOPs: 20.57 | 31: iteration 147080/ 173500 | consumed samples: 37652480 | consumed tokens: 77112279040 | elapsed time per iteration (s): 0.77 | learning rate: 3.031E-05 | global batch size: 256 | lm loss: 1.937671E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.434 | TFLOPs: 20.23 | 31: iteration 147090/ 173500 | consumed samples: 37655040 | consumed tokens: 77117521920 | elapsed time per iteration (s): 0.75 | learning rate: 3.030E-05 | global batch size: 256 | lm loss: 1.945873E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.908 | TFLOPs: 20.75 | 31: iteration 147100/ 173500 | consumed samples: 37657600 | consumed tokens: 77122764800 | elapsed time per iteration (s): 0.82 | learning rate: 3.029E-05 | global batch size: 256 | lm loss: 1.899339E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.571 | TFLOPs: 18.97 | 31: iteration 147110/ 173500 | consumed samples: 37660160 | consumed tokens: 77128007680 | elapsed time per iteration (s): 0.83 | learning rate: 3.028E-05 | global batch size: 256 | lm loss: 1.908380E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.192 | TFLOPs: 18.77 | 31: iteration 147120/ 173500 | consumed samples: 37662720 | consumed tokens: 77133250560 | elapsed time per iteration (s): 0.82 | learning rate: 3.027E-05 | global batch size: 256 | lm loss: 1.914876E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.894 | TFLOPs: 18.87 | 31: iteration 147130/ 173500 | consumed samples: 37665280 | consumed tokens: 77138493440 | elapsed time per iteration (s): 0.84 | learning rate: 3.027E-05 | global batch size: 256 | lm loss: 1.937164E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.706 | TFLOPs: 18.43 | 31: iteration 147140/ 173500 | consumed samples: 37667840 | consumed tokens: 77143736320 | elapsed time per iteration (s): 0.85 | learning rate: 3.026E-05 | global batch size: 256 | lm loss: 1.896423E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.041 | TFLOPs: 18.15 | 31: iteration 147150/ 173500 | consumed samples: 37670400 | consumed tokens: 77148979200 | elapsed time per iteration (s): 0.80 | learning rate: 3.025E-05 | global batch size: 256 | lm loss: 1.929966E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.230 | TFLOPs: 19.31 | 31: iteration 147160/ 173500 | consumed samples: 37672960 | consumed tokens: 77154222080 | elapsed time per iteration (s): 0.83 | learning rate: 3.024E-05 | global batch size: 256 | lm loss: 1.914060E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.256 | TFLOPs: 18.77 | 31: iteration 147170/ 173500 | consumed samples: 37675520 | consumed tokens: 77159464960 | elapsed time per iteration (s): 0.82 | learning rate: 3.024E-05 | global batch size: 256 | lm loss: 1.913464E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.732 | TFLOPs: 18.86 | 31: iteration 147180/ 173500 | consumed samples: 37678080 | consumed tokens: 77164707840 | elapsed time per iteration (s): 0.81 | learning rate: 3.023E-05 | global batch size: 256 | lm loss: 1.913095E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.274 | TFLOPs: 19.19 | 31: iteration 147190/ 173500 | consumed samples: 37680640 | consumed tokens: 77169950720 | elapsed time per iteration (s): 0.80 | learning rate: 3.022E-05 | global batch size: 256 | lm loss: 1.927814E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.339 | TFLOPs: 19.38 | 31: iteration 147200/ 173500 | consumed samples: 37683200 | consumed tokens: 77175193600 | elapsed time per iteration (s): 0.76 | learning rate: 3.021E-05 | global batch size: 256 | lm loss: 1.914630E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.431 | TFLOPs: 20.35 | 31: iteration 147210/ 173500 | consumed samples: 37685760 | consumed tokens: 77180436480 | elapsed time per iteration (s): 0.76 | learning rate: 3.021E-05 | global batch size: 256 | lm loss: 1.912698E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.120 | TFLOPs: 20.27 | 31: iteration 147220/ 173500 | consumed samples: 37688320 | consumed tokens: 77185679360 | elapsed time per iteration (s): 0.78 | learning rate: 3.020E-05 | global batch size: 256 | lm loss: 1.893354E+00 | grad norm: 0.285 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.808 | TFLOPs: 19.83 | 31: iteration 147230/ 173500 | consumed samples: 37690880 | consumed tokens: 77190922240 | elapsed time per iteration (s): 0.76 | learning rate: 3.019E-05 | global batch size: 256 | lm loss: 1.923858E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.889 | TFLOPs: 20.50 | 31: iteration 147240/ 173500 | consumed samples: 37693440 | consumed tokens: 77196165120 | elapsed time per iteration (s): 0.77 | learning rate: 3.018E-05 | global batch size: 256 | lm loss: 1.924110E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.503 | TFLOPs: 20.12 | 31: iteration 147250/ 173500 | consumed samples: 37696000 | consumed tokens: 77201408000 | elapsed time per iteration (s): 0.74 | learning rate: 3.018E-05 | global batch size: 256 | lm loss: 1.899637E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.949 | TFLOPs: 20.99 | 31: iteration 147260/ 173500 | consumed samples: 37698560 | consumed tokens: 77206650880 | elapsed time per iteration (s): 0.79 | learning rate: 3.017E-05 | global batch size: 256 | lm loss: 1.953916E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.339 | TFLOPs: 19.62 | 31: iteration 147270/ 173500 | consumed samples: 37701120 | consumed tokens: 77211893760 | elapsed time per iteration (s): 0.82 | learning rate: 3.016E-05 | global batch size: 256 | lm loss: 1.922580E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.161 | TFLOPs: 18.95 | 31: iteration 147280/ 173500 | consumed samples: 37703680 | consumed tokens: 77217136640 | elapsed time per iteration (s): 0.80 | learning rate: 3.015E-05 | global batch size: 256 | lm loss: 1.939980E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.173 | TFLOPs: 19.43 | 31: iteration 147290/ 173500 | consumed samples: 37706240 | consumed tokens: 77222379520 | elapsed time per iteration (s): 0.80 | learning rate: 3.015E-05 | global batch size: 256 | lm loss: 1.922089E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.125 | TFLOPs: 19.43 | 31: iteration 147300/ 173500 | consumed samples: 37708800 | consumed tokens: 77227622400 | elapsed time per iteration (s): 0.77 | learning rate: 3.014E-05 | global batch size: 256 | lm loss: 1.934114E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.604 | TFLOPs: 20.00 | 31: iteration 147310/ 173500 | consumed samples: 37711360 | consumed tokens: 77232865280 | elapsed time per iteration (s): 0.75 | learning rate: 3.013E-05 | global batch size: 256 | lm loss: 1.937924E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.430 | TFLOPs: 20.60 | 31: iteration 147320/ 173500 | consumed samples: 37713920 | consumed tokens: 77238108160 | elapsed time per iteration (s): 0.80 | learning rate: 3.012E-05 | global batch size: 256 | lm loss: 1.910747E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.084 | TFLOPs: 19.36 | 31: iteration 147330/ 173500 | consumed samples: 37716480 | consumed tokens: 77243351040 | elapsed time per iteration (s): 0.75 | learning rate: 3.011E-05 | global batch size: 256 | lm loss: 1.896853E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.105 | TFLOPs: 20.64 | 31: iteration 147340/ 173500 | consumed samples: 37719040 | consumed tokens: 77248593920 | elapsed time per iteration (s): 0.73 | learning rate: 3.011E-05 | global batch size: 256 | lm loss: 1.898149E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.419 | TFLOPs: 21.32 | 31: iteration 147350/ 173500 | consumed samples: 37721600 | consumed tokens: 77253836800 | elapsed time per iteration (s): 0.78 | learning rate: 3.010E-05 | global batch size: 256 | lm loss: 1.945954E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.118 | TFLOPs: 19.85 | 31: iteration 147360/ 173500 | consumed samples: 37724160 | consumed tokens: 77259079680 | elapsed time per iteration (s): 0.75 | learning rate: 3.009E-05 | global batch size: 256 | lm loss: 1.941790E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.714 | TFLOPs: 20.55 | 31: iteration 147370/ 173500 | consumed samples: 37726720 | consumed tokens: 77264322560 | elapsed time per iteration (s): 0.75 | learning rate: 3.008E-05 | global batch size: 256 | lm loss: 1.920910E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.421 | TFLOPs: 20.53 | 31: iteration 147380/ 173500 | consumed samples: 37729280 | consumed tokens: 77269565440 | elapsed time per iteration (s): 0.73 | learning rate: 3.008E-05 | global batch size: 256 | lm loss: 1.953112E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.988 | TFLOPs: 21.29 | 31: iteration 147390/ 173500 | consumed samples: 37731840 | consumed tokens: 77274808320 | elapsed time per iteration (s): 0.74 | learning rate: 3.007E-05 | global batch size: 256 | lm loss: 1.947566E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.601 | TFLOPs: 21.03 | 31: iteration 147400/ 173500 | consumed samples: 37734400 | consumed tokens: 77280051200 | elapsed time per iteration (s): 0.78 | learning rate: 3.006E-05 | global batch size: 256 | lm loss: 1.938509E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.340 | TFLOPs: 19.80 | 31: iteration 147410/ 173500 | consumed samples: 37736960 | consumed tokens: 77285294080 | elapsed time per iteration (s): 0.73 | learning rate: 3.005E-05 | global batch size: 256 | lm loss: 1.956256E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.189 | TFLOPs: 21.13 | 31: iteration 147420/ 173500 | consumed samples: 37739520 | consumed tokens: 77290536960 | elapsed time per iteration (s): 0.78 | learning rate: 3.005E-05 | global batch size: 256 | lm loss: 1.908331E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.452 | TFLOPs: 19.81 | 31: iteration 147430/ 173500 | consumed samples: 37742080 | consumed tokens: 77295779840 | elapsed time per iteration (s): 0.77 | learning rate: 3.004E-05 | global batch size: 256 | lm loss: 1.949284E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.672 | TFLOPs: 20.00 | 31: iteration 147440/ 173500 | consumed samples: 37744640 | consumed tokens: 77301022720 | elapsed time per iteration (s): 0.74 | learning rate: 3.003E-05 | global batch size: 256 | lm loss: 1.951476E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.856 | TFLOPs: 20.80 | 31: iteration 147450/ 173500 | consumed samples: 37747200 | consumed tokens: 77306265600 | elapsed time per iteration (s): 0.76 | learning rate: 3.002E-05 | global batch size: 256 | lm loss: 1.929570E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.302 | TFLOPs: 20.47 | 31: iteration 147460/ 173500 | consumed samples: 37749760 | consumed tokens: 77311508480 | elapsed time per iteration (s): 0.79 | learning rate: 3.002E-05 | global batch size: 256 | lm loss: 1.916751E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.147 | TFLOPs: 19.55 | 31: iteration 147470/ 173500 | consumed samples: 37752320 | consumed tokens: 77316751360 | elapsed time per iteration (s): 0.77 | learning rate: 3.001E-05 | global batch size: 256 | lm loss: 1.924395E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.558 | TFLOPs: 20.12 | 31: iteration 147480/ 173500 | consumed samples: 37754880 | consumed tokens: 77321994240 | elapsed time per iteration (s): 0.75 | learning rate: 3.000E-05 | global batch size: 256 | lm loss: 1.911569E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.795 | TFLOPs: 20.74 | 31: iteration 147490/ 173500 | consumed samples: 37757440 | consumed tokens: 77327237120 | elapsed time per iteration (s): 0.77 | learning rate: 2.999E-05 | global batch size: 256 | lm loss: 1.936659E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.215 | TFLOPs: 20.22 | 31: iteration 147500/ 173500 | consumed samples: 37760000 | consumed tokens: 77332480000 | elapsed time per iteration (s): 0.84 | learning rate: 2.999E-05 | global batch size: 256 | lm loss: 1.904927E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.103 | TFLOPs: 18.46 | 31: iteration 147510/ 173500 | consumed samples: 37762560 | consumed tokens: 77337722880 | elapsed time per iteration (s): 0.78 | learning rate: 2.998E-05 | global batch size: 256 | lm loss: 1.930128E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.501 | TFLOPs: 19.75 | 31: iteration 147520/ 173500 | consumed samples: 37765120 | consumed tokens: 77342965760 | elapsed time per iteration (s): 0.76 | learning rate: 2.997E-05 | global batch size: 256 | lm loss: 1.931667E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.109 | TFLOPs: 20.33 | 31: iteration 147530/ 173500 | consumed samples: 37767680 | consumed tokens: 77348208640 | elapsed time per iteration (s): 0.78 | learning rate: 2.996E-05 | global batch size: 256 | lm loss: 1.893687E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.517 | TFLOPs: 19.81 | 31: iteration 147540/ 173500 | consumed samples: 37770240 | consumed tokens: 77353451520 | elapsed time per iteration (s): 0.80 | learning rate: 2.996E-05 | global batch size: 256 | lm loss: 1.919082E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.901 | TFLOPs: 19.35 | 31: iteration 147550/ 173500 | consumed samples: 37772800 | consumed tokens: 77358694400 | elapsed time per iteration (s): 0.82 | learning rate: 2.995E-05 | global batch size: 256 | lm loss: 1.910595E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.381 | TFLOPs: 18.78 | 31: iteration 147560/ 173500 | consumed samples: 37775360 | consumed tokens: 77363937280 | elapsed time per iteration (s): 0.80 | learning rate: 2.994E-05 | global batch size: 256 | lm loss: 1.905907E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.211 | TFLOPs: 19.25 | 31: iteration 147570/ 173500 | consumed samples: 37777920 | consumed tokens: 77369180160 | elapsed time per iteration (s): 0.83 | learning rate: 2.993E-05 | global batch size: 256 | lm loss: 1.929493E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.498 | TFLOPs: 18.60 | 31: iteration 147580/ 173500 | consumed samples: 37780480 | consumed tokens: 77374423040 | elapsed time per iteration (s): 0.80 | learning rate: 2.993E-05 | global batch size: 256 | lm loss: 1.936464E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.909 | TFLOPs: 19.41 | 31: iteration 147590/ 173500 | consumed samples: 37783040 | consumed tokens: 77379665920 | elapsed time per iteration (s): 0.81 | learning rate: 2.992E-05 | global batch size: 256 | lm loss: 1.906176E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.510 | TFLOPs: 19.09 | 31: iteration 147600/ 173500 | consumed samples: 37785600 | consumed tokens: 77384908800 | elapsed time per iteration (s): 0.85 | learning rate: 2.991E-05 | global batch size: 256 | lm loss: 1.944560E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.906 | TFLOPs: 18.33 | 31: iteration 147610/ 173500 | consumed samples: 37788160 | consumed tokens: 77390151680 | elapsed time per iteration (s): 0.80 | learning rate: 2.990E-05 | global batch size: 256 | lm loss: 1.941936E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.739 | TFLOPs: 19.34 | 31: iteration 147620/ 173500 | consumed samples: 37790720 | consumed tokens: 77395394560 | elapsed time per iteration (s): 0.87 | learning rate: 2.990E-05 | global batch size: 256 | lm loss: 1.909003E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.611 | TFLOPs: 17.88 | 31: iteration 147630/ 173500 | consumed samples: 37793280 | consumed tokens: 77400637440 | elapsed time per iteration (s): 0.80 | learning rate: 2.989E-05 | global batch size: 256 | lm loss: 1.903883E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.754 | TFLOPs: 19.40 | 31: iteration 147640/ 173500 | consumed samples: 37795840 | consumed tokens: 77405880320 | elapsed time per iteration (s): 0.81 | learning rate: 2.988E-05 | global batch size: 256 | lm loss: 1.922915E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.877 | TFLOPs: 19.17 | 31: iteration 147650/ 173500 | consumed samples: 37798400 | consumed tokens: 77411123200 | elapsed time per iteration (s): 0.80 | learning rate: 2.987E-05 | global batch size: 256 | lm loss: 1.909105E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.382 | TFLOPs: 19.26 | 31: iteration 147660/ 173500 | consumed samples: 37800960 | consumed tokens: 77416366080 | elapsed time per iteration (s): 0.83 | learning rate: 2.987E-05 | global batch size: 256 | lm loss: 1.920370E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.939 | TFLOPs: 18.69 | 31: iteration 147670/ 173500 | consumed samples: 37803520 | consumed tokens: 77421608960 | elapsed time per iteration (s): 0.82 | learning rate: 2.986E-05 | global batch size: 256 | lm loss: 1.918890E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.013 | TFLOPs: 19.00 | 31: iteration 147680/ 173500 | consumed samples: 37806080 | consumed tokens: 77426851840 | elapsed time per iteration (s): 0.90 | learning rate: 2.985E-05 | global batch size: 256 | lm loss: 1.901190E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 283.099 | TFLOPs: 17.13 | 31: iteration 147690/ 173500 | consumed samples: 37808640 | consumed tokens: 77432094720 | elapsed time per iteration (s): 0.74 | learning rate: 2.984E-05 | global batch size: 256 | lm loss: 1.923579E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.753 | TFLOPs: 21.04 | 31: iteration 147700/ 173500 | consumed samples: 37811200 | consumed tokens: 77437337600 | elapsed time per iteration (s): 0.79 | learning rate: 2.984E-05 | global batch size: 256 | lm loss: 1.926732E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.375 | TFLOPs: 19.50 | 31: iteration 147710/ 173500 | consumed samples: 37813760 | consumed tokens: 77442580480 | elapsed time per iteration (s): 0.77 | learning rate: 2.983E-05 | global batch size: 256 | lm loss: 1.927620E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.004 | TFLOPs: 20.15 | 31: iteration 147720/ 173500 | consumed samples: 37816320 | consumed tokens: 77447823360 | elapsed time per iteration (s): 0.79 | learning rate: 2.982E-05 | global batch size: 256 | lm loss: 1.887070E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.261 | TFLOPs: 19.62 | 31: iteration 147730/ 173500 | consumed samples: 37818880 | consumed tokens: 77453066240 | elapsed time per iteration (s): 0.76 | learning rate: 2.981E-05 | global batch size: 256 | lm loss: 1.914257E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.715 | TFLOPs: 20.37 | 31: iteration 147740/ 173500 | consumed samples: 37821440 | consumed tokens: 77458309120 | elapsed time per iteration (s): 0.81 | learning rate: 2.981E-05 | global batch size: 256 | lm loss: 1.926650E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.939 | TFLOPs: 19.17 | 31: iteration 147750/ 173500 | consumed samples: 37824000 | consumed tokens: 77463552000 | elapsed time per iteration (s): 0.78 | learning rate: 2.980E-05 | global batch size: 256 | lm loss: 1.926217E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.232 | TFLOPs: 19.80 | 31: iteration 147760/ 173500 | consumed samples: 37826560 | consumed tokens: 77468794880 | elapsed time per iteration (s): 0.78 | learning rate: 2.979E-05 | global batch size: 256 | lm loss: 1.942162E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.655 | TFLOPs: 19.94 | 31: iteration 147770/ 173500 | consumed samples: 37829120 | consumed tokens: 77474037760 | elapsed time per iteration (s): 0.76 | learning rate: 2.978E-05 | global batch size: 256 | lm loss: 1.960593E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.490 | TFLOPs: 20.36 | 31: iteration 147780/ 173500 | consumed samples: 37831680 | consumed tokens: 77479280640 | elapsed time per iteration (s): 0.77 | learning rate: 2.978E-05 | global batch size: 256 | lm loss: 1.951052E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.365 | TFLOPs: 19.99 | 31: iteration 147790/ 173500 | consumed samples: 37834240 | consumed tokens: 77484523520 | elapsed time per iteration (s): 0.75 | learning rate: 2.977E-05 | global batch size: 256 | lm loss: 1.901999E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.472 | TFLOPs: 20.60 | 31: iteration 147800/ 173500 | consumed samples: 37836800 | consumed tokens: 77489766400 | elapsed time per iteration (s): 0.73 | learning rate: 2.976E-05 | global batch size: 256 | lm loss: 1.903135E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.327 | TFLOPs: 21.13 | 31: iteration 147810/ 173500 | consumed samples: 37839360 | consumed tokens: 77495009280 | elapsed time per iteration (s): 0.71 | learning rate: 2.975E-05 | global batch size: 256 | lm loss: 1.924885E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 362.377 | TFLOPs: 21.92 | 31: iteration 147820/ 173500 | consumed samples: 37841920 | consumed tokens: 77500252160 | elapsed time per iteration (s): 0.76 | learning rate: 2.975E-05 | global batch size: 256 | lm loss: 1.947582E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.148 | TFLOPs: 20.40 | 31: iteration 147830/ 173500 | consumed samples: 37844480 | consumed tokens: 77505495040 | elapsed time per iteration (s): 0.77 | learning rate: 2.974E-05 | global batch size: 256 | lm loss: 1.925868E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.534 | TFLOPs: 20.24 | 31: iteration 147840/ 173500 | consumed samples: 37847040 | consumed tokens: 77510737920 | elapsed time per iteration (s): 0.79 | learning rate: 2.973E-05 | global batch size: 256 | lm loss: 1.908085E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.680 | TFLOPs: 19.52 | 31: iteration 147850/ 173500 | consumed samples: 37849600 | consumed tokens: 77515980800 | elapsed time per iteration (s): 0.78 | learning rate: 2.972E-05 | global batch size: 256 | lm loss: 1.931992E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.335 | TFLOPs: 19.86 | 31: iteration 147860/ 173500 | consumed samples: 37852160 | consumed tokens: 77521223680 | elapsed time per iteration (s): 0.78 | learning rate: 2.972E-05 | global batch size: 256 | lm loss: 1.940961E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.442 | TFLOPs: 19.75 | 31: iteration 147870/ 173500 | consumed samples: 37854720 | consumed tokens: 77526466560 | elapsed time per iteration (s): 0.75 | learning rate: 2.971E-05 | global batch size: 256 | lm loss: 1.928862E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.523 | TFLOPs: 20.66 | 31: iteration 147880/ 173500 | consumed samples: 37857280 | consumed tokens: 77531709440 | elapsed time per iteration (s): 0.80 | learning rate: 2.970E-05 | global batch size: 256 | lm loss: 1.937314E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.727 | TFLOPs: 19.46 | 31: iteration 147890/ 173500 | consumed samples: 37859840 | consumed tokens: 77536952320 | elapsed time per iteration (s): 0.77 | learning rate: 2.969E-05 | global batch size: 256 | lm loss: 1.927334E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.340 | TFLOPs: 20.17 | 31: iteration 147900/ 173500 | consumed samples: 37862400 | consumed tokens: 77542195200 | elapsed time per iteration (s): 0.80 | learning rate: 2.969E-05 | global batch size: 256 | lm loss: 1.919499E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.408 | TFLOPs: 19.38 | 31: iteration 147910/ 173500 | consumed samples: 37864960 | consumed tokens: 77547438080 | elapsed time per iteration (s): 0.76 | learning rate: 2.968E-05 | global batch size: 256 | lm loss: 1.929621E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.751 | TFLOPs: 20.43 | 31: iteration 147920/ 173500 | consumed samples: 37867520 | consumed tokens: 77552680960 | elapsed time per iteration (s): 0.76 | learning rate: 2.967E-05 | global batch size: 256 | lm loss: 1.920019E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.276 | TFLOPs: 20.28 | 31: iteration 147930/ 173500 | consumed samples: 37870080 | consumed tokens: 77557923840 | elapsed time per iteration (s): 0.74 | learning rate: 2.966E-05 | global batch size: 256 | lm loss: 1.920714E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.253 | TFLOPs: 20.95 | 31: iteration 147940/ 173500 | consumed samples: 37872640 | consumed tokens: 77563166720 | elapsed time per iteration (s): 0.80 | learning rate: 2.966E-05 | global batch size: 256 | lm loss: 1.930188E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.978 | TFLOPs: 19.48 | 31: iteration 147950/ 173500 | consumed samples: 37875200 | consumed tokens: 77568409600 | elapsed time per iteration (s): 0.76 | learning rate: 2.965E-05 | global batch size: 256 | lm loss: 1.931178E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.956 | TFLOPs: 20.26 | 31: iteration 147960/ 173500 | consumed samples: 37877760 | consumed tokens: 77573652480 | elapsed time per iteration (s): 0.78 | learning rate: 2.964E-05 | global batch size: 256 | lm loss: 1.894389E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.230 | TFLOPs: 19.92 | 31: iteration 147970/ 173500 | consumed samples: 37880320 | consumed tokens: 77578895360 | elapsed time per iteration (s): 0.74 | learning rate: 2.964E-05 | global batch size: 256 | lm loss: 1.907229E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.601 | TFLOPs: 20.97 | 31: iteration 147980/ 173500 | consumed samples: 37882880 | consumed tokens: 77584138240 | elapsed time per iteration (s): 0.78 | learning rate: 2.963E-05 | global batch size: 256 | lm loss: 1.928291E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.174 | TFLOPs: 19.85 | 31: iteration 147990/ 173500 | consumed samples: 37885440 | consumed tokens: 77589381120 | elapsed time per iteration (s): 1.01 | learning rate: 2.962E-05 | global batch size: 256 | lm loss: 1.920974E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 254.575 | TFLOPs: 15.40 | 0: [2022-11-27 03:24:34,631] [INFO] [logging.py:68:log_dist] [Rank 0] step=148000, skipped=0, lr=[2.9612854264054498e-05, 2.9612854264054498e-05, 2.9612854264054498e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 0: steps: 148000 loss: 1.9164 iter time (s): 0.802 samples/sec: 319.326 31: iteration 148000/ 173500 | consumed samples: 37888000 | consumed tokens: 77594624000 | elapsed time per iteration (s): 0.78 | learning rate: 2.961E-05 | global batch size: 256 | lm loss: 1.936948E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.096 | TFLOPs: 19.91 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 148000 | lm loss value: 1.844196E+00 | lm loss PPL: 6.323013E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 148000 to checkpoints_1b1long 0: [2022-11-27 03:24:34,923] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step148000 is begin to save! 0: [2022-11-27 03:24:34,937] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_01-model_00-model_states.pt... 0: [2022-11-27 03:24:35,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_01-model_00-model_states.pt. 0: [2022-11-27 03:24:35,158] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_03-model_00-model_states.pt... 0: [2022-11-27 03:24:35,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_03-model_00-model_states.pt. 0: [2022-11-27 03:24:35,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_04-model_00-model_states.pt... 0: [2022-11-27 03:24:35,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_04-model_00-model_states.pt. 0: [2022-11-27 03:24:35,326] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_05-model_00-model_states.pt... 0: [2022-11-27 03:24:35,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_05-model_00-model_states.pt. 0: [2022-11-27 03:24:35,400] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_06-model_00-model_states.pt... 0: [2022-11-27 03:24:35,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_06-model_00-model_states.pt. 0: [2022-11-27 03:24:35,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_07-model_00-model_states.pt... 0: [2022-11-27 03:24:35,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_07-model_00-model_states.pt. 0: [2022-11-27 03:24:35,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_08-model_00-model_states.pt... 0: [2022-11-27 03:24:35,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_08-model_00-model_states.pt. 0: [2022-11-27 03:24:35,645] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_09-model_00-model_states.pt... 0: [2022-11-27 03:24:35,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_09-model_00-model_states.pt. 0: [2022-11-27 03:24:35,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_10-model_00-model_states.pt... 0: [2022-11-27 03:24:35,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_10-model_00-model_states.pt. 0: [2022-11-27 03:24:35,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_11-model_00-model_states.pt... 0: [2022-11-27 03:24:35,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_11-model_00-model_states.pt. 0: [2022-11-27 03:24:35,874] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_12-model_00-model_states.pt... 0: [2022-11-27 03:24:35,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_12-model_00-model_states.pt. 0: [2022-11-27 03:24:35,948] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_13-model_00-model_states.pt... 0: [2022-11-27 03:24:36,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_13-model_00-model_states.pt. 0: [2022-11-27 03:24:36,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_14-model_00-model_states.pt... 0: [2022-11-27 03:24:36,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_14-model_00-model_states.pt. 0: [2022-11-27 03:24:36,098] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_15-model_00-model_states.pt... 0: [2022-11-27 03:24:36,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_15-model_00-model_states.pt. 0: [2022-11-27 03:24:36,173] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_16-model_00-model_states.pt... 0: [2022-11-27 03:24:36,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_16-model_00-model_states.pt. 0: [2022-11-27 03:24:36,251] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_17-model_00-model_states.pt... 0: [2022-11-27 03:24:36,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_17-model_00-model_states.pt. 0: [2022-11-27 03:24:36,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_18-model_00-model_states.pt... 0: [2022-11-27 03:24:36,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_18-model_00-model_states.pt. 0: [2022-11-27 03:24:36,401] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_19-model_00-model_states.pt... 0: [2022-11-27 03:24:36,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_19-model_00-model_states.pt. 0: [2022-11-27 03:24:36,475] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_20-model_00-model_states.pt... 0: [2022-11-27 03:24:36,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_20-model_00-model_states.pt. 0: [2022-11-27 03:24:36,553] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_21-model_00-model_states.pt... 0: [2022-11-27 03:24:36,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_21-model_00-model_states.pt. 0: [2022-11-27 03:24:36,628] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_22-model_00-model_states.pt... 0: [2022-11-27 03:24:36,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_22-model_00-model_states.pt. 0: [2022-11-27 03:24:36,703] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_23-model_00-model_states.pt... 0: [2022-11-27 03:24:36,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_23-model_00-model_states.pt. 0: [2022-11-27 03:24:36,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_24-model_00-model_states.pt... 0: [2022-11-27 03:24:36,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_24-model_00-model_states.pt. 0: [2022-11-27 03:24:36,851] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_25-model_00-model_states.pt... 0: [2022-11-27 03:24:36,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_25-model_00-model_states.pt. 0: [2022-11-27 03:24:36,928] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_26-model_00-model_states.pt... 0: [2022-11-27 03:24:37,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_26-model_00-model_states.pt. 0: [2022-11-27 03:24:37,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_27-model_00-model_states.pt... 0: [2022-11-27 03:24:37,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_27-model_00-model_states.pt. 0: [2022-11-27 03:24:37,079] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_28-model_00-model_states.pt... 0: [2022-11-27 03:24:37,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_28-model_00-model_states.pt. 0: [2022-11-27 03:24:37,154] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/layer_30-model_00-model_states.pt... 0: [2022-11-27 03:24:37,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/layer_30-model_00-model_states.pt. 0: [2022-11-27 03:24:37,156] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step148000/mp_rank_00_model_states.pt 0: [2022-11-27 03:24:37,156] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/mp_rank_00_model_states.pt... 0: [2022-11-27 03:24:37,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/mp_rank_00_model_states.pt. 0: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:24:37,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:24:37,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:24:37,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 03:24:37,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 31: [2022-11-27 03:24:37,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:24:37,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:24:37,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 19: [2022-11-27 03:24:37,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 31: [2022-11-27 03:24:37,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 19: [2022-11-27 03:24:37,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 27: [2022-11-27 03:24:37,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:24:37,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:24:37,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 5: [2022-11-27 03:24:37,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 27: [2022-11-27 03:24:37,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 5: [2022-11-27 03:24:37,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 18: [2022-11-27 03:24:37,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:24:37,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 03:24:37,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 10: [2022-11-27 03:24:37,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:24:37,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 2: [2022-11-27 03:24:37,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:24:37,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 2: [2022-11-27 03:24:37,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 03:24:37,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 4: [2022-11-27 03:24:37,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:24:37,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:24:37,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 20: [2022-11-27 03:24:37,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 4: [2022-11-27 03:24:37,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 20: [2022-11-27 03:24:37,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 15: [2022-11-27 03:24:37,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:24:37,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 03:24:37,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 9: [2022-11-27 03:24:37,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:24:37,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 03:24:37,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 29: [2022-11-27 03:24:37,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:24:37,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 03:24:37,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 23: [2022-11-27 03:24:37,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:24:37,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:24:37,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 03:24:37,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 23: [2022-11-27 03:24:37,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 03:24:37,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 0: [2022-11-27 03:24:37,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:24:37,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:24:37,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 16: [2022-11-27 03:24:37,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:24:37,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 0: [2022-11-27 03:24:37,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 26: [2022-11-27 03:24:37,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 16: [2022-11-27 03:24:37,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 11: [2022-11-27 03:24:37,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:24:37,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 11: [2022-11-27 03:24:37,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 03:24:37,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 6: [2022-11-27 03:24:37,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:24:37,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:24:37,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 03:24:37,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 12: [2022-11-27 03:24:37,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 03:24:37,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 6: [2022-11-27 03:24:37,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:24:37,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 03:24:37,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 21: [2022-11-27 03:24:37,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:24:37,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:24:37,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 30: [2022-11-27 03:24:37,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 21: [2022-11-27 03:24:37,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 30: [2022-11-27 03:24:37,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 21: [2022-11-27 03:24:37,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:24:37,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:24:37,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 03:24:37,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 8: [2022-11-27 03:24:37,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:24:37,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:24:37,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:24:37,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 18: [2022-11-27 03:24:37,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:24:37,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 17: [2022-11-27 03:24:37,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 18: [2022-11-27 03:24:37,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 8: [2022-11-27 03:24:37,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 30: [2022-11-27 03:24:37,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 17: [2022-11-27 03:24:37,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:24:37,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 8: [2022-11-27 03:24:37,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 30: [2022-11-27 03:24:37,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 17: [2022-11-27 03:24:37,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 8: [2022-11-27 03:24:37,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 17: [2022-11-27 03:24:37,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 10: [2022-11-27 03:24:37,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:24:37,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 03:24:37,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 22: [2022-11-27 03:24:37,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:24:37,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:24:37,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 03:24:37,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 19: [2022-11-27 03:24:37,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-27 03:24:37,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 2: [2022-11-27 03:24:37,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:24:37,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:24:37,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:24:37,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 7: [2022-11-27 03:24:37,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:24:37,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 24: [2022-11-27 03:24:37,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 7: [2022-11-27 03:24:37,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 2: [2022-11-27 03:24:37,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 27: [2022-11-27 03:24:37,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 7: [2022-11-27 03:24:37,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 27: [2022-11-27 03:24:37,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 13: [2022-11-27 03:24:37,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:24:37,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:24:37,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 13: [2022-11-27 03:24:37,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 29: [2022-11-27 03:24:37,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 13: [2022-11-27 03:24:37,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 14: [2022-11-27 03:24:37,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:24:37,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 03:24:37,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 25: [2022-11-27 03:24:37,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:24:37,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 03:24:37,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 25: [2022-11-27 03:24:37,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:24:37,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 03:24:37,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 26: [2022-11-27 03:24:37,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:24:37,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:24:37,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 5: [2022-11-27 03:24:37,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 26: [2022-11-27 03:24:37,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 5: [2022-11-27 03:24:37,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 15: [2022-11-27 03:24:37,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:24:37,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 03:24:37,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 23: [2022-11-27 03:24:37,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-27 03:24:37,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 6: [2022-11-27 03:24:37,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 23: [2022-11-27 03:24:37,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 6: [2022-11-27 03:24:37,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 03:24:37,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 8: [2022-11-27 03:24:37,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:24:37,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 03:24:37,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 13: [2022-11-27 03:24:37,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:24:37,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 03:24:37,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 27: [2022-11-27 03:24:37,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:24:37,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 03:24:37,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 9: [2022-11-27 03:24:37,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:24:37,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 16: [2022-11-27 03:24:37,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:24:37,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:24:37,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 2: [2022-11-27 03:24:37,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 5: [2022-11-27 03:24:37,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:24:37,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 2: [2022-11-27 03:24:37,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 5: [2022-11-27 03:24:37,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 16: [2022-11-27 03:24:37,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 5: [2022-11-27 03:24:37,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 24: [2022-11-27 03:24:37,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:24:37,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 03:24:37,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 30: [2022-11-27 03:24:37,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:24:37,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 03:24:37,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 0: [2022-11-27 03:24:37,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:24:37,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:24:37,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 7: [2022-11-27 03:24:37,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 03:24:37,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 0: [2022-11-27 03:24:37,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 16: [2022-11-27 03:24:37,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:24:37,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 22: [2022-11-27 03:24:37,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:24:37,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 22: [2022-11-27 03:24:37,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 03:24:37,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 18: [2022-11-27 03:24:37,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:24:37,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:24:37,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:24:37,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:24:37,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:24:37,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 18: [2022-11-27 03:24:37,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 26: [2022-11-27 03:24:37,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 11: [2022-11-27 03:24:37,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 29: [2022-11-27 03:24:37,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 11: [2022-11-27 03:24:37,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 29: [2022-11-27 03:24:37,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 22: [2022-11-27 03:24:37,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 18: [2022-11-27 03:24:37,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 26: [2022-11-27 03:24:37,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 24: [2022-11-27 03:24:37,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:24:37,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 03:24:37,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 13: [2022-11-27 03:24:37,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:24:37,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 03:24:37,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 31: [2022-11-27 03:24:37,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:24:37,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 03:24:37,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 19: [2022-11-27 03:24:37,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:24:37,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 03:24:37,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 21: [2022-11-27 03:24:37,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:24:37,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 25: [2022-11-27 03:24:37,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:24:37,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 25: [2022-11-27 03:24:37,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 03:24:37,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 20: [2022-11-27 03:24:37,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:24:37,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 03:24:37,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 22: [2022-11-27 03:24:37,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:24:37,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 03:24:37,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 4: [2022-11-27 03:24:37,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:24:37,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 03:24:37,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 8: [2022-11-27 03:24:37,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:24:37,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:24:37,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 03:24:37,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 29: [2022-11-27 03:24:37,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:24:37,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 29: [2022-11-27 03:24:37,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 31: [2022-11-27 03:24:37,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 29: [2022-11-27 03:24:37,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 14: [2022-11-27 03:24:37,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:24:37,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 13: [2022-11-27 03:24:37,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:24:37,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 26: [2022-11-27 03:24:37,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:24:37,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 26: [2022-11-27 03:24:37,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 03:24:37,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 13: [2022-11-27 03:24:37,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 12: [2022-11-27 03:24:37,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:24:37,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 03:24:37,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 18: [2022-11-27 03:24:37,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:24:37,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 6: [2022-11-27 03:24:37,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:24:37,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 4: [2022-11-27 03:24:37,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:24:37,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 4: [2022-11-27 03:24:37,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 6: [2022-11-27 03:24:37,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 4: [2022-11-27 03:24:37,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 10: [2022-11-27 03:24:37,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:24:37,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 03:24:37,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 0: [2022-11-27 03:24:37,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:24:37,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:24:37,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 0: [2022-11-27 03:24:37,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 03:24:37,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 30: [2022-11-27 03:24:37,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 2: [2022-11-27 03:24:37,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:24:37,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 03:24:37,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 20: [2022-11-27 03:24:37,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:24:37,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 03:24:37,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 17: [2022-11-27 03:24:37,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:24:37,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:24:37,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 03:24:37,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 17: [2022-11-27 03:24:37,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:24:37,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 03:24:37,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 4: [2022-11-27 03:24:37,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 03:24:37,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 10: [2022-11-27 03:24:37,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:24:37,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 03:24:37,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 9: [2022-11-27 03:24:37,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:24:37,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:24:37,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 25: [2022-11-27 03:24:37,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 9: [2022-11-27 03:24:37,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 25: [2022-11-27 03:24:37,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 20: [2022-11-27 03:24:37,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:24:37,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 03:24:37,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 31: [2022-11-27 03:24:37,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:24:37,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 03:24:37,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 9: [2022-11-27 03:24:37,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:24:37,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 24: [2022-11-27 03:24:37,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:24:37,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 24: [2022-11-27 03:24:37,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-27 03:24:37,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 21: [2022-11-27 03:24:37,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:24:37,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 03:24:37,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 15: [2022-11-27 03:24:37,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:24:37,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 03:24:37,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 15: [2022-11-27 03:24:37,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:24:37,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 03:24:37,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 1: [2022-11-27 03:24:37,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:24:37,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:24:37,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:24:37,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 03:24:37,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 03:24:37,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 03:24:37,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 1: [2022-11-27 03:24:37,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 1: [2022-11-27 03:24:37,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 28: [2022-11-27 03:24:37,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:24:37,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:24:37,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:24:37,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:24:37,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 03:24:37,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 03:24:37,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 03:24:37,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 03:24:37,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 28: [2022-11-27 03:24:37,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 28: [2022-11-27 03:24:37,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 28: [2022-11-27 03:24:37,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 19: [2022-11-27 03:24:37,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:24:37,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 03:24:37,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 14: [2022-11-27 03:24:37,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:24:37,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 03:24:37,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 27: [2022-11-27 03:24:37,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:24:37,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 03:24:37,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 16: [2022-11-27 03:24:37,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:24:37,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 03:24:37,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 12: [2022-11-27 03:24:37,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:24:37,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 03:24:37,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 3: [2022-11-27 03:24:37,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:24:37,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:24:37,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 03:24:37,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 03:24:37,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 3: [2022-11-27 03:24:37,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 3: [2022-11-27 03:24:37,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:24:37,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 03:24:37,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 28: [2022-11-27 03:24:37,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 23: [2022-11-27 03:24:37,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 03:24:37,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 03:24:37,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 11: [2022-11-27 03:24:37,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:24:37,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:24:37,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 03:24:37,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 03:24:37,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 11: [2022-11-27 03:24:37,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 5: [2022-11-27 03:24:37,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:24:37,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 03:24:37,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 0: [2022-11-27 03:24:37,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:24:37,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 03:24:37,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 9: [2022-11-27 03:24:37,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:24:37,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 03:24:37,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 23: [2022-11-27 03:24:37,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 03:24:37,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 03:24:37,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-27 03:24:37,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 23: [2022-11-27 03:24:37,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 03:24:37,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 7: [2022-11-27 03:24:37,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:24:37,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 03:24:37,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 15: [2022-11-27 03:24:37,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:24:37,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 03:24:37,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 28: [2022-11-27 03:24:37,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 12: [2022-11-27 03:24:37,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:24:37,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 12: [2022-11-27 03:24:37,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 03:24:37,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 8: [2022-11-27 03:24:37,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:24:37,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 03:24:37,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 0: [2022-11-27 03:24:37,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:24:37,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 03:24:37,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 6: [2022-11-27 03:24:37,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:24:37,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 03:24:37,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 25: [2022-11-27 03:24:37,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:24:37,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 03:24:37,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 20: [2022-11-27 03:24:37,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:24:37,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 03:24:37,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 11: [2022-11-27 03:24:37,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:24:37,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 03:24:37,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 5: [2022-11-27 03:24:37,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:24:37,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 03:24:37,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 27: [2022-11-27 03:24:37,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:24:37,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 03:24:37,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 17: [2022-11-27 03:24:37,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:24:37,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 03:24:37,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 2: [2022-11-27 03:24:37,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:24:37,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:24:37,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 18: [2022-11-27 03:24:37,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 2: [2022-11-27 03:24:37,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 18: [2022-11-27 03:24:37,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 10: [2022-11-27 03:24:37,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:24:37,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 03:24:37,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 30: [2022-11-27 03:24:37,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:24:37,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 03:24:37,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 13: [2022-11-27 03:24:37,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:24:37,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 03:24:37,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 19: [2022-11-27 03:24:37,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:24:37,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 03:24:37,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 24: [2022-11-27 03:24:37,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:24:37,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 03:24:37,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 21: [2022-11-27 03:24:37,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:24:37,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 03:24:37,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 29: [2022-11-27 03:24:37,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:24:37,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:24:37,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 4: [2022-11-27 03:24:37,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 03:24:37,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 29: [2022-11-27 03:24:37,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 31: [2022-11-27 03:24:37,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:24:37,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 03:24:37,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 16: [2022-11-27 03:24:37,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:24:37,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:24:37,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 03:24:37,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 16: [2022-11-27 03:24:37,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 03:24:37,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 1: [2022-11-27 03:24:37,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:24:37,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 03:24:37,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 14: [2022-11-27 03:24:37,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:24:37,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 03:24:37,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 22: [2022-11-27 03:24:37,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:24:37,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 03:24:37,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 12: [2022-11-27 03:24:37,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:24:37,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 03:24:37,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 3: [2022-11-27 03:24:37,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:24:37,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 03:24:37,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 7: [2022-11-27 03:24:37,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:24:37,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 03:24:37,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 0: [2022-11-27 03:24:37,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:24:37,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:24:37,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 03:24:37,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 23: [2022-11-27 03:24:37,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:24:37,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:24:37,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 03:24:37,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 23: [2022-11-27 03:24:37,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 03:24:37,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 9: [2022-11-27 03:24:37,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:24:37,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 03:24:37,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 15: [2022-11-27 03:24:37,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:24:37,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:24:37,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 03:24:37,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 15: [2022-11-27 03:24:37,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 03:24:37,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 0: [2022-11-27 03:24:37,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 03:24:37,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 6: [2022-11-27 03:24:37,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:24:37,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 03:24:37,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 20: [2022-11-27 03:24:37,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:24:37,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 03:24:37,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 5: [2022-11-27 03:24:37,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:24:37,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 03:24:37,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 25: [2022-11-27 03:24:37,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:24:37,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 03:24:37,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 18: [2022-11-27 03:24:37,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:24:37,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 03:24:37,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 17: [2022-11-27 03:24:37,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:24:37,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 03:24:37,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 10: [2022-11-27 03:24:37,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:24:37,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 03:24:37,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 27: [2022-11-27 03:24:37,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:24:37,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 03:24:37,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 2: [2022-11-27 03:24:37,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:24:37,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:24:37,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 03:24:37,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 30: [2022-11-27 03:24:37,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 03:24:37,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 29: [2022-11-27 03:24:37,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:24:37,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 03:24:37,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 13: [2022-11-27 03:24:37,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:24:37,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 03:24:37,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 22: [2022-11-27 03:24:37,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:24:37,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 03:24:37,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 19: [2022-11-27 03:24:37,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:24:37,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 03:24:37,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 24: [2022-11-27 03:24:37,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:24:37,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 03:24:37,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 21: [2022-11-27 03:24:37,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:24:37,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 03:24:37,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 4: [2022-11-27 03:24:37,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:24:37,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 03:24:37,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 12: [2022-11-27 03:24:37,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:24:37,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 03:24:37,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 1: [2022-11-27 03:24:37,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:24:37,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 03:24:37,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 0: [2022-11-27 03:24:37,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:24:37,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 03:24:37,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 26: [2022-11-27 03:24:37,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:24:37,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 03:24:37,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 31: [2022-11-27 03:24:37,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:24:37,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 03:24:37,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 14: [2022-11-27 03:24:37,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:24:37,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 03:24:37,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 3: [2022-11-27 03:24:37,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:24:37,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 03:24:37,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 16: [2022-11-27 03:24:37,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:24:37,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 03:24:37,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 15: [2022-11-27 03:24:37,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:24:37,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 03:24:37,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 11: [2022-11-27 03:24:37,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:24:37,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 03:24:37,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 7: [2022-11-27 03:24:37,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:24:37,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:24:37,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 8: [2022-11-27 03:24:37,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 7: [2022-11-27 03:24:37,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 8: [2022-11-27 03:24:37,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 23: [2022-11-27 03:24:37,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 03:24:37,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 03:24:37,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 9: [2022-11-27 03:24:37,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:24:37,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 03:24:37,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 28: [2022-11-27 03:24:37,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:24:37,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 17: [2022-11-27 03:24:37,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:24:37,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 17: [2022-11-27 03:24:37,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 03:24:37,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 5: [2022-11-27 03:24:37,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:24:37,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 03:24:37,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 25: [2022-11-27 03:24:37,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:24:37,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 03:24:37,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 20: [2022-11-27 03:24:37,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:24:37,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 03:24:37,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 30: [2022-11-27 03:24:37,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:24:37,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 03:24:37,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 27: [2022-11-27 03:24:37,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:24:37,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 03:24:37,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 18: [2022-11-27 03:24:37,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:24:37,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-27 03:24:37,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 2: [2022-11-27 03:24:37,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:24:37,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 03:24:37,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 6: [2022-11-27 03:24:37,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:24:37,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 03:24:37,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 10: [2022-11-27 03:24:37,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:24:37,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 03:24:37,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 13: [2022-11-27 03:24:37,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:24:37,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 03:24:37,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 29: [2022-11-27 03:24:37,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:24:37,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 03:24:37,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 19: [2022-11-27 03:24:37,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:24:37,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 03:24:37,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 4: [2022-11-27 03:24:37,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:24:37,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 03:24:37,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 22: [2022-11-27 03:24:37,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:24:37,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 03:24:37,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 16: [2022-11-27 03:24:37,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:24:37,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 03:24:37,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 0: [2022-11-27 03:24:37,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:24:37,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 03:24:37,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 21: [2022-11-27 03:24:37,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:24:37,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 03:24:37,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 16: [2022-11-27 03:24:37,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:24:37,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:24:37,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 03:24:37,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 1: [2022-11-27 03:24:37,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 03:24:37,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 3: [2022-11-27 03:24:37,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:24:37,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 11: [2022-11-27 03:24:37,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:24:37,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 11: [2022-11-27 03:24:37,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 03:24:37,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 31: [2022-11-27 03:24:37,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:24:37,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 03:24:37,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 7: [2022-11-27 03:24:37,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:24:37,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 03:24:37,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 26: [2022-11-27 03:24:37,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:24:37,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 03:24:37,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:24:37,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 26: [2022-11-27 03:24:37,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 03:24:37,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 15: [2022-11-27 03:24:37,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:24:37,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 14: [2022-11-27 03:24:37,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:24:37,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 14: [2022-11-27 03:24:37,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 03:24:37,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 17: [2022-11-27 03:24:37,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:24:37,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 10: [2022-11-27 03:24:37,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:24:37,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 10: [2022-11-27 03:24:37,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 03:24:37,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 23: [2022-11-27 03:24:37,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:24:37,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:24:37,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 23: [2022-11-27 03:24:37,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 03:24:37,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 12: [2022-11-27 03:24:37,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 24: [2022-11-27 03:24:37,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:24:37,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 31: [2022-11-27 03:24:37,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:24:37,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:24:37,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:24:37,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 31: [2022-11-27 03:24:37,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 29: [2022-11-27 03:24:37,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 30: [2022-11-27 03:24:37,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 31: [2022-11-27 03:24:37,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 29: [2022-11-27 03:24:37,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 30: [2022-11-27 03:24:37,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 5: [2022-11-27 03:24:37,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:24:37,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:24:37,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 03:24:37,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 13: [2022-11-27 03:24:37,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:24:37,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 03:24:37,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 13: [2022-11-27 03:24:37,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 03:24:37,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 9: [2022-11-27 03:24:37,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:24:37,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:24:37,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 03:24:37,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 8: [2022-11-27 03:24:37,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 03:24:37,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 25: [2022-11-27 03:24:37,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:24:37,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 03:24:37,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 4: [2022-11-27 03:24:37,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:24:37,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:24:37,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 28: [2022-11-27 03:24:37,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:24:37,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 4: [2022-11-27 03:24:37,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 27: [2022-11-27 03:24:37,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 28: [2022-11-27 03:24:37,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 03:24:37,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 21: [2022-11-27 03:24:37,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:24:37,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 03:24:37,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 22: [2022-11-27 03:24:37,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:24:37,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 03:24:37,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 6: [2022-11-27 03:24:37,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:24:37,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:24:37,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 03:24:37,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 20: [2022-11-27 03:24:37,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 03:24:37,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 18: [2022-11-27 03:24:37,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:24:37,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 03:24:37,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 2: [2022-11-27 03:24:37,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:24:37,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 03:24:37,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 3: [2022-11-27 03:24:37,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:24:37,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 03:24:37,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 12: [2022-11-27 03:24:37,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:24:37,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 03:24:37,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 7: [2022-11-27 03:24:37,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:24:37,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 03:24:37,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 1: [2022-11-27 03:24:37,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:24:37,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 03:24:37,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 24: [2022-11-27 03:24:37,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:24:37,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 03:24:37,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 1: [2022-11-27 03:24:37,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:24:37,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:24:37,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 14: [2022-11-27 03:24:37,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 1: [2022-11-27 03:24:37,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 14: [2022-11-27 03:24:37,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 3: [2022-11-27 03:24:37,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:24:37,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step148000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 03:24:37,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step148000 is ready now! 0: successfully saved checkpoint at iteration 148000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2557.63 31: iteration 148010/ 173500 | consumed samples: 37890560 | consumed tokens: 77599866880 | elapsed time per iteration (s): 1.15 | learning rate: 2.961E-05 | global batch size: 256 | lm loss: 1.912872E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 223.466 | TFLOPs: 13.52 | 31: iteration 148020/ 173500 | consumed samples: 37893120 | consumed tokens: 77605109760 | elapsed time per iteration (s): 0.81 | learning rate: 2.960E-05 | global batch size: 256 | lm loss: 1.896950E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.498 | TFLOPs: 19.15 | 31: iteration 148030/ 173500 | consumed samples: 37895680 | consumed tokens: 77610352640 | elapsed time per iteration (s): 0.78 | learning rate: 2.959E-05 | global batch size: 256 | lm loss: 1.918489E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.381 | TFLOPs: 19.75 | 31: iteration 148040/ 173500 | consumed samples: 37898240 | consumed tokens: 77615595520 | elapsed time per iteration (s): 0.77 | learning rate: 2.958E-05 | global batch size: 256 | lm loss: 1.938132E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.530 | TFLOPs: 20.12 | 31: iteration 148050/ 173500 | consumed samples: 37900800 | consumed tokens: 77620838400 | elapsed time per iteration (s): 0.83 | learning rate: 2.958E-05 | global batch size: 256 | lm loss: 1.928826E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.146 | TFLOPs: 18.76 | 31: iteration 148060/ 173500 | consumed samples: 37903360 | consumed tokens: 77626081280 | elapsed time per iteration (s): 0.74 | learning rate: 2.957E-05 | global batch size: 256 | lm loss: 1.933183E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.481 | TFLOPs: 20.84 | 31: iteration 148070/ 173500 | consumed samples: 37905920 | consumed tokens: 77631324160 | elapsed time per iteration (s): 0.81 | learning rate: 2.956E-05 | global batch size: 256 | lm loss: 1.911591E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.447 | TFLOPs: 19.20 | 31: iteration 148080/ 173500 | consumed samples: 37908480 | consumed tokens: 77636567040 | elapsed time per iteration (s): 0.80 | learning rate: 2.955E-05 | global batch size: 256 | lm loss: 1.914023E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.121 | TFLOPs: 19.37 | 31: iteration 148090/ 173500 | consumed samples: 37911040 | consumed tokens: 77641809920 | elapsed time per iteration (s): 0.78 | learning rate: 2.955E-05 | global batch size: 256 | lm loss: 1.914723E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.307 | TFLOPs: 19.92 | 31: iteration 148100/ 173500 | consumed samples: 37913600 | consumed tokens: 77647052800 | elapsed time per iteration (s): 0.89 | learning rate: 2.954E-05 | global batch size: 256 | lm loss: 1.911010E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.281 | TFLOPs: 17.32 | 31: iteration 148110/ 173500 | consumed samples: 37916160 | consumed tokens: 77652295680 | elapsed time per iteration (s): 0.80 | learning rate: 2.953E-05 | global batch size: 256 | lm loss: 1.901651E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.006 | TFLOPs: 19.30 | 31: iteration 148120/ 173500 | consumed samples: 37918720 | consumed tokens: 77657538560 | elapsed time per iteration (s): 0.82 | learning rate: 2.952E-05 | global batch size: 256 | lm loss: 1.936650E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.228 | TFLOPs: 18.89 | 31: iteration 148130/ 173500 | consumed samples: 37921280 | consumed tokens: 77662781440 | elapsed time per iteration (s): 0.76 | learning rate: 2.952E-05 | global batch size: 256 | lm loss: 1.918923E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.059 | TFLOPs: 20.33 | 31: iteration 148140/ 173500 | consumed samples: 37923840 | consumed tokens: 77668024320 | elapsed time per iteration (s): 0.78 | learning rate: 2.951E-05 | global batch size: 256 | lm loss: 1.926461E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.803 | TFLOPs: 19.77 | 31: iteration 148150/ 173500 | consumed samples: 37926400 | consumed tokens: 77673267200 | elapsed time per iteration (s): 0.74 | learning rate: 2.950E-05 | global batch size: 256 | lm loss: 1.969708E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.290 | TFLOPs: 20.95 | 31: iteration 148160/ 173500 | consumed samples: 37928960 | consumed tokens: 77678510080 | elapsed time per iteration (s): 0.74 | learning rate: 2.949E-05 | global batch size: 256 | lm loss: 1.868129E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.871 | TFLOPs: 20.86 | 31: iteration 148170/ 173500 | consumed samples: 37931520 | consumed tokens: 77683752960 | elapsed time per iteration (s): 0.77 | learning rate: 2.949E-05 | global batch size: 256 | lm loss: 1.905564E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.763 | TFLOPs: 20.19 | 31: iteration 148180/ 173500 | consumed samples: 37934080 | consumed tokens: 77688995840 | elapsed time per iteration (s): 0.76 | learning rate: 2.948E-05 | global batch size: 256 | lm loss: 1.907523E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.068 | TFLOPs: 20.39 | 31: iteration 148190/ 173500 | consumed samples: 37936640 | consumed tokens: 77694238720 | elapsed time per iteration (s): 0.81 | learning rate: 2.947E-05 | global batch size: 256 | lm loss: 1.930415E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.467 | TFLOPs: 19.08 | 31: iteration 148200/ 173500 | consumed samples: 37939200 | consumed tokens: 77699481600 | elapsed time per iteration (s): 0.78 | learning rate: 2.947E-05 | global batch size: 256 | lm loss: 1.944659E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.729 | TFLOPs: 19.95 | 31: iteration 148210/ 173500 | consumed samples: 37941760 | consumed tokens: 77704724480 | elapsed time per iteration (s): 0.80 | learning rate: 2.946E-05 | global batch size: 256 | lm loss: 1.942593E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.178 | TFLOPs: 19.37 | 31: iteration 148220/ 173500 | consumed samples: 37944320 | consumed tokens: 77709967360 | elapsed time per iteration (s): 0.78 | learning rate: 2.945E-05 | global batch size: 256 | lm loss: 1.898010E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.496 | TFLOPs: 19.87 | 31: iteration 148230/ 173500 | consumed samples: 37946880 | consumed tokens: 77715210240 | elapsed time per iteration (s): 0.83 | learning rate: 2.944E-05 | global batch size: 256 | lm loss: 1.951954E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.956 | TFLOPs: 18.57 | 31: iteration 148240/ 173500 | consumed samples: 37949440 | consumed tokens: 77720453120 | elapsed time per iteration (s): 0.92 | learning rate: 2.944E-05 | global batch size: 256 | lm loss: 1.928703E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 278.774 | TFLOPs: 16.87 | 31: iteration 148250/ 173500 | consumed samples: 37952000 | consumed tokens: 77725696000 | elapsed time per iteration (s): 0.95 | learning rate: 2.943E-05 | global batch size: 256 | lm loss: 1.963871E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 270.612 | TFLOPs: 16.37 | 31: iteration 148260/ 173500 | consumed samples: 37954560 | consumed tokens: 77730938880 | elapsed time per iteration (s): 0.96 | learning rate: 2.942E-05 | global batch size: 256 | lm loss: 1.921120E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 267.504 | TFLOPs: 16.18 | 31: iteration 148270/ 173500 | consumed samples: 37957120 | consumed tokens: 77736181760 | elapsed time per iteration (s): 0.94 | learning rate: 2.941E-05 | global batch size: 256 | lm loss: 1.935660E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 272.218 | TFLOPs: 16.47 | 31: iteration 148280/ 173500 | consumed samples: 37959680 | consumed tokens: 77741424640 | elapsed time per iteration (s): 0.84 | learning rate: 2.941E-05 | global batch size: 256 | lm loss: 1.903433E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.634 | TFLOPs: 18.37 | 31: iteration 148290/ 173500 | consumed samples: 37962240 | consumed tokens: 77746667520 | elapsed time per iteration (s): 1.16 | learning rate: 2.940E-05 | global batch size: 256 | lm loss: 1.907797E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 220.004 | TFLOPs: 13.31 | 31: iteration 148300/ 173500 | consumed samples: 37964800 | consumed tokens: 77751910400 | elapsed time per iteration (s): 0.81 | learning rate: 2.939E-05 | global batch size: 256 | lm loss: 1.950356E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.070 | TFLOPs: 19.12 | 31: iteration 148310/ 173500 | consumed samples: 37967360 | consumed tokens: 77757153280 | elapsed time per iteration (s): 0.78 | learning rate: 2.938E-05 | global batch size: 256 | lm loss: 1.901074E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.104 | TFLOPs: 19.79 | 31: iteration 148320/ 173500 | consumed samples: 37969920 | consumed tokens: 77762396160 | elapsed time per iteration (s): 0.85 | learning rate: 2.938E-05 | global batch size: 256 | lm loss: 1.908943E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.452 | TFLOPs: 18.12 | 31: iteration 148330/ 173500 | consumed samples: 37972480 | consumed tokens: 77767639040 | elapsed time per iteration (s): 0.76 | learning rate: 2.937E-05 | global batch size: 256 | lm loss: 1.925185E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.814 | TFLOPs: 20.32 | 31: iteration 148340/ 173500 | consumed samples: 37975040 | consumed tokens: 77772881920 | elapsed time per iteration (s): 0.84 | learning rate: 2.936E-05 | global batch size: 256 | lm loss: 1.901622E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.198 | TFLOPs: 18.40 | 31: iteration 148350/ 173500 | consumed samples: 37977600 | consumed tokens: 77778124800 | elapsed time per iteration (s): 0.79 | learning rate: 2.936E-05 | global batch size: 256 | lm loss: 1.933352E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.622 | TFLOPs: 19.70 | 31: iteration 148360/ 173500 | consumed samples: 37980160 | consumed tokens: 77783367680 | elapsed time per iteration (s): 0.77 | learning rate: 2.935E-05 | global batch size: 256 | lm loss: 1.943236E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.424 | TFLOPs: 20.11 | 31: iteration 148370/ 173500 | consumed samples: 37982720 | consumed tokens: 77788610560 | elapsed time per iteration (s): 0.84 | learning rate: 2.934E-05 | global batch size: 256 | lm loss: 1.921988E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.303 | TFLOPs: 18.41 | 31: iteration 148380/ 173500 | consumed samples: 37985280 | consumed tokens: 77793853440 | elapsed time per iteration (s): 0.93 | learning rate: 2.933E-05 | global batch size: 256 | lm loss: 1.921932E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 274.165 | TFLOPs: 16.59 | 31: iteration 148390/ 173500 | consumed samples: 37987840 | consumed tokens: 77799096320 | elapsed time per iteration (s): 0.87 | learning rate: 2.933E-05 | global batch size: 256 | lm loss: 1.926810E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.752 | TFLOPs: 17.71 | 31: iteration 148400/ 173500 | consumed samples: 37990400 | consumed tokens: 77804339200 | elapsed time per iteration (s): 0.85 | learning rate: 2.932E-05 | global batch size: 256 | lm loss: 1.902988E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.075 | TFLOPs: 18.21 | 31: iteration 148410/ 173500 | consumed samples: 37992960 | consumed tokens: 77809582080 | elapsed time per iteration (s): 0.81 | learning rate: 2.931E-05 | global batch size: 256 | lm loss: 1.912475E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.388 | TFLOPs: 19.20 | 31: iteration 148420/ 173500 | consumed samples: 37995520 | consumed tokens: 77814824960 | elapsed time per iteration (s): 0.83 | learning rate: 2.930E-05 | global batch size: 256 | lm loss: 1.943155E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.477 | TFLOPs: 18.72 | 31: iteration 148430/ 173500 | consumed samples: 37998080 | consumed tokens: 77820067840 | elapsed time per iteration (s): 0.84 | learning rate: 2.930E-05 | global batch size: 256 | lm loss: 1.935941E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.856 | TFLOPs: 18.44 | 31: iteration 148440/ 173500 | consumed samples: 38000640 | consumed tokens: 77825310720 | elapsed time per iteration (s): 0.77 | learning rate: 2.929E-05 | global batch size: 256 | lm loss: 1.937900E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.845 | TFLOPs: 20.20 | 31: iteration 148450/ 173500 | consumed samples: 38003200 | consumed tokens: 77830553600 | elapsed time per iteration (s): 0.76 | learning rate: 2.928E-05 | global batch size: 256 | lm loss: 1.888650E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.284 | TFLOPs: 20.40 | 31: iteration 148460/ 173500 | consumed samples: 38005760 | consumed tokens: 77835796480 | elapsed time per iteration (s): 0.76 | learning rate: 2.928E-05 | global batch size: 256 | lm loss: 1.948539E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.004 | TFLOPs: 20.45 | 31: iteration 148470/ 173500 | consumed samples: 38008320 | consumed tokens: 77841039360 | elapsed time per iteration (s): 0.74 | learning rate: 2.927E-05 | global batch size: 256 | lm loss: 1.944356E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.301 | TFLOPs: 20.83 | 31: iteration 148480/ 173500 | consumed samples: 38010880 | consumed tokens: 77846282240 | elapsed time per iteration (s): 0.77 | learning rate: 2.926E-05 | global batch size: 256 | lm loss: 1.919764E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.921 | TFLOPs: 20.02 | 31: iteration 148490/ 173500 | consumed samples: 38013440 | consumed tokens: 77851525120 | elapsed time per iteration (s): 0.78 | learning rate: 2.925E-05 | global batch size: 256 | lm loss: 1.927698E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.937 | TFLOPs: 19.96 | 31: iteration 148500/ 173500 | consumed samples: 38016000 | consumed tokens: 77856768000 | elapsed time per iteration (s): 0.75 | learning rate: 2.925E-05 | global batch size: 256 | lm loss: 1.913727E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.696 | TFLOPs: 20.73 | 31: iteration 148510/ 173500 | consumed samples: 38018560 | consumed tokens: 77862010880 | elapsed time per iteration (s): 0.75 | learning rate: 2.924E-05 | global batch size: 256 | lm loss: 1.915832E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.384 | TFLOPs: 20.53 | 31: iteration 148520/ 173500 | consumed samples: 38021120 | consumed tokens: 77867253760 | elapsed time per iteration (s): 0.74 | learning rate: 2.923E-05 | global batch size: 256 | lm loss: 1.907719E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.514 | TFLOPs: 20.96 | 31: iteration 148530/ 173500 | consumed samples: 38023680 | consumed tokens: 77872496640 | elapsed time per iteration (s): 0.86 | learning rate: 2.922E-05 | global batch size: 256 | lm loss: 1.932369E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.761 | TFLOPs: 18.01 | 31: iteration 148540/ 173500 | consumed samples: 38026240 | consumed tokens: 77877739520 | elapsed time per iteration (s): 0.79 | learning rate: 2.922E-05 | global batch size: 256 | lm loss: 1.907871E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.330 | TFLOPs: 19.68 | 31: iteration 148550/ 173500 | consumed samples: 38028800 | consumed tokens: 77882982400 | elapsed time per iteration (s): 0.76 | learning rate: 2.921E-05 | global batch size: 256 | lm loss: 1.916600E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.154 | TFLOPs: 20.40 | 31: iteration 148560/ 173500 | consumed samples: 38031360 | consumed tokens: 77888225280 | elapsed time per iteration (s): 0.74 | learning rate: 2.920E-05 | global batch size: 256 | lm loss: 1.892291E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.462 | TFLOPs: 21.02 | 31: iteration 148570/ 173500 | consumed samples: 38033920 | consumed tokens: 77893468160 | elapsed time per iteration (s): 0.80 | learning rate: 2.920E-05 | global batch size: 256 | lm loss: 1.909864E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.742 | TFLOPs: 19.28 | 31: iteration 148580/ 173500 | consumed samples: 38036480 | consumed tokens: 77898711040 | elapsed time per iteration (s): 0.80 | learning rate: 2.919E-05 | global batch size: 256 | lm loss: 1.892527E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.035 | TFLOPs: 19.30 | 31: iteration 148590/ 173500 | consumed samples: 38039040 | consumed tokens: 77903953920 | elapsed time per iteration (s): 0.92 | learning rate: 2.918E-05 | global batch size: 256 | lm loss: 1.885979E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 279.333 | TFLOPs: 16.90 | 31: iteration 148600/ 173500 | consumed samples: 38041600 | consumed tokens: 77909196800 | elapsed time per iteration (s): 0.85 | learning rate: 2.917E-05 | global batch size: 256 | lm loss: 1.902669E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.887 | TFLOPs: 18.26 | 31: iteration 148610/ 173500 | consumed samples: 38044160 | consumed tokens: 77914439680 | elapsed time per iteration (s): 0.70 | learning rate: 2.917E-05 | global batch size: 256 | lm loss: 1.927198E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 363.878 | TFLOPs: 22.01 | 31: iteration 148620/ 173500 | consumed samples: 38046720 | consumed tokens: 77919682560 | elapsed time per iteration (s): 0.79 | learning rate: 2.916E-05 | global batch size: 256 | lm loss: 1.923664E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.810 | TFLOPs: 19.53 | 31: iteration 148630/ 173500 | consumed samples: 38049280 | consumed tokens: 77924925440 | elapsed time per iteration (s): 0.76 | learning rate: 2.915E-05 | global batch size: 256 | lm loss: 1.920231E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.389 | TFLOPs: 20.29 | 31: iteration 148640/ 173500 | consumed samples: 38051840 | consumed tokens: 77930168320 | elapsed time per iteration (s): 0.77 | learning rate: 2.914E-05 | global batch size: 256 | lm loss: 1.914019E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.687 | TFLOPs: 20.19 | 31: iteration 148650/ 173500 | consumed samples: 38054400 | consumed tokens: 77935411200 | elapsed time per iteration (s): 0.71 | learning rate: 2.914E-05 | global batch size: 256 | lm loss: 1.935244E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.444 | TFLOPs: 21.68 | 31: iteration 148660/ 173500 | consumed samples: 38056960 | consumed tokens: 77940654080 | elapsed time per iteration (s): 0.75 | learning rate: 2.913E-05 | global batch size: 256 | lm loss: 1.917757E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.565 | TFLOPs: 20.72 | 31: iteration 148670/ 173500 | consumed samples: 38059520 | consumed tokens: 77945896960 | elapsed time per iteration (s): 0.72 | learning rate: 2.912E-05 | global batch size: 256 | lm loss: 1.904496E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.436 | TFLOPs: 21.56 | 31: iteration 148680/ 173500 | consumed samples: 38062080 | consumed tokens: 77951139840 | elapsed time per iteration (s): 0.76 | learning rate: 2.912E-05 | global batch size: 256 | lm loss: 1.897934E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.988 | TFLOPs: 20.45 | 31: iteration 148690/ 173500 | consumed samples: 38064640 | consumed tokens: 77956382720 | elapsed time per iteration (s): 0.72 | learning rate: 2.911E-05 | global batch size: 256 | lm loss: 1.919275E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.697 | TFLOPs: 21.64 | 31: iteration 148700/ 173500 | consumed samples: 38067200 | consumed tokens: 77961625600 | elapsed time per iteration (s): 0.79 | learning rate: 2.910E-05 | global batch size: 256 | lm loss: 1.924088E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.489 | TFLOPs: 19.57 | 31: iteration 148710/ 173500 | consumed samples: 38069760 | consumed tokens: 77966868480 | elapsed time per iteration (s): 0.73 | learning rate: 2.909E-05 | global batch size: 256 | lm loss: 1.928428E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.654 | TFLOPs: 21.21 | 31: iteration 148720/ 173500 | consumed samples: 38072320 | consumed tokens: 77972111360 | elapsed time per iteration (s): 0.78 | learning rate: 2.909E-05 | global batch size: 256 | lm loss: 1.890271E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.297 | TFLOPs: 19.98 | 31: iteration 148730/ 173500 | consumed samples: 38074880 | consumed tokens: 77977354240 | elapsed time per iteration (s): 0.78 | learning rate: 2.908E-05 | global batch size: 256 | lm loss: 1.903044E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.098 | TFLOPs: 19.85 | 31: iteration 148740/ 173500 | consumed samples: 38077440 | consumed tokens: 77982597120 | elapsed time per iteration (s): 0.81 | learning rate: 2.907E-05 | global batch size: 256 | lm loss: 1.919348E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.593 | TFLOPs: 19.09 | 31: iteration 148750/ 173500 | consumed samples: 38080000 | consumed tokens: 77987840000 | elapsed time per iteration (s): 0.75 | learning rate: 2.907E-05 | global batch size: 256 | lm loss: 1.909364E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.153 | TFLOPs: 20.64 | 31: iteration 148760/ 173500 | consumed samples: 38082560 | consumed tokens: 77993082880 | elapsed time per iteration (s): 0.78 | learning rate: 2.906E-05 | global batch size: 256 | lm loss: 1.946859E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.072 | TFLOPs: 19.97 | 31: iteration 148770/ 173500 | consumed samples: 38085120 | consumed tokens: 77998325760 | elapsed time per iteration (s): 0.78 | learning rate: 2.905E-05 | global batch size: 256 | lm loss: 1.918555E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.722 | TFLOPs: 19.95 | 31: iteration 148780/ 173500 | consumed samples: 38087680 | consumed tokens: 78003568640 | elapsed time per iteration (s): 0.80 | learning rate: 2.904E-05 | global batch size: 256 | lm loss: 1.926594E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.929 | TFLOPs: 19.48 | 31: iteration 148790/ 173500 | consumed samples: 38090240 | consumed tokens: 78008811520 | elapsed time per iteration (s): 0.82 | learning rate: 2.904E-05 | global batch size: 256 | lm loss: 1.912004E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.846 | TFLOPs: 18.99 | 31: iteration 148800/ 173500 | consumed samples: 38092800 | consumed tokens: 78014054400 | elapsed time per iteration (s): 0.85 | learning rate: 2.903E-05 | global batch size: 256 | lm loss: 1.917852E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.252 | TFLOPs: 18.29 | 31: iteration 148810/ 173500 | consumed samples: 38095360 | consumed tokens: 78019297280 | elapsed time per iteration (s): 0.81 | learning rate: 2.902E-05 | global batch size: 256 | lm loss: 1.931435E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.023 | TFLOPs: 19.18 | 31: iteration 148820/ 173500 | consumed samples: 38097920 | consumed tokens: 78024540160 | elapsed time per iteration (s): 0.82 | learning rate: 2.901E-05 | global batch size: 256 | lm loss: 1.942262E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.338 | TFLOPs: 18.77 | 31: iteration 148830/ 173500 | consumed samples: 38100480 | consumed tokens: 78029783040 | elapsed time per iteration (s): 0.82 | learning rate: 2.901E-05 | global batch size: 256 | lm loss: 1.904046E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.085 | TFLOPs: 18.94 | 31: iteration 148840/ 173500 | consumed samples: 38103040 | consumed tokens: 78035025920 | elapsed time per iteration (s): 0.79 | learning rate: 2.900E-05 | global batch size: 256 | lm loss: 1.908454E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.652 | TFLOPs: 19.64 | 31: iteration 148850/ 173500 | consumed samples: 38105600 | consumed tokens: 78040268800 | elapsed time per iteration (s): 0.79 | learning rate: 2.899E-05 | global batch size: 256 | lm loss: 1.902960E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.256 | TFLOPs: 19.62 | 31: iteration 148860/ 173500 | consumed samples: 38108160 | consumed tokens: 78045511680 | elapsed time per iteration (s): 0.97 | learning rate: 2.899E-05 | global batch size: 256 | lm loss: 1.897821E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 265.180 | TFLOPs: 16.04 | 31: iteration 148870/ 173500 | consumed samples: 38110720 | consumed tokens: 78050754560 | elapsed time per iteration (s): 2.78 | learning rate: 2.898E-05 | global batch size: 256 | lm loss: 1.922473E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 91.964 | TFLOPs: 5.56 | 31: iteration 148880/ 173500 | consumed samples: 38113280 | consumed tokens: 78055997440 | elapsed time per iteration (s): 0.81 | learning rate: 2.897E-05 | global batch size: 256 | lm loss: 1.904963E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.468 | TFLOPs: 19.15 | 31: iteration 148890/ 173500 | consumed samples: 38115840 | consumed tokens: 78061240320 | elapsed time per iteration (s): 0.77 | learning rate: 2.896E-05 | global batch size: 256 | lm loss: 1.887536E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.826 | TFLOPs: 20.14 | 31: iteration 148900/ 173500 | consumed samples: 38118400 | consumed tokens: 78066483200 | elapsed time per iteration (s): 0.73 | learning rate: 2.896E-05 | global batch size: 256 | lm loss: 1.916060E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.547 | TFLOPs: 21.21 | 31: iteration 148910/ 173500 | consumed samples: 38120960 | consumed tokens: 78071726080 | elapsed time per iteration (s): 0.77 | learning rate: 2.895E-05 | global batch size: 256 | lm loss: 1.905125E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.318 | TFLOPs: 20.23 | 31: iteration 148920/ 173500 | consumed samples: 38123520 | consumed tokens: 78076968960 | elapsed time per iteration (s): 0.73 | learning rate: 2.894E-05 | global batch size: 256 | lm loss: 1.907969E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.592 | TFLOPs: 21.15 | 31: iteration 148930/ 173500 | consumed samples: 38126080 | consumed tokens: 78082211840 | elapsed time per iteration (s): 0.78 | learning rate: 2.894E-05 | global batch size: 256 | lm loss: 1.913163E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.428 | TFLOPs: 19.87 | 31: iteration 148940/ 173500 | consumed samples: 38128640 | consumed tokens: 78087454720 | elapsed time per iteration (s): 0.79 | learning rate: 2.893E-05 | global batch size: 256 | lm loss: 1.919278E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.549 | TFLOPs: 19.69 | 31: iteration 148950/ 173500 | consumed samples: 38131200 | consumed tokens: 78092697600 | elapsed time per iteration (s): 0.90 | learning rate: 2.892E-05 | global batch size: 256 | lm loss: 1.943142E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.064 | TFLOPs: 17.25 | 31: iteration 148960/ 173500 | consumed samples: 38133760 | consumed tokens: 78097940480 | elapsed time per iteration (s): 0.76 | learning rate: 2.891E-05 | global batch size: 256 | lm loss: 1.930000E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.710 | TFLOPs: 20.25 | 31: iteration 148970/ 173500 | consumed samples: 38136320 | consumed tokens: 78103183360 | elapsed time per iteration (s): 0.89 | learning rate: 2.891E-05 | global batch size: 256 | lm loss: 1.920848E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.266 | TFLOPs: 17.38 | 31: iteration 148980/ 173500 | consumed samples: 38138880 | consumed tokens: 78108426240 | elapsed time per iteration (s): 0.95 | learning rate: 2.890E-05 | global batch size: 256 | lm loss: 1.926630E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 268.676 | TFLOPs: 16.25 | 31: iteration 148990/ 173500 | consumed samples: 38141440 | consumed tokens: 78113669120 | elapsed time per iteration (s): 0.76 | learning rate: 2.889E-05 | global batch size: 256 | lm loss: 1.927331E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.258 | TFLOPs: 20.28 | 31: iteration 149000/ 173500 | consumed samples: 38144000 | consumed tokens: 78118912000 | elapsed time per iteration (s): 0.75 | learning rate: 2.889E-05 | global batch size: 256 | lm loss: 1.922496E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.881 | TFLOPs: 20.62 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 149000 | lm loss value: 1.923770E+00 | lm loss PPL: 6.846723E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 149000 to checkpoints_1b1long 0: [2022-11-27 03:38:22,009] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step149000 is begin to save! 0: [2022-11-27 03:38:22,020] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_01-model_00-model_states.pt... 0: [2022-11-27 03:38:22,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_01-model_00-model_states.pt. 0: [2022-11-27 03:38:22,283] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_03-model_00-model_states.pt... 0: [2022-11-27 03:38:22,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_03-model_00-model_states.pt. 0: [2022-11-27 03:38:22,360] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_04-model_00-model_states.pt... 0: [2022-11-27 03:38:22,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_04-model_00-model_states.pt. 0: [2022-11-27 03:38:22,438] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_05-model_00-model_states.pt... 0: [2022-11-27 03:38:22,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_05-model_00-model_states.pt. 0: [2022-11-27 03:38:22,515] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_06-model_00-model_states.pt... 0: [2022-11-27 03:38:22,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_06-model_00-model_states.pt. 0: [2022-11-27 03:38:22,592] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_07-model_00-model_states.pt... 0: [2022-11-27 03:38:22,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_07-model_00-model_states.pt. 0: [2022-11-27 03:38:22,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_08-model_00-model_states.pt... 0: [2022-11-27 03:38:22,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_08-model_00-model_states.pt. 0: [2022-11-27 03:38:22,746] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_09-model_00-model_states.pt... 0: [2022-11-27 03:38:22,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_09-model_00-model_states.pt. 0: [2022-11-27 03:38:22,821] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_10-model_00-model_states.pt... 0: [2022-11-27 03:38:22,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_10-model_00-model_states.pt. 0: [2022-11-27 03:38:22,898] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_11-model_00-model_states.pt... 0: [2022-11-27 03:38:22,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_11-model_00-model_states.pt. 0: [2022-11-27 03:38:22,975] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_12-model_00-model_states.pt... 0: [2022-11-27 03:38:23,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_12-model_00-model_states.pt. 0: [2022-11-27 03:38:23,049] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_13-model_00-model_states.pt... 0: [2022-11-27 03:38:23,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_13-model_00-model_states.pt. 0: [2022-11-27 03:38:23,128] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_14-model_00-model_states.pt... 0: [2022-11-27 03:38:23,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_14-model_00-model_states.pt. 0: [2022-11-27 03:38:23,212] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_15-model_00-model_states.pt... 0: [2022-11-27 03:38:23,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_15-model_00-model_states.pt. 0: [2022-11-27 03:38:23,291] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_16-model_00-model_states.pt... 0: [2022-11-27 03:38:23,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_16-model_00-model_states.pt. 0: [2022-11-27 03:38:23,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_17-model_00-model_states.pt... 0: [2022-11-27 03:38:23,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_17-model_00-model_states.pt. 0: [2022-11-27 03:38:23,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_18-model_00-model_states.pt... 0: [2022-11-27 03:38:23,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_18-model_00-model_states.pt. 0: [2022-11-27 03:38:23,521] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_19-model_00-model_states.pt... 0: [2022-11-27 03:38:23,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_19-model_00-model_states.pt. 0: [2022-11-27 03:38:23,597] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_20-model_00-model_states.pt... 0: [2022-11-27 03:38:23,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_20-model_00-model_states.pt. 0: [2022-11-27 03:38:23,674] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_21-model_00-model_states.pt... 0: [2022-11-27 03:38:23,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_21-model_00-model_states.pt. 0: [2022-11-27 03:38:23,750] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_22-model_00-model_states.pt... 0: [2022-11-27 03:38:23,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_22-model_00-model_states.pt. 0: [2022-11-27 03:38:23,827] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_23-model_00-model_states.pt... 0: [2022-11-27 03:38:23,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_23-model_00-model_states.pt. 0: [2022-11-27 03:38:23,904] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_24-model_00-model_states.pt... 0: [2022-11-27 03:38:23,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_24-model_00-model_states.pt. 0: [2022-11-27 03:38:23,980] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_25-model_00-model_states.pt... 0: [2022-11-27 03:38:24,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_25-model_00-model_states.pt. 0: [2022-11-27 03:38:24,055] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_26-model_00-model_states.pt... 0: [2022-11-27 03:38:24,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_26-model_00-model_states.pt. 0: [2022-11-27 03:38:24,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_27-model_00-model_states.pt... 0: [2022-11-27 03:38:24,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_27-model_00-model_states.pt. 0: [2022-11-27 03:38:24,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_28-model_00-model_states.pt... 0: [2022-11-27 03:38:24,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_28-model_00-model_states.pt. 0: [2022-11-27 03:38:24,283] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/layer_30-model_00-model_states.pt... 0: [2022-11-27 03:38:24,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/layer_30-model_00-model_states.pt. 0: [2022-11-27 03:38:24,287] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step149000/mp_rank_00_model_states.pt 0: [2022-11-27 03:38:24,287] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/mp_rank_00_model_states.pt... 0: [2022-11-27 03:38:24,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/mp_rank_00_model_states.pt. 0: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:38:24,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:38:24,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:38:24,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:38:24,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:38:24,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 03:38:24,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 10: [2022-11-27 03:38:24,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 03:38:24,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 5: [2022-11-27 03:38:24,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:38:24,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:38:24,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 03:38:24,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 24: [2022-11-27 03:38:24,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 03:38:24,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 7: [2022-11-27 03:38:24,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:38:24,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 03:38:24,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 9: [2022-11-27 03:38:24,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:38:24,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:38:24,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:38:24,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 11: [2022-11-27 03:38:24,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 9: [2022-11-27 03:38:24,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 11: [2022-11-27 03:38:24,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 0: [2022-11-27 03:38:24,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 03:38:24,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 3: [2022-11-27 03:38:24,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:38:24,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 03:38:24,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 27: [2022-11-27 03:38:24,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:38:24,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 03:38:24,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 1: [2022-11-27 03:38:24,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:38:24,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 03:38:24,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 12: [2022-11-27 03:38:24,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:38:24,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 03:38:24,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 21: [2022-11-27 03:38:24,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:38:24,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 03:38:24,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 15: [2022-11-27 03:38:24,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:38:24,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 03:38:24,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 19: [2022-11-27 03:38:24,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:38:24,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-27 03:38:24,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 29: [2022-11-27 03:38:24,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:38:24,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 03:38:24,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 13: [2022-11-27 03:38:24,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:38:24,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:38:24,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 03:38:24,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 13: [2022-11-27 03:38:24,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 03:38:24,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 30: [2022-11-27 03:38:24,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:38:24,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 03:38:24,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 8: [2022-11-27 03:38:24,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:38:24,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 03:38:24,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 16: [2022-11-27 03:38:24,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:38:24,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:38:24,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:38:24,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:38:24,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 11: [2022-11-27 03:38:24,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:38:24,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 22: [2022-11-27 03:38:24,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 26: [2022-11-27 03:38:24,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 16: [2022-11-27 03:38:24,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 11: [2022-11-27 03:38:24,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 31: [2022-11-27 03:38:24,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 22: [2022-11-27 03:38:24,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 26: [2022-11-27 03:38:24,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 11: [2022-11-27 03:38:24,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 18: [2022-11-27 03:38:24,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:38:24,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:38:24,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 29: [2022-11-27 03:38:24,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:38:24,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 6: [2022-11-27 03:38:24,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 29: [2022-11-27 03:38:24,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 6: [2022-11-27 03:38:24,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 29: [2022-11-27 03:38:24,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 21: [2022-11-27 03:38:24,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:38:24,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 03:38:24,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 6: [2022-11-27 03:38:24,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:38:24,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 03:38:24,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 20: [2022-11-27 03:38:24,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:38:24,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:38:24,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:38:24,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:38:24,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 03:38:24,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 17: [2022-11-27 03:38:24,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 20: [2022-11-27 03:38:24,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 25: [2022-11-27 03:38:24,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 17: [2022-11-27 03:38:24,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 20: [2022-11-27 03:38:24,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 25: [2022-11-27 03:38:24,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 17: [2022-11-27 03:38:24,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:38:24,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 03:38:24,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 0: [2022-11-27 03:38:24,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:38:24,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:38:24,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:38:24,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 0: [2022-11-27 03:38:24,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 8: [2022-11-27 03:38:24,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 3: [2022-11-27 03:38:24,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 0: [2022-11-27 03:38:24,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 8: [2022-11-27 03:38:24,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 1: [2022-11-27 03:38:24,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:38:24,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 9: [2022-11-27 03:38:24,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:38:24,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 9: [2022-11-27 03:38:24,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 03:38:24,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 28: [2022-11-27 03:38:24,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:38:24,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:38:24,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 03:38:24,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 24: [2022-11-27 03:38:24,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 03:38:24,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 18: [2022-11-27 03:38:24,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:38:24,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 03:38:24,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 15: [2022-11-27 03:38:24,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:38:24,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 03:38:24,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 14: [2022-11-27 03:38:24,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:38:24,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 25: [2022-11-27 03:38:24,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:38:24,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 25: [2022-11-27 03:38:24,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 03:38:24,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 31: [2022-11-27 03:38:24,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:38:24,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 03:38:24,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 18: [2022-11-27 03:38:24,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:38:24,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 03:38:24,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 1: [2022-11-27 03:38:24,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:38:24,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 03:38:24,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 7: [2022-11-27 03:38:24,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:38:24,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 03:38:24,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 9: [2022-11-27 03:38:24,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:38:24,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 5: [2022-11-27 03:38:24,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:38:24,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 5: [2022-11-27 03:38:24,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 03:38:24,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 16: [2022-11-27 03:38:24,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:38:24,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:38:24,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:38:24,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 26: [2022-11-27 03:38:24,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 03:38:24,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 19: [2022-11-27 03:38:24,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 16: [2022-11-27 03:38:24,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 19: [2022-11-27 03:38:24,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 13: [2022-11-27 03:38:24,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:38:24,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 03:38:24,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 22: [2022-11-27 03:38:24,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:38:24,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 03:38:24,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 22: [2022-11-27 03:38:24,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:38:24,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:38:24,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 22: [2022-11-27 03:38:24,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 6: [2022-11-27 03:38:24,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:38:24,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:38:24,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 6: [2022-11-27 03:38:24,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 12: [2022-11-27 03:38:24,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 22: [2022-11-27 03:38:24,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 6: [2022-11-27 03:38:24,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 12: [2022-11-27 03:38:24,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 10: [2022-11-27 03:38:24,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:38:24,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:38:24,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 03:38:24,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 12: [2022-11-27 03:38:24,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 03:38:24,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 21: [2022-11-27 03:38:24,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:38:24,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:38:24,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 03:38:24,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 10: [2022-11-27 03:38:24,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 03:38:24,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 30: [2022-11-27 03:38:24,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:38:24,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 03:38:24,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 13: [2022-11-27 03:38:24,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:38:24,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 03:38:24,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 19: [2022-11-27 03:38:24,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:38:24,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:38:24,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 2: [2022-11-27 03:38:24,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 19: [2022-11-27 03:38:24,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 2: [2022-11-27 03:38:24,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:38:24,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 2: [2022-11-27 03:38:24,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 14: [2022-11-27 03:38:24,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:38:24,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 14: [2022-11-27 03:38:24,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 15: [2022-11-27 03:38:24,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:38:24,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 15: [2022-11-27 03:38:24,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 03:38:24,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 23: [2022-11-27 03:38:24,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 03:38:24,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 03:38:24,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-27 03:38:24,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 03:38:24,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 03:38:24,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 03:38:24,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 23: [2022-11-27 03:38:24,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 23: [2022-11-27 03:38:24,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 17: [2022-11-27 03:38:24,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:38:24,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 03:38:24,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 29: [2022-11-27 03:38:24,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:38:24,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 03:38:24,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 24: [2022-11-27 03:38:24,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:38:24,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 03:38:24,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 3: [2022-11-27 03:38:24,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:38:24,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:38:24,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 8: [2022-11-27 03:38:24,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 3: [2022-11-27 03:38:24,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 8: [2022-11-27 03:38:24,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 11: [2022-11-27 03:38:24,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:38:24,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 03:38:24,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 27: [2022-11-27 03:38:24,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:38:24,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:38:24,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 18: [2022-11-27 03:38:24,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:38:24,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 03:38:24,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 18: [2022-11-27 03:38:24,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 27: [2022-11-27 03:38:24,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 18: [2022-11-27 03:38:24,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 4: [2022-11-27 03:38:24,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:38:24,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:38:24,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:38:24,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 03:38:24,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 03:38:24,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 03:38:24,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 4: [2022-11-27 03:38:24,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 4: [2022-11-27 03:38:24,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 5: [2022-11-27 03:38:24,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:38:24,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:38:24,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 2: [2022-11-27 03:38:24,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 5: [2022-11-27 03:38:24,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 2: [2022-11-27 03:38:24,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 27: [2022-11-27 03:38:24,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:38:24,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 03:38:24,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 1: [2022-11-27 03:38:24,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:38:24,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 03:38:24,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 0: [2022-11-27 03:38:24,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:38:24,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 03:38:24,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 4: [2022-11-27 03:38:24,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:38:24,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 03:38:24,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 31: [2022-11-27 03:38:24,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:38:24,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 03:38:24,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 10: [2022-11-27 03:38:24,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:38:24,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 03:38:24,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 16: [2022-11-27 03:38:24,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:38:24,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:38:24,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 16: [2022-11-27 03:38:24,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 03:38:24,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 30: [2022-11-27 03:38:24,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 7: [2022-11-27 03:38:24,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:38:24,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 03:38:24,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 20: [2022-11-27 03:38:24,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:38:24,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:38:24,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 03:38:24,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 20: [2022-11-27 03:38:24,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 03:38:24,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 26: [2022-11-27 03:38:24,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:38:24,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 26: [2022-11-27 03:38:24,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 23: [2022-11-27 03:38:24,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:38:24,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 26: [2022-11-27 03:38:24,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 23: [2022-11-27 03:38:24,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 03:38:24,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 5: [2022-11-27 03:38:24,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:38:24,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 03:38:24,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 9: [2022-11-27 03:38:24,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:38:24,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 03:38:24,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 24: [2022-11-27 03:38:24,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:38:24,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 14: [2022-11-27 03:38:24,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:38:24,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 27: [2022-11-27 03:38:24,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:38:24,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 03:38:24,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 2: [2022-11-27 03:38:24,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:38:24,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 2: [2022-11-27 03:38:24,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 27: [2022-11-27 03:38:24,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 2: [2022-11-27 03:38:24,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 17: [2022-11-27 03:38:24,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:38:24,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 03:38:24,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 16: [2022-11-27 03:38:24,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:38:24,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 03:38:24,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 3: [2022-11-27 03:38:24,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:38:24,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 03:38:24,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 30: [2022-11-27 03:38:24,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:38:24,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 03:38:24,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 7: [2022-11-27 03:38:24,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:38:24,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 03:38:24,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 6: [2022-11-27 03:38:24,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:38:24,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 03:38:24,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 19: [2022-11-27 03:38:24,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:38:24,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 03:38:24,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 25: [2022-11-27 03:38:24,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:38:24,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 03:38:24,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 12: [2022-11-27 03:38:24,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:38:24,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 03:38:24,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 29: [2022-11-27 03:38:24,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:38:24,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 03:38:24,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 11: [2022-11-27 03:38:24,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:38:24,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 03:38:24,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 21: [2022-11-27 03:38:24,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:38:24,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 03:38:24,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 15: [2022-11-27 03:38:24,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:38:24,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 03:38:24,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 31: [2022-11-27 03:38:24,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:38:24,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 03:38:24,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 28: [2022-11-27 03:38:24,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:38:24,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:38:24,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 03:38:24,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 22: [2022-11-27 03:38:24,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:38:24,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 03:38:24,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 8: [2022-11-27 03:38:24,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:38:24,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 03:38:24,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 0: [2022-11-27 03:38:24,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:38:24,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 03:38:24,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 28: [2022-11-27 03:38:24,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 03:38:24,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 20: [2022-11-27 03:38:24,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:38:24,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 03:38:24,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 1: [2022-11-27 03:38:24,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:38:24,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 03:38:24,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 18: [2022-11-27 03:38:24,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:38:24,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 03:38:24,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 17: [2022-11-27 03:38:24,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:38:24,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 03:38:24,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 10: [2022-11-27 03:38:24,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:38:24,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 03:38:24,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 5: [2022-11-27 03:38:24,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:38:24,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 03:38:24,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 23: [2022-11-27 03:38:24,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-27 03:38:24,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 03:38:24,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 9: [2022-11-27 03:38:24,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:38:24,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 03:38:24,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 4: [2022-11-27 03:38:24,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:38:24,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 03:38:24,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 2: [2022-11-27 03:38:24,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:38:24,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 03:38:24,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 24: [2022-11-27 03:38:24,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:38:24,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-27 03:38:24,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 7: [2022-11-27 03:38:24,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:38:24,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 03:38:24,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 16: [2022-11-27 03:38:24,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:38:24,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 03:38:24,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 27: [2022-11-27 03:38:24,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:38:24,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 03:38:24,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 14: [2022-11-27 03:38:24,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:38:24,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 03:38:24,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 6: [2022-11-27 03:38:24,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:38:24,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 03:38:24,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 25: [2022-11-27 03:38:24,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:38:24,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 03:38:24,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 3: [2022-11-27 03:38:24,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:38:24,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 03:38:24,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 19: [2022-11-27 03:38:24,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:38:24,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 03:38:24,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 30: [2022-11-27 03:38:24,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:38:24,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 03:38:24,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 29: [2022-11-27 03:38:24,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:38:24,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 03:38:24,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 12: [2022-11-27 03:38:24,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:38:24,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 03:38:24,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 26: [2022-11-27 03:38:24,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:38:24,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 03:38:24,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 13: [2022-11-27 03:38:24,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:38:24,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 03:38:24,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 21: [2022-11-27 03:38:24,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:38:24,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 03:38:24,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 22: [2022-11-27 03:38:24,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:38:24,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 03:38:24,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 11: [2022-11-27 03:38:24,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:38:24,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 03:38:24,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 31: [2022-11-27 03:38:24,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:38:24,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 03:38:24,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 15: [2022-11-27 03:38:24,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:38:24,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 03:38:24,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 10: [2022-11-27 03:38:24,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:38:24,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 03:38:24,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 0: [2022-11-27 03:38:24,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:38:24,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 03:38:24,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 28: [2022-11-27 03:38:24,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:38:24,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:38:24,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 03:38:24,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 18: [2022-11-27 03:38:24,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:38:24,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 03:38:24,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 23: [2022-11-27 03:38:24,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 03:38:24,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 03:38:24,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 1: [2022-11-27 03:38:24,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:38:24,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 03:38:24,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 28: [2022-11-27 03:38:24,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 03:38:24,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 4: [2022-11-27 03:38:24,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:38:24,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 03:38:24,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 2: [2022-11-27 03:38:24,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:38:24,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 03:38:24,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 20: [2022-11-27 03:38:24,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:38:24,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 03:38:24,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 17: [2022-11-27 03:38:24,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:38:24,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 03:38:24,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 7: [2022-11-27 03:38:24,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:38:24,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 03:38:24,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 24: [2022-11-27 03:38:24,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:38:24,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 03:38:24,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 26: [2022-11-27 03:38:24,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:38:24,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 03:38:24,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 16: [2022-11-27 03:38:24,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:38:24,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 03:38:24,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 14: [2022-11-27 03:38:24,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:38:24,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 03:38:24,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 9: [2022-11-27 03:38:24,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:38:24,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 03:38:24,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 6: [2022-11-27 03:38:24,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:38:24,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 03:38:24,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 30: [2022-11-27 03:38:24,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:38:24,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 03:38:24,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 3: [2022-11-27 03:38:24,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:38:24,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 03:38:24,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 27: [2022-11-27 03:38:24,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:38:24,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 03:38:24,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 19: [2022-11-27 03:38:24,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:38:24,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 03:38:24,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 29: [2022-11-27 03:38:24,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:38:24,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 03:38:24,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 5: [2022-11-27 03:38:24,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:38:24,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 03:38:24,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 25: [2022-11-27 03:38:24,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:38:24,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 03:38:24,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 12: [2022-11-27 03:38:24,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:38:24,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 03:38:24,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 11: [2022-11-27 03:38:24,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:38:24,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 03:38:24,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 21: [2022-11-27 03:38:24,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:38:24,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 03:38:24,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 15: [2022-11-27 03:38:24,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:38:24,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 03:38:24,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 13: [2022-11-27 03:38:24,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:38:24,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 03:38:24,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 31: [2022-11-27 03:38:24,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:38:24,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 03:38:24,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 8: [2022-11-27 03:38:24,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:38:24,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 03:38:24,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 28: [2022-11-27 03:38:24,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:38:24,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 03:38:24,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 0: [2022-11-27 03:38:24,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:38:24,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:38:24,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 03:38:24,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 22: [2022-11-27 03:38:24,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 03:38:24,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 20: [2022-11-27 03:38:24,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:38:24,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 03:38:24,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 1: [2022-11-27 03:38:24,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:38:24,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 03:38:24,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 17: [2022-11-27 03:38:24,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:38:24,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 03:38:24,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 18: [2022-11-27 03:38:24,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:38:24,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 03:38:24,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 26: [2022-11-27 03:38:24,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:38:24,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 03:38:24,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 10: [2022-11-27 03:38:24,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:38:24,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 03:38:24,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 23: [2022-11-27 03:38:24,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:38:24,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 23: [2022-11-27 03:38:24,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 03:38:24,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 9: [2022-11-27 03:38:24,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 03:38:24,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 2: [2022-11-27 03:38:24,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:38:24,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 03:38:24,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 4: [2022-11-27 03:38:24,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:38:24,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 03:38:24,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 27: [2022-11-27 03:38:24,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:38:24,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 03:38:24,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 5: [2022-11-27 03:38:24,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:38:24,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 03:38:24,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 24: [2022-11-27 03:38:24,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:38:24,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 03:38:24,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 7: [2022-11-27 03:38:24,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:38:24,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 03:38:24,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 3: [2022-11-27 03:38:24,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:38:24,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 03:38:24,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 30: [2022-11-27 03:38:24,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:38:24,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 03:38:24,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 6: [2022-11-27 03:38:24,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:38:24,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 03:38:24,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 14: [2022-11-27 03:38:24,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:38:24,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 03:38:24,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 11: [2022-11-27 03:38:24,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:38:24,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 03:38:24,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 21: [2022-11-27 03:38:24,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:38:24,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 03:38:24,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 25: [2022-11-27 03:38:24,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:38:24,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 03:38:24,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 13: [2022-11-27 03:38:24,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:38:24,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:38:24,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 29: [2022-11-27 03:38:24,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 13: [2022-11-27 03:38:24,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 29: [2022-11-27 03:38:24,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 15: [2022-11-27 03:38:24,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:38:24,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 03:38:24,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 16: [2022-11-27 03:38:24,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:38:24,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 03:38:24,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 31: [2022-11-27 03:38:24,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:38:24,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 03:38:24,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 12: [2022-11-27 03:38:24,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:38:24,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 03:38:24,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 0: [2022-11-27 03:38:24,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:38:24,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:38:24,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 4: [2022-11-27 03:38:24,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:38:24,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 0: [2022-11-27 03:38:24,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 4: [2022-11-27 03:38:24,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 2: [2022-11-27 03:38:24,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 9: [2022-11-27 03:38:24,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:38:24,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 9: [2022-11-27 03:38:24,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 03:38:24,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 20: [2022-11-27 03:38:24,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:38:24,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 19: [2022-11-27 03:38:24,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:38:24,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 19: [2022-11-27 03:38:24,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 30: [2022-11-27 03:38:24,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:38:24,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 19: [2022-11-27 03:38:24,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 23: [2022-11-27 03:38:24,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:38:24,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 22: [2022-11-27 03:38:24,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 23: [2022-11-27 03:38:24,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 03:38:24,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 22: [2022-11-27 03:38:24,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 03:38:24,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 8: [2022-11-27 03:38:24,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:38:24,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:38:24,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 10: [2022-11-27 03:38:24,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 03:38:24,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 8: [2022-11-27 03:38:24,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 1: [2022-11-27 03:38:24,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:38:24,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:38:24,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 24: [2022-11-27 03:38:24,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 1: [2022-11-27 03:38:24,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 24: [2022-11-27 03:38:24,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 26: [2022-11-27 03:38:24,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:38:24,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 03:38:24,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 28: [2022-11-27 03:38:24,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:38:24,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 7: [2022-11-27 03:38:24,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:38:24,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 7: [2022-11-27 03:38:24,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 03:38:24,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 18: [2022-11-27 03:38:24,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:38:24,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 03:38:24,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 16: [2022-11-27 03:38:24,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:38:24,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:38:24,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 03:38:24,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 27: [2022-11-27 03:38:24,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 03:38:24,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 13: [2022-11-27 03:38:24,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:38:24,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:38:24,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 21: [2022-11-27 03:38:24,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 13: [2022-11-27 03:38:24,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 21: [2022-11-27 03:38:24,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 11: [2022-11-27 03:38:24,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:38:24,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 03:38:24,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 17: [2022-11-27 03:38:24,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:38:24,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 03:38:24,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 19: [2022-11-27 03:38:24,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:38:24,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 03:38:24,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 5: [2022-11-27 03:38:24,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:38:24,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 03:38:24,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 31: [2022-11-27 03:38:24,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:38:24,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 03:38:24,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 15: [2022-11-27 03:38:24,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:38:24,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 03:38:24,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 28: [2022-11-27 03:38:24,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:38:24,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 03:38:24,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 22: [2022-11-27 03:38:24,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:38:24,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 03:38:24,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 29: [2022-11-27 03:38:24,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:38:24,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 03:38:24,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 3: [2022-11-27 03:38:24,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:38:24,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:38:24,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 03:38:24,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 25: [2022-11-27 03:38:24,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 03:38:24,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 8: [2022-11-27 03:38:24,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:38:24,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 6: [2022-11-27 03:38:24,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:38:24,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 6: [2022-11-27 03:38:24,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 03:38:24,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 14: [2022-11-27 03:38:24,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:38:24,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 03:38:24,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:38:24,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 14: [2022-11-27 03:38:24,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 03:38:24,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 12: [2022-11-27 03:38:24,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:38:24,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step149000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 03:38:24,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step149000 is ready now! 0: successfully saved checkpoint at iteration 149000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2608.91 31: iteration 149010/ 173500 | consumed samples: 38146560 | consumed tokens: 78124154880 | elapsed time per iteration (s): 1.09 | learning rate: 2.888E-05 | global batch size: 256 | lm loss: 1.912471E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.892 | TFLOPs: 14.21 | 31: iteration 149020/ 173500 | consumed samples: 38149120 | consumed tokens: 78129397760 | elapsed time per iteration (s): 0.83 | learning rate: 2.887E-05 | global batch size: 256 | lm loss: 1.900636E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.243 | TFLOPs: 18.71 | 31: iteration 149030/ 173500 | consumed samples: 38151680 | consumed tokens: 78134640640 | elapsed time per iteration (s): 0.82 | learning rate: 2.886E-05 | global batch size: 256 | lm loss: 1.922898E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.722 | TFLOPs: 18.92 | 31: iteration 149040/ 173500 | consumed samples: 38154240 | consumed tokens: 78139883520 | elapsed time per iteration (s): 0.83 | learning rate: 2.886E-05 | global batch size: 256 | lm loss: 1.914689E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.343 | TFLOPs: 18.65 | 31: iteration 149050/ 173500 | consumed samples: 38156800 | consumed tokens: 78145126400 | elapsed time per iteration (s): 0.80 | learning rate: 2.885E-05 | global batch size: 256 | lm loss: 1.920635E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.884 | TFLOPs: 19.41 | 31: iteration 149060/ 173500 | consumed samples: 38159360 | consumed tokens: 78150369280 | elapsed time per iteration (s): 0.84 | learning rate: 2.884E-05 | global batch size: 256 | lm loss: 1.928231E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.177 | TFLOPs: 18.34 | 31: iteration 149070/ 173500 | consumed samples: 38161920 | consumed tokens: 78155612160 | elapsed time per iteration (s): 0.82 | learning rate: 2.884E-05 | global batch size: 256 | lm loss: 1.904988E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.506 | TFLOPs: 18.91 | 31: iteration 149080/ 173500 | consumed samples: 38164480 | consumed tokens: 78160855040 | elapsed time per iteration (s): 0.87 | learning rate: 2.883E-05 | global batch size: 256 | lm loss: 1.924535E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.433 | TFLOPs: 17.87 | 31: iteration 149090/ 173500 | consumed samples: 38167040 | consumed tokens: 78166097920 | elapsed time per iteration (s): 0.83 | learning rate: 2.882E-05 | global batch size: 256 | lm loss: 1.922799E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.211 | TFLOPs: 18.59 | 31: iteration 149100/ 173500 | consumed samples: 38169600 | consumed tokens: 78171340800 | elapsed time per iteration (s): 0.81 | learning rate: 2.881E-05 | global batch size: 256 | lm loss: 1.918170E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.923 | TFLOPs: 19.05 | 31: iteration 149110/ 173500 | consumed samples: 38172160 | consumed tokens: 78176583680 | elapsed time per iteration (s): 0.84 | learning rate: 2.881E-05 | global batch size: 256 | lm loss: 1.929827E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.492 | TFLOPs: 18.54 | 31: iteration 149120/ 173500 | consumed samples: 38174720 | consumed tokens: 78181826560 | elapsed time per iteration (s): 0.83 | learning rate: 2.880E-05 | global batch size: 256 | lm loss: 1.942324E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.123 | TFLOPs: 18.76 | 31: iteration 149130/ 173500 | consumed samples: 38177280 | consumed tokens: 78187069440 | elapsed time per iteration (s): 0.85 | learning rate: 2.879E-05 | global batch size: 256 | lm loss: 1.920685E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.548 | TFLOPs: 18.30 | 31: iteration 149140/ 173500 | consumed samples: 38179840 | consumed tokens: 78192312320 | elapsed time per iteration (s): 0.85 | learning rate: 2.879E-05 | global batch size: 256 | lm loss: 1.939656E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.866 | TFLOPs: 18.14 | 31: iteration 149150/ 173500 | consumed samples: 38182400 | consumed tokens: 78197555200 | elapsed time per iteration (s): 0.81 | learning rate: 2.878E-05 | global batch size: 256 | lm loss: 1.936476E+00 | grad norm: 0.219 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.914 | TFLOPs: 19.11 | 31: iteration 149160/ 173500 | consumed samples: 38184960 | consumed tokens: 78202798080 | elapsed time per iteration (s): 0.83 | learning rate: 2.877E-05 | global batch size: 256 | lm loss: 1.905883E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.447 | TFLOPs: 18.60 | 31: iteration 149170/ 173500 | consumed samples: 38187520 | consumed tokens: 78208040960 | elapsed time per iteration (s): 0.86 | learning rate: 2.877E-05 | global batch size: 256 | lm loss: 1.902472E+00 | grad norm: 0.215 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.103 | TFLOPs: 17.91 | 31: iteration 149180/ 173500 | consumed samples: 38190080 | consumed tokens: 78213283840 | elapsed time per iteration (s): 0.81 | learning rate: 2.876E-05 | global batch size: 256 | lm loss: 1.911851E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.958 | TFLOPs: 19.11 | 31: iteration 149190/ 173500 | consumed samples: 38192640 | consumed tokens: 78218526720 | elapsed time per iteration (s): 0.83 | learning rate: 2.875E-05 | global batch size: 256 | lm loss: 1.873364E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.386 | TFLOPs: 18.66 | 31: iteration 149200/ 173500 | consumed samples: 38195200 | consumed tokens: 78223769600 | elapsed time per iteration (s): 0.80 | learning rate: 2.874E-05 | global batch size: 256 | lm loss: 1.911929E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.495 | TFLOPs: 19.39 | 31: iteration 149210/ 173500 | consumed samples: 38197760 | consumed tokens: 78229012480 | elapsed time per iteration (s): 0.82 | learning rate: 2.874E-05 | global batch size: 256 | lm loss: 1.913213E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.388 | TFLOPs: 18.90 | 31: iteration 149220/ 173500 | consumed samples: 38200320 | consumed tokens: 78234255360 | elapsed time per iteration (s): 0.86 | learning rate: 2.873E-05 | global batch size: 256 | lm loss: 1.908026E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.968 | TFLOPs: 17.91 | 31: iteration 149230/ 173500 | consumed samples: 38202880 | consumed tokens: 78239498240 | elapsed time per iteration (s): 0.82 | learning rate: 2.872E-05 | global batch size: 256 | lm loss: 1.910553E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.412 | TFLOPs: 18.96 | 31: iteration 149240/ 173500 | consumed samples: 38205440 | consumed tokens: 78244741120 | elapsed time per iteration (s): 0.81 | learning rate: 2.872E-05 | global batch size: 256 | lm loss: 1.919060E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.067 | TFLOPs: 19.06 | 31: iteration 149250/ 173500 | consumed samples: 38208000 | consumed tokens: 78249984000 | elapsed time per iteration (s): 0.81 | learning rate: 2.871E-05 | global batch size: 256 | lm loss: 1.919957E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.088 | TFLOPs: 19.12 | 31: iteration 149260/ 173500 | consumed samples: 38210560 | consumed tokens: 78255226880 | elapsed time per iteration (s): 0.79 | learning rate: 2.870E-05 | global batch size: 256 | lm loss: 1.902605E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.127 | TFLOPs: 19.67 | 31: iteration 149270/ 173500 | consumed samples: 38213120 | consumed tokens: 78260469760 | elapsed time per iteration (s): 0.87 | learning rate: 2.869E-05 | global batch size: 256 | lm loss: 1.924565E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.009 | TFLOPs: 17.85 | 31: iteration 149280/ 173500 | consumed samples: 38215680 | consumed tokens: 78265712640 | elapsed time per iteration (s): 0.83 | learning rate: 2.869E-05 | global batch size: 256 | lm loss: 1.915816E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.268 | TFLOPs: 18.71 | 31: iteration 149290/ 173500 | consumed samples: 38218240 | consumed tokens: 78270955520 | elapsed time per iteration (s): 1.78 | learning rate: 2.868E-05 | global batch size: 256 | lm loss: 1.887799E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 143.707 | TFLOPs: 8.69 | 31: iteration 149300/ 173500 | consumed samples: 38220800 | consumed tokens: 78276198400 | elapsed time per iteration (s): 0.81 | learning rate: 2.867E-05 | global batch size: 256 | lm loss: 1.894840E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.743 | TFLOPs: 19.10 | 31: iteration 149310/ 173500 | consumed samples: 38223360 | consumed tokens: 78281441280 | elapsed time per iteration (s): 0.89 | learning rate: 2.867E-05 | global batch size: 256 | lm loss: 1.934378E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.242 | TFLOPs: 17.50 | 31: iteration 149320/ 173500 | consumed samples: 38225920 | consumed tokens: 78286684160 | elapsed time per iteration (s): 0.81 | learning rate: 2.866E-05 | global batch size: 256 | lm loss: 1.928075E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.727 | TFLOPs: 19.10 | 31: iteration 149330/ 173500 | consumed samples: 38228480 | consumed tokens: 78291927040 | elapsed time per iteration (s): 0.80 | learning rate: 2.865E-05 | global batch size: 256 | lm loss: 1.913140E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.595 | TFLOPs: 19.40 | 31: iteration 149340/ 173500 | consumed samples: 38231040 | consumed tokens: 78297169920 | elapsed time per iteration (s): 0.82 | learning rate: 2.865E-05 | global batch size: 256 | lm loss: 1.933263E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.783 | TFLOPs: 18.92 | 31: iteration 149350/ 173500 | consumed samples: 38233600 | consumed tokens: 78302412800 | elapsed time per iteration (s): 0.83 | learning rate: 2.864E-05 | global batch size: 256 | lm loss: 1.911012E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.652 | TFLOPs: 18.67 | 31: iteration 149360/ 173500 | consumed samples: 38236160 | consumed tokens: 78307655680 | elapsed time per iteration (s): 0.80 | learning rate: 2.863E-05 | global batch size: 256 | lm loss: 1.911675E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.017 | TFLOPs: 19.30 | 31: iteration 149370/ 173500 | consumed samples: 38238720 | consumed tokens: 78312898560 | elapsed time per iteration (s): 0.83 | learning rate: 2.862E-05 | global batch size: 256 | lm loss: 1.919602E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.452 | TFLOPs: 18.66 | 31: iteration 149380/ 173500 | consumed samples: 38241280 | consumed tokens: 78318141440 | elapsed time per iteration (s): 0.82 | learning rate: 2.862E-05 | global batch size: 256 | lm loss: 1.922859E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.655 | TFLOPs: 18.98 | 31: iteration 149390/ 173500 | consumed samples: 38243840 | consumed tokens: 78323384320 | elapsed time per iteration (s): 0.81 | learning rate: 2.861E-05 | global batch size: 256 | lm loss: 1.932198E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.572 | TFLOPs: 19.09 | 31: iteration 149400/ 173500 | consumed samples: 38246400 | consumed tokens: 78328627200 | elapsed time per iteration (s): 0.80 | learning rate: 2.860E-05 | global batch size: 256 | lm loss: 1.938098E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.128 | TFLOPs: 19.25 | 31: iteration 149410/ 173500 | consumed samples: 38248960 | consumed tokens: 78333870080 | elapsed time per iteration (s): 0.73 | learning rate: 2.860E-05 | global batch size: 256 | lm loss: 1.925574E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.462 | TFLOPs: 21.20 | 31: iteration 149420/ 173500 | consumed samples: 38251520 | consumed tokens: 78339112960 | elapsed time per iteration (s): 0.77 | learning rate: 2.859E-05 | global batch size: 256 | lm loss: 1.938685E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.773 | TFLOPs: 20.19 | 31: iteration 149430/ 173500 | consumed samples: 38254080 | consumed tokens: 78344355840 | elapsed time per iteration (s): 0.78 | learning rate: 2.858E-05 | global batch size: 256 | lm loss: 1.933666E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.312 | TFLOPs: 19.74 | 31: iteration 149440/ 173500 | consumed samples: 38256640 | consumed tokens: 78349598720 | elapsed time per iteration (s): 0.73 | learning rate: 2.857E-05 | global batch size: 256 | lm loss: 1.917637E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.667 | TFLOPs: 21.34 | 31: iteration 149450/ 173500 | consumed samples: 38259200 | consumed tokens: 78354841600 | elapsed time per iteration (s): 0.77 | learning rate: 2.857E-05 | global batch size: 256 | lm loss: 1.945801E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.495 | TFLOPs: 20.24 | 31: iteration 149460/ 173500 | consumed samples: 38261760 | consumed tokens: 78360084480 | elapsed time per iteration (s): 0.78 | learning rate: 2.856E-05 | global batch size: 256 | lm loss: 1.923269E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.611 | TFLOPs: 19.88 | 31: iteration 149470/ 173500 | consumed samples: 38264320 | consumed tokens: 78365327360 | elapsed time per iteration (s): 0.80 | learning rate: 2.855E-05 | global batch size: 256 | lm loss: 1.917108E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.526 | TFLOPs: 19.39 | 31: iteration 149480/ 173500 | consumed samples: 38266880 | consumed tokens: 78370570240 | elapsed time per iteration (s): 0.99 | learning rate: 2.855E-05 | global batch size: 256 | lm loss: 1.906446E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 257.703 | TFLOPs: 15.59 | 31: iteration 149490/ 173500 | consumed samples: 38269440 | consumed tokens: 78375813120 | elapsed time per iteration (s): 0.79 | learning rate: 2.854E-05 | global batch size: 256 | lm loss: 1.932796E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.300 | TFLOPs: 19.50 | 31: iteration 149500/ 173500 | consumed samples: 38272000 | consumed tokens: 78381056000 | elapsed time per iteration (s): 0.76 | learning rate: 2.853E-05 | global batch size: 256 | lm loss: 1.894418E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.442 | TFLOPs: 20.41 | 31: iteration 149510/ 173500 | consumed samples: 38274560 | consumed tokens: 78386298880 | elapsed time per iteration (s): 0.82 | learning rate: 2.853E-05 | global batch size: 256 | lm loss: 1.908478E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.743 | TFLOPs: 18.98 | 31: iteration 149520/ 173500 | consumed samples: 38277120 | consumed tokens: 78391541760 | elapsed time per iteration (s): 0.80 | learning rate: 2.852E-05 | global batch size: 256 | lm loss: 1.967429E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.344 | TFLOPs: 19.44 | 31: iteration 149530/ 173500 | consumed samples: 38279680 | consumed tokens: 78396784640 | elapsed time per iteration (s): 0.79 | learning rate: 2.851E-05 | global batch size: 256 | lm loss: 1.911330E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.720 | TFLOPs: 19.58 | 31: iteration 149540/ 173500 | consumed samples: 38282240 | consumed tokens: 78402027520 | elapsed time per iteration (s): 0.77 | learning rate: 2.850E-05 | global batch size: 256 | lm loss: 1.916165E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.860 | TFLOPs: 20.20 | 31: iteration 149550/ 173500 | consumed samples: 38284800 | consumed tokens: 78407270400 | elapsed time per iteration (s): 0.79 | learning rate: 2.850E-05 | global batch size: 256 | lm loss: 1.926285E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.239 | TFLOPs: 19.56 | 31: iteration 149560/ 173500 | consumed samples: 38287360 | consumed tokens: 78412513280 | elapsed time per iteration (s): 0.74 | learning rate: 2.849E-05 | global batch size: 256 | lm loss: 1.918532E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.943 | TFLOPs: 20.87 | 31: iteration 149570/ 173500 | consumed samples: 38289920 | consumed tokens: 78417756160 | elapsed time per iteration (s): 0.75 | learning rate: 2.848E-05 | global batch size: 256 | lm loss: 1.896939E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.617 | TFLOPs: 20.73 | 31: iteration 149580/ 173500 | consumed samples: 38292480 | consumed tokens: 78422999040 | elapsed time per iteration (s): 0.78 | learning rate: 2.848E-05 | global batch size: 256 | lm loss: 1.941466E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.259 | TFLOPs: 19.80 | 31: iteration 149590/ 173500 | consumed samples: 38295040 | consumed tokens: 78428241920 | elapsed time per iteration (s): 0.82 | learning rate: 2.847E-05 | global batch size: 256 | lm loss: 1.947713E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.586 | TFLOPs: 18.85 | 31: iteration 149600/ 173500 | consumed samples: 38297600 | consumed tokens: 78433484800 | elapsed time per iteration (s): 0.77 | learning rate: 2.846E-05 | global batch size: 256 | lm loss: 1.915278E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.535 | TFLOPs: 20.24 | 31: iteration 149610/ 173500 | consumed samples: 38300160 | consumed tokens: 78438727680 | elapsed time per iteration (s): 0.78 | learning rate: 2.846E-05 | global batch size: 256 | lm loss: 1.897614E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.280 | TFLOPs: 19.86 | 31: iteration 149620/ 173500 | consumed samples: 38302720 | consumed tokens: 78443970560 | elapsed time per iteration (s): 0.73 | learning rate: 2.845E-05 | global batch size: 256 | lm loss: 1.902735E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.877 | TFLOPs: 21.17 | 31: iteration 149630/ 173500 | consumed samples: 38305280 | consumed tokens: 78449213440 | elapsed time per iteration (s): 0.76 | learning rate: 2.844E-05 | global batch size: 256 | lm loss: 1.914838E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.779 | TFLOPs: 20.31 | 31: iteration 149640/ 173500 | consumed samples: 38307840 | consumed tokens: 78454456320 | elapsed time per iteration (s): 0.74 | learning rate: 2.844E-05 | global batch size: 256 | lm loss: 1.922864E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.623 | TFLOPs: 21.03 | 31: iteration 149650/ 173500 | consumed samples: 38310400 | consumed tokens: 78459699200 | elapsed time per iteration (s): 1.00 | learning rate: 2.843E-05 | global batch size: 256 | lm loss: 1.935494E+00 | grad norm: 0.212 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 256.796 | TFLOPs: 15.54 | 31: iteration 149660/ 173500 | consumed samples: 38312960 | consumed tokens: 78464942080 | elapsed time per iteration (s): 0.82 | learning rate: 2.842E-05 | global batch size: 256 | lm loss: 1.915258E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.749 | TFLOPs: 18.92 | 31: iteration 149670/ 173500 | consumed samples: 38315520 | consumed tokens: 78470184960 | elapsed time per iteration (s): 0.79 | learning rate: 2.841E-05 | global batch size: 256 | lm loss: 1.946663E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.513 | TFLOPs: 19.51 | 31: iteration 149680/ 173500 | consumed samples: 38318080 | consumed tokens: 78475427840 | elapsed time per iteration (s): 0.80 | learning rate: 2.841E-05 | global batch size: 256 | lm loss: 1.941403E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.862 | TFLOPs: 19.29 | 31: iteration 149690/ 173500 | consumed samples: 38320640 | consumed tokens: 78480670720 | elapsed time per iteration (s): 0.80 | learning rate: 2.840E-05 | global batch size: 256 | lm loss: 1.905166E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.120 | TFLOPs: 19.31 | 31: iteration 149700/ 173500 | consumed samples: 38323200 | consumed tokens: 78485913600 | elapsed time per iteration (s): 0.78 | learning rate: 2.839E-05 | global batch size: 256 | lm loss: 1.910323E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.268 | TFLOPs: 19.92 | 31: iteration 149710/ 173500 | consumed samples: 38325760 | consumed tokens: 78491156480 | elapsed time per iteration (s): 0.85 | learning rate: 2.839E-05 | global batch size: 256 | lm loss: 1.919873E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.516 | TFLOPs: 18.24 | 31: iteration 149720/ 173500 | consumed samples: 38328320 | consumed tokens: 78496399360 | elapsed time per iteration (s): 0.74 | learning rate: 2.838E-05 | global batch size: 256 | lm loss: 1.909458E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.812 | TFLOPs: 20.98 | 31: iteration 149730/ 173500 | consumed samples: 38330880 | consumed tokens: 78501642240 | elapsed time per iteration (s): 0.77 | learning rate: 2.837E-05 | global batch size: 256 | lm loss: 1.889882E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.281 | TFLOPs: 20.22 | 31: iteration 149740/ 173500 | consumed samples: 38333440 | consumed tokens: 78506885120 | elapsed time per iteration (s): 0.78 | learning rate: 2.837E-05 | global batch size: 256 | lm loss: 1.942483E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.732 | TFLOPs: 19.95 | 31: iteration 149750/ 173500 | consumed samples: 38336000 | consumed tokens: 78512128000 | elapsed time per iteration (s): 0.72 | learning rate: 2.836E-05 | global batch size: 256 | lm loss: 1.908927E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.549 | TFLOPs: 21.57 | 31: iteration 149760/ 173500 | consumed samples: 38338560 | consumed tokens: 78517370880 | elapsed time per iteration (s): 0.71 | learning rate: 2.835E-05 | global batch size: 256 | lm loss: 1.924316E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 360.964 | TFLOPs: 21.84 | 31: iteration 149770/ 173500 | consumed samples: 38341120 | consumed tokens: 78522613760 | elapsed time per iteration (s): 0.77 | learning rate: 2.835E-05 | global batch size: 256 | lm loss: 1.915808E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.529 | TFLOPs: 20.00 | 31: iteration 149780/ 173500 | consumed samples: 38343680 | consumed tokens: 78527856640 | elapsed time per iteration (s): 0.71 | learning rate: 2.834E-05 | global batch size: 256 | lm loss: 1.893507E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 361.557 | TFLOPs: 21.87 | 31: iteration 149790/ 173500 | consumed samples: 38346240 | consumed tokens: 78533099520 | elapsed time per iteration (s): 0.89 | learning rate: 2.833E-05 | global batch size: 256 | lm loss: 1.917268E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.263 | TFLOPs: 17.32 | 31: iteration 149800/ 173500 | consumed samples: 38348800 | consumed tokens: 78538342400 | elapsed time per iteration (s): 0.72 | learning rate: 2.832E-05 | global batch size: 256 | lm loss: 1.937473E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.968 | TFLOPs: 21.41 | 31: iteration 149810/ 173500 | consumed samples: 38351360 | consumed tokens: 78543585280 | elapsed time per iteration (s): 0.75 | learning rate: 2.832E-05 | global batch size: 256 | lm loss: 1.931847E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.625 | TFLOPs: 20.61 | 31: iteration 149820/ 173500 | consumed samples: 38353920 | consumed tokens: 78548828160 | elapsed time per iteration (s): 0.72 | learning rate: 2.831E-05 | global batch size: 256 | lm loss: 1.946624E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.874 | TFLOPs: 21.59 | 31: iteration 149830/ 173500 | consumed samples: 38356480 | consumed tokens: 78554071040 | elapsed time per iteration (s): 0.80 | learning rate: 2.830E-05 | global batch size: 256 | lm loss: 1.873154E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.276 | TFLOPs: 19.32 | 31: iteration 149840/ 173500 | consumed samples: 38359040 | consumed tokens: 78559313920 | elapsed time per iteration (s): 0.82 | learning rate: 2.830E-05 | global batch size: 256 | lm loss: 1.913835E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.295 | TFLOPs: 18.89 | 31: iteration 149850/ 173500 | consumed samples: 38361600 | consumed tokens: 78564556800 | elapsed time per iteration (s): 0.82 | learning rate: 2.829E-05 | global batch size: 256 | lm loss: 1.885867E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.209 | TFLOPs: 18.83 | 31: iteration 149860/ 173500 | consumed samples: 38364160 | consumed tokens: 78569799680 | elapsed time per iteration (s): 0.81 | learning rate: 2.828E-05 | global batch size: 256 | lm loss: 1.889605E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.305 | TFLOPs: 19.01 | 31: iteration 149870/ 173500 | consumed samples: 38366720 | consumed tokens: 78575042560 | elapsed time per iteration (s): 0.79 | learning rate: 2.828E-05 | global batch size: 256 | lm loss: 1.917822E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.225 | TFLOPs: 19.55 | 31: iteration 149880/ 173500 | consumed samples: 38369280 | consumed tokens: 78580285440 | elapsed time per iteration (s): 0.78 | learning rate: 2.827E-05 | global batch size: 256 | lm loss: 1.916538E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.845 | TFLOPs: 19.77 | 31: iteration 149890/ 173500 | consumed samples: 38371840 | consumed tokens: 78585528320 | elapsed time per iteration (s): 0.79 | learning rate: 2.826E-05 | global batch size: 256 | lm loss: 1.894903E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.100 | TFLOPs: 19.49 | 31: iteration 149900/ 173500 | consumed samples: 38374400 | consumed tokens: 78590771200 | elapsed time per iteration (s): 0.86 | learning rate: 2.826E-05 | global batch size: 256 | lm loss: 1.934780E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.863 | TFLOPs: 18.02 | 31: iteration 149910/ 173500 | consumed samples: 38376960 | consumed tokens: 78596014080 | elapsed time per iteration (s): 0.81 | learning rate: 2.825E-05 | global batch size: 256 | lm loss: 1.926021E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.507 | TFLOPs: 19.09 | 31: iteration 149920/ 173500 | consumed samples: 38379520 | consumed tokens: 78601256960 | elapsed time per iteration (s): 0.75 | learning rate: 2.824E-05 | global batch size: 256 | lm loss: 1.916417E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.549 | TFLOPs: 20.60 | 31: iteration 149930/ 173500 | consumed samples: 38382080 | consumed tokens: 78606499840 | elapsed time per iteration (s): 0.76 | learning rate: 2.823E-05 | global batch size: 256 | lm loss: 1.916591E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.319 | TFLOPs: 20.47 | 31: iteration 149940/ 173500 | consumed samples: 38384640 | consumed tokens: 78611742720 | elapsed time per iteration (s): 0.77 | learning rate: 2.823E-05 | global batch size: 256 | lm loss: 1.917360E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.124 | TFLOPs: 20.21 | 31: iteration 149950/ 173500 | consumed samples: 38387200 | consumed tokens: 78616985600 | elapsed time per iteration (s): 0.78 | learning rate: 2.822E-05 | global batch size: 256 | lm loss: 1.904937E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.073 | TFLOPs: 19.79 | 31: iteration 149960/ 173500 | consumed samples: 38389760 | consumed tokens: 78622228480 | elapsed time per iteration (s): 0.74 | learning rate: 2.821E-05 | global batch size: 256 | lm loss: 1.932812E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.437 | TFLOPs: 20.90 | 31: iteration 149970/ 173500 | consumed samples: 38392320 | consumed tokens: 78627471360 | elapsed time per iteration (s): 0.75 | learning rate: 2.821E-05 | global batch size: 256 | lm loss: 1.918343E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.844 | TFLOPs: 20.74 | 31: iteration 149980/ 173500 | consumed samples: 38394880 | consumed tokens: 78632714240 | elapsed time per iteration (s): 0.74 | learning rate: 2.820E-05 | global batch size: 256 | lm loss: 1.919090E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.194 | TFLOPs: 20.88 | 31: iteration 149990/ 173500 | consumed samples: 38397440 | consumed tokens: 78637957120 | elapsed time per iteration (s): 0.91 | learning rate: 2.819E-05 | global batch size: 256 | lm loss: 1.910473E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 280.285 | TFLOPs: 16.96 | 0: [2022-11-27 03:51:55,733] [INFO] [logging.py:68:log_dist] [Rank 0] step=150000, skipped=0, lr=[2.8186529571359086e-05, 2.8186529571359086e-05, 2.8186529571359086e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 150000/ 173500 | consumed samples: 38400000 | consumed tokens: 78643200000 | elapsed time per iteration (s): 0.81 | learning rate: 2.819E-05 | global batch size: 256 | lm loss: 1.900958E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.033 | TFLOPs: 19.18 | 0: steps: 150000 loss: 1.9108 iter time (s): 0.815 samples/sec: 314.007 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 150000 | lm loss value: 1.808251E+00 | lm loss PPL: 6.099771E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 150000 to checkpoints_1b1long 0: [2022-11-27 03:51:56,109] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step150000 is begin to save! 0: [2022-11-27 03:51:56,123] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_01-model_00-model_states.pt... 0: [2022-11-27 03:51:56,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_01-model_00-model_states.pt. 0: [2022-11-27 03:51:56,344] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_03-model_00-model_states.pt... 0: [2022-11-27 03:51:56,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_03-model_00-model_states.pt. 0: [2022-11-27 03:51:56,431] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_04-model_00-model_states.pt... 0: [2022-11-27 03:51:56,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_04-model_00-model_states.pt. 0: [2022-11-27 03:51:56,512] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_05-model_00-model_states.pt... 0: [2022-11-27 03:51:56,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_05-model_00-model_states.pt. 0: [2022-11-27 03:51:56,587] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_06-model_00-model_states.pt... 0: [2022-11-27 03:51:56,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_06-model_00-model_states.pt. 0: [2022-11-27 03:51:56,662] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_07-model_00-model_states.pt... 0: [2022-11-27 03:51:56,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_07-model_00-model_states.pt. 0: [2022-11-27 03:51:56,737] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_08-model_00-model_states.pt... 0: [2022-11-27 03:51:56,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_08-model_00-model_states.pt. 0: [2022-11-27 03:51:56,812] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_09-model_00-model_states.pt... 0: [2022-11-27 03:51:56,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_09-model_00-model_states.pt. 0: [2022-11-27 03:51:56,888] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_10-model_00-model_states.pt... 0: [2022-11-27 03:51:56,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_10-model_00-model_states.pt. 0: [2022-11-27 03:51:56,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_11-model_00-model_states.pt... 0: [2022-11-27 03:51:57,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_11-model_00-model_states.pt. 0: [2022-11-27 03:51:57,036] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_12-model_00-model_states.pt... 0: [2022-11-27 03:51:57,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_12-model_00-model_states.pt. 0: [2022-11-27 03:51:57,115] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_13-model_00-model_states.pt... 0: [2022-11-27 03:51:57,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_13-model_00-model_states.pt. 0: [2022-11-27 03:51:57,190] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_14-model_00-model_states.pt... 0: [2022-11-27 03:51:57,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_14-model_00-model_states.pt. 0: [2022-11-27 03:51:57,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_15-model_00-model_states.pt... 0: [2022-11-27 03:51:57,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_15-model_00-model_states.pt. 0: [2022-11-27 03:51:57,339] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_16-model_00-model_states.pt... 0: [2022-11-27 03:51:57,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_16-model_00-model_states.pt. 0: [2022-11-27 03:51:57,409] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_17-model_00-model_states.pt... 0: [2022-11-27 03:51:57,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_17-model_00-model_states.pt. 0: [2022-11-27 03:51:57,484] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_18-model_00-model_states.pt... 0: [2022-11-27 03:51:57,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_18-model_00-model_states.pt. 0: [2022-11-27 03:51:57,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_19-model_00-model_states.pt... 0: [2022-11-27 03:51:57,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_19-model_00-model_states.pt. 0: [2022-11-27 03:51:57,630] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_20-model_00-model_states.pt... 0: [2022-11-27 03:51:57,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_20-model_00-model_states.pt. 0: [2022-11-27 03:51:57,702] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_21-model_00-model_states.pt... 0: [2022-11-27 03:51:57,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_21-model_00-model_states.pt. 0: [2022-11-27 03:51:57,779] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_22-model_00-model_states.pt... 0: [2022-11-27 03:51:57,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_22-model_00-model_states.pt. 0: [2022-11-27 03:51:57,850] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_23-model_00-model_states.pt... 0: [2022-11-27 03:51:57,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_23-model_00-model_states.pt. 0: [2022-11-27 03:51:57,923] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_24-model_00-model_states.pt... 0: [2022-11-27 03:51:57,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_24-model_00-model_states.pt. 0: [2022-11-27 03:51:58,000] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_25-model_00-model_states.pt... 0: [2022-11-27 03:51:58,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_25-model_00-model_states.pt. 0: [2022-11-27 03:51:58,071] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_26-model_00-model_states.pt... 0: [2022-11-27 03:51:58,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_26-model_00-model_states.pt. 0: [2022-11-27 03:51:58,146] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_27-model_00-model_states.pt... 0: [2022-11-27 03:51:58,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_27-model_00-model_states.pt. 0: [2022-11-27 03:51:58,218] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_28-model_00-model_states.pt... 0: [2022-11-27 03:51:58,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_28-model_00-model_states.pt. 0: [2022-11-27 03:51:58,293] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/layer_30-model_00-model_states.pt... 0: [2022-11-27 03:51:58,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/layer_30-model_00-model_states.pt. 0: [2022-11-27 03:51:58,295] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step150000/mp_rank_00_model_states.pt 0: [2022-11-27 03:51:58,296] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/mp_rank_00_model_states.pt... 0: [2022-11-27 03:51:58,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/mp_rank_00_model_states.pt. 0: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 31: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 17: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 24: [2022-11-27 03:51:58,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:51:58,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:51:58,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 03:51:58,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 17: [2022-11-27 03:51:58,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:51:58,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 03:51:58,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 9: [2022-11-27 03:51:58,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:51:58,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:51:58,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 22: [2022-11-27 03:51:58,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 9: [2022-11-27 03:51:58,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 22: [2022-11-27 03:51:58,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 26: [2022-11-27 03:51:58,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:51:58,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 23: [2022-11-27 03:51:58,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:51:58,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 23: [2022-11-27 03:51:58,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 03:51:58,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 17: [2022-11-27 03:51:58,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:51:58,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 03:51:58,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 1: [2022-11-27 03:51:58,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:51:58,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 03:51:58,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 30: [2022-11-27 03:51:58,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:51:58,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 03:51:58,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 16: [2022-11-27 03:51:58,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:51:58,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 25: [2022-11-27 03:51:58,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:51:58,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:51:58,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 14: [2022-11-27 03:51:58,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 03:51:58,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 25: [2022-11-27 03:51:58,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 03:51:58,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 4: [2022-11-27 03:51:58,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:51:58,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:51:58,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 19: [2022-11-27 03:51:58,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 4: [2022-11-27 03:51:58,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 25: [2022-11-27 03:51:58,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:51:58,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 25: [2022-11-27 03:51:58,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 4: [2022-11-27 03:51:58,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:51:58,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 4: [2022-11-27 03:51:58,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 20: [2022-11-27 03:51:58,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:51:58,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 22: [2022-11-27 03:51:58,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:51:58,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:51:58,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 20: [2022-11-27 03:51:58,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 22: [2022-11-27 03:51:58,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 21: [2022-11-27 03:51:58,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 20: [2022-11-27 03:51:58,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 22: [2022-11-27 03:51:58,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 20: [2022-11-27 03:51:58,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:51:58,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 03:51:58,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 19: [2022-11-27 03:51:58,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:51:58,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:51:58,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 30: [2022-11-27 03:51:58,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 19: [2022-11-27 03:51:58,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 30: [2022-11-27 03:51:58,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 23: [2022-11-27 03:51:58,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 03:51:58,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 03:51:58,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 7: [2022-11-27 03:51:58,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:51:58,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:51:58,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:51:58,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:51:58,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:51:58,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 10: [2022-11-27 03:51:58,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 15: [2022-11-27 03:51:58,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 31: [2022-11-27 03:51:58,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 7: [2022-11-27 03:51:58,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 10: [2022-11-27 03:51:58,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 1: [2022-11-27 03:51:58,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:51:58,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 31: [2022-11-27 03:51:58,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 10: [2022-11-27 03:51:58,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 10: [2022-11-27 03:51:58,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 1: [2022-11-27 03:51:58,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 03:51:58,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 8: [2022-11-27 03:51:58,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:51:58,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:51:58,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 8: [2022-11-27 03:51:58,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 15: [2022-11-27 03:51:58,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:51:58,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 8: [2022-11-27 03:51:58,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 15: [2022-11-27 03:51:58,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 03:51:58,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 8: [2022-11-27 03:51:58,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:51:58,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 27: [2022-11-27 03:51:58,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:51:58,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 27: [2022-11-27 03:51:58,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 16: [2022-11-27 03:51:58,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:51:58,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 03:51:58,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 27: [2022-11-27 03:51:58,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 12: [2022-11-27 03:51:58,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:51:58,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 03:51:58,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 3: [2022-11-27 03:51:58,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:51:58,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:51:58,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 03:51:58,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 03:51:58,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 3: [2022-11-27 03:51:58,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 2: [2022-11-27 03:51:58,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:51:58,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:51:58,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 12: [2022-11-27 03:51:58,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:51:58,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:51:58,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 12: [2022-11-27 03:51:58,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 21: [2022-11-27 03:51:58,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 2: [2022-11-27 03:51:58,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 12: [2022-11-27 03:51:58,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 21: [2022-11-27 03:51:58,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 2: [2022-11-27 03:51:58,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 23: [2022-11-27 03:51:58,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:51:58,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:51:58,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:51:58,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 0: [2022-11-27 03:51:58,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:51:58,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 23: [2022-11-27 03:51:58,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 31: [2022-11-27 03:51:58,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 7: [2022-11-27 03:51:58,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 23: [2022-11-27 03:51:58,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 0: [2022-11-27 03:51:58,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 03:51:58,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 25: [2022-11-27 03:51:58,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:51:58,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 1: [2022-11-27 03:51:58,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:51:58,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 1: [2022-11-27 03:51:58,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 03:51:58,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 26: [2022-11-27 03:51:58,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:51:58,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-27 03:51:58,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 14: [2022-11-27 03:51:58,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:51:58,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:51:58,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 14: [2022-11-27 03:51:58,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 8: [2022-11-27 03:51:58,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 14: [2022-11-27 03:51:58,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 9: [2022-11-27 03:51:58,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:51:58,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:51:58,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 03:51:58,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 03:51:58,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 9: [2022-11-27 03:51:58,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 14: [2022-11-27 03:51:58,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:51:58,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 03:51:58,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 28: [2022-11-27 03:51:58,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:51:58,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:51:58,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:51:58,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:51:58,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:51:58,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 16: [2022-11-27 03:51:58,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 22: [2022-11-27 03:51:58,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 19: [2022-11-27 03:51:58,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 4: [2022-11-27 03:51:58,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 16: [2022-11-27 03:51:58,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 22: [2022-11-27 03:51:58,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 19: [2022-11-27 03:51:58,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 3: [2022-11-27 03:51:58,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:51:58,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 03:51:58,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 5: [2022-11-27 03:51:58,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:51:58,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:51:58,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 03:51:58,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 03:51:58,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 5: [2022-11-27 03:51:58,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 5: [2022-11-27 03:51:58,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:51:58,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 31: [2022-11-27 03:51:58,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:51:58,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 24: [2022-11-27 03:51:58,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:51:58,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:51:58,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 24: [2022-11-27 03:51:58,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 31: [2022-11-27 03:51:58,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 03:51:58,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 4: [2022-11-27 03:51:58,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 24: [2022-11-27 03:51:58,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 12: [2022-11-27 03:51:58,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:51:58,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 03:51:58,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 20: [2022-11-27 03:51:58,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:51:58,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 03:51:58,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 8: [2022-11-27 03:51:58,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:51:58,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 03:51:58,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 23: [2022-11-27 03:51:58,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 03:51:58,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 03:51:58,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 7: [2022-11-27 03:51:58,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:51:58,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:51:58,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:51:58,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 03:51:58,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 7: [2022-11-27 03:51:58,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 20: [2022-11-27 03:51:58,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:51:58,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 6: [2022-11-27 03:51:58,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 7: [2022-11-27 03:51:58,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 20: [2022-11-27 03:51:58,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 03:51:58,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 7: [2022-11-27 03:51:58,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:51:58,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 03:51:58,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 1: [2022-11-27 03:51:58,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:51:58,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 03:51:58,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 26: [2022-11-27 03:51:58,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:51:58,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 03:51:58,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 25: [2022-11-27 03:51:58,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:51:58,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 03:51:58,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 17: [2022-11-27 03:51:58,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:51:58,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:51:58,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:51:58,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:51:58,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 17: [2022-11-27 03:51:58,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 27: [2022-11-27 03:51:58,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 0: [2022-11-27 03:51:58,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 2: [2022-11-27 03:51:58,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 17: [2022-11-27 03:51:58,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 27: [2022-11-27 03:51:58,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 0: [2022-11-27 03:51:58,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 17: [2022-11-27 03:51:58,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:51:58,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 03:51:58,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 21: [2022-11-27 03:51:58,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:51:58,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:51:58,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 27: [2022-11-27 03:51:58,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:51:58,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 21: [2022-11-27 03:51:58,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 27: [2022-11-27 03:51:58,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 30: [2022-11-27 03:51:58,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 03:51:58,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:51:58,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 30: [2022-11-27 03:51:58,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 03:51:58,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 16: [2022-11-27 03:51:58,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:51:58,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 03:51:58,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 11: [2022-11-27 03:51:58,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:51:58,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 03:51:58,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 11: [2022-11-27 03:51:58,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:51:58,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 03:51:58,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 11: [2022-11-27 03:51:58,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:51:58,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 03:51:58,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 5: [2022-11-27 03:51:58,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:51:58,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 03:51:58,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 21: [2022-11-27 03:51:58,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:51:58,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 03:51:58,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 14: [2022-11-27 03:51:58,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:51:58,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 13: [2022-11-27 03:51:58,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:51:58,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 19: [2022-11-27 03:51:58,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:51:58,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 13: [2022-11-27 03:51:58,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 03:51:58,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 12: [2022-11-27 03:51:58,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:51:58,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 19: [2022-11-27 03:51:58,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 12: [2022-11-27 03:51:58,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 13: [2022-11-27 03:51:58,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:51:58,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 03:51:58,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 24: [2022-11-27 03:51:58,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:51:58,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:51:58,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 6: [2022-11-27 03:51:58,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 03:51:58,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 24: [2022-11-27 03:51:58,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 31: [2022-11-27 03:51:58,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:51:58,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 03:51:58,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 22: [2022-11-27 03:51:58,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:51:58,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 03:51:58,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 26: [2022-11-27 03:51:58,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:51:58,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 03:51:58,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 24: [2022-11-27 03:51:58,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:51:58,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 03:51:58,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 28: [2022-11-27 03:51:58,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:51:58,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 03:51:58,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 24: [2022-11-27 03:51:58,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 28: [2022-11-27 03:51:58,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:51:58,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 28: [2022-11-27 03:51:58,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 03:51:58,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 28: [2022-11-27 03:51:58,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:51:58,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 03:51:58,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 15: [2022-11-27 03:51:58,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:51:58,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:51:58,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 03:51:58,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 03:51:58,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 15: [2022-11-27 03:51:58,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 2: [2022-11-27 03:51:58,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:51:58,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 03:51:58,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 10: [2022-11-27 03:51:58,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:51:58,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 03:51:58,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 11: [2022-11-27 03:51:58,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:51:58,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 03:51:58,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 13: [2022-11-27 03:51:58,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:51:58,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 03:51:58,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 10: [2022-11-27 03:51:58,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:51:58,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 03:51:58,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 0: [2022-11-27 03:51:58,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:51:58,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 03:51:58,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 1: [2022-11-27 03:51:58,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:51:58,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 03:51:58,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 29: [2022-11-27 03:51:58,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:51:58,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:51:58,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:51:58,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:51:58,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 03:51:58,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 03:51:58,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 03:51:58,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 29: [2022-11-27 03:51:58,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 29: [2022-11-27 03:51:58,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 03:51:58,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 29: [2022-11-27 03:51:58,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 3: [2022-11-27 03:51:58,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:51:58,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 03:51:58,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 18: [2022-11-27 03:51:58,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:51:58,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:51:58,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:51:58,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 03:51:58,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 03:51:58,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 18: [2022-11-27 03:51:58,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 18: [2022-11-27 03:51:58,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 03:51:58,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:51:58,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 18: [2022-11-27 03:51:58,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 03:51:58,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 10: [2022-11-27 03:51:58,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:51:58,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 03:51:58,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 0: [2022-11-27 03:51:58,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:51:58,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 03:51:58,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 9: [2022-11-27 03:51:58,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:51:58,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 03:51:58,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 23: [2022-11-27 03:51:58,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 03:51:58,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 03:51:58,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 9: [2022-11-27 03:51:58,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:51:58,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 03:51:58,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 28: [2022-11-27 03:51:58,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:51:58,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:51:58,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 03:51:58,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 27: [2022-11-27 03:51:58,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:51:58,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 03:51:58,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 28: [2022-11-27 03:51:58,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 03:51:58,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 25: [2022-11-27 03:51:58,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:51:58,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 03:51:58,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 20: [2022-11-27 03:51:58,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:51:58,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 03:51:58,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 18: [2022-11-27 03:51:58,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:51:58,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 03:51:58,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 4: [2022-11-27 03:51:58,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:51:58,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 03:51:58,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 17: [2022-11-27 03:51:58,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:51:58,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 03:51:58,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 14: [2022-11-27 03:51:58,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:51:58,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 03:51:58,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 22: [2022-11-27 03:51:58,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:51:58,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 03:51:58,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 19: [2022-11-27 03:51:58,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:51:58,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 03:51:58,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 29: [2022-11-27 03:51:58,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:51:58,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 03:51:58,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 8: [2022-11-27 03:51:58,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:51:58,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 03:51:58,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 30: [2022-11-27 03:51:58,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:51:58,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 03:51:58,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 16: [2022-11-27 03:51:58,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:51:58,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 03:51:58,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 15: [2022-11-27 03:51:58,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:51:58,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 03:51:58,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 31: [2022-11-27 03:51:58,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:51:58,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 03:51:58,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 26: [2022-11-27 03:51:58,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:51:58,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 03:51:58,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 12: [2022-11-27 03:51:58,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:51:58,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 03:51:58,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 2: [2022-11-27 03:51:58,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:51:58,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 03:51:58,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 24: [2022-11-27 03:51:58,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:51:58,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:51:58,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 24: [2022-11-27 03:51:58,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 3: [2022-11-27 03:51:58,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 24: [2022-11-27 03:51:58,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 13: [2022-11-27 03:51:58,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:51:58,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 03:51:58,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 6: [2022-11-27 03:51:58,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:51:58,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 03:51:58,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 7: [2022-11-27 03:51:58,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:51:58,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 03:51:58,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 1: [2022-11-27 03:51:58,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:51:58,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 03:51:58,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 11: [2022-11-27 03:51:58,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:51:58,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 03:51:58,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 10: [2022-11-27 03:51:58,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:51:58,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 03:51:58,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 0: [2022-11-27 03:51:58,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 23: [2022-11-27 03:51:58,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 03:51:58,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 03:51:58,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 9: [2022-11-27 03:51:58,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:51:58,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 03:51:58,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 27: [2022-11-27 03:51:58,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:51:58,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 03:51:58,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 28: [2022-11-27 03:51:58,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:51:58,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 03:51:58,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 20: [2022-11-27 03:51:58,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:51:58,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 03:51:58,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 18: [2022-11-27 03:51:58,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:51:58,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 5: [2022-11-27 03:51:58,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:51:58,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 5: [2022-11-27 03:51:58,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 03:51:58,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 30: [2022-11-27 03:51:58,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:51:58,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 03:51:58,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 14: [2022-11-27 03:51:58,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:51:58,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 03:51:58,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 25: [2022-11-27 03:51:58,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 03:51:58,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 03:51:58,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 0: [2022-11-27 03:51:58,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 8: [2022-11-27 03:51:58,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:51:58,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 0: [2022-11-27 03:51:58,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 8: [2022-11-27 03:51:58,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 31: [2022-11-27 03:51:58,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:51:58,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 03:51:58,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 22: [2022-11-27 03:51:58,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:51:58,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 03:51:58,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 29: [2022-11-27 03:51:58,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:51:58,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 15: [2022-11-27 03:51:58,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:51:58,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 29: [2022-11-27 03:51:58,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 15: [2022-11-27 03:51:58,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 19: [2022-11-27 03:51:58,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:51:58,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 03:51:58,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 4: [2022-11-27 03:51:58,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:51:58,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:51:58,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 2: [2022-11-27 03:51:58,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 4: [2022-11-27 03:51:58,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 2: [2022-11-27 03:51:58,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 13: [2022-11-27 03:51:58,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:51:58,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:51:58,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:51:58,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 13: [2022-11-27 03:51:58,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 12: [2022-11-27 03:51:58,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 16: [2022-11-27 03:51:58,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 13: [2022-11-27 03:51:58,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 16: [2022-11-27 03:51:58,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 17: [2022-11-27 03:51:58,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:51:58,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 03:51:58,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 7: [2022-11-27 03:51:58,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:51:58,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 03:51:58,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 6: [2022-11-27 03:51:58,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:51:58,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 03:51:58,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 24: [2022-11-27 03:51:58,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:51:58,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 03:51:58,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 3: [2022-11-27 03:51:58,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:51:58,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 03:51:58,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 21: [2022-11-27 03:51:58,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:51:58,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 03:51:58,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 1: [2022-11-27 03:51:58,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:51:58,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 03:51:58,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 10: [2022-11-27 03:51:58,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:51:58,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 03:51:58,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 26: [2022-11-27 03:51:58,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:51:58,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 03:51:58,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 5: [2022-11-27 03:51:58,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:51:58,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 03:51:58,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 11: [2022-11-27 03:51:58,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:51:58,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 03:51:58,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 21: [2022-11-27 03:51:58,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:51:58,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 03:51:58,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 28: [2022-11-27 03:51:58,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 23: [2022-11-27 03:51:58,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:51:58,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 23: [2022-11-27 03:51:58,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 28: [2022-11-27 03:51:58,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 23: [2022-11-27 03:51:58,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 0: [2022-11-27 03:51:58,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:51:58,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 03:51:58,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 27: [2022-11-27 03:51:58,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:51:58,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 03:51:58,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 9: [2022-11-27 03:51:58,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:51:58,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 03:51:58,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 17: [2022-11-27 03:51:58,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:51:58,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 03:51:58,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 25: [2022-11-27 03:51:58,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:51:58,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:51:58,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 25: [2022-11-27 03:51:58,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 03:51:58,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 18: [2022-11-27 03:51:58,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 20: [2022-11-27 03:51:58,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:51:58,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 03:51:58,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 4: [2022-11-27 03:51:58,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:51:58,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 03:51:58,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 19: [2022-11-27 03:51:58,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:51:58,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 03:51:58,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 8: [2022-11-27 03:51:58,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:51:58,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 03:51:58,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 15: [2022-11-27 03:51:58,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:51:58,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 03:51:58,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 26: [2022-11-27 03:51:58,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 03:51:58,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 03:51:58,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 12: [2022-11-27 03:51:58,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:51:58,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 03:51:58,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 14: [2022-11-27 03:51:58,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:51:58,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 03:51:58,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 30: [2022-11-27 03:51:58,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:51:58,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 03:51:58,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 7: [2022-11-27 03:51:58,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:51:58,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 03:51:58,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 16: [2022-11-27 03:51:58,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:51:58,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 03:51:58,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 24: [2022-11-27 03:51:58,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:51:58,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-27 03:51:58,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 3: [2022-11-27 03:51:58,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:51:58,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 03:51:58,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 29: [2022-11-27 03:51:58,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:51:58,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 03:51:58,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 31: [2022-11-27 03:51:58,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:51:58,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 03:51:58,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 19: [2022-11-27 03:51:58,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 03:51:58,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 03:51:58,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 8: [2022-11-27 03:51:58,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:51:58,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 03:51:58,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 30: [2022-11-27 03:51:58,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-27 03:51:58,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 03:51:58,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 15: [2022-11-27 03:51:58,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:51:58,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 03:51:58,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 22: [2022-11-27 03:51:58,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:51:58,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 03:51:58,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 03:51:58,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 22: [2022-11-27 03:51:58,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 03:51:58,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 16: [2022-11-27 03:51:58,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:51:58,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:51:58,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 16: [2022-11-27 03:51:58,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 2: [2022-11-27 03:51:58,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 14: [2022-11-27 03:51:58,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 2: [2022-11-27 03:51:58,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 14: [2022-11-27 03:51:58,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 16: [2022-11-27 03:51:58,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 25: [2022-11-27 03:51:58,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:51:58,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:51:58,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 03:51:58,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 25: [2022-11-27 03:51:58,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 03:51:58,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 18: [2022-11-27 03:51:58,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 31: [2022-11-27 03:51:58,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 18: [2022-11-27 03:51:58,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 03:51:58,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 31: [2022-11-27 03:51:58,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 03:51:58,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 23: [2022-11-27 03:51:58,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 27: [2022-11-27 03:51:58,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:51:58,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 23: [2022-11-27 03:51:58,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 27: [2022-11-27 03:51:58,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 10: [2022-11-27 03:51:58,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 23: [2022-11-27 03:51:58,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 27: [2022-11-27 03:51:58,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 10: [2022-11-27 03:51:58,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 17: [2022-11-27 03:51:58,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-27 03:51:58,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 03:51:58,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 20: [2022-11-27 03:51:58,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-27 03:51:58,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 03:51:58,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 24: [2022-11-27 03:51:58,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:51:58,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 03:51:58,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 11: [2022-11-27 03:51:58,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:51:58,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 03:51:58,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 5: [2022-11-27 03:51:58,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:51:58,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 9: [2022-11-27 03:51:58,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:51:58,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 26: [2022-11-27 03:51:58,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:51:58,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 21: [2022-11-27 03:51:58,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:51:58,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 21: [2022-11-27 03:51:58,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 26: [2022-11-27 03:51:58,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 21: [2022-11-27 03:51:58,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 26: [2022-11-27 03:51:58,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 29: [2022-11-27 03:51:58,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 03:51:58,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 03:51:58,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 6: [2022-11-27 03:51:58,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:51:58,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:51:58,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 03:51:58,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 3: [2022-11-27 03:51:58,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 03:51:58,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 2: [2022-11-27 03:51:58,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:51:58,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 03:51:58,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 12: [2022-11-27 03:51:58,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:51:58,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 03:51:58,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 24: [2022-11-27 03:51:58,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:51:58,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:51:58,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 24: [2022-11-27 03:51:58,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 6: [2022-11-27 03:51:58,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 1: [2022-11-27 03:51:58,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 24: [2022-11-27 03:51:58,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 1: [2022-11-27 03:51:58,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 03:51:58,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 7: [2022-11-27 03:51:58,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:51:58,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:51:58,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 03:51:58,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 13: [2022-11-27 03:51:58,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 03:51:58,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 13: [2022-11-27 03:51:58,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:51:58,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 4: [2022-11-27 03:51:58,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:51:58,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 4: [2022-11-27 03:51:58,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 03:51:58,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 21: [2022-11-27 03:51:58,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-27 03:51:58,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 03:51:58,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 28: [2022-11-27 03:51:58,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-27 03:51:58,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 03:51:58,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 11: [2022-11-27 03:51:58,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:51:58,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 03:51:58,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 13: [2022-11-27 03:51:58,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:51:58,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 03:51:58,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 6: [2022-11-27 03:51:58,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:51:58,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step150000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 03:51:58,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step150000 is ready now! 0: successfully saved checkpoint at iteration 150000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2522.73 31: iteration 150010/ 173500 | consumed samples: 38402560 | consumed tokens: 78648442880 | elapsed time per iteration (s): 1.08 | learning rate: 2.818E-05 | global batch size: 256 | lm loss: 1.925529E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.632 | TFLOPs: 14.38 | 31: iteration 150020/ 173500 | consumed samples: 38405120 | consumed tokens: 78653685760 | elapsed time per iteration (s): 0.75 | learning rate: 2.817E-05 | global batch size: 256 | lm loss: 1.952839E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.067 | TFLOPs: 20.69 | 31: iteration 150030/ 173500 | consumed samples: 38407680 | consumed tokens: 78658928640 | elapsed time per iteration (s): 0.75 | learning rate: 2.817E-05 | global batch size: 256 | lm loss: 1.894772E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.342 | TFLOPs: 20.71 | 31: iteration 150040/ 173500 | consumed samples: 38410240 | consumed tokens: 78664171520 | elapsed time per iteration (s): 0.75 | learning rate: 2.816E-05 | global batch size: 256 | lm loss: 1.928151E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.734 | TFLOPs: 20.55 | 31: iteration 150050/ 173500 | consumed samples: 38412800 | consumed tokens: 78669414400 | elapsed time per iteration (s): 0.77 | learning rate: 2.815E-05 | global batch size: 256 | lm loss: 1.916358E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.788 | TFLOPs: 20.01 | 31: iteration 150060/ 173500 | consumed samples: 38415360 | consumed tokens: 78674657280 | elapsed time per iteration (s): 0.77 | learning rate: 2.815E-05 | global batch size: 256 | lm loss: 1.911304E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.213 | TFLOPs: 20.16 | 31: iteration 150070/ 173500 | consumed samples: 38417920 | consumed tokens: 78679900160 | elapsed time per iteration (s): 0.80 | learning rate: 2.814E-05 | global batch size: 256 | lm loss: 1.902323E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.938 | TFLOPs: 19.36 | 31: iteration 150080/ 173500 | consumed samples: 38420480 | consumed tokens: 78685143040 | elapsed time per iteration (s): 0.78 | learning rate: 2.813E-05 | global batch size: 256 | lm loss: 1.914920E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.035 | TFLOPs: 19.78 | 31: iteration 150090/ 173500 | consumed samples: 38423040 | consumed tokens: 78690385920 | elapsed time per iteration (s): 0.75 | learning rate: 2.812E-05 | global batch size: 256 | lm loss: 1.913556E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.827 | TFLOPs: 20.56 | 31: iteration 150100/ 173500 | consumed samples: 38425600 | consumed tokens: 78695628800 | elapsed time per iteration (s): 0.80 | learning rate: 2.812E-05 | global batch size: 256 | lm loss: 1.902066E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.594 | TFLOPs: 19.27 | 31: iteration 150110/ 173500 | consumed samples: 38428160 | consumed tokens: 78700871680 | elapsed time per iteration (s): 0.76 | learning rate: 2.811E-05 | global batch size: 256 | lm loss: 1.938837E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.202 | TFLOPs: 20.28 | 31: iteration 150120/ 173500 | consumed samples: 38430720 | consumed tokens: 78706114560 | elapsed time per iteration (s): 0.76 | learning rate: 2.810E-05 | global batch size: 256 | lm loss: 1.951851E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.408 | TFLOPs: 20.47 | 31: iteration 150130/ 173500 | consumed samples: 38433280 | consumed tokens: 78711357440 | elapsed time per iteration (s): 0.83 | learning rate: 2.810E-05 | global batch size: 256 | lm loss: 1.911810E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.393 | TFLOPs: 18.72 | 31: iteration 150140/ 173500 | consumed samples: 38435840 | consumed tokens: 78716600320 | elapsed time per iteration (s): 0.75 | learning rate: 2.809E-05 | global batch size: 256 | lm loss: 1.903814E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.212 | TFLOPs: 20.76 | 31: iteration 150150/ 173500 | consumed samples: 38438400 | consumed tokens: 78721843200 | elapsed time per iteration (s): 0.79 | learning rate: 2.808E-05 | global batch size: 256 | lm loss: 1.902290E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.076 | TFLOPs: 19.67 | 31: iteration 150160/ 173500 | consumed samples: 38440960 | consumed tokens: 78727086080 | elapsed time per iteration (s): 0.80 | learning rate: 2.808E-05 | global batch size: 256 | lm loss: 1.918119E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.158 | TFLOPs: 19.43 | 31: iteration 150170/ 173500 | consumed samples: 38443520 | consumed tokens: 78732328960 | elapsed time per iteration (s): 0.78 | learning rate: 2.807E-05 | global batch size: 256 | lm loss: 1.923060E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.761 | TFLOPs: 19.83 | 31: iteration 150180/ 173500 | consumed samples: 38446080 | consumed tokens: 78737571840 | elapsed time per iteration (s): 0.76 | learning rate: 2.806E-05 | global batch size: 256 | lm loss: 1.936437E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.834 | TFLOPs: 20.44 | 31: iteration 150190/ 173500 | consumed samples: 38448640 | consumed tokens: 78742814720 | elapsed time per iteration (s): 0.75 | learning rate: 2.806E-05 | global batch size: 256 | lm loss: 1.909862E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.227 | TFLOPs: 20.70 | 31: iteration 150200/ 173500 | consumed samples: 38451200 | consumed tokens: 78748057600 | elapsed time per iteration (s): 0.81 | learning rate: 2.805E-05 | global batch size: 256 | lm loss: 1.907440E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.868 | TFLOPs: 19.17 | 31: iteration 150210/ 173500 | consumed samples: 38453760 | consumed tokens: 78753300480 | elapsed time per iteration (s): 0.84 | learning rate: 2.804E-05 | global batch size: 256 | lm loss: 1.907080E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.560 | TFLOPs: 18.49 | 31: iteration 150220/ 173500 | consumed samples: 38456320 | consumed tokens: 78758543360 | elapsed time per iteration (s): 0.91 | learning rate: 2.804E-05 | global batch size: 256 | lm loss: 1.925171E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 280.983 | TFLOPs: 17.00 | 31: iteration 150230/ 173500 | consumed samples: 38458880 | consumed tokens: 78763786240 | elapsed time per iteration (s): 0.85 | learning rate: 2.803E-05 | global batch size: 256 | lm loss: 1.931118E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.855 | TFLOPs: 18.20 | 31: iteration 150240/ 173500 | consumed samples: 38461440 | consumed tokens: 78769029120 | elapsed time per iteration (s): 0.82 | learning rate: 2.802E-05 | global batch size: 256 | lm loss: 1.902435E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.814 | TFLOPs: 18.80 | 31: iteration 150250/ 173500 | consumed samples: 38464000 | consumed tokens: 78774272000 | elapsed time per iteration (s): 0.80 | learning rate: 2.802E-05 | global batch size: 256 | lm loss: 1.913009E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.694 | TFLOPs: 19.34 | 31: iteration 150260/ 173500 | consumed samples: 38466560 | consumed tokens: 78779514880 | elapsed time per iteration (s): 0.78 | learning rate: 2.801E-05 | global batch size: 256 | lm loss: 1.916031E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.013 | TFLOPs: 19.84 | 31: iteration 150270/ 173500 | consumed samples: 38469120 | consumed tokens: 78784757760 | elapsed time per iteration (s): 0.78 | learning rate: 2.800E-05 | global batch size: 256 | lm loss: 1.905626E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.314 | TFLOPs: 19.74 | 31: iteration 150280/ 173500 | consumed samples: 38471680 | consumed tokens: 78790000640 | elapsed time per iteration (s): 0.87 | learning rate: 2.800E-05 | global batch size: 256 | lm loss: 1.906216E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.137 | TFLOPs: 17.86 | 31: iteration 150290/ 173500 | consumed samples: 38474240 | consumed tokens: 78795243520 | elapsed time per iteration (s): 0.77 | learning rate: 2.799E-05 | global batch size: 256 | lm loss: 1.919497E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.867 | TFLOPs: 20.02 | 31: iteration 150300/ 173500 | consumed samples: 38476800 | consumed tokens: 78800486400 | elapsed time per iteration (s): 0.81 | learning rate: 2.798E-05 | global batch size: 256 | lm loss: 1.878517E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.754 | TFLOPs: 19.04 | 31: iteration 150310/ 173500 | consumed samples: 38479360 | consumed tokens: 78805729280 | elapsed time per iteration (s): 0.81 | learning rate: 2.798E-05 | global batch size: 256 | lm loss: 1.909275E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.720 | TFLOPs: 19.16 | 31: iteration 150320/ 173500 | consumed samples: 38481920 | consumed tokens: 78810972160 | elapsed time per iteration (s): 0.82 | learning rate: 2.797E-05 | global batch size: 256 | lm loss: 1.910622E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.272 | TFLOPs: 18.95 | 31: iteration 150330/ 173500 | consumed samples: 38484480 | consumed tokens: 78816215040 | elapsed time per iteration (s): 0.85 | learning rate: 2.796E-05 | global batch size: 256 | lm loss: 1.928739E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.983 | TFLOPs: 18.27 | 31: iteration 150340/ 173500 | consumed samples: 38487040 | consumed tokens: 78821457920 | elapsed time per iteration (s): 0.94 | learning rate: 2.795E-05 | global batch size: 256 | lm loss: 1.907557E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 273.240 | TFLOPs: 16.53 | 31: iteration 150350/ 173500 | consumed samples: 38489600 | consumed tokens: 78826700800 | elapsed time per iteration (s): 0.86 | learning rate: 2.795E-05 | global batch size: 256 | lm loss: 1.925607E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.685 | TFLOPs: 18.01 | 31: iteration 150360/ 173500 | consumed samples: 38492160 | consumed tokens: 78831943680 | elapsed time per iteration (s): 0.93 | learning rate: 2.794E-05 | global batch size: 256 | lm loss: 1.919890E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 274.730 | TFLOPs: 16.62 | 31: iteration 150370/ 173500 | consumed samples: 38494720 | consumed tokens: 78837186560 | elapsed time per iteration (s): 0.89 | learning rate: 2.793E-05 | global batch size: 256 | lm loss: 1.904709E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.250 | TFLOPs: 17.50 | 31: iteration 150380/ 173500 | consumed samples: 38497280 | consumed tokens: 78842429440 | elapsed time per iteration (s): 0.91 | learning rate: 2.793E-05 | global batch size: 256 | lm loss: 1.930091E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 281.723 | TFLOPs: 17.04 | 31: iteration 150390/ 173500 | consumed samples: 38499840 | consumed tokens: 78847672320 | elapsed time per iteration (s): 0.84 | learning rate: 2.792E-05 | global batch size: 256 | lm loss: 1.914381E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.279 | TFLOPs: 18.53 | 31: iteration 150400/ 173500 | consumed samples: 38502400 | consumed tokens: 78852915200 | elapsed time per iteration (s): 0.82 | learning rate: 2.791E-05 | global batch size: 256 | lm loss: 1.890715E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.982 | TFLOPs: 18.93 | 31: iteration 150410/ 173500 | consumed samples: 38504960 | consumed tokens: 78858158080 | elapsed time per iteration (s): 0.83 | learning rate: 2.791E-05 | global batch size: 256 | lm loss: 1.908665E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.142 | TFLOPs: 18.64 | 31: iteration 150420/ 173500 | consumed samples: 38507520 | consumed tokens: 78863400960 | elapsed time per iteration (s): 0.87 | learning rate: 2.790E-05 | global batch size: 256 | lm loss: 1.921349E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.116 | TFLOPs: 17.73 | 31: iteration 150430/ 173500 | consumed samples: 38510080 | consumed tokens: 78868643840 | elapsed time per iteration (s): 0.83 | learning rate: 2.789E-05 | global batch size: 256 | lm loss: 1.895867E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.779 | TFLOPs: 18.74 | 31: iteration 150440/ 173500 | consumed samples: 38512640 | consumed tokens: 78873886720 | elapsed time per iteration (s): 2.37 | learning rate: 2.789E-05 | global batch size: 256 | lm loss: 1.908636E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 108.231 | TFLOPs: 6.55 | 31: iteration 150450/ 173500 | consumed samples: 38515200 | consumed tokens: 78879129600 | elapsed time per iteration (s): 0.89 | learning rate: 2.788E-05 | global batch size: 256 | lm loss: 1.915717E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.776 | TFLOPs: 17.35 | 31: iteration 150460/ 173500 | consumed samples: 38517760 | consumed tokens: 78884372480 | elapsed time per iteration (s): 0.88 | learning rate: 2.787E-05 | global batch size: 256 | lm loss: 1.943069E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.250 | TFLOPs: 17.68 | 31: iteration 150470/ 173500 | consumed samples: 38520320 | consumed tokens: 78889615360 | elapsed time per iteration (s): 0.79 | learning rate: 2.787E-05 | global batch size: 256 | lm loss: 1.926231E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.448 | TFLOPs: 19.63 | 31: iteration 150480/ 173500 | consumed samples: 38522880 | consumed tokens: 78894858240 | elapsed time per iteration (s): 0.80 | learning rate: 2.786E-05 | global batch size: 256 | lm loss: 1.917422E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.855 | TFLOPs: 19.35 | 31: iteration 150490/ 173500 | consumed samples: 38525440 | consumed tokens: 78900101120 | elapsed time per iteration (s): 0.82 | learning rate: 2.785E-05 | global batch size: 256 | lm loss: 1.890644E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.498 | TFLOPs: 18.91 | 31: iteration 150500/ 173500 | consumed samples: 38528000 | consumed tokens: 78905344000 | elapsed time per iteration (s): 0.84 | learning rate: 2.785E-05 | global batch size: 256 | lm loss: 1.938251E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.250 | TFLOPs: 18.53 | 31: iteration 150510/ 173500 | consumed samples: 38530560 | consumed tokens: 78910586880 | elapsed time per iteration (s): 0.81 | learning rate: 2.784E-05 | global batch size: 256 | lm loss: 1.909943E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.190 | TFLOPs: 19.01 | 31: iteration 150520/ 173500 | consumed samples: 38533120 | consumed tokens: 78915829760 | elapsed time per iteration (s): 0.80 | learning rate: 2.783E-05 | global batch size: 256 | lm loss: 1.914985E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.426 | TFLOPs: 19.38 | 31: iteration 150530/ 173500 | consumed samples: 38535680 | consumed tokens: 78921072640 | elapsed time per iteration (s): 0.78 | learning rate: 2.783E-05 | global batch size: 256 | lm loss: 1.892080E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.326 | TFLOPs: 19.74 | 31: iteration 150540/ 173500 | consumed samples: 38538240 | consumed tokens: 78926315520 | elapsed time per iteration (s): 0.77 | learning rate: 2.782E-05 | global batch size: 256 | lm loss: 1.929026E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.839 | TFLOPs: 20.08 | 31: iteration 150550/ 173500 | consumed samples: 38540800 | consumed tokens: 78931558400 | elapsed time per iteration (s): 0.78 | learning rate: 2.781E-05 | global batch size: 256 | lm loss: 1.950539E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.364 | TFLOPs: 19.93 | 31: iteration 150560/ 173500 | consumed samples: 38543360 | consumed tokens: 78936801280 | elapsed time per iteration (s): 0.76 | learning rate: 2.781E-05 | global batch size: 256 | lm loss: 1.913173E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.278 | TFLOPs: 20.34 | 31: iteration 150570/ 173500 | consumed samples: 38545920 | consumed tokens: 78942044160 | elapsed time per iteration (s): 0.87 | learning rate: 2.780E-05 | global batch size: 256 | lm loss: 1.913965E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.766 | TFLOPs: 17.71 | 31: iteration 150580/ 173500 | consumed samples: 38548480 | consumed tokens: 78947287040 | elapsed time per iteration (s): 0.75 | learning rate: 2.779E-05 | global batch size: 256 | lm loss: 1.924282E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.711 | TFLOPs: 20.61 | 31: iteration 150590/ 173500 | consumed samples: 38551040 | consumed tokens: 78952529920 | elapsed time per iteration (s): 0.76 | learning rate: 2.779E-05 | global batch size: 256 | lm loss: 1.925448E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.671 | TFLOPs: 20.25 | 31: iteration 150600/ 173500 | consumed samples: 38553600 | consumed tokens: 78957772800 | elapsed time per iteration (s): 0.78 | learning rate: 2.778E-05 | global batch size: 256 | lm loss: 1.933329E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.523 | TFLOPs: 19.81 | 31: iteration 150610/ 173500 | consumed samples: 38556160 | consumed tokens: 78963015680 | elapsed time per iteration (s): 0.89 | learning rate: 2.777E-05 | global batch size: 256 | lm loss: 1.906815E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.217 | TFLOPs: 17.32 | 31: iteration 150620/ 173500 | consumed samples: 38558720 | consumed tokens: 78968258560 | elapsed time per iteration (s): 0.82 | learning rate: 2.777E-05 | global batch size: 256 | lm loss: 1.932946E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.039 | TFLOPs: 18.88 | 31: iteration 150630/ 173500 | consumed samples: 38561280 | consumed tokens: 78973501440 | elapsed time per iteration (s): 0.78 | learning rate: 2.776E-05 | global batch size: 256 | lm loss: 1.924796E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.846 | TFLOPs: 19.77 | 31: iteration 150640/ 173500 | consumed samples: 38563840 | consumed tokens: 78978744320 | elapsed time per iteration (s): 0.78 | learning rate: 2.775E-05 | global batch size: 256 | lm loss: 1.930003E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.653 | TFLOPs: 19.88 | 31: iteration 150650/ 173500 | consumed samples: 38566400 | consumed tokens: 78983987200 | elapsed time per iteration (s): 0.82 | learning rate: 2.775E-05 | global batch size: 256 | lm loss: 1.933133E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.465 | TFLOPs: 18.84 | 31: iteration 150660/ 173500 | consumed samples: 38568960 | consumed tokens: 78989230080 | elapsed time per iteration (s): 0.79 | learning rate: 2.774E-05 | global batch size: 256 | lm loss: 1.950317E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.423 | TFLOPs: 19.57 | 31: iteration 150670/ 173500 | consumed samples: 38571520 | consumed tokens: 78994472960 | elapsed time per iteration (s): 0.75 | learning rate: 2.773E-05 | global batch size: 256 | lm loss: 1.916272E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.722 | TFLOPs: 20.67 | 31: iteration 150680/ 173500 | consumed samples: 38574080 | consumed tokens: 78999715840 | elapsed time per iteration (s): 0.77 | learning rate: 2.773E-05 | global batch size: 256 | lm loss: 1.947966E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.534 | TFLOPs: 20.06 | 31: iteration 150690/ 173500 | consumed samples: 38576640 | consumed tokens: 79004958720 | elapsed time per iteration (s): 0.75 | learning rate: 2.772E-05 | global batch size: 256 | lm loss: 1.919892E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.869 | TFLOPs: 20.68 | 31: iteration 150700/ 173500 | consumed samples: 38579200 | consumed tokens: 79010201600 | elapsed time per iteration (s): 0.76 | learning rate: 2.771E-05 | global batch size: 256 | lm loss: 1.895060E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.618 | TFLOPs: 20.43 | 31: iteration 150710/ 173500 | consumed samples: 38581760 | consumed tokens: 79015444480 | elapsed time per iteration (s): 0.76 | learning rate: 2.771E-05 | global batch size: 256 | lm loss: 1.901180E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.666 | TFLOPs: 20.43 | 31: iteration 150720/ 173500 | consumed samples: 38584320 | consumed tokens: 79020687360 | elapsed time per iteration (s): 0.79 | learning rate: 2.770E-05 | global batch size: 256 | lm loss: 1.924414E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.176 | TFLOPs: 19.55 | 31: iteration 150730/ 173500 | consumed samples: 38586880 | consumed tokens: 79025930240 | elapsed time per iteration (s): 0.77 | learning rate: 2.769E-05 | global batch size: 256 | lm loss: 1.900111E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.422 | TFLOPs: 20.23 | 31: iteration 150740/ 173500 | consumed samples: 38589440 | consumed tokens: 79031173120 | elapsed time per iteration (s): 0.75 | learning rate: 2.769E-05 | global batch size: 256 | lm loss: 1.889258E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.781 | TFLOPs: 20.68 | 31: iteration 150750/ 173500 | consumed samples: 38592000 | consumed tokens: 79036416000 | elapsed time per iteration (s): 0.77 | learning rate: 2.768E-05 | global batch size: 256 | lm loss: 1.958861E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.257 | TFLOPs: 20.04 | 31: iteration 150760/ 173500 | consumed samples: 38594560 | consumed tokens: 79041658880 | elapsed time per iteration (s): 0.87 | learning rate: 2.767E-05 | global batch size: 256 | lm loss: 1.928893E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.540 | TFLOPs: 17.82 | 31: iteration 150770/ 173500 | consumed samples: 38597120 | consumed tokens: 79046901760 | elapsed time per iteration (s): 0.77 | learning rate: 2.767E-05 | global batch size: 256 | lm loss: 1.913623E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.453 | TFLOPs: 20.05 | 31: iteration 150780/ 173500 | consumed samples: 38599680 | consumed tokens: 79052144640 | elapsed time per iteration (s): 0.78 | learning rate: 2.766E-05 | global batch size: 256 | lm loss: 1.916646E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.933 | TFLOPs: 19.90 | 31: iteration 150790/ 173500 | consumed samples: 38602240 | consumed tokens: 79057387520 | elapsed time per iteration (s): 0.75 | learning rate: 2.765E-05 | global batch size: 256 | lm loss: 1.908838E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.175 | TFLOPs: 20.52 | 31: iteration 150800/ 173500 | consumed samples: 38604800 | consumed tokens: 79062630400 | elapsed time per iteration (s): 0.91 | learning rate: 2.765E-05 | global batch size: 256 | lm loss: 1.948354E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 281.018 | TFLOPs: 17.00 | 31: iteration 150810/ 173500 | consumed samples: 38607360 | consumed tokens: 79067873280 | elapsed time per iteration (s): 0.76 | learning rate: 2.764E-05 | global batch size: 256 | lm loss: 1.920768E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.949 | TFLOPs: 20.51 | 31: iteration 150820/ 173500 | consumed samples: 38609920 | consumed tokens: 79073116160 | elapsed time per iteration (s): 0.75 | learning rate: 2.763E-05 | global batch size: 256 | lm loss: 1.925358E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.388 | TFLOPs: 20.59 | 31: iteration 150830/ 173500 | consumed samples: 38612480 | consumed tokens: 79078359040 | elapsed time per iteration (s): 0.72 | learning rate: 2.763E-05 | global batch size: 256 | lm loss: 1.935213E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.123 | TFLOPs: 21.36 | 31: iteration 150840/ 173500 | consumed samples: 38615040 | consumed tokens: 79083601920 | elapsed time per iteration (s): 0.80 | learning rate: 2.762E-05 | global batch size: 256 | lm loss: 1.917842E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.966 | TFLOPs: 19.30 | 31: iteration 150850/ 173500 | consumed samples: 38617600 | consumed tokens: 79088844800 | elapsed time per iteration (s): 0.80 | learning rate: 2.761E-05 | global batch size: 256 | lm loss: 1.927322E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.146 | TFLOPs: 19.43 | 31: iteration 150860/ 173500 | consumed samples: 38620160 | consumed tokens: 79094087680 | elapsed time per iteration (s): 0.76 | learning rate: 2.761E-05 | global batch size: 256 | lm loss: 1.931542E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.710 | TFLOPs: 20.43 | 31: iteration 150870/ 173500 | consumed samples: 38622720 | consumed tokens: 79099330560 | elapsed time per iteration (s): 0.76 | learning rate: 2.760E-05 | global batch size: 256 | lm loss: 1.924162E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.034 | TFLOPs: 20.39 | 31: iteration 150880/ 173500 | consumed samples: 38625280 | consumed tokens: 79104573440 | elapsed time per iteration (s): 0.76 | learning rate: 2.759E-05 | global batch size: 256 | lm loss: 1.897759E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.387 | TFLOPs: 20.41 | 31: iteration 150890/ 173500 | consumed samples: 38627840 | consumed tokens: 79109816320 | elapsed time per iteration (s): 0.76 | learning rate: 2.759E-05 | global batch size: 256 | lm loss: 1.933917E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.123 | TFLOPs: 20.40 | 31: iteration 150900/ 173500 | consumed samples: 38630400 | consumed tokens: 79115059200 | elapsed time per iteration (s): 0.75 | learning rate: 2.758E-05 | global batch size: 256 | lm loss: 1.901919E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.193 | TFLOPs: 20.76 | 31: iteration 150910/ 173500 | consumed samples: 38632960 | consumed tokens: 79120302080 | elapsed time per iteration (s): 0.74 | learning rate: 2.757E-05 | global batch size: 256 | lm loss: 1.921631E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.378 | TFLOPs: 21.02 | 31: iteration 150920/ 173500 | consumed samples: 38635520 | consumed tokens: 79125544960 | elapsed time per iteration (s): 0.74 | learning rate: 2.757E-05 | global batch size: 256 | lm loss: 1.906534E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.998 | TFLOPs: 20.99 | 31: iteration 150930/ 173500 | consumed samples: 38638080 | consumed tokens: 79130787840 | elapsed time per iteration (s): 0.76 | learning rate: 2.756E-05 | global batch size: 256 | lm loss: 1.935954E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.332 | TFLOPs: 20.35 | 31: iteration 150940/ 173500 | consumed samples: 38640640 | consumed tokens: 79136030720 | elapsed time per iteration (s): 0.72 | learning rate: 2.755E-05 | global batch size: 256 | lm loss: 1.917949E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.696 | TFLOPs: 21.52 | 31: iteration 150950/ 173500 | consumed samples: 38643200 | consumed tokens: 79141273600 | elapsed time per iteration (s): 0.73 | learning rate: 2.755E-05 | global batch size: 256 | lm loss: 1.921214E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.635 | TFLOPs: 21.15 | 31: iteration 150960/ 173500 | consumed samples: 38645760 | consumed tokens: 79146516480 | elapsed time per iteration (s): 0.76 | learning rate: 2.754E-05 | global batch size: 256 | lm loss: 1.918110E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.960 | TFLOPs: 20.39 | 31: iteration 150970/ 173500 | consumed samples: 38648320 | consumed tokens: 79151759360 | elapsed time per iteration (s): 0.78 | learning rate: 2.753E-05 | global batch size: 256 | lm loss: 1.925474E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.368 | TFLOPs: 19.87 | 31: iteration 150980/ 173500 | consumed samples: 38650880 | consumed tokens: 79157002240 | elapsed time per iteration (s): 0.84 | learning rate: 2.753E-05 | global batch size: 256 | lm loss: 1.925947E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.951 | TFLOPs: 18.45 | 31: iteration 150990/ 173500 | consumed samples: 38653440 | consumed tokens: 79162245120 | elapsed time per iteration (s): 0.78 | learning rate: 2.752E-05 | global batch size: 256 | lm loss: 1.922120E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.814 | TFLOPs: 19.89 | 31: iteration 151000/ 173500 | consumed samples: 38656000 | consumed tokens: 79167488000 | elapsed time per iteration (s): 0.86 | learning rate: 2.751E-05 | global batch size: 256 | lm loss: 1.897414E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.309 | TFLOPs: 18.05 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 151000 | lm loss value: 1.807315E+00 | lm loss PPL: 6.094064E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 151000 to checkpoints_1b1long 0: [2022-11-27 04:05:32,312] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step151000 is begin to save! 0: [2022-11-27 04:05:32,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_01-model_00-model_states.pt... 0: [2022-11-27 04:05:32,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_01-model_00-model_states.pt. 0: [2022-11-27 04:05:32,541] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_03-model_00-model_states.pt... 0: [2022-11-27 04:05:32,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_03-model_00-model_states.pt. 0: [2022-11-27 04:05:32,624] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_04-model_00-model_states.pt... 0: [2022-11-27 04:05:32,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_04-model_00-model_states.pt. 0: [2022-11-27 04:05:32,704] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_05-model_00-model_states.pt... 0: [2022-11-27 04:05:32,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_05-model_00-model_states.pt. 0: [2022-11-27 04:05:32,780] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_06-model_00-model_states.pt... 0: [2022-11-27 04:05:32,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_06-model_00-model_states.pt. 0: [2022-11-27 04:05:32,858] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_07-model_00-model_states.pt... 0: [2022-11-27 04:05:32,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_07-model_00-model_states.pt. 0: [2022-11-27 04:05:32,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_08-model_00-model_states.pt... 0: [2022-11-27 04:05:33,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_08-model_00-model_states.pt. 0: [2022-11-27 04:05:33,006] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_09-model_00-model_states.pt... 0: [2022-11-27 04:05:33,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_09-model_00-model_states.pt. 0: [2022-11-27 04:05:33,078] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_10-model_00-model_states.pt... 0: [2022-11-27 04:05:33,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_10-model_00-model_states.pt. 0: [2022-11-27 04:05:33,154] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_11-model_00-model_states.pt... 0: [2022-11-27 04:05:33,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_11-model_00-model_states.pt. 0: [2022-11-27 04:05:33,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_12-model_00-model_states.pt... 0: [2022-11-27 04:05:33,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_12-model_00-model_states.pt. 0: [2022-11-27 04:05:33,304] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_13-model_00-model_states.pt... 0: [2022-11-27 04:05:33,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_13-model_00-model_states.pt. 0: [2022-11-27 04:05:33,377] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_14-model_00-model_states.pt... 0: [2022-11-27 04:05:33,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_14-model_00-model_states.pt. 0: [2022-11-27 04:05:33,454] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_15-model_00-model_states.pt... 0: [2022-11-27 04:05:33,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_15-model_00-model_states.pt. 0: [2022-11-27 04:05:33,529] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_16-model_00-model_states.pt... 0: [2022-11-27 04:05:33,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_16-model_00-model_states.pt. 0: [2022-11-27 04:05:33,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_17-model_00-model_states.pt... 0: [2022-11-27 04:05:33,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_17-model_00-model_states.pt. 0: [2022-11-27 04:05:33,677] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_18-model_00-model_states.pt... 0: [2022-11-27 04:05:33,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_18-model_00-model_states.pt. 0: [2022-11-27 04:05:33,752] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_19-model_00-model_states.pt... 0: [2022-11-27 04:05:33,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_19-model_00-model_states.pt. 0: [2022-11-27 04:05:33,827] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_20-model_00-model_states.pt... 0: [2022-11-27 04:05:33,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_20-model_00-model_states.pt. 0: [2022-11-27 04:05:33,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_21-model_00-model_states.pt... 0: [2022-11-27 04:05:33,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_21-model_00-model_states.pt. 0: [2022-11-27 04:05:33,977] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_22-model_00-model_states.pt... 0: [2022-11-27 04:05:34,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_22-model_00-model_states.pt. 0: [2022-11-27 04:05:34,051] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_23-model_00-model_states.pt... 0: [2022-11-27 04:05:34,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_23-model_00-model_states.pt. 0: [2022-11-27 04:05:34,126] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_24-model_00-model_states.pt... 0: [2022-11-27 04:05:34,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_24-model_00-model_states.pt. 0: [2022-11-27 04:05:34,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_25-model_00-model_states.pt... 0: [2022-11-27 04:05:34,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_25-model_00-model_states.pt. 0: [2022-11-27 04:05:34,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_26-model_00-model_states.pt... 0: [2022-11-27 04:05:34,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_26-model_00-model_states.pt. 0: [2022-11-27 04:05:34,351] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_27-model_00-model_states.pt... 0: [2022-11-27 04:05:34,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_27-model_00-model_states.pt. 0: [2022-11-27 04:05:34,424] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_28-model_00-model_states.pt... 0: [2022-11-27 04:05:34,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_28-model_00-model_states.pt. 0: [2022-11-27 04:05:34,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/layer_30-model_00-model_states.pt... 0: [2022-11-27 04:05:34,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/layer_30-model_00-model_states.pt. 0: [2022-11-27 04:05:34,502] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step151000/mp_rank_00_model_states.pt 0: [2022-11-27 04:05:34,502] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/mp_rank_00_model_states.pt... 0: [2022-11-27 04:05:34,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/mp_rank_00_model_states.pt. 0: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:05:34,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:05:34,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:05:34,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 04:05:34,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 31: [2022-11-27 04:05:34,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:05:34,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 04:05:34,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 26: [2022-11-27 04:05:34,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:05:34,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 04:05:34,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 6: [2022-11-27 04:05:34,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:05:34,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:05:34,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:05:34,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:05:34,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 4: [2022-11-27 04:05:34,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 13: [2022-11-27 04:05:34,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 6: [2022-11-27 04:05:34,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 4: [2022-11-27 04:05:34,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 13: [2022-11-27 04:05:34,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 1: [2022-11-27 04:05:34,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 04:05:34,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 14: [2022-11-27 04:05:34,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:05:34,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 04:05:34,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 23: [2022-11-27 04:05:34,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:05:34,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 04:05:34,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 24: [2022-11-27 04:05:34,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:05:34,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:05:34,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 8: [2022-11-27 04:05:34,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:05:34,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 8: [2022-11-27 04:05:34,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 24: [2022-11-27 04:05:34,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 8: [2022-11-27 04:05:34,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 24: [2022-11-27 04:05:34,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 18: [2022-11-27 04:05:34,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:05:34,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 04:05:34,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 25: [2022-11-27 04:05:34,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:05:34,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-27 04:05:34,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 20: [2022-11-27 04:05:34,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:05:34,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 04:05:34,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 10: [2022-11-27 04:05:34,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:05:34,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:05:34,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:05:34,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 12: [2022-11-27 04:05:34,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 20: [2022-11-27 04:05:34,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 10: [2022-11-27 04:05:34,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 12: [2022-11-27 04:05:34,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 20: [2022-11-27 04:05:34,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 29: [2022-11-27 04:05:34,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:05:34,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:05:34,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 04:05:34,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 9: [2022-11-27 04:05:34,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:05:34,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 04:05:34,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 19: [2022-11-27 04:05:34,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:05:34,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 04:05:34,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 11: [2022-11-27 04:05:34,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:05:34,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 11: [2022-11-27 04:05:34,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 29: [2022-11-27 04:05:34,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 11: [2022-11-27 04:05:34,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 19: [2022-11-27 04:05:34,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:05:34,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:05:34,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 19: [2022-11-27 04:05:34,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 6: [2022-11-27 04:05:34,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:05:34,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 19: [2022-11-27 04:05:34,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 6: [2022-11-27 04:05:34,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 04:05:34,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 15: [2022-11-27 04:05:34,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:05:34,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 04:05:34,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 8: [2022-11-27 04:05:34,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:05:34,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:05:34,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 31: [2022-11-27 04:05:34,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 8: [2022-11-27 04:05:34,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 31: [2022-11-27 04:05:34,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 13: [2022-11-27 04:05:34,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:05:34,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 04:05:34,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 14: [2022-11-27 04:05:34,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:05:34,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:05:34,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 4: [2022-11-27 04:05:34,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 14: [2022-11-27 04:05:34,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 4: [2022-11-27 04:05:34,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 18: [2022-11-27 04:05:34,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:05:34,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 04:05:34,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 1: [2022-11-27 04:05:34,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:05:34,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 04:05:34,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 2: [2022-11-27 04:05:34,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:05:34,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 04:05:34,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 10: [2022-11-27 04:05:34,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:05:34,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 04:05:34,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 28: [2022-11-27 04:05:34,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:05:34,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:05:34,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:05:34,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 04:05:34,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 28: [2022-11-27 04:05:34,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 04:05:34,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 24: [2022-11-27 04:05:34,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:05:34,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 28: [2022-11-27 04:05:34,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 24: [2022-11-27 04:05:34,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 04:05:34,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 12: [2022-11-27 04:05:34,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:05:34,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 04:05:34,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 23: [2022-11-27 04:05:34,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:05:34,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 04:05:34,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 13: [2022-11-27 04:05:34,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:05:34,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 04:05:34,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 0: [2022-11-27 04:05:34,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:05:34,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:05:34,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:05:34,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 26: [2022-11-27 04:05:34,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:05:34,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:05:34,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 26: [2022-11-27 04:05:34,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 0: [2022-11-27 04:05:34,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 26: [2022-11-27 04:05:34,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 0: [2022-11-27 04:05:34,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 26: [2022-11-27 04:05:34,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 26: [2022-11-27 04:05:34,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 11: [2022-11-27 04:05:34,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:05:34,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 04:05:34,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 12: [2022-11-27 04:05:34,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:05:34,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 04:05:34,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 6: [2022-11-27 04:05:34,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:05:34,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 04:05:34,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 4: [2022-11-27 04:05:34,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:05:34,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 04:05:34,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 15: [2022-11-27 04:05:34,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:05:34,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 04:05:34,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 21: [2022-11-27 04:05:34,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:05:34,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 04:05:34,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 0: [2022-11-27 04:05:34,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:05:34,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:05:34,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 0: [2022-11-27 04:05:34,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 31: [2022-11-27 04:05:34,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 0: [2022-11-27 04:05:34,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 25: [2022-11-27 04:05:34,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:05:34,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 04:05:34,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 3: [2022-11-27 04:05:34,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:05:34,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 04:05:34,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 25: [2022-11-27 04:05:34,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:05:34,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 19: [2022-11-27 04:05:34,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:05:34,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 19: [2022-11-27 04:05:34,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 04:05:34,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 27: [2022-11-27 04:05:34,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:05:34,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:05:34,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:05:34,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 04:05:34,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 04:05:34,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 04:05:34,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 27: [2022-11-27 04:05:34,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 27: [2022-11-27 04:05:34,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 2: [2022-11-27 04:05:34,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:05:34,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 04:05:34,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 13: [2022-11-27 04:05:34,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:05:34,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 04:05:34,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 9: [2022-11-27 04:05:34,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:05:34,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 04:05:34,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 8: [2022-11-27 04:05:34,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:05:34,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 04:05:34,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 9: [2022-11-27 04:05:34,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:05:34,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 04:05:34,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 18: [2022-11-27 04:05:34,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:05:34,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:05:34,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 10: [2022-11-27 04:05:34,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 18: [2022-11-27 04:05:34,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 10: [2022-11-27 04:05:34,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 22: [2022-11-27 04:05:34,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:05:34,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 04:05:34,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 29: [2022-11-27 04:05:34,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:05:34,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:05:34,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 04:05:34,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 29: [2022-11-27 04:05:34,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 04:05:34,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 20: [2022-11-27 04:05:34,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:05:34,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 04:05:34,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 24: [2022-11-27 04:05:34,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:05:34,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 04:05:34,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 1: [2022-11-27 04:05:34,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:05:34,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:05:34,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 28: [2022-11-27 04:05:34,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:05:34,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 28: [2022-11-27 04:05:34,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 04:05:34,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 04:05:34,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 28: [2022-11-27 04:05:34,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 29: [2022-11-27 04:05:34,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:05:34,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 04:05:34,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 23: [2022-11-27 04:05:34,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:05:34,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:05:34,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 23: [2022-11-27 04:05:34,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 04:05:34,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 21: [2022-11-27 04:05:34,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 22: [2022-11-27 04:05:34,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:05:34,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 04:05:34,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 23: [2022-11-27 04:05:34,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:05:34,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:05:34,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 23: [2022-11-27 04:05:34,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 11: [2022-11-27 04:05:34,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 23: [2022-11-27 04:05:34,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 20: [2022-11-27 04:05:34,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:05:34,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:05:34,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 30: [2022-11-27 04:05:34,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 20: [2022-11-27 04:05:34,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 30: [2022-11-27 04:05:34,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 6: [2022-11-27 04:05:34,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:05:34,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 17: [2022-11-27 04:05:34,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:05:34,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 17: [2022-11-27 04:05:34,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 04:05:34,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 31: [2022-11-27 04:05:34,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:05:34,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:05:34,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 04:05:34,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 10: [2022-11-27 04:05:34,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:05:34,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:05:34,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 14: [2022-11-27 04:05:34,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:05:34,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 17: [2022-11-27 04:05:34,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 10: [2022-11-27 04:05:34,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 14: [2022-11-27 04:05:34,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 31: [2022-11-27 04:05:34,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 17: [2022-11-27 04:05:34,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 14: [2022-11-27 04:05:34,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 3: [2022-11-27 04:05:34,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:05:34,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 04:05:34,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 2: [2022-11-27 04:05:34,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:05:34,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 04:05:34,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 15: [2022-11-27 04:05:34,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:05:34,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 04:05:34,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 12: [2022-11-27 04:05:34,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:05:34,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 04:05:34,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 26: [2022-11-27 04:05:34,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:05:34,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 04:05:34,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 25: [2022-11-27 04:05:34,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:05:34,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 04:05:34,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 19: [2022-11-27 04:05:34,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:05:34,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 04:05:34,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 0: [2022-11-27 04:05:34,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:05:34,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 04:05:34,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 1: [2022-11-27 04:05:34,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:05:34,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 04:05:34,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 16: [2022-11-27 04:05:34,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:05:34,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:05:34,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:05:34,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:05:34,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 04:05:34,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 4: [2022-11-27 04:05:34,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 16: [2022-11-27 04:05:34,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 04:05:34,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 4: [2022-11-27 04:05:34,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 16: [2022-11-27 04:05:34,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 16: [2022-11-27 04:05:34,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 24: [2022-11-27 04:05:34,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:05:34,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 04:05:34,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 18: [2022-11-27 04:05:34,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:05:34,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 04:05:34,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 17: [2022-11-27 04:05:34,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:05:34,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 04:05:34,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 7: [2022-11-27 04:05:34,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:05:34,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:05:34,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:05:34,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 04:05:34,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 04:05:34,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 7: [2022-11-27 04:05:34,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 7: [2022-11-27 04:05:34,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 04:05:34,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 30: [2022-11-27 04:05:34,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:05:34,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 04:05:34,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 8: [2022-11-27 04:05:34,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:05:34,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 04:05:34,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 9: [2022-11-27 04:05:34,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:05:34,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 04:05:34,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 3: [2022-11-27 04:05:34,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:05:34,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 04:05:34,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 27: [2022-11-27 04:05:34,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:05:34,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 04:05:34,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 22: [2022-11-27 04:05:34,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:05:34,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 04:05:34,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 7: [2022-11-27 04:05:34,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:05:34,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 04:05:34,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 30: [2022-11-27 04:05:34,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:05:34,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 04:05:34,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 21: [2022-11-27 04:05:34,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:05:34,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 04:05:34,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 0: [2022-11-27 04:05:34,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 04:05:34,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 30: [2022-11-27 04:05:34,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:05:34,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 04:05:34,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 1: [2022-11-27 04:05:34,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:05:34,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 04:05:34,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 5: [2022-11-27 04:05:34,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:05:34,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:05:34,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:05:34,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:05:34,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 04:05:34,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 04:05:34,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 04:05:34,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 04:05:34,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 5: [2022-11-27 04:05:34,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 5: [2022-11-27 04:05:34,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 5: [2022-11-27 04:05:34,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 20: [2022-11-27 04:05:34,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:05:34,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 04:05:34,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 11: [2022-11-27 04:05:34,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:05:34,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 04:05:34,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 5: [2022-11-27 04:05:34,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:05:34,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 04:05:34,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 17: [2022-11-27 04:05:34,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:05:34,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 04:05:34,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 6: [2022-11-27 04:05:34,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:05:34,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 04:05:34,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 4: [2022-11-27 04:05:34,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:05:34,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 04:05:34,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 31: [2022-11-27 04:05:34,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:05:34,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 04:05:34,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 29: [2022-11-27 04:05:34,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:05:34,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 04:05:34,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 23: [2022-11-27 04:05:34,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:05:34,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 04:05:34,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 24: [2022-11-27 04:05:34,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:05:34,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:05:34,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 24: [2022-11-27 04:05:34,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 04:05:34,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 28: [2022-11-27 04:05:34,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 26: [2022-11-27 04:05:34,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:05:34,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 04:05:34,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 13: [2022-11-27 04:05:34,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:05:34,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 04:05:34,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 15: [2022-11-27 04:05:34,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:05:34,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 04:05:34,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 0: [2022-11-27 04:05:34,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:05:34,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 04:05:34,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 10: [2022-11-27 04:05:34,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:05:34,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 04:05:34,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 3: [2022-11-27 04:05:34,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:05:34,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 04:05:34,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 12: [2022-11-27 04:05:34,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:05:34,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 04:05:34,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 19: [2022-11-27 04:05:34,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:05:34,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 04:05:34,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 2: [2022-11-27 04:05:34,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:05:34,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 16: [2022-11-27 04:05:34,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:05:34,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 16: [2022-11-27 04:05:34,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 04:05:34,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 8: [2022-11-27 04:05:34,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:05:34,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 22: [2022-11-27 04:05:34,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:05:34,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 22: [2022-11-27 04:05:34,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 04:05:34,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 25: [2022-11-27 04:05:34,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:05:34,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 04:05:34,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 27: [2022-11-27 04:05:34,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:05:34,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 04:05:34,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 21: [2022-11-27 04:05:34,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:05:34,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 04:05:34,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 9: [2022-11-27 04:05:34,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:05:34,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 14: [2022-11-27 04:05:34,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:05:34,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 14: [2022-11-27 04:05:34,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 04:05:34,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 1: [2022-11-27 04:05:34,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:05:34,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:05:34,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 04:05:34,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 1: [2022-11-27 04:05:34,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 04:05:34,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 18: [2022-11-27 04:05:34,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:05:34,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 04:05:34,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 20: [2022-11-27 04:05:34,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:05:34,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 04:05:34,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 31: [2022-11-27 04:05:34,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:05:34,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 04:05:34,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 17: [2022-11-27 04:05:34,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:05:34,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 04:05:34,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 6: [2022-11-27 04:05:34,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:05:34,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 4: [2022-11-27 04:05:34,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:05:34,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 4: [2022-11-27 04:05:34,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 04:05:34,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 11: [2022-11-27 04:05:34,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:05:34,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 04:05:34,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 5: [2022-11-27 04:05:34,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:05:34,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 04:05:34,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 29: [2022-11-27 04:05:34,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:05:34,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 04:05:34,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 23: [2022-11-27 04:05:34,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:05:34,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 04:05:34,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 24: [2022-11-27 04:05:34,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:05:34,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 04:05:34,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 15: [2022-11-27 04:05:34,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:05:34,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 04:05:34,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 26: [2022-11-27 04:05:34,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:05:34,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 04:05:34,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 13: [2022-11-27 04:05:34,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:05:34,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 04:05:34,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 28: [2022-11-27 04:05:34,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:05:34,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:05:34,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 0: [2022-11-27 04:05:34,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:05:34,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 0: [2022-11-27 04:05:34,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 04:05:34,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 18: [2022-11-27 04:05:34,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:05:34,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:05:34,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 04:05:34,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 3: [2022-11-27 04:05:34,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 04:05:34,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 28: [2022-11-27 04:05:34,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 19: [2022-11-27 04:05:34,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:05:34,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 19: [2022-11-27 04:05:34,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 04:05:34,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 21: [2022-11-27 04:05:34,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:05:34,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 04:05:34,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 25: [2022-11-27 04:05:34,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:05:34,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:05:34,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 04:05:34,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 25: [2022-11-27 04:05:34,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 04:05:34,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 22: [2022-11-27 04:05:34,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:05:34,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:05:34,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 12: [2022-11-27 04:05:34,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 22: [2022-11-27 04:05:34,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 12: [2022-11-27 04:05:34,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 27: [2022-11-27 04:05:34,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:05:34,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 04:05:34,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 5: [2022-11-27 04:05:34,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:05:34,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 2: [2022-11-27 04:05:34,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:05:34,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 2: [2022-11-27 04:05:34,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 04:05:34,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 7: [2022-11-27 04:05:34,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:05:34,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 04:05:34,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 9: [2022-11-27 04:05:34,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:05:34,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:05:34,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 04:05:34,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 1: [2022-11-27 04:05:34,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 04:05:34,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 4: [2022-11-27 04:05:34,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:05:34,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 04:05:34,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 31: [2022-11-27 04:05:34,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:05:34,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 04:05:34,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 20: [2022-11-27 04:05:34,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:05:34,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 04:05:34,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 17: [2022-11-27 04:05:34,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:05:34,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 04:05:34,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 6: [2022-11-27 04:05:34,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:05:34,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 04:05:34,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 11: [2022-11-27 04:05:34,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:05:34,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 04:05:34,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 24: [2022-11-27 04:05:34,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:05:34,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 04:05:34,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 30: [2022-11-27 04:05:34,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:05:34,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 04:05:34,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 29: [2022-11-27 04:05:34,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:05:34,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 04:05:34,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 15: [2022-11-27 04:05:34,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:05:34,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 04:05:34,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 28: [2022-11-27 04:05:34,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:05:34,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:05:34,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 04:05:34,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 23: [2022-11-27 04:05:34,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:05:34,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 26: [2022-11-27 04:05:34,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:05:34,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 26: [2022-11-27 04:05:34,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 04:05:34,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 10: [2022-11-27 04:05:34,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:05:34,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 04:05:34,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 28: [2022-11-27 04:05:34,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 04:05:34,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 14: [2022-11-27 04:05:34,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:05:34,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 13: [2022-11-27 04:05:34,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:05:34,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 13: [2022-11-27 04:05:34,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 04:05:34,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 12: [2022-11-27 04:05:34,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:05:34,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 04:05:34,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 0: [2022-11-27 04:05:34,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:05:34,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 04:05:34,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 2: [2022-11-27 04:05:34,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:05:34,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 04:05:34,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 10: [2022-11-27 04:05:34,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:05:34,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 04:05:34,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 9: [2022-11-27 04:05:34,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:05:34,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 04:05:34,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 14: [2022-11-27 04:05:34,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:05:34,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:05:34,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 18: [2022-11-27 04:05:34,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 14: [2022-11-27 04:05:34,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 18: [2022-11-27 04:05:34,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 20: [2022-11-27 04:05:34,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:05:34,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 04:05:34,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 19: [2022-11-27 04:05:34,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:05:34,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 04:05:34,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 27: [2022-11-27 04:05:34,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:05:34,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 04:05:34,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 26: [2022-11-27 04:05:34,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:05:34,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 04:05:34,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 0: [2022-11-27 04:05:34,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:05:34,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 04:05:34,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 6: [2022-11-27 04:05:34,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:05:34,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 04:05:34,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 8: [2022-11-27 04:05:34,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:05:34,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 04:05:34,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 9: [2022-11-27 04:05:34,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:05:34,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 04:05:34,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 23: [2022-11-27 04:05:34,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:05:34,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 04:05:34,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 7: [2022-11-27 04:05:34,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:05:34,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 04:05:34,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 12: [2022-11-27 04:05:34,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:05:34,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 04:05:34,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 10: [2022-11-27 04:05:34,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:05:34,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 04:05:34,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 21: [2022-11-27 04:05:34,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:05:34,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 3: [2022-11-27 04:05:34,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:05:34,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:05:34,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 1: [2022-11-27 04:05:34,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:05:34,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 22: [2022-11-27 04:05:34,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 04:05:34,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 1: [2022-11-27 04:05:34,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 3: [2022-11-27 04:05:34,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 1: [2022-11-27 04:05:34,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 14: [2022-11-27 04:05:34,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:05:34,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 04:05:34,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 17: [2022-11-27 04:05:34,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:05:34,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:05:34,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 24: [2022-11-27 04:05:34,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:05:34,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 24: [2022-11-27 04:05:34,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 17: [2022-11-27 04:05:34,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 4: [2022-11-27 04:05:34,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 24: [2022-11-27 04:05:34,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 31: [2022-11-27 04:05:34,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:05:34,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:05:34,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 04:05:34,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 2: [2022-11-27 04:05:34,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 04:05:34,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 11: [2022-11-27 04:05:34,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:05:34,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 04:05:34,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 25: [2022-11-27 04:05:34,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:05:34,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:05:34,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 28: [2022-11-27 04:05:34,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:05:34,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 04:05:34,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 25: [2022-11-27 04:05:34,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 28: [2022-11-27 04:05:34,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 15: [2022-11-27 04:05:34,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:05:34,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 15: [2022-11-27 04:05:34,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 04:05:34,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 16: [2022-11-27 04:05:34,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:05:34,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 04:05:34,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 29: [2022-11-27 04:05:34,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:05:34,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 04:05:34,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 13: [2022-11-27 04:05:34,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:05:34,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:05:34,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 22: [2022-11-27 04:05:34,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 13: [2022-11-27 04:05:34,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 22: [2022-11-27 04:05:34,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 5: [2022-11-27 04:05:34,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:05:34,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 04:05:34,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 18: [2022-11-27 04:05:34,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:05:34,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 3: [2022-11-27 04:05:34,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:05:34,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 3: [2022-11-27 04:05:34,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 04:05:34,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 8: [2022-11-27 04:05:34,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:05:34,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 04:05:34,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 16: [2022-11-27 04:05:34,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:05:34,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 04:05:34,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 7: [2022-11-27 04:05:34,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:05:34,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 04:05:34,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 21: [2022-11-27 04:05:34,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:05:34,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 04:05:34,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 19: [2022-11-27 04:05:34,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:05:34,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 04:05:34,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 27: [2022-11-27 04:05:34,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:05:34,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 04:05:34,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 16: [2022-11-27 04:05:34,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:05:34,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 04:05:34,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 22: [2022-11-27 04:05:34,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:05:34,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 04:05:34,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 30: [2022-11-27 04:05:34,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:05:34,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 04:05:34,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 30: [2022-11-27 04:05:34,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:05:34,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step151000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 04:05:34,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151000 is ready now! 0: successfully saved checkpoint at iteration 151000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2567.94 31: iteration 151010/ 173500 | consumed samples: 38658560 | consumed tokens: 79172730880 | elapsed time per iteration (s): 1.09 | learning rate: 2.751E-05 | global batch size: 256 | lm loss: 1.904547E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.143 | TFLOPs: 14.23 | 31: iteration 151020/ 173500 | consumed samples: 38661120 | consumed tokens: 79177973760 | elapsed time per iteration (s): 0.84 | learning rate: 2.750E-05 | global batch size: 256 | lm loss: 1.911792E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.419 | TFLOPs: 18.42 | 31: iteration 151030/ 173500 | consumed samples: 38663680 | consumed tokens: 79183216640 | elapsed time per iteration (s): 0.73 | learning rate: 2.749E-05 | global batch size: 256 | lm loss: 1.925712E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.641 | TFLOPs: 21.21 | 31: iteration 151040/ 173500 | consumed samples: 38666240 | consumed tokens: 79188459520 | elapsed time per iteration (s): 0.75 | learning rate: 2.749E-05 | global batch size: 256 | lm loss: 1.927288E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.205 | TFLOPs: 20.70 | 31: iteration 151050/ 173500 | consumed samples: 38668800 | consumed tokens: 79193702400 | elapsed time per iteration (s): 0.72 | learning rate: 2.748E-05 | global batch size: 256 | lm loss: 1.933084E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.924 | TFLOPs: 21.53 | 31: iteration 151060/ 173500 | consumed samples: 38671360 | consumed tokens: 79198945280 | elapsed time per iteration (s): 0.73 | learning rate: 2.747E-05 | global batch size: 256 | lm loss: 1.919916E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.805 | TFLOPs: 21.10 | 31: iteration 151070/ 173500 | consumed samples: 38673920 | consumed tokens: 79204188160 | elapsed time per iteration (s): 0.71 | learning rate: 2.747E-05 | global batch size: 256 | lm loss: 1.934671E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 359.605 | TFLOPs: 21.76 | 31: iteration 151080/ 173500 | consumed samples: 38676480 | consumed tokens: 79209431040 | elapsed time per iteration (s): 1.20 | learning rate: 2.746E-05 | global batch size: 256 | lm loss: 1.900381E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 213.737 | TFLOPs: 12.93 | 31: iteration 151090/ 173500 | consumed samples: 38679040 | consumed tokens: 79214673920 | elapsed time per iteration (s): 0.80 | learning rate: 2.746E-05 | global batch size: 256 | lm loss: 1.926395E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.016 | TFLOPs: 19.30 | 31: iteration 151100/ 173500 | consumed samples: 38681600 | consumed tokens: 79219916800 | elapsed time per iteration (s): 0.75 | learning rate: 2.745E-05 | global batch size: 256 | lm loss: 1.917464E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.910 | TFLOPs: 20.56 | 31: iteration 151110/ 173500 | consumed samples: 38684160 | consumed tokens: 79225159680 | elapsed time per iteration (s): 0.76 | learning rate: 2.744E-05 | global batch size: 256 | lm loss: 1.928369E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.423 | TFLOPs: 20.47 | 31: iteration 151120/ 173500 | consumed samples: 38686720 | consumed tokens: 79230402560 | elapsed time per iteration (s): 0.73 | learning rate: 2.744E-05 | global batch size: 256 | lm loss: 1.930397E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.420 | TFLOPs: 21.26 | 31: iteration 151130/ 173500 | consumed samples: 38689280 | consumed tokens: 79235645440 | elapsed time per iteration (s): 0.75 | learning rate: 2.743E-05 | global batch size: 256 | lm loss: 1.917610E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.230 | TFLOPs: 20.52 | 31: iteration 151140/ 173500 | consumed samples: 38691840 | consumed tokens: 79240888320 | elapsed time per iteration (s): 0.75 | learning rate: 2.742E-05 | global batch size: 256 | lm loss: 1.926125E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.344 | TFLOPs: 20.53 | 31: iteration 151150/ 173500 | consumed samples: 38694400 | consumed tokens: 79246131200 | elapsed time per iteration (s): 0.75 | learning rate: 2.742E-05 | global batch size: 256 | lm loss: 1.900243E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.847 | TFLOPs: 20.62 | 31: iteration 151160/ 173500 | consumed samples: 38696960 | consumed tokens: 79251374080 | elapsed time per iteration (s): 0.78 | learning rate: 2.741E-05 | global batch size: 256 | lm loss: 1.917764E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.020 | TFLOPs: 19.84 | 31: iteration 151170/ 173500 | consumed samples: 38699520 | consumed tokens: 79256616960 | elapsed time per iteration (s): 0.85 | learning rate: 2.740E-05 | global batch size: 256 | lm loss: 1.910587E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.842 | TFLOPs: 18.32 | 31: iteration 151180/ 173500 | consumed samples: 38702080 | consumed tokens: 79261859840 | elapsed time per iteration (s): 0.78 | learning rate: 2.740E-05 | global batch size: 256 | lm loss: 1.907468E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.331 | TFLOPs: 19.80 | 31: iteration 151190/ 173500 | consumed samples: 38704640 | consumed tokens: 79267102720 | elapsed time per iteration (s): 0.78 | learning rate: 2.739E-05 | global batch size: 256 | lm loss: 1.908817E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.317 | TFLOPs: 19.74 | 31: iteration 151200/ 173500 | consumed samples: 38707200 | consumed tokens: 79272345600 | elapsed time per iteration (s): 0.74 | learning rate: 2.738E-05 | global batch size: 256 | lm loss: 1.907800E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.667 | TFLOPs: 20.91 | 31: iteration 151210/ 173500 | consumed samples: 38709760 | consumed tokens: 79277588480 | elapsed time per iteration (s): 0.83 | learning rate: 2.738E-05 | global batch size: 256 | lm loss: 1.918188E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.064 | TFLOPs: 18.64 | 31: iteration 151220/ 173500 | consumed samples: 38712320 | consumed tokens: 79282831360 | elapsed time per iteration (s): 0.81 | learning rate: 2.737E-05 | global batch size: 256 | lm loss: 1.886035E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.872 | TFLOPs: 19.17 | 31: iteration 151230/ 173500 | consumed samples: 38714880 | consumed tokens: 79288074240 | elapsed time per iteration (s): 0.86 | learning rate: 2.736E-05 | global batch size: 256 | lm loss: 1.942485E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.069 | TFLOPs: 18.03 | 31: iteration 151240/ 173500 | consumed samples: 38717440 | consumed tokens: 79293317120 | elapsed time per iteration (s): 0.83 | learning rate: 2.736E-05 | global batch size: 256 | lm loss: 1.916190E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.344 | TFLOPs: 18.71 | 31: iteration 151250/ 173500 | consumed samples: 38720000 | consumed tokens: 79298560000 | elapsed time per iteration (s): 0.81 | learning rate: 2.735E-05 | global batch size: 256 | lm loss: 1.929218E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.640 | TFLOPs: 19.03 | 31: iteration 151260/ 173500 | consumed samples: 38722560 | consumed tokens: 79303802880 | elapsed time per iteration (s): 0.79 | learning rate: 2.734E-05 | global batch size: 256 | lm loss: 1.907772E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.973 | TFLOPs: 19.72 | 31: iteration 151270/ 173500 | consumed samples: 38725120 | consumed tokens: 79309045760 | elapsed time per iteration (s): 0.78 | learning rate: 2.734E-05 | global batch size: 256 | lm loss: 1.873964E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.484 | TFLOPs: 19.93 | 31: iteration 151280/ 173500 | consumed samples: 38727680 | consumed tokens: 79314288640 | elapsed time per iteration (s): 0.81 | learning rate: 2.733E-05 | global batch size: 256 | lm loss: 1.919323E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.670 | TFLOPs: 19.22 | 31: iteration 151290/ 173500 | consumed samples: 38730240 | consumed tokens: 79319531520 | elapsed time per iteration (s): 0.82 | learning rate: 2.732E-05 | global batch size: 256 | lm loss: 1.901184E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.010 | TFLOPs: 18.88 | 31: iteration 151300/ 173500 | consumed samples: 38732800 | consumed tokens: 79324774400 | elapsed time per iteration (s): 0.80 | learning rate: 2.732E-05 | global batch size: 256 | lm loss: 1.898145E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.803 | TFLOPs: 19.41 | 31: iteration 151310/ 173500 | consumed samples: 38735360 | consumed tokens: 79330017280 | elapsed time per iteration (s): 0.80 | learning rate: 2.731E-05 | global batch size: 256 | lm loss: 1.932471E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.610 | TFLOPs: 19.46 | 31: iteration 151320/ 173500 | consumed samples: 38737920 | consumed tokens: 79335260160 | elapsed time per iteration (s): 0.77 | learning rate: 2.731E-05 | global batch size: 256 | lm loss: 1.912931E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.409 | TFLOPs: 20.05 | 31: iteration 151330/ 173500 | consumed samples: 38740480 | consumed tokens: 79340503040 | elapsed time per iteration (s): 0.78 | learning rate: 2.730E-05 | global batch size: 256 | lm loss: 1.919389E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.251 | TFLOPs: 19.92 | 31: iteration 151340/ 173500 | consumed samples: 38743040 | consumed tokens: 79345745920 | elapsed time per iteration (s): 0.76 | learning rate: 2.729E-05 | global batch size: 256 | lm loss: 1.908342E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.907 | TFLOPs: 20.26 | 31: iteration 151350/ 173500 | consumed samples: 38745600 | consumed tokens: 79350988800 | elapsed time per iteration (s): 0.78 | learning rate: 2.729E-05 | global batch size: 256 | lm loss: 1.942080E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.562 | TFLOPs: 19.94 | 31: iteration 151360/ 173500 | consumed samples: 38748160 | consumed tokens: 79356231680 | elapsed time per iteration (s): 0.78 | learning rate: 2.728E-05 | global batch size: 256 | lm loss: 1.911801E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.904 | TFLOPs: 19.96 | 31: iteration 151370/ 173500 | consumed samples: 38750720 | consumed tokens: 79361474560 | elapsed time per iteration (s): 0.77 | learning rate: 2.727E-05 | global batch size: 256 | lm loss: 1.938273E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.506 | TFLOPs: 20.18 | 31: iteration 151380/ 173500 | consumed samples: 38753280 | consumed tokens: 79366717440 | elapsed time per iteration (s): 0.80 | learning rate: 2.727E-05 | global batch size: 256 | lm loss: 1.878289E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.944 | TFLOPs: 19.36 | 31: iteration 151390/ 173500 | consumed samples: 38755840 | consumed tokens: 79371960320 | elapsed time per iteration (s): 0.78 | learning rate: 2.726E-05 | global batch size: 256 | lm loss: 1.931236E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.213 | TFLOPs: 19.80 | 31: iteration 151400/ 173500 | consumed samples: 38758400 | consumed tokens: 79377203200 | elapsed time per iteration (s): 0.82 | learning rate: 2.725E-05 | global batch size: 256 | lm loss: 1.915040E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.917 | TFLOPs: 18.87 | 31: iteration 151410/ 173500 | consumed samples: 38760960 | consumed tokens: 79382446080 | elapsed time per iteration (s): 0.74 | learning rate: 2.725E-05 | global batch size: 256 | lm loss: 1.932368E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.435 | TFLOPs: 21.02 | 31: iteration 151420/ 173500 | consumed samples: 38763520 | consumed tokens: 79387688960 | elapsed time per iteration (s): 0.74 | learning rate: 2.724E-05 | global batch size: 256 | lm loss: 1.889954E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.490 | TFLOPs: 21.02 | 31: iteration 151430/ 173500 | consumed samples: 38766080 | consumed tokens: 79392931840 | elapsed time per iteration (s): 0.78 | learning rate: 2.723E-05 | global batch size: 256 | lm loss: 1.910192E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.150 | TFLOPs: 19.73 | 31: iteration 151440/ 173500 | consumed samples: 38768640 | consumed tokens: 79398174720 | elapsed time per iteration (s): 0.80 | learning rate: 2.723E-05 | global batch size: 256 | lm loss: 1.919578E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.059 | TFLOPs: 19.24 | 31: iteration 151450/ 173500 | consumed samples: 38771200 | consumed tokens: 79403417600 | elapsed time per iteration (s): 0.83 | learning rate: 2.722E-05 | global batch size: 256 | lm loss: 1.908791E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.768 | TFLOPs: 18.56 | 31: iteration 151460/ 173500 | consumed samples: 38773760 | consumed tokens: 79408660480 | elapsed time per iteration (s): 0.80 | learning rate: 2.721E-05 | global batch size: 256 | lm loss: 1.958681E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.229 | TFLOPs: 19.25 | 31: iteration 151470/ 173500 | consumed samples: 38776320 | consumed tokens: 79413903360 | elapsed time per iteration (s): 0.76 | learning rate: 2.721E-05 | global batch size: 256 | lm loss: 1.918825E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.190 | TFLOPs: 20.40 | 31: iteration 151480/ 173500 | consumed samples: 38778880 | consumed tokens: 79419146240 | elapsed time per iteration (s): 0.78 | learning rate: 2.720E-05 | global batch size: 256 | lm loss: 1.903231E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.860 | TFLOPs: 19.77 | 31: iteration 151490/ 173500 | consumed samples: 38781440 | consumed tokens: 79424389120 | elapsed time per iteration (s): 0.75 | learning rate: 2.719E-05 | global batch size: 256 | lm loss: 1.899529E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.619 | TFLOPs: 20.79 | 31: iteration 151500/ 173500 | consumed samples: 38784000 | consumed tokens: 79429632000 | elapsed time per iteration (s): 0.79 | learning rate: 2.719E-05 | global batch size: 256 | lm loss: 1.927995E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.861 | TFLOPs: 19.53 | 31: iteration 151510/ 173500 | consumed samples: 38786560 | consumed tokens: 79434874880 | elapsed time per iteration (s): 0.76 | learning rate: 2.718E-05 | global batch size: 256 | lm loss: 1.935787E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.525 | TFLOPs: 20.48 | 31: iteration 151520/ 173500 | consumed samples: 38789120 | consumed tokens: 79440117760 | elapsed time per iteration (s): 0.75 | learning rate: 2.718E-05 | global batch size: 256 | lm loss: 1.931630E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.775 | TFLOPs: 20.68 | 31: iteration 151530/ 173500 | consumed samples: 38791680 | consumed tokens: 79445360640 | elapsed time per iteration (s): 0.78 | learning rate: 2.717E-05 | global batch size: 256 | lm loss: 1.917239E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.279 | TFLOPs: 19.86 | 31: iteration 151540/ 173500 | consumed samples: 38794240 | consumed tokens: 79450603520 | elapsed time per iteration (s): 0.79 | learning rate: 2.716E-05 | global batch size: 256 | lm loss: 1.898553E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.599 | TFLOPs: 19.64 | 31: iteration 151550/ 173500 | consumed samples: 38796800 | consumed tokens: 79455846400 | elapsed time per iteration (s): 0.79 | learning rate: 2.716E-05 | global batch size: 256 | lm loss: 1.932465E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.177 | TFLOPs: 19.49 | 31: iteration 151560/ 173500 | consumed samples: 38799360 | consumed tokens: 79461089280 | elapsed time per iteration (s): 0.76 | learning rate: 2.715E-05 | global batch size: 256 | lm loss: 1.878909E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.728 | TFLOPs: 20.37 | 31: iteration 151570/ 173500 | consumed samples: 38801920 | consumed tokens: 79466332160 | elapsed time per iteration (s): 0.78 | learning rate: 2.714E-05 | global batch size: 256 | lm loss: 1.893715E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.859 | TFLOPs: 19.96 | 31: iteration 151580/ 173500 | consumed samples: 38804480 | consumed tokens: 79471575040 | elapsed time per iteration (s): 0.76 | learning rate: 2.714E-05 | global batch size: 256 | lm loss: 1.931701E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.147 | TFLOPs: 20.40 | 31: iteration 151590/ 173500 | consumed samples: 38807040 | consumed tokens: 79476817920 | elapsed time per iteration (s): 0.75 | learning rate: 2.713E-05 | global batch size: 256 | lm loss: 1.914066E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.509 | TFLOPs: 20.60 | 31: iteration 151600/ 173500 | consumed samples: 38809600 | consumed tokens: 79482060800 | elapsed time per iteration (s): 0.74 | learning rate: 2.712E-05 | global batch size: 256 | lm loss: 1.914226E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.081 | TFLOPs: 20.82 | 31: iteration 151610/ 173500 | consumed samples: 38812160 | consumed tokens: 79487303680 | elapsed time per iteration (s): 0.80 | learning rate: 2.712E-05 | global batch size: 256 | lm loss: 1.941901E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.567 | TFLOPs: 19.33 | 31: iteration 151620/ 173500 | consumed samples: 38814720 | consumed tokens: 79492546560 | elapsed time per iteration (s): 0.73 | learning rate: 2.711E-05 | global batch size: 256 | lm loss: 1.917322E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.085 | TFLOPs: 21.12 | 31: iteration 151630/ 173500 | consumed samples: 38817280 | consumed tokens: 79497789440 | elapsed time per iteration (s): 0.78 | learning rate: 2.710E-05 | global batch size: 256 | lm loss: 1.908340E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.718 | TFLOPs: 19.89 | 31: iteration 151640/ 173500 | consumed samples: 38819840 | consumed tokens: 79503032320 | elapsed time per iteration (s): 0.80 | learning rate: 2.710E-05 | global batch size: 256 | lm loss: 1.918004E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.412 | TFLOPs: 19.38 | 31: iteration 151650/ 173500 | consumed samples: 38822400 | consumed tokens: 79508275200 | elapsed time per iteration (s): 0.79 | learning rate: 2.709E-05 | global batch size: 256 | lm loss: 1.917998E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.029 | TFLOPs: 19.72 | 31: iteration 151660/ 173500 | consumed samples: 38824960 | consumed tokens: 79513518080 | elapsed time per iteration (s): 0.77 | learning rate: 2.709E-05 | global batch size: 256 | lm loss: 1.926674E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.225 | TFLOPs: 20.22 | 31: iteration 151670/ 173500 | consumed samples: 38827520 | consumed tokens: 79518760960 | elapsed time per iteration (s): 0.74 | learning rate: 2.708E-05 | global batch size: 256 | lm loss: 1.902280E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.749 | TFLOPs: 20.80 | 31: iteration 151680/ 173500 | consumed samples: 38830080 | consumed tokens: 79524003840 | elapsed time per iteration (s): 0.79 | learning rate: 2.707E-05 | global batch size: 256 | lm loss: 1.922964E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.048 | TFLOPs: 19.73 | 31: iteration 151690/ 173500 | consumed samples: 38832640 | consumed tokens: 79529246720 | elapsed time per iteration (s): 0.71 | learning rate: 2.707E-05 | global batch size: 256 | lm loss: 1.916203E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 360.844 | TFLOPs: 21.83 | 31: iteration 151700/ 173500 | consumed samples: 38835200 | consumed tokens: 79534489600 | elapsed time per iteration (s): 0.75 | learning rate: 2.706E-05 | global batch size: 256 | lm loss: 1.933153E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.676 | TFLOPs: 20.67 | 31: iteration 151710/ 173500 | consumed samples: 38837760 | consumed tokens: 79539732480 | elapsed time per iteration (s): 0.75 | learning rate: 2.705E-05 | global batch size: 256 | lm loss: 1.909313E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.449 | TFLOPs: 20.60 | 31: iteration 151720/ 173500 | consumed samples: 38840320 | consumed tokens: 79544975360 | elapsed time per iteration (s): 0.74 | learning rate: 2.705E-05 | global batch size: 256 | lm loss: 1.939956E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.142 | TFLOPs: 21.06 | 31: iteration 151730/ 173500 | consumed samples: 38842880 | consumed tokens: 79550218240 | elapsed time per iteration (s): 0.78 | learning rate: 2.704E-05 | global batch size: 256 | lm loss: 1.895654E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.661 | TFLOPs: 19.76 | 31: iteration 151740/ 173500 | consumed samples: 38845440 | consumed tokens: 79555461120 | elapsed time per iteration (s): 0.76 | learning rate: 2.703E-05 | global batch size: 256 | lm loss: 1.925107E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.733 | TFLOPs: 20.49 | 31: iteration 151750/ 173500 | consumed samples: 38848000 | consumed tokens: 79560704000 | elapsed time per iteration (s): 0.75 | learning rate: 2.703E-05 | global batch size: 256 | lm loss: 1.919421E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.903 | TFLOPs: 20.56 | 31: iteration 151760/ 173500 | consumed samples: 38850560 | consumed tokens: 79565946880 | elapsed time per iteration (s): 0.78 | learning rate: 2.702E-05 | global batch size: 256 | lm loss: 1.888966E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.495 | TFLOPs: 19.81 | 31: iteration 151770/ 173500 | consumed samples: 38853120 | consumed tokens: 79571189760 | elapsed time per iteration (s): 0.76 | learning rate: 2.702E-05 | global batch size: 256 | lm loss: 1.888413E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.800 | TFLOPs: 20.50 | 31: iteration 151780/ 173500 | consumed samples: 38855680 | consumed tokens: 79576432640 | elapsed time per iteration (s): 0.75 | learning rate: 2.701E-05 | global batch size: 256 | lm loss: 1.906623E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.782 | TFLOPs: 20.68 | 31: iteration 151790/ 173500 | consumed samples: 38858240 | consumed tokens: 79581675520 | elapsed time per iteration (s): 0.78 | learning rate: 2.700E-05 | global batch size: 256 | lm loss: 1.931335E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.891 | TFLOPs: 19.84 | 31: iteration 151800/ 173500 | consumed samples: 38860800 | consumed tokens: 79586918400 | elapsed time per iteration (s): 0.75 | learning rate: 2.700E-05 | global batch size: 256 | lm loss: 1.915083E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.802 | TFLOPs: 20.74 | 31: iteration 151810/ 173500 | consumed samples: 38863360 | consumed tokens: 79592161280 | elapsed time per iteration (s): 0.75 | learning rate: 2.699E-05 | global batch size: 256 | lm loss: 1.927727E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.070 | TFLOPs: 20.63 | 31: iteration 151820/ 173500 | consumed samples: 38865920 | consumed tokens: 79597404160 | elapsed time per iteration (s): 0.78 | learning rate: 2.698E-05 | global batch size: 256 | lm loss: 1.934198E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.355 | TFLOPs: 19.80 | 31: iteration 151830/ 173500 | consumed samples: 38868480 | consumed tokens: 79602647040 | elapsed time per iteration (s): 0.74 | learning rate: 2.698E-05 | global batch size: 256 | lm loss: 1.935402E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.977 | TFLOPs: 20.87 | 31: iteration 151840/ 173500 | consumed samples: 38871040 | consumed tokens: 79607889920 | elapsed time per iteration (s): 0.76 | learning rate: 2.697E-05 | global batch size: 256 | lm loss: 1.914249E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.261 | TFLOPs: 20.40 | 31: iteration 151850/ 173500 | consumed samples: 38873600 | consumed tokens: 79613132800 | elapsed time per iteration (s): 0.72 | learning rate: 2.696E-05 | global batch size: 256 | lm loss: 1.924050E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.745 | TFLOPs: 21.52 | 31: iteration 151860/ 173500 | consumed samples: 38876160 | consumed tokens: 79618375680 | elapsed time per iteration (s): 0.74 | learning rate: 2.696E-05 | global batch size: 256 | lm loss: 1.909476E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.142 | TFLOPs: 21.00 | 31: iteration 151870/ 173500 | consumed samples: 38878720 | consumed tokens: 79623618560 | elapsed time per iteration (s): 0.74 | learning rate: 2.695E-05 | global batch size: 256 | lm loss: 1.944876E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.412 | TFLOPs: 20.96 | 31: iteration 151880/ 173500 | consumed samples: 38881280 | consumed tokens: 79628861440 | elapsed time per iteration (s): 0.75 | learning rate: 2.695E-05 | global batch size: 256 | lm loss: 1.929185E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.046 | TFLOPs: 20.69 | 31: iteration 151890/ 173500 | consumed samples: 38883840 | consumed tokens: 79634104320 | elapsed time per iteration (s): 0.74 | learning rate: 2.694E-05 | global batch size: 256 | lm loss: 1.925278E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.557 | TFLOPs: 21.03 | 31: iteration 151900/ 173500 | consumed samples: 38886400 | consumed tokens: 79639347200 | elapsed time per iteration (s): 0.75 | learning rate: 2.693E-05 | global batch size: 256 | lm loss: 1.920764E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.630 | TFLOPs: 20.73 | 31: iteration 151910/ 173500 | consumed samples: 38888960 | consumed tokens: 79644590080 | elapsed time per iteration (s): 0.76 | learning rate: 2.693E-05 | global batch size: 256 | lm loss: 1.923917E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.098 | TFLOPs: 20.33 | 31: iteration 151920/ 173500 | consumed samples: 38891520 | consumed tokens: 79649832960 | elapsed time per iteration (s): 0.76 | learning rate: 2.692E-05 | global batch size: 256 | lm loss: 1.921232E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.952 | TFLOPs: 20.51 | 31: iteration 151930/ 173500 | consumed samples: 38894080 | consumed tokens: 79655075840 | elapsed time per iteration (s): 0.76 | learning rate: 2.691E-05 | global batch size: 256 | lm loss: 1.885778E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.049 | TFLOPs: 20.27 | 31: iteration 151940/ 173500 | consumed samples: 38896640 | consumed tokens: 79660318720 | elapsed time per iteration (s): 0.73 | learning rate: 2.691E-05 | global batch size: 256 | lm loss: 1.899970E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.184 | TFLOPs: 21.19 | 31: iteration 151950/ 173500 | consumed samples: 38899200 | consumed tokens: 79665561600 | elapsed time per iteration (s): 0.74 | learning rate: 2.690E-05 | global batch size: 256 | lm loss: 1.927170E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.566 | TFLOPs: 20.91 | 31: iteration 151960/ 173500 | consumed samples: 38901760 | consumed tokens: 79670804480 | elapsed time per iteration (s): 0.74 | learning rate: 2.689E-05 | global batch size: 256 | lm loss: 1.924084E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.795 | TFLOPs: 20.86 | 31: iteration 151970/ 173500 | consumed samples: 38904320 | consumed tokens: 79676047360 | elapsed time per iteration (s): 0.76 | learning rate: 2.689E-05 | global batch size: 256 | lm loss: 1.910103E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.143 | TFLOPs: 20.46 | 31: iteration 151980/ 173500 | consumed samples: 38906880 | consumed tokens: 79681290240 | elapsed time per iteration (s): 0.81 | learning rate: 2.688E-05 | global batch size: 256 | lm loss: 1.942290E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.510 | TFLOPs: 19.21 | 31: iteration 151990/ 173500 | consumed samples: 38909440 | consumed tokens: 79686533120 | elapsed time per iteration (s): 0.79 | learning rate: 2.688E-05 | global batch size: 256 | lm loss: 1.909590E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.491 | TFLOPs: 19.63 | 0: [2022-11-27 04:18:29,011] [INFO] [logging.py:68:log_dist] [Rank 0] step=152000, skipped=0, lr=[2.6869667028068037e-05, 2.6869667028068037e-05, 2.6869667028068037e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 152000/ 173500 | consumed samples: 38912000 | consumed tokens: 79691776000 | elapsed time per iteration (s): 0.73 | learning rate: 2.687E-05 | global batch size: 256 | lm loss: 1.918742E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.316 | TFLOPs: 21.13 | 0: steps: 152000 loss: 1.9506 iter time (s): 0.791 samples/sec: 323.526 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 152000 | lm loss value: 1.907954E+00 | lm loss PPL: 6.739285E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 152000 to checkpoints_1b1long 0: [2022-11-27 04:18:29,261] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step152000 is begin to save! 0: [2022-11-27 04:18:29,273] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_01-model_00-model_states.pt... 0: [2022-11-27 04:18:29,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_01-model_00-model_states.pt. 0: [2022-11-27 04:18:29,487] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_03-model_00-model_states.pt... 0: [2022-11-27 04:18:29,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_03-model_00-model_states.pt. 0: [2022-11-27 04:18:29,573] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_04-model_00-model_states.pt... 0: [2022-11-27 04:18:29,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_04-model_00-model_states.pt. 0: [2022-11-27 04:18:29,653] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_05-model_00-model_states.pt... 0: [2022-11-27 04:18:29,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_05-model_00-model_states.pt. 0: [2022-11-27 04:18:29,730] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_06-model_00-model_states.pt... 0: [2022-11-27 04:18:29,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_06-model_00-model_states.pt. 0: [2022-11-27 04:18:29,810] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_07-model_00-model_states.pt... 0: [2022-11-27 04:18:29,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_07-model_00-model_states.pt. 0: [2022-11-27 04:18:29,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_08-model_00-model_states.pt... 0: [2022-11-27 04:18:29,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_08-model_00-model_states.pt. 0: [2022-11-27 04:18:29,961] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_09-model_00-model_states.pt... 0: [2022-11-27 04:18:30,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_09-model_00-model_states.pt. 0: [2022-11-27 04:18:30,037] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_10-model_00-model_states.pt... 0: [2022-11-27 04:18:30,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_10-model_00-model_states.pt. 0: [2022-11-27 04:18:30,113] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_11-model_00-model_states.pt... 0: [2022-11-27 04:18:30,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_11-model_00-model_states.pt. 0: [2022-11-27 04:18:30,188] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_12-model_00-model_states.pt... 0: [2022-11-27 04:18:30,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_12-model_00-model_states.pt. 0: [2022-11-27 04:18:30,264] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_13-model_00-model_states.pt... 0: [2022-11-27 04:18:30,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_13-model_00-model_states.pt. 0: [2022-11-27 04:18:30,339] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_14-model_00-model_states.pt... 0: [2022-11-27 04:18:30,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_14-model_00-model_states.pt. 0: [2022-11-27 04:18:30,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_15-model_00-model_states.pt... 0: [2022-11-27 04:18:30,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_15-model_00-model_states.pt. 0: [2022-11-27 04:18:30,488] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_16-model_00-model_states.pt... 0: [2022-11-27 04:18:30,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_16-model_00-model_states.pt. 0: [2022-11-27 04:18:30,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_17-model_00-model_states.pt... 0: [2022-11-27 04:18:30,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_17-model_00-model_states.pt. 0: [2022-11-27 04:18:30,640] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_18-model_00-model_states.pt... 0: [2022-11-27 04:18:30,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_18-model_00-model_states.pt. 0: [2022-11-27 04:18:30,713] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_19-model_00-model_states.pt... 0: [2022-11-27 04:18:30,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_19-model_00-model_states.pt. 0: [2022-11-27 04:18:30,790] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_20-model_00-model_states.pt... 0: [2022-11-27 04:18:30,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_20-model_00-model_states.pt. 0: [2022-11-27 04:18:30,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_21-model_00-model_states.pt... 0: [2022-11-27 04:18:30,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_21-model_00-model_states.pt. 0: [2022-11-27 04:18:30,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_22-model_00-model_states.pt... 0: [2022-11-27 04:18:31,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_22-model_00-model_states.pt. 0: [2022-11-27 04:18:31,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_23-model_00-model_states.pt... 0: [2022-11-27 04:18:31,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_23-model_00-model_states.pt. 0: [2022-11-27 04:18:31,092] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_24-model_00-model_states.pt... 0: [2022-11-27 04:18:31,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_24-model_00-model_states.pt. 0: [2022-11-27 04:18:31,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_25-model_00-model_states.pt... 0: [2022-11-27 04:18:31,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_25-model_00-model_states.pt. 0: [2022-11-27 04:18:31,243] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_26-model_00-model_states.pt... 0: [2022-11-27 04:18:31,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_26-model_00-model_states.pt. 0: [2022-11-27 04:18:31,319] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_27-model_00-model_states.pt... 0: [2022-11-27 04:18:31,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_27-model_00-model_states.pt. 0: [2022-11-27 04:18:31,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_28-model_00-model_states.pt... 0: [2022-11-27 04:18:31,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_28-model_00-model_states.pt. 0: [2022-11-27 04:18:31,479] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/layer_30-model_00-model_states.pt... 0: [2022-11-27 04:18:31,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/layer_30-model_00-model_states.pt. 0: [2022-11-27 04:18:31,482] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step152000/mp_rank_00_model_states.pt 0: [2022-11-27 04:18:31,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/mp_rank_00_model_states.pt... 0: [2022-11-27 04:18:31,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/mp_rank_00_model_states.pt. 0: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:18:31,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:18:31,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:18:31,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 04:18:31,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 7: [2022-11-27 04:18:31,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:18:31,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 19: [2022-11-27 04:18:31,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:18:31,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 19: [2022-11-27 04:18:31,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-27 04:18:31,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 11: [2022-11-27 04:18:31,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:18:31,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 04:18:31,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 3: [2022-11-27 04:18:31,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:18:31,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 04:18:31,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 28: [2022-11-27 04:18:31,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:18:31,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:18:31,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 04:18:31,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 16: [2022-11-27 04:18:31,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:18:31,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:18:31,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 18: [2022-11-27 04:18:31,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 16: [2022-11-27 04:18:31,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 18: [2022-11-27 04:18:31,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 4: [2022-11-27 04:18:31,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:18:31,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 04:18:31,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 27: [2022-11-27 04:18:31,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:18:31,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:18:31,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 18: [2022-11-27 04:18:31,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 27: [2022-11-27 04:18:31,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 18: [2022-11-27 04:18:31,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 7: [2022-11-27 04:18:31,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:18:31,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 04:18:31,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 19: [2022-11-27 04:18:31,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:18:31,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:18:31,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 04:18:31,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 3: [2022-11-27 04:18:31,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 04:18:31,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 11: [2022-11-27 04:18:31,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:18:31,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 04:18:31,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 23: [2022-11-27 04:18:31,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:18:31,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 04:18:31,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 2: [2022-11-27 04:18:31,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:18:31,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:18:31,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 04:18:31,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 13: [2022-11-27 04:18:31,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:18:31,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 2: [2022-11-27 04:18:31,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 13: [2022-11-27 04:18:31,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 04:18:31,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 17: [2022-11-27 04:18:31,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:18:31,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 04:18:31,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 5: [2022-11-27 04:18:31,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:18:31,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 29: [2022-11-27 04:18:31,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:18:31,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:18:31,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 29: [2022-11-27 04:18:31,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 04:18:31,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 27: [2022-11-27 04:18:31,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:18:31,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 27: [2022-11-27 04:18:31,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 5: [2022-11-27 04:18:31,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 27: [2022-11-27 04:18:31,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 22: [2022-11-27 04:18:31,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:18:31,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 04:18:31,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 1: [2022-11-27 04:18:31,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:18:31,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:18:31,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:18:31,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 25: [2022-11-27 04:18:31,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 1: [2022-11-27 04:18:31,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 25: [2022-11-27 04:18:31,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 24: [2022-11-27 04:18:31,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 04:18:31,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 16: [2022-11-27 04:18:31,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:18:31,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 10: [2022-11-27 04:18:31,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:18:31,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 10: [2022-11-27 04:18:31,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 8: [2022-11-27 04:18:31,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:18:31,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:18:31,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 8: [2022-11-27 04:18:31,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 04:18:31,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 04:18:31,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 8: [2022-11-27 04:18:31,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 0: [2022-11-27 04:18:31,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:18:31,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:18:31,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 0: [2022-11-27 04:18:31,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 04:18:31,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 4: [2022-11-27 04:18:31,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:18:31,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 4: [2022-11-27 04:18:31,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 04:18:31,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 10: [2022-11-27 04:18:31,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:18:31,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 04:18:31,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 21: [2022-11-27 04:18:31,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:18:31,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:18:31,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 20: [2022-11-27 04:18:31,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 04:18:31,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 21: [2022-11-27 04:18:31,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 28: [2022-11-27 04:18:31,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 24: [2022-11-27 04:18:31,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:18:31,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 20: [2022-11-27 04:18:31,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:18:31,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:18:31,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 28: [2022-11-27 04:18:31,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 24: [2022-11-27 04:18:31,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 20: [2022-11-27 04:18:31,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 28: [2022-11-27 04:18:31,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 20: [2022-11-27 04:18:31,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 17: [2022-11-27 04:18:31,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:18:31,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 04:18:31,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 19: [2022-11-27 04:18:31,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:18:31,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 04:18:31,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 30: [2022-11-27 04:18:31,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:18:31,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 11: [2022-11-27 04:18:31,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:18:31,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 11: [2022-11-27 04:18:31,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 04:18:31,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 30: [2022-11-27 04:18:31,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:18:31,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:18:31,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 7: [2022-11-27 04:18:31,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 04:18:31,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 30: [2022-11-27 04:18:31,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 1: [2022-11-27 04:18:31,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:18:31,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 04:18:31,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 18: [2022-11-27 04:18:31,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:18:31,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 13: [2022-11-27 04:18:31,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:18:31,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 13: [2022-11-27 04:18:31,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 04:18:31,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 3: [2022-11-27 04:18:31,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:18:31,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 04:18:31,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 5: [2022-11-27 04:18:31,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:18:31,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 04:18:31,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 24: [2022-11-27 04:18:31,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:18:31,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 04:18:31,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 12: [2022-11-27 04:18:31,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:18:31,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:18:31,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 29: [2022-11-27 04:18:31,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 12: [2022-11-27 04:18:31,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 29: [2022-11-27 04:18:31,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 4: [2022-11-27 04:18:31,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:18:31,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:18:31,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 12: [2022-11-27 04:18:31,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:18:31,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 12: [2022-11-27 04:18:31,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 9: [2022-11-27 04:18:31,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:18:31,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 9: [2022-11-27 04:18:31,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:18:31,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 9: [2022-11-27 04:18:31,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 12: [2022-11-27 04:18:31,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 9: [2022-11-27 04:18:31,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 9: [2022-11-27 04:18:31,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 04:18:31,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 17: [2022-11-27 04:18:31,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:18:31,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 04:18:31,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 30: [2022-11-27 04:18:31,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:18:31,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:18:31,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 04:18:31,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 30: [2022-11-27 04:18:31,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 16: [2022-11-27 04:18:31,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:18:31,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 16: [2022-11-27 04:18:31,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 04:18:31,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 15: [2022-11-27 04:18:31,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:18:31,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 04:18:31,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:18:31,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:18:31,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 15: [2022-11-27 04:18:31,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 0: [2022-11-27 04:18:31,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 15: [2022-11-27 04:18:31,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 27: [2022-11-27 04:18:31,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:18:31,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 27: [2022-11-27 04:18:31,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 04:18:31,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 0: [2022-11-27 04:18:31,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:18:31,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:18:31,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:18:31,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 23: [2022-11-27 04:18:31,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:18:31,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 1: [2022-11-27 04:18:31,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 23: [2022-11-27 04:18:31,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 1: [2022-11-27 04:18:31,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 23: [2022-11-27 04:18:31,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 28: [2022-11-27 04:18:31,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:18:31,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:18:31,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 04:18:31,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 13: [2022-11-27 04:18:31,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:18:31,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 26: [2022-11-27 04:18:31,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:18:31,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:18:31,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 26: [2022-11-27 04:18:31,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 04:18:31,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 04:18:31,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:18:31,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 26: [2022-11-27 04:18:31,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 23: [2022-11-27 04:18:31,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:18:31,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 23: [2022-11-27 04:18:31,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 26: [2022-11-27 04:18:31,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 23: [2022-11-27 04:18:31,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 23: [2022-11-27 04:18:31,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:18:31,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:18:31,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 10: [2022-11-27 04:18:31,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 23: [2022-11-27 04:18:31,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 10: [2022-11-27 04:18:31,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 18: [2022-11-27 04:18:31,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:18:31,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:18:31,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 04:18:31,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 19: [2022-11-27 04:18:31,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 04:18:31,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 10: [2022-11-27 04:18:31,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:18:31,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 04:18:31,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 8: [2022-11-27 04:18:31,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:18:31,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 04:18:31,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 14: [2022-11-27 04:18:31,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:18:31,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 04:18:31,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 20: [2022-11-27 04:18:31,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:18:31,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 04:18:31,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 5: [2022-11-27 04:18:31,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:18:31,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 04:18:31,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 8: [2022-11-27 04:18:31,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:18:31,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 04:18:31,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 9: [2022-11-27 04:18:31,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:18:31,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 04:18:31,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 3: [2022-11-27 04:18:31,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:18:31,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 04:18:31,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 21: [2022-11-27 04:18:31,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:18:31,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:18:31,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 04:18:31,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 24: [2022-11-27 04:18:31,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 7: [2022-11-27 04:18:31,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:18:31,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 04:18:31,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 25: [2022-11-27 04:18:31,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:18:31,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 21: [2022-11-27 04:18:31,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:18:31,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 25: [2022-11-27 04:18:31,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 21: [2022-11-27 04:18:31,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 25: [2022-11-27 04:18:31,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 25: [2022-11-27 04:18:31,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:18:31,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 04:18:31,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 28: [2022-11-27 04:18:31,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 04:18:31,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 28: [2022-11-27 04:18:31,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:18:31,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 04:18:31,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 15: [2022-11-27 04:18:31,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:18:31,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 04:18:31,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 4: [2022-11-27 04:18:31,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:18:31,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 22: [2022-11-27 04:18:31,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:18:31,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 22: [2022-11-27 04:18:31,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 04:18:31,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 22: [2022-11-27 04:18:31,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:18:31,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 04:18:31,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 30: [2022-11-27 04:18:31,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:18:31,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 04:18:31,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 19: [2022-11-27 04:18:31,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:18:31,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 04:18:31,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 27: [2022-11-27 04:18:31,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:18:31,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:18:31,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 27: [2022-11-27 04:18:31,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 12: [2022-11-27 04:18:31,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 27: [2022-11-27 04:18:31,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 1: [2022-11-27 04:18:31,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:18:31,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 04:18:31,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 20: [2022-11-27 04:18:31,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:18:31,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 04:18:31,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 2: [2022-11-27 04:18:31,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:18:31,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 04:18:31,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 11: [2022-11-27 04:18:31,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:18:31,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 04:18:31,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 2: [2022-11-27 04:18:31,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:18:31,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 04:18:31,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 14: [2022-11-27 04:18:31,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:18:31,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 04:18:31,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 0: [2022-11-27 04:18:31,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:18:31,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 04:18:31,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 31: [2022-11-27 04:18:31,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:18:31,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:18:31,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:18:31,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:18:31,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:18:31,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 04:18:31,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 04:18:31,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 04:18:31,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 04:18:31,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 31: [2022-11-27 04:18:31,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 31: [2022-11-27 04:18:31,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 31: [2022-11-27 04:18:31,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 17: [2022-11-27 04:18:31,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 04:18:31,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 18: [2022-11-27 04:18:31,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:18:31,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 16: [2022-11-27 04:18:31,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:18:31,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 16: [2022-11-27 04:18:31,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 04:18:31,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 0: [2022-11-27 04:18:31,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 04:18:31,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 13: [2022-11-27 04:18:31,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:18:31,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 04:18:31,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 21: [2022-11-27 04:18:31,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:18:31,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 04:18:31,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 22: [2022-11-27 04:18:31,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:18:31,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 04:18:31,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 20: [2022-11-27 04:18:31,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:18:31,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 04:18:31,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 17: [2022-11-27 04:18:31,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:18:31,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 04:18:31,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 12: [2022-11-27 04:18:31,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:18:31,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 04:18:31,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 6: [2022-11-27 04:18:31,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:18:31,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:18:31,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:18:31,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:18:31,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 04:18:31,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 04:18:31,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 04:18:31,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 04:18:31,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 6: [2022-11-27 04:18:31,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 6: [2022-11-27 04:18:31,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 6: [2022-11-27 04:18:31,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 5: [2022-11-27 04:18:31,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:18:31,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 04:18:31,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 6: [2022-11-27 04:18:31,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:18:31,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 04:18:31,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 29: [2022-11-27 04:18:31,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:18:31,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 04:18:31,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 28: [2022-11-27 04:18:31,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:18:31,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 04:18:31,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 9: [2022-11-27 04:18:31,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:18:31,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 04:18:31,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 31: [2022-11-27 04:18:31,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:18:31,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 04:18:31,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 10: [2022-11-27 04:18:31,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:18:31,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 04:18:31,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 15: [2022-11-27 04:18:31,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:18:31,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:18:31,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 0: [2022-11-27 04:18:31,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:18:31,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 26: [2022-11-27 04:18:31,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 0: [2022-11-27 04:18:31,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 04:18:31,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 26: [2022-11-27 04:18:31,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 4: [2022-11-27 04:18:31,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:18:31,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 04:18:31,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 30: [2022-11-27 04:18:31,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:18:31,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 04:18:31,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 27: [2022-11-27 04:18:31,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:18:31,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 04:18:31,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 1: [2022-11-27 04:18:31,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:18:31,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 04:18:31,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 24: [2022-11-27 04:18:31,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:18:31,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 21: [2022-11-27 04:18:31,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:18:31,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 04:18:31,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 24: [2022-11-27 04:18:31,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 25: [2022-11-27 04:18:31,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:18:31,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:18:31,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 8: [2022-11-27 04:18:31,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 25: [2022-11-27 04:18:31,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 8: [2022-11-27 04:18:31,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 16: [2022-11-27 04:18:31,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:18:31,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 04:18:31,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 2: [2022-11-27 04:18:31,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:18:31,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 23: [2022-11-27 04:18:31,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:18:31,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 23: [2022-11-27 04:18:31,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 04:18:31,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 11: [2022-11-27 04:18:31,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:18:31,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 04:18:31,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 7: [2022-11-27 04:18:31,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:18:31,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 04:18:31,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 14: [2022-11-27 04:18:31,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:18:31,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 04:18:31,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 3: [2022-11-27 04:18:31,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:18:31,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 04:18:31,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 13: [2022-11-27 04:18:31,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:18:31,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 04:18:31,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 18: [2022-11-27 04:18:31,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:18:31,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-27 04:18:31,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 17: [2022-11-27 04:18:31,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:18:31,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 04:18:31,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 5: [2022-11-27 04:18:31,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:18:31,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 04:18:31,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 22: [2022-11-27 04:18:31,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:18:31,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 04:18:31,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 12: [2022-11-27 04:18:31,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:18:31,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 04:18:31,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 31: [2022-11-27 04:18:31,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:18:31,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 04:18:31,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 20: [2022-11-27 04:18:31,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:18:31,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:18:31,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 20: [2022-11-27 04:18:31,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 10: [2022-11-27 04:18:31,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 20: [2022-11-27 04:18:31,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 19: [2022-11-27 04:18:31,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:18:31,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:18:31,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 6: [2022-11-27 04:18:31,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 19: [2022-11-27 04:18:31,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 6: [2022-11-27 04:18:31,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 8: [2022-11-27 04:18:31,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:18:31,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 04:18:31,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 30: [2022-11-27 04:18:31,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:18:31,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:18:31,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 26: [2022-11-27 04:18:31,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 30: [2022-11-27 04:18:31,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 26: [2022-11-27 04:18:31,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 7: [2022-11-27 04:18:31,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:18:31,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 04:18:31,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 23: [2022-11-27 04:18:31,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:18:31,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 04:18:31,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 15: [2022-11-27 04:18:31,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:18:31,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 04:18:31,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 11: [2022-11-27 04:18:31,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:18:31,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 04:18:31,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 24: [2022-11-27 04:18:31,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:18:31,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 04:18:31,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 28: [2022-11-27 04:18:31,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:18:31,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:18:31,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 04:18:31,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 28: [2022-11-27 04:18:31,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 04:18:31,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 4: [2022-11-27 04:18:31,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:18:31,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 04:18:31,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 9: [2022-11-27 04:18:31,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:18:31,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 04:18:31,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 0: [2022-11-27 04:18:31,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:18:31,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 04:18:31,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 3: [2022-11-27 04:18:31,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:18:31,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 04:18:31,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 25: [2022-11-27 04:18:31,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:18:31,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 04:18:31,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 16: [2022-11-27 04:18:31,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:18:31,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 04:18:31,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 21: [2022-11-27 04:18:31,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:18:31,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 04:18:31,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 5: [2022-11-27 04:18:31,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:18:31,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 04:18:31,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 13: [2022-11-27 04:18:31,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:18:31,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 04:18:31,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 2: [2022-11-27 04:18:31,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:18:31,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 04:18:31,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 1: [2022-11-27 04:18:31,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:18:31,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 04:18:31,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 14: [2022-11-27 04:18:31,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:18:31,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 04:18:31,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 27: [2022-11-27 04:18:31,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:18:31,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:18:31,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 04:18:31,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 27: [2022-11-27 04:18:31,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 04:18:31,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 12: [2022-11-27 04:18:31,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:18:31,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 04:18:31,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 19: [2022-11-27 04:18:31,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:18:31,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 04:18:31,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 20: [2022-11-27 04:18:31,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:18:31,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 04:18:31,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 18: [2022-11-27 04:18:31,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:18:31,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 04:18:31,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 31: [2022-11-27 04:18:31,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:18:31,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 04:18:31,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 17: [2022-11-27 04:18:31,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:18:31,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 04:18:31,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 15: [2022-11-27 04:18:31,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:18:31,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 04:18:31,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 8: [2022-11-27 04:18:31,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:18:31,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 04:18:31,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 23: [2022-11-27 04:18:31,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:18:31,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 04:18:31,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 10: [2022-11-27 04:18:31,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:18:31,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 26: [2022-11-27 04:18:31,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:18:31,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 26: [2022-11-27 04:18:31,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 04:18:31,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 3: [2022-11-27 04:18:31,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:18:31,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 04:18:31,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 29: [2022-11-27 04:18:31,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:18:31,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:18:31,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 04:18:31,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 24: [2022-11-27 04:18:31,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-27 04:18:31,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 4: [2022-11-27 04:18:31,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:18:31,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 04:18:31,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 30: [2022-11-27 04:18:31,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:18:31,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 04:18:31,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 7: [2022-11-27 04:18:31,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:18:31,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 04:18:31,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 11: [2022-11-27 04:18:31,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:18:31,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 04:18:31,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 28: [2022-11-27 04:18:31,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:18:31,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 04:18:31,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 6: [2022-11-27 04:18:31,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:18:31,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 04:18:31,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 9: [2022-11-27 04:18:31,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:18:31,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 04:18:31,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 1: [2022-11-27 04:18:31,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:18:31,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 04:18:31,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 2: [2022-11-27 04:18:31,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:18:31,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 04:18:31,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 27: [2022-11-27 04:18:31,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:18:31,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 04:18:31,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 16: [2022-11-27 04:18:31,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:18:31,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 04:18:31,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 13: [2022-11-27 04:18:31,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:18:31,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 04:18:31,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 25: [2022-11-27 04:18:31,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:18:31,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-27 04:18:31,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 0: [2022-11-27 04:18:31,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:18:31,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 04:18:31,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 31: [2022-11-27 04:18:31,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:18:31,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 04:18:31,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 5: [2022-11-27 04:18:31,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:18:31,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:18:31,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 04:18:31,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 18: [2022-11-27 04:18:31,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 04:18:31,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 17: [2022-11-27 04:18:31,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:18:31,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 04:18:31,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 30: [2022-11-27 04:18:31,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:18:31,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 04:18:31,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 22: [2022-11-27 04:18:31,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:18:31,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 04:18:31,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 9: [2022-11-27 04:18:31,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:18:31,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:18:31,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 14: [2022-11-27 04:18:31,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 9: [2022-11-27 04:18:31,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 14: [2022-11-27 04:18:31,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 29: [2022-11-27 04:18:31,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:18:31,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 21: [2022-11-27 04:18:31,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:18:31,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 21: [2022-11-27 04:18:31,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 04:18:31,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 8: [2022-11-27 04:18:31,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:18:31,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 04:18:31,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 23: [2022-11-27 04:18:31,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:18:31,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 04:18:31,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 2: [2022-11-27 04:18:31,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:18:31,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 04:18:31,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 10: [2022-11-27 04:18:31,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:18:31,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 04:18:31,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 19: [2022-11-27 04:18:31,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:18:31,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 04:18:31,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 1: [2022-11-27 04:18:31,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:18:31,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:18:31,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 1: [2022-11-27 04:18:31,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 7: [2022-11-27 04:18:31,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 1: [2022-11-27 04:18:31,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 24: [2022-11-27 04:18:31,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:18:31,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 15: [2022-11-27 04:18:31,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:18:31,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 15: [2022-11-27 04:18:31,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 04:18:31,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 12: [2022-11-27 04:18:31,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:18:31,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:18:31,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 04:18:31,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 28: [2022-11-27 04:18:31,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 04:18:31,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 16: [2022-11-27 04:18:31,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:18:31,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:18:31,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 04:18:31,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 11: [2022-11-27 04:18:31,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 6: [2022-11-27 04:18:31,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:18:31,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 6: [2022-11-27 04:18:31,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 04:18:31,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 0: [2022-11-27 04:18:31,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:18:31,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 04:18:31,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 26: [2022-11-27 04:18:31,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:18:31,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 04:18:31,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 27: [2022-11-27 04:18:31,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:18:31,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 04:18:31,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 22: [2022-11-27 04:18:31,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:18:31,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:18:31,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 04:18:31,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 13: [2022-11-27 04:18:31,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 04:18:31,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 4: [2022-11-27 04:18:31,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:18:31,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 04:18:31,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 25: [2022-11-27 04:18:31,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:18:31,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 04:18:31,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 9: [2022-11-27 04:18:31,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:18:31,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 04:18:31,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 14: [2022-11-27 04:18:31,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:18:31,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 04:18:31,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 20: [2022-11-27 04:18:31,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:18:31,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 04:18:31,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 26: [2022-11-27 04:18:31,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:18:31,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 04:18:31,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 21: [2022-11-27 04:18:31,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:18:31,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 04:18:31,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 3: [2022-11-27 04:18:31,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:18:31,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step152000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 04:18:31,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step152000 is ready now! 0: successfully saved checkpoint at iteration 152000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2556.85 31: iteration 152010/ 173500 | consumed samples: 38914560 | consumed tokens: 79697018880 | elapsed time per iteration (s): 1.62 | learning rate: 2.686E-05 | global batch size: 256 | lm loss: 1.908020E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 157.780 | TFLOPs: 9.55 | 31: iteration 152020/ 173500 | consumed samples: 38917120 | consumed tokens: 79702261760 | elapsed time per iteration (s): 0.81 | learning rate: 2.686E-05 | global batch size: 256 | lm loss: 1.911938E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.444 | TFLOPs: 19.20 | 31: iteration 152030/ 173500 | consumed samples: 38919680 | consumed tokens: 79707504640 | elapsed time per iteration (s): 0.77 | learning rate: 2.685E-05 | global batch size: 256 | lm loss: 1.941529E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.105 | TFLOPs: 20.09 | 31: iteration 152040/ 173500 | consumed samples: 38922240 | consumed tokens: 79712747520 | elapsed time per iteration (s): 0.78 | learning rate: 2.684E-05 | global batch size: 256 | lm loss: 1.932194E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.346 | TFLOPs: 19.74 | 31: iteration 152050/ 173500 | consumed samples: 38924800 | consumed tokens: 79717990400 | elapsed time per iteration (s): 0.83 | learning rate: 2.684E-05 | global batch size: 256 | lm loss: 1.921551E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.542 | TFLOPs: 18.67 | 31: iteration 152060/ 173500 | consumed samples: 38927360 | consumed tokens: 79723233280 | elapsed time per iteration (s): 0.86 | learning rate: 2.683E-05 | global batch size: 256 | lm loss: 1.938054E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.523 | TFLOPs: 17.94 | 31: iteration 152070/ 173500 | consumed samples: 38929920 | consumed tokens: 79728476160 | elapsed time per iteration (s): 0.78 | learning rate: 2.683E-05 | global batch size: 256 | lm loss: 1.907069E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.064 | TFLOPs: 19.91 | 31: iteration 152080/ 173500 | consumed samples: 38932480 | consumed tokens: 79733719040 | elapsed time per iteration (s): 0.76 | learning rate: 2.682E-05 | global batch size: 256 | lm loss: 1.900313E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.278 | TFLOPs: 20.28 | 31: iteration 152090/ 173500 | consumed samples: 38935040 | consumed tokens: 79738961920 | elapsed time per iteration (s): 0.75 | learning rate: 2.681E-05 | global batch size: 256 | lm loss: 1.920274E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.299 | TFLOPs: 20.71 | 31: iteration 152100/ 173500 | consumed samples: 38937600 | consumed tokens: 79744204800 | elapsed time per iteration (s): 0.77 | learning rate: 2.681E-05 | global batch size: 256 | lm loss: 1.891113E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.587 | TFLOPs: 20.18 | 31: iteration 152110/ 173500 | consumed samples: 38940160 | consumed tokens: 79749447680 | elapsed time per iteration (s): 0.73 | learning rate: 2.680E-05 | global batch size: 256 | lm loss: 1.909566E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.311 | TFLOPs: 21.31 | 31: iteration 152120/ 173500 | consumed samples: 38942720 | consumed tokens: 79754690560 | elapsed time per iteration (s): 0.76 | learning rate: 2.679E-05 | global batch size: 256 | lm loss: 1.910214E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.050 | TFLOPs: 20.33 | 31: iteration 152130/ 173500 | consumed samples: 38945280 | consumed tokens: 79759933440 | elapsed time per iteration (s): 0.75 | learning rate: 2.679E-05 | global batch size: 256 | lm loss: 1.901978E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.294 | TFLOPs: 20.65 | 31: iteration 152140/ 173500 | consumed samples: 38947840 | consumed tokens: 79765176320 | elapsed time per iteration (s): 0.75 | learning rate: 2.678E-05 | global batch size: 256 | lm loss: 1.919711E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.448 | TFLOPs: 20.54 | 31: iteration 152150/ 173500 | consumed samples: 38950400 | consumed tokens: 79770419200 | elapsed time per iteration (s): 0.78 | learning rate: 2.678E-05 | global batch size: 256 | lm loss: 1.906559E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.275 | TFLOPs: 19.98 | 31: iteration 152160/ 173500 | consumed samples: 38952960 | consumed tokens: 79775662080 | elapsed time per iteration (s): 0.75 | learning rate: 2.677E-05 | global batch size: 256 | lm loss: 1.927163E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.252 | TFLOPs: 20.52 | 31: iteration 152170/ 173500 | consumed samples: 38955520 | consumed tokens: 79780904960 | elapsed time per iteration (s): 0.76 | learning rate: 2.676E-05 | global batch size: 256 | lm loss: 1.921264E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.396 | TFLOPs: 20.41 | 31: iteration 152180/ 173500 | consumed samples: 38958080 | consumed tokens: 79786147840 | elapsed time per iteration (s): 0.75 | learning rate: 2.676E-05 | global batch size: 256 | lm loss: 1.946523E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.208 | TFLOPs: 20.58 | 31: iteration 152190/ 173500 | consumed samples: 38960640 | consumed tokens: 79791390720 | elapsed time per iteration (s): 0.73 | learning rate: 2.675E-05 | global batch size: 256 | lm loss: 1.928648E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.487 | TFLOPs: 21.26 | 31: iteration 152200/ 173500 | consumed samples: 38963200 | consumed tokens: 79796633600 | elapsed time per iteration (s): 0.77 | learning rate: 2.674E-05 | global batch size: 256 | lm loss: 1.926096E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.328 | TFLOPs: 20.10 | 31: iteration 152210/ 173500 | consumed samples: 38965760 | consumed tokens: 79801876480 | elapsed time per iteration (s): 0.78 | learning rate: 2.674E-05 | global batch size: 256 | lm loss: 1.939350E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.696 | TFLOPs: 19.95 | 31: iteration 152220/ 173500 | consumed samples: 38968320 | consumed tokens: 79807119360 | elapsed time per iteration (s): 0.80 | learning rate: 2.673E-05 | global batch size: 256 | lm loss: 1.921639E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.645 | TFLOPs: 19.40 | 31: iteration 152230/ 173500 | consumed samples: 38970880 | consumed tokens: 79812362240 | elapsed time per iteration (s): 0.74 | learning rate: 2.673E-05 | global batch size: 256 | lm loss: 1.903844E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.626 | TFLOPs: 20.79 | 31: iteration 152240/ 173500 | consumed samples: 38973440 | consumed tokens: 79817605120 | elapsed time per iteration (s): 0.77 | learning rate: 2.672E-05 | global batch size: 256 | lm loss: 1.918181E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.008 | TFLOPs: 20.15 | 31: iteration 152250/ 173500 | consumed samples: 38976000 | consumed tokens: 79822848000 | elapsed time per iteration (s): 0.73 | learning rate: 2.671E-05 | global batch size: 256 | lm loss: 1.908907E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.373 | TFLOPs: 21.14 | 31: iteration 152260/ 173500 | consumed samples: 38978560 | consumed tokens: 79828090880 | elapsed time per iteration (s): 0.79 | learning rate: 2.671E-05 | global batch size: 256 | lm loss: 1.936866E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.187 | TFLOPs: 19.49 | 31: iteration 152270/ 173500 | consumed samples: 38981120 | consumed tokens: 79833333760 | elapsed time per iteration (s): 0.76 | learning rate: 2.670E-05 | global batch size: 256 | lm loss: 1.935527E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.908 | TFLOPs: 20.50 | 31: iteration 152280/ 173500 | consumed samples: 38983680 | consumed tokens: 79838576640 | elapsed time per iteration (s): 0.74 | learning rate: 2.669E-05 | global batch size: 256 | lm loss: 1.897205E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.416 | TFLOPs: 20.96 | 31: iteration 152290/ 173500 | consumed samples: 38986240 | consumed tokens: 79843819520 | elapsed time per iteration (s): 0.75 | learning rate: 2.669E-05 | global batch size: 256 | lm loss: 1.927302E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.127 | TFLOPs: 20.52 | 31: iteration 152300/ 173500 | consumed samples: 38988800 | consumed tokens: 79849062400 | elapsed time per iteration (s): 0.75 | learning rate: 2.668E-05 | global batch size: 256 | lm loss: 1.913842E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.179 | TFLOPs: 20.52 | 31: iteration 152310/ 173500 | consumed samples: 38991360 | consumed tokens: 79854305280 | elapsed time per iteration (s): 0.76 | learning rate: 2.668E-05 | global batch size: 256 | lm loss: 1.919532E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.268 | TFLOPs: 20.28 | 31: iteration 152320/ 173500 | consumed samples: 38993920 | consumed tokens: 79859548160 | elapsed time per iteration (s): 0.75 | learning rate: 2.667E-05 | global batch size: 256 | lm loss: 1.892442E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.059 | TFLOPs: 20.57 | 31: iteration 152330/ 173500 | consumed samples: 38996480 | consumed tokens: 79864791040 | elapsed time per iteration (s): 0.75 | learning rate: 2.666E-05 | global batch size: 256 | lm loss: 1.896877E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.047 | TFLOPs: 20.75 | 31: iteration 152340/ 173500 | consumed samples: 38999040 | consumed tokens: 79870033920 | elapsed time per iteration (s): 0.75 | learning rate: 2.666E-05 | global batch size: 256 | lm loss: 1.896939E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.810 | TFLOPs: 20.68 | 31: iteration 152350/ 173500 | consumed samples: 39001600 | consumed tokens: 79875276800 | elapsed time per iteration (s): 0.80 | learning rate: 2.665E-05 | global batch size: 256 | lm loss: 1.917265E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.727 | TFLOPs: 19.40 | 31: iteration 152360/ 173500 | consumed samples: 39004160 | consumed tokens: 79880519680 | elapsed time per iteration (s): 0.80 | learning rate: 2.664E-05 | global batch size: 256 | lm loss: 1.911575E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.221 | TFLOPs: 19.37 | 31: iteration 152370/ 173500 | consumed samples: 39006720 | consumed tokens: 79885762560 | elapsed time per iteration (s): 0.80 | learning rate: 2.664E-05 | global batch size: 256 | lm loss: 1.954045E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.102 | TFLOPs: 19.30 | 31: iteration 152380/ 173500 | consumed samples: 39009280 | consumed tokens: 79891005440 | elapsed time per iteration (s): 0.76 | learning rate: 2.663E-05 | global batch size: 256 | lm loss: 1.932517E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.642 | TFLOPs: 20.31 | 31: iteration 152390/ 173500 | consumed samples: 39011840 | consumed tokens: 79896248320 | elapsed time per iteration (s): 0.78 | learning rate: 2.663E-05 | global batch size: 256 | lm loss: 1.902684E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.249 | TFLOPs: 19.86 | 31: iteration 152400/ 173500 | consumed samples: 39014400 | consumed tokens: 79901491200 | elapsed time per iteration (s): 0.79 | learning rate: 2.662E-05 | global batch size: 256 | lm loss: 1.920073E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.955 | TFLOPs: 19.66 | 31: iteration 152410/ 173500 | consumed samples: 39016960 | consumed tokens: 79906734080 | elapsed time per iteration (s): 0.73 | learning rate: 2.661E-05 | global batch size: 256 | lm loss: 1.931332E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.585 | TFLOPs: 21.21 | 31: iteration 152420/ 173500 | consumed samples: 39019520 | consumed tokens: 79911976960 | elapsed time per iteration (s): 0.76 | learning rate: 2.661E-05 | global batch size: 256 | lm loss: 1.906760E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.520 | TFLOPs: 20.30 | 31: iteration 152430/ 173500 | consumed samples: 39022080 | consumed tokens: 79917219840 | elapsed time per iteration (s): 0.76 | learning rate: 2.660E-05 | global batch size: 256 | lm loss: 1.903667E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.080 | TFLOPs: 20.27 | 31: iteration 152440/ 173500 | consumed samples: 39024640 | consumed tokens: 79922462720 | elapsed time per iteration (s): 0.72 | learning rate: 2.659E-05 | global batch size: 256 | lm loss: 1.902027E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.637 | TFLOPs: 21.39 | 31: iteration 152450/ 173500 | consumed samples: 39027200 | consumed tokens: 79927705600 | elapsed time per iteration (s): 0.73 | learning rate: 2.659E-05 | global batch size: 256 | lm loss: 1.931849E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.375 | TFLOPs: 21.08 | 31: iteration 152460/ 173500 | consumed samples: 39029760 | consumed tokens: 79932948480 | elapsed time per iteration (s): 0.78 | learning rate: 2.658E-05 | global batch size: 256 | lm loss: 1.950945E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.269 | TFLOPs: 19.80 | 31: iteration 152470/ 173500 | consumed samples: 39032320 | consumed tokens: 79938191360 | elapsed time per iteration (s): 0.91 | learning rate: 2.658E-05 | global batch size: 256 | lm loss: 1.907174E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 281.365 | TFLOPs: 17.02 | 31: iteration 152480/ 173500 | consumed samples: 39034880 | consumed tokens: 79943434240 | elapsed time per iteration (s): 0.78 | learning rate: 2.657E-05 | global batch size: 256 | lm loss: 1.925401E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.499 | TFLOPs: 19.93 | 31: iteration 152490/ 173500 | consumed samples: 39037440 | consumed tokens: 79948677120 | elapsed time per iteration (s): 0.87 | learning rate: 2.656E-05 | global batch size: 256 | lm loss: 1.896271E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.288 | TFLOPs: 17.80 | 31: iteration 152500/ 173500 | consumed samples: 39040000 | consumed tokens: 79953920000 | elapsed time per iteration (s): 0.88 | learning rate: 2.656E-05 | global batch size: 256 | lm loss: 1.950023E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.643 | TFLOPs: 17.64 | 31: iteration 152510/ 173500 | consumed samples: 39042560 | consumed tokens: 79959162880 | elapsed time per iteration (s): 0.87 | learning rate: 2.655E-05 | global batch size: 256 | lm loss: 1.915328E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.701 | TFLOPs: 17.77 | 31: iteration 152520/ 173500 | consumed samples: 39045120 | consumed tokens: 79964405760 | elapsed time per iteration (s): 0.86 | learning rate: 2.655E-05 | global batch size: 256 | lm loss: 1.887470E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.679 | TFLOPs: 17.95 | 31: iteration 152530/ 173500 | consumed samples: 39047680 | consumed tokens: 79969648640 | elapsed time per iteration (s): 0.85 | learning rate: 2.654E-05 | global batch size: 256 | lm loss: 1.950290E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.148 | TFLOPs: 18.28 | 31: iteration 152540/ 173500 | consumed samples: 39050240 | consumed tokens: 79974891520 | elapsed time per iteration (s): 0.88 | learning rate: 2.653E-05 | global batch size: 256 | lm loss: 1.933886E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.653 | TFLOPs: 17.58 | 31: iteration 152550/ 173500 | consumed samples: 39052800 | consumed tokens: 79980134400 | elapsed time per iteration (s): 0.86 | learning rate: 2.653E-05 | global batch size: 256 | lm loss: 1.904428E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.092 | TFLOPs: 18.03 | 31: iteration 152560/ 173500 | consumed samples: 39055360 | consumed tokens: 79985377280 | elapsed time per iteration (s): 0.82 | learning rate: 2.652E-05 | global batch size: 256 | lm loss: 1.920628E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.799 | TFLOPs: 18.80 | 31: iteration 152570/ 173500 | consumed samples: 39057920 | consumed tokens: 79990620160 | elapsed time per iteration (s): 0.79 | learning rate: 2.651E-05 | global batch size: 256 | lm loss: 1.910834E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.908 | TFLOPs: 19.72 | 31: iteration 152580/ 173500 | consumed samples: 39060480 | consumed tokens: 79995863040 | elapsed time per iteration (s): 0.82 | learning rate: 2.651E-05 | global batch size: 256 | lm loss: 1.920579E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.086 | TFLOPs: 19.00 | 31: iteration 152590/ 173500 | consumed samples: 39063040 | consumed tokens: 80001105920 | elapsed time per iteration (s): 0.80 | learning rate: 2.650E-05 | global batch size: 256 | lm loss: 1.920430E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.574 | TFLOPs: 19.33 | 31: iteration 152600/ 173500 | consumed samples: 39065600 | consumed tokens: 80006348800 | elapsed time per iteration (s): 0.78 | learning rate: 2.650E-05 | global batch size: 256 | lm loss: 1.931130E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.463 | TFLOPs: 19.81 | 31: iteration 152610/ 173500 | consumed samples: 39068160 | consumed tokens: 80011591680 | elapsed time per iteration (s): 0.77 | learning rate: 2.649E-05 | global batch size: 256 | lm loss: 1.905590E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.858 | TFLOPs: 20.14 | 31: iteration 152620/ 173500 | consumed samples: 39070720 | consumed tokens: 80016834560 | elapsed time per iteration (s): 0.78 | learning rate: 2.648E-05 | global batch size: 256 | lm loss: 1.923689E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.363 | TFLOPs: 19.87 | 31: iteration 152630/ 173500 | consumed samples: 39073280 | consumed tokens: 80022077440 | elapsed time per iteration (s): 0.83 | learning rate: 2.648E-05 | global batch size: 256 | lm loss: 1.914907E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.487 | TFLOPs: 18.72 | 31: iteration 152640/ 173500 | consumed samples: 39075840 | consumed tokens: 80027320320 | elapsed time per iteration (s): 0.81 | learning rate: 2.647E-05 | global batch size: 256 | lm loss: 1.923532E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.343 | TFLOPs: 19.08 | 31: iteration 152650/ 173500 | consumed samples: 39078400 | consumed tokens: 80032563200 | elapsed time per iteration (s): 0.81 | learning rate: 2.647E-05 | global batch size: 256 | lm loss: 1.913174E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.389 | TFLOPs: 19.14 | 31: iteration 152660/ 173500 | consumed samples: 39080960 | consumed tokens: 80037806080 | elapsed time per iteration (s): 0.79 | learning rate: 2.646E-05 | global batch size: 256 | lm loss: 1.913362E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.016 | TFLOPs: 19.60 | 31: iteration 152670/ 173500 | consumed samples: 39083520 | consumed tokens: 80043048960 | elapsed time per iteration (s): 0.79 | learning rate: 2.645E-05 | global batch size: 256 | lm loss: 1.894766E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.549 | TFLOPs: 19.57 | 31: iteration 152680/ 173500 | consumed samples: 39086080 | consumed tokens: 80048291840 | elapsed time per iteration (s): 0.73 | learning rate: 2.645E-05 | global batch size: 256 | lm loss: 1.924780E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.475 | TFLOPs: 21.08 | 31: iteration 152690/ 173500 | consumed samples: 39088640 | consumed tokens: 80053534720 | elapsed time per iteration (s): 0.81 | learning rate: 2.644E-05 | global batch size: 256 | lm loss: 1.924208E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.768 | TFLOPs: 19.10 | 31: iteration 152700/ 173500 | consumed samples: 39091200 | consumed tokens: 80058777600 | elapsed time per iteration (s): 0.77 | learning rate: 2.643E-05 | global batch size: 256 | lm loss: 1.912054E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.319 | TFLOPs: 20.04 | 31: iteration 152710/ 173500 | consumed samples: 39093760 | consumed tokens: 80064020480 | elapsed time per iteration (s): 0.76 | learning rate: 2.643E-05 | global batch size: 256 | lm loss: 1.931200E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.016 | TFLOPs: 20.45 | 31: iteration 152720/ 173500 | consumed samples: 39096320 | consumed tokens: 80069263360 | elapsed time per iteration (s): 0.78 | learning rate: 2.642E-05 | global batch size: 256 | lm loss: 1.918736E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.958 | TFLOPs: 19.96 | 31: iteration 152730/ 173500 | consumed samples: 39098880 | consumed tokens: 80074506240 | elapsed time per iteration (s): 0.76 | learning rate: 2.642E-05 | global batch size: 256 | lm loss: 1.897779E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.285 | TFLOPs: 20.40 | 31: iteration 152740/ 173500 | consumed samples: 39101440 | consumed tokens: 80079749120 | elapsed time per iteration (s): 0.80 | learning rate: 2.641E-05 | global batch size: 256 | lm loss: 1.931903E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.383 | TFLOPs: 19.26 | 31: iteration 152750/ 173500 | consumed samples: 39104000 | consumed tokens: 80084992000 | elapsed time per iteration (s): 0.80 | learning rate: 2.640E-05 | global batch size: 256 | lm loss: 1.927498E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.434 | TFLOPs: 19.45 | 31: iteration 152760/ 173500 | consumed samples: 39106560 | consumed tokens: 80090234880 | elapsed time per iteration (s): 0.81 | learning rate: 2.640E-05 | global batch size: 256 | lm loss: 1.878472E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.215 | TFLOPs: 19.07 | 31: iteration 152770/ 173500 | consumed samples: 39109120 | consumed tokens: 80095477760 | elapsed time per iteration (s): 0.78 | learning rate: 2.639E-05 | global batch size: 256 | lm loss: 1.901977E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.512 | TFLOPs: 19.93 | 31: iteration 152780/ 173500 | consumed samples: 39111680 | consumed tokens: 80100720640 | elapsed time per iteration (s): 0.82 | learning rate: 2.639E-05 | global batch size: 256 | lm loss: 1.938742E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.074 | TFLOPs: 18.82 | 31: iteration 152790/ 173500 | consumed samples: 39114240 | consumed tokens: 80105963520 | elapsed time per iteration (s): 0.73 | learning rate: 2.638E-05 | global batch size: 256 | lm loss: 1.916207E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.150 | TFLOPs: 21.24 | 31: iteration 152800/ 173500 | consumed samples: 39116800 | consumed tokens: 80111206400 | elapsed time per iteration (s): 0.72 | learning rate: 2.637E-05 | global batch size: 256 | lm loss: 1.935423E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.882 | TFLOPs: 21.41 | 31: iteration 152810/ 173500 | consumed samples: 39119360 | consumed tokens: 80116449280 | elapsed time per iteration (s): 0.79 | learning rate: 2.637E-05 | global batch size: 256 | lm loss: 1.904847E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.286 | TFLOPs: 19.50 | 31: iteration 152820/ 173500 | consumed samples: 39121920 | consumed tokens: 80121692160 | elapsed time per iteration (s): 0.75 | learning rate: 2.636E-05 | global batch size: 256 | lm loss: 1.912210E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.580 | TFLOPs: 20.79 | 31: iteration 152830/ 173500 | consumed samples: 39124480 | consumed tokens: 80126935040 | elapsed time per iteration (s): 0.87 | learning rate: 2.636E-05 | global batch size: 256 | lm loss: 1.919045E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.201 | TFLOPs: 17.80 | 31: iteration 152840/ 173500 | consumed samples: 39127040 | consumed tokens: 80132177920 | elapsed time per iteration (s): 0.82 | learning rate: 2.635E-05 | global batch size: 256 | lm loss: 1.900143E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.799 | TFLOPs: 18.98 | 31: iteration 152850/ 173500 | consumed samples: 39129600 | consumed tokens: 80137420800 | elapsed time per iteration (s): 0.79 | learning rate: 2.634E-05 | global batch size: 256 | lm loss: 1.905704E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.401 | TFLOPs: 19.50 | 31: iteration 152860/ 173500 | consumed samples: 39132160 | consumed tokens: 80142663680 | elapsed time per iteration (s): 0.96 | learning rate: 2.634E-05 | global batch size: 256 | lm loss: 1.929216E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 265.449 | TFLOPs: 16.06 | 31: iteration 152870/ 173500 | consumed samples: 39134720 | consumed tokens: 80147906560 | elapsed time per iteration (s): 0.85 | learning rate: 2.633E-05 | global batch size: 256 | lm loss: 1.925630E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.954 | TFLOPs: 18.15 | 31: iteration 152880/ 173500 | consumed samples: 39137280 | consumed tokens: 80153149440 | elapsed time per iteration (s): 0.86 | learning rate: 2.633E-05 | global batch size: 256 | lm loss: 1.917712E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.022 | TFLOPs: 17.97 | 31: iteration 152890/ 173500 | consumed samples: 39139840 | consumed tokens: 80158392320 | elapsed time per iteration (s): 0.84 | learning rate: 2.632E-05 | global batch size: 256 | lm loss: 1.921073E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.552 | TFLOPs: 18.55 | 31: iteration 152900/ 173500 | consumed samples: 39142400 | consumed tokens: 80163635200 | elapsed time per iteration (s): 0.82 | learning rate: 2.631E-05 | global batch size: 256 | lm loss: 1.892301E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.258 | TFLOPs: 18.83 | 31: iteration 152910/ 173500 | consumed samples: 39144960 | consumed tokens: 80168878080 | elapsed time per iteration (s): 1.18 | learning rate: 2.631E-05 | global batch size: 256 | lm loss: 1.904496E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.365 | TFLOPs: 13.15 | 31: iteration 152920/ 173500 | consumed samples: 39147520 | consumed tokens: 80174120960 | elapsed time per iteration (s): 0.79 | learning rate: 2.630E-05 | global batch size: 256 | lm loss: 1.921924E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.246 | TFLOPs: 19.56 | 31: iteration 152930/ 173500 | consumed samples: 39150080 | consumed tokens: 80179363840 | elapsed time per iteration (s): 0.89 | learning rate: 2.630E-05 | global batch size: 256 | lm loss: 1.907247E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.596 | TFLOPs: 17.40 | 31: iteration 152940/ 173500 | consumed samples: 39152640 | consumed tokens: 80184606720 | elapsed time per iteration (s): 0.79 | learning rate: 2.629E-05 | global batch size: 256 | lm loss: 1.918804E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.434 | TFLOPs: 19.51 | 31: iteration 152950/ 173500 | consumed samples: 39155200 | consumed tokens: 80189849600 | elapsed time per iteration (s): 0.85 | learning rate: 2.628E-05 | global batch size: 256 | lm loss: 1.918152E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.363 | TFLOPs: 18.29 | 31: iteration 152960/ 173500 | consumed samples: 39157760 | consumed tokens: 80195092480 | elapsed time per iteration (s): 0.79 | learning rate: 2.628E-05 | global batch size: 256 | lm loss: 1.899327E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.725 | TFLOPs: 19.71 | 31: iteration 152970/ 173500 | consumed samples: 39160320 | consumed tokens: 80200335360 | elapsed time per iteration (s): 0.75 | learning rate: 2.627E-05 | global batch size: 256 | lm loss: 1.924362E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.744 | TFLOPs: 20.67 | 31: iteration 152980/ 173500 | consumed samples: 39162880 | consumed tokens: 80205578240 | elapsed time per iteration (s): 0.75 | learning rate: 2.626E-05 | global batch size: 256 | lm loss: 1.932756E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.862 | TFLOPs: 20.68 | 31: iteration 152990/ 173500 | consumed samples: 39165440 | consumed tokens: 80210821120 | elapsed time per iteration (s): 0.73 | learning rate: 2.626E-05 | global batch size: 256 | lm loss: 1.931568E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.621 | TFLOPs: 21.33 | 31: iteration 153000/ 173500 | consumed samples: 39168000 | consumed tokens: 80216064000 | elapsed time per iteration (s): 0.72 | learning rate: 2.625E-05 | global batch size: 256 | lm loss: 1.906381E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.263 | TFLOPs: 21.61 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 153000 | lm loss value: 1.784699E+00 | lm loss PPL: 5.957786E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 153000 to checkpoints_1b1long 0: [2022-11-27 04:31:50,522] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step153000 is begin to save! 0: [2022-11-27 04:31:50,531] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_01-model_00-model_states.pt... 0: [2022-11-27 04:31:50,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_01-model_00-model_states.pt. 0: [2022-11-27 04:31:50,751] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_03-model_00-model_states.pt... 0: [2022-11-27 04:31:50,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_03-model_00-model_states.pt. 0: [2022-11-27 04:31:50,836] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_04-model_00-model_states.pt... 0: [2022-11-27 04:31:50,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_04-model_00-model_states.pt. 0: [2022-11-27 04:31:50,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_05-model_00-model_states.pt... 0: [2022-11-27 04:31:50,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_05-model_00-model_states.pt. 0: [2022-11-27 04:31:50,994] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_06-model_00-model_states.pt... 0: [2022-11-27 04:31:51,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_06-model_00-model_states.pt. 0: [2022-11-27 04:31:51,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_07-model_00-model_states.pt... 0: [2022-11-27 04:31:51,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_07-model_00-model_states.pt. 0: [2022-11-27 04:31:51,150] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_08-model_00-model_states.pt... 0: [2022-11-27 04:31:51,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_08-model_00-model_states.pt. 0: [2022-11-27 04:31:51,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_09-model_00-model_states.pt... 0: [2022-11-27 04:31:51,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_09-model_00-model_states.pt. 0: [2022-11-27 04:31:51,309] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_10-model_00-model_states.pt... 0: [2022-11-27 04:31:51,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_10-model_00-model_states.pt. 0: [2022-11-27 04:31:51,383] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_11-model_00-model_states.pt... 0: [2022-11-27 04:31:51,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_11-model_00-model_states.pt. 0: [2022-11-27 04:31:51,460] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_12-model_00-model_states.pt... 0: [2022-11-27 04:31:51,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_12-model_00-model_states.pt. 0: [2022-11-27 04:31:51,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_13-model_00-model_states.pt... 0: [2022-11-27 04:31:51,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_13-model_00-model_states.pt. 0: [2022-11-27 04:31:51,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_14-model_00-model_states.pt... 0: [2022-11-27 04:31:51,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_14-model_00-model_states.pt. 0: [2022-11-27 04:31:51,682] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_15-model_00-model_states.pt... 0: [2022-11-27 04:31:51,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_15-model_00-model_states.pt. 0: [2022-11-27 04:31:51,757] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_16-model_00-model_states.pt... 0: [2022-11-27 04:31:51,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_16-model_00-model_states.pt. 0: [2022-11-27 04:31:51,829] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_17-model_00-model_states.pt... 0: [2022-11-27 04:31:51,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_17-model_00-model_states.pt. 0: [2022-11-27 04:31:51,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_18-model_00-model_states.pt... 0: [2022-11-27 04:31:51,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_18-model_00-model_states.pt. 0: [2022-11-27 04:31:51,980] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_19-model_00-model_states.pt... 0: [2022-11-27 04:31:52,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_19-model_00-model_states.pt. 0: [2022-11-27 04:31:52,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_20-model_00-model_states.pt... 0: [2022-11-27 04:31:52,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_20-model_00-model_states.pt. 0: [2022-11-27 04:31:52,127] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_21-model_00-model_states.pt... 0: [2022-11-27 04:31:52,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_21-model_00-model_states.pt. 0: [2022-11-27 04:31:52,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_22-model_00-model_states.pt... 0: [2022-11-27 04:31:52,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_22-model_00-model_states.pt. 0: [2022-11-27 04:31:52,275] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_23-model_00-model_states.pt... 0: [2022-11-27 04:31:52,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_23-model_00-model_states.pt. 0: [2022-11-27 04:31:52,352] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_24-model_00-model_states.pt... 0: [2022-11-27 04:31:52,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_24-model_00-model_states.pt. 0: [2022-11-27 04:31:52,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_25-model_00-model_states.pt... 0: [2022-11-27 04:31:52,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_25-model_00-model_states.pt. 0: [2022-11-27 04:31:52,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_26-model_00-model_states.pt... 0: [2022-11-27 04:31:52,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_26-model_00-model_states.pt. 0: [2022-11-27 04:31:52,573] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_27-model_00-model_states.pt... 0: [2022-11-27 04:31:52,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_27-model_00-model_states.pt. 0: [2022-11-27 04:31:52,647] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_28-model_00-model_states.pt... 0: [2022-11-27 04:31:52,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_28-model_00-model_states.pt. 0: [2022-11-27 04:31:52,718] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/layer_30-model_00-model_states.pt... 0: [2022-11-27 04:31:52,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/layer_30-model_00-model_states.pt. 0: [2022-11-27 04:31:52,723] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step153000/mp_rank_00_model_states.pt 0: [2022-11-27 04:31:52,723] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/mp_rank_00_model_states.pt... 0: [2022-11-27 04:31:52,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/mp_rank_00_model_states.pt. 0: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:31:52,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:31:52,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:31:52,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 30: [2022-11-27 04:31:52,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:31:52,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 30: [2022-11-27 04:31:52,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 04:31:52,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 10: [2022-11-27 04:31:52,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:31:52,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 04:31:52,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 16: [2022-11-27 04:31:52,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:31:52,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:31:52,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 23: [2022-11-27 04:31:52,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 16: [2022-11-27 04:31:52,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 23: [2022-11-27 04:31:52,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 7: [2022-11-27 04:31:52,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:31:52,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 04:31:52,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 28: [2022-11-27 04:31:52,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:31:52,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:31:52,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:31:52,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 21: [2022-11-27 04:31:52,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 27: [2022-11-27 04:31:52,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 28: [2022-11-27 04:31:52,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 21: [2022-11-27 04:31:52,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 27: [2022-11-27 04:31:52,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 24: [2022-11-27 04:31:52,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:31:52,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:31:52,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 2: [2022-11-27 04:31:52,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:31:52,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 29: [2022-11-27 04:31:52,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 2: [2022-11-27 04:31:52,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 24: [2022-11-27 04:31:52,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 2: [2022-11-27 04:31:52,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 18: [2022-11-27 04:31:52,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:31:52,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-27 04:31:52,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 8: [2022-11-27 04:31:52,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:31:52,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 04:31:52,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 13: [2022-11-27 04:31:52,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:31:52,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 04:31:52,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 2: [2022-11-27 04:31:52,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:31:52,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 04:31:52,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 15: [2022-11-27 04:31:52,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:31:52,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 04:31:52,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 11: [2022-11-27 04:31:52,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:31:52,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 04:31:52,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 22: [2022-11-27 04:31:52,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:31:52,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 9: [2022-11-27 04:31:52,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:31:52,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 9: [2022-11-27 04:31:52,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 04:31:52,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 23: [2022-11-27 04:31:52,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:31:52,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 04:31:52,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 20: [2022-11-27 04:31:52,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:31:52,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:31:52,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 11: [2022-11-27 04:31:52,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:31:52,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 20: [2022-11-27 04:31:52,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 29: [2022-11-27 04:31:52,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 6: [2022-11-27 04:31:52,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:31:52,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:31:52,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 04:31:52,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 04:31:52,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 6: [2022-11-27 04:31:52,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 5: [2022-11-27 04:31:52,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:31:52,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:31:52,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:31:52,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 04:31:52,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:31:52,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 8: [2022-11-27 04:31:52,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 19: [2022-11-27 04:31:52,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 5: [2022-11-27 04:31:52,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 8: [2022-11-27 04:31:52,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 5: [2022-11-27 04:31:52,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 19: [2022-11-27 04:31:52,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 22: [2022-11-27 04:31:52,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:31:52,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 04:31:52,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 10: [2022-11-27 04:31:52,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:31:52,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 04:31:52,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 4: [2022-11-27 04:31:52,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:31:52,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:31:52,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:31:52,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 21: [2022-11-27 04:31:52,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 26: [2022-11-27 04:31:52,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 4: [2022-11-27 04:31:52,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 21: [2022-11-27 04:31:52,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 26: [2022-11-27 04:31:52,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 3: [2022-11-27 04:31:52,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:31:52,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:31:52,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:31:52,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 04:31:52,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 30: [2022-11-27 04:31:52,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 3: [2022-11-27 04:31:52,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 3: [2022-11-27 04:31:52,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 30: [2022-11-27 04:31:52,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 13: [2022-11-27 04:31:52,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:31:52,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 20: [2022-11-27 04:31:52,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:31:52,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 20: [2022-11-27 04:31:52,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 04:31:52,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 11: [2022-11-27 04:31:52,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 24: [2022-11-27 04:31:52,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:31:52,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 24: [2022-11-27 04:31:52,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 04:31:52,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 14: [2022-11-27 04:31:52,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:31:52,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:31:52,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 26: [2022-11-27 04:31:52,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 14: [2022-11-27 04:31:52,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 26: [2022-11-27 04:31:52,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 9: [2022-11-27 04:31:52,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:31:52,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:31:52,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 11: [2022-11-27 04:31:52,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 9: [2022-11-27 04:31:52,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 11: [2022-11-27 04:31:52,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 14: [2022-11-27 04:31:52,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:31:52,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 04:31:52,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 9: [2022-11-27 04:31:52,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:31:52,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:31:52,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 19: [2022-11-27 04:31:52,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 9: [2022-11-27 04:31:52,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 19: [2022-11-27 04:31:52,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 27: [2022-11-27 04:31:52,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:31:52,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:31:52,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 8: [2022-11-27 04:31:52,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 04:31:52,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 30: [2022-11-27 04:31:52,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:31:52,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 30: [2022-11-27 04:31:52,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 04:31:52,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 9: [2022-11-27 04:31:52,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:31:52,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:31:52,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 24: [2022-11-27 04:31:52,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 9: [2022-11-27 04:31:52,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 24: [2022-11-27 04:31:52,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 13: [2022-11-27 04:31:52,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:31:52,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:31:52,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 7: [2022-11-27 04:31:52,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 13: [2022-11-27 04:31:52,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 7: [2022-11-27 04:31:52,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 16: [2022-11-27 04:31:52,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:31:52,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 04:31:52,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 7: [2022-11-27 04:31:52,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:31:52,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 04:31:52,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 28: [2022-11-27 04:31:52,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:31:52,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:31:52,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:31:52,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 29: [2022-11-27 04:31:52,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 23: [2022-11-27 04:31:52,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:31:52,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 23: [2022-11-27 04:31:52,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 20: [2022-11-27 04:31:52,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 23: [2022-11-27 04:31:52,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 28: [2022-11-27 04:31:52,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 20: [2022-11-27 04:31:52,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 26: [2022-11-27 04:31:52,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:31:52,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 04:31:52,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 4: [2022-11-27 04:31:52,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:31:52,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 11: [2022-11-27 04:31:52,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:31:52,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:31:52,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 11: [2022-11-27 04:31:52,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 04:31:52,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 0: [2022-11-27 04:31:52,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:31:52,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 04:31:52,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 0: [2022-11-27 04:31:52,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:31:52,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:31:52,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 04:31:52,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 04:31:52,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 0: [2022-11-27 04:31:52,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 20: [2022-11-27 04:31:52,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:31:52,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 04:31:52,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 28: [2022-11-27 04:31:52,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:31:52,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:31:52,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 04:31:52,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 17: [2022-11-27 04:31:52,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 04:31:52,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 17: [2022-11-27 04:31:52,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:31:52,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 04:31:52,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 17: [2022-11-27 04:31:52,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:31:52,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 04:31:52,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 24: [2022-11-27 04:31:52,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:31:52,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 04:31:52,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 28: [2022-11-27 04:31:52,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:31:52,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:31:52,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:31:52,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:31:52,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-27 04:31:52,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 04:31:52,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 04:31:52,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 25: [2022-11-27 04:31:52,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 25: [2022-11-27 04:31:52,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 3: [2022-11-27 04:31:52,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:31:52,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:31:52,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 10: [2022-11-27 04:31:52,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 04:31:52,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 3: [2022-11-27 04:31:52,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 27: [2022-11-27 04:31:52,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:31:52,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 15: [2022-11-27 04:31:52,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:31:52,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 15: [2022-11-27 04:31:52,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 04:31:52,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 2: [2022-11-27 04:31:52,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:31:52,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 04:31:52,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 26: [2022-11-27 04:31:52,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:31:52,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 04:31:52,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 31: [2022-11-27 04:31:52,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:31:52,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:31:52,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:31:52,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:31:52,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 04:31:52,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:31:52,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 18: [2022-11-27 04:31:52,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:31:52,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 31: [2022-11-27 04:31:52,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 31: [2022-11-27 04:31:52,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 18: [2022-11-27 04:31:52,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 31: [2022-11-27 04:31:52,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 18: [2022-11-27 04:31:52,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 16: [2022-11-27 04:31:52,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 31: [2022-11-27 04:31:52,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 18: [2022-11-27 04:31:52,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 04:31:52,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 18: [2022-11-27 04:31:52,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:31:52,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 04:31:52,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 1: [2022-11-27 04:31:52,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:31:52,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 04:31:52,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 21: [2022-11-27 04:31:52,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:31:52,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 23: [2022-11-27 04:31:52,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:31:52,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:31:52,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 23: [2022-11-27 04:31:52,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 21: [2022-11-27 04:31:52,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 23: [2022-11-27 04:31:52,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 14: [2022-11-27 04:31:52,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:31:52,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 14: [2022-11-27 04:31:52,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 04:31:52,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 3: [2022-11-27 04:31:52,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:31:52,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 19: [2022-11-27 04:31:52,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:31:52,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 3: [2022-11-27 04:31:52,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 19: [2022-11-27 04:31:52,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 15: [2022-11-27 04:31:52,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:31:52,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 04:31:52,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 25: [2022-11-27 04:31:52,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:31:52,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 4: [2022-11-27 04:31:52,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:31:52,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 4: [2022-11-27 04:31:52,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 04:31:52,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 17: [2022-11-27 04:31:52,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:31:52,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 04:31:52,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 1: [2022-11-27 04:31:52,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:31:52,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:31:52,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 04:31:52,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 04:31:52,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 1: [2022-11-27 04:31:52,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 4: [2022-11-27 04:31:52,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:31:52,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 04:31:52,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 2: [2022-11-27 04:31:52,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:31:52,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 04:31:52,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 12: [2022-11-27 04:31:52,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:31:52,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 04:31:52,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 6: [2022-11-27 04:31:52,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:31:52,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:31:52,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 04:31:52,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 04:31:52,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 6: [2022-11-27 04:31:52,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 15: [2022-11-27 04:31:52,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:31:52,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 04:31:52,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 28: [2022-11-27 04:31:52,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 04:31:52,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 12: [2022-11-27 04:31:52,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:31:52,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:31:52,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 04:31:52,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 12: [2022-11-27 04:31:52,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 04:31:52,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 30: [2022-11-27 04:31:52,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:31:52,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 04:31:52,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 14: [2022-11-27 04:31:52,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:31:52,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:31:52,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 22: [2022-11-27 04:31:52,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 14: [2022-11-27 04:31:52,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 22: [2022-11-27 04:31:52,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 22: [2022-11-27 04:31:52,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:31:52,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 04:31:52,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 16: [2022-11-27 04:31:52,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:31:52,885] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 04:31:52,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 5: [2022-11-27 04:31:52,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:31:52,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 04:31:52,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 13: [2022-11-27 04:31:52,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:31:52,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:31:52,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 5: [2022-11-27 04:31:52,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 13: [2022-11-27 04:31:52,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 5: [2022-11-27 04:31:52,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 8: [2022-11-27 04:31:52,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:31:52,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 8: [2022-11-27 04:31:52,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 04:31:52,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 0: [2022-11-27 04:31:52,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 29: [2022-11-27 04:31:52,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:31:52,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 04:31:52,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 27: [2022-11-27 04:31:52,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:31:52,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 04:31:52,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 17: [2022-11-27 04:31:52,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:31:52,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 04:31:52,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 10: [2022-11-27 04:31:52,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:31:52,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 04:31:52,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 7: [2022-11-27 04:31:52,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:31:52,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 04:31:52,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 6: [2022-11-27 04:31:52,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:31:52,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 04:31:52,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 11: [2022-11-27 04:31:52,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:31:52,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 04:31:52,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 18: [2022-11-27 04:31:52,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:31:52,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 04:31:52,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 9: [2022-11-27 04:31:52,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:31:52,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 04:31:52,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 24: [2022-11-27 04:31:52,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:31:52,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 04:31:52,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 0: [2022-11-27 04:31:52,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:31:52,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 04:31:52,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 25: [2022-11-27 04:31:52,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:31:52,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 04:31:52,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 31: [2022-11-27 04:31:52,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:31:52,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 04:31:52,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 1: [2022-11-27 04:31:52,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:31:52,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 04:31:52,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 12: [2022-11-27 04:31:52,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:31:52,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 04:31:52,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 5: [2022-11-27 04:31:52,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:31:52,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 04:31:52,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 30: [2022-11-27 04:31:52,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:31:52,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 04:31:52,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 16: [2022-11-27 04:31:52,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:31:52,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 04:31:52,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 27: [2022-11-27 04:31:52,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:31:52,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 04:31:52,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 2: [2022-11-27 04:31:52,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:31:52,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 04:31:52,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 22: [2022-11-27 04:31:52,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:31:52,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 04:31:52,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 3: [2022-11-27 04:31:52,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:31:52,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 04:31:52,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 21: [2022-11-27 04:31:52,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:31:52,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:31:52,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 04:31:52,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 23: [2022-11-27 04:31:52,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 04:31:52,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 19: [2022-11-27 04:31:52,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:31:52,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 04:31:52,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 4: [2022-11-27 04:31:52,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:31:52,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 04:31:52,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 10: [2022-11-27 04:31:52,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:31:52,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 04:31:52,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 20: [2022-11-27 04:31:52,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:31:52,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:31:52,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 28: [2022-11-27 04:31:52,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 04:31:52,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 20: [2022-11-27 04:31:52,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 8: [2022-11-27 04:31:52,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:31:52,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 04:31:52,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 26: [2022-11-27 04:31:52,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:31:52,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 04:31:52,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 7: [2022-11-27 04:31:52,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:31:52,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 04:31:52,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 15: [2022-11-27 04:31:52,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:31:52,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 04:31:52,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 13: [2022-11-27 04:31:52,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:31:52,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 04:31:52,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 29: [2022-11-27 04:31:52,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:31:52,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 04:31:52,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 6: [2022-11-27 04:31:52,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:31:52,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 04:31:52,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 17: [2022-11-27 04:31:52,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:31:52,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 04:31:52,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 11: [2022-11-27 04:31:52,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:31:52,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 04:31:52,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 14: [2022-11-27 04:31:52,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:31:52,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 04:31:52,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 0: [2022-11-27 04:31:52,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:31:52,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 04:31:52,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 24: [2022-11-27 04:31:52,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:31:52,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 04:31:52,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 18: [2022-11-27 04:31:52,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:31:52,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 04:31:52,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 30: [2022-11-27 04:31:52,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:31:52,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 04:31:52,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 9: [2022-11-27 04:31:52,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:31:52,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 04:31:52,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 25: [2022-11-27 04:31:52,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:31:52,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 04:31:52,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 16: [2022-11-27 04:31:52,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:31:52,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 04:31:52,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 1: [2022-11-27 04:31:52,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:31:52,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 04:31:52,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 28: [2022-11-27 04:31:52,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:31:52,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:31:52,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 04:31:52,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 28: [2022-11-27 04:31:52,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 04:31:52,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 20: [2022-11-27 04:31:52,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:31:52,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 04:31:52,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 26: [2022-11-27 04:31:52,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:31:52,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 04:31:52,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 23: [2022-11-27 04:31:52,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:31:52,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 04:31:52,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 22: [2022-11-27 04:31:52,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:31:52,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:31:52,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 2: [2022-11-27 04:31:52,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 22: [2022-11-27 04:31:52,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 2: [2022-11-27 04:31:52,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 10: [2022-11-27 04:31:52,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:31:52,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 04:31:52,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 0: [2022-11-27 04:31:52,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:31:52,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 04:31:52,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 21: [2022-11-27 04:31:52,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:31:52,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 04:31:52,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 13: [2022-11-27 04:31:52,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:31:52,982] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 04:31:52,982] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 31: [2022-11-27 04:31:52,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:31:52,982] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 04:31:52,982] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 8: [2022-11-27 04:31:52,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:31:52,982] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 04:31:52,982] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 29: [2022-11-27 04:31:52,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:31:52,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 04:31:52,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 4: [2022-11-27 04:31:52,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:31:52,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 3: [2022-11-27 04:31:52,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:31:52,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 3: [2022-11-27 04:31:52,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 04:31:52,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 9: [2022-11-27 04:31:52,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:31:52,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:31:52,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 04:31:52,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 15: [2022-11-27 04:31:52,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 04:31:52,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 18: [2022-11-27 04:31:52,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:31:52,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 04:31:52,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 17: [2022-11-27 04:31:52,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:31:52,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 04:31:52,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 24: [2022-11-27 04:31:52,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:31:52,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:31:52,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 24: [2022-11-27 04:31:52,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 6: [2022-11-27 04:31:52,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 19: [2022-11-27 04:31:52,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:31:52,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 27: [2022-11-27 04:31:52,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:31:52,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 04:31:52,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 27: [2022-11-27 04:31:52,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 04:31:52,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 11: [2022-11-27 04:31:52,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:31:52,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 04:31:52,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 28: [2022-11-27 04:31:52,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:31:52,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:31:52,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 04:31:52,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 14: [2022-11-27 04:31:52,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:31:52,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 04:31:52,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 16: [2022-11-27 04:31:52,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:31:52,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 04:31:52,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 25: [2022-11-27 04:31:52,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:31:52,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 04:31:52,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 26: [2022-11-27 04:31:52,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:31:52,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 04:31:52,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 20: [2022-11-27 04:31:52,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:31:52,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 04:31:52,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 5: [2022-11-27 04:31:52,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:31:52,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:31:52,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 10: [2022-11-27 04:31:52,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:31:52,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 5: [2022-11-27 04:31:52,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 12: [2022-11-27 04:31:52,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 10: [2022-11-27 04:31:52,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 04:31:52,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 9: [2022-11-27 04:31:52,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:31:52,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 04:31:52,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 28: [2022-11-27 04:31:52,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 04:31:52,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 22: [2022-11-27 04:31:52,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:31:52,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:31:52,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 04:31:52,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 30: [2022-11-27 04:31:52,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 04:31:52,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 4: [2022-11-27 04:31:52,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:31:52,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 04:31:52,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 15: [2022-11-27 04:31:52,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:31:52,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 04:31:52,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 13: [2022-11-27 04:31:52,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:31:52,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 04:31:52,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 2: [2022-11-27 04:31:52,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:31:52,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 04:31:52,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 0: [2022-11-27 04:31:52,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:31:52,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 04:31:52,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 8: [2022-11-27 04:31:52,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:31:52,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 04:31:52,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 31: [2022-11-27 04:31:52,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:31:52,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:31:52,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 04:31:52,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 11: [2022-11-27 04:31:52,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 04:31:52,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 3: [2022-11-27 04:31:52,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:31:52,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 04:31:52,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 25: [2022-11-27 04:31:52,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:31:52,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:31:52,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 04:31:52,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 18: [2022-11-27 04:31:52,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:31:52,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 14: [2022-11-27 04:31:52,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:31:52,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 7: [2022-11-27 04:31:52,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 29: [2022-11-27 04:31:52,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:31:52,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 14: [2022-11-27 04:31:52,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 04:31:52,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 29: [2022-11-27 04:31:52,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 04:31:52,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 24: [2022-11-27 04:31:52,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:31:52,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 04:31:52,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 6: [2022-11-27 04:31:52,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:31:53,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 1: [2022-11-27 04:31:52,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:31:53,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 1: [2022-11-27 04:31:53,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 04:31:53,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 19: [2022-11-27 04:31:53,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:31:53,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 04:31:53,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 12: [2022-11-27 04:31:53,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:31:53,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:31:53,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 17: [2022-11-27 04:31:53,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 12: [2022-11-27 04:31:53,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 17: [2022-11-27 04:31:53,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 23: [2022-11-27 04:31:53,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:31:53,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 04:31:53,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 27: [2022-11-27 04:31:53,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:31:53,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 04:31:53,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 21: [2022-11-27 04:31:53,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:31:53,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 12: [2022-11-27 04:31:53,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:31:53,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 8: [2022-11-27 04:31:53,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:31:53,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 04:31:53,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 8: [2022-11-27 04:31:53,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 04:31:53,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 16: [2022-11-27 04:31:53,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:31:53,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 04:31:53,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 26: [2022-11-27 04:31:53,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:31:53,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 04:31:53,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 20: [2022-11-27 04:31:53,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:31:53,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:31:53,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 13: [2022-11-27 04:31:53,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 20: [2022-11-27 04:31:53,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 13: [2022-11-27 04:31:53,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 30: [2022-11-27 04:31:53,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:31:53,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 04:31:53,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 5: [2022-11-27 04:31:53,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:31:53,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 04:31:53,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 31: [2022-11-27 04:31:53,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:31:53,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 04:31:53,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 3: [2022-11-27 04:31:53,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:31:53,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 04:31:53,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 19: [2022-11-27 04:31:53,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:31:53,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:31:53,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 10: [2022-11-27 04:31:53,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 19: [2022-11-27 04:31:53,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 10: [2022-11-27 04:31:53,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 1: [2022-11-27 04:31:53,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:31:53,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 04:31:53,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 4: [2022-11-27 04:31:53,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:31:53,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 04:31:53,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 28: [2022-11-27 04:31:53,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:31:53,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:31:53,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:31:53,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 23: [2022-11-27 04:31:53,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 15: [2022-11-27 04:31:53,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 23: [2022-11-27 04:31:53,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 7: [2022-11-27 04:31:53,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:31:53,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:31:53,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 04:31:53,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 31: [2022-11-27 04:31:53,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 22: [2022-11-27 04:31:53,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:31:53,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 31: [2022-11-27 04:31:53,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 22: [2022-11-27 04:31:53,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 27: [2022-11-27 04:31:53,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:31:53,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:31:53,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 04:31:53,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 21: [2022-11-27 04:31:53,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 04:31:53,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 2: [2022-11-27 04:31:53,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:31:53,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 04:31:53,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 28: [2022-11-27 04:31:53,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 04:31:53,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 14: [2022-11-27 04:31:53,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:31:53,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 04:31:53,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 12: [2022-11-27 04:31:53,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:31:53,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 04:31:53,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 1: [2022-11-27 04:31:53,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:31:53,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 04:31:53,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 29: [2022-11-27 04:31:53,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:31:53,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step153000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 04:31:53,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step153000 is ready now! 0: successfully saved checkpoint at iteration 153000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2498.95 31: iteration 153010/ 173500 | consumed samples: 39170560 | consumed tokens: 80221306880 | elapsed time per iteration (s): 1.02 | learning rate: 2.625E-05 | global batch size: 256 | lm loss: 1.913617E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.513 | TFLOPs: 15.22 | 31: iteration 153020/ 173500 | consumed samples: 39173120 | consumed tokens: 80226549760 | elapsed time per iteration (s): 0.79 | learning rate: 2.624E-05 | global batch size: 256 | lm loss: 1.922755E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.778 | TFLOPs: 19.53 | 31: iteration 153030/ 173500 | consumed samples: 39175680 | consumed tokens: 80231792640 | elapsed time per iteration (s): 0.85 | learning rate: 2.623E-05 | global batch size: 256 | lm loss: 1.921164E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.363 | TFLOPs: 18.29 | 31: iteration 153040/ 173500 | consumed samples: 39178240 | consumed tokens: 80237035520 | elapsed time per iteration (s): 0.79 | learning rate: 2.623E-05 | global batch size: 256 | lm loss: 1.917927E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.719 | TFLOPs: 19.64 | 31: iteration 153050/ 173500 | consumed samples: 39180800 | consumed tokens: 80242278400 | elapsed time per iteration (s): 0.82 | learning rate: 2.622E-05 | global batch size: 256 | lm loss: 1.936221E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.733 | TFLOPs: 18.86 | 31: iteration 153060/ 173500 | consumed samples: 39183360 | consumed tokens: 80247521280 | elapsed time per iteration (s): 0.80 | learning rate: 2.622E-05 | global batch size: 256 | lm loss: 1.926436E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.910 | TFLOPs: 19.35 | 31: iteration 153070/ 173500 | consumed samples: 39185920 | consumed tokens: 80252764160 | elapsed time per iteration (s): 0.86 | learning rate: 2.621E-05 | global batch size: 256 | lm loss: 1.932214E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.395 | TFLOPs: 17.93 | 31: iteration 153080/ 173500 | consumed samples: 39188480 | consumed tokens: 80258007040 | elapsed time per iteration (s): 0.78 | learning rate: 2.620E-05 | global batch size: 256 | lm loss: 1.937055E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.535 | TFLOPs: 19.82 | 31: iteration 153090/ 173500 | consumed samples: 39191040 | consumed tokens: 80263249920 | elapsed time per iteration (s): 0.80 | learning rate: 2.620E-05 | global batch size: 256 | lm loss: 1.915904E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.783 | TFLOPs: 19.47 | 31: iteration 153100/ 173500 | consumed samples: 39193600 | consumed tokens: 80268492800 | elapsed time per iteration (s): 0.84 | learning rate: 2.619E-05 | global batch size: 256 | lm loss: 1.931720E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.474 | TFLOPs: 18.42 | 31: iteration 153110/ 173500 | consumed samples: 39196160 | consumed tokens: 80273735680 | elapsed time per iteration (s): 0.82 | learning rate: 2.619E-05 | global batch size: 256 | lm loss: 1.928439E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.908 | TFLOPs: 18.93 | 31: iteration 153120/ 173500 | consumed samples: 39198720 | consumed tokens: 80278978560 | elapsed time per iteration (s): 0.84 | learning rate: 2.618E-05 | global batch size: 256 | lm loss: 1.881842E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.443 | TFLOPs: 18.54 | 31: iteration 153130/ 173500 | consumed samples: 39201280 | consumed tokens: 80284221440 | elapsed time per iteration (s): 0.81 | learning rate: 2.617E-05 | global batch size: 256 | lm loss: 1.911939E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.171 | TFLOPs: 19.13 | 31: iteration 153140/ 173500 | consumed samples: 39203840 | consumed tokens: 80289464320 | elapsed time per iteration (s): 0.81 | learning rate: 2.617E-05 | global batch size: 256 | lm loss: 1.930583E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.821 | TFLOPs: 19.11 | 31: iteration 153150/ 173500 | consumed samples: 39206400 | consumed tokens: 80294707200 | elapsed time per iteration (s): 0.76 | learning rate: 2.616E-05 | global batch size: 256 | lm loss: 1.870597E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.988 | TFLOPs: 20.45 | 31: iteration 153160/ 173500 | consumed samples: 39208960 | consumed tokens: 80299950080 | elapsed time per iteration (s): 0.75 | learning rate: 2.616E-05 | global batch size: 256 | lm loss: 1.887459E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.605 | TFLOPs: 20.67 | 31: iteration 153170/ 173500 | consumed samples: 39211520 | consumed tokens: 80305192960 | elapsed time per iteration (s): 0.78 | learning rate: 2.615E-05 | global batch size: 256 | lm loss: 1.890877E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.173 | TFLOPs: 19.73 | 31: iteration 153180/ 173500 | consumed samples: 39214080 | consumed tokens: 80310435840 | elapsed time per iteration (s): 0.78 | learning rate: 2.614E-05 | global batch size: 256 | lm loss: 1.935035E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.101 | TFLOPs: 19.97 | 31: iteration 153190/ 173500 | consumed samples: 39216640 | consumed tokens: 80315678720 | elapsed time per iteration (s): 0.84 | learning rate: 2.614E-05 | global batch size: 256 | lm loss: 1.910012E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.037 | TFLOPs: 18.39 | 31: iteration 153200/ 173500 | consumed samples: 39219200 | consumed tokens: 80320921600 | elapsed time per iteration (s): 0.75 | learning rate: 2.613E-05 | global batch size: 256 | lm loss: 1.905905E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.904 | TFLOPs: 20.56 | 31: iteration 153210/ 173500 | consumed samples: 39221760 | consumed tokens: 80326164480 | elapsed time per iteration (s): 0.82 | learning rate: 2.613E-05 | global batch size: 256 | lm loss: 1.926722E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.747 | TFLOPs: 18.98 | 31: iteration 153220/ 173500 | consumed samples: 39224320 | consumed tokens: 80331407360 | elapsed time per iteration (s): 2.34 | learning rate: 2.612E-05 | global batch size: 256 | lm loss: 1.891868E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 109.257 | TFLOPs: 6.61 | 31: iteration 153230/ 173500 | consumed samples: 39226880 | consumed tokens: 80336650240 | elapsed time per iteration (s): 0.77 | learning rate: 2.611E-05 | global batch size: 256 | lm loss: 1.900071E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.016 | TFLOPs: 20.15 | 31: iteration 153240/ 173500 | consumed samples: 39229440 | consumed tokens: 80341893120 | elapsed time per iteration (s): 0.78 | learning rate: 2.611E-05 | global batch size: 256 | lm loss: 1.931178E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.260 | TFLOPs: 19.92 | 31: iteration 153250/ 173500 | consumed samples: 39232000 | consumed tokens: 80347136000 | elapsed time per iteration (s): 0.78 | learning rate: 2.610E-05 | global batch size: 256 | lm loss: 1.902047E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.327 | TFLOPs: 19.74 | 31: iteration 153260/ 173500 | consumed samples: 39234560 | consumed tokens: 80352378880 | elapsed time per iteration (s): 0.85 | learning rate: 2.610E-05 | global batch size: 256 | lm loss: 1.903467E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.007 | TFLOPs: 18.27 | 31: iteration 153270/ 173500 | consumed samples: 39237120 | consumed tokens: 80357621760 | elapsed time per iteration (s): 0.80 | learning rate: 2.609E-05 | global batch size: 256 | lm loss: 1.922125E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.208 | TFLOPs: 19.25 | 31: iteration 153280/ 173500 | consumed samples: 39239680 | consumed tokens: 80362864640 | elapsed time per iteration (s): 0.74 | learning rate: 2.609E-05 | global batch size: 256 | lm loss: 1.906966E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.322 | TFLOPs: 21.01 | 31: iteration 153290/ 173500 | consumed samples: 39242240 | consumed tokens: 80368107520 | elapsed time per iteration (s): 0.74 | learning rate: 2.608E-05 | global batch size: 256 | lm loss: 1.926100E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.023 | TFLOPs: 20.87 | 31: iteration 153300/ 173500 | consumed samples: 39244800 | consumed tokens: 80373350400 | elapsed time per iteration (s): 0.77 | learning rate: 2.607E-05 | global batch size: 256 | lm loss: 1.918825E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.328 | TFLOPs: 20.04 | 31: iteration 153310/ 173500 | consumed samples: 39247360 | consumed tokens: 80378593280 | elapsed time per iteration (s): 0.75 | learning rate: 2.607E-05 | global batch size: 256 | lm loss: 1.925904E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.420 | TFLOPs: 20.66 | 31: iteration 153320/ 173500 | consumed samples: 39249920 | consumed tokens: 80383836160 | elapsed time per iteration (s): 0.77 | learning rate: 2.606E-05 | global batch size: 256 | lm loss: 1.914469E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.459 | TFLOPs: 19.99 | 31: iteration 153330/ 173500 | consumed samples: 39252480 | consumed tokens: 80389079040 | elapsed time per iteration (s): 0.81 | learning rate: 2.606E-05 | global batch size: 256 | lm loss: 1.914142E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.303 | TFLOPs: 19.20 | 31: iteration 153340/ 173500 | consumed samples: 39255040 | consumed tokens: 80394321920 | elapsed time per iteration (s): 1.22 | learning rate: 2.605E-05 | global batch size: 256 | lm loss: 1.896308E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 209.432 | TFLOPs: 12.67 | 31: iteration 153350/ 173500 | consumed samples: 39257600 | consumed tokens: 80399564800 | elapsed time per iteration (s): 0.79 | learning rate: 2.604E-05 | global batch size: 256 | lm loss: 1.929509E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.811 | TFLOPs: 19.59 | 31: iteration 153360/ 173500 | consumed samples: 39260160 | consumed tokens: 80404807680 | elapsed time per iteration (s): 0.75 | learning rate: 2.604E-05 | global batch size: 256 | lm loss: 1.880998E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.710 | TFLOPs: 20.55 | 31: iteration 153370/ 173500 | consumed samples: 39262720 | consumed tokens: 80410050560 | elapsed time per iteration (s): 0.79 | learning rate: 2.603E-05 | global batch size: 256 | lm loss: 1.919958E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.622 | TFLOPs: 19.64 | 31: iteration 153380/ 173500 | consumed samples: 39265280 | consumed tokens: 80415293440 | elapsed time per iteration (s): 0.79 | learning rate: 2.603E-05 | global batch size: 256 | lm loss: 1.934236E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.139 | TFLOPs: 19.49 | 31: iteration 153390/ 173500 | consumed samples: 39267840 | consumed tokens: 80420536320 | elapsed time per iteration (s): 0.79 | learning rate: 2.602E-05 | global batch size: 256 | lm loss: 1.915760E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.917 | TFLOPs: 19.54 | 31: iteration 153400/ 173500 | consumed samples: 39270400 | consumed tokens: 80425779200 | elapsed time per iteration (s): 0.75 | learning rate: 2.601E-05 | global batch size: 256 | lm loss: 1.916405E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.976 | TFLOPs: 20.69 | 31: iteration 153410/ 173500 | consumed samples: 39272960 | consumed tokens: 80431022080 | elapsed time per iteration (s): 0.75 | learning rate: 2.601E-05 | global batch size: 256 | lm loss: 1.896036E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.769 | TFLOPs: 20.62 | 31: iteration 153420/ 173500 | consumed samples: 39275520 | consumed tokens: 80436264960 | elapsed time per iteration (s): 0.97 | learning rate: 2.600E-05 | global batch size: 256 | lm loss: 1.922462E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 263.575 | TFLOPs: 15.95 | 31: iteration 153430/ 173500 | consumed samples: 39278080 | consumed tokens: 80441507840 | elapsed time per iteration (s): 0.75 | learning rate: 2.600E-05 | global batch size: 256 | lm loss: 1.932143E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.869 | TFLOPs: 20.68 | 31: iteration 153440/ 173500 | consumed samples: 39280640 | consumed tokens: 80446750720 | elapsed time per iteration (s): 0.80 | learning rate: 2.599E-05 | global batch size: 256 | lm loss: 1.922876E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.469 | TFLOPs: 19.45 | 31: iteration 153450/ 173500 | consumed samples: 39283200 | consumed tokens: 80451993600 | elapsed time per iteration (s): 0.71 | learning rate: 2.598E-05 | global batch size: 256 | lm loss: 1.913909E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 360.142 | TFLOPs: 21.79 | 31: iteration 153460/ 173500 | consumed samples: 39285760 | consumed tokens: 80457236480 | elapsed time per iteration (s): 0.74 | learning rate: 2.598E-05 | global batch size: 256 | lm loss: 1.908294E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.784 | TFLOPs: 20.80 | 31: iteration 153470/ 173500 | consumed samples: 39288320 | consumed tokens: 80462479360 | elapsed time per iteration (s): 0.73 | learning rate: 2.597E-05 | global batch size: 256 | lm loss: 1.933374E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.482 | TFLOPs: 21.20 | 31: iteration 153480/ 173500 | consumed samples: 39290880 | consumed tokens: 80467722240 | elapsed time per iteration (s): 0.76 | learning rate: 2.597E-05 | global batch size: 256 | lm loss: 1.890119E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.050 | TFLOPs: 20.45 | 31: iteration 153490/ 173500 | consumed samples: 39293440 | consumed tokens: 80472965120 | elapsed time per iteration (s): 0.85 | learning rate: 2.596E-05 | global batch size: 256 | lm loss: 1.909318E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.475 | TFLOPs: 18.18 | 31: iteration 153500/ 173500 | consumed samples: 39296000 | consumed tokens: 80478208000 | elapsed time per iteration (s): 0.79 | learning rate: 2.595E-05 | global batch size: 256 | lm loss: 1.911787E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.638 | TFLOPs: 19.52 | 31: iteration 153510/ 173500 | consumed samples: 39298560 | consumed tokens: 80483450880 | elapsed time per iteration (s): 0.83 | learning rate: 2.595E-05 | global batch size: 256 | lm loss: 1.920157E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.502 | TFLOPs: 18.60 | 31: iteration 153520/ 173500 | consumed samples: 39301120 | consumed tokens: 80488693760 | elapsed time per iteration (s): 0.94 | learning rate: 2.594E-05 | global batch size: 256 | lm loss: 1.907107E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 271.232 | TFLOPs: 16.41 | 31: iteration 153530/ 173500 | consumed samples: 39303680 | consumed tokens: 80493936640 | elapsed time per iteration (s): 0.82 | learning rate: 2.594E-05 | global batch size: 256 | lm loss: 1.910268E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.656 | TFLOPs: 18.85 | 31: iteration 153540/ 173500 | consumed samples: 39306240 | consumed tokens: 80499179520 | elapsed time per iteration (s): 0.78 | learning rate: 2.593E-05 | global batch size: 256 | lm loss: 1.901154E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.757 | TFLOPs: 19.83 | 31: iteration 153550/ 173500 | consumed samples: 39308800 | consumed tokens: 80504422400 | elapsed time per iteration (s): 0.86 | learning rate: 2.593E-05 | global batch size: 256 | lm loss: 1.931778E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.211 | TFLOPs: 18.04 | 31: iteration 153560/ 173500 | consumed samples: 39311360 | consumed tokens: 80509665280 | elapsed time per iteration (s): 0.80 | learning rate: 2.592E-05 | global batch size: 256 | lm loss: 1.907337E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.794 | TFLOPs: 19.47 | 31: iteration 153570/ 173500 | consumed samples: 39313920 | consumed tokens: 80514908160 | elapsed time per iteration (s): 0.83 | learning rate: 2.591E-05 | global batch size: 256 | lm loss: 1.933277E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.312 | TFLOPs: 18.71 | 31: iteration 153580/ 173500 | consumed samples: 39316480 | consumed tokens: 80520151040 | elapsed time per iteration (s): 0.79 | learning rate: 2.591E-05 | global batch size: 256 | lm loss: 1.918099E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.877 | TFLOPs: 19.53 | 31: iteration 153590/ 173500 | consumed samples: 39319040 | consumed tokens: 80525393920 | elapsed time per iteration (s): 0.83 | learning rate: 2.590E-05 | global batch size: 256 | lm loss: 1.917054E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.931 | TFLOPs: 18.57 | 31: iteration 153600/ 173500 | consumed samples: 39321600 | consumed tokens: 80530636800 | elapsed time per iteration (s): 0.79 | learning rate: 2.590E-05 | global batch size: 256 | lm loss: 1.925974E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.038 | TFLOPs: 19.48 | 31: iteration 153610/ 173500 | consumed samples: 39324160 | consumed tokens: 80535879680 | elapsed time per iteration (s): 0.83 | learning rate: 2.589E-05 | global batch size: 256 | lm loss: 1.903661E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.300 | TFLOPs: 18.59 | 31: iteration 153620/ 173500 | consumed samples: 39326720 | consumed tokens: 80541122560 | elapsed time per iteration (s): 0.83 | learning rate: 2.588E-05 | global batch size: 256 | lm loss: 1.914553E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.860 | TFLOPs: 18.62 | 31: iteration 153630/ 173500 | consumed samples: 39329280 | consumed tokens: 80546365440 | elapsed time per iteration (s): 0.83 | learning rate: 2.588E-05 | global batch size: 256 | lm loss: 1.912880E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.368 | TFLOPs: 18.60 | 31: iteration 153640/ 173500 | consumed samples: 39331840 | consumed tokens: 80551608320 | elapsed time per iteration (s): 0.82 | learning rate: 2.587E-05 | global batch size: 256 | lm loss: 1.921184E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.651 | TFLOPs: 18.98 | 31: iteration 153650/ 173500 | consumed samples: 39334400 | consumed tokens: 80556851200 | elapsed time per iteration (s): 0.81 | learning rate: 2.587E-05 | global batch size: 256 | lm loss: 1.898011E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.629 | TFLOPs: 19.16 | 31: iteration 153660/ 173500 | consumed samples: 39336960 | consumed tokens: 80562094080 | elapsed time per iteration (s): 0.80 | learning rate: 2.586E-05 | global batch size: 256 | lm loss: 1.936484E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.271 | TFLOPs: 19.25 | 31: iteration 153670/ 173500 | consumed samples: 39339520 | consumed tokens: 80567336960 | elapsed time per iteration (s): 0.90 | learning rate: 2.586E-05 | global batch size: 256 | lm loss: 1.917208E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 283.152 | TFLOPs: 17.13 | 31: iteration 153680/ 173500 | consumed samples: 39342080 | consumed tokens: 80572579840 | elapsed time per iteration (s): 0.88 | learning rate: 2.585E-05 | global batch size: 256 | lm loss: 1.925827E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.684 | TFLOPs: 17.59 | 31: iteration 153690/ 173500 | consumed samples: 39344640 | consumed tokens: 80577822720 | elapsed time per iteration (s): 0.82 | learning rate: 2.584E-05 | global batch size: 256 | lm loss: 1.915367E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.600 | TFLOPs: 18.79 | 31: iteration 153700/ 173500 | consumed samples: 39347200 | consumed tokens: 80583065600 | elapsed time per iteration (s): 0.98 | learning rate: 2.584E-05 | global batch size: 256 | lm loss: 1.936805E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 261.059 | TFLOPs: 15.79 | 31: iteration 153710/ 173500 | consumed samples: 39349760 | consumed tokens: 80588308480 | elapsed time per iteration (s): 0.84 | learning rate: 2.583E-05 | global batch size: 256 | lm loss: 1.885466E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.554 | TFLOPs: 18.36 | 31: iteration 153720/ 173500 | consumed samples: 39352320 | consumed tokens: 80593551360 | elapsed time per iteration (s): 0.85 | learning rate: 2.583E-05 | global batch size: 256 | lm loss: 1.914672E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.393 | TFLOPs: 18.29 | 31: iteration 153730/ 173500 | consumed samples: 39354880 | consumed tokens: 80598794240 | elapsed time per iteration (s): 0.86 | learning rate: 2.582E-05 | global batch size: 256 | lm loss: 1.908101E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.260 | TFLOPs: 18.10 | 31: iteration 153740/ 173500 | consumed samples: 39357440 | consumed tokens: 80604037120 | elapsed time per iteration (s): 0.82 | learning rate: 2.581E-05 | global batch size: 256 | lm loss: 1.882775E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.066 | TFLOPs: 18.82 | 31: iteration 153750/ 173500 | consumed samples: 39360000 | consumed tokens: 80609280000 | elapsed time per iteration (s): 0.79 | learning rate: 2.581E-05 | global batch size: 256 | lm loss: 1.932336E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.865 | TFLOPs: 19.59 | 31: iteration 153760/ 173500 | consumed samples: 39362560 | consumed tokens: 80614522880 | elapsed time per iteration (s): 0.80 | learning rate: 2.580E-05 | global batch size: 256 | lm loss: 1.930094E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.029 | TFLOPs: 19.30 | 31: iteration 153770/ 173500 | consumed samples: 39365120 | consumed tokens: 80619765760 | elapsed time per iteration (s): 0.79 | learning rate: 2.580E-05 | global batch size: 256 | lm loss: 1.884115E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.908 | TFLOPs: 19.54 | 31: iteration 153780/ 173500 | consumed samples: 39367680 | consumed tokens: 80625008640 | elapsed time per iteration (s): 0.91 | learning rate: 2.579E-05 | global batch size: 256 | lm loss: 1.930262E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 281.766 | TFLOPs: 17.05 | 31: iteration 153790/ 173500 | consumed samples: 39370240 | consumed tokens: 80630251520 | elapsed time per iteration (s): 0.93 | learning rate: 2.579E-05 | global batch size: 256 | lm loss: 1.901487E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 276.206 | TFLOPs: 16.71 | 31: iteration 153800/ 173500 | consumed samples: 39372800 | consumed tokens: 80635494400 | elapsed time per iteration (s): 0.84 | learning rate: 2.578E-05 | global batch size: 256 | lm loss: 1.895801E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.807 | TFLOPs: 18.50 | 31: iteration 153810/ 173500 | consumed samples: 39375360 | consumed tokens: 80640737280 | elapsed time per iteration (s): 0.99 | learning rate: 2.577E-05 | global batch size: 256 | lm loss: 1.921234E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 259.607 | TFLOPs: 15.71 | 31: iteration 153820/ 173500 | consumed samples: 39377920 | consumed tokens: 80645980160 | elapsed time per iteration (s): 0.85 | learning rate: 2.577E-05 | global batch size: 256 | lm loss: 1.914899E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.458 | TFLOPs: 18.30 | 31: iteration 153830/ 173500 | consumed samples: 39380480 | consumed tokens: 80651223040 | elapsed time per iteration (s): 0.75 | learning rate: 2.576E-05 | global batch size: 256 | lm loss: 1.890987E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.232 | TFLOPs: 20.52 | 31: iteration 153840/ 173500 | consumed samples: 39383040 | consumed tokens: 80656465920 | elapsed time per iteration (s): 0.84 | learning rate: 2.576E-05 | global batch size: 256 | lm loss: 1.923433E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.043 | TFLOPs: 18.51 | 31: iteration 153850/ 173500 | consumed samples: 39385600 | consumed tokens: 80661708800 | elapsed time per iteration (s): 0.74 | learning rate: 2.575E-05 | global batch size: 256 | lm loss: 1.916815E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.546 | TFLOPs: 21.03 | 31: iteration 153860/ 173500 | consumed samples: 39388160 | consumed tokens: 80666951680 | elapsed time per iteration (s): 0.85 | learning rate: 2.574E-05 | global batch size: 256 | lm loss: 1.927174E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.680 | TFLOPs: 18.25 | 31: iteration 153870/ 173500 | consumed samples: 39390720 | consumed tokens: 80672194560 | elapsed time per iteration (s): 0.79 | learning rate: 2.574E-05 | global batch size: 256 | lm loss: 1.924527E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.988 | TFLOPs: 19.66 | 31: iteration 153880/ 173500 | consumed samples: 39393280 | consumed tokens: 80677437440 | elapsed time per iteration (s): 0.76 | learning rate: 2.573E-05 | global batch size: 256 | lm loss: 1.914386E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.948 | TFLOPs: 20.38 | 31: iteration 153890/ 173500 | consumed samples: 39395840 | consumed tokens: 80682680320 | elapsed time per iteration (s): 0.79 | learning rate: 2.573E-05 | global batch size: 256 | lm loss: 1.913814E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.945 | TFLOPs: 19.66 | 31: iteration 153900/ 173500 | consumed samples: 39398400 | consumed tokens: 80687923200 | elapsed time per iteration (s): 0.78 | learning rate: 2.572E-05 | global batch size: 256 | lm loss: 1.919273E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.767 | TFLOPs: 19.77 | 31: iteration 153910/ 173500 | consumed samples: 39400960 | consumed tokens: 80693166080 | elapsed time per iteration (s): 0.73 | learning rate: 2.572E-05 | global batch size: 256 | lm loss: 1.893807E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.983 | TFLOPs: 21.23 | 31: iteration 153920/ 173500 | consumed samples: 39403520 | consumed tokens: 80698408960 | elapsed time per iteration (s): 0.77 | learning rate: 2.571E-05 | global batch size: 256 | lm loss: 1.882920E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.676 | TFLOPs: 20.07 | 31: iteration 153930/ 173500 | consumed samples: 39406080 | consumed tokens: 80703651840 | elapsed time per iteration (s): 0.77 | learning rate: 2.570E-05 | global batch size: 256 | lm loss: 1.922111E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.171 | TFLOPs: 20.04 | 31: iteration 153940/ 173500 | consumed samples: 39408640 | consumed tokens: 80708894720 | elapsed time per iteration (s): 0.74 | learning rate: 2.570E-05 | global batch size: 256 | lm loss: 1.930141E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.043 | TFLOPs: 20.81 | 31: iteration 153950/ 173500 | consumed samples: 39411200 | consumed tokens: 80714137600 | elapsed time per iteration (s): 0.74 | learning rate: 2.569E-05 | global batch size: 256 | lm loss: 1.923210E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.350 | TFLOPs: 20.95 | 31: iteration 153960/ 173500 | consumed samples: 39413760 | consumed tokens: 80719380480 | elapsed time per iteration (s): 0.78 | learning rate: 2.569E-05 | global batch size: 256 | lm loss: 1.910203E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.654 | TFLOPs: 19.94 | 31: iteration 153970/ 173500 | consumed samples: 39416320 | consumed tokens: 80724623360 | elapsed time per iteration (s): 0.75 | learning rate: 2.568E-05 | global batch size: 256 | lm loss: 1.913967E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.719 | TFLOPs: 20.73 | 31: iteration 153980/ 173500 | consumed samples: 39418880 | consumed tokens: 80729866240 | elapsed time per iteration (s): 0.83 | learning rate: 2.568E-05 | global batch size: 256 | lm loss: 1.917708E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.970 | TFLOPs: 18.69 | 31: iteration 153990/ 173500 | consumed samples: 39421440 | consumed tokens: 80735109120 | elapsed time per iteration (s): 0.74 | learning rate: 2.567E-05 | global batch size: 256 | lm loss: 1.914083E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.328 | TFLOPs: 20.89 | 0: [2022-11-27 04:45:37,285] [INFO] [logging.py:68:log_dist] [Rank 0] step=154000, skipped=0, lr=[2.5664028527469924e-05, 2.5664028527469924e-05, 2.5664028527469924e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 154000/ 173500 | consumed samples: 39424000 | consumed tokens: 80740352000 | elapsed time per iteration (s): 0.74 | learning rate: 2.566E-05 | global batch size: 256 | lm loss: 1.932085E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.278 | TFLOPs: 20.95 | 0: steps: 154000 loss: 1.8564 iter time (s): 0.809 samples/sec: 316.483 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 154000 | lm loss value: 1.849655E+00 | lm loss PPL: 6.357623E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 154000 to checkpoints_1b1long 0: [2022-11-27 04:45:37,655] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step154000 is begin to save! 0: [2022-11-27 04:45:37,666] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_01-model_00-model_states.pt... 0: [2022-11-27 04:45:37,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_01-model_00-model_states.pt. 0: [2022-11-27 04:45:37,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_03-model_00-model_states.pt... 0: [2022-11-27 04:45:37,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_03-model_00-model_states.pt. 0: [2022-11-27 04:45:37,981] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_04-model_00-model_states.pt... 0: [2022-11-27 04:45:38,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_04-model_00-model_states.pt. 0: [2022-11-27 04:45:38,057] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_05-model_00-model_states.pt... 0: [2022-11-27 04:45:38,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_05-model_00-model_states.pt. 0: [2022-11-27 04:45:38,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_06-model_00-model_states.pt... 0: [2022-11-27 04:45:38,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_06-model_00-model_states.pt. 0: [2022-11-27 04:45:38,207] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_07-model_00-model_states.pt... 0: [2022-11-27 04:45:38,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_07-model_00-model_states.pt. 0: [2022-11-27 04:45:38,280] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_08-model_00-model_states.pt... 0: [2022-11-27 04:45:38,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_08-model_00-model_states.pt. 0: [2022-11-27 04:45:38,355] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_09-model_00-model_states.pt... 0: [2022-11-27 04:45:38,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_09-model_00-model_states.pt. 0: [2022-11-27 04:45:38,427] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_10-model_00-model_states.pt... 0: [2022-11-27 04:45:38,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_10-model_00-model_states.pt. 0: [2022-11-27 04:45:38,502] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_11-model_00-model_states.pt... 0: [2022-11-27 04:45:38,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_11-model_00-model_states.pt. 0: [2022-11-27 04:45:38,574] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_12-model_00-model_states.pt... 0: [2022-11-27 04:45:38,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_12-model_00-model_states.pt. 0: [2022-11-27 04:45:38,654] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_13-model_00-model_states.pt... 0: [2022-11-27 04:45:38,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_13-model_00-model_states.pt. 0: [2022-11-27 04:45:38,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_14-model_00-model_states.pt... 0: [2022-11-27 04:45:38,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_14-model_00-model_states.pt. 0: [2022-11-27 04:45:38,801] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_15-model_00-model_states.pt... 0: [2022-11-27 04:45:38,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_15-model_00-model_states.pt. 0: [2022-11-27 04:45:38,875] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_16-model_00-model_states.pt... 0: [2022-11-27 04:45:38,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_16-model_00-model_states.pt. 0: [2022-11-27 04:45:38,948] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_17-model_00-model_states.pt... 0: [2022-11-27 04:45:39,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_17-model_00-model_states.pt. 0: [2022-11-27 04:45:39,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_18-model_00-model_states.pt... 0: [2022-11-27 04:45:39,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_18-model_00-model_states.pt. 0: [2022-11-27 04:45:39,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_19-model_00-model_states.pt... 0: [2022-11-27 04:45:39,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_19-model_00-model_states.pt. 0: [2022-11-27 04:45:39,168] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_20-model_00-model_states.pt... 0: [2022-11-27 04:45:39,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_20-model_00-model_states.pt. 0: [2022-11-27 04:45:39,240] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_21-model_00-model_states.pt... 0: [2022-11-27 04:45:39,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_21-model_00-model_states.pt. 0: [2022-11-27 04:45:39,314] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_22-model_00-model_states.pt... 0: [2022-11-27 04:45:39,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_22-model_00-model_states.pt. 0: [2022-11-27 04:45:39,387] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_23-model_00-model_states.pt... 0: [2022-11-27 04:45:39,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_23-model_00-model_states.pt. 0: [2022-11-27 04:45:39,458] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_24-model_00-model_states.pt... 0: [2022-11-27 04:45:39,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_24-model_00-model_states.pt. 0: [2022-11-27 04:45:39,531] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_25-model_00-model_states.pt... 0: [2022-11-27 04:45:39,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_25-model_00-model_states.pt. 0: [2022-11-27 04:45:39,604] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_26-model_00-model_states.pt... 0: [2022-11-27 04:45:39,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_26-model_00-model_states.pt. 0: [2022-11-27 04:45:39,679] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_27-model_00-model_states.pt... 0: [2022-11-27 04:45:39,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_27-model_00-model_states.pt. 0: [2022-11-27 04:45:39,750] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_28-model_00-model_states.pt... 0: [2022-11-27 04:45:39,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_28-model_00-model_states.pt. 0: [2022-11-27 04:45:39,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/layer_30-model_00-model_states.pt... 0: [2022-11-27 04:45:39,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/layer_30-model_00-model_states.pt. 0: [2022-11-27 04:45:39,827] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step154000/mp_rank_00_model_states.pt 0: [2022-11-27 04:45:39,827] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/mp_rank_00_model_states.pt... 0: [2022-11-27 04:45:39,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/mp_rank_00_model_states.pt. 0: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:45:39,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:45:39,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:45:39,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 04:45:39,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 20: [2022-11-27 04:45:39,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:45:39,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 04:45:39,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 10: [2022-11-27 04:45:39,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:45:39,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:45:39,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 12: [2022-11-27 04:45:39,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 10: [2022-11-27 04:45:39,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 12: [2022-11-27 04:45:39,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 13: [2022-11-27 04:45:39,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:45:39,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 04:45:39,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 5: [2022-11-27 04:45:39,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:45:39,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 18: [2022-11-27 04:45:39,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:45:39,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 18: [2022-11-27 04:45:39,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 04:45:39,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 1: [2022-11-27 04:45:39,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:45:39,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 04:45:39,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 24: [2022-11-27 04:45:39,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:45:39,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 04:45:39,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 2: [2022-11-27 04:45:39,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:45:39,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 04:45:39,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 16: [2022-11-27 04:45:39,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:45:39,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 04:45:39,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 26: [2022-11-27 04:45:39,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:45:39,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 11: [2022-11-27 04:45:39,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:45:39,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 11: [2022-11-27 04:45:39,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 0: [2022-11-27 04:45:39,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:45:39,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 0: [2022-11-27 04:45:39,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 20: [2022-11-27 04:45:39,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:45:39,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 20: [2022-11-27 04:45:39,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 04:45:39,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 22: [2022-11-27 04:45:39,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:45:39,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 04:45:39,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 6: [2022-11-27 04:45:39,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:45:39,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:45:39,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 15: [2022-11-27 04:45:39,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 6: [2022-11-27 04:45:39,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 15: [2022-11-27 04:45:39,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 0: [2022-11-27 04:45:39,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:45:39,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 04:45:39,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 1: [2022-11-27 04:45:39,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:45:39,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 10: [2022-11-27 04:45:39,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:45:39,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 10: [2022-11-27 04:45:39,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 04:45:39,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 21: [2022-11-27 04:45:39,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:45:39,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 04:45:39,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 26: [2022-11-27 04:45:39,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:45:39,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 04:45:39,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 13: [2022-11-27 04:45:39,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:45:39,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:45:39,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:45:39,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 30: [2022-11-27 04:45:39,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 13: [2022-11-27 04:45:39,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 30: [2022-11-27 04:45:39,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 04:45:39,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 30: [2022-11-27 04:45:39,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 9: [2022-11-27 04:45:39,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:45:39,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 04:45:39,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 2: [2022-11-27 04:45:39,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:45:39,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:45:39,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:45:39,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 29: [2022-11-27 04:45:39,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 2: [2022-11-27 04:45:39,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 29: [2022-11-27 04:45:39,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 04:45:39,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 18: [2022-11-27 04:45:39,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:45:39,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 18: [2022-11-27 04:45:39,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-27 04:45:39,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 25: [2022-11-27 04:45:39,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:45:39,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 04:45:39,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 12: [2022-11-27 04:45:39,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:45:39,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 04:45:39,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 9: [2022-11-27 04:45:39,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:45:39,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 04:45:39,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 22: [2022-11-27 04:45:39,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:45:39,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 04:45:39,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 24: [2022-11-27 04:45:39,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:45:39,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:45:39,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 04:45:39,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 13: [2022-11-27 04:45:39,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:45:39,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 24: [2022-11-27 04:45:39,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 13: [2022-11-27 04:45:39,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 04:45:39,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 16: [2022-11-27 04:45:39,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:45:39,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 04:45:39,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 25: [2022-11-27 04:45:39,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:45:39,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 27: [2022-11-27 04:45:39,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:45:39,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 27: [2022-11-27 04:45:39,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 1: [2022-11-27 04:45:39,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:45:39,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 1: [2022-11-27 04:45:39,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 04:45:39,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 17: [2022-11-27 04:45:39,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:45:39,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:45:39,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 04:45:39,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 04:45:39,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 17: [2022-11-27 04:45:39,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 20: [2022-11-27 04:45:39,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:45:39,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 04:45:39,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 26: [2022-11-27 04:45:39,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:45:39,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 04:45:39,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 31: [2022-11-27 04:45:39,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:45:39,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 04:45:39,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 27: [2022-11-27 04:45:39,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:45:39,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 04:45:39,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 6: [2022-11-27 04:45:39,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:45:39,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 04:45:39,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:45:39,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:45:39,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 04:45:39,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 18: [2022-11-27 04:45:39,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:45:39,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 04:45:39,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 6: [2022-11-27 04:45:39,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 6: [2022-11-27 04:45:39,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 04:45:39,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 12: [2022-11-27 04:45:39,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:45:39,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:45:39,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 22: [2022-11-27 04:45:39,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 12: [2022-11-27 04:45:39,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 22: [2022-11-27 04:45:39,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 21: [2022-11-27 04:45:39,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:45:39,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:45:39,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 04:45:39,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 04:45:39,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 21: [2022-11-27 04:45:39,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 25: [2022-11-27 04:45:39,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:45:39,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 04:45:39,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 24: [2022-11-27 04:45:39,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:45:39,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 04:45:39,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 31: [2022-11-27 04:45:39,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:45:39,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 25: [2022-11-27 04:45:39,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:45:39,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:45:39,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 9: [2022-11-27 04:45:39,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 25: [2022-11-27 04:45:39,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 9: [2022-11-27 04:45:39,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 30: [2022-11-27 04:45:39,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:45:39,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:45:39,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 30: [2022-11-27 04:45:39,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 04:45:39,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 04:45:39,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 15: [2022-11-27 04:45:39,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:45:39,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:45:39,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 15: [2022-11-27 04:45:39,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 04:45:39,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 04:45:39,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 15: [2022-11-27 04:45:39,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 31: [2022-11-27 04:45:39,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:45:39,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 04:45:39,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 27: [2022-11-27 04:45:39,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:45:39,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 04:45:39,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 0: [2022-11-27 04:45:39,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:45:39,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 16: [2022-11-27 04:45:39,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:45:39,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 0: [2022-11-27 04:45:39,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 16: [2022-11-27 04:45:39,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 0: [2022-11-27 04:45:39,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:45:39,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 04:45:39,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 29: [2022-11-27 04:45:39,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:45:39,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 04:45:39,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 16: [2022-11-27 04:45:39,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:45:39,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 04:45:39,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 12: [2022-11-27 04:45:39,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:45:39,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 04:45:39,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 17: [2022-11-27 04:45:39,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:45:39,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:45:39,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:45:39,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 04:45:39,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 04:45:39,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 21: [2022-11-27 04:45:39,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 17: [2022-11-27 04:45:39,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 21: [2022-11-27 04:45:39,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 15: [2022-11-27 04:45:39,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:45:39,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 04:45:39,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 20: [2022-11-27 04:45:39,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:45:39,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:45:39,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 2: [2022-11-27 04:45:39,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 20: [2022-11-27 04:45:39,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 2: [2022-11-27 04:45:39,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 11: [2022-11-27 04:45:39,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:45:39,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 04:45:39,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 11: [2022-11-27 04:45:39,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:45:39,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 04:45:39,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 11: [2022-11-27 04:45:39,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:45:39,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 04:45:39,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 22: [2022-11-27 04:45:39,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:45:39,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 13: [2022-11-27 04:45:39,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:45:39,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 13: [2022-11-27 04:45:39,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 04:45:39,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 18: [2022-11-27 04:45:39,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:45:39,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 04:45:39,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 6: [2022-11-27 04:45:39,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:45:39,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 5: [2022-11-27 04:45:39,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:45:39,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 5: [2022-11-27 04:45:39,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 04:45:39,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 5: [2022-11-27 04:45:39,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:45:39,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 04:45:39,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 5: [2022-11-27 04:45:39,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:45:39,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 04:45:39,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 2: [2022-11-27 04:45:39,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:45:39,982] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 04:45:39,982] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 8: [2022-11-27 04:45:39,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:45:39,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:45:39,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:45:39,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:45:39,982] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 04:45:39,982] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 04:45:39,982] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 04:45:39,982] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 04:45:39,982] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 8: [2022-11-27 04:45:39,982] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 8: [2022-11-27 04:45:39,982] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 8: [2022-11-27 04:45:39,982] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 29: [2022-11-27 04:45:39,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:45:39,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 04:45:39,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 7: [2022-11-27 04:45:39,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:45:39,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:45:39,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:45:39,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 04:45:39,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 04:45:39,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 04:45:39,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 7: [2022-11-27 04:45:39,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 7: [2022-11-27 04:45:39,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 9: [2022-11-27 04:45:39,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:45:39,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 04:45:39,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 1: [2022-11-27 04:45:39,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:45:39,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 04:45:39,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 26: [2022-11-27 04:45:39,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:45:39,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 04:45:39,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 10: [2022-11-27 04:45:39,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:45:39,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 04:45:39,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 14: [2022-11-27 04:45:39,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:45:39,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:45:39,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:45:39,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 04:45:39,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 14: [2022-11-27 04:45:39,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 04:45:39,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 04:45:39,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 14: [2022-11-27 04:45:39,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 3: [2022-11-27 04:45:39,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:45:39,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 04:45:39,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:45:39,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:45:39,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:45:39,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 31: [2022-11-27 04:45:39,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 3: [2022-11-27 04:45:39,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 04:45:39,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 31: [2022-11-27 04:45:39,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 3: [2022-11-27 04:45:39,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 3: [2022-11-27 04:45:39,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 23: [2022-11-27 04:45:39,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:45:39,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:45:39,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:45:39,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:45:39,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 04:45:39,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 04:45:39,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 04:45:39,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 04:45:39,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 23: [2022-11-27 04:45:39,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 23: [2022-11-27 04:45:39,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 23: [2022-11-27 04:45:39,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 27: [2022-11-27 04:45:39,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:45:39,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 04:45:39,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 11: [2022-11-27 04:45:39,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:45:39,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 04:45:39,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 19: [2022-11-27 04:45:39,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:45:39,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:45:39,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:45:39,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:45:39,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 04:45:39,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 04:45:39,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 04:45:39,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-27 04:45:39,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 19: [2022-11-27 04:45:39,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 19: [2022-11-27 04:45:39,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 19: [2022-11-27 04:45:39,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 28: [2022-11-27 04:45:39,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:45:39,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:45:39,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:45:39,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:45:39,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 04:45:39,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 04:45:39,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 04:45:39,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 04:45:39,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 28: [2022-11-27 04:45:39,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 28: [2022-11-27 04:45:39,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 28: [2022-11-27 04:45:39,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 6: [2022-11-27 04:45:40,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:45:40,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 04:45:40,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 4: [2022-11-27 04:45:40,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:45:40,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:45:40,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:45:40,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:45:40,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 04:45:40,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 04:45:40,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 04:45:40,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 4: [2022-11-27 04:45:40,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 04:45:40,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 4: [2022-11-27 04:45:40,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 4: [2022-11-27 04:45:40,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 0: [2022-11-27 04:45:40,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:45:40,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 04:45:40,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 12: [2022-11-27 04:45:40,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:45:40,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 04:45:40,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 5: [2022-11-27 04:45:40,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:45:40,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 04:45:40,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 4: [2022-11-27 04:45:40,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:45:40,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 04:45:40,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 24: [2022-11-27 04:45:40,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:45:40,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-27 04:45:40,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 10: [2022-11-27 04:45:40,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:45:40,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:45:40,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 28: [2022-11-27 04:45:40,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 10: [2022-11-27 04:45:40,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 28: [2022-11-27 04:45:40,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 8: [2022-11-27 04:45:40,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:45:40,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 04:45:40,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 29: [2022-11-27 04:45:40,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:45:40,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 04:45:40,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 25: [2022-11-27 04:45:40,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:45:40,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 04:45:40,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 21: [2022-11-27 04:45:40,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:45:40,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 04:45:40,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 20: [2022-11-27 04:45:40,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:45:40,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 04:45:40,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 30: [2022-11-27 04:45:40,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:45:40,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 04:45:40,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 13: [2022-11-27 04:45:40,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:45:40,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 04:45:40,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 31: [2022-11-27 04:45:40,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:45:40,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 04:45:40,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 26: [2022-11-27 04:45:40,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:45:40,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 04:45:40,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 16: [2022-11-27 04:45:40,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:45:40,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 04:45:40,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 22: [2022-11-27 04:45:40,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:45:40,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 04:45:40,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 17: [2022-11-27 04:45:40,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:45:40,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 04:45:40,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 19: [2022-11-27 04:45:40,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:45:40,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 04:45:40,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 15: [2022-11-27 04:45:40,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:45:40,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 04:45:40,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 23: [2022-11-27 04:45:40,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:45:40,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 04:45:40,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 9: [2022-11-27 04:45:40,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:45:40,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 04:45:40,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 14: [2022-11-27 04:45:40,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:45:40,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 04:45:40,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 3: [2022-11-27 04:45:40,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:45:40,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 04:45:40,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 2: [2022-11-27 04:45:40,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:45:40,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 04:45:40,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 27: [2022-11-27 04:45:40,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:45:40,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 18: [2022-11-27 04:45:40,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:45:40,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 18: [2022-11-27 04:45:40,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 04:45:40,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 7: [2022-11-27 04:45:40,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:45:40,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 04:45:40,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 1: [2022-11-27 04:45:40,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:45:40,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 04:45:40,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 6: [2022-11-27 04:45:40,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:45:40,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 04:45:40,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 12: [2022-11-27 04:45:40,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:45:40,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 04:45:40,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 0: [2022-11-27 04:45:40,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:45:40,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:45:40,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 04:45:40,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 11: [2022-11-27 04:45:40,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:45:40,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 04:45:40,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 10: [2022-11-27 04:45:40,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:45:40,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 04:45:40,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 28: [2022-11-27 04:45:40,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:45:40,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 04:45:40,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 4: [2022-11-27 04:45:40,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:45:40,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 04:45:40,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 5: [2022-11-27 04:45:40,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:45:40,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 04:45:40,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 13: [2022-11-27 04:45:40,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:45:40,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 04:45:40,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 24: [2022-11-27 04:45:40,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:45:40,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 04:45:40,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 29: [2022-11-27 04:45:40,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:45:40,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 04:45:40,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 0: [2022-11-27 04:45:40,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 04:45:40,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 31: [2022-11-27 04:45:40,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:45:40,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 04:45:40,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 20: [2022-11-27 04:45:40,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:45:40,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:45:40,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 8: [2022-11-27 04:45:40,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 20: [2022-11-27 04:45:40,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 8: [2022-11-27 04:45:40,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 26: [2022-11-27 04:45:40,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:45:40,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 04:45:40,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 22: [2022-11-27 04:45:40,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:45:40,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:45:40,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 30: [2022-11-27 04:45:40,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 04:45:40,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 22: [2022-11-27 04:45:40,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 21: [2022-11-27 04:45:40,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:45:40,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 04:45:40,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 14: [2022-11-27 04:45:40,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:45:40,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 16: [2022-11-27 04:45:40,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:45:40,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 16: [2022-11-27 04:45:40,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 17: [2022-11-27 04:45:40,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:45:40,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 17: [2022-11-27 04:45:40,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 04:45:40,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 15: [2022-11-27 04:45:40,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:45:40,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 04:45:40,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 23: [2022-11-27 04:45:40,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:45:40,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 04:45:40,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 19: [2022-11-27 04:45:40,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:45:40,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 04:45:40,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 3: [2022-11-27 04:45:40,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:45:40,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 04:45:40,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 9: [2022-11-27 04:45:40,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:45:40,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 04:45:40,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 1: [2022-11-27 04:45:40,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:45:40,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 0: [2022-11-27 04:45:40,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:45:40,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 0: [2022-11-27 04:45:40,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 04:45:40,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 2: [2022-11-27 04:45:40,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:45:40,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:45:40,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 04:45:40,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 6: [2022-11-27 04:45:40,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:45:40,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 04:45:40,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 2: [2022-11-27 04:45:40,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 04:45:40,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 27: [2022-11-27 04:45:40,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:45:40,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:45:40,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 27: [2022-11-27 04:45:40,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 18: [2022-11-27 04:45:40,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 27: [2022-11-27 04:45:40,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 11: [2022-11-27 04:45:40,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:45:40,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 04:45:40,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 12: [2022-11-27 04:45:40,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:45:40,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 04:45:40,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 25: [2022-11-27 04:45:40,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:45:40,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 04:45:40,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 10: [2022-11-27 04:45:40,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:45:40,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 04:45:40,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 5: [2022-11-27 04:45:40,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:45:40,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 04:45:40,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 28: [2022-11-27 04:45:40,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:45:40,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 04:45:40,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 24: [2022-11-27 04:45:40,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:45:40,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 04:45:40,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 4: [2022-11-27 04:45:40,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:45:40,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 04:45:40,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 8: [2022-11-27 04:45:40,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:45:40,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 04:45:40,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 30: [2022-11-27 04:45:40,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:45:40,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 04:45:40,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 26: [2022-11-27 04:45:40,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:45:40,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 04:45:40,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 17: [2022-11-27 04:45:40,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:45:40,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 04:45:40,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 31: [2022-11-27 04:45:40,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:45:40,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 20: [2022-11-27 04:45:40,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:45:40,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 20: [2022-11-27 04:45:40,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 04:45:40,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 29: [2022-11-27 04:45:40,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:45:40,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 04:45:40,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 13: [2022-11-27 04:45:40,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:45:40,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 04:45:40,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 22: [2022-11-27 04:45:40,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:45:40,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 04:45:40,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 15: [2022-11-27 04:45:40,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:45:40,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 04:45:40,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 21: [2022-11-27 04:45:40,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:45:40,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 04:45:40,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 19: [2022-11-27 04:45:40,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:45:40,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 04:45:40,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 18: [2022-11-27 04:45:40,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:45:40,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:45:40,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:45:40,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 18: [2022-11-27 04:45:40,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 16: [2022-11-27 04:45:40,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 18: [2022-11-27 04:45:40,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 04:45:40,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 18: [2022-11-27 04:45:40,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 8: [2022-11-27 04:45:40,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:45:40,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 30: [2022-11-27 04:45:40,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:45:40,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 30: [2022-11-27 04:45:40,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 04:45:40,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 23: [2022-11-27 04:45:40,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:45:40,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:45:40,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 6: [2022-11-27 04:45:40,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:45:40,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 27: [2022-11-27 04:45:40,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 04:45:40,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 6: [2022-11-27 04:45:40,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 04:45:40,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 12: [2022-11-27 04:45:40,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:45:40,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:45:40,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 20: [2022-11-27 04:45:40,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 13: [2022-11-27 04:45:40,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:45:40,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 20: [2022-11-27 04:45:40,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 13: [2022-11-27 04:45:40,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 04:45:40,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 22: [2022-11-27 04:45:40,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:45:40,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 04:45:40,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 26: [2022-11-27 04:45:40,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:45:40,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 17: [2022-11-27 04:45:40,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:45:40,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 17: [2022-11-27 04:45:40,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 04:45:40,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 2: [2022-11-27 04:45:40,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:45:40,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 04:45:40,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 1: [2022-11-27 04:45:40,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:45:40,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 3: [2022-11-27 04:45:40,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:45:40,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 1: [2022-11-27 04:45:40,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 3: [2022-11-27 04:45:40,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 9: [2022-11-27 04:45:40,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:45:40,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 04:45:40,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 11: [2022-11-27 04:45:40,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:45:40,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:45:40,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 0: [2022-11-27 04:45:40,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 11: [2022-11-27 04:45:40,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 0: [2022-11-27 04:45:40,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 29: [2022-11-27 04:45:40,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:45:40,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 04:45:40,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 5: [2022-11-27 04:45:40,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:45:40,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:45:40,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 10: [2022-11-27 04:45:40,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:45:40,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 5: [2022-11-27 04:45:40,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 10: [2022-11-27 04:45:40,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 31: [2022-11-27 04:45:40,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 10: [2022-11-27 04:45:40,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 14: [2022-11-27 04:45:40,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:45:40,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 04:45:40,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 24: [2022-11-27 04:45:40,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:45:40,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 04:45:40,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 25: [2022-11-27 04:45:40,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:45:40,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:45:40,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 19: [2022-11-27 04:45:40,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 25: [2022-11-27 04:45:40,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 19: [2022-11-27 04:45:40,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 21: [2022-11-27 04:45:40,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:45:40,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 04:45:40,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 15: [2022-11-27 04:45:40,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:45:40,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 04:45:40,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 28: [2022-11-27 04:45:40,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:45:40,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 14: [2022-11-27 04:45:40,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:45:40,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 04:45:40,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 7: [2022-11-27 04:45:40,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:45:40,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 04:45:40,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 4: [2022-11-27 04:45:40,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:45:40,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:45:40,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 9: [2022-11-27 04:45:40,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 4: [2022-11-27 04:45:40,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 9: [2022-11-27 04:45:40,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 28: [2022-11-27 04:45:40,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 2: [2022-11-27 04:45:40,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:45:40,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 04:45:40,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 23: [2022-11-27 04:45:40,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:45:40,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 04:45:40,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 27: [2022-11-27 04:45:40,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:45:40,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 16: [2022-11-27 04:45:40,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:45:40,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 16: [2022-11-27 04:45:40,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 04:45:40,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 14: [2022-11-27 04:45:40,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:45:40,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 04:45:40,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 3: [2022-11-27 04:45:40,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:45:40,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 04:45:40,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:45:40,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 3: [2022-11-27 04:45:40,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 04:45:40,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 7: [2022-11-27 04:45:40,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:45:40,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 04:45:40,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 7: [2022-11-27 04:45:40,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:45:40,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step154000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 04:45:40,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step154000 is ready now! 0: successfully saved checkpoint at iteration 154000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2609.89 31: iteration 154010/ 173500 | consumed samples: 39426560 | consumed tokens: 80745594880 | elapsed time per iteration (s): 1.04 | learning rate: 2.566E-05 | global batch size: 256 | lm loss: 1.913832E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.246 | TFLOPs: 14.84 | 31: iteration 154020/ 173500 | consumed samples: 39429120 | consumed tokens: 80750837760 | elapsed time per iteration (s): 0.79 | learning rate: 2.565E-05 | global batch size: 256 | lm loss: 1.907764E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.023 | TFLOPs: 19.60 | 31: iteration 154030/ 173500 | consumed samples: 39431680 | consumed tokens: 80756080640 | elapsed time per iteration (s): 0.74 | learning rate: 2.565E-05 | global batch size: 256 | lm loss: 1.920095E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.428 | TFLOPs: 20.96 | 31: iteration 154040/ 173500 | consumed samples: 39434240 | consumed tokens: 80761323520 | elapsed time per iteration (s): 0.74 | learning rate: 2.564E-05 | global batch size: 256 | lm loss: 1.892192E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.610 | TFLOPs: 20.85 | 31: iteration 154050/ 173500 | consumed samples: 39436800 | consumed tokens: 80766566400 | elapsed time per iteration (s): 0.78 | learning rate: 2.564E-05 | global batch size: 256 | lm loss: 1.893055E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.312 | TFLOPs: 19.92 | 31: iteration 154060/ 173500 | consumed samples: 39439360 | consumed tokens: 80771809280 | elapsed time per iteration (s): 0.77 | learning rate: 2.563E-05 | global batch size: 256 | lm loss: 1.903123E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.693 | TFLOPs: 20.13 | 31: iteration 154070/ 173500 | consumed samples: 39441920 | consumed tokens: 80777052160 | elapsed time per iteration (s): 0.77 | learning rate: 2.562E-05 | global batch size: 256 | lm loss: 1.912304E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.650 | TFLOPs: 20.18 | 31: iteration 154080/ 173500 | consumed samples: 39444480 | consumed tokens: 80782295040 | elapsed time per iteration (s): 0.76 | learning rate: 2.562E-05 | global batch size: 256 | lm loss: 1.903261E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.154 | TFLOPs: 20.34 | 31: iteration 154090/ 173500 | consumed samples: 39447040 | consumed tokens: 80787537920 | elapsed time per iteration (s): 0.74 | learning rate: 2.561E-05 | global batch size: 256 | lm loss: 1.936163E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.168 | TFLOPs: 20.82 | 31: iteration 154100/ 173500 | consumed samples: 39449600 | consumed tokens: 80792780800 | elapsed time per iteration (s): 0.73 | learning rate: 2.561E-05 | global batch size: 256 | lm loss: 1.922774E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.193 | TFLOPs: 21.31 | 31: iteration 154110/ 173500 | consumed samples: 39452160 | consumed tokens: 80798023680 | elapsed time per iteration (s): 0.77 | learning rate: 2.560E-05 | global batch size: 256 | lm loss: 1.928732E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.924 | TFLOPs: 20.14 | 31: iteration 154120/ 173500 | consumed samples: 39454720 | consumed tokens: 80803266560 | elapsed time per iteration (s): 0.76 | learning rate: 2.560E-05 | global batch size: 256 | lm loss: 1.912864E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.081 | TFLOPs: 20.39 | 31: iteration 154130/ 173500 | consumed samples: 39457280 | consumed tokens: 80808509440 | elapsed time per iteration (s): 0.77 | learning rate: 2.559E-05 | global batch size: 256 | lm loss: 1.893502E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.991 | TFLOPs: 20.15 | 31: iteration 154140/ 173500 | consumed samples: 39459840 | consumed tokens: 80813752320 | elapsed time per iteration (s): 0.89 | learning rate: 2.558E-05 | global batch size: 256 | lm loss: 1.922791E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.522 | TFLOPs: 17.45 | 31: iteration 154150/ 173500 | consumed samples: 39462400 | consumed tokens: 80818995200 | elapsed time per iteration (s): 0.79 | learning rate: 2.558E-05 | global batch size: 256 | lm loss: 1.897628E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.546 | TFLOPs: 19.57 | 31: iteration 154160/ 173500 | consumed samples: 39464960 | consumed tokens: 80824238080 | elapsed time per iteration (s): 0.76 | learning rate: 2.557E-05 | global batch size: 256 | lm loss: 1.899163E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.075 | TFLOPs: 20.33 | 31: iteration 154170/ 173500 | consumed samples: 39467520 | consumed tokens: 80829480960 | elapsed time per iteration (s): 0.77 | learning rate: 2.557E-05 | global batch size: 256 | lm loss: 1.918431E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.422 | TFLOPs: 20.11 | 31: iteration 154180/ 173500 | consumed samples: 39470080 | consumed tokens: 80834723840 | elapsed time per iteration (s): 0.77 | learning rate: 2.556E-05 | global batch size: 256 | lm loss: 1.905406E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.023 | TFLOPs: 20.09 | 31: iteration 154190/ 173500 | consumed samples: 39472640 | consumed tokens: 80839966720 | elapsed time per iteration (s): 0.79 | learning rate: 2.556E-05 | global batch size: 256 | lm loss: 1.867603E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.574 | TFLOPs: 19.64 | 31: iteration 154200/ 173500 | consumed samples: 39475200 | consumed tokens: 80845209600 | elapsed time per iteration (s): 0.79 | learning rate: 2.555E-05 | global batch size: 256 | lm loss: 1.912212E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.696 | TFLOPs: 19.64 | 31: iteration 154210/ 173500 | consumed samples: 39477760 | consumed tokens: 80850452480 | elapsed time per iteration (s): 0.80 | learning rate: 2.554E-05 | global batch size: 256 | lm loss: 1.911404E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.669 | TFLOPs: 19.34 | 31: iteration 154220/ 173500 | consumed samples: 39480320 | consumed tokens: 80855695360 | elapsed time per iteration (s): 0.76 | learning rate: 2.554E-05 | global batch size: 256 | lm loss: 1.897359E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.241 | TFLOPs: 20.28 | 31: iteration 154230/ 173500 | consumed samples: 39482880 | consumed tokens: 80860938240 | elapsed time per iteration (s): 0.75 | learning rate: 2.553E-05 | global batch size: 256 | lm loss: 1.922742E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.305 | TFLOPs: 20.65 | 31: iteration 154240/ 173500 | consumed samples: 39485440 | consumed tokens: 80866181120 | elapsed time per iteration (s): 0.75 | learning rate: 2.553E-05 | global batch size: 256 | lm loss: 1.918657E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.217 | TFLOPs: 20.70 | 31: iteration 154250/ 173500 | consumed samples: 39488000 | consumed tokens: 80871424000 | elapsed time per iteration (s): 0.81 | learning rate: 2.552E-05 | global batch size: 256 | lm loss: 1.885699E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.720 | TFLOPs: 19.10 | 31: iteration 154260/ 173500 | consumed samples: 39490560 | consumed tokens: 80876666880 | elapsed time per iteration (s): 0.77 | learning rate: 2.552E-05 | global batch size: 256 | lm loss: 1.945142E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.987 | TFLOPs: 20.08 | 31: iteration 154270/ 173500 | consumed samples: 39493120 | consumed tokens: 80881909760 | elapsed time per iteration (s): 0.75 | learning rate: 2.551E-05 | global batch size: 256 | lm loss: 1.927511E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.958 | TFLOPs: 20.69 | 31: iteration 154280/ 173500 | consumed samples: 39495680 | consumed tokens: 80887152640 | elapsed time per iteration (s): 0.80 | learning rate: 2.550E-05 | global batch size: 256 | lm loss: 1.962742E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.998 | TFLOPs: 19.48 | 31: iteration 154290/ 173500 | consumed samples: 39498240 | consumed tokens: 80892395520 | elapsed time per iteration (s): 0.79 | learning rate: 2.550E-05 | global batch size: 256 | lm loss: 1.912482E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.337 | TFLOPs: 19.68 | 31: iteration 154300/ 173500 | consumed samples: 39500800 | consumed tokens: 80897638400 | elapsed time per iteration (s): 0.78 | learning rate: 2.549E-05 | global batch size: 256 | lm loss: 1.903358E+00 | grad norm: 0.223 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.413 | TFLOPs: 19.75 | 31: iteration 154310/ 173500 | consumed samples: 39503360 | consumed tokens: 80902881280 | elapsed time per iteration (s): 0.75 | learning rate: 2.549E-05 | global batch size: 256 | lm loss: 1.937754E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.809 | TFLOPs: 20.62 | 31: iteration 154320/ 173500 | consumed samples: 39505920 | consumed tokens: 80908124160 | elapsed time per iteration (s): 0.76 | learning rate: 2.548E-05 | global batch size: 256 | lm loss: 1.917157E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.282 | TFLOPs: 20.34 | 31: iteration 154330/ 173500 | consumed samples: 39508480 | consumed tokens: 80913367040 | elapsed time per iteration (s): 0.80 | learning rate: 2.548E-05 | global batch size: 256 | lm loss: 1.917455E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.435 | TFLOPs: 19.32 | 31: iteration 154340/ 173500 | consumed samples: 39511040 | consumed tokens: 80918609920 | elapsed time per iteration (s): 0.91 | learning rate: 2.547E-05 | global batch size: 256 | lm loss: 1.918859E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 279.976 | TFLOPs: 16.94 | 31: iteration 154350/ 173500 | consumed samples: 39513600 | consumed tokens: 80923852800 | elapsed time per iteration (s): 0.82 | learning rate: 2.546E-05 | global batch size: 256 | lm loss: 1.919171E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.724 | TFLOPs: 18.86 | 31: iteration 154360/ 173500 | consumed samples: 39516160 | consumed tokens: 80929095680 | elapsed time per iteration (s): 0.82 | learning rate: 2.546E-05 | global batch size: 256 | lm loss: 1.909360E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.704 | TFLOPs: 18.86 | 31: iteration 154370/ 173500 | consumed samples: 39518720 | consumed tokens: 80934338560 | elapsed time per iteration (s): 0.79 | learning rate: 2.545E-05 | global batch size: 256 | lm loss: 1.933146E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.834 | TFLOPs: 19.59 | 31: iteration 154380/ 173500 | consumed samples: 39521280 | consumed tokens: 80939581440 | elapsed time per iteration (s): 0.83 | learning rate: 2.545E-05 | global batch size: 256 | lm loss: 1.912132E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.845 | TFLOPs: 18.68 | 31: iteration 154390/ 173500 | consumed samples: 39523840 | consumed tokens: 80944824320 | elapsed time per iteration (s): 0.80 | learning rate: 2.544E-05 | global batch size: 256 | lm loss: 1.905236E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.522 | TFLOPs: 19.33 | 31: iteration 154400/ 173500 | consumed samples: 39526400 | consumed tokens: 80950067200 | elapsed time per iteration (s): 0.79 | learning rate: 2.544E-05 | global batch size: 256 | lm loss: 1.932259E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.493 | TFLOPs: 19.51 | 31: iteration 154410/ 173500 | consumed samples: 39528960 | consumed tokens: 80955310080 | elapsed time per iteration (s): 0.80 | learning rate: 2.543E-05 | global batch size: 256 | lm loss: 1.931001E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.276 | TFLOPs: 19.44 | 31: iteration 154420/ 173500 | consumed samples: 39531520 | consumed tokens: 80960552960 | elapsed time per iteration (s): 0.78 | learning rate: 2.543E-05 | global batch size: 256 | lm loss: 1.891701E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.047 | TFLOPs: 19.85 | 31: iteration 154430/ 173500 | consumed samples: 39534080 | consumed tokens: 80965795840 | elapsed time per iteration (s): 0.80 | learning rate: 2.542E-05 | global batch size: 256 | lm loss: 1.901415E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.385 | TFLOPs: 19.38 | 31: iteration 154440/ 173500 | consumed samples: 39536640 | consumed tokens: 80971038720 | elapsed time per iteration (s): 0.74 | learning rate: 2.541E-05 | global batch size: 256 | lm loss: 1.899232E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.494 | TFLOPs: 20.96 | 31: iteration 154450/ 173500 | consumed samples: 39539200 | consumed tokens: 80976281600 | elapsed time per iteration (s): 0.72 | learning rate: 2.541E-05 | global batch size: 256 | lm loss: 1.919895E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.889 | TFLOPs: 21.59 | 31: iteration 154460/ 173500 | consumed samples: 39541760 | consumed tokens: 80981524480 | elapsed time per iteration (s): 0.74 | learning rate: 2.540E-05 | global batch size: 256 | lm loss: 1.935600E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.416 | TFLOPs: 21.02 | 31: iteration 154470/ 173500 | consumed samples: 39544320 | consumed tokens: 80986767360 | elapsed time per iteration (s): 0.76 | learning rate: 2.540E-05 | global batch size: 256 | lm loss: 1.908916E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.971 | TFLOPs: 20.33 | 31: iteration 154480/ 173500 | consumed samples: 39546880 | consumed tokens: 80992010240 | elapsed time per iteration (s): 0.77 | learning rate: 2.539E-05 | global batch size: 256 | lm loss: 1.934128E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.597 | TFLOPs: 20.18 | 31: iteration 154490/ 173500 | consumed samples: 39549440 | consumed tokens: 80997253120 | elapsed time per iteration (s): 0.80 | learning rate: 2.539E-05 | global batch size: 256 | lm loss: 1.897023E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.081 | TFLOPs: 19.30 | 31: iteration 154500/ 173500 | consumed samples: 39552000 | consumed tokens: 81002496000 | elapsed time per iteration (s): 0.79 | learning rate: 2.538E-05 | global batch size: 256 | lm loss: 1.908144E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.803 | TFLOPs: 19.65 | 31: iteration 154510/ 173500 | consumed samples: 39554560 | consumed tokens: 81007738880 | elapsed time per iteration (s): 1.02 | learning rate: 2.537E-05 | global batch size: 256 | lm loss: 1.897543E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.704 | TFLOPs: 15.17 | 31: iteration 154520/ 173500 | consumed samples: 39557120 | consumed tokens: 81012981760 | elapsed time per iteration (s): 0.76 | learning rate: 2.537E-05 | global batch size: 256 | lm loss: 1.920205E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.350 | TFLOPs: 20.47 | 31: iteration 154530/ 173500 | consumed samples: 39559680 | consumed tokens: 81018224640 | elapsed time per iteration (s): 0.80 | learning rate: 2.536E-05 | global batch size: 256 | lm loss: 1.916711E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.095 | TFLOPs: 19.36 | 31: iteration 154540/ 173500 | consumed samples: 39562240 | consumed tokens: 81023467520 | elapsed time per iteration (s): 0.80 | learning rate: 2.536E-05 | global batch size: 256 | lm loss: 1.947214E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.489 | TFLOPs: 19.45 | 31: iteration 154550/ 173500 | consumed samples: 39564800 | consumed tokens: 81028710400 | elapsed time per iteration (s): 0.76 | learning rate: 2.535E-05 | global batch size: 256 | lm loss: 1.898547E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.399 | TFLOPs: 20.29 | 31: iteration 154560/ 173500 | consumed samples: 39567360 | consumed tokens: 81033953280 | elapsed time per iteration (s): 0.80 | learning rate: 2.535E-05 | global batch size: 256 | lm loss: 1.908142E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.230 | TFLOPs: 19.25 | 31: iteration 154570/ 173500 | consumed samples: 39569920 | consumed tokens: 81039196160 | elapsed time per iteration (s): 0.79 | learning rate: 2.534E-05 | global batch size: 256 | lm loss: 1.927055E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.490 | TFLOPs: 19.51 | 31: iteration 154580/ 173500 | consumed samples: 39572480 | consumed tokens: 81044439040 | elapsed time per iteration (s): 0.73 | learning rate: 2.534E-05 | global batch size: 256 | lm loss: 1.937914E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.615 | TFLOPs: 21.09 | 31: iteration 154590/ 173500 | consumed samples: 39575040 | consumed tokens: 81049681920 | elapsed time per iteration (s): 0.75 | learning rate: 2.533E-05 | global batch size: 256 | lm loss: 1.893369E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.515 | TFLOPs: 20.54 | 31: iteration 154600/ 173500 | consumed samples: 39577600 | consumed tokens: 81054924800 | elapsed time per iteration (s): 0.74 | learning rate: 2.532E-05 | global batch size: 256 | lm loss: 1.937139E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.015 | TFLOPs: 20.81 | 31: iteration 154610/ 173500 | consumed samples: 39580160 | consumed tokens: 81060167680 | elapsed time per iteration (s): 0.82 | learning rate: 2.532E-05 | global batch size: 256 | lm loss: 1.912284E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.175 | TFLOPs: 18.89 | 31: iteration 154620/ 173500 | consumed samples: 39582720 | consumed tokens: 81065410560 | elapsed time per iteration (s): 0.77 | learning rate: 2.531E-05 | global batch size: 256 | lm loss: 1.927202E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.397 | TFLOPs: 20.17 | 31: iteration 154630/ 173500 | consumed samples: 39585280 | consumed tokens: 81070653440 | elapsed time per iteration (s): 0.79 | learning rate: 2.531E-05 | global batch size: 256 | lm loss: 1.911294E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.376 | TFLOPs: 19.68 | 31: iteration 154640/ 173500 | consumed samples: 39587840 | consumed tokens: 81075896320 | elapsed time per iteration (s): 0.79 | learning rate: 2.530E-05 | global batch size: 256 | lm loss: 1.919000E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.497 | TFLOPs: 19.51 | 31: iteration 154650/ 173500 | consumed samples: 39590400 | consumed tokens: 81081139200 | elapsed time per iteration (s): 0.99 | learning rate: 2.530E-05 | global batch size: 256 | lm loss: 1.919082E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 258.717 | TFLOPs: 15.65 | 31: iteration 154660/ 173500 | consumed samples: 39592960 | consumed tokens: 81086382080 | elapsed time per iteration (s): 0.89 | learning rate: 2.529E-05 | global batch size: 256 | lm loss: 1.894933E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.258 | TFLOPs: 17.50 | 31: iteration 154670/ 173500 | consumed samples: 39595520 | consumed tokens: 81091624960 | elapsed time per iteration (s): 0.97 | learning rate: 2.529E-05 | global batch size: 256 | lm loss: 1.907484E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 263.634 | TFLOPs: 15.95 | 31: iteration 154680/ 173500 | consumed samples: 39598080 | consumed tokens: 81096867840 | elapsed time per iteration (s): 2.81 | learning rate: 2.528E-05 | global batch size: 256 | lm loss: 1.905871E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 90.978 | TFLOPs: 5.50 | 31: iteration 154690/ 173500 | consumed samples: 39600640 | consumed tokens: 81102110720 | elapsed time per iteration (s): 0.91 | learning rate: 2.527E-05 | global batch size: 256 | lm loss: 1.935587E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 280.364 | TFLOPs: 16.96 | 31: iteration 154700/ 173500 | consumed samples: 39603200 | consumed tokens: 81107353600 | elapsed time per iteration (s): 0.92 | learning rate: 2.527E-05 | global batch size: 256 | lm loss: 1.926514E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 278.882 | TFLOPs: 16.87 | 31: iteration 154710/ 173500 | consumed samples: 39605760 | consumed tokens: 81112596480 | elapsed time per iteration (s): 0.90 | learning rate: 2.526E-05 | global batch size: 256 | lm loss: 1.933309E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.702 | TFLOPs: 17.22 | 31: iteration 154720/ 173500 | consumed samples: 39608320 | consumed tokens: 81117839360 | elapsed time per iteration (s): 0.87 | learning rate: 2.526E-05 | global batch size: 256 | lm loss: 1.924047E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.681 | TFLOPs: 17.89 | 31: iteration 154730/ 173500 | consumed samples: 39610880 | consumed tokens: 81123082240 | elapsed time per iteration (s): 0.81 | learning rate: 2.525E-05 | global batch size: 256 | lm loss: 1.913553E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.154 | TFLOPs: 19.07 | 31: iteration 154740/ 173500 | consumed samples: 39613440 | consumed tokens: 81128325120 | elapsed time per iteration (s): 0.76 | learning rate: 2.525E-05 | global batch size: 256 | lm loss: 1.880090E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.105 | TFLOPs: 20.33 | 31: iteration 154750/ 173500 | consumed samples: 39616000 | consumed tokens: 81133568000 | elapsed time per iteration (s): 0.77 | learning rate: 2.524E-05 | global batch size: 256 | lm loss: 1.922814E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.232 | TFLOPs: 20.22 | 31: iteration 154760/ 173500 | consumed samples: 39618560 | consumed tokens: 81138810880 | elapsed time per iteration (s): 0.80 | learning rate: 2.524E-05 | global batch size: 256 | lm loss: 1.919446E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.177 | TFLOPs: 19.37 | 31: iteration 154770/ 173500 | consumed samples: 39621120 | consumed tokens: 81144053760 | elapsed time per iteration (s): 0.81 | learning rate: 2.523E-05 | global batch size: 256 | lm loss: 1.913123E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.122 | TFLOPs: 19.06 | 31: iteration 154780/ 173500 | consumed samples: 39623680 | consumed tokens: 81149296640 | elapsed time per iteration (s): 0.79 | learning rate: 2.522E-05 | global batch size: 256 | lm loss: 1.922981E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.602 | TFLOPs: 19.58 | 31: iteration 154790/ 173500 | consumed samples: 39626240 | consumed tokens: 81154539520 | elapsed time per iteration (s): 0.92 | learning rate: 2.522E-05 | global batch size: 256 | lm loss: 1.930927E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 279.041 | TFLOPs: 16.88 | 31: iteration 154800/ 173500 | consumed samples: 39628800 | consumed tokens: 81159782400 | elapsed time per iteration (s): 0.87 | learning rate: 2.521E-05 | global batch size: 256 | lm loss: 1.920501E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.696 | TFLOPs: 17.77 | 31: iteration 154810/ 173500 | consumed samples: 39631360 | consumed tokens: 81165025280 | elapsed time per iteration (s): 0.91 | learning rate: 2.521E-05 | global batch size: 256 | lm loss: 1.913221E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.178 | TFLOPs: 17.07 | 31: iteration 154820/ 173500 | consumed samples: 39633920 | consumed tokens: 81170268160 | elapsed time per iteration (s): 0.86 | learning rate: 2.520E-05 | global batch size: 256 | lm loss: 1.887804E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.375 | TFLOPs: 17.99 | 31: iteration 154830/ 173500 | consumed samples: 39636480 | consumed tokens: 81175511040 | elapsed time per iteration (s): 0.91 | learning rate: 2.520E-05 | global batch size: 256 | lm loss: 1.904890E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.769 | TFLOPs: 17.11 | 31: iteration 154840/ 173500 | consumed samples: 39639040 | consumed tokens: 81180753920 | elapsed time per iteration (s): 0.84 | learning rate: 2.519E-05 | global batch size: 256 | lm loss: 1.891010E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.939 | TFLOPs: 18.45 | 31: iteration 154850/ 173500 | consumed samples: 39641600 | consumed tokens: 81185996800 | elapsed time per iteration (s): 0.83 | learning rate: 2.519E-05 | global batch size: 256 | lm loss: 1.905098E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.711 | TFLOPs: 18.74 | 31: iteration 154860/ 173500 | consumed samples: 39644160 | consumed tokens: 81191239680 | elapsed time per iteration (s): 0.84 | learning rate: 2.518E-05 | global batch size: 256 | lm loss: 1.918731E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.845 | TFLOPs: 18.38 | 31: iteration 154870/ 173500 | consumed samples: 39646720 | consumed tokens: 81196482560 | elapsed time per iteration (s): 0.83 | learning rate: 2.517E-05 | global batch size: 256 | lm loss: 1.896406E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.381 | TFLOPs: 18.60 | 31: iteration 154880/ 173500 | consumed samples: 39649280 | consumed tokens: 81201725440 | elapsed time per iteration (s): 0.83 | learning rate: 2.517E-05 | global batch size: 256 | lm loss: 1.917571E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.857 | TFLOPs: 18.75 | 31: iteration 154890/ 173500 | consumed samples: 39651840 | consumed tokens: 81206968320 | elapsed time per iteration (s): 0.85 | learning rate: 2.516E-05 | global batch size: 256 | lm loss: 1.914062E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.848 | TFLOPs: 18.20 | 31: iteration 154900/ 173500 | consumed samples: 39654400 | consumed tokens: 81212211200 | elapsed time per iteration (s): 0.82 | learning rate: 2.516E-05 | global batch size: 256 | lm loss: 1.929816E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.692 | TFLOPs: 18.98 | 31: iteration 154910/ 173500 | consumed samples: 39656960 | consumed tokens: 81217454080 | elapsed time per iteration (s): 0.76 | learning rate: 2.515E-05 | global batch size: 256 | lm loss: 1.934160E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.584 | TFLOPs: 20.48 | 31: iteration 154920/ 173500 | consumed samples: 39659520 | consumed tokens: 81222696960 | elapsed time per iteration (s): 0.79 | learning rate: 2.515E-05 | global batch size: 256 | lm loss: 1.929733E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.937 | TFLOPs: 19.60 | 31: iteration 154930/ 173500 | consumed samples: 39662080 | consumed tokens: 81227939840 | elapsed time per iteration (s): 0.80 | learning rate: 2.514E-05 | global batch size: 256 | lm loss: 1.942499E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.390 | TFLOPs: 19.26 | 31: iteration 154940/ 173500 | consumed samples: 39664640 | consumed tokens: 81233182720 | elapsed time per iteration (s): 0.72 | learning rate: 2.514E-05 | global batch size: 256 | lm loss: 1.917943E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.501 | TFLOPs: 21.63 | 31: iteration 154950/ 173500 | consumed samples: 39667200 | consumed tokens: 81238425600 | elapsed time per iteration (s): 0.79 | learning rate: 2.513E-05 | global batch size: 256 | lm loss: 1.912073E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.972 | TFLOPs: 19.66 | 31: iteration 154960/ 173500 | consumed samples: 39669760 | consumed tokens: 81243668480 | elapsed time per iteration (s): 0.78 | learning rate: 2.513E-05 | global batch size: 256 | lm loss: 1.884091E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.818 | TFLOPs: 19.83 | 31: iteration 154970/ 173500 | consumed samples: 39672320 | consumed tokens: 81248911360 | elapsed time per iteration (s): 0.79 | learning rate: 2.512E-05 | global batch size: 256 | lm loss: 1.906972E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.627 | TFLOPs: 19.70 | 31: iteration 154980/ 173500 | consumed samples: 39674880 | consumed tokens: 81254154240 | elapsed time per iteration (s): 0.80 | learning rate: 2.511E-05 | global batch size: 256 | lm loss: 1.912694E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.378 | TFLOPs: 19.38 | 31: iteration 154990/ 173500 | consumed samples: 39677440 | consumed tokens: 81259397120 | elapsed time per iteration (s): 0.80 | learning rate: 2.511E-05 | global batch size: 256 | lm loss: 1.916692E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.775 | TFLOPs: 19.47 | 31: iteration 155000/ 173500 | consumed samples: 39680000 | consumed tokens: 81264640000 | elapsed time per iteration (s): 0.76 | learning rate: 2.510E-05 | global batch size: 256 | lm loss: 1.889745E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.212 | TFLOPs: 20.40 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 155000 | lm loss value: 1.845436E+00 | lm loss PPL: 6.330856E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 155000 to checkpoints_1b1long 0: [2022-11-27 04:59:22,137] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step155000 is begin to save! 0: [2022-11-27 04:59:22,150] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_01-model_00-model_states.pt... 0: [2022-11-27 04:59:22,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_01-model_00-model_states.pt. 0: [2022-11-27 04:59:22,371] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_03-model_00-model_states.pt... 0: [2022-11-27 04:59:22,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_03-model_00-model_states.pt. 0: [2022-11-27 04:59:22,457] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_04-model_00-model_states.pt... 0: [2022-11-27 04:59:22,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_04-model_00-model_states.pt. 0: [2022-11-27 04:59:22,541] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_05-model_00-model_states.pt... 0: [2022-11-27 04:59:22,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_05-model_00-model_states.pt. 0: [2022-11-27 04:59:22,619] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_06-model_00-model_states.pt... 0: [2022-11-27 04:59:22,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_06-model_00-model_states.pt. 0: [2022-11-27 04:59:22,698] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_07-model_00-model_states.pt... 0: [2022-11-27 04:59:22,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_07-model_00-model_states.pt. 0: [2022-11-27 04:59:22,773] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_08-model_00-model_states.pt... 0: [2022-11-27 04:59:22,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_08-model_00-model_states.pt. 0: [2022-11-27 04:59:22,851] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_09-model_00-model_states.pt... 0: [2022-11-27 04:59:22,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_09-model_00-model_states.pt. 0: [2022-11-27 04:59:22,923] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_10-model_00-model_states.pt... 0: [2022-11-27 04:59:22,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_10-model_00-model_states.pt. 0: [2022-11-27 04:59:22,999] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_11-model_00-model_states.pt... 0: [2022-11-27 04:59:23,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_11-model_00-model_states.pt. 0: [2022-11-27 04:59:23,069] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_12-model_00-model_states.pt... 0: [2022-11-27 04:59:23,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_12-model_00-model_states.pt. 0: [2022-11-27 04:59:23,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_13-model_00-model_states.pt... 0: [2022-11-27 04:59:23,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_13-model_00-model_states.pt. 0: [2022-11-27 04:59:23,219] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_14-model_00-model_states.pt... 0: [2022-11-27 04:59:23,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_14-model_00-model_states.pt. 0: [2022-11-27 04:59:23,291] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_15-model_00-model_states.pt... 0: [2022-11-27 04:59:23,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_15-model_00-model_states.pt. 0: [2022-11-27 04:59:23,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_16-model_00-model_states.pt... 0: [2022-11-27 04:59:23,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_16-model_00-model_states.pt. 0: [2022-11-27 04:59:23,438] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_17-model_00-model_states.pt... 0: [2022-11-27 04:59:23,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_17-model_00-model_states.pt. 0: [2022-11-27 04:59:23,511] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_18-model_00-model_states.pt... 0: [2022-11-27 04:59:23,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_18-model_00-model_states.pt. 0: [2022-11-27 04:59:23,584] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_19-model_00-model_states.pt... 0: [2022-11-27 04:59:23,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_19-model_00-model_states.pt. 0: [2022-11-27 04:59:23,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_20-model_00-model_states.pt... 0: [2022-11-27 04:59:23,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_20-model_00-model_states.pt. 0: [2022-11-27 04:59:23,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_21-model_00-model_states.pt... 0: [2022-11-27 04:59:23,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_21-model_00-model_states.pt. 0: [2022-11-27 04:59:23,802] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_22-model_00-model_states.pt... 0: [2022-11-27 04:59:23,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_22-model_00-model_states.pt. 0: [2022-11-27 04:59:23,875] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_23-model_00-model_states.pt... 0: [2022-11-27 04:59:23,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_23-model_00-model_states.pt. 0: [2022-11-27 04:59:23,946] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_24-model_00-model_states.pt... 0: [2022-11-27 04:59:24,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_24-model_00-model_states.pt. 0: [2022-11-27 04:59:24,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_25-model_00-model_states.pt... 0: [2022-11-27 04:59:24,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_25-model_00-model_states.pt. 0: [2022-11-27 04:59:24,092] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_26-model_00-model_states.pt... 0: [2022-11-27 04:59:24,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_26-model_00-model_states.pt. 0: [2022-11-27 04:59:24,166] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_27-model_00-model_states.pt... 0: [2022-11-27 04:59:24,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_27-model_00-model_states.pt. 0: [2022-11-27 04:59:24,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_28-model_00-model_states.pt... 0: [2022-11-27 04:59:24,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_28-model_00-model_states.pt. 0: [2022-11-27 04:59:24,312] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/layer_30-model_00-model_states.pt... 0: [2022-11-27 04:59:24,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/layer_30-model_00-model_states.pt. 0: [2022-11-27 04:59:24,314] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step155000/mp_rank_00_model_states.pt 0: [2022-11-27 04:59:24,314] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/mp_rank_00_model_states.pt... 0: [2022-11-27 04:59:24,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/mp_rank_00_model_states.pt. 0: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 26: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 28: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 31: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 24: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 21: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 19: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:59:24,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:59:24,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:59:24,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:59:24,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 04:59:24,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 0: [2022-11-27 04:59:24,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:59:24,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 04:59:24,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 18: [2022-11-27 04:59:24,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:59:24,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 04:59:24,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 18: [2022-11-27 04:59:24,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:59:24,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 04:59:24,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 0: [2022-11-27 04:59:24,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:59:24,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 04:59:24,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 3: [2022-11-27 04:59:24,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:59:24,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 04:59:24,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 27: [2022-11-27 04:59:24,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:59:24,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 04:59:24,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 17: [2022-11-27 04:59:24,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:59:24,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:59:24,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 3: [2022-11-27 04:59:24,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 17: [2022-11-27 04:59:24,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 3: [2022-11-27 04:59:24,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 1: [2022-11-27 04:59:24,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:59:24,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 04:59:24,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 6: [2022-11-27 04:59:24,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:59:24,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 04:59:24,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 3: [2022-11-27 04:59:24,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:59:24,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:59:24,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 04:59:24,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 24: [2022-11-27 04:59:24,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 04:59:24,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 6: [2022-11-27 04:59:24,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:59:24,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 04:59:24,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 31: [2022-11-27 04:59:24,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:59:24,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 04:59:24,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 31: [2022-11-27 04:59:24,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:59:24,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:59:24,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 04:59:24,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 04:59:24,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 31: [2022-11-27 04:59:24,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 12: [2022-11-27 04:59:24,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:59:24,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 04:59:24,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 24: [2022-11-27 04:59:24,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:59:24,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 19: [2022-11-27 04:59:24,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:59:24,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 19: [2022-11-27 04:59:24,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-27 04:59:24,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 17: [2022-11-27 04:59:24,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:59:24,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:59:24,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 04:59:24,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 04:59:24,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 17: [2022-11-27 04:59:24,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 15: [2022-11-27 04:59:24,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:59:24,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:59:24,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:59:24,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 04:59:24,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 8: [2022-11-27 04:59:24,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:59:24,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 15: [2022-11-27 04:59:24,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 8: [2022-11-27 04:59:24,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 15: [2022-11-27 04:59:24,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 0: [2022-11-27 04:59:24,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:59:24,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 15: [2022-11-27 04:59:24,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 0: [2022-11-27 04:59:24,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 8: [2022-11-27 04:59:24,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:59:24,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:59:24,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 22: [2022-11-27 04:59:24,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 0: [2022-11-27 04:59:24,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 8: [2022-11-27 04:59:24,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 22: [2022-11-27 04:59:24,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 19: [2022-11-27 04:59:24,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:59:24,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 04:59:24,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 15: [2022-11-27 04:59:24,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:59:24,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 04:59:24,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 24: [2022-11-27 04:59:24,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:59:24,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:59:24,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:59:24,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:59:24,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 10: [2022-11-27 04:59:24,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 24: [2022-11-27 04:59:24,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 10: [2022-11-27 04:59:24,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 11: [2022-11-27 04:59:24,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 04:59:24,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 04:59:24,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 11: [2022-11-27 04:59:24,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 6: [2022-11-27 04:59:24,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:59:24,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:59:24,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 04:59:24,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 26: [2022-11-27 04:59:24,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 11: [2022-11-27 04:59:24,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:59:24,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 11: [2022-11-27 04:59:24,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:59:24,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 04:59:24,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 18: [2022-11-27 04:59:24,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:59:24,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 18: [2022-11-27 04:59:24,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 11: [2022-11-27 04:59:24,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 18: [2022-11-27 04:59:24,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 17: [2022-11-27 04:59:24,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:59:24,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:59:24,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:59:24,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 04:59:24,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 9: [2022-11-27 04:59:24,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:59:24,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 31: [2022-11-27 04:59:24,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 17: [2022-11-27 04:59:24,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 17: [2022-11-27 04:59:24,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 9: [2022-11-27 04:59:24,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 31: [2022-11-27 04:59:24,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 25: [2022-11-27 04:59:24,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:59:24,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:59:24,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 5: [2022-11-27 04:59:24,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:59:24,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 22: [2022-11-27 04:59:24,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 25: [2022-11-27 04:59:24,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 5: [2022-11-27 04:59:24,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 25: [2022-11-27 04:59:24,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:59:24,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 20: [2022-11-27 04:59:24,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:59:24,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:59:24,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:59:24,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:59:24,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:59:24,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 5: [2022-11-27 04:59:24,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 20: [2022-11-27 04:59:24,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 25: [2022-11-27 04:59:24,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 04:59:24,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 5: [2022-11-27 04:59:24,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 20: [2022-11-27 04:59:24,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 25: [2022-11-27 04:59:24,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 5: [2022-11-27 04:59:24,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:59:24,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 5: [2022-11-27 04:59:24,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 25: [2022-11-27 04:59:24,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 5: [2022-11-27 04:59:24,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 25: [2022-11-27 04:59:24,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 25: [2022-11-27 04:59:24,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 5: [2022-11-27 04:59:24,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:59:24,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 04:59:24,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 5: [2022-11-27 04:59:24,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:59:24,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 04:59:24,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 3: [2022-11-27 04:59:24,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:59:24,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:59:24,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 04:59:24,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 04:59:24,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 3: [2022-11-27 04:59:24,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 8: [2022-11-27 04:59:24,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:59:24,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 04:59:24,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 22: [2022-11-27 04:59:24,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:59:24,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:59:24,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:59:24,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 19: [2022-11-27 04:59:24,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 1: [2022-11-27 04:59:24,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:59:24,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:59:24,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 19: [2022-11-27 04:59:24,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 1: [2022-11-27 04:59:24,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 04:59:24,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 04:59:24,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 1: [2022-11-27 04:59:24,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 20: [2022-11-27 04:59:24,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 04:59:24,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 27: [2022-11-27 04:59:24,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:59:24,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 04:59:24,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 19: [2022-11-27 04:59:24,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:59:24,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 04:59:24,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 6: [2022-11-27 04:59:24,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:59:24,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:59:24,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 23: [2022-11-27 04:59:24,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 6: [2022-11-27 04:59:24,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 23: [2022-11-27 04:59:24,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 16: [2022-11-27 04:59:24,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:59:24,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:59:24,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 04:59:24,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 04:59:24,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 16: [2022-11-27 04:59:24,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 5: [2022-11-27 04:59:24,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:59:24,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 04:59:24,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 28: [2022-11-27 04:59:24,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:59:24,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:59:24,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:59:24,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:59:24,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 6: [2022-11-27 04:59:24,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 12: [2022-11-27 04:59:24,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:59:24,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 6: [2022-11-27 04:59:24,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 12: [2022-11-27 04:59:24,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 04:59:24,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 9: [2022-11-27 04:59:24,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:59:24,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 04:59:24,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 4: [2022-11-27 04:59:24,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:59:24,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 04:59:24,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 24: [2022-11-27 04:59:24,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:59:24,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 04:59:24,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 4: [2022-11-27 04:59:24,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:59:24,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 15: [2022-11-27 04:59:24,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:59:24,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 15: [2022-11-27 04:59:24,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 04:59:24,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 23: [2022-11-27 04:59:24,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:59:24,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 4: [2022-11-27 04:59:24,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:59:24,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 04:59:24,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 23: [2022-11-27 04:59:24,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 4: [2022-11-27 04:59:24,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:59:24,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 04:59:24,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 14: [2022-11-27 04:59:24,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:59:24,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:59:24,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 14: [2022-11-27 04:59:24,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:59:24,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 14: [2022-11-27 04:59:24,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 04:59:24,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 14: [2022-11-27 04:59:24,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 04:59:24,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 22: [2022-11-27 04:59:24,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:59:24,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 04:59:24,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 14: [2022-11-27 04:59:24,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:59:24,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:59:24,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:59:24,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 14: [2022-11-27 04:59:24,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 26: [2022-11-27 04:59:24,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 8: [2022-11-27 04:59:24,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 14: [2022-11-27 04:59:24,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 26: [2022-11-27 04:59:24,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:59:24,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 26: [2022-11-27 04:59:24,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 14: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:59:24,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 16: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:59:24,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 4: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:59:24,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 14: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 4: [2022-11-27 04:59:24,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 16: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 16: [2022-11-27 04:59:24,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 20: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:59:24,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 16: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 20: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 27: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:59:24,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 13: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:59:24,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 04:59:24,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 13: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 25: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:59:24,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 24: [2022-11-27 04:59:24,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 13: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 24: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 13: [2022-11-27 04:59:24,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 04:59:24,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 23: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:59:24,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 23: [2022-11-27 04:59:24,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 13: [2022-11-27 04:59:24,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 23: [2022-11-27 04:59:24,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 18: [2022-11-27 04:59:24,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:59:24,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 04:59:24,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 3: [2022-11-27 04:59:24,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:59:24,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 11: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:59:24,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 11: [2022-11-27 04:59:24,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 31: [2022-11-27 04:59:24,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:59:24,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 04:59:24,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 10: [2022-11-27 04:59:24,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:59:24,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 19: [2022-11-27 04:59:24,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:59:24,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 04:59:24,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 10: [2022-11-27 04:59:24,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:59:24,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 10: [2022-11-27 04:59:24,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 04:59:24,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 1: [2022-11-27 04:59:24,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:59:24,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:59:24,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 1: [2022-11-27 04:59:24,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:59:24,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:59:24,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 1: [2022-11-27 04:59:24,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 04:59:24,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 04:59:24,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 04:59:24,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 1: [2022-11-27 04:59:24,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 1: [2022-11-27 04:59:24,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 19: [2022-11-27 04:59:24,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:59:24,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 04:59:24,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 10: [2022-11-27 04:59:24,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:59:24,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 04:59:24,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 10: [2022-11-27 04:59:24,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:59:24,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:59:24,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 10: [2022-11-27 04:59:24,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 04:59:24,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 26: [2022-11-27 04:59:24,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 12: [2022-11-27 04:59:24,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:59:24,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 04:59:24,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 30: [2022-11-27 04:59:24,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:59:24,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 04:59:24,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 9: [2022-11-27 04:59:24,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:59:24,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 30: [2022-11-27 04:59:24,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:59:24,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:59:24,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:59:24,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:59:24,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:59:24,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 9: [2022-11-27 04:59:24,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 8: [2022-11-27 04:59:24,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 04:59:24,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 30: [2022-11-27 04:59:24,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 17: [2022-11-27 04:59:24,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 9: [2022-11-27 04:59:24,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 8: [2022-11-27 04:59:24,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 8: [2022-11-27 04:59:24,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 30: [2022-11-27 04:59:24,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 17: [2022-11-27 04:59:24,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 24: [2022-11-27 04:59:24,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:59:24,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 12: [2022-11-27 04:59:24,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:59:24,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 12: [2022-11-27 04:59:24,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 04:59:24,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 23: [2022-11-27 04:59:24,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:59:24,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 9: [2022-11-27 04:59:24,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:59:24,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 16: [2022-11-27 04:59:24,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:59:24,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 11: [2022-11-27 04:59:24,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:59:24,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 11: [2022-11-27 04:59:24,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 16: [2022-11-27 04:59:24,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 11: [2022-11-27 04:59:24,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 16: [2022-11-27 04:59:24,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 7: [2022-11-27 04:59:24,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:59:24,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:59:24,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 04:59:24,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 04:59:24,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 7: [2022-11-27 04:59:24,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 30: [2022-11-27 04:59:24,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:59:24,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:59:24,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 04:59:24,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 04:59:24,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 30: [2022-11-27 04:59:24,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 20: [2022-11-27 04:59:24,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:59:24,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 04:59:24,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 6: [2022-11-27 04:59:24,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:59:24,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 04:59:24,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 27: [2022-11-27 04:59:24,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:59:24,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 04:59:24,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 30: [2022-11-27 04:59:24,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:59:24,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 04:59:24,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 18: [2022-11-27 04:59:24,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:59:24,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-27 04:59:24,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 23: [2022-11-27 04:59:24,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:59:24,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 04:59:24,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 20: [2022-11-27 04:59:24,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:59:24,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 2: [2022-11-27 04:59:24,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:59:24,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:59:24,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:59:24,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:59:24,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:59:24,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:59:24,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 2: [2022-11-27 04:59:24,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 04:59:24,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 04:59:24,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 04:59:24,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 04:59:24,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 2: [2022-11-27 04:59:24,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 04:59:24,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 04:59:24,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 2: [2022-11-27 04:59:24,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 2: [2022-11-27 04:59:24,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 2: [2022-11-27 04:59:24,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 2: [2022-11-27 04:59:24,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 20: [2022-11-27 04:59:24,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:59:24,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:59:24,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 0: [2022-11-27 04:59:24,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 20: [2022-11-27 04:59:24,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 0: [2022-11-27 04:59:24,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 14: [2022-11-27 04:59:24,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:59:24,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 04:59:24,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 9: [2022-11-27 04:59:24,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:59:24,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 04:59:24,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 27: [2022-11-27 04:59:24,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:59:24,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 04:59:24,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 26: [2022-11-27 04:59:24,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:59:24,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-27 04:59:24,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 27: [2022-11-27 04:59:24,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:59:24,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 04:59:24,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 4: [2022-11-27 04:59:24,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:59:24,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:59:24,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 4: [2022-11-27 04:59:24,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 04:59:24,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 22: [2022-11-27 04:59:24,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:59:24,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 15: [2022-11-27 04:59:24,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 22: [2022-11-27 04:59:24,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 29: [2022-11-27 04:59:24,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:59:24,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 04:59:24,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 29: [2022-11-27 04:59:24,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:59:24,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 04:59:24,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 21: [2022-11-27 04:59:24,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:59:24,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:59:24,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:59:24,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:59:24,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:59:24,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:59:24,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 04:59:24,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 04:59:24,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 04:59:24,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 21: [2022-11-27 04:59:24,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 04:59:24,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 04:59:24,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 04:59:24,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 21: [2022-11-27 04:59:24,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 21: [2022-11-27 04:59:24,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 21: [2022-11-27 04:59:24,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 21: [2022-11-27 04:59:24,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 29: [2022-11-27 04:59:24,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:59:24,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:59:24,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 04:59:24,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:59:24,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 04:59:24,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 29: [2022-11-27 04:59:24,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 29: [2022-11-27 04:59:24,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 04:59:24,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 12: [2022-11-27 04:59:24,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:59:24,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 04:59:24,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 7: [2022-11-27 04:59:24,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:59:24,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:59:24,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:59:24,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 04:59:24,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 04:59:24,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 7: [2022-11-27 04:59:24,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 04:59:24,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 7: [2022-11-27 04:59:24,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 0: [2022-11-27 04:59:24,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 04:59:24,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 28: [2022-11-27 04:59:24,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 04:59:24,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 04:59:24,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 28: [2022-11-27 04:59:24,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 28: [2022-11-27 04:59:24,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:59:24,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 04:59:24,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 28: [2022-11-27 04:59:24,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:59:24,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 04:59:24,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 28: [2022-11-27 04:59:24,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 04:59:24,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 28: [2022-11-27 04:59:24,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:59:24,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 04:59:24,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 31: [2022-11-27 04:59:24,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:59:24,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 04:59:24,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 13: [2022-11-27 04:59:24,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:59:24,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 04:59:24,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 5: [2022-11-27 04:59:24,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:59:24,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 04:59:24,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 10: [2022-11-27 04:59:24,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:59:24,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 04:59:24,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 7: [2022-11-27 04:59:24,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:59:24,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 04:59:24,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 16: [2022-11-27 04:59:24,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:59:24,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 04:59:24,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 1: [2022-11-27 04:59:24,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:59:24,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 04:59:24,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 20: [2022-11-27 04:59:24,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:59:24,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 04:59:24,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 2: [2022-11-27 04:59:24,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:59:24,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 04:59:24,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 8: [2022-11-27 04:59:24,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:59:24,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 04:59:24,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 27: [2022-11-27 04:59:24,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:59:24,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 27: [2022-11-27 04:59:24,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 04:59:24,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 11: [2022-11-27 04:59:24,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 04:59:24,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 17: [2022-11-27 04:59:24,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:59:24,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 04:59:24,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 25: [2022-11-27 04:59:24,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:59:24,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 04:59:24,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 0: [2022-11-27 04:59:24,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:59:24,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 04:59:24,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 3: [2022-11-27 04:59:24,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:59:24,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 04:59:24,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 24: [2022-11-27 04:59:24,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 04:59:24,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 04:59:24,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 6: [2022-11-27 04:59:24,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:59:24,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 04:59:24,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 28: [2022-11-27 04:59:24,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:59:24,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 04:59:24,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 15: [2022-11-27 04:59:24,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:59:24,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 04:59:24,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 19: [2022-11-27 04:59:24,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:59:24,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 04:59:24,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 30: [2022-11-27 04:59:24,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:59:24,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 04:59:24,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 26: [2022-11-27 04:59:24,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:59:24,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 04:59:24,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 12: [2022-11-27 04:59:24,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:59:24,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 04:59:24,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 9: [2022-11-27 04:59:24,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:59:24,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 04:59:24,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 21: [2022-11-27 04:59:24,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:59:24,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 04:59:24,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 22: [2022-11-27 04:59:24,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:59:24,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 04:59:24,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 23: [2022-11-27 04:59:24,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:59:24,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 04:59:24,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 5: [2022-11-27 04:59:24,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:59:24,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 04:59:24,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 7: [2022-11-27 04:59:24,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:59:24,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:59:24,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 04:59:24,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 18: [2022-11-27 04:59:24,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 04:59:24,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 16: [2022-11-27 04:59:24,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:59:24,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 1: [2022-11-27 04:59:24,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:59:24,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 1: [2022-11-27 04:59:24,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 04:59:24,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 6: [2022-11-27 04:59:24,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:59:24,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 04:59:24,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 31: [2022-11-27 04:59:24,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:59:24,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 04:59:24,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 29: [2022-11-27 04:59:24,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:59:24,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 04:59:24,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 14: [2022-11-27 04:59:24,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:59:24,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 04:59:24,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 10: [2022-11-27 04:59:24,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:59:24,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 04:59:24,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 0: [2022-11-27 04:59:24,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:59:24,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:59:24,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 2: [2022-11-27 04:59:24,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 04:59:24,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 0: [2022-11-27 04:59:24,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 13: [2022-11-27 04:59:24,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:59:24,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 04:59:24,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 19: [2022-11-27 04:59:24,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 04:59:24,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 04:59:24,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 4: [2022-11-27 04:59:24,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:59:24,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:59:24,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 04:59:24,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 15: [2022-11-27 04:59:24,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 04:59:24,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 20: [2022-11-27 04:59:24,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 04:59:24,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 04:59:24,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 8: [2022-11-27 04:59:24,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:59:24,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 04:59:24,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 9: [2022-11-27 04:59:24,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:59:24,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 04:59:24,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 12: [2022-11-27 04:59:24,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:59:24,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 27: [2022-11-27 04:59:24,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:59:24,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 27: [2022-11-27 04:59:24,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 04:59:24,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 25: [2022-11-27 04:59:24,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:59:24,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 11: [2022-11-27 04:59:24,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 25: [2022-11-27 04:59:24,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 11: [2022-11-27 04:59:24,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 04:59:24,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 21: [2022-11-27 04:59:24,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-27 04:59:24,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 04:59:24,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 17: [2022-11-27 04:59:24,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 04:59:24,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 04:59:24,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 22: [2022-11-27 04:59:24,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 04:59:24,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 04:59:24,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 3: [2022-11-27 04:59:24,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:59:24,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 04:59:24,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 31: [2022-11-27 04:59:24,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:59:24,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 23: [2022-11-27 04:59:24,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 31: [2022-11-27 04:59:24,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 23: [2022-11-27 04:59:24,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 04:59:24,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 16: [2022-11-27 04:59:24,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 18: [2022-11-27 04:59:24,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 16: [2022-11-27 04:59:24,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 18: [2022-11-27 04:59:24,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 04:59:24,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 16: [2022-11-27 04:59:24,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 30: [2022-11-27 04:59:24,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:59:24,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 04:59:24,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 26: [2022-11-27 04:59:24,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 04:59:24,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 04:59:24,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 0: [2022-11-27 04:59:24,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:59:24,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 29: [2022-11-27 04:59:24,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:59:24,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 29: [2022-11-27 04:59:24,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 04:59:24,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 4: [2022-11-27 04:59:24,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:59:24,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:59:24,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 24: [2022-11-27 04:59:24,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:59:24,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 4: [2022-11-27 04:59:24,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 24: [2022-11-27 04:59:24,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 7: [2022-11-27 04:59:24,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 24: [2022-11-27 04:59:24,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 14: [2022-11-27 04:59:24,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 28: [2022-11-27 04:59:24,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:59:24,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 28: [2022-11-27 04:59:24,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 14: [2022-11-27 04:59:24,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 10: [2022-11-27 04:59:24,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:59:24,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 28: [2022-11-27 04:59:24,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 10: [2022-11-27 04:59:24,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 30: [2022-11-27 04:59:24,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 23: [2022-11-27 04:59:24,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 30: [2022-11-27 04:59:24,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 23: [2022-11-27 04:59:24,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 04:59:24,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 30: [2022-11-27 04:59:24,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 29: [2022-11-27 04:59:24,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 04:59:24,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 04:59:24,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 13: [2022-11-27 04:59:24,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:59:24,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 04:59:24,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 13: [2022-11-27 04:59:24,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:59:24,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step155000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 04:59:24,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step155000 is ready now! 0: successfully saved checkpoint at iteration 155000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2514.72 31: iteration 155010/ 173500 | consumed samples: 39682560 | consumed tokens: 81269882880 | elapsed time per iteration (s): 1.07 | learning rate: 2.510E-05 | global batch size: 256 | lm loss: 1.894225E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.985 | TFLOPs: 14.46 | 31: iteration 155020/ 173500 | consumed samples: 39685120 | consumed tokens: 81275125760 | elapsed time per iteration (s): 0.79 | learning rate: 2.509E-05 | global batch size: 256 | lm loss: 1.904819E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.176 | TFLOPs: 19.55 | 31: iteration 155030/ 173500 | consumed samples: 39687680 | consumed tokens: 81280368640 | elapsed time per iteration (s): 0.90 | learning rate: 2.509E-05 | global batch size: 256 | lm loss: 1.897643E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 283.879 | TFLOPs: 17.17 | 31: iteration 155040/ 173500 | consumed samples: 39690240 | consumed tokens: 81285611520 | elapsed time per iteration (s): 0.81 | learning rate: 2.508E-05 | global batch size: 256 | lm loss: 1.934067E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.653 | TFLOPs: 19.10 | 31: iteration 155050/ 173500 | consumed samples: 39692800 | consumed tokens: 81290854400 | elapsed time per iteration (s): 0.79 | learning rate: 2.508E-05 | global batch size: 256 | lm loss: 1.918916E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.521 | TFLOPs: 19.57 | 31: iteration 155060/ 173500 | consumed samples: 39695360 | consumed tokens: 81296097280 | elapsed time per iteration (s): 0.82 | learning rate: 2.507E-05 | global batch size: 256 | lm loss: 1.952304E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.856 | TFLOPs: 18.99 | 31: iteration 155070/ 173500 | consumed samples: 39697920 | consumed tokens: 81301340160 | elapsed time per iteration (s): 0.82 | learning rate: 2.507E-05 | global batch size: 256 | lm loss: 1.928148E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.292 | TFLOPs: 18.89 | 31: iteration 155080/ 173500 | consumed samples: 39700480 | consumed tokens: 81306583040 | elapsed time per iteration (s): 0.84 | learning rate: 2.506E-05 | global batch size: 256 | lm loss: 1.890044E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.001 | TFLOPs: 18.51 | 31: iteration 155090/ 173500 | consumed samples: 39703040 | consumed tokens: 81311825920 | elapsed time per iteration (s): 0.83 | learning rate: 2.505E-05 | global batch size: 256 | lm loss: 1.900873E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.877 | TFLOPs: 18.63 | 31: iteration 155100/ 173500 | consumed samples: 39705600 | consumed tokens: 81317068800 | elapsed time per iteration (s): 0.87 | learning rate: 2.505E-05 | global batch size: 256 | lm loss: 1.929295E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.758 | TFLOPs: 17.89 | 31: iteration 155110/ 173500 | consumed samples: 39708160 | consumed tokens: 81322311680 | elapsed time per iteration (s): 0.82 | learning rate: 2.504E-05 | global batch size: 256 | lm loss: 1.937244E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.286 | TFLOPs: 18.95 | 31: iteration 155120/ 173500 | consumed samples: 39710720 | consumed tokens: 81327554560 | elapsed time per iteration (s): 0.79 | learning rate: 2.504E-05 | global batch size: 256 | lm loss: 1.917091E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.951 | TFLOPs: 19.54 | 31: iteration 155130/ 173500 | consumed samples: 39713280 | consumed tokens: 81332797440 | elapsed time per iteration (s): 0.76 | learning rate: 2.503E-05 | global batch size: 256 | lm loss: 1.928673E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.823 | TFLOPs: 20.50 | 31: iteration 155140/ 173500 | consumed samples: 39715840 | consumed tokens: 81338040320 | elapsed time per iteration (s): 0.80 | learning rate: 2.503E-05 | global batch size: 256 | lm loss: 1.934671E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.251 | TFLOPs: 19.43 | 31: iteration 155150/ 173500 | consumed samples: 39718400 | consumed tokens: 81343283200 | elapsed time per iteration (s): 0.77 | learning rate: 2.502E-05 | global batch size: 256 | lm loss: 1.919785E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.230 | TFLOPs: 20.16 | 31: iteration 155160/ 173500 | consumed samples: 39720960 | consumed tokens: 81348526080 | elapsed time per iteration (s): 0.88 | learning rate: 2.502E-05 | global batch size: 256 | lm loss: 1.909086E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.380 | TFLOPs: 17.51 | 31: iteration 155170/ 173500 | consumed samples: 39723520 | consumed tokens: 81353768960 | elapsed time per iteration (s): 0.78 | learning rate: 2.501E-05 | global batch size: 256 | lm loss: 1.911467E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.254 | TFLOPs: 19.98 | 31: iteration 155180/ 173500 | consumed samples: 39726080 | consumed tokens: 81359011840 | elapsed time per iteration (s): 0.78 | learning rate: 2.501E-05 | global batch size: 256 | lm loss: 1.914613E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.929 | TFLOPs: 19.84 | 31: iteration 155190/ 173500 | consumed samples: 39728640 | consumed tokens: 81364254720 | elapsed time per iteration (s): 0.78 | learning rate: 2.500E-05 | global batch size: 256 | lm loss: 1.904875E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.804 | TFLOPs: 19.83 | 31: iteration 155200/ 173500 | consumed samples: 39731200 | consumed tokens: 81369497600 | elapsed time per iteration (s): 0.77 | learning rate: 2.499E-05 | global batch size: 256 | lm loss: 1.923276E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.221 | TFLOPs: 20.22 | 31: iteration 155210/ 173500 | consumed samples: 39733760 | consumed tokens: 81374740480 | elapsed time per iteration (s): 0.79 | learning rate: 2.499E-05 | global batch size: 256 | lm loss: 1.906310E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.238 | TFLOPs: 19.68 | 31: iteration 155220/ 173500 | consumed samples: 39736320 | consumed tokens: 81379983360 | elapsed time per iteration (s): 0.74 | learning rate: 2.498E-05 | global batch size: 256 | lm loss: 1.920579E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.102 | TFLOPs: 20.94 | 31: iteration 155230/ 173500 | consumed samples: 39738880 | consumed tokens: 81385226240 | elapsed time per iteration (s): 0.75 | learning rate: 2.498E-05 | global batch size: 256 | lm loss: 1.922933E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.530 | TFLOPs: 20.60 | 31: iteration 155240/ 173500 | consumed samples: 39741440 | consumed tokens: 81390469120 | elapsed time per iteration (s): 0.77 | learning rate: 2.497E-05 | global batch size: 256 | lm loss: 1.916769E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.590 | TFLOPs: 20.12 | 31: iteration 155250/ 173500 | consumed samples: 39744000 | consumed tokens: 81395712000 | elapsed time per iteration (s): 0.78 | learning rate: 2.497E-05 | global batch size: 256 | lm loss: 1.924070E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.537 | TFLOPs: 19.82 | 31: iteration 155260/ 173500 | consumed samples: 39746560 | consumed tokens: 81400954880 | elapsed time per iteration (s): 0.76 | learning rate: 2.496E-05 | global batch size: 256 | lm loss: 1.914593E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.279 | TFLOPs: 20.47 | 31: iteration 155270/ 173500 | consumed samples: 39749120 | consumed tokens: 81406197760 | elapsed time per iteration (s): 0.82 | learning rate: 2.496E-05 | global batch size: 256 | lm loss: 1.903156E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.000 | TFLOPs: 18.81 | 31: iteration 155280/ 173500 | consumed samples: 39751680 | consumed tokens: 81411440640 | elapsed time per iteration (s): 0.83 | learning rate: 2.495E-05 | global batch size: 256 | lm loss: 1.912506E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.755 | TFLOPs: 18.56 | 31: iteration 155290/ 173500 | consumed samples: 39754240 | consumed tokens: 81416683520 | elapsed time per iteration (s): 0.79 | learning rate: 2.495E-05 | global batch size: 256 | lm loss: 1.918666E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.144 | TFLOPs: 19.55 | 31: iteration 155300/ 173500 | consumed samples: 39756800 | consumed tokens: 81421926400 | elapsed time per iteration (s): 0.80 | learning rate: 2.494E-05 | global batch size: 256 | lm loss: 1.934023E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.647 | TFLOPs: 19.34 | 31: iteration 155310/ 173500 | consumed samples: 39759360 | consumed tokens: 81427169280 | elapsed time per iteration (s): 0.79 | learning rate: 2.494E-05 | global batch size: 256 | lm loss: 1.918460E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.969 | TFLOPs: 19.72 | 31: iteration 155320/ 173500 | consumed samples: 39761920 | consumed tokens: 81432412160 | elapsed time per iteration (s): 0.89 | learning rate: 2.493E-05 | global batch size: 256 | lm loss: 1.930274E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.168 | TFLOPs: 17.31 | 31: iteration 155330/ 173500 | consumed samples: 39764480 | consumed tokens: 81437655040 | elapsed time per iteration (s): 0.85 | learning rate: 2.492E-05 | global batch size: 256 | lm loss: 1.882460E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.021 | TFLOPs: 18.21 | 31: iteration 155340/ 173500 | consumed samples: 39767040 | consumed tokens: 81442897920 | elapsed time per iteration (s): 0.80 | learning rate: 2.492E-05 | global batch size: 256 | lm loss: 1.909768E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.469 | TFLOPs: 19.45 | 31: iteration 155350/ 173500 | consumed samples: 39769600 | consumed tokens: 81448140800 | elapsed time per iteration (s): 0.80 | learning rate: 2.491E-05 | global batch size: 256 | lm loss: 1.933878E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.748 | TFLOPs: 19.46 | 31: iteration 155360/ 173500 | consumed samples: 39772160 | consumed tokens: 81453383680 | elapsed time per iteration (s): 0.78 | learning rate: 2.491E-05 | global batch size: 256 | lm loss: 1.936369E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.878 | TFLOPs: 19.90 | 31: iteration 155370/ 173500 | consumed samples: 39774720 | consumed tokens: 81458626560 | elapsed time per iteration (s): 0.81 | learning rate: 2.490E-05 | global batch size: 256 | lm loss: 1.910950E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.369 | TFLOPs: 19.20 | 31: iteration 155380/ 173500 | consumed samples: 39777280 | consumed tokens: 81463869440 | elapsed time per iteration (s): 0.79 | learning rate: 2.490E-05 | global batch size: 256 | lm loss: 1.905403E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.154 | TFLOPs: 19.61 | 31: iteration 155390/ 173500 | consumed samples: 39779840 | consumed tokens: 81469112320 | elapsed time per iteration (s): 0.80 | learning rate: 2.489E-05 | global batch size: 256 | lm loss: 1.913234E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.918 | TFLOPs: 19.35 | 31: iteration 155400/ 173500 | consumed samples: 39782400 | consumed tokens: 81474355200 | elapsed time per iteration (s): 0.83 | learning rate: 2.489E-05 | global batch size: 256 | lm loss: 1.890096E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.649 | TFLOPs: 18.55 | 31: iteration 155410/ 173500 | consumed samples: 39784960 | consumed tokens: 81479598080 | elapsed time per iteration (s): 0.77 | learning rate: 2.488E-05 | global batch size: 256 | lm loss: 1.935004E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.488 | TFLOPs: 19.99 | 31: iteration 155420/ 173500 | consumed samples: 39787520 | consumed tokens: 81484840960 | elapsed time per iteration (s): 0.95 | learning rate: 2.488E-05 | global batch size: 256 | lm loss: 1.905931E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 269.096 | TFLOPs: 16.28 | 31: iteration 155430/ 173500 | consumed samples: 39790080 | consumed tokens: 81490083840 | elapsed time per iteration (s): 0.78 | learning rate: 2.487E-05 | global batch size: 256 | lm loss: 1.932945E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.006 | TFLOPs: 19.78 | 31: iteration 155440/ 173500 | consumed samples: 39792640 | consumed tokens: 81495326720 | elapsed time per iteration (s): 0.81 | learning rate: 2.487E-05 | global batch size: 256 | lm loss: 1.959515E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.780 | TFLOPs: 19.22 | 31: iteration 155450/ 173500 | consumed samples: 39795200 | consumed tokens: 81500569600 | elapsed time per iteration (s): 0.82 | learning rate: 2.486E-05 | global batch size: 256 | lm loss: 1.919581E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.965 | TFLOPs: 18.93 | 31: iteration 155460/ 173500 | consumed samples: 39797760 | consumed tokens: 81505812480 | elapsed time per iteration (s): 0.80 | learning rate: 2.486E-05 | global batch size: 256 | lm loss: 1.905035E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.242 | TFLOPs: 19.25 | 31: iteration 155470/ 173500 | consumed samples: 39800320 | consumed tokens: 81511055360 | elapsed time per iteration (s): 0.85 | learning rate: 2.485E-05 | global batch size: 256 | lm loss: 1.934001E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.164 | TFLOPs: 18.28 | 31: iteration 155480/ 173500 | consumed samples: 39802880 | consumed tokens: 81516298240 | elapsed time per iteration (s): 0.82 | learning rate: 2.484E-05 | global batch size: 256 | lm loss: 1.924644E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.487 | TFLOPs: 18.84 | 31: iteration 155490/ 173500 | consumed samples: 39805440 | consumed tokens: 81521541120 | elapsed time per iteration (s): 0.83 | learning rate: 2.484E-05 | global batch size: 256 | lm loss: 1.899781E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.318 | TFLOPs: 18.71 | 31: iteration 155500/ 173500 | consumed samples: 39808000 | consumed tokens: 81526784000 | elapsed time per iteration (s): 0.84 | learning rate: 2.483E-05 | global batch size: 256 | lm loss: 1.908531E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.587 | TFLOPs: 18.49 | 31: iteration 155510/ 173500 | consumed samples: 39810560 | consumed tokens: 81532026880 | elapsed time per iteration (s): 0.81 | learning rate: 2.483E-05 | global batch size: 256 | lm loss: 1.911237E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.438 | TFLOPs: 19.08 | 31: iteration 155520/ 173500 | consumed samples: 39813120 | consumed tokens: 81537269760 | elapsed time per iteration (s): 0.80 | learning rate: 2.482E-05 | global batch size: 256 | lm loss: 1.935753E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.406 | TFLOPs: 19.44 | 31: iteration 155530/ 173500 | consumed samples: 39815680 | consumed tokens: 81542512640 | elapsed time per iteration (s): 0.79 | learning rate: 2.482E-05 | global batch size: 256 | lm loss: 1.905597E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.300 | TFLOPs: 19.68 | 31: iteration 155540/ 173500 | consumed samples: 39818240 | consumed tokens: 81547755520 | elapsed time per iteration (s): 0.82 | learning rate: 2.481E-05 | global batch size: 256 | lm loss: 1.916821E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.108 | TFLOPs: 18.88 | 31: iteration 155550/ 173500 | consumed samples: 39820800 | consumed tokens: 81552998400 | elapsed time per iteration (s): 0.81 | learning rate: 2.481E-05 | global batch size: 256 | lm loss: 1.932023E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.905 | TFLOPs: 19.11 | 31: iteration 155560/ 173500 | consumed samples: 39823360 | consumed tokens: 81558241280 | elapsed time per iteration (s): 0.80 | learning rate: 2.480E-05 | global batch size: 256 | lm loss: 1.924741E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.663 | TFLOPs: 19.40 | 31: iteration 155570/ 173500 | consumed samples: 39825920 | consumed tokens: 81563484160 | elapsed time per iteration (s): 0.85 | learning rate: 2.480E-05 | global batch size: 256 | lm loss: 1.920412E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.680 | TFLOPs: 18.13 | 31: iteration 155580/ 173500 | consumed samples: 39828480 | consumed tokens: 81568727040 | elapsed time per iteration (s): 0.84 | learning rate: 2.479E-05 | global batch size: 256 | lm loss: 1.883576E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.558 | TFLOPs: 18.36 | 31: iteration 155590/ 173500 | consumed samples: 39831040 | consumed tokens: 81573969920 | elapsed time per iteration (s): 0.92 | learning rate: 2.479E-05 | global batch size: 256 | lm loss: 1.946094E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 278.320 | TFLOPs: 16.84 | 31: iteration 155600/ 173500 | consumed samples: 39833600 | consumed tokens: 81579212800 | elapsed time per iteration (s): 0.79 | learning rate: 2.478E-05 | global batch size: 256 | lm loss: 1.938542E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.773 | TFLOPs: 19.53 | 31: iteration 155610/ 173500 | consumed samples: 39836160 | consumed tokens: 81584455680 | elapsed time per iteration (s): 0.80 | learning rate: 2.478E-05 | global batch size: 256 | lm loss: 1.914475E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.315 | TFLOPs: 19.26 | 31: iteration 155620/ 173500 | consumed samples: 39838720 | consumed tokens: 81589698560 | elapsed time per iteration (s): 0.89 | learning rate: 2.477E-05 | global batch size: 256 | lm loss: 1.931410E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.098 | TFLOPs: 17.43 | 31: iteration 155630/ 173500 | consumed samples: 39841280 | consumed tokens: 81594941440 | elapsed time per iteration (s): 0.84 | learning rate: 2.476E-05 | global batch size: 256 | lm loss: 1.876303E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.132 | TFLOPs: 18.34 | 31: iteration 155640/ 173500 | consumed samples: 39843840 | consumed tokens: 81600184320 | elapsed time per iteration (s): 0.84 | learning rate: 2.476E-05 | global batch size: 256 | lm loss: 1.897844E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.736 | TFLOPs: 18.44 | 31: iteration 155650/ 173500 | consumed samples: 39846400 | consumed tokens: 81605427200 | elapsed time per iteration (s): 0.81 | learning rate: 2.475E-05 | global batch size: 256 | lm loss: 1.882852E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.058 | TFLOPs: 19.12 | 31: iteration 155660/ 173500 | consumed samples: 39848960 | consumed tokens: 81610670080 | elapsed time per iteration (s): 0.85 | learning rate: 2.475E-05 | global batch size: 256 | lm loss: 1.922958E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.467 | TFLOPs: 18.18 | 31: iteration 155670/ 173500 | consumed samples: 39851520 | consumed tokens: 81615912960 | elapsed time per iteration (s): 0.79 | learning rate: 2.474E-05 | global batch size: 256 | lm loss: 1.911278E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.241 | TFLOPs: 19.62 | 31: iteration 155680/ 173500 | consumed samples: 39854080 | consumed tokens: 81621155840 | elapsed time per iteration (s): 0.80 | learning rate: 2.474E-05 | global batch size: 256 | lm loss: 1.894971E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.742 | TFLOPs: 19.34 | 31: iteration 155690/ 173500 | consumed samples: 39856640 | consumed tokens: 81626398720 | elapsed time per iteration (s): 0.81 | learning rate: 2.473E-05 | global batch size: 256 | lm loss: 1.914459E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.853 | TFLOPs: 19.17 | 31: iteration 155700/ 173500 | consumed samples: 39859200 | consumed tokens: 81631641600 | elapsed time per iteration (s): 0.80 | learning rate: 2.473E-05 | global batch size: 256 | lm loss: 1.917828E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.744 | TFLOPs: 19.28 | 31: iteration 155710/ 173500 | consumed samples: 39861760 | consumed tokens: 81636884480 | elapsed time per iteration (s): 0.82 | learning rate: 2.472E-05 | global batch size: 256 | lm loss: 1.931232E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.342 | TFLOPs: 18.90 | 31: iteration 155720/ 173500 | consumed samples: 39864320 | consumed tokens: 81642127360 | elapsed time per iteration (s): 0.82 | learning rate: 2.472E-05 | global batch size: 256 | lm loss: 1.961260E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.312 | TFLOPs: 18.95 | 31: iteration 155730/ 173500 | consumed samples: 39866880 | consumed tokens: 81647370240 | elapsed time per iteration (s): 0.80 | learning rate: 2.471E-05 | global batch size: 256 | lm loss: 1.896623E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.721 | TFLOPs: 19.46 | 31: iteration 155740/ 173500 | consumed samples: 39869440 | consumed tokens: 81652613120 | elapsed time per iteration (s): 0.80 | learning rate: 2.471E-05 | global batch size: 256 | lm loss: 1.897884E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.300 | TFLOPs: 19.44 | 31: iteration 155750/ 173500 | consumed samples: 39872000 | consumed tokens: 81657856000 | elapsed time per iteration (s): 0.81 | learning rate: 2.470E-05 | global batch size: 256 | lm loss: 1.941495E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.327 | TFLOPs: 19.02 | 31: iteration 155760/ 173500 | consumed samples: 39874560 | consumed tokens: 81663098880 | elapsed time per iteration (s): 0.83 | learning rate: 2.470E-05 | global batch size: 256 | lm loss: 1.911740E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.905 | TFLOPs: 18.63 | 31: iteration 155770/ 173500 | consumed samples: 39877120 | consumed tokens: 81668341760 | elapsed time per iteration (s): 0.82 | learning rate: 2.469E-05 | global batch size: 256 | lm loss: 1.930801E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.972 | TFLOPs: 18.81 | 31: iteration 155780/ 173500 | consumed samples: 39879680 | consumed tokens: 81673584640 | elapsed time per iteration (s): 0.81 | learning rate: 2.469E-05 | global batch size: 256 | lm loss: 1.920065E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.317 | TFLOPs: 19.14 | 31: iteration 155790/ 173500 | consumed samples: 39882240 | consumed tokens: 81678827520 | elapsed time per iteration (s): 0.85 | learning rate: 2.468E-05 | global batch size: 256 | lm loss: 1.923053E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.829 | TFLOPs: 18.20 | 31: iteration 155800/ 173500 | consumed samples: 39884800 | consumed tokens: 81684070400 | elapsed time per iteration (s): 0.80 | learning rate: 2.468E-05 | global batch size: 256 | lm loss: 1.924885E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.404 | TFLOPs: 19.26 | 31: iteration 155810/ 173500 | consumed samples: 39887360 | consumed tokens: 81689313280 | elapsed time per iteration (s): 0.80 | learning rate: 2.467E-05 | global batch size: 256 | lm loss: 1.925140E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.013 | TFLOPs: 19.42 | 31: iteration 155820/ 173500 | consumed samples: 39889920 | consumed tokens: 81694556160 | elapsed time per iteration (s): 0.80 | learning rate: 2.466E-05 | global batch size: 256 | lm loss: 1.909439E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.392 | TFLOPs: 19.38 | 31: iteration 155830/ 173500 | consumed samples: 39892480 | consumed tokens: 81699799040 | elapsed time per iteration (s): 0.81 | learning rate: 2.466E-05 | global batch size: 256 | lm loss: 1.917058E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.596 | TFLOPs: 19.03 | 31: iteration 155840/ 173500 | consumed samples: 39895040 | consumed tokens: 81705041920 | elapsed time per iteration (s): 0.81 | learning rate: 2.465E-05 | global batch size: 256 | lm loss: 1.905831E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.138 | TFLOPs: 19.19 | 31: iteration 155850/ 173500 | consumed samples: 39897600 | consumed tokens: 81710284800 | elapsed time per iteration (s): 0.79 | learning rate: 2.465E-05 | global batch size: 256 | lm loss: 1.927224E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.571 | TFLOPs: 19.58 | 31: iteration 155860/ 173500 | consumed samples: 39900160 | consumed tokens: 81715527680 | elapsed time per iteration (s): 0.75 | learning rate: 2.464E-05 | global batch size: 256 | lm loss: 1.916430E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.631 | TFLOPs: 20.67 | 31: iteration 155870/ 173500 | consumed samples: 39902720 | consumed tokens: 81720770560 | elapsed time per iteration (s): 0.78 | learning rate: 2.464E-05 | global batch size: 256 | lm loss: 1.882108E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.982 | TFLOPs: 19.90 | 31: iteration 155880/ 173500 | consumed samples: 39905280 | consumed tokens: 81726013440 | elapsed time per iteration (s): 0.77 | learning rate: 2.463E-05 | global batch size: 256 | lm loss: 1.928555E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.690 | TFLOPs: 20.19 | 31: iteration 155890/ 173500 | consumed samples: 39907840 | consumed tokens: 81731256320 | elapsed time per iteration (s): 0.80 | learning rate: 2.463E-05 | global batch size: 256 | lm loss: 1.920964E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.874 | TFLOPs: 19.47 | 31: iteration 155900/ 173500 | consumed samples: 39910400 | consumed tokens: 81736499200 | elapsed time per iteration (s): 0.79 | learning rate: 2.462E-05 | global batch size: 256 | lm loss: 1.928674E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.005 | TFLOPs: 19.60 | 31: iteration 155910/ 173500 | consumed samples: 39912960 | consumed tokens: 81741742080 | elapsed time per iteration (s): 0.75 | learning rate: 2.462E-05 | global batch size: 256 | lm loss: 1.910847E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.405 | TFLOPs: 20.65 | 31: iteration 155920/ 173500 | consumed samples: 39915520 | consumed tokens: 81746984960 | elapsed time per iteration (s): 0.75 | learning rate: 2.461E-05 | global batch size: 256 | lm loss: 1.914604E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.484 | TFLOPs: 20.60 | 31: iteration 155930/ 173500 | consumed samples: 39918080 | consumed tokens: 81752227840 | elapsed time per iteration (s): 0.78 | learning rate: 2.461E-05 | global batch size: 256 | lm loss: 1.909891E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.076 | TFLOPs: 19.97 | 31: iteration 155940/ 173500 | consumed samples: 39920640 | consumed tokens: 81757470720 | elapsed time per iteration (s): 0.76 | learning rate: 2.460E-05 | global batch size: 256 | lm loss: 1.894314E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.956 | TFLOPs: 20.45 | 31: iteration 155950/ 173500 | consumed samples: 39923200 | consumed tokens: 81762713600 | elapsed time per iteration (s): 0.76 | learning rate: 2.460E-05 | global batch size: 256 | lm loss: 1.901815E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.687 | TFLOPs: 20.31 | 31: iteration 155960/ 173500 | consumed samples: 39925760 | consumed tokens: 81767956480 | elapsed time per iteration (s): 0.79 | learning rate: 2.459E-05 | global batch size: 256 | lm loss: 1.917694E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.304 | TFLOPs: 19.68 | 31: iteration 155970/ 173500 | consumed samples: 39928320 | consumed tokens: 81773199360 | elapsed time per iteration (s): 0.76 | learning rate: 2.459E-05 | global batch size: 256 | lm loss: 1.912929E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.245 | TFLOPs: 20.40 | 31: iteration 155980/ 173500 | consumed samples: 39930880 | consumed tokens: 81778442240 | elapsed time per iteration (s): 0.80 | learning rate: 2.458E-05 | global batch size: 256 | lm loss: 1.916986E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.790 | TFLOPs: 19.41 | 31: iteration 155990/ 173500 | consumed samples: 39933440 | consumed tokens: 81783685120 | elapsed time per iteration (s): 0.82 | learning rate: 2.458E-05 | global batch size: 256 | lm loss: 1.907157E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.165 | TFLOPs: 18.89 | 0: [2022-11-27 05:12:52,109] [INFO] [logging.py:68:log_dist] [Rank 0] step=156000, skipped=0, lr=[2.4571227150894576e-05, 2.4571227150894576e-05, 2.4571227150894576e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 156000/ 173500 | consumed samples: 39936000 | consumed tokens: 81788928000 | elapsed time per iteration (s): 0.84 | learning rate: 2.457E-05 | global batch size: 256 | lm loss: 1.907119E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.149 | TFLOPs: 18.34 | 0: steps: 156000 loss: 1.9336 iter time (s): 0.812 samples/sec: 315.199 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 156000 | lm loss value: 1.836297E+00 | lm loss PPL: 6.273268E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 156000 to checkpoints_1b1long 0: [2022-11-27 05:12:52,402] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step156000 is begin to save! 0: [2022-11-27 05:12:52,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_01-model_00-model_states.pt... 0: [2022-11-27 05:12:52,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_01-model_00-model_states.pt. 0: [2022-11-27 05:12:52,630] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_03-model_00-model_states.pt... 0: [2022-11-27 05:12:52,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_03-model_00-model_states.pt. 0: [2022-11-27 05:12:52,708] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_04-model_00-model_states.pt... 0: [2022-11-27 05:12:52,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_04-model_00-model_states.pt. 0: [2022-11-27 05:12:52,784] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_05-model_00-model_states.pt... 0: [2022-11-27 05:12:52,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_05-model_00-model_states.pt. 0: [2022-11-27 05:12:52,858] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_06-model_00-model_states.pt... 0: [2022-11-27 05:12:52,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_06-model_00-model_states.pt. 0: [2022-11-27 05:12:52,933] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_07-model_00-model_states.pt... 0: [2022-11-27 05:12:53,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_07-model_00-model_states.pt. 0: [2022-11-27 05:12:53,007] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_08-model_00-model_states.pt... 0: [2022-11-27 05:12:53,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_08-model_00-model_states.pt. 0: [2022-11-27 05:12:53,078] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_09-model_00-model_states.pt... 0: [2022-11-27 05:12:53,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_09-model_00-model_states.pt. 0: [2022-11-27 05:12:53,154] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_10-model_00-model_states.pt... 0: [2022-11-27 05:12:53,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_10-model_00-model_states.pt. 0: [2022-11-27 05:12:53,240] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_11-model_00-model_states.pt... 0: [2022-11-27 05:12:53,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_11-model_00-model_states.pt. 0: [2022-11-27 05:12:53,321] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_12-model_00-model_states.pt... 0: [2022-11-27 05:12:53,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_12-model_00-model_states.pt. 0: [2022-11-27 05:12:53,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_13-model_00-model_states.pt... 0: [2022-11-27 05:12:53,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_13-model_00-model_states.pt. 0: [2022-11-27 05:12:53,472] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_14-model_00-model_states.pt... 0: [2022-11-27 05:12:53,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_14-model_00-model_states.pt. 0: [2022-11-27 05:12:53,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_15-model_00-model_states.pt... 0: [2022-11-27 05:12:53,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_15-model_00-model_states.pt. 0: [2022-11-27 05:12:53,622] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_16-model_00-model_states.pt... 0: [2022-11-27 05:12:53,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_16-model_00-model_states.pt. 0: [2022-11-27 05:12:53,697] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_17-model_00-model_states.pt... 0: [2022-11-27 05:12:53,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_17-model_00-model_states.pt. 0: [2022-11-27 05:12:53,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_18-model_00-model_states.pt... 0: [2022-11-27 05:12:53,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_18-model_00-model_states.pt. 0: [2022-11-27 05:12:53,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_19-model_00-model_states.pt... 0: [2022-11-27 05:12:53,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_19-model_00-model_states.pt. 0: [2022-11-27 05:12:53,918] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_20-model_00-model_states.pt... 0: [2022-11-27 05:12:53,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_20-model_00-model_states.pt. 0: [2022-11-27 05:12:53,994] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_21-model_00-model_states.pt... 0: [2022-11-27 05:12:54,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_21-model_00-model_states.pt. 0: [2022-11-27 05:12:54,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_22-model_00-model_states.pt... 0: [2022-11-27 05:12:54,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_22-model_00-model_states.pt. 0: [2022-11-27 05:12:54,144] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_23-model_00-model_states.pt... 0: [2022-11-27 05:12:54,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_23-model_00-model_states.pt. 0: [2022-11-27 05:12:54,219] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_24-model_00-model_states.pt... 0: [2022-11-27 05:12:54,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_24-model_00-model_states.pt. 0: [2022-11-27 05:12:54,293] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_25-model_00-model_states.pt... 0: [2022-11-27 05:12:54,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_25-model_00-model_states.pt. 0: [2022-11-27 05:12:54,368] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_26-model_00-model_states.pt... 0: [2022-11-27 05:12:54,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_26-model_00-model_states.pt. 0: [2022-11-27 05:12:54,442] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_27-model_00-model_states.pt... 0: [2022-11-27 05:12:54,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_27-model_00-model_states.pt. 0: [2022-11-27 05:12:54,517] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_28-model_00-model_states.pt... 0: [2022-11-27 05:12:54,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_28-model_00-model_states.pt. 0: [2022-11-27 05:12:54,592] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/layer_30-model_00-model_states.pt... 0: [2022-11-27 05:12:54,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/layer_30-model_00-model_states.pt. 0: [2022-11-27 05:12:54,595] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step156000/mp_rank_00_model_states.pt 0: [2022-11-27 05:12:54,595] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/mp_rank_00_model_states.pt... 0: [2022-11-27 05:12:54,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/mp_rank_00_model_states.pt. 0: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:12:54,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:12:54,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:12:54,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 05:12:54,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 23: [2022-11-27 05:12:54,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:12:54,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 05:12:54,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 24: [2022-11-27 05:12:54,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:12:54,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 5: [2022-11-27 05:12:54,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:12:54,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 29: [2022-11-27 05:12:54,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:12:54,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 05:12:54,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 29: [2022-11-27 05:12:54,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 05:12:54,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 15: [2022-11-27 05:12:54,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:12:54,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 05:12:54,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 3: [2022-11-27 05:12:54,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:12:54,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:12:54,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 18: [2022-11-27 05:12:54,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 3: [2022-11-27 05:12:54,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 18: [2022-11-27 05:12:54,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 10: [2022-11-27 05:12:54,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:12:54,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:12:54,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:12:54,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 14: [2022-11-27 05:12:54,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 10: [2022-11-27 05:12:54,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 14: [2022-11-27 05:12:54,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 17: [2022-11-27 05:12:54,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 05:12:54,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 0: [2022-11-27 05:12:54,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:12:54,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:12:54,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:12:54,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 31: [2022-11-27 05:12:54,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 19: [2022-11-27 05:12:54,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 0: [2022-11-27 05:12:54,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 31: [2022-11-27 05:12:54,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 19: [2022-11-27 05:12:54,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 7: [2022-11-27 05:12:54,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 28: [2022-11-27 05:12:54,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:12:54,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 05:12:54,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 21: [2022-11-27 05:12:54,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:12:54,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:12:54,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 13: [2022-11-27 05:12:54,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 12: [2022-11-27 05:12:54,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:12:54,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 13: [2022-11-27 05:12:54,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 12: [2022-11-27 05:12:54,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 05:12:54,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 1: [2022-11-27 05:12:54,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:12:54,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 05:12:54,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 2: [2022-11-27 05:12:54,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:12:54,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 05:12:54,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 8: [2022-11-27 05:12:54,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:12:54,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:12:54,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 2: [2022-11-27 05:12:54,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 8: [2022-11-27 05:12:54,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 2: [2022-11-27 05:12:54,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 4: [2022-11-27 05:12:54,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:12:54,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 05:12:54,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 25: [2022-11-27 05:12:54,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:12:54,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 05:12:54,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 18: [2022-11-27 05:12:54,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:12:54,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-27 05:12:54,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 2: [2022-11-27 05:12:54,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:12:54,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 05:12:54,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 28: [2022-11-27 05:12:54,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 05:12:54,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 20: [2022-11-27 05:12:54,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:12:54,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:12:54,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 05:12:54,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 10: [2022-11-27 05:12:54,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 0: [2022-11-27 05:12:54,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:12:54,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 3: [2022-11-27 05:12:54,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:12:54,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 05:12:54,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 28: [2022-11-27 05:12:54,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:12:54,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:12:54,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:12:54,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 24: [2022-11-27 05:12:54,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 1: [2022-11-27 05:12:54,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:12:54,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 24: [2022-11-27 05:12:54,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 1: [2022-11-27 05:12:54,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 05:12:54,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 6: [2022-11-27 05:12:54,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:12:54,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 05:12:54,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 21: [2022-11-27 05:12:54,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:12:54,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 05:12:54,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 19: [2022-11-27 05:12:54,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:12:54,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:12:54,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 14: [2022-11-27 05:12:54,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 19: [2022-11-27 05:12:54,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 14: [2022-11-27 05:12:54,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 6: [2022-11-27 05:12:54,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:12:54,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 05:12:54,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 4: [2022-11-27 05:12:54,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:12:54,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 05:12:54,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 5: [2022-11-27 05:12:54,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:12:54,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:12:54,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 5: [2022-11-27 05:12:54,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:12:54,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:12:54,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 31: [2022-11-27 05:12:54,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 5: [2022-11-27 05:12:54,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 20: [2022-11-27 05:12:54,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 5: [2022-11-27 05:12:54,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 5: [2022-11-27 05:12:54,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 20: [2022-11-27 05:12:54,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 18: [2022-11-27 05:12:54,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:12:54,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 20: [2022-11-27 05:12:54,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:12:54,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 20: [2022-11-27 05:12:54,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 05:12:54,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 8: [2022-11-27 05:12:54,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:12:54,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:12:54,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:12:54,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:12:54,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 13: [2022-11-27 05:12:54,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 22: [2022-11-27 05:12:54,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 05:12:54,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 8: [2022-11-27 05:12:54,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 13: [2022-11-27 05:12:54,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 22: [2022-11-27 05:12:54,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 22: [2022-11-27 05:12:54,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 29: [2022-11-27 05:12:54,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:12:54,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 05:12:54,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 24: [2022-11-27 05:12:54,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:12:54,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 05:12:54,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 25: [2022-11-27 05:12:54,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:12:54,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 12: [2022-11-27 05:12:54,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:12:54,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 25: [2022-11-27 05:12:54,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 12: [2022-11-27 05:12:54,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 15: [2022-11-27 05:12:54,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:12:54,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 05:12:54,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 22: [2022-11-27 05:12:54,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:12:54,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 05:12:54,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 0: [2022-11-27 05:12:54,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:12:54,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:12:54,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:12:54,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 23: [2022-11-27 05:12:54,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 05:12:54,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 0: [2022-11-27 05:12:54,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 0: [2022-11-27 05:12:54,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 05:12:54,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 23: [2022-11-27 05:12:54,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:12:54,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 05:12:54,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 3: [2022-11-27 05:12:54,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:12:54,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 05:12:54,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 21: [2022-11-27 05:12:54,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:12:54,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 05:12:54,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 7: [2022-11-27 05:12:54,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:12:54,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 05:12:54,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 5: [2022-11-27 05:12:54,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:12:54,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 05:12:54,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 18: [2022-11-27 05:12:54,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:12:54,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 05:12:54,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 13: [2022-11-27 05:12:54,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:12:54,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 05:12:54,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 1: [2022-11-27 05:12:54,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:12:54,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:12:54,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 1: [2022-11-27 05:12:54,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 19: [2022-11-27 05:12:54,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 1: [2022-11-27 05:12:54,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 15: [2022-11-27 05:12:54,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:12:54,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 05:12:54,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 22: [2022-11-27 05:12:54,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:12:54,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 05:12:54,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 12: [2022-11-27 05:12:54,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:12:54,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:12:54,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 7: [2022-11-27 05:12:54,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:12:54,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 1: [2022-11-27 05:12:54,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 12: [2022-11-27 05:12:54,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 7: [2022-11-27 05:12:54,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 10: [2022-11-27 05:12:54,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:12:54,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 10: [2022-11-27 05:12:54,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 05:12:54,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 17: [2022-11-27 05:12:54,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:12:54,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 05:12:54,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 17: [2022-11-27 05:12:54,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:12:54,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 05:12:54,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 12: [2022-11-27 05:12:54,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:12:54,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:12:54,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 20: [2022-11-27 05:12:54,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 12: [2022-11-27 05:12:54,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 20: [2022-11-27 05:12:54,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 9: [2022-11-27 05:12:54,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:12:54,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:12:54,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:12:54,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:12:54,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:12:54,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:12:54,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 05:12:54,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 05:12:54,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 10: [2022-11-27 05:12:54,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 14: [2022-11-27 05:12:54,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 9: [2022-11-27 05:12:54,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 9: [2022-11-27 05:12:54,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 9: [2022-11-27 05:12:54,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 10: [2022-11-27 05:12:54,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 14: [2022-11-27 05:12:54,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 26: [2022-11-27 05:12:54,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:12:54,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:12:54,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:12:54,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 05:12:54,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 26: [2022-11-27 05:12:54,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 05:12:54,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 05:12:54,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 05:12:54,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 26: [2022-11-27 05:12:54,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 26: [2022-11-27 05:12:54,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 19: [2022-11-27 05:12:54,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:12:54,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 05:12:54,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 25: [2022-11-27 05:12:54,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:12:54,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 05:12:54,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 29: [2022-11-27 05:12:54,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:12:54,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 05:12:54,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 2: [2022-11-27 05:12:54,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:12:54,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 05:12:54,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 8: [2022-11-27 05:12:54,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:12:54,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 05:12:54,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 4: [2022-11-27 05:12:54,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:12:54,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 05:12:54,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 21: [2022-11-27 05:12:54,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:12:54,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 05:12:54,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 29: [2022-11-27 05:12:54,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:12:54,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 0: [2022-11-27 05:12:54,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:12:54,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 0: [2022-11-27 05:12:54,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 05:12:54,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 14: [2022-11-27 05:12:54,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:12:54,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 05:12:54,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 13: [2022-11-27 05:12:54,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:12:54,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 05:12:54,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 26: [2022-11-27 05:12:54,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:12:54,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 05:12:54,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 25: [2022-11-27 05:12:54,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:12:54,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:12:54,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 4: [2022-11-27 05:12:54,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 11: [2022-11-27 05:12:54,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:12:54,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:12:54,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:12:54,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 11: [2022-11-27 05:12:54,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:12:54,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 05:12:54,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 25: [2022-11-27 05:12:54,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 11: [2022-11-27 05:12:54,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 11: [2022-11-27 05:12:54,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 11: [2022-11-27 05:12:54,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 05:12:54,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 05:12:54,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 11: [2022-11-27 05:12:54,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 30: [2022-11-27 05:12:54,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:12:54,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:12:54,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:12:54,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:12:54,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 17: [2022-11-27 05:12:54,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:12:54,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 6: [2022-11-27 05:12:54,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:12:54,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 05:12:54,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 6: [2022-11-27 05:12:54,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 30: [2022-11-27 05:12:54,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 6: [2022-11-27 05:12:54,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 30: [2022-11-27 05:12:54,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 17: [2022-11-27 05:12:54,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 30: [2022-11-27 05:12:54,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 30: [2022-11-27 05:12:54,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 17: [2022-11-27 05:12:54,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 23: [2022-11-27 05:12:54,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:12:54,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 05:12:54,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 31: [2022-11-27 05:12:54,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:12:54,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 24: [2022-11-27 05:12:54,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:12:54,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 24: [2022-11-27 05:12:54,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-27 05:12:54,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 5: [2022-11-27 05:12:54,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:12:54,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 05:12:54,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 16: [2022-11-27 05:12:54,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:12:54,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:12:54,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:12:54,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 05:12:54,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 05:12:54,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 05:12:54,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 16: [2022-11-27 05:12:54,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 16: [2022-11-27 05:12:54,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 3: [2022-11-27 05:12:54,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:12:54,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 05:12:54,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 6: [2022-11-27 05:12:54,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:12:54,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:12:54,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 05:12:54,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 31: [2022-11-27 05:12:54,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 05:12:54,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 7: [2022-11-27 05:12:54,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:12:54,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 05:12:54,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 17: [2022-11-27 05:12:54,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:12:54,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 05:12:54,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 27: [2022-11-27 05:12:54,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:12:54,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:12:54,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:12:54,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:12:54,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 05:12:54,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 05:12:54,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 05:12:54,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 05:12:54,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 27: [2022-11-27 05:12:54,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 27: [2022-11-27 05:12:54,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 27: [2022-11-27 05:12:54,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 28: [2022-11-27 05:12:54,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 05:12:54,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 28: [2022-11-27 05:12:54,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 05:12:54,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 05:12:54,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 18: [2022-11-27 05:12:54,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:12:54,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 05:12:54,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 1: [2022-11-27 05:12:54,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:12:54,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 05:12:54,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 28: [2022-11-27 05:12:54,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-27 05:12:54,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 05:12:54,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 4: [2022-11-27 05:12:54,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:12:54,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 05:12:54,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 8: [2022-11-27 05:12:54,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:12:54,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 05:12:54,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 28: [2022-11-27 05:12:54,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:12:54,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 05:12:54,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 23: [2022-11-27 05:12:54,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:12:54,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 05:12:54,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 28: [2022-11-27 05:12:54,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 05:12:54,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 14: [2022-11-27 05:12:54,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:12:54,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 05:12:54,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 30: [2022-11-27 05:12:54,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:12:54,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 05:12:54,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 2: [2022-11-27 05:12:54,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:12:54,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 05:12:54,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 12: [2022-11-27 05:12:54,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:12:54,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:12:54,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 3: [2022-11-27 05:12:54,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 12: [2022-11-27 05:12:54,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 3: [2022-11-27 05:12:54,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 9: [2022-11-27 05:12:54,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:12:54,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 05:12:54,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 21: [2022-11-27 05:12:54,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:12:54,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 05:12:54,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 26: [2022-11-27 05:12:54,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:12:54,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-27 05:12:54,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 24: [2022-11-27 05:12:54,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:12:54,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 25: [2022-11-27 05:12:54,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:12:54,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 31: [2022-11-27 05:12:54,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:12:54,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 05:12:54,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 31: [2022-11-27 05:12:54,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 05:12:54,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 10: [2022-11-27 05:12:54,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:12:54,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 05:12:54,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 11: [2022-11-27 05:12:54,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:12:54,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 05:12:54,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 7: [2022-11-27 05:12:54,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:12:54,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 05:12:54,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 8: [2022-11-27 05:12:54,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:12:54,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 05:12:54,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 22: [2022-11-27 05:12:54,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:12:54,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 05:12:54,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 16: [2022-11-27 05:12:54,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:12:54,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 05:12:54,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 20: [2022-11-27 05:12:54,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:12:54,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 05:12:54,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 6: [2022-11-27 05:12:54,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:12:54,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:12:54,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 05:12:54,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 15: [2022-11-27 05:12:54,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 05:12:54,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 18: [2022-11-27 05:12:54,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:12:54,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 05:12:54,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 13: [2022-11-27 05:12:54,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:12:54,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 0: [2022-11-27 05:12:54,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:12:54,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 0: [2022-11-27 05:12:54,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 05:12:54,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 19: [2022-11-27 05:12:54,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:12:54,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 05:12:54,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 1: [2022-11-27 05:12:54,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:12:54,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 05:12:54,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 5: [2022-11-27 05:12:54,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:12:54,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 05:12:54,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 23: [2022-11-27 05:12:54,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:12:54,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 05:12:54,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 27: [2022-11-27 05:12:54,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:12:54,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 05:12:54,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 4: [2022-11-27 05:12:54,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:12:54,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 05:12:54,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 3: [2022-11-27 05:12:54,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:12:54,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 05:12:54,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 14: [2022-11-27 05:12:54,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:12:54,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 05:12:54,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 24: [2022-11-27 05:12:54,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:12:54,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 21: [2022-11-27 05:12:54,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:12:54,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 24: [2022-11-27 05:12:54,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 21: [2022-11-27 05:12:54,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 2: [2022-11-27 05:12:54,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:12:54,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 05:12:54,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 9: [2022-11-27 05:12:54,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:12:54,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 05:12:54,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 15: [2022-11-27 05:12:54,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:12:54,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 05:12:54,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 28: [2022-11-27 05:12:54,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 05:12:54,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 05:12:54,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 7: [2022-11-27 05:12:54,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:12:54,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 05:12:54,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 0: [2022-11-27 05:12:54,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:12:54,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:12:54,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 11: [2022-11-27 05:12:54,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 05:12:54,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 0: [2022-11-27 05:12:54,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 10: [2022-11-27 05:12:54,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:12:54,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:12:54,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 10: [2022-11-27 05:12:54,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 19: [2022-11-27 05:12:54,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 10: [2022-11-27 05:12:54,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 22: [2022-11-27 05:12:54,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:12:54,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 05:12:54,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 2: [2022-11-27 05:12:54,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:12:54,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 05:12:54,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 18: [2022-11-27 05:12:54,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:12:54,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:12:54,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 05:12:54,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 1: [2022-11-27 05:12:54,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:12:54,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 05:12:54,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 1: [2022-11-27 05:12:54,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 05:12:54,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 30: [2022-11-27 05:12:54,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:12:54,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:12:54,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 05:12:54,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 30: [2022-11-27 05:12:54,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 05:12:54,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 5: [2022-11-27 05:12:54,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:12:54,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:12:54,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 05:12:54,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 6: [2022-11-27 05:12:54,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:12:54,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 05:12:54,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 6: [2022-11-27 05:12:54,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 05:12:54,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 4: [2022-11-27 05:12:54,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:12:54,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 05:12:54,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 13: [2022-11-27 05:12:54,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:12:54,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:12:54,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 05:12:54,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 27: [2022-11-27 05:12:54,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 05:12:54,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 26: [2022-11-27 05:12:54,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:12:54,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 05:12:54,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 25: [2022-11-27 05:12:54,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:12:54,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:12:54,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 17: [2022-11-27 05:12:54,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 05:12:54,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 25: [2022-11-27 05:12:54,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 28: [2022-11-27 05:12:54,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:12:54,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:12:54,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 05:12:54,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 29: [2022-11-27 05:12:54,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:12:54,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 05:12:54,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 31: [2022-11-27 05:12:54,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:12:54,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 05:12:54,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 12: [2022-11-27 05:12:54,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:12:54,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 05:12:54,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 3: [2022-11-27 05:12:54,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:12:54,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 05:12:54,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 9: [2022-11-27 05:12:54,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:12:54,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 05:12:54,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 21: [2022-11-27 05:12:54,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:12:54,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 05:12:54,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 8: [2022-11-27 05:12:54,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:12:54,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:12:54,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 20: [2022-11-27 05:12:54,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 8: [2022-11-27 05:12:54,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 20: [2022-11-27 05:12:54,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 11: [2022-11-27 05:12:54,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:12:54,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 05:12:54,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 28: [2022-11-27 05:12:54,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 05:12:54,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 25: [2022-11-27 05:12:54,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:12:54,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 05:12:54,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 12: [2022-11-27 05:12:54,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:12:54,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 05:12:54,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 26: [2022-11-27 05:12:54,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:12:54,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 05:12:54,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 18: [2022-11-27 05:12:54,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:12:54,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:12:54,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 05:12:54,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 6: [2022-11-27 05:12:54,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 05:12:54,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 5: [2022-11-27 05:12:54,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:12:54,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 05:12:54,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 1: [2022-11-27 05:12:54,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:12:54,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 05:12:54,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 0: [2022-11-27 05:12:54,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:12:54,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 05:12:54,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 30: [2022-11-27 05:12:54,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:12:54,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 05:12:54,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 27: [2022-11-27 05:12:54,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:12:54,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 05:12:54,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 15: [2022-11-27 05:12:54,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:12:54,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:12:54,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 14: [2022-11-27 05:12:54,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 30: [2022-11-27 05:12:54,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:12:54,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 7: [2022-11-27 05:12:54,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:12:54,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 14: [2022-11-27 05:12:54,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 7: [2022-11-27 05:12:54,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 30: [2022-11-27 05:12:54,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 7: [2022-11-27 05:12:54,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 10: [2022-11-27 05:12:54,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:12:54,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 05:12:54,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 24: [2022-11-27 05:12:54,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:12:54,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 05:12:54,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 20: [2022-11-27 05:12:54,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:12:54,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 23: [2022-11-27 05:12:54,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:12:54,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:12:54,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 23: [2022-11-27 05:12:54,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 8: [2022-11-27 05:12:54,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 05:12:54,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 23: [2022-11-27 05:12:54,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 22: [2022-11-27 05:12:54,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:12:54,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:12:54,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:12:54,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 05:12:54,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 15: [2022-11-27 05:12:54,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 22: [2022-11-27 05:12:54,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 22: [2022-11-27 05:12:54,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 15: [2022-11-27 05:12:54,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 2: [2022-11-27 05:12:54,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:12:54,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:12:54,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:12:54,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 2: [2022-11-27 05:12:54,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 19: [2022-11-27 05:12:54,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 16: [2022-11-27 05:12:54,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 16: [2022-11-27 05:12:54,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:12:54,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 16: [2022-11-27 05:12:54,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 19: [2022-11-27 05:12:54,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 16: [2022-11-27 05:12:54,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 13: [2022-11-27 05:12:54,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:12:54,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 29: [2022-11-27 05:12:54,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:12:54,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 29: [2022-11-27 05:12:54,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 05:12:54,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 24: [2022-11-27 05:12:54,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:12:54,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 05:12:54,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 25: [2022-11-27 05:12:54,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:12:54,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 05:12:54,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 27: [2022-11-27 05:12:54,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:12:54,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:12:54,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 31: [2022-11-27 05:12:54,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:12:54,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 27: [2022-11-27 05:12:54,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 4: [2022-11-27 05:12:54,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:12:54,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 27: [2022-11-27 05:12:54,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 31: [2022-11-27 05:12:54,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 4: [2022-11-27 05:12:54,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 05:12:54,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 17: [2022-11-27 05:12:54,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:12:54,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 05:12:54,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 11: [2022-11-27 05:12:54,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:12:54,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 05:12:54,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 26: [2022-11-27 05:12:54,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:12:54,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:12:54,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 3: [2022-11-27 05:12:54,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 26: [2022-11-27 05:12:54,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 3: [2022-11-27 05:12:54,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 7: [2022-11-27 05:12:54,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:12:54,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 05:12:54,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 28: [2022-11-27 05:12:54,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-27 05:12:54,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 05:12:54,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 29: [2022-11-27 05:12:54,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:12:54,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 05:12:54,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 12: [2022-11-27 05:12:54,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:12:54,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 05:12:54,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 13: [2022-11-27 05:12:54,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:12:54,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 05:12:54,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 8: [2022-11-27 05:12:54,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:12:54,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 05:12:54,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 21: [2022-11-27 05:12:54,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:12:54,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 05:12:54,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 20: [2022-11-27 05:12:54,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:12:54,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 05:12:54,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 17: [2022-11-27 05:12:54,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:12:54,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 05:12:54,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 10: [2022-11-27 05:12:54,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:12:54,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 05:12:54,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 31: [2022-11-27 05:12:54,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:12:54,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 05:12:54,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 16: [2022-11-27 05:12:54,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:12:54,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 05:12:54,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 19: [2022-11-27 05:12:54,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:12:54,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step156000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 05:12:54,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step156000 is ready now! 0: successfully saved checkpoint at iteration 156000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2478.23 31: iteration 156010/ 173500 | consumed samples: 39938560 | consumed tokens: 81794170880 | elapsed time per iteration (s): 1.09 | learning rate: 2.457E-05 | global batch size: 256 | lm loss: 1.921967E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.764 | TFLOPs: 14.20 | 31: iteration 156020/ 173500 | consumed samples: 39941120 | consumed tokens: 81799413760 | elapsed time per iteration (s): 0.83 | learning rate: 2.456E-05 | global batch size: 256 | lm loss: 1.927389E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.815 | TFLOPs: 18.62 | 31: iteration 156030/ 173500 | consumed samples: 39943680 | consumed tokens: 81804656640 | elapsed time per iteration (s): 0.81 | learning rate: 2.456E-05 | global batch size: 256 | lm loss: 1.901775E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.204 | TFLOPs: 19.01 | 31: iteration 156040/ 173500 | consumed samples: 39946240 | consumed tokens: 81809899520 | elapsed time per iteration (s): 0.85 | learning rate: 2.455E-05 | global batch size: 256 | lm loss: 1.896161E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.783 | TFLOPs: 18.32 | 31: iteration 156050/ 173500 | consumed samples: 39948800 | consumed tokens: 81815142400 | elapsed time per iteration (s): 0.87 | learning rate: 2.455E-05 | global batch size: 256 | lm loss: 1.919929E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.066 | TFLOPs: 17.73 | 31: iteration 156060/ 173500 | consumed samples: 39951360 | consumed tokens: 81820385280 | elapsed time per iteration (s): 0.81 | learning rate: 2.454E-05 | global batch size: 256 | lm loss: 1.892917E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.255 | TFLOPs: 19.19 | 31: iteration 156070/ 173500 | consumed samples: 39953920 | consumed tokens: 81825628160 | elapsed time per iteration (s): 0.80 | learning rate: 2.454E-05 | global batch size: 256 | lm loss: 1.901754E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.064 | TFLOPs: 19.24 | 31: iteration 156080/ 173500 | consumed samples: 39956480 | consumed tokens: 81830871040 | elapsed time per iteration (s): 0.81 | learning rate: 2.453E-05 | global batch size: 256 | lm loss: 1.902427E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.690 | TFLOPs: 19.16 | 31: iteration 156090/ 173500 | consumed samples: 39959040 | consumed tokens: 81836113920 | elapsed time per iteration (s): 0.74 | learning rate: 2.452E-05 | global batch size: 256 | lm loss: 1.926283E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.053 | TFLOPs: 21.00 | 31: iteration 156100/ 173500 | consumed samples: 39961600 | consumed tokens: 81841356800 | elapsed time per iteration (s): 0.77 | learning rate: 2.452E-05 | global batch size: 256 | lm loss: 1.919279E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.711 | TFLOPs: 20.01 | 31: iteration 156110/ 173500 | consumed samples: 39964160 | consumed tokens: 81846599680 | elapsed time per iteration (s): 0.78 | learning rate: 2.451E-05 | global batch size: 256 | lm loss: 1.915100E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.962 | TFLOPs: 19.78 | 31: iteration 156120/ 173500 | consumed samples: 39966720 | consumed tokens: 81851842560 | elapsed time per iteration (s): 0.77 | learning rate: 2.451E-05 | global batch size: 256 | lm loss: 1.906985E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.637 | TFLOPs: 20.24 | 31: iteration 156130/ 173500 | consumed samples: 39969280 | consumed tokens: 81857085440 | elapsed time per iteration (s): 0.79 | learning rate: 2.450E-05 | global batch size: 256 | lm loss: 1.902964E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.635 | TFLOPs: 19.64 | 31: iteration 156140/ 173500 | consumed samples: 39971840 | consumed tokens: 81862328320 | elapsed time per iteration (s): 0.80 | learning rate: 2.450E-05 | global batch size: 256 | lm loss: 1.887745E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.100 | TFLOPs: 19.37 | 31: iteration 156150/ 173500 | consumed samples: 39974400 | consumed tokens: 81867571200 | elapsed time per iteration (s): 0.78 | learning rate: 2.449E-05 | global batch size: 256 | lm loss: 1.918853E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.497 | TFLOPs: 19.75 | 31: iteration 156160/ 173500 | consumed samples: 39976960 | consumed tokens: 81872814080 | elapsed time per iteration (s): 0.76 | learning rate: 2.449E-05 | global batch size: 256 | lm loss: 1.914427E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.158 | TFLOPs: 20.34 | 31: iteration 156170/ 173500 | consumed samples: 39979520 | consumed tokens: 81878056960 | elapsed time per iteration (s): 0.76 | learning rate: 2.448E-05 | global batch size: 256 | lm loss: 1.926549E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.922 | TFLOPs: 20.38 | 31: iteration 156180/ 173500 | consumed samples: 39982080 | consumed tokens: 81883299840 | elapsed time per iteration (s): 0.74 | learning rate: 2.448E-05 | global batch size: 256 | lm loss: 1.916570E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.330 | TFLOPs: 20.89 | 31: iteration 156190/ 173500 | consumed samples: 39984640 | consumed tokens: 81888542720 | elapsed time per iteration (s): 0.81 | learning rate: 2.447E-05 | global batch size: 256 | lm loss: 1.942071E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.460 | TFLOPs: 19.15 | 31: iteration 156200/ 173500 | consumed samples: 39987200 | consumed tokens: 81893785600 | elapsed time per iteration (s): 0.80 | learning rate: 2.447E-05 | global batch size: 256 | lm loss: 1.923938E+00 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.629 | TFLOPs: 19.40 | 31: iteration 156210/ 173500 | consumed samples: 39989760 | consumed tokens: 81899028480 | elapsed time per iteration (s): 0.74 | learning rate: 2.446E-05 | global batch size: 256 | lm loss: 1.915860E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.079 | TFLOPs: 20.82 | 31: iteration 156220/ 173500 | consumed samples: 39992320 | consumed tokens: 81904271360 | elapsed time per iteration (s): 0.72 | learning rate: 2.446E-05 | global batch size: 256 | lm loss: 1.916372E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.143 | TFLOPs: 21.42 | 31: iteration 156230/ 173500 | consumed samples: 39994880 | consumed tokens: 81909514240 | elapsed time per iteration (s): 0.71 | learning rate: 2.445E-05 | global batch size: 256 | lm loss: 1.892628E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 359.098 | TFLOPs: 21.72 | 31: iteration 156240/ 173500 | consumed samples: 39997440 | consumed tokens: 81914757120 | elapsed time per iteration (s): 0.76 | learning rate: 2.445E-05 | global batch size: 256 | lm loss: 1.957643E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.673 | TFLOPs: 20.25 | 31: iteration 156250/ 173500 | consumed samples: 40000000 | consumed tokens: 81920000000 | elapsed time per iteration (s): 0.76 | learning rate: 2.444E-05 | global batch size: 256 | lm loss: 1.921379E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.691 | TFLOPs: 20.37 | 31: iteration 156260/ 173500 | consumed samples: 40002560 | consumed tokens: 81925242880 | elapsed time per iteration (s): 0.73 | learning rate: 2.444E-05 | global batch size: 256 | lm loss: 1.908352E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.625 | TFLOPs: 21.27 | 31: iteration 156270/ 173500 | consumed samples: 40005120 | consumed tokens: 81930485760 | elapsed time per iteration (s): 0.79 | learning rate: 2.443E-05 | global batch size: 256 | lm loss: 1.917536E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.686 | TFLOPs: 19.64 | 31: iteration 156280/ 173500 | consumed samples: 40007680 | consumed tokens: 81935728640 | elapsed time per iteration (s): 0.73 | learning rate: 2.443E-05 | global batch size: 256 | lm loss: 1.925640E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.660 | TFLOPs: 21.34 | 31: iteration 156290/ 173500 | consumed samples: 40010240 | consumed tokens: 81940971520 | elapsed time per iteration (s): 0.77 | learning rate: 2.442E-05 | global batch size: 256 | lm loss: 1.938209E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.163 | TFLOPs: 20.22 | 31: iteration 156300/ 173500 | consumed samples: 40012800 | consumed tokens: 81946214400 | elapsed time per iteration (s): 0.75 | learning rate: 2.442E-05 | global batch size: 256 | lm loss: 1.894063E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.647 | TFLOPs: 20.61 | 31: iteration 156310/ 173500 | consumed samples: 40015360 | consumed tokens: 81951457280 | elapsed time per iteration (s): 0.79 | learning rate: 2.441E-05 | global batch size: 256 | lm loss: 1.898482E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.035 | TFLOPs: 19.72 | 31: iteration 156320/ 173500 | consumed samples: 40017920 | consumed tokens: 81956700160 | elapsed time per iteration (s): 0.76 | learning rate: 2.441E-05 | global batch size: 256 | lm loss: 1.935847E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.370 | TFLOPs: 20.29 | 31: iteration 156330/ 173500 | consumed samples: 40020480 | consumed tokens: 81961943040 | elapsed time per iteration (s): 0.81 | learning rate: 2.440E-05 | global batch size: 256 | lm loss: 1.918344E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.525 | TFLOPs: 19.09 | 31: iteration 156340/ 173500 | consumed samples: 40023040 | consumed tokens: 81967185920 | elapsed time per iteration (s): 0.77 | learning rate: 2.440E-05 | global batch size: 256 | lm loss: 1.922547E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.371 | TFLOPs: 20.05 | 31: iteration 156350/ 173500 | consumed samples: 40025600 | consumed tokens: 81972428800 | elapsed time per iteration (s): 0.77 | learning rate: 2.439E-05 | global batch size: 256 | lm loss: 1.902145E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.231 | TFLOPs: 20.22 | 31: iteration 156360/ 173500 | consumed samples: 40028160 | consumed tokens: 81977671680 | elapsed time per iteration (s): 0.74 | learning rate: 2.439E-05 | global batch size: 256 | lm loss: 1.931088E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.460 | TFLOPs: 20.96 | 31: iteration 156370/ 173500 | consumed samples: 40030720 | consumed tokens: 81982914560 | elapsed time per iteration (s): 0.77 | learning rate: 2.438E-05 | global batch size: 256 | lm loss: 1.943603E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.404 | TFLOPs: 20.05 | 31: iteration 156380/ 173500 | consumed samples: 40033280 | consumed tokens: 81988157440 | elapsed time per iteration (s): 0.78 | learning rate: 2.438E-05 | global batch size: 256 | lm loss: 1.897417E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.992 | TFLOPs: 19.90 | 31: iteration 156390/ 173500 | consumed samples: 40035840 | consumed tokens: 81993400320 | elapsed time per iteration (s): 0.79 | learning rate: 2.437E-05 | global batch size: 256 | lm loss: 1.927340E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.538 | TFLOPs: 19.51 | 31: iteration 156400/ 173500 | consumed samples: 40038400 | consumed tokens: 81998643200 | elapsed time per iteration (s): 0.76 | learning rate: 2.437E-05 | global batch size: 256 | lm loss: 1.905568E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.313 | TFLOPs: 20.35 | 31: iteration 156410/ 173500 | consumed samples: 40040960 | consumed tokens: 82003886080 | elapsed time per iteration (s): 0.77 | learning rate: 2.436E-05 | global batch size: 256 | lm loss: 1.907631E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.713 | TFLOPs: 20.19 | 31: iteration 156420/ 173500 | consumed samples: 40043520 | consumed tokens: 82009128960 | elapsed time per iteration (s): 0.74 | learning rate: 2.436E-05 | global batch size: 256 | lm loss: 1.933627E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.626 | TFLOPs: 20.97 | 31: iteration 156430/ 173500 | consumed samples: 40046080 | consumed tokens: 82014371840 | elapsed time per iteration (s): 0.80 | learning rate: 2.435E-05 | global batch size: 256 | lm loss: 1.917098E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.174 | TFLOPs: 19.31 | 31: iteration 156440/ 173500 | consumed samples: 40048640 | consumed tokens: 82019614720 | elapsed time per iteration (s): 0.76 | learning rate: 2.435E-05 | global batch size: 256 | lm loss: 1.915630E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.342 | TFLOPs: 20.29 | 31: iteration 156450/ 173500 | consumed samples: 40051200 | consumed tokens: 82024857600 | elapsed time per iteration (s): 0.89 | learning rate: 2.434E-05 | global batch size: 256 | lm loss: 1.924751E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.252 | TFLOPs: 17.44 | 31: iteration 156460/ 173500 | consumed samples: 40053760 | consumed tokens: 82030100480 | elapsed time per iteration (s): 0.78 | learning rate: 2.434E-05 | global batch size: 256 | lm loss: 1.933270E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.145 | TFLOPs: 19.91 | 31: iteration 156470/ 173500 | consumed samples: 40056320 | consumed tokens: 82035343360 | elapsed time per iteration (s): 0.77 | learning rate: 2.433E-05 | global batch size: 256 | lm loss: 1.899621E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.336 | TFLOPs: 20.17 | 31: iteration 156480/ 173500 | consumed samples: 40058880 | consumed tokens: 82040586240 | elapsed time per iteration (s): 0.84 | learning rate: 2.433E-05 | global batch size: 256 | lm loss: 1.885520E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.149 | TFLOPs: 18.52 | 31: iteration 156490/ 173500 | consumed samples: 40061440 | consumed tokens: 82045829120 | elapsed time per iteration (s): 0.84 | learning rate: 2.432E-05 | global batch size: 256 | lm loss: 1.901046E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.780 | TFLOPs: 18.38 | 31: iteration 156500/ 173500 | consumed samples: 40064000 | consumed tokens: 82051072000 | elapsed time per iteration (s): 0.79 | learning rate: 2.432E-05 | global batch size: 256 | lm loss: 1.895489E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.674 | TFLOPs: 19.64 | 31: iteration 156510/ 173500 | consumed samples: 40066560 | consumed tokens: 82056314880 | elapsed time per iteration (s): 0.80 | learning rate: 2.431E-05 | global batch size: 256 | lm loss: 1.924730E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.710 | TFLOPs: 19.46 | 31: iteration 156520/ 173500 | consumed samples: 40069120 | consumed tokens: 82061557760 | elapsed time per iteration (s): 0.87 | learning rate: 2.431E-05 | global batch size: 256 | lm loss: 1.915471E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.567 | TFLOPs: 17.82 | 31: iteration 156530/ 173500 | consumed samples: 40071680 | consumed tokens: 82066800640 | elapsed time per iteration (s): 0.84 | learning rate: 2.430E-05 | global batch size: 256 | lm loss: 1.916517E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.645 | TFLOPs: 18.37 | 31: iteration 156540/ 173500 | consumed samples: 40074240 | consumed tokens: 82072043520 | elapsed time per iteration (s): 0.78 | learning rate: 2.430E-05 | global batch size: 256 | lm loss: 1.932528E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.836 | TFLOPs: 19.83 | 31: iteration 156550/ 173500 | consumed samples: 40076800 | consumed tokens: 82077286400 | elapsed time per iteration (s): 0.76 | learning rate: 2.429E-05 | global batch size: 256 | lm loss: 1.883902E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.368 | TFLOPs: 20.41 | 31: iteration 156560/ 173500 | consumed samples: 40079360 | consumed tokens: 82082529280 | elapsed time per iteration (s): 0.74 | learning rate: 2.429E-05 | global batch size: 256 | lm loss: 1.920678E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.560 | TFLOPs: 21.03 | 31: iteration 156570/ 173500 | consumed samples: 40081920 | consumed tokens: 82087772160 | elapsed time per iteration (s): 0.80 | learning rate: 2.428E-05 | global batch size: 256 | lm loss: 1.905308E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.355 | TFLOPs: 19.26 | 31: iteration 156580/ 173500 | consumed samples: 40084480 | consumed tokens: 82093015040 | elapsed time per iteration (s): 0.78 | learning rate: 2.428E-05 | global batch size: 256 | lm loss: 1.959257E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.270 | TFLOPs: 19.74 | 31: iteration 156590/ 173500 | consumed samples: 40087040 | consumed tokens: 82098257920 | elapsed time per iteration (s): 0.75 | learning rate: 2.427E-05 | global batch size: 256 | lm loss: 1.943895E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.547 | TFLOPs: 20.72 | 31: iteration 156600/ 173500 | consumed samples: 40089600 | consumed tokens: 82103500800 | elapsed time per iteration (s): 0.74 | learning rate: 2.427E-05 | global batch size: 256 | lm loss: 1.921609E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.644 | TFLOPs: 20.85 | 31: iteration 156610/ 173500 | consumed samples: 40092160 | consumed tokens: 82108743680 | elapsed time per iteration (s): 0.79 | learning rate: 2.426E-05 | global batch size: 256 | lm loss: 1.898547E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.470 | TFLOPs: 19.57 | 31: iteration 156620/ 173500 | consumed samples: 40094720 | consumed tokens: 82113986560 | elapsed time per iteration (s): 0.79 | learning rate: 2.426E-05 | global batch size: 256 | lm loss: 1.936146E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.524 | TFLOPs: 19.63 | 31: iteration 156630/ 173500 | consumed samples: 40097280 | consumed tokens: 82119229440 | elapsed time per iteration (s): 0.83 | learning rate: 2.425E-05 | global batch size: 256 | lm loss: 1.914840E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.906 | TFLOPs: 18.75 | 31: iteration 156640/ 173500 | consumed samples: 40099840 | consumed tokens: 82124472320 | elapsed time per iteration (s): 0.80 | learning rate: 2.425E-05 | global batch size: 256 | lm loss: 1.916731E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.487 | TFLOPs: 19.27 | 31: iteration 156650/ 173500 | consumed samples: 40102400 | consumed tokens: 82129715200 | elapsed time per iteration (s): 0.76 | learning rate: 2.424E-05 | global batch size: 256 | lm loss: 1.934015E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.238 | TFLOPs: 20.28 | 31: iteration 156660/ 173500 | consumed samples: 40104960 | consumed tokens: 82134958080 | elapsed time per iteration (s): 0.83 | learning rate: 2.424E-05 | global batch size: 256 | lm loss: 1.931754E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.222 | TFLOPs: 18.65 | 31: iteration 156670/ 173500 | consumed samples: 40107520 | consumed tokens: 82140200960 | elapsed time per iteration (s): 0.75 | learning rate: 2.423E-05 | global batch size: 256 | lm loss: 1.885380E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.185 | TFLOPs: 20.76 | 31: iteration 156680/ 173500 | consumed samples: 40110080 | consumed tokens: 82145443840 | elapsed time per iteration (s): 0.72 | learning rate: 2.423E-05 | global batch size: 256 | lm loss: 1.902823E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.911 | TFLOPs: 21.59 | 31: iteration 156690/ 173500 | consumed samples: 40112640 | consumed tokens: 82150686720 | elapsed time per iteration (s): 0.76 | learning rate: 2.422E-05 | global batch size: 256 | lm loss: 1.908710E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.439 | TFLOPs: 20.47 | 31: iteration 156700/ 173500 | consumed samples: 40115200 | consumed tokens: 82155929600 | elapsed time per iteration (s): 0.73 | learning rate: 2.422E-05 | global batch size: 256 | lm loss: 1.899589E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.668 | TFLOPs: 21.15 | 31: iteration 156710/ 173500 | consumed samples: 40117760 | consumed tokens: 82161172480 | elapsed time per iteration (s): 0.78 | learning rate: 2.421E-05 | global batch size: 256 | lm loss: 1.922972E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.599 | TFLOPs: 19.94 | 31: iteration 156720/ 173500 | consumed samples: 40120320 | consumed tokens: 82166415360 | elapsed time per iteration (s): 0.73 | learning rate: 2.421E-05 | global batch size: 256 | lm loss: 1.889960E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.025 | TFLOPs: 21.24 | 31: iteration 156730/ 173500 | consumed samples: 40122880 | consumed tokens: 82171658240 | elapsed time per iteration (s): 0.78 | learning rate: 2.420E-05 | global batch size: 256 | lm loss: 1.934060E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.922 | TFLOPs: 19.96 | 31: iteration 156740/ 173500 | consumed samples: 40125440 | consumed tokens: 82176901120 | elapsed time per iteration (s): 0.74 | learning rate: 2.420E-05 | global batch size: 256 | lm loss: 1.927489E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.669 | TFLOPs: 20.91 | 31: iteration 156750/ 173500 | consumed samples: 40128000 | consumed tokens: 82182144000 | elapsed time per iteration (s): 0.78 | learning rate: 2.419E-05 | global batch size: 256 | lm loss: 1.913372E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.543 | TFLOPs: 19.82 | 31: iteration 156760/ 173500 | consumed samples: 40130560 | consumed tokens: 82187386880 | elapsed time per iteration (s): 0.76 | learning rate: 2.419E-05 | global batch size: 256 | lm loss: 1.888599E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.442 | TFLOPs: 20.41 | 31: iteration 156770/ 173500 | consumed samples: 40133120 | consumed tokens: 82192629760 | elapsed time per iteration (s): 0.74 | learning rate: 2.418E-05 | global batch size: 256 | lm loss: 1.913301E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.554 | TFLOPs: 20.91 | 31: iteration 156780/ 173500 | consumed samples: 40135680 | consumed tokens: 82197872640 | elapsed time per iteration (s): 0.72 | learning rate: 2.418E-05 | global batch size: 256 | lm loss: 1.915923E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.931 | TFLOPs: 21.53 | 31: iteration 156790/ 173500 | consumed samples: 40138240 | consumed tokens: 82203115520 | elapsed time per iteration (s): 0.76 | learning rate: 2.417E-05 | global batch size: 256 | lm loss: 1.909034E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.462 | TFLOPs: 20.29 | 31: iteration 156800/ 173500 | consumed samples: 40140800 | consumed tokens: 82208358400 | elapsed time per iteration (s): 0.75 | learning rate: 2.417E-05 | global batch size: 256 | lm loss: 1.926198E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.800 | TFLOPs: 20.56 | 31: iteration 156810/ 173500 | consumed samples: 40143360 | consumed tokens: 82213601280 | elapsed time per iteration (s): 0.78 | learning rate: 2.416E-05 | global batch size: 256 | lm loss: 1.921026E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.022 | TFLOPs: 19.78 | 31: iteration 156820/ 173500 | consumed samples: 40145920 | consumed tokens: 82218844160 | elapsed time per iteration (s): 0.74 | learning rate: 2.416E-05 | global batch size: 256 | lm loss: 1.901938E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.784 | TFLOPs: 20.92 | 31: iteration 156830/ 173500 | consumed samples: 40148480 | consumed tokens: 82224087040 | elapsed time per iteration (s): 0.81 | learning rate: 2.415E-05 | global batch size: 256 | lm loss: 1.913306E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.519 | TFLOPs: 19.09 | 31: iteration 156840/ 173500 | consumed samples: 40151040 | consumed tokens: 82229329920 | elapsed time per iteration (s): 0.83 | learning rate: 2.415E-05 | global batch size: 256 | lm loss: 1.936532E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.293 | TFLOPs: 18.77 | 31: iteration 156850/ 173500 | consumed samples: 40153600 | consumed tokens: 82234572800 | elapsed time per iteration (s): 0.85 | learning rate: 2.414E-05 | global batch size: 256 | lm loss: 1.917669E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.355 | TFLOPs: 18.23 | 31: iteration 156860/ 173500 | consumed samples: 40156160 | consumed tokens: 82239815680 | elapsed time per iteration (s): 1.20 | learning rate: 2.414E-05 | global batch size: 256 | lm loss: 1.912150E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 213.566 | TFLOPs: 12.92 | 31: iteration 156870/ 173500 | consumed samples: 40158720 | consumed tokens: 82245058560 | elapsed time per iteration (s): 0.86 | learning rate: 2.413E-05 | global batch size: 256 | lm loss: 1.914639E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.343 | TFLOPs: 17.93 | 31: iteration 156880/ 173500 | consumed samples: 40161280 | consumed tokens: 82250301440 | elapsed time per iteration (s): 0.86 | learning rate: 2.413E-05 | global batch size: 256 | lm loss: 1.925277E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.885 | TFLOPs: 18.08 | 31: iteration 156890/ 173500 | consumed samples: 40163840 | consumed tokens: 82255544320 | elapsed time per iteration (s): 0.87 | learning rate: 2.412E-05 | global batch size: 256 | lm loss: 1.917426E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.773 | TFLOPs: 17.71 | 31: iteration 156900/ 173500 | consumed samples: 40166400 | consumed tokens: 82260787200 | elapsed time per iteration (s): 0.87 | learning rate: 2.412E-05 | global batch size: 256 | lm loss: 1.925936E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.725 | TFLOPs: 17.89 | 31: iteration 156910/ 173500 | consumed samples: 40168960 | consumed tokens: 82266030080 | elapsed time per iteration (s): 0.84 | learning rate: 2.411E-05 | global batch size: 256 | lm loss: 1.896560E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.299 | TFLOPs: 18.41 | 31: iteration 156920/ 173500 | consumed samples: 40171520 | consumed tokens: 82271272960 | elapsed time per iteration (s): 0.77 | learning rate: 2.411E-05 | global batch size: 256 | lm loss: 1.916105E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.311 | TFLOPs: 20.10 | 31: iteration 156930/ 173500 | consumed samples: 40174080 | consumed tokens: 82276515840 | elapsed time per iteration (s): 0.82 | learning rate: 2.410E-05 | global batch size: 256 | lm loss: 1.929575E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.406 | TFLOPs: 18.84 | 31: iteration 156940/ 173500 | consumed samples: 40176640 | consumed tokens: 82281758720 | elapsed time per iteration (s): 0.79 | learning rate: 2.410E-05 | global batch size: 256 | lm loss: 1.920844E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.698 | TFLOPs: 19.58 | 31: iteration 156950/ 173500 | consumed samples: 40179200 | consumed tokens: 82287001600 | elapsed time per iteration (s): 0.78 | learning rate: 2.409E-05 | global batch size: 256 | lm loss: 1.898681E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.502 | TFLOPs: 19.93 | 31: iteration 156960/ 173500 | consumed samples: 40181760 | consumed tokens: 82292244480 | elapsed time per iteration (s): 0.76 | learning rate: 2.409E-05 | global batch size: 256 | lm loss: 1.890163E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.918 | TFLOPs: 20.32 | 31: iteration 156970/ 173500 | consumed samples: 40184320 | consumed tokens: 82297487360 | elapsed time per iteration (s): 0.82 | learning rate: 2.408E-05 | global batch size: 256 | lm loss: 1.924441E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.476 | TFLOPs: 18.84 | 31: iteration 156980/ 173500 | consumed samples: 40186880 | consumed tokens: 82302730240 | elapsed time per iteration (s): 0.80 | learning rate: 2.408E-05 | global batch size: 256 | lm loss: 1.899951E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.463 | TFLOPs: 19.39 | 31: iteration 156990/ 173500 | consumed samples: 40189440 | consumed tokens: 82307973120 | elapsed time per iteration (s): 0.83 | learning rate: 2.407E-05 | global batch size: 256 | lm loss: 1.910546E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.620 | TFLOPs: 18.61 | 31: iteration 157000/ 173500 | consumed samples: 40192000 | consumed tokens: 82313216000 | elapsed time per iteration (s): 0.77 | learning rate: 2.407E-05 | global batch size: 256 | lm loss: 1.887090E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.585 | TFLOPs: 20.24 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 157000 | lm loss value: 1.938900E+00 | lm loss PPL: 6.951104E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 157000 to checkpoints_1b1long 0: [2022-11-27 05:26:04,087] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step157000 is begin to save! 0: [2022-11-27 05:26:04,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_01-model_00-model_states.pt... 0: [2022-11-27 05:26:04,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_01-model_00-model_states.pt. 0: [2022-11-27 05:26:04,312] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_03-model_00-model_states.pt... 0: [2022-11-27 05:26:04,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_03-model_00-model_states.pt. 0: [2022-11-27 05:26:04,392] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_04-model_00-model_states.pt... 0: [2022-11-27 05:26:04,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_04-model_00-model_states.pt. 0: [2022-11-27 05:26:04,477] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_05-model_00-model_states.pt... 0: [2022-11-27 05:26:04,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_05-model_00-model_states.pt. 0: [2022-11-27 05:26:04,556] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_06-model_00-model_states.pt... 0: [2022-11-27 05:26:04,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_06-model_00-model_states.pt. 0: [2022-11-27 05:26:04,634] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_07-model_00-model_states.pt... 0: [2022-11-27 05:26:04,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_07-model_00-model_states.pt. 0: [2022-11-27 05:26:04,710] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_08-model_00-model_states.pt... 0: [2022-11-27 05:26:04,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_08-model_00-model_states.pt. 0: [2022-11-27 05:26:04,784] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_09-model_00-model_states.pt... 0: [2022-11-27 05:26:04,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_09-model_00-model_states.pt. 0: [2022-11-27 05:26:04,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_10-model_00-model_states.pt... 0: [2022-11-27 05:26:04,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_10-model_00-model_states.pt. 0: [2022-11-27 05:26:04,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_11-model_00-model_states.pt... 0: [2022-11-27 05:26:05,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_11-model_00-model_states.pt. 0: [2022-11-27 05:26:05,009] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_12-model_00-model_states.pt... 0: [2022-11-27 05:26:05,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_12-model_00-model_states.pt. 0: [2022-11-27 05:26:05,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_13-model_00-model_states.pt... 0: [2022-11-27 05:26:05,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_13-model_00-model_states.pt. 0: [2022-11-27 05:26:05,159] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_14-model_00-model_states.pt... 0: [2022-11-27 05:26:05,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_14-model_00-model_states.pt. 0: [2022-11-27 05:26:05,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_15-model_00-model_states.pt... 0: [2022-11-27 05:26:05,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_15-model_00-model_states.pt. 0: [2022-11-27 05:26:05,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_16-model_00-model_states.pt... 0: [2022-11-27 05:26:05,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_16-model_00-model_states.pt. 0: [2022-11-27 05:26:05,382] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_17-model_00-model_states.pt... 0: [2022-11-27 05:26:05,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_17-model_00-model_states.pt. 0: [2022-11-27 05:26:05,455] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_18-model_00-model_states.pt... 0: [2022-11-27 05:26:05,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_18-model_00-model_states.pt. 0: [2022-11-27 05:26:05,531] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_19-model_00-model_states.pt... 0: [2022-11-27 05:26:05,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_19-model_00-model_states.pt. 0: [2022-11-27 05:26:05,604] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_20-model_00-model_states.pt... 0: [2022-11-27 05:26:05,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_20-model_00-model_states.pt. 0: [2022-11-27 05:26:05,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_21-model_00-model_states.pt... 0: [2022-11-27 05:26:05,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_21-model_00-model_states.pt. 0: [2022-11-27 05:26:05,752] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_22-model_00-model_states.pt... 0: [2022-11-27 05:26:05,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_22-model_00-model_states.pt. 0: [2022-11-27 05:26:05,829] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_23-model_00-model_states.pt... 0: [2022-11-27 05:26:05,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_23-model_00-model_states.pt. 0: [2022-11-27 05:26:05,904] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_24-model_00-model_states.pt... 0: [2022-11-27 05:26:05,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_24-model_00-model_states.pt. 0: [2022-11-27 05:26:05,977] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_25-model_00-model_states.pt... 0: [2022-11-27 05:26:06,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_25-model_00-model_states.pt. 0: [2022-11-27 05:26:06,050] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_26-model_00-model_states.pt... 0: [2022-11-27 05:26:06,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_26-model_00-model_states.pt. 0: [2022-11-27 05:26:06,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_27-model_00-model_states.pt... 0: [2022-11-27 05:26:06,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_27-model_00-model_states.pt. 0: [2022-11-27 05:26:06,199] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_28-model_00-model_states.pt... 0: [2022-11-27 05:26:06,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_28-model_00-model_states.pt. 0: [2022-11-27 05:26:06,273] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/layer_30-model_00-model_states.pt... 0: [2022-11-27 05:26:06,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/layer_30-model_00-model_states.pt. 0: [2022-11-27 05:26:06,275] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step157000/mp_rank_00_model_states.pt 0: [2022-11-27 05:26:06,275] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/mp_rank_00_model_states.pt... 0: [2022-11-27 05:26:06,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/mp_rank_00_model_states.pt. 0: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:26:06,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:26:06,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:26:06,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 05:26:06,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 10: [2022-11-27 05:26:06,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:26:06,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 05:26:06,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 16: [2022-11-27 05:26:06,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:26:06,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 05:26:06,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 23: [2022-11-27 05:26:06,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:26:06,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 05:26:06,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 1: [2022-11-27 05:26:06,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:26:06,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 05:26:06,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 0: [2022-11-27 05:26:06,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:26:06,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 05:26:06,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 0: [2022-11-27 05:26:06,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:26:06,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 05:26:06,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 4: [2022-11-27 05:26:06,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:26:06,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:26:06,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:26:06,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:26:06,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:26:06,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 15: [2022-11-27 05:26:06,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 29: [2022-11-27 05:26:06,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 05:26:06,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 4: [2022-11-27 05:26:06,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 15: [2022-11-27 05:26:06,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 29: [2022-11-27 05:26:06,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 29: [2022-11-27 05:26:06,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 27: [2022-11-27 05:26:06,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 05:26:06,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 6: [2022-11-27 05:26:06,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:26:06,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 05:26:06,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 25: [2022-11-27 05:26:06,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:26:06,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:26:06,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 25: [2022-11-27 05:26:06,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 05:26:06,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 25: [2022-11-27 05:26:06,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:26:06,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:26:06,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 25: [2022-11-27 05:26:06,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 05:26:06,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 10: [2022-11-27 05:26:06,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:26:06,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 10: [2022-11-27 05:26:06,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 7: [2022-11-27 05:26:06,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 10: [2022-11-27 05:26:06,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 24: [2022-11-27 05:26:06,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:26:06,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:26:06,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:26:06,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 8: [2022-11-27 05:26:06,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 2: [2022-11-27 05:26:06,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:26:06,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 2: [2022-11-27 05:26:06,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 05:26:06,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 24: [2022-11-27 05:26:06,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 24: [2022-11-27 05:26:06,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-27 05:26:06,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 7: [2022-11-27 05:26:06,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:26:06,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 05:26:06,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 1: [2022-11-27 05:26:06,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:26:06,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 05:26:06,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 4: [2022-11-27 05:26:06,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:26:06,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 30: [2022-11-27 05:26:06,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:26:06,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 4: [2022-11-27 05:26:06,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 30: [2022-11-27 05:26:06,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 27: [2022-11-27 05:26:06,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:26:06,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 05:26:06,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 3: [2022-11-27 05:26:06,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:26:06,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:26:06,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 05:26:06,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 05:26:06,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 3: [2022-11-27 05:26:06,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 15: [2022-11-27 05:26:06,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:26:06,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:26:06,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:26:06,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 18: [2022-11-27 05:26:06,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 15: [2022-11-27 05:26:06,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 18: [2022-11-27 05:26:06,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 20: [2022-11-27 05:26:06,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 05:26:06,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 6: [2022-11-27 05:26:06,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:26:06,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:26:06,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 8: [2022-11-27 05:26:06,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 6: [2022-11-27 05:26:06,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 8: [2022-11-27 05:26:06,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 0: [2022-11-27 05:26:06,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:26:06,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 05:26:06,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 18: [2022-11-27 05:26:06,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:26:06,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 05:26:06,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 23: [2022-11-27 05:26:06,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:26:06,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:26:06,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 16: [2022-11-27 05:26:06,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 12: [2022-11-27 05:26:06,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:26:06,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 16: [2022-11-27 05:26:06,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 12: [2022-11-27 05:26:06,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 29: [2022-11-27 05:26:06,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:26:06,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 12: [2022-11-27 05:26:06,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:26:06,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 12: [2022-11-27 05:26:06,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 05:26:06,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 29: [2022-11-27 05:26:06,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 16: [2022-11-27 05:26:06,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:26:06,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 05:26:06,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 24: [2022-11-27 05:26:06,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:26:06,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:26:06,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 05:26:06,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 17: [2022-11-27 05:26:06,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 05:26:06,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 17: [2022-11-27 05:26:06,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:26:06,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 05:26:06,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 17: [2022-11-27 05:26:06,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:26:06,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 05:26:06,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 10: [2022-11-27 05:26:06,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:26:06,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:26:06,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 23: [2022-11-27 05:26:06,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 10: [2022-11-27 05:26:06,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 23: [2022-11-27 05:26:06,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 20: [2022-11-27 05:26:06,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:26:06,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 05:26:06,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 30: [2022-11-27 05:26:06,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:26:06,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 05:26:06,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 25: [2022-11-27 05:26:06,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:26:06,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 30: [2022-11-27 05:26:06,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:26:06,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 21: [2022-11-27 05:26:06,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:26:06,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 05:26:06,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:26:06,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 21: [2022-11-27 05:26:06,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 30: [2022-11-27 05:26:06,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 21: [2022-11-27 05:26:06,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 30: [2022-11-27 05:26:06,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 24: [2022-11-27 05:26:06,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:26:06,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 21: [2022-11-27 05:26:06,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:26:06,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 21: [2022-11-27 05:26:06,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 17: [2022-11-27 05:26:06,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:26:06,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 17: [2022-11-27 05:26:06,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 05:26:06,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 3: [2022-11-27 05:26:06,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:26:06,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 05:26:06,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 15: [2022-11-27 05:26:06,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:26:06,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:26:06,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 05:26:06,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 15: [2022-11-27 05:26:06,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 05:26:06,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 0: [2022-11-27 05:26:06,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:26:06,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 05:26:06,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 5: [2022-11-27 05:26:06,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:26:06,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 05:26:06,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 28: [2022-11-27 05:26:06,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:26:06,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:26:06,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 05:26:06,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 28: [2022-11-27 05:26:06,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 1: [2022-11-27 05:26:06,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 28: [2022-11-27 05:26:06,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 1: [2022-11-27 05:26:06,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 28: [2022-11-27 05:26:06,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:26:06,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 28: [2022-11-27 05:26:06,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 05:26:06,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 18: [2022-11-27 05:26:06,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:26:06,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:26:06,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 5: [2022-11-27 05:26:06,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 18: [2022-11-27 05:26:06,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 5: [2022-11-27 05:26:06,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 30: [2022-11-27 05:26:06,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:26:06,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 16: [2022-11-27 05:26:06,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:26:06,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 16: [2022-11-27 05:26:06,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 05:26:06,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 6: [2022-11-27 05:26:06,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:26:06,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:26:06,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 10: [2022-11-27 05:26:06,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 6: [2022-11-27 05:26:06,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 5: [2022-11-27 05:26:06,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:26:06,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 5: [2022-11-27 05:26:06,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 05:26:06,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 2: [2022-11-27 05:26:06,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:26:06,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 05:26:06,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 8: [2022-11-27 05:26:06,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:26:06,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:26:06,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:26:06,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 05:26:06,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 2: [2022-11-27 05:26:06,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 8: [2022-11-27 05:26:06,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 8: [2022-11-27 05:26:06,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 2: [2022-11-27 05:26:06,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 4: [2022-11-27 05:26:06,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:26:06,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 05:26:06,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 3: [2022-11-27 05:26:06,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:26:06,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 05:26:06,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 25: [2022-11-27 05:26:06,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:26:06,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 28: [2022-11-27 05:26:06,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:26:06,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 18: [2022-11-27 05:26:06,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:26:06,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 05:26:06,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 11: [2022-11-27 05:26:06,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:26:06,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:26:06,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 05:26:06,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 05:26:06,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 11: [2022-11-27 05:26:06,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:26:06,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 4: [2022-11-27 05:26:06,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:26:06,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:26:06,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 4: [2022-11-27 05:26:06,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 23: [2022-11-27 05:26:06,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 11: [2022-11-27 05:26:06,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 4: [2022-11-27 05:26:06,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 23: [2022-11-27 05:26:06,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 7: [2022-11-27 05:26:06,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:26:06,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 05:26:06,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 20: [2022-11-27 05:26:06,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:26:06,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 05:26:06,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 12: [2022-11-27 05:26:06,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:26:06,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 05:26:06,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 6: [2022-11-27 05:26:06,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:26:06,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:26:06,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 05:26:06,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 21: [2022-11-27 05:26:06,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 05:26:06,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 28: [2022-11-27 05:26:06,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 05:26:06,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 19: [2022-11-27 05:26:06,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:26:06,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 05:26:06,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 7: [2022-11-27 05:26:06,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:26:06,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 29: [2022-11-27 05:26:06,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:26:06,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 29: [2022-11-27 05:26:06,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 05:26:06,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 20: [2022-11-27 05:26:06,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:26:06,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 05:26:06,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 1: [2022-11-27 05:26:06,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:26:06,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:26:06,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 19: [2022-11-27 05:26:06,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 1: [2022-11-27 05:26:06,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 19: [2022-11-27 05:26:06,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 9: [2022-11-27 05:26:06,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:26:06,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:26:06,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:26:06,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 05:26:06,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 05:26:06,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 9: [2022-11-27 05:26:06,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 05:26:06,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 9: [2022-11-27 05:26:06,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 26: [2022-11-27 05:26:06,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:26:06,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:26:06,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 15: [2022-11-27 05:26:06,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:26:06,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 27: [2022-11-27 05:26:06,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 15: [2022-11-27 05:26:06,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 26: [2022-11-27 05:26:06,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 05:26:06,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 26: [2022-11-27 05:26:06,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:26:06,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 05:26:06,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 26: [2022-11-27 05:26:06,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:26:06,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 05:26:06,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 19: [2022-11-27 05:26:06,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:26:06,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 05:26:06,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 31: [2022-11-27 05:26:06,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:26:06,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:26:06,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:26:06,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:26:06,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 05:26:06,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 05:26:06,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 05:26:06,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 05:26:06,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 1: [2022-11-27 05:26:06,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:26:06,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 31: [2022-11-27 05:26:06,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 1: [2022-11-27 05:26:06,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 31: [2022-11-27 05:26:06,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 1: [2022-11-27 05:26:06,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 20: [2022-11-27 05:26:06,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:26:06,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 05:26:06,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 5: [2022-11-27 05:26:06,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:26:06,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 05:26:06,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 13: [2022-11-27 05:26:06,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:26:06,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:26:06,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:26:06,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:26:06,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 05:26:06,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 05:26:06,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 05:26:06,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 05:26:06,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 13: [2022-11-27 05:26:06,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 13: [2022-11-27 05:26:06,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 13: [2022-11-27 05:26:06,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 29: [2022-11-27 05:26:06,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:26:06,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 05:26:06,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 22: [2022-11-27 05:26:06,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:26:06,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 05:26:06,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 22: [2022-11-27 05:26:06,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:26:06,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 05:26:06,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 22: [2022-11-27 05:26:06,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:26:06,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 05:26:06,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 22: [2022-11-27 05:26:06,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:26:06,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 05:26:06,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 14: [2022-11-27 05:26:06,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:26:06,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:26:06,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:26:06,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:26:06,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 05:26:06,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 05:26:06,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 14: [2022-11-27 05:26:06,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 14: [2022-11-27 05:26:06,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 05:26:06,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 05:26:06,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 14: [2022-11-27 05:26:06,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 16: [2022-11-27 05:26:06,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:26:06,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 05:26:06,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 0: [2022-11-27 05:26:06,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:26:06,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 05:26:06,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 18: [2022-11-27 05:26:06,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:26:06,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 05:26:06,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 17: [2022-11-27 05:26:06,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:26:06,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 05:26:06,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 25: [2022-11-27 05:26:06,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:26:06,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 05:26:06,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 10: [2022-11-27 05:26:06,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:26:06,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 05:26:06,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 30: [2022-11-27 05:26:06,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:26:06,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 05:26:06,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 24: [2022-11-27 05:26:06,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:26:06,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 05:26:06,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 13: [2022-11-27 05:26:06,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:26:06,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:26:06,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 12: [2022-11-27 05:26:06,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 13: [2022-11-27 05:26:06,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 12: [2022-11-27 05:26:06,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 2: [2022-11-27 05:26:06,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:26:06,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:26:06,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 05:26:06,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 23: [2022-11-27 05:26:06,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 4: [2022-11-27 05:26:06,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:26:06,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 4: [2022-11-27 05:26:06,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 05:26:06,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 7: [2022-11-27 05:26:06,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:26:06,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 05:26:06,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 8: [2022-11-27 05:26:06,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:26:06,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 05:26:06,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 3: [2022-11-27 05:26:06,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:26:06,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 05:26:06,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 31: [2022-11-27 05:26:06,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:26:06,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 05:26:06,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 27: [2022-11-27 05:26:06,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:26:06,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 05:26:06,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 14: [2022-11-27 05:26:06,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:26:06,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 05:26:06,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 15: [2022-11-27 05:26:06,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:26:06,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 05:26:06,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 6: [2022-11-27 05:26:06,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:26:06,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 05:26:06,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 22: [2022-11-27 05:26:06,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:26:06,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 05:26:06,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 28: [2022-11-27 05:26:06,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:26:06,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 28: [2022-11-27 05:26:06,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 11: [2022-11-27 05:26:06,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 05:26:06,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 28: [2022-11-27 05:26:06,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 21: [2022-11-27 05:26:06,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:26:06,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 05:26:06,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 26: [2022-11-27 05:26:06,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:26:06,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 05:26:06,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 5: [2022-11-27 05:26:06,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:26:06,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 05:26:06,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 19: [2022-11-27 05:26:06,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:26:06,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 05:26:06,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 0: [2022-11-27 05:26:06,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:26:06,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 05:26:06,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 1: [2022-11-27 05:26:06,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:26:06,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 05:26:06,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 20: [2022-11-27 05:26:06,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:26:06,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 05:26:06,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 29: [2022-11-27 05:26:06,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:26:06,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 05:26:06,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 17: [2022-11-27 05:26:06,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:26:06,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 05:26:06,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 9: [2022-11-27 05:26:06,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:26:06,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 05:26:06,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 10: [2022-11-27 05:26:06,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:26:06,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 05:26:06,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 4: [2022-11-27 05:26:06,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:26:06,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 05:26:06,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 25: [2022-11-27 05:26:06,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:26:06,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:26:06,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 18: [2022-11-27 05:26:06,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 05:26:06,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 25: [2022-11-27 05:26:06,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 2: [2022-11-27 05:26:06,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:26:06,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 05:26:06,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 7: [2022-11-27 05:26:06,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:26:06,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 12: [2022-11-27 05:26:06,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:26:06,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 12: [2022-11-27 05:26:06,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 05:26:06,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 23: [2022-11-27 05:26:06,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:26:06,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 05:26:06,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 13: [2022-11-27 05:26:06,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:26:06,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 05:26:06,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 30: [2022-11-27 05:26:06,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:26:06,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 05:26:06,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 6: [2022-11-27 05:26:06,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:26:06,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 05:26:06,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 15: [2022-11-27 05:26:06,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:26:06,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 05:26:06,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 31: [2022-11-27 05:26:06,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:26:06,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 05:26:06,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 22: [2022-11-27 05:26:06,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:26:06,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 05:26:06,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 8: [2022-11-27 05:26:06,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:26:06,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 05:26:06,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 27: [2022-11-27 05:26:06,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:26:06,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 05:26:06,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 14: [2022-11-27 05:26:06,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:26:06,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 05:26:06,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 21: [2022-11-27 05:26:06,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:26:06,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 05:26:06,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 28: [2022-11-27 05:26:06,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 05:26:06,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 05:26:06,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 11: [2022-11-27 05:26:06,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:26:06,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 05:26:06,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 19: [2022-11-27 05:26:06,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:26:06,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:26:06,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 05:26:06,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 5: [2022-11-27 05:26:06,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 05:26:06,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 9: [2022-11-27 05:26:06,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:26:06,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 05:26:06,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 3: [2022-11-27 05:26:06,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:26:06,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 05:26:06,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 20: [2022-11-27 05:26:06,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:26:06,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:26:06,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 29: [2022-11-27 05:26:06,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 20: [2022-11-27 05:26:06,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 29: [2022-11-27 05:26:06,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 1: [2022-11-27 05:26:06,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:26:06,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 05:26:06,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 0: [2022-11-27 05:26:06,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:26:06,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:26:06,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 16: [2022-11-27 05:26:06,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:26:06,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:26:06,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 16: [2022-11-27 05:26:06,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 05:26:06,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 05:26:06,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 16: [2022-11-27 05:26:06,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 25: [2022-11-27 05:26:06,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:26:06,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 05:26:06,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 0: [2022-11-27 05:26:06,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 05:26:06,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 13: [2022-11-27 05:26:06,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:26:06,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 05:26:06,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 15: [2022-11-27 05:26:06,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:26:06,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:26:06,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 24: [2022-11-27 05:26:06,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 4: [2022-11-27 05:26:06,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:26:06,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:26:06,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 18: [2022-11-27 05:26:06,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:26:06,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 4: [2022-11-27 05:26:06,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 24: [2022-11-27 05:26:06,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 18: [2022-11-27 05:26:06,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 4: [2022-11-27 05:26:06,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 18: [2022-11-27 05:26:06,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 24: [2022-11-27 05:26:06,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 12: [2022-11-27 05:26:06,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:26:06,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 05:26:06,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 2: [2022-11-27 05:26:06,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:26:06,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 05:26:06,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 30: [2022-11-27 05:26:06,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:26:06,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 05:26:06,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 10: [2022-11-27 05:26:06,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:26:06,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 05:26:06,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 31: [2022-11-27 05:26:06,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:26:06,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 05:26:06,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 23: [2022-11-27 05:26:06,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:26:06,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 05:26:06,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 6: [2022-11-27 05:26:06,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:26:06,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 05:26:06,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 5: [2022-11-27 05:26:06,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:26:06,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 05:26:06,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 21: [2022-11-27 05:26:06,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:26:06,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 05:26:06,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 8: [2022-11-27 05:26:06,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:26:06,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:26:06,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 14: [2022-11-27 05:26:06,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 05:26:06,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 8: [2022-11-27 05:26:06,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 27: [2022-11-27 05:26:06,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:26:06,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 05:26:06,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 17: [2022-11-27 05:26:06,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:26:06,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 05:26:06,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 12: [2022-11-27 05:26:06,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:26:06,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:26:06,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 30: [2022-11-27 05:26:06,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 05:26:06,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 12: [2022-11-27 05:26:06,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 4: [2022-11-27 05:26:06,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:26:06,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 05:26:06,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 0: [2022-11-27 05:26:06,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:26:06,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:26:06,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 0: [2022-11-27 05:26:06,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 16: [2022-11-27 05:26:06,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:26:06,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 0: [2022-11-27 05:26:06,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 16: [2022-11-27 05:26:06,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 05:26:06,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 25: [2022-11-27 05:26:06,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:26:06,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:26:06,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 05:26:06,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 2: [2022-11-27 05:26:06,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 13: [2022-11-27 05:26:06,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:26:06,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 2: [2022-11-27 05:26:06,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 13: [2022-11-27 05:26:06,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 7: [2022-11-27 05:26:06,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:26:06,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 05:26:06,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 24: [2022-11-27 05:26:06,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:26:06,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 05:26:06,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 1: [2022-11-27 05:26:06,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:26:06,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:26:06,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 05:26:06,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 11: [2022-11-27 05:26:06,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:26:06,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 11: [2022-11-27 05:26:06,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 18: [2022-11-27 05:26:06,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:26:06,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 11: [2022-11-27 05:26:06,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 18: [2022-11-27 05:26:06,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 05:26:06,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 31: [2022-11-27 05:26:06,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:26:06,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 23: [2022-11-27 05:26:06,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:26:06,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 23: [2022-11-27 05:26:06,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 05:26:06,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 27: [2022-11-27 05:26:06,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:26:06,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 05:26:06,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 14: [2022-11-27 05:26:06,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:26:06,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 05:26:06,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 20: [2022-11-27 05:26:06,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:26:06,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:26:06,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 10: [2022-11-27 05:26:06,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 05:26:06,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 20: [2022-11-27 05:26:06,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 6: [2022-11-27 05:26:06,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:26:06,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 29: [2022-11-27 05:26:06,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:26:06,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 29: [2022-11-27 05:26:06,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 05:26:06,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 19: [2022-11-27 05:26:06,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:26:06,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 05:26:06,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 21: [2022-11-27 05:26:06,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:26:06,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 05:26:06,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 28: [2022-11-27 05:26:06,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-27 05:26:06,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-27 05:26:06,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 05:26:06,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 05:26:06,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 28: [2022-11-27 05:26:06,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 22: [2022-11-27 05:26:06,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:26:06,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 05:26:06,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 22: [2022-11-27 05:26:06,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:26:06,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 05:26:06,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 11: [2022-11-27 05:26:06,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:26:06,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 05:26:06,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 7: [2022-11-27 05:26:06,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:26:06,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:26:06,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 05:26:06,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 3: [2022-11-27 05:26:06,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 05:26:06,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 3: [2022-11-27 05:26:06,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:26:06,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 05:26:06,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 8: [2022-11-27 05:26:06,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:26:06,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:26:06,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 19: [2022-11-27 05:26:06,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 8: [2022-11-27 05:26:06,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 19: [2022-11-27 05:26:06,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 11: [2022-11-27 05:26:06,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:26:06,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 05:26:06,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 9: [2022-11-27 05:26:06,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:26:06,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 05:26:06,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 26: [2022-11-27 05:26:06,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:26:06,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 05:26:06,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 9: [2022-11-27 05:26:06,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:26:06,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 05:26:06,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 28: [2022-11-27 05:26:06,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 05:26:06,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 05:26:06,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 19: [2022-11-27 05:26:06,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:26:06,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 05:26:06,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 26: [2022-11-27 05:26:06,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:26:06,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 05:26:06,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:26:06,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 26: [2022-11-27 05:26:06,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 05:26:06,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 9: [2022-11-27 05:26:06,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:26:06,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step157000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 05:26:06,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step157000 is ready now! 0: successfully saved checkpoint at iteration 157000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2533.72 31: iteration 157010/ 173500 | consumed samples: 40194560 | consumed tokens: 82318458880 | elapsed time per iteration (s): 1.05 | learning rate: 2.406E-05 | global batch size: 256 | lm loss: 1.922148E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.982 | TFLOPs: 14.76 | 31: iteration 157020/ 173500 | consumed samples: 40197120 | consumed tokens: 82323701760 | elapsed time per iteration (s): 0.75 | learning rate: 2.406E-05 | global batch size: 256 | lm loss: 1.951424E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.752 | TFLOPs: 20.61 | 31: iteration 157030/ 173500 | consumed samples: 40199680 | consumed tokens: 82328944640 | elapsed time per iteration (s): 0.80 | learning rate: 2.405E-05 | global batch size: 256 | lm loss: 1.924096E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.406 | TFLOPs: 19.32 | 31: iteration 157040/ 173500 | consumed samples: 40202240 | consumed tokens: 82334187520 | elapsed time per iteration (s): 0.79 | learning rate: 2.405E-05 | global batch size: 256 | lm loss: 1.878898E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.241 | TFLOPs: 19.49 | 31: iteration 157050/ 173500 | consumed samples: 40204800 | consumed tokens: 82339430400 | elapsed time per iteration (s): 0.77 | learning rate: 2.404E-05 | global batch size: 256 | lm loss: 1.924940E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.482 | TFLOPs: 20.05 | 31: iteration 157060/ 173500 | consumed samples: 40207360 | consumed tokens: 82344673280 | elapsed time per iteration (s): 0.74 | learning rate: 2.404E-05 | global batch size: 256 | lm loss: 1.903540E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.051 | TFLOPs: 20.81 | 31: iteration 157070/ 173500 | consumed samples: 40209920 | consumed tokens: 82349916160 | elapsed time per iteration (s): 0.75 | learning rate: 2.403E-05 | global batch size: 256 | lm loss: 1.923671E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.574 | TFLOPs: 20.72 | 31: iteration 157080/ 173500 | consumed samples: 40212480 | consumed tokens: 82355159040 | elapsed time per iteration (s): 0.77 | learning rate: 2.403E-05 | global batch size: 256 | lm loss: 1.907728E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.509 | TFLOPs: 20.12 | 31: iteration 157090/ 173500 | consumed samples: 40215040 | consumed tokens: 82360401920 | elapsed time per iteration (s): 0.77 | learning rate: 2.402E-05 | global batch size: 256 | lm loss: 1.925749E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.461 | TFLOPs: 20.05 | 31: iteration 157100/ 173500 | consumed samples: 40217600 | consumed tokens: 82365644800 | elapsed time per iteration (s): 0.77 | learning rate: 2.402E-05 | global batch size: 256 | lm loss: 1.918562E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.630 | TFLOPs: 20.24 | 31: iteration 157110/ 173500 | consumed samples: 40220160 | consumed tokens: 82370887680 | elapsed time per iteration (s): 0.75 | learning rate: 2.401E-05 | global batch size: 256 | lm loss: 1.929701E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.167 | TFLOPs: 20.64 | 31: iteration 157120/ 173500 | consumed samples: 40222720 | consumed tokens: 82376130560 | elapsed time per iteration (s): 0.76 | learning rate: 2.401E-05 | global batch size: 256 | lm loss: 1.898336E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.654 | TFLOPs: 20.37 | 31: iteration 157130/ 173500 | consumed samples: 40225280 | consumed tokens: 82381373440 | elapsed time per iteration (s): 0.75 | learning rate: 2.400E-05 | global batch size: 256 | lm loss: 1.928889E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.058 | TFLOPs: 20.63 | 31: iteration 157140/ 173500 | consumed samples: 40227840 | consumed tokens: 82386616320 | elapsed time per iteration (s): 0.77 | learning rate: 2.400E-05 | global batch size: 256 | lm loss: 1.903466E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.833 | TFLOPs: 20.14 | 31: iteration 157150/ 173500 | consumed samples: 40230400 | consumed tokens: 82391859200 | elapsed time per iteration (s): 0.74 | learning rate: 2.399E-05 | global batch size: 256 | lm loss: 1.903989E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.948 | TFLOPs: 21.05 | 31: iteration 157160/ 173500 | consumed samples: 40232960 | consumed tokens: 82397102080 | elapsed time per iteration (s): 0.82 | learning rate: 2.399E-05 | global batch size: 256 | lm loss: 1.927782E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.626 | TFLOPs: 18.79 | 31: iteration 157170/ 173500 | consumed samples: 40235520 | consumed tokens: 82402344960 | elapsed time per iteration (s): 0.80 | learning rate: 2.398E-05 | global batch size: 256 | lm loss: 1.927916E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.817 | TFLOPs: 19.35 | 31: iteration 157180/ 173500 | consumed samples: 40238080 | consumed tokens: 82407587840 | elapsed time per iteration (s): 0.78 | learning rate: 2.398E-05 | global batch size: 256 | lm loss: 1.938576E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.342 | TFLOPs: 19.92 | 31: iteration 157190/ 173500 | consumed samples: 40240640 | consumed tokens: 82412830720 | elapsed time per iteration (s): 0.81 | learning rate: 2.398E-05 | global batch size: 256 | lm loss: 1.907209E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.809 | TFLOPs: 19.23 | 31: iteration 157200/ 173500 | consumed samples: 40243200 | consumed tokens: 82418073600 | elapsed time per iteration (s): 0.84 | learning rate: 2.397E-05 | global batch size: 256 | lm loss: 1.923689E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.428 | TFLOPs: 18.48 | 31: iteration 157210/ 173500 | consumed samples: 40245760 | consumed tokens: 82423316480 | elapsed time per iteration (s): 0.71 | learning rate: 2.397E-05 | global batch size: 256 | lm loss: 1.904741E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 362.758 | TFLOPs: 21.95 | 31: iteration 157220/ 173500 | consumed samples: 40248320 | consumed tokens: 82428559360 | elapsed time per iteration (s): 0.78 | learning rate: 2.396E-05 | global batch size: 256 | lm loss: 1.919462E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.215 | TFLOPs: 19.74 | 31: iteration 157230/ 173500 | consumed samples: 40250880 | consumed tokens: 82433802240 | elapsed time per iteration (s): 0.76 | learning rate: 2.396E-05 | global batch size: 256 | lm loss: 1.899300E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.487 | TFLOPs: 20.42 | 31: iteration 157240/ 173500 | consumed samples: 40253440 | consumed tokens: 82439045120 | elapsed time per iteration (s): 0.75 | learning rate: 2.395E-05 | global batch size: 256 | lm loss: 1.913113E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.120 | TFLOPs: 20.58 | 31: iteration 157250/ 173500 | consumed samples: 40256000 | consumed tokens: 82444288000 | elapsed time per iteration (s): 0.71 | learning rate: 2.395E-05 | global batch size: 256 | lm loss: 1.921750E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 359.154 | TFLOPs: 21.73 | 31: iteration 157260/ 173500 | consumed samples: 40258560 | consumed tokens: 82449530880 | elapsed time per iteration (s): 0.78 | learning rate: 2.394E-05 | global batch size: 256 | lm loss: 1.901003E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.649 | TFLOPs: 19.82 | 31: iteration 157270/ 173500 | consumed samples: 40261120 | consumed tokens: 82454773760 | elapsed time per iteration (s): 0.75 | learning rate: 2.394E-05 | global batch size: 256 | lm loss: 1.907287E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.289 | TFLOPs: 20.77 | 31: iteration 157280/ 173500 | consumed samples: 40263680 | consumed tokens: 82460016640 | elapsed time per iteration (s): 0.77 | learning rate: 2.393E-05 | global batch size: 256 | lm loss: 1.930962E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.647 | TFLOPs: 20.00 | 31: iteration 157290/ 173500 | consumed samples: 40266240 | consumed tokens: 82465259520 | elapsed time per iteration (s): 0.73 | learning rate: 2.393E-05 | global batch size: 256 | lm loss: 1.936672E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.444 | TFLOPs: 21.26 | 31: iteration 157300/ 173500 | consumed samples: 40268800 | consumed tokens: 82470502400 | elapsed time per iteration (s): 0.76 | learning rate: 2.392E-05 | global batch size: 256 | lm loss: 1.932249E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.256 | TFLOPs: 20.34 | 31: iteration 157310/ 173500 | consumed samples: 40271360 | consumed tokens: 82475745280 | elapsed time per iteration (s): 0.77 | learning rate: 2.392E-05 | global batch size: 256 | lm loss: 1.916874E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.701 | TFLOPs: 20.19 | 31: iteration 157320/ 173500 | consumed samples: 40273920 | consumed tokens: 82480988160 | elapsed time per iteration (s): 0.77 | learning rate: 2.391E-05 | global batch size: 256 | lm loss: 1.885481E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.475 | TFLOPs: 20.11 | 31: iteration 157330/ 173500 | consumed samples: 40276480 | consumed tokens: 82486231040 | elapsed time per iteration (s): 0.76 | learning rate: 2.391E-05 | global batch size: 256 | lm loss: 1.898424E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.446 | TFLOPs: 20.48 | 31: iteration 157340/ 173500 | consumed samples: 40279040 | consumed tokens: 82491473920 | elapsed time per iteration (s): 0.76 | learning rate: 2.390E-05 | global batch size: 256 | lm loss: 1.920797E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.057 | TFLOPs: 20.33 | 31: iteration 157350/ 173500 | consumed samples: 40281600 | consumed tokens: 82496716800 | elapsed time per iteration (s): 0.78 | learning rate: 2.390E-05 | global batch size: 256 | lm loss: 1.911532E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.233 | TFLOPs: 19.74 | 31: iteration 157360/ 173500 | consumed samples: 40284160 | consumed tokens: 82501959680 | elapsed time per iteration (s): 0.76 | learning rate: 2.389E-05 | global batch size: 256 | lm loss: 1.894346E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.621 | TFLOPs: 20.36 | 31: iteration 157370/ 173500 | consumed samples: 40286720 | consumed tokens: 82507202560 | elapsed time per iteration (s): 1.14 | learning rate: 2.389E-05 | global batch size: 256 | lm loss: 1.900241E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 224.444 | TFLOPs: 13.58 | 31: iteration 157380/ 173500 | consumed samples: 40289280 | consumed tokens: 82512445440 | elapsed time per iteration (s): 0.76 | learning rate: 2.388E-05 | global batch size: 256 | lm loss: 1.924882E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.165 | TFLOPs: 20.40 | 31: iteration 157390/ 173500 | consumed samples: 40291840 | consumed tokens: 82517688320 | elapsed time per iteration (s): 0.73 | learning rate: 2.388E-05 | global batch size: 256 | lm loss: 1.930820E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.111 | TFLOPs: 21.30 | 31: iteration 157400/ 173500 | consumed samples: 40294400 | consumed tokens: 82522931200 | elapsed time per iteration (s): 0.82 | learning rate: 2.387E-05 | global batch size: 256 | lm loss: 1.912615E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.701 | TFLOPs: 18.92 | 31: iteration 157410/ 173500 | consumed samples: 40296960 | consumed tokens: 82528174080 | elapsed time per iteration (s): 0.80 | learning rate: 2.387E-05 | global batch size: 256 | lm loss: 1.914878E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.689 | TFLOPs: 19.40 | 31: iteration 157420/ 173500 | consumed samples: 40299520 | consumed tokens: 82533416960 | elapsed time per iteration (s): 0.81 | learning rate: 2.386E-05 | global batch size: 256 | lm loss: 1.919588E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.277 | TFLOPs: 19.01 | 31: iteration 157430/ 173500 | consumed samples: 40302080 | consumed tokens: 82538659840 | elapsed time per iteration (s): 0.77 | learning rate: 2.386E-05 | global batch size: 256 | lm loss: 1.922428E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.924 | TFLOPs: 20.08 | 31: iteration 157440/ 173500 | consumed samples: 40304640 | consumed tokens: 82543902720 | elapsed time per iteration (s): 0.83 | learning rate: 2.386E-05 | global batch size: 256 | lm loss: 1.920633E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.164 | TFLOPs: 18.76 | 31: iteration 157450/ 173500 | consumed samples: 40307200 | consumed tokens: 82549145600 | elapsed time per iteration (s): 0.80 | learning rate: 2.385E-05 | global batch size: 256 | lm loss: 1.941006E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.928 | TFLOPs: 19.48 | 31: iteration 157460/ 173500 | consumed samples: 40309760 | consumed tokens: 82554388480 | elapsed time per iteration (s): 0.81 | learning rate: 2.385E-05 | global batch size: 256 | lm loss: 1.957445E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.344 | TFLOPs: 19.08 | 31: iteration 157470/ 173500 | consumed samples: 40312320 | consumed tokens: 82559631360 | elapsed time per iteration (s): 0.78 | learning rate: 2.384E-05 | global batch size: 256 | lm loss: 1.902898E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.367 | TFLOPs: 19.80 | 31: iteration 157480/ 173500 | consumed samples: 40314880 | consumed tokens: 82564874240 | elapsed time per iteration (s): 0.82 | learning rate: 2.384E-05 | global batch size: 256 | lm loss: 1.916776E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.697 | TFLOPs: 18.92 | 31: iteration 157490/ 173500 | consumed samples: 40317440 | consumed tokens: 82570117120 | elapsed time per iteration (s): 0.79 | learning rate: 2.383E-05 | global batch size: 256 | lm loss: 1.924665E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.552 | TFLOPs: 19.51 | 31: iteration 157500/ 173500 | consumed samples: 40320000 | consumed tokens: 82575360000 | elapsed time per iteration (s): 0.81 | learning rate: 2.383E-05 | global batch size: 256 | lm loss: 1.907140E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.473 | TFLOPs: 19.21 | 31: iteration 157510/ 173500 | consumed samples: 40322560 | consumed tokens: 82580602880 | elapsed time per iteration (s): 0.78 | learning rate: 2.382E-05 | global batch size: 256 | lm loss: 1.916541E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.549 | TFLOPs: 19.76 | 31: iteration 157520/ 173500 | consumed samples: 40325120 | consumed tokens: 82585845760 | elapsed time per iteration (s): 0.81 | learning rate: 2.382E-05 | global batch size: 256 | lm loss: 1.906969E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.817 | TFLOPs: 19.05 | 31: iteration 157530/ 173500 | consumed samples: 40327680 | consumed tokens: 82591088640 | elapsed time per iteration (s): 0.78 | learning rate: 2.381E-05 | global batch size: 256 | lm loss: 1.940891E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.548 | TFLOPs: 19.82 | 31: iteration 157540/ 173500 | consumed samples: 40330240 | consumed tokens: 82596331520 | elapsed time per iteration (s): 0.84 | learning rate: 2.381E-05 | global batch size: 256 | lm loss: 1.887947E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.651 | TFLOPs: 18.43 | 31: iteration 157550/ 173500 | consumed samples: 40332800 | consumed tokens: 82601574400 | elapsed time per iteration (s): 0.82 | learning rate: 2.380E-05 | global batch size: 256 | lm loss: 1.916420E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.930 | TFLOPs: 18.99 | 31: iteration 157560/ 173500 | consumed samples: 40335360 | consumed tokens: 82606817280 | elapsed time per iteration (s): 0.81 | learning rate: 2.380E-05 | global batch size: 256 | lm loss: 1.912918E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.711 | TFLOPs: 19.22 | 31: iteration 157570/ 173500 | consumed samples: 40337920 | consumed tokens: 82612060160 | elapsed time per iteration (s): 0.82 | learning rate: 2.379E-05 | global batch size: 256 | lm loss: 1.919428E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.947 | TFLOPs: 18.81 | 31: iteration 157580/ 173500 | consumed samples: 40340480 | consumed tokens: 82617303040 | elapsed time per iteration (s): 0.85 | learning rate: 2.379E-05 | global batch size: 256 | lm loss: 1.926587E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.380 | TFLOPs: 18.17 | 31: iteration 157590/ 173500 | consumed samples: 40343040 | consumed tokens: 82622545920 | elapsed time per iteration (s): 0.79 | learning rate: 2.378E-05 | global batch size: 256 | lm loss: 1.926821E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.623 | TFLOPs: 19.58 | 31: iteration 157600/ 173500 | consumed samples: 40345600 | consumed tokens: 82627788800 | elapsed time per iteration (s): 0.81 | learning rate: 2.378E-05 | global batch size: 256 | lm loss: 1.905773E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.205 | TFLOPs: 19.19 | 31: iteration 157610/ 173500 | consumed samples: 40348160 | consumed tokens: 82633031680 | elapsed time per iteration (s): 0.82 | learning rate: 2.377E-05 | global batch size: 256 | lm loss: 1.925497E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.861 | TFLOPs: 18.87 | 31: iteration 157620/ 173500 | consumed samples: 40350720 | consumed tokens: 82638274560 | elapsed time per iteration (s): 0.79 | learning rate: 2.377E-05 | global batch size: 256 | lm loss: 1.914940E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.078 | TFLOPs: 19.48 | 31: iteration 157630/ 173500 | consumed samples: 40353280 | consumed tokens: 82643517440 | elapsed time per iteration (s): 0.83 | learning rate: 2.377E-05 | global batch size: 256 | lm loss: 1.903457E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.457 | TFLOPs: 18.66 | 31: iteration 157640/ 173500 | consumed samples: 40355840 | consumed tokens: 82648760320 | elapsed time per iteration (s): 0.96 | learning rate: 2.376E-05 | global batch size: 256 | lm loss: 1.903529E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 266.432 | TFLOPs: 16.12 | 31: iteration 157650/ 173500 | consumed samples: 40358400 | consumed tokens: 82654003200 | elapsed time per iteration (s): 0.85 | learning rate: 2.376E-05 | global batch size: 256 | lm loss: 1.893220E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.211 | TFLOPs: 18.22 | 31: iteration 157660/ 173500 | consumed samples: 40360960 | consumed tokens: 82659246080 | elapsed time per iteration (s): 0.87 | learning rate: 2.375E-05 | global batch size: 256 | lm loss: 1.938782E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.815 | TFLOPs: 17.84 | 31: iteration 157670/ 173500 | consumed samples: 40363520 | consumed tokens: 82664488960 | elapsed time per iteration (s): 0.83 | learning rate: 2.375E-05 | global batch size: 256 | lm loss: 1.908162E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.672 | TFLOPs: 18.61 | 31: iteration 157680/ 173500 | consumed samples: 40366080 | consumed tokens: 82669731840 | elapsed time per iteration (s): 0.80 | learning rate: 2.374E-05 | global batch size: 256 | lm loss: 1.903060E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.664 | TFLOPs: 19.28 | 31: iteration 157690/ 173500 | consumed samples: 40368640 | consumed tokens: 82674974720 | elapsed time per iteration (s): 1.00 | learning rate: 2.374E-05 | global batch size: 256 | lm loss: 1.925199E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 256.894 | TFLOPs: 15.54 | 31: iteration 157700/ 173500 | consumed samples: 40371200 | consumed tokens: 82680217600 | elapsed time per iteration (s): 0.80 | learning rate: 2.373E-05 | global batch size: 256 | lm loss: 1.907636E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.136 | TFLOPs: 19.25 | 31: iteration 157710/ 173500 | consumed samples: 40373760 | consumed tokens: 82685460480 | elapsed time per iteration (s): 0.73 | learning rate: 2.373E-05 | global batch size: 256 | lm loss: 1.930108E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.498 | TFLOPs: 21.08 | 31: iteration 157720/ 173500 | consumed samples: 40376320 | consumed tokens: 82690703360 | elapsed time per iteration (s): 0.78 | learning rate: 2.372E-05 | global batch size: 256 | lm loss: 1.921524E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.070 | TFLOPs: 19.85 | 31: iteration 157730/ 173500 | consumed samples: 40378880 | consumed tokens: 82695946240 | elapsed time per iteration (s): 0.75 | learning rate: 2.372E-05 | global batch size: 256 | lm loss: 1.903184E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.834 | TFLOPs: 20.62 | 31: iteration 157740/ 173500 | consumed samples: 40381440 | consumed tokens: 82701189120 | elapsed time per iteration (s): 0.92 | learning rate: 2.371E-05 | global batch size: 256 | lm loss: 1.894672E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 278.146 | TFLOPs: 16.83 | 31: iteration 157750/ 173500 | consumed samples: 40384000 | consumed tokens: 82706432000 | elapsed time per iteration (s): 0.76 | learning rate: 2.371E-05 | global batch size: 256 | lm loss: 1.878095E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.937 | TFLOPs: 20.38 | 31: iteration 157760/ 173500 | consumed samples: 40386560 | consumed tokens: 82711674880 | elapsed time per iteration (s): 0.75 | learning rate: 2.370E-05 | global batch size: 256 | lm loss: 1.922230E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.534 | TFLOPs: 20.78 | 31: iteration 157770/ 173500 | consumed samples: 40389120 | consumed tokens: 82716917760 | elapsed time per iteration (s): 0.76 | learning rate: 2.370E-05 | global batch size: 256 | lm loss: 1.919078E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.643 | TFLOPs: 20.43 | 31: iteration 157780/ 173500 | consumed samples: 40391680 | consumed tokens: 82722160640 | elapsed time per iteration (s): 0.80 | learning rate: 2.369E-05 | global batch size: 256 | lm loss: 1.932986E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.776 | TFLOPs: 19.41 | 31: iteration 157790/ 173500 | consumed samples: 40394240 | consumed tokens: 82727403520 | elapsed time per iteration (s): 0.77 | learning rate: 2.369E-05 | global batch size: 256 | lm loss: 1.924608E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.539 | TFLOPs: 20.18 | 31: iteration 157800/ 173500 | consumed samples: 40396800 | consumed tokens: 82732646400 | elapsed time per iteration (s): 0.74 | learning rate: 2.369E-05 | global batch size: 256 | lm loss: 1.923060E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.113 | TFLOPs: 21.06 | 31: iteration 157810/ 173500 | consumed samples: 40399360 | consumed tokens: 82737889280 | elapsed time per iteration (s): 0.73 | learning rate: 2.368E-05 | global batch size: 256 | lm loss: 1.934045E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.356 | TFLOPs: 21.26 | 31: iteration 157820/ 173500 | consumed samples: 40401920 | consumed tokens: 82743132160 | elapsed time per iteration (s): 0.74 | learning rate: 2.368E-05 | global batch size: 256 | lm loss: 1.912499E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.135 | TFLOPs: 20.88 | 31: iteration 157830/ 173500 | consumed samples: 40404480 | consumed tokens: 82748375040 | elapsed time per iteration (s): 0.74 | learning rate: 2.367E-05 | global batch size: 256 | lm loss: 1.918689E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.356 | TFLOPs: 20.83 | 31: iteration 157840/ 173500 | consumed samples: 40407040 | consumed tokens: 82753617920 | elapsed time per iteration (s): 0.76 | learning rate: 2.367E-05 | global batch size: 256 | lm loss: 1.912834E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.152 | TFLOPs: 20.34 | 31: iteration 157850/ 173500 | consumed samples: 40409600 | consumed tokens: 82758860800 | elapsed time per iteration (s): 0.77 | learning rate: 2.366E-05 | global batch size: 256 | lm loss: 1.932466E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.007 | TFLOPs: 20.15 | 31: iteration 157860/ 173500 | consumed samples: 40412160 | consumed tokens: 82764103680 | elapsed time per iteration (s): 0.81 | learning rate: 2.366E-05 | global batch size: 256 | lm loss: 1.934565E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.930 | TFLOPs: 19.17 | 31: iteration 157870/ 173500 | consumed samples: 40414720 | consumed tokens: 82769346560 | elapsed time per iteration (s): 0.81 | learning rate: 2.365E-05 | global batch size: 256 | lm loss: 1.916305E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.530 | TFLOPs: 19.09 | 31: iteration 157880/ 173500 | consumed samples: 40417280 | consumed tokens: 82774589440 | elapsed time per iteration (s): 0.78 | learning rate: 2.365E-05 | global batch size: 256 | lm loss: 1.919254E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.928 | TFLOPs: 19.96 | 31: iteration 157890/ 173500 | consumed samples: 40419840 | consumed tokens: 82779832320 | elapsed time per iteration (s): 0.74 | learning rate: 2.364E-05 | global batch size: 256 | lm loss: 1.935667E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.428 | TFLOPs: 20.96 | 31: iteration 157900/ 173500 | consumed samples: 40422400 | consumed tokens: 82785075200 | elapsed time per iteration (s): 0.76 | learning rate: 2.364E-05 | global batch size: 256 | lm loss: 1.915358E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.046 | TFLOPs: 20.51 | 31: iteration 157910/ 173500 | consumed samples: 40424960 | consumed tokens: 82790318080 | elapsed time per iteration (s): 0.75 | learning rate: 2.363E-05 | global batch size: 256 | lm loss: 1.899405E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.743 | TFLOPs: 20.67 | 31: iteration 157920/ 173500 | consumed samples: 40427520 | consumed tokens: 82795560960 | elapsed time per iteration (s): 0.78 | learning rate: 2.363E-05 | global batch size: 256 | lm loss: 1.919338E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.060 | TFLOPs: 19.97 | 31: iteration 157930/ 173500 | consumed samples: 40430080 | consumed tokens: 82800803840 | elapsed time per iteration (s): 0.74 | learning rate: 2.363E-05 | global batch size: 256 | lm loss: 1.915141E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.749 | TFLOPs: 20.86 | 31: iteration 157940/ 173500 | consumed samples: 40432640 | consumed tokens: 82806046720 | elapsed time per iteration (s): 0.73 | learning rate: 2.362E-05 | global batch size: 256 | lm loss: 1.900262E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.390 | TFLOPs: 21.08 | 31: iteration 157950/ 173500 | consumed samples: 40435200 | consumed tokens: 82811289600 | elapsed time per iteration (s): 0.76 | learning rate: 2.362E-05 | global batch size: 256 | lm loss: 1.924563E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.955 | TFLOPs: 20.51 | 31: iteration 157960/ 173500 | consumed samples: 40437760 | consumed tokens: 82816532480 | elapsed time per iteration (s): 0.75 | learning rate: 2.361E-05 | global batch size: 256 | lm loss: 1.905134E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.423 | TFLOPs: 20.59 | 31: iteration 157970/ 173500 | consumed samples: 40440320 | consumed tokens: 82821775360 | elapsed time per iteration (s): 0.77 | learning rate: 2.361E-05 | global batch size: 256 | lm loss: 1.910441E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.263 | TFLOPs: 20.16 | 31: iteration 157980/ 173500 | consumed samples: 40442880 | consumed tokens: 82827018240 | elapsed time per iteration (s): 0.80 | learning rate: 2.360E-05 | global batch size: 256 | lm loss: 1.894049E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.714 | TFLOPs: 19.40 | 31: iteration 157990/ 173500 | consumed samples: 40445440 | consumed tokens: 82832261120 | elapsed time per iteration (s): 0.79 | learning rate: 2.360E-05 | global batch size: 256 | lm loss: 1.924360E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.001 | TFLOPs: 19.54 | 0: [2022-11-27 05:39:14,739] [INFO] [logging.py:68:log_dist] [Rank 0] step=158000, skipped=0, lr=[2.3592725009494674e-05, 2.3592725009494674e-05, 2.3592725009494674e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 158000/ 173500 | consumed samples: 40448000 | consumed tokens: 82837504000 | elapsed time per iteration (s): 0.82 | learning rate: 2.359E-05 | global batch size: 256 | lm loss: 1.914504E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.040 | TFLOPs: 18.88 | 0: steps: 158000 loss: 1.9811 iter time (s): 0.786 samples/sec: 325.691 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 158000 | lm loss value: 1.870924E+00 | lm loss PPL: 6.494292E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 158000 to checkpoints_1b1long 0: [2022-11-27 05:39:15,053] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step158000 is begin to save! 0: [2022-11-27 05:39:15,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_01-model_00-model_states.pt... 0: [2022-11-27 05:39:15,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_01-model_00-model_states.pt. 0: [2022-11-27 05:39:15,287] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_03-model_00-model_states.pt... 0: [2022-11-27 05:39:15,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_03-model_00-model_states.pt. 0: [2022-11-27 05:39:15,368] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_04-model_00-model_states.pt... 0: [2022-11-27 05:39:15,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_04-model_00-model_states.pt. 0: [2022-11-27 05:39:15,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_05-model_00-model_states.pt... 0: [2022-11-27 05:39:15,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_05-model_00-model_states.pt. 0: [2022-11-27 05:39:15,521] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_06-model_00-model_states.pt... 0: [2022-11-27 05:39:15,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_06-model_00-model_states.pt. 0: [2022-11-27 05:39:15,600] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_07-model_00-model_states.pt... 0: [2022-11-27 05:39:15,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_07-model_00-model_states.pt. 0: [2022-11-27 05:39:15,678] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_08-model_00-model_states.pt... 0: [2022-11-27 05:39:15,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_08-model_00-model_states.pt. 0: [2022-11-27 05:39:15,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_09-model_00-model_states.pt... 0: [2022-11-27 05:39:15,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_09-model_00-model_states.pt. 0: [2022-11-27 05:39:15,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_10-model_00-model_states.pt... 0: [2022-11-27 05:39:15,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_10-model_00-model_states.pt. 0: [2022-11-27 05:39:15,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_11-model_00-model_states.pt... 0: [2022-11-27 05:39:15,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_11-model_00-model_states.pt. 0: [2022-11-27 05:39:15,982] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_12-model_00-model_states.pt... 0: [2022-11-27 05:39:16,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_12-model_00-model_states.pt. 0: [2022-11-27 05:39:16,059] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_13-model_00-model_states.pt... 0: [2022-11-27 05:39:16,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_13-model_00-model_states.pt. 0: [2022-11-27 05:39:16,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_14-model_00-model_states.pt... 0: [2022-11-27 05:39:16,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_14-model_00-model_states.pt. 0: [2022-11-27 05:39:16,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_15-model_00-model_states.pt... 0: [2022-11-27 05:39:16,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_15-model_00-model_states.pt. 0: [2022-11-27 05:39:16,281] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_16-model_00-model_states.pt... 0: [2022-11-27 05:39:16,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_16-model_00-model_states.pt. 0: [2022-11-27 05:39:16,354] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_17-model_00-model_states.pt... 0: [2022-11-27 05:39:16,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_17-model_00-model_states.pt. 0: [2022-11-27 05:39:16,431] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_18-model_00-model_states.pt... 0: [2022-11-27 05:39:16,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_18-model_00-model_states.pt. 0: [2022-11-27 05:39:16,503] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_19-model_00-model_states.pt... 0: [2022-11-27 05:39:16,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_19-model_00-model_states.pt. 0: [2022-11-27 05:39:16,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_20-model_00-model_states.pt... 0: [2022-11-27 05:39:16,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_20-model_00-model_states.pt. 0: [2022-11-27 05:39:16,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_21-model_00-model_states.pt... 0: [2022-11-27 05:39:16,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_21-model_00-model_states.pt. 0: [2022-11-27 05:39:16,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_22-model_00-model_states.pt... 0: [2022-11-27 05:39:16,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_22-model_00-model_states.pt. 0: [2022-11-27 05:39:16,804] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_23-model_00-model_states.pt... 0: [2022-11-27 05:39:16,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_23-model_00-model_states.pt. 0: [2022-11-27 05:39:16,878] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_24-model_00-model_states.pt... 0: [2022-11-27 05:39:16,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_24-model_00-model_states.pt. 0: [2022-11-27 05:39:16,951] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_25-model_00-model_states.pt... 0: [2022-11-27 05:39:17,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_25-model_00-model_states.pt. 0: [2022-11-27 05:39:17,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_26-model_00-model_states.pt... 0: [2022-11-27 05:39:17,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_26-model_00-model_states.pt. 0: [2022-11-27 05:39:17,099] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_27-model_00-model_states.pt... 0: [2022-11-27 05:39:17,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_27-model_00-model_states.pt. 0: [2022-11-27 05:39:17,170] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_28-model_00-model_states.pt... 0: [2022-11-27 05:39:17,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_28-model_00-model_states.pt. 0: [2022-11-27 05:39:17,247] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/layer_30-model_00-model_states.pt... 0: [2022-11-27 05:39:17,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/layer_30-model_00-model_states.pt. 0: [2022-11-27 05:39:17,250] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step158000/mp_rank_00_model_states.pt 0: [2022-11-27 05:39:17,250] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/mp_rank_00_model_states.pt... 0: [2022-11-27 05:39:17,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/mp_rank_00_model_states.pt. 6: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:39:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:39:17,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:39:17,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:39:17,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:39:17,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 05:39:17,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 28: [2022-11-27 05:39:17,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:39:17,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:39:17,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 05:39:17,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 24: [2022-11-27 05:39:17,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:39:17,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 27: [2022-11-27 05:39:17,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:39:17,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 4: [2022-11-27 05:39:17,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:39:17,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 05:39:17,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 4: [2022-11-27 05:39:17,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 05:39:17,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 9: [2022-11-27 05:39:17,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:39:17,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:39:17,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 26: [2022-11-27 05:39:17,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 9: [2022-11-27 05:39:17,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 26: [2022-11-27 05:39:17,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 4: [2022-11-27 05:39:17,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:39:17,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 05:39:17,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 2: [2022-11-27 05:39:17,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:39:17,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:39:17,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:39:17,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:39:17,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 23: [2022-11-27 05:39:17,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 26: [2022-11-27 05:39:17,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 2: [2022-11-27 05:39:17,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 12: [2022-11-27 05:39:17,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 23: [2022-11-27 05:39:17,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 26: [2022-11-27 05:39:17,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 12: [2022-11-27 05:39:17,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 6: [2022-11-27 05:39:17,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:39:17,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:39:17,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 22: [2022-11-27 05:39:17,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 6: [2022-11-27 05:39:17,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 22: [2022-11-27 05:39:17,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 10: [2022-11-27 05:39:17,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:39:17,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:39:17,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:39:17,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 05:39:17,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 13: [2022-11-27 05:39:17,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 10: [2022-11-27 05:39:17,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 13: [2022-11-27 05:39:17,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 10: [2022-11-27 05:39:17,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 30: [2022-11-27 05:39:17,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:39:17,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 05:39:17,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 20: [2022-11-27 05:39:17,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:39:17,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:39:17,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 20: [2022-11-27 05:39:17,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 30: [2022-11-27 05:39:17,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 20: [2022-11-27 05:39:17,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 15: [2022-11-27 05:39:17,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:39:17,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:39:17,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 12: [2022-11-27 05:39:17,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 15: [2022-11-27 05:39:17,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 12: [2022-11-27 05:39:17,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 15: [2022-11-27 05:39:17,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:39:17,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 05:39:17,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 7: [2022-11-27 05:39:17,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:39:17,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 05:39:17,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 16: [2022-11-27 05:39:17,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:39:17,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:39:17,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:39:17,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 05:39:17,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 05:39:17,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 16: [2022-11-27 05:39:17,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 24: [2022-11-27 05:39:17,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 05:39:17,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 14: [2022-11-27 05:39:17,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:39:17,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 05:39:17,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 14: [2022-11-27 05:39:17,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:39:17,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 05:39:17,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 10: [2022-11-27 05:39:17,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:39:17,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 05:39:17,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 22: [2022-11-27 05:39:17,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:39:17,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 05:39:17,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 6: [2022-11-27 05:39:17,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:39:17,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:39:17,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 29: [2022-11-27 05:39:17,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 6: [2022-11-27 05:39:17,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 20: [2022-11-27 05:39:17,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:39:17,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 20: [2022-11-27 05:39:17,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 11: [2022-11-27 05:39:17,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:39:17,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 11: [2022-11-27 05:39:17,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 0: [2022-11-27 05:39:17,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:39:17,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 0: [2022-11-27 05:39:17,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 05:39:17,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 5: [2022-11-27 05:39:17,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:39:17,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:39:17,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 05:39:17,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 23: [2022-11-27 05:39:17,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 05:39:17,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 5: [2022-11-27 05:39:17,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:39:17,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:39:17,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 05:39:17,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 5: [2022-11-27 05:39:17,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 05:39:17,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 1: [2022-11-27 05:39:17,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:39:17,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:39:17,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 05:39:17,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 05:39:17,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 1: [2022-11-27 05:39:17,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 16: [2022-11-27 05:39:17,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:39:17,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 05:39:17,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 4: [2022-11-27 05:39:17,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:39:17,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:39:17,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:39:17,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 13: [2022-11-27 05:39:17,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 15: [2022-11-27 05:39:17,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 4: [2022-11-27 05:39:17,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 13: [2022-11-27 05:39:17,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 15: [2022-11-27 05:39:17,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 9: [2022-11-27 05:39:17,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:39:17,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:39:17,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 8: [2022-11-27 05:39:17,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 9: [2022-11-27 05:39:17,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 8: [2022-11-27 05:39:17,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 20: [2022-11-27 05:39:17,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:39:17,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 05:39:17,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 2: [2022-11-27 05:39:17,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:39:17,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:39:17,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 05:39:17,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 27: [2022-11-27 05:39:17,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 05:39:17,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 4: [2022-11-27 05:39:17,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:39:17,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 05:39:17,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 27: [2022-11-27 05:39:17,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:39:17,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 12: [2022-11-27 05:39:17,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:39:17,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 05:39:17,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 27: [2022-11-27 05:39:17,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 11: [2022-11-27 05:39:17,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:39:17,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:39:17,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 13: [2022-11-27 05:39:17,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 11: [2022-11-27 05:39:17,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 13: [2022-11-27 05:39:17,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 14: [2022-11-27 05:39:17,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:39:17,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:39:17,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 26: [2022-11-27 05:39:17,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 14: [2022-11-27 05:39:17,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 26: [2022-11-27 05:39:17,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 30: [2022-11-27 05:39:17,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:39:17,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 28: [2022-11-27 05:39:17,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 30: [2022-11-27 05:39:17,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 28: [2022-11-27 05:39:17,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 28: [2022-11-27 05:39:17,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-27 05:39:17,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 05:39:17,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 28: [2022-11-27 05:39:17,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-27 05:39:17,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 05:39:17,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 28: [2022-11-27 05:39:17,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 05:39:17,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 05:39:17,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 10: [2022-11-27 05:39:17,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:39:17,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:39:17,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:39:17,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 8: [2022-11-27 05:39:17,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:39:17,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 1: [2022-11-27 05:39:17,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 8: [2022-11-27 05:39:17,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 1: [2022-11-27 05:39:17,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 8: [2022-11-27 05:39:17,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 26: [2022-11-27 05:39:17,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 8: [2022-11-27 05:39:17,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:39:17,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 8: [2022-11-27 05:39:17,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 05:39:17,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 24: [2022-11-27 05:39:17,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:39:17,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 5: [2022-11-27 05:39:17,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:39:17,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 24: [2022-11-27 05:39:17,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 5: [2022-11-27 05:39:17,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 9: [2022-11-27 05:39:17,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:39:17,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:39:17,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 05:39:17,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 0: [2022-11-27 05:39:17,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:39:17,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 9: [2022-11-27 05:39:17,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 0: [2022-11-27 05:39:17,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 05:39:17,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 2: [2022-11-27 05:39:17,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:39:17,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:39:17,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 16: [2022-11-27 05:39:17,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 2: [2022-11-27 05:39:17,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 16: [2022-11-27 05:39:17,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 22: [2022-11-27 05:39:17,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:39:17,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:39:17,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:39:17,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 19: [2022-11-27 05:39:17,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 15: [2022-11-27 05:39:17,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 22: [2022-11-27 05:39:17,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 15: [2022-11-27 05:39:17,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 19: [2022-11-27 05:39:17,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 7: [2022-11-27 05:39:17,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:39:17,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:39:17,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:39:17,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 7: [2022-11-27 05:39:17,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 05:39:17,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 30: [2022-11-27 05:39:17,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 7: [2022-11-27 05:39:17,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 7: [2022-11-27 05:39:17,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 7: [2022-11-27 05:39:17,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:39:17,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 05:39:17,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 6: [2022-11-27 05:39:17,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:39:17,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 05:39:17,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 0: [2022-11-27 05:39:17,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:39:17,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:39:17,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 05:39:17,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 05:39:17,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 0: [2022-11-27 05:39:17,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 23: [2022-11-27 05:39:17,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:39:17,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 05:39:17,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 20: [2022-11-27 05:39:17,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:39:17,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 05:39:17,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 14: [2022-11-27 05:39:17,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:39:17,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 05:39:17,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 12: [2022-11-27 05:39:17,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:39:17,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 05:39:17,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 29: [2022-11-27 05:39:17,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:39:17,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 05:39:17,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 27: [2022-11-27 05:39:17,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:39:17,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 05:39:17,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 8: [2022-11-27 05:39:17,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:39:17,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 05:39:17,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 24: [2022-11-27 05:39:17,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:39:17,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-27 05:39:17,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 13: [2022-11-27 05:39:17,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:39:17,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 05:39:17,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 22: [2022-11-27 05:39:17,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:39:17,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 05:39:17,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 2: [2022-11-27 05:39:17,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:39:17,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 05:39:17,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 19: [2022-11-27 05:39:17,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:39:17,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 1: [2022-11-27 05:39:17,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:39:17,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 1: [2022-11-27 05:39:17,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 05:39:17,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 6: [2022-11-27 05:39:17,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:39:17,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 05:39:17,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 19: [2022-11-27 05:39:17,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:39:17,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 05:39:17,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 25: [2022-11-27 05:39:17,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:39:17,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:39:17,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:39:17,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:39:17,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:39:17,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:39:17,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 19: [2022-11-27 05:39:17,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:39:17,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 29: [2022-11-27 05:39:17,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 19: [2022-11-27 05:39:17,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 25: [2022-11-27 05:39:17,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 05:39:17,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 29: [2022-11-27 05:39:17,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 19: [2022-11-27 05:39:17,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 25: [2022-11-27 05:39:17,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 05:39:17,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 29: [2022-11-27 05:39:17,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 25: [2022-11-27 05:39:17,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 25: [2022-11-27 05:39:17,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 25: [2022-11-27 05:39:17,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 11: [2022-11-27 05:39:17,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:39:17,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 05:39:17,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 3: [2022-11-27 05:39:17,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:39:17,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:39:17,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:39:17,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:39:17,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 05:39:17,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 05:39:17,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 05:39:17,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 05:39:17,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 3: [2022-11-27 05:39:17,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 3: [2022-11-27 05:39:17,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 3: [2022-11-27 05:39:17,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 6: [2022-11-27 05:39:17,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:39:17,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 05:39:17,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 18: [2022-11-27 05:39:17,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:39:17,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:39:17,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:39:17,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:39:17,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 05:39:17,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 05:39:17,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 05:39:17,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 05:39:17,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 18: [2022-11-27 05:39:17,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 18: [2022-11-27 05:39:17,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 18: [2022-11-27 05:39:17,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 17: [2022-11-27 05:39:17,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:39:17,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:39:17,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:39:17,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:39:17,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 05:39:17,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 05:39:17,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 05:39:17,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 05:39:17,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 17: [2022-11-27 05:39:17,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 17: [2022-11-27 05:39:17,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 17: [2022-11-27 05:39:17,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 1: [2022-11-27 05:39:17,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:39:17,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 05:39:17,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 5: [2022-11-27 05:39:17,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:39:17,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 05:39:17,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 9: [2022-11-27 05:39:17,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:39:17,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 05:39:17,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 10: [2022-11-27 05:39:17,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:39:17,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 05:39:17,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 23: [2022-11-27 05:39:17,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:39:17,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 05:39:17,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 4: [2022-11-27 05:39:17,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:39:17,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 05:39:17,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 28: [2022-11-27 05:39:17,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 05:39:17,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 05:39:17,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 7: [2022-11-27 05:39:17,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:39:17,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 05:39:17,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 26: [2022-11-27 05:39:17,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:39:17,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 05:39:17,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 16: [2022-11-27 05:39:17,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:39:17,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 05:39:17,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 14: [2022-11-27 05:39:17,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:39:17,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 05:39:17,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 13: [2022-11-27 05:39:17,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:39:17,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 15: [2022-11-27 05:39:17,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:39:17,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:39:17,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 15: [2022-11-27 05:39:17,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 05:39:17,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 0: [2022-11-27 05:39:17,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 05:39:17,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 30: [2022-11-27 05:39:17,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:39:17,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 17: [2022-11-27 05:39:17,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:39:17,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 17: [2022-11-27 05:39:17,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 05:39:17,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 11: [2022-11-27 05:39:17,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:39:17,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 05:39:17,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 20: [2022-11-27 05:39:17,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:39:17,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 05:39:17,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 12: [2022-11-27 05:39:17,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:39:17,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 05:39:17,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 8: [2022-11-27 05:39:17,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:39:17,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 05:39:17,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 19: [2022-11-27 05:39:17,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:39:17,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 05:39:17,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 3: [2022-11-27 05:39:17,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:39:17,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 05:39:17,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 24: [2022-11-27 05:39:17,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:39:17,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 05:39:17,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 2: [2022-11-27 05:39:17,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:39:17,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 05:39:17,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 18: [2022-11-27 05:39:17,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:39:17,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 05:39:17,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 27: [2022-11-27 05:39:17,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:39:17,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 05:39:17,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 6: [2022-11-27 05:39:17,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:39:17,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 05:39:17,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 29: [2022-11-27 05:39:17,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:39:17,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 05:39:17,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 22: [2022-11-27 05:39:17,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:39:17,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 05:39:17,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 9: [2022-11-27 05:39:17,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:39:17,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:39:17,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 9: [2022-11-27 05:39:17,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 23: [2022-11-27 05:39:17,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 9: [2022-11-27 05:39:17,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 10: [2022-11-27 05:39:17,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:39:17,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 05:39:17,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 1: [2022-11-27 05:39:17,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:39:17,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 05:39:17,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 0: [2022-11-27 05:39:17,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:39:17,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:39:17,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 05:39:17,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 15: [2022-11-27 05:39:17,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:39:17,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 05:39:17,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 14: [2022-11-27 05:39:17,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:39:17,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:39:17,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 14: [2022-11-27 05:39:17,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 16: [2022-11-27 05:39:17,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 14: [2022-11-27 05:39:17,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 7: [2022-11-27 05:39:17,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:39:17,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 05:39:17,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 17: [2022-11-27 05:39:17,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:39:17,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 05:39:17,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 30: [2022-11-27 05:39:17,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:39:17,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:39:17,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 30: [2022-11-27 05:39:17,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 29: [2022-11-27 05:39:17,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 30: [2022-11-27 05:39:17,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 12: [2022-11-27 05:39:17,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:39:17,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:39:17,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 11: [2022-11-27 05:39:17,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 12: [2022-11-27 05:39:17,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 11: [2022-11-27 05:39:17,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 19: [2022-11-27 05:39:17,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:39:17,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 05:39:17,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 28: [2022-11-27 05:39:17,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:39:17,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:39:17,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 1: [2022-11-27 05:39:17,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:39:17,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 1: [2022-11-27 05:39:17,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 05:39:17,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 27: [2022-11-27 05:39:17,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:39:17,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 05:39:17,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 8: [2022-11-27 05:39:17,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:39:17,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:39:17,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 13: [2022-11-27 05:39:17,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 8: [2022-11-27 05:39:17,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 13: [2022-11-27 05:39:17,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 24: [2022-11-27 05:39:17,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:39:17,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 05:39:17,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 5: [2022-11-27 05:39:17,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:39:17,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 26: [2022-11-27 05:39:17,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:39:17,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 26: [2022-11-27 05:39:17,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 05:39:17,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 28: [2022-11-27 05:39:17,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 05:39:17,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 20: [2022-11-27 05:39:17,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:39:17,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:39:17,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 6: [2022-11-27 05:39:17,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 20: [2022-11-27 05:39:17,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 6: [2022-11-27 05:39:17,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 2: [2022-11-27 05:39:17,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:39:17,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 05:39:17,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 4: [2022-11-27 05:39:17,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:39:17,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 05:39:17,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 22: [2022-11-27 05:39:17,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:39:17,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 05:39:17,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 0: [2022-11-27 05:39:17,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 05:39:17,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 25: [2022-11-27 05:39:17,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:39:17,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:39:17,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 05:39:17,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 10: [2022-11-27 05:39:17,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 05:39:17,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 18: [2022-11-27 05:39:17,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:39:17,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 05:39:17,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 23: [2022-11-27 05:39:17,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:39:17,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 05:39:17,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 25: [2022-11-27 05:39:17,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:39:17,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 05:39:17,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 4: [2022-11-27 05:39:17,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:39:17,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 05:39:17,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 14: [2022-11-27 05:39:17,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:39:17,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 05:39:17,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 28: [2022-11-27 05:39:17,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-27 05:39:17,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 05:39:17,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 26: [2022-11-27 05:39:17,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:39:17,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 05:39:17,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 25: [2022-11-27 05:39:17,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:39:17,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 05:39:17,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 9: [2022-11-27 05:39:17,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:39:17,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 05:39:17,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 16: [2022-11-27 05:39:17,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:39:17,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 05:39:17,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 7: [2022-11-27 05:39:17,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:39:17,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 05:39:17,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 8: [2022-11-27 05:39:17,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:39:17,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 05:39:17,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 18: [2022-11-27 05:39:17,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:39:17,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-27 05:39:17,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 3: [2022-11-27 05:39:17,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:39:17,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 05:39:17,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 17: [2022-11-27 05:39:17,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:39:17,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 05:39:17,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 0: [2022-11-27 05:39:17,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:39:17,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 05:39:17,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 30: [2022-11-27 05:39:17,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:39:17,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 05:39:17,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 20: [2022-11-27 05:39:17,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:39:17,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 05:39:17,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 29: [2022-11-27 05:39:17,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:39:17,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 05:39:17,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 15: [2022-11-27 05:39:17,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:39:17,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 05:39:17,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 13: [2022-11-27 05:39:17,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:39:17,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:39:17,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 05:39:17,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 13: [2022-11-27 05:39:17,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 05:39:17,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 1: [2022-11-27 05:39:17,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:39:17,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 05:39:17,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 11: [2022-11-27 05:39:17,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:39:17,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 05:39:17,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 6: [2022-11-27 05:39:17,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:39:17,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 05:39:17,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 12: [2022-11-27 05:39:17,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:39:17,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 05:39:17,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 9: [2022-11-27 05:39:17,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:39:17,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 05:39:17,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 29: [2022-11-27 05:39:17,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:39:17,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 05:39:17,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 27: [2022-11-27 05:39:17,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:39:17,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 05:39:17,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 19: [2022-11-27 05:39:17,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:39:17,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:39:17,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 05:39:17,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 12: [2022-11-27 05:39:17,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 05:39:17,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 10: [2022-11-27 05:39:17,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:39:17,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:39:17,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 05:39:17,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 13: [2022-11-27 05:39:17,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 05:39:17,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 14: [2022-11-27 05:39:17,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:39:17,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 05:39:17,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 30: [2022-11-27 05:39:17,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:39:17,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 05:39:17,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 23: [2022-11-27 05:39:17,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:39:17,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 05:39:17,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 15: [2022-11-27 05:39:17,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:39:17,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 05:39:17,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 24: [2022-11-27 05:39:17,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:39:17,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 05:39:17,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:39:17,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 24: [2022-11-27 05:39:17,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 16: [2022-11-27 05:39:17,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:39:17,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 16: [2022-11-27 05:39:17,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 05:39:17,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 28: [2022-11-27 05:39:17,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-27 05:39:17,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 8: [2022-11-27 05:39:17,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:39:17,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:39:17,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:39:17,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 22: [2022-11-27 05:39:17,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 05:39:17,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 8: [2022-11-27 05:39:17,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 22: [2022-11-27 05:39:17,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 22: [2022-11-27 05:39:17,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 11: [2022-11-27 05:39:17,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:39:17,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 05:39:17,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 2: [2022-11-27 05:39:17,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:39:17,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 05:39:17,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 0: [2022-11-27 05:39:17,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:39:17,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:39:17,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:39:17,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 3: [2022-11-27 05:39:17,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 0: [2022-11-27 05:39:17,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 3: [2022-11-27 05:39:17,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 17: [2022-11-27 05:39:17,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 05:39:17,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 5: [2022-11-27 05:39:17,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:39:17,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 05:39:17,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 20: [2022-11-27 05:39:17,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 28: [2022-11-27 05:39:17,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 21: [2022-11-27 05:39:17,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:39:17,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 21: [2022-11-27 05:39:17,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 05:39:17,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 20: [2022-11-27 05:39:17,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 4: [2022-11-27 05:39:17,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:39:17,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 05:39:17,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 25: [2022-11-27 05:39:17,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:39:17,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 05:39:17,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 2: [2022-11-27 05:39:17,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:39:17,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:39:17,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 7: [2022-11-27 05:39:17,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 2: [2022-11-27 05:39:17,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 31: [2022-11-27 05:39:17,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:39:17,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 31: [2022-11-27 05:39:17,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 05:39:17,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 18: [2022-11-27 05:39:17,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:39:17,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 05:39:17,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 27: [2022-11-27 05:39:17,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:39:17,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 05:39:17,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 21: [2022-11-27 05:39:17,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:39:17,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 05:39:17,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 31: [2022-11-27 05:39:17,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:39:17,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:39:17,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:39:17,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 05:39:17,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 05:39:17,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 05:39:17,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 31: [2022-11-27 05:39:17,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 31: [2022-11-27 05:39:17,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 26: [2022-11-27 05:39:17,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:39:17,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 05:39:17,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 21: [2022-11-27 05:39:17,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:39:17,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:39:17,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 05:39:17,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 21: [2022-11-27 05:39:17,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 05:39:17,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 21: [2022-11-27 05:39:17,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:39:17,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:39:17,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:39:17,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 05:39:17,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:39:17,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 05:39:17,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 05:39:17,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 21: [2022-11-27 05:39:17,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 05:39:17,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 21: [2022-11-27 05:39:17,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 21: [2022-11-27 05:39:17,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 31: [2022-11-27 05:39:17,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:39:17,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 05:39:17,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 31: [2022-11-27 05:39:17,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:39:17,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:39:17,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 05:39:17,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 05:39:17,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 31: [2022-11-27 05:39:17,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 31: [2022-11-27 05:39:17,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:39:17,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step158000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 05:39:17,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step158000 is ready now! 0: successfully saved checkpoint at iteration 158000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2563.15 31: iteration 158010/ 173500 | consumed samples: 40450560 | consumed tokens: 82842746880 | elapsed time per iteration (s): 1.00 | learning rate: 2.359E-05 | global batch size: 256 | lm loss: 1.921149E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 256.160 | TFLOPs: 15.50 | 31: iteration 158020/ 173500 | consumed samples: 40453120 | consumed tokens: 82847989760 | elapsed time per iteration (s): 0.73 | learning rate: 2.358E-05 | global batch size: 256 | lm loss: 1.922044E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.401 | TFLOPs: 21.14 | 31: iteration 158030/ 173500 | consumed samples: 40455680 | consumed tokens: 82853232640 | elapsed time per iteration (s): 0.76 | learning rate: 2.358E-05 | global batch size: 256 | lm loss: 1.889804E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.692 | TFLOPs: 20.37 | 31: iteration 158040/ 173500 | consumed samples: 40458240 | consumed tokens: 82858475520 | elapsed time per iteration (s): 0.77 | learning rate: 2.357E-05 | global batch size: 256 | lm loss: 1.923356E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.900 | TFLOPs: 20.20 | 31: iteration 158050/ 173500 | consumed samples: 40460800 | consumed tokens: 82863718400 | elapsed time per iteration (s): 0.81 | learning rate: 2.357E-05 | global batch size: 256 | lm loss: 1.887821E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.736 | TFLOPs: 19.22 | 31: iteration 158060/ 173500 | consumed samples: 40463360 | consumed tokens: 82868961280 | elapsed time per iteration (s): 0.76 | learning rate: 2.357E-05 | global batch size: 256 | lm loss: 1.930962E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.830 | TFLOPs: 20.26 | 31: iteration 158070/ 173500 | consumed samples: 40465920 | consumed tokens: 82874204160 | elapsed time per iteration (s): 0.87 | learning rate: 2.356E-05 | global batch size: 256 | lm loss: 1.933293E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.894 | TFLOPs: 17.72 | 31: iteration 158080/ 173500 | consumed samples: 40468480 | consumed tokens: 82879447040 | elapsed time per iteration (s): 0.77 | learning rate: 2.356E-05 | global batch size: 256 | lm loss: 1.884317E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.951 | TFLOPs: 20.02 | 31: iteration 158090/ 173500 | consumed samples: 40471040 | consumed tokens: 82884689920 | elapsed time per iteration (s): 0.78 | learning rate: 2.355E-05 | global batch size: 256 | lm loss: 1.918756E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.883 | TFLOPs: 19.84 | 31: iteration 158100/ 173500 | consumed samples: 40473600 | consumed tokens: 82889932800 | elapsed time per iteration (s): 0.81 | learning rate: 2.355E-05 | global batch size: 256 | lm loss: 1.947183E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.364 | TFLOPs: 19.14 | 31: iteration 158110/ 173500 | consumed samples: 40476160 | consumed tokens: 82895175680 | elapsed time per iteration (s): 0.74 | learning rate: 2.354E-05 | global batch size: 256 | lm loss: 1.908663E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.235 | TFLOPs: 20.95 | 31: iteration 158120/ 173500 | consumed samples: 40478720 | consumed tokens: 82900418560 | elapsed time per iteration (s): 0.75 | learning rate: 2.354E-05 | global batch size: 256 | lm loss: 1.899744E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.265 | TFLOPs: 20.59 | 31: iteration 158130/ 173500 | consumed samples: 40481280 | consumed tokens: 82905661440 | elapsed time per iteration (s): 0.74 | learning rate: 2.353E-05 | global batch size: 256 | lm loss: 1.924730E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.844 | TFLOPs: 20.98 | 31: iteration 158140/ 173500 | consumed samples: 40483840 | consumed tokens: 82910904320 | elapsed time per iteration (s): 0.77 | learning rate: 2.353E-05 | global batch size: 256 | lm loss: 1.930313E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.542 | TFLOPs: 20.24 | 31: iteration 158150/ 173500 | consumed samples: 40486400 | consumed tokens: 82916147200 | elapsed time per iteration (s): 0.79 | learning rate: 2.352E-05 | global batch size: 256 | lm loss: 1.903997E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.422 | TFLOPs: 19.51 | 31: iteration 158160/ 173500 | consumed samples: 40488960 | consumed tokens: 82921390080 | elapsed time per iteration (s): 0.78 | learning rate: 2.352E-05 | global batch size: 256 | lm loss: 1.910284E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.530 | TFLOPs: 19.94 | 31: iteration 158170/ 173500 | consumed samples: 40491520 | consumed tokens: 82926632960 | elapsed time per iteration (s): 0.72 | learning rate: 2.351E-05 | global batch size: 256 | lm loss: 1.899434E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.032 | TFLOPs: 21.66 | 31: iteration 158180/ 173500 | consumed samples: 40494080 | consumed tokens: 82931875840 | elapsed time per iteration (s): 0.72 | learning rate: 2.351E-05 | global batch size: 256 | lm loss: 1.954929E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.740 | TFLOPs: 21.40 | 31: iteration 158190/ 173500 | consumed samples: 40496640 | consumed tokens: 82937118720 | elapsed time per iteration (s): 0.89 | learning rate: 2.351E-05 | global batch size: 256 | lm loss: 1.915114E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.531 | TFLOPs: 17.46 | 31: iteration 158200/ 173500 | consumed samples: 40499200 | consumed tokens: 82942361600 | elapsed time per iteration (s): 0.74 | learning rate: 2.350E-05 | global batch size: 256 | lm loss: 1.904583E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.810 | TFLOPs: 20.92 | 31: iteration 158210/ 173500 | consumed samples: 40501760 | consumed tokens: 82947604480 | elapsed time per iteration (s): 0.78 | learning rate: 2.350E-05 | global batch size: 256 | lm loss: 1.918970E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.122 | TFLOPs: 19.79 | 31: iteration 158220/ 173500 | consumed samples: 40504320 | consumed tokens: 82952847360 | elapsed time per iteration (s): 0.78 | learning rate: 2.349E-05 | global batch size: 256 | lm loss: 1.902900E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.615 | TFLOPs: 19.76 | 31: iteration 158230/ 173500 | consumed samples: 40506880 | consumed tokens: 82958090240 | elapsed time per iteration (s): 0.79 | learning rate: 2.349E-05 | global batch size: 256 | lm loss: 1.917399E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.922 | TFLOPs: 19.60 | 31: iteration 158240/ 173500 | consumed samples: 40509440 | consumed tokens: 82963333120 | elapsed time per iteration (s): 0.80 | learning rate: 2.348E-05 | global batch size: 256 | lm loss: 1.924112E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.298 | TFLOPs: 19.38 | 31: iteration 158250/ 173500 | consumed samples: 40512000 | consumed tokens: 82968576000 | elapsed time per iteration (s): 0.76 | learning rate: 2.348E-05 | global batch size: 256 | lm loss: 1.920566E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.135 | TFLOPs: 20.40 | 31: iteration 158260/ 173500 | consumed samples: 40514560 | consumed tokens: 82973818880 | elapsed time per iteration (s): 0.80 | learning rate: 2.347E-05 | global batch size: 256 | lm loss: 1.903841E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.060 | TFLOPs: 19.42 | 31: iteration 158270/ 173500 | consumed samples: 40517120 | consumed tokens: 82979061760 | elapsed time per iteration (s): 1.51 | learning rate: 2.347E-05 | global batch size: 256 | lm loss: 1.912140E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 169.170 | TFLOPs: 10.23 | 31: iteration 158280/ 173500 | consumed samples: 40519680 | consumed tokens: 82984304640 | elapsed time per iteration (s): 0.76 | learning rate: 2.346E-05 | global batch size: 256 | lm loss: 1.913337E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.554 | TFLOPs: 20.48 | 31: iteration 158290/ 173500 | consumed samples: 40522240 | consumed tokens: 82989547520 | elapsed time per iteration (s): 0.76 | learning rate: 2.346E-05 | global batch size: 256 | lm loss: 1.918941E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.259 | TFLOPs: 20.34 | 31: iteration 158300/ 173500 | consumed samples: 40524800 | consumed tokens: 82994790400 | elapsed time per iteration (s): 0.79 | learning rate: 2.346E-05 | global batch size: 256 | lm loss: 1.911179E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.602 | TFLOPs: 19.52 | 31: iteration 158310/ 173500 | consumed samples: 40527360 | consumed tokens: 83000033280 | elapsed time per iteration (s): 0.74 | learning rate: 2.345E-05 | global batch size: 256 | lm loss: 1.939344E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.124 | TFLOPs: 20.88 | 31: iteration 158320/ 173500 | consumed samples: 40529920 | consumed tokens: 83005276160 | elapsed time per iteration (s): 0.74 | learning rate: 2.345E-05 | global batch size: 256 | lm loss: 1.918129E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.016 | TFLOPs: 21.05 | 31: iteration 158330/ 173500 | consumed samples: 40532480 | consumed tokens: 83010519040 | elapsed time per iteration (s): 0.85 | learning rate: 2.344E-05 | global batch size: 256 | lm loss: 1.901852E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.721 | TFLOPs: 18.13 | 31: iteration 158340/ 173500 | consumed samples: 40535040 | consumed tokens: 83015761920 | elapsed time per iteration (s): 0.73 | learning rate: 2.344E-05 | global batch size: 256 | lm loss: 1.911031E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.379 | TFLOPs: 21.14 | 31: iteration 158350/ 173500 | consumed samples: 40537600 | consumed tokens: 83021004800 | elapsed time per iteration (s): 0.76 | learning rate: 2.343E-05 | global batch size: 256 | lm loss: 1.905248E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.607 | TFLOPs: 20.36 | 31: iteration 158360/ 173500 | consumed samples: 40540160 | consumed tokens: 83026247680 | elapsed time per iteration (s): 0.77 | learning rate: 2.343E-05 | global batch size: 256 | lm loss: 1.936089E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.847 | TFLOPs: 20.08 | 31: iteration 158370/ 173500 | consumed samples: 40542720 | consumed tokens: 83031490560 | elapsed time per iteration (s): 1.01 | learning rate: 2.342E-05 | global batch size: 256 | lm loss: 1.904151E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 253.378 | TFLOPs: 15.33 | 31: iteration 158380/ 173500 | consumed samples: 40545280 | consumed tokens: 83036733440 | elapsed time per iteration (s): 0.81 | learning rate: 2.342E-05 | global batch size: 256 | lm loss: 1.923103E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.421 | TFLOPs: 19.08 | 31: iteration 158390/ 173500 | consumed samples: 40547840 | consumed tokens: 83041976320 | elapsed time per iteration (s): 0.87 | learning rate: 2.342E-05 | global batch size: 256 | lm loss: 1.943409E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.595 | TFLOPs: 17.88 | 31: iteration 158400/ 173500 | consumed samples: 40550400 | consumed tokens: 83047219200 | elapsed time per iteration (s): 0.80 | learning rate: 2.341E-05 | global batch size: 256 | lm loss: 1.927860E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.038 | TFLOPs: 19.30 | 31: iteration 158410/ 173500 | consumed samples: 40552960 | consumed tokens: 83052462080 | elapsed time per iteration (s): 0.82 | learning rate: 2.341E-05 | global batch size: 256 | lm loss: 1.904861E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.265 | TFLOPs: 18.83 | 31: iteration 158420/ 173500 | consumed samples: 40555520 | consumed tokens: 83057704960 | elapsed time per iteration (s): 0.81 | learning rate: 2.340E-05 | global batch size: 256 | lm loss: 1.893419E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.903 | TFLOPs: 19.17 | 31: iteration 158430/ 173500 | consumed samples: 40558080 | consumed tokens: 83062947840 | elapsed time per iteration (s): 0.80 | learning rate: 2.340E-05 | global batch size: 256 | lm loss: 1.937736E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.003 | TFLOPs: 19.48 | 31: iteration 158440/ 173500 | consumed samples: 40560640 | consumed tokens: 83068190720 | elapsed time per iteration (s): 0.81 | learning rate: 2.339E-05 | global batch size: 256 | lm loss: 1.931740E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.894 | TFLOPs: 19.11 | 31: iteration 158450/ 173500 | consumed samples: 40563200 | consumed tokens: 83073433600 | elapsed time per iteration (s): 0.81 | learning rate: 2.339E-05 | global batch size: 256 | lm loss: 1.919587E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.897 | TFLOPs: 19.11 | 31: iteration 158460/ 173500 | consumed samples: 40565760 | consumed tokens: 83078676480 | elapsed time per iteration (s): 0.83 | learning rate: 2.338E-05 | global batch size: 256 | lm loss: 1.917313E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.532 | TFLOPs: 18.73 | 31: iteration 158470/ 173500 | consumed samples: 40568320 | consumed tokens: 83083919360 | elapsed time per iteration (s): 0.82 | learning rate: 2.338E-05 | global batch size: 256 | lm loss: 1.880135E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.278 | TFLOPs: 18.83 | 31: iteration 158480/ 173500 | consumed samples: 40570880 | consumed tokens: 83089162240 | elapsed time per iteration (s): 0.83 | learning rate: 2.338E-05 | global batch size: 256 | lm loss: 1.889219E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.548 | TFLOPs: 18.73 | 31: iteration 158490/ 173500 | consumed samples: 40573440 | consumed tokens: 83094405120 | elapsed time per iteration (s): 0.85 | learning rate: 2.337E-05 | global batch size: 256 | lm loss: 1.933253E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.865 | TFLOPs: 18.32 | 31: iteration 158500/ 173500 | consumed samples: 40576000 | consumed tokens: 83099648000 | elapsed time per iteration (s): 0.80 | learning rate: 2.337E-05 | global batch size: 256 | lm loss: 1.950331E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.255 | TFLOPs: 19.31 | 31: iteration 158510/ 173500 | consumed samples: 40578560 | consumed tokens: 83104890880 | elapsed time per iteration (s): 0.78 | learning rate: 2.336E-05 | global batch size: 256 | lm loss: 1.933446E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.771 | TFLOPs: 19.95 | 31: iteration 158520/ 173500 | consumed samples: 40581120 | consumed tokens: 83110133760 | elapsed time per iteration (s): 0.89 | learning rate: 2.336E-05 | global batch size: 256 | lm loss: 1.911602E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.965 | TFLOPs: 17.48 | 31: iteration 158530/ 173500 | consumed samples: 40583680 | consumed tokens: 83115376640 | elapsed time per iteration (s): 0.81 | learning rate: 2.335E-05 | global batch size: 256 | lm loss: 1.884871E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.453 | TFLOPs: 19.08 | 31: iteration 158540/ 173500 | consumed samples: 40586240 | consumed tokens: 83120619520 | elapsed time per iteration (s): 0.78 | learning rate: 2.335E-05 | global batch size: 256 | lm loss: 1.896103E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.508 | TFLOPs: 19.87 | 31: iteration 158550/ 173500 | consumed samples: 40588800 | consumed tokens: 83125862400 | elapsed time per iteration (s): 0.92 | learning rate: 2.334E-05 | global batch size: 256 | lm loss: 1.938931E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 278.939 | TFLOPs: 16.88 | 31: iteration 158560/ 173500 | consumed samples: 40591360 | consumed tokens: 83131105280 | elapsed time per iteration (s): 0.79 | learning rate: 2.334E-05 | global batch size: 256 | lm loss: 1.932380E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.733 | TFLOPs: 19.65 | 31: iteration 158570/ 173500 | consumed samples: 40593920 | consumed tokens: 83136348160 | elapsed time per iteration (s): 0.79 | learning rate: 2.333E-05 | global batch size: 256 | lm loss: 1.910119E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.068 | TFLOPs: 19.54 | 31: iteration 158580/ 173500 | consumed samples: 40596480 | consumed tokens: 83141591040 | elapsed time per iteration (s): 0.81 | learning rate: 2.333E-05 | global batch size: 256 | lm loss: 1.905354E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.294 | TFLOPs: 19.07 | 31: iteration 158590/ 173500 | consumed samples: 40599040 | consumed tokens: 83146833920 | elapsed time per iteration (s): 0.78 | learning rate: 2.333E-05 | global batch size: 256 | lm loss: 1.920611E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.405 | TFLOPs: 19.87 | 31: iteration 158600/ 173500 | consumed samples: 40601600 | consumed tokens: 83152076800 | elapsed time per iteration (s): 0.78 | learning rate: 2.332E-05 | global batch size: 256 | lm loss: 1.942455E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.931 | TFLOPs: 19.78 | 31: iteration 158610/ 173500 | consumed samples: 40604160 | consumed tokens: 83157319680 | elapsed time per iteration (s): 0.78 | learning rate: 2.332E-05 | global batch size: 256 | lm loss: 1.950647E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.105 | TFLOPs: 19.91 | 31: iteration 158620/ 173500 | consumed samples: 40606720 | consumed tokens: 83162562560 | elapsed time per iteration (s): 0.94 | learning rate: 2.331E-05 | global batch size: 256 | lm loss: 1.899791E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 272.316 | TFLOPs: 16.47 | 31: iteration 158630/ 173500 | consumed samples: 40609280 | consumed tokens: 83167805440 | elapsed time per iteration (s): 0.81 | learning rate: 2.331E-05 | global batch size: 256 | lm loss: 1.905498E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.393 | TFLOPs: 19.14 | 31: iteration 158640/ 173500 | consumed samples: 40611840 | consumed tokens: 83173048320 | elapsed time per iteration (s): 0.99 | learning rate: 2.330E-05 | global batch size: 256 | lm loss: 1.911807E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 258.445 | TFLOPs: 15.64 | 31: iteration 158650/ 173500 | consumed samples: 40614400 | consumed tokens: 83178291200 | elapsed time per iteration (s): 0.92 | learning rate: 2.330E-05 | global batch size: 256 | lm loss: 1.915462E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 279.117 | TFLOPs: 16.89 | 31: iteration 158660/ 173500 | consumed samples: 40616960 | consumed tokens: 83183534080 | elapsed time per iteration (s): 0.82 | learning rate: 2.330E-05 | global batch size: 256 | lm loss: 1.894568E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.617 | TFLOPs: 18.85 | 31: iteration 158670/ 173500 | consumed samples: 40619520 | consumed tokens: 83188776960 | elapsed time per iteration (s): 0.80 | learning rate: 2.329E-05 | global batch size: 256 | lm loss: 1.925847E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.283 | TFLOPs: 19.26 | 31: iteration 158680/ 173500 | consumed samples: 40622080 | consumed tokens: 83194019840 | elapsed time per iteration (s): 0.94 | learning rate: 2.329E-05 | global batch size: 256 | lm loss: 1.885167E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 271.300 | TFLOPs: 16.41 | 31: iteration 158690/ 173500 | consumed samples: 40624640 | consumed tokens: 83199262720 | elapsed time per iteration (s): 0.82 | learning rate: 2.328E-05 | global batch size: 256 | lm loss: 1.929343E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.822 | TFLOPs: 18.99 | 31: iteration 158700/ 173500 | consumed samples: 40627200 | consumed tokens: 83204505600 | elapsed time per iteration (s): 0.83 | learning rate: 2.328E-05 | global batch size: 256 | lm loss: 1.923606E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.639 | TFLOPs: 18.55 | 31: iteration 158710/ 173500 | consumed samples: 40629760 | consumed tokens: 83209748480 | elapsed time per iteration (s): 0.93 | learning rate: 2.327E-05 | global batch size: 256 | lm loss: 1.896735E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 275.618 | TFLOPs: 16.67 | 31: iteration 158720/ 173500 | consumed samples: 40632320 | consumed tokens: 83214991360 | elapsed time per iteration (s): 0.82 | learning rate: 2.327E-05 | global batch size: 256 | lm loss: 1.923341E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.780 | TFLOPs: 18.98 | 31: iteration 158730/ 173500 | consumed samples: 40634880 | consumed tokens: 83220234240 | elapsed time per iteration (s): 0.85 | learning rate: 2.326E-05 | global batch size: 256 | lm loss: 1.888471E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.031 | TFLOPs: 18.21 | 31: iteration 158740/ 173500 | consumed samples: 40637440 | consumed tokens: 83225477120 | elapsed time per iteration (s): 0.81 | learning rate: 2.326E-05 | global batch size: 256 | lm loss: 1.926948E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.759 | TFLOPs: 19.10 | 31: iteration 158750/ 173500 | consumed samples: 40640000 | consumed tokens: 83230720000 | elapsed time per iteration (s): 0.82 | learning rate: 2.326E-05 | global batch size: 256 | lm loss: 1.929981E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.243 | TFLOPs: 18.89 | 31: iteration 158760/ 173500 | consumed samples: 40642560 | consumed tokens: 83235962880 | elapsed time per iteration (s): 0.85 | learning rate: 2.325E-05 | global batch size: 256 | lm loss: 1.893431E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.813 | TFLOPs: 18.14 | 31: iteration 158770/ 173500 | consumed samples: 40645120 | consumed tokens: 83241205760 | elapsed time per iteration (s): 0.78 | learning rate: 2.325E-05 | global batch size: 256 | lm loss: 1.926295E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.831 | TFLOPs: 19.83 | 31: iteration 158780/ 173500 | consumed samples: 40647680 | consumed tokens: 83246448640 | elapsed time per iteration (s): 0.82 | learning rate: 2.324E-05 | global batch size: 256 | lm loss: 1.909245E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.604 | TFLOPs: 18.85 | 31: iteration 158790/ 173500 | consumed samples: 40650240 | consumed tokens: 83251691520 | elapsed time per iteration (s): 0.82 | learning rate: 2.324E-05 | global batch size: 256 | lm loss: 1.905476E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.869 | TFLOPs: 18.81 | 31: iteration 158800/ 173500 | consumed samples: 40652800 | consumed tokens: 83256934400 | elapsed time per iteration (s): 0.87 | learning rate: 2.323E-05 | global batch size: 256 | lm loss: 1.939745E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.598 | TFLOPs: 17.76 | 31: iteration 158810/ 173500 | consumed samples: 40655360 | consumed tokens: 83262177280 | elapsed time per iteration (s): 0.78 | learning rate: 2.323E-05 | global batch size: 256 | lm loss: 1.896785E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.529 | TFLOPs: 19.75 | 31: iteration 158820/ 173500 | consumed samples: 40657920 | consumed tokens: 83267420160 | elapsed time per iteration (s): 0.83 | learning rate: 2.322E-05 | global batch size: 256 | lm loss: 1.932262E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.642 | TFLOPs: 18.67 | 31: iteration 158830/ 173500 | consumed samples: 40660480 | consumed tokens: 83272663040 | elapsed time per iteration (s): 0.80 | learning rate: 2.322E-05 | global batch size: 256 | lm loss: 1.932730E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.574 | TFLOPs: 19.45 | 31: iteration 158840/ 173500 | consumed samples: 40663040 | consumed tokens: 83277905920 | elapsed time per iteration (s): 0.80 | learning rate: 2.322E-05 | global batch size: 256 | lm loss: 1.898713E+00 | grad norm: 0.217 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.132 | TFLOPs: 19.31 | 31: iteration 158850/ 173500 | consumed samples: 40665600 | consumed tokens: 83283148800 | elapsed time per iteration (s): 0.84 | learning rate: 2.321E-05 | global batch size: 256 | lm loss: 1.922660E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.585 | TFLOPs: 18.55 | 31: iteration 158860/ 173500 | consumed samples: 40668160 | consumed tokens: 83288391680 | elapsed time per iteration (s): 0.79 | learning rate: 2.321E-05 | global batch size: 256 | lm loss: 1.917867E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.074 | TFLOPs: 19.55 | 31: iteration 158870/ 173500 | consumed samples: 40670720 | consumed tokens: 83293634560 | elapsed time per iteration (s): 0.84 | learning rate: 2.320E-05 | global batch size: 256 | lm loss: 1.917481E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.263 | TFLOPs: 18.41 | 31: iteration 158880/ 173500 | consumed samples: 40673280 | consumed tokens: 83298877440 | elapsed time per iteration (s): 0.86 | learning rate: 2.320E-05 | global batch size: 256 | lm loss: 1.912424E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.619 | TFLOPs: 18.07 | 31: iteration 158890/ 173500 | consumed samples: 40675840 | consumed tokens: 83304120320 | elapsed time per iteration (s): 0.79 | learning rate: 2.319E-05 | global batch size: 256 | lm loss: 1.924060E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.338 | TFLOPs: 19.50 | 31: iteration 158900/ 173500 | consumed samples: 40678400 | consumed tokens: 83309363200 | elapsed time per iteration (s): 0.81 | learning rate: 2.319E-05 | global batch size: 256 | lm loss: 1.918242E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.824 | TFLOPs: 19.11 | 31: iteration 158910/ 173500 | consumed samples: 40680960 | consumed tokens: 83314606080 | elapsed time per iteration (s): 0.80 | learning rate: 2.319E-05 | global batch size: 256 | lm loss: 1.917547E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.215 | TFLOPs: 19.37 | 31: iteration 158920/ 173500 | consumed samples: 40683520 | consumed tokens: 83319848960 | elapsed time per iteration (s): 0.84 | learning rate: 2.318E-05 | global batch size: 256 | lm loss: 1.906349E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.976 | TFLOPs: 18.45 | 31: iteration 158930/ 173500 | consumed samples: 40686080 | consumed tokens: 83325091840 | elapsed time per iteration (s): 0.80 | learning rate: 2.318E-05 | global batch size: 256 | lm loss: 1.908245E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.063 | TFLOPs: 19.30 | 31: iteration 158940/ 173500 | consumed samples: 40688640 | consumed tokens: 83330334720 | elapsed time per iteration (s): 0.93 | learning rate: 2.317E-05 | global batch size: 256 | lm loss: 1.928740E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 274.609 | TFLOPs: 16.61 | 31: iteration 158950/ 173500 | consumed samples: 40691200 | consumed tokens: 83335577600 | elapsed time per iteration (s): 0.84 | learning rate: 2.317E-05 | global batch size: 256 | lm loss: 1.947509E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.771 | TFLOPs: 18.44 | 31: iteration 158960/ 173500 | consumed samples: 40693760 | consumed tokens: 83340820480 | elapsed time per iteration (s): 0.98 | learning rate: 2.316E-05 | global batch size: 256 | lm loss: 1.914029E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 261.199 | TFLOPs: 15.80 | 31: iteration 158970/ 173500 | consumed samples: 40696320 | consumed tokens: 83346063360 | elapsed time per iteration (s): 0.84 | learning rate: 2.316E-05 | global batch size: 256 | lm loss: 1.913161E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.592 | TFLOPs: 18.43 | 31: iteration 158980/ 173500 | consumed samples: 40698880 | consumed tokens: 83351306240 | elapsed time per iteration (s): 0.83 | learning rate: 2.316E-05 | global batch size: 256 | lm loss: 1.905403E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.009 | TFLOPs: 18.63 | 31: iteration 158990/ 173500 | consumed samples: 40701440 | consumed tokens: 83356549120 | elapsed time per iteration (s): 0.78 | learning rate: 2.315E-05 | global batch size: 256 | lm loss: 1.906499E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.830 | TFLOPs: 19.89 | 31: iteration 159000/ 173500 | consumed samples: 40704000 | consumed tokens: 83361792000 | elapsed time per iteration (s): 0.77 | learning rate: 2.315E-05 | global batch size: 256 | lm loss: 1.933634E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.341 | TFLOPs: 19.98 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 159000 | lm loss value: 1.792614E+00 | lm loss PPL: 6.005131E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 159000 to checkpoints_1b1long 0: [2022-11-27 05:52:57,167] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step159000 is begin to save! 0: [2022-11-27 05:52:57,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_01-model_00-model_states.pt... 0: [2022-11-27 05:52:57,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_01-model_00-model_states.pt. 0: [2022-11-27 05:52:57,433] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_03-model_00-model_states.pt... 0: [2022-11-27 05:52:57,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_03-model_00-model_states.pt. 0: [2022-11-27 05:52:57,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_04-model_00-model_states.pt... 0: [2022-11-27 05:52:57,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_04-model_00-model_states.pt. 0: [2022-11-27 05:52:57,590] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_05-model_00-model_states.pt... 0: [2022-11-27 05:52:57,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_05-model_00-model_states.pt. 0: [2022-11-27 05:52:57,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_06-model_00-model_states.pt... 0: [2022-11-27 05:52:57,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_06-model_00-model_states.pt. 0: [2022-11-27 05:52:57,745] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_07-model_00-model_states.pt... 0: [2022-11-27 05:52:57,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_07-model_00-model_states.pt. 0: [2022-11-27 05:52:57,819] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_08-model_00-model_states.pt... 0: [2022-11-27 05:52:57,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_08-model_00-model_states.pt. 0: [2022-11-27 05:52:57,904] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_09-model_00-model_states.pt... 0: [2022-11-27 05:52:57,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_09-model_00-model_states.pt. 0: [2022-11-27 05:52:57,982] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_10-model_00-model_states.pt... 0: [2022-11-27 05:52:58,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_10-model_00-model_states.pt. 0: [2022-11-27 05:52:58,057] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_11-model_00-model_states.pt... 0: [2022-11-27 05:52:58,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_11-model_00-model_states.pt. 0: [2022-11-27 05:52:58,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_12-model_00-model_states.pt... 0: [2022-11-27 05:52:58,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_12-model_00-model_states.pt. 0: [2022-11-27 05:52:58,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_13-model_00-model_states.pt... 0: [2022-11-27 05:52:58,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_13-model_00-model_states.pt. 0: [2022-11-27 05:52:58,283] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_14-model_00-model_states.pt... 0: [2022-11-27 05:52:58,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_14-model_00-model_states.pt. 0: [2022-11-27 05:52:58,356] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_15-model_00-model_states.pt... 0: [2022-11-27 05:52:58,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_15-model_00-model_states.pt. 0: [2022-11-27 05:52:58,433] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_16-model_00-model_states.pt... 0: [2022-11-27 05:52:58,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_16-model_00-model_states.pt. 0: [2022-11-27 05:52:58,510] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_17-model_00-model_states.pt... 0: [2022-11-27 05:52:58,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_17-model_00-model_states.pt. 0: [2022-11-27 05:52:58,586] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_18-model_00-model_states.pt... 0: [2022-11-27 05:52:58,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_18-model_00-model_states.pt. 0: [2022-11-27 05:52:58,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_19-model_00-model_states.pt... 0: [2022-11-27 05:52:58,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_19-model_00-model_states.pt. 0: [2022-11-27 05:52:58,736] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_20-model_00-model_states.pt... 0: [2022-11-27 05:52:58,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_20-model_00-model_states.pt. 0: [2022-11-27 05:52:58,814] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_21-model_00-model_states.pt... 0: [2022-11-27 05:52:58,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_21-model_00-model_states.pt. 0: [2022-11-27 05:52:58,889] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_22-model_00-model_states.pt... 0: [2022-11-27 05:52:58,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_22-model_00-model_states.pt. 0: [2022-11-27 05:52:58,968] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_23-model_00-model_states.pt... 0: [2022-11-27 05:52:59,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_23-model_00-model_states.pt. 0: [2022-11-27 05:52:59,042] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_24-model_00-model_states.pt... 0: [2022-11-27 05:52:59,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_24-model_00-model_states.pt. 0: [2022-11-27 05:52:59,126] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_25-model_00-model_states.pt... 0: [2022-11-27 05:52:59,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_25-model_00-model_states.pt. 0: [2022-11-27 05:52:59,207] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_26-model_00-model_states.pt... 0: [2022-11-27 05:52:59,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_26-model_00-model_states.pt. 0: [2022-11-27 05:52:59,283] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_27-model_00-model_states.pt... 0: [2022-11-27 05:52:59,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_27-model_00-model_states.pt. 0: [2022-11-27 05:52:59,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_28-model_00-model_states.pt... 0: [2022-11-27 05:52:59,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_28-model_00-model_states.pt. 0: [2022-11-27 05:52:59,434] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/layer_30-model_00-model_states.pt... 0: [2022-11-27 05:52:59,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/layer_30-model_00-model_states.pt. 0: [2022-11-27 05:52:59,437] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step159000/mp_rank_00_model_states.pt 0: [2022-11-27 05:52:59,437] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/mp_rank_00_model_states.pt... 0: [2022-11-27 05:52:59,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/mp_rank_00_model_states.pt. 31: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 23: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 29: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 30: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 20: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 24: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 18: [2022-11-27 05:52:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:52:59,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:52:59,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:52:59,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 05:52:59,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 16: [2022-11-27 05:52:59,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:52:59,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 05:52:59,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 25: [2022-11-27 05:52:59,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:52:59,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-27 05:52:59,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 27: [2022-11-27 05:52:59,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:52:59,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 05:52:59,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 29: [2022-11-27 05:52:59,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:52:59,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 05:52:59,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 0: [2022-11-27 05:52:59,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:52:59,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 05:52:59,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 10: [2022-11-27 05:52:59,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:52:59,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 05:52:59,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 21: [2022-11-27 05:52:59,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:52:59,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 13: [2022-11-27 05:52:59,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:52:59,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 05:52:59,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 21: [2022-11-27 05:52:59,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 6: [2022-11-27 05:52:59,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:52:59,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:52:59,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 9: [2022-11-27 05:52:59,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 6: [2022-11-27 05:52:59,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 9: [2022-11-27 05:52:59,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 18: [2022-11-27 05:52:59,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:52:59,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 31: [2022-11-27 05:52:59,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:52:59,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 20: [2022-11-27 05:52:59,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:52:59,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 20: [2022-11-27 05:52:59,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 31: [2022-11-27 05:52:59,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 20: [2022-11-27 05:52:59,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 13: [2022-11-27 05:52:59,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:52:59,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 05:52:59,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 2: [2022-11-27 05:52:59,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:52:59,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:52:59,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 05:52:59,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 24: [2022-11-27 05:52:59,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 05:52:59,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 17: [2022-11-27 05:52:59,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:52:59,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 05:52:59,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 20: [2022-11-27 05:52:59,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:52:59,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 6: [2022-11-27 05:52:59,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:52:59,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 05:52:59,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 20: [2022-11-27 05:52:59,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 2: [2022-11-27 05:52:59,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:52:59,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 05:52:59,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 29: [2022-11-27 05:52:59,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:52:59,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 05:52:59,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 2: [2022-11-27 05:52:59,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:52:59,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 05:52:59,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 1: [2022-11-27 05:52:59,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:52:59,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:52:59,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 05:52:59,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 30: [2022-11-27 05:52:59,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 05:52:59,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 30: [2022-11-27 05:52:59,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:52:59,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 05:52:59,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 10: [2022-11-27 05:52:59,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:52:59,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 05:52:59,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 18: [2022-11-27 05:52:59,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:52:59,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:52:59,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 16: [2022-11-27 05:52:59,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 18: [2022-11-27 05:52:59,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 16: [2022-11-27 05:52:59,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 25: [2022-11-27 05:52:59,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:52:59,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 05:52:59,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 24: [2022-11-27 05:52:59,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:52:59,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:52:59,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-27 05:52:59,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 31: [2022-11-27 05:52:59,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 05:52:59,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 5: [2022-11-27 05:52:59,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:52:59,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 05:52:59,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 0: [2022-11-27 05:52:59,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:52:59,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:52:59,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:52:59,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 12: [2022-11-27 05:52:59,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 05:52:59,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 05:52:59,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 12: [2022-11-27 05:52:59,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 0: [2022-11-27 05:52:59,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 17: [2022-11-27 05:52:59,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:52:59,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:52:59,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:52:59,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 27: [2022-11-27 05:52:59,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 17: [2022-11-27 05:52:59,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 27: [2022-11-27 05:52:59,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 30: [2022-11-27 05:52:59,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 05:52:59,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 29: [2022-11-27 05:52:59,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:52:59,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 05:52:59,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 26: [2022-11-27 05:52:59,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:52:59,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 05:52:59,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 6: [2022-11-27 05:52:59,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:52:59,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 05:52:59,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 13: [2022-11-27 05:52:59,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:52:59,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 05:52:59,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 9: [2022-11-27 05:52:59,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:52:59,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:52:59,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:52:59,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 5: [2022-11-27 05:52:59,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 9: [2022-11-27 05:52:59,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 05:52:59,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 9: [2022-11-27 05:52:59,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 5: [2022-11-27 05:52:59,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 11: [2022-11-27 05:52:59,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:52:59,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:52:59,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 05:52:59,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 05:52:59,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 11: [2022-11-27 05:52:59,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 11: [2022-11-27 05:52:59,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:52:59,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 05:52:59,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 18: [2022-11-27 05:52:59,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:52:59,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 05:52:59,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 16: [2022-11-27 05:52:59,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:52:59,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:52:59,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 13: [2022-11-27 05:52:59,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:52:59,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 16: [2022-11-27 05:52:59,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 25: [2022-11-27 05:52:59,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 13: [2022-11-27 05:52:59,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 05:52:59,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 24: [2022-11-27 05:52:59,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:52:59,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:52:59,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 05:52:59,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 2: [2022-11-27 05:52:59,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:52:59,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 20: [2022-11-27 05:52:59,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 2: [2022-11-27 05:52:59,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 20: [2022-11-27 05:52:59,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 18: [2022-11-27 05:52:59,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:52:59,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 10: [2022-11-27 05:52:59,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:52:59,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:52:59,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 10: [2022-11-27 05:52:59,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 21: [2022-11-27 05:52:59,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 10: [2022-11-27 05:52:59,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 21: [2022-11-27 05:52:59,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 12: [2022-11-27 05:52:59,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:52:59,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 05:52:59,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 22: [2022-11-27 05:52:59,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:52:59,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 05:52:59,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 1: [2022-11-27 05:52:59,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:52:59,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 05:52:59,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 1: [2022-11-27 05:52:59,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:52:59,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 17: [2022-11-27 05:52:59,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:52:59,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:52:59,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 17: [2022-11-27 05:52:59,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 1: [2022-11-27 05:52:59,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 17: [2022-11-27 05:52:59,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 1: [2022-11-27 05:52:59,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 30: [2022-11-27 05:52:59,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:52:59,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 05:52:59,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 14: [2022-11-27 05:52:59,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:52:59,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 20: [2022-11-27 05:52:59,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:52:59,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 20: [2022-11-27 05:52:59,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 05:52:59,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 11: [2022-11-27 05:52:59,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:52:59,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 05:52:59,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 24: [2022-11-27 05:52:59,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:52:59,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:52:59,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 26: [2022-11-27 05:52:59,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 05:52:59,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 26: [2022-11-27 05:52:59,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:52:59,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 26: [2022-11-27 05:52:59,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 05:52:59,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 16: [2022-11-27 05:52:59,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:52:59,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 05:52:59,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 6: [2022-11-27 05:52:59,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:52:59,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:52:59,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 05:52:59,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 27: [2022-11-27 05:52:59,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 05:52:59,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 10: [2022-11-27 05:52:59,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:52:59,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 05:52:59,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 26: [2022-11-27 05:52:59,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:52:59,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 05:52:59,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 12: [2022-11-27 05:52:59,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:52:59,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 05:52:59,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 9: [2022-11-27 05:52:59,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:52:59,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 05:52:59,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 14: [2022-11-27 05:52:59,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:52:59,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:52:59,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 05:52:59,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 05:52:59,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 14: [2022-11-27 05:52:59,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 22: [2022-11-27 05:52:59,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:52:59,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:52:59,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 05:52:59,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 31: [2022-11-27 05:52:59,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:52:59,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 31: [2022-11-27 05:52:59,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 22: [2022-11-27 05:52:59,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 31: [2022-11-27 05:52:59,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 1: [2022-11-27 05:52:59,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:52:59,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 05:52:59,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 27: [2022-11-27 05:52:59,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:52:59,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:52:59,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 27: [2022-11-27 05:52:59,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 5: [2022-11-27 05:52:59,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 27: [2022-11-27 05:52:59,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 5: [2022-11-27 05:52:59,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:52:59,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:52:59,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 05:52:59,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 0: [2022-11-27 05:52:59,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:52:59,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 05:52:59,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 21: [2022-11-27 05:52:59,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:52:59,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:52:59,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 05:52:59,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 21: [2022-11-27 05:52:59,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 05:52:59,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 3: [2022-11-27 05:52:59,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:52:59,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:52:59,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:52:59,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 05:52:59,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 05:52:59,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 05:52:59,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 3: [2022-11-27 05:52:59,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 3: [2022-11-27 05:52:59,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 29: [2022-11-27 05:52:59,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:52:59,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 05:52:59,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 7: [2022-11-27 05:52:59,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:52:59,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:52:59,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 05:52:59,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 05:52:59,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 7: [2022-11-27 05:52:59,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 7: [2022-11-27 05:52:59,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:52:59,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 05:52:59,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 31: [2022-11-27 05:52:59,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:52:59,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 05:52:59,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 23: [2022-11-27 05:52:59,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:52:59,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:52:59,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:52:59,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:52:59,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 05:52:59,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 05:52:59,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 05:52:59,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 05:52:59,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 23: [2022-11-27 05:52:59,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 23: [2022-11-27 05:52:59,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 23: [2022-11-27 05:52:59,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 5: [2022-11-27 05:52:59,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 05:52:59,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 5: [2022-11-27 05:52:59,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:52:59,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 05:52:59,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 4: [2022-11-27 05:52:59,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:52:59,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:52:59,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:52:59,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 05:52:59,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 05:52:59,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 05:52:59,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 4: [2022-11-27 05:52:59,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 4: [2022-11-27 05:52:59,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 8: [2022-11-27 05:52:59,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:52:59,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:52:59,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:52:59,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:52:59,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:52:59,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 11: [2022-11-27 05:52:59,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 8: [2022-11-27 05:52:59,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 05:52:59,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 11: [2022-11-27 05:52:59,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 8: [2022-11-27 05:52:59,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 05:52:59,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 8: [2022-11-27 05:52:59,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 8: [2022-11-27 05:52:59,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 8: [2022-11-27 05:52:59,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 4: [2022-11-27 05:52:59,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:52:59,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 05:52:59,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 28: [2022-11-27 05:52:59,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 05:52:59,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-27 05:52:59,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-27 05:52:59,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-27 05:52:59,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 05:52:59,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 05:52:59,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 28: [2022-11-27 05:52:59,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 05:52:59,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 05:52:59,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 0: [2022-11-27 05:52:59,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 28: [2022-11-27 05:52:59,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 28: [2022-11-27 05:52:59,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 0: [2022-11-27 05:52:59,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:52:59,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 0: [2022-11-27 05:52:59,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 05:52:59,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 25: [2022-11-27 05:52:59,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:52:59,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 05:52:59,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 19: [2022-11-27 05:52:59,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:52:59,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:52:59,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:52:59,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:52:59,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:52:59,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-27 05:52:59,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 05:52:59,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 05:52:59,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 19: [2022-11-27 05:52:59,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 19: [2022-11-27 05:52:59,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 05:52:59,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 05:52:59,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 19: [2022-11-27 05:52:59,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 19: [2022-11-27 05:52:59,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 2: [2022-11-27 05:52:59,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:52:59,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 05:52:59,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 15: [2022-11-27 05:52:59,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:52:59,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:52:59,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:52:59,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:52:59,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:52:59,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 05:52:59,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 05:52:59,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 05:52:59,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 05:52:59,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 05:52:59,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 15: [2022-11-27 05:52:59,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 15: [2022-11-27 05:52:59,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 15: [2022-11-27 05:52:59,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 15: [2022-11-27 05:52:59,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 12: [2022-11-27 05:52:59,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:52:59,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 05:52:59,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 29: [2022-11-27 05:52:59,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:52:59,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 05:52:59,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 30: [2022-11-27 05:52:59,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:52:59,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 05:52:59,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 8: [2022-11-27 05:52:59,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:52:59,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 05:52:59,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 18: [2022-11-27 05:52:59,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:52:59,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 05:52:59,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 21: [2022-11-27 05:52:59,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:52:59,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 05:52:59,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 13: [2022-11-27 05:52:59,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:52:59,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 05:52:59,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 16: [2022-11-27 05:52:59,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:52:59,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 28: [2022-11-27 05:52:59,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:52:59,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 23: [2022-11-27 05:52:59,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:52:59,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 05:52:59,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 28: [2022-11-27 05:52:59,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 05:52:59,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 6: [2022-11-27 05:52:59,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:52:59,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 05:52:59,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 14: [2022-11-27 05:52:59,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:52:59,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 05:52:59,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 9: [2022-11-27 05:52:59,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:52:59,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 05:52:59,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 20: [2022-11-27 05:52:59,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:52:59,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 05:52:59,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 3: [2022-11-27 05:52:59,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:52:59,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 05:52:59,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 1: [2022-11-27 05:52:59,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:52:59,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 05:52:59,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 24: [2022-11-27 05:52:59,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:52:59,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 05:52:59,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 31: [2022-11-27 05:52:59,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:52:59,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 05:52:59,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 7: [2022-11-27 05:52:59,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:52:59,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 05:52:59,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 22: [2022-11-27 05:52:59,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:52:59,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 05:52:59,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 17: [2022-11-27 05:52:59,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:52:59,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 05:52:59,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 27: [2022-11-27 05:52:59,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:52:59,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 10: [2022-11-27 05:52:59,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:52:59,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 10: [2022-11-27 05:52:59,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 05:52:59,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 26: [2022-11-27 05:52:59,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:52:59,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 05:52:59,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 5: [2022-11-27 05:52:59,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:52:59,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 05:52:59,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 4: [2022-11-27 05:52:59,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:52:59,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:52:59,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 19: [2022-11-27 05:52:59,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 4: [2022-11-27 05:52:59,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 19: [2022-11-27 05:52:59,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 25: [2022-11-27 05:52:59,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:52:59,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 05:52:59,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 11: [2022-11-27 05:52:59,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:52:59,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 05:52:59,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 12: [2022-11-27 05:52:59,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:52:59,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 05:52:59,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 2: [2022-11-27 05:52:59,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:52:59,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 05:52:59,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 15: [2022-11-27 05:52:59,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:52:59,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 05:52:59,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 18: [2022-11-27 05:52:59,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:52:59,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-27 05:52:59,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 30: [2022-11-27 05:52:59,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:52:59,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:52:59,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 30: [2022-11-27 05:52:59,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 0: [2022-11-27 05:52:59,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:52:59,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 30: [2022-11-27 05:52:59,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 0: [2022-11-27 05:52:59,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 05:52:59,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 29: [2022-11-27 05:52:59,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:52:59,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 05:52:59,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 28: [2022-11-27 05:52:59,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-27 05:52:59,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 05:52:59,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 20: [2022-11-27 05:52:59,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:52:59,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 05:52:59,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 21: [2022-11-27 05:52:59,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:52:59,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 05:52:59,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 13: [2022-11-27 05:52:59,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:52:59,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 05:52:59,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 8: [2022-11-27 05:52:59,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:52:59,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 05:52:59,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 23: [2022-11-27 05:52:59,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:52:59,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 6: [2022-11-27 05:52:59,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:52:59,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 6: [2022-11-27 05:52:59,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 05:52:59,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 3: [2022-11-27 05:52:59,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:52:59,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:52:59,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 3: [2022-11-27 05:52:59,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 05:52:59,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 9: [2022-11-27 05:52:59,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 24: [2022-11-27 05:52:59,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:52:59,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 05:52:59,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 31: [2022-11-27 05:52:59,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:52:59,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 05:52:59,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 22: [2022-11-27 05:52:59,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:52:59,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 05:52:59,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 16: [2022-11-27 05:52:59,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:52:59,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 05:52:59,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 26: [2022-11-27 05:52:59,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:52:59,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 05:52:59,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 7: [2022-11-27 05:52:59,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:52:59,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 05:52:59,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 17: [2022-11-27 05:52:59,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:52:59,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 05:52:59,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 0: [2022-11-27 05:52:59,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:52:59,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 27: [2022-11-27 05:52:59,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:52:59,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 27: [2022-11-27 05:52:59,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 05:52:59,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 1: [2022-11-27 05:52:59,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:52:59,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 05:52:59,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 10: [2022-11-27 05:52:59,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:52:59,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 05:52:59,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 5: [2022-11-27 05:52:59,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:52:59,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 05:52:59,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 05:52:59,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 4: [2022-11-27 05:52:59,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:52:59,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 05:52:59,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 11: [2022-11-27 05:52:59,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:52:59,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 05:52:59,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 5: [2022-11-27 05:52:59,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 05:52:59,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 25: [2022-11-27 05:52:59,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:52:59,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 05:52:59,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 12: [2022-11-27 05:52:59,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:52:59,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 05:52:59,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 2: [2022-11-27 05:52:59,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:52:59,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 05:52:59,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 30: [2022-11-27 05:52:59,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:52:59,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 05:52:59,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 15: [2022-11-27 05:52:59,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:52:59,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 05:52:59,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 23: [2022-11-27 05:52:59,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:52:59,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 05:52:59,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 16: [2022-11-27 05:52:59,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:52:59,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 05:52:59,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 21: [2022-11-27 05:52:59,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:52:59,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 05:52:59,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 13: [2022-11-27 05:52:59,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:52:59,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 05:52:59,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 29: [2022-11-27 05:52:59,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:52:59,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 05:52:59,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 8: [2022-11-27 05:52:59,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:52:59,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 05:52:59,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 18: [2022-11-27 05:52:59,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:52:59,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 05:52:59,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 6: [2022-11-27 05:52:59,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:52:59,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 05:52:59,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 28: [2022-11-27 05:52:59,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-27 05:52:59,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 05:52:59,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 20: [2022-11-27 05:52:59,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:52:59,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:52:59,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:52:59,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 14: [2022-11-27 05:52:59,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 20: [2022-11-27 05:52:59,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 24: [2022-11-27 05:52:59,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 14: [2022-11-27 05:52:59,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 24: [2022-11-27 05:52:59,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 28: [2022-11-27 05:52:59,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:52:59,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:52:59,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 05:52:59,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 28: [2022-11-27 05:52:59,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 05:52:59,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 1: [2022-11-27 05:52:59,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:52:59,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 05:52:59,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 23: [2022-11-27 05:52:59,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 05:52:59,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 05:52:59,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 25: [2022-11-27 05:52:59,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 16: [2022-11-27 05:52:59,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:52:59,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 16: [2022-11-27 05:52:59,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 2: [2022-11-27 05:52:59,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 25: [2022-11-27 05:52:59,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 16: [2022-11-27 05:52:59,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 2: [2022-11-27 05:52:59,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 05:52:59,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 30: [2022-11-27 05:52:59,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 05:52:59,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 05:52:59,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 18: [2022-11-27 05:52:59,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:52:59,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 22: [2022-11-27 05:52:59,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 18: [2022-11-27 05:52:59,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 22: [2022-11-27 05:52:59,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 05:52:59,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 0: [2022-11-27 05:52:59,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:52:59,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 19: [2022-11-27 05:52:59,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:52:59,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 19: [2022-11-27 05:52:59,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 05:52:59,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 17: [2022-11-27 05:52:59,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:52:59,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:52:59,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:52:59,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 05:52:59,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 12: [2022-11-27 05:52:59,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 11: [2022-11-27 05:52:59,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 12: [2022-11-27 05:52:59,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 11: [2022-11-27 05:52:59,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 10: [2022-11-27 05:52:59,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:52:59,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:52:59,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 05:52:59,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 31: [2022-11-27 05:52:59,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:52:59,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:52:59,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 05:52:59,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 31: [2022-11-27 05:52:59,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 8: [2022-11-27 05:52:59,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 31: [2022-11-27 05:52:59,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 8: [2022-11-27 05:52:59,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 6: [2022-11-27 05:52:59,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 17: [2022-11-27 05:52:59,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:52:59,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:52:59,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:52:59,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 3: [2022-11-27 05:52:59,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 17: [2022-11-27 05:52:59,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 26: [2022-11-27 05:52:59,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 6: [2022-11-27 05:52:59,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 5: [2022-11-27 05:52:59,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:52:59,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 17: [2022-11-27 05:52:59,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 26: [2022-11-27 05:52:59,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 4: [2022-11-27 05:52:59,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:52:59,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 4: [2022-11-27 05:52:59,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 5: [2022-11-27 05:52:59,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 4: [2022-11-27 05:52:59,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 27: [2022-11-27 05:52:59,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:52:59,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 05:52:59,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 13: [2022-11-27 05:52:59,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:52:59,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:52:59,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 05:52:59,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 4: [2022-11-27 05:52:59,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 05:52:59,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 24: [2022-11-27 05:52:59,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 05:52:59,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 05:52:59,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 9: [2022-11-27 05:52:59,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:52:59,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 05:52:59,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 20: [2022-11-27 05:52:59,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 05:52:59,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 05:52:59,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 21: [2022-11-27 05:52:59,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 05:52:59,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 05:52:59,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 14: [2022-11-27 05:52:59,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:52:59,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 05:52:59,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 10: [2022-11-27 05:52:59,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:52:59,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 05:52:59,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 15: [2022-11-27 05:52:59,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:52:59,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 05:52:59,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 3: [2022-11-27 05:52:59,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:52:59,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 05:52:59,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 31: [2022-11-27 05:52:59,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 05:52:59,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 05:52:59,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 29: [2022-11-27 05:52:59,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 26: [2022-11-27 05:52:59,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 29: [2022-11-27 05:52:59,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 26: [2022-11-27 05:52:59,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 05:52:59,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 29: [2022-11-27 05:52:59,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 27: [2022-11-27 05:52:59,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:52:59,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 27: [2022-11-27 05:52:59,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 22: [2022-11-27 05:52:59,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 27: [2022-11-27 05:52:59,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 22: [2022-11-27 05:52:59,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 3: [2022-11-27 05:52:59,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:52:59,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 05:52:59,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 22: [2022-11-27 05:52:59,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 05:52:59,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 05:52:59,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 7: [2022-11-27 05:52:59,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:52:59,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 05:52:59,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 7: [2022-11-27 05:52:59,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:52:59,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 05:52:59,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 7: [2022-11-27 05:52:59,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:52:59,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step159000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 05:52:59,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step159000 is ready now! 0: successfully saved checkpoint at iteration 159000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2634.28 31: iteration 159010/ 173500 | consumed samples: 40706560 | consumed tokens: 83367034880 | elapsed time per iteration (s): 1.01 | learning rate: 2.314E-05 | global batch size: 256 | lm loss: 1.916256E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 253.183 | TFLOPs: 15.32 | 31: iteration 159020/ 173500 | consumed samples: 40709120 | consumed tokens: 83372277760 | elapsed time per iteration (s): 0.86 | learning rate: 2.314E-05 | global batch size: 256 | lm loss: 1.930573E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.647 | TFLOPs: 18.07 | 31: iteration 159030/ 173500 | consumed samples: 40711680 | consumed tokens: 83377520640 | elapsed time per iteration (s): 0.75 | learning rate: 2.313E-05 | global batch size: 256 | lm loss: 1.906297E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.254 | TFLOPs: 20.77 | 31: iteration 159040/ 173500 | consumed samples: 40714240 | consumed tokens: 83382763520 | elapsed time per iteration (s): 0.87 | learning rate: 2.313E-05 | global batch size: 256 | lm loss: 1.916061E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.706 | TFLOPs: 17.89 | 31: iteration 159050/ 173500 | consumed samples: 40716800 | consumed tokens: 83388006400 | elapsed time per iteration (s): 0.95 | learning rate: 2.313E-05 | global batch size: 256 | lm loss: 1.923355E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 269.339 | TFLOPs: 16.29 | 31: iteration 159060/ 173500 | consumed samples: 40719360 | consumed tokens: 83393249280 | elapsed time per iteration (s): 0.92 | learning rate: 2.312E-05 | global batch size: 256 | lm loss: 1.941236E+00 | grad norm: 0.215 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 276.948 | TFLOPs: 16.75 | 31: iteration 159070/ 173500 | consumed samples: 40721920 | consumed tokens: 83398492160 | elapsed time per iteration (s): 0.86 | learning rate: 2.312E-05 | global batch size: 256 | lm loss: 1.926404E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.608 | TFLOPs: 18.07 | 31: iteration 159080/ 173500 | consumed samples: 40724480 | consumed tokens: 83403735040 | elapsed time per iteration (s): 0.89 | learning rate: 2.311E-05 | global batch size: 256 | lm loss: 1.913879E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.714 | TFLOPs: 17.35 | 31: iteration 159090/ 173500 | consumed samples: 40727040 | consumed tokens: 83408977920 | elapsed time per iteration (s): 0.88 | learning rate: 2.311E-05 | global batch size: 256 | lm loss: 1.920046E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.527 | TFLOPs: 17.52 | 31: iteration 159100/ 173500 | consumed samples: 40729600 | consumed tokens: 83414220800 | elapsed time per iteration (s): 0.92 | learning rate: 2.310E-05 | global batch size: 256 | lm loss: 1.918867E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 277.451 | TFLOPs: 16.79 | 31: iteration 159110/ 173500 | consumed samples: 40732160 | consumed tokens: 83419463680 | elapsed time per iteration (s): 0.88 | learning rate: 2.310E-05 | global batch size: 256 | lm loss: 1.922553E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.547 | TFLOPs: 17.58 | 31: iteration 159120/ 173500 | consumed samples: 40734720 | consumed tokens: 83424706560 | elapsed time per iteration (s): 0.79 | learning rate: 2.310E-05 | global batch size: 256 | lm loss: 1.916985E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.757 | TFLOPs: 19.53 | 31: iteration 159130/ 173500 | consumed samples: 40737280 | consumed tokens: 83429949440 | elapsed time per iteration (s): 0.82 | learning rate: 2.309E-05 | global batch size: 256 | lm loss: 1.905034E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.190 | TFLOPs: 18.95 | 31: iteration 159140/ 173500 | consumed samples: 40739840 | consumed tokens: 83435192320 | elapsed time per iteration (s): 0.83 | learning rate: 2.309E-05 | global batch size: 256 | lm loss: 1.892165E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.128 | TFLOPs: 18.70 | 31: iteration 159150/ 173500 | consumed samples: 40742400 | consumed tokens: 83440435200 | elapsed time per iteration (s): 0.87 | learning rate: 2.308E-05 | global batch size: 256 | lm loss: 1.891966E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.534 | TFLOPs: 17.82 | 31: iteration 159160/ 173500 | consumed samples: 40744960 | consumed tokens: 83445678080 | elapsed time per iteration (s): 0.81 | learning rate: 2.308E-05 | global batch size: 256 | lm loss: 1.931985E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.968 | TFLOPs: 19.05 | 31: iteration 159170/ 173500 | consumed samples: 40747520 | consumed tokens: 83450920960 | elapsed time per iteration (s): 0.82 | learning rate: 2.307E-05 | global batch size: 256 | lm loss: 1.917409E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.937 | TFLOPs: 18.93 | 31: iteration 159180/ 173500 | consumed samples: 40750080 | consumed tokens: 83456163840 | elapsed time per iteration (s): 0.85 | learning rate: 2.307E-05 | global batch size: 256 | lm loss: 1.925137E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.085 | TFLOPs: 18.28 | 31: iteration 159190/ 173500 | consumed samples: 40752640 | consumed tokens: 83461406720 | elapsed time per iteration (s): 0.85 | learning rate: 2.307E-05 | global batch size: 256 | lm loss: 1.886926E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.462 | TFLOPs: 18.18 | 31: iteration 159200/ 173500 | consumed samples: 40755200 | consumed tokens: 83466649600 | elapsed time per iteration (s): 0.85 | learning rate: 2.306E-05 | global batch size: 256 | lm loss: 1.931352E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.056 | TFLOPs: 18.27 | 31: iteration 159210/ 173500 | consumed samples: 40757760 | consumed tokens: 83471892480 | elapsed time per iteration (s): 0.81 | learning rate: 2.306E-05 | global batch size: 256 | lm loss: 1.914854E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.624 | TFLOPs: 19.15 | 31: iteration 159220/ 173500 | consumed samples: 40760320 | consumed tokens: 83477135360 | elapsed time per iteration (s): 0.83 | learning rate: 2.305E-05 | global batch size: 256 | lm loss: 1.892133E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.702 | TFLOPs: 18.74 | 31: iteration 159230/ 173500 | consumed samples: 40762880 | consumed tokens: 83482378240 | elapsed time per iteration (s): 0.78 | learning rate: 2.305E-05 | global batch size: 256 | lm loss: 1.919878E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.788 | TFLOPs: 19.77 | 31: iteration 159240/ 173500 | consumed samples: 40765440 | consumed tokens: 83487621120 | elapsed time per iteration (s): 0.75 | learning rate: 2.304E-05 | global batch size: 256 | lm loss: 1.879324E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.918 | TFLOPs: 20.62 | 31: iteration 159250/ 173500 | consumed samples: 40768000 | consumed tokens: 83492864000 | elapsed time per iteration (s): 0.83 | learning rate: 2.304E-05 | global batch size: 256 | lm loss: 1.920965E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.274 | TFLOPs: 18.65 | 31: iteration 159260/ 173500 | consumed samples: 40770560 | consumed tokens: 83498106880 | elapsed time per iteration (s): 0.79 | learning rate: 2.304E-05 | global batch size: 256 | lm loss: 1.918230E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.756 | TFLOPs: 19.59 | 31: iteration 159270/ 173500 | consumed samples: 40773120 | consumed tokens: 83503349760 | elapsed time per iteration (s): 0.72 | learning rate: 2.303E-05 | global batch size: 256 | lm loss: 1.902898E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.403 | TFLOPs: 21.38 | 31: iteration 159280/ 173500 | consumed samples: 40775680 | consumed tokens: 83508592640 | elapsed time per iteration (s): 0.72 | learning rate: 2.303E-05 | global batch size: 256 | lm loss: 1.908707E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.184 | TFLOPs: 21.43 | 31: iteration 159290/ 173500 | consumed samples: 40778240 | consumed tokens: 83513835520 | elapsed time per iteration (s): 0.72 | learning rate: 2.302E-05 | global batch size: 256 | lm loss: 1.936066E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.899 | TFLOPs: 21.47 | 31: iteration 159300/ 173500 | consumed samples: 40780800 | consumed tokens: 83519078400 | elapsed time per iteration (s): 0.79 | learning rate: 2.302E-05 | global batch size: 256 | lm loss: 1.924100E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.650 | TFLOPs: 19.70 | 31: iteration 159310/ 173500 | consumed samples: 40783360 | consumed tokens: 83524321280 | elapsed time per iteration (s): 0.77 | learning rate: 2.301E-05 | global batch size: 256 | lm loss: 1.909705E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.761 | TFLOPs: 20.19 | 31: iteration 159320/ 173500 | consumed samples: 40785920 | consumed tokens: 83529564160 | elapsed time per iteration (s): 1.03 | learning rate: 2.301E-05 | global batch size: 256 | lm loss: 1.890625E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.603 | TFLOPs: 15.04 | 31: iteration 159330/ 173500 | consumed samples: 40788480 | consumed tokens: 83534807040 | elapsed time per iteration (s): 0.74 | learning rate: 2.301E-05 | global batch size: 256 | lm loss: 1.940158E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.381 | TFLOPs: 20.96 | 31: iteration 159340/ 173500 | consumed samples: 40791040 | consumed tokens: 83540049920 | elapsed time per iteration (s): 0.74 | learning rate: 2.300E-05 | global batch size: 256 | lm loss: 1.918874E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.331 | TFLOPs: 20.89 | 31: iteration 159350/ 173500 | consumed samples: 40793600 | consumed tokens: 83545292800 | elapsed time per iteration (s): 0.73 | learning rate: 2.300E-05 | global batch size: 256 | lm loss: 1.943425E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.827 | TFLOPs: 21.16 | 31: iteration 159360/ 173500 | consumed samples: 40796160 | consumed tokens: 83550535680 | elapsed time per iteration (s): 0.75 | learning rate: 2.299E-05 | global batch size: 256 | lm loss: 1.945042E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.253 | TFLOPs: 20.71 | 31: iteration 159370/ 173500 | consumed samples: 40798720 | consumed tokens: 83555778560 | elapsed time per iteration (s): 0.73 | learning rate: 2.299E-05 | global batch size: 256 | lm loss: 1.922529E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.809 | TFLOPs: 21.22 | 31: iteration 159380/ 173500 | consumed samples: 40801280 | consumed tokens: 83561021440 | elapsed time per iteration (s): 0.77 | learning rate: 2.298E-05 | global batch size: 256 | lm loss: 1.927122E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.268 | TFLOPs: 20.04 | 31: iteration 159390/ 173500 | consumed samples: 40803840 | consumed tokens: 83566264320 | elapsed time per iteration (s): 0.82 | learning rate: 2.298E-05 | global batch size: 256 | lm loss: 1.922506E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.008 | TFLOPs: 18.88 | 31: iteration 159400/ 173500 | consumed samples: 40806400 | consumed tokens: 83571507200 | elapsed time per iteration (s): 0.77 | learning rate: 2.298E-05 | global batch size: 256 | lm loss: 1.933325E+00 | grad norm: 0.258 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.468 | TFLOPs: 20.11 | 31: iteration 159410/ 173500 | consumed samples: 40808960 | consumed tokens: 83576750080 | elapsed time per iteration (s): 0.83 | learning rate: 2.297E-05 | global batch size: 256 | lm loss: 1.932170E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.190 | TFLOPs: 18.71 | 31: iteration 159420/ 173500 | consumed samples: 40811520 | consumed tokens: 83581992960 | elapsed time per iteration (s): 0.80 | learning rate: 2.297E-05 | global batch size: 256 | lm loss: 1.920907E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.929 | TFLOPs: 19.35 | 31: iteration 159430/ 173500 | consumed samples: 40814080 | consumed tokens: 83587235840 | elapsed time per iteration (s): 0.81 | learning rate: 2.296E-05 | global batch size: 256 | lm loss: 1.897501E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.272 | TFLOPs: 19.19 | 31: iteration 159440/ 173500 | consumed samples: 40816640 | consumed tokens: 83592478720 | elapsed time per iteration (s): 0.81 | learning rate: 2.296E-05 | global batch size: 256 | lm loss: 1.904075E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.167 | TFLOPs: 19.13 | 31: iteration 159450/ 173500 | consumed samples: 40819200 | consumed tokens: 83597721600 | elapsed time per iteration (s): 0.79 | learning rate: 2.296E-05 | global batch size: 256 | lm loss: 1.907009E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.500 | TFLOPs: 19.57 | 31: iteration 159460/ 173500 | consumed samples: 40821760 | consumed tokens: 83602964480 | elapsed time per iteration (s): 0.91 | learning rate: 2.295E-05 | global batch size: 256 | lm loss: 1.906858E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.585 | TFLOPs: 17.10 | 31: iteration 159470/ 173500 | consumed samples: 40824320 | consumed tokens: 83608207360 | elapsed time per iteration (s): 0.81 | learning rate: 2.295E-05 | global batch size: 256 | lm loss: 1.917870E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.314 | TFLOPs: 19.14 | 31: iteration 159480/ 173500 | consumed samples: 40826880 | consumed tokens: 83613450240 | elapsed time per iteration (s): 0.79 | learning rate: 2.294E-05 | global batch size: 256 | lm loss: 1.912700E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.069 | TFLOPs: 19.61 | 31: iteration 159490/ 173500 | consumed samples: 40829440 | consumed tokens: 83618693120 | elapsed time per iteration (s): 0.81 | learning rate: 2.294E-05 | global batch size: 256 | lm loss: 1.903562E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.506 | TFLOPs: 19.21 | 31: iteration 159500/ 173500 | consumed samples: 40832000 | consumed tokens: 83623936000 | elapsed time per iteration (s): 0.83 | learning rate: 2.293E-05 | global batch size: 256 | lm loss: 1.913708E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.256 | TFLOPs: 18.59 | 31: iteration 159510/ 173500 | consumed samples: 40834560 | consumed tokens: 83629178880 | elapsed time per iteration (s): 0.80 | learning rate: 2.293E-05 | global batch size: 256 | lm loss: 1.910676E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.741 | TFLOPs: 19.34 | 31: iteration 159520/ 173500 | consumed samples: 40837120 | consumed tokens: 83634421760 | elapsed time per iteration (s): 0.79 | learning rate: 2.293E-05 | global batch size: 256 | lm loss: 1.880506E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.494 | TFLOPs: 19.63 | 31: iteration 159530/ 173500 | consumed samples: 40839680 | consumed tokens: 83639664640 | elapsed time per iteration (s): 0.78 | learning rate: 2.292E-05 | global batch size: 256 | lm loss: 1.940483E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.070 | TFLOPs: 19.97 | 31: iteration 159540/ 173500 | consumed samples: 40842240 | consumed tokens: 83644907520 | elapsed time per iteration (s): 0.79 | learning rate: 2.292E-05 | global batch size: 256 | lm loss: 1.907084E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.869 | TFLOPs: 19.53 | 31: iteration 159550/ 173500 | consumed samples: 40844800 | consumed tokens: 83650150400 | elapsed time per iteration (s): 0.75 | learning rate: 2.291E-05 | global batch size: 256 | lm loss: 1.917188E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.317 | TFLOPs: 20.77 | 31: iteration 159560/ 173500 | consumed samples: 40847360 | consumed tokens: 83655393280 | elapsed time per iteration (s): 0.76 | learning rate: 2.291E-05 | global batch size: 256 | lm loss: 1.902464E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.352 | TFLOPs: 20.35 | 31: iteration 159570/ 173500 | consumed samples: 40849920 | consumed tokens: 83660636160 | elapsed time per iteration (s): 0.74 | learning rate: 2.291E-05 | global batch size: 256 | lm loss: 1.942056E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.584 | TFLOPs: 21.03 | 31: iteration 159580/ 173500 | consumed samples: 40852480 | consumed tokens: 83665879040 | elapsed time per iteration (s): 0.74 | learning rate: 2.290E-05 | global batch size: 256 | lm loss: 1.925774E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.828 | TFLOPs: 20.80 | 31: iteration 159590/ 173500 | consumed samples: 40855040 | consumed tokens: 83671121920 | elapsed time per iteration (s): 0.80 | learning rate: 2.290E-05 | global batch size: 256 | lm loss: 1.919371E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.177 | TFLOPs: 19.25 | 31: iteration 159600/ 173500 | consumed samples: 40857600 | consumed tokens: 83676364800 | elapsed time per iteration (s): 0.75 | learning rate: 2.289E-05 | global batch size: 256 | lm loss: 1.914341E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.784 | TFLOPs: 20.68 | 31: iteration 159610/ 173500 | consumed samples: 40860160 | consumed tokens: 83681607680 | elapsed time per iteration (s): 0.75 | learning rate: 2.289E-05 | global batch size: 256 | lm loss: 1.909348E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.429 | TFLOPs: 20.78 | 31: iteration 159620/ 173500 | consumed samples: 40862720 | consumed tokens: 83686850560 | elapsed time per iteration (s): 0.76 | learning rate: 2.288E-05 | global batch size: 256 | lm loss: 1.888454E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.314 | TFLOPs: 20.35 | 31: iteration 159630/ 173500 | consumed samples: 40865280 | consumed tokens: 83692093440 | elapsed time per iteration (s): 0.74 | learning rate: 2.288E-05 | global batch size: 256 | lm loss: 1.897822E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.784 | TFLOPs: 20.92 | 31: iteration 159640/ 173500 | consumed samples: 40867840 | consumed tokens: 83697336320 | elapsed time per iteration (s): 0.75 | learning rate: 2.288E-05 | global batch size: 256 | lm loss: 1.914399E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.561 | TFLOPs: 20.54 | 31: iteration 159650/ 173500 | consumed samples: 40870400 | consumed tokens: 83702579200 | elapsed time per iteration (s): 0.76 | learning rate: 2.287E-05 | global batch size: 256 | lm loss: 1.909110E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.155 | TFLOPs: 20.28 | 31: iteration 159660/ 173500 | consumed samples: 40872960 | consumed tokens: 83707822080 | elapsed time per iteration (s): 0.74 | learning rate: 2.287E-05 | global batch size: 256 | lm loss: 1.906642E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.011 | TFLOPs: 20.87 | 31: iteration 159670/ 173500 | consumed samples: 40875520 | consumed tokens: 83713064960 | elapsed time per iteration (s): 0.76 | learning rate: 2.286E-05 | global batch size: 256 | lm loss: 1.910449E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.518 | TFLOPs: 20.42 | 31: iteration 159680/ 173500 | consumed samples: 40878080 | consumed tokens: 83718307840 | elapsed time per iteration (s): 0.92 | learning rate: 2.286E-05 | global batch size: 256 | lm loss: 1.892922E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 278.399 | TFLOPs: 16.84 | 31: iteration 159690/ 173500 | consumed samples: 40880640 | consumed tokens: 83723550720 | elapsed time per iteration (s): 0.74 | learning rate: 2.286E-05 | global batch size: 256 | lm loss: 1.890229E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.507 | TFLOPs: 20.84 | 31: iteration 159700/ 173500 | consumed samples: 40883200 | consumed tokens: 83728793600 | elapsed time per iteration (s): 0.81 | learning rate: 2.285E-05 | global batch size: 256 | lm loss: 1.927661E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.265 | TFLOPs: 19.19 | 31: iteration 159710/ 173500 | consumed samples: 40885760 | consumed tokens: 83734036480 | elapsed time per iteration (s): 0.78 | learning rate: 2.285E-05 | global batch size: 256 | lm loss: 1.922442E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.721 | TFLOPs: 19.83 | 31: iteration 159720/ 173500 | consumed samples: 40888320 | consumed tokens: 83739279360 | elapsed time per iteration (s): 0.78 | learning rate: 2.284E-05 | global batch size: 256 | lm loss: 1.879642E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.408 | TFLOPs: 19.87 | 31: iteration 159730/ 173500 | consumed samples: 40890880 | consumed tokens: 83744522240 | elapsed time per iteration (s): 0.81 | learning rate: 2.284E-05 | global batch size: 256 | lm loss: 1.917488E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.138 | TFLOPs: 19.07 | 31: iteration 159740/ 173500 | consumed samples: 40893440 | consumed tokens: 83749765120 | elapsed time per iteration (s): 0.79 | learning rate: 2.284E-05 | global batch size: 256 | lm loss: 1.925368E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.055 | TFLOPs: 19.73 | 31: iteration 159750/ 173500 | consumed samples: 40896000 | consumed tokens: 83755008000 | elapsed time per iteration (s): 0.83 | learning rate: 2.283E-05 | global batch size: 256 | lm loss: 1.908653E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.632 | TFLOPs: 18.67 | 31: iteration 159760/ 173500 | consumed samples: 40898560 | consumed tokens: 83760250880 | elapsed time per iteration (s): 0.79 | learning rate: 2.283E-05 | global batch size: 256 | lm loss: 1.913352E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.740 | TFLOPs: 19.65 | 31: iteration 159770/ 173500 | consumed samples: 40901120 | consumed tokens: 83765493760 | elapsed time per iteration (s): 0.78 | learning rate: 2.282E-05 | global batch size: 256 | lm loss: 1.893957E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.670 | TFLOPs: 19.88 | 31: iteration 159780/ 173500 | consumed samples: 40903680 | consumed tokens: 83770736640 | elapsed time per iteration (s): 0.72 | learning rate: 2.282E-05 | global batch size: 256 | lm loss: 1.912494E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.895 | TFLOPs: 21.59 | 31: iteration 159790/ 173500 | consumed samples: 40906240 | consumed tokens: 83775979520 | elapsed time per iteration (s): 0.79 | learning rate: 2.281E-05 | global batch size: 256 | lm loss: 1.919879E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.882 | TFLOPs: 19.59 | 31: iteration 159800/ 173500 | consumed samples: 40908800 | consumed tokens: 83781222400 | elapsed time per iteration (s): 0.75 | learning rate: 2.281E-05 | global batch size: 256 | lm loss: 1.912075E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.212 | TFLOPs: 20.76 | 31: iteration 159810/ 173500 | consumed samples: 40911360 | consumed tokens: 83786465280 | elapsed time per iteration (s): 0.76 | learning rate: 2.281E-05 | global batch size: 256 | lm loss: 1.888802E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.862 | TFLOPs: 20.38 | 31: iteration 159820/ 173500 | consumed samples: 40913920 | consumed tokens: 83791708160 | elapsed time per iteration (s): 0.76 | learning rate: 2.280E-05 | global batch size: 256 | lm loss: 1.893306E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.877 | TFLOPs: 20.38 | 31: iteration 159830/ 173500 | consumed samples: 40916480 | consumed tokens: 83796951040 | elapsed time per iteration (s): 0.75 | learning rate: 2.280E-05 | global batch size: 256 | lm loss: 1.910086E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.742 | TFLOPs: 20.67 | 31: iteration 159840/ 173500 | consumed samples: 40919040 | consumed tokens: 83802193920 | elapsed time per iteration (s): 0.83 | learning rate: 2.279E-05 | global batch size: 256 | lm loss: 1.899002E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.970 | TFLOPs: 18.75 | 31: iteration 159850/ 173500 | consumed samples: 40921600 | consumed tokens: 83807436800 | elapsed time per iteration (s): 0.75 | learning rate: 2.279E-05 | global batch size: 256 | lm loss: 1.908117E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.255 | TFLOPs: 20.52 | 31: iteration 159860/ 173500 | consumed samples: 40924160 | consumed tokens: 83812679680 | elapsed time per iteration (s): 0.76 | learning rate: 2.279E-05 | global batch size: 256 | lm loss: 1.906290E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.720 | TFLOPs: 20.37 | 31: iteration 159870/ 173500 | consumed samples: 40926720 | consumed tokens: 83817922560 | elapsed time per iteration (s): 0.81 | learning rate: 2.278E-05 | global batch size: 256 | lm loss: 1.917420E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.940 | TFLOPs: 19.17 | 31: iteration 159880/ 173500 | consumed samples: 40929280 | consumed tokens: 83823165440 | elapsed time per iteration (s): 0.73 | learning rate: 2.278E-05 | global batch size: 256 | lm loss: 1.909654E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.487 | TFLOPs: 21.32 | 31: iteration 159890/ 173500 | consumed samples: 40931840 | consumed tokens: 83828408320 | elapsed time per iteration (s): 0.88 | learning rate: 2.277E-05 | global batch size: 256 | lm loss: 1.902496E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.617 | TFLOPs: 17.64 | 31: iteration 159900/ 173500 | consumed samples: 40934400 | consumed tokens: 83833651200 | elapsed time per iteration (s): 0.75 | learning rate: 2.277E-05 | global batch size: 256 | lm loss: 1.905996E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.105 | TFLOPs: 20.52 | 31: iteration 159910/ 173500 | consumed samples: 40936960 | consumed tokens: 83838894080 | elapsed time per iteration (s): 0.72 | learning rate: 2.277E-05 | global batch size: 256 | lm loss: 1.899156E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.776 | TFLOPs: 21.46 | 31: iteration 159920/ 173500 | consumed samples: 40939520 | consumed tokens: 83844136960 | elapsed time per iteration (s): 0.72 | learning rate: 2.276E-05 | global batch size: 256 | lm loss: 1.882587E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.646 | TFLOPs: 21.52 | 31: iteration 159930/ 173500 | consumed samples: 40942080 | consumed tokens: 83849379840 | elapsed time per iteration (s): 0.78 | learning rate: 2.276E-05 | global batch size: 256 | lm loss: 1.937506E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.013 | TFLOPs: 19.78 | 31: iteration 159940/ 173500 | consumed samples: 40944640 | consumed tokens: 83854622720 | elapsed time per iteration (s): 0.78 | learning rate: 2.275E-05 | global batch size: 256 | lm loss: 1.931535E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.947 | TFLOPs: 19.84 | 31: iteration 159950/ 173500 | consumed samples: 40947200 | consumed tokens: 83859865600 | elapsed time per iteration (s): 0.74 | learning rate: 2.275E-05 | global batch size: 256 | lm loss: 1.895773E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.161 | TFLOPs: 21.06 | 31: iteration 159960/ 173500 | consumed samples: 40949760 | consumed tokens: 83865108480 | elapsed time per iteration (s): 0.76 | learning rate: 2.275E-05 | global batch size: 256 | lm loss: 1.914100E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.233 | TFLOPs: 20.28 | 31: iteration 159970/ 173500 | consumed samples: 40952320 | consumed tokens: 83870351360 | elapsed time per iteration (s): 0.77 | learning rate: 2.274E-05 | global batch size: 256 | lm loss: 1.949790E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.159 | TFLOPs: 20.09 | 31: iteration 159980/ 173500 | consumed samples: 40954880 | consumed tokens: 83875594240 | elapsed time per iteration (s): 0.78 | learning rate: 2.274E-05 | global batch size: 256 | lm loss: 1.921221E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.514 | TFLOPs: 19.81 | 31: iteration 159990/ 173500 | consumed samples: 40957440 | consumed tokens: 83880837120 | elapsed time per iteration (s): 0.81 | learning rate: 2.273E-05 | global batch size: 256 | lm loss: 1.904386E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.907 | TFLOPs: 19.11 | 0: [2022-11-27 06:06:14,487] [INFO] [logging.py:68:log_dist] [Rank 0] step=160000, skipped=0, lr=[2.2729831288017337e-05, 2.2729831288017337e-05, 2.2729831288017337e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 160000/ 173500 | consumed samples: 40960000 | consumed tokens: 83886080000 | elapsed time per iteration (s): 0.82 | learning rate: 2.273E-05 | global batch size: 256 | lm loss: 1.927923E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.664 | TFLOPs: 18.85 | 0: steps: 160000 loss: 1.8944 iter time (s): 0.805 samples/sec: 318.196 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 160000 | lm loss value: 1.861396E+00 | lm loss PPL: 6.432709E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 160000 to checkpoints_1b1long 0: [2022-11-27 06:06:14,768] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step160000 is begin to save! 0: [2022-11-27 06:06:14,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_01-model_00-model_states.pt... 0: [2022-11-27 06:06:14,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_01-model_00-model_states.pt. 0: [2022-11-27 06:06:14,991] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_03-model_00-model_states.pt... 0: [2022-11-27 06:06:15,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_03-model_00-model_states.pt. 0: [2022-11-27 06:06:15,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_04-model_00-model_states.pt... 0: [2022-11-27 06:06:15,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_04-model_00-model_states.pt. 0: [2022-11-27 06:06:15,148] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_05-model_00-model_states.pt... 0: [2022-11-27 06:06:15,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_05-model_00-model_states.pt. 0: [2022-11-27 06:06:15,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_06-model_00-model_states.pt... 0: [2022-11-27 06:06:15,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_06-model_00-model_states.pt. 0: [2022-11-27 06:06:15,303] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_07-model_00-model_states.pt... 0: [2022-11-27 06:06:15,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_07-model_00-model_states.pt. 0: [2022-11-27 06:06:15,381] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_08-model_00-model_states.pt... 0: [2022-11-27 06:06:15,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_08-model_00-model_states.pt. 0: [2022-11-27 06:06:15,453] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_09-model_00-model_states.pt... 0: [2022-11-27 06:06:15,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_09-model_00-model_states.pt. 0: [2022-11-27 06:06:15,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_10-model_00-model_states.pt... 0: [2022-11-27 06:06:15,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_10-model_00-model_states.pt. 0: [2022-11-27 06:06:15,603] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_11-model_00-model_states.pt... 0: [2022-11-27 06:06:15,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_11-model_00-model_states.pt. 0: [2022-11-27 06:06:15,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_12-model_00-model_states.pt... 0: [2022-11-27 06:06:15,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_12-model_00-model_states.pt. 0: [2022-11-27 06:06:15,756] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_13-model_00-model_states.pt... 0: [2022-11-27 06:06:15,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_13-model_00-model_states.pt. 0: [2022-11-27 06:06:15,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_14-model_00-model_states.pt... 0: [2022-11-27 06:06:15,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_14-model_00-model_states.pt. 0: [2022-11-27 06:06:15,906] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_15-model_00-model_states.pt... 0: [2022-11-27 06:06:15,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_15-model_00-model_states.pt. 0: [2022-11-27 06:06:15,982] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_16-model_00-model_states.pt... 0: [2022-11-27 06:06:16,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_16-model_00-model_states.pt. 0: [2022-11-27 06:06:16,057] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_17-model_00-model_states.pt... 0: [2022-11-27 06:06:16,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_17-model_00-model_states.pt. 0: [2022-11-27 06:06:16,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_18-model_00-model_states.pt... 0: [2022-11-27 06:06:16,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_18-model_00-model_states.pt. 0: [2022-11-27 06:06:16,207] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_19-model_00-model_states.pt... 0: [2022-11-27 06:06:16,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_19-model_00-model_states.pt. 0: [2022-11-27 06:06:16,279] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_20-model_00-model_states.pt... 0: [2022-11-27 06:06:16,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_20-model_00-model_states.pt. 0: [2022-11-27 06:06:16,355] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_21-model_00-model_states.pt... 0: [2022-11-27 06:06:16,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_21-model_00-model_states.pt. 0: [2022-11-27 06:06:16,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_22-model_00-model_states.pt... 0: [2022-11-27 06:06:16,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_22-model_00-model_states.pt. 0: [2022-11-27 06:06:16,502] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_23-model_00-model_states.pt... 0: [2022-11-27 06:06:16,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_23-model_00-model_states.pt. 0: [2022-11-27 06:06:16,577] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_24-model_00-model_states.pt... 0: [2022-11-27 06:06:16,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_24-model_00-model_states.pt. 0: [2022-11-27 06:06:16,649] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_25-model_00-model_states.pt... 0: [2022-11-27 06:06:16,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_25-model_00-model_states.pt. 0: [2022-11-27 06:06:16,726] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_26-model_00-model_states.pt... 0: [2022-11-27 06:06:16,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_26-model_00-model_states.pt. 0: [2022-11-27 06:06:16,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_27-model_00-model_states.pt... 0: [2022-11-27 06:06:16,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_27-model_00-model_states.pt. 0: [2022-11-27 06:06:16,874] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_28-model_00-model_states.pt... 0: [2022-11-27 06:06:16,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_28-model_00-model_states.pt. 0: [2022-11-27 06:06:16,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/layer_30-model_00-model_states.pt... 0: [2022-11-27 06:06:16,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/layer_30-model_00-model_states.pt. 0: [2022-11-27 06:06:16,952] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step160000/mp_rank_00_model_states.pt 0: [2022-11-27 06:06:16,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/mp_rank_00_model_states.pt... 0: [2022-11-27 06:06:16,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/mp_rank_00_model_states.pt. 0: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:06:17,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:06:17,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:06:17,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:06:17,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-27 06:06:17,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 1: [2022-11-27 06:06:17,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 06:06:17,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 27: [2022-11-27 06:06:17,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:06:17,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 06:06:17,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 6: [2022-11-27 06:06:17,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:06:17,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 06:06:17,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 22: [2022-11-27 06:06:17,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:06:17,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 06:06:17,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 21: [2022-11-27 06:06:17,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:06:17,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 06:06:17,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 18: [2022-11-27 06:06:17,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:06:17,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-27 06:06:17,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 17: [2022-11-27 06:06:17,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:06:17,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 06:06:17,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 13: [2022-11-27 06:06:17,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:06:17,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 06:06:17,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 16: [2022-11-27 06:06:17,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:06:17,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 06:06:17,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 23: [2022-11-27 06:06:17,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:06:17,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 06:06:17,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 9: [2022-11-27 06:06:17,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:06:17,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 06:06:17,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 10: [2022-11-27 06:06:17,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:06:17,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 06:06:17,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 26: [2022-11-27 06:06:17,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:06:17,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 06:06:17,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 20: [2022-11-27 06:06:17,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:06:17,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 06:06:17,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 7: [2022-11-27 06:06:17,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:06:17,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:06:17,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 31: [2022-11-27 06:06:17,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 7: [2022-11-27 06:06:17,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 31: [2022-11-27 06:06:17,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 6: [2022-11-27 06:06:17,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:06:17,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:06:17,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 25: [2022-11-27 06:06:17,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:06:17,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 6: [2022-11-27 06:06:17,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 30: [2022-11-27 06:06:17,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 25: [2022-11-27 06:06:17,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 06:06:17,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 14: [2022-11-27 06:06:17,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:06:17,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:06:17,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 06:06:17,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 1: [2022-11-27 06:06:17,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 06:06:17,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 12: [2022-11-27 06:06:17,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:06:17,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 19: [2022-11-27 06:06:17,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:06:17,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 19: [2022-11-27 06:06:17,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 06:06:17,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 27: [2022-11-27 06:06:17,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:06:17,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:06:17,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 13: [2022-11-27 06:06:17,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 27: [2022-11-27 06:06:17,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 13: [2022-11-27 06:06:17,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 10: [2022-11-27 06:06:17,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:06:17,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 06:06:17,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 15: [2022-11-27 06:06:17,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:06:17,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:06:17,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 06:06:17,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 06:06:17,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 15: [2022-11-27 06:06:17,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 31: [2022-11-27 06:06:17,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:06:17,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 06:06:17,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 8: [2022-11-27 06:06:17,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:06:17,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 06:06:17,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 3: [2022-11-27 06:06:17,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:06:17,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 06:06:17,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 9: [2022-11-27 06:06:17,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:06:17,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 06:06:17,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 17: [2022-11-27 06:06:17,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:06:17,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 06:06:17,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 21: [2022-11-27 06:06:17,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:06:17,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 06:06:17,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 7: [2022-11-27 06:06:17,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:06:17,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 06:06:17,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 16: [2022-11-27 06:06:17,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:06:17,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 06:06:17,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 4: [2022-11-27 06:06:17,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:06:17,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 06:06:17,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 22: [2022-11-27 06:06:17,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:06:17,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 26: [2022-11-27 06:06:17,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:06:17,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 26: [2022-11-27 06:06:17,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 06:06:17,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 25: [2022-11-27 06:06:17,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:06:17,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:06:17,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 06:06:17,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 20: [2022-11-27 06:06:17,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 06:06:17,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 19: [2022-11-27 06:06:17,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:06:17,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 06:06:17,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 1: [2022-11-27 06:06:17,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:06:17,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 06:06:17,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 15: [2022-11-27 06:06:17,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:06:17,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 06:06:17,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 27: [2022-11-27 06:06:17,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:06:17,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 06:06:17,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 13: [2022-11-27 06:06:17,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:06:17,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 06:06:17,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 18: [2022-11-27 06:06:17,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:06:17,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 06:06:17,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 14: [2022-11-27 06:06:17,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:06:17,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 06:06:17,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 4: [2022-11-27 06:06:17,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:06:17,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 26: [2022-11-27 06:06:17,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:06:17,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 26: [2022-11-27 06:06:17,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 06:06:17,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 23: [2022-11-27 06:06:17,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:06:17,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:06:17,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 06:06:17,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 29: [2022-11-27 06:06:17,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 06:06:17,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 0: [2022-11-27 06:06:17,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:06:17,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:06:17,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 06:06:17,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 6: [2022-11-27 06:06:17,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:06:17,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:06:17,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 6: [2022-11-27 06:06:17,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 0: [2022-11-27 06:06:17,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 6: [2022-11-27 06:06:17,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 23: [2022-11-27 06:06:17,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:06:17,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 28: [2022-11-27 06:06:17,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:06:17,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 23: [2022-11-27 06:06:17,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 8: [2022-11-27 06:06:17,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:06:17,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 8: [2022-11-27 06:06:17,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 06:06:17,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 0: [2022-11-27 06:06:17,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:06:17,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:06:17,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 06:06:17,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 06:06:17,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 0: [2022-11-27 06:06:17,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 5: [2022-11-27 06:06:17,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:06:17,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 16: [2022-11-27 06:06:17,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:06:17,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:06:17,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 16: [2022-11-27 06:06:17,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 19: [2022-11-27 06:06:17,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 16: [2022-11-27 06:06:17,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 19: [2022-11-27 06:06:17,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 3: [2022-11-27 06:06:17,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:06:17,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:06:17,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 06:06:17,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 11: [2022-11-27 06:06:17,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 06:06:17,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 11: [2022-11-27 06:06:17,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:06:17,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:06:17,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 06:06:17,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 25: [2022-11-27 06:06:17,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-27 06:06:17,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 29: [2022-11-27 06:06:17,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:06:17,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 06:06:17,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 9: [2022-11-27 06:06:17,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:06:17,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 06:06:17,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 31: [2022-11-27 06:06:17,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:06:17,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 06:06:17,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 14: [2022-11-27 06:06:17,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:06:17,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 06:06:17,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 21: [2022-11-27 06:06:17,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:06:17,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 06:06:17,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 30: [2022-11-27 06:06:17,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:06:17,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 06:06:17,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 17: [2022-11-27 06:06:17,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:06:17,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:06:17,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 17: [2022-11-27 06:06:17,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 10: [2022-11-27 06:06:17,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 17: [2022-11-27 06:06:17,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 24: [2022-11-27 06:06:17,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:06:17,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:06:17,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 06:06:17,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 06:06:17,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 24: [2022-11-27 06:06:17,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 21: [2022-11-27 06:06:17,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:06:17,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:06:17,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 7: [2022-11-27 06:06:17,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 21: [2022-11-27 06:06:17,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 7: [2022-11-27 06:06:17,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 2: [2022-11-27 06:06:17,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:06:17,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:06:17,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:06:17,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 06:06:17,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 2: [2022-11-27 06:06:17,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 06:06:17,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 06:06:17,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 2: [2022-11-27 06:06:17,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 6: [2022-11-27 06:06:17,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:06:17,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 1: [2022-11-27 06:06:17,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:06:17,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 5: [2022-11-27 06:06:17,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:06:17,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 1: [2022-11-27 06:06:17,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 06:06:17,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 5: [2022-11-27 06:06:17,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 22: [2022-11-27 06:06:17,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:06:17,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 06:06:17,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 22: [2022-11-27 06:06:17,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:06:17,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 06:06:17,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 27: [2022-11-27 06:06:17,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:06:17,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:06:17,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 18: [2022-11-27 06:06:17,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 06:06:17,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 27: [2022-11-27 06:06:17,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 11: [2022-11-27 06:06:17,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:06:17,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 06:06:17,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 0: [2022-11-27 06:06:17,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:06:17,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 06:06:17,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 29: [2022-11-27 06:06:17,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:06:17,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 06:06:17,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 3: [2022-11-27 06:06:17,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:06:17,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 12: [2022-11-27 06:06:17,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:06:17,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 12: [2022-11-27 06:06:17,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 06:06:17,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 8: [2022-11-27 06:06:17,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:06:17,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 06:06:17,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 24: [2022-11-27 06:06:17,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:06:17,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:06:17,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 10: [2022-11-27 06:06:17,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 06:06:17,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 24: [2022-11-27 06:06:17,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 31: [2022-11-27 06:06:17,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:06:17,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 06:06:17,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 20: [2022-11-27 06:06:17,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:06:17,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 06:06:17,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 28: [2022-11-27 06:06:17,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 06:06:17,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 28: [2022-11-27 06:06:17,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:06:17,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 06:06:17,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 28: [2022-11-27 06:06:17,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:06:17,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 06:06:17,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 15: [2022-11-27 06:06:17,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:06:17,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 06:06:17,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 7: [2022-11-27 06:06:17,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:06:17,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 06:06:17,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 18: [2022-11-27 06:06:17,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:06:17,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 06:06:17,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 17: [2022-11-27 06:06:17,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:06:17,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:06:17,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 30: [2022-11-27 06:06:17,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 06:06:17,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 4: [2022-11-27 06:06:17,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:06:17,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 06:06:17,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 20: [2022-11-27 06:06:17,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:06:17,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 06:06:17,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 17: [2022-11-27 06:06:17,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 23: [2022-11-27 06:06:17,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:06:17,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 06:06:17,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 5: [2022-11-27 06:06:17,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:06:17,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 06:06:17,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 14: [2022-11-27 06:06:17,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:06:17,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 06:06:17,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 13: [2022-11-27 06:06:17,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:06:17,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 06:06:17,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 4: [2022-11-27 06:06:17,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:06:17,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 06:06:17,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 2: [2022-11-27 06:06:17,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:06:17,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 06:06:17,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 3: [2022-11-27 06:06:17,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:06:17,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 06:06:17,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 9: [2022-11-27 06:06:17,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:06:17,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:06:17,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 06:06:17,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 9: [2022-11-27 06:06:17,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 06:06:17,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 24: [2022-11-27 06:06:17,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:06:17,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 26: [2022-11-27 06:06:17,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:06:17,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 26: [2022-11-27 06:06:17,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-27 06:06:17,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 16: [2022-11-27 06:06:17,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:06:17,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 06:06:17,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 12: [2022-11-27 06:06:17,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:06:17,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 06:06:17,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 21: [2022-11-27 06:06:17,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:06:17,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 06:06:17,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 19: [2022-11-27 06:06:17,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:06:17,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 06:06:17,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 25: [2022-11-27 06:06:17,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:06:17,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 06:06:17,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 6: [2022-11-27 06:06:17,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:06:17,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 06:06:17,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 1: [2022-11-27 06:06:17,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:06:17,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 06:06:17,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 8: [2022-11-27 06:06:17,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:06:17,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 06:06:17,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 7: [2022-11-27 06:06:17,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:06:17,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 06:06:17,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 27: [2022-11-27 06:06:17,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:06:17,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 06:06:17,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 22: [2022-11-27 06:06:17,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:06:17,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 06:06:17,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 29: [2022-11-27 06:06:17,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:06:17,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 06:06:17,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 0: [2022-11-27 06:06:17,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:06:17,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 06:06:17,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 11: [2022-11-27 06:06:17,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:06:17,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 06:06:17,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 31: [2022-11-27 06:06:17,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:06:17,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 06:06:17,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 15: [2022-11-27 06:06:17,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:06:17,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 30: [2022-11-27 06:06:17,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:06:17,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 30: [2022-11-27 06:06:17,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 06:06:17,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 18: [2022-11-27 06:06:17,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:06:17,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 06:06:17,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 13: [2022-11-27 06:06:17,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:06:17,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 06:06:17,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 17: [2022-11-27 06:06:17,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:06:17,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 06:06:17,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 9: [2022-11-27 06:06:17,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:06:17,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 06:06:17,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 20: [2022-11-27 06:06:17,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:06:17,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 06:06:17,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 2: [2022-11-27 06:06:17,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:06:17,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 06:06:17,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 5: [2022-11-27 06:06:17,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:06:17,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 06:06:17,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 4: [2022-11-27 06:06:17,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:06:17,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 06:06:17,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 14: [2022-11-27 06:06:17,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:06:17,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 06:06:17,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 3: [2022-11-27 06:06:17,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:06:17,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 06:06:17,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 23: [2022-11-27 06:06:17,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:06:17,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 06:06:17,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 26: [2022-11-27 06:06:17,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:06:17,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 06:06:17,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 12: [2022-11-27 06:06:17,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:06:17,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 06:06:17,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 25: [2022-11-27 06:06:17,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:06:17,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 06:06:17,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 28: [2022-11-27 06:06:17,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:06:17,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:06:17,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 06:06:17,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 6: [2022-11-27 06:06:17,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:06:17,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 06:06:17,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 24: [2022-11-27 06:06:17,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:06:17,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 21: [2022-11-27 06:06:17,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:06:17,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 24: [2022-11-27 06:06:17,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 21: [2022-11-27 06:06:17,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 1: [2022-11-27 06:06:17,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:06:17,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 06:06:17,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 0: [2022-11-27 06:06:17,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:06:17,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:06:17,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 06:06:17,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 10: [2022-11-27 06:06:17,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:06:17,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 06:06:17,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 7: [2022-11-27 06:06:17,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:06:17,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 06:06:17,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 27: [2022-11-27 06:06:17,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:06:17,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 06:06:17,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 8: [2022-11-27 06:06:17,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:06:17,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 06:06:17,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 22: [2022-11-27 06:06:17,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:06:17,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 06:06:17,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 11: [2022-11-27 06:06:17,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:06:17,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 06:06:17,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 15: [2022-11-27 06:06:17,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:06:17,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 06:06:17,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 29: [2022-11-27 06:06:17,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:06:17,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 06:06:17,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 28: [2022-11-27 06:06:17,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 06:06:17,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 31: [2022-11-27 06:06:17,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:06:17,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 06:06:17,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 30: [2022-11-27 06:06:17,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:06:17,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 06:06:17,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 17: [2022-11-27 06:06:17,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:06:17,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 06:06:17,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 10: [2022-11-27 06:06:17,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:06:17,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 06:06:17,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 20: [2022-11-27 06:06:17,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:06:17,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 06:06:17,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 18: [2022-11-27 06:06:17,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:06:17,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 06:06:17,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 13: [2022-11-27 06:06:17,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:06:17,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 06:06:17,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 0: [2022-11-27 06:06:17,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 06:06:17,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 4: [2022-11-27 06:06:17,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:06:17,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 2: [2022-11-27 06:06:17,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:06:17,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 2: [2022-11-27 06:06:17,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 06:06:17,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 3: [2022-11-27 06:06:17,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:06:17,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 06:06:17,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 16: [2022-11-27 06:06:17,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:06:17,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 06:06:17,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 5: [2022-11-27 06:06:17,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:06:17,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 06:06:17,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 14: [2022-11-27 06:06:17,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:06:17,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 06:06:17,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 23: [2022-11-27 06:06:17,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:06:17,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 06:06:17,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 9: [2022-11-27 06:06:17,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:06:17,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 06:06:17,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 28: [2022-11-27 06:06:17,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:06:17,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 06:06:17,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 12: [2022-11-27 06:06:17,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:06:17,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 06:06:17,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 26: [2022-11-27 06:06:17,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:06:17,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 06:06:17,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 1: [2022-11-27 06:06:17,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:06:17,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 06:06:17,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 21: [2022-11-27 06:06:17,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:06:17,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 06:06:17,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 24: [2022-11-27 06:06:17,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:06:17,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 06:06:17,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 19: [2022-11-27 06:06:17,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:06:17,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 06:06:17,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 6: [2022-11-27 06:06:17,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:06:17,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 06:06:17,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 27: [2022-11-27 06:06:17,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:06:17,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 06:06:17,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 25: [2022-11-27 06:06:17,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:06:17,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 06:06:17,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 7: [2022-11-27 06:06:17,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:06:17,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 06:06:17,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 8: [2022-11-27 06:06:17,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:06:17,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 06:06:17,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 22: [2022-11-27 06:06:17,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:06:17,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 06:06:17,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 0: [2022-11-27 06:06:17,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:06:17,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 06:06:17,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 15: [2022-11-27 06:06:17,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:06:17,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 06:06:17,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 31: [2022-11-27 06:06:17,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:06:17,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 06:06:17,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 18: [2022-11-27 06:06:17,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:06:17,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 06:06:17,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 30: [2022-11-27 06:06:17,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:06:17,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 06:06:17,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 29: [2022-11-27 06:06:17,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:06:17,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 06:06:17,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 10: [2022-11-27 06:06:17,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:06:17,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 06:06:17,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 11: [2022-11-27 06:06:17,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:06:17,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 06:06:17,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 17: [2022-11-27 06:06:17,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:06:17,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 06:06:17,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 13: [2022-11-27 06:06:17,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:06:17,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 06:06:17,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 14: [2022-11-27 06:06:17,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:06:17,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 06:06:17,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 5: [2022-11-27 06:06:17,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:06:17,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 06:06:17,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 20: [2022-11-27 06:06:17,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:06:17,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:06:17,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 20: [2022-11-27 06:06:17,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 06:06:17,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 4: [2022-11-27 06:06:17,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 2: [2022-11-27 06:06:17,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:06:17,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 06:06:17,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 3: [2022-11-27 06:06:17,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:06:17,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 06:06:17,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 28: [2022-11-27 06:06:17,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:06:17,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:06:17,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 06:06:17,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 21: [2022-11-27 06:06:17,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:06:17,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 06:06:17,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 25: [2022-11-27 06:06:17,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:06:17,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:06:17,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 23: [2022-11-27 06:06:17,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 06:06:17,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 25: [2022-11-27 06:06:17,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 19: [2022-11-27 06:06:17,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:06:17,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 06:06:17,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 1: [2022-11-27 06:06:17,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:06:17,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:06:17,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 1: [2022-11-27 06:06:17,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 9: [2022-11-27 06:06:17,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 1: [2022-11-27 06:06:17,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 16: [2022-11-27 06:06:17,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:06:17,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 06:06:17,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 6: [2022-11-27 06:06:17,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:06:17,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:06:17,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 15: [2022-11-27 06:06:17,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 6: [2022-11-27 06:06:17,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 15: [2022-11-27 06:06:17,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 28: [2022-11-27 06:06:17,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 06:06:17,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 22: [2022-11-27 06:06:17,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:06:17,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 24: [2022-11-27 06:06:17,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:06:17,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 24: [2022-11-27 06:06:17,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 27: [2022-11-27 06:06:17,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:06:17,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 27: [2022-11-27 06:06:17,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 06:06:17,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 18: [2022-11-27 06:06:17,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:06:17,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 06:06:17,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 30: [2022-11-27 06:06:17,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:06:17,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 06:06:17,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 31: [2022-11-27 06:06:17,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:06:17,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:06:17,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 8: [2022-11-27 06:06:17,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 31: [2022-11-27 06:06:17,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 8: [2022-11-27 06:06:17,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 0: [2022-11-27 06:06:17,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:06:17,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 06:06:17,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 2: [2022-11-27 06:06:17,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:06:17,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 06:06:17,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 17: [2022-11-27 06:06:17,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:06:17,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:06:17,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 06:06:17,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 26: [2022-11-27 06:06:17,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 06:06:17,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 20: [2022-11-27 06:06:17,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:06:17,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 06:06:17,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 10: [2022-11-27 06:06:17,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:06:17,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 06:06:17,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 5: [2022-11-27 06:06:17,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:06:17,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 06:06:17,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 11: [2022-11-27 06:06:17,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:06:17,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 06:06:17,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 25: [2022-11-27 06:06:17,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:06:17,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 24: [2022-11-27 06:06:17,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:06:17,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:06:17,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 24: [2022-11-27 06:06:17,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 13: [2022-11-27 06:06:17,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 24: [2022-11-27 06:06:17,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 8: [2022-11-27 06:06:17,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:06:17,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 06:06:17,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 23: [2022-11-27 06:06:17,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:06:17,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:06:17,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 23: [2022-11-27 06:06:17,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 13: [2022-11-27 06:06:17,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 23: [2022-11-27 06:06:17,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 4: [2022-11-27 06:06:17,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 9: [2022-11-27 06:06:17,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:06:17,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:06:17,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:06:17,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 3: [2022-11-27 06:06:17,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 9: [2022-11-27 06:06:17,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 28: [2022-11-27 06:06:17,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 3: [2022-11-27 06:06:17,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 29: [2022-11-27 06:06:17,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:06:17,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:06:17,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:06:17,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 29: [2022-11-27 06:06:17,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 16: [2022-11-27 06:06:17,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 29: [2022-11-27 06:06:17,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 12: [2022-11-27 06:06:17,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 16: [2022-11-27 06:06:17,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 12: [2022-11-27 06:06:17,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 14: [2022-11-27 06:06:17,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:06:17,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 06:06:17,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 29: [2022-11-27 06:06:17,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:06:17,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 06:06:17,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 12: [2022-11-27 06:06:17,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:06:17,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 06:06:17,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 7: [2022-11-27 06:06:17,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:06:17,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 06:06:17,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 5: [2022-11-27 06:06:17,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:06:17,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 06:06:17,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 11: [2022-11-27 06:06:17,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:06:17,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step160000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 06:06:17,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step160000 is ready now! 0: successfully saved checkpoint at iteration 160000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2726.95 31: iteration 160010/ 173500 | consumed samples: 40962560 | consumed tokens: 83891322880 | elapsed time per iteration (s): 1.12 | learning rate: 2.273E-05 | global batch size: 256 | lm loss: 1.902878E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.050 | TFLOPs: 13.80 | 31: iteration 160020/ 173500 | consumed samples: 40965120 | consumed tokens: 83896565760 | elapsed time per iteration (s): 0.94 | learning rate: 2.272E-05 | global batch size: 256 | lm loss: 1.913455E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 273.786 | TFLOPs: 16.56 | 31: iteration 160030/ 173500 | consumed samples: 40967680 | consumed tokens: 83901808640 | elapsed time per iteration (s): 0.75 | learning rate: 2.272E-05 | global batch size: 256 | lm loss: 1.907585E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.437 | TFLOPs: 20.66 | 31: iteration 160040/ 173500 | consumed samples: 40970240 | consumed tokens: 83907051520 | elapsed time per iteration (s): 1.16 | learning rate: 2.271E-05 | global batch size: 256 | lm loss: 1.915362E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 220.843 | TFLOPs: 13.36 | 31: iteration 160050/ 173500 | consumed samples: 40972800 | consumed tokens: 83912294400 | elapsed time per iteration (s): 0.90 | learning rate: 2.271E-05 | global batch size: 256 | lm loss: 1.950615E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.989 | TFLOPs: 17.24 | 31: iteration 160060/ 173500 | consumed samples: 40975360 | consumed tokens: 83917537280 | elapsed time per iteration (s): 0.74 | learning rate: 2.271E-05 | global batch size: 256 | lm loss: 1.936717E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.912 | TFLOPs: 20.93 | 31: iteration 160070/ 173500 | consumed samples: 40977920 | consumed tokens: 83922780160 | elapsed time per iteration (s): 0.75 | learning rate: 2.270E-05 | global batch size: 256 | lm loss: 1.898087E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.175 | TFLOPs: 20.58 | 31: iteration 160080/ 173500 | consumed samples: 40980480 | consumed tokens: 83928023040 | elapsed time per iteration (s): 0.71 | learning rate: 2.270E-05 | global batch size: 256 | lm loss: 1.898763E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.253 | TFLOPs: 21.67 | 31: iteration 160090/ 173500 | consumed samples: 40983040 | consumed tokens: 83933265920 | elapsed time per iteration (s): 0.75 | learning rate: 2.269E-05 | global batch size: 256 | lm loss: 1.910866E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.163 | TFLOPs: 20.70 | 31: iteration 160100/ 173500 | consumed samples: 40985600 | consumed tokens: 83938508800 | elapsed time per iteration (s): 0.75 | learning rate: 2.269E-05 | global batch size: 256 | lm loss: 1.886376E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.616 | TFLOPs: 20.61 | 31: iteration 160110/ 173500 | consumed samples: 40988160 | consumed tokens: 83943751680 | elapsed time per iteration (s): 0.81 | learning rate: 2.269E-05 | global batch size: 256 | lm loss: 1.942165E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.444 | TFLOPs: 19.20 | 31: iteration 160120/ 173500 | consumed samples: 40990720 | consumed tokens: 83948994560 | elapsed time per iteration (s): 0.80 | learning rate: 2.268E-05 | global batch size: 256 | lm loss: 1.922155E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.495 | TFLOPs: 19.33 | 31: iteration 160130/ 173500 | consumed samples: 40993280 | consumed tokens: 83954237440 | elapsed time per iteration (s): 0.83 | learning rate: 2.268E-05 | global batch size: 256 | lm loss: 1.927050E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.472 | TFLOPs: 18.72 | 31: iteration 160140/ 173500 | consumed samples: 40995840 | consumed tokens: 83959480320 | elapsed time per iteration (s): 0.82 | learning rate: 2.267E-05 | global batch size: 256 | lm loss: 1.918102E+00 | grad norm: 0.216 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.009 | TFLOPs: 19.00 | 31: iteration 160150/ 173500 | consumed samples: 40998400 | consumed tokens: 83964723200 | elapsed time per iteration (s): 0.83 | learning rate: 2.267E-05 | global batch size: 256 | lm loss: 1.926547E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.635 | TFLOPs: 18.55 | 31: iteration 160160/ 173500 | consumed samples: 41000960 | consumed tokens: 83969966080 | elapsed time per iteration (s): 0.82 | learning rate: 2.267E-05 | global batch size: 256 | lm loss: 1.899254E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.033 | TFLOPs: 19.00 | 31: iteration 160170/ 173500 | consumed samples: 41003520 | consumed tokens: 83975208960 | elapsed time per iteration (s): 0.85 | learning rate: 2.266E-05 | global batch size: 256 | lm loss: 1.933017E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.895 | TFLOPs: 18.14 | 31: iteration 160180/ 173500 | consumed samples: 41006080 | consumed tokens: 83980451840 | elapsed time per iteration (s): 0.80 | learning rate: 2.266E-05 | global batch size: 256 | lm loss: 1.899633E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.224 | TFLOPs: 19.37 | 31: iteration 160190/ 173500 | consumed samples: 41008640 | consumed tokens: 83985694720 | elapsed time per iteration (s): 0.80 | learning rate: 2.265E-05 | global batch size: 256 | lm loss: 1.904910E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.061 | TFLOPs: 19.24 | 31: iteration 160200/ 173500 | consumed samples: 41011200 | consumed tokens: 83990937600 | elapsed time per iteration (s): 0.81 | learning rate: 2.265E-05 | global batch size: 256 | lm loss: 1.909068E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.875 | TFLOPs: 19.23 | 31: iteration 160210/ 173500 | consumed samples: 41013760 | consumed tokens: 83996180480 | elapsed time per iteration (s): 0.81 | learning rate: 2.265E-05 | global batch size: 256 | lm loss: 1.924522E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.398 | TFLOPs: 19.02 | 31: iteration 160220/ 173500 | consumed samples: 41016320 | consumed tokens: 84001423360 | elapsed time per iteration (s): 0.80 | learning rate: 2.264E-05 | global batch size: 256 | lm loss: 1.910018E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.501 | TFLOPs: 19.33 | 31: iteration 160230/ 173500 | consumed samples: 41018880 | consumed tokens: 84006666240 | elapsed time per iteration (s): 0.79 | learning rate: 2.264E-05 | global batch size: 256 | lm loss: 1.907304E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.332 | TFLOPs: 19.56 | 31: iteration 160240/ 173500 | consumed samples: 41021440 | consumed tokens: 84011909120 | elapsed time per iteration (s): 0.80 | learning rate: 2.263E-05 | global batch size: 256 | lm loss: 1.906551E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.352 | TFLOPs: 19.44 | 31: iteration 160250/ 173500 | consumed samples: 41024000 | consumed tokens: 84017152000 | elapsed time per iteration (s): 0.82 | learning rate: 2.263E-05 | global batch size: 256 | lm loss: 1.904871E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.558 | TFLOPs: 18.91 | 31: iteration 160260/ 173500 | consumed samples: 41026560 | consumed tokens: 84022394880 | elapsed time per iteration (s): 0.78 | learning rate: 2.263E-05 | global batch size: 256 | lm loss: 1.882222E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.090 | TFLOPs: 19.85 | 31: iteration 160270/ 173500 | consumed samples: 41029120 | consumed tokens: 84027637760 | elapsed time per iteration (s): 0.94 | learning rate: 2.262E-05 | global batch size: 256 | lm loss: 1.953024E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 273.206 | TFLOPs: 16.53 | 31: iteration 160280/ 173500 | consumed samples: 41031680 | consumed tokens: 84032880640 | elapsed time per iteration (s): 0.81 | learning rate: 2.262E-05 | global batch size: 256 | lm loss: 1.899373E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.572 | TFLOPs: 19.21 | 31: iteration 160290/ 173500 | consumed samples: 41034240 | consumed tokens: 84038123520 | elapsed time per iteration (s): 0.80 | learning rate: 2.261E-05 | global batch size: 256 | lm loss: 1.890613E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.920 | TFLOPs: 19.29 | 31: iteration 160300/ 173500 | consumed samples: 41036800 | consumed tokens: 84043366400 | elapsed time per iteration (s): 0.82 | learning rate: 2.261E-05 | global batch size: 256 | lm loss: 1.898244E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.777 | TFLOPs: 18.80 | 31: iteration 160310/ 173500 | consumed samples: 41039360 | consumed tokens: 84048609280 | elapsed time per iteration (s): 0.86 | learning rate: 2.261E-05 | global batch size: 256 | lm loss: 1.894702E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.880 | TFLOPs: 18.08 | 31: iteration 160320/ 173500 | consumed samples: 41041920 | consumed tokens: 84053852160 | elapsed time per iteration (s): 0.80 | learning rate: 2.260E-05 | global batch size: 256 | lm loss: 1.926733E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.063 | TFLOPs: 19.42 | 31: iteration 160330/ 173500 | consumed samples: 41044480 | consumed tokens: 84059095040 | elapsed time per iteration (s): 0.81 | learning rate: 2.260E-05 | global batch size: 256 | lm loss: 1.911412E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.226 | TFLOPs: 19.01 | 31: iteration 160340/ 173500 | consumed samples: 41047040 | consumed tokens: 84064337920 | elapsed time per iteration (s): 0.83 | learning rate: 2.259E-05 | global batch size: 256 | lm loss: 1.918482E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.980 | TFLOPs: 18.63 | 31: iteration 160350/ 173500 | consumed samples: 41049600 | consumed tokens: 84069580800 | elapsed time per iteration (s): 0.80 | learning rate: 2.259E-05 | global batch size: 256 | lm loss: 1.895927E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.022 | TFLOPs: 19.36 | 31: iteration 160360/ 173500 | consumed samples: 41052160 | consumed tokens: 84074823680 | elapsed time per iteration (s): 0.81 | learning rate: 2.259E-05 | global batch size: 256 | lm loss: 1.919510E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.834 | TFLOPs: 19.17 | 31: iteration 160370/ 173500 | consumed samples: 41054720 | consumed tokens: 84080066560 | elapsed time per iteration (s): 0.80 | learning rate: 2.258E-05 | global batch size: 256 | lm loss: 1.888886E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.021 | TFLOPs: 19.36 | 31: iteration 160380/ 173500 | consumed samples: 41057280 | consumed tokens: 84085309440 | elapsed time per iteration (s): 0.80 | learning rate: 2.258E-05 | global batch size: 256 | lm loss: 1.899783E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.354 | TFLOPs: 19.38 | 31: iteration 160390/ 173500 | consumed samples: 41059840 | consumed tokens: 84090552320 | elapsed time per iteration (s): 0.79 | learning rate: 2.258E-05 | global batch size: 256 | lm loss: 1.909965E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.530 | TFLOPs: 19.69 | 31: iteration 160400/ 173500 | consumed samples: 41062400 | consumed tokens: 84095795200 | elapsed time per iteration (s): 0.82 | learning rate: 2.257E-05 | global batch size: 256 | lm loss: 1.936876E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.757 | TFLOPs: 18.86 | 31: iteration 160410/ 173500 | consumed samples: 41064960 | consumed tokens: 84101038080 | elapsed time per iteration (s): 0.78 | learning rate: 2.257E-05 | global batch size: 256 | lm loss: 1.927576E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.367 | TFLOPs: 19.80 | 31: iteration 160420/ 173500 | consumed samples: 41067520 | consumed tokens: 84106280960 | elapsed time per iteration (s): 0.78 | learning rate: 2.256E-05 | global batch size: 256 | lm loss: 1.892563E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.553 | TFLOPs: 19.82 | 31: iteration 160430/ 173500 | consumed samples: 41070080 | consumed tokens: 84111523840 | elapsed time per iteration (s): 0.83 | learning rate: 2.256E-05 | global batch size: 256 | lm loss: 1.912469E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.236 | TFLOPs: 18.65 | 31: iteration 160440/ 173500 | consumed samples: 41072640 | consumed tokens: 84116766720 | elapsed time per iteration (s): 0.79 | learning rate: 2.256E-05 | global batch size: 256 | lm loss: 1.910494E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.445 | TFLOPs: 19.51 | 31: iteration 160450/ 173500 | consumed samples: 41075200 | consumed tokens: 84122009600 | elapsed time per iteration (s): 0.96 | learning rate: 2.255E-05 | global batch size: 256 | lm loss: 1.912246E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 267.773 | TFLOPs: 16.20 | 31: iteration 160460/ 173500 | consumed samples: 41077760 | consumed tokens: 84127252480 | elapsed time per iteration (s): 0.75 | learning rate: 2.255E-05 | global batch size: 256 | lm loss: 1.930689E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.842 | TFLOPs: 20.62 | 31: iteration 160470/ 173500 | consumed samples: 41080320 | consumed tokens: 84132495360 | elapsed time per iteration (s): 0.80 | learning rate: 2.254E-05 | global batch size: 256 | lm loss: 1.890805E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.134 | TFLOPs: 19.43 | 31: iteration 160480/ 173500 | consumed samples: 41082880 | consumed tokens: 84137738240 | elapsed time per iteration (s): 0.77 | learning rate: 2.254E-05 | global batch size: 256 | lm loss: 1.923603E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.913 | TFLOPs: 20.14 | 31: iteration 160490/ 173500 | consumed samples: 41085440 | consumed tokens: 84142981120 | elapsed time per iteration (s): 0.75 | learning rate: 2.254E-05 | global batch size: 256 | lm loss: 1.915016E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.243 | TFLOPs: 20.77 | 31: iteration 160500/ 173500 | consumed samples: 41088000 | consumed tokens: 84148224000 | elapsed time per iteration (s): 0.74 | learning rate: 2.253E-05 | global batch size: 256 | lm loss: 1.939074E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.043 | TFLOPs: 20.81 | 31: iteration 160510/ 173500 | consumed samples: 41090560 | consumed tokens: 84153466880 | elapsed time per iteration (s): 0.80 | learning rate: 2.253E-05 | global batch size: 256 | lm loss: 1.903786E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.704 | TFLOPs: 19.40 | 31: iteration 160520/ 173500 | consumed samples: 41093120 | consumed tokens: 84158709760 | elapsed time per iteration (s): 0.73 | learning rate: 2.252E-05 | global batch size: 256 | lm loss: 1.913137E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.186 | TFLOPs: 21.19 | 31: iteration 160530/ 173500 | consumed samples: 41095680 | consumed tokens: 84163952640 | elapsed time per iteration (s): 0.74 | learning rate: 2.252E-05 | global batch size: 256 | lm loss: 1.908735E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.611 | TFLOPs: 20.91 | 31: iteration 160540/ 173500 | consumed samples: 41098240 | consumed tokens: 84169195520 | elapsed time per iteration (s): 0.78 | learning rate: 2.252E-05 | global batch size: 256 | lm loss: 1.916656E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.687 | TFLOPs: 19.82 | 31: iteration 160550/ 173500 | consumed samples: 41100800 | consumed tokens: 84174438400 | elapsed time per iteration (s): 0.83 | learning rate: 2.251E-05 | global batch size: 256 | lm loss: 1.918730E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.438 | TFLOPs: 18.66 | 31: iteration 160560/ 173500 | consumed samples: 41103360 | consumed tokens: 84179681280 | elapsed time per iteration (s): 0.75 | learning rate: 2.251E-05 | global batch size: 256 | lm loss: 1.916372E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.531 | TFLOPs: 20.60 | 31: iteration 160570/ 173500 | consumed samples: 41105920 | consumed tokens: 84184924160 | elapsed time per iteration (s): 0.75 | learning rate: 2.251E-05 | global batch size: 256 | lm loss: 1.893941E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.202 | TFLOPs: 20.52 | 31: iteration 160580/ 173500 | consumed samples: 41108480 | consumed tokens: 84190167040 | elapsed time per iteration (s): 0.77 | learning rate: 2.250E-05 | global batch size: 256 | lm loss: 1.889338E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.414 | TFLOPs: 20.17 | 31: iteration 160590/ 173500 | consumed samples: 41111040 | consumed tokens: 84195409920 | elapsed time per iteration (s): 0.75 | learning rate: 2.250E-05 | global batch size: 256 | lm loss: 1.894252E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.549 | TFLOPs: 20.72 | 31: iteration 160600/ 173500 | consumed samples: 41113600 | consumed tokens: 84200652800 | elapsed time per iteration (s): 0.78 | learning rate: 2.249E-05 | global batch size: 256 | lm loss: 1.911230E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.233 | TFLOPs: 19.80 | 31: iteration 160610/ 173500 | consumed samples: 41116160 | consumed tokens: 84205895680 | elapsed time per iteration (s): 0.78 | learning rate: 2.249E-05 | global batch size: 256 | lm loss: 1.895865E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.755 | TFLOPs: 19.95 | 31: iteration 160620/ 173500 | consumed samples: 41118720 | consumed tokens: 84211138560 | elapsed time per iteration (s): 0.85 | learning rate: 2.249E-05 | global batch size: 256 | lm loss: 1.877555E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.348 | TFLOPs: 18.29 | 31: iteration 160630/ 173500 | consumed samples: 41121280 | consumed tokens: 84216381440 | elapsed time per iteration (s): 0.75 | learning rate: 2.248E-05 | global batch size: 256 | lm loss: 1.918198E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.986 | TFLOPs: 20.63 | 31: iteration 160640/ 173500 | consumed samples: 41123840 | consumed tokens: 84221624320 | elapsed time per iteration (s): 0.75 | learning rate: 2.248E-05 | global batch size: 256 | lm loss: 1.916991E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.828 | TFLOPs: 20.74 | 31: iteration 160650/ 173500 | consumed samples: 41126400 | consumed tokens: 84226867200 | elapsed time per iteration (s): 0.75 | learning rate: 2.247E-05 | global batch size: 256 | lm loss: 1.905304E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.257 | TFLOPs: 20.71 | 31: iteration 160660/ 173500 | consumed samples: 41128960 | consumed tokens: 84232110080 | elapsed time per iteration (s): 0.77 | learning rate: 2.247E-05 | global batch size: 256 | lm loss: 1.904752E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.383 | TFLOPs: 20.05 | 31: iteration 160670/ 173500 | consumed samples: 41131520 | consumed tokens: 84237352960 | elapsed time per iteration (s): 0.80 | learning rate: 2.247E-05 | global batch size: 256 | lm loss: 1.906469E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.243 | TFLOPs: 19.25 | 31: iteration 160680/ 173500 | consumed samples: 41134080 | consumed tokens: 84242595840 | elapsed time per iteration (s): 0.77 | learning rate: 2.246E-05 | global batch size: 256 | lm loss: 1.917027E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.016 | TFLOPs: 20.21 | 31: iteration 160690/ 173500 | consumed samples: 41136640 | consumed tokens: 84247838720 | elapsed time per iteration (s): 0.79 | learning rate: 2.246E-05 | global batch size: 256 | lm loss: 1.957241E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.399 | TFLOPs: 19.63 | 31: iteration 160700/ 173500 | consumed samples: 41139200 | consumed tokens: 84253081600 | elapsed time per iteration (s): 0.76 | learning rate: 2.246E-05 | global batch size: 256 | lm loss: 1.913169E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.993 | TFLOPs: 20.27 | 31: iteration 160710/ 173500 | consumed samples: 41141760 | consumed tokens: 84258324480 | elapsed time per iteration (s): 0.73 | learning rate: 2.245E-05 | global batch size: 256 | lm loss: 1.899536E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.903 | TFLOPs: 21.11 | 31: iteration 160720/ 173500 | consumed samples: 41144320 | consumed tokens: 84263567360 | elapsed time per iteration (s): 0.75 | learning rate: 2.245E-05 | global batch size: 256 | lm loss: 1.906989E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.924 | TFLOPs: 20.63 | 31: iteration 160730/ 173500 | consumed samples: 41146880 | consumed tokens: 84268810240 | elapsed time per iteration (s): 0.81 | learning rate: 2.244E-05 | global batch size: 256 | lm loss: 1.913643E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.949 | TFLOPs: 19.05 | 31: iteration 160740/ 173500 | consumed samples: 41149440 | consumed tokens: 84274053120 | elapsed time per iteration (s): 0.76 | learning rate: 2.244E-05 | global batch size: 256 | lm loss: 1.923111E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.602 | TFLOPs: 20.30 | 31: iteration 160750/ 173500 | consumed samples: 41152000 | consumed tokens: 84279296000 | elapsed time per iteration (s): 0.74 | learning rate: 2.244E-05 | global batch size: 256 | lm loss: 1.873730E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.058 | TFLOPs: 20.94 | 31: iteration 160760/ 173500 | consumed samples: 41154560 | consumed tokens: 84284538880 | elapsed time per iteration (s): 0.76 | learning rate: 2.243E-05 | global batch size: 256 | lm loss: 1.901390E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.091 | TFLOPs: 20.39 | 31: iteration 160770/ 173500 | consumed samples: 41157120 | consumed tokens: 84289781760 | elapsed time per iteration (s): 0.79 | learning rate: 2.243E-05 | global batch size: 256 | lm loss: 1.927322E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.369 | TFLOPs: 19.50 | 31: iteration 160780/ 173500 | consumed samples: 41159680 | consumed tokens: 84295024640 | elapsed time per iteration (s): 0.75 | learning rate: 2.242E-05 | global batch size: 256 | lm loss: 1.927545E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.743 | TFLOPs: 20.61 | 31: iteration 160790/ 173500 | consumed samples: 41162240 | consumed tokens: 84300267520 | elapsed time per iteration (s): 0.77 | learning rate: 2.242E-05 | global batch size: 256 | lm loss: 1.921371E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.818 | TFLOPs: 20.13 | 31: iteration 160800/ 173500 | consumed samples: 41164800 | consumed tokens: 84305510400 | elapsed time per iteration (s): 0.76 | learning rate: 2.242E-05 | global batch size: 256 | lm loss: 1.933874E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.629 | TFLOPs: 20.43 | 31: iteration 160810/ 173500 | consumed samples: 41167360 | consumed tokens: 84310753280 | elapsed time per iteration (s): 0.76 | learning rate: 2.241E-05 | global batch size: 256 | lm loss: 1.914892E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.851 | TFLOPs: 20.44 | 31: iteration 160820/ 173500 | consumed samples: 41169920 | consumed tokens: 84315996160 | elapsed time per iteration (s): 0.77 | learning rate: 2.241E-05 | global batch size: 256 | lm loss: 1.909742E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.115 | TFLOPs: 20.15 | 31: iteration 160830/ 173500 | consumed samples: 41172480 | consumed tokens: 84321239040 | elapsed time per iteration (s): 0.78 | learning rate: 2.241E-05 | global batch size: 256 | lm loss: 1.934189E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.649 | TFLOPs: 19.82 | 31: iteration 160840/ 173500 | consumed samples: 41175040 | consumed tokens: 84326481920 | elapsed time per iteration (s): 0.73 | learning rate: 2.240E-05 | global batch size: 256 | lm loss: 1.919936E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.920 | TFLOPs: 21.29 | 31: iteration 160850/ 173500 | consumed samples: 41177600 | consumed tokens: 84331724800 | elapsed time per iteration (s): 0.77 | learning rate: 2.240E-05 | global batch size: 256 | lm loss: 1.909204E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.509 | TFLOPs: 20.12 | 31: iteration 160860/ 173500 | consumed samples: 41180160 | consumed tokens: 84336967680 | elapsed time per iteration (s): 0.78 | learning rate: 2.239E-05 | global batch size: 256 | lm loss: 1.915010E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.193 | TFLOPs: 19.92 | 31: iteration 160870/ 173500 | consumed samples: 41182720 | consumed tokens: 84342210560 | elapsed time per iteration (s): 0.76 | learning rate: 2.239E-05 | global batch size: 256 | lm loss: 1.931957E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.089 | TFLOPs: 20.33 | 31: iteration 160880/ 173500 | consumed samples: 41185280 | consumed tokens: 84347453440 | elapsed time per iteration (s): 0.80 | learning rate: 2.239E-05 | global batch size: 256 | lm loss: 1.906399E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.908 | TFLOPs: 19.41 | 31: iteration 160890/ 173500 | consumed samples: 41187840 | consumed tokens: 84352696320 | elapsed time per iteration (s): 0.74 | learning rate: 2.238E-05 | global batch size: 256 | lm loss: 1.918188E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.388 | TFLOPs: 20.90 | 31: iteration 160900/ 173500 | consumed samples: 41190400 | consumed tokens: 84357939200 | elapsed time per iteration (s): 0.76 | learning rate: 2.238E-05 | global batch size: 256 | lm loss: 1.919441E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.722 | TFLOPs: 20.49 | 31: iteration 160910/ 173500 | consumed samples: 41192960 | consumed tokens: 84363182080 | elapsed time per iteration (s): 0.76 | learning rate: 2.238E-05 | global batch size: 256 | lm loss: 1.920734E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.128 | TFLOPs: 20.33 | 31: iteration 160920/ 173500 | consumed samples: 41195520 | consumed tokens: 84368424960 | elapsed time per iteration (s): 0.80 | learning rate: 2.237E-05 | global batch size: 256 | lm loss: 1.910196E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.000 | TFLOPs: 19.30 | 31: iteration 160930/ 173500 | consumed samples: 41198080 | consumed tokens: 84373667840 | elapsed time per iteration (s): 0.77 | learning rate: 2.237E-05 | global batch size: 256 | lm loss: 1.903526E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.320 | TFLOPs: 20.04 | 31: iteration 160940/ 173500 | consumed samples: 41200640 | consumed tokens: 84378910720 | elapsed time per iteration (s): 0.82 | learning rate: 2.236E-05 | global batch size: 256 | lm loss: 1.925218E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.882 | TFLOPs: 18.81 | 31: iteration 160950/ 173500 | consumed samples: 41203200 | consumed tokens: 84384153600 | elapsed time per iteration (s): 0.77 | learning rate: 2.236E-05 | global batch size: 256 | lm loss: 1.909497E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.377 | TFLOPs: 20.23 | 31: iteration 160960/ 173500 | consumed samples: 41205760 | consumed tokens: 84389396480 | elapsed time per iteration (s): 1.06 | learning rate: 2.236E-05 | global batch size: 256 | lm loss: 1.908508E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.533 | TFLOPs: 14.67 | 31: iteration 160970/ 173500 | consumed samples: 41208320 | consumed tokens: 84394639360 | elapsed time per iteration (s): 0.78 | learning rate: 2.235E-05 | global batch size: 256 | lm loss: 1.897197E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.574 | TFLOPs: 19.76 | 31: iteration 160980/ 173500 | consumed samples: 41210880 | consumed tokens: 84399882240 | elapsed time per iteration (s): 0.81 | learning rate: 2.235E-05 | global batch size: 256 | lm loss: 1.914455E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.550 | TFLOPs: 19.03 | 31: iteration 160990/ 173500 | consumed samples: 41213440 | consumed tokens: 84405125120 | elapsed time per iteration (s): 0.80 | learning rate: 2.235E-05 | global batch size: 256 | lm loss: 1.891976E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.279 | TFLOPs: 19.26 | 31: iteration 161000/ 173500 | consumed samples: 41216000 | consumed tokens: 84410368000 | elapsed time per iteration (s): 0.82 | learning rate: 2.234E-05 | global batch size: 256 | lm loss: 1.910809E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.117 | TFLOPs: 18.82 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 161000 | lm loss value: 1.827668E+00 | lm loss PPL: 6.219369E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 161000 to checkpoints_1b1long 0: [2022-11-27 06:19:33,780] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step161000 is begin to save! 0: [2022-11-27 06:19:33,790] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_01-model_00-model_states.pt... 0: [2022-11-27 06:19:34,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_01-model_00-model_states.pt. 0: [2022-11-27 06:19:34,005] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_03-model_00-model_states.pt... 0: [2022-11-27 06:19:34,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_03-model_00-model_states.pt. 0: [2022-11-27 06:19:34,091] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_04-model_00-model_states.pt... 0: [2022-11-27 06:19:34,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_04-model_00-model_states.pt. 0: [2022-11-27 06:19:34,168] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_05-model_00-model_states.pt... 0: [2022-11-27 06:19:34,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_05-model_00-model_states.pt. 0: [2022-11-27 06:19:34,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_06-model_00-model_states.pt... 0: [2022-11-27 06:19:34,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_06-model_00-model_states.pt. 0: [2022-11-27 06:19:34,320] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_07-model_00-model_states.pt... 0: [2022-11-27 06:19:34,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_07-model_00-model_states.pt. 0: [2022-11-27 06:19:34,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_08-model_00-model_states.pt... 0: [2022-11-27 06:19:34,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_08-model_00-model_states.pt. 0: [2022-11-27 06:19:34,472] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_09-model_00-model_states.pt... 0: [2022-11-27 06:19:34,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_09-model_00-model_states.pt. 0: [2022-11-27 06:19:34,545] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_10-model_00-model_states.pt... 0: [2022-11-27 06:19:34,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_10-model_00-model_states.pt. 0: [2022-11-27 06:19:34,620] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_11-model_00-model_states.pt... 0: [2022-11-27 06:19:34,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_11-model_00-model_states.pt. 0: [2022-11-27 06:19:34,696] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_12-model_00-model_states.pt... 0: [2022-11-27 06:19:34,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_12-model_00-model_states.pt. 0: [2022-11-27 06:19:34,779] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_13-model_00-model_states.pt... 0: [2022-11-27 06:19:34,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_13-model_00-model_states.pt. 0: [2022-11-27 06:19:34,857] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_14-model_00-model_states.pt... 0: [2022-11-27 06:19:34,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_14-model_00-model_states.pt. 0: [2022-11-27 06:19:34,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_15-model_00-model_states.pt... 0: [2022-11-27 06:19:35,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_15-model_00-model_states.pt. 0: [2022-11-27 06:19:35,009] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_16-model_00-model_states.pt... 0: [2022-11-27 06:19:35,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_16-model_00-model_states.pt. 0: [2022-11-27 06:19:35,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_17-model_00-model_states.pt... 0: [2022-11-27 06:19:35,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_17-model_00-model_states.pt. 0: [2022-11-27 06:19:35,163] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_18-model_00-model_states.pt... 0: [2022-11-27 06:19:35,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_18-model_00-model_states.pt. 0: [2022-11-27 06:19:35,242] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_19-model_00-model_states.pt... 0: [2022-11-27 06:19:35,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_19-model_00-model_states.pt. 0: [2022-11-27 06:19:35,320] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_20-model_00-model_states.pt... 0: [2022-11-27 06:19:35,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_20-model_00-model_states.pt. 0: [2022-11-27 06:19:35,419] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_21-model_00-model_states.pt... 0: [2022-11-27 06:19:35,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_21-model_00-model_states.pt. 0: [2022-11-27 06:19:35,495] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_22-model_00-model_states.pt... 0: [2022-11-27 06:19:35,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_22-model_00-model_states.pt. 0: [2022-11-27 06:19:35,567] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_23-model_00-model_states.pt... 0: [2022-11-27 06:19:35,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_23-model_00-model_states.pt. 0: [2022-11-27 06:19:35,644] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_24-model_00-model_states.pt... 0: [2022-11-27 06:19:35,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_24-model_00-model_states.pt. 0: [2022-11-27 06:19:35,716] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_25-model_00-model_states.pt... 0: [2022-11-27 06:19:35,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_25-model_00-model_states.pt. 0: [2022-11-27 06:19:35,791] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_26-model_00-model_states.pt... 0: [2022-11-27 06:19:35,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_26-model_00-model_states.pt. 0: [2022-11-27 06:19:35,867] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_27-model_00-model_states.pt... 0: [2022-11-27 06:19:35,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_27-model_00-model_states.pt. 0: [2022-11-27 06:19:35,938] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_28-model_00-model_states.pt... 0: [2022-11-27 06:19:36,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_28-model_00-model_states.pt. 0: [2022-11-27 06:19:36,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/layer_30-model_00-model_states.pt... 0: [2022-11-27 06:19:36,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/layer_30-model_00-model_states.pt. 0: [2022-11-27 06:19:36,017] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step161000/mp_rank_00_model_states.pt 0: [2022-11-27 06:19:36,017] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/mp_rank_00_model_states.pt... 0: [2022-11-27 06:19:36,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/mp_rank_00_model_states.pt. 31: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:19:36,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:19:36,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:19:36,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:19:36,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 15: [2022-11-27 06:19:36,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 24: [2022-11-27 06:19:36,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 15: [2022-11-27 06:19:36,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 1: [2022-11-27 06:19:36,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:19:36,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 23: [2022-11-27 06:19:36,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:19:36,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 23: [2022-11-27 06:19:36,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 06:19:36,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 0: [2022-11-27 06:19:36,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:19:36,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 06:19:36,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 24: [2022-11-27 06:19:36,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:19:36,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 06:19:36,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 7: [2022-11-27 06:19:36,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:19:36,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 06:19:36,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 8: [2022-11-27 06:19:36,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:19:36,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 06:19:36,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 28: [2022-11-27 06:19:36,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:19:36,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:19:36,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 7: [2022-11-27 06:19:36,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 28: [2022-11-27 06:19:36,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 7: [2022-11-27 06:19:36,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 14: [2022-11-27 06:19:36,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:19:36,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 06:19:36,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 10: [2022-11-27 06:19:36,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:19:36,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 06:19:36,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 21: [2022-11-27 06:19:36,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:19:36,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 17: [2022-11-27 06:19:36,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:19:36,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 18: [2022-11-27 06:19:36,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:19:36,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 06:19:36,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 17: [2022-11-27 06:19:36,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 06:19:36,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 0: [2022-11-27 06:19:36,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:19:36,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 06:19:36,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 31: [2022-11-27 06:19:36,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:19:36,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:19:36,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 22: [2022-11-27 06:19:36,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 31: [2022-11-27 06:19:36,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 22: [2022-11-27 06:19:36,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 4: [2022-11-27 06:19:36,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:19:36,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 9: [2022-11-27 06:19:36,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:19:36,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 9: [2022-11-27 06:19:36,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 06:19:36,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 19: [2022-11-27 06:19:36,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:19:36,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 13: [2022-11-27 06:19:36,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:19:36,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 13: [2022-11-27 06:19:36,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 06:19:36,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 12: [2022-11-27 06:19:36,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:19:36,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 06:19:36,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 13: [2022-11-27 06:19:36,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:19:36,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 06:19:36,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 18: [2022-11-27 06:19:36,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:19:36,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 06:19:36,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 3: [2022-11-27 06:19:36,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:19:36,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 06:19:36,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 10: [2022-11-27 06:19:36,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:19:36,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:19:36,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:19:36,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:19:36,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 23: [2022-11-27 06:19:36,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 10: [2022-11-27 06:19:36,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 15: [2022-11-27 06:19:36,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 23: [2022-11-27 06:19:36,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 10: [2022-11-27 06:19:36,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 31: [2022-11-27 06:19:36,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:19:36,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 31: [2022-11-27 06:19:36,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 10: [2022-11-27 06:19:36,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 31: [2022-11-27 06:19:36,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 24: [2022-11-27 06:19:36,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:19:36,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 21: [2022-11-27 06:19:36,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:19:36,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 24: [2022-11-27 06:19:36,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 21: [2022-11-27 06:19:36,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 18: [2022-11-27 06:19:36,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:19:36,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:19:36,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 12: [2022-11-27 06:19:36,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 18: [2022-11-27 06:19:36,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 12: [2022-11-27 06:19:36,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 3: [2022-11-27 06:19:36,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:19:36,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:19:36,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:19:36,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 14: [2022-11-27 06:19:36,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 2: [2022-11-27 06:19:36,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:19:36,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 14: [2022-11-27 06:19:36,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 1: [2022-11-27 06:19:36,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 2: [2022-11-27 06:19:36,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 06:19:36,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 1: [2022-11-27 06:19:36,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 2: [2022-11-27 06:19:36,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:19:36,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 06:19:36,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 7: [2022-11-27 06:19:36,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:19:36,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 06:19:36,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 29: [2022-11-27 06:19:36,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:19:36,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 06:19:36,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 29: [2022-11-27 06:19:36,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:19:36,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 06:19:36,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 8: [2022-11-27 06:19:36,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:19:36,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 06:19:36,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 0: [2022-11-27 06:19:36,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:19:36,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:19:36,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:19:36,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 06:19:36,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 06:19:36,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 28: [2022-11-27 06:19:36,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 23: [2022-11-27 06:19:36,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:19:36,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 06:19:36,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 13: [2022-11-27 06:19:36,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:19:36,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 06:19:36,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 9: [2022-11-27 06:19:36,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:19:36,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 06:19:36,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 17: [2022-11-27 06:19:36,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:19:36,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 06:19:36,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 8: [2022-11-27 06:19:36,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:19:36,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 06:19:36,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 15: [2022-11-27 06:19:36,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:19:36,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:19:36,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 3: [2022-11-27 06:19:36,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 15: [2022-11-27 06:19:36,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 22: [2022-11-27 06:19:36,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:19:36,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 22: [2022-11-27 06:19:36,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 06:19:36,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 18: [2022-11-27 06:19:36,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:19:36,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 16: [2022-11-27 06:19:36,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:19:36,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:19:36,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:19:36,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:19:36,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:19:36,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 18: [2022-11-27 06:19:36,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 16: [2022-11-27 06:19:36,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 06:19:36,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 29: [2022-11-27 06:19:36,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 16: [2022-11-27 06:19:36,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 14: [2022-11-27 06:19:36,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:19:36,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:19:36,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 16: [2022-11-27 06:19:36,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 1: [2022-11-27 06:19:36,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 16: [2022-11-27 06:19:36,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 14: [2022-11-27 06:19:36,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 1: [2022-11-27 06:19:36,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 14: [2022-11-27 06:19:36,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 1: [2022-11-27 06:19:36,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 1: [2022-11-27 06:19:36,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 24: [2022-11-27 06:19:36,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:19:36,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:19:36,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 20: [2022-11-27 06:19:36,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 24: [2022-11-27 06:19:36,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 20: [2022-11-27 06:19:36,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 17: [2022-11-27 06:19:36,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:19:36,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 06:19:36,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 31: [2022-11-27 06:19:36,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:19:36,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 06:19:36,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 4: [2022-11-27 06:19:36,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:19:36,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:19:36,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 17: [2022-11-27 06:19:36,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 4: [2022-11-27 06:19:36,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 17: [2022-11-27 06:19:36,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 13: [2022-11-27 06:19:36,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:19:36,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 06:19:36,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 4: [2022-11-27 06:19:36,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:19:36,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 06:19:36,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 2: [2022-11-27 06:19:36,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:19:36,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 06:19:36,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 21: [2022-11-27 06:19:36,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:19:36,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 06:19:36,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 23: [2022-11-27 06:19:36,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:19:36,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 06:19:36,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 7: [2022-11-27 06:19:36,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:19:36,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 06:19:36,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 4: [2022-11-27 06:19:36,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:19:36,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 06:19:36,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 21: [2022-11-27 06:19:36,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:19:36,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 06:19:36,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 10: [2022-11-27 06:19:36,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:19:36,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 06:19:36,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 19: [2022-11-27 06:19:36,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:19:36,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 06:19:36,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 12: [2022-11-27 06:19:36,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:19:36,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 14: [2022-11-27 06:19:36,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:19:36,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:19:36,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 14: [2022-11-27 06:19:36,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 31: [2022-11-27 06:19:36,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 14: [2022-11-27 06:19:36,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 31: [2022-11-27 06:19:36,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 16: [2022-11-27 06:19:36,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:19:36,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 06:19:36,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 22: [2022-11-27 06:19:36,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:19:36,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 06:19:36,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 3: [2022-11-27 06:19:36,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:19:36,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 06:19:36,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 20: [2022-11-27 06:19:36,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:19:36,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 6: [2022-11-27 06:19:36,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:19:36,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:19:36,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:19:36,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:19:36,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 06:19:36,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 06:19:36,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 20: [2022-11-27 06:19:36,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 6: [2022-11-27 06:19:36,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 06:19:36,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 6: [2022-11-27 06:19:36,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 6: [2022-11-27 06:19:36,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 6: [2022-11-27 06:19:36,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 0: [2022-11-27 06:19:36,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:19:36,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 19: [2022-11-27 06:19:36,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:19:36,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 19: [2022-11-27 06:19:36,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:19:36,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 06:19:36,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 06:19:36,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 28: [2022-11-27 06:19:36,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:19:36,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 28: [2022-11-27 06:19:36,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 15: [2022-11-27 06:19:36,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:19:36,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 06:19:36,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 25: [2022-11-27 06:19:36,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:19:36,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 06:19:36,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 12: [2022-11-27 06:19:36,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:19:36,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 06:19:36,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 25: [2022-11-27 06:19:36,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:19:36,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:19:36,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-27 06:19:36,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 06:19:36,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 25: [2022-11-27 06:19:36,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 0: [2022-11-27 06:19:36,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 29: [2022-11-27 06:19:36,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:19:36,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 2: [2022-11-27 06:19:36,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:19:36,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 2: [2022-11-27 06:19:36,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 29: [2022-11-27 06:19:36,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 2: [2022-11-27 06:19:36,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 28: [2022-11-27 06:19:36,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 8: [2022-11-27 06:19:36,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:19:36,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 06:19:36,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 9: [2022-11-27 06:19:36,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:19:36,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:19:36,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 06:19:36,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 06:19:36,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 9: [2022-11-27 06:19:36,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 26: [2022-11-27 06:19:36,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:19:36,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:19:36,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:19:36,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 06:19:36,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 26: [2022-11-27 06:19:36,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 06:19:36,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 06:19:36,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 26: [2022-11-27 06:19:36,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 27: [2022-11-27 06:19:36,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:19:36,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:19:36,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:19:36,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:19:36,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 06:19:36,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 06:19:36,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 06:19:36,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 27: [2022-11-27 06:19:36,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 27: [2022-11-27 06:19:36,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 27: [2022-11-27 06:19:36,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 06:19:36,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 20: [2022-11-27 06:19:36,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:19:36,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 06:19:36,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 30: [2022-11-27 06:19:36,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:19:36,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:19:36,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:19:36,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:19:36,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 06:19:36,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 06:19:36,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 06:19:36,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 06:19:36,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 30: [2022-11-27 06:19:36,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 30: [2022-11-27 06:19:36,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 30: [2022-11-27 06:19:36,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 11: [2022-11-27 06:19:36,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:19:36,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:19:36,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:19:36,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:19:36,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 06:19:36,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 06:19:36,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 06:19:36,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 11: [2022-11-27 06:19:36,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 06:19:36,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 11: [2022-11-27 06:19:36,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 11: [2022-11-27 06:19:36,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 5: [2022-11-27 06:19:36,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:19:36,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:19:36,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:19:36,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:19:36,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 06:19:36,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 06:19:36,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 5: [2022-11-27 06:19:36,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 5: [2022-11-27 06:19:36,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 06:19:36,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 06:19:36,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 5: [2022-11-27 06:19:36,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 23: [2022-11-27 06:19:36,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:19:36,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 06:19:36,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 22: [2022-11-27 06:19:36,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:19:36,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 06:19:36,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 20: [2022-11-27 06:19:36,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:19:36,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:19:36,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 24: [2022-11-27 06:19:36,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 20: [2022-11-27 06:19:36,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 24: [2022-11-27 06:19:36,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 10: [2022-11-27 06:19:36,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:19:36,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 06:19:36,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 0: [2022-11-27 06:19:36,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:19:36,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 06:19:36,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 1: [2022-11-27 06:19:36,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:19:36,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 06:19:36,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 15: [2022-11-27 06:19:36,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:19:36,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 06:19:36,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 5: [2022-11-27 06:19:36,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:19:36,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 06:19:36,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 4: [2022-11-27 06:19:36,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:19:36,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 06:19:36,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 18: [2022-11-27 06:19:36,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:19:36,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 06:19:36,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 7: [2022-11-27 06:19:36,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:19:36,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 06:19:36,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 17: [2022-11-27 06:19:36,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:19:36,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 06:19:36,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 11: [2022-11-27 06:19:36,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:19:36,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 06:19:36,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 14: [2022-11-27 06:19:36,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:19:36,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 25: [2022-11-27 06:19:36,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:19:36,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 25: [2022-11-27 06:19:36,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 06:19:36,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 13: [2022-11-27 06:19:36,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:19:36,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 06:19:36,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 30: [2022-11-27 06:19:36,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:19:36,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 06:19:36,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 31: [2022-11-27 06:19:36,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:19:36,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 06:19:36,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 26: [2022-11-27 06:19:36,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:19:36,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 06:19:36,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 19: [2022-11-27 06:19:36,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:19:36,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 06:19:36,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 20: [2022-11-27 06:19:36,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:19:36,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 06:19:36,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 6: [2022-11-27 06:19:36,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:19:36,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 06:19:36,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 28: [2022-11-27 06:19:36,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:19:36,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 06:19:36,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 2: [2022-11-27 06:19:36,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:19:36,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 06:19:36,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 3: [2022-11-27 06:19:36,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:19:36,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 06:19:36,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 9: [2022-11-27 06:19:36,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:19:36,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 06:19:36,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 12: [2022-11-27 06:19:36,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:19:36,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 06:19:36,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 16: [2022-11-27 06:19:36,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:19:36,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 06:19:36,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 22: [2022-11-27 06:19:36,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:19:36,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 06:19:36,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 27: [2022-11-27 06:19:36,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:19:36,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 06:19:36,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 8: [2022-11-27 06:19:36,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:19:36,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:19:36,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 06:19:36,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 29: [2022-11-27 06:19:36,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 06:19:36,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 21: [2022-11-27 06:19:36,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:19:36,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 06:19:36,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 10: [2022-11-27 06:19:36,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:19:36,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:19:36,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 06:19:36,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 23: [2022-11-27 06:19:36,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 06:19:36,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 0: [2022-11-27 06:19:36,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:19:36,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:19:36,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 1: [2022-11-27 06:19:36,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 0: [2022-11-27 06:19:36,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 1: [2022-11-27 06:19:36,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 5: [2022-11-27 06:19:36,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:19:36,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 06:19:36,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 24: [2022-11-27 06:19:36,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:19:36,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-27 06:19:36,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 15: [2022-11-27 06:19:36,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:19:36,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 06:19:36,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 7: [2022-11-27 06:19:36,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:19:36,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 06:19:36,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 18: [2022-11-27 06:19:36,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:19:36,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 06:19:36,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 17: [2022-11-27 06:19:36,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:19:36,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 06:19:36,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 14: [2022-11-27 06:19:36,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:19:36,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 06:19:36,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 4: [2022-11-27 06:19:36,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:19:36,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 06:19:36,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 11: [2022-11-27 06:19:36,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:19:36,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 06:19:36,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 13: [2022-11-27 06:19:36,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:19:36,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 06:19:36,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 2: [2022-11-27 06:19:36,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:19:36,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:19:36,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 25: [2022-11-27 06:19:36,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:19:36,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 2: [2022-11-27 06:19:36,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 26: [2022-11-27 06:19:36,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 25: [2022-11-27 06:19:36,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 06:19:36,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 28: [2022-11-27 06:19:36,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:19:36,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 06:19:36,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 3: [2022-11-27 06:19:36,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:19:36,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 06:19:36,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 16: [2022-11-27 06:19:36,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:19:36,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 06:19:36,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 30: [2022-11-27 06:19:36,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:19:36,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:19:36,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 06:19:36,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 20: [2022-11-27 06:19:36,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 06:19:36,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 31: [2022-11-27 06:19:36,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:19:36,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 06:19:36,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 6: [2022-11-27 06:19:36,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:19:36,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 06:19:36,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 9: [2022-11-27 06:19:36,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:19:36,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 06:19:36,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 21: [2022-11-27 06:19:36,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:19:36,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 06:19:36,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 19: [2022-11-27 06:19:36,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:19:36,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 06:19:36,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 29: [2022-11-27 06:19:36,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:19:36,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 06:19:36,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 12: [2022-11-27 06:19:36,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:19:36,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 06:19:36,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 27: [2022-11-27 06:19:36,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:19:36,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 06:19:36,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 23: [2022-11-27 06:19:36,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:19:36,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 06:19:36,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 24: [2022-11-27 06:19:36,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:19:36,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 06:19:36,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 22: [2022-11-27 06:19:36,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:19:36,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 06:19:36,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 8: [2022-11-27 06:19:36,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:19:36,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 06:19:36,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 1: [2022-11-27 06:19:36,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:19:36,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 06:19:36,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 18: [2022-11-27 06:19:36,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:19:36,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 06:19:36,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 5: [2022-11-27 06:19:36,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:19:36,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 06:19:36,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 10: [2022-11-27 06:19:36,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:19:36,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 06:19:36,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 4: [2022-11-27 06:19:36,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:19:36,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:19:36,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 06:19:36,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 15: [2022-11-27 06:19:36,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 06:19:36,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 0: [2022-11-27 06:19:36,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:19:36,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 06:19:36,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 7: [2022-11-27 06:19:36,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:19:36,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 06:19:36,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 11: [2022-11-27 06:19:36,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:19:36,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 06:19:36,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 17: [2022-11-27 06:19:36,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:19:36,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 06:19:36,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 14: [2022-11-27 06:19:36,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:19:36,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 06:19:36,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 30: [2022-11-27 06:19:36,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:19:36,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 06:19:36,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 13: [2022-11-27 06:19:36,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:19:36,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 06:19:36,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 16: [2022-11-27 06:19:36,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:19:36,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 06:19:36,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 9: [2022-11-27 06:19:36,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:19:36,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:19:36,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 06:19:36,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 28: [2022-11-27 06:19:36,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 6: [2022-11-27 06:19:36,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:19:36,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:19:36,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 26: [2022-11-27 06:19:36,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 6: [2022-11-27 06:19:36,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 26: [2022-11-27 06:19:36,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 2: [2022-11-27 06:19:36,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:19:36,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 06:19:36,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 31: [2022-11-27 06:19:36,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:19:36,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 21: [2022-11-27 06:19:36,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:19:36,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 21: [2022-11-27 06:19:36,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 06:19:36,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 1: [2022-11-27 06:19:36,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:19:36,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 06:19:36,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 24: [2022-11-27 06:19:36,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:19:36,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 06:19:36,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 18: [2022-11-27 06:19:36,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:19:36,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 06:19:36,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 12: [2022-11-27 06:19:36,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:19:36,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 06:19:36,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 3: [2022-11-27 06:19:36,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:19:36,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 3: [2022-11-27 06:19:36,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 06:19:36,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 23: [2022-11-27 06:19:36,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:19:36,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 06:19:36,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 15: [2022-11-27 06:19:36,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:19:36,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 06:19:36,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 20: [2022-11-27 06:19:36,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:19:36,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:19:36,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:19:36,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 20: [2022-11-27 06:19:36,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 19: [2022-11-27 06:19:36,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 7: [2022-11-27 06:19:36,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 20: [2022-11-27 06:19:36,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 19: [2022-11-27 06:19:36,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 4: [2022-11-27 06:19:36,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:19:36,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 06:19:36,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 8: [2022-11-27 06:19:36,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:19:36,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:19:36,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 06:19:36,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 25: [2022-11-27 06:19:36,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 06:19:36,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 14: [2022-11-27 06:19:36,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:19:36,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:19:36,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 06:19:36,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 27: [2022-11-27 06:19:36,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 06:19:36,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 0: [2022-11-27 06:19:36,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:19:36,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 06:19:36,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 5: [2022-11-27 06:19:36,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:19:36,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 06:19:36,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 10: [2022-11-27 06:19:36,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:19:36,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 29: [2022-11-27 06:19:36,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:19:36,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 22: [2022-11-27 06:19:36,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:19:36,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 22: [2022-11-27 06:19:36,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 06:19:36,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 29: [2022-11-27 06:19:36,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 11: [2022-11-27 06:19:36,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:19:36,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 06:19:36,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 17: [2022-11-27 06:19:36,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:19:36,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 06:19:36,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 13: [2022-11-27 06:19:36,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:19:36,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:19:36,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 13: [2022-11-27 06:19:36,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 30: [2022-11-27 06:19:36,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 2: [2022-11-27 06:19:36,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:19:36,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 2: [2022-11-27 06:19:36,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 06:19:36,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 28: [2022-11-27 06:19:36,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:19:36,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 27: [2022-11-27 06:19:36,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:19:36,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 06:19:36,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 6: [2022-11-27 06:19:36,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:19:36,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 06:19:36,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 20: [2022-11-27 06:19:36,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:19:36,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 06:19:36,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 3: [2022-11-27 06:19:36,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:19:36,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:19:36,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 3: [2022-11-27 06:19:36,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 06:19:36,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 31: [2022-11-27 06:19:36,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 22: [2022-11-27 06:19:36,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:19:36,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 06:19:36,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 8: [2022-11-27 06:19:36,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:19:36,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:19:36,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 06:19:36,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 19: [2022-11-27 06:19:36,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 06:19:36,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 21: [2022-11-27 06:19:36,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:19:36,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:19:36,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 21: [2022-11-27 06:19:36,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 9: [2022-11-27 06:19:36,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 06:19:36,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 21: [2022-11-27 06:19:36,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 25: [2022-11-27 06:19:36,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:19:36,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 16: [2022-11-27 06:19:36,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:19:36,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 16: [2022-11-27 06:19:36,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 06:19:36,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 25: [2022-11-27 06:19:36,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:19:36,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 06:19:36,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 26: [2022-11-27 06:19:36,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:19:36,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:19:36,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 06:19:36,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 06:19:36,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 26: [2022-11-27 06:19:36,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 29: [2022-11-27 06:19:36,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:19:36,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:19:36,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 12: [2022-11-27 06:19:36,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step161000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 29: [2022-11-27 06:19:36,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 12: [2022-11-27 06:19:36,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step161000 is ready now! 0: successfully saved checkpoint at iteration 161000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2556.68 31: iteration 161010/ 173500 | consumed samples: 41218560 | consumed tokens: 84415610880 | elapsed time per iteration (s): 1.09 | learning rate: 2.234E-05 | global batch size: 256 | lm loss: 1.920200E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.284 | TFLOPs: 14.17 | 31: iteration 161020/ 173500 | consumed samples: 41221120 | consumed tokens: 84420853760 | elapsed time per iteration (s): 0.83 | learning rate: 2.233E-05 | global batch size: 256 | lm loss: 1.918559E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.763 | TFLOPs: 18.62 | 31: iteration 161030/ 173500 | consumed samples: 41223680 | consumed tokens: 84426096640 | elapsed time per iteration (s): 0.80 | learning rate: 2.233E-05 | global batch size: 256 | lm loss: 1.924685E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.220 | TFLOPs: 19.43 | 31: iteration 161040/ 173500 | consumed samples: 41226240 | consumed tokens: 84431339520 | elapsed time per iteration (s): 0.78 | learning rate: 2.233E-05 | global batch size: 256 | lm loss: 1.916333E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.974 | TFLOPs: 19.90 | 31: iteration 161050/ 173500 | consumed samples: 41228800 | consumed tokens: 84436582400 | elapsed time per iteration (s): 0.79 | learning rate: 2.232E-05 | global batch size: 256 | lm loss: 1.906310E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.314 | TFLOPs: 19.62 | 31: iteration 161060/ 173500 | consumed samples: 41231360 | consumed tokens: 84441825280 | elapsed time per iteration (s): 0.79 | learning rate: 2.232E-05 | global batch size: 256 | lm loss: 1.892774E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.309 | TFLOPs: 19.68 | 31: iteration 161070/ 173500 | consumed samples: 41233920 | consumed tokens: 84447068160 | elapsed time per iteration (s): 0.84 | learning rate: 2.232E-05 | global batch size: 256 | lm loss: 1.938551E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.433 | TFLOPs: 18.42 | 31: iteration 161080/ 173500 | consumed samples: 41236480 | consumed tokens: 84452311040 | elapsed time per iteration (s): 0.79 | learning rate: 2.231E-05 | global batch size: 256 | lm loss: 1.922887E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.489 | TFLOPs: 19.63 | 31: iteration 161090/ 173500 | consumed samples: 41239040 | consumed tokens: 84457553920 | elapsed time per iteration (s): 0.79 | learning rate: 2.231E-05 | global batch size: 256 | lm loss: 1.914230E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.377 | TFLOPs: 19.50 | 31: iteration 161100/ 173500 | consumed samples: 41241600 | consumed tokens: 84462796800 | elapsed time per iteration (s): 0.82 | learning rate: 2.230E-05 | global batch size: 256 | lm loss: 1.935792E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.870 | TFLOPs: 18.81 | 31: iteration 161110/ 173500 | consumed samples: 41244160 | consumed tokens: 84468039680 | elapsed time per iteration (s): 0.85 | learning rate: 2.230E-05 | global batch size: 256 | lm loss: 1.909639E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.104 | TFLOPs: 18.22 | 31: iteration 161120/ 173500 | consumed samples: 41246720 | consumed tokens: 84473282560 | elapsed time per iteration (s): 0.82 | learning rate: 2.230E-05 | global batch size: 256 | lm loss: 1.896797E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.828 | TFLOPs: 18.93 | 31: iteration 161130/ 173500 | consumed samples: 41249280 | consumed tokens: 84478525440 | elapsed time per iteration (s): 0.82 | learning rate: 2.229E-05 | global batch size: 256 | lm loss: 1.907550E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.684 | TFLOPs: 18.92 | 31: iteration 161140/ 173500 | consumed samples: 41251840 | consumed tokens: 84483768320 | elapsed time per iteration (s): 0.79 | learning rate: 2.229E-05 | global batch size: 256 | lm loss: 1.906894E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.416 | TFLOPs: 19.63 | 31: iteration 161150/ 173500 | consumed samples: 41254400 | consumed tokens: 84489011200 | elapsed time per iteration (s): 0.79 | learning rate: 2.229E-05 | global batch size: 256 | lm loss: 1.917483E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.625 | TFLOPs: 19.52 | 31: iteration 161160/ 173500 | consumed samples: 41256960 | consumed tokens: 84494254080 | elapsed time per iteration (s): 0.75 | learning rate: 2.228E-05 | global batch size: 256 | lm loss: 1.895243E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.192 | TFLOPs: 20.58 | 31: iteration 161170/ 173500 | consumed samples: 41259520 | consumed tokens: 84499496960 | elapsed time per iteration (s): 0.82 | learning rate: 2.228E-05 | global batch size: 256 | lm loss: 1.898237E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.540 | TFLOPs: 18.91 | 31: iteration 161180/ 173500 | consumed samples: 41262080 | consumed tokens: 84504739840 | elapsed time per iteration (s): 0.81 | learning rate: 2.228E-05 | global batch size: 256 | lm loss: 1.909216E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.594 | TFLOPs: 19.21 | 31: iteration 161190/ 173500 | consumed samples: 41264640 | consumed tokens: 84509982720 | elapsed time per iteration (s): 0.78 | learning rate: 2.227E-05 | global batch size: 256 | lm loss: 1.920035E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.203 | TFLOPs: 19.86 | 31: iteration 161200/ 173500 | consumed samples: 41267200 | consumed tokens: 84515225600 | elapsed time per iteration (s): 0.73 | learning rate: 2.227E-05 | global batch size: 256 | lm loss: 1.908643E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.125 | TFLOPs: 21.30 | 31: iteration 161210/ 173500 | consumed samples: 41269760 | consumed tokens: 84520468480 | elapsed time per iteration (s): 0.75 | learning rate: 2.226E-05 | global batch size: 256 | lm loss: 1.897708E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.196 | TFLOPs: 20.58 | 31: iteration 161220/ 173500 | consumed samples: 41272320 | consumed tokens: 84525711360 | elapsed time per iteration (s): 0.75 | learning rate: 2.226E-05 | global batch size: 256 | lm loss: 1.930022E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.156 | TFLOPs: 20.76 | 31: iteration 161230/ 173500 | consumed samples: 41274880 | consumed tokens: 84530954240 | elapsed time per iteration (s): 0.77 | learning rate: 2.226E-05 | global batch size: 256 | lm loss: 1.901863E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.882 | TFLOPs: 20.14 | 31: iteration 161240/ 173500 | consumed samples: 41277440 | consumed tokens: 84536197120 | elapsed time per iteration (s): 0.78 | learning rate: 2.225E-05 | global batch size: 256 | lm loss: 1.902022E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.387 | TFLOPs: 19.87 | 31: iteration 161250/ 173500 | consumed samples: 41280000 | consumed tokens: 84541440000 | elapsed time per iteration (s): 0.82 | learning rate: 2.225E-05 | global batch size: 256 | lm loss: 1.933382E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.682 | TFLOPs: 18.98 | 31: iteration 161260/ 173500 | consumed samples: 41282560 | consumed tokens: 84546682880 | elapsed time per iteration (s): 0.86 | learning rate: 2.225E-05 | global batch size: 256 | lm loss: 1.937494E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.438 | TFLOPs: 18.05 | 31: iteration 161270/ 173500 | consumed samples: 41285120 | consumed tokens: 84551925760 | elapsed time per iteration (s): 0.88 | learning rate: 2.224E-05 | global batch size: 256 | lm loss: 1.939705E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.206 | TFLOPs: 17.56 | 31: iteration 161280/ 173500 | consumed samples: 41287680 | consumed tokens: 84557168640 | elapsed time per iteration (s): 0.90 | learning rate: 2.224E-05 | global batch size: 256 | lm loss: 1.887850E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.572 | TFLOPs: 17.28 | 31: iteration 161290/ 173500 | consumed samples: 41290240 | consumed tokens: 84562411520 | elapsed time per iteration (s): 0.92 | learning rate: 2.224E-05 | global batch size: 256 | lm loss: 1.925229E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 279.001 | TFLOPs: 16.88 | 31: iteration 161300/ 173500 | consumed samples: 41292800 | consumed tokens: 84567654400 | elapsed time per iteration (s): 0.87 | learning rate: 2.223E-05 | global batch size: 256 | lm loss: 1.909921E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.859 | TFLOPs: 17.90 | 31: iteration 161310/ 173500 | consumed samples: 41295360 | consumed tokens: 84572897280 | elapsed time per iteration (s): 1.03 | learning rate: 2.223E-05 | global batch size: 256 | lm loss: 1.911083E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.523 | TFLOPs: 15.04 | 31: iteration 161320/ 173500 | consumed samples: 41297920 | consumed tokens: 84578140160 | elapsed time per iteration (s): 0.92 | learning rate: 2.222E-05 | global batch size: 256 | lm loss: 1.877136E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 279.750 | TFLOPs: 16.92 | 31: iteration 161330/ 173500 | consumed samples: 41300480 | consumed tokens: 84583383040 | elapsed time per iteration (s): 0.86 | learning rate: 2.222E-05 | global batch size: 256 | lm loss: 1.872793E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.371 | TFLOPs: 18.11 | 31: iteration 161340/ 173500 | consumed samples: 41303040 | consumed tokens: 84588625920 | elapsed time per iteration (s): 0.91 | learning rate: 2.222E-05 | global batch size: 256 | lm loss: 1.887064E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 281.344 | TFLOPs: 17.02 | 31: iteration 161350/ 173500 | consumed samples: 41305600 | consumed tokens: 84593868800 | elapsed time per iteration (s): 0.83 | learning rate: 2.221E-05 | global batch size: 256 | lm loss: 1.944110E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.184 | TFLOPs: 18.58 | 31: iteration 161360/ 173500 | consumed samples: 41308160 | consumed tokens: 84599111680 | elapsed time per iteration (s): 0.82 | learning rate: 2.221E-05 | global batch size: 256 | lm loss: 1.915181E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.506 | TFLOPs: 18.97 | 31: iteration 161370/ 173500 | consumed samples: 41310720 | consumed tokens: 84604354560 | elapsed time per iteration (s): 0.82 | learning rate: 2.221E-05 | global batch size: 256 | lm loss: 1.917027E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.963 | TFLOPs: 18.87 | 31: iteration 161380/ 173500 | consumed samples: 41313280 | consumed tokens: 84609597440 | elapsed time per iteration (s): 0.76 | learning rate: 2.220E-05 | global batch size: 256 | lm loss: 1.905044E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.636 | TFLOPs: 20.31 | 31: iteration 161390/ 173500 | consumed samples: 41315840 | consumed tokens: 84614840320 | elapsed time per iteration (s): 0.83 | learning rate: 2.220E-05 | global batch size: 256 | lm loss: 1.921996E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.512 | TFLOPs: 18.60 | 31: iteration 161400/ 173500 | consumed samples: 41318400 | consumed tokens: 84620083200 | elapsed time per iteration (s): 0.74 | learning rate: 2.220E-05 | global batch size: 256 | lm loss: 1.902463E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.163 | TFLOPs: 20.88 | 31: iteration 161410/ 173500 | consumed samples: 41320960 | consumed tokens: 84625326080 | elapsed time per iteration (s): 0.78 | learning rate: 2.219E-05 | global batch size: 256 | lm loss: 1.898561E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.191 | TFLOPs: 19.98 | 31: iteration 161420/ 173500 | consumed samples: 41323520 | consumed tokens: 84630568960 | elapsed time per iteration (s): 0.87 | learning rate: 2.219E-05 | global batch size: 256 | lm loss: 1.911043E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.684 | TFLOPs: 17.89 | 31: iteration 161430/ 173500 | consumed samples: 41326080 | consumed tokens: 84635811840 | elapsed time per iteration (s): 0.90 | learning rate: 2.218E-05 | global batch size: 256 | lm loss: 1.953460E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.386 | TFLOPs: 17.27 | 31: iteration 161440/ 173500 | consumed samples: 41328640 | consumed tokens: 84641054720 | elapsed time per iteration (s): 0.90 | learning rate: 2.218E-05 | global batch size: 256 | lm loss: 1.900345E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.769 | TFLOPs: 17.29 | 31: iteration 161450/ 173500 | consumed samples: 41331200 | consumed tokens: 84646297600 | elapsed time per iteration (s): 0.84 | learning rate: 2.218E-05 | global batch size: 256 | lm loss: 1.937178E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.818 | TFLOPs: 18.50 | 31: iteration 161460/ 173500 | consumed samples: 41333760 | consumed tokens: 84651540480 | elapsed time per iteration (s): 0.87 | learning rate: 2.217E-05 | global batch size: 256 | lm loss: 1.899225E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.562 | TFLOPs: 17.88 | 31: iteration 161470/ 173500 | consumed samples: 41336320 | consumed tokens: 84656783360 | elapsed time per iteration (s): 0.77 | learning rate: 2.217E-05 | global batch size: 256 | lm loss: 1.920378E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.875 | TFLOPs: 20.02 | 31: iteration 161480/ 173500 | consumed samples: 41338880 | consumed tokens: 84662026240 | elapsed time per iteration (s): 0.76 | learning rate: 2.217E-05 | global batch size: 256 | lm loss: 1.915031E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.926 | TFLOPs: 20.26 | 31: iteration 161490/ 173500 | consumed samples: 41341440 | consumed tokens: 84667269120 | elapsed time per iteration (s): 0.74 | learning rate: 2.216E-05 | global batch size: 256 | lm loss: 1.917085E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.247 | TFLOPs: 20.83 | 31: iteration 161500/ 173500 | consumed samples: 41344000 | consumed tokens: 84672512000 | elapsed time per iteration (s): 0.81 | learning rate: 2.216E-05 | global batch size: 256 | lm loss: 1.938213E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.614 | TFLOPs: 19.03 | 31: iteration 161510/ 173500 | consumed samples: 41346560 | consumed tokens: 84677754880 | elapsed time per iteration (s): 0.75 | learning rate: 2.216E-05 | global batch size: 256 | lm loss: 1.887968E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.605 | TFLOPs: 20.55 | 31: iteration 161520/ 173500 | consumed samples: 41349120 | consumed tokens: 84682997760 | elapsed time per iteration (s): 0.71 | learning rate: 2.215E-05 | global batch size: 256 | lm loss: 1.917256E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 359.568 | TFLOPs: 21.75 | 31: iteration 161530/ 173500 | consumed samples: 41351680 | consumed tokens: 84688240640 | elapsed time per iteration (s): 0.79 | learning rate: 2.215E-05 | global batch size: 256 | lm loss: 1.900909E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.733 | TFLOPs: 19.59 | 31: iteration 161540/ 173500 | consumed samples: 41354240 | consumed tokens: 84693483520 | elapsed time per iteration (s): 0.76 | learning rate: 2.214E-05 | global batch size: 256 | lm loss: 1.908415E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.813 | TFLOPs: 20.44 | 31: iteration 161550/ 173500 | consumed samples: 41356800 | consumed tokens: 84698726400 | elapsed time per iteration (s): 0.78 | learning rate: 2.214E-05 | global batch size: 256 | lm loss: 1.942887E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.716 | TFLOPs: 19.77 | 31: iteration 161560/ 173500 | consumed samples: 41359360 | consumed tokens: 84703969280 | elapsed time per iteration (s): 0.83 | learning rate: 2.214E-05 | global batch size: 256 | lm loss: 1.915910E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.677 | TFLOPs: 18.73 | 31: iteration 161570/ 173500 | consumed samples: 41361920 | consumed tokens: 84709212160 | elapsed time per iteration (s): 0.84 | learning rate: 2.213E-05 | global batch size: 256 | lm loss: 1.901708E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.098 | TFLOPs: 18.46 | 31: iteration 161580/ 173500 | consumed samples: 41364480 | consumed tokens: 84714455040 | elapsed time per iteration (s): 0.80 | learning rate: 2.213E-05 | global batch size: 256 | lm loss: 1.939084E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.383 | TFLOPs: 19.26 | 31: iteration 161590/ 173500 | consumed samples: 41367040 | consumed tokens: 84719697920 | elapsed time per iteration (s): 0.82 | learning rate: 2.213E-05 | global batch size: 256 | lm loss: 1.919188E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.248 | TFLOPs: 18.95 | 31: iteration 161600/ 173500 | consumed samples: 41369600 | consumed tokens: 84724940800 | elapsed time per iteration (s): 0.78 | learning rate: 2.212E-05 | global batch size: 256 | lm loss: 1.925204E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.759 | TFLOPs: 19.89 | 31: iteration 161610/ 173500 | consumed samples: 41372160 | consumed tokens: 84730183680 | elapsed time per iteration (s): 0.80 | learning rate: 2.212E-05 | global batch size: 256 | lm loss: 1.882417E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.931 | TFLOPs: 19.29 | 31: iteration 161620/ 173500 | consumed samples: 41374720 | consumed tokens: 84735426560 | elapsed time per iteration (s): 0.83 | learning rate: 2.212E-05 | global batch size: 256 | lm loss: 1.942569E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.856 | TFLOPs: 18.68 | 31: iteration 161630/ 173500 | consumed samples: 41377280 | consumed tokens: 84740669440 | elapsed time per iteration (s): 0.80 | learning rate: 2.211E-05 | global batch size: 256 | lm loss: 1.895766E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.210 | TFLOPs: 19.37 | 31: iteration 161640/ 173500 | consumed samples: 41379840 | consumed tokens: 84745912320 | elapsed time per iteration (s): 0.83 | learning rate: 2.211E-05 | global batch size: 256 | lm loss: 1.923543E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.437 | TFLOPs: 18.66 | 31: iteration 161650/ 173500 | consumed samples: 41382400 | consumed tokens: 84751155200 | elapsed time per iteration (s): 0.84 | learning rate: 2.211E-05 | global batch size: 256 | lm loss: 1.875130E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.022 | TFLOPs: 18.51 | 31: iteration 161660/ 173500 | consumed samples: 41384960 | consumed tokens: 84756398080 | elapsed time per iteration (s): 0.78 | learning rate: 2.210E-05 | global batch size: 256 | lm loss: 1.916061E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.302 | TFLOPs: 19.98 | 31: iteration 161670/ 173500 | consumed samples: 41387520 | consumed tokens: 84761640960 | elapsed time per iteration (s): 0.77 | learning rate: 2.210E-05 | global batch size: 256 | lm loss: 1.896695E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.606 | TFLOPs: 20.24 | 31: iteration 161680/ 173500 | consumed samples: 41390080 | consumed tokens: 84766883840 | elapsed time per iteration (s): 0.76 | learning rate: 2.210E-05 | global batch size: 256 | lm loss: 1.894251E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.600 | TFLOPs: 20.42 | 31: iteration 161690/ 173500 | consumed samples: 41392640 | consumed tokens: 84772126720 | elapsed time per iteration (s): 0.75 | learning rate: 2.209E-05 | global batch size: 256 | lm loss: 1.923434E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.730 | TFLOPs: 20.61 | 31: iteration 161700/ 173500 | consumed samples: 41395200 | consumed tokens: 84777369600 | elapsed time per iteration (s): 0.74 | learning rate: 2.209E-05 | global batch size: 256 | lm loss: 1.919850E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.758 | TFLOPs: 20.80 | 31: iteration 161710/ 173500 | consumed samples: 41397760 | consumed tokens: 84782612480 | elapsed time per iteration (s): 0.77 | learning rate: 2.208E-05 | global batch size: 256 | lm loss: 1.918564E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.455 | TFLOPs: 20.11 | 31: iteration 161720/ 173500 | consumed samples: 41400320 | consumed tokens: 84787855360 | elapsed time per iteration (s): 0.72 | learning rate: 2.208E-05 | global batch size: 256 | lm loss: 1.904312E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.767 | TFLOPs: 21.40 | 31: iteration 161730/ 173500 | consumed samples: 41402880 | consumed tokens: 84793098240 | elapsed time per iteration (s): 0.76 | learning rate: 2.208E-05 | global batch size: 256 | lm loss: 1.916011E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.260 | TFLOPs: 20.40 | 31: iteration 161740/ 173500 | consumed samples: 41405440 | consumed tokens: 84798341120 | elapsed time per iteration (s): 0.74 | learning rate: 2.207E-05 | global batch size: 256 | lm loss: 1.915169E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.998 | TFLOPs: 20.99 | 31: iteration 161750/ 173500 | consumed samples: 41408000 | consumed tokens: 84803584000 | elapsed time per iteration (s): 0.76 | learning rate: 2.207E-05 | global batch size: 256 | lm loss: 1.917591E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.289 | TFLOPs: 20.28 | 31: iteration 161760/ 173500 | consumed samples: 41410560 | consumed tokens: 84808826880 | elapsed time per iteration (s): 0.78 | learning rate: 2.207E-05 | global batch size: 256 | lm loss: 1.935600E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.412 | TFLOPs: 19.81 | 31: iteration 161770/ 173500 | consumed samples: 41413120 | consumed tokens: 84814069760 | elapsed time per iteration (s): 0.75 | learning rate: 2.206E-05 | global batch size: 256 | lm loss: 1.904933E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.738 | TFLOPs: 20.55 | 31: iteration 161780/ 173500 | consumed samples: 41415680 | consumed tokens: 84819312640 | elapsed time per iteration (s): 0.77 | learning rate: 2.206E-05 | global batch size: 256 | lm loss: 1.896386E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.836 | TFLOPs: 20.08 | 31: iteration 161790/ 173500 | consumed samples: 41418240 | consumed tokens: 84824555520 | elapsed time per iteration (s): 0.75 | learning rate: 2.206E-05 | global batch size: 256 | lm loss: 1.935485E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.035 | TFLOPs: 20.57 | 31: iteration 161800/ 173500 | consumed samples: 41420800 | consumed tokens: 84829798400 | elapsed time per iteration (s): 0.79 | learning rate: 2.205E-05 | global batch size: 256 | lm loss: 1.916784E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.298 | TFLOPs: 19.50 | 31: iteration 161810/ 173500 | consumed samples: 41423360 | consumed tokens: 84835041280 | elapsed time per iteration (s): 0.81 | learning rate: 2.205E-05 | global batch size: 256 | lm loss: 1.911640E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.571 | TFLOPs: 19.21 | 31: iteration 161820/ 173500 | consumed samples: 41425920 | consumed tokens: 84840284160 | elapsed time per iteration (s): 0.79 | learning rate: 2.205E-05 | global batch size: 256 | lm loss: 1.926275E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.613 | TFLOPs: 19.70 | 31: iteration 161830/ 173500 | consumed samples: 41428480 | consumed tokens: 84845527040 | elapsed time per iteration (s): 0.82 | learning rate: 2.204E-05 | global batch size: 256 | lm loss: 1.889736E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.369 | TFLOPs: 18.96 | 31: iteration 161840/ 173500 | consumed samples: 41431040 | consumed tokens: 84850769920 | elapsed time per iteration (s): 0.81 | learning rate: 2.204E-05 | global batch size: 256 | lm loss: 1.903447E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.716 | TFLOPs: 19.10 | 31: iteration 161850/ 173500 | consumed samples: 41433600 | consumed tokens: 84856012800 | elapsed time per iteration (s): 0.80 | learning rate: 2.204E-05 | global batch size: 256 | lm loss: 1.920261E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.215 | TFLOPs: 19.43 | 31: iteration 161860/ 173500 | consumed samples: 41436160 | consumed tokens: 84861255680 | elapsed time per iteration (s): 0.81 | learning rate: 2.203E-05 | global batch size: 256 | lm loss: 1.917636E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.827 | TFLOPs: 19.23 | 31: iteration 161870/ 173500 | consumed samples: 41438720 | consumed tokens: 84866498560 | elapsed time per iteration (s): 0.87 | learning rate: 2.203E-05 | global batch size: 256 | lm loss: 1.901711E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.085 | TFLOPs: 17.73 | 31: iteration 161880/ 173500 | consumed samples: 41441280 | consumed tokens: 84871741440 | elapsed time per iteration (s): 0.78 | learning rate: 2.203E-05 | global batch size: 256 | lm loss: 1.937508E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.598 | TFLOPs: 19.76 | 31: iteration 161890/ 173500 | consumed samples: 41443840 | consumed tokens: 84876984320 | elapsed time per iteration (s): 0.84 | learning rate: 2.202E-05 | global batch size: 256 | lm loss: 1.929396E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.974 | TFLOPs: 18.45 | 31: iteration 161900/ 173500 | consumed samples: 41446400 | consumed tokens: 84882227200 | elapsed time per iteration (s): 0.84 | learning rate: 2.202E-05 | global batch size: 256 | lm loss: 1.896273E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.556 | TFLOPs: 18.49 | 31: iteration 161910/ 173500 | consumed samples: 41448960 | consumed tokens: 84887470080 | elapsed time per iteration (s): 0.85 | learning rate: 2.201E-05 | global batch size: 256 | lm loss: 1.909247E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.837 | TFLOPs: 18.14 | 31: iteration 161920/ 173500 | consumed samples: 41451520 | consumed tokens: 84892712960 | elapsed time per iteration (s): 0.82 | learning rate: 2.201E-05 | global batch size: 256 | lm loss: 1.891391E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.558 | TFLOPs: 18.97 | 31: iteration 161930/ 173500 | consumed samples: 41454080 | consumed tokens: 84897955840 | elapsed time per iteration (s): 0.86 | learning rate: 2.201E-05 | global batch size: 256 | lm loss: 1.916073E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.953 | TFLOPs: 18.09 | 31: iteration 161940/ 173500 | consumed samples: 41456640 | consumed tokens: 84903198720 | elapsed time per iteration (s): 0.80 | learning rate: 2.200E-05 | global batch size: 256 | lm loss: 1.930785E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.635 | TFLOPs: 19.34 | 31: iteration 161950/ 173500 | consumed samples: 41459200 | consumed tokens: 84908441600 | elapsed time per iteration (s): 0.78 | learning rate: 2.200E-05 | global batch size: 256 | lm loss: 1.903069E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.535 | TFLOPs: 19.94 | 31: iteration 161960/ 173500 | consumed samples: 41461760 | consumed tokens: 84913684480 | elapsed time per iteration (s): 0.77 | learning rate: 2.200E-05 | global batch size: 256 | lm loss: 1.915069E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.077 | TFLOPs: 20.21 | 31: iteration 161970/ 173500 | consumed samples: 41464320 | consumed tokens: 84918927360 | elapsed time per iteration (s): 0.73 | learning rate: 2.199E-05 | global batch size: 256 | lm loss: 1.915699E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.180 | TFLOPs: 21.18 | 31: iteration 161980/ 173500 | consumed samples: 41466880 | consumed tokens: 84924170240 | elapsed time per iteration (s): 0.73 | learning rate: 2.199E-05 | global batch size: 256 | lm loss: 1.907421E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.556 | TFLOPs: 21.09 | 31: iteration 161990/ 173500 | consumed samples: 41469440 | consumed tokens: 84929413120 | elapsed time per iteration (s): 0.74 | learning rate: 2.199E-05 | global batch size: 256 | lm loss: 1.904102E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.959 | TFLOPs: 20.81 | 0: [2022-11-27 06:33:00,089] [INFO] [logging.py:68:log_dist] [Rank 0] step=162000, skipped=0, lr=[2.1983700493183342e-05, 2.1983700493183342e-05, 2.1983700493183342e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 162000/ 173500 | consumed samples: 41472000 | consumed tokens: 84934656000 | elapsed time per iteration (s): 0.74 | learning rate: 2.198E-05 | global batch size: 256 | lm loss: 1.951315E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.326 | TFLOPs: 20.83 | 0: steps: 162000 loss: 1.9741 iter time (s): 0.797 samples/sec: 321.052 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 162000 | lm loss value: 1.878700E+00 | lm loss PPL: 6.544990E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 162000 to checkpoints_1b1long 0: [2022-11-27 06:33:00,375] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step162000 is begin to save! 0: [2022-11-27 06:33:00,572] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_01-model_00-model_states.pt... 0: [2022-11-27 06:33:00,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_01-model_00-model_states.pt. 0: [2022-11-27 06:33:00,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_03-model_00-model_states.pt... 0: [2022-11-27 06:33:00,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_03-model_00-model_states.pt. 0: [2022-11-27 06:33:00,892] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_04-model_00-model_states.pt... 0: [2022-11-27 06:33:00,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_04-model_00-model_states.pt. 0: [2022-11-27 06:33:00,969] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_05-model_00-model_states.pt... 0: [2022-11-27 06:33:01,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_05-model_00-model_states.pt. 0: [2022-11-27 06:33:01,045] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_06-model_00-model_states.pt... 0: [2022-11-27 06:33:01,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_06-model_00-model_states.pt. 0: [2022-11-27 06:33:01,121] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_07-model_00-model_states.pt... 0: [2022-11-27 06:33:01,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_07-model_00-model_states.pt. 0: [2022-11-27 06:33:01,197] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_08-model_00-model_states.pt... 0: [2022-11-27 06:33:01,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_08-model_00-model_states.pt. 0: [2022-11-27 06:33:01,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_09-model_00-model_states.pt... 0: [2022-11-27 06:33:01,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_09-model_00-model_states.pt. 0: [2022-11-27 06:33:01,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_10-model_00-model_states.pt... 0: [2022-11-27 06:33:01,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_10-model_00-model_states.pt. 0: [2022-11-27 06:33:01,426] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_11-model_00-model_states.pt... 0: [2022-11-27 06:33:01,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_11-model_00-model_states.pt. 0: [2022-11-27 06:33:01,502] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_12-model_00-model_states.pt... 0: [2022-11-27 06:33:01,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_12-model_00-model_states.pt. 0: [2022-11-27 06:33:01,581] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_13-model_00-model_states.pt... 0: [2022-11-27 06:33:01,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_13-model_00-model_states.pt. 0: [2022-11-27 06:33:01,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_14-model_00-model_states.pt... 0: [2022-11-27 06:33:01,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_14-model_00-model_states.pt. 0: [2022-11-27 06:33:01,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_15-model_00-model_states.pt... 0: [2022-11-27 06:33:01,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_15-model_00-model_states.pt. 0: [2022-11-27 06:33:01,806] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_16-model_00-model_states.pt... 0: [2022-11-27 06:33:01,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_16-model_00-model_states.pt. 0: [2022-11-27 06:33:01,883] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_17-model_00-model_states.pt... 0: [2022-11-27 06:33:01,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_17-model_00-model_states.pt. 0: [2022-11-27 06:33:01,958] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_18-model_00-model_states.pt... 0: [2022-11-27 06:33:02,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_18-model_00-model_states.pt. 0: [2022-11-27 06:33:02,033] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_19-model_00-model_states.pt... 0: [2022-11-27 06:33:02,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_19-model_00-model_states.pt. 0: [2022-11-27 06:33:02,109] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_20-model_00-model_states.pt... 0: [2022-11-27 06:33:02,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_20-model_00-model_states.pt. 0: [2022-11-27 06:33:02,182] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_21-model_00-model_states.pt... 0: [2022-11-27 06:33:02,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_21-model_00-model_states.pt. 0: [2022-11-27 06:33:02,258] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_22-model_00-model_states.pt... 0: [2022-11-27 06:33:02,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_22-model_00-model_states.pt. 0: [2022-11-27 06:33:02,333] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_23-model_00-model_states.pt... 0: [2022-11-27 06:33:02,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_23-model_00-model_states.pt. 0: [2022-11-27 06:33:02,408] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_24-model_00-model_states.pt... 0: [2022-11-27 06:33:02,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_24-model_00-model_states.pt. 0: [2022-11-27 06:33:02,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_25-model_00-model_states.pt... 0: [2022-11-27 06:33:02,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_25-model_00-model_states.pt. 0: [2022-11-27 06:33:02,556] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_26-model_00-model_states.pt... 0: [2022-11-27 06:33:02,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_26-model_00-model_states.pt. 0: [2022-11-27 06:33:02,631] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_27-model_00-model_states.pt... 0: [2022-11-27 06:33:02,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_27-model_00-model_states.pt. 0: [2022-11-27 06:33:02,706] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_28-model_00-model_states.pt... 0: [2022-11-27 06:33:02,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_28-model_00-model_states.pt. 0: [2022-11-27 06:33:02,779] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/layer_30-model_00-model_states.pt... 0: [2022-11-27 06:33:02,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/layer_30-model_00-model_states.pt. 0: [2022-11-27 06:33:02,782] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step162000/mp_rank_00_model_states.pt 0: [2022-11-27 06:33:02,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/mp_rank_00_model_states.pt... 0: [2022-11-27 06:33:02,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/mp_rank_00_model_states.pt. 0: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:33:02,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:33:02,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:33:02,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 06:33:02,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 25: [2022-11-27 06:33:02,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:33:02,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 06:33:02,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 6: [2022-11-27 06:33:02,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:33:02,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 1: [2022-11-27 06:33:02,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:33:02,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 1: [2022-11-27 06:33:02,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 06:33:02,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 26: [2022-11-27 06:33:02,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:33:02,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 06:33:02,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 16: [2022-11-27 06:33:02,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:33:02,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 06:33:02,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 18: [2022-11-27 06:33:02,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:33:02,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:33:02,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 19: [2022-11-27 06:33:02,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 18: [2022-11-27 06:33:02,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 19: [2022-11-27 06:33:02,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 20: [2022-11-27 06:33:02,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:33:02,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 3: [2022-11-27 06:33:02,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:33:02,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 3: [2022-11-27 06:33:02,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 06:33:02,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 5: [2022-11-27 06:33:02,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:33:02,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:33:02,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 06:33:02,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 06:33:02,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 5: [2022-11-27 06:33:02,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 28: [2022-11-27 06:33:02,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:33:02,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:33:02,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 06:33:02,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 0: [2022-11-27 06:33:02,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:33:02,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 22: [2022-11-27 06:33:02,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:33:02,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:33:02,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 22: [2022-11-27 06:33:02,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 0: [2022-11-27 06:33:02,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 22: [2022-11-27 06:33:02,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 0: [2022-11-27 06:33:02,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 17: [2022-11-27 06:33:02,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 06:33:02,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 21: [2022-11-27 06:33:02,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:33:02,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 06:33:02,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 30: [2022-11-27 06:33:02,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:33:02,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 23: [2022-11-27 06:33:02,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:33:02,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 23: [2022-11-27 06:33:02,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 06:33:02,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 25: [2022-11-27 06:33:02,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:33:02,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:33:02,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 06:33:02,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 8: [2022-11-27 06:33:02,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:33:02,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:33:02,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 8: [2022-11-27 06:33:02,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 25: [2022-11-27 06:33:02,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 6: [2022-11-27 06:33:02,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:33:02,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 16: [2022-11-27 06:33:02,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 06:33:02,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 6: [2022-11-27 06:33:02,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 06:33:02,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 1: [2022-11-27 06:33:02,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:33:02,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:33:02,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:33:02,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 12: [2022-11-27 06:33:02,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 27: [2022-11-27 06:33:02,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 1: [2022-11-27 06:33:02,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 12: [2022-11-27 06:33:02,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 9: [2022-11-27 06:33:02,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:33:02,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 06:33:02,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 27: [2022-11-27 06:33:02,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 0: [2022-11-27 06:33:02,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:33:02,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:33:02,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 28: [2022-11-27 06:33:02,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:33:02,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 06:33:02,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 15: [2022-11-27 06:33:02,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 18: [2022-11-27 06:33:02,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:33:02,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 06:33:02,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 23: [2022-11-27 06:33:02,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:33:02,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 06:33:02,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 6: [2022-11-27 06:33:02,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:33:02,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:33:02,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 15: [2022-11-27 06:33:02,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 6: [2022-11-27 06:33:02,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 15: [2022-11-27 06:33:02,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 7: [2022-11-27 06:33:02,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:33:02,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:33:02,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 06:33:02,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 06:33:02,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 7: [2022-11-27 06:33:02,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 22: [2022-11-27 06:33:02,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:33:02,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 06:33:02,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 12: [2022-11-27 06:33:02,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:33:02,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:33:02,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 06:33:02,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 19: [2022-11-27 06:33:02,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-27 06:33:02,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 28: [2022-11-27 06:33:02,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 06:33:02,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 30: [2022-11-27 06:33:02,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:33:02,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 06:33:02,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 20: [2022-11-27 06:33:02,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:33:02,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 06:33:02,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 9: [2022-11-27 06:33:02,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:33:02,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 06:33:02,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 1: [2022-11-27 06:33:02,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:33:02,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:33:02,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 25: [2022-11-27 06:33:02,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-27 06:33:02,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 1: [2022-11-27 06:33:02,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 3: [2022-11-27 06:33:02,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:33:02,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 06:33:02,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 3: [2022-11-27 06:33:02,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:33:02,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 06:33:02,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 21: [2022-11-27 06:33:02,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:33:02,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:33:02,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 20: [2022-11-27 06:33:02,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 21: [2022-11-27 06:33:02,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 27: [2022-11-27 06:33:02,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:33:02,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 27: [2022-11-27 06:33:02,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 29: [2022-11-27 06:33:02,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:33:02,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 27: [2022-11-27 06:33:02,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 29: [2022-11-27 06:33:02,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 23: [2022-11-27 06:33:02,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:33:02,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 06:33:02,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 8: [2022-11-27 06:33:02,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:33:02,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:33:02,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 5: [2022-11-27 06:33:02,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 8: [2022-11-27 06:33:02,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 5: [2022-11-27 06:33:02,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 26: [2022-11-27 06:33:02,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:33:02,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 06:33:02,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 22: [2022-11-27 06:33:02,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:33:02,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 06:33:02,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 19: [2022-11-27 06:33:02,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:33:02,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 06:33:02,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 17: [2022-11-27 06:33:02,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:33:02,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:33:02,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 06:33:02,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 17: [2022-11-27 06:33:02,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 06:33:02,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 17: [2022-11-27 06:33:02,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:33:02,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 06:33:02,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 18: [2022-11-27 06:33:02,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:33:02,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 06:33:02,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 26: [2022-11-27 06:33:02,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:33:02,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-27 06:33:02,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 7: [2022-11-27 06:33:02,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:33:02,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 06:33:02,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 16: [2022-11-27 06:33:02,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:33:02,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 06:33:02,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 28: [2022-11-27 06:33:02,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:33:02,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 06:33:02,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 8: [2022-11-27 06:33:02,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:33:02,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 06:33:02,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 15: [2022-11-27 06:33:02,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:33:02,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 06:33:02,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 31: [2022-11-27 06:33:02,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:33:02,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:33:02,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 06:33:02,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 06:33:02,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 31: [2022-11-27 06:33:02,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 30: [2022-11-27 06:33:02,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:33:02,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 06:33:02,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 29: [2022-11-27 06:33:02,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:33:02,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 06:33:02,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 9: [2022-11-27 06:33:02,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:33:02,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 06:33:02,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 21: [2022-11-27 06:33:02,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:33:02,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 06:33:02,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 31: [2022-11-27 06:33:02,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:33:02,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 06:33:02,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 24: [2022-11-27 06:33:02,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:33:02,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:33:02,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 06:33:02,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-27 06:33:02,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 24: [2022-11-27 06:33:02,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 5: [2022-11-27 06:33:02,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:33:02,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 06:33:02,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 1: [2022-11-27 06:33:02,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:33:02,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 06:33:02,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 14: [2022-11-27 06:33:02,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:33:02,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:33:02,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 06:33:02,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:33:02,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 06:33:02,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 14: [2022-11-27 06:33:02,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 14: [2022-11-27 06:33:02,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 06:33:02,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 20: [2022-11-27 06:33:02,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:33:02,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 06:33:02,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 6: [2022-11-27 06:33:02,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:33:02,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 06:33:02,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 12: [2022-11-27 06:33:02,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:33:02,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 06:33:02,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 13: [2022-11-27 06:33:02,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:33:02,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:33:02,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:33:02,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 06:33:02,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 13: [2022-11-27 06:33:02,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 06:33:02,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 06:33:02,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 13: [2022-11-27 06:33:02,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 26: [2022-11-27 06:33:02,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:33:02,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 06:33:02,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 25: [2022-11-27 06:33:02,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:33:02,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 06:33:02,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 13: [2022-11-27 06:33:02,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:33:02,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 06:33:02,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 27: [2022-11-27 06:33:02,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:33:02,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 06:33:02,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 17: [2022-11-27 06:33:02,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:33:02,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 06:33:02,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 11: [2022-11-27 06:33:02,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:33:02,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:33:02,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 06:33:02,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 06:33:02,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 11: [2022-11-27 06:33:02,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 24: [2022-11-27 06:33:02,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:33:02,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 06:33:02,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 3: [2022-11-27 06:33:02,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:33:02,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 06:33:02,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 16: [2022-11-27 06:33:02,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:33:02,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 06:33:02,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 0: [2022-11-27 06:33:02,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:33:02,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 06:33:02,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 23: [2022-11-27 06:33:02,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:33:02,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 06:33:02,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 15: [2022-11-27 06:33:02,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:33:02,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 06:33:02,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 10: [2022-11-27 06:33:02,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:33:02,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:33:02,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 06:33:02,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 06:33:02,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 10: [2022-11-27 06:33:02,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:33:02,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 10: [2022-11-27 06:33:02,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 06:33:02,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 9: [2022-11-27 06:33:02,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:33:02,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 06:33:02,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 2: [2022-11-27 06:33:02,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:33:02,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:33:02,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 06:33:02,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 06:33:02,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 2: [2022-11-27 06:33:02,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 11: [2022-11-27 06:33:02,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:33:02,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:33:02,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 2: [2022-11-27 06:33:02,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 11: [2022-11-27 06:33:02,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 31: [2022-11-27 06:33:02,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:33:02,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 31: [2022-11-27 06:33:02,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 06:33:02,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 19: [2022-11-27 06:33:02,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:33:02,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 06:33:02,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 30: [2022-11-27 06:33:02,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:33:02,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 06:33:02,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 4: [2022-11-27 06:33:02,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:33:02,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:33:02,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 06:33:02,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 06:33:02,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 4: [2022-11-27 06:33:02,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 22: [2022-11-27 06:33:02,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:33:02,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 06:33:02,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 21: [2022-11-27 06:33:02,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:33:02,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 06:33:02,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 18: [2022-11-27 06:33:02,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:33:02,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 06:33:02,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 4: [2022-11-27 06:33:02,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:33:02,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 06:33:02,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 8: [2022-11-27 06:33:02,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:33:02,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 06:33:02,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 14: [2022-11-27 06:33:02,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:33:02,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 06:33:02,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 28: [2022-11-27 06:33:02,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:33:02,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 06:33:02,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 7: [2022-11-27 06:33:02,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:33:02,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 06:33:02,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 29: [2022-11-27 06:33:02,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:33:02,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 06:33:02,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 6: [2022-11-27 06:33:03,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:33:03,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 06:33:03,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 20: [2022-11-27 06:33:03,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:33:03,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 06:33:03,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 12: [2022-11-27 06:33:03,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:33:03,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 06:33:03,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 26: [2022-11-27 06:33:03,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:33:03,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 06:33:03,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 5: [2022-11-27 06:33:03,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:33:03,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 06:33:03,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 27: [2022-11-27 06:33:03,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:33:03,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 06:33:03,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 17: [2022-11-27 06:33:03,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:33:03,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 06:33:03,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 25: [2022-11-27 06:33:03,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:33:03,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:33:03,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 06:33:03,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 15: [2022-11-27 06:33:03,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 06:33:03,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 0: [2022-11-27 06:33:03,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:33:03,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 06:33:03,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 3: [2022-11-27 06:33:03,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:33:03,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 06:33:03,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 24: [2022-11-27 06:33:03,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:33:03,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-27 06:33:03,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 23: [2022-11-27 06:33:03,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:33:03,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 06:33:03,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 16: [2022-11-27 06:33:03,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:33:03,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 06:33:03,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 11: [2022-11-27 06:33:03,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:33:03,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 06:33:03,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 31: [2022-11-27 06:33:03,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:33:03,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 06:33:03,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 10: [2022-11-27 06:33:03,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:33:03,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 06:33:03,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 1: [2022-11-27 06:33:03,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:33:03,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 06:33:03,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 2: [2022-11-27 06:33:03,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:33:03,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 06:33:03,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 19: [2022-11-27 06:33:03,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:33:03,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 22: [2022-11-27 06:33:03,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:33:03,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 22: [2022-11-27 06:33:03,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 06:33:03,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 9: [2022-11-27 06:33:03,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:33:03,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 06:33:03,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 30: [2022-11-27 06:33:03,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:33:03,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 06:33:03,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 18: [2022-11-27 06:33:03,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:33:03,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 06:33:03,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 21: [2022-11-27 06:33:03,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:33:03,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 06:33:03,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 13: [2022-11-27 06:33:03,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:33:03,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 06:33:03,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 4: [2022-11-27 06:33:03,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:33:03,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 06:33:03,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 7: [2022-11-27 06:33:03,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:33:03,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:33:03,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 8: [2022-11-27 06:33:03,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 7: [2022-11-27 06:33:03,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 8: [2022-11-27 06:33:03,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 28: [2022-11-27 06:33:03,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:33:03,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 06:33:03,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 6: [2022-11-27 06:33:03,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:33:03,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 06:33:03,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 14: [2022-11-27 06:33:03,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:33:03,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 06:33:03,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 12: [2022-11-27 06:33:03,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:33:03,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 06:33:03,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 26: [2022-11-27 06:33:03,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:33:03,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 06:33:03,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 27: [2022-11-27 06:33:03,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:33:03,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 0: [2022-11-27 06:33:03,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:33:03,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 29: [2022-11-27 06:33:03,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:33:03,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 06:33:03,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 3: [2022-11-27 06:33:03,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:33:03,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:33:03,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 06:33:03,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 5: [2022-11-27 06:33:03,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 06:33:03,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 15: [2022-11-27 06:33:03,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:33:03,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 06:33:03,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 23: [2022-11-27 06:33:03,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:33:03,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:33:03,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 06:33:03,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 24: [2022-11-27 06:33:03,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 17: [2022-11-27 06:33:03,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:33:03,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 17: [2022-11-27 06:33:03,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 06:33:03,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 20: [2022-11-27 06:33:03,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:33:03,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 06:33:03,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 25: [2022-11-27 06:33:03,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:33:03,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 06:33:03,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 11: [2022-11-27 06:33:03,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:33:03,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 06:33:03,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 16: [2022-11-27 06:33:03,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:33:03,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 06:33:03,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 1: [2022-11-27 06:33:03,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:33:03,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 06:33:03,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 10: [2022-11-27 06:33:03,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:33:03,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 06:33:03,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 0: [2022-11-27 06:33:03,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 06:33:03,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 31: [2022-11-27 06:33:03,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:33:03,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 06:33:03,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 9: [2022-11-27 06:33:03,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:33:03,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 06:33:03,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 13: [2022-11-27 06:33:03,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:33:03,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 06:33:03,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 19: [2022-11-27 06:33:03,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:33:03,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 06:33:03,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 30: [2022-11-27 06:33:03,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:33:03,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 22: [2022-11-27 06:33:03,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:33:03,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 22: [2022-11-27 06:33:03,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 06:33:03,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 2: [2022-11-27 06:33:03,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:33:03,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 06:33:03,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 4: [2022-11-27 06:33:03,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:33:03,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:33:03,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 06:33:03,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 18: [2022-11-27 06:33:03,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 06:33:03,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 21: [2022-11-27 06:33:03,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:33:03,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 06:33:03,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 28: [2022-11-27 06:33:03,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:33:03,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 06:33:03,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 7: [2022-11-27 06:33:03,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:33:03,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 06:33:03,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 8: [2022-11-27 06:33:03,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:33:03,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 06:33:03,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 14: [2022-11-27 06:33:03,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:33:03,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 06:33:03,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 6: [2022-11-27 06:33:03,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:33:03,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 29: [2022-11-27 06:33:03,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:33:03,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 29: [2022-11-27 06:33:03,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 06:33:03,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 5: [2022-11-27 06:33:03,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:33:03,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 06:33:03,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 20: [2022-11-27 06:33:03,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:33:03,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 06:33:03,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 1: [2022-11-27 06:33:03,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:33:03,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 06:33:03,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 17: [2022-11-27 06:33:03,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:33:03,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 06:33:03,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 12: [2022-11-27 06:33:03,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:33:03,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 06:33:03,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 26: [2022-11-27 06:33:03,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:33:03,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 06:33:03,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 25: [2022-11-27 06:33:03,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:33:03,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 06:33:03,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 15: [2022-11-27 06:33:03,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:33:03,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 06:33:03,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 27: [2022-11-27 06:33:03,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:33:03,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 06:33:03,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 3: [2022-11-27 06:33:03,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:33:03,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 06:33:03,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 16: [2022-11-27 06:33:03,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:33:03,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 9: [2022-11-27 06:33:03,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:33:03,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 9: [2022-11-27 06:33:03,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 06:33:03,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 31: [2022-11-27 06:33:03,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:33:03,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 06:33:03,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 19: [2022-11-27 06:33:03,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:33:03,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 06:33:03,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 24: [2022-11-27 06:33:03,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:33:03,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:33:03,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 10: [2022-11-27 06:33:03,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:33:03,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 10: [2022-11-27 06:33:03,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 06:33:03,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 24: [2022-11-27 06:33:03,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 0: [2022-11-27 06:33:03,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 4: [2022-11-27 06:33:03,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:33:03,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 06:33:03,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 22: [2022-11-27 06:33:03,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:33:03,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 06:33:03,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 1: [2022-11-27 06:33:03,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:33:03,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:33:03,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 1: [2022-11-27 06:33:03,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 21: [2022-11-27 06:33:03,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 1: [2022-11-27 06:33:03,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 7: [2022-11-27 06:33:03,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:33:03,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 06:33:03,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 26: [2022-11-27 06:33:03,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:33:03,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 06:33:03,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 2: [2022-11-27 06:33:03,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:33:03,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 06:33:03,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 20: [2022-11-27 06:33:03,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:33:03,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:33:03,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 06:33:03,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 12: [2022-11-27 06:33:03,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 06:33:03,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 17: [2022-11-27 06:33:03,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:33:03,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:33:03,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 17: [2022-11-27 06:33:03,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 06:33:03,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 14: [2022-11-27 06:33:03,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 8: [2022-11-27 06:33:03,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:33:03,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 06:33:03,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 11: [2022-11-27 06:33:03,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:33:03,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 06:33:03,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 28: [2022-11-27 06:33:03,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:33:03,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 06:33:03,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 18: [2022-11-27 06:33:03,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:33:03,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 06:33:03,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 23: [2022-11-27 06:33:03,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:33:03,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:33:03,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 06:33:03,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 6: [2022-11-27 06:33:03,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:33:03,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 06:33:03,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 6: [2022-11-27 06:33:03,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 06:33:03,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 24: [2022-11-27 06:33:03,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:33:03,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 06:33:03,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 5: [2022-11-27 06:33:03,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:33:03,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 06:33:03,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 25: [2022-11-27 06:33:03,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:33:03,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:33:03,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 06:33:03,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 25: [2022-11-27 06:33:03,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 06:33:03,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 3: [2022-11-27 06:33:03,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:33:03,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 06:33:03,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 29: [2022-11-27 06:33:03,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:33:03,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 06:33:03,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 0: [2022-11-27 06:33:03,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:33:03,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:33:03,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 0: [2022-11-27 06:33:03,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 13: [2022-11-27 06:33:03,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 0: [2022-11-27 06:33:03,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 10: [2022-11-27 06:33:03,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:33:03,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 06:33:03,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 15: [2022-11-27 06:33:03,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:33:03,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 06:33:03,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 13: [2022-11-27 06:33:03,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:33:03,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 06:33:03,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 14: [2022-11-27 06:33:03,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:33:03,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 06:33:03,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 30: [2022-11-27 06:33:03,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:33:03,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 06:33:03,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 23: [2022-11-27 06:33:03,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:33:03,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 06:33:03,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 7: [2022-11-27 06:33:03,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:33:03,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 06:33:03,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 8: [2022-11-27 06:33:03,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:33:03,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:33:03,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 21: [2022-11-27 06:33:03,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 8: [2022-11-27 06:33:03,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 21: [2022-11-27 06:33:03,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 11: [2022-11-27 06:33:03,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:33:03,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 06:33:03,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 24: [2022-11-27 06:33:03,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:33:03,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:33:03,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 9: [2022-11-27 06:33:03,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 06:33:03,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 10: [2022-11-27 06:33:03,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:33:03,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:33:03,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 24: [2022-11-27 06:33:03,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 22: [2022-11-27 06:33:03,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 06:33:03,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 10: [2022-11-27 06:33:03,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 28: [2022-11-27 06:33:03,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:33:03,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 06:33:03,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 16: [2022-11-27 06:33:03,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:33:03,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 06:33:03,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 29: [2022-11-27 06:33:03,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:33:03,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 06:33:03,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 31: [2022-11-27 06:33:03,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:33:03,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 06:33:03,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 11: [2022-11-27 06:33:03,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:33:03,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 06:33:03,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 2: [2022-11-27 06:33:03,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:33:03,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 06:33:03,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 4: [2022-11-27 06:33:03,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:33:03,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 06:33:03,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 18: [2022-11-27 06:33:03,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:33:03,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 06:33:03,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 2: [2022-11-27 06:33:03,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:33:03,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 06:33:03,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 4: [2022-11-27 06:33:03,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:33:03,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 06:33:03,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 19: [2022-11-27 06:33:03,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:33:03,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step162000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 06:33:03,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step162000 is ready now! 0: successfully saved checkpoint at iteration 162000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2806.40 31: iteration 162010/ 173500 | consumed samples: 41474560 | consumed tokens: 84939898880 | elapsed time per iteration (s): 1.06 | learning rate: 2.198E-05 | global batch size: 256 | lm loss: 1.918399E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.612 | TFLOPs: 14.62 | 31: iteration 162020/ 173500 | consumed samples: 41477120 | consumed tokens: 84945141760 | elapsed time per iteration (s): 0.76 | learning rate: 2.198E-05 | global batch size: 256 | lm loss: 1.917868E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.842 | TFLOPs: 20.38 | 31: iteration 162030/ 173500 | consumed samples: 41479680 | consumed tokens: 84950384640 | elapsed time per iteration (s): 0.73 | learning rate: 2.197E-05 | global batch size: 256 | lm loss: 1.895841E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.027 | TFLOPs: 21.36 | 31: iteration 162040/ 173500 | consumed samples: 41482240 | consumed tokens: 84955627520 | elapsed time per iteration (s): 0.79 | learning rate: 2.197E-05 | global batch size: 256 | lm loss: 1.902491E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.414 | TFLOPs: 19.69 | 31: iteration 162050/ 173500 | consumed samples: 41484800 | consumed tokens: 84960870400 | elapsed time per iteration (s): 0.74 | learning rate: 2.197E-05 | global batch size: 256 | lm loss: 1.892518E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.693 | TFLOPs: 20.91 | 31: iteration 162060/ 173500 | consumed samples: 41487360 | consumed tokens: 84966113280 | elapsed time per iteration (s): 0.77 | learning rate: 2.196E-05 | global batch size: 256 | lm loss: 1.902135E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.628 | TFLOPs: 20.06 | 31: iteration 162070/ 173500 | consumed samples: 41489920 | consumed tokens: 84971356160 | elapsed time per iteration (s): 0.74 | learning rate: 2.196E-05 | global batch size: 256 | lm loss: 1.927691E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.821 | TFLOPs: 20.92 | 31: iteration 162080/ 173500 | consumed samples: 41492480 | consumed tokens: 84976599040 | elapsed time per iteration (s): 0.78 | learning rate: 2.196E-05 | global batch size: 256 | lm loss: 1.909317E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.916 | TFLOPs: 19.96 | 31: iteration 162090/ 173500 | consumed samples: 41495040 | consumed tokens: 84981841920 | elapsed time per iteration (s): 0.75 | learning rate: 2.195E-05 | global batch size: 256 | lm loss: 1.908893E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.953 | TFLOPs: 20.57 | 31: iteration 162100/ 173500 | consumed samples: 41497600 | consumed tokens: 84987084800 | elapsed time per iteration (s): 0.75 | learning rate: 2.195E-05 | global batch size: 256 | lm loss: 1.915382E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.505 | TFLOPs: 20.54 | 31: iteration 162110/ 173500 | consumed samples: 41500160 | consumed tokens: 84992327680 | elapsed time per iteration (s): 0.79 | learning rate: 2.195E-05 | global batch size: 256 | lm loss: 1.913723E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.660 | TFLOPs: 19.58 | 31: iteration 162120/ 173500 | consumed samples: 41502720 | consumed tokens: 84997570560 | elapsed time per iteration (s): 0.75 | learning rate: 2.194E-05 | global batch size: 256 | lm loss: 1.895772E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.010 | TFLOPs: 20.63 | 31: iteration 162130/ 173500 | consumed samples: 41505280 | consumed tokens: 85002813440 | elapsed time per iteration (s): 0.75 | learning rate: 2.194E-05 | global batch size: 256 | lm loss: 1.912265E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.480 | TFLOPs: 20.54 | 31: iteration 162140/ 173500 | consumed samples: 41507840 | consumed tokens: 85008056320 | elapsed time per iteration (s): 0.78 | learning rate: 2.194E-05 | global batch size: 256 | lm loss: 1.900504E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.595 | TFLOPs: 19.76 | 31: iteration 162150/ 173500 | consumed samples: 41510400 | consumed tokens: 85013299200 | elapsed time per iteration (s): 0.74 | learning rate: 2.193E-05 | global batch size: 256 | lm loss: 1.914560E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.814 | TFLOPs: 20.92 | 31: iteration 162160/ 173500 | consumed samples: 41512960 | consumed tokens: 85018542080 | elapsed time per iteration (s): 0.79 | learning rate: 2.193E-05 | global batch size: 256 | lm loss: 1.908221E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.707 | TFLOPs: 19.58 | 31: iteration 162170/ 173500 | consumed samples: 41515520 | consumed tokens: 85023784960 | elapsed time per iteration (s): 0.79 | learning rate: 2.193E-05 | global batch size: 256 | lm loss: 1.920430E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.103 | TFLOPs: 19.73 | 31: iteration 162180/ 173500 | consumed samples: 41518080 | consumed tokens: 85029027840 | elapsed time per iteration (s): 0.75 | learning rate: 2.192E-05 | global batch size: 256 | lm loss: 1.935108E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.633 | TFLOPs: 20.73 | 31: iteration 162190/ 173500 | consumed samples: 41520640 | consumed tokens: 85034270720 | elapsed time per iteration (s): 0.73 | learning rate: 2.192E-05 | global batch size: 256 | lm loss: 1.907811E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.672 | TFLOPs: 21.28 | 31: iteration 162200/ 173500 | consumed samples: 41523200 | consumed tokens: 85039513600 | elapsed time per iteration (s): 0.80 | learning rate: 2.192E-05 | global batch size: 256 | lm loss: 1.887920E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.843 | TFLOPs: 19.35 | 31: iteration 162210/ 173500 | consumed samples: 41525760 | consumed tokens: 85044756480 | elapsed time per iteration (s): 0.79 | learning rate: 2.191E-05 | global batch size: 256 | lm loss: 1.915096E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.371 | TFLOPs: 19.50 | 31: iteration 162220/ 173500 | consumed samples: 41528320 | consumed tokens: 85049999360 | elapsed time per iteration (s): 0.79 | learning rate: 2.191E-05 | global batch size: 256 | lm loss: 1.900543E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.549 | TFLOPs: 19.57 | 31: iteration 162230/ 173500 | consumed samples: 41530880 | consumed tokens: 85055242240 | elapsed time per iteration (s): 0.73 | learning rate: 2.191E-05 | global batch size: 256 | lm loss: 1.913197E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.748 | TFLOPs: 21.10 | 31: iteration 162240/ 173500 | consumed samples: 41533440 | consumed tokens: 85060485120 | elapsed time per iteration (s): 0.81 | learning rate: 2.190E-05 | global batch size: 256 | lm loss: 1.920358E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.421 | TFLOPs: 19.20 | 31: iteration 162250/ 173500 | consumed samples: 41536000 | consumed tokens: 85065728000 | elapsed time per iteration (s): 0.77 | learning rate: 2.190E-05 | global batch size: 256 | lm loss: 1.882966E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.710 | TFLOPs: 20.07 | 31: iteration 162260/ 173500 | consumed samples: 41538560 | consumed tokens: 85070970880 | elapsed time per iteration (s): 0.72 | learning rate: 2.190E-05 | global batch size: 256 | lm loss: 1.913876E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.059 | TFLOPs: 21.42 | 31: iteration 162270/ 173500 | consumed samples: 41541120 | consumed tokens: 85076213760 | elapsed time per iteration (s): 0.76 | learning rate: 2.189E-05 | global batch size: 256 | lm loss: 1.909517E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.908 | TFLOPs: 20.50 | 31: iteration 162280/ 173500 | consumed samples: 41543680 | consumed tokens: 85081456640 | elapsed time per iteration (s): 0.80 | learning rate: 2.189E-05 | global batch size: 256 | lm loss: 1.925655E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.548 | TFLOPs: 19.39 | 31: iteration 162290/ 173500 | consumed samples: 41546240 | consumed tokens: 85086699520 | elapsed time per iteration (s): 0.75 | learning rate: 2.189E-05 | global batch size: 256 | lm loss: 1.915457E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.466 | TFLOPs: 20.60 | 31: iteration 162300/ 173500 | consumed samples: 41548800 | consumed tokens: 85091942400 | elapsed time per iteration (s): 1.69 | learning rate: 2.188E-05 | global batch size: 256 | lm loss: 1.899895E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 151.910 | TFLOPs: 9.19 | 31: iteration 162310/ 173500 | consumed samples: 41551360 | consumed tokens: 85097185280 | elapsed time per iteration (s): 0.79 | learning rate: 2.188E-05 | global batch size: 256 | lm loss: 1.929296E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.723 | TFLOPs: 19.58 | 31: iteration 162320/ 173500 | consumed samples: 41553920 | consumed tokens: 85102428160 | elapsed time per iteration (s): 0.83 | learning rate: 2.188E-05 | global batch size: 256 | lm loss: 1.895594E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.867 | TFLOPs: 18.69 | 31: iteration 162330/ 173500 | consumed samples: 41556480 | consumed tokens: 85107671040 | elapsed time per iteration (s): 0.81 | learning rate: 2.187E-05 | global batch size: 256 | lm loss: 1.911959E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.223 | TFLOPs: 19.01 | 31: iteration 162340/ 173500 | consumed samples: 41559040 | consumed tokens: 85112913920 | elapsed time per iteration (s): 0.84 | learning rate: 2.187E-05 | global batch size: 256 | lm loss: 1.923401E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.495 | TFLOPs: 18.36 | 31: iteration 162350/ 173500 | consumed samples: 41561600 | consumed tokens: 85118156800 | elapsed time per iteration (s): 0.80 | learning rate: 2.187E-05 | global batch size: 256 | lm loss: 1.909472E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.204 | TFLOPs: 19.25 | 31: iteration 162360/ 173500 | consumed samples: 41564160 | consumed tokens: 85123399680 | elapsed time per iteration (s): 0.75 | learning rate: 2.186E-05 | global batch size: 256 | lm loss: 1.921611E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.184 | TFLOPs: 20.52 | 31: iteration 162370/ 173500 | consumed samples: 41566720 | consumed tokens: 85128642560 | elapsed time per iteration (s): 0.75 | learning rate: 2.186E-05 | global batch size: 256 | lm loss: 1.905680E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.736 | TFLOPs: 20.55 | 31: iteration 162380/ 173500 | consumed samples: 41569280 | consumed tokens: 85133885440 | elapsed time per iteration (s): 0.79 | learning rate: 2.186E-05 | global batch size: 256 | lm loss: 1.916650E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.785 | TFLOPs: 19.59 | 31: iteration 162390/ 173500 | consumed samples: 41571840 | consumed tokens: 85139128320 | elapsed time per iteration (s): 0.72 | learning rate: 2.185E-05 | global batch size: 256 | lm loss: 1.931991E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.132 | TFLOPs: 21.48 | 31: iteration 162400/ 173500 | consumed samples: 41574400 | consumed tokens: 85144371200 | elapsed time per iteration (s): 0.82 | learning rate: 2.185E-05 | global batch size: 256 | lm loss: 1.922758E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.561 | TFLOPs: 18.97 | 31: iteration 162410/ 173500 | consumed samples: 41576960 | consumed tokens: 85149614080 | elapsed time per iteration (s): 0.76 | learning rate: 2.185E-05 | global batch size: 256 | lm loss: 1.907079E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.574 | TFLOPs: 20.48 | 31: iteration 162420/ 173500 | consumed samples: 41579520 | consumed tokens: 85154856960 | elapsed time per iteration (s): 0.74 | learning rate: 2.184E-05 | global batch size: 256 | lm loss: 1.903896E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.132 | TFLOPs: 20.94 | 31: iteration 162430/ 173500 | consumed samples: 41582080 | consumed tokens: 85160099840 | elapsed time per iteration (s): 0.78 | learning rate: 2.184E-05 | global batch size: 256 | lm loss: 1.893695E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.341 | TFLOPs: 19.80 | 31: iteration 162440/ 173500 | consumed samples: 41584640 | consumed tokens: 85165342720 | elapsed time per iteration (s): 0.76 | learning rate: 2.184E-05 | global batch size: 256 | lm loss: 1.892523E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.936 | TFLOPs: 20.50 | 31: iteration 162450/ 173500 | consumed samples: 41587200 | consumed tokens: 85170585600 | elapsed time per iteration (s): 0.74 | learning rate: 2.183E-05 | global batch size: 256 | lm loss: 1.925095E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.217 | TFLOPs: 20.82 | 31: iteration 162460/ 173500 | consumed samples: 41589760 | consumed tokens: 85175828480 | elapsed time per iteration (s): 0.75 | learning rate: 2.183E-05 | global batch size: 256 | lm loss: 1.888647E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.280 | TFLOPs: 20.65 | 31: iteration 162470/ 173500 | consumed samples: 41592320 | consumed tokens: 85181071360 | elapsed time per iteration (s): 0.76 | learning rate: 2.183E-05 | global batch size: 256 | lm loss: 1.948589E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.333 | TFLOPs: 20.41 | 31: iteration 162480/ 173500 | consumed samples: 41594880 | consumed tokens: 85186314240 | elapsed time per iteration (s): 0.72 | learning rate: 2.182E-05 | global batch size: 256 | lm loss: 1.932699E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.094 | TFLOPs: 21.60 | 31: iteration 162490/ 173500 | consumed samples: 41597440 | consumed tokens: 85191557120 | elapsed time per iteration (s): 0.74 | learning rate: 2.182E-05 | global batch size: 256 | lm loss: 1.912222E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.415 | TFLOPs: 20.90 | 31: iteration 162500/ 173500 | consumed samples: 41600000 | consumed tokens: 85196800000 | elapsed time per iteration (s): 0.91 | learning rate: 2.182E-05 | global batch size: 256 | lm loss: 1.885999E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 280.762 | TFLOPs: 16.99 | 31: iteration 162510/ 173500 | consumed samples: 41602560 | consumed tokens: 85202042880 | elapsed time per iteration (s): 0.83 | learning rate: 2.181E-05 | global batch size: 256 | lm loss: 1.880901E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.302 | TFLOPs: 18.71 | 31: iteration 162520/ 173500 | consumed samples: 41605120 | consumed tokens: 85207285760 | elapsed time per iteration (s): 0.86 | learning rate: 2.181E-05 | global batch size: 256 | lm loss: 1.937054E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.910 | TFLOPs: 18.08 | 31: iteration 162530/ 173500 | consumed samples: 41607680 | consumed tokens: 85212528640 | elapsed time per iteration (s): 0.74 | learning rate: 2.181E-05 | global batch size: 256 | lm loss: 1.886569E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.351 | TFLOPs: 20.95 | 31: iteration 162540/ 173500 | consumed samples: 41610240 | consumed tokens: 85217771520 | elapsed time per iteration (s): 0.80 | learning rate: 2.180E-05 | global batch size: 256 | lm loss: 1.914249E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.964 | TFLOPs: 19.42 | 31: iteration 162550/ 173500 | consumed samples: 41612800 | consumed tokens: 85223014400 | elapsed time per iteration (s): 0.74 | learning rate: 2.180E-05 | global batch size: 256 | lm loss: 1.917123E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.346 | TFLOPs: 21.01 | 31: iteration 162560/ 173500 | consumed samples: 41615360 | consumed tokens: 85228257280 | elapsed time per iteration (s): 0.79 | learning rate: 2.180E-05 | global batch size: 256 | lm loss: 1.896468E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.036 | TFLOPs: 19.72 | 31: iteration 162570/ 173500 | consumed samples: 41617920 | consumed tokens: 85233500160 | elapsed time per iteration (s): 0.83 | learning rate: 2.179E-05 | global batch size: 256 | lm loss: 1.877854E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.136 | TFLOPs: 18.70 | 31: iteration 162580/ 173500 | consumed samples: 41620480 | consumed tokens: 85238743040 | elapsed time per iteration (s): 0.79 | learning rate: 2.179E-05 | global batch size: 256 | lm loss: 1.911742E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.304 | TFLOPs: 19.62 | 31: iteration 162590/ 173500 | consumed samples: 41623040 | consumed tokens: 85243985920 | elapsed time per iteration (s): 0.78 | learning rate: 2.179E-05 | global batch size: 256 | lm loss: 1.901090E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.318 | TFLOPs: 19.80 | 31: iteration 162600/ 173500 | consumed samples: 41625600 | consumed tokens: 85249228800 | elapsed time per iteration (s): 0.84 | learning rate: 2.178E-05 | global batch size: 256 | lm loss: 1.887224E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.508 | TFLOPs: 18.48 | 31: iteration 162610/ 173500 | consumed samples: 41628160 | consumed tokens: 85254471680 | elapsed time per iteration (s): 0.81 | learning rate: 2.178E-05 | global batch size: 256 | lm loss: 1.910750E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.207 | TFLOPs: 19.01 | 31: iteration 162620/ 173500 | consumed samples: 41630720 | consumed tokens: 85259714560 | elapsed time per iteration (s): 0.80 | learning rate: 2.178E-05 | global batch size: 256 | lm loss: 1.915909E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.797 | TFLOPs: 19.35 | 31: iteration 162630/ 173500 | consumed samples: 41633280 | consumed tokens: 85264957440 | elapsed time per iteration (s): 0.81 | learning rate: 2.177E-05 | global batch size: 256 | lm loss: 1.902635E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.605 | TFLOPs: 19.21 | 31: iteration 162640/ 173500 | consumed samples: 41635840 | consumed tokens: 85270200320 | elapsed time per iteration (s): 0.79 | learning rate: 2.177E-05 | global batch size: 256 | lm loss: 1.905197E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.140 | TFLOPs: 19.67 | 31: iteration 162650/ 173500 | consumed samples: 41638400 | consumed tokens: 85275443200 | elapsed time per iteration (s): 0.76 | learning rate: 2.177E-05 | global batch size: 256 | lm loss: 1.926866E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.068 | TFLOPs: 20.27 | 31: iteration 162660/ 173500 | consumed samples: 41640960 | consumed tokens: 85280686080 | elapsed time per iteration (s): 0.75 | learning rate: 2.176E-05 | global batch size: 256 | lm loss: 1.914546E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.315 | TFLOPs: 20.65 | 31: iteration 162670/ 173500 | consumed samples: 41643520 | consumed tokens: 85285928960 | elapsed time per iteration (s): 0.75 | learning rate: 2.176E-05 | global batch size: 256 | lm loss: 1.900294E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.915 | TFLOPs: 20.68 | 31: iteration 162680/ 173500 | consumed samples: 41646080 | consumed tokens: 85291171840 | elapsed time per iteration (s): 0.77 | learning rate: 2.176E-05 | global batch size: 256 | lm loss: 1.951754E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.803 | TFLOPs: 20.07 | 31: iteration 162690/ 173500 | consumed samples: 41648640 | consumed tokens: 85296414720 | elapsed time per iteration (s): 0.74 | learning rate: 2.175E-05 | global batch size: 256 | lm loss: 1.911728E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.207 | TFLOPs: 21.01 | 31: iteration 162700/ 173500 | consumed samples: 41651200 | consumed tokens: 85301657600 | elapsed time per iteration (s): 0.78 | learning rate: 2.175E-05 | global batch size: 256 | lm loss: 1.879571E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.533 | TFLOPs: 19.81 | 31: iteration 162710/ 173500 | consumed samples: 41653760 | consumed tokens: 85306900480 | elapsed time per iteration (s): 0.74 | learning rate: 2.175E-05 | global batch size: 256 | lm loss: 1.920557E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.682 | TFLOPs: 20.97 | 31: iteration 162720/ 173500 | consumed samples: 41656320 | consumed tokens: 85312143360 | elapsed time per iteration (s): 0.78 | learning rate: 2.174E-05 | global batch size: 256 | lm loss: 1.902280E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.166 | TFLOPs: 19.79 | 31: iteration 162730/ 173500 | consumed samples: 41658880 | consumed tokens: 85317386240 | elapsed time per iteration (s): 0.75 | learning rate: 2.174E-05 | global batch size: 256 | lm loss: 1.924972E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.563 | TFLOPs: 20.60 | 31: iteration 162740/ 173500 | consumed samples: 41661440 | consumed tokens: 85322629120 | elapsed time per iteration (s): 0.78 | learning rate: 2.174E-05 | global batch size: 256 | lm loss: 1.915472E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.807 | TFLOPs: 19.89 | 31: iteration 162750/ 173500 | consumed samples: 41664000 | consumed tokens: 85327872000 | elapsed time per iteration (s): 0.75 | learning rate: 2.173E-05 | global batch size: 256 | lm loss: 1.937732E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.921 | TFLOPs: 20.75 | 31: iteration 162760/ 173500 | consumed samples: 41666560 | consumed tokens: 85333114880 | elapsed time per iteration (s): 0.84 | learning rate: 2.173E-05 | global batch size: 256 | lm loss: 1.906486E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.906 | TFLOPs: 18.51 | 31: iteration 162770/ 173500 | consumed samples: 41669120 | consumed tokens: 85338357760 | elapsed time per iteration (s): 0.76 | learning rate: 2.173E-05 | global batch size: 256 | lm loss: 1.916945E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.267 | TFLOPs: 20.34 | 31: iteration 162780/ 173500 | consumed samples: 41671680 | consumed tokens: 85343600640 | elapsed time per iteration (s): 0.77 | learning rate: 2.172E-05 | global batch size: 256 | lm loss: 1.919708E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.914 | TFLOPs: 20.08 | 31: iteration 162790/ 173500 | consumed samples: 41674240 | consumed tokens: 85348843520 | elapsed time per iteration (s): 0.74 | learning rate: 2.172E-05 | global batch size: 256 | lm loss: 1.885716E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.894 | TFLOPs: 20.80 | 31: iteration 162800/ 173500 | consumed samples: 41676800 | consumed tokens: 85354086400 | elapsed time per iteration (s): 0.81 | learning rate: 2.172E-05 | global batch size: 256 | lm loss: 1.894516E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.607 | TFLOPs: 19.21 | 31: iteration 162810/ 173500 | consumed samples: 41679360 | consumed tokens: 85359329280 | elapsed time per iteration (s): 0.76 | learning rate: 2.171E-05 | global batch size: 256 | lm loss: 1.922957E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.214 | TFLOPs: 20.46 | 31: iteration 162820/ 173500 | consumed samples: 41681920 | consumed tokens: 85364572160 | elapsed time per iteration (s): 0.75 | learning rate: 2.171E-05 | global batch size: 256 | lm loss: 1.899191E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.530 | TFLOPs: 20.60 | 31: iteration 162830/ 173500 | consumed samples: 41684480 | consumed tokens: 85369815040 | elapsed time per iteration (s): 0.72 | learning rate: 2.171E-05 | global batch size: 256 | lm loss: 1.914727E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.711 | TFLOPs: 21.58 | 31: iteration 162840/ 173500 | consumed samples: 41687040 | consumed tokens: 85375057920 | elapsed time per iteration (s): 0.78 | learning rate: 2.171E-05 | global batch size: 256 | lm loss: 1.917951E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.676 | TFLOPs: 19.76 | 31: iteration 162850/ 173500 | consumed samples: 41689600 | consumed tokens: 85380300800 | elapsed time per iteration (s): 0.73 | learning rate: 2.170E-05 | global batch size: 256 | lm loss: 1.938282E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.378 | TFLOPs: 21.08 | 31: iteration 162860/ 173500 | consumed samples: 41692160 | consumed tokens: 85385543680 | elapsed time per iteration (s): 0.73 | learning rate: 2.170E-05 | global batch size: 256 | lm loss: 1.903336E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.813 | TFLOPs: 21.22 | 31: iteration 162870/ 173500 | consumed samples: 41694720 | consumed tokens: 85390786560 | elapsed time per iteration (s): 0.73 | learning rate: 2.170E-05 | global batch size: 256 | lm loss: 1.924612E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.886 | TFLOPs: 21.11 | 31: iteration 162880/ 173500 | consumed samples: 41697280 | consumed tokens: 85396029440 | elapsed time per iteration (s): 0.78 | learning rate: 2.169E-05 | global batch size: 256 | lm loss: 1.858658E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.534 | TFLOPs: 19.88 | 31: iteration 162890/ 173500 | consumed samples: 41699840 | consumed tokens: 85401272320 | elapsed time per iteration (s): 0.76 | learning rate: 2.169E-05 | global batch size: 256 | lm loss: 1.898132E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.382 | TFLOPs: 20.35 | 31: iteration 162900/ 173500 | consumed samples: 41702400 | consumed tokens: 85406515200 | elapsed time per iteration (s): 0.77 | learning rate: 2.169E-05 | global batch size: 256 | lm loss: 1.921400E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.722 | TFLOPs: 20.01 | 31: iteration 162910/ 173500 | consumed samples: 41704960 | consumed tokens: 85411758080 | elapsed time per iteration (s): 0.75 | learning rate: 2.168E-05 | global batch size: 256 | lm loss: 1.925765E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.382 | TFLOPs: 20.59 | 31: iteration 162920/ 173500 | consumed samples: 41707520 | consumed tokens: 85417000960 | elapsed time per iteration (s): 0.75 | learning rate: 2.168E-05 | global batch size: 256 | lm loss: 1.904780E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.744 | TFLOPs: 20.55 | 31: iteration 162930/ 173500 | consumed samples: 41710080 | consumed tokens: 85422243840 | elapsed time per iteration (s): 0.73 | learning rate: 2.168E-05 | global batch size: 256 | lm loss: 1.924367E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.861 | TFLOPs: 21.17 | 31: iteration 162940/ 173500 | consumed samples: 41712640 | consumed tokens: 85427486720 | elapsed time per iteration (s): 0.78 | learning rate: 2.167E-05 | global batch size: 256 | lm loss: 1.924326E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.936 | TFLOPs: 19.96 | 31: iteration 162950/ 173500 | consumed samples: 41715200 | consumed tokens: 85432729600 | elapsed time per iteration (s): 0.80 | learning rate: 2.167E-05 | global batch size: 256 | lm loss: 1.924134E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.713 | TFLOPs: 19.28 | 31: iteration 162960/ 173500 | consumed samples: 41717760 | consumed tokens: 85437972480 | elapsed time per iteration (s): 0.83 | learning rate: 2.167E-05 | global batch size: 256 | lm loss: 1.907979E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.225 | TFLOPs: 18.71 | 31: iteration 162970/ 173500 | consumed samples: 41720320 | consumed tokens: 85443215360 | elapsed time per iteration (s): 0.83 | learning rate: 2.166E-05 | global batch size: 256 | lm loss: 1.924835E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.191 | TFLOPs: 18.71 | 31: iteration 162980/ 173500 | consumed samples: 41722880 | consumed tokens: 85448458240 | elapsed time per iteration (s): 0.83 | learning rate: 2.166E-05 | global batch size: 256 | lm loss: 1.899372E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.715 | TFLOPs: 18.62 | 31: iteration 162990/ 173500 | consumed samples: 41725440 | consumed tokens: 85453701120 | elapsed time per iteration (s): 0.80 | learning rate: 2.166E-05 | global batch size: 256 | lm loss: 1.897832E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.478 | TFLOPs: 19.45 | 31: iteration 163000/ 173500 | consumed samples: 41728000 | consumed tokens: 85458944000 | elapsed time per iteration (s): 0.82 | learning rate: 2.165E-05 | global batch size: 256 | lm loss: 1.901123E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.120 | TFLOPs: 18.88 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 163000 | lm loss value: 1.815137E+00 | lm loss PPL: 6.141918E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 163000 to checkpoints_1b1long 0: [2022-11-27 06:46:06,643] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step163000 is begin to save! 0: [2022-11-27 06:46:06,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_01-model_00-model_states.pt... 0: [2022-11-27 06:46:06,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_01-model_00-model_states.pt. 0: [2022-11-27 06:46:06,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_03-model_00-model_states.pt... 0: [2022-11-27 06:46:06,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_03-model_00-model_states.pt. 0: [2022-11-27 06:46:06,959] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_04-model_00-model_states.pt... 0: [2022-11-27 06:46:07,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_04-model_00-model_states.pt. 0: [2022-11-27 06:46:07,042] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_05-model_00-model_states.pt... 0: [2022-11-27 06:46:07,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_05-model_00-model_states.pt. 0: [2022-11-27 06:46:07,124] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_06-model_00-model_states.pt... 0: [2022-11-27 06:46:07,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_06-model_00-model_states.pt. 0: [2022-11-27 06:46:07,213] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_07-model_00-model_states.pt... 0: [2022-11-27 06:46:07,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_07-model_00-model_states.pt. 0: [2022-11-27 06:46:07,292] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_08-model_00-model_states.pt... 0: [2022-11-27 06:46:07,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_08-model_00-model_states.pt. 0: [2022-11-27 06:46:07,368] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_09-model_00-model_states.pt... 0: [2022-11-27 06:46:07,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_09-model_00-model_states.pt. 0: [2022-11-27 06:46:07,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_10-model_00-model_states.pt... 0: [2022-11-27 06:46:07,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_10-model_00-model_states.pt. 0: [2022-11-27 06:46:07,521] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_11-model_00-model_states.pt... 0: [2022-11-27 06:46:07,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_11-model_00-model_states.pt. 0: [2022-11-27 06:46:07,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_12-model_00-model_states.pt... 0: [2022-11-27 06:46:07,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_12-model_00-model_states.pt. 0: [2022-11-27 06:46:07,678] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_13-model_00-model_states.pt... 0: [2022-11-27 06:46:07,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_13-model_00-model_states.pt. 0: [2022-11-27 06:46:07,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_14-model_00-model_states.pt... 0: [2022-11-27 06:46:07,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_14-model_00-model_states.pt. 0: [2022-11-27 06:46:07,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_15-model_00-model_states.pt... 0: [2022-11-27 06:46:07,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_15-model_00-model_states.pt. 0: [2022-11-27 06:46:07,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_16-model_00-model_states.pt... 0: [2022-11-27 06:46:07,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_16-model_00-model_states.pt. 0: [2022-11-27 06:46:07,980] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_17-model_00-model_states.pt... 0: [2022-11-27 06:46:08,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_17-model_00-model_states.pt. 0: [2022-11-27 06:46:08,057] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_18-model_00-model_states.pt... 0: [2022-11-27 06:46:08,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_18-model_00-model_states.pt. 0: [2022-11-27 06:46:08,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_19-model_00-model_states.pt... 0: [2022-11-27 06:46:08,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_19-model_00-model_states.pt. 0: [2022-11-27 06:46:08,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_20-model_00-model_states.pt... 0: [2022-11-27 06:46:08,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_20-model_00-model_states.pt. 0: [2022-11-27 06:46:08,280] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_21-model_00-model_states.pt... 0: [2022-11-27 06:46:08,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_21-model_00-model_states.pt. 0: [2022-11-27 06:46:08,355] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_22-model_00-model_states.pt... 0: [2022-11-27 06:46:08,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_22-model_00-model_states.pt. 0: [2022-11-27 06:46:08,432] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_23-model_00-model_states.pt... 0: [2022-11-27 06:46:08,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_23-model_00-model_states.pt. 0: [2022-11-27 06:46:08,507] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_24-model_00-model_states.pt... 0: [2022-11-27 06:46:08,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_24-model_00-model_states.pt. 0: [2022-11-27 06:46:08,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_25-model_00-model_states.pt... 0: [2022-11-27 06:46:08,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_25-model_00-model_states.pt. 0: [2022-11-27 06:46:08,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_26-model_00-model_states.pt... 0: [2022-11-27 06:46:08,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_26-model_00-model_states.pt. 0: [2022-11-27 06:46:08,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_27-model_00-model_states.pt... 0: [2022-11-27 06:46:08,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_27-model_00-model_states.pt. 0: [2022-11-27 06:46:08,806] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_28-model_00-model_states.pt... 0: [2022-11-27 06:46:08,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_28-model_00-model_states.pt. 0: [2022-11-27 06:46:08,880] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/layer_30-model_00-model_states.pt... 0: [2022-11-27 06:46:08,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/layer_30-model_00-model_states.pt. 0: [2022-11-27 06:46:08,882] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step163000/mp_rank_00_model_states.pt 0: [2022-11-27 06:46:08,882] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/mp_rank_00_model_states.pt... 0: [2022-11-27 06:46:08,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/mp_rank_00_model_states.pt. 0: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 22: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 20: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 26: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:46:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:46:09,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:46:09,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 06:46:09,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 4: [2022-11-27 06:46:09,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:46:09,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 06:46:09,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 20: [2022-11-27 06:46:09,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:46:09,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 06:46:09,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 29: [2022-11-27 06:46:09,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:46:09,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 0: [2022-11-27 06:46:09,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:46:09,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 0: [2022-11-27 06:46:09,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 06:46:09,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 9: [2022-11-27 06:46:09,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:46:09,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 06:46:09,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 6: [2022-11-27 06:46:09,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:46:09,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 06:46:09,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 30: [2022-11-27 06:46:09,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:46:09,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 06:46:09,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 23: [2022-11-27 06:46:09,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:46:09,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 06:46:09,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 2: [2022-11-27 06:46:09,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:46:09,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 06:46:09,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 14: [2022-11-27 06:46:09,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:46:09,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:46:09,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 20: [2022-11-27 06:46:09,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 14: [2022-11-27 06:46:09,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 20: [2022-11-27 06:46:09,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 5: [2022-11-27 06:46:09,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:46:09,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 22: [2022-11-27 06:46:09,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:46:09,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 22: [2022-11-27 06:46:09,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 06:46:09,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 29: [2022-11-27 06:46:09,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:46:09,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 06:46:09,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 8: [2022-11-27 06:46:09,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:46:09,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 06:46:09,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 8: [2022-11-27 06:46:09,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:46:09,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 24: [2022-11-27 06:46:09,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:46:09,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 24: [2022-11-27 06:46:09,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 16: [2022-11-27 06:46:09,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:46:09,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 16: [2022-11-27 06:46:09,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 7: [2022-11-27 06:46:09,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:46:09,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 7: [2022-11-27 06:46:09,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 06:46:09,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 11: [2022-11-27 06:46:09,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:46:09,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 06:46:09,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 11: [2022-11-27 06:46:09,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:46:09,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:46:09,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 23: [2022-11-27 06:46:09,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 11: [2022-11-27 06:46:09,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 23: [2022-11-27 06:46:09,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 31: [2022-11-27 06:46:09,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:46:09,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:46:09,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:46:09,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 31: [2022-11-27 06:46:09,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 06:46:09,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 06:46:09,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 3: [2022-11-27 06:46:09,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 31: [2022-11-27 06:46:09,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 4: [2022-11-27 06:46:09,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:46:09,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:46:09,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 21: [2022-11-27 06:46:09,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 4: [2022-11-27 06:46:09,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 21: [2022-11-27 06:46:09,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 21: [2022-11-27 06:46:09,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:46:09,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:46:09,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 06:46:09,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 21: [2022-11-27 06:46:09,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 06:46:09,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 22: [2022-11-27 06:46:09,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:46:09,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 06:46:09,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 14: [2022-11-27 06:46:09,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:46:09,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 24: [2022-11-27 06:46:09,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:46:09,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 26: [2022-11-27 06:46:09,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:46:09,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 06:46:09,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 26: [2022-11-27 06:46:09,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 06:46:09,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 16: [2022-11-27 06:46:09,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:46:09,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:46:09,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 06:46:09,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 27: [2022-11-27 06:46:09,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 06:46:09,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 1: [2022-11-27 06:46:09,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:46:09,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 06:46:09,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 26: [2022-11-27 06:46:09,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:46:09,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 06:46:09,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 5: [2022-11-27 06:46:09,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:46:09,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 06:46:09,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 26: [2022-11-27 06:46:09,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:46:09,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 06:46:09,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 20: [2022-11-27 06:46:09,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:46:09,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 06:46:09,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 5: [2022-11-27 06:46:09,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:46:09,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:46:09,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 06:46:09,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 3: [2022-11-27 06:46:09,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 06:46:09,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 2: [2022-11-27 06:46:09,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:46:09,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 06:46:09,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 0: [2022-11-27 06:46:09,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:46:09,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:46:09,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:46:09,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 06:46:09,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 06:46:09,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 1: [2022-11-27 06:46:09,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 6: [2022-11-27 06:46:09,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:46:09,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:46:09,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:46:09,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 06:46:09,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 21: [2022-11-27 06:46:09,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 6: [2022-11-27 06:46:09,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 6: [2022-11-27 06:46:09,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 7: [2022-11-27 06:46:09,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:46:09,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 7: [2022-11-27 06:46:09,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 06:46:09,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 30: [2022-11-27 06:46:09,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:46:09,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 24: [2022-11-27 06:46:09,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:46:09,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 24: [2022-11-27 06:46:09,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-27 06:46:09,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 25: [2022-11-27 06:46:09,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:46:09,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-27 06:46:09,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 9: [2022-11-27 06:46:09,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:46:09,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 06:46:09,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 22: [2022-11-27 06:46:09,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:46:09,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 06:46:09,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 8: [2022-11-27 06:46:09,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:46:09,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:46:09,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:46:09,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 8: [2022-11-27 06:46:09,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 06:46:09,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 15: [2022-11-27 06:46:09,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 06:46:09,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 29: [2022-11-27 06:46:09,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 30: [2022-11-27 06:46:09,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:46:09,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 06:46:09,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 5: [2022-11-27 06:46:09,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:46:09,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 06:46:09,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 15: [2022-11-27 06:46:09,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:46:09,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 06:46:09,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 23: [2022-11-27 06:46:09,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:46:09,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 06:46:09,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 25: [2022-11-27 06:46:09,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:46:09,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 06:46:09,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 9: [2022-11-27 06:46:09,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:46:09,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 06:46:09,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 11: [2022-11-27 06:46:09,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:46:09,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 06:46:09,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 27: [2022-11-27 06:46:09,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:46:09,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 06:46:09,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 2: [2022-11-27 06:46:09,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:46:09,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 06:46:09,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 30: [2022-11-27 06:46:09,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:46:09,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 06:46:09,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 27: [2022-11-27 06:46:09,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:46:09,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 23: [2022-11-27 06:46:09,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:46:09,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 23: [2022-11-27 06:46:09,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 24: [2022-11-27 06:46:09,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:46:09,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 24: [2022-11-27 06:46:09,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 06:46:09,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 16: [2022-11-27 06:46:09,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:46:09,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 06:46:09,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 2: [2022-11-27 06:46:09,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:46:09,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 14: [2022-11-27 06:46:09,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:46:09,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 14: [2022-11-27 06:46:09,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 06:46:09,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 26: [2022-11-27 06:46:09,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:46:09,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 06:46:09,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 13: [2022-11-27 06:46:09,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:46:09,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:46:09,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:46:09,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 06:46:09,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 0: [2022-11-27 06:46:09,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 13: [2022-11-27 06:46:09,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 13: [2022-11-27 06:46:09,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 0: [2022-11-27 06:46:09,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 14: [2022-11-27 06:46:09,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:46:09,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:46:09,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 31: [2022-11-27 06:46:09,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 14: [2022-11-27 06:46:09,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 31: [2022-11-27 06:46:09,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 28: [2022-11-27 06:46:09,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:46:09,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:46:09,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 06:46:09,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 20: [2022-11-27 06:46:09,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:46:09,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 06:46:09,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 16: [2022-11-27 06:46:09,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:46:09,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 3: [2022-11-27 06:46:09,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:46:09,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:46:09,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 8: [2022-11-27 06:46:09,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 06:46:09,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 27: [2022-11-27 06:46:09,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:46:09,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 06:46:09,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 27: [2022-11-27 06:46:09,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 06:46:09,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 3: [2022-11-27 06:46:09,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:46:09,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 06:46:09,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 12: [2022-11-27 06:46:09,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:46:09,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 06:46:09,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 29: [2022-11-27 06:46:09,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:46:09,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 06:46:09,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 4: [2022-11-27 06:46:09,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:46:09,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:46:09,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 06:46:09,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 06:46:09,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 4: [2022-11-27 06:46:09,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 1: [2022-11-27 06:46:09,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:46:09,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 31: [2022-11-27 06:46:09,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:46:09,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 31: [2022-11-27 06:46:09,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 06:46:09,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 15: [2022-11-27 06:46:09,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:46:09,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 06:46:09,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 22: [2022-11-27 06:46:09,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:46:09,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 06:46:09,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 12: [2022-11-27 06:46:09,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:46:09,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 06:46:09,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 12: [2022-11-27 06:46:09,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:46:09,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:46:09,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:46:09,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:46:09,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:46:09,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 18: [2022-11-27 06:46:09,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 12: [2022-11-27 06:46:09,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 18: [2022-11-27 06:46:09,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-27 06:46:09,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 18: [2022-11-27 06:46:09,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 06:46:09,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 06:46:09,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 18: [2022-11-27 06:46:09,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 18: [2022-11-27 06:46:09,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 1: [2022-11-27 06:46:09,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:46:09,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 06:46:09,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 18: [2022-11-27 06:46:09,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:46:09,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 06:46:09,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 7: [2022-11-27 06:46:09,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:46:09,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 06:46:09,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 0: [2022-11-27 06:46:09,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:46:09,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:46:09,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 0: [2022-11-27 06:46:09,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 28: [2022-11-27 06:46:09,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 7: [2022-11-27 06:46:09,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 28: [2022-11-27 06:46:09,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 0: [2022-11-27 06:46:09,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 28: [2022-11-27 06:46:09,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:46:09,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 06:46:09,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 28: [2022-11-27 06:46:09,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:46:09,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:46:09,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 06:46:09,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 28: [2022-11-27 06:46:09,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 06:46:09,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 6: [2022-11-27 06:46:09,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:46:09,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:46:09,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 06:46:09,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 06:46:09,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 6: [2022-11-27 06:46:09,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 25: [2022-11-27 06:46:09,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:46:09,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:46:09,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 06:46:09,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 06:46:09,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 25: [2022-11-27 06:46:09,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 13: [2022-11-27 06:46:09,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:46:09,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 06:46:09,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 19: [2022-11-27 06:46:09,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:46:09,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-27 06:46:09,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 11: [2022-11-27 06:46:09,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:46:09,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 06:46:09,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 5: [2022-11-27 06:46:09,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:46:09,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 06:46:09,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 4: [2022-11-27 06:46:09,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:46:09,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 06:46:09,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 0: [2022-11-27 06:46:09,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 06:46:09,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 17: [2022-11-27 06:46:09,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:46:09,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:46:09,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:46:09,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 06:46:09,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 06:46:09,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 06:46:09,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 17: [2022-11-27 06:46:09,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 17: [2022-11-27 06:46:09,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 17: [2022-11-27 06:46:09,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:46:09,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:46:09,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 06:46:09,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 06:46:09,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 17: [2022-11-27 06:46:09,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 10: [2022-11-27 06:46:09,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:46:09,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:46:09,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:46:09,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:46:09,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 06:46:09,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 06:46:09,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 06:46:09,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 06:46:09,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 10: [2022-11-27 06:46:09,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 10: [2022-11-27 06:46:09,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 10: [2022-11-27 06:46:09,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 19: [2022-11-27 06:46:09,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:46:09,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:46:09,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 06:46:09,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 06:46:09,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 19: [2022-11-27 06:46:09,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 19: [2022-11-27 06:46:09,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:46:09,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 06:46:09,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 25: [2022-11-27 06:46:09,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:46:09,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 06:46:09,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 21: [2022-11-27 06:46:09,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:46:09,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 06:46:09,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 2: [2022-11-27 06:46:09,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:46:09,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 06:46:09,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 27: [2022-11-27 06:46:09,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:46:09,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 06:46:09,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 7: [2022-11-27 06:46:09,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:46:09,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 06:46:09,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 9: [2022-11-27 06:46:09,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:46:09,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 06:46:09,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 23: [2022-11-27 06:46:09,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:46:09,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 06:46:09,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 31: [2022-11-27 06:46:09,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:46:09,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 06:46:09,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 20: [2022-11-27 06:46:09,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:46:09,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 06:46:09,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 29: [2022-11-27 06:46:09,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:46:09,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 06:46:09,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 28: [2022-11-27 06:46:09,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:46:09,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 06:46:09,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 26: [2022-11-27 06:46:09,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:46:09,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 06:46:09,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 30: [2022-11-27 06:46:09,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:46:09,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 06:46:09,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 19: [2022-11-27 06:46:09,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:46:09,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 06:46:09,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 8: [2022-11-27 06:46:09,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:46:09,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 06:46:09,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 22: [2022-11-27 06:46:09,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:46:09,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 06:46:09,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 1: [2022-11-27 06:46:09,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:46:09,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 06:46:09,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 13: [2022-11-27 06:46:09,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:46:09,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 06:46:09,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 14: [2022-11-27 06:46:09,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:46:09,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 06:46:09,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 0: [2022-11-27 06:46:09,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:46:09,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 06:46:09,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 3: [2022-11-27 06:46:09,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:46:09,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 06:46:09,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 5: [2022-11-27 06:46:09,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:46:09,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 06:46:09,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 16: [2022-11-27 06:46:09,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:46:09,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 06:46:09,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 11: [2022-11-27 06:46:09,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:46:09,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 06:46:09,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 24: [2022-11-27 06:46:09,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:46:09,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 06:46:09,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 4: [2022-11-27 06:46:09,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:46:09,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 06:46:09,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 15: [2022-11-27 06:46:09,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:46:09,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 06:46:09,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 12: [2022-11-27 06:46:09,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:46:09,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 06:46:09,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 18: [2022-11-27 06:46:09,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:46:09,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 06:46:09,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 10: [2022-11-27 06:46:09,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:46:09,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 06:46:09,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 21: [2022-11-27 06:46:09,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:46:09,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 06:46:09,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 2: [2022-11-27 06:46:09,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:46:09,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 06:46:09,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 25: [2022-11-27 06:46:09,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:46:09,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 06:46:09,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 6: [2022-11-27 06:46:09,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:46:09,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 06:46:09,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 7: [2022-11-27 06:46:09,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:46:09,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 06:46:09,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 27: [2022-11-27 06:46:09,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:46:09,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:46:09,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 06:46:09,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 17: [2022-11-27 06:46:09,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 06:46:09,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 9: [2022-11-27 06:46:09,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:46:09,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 06:46:09,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 29: [2022-11-27 06:46:09,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:46:09,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 06:46:09,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 28: [2022-11-27 06:46:09,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:46:09,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 11: [2022-11-27 06:46:09,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:46:09,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 23: [2022-11-27 06:46:09,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:46:09,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 23: [2022-11-27 06:46:09,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 06:46:09,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 20: [2022-11-27 06:46:09,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:46:09,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 06:46:09,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 26: [2022-11-27 06:46:09,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:46:09,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-27 06:46:09,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 28: [2022-11-27 06:46:09,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 31: [2022-11-27 06:46:09,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:46:09,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 06:46:09,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 22: [2022-11-27 06:46:09,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:46:09,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 8: [2022-11-27 06:46:09,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:46:09,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 8: [2022-11-27 06:46:09,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 06:46:09,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 1: [2022-11-27 06:46:09,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:46:09,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 06:46:09,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 19: [2022-11-27 06:46:09,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:46:09,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 06:46:09,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 0: [2022-11-27 06:46:09,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:46:09,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 06:46:09,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 3: [2022-11-27 06:46:09,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:46:09,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 06:46:09,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 5: [2022-11-27 06:46:09,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:46:09,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 06:46:09,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 30: [2022-11-27 06:46:09,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:46:09,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 14: [2022-11-27 06:46:09,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:46:09,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 14: [2022-11-27 06:46:09,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 06:46:09,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 13: [2022-11-27 06:46:09,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:46:09,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 06:46:09,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 15: [2022-11-27 06:46:09,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:46:09,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 06:46:09,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 4: [2022-11-27 06:46:09,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:46:09,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:46:09,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 06:46:09,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 16: [2022-11-27 06:46:09,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 06:46:09,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 24: [2022-11-27 06:46:09,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:46:09,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 06:46:09,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 2: [2022-11-27 06:46:09,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:46:09,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:46:09,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 2: [2022-11-27 06:46:09,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 6: [2022-11-27 06:46:09,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 2: [2022-11-27 06:46:09,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 18: [2022-11-27 06:46:09,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:46:09,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 10: [2022-11-27 06:46:09,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:46:09,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 18: [2022-11-27 06:46:09,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 10: [2022-11-27 06:46:09,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 21: [2022-11-27 06:46:09,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 06:46:09,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 06:46:09,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 12: [2022-11-27 06:46:09,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:46:09,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 06:46:09,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 25: [2022-11-27 06:46:09,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:46:09,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 06:46:09,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 17: [2022-11-27 06:46:09,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:46:09,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 06:46:09,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 27: [2022-11-27 06:46:09,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:46:09,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 06:46:09,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 9: [2022-11-27 06:46:09,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:46:09,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 06:46:09,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 7: [2022-11-27 06:46:09,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:46:09,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 06:46:09,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 29: [2022-11-27 06:46:09,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:46:09,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 06:46:09,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 23: [2022-11-27 06:46:09,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 06:46:09,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 06:46:09,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 20: [2022-11-27 06:46:09,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:46:09,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 06:46:09,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 31: [2022-11-27 06:46:09,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:46:09,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 06:46:09,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 26: [2022-11-27 06:46:09,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:46:09,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 06:46:09,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 28: [2022-11-27 06:46:09,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:46:09,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 30: [2022-11-27 06:46:09,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:46:09,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 06:46:09,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 14: [2022-11-27 06:46:09,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:46:09,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 06:46:09,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 28: [2022-11-27 06:46:09,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 22: [2022-11-27 06:46:09,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:46:09,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 06:46:09,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 3: [2022-11-27 06:46:09,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:46:09,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 06:46:09,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 11: [2022-11-27 06:46:09,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:46:09,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 06:46:09,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 8: [2022-11-27 06:46:09,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:46:09,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 06:46:09,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 24: [2022-11-27 06:46:09,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:46:09,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 06:46:09,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 19: [2022-11-27 06:46:09,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:46:09,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 06:46:09,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 14: [2022-11-27 06:46:09,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:46:09,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 06:46:09,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 5: [2022-11-27 06:46:09,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:46:09,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 21: [2022-11-27 06:46:09,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:46:09,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 21: [2022-11-27 06:46:09,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 06:46:09,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 2: [2022-11-27 06:46:09,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:46:09,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 06:46:09,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 16: [2022-11-27 06:46:09,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:46:09,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 06:46:09,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 13: [2022-11-27 06:46:09,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:46:09,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 06:46:09,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 6: [2022-11-27 06:46:09,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:46:09,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:46:09,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 4: [2022-11-27 06:46:09,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 6: [2022-11-27 06:46:09,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 4: [2022-11-27 06:46:09,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 1: [2022-11-27 06:46:09,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:46:09,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:46:09,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 1: [2022-11-27 06:46:09,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 0: [2022-11-27 06:46:09,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 1: [2022-11-27 06:46:09,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 27: [2022-11-27 06:46:09,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-27 06:46:09,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 06:46:09,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 9: [2022-11-27 06:46:09,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:46:09,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 06:46:09,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 22: [2022-11-27 06:46:09,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:46:09,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 12: [2022-11-27 06:46:09,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 22: [2022-11-27 06:46:09,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 12: [2022-11-27 06:46:09,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 06:46:09,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 10: [2022-11-27 06:46:09,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:46:09,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:46:09,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 15: [2022-11-27 06:46:09,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 10: [2022-11-27 06:46:09,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 15: [2022-11-27 06:46:09,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 8: [2022-11-27 06:46:09,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:46:09,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 06:46:09,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 11: [2022-11-27 06:46:09,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:46:09,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 06:46:09,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 28: [2022-11-27 06:46:09,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:46:09,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 28: [2022-11-27 06:46:09,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 18: [2022-11-27 06:46:09,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 06:46:09,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 06:46:09,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 25: [2022-11-27 06:46:09,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 28: [2022-11-27 06:46:09,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 23: [2022-11-27 06:46:09,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 25: [2022-11-27 06:46:09,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 23: [2022-11-27 06:46:09,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 06:46:09,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 10: [2022-11-27 06:46:09,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:46:09,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 06:46:09,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 26: [2022-11-27 06:46:09,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-27 06:46:09,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 06:46:09,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 7: [2022-11-27 06:46:09,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:46:09,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 06:46:09,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 20: [2022-11-27 06:46:09,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 06:46:09,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 06:46:09,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 29: [2022-11-27 06:46:09,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 06:46:09,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 06:46:09,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 31: [2022-11-27 06:46:09,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 30: [2022-11-27 06:46:09,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 31: [2022-11-27 06:46:09,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 30: [2022-11-27 06:46:09,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 31: [2022-11-27 06:46:09,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 30: [2022-11-27 06:46:09,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 16: [2022-11-27 06:46:09,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 06:46:09,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 06:46:09,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 3: [2022-11-27 06:46:09,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:46:09,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 06:46:09,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 24: [2022-11-27 06:46:09,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-27 06:46:09,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-27 06:46:09,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 15: [2022-11-27 06:46:09,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:46:09,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 06:46:09,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 17: [2022-11-27 06:46:09,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-27 06:46:09,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 06:46:09,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 19: [2022-11-27 06:46:09,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:46:09,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 19: [2022-11-27 06:46:09,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 06:46:09,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 12: [2022-11-27 06:46:09,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 06:46:09,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:46:09,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 12: [2022-11-27 06:46:09,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 06:46:09,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 15: [2022-11-27 06:46:09,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:46:09,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 06:46:09,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 13: [2022-11-27 06:46:09,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:46:09,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 06:46:09,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 13: [2022-11-27 06:46:09,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:46:09,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step163000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 06:46:09,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step163000 is ready now! 0: successfully saved checkpoint at iteration 163000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2603.98 31: iteration 163010/ 173500 | consumed samples: 41730560 | consumed tokens: 85464186880 | elapsed time per iteration (s): 1.10 | learning rate: 2.165E-05 | global batch size: 256 | lm loss: 1.914179E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.070 | TFLOPs: 14.10 | 31: iteration 163020/ 173500 | consumed samples: 41733120 | consumed tokens: 85469429760 | elapsed time per iteration (s): 0.82 | learning rate: 2.165E-05 | global batch size: 256 | lm loss: 1.905578E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.455 | TFLOPs: 18.96 | 31: iteration 163030/ 173500 | consumed samples: 41735680 | consumed tokens: 85474672640 | elapsed time per iteration (s): 0.87 | learning rate: 2.165E-05 | global batch size: 256 | lm loss: 1.922327E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.476 | TFLOPs: 17.82 | 31: iteration 163040/ 173500 | consumed samples: 41738240 | consumed tokens: 85479915520 | elapsed time per iteration (s): 0.92 | learning rate: 2.164E-05 | global batch size: 256 | lm loss: 1.922837E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 278.479 | TFLOPs: 16.85 | 31: iteration 163050/ 173500 | consumed samples: 41740800 | consumed tokens: 85485158400 | elapsed time per iteration (s): 0.82 | learning rate: 2.164E-05 | global batch size: 256 | lm loss: 1.913099E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.982 | TFLOPs: 18.81 | 31: iteration 163060/ 173500 | consumed samples: 41743360 | consumed tokens: 85490401280 | elapsed time per iteration (s): 0.86 | learning rate: 2.164E-05 | global batch size: 256 | lm loss: 1.923699E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.949 | TFLOPs: 18.09 | 31: iteration 163070/ 173500 | consumed samples: 41745920 | consumed tokens: 85495644160 | elapsed time per iteration (s): 0.83 | learning rate: 2.163E-05 | global batch size: 256 | lm loss: 1.898239E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.297 | TFLOPs: 18.59 | 31: iteration 163080/ 173500 | consumed samples: 41748480 | consumed tokens: 85500887040 | elapsed time per iteration (s): 0.82 | learning rate: 2.163E-05 | global batch size: 256 | lm loss: 1.871123E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.402 | TFLOPs: 18.90 | 31: iteration 163090/ 173500 | consumed samples: 41751040 | consumed tokens: 85506129920 | elapsed time per iteration (s): 0.80 | learning rate: 2.163E-05 | global batch size: 256 | lm loss: 1.897165E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.122 | TFLOPs: 19.37 | 31: iteration 163100/ 173500 | consumed samples: 41753600 | consumed tokens: 85511372800 | elapsed time per iteration (s): 0.82 | learning rate: 2.162E-05 | global batch size: 256 | lm loss: 1.947639E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.992 | TFLOPs: 18.87 | 31: iteration 163110/ 173500 | consumed samples: 41756160 | consumed tokens: 85516615680 | elapsed time per iteration (s): 0.82 | learning rate: 2.162E-05 | global batch size: 256 | lm loss: 1.937281E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.092 | TFLOPs: 18.94 | 31: iteration 163120/ 173500 | consumed samples: 41758720 | consumed tokens: 85521858560 | elapsed time per iteration (s): 1.07 | learning rate: 2.162E-05 | global batch size: 256 | lm loss: 1.898072E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.182 | TFLOPs: 14.47 | 31: iteration 163130/ 173500 | consumed samples: 41761280 | consumed tokens: 85527101440 | elapsed time per iteration (s): 0.83 | learning rate: 2.161E-05 | global batch size: 256 | lm loss: 1.916498E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.259 | TFLOPs: 18.59 | 31: iteration 163140/ 173500 | consumed samples: 41763840 | consumed tokens: 85532344320 | elapsed time per iteration (s): 0.80 | learning rate: 2.161E-05 | global batch size: 256 | lm loss: 1.905161E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.151 | TFLOPs: 19.37 | 31: iteration 163150/ 173500 | consumed samples: 41766400 | consumed tokens: 85537587200 | elapsed time per iteration (s): 0.82 | learning rate: 2.161E-05 | global batch size: 256 | lm loss: 1.902536E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.148 | TFLOPs: 18.94 | 31: iteration 163160/ 173500 | consumed samples: 41768960 | consumed tokens: 85542830080 | elapsed time per iteration (s): 0.84 | learning rate: 2.160E-05 | global batch size: 256 | lm loss: 1.919552E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.856 | TFLOPs: 18.50 | 31: iteration 163170/ 173500 | consumed samples: 41771520 | consumed tokens: 85548072960 | elapsed time per iteration (s): 0.80 | learning rate: 2.160E-05 | global batch size: 256 | lm loss: 1.905829E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.504 | TFLOPs: 19.39 | 31: iteration 163180/ 173500 | consumed samples: 41774080 | consumed tokens: 85553315840 | elapsed time per iteration (s): 0.83 | learning rate: 2.160E-05 | global batch size: 256 | lm loss: 1.901543E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.982 | TFLOPs: 18.63 | 31: iteration 163190/ 173500 | consumed samples: 41776640 | consumed tokens: 85558558720 | elapsed time per iteration (s): 0.84 | learning rate: 2.160E-05 | global batch size: 256 | lm loss: 1.930113E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.959 | TFLOPs: 18.39 | 31: iteration 163200/ 173500 | consumed samples: 41779200 | consumed tokens: 85563801600 | elapsed time per iteration (s): 0.77 | learning rate: 2.159E-05 | global batch size: 256 | lm loss: 1.897535E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.341 | TFLOPs: 19.98 | 31: iteration 163210/ 173500 | consumed samples: 41781760 | consumed tokens: 85569044480 | elapsed time per iteration (s): 0.87 | learning rate: 2.159E-05 | global batch size: 256 | lm loss: 1.929961E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.747 | TFLOPs: 17.77 | 31: iteration 163220/ 173500 | consumed samples: 41784320 | consumed tokens: 85574287360 | elapsed time per iteration (s): 0.93 | learning rate: 2.159E-05 | global batch size: 256 | lm loss: 1.912414E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 274.267 | TFLOPs: 16.59 | 31: iteration 163230/ 173500 | consumed samples: 41786880 | consumed tokens: 85579530240 | elapsed time per iteration (s): 0.80 | learning rate: 2.158E-05 | global batch size: 256 | lm loss: 1.899485E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.850 | TFLOPs: 19.35 | 31: iteration 163240/ 173500 | consumed samples: 41789440 | consumed tokens: 85584773120 | elapsed time per iteration (s): 0.86 | learning rate: 2.158E-05 | global batch size: 256 | lm loss: 1.921561E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.284 | TFLOPs: 18.11 | 31: iteration 163250/ 173500 | consumed samples: 41792000 | consumed tokens: 85590016000 | elapsed time per iteration (s): 0.84 | learning rate: 2.158E-05 | global batch size: 256 | lm loss: 1.908656E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.099 | TFLOPs: 18.52 | 31: iteration 163260/ 173500 | consumed samples: 41794560 | consumed tokens: 85595258880 | elapsed time per iteration (s): 0.81 | learning rate: 2.157E-05 | global batch size: 256 | lm loss: 1.906185E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.635 | TFLOPs: 19.16 | 31: iteration 163270/ 173500 | consumed samples: 41797120 | consumed tokens: 85600501760 | elapsed time per iteration (s): 0.82 | learning rate: 2.157E-05 | global batch size: 256 | lm loss: 1.926820E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.990 | TFLOPs: 19.00 | 31: iteration 163280/ 173500 | consumed samples: 41799680 | consumed tokens: 85605744640 | elapsed time per iteration (s): 0.77 | learning rate: 2.157E-05 | global batch size: 256 | lm loss: 1.940832E+00 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.127 | TFLOPs: 20.03 | 31: iteration 163290/ 173500 | consumed samples: 41802240 | consumed tokens: 85610987520 | elapsed time per iteration (s): 0.79 | learning rate: 2.156E-05 | global batch size: 256 | lm loss: 1.926102E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.491 | TFLOPs: 19.63 | 31: iteration 163300/ 173500 | consumed samples: 41804800 | consumed tokens: 85616230400 | elapsed time per iteration (s): 0.81 | learning rate: 2.156E-05 | global batch size: 256 | lm loss: 1.898182E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.347 | TFLOPs: 19.02 | 31: iteration 163310/ 173500 | consumed samples: 41807360 | consumed tokens: 85621473280 | elapsed time per iteration (s): 0.81 | learning rate: 2.156E-05 | global batch size: 256 | lm loss: 1.900657E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.668 | TFLOPs: 19.16 | 31: iteration 163320/ 173500 | consumed samples: 41809920 | consumed tokens: 85626716160 | elapsed time per iteration (s): 0.81 | learning rate: 2.156E-05 | global batch size: 256 | lm loss: 1.901460E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.361 | TFLOPs: 19.20 | 31: iteration 163330/ 173500 | consumed samples: 41812480 | consumed tokens: 85631959040 | elapsed time per iteration (s): 0.85 | learning rate: 2.155E-05 | global batch size: 256 | lm loss: 1.894004E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.193 | TFLOPs: 18.16 | 31: iteration 163340/ 173500 | consumed samples: 41815040 | consumed tokens: 85637201920 | elapsed time per iteration (s): 0.82 | learning rate: 2.155E-05 | global batch size: 256 | lm loss: 1.912946E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.336 | TFLOPs: 18.90 | 31: iteration 163350/ 173500 | consumed samples: 41817600 | consumed tokens: 85642444800 | elapsed time per iteration (s): 0.81 | learning rate: 2.155E-05 | global batch size: 256 | lm loss: 1.875974E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.515 | TFLOPs: 19.21 | 31: iteration 163360/ 173500 | consumed samples: 41820160 | consumed tokens: 85647687680 | elapsed time per iteration (s): 0.83 | learning rate: 2.154E-05 | global batch size: 256 | lm loss: 1.926375E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.145 | TFLOPs: 18.58 | 31: iteration 163370/ 173500 | consumed samples: 41822720 | consumed tokens: 85652930560 | elapsed time per iteration (s): 1.03 | learning rate: 2.154E-05 | global batch size: 256 | lm loss: 1.917972E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.015 | TFLOPs: 15.06 | 31: iteration 163380/ 173500 | consumed samples: 41825280 | consumed tokens: 85658173440 | elapsed time per iteration (s): 0.76 | learning rate: 2.154E-05 | global batch size: 256 | lm loss: 1.910277E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.097 | TFLOPs: 20.39 | 31: iteration 163390/ 173500 | consumed samples: 41827840 | consumed tokens: 85663416320 | elapsed time per iteration (s): 0.75 | learning rate: 2.153E-05 | global batch size: 256 | lm loss: 1.914702E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.983 | TFLOPs: 20.69 | 31: iteration 163400/ 173500 | consumed samples: 41830400 | consumed tokens: 85668659200 | elapsed time per iteration (s): 0.86 | learning rate: 2.153E-05 | global batch size: 256 | lm loss: 1.937613E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.421 | TFLOPs: 17.93 | 31: iteration 163410/ 173500 | consumed samples: 41832960 | consumed tokens: 85673902080 | elapsed time per iteration (s): 0.75 | learning rate: 2.153E-05 | global batch size: 256 | lm loss: 1.931256E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.465 | TFLOPs: 20.78 | 31: iteration 163420/ 173500 | consumed samples: 41835520 | consumed tokens: 85679144960 | elapsed time per iteration (s): 0.75 | learning rate: 2.153E-05 | global batch size: 256 | lm loss: 1.900333E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.725 | TFLOPs: 20.55 | 31: iteration 163430/ 173500 | consumed samples: 41838080 | consumed tokens: 85684387840 | elapsed time per iteration (s): 0.79 | learning rate: 2.152E-05 | global batch size: 256 | lm loss: 1.902141E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.642 | TFLOPs: 19.64 | 31: iteration 163440/ 173500 | consumed samples: 41840640 | consumed tokens: 85689630720 | elapsed time per iteration (s): 0.74 | learning rate: 2.152E-05 | global batch size: 256 | lm loss: 1.931998E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.932 | TFLOPs: 20.87 | 31: iteration 163450/ 173500 | consumed samples: 41843200 | consumed tokens: 85694873600 | elapsed time per iteration (s): 0.79 | learning rate: 2.152E-05 | global batch size: 256 | lm loss: 1.902800E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.227 | TFLOPs: 19.68 | 31: iteration 163460/ 173500 | consumed samples: 41845760 | consumed tokens: 85700116480 | elapsed time per iteration (s): 0.85 | learning rate: 2.151E-05 | global batch size: 256 | lm loss: 1.924335E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.407 | TFLOPs: 18.17 | 31: iteration 163470/ 173500 | consumed samples: 41848320 | consumed tokens: 85705359360 | elapsed time per iteration (s): 0.79 | learning rate: 2.151E-05 | global batch size: 256 | lm loss: 1.903606E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.479 | TFLOPs: 19.69 | 31: iteration 163480/ 173500 | consumed samples: 41850880 | consumed tokens: 85710602240 | elapsed time per iteration (s): 0.90 | learning rate: 2.151E-05 | global batch size: 256 | lm loss: 1.905756E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.243 | TFLOPs: 17.20 | 31: iteration 163490/ 173500 | consumed samples: 41853440 | consumed tokens: 85715845120 | elapsed time per iteration (s): 0.90 | learning rate: 2.150E-05 | global batch size: 256 | lm loss: 1.901480E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.797 | TFLOPs: 17.29 | 31: iteration 163500/ 173500 | consumed samples: 41856000 | consumed tokens: 85721088000 | elapsed time per iteration (s): 0.82 | learning rate: 2.150E-05 | global batch size: 256 | lm loss: 1.934307E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.587 | TFLOPs: 18.85 | 31: iteration 163510/ 173500 | consumed samples: 41858560 | consumed tokens: 85726330880 | elapsed time per iteration (s): 1.04 | learning rate: 2.150E-05 | global batch size: 256 | lm loss: 1.900640E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.099 | TFLOPs: 14.83 | 31: iteration 163520/ 173500 | consumed samples: 41861120 | consumed tokens: 85731573760 | elapsed time per iteration (s): 0.90 | learning rate: 2.150E-05 | global batch size: 256 | lm loss: 1.899879E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.969 | TFLOPs: 17.24 | 31: iteration 163530/ 173500 | consumed samples: 41863680 | consumed tokens: 85736816640 | elapsed time per iteration (s): 0.99 | learning rate: 2.149E-05 | global batch size: 256 | lm loss: 1.900368E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 259.281 | TFLOPs: 15.69 | 31: iteration 163540/ 173500 | consumed samples: 41866240 | consumed tokens: 85742059520 | elapsed time per iteration (s): 0.96 | learning rate: 2.149E-05 | global batch size: 256 | lm loss: 1.911574E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 266.392 | TFLOPs: 16.12 | 31: iteration 163550/ 173500 | consumed samples: 41868800 | consumed tokens: 85747302400 | elapsed time per iteration (s): 0.97 | learning rate: 2.149E-05 | global batch size: 256 | lm loss: 1.900375E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 262.816 | TFLOPs: 15.90 | 31: iteration 163560/ 173500 | consumed samples: 41871360 | consumed tokens: 85752545280 | elapsed time per iteration (s): 0.89 | learning rate: 2.148E-05 | global batch size: 256 | lm loss: 1.916093E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.095 | TFLOPs: 17.37 | 31: iteration 163570/ 173500 | consumed samples: 41873920 | consumed tokens: 85757788160 | elapsed time per iteration (s): 0.91 | learning rate: 2.148E-05 | global batch size: 256 | lm loss: 1.928127E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 279.985 | TFLOPs: 16.94 | 31: iteration 163580/ 173500 | consumed samples: 41876480 | consumed tokens: 85763031040 | elapsed time per iteration (s): 0.86 | learning rate: 2.148E-05 | global batch size: 256 | lm loss: 1.921299E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.043 | TFLOPs: 17.91 | 31: iteration 163590/ 173500 | consumed samples: 41879040 | consumed tokens: 85768273920 | elapsed time per iteration (s): 0.81 | learning rate: 2.147E-05 | global batch size: 256 | lm loss: 1.907332E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.791 | TFLOPs: 19.10 | 31: iteration 163600/ 173500 | consumed samples: 41881600 | consumed tokens: 85773516800 | elapsed time per iteration (s): 0.78 | learning rate: 2.147E-05 | global batch size: 256 | lm loss: 1.914476E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.558 | TFLOPs: 19.82 | 31: iteration 163610/ 173500 | consumed samples: 41884160 | consumed tokens: 85778759680 | elapsed time per iteration (s): 0.77 | learning rate: 2.147E-05 | global batch size: 256 | lm loss: 1.925743E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.746 | TFLOPs: 20.01 | 31: iteration 163620/ 173500 | consumed samples: 41886720 | consumed tokens: 85784002560 | elapsed time per iteration (s): 0.75 | learning rate: 2.147E-05 | global batch size: 256 | lm loss: 1.934282E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.136 | TFLOPs: 20.76 | 31: iteration 163630/ 173500 | consumed samples: 41889280 | consumed tokens: 85789245440 | elapsed time per iteration (s): 0.75 | learning rate: 2.146E-05 | global batch size: 256 | lm loss: 1.904697E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.374 | TFLOPs: 20.53 | 31: iteration 163640/ 173500 | consumed samples: 41891840 | consumed tokens: 85794488320 | elapsed time per iteration (s): 0.95 | learning rate: 2.146E-05 | global batch size: 256 | lm loss: 1.922335E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 268.068 | TFLOPs: 16.22 | 31: iteration 163650/ 173500 | consumed samples: 41894400 | consumed tokens: 85799731200 | elapsed time per iteration (s): 0.79 | learning rate: 2.146E-05 | global batch size: 256 | lm loss: 1.910275E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.404 | TFLOPs: 19.57 | 31: iteration 163660/ 173500 | consumed samples: 41896960 | consumed tokens: 85804974080 | elapsed time per iteration (s): 0.79 | learning rate: 2.145E-05 | global batch size: 256 | lm loss: 1.907038E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.541 | TFLOPs: 19.63 | 31: iteration 163670/ 173500 | consumed samples: 41899520 | consumed tokens: 85810216960 | elapsed time per iteration (s): 0.85 | learning rate: 2.145E-05 | global batch size: 256 | lm loss: 1.926258E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.520 | TFLOPs: 18.12 | 31: iteration 163680/ 173500 | consumed samples: 41902080 | consumed tokens: 85815459840 | elapsed time per iteration (s): 0.87 | learning rate: 2.145E-05 | global batch size: 256 | lm loss: 1.925244E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.222 | TFLOPs: 17.80 | 31: iteration 163690/ 173500 | consumed samples: 41904640 | consumed tokens: 85820702720 | elapsed time per iteration (s): 0.85 | learning rate: 2.144E-05 | global batch size: 256 | lm loss: 1.906450E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.227 | TFLOPs: 18.28 | 31: iteration 163700/ 173500 | consumed samples: 41907200 | consumed tokens: 85825945600 | elapsed time per iteration (s): 0.79 | learning rate: 2.144E-05 | global batch size: 256 | lm loss: 1.903633E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.390 | TFLOPs: 19.56 | 31: iteration 163710/ 173500 | consumed samples: 41909760 | consumed tokens: 85831188480 | elapsed time per iteration (s): 0.83 | learning rate: 2.144E-05 | global batch size: 256 | lm loss: 1.902386E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.930 | TFLOPs: 18.57 | 31: iteration 163720/ 173500 | consumed samples: 41912320 | consumed tokens: 85836431360 | elapsed time per iteration (s): 0.89 | learning rate: 2.144E-05 | global batch size: 256 | lm loss: 1.911910E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.719 | TFLOPs: 17.41 | 31: iteration 163730/ 173500 | consumed samples: 41914880 | consumed tokens: 85841674240 | elapsed time per iteration (s): 0.84 | learning rate: 2.143E-05 | global batch size: 256 | lm loss: 1.926396E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.540 | TFLOPs: 18.42 | 31: iteration 163740/ 173500 | consumed samples: 41917440 | consumed tokens: 85846917120 | elapsed time per iteration (s): 0.80 | learning rate: 2.143E-05 | global batch size: 256 | lm loss: 1.891831E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.296 | TFLOPs: 19.44 | 31: iteration 163750/ 173500 | consumed samples: 41920000 | consumed tokens: 85852160000 | elapsed time per iteration (s): 0.81 | learning rate: 2.143E-05 | global batch size: 256 | lm loss: 1.905630E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.335 | TFLOPs: 19.14 | 31: iteration 163760/ 173500 | consumed samples: 41922560 | consumed tokens: 85857402880 | elapsed time per iteration (s): 0.81 | learning rate: 2.142E-05 | global batch size: 256 | lm loss: 1.921194E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.840 | TFLOPs: 19.23 | 31: iteration 163770/ 173500 | consumed samples: 41925120 | consumed tokens: 85862645760 | elapsed time per iteration (s): 0.80 | learning rate: 2.142E-05 | global batch size: 256 | lm loss: 1.899998E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.522 | TFLOPs: 19.39 | 31: iteration 163780/ 173500 | consumed samples: 41927680 | consumed tokens: 85867888640 | elapsed time per iteration (s): 0.93 | learning rate: 2.142E-05 | global batch size: 256 | lm loss: 1.916456E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 274.986 | TFLOPs: 16.64 | 31: iteration 163790/ 173500 | consumed samples: 41930240 | consumed tokens: 85873131520 | elapsed time per iteration (s): 0.80 | learning rate: 2.142E-05 | global batch size: 256 | lm loss: 1.934360E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.194 | TFLOPs: 19.25 | 31: iteration 163800/ 173500 | consumed samples: 41932800 | consumed tokens: 85878374400 | elapsed time per iteration (s): 0.80 | learning rate: 2.141E-05 | global batch size: 256 | lm loss: 1.919137E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.045 | TFLOPs: 19.42 | 31: iteration 163810/ 173500 | consumed samples: 41935360 | consumed tokens: 85883617280 | elapsed time per iteration (s): 0.96 | learning rate: 2.141E-05 | global batch size: 256 | lm loss: 1.923143E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 266.689 | TFLOPs: 16.13 | 31: iteration 163820/ 173500 | consumed samples: 41937920 | consumed tokens: 85888860160 | elapsed time per iteration (s): 0.80 | learning rate: 2.141E-05 | global batch size: 256 | lm loss: 1.899754E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.418 | TFLOPs: 19.26 | 31: iteration 163830/ 173500 | consumed samples: 41940480 | consumed tokens: 85894103040 | elapsed time per iteration (s): 0.79 | learning rate: 2.140E-05 | global batch size: 256 | lm loss: 1.913028E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.375 | TFLOPs: 19.56 | 31: iteration 163840/ 173500 | consumed samples: 41943040 | consumed tokens: 85899345920 | elapsed time per iteration (s): 0.79 | learning rate: 2.140E-05 | global batch size: 256 | lm loss: 1.908489E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.104 | TFLOPs: 19.55 | 31: iteration 163850/ 173500 | consumed samples: 41945600 | consumed tokens: 85904588800 | elapsed time per iteration (s): 0.80 | learning rate: 2.140E-05 | global batch size: 256 | lm loss: 1.884448E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.636 | TFLOPs: 19.28 | 31: iteration 163860/ 173500 | consumed samples: 41948160 | consumed tokens: 85909831680 | elapsed time per iteration (s): 0.90 | learning rate: 2.140E-05 | global batch size: 256 | lm loss: 1.892496E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.272 | TFLOPs: 17.26 | 31: iteration 163870/ 173500 | consumed samples: 41950720 | consumed tokens: 85915074560 | elapsed time per iteration (s): 0.81 | learning rate: 2.139E-05 | global batch size: 256 | lm loss: 1.940265E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.032 | TFLOPs: 19.18 | 31: iteration 163880/ 173500 | consumed samples: 41953280 | consumed tokens: 85920317440 | elapsed time per iteration (s): 0.80 | learning rate: 2.139E-05 | global batch size: 256 | lm loss: 1.929182E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.521 | TFLOPs: 19.39 | 31: iteration 163890/ 173500 | consumed samples: 41955840 | consumed tokens: 85925560320 | elapsed time per iteration (s): 0.83 | learning rate: 2.139E-05 | global batch size: 256 | lm loss: 1.926052E+00 | grad norm: 0.249 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.685 | TFLOPs: 18.67 | 31: iteration 163900/ 173500 | consumed samples: 41958400 | consumed tokens: 85930803200 | elapsed time per iteration (s): 0.92 | learning rate: 2.138E-05 | global batch size: 256 | lm loss: 1.913283E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 277.010 | TFLOPs: 16.76 | 31: iteration 163910/ 173500 | consumed samples: 41960960 | consumed tokens: 85936046080 | elapsed time per iteration (s): 0.82 | learning rate: 2.138E-05 | global batch size: 256 | lm loss: 1.905930E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.947 | TFLOPs: 18.81 | 31: iteration 163920/ 173500 | consumed samples: 41963520 | consumed tokens: 85941288960 | elapsed time per iteration (s): 0.77 | learning rate: 2.138E-05 | global batch size: 256 | lm loss: 1.907230E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.323 | TFLOPs: 20.10 | 31: iteration 163930/ 173500 | consumed samples: 41966080 | consumed tokens: 85946531840 | elapsed time per iteration (s): 0.79 | learning rate: 2.138E-05 | global batch size: 256 | lm loss: 1.902610E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.352 | TFLOPs: 19.62 | 31: iteration 163940/ 173500 | consumed samples: 41968640 | consumed tokens: 85951774720 | elapsed time per iteration (s): 0.81 | learning rate: 2.137E-05 | global batch size: 256 | lm loss: 1.927502E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.764 | TFLOPs: 19.16 | 31: iteration 163950/ 173500 | consumed samples: 41971200 | consumed tokens: 85957017600 | elapsed time per iteration (s): 0.78 | learning rate: 2.137E-05 | global batch size: 256 | lm loss: 1.892862E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.021 | TFLOPs: 19.90 | 31: iteration 163960/ 173500 | consumed samples: 41973760 | consumed tokens: 85962260480 | elapsed time per iteration (s): 0.81 | learning rate: 2.137E-05 | global batch size: 256 | lm loss: 1.902934E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.611 | TFLOPs: 19.09 | 31: iteration 163970/ 173500 | consumed samples: 41976320 | consumed tokens: 85967503360 | elapsed time per iteration (s): 0.86 | learning rate: 2.136E-05 | global batch size: 256 | lm loss: 1.912464E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.753 | TFLOPs: 18.01 | 31: iteration 163980/ 173500 | consumed samples: 41978880 | consumed tokens: 85972746240 | elapsed time per iteration (s): 0.83 | learning rate: 2.136E-05 | global batch size: 256 | lm loss: 1.931509E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.387 | TFLOPs: 18.72 | 31: iteration 163990/ 173500 | consumed samples: 41981440 | consumed tokens: 85977989120 | elapsed time per iteration (s): 0.85 | learning rate: 2.136E-05 | global batch size: 256 | lm loss: 1.923400E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.902 | TFLOPs: 18.20 | 0: [2022-11-27 07:00:06,822] [INFO] [logging.py:68:log_dist] [Rank 0] step=164000, skipped=0, lr=[2.1355330909017464e-05, 2.1355330909017464e-05, 2.1355330909017464e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 164000/ 173500 | consumed samples: 41984000 | consumed tokens: 85983232000 | elapsed time per iteration (s): 0.93 | learning rate: 2.136E-05 | global batch size: 256 | lm loss: 1.895098E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 274.075 | TFLOPs: 16.58 | 0: steps: 164000 loss: 1.8377 iter time (s): 0.808 samples/sec: 316.856 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 164000 | lm loss value: 1.951030E+00 | lm loss PPL: 7.035932E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 164000 to checkpoints_1b1long 0: [2022-11-27 07:00:07,089] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step164000 is begin to save! 0: [2022-11-27 07:00:07,104] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_01-model_00-model_states.pt... 0: [2022-11-27 07:00:07,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_01-model_00-model_states.pt. 0: [2022-11-27 07:00:07,369] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_03-model_00-model_states.pt... 0: [2022-11-27 07:00:07,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_03-model_00-model_states.pt. 0: [2022-11-27 07:00:07,451] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_04-model_00-model_states.pt... 0: [2022-11-27 07:00:07,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_04-model_00-model_states.pt. 0: [2022-11-27 07:00:07,529] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_05-model_00-model_states.pt... 0: [2022-11-27 07:00:07,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_05-model_00-model_states.pt. 0: [2022-11-27 07:00:07,608] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_06-model_00-model_states.pt... 0: [2022-11-27 07:00:07,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_06-model_00-model_states.pt. 0: [2022-11-27 07:00:07,682] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_07-model_00-model_states.pt... 0: [2022-11-27 07:00:07,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_07-model_00-model_states.pt. 0: [2022-11-27 07:00:07,759] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_08-model_00-model_states.pt... 0: [2022-11-27 07:00:07,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_08-model_00-model_states.pt. 0: [2022-11-27 07:00:07,838] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_09-model_00-model_states.pt... 0: [2022-11-27 07:00:07,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_09-model_00-model_states.pt. 0: [2022-11-27 07:00:07,917] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_10-model_00-model_states.pt... 0: [2022-11-27 07:00:07,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_10-model_00-model_states.pt. 0: [2022-11-27 07:00:07,996] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_11-model_00-model_states.pt... 0: [2022-11-27 07:00:08,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_11-model_00-model_states.pt. 0: [2022-11-27 07:00:08,070] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_12-model_00-model_states.pt... 0: [2022-11-27 07:00:08,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_12-model_00-model_states.pt. 0: [2022-11-27 07:00:08,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_13-model_00-model_states.pt... 0: [2022-11-27 07:00:08,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_13-model_00-model_states.pt. 0: [2022-11-27 07:00:08,225] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_14-model_00-model_states.pt... 0: [2022-11-27 07:00:08,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_14-model_00-model_states.pt. 0: [2022-11-27 07:00:08,299] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_15-model_00-model_states.pt... 0: [2022-11-27 07:00:08,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_15-model_00-model_states.pt. 0: [2022-11-27 07:00:08,371] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_16-model_00-model_states.pt... 0: [2022-11-27 07:00:08,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_16-model_00-model_states.pt. 0: [2022-11-27 07:00:08,447] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_17-model_00-model_states.pt... 0: [2022-11-27 07:00:08,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_17-model_00-model_states.pt. 0: [2022-11-27 07:00:08,521] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_18-model_00-model_states.pt... 0: [2022-11-27 07:00:08,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_18-model_00-model_states.pt. 0: [2022-11-27 07:00:08,595] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_19-model_00-model_states.pt... 0: [2022-11-27 07:00:08,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_19-model_00-model_states.pt. 0: [2022-11-27 07:00:08,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_20-model_00-model_states.pt... 0: [2022-11-27 07:00:08,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_20-model_00-model_states.pt. 0: [2022-11-27 07:00:08,744] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_21-model_00-model_states.pt... 0: [2022-11-27 07:00:08,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_21-model_00-model_states.pt. 0: [2022-11-27 07:00:08,816] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_22-model_00-model_states.pt... 0: [2022-11-27 07:00:08,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_22-model_00-model_states.pt. 0: [2022-11-27 07:00:08,890] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_23-model_00-model_states.pt... 0: [2022-11-27 07:00:08,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_23-model_00-model_states.pt. 0: [2022-11-27 07:00:08,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_24-model_00-model_states.pt... 0: [2022-11-27 07:00:09,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_24-model_00-model_states.pt. 0: [2022-11-27 07:00:09,039] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_25-model_00-model_states.pt... 0: [2022-11-27 07:00:09,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_25-model_00-model_states.pt. 0: [2022-11-27 07:00:09,113] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_26-model_00-model_states.pt... 0: [2022-11-27 07:00:09,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_26-model_00-model_states.pt. 0: [2022-11-27 07:00:09,187] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_27-model_00-model_states.pt... 0: [2022-11-27 07:00:09,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_27-model_00-model_states.pt. 0: [2022-11-27 07:00:09,260] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_28-model_00-model_states.pt... 0: [2022-11-27 07:00:09,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_28-model_00-model_states.pt. 0: [2022-11-27 07:00:09,334] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/layer_30-model_00-model_states.pt... 0: [2022-11-27 07:00:09,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/layer_30-model_00-model_states.pt. 0: [2022-11-27 07:00:09,336] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step164000/mp_rank_00_model_states.pt 0: [2022-11-27 07:00:09,336] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/mp_rank_00_model_states.pt... 0: [2022-11-27 07:00:09,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/mp_rank_00_model_states.pt. 0: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:00:09,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:00:09,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:00:09,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:00:09,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 07:00:09,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 10: [2022-11-27 07:00:09,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:00:09,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 07:00:09,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 23: [2022-11-27 07:00:09,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:00:09,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 07:00:09,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 3: [2022-11-27 07:00:09,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:00:09,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:00:09,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 07:00:09,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 31: [2022-11-27 07:00:09,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 07:00:09,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 20: [2022-11-27 07:00:09,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:00:09,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:00:09,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:00:09,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 10: [2022-11-27 07:00:09,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 20: [2022-11-27 07:00:09,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 10: [2022-11-27 07:00:09,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 20: [2022-11-27 07:00:09,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 20: [2022-11-27 07:00:09,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 21: [2022-11-27 07:00:09,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:00:09,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 6: [2022-11-27 07:00:09,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:00:09,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:00:09,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 30: [2022-11-27 07:00:09,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 6: [2022-11-27 07:00:09,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 30: [2022-11-27 07:00:09,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 6: [2022-11-27 07:00:09,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 21: [2022-11-27 07:00:09,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:00:09,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 30: [2022-11-27 07:00:09,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:00:09,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 30: [2022-11-27 07:00:09,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 07:00:09,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 18: [2022-11-27 07:00:09,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:00:09,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 07:00:09,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 18: [2022-11-27 07:00:09,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:00:09,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:00:09,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 12: [2022-11-27 07:00:09,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 07:00:09,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 18: [2022-11-27 07:00:09,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 27: [2022-11-27 07:00:09,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:00:09,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 07:00:09,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 24: [2022-11-27 07:00:09,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:00:09,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:00:09,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 07:00:09,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 07:00:09,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 24: [2022-11-27 07:00:09,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 26: [2022-11-27 07:00:09,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:00:09,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 07:00:09,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 19: [2022-11-27 07:00:09,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:00:09,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 07:00:09,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 11: [2022-11-27 07:00:09,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:00:09,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 07:00:09,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 13: [2022-11-27 07:00:09,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:00:09,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 07:00:09,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 31: [2022-11-27 07:00:09,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:00:09,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 21: [2022-11-27 07:00:09,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:00:09,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 07:00:09,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 31: [2022-11-27 07:00:09,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 12: [2022-11-27 07:00:09,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:00:09,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:00:09,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 23: [2022-11-27 07:00:09,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:00:09,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 13: [2022-11-27 07:00:09,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 12: [2022-11-27 07:00:09,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 23: [2022-11-27 07:00:09,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 5: [2022-11-27 07:00:09,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:00:09,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:00:09,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 5: [2022-11-27 07:00:09,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 10: [2022-11-27 07:00:09,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 5: [2022-11-27 07:00:09,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 10: [2022-11-27 07:00:09,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 29: [2022-11-27 07:00:09,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:00:09,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 07:00:09,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 20: [2022-11-27 07:00:09,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:00:09,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 07:00:09,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 11: [2022-11-27 07:00:09,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:00:09,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 07:00:09,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 16: [2022-11-27 07:00:09,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:00:09,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 07:00:09,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 29: [2022-11-27 07:00:09,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:00:09,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 07:00:09,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 17: [2022-11-27 07:00:09,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:00:09,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 07:00:09,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:00:09,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 17: [2022-11-27 07:00:09,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 07:00:09,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 3: [2022-11-27 07:00:09,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:00:09,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:00:09,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 26: [2022-11-27 07:00:09,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 3: [2022-11-27 07:00:09,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 26: [2022-11-27 07:00:09,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 16: [2022-11-27 07:00:09,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:00:09,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 07:00:09,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 27: [2022-11-27 07:00:09,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:00:09,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:00:09,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:00:09,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 18: [2022-11-27 07:00:09,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 26: [2022-11-27 07:00:09,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 19: [2022-11-27 07:00:09,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:00:09,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 18: [2022-11-27 07:00:09,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 19: [2022-11-27 07:00:09,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-27 07:00:09,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 27: [2022-11-27 07:00:09,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 13: [2022-11-27 07:00:09,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:00:09,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 07:00:09,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 30: [2022-11-27 07:00:09,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:00:09,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 07:00:09,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 11: [2022-11-27 07:00:09,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:00:09,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 07:00:09,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 24: [2022-11-27 07:00:09,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:00:09,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:00:09,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 07:00:09,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 24: [2022-11-27 07:00:09,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 07:00:09,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 29: [2022-11-27 07:00:09,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:00:09,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 07:00:09,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 19: [2022-11-27 07:00:09,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:00:09,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 07:00:09,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 3: [2022-11-27 07:00:09,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:00:09,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 07:00:09,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 10: [2022-11-27 07:00:09,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:00:09,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 07:00:09,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 12: [2022-11-27 07:00:09,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:00:09,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:00:09,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 07:00:09,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 27: [2022-11-27 07:00:09,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 07:00:09,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 0: [2022-11-27 07:00:09,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:00:09,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:00:09,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 07:00:09,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 07:00:09,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 0: [2022-11-27 07:00:09,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 6: [2022-11-27 07:00:09,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:00:09,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 07:00:09,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 23: [2022-11-27 07:00:09,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:00:09,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 07:00:09,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 31: [2022-11-27 07:00:09,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:00:09,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 07:00:09,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 22: [2022-11-27 07:00:09,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:00:09,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 07:00:09,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 0: [2022-11-27 07:00:09,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:00:09,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 07:00:09,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 6: [2022-11-27 07:00:09,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:00:09,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 07:00:09,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:00:09,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 6: [2022-11-27 07:00:09,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 07:00:09,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 11: [2022-11-27 07:00:09,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:00:09,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:00:09,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 13: [2022-11-27 07:00:09,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 11: [2022-11-27 07:00:09,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 13: [2022-11-27 07:00:09,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 22: [2022-11-27 07:00:09,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:00:09,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 12: [2022-11-27 07:00:09,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:00:09,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 12: [2022-11-27 07:00:09,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 07:00:09,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 22: [2022-11-27 07:00:09,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:00:09,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 07:00:09,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 20: [2022-11-27 07:00:09,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:00:09,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 07:00:09,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 18: [2022-11-27 07:00:09,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:00:09,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 07:00:09,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 31: [2022-11-27 07:00:09,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:00:09,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 07:00:09,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 7: [2022-11-27 07:00:09,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:00:09,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:00:09,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 07:00:09,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 7: [2022-11-27 07:00:09,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 07:00:09,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 7: [2022-11-27 07:00:09,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:00:09,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 07:00:09,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 3: [2022-11-27 07:00:09,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:00:09,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 07:00:09,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 30: [2022-11-27 07:00:09,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:00:09,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:00:09,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 07:00:09,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 24: [2022-11-27 07:00:09,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:00:09,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:00:09,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 5: [2022-11-27 07:00:09,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 13: [2022-11-27 07:00:09,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 24: [2022-11-27 07:00:09,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 5: [2022-11-27 07:00:09,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 5: [2022-11-27 07:00:09,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:00:09,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 24: [2022-11-27 07:00:09,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 5: [2022-11-27 07:00:09,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 29: [2022-11-27 07:00:09,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:00:09,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:00:09,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 5: [2022-11-27 07:00:09,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 29: [2022-11-27 07:00:09,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 5: [2022-11-27 07:00:09,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 26: [2022-11-27 07:00:09,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:00:09,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 07:00:09,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 17: [2022-11-27 07:00:09,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:00:09,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 07:00:09,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 17: [2022-11-27 07:00:09,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:00:09,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 07:00:09,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 21: [2022-11-27 07:00:09,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:00:09,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 07:00:09,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 16: [2022-11-27 07:00:09,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:00:09,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 07:00:09,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 19: [2022-11-27 07:00:09,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:00:09,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 07:00:09,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 27: [2022-11-27 07:00:09,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:00:09,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 07:00:09,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 8: [2022-11-27 07:00:09,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:00:09,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:00:09,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:00:09,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:00:09,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 07:00:09,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 07:00:09,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 07:00:09,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 07:00:09,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 8: [2022-11-27 07:00:09,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 8: [2022-11-27 07:00:09,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 8: [2022-11-27 07:00:09,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 23: [2022-11-27 07:00:09,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:00:09,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 07:00:09,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 14: [2022-11-27 07:00:09,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:00:09,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 07:00:09,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 14: [2022-11-27 07:00:09,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:00:09,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 07:00:09,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 16: [2022-11-27 07:00:09,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:00:09,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 07:00:09,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 2: [2022-11-27 07:00:09,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:00:09,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 07:00:09,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 2: [2022-11-27 07:00:09,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:00:09,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 07:00:09,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 5: [2022-11-27 07:00:09,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:00:09,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 07:00:09,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 9: [2022-11-27 07:00:09,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:00:09,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:00:09,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 07:00:09,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 07:00:09,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:00:09,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 8: [2022-11-27 07:00:09,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:00:09,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 9: [2022-11-27 07:00:09,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 8: [2022-11-27 07:00:09,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 9: [2022-11-27 07:00:09,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 8: [2022-11-27 07:00:09,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 6: [2022-11-27 07:00:09,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:00:09,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 07:00:09,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 14: [2022-11-27 07:00:09,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:00:09,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 07:00:09,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 11: [2022-11-27 07:00:09,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:00:09,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 07:00:09,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 28: [2022-11-27 07:00:09,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:00:09,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:00:09,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:00:09,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:00:09,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 07:00:09,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 07:00:09,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 07:00:09,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 07:00:09,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 28: [2022-11-27 07:00:09,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 28: [2022-11-27 07:00:09,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 28: [2022-11-27 07:00:09,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 26: [2022-11-27 07:00:09,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:00:09,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-27 07:00:09,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 25: [2022-11-27 07:00:09,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:00:09,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:00:09,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:00:09,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:00:09,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 07:00:09,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 07:00:09,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 07:00:09,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 25: [2022-11-27 07:00:09,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 25: [2022-11-27 07:00:09,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 25: [2022-11-27 07:00:09,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-27 07:00:09,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 15: [2022-11-27 07:00:09,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:00:09,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:00:09,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:00:09,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 07:00:09,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:00:09,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 17: [2022-11-27 07:00:09,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:00:09,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 15: [2022-11-27 07:00:09,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 3: [2022-11-27 07:00:09,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:00:09,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 15: [2022-11-27 07:00:09,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 3: [2022-11-27 07:00:09,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 15: [2022-11-27 07:00:09,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 15: [2022-11-27 07:00:09,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 17: [2022-11-27 07:00:09,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 3: [2022-11-27 07:00:09,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 17: [2022-11-27 07:00:09,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 2: [2022-11-27 07:00:09,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:00:09,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 07:00:09,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 0: [2022-11-27 07:00:09,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 07:00:09,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 29: [2022-11-27 07:00:09,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:00:09,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 07:00:09,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 28: [2022-11-27 07:00:09,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:00:09,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 07:00:09,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 0: [2022-11-27 07:00:09,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:00:09,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 07:00:09,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 2: [2022-11-27 07:00:09,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:00:09,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 07:00:09,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 15: [2022-11-27 07:00:09,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:00:09,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 07:00:09,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 1: [2022-11-27 07:00:09,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:00:09,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:00:09,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:00:09,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:00:09,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:00:09,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 07:00:09,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 07:00:09,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 07:00:09,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 07:00:09,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 1: [2022-11-27 07:00:09,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 1: [2022-11-27 07:00:09,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 1: [2022-11-27 07:00:09,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 07:00:09,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 1: [2022-11-27 07:00:09,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 10: [2022-11-27 07:00:09,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:00:09,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 07:00:09,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 30: [2022-11-27 07:00:09,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:00:09,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 07:00:09,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 12: [2022-11-27 07:00:09,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:00:09,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 07:00:09,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 20: [2022-11-27 07:00:09,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:00:09,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 07:00:09,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 27: [2022-11-27 07:00:09,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:00:09,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 07:00:09,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 18: [2022-11-27 07:00:09,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:00:09,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 07:00:09,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 4: [2022-11-27 07:00:09,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:00:09,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:00:09,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:00:09,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:00:09,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 07:00:09,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 07:00:09,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 07:00:09,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 07:00:09,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 4: [2022-11-27 07:00:09,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 4: [2022-11-27 07:00:09,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 4: [2022-11-27 07:00:09,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 31: [2022-11-27 07:00:09,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:00:09,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 07:00:09,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 23: [2022-11-27 07:00:09,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:00:09,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 07:00:09,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 25: [2022-11-27 07:00:09,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:00:09,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 07:00:09,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 7: [2022-11-27 07:00:09,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:00:09,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 07:00:09,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 14: [2022-11-27 07:00:09,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:00:09,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 07:00:09,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 4: [2022-11-27 07:00:09,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:00:09,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 07:00:09,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 24: [2022-11-27 07:00:09,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:00:09,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-27 07:00:09,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 1: [2022-11-27 07:00:09,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:00:09,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 07:00:09,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 9: [2022-11-27 07:00:09,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:00:09,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 07:00:09,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 16: [2022-11-27 07:00:09,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:00:09,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 07:00:09,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 22: [2022-11-27 07:00:09,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:00:09,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 07:00:09,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 13: [2022-11-27 07:00:09,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:00:09,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 07:00:09,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 19: [2022-11-27 07:00:09,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:00:09,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 07:00:09,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 6: [2022-11-27 07:00:09,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:00:09,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 07:00:09,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 21: [2022-11-27 07:00:09,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:00:09,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 07:00:09,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 26: [2022-11-27 07:00:09,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:00:09,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 07:00:09,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 0: [2022-11-27 07:00:09,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:00:09,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:00:09,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 11: [2022-11-27 07:00:09,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 0: [2022-11-27 07:00:09,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 11: [2022-11-27 07:00:09,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 17: [2022-11-27 07:00:09,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:00:09,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 07:00:09,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 30: [2022-11-27 07:00:09,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:00:09,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 07:00:09,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 3: [2022-11-27 07:00:09,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:00:09,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 07:00:09,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 27: [2022-11-27 07:00:09,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:00:09,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 07:00:09,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 28: [2022-11-27 07:00:09,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:00:09,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:00:09,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:00:09,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 12: [2022-11-27 07:00:09,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 28: [2022-11-27 07:00:09,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 29: [2022-11-27 07:00:09,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 12: [2022-11-27 07:00:09,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 20: [2022-11-27 07:00:09,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:00:09,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 20: [2022-11-27 07:00:09,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 07:00:09,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 10: [2022-11-27 07:00:09,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:00:09,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:00:09,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 31: [2022-11-27 07:00:09,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 10: [2022-11-27 07:00:09,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 31: [2022-11-27 07:00:09,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 5: [2022-11-27 07:00:09,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:00:09,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 07:00:09,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 4: [2022-11-27 07:00:09,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:00:09,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 07:00:09,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 23: [2022-11-27 07:00:09,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:00:09,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 07:00:09,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 8: [2022-11-27 07:00:09,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:00:09,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 07:00:09,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 15: [2022-11-27 07:00:09,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:00:09,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 07:00:09,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 18: [2022-11-27 07:00:09,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:00:09,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 07:00:09,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 14: [2022-11-27 07:00:09,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:00:09,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 07:00:09,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 22: [2022-11-27 07:00:09,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:00:09,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 07:00:09,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 24: [2022-11-27 07:00:09,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:00:09,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 07:00:09,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 1: [2022-11-27 07:00:09,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:00:09,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 07:00:09,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 7: [2022-11-27 07:00:09,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:00:09,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:00:09,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 07:00:09,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 2: [2022-11-27 07:00:09,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 07:00:09,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 9: [2022-11-27 07:00:09,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:00:09,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 07:00:09,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 25: [2022-11-27 07:00:09,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:00:09,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 07:00:09,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 16: [2022-11-27 07:00:09,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:00:09,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 07:00:09,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 21: [2022-11-27 07:00:09,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:00:09,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 07:00:09,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 6: [2022-11-27 07:00:09,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:00:09,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 07:00:09,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 13: [2022-11-27 07:00:09,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:00:09,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 07:00:09,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 19: [2022-11-27 07:00:09,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:00:09,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 07:00:09,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 29: [2022-11-27 07:00:09,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:00:09,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 07:00:09,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 26: [2022-11-27 07:00:09,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:00:09,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 07:00:09,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 11: [2022-11-27 07:00:09,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:00:09,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 07:00:09,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 28: [2022-11-27 07:00:09,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:00:09,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 07:00:09,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 10: [2022-11-27 07:00:09,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:00:09,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 07:00:09,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 8: [2022-11-27 07:00:09,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:00:09,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 07:00:09,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 15: [2022-11-27 07:00:09,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:00:09,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 07:00:09,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 17: [2022-11-27 07:00:09,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:00:09,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 07:00:09,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 30: [2022-11-27 07:00:09,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:00:09,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 07:00:09,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 5: [2022-11-27 07:00:09,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:00:09,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 07:00:09,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 27: [2022-11-27 07:00:09,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:00:09,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 07:00:09,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 0: [2022-11-27 07:00:09,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:00:09,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 07:00:09,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 4: [2022-11-27 07:00:09,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:00:09,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 07:00:09,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 18: [2022-11-27 07:00:09,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:00:09,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 07:00:09,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 23: [2022-11-27 07:00:09,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:00:09,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 07:00:09,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 20: [2022-11-27 07:00:09,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:00:09,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 07:00:09,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 31: [2022-11-27 07:00:09,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:00:09,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 07:00:09,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 12: [2022-11-27 07:00:09,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:00:09,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 07:00:09,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 14: [2022-11-27 07:00:09,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:00:09,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 07:00:09,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 7: [2022-11-27 07:00:09,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:00:09,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 07:00:09,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 13: [2022-11-27 07:00:09,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:00:09,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 07:00:09,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 1: [2022-11-27 07:00:09,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:00:09,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:00:09,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 2: [2022-11-27 07:00:09,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:00:09,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 2: [2022-11-27 07:00:09,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 1: [2022-11-27 07:00:09,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 25: [2022-11-27 07:00:09,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 2: [2022-11-27 07:00:09,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 21: [2022-11-27 07:00:09,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:00:09,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:00:09,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 07:00:09,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 25: [2022-11-27 07:00:09,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 29: [2022-11-27 07:00:09,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:00:09,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 29: [2022-11-27 07:00:09,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 07:00:09,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 0: [2022-11-27 07:00:09,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:00:09,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 07:00:09,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 10: [2022-11-27 07:00:09,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:00:09,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 07:00:09,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 28: [2022-11-27 07:00:09,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:00:09,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 07:00:09,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 26: [2022-11-27 07:00:09,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:00:09,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:00:09,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:00:09,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 30: [2022-11-27 07:00:09,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 26: [2022-11-27 07:00:09,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 24: [2022-11-27 07:00:09,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 22: [2022-11-27 07:00:09,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:00:09,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 6: [2022-11-27 07:00:09,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:00:09,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:00:09,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 30: [2022-11-27 07:00:09,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 22: [2022-11-27 07:00:09,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 6: [2022-11-27 07:00:09,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 16: [2022-11-27 07:00:09,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 6: [2022-11-27 07:00:09,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 16: [2022-11-27 07:00:09,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 27: [2022-11-27 07:00:09,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:00:09,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:00:09,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 07:00:09,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 19: [2022-11-27 07:00:09,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 07:00:09,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 5: [2022-11-27 07:00:09,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:00:09,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 11: [2022-11-27 07:00:09,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:00:09,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 5: [2022-11-27 07:00:09,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 11: [2022-11-27 07:00:09,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 17: [2022-11-27 07:00:09,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:00:09,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 07:00:09,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 8: [2022-11-27 07:00:09,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:00:09,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 07:00:09,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 15: [2022-11-27 07:00:09,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:00:09,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 4: [2022-11-27 07:00:09,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:00:09,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 07:00:09,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 15: [2022-11-27 07:00:09,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 23: [2022-11-27 07:00:09,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:00:09,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 07:00:09,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 9: [2022-11-27 07:00:09,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:00:09,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 07:00:09,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 20: [2022-11-27 07:00:09,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:00:09,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 07:00:09,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 31: [2022-11-27 07:00:09,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:00:09,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 07:00:09,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 18: [2022-11-27 07:00:09,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:00:09,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:00:09,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 18: [2022-11-27 07:00:09,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 07:00:09,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 12: [2022-11-27 07:00:09,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 7: [2022-11-27 07:00:09,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:00:09,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 07:00:09,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 3: [2022-11-27 07:00:09,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:00:09,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:00:09,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 07:00:09,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 07:00:09,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 3: [2022-11-27 07:00:09,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 14: [2022-11-27 07:00:09,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:00:09,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:00:09,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 07:00:09,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 16: [2022-11-27 07:00:09,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 07:00:09,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 22: [2022-11-27 07:00:09,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:00:09,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 07:00:09,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 24: [2022-11-27 07:00:09,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:00:09,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 07:00:09,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 7: [2022-11-27 07:00:09,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:00:09,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 07:00:09,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 22: [2022-11-27 07:00:09,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:00:09,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 07:00:09,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 9: [2022-11-27 07:00:09,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:00:09,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 07:00:09,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 14: [2022-11-27 07:00:09,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:00:09,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 07:00:09,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 9: [2022-11-27 07:00:09,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:00:09,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 07:00:09,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 2: [2022-11-27 07:00:09,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:00:09,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 07:00:09,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 2: [2022-11-27 07:00:09,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:00:09,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step164000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 07:00:09,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step164000 is ready now! 0: successfully saved checkpoint at iteration 164000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2607.91 31: iteration 164010/ 173500 | consumed samples: 41986560 | consumed tokens: 85988474880 | elapsed time per iteration (s): 1.07 | learning rate: 2.135E-05 | global batch size: 256 | lm loss: 1.898921E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.871 | TFLOPs: 14.45 | 31: iteration 164020/ 173500 | consumed samples: 41989120 | consumed tokens: 85993717760 | elapsed time per iteration (s): 0.85 | learning rate: 2.135E-05 | global batch size: 256 | lm loss: 1.915859E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.024 | TFLOPs: 18.15 | 31: iteration 164030/ 173500 | consumed samples: 41991680 | consumed tokens: 85998960640 | elapsed time per iteration (s): 0.83 | learning rate: 2.135E-05 | global batch size: 256 | lm loss: 1.915399E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.252 | TFLOPs: 18.77 | 31: iteration 164040/ 173500 | consumed samples: 41994240 | consumed tokens: 86004203520 | elapsed time per iteration (s): 0.81 | learning rate: 2.134E-05 | global batch size: 256 | lm loss: 1.937212E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.283 | TFLOPs: 19.19 | 31: iteration 164050/ 173500 | consumed samples: 41996800 | consumed tokens: 86009446400 | elapsed time per iteration (s): 0.82 | learning rate: 2.134E-05 | global batch size: 256 | lm loss: 1.890612E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.999 | TFLOPs: 19.00 | 31: iteration 164060/ 173500 | consumed samples: 41999360 | consumed tokens: 86014689280 | elapsed time per iteration (s): 0.80 | learning rate: 2.134E-05 | global batch size: 256 | lm loss: 1.925394E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.973 | TFLOPs: 19.30 | 31: iteration 164070/ 173500 | consumed samples: 42001920 | consumed tokens: 86019932160 | elapsed time per iteration (s): 0.97 | learning rate: 2.134E-05 | global batch size: 256 | lm loss: 1.918066E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 264.488 | TFLOPs: 16.00 | 31: iteration 164080/ 173500 | consumed samples: 42004480 | consumed tokens: 86025175040 | elapsed time per iteration (s): 0.80 | learning rate: 2.133E-05 | global batch size: 256 | lm loss: 1.913893E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.372 | TFLOPs: 19.26 | 31: iteration 164090/ 173500 | consumed samples: 42007040 | consumed tokens: 86030417920 | elapsed time per iteration (s): 0.84 | learning rate: 2.133E-05 | global batch size: 256 | lm loss: 1.907682E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.273 | TFLOPs: 18.47 | 31: iteration 164100/ 173500 | consumed samples: 42009600 | consumed tokens: 86035660800 | elapsed time per iteration (s): 0.84 | learning rate: 2.133E-05 | global batch size: 256 | lm loss: 1.913319E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.258 | TFLOPs: 18.41 | 31: iteration 164110/ 173500 | consumed samples: 42012160 | consumed tokens: 86040903680 | elapsed time per iteration (s): 0.79 | learning rate: 2.132E-05 | global batch size: 256 | lm loss: 1.905572E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.567 | TFLOPs: 19.64 | 31: iteration 164120/ 173500 | consumed samples: 42014720 | consumed tokens: 86046146560 | elapsed time per iteration (s): 0.82 | learning rate: 2.132E-05 | global batch size: 256 | lm loss: 1.907064E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.575 | TFLOPs: 18.79 | 31: iteration 164130/ 173500 | consumed samples: 42017280 | consumed tokens: 86051389440 | elapsed time per iteration (s): 0.83 | learning rate: 2.132E-05 | global batch size: 256 | lm loss: 1.904186E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.059 | TFLOPs: 18.58 | 31: iteration 164140/ 173500 | consumed samples: 42019840 | consumed tokens: 86056632320 | elapsed time per iteration (s): 0.82 | learning rate: 2.132E-05 | global batch size: 256 | lm loss: 1.894028E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.701 | TFLOPs: 18.98 | 31: iteration 164150/ 173500 | consumed samples: 42022400 | consumed tokens: 86061875200 | elapsed time per iteration (s): 0.81 | learning rate: 2.131E-05 | global batch size: 256 | lm loss: 1.918853E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.547 | TFLOPs: 19.15 | 31: iteration 164160/ 173500 | consumed samples: 42024960 | consumed tokens: 86067118080 | elapsed time per iteration (s): 1.05 | learning rate: 2.131E-05 | global batch size: 256 | lm loss: 1.907588E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.105 | TFLOPs: 14.77 | 31: iteration 164170/ 173500 | consumed samples: 42027520 | consumed tokens: 86072360960 | elapsed time per iteration (s): 0.79 | learning rate: 2.131E-05 | global batch size: 256 | lm loss: 1.899932E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.587 | TFLOPs: 19.64 | 31: iteration 164180/ 173500 | consumed samples: 42030080 | consumed tokens: 86077603840 | elapsed time per iteration (s): 0.82 | learning rate: 2.130E-05 | global batch size: 256 | lm loss: 1.913854E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.796 | TFLOPs: 18.92 | 31: iteration 164190/ 173500 | consumed samples: 42032640 | consumed tokens: 86082846720 | elapsed time per iteration (s): 0.80 | learning rate: 2.130E-05 | global batch size: 256 | lm loss: 1.900247E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.425 | TFLOPs: 19.38 | 31: iteration 164200/ 173500 | consumed samples: 42035200 | consumed tokens: 86088089600 | elapsed time per iteration (s): 0.86 | learning rate: 2.130E-05 | global batch size: 256 | lm loss: 1.937594E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.009 | TFLOPs: 18.09 | 31: iteration 164210/ 173500 | consumed samples: 42037760 | consumed tokens: 86093332480 | elapsed time per iteration (s): 0.77 | learning rate: 2.130E-05 | global batch size: 256 | lm loss: 1.921656E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.388 | TFLOPs: 19.99 | 31: iteration 164220/ 173500 | consumed samples: 42040320 | consumed tokens: 86098575360 | elapsed time per iteration (s): 0.83 | learning rate: 2.129E-05 | global batch size: 256 | lm loss: 1.899599E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.085 | TFLOPs: 18.76 | 31: iteration 164230/ 173500 | consumed samples: 42042880 | consumed tokens: 86103818240 | elapsed time per iteration (s): 0.77 | learning rate: 2.129E-05 | global batch size: 256 | lm loss: 1.881677E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.099 | TFLOPs: 20.03 | 31: iteration 164240/ 173500 | consumed samples: 42045440 | consumed tokens: 86109061120 | elapsed time per iteration (s): 0.77 | learning rate: 2.129E-05 | global batch size: 256 | lm loss: 1.934319E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.178 | TFLOPs: 20.10 | 31: iteration 164250/ 173500 | consumed samples: 42048000 | consumed tokens: 86114304000 | elapsed time per iteration (s): 0.84 | learning rate: 2.129E-05 | global batch size: 256 | lm loss: 1.912581E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.021 | TFLOPs: 18.33 | 31: iteration 164260/ 173500 | consumed samples: 42050560 | consumed tokens: 86119546880 | elapsed time per iteration (s): 0.80 | learning rate: 2.128E-05 | global batch size: 256 | lm loss: 1.912359E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.296 | TFLOPs: 19.26 | 31: iteration 164270/ 173500 | consumed samples: 42053120 | consumed tokens: 86124789760 | elapsed time per iteration (s): 0.79 | learning rate: 2.128E-05 | global batch size: 256 | lm loss: 1.894067E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.930 | TFLOPs: 19.66 | 31: iteration 164280/ 173500 | consumed samples: 42055680 | consumed tokens: 86130032640 | elapsed time per iteration (s): 0.82 | learning rate: 2.128E-05 | global batch size: 256 | lm loss: 1.915993E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.777 | TFLOPs: 18.98 | 31: iteration 164290/ 173500 | consumed samples: 42058240 | consumed tokens: 86135275520 | elapsed time per iteration (s): 0.80 | learning rate: 2.127E-05 | global batch size: 256 | lm loss: 1.906426E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.275 | TFLOPs: 19.44 | 31: iteration 164300/ 173500 | consumed samples: 42060800 | consumed tokens: 86140518400 | elapsed time per iteration (s): 0.94 | learning rate: 2.127E-05 | global batch size: 256 | lm loss: 1.918808E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 273.264 | TFLOPs: 16.53 | 31: iteration 164310/ 173500 | consumed samples: 42063360 | consumed tokens: 86145761280 | elapsed time per iteration (s): 0.79 | learning rate: 2.127E-05 | global batch size: 256 | lm loss: 1.916595E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.093 | TFLOPs: 19.61 | 31: iteration 164320/ 173500 | consumed samples: 42065920 | consumed tokens: 86151004160 | elapsed time per iteration (s): 0.79 | learning rate: 2.127E-05 | global batch size: 256 | lm loss: 1.917307E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.872 | TFLOPs: 19.53 | 31: iteration 164330/ 173500 | consumed samples: 42068480 | consumed tokens: 86156247040 | elapsed time per iteration (s): 0.79 | learning rate: 2.126E-05 | global batch size: 256 | lm loss: 1.894183E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.152 | TFLOPs: 19.55 | 31: iteration 164340/ 173500 | consumed samples: 42071040 | consumed tokens: 86161489920 | elapsed time per iteration (s): 0.78 | learning rate: 2.126E-05 | global batch size: 256 | lm loss: 1.918486E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.727 | TFLOPs: 19.77 | 31: iteration 164350/ 173500 | consumed samples: 42073600 | consumed tokens: 86166732800 | elapsed time per iteration (s): 0.73 | learning rate: 2.126E-05 | global batch size: 256 | lm loss: 1.906843E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.821 | TFLOPs: 21.22 | 31: iteration 164360/ 173500 | consumed samples: 42076160 | consumed tokens: 86171975680 | elapsed time per iteration (s): 0.78 | learning rate: 2.125E-05 | global batch size: 256 | lm loss: 1.928803E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.286 | TFLOPs: 19.92 | 31: iteration 164370/ 173500 | consumed samples: 42078720 | consumed tokens: 86177218560 | elapsed time per iteration (s): 0.75 | learning rate: 2.125E-05 | global batch size: 256 | lm loss: 1.906566E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.849 | TFLOPs: 20.74 | 31: iteration 164380/ 173500 | consumed samples: 42081280 | consumed tokens: 86182461440 | elapsed time per iteration (s): 0.84 | learning rate: 2.125E-05 | global batch size: 256 | lm loss: 1.922879E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.841 | TFLOPs: 18.44 | 31: iteration 164390/ 173500 | consumed samples: 42083840 | consumed tokens: 86187704320 | elapsed time per iteration (s): 0.82 | learning rate: 2.125E-05 | global batch size: 256 | lm loss: 1.903506E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.587 | TFLOPs: 18.91 | 31: iteration 164400/ 173500 | consumed samples: 42086400 | consumed tokens: 86192947200 | elapsed time per iteration (s): 0.82 | learning rate: 2.124E-05 | global batch size: 256 | lm loss: 1.913668E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.512 | TFLOPs: 18.97 | 31: iteration 164410/ 173500 | consumed samples: 42088960 | consumed tokens: 86198190080 | elapsed time per iteration (s): 0.78 | learning rate: 2.124E-05 | global batch size: 256 | lm loss: 1.942446E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.914 | TFLOPs: 19.90 | 31: iteration 164420/ 173500 | consumed samples: 42091520 | consumed tokens: 86203432960 | elapsed time per iteration (s): 0.75 | learning rate: 2.124E-05 | global batch size: 256 | lm loss: 1.901438E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.110 | TFLOPs: 20.76 | 31: iteration 164430/ 173500 | consumed samples: 42094080 | consumed tokens: 86208675840 | elapsed time per iteration (s): 0.81 | learning rate: 2.124E-05 | global batch size: 256 | lm loss: 1.915964E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.250 | TFLOPs: 19.19 | 31: iteration 164440/ 173500 | consumed samples: 42096640 | consumed tokens: 86213918720 | elapsed time per iteration (s): 0.77 | learning rate: 2.123E-05 | global batch size: 256 | lm loss: 1.926823E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.896 | TFLOPs: 20.20 | 31: iteration 164450/ 173500 | consumed samples: 42099200 | consumed tokens: 86219161600 | elapsed time per iteration (s): 3.56 | learning rate: 2.123E-05 | global batch size: 256 | lm loss: 1.902906E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 71.863 | TFLOPs: 4.35 | 31: iteration 164460/ 173500 | consumed samples: 42101760 | consumed tokens: 86224404480 | elapsed time per iteration (s): 4.69 | learning rate: 2.123E-05 | global batch size: 256 | lm loss: 1.891792E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 54.574 | TFLOPs: 3.30 | 31: iteration 164470/ 173500 | consumed samples: 42104320 | consumed tokens: 86229647360 | elapsed time per iteration (s): 0.78 | learning rate: 2.122E-05 | global batch size: 256 | lm loss: 1.884905E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.101 | TFLOPs: 19.79 | 31: iteration 164480/ 173500 | consumed samples: 42106880 | consumed tokens: 86234890240 | elapsed time per iteration (s): 0.77 | learning rate: 2.122E-05 | global batch size: 256 | lm loss: 1.880561E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.368 | TFLOPs: 19.99 | 31: iteration 164490/ 173500 | consumed samples: 42109440 | consumed tokens: 86240133120 | elapsed time per iteration (s): 0.72 | learning rate: 2.122E-05 | global batch size: 256 | lm loss: 1.908145E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.390 | TFLOPs: 21.50 | 31: iteration 164500/ 173500 | consumed samples: 42112000 | consumed tokens: 86245376000 | elapsed time per iteration (s): 0.75 | learning rate: 2.122E-05 | global batch size: 256 | lm loss: 1.917730E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.001 | TFLOPs: 20.75 | 31: iteration 164510/ 173500 | consumed samples: 42114560 | consumed tokens: 86250618880 | elapsed time per iteration (s): 0.76 | learning rate: 2.121E-05 | global batch size: 256 | lm loss: 1.909388E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.888 | TFLOPs: 20.26 | 31: iteration 164520/ 173500 | consumed samples: 42117120 | consumed tokens: 86255861760 | elapsed time per iteration (s): 0.81 | learning rate: 2.121E-05 | global batch size: 256 | lm loss: 1.911980E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.042 | TFLOPs: 19.18 | 31: iteration 164530/ 173500 | consumed samples: 42119680 | consumed tokens: 86261104640 | elapsed time per iteration (s): 0.77 | learning rate: 2.121E-05 | global batch size: 256 | lm loss: 1.905801E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.668 | TFLOPs: 20.07 | 31: iteration 164540/ 173500 | consumed samples: 42122240 | consumed tokens: 86266347520 | elapsed time per iteration (s): 0.78 | learning rate: 2.121E-05 | global batch size: 256 | lm loss: 1.912303E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.582 | TFLOPs: 19.88 | 31: iteration 164550/ 173500 | consumed samples: 42124800 | consumed tokens: 86271590400 | elapsed time per iteration (s): 0.76 | learning rate: 2.120E-05 | global batch size: 256 | lm loss: 1.921967E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.926 | TFLOPs: 20.26 | 31: iteration 164560/ 173500 | consumed samples: 42127360 | consumed tokens: 86276833280 | elapsed time per iteration (s): 0.74 | learning rate: 2.120E-05 | global batch size: 256 | lm loss: 1.914776E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.656 | TFLOPs: 20.85 | 31: iteration 164570/ 173500 | consumed samples: 42129920 | consumed tokens: 86282076160 | elapsed time per iteration (s): 0.78 | learning rate: 2.120E-05 | global batch size: 256 | lm loss: 1.890206E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.057 | TFLOPs: 19.97 | 31: iteration 164580/ 173500 | consumed samples: 42132480 | consumed tokens: 86287319040 | elapsed time per iteration (s): 0.74 | learning rate: 2.120E-05 | global batch size: 256 | lm loss: 1.920292E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.549 | TFLOPs: 20.97 | 31: iteration 164590/ 173500 | consumed samples: 42135040 | consumed tokens: 86292561920 | elapsed time per iteration (s): 0.79 | learning rate: 2.119E-05 | global batch size: 256 | lm loss: 1.929804E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.534 | TFLOPs: 19.57 | 31: iteration 164600/ 173500 | consumed samples: 42137600 | consumed tokens: 86297804800 | elapsed time per iteration (s): 0.78 | learning rate: 2.119E-05 | global batch size: 256 | lm loss: 1.909661E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.645 | TFLOPs: 19.88 | 31: iteration 164610/ 173500 | consumed samples: 42140160 | consumed tokens: 86303047680 | elapsed time per iteration (s): 0.84 | learning rate: 2.119E-05 | global batch size: 256 | lm loss: 1.903016E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.268 | TFLOPs: 18.41 | 31: iteration 164620/ 173500 | consumed samples: 42142720 | consumed tokens: 86308290560 | elapsed time per iteration (s): 0.78 | learning rate: 2.118E-05 | global batch size: 256 | lm loss: 1.908294E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.250 | TFLOPs: 19.92 | 31: iteration 164630/ 173500 | consumed samples: 42145280 | consumed tokens: 86313533440 | elapsed time per iteration (s): 0.77 | learning rate: 2.118E-05 | global batch size: 256 | lm loss: 1.880314E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.617 | TFLOPs: 20.24 | 31: iteration 164640/ 173500 | consumed samples: 42147840 | consumed tokens: 86318776320 | elapsed time per iteration (s): 0.88 | learning rate: 2.118E-05 | global batch size: 256 | lm loss: 1.942954E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.138 | TFLOPs: 17.55 | 31: iteration 164650/ 173500 | consumed samples: 42150400 | consumed tokens: 86324019200 | elapsed time per iteration (s): 0.82 | learning rate: 2.118E-05 | global batch size: 256 | lm loss: 1.913341E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.062 | TFLOPs: 18.88 | 31: iteration 164660/ 173500 | consumed samples: 42152960 | consumed tokens: 86329262080 | elapsed time per iteration (s): 0.82 | learning rate: 2.117E-05 | global batch size: 256 | lm loss: 1.917870E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.838 | TFLOPs: 18.93 | 31: iteration 164670/ 173500 | consumed samples: 42155520 | consumed tokens: 86334504960 | elapsed time per iteration (s): 0.76 | learning rate: 2.117E-05 | global batch size: 256 | lm loss: 1.900210E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.844 | TFLOPs: 20.44 | 31: iteration 164680/ 173500 | consumed samples: 42158080 | consumed tokens: 86339747840 | elapsed time per iteration (s): 0.81 | learning rate: 2.117E-05 | global batch size: 256 | lm loss: 1.934083E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.361 | TFLOPs: 19.02 | 31: iteration 164690/ 173500 | consumed samples: 42160640 | consumed tokens: 86344990720 | elapsed time per iteration (s): 0.79 | learning rate: 2.117E-05 | global batch size: 256 | lm loss: 1.912716E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.763 | TFLOPs: 19.59 | 31: iteration 164700/ 173500 | consumed samples: 42163200 | consumed tokens: 86350233600 | elapsed time per iteration (s): 0.81 | learning rate: 2.116E-05 | global batch size: 256 | lm loss: 1.903577E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.452 | TFLOPs: 19.02 | 31: iteration 164710/ 173500 | consumed samples: 42165760 | consumed tokens: 86355476480 | elapsed time per iteration (s): 0.78 | learning rate: 2.116E-05 | global batch size: 256 | lm loss: 1.899625E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.714 | TFLOPs: 19.77 | 31: iteration 164720/ 173500 | consumed samples: 42168320 | consumed tokens: 86360719360 | elapsed time per iteration (s): 0.79 | learning rate: 2.116E-05 | global batch size: 256 | lm loss: 1.936832E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.731 | TFLOPs: 19.71 | 31: iteration 164730/ 173500 | consumed samples: 42170880 | consumed tokens: 86365962240 | elapsed time per iteration (s): 0.79 | learning rate: 2.116E-05 | global batch size: 256 | lm loss: 1.920574E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.488 | TFLOPs: 19.69 | 31: iteration 164740/ 173500 | consumed samples: 42173440 | consumed tokens: 86371205120 | elapsed time per iteration (s): 0.82 | learning rate: 2.115E-05 | global batch size: 256 | lm loss: 1.917350E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.917 | TFLOPs: 18.99 | 31: iteration 164750/ 173500 | consumed samples: 42176000 | consumed tokens: 86376448000 | elapsed time per iteration (s): 0.78 | learning rate: 2.115E-05 | global batch size: 256 | lm loss: 1.897044E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.450 | TFLOPs: 19.81 | 31: iteration 164760/ 173500 | consumed samples: 42178560 | consumed tokens: 86381690880 | elapsed time per iteration (s): 0.81 | learning rate: 2.115E-05 | global batch size: 256 | lm loss: 1.923294E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.684 | TFLOPs: 19.22 | 31: iteration 164770/ 173500 | consumed samples: 42181120 | consumed tokens: 86386933760 | elapsed time per iteration (s): 0.91 | learning rate: 2.114E-05 | global batch size: 256 | lm loss: 1.926415E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.615 | TFLOPs: 17.10 | 31: iteration 164780/ 173500 | consumed samples: 42183680 | consumed tokens: 86392176640 | elapsed time per iteration (s): 0.80 | learning rate: 2.114E-05 | global batch size: 256 | lm loss: 1.930961E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.998 | TFLOPs: 19.42 | 31: iteration 164790/ 173500 | consumed samples: 42186240 | consumed tokens: 86397419520 | elapsed time per iteration (s): 0.81 | learning rate: 2.114E-05 | global batch size: 256 | lm loss: 1.918001E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.369 | TFLOPs: 19.14 | 31: iteration 164800/ 173500 | consumed samples: 42188800 | consumed tokens: 86402662400 | elapsed time per iteration (s): 0.83 | learning rate: 2.114E-05 | global batch size: 256 | lm loss: 1.933714E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.943 | TFLOPs: 18.69 | 31: iteration 164810/ 173500 | consumed samples: 42191360 | consumed tokens: 86407905280 | elapsed time per iteration (s): 0.82 | learning rate: 2.113E-05 | global batch size: 256 | lm loss: 1.899943E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.381 | TFLOPs: 18.96 | 31: iteration 164820/ 173500 | consumed samples: 42193920 | consumed tokens: 86413148160 | elapsed time per iteration (s): 0.83 | learning rate: 2.113E-05 | global batch size: 256 | lm loss: 1.928267E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.718 | TFLOPs: 18.56 | 31: iteration 164830/ 173500 | consumed samples: 42196480 | consumed tokens: 86418391040 | elapsed time per iteration (s): 0.82 | learning rate: 2.113E-05 | global batch size: 256 | lm loss: 1.907641E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.286 | TFLOPs: 18.83 | 31: iteration 164840/ 173500 | consumed samples: 42199040 | consumed tokens: 86423633920 | elapsed time per iteration (s): 0.80 | learning rate: 2.113E-05 | global batch size: 256 | lm loss: 1.922379E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.431 | TFLOPs: 19.39 | 31: iteration 164850/ 173500 | consumed samples: 42201600 | consumed tokens: 86428876800 | elapsed time per iteration (s): 0.83 | learning rate: 2.112E-05 | global batch size: 256 | lm loss: 1.916369E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.417 | TFLOPs: 18.60 | 31: iteration 164860/ 173500 | consumed samples: 42204160 | consumed tokens: 86434119680 | elapsed time per iteration (s): 0.83 | learning rate: 2.112E-05 | global batch size: 256 | lm loss: 1.880532E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.356 | TFLOPs: 18.72 | 31: iteration 164870/ 173500 | consumed samples: 42206720 | consumed tokens: 86439362560 | elapsed time per iteration (s): 0.84 | learning rate: 2.112E-05 | global batch size: 256 | lm loss: 1.890348E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.207 | TFLOPs: 18.46 | 31: iteration 164880/ 173500 | consumed samples: 42209280 | consumed tokens: 86444605440 | elapsed time per iteration (s): 0.84 | learning rate: 2.112E-05 | global batch size: 256 | lm loss: 1.910822E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.205 | TFLOPs: 18.46 | 31: iteration 164890/ 173500 | consumed samples: 42211840 | consumed tokens: 86449848320 | elapsed time per iteration (s): 0.82 | learning rate: 2.111E-05 | global batch size: 256 | lm loss: 1.882046E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.838 | TFLOPs: 18.80 | 31: iteration 164900/ 173500 | consumed samples: 42214400 | consumed tokens: 86455091200 | elapsed time per iteration (s): 0.82 | learning rate: 2.111E-05 | global batch size: 256 | lm loss: 1.910427E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.354 | TFLOPs: 18.84 | 31: iteration 164910/ 173500 | consumed samples: 42216960 | consumed tokens: 86460334080 | elapsed time per iteration (s): 0.80 | learning rate: 2.111E-05 | global batch size: 256 | lm loss: 1.904125E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.630 | TFLOPs: 19.28 | 31: iteration 164920/ 173500 | consumed samples: 42219520 | consumed tokens: 86465576960 | elapsed time per iteration (s): 0.83 | learning rate: 2.111E-05 | global batch size: 256 | lm loss: 1.924875E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.768 | TFLOPs: 18.74 | 31: iteration 164930/ 173500 | consumed samples: 42222080 | consumed tokens: 86470819840 | elapsed time per iteration (s): 0.81 | learning rate: 2.110E-05 | global batch size: 256 | lm loss: 1.922740E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.127 | TFLOPs: 19.00 | 31: iteration 164940/ 173500 | consumed samples: 42224640 | consumed tokens: 86476062720 | elapsed time per iteration (s): 0.80 | learning rate: 2.110E-05 | global batch size: 256 | lm loss: 1.928417E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.361 | TFLOPs: 19.32 | 31: iteration 164950/ 173500 | consumed samples: 42227200 | consumed tokens: 86481305600 | elapsed time per iteration (s): 0.87 | learning rate: 2.110E-05 | global batch size: 256 | lm loss: 1.910352E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.298 | TFLOPs: 17.86 | 31: iteration 164960/ 173500 | consumed samples: 42229760 | consumed tokens: 86486548480 | elapsed time per iteration (s): 0.80 | learning rate: 2.110E-05 | global batch size: 256 | lm loss: 1.915355E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.403 | TFLOPs: 19.44 | 31: iteration 164970/ 173500 | consumed samples: 42232320 | consumed tokens: 86491791360 | elapsed time per iteration (s): 0.82 | learning rate: 2.109E-05 | global batch size: 256 | lm loss: 1.916650E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.963 | TFLOPs: 18.93 | 31: iteration 164980/ 173500 | consumed samples: 42234880 | consumed tokens: 86497034240 | elapsed time per iteration (s): 0.81 | learning rate: 2.109E-05 | global batch size: 256 | lm loss: 1.942065E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.206 | TFLOPs: 19.07 | 31: iteration 164990/ 173500 | consumed samples: 42237440 | consumed tokens: 86502277120 | elapsed time per iteration (s): 0.83 | learning rate: 2.109E-05 | global batch size: 256 | lm loss: 1.899158E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.352 | TFLOPs: 18.72 | 31: iteration 165000/ 173500 | consumed samples: 42240000 | consumed tokens: 86507520000 | elapsed time per iteration (s): 0.81 | learning rate: 2.109E-05 | global batch size: 256 | lm loss: 1.874508E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.246 | TFLOPs: 19.01 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 165000 | lm loss value: 1.854709E+00 | lm loss PPL: 6.389841E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 165000 to checkpoints_1b1long 0: [2022-11-27 07:14:44,489] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step165000 is begin to save! 0: [2022-11-27 07:14:44,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_01-model_00-model_states.pt... 0: [2022-11-27 07:14:44,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_01-model_00-model_states.pt. 0: [2022-11-27 07:14:44,722] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_03-model_00-model_states.pt... 0: [2022-11-27 07:14:44,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_03-model_00-model_states.pt. 0: [2022-11-27 07:14:44,802] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_04-model_00-model_states.pt... 0: [2022-11-27 07:14:44,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_04-model_00-model_states.pt. 0: [2022-11-27 07:14:44,881] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_05-model_00-model_states.pt... 0: [2022-11-27 07:14:44,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_05-model_00-model_states.pt. 0: [2022-11-27 07:14:44,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_06-model_00-model_states.pt... 0: [2022-11-27 07:14:45,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_06-model_00-model_states.pt. 0: [2022-11-27 07:14:45,039] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_07-model_00-model_states.pt... 0: [2022-11-27 07:14:45,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_07-model_00-model_states.pt. 0: [2022-11-27 07:14:45,118] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_08-model_00-model_states.pt... 0: [2022-11-27 07:14:45,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_08-model_00-model_states.pt. 0: [2022-11-27 07:14:45,192] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_09-model_00-model_states.pt... 0: [2022-11-27 07:14:45,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_09-model_00-model_states.pt. 0: [2022-11-27 07:14:45,268] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_10-model_00-model_states.pt... 0: [2022-11-27 07:14:45,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_10-model_00-model_states.pt. 0: [2022-11-27 07:14:45,345] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_11-model_00-model_states.pt... 0: [2022-11-27 07:14:45,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_11-model_00-model_states.pt. 0: [2022-11-27 07:14:45,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_12-model_00-model_states.pt... 0: [2022-11-27 07:14:45,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_12-model_00-model_states.pt. 0: [2022-11-27 07:14:45,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_13-model_00-model_states.pt... 0: [2022-11-27 07:14:45,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_13-model_00-model_states.pt. 0: [2022-11-27 07:14:45,577] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_14-model_00-model_states.pt... 0: [2022-11-27 07:14:45,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_14-model_00-model_states.pt. 0: [2022-11-27 07:14:45,653] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_15-model_00-model_states.pt... 0: [2022-11-27 07:14:45,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_15-model_00-model_states.pt. 0: [2022-11-27 07:14:45,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_16-model_00-model_states.pt... 0: [2022-11-27 07:14:45,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_16-model_00-model_states.pt. 0: [2022-11-27 07:14:45,801] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_17-model_00-model_states.pt... 0: [2022-11-27 07:14:45,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_17-model_00-model_states.pt. 0: [2022-11-27 07:14:45,879] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_18-model_00-model_states.pt... 0: [2022-11-27 07:14:45,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_18-model_00-model_states.pt. 0: [2022-11-27 07:14:45,954] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_19-model_00-model_states.pt... 0: [2022-11-27 07:14:46,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_19-model_00-model_states.pt. 0: [2022-11-27 07:14:46,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_20-model_00-model_states.pt... 0: [2022-11-27 07:14:46,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_20-model_00-model_states.pt. 0: [2022-11-27 07:14:46,104] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_21-model_00-model_states.pt... 0: [2022-11-27 07:14:46,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_21-model_00-model_states.pt. 0: [2022-11-27 07:14:46,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_22-model_00-model_states.pt... 0: [2022-11-27 07:14:46,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_22-model_00-model_states.pt. 0: [2022-11-27 07:14:46,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_23-model_00-model_states.pt... 0: [2022-11-27 07:14:46,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_23-model_00-model_states.pt. 0: [2022-11-27 07:14:46,331] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_24-model_00-model_states.pt... 0: [2022-11-27 07:14:46,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_24-model_00-model_states.pt. 0: [2022-11-27 07:14:46,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_25-model_00-model_states.pt... 0: [2022-11-27 07:14:46,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_25-model_00-model_states.pt. 0: [2022-11-27 07:14:46,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_26-model_00-model_states.pt... 0: [2022-11-27 07:14:46,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_26-model_00-model_states.pt. 0: [2022-11-27 07:14:46,556] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_27-model_00-model_states.pt... 0: [2022-11-27 07:14:46,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_27-model_00-model_states.pt. 0: [2022-11-27 07:14:46,632] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_28-model_00-model_states.pt... 0: [2022-11-27 07:14:46,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_28-model_00-model_states.pt. 0: [2022-11-27 07:14:46,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/layer_30-model_00-model_states.pt... 0: [2022-11-27 07:14:46,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/layer_30-model_00-model_states.pt. 0: [2022-11-27 07:14:46,709] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step165000/mp_rank_00_model_states.pt 0: [2022-11-27 07:14:46,709] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/mp_rank_00_model_states.pt... 0: [2022-11-27 07:14:46,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/mp_rank_00_model_states.pt. 0: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:14:46,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:14:46,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:14:46,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 07:14:46,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 9: [2022-11-27 07:14:46,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:14:46,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 07:14:46,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 25: [2022-11-27 07:14:46,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:14:46,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 1: [2022-11-27 07:14:46,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:14:46,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:14:46,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 1: [2022-11-27 07:14:46,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 30: [2022-11-27 07:14:46,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:14:46,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 07:14:46,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 1: [2022-11-27 07:14:46,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 30: [2022-11-27 07:14:46,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 07:14:46,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 10: [2022-11-27 07:14:46,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:14:46,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 19: [2022-11-27 07:14:46,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:14:46,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 19: [2022-11-27 07:14:46,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 2: [2022-11-27 07:14:46,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:14:46,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 2: [2022-11-27 07:14:46,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 07:14:46,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 10: [2022-11-27 07:14:46,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:14:46,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 21: [2022-11-27 07:14:46,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:14:46,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:14:46,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:14:46,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 14: [2022-11-27 07:14:46,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:14:46,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 07:14:46,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 26: [2022-11-27 07:14:46,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 14: [2022-11-27 07:14:46,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 21: [2022-11-27 07:14:46,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 21: [2022-11-27 07:14:46,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 26: [2022-11-27 07:14:46,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:14:46,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 26: [2022-11-27 07:14:46,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 26: [2022-11-27 07:14:46,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 07:14:46,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 19: [2022-11-27 07:14:46,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:14:46,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 24: [2022-11-27 07:14:46,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:14:46,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 11: [2022-11-27 07:14:46,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:14:46,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 11: [2022-11-27 07:14:46,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 07:14:46,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 24: [2022-11-27 07:14:46,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 8: [2022-11-27 07:14:46,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:14:46,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 07:14:46,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 6: [2022-11-27 07:14:46,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:14:46,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 20: [2022-11-27 07:14:46,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:14:46,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 20: [2022-11-27 07:14:46,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 07:14:46,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 7: [2022-11-27 07:14:46,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:14:46,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 07:14:46,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 15: [2022-11-27 07:14:46,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:14:46,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:14:46,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:14:46,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 07:14:46,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 07:14:46,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 3: [2022-11-27 07:14:46,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 15: [2022-11-27 07:14:46,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 07:14:46,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 12: [2022-11-27 07:14:46,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:14:46,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 07:14:46,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 4: [2022-11-27 07:14:46,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:14:46,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 07:14:46,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 14: [2022-11-27 07:14:46,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:14:46,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 31: [2022-11-27 07:14:46,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:14:46,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 8: [2022-11-27 07:14:46,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:14:46,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 31: [2022-11-27 07:14:46,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 07:14:46,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 8: [2022-11-27 07:14:46,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 9: [2022-11-27 07:14:46,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:14:46,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 07:14:46,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 18: [2022-11-27 07:14:46,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:14:46,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 07:14:46,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 27: [2022-11-27 07:14:46,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:14:46,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 0: [2022-11-27 07:14:46,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:14:46,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:14:46,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 27: [2022-11-27 07:14:46,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 0: [2022-11-27 07:14:46,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 17: [2022-11-27 07:14:46,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:14:46,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 0: [2022-11-27 07:14:46,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 17: [2022-11-27 07:14:46,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 07:14:46,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 17: [2022-11-27 07:14:46,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:14:46,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 07:14:46,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 20: [2022-11-27 07:14:46,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:14:46,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:14:46,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 21: [2022-11-27 07:14:46,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 20: [2022-11-27 07:14:46,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 21: [2022-11-27 07:14:46,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 11: [2022-11-27 07:14:46,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:14:46,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 07:14:46,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 19: [2022-11-27 07:14:46,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:14:46,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 07:14:46,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 6: [2022-11-27 07:14:46,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:14:46,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 07:14:46,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 6: [2022-11-27 07:14:46,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:14:46,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:14:46,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 4: [2022-11-27 07:14:46,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 2: [2022-11-27 07:14:46,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:14:46,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 4: [2022-11-27 07:14:46,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 2: [2022-11-27 07:14:46,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 07:14:46,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 1: [2022-11-27 07:14:46,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:14:46,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 07:14:46,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 17: [2022-11-27 07:14:46,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:14:46,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:14:46,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 12: [2022-11-27 07:14:46,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 07:14:46,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:14:46,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 12: [2022-11-27 07:14:46,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 12: [2022-11-27 07:14:46,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 07:14:46,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 25: [2022-11-27 07:14:46,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:14:46,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 07:14:46,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 18: [2022-11-27 07:14:46,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:14:46,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 25: [2022-11-27 07:14:46,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:14:46,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 23: [2022-11-27 07:14:46,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:14:46,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:14:46,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 07:14:46,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 23: [2022-11-27 07:14:46,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:14:46,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 07:14:46,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:14:46,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 23: [2022-11-27 07:14:46,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 07:14:46,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 23: [2022-11-27 07:14:46,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 31: [2022-11-27 07:14:46,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 23: [2022-11-27 07:14:46,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 23: [2022-11-27 07:14:46,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 7: [2022-11-27 07:14:46,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:14:46,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 07:14:46,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 31: [2022-11-27 07:14:46,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:14:46,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:14:46,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 31: [2022-11-27 07:14:46,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 19: [2022-11-27 07:14:46,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 31: [2022-11-27 07:14:46,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 4: [2022-11-27 07:14:46,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:14:46,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 07:14:46,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 11: [2022-11-27 07:14:46,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:14:46,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 07:14:46,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 15: [2022-11-27 07:14:46,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:14:46,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 07:14:46,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 14: [2022-11-27 07:14:46,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:14:46,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 07:14:46,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 21: [2022-11-27 07:14:46,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:14:46,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 07:14:46,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 5: [2022-11-27 07:14:46,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:14:46,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:14:46,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 07:14:46,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 10: [2022-11-27 07:14:46,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:14:46,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 5: [2022-11-27 07:14:46,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 10: [2022-11-27 07:14:46,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 07:14:46,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 7: [2022-11-27 07:14:46,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:14:46,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 24: [2022-11-27 07:14:46,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:14:46,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 6: [2022-11-27 07:14:46,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:14:46,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 6: [2022-11-27 07:14:46,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 07:14:46,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 24: [2022-11-27 07:14:46,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 1: [2022-11-27 07:14:46,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:14:46,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 24: [2022-11-27 07:14:46,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:14:46,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:14:46,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 30: [2022-11-27 07:14:46,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:14:46,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:14:46,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 24: [2022-11-27 07:14:46,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 0: [2022-11-27 07:14:46,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 13: [2022-11-27 07:14:46,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:14:46,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 30: [2022-11-27 07:14:46,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 26: [2022-11-27 07:14:46,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 13: [2022-11-27 07:14:46,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 07:14:46,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 30: [2022-11-27 07:14:46,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 26: [2022-11-27 07:14:46,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 23: [2022-11-27 07:14:46,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:14:46,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 07:14:46,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 27: [2022-11-27 07:14:46,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:14:46,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 07:14:46,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 12: [2022-11-27 07:14:46,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:14:46,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 07:14:46,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 8: [2022-11-27 07:14:46,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:14:46,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 07:14:46,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 27: [2022-11-27 07:14:46,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:14:46,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:14:46,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 20: [2022-11-27 07:14:46,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:14:46,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 13: [2022-11-27 07:14:46,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:14:46,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 27: [2022-11-27 07:14:46,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 13: [2022-11-27 07:14:46,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:14:46,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 13: [2022-11-27 07:14:46,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 17: [2022-11-27 07:14:46,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:14:46,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 17: [2022-11-27 07:14:46,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 13: [2022-11-27 07:14:46,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 20: [2022-11-27 07:14:46,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 17: [2022-11-27 07:14:46,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 13: [2022-11-27 07:14:46,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 8: [2022-11-27 07:14:46,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:14:46,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 07:14:46,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 5: [2022-11-27 07:14:46,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:14:46,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 07:14:46,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 14: [2022-11-27 07:14:46,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:14:46,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 07:14:46,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 18: [2022-11-27 07:14:46,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:14:46,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 3: [2022-11-27 07:14:46,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:14:46,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 3: [2022-11-27 07:14:46,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 07:14:46,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 5: [2022-11-27 07:14:46,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:14:46,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 07:14:46,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 31: [2022-11-27 07:14:46,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:14:46,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 07:14:46,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 11: [2022-11-27 07:14:46,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:14:46,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 07:14:46,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 30: [2022-11-27 07:14:46,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:14:46,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:14:46,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 07:14:46,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 30: [2022-11-27 07:14:46,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 07:14:46,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 7: [2022-11-27 07:14:46,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:14:46,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 07:14:46,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 20: [2022-11-27 07:14:46,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:14:46,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 07:14:46,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 9: [2022-11-27 07:14:46,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:14:46,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 07:14:46,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 4: [2022-11-27 07:14:46,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:14:46,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 07:14:46,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 26: [2022-11-27 07:14:46,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:14:46,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 07:14:46,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 0: [2022-11-27 07:14:46,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:14:46,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 07:14:46,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 27: [2022-11-27 07:14:46,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:14:46,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:14:46,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 27: [2022-11-27 07:14:46,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 10: [2022-11-27 07:14:46,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 27: [2022-11-27 07:14:46,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 15: [2022-11-27 07:14:46,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:14:46,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 07:14:46,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 5: [2022-11-27 07:14:46,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:14:46,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 07:14:46,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 16: [2022-11-27 07:14:46,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:14:46,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:14:46,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:14:46,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 07:14:46,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 25: [2022-11-27 07:14:46,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:14:46,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 07:14:46,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 25: [2022-11-27 07:14:46,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 16: [2022-11-27 07:14:46,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 16: [2022-11-27 07:14:46,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 25: [2022-11-27 07:14:46,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 6: [2022-11-27 07:14:46,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:14:46,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 13: [2022-11-27 07:14:46,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:14:46,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 13: [2022-11-27 07:14:46,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 07:14:46,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 9: [2022-11-27 07:14:46,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:14:46,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 07:14:46,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 19: [2022-11-27 07:14:46,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:14:46,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 07:14:46,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 28: [2022-11-27 07:14:46,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:14:46,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:14:46,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:14:46,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 07:14:46,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 28: [2022-11-27 07:14:46,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:14:46,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 07:14:46,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 07:14:46,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 07:14:46,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 28: [2022-11-27 07:14:46,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 28: [2022-11-27 07:14:46,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 1: [2022-11-27 07:14:46,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:14:46,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 0: [2022-11-27 07:14:46,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:14:46,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 2: [2022-11-27 07:14:46,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:14:46,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 07:14:46,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 12: [2022-11-27 07:14:46,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:14:46,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 07:14:46,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 30: [2022-11-27 07:14:46,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:14:46,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 07:14:46,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 0: [2022-11-27 07:14:46,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 07:14:46,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 10: [2022-11-27 07:14:46,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:14:46,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 07:14:46,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 25: [2022-11-27 07:14:46,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:14:46,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 07:14:46,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 27: [2022-11-27 07:14:46,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:14:46,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 07:14:46,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 8: [2022-11-27 07:14:46,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:14:46,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 07:14:46,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 29: [2022-11-27 07:14:46,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:14:46,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:14:46,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:14:46,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:14:46,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 07:14:46,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 07:14:46,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 07:14:46,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 07:14:46,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 29: [2022-11-27 07:14:46,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 29: [2022-11-27 07:14:46,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 29: [2022-11-27 07:14:46,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 21: [2022-11-27 07:14:46,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:14:46,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 07:14:46,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 17: [2022-11-27 07:14:46,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:14:46,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 07:14:46,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 22: [2022-11-27 07:14:46,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:14:46,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:14:46,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:14:46,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:14:46,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 07:14:46,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 07:14:46,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 07:14:46,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 07:14:46,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 22: [2022-11-27 07:14:46,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 22: [2022-11-27 07:14:46,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 22: [2022-11-27 07:14:46,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 9: [2022-11-27 07:14:46,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:14:46,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 07:14:46,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 26: [2022-11-27 07:14:46,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:14:46,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 07:14:46,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 29: [2022-11-27 07:14:46,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:14:46,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 07:14:46,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 3: [2022-11-27 07:14:46,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:14:46,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 07:14:46,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 0: [2022-11-27 07:14:46,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:14:46,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 07:14:46,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 30: [2022-11-27 07:14:46,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:14:46,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 07:14:46,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 11: [2022-11-27 07:14:46,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:14:46,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 07:14:46,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 31: [2022-11-27 07:14:46,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:14:46,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 07:14:46,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 15: [2022-11-27 07:14:46,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:14:46,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 07:14:46,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 20: [2022-11-27 07:14:46,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:14:46,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 07:14:46,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 18: [2022-11-27 07:14:46,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:14:46,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 07:14:46,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 14: [2022-11-27 07:14:46,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:14:46,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 07:14:46,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 4: [2022-11-27 07:14:46,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:14:46,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 07:14:46,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 23: [2022-11-27 07:14:46,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:14:46,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 07:14:46,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 16: [2022-11-27 07:14:46,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:14:46,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 07:14:46,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 24: [2022-11-27 07:14:46,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:14:46,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-27 07:14:46,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 7: [2022-11-27 07:14:46,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:14:46,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 07:14:46,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 2: [2022-11-27 07:14:46,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:14:46,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 07:14:46,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 13: [2022-11-27 07:14:46,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:14:46,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 07:14:46,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 5: [2022-11-27 07:14:46,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:14:46,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 07:14:46,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 1: [2022-11-27 07:14:46,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:14:46,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 07:14:46,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 28: [2022-11-27 07:14:46,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:14:46,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 07:14:46,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 25: [2022-11-27 07:14:46,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:14:46,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 07:14:46,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 22: [2022-11-27 07:14:46,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:14:46,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 07:14:46,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 19: [2022-11-27 07:14:46,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:14:46,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 07:14:46,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 6: [2022-11-27 07:14:46,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:14:46,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 07:14:46,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 10: [2022-11-27 07:14:46,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:14:46,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:14:46,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 07:14:46,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 27: [2022-11-27 07:14:46,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 07:14:46,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 17: [2022-11-27 07:14:46,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:14:46,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 07:14:46,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 26: [2022-11-27 07:14:46,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:14:46,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:14:46,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 07:14:46,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 21: [2022-11-27 07:14:46,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 07:14:46,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 12: [2022-11-27 07:14:46,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:14:46,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 07:14:46,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 0: [2022-11-27 07:14:46,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:14:46,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 07:14:46,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 3: [2022-11-27 07:14:46,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:14:46,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 07:14:46,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 15: [2022-11-27 07:14:46,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:14:46,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 07:14:46,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 31: [2022-11-27 07:14:46,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:14:46,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 07:14:46,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 24: [2022-11-27 07:14:46,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:14:46,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 07:14:46,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 20: [2022-11-27 07:14:46,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:14:46,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 07:14:46,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 9: [2022-11-27 07:14:46,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:14:46,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:14:46,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 11: [2022-11-27 07:14:46,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 9: [2022-11-27 07:14:46,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 11: [2022-11-27 07:14:46,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 14: [2022-11-27 07:14:46,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:14:46,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 07:14:46,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 23: [2022-11-27 07:14:46,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:14:46,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 07:14:46,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 29: [2022-11-27 07:14:46,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:14:46,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 07:14:46,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 18: [2022-11-27 07:14:46,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:14:46,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 07:14:46,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 13: [2022-11-27 07:14:46,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:14:46,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 07:14:46,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 16: [2022-11-27 07:14:46,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:14:46,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 07:14:46,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 2: [2022-11-27 07:14:46,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:14:46,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 07:14:46,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 4: [2022-11-27 07:14:46,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:14:46,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:14:46,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 07:14:46,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 28: [2022-11-27 07:14:46,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 07:14:46,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 7: [2022-11-27 07:14:46,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:14:46,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 07:14:46,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 1: [2022-11-27 07:14:46,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:14:46,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 07:14:46,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 22: [2022-11-27 07:14:46,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:14:46,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 07:14:46,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 30: [2022-11-27 07:14:46,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:14:46,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 07:14:46,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 5: [2022-11-27 07:14:46,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:14:46,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 07:14:46,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 19: [2022-11-27 07:14:46,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:14:46,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 07:14:46,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 8: [2022-11-27 07:14:46,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:14:46,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 07:14:46,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 27: [2022-11-27 07:14:46,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:14:46,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 07:14:46,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 6: [2022-11-27 07:14:46,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:14:46,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 07:14:46,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 21: [2022-11-27 07:14:46,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:14:46,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 07:14:46,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 10: [2022-11-27 07:14:46,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:14:46,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 07:14:46,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 25: [2022-11-27 07:14:46,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:14:46,982] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-27 07:14:46,982] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 12: [2022-11-27 07:14:46,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:14:46,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 07:14:46,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 8: [2022-11-27 07:14:46,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:14:46,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 07:14:46,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 26: [2022-11-27 07:14:46,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:14:46,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-27 07:14:46,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 0: [2022-11-27 07:14:46,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:14:46,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 07:14:46,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 17: [2022-11-27 07:14:46,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:14:46,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 07:14:46,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 9: [2022-11-27 07:14:46,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:14:46,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 07:14:46,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 18: [2022-11-27 07:14:46,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:14:46,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 07:14:46,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 29: [2022-11-27 07:14:46,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:14:46,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 07:14:46,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 20: [2022-11-27 07:14:46,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:14:46,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 07:14:46,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 31: [2022-11-27 07:14:46,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:14:46,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:14:46,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 31: [2022-11-27 07:14:46,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 30: [2022-11-27 07:14:46,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 31: [2022-11-27 07:14:46,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 3: [2022-11-27 07:14:46,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:14:46,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 07:14:46,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 16: [2022-11-27 07:14:46,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:14:46,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 07:14:46,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 23: [2022-11-27 07:14:46,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:14:46,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 07:14:46,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 4: [2022-11-27 07:14:46,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:14:46,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 07:14:46,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 24: [2022-11-27 07:14:46,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:14:46,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 07:14:46,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 0: [2022-11-27 07:14:46,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:14:46,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 07:14:46,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 7: [2022-11-27 07:14:46,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:14:46,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 11: [2022-11-27 07:14:46,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:14:46,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 11: [2022-11-27 07:14:46,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 07:14:46,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 1: [2022-11-27 07:14:46,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:14:46,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 14: [2022-11-27 07:14:46,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:14:46,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 14: [2022-11-27 07:14:46,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 07:14:46,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 2: [2022-11-27 07:14:46,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:14:46,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 26: [2022-11-27 07:14:46,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:14:46,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 26: [2022-11-27 07:14:46,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 07:14:46,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 15: [2022-11-27 07:14:46,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:14:46,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 07:14:46,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 13: [2022-11-27 07:14:47,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:14:47,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 07:14:47,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 21: [2022-11-27 07:14:47,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:14:47,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:14:47,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 07:14:47,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 17: [2022-11-27 07:14:47,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:14:47,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 07:14:47,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 5: [2022-11-27 07:14:47,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:14:47,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 07:14:47,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 18: [2022-11-27 07:14:47,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:14:47,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 07:14:47,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 8: [2022-11-27 07:14:47,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:14:47,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 07:14:47,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 8: [2022-11-27 07:14:47,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 07:14:47,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 22: [2022-11-27 07:14:47,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:14:47,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 16: [2022-11-27 07:14:47,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:14:47,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 16: [2022-11-27 07:14:47,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 24: [2022-11-27 07:14:47,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:14:47,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 07:14:47,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 7: [2022-11-27 07:14:47,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:14:47,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 19: [2022-11-27 07:14:47,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:14:47,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 29: [2022-11-27 07:14:47,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:14:47,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 7: [2022-11-27 07:14:47,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 19: [2022-11-27 07:14:47,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 29: [2022-11-27 07:14:47,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 07:14:47,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 3: [2022-11-27 07:14:47,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:14:47,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 07:14:47,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 20: [2022-11-27 07:14:47,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:14:47,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 07:14:47,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 23: [2022-11-27 07:14:47,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:14:47,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 07:14:47,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 10: [2022-11-27 07:14:47,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:14:47,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:14:47,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:14:47,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 4: [2022-11-27 07:14:47,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:14:47,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 31: [2022-11-27 07:14:47,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 6: [2022-11-27 07:14:47,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 31: [2022-11-27 07:14:47,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 6: [2022-11-27 07:14:47,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 4: [2022-11-27 07:14:47,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 07:14:47,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 15: [2022-11-27 07:14:47,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:14:47,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:14:47,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 30: [2022-11-27 07:14:47,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 15: [2022-11-27 07:14:47,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 30: [2022-11-27 07:14:47,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 25: [2022-11-27 07:14:47,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:14:47,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 07:14:47,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 13: [2022-11-27 07:14:47,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:14:47,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 07:14:47,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 28: [2022-11-27 07:14:47,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:14:47,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:14:47,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 07:14:47,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 14: [2022-11-27 07:14:47,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:14:47,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 07:14:47,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 16: [2022-11-27 07:14:47,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:14:47,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 07:14:47,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 2: [2022-11-27 07:14:47,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:14:47,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 07:14:47,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 11: [2022-11-27 07:14:47,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:14:47,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 07:14:47,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 22: [2022-11-27 07:14:47,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:14:47,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 07:14:47,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 9: [2022-11-27 07:14:47,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:14:47,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 07:14:47,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 28: [2022-11-27 07:14:47,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 07:14:47,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 28: [2022-11-27 07:14:47,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:14:47,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 07:14:47,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 28: [2022-11-27 07:14:47,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:14:47,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step165000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 07:14:47,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step165000 is ready now! 0: successfully saved checkpoint at iteration 165000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2564.58 31: iteration 165010/ 173500 | consumed samples: 42242560 | consumed tokens: 86512762880 | elapsed time per iteration (s): 1.12 | learning rate: 2.108E-05 | global batch size: 256 | lm loss: 1.896802E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.163 | TFLOPs: 13.86 | 31: iteration 165020/ 173500 | consumed samples: 42245120 | consumed tokens: 86518005760 | elapsed time per iteration (s): 0.93 | learning rate: 2.108E-05 | global batch size: 256 | lm loss: 1.901961E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 276.509 | TFLOPs: 16.73 | 31: iteration 165030/ 173500 | consumed samples: 42247680 | consumed tokens: 86523248640 | elapsed time per iteration (s): 0.76 | learning rate: 2.108E-05 | global batch size: 256 | lm loss: 1.910737E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.467 | TFLOPs: 20.42 | 31: iteration 165040/ 173500 | consumed samples: 42250240 | consumed tokens: 86528491520 | elapsed time per iteration (s): 0.84 | learning rate: 2.108E-05 | global batch size: 256 | lm loss: 1.942700E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.220 | TFLOPs: 18.53 | 31: iteration 165050/ 173500 | consumed samples: 42252800 | consumed tokens: 86533734400 | elapsed time per iteration (s): 0.73 | learning rate: 2.107E-05 | global batch size: 256 | lm loss: 1.890506E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.989 | TFLOPs: 21.23 | 31: iteration 165060/ 173500 | consumed samples: 42255360 | consumed tokens: 86538977280 | elapsed time per iteration (s): 0.77 | learning rate: 2.107E-05 | global batch size: 256 | lm loss: 1.917609E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.127 | TFLOPs: 20.21 | 31: iteration 165070/ 173500 | consumed samples: 42257920 | consumed tokens: 86544220160 | elapsed time per iteration (s): 0.74 | learning rate: 2.107E-05 | global batch size: 256 | lm loss: 1.914507E+00 | grad norm: 0.212 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.693 | TFLOPs: 21.03 | 31: iteration 165080/ 173500 | consumed samples: 42260480 | consumed tokens: 86549463040 | elapsed time per iteration (s): 0.80 | learning rate: 2.107E-05 | global batch size: 256 | lm loss: 1.882921E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.268 | TFLOPs: 19.38 | 31: iteration 165090/ 173500 | consumed samples: 42263040 | consumed tokens: 86554705920 | elapsed time per iteration (s): 0.74 | learning rate: 2.106E-05 | global batch size: 256 | lm loss: 1.880760E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.761 | TFLOPs: 20.86 | 31: iteration 165100/ 173500 | consumed samples: 42265600 | consumed tokens: 86559948800 | elapsed time per iteration (s): 0.75 | learning rate: 2.106E-05 | global batch size: 256 | lm loss: 1.923342E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.756 | TFLOPs: 20.61 | 31: iteration 165110/ 173500 | consumed samples: 42268160 | consumed tokens: 86565191680 | elapsed time per iteration (s): 0.81 | learning rate: 2.106E-05 | global batch size: 256 | lm loss: 1.916329E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.714 | TFLOPs: 19.16 | 31: iteration 165120/ 173500 | consumed samples: 42270720 | consumed tokens: 86570434560 | elapsed time per iteration (s): 0.75 | learning rate: 2.106E-05 | global batch size: 256 | lm loss: 1.915469E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.260 | TFLOPs: 20.71 | 31: iteration 165130/ 173500 | consumed samples: 42273280 | consumed tokens: 86575677440 | elapsed time per iteration (s): 0.78 | learning rate: 2.105E-05 | global batch size: 256 | lm loss: 1.890687E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.558 | TFLOPs: 19.88 | 31: iteration 165140/ 173500 | consumed samples: 42275840 | consumed tokens: 86580920320 | elapsed time per iteration (s): 0.77 | learning rate: 2.105E-05 | global batch size: 256 | lm loss: 1.917703E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.425 | TFLOPs: 20.11 | 31: iteration 165150/ 173500 | consumed samples: 42278400 | consumed tokens: 86586163200 | elapsed time per iteration (s): 0.77 | learning rate: 2.105E-05 | global batch size: 256 | lm loss: 1.898442E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.434 | TFLOPs: 20.11 | 31: iteration 165160/ 173500 | consumed samples: 42280960 | consumed tokens: 86591406080 | elapsed time per iteration (s): 0.75 | learning rate: 2.105E-05 | global batch size: 256 | lm loss: 1.898903E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.346 | TFLOPs: 20.77 | 31: iteration 165170/ 173500 | consumed samples: 42283520 | consumed tokens: 86596648960 | elapsed time per iteration (s): 0.78 | learning rate: 2.104E-05 | global batch size: 256 | lm loss: 1.888653E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.869 | TFLOPs: 19.96 | 31: iteration 165180/ 173500 | consumed samples: 42286080 | consumed tokens: 86601891840 | elapsed time per iteration (s): 0.74 | learning rate: 2.104E-05 | global batch size: 256 | lm loss: 1.900672E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.505 | TFLOPs: 20.84 | 31: iteration 165190/ 173500 | consumed samples: 42288640 | consumed tokens: 86607134720 | elapsed time per iteration (s): 0.74 | learning rate: 2.104E-05 | global batch size: 256 | lm loss: 1.907489E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.349 | TFLOPs: 20.83 | 31: iteration 165200/ 173500 | consumed samples: 42291200 | consumed tokens: 86612377600 | elapsed time per iteration (s): 0.79 | learning rate: 2.104E-05 | global batch size: 256 | lm loss: 1.897856E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.625 | TFLOPs: 19.52 | 31: iteration 165210/ 173500 | consumed samples: 42293760 | consumed tokens: 86617620480 | elapsed time per iteration (s): 0.90 | learning rate: 2.103E-05 | global batch size: 256 | lm loss: 1.888453E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.286 | TFLOPs: 17.26 | 31: iteration 165220/ 173500 | consumed samples: 42296320 | consumed tokens: 86622863360 | elapsed time per iteration (s): 0.76 | learning rate: 2.103E-05 | global batch size: 256 | lm loss: 1.915176E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.621 | TFLOPs: 20.30 | 31: iteration 165230/ 173500 | consumed samples: 42298880 | consumed tokens: 86628106240 | elapsed time per iteration (s): 0.75 | learning rate: 2.103E-05 | global batch size: 256 | lm loss: 1.894466E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.227 | TFLOPs: 20.64 | 31: iteration 165240/ 173500 | consumed samples: 42301440 | consumed tokens: 86633349120 | elapsed time per iteration (s): 0.77 | learning rate: 2.103E-05 | global batch size: 256 | lm loss: 1.904815E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.507 | TFLOPs: 20.12 | 31: iteration 165250/ 173500 | consumed samples: 42304000 | consumed tokens: 86638592000 | elapsed time per iteration (s): 0.78 | learning rate: 2.102E-05 | global batch size: 256 | lm loss: 1.899813E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.491 | TFLOPs: 19.75 | 31: iteration 165260/ 173500 | consumed samples: 42306560 | consumed tokens: 86643834880 | elapsed time per iteration (s): 0.79 | learning rate: 2.102E-05 | global batch size: 256 | lm loss: 1.885203E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.685 | TFLOPs: 19.64 | 31: iteration 165270/ 173500 | consumed samples: 42309120 | consumed tokens: 86649077760 | elapsed time per iteration (s): 0.82 | learning rate: 2.102E-05 | global batch size: 256 | lm loss: 1.924279E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.493 | TFLOPs: 18.84 | 31: iteration 165280/ 173500 | consumed samples: 42311680 | consumed tokens: 86654320640 | elapsed time per iteration (s): 0.81 | learning rate: 2.102E-05 | global batch size: 256 | lm loss: 1.875722E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.678 | TFLOPs: 19.04 | 31: iteration 165290/ 173500 | consumed samples: 42314240 | consumed tokens: 86659563520 | elapsed time per iteration (s): 0.83 | learning rate: 2.101E-05 | global batch size: 256 | lm loss: 1.906374E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.480 | TFLOPs: 18.66 | 31: iteration 165300/ 173500 | consumed samples: 42316800 | consumed tokens: 86664806400 | elapsed time per iteration (s): 0.83 | learning rate: 2.101E-05 | global batch size: 256 | lm loss: 1.896016E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.067 | TFLOPs: 18.58 | 31: iteration 165310/ 173500 | consumed samples: 42319360 | consumed tokens: 86670049280 | elapsed time per iteration (s): 0.86 | learning rate: 2.101E-05 | global batch size: 256 | lm loss: 1.925368E+00 | grad norm: 0.224 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.632 | TFLOPs: 18.01 | 31: iteration 165320/ 173500 | consumed samples: 42321920 | consumed tokens: 86675292160 | elapsed time per iteration (s): 0.79 | learning rate: 2.101E-05 | global batch size: 256 | lm loss: 1.898498E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.752 | TFLOPs: 19.53 | 31: iteration 165330/ 173500 | consumed samples: 42324480 | consumed tokens: 86680535040 | elapsed time per iteration (s): 1.09 | learning rate: 2.100E-05 | global batch size: 256 | lm loss: 1.929447E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.938 | TFLOPs: 14.27 | 31: iteration 165340/ 173500 | consumed samples: 42327040 | consumed tokens: 86685777920 | elapsed time per iteration (s): 0.77 | learning rate: 2.100E-05 | global batch size: 256 | lm loss: 1.897680E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.386 | TFLOPs: 20.17 | 31: iteration 165350/ 173500 | consumed samples: 42329600 | consumed tokens: 86691020800 | elapsed time per iteration (s): 0.77 | learning rate: 2.100E-05 | global batch size: 256 | lm loss: 1.937367E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.547 | TFLOPs: 20.18 | 31: iteration 165360/ 173500 | consumed samples: 42332160 | consumed tokens: 86696263680 | elapsed time per iteration (s): 0.73 | learning rate: 2.100E-05 | global batch size: 256 | lm loss: 1.906316E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.299 | TFLOPs: 21.19 | 31: iteration 165370/ 173500 | consumed samples: 42334720 | consumed tokens: 86701506560 | elapsed time per iteration (s): 0.81 | learning rate: 2.099E-05 | global batch size: 256 | lm loss: 1.925821E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.415 | TFLOPs: 19.02 | 31: iteration 165380/ 173500 | consumed samples: 42337280 | consumed tokens: 86706749440 | elapsed time per iteration (s): 0.80 | learning rate: 2.099E-05 | global batch size: 256 | lm loss: 1.903339E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.257 | TFLOPs: 19.25 | 31: iteration 165390/ 173500 | consumed samples: 42339840 | consumed tokens: 86711992320 | elapsed time per iteration (s): 0.77 | learning rate: 2.099E-05 | global batch size: 256 | lm loss: 1.897638E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.590 | TFLOPs: 20.12 | 31: iteration 165400/ 173500 | consumed samples: 42342400 | consumed tokens: 86717235200 | elapsed time per iteration (s): 0.75 | learning rate: 2.099E-05 | global batch size: 256 | lm loss: 1.935147E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.997 | TFLOPs: 20.75 | 31: iteration 165410/ 173500 | consumed samples: 42344960 | consumed tokens: 86722478080 | elapsed time per iteration (s): 0.82 | learning rate: 2.098E-05 | global batch size: 256 | lm loss: 1.908226E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.352 | TFLOPs: 18.90 | 31: iteration 165420/ 173500 | consumed samples: 42347520 | consumed tokens: 86727720960 | elapsed time per iteration (s): 0.74 | learning rate: 2.098E-05 | global batch size: 256 | lm loss: 1.922606E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.324 | TFLOPs: 20.89 | 31: iteration 165430/ 173500 | consumed samples: 42350080 | consumed tokens: 86732963840 | elapsed time per iteration (s): 0.77 | learning rate: 2.098E-05 | global batch size: 256 | lm loss: 1.935303E+00 | grad norm: 0.225 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.873 | TFLOPs: 20.14 | 31: iteration 165440/ 173500 | consumed samples: 42352640 | consumed tokens: 86738206720 | elapsed time per iteration (s): 0.82 | learning rate: 2.098E-05 | global batch size: 256 | lm loss: 1.926922E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.268 | TFLOPs: 18.89 | 31: iteration 165450/ 173500 | consumed samples: 42355200 | consumed tokens: 86743449600 | elapsed time per iteration (s): 0.79 | learning rate: 2.097E-05 | global batch size: 256 | lm loss: 1.905888E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.697 | TFLOPs: 19.70 | 31: iteration 165460/ 173500 | consumed samples: 42357760 | consumed tokens: 86748692480 | elapsed time per iteration (s): 0.84 | learning rate: 2.097E-05 | global batch size: 256 | lm loss: 1.918656E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.228 | TFLOPs: 18.47 | 31: iteration 165470/ 173500 | consumed samples: 42360320 | consumed tokens: 86753935360 | elapsed time per iteration (s): 0.83 | learning rate: 2.097E-05 | global batch size: 256 | lm loss: 1.906413E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.203 | TFLOPs: 18.59 | 31: iteration 165480/ 173500 | consumed samples: 42362880 | consumed tokens: 86759178240 | elapsed time per iteration (s): 0.83 | learning rate: 2.097E-05 | global batch size: 256 | lm loss: 1.924292E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.152 | TFLOPs: 18.58 | 31: iteration 165490/ 173500 | consumed samples: 42365440 | consumed tokens: 86764421120 | elapsed time per iteration (s): 0.81 | learning rate: 2.096E-05 | global batch size: 256 | lm loss: 1.900113E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.363 | TFLOPs: 19.20 | 31: iteration 165500/ 173500 | consumed samples: 42368000 | consumed tokens: 86769664000 | elapsed time per iteration (s): 0.80 | learning rate: 2.096E-05 | global batch size: 256 | lm loss: 1.911055E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.774 | TFLOPs: 19.41 | 31: iteration 165510/ 173500 | consumed samples: 42370560 | consumed tokens: 86774906880 | elapsed time per iteration (s): 0.84 | learning rate: 2.096E-05 | global batch size: 256 | lm loss: 1.891534E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.814 | TFLOPs: 18.38 | 31: iteration 165520/ 173500 | consumed samples: 42373120 | consumed tokens: 86780149760 | elapsed time per iteration (s): 0.80 | learning rate: 2.096E-05 | global batch size: 256 | lm loss: 1.903275E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.807 | TFLOPs: 19.41 | 31: iteration 165530/ 173500 | consumed samples: 42375680 | consumed tokens: 86785392640 | elapsed time per iteration (s): 0.79 | learning rate: 2.095E-05 | global batch size: 256 | lm loss: 1.948649E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.569 | TFLOPs: 19.58 | 31: iteration 165540/ 173500 | consumed samples: 42378240 | consumed tokens: 86790635520 | elapsed time per iteration (s): 0.82 | learning rate: 2.095E-05 | global batch size: 256 | lm loss: 1.898996E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.009 | TFLOPs: 18.88 | 31: iteration 165550/ 173500 | consumed samples: 42380800 | consumed tokens: 86795878400 | elapsed time per iteration (s): 0.80 | learning rate: 2.095E-05 | global batch size: 256 | lm loss: 1.906629E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.231 | TFLOPs: 19.37 | 31: iteration 165560/ 173500 | consumed samples: 42383360 | consumed tokens: 86801121280 | elapsed time per iteration (s): 0.83 | learning rate: 2.095E-05 | global batch size: 256 | lm loss: 1.930653E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.683 | TFLOPs: 18.74 | 31: iteration 165570/ 173500 | consumed samples: 42385920 | consumed tokens: 86806364160 | elapsed time per iteration (s): 0.83 | learning rate: 2.095E-05 | global batch size: 256 | lm loss: 1.920798E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.127 | TFLOPs: 18.70 | 31: iteration 165580/ 173500 | consumed samples: 42388480 | consumed tokens: 86811607040 | elapsed time per iteration (s): 0.83 | learning rate: 2.094E-05 | global batch size: 256 | lm loss: 1.898351E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.944 | TFLOPs: 18.69 | 31: iteration 165590/ 173500 | consumed samples: 42391040 | consumed tokens: 86816849920 | elapsed time per iteration (s): 0.80 | learning rate: 2.094E-05 | global batch size: 256 | lm loss: 1.890376E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.741 | TFLOPs: 19.28 | 31: iteration 165600/ 173500 | consumed samples: 42393600 | consumed tokens: 86822092800 | elapsed time per iteration (s): 0.80 | learning rate: 2.094E-05 | global batch size: 256 | lm loss: 1.918587E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.221 | TFLOPs: 19.31 | 31: iteration 165610/ 173500 | consumed samples: 42396160 | consumed tokens: 86827335680 | elapsed time per iteration (s): 0.81 | learning rate: 2.094E-05 | global batch size: 256 | lm loss: 1.909184E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.975 | TFLOPs: 19.24 | 31: iteration 165620/ 173500 | consumed samples: 42398720 | consumed tokens: 86832578560 | elapsed time per iteration (s): 0.87 | learning rate: 2.093E-05 | global batch size: 256 | lm loss: 1.934675E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.709 | TFLOPs: 17.83 | 31: iteration 165630/ 173500 | consumed samples: 42401280 | consumed tokens: 86837821440 | elapsed time per iteration (s): 0.82 | learning rate: 2.093E-05 | global batch size: 256 | lm loss: 1.924467E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.149 | TFLOPs: 18.82 | 31: iteration 165640/ 173500 | consumed samples: 42403840 | consumed tokens: 86843064320 | elapsed time per iteration (s): 0.85 | learning rate: 2.093E-05 | global batch size: 256 | lm loss: 1.902752E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.177 | TFLOPs: 18.22 | 31: iteration 165650/ 173500 | consumed samples: 42406400 | consumed tokens: 86848307200 | elapsed time per iteration (s): 0.82 | learning rate: 2.093E-05 | global batch size: 256 | lm loss: 1.899324E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.332 | TFLOPs: 18.96 | 31: iteration 165660/ 173500 | consumed samples: 42408960 | consumed tokens: 86853550080 | elapsed time per iteration (s): 0.79 | learning rate: 2.092E-05 | global batch size: 256 | lm loss: 1.913025E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.593 | TFLOPs: 19.70 | 31: iteration 165670/ 173500 | consumed samples: 42411520 | consumed tokens: 86858792960 | elapsed time per iteration (s): 0.79 | learning rate: 2.092E-05 | global batch size: 256 | lm loss: 1.929089E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.408 | TFLOPs: 19.69 | 31: iteration 165680/ 173500 | consumed samples: 42414080 | consumed tokens: 86864035840 | elapsed time per iteration (s): 0.80 | learning rate: 2.092E-05 | global batch size: 256 | lm loss: 1.922303E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.974 | TFLOPs: 19.30 | 31: iteration 165690/ 173500 | consumed samples: 42416640 | consumed tokens: 86869278720 | elapsed time per iteration (s): 0.79 | learning rate: 2.092E-05 | global batch size: 256 | lm loss: 1.942544E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.233 | TFLOPs: 19.62 | 31: iteration 165700/ 173500 | consumed samples: 42419200 | consumed tokens: 86874521600 | elapsed time per iteration (s): 0.75 | learning rate: 2.091E-05 | global batch size: 256 | lm loss: 1.904760E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.087 | TFLOPs: 20.51 | 31: iteration 165710/ 173500 | consumed samples: 42421760 | consumed tokens: 86879764480 | elapsed time per iteration (s): 0.74 | learning rate: 2.091E-05 | global batch size: 256 | lm loss: 1.932821E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.292 | TFLOPs: 20.89 | 31: iteration 165720/ 173500 | consumed samples: 42424320 | consumed tokens: 86885007360 | elapsed time per iteration (s): 0.76 | learning rate: 2.091E-05 | global batch size: 256 | lm loss: 1.898852E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.309 | TFLOPs: 20.47 | 31: iteration 165730/ 173500 | consumed samples: 42426880 | consumed tokens: 86890250240 | elapsed time per iteration (s): 0.71 | learning rate: 2.091E-05 | global batch size: 256 | lm loss: 1.921515E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 360.947 | TFLOPs: 21.84 | 31: iteration 165740/ 173500 | consumed samples: 42429440 | consumed tokens: 86895493120 | elapsed time per iteration (s): 0.75 | learning rate: 2.091E-05 | global batch size: 256 | lm loss: 1.908685E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.053 | TFLOPs: 20.57 | 31: iteration 165750/ 173500 | consumed samples: 42432000 | consumed tokens: 86900736000 | elapsed time per iteration (s): 0.76 | learning rate: 2.090E-05 | global batch size: 256 | lm loss: 1.907587E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.533 | TFLOPs: 20.36 | 31: iteration 165760/ 173500 | consumed samples: 42434560 | consumed tokens: 86905978880 | elapsed time per iteration (s): 0.83 | learning rate: 2.090E-05 | global batch size: 256 | lm loss: 1.916609E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.285 | TFLOPs: 18.65 | 31: iteration 165770/ 173500 | consumed samples: 42437120 | consumed tokens: 86911221760 | elapsed time per iteration (s): 2.57 | learning rate: 2.090E-05 | global batch size: 256 | lm loss: 1.906480E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 99.752 | TFLOPs: 6.03 | 31: iteration 165780/ 173500 | consumed samples: 42439680 | consumed tokens: 86916464640 | elapsed time per iteration (s): 0.92 | learning rate: 2.090E-05 | global batch size: 256 | lm loss: 1.946006E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 279.664 | TFLOPs: 16.92 | 31: iteration 165790/ 173500 | consumed samples: 42442240 | consumed tokens: 86921707520 | elapsed time per iteration (s): 0.93 | learning rate: 2.089E-05 | global batch size: 256 | lm loss: 1.899635E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 273.970 | TFLOPs: 16.57 | 31: iteration 165800/ 173500 | consumed samples: 42444800 | consumed tokens: 86926950400 | elapsed time per iteration (s): 0.98 | learning rate: 2.089E-05 | global batch size: 256 | lm loss: 1.892888E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 260.097 | TFLOPs: 15.74 | 31: iteration 165810/ 173500 | consumed samples: 42447360 | consumed tokens: 86932193280 | elapsed time per iteration (s): 0.92 | learning rate: 2.089E-05 | global batch size: 256 | lm loss: 1.904183E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 279.298 | TFLOPs: 16.90 | 31: iteration 165820/ 173500 | consumed samples: 42449920 | consumed tokens: 86937436160 | elapsed time per iteration (s): 0.98 | learning rate: 2.089E-05 | global batch size: 256 | lm loss: 1.909177E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 260.244 | TFLOPs: 15.74 | 31: iteration 165830/ 173500 | consumed samples: 42452480 | consumed tokens: 86942679040 | elapsed time per iteration (s): 0.97 | learning rate: 2.088E-05 | global batch size: 256 | lm loss: 1.900088E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 264.143 | TFLOPs: 15.98 | 31: iteration 165840/ 173500 | consumed samples: 42455040 | consumed tokens: 86947921920 | elapsed time per iteration (s): 0.94 | learning rate: 2.088E-05 | global batch size: 256 | lm loss: 1.890018E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 271.368 | TFLOPs: 16.42 | 31: iteration 165850/ 173500 | consumed samples: 42457600 | consumed tokens: 86953164800 | elapsed time per iteration (s): 0.90 | learning rate: 2.088E-05 | global batch size: 256 | lm loss: 1.936082E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 283.536 | TFLOPs: 17.15 | 31: iteration 165860/ 173500 | consumed samples: 42460160 | consumed tokens: 86958407680 | elapsed time per iteration (s): 0.84 | learning rate: 2.088E-05 | global batch size: 256 | lm loss: 1.903800E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.302 | TFLOPs: 18.47 | 31: iteration 165870/ 173500 | consumed samples: 42462720 | consumed tokens: 86963650560 | elapsed time per iteration (s): 0.87 | learning rate: 2.088E-05 | global batch size: 256 | lm loss: 1.883503E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.460 | TFLOPs: 17.87 | 31: iteration 165880/ 173500 | consumed samples: 42465280 | consumed tokens: 86968893440 | elapsed time per iteration (s): 0.87 | learning rate: 2.087E-05 | global batch size: 256 | lm loss: 1.919355E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.337 | TFLOPs: 17.75 | 31: iteration 165890/ 173500 | consumed samples: 42467840 | consumed tokens: 86974136320 | elapsed time per iteration (s): 0.87 | learning rate: 2.087E-05 | global batch size: 256 | lm loss: 1.899404E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.951 | TFLOPs: 17.90 | 31: iteration 165900/ 173500 | consumed samples: 42470400 | consumed tokens: 86979379200 | elapsed time per iteration (s): 0.85 | learning rate: 2.087E-05 | global batch size: 256 | lm loss: 1.902444E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.715 | TFLOPs: 18.25 | 31: iteration 165910/ 173500 | consumed samples: 42472960 | consumed tokens: 86984622080 | elapsed time per iteration (s): 0.86 | learning rate: 2.087E-05 | global batch size: 256 | lm loss: 1.922656E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.327 | TFLOPs: 17.93 | 31: iteration 165920/ 173500 | consumed samples: 42475520 | consumed tokens: 86989864960 | elapsed time per iteration (s): 0.85 | learning rate: 2.086E-05 | global batch size: 256 | lm loss: 1.940715E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.620 | TFLOPs: 18.25 | 31: iteration 165930/ 173500 | consumed samples: 42478080 | consumed tokens: 86995107840 | elapsed time per iteration (s): 0.81 | learning rate: 2.086E-05 | global batch size: 256 | lm loss: 1.882225E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.441 | TFLOPs: 19.08 | 31: iteration 165940/ 173500 | consumed samples: 42480640 | consumed tokens: 87000350720 | elapsed time per iteration (s): 0.89 | learning rate: 2.086E-05 | global batch size: 256 | lm loss: 1.936238E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.053 | TFLOPs: 17.31 | 31: iteration 165950/ 173500 | consumed samples: 42483200 | consumed tokens: 87005593600 | elapsed time per iteration (s): 0.85 | learning rate: 2.086E-05 | global batch size: 256 | lm loss: 1.916073E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.434 | TFLOPs: 18.30 | 31: iteration 165960/ 173500 | consumed samples: 42485760 | consumed tokens: 87010836480 | elapsed time per iteration (s): 0.86 | learning rate: 2.085E-05 | global batch size: 256 | lm loss: 1.899737E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.813 | TFLOPs: 18.08 | 31: iteration 165970/ 173500 | consumed samples: 42488320 | consumed tokens: 87016079360 | elapsed time per iteration (s): 0.80 | learning rate: 2.085E-05 | global batch size: 256 | lm loss: 1.904411E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.098 | TFLOPs: 19.30 | 31: iteration 165980/ 173500 | consumed samples: 42490880 | consumed tokens: 87021322240 | elapsed time per iteration (s): 0.81 | learning rate: 2.085E-05 | global batch size: 256 | lm loss: 1.911819E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.793 | TFLOPs: 19.10 | 31: iteration 165990/ 173500 | consumed samples: 42493440 | consumed tokens: 87026565120 | elapsed time per iteration (s): 0.78 | learning rate: 2.085E-05 | global batch size: 256 | lm loss: 1.932058E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.999 | TFLOPs: 19.90 | 0: [2022-11-27 07:28:40,292] [INFO] [logging.py:68:log_dist] [Rank 0] step=166000, skipped=0, lr=[2.0845563261196566e-05, 2.0845563261196566e-05, 2.0845563261196566e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 166000/ 173500 | consumed samples: 42496000 | consumed tokens: 87031808000 | elapsed time per iteration (s): 0.84 | learning rate: 2.085E-05 | global batch size: 256 | lm loss: 1.900582E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.087 | TFLOPs: 18.34 | 0: steps: 166000 loss: 1.9552 iter time (s): 0.852 samples/sec: 300.635 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 166000 | lm loss value: 1.862033E+00 | lm loss PPL: 6.436809E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 166000 to checkpoints_1b1long 0: [2022-11-27 07:28:40,666] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step166000 is begin to save! 0: [2022-11-27 07:28:40,678] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_01-model_00-model_states.pt... 0: [2022-11-27 07:28:40,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_01-model_00-model_states.pt. 0: [2022-11-27 07:28:40,905] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_03-model_00-model_states.pt... 0: [2022-11-27 07:28:40,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_03-model_00-model_states.pt. 0: [2022-11-27 07:28:40,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_04-model_00-model_states.pt... 0: [2022-11-27 07:28:41,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_04-model_00-model_states.pt. 0: [2022-11-27 07:28:41,069] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_05-model_00-model_states.pt... 0: [2022-11-27 07:28:41,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_05-model_00-model_states.pt. 0: [2022-11-27 07:28:41,144] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_06-model_00-model_states.pt... 0: [2022-11-27 07:28:41,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_06-model_00-model_states.pt. 0: [2022-11-27 07:28:41,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_07-model_00-model_states.pt... 0: [2022-11-27 07:28:41,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_07-model_00-model_states.pt. 0: [2022-11-27 07:28:41,306] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_08-model_00-model_states.pt... 0: [2022-11-27 07:28:41,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_08-model_00-model_states.pt. 0: [2022-11-27 07:28:41,388] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_09-model_00-model_states.pt... 0: [2022-11-27 07:28:41,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_09-model_00-model_states.pt. 0: [2022-11-27 07:28:41,464] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_10-model_00-model_states.pt... 0: [2022-11-27 07:28:41,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_10-model_00-model_states.pt. 0: [2022-11-27 07:28:41,540] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_11-model_00-model_states.pt... 0: [2022-11-27 07:28:41,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_11-model_00-model_states.pt. 0: [2022-11-27 07:28:41,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_12-model_00-model_states.pt... 0: [2022-11-27 07:28:41,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_12-model_00-model_states.pt. 0: [2022-11-27 07:28:41,696] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_13-model_00-model_states.pt... 0: [2022-11-27 07:28:41,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_13-model_00-model_states.pt. 0: [2022-11-27 07:28:41,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_14-model_00-model_states.pt... 0: [2022-11-27 07:28:41,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_14-model_00-model_states.pt. 0: [2022-11-27 07:28:41,855] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_15-model_00-model_states.pt... 0: [2022-11-27 07:28:41,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_15-model_00-model_states.pt. 0: [2022-11-27 07:28:41,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_16-model_00-model_states.pt... 0: [2022-11-27 07:28:42,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_16-model_00-model_states.pt. 0: [2022-11-27 07:28:42,009] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_17-model_00-model_states.pt... 0: [2022-11-27 07:28:42,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_17-model_00-model_states.pt. 0: [2022-11-27 07:28:42,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_18-model_00-model_states.pt... 0: [2022-11-27 07:28:42,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_18-model_00-model_states.pt. 0: [2022-11-27 07:28:42,163] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_19-model_00-model_states.pt... 0: [2022-11-27 07:28:42,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_19-model_00-model_states.pt. 0: [2022-11-27 07:28:42,240] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_20-model_00-model_states.pt... 0: [2022-11-27 07:28:42,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_20-model_00-model_states.pt. 0: [2022-11-27 07:28:42,320] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_21-model_00-model_states.pt... 0: [2022-11-27 07:28:42,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_21-model_00-model_states.pt. 0: [2022-11-27 07:28:42,406] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_22-model_00-model_states.pt... 0: [2022-11-27 07:28:42,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_22-model_00-model_states.pt. 0: [2022-11-27 07:28:42,485] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_23-model_00-model_states.pt... 0: [2022-11-27 07:28:42,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_23-model_00-model_states.pt. 0: [2022-11-27 07:28:42,564] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_24-model_00-model_states.pt... 0: [2022-11-27 07:28:42,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_24-model_00-model_states.pt. 0: [2022-11-27 07:28:42,641] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_25-model_00-model_states.pt... 0: [2022-11-27 07:28:42,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_25-model_00-model_states.pt. 0: [2022-11-27 07:28:42,718] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_26-model_00-model_states.pt... 0: [2022-11-27 07:28:42,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_26-model_00-model_states.pt. 0: [2022-11-27 07:28:42,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_27-model_00-model_states.pt... 0: [2022-11-27 07:28:42,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_27-model_00-model_states.pt. 0: [2022-11-27 07:28:42,870] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_28-model_00-model_states.pt... 0: [2022-11-27 07:28:42,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_28-model_00-model_states.pt. 0: [2022-11-27 07:28:42,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/layer_30-model_00-model_states.pt... 0: [2022-11-27 07:28:42,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/layer_30-model_00-model_states.pt. 0: [2022-11-27 07:28:42,949] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step166000/mp_rank_00_model_states.pt 0: [2022-11-27 07:28:42,949] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/mp_rank_00_model_states.pt... 0: [2022-11-27 07:28:42,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/mp_rank_00_model_states.pt. 0: [2022-11-27 07:28:43,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:28:43,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:28:43,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:28:43,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:28:43,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:28:43,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:28:43,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:28:43,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:28:43,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:28:43,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:28:43,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:28:43,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:28:43,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:28:43,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:28:43,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:28:43,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:28:43,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:28:43,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:28:43,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:28:43,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:28:43,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:28:43,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:28:43,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 0: [2022-11-27 07:28:43,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 7: [2022-11-27 07:28:43,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 0: [2022-11-27 07:28:43,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 14: [2022-11-27 07:28:43,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:28:43,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 07:28:43,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 3: [2022-11-27 07:28:43,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:28:43,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:28:43,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 15: [2022-11-27 07:28:43,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 3: [2022-11-27 07:28:43,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 15: [2022-11-27 07:28:43,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 28: [2022-11-27 07:28:43,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:28:43,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:28:43,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:28:43,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 23: [2022-11-27 07:28:43,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 21: [2022-11-27 07:28:43,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:28:43,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 23: [2022-11-27 07:28:43,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 21: [2022-11-27 07:28:43,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 07:28:43,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 4: [2022-11-27 07:28:43,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:28:43,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:28:43,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 14: [2022-11-27 07:28:43,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 4: [2022-11-27 07:28:43,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 14: [2022-11-27 07:28:43,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 13: [2022-11-27 07:28:43,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:28:43,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 07:28:43,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 28: [2022-11-27 07:28:43,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 07:28:43,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 28: [2022-11-27 07:28:43,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:28:43,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 07:28:43,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 22: [2022-11-27 07:28:43,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:28:43,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 07:28:43,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 8: [2022-11-27 07:28:43,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:28:43,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:28:43,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 07:28:43,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 15: [2022-11-27 07:28:43,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:28:43,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 8: [2022-11-27 07:28:43,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 15: [2022-11-27 07:28:43,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 07:28:43,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 6: [2022-11-27 07:28:43,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:28:43,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 07:28:43,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 20: [2022-11-27 07:28:43,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:28:43,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 4: [2022-11-27 07:28:43,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:28:43,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 4: [2022-11-27 07:28:43,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 07:28:43,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 0: [2022-11-27 07:28:43,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:28:43,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:28:43,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 21: [2022-11-27 07:28:43,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 0: [2022-11-27 07:28:43,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 25: [2022-11-27 07:28:43,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:28:43,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 14: [2022-11-27 07:28:43,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:28:43,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:28:43,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 2: [2022-11-27 07:28:43,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 25: [2022-11-27 07:28:43,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 14: [2022-11-27 07:28:43,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 2: [2022-11-27 07:28:43,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 25: [2022-11-27 07:28:43,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 6: [2022-11-27 07:28:43,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:28:43,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 07:28:43,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 12: [2022-11-27 07:28:43,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:28:43,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 07:28:43,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 12: [2022-11-27 07:28:43,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:28:43,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:28:43,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 15: [2022-11-27 07:28:43,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 12: [2022-11-27 07:28:43,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 15: [2022-11-27 07:28:43,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 23: [2022-11-27 07:28:43,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:28:43,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 07:28:43,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 10: [2022-11-27 07:28:43,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:28:43,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 07:28:43,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 27: [2022-11-27 07:28:43,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:28:43,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 07:28:43,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 7: [2022-11-27 07:28:43,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:28:43,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 07:28:43,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 29: [2022-11-27 07:28:43,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:28:43,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:28:43,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:28:43,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 2: [2022-11-27 07:28:43,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 07:28:43,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 29: [2022-11-27 07:28:43,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 2: [2022-11-27 07:28:43,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 2: [2022-11-27 07:28:43,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 4: [2022-11-27 07:28:43,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:28:43,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 07:28:43,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 3: [2022-11-27 07:28:43,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:28:43,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 07:28:43,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 1: [2022-11-27 07:28:43,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:28:43,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 07:28:43,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 17: [2022-11-27 07:28:43,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:28:43,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:28:43,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:28:43,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 07:28:43,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 1: [2022-11-27 07:28:43,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 17: [2022-11-27 07:28:43,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 1: [2022-11-27 07:28:43,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 17: [2022-11-27 07:28:43,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 17: [2022-11-27 07:28:43,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:28:43,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 07:28:43,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 17: [2022-11-27 07:28:43,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:28:43,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:28:43,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:28:43,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 6: [2022-11-27 07:28:43,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 17: [2022-11-27 07:28:43,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 6: [2022-11-27 07:28:43,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 1: [2022-11-27 07:28:43,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 07:28:43,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 29: [2022-11-27 07:28:43,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:28:43,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:28:43,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 07:28:43,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 13: [2022-11-27 07:28:43,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 07:28:43,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 28: [2022-11-27 07:28:43,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:28:43,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:28:43,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:28:43,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 21: [2022-11-27 07:28:43,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 14: [2022-11-27 07:28:43,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 28: [2022-11-27 07:28:43,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 07:28:43,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:28:43,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 21: [2022-11-27 07:28:43,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 28: [2022-11-27 07:28:43,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 07:28:43,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 13: [2022-11-27 07:28:43,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:28:43,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 07:28:43,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 19: [2022-11-27 07:28:43,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:28:43,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 07:28:43,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 12: [2022-11-27 07:28:43,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:28:43,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:28:43,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 07:28:43,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 27: [2022-11-27 07:28:43,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 07:28:43,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 17: [2022-11-27 07:28:43,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:28:43,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:28:43,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 07:28:43,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 17: [2022-11-27 07:28:43,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 07:28:43,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 23: [2022-11-27 07:28:43,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:28:43,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 07:28:43,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 25: [2022-11-27 07:28:43,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:28:43,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 07:28:43,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 15: [2022-11-27 07:28:43,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:28:43,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 07:28:43,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 3: [2022-11-27 07:28:43,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:28:43,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 07:28:43,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 23: [2022-11-27 07:28:43,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:28:43,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 07:28:43,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 1: [2022-11-27 07:28:43,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:28:43,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 8: [2022-11-27 07:28:43,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:28:43,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 1: [2022-11-27 07:28:43,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 8: [2022-11-27 07:28:43,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 18: [2022-11-27 07:28:43,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:28:43,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 07:28:43,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 18: [2022-11-27 07:28:43,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:28:43,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-27 07:28:43,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 4: [2022-11-27 07:28:43,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:28:43,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 07:28:43,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 22: [2022-11-27 07:28:43,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:28:43,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 07:28:43,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 22: [2022-11-27 07:28:43,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:28:43,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 07:28:43,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 22: [2022-11-27 07:28:43,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:28:43,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 07:28:43,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 3: [2022-11-27 07:28:43,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:28:43,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 07:28:43,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 19: [2022-11-27 07:28:43,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:28:43,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:28:43,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:28:43,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 07:28:43,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 8: [2022-11-27 07:28:43,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 19: [2022-11-27 07:28:43,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 8: [2022-11-27 07:28:43,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 19: [2022-11-27 07:28:43,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 10: [2022-11-27 07:28:43,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:28:43,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 07:28:43,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 13: [2022-11-27 07:28:43,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:28:43,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 07:28:43,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 18: [2022-11-27 07:28:43,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:28:43,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 10: [2022-11-27 07:28:43,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:28:43,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 10: [2022-11-27 07:28:43,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 17: [2022-11-27 07:28:43,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:28:43,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:28:43,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 8: [2022-11-27 07:28:43,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 18: [2022-11-27 07:28:43,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:28:43,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 18: [2022-11-27 07:28:43,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 17: [2022-11-27 07:28:43,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 18: [2022-11-27 07:28:43,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 17: [2022-11-27 07:28:43,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 20: [2022-11-27 07:28:43,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:28:43,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:28:43,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 07:28:43,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 07:28:43,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 20: [2022-11-27 07:28:43,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 20: [2022-11-27 07:28:43,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:28:43,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 07:28:43,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 21: [2022-11-27 07:28:43,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:28:43,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 07:28:43,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 25: [2022-11-27 07:28:43,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:28:43,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 2: [2022-11-27 07:28:43,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:28:43,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 07:28:43,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 25: [2022-11-27 07:28:43,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 25: [2022-11-27 07:28:43,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:28:43,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 07:28:43,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 0: [2022-11-27 07:28:43,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:28:43,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:28:43,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 07:28:43,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 6: [2022-11-27 07:28:43,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:28:43,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 07:28:43,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 31: [2022-11-27 07:28:43,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:28:43,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 07:28:43,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 27: [2022-11-27 07:28:43,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:28:43,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:28:43,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 07:28:43,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 07:28:43,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 27: [2022-11-27 07:28:43,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 21: [2022-11-27 07:28:43,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:28:43,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 07:28:43,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 7: [2022-11-27 07:28:43,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:28:43,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 07:28:43,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 9: [2022-11-27 07:28:43,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:28:43,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:28:43,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:28:43,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:28:43,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 07:28:43,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 07:28:43,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 07:28:43,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 07:28:43,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 9: [2022-11-27 07:28:43,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 9: [2022-11-27 07:28:43,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 9: [2022-11-27 07:28:43,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 25: [2022-11-27 07:28:43,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:28:43,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 07:28:43,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 19: [2022-11-27 07:28:43,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:28:43,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-27 07:28:43,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:28:43,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 0: [2022-11-27 07:28:43,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 07:28:43,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 19: [2022-11-27 07:28:43,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 07:28:43,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 16: [2022-11-27 07:28:43,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:28:43,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:28:43,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:28:43,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:28:43,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 07:28:43,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 07:28:43,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 07:28:43,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 16: [2022-11-27 07:28:43,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 07:28:43,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 16: [2022-11-27 07:28:43,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 16: [2022-11-27 07:28:43,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 30: [2022-11-27 07:28:43,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:28:43,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:28:43,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:28:43,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:28:43,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 07:28:43,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 07:28:43,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 07:28:43,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 07:28:43,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 30: [2022-11-27 07:28:43,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 30: [2022-11-27 07:28:43,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 30: [2022-11-27 07:28:43,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 29: [2022-11-27 07:28:43,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:28:43,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 07:28:43,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 29: [2022-11-27 07:28:43,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:28:43,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:28:43,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 07:28:43,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 07:28:43,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 29: [2022-11-27 07:28:43,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 5: [2022-11-27 07:28:43,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:28:43,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:28:43,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:28:43,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:28:43,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 07:28:43,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 07:28:43,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 07:28:43,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 07:28:43,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 5: [2022-11-27 07:28:43,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 5: [2022-11-27 07:28:43,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 5: [2022-11-27 07:28:43,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 12: [2022-11-27 07:28:43,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:28:43,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 07:28:43,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 1: [2022-11-27 07:28:43,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:28:43,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 07:28:43,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 16: [2022-11-27 07:28:43,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:28:43,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 07:28:43,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 24: [2022-11-27 07:28:43,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:28:43,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:28:43,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:28:43,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:28:43,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 07:28:43,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-27 07:28:43,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 07:28:43,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 11: [2022-11-27 07:28:43,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:28:43,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:28:43,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:28:43,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:28:43,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 24: [2022-11-27 07:28:43,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 24: [2022-11-27 07:28:43,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 24: [2022-11-27 07:28:43,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 11: [2022-11-27 07:28:43,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:28:43,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 07:28:43,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 07:28:43,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 07:28:43,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 07:28:43,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 11: [2022-11-27 07:28:43,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 11: [2022-11-27 07:28:43,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 11: [2022-11-27 07:28:43,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 07:28:43,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 11: [2022-11-27 07:28:43,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 28: [2022-11-27 07:28:43,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:28:43,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 07:28:43,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 15: [2022-11-27 07:28:43,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:28:43,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 07:28:43,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 26: [2022-11-27 07:28:43,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:28:43,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-27 07:28:43,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 30: [2022-11-27 07:28:43,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:28:43,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 07:28:43,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 14: [2022-11-27 07:28:43,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:28:43,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 07:28:43,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 18: [2022-11-27 07:28:43,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:28:43,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 07:28:43,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 22: [2022-11-27 07:28:43,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:28:43,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 07:28:43,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 13: [2022-11-27 07:28:43,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:28:43,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 07:28:43,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 0: [2022-11-27 07:28:43,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:28:43,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 07:28:43,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 27: [2022-11-27 07:28:43,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:28:43,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 07:28:43,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 24: [2022-11-27 07:28:43,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:28:43,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 07:28:43,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 3: [2022-11-27 07:28:44,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:28:44,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 07:28:44,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 6: [2022-11-27 07:28:44,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:28:44,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 07:28:44,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 10: [2022-11-27 07:28:44,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:28:44,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 07:28:44,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 21: [2022-11-27 07:28:44,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:28:44,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 07:28:44,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 8: [2022-11-27 07:28:44,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:28:44,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 07:28:44,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 9: [2022-11-27 07:28:44,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:28:44,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 07:28:44,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 7: [2022-11-27 07:28:44,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:28:44,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 07:28:44,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 11: [2022-11-27 07:28:44,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:28:44,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 07:28:44,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 19: [2022-11-27 07:28:44,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:28:44,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 07:28:44,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 25: [2022-11-27 07:28:44,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:28:44,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 07:28:44,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 17: [2022-11-27 07:28:44,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:28:44,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 07:28:44,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 16: [2022-11-27 07:28:44,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:28:44,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 12: [2022-11-27 07:28:44,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:28:44,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 12: [2022-11-27 07:28:44,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 07:28:44,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 14: [2022-11-27 07:28:44,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:28:44,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 07:28:44,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 20: [2022-11-27 07:28:44,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:28:44,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 07:28:44,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 2: [2022-11-27 07:28:44,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:28:44,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 07:28:44,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 13: [2022-11-27 07:28:44,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:28:44,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 07:28:44,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 30: [2022-11-27 07:28:44,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:28:44,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 07:28:44,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 18: [2022-11-27 07:28:44,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:28:44,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 07:28:44,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 26: [2022-11-27 07:28:44,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:28:44,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 07:28:44,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 1: [2022-11-27 07:28:44,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:28:44,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:28:44,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 1: [2022-11-27 07:28:44,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 22: [2022-11-27 07:28:44,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 1: [2022-11-27 07:28:44,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 24: [2022-11-27 07:28:44,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:28:44,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:28:44,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 07:28:44,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 28: [2022-11-27 07:28:44,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 0: [2022-11-27 07:28:44,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:28:44,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 29: [2022-11-27 07:28:44,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:28:44,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 29: [2022-11-27 07:28:44,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 0: [2022-11-27 07:28:44,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:28:44,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 29: [2022-11-27 07:28:44,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 0: [2022-11-27 07:28:44,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 30: [2022-11-27 07:28:44,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:28:44,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 30: [2022-11-27 07:28:44,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 07:28:44,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 4: [2022-11-27 07:28:44,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:28:44,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:28:44,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 07:28:44,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 07:28:44,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 4: [2022-11-27 07:28:44,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 9: [2022-11-27 07:28:44,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:28:44,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 07:28:44,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 2: [2022-11-27 07:28:44,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:28:44,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 07:28:44,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 14: [2022-11-27 07:28:44,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:28:44,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:28:44,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 8: [2022-11-27 07:28:44,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 14: [2022-11-27 07:28:44,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 8: [2022-11-27 07:28:44,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 15: [2022-11-27 07:28:44,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:28:44,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:28:44,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:28:44,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 07:28:44,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 11: [2022-11-27 07:28:44,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 15: [2022-11-27 07:28:44,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 15: [2022-11-27 07:28:44,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 11: [2022-11-27 07:28:44,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 23: [2022-11-27 07:28:44,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:28:44,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 07:28:44,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 28: [2022-11-27 07:28:44,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:28:44,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:28:44,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 12: [2022-11-27 07:28:44,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:28:44,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 12: [2022-11-27 07:28:44,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 2: [2022-11-27 07:28:44,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:28:44,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 12: [2022-11-27 07:28:44,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 2: [2022-11-27 07:28:44,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 13: [2022-11-27 07:28:44,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:28:44,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 07:28:44,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 7: [2022-11-27 07:28:44,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:28:44,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 07:28:44,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 20: [2022-11-27 07:28:44,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:28:44,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 07:28:44,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 3: [2022-11-27 07:28:44,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:28:44,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 07:28:44,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 25: [2022-11-27 07:28:44,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:28:44,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 07:28:44,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 22: [2022-11-27 07:28:44,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:28:44,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:28:44,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 11: [2022-11-27 07:28:44,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 22: [2022-11-27 07:28:44,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 11: [2022-11-27 07:28:44,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 29: [2022-11-27 07:28:44,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:28:44,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:28:44,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 07:28:44,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 07:28:44,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 29: [2022-11-27 07:28:44,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 19: [2022-11-27 07:28:44,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:28:44,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 6: [2022-11-27 07:28:44,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:28:44,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 16: [2022-11-27 07:28:44,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:28:44,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:28:44,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 21: [2022-11-27 07:28:44,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 6: [2022-11-27 07:28:44,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 16: [2022-11-27 07:28:44,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 07:28:44,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 21: [2022-11-27 07:28:44,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 21: [2022-11-27 07:28:44,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:28:44,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:28:44,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 12: [2022-11-27 07:28:44,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:28:44,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 12: [2022-11-27 07:28:44,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 07:28:44,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 25: [2022-11-27 07:28:44,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 14: [2022-11-27 07:28:44,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:28:44,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 25: [2022-11-27 07:28:44,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 14: [2022-11-27 07:28:44,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 8: [2022-11-27 07:28:44,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:28:44,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 07:28:44,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 18: [2022-11-27 07:28:44,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:28:44,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 07:28:44,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 30: [2022-11-27 07:28:44,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:28:44,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:28:44,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 07:28:44,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 17: [2022-11-27 07:28:44,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 07:28:44,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 22: [2022-11-27 07:28:44,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:28:44,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 1: [2022-11-27 07:28:44,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:28:44,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:28:44,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 1: [2022-11-27 07:28:44,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 07:28:44,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 20: [2022-11-27 07:28:44,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 07:28:44,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 6: [2022-11-27 07:28:44,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:28:44,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 24: [2022-11-27 07:28:44,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:28:44,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 15: [2022-11-27 07:28:44,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:28:44,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-27 07:28:44,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 15: [2022-11-27 07:28:44,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 07:28:44,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 18: [2022-11-27 07:28:44,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:28:44,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 07:28:44,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 31: [2022-11-27 07:28:44,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:28:44,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 07:28:44,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 9: [2022-11-27 07:28:44,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:28:44,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 07:28:44,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 10: [2022-11-27 07:28:44,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:28:44,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 07:28:44,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 10: [2022-11-27 07:28:44,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 07:28:44,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 28: [2022-11-27 07:28:44,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:28:44,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 07:28:44,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 16: [2022-11-27 07:28:44,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:28:44,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:28:44,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 07:28:44,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 13: [2022-11-27 07:28:44,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 7: [2022-11-27 07:28:44,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:28:44,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 7: [2022-11-27 07:28:44,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 07:28:44,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 0: [2022-11-27 07:28:44,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:28:44,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 07:28:44,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 4: [2022-11-27 07:28:44,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:28:44,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 07:28:44,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 9: [2022-11-27 07:28:44,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:28:44,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:28:44,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 9: [2022-11-27 07:28:44,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 4: [2022-11-27 07:28:44,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 9: [2022-11-27 07:28:44,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 19: [2022-11-27 07:28:44,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:28:44,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 07:28:44,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 1: [2022-11-27 07:28:44,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:28:44,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 07:28:44,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 20: [2022-11-27 07:28:44,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:28:44,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 07:28:44,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 7: [2022-11-27 07:28:44,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:28:44,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 07:28:44,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 24: [2022-11-27 07:28:44,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:28:44,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 07:28:44,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 26: [2022-11-27 07:28:44,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:28:44,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 07:28:44,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 26: [2022-11-27 07:28:44,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:28:44,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 07:28:44,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 17: [2022-11-27 07:28:44,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:28:44,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 07:28:44,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 26: [2022-11-27 07:28:44,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:28:44,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 10: [2022-11-27 07:28:44,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:28:44,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 10: [2022-11-27 07:28:44,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 07:28:44,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 3: [2022-11-27 07:28:44,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:28:44,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 07:28:44,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 26: [2022-11-27 07:28:44,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:28:44,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 07:28:44,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 5: [2022-11-27 07:28:44,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:28:44,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 23: [2022-11-27 07:28:44,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:28:44,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 5: [2022-11-27 07:28:44,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:28:44,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 5: [2022-11-27 07:28:44,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 23: [2022-11-27 07:28:44,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 5: [2022-11-27 07:28:44,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 5: [2022-11-27 07:28:44,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:28:44,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 07:28:44,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 23: [2022-11-27 07:28:44,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:28:44,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:28:44,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 07:28:44,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 07:28:44,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 23: [2022-11-27 07:28:44,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 26: [2022-11-27 07:28:44,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:28:44,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 07:28:44,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 5: [2022-11-27 07:28:44,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:28:44,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 07:28:44,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 3: [2022-11-27 07:28:44,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:28:44,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 07:28:44,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 2: [2022-11-27 07:28:44,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:28:44,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 07:28:44,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 6: [2022-11-27 07:28:44,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:28:44,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 07:28:44,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 26: [2022-11-27 07:28:44,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:28:44,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 07:28:44,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 31: [2022-11-27 07:28:44,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:28:44,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 07:28:44,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 31: [2022-11-27 07:28:44,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:28:44,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 07:28:44,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 10: [2022-11-27 07:28:44,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:28:44,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 07:28:44,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 27: [2022-11-27 07:28:44,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:28:44,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 07:28:44,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 27: [2022-11-27 07:28:44,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:28:44,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:28:44,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 07:28:44,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 07:28:44,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 27: [2022-11-27 07:28:44,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 31: [2022-11-27 07:28:44,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:28:44,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:28:44,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:28:44,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 07:28:44,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 07:28:44,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step166000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 07:28:44,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 31: [2022-11-27 07:28:44,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 31: [2022-11-27 07:28:44,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step166000 is ready now! 0: successfully saved checkpoint at iteration 166000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 3920.31 31: iteration 166010/ 173500 | consumed samples: 42498560 | consumed tokens: 87037050880 | elapsed time per iteration (s): 1.22 | learning rate: 2.084E-05 | global batch size: 256 | lm loss: 1.893128E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 210.373 | TFLOPs: 12.73 | 31: iteration 166020/ 173500 | consumed samples: 42501120 | consumed tokens: 87042293760 | elapsed time per iteration (s): 0.96 | learning rate: 2.084E-05 | global batch size: 256 | lm loss: 1.894680E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 266.364 | TFLOPs: 16.11 | 31: iteration 166030/ 173500 | consumed samples: 42503680 | consumed tokens: 87047536640 | elapsed time per iteration (s): 0.75 | learning rate: 2.084E-05 | global batch size: 256 | lm loss: 1.890733E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.210 | TFLOPs: 20.64 | 31: iteration 166040/ 173500 | consumed samples: 42506240 | consumed tokens: 87052779520 | elapsed time per iteration (s): 0.79 | learning rate: 2.084E-05 | global batch size: 256 | lm loss: 1.898819E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.952 | TFLOPs: 19.72 | 31: iteration 166050/ 173500 | consumed samples: 42508800 | consumed tokens: 87058022400 | elapsed time per iteration (s): 0.81 | learning rate: 2.083E-05 | global batch size: 256 | lm loss: 1.912546E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.023 | TFLOPs: 19.12 | 31: iteration 166060/ 173500 | consumed samples: 42511360 | consumed tokens: 87063265280 | elapsed time per iteration (s): 0.81 | learning rate: 2.083E-05 | global batch size: 256 | lm loss: 1.908150E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.551 | TFLOPs: 19.21 | 31: iteration 166070/ 173500 | consumed samples: 42513920 | consumed tokens: 87068508160 | elapsed time per iteration (s): 0.72 | learning rate: 2.083E-05 | global batch size: 256 | lm loss: 1.907313E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 354.736 | TFLOPs: 21.46 | 31: iteration 166080/ 173500 | consumed samples: 42516480 | consumed tokens: 87073751040 | elapsed time per iteration (s): 0.82 | learning rate: 2.083E-05 | global batch size: 256 | lm loss: 1.899173E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.836 | TFLOPs: 18.99 | 31: iteration 166090/ 173500 | consumed samples: 42519040 | consumed tokens: 87078993920 | elapsed time per iteration (s): 0.75 | learning rate: 2.083E-05 | global batch size: 256 | lm loss: 1.919942E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.988 | TFLOPs: 20.63 | 31: iteration 166100/ 173500 | consumed samples: 42521600 | consumed tokens: 87084236800 | elapsed time per iteration (s): 0.80 | learning rate: 2.082E-05 | global batch size: 256 | lm loss: 1.909072E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.372 | TFLOPs: 19.26 | 31: iteration 166110/ 173500 | consumed samples: 42524160 | consumed tokens: 87089479680 | elapsed time per iteration (s): 0.78 | learning rate: 2.082E-05 | global batch size: 256 | lm loss: 1.934206E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.702 | TFLOPs: 19.83 | 31: iteration 166120/ 173500 | consumed samples: 42526720 | consumed tokens: 87094722560 | elapsed time per iteration (s): 0.78 | learning rate: 2.082E-05 | global batch size: 256 | lm loss: 1.884528E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.218 | TFLOPs: 19.74 | 31: iteration 166130/ 173500 | consumed samples: 42529280 | consumed tokens: 87099965440 | elapsed time per iteration (s): 0.80 | learning rate: 2.082E-05 | global batch size: 256 | lm loss: 1.927527E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.883 | TFLOPs: 19.47 | 31: iteration 166140/ 173500 | consumed samples: 42531840 | consumed tokens: 87105208320 | elapsed time per iteration (s): 0.77 | learning rate: 2.081E-05 | global batch size: 256 | lm loss: 1.913344E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.061 | TFLOPs: 20.15 | 31: iteration 166150/ 173500 | consumed samples: 42534400 | consumed tokens: 87110451200 | elapsed time per iteration (s): 0.77 | learning rate: 2.081E-05 | global batch size: 256 | lm loss: 1.912009E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.417 | TFLOPs: 19.99 | 31: iteration 166160/ 173500 | consumed samples: 42536960 | consumed tokens: 87115694080 | elapsed time per iteration (s): 0.75 | learning rate: 2.081E-05 | global batch size: 256 | lm loss: 1.908540E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.208 | TFLOPs: 20.64 | 31: iteration 166170/ 173500 | consumed samples: 42539520 | consumed tokens: 87120936960 | elapsed time per iteration (s): 0.76 | learning rate: 2.081E-05 | global batch size: 256 | lm loss: 1.901561E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.149 | TFLOPs: 20.34 | 31: iteration 166180/ 173500 | consumed samples: 42542080 | consumed tokens: 87126179840 | elapsed time per iteration (s): 0.79 | learning rate: 2.081E-05 | global batch size: 256 | lm loss: 1.900823E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.395 | TFLOPs: 19.56 | 31: iteration 166190/ 173500 | consumed samples: 42544640 | consumed tokens: 87131422720 | elapsed time per iteration (s): 0.77 | learning rate: 2.080E-05 | global batch size: 256 | lm loss: 1.912688E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.604 | TFLOPs: 20.00 | 31: iteration 166200/ 173500 | consumed samples: 42547200 | consumed tokens: 87136665600 | elapsed time per iteration (s): 0.79 | learning rate: 2.080E-05 | global batch size: 256 | lm loss: 1.902888E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.089 | TFLOPs: 19.49 | 31: iteration 166210/ 173500 | consumed samples: 42549760 | consumed tokens: 87141908480 | elapsed time per iteration (s): 0.75 | learning rate: 2.080E-05 | global batch size: 256 | lm loss: 1.891000E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.852 | TFLOPs: 20.74 | 31: iteration 166220/ 173500 | consumed samples: 42552320 | consumed tokens: 87147151360 | elapsed time per iteration (s): 0.76 | learning rate: 2.080E-05 | global batch size: 256 | lm loss: 1.891662E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.870 | TFLOPs: 20.50 | 31: iteration 166230/ 173500 | consumed samples: 42554880 | consumed tokens: 87152394240 | elapsed time per iteration (s): 0.70 | learning rate: 2.079E-05 | global batch size: 256 | lm loss: 1.913182E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 366.949 | TFLOPs: 22.20 | 31: iteration 166240/ 173500 | consumed samples: 42557440 | consumed tokens: 87157637120 | elapsed time per iteration (s): 0.73 | learning rate: 2.079E-05 | global batch size: 256 | lm loss: 1.886330E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.977 | TFLOPs: 21.23 | 31: iteration 166250/ 173500 | consumed samples: 42560000 | consumed tokens: 87162880000 | elapsed time per iteration (s): 0.78 | learning rate: 2.079E-05 | global batch size: 256 | lm loss: 1.954746E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.546 | TFLOPs: 19.94 | 31: iteration 166260/ 173500 | consumed samples: 42562560 | consumed tokens: 87168122880 | elapsed time per iteration (s): 0.75 | learning rate: 2.079E-05 | global batch size: 256 | lm loss: 1.913432E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.627 | TFLOPs: 20.67 | 31: iteration 166270/ 173500 | consumed samples: 42565120 | consumed tokens: 87173365760 | elapsed time per iteration (s): 0.78 | learning rate: 2.079E-05 | global batch size: 256 | lm loss: 1.919615E+00 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.793 | TFLOPs: 19.83 | 31: iteration 166280/ 173500 | consumed samples: 42567680 | consumed tokens: 87178608640 | elapsed time per iteration (s): 0.78 | learning rate: 2.078E-05 | global batch size: 256 | lm loss: 1.896075E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.047 | TFLOPs: 19.85 | 31: iteration 166290/ 173500 | consumed samples: 42570240 | consumed tokens: 87183851520 | elapsed time per iteration (s): 0.77 | learning rate: 2.078E-05 | global batch size: 256 | lm loss: 1.897796E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.805 | TFLOPs: 20.13 | 31: iteration 166300/ 173500 | consumed samples: 42572800 | consumed tokens: 87189094400 | elapsed time per iteration (s): 0.77 | learning rate: 2.078E-05 | global batch size: 256 | lm loss: 1.905276E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.002 | TFLOPs: 20.21 | 31: iteration 166310/ 173500 | consumed samples: 42575360 | consumed tokens: 87194337280 | elapsed time per iteration (s): 0.76 | learning rate: 2.078E-05 | global batch size: 256 | lm loss: 1.898934E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.656 | TFLOPs: 20.31 | 31: iteration 166320/ 173500 | consumed samples: 42577920 | consumed tokens: 87199580160 | elapsed time per iteration (s): 0.81 | learning rate: 2.078E-05 | global batch size: 256 | lm loss: 1.921845E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.078 | TFLOPs: 19.18 | 31: iteration 166330/ 173500 | consumed samples: 42580480 | consumed tokens: 87204823040 | elapsed time per iteration (s): 0.75 | learning rate: 2.077E-05 | global batch size: 256 | lm loss: 1.899099E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.965 | TFLOPs: 20.69 | 31: iteration 166340/ 173500 | consumed samples: 42583040 | consumed tokens: 87210065920 | elapsed time per iteration (s): 0.77 | learning rate: 2.077E-05 | global batch size: 256 | lm loss: 1.913536E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.009 | TFLOPs: 20.21 | 31: iteration 166350/ 173500 | consumed samples: 42585600 | consumed tokens: 87215308800 | elapsed time per iteration (s): 0.75 | learning rate: 2.077E-05 | global batch size: 256 | lm loss: 1.883998E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.043 | TFLOPs: 20.69 | 31: iteration 166360/ 173500 | consumed samples: 42588160 | consumed tokens: 87220551680 | elapsed time per iteration (s): 0.73 | learning rate: 2.077E-05 | global batch size: 256 | lm loss: 1.915121E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.608 | TFLOPs: 21.09 | 31: iteration 166370/ 173500 | consumed samples: 42590720 | consumed tokens: 87225794560 | elapsed time per iteration (s): 0.74 | learning rate: 2.076E-05 | global batch size: 256 | lm loss: 1.916830E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.209 | TFLOPs: 20.88 | 31: iteration 166380/ 173500 | consumed samples: 42593280 | consumed tokens: 87231037440 | elapsed time per iteration (s): 0.80 | learning rate: 2.076E-05 | global batch size: 256 | lm loss: 1.901727E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.492 | TFLOPs: 19.33 | 31: iteration 166390/ 173500 | consumed samples: 42595840 | consumed tokens: 87236280320 | elapsed time per iteration (s): 0.76 | learning rate: 2.076E-05 | global batch size: 256 | lm loss: 1.907420E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.387 | TFLOPs: 20.35 | 31: iteration 166400/ 173500 | consumed samples: 42598400 | consumed tokens: 87241523200 | elapsed time per iteration (s): 0.78 | learning rate: 2.076E-05 | global batch size: 256 | lm loss: 1.896753E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.848 | TFLOPs: 19.77 | 31: iteration 166410/ 173500 | consumed samples: 42600960 | consumed tokens: 87246766080 | elapsed time per iteration (s): 0.80 | learning rate: 2.076E-05 | global batch size: 256 | lm loss: 1.918144E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.077 | TFLOPs: 19.42 | 31: iteration 166420/ 173500 | consumed samples: 42603520 | consumed tokens: 87252008960 | elapsed time per iteration (s): 0.80 | learning rate: 2.075E-05 | global batch size: 256 | lm loss: 1.895262E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.104 | TFLOPs: 19.37 | 31: iteration 166430/ 173500 | consumed samples: 42606080 | consumed tokens: 87257251840 | elapsed time per iteration (s): 0.80 | learning rate: 2.075E-05 | global batch size: 256 | lm loss: 1.913878E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.829 | TFLOPs: 19.29 | 31: iteration 166440/ 173500 | consumed samples: 42608640 | consumed tokens: 87262494720 | elapsed time per iteration (s): 0.79 | learning rate: 2.075E-05 | global batch size: 256 | lm loss: 1.914878E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.171 | TFLOPs: 19.55 | 31: iteration 166450/ 173500 | consumed samples: 42611200 | consumed tokens: 87267737600 | elapsed time per iteration (s): 0.84 | learning rate: 2.075E-05 | global batch size: 256 | lm loss: 1.925664E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.198 | TFLOPs: 18.52 | 31: iteration 166460/ 173500 | consumed samples: 42613760 | consumed tokens: 87272980480 | elapsed time per iteration (s): 0.79 | learning rate: 2.075E-05 | global batch size: 256 | lm loss: 1.920911E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.535 | TFLOPs: 19.57 | 31: iteration 166470/ 173500 | consumed samples: 42616320 | consumed tokens: 87278223360 | elapsed time per iteration (s): 0.81 | learning rate: 2.074E-05 | global batch size: 256 | lm loss: 1.916992E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.021 | TFLOPs: 19.18 | 31: iteration 166480/ 173500 | consumed samples: 42618880 | consumed tokens: 87283466240 | elapsed time per iteration (s): 0.78 | learning rate: 2.074E-05 | global batch size: 256 | lm loss: 1.898569E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.919 | TFLOPs: 19.84 | 31: iteration 166490/ 173500 | consumed samples: 42621440 | consumed tokens: 87288709120 | elapsed time per iteration (s): 0.81 | learning rate: 2.074E-05 | global batch size: 256 | lm loss: 1.909245E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.394 | TFLOPs: 19.08 | 31: iteration 166500/ 173500 | consumed samples: 42624000 | consumed tokens: 87293952000 | elapsed time per iteration (s): 0.80 | learning rate: 2.074E-05 | global batch size: 256 | lm loss: 1.933229E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.886 | TFLOPs: 19.41 | 31: iteration 166510/ 173500 | consumed samples: 42626560 | consumed tokens: 87299194880 | elapsed time per iteration (s): 0.78 | learning rate: 2.073E-05 | global batch size: 256 | lm loss: 1.913048E+00 | grad norm: 0.230 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.736 | TFLOPs: 19.77 | 31: iteration 166520/ 173500 | consumed samples: 42629120 | consumed tokens: 87304437760 | elapsed time per iteration (s): 0.82 | learning rate: 2.073E-05 | global batch size: 256 | lm loss: 1.916549E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.800 | TFLOPs: 18.86 | 31: iteration 166530/ 173500 | consumed samples: 42631680 | consumed tokens: 87309680640 | elapsed time per iteration (s): 0.84 | learning rate: 2.073E-05 | global batch size: 256 | lm loss: 1.928032E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.978 | TFLOPs: 18.51 | 31: iteration 166540/ 173500 | consumed samples: 42634240 | consumed tokens: 87314923520 | elapsed time per iteration (s): 0.81 | learning rate: 2.073E-05 | global batch size: 256 | lm loss: 1.900870E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.849 | TFLOPs: 19.17 | 31: iteration 166550/ 173500 | consumed samples: 42636800 | consumed tokens: 87320166400 | elapsed time per iteration (s): 0.80 | learning rate: 2.073E-05 | global batch size: 256 | lm loss: 1.910639E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.253 | TFLOPs: 19.44 | 31: iteration 166560/ 173500 | consumed samples: 42639360 | consumed tokens: 87325409280 | elapsed time per iteration (s): 0.79 | learning rate: 2.072E-05 | global batch size: 256 | lm loss: 1.891495E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.939 | TFLOPs: 19.54 | 31: iteration 166570/ 173500 | consumed samples: 42641920 | consumed tokens: 87330652160 | elapsed time per iteration (s): 0.81 | learning rate: 2.072E-05 | global batch size: 256 | lm loss: 1.885953E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.940 | TFLOPs: 19.23 | 31: iteration 166580/ 173500 | consumed samples: 42644480 | consumed tokens: 87335895040 | elapsed time per iteration (s): 0.83 | learning rate: 2.072E-05 | global batch size: 256 | lm loss: 1.895904E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.635 | TFLOPs: 18.55 | 31: iteration 166590/ 173500 | consumed samples: 42647040 | consumed tokens: 87341137920 | elapsed time per iteration (s): 0.81 | learning rate: 2.072E-05 | global batch size: 256 | lm loss: 1.913783E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.241 | TFLOPs: 19.07 | 31: iteration 166600/ 173500 | consumed samples: 42649600 | consumed tokens: 87346380800 | elapsed time per iteration (s): 0.84 | learning rate: 2.072E-05 | global batch size: 256 | lm loss: 1.870556E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.301 | TFLOPs: 18.35 | 31: iteration 166610/ 173500 | consumed samples: 42652160 | consumed tokens: 87351623680 | elapsed time per iteration (s): 0.83 | learning rate: 2.071E-05 | global batch size: 256 | lm loss: 1.899318E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.033 | TFLOPs: 18.70 | 31: iteration 166620/ 173500 | consumed samples: 42654720 | consumed tokens: 87356866560 | elapsed time per iteration (s): 0.86 | learning rate: 2.071E-05 | global batch size: 256 | lm loss: 1.897492E+00 | grad norm: 0.217 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.224 | TFLOPs: 18.10 | 31: iteration 166630/ 173500 | consumed samples: 42657280 | consumed tokens: 87362109440 | elapsed time per iteration (s): 0.76 | learning rate: 2.071E-05 | global batch size: 256 | lm loss: 1.896084E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.790 | TFLOPs: 20.44 | 31: iteration 166640/ 173500 | consumed samples: 42659840 | consumed tokens: 87367352320 | elapsed time per iteration (s): 0.78 | learning rate: 2.071E-05 | global batch size: 256 | lm loss: 1.909877E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.240 | TFLOPs: 19.92 | 31: iteration 166650/ 173500 | consumed samples: 42662400 | consumed tokens: 87372595200 | elapsed time per iteration (s): 0.74 | learning rate: 2.071E-05 | global batch size: 256 | lm loss: 1.935005E+00 | grad norm: 0.216 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.731 | TFLOPs: 20.92 | 31: iteration 166660/ 173500 | consumed samples: 42664960 | consumed tokens: 87377838080 | elapsed time per iteration (s): 0.75 | learning rate: 2.070E-05 | global batch size: 256 | lm loss: 1.908829E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.493 | TFLOPs: 20.66 | 31: iteration 166670/ 173500 | consumed samples: 42667520 | consumed tokens: 87383080960 | elapsed time per iteration (s): 0.75 | learning rate: 2.070E-05 | global batch size: 256 | lm loss: 1.933231E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.798 | TFLOPs: 20.62 | 31: iteration 166680/ 173500 | consumed samples: 42670080 | consumed tokens: 87388323840 | elapsed time per iteration (s): 0.77 | learning rate: 2.070E-05 | global batch size: 256 | lm loss: 1.903839E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.015 | TFLOPs: 20.15 | 31: iteration 166690/ 173500 | consumed samples: 42672640 | consumed tokens: 87393566720 | elapsed time per iteration (s): 0.73 | learning rate: 2.070E-05 | global batch size: 256 | lm loss: 1.929366E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.783 | TFLOPs: 21.16 | 31: iteration 166700/ 173500 | consumed samples: 42675200 | consumed tokens: 87398809600 | elapsed time per iteration (s): 0.76 | learning rate: 2.070E-05 | global batch size: 256 | lm loss: 1.930289E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.255 | TFLOPs: 20.28 | 31: iteration 166710/ 173500 | consumed samples: 42677760 | consumed tokens: 87404052480 | elapsed time per iteration (s): 0.74 | learning rate: 2.069E-05 | global batch size: 256 | lm loss: 1.911146E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.249 | TFLOPs: 20.83 | 31: iteration 166720/ 173500 | consumed samples: 42680320 | consumed tokens: 87409295360 | elapsed time per iteration (s): 0.73 | learning rate: 2.069E-05 | global batch size: 256 | lm loss: 1.906539E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.242 | TFLOPs: 21.31 | 31: iteration 166730/ 173500 | consumed samples: 42682880 | consumed tokens: 87414538240 | elapsed time per iteration (s): 0.77 | learning rate: 2.069E-05 | global batch size: 256 | lm loss: 1.898509E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.530 | TFLOPs: 20.06 | 31: iteration 166740/ 173500 | consumed samples: 42685440 | consumed tokens: 87419781120 | elapsed time per iteration (s): 0.78 | learning rate: 2.069E-05 | global batch size: 256 | lm loss: 1.890874E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.118 | TFLOPs: 19.97 | 31: iteration 166750/ 173500 | consumed samples: 42688000 | consumed tokens: 87425024000 | elapsed time per iteration (s): 0.75 | learning rate: 2.069E-05 | global batch size: 256 | lm loss: 1.905724E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.999 | TFLOPs: 20.69 | 31: iteration 166760/ 173500 | consumed samples: 42690560 | consumed tokens: 87430266880 | elapsed time per iteration (s): 0.75 | learning rate: 2.068E-05 | global batch size: 256 | lm loss: 1.894197E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.888 | TFLOPs: 20.74 | 31: iteration 166770/ 173500 | consumed samples: 42693120 | consumed tokens: 87435509760 | elapsed time per iteration (s): 0.79 | learning rate: 2.068E-05 | global batch size: 256 | lm loss: 1.926284E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.008 | TFLOPs: 19.72 | 31: iteration 166780/ 173500 | consumed samples: 42695680 | consumed tokens: 87440752640 | elapsed time per iteration (s): 0.76 | learning rate: 2.068E-05 | global batch size: 256 | lm loss: 1.880956E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.395 | TFLOPs: 20.47 | 31: iteration 166790/ 173500 | consumed samples: 42698240 | consumed tokens: 87445995520 | elapsed time per iteration (s): 0.75 | learning rate: 2.068E-05 | global batch size: 256 | lm loss: 1.893691E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.506 | TFLOPs: 20.60 | 31: iteration 166800/ 173500 | consumed samples: 42700800 | consumed tokens: 87451238400 | elapsed time per iteration (s): 0.74 | learning rate: 2.068E-05 | global batch size: 256 | lm loss: 1.924453E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.454 | TFLOPs: 20.96 | 31: iteration 166810/ 173500 | consumed samples: 42703360 | consumed tokens: 87456481280 | elapsed time per iteration (s): 0.74 | learning rate: 2.067E-05 | global batch size: 256 | lm loss: 1.921639E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.034 | TFLOPs: 20.93 | 31: iteration 166820/ 173500 | consumed samples: 42705920 | consumed tokens: 87461724160 | elapsed time per iteration (s): 0.77 | learning rate: 2.067E-05 | global batch size: 256 | lm loss: 1.917681E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.657 | TFLOPs: 20.00 | 31: iteration 166830/ 173500 | consumed samples: 42708480 | consumed tokens: 87466967040 | elapsed time per iteration (s): 0.75 | learning rate: 2.067E-05 | global batch size: 256 | lm loss: 1.913164E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.428 | TFLOPs: 20.60 | 31: iteration 166840/ 173500 | consumed samples: 42711040 | consumed tokens: 87472209920 | elapsed time per iteration (s): 0.76 | learning rate: 2.067E-05 | global batch size: 256 | lm loss: 1.908734E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.662 | TFLOPs: 20.31 | 31: iteration 166850/ 173500 | consumed samples: 42713600 | consumed tokens: 87477452800 | elapsed time per iteration (s): 0.81 | learning rate: 2.066E-05 | global batch size: 256 | lm loss: 1.931656E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.904 | TFLOPs: 19.17 | 31: iteration 166860/ 173500 | consumed samples: 42716160 | consumed tokens: 87482695680 | elapsed time per iteration (s): 0.78 | learning rate: 2.066E-05 | global batch size: 256 | lm loss: 1.919881E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.903 | TFLOPs: 19.78 | 31: iteration 166870/ 173500 | consumed samples: 42718720 | consumed tokens: 87487938560 | elapsed time per iteration (s): 0.77 | learning rate: 2.066E-05 | global batch size: 256 | lm loss: 1.904561E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.955 | TFLOPs: 20.02 | 31: iteration 166880/ 173500 | consumed samples: 42721280 | consumed tokens: 87493181440 | elapsed time per iteration (s): 0.77 | learning rate: 2.066E-05 | global batch size: 256 | lm loss: 1.932985E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.784 | TFLOPs: 20.19 | 31: iteration 166890/ 173500 | consumed samples: 42723840 | consumed tokens: 87498424320 | elapsed time per iteration (s): 0.83 | learning rate: 2.066E-05 | global batch size: 256 | lm loss: 1.923622E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.250 | TFLOPs: 18.77 | 31: iteration 166900/ 173500 | consumed samples: 42726400 | consumed tokens: 87503667200 | elapsed time per iteration (s): 0.79 | learning rate: 2.066E-05 | global batch size: 256 | lm loss: 1.912687E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.106 | TFLOPs: 19.55 | 31: iteration 166910/ 173500 | consumed samples: 42728960 | consumed tokens: 87508910080 | elapsed time per iteration (s): 0.75 | learning rate: 2.065E-05 | global batch size: 256 | lm loss: 1.897973E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.720 | TFLOPs: 20.67 | 31: iteration 166920/ 173500 | consumed samples: 42731520 | consumed tokens: 87514152960 | elapsed time per iteration (s): 0.78 | learning rate: 2.065E-05 | global batch size: 256 | lm loss: 1.906457E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.623 | TFLOPs: 19.76 | 31: iteration 166930/ 173500 | consumed samples: 42734080 | consumed tokens: 87519395840 | elapsed time per iteration (s): 0.77 | learning rate: 2.065E-05 | global batch size: 256 | lm loss: 1.935161E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.487 | TFLOPs: 20.18 | 31: iteration 166940/ 173500 | consumed samples: 42736640 | consumed tokens: 87524638720 | elapsed time per iteration (s): 0.74 | learning rate: 2.065E-05 | global batch size: 256 | lm loss: 1.902756E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.609 | TFLOPs: 20.91 | 31: iteration 166950/ 173500 | consumed samples: 42739200 | consumed tokens: 87529881600 | elapsed time per iteration (s): 0.78 | learning rate: 2.065E-05 | global batch size: 256 | lm loss: 1.919248E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.221 | TFLOPs: 19.92 | 31: iteration 166960/ 173500 | consumed samples: 42741760 | consumed tokens: 87535124480 | elapsed time per iteration (s): 0.79 | learning rate: 2.064E-05 | global batch size: 256 | lm loss: 1.940053E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.233 | TFLOPs: 19.68 | 31: iteration 166970/ 173500 | consumed samples: 42744320 | consumed tokens: 87540367360 | elapsed time per iteration (s): 0.80 | learning rate: 2.064E-05 | global batch size: 256 | lm loss: 1.918665E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.193 | TFLOPs: 19.37 | 31: iteration 166980/ 173500 | consumed samples: 42746880 | consumed tokens: 87545610240 | elapsed time per iteration (s): 0.83 | learning rate: 2.064E-05 | global batch size: 256 | lm loss: 1.893122E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.198 | TFLOPs: 18.77 | 31: iteration 166990/ 173500 | consumed samples: 42749440 | consumed tokens: 87550853120 | elapsed time per iteration (s): 0.82 | learning rate: 2.064E-05 | global batch size: 256 | lm loss: 1.902474E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.329 | TFLOPs: 18.96 | 31: iteration 167000/ 173500 | consumed samples: 42752000 | consumed tokens: 87556096000 | elapsed time per iteration (s): 0.80 | learning rate: 2.064E-05 | global batch size: 256 | lm loss: 1.915057E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.024 | TFLOPs: 19.36 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 167000 | lm loss value: 1.795668E+00 | lm loss PPL: 6.023500E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 167000 to checkpoints_1b1long 0: [2022-11-27 07:41:45,214] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step167000 is begin to save! 0: [2022-11-27 07:41:45,228] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_01-model_00-model_states.pt... 0: [2022-11-27 07:41:45,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_01-model_00-model_states.pt. 0: [2022-11-27 07:41:45,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_03-model_00-model_states.pt... 0: [2022-11-27 07:41:45,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_03-model_00-model_states.pt. 0: [2022-11-27 07:41:45,527] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_04-model_00-model_states.pt... 0: [2022-11-27 07:41:45,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_04-model_00-model_states.pt. 0: [2022-11-27 07:41:45,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_05-model_00-model_states.pt... 0: [2022-11-27 07:41:45,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_05-model_00-model_states.pt. 0: [2022-11-27 07:41:45,689] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_06-model_00-model_states.pt... 0: [2022-11-27 07:41:45,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_06-model_00-model_states.pt. 0: [2022-11-27 07:41:45,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_07-model_00-model_states.pt... 0: [2022-11-27 07:41:45,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_07-model_00-model_states.pt. 0: [2022-11-27 07:41:45,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_08-model_00-model_states.pt... 0: [2022-11-27 07:41:45,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_08-model_00-model_states.pt. 0: [2022-11-27 07:41:45,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_09-model_00-model_states.pt... 0: [2022-11-27 07:41:45,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_09-model_00-model_states.pt. 0: [2022-11-27 07:41:45,999] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_10-model_00-model_states.pt... 0: [2022-11-27 07:41:46,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_10-model_00-model_states.pt. 0: [2022-11-27 07:41:46,078] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_11-model_00-model_states.pt... 0: [2022-11-27 07:41:46,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_11-model_00-model_states.pt. 0: [2022-11-27 07:41:46,158] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_12-model_00-model_states.pt... 0: [2022-11-27 07:41:46,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_12-model_00-model_states.pt. 0: [2022-11-27 07:41:46,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_13-model_00-model_states.pt... 0: [2022-11-27 07:41:46,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_13-model_00-model_states.pt. 0: [2022-11-27 07:41:46,312] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_14-model_00-model_states.pt... 0: [2022-11-27 07:41:46,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_14-model_00-model_states.pt. 0: [2022-11-27 07:41:46,388] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_15-model_00-model_states.pt... 0: [2022-11-27 07:41:46,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_15-model_00-model_states.pt. 0: [2022-11-27 07:41:46,466] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_16-model_00-model_states.pt... 0: [2022-11-27 07:41:46,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_16-model_00-model_states.pt. 0: [2022-11-27 07:41:46,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_17-model_00-model_states.pt... 0: [2022-11-27 07:41:46,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_17-model_00-model_states.pt. 0: [2022-11-27 07:41:46,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_18-model_00-model_states.pt... 0: [2022-11-27 07:41:46,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_18-model_00-model_states.pt. 0: [2022-11-27 07:41:46,699] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_19-model_00-model_states.pt... 0: [2022-11-27 07:41:46,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_19-model_00-model_states.pt. 0: [2022-11-27 07:41:46,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_20-model_00-model_states.pt... 0: [2022-11-27 07:41:46,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_20-model_00-model_states.pt. 0: [2022-11-27 07:41:46,855] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_21-model_00-model_states.pt... 0: [2022-11-27 07:41:46,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_21-model_00-model_states.pt. 0: [2022-11-27 07:41:46,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_22-model_00-model_states.pt... 0: [2022-11-27 07:41:47,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_22-model_00-model_states.pt. 0: [2022-11-27 07:41:47,009] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_23-model_00-model_states.pt... 0: [2022-11-27 07:41:47,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_23-model_00-model_states.pt. 0: [2022-11-27 07:41:47,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_24-model_00-model_states.pt... 0: [2022-11-27 07:41:47,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_24-model_00-model_states.pt. 0: [2022-11-27 07:41:47,166] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_25-model_00-model_states.pt... 0: [2022-11-27 07:41:47,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_25-model_00-model_states.pt. 0: [2022-11-27 07:41:47,241] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_26-model_00-model_states.pt... 0: [2022-11-27 07:41:47,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_26-model_00-model_states.pt. 0: [2022-11-27 07:41:47,318] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_27-model_00-model_states.pt... 0: [2022-11-27 07:41:47,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_27-model_00-model_states.pt. 0: [2022-11-27 07:41:47,399] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_28-model_00-model_states.pt... 0: [2022-11-27 07:41:47,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_28-model_00-model_states.pt. 0: [2022-11-27 07:41:47,475] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/layer_30-model_00-model_states.pt... 0: [2022-11-27 07:41:47,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/layer_30-model_00-model_states.pt. 0: [2022-11-27 07:41:47,479] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step167000/mp_rank_00_model_states.pt 0: [2022-11-27 07:41:47,479] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/mp_rank_00_model_states.pt... 0: [2022-11-27 07:41:47,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/mp_rank_00_model_states.pt. 0: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:41:47,559] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:41:47,559] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:41:47,559] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:41:47,559] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:41:47,559] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:41:47,559] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:41:47,559] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:41:47,559] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:41:47,559] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:41:47,559] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:41:47,559] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:41:47,559] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:41:47,559] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:41:47,559] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:41:47,559] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:41:47,559] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:41:47,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:41:47,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:41:47,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 10: [2022-11-27 07:41:47,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:41:47,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 10: [2022-11-27 07:41:47,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 07:41:47,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 21: [2022-11-27 07:41:47,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:41:47,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 07:41:47,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 20: [2022-11-27 07:41:47,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:41:47,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:41:47,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 30: [2022-11-27 07:41:47,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 20: [2022-11-27 07:41:47,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 30: [2022-11-27 07:41:47,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 0: [2022-11-27 07:41:47,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:41:47,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:41:47,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 07:41:47,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 17: [2022-11-27 07:41:47,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 07:41:47,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 17: [2022-11-27 07:41:47,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:41:47,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:41:47,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:41:47,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 9: [2022-11-27 07:41:47,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 29: [2022-11-27 07:41:47,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 9: [2022-11-27 07:41:47,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:41:47,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 9: [2022-11-27 07:41:47,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 17: [2022-11-27 07:41:47,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 9: [2022-11-27 07:41:47,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 07:41:47,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 10: [2022-11-27 07:41:47,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:41:47,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 07:41:47,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 12: [2022-11-27 07:41:47,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:41:47,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 07:41:47,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 19: [2022-11-27 07:41:47,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:41:47,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 23: [2022-11-27 07:41:47,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:41:47,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 23: [2022-11-27 07:41:47,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 07:41:47,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 1: [2022-11-27 07:41:47,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:41:47,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:41:47,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 07:41:47,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 28: [2022-11-27 07:41:47,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:41:47,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:41:47,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:41:47,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 3: [2022-11-27 07:41:47,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 1: [2022-11-27 07:41:47,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 3: [2022-11-27 07:41:47,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 1: [2022-11-27 07:41:47,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 07:41:47,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 28: [2022-11-27 07:41:47,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 07:41:47,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 18: [2022-11-27 07:41:47,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:41:47,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 07:41:47,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 12: [2022-11-27 07:41:47,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:41:47,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 16: [2022-11-27 07:41:47,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:41:47,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:41:47,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 16: [2022-11-27 07:41:47,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:41:47,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 19: [2022-11-27 07:41:47,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 16: [2022-11-27 07:41:47,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 19: [2022-11-27 07:41:47,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 16: [2022-11-27 07:41:47,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 07:41:47,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 20: [2022-11-27 07:41:47,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:41:47,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 07:41:47,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 7: [2022-11-27 07:41:47,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:41:47,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:41:47,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 07:41:47,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 07:41:47,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 7: [2022-11-27 07:41:47,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 28: [2022-11-27 07:41:47,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:41:47,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:41:47,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 15: [2022-11-27 07:41:47,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:41:47,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 15: [2022-11-27 07:41:47,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 07:41:47,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 25: [2022-11-27 07:41:47,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:41:47,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 07:41:47,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 29: [2022-11-27 07:41:47,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:41:47,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:41:47,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 07:41:47,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 30: [2022-11-27 07:41:47,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 07:41:47,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 18: [2022-11-27 07:41:47,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:41:47,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 07:41:47,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 26: [2022-11-27 07:41:47,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:41:47,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-27 07:41:47,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 10: [2022-11-27 07:41:47,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:41:47,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:41:47,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 07:41:47,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 27: [2022-11-27 07:41:47,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 07:41:47,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 9: [2022-11-27 07:41:47,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:41:47,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:41:47,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 07:41:47,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 27: [2022-11-27 07:41:47,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:41:47,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 07:41:47,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 07:41:47,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 27: [2022-11-27 07:41:47,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 4: [2022-11-27 07:41:47,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:41:47,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:41:47,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 8: [2022-11-27 07:41:47,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 4: [2022-11-27 07:41:47,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 8: [2022-11-27 07:41:47,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 4: [2022-11-27 07:41:47,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:41:47,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 07:41:47,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:41:47,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 4: [2022-11-27 07:41:47,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 07:41:47,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 30: [2022-11-27 07:41:47,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:41:47,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:41:47,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 28: [2022-11-27 07:41:47,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 30: [2022-11-27 07:41:47,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 28: [2022-11-27 07:41:47,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 30: [2022-11-27 07:41:47,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 1: [2022-11-27 07:41:47,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 0: [2022-11-27 07:41:47,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:41:47,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:41:47,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-27 07:41:47,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 16: [2022-11-27 07:41:47,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:41:47,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:41:47,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 22: [2022-11-27 07:41:47,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 10: [2022-11-27 07:41:47,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:41:47,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 22: [2022-11-27 07:41:47,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 10: [2022-11-27 07:41:47,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 22: [2022-11-27 07:41:47,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:41:47,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 22: [2022-11-27 07:41:47,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 07:41:47,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 22: [2022-11-27 07:41:47,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:41:47,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 07:41:47,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 0: [2022-11-27 07:41:47,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:41:47,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:41:47,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 0: [2022-11-27 07:41:47,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 7: [2022-11-27 07:41:47,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 0: [2022-11-27 07:41:47,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 21: [2022-11-27 07:41:47,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:41:47,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 07:41:47,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 17: [2022-11-27 07:41:47,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:41:47,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:41:47,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 07:41:47,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 18: [2022-11-27 07:41:47,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:41:47,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 18: [2022-11-27 07:41:47,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 17: [2022-11-27 07:41:47,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 18: [2022-11-27 07:41:47,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 30: [2022-11-27 07:41:47,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:41:47,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 07:41:47,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 4: [2022-11-27 07:41:47,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:41:47,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 19: [2022-11-27 07:41:47,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:41:47,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 19: [2022-11-27 07:41:47,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 07:41:47,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 3: [2022-11-27 07:41:47,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:41:47,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 07:41:47,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 12: [2022-11-27 07:41:47,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:41:47,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 07:41:47,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 29: [2022-11-27 07:41:47,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:41:47,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:41:47,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 07:41:47,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 12: [2022-11-27 07:41:47,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:41:47,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 12: [2022-11-27 07:41:47,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 29: [2022-11-27 07:41:47,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 12: [2022-11-27 07:41:47,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 25: [2022-11-27 07:41:47,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:41:47,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:41:47,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 07:41:47,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 25: [2022-11-27 07:41:47,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 07:41:47,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 16: [2022-11-27 07:41:47,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:41:47,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:41:47,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:41:47,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 26: [2022-11-27 07:41:47,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 19: [2022-11-27 07:41:47,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 27: [2022-11-27 07:41:47,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:41:47,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 26: [2022-11-27 07:41:47,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 19: [2022-11-27 07:41:47,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 26: [2022-11-27 07:41:47,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:41:47,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 07:41:47,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 27: [2022-11-27 07:41:47,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 07:41:47,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 1: [2022-11-27 07:41:47,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:41:47,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:41:47,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 1: [2022-11-27 07:41:47,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 26: [2022-11-27 07:41:47,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 1: [2022-11-27 07:41:47,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 15: [2022-11-27 07:41:47,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:41:47,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 07:41:47,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 8: [2022-11-27 07:41:47,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:41:47,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 07:41:47,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 9: [2022-11-27 07:41:47,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:41:47,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:41:47,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:41:47,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:41:47,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 20: [2022-11-27 07:41:47,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 14: [2022-11-27 07:41:47,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:41:47,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 20: [2022-11-27 07:41:47,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 14: [2022-11-27 07:41:47,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 07:41:47,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 07:41:47,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:41:47,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 14: [2022-11-27 07:41:47,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 07:41:47,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 14: [2022-11-27 07:41:47,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 14: [2022-11-27 07:41:47,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 07:41:47,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 15: [2022-11-27 07:41:47,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:41:47,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 13: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:41:47,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 07:41:47,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 07:41:47,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 07:41:47,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 13: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 13: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 13: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 31: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:41:47,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 15: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:41:47,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 07:41:47,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 07:41:47,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 15: [2022-11-27 07:41:47,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 31: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 31: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 31: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 15: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 8: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:41:47,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 8: [2022-11-27 07:41:47,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 28: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:41:47,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 07:41:47,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 25: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 8: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 28: [2022-11-27 07:41:47,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 24: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 24: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 28: [2022-11-27 07:41:47,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 28: [2022-11-27 07:41:47,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:41:47,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 07:41:47,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 21: [2022-11-27 07:41:47,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:41:47,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 07:41:47,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 7: [2022-11-27 07:41:47,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:41:47,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:41:47,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:41:47,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 23: [2022-11-27 07:41:47,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 07:41:47,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 0: [2022-11-27 07:41:47,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 7: [2022-11-27 07:41:47,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 23: [2022-11-27 07:41:47,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 23: [2022-11-27 07:41:47,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 0: [2022-11-27 07:41:47,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 24: [2022-11-27 07:41:47,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:41:47,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 07:41:47,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 21: [2022-11-27 07:41:47,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:41:47,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:41:47,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:41:47,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 5: [2022-11-27 07:41:47,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 07:41:47,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 21: [2022-11-27 07:41:47,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 5: [2022-11-27 07:41:47,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 5: [2022-11-27 07:41:47,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 8: [2022-11-27 07:41:47,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:41:47,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 07:41:47,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 5: [2022-11-27 07:41:47,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:41:47,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 07:41:47,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 3: [2022-11-27 07:41:47,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:41:47,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 07:41:47,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 1: [2022-11-27 07:41:47,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:41:47,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 07:41:47,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 19: [2022-11-27 07:41:47,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:41:47,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 07:41:47,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 22: [2022-11-27 07:41:47,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:41:47,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 07:41:47,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 11: [2022-11-27 07:41:47,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:41:47,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:41:47,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:41:47,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 07:41:47,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 07:41:47,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 07:41:47,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 11: [2022-11-27 07:41:47,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 11: [2022-11-27 07:41:47,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 6: [2022-11-27 07:41:47,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:41:47,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:41:47,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:41:47,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 07:41:47,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 07:41:47,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 07:41:47,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 6: [2022-11-27 07:41:47,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 6: [2022-11-27 07:41:47,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 20: [2022-11-27 07:41:47,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:41:47,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 07:41:47,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:41:47,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 20: [2022-11-27 07:41:47,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 07:41:47,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 26: [2022-11-27 07:41:47,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:41:47,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 07:41:47,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 17: [2022-11-27 07:41:47,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:41:47,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 07:41:47,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 12: [2022-11-27 07:41:47,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:41:47,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 07:41:47,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 27: [2022-11-27 07:41:47,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:41:47,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 07:41:47,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 18: [2022-11-27 07:41:47,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:41:47,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-27 07:41:47,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 4: [2022-11-27 07:41:47,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:41:47,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 07:41:47,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 10: [2022-11-27 07:41:47,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:41:47,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 07:41:47,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 30: [2022-11-27 07:41:47,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:41:47,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 07:41:47,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 0: [2022-11-27 07:41:47,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:41:47,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 07:41:47,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 9: [2022-11-27 07:41:47,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:41:47,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 07:41:47,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 31: [2022-11-27 07:41:47,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:41:47,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 23: [2022-11-27 07:41:47,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:41:47,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 23: [2022-11-27 07:41:47,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 07:41:47,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 13: [2022-11-27 07:41:47,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:41:47,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:41:47,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 22: [2022-11-27 07:41:47,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 13: [2022-11-27 07:41:47,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 22: [2022-11-27 07:41:47,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 25: [2022-11-27 07:41:47,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:41:47,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:41:47,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 3: [2022-11-27 07:41:47,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 07:41:47,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 25: [2022-11-27 07:41:47,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 29: [2022-11-27 07:41:47,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:41:47,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 07:41:47,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 14: [2022-11-27 07:41:47,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:41:47,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 07:41:47,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 16: [2022-11-27 07:41:47,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:41:47,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 07:41:47,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 15: [2022-11-27 07:41:47,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:41:47,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 07:41:47,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 28: [2022-11-27 07:41:47,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:41:47,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 07:41:47,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 7: [2022-11-27 07:41:47,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:41:47,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 07:41:47,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 8: [2022-11-27 07:41:47,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:41:47,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 07:41:47,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 1: [2022-11-27 07:41:47,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:41:47,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 07:41:47,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 11: [2022-11-27 07:41:47,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:41:47,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 07:41:47,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 5: [2022-11-27 07:41:47,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:41:47,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 07:41:47,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 6: [2022-11-27 07:41:47,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:41:47,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 07:41:47,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 24: [2022-11-27 07:41:47,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:41:47,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 07:41:47,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 10: [2022-11-27 07:41:47,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:41:47,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 07:41:47,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 17: [2022-11-27 07:41:47,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:41:47,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:41:47,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 07:41:47,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 27: [2022-11-27 07:41:47,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 07:41:47,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 12: [2022-11-27 07:41:47,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:41:47,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 07:41:47,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 19: [2022-11-27 07:41:47,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:41:47,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 07:41:47,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 20: [2022-11-27 07:41:47,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:41:47,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 26: [2022-11-27 07:41:47,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:41:47,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 26: [2022-11-27 07:41:47,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 07:41:47,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 21: [2022-11-27 07:41:47,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:41:47,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 07:41:47,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 4: [2022-11-27 07:41:47,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:41:47,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 07:41:47,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 31: [2022-11-27 07:41:47,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:41:47,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 07:41:47,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 22: [2022-11-27 07:41:47,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:41:47,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 07:41:47,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 9: [2022-11-27 07:41:47,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:41:47,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 07:41:47,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 0: [2022-11-27 07:41:47,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:41:47,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:41:47,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 0: [2022-11-27 07:41:47,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 30: [2022-11-27 07:41:47,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 0: [2022-11-27 07:41:47,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 18: [2022-11-27 07:41:47,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:41:47,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 07:41:47,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 29: [2022-11-27 07:41:47,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:41:47,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 07:41:47,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 13: [2022-11-27 07:41:47,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:41:47,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 07:41:47,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 23: [2022-11-27 07:41:47,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:41:47,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 07:41:47,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 25: [2022-11-27 07:41:47,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:41:47,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 07:41:47,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 28: [2022-11-27 07:41:47,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:41:47,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 07:41:47,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 16: [2022-11-27 07:41:47,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:41:47,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 07:41:47,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 14: [2022-11-27 07:41:47,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:41:47,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 07:41:47,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 3: [2022-11-27 07:41:47,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:41:47,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 07:41:47,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 7: [2022-11-27 07:41:47,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:41:47,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:41:47,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 15: [2022-11-27 07:41:47,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 7: [2022-11-27 07:41:47,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 15: [2022-11-27 07:41:47,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 19: [2022-11-27 07:41:47,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:41:47,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 07:41:47,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 8: [2022-11-27 07:41:47,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:41:47,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 07:41:47,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 21: [2022-11-27 07:41:47,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:41:47,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 07:41:47,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 6: [2022-11-27 07:41:47,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:41:47,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 07:41:47,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 24: [2022-11-27 07:41:47,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:41:47,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-27 07:41:47,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 11: [2022-11-27 07:41:47,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:41:47,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 07:41:47,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 12: [2022-11-27 07:41:47,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:41:47,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 07:41:47,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 1: [2022-11-27 07:41:47,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:41:47,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 07:41:47,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 4: [2022-11-27 07:41:47,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:41:47,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:41:47,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 07:41:47,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 10: [2022-11-27 07:41:47,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 07:41:47,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 17: [2022-11-27 07:41:47,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:41:47,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 07:41:47,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 31: [2022-11-27 07:41:47,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:41:47,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 07:41:47,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 0: [2022-11-27 07:41:47,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:41:47,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 07:41:47,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 20: [2022-11-27 07:41:47,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:41:47,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 07:41:47,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 5: [2022-11-27 07:41:47,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:41:47,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 26: [2022-11-27 07:41:47,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:41:47,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 26: [2022-11-27 07:41:47,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 07:41:47,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 27: [2022-11-27 07:41:47,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:41:47,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 07:41:47,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 30: [2022-11-27 07:41:47,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:41:47,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 07:41:47,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 9: [2022-11-27 07:41:47,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:41:47,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 07:41:47,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 22: [2022-11-27 07:41:47,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:41:47,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 07:41:47,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 23: [2022-11-27 07:41:47,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:41:47,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 07:41:47,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 2: [2022-11-27 07:41:47,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:41:47,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 07:41:47,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 2: [2022-11-27 07:41:47,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:41:47,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 07:41:47,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 25: [2022-11-27 07:41:47,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:41:47,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 07:41:47,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 3: [2022-11-27 07:41:47,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:41:47,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 07:41:47,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 29: [2022-11-27 07:41:47,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:41:47,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 07:41:47,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 14: [2022-11-27 07:41:47,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:41:47,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 07:41:47,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 16: [2022-11-27 07:41:47,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:41:47,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 07:41:47,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 28: [2022-11-27 07:41:47,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:41:47,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 07:41:47,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 15: [2022-11-27 07:41:47,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:41:47,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 07:41:47,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 2: [2022-11-27 07:41:47,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:41:47,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 07:41:47,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 2: [2022-11-27 07:41:47,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:41:47,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 07:41:47,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 7: [2022-11-27 07:41:47,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:41:47,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 07:41:47,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 8: [2022-11-27 07:41:47,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:41:47,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 07:41:47,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 13: [2022-11-27 07:41:47,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:41:47,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 07:41:47,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 2: [2022-11-27 07:41:47,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:41:47,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 07:41:47,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 11: [2022-11-27 07:41:47,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:41:47,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:41:47,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 07:41:47,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 21: [2022-11-27 07:41:47,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 07:41:47,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 18: [2022-11-27 07:41:47,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:41:47,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:41:47,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 6: [2022-11-27 07:41:47,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:41:47,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 6: [2022-11-27 07:41:47,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 24: [2022-11-27 07:41:47,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 07:41:47,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 6: [2022-11-27 07:41:47,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 4: [2022-11-27 07:41:47,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:41:47,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 1: [2022-11-27 07:41:47,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:41:47,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 1: [2022-11-27 07:41:47,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 07:41:47,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 0: [2022-11-27 07:41:47,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:41:47,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 07:41:47,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 10: [2022-11-27 07:41:47,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:41:47,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:41:47,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 10: [2022-11-27 07:41:47,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 07:41:47,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 7: [2022-11-27 07:41:47,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 14: [2022-11-27 07:41:47,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:41:47,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 07:41:47,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 17: [2022-11-27 07:41:47,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:41:47,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 07:41:47,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 27: [2022-11-27 07:41:47,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:41:47,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 07:41:47,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 29: [2022-11-27 07:41:47,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:41:47,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 07:41:47,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 30: [2022-11-27 07:41:47,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:41:47,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 12: [2022-11-27 07:41:47,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:41:47,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 9: [2022-11-27 07:41:47,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:41:47,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 31: [2022-11-27 07:41:47,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:41:47,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 31: [2022-11-27 07:41:47,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 12: [2022-11-27 07:41:47,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 31: [2022-11-27 07:41:47,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 9: [2022-11-27 07:41:47,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 26: [2022-11-27 07:41:47,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:41:47,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:41:47,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 07:41:47,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 15: [2022-11-27 07:41:47,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:41:47,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 07:41:47,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 15: [2022-11-27 07:41:47,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 23: [2022-11-27 07:41:47,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:41:47,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 23: [2022-11-27 07:41:47,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 07:41:47,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 13: [2022-11-27 07:41:47,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:41:47,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 07:41:47,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 16: [2022-11-27 07:41:47,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:41:47,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 07:41:47,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 18: [2022-11-27 07:41:47,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:41:47,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:41:47,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 07:41:47,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 3: [2022-11-27 07:41:47,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:41:47,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 07:41:47,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 8: [2022-11-27 07:41:47,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:41:47,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 8: [2022-11-27 07:41:47,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 07:41:47,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 3: [2022-11-27 07:41:47,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 11: [2022-11-27 07:41:47,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:41:47,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 07:41:47,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 25: [2022-11-27 07:41:47,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:41:47,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 07:41:47,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 20: [2022-11-27 07:41:47,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:41:47,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 07:41:47,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 22: [2022-11-27 07:41:47,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:41:47,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 07:41:47,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 6: [2022-11-27 07:41:47,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:41:47,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 07:41:47,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 5: [2022-11-27 07:41:47,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:41:47,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 07:41:47,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 24: [2022-11-27 07:41:47,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:41:47,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:41:47,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 07:41:47,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 24: [2022-11-27 07:41:47,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 07:41:47,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 11: [2022-11-27 07:41:47,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:41:47,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 07:41:47,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 24: [2022-11-27 07:41:47,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:41:47,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-27 07:41:47,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 6: [2022-11-27 07:41:47,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:41:47,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 07:41:47,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 5: [2022-11-27 07:41:47,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:41:47,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 07:41:47,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 19: [2022-11-27 07:41:47,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:41:47,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 07:41:47,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 2: [2022-11-27 07:41:47,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:41:47,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:41:47,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:41:47,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 07:41:47,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 07:41:47,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step167000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 07:41:47,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 2: [2022-11-27 07:41:47,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 2: [2022-11-27 07:41:47,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step167000 is ready now! 0: successfully saved checkpoint at iteration 167000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2659.50 31: iteration 167010/ 173500 | consumed samples: 42754560 | consumed tokens: 87561338880 | elapsed time per iteration (s): 1.11 | learning rate: 2.063E-05 | global batch size: 256 | lm loss: 1.914428E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.886 | TFLOPs: 13.91 | 31: iteration 167020/ 173500 | consumed samples: 42757120 | consumed tokens: 87566581760 | elapsed time per iteration (s): 0.81 | learning rate: 2.063E-05 | global batch size: 256 | lm loss: 1.895853E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.123 | TFLOPs: 19.19 | 31: iteration 167030/ 173500 | consumed samples: 42759680 | consumed tokens: 87571824640 | elapsed time per iteration (s): 0.74 | learning rate: 2.063E-05 | global batch size: 256 | lm loss: 1.906960E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.085 | TFLOPs: 20.88 | 31: iteration 167040/ 173500 | consumed samples: 42762240 | consumed tokens: 87577067520 | elapsed time per iteration (s): 0.74 | learning rate: 2.063E-05 | global batch size: 256 | lm loss: 1.896185E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.519 | TFLOPs: 20.84 | 31: iteration 167050/ 173500 | consumed samples: 42764800 | consumed tokens: 87582310400 | elapsed time per iteration (s): 0.79 | learning rate: 2.063E-05 | global batch size: 256 | lm loss: 1.918875E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.967 | TFLOPs: 19.66 | 31: iteration 167060/ 173500 | consumed samples: 42767360 | consumed tokens: 87587553280 | elapsed time per iteration (s): 0.84 | learning rate: 2.062E-05 | global batch size: 256 | lm loss: 1.899190E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.200 | TFLOPs: 18.34 | 31: iteration 167070/ 173500 | consumed samples: 42769920 | consumed tokens: 87592796160 | elapsed time per iteration (s): 0.76 | learning rate: 2.062E-05 | global batch size: 256 | lm loss: 1.892395E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.484 | TFLOPs: 20.36 | 31: iteration 167080/ 173500 | consumed samples: 42772480 | consumed tokens: 87598039040 | elapsed time per iteration (s): 0.79 | learning rate: 2.062E-05 | global batch size: 256 | lm loss: 1.889056E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.467 | TFLOPs: 19.69 | 31: iteration 167090/ 173500 | consumed samples: 42775040 | consumed tokens: 87603281920 | elapsed time per iteration (s): 0.82 | learning rate: 2.062E-05 | global batch size: 256 | lm loss: 1.926375E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.957 | TFLOPs: 18.93 | 31: iteration 167100/ 173500 | consumed samples: 42777600 | consumed tokens: 87608524800 | elapsed time per iteration (s): 0.89 | learning rate: 2.062E-05 | global batch size: 256 | lm loss: 1.892424E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 287.924 | TFLOPs: 17.42 | 31: iteration 167110/ 173500 | consumed samples: 42780160 | consumed tokens: 87613767680 | elapsed time per iteration (s): 0.84 | learning rate: 2.061E-05 | global batch size: 256 | lm loss: 1.906934E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.872 | TFLOPs: 18.38 | 31: iteration 167120/ 173500 | consumed samples: 42782720 | consumed tokens: 87619010560 | elapsed time per iteration (s): 0.80 | learning rate: 2.061E-05 | global batch size: 256 | lm loss: 1.935049E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.541 | TFLOPs: 19.39 | 31: iteration 167130/ 173500 | consumed samples: 42785280 | consumed tokens: 87624253440 | elapsed time per iteration (s): 0.79 | learning rate: 2.061E-05 | global batch size: 256 | lm loss: 1.893113E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.289 | TFLOPs: 19.56 | 31: iteration 167140/ 173500 | consumed samples: 42787840 | consumed tokens: 87629496320 | elapsed time per iteration (s): 0.77 | learning rate: 2.061E-05 | global batch size: 256 | lm loss: 1.892495E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.501 | TFLOPs: 20.24 | 31: iteration 167150/ 173500 | consumed samples: 42790400 | consumed tokens: 87634739200 | elapsed time per iteration (s): 0.77 | learning rate: 2.061E-05 | global batch size: 256 | lm loss: 1.937551E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.330 | TFLOPs: 20.11 | 31: iteration 167160/ 173500 | consumed samples: 42792960 | consumed tokens: 87639982080 | elapsed time per iteration (s): 0.77 | learning rate: 2.060E-05 | global batch size: 256 | lm loss: 1.870657E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.541 | TFLOPs: 20.24 | 31: iteration 167170/ 173500 | consumed samples: 42795520 | consumed tokens: 87645224960 | elapsed time per iteration (s): 0.78 | learning rate: 2.060E-05 | global batch size: 256 | lm loss: 1.937475E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.987 | TFLOPs: 19.90 | 31: iteration 167180/ 173500 | consumed samples: 42798080 | consumed tokens: 87650467840 | elapsed time per iteration (s): 0.75 | learning rate: 2.060E-05 | global batch size: 256 | lm loss: 1.915152E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.886 | TFLOPs: 20.56 | 31: iteration 167190/ 173500 | consumed samples: 42800640 | consumed tokens: 87655710720 | elapsed time per iteration (s): 0.75 | learning rate: 2.060E-05 | global batch size: 256 | lm loss: 1.912956E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.264 | TFLOPs: 20.52 | 31: iteration 167200/ 173500 | consumed samples: 42803200 | consumed tokens: 87660953600 | elapsed time per iteration (s): 0.79 | learning rate: 2.060E-05 | global batch size: 256 | lm loss: 1.907781E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.361 | TFLOPs: 19.68 | 31: iteration 167210/ 173500 | consumed samples: 42805760 | consumed tokens: 87666196480 | elapsed time per iteration (s): 0.88 | learning rate: 2.060E-05 | global batch size: 256 | lm loss: 1.908882E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.600 | TFLOPs: 17.58 | 31: iteration 167220/ 173500 | consumed samples: 42808320 | consumed tokens: 87671439360 | elapsed time per iteration (s): 0.80 | learning rate: 2.059E-05 | global batch size: 256 | lm loss: 1.923854E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.583 | TFLOPs: 19.45 | 31: iteration 167230/ 173500 | consumed samples: 42810880 | consumed tokens: 87676682240 | elapsed time per iteration (s): 0.76 | learning rate: 2.059E-05 | global batch size: 256 | lm loss: 1.929999E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.217 | TFLOPs: 20.40 | 31: iteration 167240/ 173500 | consumed samples: 42813440 | consumed tokens: 87681925120 | elapsed time per iteration (s): 0.82 | learning rate: 2.059E-05 | global batch size: 256 | lm loss: 1.896862E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.761 | TFLOPs: 18.92 | 31: iteration 167250/ 173500 | consumed samples: 42816000 | consumed tokens: 87687168000 | elapsed time per iteration (s): 0.76 | learning rate: 2.059E-05 | global batch size: 256 | lm loss: 1.900694E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.832 | TFLOPs: 20.26 | 31: iteration 167260/ 173500 | consumed samples: 42818560 | consumed tokens: 87692410880 | elapsed time per iteration (s): 0.78 | learning rate: 2.059E-05 | global batch size: 256 | lm loss: 1.897045E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.066 | TFLOPs: 19.91 | 31: iteration 167270/ 173500 | consumed samples: 42821120 | consumed tokens: 87697653760 | elapsed time per iteration (s): 0.80 | learning rate: 2.058E-05 | global batch size: 256 | lm loss: 1.905806E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.057 | TFLOPs: 19.42 | 31: iteration 167280/ 173500 | consumed samples: 42823680 | consumed tokens: 87702896640 | elapsed time per iteration (s): 0.78 | learning rate: 2.058E-05 | global batch size: 256 | lm loss: 1.953469E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.740 | TFLOPs: 19.77 | 31: iteration 167290/ 173500 | consumed samples: 42826240 | consumed tokens: 87708139520 | elapsed time per iteration (s): 0.83 | learning rate: 2.058E-05 | global batch size: 256 | lm loss: 1.932050E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.008 | TFLOPs: 18.57 | 31: iteration 167300/ 173500 | consumed samples: 42828800 | consumed tokens: 87713382400 | elapsed time per iteration (s): 0.80 | learning rate: 2.058E-05 | global batch size: 256 | lm loss: 1.903327E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.316 | TFLOPs: 19.32 | 31: iteration 167310/ 173500 | consumed samples: 42831360 | consumed tokens: 87718625280 | elapsed time per iteration (s): 0.78 | learning rate: 2.058E-05 | global batch size: 256 | lm loss: 1.887042E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.250 | TFLOPs: 19.74 | 31: iteration 167320/ 173500 | consumed samples: 42833920 | consumed tokens: 87723868160 | elapsed time per iteration (s): 0.95 | learning rate: 2.057E-05 | global batch size: 256 | lm loss: 1.885982E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 268.691 | TFLOPs: 16.26 | 31: iteration 167330/ 173500 | consumed samples: 42836480 | consumed tokens: 87729111040 | elapsed time per iteration (s): 0.80 | learning rate: 2.057E-05 | global batch size: 256 | lm loss: 1.918579E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.348 | TFLOPs: 19.32 | 31: iteration 167340/ 173500 | consumed samples: 42839040 | consumed tokens: 87734353920 | elapsed time per iteration (s): 0.81 | learning rate: 2.057E-05 | global batch size: 256 | lm loss: 1.903489E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.736 | TFLOPs: 19.22 | 31: iteration 167350/ 173500 | consumed samples: 42841600 | consumed tokens: 87739596800 | elapsed time per iteration (s): 0.85 | learning rate: 2.057E-05 | global batch size: 256 | lm loss: 1.920285E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.457 | TFLOPs: 18.30 | 31: iteration 167360/ 173500 | consumed samples: 42844160 | consumed tokens: 87744839680 | elapsed time per iteration (s): 0.82 | learning rate: 2.057E-05 | global batch size: 256 | lm loss: 1.889932E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.716 | TFLOPs: 18.80 | 31: iteration 167370/ 173500 | consumed samples: 42846720 | consumed tokens: 87750082560 | elapsed time per iteration (s): 0.83 | learning rate: 2.057E-05 | global batch size: 256 | lm loss: 1.917210E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.533 | TFLOPs: 18.73 | 31: iteration 167380/ 173500 | consumed samples: 42849280 | consumed tokens: 87755325440 | elapsed time per iteration (s): 0.83 | learning rate: 2.056E-05 | global batch size: 256 | lm loss: 1.887158E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.036 | TFLOPs: 18.70 | 31: iteration 167390/ 173500 | consumed samples: 42851840 | consumed tokens: 87760568320 | elapsed time per iteration (s): 0.81 | learning rate: 2.056E-05 | global batch size: 256 | lm loss: 1.936562E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.066 | TFLOPs: 19.12 | 31: iteration 167400/ 173500 | consumed samples: 42854400 | consumed tokens: 87765811200 | elapsed time per iteration (s): 0.86 | learning rate: 2.056E-05 | global batch size: 256 | lm loss: 1.925678E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.259 | TFLOPs: 18.04 | 31: iteration 167410/ 173500 | consumed samples: 42856960 | consumed tokens: 87771054080 | elapsed time per iteration (s): 0.83 | learning rate: 2.056E-05 | global batch size: 256 | lm loss: 1.902950E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.562 | TFLOPs: 18.61 | 31: iteration 167420/ 173500 | consumed samples: 42859520 | consumed tokens: 87776296960 | elapsed time per iteration (s): 0.82 | learning rate: 2.056E-05 | global batch size: 256 | lm loss: 1.939384E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.401 | TFLOPs: 18.96 | 31: iteration 167430/ 173500 | consumed samples: 42862080 | consumed tokens: 87781539840 | elapsed time per iteration (s): 0.83 | learning rate: 2.055E-05 | global batch size: 256 | lm loss: 1.886285E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.586 | TFLOPs: 18.61 | 31: iteration 167440/ 173500 | consumed samples: 42864640 | consumed tokens: 87786782720 | elapsed time per iteration (s): 0.81 | learning rate: 2.055E-05 | global batch size: 256 | lm loss: 1.889574E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.450 | TFLOPs: 19.02 | 31: iteration 167450/ 173500 | consumed samples: 42867200 | consumed tokens: 87792025600 | elapsed time per iteration (s): 0.85 | learning rate: 2.055E-05 | global batch size: 256 | lm loss: 1.909471E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.546 | TFLOPs: 18.24 | 31: iteration 167460/ 173500 | consumed samples: 42869760 | consumed tokens: 87797268480 | elapsed time per iteration (s): 0.80 | learning rate: 2.055E-05 | global batch size: 256 | lm loss: 1.890484E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.614 | TFLOPs: 19.28 | 31: iteration 167470/ 173500 | consumed samples: 42872320 | consumed tokens: 87802511360 | elapsed time per iteration (s): 0.77 | learning rate: 2.055E-05 | global batch size: 256 | lm loss: 1.912289E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.167 | TFLOPs: 20.10 | 31: iteration 167480/ 173500 | consumed samples: 42874880 | consumed tokens: 87807754240 | elapsed time per iteration (s): 0.75 | learning rate: 2.055E-05 | global batch size: 256 | lm loss: 1.873006E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.230 | TFLOPs: 20.52 | 31: iteration 167490/ 173500 | consumed samples: 42877440 | consumed tokens: 87812997120 | elapsed time per iteration (s): 0.76 | learning rate: 2.054E-05 | global batch size: 256 | lm loss: 1.903901E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.773 | TFLOPs: 20.37 | 31: iteration 167500/ 173500 | consumed samples: 42880000 | consumed tokens: 87818240000 | elapsed time per iteration (s): 0.74 | learning rate: 2.054E-05 | global batch size: 256 | lm loss: 1.883764E+00 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.351 | TFLOPs: 20.95 | 31: iteration 167510/ 173500 | consumed samples: 42882560 | consumed tokens: 87823482880 | elapsed time per iteration (s): 0.72 | learning rate: 2.054E-05 | global batch size: 256 | lm loss: 1.883779E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.010 | TFLOPs: 21.66 | 31: iteration 167520/ 173500 | consumed samples: 42885120 | consumed tokens: 87828725760 | elapsed time per iteration (s): 0.79 | learning rate: 2.054E-05 | global batch size: 256 | lm loss: 1.925147E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.916 | TFLOPs: 19.72 | 31: iteration 167530/ 173500 | consumed samples: 42887680 | consumed tokens: 87833968640 | elapsed time per iteration (s): 0.75 | learning rate: 2.054E-05 | global batch size: 256 | lm loss: 1.929139E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.970 | TFLOPs: 20.57 | 31: iteration 167540/ 173500 | consumed samples: 42890240 | consumed tokens: 87839211520 | elapsed time per iteration (s): 0.81 | learning rate: 2.053E-05 | global batch size: 256 | lm loss: 1.900832E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.152 | TFLOPs: 19.07 | 31: iteration 167550/ 173500 | consumed samples: 42892800 | consumed tokens: 87844454400 | elapsed time per iteration (s): 0.80 | learning rate: 2.053E-05 | global batch size: 256 | lm loss: 1.896240E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.108 | TFLOPs: 19.37 | 31: iteration 167560/ 173500 | consumed samples: 42895360 | consumed tokens: 87849697280 | elapsed time per iteration (s): 0.86 | learning rate: 2.053E-05 | global batch size: 256 | lm loss: 1.888897E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.830 | TFLOPs: 17.96 | 31: iteration 167570/ 173500 | consumed samples: 42897920 | consumed tokens: 87854940160 | elapsed time per iteration (s): 0.93 | learning rate: 2.053E-05 | global batch size: 256 | lm loss: 1.878139E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 274.839 | TFLOPs: 16.63 | 31: iteration 167580/ 173500 | consumed samples: 42900480 | consumed tokens: 87860183040 | elapsed time per iteration (s): 0.82 | learning rate: 2.053E-05 | global batch size: 256 | lm loss: 1.900261E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.364 | TFLOPs: 18.90 | 31: iteration 167590/ 173500 | consumed samples: 42903040 | consumed tokens: 87865425920 | elapsed time per iteration (s): 0.81 | learning rate: 2.053E-05 | global batch size: 256 | lm loss: 1.900778E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.916 | TFLOPs: 19.05 | 31: iteration 167600/ 173500 | consumed samples: 42905600 | consumed tokens: 87870668800 | elapsed time per iteration (s): 0.82 | learning rate: 2.052E-05 | global batch size: 256 | lm loss: 1.886498E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.806 | TFLOPs: 18.80 | 31: iteration 167610/ 173500 | consumed samples: 42908160 | consumed tokens: 87875911680 | elapsed time per iteration (s): 0.81 | learning rate: 2.052E-05 | global batch size: 256 | lm loss: 1.932862E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.025 | TFLOPs: 19.06 | 31: iteration 167620/ 173500 | consumed samples: 42910720 | consumed tokens: 87881154560 | elapsed time per iteration (s): 0.82 | learning rate: 2.052E-05 | global batch size: 256 | lm loss: 1.913681E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.980 | TFLOPs: 18.87 | 31: iteration 167630/ 173500 | consumed samples: 42913280 | consumed tokens: 87886397440 | elapsed time per iteration (s): 0.81 | learning rate: 2.052E-05 | global batch size: 256 | lm loss: 1.889567E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.277 | TFLOPs: 19.07 | 31: iteration 167640/ 173500 | consumed samples: 42915840 | consumed tokens: 87891640320 | elapsed time per iteration (s): 0.82 | learning rate: 2.052E-05 | global batch size: 256 | lm loss: 1.911251E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.282 | TFLOPs: 18.95 | 31: iteration 167650/ 173500 | consumed samples: 42918400 | consumed tokens: 87896883200 | elapsed time per iteration (s): 0.80 | learning rate: 2.051E-05 | global batch size: 256 | lm loss: 1.919352E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.787 | TFLOPs: 19.47 | 31: iteration 167660/ 173500 | consumed samples: 42920960 | consumed tokens: 87902126080 | elapsed time per iteration (s): 0.79 | learning rate: 2.051E-05 | global batch size: 256 | lm loss: 1.913920E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.170 | TFLOPs: 19.67 | 31: iteration 167670/ 173500 | consumed samples: 42923520 | consumed tokens: 87907368960 | elapsed time per iteration (s): 0.80 | learning rate: 2.051E-05 | global batch size: 256 | lm loss: 1.899925E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.336 | TFLOPs: 19.32 | 31: iteration 167680/ 173500 | consumed samples: 42926080 | consumed tokens: 87912611840 | elapsed time per iteration (s): 0.82 | learning rate: 2.051E-05 | global batch size: 256 | lm loss: 1.882679E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.622 | TFLOPs: 18.85 | 31: iteration 167690/ 173500 | consumed samples: 42928640 | consumed tokens: 87917854720 | elapsed time per iteration (s): 0.80 | learning rate: 2.051E-05 | global batch size: 256 | lm loss: 1.918208E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.451 | TFLOPs: 19.39 | 31: iteration 167700/ 173500 | consumed samples: 42931200 | consumed tokens: 87923097600 | elapsed time per iteration (s): 0.82 | learning rate: 2.051E-05 | global batch size: 256 | lm loss: 1.885702E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.171 | TFLOPs: 18.95 | 31: iteration 167710/ 173500 | consumed samples: 42933760 | consumed tokens: 87928340480 | elapsed time per iteration (s): 0.81 | learning rate: 2.050E-05 | global batch size: 256 | lm loss: 1.883498E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.606 | TFLOPs: 19.03 | 31: iteration 167720/ 173500 | consumed samples: 42936320 | consumed tokens: 87933583360 | elapsed time per iteration (s): 0.84 | learning rate: 2.050E-05 | global batch size: 256 | lm loss: 1.904080E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.256 | TFLOPs: 18.47 | 31: iteration 167730/ 173500 | consumed samples: 42938880 | consumed tokens: 87938826240 | elapsed time per iteration (s): 0.75 | learning rate: 2.050E-05 | global batch size: 256 | lm loss: 1.867147E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.314 | TFLOPs: 20.71 | 31: iteration 167740/ 173500 | consumed samples: 42941440 | consumed tokens: 87944069120 | elapsed time per iteration (s): 0.92 | learning rate: 2.050E-05 | global batch size: 256 | lm loss: 1.913331E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 276.908 | TFLOPs: 16.75 | 31: iteration 167750/ 173500 | consumed samples: 42944000 | consumed tokens: 87949312000 | elapsed time per iteration (s): 0.76 | learning rate: 2.050E-05 | global batch size: 256 | lm loss: 1.922214E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.766 | TFLOPs: 20.43 | 31: iteration 167760/ 173500 | consumed samples: 42946560 | consumed tokens: 87954554880 | elapsed time per iteration (s): 0.75 | learning rate: 2.050E-05 | global batch size: 256 | lm loss: 1.934443E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.624 | TFLOPs: 20.67 | 31: iteration 167770/ 173500 | consumed samples: 42949120 | consumed tokens: 87959797760 | elapsed time per iteration (s): 0.77 | learning rate: 2.049E-05 | global batch size: 256 | lm loss: 1.894297E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.222 | TFLOPs: 20.16 | 31: iteration 167780/ 173500 | consumed samples: 42951680 | consumed tokens: 87965040640 | elapsed time per iteration (s): 0.75 | learning rate: 2.049E-05 | global batch size: 256 | lm loss: 1.900155E+00 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.771 | TFLOPs: 20.74 | 31: iteration 167790/ 173500 | consumed samples: 42954240 | consumed tokens: 87970283520 | elapsed time per iteration (s): 0.75 | learning rate: 2.049E-05 | global batch size: 256 | lm loss: 1.928578E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.347 | TFLOPs: 20.53 | 31: iteration 167800/ 173500 | consumed samples: 42956800 | consumed tokens: 87975526400 | elapsed time per iteration (s): 0.73 | learning rate: 2.049E-05 | global batch size: 256 | lm loss: 1.911717E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.820 | TFLOPs: 21.28 | 31: iteration 167810/ 173500 | consumed samples: 42959360 | consumed tokens: 87980769280 | elapsed time per iteration (s): 0.75 | learning rate: 2.049E-05 | global batch size: 256 | lm loss: 1.893802E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.189 | TFLOPs: 20.64 | 31: iteration 167820/ 173500 | consumed samples: 42961920 | consumed tokens: 87986012160 | elapsed time per iteration (s): 1.01 | learning rate: 2.049E-05 | global batch size: 256 | lm loss: 1.925794E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 252.857 | TFLOPs: 15.30 | 31: iteration 167830/ 173500 | consumed samples: 42964480 | consumed tokens: 87991255040 | elapsed time per iteration (s): 0.78 | learning rate: 2.048E-05 | global batch size: 256 | lm loss: 1.934616E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.566 | TFLOPs: 19.94 | 31: iteration 167840/ 173500 | consumed samples: 42967040 | consumed tokens: 87996497920 | elapsed time per iteration (s): 0.83 | learning rate: 2.048E-05 | global batch size: 256 | lm loss: 1.891367E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.616 | TFLOPs: 18.61 | 31: iteration 167850/ 173500 | consumed samples: 42969600 | consumed tokens: 88001740800 | elapsed time per iteration (s): 0.74 | learning rate: 2.048E-05 | global batch size: 256 | lm loss: 1.923393E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.025 | TFLOPs: 20.99 | 31: iteration 167860/ 173500 | consumed samples: 42972160 | consumed tokens: 88006983680 | elapsed time per iteration (s): 0.80 | learning rate: 2.048E-05 | global batch size: 256 | lm loss: 1.905315E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.930 | TFLOPs: 19.48 | 31: iteration 167870/ 173500 | consumed samples: 42974720 | consumed tokens: 88012226560 | elapsed time per iteration (s): 0.81 | learning rate: 2.048E-05 | global batch size: 256 | lm loss: 1.884774E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.846 | TFLOPs: 19.23 | 31: iteration 167880/ 173500 | consumed samples: 42977280 | consumed tokens: 88017469440 | elapsed time per iteration (s): 0.85 | learning rate: 2.048E-05 | global batch size: 256 | lm loss: 1.889660E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 299.464 | TFLOPs: 18.12 | 31: iteration 167890/ 173500 | consumed samples: 42979840 | consumed tokens: 88022712320 | elapsed time per iteration (s): 0.80 | learning rate: 2.047E-05 | global batch size: 256 | lm loss: 1.920297E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.627 | TFLOPs: 19.34 | 31: iteration 167900/ 173500 | consumed samples: 42982400 | consumed tokens: 88027955200 | elapsed time per iteration (s): 0.87 | learning rate: 2.047E-05 | global batch size: 256 | lm loss: 1.918228E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.562 | TFLOPs: 17.88 | 31: iteration 167910/ 173500 | consumed samples: 42984960 | consumed tokens: 88033198080 | elapsed time per iteration (s): 0.81 | learning rate: 2.047E-05 | global batch size: 256 | lm loss: 1.928117E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.666 | TFLOPs: 19.10 | 31: iteration 167920/ 173500 | consumed samples: 42987520 | consumed tokens: 88038440960 | elapsed time per iteration (s): 0.77 | learning rate: 2.047E-05 | global batch size: 256 | lm loss: 1.914202E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.259 | TFLOPs: 20.16 | 31: iteration 167930/ 173500 | consumed samples: 42990080 | consumed tokens: 88043683840 | elapsed time per iteration (s): 0.90 | learning rate: 2.047E-05 | global batch size: 256 | lm loss: 1.884070E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.476 | TFLOPs: 17.21 | 31: iteration 167940/ 173500 | consumed samples: 42992640 | consumed tokens: 88048926720 | elapsed time per iteration (s): 0.76 | learning rate: 2.047E-05 | global batch size: 256 | lm loss: 1.933580E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.108 | TFLOPs: 20.45 | 31: iteration 167950/ 173500 | consumed samples: 42995200 | consumed tokens: 88054169600 | elapsed time per iteration (s): 0.79 | learning rate: 2.046E-05 | global batch size: 256 | lm loss: 1.917206E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.836 | TFLOPs: 19.59 | 31: iteration 167960/ 173500 | consumed samples: 42997760 | consumed tokens: 88059412480 | elapsed time per iteration (s): 0.78 | learning rate: 2.046E-05 | global batch size: 256 | lm loss: 1.921662E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.918 | TFLOPs: 19.90 | 31: iteration 167970/ 173500 | consumed samples: 43000320 | consumed tokens: 88064655360 | elapsed time per iteration (s): 0.77 | learning rate: 2.046E-05 | global batch size: 256 | lm loss: 1.885617E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.457 | TFLOPs: 20.11 | 31: iteration 167980/ 173500 | consumed samples: 43002880 | consumed tokens: 88069898240 | elapsed time per iteration (s): 0.74 | learning rate: 2.046E-05 | global batch size: 256 | lm loss: 1.931570E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.338 | TFLOPs: 20.83 | 31: iteration 167990/ 173500 | consumed samples: 43005440 | consumed tokens: 88075141120 | elapsed time per iteration (s): 0.81 | learning rate: 2.046E-05 | global batch size: 256 | lm loss: 1.892831E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.319 | TFLOPs: 19.14 | 0: [2022-11-27 07:55:10,568] [INFO] [logging.py:68:log_dist] [Rank 0] step=168000, skipped=0, lr=[2.0455079592202583e-05, 2.0455079592202583e-05, 2.0455079592202583e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 168000/ 173500 | consumed samples: 43008000 | consumed tokens: 88080384000 | elapsed time per iteration (s): 0.75 | learning rate: 2.046E-05 | global batch size: 256 | lm loss: 1.908325E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.713 | TFLOPs: 20.67 | 0: steps: 168000 loss: 1.9276 iter time (s): 0.789 samples/sec: 324.500 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 168000 | lm loss value: 1.792156E+00 | lm loss PPL: 6.002382E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 168000 to checkpoints_1b1long 0: [2022-11-27 07:55:10,830] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step168000 is begin to save! 0: [2022-11-27 07:55:10,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_01-model_00-model_states.pt... 0: [2022-11-27 07:55:11,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_01-model_00-model_states.pt. 0: [2022-11-27 07:55:11,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_03-model_00-model_states.pt... 0: [2022-11-27 07:55:11,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_03-model_00-model_states.pt. 0: [2022-11-27 07:55:11,156] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_04-model_00-model_states.pt... 0: [2022-11-27 07:55:11,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_04-model_00-model_states.pt. 0: [2022-11-27 07:55:11,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_05-model_00-model_states.pt... 0: [2022-11-27 07:55:11,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_05-model_00-model_states.pt. 0: [2022-11-27 07:55:11,312] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_06-model_00-model_states.pt... 0: [2022-11-27 07:55:11,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_06-model_00-model_states.pt. 0: [2022-11-27 07:55:11,389] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_07-model_00-model_states.pt... 0: [2022-11-27 07:55:11,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_07-model_00-model_states.pt. 0: [2022-11-27 07:55:11,467] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_08-model_00-model_states.pt... 0: [2022-11-27 07:55:11,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_08-model_00-model_states.pt. 0: [2022-11-27 07:55:11,553] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_09-model_00-model_states.pt... 0: [2022-11-27 07:55:11,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_09-model_00-model_states.pt. 0: [2022-11-27 07:55:11,631] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_10-model_00-model_states.pt... 0: [2022-11-27 07:55:11,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_10-model_00-model_states.pt. 0: [2022-11-27 07:55:11,705] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_11-model_00-model_states.pt... 0: [2022-11-27 07:55:11,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_11-model_00-model_states.pt. 0: [2022-11-27 07:55:11,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_12-model_00-model_states.pt... 0: [2022-11-27 07:55:11,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_12-model_00-model_states.pt. 0: [2022-11-27 07:55:11,867] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_13-model_00-model_states.pt... 0: [2022-11-27 07:55:11,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_13-model_00-model_states.pt. 0: [2022-11-27 07:55:11,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_14-model_00-model_states.pt... 0: [2022-11-27 07:55:12,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_14-model_00-model_states.pt. 0: [2022-11-27 07:55:12,030] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_15-model_00-model_states.pt... 0: [2022-11-27 07:55:12,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_15-model_00-model_states.pt. 0: [2022-11-27 07:55:12,110] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_16-model_00-model_states.pt... 0: [2022-11-27 07:55:12,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_16-model_00-model_states.pt. 0: [2022-11-27 07:55:12,186] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_17-model_00-model_states.pt... 0: [2022-11-27 07:55:12,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_17-model_00-model_states.pt. 0: [2022-11-27 07:55:12,263] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_18-model_00-model_states.pt... 0: [2022-11-27 07:55:12,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_18-model_00-model_states.pt. 0: [2022-11-27 07:55:12,339] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_19-model_00-model_states.pt... 0: [2022-11-27 07:55:12,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_19-model_00-model_states.pt. 0: [2022-11-27 07:55:12,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_20-model_00-model_states.pt... 0: [2022-11-27 07:55:12,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_20-model_00-model_states.pt. 0: [2022-11-27 07:55:12,493] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_21-model_00-model_states.pt... 0: [2022-11-27 07:55:12,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_21-model_00-model_states.pt. 0: [2022-11-27 07:55:12,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_22-model_00-model_states.pt... 0: [2022-11-27 07:55:12,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_22-model_00-model_states.pt. 0: [2022-11-27 07:55:12,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_23-model_00-model_states.pt... 0: [2022-11-27 07:55:12,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_23-model_00-model_states.pt. 0: [2022-11-27 07:55:12,730] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_24-model_00-model_states.pt... 0: [2022-11-27 07:55:12,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_24-model_00-model_states.pt. 0: [2022-11-27 07:55:12,809] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_25-model_00-model_states.pt... 0: [2022-11-27 07:55:12,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_25-model_00-model_states.pt. 0: [2022-11-27 07:55:12,886] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_26-model_00-model_states.pt... 0: [2022-11-27 07:55:12,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_26-model_00-model_states.pt. 0: [2022-11-27 07:55:12,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_27-model_00-model_states.pt... 0: [2022-11-27 07:55:13,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_27-model_00-model_states.pt. 0: [2022-11-27 07:55:13,045] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_28-model_00-model_states.pt... 0: [2022-11-27 07:55:13,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_28-model_00-model_states.pt. 0: [2022-11-27 07:55:13,123] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/layer_30-model_00-model_states.pt... 0: [2022-11-27 07:55:13,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/layer_30-model_00-model_states.pt. 0: [2022-11-27 07:55:13,127] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step168000/mp_rank_00_model_states.pt 0: [2022-11-27 07:55:13,127] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/mp_rank_00_model_states.pt... 0: [2022-11-27 07:55:13,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/mp_rank_00_model_states.pt. 0: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 26: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 28: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 19: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 20: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 29: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 30: [2022-11-27 07:55:13,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:55:13,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:55:13,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 07:55:13,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 24: [2022-11-27 07:55:13,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:55:13,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 07:55:13,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 9: [2022-11-27 07:55:13,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:55:13,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 07:55:13,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 11: [2022-11-27 07:55:13,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:55:13,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 07:55:13,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 30: [2022-11-27 07:55:13,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:55:13,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 07:55:13,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 16: [2022-11-27 07:55:13,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:55:13,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:55:13,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 07:55:13,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 0: [2022-11-27 07:55:13,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 07:55:13,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 7: [2022-11-27 07:55:13,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:55:13,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 07:55:13,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 2: [2022-11-27 07:55:13,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:55:13,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 07:55:13,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 23: [2022-11-27 07:55:13,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:55:13,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:55:13,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 4: [2022-11-27 07:55:13,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:55:13,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 5: [2022-11-27 07:55:13,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 8: [2022-11-27 07:55:13,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:55:13,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 4: [2022-11-27 07:55:13,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 07:55:13,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 8: [2022-11-27 07:55:13,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 07:55:13,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 0: [2022-11-27 07:55:13,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:55:13,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 07:55:13,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 24: [2022-11-27 07:55:13,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:55:13,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-27 07:55:13,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 26: [2022-11-27 07:55:13,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:55:13,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-27 07:55:13,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 10: [2022-11-27 07:55:13,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:55:13,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 07:55:13,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 31: [2022-11-27 07:55:13,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:55:13,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 07:55:13,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 25: [2022-11-27 07:55:13,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:55:13,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:55:13,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 25: [2022-11-27 07:55:13,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 22: [2022-11-27 07:55:13,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 25: [2022-11-27 07:55:13,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 12: [2022-11-27 07:55:13,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:55:13,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 07:55:13,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 29: [2022-11-27 07:55:13,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:55:13,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 20: [2022-11-27 07:55:13,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:55:13,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:55:13,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 20: [2022-11-27 07:55:13,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:55:13,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 20: [2022-11-27 07:55:13,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 07:55:13,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 20: [2022-11-27 07:55:13,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 07:55:13,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 25: [2022-11-27 07:55:13,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 27: [2022-11-27 07:55:13,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:55:13,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 07:55:13,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 3: [2022-11-27 07:55:13,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:55:13,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 07:55:13,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 31: [2022-11-27 07:55:13,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:55:13,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 07:55:13,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 6: [2022-11-27 07:55:13,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:55:13,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:55:13,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 21: [2022-11-27 07:55:13,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 6: [2022-11-27 07:55:13,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 21: [2022-11-27 07:55:13,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 9: [2022-11-27 07:55:13,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:55:13,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 07:55:13,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 14: [2022-11-27 07:55:13,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:55:13,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:55:13,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 21: [2022-11-27 07:55:13,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 14: [2022-11-27 07:55:13,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 21: [2022-11-27 07:55:13,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 28: [2022-11-27 07:55:13,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:55:13,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:55:13,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 07:55:13,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 30: [2022-11-27 07:55:13,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:55:13,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 07:55:13,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 6: [2022-11-27 07:55:13,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:55:13,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:55:13,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 4: [2022-11-27 07:55:13,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 6: [2022-11-27 07:55:13,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 4: [2022-11-27 07:55:13,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 16: [2022-11-27 07:55:13,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:55:13,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:55:13,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 12: [2022-11-27 07:55:13,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 16: [2022-11-27 07:55:13,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 12: [2022-11-27 07:55:13,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 0: [2022-11-27 07:55:13,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:55:13,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 07:55:13,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 7: [2022-11-27 07:55:13,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:55:13,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 07:55:13,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 21: [2022-11-27 07:55:13,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:55:13,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 07:55:13,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 15: [2022-11-27 07:55:13,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:55:13,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 13: [2022-11-27 07:55:13,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:55:13,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 13: [2022-11-27 07:55:13,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 3: [2022-11-27 07:55:13,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:55:13,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 10: [2022-11-27 07:55:13,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:55:13,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 10: [2022-11-27 07:55:13,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 07:55:13,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 3: [2022-11-27 07:55:13,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 28: [2022-11-27 07:55:13,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 07:55:13,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 28: [2022-11-27 07:55:13,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:55:13,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 07:55:13,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 8: [2022-11-27 07:55:13,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:55:13,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:55:13,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 28: [2022-11-27 07:55:13,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 8: [2022-11-27 07:55:13,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 28: [2022-11-27 07:55:13,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 10: [2022-11-27 07:55:13,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:55:13,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 3: [2022-11-27 07:55:13,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:55:13,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 3: [2022-11-27 07:55:13,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 07:55:13,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 17: [2022-11-27 07:55:13,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:55:13,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 07:55:13,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 17: [2022-11-27 07:55:13,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:55:13,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 07:55:13,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 8: [2022-11-27 07:55:13,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:55:13,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:55:13,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 8: [2022-11-27 07:55:13,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 17: [2022-11-27 07:55:13,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 8: [2022-11-27 07:55:13,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 17: [2022-11-27 07:55:13,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:55:13,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:55:13,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 17: [2022-11-27 07:55:13,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 22: [2022-11-27 07:55:13,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 17: [2022-11-27 07:55:13,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 16: [2022-11-27 07:55:13,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:55:13,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 07:55:13,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 26: [2022-11-27 07:55:13,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:55:13,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 07:55:13,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 14: [2022-11-27 07:55:13,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:55:13,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 5: [2022-11-27 07:55:13,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:55:13,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 5: [2022-11-27 07:55:13,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 07:55:13,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 5: [2022-11-27 07:55:13,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:55:13,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:55:13,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 25: [2022-11-27 07:55:13,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:55:13,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 25: [2022-11-27 07:55:13,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 13: [2022-11-27 07:55:13,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:55:13,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 30: [2022-11-27 07:55:13,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 26: [2022-11-27 07:55:13,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:55:13,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 07:55:13,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 30: [2022-11-27 07:55:13,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 26: [2022-11-27 07:55:13,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 10: [2022-11-27 07:55:13,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:55:13,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 10: [2022-11-27 07:55:13,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 20: [2022-11-27 07:55:13,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:55:13,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:55:13,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:55:13,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 20: [2022-11-27 07:55:13,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 22: [2022-11-27 07:55:13,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 20: [2022-11-27 07:55:13,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 22: [2022-11-27 07:55:13,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 20: [2022-11-27 07:55:13,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 3: [2022-11-27 07:55:13,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:55:13,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:55:13,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 3: [2022-11-27 07:55:13,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 12: [2022-11-27 07:55:13,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 3: [2022-11-27 07:55:13,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 12: [2022-11-27 07:55:13,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 24: [2022-11-27 07:55:13,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:55:13,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 07:55:13,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 2: [2022-11-27 07:55:13,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:55:13,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 07:55:13,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 15: [2022-11-27 07:55:13,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:55:13,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 07:55:13,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 1: [2022-11-27 07:55:13,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:55:13,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 07:55:13,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 31: [2022-11-27 07:55:13,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:55:13,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 07:55:13,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 27: [2022-11-27 07:55:13,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:55:13,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 07:55:13,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 29: [2022-11-27 07:55:13,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:55:13,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 07:55:13,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 27: [2022-11-27 07:55:13,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:55:13,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:55:13,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 0: [2022-11-27 07:55:13,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 21: [2022-11-27 07:55:13,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:55:13,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 21: [2022-11-27 07:55:13,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 0: [2022-11-27 07:55:13,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 21: [2022-11-27 07:55:13,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 9: [2022-11-27 07:55:13,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:55:13,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 8: [2022-11-27 07:55:13,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:55:13,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 8: [2022-11-27 07:55:13,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 07:55:13,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 5: [2022-11-27 07:55:13,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:55:13,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 07:55:13,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 25: [2022-11-27 07:55:13,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:55:13,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 07:55:13,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 7: [2022-11-27 07:55:13,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:55:13,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 07:55:13,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 11: [2022-11-27 07:55:13,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:55:13,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:55:13,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:55:13,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 11: [2022-11-27 07:55:13,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 07:55:13,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 29: [2022-11-27 07:55:13,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:55:13,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 11: [2022-11-27 07:55:13,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 11: [2022-11-27 07:55:13,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 29: [2022-11-27 07:55:13,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 07:55:13,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 23: [2022-11-27 07:55:13,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:55:13,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 07:55:13,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 28: [2022-11-27 07:55:13,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:55:13,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:55:13,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:55:13,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 14: [2022-11-27 07:55:13,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 30: [2022-11-27 07:55:13,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 14: [2022-11-27 07:55:13,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 26: [2022-11-27 07:55:13,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:55:13,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 07:55:13,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 11: [2022-11-27 07:55:13,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:55:13,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 07:55:13,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 1: [2022-11-27 07:55:13,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:55:13,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 23: [2022-11-27 07:55:13,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:55:13,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 23: [2022-11-27 07:55:13,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 07:55:13,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 19: [2022-11-27 07:55:13,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:55:13,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 07:55:13,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 28: [2022-11-27 07:55:13,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 07:55:13,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 24: [2022-11-27 07:55:13,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:55:13,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 0: [2022-11-27 07:55:13,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:55:13,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:55:13,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 24: [2022-11-27 07:55:13,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 16: [2022-11-27 07:55:13,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 6: [2022-11-27 07:55:13,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:55:13,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 07:55:13,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 19: [2022-11-27 07:55:13,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:55:13,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 07:55:13,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 4: [2022-11-27 07:55:13,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:55:13,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:55:13,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 07:55:13,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 4: [2022-11-27 07:55:13,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 07:55:13,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 15: [2022-11-27 07:55:13,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:55:13,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 07:55:13,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 2: [2022-11-27 07:55:13,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:55:13,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 07:55:13,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 27: [2022-11-27 07:55:13,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:55:13,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 07:55:13,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 31: [2022-11-27 07:55:13,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:55:13,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:55:13,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 13: [2022-11-27 07:55:13,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 07:55:13,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 31: [2022-11-27 07:55:13,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 19: [2022-11-27 07:55:13,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:55:13,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 07:55:13,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 19: [2022-11-27 07:55:13,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:55:13,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-27 07:55:13,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 18: [2022-11-27 07:55:13,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:55:13,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:55:13,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:55:13,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 07:55:13,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-27 07:55:13,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 07:55:13,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 18: [2022-11-27 07:55:13,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 18: [2022-11-27 07:55:13,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 1: [2022-11-27 07:55:13,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:55:13,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:55:13,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 07:55:13,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 1: [2022-11-27 07:55:13,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 07:55:13,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 7: [2022-11-27 07:55:13,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:55:13,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 07:55:13,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 22: [2022-11-27 07:55:13,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:55:13,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:55:13,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 29: [2022-11-27 07:55:13,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 07:55:13,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 22: [2022-11-27 07:55:13,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 23: [2022-11-27 07:55:13,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:55:13,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 07:55:13,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 20: [2022-11-27 07:55:13,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:55:13,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 07:55:13,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 17: [2022-11-27 07:55:13,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:55:13,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 07:55:13,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 12: [2022-11-27 07:55:13,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:55:13,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 07:55:13,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 6: [2022-11-27 07:55:13,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:55:13,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 07:55:13,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 0: [2022-11-27 07:55:13,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 07:55:13,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 25: [2022-11-27 07:55:13,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:55:13,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 07:55:13,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 26: [2022-11-27 07:55:13,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:55:13,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 07:55:13,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 11: [2022-11-27 07:55:13,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:55:13,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 07:55:13,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 14: [2022-11-27 07:55:13,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:55:13,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 07:55:13,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 8: [2022-11-27 07:55:13,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:55:13,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 07:55:13,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 14: [2022-11-27 07:55:13,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:55:13,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 07:55:13,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 15: [2022-11-27 07:55:13,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:55:13,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 07:55:13,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 28: [2022-11-27 07:55:13,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:55:13,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 07:55:13,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 9: [2022-11-27 07:55:13,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:55:13,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 07:55:13,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 3: [2022-11-27 07:55:13,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:55:13,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 07:55:13,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 24: [2022-11-27 07:55:13,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:55:13,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 07:55:13,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 31: [2022-11-27 07:55:13,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:55:13,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 07:55:13,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 30: [2022-11-27 07:55:13,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:55:13,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 07:55:13,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 19: [2022-11-27 07:55:13,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:55:13,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 07:55:13,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 21: [2022-11-27 07:55:13,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:55:13,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 07:55:13,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 10: [2022-11-27 07:55:13,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:55:13,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 07:55:13,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 5: [2022-11-27 07:55:13,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:55:13,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 07:55:13,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 4: [2022-11-27 07:55:13,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:55:13,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:55:13,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 2: [2022-11-27 07:55:13,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 4: [2022-11-27 07:55:13,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 2: [2022-11-27 07:55:13,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 1: [2022-11-27 07:55:13,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:55:13,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 07:55:13,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 23: [2022-11-27 07:55:13,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:55:13,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 07:55:13,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 18: [2022-11-27 07:55:13,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:55:13,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 07:55:13,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 22: [2022-11-27 07:55:13,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:55:13,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 07:55:13,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 16: [2022-11-27 07:55:13,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:55:13,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:55:13,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 13: [2022-11-27 07:55:13,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:55:13,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 13: [2022-11-27 07:55:13,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 07:55:13,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 27: [2022-11-27 07:55:13,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 07:55:13,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 29: [2022-11-27 07:55:13,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:55:13,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 07:55:13,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 12: [2022-11-27 07:55:13,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:55:13,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 07:55:13,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 0: [2022-11-27 07:55:13,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:55:13,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 07:55:13,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 7: [2022-11-27 07:55:13,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:55:13,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 07:55:13,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 20: [2022-11-27 07:55:13,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:55:13,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 07:55:13,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 11: [2022-11-27 07:55:13,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:55:13,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 07:55:13,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 6: [2022-11-27 07:55:13,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:55:13,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 07:55:13,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 26: [2022-11-27 07:55:13,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:55:13,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 07:55:13,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 8: [2022-11-27 07:55:13,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:55:13,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 07:55:13,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 17: [2022-11-27 07:55:13,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:55:13,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 07:55:13,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 25: [2022-11-27 07:55:13,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:55:13,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 07:55:13,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 14: [2022-11-27 07:55:13,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:55:13,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 07:55:13,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 28: [2022-11-27 07:55:13,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:55:13,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 07:55:13,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 3: [2022-11-27 07:55:13,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:55:13,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 07:55:13,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 31: [2022-11-27 07:55:13,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:55:13,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 07:55:13,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 5: [2022-11-27 07:55:13,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:55:13,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 07:55:13,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 16: [2022-11-27 07:55:13,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:55:13,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 07:55:13,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 30: [2022-11-27 07:55:13,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:55:13,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 07:55:13,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 9: [2022-11-27 07:55:13,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:55:13,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:55:13,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:55:13,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 07:55:13,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 21: [2022-11-27 07:55:13,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 24: [2022-11-27 07:55:13,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 21: [2022-11-27 07:55:13,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 19: [2022-11-27 07:55:13,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:55:13,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 19: [2022-11-27 07:55:13,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 07:55:13,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 2: [2022-11-27 07:55:13,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:55:13,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 07:55:13,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 23: [2022-11-27 07:55:13,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:55:13,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 07:55:13,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 10: [2022-11-27 07:55:13,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:55:13,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 07:55:13,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 15: [2022-11-27 07:55:13,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:55:13,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 07:55:13,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 4: [2022-11-27 07:55:13,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:55:13,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 07:55:13,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 22: [2022-11-27 07:55:13,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:55:13,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 07:55:13,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 27: [2022-11-27 07:55:13,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:55:13,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 07:55:13,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 29: [2022-11-27 07:55:13,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:55:13,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 18: [2022-11-27 07:55:13,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:55:13,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 18: [2022-11-27 07:55:13,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 07:55:13,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 1: [2022-11-27 07:55:13,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:55:13,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 07:55:13,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 12: [2022-11-27 07:55:13,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:55:13,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:55:13,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 0: [2022-11-27 07:55:13,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 12: [2022-11-27 07:55:13,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 0: [2022-11-27 07:55:13,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 7: [2022-11-27 07:55:13,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:55:13,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 07:55:13,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 13: [2022-11-27 07:55:13,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:55:13,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 07:55:13,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 20: [2022-11-27 07:55:13,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:55:13,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 07:55:13,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 6: [2022-11-27 07:55:13,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:55:13,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 07:55:13,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 26: [2022-11-27 07:55:13,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:55:13,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 07:55:13,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 25: [2022-11-27 07:55:13,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:55:13,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 07:55:13,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 11: [2022-11-27 07:55:13,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:55:13,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:55:13,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 07:55:13,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 17: [2022-11-27 07:55:13,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:55:13,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 11: [2022-11-27 07:55:13,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 07:55:13,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 17: [2022-11-27 07:55:13,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 8: [2022-11-27 07:55:13,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:55:13,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 07:55:13,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 10: [2022-11-27 07:55:13,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:55:13,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 07:55:13,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 31: [2022-11-27 07:55:13,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:55:13,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 07:55:13,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 28: [2022-11-27 07:55:13,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:55:13,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 07:55:13,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 3: [2022-11-27 07:55:13,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:55:13,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 07:55:13,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 24: [2022-11-27 07:55:13,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:55:13,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 07:55:13,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 21: [2022-11-27 07:55:13,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:55:13,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 07:55:13,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 5: [2022-11-27 07:55:13,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:55:13,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 07:55:13,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 19: [2022-11-27 07:55:13,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:55:13,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 07:55:13,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 30: [2022-11-27 07:55:13,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:55:13,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 07:55:13,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 9: [2022-11-27 07:55:13,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:55:13,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 07:55:13,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 15: [2022-11-27 07:55:13,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:55:13,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 07:55:13,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 4: [2022-11-27 07:55:13,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:55:13,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 07:55:13,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 16: [2022-11-27 07:55:13,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:55:13,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 07:55:13,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 23: [2022-11-27 07:55:13,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:55:13,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 07:55:13,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 0: [2022-11-27 07:55:13,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:55:13,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 07:55:13,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 22: [2022-11-27 07:55:13,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:55:13,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 07:55:13,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 13: [2022-11-27 07:55:13,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:55:13,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 07:55:13,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 2: [2022-11-27 07:55:13,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:55:13,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 07:55:13,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 6: [2022-11-27 07:55:13,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:55:13,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 07:55:13,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 17: [2022-11-27 07:55:13,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-27 07:55:13,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 07:55:13,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 12: [2022-11-27 07:55:13,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:55:13,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 07:55:13,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 29: [2022-11-27 07:55:13,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:55:13,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 07:55:13,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 26: [2022-11-27 07:55:13,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 07:55:13,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 07:55:13,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 14: [2022-11-27 07:55:13,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:55:13,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 07:55:13,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 1: [2022-11-27 07:55:13,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 25: [2022-11-27 07:55:13,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:55:13,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 25: [2022-11-27 07:55:13,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 07:55:13,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 1: [2022-11-27 07:55:13,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 24: [2022-11-27 07:55:13,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 07:55:13,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 07:55:13,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 16: [2022-11-27 07:55:13,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 07:55:13,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 07:55:13,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 20: [2022-11-27 07:55:13,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-27 07:55:13,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 07:55:13,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 11: [2022-11-27 07:55:13,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:55:13,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 07:55:13,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 21: [2022-11-27 07:55:13,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:55:13,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 21: [2022-11-27 07:55:13,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 07:55:13,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 7: [2022-11-27 07:55:13,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 07:55:13,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 8: [2022-11-27 07:55:13,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:55:13,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 07:55:13,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 15: [2022-11-27 07:55:13,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:55:13,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 07:55:13,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 4: [2022-11-27 07:55:13,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:55:13,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 07:55:13,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 9: [2022-11-27 07:55:13,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:55:13,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:55:13,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 07:55:13,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 27: [2022-11-27 07:55:13,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 07:55:13,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 3: [2022-11-27 07:55:13,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:55:13,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 07:55:13,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 18: [2022-11-27 07:55:13,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 19: [2022-11-27 07:55:13,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:55:13,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 19: [2022-11-27 07:55:13,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 18: [2022-11-27 07:55:13,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 19: [2022-11-27 07:55:13,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 23: [2022-11-27 07:55:13,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:55:13,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 5: [2022-11-27 07:55:13,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 23: [2022-11-27 07:55:13,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 5: [2022-11-27 07:55:13,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 07:55:13,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 10: [2022-11-27 07:55:13,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:55:13,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:55:13,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 12: [2022-11-27 07:55:13,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 31: [2022-11-27 07:55:13,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 10: [2022-11-27 07:55:13,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 12: [2022-11-27 07:55:13,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 31: [2022-11-27 07:55:13,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 12: [2022-11-27 07:55:13,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 13: [2022-11-27 07:55:13,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 30: [2022-11-27 07:55:13,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:55:13,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 07:55:13,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 30: [2022-11-27 07:55:13,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 07:55:13,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 1: [2022-11-27 07:55:13,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:55:13,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 28: [2022-11-27 07:55:13,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 29: [2022-11-27 07:55:13,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 07:55:13,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 1: [2022-11-27 07:55:13,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 28: [2022-11-27 07:55:13,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 1: [2022-11-27 07:55:13,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 28: [2022-11-27 07:55:13,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 15: [2022-11-27 07:55:13,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:55:13,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 07:55:13,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 27: [2022-11-27 07:55:13,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 07:55:13,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 07:55:13,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 2: [2022-11-27 07:55:13,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:55:13,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 07:55:13,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 18: [2022-11-27 07:55:13,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:55:13,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 07:55:13,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 22: [2022-11-27 07:55:13,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 07:55:13,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 07:55:13,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 1: [2022-11-27 07:55:13,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:55:13,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 7: [2022-11-27 07:55:13,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:55:13,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 1: [2022-11-27 07:55:13,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 7: [2022-11-27 07:55:13,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 18: [2022-11-27 07:55:13,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 07:55:13,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step168000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 07:55:13,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step168000 is ready now! 0: successfully saved checkpoint at iteration 168000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2633.88 31: iteration 168010/ 173500 | consumed samples: 43010560 | consumed tokens: 88085626880 | elapsed time per iteration (s): 1.01 | learning rate: 2.045E-05 | global batch size: 256 | lm loss: 1.902492E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 252.316 | TFLOPs: 15.26 | 31: iteration 168020/ 173500 | consumed samples: 43013120 | consumed tokens: 88090869760 | elapsed time per iteration (s): 0.84 | learning rate: 2.045E-05 | global batch size: 256 | lm loss: 1.913328E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.303 | TFLOPs: 18.47 | 31: iteration 168030/ 173500 | consumed samples: 43015680 | consumed tokens: 88096112640 | elapsed time per iteration (s): 0.78 | learning rate: 2.045E-05 | global batch size: 256 | lm loss: 1.898744E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.025 | TFLOPs: 19.78 | 31: iteration 168040/ 173500 | consumed samples: 43018240 | consumed tokens: 88101355520 | elapsed time per iteration (s): 0.79 | learning rate: 2.045E-05 | global batch size: 256 | lm loss: 1.891080E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.050 | TFLOPs: 19.60 | 31: iteration 168050/ 173500 | consumed samples: 43020800 | consumed tokens: 88106598400 | elapsed time per iteration (s): 0.79 | learning rate: 2.045E-05 | global batch size: 256 | lm loss: 1.920042E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.168 | TFLOPs: 19.67 | 31: iteration 168060/ 173500 | consumed samples: 43023360 | consumed tokens: 88111841280 | elapsed time per iteration (s): 0.84 | learning rate: 2.045E-05 | global batch size: 256 | lm loss: 1.906485E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.169 | TFLOPs: 18.40 | 31: iteration 168070/ 173500 | consumed samples: 43025920 | consumed tokens: 88117084160 | elapsed time per iteration (s): 0.97 | learning rate: 2.044E-05 | global batch size: 256 | lm loss: 1.929542E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 264.581 | TFLOPs: 16.01 | 31: iteration 168080/ 173500 | consumed samples: 43028480 | consumed tokens: 88122327040 | elapsed time per iteration (s): 1.15 | learning rate: 2.044E-05 | global batch size: 256 | lm loss: 1.903713E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 221.888 | TFLOPs: 13.42 | 31: iteration 168090/ 173500 | consumed samples: 43031040 | consumed tokens: 88127569920 | elapsed time per iteration (s): 0.95 | learning rate: 2.044E-05 | global batch size: 256 | lm loss: 1.891058E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 268.450 | TFLOPs: 16.24 | 31: iteration 168100/ 173500 | consumed samples: 43033600 | consumed tokens: 88132812800 | elapsed time per iteration (s): 1.04 | learning rate: 2.044E-05 | global batch size: 256 | lm loss: 1.915263E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.955 | TFLOPs: 14.88 | 31: iteration 168110/ 173500 | consumed samples: 43036160 | consumed tokens: 88138055680 | elapsed time per iteration (s): 0.89 | learning rate: 2.044E-05 | global batch size: 256 | lm loss: 1.883723E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.188 | TFLOPs: 17.43 | 31: iteration 168120/ 173500 | consumed samples: 43038720 | consumed tokens: 88143298560 | elapsed time per iteration (s): 0.92 | learning rate: 2.044E-05 | global batch size: 256 | lm loss: 1.923076E+00 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 276.880 | TFLOPs: 16.75 | 31: iteration 168130/ 173500 | consumed samples: 43041280 | consumed tokens: 88148541440 | elapsed time per iteration (s): 0.88 | learning rate: 2.043E-05 | global batch size: 256 | lm loss: 1.910848E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.433 | TFLOPs: 17.51 | 31: iteration 168140/ 173500 | consumed samples: 43043840 | consumed tokens: 88153784320 | elapsed time per iteration (s): 0.88 | learning rate: 2.043E-05 | global batch size: 256 | lm loss: 1.900657E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 289.982 | TFLOPs: 17.54 | 31: iteration 168150/ 173500 | consumed samples: 43046400 | consumed tokens: 88159027200 | elapsed time per iteration (s): 0.85 | learning rate: 2.043E-05 | global batch size: 256 | lm loss: 1.907532E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.606 | TFLOPs: 18.31 | 31: iteration 168160/ 173500 | consumed samples: 43048960 | consumed tokens: 88164270080 | elapsed time per iteration (s): 0.78 | learning rate: 2.043E-05 | global batch size: 256 | lm loss: 1.922331E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.016 | TFLOPs: 19.78 | 31: iteration 168170/ 173500 | consumed samples: 43051520 | consumed tokens: 88169512960 | elapsed time per iteration (s): 1.04 | learning rate: 2.043E-05 | global batch size: 256 | lm loss: 1.926127E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.369 | TFLOPs: 14.90 | 31: iteration 168180/ 173500 | consumed samples: 43054080 | consumed tokens: 88174755840 | elapsed time per iteration (s): 0.87 | learning rate: 2.043E-05 | global batch size: 256 | lm loss: 1.924069E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.871 | TFLOPs: 17.90 | 31: iteration 168190/ 173500 | consumed samples: 43056640 | consumed tokens: 88179998720 | elapsed time per iteration (s): 0.90 | learning rate: 2.042E-05 | global batch size: 256 | lm loss: 1.920680E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.432 | TFLOPs: 17.27 | 31: iteration 168200/ 173500 | consumed samples: 43059200 | consumed tokens: 88185241600 | elapsed time per iteration (s): 0.84 | learning rate: 2.042E-05 | global batch size: 256 | lm loss: 1.925247E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.235 | TFLOPs: 18.34 | 31: iteration 168210/ 173500 | consumed samples: 43061760 | consumed tokens: 88190484480 | elapsed time per iteration (s): 0.83 | learning rate: 2.042E-05 | global batch size: 256 | lm loss: 1.899958E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.244 | TFLOPs: 18.71 | 31: iteration 168220/ 173500 | consumed samples: 43064320 | consumed tokens: 88195727360 | elapsed time per iteration (s): 0.85 | learning rate: 2.042E-05 | global batch size: 256 | lm loss: 1.915518E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.306 | TFLOPs: 18.23 | 31: iteration 168230/ 173500 | consumed samples: 43066880 | consumed tokens: 88200970240 | elapsed time per iteration (s): 1.07 | learning rate: 2.042E-05 | global batch size: 256 | lm loss: 1.936219E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.423 | TFLOPs: 14.48 | 31: iteration 168240/ 173500 | consumed samples: 43069440 | consumed tokens: 88206213120 | elapsed time per iteration (s): 0.91 | learning rate: 2.042E-05 | global batch size: 256 | lm loss: 1.896347E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 281.613 | TFLOPs: 17.04 | 31: iteration 168250/ 173500 | consumed samples: 43072000 | consumed tokens: 88211456000 | elapsed time per iteration (s): 0.90 | learning rate: 2.041E-05 | global batch size: 256 | lm loss: 1.868925E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 284.244 | TFLOPs: 17.20 | 31: iteration 168260/ 173500 | consumed samples: 43074560 | consumed tokens: 88216698880 | elapsed time per iteration (s): 0.87 | learning rate: 2.041E-05 | global batch size: 256 | lm loss: 1.906295E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.551 | TFLOPs: 17.82 | 31: iteration 168270/ 173500 | consumed samples: 43077120 | consumed tokens: 88221941760 | elapsed time per iteration (s): 0.82 | learning rate: 2.041E-05 | global batch size: 256 | lm loss: 1.932743E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.738 | TFLOPs: 18.92 | 31: iteration 168280/ 173500 | consumed samples: 43079680 | consumed tokens: 88227184640 | elapsed time per iteration (s): 0.88 | learning rate: 2.041E-05 | global batch size: 256 | lm loss: 1.915910E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.904 | TFLOPs: 17.66 | 31: iteration 168290/ 173500 | consumed samples: 43082240 | consumed tokens: 88232427520 | elapsed time per iteration (s): 0.78 | learning rate: 2.041E-05 | global batch size: 256 | lm loss: 1.892026E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.920 | TFLOPs: 19.84 | 31: iteration 168300/ 173500 | consumed samples: 43084800 | consumed tokens: 88237670400 | elapsed time per iteration (s): 0.78 | learning rate: 2.041E-05 | global batch size: 256 | lm loss: 1.914828E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.749 | TFLOPs: 19.83 | 31: iteration 168310/ 173500 | consumed samples: 43087360 | consumed tokens: 88242913280 | elapsed time per iteration (s): 0.80 | learning rate: 2.041E-05 | global batch size: 256 | lm loss: 1.921884E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.928 | TFLOPs: 19.42 | 31: iteration 168320/ 173500 | consumed samples: 43089920 | consumed tokens: 88248156160 | elapsed time per iteration (s): 0.77 | learning rate: 2.040E-05 | global batch size: 256 | lm loss: 1.903755E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.888 | TFLOPs: 20.20 | 31: iteration 168330/ 173500 | consumed samples: 43092480 | consumed tokens: 88253399040 | elapsed time per iteration (s): 0.79 | learning rate: 2.040E-05 | global batch size: 256 | lm loss: 1.910175E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.524 | TFLOPs: 19.51 | 31: iteration 168340/ 173500 | consumed samples: 43095040 | consumed tokens: 88258641920 | elapsed time per iteration (s): 0.74 | learning rate: 2.040E-05 | global batch size: 256 | lm loss: 1.892725E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.782 | TFLOPs: 20.86 | 31: iteration 168350/ 173500 | consumed samples: 43097600 | consumed tokens: 88263884800 | elapsed time per iteration (s): 0.79 | learning rate: 2.040E-05 | global batch size: 256 | lm loss: 1.899765E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.326 | TFLOPs: 19.50 | 31: iteration 168360/ 173500 | consumed samples: 43100160 | consumed tokens: 88269127680 | elapsed time per iteration (s): 0.73 | learning rate: 2.040E-05 | global batch size: 256 | lm loss: 1.907874E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.346 | TFLOPs: 21.32 | 31: iteration 168370/ 173500 | consumed samples: 43102720 | consumed tokens: 88274370560 | elapsed time per iteration (s): 0.79 | learning rate: 2.040E-05 | global batch size: 256 | lm loss: 1.901563E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.255 | TFLOPs: 19.50 | 31: iteration 168380/ 173500 | consumed samples: 43105280 | consumed tokens: 88279613440 | elapsed time per iteration (s): 0.76 | learning rate: 2.039E-05 | global batch size: 256 | lm loss: 1.920401E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.517 | TFLOPs: 20.36 | 31: iteration 168390/ 173500 | consumed samples: 43107840 | consumed tokens: 88284856320 | elapsed time per iteration (s): 0.77 | learning rate: 2.039E-05 | global batch size: 256 | lm loss: 1.917445E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.324 | TFLOPs: 20.04 | 31: iteration 168400/ 173500 | consumed samples: 43110400 | consumed tokens: 88290099200 | elapsed time per iteration (s): 0.81 | learning rate: 2.039E-05 | global batch size: 256 | lm loss: 1.939747E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.030 | TFLOPs: 19.12 | 31: iteration 168410/ 173500 | consumed samples: 43112960 | consumed tokens: 88295342080 | elapsed time per iteration (s): 0.78 | learning rate: 2.039E-05 | global batch size: 256 | lm loss: 1.908561E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.187 | TFLOPs: 19.98 | 31: iteration 168420/ 173500 | consumed samples: 43115520 | consumed tokens: 88300584960 | elapsed time per iteration (s): 0.78 | learning rate: 2.039E-05 | global batch size: 256 | lm loss: 1.894500E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.786 | TFLOPs: 19.83 | 31: iteration 168430/ 173500 | consumed samples: 43118080 | consumed tokens: 88305827840 | elapsed time per iteration (s): 0.78 | learning rate: 2.039E-05 | global batch size: 256 | lm loss: 1.895661E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.122 | TFLOPs: 19.91 | 31: iteration 168440/ 173500 | consumed samples: 43120640 | consumed tokens: 88311070720 | elapsed time per iteration (s): 0.79 | learning rate: 2.039E-05 | global batch size: 256 | lm loss: 1.926740E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.589 | TFLOPs: 19.58 | 31: iteration 168450/ 173500 | consumed samples: 43123200 | consumed tokens: 88316313600 | elapsed time per iteration (s): 0.82 | learning rate: 2.038E-05 | global batch size: 256 | lm loss: 1.922144E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.081 | TFLOPs: 19.00 | 31: iteration 168460/ 173500 | consumed samples: 43125760 | consumed tokens: 88321556480 | elapsed time per iteration (s): 0.81 | learning rate: 2.038E-05 | global batch size: 256 | lm loss: 1.887925E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.580 | TFLOPs: 19.03 | 31: iteration 168470/ 173500 | consumed samples: 43128320 | consumed tokens: 88326799360 | elapsed time per iteration (s): 0.75 | learning rate: 2.038E-05 | global batch size: 256 | lm loss: 1.916095E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.260 | TFLOPs: 20.65 | 31: iteration 168480/ 173500 | consumed samples: 43130880 | consumed tokens: 88332042240 | elapsed time per iteration (s): 0.81 | learning rate: 2.038E-05 | global batch size: 256 | lm loss: 1.899318E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.987 | TFLOPs: 19.18 | 31: iteration 168490/ 173500 | consumed samples: 43133440 | consumed tokens: 88337285120 | elapsed time per iteration (s): 0.93 | learning rate: 2.038E-05 | global batch size: 256 | lm loss: 1.917299E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 276.271 | TFLOPs: 16.71 | 31: iteration 168500/ 173500 | consumed samples: 43136000 | consumed tokens: 88342528000 | elapsed time per iteration (s): 0.79 | learning rate: 2.038E-05 | global batch size: 256 | lm loss: 1.934986E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.815 | TFLOPs: 19.59 | 31: iteration 168510/ 173500 | consumed samples: 43138560 | consumed tokens: 88347770880 | elapsed time per iteration (s): 0.83 | learning rate: 2.037E-05 | global batch size: 256 | lm loss: 1.948920E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.982 | TFLOPs: 18.75 | 31: iteration 168520/ 173500 | consumed samples: 43141120 | consumed tokens: 88353013760 | elapsed time per iteration (s): 0.95 | learning rate: 2.037E-05 | global batch size: 256 | lm loss: 1.923812E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 270.032 | TFLOPs: 16.34 | 31: iteration 168530/ 173500 | consumed samples: 43143680 | consumed tokens: 88358256640 | elapsed time per iteration (s): 0.82 | learning rate: 2.037E-05 | global batch size: 256 | lm loss: 1.921372E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.599 | TFLOPs: 18.79 | 31: iteration 168540/ 173500 | consumed samples: 43146240 | consumed tokens: 88363499520 | elapsed time per iteration (s): 0.87 | learning rate: 2.037E-05 | global batch size: 256 | lm loss: 1.924460E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.584 | TFLOPs: 17.76 | 31: iteration 168550/ 173500 | consumed samples: 43148800 | consumed tokens: 88368742400 | elapsed time per iteration (s): 0.82 | learning rate: 2.037E-05 | global batch size: 256 | lm loss: 1.912251E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.267 | TFLOPs: 18.89 | 31: iteration 168560/ 173500 | consumed samples: 43151360 | consumed tokens: 88373985280 | elapsed time per iteration (s): 0.82 | learning rate: 2.037E-05 | global batch size: 256 | lm loss: 1.932484E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.669 | TFLOPs: 18.86 | 31: iteration 168570/ 173500 | consumed samples: 43153920 | consumed tokens: 88379228160 | elapsed time per iteration (s): 0.84 | learning rate: 2.037E-05 | global batch size: 256 | lm loss: 1.868935E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.318 | TFLOPs: 18.41 | 31: iteration 168580/ 173500 | consumed samples: 43156480 | consumed tokens: 88384471040 | elapsed time per iteration (s): 0.81 | learning rate: 2.036E-05 | global batch size: 256 | lm loss: 1.906472E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.145 | TFLOPs: 19.19 | 31: iteration 168590/ 173500 | consumed samples: 43159040 | consumed tokens: 88389713920 | elapsed time per iteration (s): 0.81 | learning rate: 2.036E-05 | global batch size: 256 | lm loss: 1.907520E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.187 | TFLOPs: 19.07 | 31: iteration 168600/ 173500 | consumed samples: 43161600 | consumed tokens: 88394956800 | elapsed time per iteration (s): 0.84 | learning rate: 2.036E-05 | global batch size: 256 | lm loss: 1.896305E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.783 | TFLOPs: 18.38 | 31: iteration 168610/ 173500 | consumed samples: 43164160 | consumed tokens: 88400199680 | elapsed time per iteration (s): 0.82 | learning rate: 2.036E-05 | global batch size: 256 | lm loss: 1.896227E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.612 | TFLOPs: 18.97 | 31: iteration 168620/ 173500 | consumed samples: 43166720 | consumed tokens: 88405442560 | elapsed time per iteration (s): 0.80 | learning rate: 2.036E-05 | global batch size: 256 | lm loss: 1.907439E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.288 | TFLOPs: 19.26 | 31: iteration 168630/ 173500 | consumed samples: 43169280 | consumed tokens: 88410685440 | elapsed time per iteration (s): 0.84 | learning rate: 2.036E-05 | global batch size: 256 | lm loss: 1.897793E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.881 | TFLOPs: 18.50 | 31: iteration 168640/ 173500 | consumed samples: 43171840 | consumed tokens: 88415928320 | elapsed time per iteration (s): 0.79 | learning rate: 2.036E-05 | global batch size: 256 | lm loss: 1.913210E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.533 | TFLOPs: 19.63 | 31: iteration 168650/ 173500 | consumed samples: 43174400 | consumed tokens: 88421171200 | elapsed time per iteration (s): 0.81 | learning rate: 2.035E-05 | global batch size: 256 | lm loss: 1.906285E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.128 | TFLOPs: 19.19 | 31: iteration 168660/ 173500 | consumed samples: 43176960 | consumed tokens: 88426414080 | elapsed time per iteration (s): 0.83 | learning rate: 2.035E-05 | global batch size: 256 | lm loss: 1.891016E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.845 | TFLOPs: 18.74 | 31: iteration 168670/ 173500 | consumed samples: 43179520 | consumed tokens: 88431656960 | elapsed time per iteration (s): 0.88 | learning rate: 2.035E-05 | global batch size: 256 | lm loss: 1.927876E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.887 | TFLOPs: 17.60 | 31: iteration 168680/ 173500 | consumed samples: 43182080 | consumed tokens: 88436899840 | elapsed time per iteration (s): 0.84 | learning rate: 2.035E-05 | global batch size: 256 | lm loss: 1.913811E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.466 | TFLOPs: 18.36 | 31: iteration 168690/ 173500 | consumed samples: 43184640 | consumed tokens: 88442142720 | elapsed time per iteration (s): 0.87 | learning rate: 2.035E-05 | global batch size: 256 | lm loss: 1.916561E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 295.882 | TFLOPs: 17.90 | 31: iteration 168700/ 173500 | consumed samples: 43187200 | consumed tokens: 88447385600 | elapsed time per iteration (s): 0.87 | learning rate: 2.035E-05 | global batch size: 256 | lm loss: 1.906538E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.444 | TFLOPs: 17.75 | 31: iteration 168710/ 173500 | consumed samples: 43189760 | consumed tokens: 88452628480 | elapsed time per iteration (s): 0.96 | learning rate: 2.035E-05 | global batch size: 256 | lm loss: 1.876039E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 267.482 | TFLOPs: 16.18 | 31: iteration 168720/ 173500 | consumed samples: 43192320 | consumed tokens: 88457871360 | elapsed time per iteration (s): 0.81 | learning rate: 2.034E-05 | global batch size: 256 | lm loss: 1.906574E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.978 | TFLOPs: 19.12 | 31: iteration 168730/ 173500 | consumed samples: 43194880 | consumed tokens: 88463114240 | elapsed time per iteration (s): 0.81 | learning rate: 2.034E-05 | global batch size: 256 | lm loss: 1.894353E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.963 | TFLOPs: 19.11 | 31: iteration 168740/ 173500 | consumed samples: 43197440 | consumed tokens: 88468357120 | elapsed time per iteration (s): 0.78 | learning rate: 2.034E-05 | global batch size: 256 | lm loss: 1.896384E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.673 | TFLOPs: 19.82 | 31: iteration 168750/ 173500 | consumed samples: 43200000 | consumed tokens: 88473600000 | elapsed time per iteration (s): 0.80 | learning rate: 2.034E-05 | global batch size: 256 | lm loss: 1.918509E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.991 | TFLOPs: 19.36 | 31: iteration 168760/ 173500 | consumed samples: 43202560 | consumed tokens: 88478842880 | elapsed time per iteration (s): 0.82 | learning rate: 2.034E-05 | global batch size: 256 | lm loss: 1.902995E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.484 | TFLOPs: 18.90 | 31: iteration 168770/ 173500 | consumed samples: 43205120 | consumed tokens: 88484085760 | elapsed time per iteration (s): 0.86 | learning rate: 2.034E-05 | global batch size: 256 | lm loss: 1.939396E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.648 | TFLOPs: 18.01 | 31: iteration 168780/ 173500 | consumed samples: 43207680 | consumed tokens: 88489328640 | elapsed time per iteration (s): 0.85 | learning rate: 2.034E-05 | global batch size: 256 | lm loss: 1.911685E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.499 | TFLOPs: 18.18 | 31: iteration 168790/ 173500 | consumed samples: 43210240 | consumed tokens: 88494571520 | elapsed time per iteration (s): 0.83 | learning rate: 2.033E-05 | global batch size: 256 | lm loss: 1.930094E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.411 | TFLOPs: 18.66 | 31: iteration 168800/ 173500 | consumed samples: 43212800 | consumed tokens: 88499814400 | elapsed time per iteration (s): 0.84 | learning rate: 2.033E-05 | global batch size: 256 | lm loss: 1.896604E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.070 | TFLOPs: 18.52 | 31: iteration 168810/ 173500 | consumed samples: 43215360 | consumed tokens: 88505057280 | elapsed time per iteration (s): 0.88 | learning rate: 2.033E-05 | global batch size: 256 | lm loss: 1.912486E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.970 | TFLOPs: 17.60 | 31: iteration 168820/ 173500 | consumed samples: 43217920 | consumed tokens: 88510300160 | elapsed time per iteration (s): 0.78 | learning rate: 2.033E-05 | global batch size: 256 | lm loss: 1.907137E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.651 | TFLOPs: 19.88 | 31: iteration 168830/ 173500 | consumed samples: 43220480 | consumed tokens: 88515543040 | elapsed time per iteration (s): 0.82 | learning rate: 2.033E-05 | global batch size: 256 | lm loss: 1.893117E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.112 | TFLOPs: 18.82 | 31: iteration 168840/ 173500 | consumed samples: 43223040 | consumed tokens: 88520785920 | elapsed time per iteration (s): 0.81 | learning rate: 2.033E-05 | global batch size: 256 | lm loss: 1.926260E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.277 | TFLOPs: 19.19 | 31: iteration 168850/ 173500 | consumed samples: 43225600 | consumed tokens: 88526028800 | elapsed time per iteration (s): 0.79 | learning rate: 2.033E-05 | global batch size: 256 | lm loss: 1.911550E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.389 | TFLOPs: 19.56 | 31: iteration 168860/ 173500 | consumed samples: 43228160 | consumed tokens: 88531271680 | elapsed time per iteration (s): 0.84 | learning rate: 2.032E-05 | global batch size: 256 | lm loss: 1.913085E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.747 | TFLOPs: 18.38 | 31: iteration 168870/ 173500 | consumed samples: 43230720 | consumed tokens: 88536514560 | elapsed time per iteration (s): 1.05 | learning rate: 2.032E-05 | global batch size: 256 | lm loss: 1.897153E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.983 | TFLOPs: 14.70 | 31: iteration 168880/ 173500 | consumed samples: 43233280 | consumed tokens: 88541757440 | elapsed time per iteration (s): 0.80 | learning rate: 2.032E-05 | global batch size: 256 | lm loss: 1.920541E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.166 | TFLOPs: 19.37 | 31: iteration 168890/ 173500 | consumed samples: 43235840 | consumed tokens: 88547000320 | elapsed time per iteration (s): 0.82 | learning rate: 2.032E-05 | global batch size: 256 | lm loss: 1.890069E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.104 | TFLOPs: 18.82 | 31: iteration 168900/ 173500 | consumed samples: 43238400 | consumed tokens: 88552243200 | elapsed time per iteration (s): 0.78 | learning rate: 2.032E-05 | global batch size: 256 | lm loss: 1.919123E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.933 | TFLOPs: 19.78 | 31: iteration 168910/ 173500 | consumed samples: 43240960 | consumed tokens: 88557486080 | elapsed time per iteration (s): 0.80 | learning rate: 2.032E-05 | global batch size: 256 | lm loss: 1.910162E+00 | grad norm: 0.212 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.052 | TFLOPs: 19.42 | 31: iteration 168920/ 173500 | consumed samples: 43243520 | consumed tokens: 88562728960 | elapsed time per iteration (s): 0.79 | learning rate: 2.032E-05 | global batch size: 256 | lm loss: 1.908836E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.064 | TFLOPs: 19.48 | 31: iteration 168930/ 173500 | consumed samples: 43246080 | consumed tokens: 88567971840 | elapsed time per iteration (s): 0.79 | learning rate: 2.031E-05 | global batch size: 256 | lm loss: 1.924173E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.139 | TFLOPs: 19.49 | 31: iteration 168940/ 173500 | consumed samples: 43248640 | consumed tokens: 88573214720 | elapsed time per iteration (s): 0.83 | learning rate: 2.031E-05 | global batch size: 256 | lm loss: 1.903879E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.068 | TFLOPs: 18.64 | 31: iteration 168950/ 173500 | consumed samples: 43251200 | consumed tokens: 88578457600 | elapsed time per iteration (s): 0.76 | learning rate: 2.031E-05 | global batch size: 256 | lm loss: 1.899823E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.370 | TFLOPs: 20.29 | 31: iteration 168960/ 173500 | consumed samples: 43253760 | consumed tokens: 88583700480 | elapsed time per iteration (s): 0.76 | learning rate: 2.031E-05 | global batch size: 256 | lm loss: 1.893684E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.221 | TFLOPs: 20.40 | 31: iteration 168970/ 173500 | consumed samples: 43256320 | consumed tokens: 88588943360 | elapsed time per iteration (s): 0.77 | learning rate: 2.031E-05 | global batch size: 256 | lm loss: 1.902973E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.443 | TFLOPs: 20.23 | 31: iteration 168980/ 173500 | consumed samples: 43258880 | consumed tokens: 88594186240 | elapsed time per iteration (s): 0.81 | learning rate: 2.031E-05 | global batch size: 256 | lm loss: 1.900304E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.405 | TFLOPs: 19.08 | 31: iteration 168990/ 173500 | consumed samples: 43261440 | consumed tokens: 88599429120 | elapsed time per iteration (s): 0.83 | learning rate: 2.031E-05 | global batch size: 256 | lm loss: 1.905730E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.004 | TFLOPs: 18.69 | 31: iteration 169000/ 173500 | consumed samples: 43264000 | consumed tokens: 88604672000 | elapsed time per iteration (s): 0.82 | learning rate: 2.030E-05 | global batch size: 256 | lm loss: 1.895292E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.854 | TFLOPs: 18.87 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 169000 | lm loss value: 1.849568E+00 | lm loss PPL: 6.357072E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 169000 to checkpoints_1b1long 0: [2022-11-27 08:09:10,387] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step169000 is begin to save! 0: [2022-11-27 08:09:10,399] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_01-model_00-model_states.pt... 0: [2022-11-27 08:09:10,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_01-model_00-model_states.pt. 0: [2022-11-27 08:09:10,633] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_03-model_00-model_states.pt... 0: [2022-11-27 08:09:10,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_03-model_00-model_states.pt. 0: [2022-11-27 08:09:10,711] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_04-model_00-model_states.pt... 0: [2022-11-27 08:09:10,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_04-model_00-model_states.pt. 0: [2022-11-27 08:09:10,794] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_05-model_00-model_states.pt... 0: [2022-11-27 08:09:10,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_05-model_00-model_states.pt. 0: [2022-11-27 08:09:10,874] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_06-model_00-model_states.pt... 0: [2022-11-27 08:09:10,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_06-model_00-model_states.pt. 0: [2022-11-27 08:09:10,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_07-model_00-model_states.pt... 0: [2022-11-27 08:09:11,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_07-model_00-model_states.pt. 0: [2022-11-27 08:09:11,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_08-model_00-model_states.pt... 0: [2022-11-27 08:09:11,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_08-model_00-model_states.pt. 0: [2022-11-27 08:09:11,107] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_09-model_00-model_states.pt... 0: [2022-11-27 08:09:11,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_09-model_00-model_states.pt. 0: [2022-11-27 08:09:11,183] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_10-model_00-model_states.pt... 0: [2022-11-27 08:09:11,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_10-model_00-model_states.pt. 0: [2022-11-27 08:09:11,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_11-model_00-model_states.pt... 0: [2022-11-27 08:09:11,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_11-model_00-model_states.pt. 0: [2022-11-27 08:09:11,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_12-model_00-model_states.pt... 0: [2022-11-27 08:09:11,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_12-model_00-model_states.pt. 0: [2022-11-27 08:09:11,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_13-model_00-model_states.pt... 0: [2022-11-27 08:09:11,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_13-model_00-model_states.pt. 0: [2022-11-27 08:09:11,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_14-model_00-model_states.pt... 0: [2022-11-27 08:09:11,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_14-model_00-model_states.pt. 0: [2022-11-27 08:09:11,571] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_15-model_00-model_states.pt... 0: [2022-11-27 08:09:11,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_15-model_00-model_states.pt. 0: [2022-11-27 08:09:11,646] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_16-model_00-model_states.pt... 0: [2022-11-27 08:09:11,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_16-model_00-model_states.pt. 0: [2022-11-27 08:09:11,722] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_17-model_00-model_states.pt... 0: [2022-11-27 08:09:11,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_17-model_00-model_states.pt. 0: [2022-11-27 08:09:11,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_18-model_00-model_states.pt... 0: [2022-11-27 08:09:11,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_18-model_00-model_states.pt. 0: [2022-11-27 08:09:11,873] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_19-model_00-model_states.pt... 0: [2022-11-27 08:09:11,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_19-model_00-model_states.pt. 0: [2022-11-27 08:09:11,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_20-model_00-model_states.pt... 0: [2022-11-27 08:09:12,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_20-model_00-model_states.pt. 0: [2022-11-27 08:09:12,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_21-model_00-model_states.pt... 0: [2022-11-27 08:09:12,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_21-model_00-model_states.pt. 0: [2022-11-27 08:09:12,099] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_22-model_00-model_states.pt... 0: [2022-11-27 08:09:12,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_22-model_00-model_states.pt. 0: [2022-11-27 08:09:12,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_23-model_00-model_states.pt... 0: [2022-11-27 08:09:12,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_23-model_00-model_states.pt. 0: [2022-11-27 08:09:12,248] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_24-model_00-model_states.pt... 0: [2022-11-27 08:09:12,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_24-model_00-model_states.pt. 0: [2022-11-27 08:09:12,322] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_25-model_00-model_states.pt... 0: [2022-11-27 08:09:12,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_25-model_00-model_states.pt. 0: [2022-11-27 08:09:12,400] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_26-model_00-model_states.pt... 0: [2022-11-27 08:09:12,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_26-model_00-model_states.pt. 0: [2022-11-27 08:09:12,472] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_27-model_00-model_states.pt... 0: [2022-11-27 08:09:12,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_27-model_00-model_states.pt. 0: [2022-11-27 08:09:12,552] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_28-model_00-model_states.pt... 0: [2022-11-27 08:09:12,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_28-model_00-model_states.pt. 0: [2022-11-27 08:09:12,628] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/layer_30-model_00-model_states.pt... 0: [2022-11-27 08:09:12,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/layer_30-model_00-model_states.pt. 0: [2022-11-27 08:09:12,630] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step169000/mp_rank_00_model_states.pt 0: [2022-11-27 08:09:12,630] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/mp_rank_00_model_states.pt... 0: [2022-11-27 08:09:12,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/mp_rank_00_model_states.pt. 0: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:09:14,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:09:14,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:09:14,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 08:09:14,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 4: [2022-11-27 08:09:14,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:09:14,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:09:14,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 16: [2022-11-27 08:09:14,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 08:09:14,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 4: [2022-11-27 08:09:14,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 0: [2022-11-27 08:09:14,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:09:14,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:09:14,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:09:14,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 2: [2022-11-27 08:09:14,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 3: [2022-11-27 08:09:14,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 2: [2022-11-27 08:09:14,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 18: [2022-11-27 08:09:14,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:09:14,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:09:14,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 15: [2022-11-27 08:09:14,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 18: [2022-11-27 08:09:14,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 15: [2022-11-27 08:09:14,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 24: [2022-11-27 08:09:14,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:09:14,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:09:14,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:09:14,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 08:09:14,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 24: [2022-11-27 08:09:14,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-27 08:09:14,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 30: [2022-11-27 08:09:14,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 08:09:14,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 31: [2022-11-27 08:09:14,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:09:14,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 08:09:14,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 6: [2022-11-27 08:09:14,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:09:14,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:09:14,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 08:09:14,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 25: [2022-11-27 08:09:14,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 23: [2022-11-27 08:09:14,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:09:14,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 23: [2022-11-27 08:09:14,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 08:09:14,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 26: [2022-11-27 08:09:14,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:09:14,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 08:09:14,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 3: [2022-11-27 08:09:14,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:09:14,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:09:14,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 13: [2022-11-27 08:09:14,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 3: [2022-11-27 08:09:14,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 13: [2022-11-27 08:09:14,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 7: [2022-11-27 08:09:14,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:09:14,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 08:09:14,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 16: [2022-11-27 08:09:14,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:09:14,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 08:09:14,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 27: [2022-11-27 08:09:14,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:09:14,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 08:09:14,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 28: [2022-11-27 08:09:14,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:09:14,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:09:14,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 08:09:14,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 18: [2022-11-27 08:09:14,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:09:14,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 08:09:14,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 27: [2022-11-27 08:09:14,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:09:14,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:09:14,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 29: [2022-11-27 08:09:14,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:09:14,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 29: [2022-11-27 08:09:14,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 27: [2022-11-27 08:09:14,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 15: [2022-11-27 08:09:14,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:09:14,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 08:09:14,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 29: [2022-11-27 08:09:14,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 7: [2022-11-27 08:09:14,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:09:14,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 7: [2022-11-27 08:09:14,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 08:09:14,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 0: [2022-11-27 08:09:14,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:09:14,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 08:09:14,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 29: [2022-11-27 08:09:14,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:09:14,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 08:09:14,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 26: [2022-11-27 08:09:14,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:09:14,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 4: [2022-11-27 08:09:14,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:09:14,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 4: [2022-11-27 08:09:14,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 16: [2022-11-27 08:09:14,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:09:14,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 4: [2022-11-27 08:09:14,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 16: [2022-11-27 08:09:14,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 31: [2022-11-27 08:09:14,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:09:14,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 08:09:14,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 30: [2022-11-27 08:09:14,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:09:14,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 08:09:14,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 25: [2022-11-27 08:09:14,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:09:14,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 24: [2022-11-27 08:09:14,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:09:14,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 24: [2022-11-27 08:09:14,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 13: [2022-11-27 08:09:14,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:09:14,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 13: [2022-11-27 08:09:14,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 08:09:14,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 2: [2022-11-27 08:09:14,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:09:14,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:09:14,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 22: [2022-11-27 08:09:14,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:09:14,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 2: [2022-11-27 08:09:14,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 22: [2022-11-27 08:09:14,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 30: [2022-11-27 08:09:14,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:09:14,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 22: [2022-11-27 08:09:14,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 30: [2022-11-27 08:09:14,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 08:09:14,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 14: [2022-11-27 08:09:14,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:09:14,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 28: [2022-11-27 08:09:14,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 14: [2022-11-27 08:09:14,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 08:09:14,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 28: [2022-11-27 08:09:14,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 14: [2022-11-27 08:09:14,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 14: [2022-11-27 08:09:14,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 28: [2022-11-27 08:09:14,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 08:09:14,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 08:09:14,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 13: [2022-11-27 08:09:14,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:09:14,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 08:09:14,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 29: [2022-11-27 08:09:14,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:09:14,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 08:09:14,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 21: [2022-11-27 08:09:14,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:09:14,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 08:09:14,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 9: [2022-11-27 08:09:14,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:09:14,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 08:09:14,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 14: [2022-11-27 08:09:14,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:09:14,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 08:09:14,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 22: [2022-11-27 08:09:14,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:09:14,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 08:09:14,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:09:14,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 3: [2022-11-27 08:09:14,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:09:14,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 3: [2022-11-27 08:09:14,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 22: [2022-11-27 08:09:14,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 3: [2022-11-27 08:09:14,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 28: [2022-11-27 08:09:14,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:09:14,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 28: [2022-11-27 08:09:14,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 21: [2022-11-27 08:09:14,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 08:09:14,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 28: [2022-11-27 08:09:14,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 6: [2022-11-27 08:09:14,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:09:14,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 08:09:14,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 24: [2022-11-27 08:09:14,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:09:14,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 16: [2022-11-27 08:09:14,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:09:14,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 30: [2022-11-27 08:09:14,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:09:14,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 08:09:14,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 23: [2022-11-27 08:09:14,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:09:14,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 23: [2022-11-27 08:09:14,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 30: [2022-11-27 08:09:14,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 23: [2022-11-27 08:09:14,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 13: [2022-11-27 08:09:14,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:09:14,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 08:09:14,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 15: [2022-11-27 08:09:14,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:09:14,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 08:09:14,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 27: [2022-11-27 08:09:14,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:09:14,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:09:14,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 11: [2022-11-27 08:09:14,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 08:09:14,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 11: [2022-11-27 08:09:14,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:09:14,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 08:09:14,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 27: [2022-11-27 08:09:14,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 11: [2022-11-27 08:09:14,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:09:14,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 26: [2022-11-27 08:09:14,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:09:14,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 11: [2022-11-27 08:09:14,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:09:14,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:09:14,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 29: [2022-11-27 08:09:14,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 11: [2022-11-27 08:09:14,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 29: [2022-11-27 08:09:14,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 26: [2022-11-27 08:09:14,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 08:09:14,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 6: [2022-11-27 08:09:14,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:09:14,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 08:09:14,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 4: [2022-11-27 08:09:14,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:09:14,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:09:14,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 18: [2022-11-27 08:09:14,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 4: [2022-11-27 08:09:14,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 18: [2022-11-27 08:09:14,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 4: [2022-11-27 08:09:14,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:09:14,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 5: [2022-11-27 08:09:14,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:09:14,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 4: [2022-11-27 08:09:14,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 5: [2022-11-27 08:09:14,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 5: [2022-11-27 08:09:14,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:09:14,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 08:09:14,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 9: [2022-11-27 08:09:14,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:09:14,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:09:14,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 5: [2022-11-27 08:09:14,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 9: [2022-11-27 08:09:14,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 5: [2022-11-27 08:09:14,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 5: [2022-11-27 08:09:14,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:09:14,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 08:09:14,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 23: [2022-11-27 08:09:14,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:09:14,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 0: [2022-11-27 08:09:14,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:09:14,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 23: [2022-11-27 08:09:14,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 0: [2022-11-27 08:09:14,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 15: [2022-11-27 08:09:14,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:09:14,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 08:09:14,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 27: [2022-11-27 08:09:14,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:09:14,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 08:09:14,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 28: [2022-11-27 08:09:14,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:09:14,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:09:14,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:09:14,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:09:14,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:09:14,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 08:09:14,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 08:09:14,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 08:09:14,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 10: [2022-11-27 08:09:14,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 10: [2022-11-27 08:09:14,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 21: [2022-11-27 08:09:14,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:09:14,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 21: [2022-11-27 08:09:14,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 10: [2022-11-27 08:09:14,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 21: [2022-11-27 08:09:14,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 2: [2022-11-27 08:09:14,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:09:14,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 08:09:14,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 26: [2022-11-27 08:09:14,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:09:14,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 08:09:14,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 25: [2022-11-27 08:09:14,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:09:14,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:09:14,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 25: [2022-11-27 08:09:14,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 17: [2022-11-27 08:09:14,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:09:14,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:09:14,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 25: [2022-11-27 08:09:14,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 17: [2022-11-27 08:09:14,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 08:09:14,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 7: [2022-11-27 08:09:14,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 08:09:14,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 0: [2022-11-27 08:09:14,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:09:14,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 28: [2022-11-27 08:09:14,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 0: [2022-11-27 08:09:14,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 28: [2022-11-27 08:09:14,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 3: [2022-11-27 08:09:14,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:09:14,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 08:09:14,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 12: [2022-11-27 08:09:14,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:09:14,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:09:14,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:09:14,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:09:14,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 08:09:14,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 08:09:14,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 08:09:14,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 08:09:14,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 12: [2022-11-27 08:09:14,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 12: [2022-11-27 08:09:14,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 12: [2022-11-27 08:09:14,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 2: [2022-11-27 08:09:14,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:09:14,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 08:09:14,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 11: [2022-11-27 08:09:14,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:09:14,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 08:09:14,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 8: [2022-11-27 08:09:14,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:09:14,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:09:14,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:09:14,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:09:14,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 08:09:14,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 08:09:14,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 8: [2022-11-27 08:09:14,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 31: [2022-11-27 08:09:14,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:09:14,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 08:09:14,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 08:09:14,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 8: [2022-11-27 08:09:14,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 31: [2022-11-27 08:09:14,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 08:09:14,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 0: [2022-11-27 08:09:14,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 08:09:14,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 17: [2022-11-27 08:09:14,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:09:14,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 08:09:14,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 25: [2022-11-27 08:09:14,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:09:14,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 08:09:14,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 17: [2022-11-27 08:09:14,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:09:14,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 08:09:14,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 18: [2022-11-27 08:09:14,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:09:14,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 08:09:14,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 31: [2022-11-27 08:09:14,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:09:14,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 08:09:14,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 1: [2022-11-27 08:09:14,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:09:14,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:09:14,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:09:14,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:09:14,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 08:09:14,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 08:09:14,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 1: [2022-11-27 08:09:14,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 1: [2022-11-27 08:09:14,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 08:09:14,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 08:09:14,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 1: [2022-11-27 08:09:14,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 31: [2022-11-27 08:09:14,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:09:14,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 08:09:14,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 25: [2022-11-27 08:09:14,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:09:14,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 08:09:14,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 20: [2022-11-27 08:09:14,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:09:14,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:09:14,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:09:14,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:09:14,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 08:09:14,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 08:09:14,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 08:09:14,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 08:09:14,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 20: [2022-11-27 08:09:14,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 20: [2022-11-27 08:09:14,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 20: [2022-11-27 08:09:14,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 19: [2022-11-27 08:09:14,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:09:14,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:09:14,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:09:14,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:09:14,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:09:14,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 08:09:14,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-27 08:09:14,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 08:09:14,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 08:09:14,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 08:09:14,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 19: [2022-11-27 08:09:14,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 19: [2022-11-27 08:09:14,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 19: [2022-11-27 08:09:14,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 19: [2022-11-27 08:09:14,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 6: [2022-11-27 08:09:14,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:09:14,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 08:09:14,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 20: [2022-11-27 08:09:14,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:09:14,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 08:09:14,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 29: [2022-11-27 08:09:14,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:09:14,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 08:09:14,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 10: [2022-11-27 08:09:14,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:09:14,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 08:09:14,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 16: [2022-11-27 08:09:14,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:09:14,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 08:09:14,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 5: [2022-11-27 08:09:14,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:09:14,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 08:09:14,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 22: [2022-11-27 08:09:14,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:09:14,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 08:09:14,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 24: [2022-11-27 08:09:14,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:09:14,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-27 08:09:14,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 13: [2022-11-27 08:09:14,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:09:14,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 08:09:14,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 14: [2022-11-27 08:09:14,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:09:14,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 08:09:14,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 30: [2022-11-27 08:09:14,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:09:14,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 08:09:14,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 3: [2022-11-27 08:09:14,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:09:14,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 08:09:14,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 7: [2022-11-27 08:09:14,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:09:14,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 08:09:14,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 4: [2022-11-27 08:09:14,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:09:14,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 08:09:14,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 27: [2022-11-27 08:09:14,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:09:14,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 08:09:14,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 21: [2022-11-27 08:09:14,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:09:14,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 12: [2022-11-27 08:09:14,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:09:14,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 12: [2022-11-27 08:09:14,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 08:09:14,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 23: [2022-11-27 08:09:14,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:09:14,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 08:09:14,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 26: [2022-11-27 08:09:14,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:09:14,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-27 08:09:14,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 0: [2022-11-27 08:09:14,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:09:14,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 08:09:14,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 15: [2022-11-27 08:09:14,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:09:14,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 08:09:14,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 17: [2022-11-27 08:09:14,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:09:14,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 08:09:14,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 28: [2022-11-27 08:09:14,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-27 08:09:14,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 08:09:14,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 1: [2022-11-27 08:09:14,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:09:14,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:09:14,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 8: [2022-11-27 08:09:14,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 1: [2022-11-27 08:09:14,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 8: [2022-11-27 08:09:14,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 2: [2022-11-27 08:09:14,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:09:14,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 08:09:14,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 9: [2022-11-27 08:09:14,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:09:14,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 08:09:14,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 10: [2022-11-27 08:09:14,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:09:14,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 08:09:14,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 11: [2022-11-27 08:09:14,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:09:14,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 08:09:14,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 6: [2022-11-27 08:09:14,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:09:14,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 08:09:14,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 31: [2022-11-27 08:09:14,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:09:14,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 08:09:14,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 25: [2022-11-27 08:09:14,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:09:14,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 08:09:14,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 12: [2022-11-27 08:09:14,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:09:14,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 08:09:14,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 20: [2022-11-27 08:09:14,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:09:14,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 08:09:14,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 22: [2022-11-27 08:09:14,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:09:14,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 08:09:14,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 18: [2022-11-27 08:09:14,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:09:14,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 08:09:14,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 24: [2022-11-27 08:09:14,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:09:14,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 08:09:14,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 29: [2022-11-27 08:09:14,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:09:14,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 08:09:14,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 7: [2022-11-27 08:09:14,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:09:14,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 08:09:14,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 19: [2022-11-27 08:09:14,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:09:14,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 08:09:14,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 14: [2022-11-27 08:09:14,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:09:14,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 08:09:14,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 4: [2022-11-27 08:09:14,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:09:14,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 13: [2022-11-27 08:09:14,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:09:14,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 13: [2022-11-27 08:09:14,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 08:09:14,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 3: [2022-11-27 08:09:14,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:09:14,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 08:09:14,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 15: [2022-11-27 08:09:14,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:09:14,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 08:09:14,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 21: [2022-11-27 08:09:14,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:09:14,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:09:14,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 08:09:14,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 27: [2022-11-27 08:09:14,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 08:09:14,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 23: [2022-11-27 08:09:14,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:09:14,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 08:09:14,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 0: [2022-11-27 08:09:14,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:09:14,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 28: [2022-11-27 08:09:14,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:09:14,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 5: [2022-11-27 08:09:14,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:09:14,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 08:09:14,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 30: [2022-11-27 08:09:14,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:09:14,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 08:09:14,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 16: [2022-11-27 08:09:14,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:09:14,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 08:09:14,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 26: [2022-11-27 08:09:14,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:09:14,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 08:09:14,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 17: [2022-11-27 08:09:14,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:09:14,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 08:09:14,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 1: [2022-11-27 08:09:14,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:09:14,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 08:09:14,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 8: [2022-11-27 08:09:14,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:09:14,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 08:09:14,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 9: [2022-11-27 08:09:14,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:09:14,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 08:09:14,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 10: [2022-11-27 08:09:14,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:09:14,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 08:09:14,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 31: [2022-11-27 08:09:14,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:09:14,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 08:09:14,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 25: [2022-11-27 08:09:14,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:09:14,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 08:09:14,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 18: [2022-11-27 08:09:14,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:09:14,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 14: [2022-11-27 08:09:14,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:09:14,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 14: [2022-11-27 08:09:14,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 08:09:14,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 6: [2022-11-27 08:09:14,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:09:14,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 08:09:14,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 20: [2022-11-27 08:09:14,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:09:14,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 08:09:14,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 2: [2022-11-27 08:09:14,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:09:14,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 08:09:14,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 16: [2022-11-27 08:09:14,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:09:14,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 08:09:14,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 26: [2022-11-27 08:09:14,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:09:14,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 08:09:14,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 4: [2022-11-27 08:09:14,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:09:14,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 08:09:14,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 1: [2022-11-27 08:09:14,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:09:14,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:09:14,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 08:09:14,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 1: [2022-11-27 08:09:14,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 08:09:14,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 27: [2022-11-27 08:09:14,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:09:14,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:09:14,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 5: [2022-11-27 08:09:14,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 27: [2022-11-27 08:09:14,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 5: [2022-11-27 08:09:14,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 29: [2022-11-27 08:09:14,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:09:14,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:09:14,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 29: [2022-11-27 08:09:14,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 24: [2022-11-27 08:09:14,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 29: [2022-11-27 08:09:14,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 19: [2022-11-27 08:09:14,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:09:14,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 08:09:14,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 11: [2022-11-27 08:09:14,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:09:14,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:09:14,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 12: [2022-11-27 08:09:14,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 11: [2022-11-27 08:09:14,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 12: [2022-11-27 08:09:14,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 21: [2022-11-27 08:09:14,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:09:14,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 08:09:14,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 9: [2022-11-27 08:09:14,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:09:14,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:09:14,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 08:09:14,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 3: [2022-11-27 08:09:14,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 15: [2022-11-27 08:09:14,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:09:14,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 15: [2022-11-27 08:09:14,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 7: [2022-11-27 08:09:14,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:09:14,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 7: [2022-11-27 08:09:14,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 08:09:14,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 30: [2022-11-27 08:09:14,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:09:14,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 08:09:14,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 23: [2022-11-27 08:09:14,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:09:14,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 08:09:14,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 13: [2022-11-27 08:09:14,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:09:14,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 08:09:14,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 20: [2022-11-27 08:09:14,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:09:14,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 08:09:14,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 17: [2022-11-27 08:09:14,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:09:14,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 08:09:14,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 25: [2022-11-27 08:09:14,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:09:14,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:09:14,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 11: [2022-11-27 08:09:14,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 25: [2022-11-27 08:09:14,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 11: [2022-11-27 08:09:14,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 6: [2022-11-27 08:09:14,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:09:14,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 08:09:14,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 5: [2022-11-27 08:09:14,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:09:14,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 31: [2022-11-27 08:09:14,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:09:14,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 08:09:14,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 19: [2022-11-27 08:09:14,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:09:14,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 08:09:14,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 2: [2022-11-27 08:09:14,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:09:14,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:09:14,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 18: [2022-11-27 08:09:14,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 2: [2022-11-27 08:09:14,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 18: [2022-11-27 08:09:14,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 8: [2022-11-27 08:09:14,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:09:14,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 08:09:14,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 10: [2022-11-27 08:09:14,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:09:14,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 08:09:14,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 24: [2022-11-27 08:09:14,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:09:14,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 08:09:14,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 22: [2022-11-27 08:09:14,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:09:14,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 08:09:14,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 16: [2022-11-27 08:09:14,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:09:14,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 08:09:14,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 5: [2022-11-27 08:09:14,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 30: [2022-11-27 08:09:14,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:09:14,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 08:09:14,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 28: [2022-11-27 08:09:14,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 08:09:14,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 28: [2022-11-27 08:09:14,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 08:09:14,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 08:09:14,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 4: [2022-11-27 08:09:14,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:09:14,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 08:09:14,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 27: [2022-11-27 08:09:14,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:09:14,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 08:09:14,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 14: [2022-11-27 08:09:14,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:09:14,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 08:09:14,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 13: [2022-11-27 08:09:14,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:09:14,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 08:09:14,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 23: [2022-11-27 08:09:14,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:09:14,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 08:09:14,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 21: [2022-11-27 08:09:14,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:09:14,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 08:09:14,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 26: [2022-11-27 08:09:14,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:09:14,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 08:09:14,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 15: [2022-11-27 08:09:14,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:09:14,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 08:09:14,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 29: [2022-11-27 08:09:14,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:09:14,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 08:09:14,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 3: [2022-11-27 08:09:14,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:09:14,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 08:09:14,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 0: [2022-11-27 08:09:14,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:09:14,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 08:09:14,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 0: [2022-11-27 08:09:14,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:09:14,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 08:09:14,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 7: [2022-11-27 08:09:14,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:09:14,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 08:09:14,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 28: [2022-11-27 08:09:14,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-27 08:09:14,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 08:09:14,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 14: [2022-11-27 08:09:14,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:09:14,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 08:09:14,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 1: [2022-11-27 08:09:14,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:09:14,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 08:09:14,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 17: [2022-11-27 08:09:14,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:09:14,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 08:09:14,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 22: [2022-11-27 08:09:14,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:09:14,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 08:09:14,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 12: [2022-11-27 08:09:14,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:09:14,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 08:09:14,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 21: [2022-11-27 08:09:14,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:09:14,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 08:09:14,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 17: [2022-11-27 08:09:14,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:09:14,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 08:09:14,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 2: [2022-11-27 08:09:14,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:09:14,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 08:09:14,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 8: [2022-11-27 08:09:14,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:09:14,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 08:09:14,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 9: [2022-11-27 08:09:14,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:09:14,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 08:09:14,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:09:14,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 9: [2022-11-27 08:09:14,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step169000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 08:09:14,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step169000 is ready now! 0: successfully saved checkpoint at iteration 169000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 3858.96 31: iteration 169010/ 173500 | consumed samples: 43266560 | consumed tokens: 88609914880 | elapsed time per iteration (s): 1.21 | learning rate: 2.030E-05 | global batch size: 256 | lm loss: 1.897545E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 211.855 | TFLOPs: 12.82 | 31: iteration 169020/ 173500 | consumed samples: 43269120 | consumed tokens: 88615157760 | elapsed time per iteration (s): 0.80 | learning rate: 2.030E-05 | global batch size: 256 | lm loss: 1.901807E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.156 | TFLOPs: 19.25 | 31: iteration 169030/ 173500 | consumed samples: 43271680 | consumed tokens: 88620400640 | elapsed time per iteration (s): 0.79 | learning rate: 2.030E-05 | global batch size: 256 | lm loss: 1.921622E+00 | grad norm: 0.215 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.851 | TFLOPs: 19.59 | 31: iteration 169040/ 173500 | consumed samples: 43274240 | consumed tokens: 88625643520 | elapsed time per iteration (s): 0.84 | learning rate: 2.030E-05 | global batch size: 256 | lm loss: 1.907557E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.373 | TFLOPs: 18.35 | 31: iteration 169050/ 173500 | consumed samples: 43276800 | consumed tokens: 88630886400 | elapsed time per iteration (s): 0.78 | learning rate: 2.030E-05 | global batch size: 256 | lm loss: 1.912355E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.549 | TFLOPs: 19.94 | 31: iteration 169060/ 173500 | consumed samples: 43279360 | consumed tokens: 88636129280 | elapsed time per iteration (s): 0.77 | learning rate: 2.030E-05 | global batch size: 256 | lm loss: 1.896154E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.255 | TFLOPs: 20.04 | 31: iteration 169070/ 173500 | consumed samples: 43281920 | consumed tokens: 88641372160 | elapsed time per iteration (s): 0.73 | learning rate: 2.030E-05 | global batch size: 256 | lm loss: 1.901565E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.775 | TFLOPs: 21.10 | 31: iteration 169080/ 173500 | consumed samples: 43284480 | consumed tokens: 88646615040 | elapsed time per iteration (s): 0.78 | learning rate: 2.029E-05 | global batch size: 256 | lm loss: 1.892704E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.458 | TFLOPs: 19.81 | 31: iteration 169090/ 173500 | consumed samples: 43287040 | consumed tokens: 88651857920 | elapsed time per iteration (s): 0.80 | learning rate: 2.029E-05 | global batch size: 256 | lm loss: 1.911544E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.357 | TFLOPs: 19.44 | 31: iteration 169100/ 173500 | consumed samples: 43289600 | consumed tokens: 88657100800 | elapsed time per iteration (s): 0.81 | learning rate: 2.029E-05 | global batch size: 256 | lm loss: 1.907533E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.392 | TFLOPs: 19.02 | 31: iteration 169110/ 173500 | consumed samples: 43292160 | consumed tokens: 88662343680 | elapsed time per iteration (s): 0.84 | learning rate: 2.029E-05 | global batch size: 256 | lm loss: 1.915947E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.287 | TFLOPs: 18.47 | 31: iteration 169120/ 173500 | consumed samples: 43294720 | consumed tokens: 88667586560 | elapsed time per iteration (s): 0.82 | learning rate: 2.029E-05 | global batch size: 256 | lm loss: 1.913799E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.909 | TFLOPs: 18.93 | 31: iteration 169130/ 173500 | consumed samples: 43297280 | consumed tokens: 88672829440 | elapsed time per iteration (s): 0.79 | learning rate: 2.029E-05 | global batch size: 256 | lm loss: 1.916653E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.365 | TFLOPs: 19.62 | 31: iteration 169140/ 173500 | consumed samples: 43299840 | consumed tokens: 88678072320 | elapsed time per iteration (s): 0.84 | learning rate: 2.029E-05 | global batch size: 256 | lm loss: 1.907018E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.373 | TFLOPs: 18.41 | 31: iteration 169150/ 173500 | consumed samples: 43302400 | consumed tokens: 88683315200 | elapsed time per iteration (s): 0.78 | learning rate: 2.028E-05 | global batch size: 256 | lm loss: 1.912837E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.534 | TFLOPs: 19.88 | 31: iteration 169160/ 173500 | consumed samples: 43304960 | consumed tokens: 88688558080 | elapsed time per iteration (s): 0.79 | learning rate: 2.028E-05 | global batch size: 256 | lm loss: 1.916351E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.768 | TFLOPs: 19.65 | 31: iteration 169170/ 173500 | consumed samples: 43307520 | consumed tokens: 88693800960 | elapsed time per iteration (s): 0.76 | learning rate: 2.028E-05 | global batch size: 256 | lm loss: 1.878845E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.688 | TFLOPs: 20.31 | 31: iteration 169180/ 173500 | consumed samples: 43310080 | consumed tokens: 88699043840 | elapsed time per iteration (s): 0.78 | learning rate: 2.028E-05 | global batch size: 256 | lm loss: 1.933753E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.458 | TFLOPs: 19.93 | 31: iteration 169190/ 173500 | consumed samples: 43312640 | consumed tokens: 88704286720 | elapsed time per iteration (s): 0.73 | learning rate: 2.028E-05 | global batch size: 256 | lm loss: 1.916829E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.367 | TFLOPs: 21.26 | 31: iteration 169200/ 173500 | consumed samples: 43315200 | consumed tokens: 88709529600 | elapsed time per iteration (s): 0.79 | learning rate: 2.028E-05 | global batch size: 256 | lm loss: 1.903063E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.084 | TFLOPs: 19.49 | 31: iteration 169210/ 173500 | consumed samples: 43317760 | consumed tokens: 88714772480 | elapsed time per iteration (s): 0.79 | learning rate: 2.028E-05 | global batch size: 256 | lm loss: 1.905245E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.268 | TFLOPs: 19.68 | 31: iteration 169220/ 173500 | consumed samples: 43320320 | consumed tokens: 88720015360 | elapsed time per iteration (s): 0.75 | learning rate: 2.028E-05 | global batch size: 256 | lm loss: 1.888106E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.904 | TFLOPs: 20.74 | 31: iteration 169230/ 173500 | consumed samples: 43322880 | consumed tokens: 88725258240 | elapsed time per iteration (s): 0.78 | learning rate: 2.027E-05 | global batch size: 256 | lm loss: 1.919171E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.989 | TFLOPs: 19.78 | 31: iteration 169240/ 173500 | consumed samples: 43325440 | consumed tokens: 88730501120 | elapsed time per iteration (s): 0.78 | learning rate: 2.027E-05 | global batch size: 256 | lm loss: 1.935501E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.748 | TFLOPs: 19.89 | 31: iteration 169250/ 173500 | consumed samples: 43328000 | consumed tokens: 88735744000 | elapsed time per iteration (s): 0.88 | learning rate: 2.027E-05 | global batch size: 256 | lm loss: 1.900845E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.690 | TFLOPs: 17.65 | 31: iteration 169260/ 173500 | consumed samples: 43330560 | consumed tokens: 88740986880 | elapsed time per iteration (s): 0.73 | learning rate: 2.027E-05 | global batch size: 256 | lm loss: 1.903880E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.028 | TFLOPs: 21.12 | 31: iteration 169270/ 173500 | consumed samples: 43333120 | consumed tokens: 88746229760 | elapsed time per iteration (s): 0.73 | learning rate: 2.027E-05 | global batch size: 256 | lm loss: 1.888325E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.694 | TFLOPs: 21.34 | 31: iteration 169280/ 173500 | consumed samples: 43335680 | consumed tokens: 88751472640 | elapsed time per iteration (s): 0.73 | learning rate: 2.027E-05 | global batch size: 256 | lm loss: 1.958841E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.603 | TFLOPs: 21.15 | 31: iteration 169290/ 173500 | consumed samples: 43338240 | consumed tokens: 88756715520 | elapsed time per iteration (s): 0.77 | learning rate: 2.027E-05 | global batch size: 256 | lm loss: 1.920776E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.214 | TFLOPs: 20.16 | 31: iteration 169300/ 173500 | consumed samples: 43340800 | consumed tokens: 88761958400 | elapsed time per iteration (s): 0.78 | learning rate: 2.027E-05 | global batch size: 256 | lm loss: 1.920929E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.082 | TFLOPs: 19.79 | 31: iteration 169310/ 173500 | consumed samples: 43343360 | consumed tokens: 88767201280 | elapsed time per iteration (s): 0.75 | learning rate: 2.026E-05 | global batch size: 256 | lm loss: 1.892870E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.917 | TFLOPs: 20.75 | 31: iteration 169320/ 173500 | consumed samples: 43345920 | consumed tokens: 88772444160 | elapsed time per iteration (s): 0.78 | learning rate: 2.026E-05 | global batch size: 256 | lm loss: 1.916615E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.579 | TFLOPs: 19.88 | 31: iteration 169330/ 173500 | consumed samples: 43348480 | consumed tokens: 88777687040 | elapsed time per iteration (s): 0.73 | learning rate: 2.026E-05 | global batch size: 256 | lm loss: 1.887374E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.861 | TFLOPs: 21.35 | 31: iteration 169340/ 173500 | consumed samples: 43351040 | consumed tokens: 88782929920 | elapsed time per iteration (s): 0.80 | learning rate: 2.026E-05 | global batch size: 256 | lm loss: 1.918564E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.085 | TFLOPs: 19.42 | 31: iteration 169350/ 173500 | consumed samples: 43353600 | consumed tokens: 88788172800 | elapsed time per iteration (s): 0.72 | learning rate: 2.026E-05 | global batch size: 256 | lm loss: 1.901536E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.003 | TFLOPs: 21.66 | 31: iteration 169360/ 173500 | consumed samples: 43356160 | consumed tokens: 88793415680 | elapsed time per iteration (s): 0.85 | learning rate: 2.026E-05 | global batch size: 256 | lm loss: 1.907676E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 302.265 | TFLOPs: 18.29 | 31: iteration 169370/ 173500 | consumed samples: 43358720 | consumed tokens: 88798658560 | elapsed time per iteration (s): 0.78 | learning rate: 2.026E-05 | global batch size: 256 | lm loss: 1.923529E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.171 | TFLOPs: 19.97 | 31: iteration 169380/ 173500 | consumed samples: 43361280 | consumed tokens: 88803901440 | elapsed time per iteration (s): 0.76 | learning rate: 2.026E-05 | global batch size: 256 | lm loss: 1.919742E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.190 | TFLOPs: 20.46 | 31: iteration 169390/ 173500 | consumed samples: 43363840 | consumed tokens: 88809144320 | elapsed time per iteration (s): 0.73 | learning rate: 2.025E-05 | global batch size: 256 | lm loss: 1.943404E+00 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 352.551 | TFLOPs: 21.33 | 31: iteration 169400/ 173500 | consumed samples: 43366400 | consumed tokens: 88814387200 | elapsed time per iteration (s): 0.74 | learning rate: 2.025E-05 | global batch size: 256 | lm loss: 1.897043E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.261 | TFLOPs: 21.01 | 31: iteration 169410/ 173500 | consumed samples: 43368960 | consumed tokens: 88819630080 | elapsed time per iteration (s): 0.71 | learning rate: 2.025E-05 | global batch size: 256 | lm loss: 1.918274E+00 | grad norm: 0.216 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 359.867 | TFLOPs: 21.77 | 31: iteration 169420/ 173500 | consumed samples: 43371520 | consumed tokens: 88824872960 | elapsed time per iteration (s): 0.78 | learning rate: 2.025E-05 | global batch size: 256 | lm loss: 1.908989E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.099 | TFLOPs: 19.97 | 31: iteration 169430/ 173500 | consumed samples: 43374080 | consumed tokens: 88830115840 | elapsed time per iteration (s): 0.76 | learning rate: 2.025E-05 | global batch size: 256 | lm loss: 1.935698E+00 | grad norm: 0.218 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.408 | TFLOPs: 20.41 | 31: iteration 169440/ 173500 | consumed samples: 43376640 | consumed tokens: 88835358720 | elapsed time per iteration (s): 0.87 | learning rate: 2.025E-05 | global batch size: 256 | lm loss: 1.923063E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.719 | TFLOPs: 17.77 | 31: iteration 169450/ 173500 | consumed samples: 43379200 | consumed tokens: 88840601600 | elapsed time per iteration (s): 0.74 | learning rate: 2.025E-05 | global batch size: 256 | lm loss: 1.902694E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.950 | TFLOPs: 20.99 | 31: iteration 169460/ 173500 | consumed samples: 43381760 | consumed tokens: 88845844480 | elapsed time per iteration (s): 0.77 | learning rate: 2.025E-05 | global batch size: 256 | lm loss: 1.923386E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.223 | TFLOPs: 20.10 | 31: iteration 169470/ 173500 | consumed samples: 43384320 | consumed tokens: 88851087360 | elapsed time per iteration (s): 0.73 | learning rate: 2.024E-05 | global batch size: 256 | lm loss: 1.926220E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.205 | TFLOPs: 21.19 | 31: iteration 169480/ 173500 | consumed samples: 43386880 | consumed tokens: 88856330240 | elapsed time per iteration (s): 0.77 | learning rate: 2.024E-05 | global batch size: 256 | lm loss: 1.906624E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.025 | TFLOPs: 20.03 | 31: iteration 169490/ 173500 | consumed samples: 43389440 | consumed tokens: 88861573120 | elapsed time per iteration (s): 0.73 | learning rate: 2.024E-05 | global batch size: 256 | lm loss: 1.884687E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.090 | TFLOPs: 21.18 | 31: iteration 169500/ 173500 | consumed samples: 43392000 | consumed tokens: 88866816000 | elapsed time per iteration (s): 0.77 | learning rate: 2.024E-05 | global batch size: 256 | lm loss: 1.908323E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.727 | TFLOPs: 20.01 | 31: iteration 169510/ 173500 | consumed samples: 43394560 | consumed tokens: 88872058880 | elapsed time per iteration (s): 0.73 | learning rate: 2.024E-05 | global batch size: 256 | lm loss: 1.868941E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.031 | TFLOPs: 21.24 | 31: iteration 169520/ 173500 | consumed samples: 43397120 | consumed tokens: 88877301760 | elapsed time per iteration (s): 0.74 | learning rate: 2.024E-05 | global batch size: 256 | lm loss: 1.901084E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.463 | TFLOPs: 20.90 | 31: iteration 169530/ 173500 | consumed samples: 43399680 | consumed tokens: 88882544640 | elapsed time per iteration (s): 0.75 | learning rate: 2.024E-05 | global batch size: 256 | lm loss: 1.899146E+00 | grad norm: 0.220 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.922 | TFLOPs: 20.56 | 31: iteration 169540/ 173500 | consumed samples: 43402240 | consumed tokens: 88887787520 | elapsed time per iteration (s): 0.74 | learning rate: 2.024E-05 | global batch size: 256 | lm loss: 1.904149E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.046 | TFLOPs: 21.06 | 31: iteration 169550/ 173500 | consumed samples: 43404800 | consumed tokens: 88893030400 | elapsed time per iteration (s): 0.75 | learning rate: 2.023E-05 | global batch size: 256 | lm loss: 1.903624E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.275 | TFLOPs: 20.65 | 31: iteration 169560/ 173500 | consumed samples: 43407360 | consumed tokens: 88898273280 | elapsed time per iteration (s): 0.78 | learning rate: 2.023E-05 | global batch size: 256 | lm loss: 1.910478E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.975 | TFLOPs: 19.96 | 31: iteration 169570/ 173500 | consumed samples: 43409920 | consumed tokens: 88903516160 | elapsed time per iteration (s): 0.91 | learning rate: 2.023E-05 | global batch size: 256 | lm loss: 1.899402E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 282.086 | TFLOPs: 17.07 | 31: iteration 169580/ 173500 | consumed samples: 43412480 | consumed tokens: 88908759040 | elapsed time per iteration (s): 0.78 | learning rate: 2.023E-05 | global batch size: 256 | lm loss: 1.907034E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.530 | TFLOPs: 19.94 | 31: iteration 169590/ 173500 | consumed samples: 43415040 | consumed tokens: 88914001920 | elapsed time per iteration (s): 0.78 | learning rate: 2.023E-05 | global batch size: 256 | lm loss: 1.880693E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.724 | TFLOPs: 19.89 | 31: iteration 169600/ 173500 | consumed samples: 43417600 | consumed tokens: 88919244800 | elapsed time per iteration (s): 0.77 | learning rate: 2.023E-05 | global batch size: 256 | lm loss: 1.873995E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.004 | TFLOPs: 20.21 | 31: iteration 169610/ 173500 | consumed samples: 43420160 | consumed tokens: 88924487680 | elapsed time per iteration (s): 0.73 | learning rate: 2.023E-05 | global batch size: 256 | lm loss: 1.900270E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.715 | TFLOPs: 21.10 | 31: iteration 169620/ 173500 | consumed samples: 43422720 | consumed tokens: 88929730560 | elapsed time per iteration (s): 0.74 | learning rate: 2.023E-05 | global batch size: 256 | lm loss: 1.917149E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.064 | TFLOPs: 21.00 | 31: iteration 169630/ 173500 | consumed samples: 43425280 | consumed tokens: 88934973440 | elapsed time per iteration (s): 0.75 | learning rate: 2.023E-05 | global batch size: 256 | lm loss: 1.910435E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.770 | TFLOPs: 20.74 | 31: iteration 169640/ 173500 | consumed samples: 43427840 | consumed tokens: 88940216320 | elapsed time per iteration (s): 0.77 | learning rate: 2.022E-05 | global batch size: 256 | lm loss: 1.915708E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.487 | TFLOPs: 20.18 | 31: iteration 169650/ 173500 | consumed samples: 43430400 | consumed tokens: 88945459200 | elapsed time per iteration (s): 0.77 | learning rate: 2.022E-05 | global batch size: 256 | lm loss: 1.923689E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.931 | TFLOPs: 20.02 | 31: iteration 169660/ 173500 | consumed samples: 43432960 | consumed tokens: 88950702080 | elapsed time per iteration (s): 0.80 | learning rate: 2.022E-05 | global batch size: 256 | lm loss: 1.941276E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.983 | TFLOPs: 19.30 | 31: iteration 169670/ 173500 | consumed samples: 43435520 | consumed tokens: 88955944960 | elapsed time per iteration (s): 0.72 | learning rate: 2.022E-05 | global batch size: 256 | lm loss: 1.894980E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.097 | TFLOPs: 21.54 | 31: iteration 169680/ 173500 | consumed samples: 43438080 | consumed tokens: 88961187840 | elapsed time per iteration (s): 0.82 | learning rate: 2.022E-05 | global batch size: 256 | lm loss: 1.898050E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.580 | TFLOPs: 18.91 | 31: iteration 169690/ 173500 | consumed samples: 43440640 | consumed tokens: 88966430720 | elapsed time per iteration (s): 0.83 | learning rate: 2.022E-05 | global batch size: 256 | lm loss: 1.905431E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.996 | TFLOPs: 18.69 | 31: iteration 169700/ 173500 | consumed samples: 43443200 | consumed tokens: 88971673600 | elapsed time per iteration (s): 0.80 | learning rate: 2.022E-05 | global batch size: 256 | lm loss: 1.908199E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.223 | TFLOPs: 19.31 | 31: iteration 169710/ 173500 | consumed samples: 43445760 | consumed tokens: 88976916480 | elapsed time per iteration (s): 0.81 | learning rate: 2.022E-05 | global batch size: 256 | lm loss: 1.934202E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.456 | TFLOPs: 19.14 | 31: iteration 169720/ 173500 | consumed samples: 43448320 | consumed tokens: 88982159360 | elapsed time per iteration (s): 0.73 | learning rate: 2.022E-05 | global batch size: 256 | lm loss: 1.905301E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.720 | TFLOPs: 21.22 | 31: iteration 169730/ 173500 | consumed samples: 43450880 | consumed tokens: 88987402240 | elapsed time per iteration (s): 0.74 | learning rate: 2.021E-05 | global batch size: 256 | lm loss: 1.902800E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.118 | TFLOPs: 20.88 | 31: iteration 169740/ 173500 | consumed samples: 43453440 | consumed tokens: 88992645120 | elapsed time per iteration (s): 0.78 | learning rate: 2.021E-05 | global batch size: 256 | lm loss: 1.947176E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.378 | TFLOPs: 19.93 | 31: iteration 169750/ 173500 | consumed samples: 43456000 | consumed tokens: 88997888000 | elapsed time per iteration (s): 0.74 | learning rate: 2.021E-05 | global batch size: 256 | lm loss: 1.924532E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.375 | TFLOPs: 21.02 | 31: iteration 169760/ 173500 | consumed samples: 43458560 | consumed tokens: 89003130880 | elapsed time per iteration (s): 0.79 | learning rate: 2.021E-05 | global batch size: 256 | lm loss: 1.905660E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.265 | TFLOPs: 19.68 | 31: iteration 169770/ 173500 | consumed samples: 43461120 | consumed tokens: 89008373760 | elapsed time per iteration (s): 0.75 | learning rate: 2.021E-05 | global batch size: 256 | lm loss: 1.917962E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.143 | TFLOPs: 20.76 | 31: iteration 169780/ 173500 | consumed samples: 43463680 | consumed tokens: 89013616640 | elapsed time per iteration (s): 0.75 | learning rate: 2.021E-05 | global batch size: 256 | lm loss: 1.912528E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.179 | TFLOPs: 20.64 | 31: iteration 169790/ 173500 | consumed samples: 43466240 | consumed tokens: 89018859520 | elapsed time per iteration (s): 0.73 | learning rate: 2.021E-05 | global batch size: 256 | lm loss: 1.922537E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 350.031 | TFLOPs: 21.18 | 31: iteration 169800/ 173500 | consumed samples: 43468800 | consumed tokens: 89024102400 | elapsed time per iteration (s): 0.77 | learning rate: 2.021E-05 | global batch size: 256 | lm loss: 1.904126E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.004 | TFLOPs: 20.09 | 31: iteration 169810/ 173500 | consumed samples: 43471360 | consumed tokens: 89029345280 | elapsed time per iteration (s): 0.83 | learning rate: 2.020E-05 | global batch size: 256 | lm loss: 1.907034E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.047 | TFLOPs: 18.64 | 31: iteration 169820/ 173500 | consumed samples: 43473920 | consumed tokens: 89034588160 | elapsed time per iteration (s): 0.77 | learning rate: 2.020E-05 | global batch size: 256 | lm loss: 1.895971E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.502 | TFLOPs: 20.18 | 31: iteration 169830/ 173500 | consumed samples: 43476480 | consumed tokens: 89039831040 | elapsed time per iteration (s): 0.76 | learning rate: 2.020E-05 | global batch size: 256 | lm loss: 1.897580E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.675 | TFLOPs: 20.43 | 31: iteration 169840/ 173500 | consumed samples: 43479040 | consumed tokens: 89045073920 | elapsed time per iteration (s): 0.92 | learning rate: 2.020E-05 | global batch size: 256 | lm loss: 1.889758E+00 | grad norm: 0.220 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 278.315 | TFLOPs: 16.84 | 31: iteration 169850/ 173500 | consumed samples: 43481600 | consumed tokens: 89050316800 | elapsed time per iteration (s): 0.80 | learning rate: 2.020E-05 | global batch size: 256 | lm loss: 1.908781E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.661 | TFLOPs: 19.28 | 31: iteration 169860/ 173500 | consumed samples: 43484160 | consumed tokens: 89055559680 | elapsed time per iteration (s): 0.75 | learning rate: 2.020E-05 | global batch size: 256 | lm loss: 1.905285E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.539 | TFLOPs: 20.66 | 31: iteration 169870/ 173500 | consumed samples: 43486720 | consumed tokens: 89060802560 | elapsed time per iteration (s): 0.78 | learning rate: 2.020E-05 | global batch size: 256 | lm loss: 1.900510E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.154 | TFLOPs: 19.97 | 31: iteration 169880/ 173500 | consumed samples: 43489280 | consumed tokens: 89066045440 | elapsed time per iteration (s): 0.74 | learning rate: 2.020E-05 | global batch size: 256 | lm loss: 1.891987E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.865 | TFLOPs: 20.98 | 31: iteration 169890/ 173500 | consumed samples: 43491840 | consumed tokens: 89071288320 | elapsed time per iteration (s): 0.78 | learning rate: 2.020E-05 | global batch size: 256 | lm loss: 1.903100E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.838 | TFLOPs: 19.83 | 31: iteration 169900/ 173500 | consumed samples: 43494400 | consumed tokens: 89076531200 | elapsed time per iteration (s): 0.71 | learning rate: 2.020E-05 | global batch size: 256 | lm loss: 1.914958E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 360.172 | TFLOPs: 21.79 | 31: iteration 169910/ 173500 | consumed samples: 43496960 | consumed tokens: 89081774080 | elapsed time per iteration (s): 0.74 | learning rate: 2.019E-05 | global batch size: 256 | lm loss: 1.908104E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.392 | TFLOPs: 20.90 | 31: iteration 169920/ 173500 | consumed samples: 43499520 | consumed tokens: 89087016960 | elapsed time per iteration (s): 0.79 | learning rate: 2.019E-05 | global batch size: 256 | lm loss: 1.912184E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.548 | TFLOPs: 19.63 | 31: iteration 169930/ 173500 | consumed samples: 43502080 | consumed tokens: 89092259840 | elapsed time per iteration (s): 0.78 | learning rate: 2.019E-05 | global batch size: 256 | lm loss: 1.908820E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.503 | TFLOPs: 19.75 | 31: iteration 169940/ 173500 | consumed samples: 43504640 | consumed tokens: 89097502720 | elapsed time per iteration (s): 0.75 | learning rate: 2.019E-05 | global batch size: 256 | lm loss: 1.913730E+00 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.309 | TFLOPs: 20.59 | 31: iteration 169950/ 173500 | consumed samples: 43507200 | consumed tokens: 89102745600 | elapsed time per iteration (s): 0.76 | learning rate: 2.019E-05 | global batch size: 256 | lm loss: 1.928821E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.173 | TFLOPs: 20.28 | 31: iteration 169960/ 173500 | consumed samples: 43509760 | consumed tokens: 89107988480 | elapsed time per iteration (s): 0.81 | learning rate: 2.019E-05 | global batch size: 256 | lm loss: 1.925994E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.856 | TFLOPs: 19.23 | 31: iteration 169970/ 173500 | consumed samples: 43512320 | consumed tokens: 89113231360 | elapsed time per iteration (s): 0.75 | learning rate: 2.019E-05 | global batch size: 256 | lm loss: 1.886374E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.267 | TFLOPs: 20.59 | 31: iteration 169980/ 173500 | consumed samples: 43514880 | consumed tokens: 89118474240 | elapsed time per iteration (s): 0.79 | learning rate: 2.019E-05 | global batch size: 256 | lm loss: 1.891649E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.637 | TFLOPs: 19.70 | 31: iteration 169990/ 173500 | consumed samples: 43517440 | consumed tokens: 89123717120 | elapsed time per iteration (s): 0.76 | learning rate: 2.019E-05 | global batch size: 256 | lm loss: 1.906383E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.781 | TFLOPs: 20.37 | 0: [2022-11-27 08:22:07,813] [INFO] [logging.py:68:log_dist] [Rank 0] step=170000, skipped=0, lr=[2.0184402348785326e-05, 2.0184402348785326e-05, 2.0184402348785326e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 170000/ 173500 | consumed samples: 43520000 | consumed tokens: 89128960000 | elapsed time per iteration (s): 0.78 | learning rate: 2.018E-05 | global batch size: 256 | lm loss: 1.909716E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.821 | TFLOPs: 19.89 | 0: steps: 170000 loss: 1.8586 iter time (s): 0.803 samples/sec: 318.939 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 170000 | lm loss value: 1.883842E+00 | lm loss PPL: 6.578732E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 170000 to checkpoints_1b1long 0: [2022-11-27 08:22:08,127] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step170000 is begin to save! 0: [2022-11-27 08:22:08,138] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_01-model_00-model_states.pt... 0: [2022-11-27 08:22:08,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_01-model_00-model_states.pt. 0: [2022-11-27 08:22:08,358] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_03-model_00-model_states.pt... 0: [2022-11-27 08:22:08,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_03-model_00-model_states.pt. 0: [2022-11-27 08:22:08,441] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_04-model_00-model_states.pt... 0: [2022-11-27 08:22:08,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_04-model_00-model_states.pt. 0: [2022-11-27 08:22:08,526] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_05-model_00-model_states.pt... 0: [2022-11-27 08:22:08,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_05-model_00-model_states.pt. 0: [2022-11-27 08:22:08,616] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_06-model_00-model_states.pt... 0: [2022-11-27 08:22:08,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_06-model_00-model_states.pt. 0: [2022-11-27 08:22:08,694] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_07-model_00-model_states.pt... 0: [2022-11-27 08:22:08,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_07-model_00-model_states.pt. 0: [2022-11-27 08:22:08,768] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_08-model_00-model_states.pt... 0: [2022-11-27 08:22:08,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_08-model_00-model_states.pt. 0: [2022-11-27 08:22:08,850] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_09-model_00-model_states.pt... 0: [2022-11-27 08:22:08,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_09-model_00-model_states.pt. 0: [2022-11-27 08:22:08,923] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_10-model_00-model_states.pt... 0: [2022-11-27 08:22:08,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_10-model_00-model_states.pt. 0: [2022-11-27 08:22:08,999] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_11-model_00-model_states.pt... 0: [2022-11-27 08:22:09,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_11-model_00-model_states.pt. 0: [2022-11-27 08:22:09,074] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_12-model_00-model_states.pt... 0: [2022-11-27 08:22:09,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_12-model_00-model_states.pt. 0: [2022-11-27 08:22:09,147] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_13-model_00-model_states.pt... 0: [2022-11-27 08:22:09,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_13-model_00-model_states.pt. 0: [2022-11-27 08:22:09,221] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_14-model_00-model_states.pt... 0: [2022-11-27 08:22:09,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_14-model_00-model_states.pt. 0: [2022-11-27 08:22:09,296] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_15-model_00-model_states.pt... 0: [2022-11-27 08:22:09,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_15-model_00-model_states.pt. 0: [2022-11-27 08:22:09,370] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_16-model_00-model_states.pt... 0: [2022-11-27 08:22:09,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_16-model_00-model_states.pt. 0: [2022-11-27 08:22:09,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_17-model_00-model_states.pt... 0: [2022-11-27 08:22:09,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_17-model_00-model_states.pt. 0: [2022-11-27 08:22:09,519] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_18-model_00-model_states.pt... 0: [2022-11-27 08:22:09,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_18-model_00-model_states.pt. 0: [2022-11-27 08:22:09,592] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_19-model_00-model_states.pt... 0: [2022-11-27 08:22:09,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_19-model_00-model_states.pt. 0: [2022-11-27 08:22:09,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_20-model_00-model_states.pt... 0: [2022-11-27 08:22:09,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_20-model_00-model_states.pt. 0: [2022-11-27 08:22:09,743] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_21-model_00-model_states.pt... 0: [2022-11-27 08:22:09,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_21-model_00-model_states.pt. 0: [2022-11-27 08:22:09,816] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_22-model_00-model_states.pt... 0: [2022-11-27 08:22:09,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_22-model_00-model_states.pt. 0: [2022-11-27 08:22:09,892] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_23-model_00-model_states.pt... 0: [2022-11-27 08:22:09,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_23-model_00-model_states.pt. 0: [2022-11-27 08:22:09,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_24-model_00-model_states.pt... 0: [2022-11-27 08:22:10,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_24-model_00-model_states.pt. 0: [2022-11-27 08:22:10,041] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_25-model_00-model_states.pt... 0: [2022-11-27 08:22:10,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_25-model_00-model_states.pt. 0: [2022-11-27 08:22:10,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_26-model_00-model_states.pt... 0: [2022-11-27 08:22:10,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_26-model_00-model_states.pt. 0: [2022-11-27 08:22:10,190] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_27-model_00-model_states.pt... 0: [2022-11-27 08:22:10,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_27-model_00-model_states.pt. 0: [2022-11-27 08:22:10,266] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_28-model_00-model_states.pt... 0: [2022-11-27 08:22:10,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_28-model_00-model_states.pt. 0: [2022-11-27 08:22:10,339] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/layer_30-model_00-model_states.pt... 0: [2022-11-27 08:22:10,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/layer_30-model_00-model_states.pt. 0: [2022-11-27 08:22:10,343] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step170000/mp_rank_00_model_states.pt 0: [2022-11-27 08:22:10,343] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/mp_rank_00_model_states.pt... 0: [2022-11-27 08:22:10,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/mp_rank_00_model_states.pt. 0: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:22:10,420] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:22:10,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:22:10,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:22:10,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 08:22:10,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 20: [2022-11-27 08:22:10,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 08:22:10,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 12: [2022-11-27 08:22:10,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:22:10,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 08:22:10,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 23: [2022-11-27 08:22:10,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:22:10,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 08:22:10,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 6: [2022-11-27 08:22:10,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:22:10,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:22:10,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 11: [2022-11-27 08:22:10,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:22:10,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 6: [2022-11-27 08:22:10,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 11: [2022-11-27 08:22:10,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 30: [2022-11-27 08:22:10,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 11: [2022-11-27 08:22:10,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 15: [2022-11-27 08:22:10,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:22:10,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 08:22:10,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 24: [2022-11-27 08:22:10,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:22:10,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:22:10,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:22:10,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 24: [2022-11-27 08:22:10,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 26: [2022-11-27 08:22:10,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 08:22:10,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 24: [2022-11-27 08:22:10,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 26: [2022-11-27 08:22:10,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 11: [2022-11-27 08:22:10,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:22:10,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 08:22:10,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 1: [2022-11-27 08:22:10,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:22:10,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:22:10,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 08:22:10,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 1: [2022-11-27 08:22:10,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 08:22:10,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 15: [2022-11-27 08:22:10,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:22:10,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 1: [2022-11-27 08:22:10,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:22:10,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 1: [2022-11-27 08:22:10,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 08:22:10,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 8: [2022-11-27 08:22:10,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:22:10,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 08:22:10,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 21: [2022-11-27 08:22:10,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:22:10,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 30: [2022-11-27 08:22:10,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:22:10,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 30: [2022-11-27 08:22:10,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 08:22:10,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 7: [2022-11-27 08:22:10,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:22:10,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 08:22:10,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 16: [2022-11-27 08:22:10,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:22:10,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 08:22:10,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 14: [2022-11-27 08:22:10,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:22:10,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:22:10,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 08:22:10,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 14: [2022-11-27 08:22:10,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 08:22:10,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 10: [2022-11-27 08:22:10,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:22:10,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 08:22:10,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 31: [2022-11-27 08:22:10,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:22:10,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 6: [2022-11-27 08:22:10,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:22:10,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 6: [2022-11-27 08:22:10,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 08:22:10,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 9: [2022-11-27 08:22:10,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:22:10,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:22:10,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 08:22:10,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 25: [2022-11-27 08:22:10,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 08:22:10,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 0: [2022-11-27 08:22:10,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:22:10,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 4: [2022-11-27 08:22:10,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:22:10,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:22:10,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 20: [2022-11-27 08:22:10,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 4: [2022-11-27 08:22:10,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 20: [2022-11-27 08:22:10,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 4: [2022-11-27 08:22:10,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 25: [2022-11-27 08:22:10,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:22:10,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 08:22:10,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 10: [2022-11-27 08:22:10,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:22:10,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 13: [2022-11-27 08:22:10,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:22:10,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 13: [2022-11-27 08:22:10,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 08:22:10,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:22:10,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 13: [2022-11-27 08:22:10,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 14: [2022-11-27 08:22:10,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:22:10,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 14: [2022-11-27 08:22:10,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 08:22:10,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 8: [2022-11-27 08:22:10,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:22:10,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:22:10,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 4: [2022-11-27 08:22:10,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 8: [2022-11-27 08:22:10,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 4: [2022-11-27 08:22:10,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 26: [2022-11-27 08:22:10,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:22:10,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 08:22:10,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 11: [2022-11-27 08:22:10,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:22:10,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 08:22:10,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 27: [2022-11-27 08:22:10,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:22:10,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 08:22:10,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 22: [2022-11-27 08:22:10,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:22:10,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:22:10,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 08:22:10,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 8: [2022-11-27 08:22:10,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 22: [2022-11-27 08:22:10,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:22:10,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:22:10,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 22: [2022-11-27 08:22:10,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 08:22:10,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 21: [2022-11-27 08:22:10,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 08:22:10,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 5: [2022-11-27 08:22:10,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:22:10,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:22:10,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:22:10,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:22:10,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 08:22:10,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 08:22:10,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 08:22:10,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 25: [2022-11-27 08:22:10,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 5: [2022-11-27 08:22:10,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 25: [2022-11-27 08:22:10,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 5: [2022-11-27 08:22:10,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 2: [2022-11-27 08:22:10,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:22:10,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 08:22:10,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 12: [2022-11-27 08:22:10,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:22:10,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 08:22:10,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 9: [2022-11-27 08:22:10,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:22:10,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 08:22:10,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 24: [2022-11-27 08:22:10,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:22:10,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-27 08:22:10,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 31: [2022-11-27 08:22:10,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:22:10,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 08:22:10,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 1: [2022-11-27 08:22:10,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:22:10,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 08:22:10,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 7: [2022-11-27 08:22:10,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:22:10,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 08:22:10,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 16: [2022-11-27 08:22:10,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:22:10,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 20: [2022-11-27 08:22:10,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:22:10,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 9: [2022-11-27 08:22:10,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:22:10,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 9: [2022-11-27 08:22:10,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 20: [2022-11-27 08:22:10,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 20: [2022-11-27 08:22:10,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:22:10,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 20: [2022-11-27 08:22:10,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 15: [2022-11-27 08:22:10,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:22:10,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 15: [2022-11-27 08:22:10,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 08:22:10,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 7: [2022-11-27 08:22:10,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:22:10,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 08:22:10,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 18: [2022-11-27 08:22:10,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:22:10,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:22:10,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 08:22:10,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 08:22:10,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 18: [2022-11-27 08:22:10,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 5: [2022-11-27 08:22:10,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:22:10,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 08:22:10,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 23: [2022-11-27 08:22:10,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:22:10,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:22:10,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 08:22:10,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 8: [2022-11-27 08:22:10,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 08:22:10,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 6: [2022-11-27 08:22:10,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:22:10,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 08:22:10,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 19: [2022-11-27 08:22:10,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:22:10,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 08:22:10,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 30: [2022-11-27 08:22:10,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:22:10,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:22:10,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 08:22:10,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 26: [2022-11-27 08:22:10,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 08:22:10,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 10: [2022-11-27 08:22:10,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:22:10,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 08:22:10,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 1: [2022-11-27 08:22:10,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:22:10,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 17: [2022-11-27 08:22:10,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:22:10,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 17: [2022-11-27 08:22:10,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 08:22:10,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 17: [2022-11-27 08:22:10,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:22:10,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 08:22:10,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 17: [2022-11-27 08:22:10,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:22:10,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 08:22:10,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 24: [2022-11-27 08:22:10,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:22:10,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:22:10,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 24: [2022-11-27 08:22:10,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 11: [2022-11-27 08:22:10,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 24: [2022-11-27 08:22:10,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 14: [2022-11-27 08:22:10,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:22:10,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 08:22:10,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 22: [2022-11-27 08:22:10,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:22:10,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:22:10,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 4: [2022-11-27 08:22:10,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 08:22:10,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:22:10,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 4: [2022-11-27 08:22:10,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 22: [2022-11-27 08:22:10,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 4: [2022-11-27 08:22:10,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 14: [2022-11-27 08:22:10,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:22:10,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 08:22:10,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 23: [2022-11-27 08:22:10,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:22:10,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 08:22:10,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 15: [2022-11-27 08:22:10,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:22:10,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 08:22:10,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 30: [2022-11-27 08:22:10,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:22:10,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 08:22:10,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 16: [2022-11-27 08:22:10,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:22:10,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 08:22:10,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 27: [2022-11-27 08:22:10,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:22:10,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 08:22:10,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 21: [2022-11-27 08:22:10,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:22:10,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 08:22:10,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 13: [2022-11-27 08:22:10,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:22:10,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 08:22:10,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 9: [2022-11-27 08:22:10,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:22:10,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 08:22:10,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 29: [2022-11-27 08:22:10,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:22:10,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:22:10,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:22:10,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:22:10,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 08:22:10,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 08:22:10,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 29: [2022-11-27 08:22:10,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 08:22:10,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 08:22:10,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 29: [2022-11-27 08:22:10,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 29: [2022-11-27 08:22:10,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 17: [2022-11-27 08:22:10,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:22:10,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 08:22:10,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 18: [2022-11-27 08:22:10,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:22:10,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:22:10,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 7: [2022-11-27 08:22:10,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:22:10,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 18: [2022-11-27 08:22:10,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 18: [2022-11-27 08:22:10,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 7: [2022-11-27 08:22:10,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 18: [2022-11-27 08:22:10,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 31: [2022-11-27 08:22:10,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:22:10,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 08:22:10,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 31: [2022-11-27 08:22:10,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:22:10,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 08:22:10,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 0: [2022-11-27 08:22:10,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:22:10,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 27: [2022-11-27 08:22:10,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:22:10,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 27: [2022-11-27 08:22:10,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 08:22:10,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 2: [2022-11-27 08:22:10,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:22:10,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 08:22:10,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 25: [2022-11-27 08:22:10,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:22:10,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 08:22:10,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 19: [2022-11-27 08:22:10,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:22:10,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-27 08:22:10,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 0: [2022-11-27 08:22:10,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:22:10,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 08:22:10,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 19: [2022-11-27 08:22:10,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:22:10,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 08:22:10,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 21: [2022-11-27 08:22:10,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:22:10,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 08:22:10,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 27: [2022-11-27 08:22:10,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:22:10,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 08:22:10,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 28: [2022-11-27 08:22:10,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:22:10,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:22:10,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 08:22:10,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 28: [2022-11-27 08:22:10,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 08:22:10,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 23: [2022-11-27 08:22:10,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:22:10,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:22:10,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 08:22:10,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 0: [2022-11-27 08:22:10,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 08:22:10,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 28: [2022-11-27 08:22:10,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 28: [2022-11-27 08:22:10,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-27 08:22:10,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 08:22:10,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 08:22:10,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 28: [2022-11-27 08:22:10,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 3: [2022-11-27 08:22:10,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:22:10,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:22:10,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:22:10,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:22:10,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 08:22:10,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 08:22:10,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 08:22:10,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 08:22:10,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 3: [2022-11-27 08:22:10,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 3: [2022-11-27 08:22:10,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 3: [2022-11-27 08:22:10,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 22: [2022-11-27 08:22:10,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:22:10,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 08:22:10,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 16: [2022-11-27 08:22:10,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:22:10,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 08:22:10,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 6: [2022-11-27 08:22:10,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:22:10,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 08:22:10,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 13: [2022-11-27 08:22:10,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:22:10,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 08:22:10,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 2: [2022-11-27 08:22:10,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:22:10,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 08:22:10,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 10: [2022-11-27 08:22:10,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:22:10,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 08:22:10,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 24: [2022-11-27 08:22:10,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:22:10,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 08:22:10,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 0: [2022-11-27 08:22:10,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:22:10,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 08:22:10,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 26: [2022-11-27 08:22:10,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:22:10,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 08:22:10,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 5: [2022-11-27 08:22:10,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:22:10,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 08:22:10,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 11: [2022-11-27 08:22:10,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:22:10,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 08:22:10,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 1: [2022-11-27 08:22:10,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:22:10,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 08:22:10,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 20: [2022-11-27 08:22:10,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:22:10,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 08:22:10,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 14: [2022-11-27 08:22:10,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:22:10,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 08:22:10,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 8: [2022-11-27 08:22:10,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:22:10,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 08:22:10,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 25: [2022-11-27 08:22:10,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:22:10,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:22:10,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 08:22:10,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 25: [2022-11-27 08:22:10,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 08:22:10,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 23: [2022-11-27 08:22:10,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:22:10,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 08:22:10,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 31: [2022-11-27 08:22:10,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:22:10,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 08:22:10,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 30: [2022-11-27 08:22:10,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:22:10,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 08:22:10,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 15: [2022-11-27 08:22:10,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:22:10,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 08:22:10,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 3: [2022-11-27 08:22:10,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:22:10,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 08:22:10,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 12: [2022-11-27 08:22:10,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:22:10,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:22:10,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 18: [2022-11-27 08:22:10,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 12: [2022-11-27 08:22:10,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 18: [2022-11-27 08:22:10,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 7: [2022-11-27 08:22:10,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:22:10,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 08:22:10,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 4: [2022-11-27 08:22:10,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:22:10,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 08:22:10,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 19: [2022-11-27 08:22:10,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:22:10,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 08:22:10,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 27: [2022-11-27 08:22:10,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:22:10,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:22:10,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 16: [2022-11-27 08:22:10,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:22:10,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 16: [2022-11-27 08:22:10,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 13: [2022-11-27 08:22:10,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:22:10,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:22:10,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 16: [2022-11-27 08:22:10,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 13: [2022-11-27 08:22:10,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 17: [2022-11-27 08:22:10,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 27: [2022-11-27 08:22:10,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 13: [2022-11-27 08:22:10,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 17: [2022-11-27 08:22:10,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 22: [2022-11-27 08:22:10,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:22:10,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 08:22:10,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 6: [2022-11-27 08:22:10,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:22:10,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 08:22:10,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 11: [2022-11-27 08:22:10,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:22:10,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 08:22:10,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 0: [2022-11-27 08:22:10,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:22:10,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 08:22:10,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 2: [2022-11-27 08:22:10,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:22:10,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 08:22:10,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 29: [2022-11-27 08:22:10,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:22:10,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 08:22:10,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 28: [2022-11-27 08:22:10,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-27 08:22:10,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 08:22:10,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 5: [2022-11-27 08:22:10,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:22:10,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 08:22:10,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 24: [2022-11-27 08:22:10,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:22:10,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 08:22:10,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 15: [2022-11-27 08:22:10,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:22:10,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 08:22:10,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 8: [2022-11-27 08:22:10,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:22:10,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:22:10,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:22:10,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 10: [2022-11-27 08:22:10,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 8: [2022-11-27 08:22:10,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 10: [2022-11-27 08:22:10,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 25: [2022-11-27 08:22:10,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 08:22:10,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 26: [2022-11-27 08:22:10,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:22:10,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:22:10,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 08:22:10,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 12: [2022-11-27 08:22:10,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 08:22:10,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 31: [2022-11-27 08:22:10,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:22:10,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 08:22:10,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 3: [2022-11-27 08:22:10,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:22:10,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 08:22:10,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 13: [2022-11-27 08:22:10,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:22:10,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 08:22:10,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 23: [2022-11-27 08:22:10,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:22:10,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 08:22:10,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 9: [2022-11-27 08:22:10,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:22:10,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 08:22:10,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 7: [2022-11-27 08:22:10,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:22:10,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 08:22:10,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 17: [2022-11-27 08:22:10,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:22:10,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 08:22:10,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 4: [2022-11-27 08:22:10,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:22:10,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 08:22:10,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 11: [2022-11-27 08:22:10,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:22:10,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:22:10,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 6: [2022-11-27 08:22:10,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:22:10,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 6: [2022-11-27 08:22:10,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 1: [2022-11-27 08:22:10,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 6: [2022-11-27 08:22:10,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 1: [2022-11-27 08:22:10,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 18: [2022-11-27 08:22:10,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:22:10,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 08:22:10,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 28: [2022-11-27 08:22:10,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:22:10,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:22:10,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 08:22:10,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 29: [2022-11-27 08:22:10,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:22:10,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 08:22:10,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 2: [2022-11-27 08:22:10,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:22:10,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 08:22:10,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 21: [2022-11-27 08:22:10,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:22:10,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:22:10,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 30: [2022-11-27 08:22:10,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 21: [2022-11-27 08:22:10,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 30: [2022-11-27 08:22:10,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 20: [2022-11-27 08:22:10,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:22:10,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 08:22:10,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 22: [2022-11-27 08:22:10,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:22:10,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 08:22:10,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 28: [2022-11-27 08:22:10,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 08:22:10,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 27: [2022-11-27 08:22:10,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:22:10,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 08:22:10,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 24: [2022-11-27 08:22:10,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:22:10,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:22:10,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 08:22:10,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 26: [2022-11-27 08:22:10,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 15: [2022-11-27 08:22:10,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:22:10,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 15: [2022-11-27 08:22:10,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 08:22:10,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 1: [2022-11-27 08:22:10,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:22:10,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 08:22:10,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 5: [2022-11-27 08:22:10,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:22:10,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:22:10,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 12: [2022-11-27 08:22:10,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 5: [2022-11-27 08:22:10,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 12: [2022-11-27 08:22:10,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 19: [2022-11-27 08:22:10,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:22:10,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 08:22:10,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 10: [2022-11-27 08:22:10,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:22:10,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 08:22:10,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 31: [2022-11-27 08:22:10,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:22:10,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 08:22:10,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 7: [2022-11-27 08:22:10,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:22:10,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:22:10,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 3: [2022-11-27 08:22:10,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 7: [2022-11-27 08:22:10,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 3: [2022-11-27 08:22:10,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 9: [2022-11-27 08:22:10,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:22:10,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 08:22:10,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 25: [2022-11-27 08:22:10,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:22:10,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 08:22:10,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 14: [2022-11-27 08:22:10,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:22:10,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 08:22:10,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 23: [2022-11-27 08:22:10,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:22:10,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 08:22:10,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 21: [2022-11-27 08:22:10,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:22:10,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 08:22:10,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 8: [2022-11-27 08:22:10,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:22:10,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 08:22:10,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 30: [2022-11-27 08:22:10,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:22:10,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 08:22:10,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 16: [2022-11-27 08:22:10,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:22:10,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 08:22:10,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 20: [2022-11-27 08:22:10,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:22:10,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 08:22:10,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 11: [2022-11-27 08:22:10,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:22:10,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 08:22:10,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 17: [2022-11-27 08:22:10,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:22:10,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 08:22:10,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 5: [2022-11-27 08:22:10,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:22:10,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 08:22:10,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 0: [2022-11-27 08:22:10,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:22:10,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:22:10,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 08:22:10,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 18: [2022-11-27 08:22:10,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:22:10,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:22:10,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 08:22:10,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-27 08:22:10,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 18: [2022-11-27 08:22:10,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 13: [2022-11-27 08:22:10,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:22:10,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 23: [2022-11-27 08:22:10,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:22:10,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:22:10,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 23: [2022-11-27 08:22:10,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 29: [2022-11-27 08:22:10,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:22:10,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 29: [2022-11-27 08:22:10,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 17: [2022-11-27 08:22:10,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 10: [2022-11-27 08:22:10,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:22:10,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 4: [2022-11-27 08:22:10,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:22:10,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 4: [2022-11-27 08:22:10,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 10: [2022-11-27 08:22:10,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 08:22:10,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 4: [2022-11-27 08:22:10,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 24: [2022-11-27 08:22:10,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:22:10,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-27 08:22:10,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 29: [2022-11-27 08:22:10,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:22:10,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:22:10,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 08:22:10,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 30: [2022-11-27 08:22:10,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 08:22:10,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 9: [2022-11-27 08:22:10,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:22:10,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 08:22:10,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 16: [2022-11-27 08:22:10,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:22:10,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 08:22:10,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 15: [2022-11-27 08:22:10,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:22:10,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 08:22:10,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 22: [2022-11-27 08:22:10,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:22:10,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 08:22:10,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 20: [2022-11-27 08:22:10,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:22:10,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 08:22:10,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 26: [2022-11-27 08:22:10,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:22:10,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 1: [2022-11-27 08:22:10,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:22:10,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 1: [2022-11-27 08:22:10,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 08:22:10,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 12: [2022-11-27 08:22:10,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:22:10,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 31: [2022-11-27 08:22:10,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:22:10,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 27: [2022-11-27 08:22:10,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:22:10,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 08:22:10,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 27: [2022-11-27 08:22:10,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 08:22:10,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 21: [2022-11-27 08:22:10,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:22:10,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 6: [2022-11-27 08:22:10,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:22:10,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 2: [2022-11-27 08:22:10,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:22:10,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:22:10,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 08:22:10,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 2: [2022-11-27 08:22:10,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 19: [2022-11-27 08:22:10,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 2: [2022-11-27 08:22:10,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 19: [2022-11-27 08:22:10,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 4: [2022-11-27 08:22:10,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:22:10,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 08:22:10,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 19: [2022-11-27 08:22:10,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:22:10,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 08:22:10,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 14: [2022-11-27 08:22:10,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:22:10,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 08:22:10,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 25: [2022-11-27 08:22:10,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:22:10,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 08:22:10,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 0: [2022-11-27 08:22:10,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 08:22:10,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 16: [2022-11-27 08:22:10,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:22:10,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 08:22:10,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 8: [2022-11-27 08:22:10,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:22:10,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 7: [2022-11-27 08:22:10,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:22:10,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 2: [2022-11-27 08:22:10,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:22:10,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 2: [2022-11-27 08:22:10,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 08:22:10,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 27: [2022-11-27 08:22:10,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:22:10,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 22: [2022-11-27 08:22:10,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:22:10,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 3: [2022-11-27 08:22:10,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:22:10,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 27: [2022-11-27 08:22:10,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 3: [2022-11-27 08:22:10,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 22: [2022-11-27 08:22:10,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 3: [2022-11-27 08:22:10,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 10: [2022-11-27 08:22:10,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:22:10,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 08:22:10,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 24: [2022-11-27 08:22:10,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:22:10,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 08:22:10,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 13: [2022-11-27 08:22:10,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:22:10,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 08:22:10,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 6: [2022-11-27 08:22:10,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:22:10,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 08:22:10,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 28: [2022-11-27 08:22:10,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-27 08:22:10,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 08:22:10,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 28: [2022-11-27 08:22:10,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 08:22:10,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-27 08:22:10,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 08:22:10,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step170000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 08:22:10,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 28: [2022-11-27 08:22:10,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step170000 is ready now! 0: successfully saved checkpoint at iteration 170000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2529.37 31: iteration 170010/ 173500 | consumed samples: 43522560 | consumed tokens: 89134202880 | elapsed time per iteration (s): 1.06 | learning rate: 2.018E-05 | global batch size: 256 | lm loss: 1.921572E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.537 | TFLOPs: 14.61 | 31: iteration 170020/ 173500 | consumed samples: 43525120 | consumed tokens: 89139445760 | elapsed time per iteration (s): 0.84 | learning rate: 2.018E-05 | global batch size: 256 | lm loss: 1.900493E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.062 | TFLOPs: 18.39 | 31: iteration 170030/ 173500 | consumed samples: 43527680 | consumed tokens: 89144688640 | elapsed time per iteration (s): 0.82 | learning rate: 2.018E-05 | global batch size: 256 | lm loss: 1.906515E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.478 | TFLOPs: 18.84 | 31: iteration 170040/ 173500 | consumed samples: 43530240 | consumed tokens: 89149931520 | elapsed time per iteration (s): 0.80 | learning rate: 2.018E-05 | global batch size: 256 | lm loss: 1.918585E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.599 | TFLOPs: 19.40 | 31: iteration 170050/ 173500 | consumed samples: 43532800 | consumed tokens: 89155174400 | elapsed time per iteration (s): 0.78 | learning rate: 2.018E-05 | global batch size: 256 | lm loss: 1.891455E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.431 | TFLOPs: 19.81 | 31: iteration 170060/ 173500 | consumed samples: 43535360 | consumed tokens: 89160417280 | elapsed time per iteration (s): 0.81 | learning rate: 2.018E-05 | global batch size: 256 | lm loss: 1.911513E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.254 | TFLOPs: 19.07 | 31: iteration 170070/ 173500 | consumed samples: 43537920 | consumed tokens: 89165660160 | elapsed time per iteration (s): 0.81 | learning rate: 2.018E-05 | global batch size: 256 | lm loss: 1.906199E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.300 | TFLOPs: 19.07 | 31: iteration 170080/ 173500 | consumed samples: 43540480 | consumed tokens: 89170903040 | elapsed time per iteration (s): 0.79 | learning rate: 2.018E-05 | global batch size: 256 | lm loss: 1.901988E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.891 | TFLOPs: 19.72 | 31: iteration 170090/ 173500 | consumed samples: 43543040 | consumed tokens: 89176145920 | elapsed time per iteration (s): 0.80 | learning rate: 2.018E-05 | global batch size: 256 | lm loss: 1.920683E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.953 | TFLOPs: 19.30 | 31: iteration 170100/ 173500 | consumed samples: 43545600 | consumed tokens: 89181388800 | elapsed time per iteration (s): 0.80 | learning rate: 2.017E-05 | global batch size: 256 | lm loss: 1.904809E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.974 | TFLOPs: 19.48 | 31: iteration 170110/ 173500 | consumed samples: 43548160 | consumed tokens: 89186631680 | elapsed time per iteration (s): 0.82 | learning rate: 2.017E-05 | global batch size: 256 | lm loss: 1.904903E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.197 | TFLOPs: 18.95 | 31: iteration 170120/ 173500 | consumed samples: 43550720 | consumed tokens: 89191874560 | elapsed time per iteration (s): 0.78 | learning rate: 2.017E-05 | global batch size: 256 | lm loss: 1.900838E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.281 | TFLOPs: 19.86 | 31: iteration 170130/ 173500 | consumed samples: 43553280 | consumed tokens: 89197117440 | elapsed time per iteration (s): 0.81 | learning rate: 2.017E-05 | global batch size: 256 | lm loss: 1.891664E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.220 | TFLOPs: 19.13 | 31: iteration 170140/ 173500 | consumed samples: 43555840 | consumed tokens: 89202360320 | elapsed time per iteration (s): 0.87 | learning rate: 2.017E-05 | global batch size: 256 | lm loss: 1.918832E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 294.708 | TFLOPs: 17.83 | 31: iteration 170150/ 173500 | consumed samples: 43558400 | consumed tokens: 89207603200 | elapsed time per iteration (s): 0.84 | learning rate: 2.017E-05 | global batch size: 256 | lm loss: 1.909001E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.803 | TFLOPs: 18.50 | 31: iteration 170160/ 173500 | consumed samples: 43560960 | consumed tokens: 89212846080 | elapsed time per iteration (s): 0.78 | learning rate: 2.017E-05 | global batch size: 256 | lm loss: 1.903289E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.616 | TFLOPs: 19.76 | 31: iteration 170170/ 173500 | consumed samples: 43563520 | consumed tokens: 89218088960 | elapsed time per iteration (s): 0.79 | learning rate: 2.017E-05 | global batch size: 256 | lm loss: 1.900697E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.443 | TFLOPs: 19.57 | 31: iteration 170180/ 173500 | consumed samples: 43566080 | consumed tokens: 89223331840 | elapsed time per iteration (s): 0.81 | learning rate: 2.017E-05 | global batch size: 256 | lm loss: 1.892204E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.660 | TFLOPs: 19.22 | 31: iteration 170190/ 173500 | consumed samples: 43568640 | consumed tokens: 89228574720 | elapsed time per iteration (s): 0.77 | learning rate: 2.016E-05 | global batch size: 256 | lm loss: 1.896632E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.428 | TFLOPs: 19.99 | 31: iteration 170200/ 173500 | consumed samples: 43571200 | consumed tokens: 89233817600 | elapsed time per iteration (s): 0.75 | learning rate: 2.016E-05 | global batch size: 256 | lm loss: 1.902724E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.923 | TFLOPs: 20.62 | 31: iteration 170210/ 173500 | consumed samples: 43573760 | consumed tokens: 89239060480 | elapsed time per iteration (s): 0.73 | learning rate: 2.016E-05 | global batch size: 256 | lm loss: 1.881999E+00 | grad norm: 0.213 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.837 | TFLOPs: 21.10 | 31: iteration 170220/ 173500 | consumed samples: 43576320 | consumed tokens: 89244303360 | elapsed time per iteration (s): 0.75 | learning rate: 2.016E-05 | global batch size: 256 | lm loss: 1.902172E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.737 | TFLOPs: 20.73 | 31: iteration 170230/ 173500 | consumed samples: 43578880 | consumed tokens: 89249546240 | elapsed time per iteration (s): 5.15 | learning rate: 2.016E-05 | global batch size: 256 | lm loss: 1.901264E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 49.751 | TFLOPs: 3.01 | 31: iteration 170240/ 173500 | consumed samples: 43581440 | consumed tokens: 89254789120 | elapsed time per iteration (s): 0.76 | learning rate: 2.016E-05 | global batch size: 256 | lm loss: 1.941269E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.432 | TFLOPs: 20.35 | 31: iteration 170250/ 173500 | consumed samples: 43584000 | consumed tokens: 89260032000 | elapsed time per iteration (s): 23.13 | learning rate: 2.016E-05 | global batch size: 256 | lm loss: 1.903370E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 11.066 | TFLOPs: 0.67 | 31: iteration 170260/ 173500 | consumed samples: 43586560 | consumed tokens: 89265274880 | elapsed time per iteration (s): 0.73 | learning rate: 2.016E-05 | global batch size: 256 | lm loss: 1.918259E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.593 | TFLOPs: 21.09 | 31: iteration 170270/ 173500 | consumed samples: 43589120 | consumed tokens: 89270517760 | elapsed time per iteration (s): 0.77 | learning rate: 2.016E-05 | global batch size: 256 | lm loss: 1.876850E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.399 | TFLOPs: 20.05 | 31: iteration 170280/ 173500 | consumed samples: 43591680 | consumed tokens: 89275760640 | elapsed time per iteration (s): 0.84 | learning rate: 2.016E-05 | global batch size: 256 | lm loss: 1.930031E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.416 | TFLOPs: 18.54 | 31: iteration 170290/ 173500 | consumed samples: 43594240 | consumed tokens: 89281003520 | elapsed time per iteration (s): 0.83 | learning rate: 2.016E-05 | global batch size: 256 | lm loss: 1.887561E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.038 | TFLOPs: 18.70 | 31: iteration 170300/ 173500 | consumed samples: 43596800 | consumed tokens: 89286246400 | elapsed time per iteration (s): 0.78 | learning rate: 2.015E-05 | global batch size: 256 | lm loss: 1.912201E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.529 | TFLOPs: 19.81 | 31: iteration 170310/ 173500 | consumed samples: 43599360 | consumed tokens: 89291489280 | elapsed time per iteration (s): 0.81 | learning rate: 2.015E-05 | global batch size: 256 | lm loss: 1.897285E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.399 | TFLOPs: 19.14 | 31: iteration 170320/ 173500 | consumed samples: 43601920 | consumed tokens: 89296732160 | elapsed time per iteration (s): 0.82 | learning rate: 2.015E-05 | global batch size: 256 | lm loss: 1.907912E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.785 | TFLOPs: 18.86 | 31: iteration 170330/ 173500 | consumed samples: 43604480 | consumed tokens: 89301975040 | elapsed time per iteration (s): 0.84 | learning rate: 2.015E-05 | global batch size: 256 | lm loss: 1.924426E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.330 | TFLOPs: 18.47 | 31: iteration 170340/ 173500 | consumed samples: 43607040 | consumed tokens: 89307217920 | elapsed time per iteration (s): 0.82 | learning rate: 2.015E-05 | global batch size: 256 | lm loss: 1.904235E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.814 | TFLOPs: 18.80 | 31: iteration 170350/ 173500 | consumed samples: 43609600 | consumed tokens: 89312460800 | elapsed time per iteration (s): 0.80 | learning rate: 2.015E-05 | global batch size: 256 | lm loss: 1.919617E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.017 | TFLOPs: 19.30 | 31: iteration 170360/ 173500 | consumed samples: 43612160 | consumed tokens: 89317703680 | elapsed time per iteration (s): 0.84 | learning rate: 2.015E-05 | global batch size: 256 | lm loss: 1.929529E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.266 | TFLOPs: 18.47 | 31: iteration 170370/ 173500 | consumed samples: 43614720 | consumed tokens: 89322946560 | elapsed time per iteration (s): 0.84 | learning rate: 2.015E-05 | global batch size: 256 | lm loss: 1.915792E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.598 | TFLOPs: 18.37 | 31: iteration 170380/ 173500 | consumed samples: 43617280 | consumed tokens: 89328189440 | elapsed time per iteration (s): 0.83 | learning rate: 2.015E-05 | global batch size: 256 | lm loss: 1.888298E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.125 | TFLOPs: 18.70 | 31: iteration 170390/ 173500 | consumed samples: 43619840 | consumed tokens: 89333432320 | elapsed time per iteration (s): 0.91 | learning rate: 2.015E-05 | global batch size: 256 | lm loss: 1.909240E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 280.969 | TFLOPs: 17.00 | 31: iteration 170400/ 173500 | consumed samples: 43622400 | consumed tokens: 89338675200 | elapsed time per iteration (s): 0.94 | learning rate: 2.014E-05 | global batch size: 256 | lm loss: 1.897849E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 271.948 | TFLOPs: 16.45 | 31: iteration 170410/ 173500 | consumed samples: 43624960 | consumed tokens: 89343918080 | elapsed time per iteration (s): 1.03 | learning rate: 2.014E-05 | global batch size: 256 | lm loss: 1.920171E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.437 | TFLOPs: 15.09 | 31: iteration 170420/ 173500 | consumed samples: 43627520 | consumed tokens: 89349160960 | elapsed time per iteration (s): 0.96 | learning rate: 2.014E-05 | global batch size: 256 | lm loss: 1.896869E+00 | grad norm: 0.212 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 265.911 | TFLOPs: 16.09 | 31: iteration 170430/ 173500 | consumed samples: 43630080 | consumed tokens: 89354403840 | elapsed time per iteration (s): 0.99 | learning rate: 2.014E-05 | global batch size: 256 | lm loss: 1.907321E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 259.590 | TFLOPs: 15.70 | 31: iteration 170440/ 173500 | consumed samples: 43632640 | consumed tokens: 89359646720 | elapsed time per iteration (s): 0.98 | learning rate: 2.014E-05 | global batch size: 256 | lm loss: 1.890984E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 261.221 | TFLOPs: 15.80 | 31: iteration 170450/ 173500 | consumed samples: 43635200 | consumed tokens: 89364889600 | elapsed time per iteration (s): 0.93 | learning rate: 2.014E-05 | global batch size: 256 | lm loss: 1.887939E+00 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 274.230 | TFLOPs: 16.59 | 31: iteration 170460/ 173500 | consumed samples: 43637760 | consumed tokens: 89370132480 | elapsed time per iteration (s): 0.94 | learning rate: 2.014E-05 | global batch size: 256 | lm loss: 1.934706E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 272.120 | TFLOPs: 16.46 | 31: iteration 170470/ 173500 | consumed samples: 43640320 | consumed tokens: 89375375360 | elapsed time per iteration (s): 0.90 | learning rate: 2.014E-05 | global batch size: 256 | lm loss: 1.913751E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 283.755 | TFLOPs: 17.17 | 31: iteration 170480/ 173500 | consumed samples: 43642880 | consumed tokens: 89380618240 | elapsed time per iteration (s): 0.84 | learning rate: 2.014E-05 | global batch size: 256 | lm loss: 1.929996E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 304.316 | TFLOPs: 18.41 | 31: iteration 170490/ 173500 | consumed samples: 43645440 | consumed tokens: 89385861120 | elapsed time per iteration (s): 0.87 | learning rate: 2.014E-05 | global batch size: 256 | lm loss: 1.914438E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.447 | TFLOPs: 17.75 | 31: iteration 170500/ 173500 | consumed samples: 43648000 | consumed tokens: 89391104000 | elapsed time per iteration (s): 0.79 | learning rate: 2.014E-05 | global batch size: 256 | lm loss: 1.923289E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.537 | TFLOPs: 19.63 | 31: iteration 170510/ 173500 | consumed samples: 43650560 | consumed tokens: 89396346880 | elapsed time per iteration (s): 0.76 | learning rate: 2.013E-05 | global batch size: 256 | lm loss: 1.926124E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.748 | TFLOPs: 20.31 | 31: iteration 170520/ 173500 | consumed samples: 43653120 | consumed tokens: 89401589760 | elapsed time per iteration (s): 0.97 | learning rate: 2.013E-05 | global batch size: 256 | lm loss: 1.913140E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 264.084 | TFLOPs: 15.98 | 31: iteration 170530/ 173500 | consumed samples: 43655680 | consumed tokens: 89406832640 | elapsed time per iteration (s): 0.82 | learning rate: 2.013E-05 | global batch size: 256 | lm loss: 1.927829E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.262 | TFLOPs: 18.83 | 31: iteration 170540/ 173500 | consumed samples: 43658240 | consumed tokens: 89412075520 | elapsed time per iteration (s): 0.88 | learning rate: 2.013E-05 | global batch size: 256 | lm loss: 1.893602E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.348 | TFLOPs: 17.63 | 31: iteration 170550/ 173500 | consumed samples: 43660800 | consumed tokens: 89417318400 | elapsed time per iteration (s): 0.86 | learning rate: 2.013E-05 | global batch size: 256 | lm loss: 1.894902E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.152 | TFLOPs: 17.98 | 31: iteration 170560/ 173500 | consumed samples: 43663360 | consumed tokens: 89422561280 | elapsed time per iteration (s): 0.84 | learning rate: 2.013E-05 | global batch size: 256 | lm loss: 1.920479E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.104 | TFLOPs: 18.34 | 31: iteration 170570/ 173500 | consumed samples: 43665920 | consumed tokens: 89427804160 | elapsed time per iteration (s): 0.86 | learning rate: 2.013E-05 | global batch size: 256 | lm loss: 1.891954E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.724 | TFLOPs: 17.95 | 31: iteration 170580/ 173500 | consumed samples: 43668480 | consumed tokens: 89433047040 | elapsed time per iteration (s): 0.87 | learning rate: 2.013E-05 | global batch size: 256 | lm loss: 1.918533E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.156 | TFLOPs: 17.74 | 31: iteration 170590/ 173500 | consumed samples: 43671040 | consumed tokens: 89438289920 | elapsed time per iteration (s): 0.83 | learning rate: 2.013E-05 | global batch size: 256 | lm loss: 1.913099E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.385 | TFLOPs: 18.66 | 31: iteration 170600/ 173500 | consumed samples: 43673600 | consumed tokens: 89443532800 | elapsed time per iteration (s): 0.86 | learning rate: 2.013E-05 | global batch size: 256 | lm loss: 1.908684E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.700 | TFLOPs: 18.07 | 31: iteration 170610/ 173500 | consumed samples: 43676160 | consumed tokens: 89448775680 | elapsed time per iteration (s): 0.82 | learning rate: 2.013E-05 | global batch size: 256 | lm loss: 1.892731E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.827 | TFLOPs: 18.93 | 31: iteration 170620/ 173500 | consumed samples: 43678720 | consumed tokens: 89454018560 | elapsed time per iteration (s): 0.83 | learning rate: 2.012E-05 | global batch size: 256 | lm loss: 1.904365E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.722 | TFLOPs: 18.74 | 31: iteration 170630/ 173500 | consumed samples: 43681280 | consumed tokens: 89459261440 | elapsed time per iteration (s): 0.87 | learning rate: 2.012E-05 | global batch size: 256 | lm loss: 1.890382E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.841 | TFLOPs: 17.72 | 31: iteration 170640/ 173500 | consumed samples: 43683840 | consumed tokens: 89464504320 | elapsed time per iteration (s): 0.74 | learning rate: 2.012E-05 | global batch size: 256 | lm loss: 1.908031E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.689 | TFLOPs: 20.97 | 31: iteration 170650/ 173500 | consumed samples: 43686400 | consumed tokens: 89469747200 | elapsed time per iteration (s): 0.77 | learning rate: 2.012E-05 | global batch size: 256 | lm loss: 1.884404E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.094 | TFLOPs: 20.03 | 31: iteration 170660/ 173500 | consumed samples: 43688960 | consumed tokens: 89474990080 | elapsed time per iteration (s): 0.78 | learning rate: 2.012E-05 | global batch size: 256 | lm loss: 1.906510E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.543 | TFLOPs: 19.88 | 31: iteration 170670/ 173500 | consumed samples: 43691520 | consumed tokens: 89480232960 | elapsed time per iteration (s): 0.79 | learning rate: 2.012E-05 | global batch size: 256 | lm loss: 1.899412E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.545 | TFLOPs: 19.51 | 31: iteration 170680/ 173500 | consumed samples: 43694080 | consumed tokens: 89485475840 | elapsed time per iteration (s): 0.79 | learning rate: 2.012E-05 | global batch size: 256 | lm loss: 1.902776E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.860 | TFLOPs: 19.59 | 31: iteration 170690/ 173500 | consumed samples: 43696640 | consumed tokens: 89490718720 | elapsed time per iteration (s): 0.79 | learning rate: 2.012E-05 | global batch size: 256 | lm loss: 1.905535E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.507 | TFLOPs: 19.63 | 31: iteration 170700/ 173500 | consumed samples: 43699200 | consumed tokens: 89495961600 | elapsed time per iteration (s): 0.80 | learning rate: 2.012E-05 | global batch size: 256 | lm loss: 1.898420E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.655 | TFLOPs: 19.46 | 31: iteration 170710/ 173500 | consumed samples: 43701760 | consumed tokens: 89501204480 | elapsed time per iteration (s): 0.86 | learning rate: 2.012E-05 | global batch size: 256 | lm loss: 1.922184E+00 | grad norm: 0.215 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 298.446 | TFLOPs: 18.06 | 31: iteration 170720/ 173500 | consumed samples: 43704320 | consumed tokens: 89506447360 | elapsed time per iteration (s): 0.81 | learning rate: 2.012E-05 | global batch size: 256 | lm loss: 1.908320E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.775 | TFLOPs: 19.16 | 31: iteration 170730/ 173500 | consumed samples: 43706880 | consumed tokens: 89511690240 | elapsed time per iteration (s): 0.78 | learning rate: 2.012E-05 | global batch size: 256 | lm loss: 1.899792E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.845 | TFLOPs: 19.77 | 31: iteration 170740/ 173500 | consumed samples: 43709440 | consumed tokens: 89516933120 | elapsed time per iteration (s): 0.82 | learning rate: 2.011E-05 | global batch size: 256 | lm loss: 1.881430E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.699 | TFLOPs: 18.92 | 31: iteration 170750/ 173500 | consumed samples: 43712000 | consumed tokens: 89522176000 | elapsed time per iteration (s): 0.85 | learning rate: 2.011E-05 | global batch size: 256 | lm loss: 1.907842E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 301.296 | TFLOPs: 18.23 | 31: iteration 170760/ 173500 | consumed samples: 43714560 | consumed tokens: 89527418880 | elapsed time per iteration (s): 0.83 | learning rate: 2.011E-05 | global batch size: 256 | lm loss: 1.880110E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.582 | TFLOPs: 18.61 | 31: iteration 170770/ 173500 | consumed samples: 43717120 | consumed tokens: 89532661760 | elapsed time per iteration (s): 0.83 | learning rate: 2.011E-05 | global batch size: 256 | lm loss: 1.928715E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.158 | TFLOPs: 18.70 | 31: iteration 170780/ 173500 | consumed samples: 43719680 | consumed tokens: 89537904640 | elapsed time per iteration (s): 0.83 | learning rate: 2.011E-05 | global batch size: 256 | lm loss: 1.931475E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.034 | TFLOPs: 18.57 | 31: iteration 170790/ 173500 | consumed samples: 43722240 | consumed tokens: 89543147520 | elapsed time per iteration (s): 0.82 | learning rate: 2.011E-05 | global batch size: 256 | lm loss: 1.910457E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.727 | TFLOPs: 18.80 | 31: iteration 170800/ 173500 | consumed samples: 43724800 | consumed tokens: 89548390400 | elapsed time per iteration (s): 0.83 | learning rate: 2.011E-05 | global batch size: 256 | lm loss: 1.908862E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.673 | TFLOPs: 18.73 | 31: iteration 170810/ 173500 | consumed samples: 43727360 | consumed tokens: 89553633280 | elapsed time per iteration (s): 0.83 | learning rate: 2.011E-05 | global batch size: 256 | lm loss: 1.908244E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.370 | TFLOPs: 18.60 | 31: iteration 170820/ 173500 | consumed samples: 43729920 | consumed tokens: 89558876160 | elapsed time per iteration (s): 0.80 | learning rate: 2.011E-05 | global batch size: 256 | lm loss: 1.912399E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.912 | TFLOPs: 19.47 | 31: iteration 170830/ 173500 | consumed samples: 43732480 | consumed tokens: 89564119040 | elapsed time per iteration (s): 0.79 | learning rate: 2.011E-05 | global batch size: 256 | lm loss: 1.909947E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.520 | TFLOPs: 19.57 | 31: iteration 170840/ 173500 | consumed samples: 43735040 | consumed tokens: 89569361920 | elapsed time per iteration (s): 0.74 | learning rate: 2.011E-05 | global batch size: 256 | lm loss: 1.895571E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.564 | TFLOPs: 20.85 | 31: iteration 170850/ 173500 | consumed samples: 43737600 | consumed tokens: 89574604800 | elapsed time per iteration (s): 0.74 | learning rate: 2.011E-05 | global batch size: 256 | lm loss: 1.895255E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.742 | TFLOPs: 21.04 | 31: iteration 170860/ 173500 | consumed samples: 43740160 | consumed tokens: 89579847680 | elapsed time per iteration (s): 0.77 | learning rate: 2.010E-05 | global batch size: 256 | lm loss: 1.909345E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.064 | TFLOPs: 20.09 | 31: iteration 170870/ 173500 | consumed samples: 43742720 | consumed tokens: 89585090560 | elapsed time per iteration (s): 0.72 | learning rate: 2.010E-05 | global batch size: 256 | lm loss: 1.916146E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.564 | TFLOPs: 21.39 | 31: iteration 170880/ 173500 | consumed samples: 43745280 | consumed tokens: 89590333440 | elapsed time per iteration (s): 0.73 | learning rate: 2.010E-05 | global batch size: 256 | lm loss: 1.903337E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 351.182 | TFLOPs: 21.25 | 31: iteration 170890/ 173500 | consumed samples: 43747840 | consumed tokens: 89595576320 | elapsed time per iteration (s): 0.74 | learning rate: 2.010E-05 | global batch size: 256 | lm loss: 1.891768E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.608 | TFLOPs: 20.91 | 31: iteration 170900/ 173500 | consumed samples: 43750400 | consumed tokens: 89600819200 | elapsed time per iteration (s): 0.79 | learning rate: 2.010E-05 | global batch size: 256 | lm loss: 1.918955E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.870 | TFLOPs: 19.59 | 31: iteration 170910/ 173500 | consumed samples: 43752960 | consumed tokens: 89606062080 | elapsed time per iteration (s): 0.76 | learning rate: 2.010E-05 | global batch size: 256 | lm loss: 1.916139E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.057 | TFLOPs: 20.27 | 31: iteration 170920/ 173500 | consumed samples: 43755520 | consumed tokens: 89611304960 | elapsed time per iteration (s): 0.77 | learning rate: 2.010E-05 | global batch size: 256 | lm loss: 1.910987E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.508 | TFLOPs: 20.12 | 31: iteration 170930/ 173500 | consumed samples: 43758080 | consumed tokens: 89616547840 | elapsed time per iteration (s): 0.77 | learning rate: 2.010E-05 | global batch size: 256 | lm loss: 1.921780E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.908 | TFLOPs: 20.02 | 31: iteration 170940/ 173500 | consumed samples: 43760640 | consumed tokens: 89621790720 | elapsed time per iteration (s): 0.78 | learning rate: 2.010E-05 | global batch size: 256 | lm loss: 1.919660E+00 | grad norm: 0.215 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.966 | TFLOPs: 19.90 | 31: iteration 170950/ 173500 | consumed samples: 43763200 | consumed tokens: 89627033600 | elapsed time per iteration (s): 0.75 | learning rate: 2.010E-05 | global batch size: 256 | lm loss: 1.913669E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.562 | TFLOPs: 20.66 | 31: iteration 170960/ 173500 | consumed samples: 43765760 | consumed tokens: 89632276480 | elapsed time per iteration (s): 0.75 | learning rate: 2.010E-05 | global batch size: 256 | lm loss: 1.910907E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.022 | TFLOPs: 20.69 | 31: iteration 170970/ 173500 | consumed samples: 43768320 | consumed tokens: 89637519360 | elapsed time per iteration (s): 0.74 | learning rate: 2.010E-05 | global batch size: 256 | lm loss: 1.875002E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.717 | TFLOPs: 20.79 | 31: iteration 170980/ 173500 | consumed samples: 43770880 | consumed tokens: 89642762240 | elapsed time per iteration (s): 0.79 | learning rate: 2.010E-05 | global batch size: 256 | lm loss: 1.907462E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.213 | TFLOPs: 19.61 | 31: iteration 170990/ 173500 | consumed samples: 43773440 | consumed tokens: 89648005120 | elapsed time per iteration (s): 0.75 | learning rate: 2.009E-05 | global batch size: 256 | lm loss: 1.907104E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.988 | TFLOPs: 20.69 | 31: iteration 171000/ 173500 | consumed samples: 43776000 | consumed tokens: 89653248000 | elapsed time per iteration (s): 0.73 | learning rate: 2.009E-05 | global batch size: 256 | lm loss: 1.920370E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.102 | TFLOPs: 21.36 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 171000 | lm loss value: 1.850100E+00 | lm loss PPL: 6.360456E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 171000 to checkpoints_1b1long 0: [2022-11-27 08:40:13,568] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step171000 is begin to save! 0: [2022-11-27 08:40:13,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_01-model_00-model_states.pt... 0: [2022-11-27 08:40:13,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_01-model_00-model_states.pt. 0: [2022-11-27 08:40:13,812] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_03-model_00-model_states.pt... 0: [2022-11-27 08:40:13,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_03-model_00-model_states.pt. 0: [2022-11-27 08:40:13,897] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_04-model_00-model_states.pt... 0: [2022-11-27 08:40:13,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_04-model_00-model_states.pt. 0: [2022-11-27 08:40:13,977] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_05-model_00-model_states.pt... 0: [2022-11-27 08:40:14,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_05-model_00-model_states.pt. 0: [2022-11-27 08:40:14,054] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_06-model_00-model_states.pt... 0: [2022-11-27 08:40:14,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_06-model_00-model_states.pt. 0: [2022-11-27 08:40:14,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_07-model_00-model_states.pt... 0: [2022-11-27 08:40:14,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_07-model_00-model_states.pt. 0: [2022-11-27 08:40:14,207] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_08-model_00-model_states.pt... 0: [2022-11-27 08:40:14,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_08-model_00-model_states.pt. 0: [2022-11-27 08:40:14,287] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_09-model_00-model_states.pt... 0: [2022-11-27 08:40:14,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_09-model_00-model_states.pt. 0: [2022-11-27 08:40:14,361] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_10-model_00-model_states.pt... 0: [2022-11-27 08:40:14,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_10-model_00-model_states.pt. 0: [2022-11-27 08:40:14,438] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_11-model_00-model_states.pt... 0: [2022-11-27 08:40:14,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_11-model_00-model_states.pt. 0: [2022-11-27 08:40:14,517] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_12-model_00-model_states.pt... 0: [2022-11-27 08:40:14,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_12-model_00-model_states.pt. 0: [2022-11-27 08:40:14,591] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_13-model_00-model_states.pt... 0: [2022-11-27 08:40:14,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_13-model_00-model_states.pt. 0: [2022-11-27 08:40:14,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_14-model_00-model_states.pt... 0: [2022-11-27 08:40:14,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_14-model_00-model_states.pt. 0: [2022-11-27 08:40:14,749] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_15-model_00-model_states.pt... 0: [2022-11-27 08:40:14,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_15-model_00-model_states.pt. 0: [2022-11-27 08:40:14,824] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_16-model_00-model_states.pt... 0: [2022-11-27 08:40:14,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_16-model_00-model_states.pt. 0: [2022-11-27 08:40:14,900] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_17-model_00-model_states.pt... 0: [2022-11-27 08:40:14,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_17-model_00-model_states.pt. 0: [2022-11-27 08:40:14,974] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_18-model_00-model_states.pt... 0: [2022-11-27 08:40:15,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_18-model_00-model_states.pt. 0: [2022-11-27 08:40:15,051] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_19-model_00-model_states.pt... 0: [2022-11-27 08:40:15,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_19-model_00-model_states.pt. 0: [2022-11-27 08:40:15,127] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_20-model_00-model_states.pt... 0: [2022-11-27 08:40:15,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_20-model_00-model_states.pt. 0: [2022-11-27 08:40:15,203] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_21-model_00-model_states.pt... 0: [2022-11-27 08:40:15,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_21-model_00-model_states.pt. 0: [2022-11-27 08:40:15,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_22-model_00-model_states.pt... 0: [2022-11-27 08:40:15,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_22-model_00-model_states.pt. 0: [2022-11-27 08:40:15,355] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_23-model_00-model_states.pt... 0: [2022-11-27 08:40:15,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_23-model_00-model_states.pt. 0: [2022-11-27 08:40:15,433] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_24-model_00-model_states.pt... 0: [2022-11-27 08:40:15,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_24-model_00-model_states.pt. 0: [2022-11-27 08:40:15,510] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_25-model_00-model_states.pt... 0: [2022-11-27 08:40:15,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_25-model_00-model_states.pt. 0: [2022-11-27 08:40:15,583] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_26-model_00-model_states.pt... 0: [2022-11-27 08:40:15,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_26-model_00-model_states.pt. 0: [2022-11-27 08:40:15,659] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_27-model_00-model_states.pt... 0: [2022-11-27 08:40:15,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_27-model_00-model_states.pt. 0: [2022-11-27 08:40:15,739] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_28-model_00-model_states.pt... 0: [2022-11-27 08:40:15,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_28-model_00-model_states.pt. 0: [2022-11-27 08:40:15,812] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/layer_30-model_00-model_states.pt... 0: [2022-11-27 08:40:15,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/layer_30-model_00-model_states.pt. 0: [2022-11-27 08:40:15,817] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step171000/mp_rank_00_model_states.pt 0: [2022-11-27 08:40:15,817] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/mp_rank_00_model_states.pt... 0: [2022-11-27 08:40:15,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/mp_rank_00_model_states.pt. 0: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:40:15,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:40:15,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:40:15,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 08:40:15,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 28: [2022-11-27 08:40:15,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 28: [2022-11-27 08:40:15,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 08:40:15,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 8: [2022-11-27 08:40:15,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:40:15,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 08:40:15,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 30: [2022-11-27 08:40:15,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:40:15,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 30: [2022-11-27 08:40:15,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 24: [2022-11-27 08:40:15,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:40:15,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 20: [2022-11-27 08:40:15,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:40:15,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 20: [2022-11-27 08:40:15,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 08:40:15,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 29: [2022-11-27 08:40:15,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:40:15,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 08:40:15,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 18: [2022-11-27 08:40:15,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:40:15,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 08:40:15,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 20: [2022-11-27 08:40:15,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:40:15,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 08:40:15,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 10: [2022-11-27 08:40:15,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:40:15,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:40:15,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:40:15,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 08:40:15,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 14: [2022-11-27 08:40:15,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 10: [2022-11-27 08:40:15,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 10: [2022-11-27 08:40:15,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 14: [2022-11-27 08:40:15,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 6: [2022-11-27 08:40:15,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:40:15,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 08:40:15,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 14: [2022-11-27 08:40:15,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:40:15,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 08:40:15,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 29: [2022-11-27 08:40:15,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:40:15,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:40:15,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 18: [2022-11-27 08:40:15,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 15: [2022-11-27 08:40:15,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:40:15,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:40:15,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 18: [2022-11-27 08:40:15,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 15: [2022-11-27 08:40:15,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 08:40:15,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 08:40:15,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 15: [2022-11-27 08:40:15,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 25: [2022-11-27 08:40:15,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:40:15,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 08:40:15,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 28: [2022-11-27 08:40:15,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:40:15,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:40:15,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 8: [2022-11-27 08:40:15,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:40:15,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 8: [2022-11-27 08:40:15,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 08:40:15,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 3: [2022-11-27 08:40:15,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:40:15,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 08:40:15,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 27: [2022-11-27 08:40:15,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:40:15,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:40:15,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 08:40:15,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 27: [2022-11-27 08:40:15,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 08:40:15,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 29: [2022-11-27 08:40:15,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:40:15,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 08:40:15,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 4: [2022-11-27 08:40:15,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:40:15,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 08:40:15,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 6: [2022-11-27 08:40:15,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:40:15,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 08:40:15,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 24: [2022-11-27 08:40:15,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:40:15,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:40:15,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 08:40:15,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 27: [2022-11-27 08:40:15,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 08:40:15,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 19: [2022-11-27 08:40:15,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:40:15,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 08:40:15,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 25: [2022-11-27 08:40:15,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:40:15,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 08:40:15,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 2: [2022-11-27 08:40:15,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:40:15,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 08:40:15,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 6: [2022-11-27 08:40:15,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:40:15,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 08:40:15,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 1: [2022-11-27 08:40:15,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:40:15,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:40:15,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 08:40:15,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 3: [2022-11-27 08:40:15,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 08:40:15,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 8: [2022-11-27 08:40:15,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:40:15,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 12: [2022-11-27 08:40:15,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:40:15,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 12: [2022-11-27 08:40:15,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:40:15,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 08:40:15,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 12: [2022-11-27 08:40:15,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 08:40:15,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 19: [2022-11-27 08:40:15,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:40:15,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:40:15,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 08:40:15,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-27 08:40:15,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 19: [2022-11-27 08:40:15,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 3: [2022-11-27 08:40:15,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:40:15,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 08:40:15,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 4: [2022-11-27 08:40:15,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:40:15,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 08:40:15,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 7: [2022-11-27 08:40:15,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:40:15,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 08:40:15,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 15: [2022-11-27 08:40:15,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:40:15,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 08:40:15,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 27: [2022-11-27 08:40:15,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:40:15,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:40:15,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 30: [2022-11-27 08:40:15,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 27: [2022-11-27 08:40:15,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 30: [2022-11-27 08:40:15,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 18: [2022-11-27 08:40:15,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:40:15,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 08:40:15,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 30: [2022-11-27 08:40:15,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:40:15,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 18: [2022-11-27 08:40:15,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:40:15,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:40:15,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:40:15,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 18: [2022-11-27 08:40:15,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 27: [2022-11-27 08:40:15,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 24: [2022-11-27 08:40:15,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 27: [2022-11-27 08:40:15,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 24: [2022-11-27 08:40:15,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 18: [2022-11-27 08:40:15,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 22: [2022-11-27 08:40:15,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:40:15,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 08:40:15,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 22: [2022-11-27 08:40:15,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:40:15,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 22: [2022-11-27 08:40:15,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 1: [2022-11-27 08:40:15,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:40:15,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:40:15,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 6: [2022-11-27 08:40:15,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:40:15,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 08:40:15,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 6: [2022-11-27 08:40:15,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 1: [2022-11-27 08:40:15,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 6: [2022-11-27 08:40:15,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 28: [2022-11-27 08:40:15,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 08:40:15,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 28: [2022-11-27 08:40:15,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-27 08:40:15,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 29: [2022-11-27 08:40:15,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:40:15,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 28: [2022-11-27 08:40:15,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 29: [2022-11-27 08:40:15,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 12: [2022-11-27 08:40:15,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 29: [2022-11-27 08:40:15,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 12: [2022-11-27 08:40:15,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 20: [2022-11-27 08:40:15,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:40:15,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:40:15,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 08:40:15,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 08:40:15,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 20: [2022-11-27 08:40:15,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 13: [2022-11-27 08:40:15,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:40:15,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 08:40:15,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:40:15,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 13: [2022-11-27 08:40:15,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 08:40:15,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 7: [2022-11-27 08:40:15,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:40:15,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:40:15,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 08:40:15,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 14: [2022-11-27 08:40:15,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:40:15,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:40:15,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 7: [2022-11-27 08:40:15,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 14: [2022-11-27 08:40:15,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 08:40:15,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 08:40:15,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 14: [2022-11-27 08:40:15,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 3: [2022-11-27 08:40:15,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:40:15,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:40:15,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 08:40:15,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 8: [2022-11-27 08:40:15,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 08:40:15,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 0: [2022-11-27 08:40:15,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:40:15,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:40:15,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 0: [2022-11-27 08:40:15,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:40:15,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 0: [2022-11-27 08:40:15,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 08:40:15,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 08:40:15,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 0: [2022-11-27 08:40:15,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 15: [2022-11-27 08:40:15,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:40:15,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 08:40:15,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 4: [2022-11-27 08:40:15,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:40:15,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 08:40:15,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 11: [2022-11-27 08:40:15,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:40:15,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:40:15,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:40:15,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 08:40:15,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 08:40:15,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 11: [2022-11-27 08:40:15,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 08:40:15,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 11: [2022-11-27 08:40:15,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 25: [2022-11-27 08:40:15,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:40:15,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 08:40:15,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 10: [2022-11-27 08:40:15,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:40:15,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 08:40:15,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 0: [2022-11-27 08:40:15,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:40:15,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:40:15,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 19: [2022-11-27 08:40:15,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 08:40:15,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 0: [2022-11-27 08:40:15,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 12: [2022-11-27 08:40:15,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:40:15,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 08:40:15,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 13: [2022-11-27 08:40:15,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:40:15,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 08:40:15,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 22: [2022-11-27 08:40:15,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:40:15,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 08:40:15,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 0: [2022-11-27 08:40:15,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:40:15,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:40:15,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 08:40:15,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 1: [2022-11-27 08:40:15,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:40:15,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:40:15,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 30: [2022-11-27 08:40:15,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 1: [2022-11-27 08:40:15,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 30: [2022-11-27 08:40:15,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 23: [2022-11-27 08:40:15,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:40:15,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:40:15,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 08:40:15,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 08:40:15,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 23: [2022-11-27 08:40:15,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 23: [2022-11-27 08:40:15,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:40:15,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 08:40:15,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 2: [2022-11-27 08:40:15,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:40:15,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 08:40:15,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 24: [2022-11-27 08:40:15,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:40:15,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 08:40:15,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 10: [2022-11-27 08:40:15,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:40:15,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 08:40:15,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 26: [2022-11-27 08:40:15,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:40:15,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:40:15,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:40:15,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 08:40:15,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 5: [2022-11-27 08:40:15,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:40:15,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:40:15,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:40:15,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 5: [2022-11-27 08:40:15,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:40:15,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 08:40:15,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 26: [2022-11-27 08:40:15,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 5: [2022-11-27 08:40:15,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 26: [2022-11-27 08:40:15,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 5: [2022-11-27 08:40:15,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 5: [2022-11-27 08:40:15,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 5: [2022-11-27 08:40:15,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 26: [2022-11-27 08:40:15,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 5: [2022-11-27 08:40:15,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 08:40:15,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 0: [2022-11-27 08:40:15,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 08:40:15,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 17: [2022-11-27 08:40:15,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:40:15,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:40:15,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:40:15,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:40:15,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 08:40:15,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 08:40:15,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 17: [2022-11-27 08:40:15,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 17: [2022-11-27 08:40:15,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 08:40:15,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 08:40:15,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 17: [2022-11-27 08:40:15,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 29: [2022-11-27 08:40:15,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:40:15,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 08:40:15,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 6: [2022-11-27 08:40:15,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:40:15,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 08:40:15,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 20: [2022-11-27 08:40:15,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:40:15,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 08:40:15,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 21: [2022-11-27 08:40:15,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:40:15,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:40:15,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:40:15,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:40:15,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 08:40:15,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 08:40:15,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 08:40:15,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 08:40:15,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 21: [2022-11-27 08:40:15,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 21: [2022-11-27 08:40:15,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 21: [2022-11-27 08:40:15,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 28: [2022-11-27 08:40:15,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:40:15,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:40:15,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 16: [2022-11-27 08:40:15,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:40:15,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 16: [2022-11-27 08:40:15,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:40:15,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:40:15,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:40:15,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 08:40:15,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 08:40:15,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 08:40:15,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 16: [2022-11-27 08:40:15,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 08:40:15,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 16: [2022-11-27 08:40:15,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 16: [2022-11-27 08:40:15,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 10: [2022-11-27 08:40:15,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:40:15,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 08:40:15,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 28: [2022-11-27 08:40:15,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 08:40:15,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 28: [2022-11-27 08:40:15,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-27 08:40:15,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 08:40:15,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 9: [2022-11-27 08:40:15,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:40:15,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:40:15,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:40:15,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:40:15,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 08:40:15,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 08:40:15,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 08:40:15,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 08:40:15,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 9: [2022-11-27 08:40:15,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 9: [2022-11-27 08:40:15,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 9: [2022-11-27 08:40:15,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 18: [2022-11-27 08:40:15,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:40:15,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 08:40:15,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 4: [2022-11-27 08:40:15,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:40:15,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 08:40:15,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 15: [2022-11-27 08:40:15,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:40:15,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 08:40:15,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 17: [2022-11-27 08:40:16,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:40:16,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 08:40:16,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 31: [2022-11-27 08:40:16,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:40:16,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:40:16,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:40:16,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 08:40:16,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:40:16,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 08:40:16,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:40:16,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 31: [2022-11-27 08:40:16,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 31: [2022-11-27 08:40:16,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 08:40:16,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 08:40:16,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 08:40:16,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 31: [2022-11-27 08:40:16,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 31: [2022-11-27 08:40:16,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 0: [2022-11-27 08:40:16,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:40:16,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 08:40:16,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 8: [2022-11-27 08:40:16,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:40:16,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 08:40:16,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 21: [2022-11-27 08:40:16,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:40:16,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 08:40:16,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 27: [2022-11-27 08:40:16,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:40:16,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 19: [2022-11-27 08:40:16,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:40:16,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 19: [2022-11-27 08:40:16,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 08:40:16,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 30: [2022-11-27 08:40:16,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:40:16,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 08:40:16,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 16: [2022-11-27 08:40:16,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:40:16,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 08:40:16,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 1: [2022-11-27 08:40:16,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:40:16,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:40:16,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 25: [2022-11-27 08:40:16,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 11: [2022-11-27 08:40:16,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:40:16,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 25: [2022-11-27 08:40:16,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 11: [2022-11-27 08:40:16,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 08:40:16,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 24: [2022-11-27 08:40:16,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:40:16,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:40:16,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 24: [2022-11-27 08:40:16,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 12: [2022-11-27 08:40:16,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 24: [2022-11-27 08:40:16,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 23: [2022-11-27 08:40:16,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:40:16,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 08:40:16,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 7: [2022-11-27 08:40:16,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:40:16,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 08:40:16,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 26: [2022-11-27 08:40:16,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:40:16,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 08:40:16,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 13: [2022-11-27 08:40:16,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:40:16,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 08:40:16,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 14: [2022-11-27 08:40:16,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:40:16,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 08:40:16,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 22: [2022-11-27 08:40:16,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:40:16,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 08:40:16,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 9: [2022-11-27 08:40:16,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:40:16,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 2: [2022-11-27 08:40:16,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:40:16,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:40:16,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 2: [2022-11-27 08:40:16,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 3: [2022-11-27 08:40:16,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 2: [2022-11-27 08:40:16,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 3: [2022-11-27 08:40:16,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 20: [2022-11-27 08:40:16,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:40:16,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 08:40:16,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 28: [2022-11-27 08:40:16,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 08:40:16,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 08:40:16,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 6: [2022-11-27 08:40:16,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:40:16,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 08:40:16,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 10: [2022-11-27 08:40:16,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:40:16,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 08:40:16,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 4: [2022-11-27 08:40:16,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:40:16,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 08:40:16,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 0: [2022-11-27 08:40:16,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:40:16,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 08:40:16,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 29: [2022-11-27 08:40:16,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:40:16,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 08:40:16,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 31: [2022-11-27 08:40:16,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:40:16,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 08:40:16,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 17: [2022-11-27 08:40:16,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:40:16,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 08:40:16,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 15: [2022-11-27 08:40:16,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:40:16,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 08:40:16,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 30: [2022-11-27 08:40:16,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:40:16,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 08:40:16,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 8: [2022-11-27 08:40:16,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:40:16,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 08:40:16,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 27: [2022-11-27 08:40:16,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:40:16,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 08:40:16,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 18: [2022-11-27 08:40:16,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:40:16,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 08:40:16,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 13: [2022-11-27 08:40:16,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:40:16,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:40:16,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 08:40:16,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 19: [2022-11-27 08:40:16,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 08:40:16,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 16: [2022-11-27 08:40:16,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:40:16,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 08:40:16,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 12: [2022-11-27 08:40:16,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:40:16,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 08:40:16,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 23: [2022-11-27 08:40:16,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:40:16,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 08:40:16,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 9: [2022-11-27 08:40:16,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:40:16,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 08:40:16,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 7: [2022-11-27 08:40:16,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:40:16,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 08:40:16,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 14: [2022-11-27 08:40:16,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:40:16,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 08:40:16,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 2: [2022-11-27 08:40:16,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:40:16,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 11: [2022-11-27 08:40:16,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:40:16,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 11: [2022-11-27 08:40:16,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 08:40:16,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 26: [2022-11-27 08:40:16,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:40:16,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:40:16,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 24: [2022-11-27 08:40:16,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 08:40:16,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 26: [2022-11-27 08:40:16,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 22: [2022-11-27 08:40:16,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:40:16,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 08:40:16,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 5: [2022-11-27 08:40:16,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:40:16,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 08:40:16,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 21: [2022-11-27 08:40:16,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:40:16,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 08:40:16,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 1: [2022-11-27 08:40:16,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:40:16,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 08:40:16,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 3: [2022-11-27 08:40:16,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:40:16,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 08:40:16,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 10: [2022-11-27 08:40:16,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:40:16,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 08:40:16,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 18: [2022-11-27 08:40:16,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:40:16,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:40:16,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 08:40:16,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 20: [2022-11-27 08:40:16,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 25: [2022-11-27 08:40:16,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:40:16,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 25: [2022-11-27 08:40:16,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 08:40:16,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 28: [2022-11-27 08:40:16,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-27 08:40:16,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 08:40:16,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 29: [2022-11-27 08:40:16,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:40:16,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:40:16,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 08:40:16,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 6: [2022-11-27 08:40:16,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 08:40:16,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 5: [2022-11-27 08:40:16,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:40:16,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 08:40:16,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 31: [2022-11-27 08:40:16,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:40:16,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 08:40:16,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 0: [2022-11-27 08:40:16,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:40:16,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 08:40:16,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 4: [2022-11-27 08:40:16,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:40:16,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 08:40:16,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 15: [2022-11-27 08:40:16,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:40:16,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 08:40:16,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 17: [2022-11-27 08:40:16,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:40:16,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 08:40:16,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 8: [2022-11-27 08:40:16,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:40:16,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 08:40:16,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 30: [2022-11-27 08:40:16,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:40:16,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 08:40:16,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 3: [2022-11-27 08:40:16,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:40:16,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 08:40:16,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 19: [2022-11-27 08:40:16,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:40:16,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 08:40:16,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 14: [2022-11-27 08:40:16,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:40:16,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 08:40:16,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 16: [2022-11-27 08:40:16,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:40:16,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 08:40:16,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 25: [2022-11-27 08:40:16,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:40:16,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-27 08:40:16,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 27: [2022-11-27 08:40:16,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:40:16,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 08:40:16,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 1: [2022-11-27 08:40:16,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:40:16,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 08:40:16,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 9: [2022-11-27 08:40:16,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:40:16,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 08:40:16,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 0: [2022-11-27 08:40:16,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:40:16,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 08:40:16,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 12: [2022-11-27 08:40:16,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:40:16,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 08:40:16,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 16: [2022-11-27 08:40:16,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:40:16,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 08:40:16,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 27: [2022-11-27 08:40:16,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:40:16,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 08:40:16,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 19: [2022-11-27 08:40:16,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:40:16,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 08:40:16,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 9: [2022-11-27 08:40:16,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:40:16,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 08:40:16,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 8: [2022-11-27 08:40:16,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:40:16,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 08:40:16,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 14: [2022-11-27 08:40:16,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:40:16,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:40:16,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 11: [2022-11-27 08:40:16,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:40:16,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 22: [2022-11-27 08:40:16,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 11: [2022-11-27 08:40:16,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 08:40:16,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 14: [2022-11-27 08:40:16,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 30: [2022-11-27 08:40:16,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:40:16,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 08:40:16,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 1: [2022-11-27 08:40:16,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:40:16,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 29: [2022-11-27 08:40:16,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:40:16,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 29: [2022-11-27 08:40:16,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 08:40:16,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 10: [2022-11-27 08:40:16,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:40:16,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 08:40:16,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 20: [2022-11-27 08:40:16,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:40:16,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 08:40:16,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 2: [2022-11-27 08:40:16,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:40:16,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 08:40:16,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 3: [2022-11-27 08:40:16,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:40:16,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:40:16,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 3: [2022-11-27 08:40:16,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 21: [2022-11-27 08:40:16,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 3: [2022-11-27 08:40:16,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 23: [2022-11-27 08:40:16,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:40:16,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 08:40:16,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 18: [2022-11-27 08:40:16,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:40:16,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-27 08:40:16,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 17: [2022-11-27 08:40:16,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:40:16,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 08:40:16,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 2: [2022-11-27 08:40:16,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:40:16,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 25: [2022-11-27 08:40:16,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:40:16,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 25: [2022-11-27 08:40:16,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 24: [2022-11-27 08:40:16,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:40:16,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 24: [2022-11-27 08:40:16,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 28: [2022-11-27 08:40:16,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:40:16,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 6: [2022-11-27 08:40:16,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:40:16,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 08:40:16,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 4: [2022-11-27 08:40:16,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:40:16,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 08:40:16,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 15: [2022-11-27 08:40:16,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:40:16,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 08:40:16,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 31: [2022-11-27 08:40:16,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:40:16,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 13: [2022-11-27 08:40:16,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:40:16,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 13: [2022-11-27 08:40:16,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 08:40:16,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 12: [2022-11-27 08:40:16,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:40:16,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 08:40:16,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 7: [2022-11-27 08:40:16,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:40:16,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 08:40:16,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 26: [2022-11-27 08:40:16,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:40:16,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 08:40:16,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 28: [2022-11-27 08:40:16,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 08:40:16,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 21: [2022-11-27 08:40:16,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:40:16,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 08:40:16,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 22: [2022-11-27 08:40:16,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:40:16,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 08:40:16,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 13: [2022-11-27 08:40:16,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:40:16,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 08:40:16,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:40:16,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 13: [2022-11-27 08:40:16,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 08:40:16,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 5: [2022-11-27 08:40:16,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:40:16,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 08:40:16,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 23: [2022-11-27 08:40:16,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:40:16,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:40:16,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 08:40:16,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 24: [2022-11-27 08:40:16,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-27 08:40:16,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 7: [2022-11-27 08:40:16,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:40:16,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 08:40:16,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 7: [2022-11-27 08:40:16,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:40:16,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 08:40:16,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 11: [2022-11-27 08:40:16,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:40:16,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 08:40:16,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 11: [2022-11-27 08:40:16,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:40:16,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 08:40:16,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 23: [2022-11-27 08:40:16,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:40:16,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 08:40:16,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 26: [2022-11-27 08:40:16,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:40:16,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 08:40:16,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 26: [2022-11-27 08:40:16,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:40:16,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step171000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 08:40:16,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step171000 is ready now! 0: successfully saved checkpoint at iteration 171000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2593.26 31: iteration 171010/ 173500 | consumed samples: 43778560 | consumed tokens: 89658490880 | elapsed time per iteration (s): 1.08 | learning rate: 2.009E-05 | global batch size: 256 | lm loss: 1.911320E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.911 | TFLOPs: 14.39 | 31: iteration 171020/ 173500 | consumed samples: 43781120 | consumed tokens: 89663733760 | elapsed time per iteration (s): 0.82 | learning rate: 2.009E-05 | global batch size: 256 | lm loss: 1.940623E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.105 | TFLOPs: 19.00 | 31: iteration 171030/ 173500 | consumed samples: 43783680 | consumed tokens: 89668976640 | elapsed time per iteration (s): 0.80 | learning rate: 2.009E-05 | global batch size: 256 | lm loss: 1.898717E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.450 | TFLOPs: 19.33 | 31: iteration 171040/ 173500 | consumed samples: 43786240 | consumed tokens: 89674219520 | elapsed time per iteration (s): 0.80 | learning rate: 2.009E-05 | global batch size: 256 | lm loss: 1.912236E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.186 | TFLOPs: 19.43 | 31: iteration 171050/ 173500 | consumed samples: 43788800 | consumed tokens: 89679462400 | elapsed time per iteration (s): 0.82 | learning rate: 2.009E-05 | global batch size: 256 | lm loss: 1.922761E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.425 | TFLOPs: 18.96 | 31: iteration 171060/ 173500 | consumed samples: 43791360 | consumed tokens: 89684705280 | elapsed time per iteration (s): 0.78 | learning rate: 2.009E-05 | global batch size: 256 | lm loss: 1.899818E+00 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.807 | TFLOPs: 19.89 | 31: iteration 171070/ 173500 | consumed samples: 43793920 | consumed tokens: 89689948160 | elapsed time per iteration (s): 0.79 | learning rate: 2.009E-05 | global batch size: 256 | lm loss: 1.922923E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.779 | TFLOPs: 19.59 | 31: iteration 171080/ 173500 | consumed samples: 43796480 | consumed tokens: 89695191040 | elapsed time per iteration (s): 0.82 | learning rate: 2.009E-05 | global batch size: 256 | lm loss: 1.918169E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.049 | TFLOPs: 18.94 | 31: iteration 171090/ 173500 | consumed samples: 43799040 | consumed tokens: 89700433920 | elapsed time per iteration (s): 0.82 | learning rate: 2.009E-05 | global batch size: 256 | lm loss: 1.904084E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.649 | TFLOPs: 18.97 | 31: iteration 171100/ 173500 | consumed samples: 43801600 | consumed tokens: 89705676800 | elapsed time per iteration (s): 0.80 | learning rate: 2.009E-05 | global batch size: 256 | lm loss: 1.896032E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.841 | TFLOPs: 19.47 | 31: iteration 171110/ 173500 | consumed samples: 43804160 | consumed tokens: 89710919680 | elapsed time per iteration (s): 0.78 | learning rate: 2.009E-05 | global batch size: 256 | lm loss: 1.898029E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.176 | TFLOPs: 19.79 | 31: iteration 171120/ 173500 | consumed samples: 43806720 | consumed tokens: 89716162560 | elapsed time per iteration (s): 0.80 | learning rate: 2.009E-05 | global batch size: 256 | lm loss: 1.912154E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.677 | TFLOPs: 19.46 | 31: iteration 171130/ 173500 | consumed samples: 43809280 | consumed tokens: 89721405440 | elapsed time per iteration (s): 0.77 | learning rate: 2.008E-05 | global batch size: 256 | lm loss: 1.898608E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.522 | TFLOPs: 20.24 | 31: iteration 171140/ 173500 | consumed samples: 43811840 | consumed tokens: 89726648320 | elapsed time per iteration (s): 0.77 | learning rate: 2.008E-05 | global batch size: 256 | lm loss: 1.902661E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.557 | TFLOPs: 20.18 | 31: iteration 171150/ 173500 | consumed samples: 43814400 | consumed tokens: 89731891200 | elapsed time per iteration (s): 0.78 | learning rate: 2.008E-05 | global batch size: 256 | lm loss: 1.894473E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.084 | TFLOPs: 19.97 | 31: iteration 171160/ 173500 | consumed samples: 43816960 | consumed tokens: 89737134080 | elapsed time per iteration (s): 0.76 | learning rate: 2.008E-05 | global batch size: 256 | lm loss: 1.930412E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.845 | TFLOPs: 20.44 | 31: iteration 171170/ 173500 | consumed samples: 43819520 | consumed tokens: 89742376960 | elapsed time per iteration (s): 0.74 | learning rate: 2.008E-05 | global batch size: 256 | lm loss: 1.910757E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.008 | TFLOPs: 20.99 | 31: iteration 171180/ 173500 | consumed samples: 43822080 | consumed tokens: 89747619840 | elapsed time per iteration (s): 0.72 | learning rate: 2.008E-05 | global batch size: 256 | lm loss: 1.908540E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 355.046 | TFLOPs: 21.48 | 31: iteration 171190/ 173500 | consumed samples: 43824640 | consumed tokens: 89752862720 | elapsed time per iteration (s): 0.75 | learning rate: 2.008E-05 | global batch size: 256 | lm loss: 1.906909E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.285 | TFLOPs: 20.71 | 31: iteration 171200/ 173500 | consumed samples: 43827200 | consumed tokens: 89758105600 | elapsed time per iteration (s): 0.74 | learning rate: 2.008E-05 | global batch size: 256 | lm loss: 1.906756E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.632 | TFLOPs: 20.79 | 31: iteration 171210/ 173500 | consumed samples: 43829760 | consumed tokens: 89763348480 | elapsed time per iteration (s): 0.76 | learning rate: 2.008E-05 | global batch size: 256 | lm loss: 1.920794E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.475 | TFLOPs: 20.36 | 31: iteration 171220/ 173500 | consumed samples: 43832320 | consumed tokens: 89768591360 | elapsed time per iteration (s): 0.74 | learning rate: 2.008E-05 | global batch size: 256 | lm loss: 1.930467E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.806 | TFLOPs: 20.98 | 31: iteration 171230/ 173500 | consumed samples: 43834880 | consumed tokens: 89773834240 | elapsed time per iteration (s): 0.76 | learning rate: 2.008E-05 | global batch size: 256 | lm loss: 1.906703E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.849 | TFLOPs: 20.44 | 31: iteration 171240/ 173500 | consumed samples: 43837440 | consumed tokens: 89779077120 | elapsed time per iteration (s): 0.76 | learning rate: 2.008E-05 | global batch size: 256 | lm loss: 1.930507E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.260 | TFLOPs: 20.28 | 31: iteration 171250/ 173500 | consumed samples: 43840000 | consumed tokens: 89784320000 | elapsed time per iteration (s): 0.74 | learning rate: 2.008E-05 | global batch size: 256 | lm loss: 1.930377E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.624 | TFLOPs: 20.91 | 31: iteration 171260/ 173500 | consumed samples: 43842560 | consumed tokens: 89789562880 | elapsed time per iteration (s): 0.80 | learning rate: 2.008E-05 | global batch size: 256 | lm loss: 1.910597E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.080 | TFLOPs: 19.42 | 31: iteration 171270/ 173500 | consumed samples: 43845120 | consumed tokens: 89794805760 | elapsed time per iteration (s): 0.84 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.942727E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.542 | TFLOPs: 18.55 | 31: iteration 171280/ 173500 | consumed samples: 43847680 | consumed tokens: 89800048640 | elapsed time per iteration (s): 0.85 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.890910E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 300.317 | TFLOPs: 18.17 | 31: iteration 171290/ 173500 | consumed samples: 43850240 | consumed tokens: 89805291520 | elapsed time per iteration (s): 0.81 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.926917E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.918 | TFLOPs: 19.17 | 31: iteration 171300/ 173500 | consumed samples: 43852800 | consumed tokens: 89810534400 | elapsed time per iteration (s): 0.80 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.902180E+00 | grad norm: 0.216 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.182 | TFLOPs: 19.37 | 31: iteration 171310/ 173500 | consumed samples: 43855360 | consumed tokens: 89815777280 | elapsed time per iteration (s): 0.78 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.912120E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.317 | TFLOPs: 19.74 | 31: iteration 171320/ 173500 | consumed samples: 43857920 | consumed tokens: 89821020160 | elapsed time per iteration (s): 0.80 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.878787E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.345 | TFLOPs: 19.44 | 31: iteration 171330/ 173500 | consumed samples: 43860480 | consumed tokens: 89826263040 | elapsed time per iteration (s): 0.77 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.894439E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.153 | TFLOPs: 20.03 | 31: iteration 171340/ 173500 | consumed samples: 43863040 | consumed tokens: 89831505920 | elapsed time per iteration (s): 0.78 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.905643E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.947 | TFLOPs: 19.96 | 31: iteration 171350/ 173500 | consumed samples: 43865600 | consumed tokens: 89836748800 | elapsed time per iteration (s): 0.76 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.915976E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.178 | TFLOPs: 20.40 | 31: iteration 171360/ 173500 | consumed samples: 43868160 | consumed tokens: 89841991680 | elapsed time per iteration (s): 0.78 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.911234E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.936 | TFLOPs: 19.84 | 31: iteration 171370/ 173500 | consumed samples: 43870720 | consumed tokens: 89847234560 | elapsed time per iteration (s): 0.76 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.924678E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.470 | TFLOPs: 20.36 | 31: iteration 171380/ 173500 | consumed samples: 43873280 | consumed tokens: 89852477440 | elapsed time per iteration (s): 0.83 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.897452E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.082 | TFLOPs: 18.64 | 31: iteration 171390/ 173500 | consumed samples: 43875840 | consumed tokens: 89857720320 | elapsed time per iteration (s): 0.84 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.905289E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.193 | TFLOPs: 18.34 | 31: iteration 171400/ 173500 | consumed samples: 43878400 | consumed tokens: 89862963200 | elapsed time per iteration (s): 0.82 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.913506E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.161 | TFLOPs: 18.88 | 31: iteration 171410/ 173500 | consumed samples: 43880960 | consumed tokens: 89868206080 | elapsed time per iteration (s): 0.82 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.927643E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.900 | TFLOPs: 18.87 | 31: iteration 171420/ 173500 | consumed samples: 43883520 | consumed tokens: 89873448960 | elapsed time per iteration (s): 0.83 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.892341E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 310.003 | TFLOPs: 18.75 | 31: iteration 171430/ 173500 | consumed samples: 43886080 | consumed tokens: 89878691840 | elapsed time per iteration (s): 0.78 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.930240E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.002 | TFLOPs: 19.78 | 31: iteration 171440/ 173500 | consumed samples: 43888640 | consumed tokens: 89883934720 | elapsed time per iteration (s): 0.83 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.902713E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.946 | TFLOPs: 18.69 | 31: iteration 171450/ 173500 | consumed samples: 43891200 | consumed tokens: 89889177600 | elapsed time per iteration (s): 0.82 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.930328E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.276 | TFLOPs: 18.83 | 31: iteration 171460/ 173500 | consumed samples: 43893760 | consumed tokens: 89894420480 | elapsed time per iteration (s): 0.87 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.923239E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.793 | TFLOPs: 17.77 | 31: iteration 171470/ 173500 | consumed samples: 43896320 | consumed tokens: 89899663360 | elapsed time per iteration (s): 0.84 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.926322E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.523 | TFLOPs: 18.54 | 31: iteration 171480/ 173500 | consumed samples: 43898880 | consumed tokens: 89904906240 | elapsed time per iteration (s): 0.80 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.913845E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.750 | TFLOPs: 19.47 | 31: iteration 171490/ 173500 | consumed samples: 43901440 | consumed tokens: 89910149120 | elapsed time per iteration (s): 0.80 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.940603E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.933 | TFLOPs: 19.42 | 31: iteration 171500/ 173500 | consumed samples: 43904000 | consumed tokens: 89915392000 | elapsed time per iteration (s): 0.80 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.899660E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.417 | TFLOPs: 19.26 | 31: iteration 171510/ 173500 | consumed samples: 43906560 | consumed tokens: 89920634880 | elapsed time per iteration (s): 0.82 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.901517E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.546 | TFLOPs: 18.85 | 31: iteration 171520/ 173500 | consumed samples: 43909120 | consumed tokens: 89925877760 | elapsed time per iteration (s): 0.80 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.946602E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.990 | TFLOPs: 19.30 | 31: iteration 171530/ 173500 | consumed samples: 43911680 | consumed tokens: 89931120640 | elapsed time per iteration (s): 0.77 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.898139E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.822 | TFLOPs: 20.07 | 31: iteration 171540/ 173500 | consumed samples: 43914240 | consumed tokens: 89936363520 | elapsed time per iteration (s): 0.75 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.932317E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.649 | TFLOPs: 20.61 | 31: iteration 171550/ 173500 | consumed samples: 43916800 | consumed tokens: 89941606400 | elapsed time per iteration (s): 0.76 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.950058E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.860 | TFLOPs: 20.44 | 31: iteration 171560/ 173500 | consumed samples: 43919360 | consumed tokens: 89946849280 | elapsed time per iteration (s): 0.74 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.938952E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.283 | TFLOPs: 21.07 | 31: iteration 171570/ 173500 | consumed samples: 43921920 | consumed tokens: 89952092160 | elapsed time per iteration (s): 0.77 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.908837E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 333.692 | TFLOPs: 20.19 | 31: iteration 171580/ 173500 | consumed samples: 43924480 | consumed tokens: 89957335040 | elapsed time per iteration (s): 0.80 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.911043E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.226 | TFLOPs: 19.43 | 31: iteration 171590/ 173500 | consumed samples: 43927040 | consumed tokens: 89962577920 | elapsed time per iteration (s): 0.74 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.922290E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.394 | TFLOPs: 20.83 | 31: iteration 171600/ 173500 | consumed samples: 43929600 | consumed tokens: 89967820800 | elapsed time per iteration (s): 0.77 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.919413E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.069 | TFLOPs: 20.21 | 31: iteration 171610/ 173500 | consumed samples: 43932160 | consumed tokens: 89973063680 | elapsed time per iteration (s): 0.75 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.942078E+00 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.808 | TFLOPs: 20.68 | 31: iteration 171620/ 173500 | consumed samples: 43934720 | consumed tokens: 89978306560 | elapsed time per iteration (s): 0.74 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.895600E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.580 | TFLOPs: 20.91 | 31: iteration 171630/ 173500 | consumed samples: 43937280 | consumed tokens: 89983549440 | elapsed time per iteration (s): 0.81 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.904858E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.969 | TFLOPs: 19.24 | 31: iteration 171640/ 173500 | consumed samples: 43939840 | consumed tokens: 89988792320 | elapsed time per iteration (s): 0.72 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.906140E+00 | grad norm: 0.215 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.318 | TFLOPs: 21.37 | 31: iteration 171650/ 173500 | consumed samples: 43942400 | consumed tokens: 89994035200 | elapsed time per iteration (s): 0.73 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.891837E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.158 | TFLOPs: 21.12 | 31: iteration 171660/ 173500 | consumed samples: 43944960 | consumed tokens: 89999278080 | elapsed time per iteration (s): 0.76 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.902082E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.883 | TFLOPs: 20.26 | 31: iteration 171670/ 173500 | consumed samples: 43947520 | consumed tokens: 90004520960 | elapsed time per iteration (s): 0.76 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.878191E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.491 | TFLOPs: 20.36 | 31: iteration 171680/ 173500 | consumed samples: 43950080 | consumed tokens: 90009763840 | elapsed time per iteration (s): 0.74 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.883082E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.060 | TFLOPs: 20.81 | 31: iteration 171690/ 173500 | consumed samples: 43952640 | consumed tokens: 90015006720 | elapsed time per iteration (s): 0.77 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.901164E+00 | grad norm: 0.215 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.089 | TFLOPs: 20.21 | 31: iteration 171700/ 173500 | consumed samples: 43955200 | consumed tokens: 90020249600 | elapsed time per iteration (s): 0.76 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.912438E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.469 | TFLOPs: 20.48 | 31: iteration 171710/ 173500 | consumed samples: 43957760 | consumed tokens: 90025492480 | elapsed time per iteration (s): 0.96 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.918040E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 266.497 | TFLOPs: 16.12 | 31: iteration 171720/ 173500 | consumed samples: 43960320 | consumed tokens: 90030735360 | elapsed time per iteration (s): 0.78 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.904286E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.289 | TFLOPs: 19.74 | 31: iteration 171730/ 173500 | consumed samples: 43962880 | consumed tokens: 90035978240 | elapsed time per iteration (s): 0.76 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.901070E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.730 | TFLOPs: 20.31 | 31: iteration 171740/ 173500 | consumed samples: 43965440 | consumed tokens: 90041221120 | elapsed time per iteration (s): 0.74 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.879055E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.373 | TFLOPs: 20.83 | 31: iteration 171750/ 173500 | consumed samples: 43968000 | consumed tokens: 90046464000 | elapsed time per iteration (s): 0.82 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.889316E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.593 | TFLOPs: 18.97 | 31: iteration 171760/ 173500 | consumed samples: 43970560 | consumed tokens: 90051706880 | elapsed time per iteration (s): 0.73 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.887847E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.514 | TFLOPs: 21.08 | 31: iteration 171770/ 173500 | consumed samples: 43973120 | consumed tokens: 90056949760 | elapsed time per iteration (s): 0.76 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.889876E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.873 | TFLOPs: 20.38 | 31: iteration 171780/ 173500 | consumed samples: 43975680 | consumed tokens: 90062192640 | elapsed time per iteration (s): 0.72 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.901543E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 357.740 | TFLOPs: 21.64 | 31: iteration 171790/ 173500 | consumed samples: 43978240 | consumed tokens: 90067435520 | elapsed time per iteration (s): 0.78 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.904140E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.128 | TFLOPs: 19.79 | 31: iteration 171800/ 173500 | consumed samples: 43980800 | consumed tokens: 90072678400 | elapsed time per iteration (s): 0.76 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.883115E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.477 | TFLOPs: 20.36 | 31: iteration 171810/ 173500 | consumed samples: 43983360 | consumed tokens: 90077921280 | elapsed time per iteration (s): 0.76 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.915970E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.430 | TFLOPs: 20.47 | 31: iteration 171820/ 173500 | consumed samples: 43985920 | consumed tokens: 90083164160 | elapsed time per iteration (s): 0.75 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.933585E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 340.702 | TFLOPs: 20.61 | 31: iteration 171830/ 173500 | consumed samples: 43988480 | consumed tokens: 90088407040 | elapsed time per iteration (s): 0.89 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.912580E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.959 | TFLOPs: 17.48 | 31: iteration 171840/ 173500 | consumed samples: 43991040 | consumed tokens: 90093649920 | elapsed time per iteration (s): 0.74 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.878289E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.136 | TFLOPs: 20.94 | 31: iteration 171850/ 173500 | consumed samples: 43993600 | consumed tokens: 90098892800 | elapsed time per iteration (s): 0.76 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.909247E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.532 | TFLOPs: 20.30 | 31: iteration 171860/ 173500 | consumed samples: 43996160 | consumed tokens: 90104135680 | elapsed time per iteration (s): 0.80 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.927178E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.556 | TFLOPs: 19.33 | 31: iteration 171870/ 173500 | consumed samples: 43998720 | consumed tokens: 90109378560 | elapsed time per iteration (s): 0.82 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.926717E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.047 | TFLOPs: 18.94 | 31: iteration 171880/ 173500 | consumed samples: 44001280 | consumed tokens: 90114621440 | elapsed time per iteration (s): 0.79 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.918873E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.763 | TFLOPs: 19.59 | 31: iteration 171890/ 173500 | consumed samples: 44003840 | consumed tokens: 90119864320 | elapsed time per iteration (s): 0.82 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.942966E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.991 | TFLOPs: 18.87 | 31: iteration 171900/ 173500 | consumed samples: 44006400 | consumed tokens: 90125107200 | elapsed time per iteration (s): 0.79 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.916260E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.979 | TFLOPs: 19.60 | 31: iteration 171910/ 173500 | consumed samples: 44008960 | consumed tokens: 90130350080 | elapsed time per iteration (s): 0.81 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.921652E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.680 | TFLOPs: 19.22 | 31: iteration 171920/ 173500 | consumed samples: 44011520 | consumed tokens: 90135592960 | elapsed time per iteration (s): 0.81 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.905729E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.753 | TFLOPs: 19.16 | 31: iteration 171930/ 173500 | consumed samples: 44014080 | consumed tokens: 90140835840 | elapsed time per iteration (s): 0.82 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.895209E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.193 | TFLOPs: 18.89 | 31: iteration 171940/ 173500 | consumed samples: 44016640 | consumed tokens: 90146078720 | elapsed time per iteration (s): 0.81 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.937871E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.740 | TFLOPs: 19.22 | 31: iteration 171950/ 173500 | consumed samples: 44019200 | consumed tokens: 90151321600 | elapsed time per iteration (s): 0.79 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.924668E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.921 | TFLOPs: 19.66 | 31: iteration 171960/ 173500 | consumed samples: 44021760 | consumed tokens: 90156564480 | elapsed time per iteration (s): 0.78 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.920712E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.636 | TFLOPs: 19.88 | 31: iteration 171970/ 173500 | consumed samples: 44024320 | consumed tokens: 90161807360 | elapsed time per iteration (s): 0.76 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.893264E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.229 | TFLOPs: 20.28 | 31: iteration 171980/ 173500 | consumed samples: 44026880 | consumed tokens: 90167050240 | elapsed time per iteration (s): 0.79 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.926114E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.237 | TFLOPs: 19.49 | 31: iteration 171990/ 173500 | consumed samples: 44029440 | consumed tokens: 90172293120 | elapsed time per iteration (s): 0.98 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.902483E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 261.338 | TFLOPs: 15.81 | 0: [2022-11-27 08:53:23,344] [INFO] [logging.py:68:log_dist] [Rank 0] step=172000, skipped=0, lr=[2.0033893682955986e-05, 2.0033893682955986e-05, 2.0033893682955986e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 31: iteration 172000/ 173500 | consumed samples: 44032000 | consumed tokens: 90177536000 | elapsed time per iteration (s): 0.80 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.904080E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.452 | TFLOPs: 19.39 | 0: steps: 172000 loss: 1.9138 iter time (s): 0.932 samples/sec: 274.618 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 172000 | lm loss value: 1.895318E+00 | lm loss PPL: 6.654665E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 172000 to checkpoints_1b1long 0: [2022-11-27 08:53:23,624] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step172000 is begin to save! 0: [2022-11-27 08:53:23,638] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_01-model_00-model_states.pt... 0: [2022-11-27 08:53:23,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_01-model_00-model_states.pt. 0: [2022-11-27 08:53:23,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_03-model_00-model_states.pt... 0: [2022-11-27 08:53:23,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_03-model_00-model_states.pt. 0: [2022-11-27 08:53:23,946] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_04-model_00-model_states.pt... 0: [2022-11-27 08:53:24,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_04-model_00-model_states.pt. 0: [2022-11-27 08:53:24,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_05-model_00-model_states.pt... 0: [2022-11-27 08:53:24,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_05-model_00-model_states.pt. 0: [2022-11-27 08:53:24,110] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_06-model_00-model_states.pt... 0: [2022-11-27 08:53:24,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_06-model_00-model_states.pt. 0: [2022-11-27 08:53:24,185] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_07-model_00-model_states.pt... 0: [2022-11-27 08:53:24,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_07-model_00-model_states.pt. 0: [2022-11-27 08:53:24,263] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_08-model_00-model_states.pt... 0: [2022-11-27 08:53:24,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_08-model_00-model_states.pt. 0: [2022-11-27 08:53:24,336] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_09-model_00-model_states.pt... 0: [2022-11-27 08:53:24,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_09-model_00-model_states.pt. 0: [2022-11-27 08:53:24,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_10-model_00-model_states.pt... 0: [2022-11-27 08:53:24,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_10-model_00-model_states.pt. 0: [2022-11-27 08:53:24,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_11-model_00-model_states.pt... 0: [2022-11-27 08:53:24,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_11-model_00-model_states.pt. 0: [2022-11-27 08:53:24,568] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_12-model_00-model_states.pt... 0: [2022-11-27 08:53:24,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_12-model_00-model_states.pt. 0: [2022-11-27 08:53:24,644] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_13-model_00-model_states.pt... 0: [2022-11-27 08:53:24,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_13-model_00-model_states.pt. 0: [2022-11-27 08:53:24,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_14-model_00-model_states.pt... 0: [2022-11-27 08:53:24,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_14-model_00-model_states.pt. 0: [2022-11-27 08:53:24,795] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_15-model_00-model_states.pt... 0: [2022-11-27 08:53:24,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_15-model_00-model_states.pt. 0: [2022-11-27 08:53:24,871] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_16-model_00-model_states.pt... 0: [2022-11-27 08:53:24,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_16-model_00-model_states.pt. 0: [2022-11-27 08:53:24,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_17-model_00-model_states.pt... 0: [2022-11-27 08:53:25,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_17-model_00-model_states.pt. 0: [2022-11-27 08:53:25,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_18-model_00-model_states.pt... 0: [2022-11-27 08:53:25,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_18-model_00-model_states.pt. 0: [2022-11-27 08:53:25,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_19-model_00-model_states.pt... 0: [2022-11-27 08:53:25,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_19-model_00-model_states.pt. 0: [2022-11-27 08:53:25,173] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_20-model_00-model_states.pt... 0: [2022-11-27 08:53:25,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_20-model_00-model_states.pt. 0: [2022-11-27 08:53:25,248] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_21-model_00-model_states.pt... 0: [2022-11-27 08:53:25,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_21-model_00-model_states.pt. 0: [2022-11-27 08:53:25,321] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_22-model_00-model_states.pt... 0: [2022-11-27 08:53:25,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_22-model_00-model_states.pt. 0: [2022-11-27 08:53:25,399] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_23-model_00-model_states.pt... 0: [2022-11-27 08:53:25,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_23-model_00-model_states.pt. 0: [2022-11-27 08:53:25,474] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_24-model_00-model_states.pt... 0: [2022-11-27 08:53:25,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_24-model_00-model_states.pt. 0: [2022-11-27 08:53:25,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_25-model_00-model_states.pt... 0: [2022-11-27 08:53:25,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_25-model_00-model_states.pt. 0: [2022-11-27 08:53:25,625] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_26-model_00-model_states.pt... 0: [2022-11-27 08:53:25,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_26-model_00-model_states.pt. 0: [2022-11-27 08:53:25,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_27-model_00-model_states.pt... 0: [2022-11-27 08:53:25,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_27-model_00-model_states.pt. 0: [2022-11-27 08:53:25,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_28-model_00-model_states.pt... 0: [2022-11-27 08:53:25,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_28-model_00-model_states.pt. 0: [2022-11-27 08:53:25,851] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/layer_30-model_00-model_states.pt... 0: [2022-11-27 08:53:25,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/layer_30-model_00-model_states.pt. 0: [2022-11-27 08:53:25,853] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step172000/mp_rank_00_model_states.pt 0: [2022-11-27 08:53:25,853] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/mp_rank_00_model_states.pt... 0: [2022-11-27 08:53:25,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/mp_rank_00_model_states.pt. 0: [2022-11-27 08:53:25,933] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:53:25,933] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:53:25,933] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:53:25,933] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:53:25,933] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:53:25,933] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:53:25,933] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:53:25,933] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:53:25,933] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:53:25,933] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-27 08:53:25,933] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 18: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 23: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 22: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 30: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:53:25,933] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 20: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 24: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 26: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 27: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 28: [2022-11-27 08:53:25,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 31: [2022-11-27 08:53:25,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:53:25,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 08:53:25,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 21: [2022-11-27 08:53:25,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:53:25,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 21: [2022-11-27 08:53:25,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 24: [2022-11-27 08:53:25,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:53:25,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 08:53:25,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 20: [2022-11-27 08:53:25,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:53:25,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 08:53:25,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 19: [2022-11-27 08:53:25,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:53:25,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 08:53:25,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 0: [2022-11-27 08:53:25,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 28: [2022-11-27 08:53:25,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:53:25,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:53:25,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 28: [2022-11-27 08:53:25,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 0: [2022-11-27 08:53:25,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 3: [2022-11-27 08:53:25,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 28: [2022-11-27 08:53:25,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 0: [2022-11-27 08:53:25,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 3: [2022-11-27 08:53:25,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 31: [2022-11-27 08:53:25,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:53:25,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 08:53:25,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 27: [2022-11-27 08:53:25,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 08:53:25,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 14: [2022-11-27 08:53:25,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:53:25,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:53:25,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 08:53:25,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 08:53:25,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 14: [2022-11-27 08:53:25,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 2: [2022-11-27 08:53:25,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:53:25,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 08:53:25,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 16: [2022-11-27 08:53:25,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:53:25,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:53:25,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:53:25,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 22: [2022-11-27 08:53:25,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 16: [2022-11-27 08:53:25,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 22: [2022-11-27 08:53:25,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 9: [2022-11-27 08:53:25,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:53:25,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 08:53:25,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 10: [2022-11-27 08:53:25,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:53:25,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 19: [2022-11-27 08:53:25,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 10: [2022-11-27 08:53:25,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 23: [2022-11-27 08:53:25,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:53:25,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 23: [2022-11-27 08:53:25,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 23: [2022-11-27 08:53:25,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 28: [2022-11-27 08:53:25,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:53:25,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:53:25,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:53:25,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:53:25,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 08:53:25,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 5: [2022-11-27 08:53:25,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 08:53:25,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 08:53:25,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 5: [2022-11-27 08:53:25,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 1: [2022-11-27 08:53:25,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:53:25,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 12: [2022-11-27 08:53:25,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:53:25,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:53:25,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:53:25,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:53:25,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 12: [2022-11-27 08:53:25,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 20: [2022-11-27 08:53:25,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 17: [2022-11-27 08:53:25,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 12: [2022-11-27 08:53:25,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 20: [2022-11-27 08:53:25,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 30: [2022-11-27 08:53:25,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 17: [2022-11-27 08:53:25,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 30: [2022-11-27 08:53:25,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 16: [2022-11-27 08:53:25,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:53:25,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 08:53:25,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 17: [2022-11-27 08:53:25,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:53:25,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:53:25,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 10: [2022-11-27 08:53:25,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 17: [2022-11-27 08:53:25,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 10: [2022-11-27 08:53:25,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 28: [2022-11-27 08:53:25,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 08:53:25,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 4: [2022-11-27 08:53:26,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:53:26,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:53:26,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 27: [2022-11-27 08:53:26,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:53:26,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 31: [2022-11-27 08:53:26,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:53:26,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 27: [2022-11-27 08:53:26,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 4: [2022-11-27 08:53:26,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 27: [2022-11-27 08:53:26,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 31: [2022-11-27 08:53:26,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 08:53:26,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 8: [2022-11-27 08:53:26,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:53:26,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 08:53:26,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 0: [2022-11-27 08:53:26,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:53:26,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 08:53:26,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 22: [2022-11-27 08:53:26,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:53:26,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 08:53:26,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 2: [2022-11-27 08:53:26,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:53:26,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:53:26,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 24: [2022-11-27 08:53:26,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 2: [2022-11-27 08:53:26,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 24: [2022-11-27 08:53:26,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 8: [2022-11-27 08:53:26,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:53:26,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 08:53:26,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 15: [2022-11-27 08:53:26,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:53:26,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 2: [2022-11-27 08:53:26,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:53:26,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 2: [2022-11-27 08:53:26,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 08:53:26,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 4: [2022-11-27 08:53:26,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:53:26,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 08:53:26,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 20: [2022-11-27 08:53:26,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:53:26,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:53:26,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:53:26,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:53:26,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 29: [2022-11-27 08:53:26,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 3: [2022-11-27 08:53:26,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 29: [2022-11-27 08:53:26,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 20: [2022-11-27 08:53:26,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 29: [2022-11-27 08:53:26,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 20: [2022-11-27 08:53:26,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 29: [2022-11-27 08:53:26,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 30: [2022-11-27 08:53:26,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:53:26,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 08:53:26,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 28: [2022-11-27 08:53:26,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:53:26,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:53:26,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 28: [2022-11-27 08:53:26,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 26: [2022-11-27 08:53:26,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 08:53:26,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 08:53:26,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 26: [2022-11-27 08:53:26,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 28: [2022-11-27 08:53:26,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 19: [2022-11-27 08:53:26,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:53:26,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:53:26,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-27 08:53:26,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 19: [2022-11-27 08:53:26,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 19: [2022-11-27 08:53:26,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 12: [2022-11-27 08:53:26,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:53:26,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 08:53:26,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 26: [2022-11-27 08:53:26,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:53:26,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:53:26,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:53:26,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 08:53:26,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 29: [2022-11-27 08:53:26,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:53:26,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:53:26,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 29: [2022-11-27 08:53:26,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 22: [2022-11-27 08:53:26,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 29: [2022-11-27 08:53:26,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 22: [2022-11-27 08:53:26,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 26: [2022-11-27 08:53:26,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 5: [2022-11-27 08:53:26,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 26: [2022-11-27 08:53:26,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 31: [2022-11-27 08:53:26,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:53:26,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 31: [2022-11-27 08:53:26,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 5: [2022-11-27 08:53:26,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:53:26,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 08:53:26,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 9: [2022-11-27 08:53:26,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:53:26,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 20: [2022-11-27 08:53:26,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:53:26,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 20: [2022-11-27 08:53:26,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 08:53:26,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 3: [2022-11-27 08:53:26,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:53:26,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:53:26,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 12: [2022-11-27 08:53:26,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 3: [2022-11-27 08:53:26,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 12: [2022-11-27 08:53:26,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 1: [2022-11-27 08:53:26,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:53:26,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 14: [2022-11-27 08:53:26,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:53:26,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 14: [2022-11-27 08:53:26,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 08:53:26,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 26: [2022-11-27 08:53:26,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:53:26,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:53:26,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:53:26,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 08:53:26,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 17: [2022-11-27 08:53:26,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 08:53:26,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 0: [2022-11-27 08:53:26,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 08:53:26,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 6: [2022-11-27 08:53:26,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:53:26,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 08:53:26,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 8: [2022-11-27 08:53:26,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:53:26,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 08:53:26,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 27: [2022-11-27 08:53:26,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:53:26,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 08:53:26,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 29: [2022-11-27 08:53:26,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:53:26,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 10: [2022-11-27 08:53:26,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:53:26,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 10: [2022-11-27 08:53:26,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 08:53:26,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 9: [2022-11-27 08:53:26,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:53:26,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 08:53:26,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 6: [2022-11-27 08:53:26,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:53:26,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:53:26,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 08:53:26,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 08:53:26,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 6: [2022-11-27 08:53:26,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 14: [2022-11-27 08:53:26,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:53:26,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 08:53:26,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 10: [2022-11-27 08:53:26,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:53:26,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 2: [2022-11-27 08:53:26,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:53:26,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 2: [2022-11-27 08:53:26,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 24: [2022-11-27 08:53:26,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:53:26,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 24: [2022-11-27 08:53:26,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 24: [2022-11-27 08:53:26,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 11: [2022-11-27 08:53:26,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:53:26,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:53:26,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:53:26,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:53:26,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 08:53:26,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 08:53:26,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 08:53:26,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 08:53:26,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 11: [2022-11-27 08:53:26,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 11: [2022-11-27 08:53:26,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 11: [2022-11-27 08:53:26,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 4: [2022-11-27 08:53:26,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:53:26,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 08:53:26,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 21: [2022-11-27 08:53:26,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:53:26,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 08:53:26,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 19: [2022-11-27 08:53:26,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:53:26,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 08:53:26,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 25: [2022-11-27 08:53:26,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:53:26,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:53:26,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 08:53:26,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 4: [2022-11-27 08:53:26,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:53:26,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 08:53:26,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 31: [2022-11-27 08:53:26,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:53:26,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:53:26,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:53:26,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 08:53:26,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:53:26,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 25: [2022-11-27 08:53:26,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 24: [2022-11-27 08:53:26,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 31: [2022-11-27 08:53:26,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 25: [2022-11-27 08:53:26,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 25: [2022-11-27 08:53:26,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 24: [2022-11-27 08:53:26,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 8: [2022-11-27 08:53:26,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:53:26,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 25: [2022-11-27 08:53:26,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 8: [2022-11-27 08:53:26,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 08:53:26,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 27: [2022-11-27 08:53:26,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:53:26,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 08:53:26,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 22: [2022-11-27 08:53:26,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:53:26,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 08:53:26,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 23: [2022-11-27 08:53:26,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:53:26,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 08:53:26,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 1: [2022-11-27 08:53:26,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:53:26,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 08:53:26,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 0: [2022-11-27 08:53:26,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:53:26,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:53:26,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 15: [2022-11-27 08:53:26,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 08:53:26,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 0: [2022-11-27 08:53:26,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 1: [2022-11-27 08:53:26,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:53:26,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 3: [2022-11-27 08:53:26,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:53:26,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 3: [2022-11-27 08:53:26,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 08:53:26,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 17: [2022-11-27 08:53:26,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:53:26,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 08:53:26,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 15: [2022-11-27 08:53:26,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:53:26,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 08:53:26,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 5: [2022-11-27 08:53:26,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:53:26,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:53:26,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 08:53:26,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 30: [2022-11-27 08:53:26,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:53:26,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:53:26,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 30: [2022-11-27 08:53:26,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 08:53:26,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 5: [2022-11-27 08:53:26,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 30: [2022-11-27 08:53:26,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 30: [2022-11-27 08:53:26,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 23: [2022-11-27 08:53:26,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:53:26,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 08:53:26,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 16: [2022-11-27 08:53:26,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:53:26,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 08:53:26,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 16: [2022-11-27 08:53:26,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:53:26,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 08:53:26,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 15: [2022-11-27 08:53:26,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:53:26,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 08:53:26,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 7: [2022-11-27 08:53:26,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:53:26,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:53:26,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:53:26,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:53:26,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 08:53:26,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 7: [2022-11-27 08:53:26,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 08:53:26,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 08:53:26,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 08:53:26,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 7: [2022-11-27 08:53:26,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 7: [2022-11-27 08:53:26,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 13: [2022-11-27 08:53:26,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:53:26,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:53:26,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:53:26,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 08:53:26,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:53:26,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 08:53:26,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 08:53:26,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 13: [2022-11-27 08:53:26,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 13: [2022-11-27 08:53:26,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 13: [2022-11-27 08:53:26,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 08:53:26,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 28: [2022-11-27 08:53:26,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 08:53:26,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 08:53:26,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 26: [2022-11-27 08:53:26,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:53:26,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 08:53:26,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 0: [2022-11-27 08:53:26,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:53:26,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 08:53:26,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 17: [2022-11-27 08:53:26,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:53:26,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 08:53:26,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 20: [2022-11-27 08:53:26,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:53:26,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 08:53:26,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 18: [2022-11-27 08:53:26,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:53:26,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:53:26,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:53:26,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 08:53:26,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:53:26,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 08:53:26,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 08:53:26,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 18: [2022-11-27 08:53:26,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 08:53:26,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 18: [2022-11-27 08:53:26,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 18: [2022-11-27 08:53:26,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 18: [2022-11-27 08:53:26,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:53:26,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 08:53:26,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 23: [2022-11-27 08:53:26,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:53:26,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 08:53:26,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 10: [2022-11-27 08:53:26,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:53:26,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 08:53:26,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 2: [2022-11-27 08:53:26,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:53:26,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 08:53:26,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 9: [2022-11-27 08:53:26,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:53:26,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 08:53:26,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 8: [2022-11-27 08:53:26,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:53:26,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 08:53:26,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 14: [2022-11-27 08:53:26,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:53:26,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 08:53:26,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 12: [2022-11-27 08:53:26,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:53:26,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 08:53:26,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 28: [2022-11-27 08:53:26,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 08:53:26,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 08:53:26,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 11: [2022-11-27 08:53:26,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:53:26,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 08:53:26,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 6: [2022-11-27 08:53:26,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:53:26,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 08:53:26,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 27: [2022-11-27 08:53:26,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:53:26,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 08:53:26,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 30: [2022-11-27 08:53:26,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:53:26,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 08:53:26,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 25: [2022-11-27 08:53:26,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:53:26,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 08:53:26,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 16: [2022-11-27 08:53:26,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:53:26,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 08:53:26,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 4: [2022-11-27 08:53:26,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:53:26,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 08:53:26,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 22: [2022-11-27 08:53:26,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:53:26,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 08:53:26,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 19: [2022-11-27 08:53:26,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:53:26,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 08:53:26,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 13: [2022-11-27 08:53:26,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:53:26,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 08:53:26,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 15: [2022-11-27 08:53:26,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:53:26,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 08:53:26,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 24: [2022-11-27 08:53:26,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:53:26,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-27 08:53:26,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 1: [2022-11-27 08:53:26,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:53:26,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 08:53:26,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 7: [2022-11-27 08:53:26,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:53:26,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 08:53:26,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 3: [2022-11-27 08:53:26,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:53:26,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 08:53:26,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 20: [2022-11-27 08:53:26,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:53:26,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 08:53:26,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 29: [2022-11-27 08:53:26,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:53:26,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 21: [2022-11-27 08:53:26,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:53:26,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 08:53:26,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 29: [2022-11-27 08:53:26,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 5: [2022-11-27 08:53:26,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:53:26,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 08:53:26,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 26: [2022-11-27 08:53:26,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:53:26,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 08:53:26,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 0: [2022-11-27 08:53:26,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:53:26,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:53:26,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 08:53:26,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 14: [2022-11-27 08:53:26,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:53:26,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:53:26,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 17: [2022-11-27 08:53:26,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 14: [2022-11-27 08:53:26,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 23: [2022-11-27 08:53:26,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 17: [2022-11-27 08:53:26,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 23: [2022-11-27 08:53:26,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 31: [2022-11-27 08:53:26,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:53:26,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 31: [2022-11-27 08:53:26,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 2: [2022-11-27 08:53:26,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:53:26,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 08:53:26,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 9: [2022-11-27 08:53:26,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:53:26,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 08:53:26,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 8: [2022-11-27 08:53:26,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:53:26,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 08:53:26,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 18: [2022-11-27 08:53:26,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:53:26,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 08:53:26,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 11: [2022-11-27 08:53:26,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:53:26,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 08:53:26,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 6: [2022-11-27 08:53:26,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:53:26,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 08:53:26,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 30: [2022-11-27 08:53:26,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:53:26,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 08:53:26,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 27: [2022-11-27 08:53:26,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:53:26,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 16: [2022-11-27 08:53:26,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:53:26,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 16: [2022-11-27 08:53:26,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 28: [2022-11-27 08:53:26,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:53:26,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 12: [2022-11-27 08:53:26,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:53:26,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 08:53:26,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 25: [2022-11-27 08:53:26,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:53:26,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-27 08:53:26,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 24: [2022-11-27 08:53:26,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:53:26,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 08:53:26,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 4: [2022-11-27 08:53:26,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:53:26,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 22: [2022-11-27 08:53:26,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:53:26,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 22: [2022-11-27 08:53:26,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 08:53:26,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 7: [2022-11-27 08:53:26,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:53:26,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 08:53:26,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 13: [2022-11-27 08:53:26,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:53:26,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 08:53:26,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 31: [2022-11-27 08:53:26,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 08:53:26,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 08:53:26,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 15: [2022-11-27 08:53:26,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:53:26,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 08:53:26,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 19: [2022-11-27 08:53:26,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:53:26,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 08:53:26,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 28: [2022-11-27 08:53:26,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 08:53:26,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 29: [2022-11-27 08:53:26,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:53:26,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 08:53:26,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 20: [2022-11-27 08:53:26,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:53:26,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:53:26,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 3: [2022-11-27 08:53:26,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 20: [2022-11-27 08:53:26,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 3: [2022-11-27 08:53:26,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 5: [2022-11-27 08:53:26,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:53:26,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 08:53:26,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 21: [2022-11-27 08:53:26,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:53:26,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 08:53:26,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 1: [2022-11-27 08:53:26,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:53:26,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 08:53:26,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 26: [2022-11-27 08:53:26,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:53:26,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 08:53:26,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 0: [2022-11-27 08:53:26,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:53:26,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 08:53:26,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 10: [2022-11-27 08:53:26,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:53:26,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 08:53:26,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 17: [2022-11-27 08:53:26,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:53:26,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 08:53:26,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 18: [2022-11-27 08:53:26,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:53:26,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 08:53:26,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 9: [2022-11-27 08:53:26,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:53:26,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 08:53:26,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 8: [2022-11-27 08:53:26,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:53:26,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 08:53:26,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 23: [2022-11-27 08:53:26,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:53:26,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 08:53:26,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 10: [2022-11-27 08:53:26,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:53:26,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 08:53:26,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 2: [2022-11-27 08:53:26,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:53:26,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 08:53:26,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 14: [2022-11-27 08:53:26,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:53:26,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 08:53:26,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 12: [2022-11-27 08:53:26,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:53:26,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 08:53:26,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 6: [2022-11-27 08:53:26,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:53:26,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 08:53:26,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 28: [2022-11-27 08:53:26,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-27 08:53:26,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 08:53:26,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 30: [2022-11-27 08:53:26,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:53:26,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 08:53:26,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 27: [2022-11-27 08:53:26,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 27: [2022-11-27 08:53:26,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 27: [2022-11-27 08:53:26,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 1: [2022-11-27 08:53:26,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:53:26,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 08:53:26,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 24: [2022-11-27 08:53:26,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:53:26,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 08:53:26,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 4: [2022-11-27 08:53:26,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:53:26,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 08:53:26,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 11: [2022-11-27 08:53:26,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:53:26,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 08:53:26,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 16: [2022-11-27 08:53:26,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:53:26,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 08:53:26,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 29: [2022-11-27 08:53:26,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:53:26,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 08:53:26,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 9: [2022-11-27 08:53:26,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:53:26,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 19: [2022-11-27 08:53:26,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 08:53:26,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 9: [2022-11-27 08:53:26,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 7: [2022-11-27 08:53:26,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:53:26,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 7: [2022-11-27 08:53:26,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 08:53:26,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 3: [2022-11-27 08:53:26,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:53:26,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 20: [2022-11-27 08:53:26,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:53:26,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 21: [2022-11-27 08:53:26,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 3: [2022-11-27 08:53:26,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 20: [2022-11-27 08:53:26,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 21: [2022-11-27 08:53:26,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 20: [2022-11-27 08:53:26,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 31: [2022-11-27 08:53:26,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:53:26,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:53:26,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 31: [2022-11-27 08:53:26,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 8: [2022-11-27 08:53:26,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 31: [2022-11-27 08:53:26,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 22: [2022-11-27 08:53:26,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 08:53:26,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 08:53:26,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 5: [2022-11-27 08:53:26,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 26: [2022-11-27 08:53:26,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:53:26,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 14: [2022-11-27 08:53:26,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:53:26,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 26: [2022-11-27 08:53:26,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 08:53:26,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 14: [2022-11-27 08:53:26,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 08:53:26,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 15: [2022-11-27 08:53:26,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 23: [2022-11-27 08:53:26,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:53:26,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 08:53:26,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 23: [2022-11-27 08:53:26,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 08:53:26,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 11: [2022-11-27 08:53:26,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:53:26,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 08:53:26,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 2: [2022-11-27 08:53:26,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:53:26,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 10: [2022-11-27 08:53:26,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:53:26,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 10: [2022-11-27 08:53:26,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 08:53:26,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 0: [2022-11-27 08:53:26,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:53:26,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:53:26,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 08:53:26,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-27 08:53:26,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 4: [2022-11-27 08:53:26,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:53:26,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 08:53:26,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 17: [2022-11-27 08:53:26,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 27: [2022-11-27 08:53:26,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 17: [2022-11-27 08:53:26,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 27: [2022-11-27 08:53:26,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 08:53:26,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 25: [2022-11-27 08:53:26,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:53:26,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 24: [2022-11-27 08:53:26,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:53:26,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 13: [2022-11-27 08:53:26,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 24: [2022-11-27 08:53:26,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 1: [2022-11-27 08:53:26,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:53:26,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 08:53:26,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 24: [2022-11-27 08:53:26,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 1: [2022-11-27 08:53:26,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 08:53:26,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 30: [2022-11-27 08:53:26,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 08:53:26,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 08:53:26,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 0: [2022-11-27 08:53:26,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 08:53:26,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 21: [2022-11-27 08:53:26,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-27 08:53:26,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 08:53:26,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 28: [2022-11-27 08:53:26,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-27 08:53:26,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 08:53:26,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 12: [2022-11-27 08:53:26,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:53:26,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 08:53:26,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 6: [2022-11-27 08:53:26,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:53:26,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:53:26,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 29: [2022-11-27 08:53:26,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:53:26,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:53:26,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 08:53:26,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 08:53:26,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 7: [2022-11-27 08:53:26,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 6: [2022-11-27 08:53:26,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 29: [2022-11-27 08:53:26,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 22: [2022-11-27 08:53:26,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 29: [2022-11-27 08:53:26,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 22: [2022-11-27 08:53:26,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 08:53:26,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 16: [2022-11-27 08:53:26,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-27 08:53:26,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 08:53:26,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 15: [2022-11-27 08:53:26,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:53:26,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 08:53:26,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 3: [2022-11-27 08:53:26,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:53:26,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 08:53:26,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 25: [2022-11-27 08:53:26,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:53:26,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 08:53:26,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 13: [2022-11-27 08:53:26,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:53:26,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 08:53:26,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 25: [2022-11-27 08:53:26,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 08:53:26,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step172000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 08:53:26,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step172000 is ready now! 0: successfully saved checkpoint at iteration 172000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2595.00 31: iteration 172010/ 173500 | consumed samples: 44034560 | consumed tokens: 90182778880 | elapsed time per iteration (s): 1.05 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.908090E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.373 | TFLOPs: 14.78 | 31: iteration 172020/ 173500 | consumed samples: 44037120 | consumed tokens: 90188021760 | elapsed time per iteration (s): 0.78 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.923689E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.812 | TFLOPs: 19.89 | 31: iteration 172030/ 173500 | consumed samples: 44039680 | consumed tokens: 90193264640 | elapsed time per iteration (s): 0.79 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.918430E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.336 | TFLOPs: 19.68 | 31: iteration 172040/ 173500 | consumed samples: 44042240 | consumed tokens: 90198507520 | elapsed time per iteration (s): 0.82 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.893862E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.794 | TFLOPs: 18.98 | 31: iteration 172050/ 173500 | consumed samples: 44044800 | consumed tokens: 90203750400 | elapsed time per iteration (s): 0.81 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.918965E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.079 | TFLOPs: 19.18 | 31: iteration 172060/ 173500 | consumed samples: 44047360 | consumed tokens: 90208993280 | elapsed time per iteration (s): 0.79 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.932498E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.612 | TFLOPs: 19.70 | 31: iteration 172070/ 173500 | consumed samples: 44049920 | consumed tokens: 90214236160 | elapsed time per iteration (s): 1.44 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.925413E+00 | grad norm: 0.212 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 177.729 | TFLOPs: 10.75 | 31: iteration 172080/ 173500 | consumed samples: 44052480 | consumed tokens: 90219479040 | elapsed time per iteration (s): 0.82 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.915238E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.046 | TFLOPs: 18.88 | 31: iteration 172090/ 173500 | consumed samples: 44055040 | consumed tokens: 90224721920 | elapsed time per iteration (s): 0.81 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.909793E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.678 | TFLOPs: 19.22 | 31: iteration 172100/ 173500 | consumed samples: 44057600 | consumed tokens: 90229964800 | elapsed time per iteration (s): 0.80 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.904195E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.194 | TFLOPs: 19.43 | 31: iteration 172110/ 173500 | consumed samples: 44060160 | consumed tokens: 90235207680 | elapsed time per iteration (s): 0.81 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.925795E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.652 | TFLOPs: 19.10 | 31: iteration 172120/ 173500 | consumed samples: 44062720 | consumed tokens: 90240450560 | elapsed time per iteration (s): 0.77 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.929168E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.944 | TFLOPs: 20.02 | 31: iteration 172130/ 173500 | consumed samples: 44065280 | consumed tokens: 90245693440 | elapsed time per iteration (s): 0.77 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.869738E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.265 | TFLOPs: 20.10 | 31: iteration 172140/ 173500 | consumed samples: 44067840 | consumed tokens: 90250936320 | elapsed time per iteration (s): 0.78 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.902067E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.488 | TFLOPs: 19.75 | 31: iteration 172150/ 173500 | consumed samples: 44070400 | consumed tokens: 90256179200 | elapsed time per iteration (s): 0.78 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.917875E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.022 | TFLOPs: 19.97 | 31: iteration 172160/ 173500 | consumed samples: 44072960 | consumed tokens: 90261422080 | elapsed time per iteration (s): 0.78 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.907863E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.214 | TFLOPs: 19.74 | 31: iteration 172170/ 173500 | consumed samples: 44075520 | consumed tokens: 90266664960 | elapsed time per iteration (s): 0.79 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.918136E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.232 | TFLOPs: 19.62 | 31: iteration 172180/ 173500 | consumed samples: 44078080 | consumed tokens: 90271907840 | elapsed time per iteration (s): 0.86 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.911643E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.496 | TFLOPs: 17.94 | 31: iteration 172190/ 173500 | consumed samples: 44080640 | consumed tokens: 90277150720 | elapsed time per iteration (s): 0.79 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.917602E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.315 | TFLOPs: 19.68 | 31: iteration 172200/ 173500 | consumed samples: 44083200 | consumed tokens: 90282393600 | elapsed time per iteration (s): 0.80 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.908276E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.995 | TFLOPs: 19.48 | 31: iteration 172210/ 173500 | consumed samples: 44085760 | consumed tokens: 90287636480 | elapsed time per iteration (s): 0.80 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.913712E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.411 | TFLOPs: 19.44 | 31: iteration 172220/ 173500 | consumed samples: 44088320 | consumed tokens: 90292879360 | elapsed time per iteration (s): 0.83 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.916290E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 307.261 | TFLOPs: 18.59 | 31: iteration 172230/ 173500 | consumed samples: 44090880 | consumed tokens: 90298122240 | elapsed time per iteration (s): 0.78 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.891904E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.990 | TFLOPs: 19.78 | 31: iteration 172240/ 173500 | consumed samples: 44093440 | consumed tokens: 90303365120 | elapsed time per iteration (s): 0.96 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.913222E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 266.491 | TFLOPs: 16.12 | 31: iteration 172250/ 173500 | consumed samples: 44096000 | consumed tokens: 90308608000 | elapsed time per iteration (s): 1.04 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.914689E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.459 | TFLOPs: 14.85 | 31: iteration 172260/ 173500 | consumed samples: 44098560 | consumed tokens: 90313850880 | elapsed time per iteration (s): 0.77 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.908197E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.443 | TFLOPs: 20.05 | 31: iteration 172270/ 173500 | consumed samples: 44101120 | consumed tokens: 90319093760 | elapsed time per iteration (s): 0.80 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.922236E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.252 | TFLOPs: 19.43 | 31: iteration 172280/ 173500 | consumed samples: 44103680 | consumed tokens: 90324336640 | elapsed time per iteration (s): 0.82 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.899404E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.336 | TFLOPs: 18.84 | 31: iteration 172290/ 173500 | consumed samples: 44106240 | consumed tokens: 90329579520 | elapsed time per iteration (s): 0.83 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.898748E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.316 | TFLOPs: 18.65 | 31: iteration 172300/ 173500 | consumed samples: 44108800 | consumed tokens: 90334822400 | elapsed time per iteration (s): 0.77 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.879965E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.453 | TFLOPs: 19.99 | 31: iteration 172310/ 173500 | consumed samples: 44111360 | consumed tokens: 90340065280 | elapsed time per iteration (s): 0.82 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.888539E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.578 | TFLOPs: 18.97 | 31: iteration 172320/ 173500 | consumed samples: 44113920 | consumed tokens: 90345308160 | elapsed time per iteration (s): 1.04 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.869070E+00 | grad norm: 0.225 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.230 | TFLOPs: 14.90 | 31: iteration 172330/ 173500 | consumed samples: 44116480 | consumed tokens: 90350551040 | elapsed time per iteration (s): 0.82 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.903550E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 312.628 | TFLOPs: 18.91 | 31: iteration 172340/ 173500 | consumed samples: 44119040 | consumed tokens: 90355793920 | elapsed time per iteration (s): 0.81 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.892402E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.640 | TFLOPs: 19.03 | 31: iteration 172350/ 173500 | consumed samples: 44121600 | consumed tokens: 90361036800 | elapsed time per iteration (s): 0.80 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.897102E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.895 | TFLOPs: 19.29 | 31: iteration 172360/ 173500 | consumed samples: 44124160 | consumed tokens: 90366279680 | elapsed time per iteration (s): 1.12 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.896528E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.381 | TFLOPs: 13.88 | 31: iteration 172370/ 173500 | consumed samples: 44126720 | consumed tokens: 90371522560 | elapsed time per iteration (s): 0.82 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.905906E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.801 | TFLOPs: 18.98 | 31: iteration 172380/ 173500 | consumed samples: 44129280 | consumed tokens: 90376765440 | elapsed time per iteration (s): 0.83 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.896910E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.319 | TFLOPs: 18.71 | 31: iteration 172390/ 173500 | consumed samples: 44131840 | consumed tokens: 90382008320 | elapsed time per iteration (s): 0.84 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.918960E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 306.154 | TFLOPs: 18.52 | 31: iteration 172400/ 173500 | consumed samples: 44134400 | consumed tokens: 90387251200 | elapsed time per iteration (s): 0.84 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.924533E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.517 | TFLOPs: 18.48 | 31: iteration 172410/ 173500 | consumed samples: 44136960 | consumed tokens: 90392494080 | elapsed time per iteration (s): 0.82 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.905423E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 313.116 | TFLOPs: 18.94 | 31: iteration 172420/ 173500 | consumed samples: 44139520 | consumed tokens: 90397736960 | elapsed time per iteration (s): 0.97 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.910399E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 262.685 | TFLOPs: 15.89 | 31: iteration 172430/ 173500 | consumed samples: 44142080 | consumed tokens: 90402979840 | elapsed time per iteration (s): 0.86 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.900472E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 296.336 | TFLOPs: 17.93 | 31: iteration 172440/ 173500 | consumed samples: 44144640 | consumed tokens: 90408222720 | elapsed time per iteration (s): 0.78 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.914851E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.810 | TFLOPs: 19.89 | 31: iteration 172450/ 173500 | consumed samples: 44147200 | consumed tokens: 90413465600 | elapsed time per iteration (s): 0.81 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.904688E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.881 | TFLOPs: 19.23 | 31: iteration 172460/ 173500 | consumed samples: 44149760 | consumed tokens: 90418708480 | elapsed time per iteration (s): 0.72 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.937159E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 353.569 | TFLOPs: 21.39 | 31: iteration 172470/ 173500 | consumed samples: 44152320 | consumed tokens: 90423951360 | elapsed time per iteration (s): 0.77 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.903730E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.570 | TFLOPs: 20.00 | 31: iteration 172480/ 173500 | consumed samples: 44154880 | consumed tokens: 90429194240 | elapsed time per iteration (s): 0.78 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.917243E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.821 | TFLOPs: 19.77 | 31: iteration 172490/ 173500 | consumed samples: 44157440 | consumed tokens: 90434437120 | elapsed time per iteration (s): 0.76 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.929945E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.719 | TFLOPs: 20.37 | 31: iteration 172500/ 173500 | consumed samples: 44160000 | consumed tokens: 90439680000 | elapsed time per iteration (s): 0.77 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.906175E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.954 | TFLOPs: 20.08 | 31: iteration 172510/ 173500 | consumed samples: 44162560 | consumed tokens: 90444922880 | elapsed time per iteration (s): 0.74 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.924328E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 343.817 | TFLOPs: 20.80 | 31: iteration 172520/ 173500 | consumed samples: 44165120 | consumed tokens: 90450165760 | elapsed time per iteration (s): 0.78 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.911958E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.351 | TFLOPs: 19.74 | 31: iteration 172530/ 173500 | consumed samples: 44167680 | consumed tokens: 90455408640 | elapsed time per iteration (s): 0.77 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.889545E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.726 | TFLOPs: 20.13 | 31: iteration 172540/ 173500 | consumed samples: 44170240 | consumed tokens: 90460651520 | elapsed time per iteration (s): 0.83 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.912024E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 308.967 | TFLOPs: 18.69 | 31: iteration 172550/ 173500 | consumed samples: 44172800 | consumed tokens: 90465894400 | elapsed time per iteration (s): 0.75 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.913962E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.697 | TFLOPs: 20.67 | 31: iteration 172560/ 173500 | consumed samples: 44175360 | consumed tokens: 90471137280 | elapsed time per iteration (s): 0.74 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.902633E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 345.817 | TFLOPs: 20.92 | 31: iteration 172570/ 173500 | consumed samples: 44177920 | consumed tokens: 90476380160 | elapsed time per iteration (s): 1.04 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.912594E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.775 | TFLOPs: 14.87 | 31: iteration 172580/ 173500 | consumed samples: 44180480 | consumed tokens: 90481623040 | elapsed time per iteration (s): 0.79 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.939221E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.495 | TFLOPs: 19.51 | 31: iteration 172590/ 173500 | consumed samples: 44183040 | consumed tokens: 90486865920 | elapsed time per iteration (s): 0.75 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.949126E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.949 | TFLOPs: 20.57 | 31: iteration 172600/ 173500 | consumed samples: 44185600 | consumed tokens: 90492108800 | elapsed time per iteration (s): 0.88 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.903891E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.012 | TFLOPs: 17.67 | 31: iteration 172610/ 173500 | consumed samples: 44188160 | consumed tokens: 90497351680 | elapsed time per iteration (s): 0.72 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.906013E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 356.529 | TFLOPs: 21.57 | 31: iteration 172620/ 173500 | consumed samples: 44190720 | consumed tokens: 90502594560 | elapsed time per iteration (s): 0.76 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.919422E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.045 | TFLOPs: 20.33 | 31: iteration 172630/ 173500 | consumed samples: 44193280 | consumed tokens: 90507837440 | elapsed time per iteration (s): 0.79 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.901743E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.408 | TFLOPs: 19.69 | 31: iteration 172640/ 173500 | consumed samples: 44195840 | consumed tokens: 90513080320 | elapsed time per iteration (s): 0.73 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.887107E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.363 | TFLOPs: 21.14 | 31: iteration 172650/ 173500 | consumed samples: 44198400 | consumed tokens: 90518323200 | elapsed time per iteration (s): 0.78 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.898201E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 326.410 | TFLOPs: 19.75 | 31: iteration 172660/ 173500 | consumed samples: 44200960 | consumed tokens: 90523566080 | elapsed time per iteration (s): 0.76 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.902734E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 335.400 | TFLOPs: 20.29 | 31: iteration 172670/ 173500 | consumed samples: 44203520 | consumed tokens: 90528808960 | elapsed time per iteration (s): 0.73 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.901566E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 348.335 | TFLOPs: 21.07 | 31: iteration 172680/ 173500 | consumed samples: 44206080 | consumed tokens: 90534051840 | elapsed time per iteration (s): 0.89 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.922120E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 288.167 | TFLOPs: 17.43 | 31: iteration 172690/ 173500 | consumed samples: 44208640 | consumed tokens: 90539294720 | elapsed time per iteration (s): 0.77 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.899119E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 334.564 | TFLOPs: 20.24 | 31: iteration 172700/ 173500 | consumed samples: 44211200 | consumed tokens: 90544537600 | elapsed time per iteration (s): 0.90 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.890503E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.527 | TFLOPs: 17.27 | 31: iteration 172710/ 173500 | consumed samples: 44213760 | consumed tokens: 90549780480 | elapsed time per iteration (s): 0.88 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.892782E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.872 | TFLOPs: 17.66 | 31: iteration 172720/ 173500 | consumed samples: 44216320 | consumed tokens: 90555023360 | elapsed time per iteration (s): 0.78 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.907452E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.888 | TFLOPs: 19.84 | 31: iteration 172730/ 173500 | consumed samples: 44218880 | consumed tokens: 90560266240 | elapsed time per iteration (s): 0.77 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.911002E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 331.230 | TFLOPs: 20.04 | 31: iteration 172740/ 173500 | consumed samples: 44221440 | consumed tokens: 90565509120 | elapsed time per iteration (s): 0.96 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.895216E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 267.799 | TFLOPs: 16.20 | 31: iteration 172750/ 173500 | consumed samples: 44224000 | consumed tokens: 90570752000 | elapsed time per iteration (s): 0.88 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.901863E+00 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 290.535 | TFLOPs: 17.58 | 31: iteration 172760/ 173500 | consumed samples: 44226560 | consumed tokens: 90575994880 | elapsed time per iteration (s): 0.88 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.927585E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.223 | TFLOPs: 17.62 | 31: iteration 172770/ 173500 | consumed samples: 44229120 | consumed tokens: 90581237760 | elapsed time per iteration (s): 0.88 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.918182E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 292.053 | TFLOPs: 17.67 | 31: iteration 172780/ 173500 | consumed samples: 44231680 | consumed tokens: 90586480640 | elapsed time per iteration (s): 0.90 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.903862E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.724 | TFLOPs: 17.29 | 31: iteration 172790/ 173500 | consumed samples: 44234240 | consumed tokens: 90591723520 | elapsed time per iteration (s): 0.88 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.892035E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 291.391 | TFLOPs: 17.63 | 31: iteration 172800/ 173500 | consumed samples: 44236800 | consumed tokens: 90596966400 | elapsed time per iteration (s): 0.89 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.918371E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 286.412 | TFLOPs: 17.33 | 31: iteration 172810/ 173500 | consumed samples: 44239360 | consumed tokens: 90602209280 | elapsed time per iteration (s): 0.86 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.891486E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 297.944 | TFLOPs: 18.02 | 31: iteration 172820/ 173500 | consumed samples: 44241920 | consumed tokens: 90607452160 | elapsed time per iteration (s): 0.79 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.900096E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.134 | TFLOPs: 19.67 | 31: iteration 172830/ 173500 | consumed samples: 44244480 | consumed tokens: 90612695040 | elapsed time per iteration (s): 0.78 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.921468E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.266 | TFLOPs: 19.86 | 31: iteration 172840/ 173500 | consumed samples: 44247040 | consumed tokens: 90617937920 | elapsed time per iteration (s): 0.81 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.888322E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.382 | TFLOPs: 19.14 | 31: iteration 172850/ 173500 | consumed samples: 44249600 | consumed tokens: 90623180800 | elapsed time per iteration (s): 0.74 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.936643E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.759 | TFLOPs: 20.98 | 31: iteration 172860/ 173500 | consumed samples: 44252160 | consumed tokens: 90628423680 | elapsed time per iteration (s): 0.81 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.893483E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.793 | TFLOPs: 19.10 | 31: iteration 172870/ 173500 | consumed samples: 44254720 | consumed tokens: 90633666560 | elapsed time per iteration (s): 0.80 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.914665E+00 | grad norm: 0.212 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.505 | TFLOPs: 19.33 | 31: iteration 172880/ 173500 | consumed samples: 44257280 | consumed tokens: 90638909440 | elapsed time per iteration (s): 0.81 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.891232E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.713 | TFLOPs: 19.22 | 31: iteration 172890/ 173500 | consumed samples: 44259840 | consumed tokens: 90644152320 | elapsed time per iteration (s): 0.83 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.888502E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 309.369 | TFLOPs: 18.72 | 31: iteration 172900/ 173500 | consumed samples: 44262400 | consumed tokens: 90649395200 | elapsed time per iteration (s): 0.87 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.895229E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 293.198 | TFLOPs: 17.74 | 31: iteration 172910/ 173500 | consumed samples: 44264960 | consumed tokens: 90654638080 | elapsed time per iteration (s): 0.80 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.897193E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.829 | TFLOPs: 19.47 | 31: iteration 172920/ 173500 | consumed samples: 44267520 | consumed tokens: 90659880960 | elapsed time per iteration (s): 0.79 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.925870E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.242 | TFLOPs: 19.49 | 31: iteration 172930/ 173500 | consumed samples: 44270080 | consumed tokens: 90665123840 | elapsed time per iteration (s): 0.77 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.904150E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.323 | TFLOPs: 19.98 | 31: iteration 172940/ 173500 | consumed samples: 44272640 | consumed tokens: 90670366720 | elapsed time per iteration (s): 0.80 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.897655E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.308 | TFLOPs: 19.38 | 31: iteration 172950/ 173500 | consumed samples: 44275200 | consumed tokens: 90675609600 | elapsed time per iteration (s): 0.77 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.918740E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 330.824 | TFLOPs: 20.01 | 31: iteration 172960/ 173500 | consumed samples: 44277760 | consumed tokens: 90680852480 | elapsed time per iteration (s): 0.76 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.920543E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.349 | TFLOPs: 20.47 | 31: iteration 172970/ 173500 | consumed samples: 44280320 | consumed tokens: 90686095360 | elapsed time per iteration (s): 0.78 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.916235E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 328.760 | TFLOPs: 19.89 | 31: iteration 172980/ 173500 | consumed samples: 44282880 | consumed tokens: 90691338240 | elapsed time per iteration (s): 0.76 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.884861E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 336.289 | TFLOPs: 20.34 | 31: iteration 172990/ 173500 | consumed samples: 44285440 | consumed tokens: 90696581120 | elapsed time per iteration (s): 0.77 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.897220E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.184 | TFLOPs: 20.10 | 31: iteration 173000/ 173500 | consumed samples: 44288000 | consumed tokens: 90701824000 | elapsed time per iteration (s): 0.90 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.904908E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 285.501 | TFLOPs: 17.27 | 31: -------------------------------------------------------------------------------------------- 31: valid loss at iteration 173000 | lm loss value: 1.817153E+00 | lm loss PPL: 6.154312E+00 | 31: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 173000 to checkpoints_1b1long 0: [2022-11-27 09:07:09,698] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step173000 is begin to save! 0: [2022-11-27 09:07:09,717] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_01-model_00-model_states.pt... 0: [2022-11-27 09:07:09,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_01-model_00-model_states.pt. 0: [2022-11-27 09:07:09,937] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_03-model_00-model_states.pt... 0: [2022-11-27 09:07:10,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_03-model_00-model_states.pt. 0: [2022-11-27 09:07:10,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_04-model_00-model_states.pt... 0: [2022-11-27 09:07:10,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_04-model_00-model_states.pt. 0: [2022-11-27 09:07:10,105] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_05-model_00-model_states.pt... 0: [2022-11-27 09:07:10,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_05-model_00-model_states.pt. 0: [2022-11-27 09:07:10,189] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_06-model_00-model_states.pt... 0: [2022-11-27 09:07:10,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_06-model_00-model_states.pt. 0: [2022-11-27 09:07:10,266] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_07-model_00-model_states.pt... 0: [2022-11-27 09:07:10,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_07-model_00-model_states.pt. 0: [2022-11-27 09:07:10,345] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_08-model_00-model_states.pt... 0: [2022-11-27 09:07:10,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_08-model_00-model_states.pt. 0: [2022-11-27 09:07:10,424] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_09-model_00-model_states.pt... 0: [2022-11-27 09:07:10,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_09-model_00-model_states.pt. 0: [2022-11-27 09:07:10,502] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_10-model_00-model_states.pt... 0: [2022-11-27 09:07:10,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_10-model_00-model_states.pt. 0: [2022-11-27 09:07:10,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_11-model_00-model_states.pt... 0: [2022-11-27 09:07:10,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_11-model_00-model_states.pt. 0: [2022-11-27 09:07:10,654] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_12-model_00-model_states.pt... 0: [2022-11-27 09:07:10,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_12-model_00-model_states.pt. 0: [2022-11-27 09:07:10,730] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_13-model_00-model_states.pt... 0: [2022-11-27 09:07:10,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_13-model_00-model_states.pt. 0: [2022-11-27 09:07:10,806] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_14-model_00-model_states.pt... 0: [2022-11-27 09:07:10,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_14-model_00-model_states.pt. 0: [2022-11-27 09:07:10,881] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_15-model_00-model_states.pt... 0: [2022-11-27 09:07:10,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_15-model_00-model_states.pt. 0: [2022-11-27 09:07:10,954] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_16-model_00-model_states.pt... 0: [2022-11-27 09:07:11,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_16-model_00-model_states.pt. 0: [2022-11-27 09:07:11,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_17-model_00-model_states.pt... 0: [2022-11-27 09:07:11,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_17-model_00-model_states.pt. 0: [2022-11-27 09:07:11,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_18-model_00-model_states.pt... 0: [2022-11-27 09:07:11,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_18-model_00-model_states.pt. 0: [2022-11-27 09:07:11,185] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_19-model_00-model_states.pt... 0: [2022-11-27 09:07:11,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_19-model_00-model_states.pt. 0: [2022-11-27 09:07:11,260] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_20-model_00-model_states.pt... 0: [2022-11-27 09:07:11,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_20-model_00-model_states.pt. 0: [2022-11-27 09:07:11,345] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_21-model_00-model_states.pt... 0: [2022-11-27 09:07:11,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_21-model_00-model_states.pt. 0: [2022-11-27 09:07:11,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_22-model_00-model_states.pt... 0: [2022-11-27 09:07:11,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_22-model_00-model_states.pt. 0: [2022-11-27 09:07:11,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_23-model_00-model_states.pt... 0: [2022-11-27 09:07:11,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_23-model_00-model_states.pt. 0: [2022-11-27 09:07:11,571] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_24-model_00-model_states.pt... 0: [2022-11-27 09:07:11,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_24-model_00-model_states.pt. 0: [2022-11-27 09:07:11,648] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_25-model_00-model_states.pt... 0: [2022-11-27 09:07:11,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_25-model_00-model_states.pt. 0: [2022-11-27 09:07:11,722] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_26-model_00-model_states.pt... 0: [2022-11-27 09:07:11,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_26-model_00-model_states.pt. 0: [2022-11-27 09:07:11,794] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_27-model_00-model_states.pt... 0: [2022-11-27 09:07:11,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_27-model_00-model_states.pt. 0: [2022-11-27 09:07:11,870] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_28-model_00-model_states.pt... 0: [2022-11-27 09:07:11,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_28-model_00-model_states.pt. 0: [2022-11-27 09:07:11,942] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/layer_30-model_00-model_states.pt... 0: [2022-11-27 09:07:11,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/layer_30-model_00-model_states.pt. 0: [2022-11-27 09:07:11,946] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step173000/mp_rank_00_model_states.pt 0: [2022-11-27 09:07:11,946] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/mp_rank_00_model_states.pt... 0: [2022-11-27 09:07:11,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/mp_rank_00_model_states.pt. 0: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 20: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 25: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 25: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 28: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 24: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 24: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 31: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 31: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 31: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 31: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 29: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 29: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 29: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 21: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 18: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 26: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 26: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 27: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 16: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 20: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 20: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 20: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 20: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 23: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 23: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 31: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 31: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 31: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 29: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 22: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 30: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 21: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 18: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 27: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 16: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 20: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 23: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 23: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 28: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 29: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 22: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 17: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 18: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 26: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 26: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 19: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 25: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 28: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 29: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 22: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 17: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 21: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 19: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-27 09:07:12,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 25: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 28: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 24: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 30: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 17: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 17: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 21: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:07:12,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:07:12,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:07:12,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 09:07:12,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 20: [2022-11-27 09:07:12,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-27 09:07:12,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 09:07:12,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 26: [2022-11-27 09:07:12,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 09:07:12,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 09:07:12,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 8: [2022-11-27 09:07:12,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:07:12,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 09:07:12,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 31: [2022-11-27 09:07:12,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 31: [2022-11-27 09:07:12,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 31: [2022-11-27 09:07:12,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 13: [2022-11-27 09:07:12,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 29: [2022-11-27 09:07:12,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:07:12,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 29: [2022-11-27 09:07:12,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 13: [2022-11-27 09:07:12,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 29: [2022-11-27 09:07:12,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 3: [2022-11-27 09:07:12,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:07:12,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:07:12,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:07:12,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 12: [2022-11-27 09:07:12,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 15: [2022-11-27 09:07:12,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 3: [2022-11-27 09:07:12,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 12: [2022-11-27 09:07:12,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 15: [2022-11-27 09:07:12,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 21: [2022-11-27 09:07:12,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 09:07:12,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 09:07:12,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 25: [2022-11-27 09:07:12,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 09:07:12,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 25: [2022-11-27 09:07:12,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 29: [2022-11-27 09:07:12,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 09:07:12,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 09:07:12,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 20: [2022-11-27 09:07:12,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 09:07:12,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 09:07:12,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 7: [2022-11-27 09:07:12,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:07:12,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 09:07:12,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 27: [2022-11-27 09:07:12,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-27 09:07:12,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 09:07:12,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 26: [2022-11-27 09:07:12,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 24: [2022-11-27 09:07:12,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 26: [2022-11-27 09:07:12,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 24: [2022-11-27 09:07:12,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 3: [2022-11-27 09:07:12,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 26: [2022-11-27 09:07:12,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 3: [2022-11-27 09:07:12,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 09:07:12,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 24: [2022-11-27 09:07:12,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 31: [2022-11-27 09:07:12,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 31: [2022-11-27 09:07:12,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 31: [2022-11-27 09:07:12,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 5: [2022-11-27 09:07:12,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:07:12,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 09:07:12,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 23: [2022-11-27 09:07:12,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 23: [2022-11-27 09:07:12,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 21: [2022-11-27 09:07:12,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 23: [2022-11-27 09:07:12,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 23: [2022-11-27 09:07:12,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 21: [2022-11-27 09:07:12,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 23: [2022-11-27 09:07:12,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 23: [2022-11-27 09:07:12,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 28: [2022-11-27 09:07:12,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 21: [2022-11-27 09:07:12,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 5: [2022-11-27 09:07:12,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:07:12,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 9: [2022-11-27 09:07:12,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:07:12,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 9: [2022-11-27 09:07:12,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 09:07:12,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 21: [2022-11-27 09:07:12,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 09:07:12,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 09:07:12,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 18: [2022-11-27 09:07:12,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 09:07:12,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 09:07:12,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 8: [2022-11-27 09:07:12,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 30: [2022-11-27 09:07:12,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:07:12,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 30: [2022-11-27 09:07:12,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 30: [2022-11-27 09:07:12,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 8: [2022-11-27 09:07:12,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 30: [2022-11-27 09:07:12,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-27 09:07:12,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 31: [2022-11-27 09:07:12,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 09:07:12,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 09:07:12,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 30: [2022-11-27 09:07:12,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 12: [2022-11-27 09:07:12,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:07:12,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 09:07:12,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 18: [2022-11-27 09:07:12,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 18: [2022-11-27 09:07:12,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 18: [2022-11-27 09:07:12,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 26: [2022-11-27 09:07:12,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 09:07:12,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 09:07:12,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 15: [2022-11-27 09:07:12,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:07:12,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 09:07:12,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 24: [2022-11-27 09:07:12,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 09:07:12,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 09:07:12,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 16: [2022-11-27 09:07:12,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 16: [2022-11-27 09:07:12,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 16: [2022-11-27 09:07:12,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 21: [2022-11-27 09:07:12,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-27 09:07:12,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 09:07:12,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 28: [2022-11-27 09:07:12,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 09:07:12,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 28: [2022-11-27 09:07:12,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-27 09:07:12,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 09:07:12,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 25: [2022-11-27 09:07:12,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 09:07:12,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 13: [2022-11-27 09:07:12,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 25: [2022-11-27 09:07:12,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 13: [2022-11-27 09:07:12,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 17: [2022-11-27 09:07:12,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 17: [2022-11-27 09:07:12,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:07:12,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 17: [2022-11-27 09:07:12,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 17: [2022-11-27 09:07:12,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 09:07:12,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 17: [2022-11-27 09:07:12,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 29: [2022-11-27 09:07:12,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 09:07:12,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 09:07:12,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 25: [2022-11-27 09:07:12,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 25: [2022-11-27 09:07:12,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 25: [2022-11-27 09:07:12,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 13: [2022-11-27 09:07:12,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:07:12,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:07:12,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:07:12,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:07:12,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 4: [2022-11-27 09:07:12,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 09:07:12,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 13: [2022-11-27 09:07:12,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 3: [2022-11-27 09:07:12,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 4: [2022-11-27 09:07:12,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 4: [2022-11-27 09:07:12,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 3: [2022-11-27 09:07:12,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 0: [2022-11-27 09:07:12,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:07:12,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:07:12,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:07:12,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 09:07:12,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 09:07:12,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 11: [2022-11-27 09:07:12,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 20: [2022-11-27 09:07:12,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 28: [2022-11-27 09:07:12,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 20: [2022-11-27 09:07:12,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 09:07:12,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 28: [2022-11-27 09:07:12,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 28: [2022-11-27 09:07:12,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 20: [2022-11-27 09:07:12,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 20: [2022-11-27 09:07:12,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 09:07:12,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 24: [2022-11-27 09:07:12,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 24: [2022-11-27 09:07:12,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 09:07:12,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 16: [2022-11-27 09:07:12,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-27 09:07:12,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 09:07:12,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 22: [2022-11-27 09:07:12,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 09:07:12,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 31: [2022-11-27 09:07:12,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 22: [2022-11-27 09:07:12,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 18: [2022-11-27 09:07:12,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 27: [2022-11-27 09:07:12,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 25: [2022-11-27 09:07:12,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 31: [2022-11-27 09:07:12,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 25: [2022-11-27 09:07:12,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 18: [2022-11-27 09:07:12,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 27: [2022-11-27 09:07:12,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 18: [2022-11-27 09:07:12,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 25: [2022-11-27 09:07:12,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 27: [2022-11-27 09:07:12,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 31: [2022-11-27 09:07:12,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 27: [2022-11-27 09:07:12,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 09:07:12,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 09:07:12,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 9: [2022-11-27 09:07:12,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:07:12,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 09:07:12,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 13: [2022-11-27 09:07:12,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:07:12,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 09:07:12,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 29: [2022-11-27 09:07:12,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 09:07:12,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 09:07:12,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 23: [2022-11-27 09:07:12,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 23: [2022-11-27 09:07:12,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 5: [2022-11-27 09:07:12,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:07:12,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 09:07:12,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 23: [2022-11-27 09:07:12,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 2: [2022-11-27 09:07:12,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:07:12,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 09:07:12,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 15: [2022-11-27 09:07:12,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:07:12,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 09:07:12,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 30: [2022-11-27 09:07:12,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 09:07:12,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 09:07:12,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 09:07:12,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 09:07:12,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 30: [2022-11-27 09:07:12,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 16: [2022-11-27 09:07:12,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:07:12,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 16: [2022-11-27 09:07:12,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 09:07:12,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 0: [2022-11-27 09:07:12,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 09:07:12,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 0: [2022-11-27 09:07:12,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:07:12,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 09:07:12,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 12: [2022-11-27 09:07:12,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 22: [2022-11-27 09:07:12,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:07:12,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 22: [2022-11-27 09:07:12,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 12: [2022-11-27 09:07:12,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 22: [2022-11-27 09:07:12,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 7: [2022-11-27 09:07:12,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:07:12,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 09:07:12,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 14: [2022-11-27 09:07:12,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:07:12,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:07:12,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:07:12,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 09:07:12,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 09:07:12,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 09:07:12,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 14: [2022-11-27 09:07:12,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 14: [2022-11-27 09:07:12,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 18: [2022-11-27 09:07:12,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 09:07:12,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 09:07:12,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 11: [2022-11-27 09:07:12,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:07:12,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:07:12,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 09:07:12,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 7: [2022-11-27 09:07:12,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:07:12,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 7: [2022-11-27 09:07:12,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 3: [2022-11-27 09:07:12,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 7: [2022-11-27 09:07:12,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:07:12,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 7: [2022-11-27 09:07:12,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 09:07:12,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 11: [2022-11-27 09:07:12,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:07:12,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 22: [2022-11-27 09:07:12,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:07:12,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:07:12,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 22: [2022-11-27 09:07:12,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 5: [2022-11-27 09:07:12,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 22: [2022-11-27 09:07:12,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 5: [2022-11-27 09:07:12,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 8: [2022-11-27 09:07:12,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:07:12,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 09:07:12,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 19: [2022-11-27 09:07:12,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:07:12,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:07:12,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 19: [2022-11-27 09:07:12,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 09:07:12,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 2: [2022-11-27 09:07:12,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 15: [2022-11-27 09:07:12,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 09:07:12,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 19: [2022-11-27 09:07:12,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:07:12,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 19: [2022-11-27 09:07:12,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 19: [2022-11-27 09:07:12,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 8: [2022-11-27 09:07:12,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 24: [2022-11-27 09:07:12,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:07:12,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 24: [2022-11-27 09:07:12,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 8: [2022-11-27 09:07:12,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 24: [2022-11-27 09:07:12,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 28: [2022-11-27 09:07:12,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:07:12,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:07:12,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 09:07:12,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 23: [2022-11-27 09:07:12,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-27 09:07:12,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 09:07:12,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 17: [2022-11-27 09:07:12,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 17: [2022-11-27 09:07:12,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 17: [2022-11-27 09:07:12,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 17: [2022-11-27 09:07:12,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-27 09:07:12,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 09:07:12,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 2: [2022-11-27 09:07:12,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:07:12,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 09:07:12,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 9: [2022-11-27 09:07:12,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:07:12,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 09:07:12,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 1: [2022-11-27 09:07:12,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:07:12,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 09:07:12,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 28: [2022-11-27 09:07:12,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 09:07:12,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 19: [2022-11-27 09:07:12,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 09:07:12,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 09:07:12,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 1: [2022-11-27 09:07:12,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:07:12,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:07:12,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 09:07:12,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 1: [2022-11-27 09:07:12,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 09:07:12,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 27: [2022-11-27 09:07:12,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 09:07:12,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 09:07:12,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 12: [2022-11-27 09:07:12,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:07:12,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 09:07:12,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 19: [2022-11-27 09:07:12,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-27 09:07:12,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-27 09:07:12,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 26: [2022-11-27 09:07:12,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 09:07:12,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 09:07:12,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 10: [2022-11-27 09:07:12,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:07:12,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 19: [2022-11-27 09:07:12,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 19: [2022-11-27 09:07:12,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 10: [2022-11-27 09:07:12,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:07:12,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 19: [2022-11-27 09:07:12,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 10: [2022-11-27 09:07:12,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 09:07:12,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 09:07:12,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 10: [2022-11-27 09:07:12,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 10: [2022-11-27 09:07:12,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 09:07:12,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 09:07:12,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 10: [2022-11-27 09:07:12,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 0: [2022-11-27 09:07:12,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:07:12,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 09:07:12,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 0: [2022-11-27 09:07:12,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:07:12,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 09:07:12,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 6: [2022-11-27 09:07:12,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:07:12,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:07:12,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:07:12,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:07:12,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:07:12,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 09:07:12,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 09:07:12,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 6: [2022-11-27 09:07:12,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 09:07:12,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 6: [2022-11-27 09:07:12,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 09:07:12,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 09:07:12,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 6: [2022-11-27 09:07:12,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 6: [2022-11-27 09:07:12,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 0: [2022-11-27 09:07:12,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 09:07:12,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 29: [2022-11-27 09:07:12,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-27 09:07:12,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 09:07:12,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 31: [2022-11-27 09:07:12,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 09:07:12,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 09:07:12,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 11: [2022-11-27 09:07:12,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:07:12,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 09:07:12,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 17: [2022-11-27 09:07:12,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 09:07:12,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 09:07:12,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 26: [2022-11-27 09:07:12,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 09:07:12,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 09:07:12,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 20: [2022-11-27 09:07:12,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 09:07:12,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 09:07:12,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 10: [2022-11-27 09:07:12,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:07:12,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 09:07:12,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 21: [2022-11-27 09:07:12,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 09:07:12,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 09:07:12,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 24: [2022-11-27 09:07:12,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 30: [2022-11-27 09:07:12,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 24: [2022-11-27 09:07:12,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 30: [2022-11-27 09:07:12,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 24: [2022-11-27 09:07:12,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 30: [2022-11-27 09:07:12,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 13: [2022-11-27 09:07:12,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:07:12,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 09:07:12,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 9: [2022-11-27 09:07:12,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:07:12,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 09:07:12,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 18: [2022-11-27 09:07:12,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 09:07:12,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 09:07:12,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 7: [2022-11-27 09:07:12,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:07:12,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 09:07:12,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 3: [2022-11-27 09:07:12,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:07:12,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 09:07:12,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 4: [2022-11-27 09:07:12,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:07:12,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 09:07:12,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 14: [2022-11-27 09:07:12,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:07:12,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 09:07:12,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 5: [2022-11-27 09:07:12,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:07:12,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 09:07:12,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 8: [2022-11-27 09:07:12,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 23: [2022-11-27 09:07:12,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:07:12,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 23: [2022-11-27 09:07:12,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 8: [2022-11-27 09:07:12,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 23: [2022-11-27 09:07:12,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 28: [2022-11-27 09:07:12,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 22: [2022-11-27 09:07:12,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 09:07:12,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 22: [2022-11-27 09:07:12,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 28: [2022-11-27 09:07:12,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 09:07:12,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 1: [2022-11-27 09:07:12,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:07:12,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 09:07:12,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 16: [2022-11-27 09:07:12,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 09:07:12,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 09:07:12,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 15: [2022-11-27 09:07:12,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:07:12,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 09:07:12,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 2: [2022-11-27 09:07:12,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:07:12,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 09:07:12,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 12: [2022-11-27 09:07:12,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:07:12,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 09:07:12,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 27: [2022-11-27 09:07:12,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 09:07:12,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 09:07:12,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 19: [2022-11-27 09:07:12,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 09:07:12,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 09:07:12,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 0: [2022-11-27 09:07:12,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:07:12,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 09:07:12,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 6: [2022-11-27 09:07:12,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:07:12,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 09:07:12,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 29: [2022-11-27 09:07:12,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 09:07:12,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 09:07:12,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 26: [2022-11-27 09:07:12,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 09:07:12,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 09:07:12,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 11: [2022-11-27 09:07:12,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:07:12,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 09:07:12,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 25: [2022-11-27 09:07:12,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-27 09:07:12,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 09:07:12,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 10: [2022-11-27 09:07:12,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:07:12,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 09:07:12,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 31: [2022-11-27 09:07:12,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 09:07:12,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 09:07:12,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 20: [2022-11-27 09:07:12,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 09:07:12,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 09:07:12,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 21: [2022-11-27 09:07:12,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 17: [2022-11-27 09:07:12,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 21: [2022-11-27 09:07:12,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 21: [2022-11-27 09:07:12,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 17: [2022-11-27 09:07:12,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 17: [2022-11-27 09:07:12,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 24: [2022-11-27 09:07:12,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 24: [2022-11-27 09:07:12,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 24: [2022-11-27 09:07:12,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 30: [2022-11-27 09:07:12,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 09:07:12,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 09:07:12,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 13: [2022-11-27 09:07:12,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:07:12,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 09:07:12,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 18: [2022-11-27 09:07:12,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 09:07:12,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 09:07:12,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 9: [2022-11-27 09:07:12,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:07:12,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 09:07:12,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 14: [2022-11-27 09:07:12,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:07:12,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 09:07:12,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 1: [2022-11-27 09:07:12,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:07:12,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 09:07:12,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 8: [2022-11-27 09:07:12,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:07:12,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 09:07:12,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 4: [2022-11-27 09:07:12,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:07:12,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 09:07:12,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 5: [2022-11-27 09:07:12,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:07:12,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 09:07:12,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 28: [2022-11-27 09:07:12,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 09:07:12,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 25: [2022-11-27 09:07:12,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 28: [2022-11-27 09:07:12,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 25: [2022-11-27 09:07:12,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 09:07:12,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 3: [2022-11-27 09:07:12,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:07:12,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 09:07:12,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 12: [2022-11-27 09:07:12,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:07:12,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 09:07:12,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 22: [2022-11-27 09:07:12,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 22: [2022-11-27 09:07:12,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 22: [2022-11-27 09:07:12,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 7: [2022-11-27 09:07:12,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:07:12,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 09:07:12,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 16: [2022-11-27 09:07:12,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 16: [2022-11-27 09:07:12,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 16: [2022-11-27 09:07:12,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 2: [2022-11-27 09:07:12,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:07:12,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 09:07:12,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 15: [2022-11-27 09:07:12,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 19: [2022-11-27 09:07:12,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:07:12,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 19: [2022-11-27 09:07:12,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 15: [2022-11-27 09:07:12,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 19: [2022-11-27 09:07:12,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 23: [2022-11-27 09:07:12,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-27 09:07:12,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 09:07:12,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 0: [2022-11-27 09:07:12,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:07:12,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 09:07:12,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 6: [2022-11-27 09:07:12,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:07:12,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 09:07:12,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 27: [2022-11-27 09:07:12,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 27: [2022-11-27 09:07:12,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 27: [2022-11-27 09:07:12,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 31: [2022-11-27 09:07:12,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 09:07:12,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 09:07:12,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 26: [2022-11-27 09:07:12,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 09:07:12,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 09:07:12,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 29: [2022-11-27 09:07:12,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 09:07:12,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 09:07:12,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 11: [2022-11-27 09:07:12,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:07:12,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 09:07:12,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 10: [2022-11-27 09:07:12,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:07:12,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 09:07:12,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 21: [2022-11-27 09:07:12,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 21: [2022-11-27 09:07:12,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 24: [2022-11-27 09:07:12,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 09:07:12,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 21: [2022-11-27 09:07:12,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 24: [2022-11-27 09:07:12,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 20: [2022-11-27 09:07:12,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 20: [2022-11-27 09:07:12,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 20: [2022-11-27 09:07:12,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 17: [2022-11-27 09:07:12,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 09:07:12,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 09:07:12,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 18: [2022-11-27 09:07:12,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 09:07:12,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 09:07:12,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 30: [2022-11-27 09:07:12,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 09:07:12,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 09:07:12,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 13: [2022-11-27 09:07:12,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:07:12,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 09:07:12,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 25: [2022-11-27 09:07:12,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 25: [2022-11-27 09:07:12,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 25: [2022-11-27 09:07:12,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 3: [2022-11-27 09:07:12,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:07:12,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 09:07:12,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 9: [2022-11-27 09:07:12,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:07:12,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 09:07:12,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 23: [2022-11-27 09:07:12,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:07:12,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:07:12,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:07:12,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 23: [2022-11-27 09:07:12,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 7: [2022-11-27 09:07:12,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 23: [2022-11-27 09:07:12,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 14: [2022-11-27 09:07:12,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 7: [2022-11-27 09:07:12,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 5: [2022-11-27 09:07:12,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:07:12,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 09:07:12,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 8: [2022-11-27 09:07:12,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:07:12,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 09:07:12,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 28: [2022-11-27 09:07:12,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 09:07:12,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 09:07:12,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 10: [2022-11-27 09:07:12,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:07:12,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 09:07:12,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 19: [2022-11-27 09:07:12,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:07:12,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 19: [2022-11-27 09:07:12,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 09:07:12,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 1: [2022-11-27 09:07:12,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 09:07:12,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 12: [2022-11-27 09:07:12,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:07:12,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 09:07:12,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 26: [2022-11-27 09:07:12,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 26: [2022-11-27 09:07:12,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 26: [2022-11-27 09:07:12,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 0: [2022-11-27 09:07:12,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:07:12,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 09:07:12,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 3: [2022-11-27 09:07:12,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:07:12,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 09:07:12,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 27: [2022-11-27 09:07:12,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 30: [2022-11-27 09:07:12,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 23: [2022-11-27 09:07:12,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 27: [2022-11-27 09:07:12,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 13: [2022-11-27 09:07:12,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 27: [2022-11-27 09:07:12,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 30: [2022-11-27 09:07:12,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 13: [2022-11-27 09:07:12,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 23: [2022-11-27 09:07:12,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 09:07:12,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 11: [2022-11-27 09:07:12,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 30: [2022-11-27 09:07:12,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 13: [2022-11-27 09:07:12,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 11: [2022-11-27 09:07:12,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 09:07:12,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 29: [2022-11-27 09:07:12,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 22: [2022-11-27 09:07:12,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 29: [2022-11-27 09:07:12,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 29: [2022-11-27 09:07:12,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 22: [2022-11-27 09:07:12,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 09:07:12,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 4: [2022-11-27 09:07:12,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:07:12,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 09:07:12,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 28: [2022-11-27 09:07:12,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 28: [2022-11-27 09:07:12,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 09:07:12,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 18: [2022-11-27 09:07:12,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 09:07:12,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 24: [2022-11-27 09:07:12,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 18: [2022-11-27 09:07:12,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 24: [2022-11-27 09:07:12,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 09:07:12,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 14: [2022-11-27 09:07:12,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:07:12,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 31: [2022-11-27 09:07:12,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:07:12,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 31: [2022-11-27 09:07:12,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 21: [2022-11-27 09:07:12,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 31: [2022-11-27 09:07:12,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 21: [2022-11-27 09:07:12,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 09:07:12,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 27: [2022-11-27 09:07:12,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-27 09:07:12,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 09:07:12,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 16: [2022-11-27 09:07:12,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 20: [2022-11-27 09:07:12,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 16: [2022-11-27 09:07:12,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 16: [2022-11-27 09:07:12,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 20: [2022-11-27 09:07:12,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 20: [2022-11-27 09:07:12,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 15: [2022-11-27 09:07:12,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:07:12,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 09:07:12,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 9: [2022-11-27 09:07:12,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:07:12,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 09:07:12,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 6: [2022-11-27 09:07:12,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:07:12,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 09:07:12,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 25: [2022-11-27 09:07:12,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-27 09:07:12,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-27 09:07:12,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 17: [2022-11-27 09:07:12,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 17: [2022-11-27 09:07:12,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 09:07:12,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 12: [2022-11-27 09:07:12,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:07:12,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 09:07:12,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 4: [2022-11-27 09:07:12,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:07:12,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 09:07:12,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 1: [2022-11-27 09:07:12,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:07:12,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 09:07:12,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 7: [2022-11-27 09:07:12,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:07:12,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 09:07:12,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 8: [2022-11-27 09:07:12,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:07:12,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 09:07:12,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 5: [2022-11-27 09:07:12,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:07:12,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 09:07:12,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 2: [2022-11-27 09:07:12,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:07:12,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 09:07:12,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 15: [2022-11-27 09:07:12,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:07:12,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 09:07:12,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 22: [2022-11-27 09:07:12,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 09:07:12,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 09:07:12,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 1: [2022-11-27 09:07:12,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:07:12,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 09:07:12,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 16: [2022-11-27 09:07:12,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 09:07:12,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 09:07:12,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 14: [2022-11-27 09:07:12,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:07:12,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 09:07:12,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 4: [2022-11-27 09:07:12,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:07:12,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 09:07:12,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 22: [2022-11-27 09:07:12,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 09:07:12,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 22: [2022-11-27 09:07:12,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 2: [2022-11-27 09:07:12,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:07:12,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 09:07:12,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 16: [2022-11-27 09:07:12,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 09:07:12,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 09:07:12,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 2: [2022-11-27 09:07:12,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:07:12,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 09:07:12,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173000 is ready now! 0: successfully saved checkpoint at iteration 173000 to checkpoints_1b1long 31: time (ms) | save-checkpoint: 2578.48 31: iteration 173010/ 173500 | consumed samples: 44290560 | consumed tokens: 90707066880 | elapsed time per iteration (s): 1.62 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.904263E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 158.470 | TFLOPs: 9.59 | 31: iteration 173020/ 173500 | consumed samples: 44293120 | consumed tokens: 90712309760 | elapsed time per iteration (s): 0.92 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.928885E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 278.218 | TFLOPs: 16.83 | 31: iteration 173030/ 173500 | consumed samples: 44295680 | consumed tokens: 90717552640 | elapsed time per iteration (s): 0.78 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.911105E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.861 | TFLOPs: 19.83 | 31: iteration 173040/ 173500 | consumed samples: 44298240 | consumed tokens: 90722795520 | elapsed time per iteration (s): 0.74 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.920517E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 347.111 | TFLOPs: 21.00 | 31: iteration 173050/ 173500 | consumed samples: 44300800 | consumed tokens: 90728038400 | elapsed time per iteration (s): 0.78 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.909507E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.976 | TFLOPs: 19.84 | 31: iteration 173060/ 173500 | consumed samples: 44303360 | consumed tokens: 90733281280 | elapsed time per iteration (s): 0.75 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.911711E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.605 | TFLOPs: 20.73 | 31: iteration 173070/ 173500 | consumed samples: 44305920 | consumed tokens: 90738524160 | elapsed time per iteration (s): 0.76 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.933150E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 337.182 | TFLOPs: 20.40 | 31: iteration 173080/ 173500 | consumed samples: 44308480 | consumed tokens: 90743767040 | elapsed time per iteration (s): 0.78 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.912162E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.796 | TFLOPs: 19.95 | 31: iteration 173090/ 173500 | consumed samples: 44311040 | consumed tokens: 90749009920 | elapsed time per iteration (s): 0.74 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.899204E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.486 | TFLOPs: 20.84 | 31: iteration 173100/ 173500 | consumed samples: 44313600 | consumed tokens: 90754252800 | elapsed time per iteration (s): 0.75 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.865439E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.735 | TFLOPs: 20.73 | 31: iteration 173110/ 173500 | consumed samples: 44316160 | consumed tokens: 90759495680 | elapsed time per iteration (s): 0.80 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.898860E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.115 | TFLOPs: 19.25 | 31: iteration 173120/ 173500 | consumed samples: 44318720 | consumed tokens: 90764738560 | elapsed time per iteration (s): 0.75 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.914153E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 339.165 | TFLOPs: 20.52 | 31: iteration 173130/ 173500 | consumed samples: 44321280 | consumed tokens: 90769981440 | elapsed time per iteration (s): 0.79 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.899960E+00 | grad norm: 0.226 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.561 | TFLOPs: 19.70 | 31: iteration 173140/ 173500 | consumed samples: 44323840 | consumed tokens: 90775224320 | elapsed time per iteration (s): 0.82 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.908772E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.888 | TFLOPs: 18.87 | 31: iteration 173150/ 173500 | consumed samples: 44326400 | consumed tokens: 90780467200 | elapsed time per iteration (s): 0.82 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.903797E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 311.571 | TFLOPs: 18.85 | 31: iteration 173160/ 173500 | consumed samples: 44328960 | consumed tokens: 90785710080 | elapsed time per iteration (s): 0.79 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.904300E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.913 | TFLOPs: 19.72 | 31: iteration 173170/ 173500 | consumed samples: 44331520 | consumed tokens: 90790952960 | elapsed time per iteration (s): 0.81 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.883578E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 317.829 | TFLOPs: 19.23 | 31: iteration 173180/ 173500 | consumed samples: 44334080 | consumed tokens: 90796195840 | elapsed time per iteration (s): 0.80 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.916105E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.843 | TFLOPs: 19.35 | 31: iteration 173190/ 173500 | consumed samples: 44336640 | consumed tokens: 90801438720 | elapsed time per iteration (s): 0.80 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.905784E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 321.491 | TFLOPs: 19.45 | 31: iteration 173200/ 173500 | consumed samples: 44339200 | consumed tokens: 90806681600 | elapsed time per iteration (s): 0.80 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.900571E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.153 | TFLOPs: 19.37 | 31: iteration 173210/ 173500 | consumed samples: 44341760 | consumed tokens: 90811924480 | elapsed time per iteration (s): 0.81 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.916140E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.713 | TFLOPs: 19.04 | 31: iteration 173220/ 173500 | consumed samples: 44344320 | consumed tokens: 90817167360 | elapsed time per iteration (s): 0.93 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.920739E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 274.783 | TFLOPs: 16.62 | 31: iteration 173230/ 173500 | consumed samples: 44346880 | consumed tokens: 90822410240 | elapsed time per iteration (s): 0.81 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.892728E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 314.773 | TFLOPs: 19.04 | 31: iteration 173240/ 173500 | consumed samples: 44349440 | consumed tokens: 90827653120 | elapsed time per iteration (s): 0.81 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.906567E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 315.019 | TFLOPs: 19.06 | 31: iteration 173250/ 173500 | consumed samples: 44352000 | consumed tokens: 90832896000 | elapsed time per iteration (s): 0.79 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.884517E+00 | grad norm: 0.335 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.228 | TFLOPs: 19.49 | 31: iteration 173260/ 173500 | consumed samples: 44354560 | consumed tokens: 90838138880 | elapsed time per iteration (s): 0.78 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.907717E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.629 | TFLOPs: 19.94 | 31: iteration 173270/ 173500 | consumed samples: 44357120 | consumed tokens: 90843381760 | elapsed time per iteration (s): 0.79 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.903922E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 322.056 | TFLOPs: 19.48 | 31: iteration 173280/ 173500 | consumed samples: 44359680 | consumed tokens: 90848624640 | elapsed time per iteration (s): 0.79 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.895554E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 325.554 | TFLOPs: 19.70 | 31: iteration 173290/ 173500 | consumed samples: 44362240 | consumed tokens: 90853867520 | elapsed time per iteration (s): 0.81 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.906209E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.732 | TFLOPs: 19.16 | 31: iteration 173300/ 173500 | consumed samples: 44364800 | consumed tokens: 90859110400 | elapsed time per iteration (s): 0.78 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.926336E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 327.054 | TFLOPs: 19.79 | 31: iteration 173310/ 173500 | consumed samples: 44367360 | consumed tokens: 90864353280 | elapsed time per iteration (s): 0.79 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.932574E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.703 | TFLOPs: 19.64 | 31: iteration 173320/ 173500 | consumed samples: 44369920 | consumed tokens: 90869596160 | elapsed time per iteration (s): 0.81 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.889217E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 316.631 | TFLOPs: 19.16 | 31: iteration 173330/ 173500 | consumed samples: 44372480 | consumed tokens: 90874839040 | elapsed time per iteration (s): 0.80 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.906006E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 319.172 | TFLOPs: 19.31 | 31: iteration 173340/ 173500 | consumed samples: 44375040 | consumed tokens: 90880081920 | elapsed time per iteration (s): 0.76 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.883444E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.448 | TFLOPs: 20.48 | 31: iteration 173350/ 173500 | consumed samples: 44377600 | consumed tokens: 90885324800 | elapsed time per iteration (s): 0.79 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.880316E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.418 | TFLOPs: 19.63 | 31: iteration 173360/ 173500 | consumed samples: 44380160 | consumed tokens: 90890567680 | elapsed time per iteration (s): 0.75 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.887113E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 342.614 | TFLOPs: 20.73 | 31: iteration 173370/ 173500 | consumed samples: 44382720 | consumed tokens: 90895810560 | elapsed time per iteration (s): 0.76 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.912958E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 338.068 | TFLOPs: 20.45 | 31: iteration 173380/ 173500 | consumed samples: 44385280 | consumed tokens: 90901053440 | elapsed time per iteration (s): 0.80 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.931065E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 320.153 | TFLOPs: 19.37 | 31: iteration 173390/ 173500 | consumed samples: 44387840 | consumed tokens: 90906296320 | elapsed time per iteration (s): 0.77 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.917292E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 332.776 | TFLOPs: 20.13 | 31: iteration 173400/ 173500 | consumed samples: 44390400 | consumed tokens: 90911539200 | elapsed time per iteration (s): 0.75 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.917133E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 341.663 | TFLOPs: 20.67 | 31: iteration 173410/ 173500 | consumed samples: 44392960 | consumed tokens: 90916782080 | elapsed time per iteration (s): 0.73 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.906620E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 349.571 | TFLOPs: 21.15 | 31: iteration 173420/ 173500 | consumed samples: 44395520 | consumed tokens: 90922024960 | elapsed time per iteration (s): 0.71 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.878956E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 358.095 | TFLOPs: 21.66 | 31: iteration 173430/ 173500 | consumed samples: 44398080 | consumed tokens: 90927267840 | elapsed time per iteration (s): 0.74 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.895397E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 344.584 | TFLOPs: 20.85 | 31: iteration 173440/ 173500 | consumed samples: 44400640 | consumed tokens: 90932510720 | elapsed time per iteration (s): 0.79 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.915573E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 324.101 | TFLOPs: 19.61 | 31: iteration 173450/ 173500 | consumed samples: 44403200 | consumed tokens: 90937753600 | elapsed time per iteration (s): 0.84 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.887219E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 303.706 | TFLOPs: 18.37 | 31: iteration 173460/ 173500 | consumed samples: 44405760 | consumed tokens: 90942996480 | elapsed time per iteration (s): 0.74 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.913278E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 346.567 | TFLOPs: 20.97 | 31: iteration 173470/ 173500 | consumed samples: 44408320 | consumed tokens: 90948239360 | elapsed time per iteration (s): 0.78 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.912518E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 329.824 | TFLOPs: 19.95 | 31: iteration 173480/ 173500 | consumed samples: 44410880 | consumed tokens: 90953482240 | elapsed time per iteration (s): 0.79 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.911126E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 323.068 | TFLOPs: 19.54 | 31: iteration 173490/ 173500 | consumed samples: 44413440 | consumed tokens: 90958725120 | elapsed time per iteration (s): 0.84 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.944413E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 305.226 | TFLOPs: 18.47 | 31: iteration 173500/ 173500 | consumed samples: 44416000 | consumed tokens: 90963968000 | elapsed time per iteration (s): 0.80 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.873877E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 318.348 | TFLOPs: 19.26 | 0: [after training is done] datetime: 2022-11-27 09:13:51 0: saving checkpoint at iteration 173500 to checkpoints_1b1long 31: ------------------------------------------------------------------------------------------------------------ 31: valid loss at the end of training for val data | lm loss value: 1.870674E+00 | lm loss PPL: 6.492670E+00 | 31: ------------------------------------------------------------------------------------------------------------ 0: [2022-11-27 09:13:52,071] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step173500 is begin to save! 0: [2022-11-27 09:13:52,080] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_01-model_00-model_states.pt... 0: [2022-11-27 09:13:52,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_01-model_00-model_states.pt. 0: [2022-11-27 09:13:52,302] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_03-model_00-model_states.pt... 0: [2022-11-27 09:13:52,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_03-model_00-model_states.pt. 0: [2022-11-27 09:13:52,382] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_04-model_00-model_states.pt... 0: [2022-11-27 09:13:52,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_04-model_00-model_states.pt. 0: [2022-11-27 09:13:52,461] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_05-model_00-model_states.pt... 0: [2022-11-27 09:13:52,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_05-model_00-model_states.pt. 0: [2022-11-27 09:13:52,540] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_06-model_00-model_states.pt... 0: [2022-11-27 09:13:52,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_06-model_00-model_states.pt. 0: [2022-11-27 09:13:52,616] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_07-model_00-model_states.pt... 0: [2022-11-27 09:13:52,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_07-model_00-model_states.pt. 0: [2022-11-27 09:13:52,696] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_08-model_00-model_states.pt... 0: [2022-11-27 09:13:52,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_08-model_00-model_states.pt. 0: [2022-11-27 09:13:52,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_09-model_00-model_states.pt... 0: [2022-11-27 09:13:52,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_09-model_00-model_states.pt. 0: [2022-11-27 09:13:52,849] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_10-model_00-model_states.pt... 0: [2022-11-27 09:13:52,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_10-model_00-model_states.pt. 0: [2022-11-27 09:13:52,927] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_11-model_00-model_states.pt... 0: [2022-11-27 09:13:53,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_11-model_00-model_states.pt. 0: [2022-11-27 09:13:53,001] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_12-model_00-model_states.pt... 0: [2022-11-27 09:13:53,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_12-model_00-model_states.pt. 0: [2022-11-27 09:13:53,080] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_13-model_00-model_states.pt... 0: [2022-11-27 09:13:53,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_13-model_00-model_states.pt. 0: [2022-11-27 09:13:53,156] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_14-model_00-model_states.pt... 0: [2022-11-27 09:13:53,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_14-model_00-model_states.pt. 0: [2022-11-27 09:13:53,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_15-model_00-model_states.pt... 0: [2022-11-27 09:13:53,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_15-model_00-model_states.pt. 0: [2022-11-27 09:13:53,307] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_16-model_00-model_states.pt... 0: [2022-11-27 09:13:53,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_16-model_00-model_states.pt. 0: [2022-11-27 09:13:53,379] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_17-model_00-model_states.pt... 0: [2022-11-27 09:13:53,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_17-model_00-model_states.pt. 0: [2022-11-27 09:13:53,454] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_18-model_00-model_states.pt... 0: [2022-11-27 09:13:53,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_18-model_00-model_states.pt. 0: [2022-11-27 09:13:53,529] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_19-model_00-model_states.pt... 0: [2022-11-27 09:13:53,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_19-model_00-model_states.pt. 0: [2022-11-27 09:13:53,608] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_20-model_00-model_states.pt... 0: [2022-11-27 09:13:53,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_20-model_00-model_states.pt. 0: [2022-11-27 09:13:53,682] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_21-model_00-model_states.pt... 0: [2022-11-27 09:13:53,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_21-model_00-model_states.pt. 0: [2022-11-27 09:13:53,760] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_22-model_00-model_states.pt... 0: [2022-11-27 09:13:53,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_22-model_00-model_states.pt. 0: [2022-11-27 09:13:53,833] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_23-model_00-model_states.pt... 0: [2022-11-27 09:13:53,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_23-model_00-model_states.pt. 0: [2022-11-27 09:13:53,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_24-model_00-model_states.pt... 0: [2022-11-27 09:13:53,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_24-model_00-model_states.pt. 0: [2022-11-27 09:13:53,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_25-model_00-model_states.pt... 0: [2022-11-27 09:13:54,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_25-model_00-model_states.pt. 0: [2022-11-27 09:13:54,062] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_26-model_00-model_states.pt... 0: [2022-11-27 09:13:54,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_26-model_00-model_states.pt. 0: [2022-11-27 09:13:54,138] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_27-model_00-model_states.pt... 0: [2022-11-27 09:13:54,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_27-model_00-model_states.pt. 0: [2022-11-27 09:13:54,214] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_28-model_00-model_states.pt... 0: [2022-11-27 09:13:54,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_28-model_00-model_states.pt. 0: [2022-11-27 09:13:54,291] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/layer_30-model_00-model_states.pt... 0: [2022-11-27 09:13:54,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/layer_30-model_00-model_states.pt. 0: [2022-11-27 09:13:54,293] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b1long/global_step173500/mp_rank_00_model_states.pt 0: [2022-11-27 09:13:54,293] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/mp_rank_00_model_states.pt... 0: [2022-11-27 09:13:54,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/mp_rank_00_model_states.pt. 0: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 16: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt... 16: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt... 16: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt... 16: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt... 16: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 20: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt... 20: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt... 25: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt... 25: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt... 25: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt... 23: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt... 23: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 28: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt... 24: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 31: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt... 29: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt... 29: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt... 29: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt... 22: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt... 30: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt... 30: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt... 30: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt... 17: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt... 17: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt... 17: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt... 17: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt... 17: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt... 21: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt... 21: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt... 18: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt... 18: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt... 18: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt... 18: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt... 26: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt... 19: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt... 19: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt... 19: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt... 19: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt... 19: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt... 27: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt... 27: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt... 27: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt... 27: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 16: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt... 16: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt... 16: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 20: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt... 20: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt... 25: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt... 25: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt... 23: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 28: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt... 28: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt... 28: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt... 24: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt... 24: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt... 24: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt... 24: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 31: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt... 29: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt... 29: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt... 22: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt... 22: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt... 22: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt... 30: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt... 17: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt... 17: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt... 17: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt... 21: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt... 21: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt... 21: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt... 21: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt... 18: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt... 26: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt... 19: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt... 19: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt... 27: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt... 27: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 20: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt... 25: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt... 25: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt... 23: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt... 23: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 28: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt... 24: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt... 24: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt... 24: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 31: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt... 29: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt... 22: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt... 30: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt... 21: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt... 18: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt... 26: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt... 26: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt... 19: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt... 27: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 20: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt... 25: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt... 23: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt... 23: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt... 23: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt... 28: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt... 28: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 31: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt... 29: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt... 22: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt... 30: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt... 21: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt... 18: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt... 26: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt... 27: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 20: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt... 28: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 31: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt... 29: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt... 22: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt... 30: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt... 18: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt... 26: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 20: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt... 31: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt... 31: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt... 22: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt... 30: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt... 26: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt... 26: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt... 31: [2022-11-27 09:13:54,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt... 26: [2022-11-27 09:13:54,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt. 26: [2022-11-27 09:13:54,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_212_mp_rank_00_optim_states.pt 26: [2022-11-27 09:13:54,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 18: [2022-11-27 09:13:54,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt. 18: [2022-11-27 09:13:54,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_144_mp_rank_00_optim_states.pt 18: [2022-11-27 09:13:54,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 20: [2022-11-27 09:13:54,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt. 20: [2022-11-27 09:13:54,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_166_mp_rank_00_optim_states.pt 20: [2022-11-27 09:13:54,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 17: [2022-11-27 09:13:54,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt. 17: [2022-11-27 09:13:54,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_140_mp_rank_00_optim_states.pt 17: [2022-11-27 09:13:54,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 6: [2022-11-27 09:13:54,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:13:54,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:13:54,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 9: [2022-11-27 09:13:54,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 6: [2022-11-27 09:13:54,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 9: [2022-11-27 09:13:54,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 29: [2022-11-27 09:13:54,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt. 29: [2022-11-27 09:13:54,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_233_mp_rank_00_optim_states.pt 29: [2022-11-27 09:13:54,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 25: [2022-11-27 09:13:54,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt. 24: [2022-11-27 09:13:54,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt. 25: [2022-11-27 09:13:54,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_200_mp_rank_00_optim_states.pt 24: [2022-11-27 09:13:54,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_192_mp_rank_00_optim_states.pt 24: [2022-11-27 09:13:54,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 8: [2022-11-27 09:13:54,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 25: [2022-11-27 09:13:54,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 8: [2022-11-27 09:13:54,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 09:13:54,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 28: [2022-11-27 09:13:54,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt. 31: [2022-11-27 09:13:54,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:13:54,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:13:54,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 31: [2022-11-27 09:13:54,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_255_mp_rank_00_optim_states.pt 7: [2022-11-27 09:13:54,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 13: [2022-11-27 09:13:54,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 31: [2022-11-27 09:13:54,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 7: [2022-11-27 09:13:54,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 13: [2022-11-27 09:13:54,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 10: [2022-11-27 09:13:54,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 16: [2022-11-27 09:13:54,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt. 23: [2022-11-27 09:13:54,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt. 16: [2022-11-27 09:13:54,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_133_mp_rank_00_optim_states.pt 23: [2022-11-27 09:13:54,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_191_mp_rank_00_optim_states.pt 0: [2022-11-27 09:13:54,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:13:54,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 16: [2022-11-27 09:13:54,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 23: [2022-11-27 09:13:54,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 10: [2022-11-27 09:13:54,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 4: [2022-11-27 09:13:54,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:13:54,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 09:13:54,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 3: [2022-11-27 09:13:54,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 30: [2022-11-27 09:13:54,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:13:54,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:13:54,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 11: [2022-11-27 09:13:54,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 30: [2022-11-27 09:13:54,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_241_mp_rank_00_optim_states.pt 5: [2022-11-27 09:13:54,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 3: [2022-11-27 09:13:54,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 11: [2022-11-27 09:13:54,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 30: [2022-11-27 09:13:54,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 5: [2022-11-27 09:13:54,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 11: [2022-11-27 09:13:54,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 27: [2022-11-27 09:13:54,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt. 27: [2022-11-27 09:13:54,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_222_mp_rank_00_optim_states.pt 27: [2022-11-27 09:13:54,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 14: [2022-11-27 09:13:54,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:13:54,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 09:13:54,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 8: [2022-11-27 09:13:54,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 31: [2022-11-27 09:13:54,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:13:54,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:13:54,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 31: [2022-11-27 09:13:54,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_250_mp_rank_00_optim_states.pt 9: [2022-11-27 09:13:54,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 8: [2022-11-27 09:13:54,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 31: [2022-11-27 09:13:54,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 19: [2022-11-27 09:13:54,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:13:54,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 19: [2022-11-27 09:13:54,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_154_mp_rank_00_optim_states.pt 19: [2022-11-27 09:13:54,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 19: [2022-11-27 09:13:54,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt. 19: [2022-11-27 09:13:54,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_157_mp_rank_00_optim_states.pt 19: [2022-11-27 09:13:54,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 26: [2022-11-27 09:13:54,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt. 26: [2022-11-27 09:13:54,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_210_mp_rank_00_optim_states.pt 26: [2022-11-27 09:13:54,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 15: [2022-11-27 09:13:54,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:13:54,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 09:13:54,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 7: [2022-11-27 09:13:54,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:13:54,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 09:13:54,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 11: [2022-11-27 09:13:54,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:13:54,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 22: [2022-11-27 09:13:54,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:13:54,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 22: [2022-11-27 09:13:54,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt. 22: [2022-11-27 09:13:54,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_176_mp_rank_00_optim_states.pt 22: [2022-11-27 09:13:54,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 15: [2022-11-27 09:13:54,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 22: [2022-11-27 09:13:54,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_177_mp_rank_00_optim_states.pt 15: [2022-11-27 09:13:54,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 22: [2022-11-27 09:13:54,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 15: [2022-11-27 09:13:54,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 18: [2022-11-27 09:13:54,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt. 18: [2022-11-27 09:13:54,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_146_mp_rank_00_optim_states.pt 18: [2022-11-27 09:13:54,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 3: [2022-11-27 09:13:54,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 24: [2022-11-27 09:13:54,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:13:54,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:13:54,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 13: [2022-11-27 09:13:54,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 3: [2022-11-27 09:13:54,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 13: [2022-11-27 09:13:54,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 24: [2022-11-27 09:13:54,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_193_mp_rank_00_optim_states.pt 0: [2022-11-27 09:13:54,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:13:54,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 24: [2022-11-27 09:13:54,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 0: [2022-11-27 09:13:54,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 29: [2022-11-27 09:13:54,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt. 29: [2022-11-27 09:13:54,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_238_mp_rank_00_optim_states.pt 29: [2022-11-27 09:13:54,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 5: [2022-11-27 09:13:54,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:13:54,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 27: [2022-11-27 09:13:54,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:13:54,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 4: [2022-11-27 09:13:54,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 5: [2022-11-27 09:13:54,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 4: [2022-11-27 09:13:54,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 23: [2022-11-27 09:13:54,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt. 27: [2022-11-27 09:13:54,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_220_mp_rank_00_optim_states.pt 27: [2022-11-27 09:13:54,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 23: [2022-11-27 09:13:54,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_187_mp_rank_00_optim_states.pt 23: [2022-11-27 09:13:54,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 25: [2022-11-27 09:13:54,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt. 25: [2022-11-27 09:13:54,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_206_mp_rank_00_optim_states.pt 6: [2022-11-27 09:13:54,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 25: [2022-11-27 09:13:54,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 6: [2022-11-27 09:13:54,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 09:13:54,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 20: [2022-11-27 09:13:54,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt. 20: [2022-11-27 09:13:54,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_164_mp_rank_00_optim_states.pt 20: [2022-11-27 09:13:54,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 30: [2022-11-27 09:13:54,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt. 30: [2022-11-27 09:13:54,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_245_mp_rank_00_optim_states.pt 7: [2022-11-27 09:13:54,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 30: [2022-11-27 09:13:54,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 7: [2022-11-27 09:13:54,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 09:13:54,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 10: [2022-11-27 09:13:54,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:13:54,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 09:13:54,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 12: [2022-11-27 09:13:54,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:13:54,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 17: [2022-11-27 09:13:54,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:13:54,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 17: [2022-11-27 09:13:54,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_139_mp_rank_00_optim_states.pt 12: [2022-11-27 09:13:54,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 17: [2022-11-27 09:13:54,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 12: [2022-11-27 09:13:54,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 2: [2022-11-27 09:13:54,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:13:54,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:13:54,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 2: [2022-11-27 09:13:54,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:13:54,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 09:13:54,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 09:13:54,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 1: [2022-11-27 09:13:54,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:13:54,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 09:13:54,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 21: [2022-11-27 09:13:54,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:13:54,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 21: [2022-11-27 09:13:54,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt. 21: [2022-11-27 09:13:54,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_175_mp_rank_00_optim_states.pt 1: [2022-11-27 09:13:54,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 21: [2022-11-27 09:13:54,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 1: [2022-11-27 09:13:54,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 31: [2022-11-27 09:13:54,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt. 21: [2022-11-27 09:13:54,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_169_mp_rank_00_optim_states.pt 21: [2022-11-27 09:13:54,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 16: [2022-11-27 09:13:54,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt. 31: [2022-11-27 09:13:54,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_248_mp_rank_00_optim_states.pt 16: [2022-11-27 09:13:54,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_135_mp_rank_00_optim_states.pt 31: [2022-11-27 09:13:54,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 16: [2022-11-27 09:13:54,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 29: [2022-11-27 09:13:54,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt. 29: [2022-11-27 09:13:54,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_236_mp_rank_00_optim_states.pt 29: [2022-11-27 09:13:54,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 11: [2022-11-27 09:13:54,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 23: [2022-11-27 09:13:54,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:13:54,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 13: [2022-11-27 09:13:54,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 23: [2022-11-27 09:13:54,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_184_mp_rank_00_optim_states.pt 11: [2022-11-27 09:13:54,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 13: [2022-11-27 09:13:54,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 23: [2022-11-27 09:13:54,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 13: [2022-11-27 09:13:54,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 12: [2022-11-27 09:13:54,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:13:54,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 09:13:54,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 6: [2022-11-27 09:13:54,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:13:54,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 09:13:54,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 14: [2022-11-27 09:13:54,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:13:54,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 09:13:54,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 28: [2022-11-27 09:13:54,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_231_mp_rank_00_optim_states.pt 28: [2022-11-27 09:13:54,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 28: [2022-11-27 09:13:54,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt. 28: [2022-11-27 09:13:54,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_229_mp_rank_00_optim_states.pt 28: [2022-11-27 09:13:54,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 28: [2022-11-27 09:13:54,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt. 28: [2022-11-27 09:13:54,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_228_mp_rank_00_optim_states.pt 28: [2022-11-27 09:13:54,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 25: [2022-11-27 09:13:54,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt. 25: [2022-11-27 09:13:54,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_203_mp_rank_00_optim_states.pt 25: [2022-11-27 09:13:54,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 27: [2022-11-27 09:13:54,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt. 27: [2022-11-27 09:13:54,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_221_mp_rank_00_optim_states.pt 27: [2022-11-27 09:13:54,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 5: [2022-11-27 09:13:54,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:13:54,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 09:13:54,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 3: [2022-11-27 09:13:54,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:13:54,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 09:13:54,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 18: [2022-11-27 09:13:54,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt. 18: [2022-11-27 09:13:54,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_150_mp_rank_00_optim_states.pt 18: [2022-11-27 09:13:54,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 15: [2022-11-27 09:13:54,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:13:54,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 09:13:54,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 24: [2022-11-27 09:13:54,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt. 24: [2022-11-27 09:13:54,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_195_mp_rank_00_optim_states.pt 24: [2022-11-27 09:13:54,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 14: [2022-11-27 09:13:54,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 20: [2022-11-27 09:13:54,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:13:54,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 09:13:54,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 17: [2022-11-27 09:13:54,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt. 20: [2022-11-27 09:13:54,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_165_mp_rank_00_optim_states.pt 17: [2022-11-27 09:13:54,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_137_mp_rank_00_optim_states.pt 11: [2022-11-27 09:13:54,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 17: [2022-11-27 09:13:54,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 11: [2022-11-27 09:13:54,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 20: [2022-11-27 09:13:54,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 11: [2022-11-27 09:13:54,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 19: [2022-11-27 09:13:54,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:13:54,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 19: [2022-11-27 09:13:54,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_153_mp_rank_00_optim_states.pt 4: [2022-11-27 09:13:54,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 19: [2022-11-27 09:13:54,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 4: [2022-11-27 09:13:54,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 0: [2022-11-27 09:13:54,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:13:54,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 9: [2022-11-27 09:13:54,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 22: [2022-11-27 09:13:54,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:13:54,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 9: [2022-11-27 09:13:54,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 22: [2022-11-27 09:13:54,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_181_mp_rank_00_optim_states.pt 9: [2022-11-27 09:13:54,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 22: [2022-11-27 09:13:54,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 10: [2022-11-27 09:13:54,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:13:54,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 09:13:54,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 30: [2022-11-27 09:13:54,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt. 30: [2022-11-27 09:13:54,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_242_mp_rank_00_optim_states.pt 30: [2022-11-27 09:13:54,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 0: [2022-11-27 09:13:54,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 09:13:54,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 8: [2022-11-27 09:13:54,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:13:54,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 09:13:54,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 16: [2022-11-27 09:13:54,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt. 16: [2022-11-27 09:13:54,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_128_mp_rank_00_optim_states.pt 16: [2022-11-27 09:13:54,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 21: [2022-11-27 09:13:54,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt. 21: [2022-11-27 09:13:54,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_170_mp_rank_00_optim_states.pt 21: [2022-11-27 09:13:54,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 26: [2022-11-27 09:13:54,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt. 26: [2022-11-27 09:13:54,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_213_mp_rank_00_optim_states.pt 26: [2022-11-27 09:13:54,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 18: [2022-11-27 09:13:54,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt. 18: [2022-11-27 09:13:54,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_147_mp_rank_00_optim_states.pt 18: [2022-11-27 09:13:54,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 6: [2022-11-27 09:13:54,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:13:54,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 09:13:54,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 1: [2022-11-27 09:13:54,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:13:54,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 09:13:54,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:13:54,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 1: [2022-11-27 09:13:54,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:13:54,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 09:13:54,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 29: [2022-11-27 09:13:54,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:13:54,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 1: [2022-11-27 09:13:54,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 29: [2022-11-27 09:13:54,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_232_mp_rank_00_optim_states.pt 29: [2022-11-27 09:13:54,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 20: [2022-11-27 09:13:54,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt. 20: [2022-11-27 09:13:54,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_162_mp_rank_00_optim_states.pt 20: [2022-11-27 09:13:54,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 26: [2022-11-27 09:13:54,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt. 26: [2022-11-27 09:13:54,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_211_mp_rank_00_optim_states.pt 26: [2022-11-27 09:13:54,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 9: [2022-11-27 09:13:54,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:13:54,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 09:13:54,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 24: [2022-11-27 09:13:54,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt. 24: [2022-11-27 09:13:54,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_196_mp_rank_00_optim_states.pt 24: [2022-11-27 09:13:54,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 28: [2022-11-27 09:13:54,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt. 28: [2022-11-27 09:13:54,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_227_mp_rank_00_optim_states.pt 28: [2022-11-27 09:13:54,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 2: [2022-11-27 09:13:54,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:13:54,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:13:54,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 3: [2022-11-27 09:13:54,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 2: [2022-11-27 09:13:54,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 3: [2022-11-27 09:13:54,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 30: [2022-11-27 09:13:54,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt. 30: [2022-11-27 09:13:54,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_243_mp_rank_00_optim_states.pt 30: [2022-11-27 09:13:54,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 7: [2022-11-27 09:13:54,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:13:54,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 09:13:54,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 17: [2022-11-27 09:13:54,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt. 17: [2022-11-27 09:13:54,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_142_mp_rank_00_optim_states.pt 17: [2022-11-27 09:13:54,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 8: [2022-11-27 09:13:54,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:13:54,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 09:13:54,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 19: [2022-11-27 09:13:54,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt. 19: [2022-11-27 09:13:54,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_156_mp_rank_00_optim_states.pt 19: [2022-11-27 09:13:54,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 4: [2022-11-27 09:13:54,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:13:54,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 09:13:54,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 12: [2022-11-27 09:13:54,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:13:54,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 09:13:54,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 16: [2022-11-27 09:13:54,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt. 16: [2022-11-27 09:13:54,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_131_mp_rank_00_optim_states.pt 16: [2022-11-27 09:13:54,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 25: [2022-11-27 09:13:54,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt. 25: [2022-11-27 09:13:54,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_205_mp_rank_00_optim_states.pt 25: [2022-11-27 09:13:54,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 31: [2022-11-27 09:13:54,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt. 31: [2022-11-27 09:13:54,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_254_mp_rank_00_optim_states.pt 31: [2022-11-27 09:13:54,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 23: [2022-11-27 09:13:54,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt. 23: [2022-11-27 09:13:54,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_189_mp_rank_00_optim_states.pt 23: [2022-11-27 09:13:54,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 5: [2022-11-27 09:13:54,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:13:54,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 09:13:54,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 13: [2022-11-27 09:13:54,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:13:54,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 09:13:54,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 15: [2022-11-27 09:13:54,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:13:54,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 09:13:54,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 22: [2022-11-27 09:13:54,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt. 22: [2022-11-27 09:13:54,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_183_mp_rank_00_optim_states.pt 22: [2022-11-27 09:13:54,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 21: [2022-11-27 09:13:54,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:13:54,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 21: [2022-11-27 09:13:54,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_168_mp_rank_00_optim_states.pt 21: [2022-11-27 09:13:54,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 10: [2022-11-27 09:13:54,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 09:13:54,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 0: [2022-11-27 09:13:54,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:13:54,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 09:13:54,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 14: [2022-11-27 09:13:54,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:13:54,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 09:13:54,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 1: [2022-11-27 09:13:54,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:13:54,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 11: [2022-11-27 09:13:54,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:13:54,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 20: [2022-11-27 09:13:54,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:13:54,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 20: [2022-11-27 09:13:54,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_167_mp_rank_00_optim_states.pt 11: [2022-11-27 09:13:54,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 20: [2022-11-27 09:13:54,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 18: [2022-11-27 09:13:54,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt. 18: [2022-11-27 09:13:54,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_148_mp_rank_00_optim_states.pt 18: [2022-11-27 09:13:54,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 27: [2022-11-27 09:13:54,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt. 27: [2022-11-27 09:13:54,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_219_mp_rank_00_optim_states.pt 27: [2022-11-27 09:13:54,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 26: [2022-11-27 09:13:54,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt. 26: [2022-11-27 09:13:54,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_209_mp_rank_00_optim_states.pt 26: [2022-11-27 09:13:54,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 29: [2022-11-27 09:13:54,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt. 29: [2022-11-27 09:13:54,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_237_mp_rank_00_optim_states.pt 29: [2022-11-27 09:13:54,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 6: [2022-11-27 09:13:54,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 28: [2022-11-27 09:13:54,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:13:54,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 09:13:54,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 28: [2022-11-27 09:13:54,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_225_mp_rank_00_optim_states.pt 28: [2022-11-27 09:13:54,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 30: [2022-11-27 09:13:54,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt. 30: [2022-11-27 09:13:54,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_246_mp_rank_00_optim_states.pt 30: [2022-11-27 09:13:54,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 31: [2022-11-27 09:13:54,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt. 31: [2022-11-27 09:13:54,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_252_mp_rank_00_optim_states.pt 31: [2022-11-27 09:13:54,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 24: [2022-11-27 09:13:54,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt. 24: [2022-11-27 09:13:54,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_199_mp_rank_00_optim_states.pt 24: [2022-11-27 09:13:54,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 3: [2022-11-27 09:13:54,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:13:54,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 09:13:54,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 9: [2022-11-27 09:13:54,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:13:54,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 17: [2022-11-27 09:13:54,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:13:54,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 17: [2022-11-27 09:13:54,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_138_mp_rank_00_optim_states.pt 17: [2022-11-27 09:13:54,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 7: [2022-11-27 09:13:54,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:13:54,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 09:13:54,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 19: [2022-11-27 09:13:54,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt. 19: [2022-11-27 09:13:54,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_159_mp_rank_00_optim_states.pt 19: [2022-11-27 09:13:54,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 2: [2022-11-27 09:13:54,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:13:54,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 09:13:54,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 4: [2022-11-27 09:13:54,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:13:54,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 09:13:54,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 8: [2022-11-27 09:13:54,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:13:54,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 12: [2022-11-27 09:13:54,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:13:54,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 12: [2022-11-27 09:13:54,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 09:13:54,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 23: [2022-11-27 09:13:54,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt. 23: [2022-11-27 09:13:54,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_185_mp_rank_00_optim_states.pt 23: [2022-11-27 09:13:54,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 16: [2022-11-27 09:13:54,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt. 16: [2022-11-27 09:13:54,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_132_mp_rank_00_optim_states.pt 16: [2022-11-27 09:13:54,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 13: [2022-11-27 09:13:54,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:13:54,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 09:13:54,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 25: [2022-11-27 09:13:54,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt. 25: [2022-11-27 09:13:54,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_202_mp_rank_00_optim_states.pt 25: [2022-11-27 09:13:54,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 15: [2022-11-27 09:13:54,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:13:54,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 09:13:54,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 10: [2022-11-27 09:13:54,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:13:54,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 09:13:54,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 21: [2022-11-27 09:13:54,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt. 21: [2022-11-27 09:13:54,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_172_mp_rank_00_optim_states.pt 21: [2022-11-27 09:13:54,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 0: [2022-11-27 09:13:54,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:13:54,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 09:13:54,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 22: [2022-11-27 09:13:54,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt. 22: [2022-11-27 09:13:54,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_178_mp_rank_00_optim_states.pt 22: [2022-11-27 09:13:54,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 5: [2022-11-27 09:13:54,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:13:54,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 09:13:54,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 14: [2022-11-27 09:13:54,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:13:54,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 09:13:54,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 11: [2022-11-27 09:13:54,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:13:54,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 09:13:54,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 27: [2022-11-27 09:13:54,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt. 27: [2022-11-27 09:13:54,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_217_mp_rank_00_optim_states.pt 27: [2022-11-27 09:13:54,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 20: [2022-11-27 09:13:54,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt. 20: [2022-11-27 09:13:54,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_161_mp_rank_00_optim_states.pt 20: [2022-11-27 09:13:54,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 29: [2022-11-27 09:13:54,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt. 29: [2022-11-27 09:13:54,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_234_mp_rank_00_optim_states.pt 29: [2022-11-27 09:13:54,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 26: [2022-11-27 09:13:54,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt. 18: [2022-11-27 09:13:54,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt. 26: [2022-11-27 09:13:54,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_208_mp_rank_00_optim_states.pt 18: [2022-11-27 09:13:54,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_151_mp_rank_00_optim_states.pt 26: [2022-11-27 09:13:54,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 18: [2022-11-27 09:13:54,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 1: [2022-11-27 09:13:54,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:13:54,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 09:13:54,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 24: [2022-11-27 09:13:54,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt. 24: [2022-11-27 09:13:54,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_194_mp_rank_00_optim_states.pt 24: [2022-11-27 09:13:54,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 3: [2022-11-27 09:13:54,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:13:54,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 09:13:54,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 28: [2022-11-27 09:13:54,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt. 28: [2022-11-27 09:13:54,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_224_mp_rank_00_optim_states.pt 28: [2022-11-27 09:13:54,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 2: [2022-11-27 09:13:54,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:13:54,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 09:13:54,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 9: [2022-11-27 09:13:54,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:13:54,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 09:13:54,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 6: [2022-11-27 09:13:54,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:13:54,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:13:54,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 09:13:54,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 7: [2022-11-27 09:13:54,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 09:13:54,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 30: [2022-11-27 09:13:54,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt. 30: [2022-11-27 09:13:54,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_240_mp_rank_00_optim_states.pt 30: [2022-11-27 09:13:54,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 19: [2022-11-27 09:13:54,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt. 19: [2022-11-27 09:13:54,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_158_mp_rank_00_optim_states.pt 19: [2022-11-27 09:13:54,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 17: [2022-11-27 09:13:54,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt. 17: [2022-11-27 09:13:54,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_136_mp_rank_00_optim_states.pt 17: [2022-11-27 09:13:54,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 12: [2022-11-27 09:13:54,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:13:54,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 09:13:54,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 4: [2022-11-27 09:13:54,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:13:54,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 09:13:54,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 31: [2022-11-27 09:13:54,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt. 31: [2022-11-27 09:13:54,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_251_mp_rank_00_optim_states.pt 31: [2022-11-27 09:13:54,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 16: [2022-11-27 09:13:54,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt. 16: [2022-11-27 09:13:54,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_134_mp_rank_00_optim_states.pt 16: [2022-11-27 09:13:54,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 25: [2022-11-27 09:13:54,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt. 25: [2022-11-27 09:13:54,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_201_mp_rank_00_optim_states.pt 25: [2022-11-27 09:13:54,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 8: [2022-11-27 09:13:54,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:13:54,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 09:13:54,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 13: [2022-11-27 09:13:54,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:13:54,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 09:13:54,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 27: [2022-11-27 09:13:54,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt. 27: [2022-11-27 09:13:54,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_218_mp_rank_00_optim_states.pt 27: [2022-11-27 09:13:54,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 23: [2022-11-27 09:13:54,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt. 23: [2022-11-27 09:13:54,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_188_mp_rank_00_optim_states.pt 23: [2022-11-27 09:13:54,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 5: [2022-11-27 09:13:54,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:13:54,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 09:13:54,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 10: [2022-11-27 09:13:54,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:13:54,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 09:13:54,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 15: [2022-11-27 09:13:54,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:13:54,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 09:13:54,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 21: [2022-11-27 09:13:54,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt. 21: [2022-11-27 09:13:54,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_171_mp_rank_00_optim_states.pt 21: [2022-11-27 09:13:54,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 22: [2022-11-27 09:13:54,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt. 22: [2022-11-27 09:13:54,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_179_mp_rank_00_optim_states.pt 22: [2022-11-27 09:13:54,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 0: [2022-11-27 09:13:54,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:13:54,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:13:54,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 14: [2022-11-27 09:13:54,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 0: [2022-11-27 09:13:54,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 14: [2022-11-27 09:13:54,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 26: [2022-11-27 09:13:54,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt. 26: [2022-11-27 09:13:54,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_215_mp_rank_00_optim_states.pt 26: [2022-11-27 09:13:54,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 18: [2022-11-27 09:13:54,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt. 18: [2022-11-27 09:13:54,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_145_mp_rank_00_optim_states.pt 18: [2022-11-27 09:13:54,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 29: [2022-11-27 09:13:54,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt. 29: [2022-11-27 09:13:54,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_235_mp_rank_00_optim_states.pt 29: [2022-11-27 09:13:54,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 11: [2022-11-27 09:13:54,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:13:54,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 09:13:54,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 1: [2022-11-27 09:13:54,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:13:54,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 09:13:54,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 9: [2022-11-27 09:13:54,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:13:54,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 09:13:54,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 24: [2022-11-27 09:13:54,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt. 24: [2022-11-27 09:13:54,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_198_mp_rank_00_optim_states.pt 24: [2022-11-27 09:13:54,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 4: [2022-11-27 09:13:54,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:13:54,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:13:54,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 09:13:54,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 3: [2022-11-27 09:13:54,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 09:13:54,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 20: [2022-11-27 09:13:54,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt. 20: [2022-11-27 09:13:54,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_163_mp_rank_00_optim_states.pt 20: [2022-11-27 09:13:54,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 6: [2022-11-27 09:13:54,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:13:54,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 09:13:54,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 30: [2022-11-27 09:13:54,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt. 30: [2022-11-27 09:13:54,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_244_mp_rank_00_optim_states.pt 30: [2022-11-27 09:13:54,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 2: [2022-11-27 09:13:54,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:13:54,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 09:13:54,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 31: [2022-11-27 09:13:54,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt. 31: [2022-11-27 09:13:54,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_253_mp_rank_00_optim_states.pt 31: [2022-11-27 09:13:54,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 19: [2022-11-27 09:13:54,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt. 19: [2022-11-27 09:13:54,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_155_mp_rank_00_optim_states.pt 19: [2022-11-27 09:13:54,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 8: [2022-11-27 09:13:54,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:13:54,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 09:13:54,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 17: [2022-11-27 09:13:54,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt. 17: [2022-11-27 09:13:54,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_143_mp_rank_00_optim_states.pt 17: [2022-11-27 09:13:54,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 7: [2022-11-27 09:13:54,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:13:54,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 09:13:54,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 28: [2022-11-27 09:13:54,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt. 28: [2022-11-27 09:13:54,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_226_mp_rank_00_optim_states.pt 28: [2022-11-27 09:13:54,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 12: [2022-11-27 09:13:54,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:13:54,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 09:13:54,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 16: [2022-11-27 09:13:54,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt. 16: [2022-11-27 09:13:54,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_129_mp_rank_00_optim_states.pt 16: [2022-11-27 09:13:54,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 13: [2022-11-27 09:13:54,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:13:54,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 09:13:54,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 25: [2022-11-27 09:13:54,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt. 25: [2022-11-27 09:13:54,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_207_mp_rank_00_optim_states.pt 25: [2022-11-27 09:13:54,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 10: [2022-11-27 09:13:54,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:13:54,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 09:13:54,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 15: [2022-11-27 09:13:54,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 23: [2022-11-27 09:13:54,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:13:54,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 09:13:54,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 23: [2022-11-27 09:13:54,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_190_mp_rank_00_optim_states.pt 23: [2022-11-27 09:13:54,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 26: [2022-11-27 09:13:54,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt. 26: [2022-11-27 09:13:54,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_214_mp_rank_00_optim_states.pt 26: [2022-11-27 09:13:54,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 5: [2022-11-27 09:13:54,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:13:54,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 09:13:54,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 30: [2022-11-27 09:13:54,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt. 30: [2022-11-27 09:13:54,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_247_mp_rank_00_optim_states.pt 30: [2022-11-27 09:13:54,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 4: [2022-11-27 09:13:54,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:13:54,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 09:13:54,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 27: [2022-11-27 09:13:54,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:13:54,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 19: [2022-11-27 09:13:54,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:13:54,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 27: [2022-11-27 09:13:54,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_223_mp_rank_00_optim_states.pt 13: [2022-11-27 09:13:54,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 19: [2022-11-27 09:13:54,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_152_mp_rank_00_optim_states.pt 27: [2022-11-27 09:13:54,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 19: [2022-11-27 09:13:54,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 1: [2022-11-27 09:13:54,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 21: [2022-11-27 09:13:54,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt. 31: [2022-11-27 09:13:54,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt. 21: [2022-11-27 09:13:54,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_174_mp_rank_00_optim_states.pt 1: [2022-11-27 09:13:54,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 21: [2022-11-27 09:13:54,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 10: [2022-11-27 09:13:54,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:13:54,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 31: [2022-11-27 09:13:54,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_249_mp_rank_00_optim_states.pt 3: [2022-11-27 09:13:54,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 31: [2022-11-27 09:13:54,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 10: [2022-11-27 09:13:54,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 3: [2022-11-27 09:13:54,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 10: [2022-11-27 09:13:54,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 3: [2022-11-27 09:13:54,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 12: [2022-11-27 09:13:54,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:13:54,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 09:13:54,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 23: [2022-11-27 09:13:54,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt. 23: [2022-11-27 09:13:54,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_186_mp_rank_00_optim_states.pt 23: [2022-11-27 09:13:54,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 0: [2022-11-27 09:13:54,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:13:54,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 24: [2022-11-27 09:13:54,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:13:54,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:13:54,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 15: [2022-11-27 09:13:54,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 24: [2022-11-27 09:13:54,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_197_mp_rank_00_optim_states.pt 0: [2022-11-27 09:13:54,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 24: [2022-11-27 09:13:54,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 0: [2022-11-27 09:13:54,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 15: [2022-11-27 09:13:54,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 11: [2022-11-27 09:13:54,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:13:54,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 11: [2022-11-27 09:13:54,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 09:13:54,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 29: [2022-11-27 09:13:54,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt. 17: [2022-11-27 09:13:54,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt. 29: [2022-11-27 09:13:54,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_239_mp_rank_00_optim_states.pt 17: [2022-11-27 09:13:54,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_141_mp_rank_00_optim_states.pt 8: [2022-11-27 09:13:54,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 29: [2022-11-27 09:13:54,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 17: [2022-11-27 09:13:54,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 8: [2022-11-27 09:13:54,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 2: [2022-11-27 09:13:54,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:13:54,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:13:54,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 25: [2022-11-27 09:13:54,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:13:54,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 14: [2022-11-27 09:13:54,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 25: [2022-11-27 09:13:54,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_204_mp_rank_00_optim_states.pt 14: [2022-11-27 09:13:54,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 25: [2022-11-27 09:13:54,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 2: [2022-11-27 09:13:54,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 14: [2022-11-27 09:13:54,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 09:13:54,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 14: [2022-11-27 09:13:54,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 16: [2022-11-27 09:13:54,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt. 22: [2022-11-27 09:13:54,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:13:54,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 16: [2022-11-27 09:13:54,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_130_mp_rank_00_optim_states.pt 22: [2022-11-27 09:13:54,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_180_mp_rank_00_optim_states.pt 16: [2022-11-27 09:13:54,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 9: [2022-11-27 09:13:54,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 22: [2022-11-27 09:13:54,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt. 22: [2022-11-27 09:13:54,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 28: [2022-11-27 09:13:54,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt. 22: [2022-11-27 09:13:54,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_182_mp_rank_00_optim_states.pt 9: [2022-11-27 09:13:54,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 28: [2022-11-27 09:13:54,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_230_mp_rank_00_optim_states.pt 22: [2022-11-27 09:13:54,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 28: [2022-11-27 09:13:54,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 7: [2022-11-27 09:13:54,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:13:54,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 09:13:54,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 21: [2022-11-27 09:13:54,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt. 21: [2022-11-27 09:13:54,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_173_mp_rank_00_optim_states.pt 21: [2022-11-27 09:13:54,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 18: [2022-11-27 09:13:54,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt. 18: [2022-11-27 09:13:54,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_149_mp_rank_00_optim_states.pt 18: [2022-11-27 09:13:54,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 27: [2022-11-27 09:13:54,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:13:54,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 27: [2022-11-27 09:13:54,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_216_mp_rank_00_optim_states.pt 5: [2022-11-27 09:13:54,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 27: [2022-11-27 09:13:54,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 5: [2022-11-27 09:13:54,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 20: [2022-11-27 09:13:54,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt. 20: [2022-11-27 09:13:54,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_160_mp_rank_00_optim_states.pt 20: [2022-11-27 09:13:54,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 6: [2022-11-27 09:13:54,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:13:54,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b1long/global_step173500/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 09:13:54,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step173500 is ready now! 0: successfully saved checkpoint at iteration 173500 to checkpoints_1b1long